cs.CV - 2023-11-26

DISYRE: Diffusion-Inspired SYnthetic REstoration for Unsupervised Anomaly Detection

  • paper_url: http://arxiv.org/abs/2311.15453
  • repo_url: None
  • paper_authors: Sergio Naval Marimont, Matthew Baugh, Vasilis Siomos, Christos Tzelepis, Bernhard Kainz, Giacomo Tarroni
  • for: 这个研究旨在提出一种不需要标注的异常检测技术,并且可以对医疗影像中的异常进行检测。
  • methods: 本研究使用了传播模型,并将其应用于异常检测中。传播模型可以将输入 $x$ 转换为一个更有可能性的分布,这样就可以将 $x$ 转换为一个异常检测结果。
  • results: 本研究在三个常见的脑MRI异常检测任务中,与其他方法比较,DISYRE 表现了优异的成绩,在两个任务中表现更好。
    Abstract Unsupervised Anomaly Detection (UAD) techniques aim to identify and localize anomalies without relying on annotations, only leveraging a model trained on a dataset known to be free of anomalies. Diffusion models learn to modify inputs $x$ to increase the probability of it belonging to a desired distribution, i.e., they model the score function $\nabla_x \log p(x)$. Such a score function is potentially relevant for UAD, since $\nabla_x \log p(x)$ is itself a pixel-wise anomaly score. However, diffusion models are trained to invert a corruption process based on Gaussian noise and the learned score function is unlikely to generalize to medical anomalies. This work addresses the problem of how to learn a score function relevant for UAD and proposes DISYRE: Diffusion-Inspired SYnthetic REstoration. We retain the diffusion-like pipeline but replace the Gaussian noise corruption with a gradual, synthetic anomaly corruption so the learned score function generalizes to medical, naturally occurring anomalies. We evaluate DISYRE on three common Brain MRI UAD benchmarks and substantially outperform other methods in two out of the three tasks.
    摘要 非监督异常检测(USD)技术目标是无需注释,只是利用已知无异常数据集上训练的模型来识别和地图异常。 Diffusion模型可以改变输入$x$,使其更有可能属于某个愿望分布,即它们模型了$\nabla_x \log p(x)$的Score函数。这个Score函数可能对USD有用,因为它是每个像素异常分数。然而,Diffusion模型通常是基于 Gaussian 噪声的损坏过程的,学习的Score函数不太可能泛化到医疗异常。这个研究解决了如何学习 relevante дляUSD的Score函数,并提出了DISYRE:Diffusion-Inspired SYnthetic REstoration。我们保留了Diffusion-like管道,但是将Gaussian 噪声损坏 replaced withgradual、自然出现的医疗异常损坏,以便学习的Score函数泛化到医疗异常。我们对三个常见的脑MRI USD benchmark进行了评估,并在两个任务中明显超过了其他方法。

FLAIR: A Conditional Diffusion Framework with Applications to Face Video Restoration

  • paper_url: http://arxiv.org/abs/2311.15445
  • repo_url: None
  • paper_authors: Zihao Zou, Jiaming Liu, Shirin Shoushtari, Yubo Wang, Weijie Gan, Ulugbek S. Kamilov
  • for: restore perceptually realistic face videos from low-quality inputs
  • methods: conditional diffusion framework called FLAIR, which ensures temporal consistency and uses a recurrent video refinement layer and temporal self-attention
  • results: superior performance compared to current state-of-the-art for video super-resolution, deblurring, JPEG restoration, and space-time frame interpolation on two high-quality face video datasets
    Abstract Face video restoration (FVR) is a challenging but important problem where one seeks to recover a perceptually realistic face videos from a low-quality input. While diffusion probabilistic models (DPMs) have been shown to achieve remarkable performance for face image restoration, they often fail to preserve temporally coherent, high-quality videos, compromising the fidelity of reconstructed faces. We present a new conditional diffusion framework called FLAIR for FVR. FLAIR ensures temporal consistency across frames in a computationally efficient fashion by converting a traditional image DPM into a video DPM. The proposed conversion uses a recurrent video refinement layer and a temporal self-attention at different scales. FLAIR also uses a conditional iterative refinement process to balance the perceptual and distortion quality during inference. This process consists of two key components: a data-consistency module that analytically ensures that the generated video precisely matches its degraded observation and a coarse-to-fine image enhancement module specifically for facial regions. Our extensive experiments show superiority of FLAIR over the current state-of-the-art (SOTA) for video super-resolution, deblurring, JPEG restoration, and space-time frame interpolation on two high-quality face video datasets.
    摘要 “Face video restoration(FVR)是一个具有挑战性和重要性的问题,旨在从低质量输入中恢复出高品质的面部视频。而传播概率模型(DPM)通常不能保持高质量的视频,导致重建的脸部质量下降。我们提出了一个新的增强型条件传播框架,称为FLAIR,以解决FVR这个问题。FLAIR通过将传播图像DPM转换为视频DPM,以确保几帧内的一致性。这个转换使用了回归视频修正层和不同的时间自我注意力。FLAIR还使用了一个条件迭代约束 процедуre来寻求具有优化质量和稳定性的实际恢复。这个过程包括两个关键 ком成分:一个数据一致模块,以确保生成的视频与其损坏观察相符,以及一个特殊设计 для face region的粗糙增强模块。我们的广泛实验显示FLAIR在视频超解析、对照滤过、JPEG重建和时空架构 interpolating four high-quality face video datasets中具有较高的表现。”

Efficient Encoding of Graphics Primitives with Simplex-based Structures

  • paper_url: http://arxiv.org/abs/2311.15439
  • repo_url: None
  • paper_authors: Yibo Wen, Yunfan Yang
  • for: 图像和数据精度渲染
  • methods: 使用 simplicial 结构编码、坐标变换、积分 interpolate 和 hash 表存储
  • results: 在2D图像适应任务中比基eline方法快9.4%,在稠密样本情况下最大提高41.2%的速度提升
    Abstract Grid-based structures are commonly used to encode explicit features for graphics primitives such as images, signed distance functions (SDF), and neural radiance fields (NeRF) due to their simple implementation. However, in $n$-dimensional space, calculating the value of a sampled point requires interpolating the values of its $2^n$ neighboring vertices. The exponential scaling with dimension leads to significant computational overheads. To address this issue, we propose a simplex-based approach for encoding graphics primitives. The number of vertices in a simplex-based structure increases linearly with dimension, making it a more efficient and generalizable alternative to grid-based representations. Using the non-axis-aligned simplicial structure property, we derive and prove a coordinate transformation, simplicial subdivision, and barycentric interpolation scheme for efficient sampling, which resembles transformation procedures in the simplex noise algorithm. Finally, we use hash tables to store multiresolution features of all interest points in the simplicial grid, which are passed into a tiny fully connected neural network to parameterize graphics primitives. We implemented a detailed simplex-based structure encoding algorithm in C++ and CUDA using the methods outlined in our approach. In the 2D image fitting task, the proposed method is capable of fitting a giga-pixel image with 9.4% less time compared to the baseline method proposed by instant-ngp, while maintaining the same quality and compression rate. In the volumetric rendering setup, we observe a maximum 41.2% speedup when the samples are dense enough.
    摘要 Grid-based structures are commonly used to encode explicit features for graphics primitives such as images, signed distance functions (SDF), and neural radiance fields (NeRF) due to their simple implementation. However, in $n$-dimensional space, calculating the value of a sampled point requires interpolating the values of its $2^n$ neighboring vertices. The exponential scaling with dimension leads to significant computational overheads. To address this issue, we propose a simplex-based approach for encoding graphics primitives. The number of vertices in a simplex-based structure increases linearly with dimension, making it a more efficient and generalizable alternative to grid-based representations. Using the non-axis-aligned simplicial structure property, we derive and prove a coordinate transformation, simplicial subdivision, and barycentric interpolation scheme for efficient sampling, which resembles transformation procedures in the simplex noise algorithm. Finally, we use hash tables to store multiresolution features of all interest points in the simplicial grid, which are passed into a tiny fully connected neural network to parameterize graphics primitives. We implemented a detailed simplex-based structure encoding algorithm in C++ and CUDA using the methods outlined in our approach. In the 2D image fitting task, the proposed method is capable of fitting a giga-pixel image with 9.4% less time compared to the baseline method proposed by instant-ngp, while maintaining the same quality and compression rate. In the volumetric rendering setup, we observe a maximum 41.2% speedup when the samples are dense enough.

Quality Modeling Under A Relaxed Natural Scene Statistics Model

  • paper_url: http://arxiv.org/abs/2311.15437
  • repo_url: None
  • paper_authors: Abhinau K. Venkataramanan, Alan C. Bovik
  • for: 本研究旨在推广现有的图像质量评估模型,以适应社交媒体上的用户生成内容,这些内容通常受到多种不确定的缺陷影响。
  • methods: 本研究使用了信息论基础的图像质量评估模型,包括视觉信息准确性指数(VIF)和时空减少参照信息差分(ST-RRED)。这些模型基于自然场景统计(NSS)和信息论。
  • results: 本研究发现,通过使用多元泛化 Gaussian Scale Mixture(GSM)模型,可以更好地捕捉社交媒体上的图像缺陷。此外,通过研究视觉信息准确性指数(VIF)的性质,可以更好地理解图像质量评估模型的行为。
    Abstract Information-theoretic image quality assessment (IQA) models such as Visual Information Fidelity (VIF) and Spatio-temporal Reduced Reference Entropic Differences (ST-RRED) have enjoyed great success by seamlessly integrating natural scene statistics (NSS) with information theory. The Gaussian Scale Mixture (GSM) model that governs the wavelet subband coefficients of natural images forms the foundation for these algorithms. However, the explosion of user-generated content on social media, which is typically distorted by one or more of many possible unknown impairments, has revealed the limitations of NSS-based IQA models that rely on the simple GSM model. Here, we seek to elaborate the VIF index by deriving useful properties of the Multivariate Generalized Gaussian Distribution (MGGD), and using them to study the behavior of VIF under a Generalized GSM (GGSM) model.
    摘要 信息理论图像质量评估(IQA)模型,如视觉信息准确度(VIF)和时空减 Referential Entropic Differences(ST-RRED),在将自然场景统计(NSS)与信息理论相结合的基础上取得了很大的成功。然而,社交媒体上的用户生成内容的普遍性损害,通常是一种或多种未知的障碍,已经暴露了基于简单的GSM模型的NSS-based IQA模型的局限性。在这里,我们寻求通过 derivation of useful properties of the Multivariate Generalized Gaussian Distribution(MGGD),使得VIFIndex在Generalized GSM(GGSM)模型下的行为得到更好的理解。

Functional Diffusion

  • paper_url: http://arxiv.org/abs/2311.15435
  • repo_url: https://github.com/nipreps/niworkflows
  • paper_authors: Biao Zhang, Peter Wonka
  • for: 该论文旨在提出一种新的生成扩散模型,称为函数扩散模型,可以处理样本的函数表示形式。
  • methods: 该论文使用了 transformer 架构进行实现,并 derive 了必要的基础知识。
  • results: 论文通过处理各种复杂的 signed distance functions 和 deformation functions 来展示生成的能力。
    Abstract We propose a new class of generative diffusion models, called functional diffusion. In contrast to previous work, functional diffusion works on samples that are represented by functions with a continuous domain. Functional diffusion can be seen as an extension of classical diffusion models to an infinite-dimensional domain. Functional diffusion is very versatile as images, videos, audio, 3D shapes, deformations, \etc, can be handled by the same framework with minimal changes. In addition, functional diffusion is especially suited for irregular data or data defined in non-standard domains. In our work, we derive the necessary foundations for functional diffusion and propose a first implementation based on the transformer architecture. We show generative results on complicated signed distance functions and deformation functions defined on 3D surfaces.
    摘要 我们提出一种新的生成扩散模型,即功能扩散。与前作不同,功能扩散在样本上进行处理,即通过函数表示,其域为连续的。功能扩散可以看作是传统扩散模型的扩展,扩展到无穷度域。功能扩散非常灵活,图像、视频、音频、3D形状、扭形等都可以通过同一个框架进行处理,只需要小量的修改。此外,功能扩散尤其适用于不规则数据或非标准域上的数据。在我们的工作中,我们 derivation 功能扩散的必要基础和基于 transformer 架构的首个实现。我们在 complicated 签名距离函数和形变函数上进行生成试验,并取得了良好的结果。

Data-Driven Modelling for Harmonic Current Emission in Low-Voltage Grid Using MCReSANet with Interpretability Analysis

  • paper_url: http://arxiv.org/abs/2311.15420
  • repo_url: None
  • paper_authors: Jieyu Yao, Hao Yu, Paul Judge, Jiabin Jia, Sasa Djokic, Verner Püvi, Matti Lehtonen, Jan Meyer
    for: 这个论文旨在解决在分布系统中多种异类负荷的交互作用使得建立分析模型变得复杂,从而影响了电力质量优化的问题。methods: 本论文使用MCReSANet建立数据驱动模型,以建立不同分布系统特点下的异harmonic voltage和current之间的非线性映射。results: comparative study shows that MCReSANet可以在不同的网络特点下建立高精度的非线性映射,并且与CNN和MLP模型相比,MCReSANet可以提高MAE值10%-17%,同时显示出较低的模型不确定性。这些结果表明MCReSANet可以为分布系统中异类负荷的电力质量优化提供更好的预测性。
    Abstract Even though the use of power electronics PE loads offers enhanced electrical energy conversion efficiency and control, they remain the primary sources of harmonics in grids. When diverse loads are connected in the distribution system, their interactions complicate establishing analytical models for the relationship between harmonic voltages and currents. To solve this, our paper presents a data-driven model using MCReSANet to construct the highly nonlinear between harmonic voltage and current. Two datasets from PCCs in Finland and Germany are utilized, which demonstrates that MCReSANet is capable of establishing accurate nonlinear mappings, even in the presence of various network characteristics for selected Finland and Germany datasets. The model built by MCReSANet can improve the MAE by 10% and 14% compared to the CNN, and by 8% and 17% compared to the MLP for both Finnish and German datasets, also showing much lower model uncertainty than others. This is a crucial prerequisite for more precise SHAP value-based feature importance analysis, which is a method for the model interpretability analysis in this paper. The results by feature importance analysis show the detailed relationships between each order of harmonic voltage and current in the distribution system. There is an interactive impact on each order of harmonic current, but some orders of harmonic voltages have a dominant influence on harmonic current emissions: positive sequence and zero sequence harmonics have the dominant importance in the Finnish and German networks, respectively, which conforms to the pattern of connected load types in two selected Finnish and German datasets. This paper enhances the potential for understanding and predicting harmonic current emissions by diverse PE loads in distribution systems, which is beneficial to more effective management for optimizing power quality in diverse grid environments.
    摘要 尽管具有更高的电能转换效率和控制能力的劳电电子荷载(PELoads)在电网中仍然是主要的幂强来源,它们的互动会复杂电网的分析模型。为解决这个问题,我们的论文提出了基于MCReSANet的数据驱动模型,用于构建幂强和电流之间的非线性映射。我们使用了芬兰和德国PCC(劳电电子荷载控制中心)的两个数据集,并证明MCReSANet可以在不同的网络特点下建立高精度的非线性映射,比其他模型更加稳定。这个模型可以提高MAE(平均平方误差)的精度,相比于CNN和MLP模型,在芬兰和德国数据集中提高了10%和14%,并且显示了远低于其他模型的模型不确定性。这对于更精准的SHAP值基本特征重要性分析(模型解释分析)是非常重要的。结果表明,每个幂强频率和电流频率之间存在互动,但是某些幂强频率对幂强电流排泄产生主导性的影响。芬兰网络中,正弦和零弦幂强频率具有主导性的重要性,而德国网络中,零弦幂强频率具有主导性的重要性,这与连接的加装类型在两个选择的芬兰和德国数据集中的征文匹配。这篇论文可以增强了对劳电电子荷载在分布系统中幂强电流排泄的理解和预测,这对于优化电力质量在多种网络环境中是非常有用。

GAN-Based LiDAR Intensity Simulation

  • paper_url: http://arxiv.org/abs/2311.15415
  • repo_url: None
  • paper_authors: Richard Marcus, Felix Gabel, Niklas Knoop, Marc Stamminger
  • for: 研发自动驾驶技术的实用车载传感器模拟
  • methods: 使用实际测试驾驶数据对LiDAR扫描数据进行GAN适应训练,并利用摄像头图像获取分割数据和精度深度地图作为训练输入
  • results: 实现了可靠地生成真实的LiDAR扫描点云数据,并证明了对象检测网络在真实和 sintetic 点云数据之间的泛化性能良好,无需真实点云数据进行评估。
    Abstract Realistic vehicle sensor simulation is an important element in developing autonomous driving. As physics-based implementations of visual sensors like LiDAR are complex in practice, data-based approaches promise solutions. Using pairs of camera images and LiDAR scans from real test drives, GANs can be trained to translate between them. For this process, we contribute two additions. First, we exploit the camera images, acquiring segmentation data and dense depth maps as additional input for training. Second, we test the performance of the LiDAR simulation by testing how well an object detection network generalizes between real and synthetic point clouds to enable evaluation without ground truth point clouds. Combining both, we simulate LiDAR point clouds and demonstrate their realism.
    摘要 现实主义汽车感测模拟在自动驾驶发展中具有重要地位。由于物理实现视频感测器如LiDAR实际做起来复杂,数据基本方法提供解决方案。通过使用实际测试驾驶中的相对应的摄像头图像和LiDAR扫描数据,我们可以使用GAN进行训练,将其翻译成对应的点云数据。为此,我们提供了两项贡献:首先,我们利用摄像头图像,从中获取分割数据和密集的深度地图作为训练输入。其次,我们测试了LiDAR模拟的性能,通过评估一个物体检测网络在实际和 sintetic点云之间的泛化能力,以无需基准点云数据进行评估。将这两种方法结合使用,我们可以模拟LiDAR点云并证明其现实性。

KOPPA: Improving Prompt-based Continual Learning with Key-Query Orthogonal Projection and Prototype-based One-Versus-All

  • paper_url: http://arxiv.org/abs/2311.15414
  • repo_url: None
  • paper_authors: Quyen Tran, Lam Tran, Khoat Than, Toan Tran, Dinh Phung, Trung Le
  • for: 提高 Continual Learning 中的模型表现,解决各个任务之间的相互影响和特征推移问题。
  • methods: 基于预训练 ViT 网络的新方法,包括维护一组提醒和对每个任务使用键值匹配策略。
  • results: 实验结果表明,我们的方法可以使模型的表现超过当前状态艺术方法的最大差距达到20%。
    Abstract Drawing inspiration from prompt tuning techniques applied to Large Language Models, recent methods based on pre-trained ViT networks have achieved remarkable results in the field of Continual Learning. Specifically, these approaches propose to maintain a set of prompts and allocate a subset of them to learn each task using a key-query matching strategy. However, they may encounter limitations when lacking control over the correlations between old task queries and keys of future tasks, the shift of features in the latent space, and the relative separation of latent vectors learned in independent tasks. In this work, we introduce a novel key-query learning strategy based on orthogonal projection, inspired by model-agnostic meta-learning, to enhance prompt matching efficiency and address the challenge of shifting features. Furthermore, we introduce a One-Versus-All (OVA) prototype-based component that enhances the classification head distinction. Experimental results on benchmark datasets demonstrate that our method empowers the model to achieve results surpassing those of current state-of-the-art approaches by a large margin of up to 20%.
    摘要 以启发自大语言模型的技巧为基础,最近的方法基于预训练ViT网络实现了逐渐学习领域的出色成绩。这些方法建议将一组提示分配给每个任务学习,使用键Query匹配策略。然而,它们可能会遇到不同任务预测查询和键之间的相关性控制、特征空间的滑块和独立任务学习 latent vector 之间的相对分离的问题。在这种情况下,我们提出了一种新的键Query学习策略基于正交投影, inspirited by model-agnostic meta-learning,以提高提示匹配效率和feature shift 问题。此外,我们引入了一个One-Versus-All(OVA)原型基于组件,以增强分类头的 отличимость。实验结果表明,我们的方法使得模型在标准 benchmark 数据集上达到了当前领先方法的最高记录,高达20%。

ConstraintMatch for Semi-constrained Clustering

  • paper_url: http://arxiv.org/abs/2311.15395
  • repo_url: https://github.com/slds-lmu/constraintmatch
  • paper_authors: Jann Goschenhofer, Bernd Bischl, Zsolt Kira
  • for: 这个论文的目的是提出一种受限集成方法,可以使用对数据点的对数据点约束来训练分类模型,而不需要全量标签数据。
  • methods: 该方法使用了一种叫做约束匹配的方法,可以利用大量的不受限数据和一个更小的约束集来训练分类模型。
  • results: 在五个复杂的标准测试集上,该方法表现出色,比基eline表现更好,并且可以在不受限数据和约束集的情况下提供高质量的分类模型。
    Abstract Constrained clustering allows the training of classification models using pairwise constraints only, which are weak and relatively easy to mine, while still yielding full-supervision-level model performance. While they perform well even in the absence of the true underlying class labels, constrained clustering models still require large amounts of binary constraint annotations for training. In this paper, we propose a semi-supervised context whereby a large amount of \textit{unconstrained} data is available alongside a smaller set of constraints, and propose \textit{ConstraintMatch} to leverage such unconstrained data. While a great deal of progress has been made in semi-supervised learning using full labels, there are a number of challenges that prevent a naive application of the resulting methods in the constraint-based label setting. Therefore, we reason about and analyze these challenges, specifically 1) proposing a \textit{pseudo-constraining} mechanism to overcome the confirmation bias, a major weakness of pseudo-labeling, 2) developing new methods for pseudo-labeling towards the selection of \textit{informative} unconstrained samples, 3) showing that this also allows the use of pairwise loss functions for the initial and auxiliary losses which facilitates semi-constrained model training. In extensive experiments, we demonstrate the effectiveness of ConstraintMatch over relevant baselines in both the regular clustering and overclustering scenarios on five challenging benchmarks and provide analyses of its several components.
    摘要 straight clustering 允许通过对应关系只需要弱和容易采集的约束来训练分类模型,而仍然可以达到全监督模型性能水平。尽管它们在真实类别标签缺失时也可以表现良好,但Constrained clustering模型仍然需要大量的二分类约束注释来训练。在这篇论文中,我们提出了一种半监督的上下文,其中有大量的不约束数据和一个更小的约束集,并提出了一种名为《ConstraintMatch》的方法来利用这些不约束数据。虽然在半监督学习中有很大的进步,但在约束基于标签设置中存在一些挑战,因此我们考虑和分析这些挑战,包括1)提出一种《pseudo-约束》机制来超越偏见假设,2)开发新的方法来pseudo-标签选择有用的不约束样本,3)示出这也允许使用对应关系损失函数来初始化和辅助损失,从而实现半监督模型训练。在广泛的实验中,我们证明了 ConstraintMatch 在相关的基准点上表现出色,并提供了几个组件的分析。

Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

  • paper_url: http://arxiv.org/abs/2311.15383
  • repo_url: https://github.com/CurryYuan/ZSVG3D
  • paper_authors: Zhihao Yuan, Jinke Ren, Chun-Mei Feng, Hengshuang Zhao, Shuguang Cui, Zhen Li
  • for: 本研究旨在实现基于文本描述的3D物体localization。
  • methods: 我们提出了一种新的视觉编程方法,利用大型语言模型(LLMs)实现零基础、开 vocabulary 3DVG。我们的方法始于对 LLMs 进行对话,以建立零基础 3DVG 的基础理解。然后,我们设计了三种类型的模块:视图独立、视图依赖和功能模块。这些模块,特意设计 для 3D 场景,协作进行复杂的推理和推论。
  • results: 我们的零基础方法可以超越一些基础过程的基础模型,在开 vocabulary 3DVG 中实现更高的性能。
    Abstract 3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive annotations and a predefined vocabulary, which can be restrictive. To address this issue, we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method, engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this, we design a visual program that consists of three types of modules, i.e., view-independent, view-dependent, and functional modules. These modules, specifically tailored for 3D scenarios, work collaboratively to perform complex reasoning and inference. Furthermore, we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines, marking a significant stride towards effective 3DVG.
    摘要 三维视觉根据(3DVG)targets at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often require extensive annotations and a predefined vocabulary, which can be restrictive. To address this issue, we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method, engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this, we design a visual program that consists of three types of modules, i.e., view-independent, view-dependent, and functional modules. These modules, specifically tailored for 3D scenarios, work collaboratively to perform complex reasoning and inference. Furthermore, we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines, marking a significant stride towards effective 3DVG.Here's the word-for-word translation of the text into Simplified Chinese:三维视觉根据(3DVG)targets at localizing 3D object based on textual descriptions. 常见的超级vised方法 для 3DVG often require extensive annotations and a predefined vocabulary, which can be restrictive. To address this issue, we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method, engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this, we design a visual program that consists of three types of modules, i.e., view-independent, view-dependent, and functional modules. These modules, specifically tailored for 3D scenarios, work collaboratively to perform complex reasoning and inference. Furthermore, we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines, marking a significant stride towards effective 3DVG.

Flow-Guided Diffusion for Video Inpainting

  • paper_url: http://arxiv.org/abs/2311.15368
  • repo_url: https://github.com/nevsnev/fgdvi
  • paper_authors: Bohai Gu, Yongsheng Yu, Heng Fan, Libo Zhang
  • for: 提高视频填充质量和效率,解决复杂场景下的大规模运动和低光照问题。
  • methods: 基于现有的图像生成扩散模型,提出了流场导向扩散模型 для视频填充(FGDVI),通过精准的一步潜在流场传播来提高时间一致性,并提出了一种模型无关的流场导向潜在 interpolate 技术来加速净化和融合。
  • results: FGDVI 在流扰error E_warp 方面达到了10%的提高,并在多种场景下进行了广泛的实验验证, validate 了 FGDVI 的superior performance,提供了进一步提高视频填充质量的可能性。
    Abstract Video inpainting has been challenged by complex scenarios like large movements and low-light conditions. Current methods, including emerging diffusion models, face limitations in quality and efficiency. This paper introduces the Flow-Guided Diffusion model for Video Inpainting (FGDVI), a novel approach that significantly enhances temporal consistency and inpainting quality via reusing an off-the-shelf image generation diffusion model. We employ optical flow for precise one-step latent propagation and introduces a model-agnostic flow-guided latent interpolation technique. This technique expedites denoising, seamlessly integrating with any Video Diffusion Model (VDM) without additional training. Our FGDVI demonstrates a remarkable 10% improvement in flow warping error E_warp over existing state-of-the-art methods. Our comprehensive experiments validate superior performance of FGDVI, offering a promising direction for advanced video inpainting. The code and detailed results will be publicly available in https://github.com/NevSNev/FGDVI.
    摘要 视频填充面临了复杂的场景,如大 движения和低光照情况。当前方法,包括出现的扩散模型,受到质量和效率的限制。这篇论文介绍了流导引扩散模型 для视频填充(FGDVI),一种新的方法,通过 reuse off-the-shelf 图像生成扩散模型来显著提高时间一致性和填充质量。我们使用光学流来实现精确的一步潜在幂propagation,并引入一种模型无关的流导引潜在 interpolating技术。这种技术可以快速去噪,无需额外训练,可以与任何视频扩散模型(VDM)集成。我们的FGDVI在流扭距error E_warp中表现出了10%的显著提高。我们的全面实验证明了FGDVI的超过当前状态的表现,提供了一个有前途的方向,为高级视频填充做出了贡献。代码和详细结果将在https://github.com/NevSNev/FGDVI上公开。

BatchNorm-based Weakly Supervised Video Anomaly Detection

  • paper_url: http://arxiv.org/abs/2311.15367
  • repo_url: https://github.com/cool-xuan/bn-wvad
  • paper_authors: Yixuan Zhou, Yi Qu, Xing Xu, Fumin Shen, Jingkuan Song, Hengtao Shen
  • for: 这篇研究旨在提高弱指导影像异常检测(WVAD)中的表现,尤其是在只有影像层级标签(表示异常事件存在或不存在)的情况下。
  • methods: 本研究提出了一个名为BN-WVAD的新方法,它将BatchNorm技术应用到WVAD中。在提案的BN-WVAD中,我们利用BatchNorm的弹性分布Feature from Mean vector(DFM)作为可靠的异常标准,以检测影像中的异常片段。此外,我们还提出了一个批次选择策略,以筛选影像中更多的异常片段。
  • results: 在UCF-Crime和XD-Violence这两个资料集上,BN-WVAD模型的性能均达到了州际级水平,其中UCF-Crime上的AUC为87.24%,XD-Violence上的AP可达84.93%。
    Abstract In weakly supervised video anomaly detection (WVAD), where only video-level labels indicating the presence or absence of abnormal events are available, the primary challenge arises from the inherent ambiguity in temporal annotations of abnormal occurrences. Inspired by the statistical insight that temporal features of abnormal events often exhibit outlier characteristics, we propose a novel method, BN-WVAD, which incorporates BatchNorm into WVAD. In the proposed BN-WVAD, we leverage the Divergence of Feature from Mean vector (DFM) of BatchNorm as a reliable abnormality criterion to discern potential abnormal snippets in abnormal videos. The proposed DFM criterion is also discriminative for anomaly recognition and more resilient to label noise, serving as the additional anomaly score to amend the prediction of the anomaly classifier that is susceptible to noisy labels. Moreover, a batch-level selection strategy is devised to filter more abnormal snippets in videos where more abnormal events occur. The proposed BN-WVAD model demonstrates state-of-the-art performance on UCF-Crime with an AUC of 87.24%, and XD-Violence, where AP reaches up to 84.93%. Our code implementation is accessible at https://github.com/cool-xuan/BN-WVAD.
    摘要 弱监视视频异常检测(WVAD)中,主要挑战在于视频水平标签(即异常事件存在或缺失)的时间标注的潜在歧义。我们提出了一种新方法:BN-WVAD,它将批处理标准(BatchNorm)结合到WVAD中。在我们的提议中,我们利用批处理的特征差异向量(DFM)来判断潜在异常片段是否存在异常事件。这个DFM criterion不仅能够识别异常事件,而且对干扰标签具有更高的抗性,可以增强异常类别器的预测。此外,我们还提出了一种批处理筛选策略,以便在视频中更多的异常事件存在时,筛选更多的异常片段。BN-WVAD模型在UCF-Crime上达到了87.24%的AUC,以及XD-Violence上的AP达到了84.93%。我们的代码实现可以在GitHub上找到:https://github.com/cool-xuan/BN-WVAD。

Ultra-Range Gesture Recognition using an RGB Camera in Human-Robot Interaction

  • paper_url: http://arxiv.org/abs/2311.15361
  • repo_url: None
  • paper_authors: Eran Bamani, Eden Nissinman, Inbar Meir, Lisa Koenigsberg, Avishai Sintov
    for:* The paper addresses the Ultra-Range Gesture Recognition (URGR) problem in Human-Robot Interaction (HRI), aiming for a recognition distance of up to 25 meters using a simple RGB camera.methods:* The proposed framework uses a novel super-resolution model called HQ-Net to enhance the low-resolution image of the user, followed by a novel URGR classifier called Graph Vision Transformer (GViT) that combines the benefits of a Graph Convolutional Network (GCN) and a modified Vision Transformer (ViT).results:* The proposed framework achieves a high recognition rate of 98.1% over diverse test data, and exhibits superior performance compared to human recognition in ultra-range distances. The framework is demonstrated to control an autonomous quadruped robot directed by human gestures in complex ultra-range indoor and outdoor environments.
    Abstract Hand gestures play a significant role in human interactions where non-verbal intentions, thoughts and commands are conveyed. In Human-Robot Interaction (HRI), hand gestures offer a similar and efficient medium for conveying clear and rapid directives to a robotic agent. However, state-of-the-art vision-based methods for gesture recognition have been shown to be effective only up to a user-camera distance of seven meters. Such a short distance range limits practical HRI with, for example, service robots, search and rescue robots and drones. In this work, we address the Ultra-Range Gesture Recognition (URGR) problem by aiming for a recognition distance of up to 25 meters and in the context of HRI. We propose a novel deep-learning framework for URGR using solely a simple RGB camera. First, a novel super-resolution model termed HQ-Net is used to enhance the low-resolution image of the user. Then, we propose a novel URGR classifier termed Graph Vision Transformer (GViT) which takes the enhanced image as input. GViT combines the benefits of a Graph Convolutional Network (GCN) and a modified Vision Transformer (ViT). Evaluation of the proposed framework over diverse test data yields a high recognition rate of 98.1%. The framework has also exhibited superior performance compared to human recognition in ultra-range distances. With the framework, we analyze and demonstrate the performance of an autonomous quadruped robot directed by human gestures in complex ultra-range indoor and outdoor environments.
    摘要 人体姿势在人类互动中扮演着重要的角色,通过非语言意图、思想和命令来传递非语言意图、思想和命令。在人机交互(HRI)中,人体姿势提供了一种效率的媒体,以便通过清晰快速的指令来控制 робо控制器。然而,现有的视觉基于方法在用户摄像头距离上限为7米,这限制了实用HRI的应用,例如服务机器人、搜救机器人和无人机。在这种工作中,我们解决了超距离姿势识别(URGR)问题,并且在HRI上下文中实现了识别距离达25米。我们提议一种深度学习框架,使用单个简单的RGB摄像头。首先,我们使用一种新的超解析模型,称为HQ-Net,来提高用户的低分辨率图像。然后,我们提议一种新的URGR分类器,称为图像视觉变换器(GViT),它将提高图像的分辨率。GViT结合了图像卷积网络(GCN)和修改后的视觉变换器(ViT)的优点。我们对提议的框架进行了多种测试数据的评估,并取得了98.1%的识别率。我们还发现,我们的框架在超距离距离下的表现优于人类识别。我们通过使用该框架,分析和示例了一个自动化的四脚Robot,通过人类姿势控制在复杂的超距离室内和室外环境中。

Adversarial Purification of Information Masking

  • paper_url: http://arxiv.org/abs/2311.15339
  • repo_url: https://github.com/nowindbutrain/impure
  • paper_authors: Sitong Liu, Zhichao Lian, Shuangquan Zhang, Liang Xiao
  • For: The paper aims to defend against adversarial attacks by purifying the input images to eliminate imperceptible perturbations and increase the robustness of neural networks.* Methods: The proposed approach, called Information Mask Purification (IMPure), involves masking part of the patches in the input image, reconstructing the patches to resist adversarial perturbations, and simulating potential similar regional perturbations to protect the purified samples.* Results: The approach achieves state-of-the-art results against nine adversarial attack methods on the ImageNet dataset with three classifier models, demonstrating its effectiveness in defending against adversarial attacks.Here is the information in Simplified Chinese text:* For: 本 paper 的目的是为了防止针对神经网络的攻击,通过纯化输入图像来减少微小的杂音和增强神经网络的 robustness。* Methods: 提议的方法是信息屏蔽纯化(IMPure),它包括对输入图像中的patch进行屏蔽,然后重建patch来抵抗攻击,并模拟可能的相似地域杂音来保护纯化的样本。* Results: 对于 ImageNet dataset 上的三种分类器模型,提议的方法可以达到最佳的效果,在九种攻击方法下都达到了领先的result。
    Abstract Adversarial attacks meticulously generate minuscule, imperceptible perturbations to images to deceive neural networks. Counteracting these, adversarial purification methods seek to transform adversarial input samples into clean output images to defend against adversarial attacks. Nonetheless, extent generative models fail to effectively eliminate adversarial perturbations, yielding less-than-ideal purification results. We emphasize the potential threat of residual adversarial perturbations to target models, quantitatively establishing a relationship between perturbation scale and attack capability. Notably, the residual perturbations on the purified image primarily stem from the same-position patch and similar patches of the adversarial sample. We propose a novel adversarial purification approach named Information Mask Purification (IMPure), aims to extensively eliminate adversarial perturbations. To obtain an adversarial sample, we first mask part of the patches information, then reconstruct the patches to resist adversarial perturbations from the patches. We reconstruct all patches in parallel to obtain a cohesive image. Then, in order to protect the purified samples against potential similar regional perturbations, we simulate this risk by randomly mixing the purified samples with the input samples before inputting them into the feature extraction network. Finally, we establish a combined constraint of pixel loss and perceptual loss to augment the model's reconstruction adaptability. Extensive experiments on the ImageNet dataset with three classifier models demonstrate that our approach achieves state-of-the-art results against nine adversarial attack methods. Implementation code and pre-trained weights can be accessed at \textcolor{blue}{https://github.com/NoWindButRain/IMPure}.
    摘要 adversarial 攻击 меiculously 生成minuscule, imperceptible 杂化来图像,以骗 neural network。对于这些攻击, adversarial 纯化方法 seek 以 transform 攻击输入样本为 clean 输出图像,以防止 adversarial 攻击。然而, extent 生成模型不能有效地消除 adversarial 杂化,导致纯化结果不如意料的。我们强调了残留的 adversarial 杂化对目标模型的威胁,量化地建立了杂化规模和攻击能力之间的关系。尤其是,残留的杂化在纯化后的图像主要来自相同位置的 patch 和相似的 patches。我们提出了一种名为 Information Mask Purification (IMPure)的新的 adversarial 纯化方法。我们首先对 patches 中的一部分信息进行遮盖,然后重建 patches 以抵抗攻击。我们在并行的方式重建所有 patches,以获得一个协调的图像。然后,为保护纯化样本免受可能的相似地区域攻击,我们随机混合纯化样本与输入样本,并在输入到特征提取网络之前进行混合。最后,我们建立了一个结合像素损失和感知损失的共同约束,以增强模型的重建适应能力。我们在 ImageNet 数据集上进行了三个分类器模型的实验,证明了我们的方法可以与九种 adversarial 攻击方法进行比较。实现代码和预训练 веса可以在 \textcolor{blue}{https://github.com/NoWindButRain/IMPure} 上获取。

How much data do I need? A case study on medical data

  • paper_url: http://arxiv.org/abs/2311.15331
  • repo_url: None
  • paper_authors: Ayse Betul Cengiz, A. Stephen McGough
  • for: 本研究旨在检验深度学习中两个通行见解:更多数据提供更好的结果,以及在缺乏数据时,可以通过知识转移来提高性能。
  • methods: 研究使用了六个医疗数据集和六个通用数据集,对于不同的数据集进行训练,以评估“更多数据提供更好的结果”这个见解。此外,研究还使用了十一个数据集进行知识转移,以评估知识转移是否总是有利的。
  • results: 研究发现,更多数据不一定意味着更好的结果,而且选择不合适的数据集进行知识转移可能会导致性能更差。此外,多stage知识转移也显示了数据集之间的复杂关系。
    Abstract The collection of data to train a Deep Learning network is costly in terms of effort and resources. In many cases, especially in a medical context, it may have detrimental impacts. Such as requiring invasive medical procedures or processes which could in themselves cause medical harm. However, Deep Learning is seen as a data hungry method. Here, we look at two commonly held adages i) more data gives better results and ii) transfer learning will aid you when you don't have enough data. These are widely assumed to be true and used as evidence for choosing how to solve a problem when Deep Learning is involved. We evaluate six medical datasets and six general datasets. Training a ResNet18 network on varying subsets of these datasets to evaluate `more data gives better results'. We take eleven of these datasets as the sources for Transfer Learning on subsets of the twelfth dataset -- Chest -- in order to determine whether Transfer Learning is universally beneficial. We go further to see whether multi-stage Transfer Learning provides a consistent benefit. Our analysis shows that the real situation is more complex than these simple adages -- more data could lead to a case of diminishing returns and an incorrect choice of dataset for transfer learning can lead to worse performance, with datasets which we would consider highly similar to the Chest dataset giving worse results than datasets which are more dissimilar. Multi-stage transfer learning likewise reveals complex relationships between datasets.
    摘要 集成数据用于训练深度学习网络是费时和资源的。在医疗上,这可能会有不良影响,例如需要侵入性的医疗程序或过程,这些过程可能会引起医疗损害。然而,深度学习被视为数据吃猛的方法。我们在这里考虑了两个通常被接受的谬误:一是更多的数据会得到更好的结果,二是转移学习会帮助你当你没有够多的数据。这些谬误广泛被用作解决问题时的证据,当深度学习被用时。我们评估了六个医学数据集和六个通用数据集。我们在不同的子集上训练了ResNet18网络,以评估“更多数据会得到更好的结果”。我们选择了十一个数据集作为转移学习的来源,用于评估是否在不同的数据集上进行转移学习会帮助。我们进一步发现,实际情况比这些简单的谬误更复杂:更多的数据可能会导致减少的回报,而不合适的数据集选择可能会导致性能更差,尤其是在与Chest数据集相似的数据集上。多stage转移学习也显示了数据集之间的复杂关系。

BS-Diff: Effective Bone Suppression Using Conditional Diffusion Models from Chest X-Ray Images

  • paper_url: http://arxiv.org/abs/2311.15328
  • repo_url: None
  • paper_authors: Zhanghao Chen, Yifei Sun, Wenjian Qin, Ruiquan Ge, Cheng Pan, Wenming Deng, Zhou Liu, Wenwen Min, Ahmed Elazab, Xiang Wan, Changmiao Wang
  • for: 这篇论文旨在提高乳影X射线(CXR)的效率,以便更好地诊断肺部疾病。
  • methods: 本文提出了一个新的骨抑制框架,称为BS-Diff,其包括一个具有U-Net架构的条件散布模型,以及一个简单的增强模块,以实现高品质的骨抑制。
  • results: 实验结果显示,BS-Diff比以往的抑制模型在多个指标上表现更好,并且能够更好地捕捉肺部疾病的细微特征。
    Abstract Chest X-rays (CXRs) are commonly utilized as a low-dose modality for lung screening. Nonetheless, the efficacy of CXRs is somewhat impeded, given that approximately 75% of the lung area overlaps with bone, which in turn hampers the detection and diagnosis of diseases. As a remedial measure, bone suppression techniques have been introduced. The current dual-energy subtraction imaging technique in the clinic requires costly equipment and subjects being exposed to high radiation. To circumvent these issues, deep learning-based image generation algorithms have been proposed. However, existing methods fall short in terms of producing high-quality images and capturing texture details, particularly with pulmonary vessels. To address these issues, this paper proposes a new bone suppression framework, termed BS-Diff, that comprises a conditional diffusion model equipped with a U-Net architecture and a simple enhancement module to incorporate an autoencoder. Our proposed network cannot only generate soft tissue images with a high bone suppression rate but also possesses the capability to capture fine image details. Additionally, we compiled the largest dataset since 2010, including data from 120 patients with high-definition, high-resolution paired CXRs and soft tissue images collected by our affiliated hospital. Extensive experiments, comparative analyses, ablation studies, and clinical evaluations indicate that the proposed BS-Diff outperforms several bone-suppression models across multiple metrics.
    摘要 胸部X射像(CXR)是一种常用的低剂量成像模式,但它们的效果有所受限,因为肺部约75%的区域与骨部重叠,从而降低了疾病检测和诊断的效果。为了解决这些问题,骨部抑制技术已经被引入。现有的双能量减少成像技术在临床中使用了昂贵的设备和高剂量辐射。为了绕过这些问题,这篇论文提出了一种新的骨部抑制框架,称之为BS-Diff。该框架包括一个条件扩散模型和一个U-Net架构,以及一个简单的优化模块来吸收自动encoder。我们的提议的网络不仅可以生成高骨部抑制率的软组织图像,而且具有捕捉细节图像的能力。此外,我们编辑了自2010年以来最大的数据集,包括120名患者的高分辨率、高分辨率的对应CXR和软组织图像,收集自我们的附属医院。我们进行了广泛的实验、比较分析、剖析研究和临床评估,结果表明,我们的BS-Diff在多个纪录中超过了多种骨部抑制模型。

BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP

  • paper_url: http://arxiv.org/abs/2311.16194
  • repo_url: None
  • paper_authors: Jiawang Bai, Kuofeng Gao, Shaobo Min, Shu-Tao Xia, Zhifeng Li, Wei Liu
  • for: 增强图像识别任务的下游性能,例如CLIP模型。
  • methods: 利用learnable prompts技术在CLIP模型中植入后门。
  • results: 成功率高于99%,可以在不同的数据集和频谱上进行攻击。
    Abstract Contrastive Vision-Language Pre-training, known as CLIP, has shown promising effectiveness in addressing downstream image recognition tasks. However, recent works revealed that the CLIP model can be implanted with a downstream-oriented backdoor. On downstream tasks, one victim model performs well on clean samples but predicts a specific target class whenever a specific trigger is present. For injecting a backdoor, existing attacks depend on a large amount of additional data to maliciously fine-tune the entire pre-trained CLIP model, which makes them inapplicable to data-limited scenarios. In this work, motivated by the recent success of learnable prompts, we address this problem by injecting a backdoor into the CLIP model in the prompt learning stage. Our method named BadCLIP is built on a novel and effective mechanism in backdoor attacks on CLIP, i.e., influencing both the image and text encoders with the trigger. It consists of a learnable trigger applied to images and a trigger-aware context generator, such that the trigger can change text features via trigger-aware prompts, resulting in a powerful and generalizable attack. Extensive experiments conducted on 11 datasets verify that the clean accuracy of BadCLIP is similar to those of advanced prompt learning methods and the attack success rate is higher than 99% in most cases. BadCLIP is also generalizable to unseen classes, and shows a strong generalization capability under cross-dataset and cross-domain settings.
    摘要 具有可Implanting下游 oriented backdoor的 Contrastive Vision-Language Pre-training(CLIP)已经在下游图像识别任务中表现出了扎实的效果。然而,最近的研究发现,CLIP模型可以被植入一个特定的下游任务oriented backdoor。在下游任务上,一个受害者模型在干净样本上表现良好,但是在特定的触发器存在时,它会预测特定的目标类。现有的攻击都需要大量的额外数据来恶意 fine-tune 整个预训练CLIP模型,这使得它们在数据有限的情况下不可用。在这种情况下,我们被动地采用了learnable prompts的最近成功,并在prompt学习阶段植入了backdoor到CLIP模型中。我们的方法名为BadCLIP,它基于在CLIP模型中植入backdoor的新和有效机制,即通过触发器影响图像和文本编码器。BadCLIP包含一个可学习的触发器,用于图像,以及一个触发器意识的上下文生成器,以至于触发器可以通过触发器意识的提问改变文本特征,从而导致一种强大和通用的攻击。我们在11个数据集上进行了广泛的实验,并证明了BadCLIP的干净精度与先进的提问学习方法相当,并且攻击成功率大于99%。BadCLIP还能够通过cross-dataset和cross-domain设置来 generalized 到未看过类和不同的数据集。

AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset

  • paper_url: http://arxiv.org/abs/2311.15308
  • repo_url: https://github.com/controlnet/av-deepfake1m
  • paper_authors: Zhixi Cai, Shreya Ghosh, Aman Pankaj Adatia, Munawar Hayat, Abhinav Dhall, Kalin Stefanov
  • for: 本研究旨在提供一个大规模的深伪视频内容生成和检测数据集,以便提高深伪检测和地址化技术的性能。
  • methods: 该研究使用了一种基于内容的生成策略,通过混合视频和音频杂音来生成深伪视频内容。
  • results: 对于已有的深伪检测和地址化方法,对于该数据集进行了严格的测试,并得到了较低的性能表现,表明该数据集可以帮助建立下一代的深伪地址化方法。Here’s the translation in English:
  • for: The purpose of this research is to provide a large-scale deepfake audio-visual content generation and detection dataset to improve the performance of deepfake detection and localization technologies.
  • methods: The research uses a content-driven generation strategy that combines video and audio manipulations to generate realistic deepfake audio-visual content.
  • results: The proposed dataset was tested using existing deepfake detection and localization methods, and the results show a significant drop in performance, indicating that the dataset can help build next-generation deepfake localization methods.
    Abstract The detection and localization of highly realistic deepfake audio-visual content are challenging even for the most advanced state-of-the-art methods. While most of the research efforts in this domain are focused on detecting high-quality deepfake images and videos, only a few works address the problem of the localization of small segments of audio-visual manipulations embedded in real videos. In this research, we emulate the process of such content generation and propose the AV-Deepfake1M dataset. The dataset contains content-driven (i) video manipulations, (ii) audio manipulations, and (iii) audio-visual manipulations for more than 2K subjects resulting in a total of more than 1M videos. The paper provides a thorough description of the proposed data generation pipeline accompanied by a rigorous analysis of the quality of the generated data. The comprehensive benchmark of the proposed dataset utilizing state-of-the-art deepfake detection and localization methods indicates a significant drop in performance compared to previous datasets. The proposed dataset will play a vital role in building the next-generation deepfake localization methods. The dataset and associated code are available at https://github.com/ControlNet/AV-Deepfake1M .
    摘要 <>转换文本到简化中文。<>探测和特定深圳潜在伪造音视频内容是复杂的,即使使用最先进的方法也有困难。大多数研究努力在这个领域都是针对高品质深圳图像和视频进行探测,只有一些研究专注于实际录影内容中的小段音频修改的位置探测。在这个研究中,我们模拟了内容生成过程,并提出了AV-Deepfake1M数据集。这个数据集包含了内容驱动的(一)视频修改、(二)音频修改和(三)音视修改,共有超过2K名主题,总共超过100万个视频。文章提供了生成数据的详细描述,以及使用现有的深圳探测和定位方法进行严格的分析。基于这个数据集,我们进行了完整的实验,结果显示了与先前的数据集相比,深圳探测和定位方法的性能有明显下降。我们提出的数据集将会在建立下一代深圳定位方法方面扮演重要的角色。数据集和相关的代码可以在 GitHub 上获取:https://github.com/ControlNet/AV-Deepfake1M。

Sketch Video Synthesis

  • paper_url: http://arxiv.org/abs/2311.15306
  • repo_url: https://github.com/yudianzheng/sketchvideo
  • paper_authors: Yudian Zheng, Xiaodong Cun, Menghan Xia, Chi-Man Pun
  • for: 这个研究是为了生成类似于画作的动画绘制,以提高动画的抽象和时间一致性。
  • methods: 本研究使用了一个新的优化基准框架,包括跨帧画线初始化方法和基于 CLIP 特征的 semantics 损失,以及一个新的自相似 2D 图集网络的一致损失。
  • results: 本研究的结果显示了具有优秀的视觉抽象和时间一致性的画制动画,并且可以实现类似于画作的动画编辑和游戏。
    Abstract Understanding semantic intricacies and high-level concepts is essential in image sketch generation, and this challenge becomes even more formidable when applied to the domain of videos. To address this, we propose a novel optimization-based framework for sketching videos represented by the frame-wise B\'ezier curve. In detail, we first propose a cross-frame stroke initialization approach to warm up the location and the width of each curve. Then, we optimize the locations of these curves by utilizing a semantic loss based on CLIP features and a newly designed consistency loss using the self-decomposed 2D atlas network. Built upon these design elements, the resulting sketch video showcases impressive visual abstraction and temporal coherence. Furthermore, by transforming a video into SVG lines through the sketching process, our method unlocks applications in sketch-based video editing and video doodling, enabled through video composition, as exemplified in the teaser.
    摘要 理解 semantic 细节和高级概念是视频绘制中的关键,而在视频 domain 中,这种挑战变得更加困难。为解决这个问题,我们提出了一种基于优化的框架,用于绘制 represented by frame-wise B\'ezier 曲线的视频绘制。在详细的实现方式中,我们首先提出了跨帧roke 初始化方法,以温身位置和宽度的 curve。然后,我们使用基于 CLIP 特征的 semantic 损失和自定义设计的 consistency 损失来优化这些曲线的位置。通过这些设计元素,我们得到了具有吸引人的视觉抽象和时间准确性的结果。此外,通过将视频转换为 SVG 线条,我们的方法打开了基于绘制的视频编辑和视频涂鸦的应用,例如视频组合,如例子所示。Note: "clip" in the text refers to the Contrastive Language-Image Pre-training (CLIP) model, which is a popular deep learning model used for image and video analysis.

Eye Disease Prediction using Ensemble Learning and Attention on OCT Scans

  • paper_url: http://arxiv.org/abs/2311.15301
  • repo_url: None
  • paper_authors: Gauri Naik, Nandini Narvekar, Dimple Agarwal, Nishita Nandanwar, Himangi Pande
  • for: 提高眼病检测和诊断效率,帮助早期发现和及时 intervene。
  • methods: 使用机器学习和深度学习算法,并与Optical Coherent Technology(OCT)成像结合,开发了一种高效的眼病检测方法。
  • results: 实现了眼病检测的高精度预测,并且对多种眼病(如choroidal neovascularization、diabetic macular edema和druzen)进行了分类。
    Abstract Eye diseases have posed significant challenges for decades, but advancements in technology have opened new avenues for their detection and treatment. Machine learning and deep learning algorithms have become instrumental in this domain, particularly when combined with Optical Coherent Technology (OCT) imaging. We propose a novel method for efficient detection of eye diseases from OCT images. Our technique enables the classification of patients into disease free (normal eyes) or affected by specific conditions such as Choroidal Neovascularization (CNV), Diabetic Macular Edema (DME), or Drusen. In this work, we introduce an end to end web application that utilizes machine learning and deep learning techniques for efficient eye disease prediction. The application allows patients to submit their raw OCT scanned images, which undergo segmentation using a trained custom UNet model. The segmented images are then fed into an ensemble model, comprising InceptionV3 and Xception networks, enhanced with a self attention layer. This self attention approach leverages the feature maps of individual models to achieve improved classification accuracy. The ensemble model's output is aggregated to predict and classify various eye diseases. Extensive experimentation and optimization have been conducted to ensure the application's efficiency and optimal performance. Our results demonstrate the effectiveness of the proposed approach in accurate eye disease prediction. The developed web application holds significant potential for early detection and timely intervention, thereby contributing to improved eye healthcare outcomes.
    摘要 眼疾病已经对社会造成了长期的挑战,但是技术的进步已经开启了新的检测和治疗途径。机器学习和深度学习算法在这个领域特别有用,尤其是与光学同步技术(OCT)成像结合使用。我们提出了一种高效的眼疾病检测方法,可以将患者分为正常眼(健康眼)和受到特定疾病的影响(如choroidal neovascularization(CNV)、diabetic macular edema(DME)或druzen)。在这项工作中,我们开发了一个简洁的Web应用程序,使用机器学习和深度学习技术进行眼疾病预测。用户可以将自己的raw OCT扫描图像上传到应用程序中,并使用自定义的U-Net模型进行分 segmentation。分 segmentation后,图像将被传递给一个集成模型,包括InceptionV3和Xception网络,并添加了一个自注意层。这个自注意层可以使用不同网络的特征图来实现更好的分类精度。集成模型的输出将被聚合来预测和分类不同的眼疾病。我们进行了广泛的实验和优化,以确保应用程序的效率和最佳性能。我们的结果表明,我们提出的方法可以高效地预测眼疾病。开发的Web应用程序具有 significannot potential for early detection和及时 intervención,从而为眼保健带来改善的结果。

Obj-NeRF: Extract Object NeRFs from Multi-view Images

  • paper_url: http://arxiv.org/abs/2311.15291
  • repo_url: None
  • paper_authors: Zhiyi Li, Lihe Ding, Tianfan Xue
  • for: 本文提出了一种解决基于多视图图像的NeRF(Neural Radiance Fields)中EXTRACTING一个特定对象的颜色场的问题,以便在下游应用中进行NeRF编辑和3D网格提取。
  • methods: 本文提出了一种 combiningSegment Anything Model(SAM)和NeRF的全面管道,使得通过单个提示可以从多视图图像中提取出3D对象的几何结构。具体来说,首先通过SAM在多视图图像中获得对象的多视图分割,然后使用分割图像来监督NeRF的建构,并应用了一些有效的技巧。
  • results: 本文 constructed一个大量的对象级NeRF数据集,其中包含了多种对象,可以用于多种下游任务。此外,本文还应用了Obj-NeRF在不同应用中,包括对象移除、旋转、替换和重新颜色等。
    Abstract Neural Radiance Fields (NeRFs) have demonstrated remarkable effectiveness in novel view synthesis within 3D environments. However, extracting a radiance field of one specific object from multi-view images encounters substantial challenges due to occlusion and background complexity, thereby presenting difficulties in downstream applications such as NeRF editing and 3D mesh extraction. To solve this problem, in this paper, we propose Obj-NeRF, a comprehensive pipeline that recovers the 3D geometry of a specific object from multi-view images using a single prompt. This method combines the 2D segmentation capabilities of the Segment Anything Model (SAM) in conjunction with the 3D reconstruction ability of NeRF. Specifically, we first obtain multi-view segmentation for the indicated object using SAM with a single prompt. Then, we use the segmentation images to supervise NeRF construction, integrating several effective techniques. Additionally, we construct a large object-level NeRF dataset containing diverse objects, which can be useful in various downstream tasks. To demonstrate the practicality of our method, we also apply Obj-NeRF to various applications, including object removal, rotation, replacement, and recoloring.
    摘要 神经辐射场(NeRF)已经在3D环境中展现出了非常出色的效果。然而,从多视图图像中提取一个特定物体的辐射场受到遮挡和背景复杂性的限制,从而对下游应用 such as NeRF编辑和3D网格提取带来了困难。为解决这个问题,在这篇论文中,我们提出了Obj-NeRF,一个完整的管道,可以从多视图图像中提取一个特定物体的3D几何结构。这种方法结合了2D分割能力的Segment Anything Model(SAM)和NeRF的3D重建能力。具体来说,我们首先使用SAM在多视图图像中提取指定的物体的多视图分割。然后,我们使用分割图像来监督NeRF的构建,并结合了一些有效的技术。此外,我们还构建了包含多种物体的大型物体级NeRF数据集,这可以在多种下游任务中被利用。为证明我们的方法的实用性,我们还应用了Obj-NeRF于多种应用,包括物体除除、旋转、替换和重新颜色。

Efficient Rehearsal Free Zero Forgetting Continual Learning using Adaptive Weight Modulation

  • paper_url: http://arxiv.org/abs/2311.15276
  • repo_url: None
  • paper_authors: Yonatan Sverdlov, Shimon Ullman
  • for: 这个研究旨在解决人工神经网络在长期学习多个任务时所遇到的挑战,即“连续学习”,以及该过程中学习的问题。
  • methods: 本研究使用了一种专门设计的任务特有的调整参数,将每个任务的学习结果分类为两类:一是不会忘记前一个任务的学习结果,二是会忘记前一个任务的学习结果。
  • results: 研究发现,使用专门设计的任务特有的调整参数可以实现预 Ventures 的连续学习,并且在获得新任务时,不会忘记之前的任务。此外,研究还发现,这种方法可以在多个任务之间进行协调,以提高总的学习效果。
    Abstract Artificial neural networks encounter a notable challenge known as continual learning, which involves acquiring knowledge of multiple tasks over an extended period. This challenge arises due to the tendency of previously learned weights to be adjusted to suit the objectives of new tasks, resulting in a phenomenon called catastrophic forgetting. Most approaches to this problem seek a balance between maximizing performance on the new tasks and minimizing the forgetting of previous tasks. In contrast, our approach attempts to maximize the performance of the new task, while ensuring zero forgetting. This is accomplished by creating a task-specific modulation parameters for each task. Only these would be learnable parameters during learning of consecutive tasks. Through comprehensive experimental evaluations, our model demonstrates superior performance in acquiring and retaining novel tasks that pose difficulties for other multi-task models. This emphasizes the efficacy of our approach in preventing catastrophic forgetting while accommodating the acquisition of new tasks
    摘要

An Intelligent-Detection Network for Handwritten Mathematical Expression Recognition

  • paper_url: http://arxiv.org/abs/2311.15273
  • repo_url: None
  • paper_authors: Ziqi Ye
  • for: 本研究旨在提高手写数学表达识别精度,提供更高精度的手写数学表达识别方法。
  • methods: 本研究使用物体检测技术,开发了增强的YOLOv7网络,可准确检测数字和符号对象。然后,通过笔直流网络和基线符号关系树(BSRT),确定符号和数字之间的关系。
  • results: 实验结果表明,提出的方法在识别复杂手写数学表达方面比传统编码器-解码器网络高精度。这是因为物体检测技术可以准确检测符号和数字。这些研究结果有望在各种实际应用中提供价值,如学生作业评分和纸质文档信息入库。
    Abstract The use of artificial intelligence technology in education is growing rapidly, with increasing attention being paid to handwritten mathematical expression recognition (HMER) by researchers. However, many existing methods for HMER may fail to accurately read formulas with complex structures, as the attention results can be inaccurate due to illegible handwriting or large variations in writing styles. Our proposed Intelligent-Detection Network (IDN) for HMER differs from traditional encoder-decoder methods by utilizing object detection techniques. Specifically, we have developed an enhanced YOLOv7 network that can accurately detect both digital and symbolic objects. The detection results are then integrated into the bidirectional gated recurrent unit (BiGRU) and the baseline symbol relationship tree (BSRT) to determine the relationships between symbols and numbers. The experiments demonstrate that the proposed method outperforms those encoder-decoder networks in recognizing complex handwritten mathematical expressions. This is due to the precise detection of symbols and numbers. Our research has the potential to make valuable contributions to the field of HMER. This could be applied in various practical scenarios, such as assignment grading in schools and information entry of paper documents.
    摘要 使用人工智能技术在教育领域的应用正在迅速增长,研究人员对手写数学表达识别(HMER)的关注也在不断增加。然而,现有的HMER方法可能无法准确识别复杂结构的方程,因为注意结果可能因为难读或写作风格差异而不准确。我们提出的智能检测网络(IDN)对HMER differ from traditional encoder-decoder方法,通过使用对象检测技术。具体来说,我们开发了一个改进的YOLOv7网络,可以准确检测数字和符号对象。检测结果然后被集成到双向导引回征unit(BiGRU)和基础符号关系树(BSRT)中,以确定符号和数字之间的关系。实验结果表明,我们提出的方法在识别复杂手写数学表达方面高效。这是因为准确检测符号和数字。我们的研究有可能对HMER领域产生重要贡献。这可以应用于学校作业评分和纸质文档信息入库等实际场景。

ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images

  • paper_url: http://arxiv.org/abs/2311.15264
  • repo_url: None
  • paper_authors: Nicolas Bourriez, Ihab Bendidi, Ethan Cohen, Gabriel Watkinson, Maxime Sanchez, Guillaume Bollot, Auguste Genovesio
  • for: 这个论文主要针对的是生物图像分析领域中的图像模式识别和下游任务问题,具体来说是处理具有不同渠道、数量和类型的生物图像。
  • methods: 该论文提出了一种 Channel Adaptive Vision Transformer 架构,并在其中引入了 между渠道注意 Mechanism,以适应生物图像中的多渠道、多类型和多数据种情况。
  • results: 该论文通过自然地自监学习方式训练 Channel Adaptive Vision Transformer 架构,并在多个生物图像分析任务上达到了比较好的效果,同时可以帮助 bridge 不同的探针、渠道数量或类型之间的 gap。
    Abstract Unlike color photography images, which are consistently encoded into RGB channels, biological images encompass various modalities, where the type of microscopy and the meaning of each channel varies with each experiment. Importantly, the number of channels can range from one to a dozen and their correlation is often comparatively much lower than RGB, as each of them brings specific information content. This aspect is largely overlooked by methods designed out of the bioimage field, and current solutions mostly focus on intra-channel spatial attention, often ignoring the relationship between channels, yet crucial in most biological applications. Importantly, the variable channel type and count prevent the projection of several experiments to a unified representation for large scale pre-training. In this study, we propose ChAda-ViT, a novel Channel Adaptive Vision Transformer architecture employing an Inter-Channel Attention mechanism on images with an arbitrary number, order and type of channels. We also introduce IDRCell100k, a bioimage dataset with a rich set of 79 experiments covering 7 microscope modalities, with a multitude of channel types, and channel counts varying from 1 to 10 per experiment. Our proposed architecture, trained in a self-supervised manner, outperforms existing approaches in several biologically relevant downstream tasks. Additionally, it can be used to bridge the gap for the first time between assays with different microscopes, channel numbers or types by embedding various image and experimental modalities into a unified biological image representation. The latter should facilitate interdisciplinary studies and pave the way for better adoption of deep learning in biological image-based analyses. Code and Data to be released soon.
    摘要 不同于颜色摄影图像,生物图像包括多种模式,其中每个渠道的类型和意义因每个实验而异,并且渠道的数量可以从一到十二。此外,渠道之间的相关性通常较低,每个渠道都带来特定的信息内容。这一点在生物图像处理领域中一般被忽略,现有方法通常只关注每个渠道的空间注意力,忽略渠道之间的关系,尤其在生物应用中关键。此外,变量的渠道类型和数量使得将多个实验 projet到统一的表示方法上进行大规模预训练是困难的。在本研究中,我们提出了Channel Adaptive Vision Transformer架构(ChAda-ViT),该架构使用 между渠道注意力机制来处理具有不同数量、顺序和类型的渠道的图像。我们还提出了IDRCell100k生物图像集,该集包括79个实验,每个实验包括7种微镜Modalities,多种渠道类型和渠道数量变化在1-10之间。我们的提posed架构通过自助学习方式训练,在多种生物相关下游任务中表现出色。此外,它可以 bridging the gap between不同的微镜、渠道数量或类型的实验,将不同的图像和实验模式embedding into a unified biological image representation。这将促进跨学科研究并为生物图像基于分析领域 deeper learning的应用做出了大开口。代码和数据将即将发布。

Revealing Cortical Layers In Histological Brain Images With Self-Supervised Graph Convolutional Networks Applied To Cell-Graphs

  • paper_url: http://arxiv.org/abs/2311.15262
  • repo_url: None
  • paper_authors: Valentina Vadori, Antonella Peruffo, Jean-Marie Graïc, Giulia Vadori, Livio Finos, Enrico Grisan
  • For: 用于提高 Comparative studies of cerebral cortex cytoarchitecture, 以便了解脑结构和功能之间的关系 across species.* Methods: 使用自动学习方法,首先 segmentation of individual cells, 然后创建一个 attributed cell-graph,最后使用自我超vised graph convolutional network 生成细胞特征,并使用社区探测算法 для最终层分.* Results: 可以快速和无需注释数据,自动检测细胞层,提高了 Comparative studies of cerebral cortex cytoarchitecture, 并可以促进跨种研究.
    Abstract Identifying cerebral cortex layers is crucial for comparative studies of the cytoarchitecture aiming at providing insights into the relations between brain structure and function across species. The absence of extensive annotated datasets typically limits the adoption of machine learning approaches, leading to the manual delineation of cortical layers by neuroanatomists. We introduce a self-supervised approach to detect layers in 2D Nissl-stained histological slices of the cerebral cortex. It starts with the segmentation of individual cells and the creation of an attributed cell-graph. A self-supervised graph convolutional network generates cell embeddings that encode morphological and structural traits of the cellular environment and are exploited by a community detection algorithm for the final layering. Our method, the first self-supervised of its kind with no spatial transcriptomics data involved, holds the potential to accelerate cytoarchitecture analyses, sidestepping annotation needs and advancing cross-species investigation.
    摘要 描述大脑皮层是重要的 для比较研究,以获得脑结构和功能之间的关系。然而,由于缺乏充分的标注数据,通常会使得机器学习方法的应用受限。我们提出了一种自动化的方法,通过自我超视的方式检测皮层。这个方法首先对射电图像中的单个细胞进行分割,然后生成一个归属细胞图。一个基于自我超视的图像conv网络生成细胞特征,这些特征捕捉细胞环境的形态和结构特征,并将其用于社区探测算法进行最终层分。我们的方法是首个不需要空间脑层数据的自动化方法,它可能会加速皮层分析,绕过标注需求,并推动跨物种研究。

NeuRAD: Neural Rendering for Autonomous Driving

  • paper_url: http://arxiv.org/abs/2311.15260
  • repo_url: https://github.com/georghess/neurad
  • paper_authors: Adam Tonderski, Carl Lindström, Georg Hess, William Ljungbergh, Lennart Svensson, Christoffer Petersson
  • for: 这篇论文主要是为了解决自动驾驶领域中的神经辐射场(NeRFs)问题,提高NeRFs的应用率和可扩展性。
  • methods: 该论文提出了一种名为NeuRAD的新型视角合成方法,该方法采用简单的网络设计,全面模拟摄像头和激光仪的传感器,并适用于多个数据集。
  • results: 论文通过对五个流行的自动驾驶数据集进行测试,得到了最佳性能。同时,作者还公开发布了NeuRAD的源代码,以便进一步的研发和应用。
    Abstract Neural radiance fields (NeRFs) have gained popularity in the autonomous driving (AD) community. Recent methods show NeRFs' potential for closed-loop simulation, enabling testing of AD systems, and as an advanced training data augmentation technique. However, existing methods often require long training times, dense semantic supervision, or lack generalizability. This, in turn, hinders the application of NeRFs for AD at scale. In this paper, we propose NeuRAD, a robust novel view synthesis method tailored to dynamic AD data. Our method features simple network design, extensive sensor modeling for both camera and lidar -- including rolling shutter, beam divergence and ray dropping -- and is applicable to multiple datasets out of the box. We verify its performance on five popular AD datasets, achieving state-of-the-art performance across the board. To encourage further development, we openly release the NeuRAD source code. See https://github.com/georghess/NeuRAD .
    摘要 In this paper, we propose NeuRAD, a robust novel view synthesis method tailored to dynamic AD data. Our method features a simple network design, extensive sensor modeling for both camera and lidar, including rolling shutter, beam divergence, and ray dropping, and is applicable to multiple datasets out of the box. We verify its performance on five popular AD datasets, achieving state-of-the-art performance across the board. To encourage further development, we openly release the NeuRAD source code. See .

Generating Human-Centric Visual Cues for Human-Object Interaction Detection via Large Vision-Language Models

  • paper_url: http://arxiv.org/abs/2311.16475
  • repo_url: None
  • paper_authors: Yu-Wei Zhan, Fan Liu, Xin Luo, Liqiang Nie, Xin-Shun Xu, Mohan Kankanhalli
  • for: 本研究旨在提高人机交互探测的精度,通过生成多个视角的人Centric Visual Cues来帮助预测人机交互。
  • methods: 本文提议使用Visual Language Model(VLM)生成多个视角的人Centric Visual Cues,并通过多塔 architecture的 transformer 模块将视觉缘点特征与实体和交互解码器结合。
  • results: 实验结果表明,基于生成的人Centric Visual Cues的方法在两个广泛使用的数据集上表现出色,与现有的状态对照方法相比有所提高。
    Abstract Human-object interaction (HOI) detection aims at detecting human-object pairs and predicting their interactions. However, the complexity of human behavior and the diverse contexts in which these interactions occur make it challenging. Intuitively, human-centric visual cues, such as the involved participants, the body language, and the surrounding environment, play crucial roles in shaping these interactions. These cues are particularly vital in interpreting unseen interactions. In this paper, we propose three prompts with VLM to generate human-centric visual cues within an image from multiple perspectives of humans. To capitalize on these rich Human-Centric Visual Cues, we propose a novel approach named HCVC for HOI detection. Particularly, we develop a transformer-based multimodal fusion module with multitower architecture to integrate visual cue features into the instance and interaction decoders. Our extensive experiments and analysis validate the efficacy of leveraging the generated human-centric visual cues for HOI detection. Notably, the experimental results indicate the superiority of the proposed model over the existing state-of-the-art methods on two widely used datasets.
    摘要 人机交互检测(HOI)的目标是检测人机对象对的存在和预测他们之间的交互。然而,人类行为的复杂性和交互场景的多样性使得这个任务变得困难。人类中心的视觉特征,如参与者、姿势和周围环境,在解释这些交互中扮演关键角色。这些特征特别重要在解释未看到的交互中。在这篇论文中,我们提出了三个提示与VLM生成多个人中心视觉特征的方法。为了利用这些丰富的人类中心视觉特征,我们提出了一种名为HCVC的新方法。特别是,我们开发了一种基于转换器的多模式融合模块,将视觉特征纳入实例和交互解码器中。我们的广泛的实验和分析证明了利用生成的人类中心视觉特征对HOI检测的有效性。尤其是,实验结果表明我们提出的模型在两个广泛使用的数据集上与现有的状态对方法相比具有优势。

CalibFormer: A Transformer-based Automatic LiDAR-Camera Calibration Network

  • paper_url: http://arxiv.org/abs/2311.15241
  • repo_url: None
  • paper_authors: Yuxuan Xiao, Yao Li, Chengzhen Meng, Xingchen Li, Yanyong Zhang
  • for: 本研究旨在提出一种自适应LiDAR-camera协调方法,以解决探测器协调问题中的准确性和可靠性问题。
  • methods: 我们提出了一种名为CalibFormer的终端网络,该网络可以自动将LiDAR和摄像头图像特征进行协调,并使用多头相关模块和变换器架构来估算准确的协调参数。
  • results: 我们在KITTI数据集上测试了我们的方法,并获得了0.8751cm的平均翻译错误和0.0562度的平均旋转错误,超过了现有状态的方法,表明我们的方法具有强大的稳定性、准确性和泛化能力。
    Abstract The fusion of LiDARs and cameras has been increasingly adopted in autonomous driving for perception tasks. The performance of such fusion-based algorithms largely depends on the accuracy of sensor calibration, which is challenging due to the difficulty of identifying common features across different data modalities. Previously, many calibration methods involved specific targets and/or manual intervention, which has proven to be cumbersome and costly. Learning-based online calibration methods have been proposed, but their performance is barely satisfactory in most cases. These methods usually suffer from issues such as sparse feature maps, unreliable cross-modality association, inaccurate calibration parameter regression, etc. In this paper, to address these issues, we propose CalibFormer, an end-to-end network for automatic LiDAR-camera calibration. We aggregate multiple layers of camera and LiDAR image features to achieve high-resolution representations. A multi-head correlation module is utilized to identify correlations between features more accurately. Lastly, we employ transformer architectures to estimate accurate calibration parameters from the correlation information. Our method achieved a mean translation error of $0.8751 \mathrm{cm}$ and a mean rotation error of $0.0562 ^{\circ}$ on the KITTI dataset, surpassing existing state-of-the-art methods and demonstrating strong robustness, accuracy, and generalization capabilities.
    摘要 “激光探测器和摄像头的融合在自动驾驶中用于感知任务越来越普遍。这种融合算法的性能很大程度上取决于探测器和摄像头的敏感器准确性,但是这是一项具有挑战性的任务,因为寻找不同数据模式之间的共同特征是困难的。过去,许多准确性方法都需要特定的目标和/或人工干预,这些方法一般是费时费力的。在线学习基于准确性方法也有很多问题,如稀疏特征地图、不可靠的跨模态相关性、不准确的准确参数回归等。在这篇论文中,我们提出了CalibFormer,一种完整的端到端网络用于自动激光探测器和摄像头准确性排序。我们将多层camera和激光图像特征聚合到高分辨率表示中,并使用多头相关模块来更准确地识别共同特征。最后,我们使用变换架构来从相关信息中估算准确的准确参数。我们的方法在KITTI数据集上实现了平均翻译误差0.8751公分和平均旋转误差0.0562度,超越现有的状态级方法,展现出强大的稳定性、准确性和泛化能力。”

Double Reverse Regularization Network Based on Self-Knowledge Distillation for SAR Object Classification

  • paper_url: http://arxiv.org/abs/2311.15231
  • repo_url: https://github.com/consult98/DRRNet-SKD
  • paper_authors: Bo Xu, Hao Zheng, Zhigang Hu, Liu Yang, Meiguang Zheng
  • for: 本研究旨在解决现代 Synthetic Aperture Radar (SAR) 对象分类中的严重欠拟合问题,特别是由于数据量 Limited (few-shot) 和噪音引起的。
  • methods: 本研究提出了一种基于 Self-Knowledge Distillation (SKD) 的 Double Reverse Regularization Network (DRRNet-SKD),通过探索热力学液体填充的效果,采用了双向反思思想来实现有效的规范网络。然后,采用 Adaptive Weight Assignment (AWA) 模块,自适应地分配两个反向改变的质量,使学生网络更好地从两个教师中吸取知识。
  • results: 实验结果表明,DRRNet-SKD 在 OpenSARShip 和 FUSAR-Ship 上表现出色,超过了现有的自知ledge distillation方法,并且在 classical CNNs 上显示出了remarkable的性能提升。
    Abstract In current synthetic aperture radar (SAR) object classification, one of the major challenges is the severe overfitting issue due to the limited dataset (few-shot) and noisy data. Considering the advantages of knowledge distillation as a learned label smoothing regularization, this paper proposes a novel Double Reverse Regularization Network based on Self-Knowledge Distillation (DRRNet-SKD). Specifically, through exploring the effect of distillation weight on the process of distillation, we are inspired to adopt the double reverse thought to implement an effective regularization network by combining offline and online distillation in a complementary way. Then, the Adaptive Weight Assignment (AWA) module is designed to adaptively assign two reverse-changing weights based on the network performance, allowing the student network to better benefit from both teachers. The experimental results on OpenSARShip and FUSAR-Ship demonstrate that DRRNet-SKD exhibits remarkable performance improvement on classical CNNs, outperforming state-of-the-art self-knowledge distillation methods.
    摘要 现在的 Synthetic Aperture Radar (SAR) 对象分类中,一个主要挑战是严重的适应过拟合问题,即因数据集(几个样本)的有限性和噪声导致的问题。本文提出了一种新的 Double Reverse Regularization Network based on Self-Knowledge Distillation (DRRNet-SKD),具体来说,通过研究报导权重对报导过程的影响,我们被 inspirited 到采用双反思想来实现一个有效的正则化网络,通过在线和离线报导的组合来解决这个问题。然后,我们设计了适应性Weight分配(AWA)模块,以适应性地分配两个反向变化的权重,使学生网络可以更好地从两个教师中获得利益。实验结果表明,DRRNet-SKD在 OpenSARShip 和 FUSAR-Ship 上表现出了remarkable的性能提高,超过了当前的自知知识抖动方法。

GAIA: Zero-shot Talking Avatar Generation

  • paper_url: http://arxiv.org/abs/2311.15230
  • repo_url: None
  • paper_authors: Tianyu He, Junliang Guo, Runyi Yu, Yuchi Wang, Jialiang Zhu, Kaikai An, Leyi Li, Xu Tan, Chunyu Wang, Han Hu, HsiangTao Wu, Sheng Zhao, Jiang Bian
  • for: 这个论文的目的是生成自然的 talking video,从语音和一幅肖像图来 sinthezise 自然的 talking avatar。
  • methods: 这个论文使用了一种新的方法,叫做 GAIA(生成AI дляavatar),它消除了 talking avatar 生成中的域约束,从而提高了生成的自然性和多样性。
  • results: 实验结果表明,GAIA 比前一代模型更加自然、多样、唇 sync 质量和视觉质量更高。此外,这个框架可以扩展到不同的应用,如可控 talking avatar 生成和文本 instructed avatar 生成。
    Abstract Zero-shot talking avatar generation aims at synthesizing natural talking videos from speech and a single portrait image. Previous methods have relied on domain-specific heuristics such as warping-based motion representation and 3D Morphable Models, which limit the naturalness and diversity of the generated avatars. In this work, we introduce GAIA (Generative AI for Avatar), which eliminates the domain priors in talking avatar generation. In light of the observation that the speech only drives the motion of the avatar while the appearance of the avatar and the background typically remain the same throughout the entire video, we divide our approach into two stages: 1) disentangling each frame into motion and appearance representations; 2) generating motion sequences conditioned on the speech and reference portrait image. We collect a large-scale high-quality talking avatar dataset and train the model on it with different scales (up to 2B parameters). Experimental results verify the superiority, scalability, and flexibility of GAIA as 1) the resulting model beats previous baseline models in terms of naturalness, diversity, lip-sync quality, and visual quality; 2) the framework is scalable since larger models yield better results; 3) it is general and enables different applications like controllable talking avatar generation and text-instructed avatar generation.
    摘要 zero-shot talking avatar 生成目标是从语音和一幅肖像图生成自然的 talking 视频。 前一些方法通过域特定的规则,如扭曲基于动作表示和3D 形态模型,来限制生成的人物的自然性和多样性。 在这种工作中,我们介绍了GAIA(生成AI для人物),它消除了 talking 人物生成中的域特定约束。 基于语音只影响人物的动作,而人物的外观和背景通常在整个视频中保持不变的观察,我们将我们的方法分成两个阶段: 1)每帧分解为动作和外观表示; 2)基于语音和参考肖像图生成动作序列。 我们收集了大规模高质量 talking 人物数据集,并在其中训练模型(最多2B参数)。 实验结果表明GAIA的优势,包括:1)生成的模型在自然性、多样性、唇 sync 质量和视觉质量方面超越先前的基准模型; 2)框架可扩展,大型模型可以获得更好的结果; 3)可控和文本指导的人物生成等多种应用。

One-bit Supervision for Image Classification: Problem, Solution, and Beyond

  • paper_url: http://arxiv.org/abs/2311.15225
  • repo_url: None
  • paper_authors: Hengtong Hu, Lingxi Xie, Xinyue Hue, Richang Hong, Qi Tian
  • for: 这篇论文探讨了一种名为“一比特监督”的新设定,用于图像识别 tasks 的学习。在这个设定下,模型不需要使用每个样本的精确标签,而是通过预测每个样本的分类标签,然后从系统回传的答案中获取一个比特(是或否)的信息。
  • methods: 为了实现一比特监督,这篇论文提出了两个关键方法:一是提高预测精度,二是将错误预测作为学习资源。这些方法包括多阶段训练架构和抑制负标签。
  • results: 这篇论文的实验结果显示,一比特监督可以比拟使用全比特监督更有效率地进行学习。另外,这篇论文还发现,使用自我监督学习初始化模型后,通过对应的硬例子挑战和对称调整,可以进一步提高学习效率。
    Abstract This paper presents one-bit supervision, a novel setting of learning with fewer labels, for image classification. Instead of training model using the accurate label of each sample, our setting requires the model to interact with the system by predicting the class label of each sample and learn from the answer whether the guess is correct, which provides one bit (yes or no) of information. An intriguing property of the setting is that the burden of annotation largely alleviates in comparison to offering the accurate label. There are two keys to one-bit supervision, which are (i) improving the guess accuracy and (ii) making good use of the incorrect guesses. To achieve these goals, we propose a multi-stage training paradigm and incorporate negative label suppression into an off-the-shelf semi-supervised learning algorithm. Theoretical analysis shows that one-bit annotation is more efficient than full-bit annotation in most cases and gives the conditions of combining our approach with active learning. Inspired by this, we further integrate the one-bit supervision framework into the self-supervised learning algorithm which yields an even more efficient training schedule. Different from training from scratch, when self-supervised learning is used for initialization, both hard example mining and class balance are verified effective in boosting the learning performance. However, these two frameworks still need full-bit labels in the initial stage. To cast off this burden, we utilize unsupervised domain adaptation to train the initial model and conduct pure one-bit annotations on the target dataset. In multiple benchmarks, the learning efficiency of the proposed approach surpasses that using full-bit, semi-supervised supervision.
    摘要 In this paper, the authors propose a new setting for image classification called one-bit supervision, where the model is trained using only one bit of information, whether the guess is correct or not. This setting alleviates the burden of annotation and can be more efficient than full-bit annotation in most cases. To achieve this, the authors propose a multi-stage training paradigm and incorporate negative label suppression into an off-the-shelf semi-supervised learning algorithm. The proposed approach is tested on multiple benchmarks and shows better learning efficiency than using full-bit, semi-supervised supervision.The key ideas of the proposed approach are:1. Improving the guess accuracy: The model is trained to improve its guess accuracy using only one bit of information.2. Making good use of incorrect guesses: The model is trained to learn from its incorrect guesses, which provides valuable information for improvement.3. Multi-stage training paradigm: The model is trained in multiple stages, each stage using a different portion of the data, to improve its performance.4. Negative label suppression: The model is trained to suppress the negative labels, which can improve its performance and reduce the burden of annotation.The proposed approach is integrated into the self-supervised learning algorithm, which yields an even more efficient training schedule. The approach is tested on multiple benchmarks and shows better learning efficiency than using full-bit, semi-supervised supervision.In summary, the proposed approach of one-bit supervision for image classification alleviates the burden of annotation and can be more efficient than full-bit annotation in most cases. The approach is based on a multi-stage training paradigm, negative label suppression, and integration with self-supervised learning. The proposed approach is tested on multiple benchmarks and shows better learning efficiency than using full-bit, semi-supervised supervision.

Leveraging Anatomical Constraints with Uncertainty for Pneumothorax Segmentation

  • paper_url: http://arxiv.org/abs/2311.15213
  • repo_url: None
  • paper_authors: Han Yuan, Chuan Hong, Nguyen Tuan Anh Tran, Xinxing Xu, Nan Liu
  • for: 增强基于深度学习的肺部液肿检测方法,特别是考虑肺部空间的位置特征。
  • methods: 提出一种新的方法,利用肺部空间作为约束,在深度学习模型训练中进行肺部液肿检测。通过使用外部数据集和副 зада务肺部分 segmentation,生成每个胸部X射像中特定的肺部空间约束。同时,利用验证器来消除域转换导致的不可靠约束。
  • results: 实验结果显示,与基eline方法相比,提出的方法在IoU、DSC和HD三个指标上均达到了显著的提高,具体提高4.6%、3.6%和3.3%。这些结果证明,在基于深度学习的肺部液肿检测方法中,考虑肺部空间的位置特征可以提高检测精度。
    Abstract Pneumothorax is a medical emergency caused by abnormal accumulation of air in the pleural space - the potential space between the lungs and chest wall. On 2D chest radiographs, pneumothorax occurs within the thoracic cavity and outside of the mediastinum and we refer to this area as "lung+ space". While deep learning (DL) has increasingly been utilized to segment pneumothorax lesions in chest radiographs, many existing DL models employ an end-to-end approach. These models directly map chest radiographs to clinician-annotated lesion areas, often neglecting the vital domain knowledge that pneumothorax is inherently location-sensitive. We propose a novel approach that incorporates the lung+ space as a constraint during DL model training for pneumothorax segmentation on 2D chest radiographs. To circumvent the need for additional annotations and to prevent potential label leakage on the target task, our method utilizes external datasets and an auxiliary task of lung segmentation. This approach generates a specific constraint of lung+ space for each chest radiograph. Furthermore, we have incorporated a discriminator to eliminate unreliable constraints caused by the domain shift between the auxiliary and target datasets. Our results demonstrated significant improvements, with average performance gains of 4.6%, 3.6%, and 3.3% regarding Intersection over Union (IoU), Dice Similarity Coefficient (DSC), and Hausdorff Distance (HD). Our research underscores the significance of incorporating medical domain knowledge about the location-specific nature of pneumothorax to enhance DL-based lesion segmentation.
    摘要 肺膜破裂是医疗紧急情况,由肺膜和胸壁之间的空间内的空气堆积所致。在二维胸部X射线图像上,肺膜破裂发生在胸部内部,而不在 mediastinum 中。许多现有的深度学习(DL)模型在胸部X射线图像上 segmentation 肺膜破裂 lesion 时,采用了端到端的方法。这些模型直接将胸部X射线图像映射到临床注意力的 lesion 区域,经常忽视肺膜破裂的医学特定知识,即肺膜破裂是位置特定的。我们提出了一种新的方法,即在 DL 模型训练时,通过肺+空间作为约束来进行肺膜破裂 segmentation。为了避免额外标注和可能的标签泄露问题,我们的方法使用了外部数据集和副任务肺部分 segmentation。这种方法生成了每个胸部X射线图像的特定肺+空间约束。此外,我们还添加了一个探测器,以消除由数据集领域转换所引起的不可靠约束。我们的结果显示,在 IoU、DSC 和 HD 等指标上,我们的方法可以获得显著的改善,升幅为 4.6%、3.6% 和 3.3%。我们的研究强调了在深度学习基于肺膜破裂 lesion segmentation 的情况下,应该考虑医学特定知识,以提高模型的性能。

PISA: Point-cloud-based Instructed Scene Augmentation

  • paper_url: http://arxiv.org/abs/2311.16501
  • repo_url: None
  • paper_authors: Yiyang Luo, Ke Lin
  • for: 本研究旨在提出一种基于多模态深度学习的室内场景增强方法,可以根据文本描述生成符合室内环境的物体。
  • methods: 我们提出了一种基于Point-E模型的端到端多模态深度学习方法,并引入了量化位置预测和Top-K估算等方法来解决由于模糊语言描述而导致的假阳性问题。
  • results: 我们通过评估多个指标,包括生成物体的多样性、指令效果和量化指标结果,证明了我们的模型能够生成真实的室内物体。此外,我们还包括视觉定位作为评估 metric,以评估生成的场景质量。
    Abstract Indoor scene augmentation has become an emerging topic in the field of computer vision with applications in augmented and virtual reality. However, existing scene augmentation methods mostly require a pre-built object database with a given position as the desired location. In this paper, we propose the first end-to-end multi-modal deep neural network that can generate point cloud objects consistent with their surroundings, conditioned on text instructions. Our model generates a seemly object in the appropriate position based on the inputs of a query and point clouds, thereby enabling the creation of new scenarios involving previously unseen layouts of objects. Database of pre-stored CAD models is no longer needed. We use Point-E as our generative model and introduce methods including quantified position prediction and Top-K estimation to mitigate the false negative problems caused by ambiguous language description. Moreover, we evaluate the ability of our model by demonstrating the diversity of generated objects, the effectiveness of instruction, and quantitative metric results, which collectively indicate that our model is capable of generating realistic in-door objects. For a more thorough evaluation, we also incorporate visual grounding as a metric to assess the quality of the scenes generated by our model.
    摘要 室内场景增强已成为计算机视觉领域的一个emerging话题,应用于增强和虚拟现实。然而,现有的场景增强方法大多需要一个预建的物体数据库,其中物体的位置为所需的位置。在这篇论文中,我们提出了第一个综合多Modal深度神经网络,可以根据文本指令生成与室内环境相符的点云对象。我们的模型可以根据查询和点云输入生成一个看上去准确的对象,从而实现创建新的enario并不需要预先存储的CAD模型库。我们使用Point-E作为我们的生成模型,并引入了量化位置预测和Top-K估计等方法,以解决由 ambiguous语言描述引起的假阳性问题。此外,我们评估了我们的模型的能力,包括生成对象的多样性、指令的有效性以及量化指标结果,这些结果表明我们的模型可以生成真实的室内对象。为了进行更加全面的评估,我们还 incorporated visual grounding作为一个评估 metric,以评估生成的场景的质量。

Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding

  • paper_url: http://arxiv.org/abs/2311.15206
  • repo_url: None
  • paper_authors: Hoang-Quan Nguyen, Thanh-Dat Truong, Xuan Bac Nguyen, Ashley Dowling, Xin Li, Khoa Luu
    for: 这个论文的目的是为了提高精准农业中对昆虫的识别和检测,以提高作物的生长和产量。methods: 这个论文使用了一种新的“昆虫100万”数据集,这个数据集包含100万张昆虫图像,每张图像都有密集的标注标签,包括昆虫的种类层次和描述。此外,这个论文还使用了一种微特征自我超参方法,该方法可以在昆虫图像中提取微scopic特征,并使用描述一致损失来提高微特征模型的建立。results: 这个论文通过实验证明了其提出的方法的有效性,在标准的昆虫相关任务上达到了State-of-the-Art的性能。这个论文的昆虫基础模型和数据集将为以后的昆虫相关视觉模型提供一个强大的基础,使其更近于实现精准农业的最终目标。
    Abstract In precision agriculture, the detection and recognition of insects play an essential role in the ability of crops to grow healthy and produce a high-quality yield. The current machine vision model requires a large volume of data to achieve high performance. However, there are approximately 5.5 million different insect species in the world. None of the existing insect datasets can cover even a fraction of them due to varying geographic locations and acquisition costs. In this paper, we introduce a novel ``Insect-1M'' dataset, a game-changing resource poised to revolutionize insect-related foundation model training. Covering a vast spectrum of insect species, our dataset, including 1 million images with dense identification labels of taxonomy hierarchy and insect descriptions, offers a panoramic view of entomology, enabling foundation models to comprehend visual and semantic information about insects like never before. Then, to efficiently establish an Insect Foundation Model, we develop a micro-feature self-supervised learning method with a Patch-wise Relevant Attention mechanism capable of discerning the subtle differences among insect images. In addition, we introduce Description Consistency loss to improve micro-feature modeling via insect descriptions. Through our experiments, we illustrate the effectiveness of our proposed approach in insect modeling and achieve State-of-the-Art performance on standard benchmarks of insect-related tasks. Our Insect Foundation Model and Dataset promise to empower the next generation of insect-related vision models, bringing them closer to the ultimate goal of precision agriculture.
    摘要 在精准农业中,检测和识别昆虫具有重要的作用,以确保作物健康成长并生产高质量的产品。现有的机器视觉模型需要大量数据来达到高性能,但世界上约有550万种不同的昆虫种类,现有的任何昆虫数据集都无法覆盖其中的一部分,主要因为不同的地理位置和获取成本。在本文中,我们介绍了一个新的“昆虫-1M”数据集,这是一种可能改变现有模型训练的渠道。我们的数据集包括100万张昆虫图像,每张图像都有密集的种系树标签和昆虫描述,提供了昆虫学的广泛视野,使基础模型能够从视觉和语义角度理解昆虫,比以往不同。然后,我们开发了一种微特征自我超级学习方法,具有 patch-wise 相关注意力机制,能够在昆虫图像中捕捉到细微差异。此外,我们引入描述一致损失,以改进微特征模型。我们的实验表明,我们的提议方法在昆虫模型建立和昆虫相关任务的标准评价指标上具有出色的表现。我们的昆虫基础模型和数据集承诺将推动下一代昆虫相关视觉模型的发展,使其更接近精准农业的目标。

Dual-stream contrastive predictive network with joint handcrafted feature view for SAR ship classification

  • paper_url: http://arxiv.org/abs/2311.15202
  • repo_url: https://github.com/Haosheng-Chen/DCPNet
  • paper_authors: Xianting Feng, Hao zheng, Zhigang Hu, Liu Yang, Meiguang Zheng
  • for: 本研究旨在提高Synthetic Aperture Radar(SAR)船舶分类技术的精度,通过不正确标注的数据来增强SAR船舶图像的特征。
  • methods: 该研究提出了一种新的双流对比预测网络(DCPNet),包括两个不同的任务设计和假性采样除Module。第一个任务是构建正例对,导致核心编码器学习更通用的表示。第二个任务是促进深度特征和手动特征之间的对应关系捕捉,实现模型内知识传递,并有效地减少由特征融合而引起的信息重复。
  • results: 实验结果表明,DCPNet可以提高supervised模型的分类精度,并学习有效的SAR船舶图像表示。
    Abstract Most existing synthetic aperture radar (SAR) ship classification technologies heavily rely on correctly labeled data, ignoring the discriminative features of unlabeled SAR ship images. Even though researchers try to enrich CNN-based features by introducing traditional handcrafted features, existing methods easily cause information redundancy and fail to capture the interaction between them. To address these issues, we propose a novel dual-stream contrastive predictive network (DCPNet), which consists of two asymmetric task designs and the false negative sample elimination module. The first task is to construct positive sample pairs, guiding the core encoder to learn more general representations. The second task is to encourage adaptive capture of the correspondence between deep features and handcrated features, achieving knowledge transfer within the model, and effectively improving the redundancy caused by the feature fusion. To increase the separability between clusters, we also design a cluster-level tasks. The experimental results on OpenSARShip and FUSAR-Ship datasets demonstrate the improvement in classification accuracy of supervised models and confirm the capability of learning effective representations of DCPNet.
    摘要 现有的Synthetic Aperture Radar(SAR)船类分类技术大多数依赖于正确的标签数据,忽略SAR船图像的特征特异性。即使研究人员尝试通过引入传统的手工特征来增强CNN基于的特征,现有的方法易引起信息重复和特征之间的交互不足。为解决这些问题,我们提出了一种新的双流对比预测网络(DCPNet),其包括两个异 symmetry任务设计和假值采样除Module。第一个任务是建立正样本对,引导核心编码器学习更通用的表示。第二个任务是促进深度特征和手工特征之间的匹配,实现模型内知识传递,有效地减少由特征混合引起的信息重复。为提高分布间的分离度,我们还设计了群集级任务。实验结果表明,DCPNet可以提高supervised模型的分类精度,并证明其能够学习有效的表示。

SpliceMix: A Cross-scale and Semantic Blending Augmentation Strategy for Multi-label Image Classification

  • paper_url: http://arxiv.org/abs/2311.15200
  • repo_url: None
  • paper_authors: Lei Wang, Yibing Zhan, Leilei Ma, Dapeng Tao, Liang Ding, Chen Gong
    for:这篇论文主要针对多标签图像分类(MLIC)领域,旨在提出一种简单 yet effective的数据增强策略,以提高MLIC模型的性能。methods:该策略被称为SpliceMix,其中“拼接”是两重的:1)每个拼接图像是一个网格中的多个下采样图像的拼接,这些图像的 semantics 在拼接后保持不受对象缺失的影响,以解决同类干扰的问题;2)我们将拼接图像和原始小批量拼接成一个新的 SpliceMixed 小批量,这样一个图像可以在不同的缩放比例下进行训练。results:广泛的实验结果表明,只使用 SpliceMix 与基eline模型(例如 ResNet)可以获得更好的性能,并且我们的 SpliceMix 可以与当前 MLIC 方法结合使用,提高其性能。此外,我们的 SpliceMix 的普适性也得到了证明,当 MLIC 模型结合使用 SpliceMix 时,其性能得到了改善。代码可以在 https://github.com/zuiran/SpliceMix 上获取。
    Abstract Recently, Mix-style data augmentation methods (e.g., Mixup and CutMix) have shown promising performance in various visual tasks. However, these methods are primarily designed for single-label images, ignoring the considerable discrepancies between single- and multi-label images, i.e., a multi-label image involves multiple co-occurred categories and fickle object scales. On the other hand, previous multi-label image classification (MLIC) methods tend to design elaborate models, bringing expensive computation. In this paper, we introduce a simple but effective augmentation strategy for multi-label image classification, namely SpliceMix. The "splice" in our method is two-fold: 1) Each mixed image is a splice of several downsampled images in the form of a grid, where the semantics of images attending to mixing are blended without object deficiencies for alleviating co-occurred bias; 2) We splice mixed images and the original mini-batch to form a new SpliceMixed mini-batch, which allows an image with different scales to contribute to training together. Furthermore, such splice in our SpliceMixed mini-batch enables interactions between mixed images and original regular images. We also offer a simple and non-parametric extension based on consistency learning (SpliceMix-CL) to show the flexible extensibility of our SpliceMix. Extensive experiments on various tasks demonstrate that only using SpliceMix with a baseline model (e.g., ResNet) achieves better performance than state-of-the-art methods. Moreover, the generalizability of our SpliceMix is further validated by the improvements in current MLIC methods when married with our SpliceMix. The code is available at https://github.com/zuiran/SpliceMix.
    摘要 近些年,mix样式数据增强方法(如Mixup和CutMix)在视觉任务中表现出色,但这些方法主要针对单标图像,忽视了多标图像中的显著差异,即多标图像包含多个同时出现的类别和不确定的物体比例。而之前的多标图像分类(MLIC)方法往往设计了复杂的模型,带来昂贵的计算成本。在这篇论文中,我们介绍了一种简单 yet effective的增强策略 для MLIC,即SpliceMix。我们的“拼接”在两个方面:1)每个混合图像是一个格式的多个下采样图像的拼接,这些图像的 semantics 被混合而不会出现物体不足的问题,以降低同类别偏好;2)我们将混合图像和原始小批量拼接成一个新的 SpliceMixed 小批量,这allow 图像不同的比例参与到训练中。此外,我们的 SpliceMixed 小批量中的拼接允许混合图像和原始正常图像之间的互动。我们还提供了一种简单、非参数的扩展(SpliceMix-CL),以示我们的 SpliceMix 的可扩展性。我们的实验表明,只使用 SpliceMix 与基eline模型(如ResNet)可以超越现状最佳方法。此外,我们的 SpliceMix 的通用性得到了当前 MLIC 方法的改进, thereby further validating the flexibility and effectiveness of our SpliceMix. 代码可以在 上获取。

Eye vs. AI: Human Gaze and Model Attention in Video Memorability

  • paper_url: http://arxiv.org/abs/2311.16484
  • repo_url: None
  • paper_authors: Prajneya Kumar, Eshika Khandelwal, Makarand Tapaswi, Vishnu Sreekumar
  • for: 该研究的目的是理解视频吸引力的因素,以便在教育技术和广告等领域应用。
  • methods: 该研究使用了Transformer架构和空间-时间注意力,以达到现有大量自然视频数据上的SoTA性能。
  • results: 研究发现,模型的自注意力和人类眼动追踪图中的焦点密度图呈同样的模式,而且模型和人类在视频中吸引力高的部分具有相似的注意力强度。
    Abstract Understanding the factors that determine video memorability has important applications in areas such as educational technology and advertising. Towards this goal, we investigate the semantic and temporal attention mechanisms underlying video memorability. We propose a Transformer-based model with spatio-temporal attention that matches SoTA performance on video memorability prediction on a large naturalistic video dataset. More importantly, the self-attention patterns show us where the model looks to predict memorability. We compare model attention against human gaze fixation density maps collected through a small-scale eye-tracking experiment where humans perform a video memory task. Quantitative saliency metrics show that the model attention and human gaze follow similar patterns. Furthermore, while panoptic segmentation confirms that the model and humans attend more to thing classes, stuff classes that receive increased/decreased attention tend to have higher memorability scores. We also observe that the model assigns greater importance to the initial frames, mimicking temporal attention patterns found in humans.
    摘要 理解视频记忆的因素有重要应用于教育技术和广告领域。为实现这个目标,我们研究了视频记忆下的语义和时间注意力机制。我们提出一种基于Transformer的模型,具有空间时间注意力,可以与现有的State-of-the-Art(SoTA)性能相匹配在大规模自然视频集上的视频记忆预测任务。更重要的是,模型的自注意 patrtern 可以告诉我们模型在预测记忆性时看到了哪些地方。我们比较模型注意力和人类眼动 fixation density map,通过小规模的眼动跟踪实验,在视频记忆任务中,人类的眼动和模型注意力 Display similar patterns。此外,我们发现,模型和人类在 thing 和 stuff 类别上都具有更高的注意力,而受到增加/减少注意力的 stuff 类别通常会得到更高的记忆性分数。此外,我们发现模型在初始帧中具有更高的重要性,与人类的时间注意力模式相似。

HumanRecon: Neural Reconstruction of Dynamic Human Using Geometric Cues and Physical Priors

  • paper_url: http://arxiv.org/abs/2311.15171
  • repo_url: https://github.com/pris-cv/humanrecon
  • paper_authors: Junhui Yin, Wei Yin, Hao Chen, Xuqian Ren, Zhanyu Ma, Jun Guo, Yifan Liu
  • for: 这篇论文是关于动态人体重建的最新方法的研究。
  • methods: 这些方法使用RGB色彩监督,不考虑显式的 геометрические约束。这会导致现有的人体重建技术更容易过拟合颜色,并且在缺乏视角的情况下存在几何内在的抽象。
  • results: 作者通过考虑估算深度和法向的几何约束,在学习神经隐式表示方法时使用几何规范作为可靠的监督信号。这种约束可以提供可靠的监督信号,并且提高重建质量。此外,作者还利用了一些有利的物理约束,如添加视向方向上的噪音和最大化人体表面上的浓度。这些约束使得颜色渲染在光束上更加稳定和robust。实验结果表明,depth和法向估计器预测的约束信号可以提供有效的监督信号,并且生成更加准确的图像。最后,作者还证明了提出的物理约束可以减少过拟合和提高总体重建质量。
    Abstract Recent methods for dynamic human reconstruction have attained promising reconstruction results. Most of these methods rely only on RGB color supervision without considering explicit geometric constraints. This leads to existing human reconstruction techniques being more prone to overfitting to color and causes geometrically inherent ambiguities, especially in the sparse multi-view setup. Motivated by recent advances in the field of monocular geometry prediction, we consider the geometric constraints of estimated depth and normals in the learning of neural implicit representation for dynamic human reconstruction. As a geometric regularization, this provides reliable yet explicit supervision information, and improves reconstruction quality. We also exploit several beneficial physical priors, such as adding noise into view direction and maximizing the density on the human surface. These priors ensure the color rendered along rays to be robust to view direction and reduce the inherent ambiguities of density estimated along rays. Experimental results demonstrate that depth and normal cues, predicted by human-specific monocular estimators, can provide effective supervision signals and render more accurate images. Finally, we also show that the proposed physical priors significantly reduce overfitting and improve the overall quality of novel view synthesis. Our code is available at:~\href{https://github.com/PRIS-CV/HumanRecon}{https://github.com/PRIS-CV/HumanRecon}.
    摘要 现代人体重建方法已经取得了有前例的重建结果。大多数这些方法仅基于RGB颜色监督而没有考虑直接的 геометрические约束。这会导致现有的人体重建技术更易于颜色拟合和存在缺失视角设置中的几何约束,尤其是在缺失多视角设置中。我们受到最近的计算机视觉领域的干扰geometry预测技术的激发,我们考虑了重建人体的几何约束。在学习各种神经隐式表示中,我们利用了估计深度和法向的几何约束,这提供了可靠的直接监督信息,并改善重建质量。我们还利用了一些有利的物理约束,例如在视向中添加噪声和最大化人体表面上的浓度。这些约束使得颜色在光梯上渲染的是robust to view direction和减少了人体表面上的几何约束歧义。实验结果表明,人体特定的单视预测器预测的深度和法向信号可以提供有效的监督信号,并生成更加准确的图像。此外,我们还发现了我们所提出的物理约束有效地减少了过拟合和改善了总体图像重建质量。我们的代码可以在:https://github.com/PRIS-CV/HumanRecon 中找到。

Self-supervised OCT Image Denoising with Slice-to-Slice Registration and Reconstruction

  • paper_url: http://arxiv.org/abs/2311.15167
  • repo_url: https://github.com/cjlee94/slice2slice
  • paper_authors: Shijie Li, Palaiologos Alexopoulos, Anse Vellappally, Ronald Zambrano, Wollstein Gadi, Guido Gerig
  • for: 提高Retinal structures的精度量化分析,帮助临床诊断和病种监测。
  • methods: 学习自主方法,特别是结构保持性噪声减少方法,可以更好地适应OCT图像噪声。
  • results: 提出了一种新的自主学习框架,可以更好地处理OCT图像噪声,并且与之前发表的自主减噪模型进行比较,得到了更好的性能。
    Abstract Strong speckle noise is inherent to optical coherence tomography (OCT) imaging and represents a significant obstacle for accurate quantitative analysis of retinal structures which is key for advances in clinical diagnosis and monitoring of disease. Learning-based self-supervised methods for structure-preserving noise reduction have demonstrated superior performance over traditional methods but face unique challenges in OCT imaging. The high correlation of voxels generated by coherent A-scan beams undermines the efficacy of self-supervised learning methods as it violates the assumption of independent pixel noise. We conduct experiments demonstrating limitations of existing models due to this independence assumption. We then introduce a new end-to-end self-supervised learning framework specifically tailored for OCT image denoising, integrating slice-by-slice training and registration modules into one network. An extensive ablation study is conducted for the proposed approach. Comparison to previously published self-supervised denoising models demonstrates improved performance of the proposed framework, potentially serving as a preprocessing step towards superior segmentation performance and quantitative analysis.
    摘要 强度的点杂论是optical coherence tomography(OCT)成像中的内在障碍,对于准确地分析肉眼结构进行临床诊断和监测疾病具有重要意义。学习自主式自我监督方法在OCT成像中表现出了更高的性能,但面临了uniquecalendar挑战。协议生成的voxels之间的高相关性让自主学习方法失效,因为它违反了独立像素噪声的假设。我们进行了实验,示出了现有模型的局限性。然后,我们介绍了一种专门针对OCT成像的新自主学习框架,将扫描模块和注册模块集成到一个网络中。我们进行了广泛的ablation研究。与之前发表的自主净化模型进行比较,我们的提议框架表现出了更高的性能,可能作为预处理步骤,为高级分析和诊断做出贡献。

GS-IR: 3D Gaussian Splatting for Inverse Rendering

  • paper_url: http://arxiv.org/abs/2311.16473
  • repo_url: https://github.com/lzhnb/gs-ir
  • paper_authors: Zhihao Liang, Qi Zhang, Ying Feng, Ying Shan, Kui Jia
  • for: 该 paper 用于 achieving photorealistic novel view synthesis and relighting results through inverse rendering.
  • methods: 该 paper 使用了一种基于 3D Gaussian Splatting (GS) 的新的 inverse rendering 方法,具有 photorealistic 的结果和高效的计算复杂度.
  • results: 该 paper 实现了高质量的 geometry 重建、新视角synthesis 和物理基本的rendering, 并且在各种复杂的场景下进行了质量和量化的评估.
    Abstract We propose GS-IR, a novel inverse rendering approach based on 3D Gaussian Splatting (GS) that leverages forward mapping volume rendering to achieve photorealistic novel view synthesis and relighting results. Unlike previous works that use implicit neural representations and volume rendering (e.g. NeRF), which suffer from low expressive power and high computational complexity, we extend GS, a top-performance representation for novel view synthesis, to estimate scene geometry, surface material, and environment illumination from multi-view images captured under unknown lighting conditions. There are two main problems when introducing GS to inverse rendering: 1) GS does not support producing plausible normal natively; 2) forward mapping (e.g. rasterization and splatting) cannot trace the occlusion like backward mapping (e.g. ray tracing). To address these challenges, our GS-IR proposes an efficient optimization scheme that incorporates a depth-derivation-based regularization for normal estimation and a baking-based occlusion to model indirect lighting. The flexible and expressive GS representation allows us to achieve fast and compact geometry reconstruction, photorealistic novel view synthesis, and effective physically-based rendering. We demonstrate the superiority of our method over baseline methods through qualitative and quantitative evaluations on various challenging scenes.
    摘要 我们提出了GS-IR,一种基于三角形拟合(GS)的新的反向渲染方法,利用前向映射量子Rendering实现高品质的新视图合成和照明重新渲染。与之前的工作使用卷积神经表示和量子Rendering(例如NeRF)不同,我们扩展了GS,一种高性能的表示方法,以估计场景几何结构、表面材料和环境照明conditions from multi-view images captured under unknown lighting conditions.在将GS应用到反向渲染中,存在两个主要挑战:1)GS不能生成可靠的法向;2)前向映射(例如排列和拟合)无法跟踪遮盖物如backward mapping(例如ray tracing)。为了解决这些挑战,我们的GS-IR提出了一种高效的优化方案,其中包括了深度梯度基于的常量估计和烘焙基于的遮盖物模型。通过这种灵活和表达强的GS表示,我们可以实现快速和 компакт的几何重建、高品质的新视图合成和有效的物理基于渲染。我们通过对多种复杂的场景进行质量和量化评价,证明了我们的方法的优越性。

Mixing Classifiers to Alleviate the Accuracy-Robustness Trade-Off

  • paper_url: http://arxiv.org/abs/2311.15165
  • repo_url: None
  • paper_authors: Yatong Bai, Brendon G. Anderson, Somayeh Sojoudi
  • for: 这个论文旨在提高数据驱动控制系统中的机器学习模型的准确率和可靠性。
  • methods: 这篇论文使用了“本地偏好平滑”方法,并将其扩展到多类 Setting,以提高标准模型的准确率和可靠性模型的可靠性。
  • results: 数字实验表明,混合模型可以明显改善准确率和可靠性的负荷-鲁棒性质量。
    Abstract Machine learning models have recently found tremendous success in data-driven control systems. However, standard learning models often suffer from an accuracy-robustness trade-off, which is a limitation that must be overcome in the control of safety-critical systems that require both high performance and rigorous robustness guarantees. In this work, we build upon the recent "locally biased smoothing" method to develop classifiers that simultaneously inherit high accuracy from standard models and high robustness from robust models. Specifically, we extend locally biased smoothing to the multi-class setting, and then overcome its performance bottleneck by generalizing the formulation to "mix" the outputs of a standard neural network and a robust neural network. We prove that when the robustness of the robust base model is certifiable, within a closed-form $\ell_p$ radius, no alteration or attack on an input can result in misclassification of the mixed classifier; the proposed model inherits the certified robustness. Moreover, we use numerical experiments on the CIFAR-10 benchmark dataset to verify that the mixed model noticeably improves the accuracy-robustness trade-off.
    摘要 Translated into Simplified Chinese:机器学习模型在数据驱动控制系统中最近发现了很大的成功。然而,标准学习模型经常面临精度与可靠性之间的负面选择,这是安全关键系统需要的高性能和严格可靠性保证的一个限制。在这种情况下,我们基于最近的“本地偏见缓和”方法,开发出同时继承标准模型高精度和可靠模型高可靠性的分类器。具体来说,我们将本地偏见缓和方法扩展到多类 Setting中,然后超越其性能瓶颈,通过将标准神经网络和可靠神经网络的输出“混合”起来。我们证明,当可靠性可证明的粗略模型的粗略性在closed-form $\ell_p$ 范围内,则无论对输入进行任何修改或攻击,也无法导致混合分类器对输入进行误分类;我们的模型继承了证明的可靠性。此外,我们通过对CIFAR-10 benchmark数据集进行数值实验,证明混合模型明显改善了精度与可靠性之间的负面选择。

Deep Learning-Based Approaches for Contactless Fingerprints Segmentation and Extraction

  • paper_url: http://arxiv.org/abs/2311.15163
  • repo_url: None
  • paper_authors: M. G. Sarwar Murshed, Syed Konain Abbas, Sandip Purnapatra, Daqing Hou, Faraz Hussain
  • for: 这个论文的目的是为了开发一种基于深度学习的无接触指纹分割和提取工具,以提高无接触指纹认证系统的精度和可靠性。
  • methods: 该论文使用了深度学习技术来实现高精度的指纹分割和指纹提取。
  • results: 论文中的分割方法在评估中示出了平均绝对误差(MAE)为30像素,错误角预测(EAP)为5.92度,并达到了97.46%的标签准确率。这些结果表明了该工具的效果。
    Abstract Fingerprints are widely recognized as one of the most unique and reliable characteristics of human identity. Most modern fingerprint authentication systems rely on contact-based fingerprints, which require the use of fingerprint scanners or fingerprint sensors for capturing fingerprints during the authentication process. Various types of fingerprint sensors, such as optical, capacitive, and ultrasonic sensors, employ distinct techniques to gather and analyze fingerprint data. This dependency on specific hardware or sensors creates a barrier or challenge for the broader adoption of fingerprint based biometric systems. This limitation hinders the widespread adoption of fingerprint authentication in various applications and scenarios. Border control, healthcare systems, educational institutions, financial transactions, and airport security face challenges when fingerprint sensors are not universally available. To mitigate the dependence on additional hardware, the use of contactless fingerprints has emerged as an alternative. Developing precise fingerprint segmentation methods, accurate fingerprint extraction tools, and reliable fingerprint matchers are crucial for the successful implementation of a robust contactless fingerprint authentication system. This paper focuses on the development of a deep learning-based segmentation tool for contactless fingerprint localization and segmentation. Our system leverages deep learning techniques to achieve high segmentation accuracy and reliable extraction of fingerprints from contactless fingerprint images. In our evaluation, our segmentation method demonstrated an average mean absolute error (MAE) of 30 pixels, an error in angle prediction (EAP) of 5.92 degrees, and a labeling accuracy of 97.46%. These results demonstrate the effectiveness of our novel contactless fingerprint segmentation and extraction tools.
    摘要 人体指纹被广泛认为是人类身份认证中最独特和可靠的特征之一。现代人体指纹认证系统大多采用了贴近式指纹识别,需要使用指纹扫描仪或指纹感知器来获取指纹数据。不同类型的指纹感知器,如光学、电容式和超声感知器,采用不同的技术来收集和分析指纹数据。这种固定在特定硬件或感知器上的依赖性成为了人体指纹认证系统广泛应用的一大障碍。这种限制约束了人体指纹认证的应用范围,包括边境管控、医疗系统、教育机构、金融交易和机场安全等场景。为了减少特定硬件的依赖性,使用无接触指纹 authentication 系统成为了一种可能的解决方案。我们的系统利用深度学习技术来实现高精度的无接触指纹分割和检索。我们的 segmentation 方法在评估中达到了平均缺失Error (MAE) 30像素,错误角度预测 (EAP) 5.92度,并达到了97.46%的标签准确率。这些结果证明了我们的无接触指纹分割和检索工具的有效性。

Advancing Vision Transformers with Group-Mix Attention

  • paper_url: http://arxiv.org/abs/2311.15157
  • repo_url: https://github.com/ailab-cvc/groupmixformer
  • paper_authors: Chongjian Ge, Xiaohan Ding, Zhan Tong, Li Yuan, Jiangliu Wang, Yibing Song, Ping Luo
  • for: 这篇论文目标是提高视觉认知,提出了一种新的自注意机制,即分组混合注意(GMA),以capture多个相邻的符号之间的相关性,提高模型的表示能力。
  • methods: 该论文提出了一种新的自注意机制,即分组混合注意(GMA),它可以同时capture token-to-token、token-to-组、和组-to-组的相关性,并通过不同的组合来生成多个组proxy。
  • results: 根据GMA机制,该论文提出了一种新的底层模型,即GroupMixFormer,其在图像分类、物体检测和semantic segmentation中达到了 estado del arte的性能,而且参数数量较少。例如,GroupMixFormer-L(参数数量为70.3M,输入大小为384^2)在ImageNet-1K上达到了86.2%的Top-1准确率,而GroupMixFormer-B(参数数量为45.8M)在ADE20K上达到了51.2%的mIoU。
    Abstract Vision Transformers (ViTs) have been shown to enhance visual recognition through modeling long-range dependencies with multi-head self-attention (MHSA), which is typically formulated as Query-Key-Value computation. However, the attention map generated from the Query and Key captures only token-to-token correlations at one single granularity. In this paper, we argue that self-attention should have a more comprehensive mechanism to capture correlations among tokens and groups (i.e., multiple adjacent tokens) for higher representational capacity. Thereby, we propose Group-Mix Attention (GMA) as an advanced replacement for traditional self-attention, which can simultaneously capture token-to-token, token-to-group, and group-to-group correlations with various group sizes. To this end, GMA splits the Query, Key, and Value into segments uniformly and performs different group aggregations to generate group proxies. The attention map is computed based on the mixtures of tokens and group proxies and used to re-combine the tokens and groups in Value. Based on GMA, we introduce a powerful backbone, namely GroupMixFormer, which achieves state-of-the-art performance in image classification, object detection, and semantic segmentation with fewer parameters than existing models. For instance, GroupMixFormer-L (with 70.3M parameters and 384^2 input) attains 86.2% Top-1 accuracy on ImageNet-1K without external data, while GroupMixFormer-B (with 45.8M parameters) attains 51.2% mIoU on ADE20K.
    摘要 “视觉 трансформер”(ViTs)已经被证明可以提高视觉识别能力,通过多头自注意(MHSA)的模型,实现跨度取决关系。然而,从 Query 和 Key 中生成的注意地图仅仅捕捉了单个单位的 Token 之间的相互 correlations。在这篇文章中,我们主张自注意应该有更加全面的机制,以 capture Tokens 和groups(i.e.,多个邻近的 Token)之间的 correlations,以提高表达能力。因此,我们提出了 Group-Mix Attention(GMA),作为传统自注意的进阶替代方案,可以同时 capture Token-to-Token、Token-to-group、group-to-group 的相互 correlations,并且可以运用不同的 group size。为了实现这一点,GMA 将 Query、Key 和 Value 分割成段,并在不同的 group size下进行不同的group aggregation,以生成group proxies。然后,通过 mixture of Tokens 和 group proxies compute attention map,并用这个注意地图来重新结合 Token 和 group。基于 GMA,我们提出了强大的背部架构,即 GroupMixFormer,它在图像分类、物体检测和Semantic Segmentation 中实现了 state-of-the-art 性能,并且具有较少参数的优点。例如,GroupMixFormer-L(具有 70.3M 参数和 384^2 输入)在 ImageNet-1K 上 достиieves 86.2% Top-1 准确率,而 GroupMixFormer-B(具有 45.8M 参数)在 ADE20K 上达到了 51.2% mIoU。

Self-Supervised Learning for SAR ATR with a Knowledge-Guided Predictive Architecture

  • paper_url: http://arxiv.org/abs/2311.15153
  • repo_url: None
  • paper_authors: Weijie Li, Yang Wei, Tianpeng Liu, Yuenan Hou, Yongxiang Liu, Li Liu
  • for: 本研究旨在提出一种基于自我超vised学习的基础模型,以解决SAR目标识别领域中的通用表示学习问题。
  • methods: 该方法利用本地偏masked patches来预测SAR多尺度特征表示,并结合传统SAR领域特征提取和当前最佳自我超vised学习方法,以实现高精度和普适性的表示学习。
  • results: 实验结果表明,该方法可以在不同目标、场景和探测器上提供一致性的性能提升,并且可以在低质量数据和噪声环境下实现高精度的表示学习。
    Abstract Recently, the emergence of a large number of Synthetic Aperture Radar (SAR) sensors and target datasets has made it possible to unify downstream tasks with self-supervised learning techniques, which can pave the way for building the foundation model in the SAR target recognition field. The major challenge of self-supervised learning for SAR target recognition lies in the generalizable representation learning in low data quality and noise.To address the aforementioned problem, we propose a knowledge-guided predictive architecture that uses local masked patches to predict the multiscale SAR feature representations of unseen context. The core of the proposed architecture lies in combining traditional SAR domain feature extraction with state-of-the-art scalable self-supervised learning for accurate generalized feature representations. The proposed framework is validated on various downstream datasets (MSTAR, FUSAR-Ship, SAR-ACD and SSDD), and can bring consistent performance improvement for SAR target recognition. The experimental results strongly demonstrate the unified performance improvement of the self-supervised learning technique for SAR target recognition across diverse targets, scenes and sensors.
    摘要 最近,巨量的Synthetic Aperture Radar(SAR)感知器和目标数据集的出现,使得可以通过自我超vised学习技术,建立SAR目标识别领域的基础模型。然而,SAR目标识别自我超vised学习的主要挑战在于低质量数据和噪声下的通用表征学习。为解决上述问题,我们提出了基于本地封闭矩阵的知识导向预测架构,使用本地封闭矩阵预测未经见过的多尺度SAR特征表示。核心在于结合传统SAR领域特征提取和当今最佳尺度自我超vised学习,以获得准确的通用表征表示。我们的框架在多个下游数据集(MSTAR、FUSAR-Ship、SAR-ACD和SSDD)上进行验证,可以带来不同目标、场景和感知器的SAR目标识别性能的统一改进。实验结果表明,自我超vised学习技术在SAR目标识别领域中具有普遍性和可靠性。

Choosing Wisely and Learning Deeply: Selective Cross-Modality Distillation via CLIP for Domain Generalization

  • paper_url: http://arxiv.org/abs/2311.15145
  • repo_url: None
  • paper_authors: Jixuan Leng, Yijiang Li, Haohan Wang
  • for: 本研究旨在提出一种新的方法,即选择性跨模态精炼(SCMD),以便在不同频率上训练模型,并在未经见过的频率上测试其泛化能力。
  • methods: SCMD 利用了大型视觉语言模型CLIP的能力,具体来说是使用CLIP模型训练一个更加高效的模型,以确保它在未经见过的频率上具有更高的泛化能力。我们的主要贡献是一种独特的选择框架,用于针对硬度学习的样本进行精炼。同时,我们还提出了一种新的跨模态模块,用于将学生模型的投影特征与CLIP模型的文本嵌入相结合。
  • results: 我们在多个标准 benchmark 上评估了 SCMD 的性能,并发现它可以使一个 ResNet50 模型在不同频率上达到状态机器人的性能,超过现有的频率泛化方法。此外,我们还提供了一种理论分析,以更深入地理解我们的选择策略的效果和潜在应用在频率泛化领域。
    Abstract Domain Generalization (DG), a crucial research area, seeks to train models across multiple domains and test them on unseen ones. In this paper, we introduce a novel approach, namely, Selective Cross-Modality Distillation for Domain Generalization (SCMD). SCMD leverages the capabilities of large vision-language models, specifically the CLIP model, to train a more efficient model, ensuring it acquires robust generalization capabilities across unseen domains. Our primary contribution is a unique selection framework strategically designed to identify hard-to-learn samples for distillation. In parallel, we introduce a novel cross-modality module. This module seamlessly combines the projected features of the student model with the text embeddings from CLIP, ensuring the alignment of similarity distributions. We assess SCMD's performance on various benchmarks, where it empowers a ResNet50 to deliver state-of-the-art performance, surpassing existing domain generalization methods. Furthermore, we provide a theoretical analysis of our selection strategy, offering deeper insight into its effectiveness and potential in the field of DG.
    摘要 领域总结(DG)是一个重要的研究领域,旨在训练模型在多个领域下,并在未见领域上进行测试。在这篇论文中,我们介绍了一种新的方法,即选择性跨模态精益(SCMD)。SCMD利用了大型视觉语言模型,特别是CLIP模型,来训练一个更高效的模型,以确保它在未见领域上获得稳定的泛化能力。我们的主要贡献是一种独特的选择框架,用于针对难以学习的样本进行精益。同时,我们还引入了一种跨模态模块,这个模块将学生模型的投影特征和CLIP的文本嵌入相结合,以确保相似性分布的协调。我们在多个benchmark上评估了SCMD的性能,其能使一个ResNet50模型在不同的领域上达到状态之最的表现,超越现有的领域总结方法。此外,我们还提供了选择策略的理论分析,为领域DG的发展提供更深入的理解和潜在。

cs.AI - 2023-11-26

Variational Exploration Module VEM: A Cloud-Native Optimization and Validation Tool for Geospatial Modeling and AI Workflows

  • paper_url: http://arxiv.org/abs/2311.16196
  • repo_url: None
  • paper_authors: Julian Kuehnert, Hiwot Tadesse, Chris Dearden, Rosie Lickorish, Paolo Fraccaro, Anne Jones, Blair Edwards, Sekou L. Remy, Peter Melling, Tim Culmer
  • for: 这种研究旨在提高我们所处环境的物理系统理解,并为减少社会危害而设计最佳实践。
  • methods: 这种研究使用地理观测和计算模型,并使用云端部署来扩大模型和人工智能工作流程。
  • results: 研究人员已经开发了变量探索模块,该模块可以在云端部署模型工作流程,并使用 bayesian 和机器学习方法来分析模型行为。用户可以通过组合多种采样策略来自定义模型。
    Abstract Geospatial observations combined with computational models have become key to understanding the physical systems of our environment and enable the design of best practices to reduce societal harm. Cloud-based deployments help to scale up these modeling and AI workflows. Yet, for practitioners to make robust conclusions, model tuning and testing is crucial, a resource intensive process which involves the variation of model input variables. We have developed the Variational Exploration Module which facilitates the optimization and validation of modeling workflows deployed in the cloud by orchestrating workflow executions and using Bayesian and machine learning-based methods to analyze model behavior. User configurations allow the combination of diverse sampling strategies in multi-agent environments. The flexibility and robustness of the model-agnostic module is demonstrated using real-world applications.
    摘要

GGNNs : Generalizing GNNs using Residual Connections and Weighted Message Passing

  • paper_url: http://arxiv.org/abs/2311.15448
  • repo_url: None
  • paper_authors: Abhinav Raghuvanshi, Kushal Sokke Malleshappa
  • for: 本研究旨在提高图生成神经网络(GNN)的学习效果和准确率,通过修改消息传递机制来提高网络的泛化能力。
  • methods: 本研究使用多层感知器(MLP)构建GNN,并在消息传递机制中添加权重补做和径向连接,以提高网络的学习和快速抽象。
  • results: 研究表明,通过修改消息传递机制,GNN的学习效果和准确率显著提高,同时网络的泛化能力也得到了改善。
    Abstract Many real-world phenomena can be modeled as a graph, making them extremely valuable due to their ubiquitous presence. GNNs excel at capturing those relationships and patterns within these graphs, enabling effective learning and prediction tasks. GNNs are constructed using Multi-Layer Perceptrons (MLPs) and incorporate additional layers for message passing to facilitate the flow of features among nodes. It is commonly believed that the generalizing power of GNNs is attributed to the message-passing mechanism between layers, where nodes exchange information with their neighbors, enabling them to effectively capture and propagate information across the nodes of a graph. Our technique builds on these results, modifying the message-passing mechanism further: one by weighing the messages before accumulating at each node and another by adding Residual connections. These two mechanisms show significant improvements in learning and faster convergence
    摘要 许多实际世界现象可以表示为图,使其变得非常有价值。GNNS excellence at capturing这些关系和图中的模式,使得学习和预测任务变得非常有效。GNNS通过多层感知器(MLP)构建,并添加特殊层来促进节点之间信息的传递。通常认为,GNNS的泛化能力是由节点之间信息传递机制带来的,其中节点与邻居节点交换信息,以便有效地捕捉和传播节点图中的信息。我们的技术基于这些结果,进一步改进信息传递机制:一是对消息进行权重,二是添加径向连接。这两种机制在学习和速度更快的整体性能中具有显著改进。

ProtoArgNet: Interpretable Image Classification with Super-Prototypes and Argumentation [Technical Report]

  • paper_url: http://arxiv.org/abs/2311.15438
  • repo_url: None
  • paper_authors: Hamed Ayoobi, Nico Potyka, Francesca Toni
  • for: 这篇论文是为了提出一种新的可解释深度神经网络模型,用于图像分类。
  • methods: 这篇论文使用了 proto-arg-net,一种基于 protoypical-part-learning 的 interpretable 深度神经网络模型。它使用super-protoypes,将多个 prototypical-parts 组合成单个 prototypical class representation。此外,它还使用多层感知抽象(MLP)来提高准确性,并且可以根据用户认知需求进行自定义。
  • results: 在实验中,ProtoArgNet 的准确率比 protoPNet 高,并且可以认识到图像中不同区域的 prototypical-parts 之间的空间关系。
    Abstract We propose ProtoArgNet, a novel interpretable deep neural architecture for image classification in the spirit of prototypical-part-learning as found, e.g. in ProtoPNet. While earlier approaches associate every class with multiple prototypical-parts, ProtoArgNet uses super-prototypes that combine prototypical-parts into single prototypical class representations. Furthermore, while earlier approaches use interpretable classification layers, e.g. logistic regression in ProtoPNet, ProtoArgNet improves accuracy with multi-layer perceptrons while relying upon an interpretable reading thereof based on a form of argumentation. ProtoArgNet is customisable to user cognitive requirements by a process of sparsification of the multi-layer perceptron/argumentation component. Also, as opposed to other prototypical-part-learning approaches, ProtoArgNet can recognise spatial relations between different prototypical-parts that are from different regions in images, similar to how CNNs capture relations between patterns recognized in earlier layers.
    摘要 我们提出了ProtoArgNet,一种新的可读性深度神经网络模型,以探索 protoypical-part-learning 的精神,类似于 ProtoPNet 等。与先前的方法不同,ProtoArgNet 使用超类 prototype,将多个 protoypical-part 合并到单个 protoypical class 表示中。此外,ProtoArgNet 使用多层感知器,而不是可读性分类层,如 logistic regression,以提高准确率。同时,ProtoArgNet 可以根据用户认知需求进行精简,以适应不同的应用场景。此外,ProtoArgNet 还可以识别不同区域中 protoypical-part 之间的空间关系,类似于 CNN 在早期层中识别模式之间的关系。

Wired Perspectives: Multi-View Wire Art Embraces Generative AI

  • paper_url: http://arxiv.org/abs/2311.15421
  • repo_url: https://github.com/WinKawaks/DreamWire
  • paper_authors: Zhiyu Qu, Lan Yang, Honggang Zhang, Tao Xiang, Kaiyue Pang, Yi-Zhe Song
  • for: 这篇论文的目的是提供一种使得 everyone可以轻松创造多视角雕塑(MVWA)的人工智能系统。
  • methods: 该系统使用了3D B'ezier曲线、Prim的算法和扩散模型或其变体(如ControlNet)的知识储存。这种结合使得系统能够表示3D雕塑,保证空间连续性,并且解决数据缺乏问题。
  • results: 作者通过了广泛的评估和分析,提供了系统的内部工作原理,包括连接性和视觉美感之间的质量均衡。
    Abstract Creating multi-view wire art (MVWA), a static 3D sculpture with diverse interpretations from different viewpoints, is a complex task even for skilled artists. In response, we present DreamWire, an AI system enabling everyone to craft MVWA easily. Users express their vision through text prompts or scribbles, freeing them from intricate 3D wire organisation. Our approach synergises 3D B\'ezier curves, Prim's algorithm, and knowledge distillation from diffusion models or their variants (e.g., ControlNet). This blend enables the system to represent 3D wire art, ensuring spatial continuity and overcoming data scarcity. Extensive evaluation and analysis are conducted to shed insight on the inner workings of the proposed system, including the trade-off between connectivity and visual aesthetics.
    摘要 创造多视图电磁雕塑(MVWA),一种静态3D雕塑,需要高水平的艺术家技巧。为了解决这问题,我们提出了梦织系统,帮助所有人轻松创作MVWA。用户通过文本提示或勾勒,不需要考虑复杂的3D电磁组织。我们的方法结合3D Bézier曲线、Prim的算法和扩散模型知识填充(例如ControlNet),这种混合可以 repre sent 3D电磁雕塑,保证空间连续性,并且超越数据缺乏。我们进行了广泛的评估和分析,以便了解我们提出的系统的内部工作机制,包括连接和视觉美感之间的质量衡量。

A Framework for Realistic Simulation of Daily Human Activity

  • paper_url: http://arxiv.org/abs/2311.15400
  • repo_url: None
  • paper_authors: Ifrah Idrees, Siddharth Singh, Kerui Xu, Dylan F. Glas
  • for: 这个论文是为了提供一种大规模 simulate human activity in home environments中的方法,以支持社交机器人like Astro的特性开发和测试。
  • methods: 该论文提出了一种基于人工生成的日常活动模式的框架,可以自定义不同的人物模式或活动模式,并可以变换活动时间。它还提出了一种 bidirectional constraint propagation 算法,用于从模板中生成时间表。
  • results: 论文通过用例enario分析 validate了其框架的表达力,并demonstrated that its method can be used to generate data closely resembling human behavior from three public datasets and a self-collected dataset。
    Abstract For social robots like Astro which interact with and adapt to the daily movements of users within the home, realistic simulation of human activity is needed for feature development and testing. This paper presents a framework for simulating daily human activity patterns in home environments at scale, supporting manual configurability of different personas or activity patterns, variation of activity timings, and testing on multiple home layouts. We introduce a method for specifying day-to-day variation in schedules and present a bidirectional constraint propagation algorithm for generating schedules from templates. We validate the expressive power of our framework through a use case scenario analysis and demonstrate that our method can be used to generate data closely resembling human behavior from three public datasets and a self-collected dataset. Our contribution supports systematic testing of social robot behaviors at scale, enables procedural generation of synthetic datasets of human movement in different households, and can help minimize bias in training data, leading to more robust and effective robots for home environments.
    摘要 For social robots like Astro, which interact with and adapt to the daily movements of users within the home, realistic simulation of human activity is necessary for feature development and testing. This paper proposes a framework for simulating daily human activity patterns in home environments at scale, supporting manual configurability of different personas or activity patterns, variation of activity timings, and testing on multiple home layouts. We introduce a method for specifying day-to-day variation in schedules and present a bidirectional constraint propagation algorithm for generating schedules from templates. We validate the expressive power of our framework through a use case scenario analysis and demonstrate that our method can be used to generate data closely resembling human behavior from three public datasets and a self-collected dataset. Our contribution supports systematic testing of social robot behaviors at scale, enables procedural generation of synthetic datasets of human movement in different households, and can help minimize bias in training data, leading to more robust and effective robots for home environments.Here's the text with some additional information about the translation:I used the Google Translate API to translate the text into Simplified Chinese. The translation is in the traditional Chinese format, which is commonly used in mainland China and Taiwan.Note that the translation may not be perfect, and some nuances or idioms in the original text may not be fully conveyed in the translation. Additionally, the translation may not be suitable for all contexts or audiences, and it's important to consider the cultural and linguistic appropriateness of the translation when using it in different contexts.

Optimally Teaching a Linear Behavior Cloning Agent

  • paper_url: http://arxiv.org/abs/2311.15399
  • repo_url: None
  • paper_authors: Shubham Kumar Bharti, Stephen Wright, Adish Singla, Xiaojin Zhu
  • for: 这个论文研究了线性行为克隆学习者的最优教学方法。
  • methods: 教师可以选择哪些状态给学习者示范,学习者保持一个无穷Linear гипоthesis空间,实现目标政策。
  • results: 我们提出了一种名为“iterative elimination”的教学算法,可以实现最小化示范数量,并且提供了一种approximation算法,可以保证 approximates the teaching dimension。我们还提供了实验结果,证明我们的算法的有效性和可行性。
    Abstract We study optimal teaching of Linear Behavior Cloning (LBC) learners. In this setup, the teacher can select which states to demonstrate to an LBC learner. The learner maintains a version space of infinite linear hypotheses consistent with the demonstration. The goal of the teacher is to teach a realizable target policy to the learner using minimum number of state demonstrations. This number is known as the Teaching Dimension(TD). We present a teaching algorithm called ``Teach using Iterative Elimination(TIE)" that achieves instance optimal TD. However, we also show that finding optimal teaching set computationally is NP-hard. We further provide an approximation algorithm that guarantees an approximation ratio of $\log(|A|-1)$ on the teaching dimension. Finally, we provide experimental results to validate the efficiency and effectiveness of our algorithm.
    摘要 我们研究线性行为传递(LBC)学生的最佳教学方法。在这个设置中,老师可以选择给LBC学生显示哪些状态。学生保持了一个无限 Linear Hypotheses 的版本空间,与示例相容。老师的目标是使用最少的状态示例教育 LBC 学生一个可行的目标政策,这个数字称为教学维度(TD)。我们提出了一个名为“Iterative Elimination 教学法”(TIE),它可以实现实例最佳的 TD。然而,我们还证明了找到最佳教学集的计算是NP困难的。我们还提供了一个近似算法,它保证了 $\log(|A|-1)$ 的近似比率在教学维度上。最后,我们提供了实验结果,以验证我们的算法的有效性和可行性。

Confidence Is All You Need for MI Attacks

  • paper_url: http://arxiv.org/abs/2311.15373
  • repo_url: None
  • paper_authors: Abhishek Sinha, Himanshi Tibrewal, Mansi Gupta, Nikhar Waghela, Shivank Garg
  • for: 本研究旨在提出一种新的会员推理攻击方法,用于攻击机器学习模型的机密性。
  • methods: 本研究使用机器学习模型的信任值来衡量数据点的成员身份。而不是通过损失函数和会员关系的尝试,我们利用了模型在训练数据上的满意度更高,以及模型在训练数据上的特定模式和噪声。
  • results: 本研究提出了一种基于模型的信任值的会员推理攻击方法,无需知道数据点的真实类别。这种方法比传统的标签依赖的攻击方法更加有利。
    Abstract In this evolving era of machine learning security, membership inference attacks have emerged as a potent threat to the confidentiality of sensitive data. In this attack, adversaries aim to determine whether a particular point was used during the training of a target model. This paper proposes a new method to gauge a data point's membership in a model's training set. Instead of correlating loss with membership, as is traditionally done, we have leveraged the fact that training examples generally exhibit higher confidence values when classified into their actual class. During training, the model is essentially being 'fit' to the training data and might face particular difficulties in generalization to unseen data. This asymmetry leads to the model achieving higher confidence on the training data as it exploits the specific patterns and noise present in the training data. Our proposed approach leverages the confidence values generated by the machine learning model. These confidence values provide a probabilistic measure of the model's certainty in its predictions and can further be used to infer the membership of a given data point. Additionally, we also introduce another variant of our method that allows us to carry out this attack without knowing the ground truth(true class) of a given data point, thus offering an edge over existing label-dependent attack methods.
    摘要 在这个不断发展的机器学习安全领域,用户数据隐私性的威胁——会员推测攻击——已经成为一种强大的攻击方式。在这种攻击中,敌方希望确定特定点是否在目标模型的训练集中使用。这篇论文提出了一种新的方法来评估模型训练集中的数据点成员身份。而不是传统的基于损失函数的方法,我们利用了训练数据中实际类别的推断值更高的特性。在训练过程中,模型在训练数据上进行"适应",可能会面临未见数据的总体化问题。这种差异导致模型对训练数据的推断值更高,因为它利用了特定的模式和噪声在训练数据中。我们提出的方法利用了机器学习模型生成的信任值。这些信任值提供了机器学习模型对其预测结果的概率评估,并可以用来推断数据点的成员身份。此外,我们还介绍了另一种我们的方法,允许我们在不知道数据点真实类别(true class)的情况下进行这种攻击,从而对现有的标签依赖攻击方法产生优势。

TD-Net: A Tri-domain network for sparse-view CT reconstruction

  • paper_url: http://arxiv.org/abs/2311.15369
  • repo_url: None
  • paper_authors: Xinyuan Wang, Changqing Su, Bo Xiong
  • for: 减少X射线Radiation风险,提高CT图像质量
  • methods: 三域策略(TD-Net),结合频率监督模块(FSM)
  • results: 高质量CT图像重建,有效地平衡Radiation安全和图像准确性
    Abstract Sparse-view CT reconstruction, aimed at reducing X-ray radiation risks, frequently suffers from image quality degradation, manifested as noise and artifacts. Existing post-processing and dual-domain techniques, although effective in radiation reduction, often lead to over-smoothed results, compromising diagnostic clarity. Addressing this, we introduce TD-Net, a pioneering tri-domain approach that unifies sinogram, image, and frequency domain optimizations. By incorporating Frequency Supervision Module(FSM), TD-Net adeptly preserves intricate details, overcoming the prevalent over-smoothing issue. Extensive evaluations demonstrate TD-Net's superior performance in reconstructing high-quality CT images from sparse views, efficiently balancing radiation safety and image fidelity. The enhanced capabilities of TD-Net in varied noise scenarios highlight its potential as a breakthrough in medical imaging.
    摘要 《简化视图CT重建》,目的是降低X射线辐射风险,然而经常会导致图像质量下降,manifest为噪声和artefacts。现有的后处理和双Domain技术,虽然可以减少辐射,但经常导致结果过滤,增加诊断的模糊性。为了解决这个问题,我们介绍TD-Net,一种创新的三Domain方法,它将sinogram、图像和频率Domain优化融合在一起。通过integrating Frequency Supervision Module(FSM),TD-Net能够精细地保留细节,超越常见的过滤问题。广泛的评估表明TD-Net在不同的噪声场景下能够高效地重建高质量CT图像,fficiently balance辐射安全和图像准确性。TD-Net的提高能力在多种噪声场景下表明它的潜在breakthrough在医学影像领域。

Having Second Thoughts? Let’s hear it

  • paper_url: http://arxiv.org/abs/2311.15356
  • repo_url: https://github.com/rprokap/pset-9
  • paper_authors: Jung H. Lee, Sujith Vijayan
  • for: 该研究旨在提高深度学习模型的可靠性和安全性。
  • methods: 该研究提出了一种新的证明过程,模拟选择性注意力,以提高深度学习模型的准确率和鲁棒性。
  • results: 实验结果表明,该新的证明过程可以提高深度学习模型的准确率,并帮助建立安全措施,以避免模型受到人工和自然 adversarial example 的攻击。
    Abstract Deep learning models loosely mimic bottom-up signal pathways from low-order sensory areas to high-order cognitive areas. After training, DL models can outperform humans on some domain-specific tasks, but their decision-making process has been known to be easily disrupted. Since the human brain consists of multiple functional areas highly connected to one another and relies on intricate interplays between bottom-up and top-down (from high-order to low-order areas) processing, we hypothesize that incorporating top-down signal processing may make DL models more robust. To address this hypothesis, we propose a certification process mimicking selective attention and test if it could make DL models more robust. Our empirical evaluations suggest that this newly proposed certification can improve DL models' accuracy and help us build safety measures to alleviate their vulnerabilities with both artificial and natural adversarial examples.
    摘要

Token Recycling for Efficient Sequential Inference with Vision Transformers

  • paper_url: http://arxiv.org/abs/2311.15335
  • repo_url: None
  • paper_authors: Jan Olszewski, Dawid Rymarczyk, Piotr Wójcik, Mateusz Pach, Bartosz Zieliński
  • for: 这篇论文是为了解决对 incomplete inputs 的处理问题,因为 ViT 不需要假设 missing values 的值。
  • methods: 这篇论文提出了 TOken REcycling (TORE) 修改,它可以与任何架构组合使用,以提高 ViT 的sequential inference 效率。 TORE 将 ViT 分为两部分:iterator 和 aggregator。 iterator 处理每个 sequential information,并将其分为中途的 tokens,这些 tokens 会被缓存。 aggregator 则处理中途 tokens,以取得预测结果。
  • results: 这篇论文的 TORE 修改可以大幅提高 ViT 的 sequential inference 效率,并且可以与任何架构组合使用。 此外,论文还提出了一个辅助的训练策略,可以对 sequential decision-making 问题进行有效的解决,并且可以保持 state-of-the-art 的准确性。
    Abstract Vision Transformers (ViTs) overpass Convolutional Neural Networks in processing incomplete inputs because they do not require the imputation of missing values. Therefore, ViTs are well suited for sequential decision-making, e.g. in the Active Visual Exploration problem. However, they are computationally inefficient because they perform a full forward pass each time a piece of new sequential information arrives. To reduce this computational inefficiency, we introduce the TOken REcycling (TORE) modification for the ViT inference, which can be used with any architecture. TORE divides ViT into two parts, iterator and aggregator. An iterator processes sequential information separately into midway tokens, which are cached. The aggregator processes midway tokens jointly to obtain the prediction. This way, we can reuse the results of computations made by iterator. Except for efficient sequential inference, we propose a complementary training policy, which significantly reduces the computational burden associated with sequential decision-making while achieving state-of-the-art accuracy.
    摘要 vision transformers (ViTs) 超过 convolutional neural networks (CNNs) 在处理不完整输入的能力方面,因为它们不需要填充缺失的值。因此,ViTs 适用于顺序决策问题,如活动视觉探索问题。然而,它们在处理新的Sequential信息时需要进行全面的前进通过,这会导致计算不fficient。为了解决这个计算不fficient的问题,我们介绍了 TOken REcycling (TORE) 修改,可以与任何架构一起使用。TORE 将 ViT 分成两部分:iterator 和 aggregator。iterator 处理 sequential 信息,将其分割成中途 токен,并将其缓存。aggregator 处理中途 токен,并使用它们来获取预测结果。这样,我们可以重用 iterator 进行计算的结果。除了高效的顺序推理,我们还提出了一种补充的训练策略,可以减少顺序决策所需的计算负担,同时保持state-of-the-art的准确性。

ASI: Accuracy-Stability Index for Evaluating Deep Learning Models

  • paper_url: http://arxiv.org/abs/2311.15332
  • repo_url: None
  • paper_authors: Wei Dai, Daniel Berleant
  • for: 本研究旨在提供一种新的评估深度学习模型的量化指标,以准确评估模型的准确率和稳定性。
  • methods: 本研究提出了一种名为准确稳定指数(ASI)的量化指标,该指标结合了准确率和稳定性来评估深度学习模型。
  • results: 实验结果表明,ASI可以准确地评估深度学习模型的准确率和稳定性,并且可以visual化模型的性能。
    Abstract In the context of deep learning research, where model introductions continually occur, the need for effective and efficient evaluation remains paramount. Existing methods often emphasize accuracy metrics, overlooking stability. To address this, the paper introduces the Accuracy-Stability Index (ASI), a quantitative measure incorporating both accuracy and stability for assessing deep learning models. Experimental results demonstrate the application of ASI, and a 3D surface model is presented for visualizing ASI, mean accuracy, and coefficient of variation. This paper addresses the important issue of quantitative benchmarking metrics for deep learning models, providing a new approach for accurately evaluating accuracy and stability of deep learning models. The paper concludes with discussions on potential weaknesses and outlines future research directions.
    摘要 在深度学习研究中,模型引入不断,评估效果的需求仍然急需。现有方法frequently emphasize精度指标,忽视稳定性。为解决这一问题,本文提出了精度稳定指数(ASI),一种具有精度和稳定性的评估深度学习模型的量化指标。实验结果表明ASI的应用,并提供了一种3D表面模型,用于可见性评估ASI、平均精度和变化系数。本文解决了深度学习模型的量化评估指标问题,提供了一种新的精度和稳定性评估方法。文章结尾提出了可能的弱点和未来研究方向。

Lightweight Face Recognition: An Improved MobileFaceNet Model

  • paper_url: http://arxiv.org/abs/2311.15326
  • repo_url: None
  • paper_authors: Ahmad Hassanpour, Yasamin Kowsari
  • for: 这个论文探讨了轻量级面Recognition(FR)模型,特别是MobileFaceNet和其修改后的MMobileFaceNet。由于设备限制的计算资源,需要开发具有减少内存占用和计算需求的FR模型,同时保持准确性。
  • methods: 我们使用了不同的数据集、模型架构和优化算法来影响FR模型的性能。我们在EFaR-2023比赛中参与了,并在不同的测试数据上显示出了出色的性能,特别是在限定参数数量的类别中。我们使用了Webface42M数据集的子集和锐度感知优化(SAM)算法,以实现在多种测试数据上提高准确率。
  • results: 我们的方法可以创建不仅计算效率高,还可以在多种不同情况下保持高准确率的FR模型。在不同的测试数据上,我们的模型都达到了优秀的性能。
    Abstract This paper presents an extensive exploration and comparative analysis of lightweight face recognition (FR) models, specifically focusing on MobileFaceNet and its modified variant, MMobileFaceNet. The need for efficient FR models on devices with limited computational resources has led to the development of models with reduced memory footprints and computational demands without sacrificing accuracy. Our research delves into the impact of dataset selection, model architecture, and optimization algorithms on the performance of FR models. We highlight our participation in the EFaR-2023 competition, where our models showcased exceptional performance, particularly in categories restricted by the number of parameters. By employing a subset of the Webface42M dataset and integrating sharpness-aware minimization (SAM) optimization, we achieved significant improvements in accuracy across various benchmarks, including those that test for cross-pose, cross-age, and cross-ethnicity performance. The results underscore the efficacy of our approach in crafting models that are not only computationally efficient but also maintain high accuracy in diverse conditions.
    摘要 We participated in the EFaR-2023 competition and our models demonstrated exceptional performance, particularly in categories with restricted parameters. By using a subset of the Webface42M dataset and integrating sharpness-aware minimization (SAM) optimization, we achieved significant improvements in accuracy across various benchmarks, including those that test for cross-pose, cross-age, and cross-ethnicity performance. The results demonstrate the effectiveness of our approach in developing models that are not only computationally efficient but also maintain high accuracy in diverse conditions.

A Foundational Framework and Methodology for Personalized Early and Timely Diagnosis

  • paper_url: http://arxiv.org/abs/2311.16195
  • repo_url: None
  • paper_authors: Tim Schubert, Richard W Peck, Alexander Gimson, Camelia Davtyan, Mihaela van der Schaar
  • for: 提高患者生活质量和医疗系统效益,早期诊断可能导致更好的治疗选择、长期存活和生活质量,以及减少总成本。
  • methods: 使用决策理论方法描述诊断过程,并结合机器学习和统计方法来估计个性化诊断路径的优化。
  • results: 提出了首个基础框架,可以系统地识别和估计个体患者诊断过程中的时间依赖性和价值。这种框架可以帮助开发个性化的决策支持工具,并且可以用来评估技术对个体早期诊断的影响。
    Abstract Early diagnosis of diseases holds the potential for deep transformation in healthcare by enabling better treatment options, improving long-term survival and quality of life, and reducing overall cost. With the advent of medical big data, advances in diagnostic tests as well as in machine learning and statistics, early or timely diagnosis seems within reach. Early diagnosis research often neglects the potential for optimizing individual diagnostic paths. To enable personalized early diagnosis, a foundational framework is needed that delineates the diagnosis process and systematically identifies the time-dependent value of various diagnostic tests for an individual patient given their unique characteristics. Here, we propose the first foundational framework for early and timely diagnosis. It builds on decision-theoretic approaches to outline the diagnosis process and integrates machine learning and statistical methodology for estimating the optimal personalized diagnostic path. To describe the proposed framework as well as possibly other frameworks, we provide essential definitions. The development of a foundational framework is necessary for several reasons: 1) formalism provides clarity for the development of decision support tools; 2) observed information can be complemented with estimates of the future patient trajectory; 3) the net benefit of counterfactual diagnostic paths and associated uncertainties can be modeled for individuals 4) 'early' and 'timely' diagnosis can be clearly defined; 5) a mechanism emerges for assessing the value of technologies in terms of their impact on personalized early diagnosis, resulting health outcomes and incurred costs. Finally, we hope that this foundational framework will unlock the long-awaited potential of timely diagnosis and intervention, leading to improved outcomes for patients and higher cost-effectiveness for healthcare systems.
    摘要 早期疾病诊断具有深刻的转变潜力,可以提供更好的治疗选择、改善长期存活和生活质量,以及减少总成本。随着医疗大数据的出现,早期诊断的进步在望。然而,早期诊断研究通常忽略个人化诊断路径的优化潜力。为实现个人化早期诊断,我们提出了首个基础框架。这个框架基于决策理论方法,描述诊断过程,并将机器学习和统计方法用于估计个人化诊断路径的优化。为描述我们的框架以及可能的其他框架,我们提供必要的定义。我们认为,开发基础框架是必要的多种原因:1)形式提供诊断工具的开发方法的清晰性; 2)可以补充当前患者信息,预测未来患者轨迹; 3)可以模型个人化诊断路径的未来患者 trajectory 和相关不确定性; 4)“早期”和“时间有序”的诊断可以得到明确定义; 5)可以制定评估技术对个人化早期诊断的影响,结果健康结果和支付成本。最后,我们希望该基础框架能够解锁医疗系统中早期诊断和 intervención 的潜在力量,导致患者的改善结果和成本效益的提高。

Perspective in Opinion Dynamics on Complex Convex Domains of Time Networks for Addiction, Forgetting

  • paper_url: http://arxiv.org/abs/2311.15318
  • repo_url: None
  • paper_authors: Yasuko Kawahata
  • for: 本文修订了之前的研究,并引入不同的时空尺度。文章提出了一个包含层A和B的模型,它们在时间上有不同的忘记和依赖程度。此外,文章还模型了层A、A’、B和B’中忘记和依赖的变化在某些条件下。
  • methods: 本文使用了新的层C和D来描述各自的忘记和依赖行为,以及这些行为在时间和空间上的发展。此外,文章还使用了网络分析和笛卡尔矩阵来更深入地理解网络结构和动态。
  • results: 本文的结果表明,在不同的时空尺度下,各自的忘记和依赖行为会产生不同的影响。文章还发现了一些潜在的问题,如观点的扩展、媒体的影响和信任问题,这些问题需要在协调建设中考虑。
    Abstract This paper revises previous work and introduces changes in spatio-temporal scales. The paper presents a model that includes layers A and B with varying degrees of forgetting and dependence over time. We also model changes in dependence and forgetting in layers A, A', B, and B' under certain conditions. In addition, to discuss the formation of opinion clusters that have reinforcing or obstructive behaviors of forgetting and dependence and are conservative or brainwashing or detoxifying and less prone to filter bubbling, new clusters C and D that recommend, obstruct, block, or incite forgetting and dependence over time are Introduction. This introduction allows us to test hypotheses regarding the expansion of opinions in two dimensions over time and space, the state of development of opinion space, and the expansion of public opinion. Challenges in consensus building will be highlighted, emphasizing the dynamic nature of opinions and the need to consider factors such as dissent, distrust, and media influence. The paper proposes an extended framework that incorporates trust, distrust, and media influence into the consensus building model. We introduce network analysis using dimerizing as a method to gain deeper insights. In this context, we discuss network clustering, media influence, and consensus building. The location and distribution of dimers will be analyzed to gain insight into the structure and dynamics of the network. Dimertiling has been applied in various fields other than network analysis, such as physics and sociology. The paper concludes by emphasizing the importance of diverse perspectives, network analysis, and influential entities in consensus building. It also introduces torus-based visualizations that aid in understanding complex network structures.
    摘要

Students’ interest in knowledge acquisition in Artificial Intelligence

  • paper_url: http://arxiv.org/abs/2311.16193
  • repo_url: None
  • paper_authors: Manuela-Andreea Petrescu, Emilia-Loredana Pop, Tudor-Dan Mihoc
  • for: 本研究探讨了计算机科学专业学生对人工智能课程的期望和看法。
  • methods: 我们匿名收集了200名学生中的58名学生的答案,并使用主题分析方法进行分析和解读。
  • results: 结果显示学生对人工智能的兴趣主要来自于其流行性、应用性、爱好和兴趣、未来发展潜力和高薪资。然而,男生更关心获得高级技能,而女生更关心获得中级知识。学生最不喜欢的部分是人工智能中的数学方面。一些学生还意识到人工智能的潜在可能性,并用于不良目的。与数据库课程相比,学生对人工智能课程的兴趣更高,但对数据库课程的兴趣相对较低。
    Abstract Some students' expectations and points of view related to the Artificial Intelligence course are explored and analyzed in this study. We anonymous collected answers from 58 undergraduate students out of 200 enrolled in the Computer Science specialization. The answers were analysed and interpreted using thematic analysis to find out their interests and attractive and unattractive aspects related to the Artificial Intelligence study topic. We concluded that students are interested in Artificial Intelligence due to its trendiness, applicability, their passion and interest in the subject, the potential for future growth, and high salaries. However, the students' expectations were mainly related to achieving medium knowledge in the Artificial Intelligence field, and men seem to be more interested in acquiring high-level skills than women. The most common part that wasn't enjoyed by the students was the mathematical aspect used in Artificial Intelligence. Some of them (a small group) were also aware of the Artificial Intelligence potential which could be used in an unethical manner for negative purposes. Our study also provides a short comparison to the Databases course, in which students were not that passionate or interested in achieving medium knowledge, their interest was related to DB usage and basic information.
    摘要 这个研究探讨了一些计算机专业学生对人工智能课程的期望和看法。我们匿名收集了200名计算机专业学生中的58名学生的答案,并使用主题分析方法进行分析和解读。我们发现,学生对人工智能的兴趣主要归结于其流行度、应用性、自己的热爱和兴趣、未来发展的潜力以及高薪金。然而,学生们对人工智能的期望主要是要达到中等水平的知识,而男生更想要具备高级技能 than women。学生最不喜欢的部分是人工智能中的数学方面。一些学生(一小组)也意识到人工智能的潜在可能性,并用于不道德的目的。我们的研究还提供了对数据库课程的简短比较,学生对数据库课程不那么热情和感兴趣,只是关注数据库的使用和基本信息。

Concept Distillation: Leveraging Human-Centered Explanations for Model Improvement

  • paper_url: http://arxiv.org/abs/2311.15303
  • repo_url: https://github.com/avani17101/CD
  • paper_authors: Avani Gupta, Saurabh Saini, P J Narayanan
  • for: 这paper focuses on improving the interpretability of neural networks by using human-centered concept explanations to reduce model bias and improve understanding.
  • methods: The paper introduces Concept Activation Vectors (CAVs) for ante-hoc training and Concept Distillation to create richer concepts using a pre-trained knowledgeable model as the teacher. The method can sensitize or desensitize a model towards concepts.
  • results: The paper shows applications of concept-sensitive training to debias several classification problems and induce prior knowledge into IID, a reconstruction problem. Concept-sensitive training can improve model interpretability, reduce biases, and induce prior knowledge.Here’s the Chinese translation of the three points:
  • for: 这paper 专注于使用人类中心的概念解释来改善神经网络的可解释性,以降低模型偏见和提高理解。
  • methods: paper 引入 Concept Activation Vectors (CAVs) для ante-hoc 训练和 Concept Distillation 来创造更加丰富的概念。 teacher 模型可以敏感或抑制模型对概念的响应。
  • results: paper 示例了通过概念敏感训练来减少多个分类问题的偏见和将先验知识引入 IID 重建问题中。 Concept-sensitive training 可以提高模型的可解释性,降低偏见,并引入先验知识。
    Abstract Humans use abstract concepts for understanding instead of hard features. Recent interpretability research has focused on human-centered concept explanations of neural networks. Concept Activation Vectors (CAVs) estimate a model's sensitivity and possible biases to a given concept. In this paper, we extend CAVs from post-hoc analysis to ante-hoc training in order to reduce model bias through fine-tuning using an additional Concept Loss. Concepts were defined on the final layer of the network in the past. We generalize it to intermediate layers using class prototypes. This facilitates class learning in the last convolution layer, which is known to be most informative. We also introduce Concept Distillation to create richer concepts using a pre-trained knowledgeable model as the teacher. Our method can sensitize or desensitize a model towards concepts. We show applications of concept-sensitive training to debias several classification problems. We also use concepts to induce prior knowledge into IID, a reconstruction problem. Concept-sensitive training can improve model interpretability, reduce biases, and induce prior knowledge. Please visit https://avani17101.github.io/Concept-Distilllation/ for code and more details.
    摘要 人类使用抽象概念来理解而不是具体特征。 current interpretability research 关注人类中心的概念解释 neural networks。 Concept Activation Vectors (CAVs) 可以 estimating a model's sensitivity and possible biases to a given concept. 在这篇论文中,我们将 CAVs 从后续分析扩展到 ante-hoc 训练,以降低模型偏见 durch fine-tuning 使用额外的 Concept Loss。 在过去, concepts 是在网络的最后一层定义的。 我们扩展了它到中间层使用类prototype,这使得最后一层的 Convolution 层能够更好地学习类。 我们还引入了 Concept Distillation,它可以使用一个已经训练过的、知识充足的模型作为教师,来创造更加丰富的概念。 我们的方法可以让模型对概念敏感或不敏感。 我们在几个分类问题中应用了概念敏感训练来减少偏见。 我们还使用概念来启发IID,一个重建问题。 概念敏感训练可以提高模型解释性,减少偏见,并启发先验知识。 请参考 https://avani17101.github.io/Concept-Distilllation/ 获取代码和更多细节。

A Data-driven and multi-agent decision support system for time slot management at container terminals: A case study for the Port of Rotterdam

  • paper_url: http://arxiv.org/abs/2311.15298
  • repo_url: None
  • paper_authors: Ali Nadi, Maaike Snelder, J. W. C. van Lint, Lóránt Tavasszy
  • for: 这篇论文旨在提供一种智能决策支持系统,以控制和管理 Container Hub 中卡车的到达时间。
  • methods: 该模型利用历史数据预测系统状态,并根据控制策略来满足多个潜在利益相关方的需求。
  • results: 模拟结果表明,通过应用该方法可以在系统中获得显著的改善。
    Abstract Controlling the departure time of the trucks from a container hub is important to both the traffic and the logistics systems. This, however, requires an intelligent decision support system that can control and manage truck arrival times at terminal gates. This paper introduces an integrated model that can be used to understand, predict, and control logistics and traffic interactions in the port-hinterland ecosystem. This approach is context-aware and makes use of big historical data to predict system states and apply control policies accordingly, on truck inflow and outflow. The control policies ensure multiple stakeholders satisfaction including those of trucking companies, terminal operators, and road traffic agencies. The proposed method consists of five integrated modules orchestrated to systematically steer truckers toward choosing those time slots that are expected to result in lower gate waiting times and more cost-effective schedules. The simulation is supported by real-world data and shows that significant gains can be obtained in the system.
    摘要 控制Container hub出发时间对交通和物流系统都很重要。然而,这需要一个智能决策支持系统来控制和管理货架到 термина尔门的车辆到达时间。这篇文章介绍了一个整合的模型,可以用来理解、预测和控制港口-后勤环境中的物流和交通互动。这种方法是Context-aware,利用历史大数据预测系统状态并应用控制策略,以确保多个利益相关方的满意度,包括货运公司、码头运营商和公路交通管理机构。提出的方法由五个整合模块组成,用于系统化地引导车辆选择预计会导致更低的门后等待时间和更经济的计划。 simulations supported by real-world data show that significant gains can be obtained in the system.

Spatial and Temporal Characteristics of Freight Tours: A Data-Driven Exploratory Analysis

  • paper_url: http://arxiv.org/abs/2311.15287
  • repo_url: None
  • paper_authors: Ali Nadi, Lóránt Tavasszy, J. W. C. van Lint, Maaike Snelder
  • for: 本研究目的是对于不同的货物市场中的数字货物运输活动数据进行模型化分析,推导出运输和路径模式。
  • methods: 本研究使用了一个新的离散-连续决策树方法来从货物运输数据中提取规则。
  • results: 研究发现,空间和时间特征是重要的,可以捕捉不同的巡航和时间径patterns。许多交通市场中的交通公司对于填充情况的反应也有所不同。
    Abstract This paper presents a modeling approach to infer scheduling and routing patterns from digital freight transport activity data for different freight markets. We provide a complete modeling framework including a new discrete-continuous decision tree approach for extracting rules from the freight transport data. We apply these models to collected tour data for the Netherlands to understand departure time patterns and tour strategies, also allowing us to evaluate the effectiveness of the proposed algorithm. We find that spatial and temporal characteristics are important to capture the types of tours and time-of-day patterns of freight activities. Also, the empirical evidence indicates that carriers in most of the transport markets are sensitive to the level of congestion. Many of them adjust the type of tour, departure time, and the number of stops per tour when facing a congested zone. The results can be used by practitioners to get more grip on transport markets and develop freight and traffic management measures.
    摘要 Here is the text in Simplified Chinese:这篇论文提出了一种基于数字货物运输活动数据的路线和计划模型化方法,用于不同的货物市场。该方法包括一种新的抽象连续决策树方法,用于从数据中提取规则。作者使用这种方法对荷兰实际旅游数据进行了应用,以了解出发时间模式和旅游策略,并评估了提议的效果。研究发现,空间和时间特征是重要的,以捕捉不同的旅游和时间征式。此外,许多运输商在面临拥堵情况时,会调整巡游类型、出发时间和停留数量。结果可以用于实践者更好地了解运输市场,并开发有效的货物和交通管理措施。

Bias-Variance Trade-off in Physics-Informed Neural Networks with Randomized Smoothing for High-Dimensional PDEs

  • paper_url: http://arxiv.org/abs/2311.15283
  • repo_url: None
  • paper_authors: Zheyuan Hu, Zhouhao Yang, Yezhen Wang, George Em Karniadakis, Kenji Kawaguchi
  • for: 本研究旨在提高物理学 Informed Neural Networks (PINNs) 在高维度情况下的计算效率,并且解决高维度Derivative computation中的计算成本问题。
  • methods: 本研究使用 Randomized Smoothing PINN (RS-PINN) 方法,通过加入 Gaussian noise来实现随机缓和,从而使得 Monte Carlo 方法可以用于 Derivative aproximation,从而消除了 costly auto-differentiation。
  • results: 研究发现,RS-PINN 在高维度情况下存在偏差,这些偏差来自 PDE 非线性和 MSE 损失函数的非线性。研究还提出了一些偏差修正技术,包括基于 PDE 非线性的修正方法。这些修正方法使得 RS-PINN 可以更好地捕捉高维度 PDE 的解。
    Abstract While physics-informed neural networks (PINNs) have been proven effective for low-dimensional partial differential equations (PDEs), the computational cost remains a hurdle in high-dimensional scenarios. This is particularly pronounced when computing high-order and high-dimensional derivatives in the physics-informed loss. Randomized Smoothing PINN (RS-PINN) introduces Gaussian noise for stochastic smoothing of the original neural net model, enabling Monte Carlo methods for derivative approximation, eliminating the need for costly auto-differentiation. Despite its computational efficiency in high dimensions, RS-PINN introduces biases in both loss and gradients, negatively impacting convergence, especially when coupled with stochastic gradient descent (SGD). We present a comprehensive analysis of biases in RS-PINN, attributing them to the nonlinearity of the Mean Squared Error (MSE) loss and the PDE nonlinearity. We propose tailored bias correction techniques based on the order of PDE nonlinearity. The unbiased RS-PINN allows for a detailed examination of its pros and cons compared to the biased version. Specifically, the biased version has a lower variance and runs faster than the unbiased version, but it is less accurate due to the bias. To optimize the bias-variance trade-off, we combine the two approaches in a hybrid method that balances the rapid convergence of the biased version with the high accuracy of the unbiased version. In addition, we present an enhanced implementation of RS-PINN. Extensive experiments on diverse high-dimensional PDEs, including Fokker-Planck, HJB, viscous Burgers', Allen-Cahn, and Sine-Gordon equations, illustrate the bias-variance trade-off and highlight the effectiveness of the hybrid RS-PINN. Empirical guidelines are provided for selecting biased, unbiased, or hybrid versions, depending on the dimensionality and nonlinearity of the specific PDE problem.
    摘要 physics-informed neural networks (PINNs) 已经在低维度partial differential equations (PDEs) 中证明了其效iveness,但计算成本在高维度场景中仍然是一个障碍。特别是当计算高阶和高维度导数时,物理学习损失中的计算成本变得非常高。Randomized Smoothing PINN (RS-PINN) 引入 Gaussian 噪声,用于随机简化原始神经网络模型,使得 Monte Carlo 方法可以用于导数 aproximation,从而消除了自动梯度分布的成本。然而,RS-PINN 在高维度场景中引入了 bias,影响了整体的 converges,尤其是在与随机梯度下降 (SGD) 结合使用时。我们进行了 PINN 中 bias 的全面分析,并归因于 Mean Squared Error (MSE) 损失函数的非线性和 PDE 非线性。我们提出了针对不同顺序的 PDE 非线性的修正技术。不偏 bias 版本的 RS-PINN 具有较低的偏差和较快的运行速度,但它的精度较差。为了优化偏差-偏差质量的负面选择,我们将偏差版本和不偏版本组合在一起,以充分利用它们的优点。此外,我们还提出了 RS-PINN 的改进实现方式。我们在多种高维度 PDE 中进行了广泛的实验,包括 Fokker-Planck、HJB、viscous Burgers、Allen-Cahn 和 Sine-Gordon 等方程,并证明了 bias-variance 负面的选择。我们还提供了实际指南,以帮助选择适合特定 PDE 问题的偏差、不偏或混合版本。

  • paper_url: http://arxiv.org/abs/2311.15269
  • repo_url: None
  • paper_authors: Zhiqi Lin, Youshan Miao, Guanbin Xu, Cheng Li, Olli Saarikivi, Saeed Maleki, Fan Yang
  • for: 提高深度神经网络(DNN)训练和推理 task 的性能,采用多个设备分布执行,并且需要谨慎地制定执行计划。
  • methods: 使用自动化搜索系统 Tessel,实现了多种操作放置策略下的高效执行计划搜索。
  • results: Tessel 可以在训练和推理任务中提高性能,实验结果显示,与代表性 DNN 模型相比,Tessel 可以实现 Training 性能提高5.5倍,并且在推理任务中减少了38%的延迟时间。
    Abstract Increasingly complex and diverse deep neural network (DNN) models necessitate distributing the execution across multiple devices for training and inference tasks, and also require carefully planned schedules for performance. However, existing practices often rely on predefined schedules that may not fully exploit the benefits of emerging diverse model-aware operator placement strategies. Handcrafting high-efficiency schedules can be challenging due to the large and varying schedule space. This paper presents Tessel, an automated system that searches for efficient schedules for distributed DNN training and inference for diverse operator placement strategies. To reduce search costs, Tessel leverages the insight that the most efficient schedules often exhibit repetitive pattern (repetend) across different data inputs. This leads to a two-phase approach: repetend construction and schedule completion. By exploring schedules for various operator placement strategies, Tessel significantly improves both training and inference performance. Experiments with representative DNN models demonstrate that Tessel achieves up to 5.5x training performance speedup and up to 38% inference latency reduction.
    摘要 随着深度神经网络(DNN)模型的复杂度和多样性的增加,训练和推理任务需要分布在多个设备上进行执行,并且需要谨慎地规划表现。然而,现有的做法 oftentimes 采用固定的计划,可能无法完全利用emerging的多样模型-aware操作放置策略的优势。手工制定高效计划可以是困难的,因为计划空间很大且变化很大。这篇论文介绍了Tessel,一个自动化的系统,用于搜索分布式DNN训练和推理中的高效计划,包括多种操作放置策略。为了降低搜索成本,Tessel利用了对着重复性(repetend)的观察,即不同数据输入下的高效计划往往具有相同的模式。这导致了一个两相阶段的方法:repetend construction和schedule completion。通过对不同操作放置策略的计划进行探索,Tessel可以显著提高训练和推理性能。实验表明,使用代表性DNN模型,Tessel可以实现训练性能速度提高达5.5倍,并且在推理中降低了38%的延迟。

Unlearning via Sparse Representations

  • paper_url: http://arxiv.org/abs/2311.15268
  • repo_url: None
  • paper_authors: Vedant Shah, Frederik Träuble, Ashish Malik, Hugo Larochelle, Michael Mozer, Sanjeev Arora, Yoshua Bengio, Anirudh Goyal
  • for: 删除模型中的知识(class unlearning)
  • methods: 基于抽象瓶颈的几何表示法,减少计算成本
  • results: 在CIFAR-10、CIFAR-100和LACUNA-100三个数据集上,与现有的SCRUB方法相当或更好的性能,计算成本几乎为零
    Abstract Machine \emph{unlearning}, which involves erasing knowledge about a \emph{forget set} from a trained model, can prove to be costly and infeasible by existing techniques. We propose a nearly compute-free zero-shot unlearning technique based on a discrete representational bottleneck. We show that the proposed technique efficiently unlearns the forget set and incurs negligible damage to the model's performance on the rest of the data set. We evaluate the proposed technique on the problem of \textit{class unlearning} using three datasets: CIFAR-10, CIFAR-100, and LACUNA-100. We compare the proposed technique to SCRUB, a state-of-the-art approach which uses knowledge distillation for unlearning. Across all three datasets, the proposed technique performs as well as, if not better than SCRUB while incurring almost no computational cost.
    摘要

Utilizing Multiple Inputs Autoregressive Models for Bearing Remaining Useful Life Prediction

  • paper_url: http://arxiv.org/abs/2311.16192
  • repo_url: None
  • paper_authors: Junliang Wang, Qinghua Zhang, Guanhua Zhu, Guoxi Sun
  • For: The paper aims to improve the accuracy of remaining useful life (RUL) prediction for rolling bearings in industrial production.* Methods: The proposed method uses a multi-input autoregressive model that integrates vibration signals with previously predicted health indicator (HI) values, utilizing feature fusion and autoregressive iterations to improve generalization capabilities.* Results: The proposed method achieves significantly lower root mean square error (RMSE) and score compared to other backbone networks using similar autoregressive approaches, and outperforms traditional autoregressive models and non-autoregressive networks in terms of generalization ability.
    Abstract Accurate prediction of the Remaining Useful Life (RUL) of rolling bearings is crucial in industrial production, yet existing models often struggle with limited generalization capabilities due to their inability to fully process all vibration signal patterns. We introduce a novel multi-input autoregressive model to address this challenge in RUL prediction for bearings. Our approach uniquely integrates vibration signals with previously predicted Health Indicator (HI) values, employing feature fusion to output current window HI values. Through autoregressive iterations, the model attains a global receptive field, effectively overcoming the limitations in generalization. Furthermore, we innovatively incorporate a segmentation method and multiple training iterations to mitigate error accumulation in autoregressive models. Empirical evaluation on the PMH2012 dataset demonstrates that our model, compared to other backbone networks using similar autoregressive approaches, achieves significantly lower Root Mean Square Error (RMSE) and Score. Notably, it outperforms traditional autoregressive models that use label values as inputs and non-autoregressive networks, showing superior generalization abilities with a marked lead in RMSE and Score metrics.
    摘要 <>输入文本翻译成简化中文。<>工业生产中,准确预测滚珠机器的剩余有用生命(RUL)非常重要,但现有模型往往因不能完全处理振荡信号模式而受到限制。我们介绍了一种新的多输入自回归模型,用于解决滚珠机器RUL预测中的这种挑战。我们的方法独特地结合振荡信号和先前预测的健康指标(HI)值,使用特征融合输出当前窗口HI值。通过自回归迭代,模型实现了全球接收场,有效地超越了限制性。此外,我们创新地引入分割方法和多个训练迭代,以降低自回归模型中的错误积累。实验表明,我们的模型,相比其他基于同样的自回归方法的背bone网络,在PMH2012数据集上显著降低了Root Mean Square Error(RMSE)和Score指标。特别是,它超过了使用标签值作为输入的传统自回归模型和非自回归网络,显示出优于普通自回归模型和非自回归网络的普通化能力,并且在RMSE和Score指标中表现出了明显的优势。

Algorithm Evolution Using Large Language Model

  • paper_url: http://arxiv.org/abs/2311.15249
  • repo_url: https://github.com/Beta-Blaze/astral-LLM
  • paper_authors: Fei Liu, Xialiang Tong, Mingxuan Yuan, Qingfu Zhang
  • for: 本研究的目的是提出一种新的自动生成优化算法方法,帮助减少人类专家的努力和知识要求。
  • methods: 该方法利用大语言模型(LLM)来自动生成优化算法,而不需要训练模型。
  • results: 在寻找销售人员旅行问题的解决方案上,使用AEL生成的构造式算法比基于简单手动设计和LLM生成的启发式更高效。 AEL也比其他领域深度学习模型基于算法的方法更加可扩展性。
    Abstract Optimization can be found in many real-life applications. Designing an effective algorithm for a specific optimization problem typically requires a tedious amount of effort from human experts with domain knowledge and algorithm design skills. In this paper, we propose a novel approach called Algorithm Evolution using Large Language Model (AEL). It utilizes a large language model (LLM) to automatically generate optimization algorithms via an evolutionary framework. AEL does algorithm-level evolution without model training. Human effort and requirements for domain knowledge can be significantly reduced. We take constructive methods for the salesman traveling problem as a test example, we show that the constructive algorithm obtained by AEL outperforms simple hand-crafted and LLM-generated heuristics. Compared with other domain deep learning model-based algorithms, these methods exhibit excellent scalability across different problem sizes. AEL is also very different from previous attempts that utilize LLMs as search operators in algorithms.
    摘要 优化可以在各种实际应用中找到。设计一个有效的优化算法通常需要培训的人员具有域知识和算法设计技能的几个月的努力。在这篇论文中,我们提出了一种新的方法called Algorithm Evolution using Large Language Model(AEL)。它利用大语言模型(LLM)来自动生成优化算法via演化框架。AEL不需要训练模型。人工劳动和域知识的需求可以减少到最小。我们使用constructive方法为旅行商问题作为测试例子,我们显示了constructive算法获得的AEL比simple手动和LLM生成的启发性较高。与其他域深度学习模型基于算法相比,这些方法在不同的问题大小上 exhibit excellent scalability。AEL与前一些利用LLM作为搜索操作的方法不同。

ID-like Prompt Learning for Few-Shot Out-of-Distribution Detection

  • paper_url: http://arxiv.org/abs/2311.15243
  • repo_url: None
  • paper_authors: Yichen Bai, Zongbo Han, Changqing Zhang, Bing Cao, Xiaoheng Jiang, Qinghua Hu
  • for: 提高异常点检测方法的效果,尤其是对于最难的异常点样本(ID-like outliers)。
  • methods: 使用CLIP从 vicinity 空间中挑选ID样本的近似点,并利用这些近似点来进一步利用CLIP的能力进行异常点检测。
  • results: 在多个实际图像数据集上(如ImageNet-1k)实现了superior的几步学习性能,比如在4步OOD检测中,我们的方法相比state-of-the-art方法,平均降低FPR95的值 by 12.16%,并提高了平均AUROC的值 by 2.76%.
    Abstract Out-of-distribution (OOD) detection methods often exploit auxiliary outliers to train model identifying OOD samples, especially discovering challenging outliers from auxiliary outliers dataset to improve OOD detection. However, they may still face limitations in effectively distinguishing between the most challenging OOD samples that are much like in-distribution (ID) data, i.e., ID-like samples. To this end, we propose a novel OOD detection framework that discovers ID-like outliers using CLIP from the vicinity space of the ID samples, thus helping to identify these most challenging OOD samples. Then a prompt learning framework is proposed that utilizes the identified ID-like outliers to further leverage the capabilities of CLIP for OOD detection. Benefiting from the powerful CLIP, we only need a small number of ID samples to learn the prompts of the model without exposing other auxiliary outlier datasets. By focusing on the most challenging ID-like OOD samples and elegantly exploiting the capabilities of CLIP, our method achieves superior few-shot learning performance on various real-world image datasets (e.g., in 4-shot OOD detection on the ImageNet-1k dataset, our method reduces the average FPR95 by 12.16% and improves the average AUROC by 2.76%, compared to state-of-the-art methods).
    摘要 OUT-OF-DISTRIBUTION (OOD) 探测方法经常利用辅助异常数据来训练模型标识异常样本,特别是从辅助异常数据集中发现挑战性异常样本以提高 OOD 探测。但它们可能仍然面临有限制力在有效地分辨最挑战性的 OOD 样本,即 ID 样本类似的样本。为此,我们提议一种新的 OOD 探测框架,利用 CLIP 从 ID 样本附近的空间检测 ID 样本类似的异常样本,从而帮助标识这些最挑战性的 OOD 样本。然后,我们提议一种启发学习框架,利用已经标识的 ID 样本类似异常样本来进一步利用 CLIP 的能力进行 OOD 探测。由于我们只需要一小数量的 ID 样本来学习启发,因此我们不需要暴露其他辅助异常数据集。通过专注于最挑战性 ID 样本类似 OOD 样本并精巧利用 CLIP 的能力,我们的方法在多个实际图像 dataset 上实现了优秀的几极学习性能(例如,在 ImageNet-1k dataset 上的 4-shot OOD 探测中,我们的方法降低了平均 FPR95 值 by 12.16%,并提高了平均 AUROC 值 by 2.76%,相比于当前方法)。

See and Think: Embodied Agent in Virtual Environment

  • paper_url: http://arxiv.org/abs/2311.15209
  • repo_url: https://github.com/Aryia-Behroziuan/References
  • paper_authors: Zhonghan Zhao, Wenhao Chai, Xuan Wang, Li Boyi, Shengyu Hao, Shidong Cao, Tian Ye, Jenq-Neng Hwang, Gaoang Wang
  • for: This paper aims to propose a comprehensive and visionary embodied agent, STEVE, in the Minecraft virtual environment.
  • methods: STEVE consists of three key components: vision perception, language instruction, and code action. Vision perception involves interpreting visual information in the environment, while language instruction is responsible for iterative reasoning and decomposing complex tasks into manageable guidelines. Code action generates executable skill actions based on retrieval in a skill database.
  • results: The paper achieves at most 1.5 times faster unlocking key tech trees and 2.5 times quicker in block search tasks compared to previous state-of-the-art methods. Additionally, the paper collects a new dataset called STEVE-21K, which includes 600+ vision-environment pairs, 20K knowledge question-answering pairs, and 200+ skill-code pairs.
    Abstract Large language models (LLMs) have achieved impressive progress on several open-world tasks. Recently, using LLMs to build embodied agents has been a hotspot. In this paper, we propose STEVE, a comprehensive and visionary embodied agent in the Minecraft virtual environment. STEVE consists of three key components: vision perception, language instruction, and code action. Vision perception involves the interpretation of visual information in the environment, which is then integrated into the LLMs component with agent state and task instruction. Language instruction is responsible for iterative reasoning and decomposing complex tasks into manageable guidelines. Code action generates executable skill actions based on retrieval in skill database, enabling the agent to interact effectively within the Minecraft environment. We also collect STEVE-21K dataset, which includes 600$+$ vision-environment pairs, 20K knowledge question-answering pairs, and 200$+$ skill-code pairs. We conduct continuous block search, knowledge question and answering, and tech tree mastery to evaluate the performance. Extensive experiments show that STEVE achieves at most $1.5 \times$ faster unlocking key tech trees and $2.5 \times$ quicker in block search tasks compared to previous state-of-the-art methods.
    摘要 大型语言模型(LLM)已经取得了许多开放世界任务的卓越成果。现在使用LLM创建具有身体的代理人是一个热点。在这篇论文中,我们提出了STEVE,一个涵盖性和未来性的具有身体的代理人,在 Minecraft 虚拟环境中实现。STEVE 包括三个关键 ком成分:视觉识别、语言指令和代码行动。视觉识别涉及环境中的视觉信息的解释,并与代理人的状态和任务指令结合在LLMs 中。语言指令负责逐步逻辑和精确化复杂任务,并将其转换为可行的指令。代码行动从技能数据库中获取可行的技能动作,让代理人在 Minecraft 环境中实现有效的互动。我们还收集了 STEVE-21K 数据集,包括 600+ 视觉环境组合、20,000 知识问题答案组合和 200+ 技能代码组合。我们进行了无间断的对应搜寻、知识问题回答和技术树精通,以评估表现。实验结果显示,STEVE 在开关键技术树和对应搜寻任务中实现了最多 $1.5 \times$ 更快的解锁,并在对应搜寻任务中实现了最多 $2.5 \times$ 更快的速度。

LongStory: Coherent, Complete and Length Controlled Long story Generation

  • paper_url: http://arxiv.org/abs/2311.15208
  • repo_url: None
  • paper_authors: Kyeongman Park, Nakyeong Yang, Kyomin Jung
  • for: 这篇论文旨在提供一种可控长度、具有准确性和完整性的长篇故事生成模型。
  • methods: 这篇论文提出了两种新的方法:首先,使用长期和短期上下文权重调整器(CWC),以 acknowledge their distinct roles。其次,使用 DISCOURSE TOKENS 来表达长篇故事的结构位置(LSP)。
  • results: 在三个不同的数据集上进行训练,LongStory 模型比其他基elines,包括 Plotmachine,在 coherence、completeness、 relevance 和 repetitiveness 方面表现出色。此外,我们还进行了零shot测试,以评估模型对数据集之外的预测能力,并对我们的方法进行了验证。
    Abstract A human author can write any length of story without losing coherence. Also, they always bring the story to a proper ending, an ability that current language models lack. In this work, we present the LongStory for coherent, complete, and length-controlled long story generation. LongStory introduces two novel methodologies: (1) the long and short-term contexts weight calibrator (CWC) and (2) long story structural positions (LSP). The CWC adjusts weights for long-term context Memory and short-term context Cheating, acknowledging their distinct roles. The LSP employs discourse tokens to convey the structural positions of a long story. Trained on three datasets with varied average story lengths, LongStory outperforms other baselines, including the strong story generator Plotmachine, in coherence, completeness, relevance, and repetitiveness. We also perform zero-shot tests on each dataset to assess the model's ability to predict outcomes beyond its training data and validate our methodology by comparing its performance with variants of our model.
    摘要 人类作者可以写出任意长度的故事,而不失一致性。他们也总是能够为故事做出合适的结尾,这是当前语言模型缺乏的能力。在这项工作中,我们提出了长故事生成的新方法:长和短期上下文权重调整器(CWC)和长故事结构位置(LSP)。CWC调整了长期上下文记忆和短期上下文诈取的权重,认可他们的不同角色。LSP使用演讲符号传达长故事的结构位置。我们在三个不同平均故事长度的数据集上训练了LongStory,并与其他基elines进行比较,包括强大的故事生成器Plotmachine。LongStory在一致性、完整性、 relevance和重复性方面超过了基elines,并在零工作测试中也表现出色。我们还对每个数据集进行零工作测试,以评估模型在训练数据以外的预测能力,并对我们的方法进行验证。

ChatGPT and Beyond: The Generative AI Revolution in Education

  • paper_url: http://arxiv.org/abs/2311.15198
  • repo_url: None
  • paper_authors: Mohammad AL-Smadi
  • for: This paper aims to explore the potential applications and implications of generative AI models, particularly ChatGPT, in the educational landscape.
  • methods: The paper conducts a comprehensive and rigorous evaluation of recent academic literature on generative AI models in education, specifically targeting high-impact research from Scopus-indexed Q1 and Q2 journals published between November 2022 and July 2023.
  • results: The survey finds the potential benefits, challenges, and emerging trends in the integration of AI technologies into learning environments, and seeks to contribute to the understanding of the nexus between artificial intelligence and education, empowering educators, researchers, and policymakers to make informed decisions about the integration of AI technologies into learning environments.
    Abstract The wide adoption and usage of generative artificial intelligence (AI) models, particularly ChatGPT, has sparked a surge in research exploring their potential applications in the educational landscape. This survey examines academic literature published between November, 2022, and July, 2023, specifically targeting high-impact research from Scopus-indexed Q1 and Q2 journals. This survey delves into the practical applications and implications of generative AI models across a diverse range of educational contexts. Through a comprehensive and rigorous evaluation of recent academic literature, this survey seeks to illuminate the evolving role of generative AI models, particularly ChatGPT, in education. By shedding light on the potential benefits, challenges, and emerging trends in this dynamic field, the survey endeavors to contribute to the understanding of the nexus between artificial intelligence and education. The findings of this review will empower educators, researchers, and policymakers to make informed decisions about the integration of AI technologies into learning environments.
    摘要 “广泛的采用和使用生成人工智能(AI)模型,特别是ChatGPT,已经引发了教育领域的研究潮流。这份调查探讨了在2022年11月至2023年7月期间发表的学术期刊文章,特别是来自Scopus搜索引擎的Q1和Q2期刊。这份调查探讨了生成AI模型在多元化教育背景下的实际应用和影响。通过 sistematic 和严谨的学术文献评估,这份调查想要照明生成AI模型在教育领域中的演变角色,特别是ChatGPT。透过探讨这些技术的潜在优点、挑战和趋势,这份调查想要帮助教育工作者、研究者和政策制定者做出了解AI技术在学习环境中的应用。这份评估结果将帮助人们更好地理解人工智能和教育之间的联系。”

Neural Network Models of Becoming a Cardinal Principle Knower

  • paper_url: http://arxiv.org/abs/2311.15194
  • repo_url: None
  • paper_authors: Vima Gupta, Sashank Varma
  • for: 这个论文 investigate 儿童进入小学时,从记忆 count list 到理解 successor function 和 countably infinite 的过程。
  • methods: 这两个神经网络模型都学习 successor function,一个使用一个热点编码方法,另一个使用位值编码方法。
  • results: 研究发现,使用位值编码方法的模型会在十进制界限上显示预测的 representational similarity drop。 counting across a tens boundary 可以理解为二维空间中的向量运算,其中同一个十进制位的数字组织在线性分割的方式,而同一个一进制位的数字则 grouped together。 一个学习纲curriculum simulation 表明,在 развивающейся child 的数学环境中,小数的表示 continually sharpened even as larger numbers begin to be learned.
    Abstract As children enter elementary school, their understanding of the ordinal structure of numbers transitions from a memorized count list of the first 50-100 numbers to knowing the successor function and understanding the countably infinite. We investigate this developmental change in two neural network models that learn the successor function on the pairs (N, N+1) for N in (0, 98). The first uses a one-hot encoding of the input and output values and corresponds to children memorizing a count list, while the second model uses a place-value encoding and corresponds to children learning the language rules for naming numbers. The place-value model showed a predicted drop in representational similarity across tens boundaries. Counting across a tens boundary can be understood as a vector operation in 2D space, where the numbers with the same tens place are organized in a linearly separable manner, whereas those with the same ones place are grouped together. A curriculum learning simulation shows that, in the expanding numerical environment of the developing child, representations of smaller numbers continue to be sharpened even as larger numbers begin to be learned. These models set the stage for future work using recurrent architectures to move beyond learning the successor function to simulating the counting process more generally, and point towards a deeper understanding of what it means to understand the countably infinite.
    摘要 Note: The above text is translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. The translation is written in the formal style, which is appropriate for academic or professional writing.Here are some key differences between the original text and the translation:1. Word order: In Chinese, the word order is different from English. In the translation, the word order is changed to conform to the standard word order in Chinese. For example, "As children enter elementary school" becomes "进入小学时" (jìn rù xiǎo xué shí) in Chinese.2. Grammar: Chinese grammar is different from English, and the translation is written in the formal style, which is appropriate for academic or professional writing. For example, the phrase "their understanding of the ordinal structure of numbers" becomes "他们对数字序列结构的理解" (tāmen duì shū zhì xiàng jì yì) in Chinese.3. Vocabulary: Some vocabulary is changed to conform to the standard usage in Chinese. For example, "count list" becomes "数字列表" (shù zhì jiǎo biǎo) in Chinese.4. Idioms: Chinese has many idiomatic expressions that do not have direct translations in English. The translation takes into account the cultural and linguistic context of the original text and uses appropriate idiomatic expressions in Chinese. For example, "counting across a tens boundary" becomes "跨越十的数字" (kuà yuè shí de shù zhì) in Chinese.

IA-LSTM: Interaction-Aware LSTM for Pedestrian Trajectory Prediction

  • paper_url: http://arxiv.org/abs/2311.15193
  • repo_url: None
  • paper_authors: Yuehai Chen
  • for: 预测人行车道上人员的轨迹,是自动驾驶或自主移动机器人领域的不可或缺的任务,因为这可以帮助机器做出避免碰撞的政策决策。
  • methods: 我们提出了一种基于correntropy的新机制,可以衡量人群之间的相互作用的重要性,同时还可以建立每个人行车道上的个人空间。我们还提出了一种含有这种数据驱动机制的交互模块,可以有效地提取人群动态的人群交互特征表示,并计算相应的重要性权重。
  • results: 我们在两个公共数据集上进行了实验,并证明了我们的模型可以在评估人行车道上的轨迹预测方面达到更好的性能,比如最新的方法。
    Abstract Predicting the trajectory of pedestrians in crowd scenarios is indispensable in self-driving or autonomous mobile robot field because estimating the future locations of pedestrians around is beneficial for policy decision to avoid collision. It is a challenging issue because humans have different walking motions and the interactions between humans and objects in the current environment, especially between human themselves, are complex. Previous researches have focused on how to model the human-human interactions, however, neglecting the relative importance of interactions. In order to address this issue, we introduce a novel mechanism based on the correntropy, which not only can measure the relative importance of human-human interactions, but also can build personal space for each pedestrian. We further propose an Interaction Module including this data-driven mechanism that can effectively extract feature representations of dynamic human-human interactions in the scene and calculate corresponding weights to represent the importance of different interactions. To share such social messages among pedestrians, we design an interaction-aware architecture based on the Long Short-Term Memory (LSTM) network for trajectory prediction. We demonstrate the performance of our model on two public datasets and the experimental results demonstrate that our model can achieve better performance than several latest methods with good performance.
    摘要 预测人群中行人的轨迹是自动驾驶或自主移动机器场景中必不可少的因为可以帮助做出避免碰撞的政策决策。然而,这是一个具有挑战性的问题,因为人们的步行方式不同,人与人和人与环境中的交互都是复杂的。 previous researches have focused on how to model human-human interactions, but neglected the relative importance of interactions.为了解决这个问题,我们引入了一种基于correntropy的新机制,不仅可以测量人类之间交互的相对重要性,还可以建立每名行人的个人空间。我们还提出了一个包含这种数据驱动机制的交互模块,可以有效地提取场景中动态人类之间交互的特征表示和计算相应的重要性权重。为了在行人之间分享社会消息,我们设计了基于Long Short-Term Memory(LSTM)网络的交通预测建筑。我们在两个公共数据集上进行了实验,结果表明,我们的模型可以比多种最新的方法表现出更好的性能。

MACE: A Multi-pattern Accommodated and Efficient Anomaly Detection Method in the Frequency Domain

  • paper_url: http://arxiv.org/abs/2311.16191
  • repo_url: None
  • paper_authors: Feiyi Chen, Yingying zhang, Zhen Qin, Lunting Fan, Renhe Jiang, Yuxuan Liang, Qingsong Wen, Shuiguang Deng
  • For: 这个论文的目的是提出一种基于多模式的高效异常检测方法,用于适应云端系统中的异常检测问题。* Methods: 该方法使用了一种基于频域的多模式检测方法,包括一种新的几何EXTRACT机制,可以处理多种常见的正常模式,以及一种双重卷积机制,可以提高短期异常检测的敏感度和效率。* Results: 实验表明,该方法可以有效地处理多种常见的正常模式,并且可以达到现状领先的性能水平,同时具有高效的计算成本。
    Abstract Anomaly detection significantly enhances the robustness of cloud systems. While neural network-based methods have recently demonstrated strong advantages, they encounter practical challenges in cloud environments: the contradiction between the impracticality of maintaining a unique model for each service and the limited ability of dealing with diverse normal patterns by a unified model, as well as issues with handling heavy traffic in real time and short-term anomaly detection sensitivity. Thus, we propose MACE, a Multi-pattern Accommodated and efficient Anomaly detection method in the frequency domain for time series anomaly detection. There are three novel characteristics of it: (i) a pattern extraction mechanism excelling at handling diverse normal patterns, which enables the model to identify anomalies by examining the correlation between the data sample and its service normal pattern, instead of solely focusing on the data sample itself; (ii) a dualistic convolution mechanism that amplifies short-term anomalies in the time domain and hinders the reconstruction of anomalies in the frequency domain, which enlarges the reconstruction error disparity between anomaly and normality and facilitates anomaly detection; (iii) leveraging the sparsity and parallelism of frequency domain to enhance model efficiency. We theoretically and experimentally prove that using a strategically selected subset of Fourier bases can not only reduce computational overhead but is also profit to distinguish anomalies, compared to using the complete spectrum. Moreover, extensive experiments demonstrate MACE's effectiveness in handling diverse normal patterns with a unified model and it achieves state-of-the-art performance with high efficiency. \end{abstract}
    摘要 cloud系统的稳定性得到了异常检测的增强。而基于神经网络的方法最近表现出了强大的优势,但在云环境中遇到了实际挑战:一个服务只能维护一个唯一的模型,而另一个服务可能会有多种正常模式,这两者之间的矛盾,以及在实时处理大量流量时和短期异常检测敏感性的问题。因此,我们提出了MACE,一种适用于时间序列异常检测的频率频谱中的多模式适应和高效异常检测方法。它具有以下三个新特点:1. 能够处理多种正常模式的模式提取机制,使得模型可以通过对数据样本和其服务正常模式之间的相关性进行检查,而不是仅仅关注数据样本本身来识别异常。2. 使用双重卷积机制,在时间频谱中强制扩大短期异常,并阻止异常重建的过程,使异常检测错误差异增加,便于异常检测。3. 利用频率频谱的稀热和并行性来提高模型效率。我们理论和实验证明,使用策略性选择的快捷谱可以不仅减少计算开销,还能够更好地 отли异常,比使用完整谱更有利。此外,广泛的实验表明,MACE可以有效地处理多种正常模式,并且达到了当前最高的性能和高效性。

Domain Knowledge Injection in Bayesian Search for New Materials

  • paper_url: http://arxiv.org/abs/2311.15162
  • repo_url: None
  • paper_authors: Zikai Xie, Xenophon Evangelopoulos, Joseph Thacker, Andrew Cooper
  • for: 本研究设计了一个称为DKIBO的潜在搜寻过程优化算法,以整合领域知识来调整探索空间。
  • methods: 本研究使用了 bayesian 优化(BO)算法,并在这个过程中内置了领域知识,以提高搜寻的效率。
  • results: 本研究透过实验验证了DKIBO的实际价值,并在不同的实验设定下进行了验证和减少分析。
    Abstract In this paper we propose DKIBO, a Bayesian optimization (BO) algorithm that accommodates domain knowledge to tune exploration in the search space. Bayesian optimization has recently emerged as a sample-efficient optimizer for many intractable scientific problems. While various existing BO frameworks allow the input of prior beliefs to accelerate the search by narrowing down the space, incorporating such knowledge is not always straightforward and can often introduce bias and lead to poor performance. Here we propose a simple approach to incorporate structural knowledge in the acquisition function by utilizing an additional deterministic surrogate model to enrich the approximation power of the Gaussian process. This is suitably chosen according to structural information of the problem at hand and acts a corrective term towards a better-informed sampling. We empirically demonstrate the practical utility of the proposed method by successfully injecting domain knowledge in a materials design task. We further validate our method's performance on different experimental settings and ablation analyses.
    摘要 在这篇论文中,我们提出了DKIBO算法,这是一种基于 bayesian 优化(BO)的算法,可以利用域知识来调整搜索空间中的探索。 bayesian 优化在最近几年内已经出现为许多不可解的科学问题的一种效果的优化器。 although various existing BO frameworks allow the input of prior beliefs to accelerate the search by narrowing down the space, incorporating such knowledge is not always straightforward and can often introduce bias and lead to poor performance. 在这篇论文中,我们提出了一种简单的方法来包含域知识在获取函数中,通过使用一个额外的束缚函数来增强 Gaussian 过程的描述力。这个函数遵循问题当前的结构信息,并作为一个更加有信息的抽象来帮助更好地采样。我们在材料设计任务中证明了这种方法的实践用途,并在不同的实验设置和缺失分析中 validate 了我们的方法的性能。

Hessian Aware Low-Rank Weight Perturbation for Continual Learning

  • paper_url: http://arxiv.org/abs/2311.15161
  • repo_url: https://github.com/lijiaqi/halrp
  • paper_authors: Jiaqi Li, Rui Wang, Yuanhao Lai, Changjian Shui, Sabyasachi Sahoo, Charles X. Ling, Shichun Yang, Boyu Wang, Christian Gagné, Fan Zhou
  • for: 本研究旨在提出一种基于低级数投影的持续学习算法,以便在进行多个任务的顺序学习时,无需忘记之前学习的知识。
  • methods: 本文提出了一种基于参数变化的低级数投影算法,通过对每层神经网络的任务适应参数进行低级数投影,以实现持续学习。具体来说,我们提出了一种基于Hessian矩阵的量化关系,以确定低级数投影级数。此外,我们还控制了模型资源增长,通过杀死不重要的参数来减少参数的增长。
  • results: 我们在多个 benchmark 上进行了广泛的实验,包括一个大规模任务集。与其他一些最新的方法进行比较,我们的方法能够更好地处理不同的任务顺序问题和忘记问题,并且能够更好地控制模型的资源增长。
    Abstract Continual learning aims to learn a series of tasks sequentially without forgetting the knowledge acquired from the previous ones. In this work, we propose the Hessian Aware Low-Rank Perturbation algorithm for continual learning. By modeling the parameter transitions along the sequential tasks with the weight matrix transformation, we propose to apply the low-rank approximation on the task-adaptive parameters in each layer of the neural networks. Specifically, we theoretically demonstrate the quantitative relationship between the Hessian and the proposed low-rank approximation. The approximation ranks are then globally determined according to the marginal increment of the empirical loss estimated by the layer-specific gradient and low-rank approximation error. Furthermore, we control the model capacity by pruning less important parameters to diminish the parameter growth. We conduct extensive experiments on various benchmarks, including a dataset with large-scale tasks, and compare our method against some recent state-of-the-art methods to demonstrate the effectiveness and scalability of our proposed method. Empirical results show that our method performs better on different benchmarks, especially in achieving task order robustness and handling the forgetting issue. A demo code can be found at https://github.com/lijiaqi/HALRP.
    摘要 <>将文本翻译成简化中文。<>持续学习目标是在一系列任务之间学习无忘记前一个任务所学习的知识。在这个工作中,我们提出了对 continual learning 的 Hessian Aware Low-Rank Perturbation 算法。通过将参数变化模型化为层次结构上的 weight matrix 变化,我们提议在每层神经网络中应用低级approximation。 Specifically,我们 theoretically 阐述了 Hessian 和我们提议的低级approximation之间的量化关系。然后,我们根据层次特征 gradient 和低级approximation error 的marginally增量来全局确定权重级别。此外,我们控制模型容量by pruning less important parameters,以减少参数增长。我们在不同的benchmark上进行了广泛的实验,包括一个大规模任务的数据集,并与一些最近的state-of-the-art方法进行了比较,以示出我们提议的方法的有效性和可扩展性。实验结果表明,我们的方法在不同的benchmark上表现更好,特别是在完成任务顺序稳定性和忘记问题方面。一个 demo 代码可以在 https://github.com/lijiaqi/HALRP 找到。

xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data

  • paper_url: http://arxiv.org/abs/2311.15156
  • repo_url: None
  • paper_authors: Jing Gong, Minsheng Hao, Xingyi Cheng, Xin Zeng, Chiming Liu, Jianzhu Ma, Xuegong Zhang, Taifeng Wang, Le Song
  • for: 这篇论文是用于解决高通量测序技术的进步使得单元级基因表达测量中的数据总量已经超过5000万记录,但类传播模型在处理这些数据时存在计算和内存的瓶颈。
  • methods: 作者提出了一种新的不对称编码器-解码器变换器模型,称为xTrimoGene$^\alpha$(简称xTrimoGene),该模型利用数据的稀疏特征来扩大预训练。
  • results: 实验表明,xTrimoGene模型可以在最大化transformer模型规模时提高性能,并且在多个下游任务中达到了最高精度,包括单元类型标注、扰动-seq效应预测和药品组合预测。xTrimoGene模型现在通过以下链接提供服务:https://api.biomap.com/xTrimoGene/apply。
    Abstract Advances in high-throughput sequencing technology have led to significant progress in measuring gene expressions at the single-cell level. The amount of publicly available single-cell RNA-seq (scRNA-seq) data is already surpassing 50M records for humans with each record measuring 20,000 genes. This highlights the need for unsupervised representation learning to fully ingest these data, yet classical transformer architectures are prohibitive to train on such data in terms of both computation and memory. To address this challenge, we propose a novel asymmetric encoder-decoder transformer for scRNA-seq data, called xTrimoGene$^\alpha$ (or xTrimoGene for short), which leverages the sparse characteristic of the data to scale up the pre-training. This scalable design of xTrimoGene reduces FLOPs by one to two orders of magnitude compared to classical transformers while maintaining high accuracy, enabling us to train the largest transformer models over the largest scRNA-seq dataset today. Our experiments also show that the performance of xTrimoGene improves as we scale up the model sizes, and it also leads to SOTA performance over various downstream tasks, such as cell type annotation, perturb-seq effect prediction, and drug combination prediction. xTrimoGene model is now available for use as a service via the following link: https://api.biomap.com/xTrimoGene/apply.
    摘要 高通量测序技术的进步使得单个细胞水平的基因表达测量得到了 significi cant进步。目前已经有5000万条人类单个细胞RNA-seq数据(scRNA-seq)公共数据,每个记录测量20,000个基因。这种数据的普遍性增加了不监督学习的需求,然而经典的变换器架构在计算和内存方面都是禁制的。为了解决这个挑战,我们提出了一种新的非对称编码器-解码器变换器,称为xTrimoGene$^\alpha$(简称xTrimoGene),它利用单个细胞RNA-seq数据的稀疏特点来扩大预训练。这种可扩展的xTrimoGene设计减少了计算量的浮点数据点数量,同时保持高度准确,因此可以训练最大的变换器模型,并且可以在最大的单个细胞RNA-seq数据集上进行训练。我们的实验也显示,xTrimoGene的性能随模型大小的增加而提高,并且在不同的下游任务中表现出了领先的性能,如细胞类型标注、扰动序列效应预测和药物组合预测。xTrimoGene模型现在通过以下链接提供:https://api.biomap.com/xTrimoGene/apply。

cs.CL - 2023-11-26

Uncertainty-aware Language Modeling for Selective Question Answering

  • paper_url: http://arxiv.org/abs/2311.15451
  • repo_url: None
  • paper_authors: Qi Yang, Shreya Ravikumar, Fynn Schmitt-Ulms, Satvik Lolla, Ege Demir, Iaroslav Elistratov, Alex Lavaee, Sadhana Lolla, Elaheh Ahmadi, Daniela Rus, Alexander Amini, Alejandro Perez
  • for: 本研究开发了一个自动转换大语言模型(LLM)的方法,以生成预测时的不确定性评估。
  • methods: 本方法不需要外部模型或系统,可以与不同的模型和数据集搭配使用,且 computationally-efficient。
  • results: 我们在选择性问题回答任务中评估了 converts 模型,结果显示使用我们的方法提供的不确定性评估来选择回答问题可以实现更高的准确率,比直接使用模型概率更高。
    Abstract We present an automatic large language model (LLM) conversion approach that produces uncertainty-aware LLMs capable of estimating uncertainty with every prediction. Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems. We evaluate converted models on the selective question answering setting -- to answer as many questions as possible while maintaining a given accuracy, forgoing providing predictions when necessary. As part of our results, we test BERT and Llama 2 model variants on the SQuAD extractive QA task and the TruthfulQA generative QA task. We show that using the uncertainty estimates provided by our approach to selectively answer questions leads to significantly higher accuracy over directly using model probabilities.
    摘要 我们提出了一种自动转换大语言模型(LLM)的方法,该方法可以生成对每个预测都能提供不确定性的LLM。我们的方法是数据和模型无关的,计算效率高,不需要外部模型或系统。我们在选择性问答Setting上评估转换模型,即Answer as many questions as possible while maintaining a given accuracy, if necessary, do not provide predictions。在我们的结果中,我们测试了BERT和Llama 2模型变体在SQuAD抽取式问答任务和TruthfulQA生成式问答任务上。我们发现,使用我们的方法提供的不确定性估计来选择answerQuestion leads to significantly higher accuracy than directly using model probabilities。

Learning to Skip for Language Modeling

  • paper_url: http://arxiv.org/abs/2311.15436
  • repo_url: https://github.com/Sfedfcv/redesigned-pancake
  • paper_authors: Dewen Zeng, Nan Du, Tao Wang, Yuanzhong Xu, Tao Lei, Zhifeng Chen, Claire Cui
  • for: 提高语言模型在少量示例学习中的一般化性能
  • methods: 使用变量计算量分配方法,通过简单的路由机制来实现
  • results: 在24个NLP任务上进行了广泛评估,与其他竞争性基elines相比,提高了1架shot性能,仅带有极少的执行成本增加
    Abstract Overparameterized large-scale language models have impressive generalization performance of in-context few-shot learning. However, most language models allocate the same amount of parameters or computation to each token, disregarding the complexity or importance of the input data. We argue that in language model pretraining, a variable amount of computation should be assigned to different tokens, and this can be efficiently achieved via a simple routing mechanism. Different from conventional early stopping techniques where tokens can early exit at only early layers, we propose a more general method that dynamically skips the execution of a layer (or module) for any input token with a binary router. In our extensive evaluation across 24 NLP tasks, we demonstrate that the proposed method can significantly improve the 1-shot performance compared to other competitive baselines only at mild extra cost for inference.
    摘要 大规模语言模型具有吸引人的一些通用性表现,但大多数语言模型对输入数据的复杂度或重要性不做调整。我们认为在语言模型预训中,应该对不同的 Token 分配不同的计算量,这可以通过简单的路由机制进行效率地实现。与传统的早期停止技术不同,我们的方法可以在任何输入 Token 上进行动态的层(或模组)跳过。在我们的广泛评估中,我们发现在24个 NLP 任务中,我们的方法可以与其他竞争性基eline Only at mild extra cost for inference.

Machine-Generated Text Detection using Deep Learning

  • paper_url: http://arxiv.org/abs/2311.15425
  • repo_url: https://github.com/Aryia-Behroziuan/neurons
  • paper_authors: Raghav Gaggar, Ashish Bhagchandani, Harsh Oza
  • for: 本研究探讨了大语言模型(LLM)生成文本与人类生成文本之间的区别,这有助于多种应用。
  • methods: 我们使用了多种检测方法,包括支持向量机(SVM)、RoBERTa-base和RoBERTa-large等。
  • results: 研究发现,检测结果主要受到句子长度的影响。
    Abstract Our research focuses on the crucial challenge of discerning text produced by Large Language Models (LLMs) from human-generated text, which holds significance for various applications. With ongoing discussions about attaining a model with such functionality, we present supporting evidence regarding the feasibility of such models. We evaluated our models on multiple datasets, including Twitter Sentiment, Football Commentary, Project Gutenberg, PubMedQA, and SQuAD, confirming the efficacy of the enhanced detection approaches. These datasets were sampled with intricate constraints encompassing every possibility, laying the foundation for future research. We evaluate GPT-3.5-Turbo against various detectors such as SVM, RoBERTa-base, and RoBERTa-large. Based on the research findings, the results predominantly relied on the sequence length of the sentence.
    摘要 我们的研究关注了大语言模型生成文本与人类生成文本之间的核心挑战,这对各种应用场景都具有重要性。随着实现这种功能的讨论,我们提供了支持这种模型可行性的证据。我们对多个数据集进行了评估,包括推特情感、足球评论、Project Gutenberg、PubMedQA和SQuAD,并证明了提高检测方法的效果。这些数据集具有复杂的约束,涵盖了每一种可能性,为未来研究提供了基础。我们对GPT-3.5-Turbo与多种检测器进行比较,包括SVM、RoBERTa-base和RoBERTa-large。根据研究发现,结果主要受到句子长度的影响。

Learning Section Weights for Multi-Label Document Classification

  • paper_url: http://arxiv.org/abs/2311.15402
  • repo_url: https://github.com/MaziarMF/Learning-Section-Weights-for-Multi-Label-Document-Classification
  • paper_authors: Maziar Moradi Fard, Paula Sorrolla Bayod, Kiomars Motarjem, Mohammad Alian Nejadi, Saber Akhondi, Camilo Thorne
    for: 这篇论文旨在提出一种新的多标签文档分类方法,以更好地处理具有多个标签的文档。methods: 该方法称为学习段重要性(LSW),它通过多个循环层来学习将每个文档中的每个部分 assign 不同的权重,以便在分类时进行权重融合。results: 实验结果表明,LSW 方法在公共(arXiv)和私人(Elsevier)数据集上的实验结果都表现出色,与当前最佳的多标签文档分类方法相比,LSW 方法提高了1.3%的macro平均F1分数和1.3%的macro平均回归率。
    Abstract Multi-label document classification is a traditional task in NLP. Compared to single-label classification, each document can be assigned multiple classes. This problem is crucially important in various domains, such as tagging scientific articles. Documents are often structured into several sections such as abstract and title. Current approaches treat different sections equally for multi-label classification. We argue that this is not a realistic assumption, leading to sub-optimal results. Instead, we propose a new method called Learning Section Weights (LSW), leveraging the contribution of each distinct section for multi-label classification. Via multiple feed-forward layers, LSW learns to assign weights to each section of, and incorporate the weights in the prediction. We demonstrate our approach on scientific articles. Experimental results on public (arXiv) and private (Elsevier) datasets confirm the superiority of LSW, compared to state-of-the-art multi-label document classification methods. In particular, LSW achieves a 1.3% improvement in terms of macro averaged F1-score while it achieves 1.3% in terms of macro averaged recall on the publicly available arXiv dataset.
    摘要 多标签文档分类是NLП中的传统任务。相比单标签分类,每个文档可以被分配多个标签。这个问题在各个领域都是非常重要,例如标注科学论文。文档通常被分成几个部分,如摘要和标题。现有的方法假设不同的部分具有相同的重要性,导致优化结果不佳。我们认为这是一个不切实际的假设,因此提出了一种新的方法 called Learning Section Weights (LSW),利用每个不同部分的贡献来进行多标签分类。通过多个批处理层,LSW学习将每个部分的权重分配给每个标签,并将权重纳入预测中。我们在科学论文上进行了实验,并证明LSW在比 estado-of-the-art 多标签文档分类方法的情况下具有优越性。具体来说,LSW在公开available arXiv 数据集上取得了1.3%的提升 macro 平均 F1 分数,而在公开available arXiv 数据集上取得了1.3%的提升 macro 平均 recall。

Enhancing Empathetic and Emotion Support Dialogue Generation with Prophetic Commonsense Inference

  • paper_url: http://arxiv.org/abs/2311.15316
  • repo_url: None
  • paper_authors: Lanrui Wang, Jiangnan Li, Chenxu Yang, Zheng Lin, Weiping Wang
  • for: 提高对话机器人的回应质量,使其更能理解人们的情感和情境。
  • methods: 利用大语言模型理解对话和做出通俗知识推理,并训练可调模型以桥接过去和未来对话主题。
  • results: 在EmpatheticDialogues和Emotion Support Conversation中,通过我们提出的 prophetic commonsense inference,对话机器人的回应质量得到了显著提高。
    Abstract The interest in Empathetic and Emotional Support conversations among the public has significantly increased. To offer more sensitive and understanding responses, leveraging commonsense knowledge has become a common strategy to better understand psychological aspects and causality. However, such commonsense inferences can be out of context and unable to predict upcoming dialogue themes, resulting in responses that lack coherence and empathy. To remedy this issue, we present Prophetic Commonsense Inference, an innovative paradigm for inferring commonsense knowledge. By harnessing the capabilities of Large Language Models in understanding dialogue and making commonsense deductions, we train tunable models to bridge the gap between past and potential future dialogues. Extensive experiments conducted on EmpatheticDialogues and Emotion Support Conversation show that equipping dialogue agents with our proposed prophetic commonsense inference significantly enhances the quality of their responses.
    摘要 “公众对实情和情感支持聊天的兴趣增长了。为了更好地理解心理方面和 causality,许多人使用通俗知识作为战略。然而,这些通俗推理可能会在不同的背景下无法预测下一个对话主题,从而导致响应缺乏一致性和同情。为解决这个问题,我们提出了《启示 Commonsense 推理》,一种新的推理模式。通过利用大型自然语言模型理解对话和做出通俗推理,我们训练可调模型,以 bridge 过去和未来对话的鸿沟。我们在《实情对话》和《情感支持对话》中进行了广泛的实验,发现使用我们提出的《启示 Commonsense 推理》可以significantly enhance 对话机器人的响应质量。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation

  • paper_url: http://arxiv.org/abs/2311.15296
  • repo_url: https://github.com/IAAR-Shanghai/UHGEval
  • paper_authors: Xun Liang, Shichao Song, Simin Niu, Zhiyu Li, Feiyu Xiong, Bo Tang, Zhaohui Wy, Dawei He, Peng Cheng, Zhonghao Wang, Haiying Deng
  • For: This paper aims to assess the authentic reliability of large language models (LLMs) in text generation, specifically in the context of Chinese language.* Methods: The paper develops an Unconstrained Hallucination Generation Evaluation (UHGEval) benchmark to evaluate the ability of LLMs to generate text without restrictions, and establishes a comprehensive benchmark evaluation framework for scalable and reproducible experiments.* Results: The paper conducts extensive experiments using prominent Chinese language models and the GPT series models, providing insights into the challenges of hallucination in text generation and the professional performance of these models.
    Abstract Large language models (LLMs) have emerged as pivotal contributors in contemporary natural language processing and are increasingly being applied across a diverse range of industries. However, these large-scale probabilistic statistical models cannot currently ensure the requisite quality in professional content generation. These models often produce hallucinated text, compromising their practical utility in professional contexts. To assess the authentic reliability of LLMs in text generation, numerous initiatives have developed benchmark evaluations for hallucination phenomena. Nevertheless, these benchmarks frequently utilize constrained generation techniques due to cost and temporal constraints. These techniques encompass the use of directed hallucination induction and strategies that deliberately alter authentic text to produce hallucinations. These approaches are not congruent with the unrestricted text generation demanded by real-world applications. Furthermore, a well-established Chinese-language dataset dedicated to the evaluation of hallucinations in text generation is presently lacking. Consequently, we have developed an Unconstrained Hallucination Generation Evaluation (UHGEval) benchmark, designed to compile outputs produced with minimal restrictions by LLMs. Concurrently, we have established a comprehensive benchmark evaluation framework to aid subsequent researchers in undertaking scalable and reproducible experiments. We have also executed extensive experiments, evaluating prominent Chinese language models and the GPT series models to derive professional performance insights regarding hallucination challenges.
    摘要 大型语言模型(LLM)在当代自然语言处理中获得重要贡献,并在多种业界中应用。然而,这些大规模的概率统计模型目前无法确保专业内容生成的质量。这些模型经常生成幻视文本,限制其实际应用的实用性。为评估 LLM 在文本生成中的实际可靠性,许多项目已经开发了检测幻视现象的标准评估 benchmark。然而,这些 benchmark часто使用受限的生成技术,例如导向幻视启动和变更 Authentic 文本以生成幻视。这些方法与实际世界中的内容生成无法匹配。此外,现在没有一个成熟的中文Dataset,用于评估文本生成中的幻视现象。因此,我们已经开发了一个不受限制的幻视生成评估 benchmark(UHGEval),用于 compiling LLM 生成的输出。同时,我们建立了一个全面的评估框架,以便后续研究人员可以进行可扩展和可重现的实验。我们还执行了广泛的实验,评估了中文模型和 GPT 系列模型,以 derive 专业性能见解幻视挑战。

Dataset for Stock Market Forecasting Based on Quantitative Analysis and Qualitative Data

  • paper_url: http://arxiv.org/abs/2311.15218
  • repo_url: None
  • paper_authors: Sai Akash Bathini, Dagli Cihan
  • For: The paper is written for researchers and practitioners in the field of finance and machine learning, with a focus on stock market forecasting.* Methods: The paper uses a combination of numerical stock data and qualitative text data, including news articles, TV news captions, radio transcripts, and tweets, to extract sentiment and provide a holistic view of the stock market.* Results: The paper provides an unprecedented, publicly available dataset of technical and fundamental data, sentiment, and daily entries from January 2018 to December 2022 for 8 different companies and the Dow Jones Index as a whole, which can be used to train and deploy deep learning models for stock market forecasting.Here is the information in Simplified Chinese text:* 为:本文为金融和机器学习领域的研究者和实践者写的,主要关注股票市场预测。* 方法:本文使用 numerical 股票数据和qualitative 文本数据,包括新闻文章、电视新闻笔记、广播笔记和推特等,提取情绪,为股票市场提供整体视图。* 结果:本文提供了历史上无 precedent 的、公共可用的股票市场数据集,包括技术和基本数据、情绪、日常发布自2018年1月至2022年12月的8家公司和道琛指数全部日记录,可用于模型学习和部署。
    Abstract The application of Machine learning to finance has become a familiar approach, even more so in stock market forecasting. The stock market is highly volatile and huge amounts of data are generated every minute globally. The extraction of effective intelligence from this data is of critical importance. However, a collaboration of numerical stock data with qualitative text data can be a challenging task. In this work, we accomplish this and provide an unprecedented, publicly available dataset with technical and fundamental data, sentiment that we gathered from News Archives, TV news captions, Radio Transcripts, Tweets, Daily financial newspapers, etc. The text data entries used for sentiment extraction total more than 1.4 Million. The dataset comprises of daily entries from January 2018 to December 2022 for 8 different companies and Dow Jones Index as a whole. Holistic Fundamental and Technical data is provided training ready for Model learning and deployment. The predictive power of deep learning models is highly determined by the training data provided. This dataset would be of benefit for research globally incorporating qualitative intelligence for stock market forecasting. The dataset is made available at https://github.com/batking24/Huge-Stock-Dataset.
    摘要 Machine learning在金融领域的应用已经变得非常普遍,尤其是在股票市场预测方面。股票市场的波动性很高,全球每分钟产生大量数据。从这些数据中提取有效的智能是非常重要。然而,将数字股票数据与文本数据进行合作可以是一项挑战性的任务。在这项工作中,我们完成了这项任务,并提供了前所未有的、公共可用的数据集,包括技术和基础数据、情感等,从新闻档案、电视新闻笔记、广播笔记、推特等获取。文本数据用于情感EXTRACTING总计超过140万个。该数据集包括2018年1月至2022年12月的每天数据,涵盖8家公司和道琴指数的整体情况。该数据集包含了深度学习模型的训练ready的整体基础和技术数据。深度学习模型的预测力量受训练数据的提供程度很大。这个数据集将对全球股票市场预测研究提供很大的帮助。该数据集可以在GitHub上下载:https://github.com/batking24/Huge-Stock-Dataset。

Probabilistic Transformer: A Probabilistic Dependency Model for Contextual Word Representation

  • paper_url: http://arxiv.org/abs/2311.15211
  • repo_url: https://github.com/whynlp/probabilistic-transformer
  • paper_authors: Haoyi Wu, Kewei Tu
  • for: This paper is written to propose a new model of contextual word representation that is based on syntactic and probabilistic principles, with the goal of bridging the gap between traditional and neural approaches to NLP.
  • methods: The proposed model uses a conditional random field to model discrete latent representations of all words in a sentence, as well as dependency arcs between them. The model also uses mean field variational inference for approximate inference.
  • results: The authors find that their model performs competitively to transformers on small to medium sized datasets, and that the computation graph of their model resembles transformers in terms of dependencies and self-attention.Here is the information in Simplified Chinese text:
  • for: 这篇论文是为了提出一种基于语法和概率原理的新的语言处理(NLP)模型,目的是将传统的语法和概率方法和现代神经网络方法相结合。
  • methods: 该模型使用Conditional Random Field来模型所有句子中单词的隐藏表示,以及它们之间的依赖关系。模型还使用mean field variational inference来进行approximate inference。
  • results: 作者们发现,他们的模型在小到中等大小的数据集上表现竞争力强,并且该模型的计算图与transformer类似,包括依赖关系和自注意力。
    Abstract Syntactic structures used to play a vital role in natural language processing (NLP), but since the deep learning revolution, NLP has been gradually dominated by neural models that do not consider syntactic structures in their design. One vastly successful class of neural models is transformers. When used as an encoder, a transformer produces contextual representation of words in the input sentence. In this work, we propose a new model of contextual word representation, not from a neural perspective, but from a purely syntactic and probabilistic perspective. Specifically, we design a conditional random field that models discrete latent representations of all words in a sentence as well as dependency arcs between them; and we use mean field variational inference for approximate inference. Strikingly, we find that the computation graph of our model resembles transformers, with correspondences between dependencies and self-attention and between distributions over latent representations and contextual embeddings of words. Experiments show that our model performs competitively to transformers on small to medium sized datasets. We hope that our work could help bridge the gap between traditional syntactic and probabilistic approaches and cutting-edge neural approaches to NLP, and inspire more linguistically-principled neural approaches in the future.
    摘要 natural language processing (NLP) 在过去, sintactic structures 扮演着关键的角色,但是自深度学习革命以来, NLP 被慢慢地由不考虑 sintactic structures 的神经网络模型所取代。一个非常成功的神经网络模型是 transformers。作为编码器, transformers 会生成输入句子中每个单词的上下文表示。在这项工作中,我们提出了一种新的上下文单词表示模型,不是从神经网络角度来看,而是从纯粹的 sintactic 和概率角度来看。我们设计了一个条件随机场,该模型所有句子中的单词的不同的latent表示和它们之间的依赖关系。我们使用mean field variational inference来进行approximate inference。很surprisingly,我们发现我们的计算图与 transformers 的计算图很相似,它们之间存在依赖和自我注意的对应关系,以及latent表示分布和上下文嵌入的单词之间的对应关系。实验表明,我们的模型与 transformers 在小到中型数据集上表现相当。我们希望通过这种工作,可以将传统的 sintactic 和概率方法与最新的神经方法之间的差距bridged,并且激励更多基于语言学原理的神经方法在未来的发展。

Benchmarking Large Language Model Volatility

  • paper_url: http://arxiv.org/abs/2311.15180
  • repo_url: None
  • paper_authors: Boyang Yu
  • for: 这个研究旨在探讨大语言模型(LLM)在读取金融文本时的不确定性的影响。
  • methods: 研究使用了新闻情感分类任务来研究投资在美国股市中的应用。
  • results: 研究发现,使用大语言模型时,句子级情感分类结果具有较大的不确定性,这种不确定性会在下游 tasks 中带来更大的变化。同时,调整温度参数可以减轻这种不确定性,但是会导致模型的创造力减退。 ensemble 多个输出可以减轻不确定性的影响,但是需要大量的计算投入。
    Abstract The impact of non-deterministic outputs from Large Language Models (LLMs) is not well examined for financial text understanding tasks. Through a compelling case study on investing in the US equity market via news sentiment analysis, we uncover substantial variability in sentence-level sentiment classification results, underscoring the innate volatility of LLM outputs. These uncertainties cascade downstream, leading to more significant variations in portfolio construction and return. While tweaking the temperature parameter in the language model decoder presents a potential remedy, it comes at the expense of stifled creativity. Similarly, while ensembling multiple outputs mitigates the effect of volatile outputs, it demands a notable computational investment. This work furnishes practitioners with invaluable insights for adeptly navigating uncertainty in the integration of LLMs into financial decision-making, particularly in scenarios dictated by non-deterministic information.
    摘要 大型语言模型(LLM)的非决定性输出对金融文本理解任务的影响尚未得到充分探讨。通过一个有力的案例研究,我们发现在美国股票市场中通过新闻情感分析进行投资时,句子水平情感分类结果存在很大的变化,这反映了LLM输出的内在波动性。这些不确定性会在下游传递给更大的股票组合和收益变化。虽然在语言模型解码器中调整温度参数可能有所缓解,但这会导致创造力受限。同时,通过多个输出 ensemble 可以减轻非确定性的影响,但这需要显著的计算投入。这项研究为金融决策中 интеграble LLM 提供了价值的经验,特别是在不确定信息的情况下。

cs.LG - 2023-11-26

Frobenius-Type Norms and Inner Products of Matrices and Linear Maps with Applications to Neural Network Training

  • paper_url: http://arxiv.org/abs/2311.15419
  • repo_url: None
  • paper_authors: Roland Herzog, Frederik Köhne, Leonie Kreis, Anton Schiela
  • for: 这篇论文主要针对的是 Frobenius нор和内积的应用在神经网络训练中。
  • methods: 论文使用了 Frobenius 内积来评估对矩阵变量的梯度,并提供了一种更通用的 Frobenius-type нор来替代传统的 Frobenius нор。
  • results: 论文表明了 Frobenius norm 和内积在域和幂域空间之间的相互关系,并使用了这种更通用的 Frobenius-type norm 来预处理神经网络训练。
    Abstract The Frobenius norm is a frequent choice of norm for matrices. In particular, the underlying Frobenius inner product is typically used to evaluate the gradient of an objective with respect to matrix variable, such as those occuring in the training of neural networks. We provide a broader view on the Frobenius norm and inner product for linear maps or matrices, and establish their dependence on inner products in the domain and co-domain spaces. This shows that the classical Frobenius norm is merely one special element of a family of more general Frobenius-type norms. The significant extra freedom furnished by this realization can be used, among other things, to precondition neural network training.
    摘要 “弗罗贝尼矩阵是一种常用的矩阵范数。具体来说,它通常用于评估神经网络训练中矩阵变量的梯度。我们为linear map或矩阵提供了更广泛的视角,并证明它们与域和共域空间中的内积相关。这表明了классиical Frobenius范数只是一种特殊的 Frobenius-type 范数家族中的一个元素。这些额外的自由空间可以用于PREconditioning神经网络训练等其他应用。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Applying statistical learning theory to deep learning

  • paper_url: http://arxiv.org/abs/2311.15404
  • repo_url: https://github.com/djdprogramming/adfa2
  • paper_authors: Cédric Gerbelot, Avetik Karagulyan, Stefani Karp, Kavya Ravichandran, Menachem Stern, Nathan Srebro
  • for: 本文旨在提供一个深度学习从学习理论角度的概述,尤其是如何不同的架构可以导致束缚偏见 when 使用梯度下降方法进行训练。
  • methods: 本文使用隐式偏见的概念来描述梯度下降算法的工作机制,并提出了一种可以回 и forth между参数空间和对应的函数空间的描述方法。
  • results: 本文对Linear Diagonal Networks(LINE)进行了详细的研究,并证明了不同的损失函数、参数初值的尺度和网络深度可以导致不同的隐式偏见。
    Abstract Although statistical learning theory provides a robust framework to understand supervised learning, many theoretical aspects of deep learning remain unclear, in particular how different architectures may lead to inductive bias when trained using gradient based methods. The goal of these lectures is to provide an overview of some of the main questions that arise when attempting to understand deep learning from a learning theory perspective. After a brief reminder on statistical learning theory and stochastic optimization, we discuss implicit bias in the context of benign overfitting. We then move to a general description of the mirror descent algorithm, showing how we may go back and forth between a parameter space and the corresponding function space for a given learning problem, as well as how the geometry of the learning problem may be represented by a metric tensor. Building on this framework, we provide a detailed study of the implicit bias of gradient descent on linear diagonal networks for various regression tasks, showing how the loss function, scale of parameters at initialization and depth of the network may lead to various forms of implicit bias, in particular transitioning between kernel or feature learning.
    摘要 尽管统计学学习理论提供了一个强大的框架来理解超参数学习,但是许多深度学习的理论方面仍然不清楚,特别是如何不同的架构可以导致拟合偏见。这些讲义的目标是为了提供一个学习理论角度下对深度学习的概述,包括一些主要的问题和挑战。首先,我们将提供一个简要的统计学学习理论和随机优化的简介,然后讨论在恰当的拟合情况下的隐含偏见。接着,我们将介绍mirror descent算法,并示出如何在参数空间和对应的函数空间之间进行往返,以及如何在学习问题的几何结构上使用度量矩阵来表示。基于这个框架,我们进行了深入的研究,探讨了不同的梯度下降算法在不同回归任务中的隐含偏见,包括损失函数、参数的Initialization scale和网络的深度。

Local Convergence of Approximate Newton Method for Two Layer Nonlinear Regression

  • paper_url: http://arxiv.org/abs/2311.15390
  • repo_url: None
  • paper_authors: Zhihang Li, Zhao Song, Zifan Wang, Junze Yin
  • for: 这个论文的目的是对二层回归问题进行分析,并研究如何使用不同的激活函数来构建更复杂的回归模型。
  • methods: 这个论文使用了一种称为柯尔莫哈沃-均值梯度下降(CGD)的逼近新顿法来训练模型。
  • results: 该论文提供了对CGD算法的本地可确定性 guarantees,即在满足某些假设条件下,可以在$O(\log(1/\epsilon))$次迭代后找到一个$\epsilon$-近似的训练损失最小值。
    Abstract There have been significant advancements made by large language models (LLMs) in various aspects of our daily lives. LLMs serve as a transformative force in natural language processing, finding applications in text generation, translation, sentiment analysis, and question-answering. The accomplishments of LLMs have led to a substantial increase in research efforts in this domain. One specific two-layer regression problem has been well-studied in prior works, where the first layer is activated by a ReLU unit, and the second layer is activated by a softmax unit. While previous works provide a solid analysis of building a two-layer regression, there is still a gap in the analysis of constructing regression problems with more than two layers. In this paper, we take a crucial step toward addressing this problem: we provide an analysis of a two-layer regression problem. In contrast to previous works, our first layer is activated by a softmax unit. This sets the stage for future analyses of creating more activation functions based on the softmax function. Rearranging the softmax function leads to significantly different analyses. Our main results involve analyzing the convergence properties of an approximate Newton method used to minimize the regularized training loss. We prove that the loss function for the Hessian matrix is positive definite and Lipschitz continuous under certain assumptions. This enables us to establish local convergence guarantees for the proposed training algorithm. Specifically, with an appropriate initialization and after $O(\log(1/\epsilon))$ iterations, our algorithm can find an $\epsilon$-approximate minimizer of the training loss with high probability. Each iteration requires approximately $O(\mathrm{nnz}(C) + d^\omega)$ time, where $d$ is the model size, $C$ is the input matrix, and $\omega < 2.374$ is the matrix multiplication exponent.
    摘要 LLMs 在我们日常生活中的不同方面做出了重要的进步。 LLMs 在自然语言处理方面发挥了转变性的作用,其应用包括文本生成、翻译、情感分析和问答。 LLMs 的成就引发了这个领域的研究努力的增加。在先前的作品中,我们已经彻底分析了建立两层回归问题,其中第一层由 ReLU 单元活化,第二层由 softmax 单元活化。虽然先前的作品提供了建立两层回归的坚实分析,但还有一个空白:构建多层回归问题的分析。在这篇论文中,我们采取了一个重要的步骤:我们对二层回归问题进行分析。与先前作品不同的是,我们的第一层使用 softmax 单元活化。这种设置为未来构建更多的 activation 函数基于 softmax 函数提供了前景。通过重新排序 softmax 函数,我们可以获得不同的分析结果。我们的主要结果是对approximate Newton 方法来 minimize 正则化训练损失的 convergency 性进行分析。我们证明了训练损失的偏导数矩阵是正定定义的和 Lipschitz 连续的,以满足 certain 假设。这使得我们可以建立本地 convergency 保证,具体来说,通过适当的初始化和 $O(\log(1/\epsilon))$ 迭代,我们的算法可以在高probability 下找到 $\epsilon$-近似的训练损失的最小值。每个迭代需要约 $O(\text{nnz}(C) + d^\omega)$ 时间,其中 $d$ 是模型大小,$C$ 是输入矩阵,$\omega < 2.374$ 是矩阵乘法幂 exponent。

Spectro-ViT: A Vision Transformer Model for GABA-edited MRS Reconstruction Using Spectrograms

  • paper_url: http://arxiv.org/abs/2311.15386
  • repo_url: None
  • paper_authors: Gabriel Dias, Rodrigo Pommot Berto, Mateus Oliveira, Lucas Ueda, Sergio Dertkigil, Paula D. P. Costa, Amirmohammad Shamaei, Roberto Souza, Ashley Harris, Leticia Rittner
    for: 这个论文的目的是利用Vision Transformer(ViT)来重建/除噪GABA编辑核磁共振спектроскопи(MRS)数据,从原来的四分之一数量的脉冲中获取更多的信息。methods: 这个论文使用了Short-Time Fourier Transform(STFT)将GABA编辑MRS扫描数据转换为图像表示形式,然后采用已经训练过的ViT进行重建GABA编辑MRSpectra(Spectro-ViT)。Spectro-ViT在\textit{in vivo} GABA编辑MRS数据上进行了微调和测试。与其他已有的模型相比,Spectro-ViT表现得更好,以四个分量的量化评价指标(平均方差、形态分数、GABA+/水适应错误、半width at half maximum)来评价。results: 结果表明,Spectro-ViT模型在四个分量的量化评价指标中全面高于其他所有模型(平均方差、形态分数、GABA+/水适应错误、半width at half maximum)。GABA编辑MRSpectra的估计值与通常使用完整的脉冲收集的GABA编辑MRSpectra的估计值一致。
    Abstract Purpose: To investigate the use of a Vision Transformer (ViT) to reconstruct/denoise GABA-edited magnetic resonance spectroscopy (MRS) from a quarter of the typically acquired number of transients using spectrograms. Theory and Methods: A quarter of the typically acquired number of transients collected in GABA-edited MRS scans are pre-processed and converted to a spectrogram image representation using the Short-Time Fourier Transform (STFT). The image representation of the data allows the adaptation of a pre-trained ViT for reconstructing GABA-edited MRS spectra (Spectro-ViT). The Spectro-ViT is fine-tuned and then tested using \textit{in vivo} GABA-edited MRS data. The Spectro-ViT performance is compared against other models in the literature using spectral quality metrics and estimated metabolite concentration values. Results: The Spectro-ViT model significantly outperformed all other models in four out of five quantitative metrics (mean squared error, shape score, GABA+/water fit error, and full width at half maximum). The metabolite concentrations estimated (GABA+/water, GABA+/Cr, and Glx/water) were consistent with the metabolite concentrations estimated using typical GABA-edited MRS scans reconstructed with the full amount of typically collected transients. Conclusion: The proposed Spectro-ViT model achieved state-of-the-art results in reconstructing GABA-edited MRS, and the results indicate these scans could be up to four times faster.
    摘要 Theory and Methods: A quarter of the typically acquired number of transients collected in GABA-edited MRS scans are pre-processed and converted to a spectrogram image representation using the Short-Time Fourier Transform (STFT). The image representation of the data allows the adaptation of a pre-trained ViT for reconstructing GABA-edited MRS spectra (Spectro-ViT). The Spectro-ViT is fine-tuned and then tested using \textit{in vivo} GABA-edited MRS data. The Spectro-ViT performance is compared against other models in the literature using spectral quality metrics and estimated metabolite concentration values. Results: The Spectro-ViT model significantly outperformed all other models in four out of five quantitative metrics (mean squared error, shape score, GABA+/water fit error, and full width at half maximum). The metabolite concentrations estimated (GABA+/water, GABA+/Cr, and Glx/water) were consistent with the metabolite concentrations estimated using typical GABA-edited MRS scans reconstructed with the full amount of typically collected transients. Conclusion: The proposed Spectro-ViT model achieved state-of-the-art results in reconstructing GABA-edited MRS, and the results indicate these scans could be up to four times faster.

Robust and Automatic Data Clustering: Dirichlet Process meets Median-of-Means

  • paper_url: http://arxiv.org/abs/2311.15384
  • repo_url: None
  • paper_authors: Supratik Basu, Jyotishka Ray Choudhury, Debolina Paul, Swagatam Das
  • for: 提出了一种高效自动化归一化技术,以满足不同场景下的归一化需求。
  • methods: integrate了模型基于方法和中心基于方法的原理,以减轻噪声对归一化质量的影响,同时不需要在先知数据中 Specify the number of clusters。
  • results: 通过对模拟和实际数据进行了严格的评估,并提供了对归一化误差的统计保证,而显示了我们提出的方法在对归一化质量的影响有明显的优势。
    Abstract Clustering stands as one of the most prominent challenges within the realm of unsupervised machine learning. Among the array of centroid-based clustering algorithms, the classic $k$-means algorithm, rooted in Lloyd's heuristic, takes center stage as one of the extensively employed techniques in the literature. Nonetheless, both $k$-means and its variants grapple with noteworthy limitations. These encompass a heavy reliance on initial cluster centroids, susceptibility to converging into local minima of the objective function, and sensitivity to outliers and noise in the data. When confronted with data containing noisy or outlier-laden observations, the Median-of-Means (MoM) estimator emerges as a stabilizing force for any centroid-based clustering framework. On a different note, a prevalent constraint among existing clustering methodologies resides in the prerequisite knowledge of the number of clusters prior to analysis. Utilizing model-based methodologies, such as Bayesian nonparametric models, offers the advantage of infinite mixture models, thereby circumventing the need for such requirements. Motivated by these facts, in this article, we present an efficient and automatic clustering technique by integrating the principles of model-based and centroid-based methodologies that mitigates the effect of noise on the quality of clustering while ensuring that the number of clusters need not be specified in advance. Statistical guarantees on the upper bound of clustering error, and rigorous assessment through simulated and real datasets suggest the advantages of our proposed method over existing state-of-the-art clustering algorithms.
    摘要 群集是Machine learning中最突出的挑战之一,而中心基于的聚类算法中,$k$-means算法,基于沃尔特的假设,在文献中广泛应用。然而,$k$-means和其变体受到一些局限性的影响。这些局限性包括听起来Initial cluster centroids依赖度高,容易 converges into local minima of the objective function,和敏感于数据中噪声和异常值。当面临含噪声或异常值的数据时,Median-of-Means(MoM)估计器作为一种稳定的力量,可以为任何聚类框架提供稳定的效果。此外,现有的聚类方法ologies中一个普遍的限制是必须在分析之前知道聚类数量。使用模型基于的方法ologies,如概率非参数模型,可以减少这种需求。我们在这篇文章中提出一种高效的自动聚类技术,通过结合模型基于和中心基于的方法ologies,以降低噪声对聚类质量的影响,同时不需要在先知道聚类数量。我们的提posed方法在 simulate和实际数据上进行了严格的评估,并且在 clustering error的Upper bound上提供了统计保证。

Evaluating Multi-Global Server Architecture for Federated Learning

  • paper_url: http://arxiv.org/abs/2311.15382
  • repo_url: None
  • paper_authors: Asfia Kawnine, Hung Cao, Atah Nuh Mih, Monica Wachowicz
  • for: 提高 federated learning 系统的可靠性和效率,适用于多设备协同学习和模型训练。
  • methods: 提出了一种基于多global server的 federated learning框架,通过本地协同和知识整合来提高模型效果,并通过多个服务器处理来解决单服务器架构中的通信失败问题。
  • results: 通过对电动车充电记录数据进行实验,发现在多服务器架构下,模型性能差异小于1%,而且可以解决单服务器架构中的通信失败问题。
    Abstract Federated learning (FL) with a single global server framework is currently a popular approach for training machine learning models on decentralized environment, such as mobile devices and edge devices. However, the centralized server architecture poses a risk as any challenge on the central/global server would result in the failure of the entire system. To minimize this risk, we propose a novel federated learning framework that leverages the deployment of multiple global servers. We posit that implementing multiple global servers in federated learning can enhance efficiency by capitalizing on local collaborations and aggregating knowledge, and the error tolerance in regard to communication failure in the single server framework would be handled. We therefore propose a novel framework that leverages the deployment of multiple global servers. We conducted a series of experiments using a dataset containing the event history of electric vehicle (EV) charging at numerous stations. We deployed a federated learning setup with multiple global servers and client servers, where each client-server strategically represented a different region and a global server was responsible for aggregating local updates from those devices. Our preliminary results of the global models demonstrate that the difference in performance attributed to multiple servers is less than 1%. While the hypothesis of enhanced model efficiency was not as expected, the rule for handling communication challenges added to the algorithm could resolve the error tolerance issue. Future research can focus on identifying specific uses for the deployment of multiple global servers.
    摘要 Currently, federated learning (FL) with a single global server framework is a popular approach for training machine learning models in decentralized environments, such as mobile devices and edge devices. However, the centralized server architecture poses a risk, as any challenge on the central/global server would result in the failure of the entire system. To minimize this risk, we propose a novel federated learning framework that leverages the deployment of multiple global servers. We believe that implementing multiple global servers in federated learning can enhance efficiency by capitalizing on local collaborations and aggregating knowledge, and the error tolerance in regard to communication failure in the single server framework can be handled. Therefore, we propose a novel framework that leverages the deployment of multiple global servers.We conducted a series of experiments using a dataset containing the event history of electric vehicle (EV) charging at numerous stations. We deployed a federated learning setup with multiple global servers and client servers, where each client-server strategically represented a different region, and a global server was responsible for aggregating local updates from those devices. Our preliminary results of the global models demonstrate that the difference in performance attributed to multiple servers is less than 1%. While the hypothesis of enhanced model efficiency was not as expected, the rule for handling communication challenges added to the algorithm could resolve the error tolerance issue. Future research can focus on identifying specific uses for the deployment of multiple global servers.

Untargeted Code Authorship Evasion with Seq2Seq Transformation

  • paper_url: http://arxiv.org/abs/2311.15366
  • repo_url: None
  • paper_authors: Soohyeon Choi, Rhongho Jang, DaeHun Nyang, David Mohaisen
  • for: 本研究旨在提出一种代码作者归属特征分析方法,通过分析代码的风格特征来归属作者。
  • methods: 本方法使用了Seq2Seq代码转换器 called StructCoder,通过转换代码来增强代码的风格特征,从而提高代码归属特征的准确率。
  • results: 本研究实现了代码归属特征分析的高效化,同时保持了85%的转换成功率和95.77%的诱导成功率。
    Abstract Code authorship attribution is the problem of identifying authors of programming language codes through the stylistic features in their codes, a topic that recently witnessed significant interest with outstanding performance. In this work, we present SCAE, a code authorship obfuscation technique that leverages a Seq2Seq code transformer called StructCoder. SCAE customizes StructCoder, a system designed initially for function-level code translation from one language to another (e.g., Java to C#), using transfer learning. SCAE improved the efficiency at a slight accuracy degradation compared to existing work. We also reduced the processing time by about 68% while maintaining an 85% transformation success rate and up to 95.77% evasion success rate in the untargeted setting.
    摘要 <>程序作者归属问题是通过代码风格特征确定程序作者的问题,最近受到了广泛关注,表现出色。在这项工作中,我们介绍了SCAE,一种基于Seq2Seq代码转换器called StructCoder的代码作者隐蔽技术。SCAE通过转移学习自定义StructCoder,提高了效率,但是有些程度下grades accuracy。我们还将处理时间减少了约68%,保持了85%的转换成功率和95.77%的不argeted Setting中的逃脱成功率。Note:* “Seq2Seq”是Sequence-to-Sequence的缩写,指的是一种从一种语言到另一种语言的代码转换技术。* “StructCoder”是一种用于函数级代码翻译的系统,可以将一种语言的代码转换成另一种语言的代码。* “transfer learning”是一种使用已经训练好的模型来适应新的任务的技术,可以减少训练时间和提高模型的性能。

A Convergence result of a continuous model of deep learning via Łojasiewicz–Simon inequality

  • paper_url: http://arxiv.org/abs/2311.15365
  • repo_url: None
  • paper_authors: Noboru Isobe
  • for: 这个研究关注水星梯度流,该流表示一种深度神经网络(DNN)的连续模型优化过程。
  • methods: 我们首先证明了模型的均值损失下存在最小值,然后证明损失函数的最大坡度曲线的存在。我们的主要结果是流会 converge to 损失函数的极值点,这个结果基于 \L{}ojasiewicz–Simon 梯度不等式。
  • results: 我们的证明方法为非конvex函数的 Wasserstein-type 梯度流提供了新的分析方法。
    Abstract This study focuses on a Wasserstein-type gradient flow, which represents an optimization process of a continuous model of a Deep Neural Network (DNN). First, we establish the existence of a minimizer for an average loss of the model under $L^2$-regularization. Subsequently, we show the existence of a curve of maximal slope of the loss. Our main result is the convergence of flow to a critical point of the loss as time goes to infinity. An essential aspect of proving this result involves the establishment of the \L{}ojasiewicz--Simon gradient inequality for the loss. We derive this inequality by assuming the analyticity of NNs and loss functions. Our proofs offer a new approach for analyzing the asymptotic behavior of Wasserstein-type gradient flows for nonconvex functionals.
    摘要 Translation in Simplified Chinese:这项研究关注 Wasserstein-类型的梯度流,它表示一个深度神经网络(DNN)的连续模型优化过程。首先,我们证明了 $L^2$ 正则化下的模型均值损失的存在最小值。然后,我们显示了损失函数的最大坡度曲线的存在。我们的主要结论是梯度流的渐近归一化到损失函数的极值点,随着时间的推移。一个关键的证明步骤是通过假设神经网络和损失函数的分析性来确定 \L{}ojasiewicz--Simon 梯度不等式。我们从这个不等式中得出了一种新的分析方法,用于研究非凸函数的 Wasserstein-类型梯度流的渐近行为。

Generative Modelling of Stochastic Actions with Arbitrary Constraints in Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2311.15341
  • repo_url: None
  • paper_authors: Changyu Chen, Ramesha Karunasena, Thanh Hong Nguyen, Arunesh Sinha, Pradeep Varakantham
  • for: 本研究旨在解决强化学习中大量离散多维动作空间的问题,如安放多个安全资源和紧急救援单位等问题。这些问题的目标是找到最优策略,但存在大量离散动作空间和难以表述的有效性约束的问题。
  • methods: 本研究使用conditional normalizing flow来压缩策略,通过网络生成一个随机选择的动作和对应的日志概率,然后使用actor-critic方法进行更新。此外,还使用无效动作拒绝方法(通过无效动作oracle)来更新基策略。
  • results: 实验表明,我们的方法可以比PRIOR方法更好地扩展到大量离散动作空间,并且可以在任意状态下对动作分布的支持进行任意 Conditional state-conditional constraints。
    Abstract Many problems in Reinforcement Learning (RL) seek an optimal policy with large discrete multidimensional yet unordered action spaces; these include problems in randomized allocation of resources such as placements of multiple security resources and emergency response units, etc. A challenge in this setting is that the underlying action space is categorical (discrete and unordered) and large, for which existing RL methods do not perform well. Moreover, these problems require validity of the realized action (allocation); this validity constraint is often difficult to express compactly in a closed mathematical form. The allocation nature of the problem also prefers stochastic optimal policies, if one exists. In this work, we address these challenges by (1) applying a (state) conditional normalizing flow to compactly represent the stochastic policy -- the compactness arises due to the network only producing one sampled action and the corresponding log probability of the action, which is then used by an actor-critic method; and (2) employing an invalid action rejection method (via a valid action oracle) to update the base policy. The action rejection is enabled by a modified policy gradient that we derive. Finally, we conduct extensive experiments to show the scalability of our approach compared to prior methods and the ability to enforce arbitrary state-conditional constraints on the support of the distribution of actions in any state.
    摘要 许多强化学习(RL)问题寻找最优策略,其中动作空间较大、离散多维且无序; 这些问题包括资源预配和紧急响应单位等的 randomly 分配。一个挑战在这种设定下是,动作空间是分类的(离散且无序),而且较大,现有的RL方法不太适合。此外,这些问题需要动作的有效性(实现),这是Difficult to express compactly in a closed mathematical form. 因此,我们在这种情况下采取以下两种方法:1. 使用状态conditional normalizing flow来紧凑地表示随机策略,其中网络只生成一个随机动作和对应的Log概率值,然后使用actor-critic方法来更新基策略。2. 通过无效动作拒绝方法(via valid action oracle)来更新基策略。无效动作拒绝是通过我们Derive的修改政策梯度来实现的。最后,我们进行了广泛的实验,以证明我们的方法可以比PRIOR方法更加扩展,并且可以在任何状态下对动作分布的支持 enforcing arbitrary state-conditional constraints。

FRAC-Q-Learning: A Reinforcement Learning with Boredom Avoidance Processes for Social Robots

  • paper_url: http://arxiv.org/abs/2311.15327
  • repo_url: None
  • paper_authors: Akinari Onishi
  • For: 该研究旨在开发一种适用于社交机器人的增强学习算法,以避免用户产生厌烦。* Methods: 该算法基于忘记过程、随机化过程和分类过程,并与传统Q学习算法进行比较。* Results: FRAC-Q-学习算法在用户兴趣和厌烦强度方面表现出明显的改善,比传统Q学习算法更难使用者产生厌烦。这些结果表明,FRAC-Q-学习算法可以应用于开发不会厌烦用户的社交机器人。
    Abstract The reinforcement learning algorithms have often been applied to social robots. However, most reinforcement learning algorithms were not optimized for the use of social robots, and consequently they may bore users. We proposed a new reinforcement learning method specialized for the social robot, the FRAC-Q-learning, that can avoid user boredom. The proposed algorithm consists of a forgetting process in addition to randomizing and categorizing processes. This study evaluated interest and boredom hardness scores of the FRAC-Q-learning by a comparison with the traditional Q-learning. The FRAC-Q-learning showed significantly higher trend of interest score, and indicated significantly harder to bore users compared to the traditional Q-learning. Therefore, the FRAC-Q-learning can contribute to develop a social robot that will not bore users. The proposed algorithm can also find applications in Web-based communication and educational systems. This paper presents the entire process, detailed implementation and a detailed evaluation method of the of the FRAC-Q-learning for the first time.
    摘要 社交机器人常用到强化学习算法,但大多数强化学习算法没有特地针对社交机器人的使用,因此可能会让用户感到无聊。我们提出了一种特化于社交机器人的强化学习方法,即FRAC-Q-learning,可以避免用户无聊。该算法包括忘记过程、随机过程和分类过程。本研究通过对FRAC-Q-learning和传统Q-学习进行比较,评估了用户兴趣和无聊强度 scores。FRAC-Q-learning表现出了明显更高的兴趣度趋势,并表明与传统Q-学习相比,用户更难被无聊。因此,FRAC-Q-learning可以帮助开发一种不会让用户无聊的社交机器人。此外,该算法还可以应用于网络基本通信和教育系统。本文为FRAC-Q-learning的完整过程、详细实现和评估方法提供了首次描述。

Generalized Graph Prompt: Toward a Unification of Pre-Training and Downstream Tasks on Graphs

  • paper_url: http://arxiv.org/abs/2311.15317
  • repo_url: https://github.com/gmcmt/graph_prompt_extension
  • paper_authors: Xingtong Yu, Zhenghao Liu, Yuan Fang, Zemin Liu, Sihong Chen, Xinming Zhang
  • for: 这篇论文的目的是提出一个名为GraphPrompt的图形预训推导框架,以便将图形预训推导框架与下游任务集成为一个通用的任务模板。
  • methods: 这篇论文使用了一个名为GraphPrompt的预训推导框架,该框架不仅将预训推导框架与下游任务集成为一个通用的任务模板,而且还使用了一个可学习的提示来帮助下游任务在特定任务中找到适当的知识。
  • results: 在五个公开数据集上进行了广泛的实验,以评估和分析GraphPrompt和GraphPrompt+。
    Abstract Graph neural networks have emerged as a powerful tool for graph representation learning, but their performance heavily relies on abundant task-specific supervision. To reduce labeling requirement, the "pre-train, prompt" paradigms have become increasingly common. However, existing study of prompting on graphs is limited, lacking a universal treatment to appeal to different downstream tasks. In this paper, we propose GraphPrompt, a novel pre-training and prompting framework on graphs. GraphPrompt not only unifies pre-training and downstream tasks into a common task template but also employs a learnable prompt to assist a downstream task in locating the most relevant knowledge from the pre-trained model in a task-specific manner. To further enhance GraphPrompt in these two stages, we extend it into GraphPrompt+ with two major enhancements. First, we generalize several popular graph pre-training tasks beyond simple link prediction to broaden the compatibility with our task template. Second, we propose a more generalized prompt design that incorporates a series of prompt vectors within every layer of the pre-trained graph encoder, in order to capitalize on the hierarchical information across different layers beyond just the readout layer. Finally, we conduct extensive experiments on five public datasets to evaluate and analyze GraphPrompt and GraphPrompt+.
    摘要 图 нейрон网络已成为图表示学习的强大工具,但其性能受到任务特定的监督性的限制。为了减少标注要求,“预训练、提示”的方法在图上成为越来越普遍。然而,现有的图提示研究受限,缺乏一个通用的方法来吸引不同的下游任务。在这篇论文中,我们提出了图Prompt,一种新的预训练和提示框架 на graphs。图Prompt不仅将预训练和下游任务合并到一个共同任务模板中,而且使用可学习的提示来帮助下游任务在任务特定的方式中找到预训练模型中最相关的知识。为了进一步提高图Prompt在这两个阶段,我们在GraphPrompt+中对其进行了两个主要改进。首先,我们扩展了许多流行的图预训练任务,以扩大与我们任务模板的兼容性。其次,我们提出了一种更通用的提示设计,将每层的图编码器中的多个提示向量纳入到每层中,以利用图层次信息的各种层次。最后,我们在五个公共数据集上进行了广泛的实验和分析,评估和分析图Prompt和GraphPrompt+。

Secure and Verifiable Data Collaboration with Low-Cost Zero-Knowledge Proofs

  • paper_url: http://arxiv.org/abs/2311.15310
  • repo_url: None
  • paper_authors: Yizheng Zhu, Yuncheng Wu, Zhaojing Luo, Beng Chin Ooi, Xiaokui Xiao
  • for: 本研究旨在提供一种安全和可靠的 Federated Learning(FL)解决方案,以便在数据分享和数据分析方面实现数据合作。
  • methods: 本文使用了随机完整性检查方法和混合承诺方案来保证输入隐私和完整性。同时, authors 还提供了一种理论上的安全保证。
  • results: 实验结果表明,提出的解决方案可以快速和高效地完成客户端计算和通信。比如,与三种州前的基准值 ACORN、RoFL 和 EIFFeL 进行比较,RiseFL 在客户端计算方面比它们快速到 28 倍、53 倍和 164 倍。
    Abstract Organizations are increasingly recognizing the value of data collaboration for data analytics purposes. Yet, stringent data protection laws prohibit the direct exchange of raw data. To facilitate data collaboration, federated Learning (FL) emerges as a viable solution, which enables multiple clients to collaboratively train a machine learning (ML) model under the supervision of a central server while ensuring the confidentiality of their raw data. However, existing studies have unveiled two main risks: (i) the potential for the server to infer sensitive information from the client's uploaded updates (i.e., model gradients), compromising client input privacy, and (ii) the risk of malicious clients uploading malformed updates to poison the FL model, compromising input integrity. Recent works utilize secure aggregation with zero-knowledge proofs (ZKP) to guarantee input privacy and integrity in FL. Nevertheless, they suffer from extremely low efficiency and, thus, are impractical for real deployment. In this paper, we propose a novel and highly efficient solution RiseFL for secure and verifiable data collaboration, ensuring input privacy and integrity simultaneously.Firstly, we devise a probabilistic integrity check method that significantly reduces the cost of ZKP generation and verification. Secondly, we design a hybrid commitment scheme to satisfy Byzantine robustness with improved performance. Thirdly, we theoretically prove the security guarantee of the proposed solution. Extensive experiments on synthetic and real-world datasets suggest that our solution is effective and is highly efficient in both client computation and communication. For instance, RiseFL is up to 28x, 53x and 164x faster than three state-of-the-art baselines ACORN, RoFL and EIFFeL for the client computation.
    摘要 organizations increasingly recognize the value of data collaboration for data analytics purposes. However, strict data protection laws prohibit the direct exchange of raw data. To facilitate data collaboration, federated learning (FL) emerges as a viable solution, which enables multiple clients to collaboratively train a machine learning (ML) model under the supervision of a central server while ensuring the confidentiality of their raw data. However, existing studies have unveiled two main risks: (i) the potential for the server to infer sensitive information from the client's uploaded updates (i.e., model gradients), compromising client input privacy, and (ii) the risk of malicious clients uploading malformed updates to poison the FL model, compromising input integrity. Recent works utilize secure aggregation with zero-knowledge proofs (ZKP) to guarantee input privacy and integrity in FL. Nevertheless, they suffer from extremely low efficiency and, thus, are impractical for real deployment. In this paper, we propose a novel and highly efficient solution RiseFL for secure and verifiable data collaboration, ensuring input privacy and integrity simultaneously.Firstly, we devise a probabilistic integrity check method that significantly reduces the cost of ZKP generation and verification. Secondly, we design a hybrid commitment scheme to satisfy Byzantine robustness with improved performance. Thirdly, we theoretically prove the security guarantee of the proposed solution. Extensive experiments on synthetic and real-world datasets suggest that our solution is effective and is highly efficient in both client computation and communication. For instance, RiseFL is up to 28x, 53x and 164x faster than three state-of-the-art baselines ACORN, RoFL and EIFFeL for the client computation.

Controllable Expensive Multi-objective Optimization with Warm-starting Gaussian Processes

  • paper_url: http://arxiv.org/abs/2311.15297
  • repo_url: None
  • paper_authors: Quang-Huy Nguyen, Long P. Hoang, Hoang V. Viet, Dung D. Le
  • for: 这篇论文目的是提出一种可控的Pareto集学习方法,以解决现有的Pareto集学习方法在多目标优化问题中的不稳定和不效率问题。
  • methods: 该方法包括两个阶段:首先使用温始 Bayesian优化来获取高质量的 Gaussian Processes 先验,然后使用可控Pareto集学习来准确地获取多目标优化问题中的参数映射。
  • results: 在 synthesis 和实际多目标优化问题中,该方法可以显著提高多目标优化任务的效率和稳定性。
    Abstract Pareto Set Learning (PSL) is a promising approach for approximating the entire Pareto front in multi-objective optimization (MOO) problems. However, existing derivative-free PSL methods are often unstable and inefficient, especially for expensive black-box MOO problems where objective function evaluations are costly. In this work, we propose to address the instability and inefficiency of existing PSL methods with a novel controllable PSL method, called Co-PSL. Particularly, Co-PSL consists of two stages: (1) warm-starting Bayesian optimization to obtain quality Gaussian Processes priors and (2) controllable Pareto set learning to accurately acquire a parametric mapping from preferences to the corresponding Pareto solutions. The former is to help stabilize the PSL process and reduce the number of expensive function evaluations. The latter is to support real-time trade-off control between conflicting objectives. Performances across synthesis and real-world MOO problems showcase the effectiveness of our Co-PSL for expensive multi-objective optimization tasks.
    摘要 pareto set learning (PSL) 是一种有前途的方法,用于approximate multi-objective optimization (MOO) 问题中的整个 pareto 前。然而,现有的derivative-free PSL方法经常unstable和不fficient,特别是在costly black-box MOO问题中, objective function evaluations 是 Expensive。在这种情况下,我们提出了一种新的可控PSL方法,called Co-PSL。特别是,Co-PSL包括两个阶段:(1) warm-starting Bayesian optimization,以获得高质量 Gaussian Processes 的先验,和(2)可控 pareto set learning,以准确地获得对 preferences 的参数 mapping。前者是为了稳定 PSL 过程,降低 expensive function evaluations 的数量。后者是为了支持实时负载控制,并且在 conflicting objectives 中进行trade-off。在 synthesis 和实际 MOO 问题中,我们的 Co-PSL 表现出了高效性,用于 expensive multi-objective optimization 任务。

A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation

  • paper_url: http://arxiv.org/abs/2311.15238
  • repo_url: None
  • paper_authors: Heyang Zhao, Jiafan He, Quanquan Gu
  • for: 这个论文的目的是解决复杂模型类型的强化学习中的探索-利用之间的矛盾。
  • methods: 该论文提出了一种新的算法 named Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB),用于强化学习中的通用函数近似。算法的关键设计包括(1)一种通用的决定性策略切换策略,以实现低的切换成本,(2)一种幂等值函数结构,并且控制函数类型的复杂性,以及(3)一种重量考虑的回归方案,可以充分利用历史轨迹来实现高效的历史轨迹利用。
  • results: MQL-UCB可以实现$\tilde{O}(d\sqrt{HK})$的最小最优误差和$\tilde{O}(dH)$的近乎最优策略切换成本,其中$d$是函数类型的扩展维度,$H$是规划时间 horizon,$K$是话数。该论文的研究 shed light on 设计可靠性和部署效率的Q-学习算法。
    Abstract The exploration-exploitation dilemma has been a central challenge in reinforcement learning (RL) with complex model classes. In this paper, we propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for RL with general function approximation. Our key algorithmic design includes (1) a general deterministic policy-switching strategy that achieves low switching cost, (2) a monotonic value function structure with carefully controlled function class complexity, and (3) a variance-weighted regression scheme that exploits historical trajectories with high data efficiency. MQL-UCB achieves minimax optimal regret of $\tilde{O}(d\sqrt{HK})$ when $K$ is sufficiently large and near-optimal policy switching cost of $\tilde{O}(dH)$, with $d$ being the eluder dimension of the function class, $H$ being the planning horizon, and $K$ being the number of episodes. Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
    摘要 “寻找刚好的挑战(exploration-exploitation dilemma)在强化学习(reinforcement learning,RL)中的复杂模型类型上是一个中心问题。在这篇论文中,我们提出了一个新的算法,即 monotonic Q-learning with Upper Confidence Bound(MQL-UCB),用于RL中的通用函数近似。我们的关键算法设计包括:1. 一种通用的决策策略,实现低的交换成本,2. 具有控制函数类型复杂度的幂等值函数结构,3. 使用历史轨迹进行高效的回归计算。MQL-UCB实现了 $\tilde{O}(d\sqrt{HK})$ 的最小最大 regret,并且具有 $\tilde{O}(dH)$ 的策略交换成本,其中 $d$ 是函数类型的欧拉度,$H$ 是规划时间步长,$K$ 是话数。我们的研究帮助设计可靠的、高效的 Q-学习算法,并且帮助解决了在非线性函数近似下的寻找刚好的挑战。”

Decision Tree Psychological Risk Assessment in Currency Trading

  • paper_url: http://arxiv.org/abs/2311.15222
  • repo_url: https://github.com/jp9621/pricalc
  • paper_authors: Jai Pal
  • for: 这个研究paper的目的是探讨人工智能(AI)在货币交易中的应用,并提出了个性化AI模型的开发,这些模型可以作为个性化的智能助手,为个体投资者提供更加准确和深入的心理风险评估。
  • methods: 该paper使用了复杂的技术,包括分类决策树,以 классификаierung boundaries 的形式提供 clearer decision-making boundaries。 该模型还 integrate了用户的时间序列贸易记录,以便更好地识别高亮心理风险的时刻。
  • results: 该paper的实验结果表明,该模型可以在实时计算中提供更加准确和有价值的心理风险评估,并且可以提供timely alerts ,以避免因心理风险而导致的投资错误。
    Abstract This research paper focuses on the integration of Artificial Intelligence (AI) into the currency trading landscape, positing the development of personalized AI models, essentially functioning as intelligent personal assistants tailored to the idiosyncrasies of individual traders. The paper posits that AI models are capable of identifying nuanced patterns within the trader's historical data, facilitating a more accurate and insightful assessment of psychological risk dynamics in currency trading. The PRI is a dynamic metric that experiences fluctuations in response to market conditions that foster psychological fragility among traders. By employing sophisticated techniques, a classifying decision tree is crafted, enabling clearer decision-making boundaries within the tree structure. By incorporating the user's chronological trade entries, the model becomes adept at identifying critical junctures when psychological risks are heightened. The real-time nature of the calculations enhances the model's utility as a proactive tool, offering timely alerts to traders about impending moments of psychological risks. The implications of this research extend beyond the confines of currency trading, reaching into the realms of other industries where the judicious application of personalized modeling emerges as an efficient and strategic approach. This paper positions itself at the intersection of cutting-edge technology and the intricate nuances of human psychology, offering a transformative paradigm for decision making support in dynamic and high-pressure environments.
    摘要 The paper introduces the Psychological Risk Index (PRI), a dynamic metric that fluctuates in response to market conditions that may cause psychological fragility among traders. By using sophisticated techniques, such as decision trees, the model becomes adept at identifying critical junctures when psychological risks are heightened. The real-time nature of the calculations enhances the model's utility as a proactive tool, offering timely alerts to traders about impending moments of psychological risk.The implications of this research extend beyond currency trading, as the judicious application of personalized modeling can be an efficient and strategic approach in other industries. This paper positions itself at the intersection of cutting-edge technology and human psychology, offering a transformative paradigm for decision-making support in dynamic and high-pressure environments.

The Local Landscape of Phase Retrieval Under Limited Samples

  • paper_url: http://arxiv.org/abs/2311.15221
  • repo_url: None
  • paper_authors: Kaizhao Liu, Zihao Wang, Lei Wu
  • for: 本文研究了受限样本量下的相位回拟的本地景观,特别是确定最小样本大小以确保优质的本地景观环境。
  • methods: 作者们使用了本地抽象和一点强式抽象来分析本地景观的非拟合性和一点强式抽象的稳定性。
  • results: 研究结果表明,当样本大小为$o(d\log d)$时,本地景观是非拟合的,而当样本大小为$\omega(d)$时,本地景观在一定的区域内是一点强式抽象的,这意味着梯度下降 initialized from any point in this domain can converge to an $o_d(1)$-loss solution exponentially fast。此外,当样本大小为$o(d\log d)$时,一点强式抽象在相应的更小的本地球体内失效。
    Abstract In this paper, we provide a fine-grained analysis of the local landscape of phase retrieval under the regime with limited samples. Our aim is to ascertain the minimal sample size necessary to guarantee a benign local landscape surrounding global minima in high dimensions. Let $n$ and $d$ denote the sample size and input dimension, respectively. We first explore the local convexity and establish that when $n=o(d\log d)$, for almost every fixed point in the local ball, the Hessian matrix must have negative eigenvalues as long as $d$ is sufficiently large. Consequently, the local landscape is highly non-convex. We next consider the one-point strong convexity and show that as long as $n=\omega(d)$, with high probability, the landscape is one-point strongly convex in the local annulus: $\{w\in\mathbb{R}^d: o_d(1)\leqslant \|w-w^*\|\leqslant c\}$, where $w^*$ is the ground truth and $c$ is an absolute constant. This implies that gradient descent initialized from any point in this domain can converge to an $o_d(1)$-loss solution exponentially fast. Furthermore, we show that when $n=o(d\log d)$, there is a radius of $\widetilde\Theta\left(\sqrt{1/d}\right)$ such that one-point convexity breaks in the corresponding smaller local ball. This indicates an impossibility to establish a convergence to exact $w^*$ for gradient descent under limited samples by relying solely on one-point convexity.
    摘要 在这篇论文中,我们提供了细化的分析方法,以探讨在有限样本下的阶段采样特性。我们的目标是确定最小样本大小,以确保在高维度下的global minimum附近具有温顺的本地特性。 let $n$和$d$ denote the sample size and input dimension, respectively. We first explore the local convexity and establish that when $n=o(d\log d)$, for almost every fixed point in the local ball, the Hessian matrix must have negative eigenvalues as long as $d$ is sufficiently large. This indicates that the local landscape is highly non-convex. We next consider the one-point strong convexity and show that as long as $n=\omega(d)$, with high probability, the landscape is one-point strongly convex in the local annulus: $\{$w\in\mathbb{R}^d: o_d(1)\leqslant \|w-w^*\|\leqslant c\}$, where $w^*$ is the ground truth and $c$ is an absolute constant. This implies that gradient descent initialized from any point in this domain can converge to an $o_d(1)$-loss solution exponentially fast. Furthermore, we show that when $n=o(d\log d)$, there is a radius of $\widetilde\Theta\left(\sqrt{1/d}\right)$ such that one-point convexity breaks in the corresponding smaller local ball. This indicates an impossibility to establish a convergence to exact $w^*$ for gradient descent under limited samples by relying solely on one-point convexity.

Solve Large-scale Unit Commitment Problems by Physics-informed Graph Learning

  • paper_url: http://arxiv.org/abs/2311.15216
  • repo_url: None
  • paper_authors: Jingtao Qin, Nanpeng Yu
  • for: solves large-scale unit commitment (UC) problems with improved performance and scalability.
  • methods: leverages physics-informed hierarchical graph convolutional networks (PI-GCN) for neural diving and model-based graph convolutional networks (MB-GCN) for neural branching.
  • results: achieves better performance and scalability than the baseline MB-GCN on neural diving, and outperforms a modern MIP solver for all testing days after combining it with the proposed neural diving model and the baseline neural branching model.Here’s the text in Simplified Chinese:
  • for: solves large-scale Unit Commitment (UC) 问题,提高性能和可扩展性。
  • methods: 使用Physics-Informed Hierarchical Graph Convolutional Networks (PI-GCN) 进行神经浸润,并使用Model-Based Graph Convolutional Networks (MB-GCN) 进行神经分支。
  • results: 与基eline MB-GCN 进行神经浸润比较,在所有测试天数上获得更好的性能和可扩展性。
    Abstract Unit commitment (UC) problems are typically formulated as mixed-integer programs (MIP) and solved by the branch-and-bound (B&B) scheme. The recent advances in graph neural networks (GNN) enable it to enhance the B&B algorithm in modern MIP solvers by learning to dive and branch. Existing GNN models that tackle MIP problems are mostly constructed from mathematical formulation, which is computationally expensive when dealing with large-scale UC problems. In this paper, we propose a physics-informed hierarchical graph convolutional network (PI-GCN) for neural diving that leverages the underlying features of various components of power systems to find high-quality variable assignments. Furthermore, we adopt the MIP model-based graph convolutional network (MB-GCN) for neural branching to select the optimal variables for branching at each node of the B&B tree. Finally, we integrate neural diving and neural branching into a modern MIP solver to establish a novel neural MIP solver designed for large-scale UC problems. Numeral studies show that PI-GCN has better performance and scalability than the baseline MB-GCN on neural diving. Moreover, the neural MIP solver yields the lowest operational cost and outperforms a modern MIP solver for all testing days after combining it with our proposed neural diving model and the baseline neural branching model.
    摘要 <>将文本翻译成简化中文。<>Unit commitment(UC)问题通常被表示为杂合数学程序(MIP),并使用分支和约束(B&B)算法解决。现代Graph神经网络(GNN)可以增强B&B算法,以便更好地解决现代MIP问题。现有的GNN模型,通常是从数学形式来构建,对于大规模UC问题来说,计算成本较高。在本文中,我们提议一种物理约束层次 Graph卷积神经网络(PI-GCN),可以利用电力系统各种组件的下面特征来找到高质量变量分配。此外,我们采用MIP模型基于Graph卷积神经网络(MB-GCN)来选择每个B&B树的优质变量。最后,我们将神经分割和神经分支结合到现代MIP解决器中,以建立一种基于神经网络的MIP解决器,适用于大规模UC问题。数值研究表明,PI-GCN在神经分割方面比基eline MB-GCN有更好的性能和可扩展性。此外,神经MIP解决器在所有测试天数后,与现代MIP解决器相比,具有最低的运营成本和最高的性能。

A Novel Normalized-Cut Solver with Nearest Neighbor Hierarchical Initialization

  • paper_url: http://arxiv.org/abs/2311.15214
  • repo_url: None
  • paper_authors: Feiping Nie, Jitao Lu, Danyang Wu, Rong Wang, Xuelong Li
  • for: 提出了一种基于坐标下降方法的新的 Normalized-Cut(N-Cut)解决方案,以解决传统N-Cut解决方案存在的两个主要问题:1)两阶段方法无法解决原始问题的优良解;2)解决relax版本问题需要征值 decompositions,具有 $\mathcal{O}(n^3)$ 时间复杂度($n$ 是节点数)。
  • methods: 提出了一种基于坐标下降方法的新的 N-Cut 解决方案,并设计了减速策略来降低时间复杂度至 $\mathcal{O}(|E|)$( $|E|$ 是边数)。此外,还提出了一种高效的初始化方法,以避免依赖于随机初始化,从而减少了 clustering 中的不确定性。
  • results: 实验表明,提出的解决方案可以获得更大的对应值,同时实现了传统解决方案所不能达到的更好的 clustering 性能。
    Abstract Normalized-Cut (N-Cut) is a famous model of spectral clustering. The traditional N-Cut solvers are two-stage: 1) calculating the continuous spectral embedding of normalized Laplacian matrix; 2) discretization via $K$-means or spectral rotation. However, this paradigm brings two vital problems: 1) two-stage methods solve a relaxed version of the original problem, so they cannot obtain good solutions for the original N-Cut problem; 2) solving the relaxed problem requires eigenvalue decomposition, which has $\mathcal{O}(n^3)$ time complexity ($n$ is the number of nodes). To address the problems, we propose a novel N-Cut solver designed based on the famous coordinate descent method. Since the vanilla coordinate descent method also has $\mathcal{O}(n^3)$ time complexity, we design various accelerating strategies to reduce the time complexity to $\mathcal{O}(|E|)$ ($|E|$ is the number of edges). To avoid reliance on random initialization which brings uncertainties to clustering, we propose an efficient initialization method that gives deterministic outputs. Extensive experiments on several benchmark datasets demonstrate that the proposed solver can obtain larger objective values of N-Cut, meanwhile achieving better clustering performance compared to traditional solvers.
    摘要 通用剖除(N-Cut)是一种著名的спектраль分 clustering模型。传统的N-Cut解决方案是两 stage的:1)计算正规化 Laplacian 矩阵的连续 спектраль嵌入; 2)逐点化via $K$-means 或 спектраль旋转。然而,这种模式存在两个重要问题:1)两 stage 方法解决的是放松版本的原始问题,因此无法获得好的解决方案 для原始N-Cut问题; 2)解决放松问题需要各值归一化,这有 $\mathcal{O}(n^3)$ 时间复杂度 ($n$ 是节点数)。为了解决这些问题,我们提出了一种基于知名的坐标极下降法的新的N-Cut解决方案。由于vanilla坐标极下降法也有 $\mathcal{O}(n^3)$ 时间复杂度,我们设计了多种加速策略来降低时间复杂度到 $\mathcal{O}(|E|)$ ($|E|$ 是边数)。为了避免依赖Random initialization 带来分 clustering 不确定性,我们提出了一种高效的初始化方法,该方法可以确定性地输出结果。广泛的实验表明,提出的解决方案可以获得更大的对象值,同时实现 Traditional 解决方案比较好的分 clustering性能。

Topology combined machine learning for consonant recognition

  • paper_url: http://arxiv.org/abs/2311.15210
  • repo_url: https://github.com/AnnFeng233/TDA_Consonant_Recognition
  • paper_authors: Pingyao Feng, Siheng Yi, Qingrui Qu, Zhiwang Yu, Yifei Zhu
  • For: The paper is written for researchers and practitioners in the field of artificial intelligence and signal processing, particularly those interested in using topological methods for machine learning.* Methods: The paper proposes a new method called TopCap, which combines time-delay embedding and persistent homology to capture the most salient topological features of time series data. The method is designed to be transparent and broadly applicable, and can capture features that are not easily detected in datasets with low intrinsic dimensionality.* Results: The paper demonstrates the effectiveness of TopCap in classifying voiced and voiceless consonants, achieving an accuracy of over 96%. The method is also geared towards designing topological convolutional layers for deep learning of speech and audio signals.
    Abstract In artificial-intelligence-aided signal processing, existing deep learning models often exhibit a black-box structure, and their validity and comprehensibility remain elusive. The integration of topological methods, despite its relatively nascent application, serves a dual purpose of making models more interpretable as well as extracting structural information from time-dependent data for smarter learning. Here, we provide a transparent and broadly applicable methodology, TopCap, to capture the most salient topological features inherent in time series for machine learning. Rooted in high-dimensional ambient spaces, TopCap is capable of capturing features rarely detected in datasets with low intrinsic dimensionality. Applying time-delay embedding and persistent homology, we obtain descriptors which encapsulate information such as the vibration of a time series, in terms of its variability of frequency, amplitude, and average line, demonstrated with simulated data. This information is then vectorised and fed into multiple machine learning algorithms such as k-nearest neighbours and support vector machine. Notably, in classifying voiced and voiceless consonants, TopCap achieves an accuracy exceeding 96% and is geared towards designing topological convolutional layers for deep learning of speech and audio signals.
    摘要 人工智能帮助的信号处理中,现有的深度学习模型经常具有黑盒结构,其效度和可读性尚未得到了充分的解释。通过束缚方法的应用,可以同时使模型更加可读性高并提取时间依赖的数据中的结构信息,从而使学习更加聪明。我们提出了一种透明和广泛适用的方法,即TopCap,用于捕捉时间序列中最重要的拓扑特征。基于高维的启发空间,TopCap能够捕捉低维度数据中缺失的特征。通过时间延迟嵌入和不变Homology,我们获得了描述符,它们包含时间序列的变化频率、振荡幅度和平均线,通过对simeulated数据进行示例。这些信息被vector化并 feed into多种机器学习算法,如k-最近邻和支持向量机。值得注意的是,使用TopCap对杂音和无声辅音的分类,达到了96%以上的准确率,这使得TopCap可以用于深度学习的语音和音频信号处理。

Efficient interpolation of molecular properties across chemical compound space with low-dimensional descriptors

  • paper_url: http://arxiv.org/abs/2311.15207
  • repo_url: None
  • paper_authors: Yun-Wen Mao, Roman V. Krems
  • for: 该论文目的是提出一种准确且数据缺乏的分子性质模型,用于在化学化合物空间中进行 interpolating。
  • methods: 该论文使用的方法包括基于三维物理特征的幂函数回归,以及基于格 Neurol Networks 的六维特征。
  • results: 该论文的结果表明,使用九维描述符和变函数形式的幂函数回归模型可以高效地预测分子的热能和静止振荡能,并且精度在 1 kcal mol$^{-1}$ 以下。
    Abstract We demonstrate accurate data-starved models of molecular properties for interpolation in chemical compound spaces with low-dimensional descriptors. Our starting point is based on three-dimensional, universal, physical descriptors derived from the properties of the distributions of the eigenvalues of Coulomb matrices. To account for the shape and composition of molecules, we combine these descriptors with six-dimensional features informed by the Gershgorin circle theorem. We use the nine-dimensional descriptors thus obtained for Gaussian process regression based on kernels with variable functional form, leading to extremely efficient, low-dimensional interpolation models. The resulting models trained with 100 molecules are able to predict the product of entropy and temperature ($S \times T$) and zero point vibrational energy (ZPVE) with the absolute error under 1 kcal mol$^{-1}$ for $> 78$ \% and under 1.3 kcal mol$^{-1}$ for $> 92$ \% of molecules in the test data. The test data comprises 20,000 molecules with complexity varying from three atoms to 29 atoms and the ranges of $S \times T$ and ZPVE covering 36 kcal mol$^{-1}$ and 161 kcal mol$^{-1}$, respectively. We also illustrate that the descriptors based on the Gershgorin circle theorem yield more accurate models of molecular entropy than those based on graph neural networks that explicitly account for the atomic connectivity of molecules.
    摘要 我们展示了一种精准的数据缺乏模型,用于在化学化合物空间中 interpolate 分子属性。我们的起点是基于三维、通用的物理描述符,它们来自于玻璃矩阵的 eigenvalues 的分布特性。为了考虑分子形态和组成,我们将这些描述符与六维基于GERSHGORIN环 theorem 的特征相结合。我们使用这些九维描述符进行 Gaussian process regression,使用可变形式的函数核,从而获得了极其高效的低维 interpolate 模型。这些模型通过对 100 个分子进行训练,可以预测 $S \times T$ 和 ZPVE 的绝对误差在 1 kcal mol$^{-1}$ 以下,对于大于 78% 的分子而言。测试数据包括 20,000 个分子,其中分子复licity从三个原子到 29 个原子,-$S \times T$ 和 ZPVE 的范围分别为 36 kcal mol$^{-1}$ 和 161 kcal mol$^{-1}$。此外,我们还证明了基于 GERSHGORIN 环 theorem 的描述符可以更好地预测分子热 capacity than 基于图 neural network 的模型,其直接考虑分子原子之间的连接。

A Data-Driven Approach for High-Impedance Fault Localization in Distribution Systems

  • paper_url: http://arxiv.org/abs/2311.15168
  • repo_url: None
  • paper_authors: Yuqi Zhou, Yuqing Dong, Rui Yang
  • for: 本研究旨在提出一种数据驱动的高阻位短路检测方法,以提高分布系统的可靠运行。
  • methods: 该方法首先使用优化问题来近似扭曲轨迹,然后收集所有分段函数特征作为输入,使用支持向量机方法进行高阻位短路 Identification。
  • results: 数值研究表明,提出的方法可以准确地在不同位置上 Identification 高阻位短路事件,并且有较高的准确率和速度。
    Abstract Accurate and quick identification of high-impedance faults is critical for the reliable operation of distribution systems. Unlike other faults in power grids, HIFs are very difficult to detect by conventional overcurrent relays due to the low fault current. Although HIFs can be affected by various factors, the voltage current characteristics can substantially imply how the system responds to the disturbance and thus provides opportunities to effectively localize HIFs. In this work, we propose a data-driven approach for the identification of HIF events. To tackle the nonlinearity of the voltage current trajectory, first, we formulate optimization problems to approximate the trajectory with piecewise functions. Then we collect the function features of all segments as inputs and use the support vector machine approach to efficiently identify HIFs at different locations. Numerical studies on the IEEE 123-node test feeder demonstrate the validity and accuracy of the proposed approach for real-time HIF identification.
    摘要 高域故障的准确和快速识别是分布系统的可靠运行所必需的。与其他网格故障不同,高域故障(HIF)通过传统过电流关系器不易探测,因为故障电流很低。尽管HIF可能受到多种因素影响,但系统电压电流特征可以大大预示系统如何响应干扰,从而提供了local化HIF的机会。在这种工作中,我们提出了基于数据驱动的HIF事件识别方法。为了处理非线性电压电流轨迹,我们首先将轨迹分解成多个段,然后收集每个段的功能特征作为输入,使用支持向量机器学习方法进行高速和准确地识别HIF。数字研究表明,在IEEE 123-node 测试供电系统上,提议的方法可以在实时中准确地识别HIF。

eess.IV - 2023-11-26

HyperKon: A Self-Supervised Contrastive Network for Hyperspectral Image Analysis

  • paper_url: http://arxiv.org/abs/2311.15459
  • repo_url: None
  • paper_authors: Daniel L Ayuba, Belen Marti-Cardona, Jean-Yves Guillemaut, Oscar Mendez Maldonado
  • for: 这 paper 是为了解决 deep learning 技术在 hyperspectral 图像分析中的潜在问题,即 hyperspectral-native CNN 背bone 的缺乏。
  • methods: 作者提出了 HyperKon,一种自我超VIva learning 网络,该网络利用 EnMAP 卫星 hyperspectral 数据的高 spectral 细分度、范围和分辨率,通过 spectral attention 机制和特殊 convolutional layers 进行处理。
  • results: HyperKon 在 hyperspectral pan-sharpening 和 hyperspectral image classification 任务中表现出色,它的 Top-1 重建精度为 98%,而且在 hyperspectral image classification 任务中超过了现有的 state-of-the-art 方法,这表明 hyperspectral-native backbones 在 hyperspectral 图像分析中具有重要的作用。
    Abstract The exceptional spectral resolution of hyperspectral imagery enables material insights that are not possible with RGB or multispectral images. Yet, the full potential of this data is often underutilized by deep learning techniques due to the scarcity of hyperspectral-native CNN backbones. To bridge this gap, we introduce HyperKon, a self-supervised contrastive learning network designed and trained on hyperspectral data from the EnMAP Hyperspectral Satellite\cite{kaufmann2012environmental}. HyperKon uniquely leverages the high spectral continuity, range, and resolution of hyperspectral data through a spectral attention mechanism and specialized convolutional layers. We also perform a thorough ablation study on different kinds of layers, showing their performance in understanding hyperspectral layers. It achieves an outstanding 98% Top-1 retrieval accuracy and outperforms traditional RGB-trained backbones in hyperspectral pan-sharpening tasks. Additionally, in hyperspectral image classification, HyperKon surpasses state-of-the-art methods, indicating a paradigm shift in hyperspectral image analysis and underscoring the importance of hyperspectral-native backbones.
    摘要 <> hyperspectral imagery 的特殊频谱分辨率允许材质探测,其可能性不存在于RGB或多spectral图像中。然而,深度学习技术在使用这些数据时经常下Utilized,因为 hyperspectral-native CNN 背bone 的缺乏。为bridging这个差距,我们介绍 HyperKon,一种自我超VI的contrastive学习网络,设计并在EnMAP 卫星发射的 hyperspectral 数据上进行了训练。HyperKon 利用高 spectral 细化、范围和分辨率,通过 spectral 注意力机制和专门的 convolutional 层,实现了高效的材质理解。我们还进行了不同类型层的ablation研究,证明其在理解 hyperspectral 层的能力。它实现了98% 的 Top-1 检索精度,并在 hyperspectral 扫描任务中超过了传统RGB 训练的 backbone。此外,在 hyperspectral 图像分类任务中,HyperKon 超越了当前状态的方法,表明了一种新的材质图像分析 paradigm shift,并证明了 hyperspectral-native backbones 的重要性。<>

Deep Refinement-Based Joint Source Channel Coding over Time-Varying Channels

  • paper_url: http://arxiv.org/abs/2311.15309
  • repo_url: None
  • paper_authors: Junyu Pan, Hanlei Li, Guangyi Zhang, Yunlong Cai, Guanding Yu
  • for: 提高无线图像传输性能
  • methods: 使用深度学习(DL)基于的 JOINT SOURCE-CHANNEL CODING(JSCC)技术,并在实时适应时变通道条件
  • results: 与稳定通道条件下的主流DL-based JSCC方法相当,并且在时变通道条件下表现出了强大的Robustness。
    Abstract In recent developments, deep learning (DL)-based joint source-channel coding (JSCC) for wireless image transmission has made significant strides in performance enhancement. Nonetheless, the majority of existing DL-based JSCC methods are tailored for scenarios featuring stable channel conditions, notably a fixed signal-to-noise ratio (SNR). This specialization poses a limitation, as their performance tends to wane in practical scenarios marked by highly dynamic channels, given that a fixed SNR inadequately represents the dynamic nature of such channels. In response to this challenge, we introduce a novel solution, namely deep refinement-based JSCC (DRJSCC). This innovative method is designed to seamlessly adapt to channels exhibiting temporal variations. By leveraging instantaneous channel state information (CSI), we dynamically optimize the encoding strategy through re-encoding the channel symbols. This dynamic adjustment ensures that the encoding strategy consistently aligns with the varying channel conditions during the transmission process. Specifically, our approach begins with the division of encoded symbols into multiple blocks, which are transmitted progressively to the receiver. In the event of changing channel conditions, we propose a mechanism to re-encode the remaining blocks, allowing them to adapt to the current channel conditions. Experimental results show that the DRJSCC scheme achieves comparable performance to the other mainstream DL-based JSCC models in stable channel conditions, and also exhibits great robustness against time-varying channels.
    摘要 近期发展,深度学习(DL)基于联合源码混合(JSCC)技术在无线图像传输中取得了显著的进步。然而,大多数现有的DL基于JSCC方法都是针对具有稳定频率的通道条件设计的,即固定的信号听频率(SNR)。这种特殊化存在限制,因为它们在实际场景中具有高度动态的通道条件下表现不佳,因为固定的SNR无法准确表示动态的通道条件。为回应这个挑战,我们介绍了一种新的解决方案,即深度修正基于JSCC(DRJSCC)。这种创新的方法是为了贯穿时间变化的通道条件而设计的,通过利用实时的通道状态信息(CSI) dynamically 优化编码策略,以确保编码策略与通道条件的变化保持一致。具体来说,我们的方法是将编码符号分成多个块,然后将块传输给接收器进行进行编码。在通道条件发生变化时,我们提议使用重新编码剩下的块,以适应当前的通道条件。实验结果表明,DRJSCC 方案与其他主流的 DL 基于 JSCC 模型在稳定通道条件下具有相同的性能,并且在时间变化的通道条件下表现出了很好的鲁棒性。

Neural-Optic Co-Designed Polarization-Multiplexed Metalens for Compact Computational Spectral Imaging

  • paper_url: http://arxiv.org/abs/2311.15164
  • repo_url: None
  • paper_authors: Qiangbo Zhang, Peicheng Lin, Chang Wang, Yang Zhang, Zeqing Yu, Xinyu Liu, Ting Xu, Zhenrong Zheng
  • for: 这个论文主要用于提出一种基于多谱折射元件的计算成像框架,以实现高精度 spectral imaging 系统的减小。
  • methods: 该框架使用扭曲光学元件(DOEs)来模拟多谱折射,并通过与神经网络的结合,实现高精度 spectral reconstruction。
  • results: 实验结果表明,该框架在实际应用场景中具有出色的空间-спектраль重建性能,证明了该系统在计算成像领域中的可行性和有效性。
    Abstract As the realm of spectral imaging applications extends its reach into the domains of mobile technology and augmented reality, the demands for compact yet high-fidelity systems become increasingly pronounced. Conventional methodologies, exemplified by coded aperture snapshot spectral imaging systems, are significantly limited by their cumbersome physical dimensions and form factors. To address this inherent challenge, diffractive optical elements (DOEs) have been repeatedly employed as a means to mitigate issues related to the bulky nature of these systems. Nonetheless, it's essential to note that the capabilities of DOEs primarily revolve around the modulation of the phase of light. Here, we introduce an end-to-end computational spectral imaging framework based on a polarization-multiplexed metalens. A distinguishing feature of this approach lies in its capacity to simultaneously modulate orthogonal polarization channels. When harnessed in conjunction with a neural network, it facilitates the attainment of high-fidelity spectral reconstruction. Importantly, the framework is intrinsically fully differentiable, a feature that permits the joint optimization of both the metalens structure and the parameters governing the neural network. The experimental results presented herein validate the exceptional spatial-spectral reconstruction performance, underscoring the efficacy of this system in practical, real-world scenarios. This innovative approach transcends the traditional boundaries separating hardware and software in the realm of computational imaging and holds the promise of substantially propelling the miniaturization of spectral imaging systems.
    摘要 随着spectral imaging应用领域扩展到移动技术和增强现实领域,需求更加强大、 yet compact high-fidelity system 的要求变得越来越明显。传统方法,例如coded aperture snapshot spectral imaging system,受限于其庞大的物理尺寸和形态因素。为了解决这一内在挑战,diffractive optical elements (DOEs) 被重复使用,以减轻spectral imaging system 的尺寸和重量。然而,需要注意的是,DOEs 的能力主要是修改光的频谱。在这里,我们提出了一种基于polarization-multiplexed metalens的 end-to-end计算 spectral imaging 框架。该approach的一个特点是同时修改两个垂直的极化通道。当与神经网络结合使用时,它可以实现高精度的频谱重建。进一步地,这种框架是完全可导的,这使得可以同时优化 metalens 结构和神经网络参数。实验结果表明,这种系统在实际应用场景中具有出色的空间-频谱重建性能,证明了该系统的可行性。这种创新的approach 突破了传统的硬件和软件之间的分界线,并且拥有极大地Minimize spectral imaging system 的大小。

eess.SP - 2023-11-26

Joint Antenna Selection and Power Allocation in Massive MIMO Systems with Cell Division Technique for MRT and ZF Precoding Schemes

  • paper_url: http://arxiv.org/abs/2311.15412
  • repo_url: None
  • paper_authors: Abdolrasoul Sakhaei Gharagezlou, Nima Imani, Mahdi Nangir
  • For: 这个研究目的是实现5G通信系统中的能源和频率利用效率。* Methods: 这个研究使用大量多input多output(MIMO)系统,并考虑了该系统中的安全传输问题。* Results: 研究发现,透过选择最佳天线数量和细分方法,可以实现最大化系统的安全能源效率。另外,这个研究还提出了四个迭代算法来提供量化评估。
    Abstract One of the most important challenges in the fifth generation (5G) of telecommunication systems is the efficiency of energy and spectrum. Massive multiple-input multiple-output (MIMO) systems have been proposed by researchers to resolve existing challenges. In the proposed system model of this paper, there is a base station (BS) around which several users and an eavesdropper (EVA) are evenly distributed. The information transmitted between BS and users is disrupted by an EVA, which highlights the importance of secure transfer. This paper analyzes secure energy efficiency (EE) of a massive MIMO system, and its purpose is to maximize the secure EE of the system. Several scenarios are considered to evaluate achieving the desired goal. To maximize the secure EE, selecting optimal number of antennas and cell division methods are employed. Each of these two methods is applied in a system with the maximum ratio transmission (MRT) and the zero forcing (ZF) precodings, and then the problem is solved. Maximum transmission power and minimum secure rate for users insert limitations to the optimization problem. Channel state information (CSI) is generally imperfect for users in any method, while CSI of the EVA is considered perfect as the worst case. Four iterative algorithms are designed to provide numerical assessments. The first algorithm calculates the optimal power of users without utilizing existing methods, the second one is related to the cell division method, the third one is based on the strategy of selecting optimal number of antennas, and forth one is based on a hybrid strategy.
    摘要 一个重要的挑战在 fifth generation (5G) 通信系统中是能源和频谱的效率。大量多输入多输出 (MIMO) 系统已经被研究人员提议以解决现有的挑战。本文中的系统模型中有一个基站 (BS),周围有多个用户和一个伪装者 (EVA)。BS 与用户之间的信息被伪装者破坏,这 highlights 了安全传输的重要性。本文分析了大规模 MIMO 系统的安全能效 (EE),目的是最大化系统的安全EE。多种场景被考虑以评估目标。为了最大化安全EE,选择最佳的天线数和 Cell division 方法是使用的。每种这两种方法都在使用最大 переда信号强度和零干扰 (ZF) 预编码时被应用,然后解决问题。最大传输功率和最小安全速率为用户设置限制了优化问题。用户在任何方法中都有不完整的通道状态信息 (CSI),而 EVA 的 CSI 被视为最坏情况。四种迭代算法被设计来提供数值评估。第一个算法计算用户的最佳功率,不使用现有方法;第二个算法关于 Cell division 方法;第三个算法基于选择最佳天线数的策略;第四个算法基于混合策略。

Low-Complexity Joint Beamforming for RIS-Assisted MU-MISO Systems Based on Model-Driven Deep Learning

  • paper_url: http://arxiv.org/abs/2311.15313
  • repo_url: None
  • paper_authors: Weijie Jin, Jing Zhang, Chao-Kai Wen, Shi Jin, Xiao Li, Shuangfeng Han
  • for: 提高信号传播环境中RIS的信号强度和多用户多输入单出口系统的吞吐量。
  • methods: 使用加权最小二乘误差优化和功率迭代来最大化RIS协助下行多用户多输入单出口系统的吞吐量。
  • results: 比较存在 estado-of-the-art算法,提案的算法在复杂度和吞吐量方面表现更好,特别是使用模型驱动深度学习approach可以大幅降低运行时间。
    Abstract Reconfigurable intelligent surfaces (RIS) can improve signal propagation environments by adjusting the phase of the incident signal. However, optimizing the phase shifts jointly with the beamforming vector at the access point is challenging due to the non-convex objective function and constraints. In this study, we propose an algorithm based on weighted minimum mean square error optimization and power iteration to maximize the weighted sum rate (WSR) of a RIS-assisted downlink multi-user multiple-input single-output system. To further improve performance, a model-driven deep learning (DL) approach is designed, where trainable variables and graph neural networks are introduced to accelerate the convergence of the proposed algorithm. We also extend the proposed method to include beamforming with imperfect channel state information and derive a two-timescale stochastic optimization algorithm. Simulation results show that the proposed algorithm outperforms state-of-the-art algorithms in terms of complexity and WSR. Specifically, the model-driven DL approach has a runtime that is approximately 3% of the state-of-the-art algorithm to achieve the same performance. Additionally, the proposed algorithm with 2-bit phase shifters outperforms the compared algorithm with continuous phase shift.
    摘要 可变智能表面(RIS)可以改善信号传播环境中的信号相位。然而,在访问点上同时优化相位偏移和扫描向量是复杂的,因为目标函数是非对称的。在本研究中,我们提出了基于最小二乘平均误差优化和电力迭代的算法,以提高RIS协助的下行多用户多输入单输出系统的权重和平均数据率(WSR)。为了进一步改善性能,我们设计了基于模型驱动的深度学习(DL)方法,其中引入了可学习变量和图神经网络,以加速我们提出的算法的整合。此外,我们将提案扩展到包括不完美通道状态信息的扫描方法,并 derivate了两个时间步骤的随机优化算法。实验结果表明,我们的算法在复杂度和WSR方面与状态的算法相比有较好的性能。具体来说,使用模型驱动的DL方法的运行时间约为状态的算法的3%,而且我们的算法使用2比特相位偏移器可以超过与连续相位偏移器相比的性能。

A Low-cost and Portable Active Noise Control Unit

  • paper_url: http://arxiv.org/abs/2311.15312
  • repo_url: None
  • paper_authors: Wang Zhaohan
    for: This research paper aims to mitigate the noise emissions produced by electrical appliances, such as a coffee machine, using cutting-edge active noise control methodologies.methods: The study employs a modified Filtered-X Least Mean Square (FXLMS) algorithm, which generates an anti-noise waveform by utilizing measurements from both the reference microphone and the error microphone.results: The desired outcome of this approach is to achieve a residual noise level of zero, despite the challenge of conducting the experiment in an open space setting. The study provides an introduction to different Active Noise Control systems and algorithms, followed by simulations and experimental execution.
    Abstract The objective of this research is to employ cutting-edge active noise control methodologies in order to mitigate the noise emissions produced by electrical appliances, such as a coffee machine. The algorithm utilized in this study is the modified Filtered-X Least Mean Square (FXLMS) algorithm. This algorithm aims to generate an anti-noise waveform by utilizing measurements from both the reference microphone and the error microphone. The desired outcome of this approach is to achieve a residual noise level of zero. The primary difficulty lies in conducting the experiment in an open space setting, as conventional active noise control systems are designed to function within enclosed environments, such as closed rooms or relatively confined spaces like the volume inside headphones. A validation test bench is established, employing the Sigma Studio software to oversee the entire system, with the ADAU1452 digital signal processor being chosen. This study presents an introduction to different Active Noise Control systems and algorithms, followed by the execution of simulations for representative techniques. Subsequently, this section provides a comprehensive account of the procedures involved in executing the experiments, followed by an exploration of potential avenues for further research.
    摘要 研究目标是使用先进的活动噪声控制方法来减少电器设备(如咖啡机)发生的噪声泄露。这种算法是修改后的 Filtered-X Least Mean Square(FXLMS)算法。该算法利用参照 Microphone 和错误 Microphone 的测量值来生成一个反噪波形。 Desired outcome 是实现噪声水平为零。主要挑战在于在开放空间环境中进行实验,因为传统的活动噪声控制系统是为关闭空间(如室内或相对封闭的空间,如Headphones 中的volume)设计。我们建立了一个验证测试台,使用Sigma Studio 软件进行整个系统的监测,并选择了 ADAU1452 数字信号处理器。本文介绍了不同的活动噪声控制系统和算法,然后对代表性技术进行了 simulations。接着,这个部分提供了实验执行的详细过程,然后进行了潜在的进一步研究的探索。

Covariance-Based Activity Detection in Cooperative Multi-Cell Massive MIMO: Scaling Law and Efficient Algorithms

  • paper_url: http://arxiv.org/abs/2311.15299
  • repo_url: None
  • paper_authors: Ziyue Wang, Ya-Feng Liu, Zhaorui Wang, Wei Yu
  • For: 这篇论文关注于多组多输入多Output(MIMO)系统中的协同活动检测问题。在这个系统中,活动设备对多个基站(BS)发送签章序列,并且BS们协同检测活动设备的签章序列。* Methods: 这篇论文使用了协同检测的方法,包括协同检测的概率模型和协同检测的数据分析方法。* Results: 这篇论文的结果显示,在多组MIMO系统中,签章序列的长度和签章序列的类型都会影响协同检测的正确率。此外,这篇论文还提出了两个高效的加速坐标下降(CD)算法,可以实现协同检测问题的解决。这两个算法都有优化的复杂度和解决方案。
    Abstract This paper focuses on the covariance-based activity detection problem in a multi-cell massive multiple-input multiple-output (MIMO) system. In this system, active devices transmit their signature sequences to multiple base stations (BSs), and the BSs cooperatively detect the active devices based on the received signals. While the scaling law for the covariance-based activity detection in the single-cell scenario has been extensively analyzed in the literature, this paper aims to analyze the scaling law for the covariance-based activity detection in the multi-cell massive MIMO system. Specifically, this paper demonstrates a quadratic scaling law in the multi-cell system, under the assumption that the exponent in the classical path-loss model is greater than 2. This finding shows that, in the multi-cell MIMO system, the maximum number of active devices that can be detected correctly in each cell increases quadratically with the length of the signature sequence and decreases logarithmically with the number of cells (as the number of antennas tends to infinity). Moreover, in addition to analyzing the scaling law for the signature sequences randomly and uniformly distributed on a sphere, the paper also establishes the scaling law for signature sequences generated from a finite alphabet, which are easier to generate and store. Moreover, this paper proposes two efficient accelerated coordinate descent (CD) algorithms with a convergence guarantee for solving the device activity detection problem. The first algorithm reduces the complexity of CD by using an inexact coordinate update strategy. The second algorithm avoids unnecessary computations of CD by using an active set selection strategy. Simulation results show that the proposed algorithms exhibit excellent performance in terms of computational efficiency and detection error probability.
    摘要 (注意:以下是简化中文翻译,并不是正式的学术论文翻译)这篇论文关注多个细胞大规模多输入多输出(MIMO)系统中的协同活动检测问题。在这个系统中,活动设备将其签名序列传输到多个基站(BS),BS们协同检测活动设备基于接收信号。而在单个细胞场景下, covariance-based 活动检测的扩展法已经广泛研究过,这篇论文想要分析多细胞系统中 covariance-based 活动检测的扩展法。具体来说,这篇论文显示在多细胞系统中,使用签名序列的长度和细胞数量(在无穷多antenna时)来检测活动设备的最大数量呈 quadratic 增长。此外,这篇论文还分析了在不同签名序列分布情况下的扩展法,并提出了两种高效的加速坐标下降算法(CD),以及它们在检测错误概率和计算效率方面的性能。

Active-Sensing-Based Beam Alignment for Near Field MIMO Communications

  • paper_url: http://arxiv.org/abs/2311.15292
  • repo_url: None
  • paper_authors: Hao Jiang, Zhaolin Wang, Yuanwei Liu
  • for: 解决近场ibeam对alignment问题
  • methods: 使用普通矩阵(WTMs)将天线频率域通道转换为短束域表示,并利用LoS链接减少维度,使用 Lower-dimensional WTMs作为映射函数进行活动感知基本学习算法
  • results: 比codebook-based beam alignment方法更快速地找到优化的发射器对,避免高训练负担 caused by beam sweeping,数学结果验证了提议方法的有效性
    Abstract An active-sensing-based learning algorithm is proposed to solve the near-field beam alignment problem with the aid of wavenumber-domain transform matrices (WTMs). Specifically, WTMs can transform the antenna-domain channel into a sparse representation in the wavenumber domain. The dimensions of WTMs can be further reduced by exploiting the dominance of line-of-sight (LoS) links. By employing these lower-dimensional WTMs as mapping functions, the active-sensing-based algorithm is executed in the wavenumber domain, resulting in an acceleration of convergence. Compared with the codebook-based beam alignment methods, the proposed method finds the optimal beam pair in a ping-pong fashion, thus avoiding high training overheads caused by beam sweeping. Finally, the numerical results validate the effectiveness of the proposed method.
    摘要 一种基于活动感测的学习算法是提出来解决近场束Alignment问题,通过使用幂数频率域变换矩阵(WTM)。特别是,WTM可以将天线频率域通道转换为稀疏表示。进一步地,通过利用直线视线(LoS)链接的主导性,可以降低WTM的维度。通过这些减少维度的WTM作为映射函数,在幂数频率域执行活动感测基本算法,从而加速吞吐量。与codebook-based beam alignment方法相比,提议方法可以快速找到最佳束对,因此避免了高训练负担,引起的扫描束对过程。最后,数值结果证明了提议方法的有效性。

From OTFS to DD-ISAC: Integrating Sensing and Communications in the Delay Doppler Domain

  • paper_url: http://arxiv.org/abs/2311.15215
  • repo_url: None
  • paper_authors: Weijie Yuan, Lin Zhou, Saeid K. Dehkordi, Shuangyang Li, Pingzhi Fan, Giuseppe Caire, H. Vincent Poor
  • for: 这篇论文旨在探讨将偏振通信波形应用于 интеграted sensing and communication (ISAC) 技术中的优点。
  • methods: 该论文使用了各种技术,包括偏振通信波形、延迟Doppler(DD)通信波形和混合扩展(OFDM)波形。
  • results: 论文表明,使用DD通信波形可以提高ISAC的性能,特别是在高速移动场景下。DD通信波形可以直接与雷达探测参数相互作用,并且可以提高雷达探测的准确率和平均平方误差(MSE)性能。
    Abstract Next-generation vehicular networks are expected to provide the capability of robust environmental sensing in addition to reliable communications to meet intelligence requirements. A promising solution is the integrated sensing and communication (ISAC) technology, which performs both functionalities using the same spectrum and hardware resources. Most existing works on ISAC consider the Orthogonal Frequency Division Multiplexing (OFDM) waveform. Nevertheless, vehicle motion introduces Doppler shift, which breaks the subcarrier orthogonality and leads to performance degradation. The recently proposed Orthogonal Time Frequency Space (OTFS) modulation, which exploits various advantages of Delay Doppler (DD) channels, has been shown to support reliable communication in high-mobility scenarios. Moreover, the DD waveform can directly interact with radar sensing parameters, which are actually delay and Doppler shifts. This paper investigates the advantages of applying the DD communication waveform to ISAC. Specifically, we first provide a comprehensive overview of implementing DD communications, based on which several advantages of DD-ISAC over OFDM-based ISAC are revealed, including transceiver designs and the ambiguity function. Furthermore, a detailed performance comparison are presented, where the target detection probability and the mean squared error (MSE) performance are also studied. Finally, some challenges and opportunities of DD-ISAC are also provided.
    摘要 Recently proposed orthogonal time frequency space (OTFS) modulation, which exploits various advantages of delay Doppler (DD) channels, has been shown to support reliable communication in high-mobility scenarios. Moreover, the DD waveform can directly interact with radar sensing parameters, which are actually delay and Doppler shifts. This paper investigates the advantages of applying the DD communication waveform to ISAC.Specifically, we first provide a comprehensive overview of implementing DD communications, based on which several advantages of DD-ISAC over OFDM-based ISAC are revealed, including transceiver designs and the ambiguity function. Furthermore, a detailed performance comparison is presented, where the target detection probability and the mean squared error (MSE) performance are also studied. Finally, some challenges and opportunities of DD-ISAC are also provided.Translation notes:* "vehicular networks" becomes "交通网络" (jiao tong wang luo)* "intelligence requirements" becomes "智能要求" (zhì neng yào qiú)* "integrated sensing and communication" becomes "集成探测通信" (jí chéng tàn cè tōng xì)* "Orthogonal Frequency Division Multiplexing" becomes "垂直频分多路复用" (shuāng zhì fēn fāng duō lù fù yòng)* "Doppler shift" becomes "多普勒移动" (duō pǔ lè yí dòng)* "delay Doppler" becomes "延迟多普勒" (jiān chái duō pǔ lè)* "orthogonal time frequency space" becomes "时频空间" (shí fēn kōng jiān)* "transceiver designs" becomes "传输设计" (chuán xiū jiè yì)* "ambiguity function" becomes "模糊函数" (mó hóu fäng xiàng)* "target detection probability" becomes "目标检测可能性" (mù zhì gǎn zhè kě néng xìng)* "mean squared error" becomes "平均方差" (píng jūn fāng biàn)

Angular-Distance Based Channel Estimation for Holographic MIMO

  • paper_url: http://arxiv.org/abs/2311.15158
  • repo_url: None
  • paper_authors: Yuanbin Chen, Ying Wang, Zhaocheng Wang, Zhu Han
    for:这篇论文探讨了束图多Input多Output(MIMO)系统的通道估计,具体来说是揭示了这种系统的通道估计问题与传统系统不同之处。methods:这篇论文使用了一种 Parametric Decomposition and Compressed Deconstruction(DeRe)框架,并提出了一种基于变分bayesian推断和消息传递(DeRe-VM)的高效算法,以实现对3D AED参数的精确探测和可Robust通道的重建。results:该论文的结果表明,提出的通道估计策略具有很好的Robust性,可以在不同的通道条件下进行正确的估计,并且与传统 benchmark 相比,表现出了更高的性能。
    Abstract This paper investigates the channel estimation for holographic MIMO systems by unmasking their distinctions from the conventional one. Specifically, we elucidate that the channel estimation, subject to holographic MIMO's electromagnetically large antenna arrays, has to discriminate not only the angles of a user/scatterer but also its distance information, namely the three-dimensional (3D) azimuth and elevation angles plus the distance (AED) parameters. As the angular-domain representation fails to characterize the sparsity inherent in holographic MIMO channels, the tightly coupled 3D AED parameters are firstly decomposed for independently constructing their own covariance matrices. Then, the recovery of each individual parameter can be structured as a compressive sensing (CS) problem by harnessing the covariance matrix constructed. This pair of techniques contribute to a parametric decomposition and compressed deconstruction (DeRe) framework, along with a formulation of the maximum likelihood estimation for each parameter. Then, an efficient algorithm, namely DeRe-based variational Bayesian inference and message passing (DeRe-VM), is proposed for the sharp detection of the 3D AED parameters and the robust recovery of sparse channels. Finally, the proposed channel estimation regime is confirmed to be of great robustness in accommodating different channel conditions, regardless of the near-field and far-field contexts of a holographic MIMO system, as well as an improved performance in comparison to the state-of-the-art benchmarks.
    摘要

cs.SD - 2023-11-25

Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder

  • paper_url: http://arxiv.org/abs/2311.14957
  • repo_url: None
  • paper_authors: Yicheng Gu, Xueyao Zhang, Liumeng Xue, Zhizheng Wu
  • for: This study aims to improve the discriminator of Generative Adversarial Network (GAN) based vocoders to promote their inference speed and synthesis quality.
  • methods: The proposed method utilizes the Constant-Q Transform (CQT) instead of the traditional Short-Time Fourier Transform (STFT) to improve the time-frequency resolution and flexibility in modeling different frequency bands. The Multi-Scale Sub-Band CQT (MS-SB-CQT) Discriminator is proposed to operate on the CQT spectrogram at multiple scales and perform sub-band processing according to different octaves.
  • results: Experimental results on both speech and singing voices confirm the effectiveness of the proposed method, with the MOS of HiFi-GAN boosted from 3.27 to 3.87 for seen singers and from 3.40 to 3.78 for unseen singers when combined with the existing MS-STFT Discriminator.
    Abstract Generative Adversarial Network (GAN) based vocoders are superior in inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator to promote GAN-based vocoders. Most existing time-frequency-representation-based discriminators are rooted in Short-Time Fourier Transform (STFT), whose time-frequency resolution in a spectrogram is fixed, making it incompatible with signals like singing voices that require flexible attention for different frequency bands. Motivated by that, our study utilizes the Constant-Q Transform (CQT), which owns dynamic resolution among frequencies, contributing to a better modeling ability in pitch accuracy and harmonic tracking. Specifically, we propose a Multi-Scale Sub-Band CQT (MS-SB-CQT) Discriminator, which operates on the CQT spectrogram at multiple scales and performs sub-band processing according to different octaves. Experiments conducted on both speech and singing voices confirm the effectiveness of our proposed method. Moreover, we also verified that the CQT-based and the STFT-based discriminators could be complementary under joint training. Specifically, enhanced by the proposed MS-SB-CQT and the existing MS-STFT Discriminators, the MOS of HiFi-GAN can be boosted from 3.27 to 3.87 for seen singers and from 3.40 to 3.78 for unseen singers.
    摘要 生成对抗网络(GAN)基于 vocoder 在推理速度和合成质量方面表现出色,这项研究专注于提高 discriminator 以提高 GAN-based vocoder 的性能。现有的大多数时间频域表示基于 Short-Time Fourier Transform (STFT) 的 discriminator 具有固定的时间频域分辨率,使得无法适应如歌唱voice 等需要灵活注意的信号。为此,我们的研究利用 Constant-Q Transform (CQT),它具有动态的频谱分辨率,从而提高了模型对抗准确性和和律追踪性。特别是,我们提出了一种多尺度子带 CQT (MS-SB-CQT) Discriminator,它在 CQT spectrogram 中进行多级scaling并根据不同的 oktave 进行子带处理。实验结果表明,我们的提议方法可以在 both speech 和 singing voices 上提高 HiFi-GAN 的 MOS 值,从 3.27 提高到 3.87 для seen singers,从 3.40 提高到 3.78 для unseen singers。此外,我们还证明了 CQT-based 和 STFT-based Discriminators 在共同训练下可以增强 HiFi-GAN 的性能。

cs.CV - 2023-11-25

Can SAM recognize crops? Quantifying the zero-shot performance of a semantic segmentation foundation model on generating crop-type maps using satellite imagery for precision agriculture

  • paper_url: http://arxiv.org/abs/2311.15138
  • repo_url: None
  • paper_authors: Rutuja Gurav, Het Patel, Zhuocheng Shang, Ahmed Eldawy, Jia Chen, Elia Scudiero, Evangelos Papalexakis
  • for: 这篇论文的目的是强调现代管理策略,如精确农业,可以帮助农民和决策者取得丰富和有用的信息,以提高农业实践的效率和可持续性。
  • methods: 这篇论文使用Meta AI的Segment Anything Model(SAM)来预测植物地图,并评估SAM在零学习情况下的性能。
  • results: 实验显示,SAM可以快速和精确地标出卫星影像中的田野,并提供了一个基础 для后续的植物类别。
    Abstract Climate change is increasingly disrupting worldwide agriculture, making global food production less reliable.To tackle the growing challenges in feeding the planet, cutting-edge management strategies, such as precision agriculture, empower farmers and decision-makers with rich and actionable information to increase the efficiency and sustainability of their farming practices.Crop-type maps are key information for decision-support tools but are challenging and costly to generate.We investigate the capabilities of Meta AI's Segment Anything Model (SAM) for crop-map prediction task, acknowledging its recent successes at zero-shot image segmentation.However, SAM being limited to up-to 3 channel inputs and its zero-shot usage being class-agnostic in nature pose unique challenges in using it directly for crop-type mapping.We propose using clustering consensus metrics to assess SAM's zero-shot performance in segmenting satellite imagery and producing crop-type maps.Although direct crop-type mapping is challenging using SAM in zero-shot setting, experiments reveal SAM's potential for swiftly and accurately outlining fields in satellite images, serving as a foundation for subsequent crop classification.This paper attempts to highlight a use-case of state-of-the-art image segmentation models like SAM for crop-type mapping and related specific needs of the agriculture industry, offering a potential avenue for automatic, efficient, and cost-effective data products for precision agriculture practices.
    摘要 CLIMATE CHANGE 会使全球农业逐渐受到影响,使得全球食物生产变得不可预测。为了解决随着人口增长而增长的粮食供应问题, cutting-edge 管理策略,如精准农业,为农民和决策者提供了丰富和可行的信息,以提高农业实践的效率和可持续性。但是,为了生产粮食,农民和决策者需要准确地知道各种作物的分布情况。作物类划图是决策工具中的关键信息,但是生成这些图表是复杂和昂贵的。我们调查了 Meta AI 的 Segment Anything Model (SAM) 是否可以预测作物类划图任务,并评估其在零例图像分割中的表现。although SAM 有很好的表现,但是它只能处理最多三个通道输入,而且其零例使用是无类型的,这会带来一些困难。我们提出使用 clustering 共识度量来评估 SAM 在零例图像分割中的表现,并对作物类划图进行评估。虽然直接使用 SAM 进行作物类划图是困难的,但是实验表明 SAM 可以快速和准确地描述卫星图像中的田野,这可以作为后续作物分类的基础。本文想要高亮 state-of-the-art 图像分割模型如 SAM 在作物类划图方面的应用潜在性,同时强调农业行业的特定需求,提供一种可能的自动、高效、Cost-effective 数据产品的可能性。

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

  • paper_url: http://arxiv.org/abs/2311.15127
  • repo_url: https://github.com/stability-ai/generative-models
  • paper_authors: Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach
  • for: 这种论文旨在提出一种高分辨率、标准的文本到视频和图像到视频生成模型。
  • methods: 该论文使用了 latent video diffusion 模型,包括文本预热、视频预热和高质量视频细化等三个阶段。
  • results: 论文通过一系列实验表明,使用这种方法可以生成高质量的视频,并且可以用于多视图3D优先和图像到视频生成等下游任务。
    Abstract We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. We also show that our base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Finally, we demonstrate that our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. We release code and model weights at https://github.com/Stability-AI/generative-models .
    摘要 我们介绍Stable Video Diffusion,一种高度可靠的视频扩散模型,用于高分辨率、国际先进的文本到视频和图像到视频生成。在文献中,用于2D图像生成的潜在扩散模型被转化为生成视频模型,通过插入时间层和微调来训练。然而,训练方法在文献中异常多样,领域尚未达成一致的协议关于视频数据的准备方法。在这篇论文中,我们识别和评估三个不同的阶段,以便成功训练视频LDMs:文本到图像预训练、视频预训练和高质量视频微调。此外,我们还证明了需要一个良好地筛选的预训练数据集来生成高质量视频,并提供了一种系统atic的筛选过程来训练强大的基础模型,包括标题和筛选策略。然后,我们研究了我们的基础模型在高质量数据上进行微调的影响,并训练一个与关闭源代码视频生成模型竞争的文本到视频模型。最后,我们展示了我们的模型在多视图3D-优先下提供了强大的多视图扩散模型,可以作为基础模型来微调多视图扩散模型,在计算成本的一半以下,超越图像基于方法。我们在 GitHub 上发布了代码和模型权重,请参考

SAMv2: A Unified Framework for Learning Appearance, Semantic and Cross-Modality Anatomical Embeddings

  • paper_url: http://arxiv.org/abs/2311.15111
  • repo_url: https://github.com/alibaba-damo-academy/self-supervised-anatomical-embedding-v2
  • paper_authors: Xiaoyu Bai, Fan Bai, Xiaofei Huo, Jia Ge, Jingjing Lu, Xianghua Ye, Ke Yan, Yong Xia
  • for: 这篇论文的目的是提出一个基于自我超级学习的医疗影像分析方法,以便实现医疗影像中的特征点(例如病变或特征点)的识别。
  • methods: 这篇论文使用了一种叫做Self-supervised Anatomical eMbedding(SAM)的方法,这是一种基于例子的类别学习方法,它会将医疗影像中的每个小区域(voxel)映射到一个特征点的对应点上。
  • results: 这篇论文的结果显示,SAMv2比SAM和其他现有的方法更好地识别医疗影像中的特征点,并且可以在不同的医疗影像模式(例如CT和MRI)之间进行跨模式匹配。具体来说,SAMv2在一个医疗影像中的特征点标定、病变追踪和医疗影像模式之间的跨模式匹配中表现出色。
    Abstract Identifying anatomical structures (e.g., lesions or landmarks) in medical images plays a fundamental role in medical image analysis. As an exemplar-based landmark detection method, Self-supervised Anatomical eMbedding (SAM) learns a discriminative embedding for each voxel in the image and has shown promising results on various tasks. However, SAM still faces challenges in: (1) differentiating voxels with similar appearance but different semantic meanings (\textit{e.g.}, two adjacent structures without clear borders); (2) matching voxels with similar semantics but markedly different appearance (e.g., the same vessel before and after contrast injection); and (3) cross-modality matching (e.g., CT-MRI registration). To overcome these challenges, we propose SAMv2, which is a unified framework designed to learn appearance, semantic, and cross-modality anatomical embeddings. Specifically, SAMv2 incorporates three key innovations: (1) semantic embedding learning with prototypical contrastive loss; (2) a fixed-point-based matching strategy; and (3) an iterative approach for cross-modality embedding learning. We thoroughly evaluated SAMv2 across three tasks, including one-shot landmark detection, lesion tracking on longitudinal CT scans, and CT-MRI affine/rigid registration with varying field of view. Our results suggest that SAMv2 outperforms SAM and other state-of-the-art methods, offering a robust and versatile approach for landmark based medical image analysis tasks. Code and trained models are available at: https://github.com/alibaba-damo-academy/self-supervised-anatomical-embedding-v2
    摘要 医学图像分析中识别组织结构(例如,损伤或标志)是基本的。作为一种示例基于的地标检测方法,Self-supervised Anatomical eMbedding(SAM)学习了每个像素在图像中的抽象嵌入,并显示了出色的结果。然而,SAM仍面临以下挑战:(1)分辨 voxel 有相似的外观,但不同的semantic意义(例如,两个邻近的结构没有明确的边界);(2)匹配 voxel 有相似的 semantics,但明显不同的外观(例如,同一个血管之前和之后对照注射);以及(3)多modal的匹配(例如,CT-MRI匹配)。为了解决这些挑战,我们提出了 SAMv2,它是一个统一的框架,用于学习 appearanced、semantic 和多modal的组织嵌入。具体来说,SAMv2 包括以下三个关键创新:(1)使用 prototype 对比损失来学习 semantics embedding;(2)使用固定点基本匹配策略;以及(3)用迭代方式学习多modal的嵌入。我们对 SAMv2 进行了三种任务的全面评估:一个是一键地标检测任务,第二个是在长期 CT 扫描图像上进行损伤跟踪,以及第三个是 CT-MRI 平行/固定匹配任务。我们的结果表明,SAMv2 在这些任务中表现出色,提供了一种可靠和多功能的组织基于嵌入图像分析方法。代码和训练模型可以在以下链接中找到:https://github.com/alibaba-damo-academy/self-supervised-anatomical-embedding-v2。

Fine-Grained Unsupervised Cross-Modality Domain Adaptation for Vestibular Schwannoma Segmentation

  • paper_url: http://arxiv.org/abs/2311.15090
  • repo_url: None
  • paper_authors: Luyi Han, Tao Tan, Ritse Mann
  • for: 本研究旨在 Addressing the challenge of domain adaptation in multi-center applications, particularly in the context of vestibular schwannoma (VS) and cochlea segmentation.
  • methods: 方法方面, 本研究提出了一种细化的无监督预测框架,使用向量控制生成器Synthesize fake images with given features, followed by diversity augmentation to increase performance and robustness.
  • results: 结果显示,在 CrossMoDA 验证阶段 Leaderboard 上,我们的方法得到了 VS 和 cochlea 的 Mean Dice 分数为 0.765 和 0.836 分别。
    Abstract The domain adaptation approach has gained significant acceptance in transferring styles across various vendors and centers, along with filling the gaps in modalities. However, multi-center application faces the challenge of the difficulty of domain adaptation due to their intra-domain differences. We focus on introducing a fine-grained unsupervised framework for domain adaptation to facilitate cross-modality segmentation of vestibular schwannoma (VS) and cochlea. We propose to use a vector to control the generator to synthesize a fake image with given features. And then, we can apply various augmentations to the dataset by searching the feature dictionary. The diversity augmentation can increase the performance and robustness of the segmentation model. On the CrossMoDA validation phase Leaderboard, our method received a mean Dice score of 0.765 and 0.836 on VS and cochlea, respectively.
    摘要 域 adaptation 方法在跨供应商和中心之间传递样式,同时填充不同域的差异。然而,多中心应用面临域 adaptation 的挑战,因为它们之间存在域差异。我们专注于提出一种细化无监督的框架 для域 adaptation,以便在听压耳和 vestibular schwannoma 之间进行跨模态分割。我们提议使用一个向量控制生成器生成一个具有给定特征的假图像。然后,我们可以通过搜索特征词典来应用多种扩充。多样性扩充可以提高分割模型的性能和可靠性。在 CrossMoDA 验证阶段 Leaderboard 上,我们的方法得到了 VS 和 cochlea 的平均 dice 分数为 0.765 和 0.836, respectively。

RandMSAugment: A Mixed-Sample Augmentation for Limited-Data Scenarios

  • paper_url: http://arxiv.org/abs/2311.16508
  • repo_url: None
  • paper_authors: Swarna Kamlam Ravindran, Carlo Tomasi
    for: 这篇论文主要研究了如何使用数据增强技术来有效地训练深度学习模型,以降低大量数据的高成本。methods: 这篇论文使用了两种基本的数据增强技术:Mixed Sample Data Augmentations (MSDAs)和Preset-RandAugment。 authors还进行了对这两种技术的比较,以及对它们的优化。results: 根据实验结果,Preset-RandAugment在有限数据情况下表现出色,而MSDAs则只有moderate效果。 authors还发现了一种新的数据增强特性,并提出了一种新的评价方法来衡量数据增强的多样性和现实性。基于这些发现,authors提出了一种新的数据增强技术 called RandMSAugment,它能够 integrates complementary strengths of existing methods。 RandMSAugment在CIFAR-100、STL-10和Tiny-Imagenet等 datasets上表现出色,与传统的数据增强技术相比,它可以在很小的训练集上达到更高的性能。
    Abstract The high costs of annotating large datasets suggests a need for effectively training CNNs with limited data, and data augmentation is a promising direction. We study foundational augmentation techniques, including Mixed Sample Data Augmentations (MSDAs) and a no-parameter variant of RandAugment termed Preset-RandAugment, in the fully supervised scenario. We observe that Preset-RandAugment excels in limited-data contexts while MSDAs are moderately effective. We show that low-level feature transforms play a pivotal role in this performance difference, postulate a new property of augmentations related to their data efficiency, and propose new ways to measure the diversity and realism of augmentations. Building on these insights, we introduce a novel augmentation technique called RandMSAugment that integrates complementary strengths of existing methods. RandMSAugment significantly outperforms the competition on CIFAR-100, STL-10, and Tiny-Imagenet. With very small training sets (4, 25, 100 samples/class), RandMSAugment achieves compelling performance gains between 4.1% and 6.75%. Even with more training data (500 samples/class) we improve performance by 1.03% to 2.47%. RandMSAugment does not require hyperparameter tuning, extra validation data, or cumbersome optimizations.
    摘要 高 costa annotating 大型数据集 提出了有效地培养 CNNs WITH 有限数据的需求,而数据扩展是一个有前途的方向。我们研究了基础扩展技术,包括混合样本数据扩展 (MSDAs) 和无参数的 RandAugment 方法 termed Preset-RandAugment 在完全supervised scenario 中。我们发现Preset-RandAugment 在有限数据上具有优秀的表现,而 MSDAs 表现相对较差。我们发现低级特征变换在这种性能差异中发挥了重要作用,并提出了一种新的扩展方法的数据效率性质,以及一种新的评价扩展的多样性和现实性的方法。基于这些发现,我们介绍了一种新的扩展方法called RandMSAugment,它集成了现有方法的优秀特点。RandMSAugment 在 CIFAR-100、STL-10 和 Tiny-Imagenet 上显著超越了竞争对手,尤其是在具有非常有限的训练集(4、25、100 个样本/类)时,它的表现提升为 4.1% 到 6.75%。甚至在有更多的训练数据(500 个样本/类)时,它仍然提高表现,增加了 1.03% 到 2.47%。RandMSAugment 不需要 hyperparameter 调整、额外验证数据或复杂的优化。

X-Ray to CT Rigid Registration Using Scene Coordinate Regression

  • paper_url: http://arxiv.org/abs/2311.15087
  • repo_url: https://github.com/pragyanstha/scr-registration
  • paper_authors: Pragyan Shrestha, Chun Xie, Hidehiko Shishido, Yuichi Yoshii, Itary Kitahara
  • for: 这篇论文是为了提高微小进行骨科手术时的显微镜影像与预先取得的3D模型的融合,以减少运动障碍病人脑中的负担。
  • methods: 本论文提出了一种全自动的registratin方法,不需要手动设定特征点,并且能够抗衡不同视角的问题。这方法基于一个全连接神经网络(CNN),将背 проекted rays与3D模型的交集点扩展为Scene坐标。
  • results: 这篇论文的实验结果显示,在50%的虚拟测试数据集中,提出的方法可以取得3.79mm的平均target registration error(mTRE),并且在50%的实际显微镜影像中,预测mTRE为9.65mm。
    Abstract Intraoperative fluoroscopy is a frequently used modality in minimally invasive orthopedic surgeries. Aligning the intraoperatively acquired X-ray image with the preoperatively acquired 3D model of a computed tomography (CT) scan reduces the mental burden on surgeons induced by the overlapping anatomical structures in the acquired images. This paper proposes a fully automatic registration method that is robust to extreme viewpoints and does not require manual annotation of landmark points during training. It is based on a fully convolutional neural network (CNN) that regresses the scene coordinates for a given X-ray image. The scene coordinates are defined as the intersection of the back-projected rays from a pixel toward the 3D model. Training data for a patient-specific model were generated through a realistic simulation of a C-arm device using preoperative CT scans. In contrast, intraoperative registration was achieved by solving the perspective-n-point (PnP) problem with a random sample and consensus (RANSAC) algorithm. Experiments were conducted using a pelvic CT dataset that included several real fluoroscopic (X-ray) images with ground truth annotations. The proposed method achieved an average mean target registration error (mTRE) of 3.79 mm in the 50th percentile of the simulated test dataset and projected mTRE of 9.65 mm in the 50th percentile of real fluoroscopic images for pelvis registration.
    摘要 医学操作中的扩散X射线成像是许多微创外科手术中常用的干预手段。将在操作中获得的X射线图像与之前获得的三维计算机扫描图像(CT扫描)进行匹配,可以减轻Surgeon因为获得的解剖结构图像重叠而产生的心理压力。本文提出了一种 completly自动注册方法,不需要在训练过程中手动标注特征点。该方法基于一个完全的卷积神经网络(CNN),该神经网络的输出是一个给定X射线图像的场景坐标。场景坐标是指在3D模型上投影到X射线图像的观察点的交叉点。实际上,我们使用了一种真实的C-arm设备模拟方法生成了patient-specific的训练数据。而在操作中进行注册是通过解析n点Problem(PnP)和Random Sample和Consensus(RANSAC)算法。我们在pelvic CT数据集上进行了一系列实验,该数据集包括了一些真实的X射线图像,并且包含了ground truth标注。提出的方法在50%的 simulate测试数据集上 achieved an average mean target registration error(mTRE)of 3.79 mm,并且在50%的实际X射线图像上预测的mTRE为9.65 mm。

Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding

  • paper_url: http://arxiv.org/abs/2311.15075
  • repo_url: https://github.com/farewellthree/stan
  • paper_authors: Ruyang Liu, Jingjia Huang, Wei Gao, Thomas H. Li, Ge Li
  • for: 这 paper 探讨了如何将图像语言预训模型扩展到通用视频理解方面,并提出了一种名为 Mug-STAN 的简单 yet effective 框架,可以帮助这些模型更好地适应视频数据。
  • methods: 这 paper 使用了一种名为 Spatial-Temporal Auxiliary Network with Mutual-guided alignment module (Mug-STAN) 的方法,它包括一个分支结构和一个deckstone-temporal模块,以便更好地处理时间特征,以及一个 Mutual-guided alignment module,以便更好地匹配视频和文本数据。
  • results: 实验结果表明,使用 Mug-STAN 可以显著提高图像语言预训模型在多个下游任务上的适应能力,包括 MSR-VTT、DiDeMo、LSMDC、Kinetics-400、Something-Something-2、HMDB-51、UCF-101 和 AVA 等 dataset 上的零shot 和 fine-tuning 结果。此外,通过与新兴多Modal dialogue模型集成,可以实现零shot 视频对话。
    Abstract Large-scale image-language pretrained models, e.g., CLIP, have demonstrated remarkable proficiency in acquiring general multi-modal knowledge through web-scale image-text data. Despite the impressive performance of image-language models on various image tasks, how to effectively expand them on general video understanding remains an area of ongoing exploration. In this paper, we investigate the image-to-video transferring from the perspective of the model and the data, unveiling two key obstacles impeding the adaptation of image-language models: non-generalizable temporal modeling and partially misaligned video-text data. To address these challenges, we propose Spatial-Temporal Auxiliary Network with Mutual-guided alignment module (Mug-STAN), a simple yet effective framework extending image-text model to diverse video tasks and video-text data.Specifically, STAN adopts a branch structure with decomposed spatial-temporal modules to enable generalizable temporal modeling, while Mug suppresses misalignment by introducing token-wise feature aggregation of either modality from the other. Extensive experimental results verify Mug-STAN significantly improves adaptation of language-image pretrained models such as CLIP and CoCa at both video-text post-pretraining and finetuning stages. With our solution, state-of-the-art zero-shot and finetuning results on various downstream datasets, including MSR-VTT, DiDeMo, LSMDC, Kinetics-400, Something-Something-2, HMDB-51, UCF- 101, and AVA, are achieved. Moreover, by integrating pretrained Mug-STAN with the emerging multimodal dialogue model, we can realize zero-shot video chatting. Codes are available at https://github.com/farewellthree/STAN
    摘要 大规模图像语言预训模型,如CLIP,在使用web规模图像文本数据时表现出了惊人的通用多媒体知识获取能力。 despite the impressive performance of image-language models on various image tasks, how to effectively expand them to general video understanding remains an area of ongoing exploration. In this paper, we investigate the image-to-video transfer from the perspective of the model and the data, revealing two key obstacles hindering the adaptation of image-language models: non-generalizable temporal modeling and partially misaligned video-text data. To address these challenges, we propose Spatial-Temporal Auxiliary Network with Mutual-guided alignment module (Mug-STAN), a simple yet effective framework that extends image-text models to diverse video tasks and video-text data. Specifically, STAN adopts a branch structure with decomposed spatial-temporal modules to enable generalizable temporal modeling, while Mug suppresses misalignment by introducing token-wise feature aggregation of either modality from the other. Extensive experimental results show that Mug-STAN significantly improves the adaptation of language-image pretrained models such as CLIP and CoCa at both video-text post-pretraining and finetuning stages. With our solution, state-of-the-art zero-shot and finetuning results are achieved on various downstream datasets, including MSR-VTT, DiDeMo, LSMDC, Kinetics-400, Something-Something-2, HMDB-51, UCF-101, and AVA. Moreover, by integrating pretrained Mug-STAN with the emerging multimodal dialogue model, we can realize zero-shot video chatting. Codes are available at https://github.com/farewellthree/STAN.

Task adaption by biologically inspired stochastic comodulation

  • paper_url: http://arxiv.org/abs/2311.15053
  • repo_url: None
  • paper_authors: Gauthier Boeshertz, Caroline Haimerl, Cristina Savin
  • for: 这个论文旨在探讨如何在多任务学习中使用暂时性干涉来提高学习效率和性能。
  • methods: 论文使用了随机增益调节来改进了 deterministic 增益调节,并通过 fine-tuning convolutional neural networks 来实现状态 искусственный智能系统。
  • results: 研究发现,使用随机增益调节可以提高多任务学习中的学习效率和性能,而无需添加可学习参数。这种方法可以提供一个有前途的新方向,用于开发更加灵活和可靠的人工智能系统。
    Abstract Brain representations must strike a balance between generalizability and adaptability. Neural codes capture general statistical regularities in the world, while dynamically adjusting to reflect current goals. One aspect of this adaptation is stochastically co-modulating neurons' gains based on their task relevance. These fluctuations then propagate downstream to guide decision-making. Here, we test the computational viability of such a scheme in the context of multi-task learning. We show that fine-tuning convolutional networks by stochastic gain modulation improves on deterministic gain modulation, achieving state-of-the-art results on the CelebA dataset. To better understand the mechanisms supporting this improvement, we explore how fine-tuning performance is affected by architecture using Cifar-100. Overall, our results suggest that stochastic comodulation can enhance learning efficiency and performance in multi-task learning, without additional learnable parameters. This offers a promising new direction for developing more flexible and robust intelligent systems.
    摘要 脑表示应尽可能寻求一致性和适应性之间的平衡。神经代码捕捉世界中的通用统计regularities,同时 dynamically adjusting以反映当前目标。其中一个adaptation的方面是在抽象层进行随机共调 neurons的收益,以反映任务的相关性。这些波动然后在下游流程中引导决策。我们在多任务学习的上下文中测试了这种方案的计算可行性。我们发现,通过随机收益模ulation,可以超越固定收益模ulation,在CelebA数据集上实现最佳的结果。为了更好地理解这种改进的机制,我们在Cifar-100数据集上explore了不同架构对练习性的影响。总之,我们的结果表明,随机共调可以提高多任务学习的效率和性能,无需额外学习参数。这对于开发更加灵活和可靠的智能系统提供了一个有前途的新方向。

InstaStyle: Inversion Noise of a Stylized Image is Secretly a Style Adviser

  • paper_url: http://arxiv.org/abs/2311.15040
  • repo_url: None
  • paper_authors: Xing Cui, Zekun Li, Pei Pei Li, Huaibo Huang, Zhaofeng He
  • for: 本研究旨在提出一种可以从单一参考图像中生成高效精度的风格化图像的方法。
  • methods: 本方法基于参考图像的倒映噪声中的风格信号的发现,通过Diffusion Model进行生成新的风格化图像。此外,文本提示中的自然杂乱和偏见会妨碍风格的准确传递,因此我们引入了学习式风格标识符,以提高风格描述的准确性。
  • results: 实验表明,InstaStyle可以在高精度和创新任务中表现出色,并且可以在混合倒映噪声下进行风格组合。
    Abstract Stylized text-to-image generation focuses on creating images from textual descriptions while adhering to a style specified by a few reference images. However, subtle style variations within different reference images can hinder the model from accurately learning the target style. In this paper, we propose InstaStyle, a novel approach that excels in generating high-fidelity stylized images with only a single reference image. Our approach is based on the finding that the inversion noise from a stylized reference image inherently carries the style signal, as evidenced by their non-zero signal-to-noise ratio. We employ DDIM inversion to extract this noise from the reference image and leverage a diffusion model to generate new stylized images from the ``style" noise. Additionally, the inherent ambiguity and bias of textual prompts impede the precise conveying of style. To address this, we introduce a learnable style token via prompt refinement, which enhances the accuracy of the style description for the reference image. Qualitative and quantitative experimental results demonstrate that InstaStyle achieves superior performance compared to current benchmarks. Furthermore, our approach also showcases its capability in the creative task of style combination with mixed inversion noise.
    摘要 文本描述到图像生成技术的推进者是创建图像从文本描述时遵循一定的样式标准,但是不同的参考图像中的微妙的风格差异可能会阻碍模型准确地学习目标风格。在这篇论文中,我们提出了一种新的方法,即InstaStyle,可以生成高准确精美的风格化图像,只需要一个参考图像。我们的方法基于发现归一化噪声从参考图像中的风格噪声中含有风格信号的发现,并利用扩散模型将这种“风格”噪声转化为新的风格化图像。此外,文本描述中的不确定性和偏见会阻碍风格的准确传递。为了解决这个问题,我们引入了可学习的风格标识符,通过提高参考图像的风格描述的精度来解决这个问题。我们的实验结果表明,InstaStyle在量化和质量上都有较高的表现,并且可以在创作任务中组合不同的风格噪声。

Low-latency Visual Previews of Large Synchrotron Micro-CT Datasets

  • paper_url: http://arxiv.org/abs/2311.15038
  • repo_url: None
  • paper_authors: Nicholas Tan Jerome, Suren Chilingaryan, Thomas van de Kamp, Andreas Kopmann
  • for: 这个研究旨在解决 synchrotron radiation 设施生成的微型计算机成像(micro-CT)数据承受于实时浏览和交互的问题。
  • methods: 这个研究使用了多种减少数据大小的方法,包括多分辨率切割图、服务器端渲染和 histogram 范围筛选。
  • results: 这个研究获得了成功地将数据大小从 gigabyte 降低到 megabyte 范围,并且保留了arthropod 的几何信息。
    Abstract The unprecedented rate at which synchrotron radiation facilities are producing micro-computed (micro-CT) datasets has resulted in an overwhelming amount of data that scientists struggle to browse and interact with in real-time. Thousands of arthropods are scanned into micro-CT within the NOVA project, producing a large collection of gigabyte-sized datasets. In this work, we present methods to reduce the size of this data, scaling it from gigabytes to megabytes, enabling the micro-CT dataset to be delivered in real-time. In addition, arthropods can be identified by scientists even after implementing data reduction methodologies. Our initial step is to devise three distinct visual previews that comply with the best practices of data exploration. Subsequently, each visual preview warrants its own design consideration, thereby necessitating an individual data processing pipeline for each. We aim to present data reduction algorithms applied across the data processing pipelines. Particularly, we reduce size by using the multi-resolution slicemaps, the server-side rendering, and the histogram filtering approaches. In the evaluation, we examine the disparities of each method to identify the most favorable arrangement for our operation, which can then be adjusted for other experiments that have comparable necessities. Our demonstration proved that reducing the dataset size to the megabyte range is achievable without compromising the arthropod's geometry information.
    摘要 “现代Synchrotron radiation设施的速度无 precedent, Producing micro-computed (micro-CT) 数据总是 Overwhelming scientists struggle to browse and interact with in real-time. Thousands of arthropods are scanned into micro-CT within the NOVA project, producing a large collection of gigabyte-sized datasets. In this work, we present methods to reduce the size of this data, scaling it from gigabytes to megabytes, enabling the micro-CT dataset to be delivered in real-time. In addition, arthropods can be identified by scientists even after implementing data reduction methodologies. Our initial step is to devise three distinct visual previews that comply with the best practices of data exploration. Subsequently, each visual preview warrants its own design consideration, thereby necessitating an individual data processing pipeline for each. We aim to present data reduction algorithms applied across the data processing pipelines. Particularly, we reduce size by using the multi-resolution slicemaps, the server-side rendering, and the histogram filtering approaches. In the evaluation, we examine the disparities of each method to identify the most favorable arrangement for our operation, which can then be adjusted for other experiments that have comparable necessities. Our demonstration proved that reducing the dataset size to the megabyte range is achievable without compromising the arthropod's geometry information.”Note that Simplified Chinese is the version of Chinese used in mainland China, and it may be different from Traditional Chinese, which is used in Hong Kong, Taiwan, and other regions.

Double-Flow-based Steganography without Embedding for Image-to-Image Hiding

  • paper_url: http://arxiv.org/abs/2311.15027
  • repo_url: None
  • paper_authors: Bingbing Song, Derui Wang, Tianwei Zhang, Renyang Liu, Yu Lin, Wei Zhou
  • For: The paper proposes a novel steganography-without-embedding technique called DF-SWE, which aims to hide secret images without directly embedding them into a cover image.* Methods: DF-SWE employs a reversible circulation of double flow to build a reversible bijective transformation between the secret image and the generated stego image. This technique leverages the invertible property and can invert a secret image from a generated stego image in a nearly lossless manner.* Results: The proposed DF-SWE method achieves a payload capacity of 24-72 BPP, which is 8000-16000 times higher than its competitors. Additionally, DF-SWE produces diverse images to minimize the exposure risk and can be applied in various domains without requiring training data from the corresponding domains.
    Abstract As an emerging concept, steganography without embedding (SWE) hides a secret message without directly embedding it into a cover. Thus, SWE has the unique advantage of being immune to typical steganalysis methods and can better protect the secret message from being exposed. However, existing SWE methods are generally criticized for their poor payload capacity and low fidelity of recovered secret messages. In this paper, we propose a novel steganography-without-embedding technique, named DF-SWE, which addresses the aforementioned drawbacks and produces diverse and natural stego images. Specifically, DF-SWE employs a reversible circulation of double flow to build a reversible bijective transformation between the secret image and the generated stego image. Hence, it provides a way to directly generate stego images from secret images without a cover image. Besides leveraging the invertible property, DF-SWE can invert a secret image from a generated stego image in a nearly lossless manner and increases the fidelity of extracted secret images. To the best of our knowledge, DF-SWE is the first SWE method that can hide large images and multiple images into one image with the same size, significantly enhancing the payload capacity. According to the experimental results, the payload capacity of DF-SWE achieves 24-72 BPP is 8000-16000 times compared to its competitors while producing diverse images to minimize the exposure risk. Importantly, DF-SWE can be applied in the steganography of secret images in various domains without requiring training data from the corresponding domains. This domain-agnostic property suggests that DF-SWE can 1) be applied to hiding private data and 2) be deployed in resource-limited systems.
    摘要 As an emerging concept, steganography without embedding (SWE) hides a secret message without directly embedding it into a cover. Therefore, SWE has the unique advantage of being immune to typical steganalysis methods and can better protect the secret message from being exposed. However, existing SWE methods are generally criticized for their poor payload capacity and low fidelity of recovered secret messages. In this paper, we propose a novel steganography-without-embedding technique, named DF-SWE, which addresses the aforementioned drawbacks and produces diverse and natural stego images. Specifically, DF-SWE employs a reversible circulation of double flow to build a reversible bijective transformation between the secret image and the generated stego image. Hence, it provides a way to directly generate stego images from secret images without a cover image. Besides leveraging the invertible property, DF-SWE can invert a secret image from a generated stego image in a nearly lossless manner and increases the fidelity of extracted secret images. To the best of our knowledge, DF-SWE is the first SWE method that can hide large images and multiple images into one image with the same size, significantly enhancing the payload capacity. According to the experimental results, the payload capacity of DF-SWE achieves 24-72 BPP is 8000-16000 times compared to its competitors while producing diverse images to minimize the exposure risk. Importantly, DF-SWE can be applied in the steganography of secret images in various domains without requiring training data from the corresponding domains. This domain-agnostic property suggests that DF-SWE can 1) be applied to hiding private data and 2) be deployed in resource-limited systems.Here's the translation in Traditional Chinese:As an emerging concept, steganography without embedding (SWE) hides a secret message without directly embedding it into a cover. Therefore, SWE has the unique advantage of being immune to typical steganalysis methods and can better protect the secret message from being exposed. However, existing SWE methods are generally criticized for their poor payload capacity and low fidelity of recovered secret messages. In this paper, we propose a novel steganography-without-embedding technique, named DF-SWE, which addresses the aforementioned drawbacks and produces diverse and natural stego images. Specifically, DF-SWE employs a reversible circulation of double flow to build a reversible bijective transformation between the secret image and the generated stego image. Hence, it provides a way to directly generate stego images from secret images without a cover image. Besides leveraging the invertible property, DF-SWE can invert a secret image from a generated stego image in a nearly lossless manner and increases the fidelity of extracted secret images. To the best of our knowledge, DF-SWE is the first SWE method that can hide large images and multiple images into one image with the same size, significantly enhancing the payload capacity. According to the experimental results, the payload capacity of DF-SWE achieves 24-72 BPP is 8000-16000 times compared to its competitors while producing diverse images to minimize the exposure risk. Importantly, DF-SWE can be applied in the steganography of secret images in various domains without requiring training data from the corresponding domains. This domain-agnostic property suggests that DF-SWE can 1) be applied to hiding private data and 2) be deployed in resource-limited systems.

Occlusion Sensitivity Analysis with Augmentation Subspace Perturbation in Deep Feature Space

  • paper_url: http://arxiv.org/abs/2311.15022
  • repo_url: None
  • paper_authors: Pedro Valois, Koichiro Niinuma, Kazuhiro Fukui
  • for: This paper aims to address the challenges of model transparency and biases in deep learning, specifically in computer vision applications.
  • methods: The proposed method, Occlusion Sensitivity Analysis with Deep Feature Augmentation Subspace (OSA-DAS), is a perturbation-based interpretability approach that integrates diverse image augmentations with standard occlusion sensitivity analysis.
  • results: The proposed method outperforms commonly used interpreters on the ImageNet-1k dataset, providing a more precise explanation of the model predictions and offering a class- and model-agnostic approach.
    Abstract Deep Learning of neural networks has gained prominence in multiple life-critical applications like medical diagnoses and autonomous vehicle accident investigations. However, concerns about model transparency and biases persist. Explainable methods are viewed as the solution to address these challenges. In this study, we introduce the Occlusion Sensitivity Analysis with Deep Feature Augmentation Subspace (OSA-DAS), a novel perturbation-based interpretability approach for computer vision. While traditional perturbation methods make only use of occlusions to explain the model predictions, OSA-DAS extends standard occlusion sensitivity analysis by enabling the integration with diverse image augmentations. Distinctly, our method utilizes the output vector of a DNN to build low-dimensional subspaces within the deep feature vector space, offering a more precise explanation of the model prediction. The structural similarity between these subspaces encompasses the influence of diverse augmentations and occlusions. We test extensively on the ImageNet-1k, and our class- and model-agnostic approach outperforms commonly used interpreters, setting it apart in the realm of explainable AI.
    摘要 Traditional perturbation methods only use occlusions to explain the model predictions, but OSA-DAS extends this by integrating diverse image augmentations. Our method uses the output vector of a deep neural network to build low-dimensional subspaces within the deep feature vector space, providing a more accurate explanation of the model prediction. The structural similarity between these subspaces captures the influence of diverse augmentations and occlusions.We extensively test OSA-DAS on the ImageNet-1k dataset and show that our class- and model-agnostic approach outperforms commonly used interpreters, setting it apart in the field of explainable AI.

VSCode: General Visual Salient and Camouflaged Object Detection with 2D Prompt Learning

  • paper_url: http://arxiv.org/abs/2311.15011
  • repo_url: None
  • paper_authors: Ziyang Luo, Nian Liu, Wangbo Zhao, Xuguang Yang, Dingwen Zhang, Deng-Ping Fan, Fahad Khan, Junwei Han
  • for: 这篇论文旨在解决多模态掩蔽物检测和鲜明物检测这两个相关 yet distinct binary mapping任务。
  • methods: 该论文提出了一种通用模型VSCode,通过在encoder-decoder架构中引入2D提示来学习多模态和任务特定的知识。
  • results: VSCode在6个任务中26个数据集上表现出色,并且在未经见过任务时通过组合2D提示实现零Shift泛化。
    Abstract Salient object detection (SOD) and camouflaged object detection (COD) are related yet distinct binary mapping tasks. These tasks involve multiple modalities, sharing commonalities and unique cues. Existing research often employs intricate task-specific specialist models, potentially leading to redundancy and suboptimal results. We introduce VSCode, a generalist model with novel 2D prompt learning, to jointly address four SOD tasks and three COD tasks. We utilize VST as the foundation model and introduce 2D prompts within the encoder-decoder architecture to learn domain and task-specific knowledge on two separate dimensions. A prompt discrimination loss helps disentangle peculiarities to benefit model optimization. VSCode outperforms state-of-the-art methods across six tasks on 26 datasets and exhibits zero-shot generalization to unseen tasks by combining 2D prompts, such as RGB-D COD.
    摘要 突出物体检测(SOD)和掩盖物体检测(COD)是相关 yet distinct的二进制映射任务。这些任务涉及多modalities,共享共同特征和唯一的cue。现有研究通常采用复杂的任务特定专家模型,可能导致重复和低效果。我们介绍VSCode,一个通用模型,使用新的二维提示学习方法,同时解决四个SOD任务和三个COD任务。我们使用VST作为基础模型,在encoder-decoder架构中引入二维提示,以学习域和任务特定的知识。一个提示分化损失帮助解脱 peculiars,以便模型优化。 VSCode在六个任务上26个数据集上表现出色,并且在未看到任务上展现零shot泛化性。

Adapter is All You Need for Tuning Visual Tasks

  • paper_url: http://arxiv.org/abs/2311.15010
  • repo_url: https://github.com/leiyi-hu/mona
  • paper_authors: Dongshuo Yin, Leiyi Hu. Bin Li, Youqun Zhang
    for: 这个研究是为了找出一种可以超越全 fine-tuning 的方法,以提高预训模型在视觉任务中的转移效率和性能。methods: 这个研究使用了一种新的 adapter-based 缓和方法,称为 Multi-cognitive Visual Adapter (Mona) 缓和方法。这个方法将多个视觉友好的滤波器引入 adapter,以增强其处理视觉信号的能力,而不是将语言友好的线性滤波器主要用于过去的方法。此外,这个方法还将标准化层添加到 adapter,以调节输入特征的分布。results: 这个研究的实验结果显示,Mona 可以在多个视觉任务上超越全 fine-tuning,包括 COCO 项目描述 segmentation、ADE20K semantic segmentation、Pascal VOC 物体检测和多个常见的数据集 image classification。例如,在 COCO 项目上,Mona 与全 fine-tuning 相比,获得了1%的性能提升。全面的结果表明,Mona-缓和比 full fine-tuning 更适合保留和利用预训模型的能力。
    Abstract Pre-training & fine-tuning can enhance the transferring efficiency and performance in visual tasks. Recent delta-tuning methods provide more options for visual classification tasks. Despite their success, existing visual delta-tuning art fails to exceed the upper limit of full fine-tuning on challenging tasks like instance segmentation and semantic segmentation. To find a competitive alternative to full fine-tuning, we propose the Multi-cognitive Visual Adapter (Mona) tuning, a novel adapter-based tuning method. First, we introduce multiple vision-friendly filters into the adapter to enhance its ability to process visual signals, while previous methods mainly rely on language-friendly linear filters. Second, we add the scaled normalization layer in the adapter to regulate the distribution of input features for visual filters. To fully demonstrate the practicality and generality of Mona, we conduct experiments on multiple representative visual tasks, including instance segmentation on COCO, semantic segmentation on ADE20K, object detection on Pascal VOC, and image classification on several common datasets. Exciting results illustrate that Mona surpasses full fine-tuning on all these tasks and is the only delta-tuning method outperforming full fine-tuning on instance segmentation and semantic segmentation tasks. For example, Mona achieves a 1% performance gain on the COCO dataset compared to full fine-tuning. Comprehensive results suggest that Mona-tuning is more suitable for retaining and utilizing the capabilities of pre-trained models than full fine-tuning. The code will be released at https://github.com/Leiyi-Hu/mona.
    摘要 <>将文本翻译成简化中文。<>预训练与精度调整可以提高视觉任务中的传递效率和表现。现有的Δ调整方法提供了更多的选项 для视觉分类任务。尽管它们取得了成功,但现有的视觉Δ调整艺术无法超越全面精度调整的上限,特别是在复杂的任务如实例分割和semantic segmentation中。为了找到一种与全面精度调整竞争的替代方案,我们提出了多智能视觉适配器(Mona)调整方法。我们首先引入多种视觉友好的滤波器到适配器中,以增强其处理视觉信号的能力,而前一些方法主要依赖于语言友好的线性滤波器。其次,我们在适配器中添加了缩放 нормализа层来规则输入特征的分布。为了全面展示Monase的实用性和通用性,我们对多个代表性的视觉任务进行了实验,包括COCO上的实例分割、ADE20K上的semantic segmentation、Pascal VOC上的物体检测和一些常见的数据集上的图像分类。我们的实验结果表明,Monase在所有这些任务上表现出色,并且是Δ调整方法中唯一超越全面精度调整的实例 segmentation和semantic segmentation任务。例如,Monase在COCO数据集上与全面精度调整相比,提高了1%的性能。广泛的结果表明,Monase-调整是更适合保留和利用预训练模型的能力的方法。我们将代码发布在https://github.com/Leiyi-Hu/mona上。

$Z^*$: Zero-shot Style Transfer via Attention Rearrangement

  • paper_url: http://arxiv.org/abs/2311.16491
  • repo_url: None
  • paper_authors: Yingying Deng, Xiangyu He, Fan Tang, Weiming Dong
  • for: 这篇论文是关于图像风格传输的研究,旨在解决风格在艺术上是主观和困难的问题。
  • methods: 本研究使用了简单的扩散模型,直接从图像中提取风格信息,并将生成器的先验知识纳入图像中。
  • results: 研究表明,通过修改混合注意机制,可以实现图像风格传输,并且不需要进行学习/调整。此外,研究还发现了适用于不同风格图像的混合注意机制,可以提高图像风格传输的效果。
    Abstract Despite the remarkable progress in image style transfer, formulating style in the context of art is inherently subjective and challenging. In contrast to existing learning/tuning methods, this study shows that vanilla diffusion models can directly extract style information and seamlessly integrate the generative prior into the content image without retraining. Specifically, we adopt dual denoising paths to represent content/style references in latent space and then guide the content image denoising process with style latent codes. We further reveal that the cross-attention mechanism in latent diffusion models tends to blend the content and style images, resulting in stylized outputs that deviate from the original content image. To overcome this limitation, we introduce a cross-attention rearrangement strategy. Through theoretical analysis and experiments, we demonstrate the effectiveness and superiority of the diffusion-based $\underline{Z}$ero-shot $\underline{S}$tyle $\underline{T}$ransfer via $\underline{A}$ttention $\underline{R}$earrangement, Z-STAR.
    摘要 尽管图像风格传递已经取得了非常出色的进步,但在艺术上定义风格仍然是一项有争议和挑战的任务。在现有的学习/调整方法中,这项研究表明了使用简单扩散模型直接从内存空间提取风格信息,并将生成先验级codes与内容图像的混合过程快速化。specifically,我们采用了双扩散路径来表示内容/风格参考图像在内存空间,然后通过风格干扰codes引导内容图像的混合过程。我们还发现了混合机制在扩散模型中会将内容和风格图像混合在一起,导致输出图像与原始内容图像有所偏差。为了解决这个限制,我们提出了混合重新排序策略。通过理论分析和实验,我们证明了扩散基于零射频率风格传递via混合重新排序的Z-STAR是有效和优于现有方法。

Coordinate-Aware Modulation for Neural Fields

  • paper_url: http://arxiv.org/abs/2311.14993
  • repo_url: None
  • paper_authors: Joo Chan Lee, Daniel Rho, Seungtae Nam, Jong Hwan Ko, Eunbyung Park
  • for: 本文旨在提出一种新的方法,使得 neural fields 可以更好地利用多层感知器(MLP)和网格表示法(grid representation)。
  • methods: 本文提出了一种叫做 Coordinate-Aware Modulation(CAM)的方法,它将 grid representation 注入到 MLP 中的中间特征中,以避免 MLB 中的可能的偏见。
  • results: 实验结果表明,CAM 可以提高 neural representation 的性能,并且在不同的信号场景下增强学习稳定性。特别是在新视野生成任务中,CAM 可以在最少的参数量和快速训练速度下达到状态开头的性能。
    Abstract Neural fields, mapping low-dimensional input coordinates to corresponding signals, have shown promising results in representing various signals. Numerous methodologies have been proposed, and techniques employing MLPs and grid representations have achieved substantial success. MLPs allow compact and high expressibility, yet often suffer from spectral bias and slow convergence speed. On the other hand, methods using grids are free from spectral bias and achieve fast training speed, however, at the expense of high spatial complexity. In this work, we propose a novel way for exploiting both MLPs and grid representations in neural fields. Unlike the prevalent methods that combine them sequentially (extract features from the grids first and feed them to the MLP), we inject spectral bias-free grid representations into the intermediate features in the MLP. More specifically, we suggest a Coordinate-Aware Modulation (CAM), which modulates the intermediate features using scale and shift parameters extracted from the grid representations. This can maintain the strengths of MLPs while mitigating any remaining potential biases, facilitating the rapid learning of high-frequency components. In addition, we empirically found that the feature normalizations, which have not been successful in neural filed literature, proved to be effective when applied in conjunction with the proposed CAM. Experimental results demonstrate that CAM enhances the performance of neural representation and improves learning stability across a range of signals. Especially in the novel view synthesis task, we achieved state-of-the-art performance with the least number of parameters and fast training speed for dynamic scenes and the best performance under 1MB memory for static scenes. CAM also outperforms the best-performing video compression methods using neural fields by a large margin.
    摘要 neural fields, 将低维度输入坐标映射到相应的信号上,已经表现出了许多优势。许多方法ologies have been proposed, and techniques employing MLPs and grid representations have achieved substantial success. MLPs allow for compact and high expressibility, but often suffer from spectral bias and slow convergence speed. On the other hand, methods using grids are free from spectral bias and achieve fast training speed, but at the expense of high spatial complexity. In this work, we propose a novel way of exploiting both MLPs and grid representations in neural fields. Unlike the prevalent methods that combine them sequentially (extract features from the grids first and feed them to the MLP), we inject spectral bias-free grid representations into the intermediate features in the MLP. More specifically, we suggest a Coordinate-Aware Modulation (CAM), which modulates the intermediate features using scale and shift parameters extracted from the grid representations. This can maintain the strengths of MLPs while mitigating any remaining potential biases, facilitating the rapid learning of high-frequency components. In addition, we empirically found that the feature normalizations, which have not been successful in neural field literature, proved to be effective when applied in conjunction with the proposed CAM. Experimental results demonstrate that CAM enhances the performance of neural representation and improves learning stability across a range of signals. Especially in the novel view synthesis task, we achieved state-of-the-art performance with the least number of parameters and fast training speed for dynamic scenes and the best performance under 1MB memory for static scenes. CAM also outperforms the best-performing video compression methods using neural fields by a large margin.

View it like a radiologist: Shifted windows for deep learning augmentation of CT images

  • paper_url: http://arxiv.org/abs/2311.14990
  • repo_url: https://github.com/agnalt/window-shifting
  • paper_authors: Eirik A. Østmo, Kristoffer K. Wickstrøm, Keyur Radiya, Michael C. Kampffmeyer, Robert Jenssen
  • for: 医学图像中的癌症检测和定位
  • methods: 使用窗口偏移法进行预处理和强度增强
  • results: 提高了肝脏癌症分割性和鲁棒性,并在图像中的辐射剂不当时表现出色
    Abstract Deep learning has the potential to revolutionize medical practice by automating and performing important tasks like detecting and delineating the size and locations of cancers in medical images. However, most deep learning models rely on augmentation techniques that treat medical images as natural images. For contrast-enhanced Computed Tomography (CT) images in particular, the signals producing the voxel intensities have physical meaning, which is lost during preprocessing and augmentation when treating such images as natural images. To address this, we propose a novel preprocessing and intensity augmentation scheme inspired by how radiologists leverage multiple viewing windows when evaluating CT images. Our proposed method, window shifting, randomly places the viewing windows around the region of interest during training. This approach improves liver lesion segmentation performance and robustness on images with poorly timed contrast agent. Our method outperforms classical intensity augmentations as well as the intensity augmentation pipeline of the popular nn-UNet on multiple datasets.
    摘要 深度学习有可能改变医疗实践,自动完成重要任务,如医学图像中检测和定义肿瘤大小和位置。然而,大多数深度学习模型依靠增强技术,对医学图像进行自然图像处理,这会导致信号生成精度损失。特别是对于contrast-enhanced Computed Tomography(CT)图像,信号生成精度具有物理意义,这些信号在预处理和增强时丢失。为解决这问题,我们提出了一种新的预处理和强度增强方法,基于评估CT图像时,放射学家如何使用多个视窗。我们的提议方法,窗口移动,在训练过程中随机将视窗移动到区域 интерес点附近。这种方法可以提高肝脏肿瘤分 segmentation性能和对图像具有较差时间的对比剂的Robustness。我们的方法超过了经典强度增强以及nn-UNet的强度增强管道在多个数据集上。

SAME++: A Self-supervised Anatomical eMbeddings Enhanced medical image registration framework using stable sampling and regularized transformation

  • paper_url: http://arxiv.org/abs/2311.14986
  • repo_url: https://github.com/alibaba-damo-academy/same
  • paper_authors: Lin Tian, Zi Li, Fengze Liu, Xiaoyu Bai, Jia Ge, Le Lu, Marc Niethammer, Xianghua Ye, Ke Yan, Daikai Jin
  • for: 这个研究是为了提高医疗影像注册的精度和效率,以便更好地处理各种医疗影像处理任务。
  • methods: 这篇研究使用了一种名为SAM(Self-supervised Anatomical eMbedding)算法,它可以在医疗影像之间进行精确的对齐,并且可以将这些对齐与医学 semantics 相互关联。
  • results: 这篇研究发现,使用SAM-Enhanced registration(SAME++)方法可以将医疗影像注册的精度提高至4.2%-8.2%,并且比较数值优化方法更快速。
    Abstract Image registration is a fundamental medical image analysis task. Ideally, registration should focus on aligning semantically corresponding voxels, i.e., the same anatomical locations. However, existing methods often optimize similarity measures computed directly on intensities or on hand-crafted features, which lack anatomical semantic information. These similarity measures may lead to sub-optimal solutions where large deformations, complex anatomical differences, or cross-modality imagery exist. In this work, we introduce a fast and accurate method for unsupervised 3D medical image registration building on top of a Self-supervised Anatomical eMbedding (SAM) algorithm, which is capable of computing dense anatomical correspondences between two images at the voxel level. We name our approach SAM-Enhanced registration (SAME++), which decomposes image registration into four steps: affine transformation, coarse deformation, deep non-parametric transformation, and instance optimization. Using SAM embeddings, we enhance these steps by finding more coherent correspondence and providing features with better semantic guidance. We extensively evaluated SAME++ using more than 50 labeled organs on three challenging inter-subject registration tasks of different body parts. As a complete registration framework, SAME++ markedly outperforms leading methods by $4.2\%$ - $8.2\%$ in terms of Dice score while being orders of magnitude faster than numerical optimization-based methods. Code is available at \url{https://github.com/alibaba-damo-academy/same}.
    摘要 医疗图像registratio是医学图像分析的基本任务之一。理想情况下,registratio应该将具有相同 semantics的voxel相对适配,即同一个解剖位置。然而,现有方法frequently优化直接基于Intensities或手工设计的特征来计算相似度度量,这些相似度度量可能会导致不优化的解决方案,特别是在大弯变、复杂解剖差异或跨模态图像存在时。在这项工作中,我们介绍了一种快速高精度的无监督3D医学图像registratio方法,基于Self-supervised Anatomical eMbedding(SAM)算法,可以在voxel级别计算 dense anatomical correspondence。我们称之为SAME++方法,它将图像registratio decomposes into四个步骤:Affine transformation、coarse deformation、deep non-parametric transformation和instance optimization。使用SAM embeddings,我们可以增强这些步骤,通过找到更coherent的对应和提供更好的semantic导航。我们对SAME++方法进行了extensive evaluate,在三个Difficult inter-subject registration tasks of different body parts上,SAME++方法与现有方法相比,提高了Dice score的值by $4.2\%$ - $8.2\%$,同时比数学优化方法orders of magnitude faster。代码可以在\url{https://github.com/alibaba-damo-academy/same}上找到。

Elucidating and Overcoming the Challenges of Label Noise in Supervised Contrastive Learning

  • paper_url: http://arxiv.org/abs/2311.16481
  • repo_url: None
  • paper_authors: Zijun Long, George Killick, Lipeng Zhuang, Richard McCreadie, Gerardo Aragon Camarasa, Paul Henderson
  • for: 本文旨在探讨隐藏样本中的标签错误对超vised contrastive learning(SCL)的影响,并提出一种可以减少这种影响的对象目标function。
  • methods: 本文使用了一种新的Debiased Supervised Contrastive Learning(D-SCL)对象目标,该目标可以减少由标签错误引入的偏见。
  • results: 对多种视觉benchmark数据集进行了实验,发现D-SCL可以在各种情况下提供更好的对象学习表现,并且具有更好的对标签错误的Robustness。
    Abstract Image classification datasets exhibit a non-negligible fraction of mislabeled examples, often due to human error when one class superficially resembles another. This issue poses challenges in supervised contrastive learning (SCL), where the goal is to cluster together data points of the same class in the embedding space while distancing those of disparate classes. While such methods outperform those based on cross-entropy, they are not immune to labeling errors. However, while the detrimental effects of noisy labels in supervised learning are well-researched, their influence on SCL remains largely unexplored. Hence, we analyse the effect of label errors and examine how they disrupt the SCL algorithm's ability to distinguish between positive and negative sample pairs. Our analysis reveals that human labeling errors manifest as easy positive samples in around 99% of cases. We, therefore, propose D-SCL, a novel Debiased Supervised Contrastive Learning objective designed to mitigate the bias introduced by labeling errors. We demonstrate that D-SCL consistently outperforms state-of-the-art techniques for representation learning across diverse vision benchmarks, offering improved robustness to label errors.
    摘要 Image classification datasets 经常具有一定的杂乱标注例子,常因人类错误判断一类superficially resembles another。这个问题对supervised contrastive learning(SCL) pose challenges, where the goal is to cluster together data points of the same class in the embedding space while distancing those of disparate classes. Although such methods outperform those based on cross-entropy, they are not immune to labeling errors. However, while the detrimental effects of noisy labels in supervised learning are well-researched, their influence on SCL remains largely unexplored. Therefore, we analyze the effect of label errors and examine how they disrupt the SCL algorithm's ability to distinguish between positive and negative sample pairs. Our analysis reveals that human labeling errors manifest as easy positive samples in around 99% of cases. We therefore propose D-SCL, a novel Debiased Supervised Contrastive Learning objective designed to mitigate the bias introduced by labeling errors. We demonstrate that D-SCL consistently outperforms state-of-the-art techniques for representation learning across diverse vision benchmarks, offering improved robustness to label errors.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Hong Kong, Macau, and Taiwan.

Neural Network Based Approach to Recognition of Meteor Tracks in the Mini-EUSO Telescope Data

  • paper_url: http://arxiv.org/abs/2311.14983
  • repo_url: None
  • paper_authors: Mikhail Zotov, Dmitry Anzhiganov, Aleksandr Kryazhenkov, Dario Barghini, Matteo Battisti, Alexander Belov, Mario Bertaina, Marta Bianciotto, Francesca Bisconti, Carl Blaksley, Sylvie Blin, Giorgio Cambiè, Francesca Capel, Marco Casolino, Toshikazu Ebisuzaki, Johannes Eser, Francesco Fenu, Massimo Alberto Franceschi, Alessio Golzio, Philippe Gorodetzky, Fumiyoshi Kajino, Hiroshi Kasuga, Pavel Klimov, Massimiliano Manfrin, Laura Marcelli, Hiroko Miyamoto, Alexey Murashov, Tommaso Napolitano, Hiroshi Ohmori, Angela Olinto, Etienne Parizot, Piergiorgio Picozza, Lech Wiktor Piotrowski, Zbigniew Plebaniak, Guillaume Prévôt, Enzo Reali, Marco Ricci, Giulia Romoli, Naoto Sakaki, Kenji Shinozaki, Christophe De La Taille, Yoshiyuki Takizawa, Michal Vrábel, Lawrence Wiencke
  • for: 这项研究用于开发一种能够在MINI-EUSO数据中识别流星信号的人工神经网络模型。
  • methods: 该研究使用了两种简单的人工神经网络模型来识别流星信号,并达到了高精度水平。
  • results: 研究发现这两种神经网络模型可以效果地用于其他辐射望远镜中的信号识别,无论信号的性质如何。
    Abstract Mini-EUSO is a wide-angle fluorescence telescope that registers ultraviolet (UV) radiation in the nocturnal atmosphere of Earth from the International Space Station. Meteors are among multiple phenomena that manifest themselves not only in the visible range but also in the UV. We present two simple artificial neural networks that allow for recognizing meteor signals in the Mini-EUSO data with high accuracy in terms of a binary classification problem. We expect that similar architectures can be effectively used for signal recognition in other fluorescence telescopes, regardless of the nature of the signal. Due to their simplicity, the networks can be implemented in onboard electronics of future orbital or balloon experiments.
    摘要 小型EUSO是一架宽角荧光望远镜,在地球夜间大气中读取紫外线(UV)辐射。流星是多种现象之一,它们不仅出现在可见范围内,还在UV范围内出现。我们提出了两种简单的人工神经网络,可以高精度地识别小型EUSO数据中的流星信号,即二分类问题。我们预计这些体系可以有效地用于其他荧光望远镜中的信号识别,无论信号的性质如何。由于它们的简单性,这些网络可以在未来的卫星或气球实验中的 борьев电子设备中实现。

Multi-task Planar Reconstruction with Feature Warping Guidance

  • paper_url: http://arxiv.org/abs/2311.14981
  • repo_url: None
  • paper_authors: Luan Wei, Anna Hilsmann, Peter Eisert
  • for: 该论文目的是提出一种实时的平面三角形重建模型,能同时对每个平面实例进行 semantic 预测和平面参数重建。
  • methods: 该模型基于修改的实例分割架构,使用多视图指导在特征空间中进行Feature sharing,以提高实例掩蔽分割精度。
  • results: 该模型在实时预测中可以 достичь43帧/秒的速度,并同时对每个平面实例进行 semantics 预测。
    Abstract Piece-wise planar 3D reconstruction simultaneously segments plane instances and recovers their 3D plane parameters from an image, which is particularly useful for indoor or man-made environments. Efficient reconstruction of 3D planes coupled with semantic predictions offers advantages for a wide range of applications requiring scene understanding and concurrent spatial mapping. However, most existing planar reconstruction models either neglect semantic predictions or do not run efficiently enough for real-time applications. We introduce SoloPlanes, a real-time planar reconstruction model based on a modified instance segmentation architecture which simultaneously predicts semantics for each plane instance, along with plane parameters and piece-wise plane instance masks. By providing multi-view guidance in feature space, we achieve an improvement in instance mask segmentation despite only warping plane features due to the nature of feature sharing in multi-task learning. Our model simultaneously predicts semantics using single images at inference time, while achieving real-time predictions at 43 FPS. The code will be released post-publication.
    摘要 “ Piece-wise 平面三维重建同时分割平面实例和recover其三维平面参数从图像,尤其适用于室内或人工环境。有效地重建3D平面并与semantic预测提供了Scene理解和同时空间地图的优势。然而,大多数现有的平面重建模型 Either neglect semantic predictions or do not run efficiently enough for real-time applications。我们介绍SoloPlanes,一种基于修改的实例分割架构的实时平面重建模型,同时预测每个平面实例的 semantics, along with plane parameters and piece-wise plane instance masks。通过在特征空间提供多视图指导,我们实现了实例标签分割提高,即使只是折叠平面特征。我们的模型在推理时使用单个图像进行semantic预测,并在43帧/秒 achieve real-time预测。代码将在发表后发布。”

Incorporating granularity bias as the margin into contrastive loss for video captioning

  • paper_url: http://arxiv.org/abs/2311.14977
  • repo_url: None
  • paper_authors: Jiayang Gu, Fengming Yao
  • for: The paper aims to mitigate the impact of granularity bias on video captioning models, which often generate vague sentences instead of accurate ones.
  • methods: The proposed method uses a statistical-based bias extractor to quantify the information content within sentences and videos, and incorporates a bidirectional triplet loss with a margin score to establish distinct training objectives for head and tail sentences.
  • results: The proposed model demonstrates state-of-the-art performance on two benchmark datasets, MSRVTT and MSVD, with CIDEr scores of 57.17 and 138.68, respectively.Here’s the Chinese translation of the three points:
  • for: 论文目的是解决视频描述模型受到粒度偏见的问题,导致模型更多地生成抽象的句子而不是准确的一。
  • methods: 提议的方法使用基于统计的偏见提取器来衡量句子和视频中信息的量,并使用双向 triplet 损失和边缘分数来建立不同的训练目标 для头和尾句子。
  • results: 提议的模型在两个标准测试集 MSRVTT 和 MSVD 上达到了当前最佳性能,CIDEr 分数分别为 57.17 和 138.68。
    Abstract Video captioning models easily suffer from long-tail distribution of phrases, which makes captioning models prone to generate vague sentences instead of accurate ones. However, existing debiasing strategies tend to export external knowledge to build dependency trees of words or refine frequency distribution by complex losses and extra input features, which lack interpretability and are hard to train. To mitigate the impact of granularity bias on the model, we introduced a statistical-based bias extractor. This extractor quantifies the information content within sentences and videos, providing an estimate of the likelihood that a video-sentence pair is affected by granularity bias. Furthermore, with the growing trend of integrating contrastive learning methods into video captioning tasks, we use a bidirectional triplet loss to get more negative samples in a batch. Subsequently, we incorporate the margin score into the contrastive learning loss, establishing distinct training objectives for head and tail sentences. This approach facilitates the model's training effectiveness on tail samples. Our simple yet effective loss, incorporating Granularity bias, is referred to as the Margin-Contrastive Loss (GMC Loss). The proposed model demonstrates state-of-the-art performance on MSRVTT with a CIDEr of 57.17, and MSVD, where CIDEr reaches up to 138.68.
    摘要 视频描述模型容易受到长尾分布的影响,导致模型生成模糊的句子而不是准确的句子。然而,现有的偏好降低策略通常通过外部知识导入建立词语之间的依赖树或者使用复杂的损失函数和额外输入特征来实现,这些策略缺乏可读性和训练困难。为了减轻模型受到粒度偏好的影响,我们提出了一种基于统计学的偏好提取器。这个提取器量化视频和句子之间的信息含量,并提供了影响视频-句子对的 granularity 偏好的估计。此外,随着视频描述任务中的对照学习方法的普及,我们使用双向 triplet 损失来在批处理中获得更多的负样本。然后,我们将margin score integrate into contrastive learning损失,使得模型在尾句子上更有效地进行训练。我们称这种简单 yet effective 的损失为 GMC 损失(Granularity-based Margin-Contrastive Loss)。我们的模型在 MSRVTT 上达到了状态的推荐性能,CIDEr 为 57.17,并在 MSVD 上达到了 CIDEr 为 138.68。

Segmentation of diagnostic tissue compartments on whole slide images with renal thrombotic microangiopathies (TMAs)

  • paper_url: http://arxiv.org/abs/2311.14971
  • repo_url: None
  • paper_authors: Huy Q. Vo, Pietro A. Cicalese, Surya Seshan, Syed A. Rizvi, Aneesh Vathul, Gloria Bueno, Anibal Pedraza Dorado, Niels Grabe, Katharina Stolle, Francesco Pesce, Joris J. T. H. Roelofs, Jesper Kers, Vitoantonio Bevilacqua, Nicola Altini, Bernd Schröppel, Dario Roccatello, Antonella Barreca, Savino Sciascia, Chandra Mohan, Hien V. Nguyen, Jan U. Becker
  • for: 该研究旨在开发一种基于机器学习和计算机视觉的分割模型,用于自动识别肾生张病理片中的关键肾组织部分,以提高肾生张病理片诊断的精度和效率。
  • methods: 该研究使用了一种组合了U-Net基于组织检测和Shifted windows-transformer架构的分割模型,以达到高度准确的分割结果,包括even the most severely altered glomeruli, arterioles and arteries,以及在不同的病理实验室中的不同染色域。
  • results: 研究发现,该分割模型可以准确地自动识别肾生张病理片中的关键肾组织部分,包括血管、血管小分支和肾球体,以及在不同的病理实验室中的不同染色域。这 laid the foundation for large-scale compartment-specific machine learning and computer vision analysis of renal biopsy repositories with TMAs。
    Abstract The thrombotic microangiopathies (TMAs) manifest in renal biopsy histology with a broad spectrum of acute and chronic findings. Precise diagnostic criteria for a renal biopsy diagnosis of TMA are missing. As a first step towards a machine learning- and computer vision-based analysis of wholes slide images from renal biopsies, we trained a segmentation model for the decisive diagnostic kidney tissue compartments artery, arteriole, glomerulus on a set of whole slide images from renal biopsies with TMAs and Mimickers (distinct diseases with a similar nephropathological appearance as TMA like severe benign nephrosclerosis, various vasculitides, Bevacizumab-plug glomerulopathy, arteriolar light chain deposition disease). Our segmentation model combines a U-Net-based tissue detection with a Shifted windows-transformer architecture to reach excellent segmentation results for even the most severely altered glomeruli, arterioles and arteries, even on unseen staining domains from a different nephropathology lab. With accurate automatic segmentation of the decisive renal biopsy compartments in human renal vasculopathies, we have laid the foundation for large-scale compartment-specific machine learning and computer vision analysis of renal biopsy repositories with TMAs.
    摘要 血栓微血管病(TMA)在肾切片组织学图像中表现出广泛的急性和慢性发现。确定肾切片诊断TMA的准确标准riteria缺失。为了使用机器学习和计算机视觉分析肾切片图像,我们首先训练了分类模型,以分类肾脏组织 compartment artery, arteriole, glomerulus 在肾切片图像中。我们的分类模型结合了U-Net基于的组织检测和Shifted windows-transformer架构,以达到出色的分类结果,包括even the most severely altered glomeruli, arterioles and arteries, even on unseen staining domains from a different nephropathology lab。通过自动准确地分类肾脏组织中的关键组织部分,我们已经为大规模的组织特异性机器学习和计算机视觉分析肾切片图像库提供了基础。

Point Cloud Pre-training with Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.14960
  • repo_url: None
  • paper_authors: Xiao Zheng, Xiaoshui Huang, Guofeng Mei, Yuenan Hou, Zhaoyang Lyu, Bo Dai, Wanli Ouyang, Yongshun Gong
  • for: 这个研究是为了开发一个适用于不同点云背景的点云预训练方法,以提高点云背景下的下游任务性能。
  • methods: 这个研究使用了一个称为Point cloud Diffusion pre-training(PointDif)的新型预训方法,将点云预训任务视为一个 conditional point-to-point 生成问题,并引入一个参数条件点生成器。这个生成器将从预训模型中提取的特征与点云中的条件相互关联,以帮助预训模型吸收本地和全局几何假设,以及点云中的全局点密度分布。此外,这个研究还提出了一个回归均匀抽样优化策略,让模型从不同的噪音水平中均匀地恢复,并从多样化的导师学习。
  • results: 这个研究在多个真实世界的数据集上取得了显著的进步,包括类别、分类、检测等下游任务。具体来说,PointDif在S3DIS Area 5上录得70.0% mIoU,在ScanObjectNN上录得平均提高2.4% compared to TAP。此外,这个预训框架可以灵活地应用到不同的点云背景上,带来了许多改善。
    Abstract Pre-training a model and then fine-tuning it on downstream tasks has demonstrated significant success in the 2D image and NLP domains. However, due to the unordered and non-uniform density characteristics of point clouds, it is non-trivial to explore the prior knowledge of point clouds and pre-train a point cloud backbone. In this paper, we propose a novel pre-training method called Point cloud Diffusion pre-training (PointDif). We consider the point cloud pre-training task as a conditional point-to-point generation problem and introduce a conditional point generator. This generator aggregates the features extracted by the backbone and employs them as the condition to guide the point-to-point recovery from the noisy point cloud, thereby assisting the backbone in capturing both local and global geometric priors as well as the global point density distribution of the object. We also present a recurrent uniform sampling optimization strategy, which enables the model to uniformly recover from various noise levels and learn from balanced supervision. Our PointDif achieves substantial improvement across various real-world datasets for diverse downstream tasks such as classification, segmentation and detection. Specifically, PointDif attains 70.0% mIoU on S3DIS Area 5 for the segmentation task and achieves an average improvement of 2.4% on ScanObjectNN for the classification task compared to TAP. Furthermore, our pre-training framework can be flexibly applied to diverse point cloud backbones and bring considerable gains.
    摘要 <> tranlate_text = "预训练模型并 then fine-tuning 其在下游任务上表现出了显著的成功,特别是在2D图像和自然语言处理领域。但由于点云的无序和非均匀密度特征,难以挖掘点云的先前知识和预训练点云干部。在这篇论文中,我们提出了一种新的预训练方法,称为点云扩散预训练(PointDif)。我们认为点云预训练任务可以视为一种条件点到点生成问题,并引入了一个条件点生成器。这个生成器将由干部提取的特征与条件相集成,以帮助干部从噪点云中回归到原始点云,并帮助干部捕捉到物体的局部和全局几何先验以及点云的全局点密度分布。我们还提出了一种循环均衡抽取优化策略,使模型可以从不同的噪音水平中均匀恢复,并从良性监督学习。我们的 PointDif 在多个实际世界数据集上实现了广泛的改进,包括分类、 segmentation 和检测等下游任务。具体来说,PointDif 在 S3DIS Area 5 上实现了70.0% mIoU 的 segmentation 任务,并在 ScanObjectNN 中 average 提高了2.4% 的分类任务比 TAP。此外,我们的预训练框架可以适应多种点云干部,并带来显著的改进。"<>Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know.

OpenNet: Incremental Learning for Autonomous Driving Object Detection with Balanced Loss

  • paper_url: http://arxiv.org/abs/2311.14939
  • repo_url: None
  • paper_authors: Zezhou Wang, Guitao Cao, Xidong Xi, Jiangtao Wang
  • for: 提高自动驾驶对象检测的精度和稳定性,抵御环境不确定性和类别偏度问题。
  • methods: 提议使用OpenNet模型,利用Balanced Loss和逐渐学习来弥补类别偏度问题,并采用特征填充和正常化特征抖擀来防止恶化学习。
  • results: 对CODA数据集进行实验,表明提议方法可以比既有方法表现更好,提高多尺度检测稳定性和未知类识别能力。
    Abstract Automated driving object detection has always been a challenging task in computer vision due to environmental uncertainties. These uncertainties include significant differences in object sizes and encountering the class unseen. It may result in poor performance when traditional object detection models are directly applied to automated driving detection. Because they usually presume fixed categories of common traffic participants, such as pedestrians and cars. Worsely, the huge class imbalance between common and novel classes further exacerbates performance degradation. To address the issues stated, we propose OpenNet to moderate the class imbalance with the Balanced Loss, which is based on Cross Entropy Loss. Besides, we adopt an inductive layer based on gradient reshaping to fast learn new classes with limited samples during incremental learning. To against catastrophic forgetting, we employ normalized feature distillation. By the way, we improve multi-scale detection robustness and unknown class recognition through FPN and energy-based detection, respectively. The Experimental results upon the CODA dataset show that the proposed method can obtain better performance than that of the existing methods.
    摘要 自动驾驶对象检测一直是计算机视觉中的挑战,因为环境不确定性。这些不确定性包括对象大小的重要差异和遇到未经见过的类。这可能会导致传统的对象检测模型在自动驾驶检测中表现不佳。这是因为这些模型通常假设固定的通用交通参与者类型,如行人和汽车。更糟糕的是,类别之间的巨大偏好会进一步恶化性能。为解决这些问题,我们提出了 OpenNet,用来调节类别偏好。我们采用基于 Cross Entropy Loss 的 Balanced Loss,以调节类别偏好。此外,我们采用基于梯度重塑的拓展层,以快速学习新类型。在增量学习中,我们采用 нор 非常化特征填充,以防止恶化学习。此外,我们使用 FPN 和能量基本检测,以提高多尺度检测的 Robustness 和未知类识别。实验结果表明,我们的方法可以在 CODA 数据集上比已有方法更好地表现。

View-Based Luminance Mapping in Open Workplace

  • paper_url: http://arxiv.org/abs/2311.14927
  • repo_url: None
  • paper_authors: Guanzhou Ji, Tingsong Ou, Azadeh O. Sawyer
  • for: 提高室内照明性能
  • methods: 使用计算机方法将室内光照映射到建筑外墙,并过滤高照明值进行投影
  • results: 可以高效地确定建筑外墙的照明问题,并为日光设计和室内照明优化提供多种参数计算和结果总结
    Abstract This paper introduces a novel computational method for mapping indoor luminance values on the facade of an open workplace to improve its daylight performance. 180-degree fisheye renderings from different indoor locations, view positions, and times of the year are created. These renderings are then transformed from two-dimensional (2D) images into three-dimensional (3D) hemispheres. High luminance values are filtered and projected from the hemisphere to the facade surface. This framework will highlight the areas of the facade that allow too much light penetration into the interior environment. The flexible workflow allows occupant centric lighting analysis that computes multiple design parameters and synthesizes results for localized facade optimization and daylight design.
    摘要 中文翻译:这篇论文介绍了一种新的计算方法,用于将室内照度值映射到开放办公室的外墙,以改善其日光性能。方法包括从不同的室内位置、视点和时间创建180度鱼眼渲染图,然后将其转换成三维 Hemisphere。高照度值被筛选并从 Hemisphere 投射到外墙表面,以显示允许过多的光线进入室内环境的位置。灵活的工作流程允许occupant-centric 光照分析,计算多个设计参数并结合结果进行地方化外墙优化和日光设计。

Coordinate-based Neural Network for Fourier Phase Retrieval

  • paper_url: http://arxiv.org/abs/2311.14925
  • repo_url: None
  • paper_authors: Tingyou Li, Zixin Xu, Yong S. Chu, Xiaojing Huang, Jizhou Li
  • for: 这种研究旨在提高高解像级别的干涉干涌成像技术,尤其是在多种领域中高分辨照明细结构。
  • methods: 该研究提出了一种基于坐标神经网络的单一神经网络(SCAN)工具,用于提高干涉干涌成像性能。这种方法不同于传统迭代方法,可以快速地连接物体坐标和其干涌和相位信息,并且可以在无监督的情况下进行。
  • results: 测试表明,SCAN在准确率和静态干涌率方面具有显著的优势,并且在ptychography设置中也表现出色。
    Abstract Fourier phase retrieval is essential for high-definition imaging of nanoscale structures across diverse fields, notably coherent diffraction imaging. This study presents the Single impliCit neurAl Network (SCAN), a tool built upon coordinate neural networks meticulously designed for enhanced phase retrieval performance. Bypassing the pitfalls of conventional iterative methods, which frequently face high computational loads and are prone to noise interference, SCAN adeptly connects object coordinates to their amplitude and phase within a unified network in an unsupervised manner. While many existing methods primarily use Fourier magnitude in their loss function, our approach incorporates both the predicted magnitude and phase, enhancing retrieval accuracy. Comprehensive tests validate SCAN's superiority over traditional and other deep learning models regarding accuracy and noise robustness. We also demonstrate that SCAN excels in the ptychography setting.
    摘要 傅里叶阶段恢复是高分辨照像奈米精细结构领域中必不可或缺的一环,特别是同性散射成像。本研究提出了单个内置神经网络(SCAN),基于坐标神经网络的优化设计,以提高阶段恢复性能。与传统迭代方法不同,SCAN通过一个简单的网络连接对象坐标和其振荡和阶段,在无监督的情况下进行连接。许多现有方法主要通过福特 magnitude 来确定损失函数,而我们的方法则是通过预测的 magnitude 和阶段来提高恢复精度。我们对 SCAN 的可行性进行了全面的测试,并证明了它在传统和深度学习模型中的超越性。我们还示出了 SCAN 在ptychography 设置中的优异性。

GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

  • paper_url: http://arxiv.org/abs/2311.16511
  • repo_url: https://github.com/gpt4video/GPT4Video
  • paper_authors: Zhanyu Wang, Longyue Wang, Zhen Zhao, Minghao Wu, Chenyang Lyu, Huayang Li, Deng Cai, Luping Zhou, Shuming Shi, Zhaopeng Tu
  • for: 这个论文的目的是提供一种能够涵盖视频理解和生成两个方面的多Modal大语言模型(GPT4Video),以填补现有模型在视频生成方面的缺失。
  • methods: 这个论文使用了一种基于 instrucion-following 的方法,与稳定扩散生成模型相结合,以实现视频生成场景中的安全和可靠的处理。
  • results: GPT4Video 在视频问答任务和文本到视频生成任务上表现出色,比如与 Valley 比赛中提高了11.8%的表现,并在 Text to Video 生成任务上超过了 NExt-GPT 的表现。
    Abstract While the recent advances in Multimodal Large Language Models (MLLMs) constitute a significant leap forward in the field, these models are predominantly confined to the realm of input-side multimodal comprehension, lacking the capacity for multimodal content generation. To fill this gap, we present GPT4Video, a unified multi-model framework that empowers Large Language Models (LLMs) with the capability of both video understanding and generation. Specifically, we develop an instruction-following-based approach integrated with the stable diffusion generative model, which has demonstrated to effectively and securely handle video generation scenarios. GPT4Video offers the following benefits: 1) It exhibits impressive capabilities in both video understanding and generation scenarios. For example, GPT4Video outperforms Valley by 11.8\% on the Video Question Answering task, and surpasses NExt-GPT by 2.3\% on the Text to Video generation task. 2) it endows the LLM/MLLM with video generation capabilities without requiring additional training parameters and can flexibly interface with a wide range of models to perform video generation. 3) it maintains a safe and healthy conversation not only in output-side but also the input side in an end-to-end manner. Qualitative and qualitative experiments demonstrate that GPT4Video holds the potential to function as a effective, safe and Humanoid-like video assistant that can handle both video understanding and generation scenarios.
    摘要 Recent advances in Multimodal Large Language Models (MLLMs) have made significant progress in the field, but these models are mainly limited to input-side multimodal comprehension and lack the ability to generate multimodal content. To address this gap, we propose GPT4Video, a unified multi-model framework that empowers Large Language Models (LLMs) with the capabilities of both video understanding and generation. Specifically, we develop an instruction-following-based approach integrated with the stable diffusion generative model, which has proven effective and secure in video generation scenarios. GPT4Video offers the following benefits:1. It demonstrates impressive performance in both video understanding and generation scenarios. For example, GPT4Video outperforms Valley by 11.8% on the Video Question Answering task and surpasses NExt-GPT by 2.3% on the Text to Video generation task.2. It enables the LLM/MLLM to generate videos without requiring additional training parameters and can seamlessly interface with a variety of models for video generation.3. It maintains a safe and healthy conversation not only in the output side but also in the input side in an end-to-end manner.Experiments show that GPT4Video has the potential to function as an effective, safe, and humanoid-like video assistant that can handle both video understanding and generation scenarios.

GBD-TS: Goal-based Pedestrian Trajectory Prediction with Diffusion using Tree Sampling Algorithm

  • paper_url: http://arxiv.org/abs/2311.14922
  • repo_url: https://github.com/Winderting/GBD-TS
  • paper_authors: Ge Sun, Sheng Wang, Yang Xiao, Lei Zhu, Ming Liu
  • for: 预测行人轨迹,以提高自动驾驶和移动 робоット的安全性和效率。
  • methods: 使用 denoising diffusion probabilistic model (DDPM) 和 scene-aware multi-modal pedestrian trajectory prediction framework (GBD),并 introduce 一种新的 diffusion sampling algorithm named tree sampling (TS)。
  • results: GBD-TS 方法实现了状态体验最佳性和实时推理速度。
    Abstract Predicting pedestrian trajectories is crucial for improving the safety and effectiveness of autonomous driving and mobile robots. However, this task is nontrivial due to the inherent stochasticity of human motion, which naturally requires the predictor to generate multi-model prediction. Previous works have used various generative methods, such as GAN and VAE, for pedestrian trajectory prediction. Nevertheless, these methods may suffer from problems, including mode collapse and relatively low-quality results. The denoising diffusion probabilistic model (DDPM) has recently been applied to trajectory prediction due to its simple training process and powerful reconstruction ability. However, current diffusion-based methods are straightforward without fully leveraging input information and usually require many denoising iterations leading to a long inference time or an additional network for initialization. To address these challenges and promote the application of diffusion models in trajectory prediction, we propose a novel scene-aware multi-modal pedestrian trajectory prediction framework called GBD. GBD combines goal prediction with the diffusion network. First, the goal predictor produces multiple goals, and then the diffusion network generates multi-modal trajectories conditioned on these goals. Furthermore, we introduce a new diffusion sampling algorithm named tree sampling (TS), which leverages common feature to reduce the inference time and improve accuracy for multi-modal prediction. Experimental results demonstrate that our GBD-TS method achieves state-of-the-art performance with real-time inference speed.
    摘要 预测行人轨迹是自动驾驶和移动机器人技术的关键。然而,这项任务并不容易,因为人类运动具有内生的随机性,需要预测器生成多模型预测。previous works 使用了不同的生成方法,如 GAN 和 VAE, для行人轨迹预测。然而,这些方法可能会存在问题,如模式折衣和低质量结果。 latest diffusion-based methods 应用于轨迹预测,因为它们的训练过程简单,并且具有强大的重建能力。然而,当前的扩散方法通常简单,不充分利用输入信息,通常需要多个扩散迭代,导致扩散时间长或需要额外的网络初始化。为了解决这些挑战并推广扩散模型在轨迹预测中的应用,我们提出了一种新的场景意识多模态行人轨迹预测框架,称为 GBD。GBD 结合目标预测和扩散网络。首先,目标预测器生成多个目标,然后扩散网络生成 conditioned 于这些目标的多模态轨迹。此外,我们引入了一种新的扩散采样算法,名为树采样(TS),它利用共同特征来减少扩散时间和提高多模态预测的准确性。实验结果表明,我们的 GBD-TS 方法在实时扩散速度下达到了状态码的表现。

DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism

  • paper_url: http://arxiv.org/abs/2311.14920
  • repo_url: None
  • paper_authors: Zhen Wang, Jun Xiao, Tao Chen, Long Chen
  • for: 提高Explicit Caption Editing(ECE)模型的泛化能力和caption生成质量
  • methods: 基于扩散机制的Diffusion-based Explicit Caption editing方法,包括 introduce word-level noise和denoising process
  • results: 实验表明,DECap具有强大的泛化能力和caption生成质量,并且可以有效地提高caption生成的质量和控制性。Here’s the Chinese text in the format you requested:
  • for: 提高Explicit Caption Editing(ECE)模型的泛化能力和caption生成质量
  • methods: 基于扩散机制的Diffusion-based Explicit Caption editing方法,包括 introduce word-level noise和denoising process
  • results: 实验表明,DECap具有强大的泛化能力和caption生成质量,并且可以有效地提高caption生成的质量和控制性。
    Abstract Explicit Caption Editing (ECE) -- refining reference image captions through a sequence of explicit edit operations (e.g., KEEP, DETELE) -- has raised significant attention due to its explainable and human-like nature. After training with carefully designed reference and ground-truth caption pairs, state-of-the-art ECE models exhibit limited generalization ability beyond the original training data distribution, i.e., they are tailored to refine content details only in in-domain samples but fail to correct errors in out-of-domain samples. To this end, we propose a new Diffusion-based Explicit Caption editing method: DECap. Specifically, we reformulate the ECE task as a denoising process under the diffusion mechanism, and introduce innovative edit-based noising and denoising processes. Thanks to this design, the noising process can help to eliminate the need for meticulous paired data selection by directly introducing word-level noises for training, learning diverse distribution over input reference caption. The denoising process involves the explicit predictions of edit operations and corresponding content words, refining reference captions through iterative step-wise editing. To further efficiently implement our diffusion process and improve the inference speed, DECap discards the prevalent multi-stage design and directly generates edit operations and content words simultaneously. Extensive ablations have demonstrated the strong generalization ability of DECap in various scenarios. More interestingly, it even shows great potential in improving the quality and controllability of caption generation.
    摘要 Explicit Caption Editing (ECE) -- 通过一系列显式编辑操作(如保留、消除)来精细调整参考图文描述 -- 在过去几年内引起了广泛关注,因为它具有可解释的人类化特点。 经过使用特制的参考和真实描述对的训练,现代ECE模型在原始训练数据分布之外的扩展能力很有限,即只能在适应域样本中细化内容细节,而对外域样本中的错误则无法更正。为此,我们提出了一种新的扩散基本的显式描述编辑方法:DECap。具体来说,我们将ECE任务 reformulate为扩散机制下的干扰过程,并引入创新的编辑基于干扰和恢复过程。由于这种设计,干扰过程可以直接将单词级干扰引入训练,学习多样的输入参考描述的分布。恢复过程则包括显式预测编辑操作和相应的内容词,通过 iterative 步骤 editing 来细化参考描述。为了更有效地实现我们的扩散过程并提高推理速度,DECap 直接生成编辑操作和内容词 simultaneous 生成。广泛的ablation 表明 DECap 在多种场景中具有强大的扩展能力。更有趣的是,它甚至可以提高描述生成质量和控制性。

Resolution- and Stimulus-agnostic Super-Resolution of Ultra-High-Field Functional MRI: Application to Visual Studies

  • paper_url: http://arxiv.org/abs/2311.14918
  • repo_url: None
  • paper_authors: Hongwei Bran Li, Matthew S. Rosen, Shahin Nasr, Juan Eugenio Iglesias
  • for: 这篇论文旨在提高fMRI的空间分辨率,以减少扫描时间。
  • methods: 这篇论文使用深度学习的3D超解像技术来提高fMRI的分辨率。这种技术可以适应不同的 vozixel 大小,而无需重新训练。
  • results: 这篇论文可以基于2-3mm是otropic的fMRI数据 visualize高度精细的视觉区域,包括运动选择性的网格组织。这些结果表明该技术可以提高fMRI的分辨率,并且可以适应不同的实验室和试验方法。
    Abstract High-resolution fMRI provides a window into the brain's mesoscale organization. Yet, higher spatial resolution increases scan times, to compensate for the low signal and contrast-to-noise ratio. This work introduces a deep learning-based 3D super-resolution (SR) method for fMRI. By incorporating a resolution-agnostic image augmentation framework, our method adapts to varying voxel sizes without retraining. We apply this innovative technique to localize fine-scale motion-selective sites in the early visual areas. Detection of these sites typically requires a resolution higher than 1 mm isotropic, whereas here, we visualize them based on lower resolution (2-3mm isotropic) fMRI data. Remarkably, the super-resolved fMRI is able to recover high-frequency detail of the interdigitated organization of these sites (relative to the color-selective sites), even with training data sourced from different subjects and experimental paradigms -- including non-visual resting-state fMRI, underscoring its robustness and versatility. Quantitative and qualitative results indicate that our method has the potential to enhance the spatial resolution of fMRI, leading to a drastic reduction in acquisition time.
    摘要 高分辨率fMRI提供了大脑宏观细致组织的窗口。然而,高空间分辨率会增加扫描时间,以做到补做低信号和噪声比例。这项工作介绍了基于深度学习的3D超分辨(SR)方法,用于fMRI。我们的方法通过 incorporating a resolution-agnostic image augmentation framework,可以适应不同的 voxel size 而不需要重新训练。我们将这种创新技术应用于本地化运动选择性sites的检测。通常需要高于1毫米的等效分辨率才能检测这些sites,而我们则可以基于2-3毫米的等效分辨率fMRI数据进行检测。这些超分辨fMRI能够恢复高频率的细致组织 detail,即使使用不同的主体和实验室 paradigms 的训练数据。这些结果表明我们的方法具有强大和多样性。量化和质量结果表明,我们的方法可以提高fMRI的空间分辨率,从而减少扫描时间。

CUCL: Codebook for Unsupervised Continual Learning

  • paper_url: http://arxiv.org/abs/2311.14911
  • repo_url: None
  • paper_authors: Chen Cheng, Jingkuan Song, Xiaosu Zhu, Junchen Zhu, Lianli Gao, Hengtao Shen
  • for: 这个研究的目的是提出一种不需要高质量手动标注数据的无监督连续学习(Unsupervised Continual Learning,UCL)方法,以解决监督学习中的快速卷积问题。
  • methods: 该研究提出了一种名为Codebook for Unsupervised Continual Learning(CUCL)的方法,通过在表示中批量化并使用对应的反 quantization损失来驱动模型学习异常特征,从而完善类边界。
  • results: 该研究在CIFAR100、TinyImageNet和MiniImageNet数据集上进行了广泛的实验,并证明了CUCL方法可以显著提高监督和无监督方法的性能。例如,在TinyImageNet上,与Simsiam和BYOL相比,CUCL方法得到了12.76%和7%的相对提升。
    Abstract The focus of this study is on Unsupervised Continual Learning (UCL), as it presents an alternative to Supervised Continual Learning which needs high-quality manual labeled data. The experiments under the UCL paradigm indicate a phenomenon where the results on the first few tasks are suboptimal. This phenomenon can render the model inappropriate for practical applications. To address this issue, after analyzing the phenomenon and identifying the lack of diversity as a vital factor, we propose a method named Codebook for Unsupervised Continual Learning (CUCL) which promotes the model to learn discriminative features to complete the class boundary. Specifically, we first introduce a Product Quantization to inject diversity into the representation and apply a cross quantized contrastive loss between the original representation and the quantized one to capture discriminative information. Then, based on the quantizer, we propose an effective Codebook Rehearsal to address catastrophic forgetting. This study involves conducting extensive experiments on CIFAR100, TinyImageNet, and MiniImageNet benchmark datasets. Our method significantly boosts the performances of supervised and unsupervised methods. For instance, on TinyImageNet, our method led to a relative improvement of 12.76% and 7% when compared with Simsiam and BYOL, respectively.
    摘要 Our method first injects diversity into the representation using Product Quantization and then applies a cross-quantized contrastive loss to capture discriminative information. Additionally, we propose an effective Codebook Rehearsal to address catastrophic forgetting based on the quantizer. We conduct extensive experiments on CIFAR100, TinyImageNet, and MiniImageNet benchmark datasets and show that our method significantly improves the performances of both supervised and unsupervised methods. For example, on TinyImageNet, our method achieved a relative improvement of 12.76% and 7% compared to Simsiam and BYOL, respectively.

Continual Referring Expression Comprehension via Dual Modular Memorization

  • paper_url: http://arxiv.org/abs/2311.14909
  • repo_url: https://github.com/zackschen/DMM
  • paper_authors: Heng Tao Shen, Cheng Chen, Peng Wang, Lianli Gao, Meng Wang, Jingkuan Song
  • for: 本研究旨在提高 Referring Expression Comprehension (REC) 模型的实用性,解决现有 REC 算法在真实世界场景中的缺点,即需要预先提供训练数据。
  • methods: 本研究提出了 Continual Referring Expression Comprehension (CREC) 设定,其中 REC 模型需要在流入任务上进行不断学习。为了避免 catastrophic forgetting 问题,我们提出了 Dual Modular Memorization (DMM) 方法,包括两个忘记模块:Implicit-Memory 和 Explicit-Memory。
  • results: 我们在三个新建的 benchmark 上进行了广泛的实验,并证明了 DMM 方法在两个流行的 REC 后台上显著超越了其他方法。
    Abstract Referring Expression Comprehension (REC) aims to localize an image region of a given object described by a natural-language expression. While promising performance has been demonstrated, existing REC algorithms make a strong assumption that training data feeding into a model are given upfront, which degrades its practicality for real-world scenarios. In this paper, we propose Continual Referring Expression Comprehension (CREC), a new setting for REC, where a model is learning on a stream of incoming tasks. In order to continuously improve the model on sequential tasks without forgetting prior learned knowledge and without repeatedly re-training from a scratch, we propose an effective baseline method named Dual Modular Memorization (DMM), which alleviates the problem of catastrophic forgetting by two memorization modules: Implicit-Memory and Explicit-Memory. Specifically, the former module aims to constrain drastic changes to important parameters learned on old tasks when learning a new task; while the latter module maintains a buffer pool to dynamically select and store representative samples of each seen task for future rehearsal. We create three benchmarks for the new CREC setting, by respectively re-splitting three widely-used REC datasets RefCOCO, RefCOCO+ and RefCOCOg into sequential tasks. Extensive experiments on the constructed benchmarks demonstrate that our DMM method significantly outperforms other alternatives, based on two popular REC backbones. We make the source code and benchmarks publicly available to foster future progress in this field: https://github.com/zackschen/DMM.
    摘要 REFERENCE EXPRESSION COMPREHENSION (REC) 目标是将图像区域Localize到给定的自然语言描述中的对象。 although promising performance has been demonstrated, existing REC algorithms make a strong assumption that training data are given upfront, which degrades its practicality for real-world scenarios. In this paper, we propose Continual Referring Expression Comprehension (CREC), a new setting for REC, where a model is learning on a stream of incoming tasks. In order to continuously improve the model on sequential tasks without forgetting prior learned knowledge and without repeatedly re-training from a scratch, we propose an effective baseline method named Dual Modular Memorization (DMM), which alleviates the problem of catastrophic forgetting by two memorization modules: Implicit-Memory and Explicit-Memory. Specifically, the former module aims to constrain drastic changes to important parameters learned on old tasks when learning a new task; while the latter module maintains a buffer pool to dynamically select and store representative samples of each seen task for future rehearsal. We create three benchmarks for the new CREC setting, by respectively re-splitting three widely-used REC datasets RefCOCO, RefCOCO+ and RefCOCOg into sequential tasks. Extensive experiments on the constructed benchmarks demonstrate that our DMM method significantly outperforms other alternatives, based on two popular REC backbones. We make the source code and benchmarks publicly available to foster future progress in this field: .

AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering

  • paper_url: http://arxiv.org/abs/2311.14906
  • repo_url: https://github.com/xiuyuan-chen/autoeval-video
  • paper_authors: Xiuyuan Chen, Yuan Lin, Yuchen Zhang, Weiran Huang
  • for: 这个论文旨在评估大视力语言模型在开放视频问答中的能力。
  • methods: 这个论文使用了一个新的和挑战性的评估准则,使用LLM-based的评估方法,并采用了一种新的对抗式标注机制来提高评估规则的稳定性。
  • results: 研究发现,使用GPT-4作为自动评估器可以达到约97.0%的稳定度,与人类评估者的94.9%-97.5%的精度相当。此外,研究对8种大视力语言模型进行了评估,其中GPT-4V(ision)表现出色,达到了32.2%的精度。但是,与人类精度的72.8%相比,还存在一定的提高空间。
    Abstract We propose a novel and challenging benchmark, AutoEval-Video, to comprehensively evaluate large vision-language models in open-ended video question answering. The comprehensiveness of AutoEval-Video is demonstrated in two aspects: 1) AutoEval-Video constructs open-ended video-questions across 9 skill dimensions, addressing capabilities of perception, comprehension, and generation. 2) AutoEval-Video contains newly collected videos that cover over 40 distinct themes. To efficiently evaluate responses to the open-ended questions, we employ an LLM-based evaluation approach, but instead of merely providing a reference answer, we annotate unique evaluation rules for every single instance (video-question pair). To maximize the robustness of these rules, we develop a novel adversarial annotation mechanism. By using instance-specific rules as prompt, GPT-4, as an automatic evaluator, can achieve a stable evaluation accuracy of around 97.0\%, comparable to the 94.9\% - 97.5\% accuracy of a human evaluator. Furthermore, we assess the performance of eight large vision-language models on AutoEval-Video. Among them, GPT-4V(ision) significantly outperforms other models, achieving an accuracy of 32.2\%. However, there is still substantial room for improvement compared to human accuracy of 72.8\%. By conducting an extensive case study, we uncover several drawbacks of GPT-4V, such as limited temporal and dynamic comprehension, and overly general responses. Code is available at \href{https://github.com/Xiuyuan-Chen/AutoEval-Video}{\color{magenta}https://github.com/Xiuyuan-Chen/AutoEval-Video}.
    摘要 我们提出了一个新的和挑战性的 benchmarck,AutoEval-Video,用于全面评估大视力语言模型在开放式视频问答中。AutoEval-Video 的全面性在两个方面表现出来:1)AutoEval-Video 构建了开放式视频问题,覆盖了9种技能维度,包括感知、理解和生成能力。2)AutoEval-Video 收集了新的视频数据,覆盖了40多个主题。为了有效地评估响应开放式问题,我们采用了一种基于 LLM 的评估方法,而不是仅提供参考答案,我们为每个视频问题对应创建了唯一的评估规则。为了确保这些规则的可靠性,我们开发了一种新的对抗式注释机制。通过使用实例特定的规则作为提示,GPT-4 可以实现稳定的评估准确率 around 97.0%,与人工评估准确率的 94.9% - 97.5% 相当。此外,我们评估了八种大视力语言模型在 AutoEval-Video 上的性能,其中 GPT-4V(ision) 显著超过其他模型,实现了 32.2% 的准确率。然而,与人类准确率 72.8% 相比,还有很大的提高空间。通过进行广泛的案例研究,我们发现 GPT-4V 存在一些缺陷,如时间和动态理解的局限性,以及过于一般的回答。代码可以在 \href{https://github.com/Xiuyuan-Chen/AutoEval-Video}{\color{magenta}https://github.com/Xiuyuan-Chen/AutoEval-Video} 上获取。

Class Gradient Projection For Continual Learning

  • paper_url: http://arxiv.org/abs/2311.14905
  • repo_url: https://github.com/zackschen/CGP
  • paper_authors: Cheng Chen, Ji Zhang, Jingkuan Song, Lianli Gao
  • for: 本文目的是解决 kontinual learning (CL) 中的极端忘记问题。
  • methods: 本文提出了一种新的方法,即类 gradient projection (CGP),它计算出每个类的梯度空间,然后使用这些梯度来避免类偏移。此外,本文还提出了一种基础策略(BR),可以将相似的类合并并动态调整类基。
  • results: 对于 CIFAR-100 数据集,本文的方法比前一代方法提高了 2.0%。
    Abstract Catastrophic forgetting is one of the most critical challenges in Continual Learning (CL). Recent approaches tackle this problem by projecting the gradient update orthogonal to the gradient subspace of existing tasks. While the results are remarkable, those approaches ignore the fact that these calculated gradients are not guaranteed to be orthogonal to the gradient subspace of each class due to the class deviation in tasks, e.g., distinguishing "Man" from "Sea" v.s. differentiating "Boy" from "Girl". Therefore, this strategy may still cause catastrophic forgetting for some classes. In this paper, we propose Class Gradient Projection (CGP), which calculates the gradient subspace from individual classes rather than tasks. Gradient update orthogonal to the gradient subspace of existing classes can be effectively utilized to minimize interference from other classes. To improve the generalization and efficiency, we further design a Base Refining (BR) algorithm to combine similar classes and refine class bases dynamically. Moreover, we leverage a contrastive learning method to improve the model's ability to handle unseen tasks. Extensive experiments on benchmark datasets demonstrate the effectiveness of our proposed approach. It improves the previous methods by 2.0% on the CIFAR-100 dataset.
    摘要 您好!我们面临一个非常重要的挑战是 catastrophic forgetting(悲剧性忘记),它会导致模型在继续学习(Continual Learning,CL)过程中忘记之前学习的知识。现有的方法是通过将梯度更新投影到现有任务的梯度子空间中来解决这个问题。然而,这些计算的梯度并不一定是对每个类别的梯度子空间正交的,导致这策略可能会导致某些类别的悲剧性忘记。为了解决这个问题,我们提出了 Class Gradient Projection(CGP)方法,它计算个别类别的梯度子空间而不是任务。这样可以将梯度更新投影到现有类别的梯度子空间中,以避免因为其他类别的梯度干扰而导致的悲剧性忘记。另外,我们还设计了 Base Refining(BR)算法,它可以在不同的类别之间进行基础的整合和精致化。此外,我们还应用了一种对比学习方法,以提高模型在未见任务中的能力。实验结果显示,我们的提案方法可以与之前的方法相比,在 CIFAR-100 数据集上提高了 2.0%。

Parkinson Disease classification Using Contrastive Graph Cross-View Learning with Multimodal Fusion of SPECT Images and Clinical Features

  • paper_url: http://arxiv.org/abs/2311.14902
  • repo_url: None
  • paper_authors: Jun-En Ding, Chien-Chin Hsu, Feng Liu
  • for: 预测帕金森病(PD)的病人群,并且利用多Modal Feature fusion来提高预测精度。
  • methods: 使用图像和非图像特征,并实现对多视图数据的拟合融合,以提高模型的稳定性和结构化特征提取能力。
  • results: 在五次交叉验证中,图像视图多Modal方法可以达到91%的准确率和92.8%的AUC水平,并且在非图像数据上也表现出比solely使用机器学习方法更高的预测能力。
    Abstract Parkinson's Disease (PD) is a neurodegenerative neurological disorder that impacts movement and afflicts over 10 million people worldwide. Previous researches have come up with deep learning models for predicting Parkinson's disease primarily using medical images and didn't leverage the manifold structure in the dataset. Our study introduces a multimodal approach with both image and non-image features with a contrastive cross-view graph fusion for Parkinson's disease classification. Specifically, we designed a multimodal co-attention module to integrate embeddings from two distinct graph views derived from low dimensional representation of images and clinical features, enabling the extraction of more stable and structured features from the multiview data. Additionally, we have devised a simplified fusion method utilizing a contrastive loss for positive and negative pairs, to enhance the model's overall cross-view fusion learning capabilities. In our experiments, the graph-view multimodal approach can achieve an accuracy rate of 91% and an AUC of 92.8% in five-fold cross-validation, and it also demonstrates superior predictive capabilities on non-image data as compared to methods that rely solely on machine learning methods.
    摘要

HyperDID: Hyperspectral Intrinsic Image Decomposition with Deep Feature Embedding

  • paper_url: http://arxiv.org/abs/2311.14899
  • repo_url: None
  • paper_authors: Zhiqiang Gong, Xian Zhou, Wen Yao, Xiaohu Zheng, Ping Zhong
    for: This paper aims to improve the classification performance of hyperspectral image analysis by introducing a novel framework called HyperDID, which leverages deep feature embedding principles to enhance the interpretability of hyperspectral data.methods: The proposed HyperDID framework consists of three modules: the Environmental Feature Module (EFM), Categorical Feature Module (CFM), and Feature Discrimination Module (FDM). These modules work together to extract intrinsic features and separate environment-related and category-related features, leading to improved classification performance.results: The proposed HyperDID framework was validated on three commonly used hyperspectral image datasets, and the results showed significant improvements in classification performance compared to traditional methods. The HyperDID framework has the potential to advance the capabilities of hyperspectral image analysis by leveraging deep feature embedding principles.
    Abstract The dissection of hyperspectral images into intrinsic components through hyperspectral intrinsic image decomposition (HIID) enhances the interpretability of hyperspectral data, providing a foundation for more accurate classification outcomes. However, the classification performance of HIID is constrained by the model's representational ability. To address this limitation, this study rethinks hyperspectral intrinsic image decomposition for classification tasks by introducing deep feature embedding. The proposed framework, HyperDID, incorporates the Environmental Feature Module (EFM) and Categorical Feature Module (CFM) to extract intrinsic features. Additionally, a Feature Discrimination Module (FDM) is introduced to separate environment-related and category-related features. Experimental results across three commonly used datasets validate the effectiveness of HyperDID in improving hyperspectral image classification performance. This novel approach holds promise for advancing the capabilities of hyperspectral image analysis by leveraging deep feature embedding principles. The implementation of the proposed method could be accessed soon at https://github.com/shendu-sw/HyperDID for the sake of reproducibility.
    摘要 干预干spectral图像的分解成内在组件通过干预干spectral内部图像分解(HIID)可以提高干预图像数据的可读性,为更准确的分类提供基础。然而,HIID的分类性能受模型表达能力的限制。为解决这种局限性,本研究重新思考了干预干spectral内部图像分解的方法,并 introduce了深度特征嵌入。提议的框架HyperDID包括环境特征模块(EFM)和分类特征模块(CFM),以EXTRACT内在特征。此外,一个特征分离模块(FDM)也被引入,以分离环境相关和分类相关的特征。实验结果验证了HyperDID在三个常用的数据集上的效果,提高了干预图像分类性能。这种新的方法可以利用深度特征嵌入的原理,为干预图像分析提供新的可能性。实现方法可以在https://github.com/shendu-sw/HyperDID上进行访问,以便重现。

Towards Scalable 3D Anomaly Detection and Localization: A Benchmark via 3D Anomaly Synthesis and A Self-Supervised Learning Network

  • paper_url: http://arxiv.org/abs/2311.14897
  • repo_url: None
  • paper_authors: Wenqiao Li, Xiaohao Xu, Yao Gu, Bozhong Zheng, Shenghua Gao, Yingna Wu
  • for: 本研究的目的是提供一种可扩展的3D异常检测方法,以便在实际场景中检测3D异常。
  • methods: 本研究使用了一种自我超vised学习方法,即 Iterative Mask Reconstruction Network (IMRNet),以及一个Synthetic dataset named Anomaly-ShapeNet,基于ShapeNet。
  • results: 实验结果显示,IMRNet方法在Anomaly-ShapeNet dataset上 achieved 66.1%的I-AUC分数,并在Real3D-AD dataset上 achieved 72.5%的I-AUC分数,都高于之前的状态的艺术方法。
    Abstract Recently, 3D anomaly detection, a crucial problem involving fine-grained geometry discrimination, is getting more attention. However, the lack of abundant real 3D anomaly data limits the scalability of current models. To enable scalable anomaly data collection, we propose a 3D anomaly synthesis pipeline to adapt existing large-scale 3Dmodels for 3D anomaly detection. Specifically, we construct a synthetic dataset, i.e., Anomaly-ShapeNet, basedon ShapeNet. Anomaly-ShapeNet consists of 1600 point cloud samples under 40 categories, which provides a rich and varied collection of data, enabling efficient training and enhancing adaptability to industrial scenarios. Meanwhile,to enable scalable representation learning for 3D anomaly localization, we propose a self-supervised method, i.e., Iterative Mask Reconstruction Network (IMRNet). During training, we propose a geometry-aware sample module to preserve potentially anomalous local regions during point cloud down-sampling. Then, we randomly mask out point patches and sent the visible patches to a transformer for reconstruction-based self-supervision. During testing, the point cloud repeatedly goes through the Mask Reconstruction Network, with each iteration's output becoming the next input. By merging and contrasting the final reconstructed point cloud with the initial input, our method successfully locates anomalies. Experiments show that IMRNet outperforms previous state-of-the-art methods, achieving 66.1% in I-AUC on Anomaly-ShapeNet dataset and 72.5% in I-AUC on Real3D-AD dataset. Our dataset will be released at https://github.com/Chopper-233/Anomaly-ShapeNet
    摘要 最近,三维异常检测问题正在收到更多关注,但现有的模型受到有限的异常数据的限制。为了实现大规模异常数据收集,我们提出了一个三维异常合成管道,将现有的大规模三维模型适应到异常检测中。具体来说,我们构建了一个 sintetic数据集,即异常形状网络(Anomaly-ShapeNet),基于ShapeNet。Anomaly-ShapeNet包含1600个点云样本,分别属于40个类别,提供了丰富多样的数据集,以便高效训练和适应工业场景。同时,为了实现可扩展的表示学习,我们提出了一种自动化学习方法,即循环掩码重建网络(IMRNet)。在训练中,我们提出了一种准确保持潜在异常地方的地理ometry-aware采样模块,以保持点云下采样中潜在异常的地方。然后,我们随机掩码出点云范围,并将可见的范围发送到 transformer 进行重建基于自我超vision。在测试中,点云重复经过掩码重建网络,每次输入的输出 becoming the next input。通过将最终重建的点云与初始输入进行合并并对比,我们成功地检测到异常。实验表明,IMRNet 在 Anomaly-ShapeNet 数据集上 achieve 66.1%的 I-AUC 和 Real3D-AD 数据集上 achieve 72.5%的 I-AUC。我们的数据集将于https://github.com/Chopper-233/Anomaly-ShapeNet 发布。

cs.AI - 2023-11-25

SwiftLearn: A Data-Efficient Training Method of Deep Learning Models using Importance Sampling

  • paper_url: http://arxiv.org/abs/2311.15134
  • repo_url: None
  • paper_authors: Habib Hajimolahoseini, Omar Mohamed Awad, Walid Ahmed, Austin Wen, Saina Asani, Mohammad Hassanpour, Farnoosh Javadi, Mehdi Ahmadi, Foozhan Ataiefard, Kangling Liu, Yang Liu
  • for: 加速深度学习模型训练,使用一小部分数据样本在训练过程中的暖身阶段选择。
  • methods: 提出了一种基于重要性标准的数据选择策略,通过在整个数据集中测试模型性能,选择一部分数据样本,以保持训练过程中模型性能的稳定。
  • results: 在多种 Computer Vision 和自然语言处理任务中进行预训练和精度训练,实验结果表明,可以在减少训练样本数量的情况下保持模型性能,并在训练过程中实现显著的速度增加。例如,BERT 精度训练在 GLUE benchmark 上,可以将90%的数据排除,实现了平均速度增加3.36倍,保持模型性能下降在0.92%以下。
    Abstract In this paper, we present SwiftLearn, a data-efficient approach to accelerate training of deep learning models using a subset of data samples selected during the warm-up stages of training. This subset is selected based on an importance criteria measured over the entire dataset during warm-up stages, aiming to preserve the model performance with fewer examples during the rest of training. The importance measure we propose could be updated during training every once in a while, to make sure that all of the data samples have a chance to return to the training loop if they show a higher importance. The model architecture is unchanged but since the number of data samples controls the number of forward and backward passes during training, we can reduce the training time by reducing the number of training samples used in each epoch of training. Experimental results on a variety of CV and NLP models during both pretraining and finetuning show that the model performance could be preserved while achieving a significant speed-up during training. More specifically, BERT finetuning on GLUE benchmark shows that almost 90% of the data can be dropped achieving an end-to-end average speedup of 3.36x while keeping the average accuracy drop less than 0.92%.
    摘要 在这篇论文中,我们提出了 SwiftLearn,一种数据效率的方法,用于加速深度学习模型的训练,通过在训练过程中选择一 subset 的数据样本。这个 subset 被选择基于数据集中的重要性标准,以保持模型的性能使用 fewer examples 进行训练。我们提出的重要性标准可以在训练过程中不断更新,以确保所有数据样本都有机会回到训练循环,如果它们在训练过程中显示更高的重要性。模型架构保持不变,但是由于训练样本数控制了训练过程中的前向和反向传播,因此可以降低训练时间。实验结果表明,在不同的 Computer Vision 和自然语言处理(NLP)模型中,可以保持模型性能的同时实现显著的训练时间减速。例如,BERT 在 GLUE 数据集上进行 Fine-tuning 时,可以将大约 90% 的数据排除,实现了端到端的平均减速比为 3.36x,而保持平均损失变化小于 0.92%。

Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching

  • paper_url: http://arxiv.org/abs/2311.15131
  • repo_url: None
  • paper_authors: James Campbell, Richard Ren, Phillip Guo
  • for: investigate instructed dishonesty in large language models (LLMs)
  • methods: prompt engineering and mechanistic interpretability approaches
  • results: localized five layers and 46 attention heads that enable the model to answer honestly, and demonstrated the interventions work robustly across many prompts and dataset splits.
    Abstract Large language models (LLMs) demonstrate significant knowledge through their outputs, though it is often unclear whether false outputs are due to a lack of knowledge or dishonesty. In this paper, we investigate instructed dishonesty, wherein we explicitly prompt LLaMA-2-70b-chat to lie. We perform prompt engineering to find which prompts best induce lying behavior, and then use mechanistic interpretability approaches to localize where in the network this behavior occurs. Using linear probing and activation patching, we localize five layers that appear especially important for lying. We then find just 46 attention heads within these layers that enable us to causally intervene such that the lying model instead answers honestly. We show that these interventions work robustly across many prompts and dataset splits. Overall, our work contributes a greater understanding of dishonesty in LLMs so that we may hope to prevent it.
    摘要 大型语言模型(LLM)display significant knowledge in their outputs, but it is often unclear whether false outputs are due to a lack of knowledge or dishonesty. In this paper, we investigate instructed dishonesty, where we explicitly prompt LLaMA-2-70b-chat to lie. We perform prompt engineering to find which prompts best induce lying behavior, and then use mechanistic interpretability approaches to localize where in the network this behavior occurs. Using linear probing and activation patching, we localize five layers that appear especially important for lying. We then find just 46 attention heads within these layers that enable us to causally intervene such that the lying model instead answers honestly. We show that these interventions work robustly across many prompts and dataset splits. Overall, our work contributes to a greater understanding of dishonesty in LLMs, so that we may hope to prevent it.Note: The translation is done using a machine translation tool, and may not be perfect or idiomatic.

NCL-SM: A Fully Annotated Dataset of Images from Human Skeletal Muscle Biopsies

  • paper_url: http://arxiv.org/abs/2311.15113
  • repo_url: https://github.com/atifkhanncl/ncl-sm
  • paper_authors: Atif Khan, Conor Lawless, Amy Vincent, Charlotte Warren, Valeria Di Leo, Tiago Gomes, A. Stephen McGough
  • for: 本研究的目的是开发一个自动化、精准、可重复地单元细胞分析方法,以便更好地理解许多 neuromuscular disorders。
  • methods: 本研究使用了深度学习模型,并提供了高质量的生物医学影像数据集(NCL-SM),以便训练和测试这些模型。
  • results: 本研究提供了46个人skeletal muscle(SM)组织横截面的高质量生物医学影像数据集,包括超过5万个手动标注的muscle fibers(myofibers),以及高质量myofibers分割和优质区域标注。这些数据集可以用于开发自动化、精准、可重复地单元细胞分析方法。
    Abstract Single cell analysis of human skeletal muscle (SM) tissue cross-sections is a fundamental tool for understanding many neuromuscular disorders. For this analysis to be reliable and reproducible, identification of individual fibres within microscopy images (segmentation) of SM tissue should be automatic and precise. Biomedical scientists in this field currently rely on custom tools and general machine learning (ML) models, both followed by labour intensive and subjective manual interventions to fine-tune segmentation. We believe that fully automated, precise, reproducible segmentation is possible by training ML models. However, in this important biomedical domain, there are currently no good quality, publicly available annotated imaging datasets available for ML model training. In this paper we release NCL-SM: a high quality bioimaging dataset of 46 human SM tissue cross-sections from both healthy control subjects and from patients with genetically diagnosed muscle pathology. These images include $>$ 50k manually segmented muscle fibres (myofibres). In addition we also curated high quality myofibre segmentations, annotating reasons for rejecting low quality myofibres and low quality regions in SM tissue images, making these annotations completely ready for downstream analysis. This, we believe, will pave the way for development of a fully automatic pipeline that identifies individual myofibres within images of tissue sections and, in particular, also classifies individual myofibres that are fit for further analysis.
    摘要 Single cell analysis of human skeletal muscle (SM) tissue cross-sections is a fundamental tool for understanding many neuromuscular disorders. For this analysis to be reliable and reproducible, identification of individual fibers within microscopy images (segmentation) of SM tissue should be automatic and precise. Biomedical scientists in this field currently rely on custom tools and general machine learning (ML) models, both followed by labor-intensive and subjective manual interventions to fine-tune segmentation. We believe that fully automated, precise, reproducible segmentation is possible by training ML models. However, in this important biomedical domain, there are currently no good quality, publicly available annotated imaging datasets available for ML model training. In this paper, we release NCL-SM: a high-quality bioimaging dataset of 46 human SM tissue cross-sections from both healthy control subjects and from patients with genetically diagnosed muscle pathology. These images include >50k manually segmented muscle fibers (myofibers). In addition, we also curated high-quality myofibers segmentations, annotating reasons for rejecting low-quality myofibers and low-quality regions in SM tissue images, making these annotations completely ready for downstream analysis. This, we believe, will pave the way for the development of a fully automatic pipeline that identifies individual myofibers within images of tissue sections and, in particular, also classifies individual myofibers that are fit for further analysis.

Everybody Needs a Little HELP: Explaining Graphs via Hierarchical Concepts

  • paper_url: http://arxiv.org/abs/2311.15112
  • repo_url: https://github.com/jonasjuerss/help
  • paper_authors: Jonas Jürß, Lucie Charlotte Magister, Pietro Barbiero, Pietro Liò, Nikola Simidjievski
  • for: 这个论文的目的是提高graph neural networks(GNNs)的可解释性,使其在高风险决策场景中得到更多的信任。
  • methods: 该论文提出了一种新的可解释的图 Pooling方法,即Hierarchical Explainable Latent Pooling(HELP),它可以在不 Spectral 的情况下,采用端到端学习的方式,学习 pooling 一个可变数量的连接组件。
  • results: 实验表明,HELP 方法可以与标准GCNs和流行的pooling方法相比,在准确率上达到一致,而且可以提供与专家知识相关的解释。此外,该论文还提出了一些新的评价指标,以证明发现的概念是更易于完全理解的。
    Abstract Graph neural networks (GNNs) have led to major breakthroughs in a variety of domains such as drug discovery, social network analysis, and travel time estimation. However, they lack interpretability which hinders human trust and thereby deployment to settings with high-stakes decisions. A line of interpretable methods approach this by discovering a small set of relevant concepts as subgraphs in the last GNN layer that together explain the prediction. This can yield oversimplified explanations, failing to explain the interaction between GNN layers. To address this oversight, we provide HELP (Hierarchical Explainable Latent Pooling), a novel, inherently interpretable graph pooling approach that reveals how concepts from different GNN layers compose to new ones in later steps. HELP is more than 1-WL expressive and is the first non-spectral, end-to-end-learnable, hierarchical graph pooling method that can learn to pool a variable number of arbitrary connected components. We empirically demonstrate that it performs on-par with standard GCNs and popular pooling methods in terms of accuracy while yielding explanations that are aligned with expert knowledge in the domains of chemistry and social networks. In addition to a qualitative analysis, we employ concept completeness scores as well as concept conformity, a novel metric to measure the noise in discovered concepts, quantitatively verifying that the discovered concepts are significantly easier to fully understand than those from previous work. Our work represents a first step towards an understanding of graph neural networks that goes beyond a set of concepts from the final layer and instead explains the complex interplay of concepts on different levels.
    摘要 图 neck 网络(GNNs)已经导致了多种领域的重要突破,如药物发现、社交网络分析和旅行时间估计。然而,它们缺乏可解释性,这限制了人类信任度,从而阻碍了高风险决策的部署。一些可解释的方法采取了发现最后一层 GNN 层中的一小组有用的概念的方法,这些概念共同解释预测结果。然而,这些解释可能过分简化,不能解释 GNN 层之间的互动。为解决这个问题,我们提供了 HELP(层次可解释的抽象池化),一种新的可解释的图 pooling 方法,它可以在后续步骤中抽取出不同 GNN 层中的概念,并将它们组合成新的概念。HELP 是一种一元化可解释的图 pooling 方法,可以在变量数量的 arbitrary 连接组件上进行学习。我们在化学和社交网络等领域进行了实验,并证明了它的准确率和可解释性与标准 GCN 和流行的 pooling 方法相当。此外,我们还提出了一种新的概念完整性分数和概念准确性分数来衡量发现的概念是否具有完整性和准确性。我们的工作代表了对图 neck 网络的理解的第一步,而不是仅仅是最后一层 GNN 层中的概念的集合。我们的方法可以解释图 neck 网络中不同层次的概念之间的复杂的互动。

Leveraging Diffusion Perturbations for Measuring Fairness in Computer Vision

  • paper_url: http://arxiv.org/abs/2311.15108
  • repo_url: None
  • paper_authors: Nicholas Lui, Bryan Chia, William Berrios, Candace Ross, Douwe Kiela
  • for: This paper aims to evaluate the fairness of computer vision models by creating a dataset balanced along demographic traits and benchmarking several vision-language models on a multi-class occupation classification task.
  • methods: The paper uses diffusion models to generate a large set of images depicting various occupations, and inpainting to generate multiple variants of each image with different perceived races.
  • results: The paper finds that images generated with non-Caucasian labels have a significantly higher occupation misclassification rate than images generated with Caucasian labels, and that several misclassifications are suggestive of racial biases. The paper also finds significant disparities between the evaluated vision-and-language models using a fairness metric that measures the standard deviation in the probability of predicting the true occupation label across different perceived identity groups.
    Abstract Computer vision models have been known to encode harmful biases, leading to the potentially unfair treatment of historically marginalized groups, such as people of color. However, there remains a lack of datasets balanced along demographic traits that can be used to evaluate the downstream fairness of these models. In this work, we demonstrate that diffusion models can be leveraged to create such a dataset. We first use a diffusion model to generate a large set of images depicting various occupations. Subsequently, each image is edited using inpainting to generate multiple variants, where each variant refers to a different perceived race. Using this dataset, we benchmark several vision-language models on a multi-class occupation classification task. We find that images generated with non-Caucasian labels have a significantly higher occupation misclassification rate than images generated with Caucasian labels, and that several misclassifications are suggestive of racial biases. We measure a model's downstream fairness by computing the standard deviation in the probability of predicting the true occupation label across the different perceived identity groups. Using this fairness metric, we find significant disparities between the evaluated vision-and-language models. We hope that our work demonstrates the potential value of diffusion methods for fairness evaluations.
    摘要

Unbalancedness in Neural Monge Maps Improves Unpaired Domain Translation

  • paper_url: http://arxiv.org/abs/2311.15100
  • repo_url: None
  • paper_authors: Luca Eyring, Dominik Klein, Théo Uscidda, Giovanni Palla, Niki Kilbertus, Zeynep Akata, Fabian Theis
  • for: 这个论文主要是为了解决无对照领域译换任务中的异常值问题。
  • methods: 这个论文提出了一种基于无对照领域译换的Monge图像译换方法,并使用了多个神经网络估计器来实现。
  • results: 论文的实验结果表明,这种方法可以在图像译换任务中提高表现,并且可以更好地保留有关特征。
    Abstract In optimal transport (OT), a Monge map is known as a mapping that transports a source distribution to a target distribution in the most cost-efficient way. Recently, multiple neural estimators for Monge maps have been developed and applied in diverse unpaired domain translation tasks, e.g. in single-cell biology and computer vision. However, the classic OT framework enforces mass conservation, which makes it prone to outliers and limits its applicability in real-world scenarios. The latter can be particularly harmful in OT domain translation tasks, where the relative position of a sample within a distribution is explicitly taken into account. While unbalanced OT tackles this challenge in the discrete setting, its integration into neural Monge map estimators has received limited attention. We propose a theoretically grounded method to incorporate unbalancedness into any Monge map estimator. We improve existing estimators to model cell trajectories over time and to predict cellular responses to perturbations. Moreover, our approach seamlessly integrates with the OT flow matching (OT-FM) framework. While we show that OT-FM performs competitively in image translation, we further improve performance by incorporating unbalancedness (UOT-FM), which better preserves relevant features. We hence establish UOT-FM as a principled method for unpaired image translation.
    摘要 在最优运输(OT)中,蒙日图(Monge map)是指将源分布转移到目标分布所需的最有效的映射。在最近,多种神经网络估计器已经开发出来用于蒙日图,并在多个不同的领域中应用,如单元细胞生物和计算机视觉。然而,经典的OT框架强制保持质量的保持,这会使其容易受到异常值的影响,从而限制其在实际场景中的应用。特别是在OT领域译换任务中,样本在分布中的相对位置直接被考虑,这会使得经典的OT框架更加不适用。而不均衡的OT(unbalanced OT)可以解决这个挑战,但是它们在神经网络蒙日图估计器中的整合受到了有限的关注。我们提出了一种基于理论的方法,可以将不均衡性 integrate into any Monge map estimator。我们改进了现有的蒙日图估计器,以便模型细胞的时间演化和响应干扰。此外,我们的方法可以轻松地与OT流匹配(OT-FM)框架集成。虽然我们显示OT-FM在图像译换中表现竞争力强,但我们进一步改进性能,通过 integrate不均衡性(UOT-FM),以更好地保留相关的特征。因此,我们建立了UOT-FM作为一种原则性的图像译换方法。

Enhancing Sentiment Analysis Results through Outlier Detection Optimization

  • paper_url: http://arxiv.org/abs/2311.16185
  • repo_url: https://github.com/Stry233/Enhancing-Sentiment-Analysis-Results-through-Outlier-Detection-Optimization
  • paper_authors: Yuetian Chen, Mei Si
  • for: 这个研究旨在提高文本数据中主观标签的分类效果,通过识别和处理异常值。
  • methods: 这个研究使用了深度SVDD算法,一种一类分类方法,来检测文本数据中的异常值。同时,研究还使用了不同的机器学习算法(决策树、KNN、逻辑回归和LDA)作为分类器。
  • results: 研究发现,通过去除异常值可以提高分类效果的大多数情况。此外,使用大型语言模型(DeBERTa v3大型)也可以捕捉数据中复杂的模式,并继续观察到多个数据集上的性能提高。
    Abstract When dealing with text data containing subjective labels like speaker emotions, inaccuracies or discrepancies among labelers are not uncommon. Such discrepancies can significantly affect the performance of machine learning algorithms. This study investigates the potential of identifying and addressing outliers in text data with subjective labels, aiming to enhance classification outcomes. We utilized the Deep SVDD algorithm, a one-class classification method, to detect outliers in nine text-based emotion and sentiment analysis datasets. By employing both a small-sized language model (DistilBERT base model with 66 million parameters) and non-deep learning machine learning algorithms (decision tree, KNN, Logistic Regression, and LDA) as the classifier, our findings suggest that the removal of outliers can lead to enhanced results in most cases. Additionally, as outliers in such datasets are not necessarily unlearnable, we experienced utilizing a large language model -- DeBERTa v3 large with 131 million parameters, which can capture very complex patterns in data. We continued to observe performance enhancements across multiple datasets.
    摘要 当处理含主观标签的文本数据时,标注人员之间的不一致性或准确性不高是不可避免的。这些不一致性可能会对机器学习算法的性能产生影响。本研究探讨了在文本数据中检测和处理异常值的潜在性,以提高分类结果。我们使用了深度SVDD算法,一种一类分类方法,对九个文本基于情感和 sentiment 分析 dataset 进行检测异常值。我们使用了小型语言模型(DistilBERT基础模型,参数数量为6600万)和非深度学习机器学习算法(决策树、KNN、逻辑回归和LDA)作为分类器,我们的发现表明,在大多数情况下,移除异常值可以提高结果。此外,我们发现在这些 dataset 中的异常值并不一定是无法学习的,我们使用了大型语言模型——DeBERTa v3大型(参数数量为13100万),这种模型可以捕捉非常复杂的数据模式。我们继续观察到多个 dataset 中的性能提高。

Weakly-Supervised Audio-Visual Segmentation

  • paper_url: http://arxiv.org/abs/2311.15080
  • repo_url: None
  • paper_authors: Shentong Mo, Bhiksha Raj
  • for: 这个论文是针对 Audio-visual segmentation 进行研究,具体的目标是预测视频中的音源 pixels 的标识。
  • methods: 本文提出了一个 Weakly-Supervised Audio-Visual Segmentation 框架,named WS-AVS,它可以学习多尺度的音频视频同步,并使用多尺度多例对比学习来进行音频视频分类。
  • results: 实验结果显示,WS-AVS 能够在单源和多源情况下进行弱型Supervised Audio-Visual Segmentation,并且在 AVSBench 上实现了高准确率和高速度的 Audio-Visual Segmentation。
    Abstract Audio-visual segmentation is a challenging task that aims to predict pixel-level masks for sound sources in a video. Previous work applied a comprehensive manually designed architecture with countless pixel-wise accurate masks as supervision. However, these pixel-level masks are expensive and not available in all cases. In this work, we aim to simplify the supervision as the instance-level annotation, i.e., weakly-supervised audio-visual segmentation. We present a novel Weakly-Supervised Audio-Visual Segmentation framework, namely WS-AVS, that can learn multi-scale audio-visual alignment with multi-scale multiple-instance contrastive learning for audio-visual segmentation. Extensive experiments on AVSBench demonstrate the effectiveness of our WS-AVS in the weakly-supervised audio-visual segmentation of single-source and multi-source scenarios.
    摘要 audio-visual 分 segmentation 是一项复杂的任务,旨在预测视频中声音源 pixels 的 маски。过去的工作使用了全面地手动设计的架构,并且需要 COUNTLESS 像素精度的 masks 作为监督。然而,这些像素精度的 masks 是昂贵的,而且不一定有所提供。在这项工作中,我们想要简化监督,即使用 instance-level 标注,即弱有监督的 audio-visual 分 segmentation。我们提出了一种新的弱有监督 audio-visual 分 segmentation 框架,即 WS-AVS,可以学习多尺度的 audio-visual 对应关系,并使用多尺度的多个实例对比学习来实现 audio-visual 分 segmentation。我们在 AVSBench 上进行了广泛的实验,并证明了我们的 WS-AVS 在弱有监督的 audio-visual 分 segmentation中的有效性。

Introducing SSBD+ Dataset with a Convolutional Pipeline for detecting Self-Stimulatory Behaviours in Children using raw videos

  • paper_url: http://arxiv.org/abs/2311.15072
  • repo_url: https://github.com/sarl-iiitb/ssbdplus-dataset
  • paper_authors: Vaibhavi Lokegaonkar, Vijay Jaisankar, Pon Deepika, Madhav Rao, T K Srikanth, Sarbani Mallick, Manjit Sodhi
  • for: 该论文旨在提出一种自动识别儿童自闭行为,以便早期诊断AUTSPECTRUM DISORDER(ASD)的机器学习方法。
  • methods: 该论文提出了一种管道式深度学习架构,用于检测certain self-stimulatory behaviors,以帮助诊断ASD。该架构还使用了一个扩展版的Self Stimulatory Behavior Dataset(SSBD),并提出了一个新的label:no-class。
  • results: 该论文的提案的搅拌模型在targeted real-time和手sfree自动诊断任务中实现了约81%的总准确率。
    Abstract Conventionally, evaluation for the diagnosis of Autism spectrum disorder is done by a trained specialist through questionnaire-based formal assessments and by observation of behavioral cues under various settings to capture the early warning signs of autism. These evaluation techniques are highly subjective and their accuracy relies on the experience of the specialist. In this regard, machine learning-based methods for automated capturing of early signs of autism from the recorded videos of the children is a promising alternative. In this paper, the authors propose a novel pipelined deep learning architecture to detect certain self-stimulatory behaviors that help in the diagnosis of autism spectrum disorder (ASD). The authors also supplement their tool with an augmented version of the Self Stimulatory Behavior Dataset (SSBD) and also propose a new label in SSBD Action detection: no-class. The deep learning model with the new dataset is made freely available for easy adoption to the researchers and developers community. An overall accuracy of around 81% was achieved from the proposed pipeline model that is targeted for real-time and hands-free automated diagnosis. All of the source code, data, licenses of use, and other relevant material is made freely available in https://github.com/sarl-iiitb/
    摘要 传统上,autism spectrum症诊断的评估是通过专业人员通过问卷式正式评估和对儿童行为观察来进行。这些评估技术具有高度主观性,其准确性取决于专业人员的经验。在这种情况下,机器学习基于方法可能成为自动捕捉早期autism特征的有力的替代方案。在这篇论文中,作者提出了一种新的管道式深度学习架构,用于检测某些自我刺激行为,这些行为可以帮助在autism спектル症诊断中。作者还补充了一个增强版的Self Stimulatory BehaviorDataset(SSBD),并提出了新的标签:no-class。这种深度学习模型与新的数据集在https://github.com/sarl-iiitb/ 上提供了免费下载和使用。该管道模型在targeted for real-time and hands-free automated diagnosis情况下实现了约81%的总准确率。所有的源代码、数据、许可使用、其他相关材料都可以免费下载在https://github.com/sarl-iiitb/。

Accurate and interpretable drug-drug interaction prediction enabled by knowledge subgraph learning

  • paper_url: http://arxiv.org/abs/2311.15056
  • repo_url: https://github.com/lars-research/knowddi
  • paper_authors: Yaqing Wang, Zaifei Yang, Quanming Yao
  • for: 本研究旨在提高药物相互作用预测的精度和可 interpretability,并使用知识图加以优化。
  • methods: 本文提出了一种基于图神经网络的方法,称为 KnowDDI,可以利用生物医学知识图中的丰富信息来增强药物表示,然后学习每个药物对的知识子图以解释预测的相互作用。
  • results: 在两个 benchmark 数据集上进行测试, results 表明 KnowDDI 可以达到预测性能的状态态,并且比现有方法更好地tolerate 知识图的稀缺性。
    Abstract Background: Discovering potential drug-drug interactions (DDIs) is a long-standing challenge in clinical treatments and drug developments. Recently, deep learning techniques have been developed for DDI prediction. However, they generally require a huge number of samples, while known DDIs are rare. Methods: In this work, we present KnowDDI, a graph neural network-based method that addresses the above challenge. KnowDDI enhances drug representations by adaptively leveraging rich neighborhood information from large biomedical knowledge graphs. Then, it learns a knowledge subgraph for each drug-pair to interpret the predicted DDI, where each of the edges is associated with a connection strength indicating the importance of a known DDI or resembling strength between a drug-pair whose connection is unknown. Thus, the lack of DDIs is implicitly compensated by the enriched drug representations and propagated drug similarities. Results: We evaluate KnowDDI on two benchmark DDI datasets. Results show that KnowDDI obtains the state-of-the-art prediction performance with better interpretability. We also find that KnowDDI suffers less than existing works given a sparser knowledge graph. This indicates that the propagated drug similarities play a more important role in compensating for the lack of DDIs when the drug representations are less enriched. Conclusions: KnowDDI nicely combines the efficiency of deep learning techniques and the rich prior knowledge in biomedical knowledge graphs. As an original open-source tool, KnowDDI can help detect possible interactions in a broad range of relevant interaction prediction tasks, such as protein-protein interactions, drug-target interactions and disease-gene interactions, eventually promoting the development of biomedicine and healthcare.
    摘要 Background: 识别药物间交互(DDIs)是现代药物治疗和药物开发中长期的挑战。现在,深度学习技术已经为DDIs预测提供了新的方法。然而,这些方法通常需要庞大的样本数,而已知的DDIs却相对罕见。Methods: 在本工作中,我们提出了知识DDIs(KnowDDI)方法,用于解决上述挑战。KnowDDI使用图ael neural network对药物表示进行增强,并且可靠地利用生物医学知识图的丰富信息来增强药物表示。然后,它学习每个药物对的知识子图,以便为每个药物对解释预测的DDIs。每个边都有一个Connection Strength,表示知识DDIs的重要性或药物对的相似程度。因此,缺乏DDIs的缺陷被隐式地补偿了,而药物表示和相似度的增强则提高了预测性能。Results: 我们对两个DDIs数据集进行评估,结果显示,KnowDDI在预测DDIs方面达到了状态 искусственный智能技术的最高水平,并且具有更好的可读性。我们还发现,KnowDDI在知识图更 sparse 的情况下表现更好,这表明,在药物表示较为缺乏的情况下,传递的药物相似度更加重要。Conclusions: KnowDDI兼容着深度学习技术的效率和生物医学知识图中的丰富先验知识,可以帮助检测广泛的相互作用预测任务,如蛋白质-蛋白质交互、药物-target交互和疾病-基因交互,从而推动生物医学和医疗领域的发展。

MPCNN: A Novel Matrix Profile Approach for CNN-based Sleep Apnea Classification

  • paper_url: http://arxiv.org/abs/2311.15041
  • repo_url: https://github.com/vinuni-vishc/mpcnn-sleep-apnea
  • paper_authors: Hieu X. Nguyen, Duong V. Nguyen, Hieu H. Pham, Cuong D. Do
  • for: 本研究旨在提高电cardiogram(ECG)信号中的呼吸暂停诊断精度,通过挖掘ECG信号的完整PQRST段中的关键信息。
  • methods: 本研究提出了一种新的方法,基于Matrix Profile算法,生成Euclidean distanceprofile从固定长度信号 subsequences中。从这里,我们 deriv了三种基于距离关系的特征提取方法:MinDP、MaxDP和MeanDP。
  • results: 我们在PhysioNet Apnea-ECG数据集上进行了广泛的实验,结果表明,与新的特征提取方法相比,我们的方法可以达到92.11%的每段精度和100%的每个记录精度,同时也具有最高的相关系数(0.989)。此外,我们还发现这些方法可以增强特定的轻量级模型的性能,这显示了其潜在的应用前景。
    Abstract Sleep apnea (SA) is a significant respiratory condition that poses a major global health challenge. Previous studies have investigated several machine and deep learning models for electrocardiogram (ECG)-based SA diagnoses. Despite these advancements, conventional feature extractions derived from ECG signals, such as R-peaks and RR intervals, may fail to capture crucial information encompassed within the complete PQRST segments. In this study, we propose an innovative approach to address this diagnostic gap by delving deeper into the comprehensive segments of the ECG signal. The proposed methodology draws inspiration from Matrix Profile algorithms, which generate an Euclidean distance profile from fixed-length signal subsequences. From this, we derived the Min Distance Profile (MinDP), Max Distance Profile (MaxDP), and Mean Distance Profile (MeanDP) based on the minimum, maximum, and mean of the profile distances, respectively. To validate the effectiveness of our approach, we use the modified LeNet-5 architecture as the primary CNN model, along with two existing lightweight models, BAFNet and SE-MSCNN, for ECG classification tasks. Our extensive experimental results on the PhysioNet Apnea-ECG dataset revealed that with the new feature extraction method, we achieved a per-segment accuracy up to 92.11 \% and a per-recording accuracy of 100\%. Moreover, it yielded the highest correlation compared to state-of-the-art methods, with a correlation coefficient of 0.989. By introducing a new feature extraction method based on distance relationships, we enhanced the performance of certain lightweight models, showing potential for home sleep apnea test (HSAT) and SA detection in IoT devices. The source code for this work is made publicly available in GitHub: https://github.com/vinuni-vishc/MPCNN-Sleep-Apnea.
    摘要 “睡眠呼吸暂停症(SA)是一种重要的呼吸疾病,对全球健康带来重要挑战。过去的研究已经使用机器学习和深度学习模型来进行电cardiogram(ECG)基于SA诊断。 despite these advancements, conventional feature extractions derived from ECG signals, such as R-peaks and RR intervals, may fail to capture crucial information encompassed within the complete PQRST segments。 In this study, we propose an innovative approach to address this diagnostic gap by delving deeper into the comprehensive segments of the ECG signal。 The proposed methodology draws inspiration from Matrix Profile algorithms, which generate an Euclidean distance profile from fixed-length signal subsequences。 From this, we derived the Min Distance Profile(MinDP), Max Distance Profile(MaxDP), and Mean Distance Profile(MeanDP)based on the minimum, maximum, and mean of the profile distances, respectively。 To validate the effectiveness of our approach, we use the modified LeNet-5 architecture as the primary CNN model, along with two existing lightweight models, BAFNet and SE-MSCNN, for ECG classification tasks。 Our extensive experimental results on the PhysioNet Apnea-ECG dataset revealed that with the new feature extraction method, we achieved a per-segment accuracy up to 92.11% and a per-recording accuracy of 100%. Moreover, it yielded the highest correlation compared to state-of-the-art methods, with a correlation coefficient of 0.989。 By introducing a new feature extraction method based on distance relationships, we enhanced the performance of certain lightweight models, showing potential for home sleep apnea test(HSAT) and SA detection in IoT devices。 The source code for this work is made publicly available in GitHub:https://github.com/vinuni-vishc/MPCNN-Sleep-Apnea。”Note: The translation is in Simplified Chinese, and the sentence structure and wording may be different from the original text.

Word for Person: Zero-shot Composed Person Retrieval

  • paper_url: http://arxiv.org/abs/2311.16515
  • repo_url: https://github.com/Delong-liu-bupt/Word4Per
  • paper_authors: Delong Liu, Haiwen Li, Zhicheng Zhao, Fei Su, Hongying Meng
  • for: 提高人员检索精度,利用图像和文本信息进行人员识别。
  • methods: 提出了一种全新的组合人员检索任务(CPR),并提出了一种零 shot 组合人员检索方法(ZS-CPR),使用现有领域相关数据解决CPR问题而不需要昂贵的手动标注数据。
  • results: 提出了一种基于文本扩展和图像扩展的二 stage 学习框架 Word4Per,并构建了一个精心标注的图像文本组合人员检索数据集(ITCPR),实验表明 Word4Per 在 ZS-CPR 任务中表现出色,比对方法高出10%以上。
    Abstract Searching for specific person has great security value and social benefits, and it often involves a combination of visual and textual information. Conventional person retrieval methods, whether image-based or text-based, usually fall short in effectively harnessing both types of information, leading to the loss of accuracy. In this paper, a whole new task called Composed Person Retrieval (CPR) is proposed to jointly utilize both image and text information for target person retrieval. However, the supervised CPR must depend on very costly manual annotation dataset, while there are currently no available resources. To mitigate this issue, we firstly introduce the Zero-shot Composed Person Retrieval (ZS-CPR), which leverages existing domain-related data to resolve the CPR problem without reliance on expensive annotations. Secondly, to learn ZS-CPR model, we propose a two-stage learning framework, Word4Per, where a lightweight Textual Inversion Network (TINet) and a text-based person retrieval model based on fine-tuned Contrastive Language-Image Pre-training (CLIP) network are learned without utilizing any CPR data. Thirdly, a finely annotated Image-Text Composed Person Retrieval dataset (ITCPR) is built as the benchmark to assess the performance of the proposed Word4Per framework. Extensive experiments under both Rank-1 and mAP demonstrate the effectiveness of Word4Per for the ZS-CPR task, surpassing the comparative methods by over 10%. The code and ITCPR dataset will be publicly available at https://github.com/Delong-liu-bupt/Word4Per.
    摘要 searching for a specific person has great security value and social benefits, and it often involves a combination of visual and textual information. conventional person retrieval methods, whether image-based or text-based, usually fall short in effectively harnessing both types of information, leading to the loss of accuracy. in this paper, a whole new task called Composed Person Retrieval (CPR) is proposed to jointly utilize both image and text information for target person retrieval. however, the supervised CPR must depend on very costly manual annotation dataset, while there are currently no available resources. to mitigate this issue, we firstly introduce the Zero-shot Composed Person Retrieval (ZS-CPR), which leverages existing domain-related data to resolve the CPR problem without reliance on expensive annotations. secondly, to learn ZS-CPR model, we propose a two-stage learning framework, Word4Per, where a lightweight Textual Inversion Network (TINet) and a text-based person retrieval model based on fine-tuned Contrastive Language-Image Pre-training (CLIP) network are learned without utilizing any CPR data. thirdly, a finely annotated Image-Text Composed Person Retrieval dataset (ITCPR) is built as the benchmark to assess the performance of the proposed Word4Per framework. extensive experiments under both Rank-1 and mAP demonstrate the effectiveness of Word4Per for the ZS-CPR task, surpassing the comparative methods by over 10%. the code and ITCPR dataset will be publicly available at https://github.com/Delong-liu-bupt/Word4Per.

On-Device Soft Sensors: Real-Time Fluid Flow Estimation from Level Sensor Data

  • paper_url: http://arxiv.org/abs/2311.15036
  • repo_url: None
  • paper_authors: Tianheng Ling, Chao Qian, Gregor Schiele
  • for: 本研究旨在提高自动化系统的 физи加数字领域融合,通过在设备上部署软感知器来提高感知和整合。
  • methods: 本研究使用了在设备上部署的软感知器,通过将人工智能直接部署在设备上来提高能效性,并通过 Mikrokontroller Unit 和 Field-Programmable Gate Array(FPGA)的协同集成来提高快速的人工智能推理能力。
  • results: 实验结果表明,使用 FPGA 软感知器可以实现推理时间从 1.04 微秒到 12.04 微秒,这些结果表明了我们的创新方法在实时推理任务中的高效性,从而成为一种可行的云部署中的延迟问题的解决方案。
    Abstract Soft sensors are crucial in bridging autonomous systems' physical and digital realms, enhancing sensor fusion and perception. Instead of deploying soft sensors on the Cloud, this study shift towards employing on-device soft sensors, promising heightened efficiency and bolstering data security. Our approach substantially improves energy efficiency by deploying Artificial Intelligence (AI) directly on devices within a wireless sensor network. Furthermore, the synergistic integration of the Microcontroller Unit and Field-Programmable Gate Array (FPGA) leverages the rapid AI inference capabilities of the latter. Empirical evidence from our real-world use case demonstrates that FPGA-based soft sensors achieve inference times ranging remarkably from 1.04 to 12.04 microseconds. These compelling results highlight the considerable potential of our innovative approach for executing real-time inference tasks efficiently, thereby presenting a feasible alternative that effectively addresses the latency challenges intrinsic to Cloud-based deployments.
    摘要 (Simplified Chinese)软传感器在自动化系统的物理和数字世界之间 bridge 是关键,提高感知和混合传感器。而而不是在云上部署软传感器,这项研究倾向于在设备上部署软传感器,这将提高效率和加强数据安全性。我们的方法可以在无线传感网络中直接部署人工智能(AI),从而显著提高能效性。此外,微控制器单元和场程序门阵列(FPGA)的共同integation可以利用后者的快速AI推理能力。实际案例证明,FPGA基于的软传感器在1.04到12.04微秒的推理时间范围内进行了remarkably的推理。这些吸引人的结果表明我们的创新方法可以高效地执行实时推理任务,从而为云端部署所存在的延迟挑战提供了可行的解决方案。

Agent as Cerebrum, Controller as Cerebellum: Implementing an Embodied LMM-based Agent on Drones

  • paper_url: http://arxiv.org/abs/2311.15033
  • repo_url: None
  • paper_authors: Haoran Zhao, Fengxing Pan, Huqiuyue Ping, Yaoming Zhou
  • for: 这种研究旨在开发一种基于机器人技术的工业机器人智能体,具有“智能体为脑,控制器为脊梁”的架构。
  • methods: 该方法利用大型多modal模型(LMM)在机器人框架中,并在实验室中采用ROS链接框架连接LMM基于的智能体和机器人操作系统(ROS)。
  • results: 研究结果表明,AeroAgent在 simulated experiments中表现出优于现有的深度强化学习(DRL)基于的智能体,在实际场景中也达到了更高的性能,特别是在寻找和救援任务中。
    Abstract In this study, we present a novel paradigm for industrial robotic embodied agents, encapsulating an 'agent as cerebrum, controller as cerebellum' architecture. Our approach harnesses the power of Large Multimodal Models (LMMs) within an agent framework known as AeroAgent, tailored for drone technology in industrial settings. To facilitate seamless integration with robotic systems, we introduce ROSchain, a bespoke linkage framework connecting LMM-based agents to the Robot Operating System (ROS). We report findings from extensive empirical research, including simulated experiments on the Airgen and real-world case study, particularly in individual search and rescue operations. The results demonstrate AeroAgent's superior performance in comparison to existing Deep Reinforcement Learning (DRL)-based agents, highlighting the advantages of the embodied LMM in complex, real-world scenarios.
    摘要 在这项研究中,我们提出了一种新的工业机器人智能代理模式,涵盖了“代理为脑、控制器为脊梁”架构。我们的方法利用了大型多Modal模型(LMM)在代理框架中,称之为AeroAgent,是专门为工业设置中的机器人技术开发的。为了寻求机器人系统的灵活集成,我们提出了ROS链接框架,将LMM基于的代理与ROS操作系统相连。我们在实验室和实际案例中进行了广泛的实践研究,包括在Airgen上的 simulated实验和实际搜索救援操作。结果显示,AeroAgent的表现比现有的深度强化学习(DRL)基于的代理更出色,highlighting the advantages of embodied LMM in complex, real-world scenarios.

E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation

  • paper_url: http://arxiv.org/abs/2311.15016
  • repo_url: None
  • paper_authors: Fengyi Fu, Lei Zhang, Quan Wang, Zhendong Mao
  • for: 提高对话系统的人性化
  • methods: 利用多resolution情感图、 correlate-aware汇集和硬/软策略等方法进行情感相关学习和应用
  • results: 在标准测试数据集上实现了更高的 Empathy 识别和表达能力
    Abstract Achieving empathy is a crucial step toward humanized dialogue systems. Current approaches for empathetic dialogue generation mainly perceive an emotional label to generate an empathetic response conditioned on it, which simply treat emotions independently, but ignore the intrinsic emotion correlation in dialogues, resulting in inaccurate emotion perception and unsuitable response generation. In this paper, we propose a novel emotion correlation enhanced empathetic dialogue generation framework, which comprehensively realizes emotion correlation learning, utilization, and supervising. Specifically, a multi-resolution emotion graph is devised to capture context-based emotion interactions from different resolutions, further modeling emotion correlation. Then we propose an emotion correlation enhanced decoder, with a novel correlation-aware aggregation and soft/hard strategy, respectively improving the emotion perception and response generation. Experimental results on the benchmark dataset demonstrate the superiority of our model in both empathetic perception and expression.
    摘要 实现共鸣是对人工智能对话系统的重要步骤。现有的对话生成方法主要是根据情感标签生成情感相关的回应,但是这些方法通常忽略对话中情感之间的自然相互关系,导致情感识别不准确和回应生成不适用。在这篇论文中,我们提出了一种新的情感相关增强的对话生成框架,该框架完整实现情感相关学习、利用和监督。specifically,我们设计了一个多resolution情感图来捕捉对话中不同分辨率的情感互动,并模型情感相关。然后,我们提出了一种具有相关感知和软/硬策略的情感相关增强decoder,以提高情感识别和回应生成。实验结果表明,我们的模型在识别和表达方面均有所提高。

Exploring Causal Learning through Graph Neural Networks: An In-depth Review

  • paper_url: http://arxiv.org/abs/2311.14994
  • repo_url: None
  • paper_authors: Simi Job, Xiaohui Tao, Taotao Cai, Haoran Xie, Lin Li, Jianming Yong, Qing Li
  • for: 本文旨在对graph neural networks(GNNs)在 causal learning 领域的发展进行系统性的回顾和梳理,并提供了一种新的分类方法来 categorize 不同的 GNN 方法。
  • methods: 本文使用了多种 state-of-the-art GNN 方法,包括 Graph Convolutional Networks (GCNs), Graph Attention Networks (GATs), and Graph Autoencoders (GAEs),以及它们在 causal learning 领域的应用。
  • results: 本文提供了一个 exhaustive 的数据集合,以及一些实际应用的案例,并结合了不同的 GNN 方法,以便于读者进行实践研究。此外,本文还提供了一些未来可能的挑战和发展方向。
    Abstract In machine learning, exploring data correlations to predict outcomes is a fundamental task. Recognizing causal relationships embedded within data is pivotal for a comprehensive understanding of system dynamics, the significance of which is paramount in data-driven decision-making processes. Beyond traditional methods, there has been a surge in the use of graph neural networks (GNNs) for causal learning, given their capabilities as universal data approximators. Thus, a thorough review of the advancements in causal learning using GNNs is both relevant and timely. To structure this review, we introduce a novel taxonomy that encompasses various state-of-the-art GNN methods employed in studying causality. GNNs are further categorized based on their applications in the causality domain. We further provide an exhaustive compilation of datasets integral to causal learning with GNNs to serve as a resource for practical study. This review also touches upon the application of causal learning across diverse sectors. We conclude the review with insights into potential challenges and promising avenues for future exploration in this rapidly evolving field of machine learning.
    摘要 在机器学习中,探索数据之间的相关性以预测结果是基本任务之一。认识数据中嵌入的 causal 关系对系统动态的理解是关键,这对数据驱动决策过程中具有重要性。在传统方法之外,Graph Neural Networks(GNNs)在 causal 学习中得到了广泛的应用,这是因为它们具有 Universal Data Approximators 的特点。因此,一个系统的综述和报告 GNNs 在 causal 学习中的进步是非常相关和时宜。为了结构化这个综述,我们提出了一种新的分类法,该法包括了不同的 state-of-the-art GNN 方法在 causality 领域中的应用。GNNs 进一步被分为了根据它们在 causality 领域中的应用。我们还提供了一份 integral 的 causal 学习中的数据集,以便在实践中使用。这个综述还讨论了 causal 学习在多个领域中的应用。我们在这个综述中结束,并提出了未来探索的潜在挑战和前瞻。

Effective Backdoor Mitigation Depends on the Pre-training Objective

  • paper_url: http://arxiv.org/abs/2311.14948
  • repo_url: None
  • paper_authors: Sahil Verma, Gantavya Bhatt, Avi Schwarzschild, Soumye Singhal, Arnav Mohanty Das, Chirag Shah, John P Dickerson, Jeff Bilmes
  • for: This paper is written to investigate the effectiveness of CleanCLIP in mitigating backdoors in multimodal models, and to explore the relationship between pre-training objectives and backdoor removal.
  • methods: The paper uses two large datasets (CC3M and CC6M) and various pre-training objectives to train multimodal models, followed by poison removal using CleanCLIP. The authors also perform extensive hyperparameter tuning to evaluate the effectiveness of CleanCLIP under different conditions.
  • results: The paper finds that CleanCLIP is ineffective in removing backdoors when stronger pre-training objectives are used, and that simpler pre-training objectives are more amenable to effective backdoor removal. The findings highlight the importance of considering the trade-offs between pre-training objectives and security against backdoor attacks in ML deployments.
    Abstract Despite the advanced capabilities of contemporary machine learning (ML) models, they remain vulnerable to adversarial and backdoor attacks. This vulnerability is particularly concerning in real-world deployments, where compromised models may exhibit unpredictable behavior in critical scenarios. Such risks are heightened by the prevalent practice of collecting massive, internet-sourced datasets for pre-training multimodal models, as these datasets may harbor backdoors. Various techniques have been proposed to mitigate the effects of backdooring in these models such as CleanCLIP which is the current state-of-the-art approach. In this work, we demonstrate that the efficacy of CleanCLIP in mitigating backdoors is highly dependent on the particular objective used during model pre-training. We observe that stronger pre-training objectives correlate with harder to remove backdoors behaviors. We show this by training multimodal models on two large datasets consisting of 3 million (CC3M) and 6 million (CC6M) datapoints, under various pre-training objectives, followed by poison removal using CleanCLIP. We find that CleanCLIP is ineffective when stronger pre-training objectives are used, even with extensive hyperparameter tuning. Our findings underscore critical considerations for ML practitioners who pre-train models using large-scale web-curated data and are concerned about potential backdoor threats. Notably, our results suggest that simpler pre-training objectives are more amenable to effective backdoor removal. This insight is pivotal for practitioners seeking to balance the trade-offs between using stronger pre-training objectives and security against backdoor attacks.
    摘要 尽管现代机器学习(ML)模型具有先进的能力,但它们仍然易受到恶意攻击和后门攻击。这种抵触特别是在实际应用中有着很大的风险,因为受到攻击的模型可能会在重要的情况下表现不可预测。这些风险受到了广泛采集互联网上的大量数据进行预训练多Modal模型的习惯所增加。为了缓解这些后门的影响,许多技术已经被提出来,例如CleanCLIP,它是当前的状态艺术方法。在这项工作中,我们证明了CleanCLIP在 mitigating后门时的有效性具有很高的依赖关系,特别是在使用不同的预训练目标时。我们发现,使用更加强大的预训练目标会导致后门的行为更加困难去除。我们通过使用三百万(CC3M)和六百万(CC6M)个数据点的两个大型数据集,采用不同的预训练目标,然后使用CleanCLIP进行恶意攻击去除,发现CleanCLIP在使用更加强大的预训练目标时无法生效,即使进行了广泛的超参数调整。我们的发现强调了ML实践者在使用大规模网络筛选数据预训练模型时需要考虑的重要因素。特别是,我们的结果表明,使用 simpler 的预训练目标更容易进行有效的后门去除。这一发现对于寻求平衡使用更加强大的预训练目标和安全性的实践者来说,是非常重要的。

FreePIH: Training-Free Painterly Image Harmonization with Diffusion Model

  • paper_url: http://arxiv.org/abs/2311.14926
  • repo_url: None
  • paper_authors: Ruibin Li, Jingcai Guo, Song Guo, Qihua Zhou, Jie Zhang
  • for: 提供一种高效无需训练的画家风图像融合方法(PIH),即FreePIH,通过只使用预训练的扩散模型来实现当前最佳的融合结果。
  • methods: 我们的FreePIH方法利用杜氏扩散模型的最后几步生成过程中的风格信息,对于eground图像进行风格传递。我们通过在 latent space 中加入高斯函数来直接进行风格传递,并通过多尺度特征来保证内容一致性和风格稳定性。此外,我们还使用文本提示来attend to latent space,从而提高生成质量。
  • results: 我们的方法在 COCO 和 LAION 5B 数据集上进行了量化和质量评估,并证明了我们的方法可以在与代表性基eline之间创造大幅度的优势。
    Abstract This paper provides an efficient training-free painterly image harmonization (PIH) method, dubbed FreePIH, that leverages only a pre-trained diffusion model to achieve state-of-the-art harmonization results. Unlike existing methods that require either training auxiliary networks or fine-tuning a large pre-trained backbone, or both, to harmonize a foreground object with a painterly-style background image, our FreePIH tames the denoising process as a plug-in module for foreground image style transfer. Specifically, we find that the very last few steps of the denoising (i.e., generation) process strongly correspond to the stylistic information of images, and based on this, we propose to augment the latent features of both the foreground and background images with Gaussians for a direct denoising-based harmonization. To guarantee the fidelity of the harmonized image, we make use of multi-scale features to enforce the consistency of the content and stability of the foreground objects in the latent space, and meanwhile, aligning both fore-/back-grounds with the same style. Moreover, to accommodate the generation with more structural and textural details, we further integrate text prompts to attend to the latent features, hence improving the generation quality. Quantitative and qualitative evaluations on COCO and LAION 5B datasets demonstrate that our method can surpass representative baselines by large margins.
    摘要 Specifically, we find that the very last few steps of the denoising (i.e., generation) process strongly correspond to the stylistic information of images. Based on this, we propose to augment the latent features of both the foreground and background images with Gaussians for a direct denoising-based harmonization. To guarantee the fidelity of the harmonized image, we make use of multi-scale features to enforce the consistency of the content and stability of the foreground objects in the latent space, and meanwhile, aligning both fore-/back-grounds with the same style.Moreover, to accommodate the generation with more structural and textural details, we further integrate text prompts to attend to the latent features, hence improving the generation quality. Quantitative and qualitative evaluations on COCO and LAION 5B datasets demonstrate that our method can surpass representative baselines by large margins.

LANS: A Layout-Aware Neural Solver for Plane Geometry Problem

  • paper_url: http://arxiv.org/abs/2311.16476
  • repo_url: None
  • paper_authors: Ming-Liang Zhang, Zhong-Zhi Li, Fei Yin, Cheng-Lin Liu
  • for: solves geometry problem solving (GPS) tasks with multi-modal understanding, fusion, and reasoning.
  • methods: proposes a layout-aware neural solver (LANS) with two new modules: multimodal layout-aware pre-trained language model (MLA-PLM) and layout-aware fusion attention (LA-FA).
  • results: extensive experiments on Geometry3K and PGPS9K datasets show the effectiveness of the layout-aware modules and superior problem solving performance of LANS compared to existing symbolic solvers and neural solvers.
    Abstract Geometry problem solving (GPS) is a challenging mathematical reasoning task requiring multi-modal understanding, fusion and reasoning. Existing neural solvers take GPS as a vision-language task but be short in the representation of geometry diagrams which carry rich and complex layout information. In this paper, we propose a layout-aware neural solver named LANS, integrated with two new modules: multimodal layout-aware pre-trained language model (MLA-PLM) and layout-aware fusion attention (LA-FA). MLA-PLM adopts structural and semantic pre-training (SSP) to implement global relationship modeling, and point matching pre-training (PMP) to achieve alignment between visual points and textual points. LA-FA employs a layout-aware attention mask to realize point-guided cross-modal fusion for further boosting layout awareness of LANS. Extensive experiments on datasets Geometry3K and PGPS9K validate the effectiveness of the layout-aware modules and superior problem solving performance of our LANS solver, over existing symbolic solvers and neural solvers. The code will make public available soon.
    摘要 几何问题解决(GPS)是一项具有挑战性的数学逻辑任务,需要多modal的理解、融合和理解。现有的神经网络解决方案将GPS视为视觉语言任务,但缺乏geometry diagram的表示,这些图形图示含有复杂的布局信息。在这篇论文中,我们提出了一种具有布局意识的神经网络解决方案,名为LANS,并采用了两个新模块:多modal布局意识预训练语言模型(MLA-PLM)和布局意识融合注意(LA-FA)。MLA-PLM采用了结构性和semantic预训练(SSP)来实现全局关系模型,并采用点匹配预训练(PMP)来实现视觉点和文本点之间的对应。LA-FA使用了布局意识注意mask来实现点导向的交叉模式融合,以进一步增强LANS的布局意识。我们的实验表明, layout-aware模块和LANS解决器的效果,并超越现有的符号学神经网络解决方案和神经网络解决方案。代码很快将公开。

Resfusion: Prior Residual Noise embedded Denoising Diffusion Probabilistic Models

  • paper_url: http://arxiv.org/abs/2311.14900
  • repo_url: None
  • paper_authors: Shi Zhenning, Dong Changsheng, Pan Bin, Xie Xueshuo, He Along, Qu Qiaoying, Li Tao
  • for: 这篇论文是为了推广Diffusion Probabilistic Models在图像分割中的应用,以生成基于输入图像的分割面积。
  • methods: 这篇论文提出了一种新的Resfusion方法,它通过一种新的杂化-滤波过程,将现有的端到端模型和杂化 diffusion模型相结合,以生成分割面积或任何类型的目标图像。
  • results: 实验结果表明,Resfusion可以与现有的端到端模型和杂化 diffusion模型相结合,提高性能,并且可以轻松扩展到更多的任务和更大的数据集。
    Abstract Recently, Denoising Diffusion Probabilistic Models have been widely used in image segmentation, by generating segmentation masks conditioned on the input image. However, previous works can not seamlessly integrate existing end-to-end models with denoising diffusion models. Existing research can only select acceleration steps based on experience rather than calculating them specifically. Moreover, most methods are limited to small models and small-scale datasets, unable to generalize to general datasets and a wider range of tasks. Therefore, we propose Resfusion with a novel resnoise-diffusion process, which gradually generates segmentation masks or any type of target image, seamlessly integrating state-of-the-art end-to-end models and denoising diffusion models. Resfusion bridges the discrepancy between the likelihood output and the ground truth output through a Markov process. Through the novel smooth equivalence transformation in resnoise-diffusion process, we determine the optimal acceleration step. Experimental results demonstrate that Resfusion combines the capabilities of existing end-to-end models and denoising diffusion models, further enhancing performance and achieving outstanding results. Moreover, Resfusion is not limited to segmentation tasks, it can easily generalize to any general tasks of image generation and exhibit strong competitiveness.
    摘要 (Note: The text is translated into Simplified Chinese, which is the most widely used standard for Chinese writing. The translation is done using Google Translate and may not be perfect, but it should give you a good idea of the content of the original text.)

Aiming to Minimize Alcohol-Impaired Road Fatalities: Utilizing Fairness-Aware and Domain Knowledge-Infused Artificial Intelligence

  • paper_url: http://arxiv.org/abs/2311.16180
  • repo_url: None
  • paper_authors: Tejas Venkateswaran, Sheikh Rabiul Islam, Md Golam Moula Mehedi Hasan, Mohiuddin Ahmed
  • for: 降低美国交通事故中的酒后驾车死亡率,以及实现更公平和有效的推广资源分配。
  • methods: 使用人工智能技术,采用公平意识和领域知识来预测酒后驾车事故,并分析不同族裔、年龄和收入等人口统计。
  • results: 透过分析不同地区的酒后驾车事故,获得有趣的统计资料,实现更公平和有效的推广资源分配,有助于降低交通事故的死亡率。
    Abstract Approximately 30% of all traffic fatalities in the United States are attributed to alcohol-impaired driving. This means that, despite stringent laws against this offense in every state, the frequency of drunk driving accidents is alarming, resulting in approximately one person being killed every 45 minutes. The process of charging individuals with Driving Under the Influence (DUI) is intricate and can sometimes be subjective, involving multiple stages such as observing the vehicle in motion, interacting with the driver, and conducting Standardized Field Sobriety Tests (SFSTs). Biases have been observed through racial profiling, leading to some groups and geographical areas facing fewer DUI tests, resulting in many actual DUI incidents going undetected, ultimately leading to a higher number of fatalities. To tackle this issue, our research introduces an Artificial Intelligence-based predictor that is both fairness-aware and incorporates domain knowledge to analyze DUI-related fatalities in different geographic locations. Through this model, we gain intriguing insights into the interplay between various demographic groups, including age, race, and income. By utilizing the provided information to allocate policing resources in a more equitable and efficient manner, there is potential to reduce DUI-related fatalities and have a significant impact on road safety.
    摘要 Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries.Please note that the translation is done by a machine translation tool, and the quality of the translation may vary depending on the complexity and nuances of the original text.

cs.CL - 2023-11-25

Relevance feedback strategies for recall-oriented neural information retrieval

  • paper_url: http://arxiv.org/abs/2311.15110
  • repo_url: None
  • paper_authors: Timo Kats, Peter van der Putten, Jan Scholtes
  • for: 降低审核努力(例如权利搜索、文献综述、风险评估等)中避免假阳性更重要 than 避免假阴性。
  • methods: 提出了一种更强调回归的方法,通过基于用户反馈的反馈循环重新排序相关性排名,以提高审核效率。
  • results: 对比基eline方法(无反馈),该方法可以降低审核努力 между 17.85% 和 59.04%,具体取决于固定的回归目标。
    Abstract In a number of information retrieval applications (e.g., patent search, literature review, due diligence, etc.), preventing false negatives is more important than preventing false positives. However, approaches designed to reduce review effort (like "technology assisted review") can create false negatives, since they are often based on active learning systems that exclude documents automatically based on user feedback. Therefore, this research proposes a more recall-oriented approach to reducing review effort. More specifically, through iteratively re-ranking the relevance rankings based on user feedback, which is also referred to as relevance feedback. In our proposed method, the relevance rankings are produced by a BERT-based dense-vector search and the relevance feedback is based on cumulatively summing the queried and selected embeddings. Our results show that this method can reduce review effort between 17.85% and 59.04%, compared to a baseline approach (of no feedback), given a fixed recall target
    摘要 在一些信息检索应用程序(如专利搜索、文献综述、due diligence等)中,避免假 отрицатель(false negative)更重要于避免假正(false positive)。然而,以减少审核努力(like "技术支持审核")为基础的方法可能会导致假 отрицатель的生成。因此,本研究提议一种更强调回快的方法来减少审核努力。具体来说,通过基于用户反馈的重新排序 relevance 排名,以及 referred to as relevance feedback。在我们的提议方法中, relevance 排名由 BERT 基于 dense-vector 搜索生成,而用户反馈基于累加查询和选择嵌入。我们的结果表明,这种方法可以在固定 recall 目标下降低审核努力,比基eline方法(无反馈)下降低的范围为17.85% 到 59.04%。

Solving the Right Problem is Key for Translational NLP: A Case Study in UMLS Vocabulary Insertion

  • paper_url: http://arxiv.org/abs/2311.15106
  • repo_url: https://github.com/osu-nlp-group/umls-vocabulary-insertion
  • paper_authors: Bernal Jimenez Gutierrez, Yuqing Mao, Vinh Nguyen, Kin Wah Fung, Yu Su, Olivier Bodenreider
  • for: 这个研究旨在提高生物医学领域语义理解的自动化系统,具体来说是插入UMLS词表(一个大规模的生物医学知识库)中的数十万个新词。
  • methods: 该研究使用了一种新的问题形ulation和数据集,以及一些现有解决方案的重新定制,以提高模型的表现。
  • results: 研究发现,使用新的问题形ulation和数据集,以及重新定制的现有解决方案,可以提高模型的表现,并且比所有强有力基eline都高。此外,该研究还提供了可衡量的编辑者所行的改进。
    Abstract As the immense opportunities enabled by large language models become more apparent, NLP systems will be increasingly expected to excel in real-world settings. However, in many instances, powerful models alone will not yield translational NLP solutions, especially if the formulated problem is not well aligned with the real-world task. In this work, we study the case of UMLS vocabulary insertion, an important real-world task in which hundreds of thousands of new terms, referred to as atoms, are added to the UMLS, one of the most comprehensive open-source biomedical knowledge bases. Previous work aimed to develop an automated NLP system to make this time-consuming, costly, and error-prone task more efficient. Nevertheless, practical progress in this direction has been difficult to achieve due to a problem formulation and evaluation gap between research output and the real-world task. In order to address this gap, we introduce a new formulation for UMLS vocabulary insertion which mirrors the real-world task, datasets which faithfully represent it and several strong baselines we developed through re-purposing existing solutions. Additionally, we propose an effective rule-enhanced biomedical language model which enables important new model behavior, outperforms all strong baselines and provides measurable qualitative improvements to editors who carry out the UVI task. We hope this case study provides insight into the considerable importance of problem formulation for the success of translational NLP solutions.
    摘要 As the immense opportunities enabled by large language models become more apparent, NLP systems will be increasingly expected to excel in real-world settings. However, in many instances, powerful models alone will not yield translational NLP solutions, especially if the formulated problem is not well aligned with the real-world task. In this work, we study the case of UMLS vocabulary insertion, an important real-world task in which hundreds of thousands of new terms, referred to as atoms, are added to the UMLS, one of the most comprehensive open-source biomedical knowledge bases. Previous work aimed to develop an automated NLP system to make this time-consuming, costly, and error-prone task more efficient. Nevertheless, practical progress in this direction has been difficult to achieve due to a problem formulation and evaluation gap between research output and the real-world task. In order to address this gap, we introduce a new formulation for UMLS vocabulary insertion which mirrors the real-world task, datasets which faithfully represent it and several strong baselines we developed through re-purposing existing solutions. Additionally, we propose an effective rule-enhanced biomedical language model which enables important new model behavior, outperforms all strong baselines and provides measurable qualitative improvements to editors who carry out the UVI task. We hope this case study provides insight into the considerable importance of problem formulation for the success of translational NLP solutions.Here's the translation in Traditional Chinese:为了推广大型语言模型的可能性,NLG系统将在实际应用中被越来越期待。然而,在许多情况下,强大的模型独立不足以提供翻译NLG解决方案,尤其是如果问题的形式化不好align with real-world task。在这个工作中,我们研究了UMLS词汇插入task,这是生物医学知识库中的一个重要实际任务,每年添加了百万个新的词汇。过去的工作尝试了开发一个自动NLG系统,以便更有效率地执行这个时间consuming、成本高和Error-prone的任务。然而,实际上进展难以取得,因为问题的形式化和评估 gap between research output和实际任务。为了解决这个问题,我们引入了一个新的UMLS词汇插入formulation,这个formulation faithfully reflects the real-world task,dataset和several strong baselines we developed through re-purposing existing solutions。此外,我们提出了一个有效的规则增强生医语言模型,这个模型具有重要的新模型行为,超越了所有强大基eline,并且为编辑者在UVI任务中提供了可衡量的质量提升。我们希望这个案例研究可以给出问题形式化的巨大重要性,以便翻译NLG解决方案的成功。

Multilingual self-supervised speech representations improve the speech recognition of low-resource African languages with codeswitching

  • paper_url: http://arxiv.org/abs/2311.15077
  • repo_url: None
  • paper_authors: Tolúlopé Ògúnrèmí, Christopher D. Manning, Dan Jurafsky
  • for: 本研究是为了提高低资源语言switch语音识别的精度。
  • methods: 本研究使用了自动生成的wav2vec 2.0 XLSR特征,并通过finetuning和n-gram语言模型的拓展来提高识别精度。
  • results: 对比基线方法,finetuning自动生成的特征和n-gram语言模型可以降低绝对单词错误率达20%。这表明在具有有限训练数据的情况下,finetuning自动生成的特征是一个更好的和可行的解决方案。
    Abstract While many speakers of low-resource languages regularly code-switch between their languages and other regional languages or English, datasets of codeswitched speech are too small to train bespoke acoustic models from scratch or do language model rescoring. Here we propose finetuning self-supervised speech representations such as wav2vec 2.0 XLSR to recognize code-switched data. We find that finetuning self-supervised multilingual representations and augmenting them with n-gram language models trained from transcripts reduces absolute word error rates by up to 20% compared to baselines of hybrid models trained from scratch on code-switched data. Our findings suggest that in circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
    摘要 “许多语言的话者通常在与其他地域语言或英文混合说话,但 datasets of codeswitched speech 太小,无法训练自己的语音模型。我们提出了调整自我超vised语音表现,如 wav2vec 2.0 XLSR,以识别混合语言资料。我们发现,调整自我超vised多语言表现,并将其与 n-gram 语言模型结合,可以降低绝对字元误差率,相比基eline 的混合模型。我们的发现建议,在有限的训练数据情况下,调整自我超vised表现是一个更好的性能和可行的解决方案。”

Automatically Finding and Categorizing Replication Studies

  • paper_url: http://arxiv.org/abs/2311.15055
  • repo_url: https://github.com/bopjesvla/replication
  • paper_authors: Bob de Ruiter
  • for: 本研究是为了创建一个自动找到复制研究的系统,以便为给定的论文找到相关的复制研究。
  • methods: 研究使用了334个复制研究和344个复制研究来采样,并通过文本内容来确定复制研究。
  • results: 研究发现,可以通过文本内容来 correctly identify replication studies at a higher rate than chance(AUROC = 0.886),并且可以 correctly distinguish successful replication studies from failed replication studies at a higher rate than chance(AUROC = 0.664)。
    Abstract In many fields of experimental science, papers that failed to replicate continue to be cited as a result of the poor discoverability of replication studies. As a first step to creating a system that automatically finds replication studies for a given paper, 334 replication studies and 344 replicated studies were collected. Replication studies could be identified in the dataset based on text content at a higher rate than chance (AUROC = 0.886). Additionally, successful replication studies could be distinguished from failed replication studies at a higher rate than chance (AUROC = 0.664).
    摘要 在许多实验科学领域的论文中,无法复制的研究继续被引用,这主要是因为复制研究的发现性不足。为解决这个问题,我们首先收集了334个复制研究和344个复制论文。可以通过文本内容来识别复制研究,并且在各种机会上进行了分类(AUROC = 0.886)。此外,成功复制研究还可以与失败复制研究进行区分,并且在各种机会上进行了分类(AUROC = 0.664)。

Detection of developmental language disorder in Cypriot Greek children using a machine learning neural network algorithm

  • paper_url: http://arxiv.org/abs/2311.15054
  • repo_url: None
  • paper_authors: Georgios P. Georgiou, Elena Theodorou
  • for: 旨在开发一种自动识别发展语言障碍(DLD)的方法,使用人工智能算法,具体是神经网络机器学习算法。
  • methods: 使用感知和生产数据,从儿童DLD和健康控制群体中收集数据,训练神经网络模型。使用kfold技术进行跨验证,并使用准确率、准确率、回归率、F1分数和ROC曲线来评估模型的准确性。
  • results: 研究结果显示,神经网络模型在分类儿童DLD和健康儿童时达到了高精度水平(准确率在0.92-0.98之间),这表明神经网络模型在检测DLD中具有高准确性。此外,变量重要性分析表明,儿童语言生产技能对模型性能的影响更大于语言感知技能。
    Abstract Children with developmental language disorder (DLD) encounter difficulties in acquiring various language structures. Early identification and intervention are crucial to prevent negative long-term outcomes impacting the academic, social, and emotional development of children. The study aims to develop an automated method for the identification of DLD using artificial intelligence, specifically a neural network machine learning algorithm. This protocol is applied for the first time in Cypriot Greek children, which is generally considered underresearched in the context of DLD. The neural network model was trained using perceptual and production data elicited from children with DLD and healthy controls. The k-fold technique was used to crossvalidate the algorithm. The performance of the model was evaluated using metrics such as accuracy, precision, recall, F1 score, and ROC/AUC curve to assess its ability to make accurate predictions on a set of unseen data. The results demonstrated high classification values for all metrics (between 0.92 and 0.98), indicating the high accuracy of the neural model in classifying children with DLD. Additionally, the variable importance analysis revealed that the language production skills of children had a more significant impact on the performance of the model compared to perception skills. Neural networks represent powerful tools for detecting DLD, providing early and quick assessments of the disorder, and having the potential to improve clinical outcomes.
    摘要 儿童发展语言障碍(DLD)可能会导致儿童学习语言结构的困难。早期识别和 intervención是关键,以避免长期的负面影响,对儿童的学术、社会和情感发展产生负面影响。本研究旨在开发一种基于人工智能的DLD识别方法,使用神经网络机器学习算法。这种协议在塞浦路斯希腊语言中首次应用。神经网络模型通过对儿童DLD和健康儿童的语言感知和生产数据进行训练。使用k-fold技术进行交叉验证算法。模型的性能通过准确率、精度、回归率、F1分数和ROC/AUC曲线进行评估。结果显示神经网络模型在未seen数据上的分类性能强,准确率在0.92-0.98之间。此外,变量重要性分析表明儿童语言生产技能对模型性能的影响更大 than语言感知技能。神经网络代表了检测DLD的强大工具,提供早期快速诊断,有助于提高临床结果。

nlpBDpatriots at BLP-2023 Task 2: A Transfer Learning Approach to Bangla Sentiment Analysis

  • paper_url: http://arxiv.org/abs/2311.15032
  • repo_url: None
  • paper_authors: Dhiman Goswami, Md Nishat Raihan, Sadiya Sayara Chowdhury Puspo, Marcos Zampieri
  • for: 本研究参加了第一届 Bangla Language Processing(BLP)工作坊的共同任务,对 Bangla 社交媒体吐字进行情感分析。
  • methods: 我们采用了传输学习策略,并进行数据增强来解决这个任务。
  • results: 我们的最佳系统在 Micro F1 分数上达到 0.71,在30个参与者中排名第12名。
    Abstract In this paper, we discuss the nlpBDpatriots entry to the shared task on Sentiment Analysis of Bangla Social Media Posts organized at the first workshop on Bangla Language Processing (BLP) co-located with EMNLP. The main objective of this task is to identify the polarity of social media content using a Bangla dataset annotated with positive, neutral, and negative labels provided by the shared task organizers. Our best system for this task is a transfer learning approach with data augmentation which achieved a micro F1 score of 0.71. Our best system ranked 12th among 30 teams that participated in the competition.
    摘要 在这篇论文中,我们讨论了nlpBDpatriots对在孟加拉社交媒体帖子上的情感分析的入选。这个任务的主要目标是使用提供的孟加拉数据集,并使用正、中性和负标签进行标注,以识别社交媒体内容的情感 polarity。我们的最佳系统是基于传输学习和数据扩展的方法,其中微芽F1分数达到0.71。我们的最佳系统在30支参与比赛的团队中排名第12名。

nlpBDpatriots at BLP-2023 Task 1: A Two-Step Classification for Violence Inciting Text Detection in Bangla

  • paper_url: http://arxiv.org/abs/2311.15029
  • repo_url: None
  • paper_authors: Md Nishat Raihan, Dhiman Goswami, Sadiya Sayara Chowdhury Puspo, Marcos Zampieri
  • for: 本研究参加了第一届 Bangla Language Processing(BLP)工作室上的共同任务——激进语言识别(VITD),目标是识别和分类激进威胁,引发更多的非法暴力行为。
  • methods: 我们使用了回译和多语言方法,实现了两步分类,并在27支持组中排名第6名,macro F1分数为0.74。
  • results: 我们的最佳方法在VITD任务中获得了macro F1分数0.74,位于27支持组中第6名。
    Abstract In this paper, we discuss the nlpBDpatriots entry to the shared task on Violence Inciting Text Detection (VITD) organized as part of the first workshop on Bangla Language Processing (BLP) co-located with EMNLP. The aim of this task is to identify and classify the violent threats, that provoke further unlawful violent acts. Our best-performing approach for the task is two-step classification using back translation and multilinguality which ranked 6th out of 27 teams with a macro F1 score of 0.74.
    摘要 在这篇论文中,我们讨论了nlpBDpatriots对共同任务《激进语言识别(VITD)》的参与,该任务是为了识别和分类激进威胁,这些威胁可能导致未经法律许可的暴力行为。我们的最佳策略是两步分类,使用回译和多语言,其中 macro F1 分数为 0.74,排名第 6 名 из 27 个队伍。

Offensive Language Identification in Transliterated and Code-Mixed Bangla

  • paper_url: http://arxiv.org/abs/2311.15023
  • repo_url: None
  • paper_authors: Md Nishat Raihan, Umma Hani Tanmoy, Anika Binte Islam, Kai North, Tharindu Ranasinghe, Antonios Anastasopoulos, Marcos Zampieri
  • for: 本研究旨在Addressing offensive content identification in social media, particularly in multilingual societies where transliterations and code-mixing are common.
  • methods: 本研究使用了 manually annotated Bangla dataset (TB-OLID),以及英语预训练变换器模型(fBERT和HateBERT)进行训练和微调。
  • results: 研究结果显示,英语预训练变换器模型在TB-OLID dataset上得到了最好的表现。
    Abstract Identifying offensive content in social media is vital for creating safe online communities. Several recent studies have addressed this problem by creating datasets for various languages. In this paper, we explore offensive language identification in texts with transliterations and code-mixing, linguistic phenomena common in multilingual societies, and a known challenge for NLP systems. We introduce TB-OLID, a transliterated Bangla offensive language dataset containing 5,000 manually annotated comments. We train and fine-tune machine learning models on TB-OLID, and we evaluate their results on this dataset. Our results show that English pre-trained transformer-based models, such as fBERT and HateBERT achieve the best performance on this dataset.
    摘要 “识别社交媒体中的攻击性内容非常重要,以建立安全的线上社区。最近几年,许多研究对这个问题提出了数据集,包括不同语言的数据集。本文则探讨了文字转写和混合语言现象,在多语言社会中很常见,并且对自然语言处理系统(NLP)提出了挑战。我们介绍了一个名为TB-OLID的实体化的孟加拉语攻击语言数据集,包含5,000个手动标注的评论。我们将这些模型训练和精确化,然后评估它们在这个数据集上的表现。我们的结果显示,英语预训transformer型模型,如fBERT和HateBERT,在这个数据集上表现最佳。”

Walking a Tightrope – Evaluating Large Language Models in High-Risk Domains

  • paper_url: http://arxiv.org/abs/2311.14966
  • repo_url: None
  • paper_authors: Chia-Chien Hung, Wiem Ben Rim, Lindsay Frost, Lars Bruckner, Carolin Lawrence
  • for: 研究探讨了大语言模型在高风险领域的性能,以便更好地评估和改进大语言模型在这些领域的表现。
  • methods: 研究使用了 instruciton-tuned 大语言模型,并在法律和医学两个高风险领域进行了实验,包括问答和摘要任务。
  • results: 研究发现,现有的大语言模型在高风险领域存在一些局限性和不准确的问题,需要进一步改进和人类中心的方法来提高大语言模型的安全性和事实可靠性。
    Abstract High-risk domains pose unique challenges that require language models to provide accurate and safe responses. Despite the great success of large language models (LLMs), such as ChatGPT and its variants, their performance in high-risk domains remains unclear. Our study delves into an in-depth analysis of the performance of instruction-tuned LLMs, focusing on factual accuracy and safety adherence. To comprehensively assess the capabilities of LLMs, we conduct experiments on six NLP datasets including question answering and summarization tasks within two high-risk domains: legal and medical. Further qualitative analysis highlights the existing limitations inherent in current LLMs when evaluating in high-risk domains. This underscores the essential nature of not only improving LLM capabilities but also prioritizing the refinement of domain-specific metrics, and embracing a more human-centric approach to enhance safety and factual reliability. Our findings advance the field toward the concerns of properly evaluating LLMs in high-risk domains, aiming to steer the adaptability of LLMs in fulfilling societal obligations and aligning with forthcoming regulations, such as the EU AI Act.
    摘要 高风险领域的挑战需要语言模型提供准确和安全的回答。虽然大型语言模型(LLM)在legal和医疗等高风险领域的表现仍未得到清晰的评估。我们的研究进行了深入分析 instruciton-tuned LLMs的性能,关注事实准确性和安全遵循。为全面评估LLMs的能力,我们在六个NLP数据集中进行了问答和摘要任务的实验,其中两个高风险领域为法律和医疗。进一步的质量分析表明当前LLMs在高风险领域的评估存在一定的限制。这说明不仅需要提高LLM的能力,还需要优化领域专业的评估指标,并采取人类中心的方法来增强安全性和事实准确性。我们的发现有助于领域的发展,使LLMs能够适应社会的责任,并遵循未来的法规,如EU AI Act。

Vector-Quantized Prompt Learning for Paraphrase Generation

  • paper_url: http://arxiv.org/abs/2311.14949
  • repo_url: None
  • paper_authors: Haotian Luo, Yixin Liu, Peidong Liu, Xianggen Liu
  • for: 提高自然语言生成模型的多样性和 semantic preservation
  • methods: 使用预训练模型和实例特定的提示控制生成
  • results: 在 Quora、Wikianswers 和 MSCOCO 三个标准测试集上达到新的状态数据表现
    Abstract Deep generative modeling of natural languages has achieved many successes, such as producing fluent sentences and translating from one language into another. However, the development of generative modeling techniques for paraphrase generation still lags behind largely due to the challenges in addressing the complex conflicts between expression diversity and semantic preservation. This paper proposes to generate diverse and high-quality paraphrases by exploiting the pre-trained models with instance-dependent prompts. To learn generalizable prompts, we assume that the number of abstract transforming patterns of paraphrase generation (governed by prompts) is finite and usually not large. Therefore, we present vector-quantized prompts as the cues to control the generation of pre-trained models. Extensive experiments demonstrate that the proposed method achieves new state-of-art results on three benchmark datasets, including Quora, Wikianswers, and MSCOCO. We will release all the code upon acceptance.
    摘要 深度生成模型已经在自然语言处理中取得了许多成功,如生成流畅句子和翻译语言。但是对于句子重构生成技术的发展仍然落后于其他领域,主要是因为处理复杂的表达多样性和 semantics 保持的矛盾。这篇论文提议使用预训练模型和实例依存的提示来生成多样和高质量的重构。为了学习普适的提示,我们假设了重构生成的抽象变换模式(受控于提示)的数量是有限的,通常不大。因此,我们提出vector化的提示作为驱动预训练模型的生成的缓存。我们进行了广泛的实验,并证明了我们的方法可以在三个标准测试集上达到新的状态码。我们将在接受后发布所有代码。

Faster Minimum Bayes Risk Decoding with Confidence-based Pruning

  • paper_url: http://arxiv.org/abs/2311.14919
  • repo_url: None
  • paper_authors: Julius Cheng, Andreas Vlachos
  • For: The paper is written for improving the efficiency of Minimum Bayes risk (MBR) decoding in conditional language generation problems, specifically in neural machine translation.* Methods: The paper proposes an algorithm for MBR decoding that gradually grows the number of samples used to estimate the utility while pruning hypotheses that are unlikely to have the highest utility according to confidence estimates obtained with bootstrap sampling.* Results: The paper demonstrates the effectiveness of the proposed approach in experiments on three language pairs, using chrF++ and COMET as utility/evaluation metrics, with results showing that the proposed method requires fewer samples and drastically reduces the number of calls to the utility function compared to standard MBR while being statistically indistinguishable in terms of accuracy.
    Abstract Minimum Bayes risk (MBR) decoding outputs the hypothesis with the highest expected utility over the model distribution for some utility function. It has been shown to improve accuracy over beam search in conditional language generation problems and especially neural machine translation, in both human and automatic evaluations. However, the standard sampling-based algorithm for MBR is substantially more computationally expensive than beam search, requiring a large number of samples as well as a quadratic number of calls to the utility function, limiting its applicability. We describe an algorithm for MBR which gradually grows the number of samples used to estimate the utility while pruning hypotheses that are unlikely to have the highest utility according to confidence estimates obtained with bootstrap sampling. Our method requires fewer samples and drastically reduces the number of calls to the utility function compared to standard MBR while being statistically indistinguishable in terms of accuracy. We demonstrate the effectiveness of our approach in experiments on three language pairs, using chrF++ and COMET as utility/evaluation metrics.
    摘要 <>TRANSLATE_TEXT最小 bayes 风险(MBR)解码输出最高预期用于一个Utility函数的假设。它已经在条件语言生成问题和神经机器翻译中提高了准确率,并且在人工和自动评估中表现出色。然而,标准的抽样基本算法对MBR来说是计算成本很高的,需要许多抽样和Utility函数的 quadratic 数量的调用,限制了其应用。我们描述了一种算法,它逐渐增加用于估计Utility的样本数量,并在判断假设是否具有最高Utility的可能性时使用信任估计。我们的方法需要 fewer samples 和 drastically reduces the number of calls to the Utility function compared to standard MBR,而且和标准 MBR statistically indistinguishable terms of accuracy。我们在三个语言对的实验中证明了我们的方法的有效性,使用 chrF++ 和 COMET utility/evaluation metrics。Note:* "TRANSLATE_TEXT" is a system variable that indicates the text to be translated.* " Simplified Chinese" is the target language for the translation.

Code Search Debiasing:Improve Search Results beyond Overall Ranking Performance

  • paper_url: http://arxiv.org/abs/2311.14901
  • repo_url: None
  • paper_authors: Sheng Zhang, Hui Li, Yanlin Wang, Zhao Wei, Yong Xiu, Juhong Wang, Rongong Ji
  • for: 本研究探讨了代码搜索模型的偏见问题,以帮助提高代码搜索的用户体验。
  • methods: 我们提出了一种通用的偏见调整框架,通过重新排序来减少代码搜索模型的偏见。这个框架可以轻松地与现有的代码搜索引擎集成,并且可以在未来发现新的偏见问题时进行调整。
  • results: 我们的实验结果表明,我们的偏见调整框架可以有效地减少代码搜索模型的偏见,同时也提高了代码搜索的总排名性能。
    Abstract Code search engine is an essential tool in software development. Many code search methods have sprung up, focusing on the overall ranking performance of code search. In this paper, we study code search from another perspective by analyzing the bias of code search models. Biased code search engines provide poor user experience, even though they show promising overall performance. Due to different development conventions (e.g., prefer long queries or abbreviations), some programmers will find the engine useful, while others may find it hard to get desirable search results. To mitigate biases, we develop a general debiasing framework that employs reranking to calibrate search results. It can be easily plugged into existing engines and handle new code search biases discovered in the future. Experiments show that our framework can effectively reduce biases. Meanwhile, the overall ranking performance of code search gets improved after debiasing.
    摘要 <>代码搜索引擎是软件开发中不可或缺的工具。许多代码搜索方法已经出现,主要关注代码搜索的总排名性能。在这篇论文中,我们从另一个角度研究代码搜索,即代码搜索模型的偏见。偏见的代码搜索引擎会给用户带来差iente的用户体验,即使它们在总排名性能方面表现良好。因为不同的开发习惯(例如,偏好长 queries 或缩写),一些程序员可能会找到这些引擎有用,而其他程序员可能会找到寻找搜索结果很难。为了缓解偏见,我们开发了一个通用的减偏框架,使用重新排序来调整搜索结果。它可以轻松地插入现有引擎,并在未来发现的新的代码搜索偏见上进行处理。实验表明,我们的框架可以有效地减少偏见,同时提高代码搜索的总排名性能。