2023-07-05

cs.CV

cs.CV - 2023-07-05

Detecting Images Generated by Deep Diffusion Models using their Local Intrinsic Dimensionality

paper_url: http://arxiv.org/abs/2307.02347
repo_url: https://github.com/deepfake-study/deepfake-multiLID
paper_authors: Peter Lorenz, Ricard Durall, Janis Keuper
for: 本研究旨在检测Diffusion模型生成的图像，并Identify相关的生成器网络。
methods: 该方法使用了Local Intrinsic Dimensionality（LID），具体来说是多重Local Intrinsic Dimensionality（multiLID），原本是用于检测对抗性图像的方法。
results: 实验结果表明，提议的multiLID方法在许多现实场景中具有超越性，可以准确检测Diffusion模型生成的图像，并正确地标识相关的生成器网络。

Abstract
Diffusion models recently have been successfully applied for the visual synthesis of strikingly realistic appearing images. This raises strong concerns about their potential for malicious purposes. In this paper, we propose using the lightweight multi Local Intrinsic Dimensionality (multiLID), which has been originally developed in context of the detection of adversarial examples, for the automatic detection of synthetic images and the identification of the according generator networks. In contrast to many existing detection approaches, which often only work for GAN-generated images, the proposed method provides close to perfect detection results in many realistic use cases. Extensive experiments on known and newly created datasets demonstrate that the proposed multiLID approach exhibits superiority in diffusion detection and model identification. Since the empirical evaluations of recent publications on the detection of generated images are often mainly focused on the "LSUN-Bedroom" dataset, we further establish a comprehensive benchmark for the detection of diffusion-generated images, including samples from several diffusion models with different image sizes.

摘要
各种扩散模型最近已经成功应用于可见的图像合成，这引发了严重的不良用途的担忧。在这篇论文中，我们提议使用轻量级的多地方本各自维度（multiLID），这是在检测对抗性例子方面原始开发的。与许多现有的检测方法不同，我们的方法可以准确检测多种扩散模型生成的图像，而不是仅仅是GAN生成的图像。我们在各种实际应用场景中进行了广泛的实验，结果显示了我们的multiLID方法在扩散检测和模型标识方面具有显著的优势。由于现有文献中对生成图像的检测的评估 oftens mainly focuses on "LSUN-Bedroom" 数据集，我们还建立了一个包括多种扩散模型和不同图像大小的扩散检测和模型标识 benchmark。

GAFAR: Graph-Attention Feature-Augmentation for Registration A Fast and Light-weight Point Set Registration Algorithm

paper_url: http://arxiv.org/abs/2307.02339
repo_url: https://github.com/mordecaimalignatius/gafar
paper_authors: Ludwig Mohr, Ismail Geles, Friedrich Fraundorfer
for: 点云注册问题是计算机视觉中的基本问题，它有很多应用，从3D场景重建到geometry捕获和机器人等。如果有一个适当的初始注册， convential方法如ICP和其变种可以提供可靠的解决方案。但在缺乏适当初始注册和高异常率或小 overlap的情况下，传统的方法具有大量的计算成本和低精度。
methods: 我们提出了一种新的快速和轻量级网络架构，使用注意力机制来在推理时对点描述符进行增强，以便更好地适应特定点云的注册任务。我们使用了全连接图 both within和between点云，让网络理解点云之间的相互关系和重要性，从而使我们的方法具有对于异常值和3D扫描数据的耐性和泛化能力。
results: 我们测试了我们的注册算法在不同的注册和泛化任务中的性能，并提供了运行时间和资源消耗的信息。我们的代码和训练 веса可以在https://github.com/mordecaimalignatius/GAFAR/上下载。

Abstract
Rigid registration of point clouds is a fundamental problem in computer vision with many applications from 3D scene reconstruction to geometry capture and robotics. If a suitable initial registration is available, conventional methods like ICP and its many variants can provide adequate solutions. In absence of a suitable initialization and in the presence of a high outlier rate or in the case of small overlap though the task of rigid registration still presents great challenges. The advent of deep learning in computer vision has brought new drive to research on this topic, since it provides the possibility to learn expressive feature-representations and provide one-shot estimates instead of depending on time-consuming iterations of conventional robust methods. Yet, the rotation and permutation invariant nature of point clouds poses its own challenges to deep learning, resulting in loss of performance and low generalization capability due to sensitivity to outliers and characteristics of 3D scans not present during network training. In this work, we present a novel fast and light-weight network architecture using the attention mechanism to augment point descriptors at inference time to optimally suit the registration task of the specific point clouds it is presented with. Employing a fully-connected graph both within and between point clouds lets the network reason about the importance and reliability of points for registration, making our approach robust to outliers, low overlap and unseen data. We test the performance of our registration algorithm on different registration and generalization tasks and provide information on runtime and resource consumption. The code and trained weights are available at https://github.com/mordecaimalignatius/GAFAR/.

摘要
rigid registration of point clouds 是计算机视觉中的基本问题，有很多应用从3D场景重建到geometry capture和机器人等。如果有合适的初始化注册， THEN conventional methods like ICP and its variants can provide adequate solutions。在缺乏合适的初始化和高异常率或小 overlap的情况下，rigid registration 仍然存在很大的挑战。计算机视觉领域的深度学习带来了新的动力，因为它提供了可以学习表达性的特征表示和一次估计而不需要时间consuming的迭代。然而，点云的旋转和排序不变性对深度学习 pose 自己的挑战，导致表现下降和低通用性，因为感知到3D扫描不存在于网络训练时的特征。在这项工作中，我们提出了一种新的快速和轻量级网络架构，使用注意力机制来在推理时增强点云描述符，以便最佳地适应特定的点云注册任务。通过在点云之间和点云内建立完全连接的图，网络可以理解点云中的重要性和可靠性，使我们的方法对异常点和低重 overlap 有很好的Robustness。我们对不同的注册和泛化任务进行了测试，并提供了运行时间和资源消耗的信息。网络和训练过的权重可以在https://github.com/mordecaimalignatius/GAFAR/下载。

Dual Arbitrary Scale Super-Resolution for Multi-Contrast MRI

paper_url: http://arxiv.org/abs/2307.02334
repo_url: https://github.com/jmzhang79/dual-arbnet
paper_authors: Jiamiao Zhang, Yichen Chi, Jun Lyu, Wenming Yang, Yapeng Tian
for: 这个研究的目的是提高医疗影像重建的品质，特别是在受限制的图像系统下，以提高医疗影像重建的精确性和可靠性。
methods: 这个研究使用了一种基于神经网络的多条件超解析技术，named Dual-ArbNet，可以在不固定 scales的情况下进行医疗影像重建。
results: 实验结果显示，这个技术可以在不同的扩展比例下进行医疗影像重建，并且在两个公共的MRI数据集上表现出优于现有的方法。

Abstract
Limited by imaging systems, the reconstruction of Magnetic Resonance Imaging (MRI) images from partial measurement is essential to medical imaging research. Benefiting from the diverse and complementary information of multi-contrast MR images in different imaging modalities, multi-contrast Super-Resolution (SR) reconstruction is promising to yield SR images with higher quality. In the medical scenario, to fully visualize the lesion, radiologists are accustomed to zooming the MR images at arbitrary scales rather than using a fixed scale, as used by most MRI SR methods. In addition, existing multi-contrast MRI SR methods often require a fixed resolution for the reference image, which makes acquiring reference images difficult and imposes limitations on arbitrary scale SR tasks. To address these issues, we proposed an implicit neural representations based dual-arbitrary multi-contrast MRI super-resolution method, called Dual-ArbNet. First, we decouple the resolution of the target and reference images by a feature encoder, enabling the network to input target and reference images at arbitrary scales. Then, an implicit fusion decoder fuses the multi-contrast features and uses an Implicit Decoding Function~(IDF) to obtain the final MRI SR results. Furthermore, we introduce a curriculum learning strategy to train our network, which improves the generalization and performance of our Dual-ArbNet. Extensive experiments in two public MRI datasets demonstrate that our method outperforms state-of-the-art approaches under different scale factors and has great potential in clinical practice.

摘要
限于成像系统，重建医用磁共振成像（MRI）图像从部分测量是医学成像研究中的关键。利用不同和补充的多比特MR图像不同成像模式的多比特超解析（SR）重建，可以获得高质量的SR图像。在医疗场景中，为了完全可见患部，放射医生通常会在自定义比例上缩放MR图像，而不是使用固定比例，这与大多数MRI SR方法不同。此外，现有的多比特MRI SR方法通常需要固定分辨率的参照图像，这会使得获得参照图像困难，并对自定义比例SR任务带来限制。为解决这些问题，我们提出了基于卷积神经网络的双自由多比特MRI超解析方法，称为双自由网络（Dual-ArbNet）。首先，我们通过特征编码器解耦目标和参照图像的分辨率，使得网络可以输入任意比例的目标和参照图像。然后，我们使用卷积神经网络将多比特特征进行融合，并使用卷积神经网络的强化学习函数（IDF）来获得最终的MRI SR结果。此外，我们还提出了训练网络的课程学习策略，以提高我们的Dual-ArbNet的通用性和性能。在两个公共MRI数据集上进行了广泛的实验，我们发现我们的方法在不同的比例因子下表现出优于当前状态的方法，并具有优秀的临床应用潜力。

MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers

paper_url: http://arxiv.org/abs/2307.02321
repo_url: None
paper_authors: Jakob Drachmann Havtorn, Amelie Royer, Tijmen Blankevoort, Babak Ehteshami Bejnordi
for: 提高 ViT 模型的精度和可扩展性，通过自适应地选择最佳的 токен规模来减少计算量。
methods: 提出了一种名为 MSViT 的动态混合缩放tokenization scheme，通过一种名为 gate 的Conditional gating机制来选择每个图像区域的最佳 токен规模，以 dynamically determine the number of tokens per input。
results: 在图像分类和 segmentation 任务上，MSViT 能够 дости得更好的精度-复杂度质量平衡，并且在训练时间和资源上具有较小的开销。

Abstract
The input tokens to Vision Transformers carry little semantic meaning as they are defined as regular equal-sized patches of the input image, regardless of its content. However, processing uniform background areas of an image should not necessitate as much compute as dense, cluttered areas. To address this issue, we propose a dynamic mixed-scale tokenization scheme for ViT, MSViT. Our method introduces a conditional gating mechanism that selects the optimal token scale for every image region, such that the number of tokens is dynamically determined per input. The proposed gating module is lightweight, agnostic to the choice of transformer backbone, and trained within a few epochs (e.g., 20 epochs on ImageNet) with little training overhead. In addition, to enhance the conditional behavior of the gate during training, we introduce a novel generalization of the batch-shaping loss. We show that our gating module is able to learn meaningful semantics despite operating locally at the coarse patch-level. We validate MSViT on the tasks of classification and segmentation where it leads to improved accuracy-complexity trade-off.

摘要
（input tokens to Vision Transformers carry little semantic meaning, as they are defined as regular equal-sized patches of the input image regardless of content; however, processing uniform background areas of an image should not require as much compute as dense, cluttered areas; to address this issue, we propose a dynamic mixed-scale tokenization scheme for ViT, MSViT, which introduces a conditional gating mechanism that selects the optimal token scale for every image region, such that the number of tokens is dynamically determined per input; the proposed gating module is lightweight, agnostic to the choice of transformer backbone, and trained within a few epochs (e.g., 20 epochs on ImageNet) with little training overhead; in addition, to enhance the conditional behavior of the gate during training, we introduce a novel generalization of the batch-shaping loss; we show that our gating module is able to learn meaningful semantics despite operating locally at the coarse patch-level; we validate MSViT on the tasks of classification and segmentation, where it leads to an improved accuracy-complexity trade-off）

Multi-Scale Prototypical Transformer for Whole Slide Image Classification

paper_url: http://arxiv.org/abs/2307.02308
repo_url: None
paper_authors: Saisai Ding, Jun Wang, Juncheng Li, Jun Shi
for: 这篇论文的目的是提出一种基于多例学习的数位组织学分类方法，以提高 computation pathology 中 WSI 的分类精度。
methods: 方法包括一个 prototypical Transformer (PT) 模组和一个多维度特征融合模组 (MFFM)。PT 模组通过融合 prototypical learning 到 Transformer 架构中，将所有例子取代为丛集几何中心，然后通过对丛集中心进行自我注意力机制来更正。MFFM 模组则是使用 MLP-Mixer 来增强丛集中心之间的信息交流。
results: 实验结果显示，提出的 MSPT 方法在两个公共 WSI 数据集上比较所有相比方法都高效，这表明它在 computation pathology 中具有应用潜力。

Abstract
Whole slide image (WSI) classification is an essential task in computational pathology. Despite the recent advances in multiple instance learning (MIL) for WSI classification, accurate classification of WSIs remains challenging due to the extreme imbalance between the positive and negative instances in bags, and the complicated pre-processing to fuse multi-scale information of WSI. To this end, we propose a novel multi-scale prototypical Transformer (MSPT) for WSI classification, which includes a prototypical Transformer (PT) module and a multi-scale feature fusion module (MFFM). The PT is developed to reduce redundant instances in bags by integrating prototypical learning into the Transformer architecture. It substitutes all instances with cluster prototypes, which are then re-calibrated through the self-attention mechanism of the Trans-former. Thereafter, an MFFM is proposed to fuse the clustered prototypes of different scales, which employs MLP-Mixer to enhance the information communication between prototypes. The experimental results on two public WSI datasets demonstrate that the proposed MSPT outperforms all the compared algorithms, suggesting its potential applications.

摘要
整幕图像（WSI）分类是计算 PATHOLOGY 中的一项关键任务。尽管最近在多个实例学习（MIL）中进行了一些进步，但是准确地分类 WSI 仍然是一项挑战，主要因为实例卷积质量差异极大，以及综合多级信息的复杂处理。为此，我们提出了一种新的多级原型变换器（MSPT），它包括原型变换器（PT）模块和多级特征融合模块（MFFM）。PT 模块通过将实例替换为团体原型，并通过 Transformer 架构中的自注意机制来重新准确这些原型。然后，我们提出了一种 MFFM，用于融合不同级别的团体原型，并使用 MLP-Mixer 增强团体原型之间的信息交流。我们在两个公共 WSI 数据集上进行了实验，并证明了我们提出的 MSPT 在所有比较算法中表现出色， suggesting its potential applications。

Focusing on what to decode and what to train: Efficient Training with HOI Split Decoders and Specific Target Guided DeNoising

paper_url: http://arxiv.org/abs/2307.02291
repo_url: None
paper_authors: Junwen Chen, Yingcheng Wang, Keiji Yanai
for: 本研究旨在提高人物互动检测（HOI）任务中一Stage Transformer基于方法的精度和效率。
methods: 我们提出了一种新的一Stage框架（SOV），包括主体解码器、 объек解码器和动词解码器。此外，我们还提出了一种Specific Target Guided（STG）降噪策略，使用学习的物体和动词标签嵌入来帮助训练并加速训练快速。
results: 在一个训练集中，我们的方法（SOV-STG）在三分之一的训练集中达到了最高精度，并且在一个三分之一的训练时间内完成了训练。代码可以在 \url{https://github.com/cjw2021/SOV-STG} 上获取。

Abstract
Recent one-stage transformer-based methods achieve notable gains in the Human-object Interaction Detection (HOI) task by leveraging the detection of DETR. However, the current methods redirect the detection target of the object decoder, and the box target is not explicitly separated from the query embeddings, which leads to long and hard training. Furthermore, matching the predicted HOI instances with the ground-truth is more challenging than object detection, simply adapting training strategies from the object detection makes the training more difficult. To clear the ambiguity between human and object detection and share the prediction burden, we propose a novel one-stage framework (SOV), which consists of a subject decoder, an object decoder, and a verb decoder. Moreover, we propose a novel Specific Target Guided (STG) DeNoising strategy, which leverages learnable object and verb label embeddings to guide the training and accelerates the training convergence. In addition, for the inference part, the label-specific information is directly fed into the decoders by initializing the query embeddings from the learnable label embeddings. Without additional features or prior language knowledge, our method (SOV-STG) achieves higher accuracy than the state-of-the-art method in one-third of training epochs. The code is available at \url{https://github.com/cjw2021/SOV-STG}.

摘要
近期一阶段变换器基本方法在人物交互检测（HOI）任务中获得了显著的进步，利用DETR的检测。然而，现有方法会将 объек 预测器的目标重定向，并且不分离查询嵌入中的盒子目标，这会导致训练变得更加长和更加困难。此外，与物体检测不同，HOI 预测的匹配更加具有挑战性，简单地采用物体检测的训练策略不足以解决这个问题。为了解决人物和物体检测之间的混乱，并分担预测负担，我们提出了一种新的一阶段框架（SOV），该框架包括主体预测器、物体预测器和动词预测器。此外，我们还提出了一种新的特定目标导航（STG）减噪策略，该策略利用可学习的物体和动词标签嵌入来导航训练，并加速训练的收敛。在推断部分，标签特定的信息直接传递给预测器，通过初始化查询嵌入从学习的标签嵌入开始。不需要额外的特征或先验语言知识，我们的方法（SOV-STG）在一个第三个训练翻卷中达到了state-of-the-art的高精度。代码可以在 \url{https://github.com/cjw2021/SOV-STG} 上获取。

Interactive Image Segmentation with Cross-Modality Vision Transformers

paper_url: http://arxiv.org/abs/2307.02280
repo_url: https://github.com/lik1996/icmformer
paper_authors: Kun Li, George Vosselman, Michael Ying Yang
for: 这 paper 是为了提出一种基于视觉转换器的交互式图像分割方法，以提高图像分割精度和稳定性。
methods: 该 paper 使用了 crossing-modality 视觉转换器，利用了多Modal 数据的相互关系，从而更好地引导学习过程。
results: 实验结果表明，提出的方法在多个benchmark上表现出色，与之前的state-of-the-art模型相比，具有更高的精度和稳定性。

Abstract
Interactive image segmentation aims to segment the target from the background with the manual guidance, which takes as input multimodal data such as images, clicks, scribbles, and bounding boxes. Recently, vision transformers have achieved a great success in several downstream visual tasks, and a few efforts have been made to bring this powerful architecture to interactive segmentation task. However, the previous works neglect the relations between two modalities and directly mock the way of processing purely visual information with self-attentions. In this paper, we propose a simple yet effective network for click-based interactive segmentation with cross-modality vision transformers. Cross-modality transformers exploits mutual information to better guide the learning process. The experiments on several benchmarks show that the proposed method achieves superior performance in comparison to the previous state-of-the-art models. The stability of our method in term of avoiding failure cases shows its potential to be a practical annotation tool. The code and pretrained models will be released under https://github.com/lik1996/iCMFormer.

摘要
中文翻译：互动图像分割target goal是将目标分割自背景，通过手动指导，输入多Modal数据，如图像、点击、笔划和 bounding box。最近，视Transformers在多个下游视图任务中取得了很大成功，而在互动分割任务上，几乎没有尝试将这个强大的架构应用于互动分割任务。然而，之前的工作忽视了两种模式之间的关系，直接模仿了对纯视觉信息的处理方式，使用自注意力。在这篇论文中，我们提出了一种简单 yet effective的网络，用于点击基于的互动分割，cross-modality vision transformers。cross-modality transformers利用两种模式之间的共识来更好地引导学习过程。我们的实验表明，我们提出的方法在多个 benchmark上表现出色，与之前的状态对照模型相比，具有更高的稳定性和可行性。我们的代码和预训练模型将在https://github.com/lik1996/iCMFormer中发布。

Convolutions Through the Lens of Tensor Networks

paper_url: http://arxiv.org/abs/2307.02275
repo_url: None
paper_authors: Felix Dangel
for: 这篇论文旨在探讨卷积神经网络的分析和推理方法。
methods: 该论文使用tensor网络（TN）来理解卷积神经网络的底层tensor乘法，并通过绘制图像和手动修改它们来实现函数变换、子tensor访问、融合等操作。
results: 该论文通过示例展示了各种自动梯度下降算法和流行的第二个信息近似算法的图像，并提供了基于连接环境的卷积特性变换，以便简化和加速图像评估。此外，该论文还提出了一种基于TN的高效绘制实现，可以加速KFAC变体和实现硬件高效的tensor dropout。

Abstract
Despite their simple intuition, convolutions are more tedious to analyze than dense layers, which complicates the generalization of theoretical and algorithmic ideas. We provide a new perspective onto convolutions through tensor networks (TNs) which allow reasoning about the underlying tensor multiplications by drawing diagrams, and manipulating them to perform function transformations, sub-tensor access, and fusion. We demonstrate this expressive power by deriving the diagrams of various autodiff operations and popular approximations of second-order information with full hyper-parameter support, batching, channel groups, and generalization to arbitrary convolution dimensions. Further, we provide convolution-specific transformations based on the connectivity pattern which allow to re-wire and simplify diagrams before evaluation. Finally, we probe computational performance, relying on established machinery for efficient TN contraction. Our TN implementation speeds up a recently-proposed KFAC variant up to 4.5x and enables new hardware-efficient tensor dropout for approximate backpropagation.

摘要

Joint Hierarchical Priors and Adaptive Spatial Resolution for Efficient Neural Image Compression

paper_url: http://arxiv.org/abs/2307.02273
repo_url: None
paper_authors: Ahmed Ghorbel, Wassim Hamidouche, Luce Morin
for: 提高神经网络图像压缩（NIC）的性能，使其与传统编码器相当或超越。
methods: 使用Transformer-based transform coding框架，并添加channel-wise auto-regressive prior模型以更好地捕捉全局和局部上下文。还使用一个可学习的扩展模块来提取更加压缩的latent代码。
results: 对比VVC参照编码器（VTM-18.0）和SwinT-ChARM神经编码器，提出的框架在质量压缩和解码复杂度之间取得了显著的质量提升。还进行了模型缩放研究和对象和主观分析，证明了我们的方法的计算效率和性能优势。

Abstract
Recently, the performance of neural image compression (NIC) has steadily improved thanks to the last line of study, reaching or outperforming state-of-the-art conventional codecs. Despite significant progress, current NIC methods still rely on ConvNet-based entropy coding, limited in modeling long-range dependencies due to their local connectivity and the increasing number of architectural biases and priors, resulting in complex underperforming models with high decoding latency. Motivated by the efficiency investigation of the Tranformer-based transform coding framework, namely SwinT-ChARM, we propose to enhance the latter, as first, with a more straightforward yet effective Tranformer-based channel-wise auto-regressive prior model, resulting in an absolute image compression transformer (ICT). Through the proposed ICT, we can capture both global and local contexts from the latent representations and better parameterize the distribution of the quantized latents. Further, we leverage a learnable scaling module with a sandwich ConvNeXt-based pre-/post-processor to accurately extract more compact latent codes while reconstructing higher-quality images. Extensive experimental results on benchmark datasets showed that the proposed framework significantly improves the trade-off between coding efficiency and decoder complexity over the versatile video coding (VVC) reference encoder (VTM-18.0) and the neural codec SwinT-ChARM. Moreover, we provide model scaling studies to verify the computational efficiency of our approach and conduct several objective and subjective analyses to bring to the fore the performance gap between the adaptive image compression transformer (AICT) and the neural codec SwinT-ChARM.

摘要
近些时候，神经图像压缩（NIC）的性能已经不断提高，达到或超过了传统编码器的状态对抗水平。尽管有了 significante进步，现有NIC方法仍然基于ConvNet来实现 entropy coding，受到当地连接和建筑偏好的限制，难以模型长距离依赖关系，导致复杂的下去性能不佳的模型和高解码延迟。为了提高 tranformer 基于 transform coding 框架的效率，我们提出了一种能更直观、有效的 channel-wise auto-regressive prior 模型，即 absolute image compression transformer（ICT）。通过 ICT，我们可以从隐藏表示中捕捉全局和局部上下文，更好地参数化归一化后的量化latent。此外，我们采用一个可学习的缩放模块，与一个 ConvNeXt 基于的预/后处理器，准确提取更加 компакт的latent代码，并在重建高质量图像时更好地恢复图像。在多个标准测试集上，我们的框架与VVC参考编码器（VTM-18.0）和神经编码器 SwinT-ChARM 进行了比较，并显示了显著改善了编码效率和解码复杂度之间的平衡。此外，我们还进行了模型缩放研究，以验证我们的方法的计算效率，并进行了一些对象和主观分析，以阐明 AICT 与 SwinT-ChARM 之间的性能差距。

RanPAC: Random Projections and Pre-trained Models for Continual Learning

paper_url: http://arxiv.org/abs/2307.02251
repo_url: None
paper_authors: Mark D. McDonnell, Dong Gong, Amin Parveneh, Ehsan Abbasnejad, Anton van den Hengel
for: 这 paper 的目的是提出一种简洁有效的 continual learning 方法，使得 pre-trained 模型可以在非站ARY 数据流中逐步学习不同的任务，而不会忘记之前学习的内容。
methods: 这 paper 使用了 training-free random projectors 和 class-prototype accumulation 来解决 continual learning 中的忘记问题。具体来说， authors 在 pre-trained 模型的 feature representations 和 output head 之间添加了一层冻结 Random Projection 层，以捕捉Feature之间的交互，提高了类prototype-based continual learning 的 linear separability。此外， authors 还证明了在使用 pre-trained 表示时，需要decorrelating 类prototype 以降低分布差异。
results: compared to previous methods applied to pre-trained ViT-B/16 models, authors 在 seven class-incremental benchmark datasets 上减少了最终的错误率，而不使用任何缓存记忆。specifically, authors 在这些 datasets 上减少了最终错误率的范围为 10% 到 62%。这表明， pre-trained 模型可以在简单、有效、快速的 continual learning 中实现出色的性能，而且还可以避免忘记问题。

Abstract
Continual learning (CL) aims to incrementally learn different tasks (such as classification) in a non-stationary data stream without forgetting old ones. Most CL works focus on tackling catastrophic forgetting under a learning-from-scratch paradigm. However, with the increasing prominence of foundation models, pre-trained models equipped with informative representations have become available for various downstream requirements. Several CL methods based on pre-trained models have been explored, either utilizing pre-extracted features directly (which makes bridging distribution gaps challenging) or incorporating adaptors (which may be subject to forgetting). In this paper, we propose a concise and effective approach for CL with pre-trained models. Given that forgetting occurs during parameter updating, we contemplate an alternative approach that exploits training-free random projectors and class-prototype accumulation, which thus bypasses the issue. Specifically, we inject a frozen Random Projection layer with nonlinear activation between the pre-trained model's feature representations and output head, which captures interactions between features with expanded dimensionality, providing enhanced linear separability for class-prototype-based CL. We also demonstrate the importance of decorrelating the class-prototypes to reduce the distribution disparity when using pre-trained representations. These techniques prove to be effective and circumvent the problem of forgetting for both class- and domain-incremental continual learning. Compared to previous methods applied to pre-trained ViT-B/16 models, we reduce final error rates by between 10\% and 62\% on seven class-incremental benchmark datasets, despite not using any rehearsal memory. We conclude that the full potential of pre-trained models for simple, effective, and fast continual learning has not hitherto been fully tapped.

摘要
In this paper, we propose a concise and effective approach for CL with pre-trained models. Since forgetting occurs during parameter updating, we contemplate an alternative approach that exploits training-free random projectors and class-prototype accumulation, which thus bypasses the issue. Specifically, we inject a frozen Random Projection layer with nonlinear activation between the pre-trained model's feature representations and output head, which captures interactions between features with expanded dimensionality, providing enhanced linear separability for class-prototype-based CL. We also demonstrate the importance of decorrelating the class-prototypes to reduce the distribution disparity when using pre-trained representations. These techniques prove to be effective and circumvent the problem of forgetting for both class- and domain-incremental continual learning.Compared to previous methods applied to pre-trained ViT-B/16 models, we reduce final error rates by between 10\% and 62\% on seven class-incremental benchmark datasets, despite not using any rehearsal memory. We conclude that the full potential of pre-trained models for simple, effective, and fast continual learning has not hitherto been fully tapped.

Rethinking Multiple Instance Learning for Whole Slide Image Classification: A Good Instance Classifier is All You Need

paper_url: http://arxiv.org/abs/2307.02249
repo_url: None
paper_authors: Linhao Qu, Yingfan Ma, Xiaoyuan Luo, Manning Wang, Zhijian Song
for: This paper focuses on developing an instance-level multiple instance learning (MIL) framework for weakly supervised whole slide image classification.
methods: The proposed method combines contrastive learning and prototype learning to effectively learn instance feature representation and generate accurate pseudo labels.
results: The proposed method achieves powerful performance on four datasets through extensive experiments and visualizations.Here’s the full answer in Simplified Chinese:
for: 本文目标是开发一种基于多个实例学习（MIL）的弱监督整幕图像分类方法，以提高整幕图像分类的精度和效率。
methods: 该方法结合对比学习和原型学习，以便有效地学习实例特征表示，并生成高精度的假标签。
results: 经验和视觉化分析表明，提出的方法在四个数据集上具有强大的表现。

Abstract
Weakly supervised whole slide image classification is usually formulated as a multiple instance learning (MIL) problem, where each slide is treated as a bag, and the patches cut out of it are treated as instances. Existing methods either train an instance classifier through pseudo-labeling or aggregate instance features into a bag feature through attention mechanisms and then train a bag classifier, where the attention scores can be used for instance-level classification. However, the pseudo instance labels constructed by the former usually contain a lot of noise, and the attention scores constructed by the latter are not accurate enough, both of which affect their performance. In this paper, we propose an instance-level MIL framework based on contrastive learning and prototype learning to effectively accomplish both instance classification and bag classification tasks. To this end, we propose an instance-level weakly supervised contrastive learning algorithm for the first time under the MIL setting to effectively learn instance feature representation. We also propose an accurate pseudo label generation method through prototype learning. We then develop a joint training strategy for weakly supervised contrastive learning, prototype learning, and instance classifier training. Extensive experiments and visualizations on four datasets demonstrate the powerful performance of our method. Codes will be available.

摘要
多数批处图像分类通常被形式化为多个实例学习（MIL）问题，每个批处图像被当作袋子，并且从中切出的小块被当作实例。现有方法可以通过假标签或者通过注意机制将实例特征聚合到袋子特征上，然后训练袋子分类器，其中注意分数可以用于实例级分类。然而，由前者构造的假实例标签通常含有很多噪音，而由后者构造的注意分数不够准确，这两者都会影响其性能。在这篇论文中，我们提出了基于对比学习和原型学习的实例级MIL框架，以有效完成实例分类和袋子分类任务。为此，我们提出了在MIL设定下的首次实例级弱监督对准学习算法，以有效地学习实例特征表示。我们还提出了一种准确的假标签生成方法。然后，我们开发了一种共同训练策略，用于弱监督对准学习、原型学习和实例分类器训练。广泛的实验和视觉化在四个数据集上，证明了我们的方法的强大性能。代码将可用。

S3C: Self-Supervised Stochastic Classifiers for Few-Shot Class-Incremental Learning

paper_url: http://arxiv.org/abs/2307.02246
repo_url: https://github.com/jayatejak/s3c
paper_authors: Jayateja Kalla, Soma Biswas
for: 这个研究目的是提出一种可以逐步学习新类的方法，并且不会忘记之前学习过的类别。
methods: 这个方法使用了一种自我监督的随机分类器（S3C），它不仅可以减轻新类的欠拓范围问题，而且可以避免在增量阶段中忘记之前学习过的类别。
results: 在三个标准资料集上进行了广泛的评估，并且使用多个评估指标显示了这个方法的有效性。此外，还进行了两个实际的应用情况评估，包括不同新类的标签数量和基础类别的数量较少的情况，并证明了这个方法在这些情况下表现出色。

Abstract
Few-shot class-incremental learning (FSCIL) aims to learn progressively about new classes with very few labeled samples, without forgetting the knowledge of already learnt classes. FSCIL suffers from two major challenges: (i) over-fitting on the new classes due to limited amount of data, (ii) catastrophically forgetting about the old classes due to unavailability of data from these classes in the incremental stages. In this work, we propose a self-supervised stochastic classifier (S3C) to counter both these challenges in FSCIL. The stochasticity of the classifier weights (or class prototypes) not only mitigates the adverse effect of absence of large number of samples of the new classes, but also the absence of samples from previously learnt classes during the incremental steps. This is complemented by the self-supervision component, which helps to learn features from the base classes which generalize well to unseen classes that are encountered in future, thus reducing catastrophic forgetting. Extensive evaluation on three benchmark datasets using multiple evaluation metrics show the effectiveness of the proposed framework. We also experiment on two additional realistic scenarios of FSCIL, namely where the number of annotated data available for each of the new classes can be different, and also where the number of base classes is much lesser, and show that the proposed S3C performs significantly better than the state-of-the-art for all these challenging scenarios.

摘要
《几个步骤类增长学习》（Few-shot Class-incremental Learning，FSCIL）的目标是逐渐学习新类，使用非常少的标注样本。然而，FSCIL面临两大挑战：（1）新类的过拟合由于数据有限，（2）在增量阶段缺乏过去类的数据，导致忘记已经学习的类。在这种情况下，我们提出了一种自编程随机分类器（S3C），以解决FSCIL中这两个挑战。S3C的权重随机性不仅减轻新类缺乏大量样本的影响，还减轻过去类缺乏样本的影响。此外，我们还添加了一个自编程组件，帮助学习基础类的特征，以便在未来遇到未经过学习的类时，减少忘记。我们对三个标准数据集进行了广泛的评估，并使用多个评价指标，结果表明我们的提案效果显著。此外，我们还在不同的实际情景下进行了进一步的实验，包括每个新类的标注数据有不同数量，以及基础类的数量远少于新类的情况，并证明了S3C在这些挑战情况下表现更好。

Set Learning for Accurate and Calibrated Models

paper_url: http://arxiv.org/abs/2307.02245
repo_url: https://github.com/lukasmut/oko
paper_authors: Lukas Muttenthaler, Robert A. Vandermeulen, Qiuyi Zhang, Thomas Unterthiner, Klaus-Robert Müller
for: 本文是为了解决机器学习模型过于自信和轻量级别困难 Accounting for when applying standard empirical risk minimization.
methods: 本文提出了一种新的方法，即odd-$k$-out learning（OKO），它将cross-entropy error для集合而不是单个例子进行最小化，从而使模型能够捕捉数据示例之间的相关性，并达到更高的准确率和准确性，特别是在有限的训练数据和类别不均衡情况下。
results: 经过理论证明和广泛的实验分析，OKO方法能够更好地准确化模型，而且在使用硬标签和抛弃任何额外参数调整时，OKO方法仍然能够提供更好的准确性。

Abstract
Model overconfidence and poor calibration are common in machine learning and difficult to account for when applying standard empirical risk minimization. In this work, we propose a novel method to alleviate these problems that we call odd-$k$-out learning (OKO), which minimizes the cross-entropy error for sets rather than for single examples. This naturally allows the model to capture correlations across data examples and achieves both better accuracy and calibration, especially in limited training data and class-imbalanced regimes. Perhaps surprisingly, OKO often yields better calibration even when training with hard labels and dropping any additional calibration parameter tuning, such as temperature scaling. We provide theoretical justification, establishing that OKO naturally yields better calibration, and provide extensive experimental analyses that corroborate our theoretical findings. We emphasize that OKO is a general framework that can be easily adapted to many settings and the trained model can be applied to single examples at inference time, without introducing significant run-time overhead or architecture changes.

摘要
模型过置信和轻度误差是机器学习中的常见问题，难以通过标准的实际风险最小化来补偿。在这项工作中，我们提出了一种新的方法，称为偶数-$k$-学习（OKO），它为集合而非单个例子 minimum 交叉熵错误。这种方法自然地让模型捕捉数据示例之间的相关性，并达到更高的准确率和评估，特别是在有限的训练数据和类别不均衡的情况下。 surprisingly，OKO часто提供更好的评估，即使在使用硬标签和遗漏任何额外参数调整时。我们提供了理论基础，证明 OKO 自然地提供更好的评估，并进行了广泛的实验分析，证实我们的理论发现。我们强调的是，OKO 是一种通用的框架，可以轻松地适应多种情况，并且训练好的模型可以在推理时应用于单个示例，无需增加重要的运行时 overhead 或者架构变化。

Source Identification: A Self-Supervision Task for Dense Prediction

paper_url: http://arxiv.org/abs/2307.02238
repo_url: None
paper_authors: Shuai Chen, Subhradeep Kayal, Marleen de Bruijne
for: 这个论文旨在提出一种新的自监督任务，即源标识（SI）任务，用于提高数据驱动方法的表征学习效果。
methods: 该论文使用了自然图像生成方法生成了多个源图像的混合图像，然后让网络重建原始图像，以验证网络的能力。
results: 实验结果表明，提出的SI任务在某些 dense predictions 任务上比传统的自监督任务表现更好，包括填充、像素排序、INTENSITY SHIFT 和超分辨率等。此外，在不同类型图像混合中，使用不同病人图像的混合得到最佳效果。

Abstract
The paradigm of self-supervision focuses on representation learning from raw data without the need of labor-consuming annotations, which is the main bottleneck of current data-driven methods. Self-supervision tasks are often used to pre-train a neural network with a large amount of unlabeled data and extract generic features of the dataset. The learned model is likely to contain useful information which can be transferred to the downstream main task and improve performance compared to random parameter initialization. In this paper, we propose a new self-supervision task called source identification (SI), which is inspired by the classic blind source separation problem. Synthetic images are generated by fusing multiple source images and the network's task is to reconstruct the original images, given the fused images. A proper understanding of the image content is required to successfully solve the task. We validate our method on two medical image segmentation tasks: brain tumor segmentation and white matter hyperintensities segmentation. The results show that the proposed SI task outperforms traditional self-supervision tasks for dense predictions including inpainting, pixel shuffling, intensity shift, and super-resolution. Among variations of the SI task fusing images of different types, fusing images from different patients performs best.

摘要
自我监督框架强调 Raw 数据学习 representation 无需劳动 INTENSIVE 注解，这是当前数据驱动方法的主要瓶颈。自我监督任务经常用于预训练神经网络大量无注释数据，并提取数据集的通用特征。学习的模型可能包含有用信息，可以转移到下游主任务，提高与随机参数初始化相比的性能。在本文中，我们提出了一种新的自我监督任务，即源标识（SI），它是基于经典的盲源分离问题的。通过混合多个源图像生成synthetic图像，并让网络在混合图像中重建原始图像，需要成功解决这个任务的良好图像内容理解。我们验证了我们的方法在两个医学影像分割任务中，即脑肿瘤分割和白 matter 高敏敏分割。结果表明，提议的 SI 任务超过传统自我监督任务 dense 预测，包括填充、像素拼接、Intensity 变化和超分辨率。其中，将不同类型图像混合的 variations 中，将不同病人图像混合得到最好的性能。

Direct segmentation of brain white matter tracts in diffusion MRI

paper_url: http://arxiv.org/abs/2307.02223
repo_url: None
paper_authors: Hamza Kebiri, Ali Gholipour, Meritxell Bach Cuadra, Davood Karimi
for:这篇论文旨在提供一种新的深度学习方法，用于非侵入性地从 diffusion MRI 数据中直接分割白 matter 轨迹，并且可以避免中间计算的错误。methods:这种新方法使用深度学习算法，直接从 diffusion MRI 数据中提取白 matter 轨迹的特征，不需要进行中间计算。results:实验表明，这种方法可以达到与现有方法相当的分割精度（mean Dice Similarity Coefficient 为 0.826），并且在受损数据下表现更好，可以满足许多重要的临床和科学应用。

Abstract
The brain white matter consists of a set of tracts that connect distinct regions of the brain. Segmentation of these tracts is often needed for clinical and research studies. Diffusion-weighted MRI offers unique contrast to delineate these tracts. However, existing segmentation methods rely on intermediate computations such as tractography or estimation of fiber orientation density. These intermediate computations, in turn, entail complex computations that can result in unnecessary errors. Moreover, these intermediate computations often require dense multi-shell measurements that are unavailable in many clinical and research applications. As a result, current methods suffer from low accuracy and poor generalizability. Here, we propose a new deep learning method that segments these tracts directly from the diffusion MRI data, thereby sidestepping the intermediate computation errors. Our experiments show that this method can achieve segmentation accuracy that is on par with the state of the art methods (mean Dice Similarity Coefficient of 0.826). Compared with the state of the art, our method offers far superior generalizability to undersampled data that are typical of clinical studies and to data obtained with different acquisition protocols. Moreover, we propose a new method for detecting inaccurate segmentations and show that it is more accurate than standard methods that are based on estimation uncertainty quantification. The new methods can serve many critically important clinical and scientific applications that require accurate and reliable non-invasive segmentation of white matter tracts.

摘要
脑白物质由一组轨迹组成，连接脑各区域。 segmentation这些轨迹经常需要临床和研究实验室中。扩散束MRI提供了唯一的对比，以定义这些轨迹。然而，现有的分割方法依赖于中间计算，如轨迹学或纤维方向密度的估计。这些中间计算会导致 Complex errors。此外，这些中间计算通常需要密集的多层测量，这些测量在临床和研究应用中不可得。因此，当前的方法受到低准确率和poor generalizability的限制。我们提出了一种新的深度学习方法，直接从扩散MRI数据中分割白物质轨迹，从而割除中间计算的错误。我们的实验表明，这种方法可以与状态的方法（mean Dice Similarity Coefficient of 0.826）准确性相似。相比之下，我们的方法具有更高的普适性，可以在临床研究中常见的压缩数据上进行分割，以及数据获取不同的探测协议。此外，我们还提出了一种新的方法来检测不准确的分割，并证明它在标准方法基于估计不确定性量化上更为准确。这些新方法可以满足许多临床和科学应用中需要准确和可靠的非侵入式白物质轨迹分割。

Object Recognition System on a Tactile Device for Visually Impaired

paper_url: http://arxiv.org/abs/2307.02211
repo_url: None
paper_authors: Souayah Abdelkader, Mokretar Kraroubi Abderrahmene, Slimane Larabi
for: developing a device that facilitates communication between individuals with visual impairments and their surroundings
methods: using an object detection model implemented on a Raspberry Pi, which is connected to a specifically designed tactile device to provide auditory feedback of object identification
results: successful demonstration of the device’s effectiveness in scene understanding, including static or dynamic objects and screen contents such as TVs, computers, and mobile phones.

Abstract
People with visual impairments face numerous challenges when interacting with their environment. Our objective is to develop a device that facilitates communication between individuals with visual impairments and their surroundings. The device will convert visual information into auditory feedback, enabling users to understand their environment in a way that suits their sensory needs. Initially, an object detection model is selected from existing machine learning models based on its accuracy and cost considerations, including time and power consumption. The chosen model is then implemented on a Raspberry Pi, which is connected to a specifically designed tactile device. When the device is touched at a specific position, it provides an audio signal that communicates the identification of the object present in the scene at that corresponding position to the visually impaired individual. Conducted tests have demonstrated the effectiveness of this device in scene understanding, encompassing static or dynamic objects, as well as screen contents such as TVs, computers, and mobile phones.

摘要
人们 WITH visual impairments 面临着许多挑战 WHEN interacting with their environment. Our objective is to develop a device that facilitates communication between individuals with visual impairments AND their surroundings. The device will convert visual information into auditory feedback, enabling users to understand their environment in a way that suits their sensory needs.Initially, an object detection model IS selected FROM existing machine learning models based on its accuracy AND cost considerations, including time AND power consumption. The chosen model IS then implemented on a Raspberry Pi, which IS connected to a specifically designed tactile device. When the device IS touched at a specific position, it provides an audio signal that communicates the identification of the object present in the scene at that corresponding position to the visually impaired individual.Conducted tests have demonstrated the effectiveness of this device in scene understanding, encompassing static OR dynamic objects, as well as screen contents such as TVs, computers, AND mobile phones.

Neural Fields for Interactive Visualization of Statistical Dependencies in 3D Simulation Ensembles

paper_url: http://arxiv.org/abs/2307.02203
repo_url: None
paper_authors: Fatemeh Farokhmanesh, Kevin Höhlein, Christoph Neuhauser, Rüdiger Westermann
for: 这篇论文是为了开发一种可以高效地表示和重建物理变量之间的统计依赖关系的神经网络。
methods: 论文使用了约束优化和缺省搜索来学习并重建非线性依赖关系。
results: 论文可以减少计算和存储需求，并在 GPU 加速的直接体积渲染器中实时显示所有依赖关系。

Abstract
We present the first neural network that has learned to compactly represent and can efficiently reconstruct the statistical dependencies between the values of physical variables at different spatial locations in large 3D simulation ensembles. Going beyond linear dependencies, we consider mutual information as a measure of non-linear dependence. We demonstrate learning and reconstruction with a large weather forecast ensemble comprising 1000 members, each storing multiple physical variables at a 250 x 352 x 20 simulation grid. By circumventing compute-intensive statistical estimators at runtime, we demonstrate significantly reduced memory and computation requirements for reconstructing the major dependence structures. This enables embedding the estimator into a GPU-accelerated direct volume renderer and interactively visualizing all mutual dependencies for a selected domain point.

摘要
我们提出了第一个神经网络，可以压缩地表示并高效地重建物理变量之间的统计依赖关系在大型3D模拟集中。我们超越了线性依赖，使用共识信息作为非线性依赖的度量。我们在1000个天气预测集中，每个集包含多个物理变量，并在250x352x20的模拟网格上存储。通过避免在运行时计算昂贵的统计估计器，我们显著减少了内存和计算需求，从而实现嵌入估计器到GPU加速的直接体积渲染器中，并在选择的域点上交互式地Visualize所有的相互关系。

Evaluating AI systems under uncertain ground truth: a case study in dermatology

paper_url: http://arxiv.org/abs/2307.02191
repo_url: None
paper_authors: David Stutz, Ali Taylan Cemgil, Abhijit Guha Roy, Tatiana Matejovicova, Melih Barsbey, Patricia Strachan, Mike Schaekermann, Jan Freyberg, Rajeev Rikhye, Beverly Freeman, Javier Perez Matos, Umesh Telang, Dale R. Webster, Yuan Liu, Greg S. Corrado, Yossi Matias, Pushmeet Kohli, Yun Liu, Arnaud Doucet, Alan Karthikesalingam
for: 这篇论文旨在探讨人工智能系统在医疗领域的评估方法，以及评估结果的不确定性的影响。
methods: 该论文提出了一种基于统计学模型的 annotation uncertainty 度量方法，以及一种 uncertainty-adjusted 性能评估方法。
results: 研究发现，使用现有的 deterministic adjudication 方法可能会严重低估实际性能，而使用统计学模型可以更好地评估 annotation uncertainty 并提供不确定性估计。

Abstract
For safety, AI systems in health undergo thorough evaluations before deployment, validating their predictions against a ground truth that is assumed certain. However, this is actually not the case and the ground truth may be uncertain. Unfortunately, this is largely ignored in standard evaluation of AI models but can have severe consequences such as overestimating the future performance. To avoid this, we measure the effects of ground truth uncertainty, which we assume decomposes into two main components: annotation uncertainty which stems from the lack of reliable annotations, and inherent uncertainty due to limited observational information. This ground truth uncertainty is ignored when estimating the ground truth by deterministically aggregating annotations, e.g., by majority voting or averaging. In contrast, we propose a framework where aggregation is done using a statistical model. Specifically, we frame aggregation of annotations as posterior inference of so-called plausibilities, representing distributions over classes in a classification setting, subject to a hyper-parameter encoding annotator reliability. Based on this model, we propose a metric for measuring annotation uncertainty and provide uncertainty-adjusted metrics for performance evaluation. We present a case study applying our framework to skin condition classification from images where annotations are provided in the form of differential diagnoses. The deterministic adjudication process called inverse rank normalization (IRN) from previous work ignores ground truth uncertainty in evaluation. Instead, we present two alternative statistical models: a probabilistic version of IRN and a Plackett-Luce-based model. We find that a large portion of the dataset exhibits significant ground truth uncertainty and standard IRN-based evaluation severely over-estimates performance without providing uncertainty estimates.

摘要
Translated into Simplified Chinese:为了安全，医疗AI系统在部署前都会进行严格的评估，将预测结果与假设为确定的地面 truth进行比较。然而，实际上地面 truth 可能存在uncertainty。这种uncertainty 在标准AI模型评估中很多被忽略，可能导致未来性能的估计过高。为了避免这种情况，我们会测量地面 truth 的影响，假设地面 truth 可以分解为两个主要组成部分：注释uncertainty 和限制的观察信息的 uncertainty。这种地面 truth 的uncertainty 在评估地面 truth 时被忽略，通常通过 deterministic aggregation of annotations，例如多数投票或平均值来进行。相比之下，我们提出了一种使用统计模型进行集成注释的框架。具体来说，我们将注释集成作为 posterior inference of so-called plausibilities，表示分类 Setting 中的分布，受到 annotator reliability 的 hyperparameter encoding。基于这个模型，我们提出了一个度量注释uncertainty的metric，并提供了不确定度调整的性能评估方法。我们在皮肤状况分类从图像中的案例研究中应用了我们的框架。在 previous work 中使用的 inverse rank normalization (IRN) 的 deterministic adjudication process 忽略了地面 truth 的 uncertainty，而我们则提出了两种统计模型：一种是probabilistic version of IRN，另一种是 Plackett-Luce-based model。我们发现大量数据存在显著的地面 truth uncertainty，标准 IRN-based 评估严重过估了性能，没有提供不确定度估计。

DiffFlow: A Unified SDE Framework for Score-Based Diffusion Models and Generative Adversarial Networks

paper_url: http://arxiv.org/abs/2307.02159
repo_url: None
paper_authors: Jingwei Zhang, Han Shi, Jincheng Yu, Enze Xie, Zhenguo Li
for: 这个论文的目的是提出一种统一的概率理论框架，用于描述逻辑推理模型和生成随机模型。
methods: 该论文使用的方法包括：1) 提出一种新的泛化概率方程（DiffFlow），用于描述逻辑推理模型和生成随机模型的学习动态；2) 通过调整权重来实现逻辑推理模型和生成随机模型之间的平滑过渡，保持极大似然估计的可行性。
results: 该论文的结果包括：1) 提出了一种新的泛化概率方程（DiffFlow），可以具有高速样本生成和高质量样本生成的优点；2) 通过调整权重来实现逻辑推理模型和生成随机模型之间的平滑过渡，保持极大似然估计的可行性。

Abstract
Generative models can be categorized into two types: explicit generative models that define explicit density forms and allow exact likelihood inference, such as score-based diffusion models (SDMs) and normalizing flows; implicit generative models that directly learn a transformation from the prior to the data distribution, such as generative adversarial nets (GANs). While these two types of models have shown great success, they suffer from respective limitations that hinder them from achieving fast sampling and high sample quality simultaneously. In this paper, we propose a unified theoretic framework for SDMs and GANs. We shown that: i) the learning dynamics of both SDMs and GANs can be described as a novel SDE named Discriminator Denoising Diffusion Flow (DiffFlow) where the drift can be determined by some weighted combinations of scores of the real data and the generated data; ii) By adjusting the relative weights between different score terms, we can obtain a smooth transition between SDMs and GANs while the marginal distribution of the SDE remains invariant to the change of the weights; iii) we prove the asymptotic optimality and maximal likelihood training scheme of the DiffFlow dynamics; iv) under our unified theoretic framework, we introduce several instantiations of the DiffFLow that provide new algorithms beyond GANs and SDMs with exact likelihood inference and have potential to achieve flexible trade-off between high sample quality and fast sampling speed.

摘要
<> translate into Simplified Chinese生成模型可以分为两类：Explicit生成模型，其定义Explicit概率形式，并允许精确的风险推断，如投 diffusion模型（SDM）和生成 adversarial网（GAN）；Implicit生成模型，直接学习变换从 priors 到数据分布。而这两种类型的模型各有其限制，不能同时实现高质量样本和快速采样。在这篇论文中，我们提出一个统一的理论框架，用于SDM和GAN。我们证明：i) SDM和GAN的学习动态可以用一种新的SDE named Discriminator Denoising Diffusion Flow（DiffFlow）来描述，其拥有的拥有权重的混合的分数可以用来描述实际数据和生成数据的关系；ii) 通过调整不同的分数项的相对权重，可以在SDM和GAN之间进行缓和转换，而保持样本分布的偏微分不变；iii) 我们证明DiffFlow的动态是最优的和最大可信度训练方案；iv）在我们的统一理论框架下，我们引入了一些DiffFlow的实现，提供了新的算法，其中包括高可信度采样和快速采样，并且可以实现flexible的质量和速度之间的平衡。

Wasserstein Auto-Encoders of Merge Trees (and Persistence Diagrams)

paper_url: http://arxiv.org/abs/2307.02509
repo_url: None
paper_authors: Mahieu Pont, Julien Tierny
for: 这个论文是为了提出一种计算框架，用于在 Wasserstein метри空间中进行merge trees auto-encoding。
methods: 该方法使用 neural network architecture，并在每层网络中Explicitly manipulate merge trees on their associated metric space，从而实现更高的准确率和可解释性。
results: 对于公共ensemble数据进行了广泛的实验，并证明了其计算效率，typically in the orders of minutes on average。此外，该方法还可以应用于数据压缩和维度减少等问题。

Abstract
This paper presents a computational framework for the Wasserstein auto-encoding of merge trees (MT-WAE), a novel extension of the classical auto-encoder neural network architecture to the Wasserstein metric space of merge trees. In contrast to traditional auto-encoders which operate on vectorized data, our formulation explicitly manipulates merge trees on their associated metric space at each layer of the network, resulting in superior accuracy and interpretability. Our novel neural network approach can be interpreted as a non-linear generalization of previous linear attempts [65] at merge tree encoding. It also trivially extends to persistence diagrams. Extensive experiments on public ensembles demonstrate the efficiency of our algorithms, with MT-WAE computations in the orders of minutes on average. We show the utility of our contributions in two applications adapted from previous work on merge tree encoding [65]. First, we apply MT-WAE to data reduction and reliably compress merge trees by concisely representing them with their coordinates in the final layer of our auto-encoder. Second, we document an application to dimensionality reduction, by exploiting the latent space of our auto-encoder, for the visual analysis of ensemble data. We illustrate the versatility of our framework by introducing two penalty terms, to help preserve in the latent space both the Wasserstein distances between merge trees, as well as their clusters. In both applications, quantitative experiments assess the relevance of our framework. Finally, we provide a C++ implementation that can be used for reproducibility.

摘要

Harmonizing Feature Attributions Across Deep Learning Architectures: Enhancing Interpretability and Consistency

paper_url: http://arxiv.org/abs/2307.02150
repo_url: None
paper_authors: Md Abdul Kadir, Gowtham Krishna Addluri, Daniel Sonntag
for: 本研究旨在评估特征归属方法的普适性，以及如何使用这些方法来协调多种深度学习模型的本地解释。
methods: 本研究使用了多种深度学习架构，如卷积神经网络（CNN）和视transformer，以评估特征归属方法的普适性。
results: 研究发现，使用不同深度学习架构的模型可以共同使用特征归属方法来提供本地解释，并且这些特征可以在多种模型上被协调。这些结果表明，使用协调的特征归属方法可以提高机器学习模型的解释性和可靠性。

Abstract
Ensuring the trustworthiness and interpretability of machine learning models is critical to their deployment in real-world applications. Feature attribution methods have gained significant attention, which provide local explanations of model predictions by attributing importance to individual input features. This study examines the generalization of feature attributions across various deep learning architectures, such as convolutional neural networks (CNNs) and vision transformers. We aim to assess the feasibility of utilizing a feature attribution method as a future detector and examine how these features can be harmonized across multiple models employing distinct architectures but trained on the same data distribution. By exploring this harmonization, we aim to develop a more coherent and optimistic understanding of feature attributions, enhancing the consistency of local explanations across diverse deep-learning models. Our findings highlight the potential for harmonized feature attribution methods to improve interpretability and foster trust in machine learning applications, regardless of the underlying architecture.

摘要
Simplified Chinese:确保机器学习模型的可靠性和可读性是应用中的关键。特征归因方法在不同的深度学习架构上得到了广泛的关注，例如卷积神经网络（CNN）和视觉变换器。本研究检验了不同深度学习架构之间的特征归因方法的一致性，以评估将这些特征归因方法应用于未来的检测器是否可行。我们还希望通过协调这些特征归因方法，从多个模型使用不同的架构但是在同一个数据分布上进行训练来增强当地解释的一致性。通过这种协调，我们希望能够提高机器学习应用中的解释性和信任度，无论下面的架构是什么。

Compound Attention and Neighbor Matching Network for Multi-contrast MRI Super-resolution

paper_url: http://arxiv.org/abs/2307.02148
repo_url: None
paper_authors: Wenxuan Chen, Sirui Wu, Shuai Wang, Zhongsen Li, Jia Yang, Huifeng Yao, Xiaomeng Li, Xiaolei Song
for: 这个论文旨在提出一种新的多感知磁共振成像超分辨（SR）方法，用于提高多感知磁共振成像的图像质量。
methods: 该方法使用了自定义的compound-attention和邻居匹配机制（CANM-Net），通过同时capture spatial和channel维度的依赖关系，以及在邻居级别进行特征匹配和融合，从而实现更高的SR性能。
results: 实验结果表明，CANM-Net在IXI、fastMRI和实际扫描 dataset上的SR任务中表现出色，超过了现有的状态对方法。此外，对于不完美地注册的参考和下降图像的情况，CANM-Net仍然能够保持良好的性能，这说明了其在临床应用中的可靠性。

Abstract
Multi-contrast magnetic resonance imaging (MRI) reflects information about human tissue from different perspectives and has many clinical applications. By utilizing the complementary information among different modalities, multi-contrast super-resolution (SR) of MRI can achieve better results than single-image super-resolution. However, existing methods of multi-contrast MRI SR have the following shortcomings that may limit their performance: First, existing methods either simply concatenate the reference and degraded features or exploit global feature-matching between them, which are unsuitable for multi-contrast MRI SR. Second, although many recent methods employ transformers to capture long-range dependencies in the spatial dimension, they neglect that self-attention in the channel dimension is also important for low-level vision tasks. To address these shortcomings, we proposed a novel network architecture with compound-attention and neighbor matching (CANM-Net) for multi-contrast MRI SR: The compound self-attention mechanism effectively captures the dependencies in both spatial and channel dimension; the neighborhood-based feature-matching modules are exploited to match degraded features and adjacent reference features and then fuse them to obtain the high-quality images. We conduct experiments of SR tasks on the IXI, fastMRI, and real-world scanning datasets. The CANM-Net outperforms state-of-the-art approaches in both retrospective and prospective experiments. Moreover, the robustness study in our work shows that the CANM-Net still achieves good performance when the reference and degraded images are imperfectly registered, proving good potential in clinical applications.

摘要
多模式磁共振成像（MRI）可以从不同的角度提取人体组织信息，有很多临床应用。通过利用不同模式之间的补充信息，多模式超分辨（SR）的MRI可以比单图SR更好。然而，现有的多模式MRI SR方法存在以下缺点，可能会限制其性能：首先，现有的方法可能会简单地将参考和降解特征 concatenate 或者利用全局特征匹配，这些方法不适合多模式MRI SR。其次，许多现代方法使用 transformers 捕捉在空间维度上的长距离依赖关系，但它们忽略了频道维度上的自我注意力对低级视觉任务的重要性。为了解决这些缺点，我们提出了一种新的网络架构，即 Compound-Attention and Neighbor Matching 网络（CANM-Net），用于多模式MRI SR：Compound self-attention 机制可以有效捕捉空间和频道维度上的依赖关系; 邻域特征匹配模块可以将降解特征和相邻参考特征匹配并融合以获得高质量图像。我们在 IXI、fastMRI 和实际扫描数据集上进行 SR 任务的实验，CANM-Net 超过了当前的状态艺术方法。此外，我们的Robustness 研究表明，CANM-Net 在参考和降解图像不完全匹配时仍能达到良好性能，证明了其在临床应用中的好可能性。

Prompting Diffusion Representations for Cross-Domain Semantic Segmentation

paper_url: http://arxiv.org/abs/2307.02138
repo_url: None
paper_authors: Rui Gong, Martin Danelljan, Han Sun, Julio Delgado Mangas, Luc Van Gool
for: 本研究探讨了 diffusion-pretrained 表示的适用范围，以及如何使用这些表示进行 semantic segmentation 任务。
methods: 本研究使用了 diffusion-pretrained 模型，并 investigate 了如何使用模型的特点（即输入提示）来进一步提高其跨频变性性能。
results: 研究发现 diffusion-pretraining 可以很好地在新频变性环境下进行 semantic segmentation，并且可以使用 simple yet effective 的 test-time domain adaptation 方法来进一步提高性能。

Abstract
While originally designed for image generation, diffusion models have recently shown to provide excellent pretrained feature representations for semantic segmentation. Intrigued by this result, we set out to explore how well diffusion-pretrained representations generalize to new domains, a crucial ability for any representation. We find that diffusion-pretraining achieves extraordinary domain generalization results for semantic segmentation, outperforming both supervised and self-supervised backbone networks. Motivated by this, we investigate how to utilize the model's unique ability of taking an input prompt, in order to further enhance its cross-domain performance. We introduce a scene prompt and a prompt randomization strategy to help further disentangle the domain-invariant information when training the segmentation head. Moreover, we propose a simple but highly effective approach for test-time domain adaptation, based on learning a scene prompt on the target domain in an unsupervised manner. Extensive experiments conducted on four synthetic-to-real and clear-to-adverse weather benchmarks demonstrate the effectiveness of our approaches. Without resorting to any complex techniques, such as image translation, augmentation, or rare-class sampling, we set a new state-of-the-art on all benchmarks. Our implementation will be publicly available at \url{https://github.com/ETHRuiGong/PTDiffSeg}.

摘要
原本设计用于图像生成的扩散模型，最近显示出了优秀的预训练特征表示，用于Semantic segmentation。我们对这个结果感到感兴趣，于是决定了探索 diffusion-预训练的特征如何在新领域中 generalized。我们发现，diffusion-预训练可以在Semantic segmentation中获得杰出的领域泛化结果，超过了 both 监督和自监督的底层网络。这种结果使我们对模型的唯一能力——接受输入提示，以提高它在不同领域的跨领域性表示。我们引入了场景提示和提示随机策略，以帮助更好地分离领域无关的信息。此外，我们提出了一种简单 yet 高效的测试时领域适应方法，基于在目标领域上无监督学习场景提示。我们的实验结果在四个 synthetic-to-real 和 clear-to-adverse weather benchmark上进行了广泛的测试，并证明了我们的方法的有效性。无需使用复杂的技术，如图像翻译、扩展、罕见类采样，我们设置了新的状态之册在所有benchmark上。我们的实现将在 GitHub 上公开，可以在 \url{https://github.com/ETHRuiGong/PTDiffSeg} 中找到。

How Deep Neural Networks Learn Compositional Data: The Random Hierarchy Model

paper_url: http://arxiv.org/abs/2307.02129
repo_url: https://github.com/pcsl-epfl/hierarchy-learning
paper_authors: Leonardo Petrini, Francesco Cagnetta, Umberto M. Tomasini, Alessandro Favero, Matthieu Wyart
for: 这个论文的目的是研究深度卷积神经网络（CNN）在高维度数据上学习复杂任务的能力。
methods: 这个论文使用了深度卷积神经网络（CNN）来解决高维度数据上的学习问题。
results: 这个论文发现，深度卷积神经网络（CNN）可以通过建立不变表示来抗衡维度的呼噪问题，并且可以通过训练数据的增加来提高学习效果。

Abstract
Learning generic high-dimensional tasks is notably hard, as it requires a number of training data exponential in the dimension. Yet, deep convolutional neural networks (CNNs) have shown remarkable success in overcoming this challenge. A popular hypothesis is that learnable tasks are highly structured and that CNNs leverage this structure to build a low-dimensional representation of the data. However, little is known about how much training data they require, and how this number depends on the data structure. This paper answers this question for a simple classification task that seeks to capture relevant aspects of real data: the Random Hierarchy Model. In this model, each of the $n_c$ classes corresponds to $m$ synonymic compositions of high-level features, which are in turn composed of sub-features through an iterative process repeated $L$ times. We find that the number of training data $P^*$ required by deep CNNs to learn this task (i) grows asymptotically as $n_c m^L$, which is only polynomial in the input dimensionality; (ii) coincides with the training set size such that the representation of a trained network becomes invariant to exchanges of synonyms; (iii) corresponds to the number of data at which the correlations between low-level features and classes become detectable. Overall, our results indicate how deep CNNs can overcome the curse of dimensionality by building invariant representations, and provide an estimate of the number of data required to learn a task based on its hierarchically compositional structure.

摘要
学习高维任务非常困难，因为它们需要数据量的幂等于维度。然而，深度卷积神经网络（CNN）在缓解这个挑战上表现出了惊人的成功。一个流行的假设是，学习任务都具有很高的结构，并且CNN可以利用这种结构来建立低维度的数据表示。然而，有很少关于CNN需要多少训练数据，以及这个数据结构如何影响这个问题的研究。这篇论文回答了这个问题，对于一个简单的分类任务，即Random Hierarchy Model。在这个模型中，每个类都对应于$m$个同义词的高级特征组合，这些高级特征再经过$L$次循环组合而组成。我们发现，deep CNNs需要$P^*$多少训练数据来学习这个任务，其中（i）与输入维度相同的幂函数增长，即$n_c m^L$；（ii）与训练集大小相同，使得神经网络学习后的表示不受同义词的交换影响；（iii）与低级特征和类之间的相关性变得可识别。总之，我们的结果表明深度CNN可以缓解维度之恶的问题，并提供了学习任务基于层次结构的数据量需求的估计。

MDViT: Multi-domain Vision Transformer for Small Medical Image Segmentation Datasets

paper_url: http://arxiv.org/abs/2307.02100
repo_url: https://github.com/siyi-wind/mdvit
paper_authors: Siyi Du, Nourhan Bayasi, Ghassan Harmarneh, Rafeef Garbi
for: 这篇论文旨在提高医疗影像分类（MIS）的精度和效率，以应对医疗影像的复杂和多样性。
methods: 本论文提出了多元领域 transformer（MDViT），具有领域适应器来缓解资料饥饿和抵消不同领域之间的负面知识转移（NKT）。此外，它还包括了互相知识传授 paradigm，将知识传授到多个领域的综合网络和辅助领域特有的分支。
results: 实验结果显示，MDViT 可以在 4 个皮肤病分类dataset上取得最佳性能，比起现有的算法，具有更好的分类性能和固定的模型大小，甚至在更多的领域添加时仍然保持良好的性能。

Abstract
Despite its clinical utility, medical image segmentation (MIS) remains a daunting task due to images' inherent complexity and variability. Vision transformers (ViTs) have recently emerged as a promising solution to improve MIS; however, they require larger training datasets than convolutional neural networks. To overcome this obstacle, data-efficient ViTs were proposed, but they are typically trained using a single source of data, which overlooks the valuable knowledge that could be leveraged from other available datasets. Naivly combining datasets from different domains can result in negative knowledge transfer (NKT), i.e., a decrease in model performance on some domains with non-negligible inter-domain heterogeneity. In this paper, we propose MDViT, the first multi-domain ViT that includes domain adapters to mitigate data-hunger and combat NKT by adaptively exploiting knowledge in multiple small data resources (domains). Further, to enhance representation learning across domains, we integrate a mutual knowledge distillation paradigm that transfers knowledge between a universal network (spanning all the domains) and auxiliary domain-specific branches. Experiments on 4 skin lesion segmentation datasets show that MDViT outperforms state-of-the-art algorithms, with superior segmentation performance and a fixed model size, at inference time, even as more domains are added. Our code is available at https://github.com/siyi-wind/MDViT.

摘要
医学图像分割（MIS）是一项复杂和变化的任务，医学图像的自然复杂性和多样性使得MIS成为一项挑战。最近，视transformer（ViT）已经出现为MIS提供了一种有前途的解决方案，但它们需要更大的训练数据集than convolutional neural networks。为了解决这个障碍，数据高效的ViTs被提出，但它们通常是使用单个数据源进行训练，这会忽略其他可用的数据集中的有价值知识。将多个领域的数据集合并训练可能会导致非可靠的知识传递（NKT），即模型在某些领域的性能下降。在这篇论文中，我们提出了MDViT，第一个多领域ViT，包括域适应器来减少数据饿和抵消NKT。此外，为了增强跨领域表示学ARNING，我们 integration了一种互助知识传播模式，将所有领域的网络知识传递到各个域适应器中。我们的实验结果表明，MDViT在4种皮肤病变分割数据集上表现出色，性能高于状态 искусственный智能算法，即使添加更多的领域。我们的代码可以在https://github.com/siyi-wind/MDViT中下载。

Make A Long Image Short: Adaptive Token Length for Vision Transformers

paper_url: http://arxiv.org/abs/2307.02092
repo_url: None
paper_authors: Qiqi Zhou, Yichen Zhu
for: 提高 ViT 模型的推理速度，使其更加快速和效率。
methods: 提出了一种基于 Resizable-ViT 模型和 Token-Length Assigner 的适应性词长分配策略，可以在执行时根据每个图像的需要选择最佳的词长。
results: 通过对多种 ViT 模型进行图像分类和动作识别等实验，证明了该方法可以有效地降低计算成本，提高推理速度。

Abstract
The vision transformer is a model that breaks down each image into a sequence of tokens with a fixed length and processes them similarly to words in natural language processing. Although increasing the number of tokens typically results in better performance, it also leads to a considerable increase in computational cost. Motivated by the saying "A picture is worth a thousand words," we propose an innovative approach to accelerate the ViT model by shortening long images. Specifically, we introduce a method for adaptively assigning token length for each image at test time to accelerate inference speed. First, we train a Resizable-ViT (ReViT) model capable of processing input with diverse token lengths. Next, we extract token-length labels from ReViT that indicate the minimum number of tokens required to achieve accurate predictions. We then use these labels to train a lightweight Token-Length Assigner (TLA) that allocates the optimal token length for each image during inference. The TLA enables ReViT to process images with the minimum sufficient number of tokens, reducing token numbers in the ViT model and improving inference speed. Our approach is general and compatible with modern vision transformer architectures, significantly reducing computational costs. We verified the effectiveness of our methods on multiple representative ViT models on image classification and action recognition.

摘要
“当然”视觉 трансформа器（ViT）是一个模型，将每个图像转换为一串固定长度的单位，然后处理他们就如同自然语言处理中的字。虽然增加单位数量通常会导致性能提高，但也会导致庞大的计算成本增加。为了解决这个问题，我们提出了一个创新的方法，可以快速加速ViT模型的推理速度。我们首先将一个可调尺度的ViT（ReViT）模型训练，可以处理多种单位长度的输入。接着，我们从ReViT中提取了单位长度标签，这些标签指示了每个图像所需的最少单位数量以确保精准预测。我们然后使用这些标签进行训练一个轻量级的单位长度分配器（TLA），以将适当的单位长度分配给每个图像 during inference。TLA使得ReViT可以在测试时快速处理图像，并且将ViT模型内部的单位数量降低，进而降低计算成本。我们的方法是通用的，可以与现代的视觉 трансформа器架构相容，实现了计算成本的创下。我们在多个代表性的ViT模型上进行了证明，以证明我们的方法的有效性。

Interactive Conversational Head Generation

paper_url: http://arxiv.org/abs/2307.02090
repo_url: None
paper_authors: Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao
for: 本研究旨在提供一个面对面对话中的人工智能对话代理人生成的标准评估 benchmark，以促进对话中的人工智能代理人自动生成和多轮对话的能力。
methods: 本研究使用了独立的对话和聆听头生成任务（ViCo）和多轮对话对话任务（ViCo-X）两个新的数据集，并定义了三个新的任务：回应式聆听头生成、表达性对话头生成和对话合成。
results: 本研究的基本方法可以实现回应式和生动的代理人，并与真正的人类进行完整的对话。

Abstract
We introduce a new conversation head generation benchmark for synthesizing behaviors of a single interlocutor in a face-to-face conversation. The capability to automatically synthesize interlocutors which can participate in long and multi-turn conversations is vital and offer benefits for various applications, including digital humans, virtual agents, and social robots. While existing research primarily focuses on talking head generation (one-way interaction), hindering the ability to create a digital human for conversation (two-way) interaction due to the absence of listening and interaction parts. In this work, we construct two datasets to address this issue, ``ViCo'' for independent talking and listening head generation tasks at the sentence level, and ``ViCo-X'', for synthesizing interlocutors in multi-turn conversational scenarios. Based on ViCo and ViCo-X, we define three novel tasks targeting the interaction modeling during the face-to-face conversation: 1) responsive listening head generation making listeners respond actively to the speaker with non-verbal signals, 2) expressive talking head generation guiding speakers to be aware of listeners' behaviors, and 3) conversational head generation to integrate the talking/listening ability in one interlocutor. Along with the datasets, we also propose corresponding baseline solutions to the three aforementioned tasks. Experimental results show that our baseline method could generate responsive and vivid agents that can collaborate with real person to fulfil the whole conversation. Project page: https://vico.solutions/.

摘要
我们介绍一个新的对话头生成标准 для生成面对面对话中的对话者行为。能够自动生成可以参与长时间和多轮对话的对话者是非常重要的，对于数字人类、虚拟代表和社交机器人等应用有益。而现有研究主要关注一个方面的对话（一个方向的交流），因此缺少对话者的听说和互动部分，这使得创建数字人类对话变得困难。在这项工作中，我们构建了两个数据集来解决这个问题：“ViCo”和“ViCo-X”。“ViCo”是独立的说话和听说头生成任务，而“ViCo-X”是用于synthesizing多轮对话的对话者。基于ViCo和ViCo-X，我们定义了三个新任务，包括：1）回应式听说头生成，使listeners能够活泼地回应说话者的非语言信号；2）表达声明性的说头生成，使说话者意识到听众的行为；以及3）对话头生成，将说话和听说能力集成到一个对话者中。同时，我们还提出了对这三个任务的基线解决方案。我们的基线方法可以生成有回应和生命力的代理人，与真正的人类一起完成对话。项目页面：https://vico.solutions/。

Adversarial Attacks on Image Classification Models: FGSM and Patch Attacks and their Impact

paper_url: http://arxiv.org/abs/2307.02055
repo_url: None
paper_authors: Jaydip Sen, Subhasis Dasgupta
for: 本文介绍了图像分类模型遭受敌意攻击的问题。
methods: 本文使用了两种常见的敌意攻击方法：快速梯度签名方法（FGSM）和敌意贴图攻击。
results: 对三种强大预训练的图像分类模型（ResNet-34、GoogleNet和DenseNet-161）进行了攻击性评估，发现攻击后模型的分类精度减少了较多。

Abstract
This chapter introduces the concept of adversarial attacks on image classification models built on convolutional neural networks (CNN). CNNs are very popular deep-learning models which are used in image classification tasks. However, very powerful and pre-trained CNN models working very accurately on image datasets for image classification tasks may perform disastrously when the networks are under adversarial attacks. In this work, two very well-known adversarial attacks are discussed and their impact on the performance of image classifiers is analyzed. These two adversarial attacks are the fast gradient sign method (FGSM) and adversarial patch attack. These attacks are launched on three powerful pre-trained image classifier architectures, ResNet-34, GoogleNet, and DenseNet-161. The classification accuracy of the models in the absence and presence of the two attacks are computed on images from the publicly accessible ImageNet dataset. The results are analyzed to evaluate the impact of the attacks on the image classification task.

摘要
本章介绍了图像分类模型 based on 卷积神经网络 (CNN) 面临的敌对攻击。CNN 是深度学习模型的非常受欢迎的一种，通常用于图像分类任务。然而，当这些网络受到敌对攻击时，它们可能会表现非常糟糕。在这种情况下，本文讨论了两种非常知名的敌对攻击方法，即快速梯度签名方法 (FGSM) 和敌对贴图攻击。这些攻击在三种强大预训练的图像分类器架构上进行，即 ResNet-34、GoogleNet 和 DenseNet-161。在图像网络上计算图像分类器在缺乏和存在这两种攻击时的分类精度。结果分析以评估攻击对图像分类任务的影响。

Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised Audio-Visual Video Parsing

paper_url: http://arxiv.org/abs/2307.02041
repo_url: None
paper_authors: Jie Fu, Junyu Gao, Changsheng Xu
for: 本研究旨在解决听视联合分割任务中不同感知Modalities之间的特征学习不均衡问题。
methods: 本文提出了一种动态梯度修正机制，通过设计一种新的效果metric函数来度量听视特征学习不均衡问题。此外，本文还提出了一种模块化决策单元（MSDU），以更精准地测量听视特征学习不均衡问题。
results: 经过实验表明，提出的方法可以有效地解决听视联合分割任务中的特征学习不均衡问题。

Abstract
Weakly-supervised audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances as well as identify the corresponding event categories with only video-level category labels for training. Most previous methods pay much attention to refining the supervision for each modality or extracting fruitful cross-modality information for more reliable feature learning. None of them have noticed the imbalanced feature learning between different modalities in the task. In this paper, to balance the feature learning processes of different modalities, a dynamic gradient modulation (DGM) mechanism is explored, where a novel and effective metric function is designed to measure the imbalanced feature learning between audio and visual modalities. Furthermore, principle analysis indicates that the multimodal confusing calculation will hamper the precise measurement of multimodal imbalanced feature learning, which further weakens the effectiveness of our DGM mechanism. To cope with this issue, a modality-separated decision unit (MSDU) is designed for more precise measurement of imbalanced feature learning between audio and visual modalities. Comprehensive experiments are conducted on public benchmarks and the corresponding experimental results demonstrate the effectiveness of our proposed method.

摘要
弱监督音视频分析（WS-AVVP）目标是在没有视频类别标签培训的情况下，地时段化音、视觉和音视觉事件实例的涵义和相应的事件类别。先前的方法大多强调精细调整每个模式的监督或提取有利的跨模式信息以便更可靠的特征学习。然而，none of them 注意到了不同模式之间的特征学习不均衡问题。本文提出了一种动态梯度修正（DGM）机制，其中设计了一种新的有效度量函数，用于测量不同模式之间的特征学习不均衡。此外，理论分析表明，跨模式冲淡计算会妨碍精准测量跨模式特征学习不均衡问题，这会进一步削弱我们的DGM机制的效果。为解决这个问题，我们提出了一种模式分离决策单元（MSDU），用于更精准地测量不同模式之间的特征学习不均衡。我们在公共 benchmark上进行了广泛的实验，并结果表明了我们的提议方法的效果。

NMS Threshold matters for Ego4D Moment Queries – 2nd place solution to the Ego4D Moment Queries Challenge 2023

paper_url: http://arxiv.org/abs/2307.02025
repo_url: https://github.com/happyharrycn/actionformer_release
paper_authors: Lin Sui, Fangzhou Mu, Yin Li
for: 本文是为了参加Ego4D Moment Queries Challenge 2023而提交的一篇论文。
methods: 本文extend了Latest ActionFormer方法，并在训练时使用改进的真实标注策略和推理时使用加工的SoftNMS。
results: 本文在公共领先板上排名第2，具有26.62%的均值准确率和45.69%的Recall@1x，在tIoU=0.5的测试集上显著超越2023年挑战的强基eline。Here’s the English version of the three key information points:
for: This paper is a submission to the Ego4D Moment Queries Challenge 2023.
methods: The paper extends the latest ActionFormer method with an improved ground-truth assignment strategy during training and a refined version of SoftNMS at inference time.
results: The paper ranks 2nd on the public leaderboard with 26.62% average mAP and 45.69% Recall@1x at tIoU=0.5 on the test set, significantly outperforming the strong baseline from 2023 challenge.

Abstract
This report describes our submission to the Ego4D Moment Queries Challenge 2023. Our submission extends ActionFormer, a latest method for temporal action localization. Our extension combines an improved ground-truth assignment strategy during training and a refined version of SoftNMS at inference time. Our solution is ranked 2nd on the public leaderboard with 26.62% average mAP and 45.69% Recall@1x at tIoU=0.5 on the test set, significantly outperforming the strong baseline from 2023 challenge. Our code is available at https://github.com/happyharrycn/actionformer_release.

摘要
这份报告介绍我们在Ego4D动作识别挑战2023年提交的工作。我们的提交基于最新的ActionFormer方法，我们在训练时使用改进的真实标注策略，并在推理时使用精制的SoftNMS。我们的解决方案在公共排名榜上排名第2，具有26.62%的均值准确率和45.69%的回归率@1x，在测试集上显著超越2023年挑战的强基eline。我们的代码可以在https://github.com/happyharrycn/actionformer_release中找到。

ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: TREK-150 Single Object Tracking

paper_url: http://arxiv.org/abs/2307.02508
repo_url: None
paper_authors: Yuanyou Xu, Jiahao Li, Zongxin Yang, Yi Yang, Yueting Zhuang
for: 这个研究旨在将类别框转换为面对面的图像标识，并使用Segment Anything Model (SAM)和Alpha-Refine进行标识储存和调节。
methods: 这个研究使用了Associating Objects with Transformers (AOT)框架，并将类别框转换为面对面的图像标识，使用Segment Anything Model (SAM)和Alpha-Refine进行标识储存和调节。
results: 这个研究获得了EPIC-KITCHENS TREK-150 Object Tracking Challenge的第一名。

Abstract
The Associating Objects with Transformers (AOT) framework has exhibited exceptional performance in a wide range of complex scenarios for video object tracking and segmentation. In this study, we convert the bounding boxes to masks in reference frames with the help of the Segment Anything Model (SAM) and Alpha-Refine, and then propagate the masks to the current frame, transforming the task from Video Object Tracking (VOT) to video object segmentation (VOS). Furthermore, we introduce MSDeAOT, a variant of the AOT series that incorporates transformers at multiple feature scales. MSDeAOT efficiently propagates object masks from previous frames to the current frame using two feature scales of 16 and 8. As a testament to the effectiveness of our design, we achieved the 1st place in the EPIC-KITCHENS TREK-150 Object Tracking Challenge.

摘要
《 associating objects with transformers（AOT）框架在各种复杂场景中表现出色，包括视频对象跟踪和分割。在这项研究中，我们将 bounding box 转换为 masks 在参照帧中使用 Segment Anything Model（SAM）和 Alpha-Refine，然后将 masks 传播到当前帧，从视频对象跟踪（VOT）任务转化为视频对象分割（VOS）任务。此外，我们还提出了 MSDeAOT，它是 AOT 系列中的一种变体，在多级特征尺度上引入 transformers。MSDeAOT 高效地在前一帧中传播对象mask，使用 16 和 8 级的特征尺度。我们在 EPIC-KITCHENS TREK-150 对象跟踪挑战中取得了第一名，这是我们设计的证明。

ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: Semi-Supervised Video Object Segmentation

paper_url: http://arxiv.org/abs/2307.02010
repo_url: None
paper_authors: Jiahao Li, Yuanyou Xu, Zongxin Yang, Yi Yang, Yueting Zhuang
for: 这个论文主要针对视频对象分割任务，旨在提高对象分割精度和效率。
methods: 该论文提出了一种基于 transformers 的多级尺度Feature FusionModule (GPM)，可以有效地在不同尺度上进行对象 маска的传播。此外，论文还使用了高级的 GPM，以提高对小对象的检测和跟踪精度。
results: 根据实验结果，该论文在 EPIC-KITCHEN VISOR Semi-supervised Video Object Segmentation Challenge 中取得了第一名。

Abstract
The Associating Objects with Transformers (AOT) framework has exhibited exceptional performance in a wide range of complex scenarios for video object segmentation. In this study, we introduce MSDeAOT, a variant of the AOT series that incorporates transformers at multiple feature scales. Leveraging the hierarchical Gated Propagation Module (GPM), MSDeAOT efficiently propagates object masks from previous frames to the current frame using a feature scale with a stride of 16. Additionally, we employ GPM in a more refined feature scale with a stride of 8, leading to improved accuracy in detecting and tracking small objects. Through the implementation of test-time augmentations and model ensemble techniques, we achieve the top-ranking position in the EPIC-KITCHEN VISOR Semi-supervised Video Object Segmentation Challenge.

摘要
“ associating objects with transformers”（AOT）框架在各种复杂场景中表现出了非常出色的表现，在本研究中，我们引入了多级嵌入式变换器（MSDeAOT），该系列嵌入了多级嵌入式变换器。通过层次阻止卷积模型（GPM），MSDeAOT可以有效地在当前帧中传递对象面积从前一帧，使用一个幅度为16的特征缩放级别。此外，我们还使用了一个更细的特征缩放级别，以提高对小对象的检测和跟踪精度。通过实施测试时数据增强和模型集成技术，我们在EPIC-KITCHEN VISOR Semi-supervised Video Object Segmentation Challenge中获得了第一名。

Remote Sensing Image Change Detection with Graph Interaction

paper_url: http://arxiv.org/abs/2307.02007
repo_url: https://github.com/jackliu-97/bginet
paper_authors: Chenglong Liu
for: 本研究旨在提出一种基于图STRUCTURE的 remote sensing 变化检测方法，以提高变化检测精度和效率。
methods: 该方法利用非局部操作和图STRUCTURE空间的mapping技术，实现了两幅时间图像之间的信息交互，从而提高变化检测精度。
results: 根据GZ CD数据集的实验结果，该方法与其他现有方法相比，显示了更高的准确率和更好的计算效率，同时也提高了整体效果。

Abstract
Modern remote sensing image change detection has witnessed substantial advancements by harnessing the potent feature extraction capabilities of CNNs and Transforms.Yet,prevailing change detection techniques consistently prioritize extracting semantic features related to significant alterations,overlooking the viability of directly interacting with bitemporal image features.In this letter,we propose a bitemporal image graph Interaction network for remote sensing change detection,namely BGINet-CD. More specifically,by leveraging the concept of non-local operations and mapping the features obtained from the backbone network to the graph structure space,we propose a unified self-focus mechanism for bitemporal images.This approach enhances the information coupling between the two temporal images while effectively suppressing task-irrelevant interference,Based on a streamlined backbone architecture,namely ResNet18,our model demonstrates superior performance compared to other state-of-the-art methods (SOTA) on the GZ CD dataset. Moreover,the model exhibits an enhanced trade-off between accuracy and computational efficiency,further improving its overall effectiveness

摘要
现代远程感知图像变化检测技术已经取得了显著的进步，通过利用深度学习网络的强大特征提取能力。然而，现有的变化检测技术一直会优先把注意力集中在 semantics 相关的变化上，忽略了直接与二元时间图像特征进行交互的可能性。在本文中，我们提出了一种基于图像图грам的交互网络，称为 BGINet-CD。更具体地说，我们利用非本地操作的概念和将干扰网络生成的特征映射到图грам结构空间，提出了一种简单的自 Фокус机制。这种方法可以增强两个时间图像之间的信息coupling，同时有效地抑制任务无关的干扰。基于流畅的后ION架构，我们的模型在 GZ CD 数据集上达到了与其他现有方法（SOTA）比较的高性能。此外，我们的模型还展现出了提高精度和计算效率的权衡，进一步提高了其总效iveness。

paper_url: http://arxiv.org/abs/2307.02003
repo_url: None
paper_authors: Yuhuan Yang, Chaofan Ma, Chen Ju, Ya Zhang, Yanfeng Wang
for: 本研究旨在解决semantic segmentation中对新物种的泛化问题，即在推理时通过提供视觉示例或文本描述来启用泛化。
methods: 本研究使用了一种统称为开集 semantic segmentation（O3S），它通过同时学习视觉和文本信息来学习已知和未知 semantics。研究pipeline包括单模强化和聚合、多模强化融合等步骤。
results: 在 Pascal和COCO数据集上，我们进行了广泛的实验评估框架的效果。 results表明，我们可以在只有粗粒化数据集上训练的情况下，在 Pascal-Animals 上达到了状态控制的结果。我们还进行了细致的拆分研究，以量化和质量地分析每个组件的贡献。

Abstract
In semantic segmentation, adapting a visual system to novel object categories at inference time has always been both valuable and challenging. To enable such generalization, existing methods rely on either providing several support examples as visual cues or class names as textual cues. Through the development is relatively optimistic, these two lines have been studied in isolation, neglecting the complementary intrinsic of low-level visual and high-level language information. In this paper, we define a unified setting termed as open-set semantic segmentation (O3S), which aims to learn seen and unseen semantics from both visual examples and textual names. Our pipeline extracts multi-modal prototypes for segmentation task, by first single modal self-enhancement and aggregation, then multi-modal complementary fusion. To be specific, we aggregate visual features into several tokens as visual prototypes, and enhance the class name with detailed descriptions for textual prototype generation. The two modalities are then fused to generate multi-modal prototypes for final segmentation. On both \pascal and \coco datasets, we conduct extensive experiments to evaluate the framework effectiveness. State-of-the-art results are achieved even on more detailed part-segmentation, Pascal-Animals, by only training on coarse-grained datasets. Thorough ablation studies are performed to dissect each component, both quantitatively and qualitatively.

摘要
在 semantic segmentation 中，适应新的物体类别于推理时是非常有价值和挑战的。现有的方法通常采用提供一些支持例子作为视觉cue或者类名作为文本cue来实现这种总结。虽然这两种方法在开发中有一定的成功，但是它们在孤立的情况下研究，忽略了低级别视觉信息和高级别语言信息之间的共同内在。在这篇文章中，我们定义了一个统一的设定，称为开放集 semantic segmentation（O3S），该设定的目标是从视觉例子和文本名称中学习已知和未知 semantics。我们的管道包括以下步骤：1. 单modal自增和聚合：首先对视觉特征进行单modal自增，然后对各个模式进行聚合。2. 多modal补偿融合：将视觉特征聚合成多个标签，然后将文本名称补偿为更多的细节描述，以生成多模式的融合特征。3. 最终进行多模式融合，以实现最终的分类。在 Pascal 和 COCO 数据集上，我们进行了广泛的实验，以评估框架的效果。在 Pascal-Animals 数据集上，只需要训练在粗糙数据集上，就可以 дости得 estado-of-the-art 的结果。我们还进行了详细的拆分研究，以量化和质量地分析每个组件的作用。

Distilling Missing Modality Knowledge from Ultrasound for Endometriosis Diagnosis with Magnetic Resonance Images

paper_url: http://arxiv.org/abs/2307.02000
repo_url: None
paper_authors: Yuan Zhang, Hu Wang, David Butler, Minh-Son To, Jodie Avery, M Louise Hull, Gustavo Carneiro
for: This paper aims to improve the accuracy of detecting pouch of Douglas (POD) obliteration from magnetic resonance imaging (MRI) scans, which is a common diagnostic challenge in endometriosis diagnosis.methods: The proposed method uses a knowledge distillation training algorithm that leverages detection results from unpaired transvaginal gynecological ultrasound (TVUS) data to improve the POD obliteration detection from MRI. The algorithm pre-trains a teacher model to detect POD obliteration from TVUS data and a student model with a 3D masked auto-encoder using unlabelled pelvic 3D MRI volumes.results: Experimental results on an endometriosis dataset containing TVUS and MRI data demonstrate the effectiveness of the proposed method in improving the POD detection accuracy from MRI.

Abstract
Endometriosis is a common chronic gynecological disorder that has many characteristics, including the pouch of Douglas (POD) obliteration, which can be diagnosed using Transvaginal gynecological ultrasound (TVUS) scans and magnetic resonance imaging (MRI). TVUS and MRI are complementary non-invasive endometriosis diagnosis imaging techniques, but patients are usually not scanned using both modalities and, it is generally more challenging to detect POD obliteration from MRI than TVUS. To mitigate this classification imbalance, we propose in this paper a knowledge distillation training algorithm to improve the POD obliteration detection from MRI by leveraging the detection results from unpaired TVUS data. More specifically, our algorithm pre-trains a teacher model to detect POD obliteration from TVUS data, and it also pre-trains a student model with 3D masked auto-encoder using a large amount of unlabelled pelvic 3D MRI volumes. Next, we distill the knowledge from the teacher TVUS POD obliteration detector to train the student MRI model by minimizing a regression loss that approximates the output of the student to the teacher using unpaired TVUS and MRI data. Experimental results on our endometriosis dataset containing TVUS and MRI data demonstrate the effectiveness of our method to improve the POD detection accuracy from MRI.

摘要
Endometriosis 是一种常见的慢性室内疾病，其有多种特征之一是胞囊吟唱（POD）衰竭，可以通过跨膜室内ultrasound（TVUS）扫描和磁共振成像（MRI）进行诊断。TVUS和MRI 是非侵入性的抗癌诊断技术，但患者通常不会同时进行这两种扫描，并且从 MRI 中诊断 POD 衰竭更加困难。为了解决这种分类偏补，我们在这篇论文中提出了一种知识储备训练算法，用于从 MRI 中提高 PODS 衰竭的检测精度。具体来说，我们的算法首先在 TVUS 数据上训练一个老师模型，用于检测 PODS 衰竭。然后，我们使用一个 3D 卷积掩码自动机制学习模型，并使用大量的不标注的 pelvic 3D MRI 数据进行预训练。接着，我们将老师 TVUS POD 衰竭检测器的知识传播给学生 MRI 模型，通过使用不匹配的 TVUS 和 MRI 数据来执行知识储备训练。实验结果表明，我们的方法可以提高 MRI 中 PODS 衰竭的检测精度。

Zero-Shot Neural Architecture Search: Challenges, Solutions, and Opportunities

paper_url: http://arxiv.org/abs/2307.01998
repo_url: https://github.com/sldgroup/survey-zero-shot-nas
paper_authors: Guihong Li, Duc Hoang, Kartikeya Bhardwaj, Ming Lin, Zhangyang Wang, Radu Marculescu
for: 本研究旨在审查和比较当前最佳实践（SOTA）的零training neural architecture search（NAS）方法，强调硬件认知。
methods: 本研究使用了主流的零shot прокси，并讨论了它们的理论基础。
results: 经过大规模实验，本研究表明了零shot прокси在硬件认知和无硬件情况下的效果，并提出了一些可能更好的 proxy 设计方法。In English, the three key points are:
for: The purpose of this study is to comprehensively review and compare the state-of-the-art (SOTA) zero-shot NAS methods, with an emphasis on their hardware awareness.
methods: The study uses mainstream zero-shot proxies and discusses their theoretical underpinnings.
results: Through large-scale experiments, the study demonstrates the effectiveness of zero-shot proxies in both hardware-aware and hardware-oblivious NAS scenarios, and proposes several promising ideas for designing better proxies.

Abstract
Recently, zero-shot (or training-free) Neural Architecture Search (NAS) approaches have been proposed to liberate the NAS from training requirements. The key idea behind zero-shot NAS approaches is to design proxies that predict the accuracies of the given networks without training network parameters. The proxies proposed so far are usually inspired by recent progress in theoretical deep learning and have shown great potential on several NAS benchmark datasets. This paper aims to comprehensively review and compare the state-of-the-art (SOTA) zero-shot NAS approaches, with an emphasis on their hardware awareness. To this end, we first review the mainstream zero-shot proxies and discuss their theoretical underpinnings. We then compare these zero-shot proxies through large-scale experiments and demonstrate their effectiveness in both hardware-aware and hardware-oblivious NAS scenarios. Finally, we point out several promising ideas to design better proxies. Our source code and the related paper list are available on https://github.com/SLDGroup/survey-zero-shot-nas.

摘要
最近，零批量（或训练自由）神经建筑搜索（NAS）方法被提出，以解放NAS免除训练要求。零批量NAS方法的关键思想是通过不需要训练网络参数的代理来预测网络的准确率。现有的代理大多受到最新的深度学习理论的启发，并在多个 NAS 评测数据集上显示出了极大的潜力。这篇论文的目的是对现代零批量 NAS 方法进行全面的评审和比较，强调硬件意识。为此，我们首先介绍主流的零批量代理，并讨论它们的理论基础。然后，我们通过大规模的实验进行比较，并在硬件意识和硬件无知 NAS 场景中证明这些零批量代理的有效性。最后，我们提出了一些可能的代理设计方法。我们的源代码和相关论文列表可以在上找到。

Unsupervised Spectral Demosaicing with Lightweight Spectral Attention Networks

paper_url: http://arxiv.org/abs/2307.01990
repo_url: None
paper_authors: Kai Feng, Yongqiang Zhao, Seong G. Kong, Haijin Zeng
for: This paper presents an unsupervised deep learning-based spectral demosaicing technique for real-world hyperspectral images.
methods: The proposed method uses a mosaic loss function, a specific model structure, a transformation strategy, and an early stopping strategy to form a complete unsupervised spectral demosaicing framework. The spectral attention module is modified to reduce complexity and parameters.
results: The proposed method outperforms conventional unsupervised methods in terms of spatial distortion suppression, spectral fidelity, robustness, and computational cost, as demonstrated through extensive experiments on synthetic and real-world datasets.Here’s the text in Simplified Chinese:
for: 这篇论文提出了一种基于深度学习的无监督 spectral demosaicing 技术，用于处理真实世界的多spectral 图像。
methods: 该方法使用了一种叫做 mosaic loss function 的损失函数，特定的模型结构，一种转换策略，以及一种 early stopping 策略，形成了一个完整的无监督 spectral demosaicing 框架。 spectral attention 模块被修改以降低复杂性和参数。
results: 该方法在 synthetic 和实际 datasets 上进行了广泛的实验，与传统的无监督方法相比，在平衡压缩、spectral 准确性、Robustness 和计算成本等方面表现出色。

Abstract
This paper presents a deep learning-based spectral demosaicing technique trained in an unsupervised manner. Many existing deep learning-based techniques relying on supervised learning with synthetic images, often underperform on real-world images especially when the number of spectral bands increases. According to the characteristics of the spectral mosaic image, this paper proposes a mosaic loss function, the corresponding model structure, a transformation strategy, and an early stopping strategy, which form a complete unsupervised spectral demosaicing framework. A challenge in real-world spectral demosaicing is inconsistency between the model parameters and the computational resources of the imager. We reduce the complexity and parameters of the spectral attention module by dividing the spectral attention tensor into spectral attention matrices in the spatial dimension and spectral attention vector in the channel dimension, which is more suitable for unsupervised framework. This paper also presents Mosaic25, a real 25-band hyperspectral mosaic image dataset of various objects, illuminations, and materials for benchmarking. Extensive experiments on synthetic and real-world datasets demonstrate that the proposed method outperforms conventional unsupervised methods in terms of spatial distortion suppression, spectral fidelity, robustness, and computational cost.

摘要
A challenge in real-world spectral demosaicing is the inconsistency between the model parameters and the computational resources of the imager. To address this, the spectral attention module is simplified by dividing the spectral attention tensor into spectral attention matrices in the spatial dimension and a spectral attention vector in the channel dimension, making it more suitable for an unsupervised framework.This paper also introduces Mosaic25, a real 25-band hyperspectral mosaic image dataset of various objects, illuminations, and materials for benchmarking. Extensive experiments on synthetic and real-world datasets show that the proposed method outperforms conventional unsupervised methods in terms of spatial distortion suppression, spectral fidelity, robustness, and computational cost.

Task-Specific Alignment and Multiple Level Transformer for Few-Shot Action Recognition

paper_url: http://arxiv.org/abs/2307.01985
repo_url: https://github.com/cofly2014/tsa-mlt
paper_authors: Fei Guo, Li Zhu, YiWang Wang
for: 这种论文的目的是提出一种基于Transformer的ew-shot动作识别方法，以提高动作识别的性能。
methods: 该方法使用了Multiple Level Transformer和Task-Specific Alignment（TSA）来提取多级特征，并使用了权重 fusion loss来将多种距离拟合。
results: 实验结果显示，该方法在HMDB51和UCF101数据集上达到了状态级Result，并在Kinetics和something-2-something V2数据集上获得了竞争性的结果。Here’s the full summary in Simplified Chinese:
for: 这种论文的目的是提出一种基于Transformer的ew-shot动作识别方法，以提高动作识别的性能。
methods: 该方法使用了Multiple Level Transformer和Task-Specific Alignment（TSA）来提取多级特征，并使用了权重 fusion loss来将多种距离拟合。
results: 实验结果显示，该方法在HMDB51和UCF101数据集上达到了状态级Result，并在Kinetics和something-2-something V2数据集上获得了竞争性的结果。

Abstract
In the research field of few-shot learning, the main difference between image-based and video-based is the additional temporal dimension for videos. In recent years, many approaches for few-shot action recognition have followed the metric-based methods, especially, since some works use the Transformer to get the cross-attention feature of the videos or the enhanced prototype, and the results are competitive. However, they do not mine enough information from the Transformer because they only focus on the feature of a single level. In our paper, we have addressed this problem. We propose an end-to-end method named "Task-Specific Alignment and Multiple Level Transformer Network (TSA-MLT)". In our model, the Multiple Level Transformer focuses on the multiple-level feature of the support video and query video. Especially before Multiple Level Transformer, we use task-specific TSA to filter unimportant or misleading frames as a pre-processing. Furthermore, we adopt a fusion loss using two kinds of distance, the first is L2 sequence distance, which focuses on temporal order alignment. The second one is Optimal transport distance, which focuses on measuring the gap between the appearance and semantics of the videos. Using a simple fusion network, we fuse the two distances element-wise, then use the cross-entropy loss as our fusion loss. Extensive experiments show our method achieves state-of-the-art results on the HMDB51 and UCF101 datasets and a competitive result on the benchmark of Kinetics and something-2-something V2 datasets. Our code will be available at the URL: https://github.com/cofly2014/tsa-mlt.git

摘要
在几shot学习研究领域，图像和视频的主要区别在于视频具有额外的时间维度。在最近几年，许多几shot动作识别方法采用了度量基本方法，尤其是使用Transformer获得视频或增强原型的跨注意力特征，并且得到了竞争性的结果。然而，这些方法并不能充分利用Transformer的信息，因为它们只关注单级特征。在我们的论文中，我们解决了这个问题。我们提出了一种端到端方法，名为“任务特定对齐和多级Transformer网络”（TSA-MLT）。在我们的模型中，多级Transformer专注于支持视频和查询视频的多级特征。尤其是在多级Transformer之前，我们使用任务特定的TSA进行预处理，筛选不重要或误导性的帧。此外，我们采用了两种类型的混合损失，其中一种是L2序列距离，它专注于时间顺序对齐。另一种是优化运输距离，它专注于度量视频的外观和 semantics 之间的差异。使用简单的混合网络，我们将两种距离元素加权混合，然后使用交叉熵损失。广泛的实验显示，我们的方法在HMDB51和UCF101数据集上获得了状态独一标的结果，并在Kinetics和something-2-something V2数据集上获得了竞争性的结果。我们的代码将在以下URL上公开：https://github.com/cofly2014/tsa-mlt.git。

A ChatGPT Aided Explainable Framework for Zero-Shot Medical Image Diagnosis

paper_url: http://arxiv.org/abs/2307.01981
repo_url: None
paper_authors: Jiaxiang Liu, Tianxiang Hu, Yan Zhang, Xiaotang Gai, Yang Feng, Zuozhu Liu
for: 这个研究是为了提出一个零shot医疗影像分类框架，以应对实际医疗应用中的有限资料和疾病资料。
methods: 这个框架使用了CLIP的预训语言模型，并与ChatGPT进行整合，以提供可解释的诊断结果。具体来说，这个框架会将疾病名称输入到LLMs中，以生成更多的资讯和描述，例如疾病的征象或描述，以帮助提供更加精确和可解释的诊断结果。
results: 在一个私人数据集和四个公共数据集上进行了广泛的实验，结果显示了这个无需训练的零shot诊断框架的有效性和可解释性，并证明了VLMs和LLMs在医疗应用中的巨大潜力。

Abstract
Zero-shot medical image classification is a critical process in real-world scenarios where we have limited access to all possible diseases or large-scale annotated data. It involves computing similarity scores between a query medical image and possible disease categories to determine the diagnostic result. Recent advances in pretrained vision-language models (VLMs) such as CLIP have shown great performance for zero-shot natural image recognition and exhibit benefits in medical applications. However, an explainable zero-shot medical image recognition framework with promising performance is yet under development. In this paper, we propose a novel CLIP-based zero-shot medical image classification framework supplemented with ChatGPT for explainable diagnosis, mimicking the diagnostic process performed by human experts. The key idea is to query large language models (LLMs) with category names to automatically generate additional cues and knowledge, such as disease symptoms or descriptions other than a single category name, to help provide more accurate and explainable diagnosis in CLIP. We further design specific prompts to enhance the quality of generated texts by ChatGPT that describe visual medical features. Extensive results on one private dataset and four public datasets along with detailed analysis demonstrate the effectiveness and explainability of our training-free zero-shot diagnosis pipeline, corroborating the great potential of VLMs and LLMs for medical applications.

摘要
Zero-shot医疗图像分类是现实场景中的一个关键过程，在受到有限的疾病或大规模标注数据的限制下进行诊断。它通过计算医疗图像和可能的疾病类别之间的相似性分数来确定诊断结果。现代的预训练视觉语言模型（VLM），如CLIP，在无需训练的情况下显示出了出色的性能，并在医学应用中获得了多方面的利用。然而，一个可解释的无需训练医疗图像诊断框架仍然在开发中。在这篇论文中，我们提出了一种基于CLIP的新的无需训练医疗图像诊断框架，并与ChatGPT进行了可解释的诊断。我们的关键想法是通过询问大语言模型（LLM）来自动生成更多的提示和知识，如疾病 симптом或描述，以帮助提供更准确和可解释的诊断。我们还设计了特定的提示，以提高生成的文本中的医学特征描述质量。我们在一个私人数据集和四个公共数据集上进行了广泛的实验和分析，并证明了我们的无需训练无需标注的诊断管道的有效性和可解释性，证明了VLM和LLM在医学应用中的潜在能力。

ToothSegNet: Image Degradation meets Tooth Segmentation in CBCT Images

paper_url: http://arxiv.org/abs/2307.01979
repo_url: None
paper_authors: Jiaxiang Liu, Tianxiang Hu, Yang Feng, Wanghui Ding, Zuozhu Liu
for: 三维牙齿模型的构建需要在计算机助动 orthodontics 中进行多种医疗治疗。
methods: 我们提出了一种新的框架，即 ToothSegNet，用于在训练中让分割模型适应受到影像质量问题的影响。
results: ToothSegNet 能够生成更精准的分割结果，并超过了现有的医学图像分割方法。

Abstract
In computer-assisted orthodontics, three-dimensional tooth models are required for many medical treatments. Tooth segmentation from cone-beam computed tomography (CBCT) images is a crucial step in constructing the models. However, CBCT image quality problems such as metal artifacts and blurring caused by shooting equipment and patients' dental conditions make the segmentation difficult. In this paper, we propose ToothSegNet, a new framework which acquaints the segmentation model with generated degraded images during training. ToothSegNet merges the information of high and low quality images from the designed degradation simulation module using channel-wise cross fusion to reduce the semantic gap between encoder and decoder, and also refines the shape of tooth prediction through a structural constraint loss. Experimental results suggest that ToothSegNet produces more precise segmentation and outperforms the state-of-the-art medical image segmentation methods.

摘要
在计算机协助 ortodontics 中，三维牙齿模型是许多医疗治疗的关键。 however， CBCT 图像质量问题，如材料残余和扫描设备和病人的牙科状况所引起的模糊，使 segmentation 变得困难。在这篇论文中，我们提出了 ToothSegNet，一个新的框架，它在训练中将 segmentation 模型告诉 generated 受损图像的信息。 ToothSegNet 通过设计的受损模拟器来汇集高和低质量图像的信息，使用通道 wise cross fusion 减少 encoder 和 decoder 之间的 semantic gap，并通过结构约束损失来精细调整牙齿预测的形态。实验结果表明， ToothSegNet 可以生成更加精确的 segmentation，并超过了当前医疗图像 segmentation 方法的性能。

Multimodal Prompt Learning for Product Title Generation with Extremely Limited Labels

paper_url: http://arxiv.org/abs/2307.01969
repo_url: None
paper_authors: Bang Yang, Fenglin Liu, Zheng Li, Qingyu Yin, Chenyu You, Bing Yin, Yuexian Zou
for: 本研究旨在提出一种基于提示的方法，即多模态提示学习框架，以精准地和效率地生成对新产品有限标注数据的标题。
methods: 我们使用多模态提示从不同的模式提取特征和写作风格，并将其用于生成标题。我们只需要1%的下游标注数据进行训练，并且在预测时使用提示来生成满意的标题。
results: 我们在五种新产品类别下进行了实验和分析，并在域外和域内两种 эксперименталь设定下测试了我们的方法。结果表明，只需要1%的下游标注数据进行训练，我们的方法可以在少量标注下获得最佳几枚shot结果，甚至与完全监督方法在100%的训练数据上进行了竞争。当有完整的标注数据进行训练时，我们的方法达到了状态艺术的结果。

Abstract
Generating an informative and attractive title for the product is a crucial task for e-commerce. Most existing works follow the standard multimodal natural language generation approaches, e.g., image captioning, and employ the large scale of human-labelled datasets to train desirable models. However, for novel products, especially in a different domain, there are few existing labelled data. In this paper, we propose a prompt-based approach, i.e., the Multimodal Prompt Learning framework, to accurately and efficiently generate titles for novel products with limited labels. We observe that the core challenges of novel product title generation are the understanding of novel product characteristics and the generation of titles in a novel writing style. To this end, we build a set of multimodal prompts from different modalities to preserve the corresponding characteristics and writing styles of novel products. As a result, with extremely limited labels for training, the proposed method can retrieve the multimodal prompts to generate desirable titles for novel products. The experiments and analyses are conducted on five novel product categories under both the in-domain and out-of-domain experimental settings. The results show that, with only 1% of downstream labelled data for training, our proposed approach achieves the best few-shot results and even achieves competitive results with fully-supervised methods trained on 100% of training data; With the full labelled data for training, our method achieves state-of-the-art results.

摘要
<>将产品的信息和描述呈现成有引起人们兴趣的标题是电商的关键任务。现有的大多数工作采用标准的多modal自然语言生成方法，如图像标题，并使用大量人工标注数据来训练愿望的模型。但是，对于新型产品，特别是在不同的领域，有限的标注数据。在这篇论文中，我们提出了一种提示基于的方法，即多modal提示学习框架，以准确和高效地生成新型产品的标题。我们发现，新型产品标题生成的核心挑战是理解新型产品特点和生成新式标题。为此，我们构建了多modal提示，从不同的Modalities中提取了相应的特征和写作风格，以保持新型产品的特点和写作风格。结果是，只需培育1%的下游标注数据，我们的提posed方法可以将多modal提示拟合到生成愉悦的标题。我们在五种新型产品类别下进行了在领域和非领域实验设定下的实验和分析。结果表明，只需培育1%的下游标注数据，我们的提posed方法可以达到少量多shot最佳 result，甚至与完全upervised方法在100%的培育数据上培育的结果竞争。与完全标注数据培育，我们的方法达到了状态的艺术 result。

paper_url: http://arxiv.org/abs/2307.01968
repo_url: None
paper_authors: Shuhao Shi, Kai Qiao, Zhengyan Wang, Jie Yang, Baojie Song, Jian Chen, Bin Yan
for: This paper proposes a method for detecting social bots using a multi-scale signed-attention graph filter (MSGS) to improve the representation ability of the model and alleviate the over-smoothing problem of deep GNNs.
methods: The proposed MSGS method utilizes a multi-scale structure to produce representation vectors at different scales, which are then combined using a signed-attention mechanism. The method also uses a polymerization process to produce the final result.
results: The experimental results on real-world datasets demonstrate that the proposed MSGS method achieves better performance compared with several state-of-the-art social bot detection methods.Here is the same information in Simplified Chinese text:
for: 这篇论文提出了一种基于多尺度签名图表匹配（MSGS）的社交机器人检测方法，以提高模型表示能力和解决深度GNN的过滤问题。
methods: MSGS方法使用多尺度结构生成不同级别的表示向量，然后使用签名机制将其组合。最后，通过多尺度表示后的多层树结构（MLP）进行权重调整，以生成最终结果。
results: 实验结果表明，提出的MSGS方法在真实世界数据集上比一些现有的社交机器人检测方法更高效。

Abstract
The presence of a large number of bots on social media has adverse effects. The graph neural network (GNN) can effectively leverage the social relationships between users and achieve excellent results in detecting bots. Recently, more and more GNN-based methods have been proposed for bot detection. However, the existing GNN-based bot detection methods only focus on low-frequency information and seldom consider high-frequency information, which limits the representation ability of the model. To address this issue, this paper proposes a Multi-scale with Signed-attention Graph Filter for social bot detection called MSGS. MSGS could effectively utilize both high and low-frequency information in the social graph. Specifically, MSGS utilizes a multi-scale structure to produce representation vectors at different scales. These representations are then combined using a signed-attention mechanism. Finally, multi-scale representations via MLP after polymerization to produce the final result. We analyze the frequency response and demonstrate that MSGS is a more flexible and expressive adaptive graph filter. MSGS can effectively utilize high-frequency information to alleviate the over-smoothing problem of deep GNNs. Experimental results on real-world datasets demonstrate that our method achieves better performance compared with several state-of-the-art social bot detection methods.

摘要
社交媒体上有大量的机器人存在，这会导致不良影响。社交图图 neural network（GNN）可以有效利用用户之间的社交关系，并实现出色的机器人检测结果。最近，更多的GNN基本方法被提出用于机器人检测。然而，现有的GNN基本方法通常只专注于低频信息，罕考虑高频信息，这限制了模型的表达能力。为了解决这个问题，本文提出了一种多尺度签名Graph Filter（MSGS）。MSGS可以有效利用社交图中的高频和低频信息。具体来说，MSGS使用多尺度结构生成不同级别的表示 вектор。这些表示vector然后通过签名机制相加。最后，通过多尺度MLP（多层感知器）进行混合，生成最终结果。我们分析频率响应，并证明MSGS是一种更灵活和表达力强的自适应图 filters。MSGS可以充分利用高频信息，以解决深度GNN的过滤问题。实验结果表明，我们的方法在真实世界数据上达到了与一些现有社交bot检测方法相比的更好的性能。

Hybrid Neural Diffeomorphic Flow for Shape Representation and Generation via Triplane

paper_url: http://arxiv.org/abs/2307.01957
repo_url: None
paper_authors: Kun Han, Shanlin Sun, Xiaohui Xie
for: 用于3D计算机视觉中的形状表示和生成
methods: 使用混合神经抽象流变换（HNDF），通过嵌入式学习下的三平面特征来分解复杂的密集对准关系
results: 可以生成高质量和多样化的3D抽象流变换，保证形状的topological consistency，并且在医疗影像 segmentation 数据集上证明了 HNDF 的效果

Abstract
Deep Implicit Functions (DIFs) have gained popularity in 3D computer vision due to their compactness and continuous representation capabilities. However, addressing dense correspondences and semantic relationships across DIF-encoded shapes remains a critical challenge, limiting their applications in texture transfer and shape analysis. Moreover, recent endeavors in 3D shape generation using DIFs often neglect correspondence and topology preservation. This paper presents HNDF (Hybrid Neural Diffeomorphic Flow), a method that implicitly learns the underlying representation and decomposes intricate dense correspondences into explicitly axis-aligned triplane features. To avoid suboptimal representations trapped in local minima, we propose hybrid supervision that captures both local and global correspondences. Unlike conventional approaches that directly generate new 3D shapes, we further explore the idea of shape generation with deformed template shape via diffeomorphic flows, where the deformation is encoded by the generated triplane features. Leveraging a pre-existing 2D diffusion model, we produce high-quality and diverse 3D diffeomorphic flows through generated triplanes features, ensuring topological consistency with the template shape. Extensive experiments on medical image organ segmentation datasets evaluate the effectiveness of HNDF in 3D shape representation and generation.

摘要
深层含义函数（DIF）在三维计算机视觉中受到广泛关注，因为它们具有高度压缩和连续表示能力。然而，在 DIF 编码的形状之间 densely 的对应和semantic 关系仍然是一个挑战，限制了它们在文本传输和形状分析方面的应用。此外，现在的3D形状生成使用 DIF часто忽视对应和拓扑保持。这篇论文提出了 HyNDF（混合神经含义流），一种方法，可以 implicitly 学习下面的含义表示，并将含义嵌入在满足特定条件的三维形状中。为了避免局部最佳化，我们提议 hybrid 监督，捕捉本地和全局对应。与传统方法不同的是，我们不直接生成新的3D形状，而是通过含义流来生成扭变的模板形状，其中的扭变是由生成的三维特征表示。利用现有的2D扩散模型，我们可以通过生成的三维特征来生成高质量和多样化的3D扩散流，保证模板形状的拓扑一致性。EXTENSIVE EXPERIMENTS ON MEDICAL IMAGE ORGAN SEGMENTATION DATASETS EVALUATE THE EFFECTIVENESS OF HyNDF IN 3D SHAPE REPRESENTATION AND GENERATION。

Toward more frugal models for functional cerebral networks automatic recognition with resting-state fMRI

paper_url: http://arxiv.org/abs/2307.01953
repo_url: None
paper_authors: Lukman Ismaila, Pejman Rasti, Jean-Michel Lemée, David Rousseau
for: 这个论文是为了研究不同的编码技术，以降低神经网络模型的复杂性而不影响其性能。
methods: 这个论文使用了超像素图和图表示法来编码功能神经网络图像，并且使用了神经网络模型来识别患有脑肿瘤的患者的冥想状态功能网络。
results: 研究发现，使用图表示法可以保持神经网络模型的性能，同时降低模型的复杂性，最终可以提高识别精度。

Abstract
We refer to a machine learning situation where models based on classical convolutional neural networks have shown good performance. We are investigating different encoding techniques in the form of supervoxels, then graphs to reduce the complexity of the model while tracking the loss of performance. This approach is illustrated on a recognition task of resting-state functional networks for patients with brain tumors. Graphs encoding supervoxels preserve activation characteristics of functional brain networks from images, optimize model parameters by 26 times while maintaining CNN model performance.

摘要
我们提到了一种机器学习情况，其中基于古典卷积神经网络的模型表现良好。我们正在研究不同的编码技术，包括超 voxel 和图гра夫，以降低模型复杂性的方式，同时跟踪模型性能的损失。这种方法在抑制性功能网络图像识别任务中被应用，并且使用图гра夫编码超 voxel 保持了功能脑网络的活动特征，并且优化模型参数26倍，而无需妥协模型性能。

A Synthetic Electrocardiogram (ECG) Image Generation Toolbox to Facilitate Deep Learning-Based Scanned ECG Digitization

paper_url: http://arxiv.org/abs/2307.01946
repo_url: None
paper_authors: Kshama Kodthalu Shivashankara, Afagh Mehri Shervedani, Reza Sameni
For: The paper is written for training machine learning models in algorithmic ECG diagnosis, and the authors aim to address the challenge of scarce ECG archives with reference time-series.* Methods: The authors propose a novel method for generating synthetic ECG images on standard paper-like ECG backgrounds with realistic artifacts, using digital twins and data augmentation techniques.* Results: The authors built a deep ECG image digitization model and trained it on the synthetic dataset, and the results show an average signal recovery SNR of 27$\pm$2.8,dB, demonstrating the significance of the proposed synthetic ECG image dataset for training deep learning models.

Abstract
The electrocardiogram (ECG) is an accurate and widely available tool for diagnosing cardiovascular diseases. ECGs have been recorded in printed formats for decades and their digitization holds great potential for training machine learning (ML) models in algorithmic ECG diagnosis. Physical ECG archives are at risk of deterioration and scanning printed ECGs alone is insufficient, as ML models require ECG time-series data. Therefore, the digitization and conversion of paper ECG archives into time-series data is of utmost importance. Deep learning models for image processing show promise in this regard. However, the scarcity of ECG archives with reference time-series is a challenge. Data augmentation techniques utilizing \textit{digital twins} present a potential solution. We introduce a novel method for generating synthetic ECG images on standard paper-like ECG backgrounds with realistic artifacts. Distortions including handwritten text artifacts, wrinkles, creases and perspective transforms are applied to the generated images, without personally identifiable information. As a use case, we generated an ECG image dataset of 21,801 records from the 12-lead PhysioNet PTB-XL ECG time-series dataset. A deep ECG image digitization model was built and trained on the synthetic dataset, and was employed to convert the synthetic images to time-series data for evaluation. The signal-to-noise ratio (SNR) was calculated to assess the image digitization quality vs the ground truth ECG time-series. The results show an average signal recovery SNR of 27$\pm$2.8\,dB, demonstrating the significance of the proposed synthetic ECG image dataset for training deep learning models. The codebase is available as an open-access toolbox for ECG research.

摘要
《电cardiogram（ECG）是一种准确和普遍可用的诊断卡ди奥vascular疾病工具。ECG已经被记录在打印格式上了几十年，其数字化具有很大的潜在价值，用于训练机器学习（ML）模型进行算法ic ECG诊断。 Physical ECG Archives 面临着逐渐损坏的问题，而且单独扫描打印ECG 也无法满足 ML 模型的需求，因此将paper ECG Archives digitized和转换为时序数据是非常重要的。深度学习模型用于图像处理表现出了潜在的优势。然而，ECG Archives 中的参考时序数据的缺乏是一个挑战。使用 \textit{digital twins} 的数据增强技术可以解决这个问题。》我们提出了一种新的方法，用于在标准纸如ECG背景上生成 sintetic ECG 图像。这些图像包括手写文本残留、折痕、折叠和 perspective transforms，但不包含个人可 identificable 信息。作为用例，我们生成了一个 ECG 图像集合，包含PhysioNet PTB-XL ECG 时序数据集的 12 个导线。我们建立了一个深度 ECG 图像数字化模型，并在生成的 sintetic 图像上进行训练。然后，我们使用这个模型将 sintetic 图像转换为时序数据，并计算了信号噪声比（SNR），以评估图像数字化质量和参考时序数据之间的相似性。结果表明，Synthetic ECG 图像集合可以提供高质量的深度学习模型训练数据，并且可以提高 ECG 图像数字化质量。代码库将作为一个开放的工具箱，用于ECG 研究。

Text + Sketch: Image Compression at Ultra Low Rates

paper_url: http://arxiv.org/abs/2307.01944
repo_url: https://github.com/leieric/text-sketch
paper_authors: Eric Lei, Yiğit Berkay Uslu, Hamed Hassani, Shirin Saeedi Bidokhti
for: 研究了如何使用基于文本的生成模型进行图像压缩，以实现新的低比特率压缩方案。
methods: 使用预训练的基于文本的模型进行图像压缩，并与侧信息相结合以生成高质量重建图像，保持原始图像的 semantics 和空间结构。
results: 在非常低的比特率下，方法可以明显超越学习压缩器，在 semantic 和 perceived 稳定性方面达到更高的水平。

Abstract
Recent advances in text-to-image generative models provide the ability to generate high-quality images from short text descriptions. These foundation models, when pre-trained on billion-scale datasets, are effective for various downstream tasks with little or no further training. A natural question to ask is how such models may be adapted for image compression. We investigate several techniques in which the pre-trained models can be directly used to implement compression schemes targeting novel low rate regimes. We show how text descriptions can be used in conjunction with side information to generate high-fidelity reconstructions that preserve both semantics and spatial structure of the original. We demonstrate that at very low bit-rates, our method can significantly improve upon learned compressors in terms of perceptual and semantic fidelity, despite no end-to-end training.

摘要

Physics-based Motion Retargeting from Sparse Inputs

paper_url: http://arxiv.org/abs/2307.01938
repo_url: None
paper_authors: Daniele Reda, Jungdam Won, Yuting Ye, Michiel van de Panne, Alexander Winkler
for: 这个论文的目的是如何在虚拟世界中创建有互动性和吸引力的人物动画，并解决了用户动作追踪的技术问题。
methods: 这篇论文使用了回归学习来训练一个策略，以控制不同骨架结构的人物在物理模拟器中进行动作。这种方法不需要艺术家制作每个人物的动画，只需要人体动作捕捉数据来训练。
results: 研究人员通过使用这种方法，可以在实时中基于少量的人体感知数据来追踪用户的动作，并且可以让人物的姿势很好地与用户匹配。研究人员还进行了多种设定的研究，包括不均衡、舞蹈和体育动作等。

Abstract
Avatars are important to create interactive and immersive experiences in virtual worlds. One challenge in animating these characters to mimic a user's motion is that commercial AR/VR products consist only of a headset and controllers, providing very limited sensor data of the user's pose. Another challenge is that an avatar might have a different skeleton structure than a human and the mapping between them is unclear. In this work we address both of these challenges. We introduce a method to retarget motions in real-time from sparse human sensor data to characters of various morphologies. Our method uses reinforcement learning to train a policy to control characters in a physics simulator. We only require human motion capture data for training, without relying on artist-generated animations for each avatar. This allows us to use large motion capture datasets to train general policies that can track unseen users from real and sparse data in real-time. We demonstrate the feasibility of our approach on three characters with different skeleton structure: a dinosaur, a mouse-like creature and a human. We show that the avatar poses often match the user surprisingly well, despite having no sensor information of the lower body available. We discuss and ablate the important components in our framework, specifically the kinematic retargeting step, the imitation, contact and action reward as well as our asymmetric actor-critic observations. We further explore the robustness of our method in a variety of settings including unbalancing, dancing and sports motions.

摘要
�� Avatars 是创建互动和 immerse 的虚拟世界体验的关键。一个挑战在动画这些人物，以便它们模拟用户的运动是商业 AR/VR 产品只有头盔和控制器，这些设备提供的感知数据非常有限。另一个挑战是人物可能有不同的骨架结构和人类的映射关系不清楚。在这种工作中，我们解决了这两个挑战。我们提出了一种在实时中归档人物动作的方法，从某些人类感知数据中提取人物动作。我们使用了强化学习来训练一个策略，以控制在物理模拟器中的人物。我们只需要人类动作捕捉数据进行训练，而不需要每个人物都有特定的动作捕捉。这使得我们可以使用大量的动作捕捉数据来训练通用的策略，以跟踪未看过的用户从真实的感知数据中的真实时间。我们在三个不同的骨架结构中进行了试验：恐龙、小型宠物和人类。我们发现，尽管没有lower body的感知数据可用，但是人物姿势仍然很准确地匹配了用户。我们讨论和折射了我们框架中的重要组件，特别是凯旋投射步骤、仿制、接触和动作奖励。我们进一步探讨了我们的方法在多种设置下的稳定性，包括不均衡、舞蹈和运动动作。

ProtoDiffusion: Classifier-Free Diffusion Guidance with Prototype Learning

paper_url: http://arxiv.org/abs/2307.01924
repo_url: None
paper_authors: Gulcin Baykal, Halil Faruk Karagoz, Taha Binhuraib, Gozde Unal
for: 提高生成质量和快速训练 diffusion models
methods: incorporate prototype learning into diffusion models
results: 比基eline方法更快地达到更高的生成质量

Abstract
Diffusion models are generative models that have shown significant advantages compared to other generative models in terms of higher generation quality and more stable training. However, the computational need for training diffusion models is considerably increased. In this work, we incorporate prototype learning into diffusion models to achieve high generation quality faster than the original diffusion model. Instead of randomly initialized class embeddings, we use separately learned class prototypes as the conditioning information to guide the diffusion process. We observe that our method, called ProtoDiffusion, achieves better performance in the early stages of training compared to the baseline method, signifying that using the learned prototypes shortens the training time. We demonstrate the performance of ProtoDiffusion using various datasets and experimental settings, achieving the best performance in shorter times across all settings.

摘要
Diffusion models 是一种生成模型，在其他生成模型的比较中显示出了更高的生成质量和更稳定的训练。然而， diffusion models 的计算需求在训练中增加了许多。在这项工作中，我们将 prototype learning интеGRATED into diffusion models，以实现更高的生成质量和更快的训练。而不是使用随机初始化的类嵌入，我们使用分立学习的类表例来导引 diffusion 过程。我们发现，我们的方法（Call ProtoDiffusion）在训练的早期 stages 表现更好，表明使用学习的表例可以缩短训练时间。我们通过使用不同的数据集和实验设置来证明 ProtoDiffusion 的性能，在所有设置下都达到了最佳性能，并且在短时间内完成。

EANet: Enhanced Attribute-based RGBT Tracker Network

paper_url: http://arxiv.org/abs/2307.01893
repo_url: None
paper_authors: Abbas Türkoğlu, Erdem Akagündüz
for: 提高计算机视觉中跟踪对象的精度，特别是在受到 occlusion、光照变化和运动模糊等挑战时。
methods: 提议一种基于深度学习的图像跟踪方法，使用 RGB 和 TIR 图像的复合（RGBT），并使用特征提取器和跟踪器两部分组成。
results: 对 RGBT234 和 LasHeR 等 dataset进行评估，表明提议的系统在这些 dataset 上表现出色，与现有状态的 RGBT 对象跟踪器相比，具有较少的参数量。

Abstract
Tracking objects can be a difficult task in computer vision, especially when faced with challenges such as occlusion, changes in lighting, and motion blur. Recent advances in deep learning have shown promise in challenging these conditions. However, most deep learning-based object trackers only use visible band (RGB) images. Thermal infrared electromagnetic waves (TIR) can provide additional information about an object, including its temperature, when faced with challenging conditions. We propose a deep learning-based image tracking approach that fuses RGB and thermal images (RGBT). The proposed model consists of two main components: a feature extractor and a tracker. The feature extractor encodes deep features from both the RGB and the TIR images. The tracker then uses these features to track the object using an enhanced attribute-based architecture. We propose a fusion of attribute-specific feature selection with an aggregation module. The proposed methods are evaluated on the RGBT234 \cite{LiCLiang2018} and LasHeR \cite{LiLasher2021} datasets, which are the most widely used RGBT object-tracking datasets in the literature. The results show that the proposed system outperforms state-of-the-art RGBT object trackers on these datasets, with a relatively smaller number of parameters.

摘要
<>转换文本到简化中文。<>计算机视觉中跟踪对象可能是一项困难任务，特别是在面临遮挡、光照变化和运动模糊等挑战时。现代深度学习技术已经在这些条件下表现出了承诺。然而，大多数深度学习基于对象跟踪器只使用可见光（RGB）图像。热电磁波（TIR）可以提供更多关于对象的信息，包括其温度，在面临挑战情况下。我们提议一种基于深度学习的图像跟踪方法，将RGB和TIR图像进行融合（RGBT）。我们的方法包括两个主要组件：特征提取器和跟踪器。特征提取器将深度特征从RGB和TIR图像中提取出来。跟踪器然后使用这些特征进行跟踪对象，使用增强的属性基本架构。我们提议一种属性特征选择与聚合模块的融合。我们的方法在RGBT234和LasHeR数据集上进行评估，这两个数据集是现有RGBT对象跟踪数据集中最为常用的。结果显示，我们的系统在这些数据集上与现状最佳的RGBT对象跟踪器进行比较，具有相对较小的参数数量。

MaskBEV: Joint Object Detection and Footprint Completion for Bird’s-eye View 3D Point Clouds

paper_url: http://arxiv.org/abs/2307.01864
repo_url: https://github.com/norlab-ulaval/mask_bev
paper_authors: William Guimont-Martin, Jean-Michel Fortin, François Pomerleau, Philippe Giguère
for: 本研究旨在提出一种基于bird’s-eye view(BEV)的mask-based对象检测方法，以解决现有对象检测方法中的 anchor-based或anchor-free检测器需要显式知识来工作的局限性。
methods: 本方法采用一种BEV实例面积预测方法，预测出Object的footprint，同时实现对象检测和面积完成在一个过程中。此外，本方法将检测问题转换为纯粹的分类问题，减少了归一化问题。
results: 在SemanticKITTI和KITTI datasets上，MaskBEV方法实现了比较出色的性能，并且分析了方法的优势和局限性。

Abstract
Recent works in object detection in LiDAR point clouds mostly focus on predicting bounding boxes around objects. This prediction is commonly achieved using anchor-based or anchor-free detectors that predict bounding boxes, requiring significant explicit prior knowledge about the objects to work properly. To remedy these limitations, we propose MaskBEV, a bird's-eye view (BEV) mask-based object detector neural architecture. MaskBEV predicts a set of BEV instance masks that represent the footprints of detected objects. Moreover, our approach allows object detection and footprint completion in a single pass. MaskBEV also reformulates the detection problem purely in terms of classification, doing away with regression usually done to predict bounding boxes. We evaluate the performance of MaskBEV on both SemanticKITTI and KITTI datasets while analyzing the architecture advantages and limitations.

摘要
现代对LiDAR点云中物体探测的研究主要集中在预测物体 bounding box 上。这种预测通常使用 anchor-based 或 anchor-free 探测器，需要明确对物体的知识来工作。为了解决这些限制，我们提议了 MaskBEV，一种 bird's-eye view（BEV）Mask-based 物体探测神经网络架构。MaskBEV 预测了 BEV 实例Mask，表示探测到的对象的脚印。此外，我们的方法可以同时进行物体探测和脚印完成。而且，MaskBEV 做了 bounding box 的预测，而不需要进行坐标 regression。我们在 SemanticKITTI 和 KITTI 数据集上评估了 MaskBEV 的性能，同时分析了架构的优势和局限性。

Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning

paper_url: http://arxiv.org/abs/2307.01849
repo_url: None
paper_authors: Xiang Li, Varun Belagali, Jinghuan Shang, Michael S. Ryoo
for: 这篇论文旨在提高基于扩散过程的视OMOTION政策学习，通过添加一个自然语言supervised learning（SSL）目标函数。
methods: 该方法使用扩散过程生成动作序列，并使用一个新的解码器将拓展图像像素和其他状态信息重建为初始扩散过程的中间表示。
results: 实验表明，在多个 simulate 和实际世界机器人任务中，权威 diffusion 政策 Learning 效果更好，特别是当示例具有不同水平时。

Abstract
Sequence modeling approaches have shown promising results in robot imitation learning. Recently, diffusion models have been adopted for behavioral cloning, benefiting from their exceptional capabilities in modeling complex data distribution. In this work, we propose Crossway Diffusion, a method to enhance diffusion-based visuomotor policy learning by using an extra self-supervised learning (SSL) objective. The standard diffusion-based policy generates action sequences from random noise conditioned on visual observations and other low-dimensional states. We further extend this by introducing a new decoder that reconstructs raw image pixels (and other state information) from the intermediate representations of the reverse diffusion process, and train the model jointly using the SSL loss. Our experiments demonstrate the effectiveness of Crossway Diffusion in various simulated and real-world robot tasks, confirming its advantages over the standard diffusion-based policy. We demonstrate that such self-supervised reconstruction enables better representation for policy learning, especially when the demonstrations have different proficiencies.

摘要
序列模型方法在机器人模仿学习中已经取得了成功。最近，扩散模型在行为假设学习中得到了广泛的应用，这主要是因为它们可以模型复杂的数据分布。在这项工作中，我们提议了跨度扩散（Crossway Diffusion）方法，用于增强扩散基于视听动作政策学习的方法。标准的扩散基于政策可以根据随机噪声和视觉观察等低维度状态生成动作序列。我们进一步延伸了这个方法，通过引入一个新的解码器，将拟合过程中的中间表示转换回原始的图像像素和其他状态信息，并在模型中同时使用SSL损失来训练。我们的实验表明，跨度扩散方法在多种模拟和真实世界机器人任务中具有优势，特别是当示例具有不同水平时。我们的结果表明，自我超级vised reconstruction可以为政策学习提供更好的表示，尤其是当示例有不同的技巧水平时。

Grad-FEC: Unequal Loss Protection of Deep Features in Collaborative Intelligence

paper_url: http://arxiv.org/abs/2307.01846
repo_url: None
paper_authors: Korcan Uyanik, S. Faegheh Yeganli, Ivan V. Bajić
for: 提高edge设备和云端AI模型的协同智能系统在Packet Loss情况下的可靠性和Robustness。
methods: 使用Unequal Loss Protection（ULP）approach，包括feature importance estimator和Forward Error Correction（FEC） codes， selectively保护front-end生成的重要数据包。
results: 实验结果表明，提议的方法可以在Packet Loss情况下明显提高协同智能系统的可靠性和Robustness。

Abstract
Collaborative intelligence (CI) involves dividing an artificial intelligence (AI) model into two parts: front-end, to be deployed on an edge device, and back-end, to be deployed in the cloud. The deep feature tensors produced by the front-end are transmitted to the cloud through a communication channel, which may be subject to packet loss. To address this issue, in this paper, we propose a novel approach to enhance the resilience of the CI system in the presence of packet loss through Unequal Loss Protection (ULP). The proposed ULP approach involves a feature importance estimator, which estimates the importance of feature packets produced by the front-end, and then selectively applies Forward Error Correction (FEC) codes to protect important packets. Experimental results demonstrate that the proposed approach can significantly improve the reliability and robustness of the CI system in the presence of packet loss.

摘要
合作智能（CI）涉及将人工智能（AI）模型分成两部分：前端，部署在边缘设备上，和后端，部署在云端。深度特征tensor生成于前端被传输到云端通过通信频道，该频道可能会出现包loss。为 Addressing this issue, in this paper, we propose a novel approach to enhance the resilience of the CI system in the presence of packet loss through Unequal Loss Protection (ULP). The proposed ULP approach involves a feature importance estimator, which estimates the importance of feature packets produced by the front-end, and then selectively applies Forward Error Correction (FEC) codes to protect important packets. Experimental results demonstrate that the proposed approach can significantly improve the reliability and robustness of the CI system in the presence of packet loss.Here's the translation in Traditional Chinese:与我们的研究相关的文章，我们提出了一种新的方法来增强在包loss情况下的合作智能系统（CI）的可靠性和韧性。这个方法通过不等损伤保护（ULP）来实现。ULP方法包括一个特征重要度估计器，这个估计器可以估计前端生成的特征包中的重要度，然后选择性地对重要的包应用Forward Error Correction（FEC）代码进行保护。实验结果显示，提案的方法可以在包loss情况下对CI系统的可靠性和韧性进行明显改善。

Deep Features for Contactless Fingerprint Presentation Attack Detection: Can They Be Generalized?

paper_url: http://arxiv.org/abs/2307.01845
repo_url: None
paper_authors: Hailin Li, Raghavendra Ramachandra
for: 这项研究旨在比较不同预训练的 convolutional neural network (CNN) 和 vision transformer (ViT) 在掌握展示攻击方面的一致性。
methods: 该研究使用了公共可用的智能手机展示攻击数据集进行了广泛的实验，使用了四种不同的展示攻击工具 (PAI)。
results: 研究发现，使用 ResNet50 CNN 的特征提取方法可以获得最佳一致性表现。

Abstract
The rapid evolution of high-end smartphones with advanced high-resolution cameras has resulted in contactless capture of fingerprint biometrics that are more reliable and suitable for verification. Similar to other biometric systems, contactless fingerprint-verification systems are vulnerable to presentation attacks. In this paper, we present a comparative study on the generalizability of seven different pre-trained Convolutional Neural Networks (CNN) and a Vision Transformer (ViT) to reliably detect presentation attacks. Extensive experiments were carried out on publicly available smartphone-based presentation attack datasets using four different Presentation Attack Instruments (PAI). The detection performance of the eighth deep feature technique was evaluated using the leave-one-out protocol to benchmark the generalization performance for unseen PAI. The obtained results indicated the best generalization performance with the ResNet50 CNN.

摘要
高端智能手机的高分辨率摄像头快速进化，导致无接触式指纹识别系统更加可靠和适用于验证。与其他生物特征系统一样，无接触指纹验证系统也容易受到披露攻击。在这篇论文中，我们进行了七种不同预训练Convolutional Neural Networks（CNN）和一种Vision Transformer（ViT）的比较研究，以确定可靠地探测披露攻击。我们在公开available的智能手机披露攻击数据集上进行了广泛的实验，使用了四种不同的Presentation Attack Instrument（PAI）。我们使用了留一个数据点协议来评估深度特征技术的泛化性表现。结果显示，ResNet50 CNN表现最佳。

Advancing Wound Filling Extraction on 3D Faces: Auto-Segmentation and Wound Face Regeneration Approach

paper_url: http://arxiv.org/abs/2307.01844
repo_url: https://github.com/simogroup/woundfilling3d
paper_authors: Duong Q. Nguyen, Thinh D. Le, Phuong D. Nguyen, Nga T. K. Le, H. Nguyen-Xuan
for: 这个研究旨在 automatizing 3D facial wound segmentation, 以帮助医生在不同的医疗应用中进行精确的预预操作规划和优化病人结果。
methods: 我们提出了一个高效的方法，使用两条流Graph Convolutional Network (GCN)来自动化3D facial wound segmentation。我们使用Cir3D-FaIR dataset，并通过不同的损失函数进行实验，以解决数据不均衡的问题。
results: 我们的方法可以实现高精度的3D facial wound segmentation，并且比前一代方法更高的精度。我们还提出了一个改进的方法来提取3D facial wound filler，并与前一代方法进行比较。我们的方法在试验组中取得了0.9999986%的准确率。

Abstract
Facial wound segmentation plays a crucial role in preoperative planning and optimizing patient outcomes in various medical applications. In this paper, we propose an efficient approach for automating 3D facial wound segmentation using a two-stream graph convolutional network. Our method leverages the Cir3D-FaIR dataset and addresses the challenge of data imbalance through extensive experimentation with different loss functions. To achieve accurate segmentation, we conducted thorough experiments and selected a high-performing model from the trained models. The selected model demonstrates exceptional segmentation performance for complex 3D facial wounds. Furthermore, based on the segmentation model, we propose an improved approach for extracting 3D facial wound fillers and compare it to the results of the previous study. Our method achieved a remarkable accuracy of 0.9999986\% on the test suite, surpassing the performance of the previous method. From this result, we use 3D printing technology to illustrate the shape of the wound filling. The outcomes of this study have significant implications for physicians involved in preoperative planning and intervention design. By automating facial wound segmentation and improving the accuracy of wound-filling extraction, our approach can assist in carefully assessing and optimizing interventions, leading to enhanced patient outcomes. Additionally, it contributes to advancing facial reconstruction techniques by utilizing machine learning and 3D bioprinting for printing skin tissue implants. Our source code is available at \url{https://github.com/SIMOGroup/WoundFilling3D}.

摘要
facial 伤口分割扮演着重要的前Operative 规划和优化患者结果的关键角色在各种医疗应用中。在这篇论文中，我们提出了一种高效的自动化3D facial 伤口分割方法，使用了两�ream Graph Convolutional Network。我们的方法利用了Cir3D-FaIR数据集，并通过不同损失函数的实验处理数据不均衡问题。为了实现准确的分割，我们进行了详细的实验，并从已训练的模型中选择了高性能的模型。选择的模型在复杂的3D facial 伤口上显示了出色的分割性能。此外，基于分割模型，我们提出了一种改进的3D facial 伤口粘质EXTRACT方法，并与之前的研究比较。我们的方法在测试集上达到了0.9999986%的准确率，超过了之前的方法的性能。从这个结果来看，我们使用3D打印技术来 Illustrate the shape of the wound filling。这些结果对于参与预Operative 规划和 intervención设计的医生非常重要，因为它可以帮助精确评估和优化 intervención，导致患者的结果进一步提高。此外，它还对于提高面部重建技术的发展起到了作用，因为它可以使用机器学习和3D生物打印技术来打印皮肤组织嵌入物。我们的源代码可以在 \url{https://github.com/SIMOGroup/WoundFilling3D} 上获取。

Collaborative Score Distillation for Consistent Visual Synthesis

paper_url: http://arxiv.org/abs/2307.04787
repo_url: https://github.com/subin-kim-cv/CSD
paper_authors: Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, Jinwoo Shin
for: 提高文本到图像扩散模型的多样化应用
methods: 基于 Stein Variational Gradient Descent (SVGD) 的 Collaborative Score Distillation (CSD) 方法
results: 在多个图像样本中实现一致性，并在多种任务中展示了 CSD 的效果，如修改投影图像、视频和3D场景。

Abstract
Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as "particles" in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.

摘要
<>使用大规模文本到图像扩散模型的生成积分可以开拓广泛的新生成和编辑应用程序。然而，当适应复杂的视觉模式时，如多个图像（例如视频），保证图像集中的一致性是挑战。在这篇论文中，我们解决这个挑战通过一种新方法：协同分数热退（CSD）。CSD基于斯泰因变量梯度下降（SVGD）。我们建议将多个样本视为“粒子”在SVGD更新中，并将其分数函数相结合，以降生生成积分过多个图像同步。因此，CSD可以快速地融合多个图像中的信息，从而实现一致的视觉生成。我们在多种任务中展示了CSD的效果，包括修改投影图像、视频和3D场景。我们的结果表明CSD是一种多功能的方法，可以提高图像集中的一致性，从而扩展文本到图像扩散模型的应用范围。

EdgeFace: Efficient Face Recognition Model for Edge Devices

paper_url: http://arxiv.org/abs/2307.01838
repo_url: None
paper_authors: Anjith George, Christophe Ecabert, Hatef Otroshi Shahreza, Ketan Kotwal, Sebastien Marcel
for: 这篇论文是为了开发一个轻量级的面部识别网络（EdgeFace），以便在边缘设备上进行面部识别。
methods: 这篇论文使用了一种混合式架构，兼容了CNN和Transformer模型，以及一个低维度线性层，实现了高品质的面部识别。
results: 实验结果显示，EdgeFace网络不仅具有低 Computational Complexity 和储存空间，而且可以在挑战性的资料集上实现高度的面部识别精度，并且较以前的有效网络模型表现更好。

Abstract
In this paper, we present EdgeFace, a lightweight and efficient face recognition network inspired by the hybrid architecture of EdgeNeXt. By effectively combining the strengths of both CNN and Transformer models, and a low rank linear layer, EdgeFace achieves excellent face recognition performance optimized for edge devices. The proposed EdgeFace network not only maintains low computational costs and compact storage, but also achieves high face recognition accuracy, making it suitable for deployment on edge devices. Extensive experiments on challenging benchmark face datasets demonstrate the effectiveness and efficiency of EdgeFace in comparison to state-of-the-art lightweight models and deep face recognition models. Our EdgeFace model with 1.77M parameters achieves state of the art results on LFW (99.73%), IJB-B (92.67%), and IJB-C (94.85%), outperforming other efficient models with larger computational complexities. The code to replicate the experiments will be made available publicly.

摘要
在本文中，我们提出了EdgeFace，一种轻量级、高效的面recognition网络， Drawing inspiration from EdgeNeXt 的混合体系。 EdgeFace 能够 effectively 结合 CNN 和 Transformer 模型的优点，以及低级线性层，实现面recognition 性能的优化。我们的EdgeFace 网络不仅具有低计算成本和压缩存储，还能够实现高精度的面recognition，适用于边缘设备上部署。我们进行了大量的实验，并证明EdgeFace 在艰难的面数据集上具有优秀的表现，比如LFW（99.73%）、IJB-B（92.67%）和IJB-C（94.85%）等。与其他高效模型相比，EdgeFace 的1.77M 参数的模型可以达到状态码的表现。我们将会在公共领域中公开我们的代码，以便读者可以重新实现我们的实验。

On the Matrix Form of the Quaternion Fourier Transform and Quaternion Convolution

paper_url: http://arxiv.org/abs/2307.01836
repo_url: None
paper_authors: Giorgos Sfikas, George Retsinas
for: 这篇论文探讨了四元数版本的傅立叶变换和卷积运算的矩阵形式。
methods: 论文使用了四元数的 Representation 单位，但由于四元数的非 коммуutat性，使得在四元数域中处理矩阵时存在许多复杂性。
results: 论文发现了四元数傅立叶变换矩阵与标准复数逐步变换矩阵之间的关系，以及许多已知的复数域理论是否能够推广到四元数域。

Abstract
We study matrix forms of quaternionic versions of the Fourier Transform and Convolution operations. Quaternions offer a powerful representation unit, however they are related to difficulties in their use that stem foremost from non-commutativity of quaternion multiplication, and due to that $\mu^2 = -1$ posseses infinite solutions in the quaternion domain. Handling of quaternionic matrices is consequently complicated in several aspects (definition of eigenstructure, determinant, etc.). Our research findings clarify the relation of the Quaternion Fourier Transform matrix to the standard (complex) Discrete Fourier Transform matrix, and the extend on which well-known complex-domain theorems extend to quaternions. We focus especially on the relation of Quaternion Fourier Transform matrices to Quaternion Circulant matrices (representing quaternionic convolution), and the eigenstructure of the latter. A proof-of-concept application that makes direct use of our theoretical results is presented, where we produce a method to bound the spectral norm of a Quaternionic Convolution.

摘要
我们研究matrix形式的四元数版本的傅立叶变换和卷积操作。四元数提供了一个强大的表示单元，但它们具有非共轭 multiplication 的特点，导致在使用中存在许多问题。其中，由于 $\mu^2 = -1$ 在四元数域中有无穷多个解，这使得四元数矩阵的处理变得复杂。我们的研究发现，四元数傅立叶变换矩阵与标准（复数）离散傅立叶变换矩阵之间存在密切的关系，以及复数域中已知的定理是否可以推广到四元数域。我们特别关注四元数傅立叶变换矩阵与四元数卷积矩阵（表示四元数卷积）之间的关系，以及它们的特征值结构。我们还提出了一种基于我们的理论成果的应用，可以直接 bound 四元数卷积的 спектраль нор。

SUIT: Learning Significance-guided Information for 3D Temporal Detection

paper_url: http://arxiv.org/abs/2307.01807
repo_url: None
paper_authors: Zheyuan Zhou, Jiachen Lu, Yihan Zeng, Hang Xu, Li Zhang
for: 提高自动驾驶和机器人三元素检测的精度，尤其是利用时间信息来提高3D感知。
methods: 提出了Significance-gUided Information for 3D Temporal detection（SUIT）方法，通过采样机制提取信息充沛的特征，并使用explicit geometric transformation learning技术来学习物体中心点之间的变换。
results: 在nuScenes和Waymo dataset上评估了SUIT方法，并证明了它不仅可以减少时间拼接的内存和计算成本，而且也可以与现有基线方法相比而言性能较高。

Abstract
3D object detection from LiDAR point cloud is of critical importance for autonomous driving and robotics. While sequential point cloud has the potential to enhance 3D perception through temporal information, utilizing these temporal features effectively and efficiently remains a challenging problem. Based on the observation that the foreground information is sparsely distributed in LiDAR scenes, we believe sufficient knowledge can be provided by sparse format rather than dense maps. To this end, we propose to learn Significance-gUided Information for 3D Temporal detection (SUIT), which simplifies temporal information as sparse features for information fusion across frames. Specifically, we first introduce a significant sampling mechanism that extracts information-rich yet sparse features based on predicted object centroids. On top of that, we present an explicit geometric transformation learning technique, which learns the object-centric transformations among sparse features across frames. We evaluate our method on large-scale nuScenes and Waymo dataset, where our SUIT not only significantly reduces the memory and computation cost of temporal fusion, but also performs well over the state-of-the-art baselines.

摘要
三维 объек物检测从 LiDAR 点云是自动驾驶和 робо技术中的关键问题。虽然时序点云具有提高三维见解的潜在优势，但是利用这些时序特征有效和高效地处理仍然是一个挑战。基于 LiDAR 场景中前景信息稀疏的观察，我们认为可以通过稀疏格式提供足够的知识。为此，我们提出了 Significance-gUided Information for 3D Temporal detection（SUIT），它将时序信息简化为稀疏特征 для 信息融合。具体来说，我们首先引入一种重要采样机制，将 predicted object centroids 中的信息富集 yet sparse 特征提取出来。然后，我们提出了一种显式的几何变换学习技术，这种技术学习 object-centric 几何变换 among sparse features across frames。我们对大规模的 nuScenes 和 Waymo 数据集进行了评估，我们的 SUIT 不仅能够减少时间和计算成本，而且也能够与当前的基线方法进行比较。

Edge-aware Multi-task Network for Integrating Quantification Segmentation and Uncertainty Prediction of Liver Tumor on Multi-modality Non-contrast MRI

paper_url: http://arxiv.org/abs/2307.01798
repo_url: None
paper_authors: Xiaojiao Xiao, Qinmin Hu, Guanghui Wang
for:This paper aims to provide a unified framework for simultaneous multi-index quantification, segmentation, and uncertainty estimation of liver tumors on multi-modality non-contrast magnetic resonance imaging (NCMRI).methods:The proposed method, called edge-aware multi-task network (EaMtNet), employs two parallel CNN encoders, Sobel filters, and a newly designed edge-aware feature aggregation module (EaFA) for feature fusion and selection. The network also uses multi-tasking to leverage prediction discrepancy for uncertainty estimation and improved segmentation and quantification performance.results:The proposed EaMtNet outperforms the state-of-the-art by a large margin, achieving a dice similarity coefficient of 90.01$\pm$1.23 and a mean absolute error of 2.72$\pm$0.58 mm for MD. The results demonstrate the potential of EaMtNet as a reliable clinical-aided tool for medical image analysis.Here is the same information in Simplified Chinese:for:这篇论文目的是提供一个统一的框架，用于同时进行多指标量化、分割和不确定性估计liver肿瘤在多Modal非对照磁共振成像（NCMRI）中。methods:提议的方法是edge-aware multi-task network（EaMtNet），它使用了两个平行的CNN编码器， Sobel滤波器和一个新设计的边缘感知特征聚合模块（EaFA）来抽取本地特征和边图，并实现了特征融合和选择。网络还使用了多任务学习，以利用预测差异来估计uncertainty和提高分割和量化性能。results:提议的EaMtNet比STATE-OF-THE-ART大幅提高，达到了dice相似度为90.01$\pm$1.23和MD的mean absolute error为2.72$\pm$0.58 mm。结果表明EaMtNet可能成为一种可靠的临床助手工具，用于医疗图像分析。

Abstract
Simultaneous multi-index quantification, segmentation, and uncertainty estimation of liver tumors on multi-modality non-contrast magnetic resonance imaging (NCMRI) are crucial for accurate diagnosis. However, existing methods lack an effective mechanism for multi-modality NCMRI fusion and accurate boundary information capture, making these tasks challenging. To address these issues, this paper proposes a unified framework, namely edge-aware multi-task network (EaMtNet), to associate multi-index quantification, segmentation, and uncertainty of liver tumors on the multi-modality NCMRI. The EaMtNet employs two parallel CNN encoders and the Sobel filters to extract local features and edge maps, respectively. The newly designed edge-aware feature aggregation module (EaFA) is used for feature fusion and selection, making the network edge-aware by capturing long-range dependency between feature and edge maps. Multi-tasking leverages prediction discrepancy to estimate uncertainty and improve segmentation and quantification performance. Extensive experiments are performed on multi-modality NCMRI with 250 clinical subjects. The proposed model outperforms the state-of-the-art by a large margin, achieving a dice similarity coefficient of 90.01$\pm$1.23 and a mean absolute error of 2.72$\pm$0.58 mm for MD. The results demonstrate the potential of EaMtNet as a reliable clinical-aided tool for medical image analysis.

摘要
simultaneous 多指标评估、分割和不确定性估计liver肿瘤在多Modal非对照核磁共振成像（NCMRI）是诊断精准的关键。然而，现有方法缺乏有效的多Modal NCMRI融合机制和准确边缘信息捕获机制，使这些任务变得困难。为解决这些问题，本文提出了一个统一框架，即edge-aware multi-task network（EaMtNet），将多指标评估、分割和不确定性 associates with liver tumors on multi-modal NCMRI。EaMtNet使用两个平行的CNNEncoder和Sobel滤波器提取本地特征和边图像，并使用 newly designed edge-aware feature aggregation module（EaFA）进行特征融合和选择，使网络变得边缘意识。多任务学习利用预测差异来估计不确定性和提高分割和评估性能。对多Modal NCMRI的250个临床Subject进行了广泛的实验。提出的模型在与状态艺的比较中大幅度超越了当前最佳方法，实现了 dice similarity coefficient的90.01±1.23和mean absolute error的2.72±0.58mm for MD。结果表明EaMtNet具有可靠的临床助手工具的潜力。

2023-07-05

Detecting Images Generated by Deep Diffusion Models using their Local Intrinsic Dimensionality

GAFAR: Graph-Attention Feature-Augmentation for Registration A Fast and Light-weight Point Set Registration Algorithm

Dual Arbitrary Scale Super-Resolution for Multi-Contrast MRI

MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers

Multi-Scale Prototypical Transformer for Whole Slide Image Classification

Focusing on what to decode and what to train: Efficient Training with HOI Split Decoders and Specific Target Guided DeNoising

Interactive Image Segmentation with Cross-Modality Vision Transformers

Convolutions Through the Lens of Tensor Networks

Joint Hierarchical Priors and Adaptive Spatial Resolution for Efficient Neural Image Compression

RanPAC: Random Projections and Pre-trained Models for Continual Learning

Rethinking Multiple Instance Learning for Whole Slide Image Classification: A Good Instance Classifier is All You Need

S3C: Self-Supervised Stochastic Classifiers for Few-Shot Class-Incremental Learning

Set Learning for Accurate and Calibrated Models

Source Identification: A Self-Supervision Task for Dense Prediction

Direct segmentation of brain white matter tracts in diffusion MRI

Object Recognition System on a Tactile Device for Visually Impaired

Neural Fields for Interactive Visualization of Statistical Dependencies in 3D Simulation Ensembles

Evaluating AI systems under uncertain ground truth: a case study in dermatology

DiffFlow: A Unified SDE Framework for Score-Based Diffusion Models and Generative Adversarial Networks

Wasserstein Auto-Encoders of Merge Trees (and Persistence Diagrams)

Harmonizing Feature Attributions Across Deep Learning Architectures: Enhancing Interpretability and Consistency

Compound Attention and Neighbor Matching Network for Multi-contrast MRI Super-resolution

Prompting Diffusion Representations for Cross-Domain Semantic Segmentation

How Deep Neural Networks Learn Compositional Data: The Random Hierarchy Model

MDViT: Multi-domain Vision Transformer for Small Medical Image Segmentation Datasets

Make A Long Image Short: Adaptive Token Length for Vision Transformers

Interactive Conversational Head Generation

Adversarial Attacks on Image Classification Models: FGSM and Patch Attacks and their Impact

Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised Audio-Visual Video Parsing

NMS Threshold matters for Ego4D Moment Queries – 2nd place solution to the Ego4D Moment Queries Challenge 2023

ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: TREK-150 Single Object Tracking

ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: Semi-Supervised Video Object Segmentation

Remote Sensing Image Change Detection with Graph Interaction

Multi-Modal Prototypes for Open-Set Semantic Segmentation

Distilling Missing Modality Knowledge from Ultrasound for Endometriosis Diagnosis with Magnetic Resonance Images

Zero-Shot Neural Architecture Search: Challenges, Solutions, and Opportunities

Unsupervised Spectral Demosaicing with Lightweight Spectral Attention Networks

Task-Specific Alignment and Multiple Level Transformer for Few-Shot Action Recognition

A ChatGPT Aided Explainable Framework for Zero-Shot Medical Image Diagnosis

ToothSegNet: Image Degradation meets Tooth Segmentation in CBCT Images

Multimodal Prompt Learning for Product Title Generation with Extremely Limited Labels

Muti-scale Graph Neural Network with Signed-attention for Social Bot Detection: A Frequency Perspective

Hybrid Neural Diffeomorphic Flow for Shape Representation and Generation via Triplane

Toward more frugal models for functional cerebral networks automatic recognition with resting-state fMRI

A Synthetic Electrocardiogram (ECG) Image Generation Toolbox to Facilitate Deep Learning-Based Scanned ECG Digitization

Text + Sketch: Image Compression at Ultra Low Rates

Physics-based Motion Retargeting from Sparse Inputs

ProtoDiffusion: Classifier-Free Diffusion Guidance with Prototype Learning

EANet: Enhanced Attribute-based RGBT Tracker Network

MaskBEV: Joint Object Detection and Footprint Completion for Bird’s-eye View 3D Point Clouds

Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning

Grad-FEC: Unequal Loss Protection of Deep Features in Collaborative Intelligence

Deep Features for Contactless Fingerprint Presentation Attack Detection: Can They Be Generalized?

Advancing Wound Filling Extraction on 3D Faces: Auto-Segmentation and Wound Face Regeneration Approach

Collaborative Score Distillation for Consistent Visual Synthesis

EdgeFace: Efficient Face Recognition Model for Edge Devices

On the Matrix Form of the Quaternion Fourier Transform and Quaternion Convolution

SUIT: Learning Significance-guided Information for 3D Temporal Detection

Edge-aware Multi-task Network for Integrating Quantification Segmentation and Uncertainty Prediction of Liver Tumor on Multi-modality Non-contrast MRI