2023-10-05

cs.CV

cs.CV - 2023-10-05

Diffusion Models as Masked Audio-Video Learners

paper_url: http://arxiv.org/abs/2310.03937
repo_url: None
paper_authors: Elvis Nunez, Yanzi Jin, Mohammad Rastegari, Sachin Mehta, Maxwell Horton
for: 这篇论文是为了探讨采用散射模型与MAViL架构的协同运作，以实现更好的音频-视频表现。
methods: 这篇论文使用了Masked Audio-Video Learners（MAViL）架构，结合了对比学习和对应预测，将音频spectrogram和视频帧组合在一起，并且使用散射模型对应视频帧。
results: 这篇论文的结果显示，通过将散射模型与MAViL架构结合，可以实现32%的预训操作数量（FLOPS）和18%的预训时间（wall clock time）的减少，并且不会对音频类别 зада对的表现造成影响。

Abstract
Over the past several years, the synchronization between audio and visual signals has been leveraged to learn richer audio-visual representations. Aided by the large availability of unlabeled videos, many unsupervised training frameworks have demonstrated impressive results in various downstream audio and video tasks. Recently, Masked Audio-Video Learners (MAViL) has emerged as a state-of-the-art audio-video pre-training framework. MAViL couples contrastive learning with masked autoencoding to jointly reconstruct audio spectrograms and video frames by fusing information from both modalities. In this paper, we study the potential synergy between diffusion models and MAViL, seeking to derive mutual benefits from these two frameworks. The incorporation of diffusion into MAViL, combined with various training efficiency methodologies that include the utilization of a masking ratio curriculum and adaptive batch sizing, results in a notable 32% reduction in pre-training Floating-Point Operations (FLOPS) and an 18% decrease in pre-training wall clock time. Crucially, this enhanced efficiency does not compromise the model's performance in downstream audio-classification tasks when compared to MAViL's performance.

摘要
在过去几年，听视同步学习被利用来学习更加 ricah 的听视信号表示。由于大量的无标视频数据的可用性，许多无监督训练框架在各种下游听视任务中表现出色。最近，Masked Audio-Video Learners（MAViL）在听视预训练框架中崛起为state-of-the-art。MAViL将对听视信号和视频帧进行对比学习和压缩编码，通过两种Modalities的信息融合来重建听视spectrogram和视频帧。在这篇文章中，我们研究了diffusion模型和MAViL之间的可能的共识，以寻找这两个框架之间的互惠关系。通过将diffusion incorporated into MAViL，并结合多种训练效率技术，包括使用masquerade ratio curriculum和adaptive batch sizing，我们得到了一个明显的32%的预训练FLOPS减少和18%的预训练wall clock time减少。幸运的是，这些提高的效率不会影响模型在下游听视分类任务中的性能，与MAViL的性能相比。

Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation

paper_url: http://arxiv.org/abs/2310.03923
repo_url: https://github.com/UARK-AICV/OpenFusion
paper_authors: Kashu Yamazaki, Taisei Hanyu, Khoa Vo, Thang Pham, Minh Tran, Gianfranco Doretto, Anh Nguyen, Ngan Le
for: 这篇论文旨在提出一种实时开放词汇3D环境地图创建和可询问场景表示方法，使用RGB-D数据。
methods: 该方法利用预训练的视觉语言基础模型（VLFM）进行开放集semantic解决，并使用TSDF快速生成3D场景重建。
results: 对于ScanNet数据集，Open-Fusion的表现明显超过了领先的零shot方法，并且可以实现无需额外3D训练的注释自由3D分割。此外，Open-Fusion通过结合区域基于VLFM和TSDF的优化hungarian特征匹配机制，实现了实时3D场景理解，包括物体概念和开放世界 semantics。Here’s the English version of the three key information points for reference:
for: This paper proposes a real-time open-vocabulary 3D environmental mapping method using RGB-D data, which is groundbreaking.
methods: The method utilizes a pre-trained vision-language foundation model (VLFM) for open-set semantic comprehension and employs the Truncated Signed Distance Function (TSDF) for swift 3D scene reconstruction.
results: The results on the ScanNet dataset demonstrate that Open-Fusion outperforms leading zero-shot methods and can achieve annotation-free 3D segmentation without additional 3D training. Additionally, Open-Fusion seamlessly combines the strengths of region-based VLFM and TSDF, enabling real-time 3D scene comprehension that includes object concepts and open-world semantics.

Abstract
Precise 3D environmental mapping is pivotal in robotics. Existing methods often rely on predefined concepts during training or are time-intensive when generating semantic maps. This paper presents Open-Fusion, a groundbreaking approach for real-time open-vocabulary 3D mapping and queryable scene representation using RGB-D data. Open-Fusion harnesses the power of a pre-trained vision-language foundation model (VLFM) for open-set semantic comprehension and employs the Truncated Signed Distance Function (TSDF) for swift 3D scene reconstruction. By leveraging the VLFM, we extract region-based embeddings and their associated confidence maps. These are then integrated with 3D knowledge from TSDF using an enhanced Hungarian-based feature-matching mechanism. Notably, Open-Fusion delivers outstanding annotation-free 3D segmentation for open-vocabulary without necessitating additional 3D training. Benchmark tests on the ScanNet dataset against leading zero-shot methods highlight Open-Fusion's superiority. Furthermore, it seamlessly combines the strengths of region-based VLFM and TSDF, facilitating real-time 3D scene comprehension that includes object concepts and open-world semantics. We encourage the readers to view the demos on our project page: https://uark-aicv.github.io/OpenFusion

摘要
precisions 3D 环境地图是 robotics 中关键的。现有方法 oftentimes rely on predefined concepts during training or are time-consuming when generating semantic maps。这篇文章 introduce Open-Fusion， a groundbreaking approach for real-time open-vocabulary 3D mapping and queryable scene representation using RGB-D data。Open-Fusion 利用 pre-trained vision-language foundation model (VLFM) for open-set semantic comprehension，并使用 Truncated Signed Distance Function (TSDF) for swift 3D scene reconstruction。通过 leveraging VLFM，我们可以 extract region-based embeddings and their associated confidence maps。这些是 then integrated with 3D knowledge from TSDF using an enhanced Hungarian-based feature-matching mechanism。 worth noting that Open-Fusion delivers outstanding annotation-free 3D segmentation for open-vocabulary without requiring additional 3D training。 benchmark tests on ScanNet dataset against leading zero-shot methods highlight Open-Fusion's superiority。 Furthermore, it seamlessly combines the strengths of region-based VLFM and TSDF, facilitating real-time 3D scene comprehension that includes object concepts and open-world semantics。readers can view demos on our project page: https://uark-aicv.github.io/OpenFusion

Coloring Deep CNN Layers with Activation Hue Loss

paper_url: http://arxiv.org/abs/2310.03911
repo_url: None
paper_authors: Louis-François Bouchard, Mohsen Ben Lazreg, Matthew Toews
for: 这篇论文旨在提出一种新的深度卷积神经网络（CNN）活动空间模型，即“活动色彩”，以优化模型进行更有效的学习。
methods: 该论文使用了一种基于 nearest neighbor indexing of activation vectors 的方法，发现类信息强度集中于某个角度 $\theta$ 处， both in $(x,y)$ 图像平面和多通道活动空间。论文还提出了一种使用活动色彩标签的偏置项来补偿标准一采样损失。
results: 论文通过在多种分类任务上从零开始训练，包括 ImageNet，发现了一些有限的改进。

Abstract
This paper proposes a novel hue-like angular parameter to model the structure of deep convolutional neural network (CNN) activation space, referred to as the {\em activation hue}, for the purpose of regularizing models for more effective learning. The activation hue generalizes the notion of color hue angle in standard 3-channel RGB intensity space to $N$-channel activation space. A series of observations based on nearest neighbor indexing of activation vectors with pre-trained networks indicate that class-informative activations are concentrated about an angle $\theta$ in both the $(x,y)$ image plane and in multi-channel activation space. A regularization term in the form of hue-like angular $\theta$ labels is proposed to complement standard one-hot loss. Training from scratch using combined one-hot + activation hue loss improves classification performance modestly for a wide variety of classification tasks, including ImageNet.

摘要
Here is the text in Simplified Chinese:这篇论文提出了一种新的深度 convolutional neural network (CNN) 正则化方法，基于一种新的概念——“活动色”。活动色是一种模型深度 CNN 的活动空间结构的方法，它基于标准的3通道 RGB 强度空间中的色度角的想法。作者们提议使用一种类似于色度角的angular标签来补充标准的一个零一个Hot损失函数，并表明这种方法可以提高各种分类任务的分类性能，包括 ImageNet。

TWICE Dataset: Digital Twin of Test Scenarios in a Controlled Environment

paper_url: http://arxiv.org/abs/2310.03895
repo_url: None
paper_authors: Leonardo Novicki Neto, Fabio Reway, Yuri Poledna, Maikol Funk Drechsler, Eduardo Parente Ribeiro, Werner Huber, Christian Icking
for: 本研究是为了提供一个实际测试车道和实验室中重现的恶势力天气下自动驾驶车辆安全可靠运行的数据集。
methods: 该数据集包括了摄像头、雷达、LiDAR、自转仪和GPS数据，并在恶势力天气条件下进行了多种测试场景，包括雨天、夜间和雪天等。
results: 该数据集包含了超过2小时的录制数据，总量超过280GB，因此是自动驾驶领域研究人员测试和改进算法的宝贵资源，同时也可以探索 simulation-to-reality gap。数据集可以在以下地址下载：https://twicedataset.github.io/site/

Abstract
Ensuring the safe and reliable operation of autonomous vehicles under adverse weather remains a significant challenge. To address this, we have developed a comprehensive dataset composed of sensor data acquired in a real test track and reproduced in the laboratory for the same test scenarios. The provided dataset includes camera, radar, LiDAR, inertial measurement unit (IMU), and GPS data recorded under adverse weather conditions (rainy, night-time, and snowy conditions). We recorded test scenarios using objects of interest such as car, cyclist, truck and pedestrian -- some of which are inspired by EURONCAP (European New Car Assessment Programme). The sensor data generated in the laboratory is acquired by the execution of simulation-based tests in hardware-in-the-loop environment with the digital twin of each real test scenario. The dataset contains more than 2 hours of recording, which totals more than 280GB of data. Therefore, it is a valuable resource for researchers in the field of autonomous vehicles to test and improve their algorithms in adverse weather conditions, as well as explore the simulation-to-reality gap. The dataset is available for download at: https://twicedataset.github.io/site/

摘要
Ensuring the safe and reliable operation of autonomous vehicles under adverse weather remains a significant challenge. To address this, we have developed a comprehensive dataset composed of sensor data acquired in a real test track and reproduced in the laboratory for the same test scenarios. The provided dataset includes camera, radar, LiDAR, inertial measurement unit (IMU), and GPS data recorded under adverse weather conditions (rainy, night-time, and snowy conditions). We recorded test scenarios using objects of interest such as car, cyclist, truck and pedestrian -- some of which are inspired by EURONCAP (European New Car Assessment Programme). The sensor data generated in the laboratory is acquired by the execution of simulation-based tests in hardware-in-the-loop environment with the digital twin of each real test scenario. The dataset contains more than 2 hours of recording, which totals more than 280GB of data. Therefore, it is a valuable resource for researchers in the field of autonomous vehicles to test and improve their algorithms in adverse weather conditions, as well as explore the simulation-to-reality gap. The dataset is available for download at: https://twicedataset.github.io/site/Here's the word-for-word translation of the text into Simplified Chinese: Ensuring the safe and reliable operation of autonomous vehicles under adverse weather remains a significant challenge. To address this, we have developed a comprehensive dataset composed of sensor data acquired in a real test track and reproduced in the laboratory for the same test scenarios. The provided dataset includes camera, radar, LiDAR, inertial measurement unit (IMU), and GPS data recorded under adverse weather conditions (rainy, night-time, and snowy conditions). We recorded test scenarios using objects of interest such as car, cyclist, truck and pedestrian -- some of which are inspired by EURONCAP (European New Car Assessment Programme). The sensor data generated in the laboratory is acquired by the execution of simulation-based tests in hardware-in-the-loop environment with the digital twin of each real test scenario. The dataset contains more than 2 hours of recording, which totals more than 280GB of data. Therefore, it is a valuable resource for researchers in the field of autonomous vehicles to test and improve their algorithms in adverse weather conditions, as well as explore the simulation-to-reality gap. The dataset is available for download at: https://twicedataset.github.io/site/

Characterizing the Features of Mitotic Figures Using a Conditional Diffusion Probabilistic Model

paper_url: http://arxiv.org/abs/2310.03893
repo_url: https://github.com/cagladbahadir/dpm-for-mitotic-figures
paper_authors: Cagla Deniz Bahadir, Benjamin Liechty, David J. Pisapia, Mert R. Sabuncu
for: 本研究旨在描述mitosis标注的不确定性和分类任务的人类可读性特征。
methods: 我们使用泛化噪声模型生成相同核心变换为mitosis标注的Synthetic图像序列，以便识别不同的核心特征，如细胞核粒体粗糙度、核心密度、核心异常和核心与细胞体的对比度。
results: 我们的方法可以帮助病理学家更好地理解和解释决策所需的特征。

Abstract
Mitotic figure detection in histology images is a hard-to-define, yet clinically significant task, where labels are generated with pathologist interpretations and where there is no ``gold-standard'' independent ground-truth. However, it is well-established that these interpretation based labels are often unreliable, in part, due to differences in expertise levels and human subjectivity. In this paper, our goal is to shed light on the inherent uncertainty of mitosis labels and characterize the mitotic figure classification task in a human interpretable manner. We train a probabilistic diffusion model to synthesize patches of cell nuclei for a given mitosis label condition. Using this model, we can then generate a sequence of synthetic images that correspond to the same nucleus transitioning into the mitotic state. This allows us to identify different image features associated with mitosis, such as cytoplasm granularity, nuclear density, nuclear irregularity and high contrast between the nucleus and the cell body. Our approach offers a new tool for pathologists to interpret and communicate the features driving the decision to recognize a mitotic figure.

摘要
In this paper, we aim to shed light on the inherent uncertainty of mitosis labels and characterize the mitotic figure classification task in a human-interpretable manner. We train a probabilistic diffusion model to synthesize patches of cell nuclei for a given mitosis label condition. By using this model, we can generate a sequence of synthetic images that correspond to the same nucleus transitioning into the mitotic state. This allows us to identify different image features associated with mitosis, such as cytoplasm granularity, nuclear density, nuclear irregularity, and high contrast between the nucleus and the cell body.Our approach offers a new tool for pathologists to interpret and communicate the features driving the decision to recognize a mitotic figure.

FNOSeg3D: Resolution-Robust 3D Image Segmentation with Fourier Neural Operator

paper_url: http://arxiv.org/abs/2310.03872
repo_url: https://github.com/ibm/multimodal-3d-image-segmentation
paper_authors: Ken C. L. Wong, Hongzhi Wang, Tanveer Syeda-Mahmood
for: 这个论文主要是为了解决深度学习在医疗三维静止图像分割中的计算复杂性问题，通过使用下采样的图像进行训练，但是这会导致模型在原始分辨率下的准确性下降。
methods: 这个论文提出了一种基于干扰 нейронOperator（FNO）的3D分割模型，该模型具有零批量超分辨率和全局响应场的特点。 authors 通过减少参数量和加入循环连接和深度监督来改进FNO，从而实现了具有少量参数和高精度的FNOSeg3D模型。
results: 对于BraTS’19数据集，FNOSeg3D模型在不同的训练图像分辨率下表现出了较好的鲁棒性，与其他测试模型相比，它的参数量少于1%。

Abstract
Due to the computational complexity of 3D medical image segmentation, training with downsampled images is a common remedy for out-of-memory errors in deep learning. Nevertheless, as standard spatial convolution is sensitive to variations in image resolution, the accuracy of a convolutional neural network trained with downsampled images can be suboptimal when applied on the original resolution. To address this limitation, we introduce FNOSeg3D, a 3D segmentation model robust to training image resolution based on the Fourier neural operator (FNO). The FNO is a deep learning framework for learning mappings between functions in partial differential equations, which has the appealing properties of zero-shot super-resolution and global receptive field. We improve the FNO by reducing its parameter requirement and enhancing its learning capability through residual connections and deep supervision, and these result in our FNOSeg3D model which is parameter efficient and resolution robust. When tested on the BraTS'19 dataset, it achieved superior robustness to training image resolution than other tested models with less than 1% of their model parameters.

摘要
The FNO is a deep learning framework for learning mappings between functions in partial differential equations, which has the advantages of zero-shot super-resolution and global receptive field. We improve the FNO by reducing its parameter requirement and enhancing its learning capability through residual connections and deep supervision, resulting in our FNOSeg3D model. This model is parameter efficient and resolution robust, achieving superior performance compared to other models with less than 1% of their parameters when tested on the BraTS'19 dataset.

Consistency Regularization Improves Placenta Segmentation in Fetal EPI MRI Time Series

paper_url: http://arxiv.org/abs/2310.03870
repo_url: https://github.com/firstmover/cr-seg
paper_authors: Yingcheng Liu, Neerav Karani, Neel Dey, S. Mazdak Abulnaga, Junshen Xu, P. Ellen Grant, Esra Abaci Turk, Polina Golland
for: 这个论文旨在提高围绕胎动成像的胎盘三维自动分割精度，以便提高先天医疗的诊断和治疗。
methods: 该论文提出了一种有效的半监督学习方法，使用了一种协调正则化损失函数来提高三维胎盘分割的精度。
results: 实验结果表明，该方法可以提高总分割精度，并且对于异常样本和困难样本表现更好。此外，该方法还可以提高时间序列中的预测准确性，这将有助于更 preciselly 计算胎盘生物标志物。

Abstract
The placenta plays a crucial role in fetal development. Automated 3D placenta segmentation from fetal EPI MRI holds promise for advancing prenatal care. This paper proposes an effective semi-supervised learning method for improving placenta segmentation in fetal EPI MRI time series. We employ consistency regularization loss that promotes consistency under spatial transformation of the same image and temporal consistency across nearby images in a time series. The experimental results show that the method improves the overall segmentation accuracy and provides better performance for outliers and hard samples. The evaluation also indicates that our method improves the temporal coherency of the prediction, which could lead to more accurate computation of temporal placental biomarkers. This work contributes to the study of the placenta and prenatal clinical decision-making. Code is available at https://github.com/firstmover/cr-seg.

摘要
placenta 在胎儿发展中扮演重要角色。自动化3D placenta分割从胎儿EPI MRI序列图像中提取placenta信息可能提高产前照管。本文提出一种有效的半指导学习方法，用于提高胎儿EPI MRI时序序列中placenta分割精度。我们使用一种协调常量损失函数，以便在同一张图像中的不同位置和邻近图像时序序列中保持一致性。实验结果表明，该方法可以提高总的分割精度，并且对于异常样本和困难样本表现更好。评估还表明，我们的方法可以提高计算时间序列placental生物标志的准确性。这种研究对于胎儿和产前诊断具有重要意义。代码可以在https://github.com/firstmover/cr-seg上获取。

OpenIncrement: A Unified Framework for Open Set Recognition and Deep Class-Incremental Learning

paper_url: http://arxiv.org/abs/2310.03848
repo_url: https://github.com/gawainxu/openincremen
paper_authors: Jiawen Xu, Claas Grohnfeldt, Odej Kao
for: 这篇论文是用于探讨深度增量学习的问题，特别是当 Novel samples 是预先识别的时候， neural network 的重训可能会导致错误的预测。
methods: 本文提出了一个基于 open set recognition 的深度类别增量学习框架，并将类别增量学习的特征整合到了距离基本 open set recognition。
results: 实验结果显示，我们的方法在比较于现有的增量学习技术的情况下，表现出了更好的性能，并且在 open set recognition 方面比基于方法表现更好。

Abstract
In most works on deep incremental learning research, it is assumed that novel samples are pre-identified for neural network retraining. However, practical deep classifiers often misidentify these samples, leading to erroneous predictions. Such misclassifications can degrade model performance. Techniques like open set recognition offer a means to detect these novel samples, representing a significant area in the machine learning domain. In this paper, we introduce a deep class-incremental learning framework integrated with open set recognition. Our approach refines class-incrementally learned features to adapt them for distance-based open set recognition. Experimental results validate that our method outperforms state-of-the-art incremental learning techniques and exhibits superior performance in open set recognition compared to baseline methods.

摘要
多数深度增量学习研究中假设新样本已经被预先标识，以供神经网络重新训练。然而，实际的深度分类器经常错误地分类这些样本，导致错误预测。这种错误分类可能会降低模型性能。开集识别技术提供了检测这些新样本的方式，这是机器学习领域的一个重要领域。在这篇论文中，我们介绍了一种集成了开集识别的深度分类器增量学习框架。我们的方法可以将增量学习得到的特征进行修正，以适应距离基于开集识别。实验结果表明，我们的方法比状态数据增量学习技术更高效，并在基准方法比较下表现出超越性。

Less is More: On the Feature Redundancy of Pretrained Models When Transferring to Few-shot Tasks

paper_url: http://arxiv.org/abs/2310.03843
repo_url: None
paper_authors: Xu Luo, Difan Zou, Lianli Gao, Zenglin Xu, Jingkuan Song
For: 这篇论文的目的是研究如何将预训练模型转移到下游任务中，以便在几 shot 的情况下提高模型的性能。* Methods: 这篇论文使用了线性探测方法，即在预训练模型中固化特征后，使用特征来训练一个线性分类器。然而，在下游数据中存在差距，因此这篇论文提出了问题：预训练特征中的哪些维度是有用的？* Results: 研究发现，在几 shot 的情况下，预训练特征可以很 redundant，即使只使用1%的最重要维度，也可以恢复使用全个表示的性能。此外，研究还发现，这种 redundancy 在几 shot 的情况下是非常明显的，而且随着数据量的增加，这种 redundancy 逐渐消失。这种现象的理论理解和解释，以及如何通过软掩码来解决这种问题，都是这篇论文的重要贡献。

Abstract
Transferring a pretrained model to a downstream task can be as easy as conducting linear probing with target data, that is, training a linear classifier upon frozen features extracted from the pretrained model. As there may exist significant gaps between pretraining and downstream datasets, one may ask whether all dimensions of the pretrained features are useful for a given downstream task. We show that, for linear probing, the pretrained features can be extremely redundant when the downstream data is scarce, or few-shot. For some cases such as 5-way 1-shot tasks, using only 1\% of the most important feature dimensions is able to recover the performance achieved by using the full representation. Interestingly, most dimensions are redundant only under few-shot settings and gradually become useful when the number of shots increases, suggesting that feature redundancy may be the key to characterizing the "few-shot" nature of few-shot transfer problems. We give a theoretical understanding of this phenomenon and show how dimensions with high variance and small distance between class centroids can serve as confounding factors that severely disturb classification results under few-shot settings. As an attempt at solving this problem, we find that the redundant features are difficult to identify accurately with a small number of training samples, but we can instead adjust feature magnitude with a soft mask based on estimated feature importance. We show that this method can generally improve few-shot transfer performance across various pretrained models and downstream datasets.

摘要
传播预训模型到下游任务可以非常简单，只需要在目标数据上进行线性探测，即在预训模型中冻结特征后，训练一个线性分类器。由于预训和下游数据集之间可能存在很大差距，因此我们可能会问到，预训特征中的所有维度都是下游任务中有用吗。我们发现，在线性探测情况下，预训特征可以非常重复，特别是在scarce或少shot情况下。例如，在5种类1个shot任务中，只使用1%的最重要特征维度可以恢复使用全表示性能。 Interestingly, most dimensions are redundant only under few-shot settings and gradually become useful when the number of shots increases, suggesting that feature redundancy may be the key to characterizing the "few-shot" nature of few-shot transfer problems. We give a theoretical understanding of this phenomenon and show how dimensions with high variance and small distance between class centroids can serve as confounding factors that severely disturb classification results under few-shot settings. As an attempt at solving this problem, we find that the redundant features are difficult to identify accurately with a small number of training samples, but we can instead adjust feature magnitude with a soft mask based on estimated feature importance. We show that this method can generally improve few-shot transfer performance across various pretrained models and downstream datasets.

HartleyMHA: Self-Attention in Frequency Domain for Resolution-Robust and Parameter-Efficient 3D Image Segmentation

paper_url: http://arxiv.org/abs/2310.04466
repo_url: https://github.com/ibm/multimodal-3d-image-segmentation
paper_authors: Ken C. L. Wong, Hongzhi Wang, Tanveer Syeda-Mahmood
for: 提高3D图像分割中自动注意力模型的训练缓存效率，避免训练图像大小过小导致的准确性下降。
methods: 基于FNO框架，使用 Hartley transform 和共享参数来减少模型大小，并在频域中应用自注意力以提高表达能力和效率。
results: 在BraTS’19数据集上测试， HartleyMHA 模型可以与其他模型相比，在训练图像大小不同情况下保持超过99%的准确性，而且具有训练缓存效率的优势。

Abstract
With the introduction of Transformers, different attention-based models have been proposed for image segmentation with promising results. Although self-attention allows capturing of long-range dependencies, it suffers from a quadratic complexity in the image size especially in 3D. To avoid the out-of-memory error during training, input size reduction is usually required for 3D segmentation, but the accuracy can be suboptimal when the trained models are applied on the original image size. To address this limitation, inspired by the Fourier neural operator (FNO), we introduce the HartleyMHA model which is robust to training image resolution with efficient self-attention. FNO is a deep learning framework for learning mappings between functions in partial differential equations, which has the appealing properties of zero-shot super-resolution and global receptive field. We modify the FNO by using the Hartley transform with shared parameters to reduce the model size by orders of magnitude, and this allows us to further apply self-attention in the frequency domain for more expressive high-order feature combination with improved efficiency. When tested on the BraTS'19 dataset, it achieved superior robustness to training image resolution than other tested models with less than 1% of their model parameters.

摘要
“受到变换器引入后，不同的注意力基于模型在图像分割方面提出了出色的结果。虽然自注意力允许捕捉长距离依赖关系，但它在图像大小上具有二次复杂性，特别是在3D分割中。为了避免训练过程中的内存出错，通常需要降低输入图像大小，但在原始图像大小上应用已经训练的模型时，准确率可能会受到限制。为解决这个局限性，我们引入了HartleyMHA模型，它具有高效的自注意力和鲁棒性。FNO是一种深度学习框架，用于学习函数partial differential equations中的映射，它具有透明的特性，如零shot超解像和全球响应区。我们通过使用Hartley变换和共享参数来降低模型大小，从而使得自注意力在频域中进行更加有表达力的高阶特征组合，并且提高了效率。在BraTS'19数据集上测试时，它与其他测试模型相比，具有更好的训练图像分辨率鲁棒性，仅占其模型参数的0.1%。”

Integrating Audio-Visual Features for Multimodal Deepfake Detection

paper_url: http://arxiv.org/abs/2310.03827
repo_url: None
paper_authors: Sneha Muppalla, Shan Jia, Siwei Lyu
for: 本研究旨在提出一种音视频基于的deepfake检测方法，以提高对多模态检测的精度。
methods: 本方法结合细致的deepfake标识与二分类算法，将样本分为四类，并对带内和跨域测试进行提升。
results: 实验结果表明，该方法在多模态检测中显著提高了检测精度，并在带内和跨域测试中具有优异表现。

Abstract
Deepfakes are AI-generated media in which an image or video has been digitally modified. The advancements made in deepfake technology have led to privacy and security issues. Most deepfake detection techniques rely on the detection of a single modality. Existing methods for audio-visual detection do not always surpass that of the analysis based on single modalities. Therefore, this paper proposes an audio-visual-based method for deepfake detection, which integrates fine-grained deepfake identification with binary classification. We categorize the samples into four types by combining labels specific to each single modality. This method enhances the detection under intra-domain and cross-domain testing.

摘要
深圳技术是由人工智能修改的图像或视频。随着深圳技术的发展， privacy和安全问题得到了关注。大多数深圳检测技术都是基于单一模式的检测。现有的音频视频检测方法不总能超越单个模式的分析。因此，本文提出了一种音频视频基于的深圳检测方法，该方法将细致的深圳标识与二分类结合。我们将样本分为四类，通过将每个单模式的标签结合。这种方法可以在同频和交叉频测试中提高检测精度。

WLST: Weak Labels Guided Self-training for Weakly-supervised Domain Adaptation on 3D Object Detection

paper_url: http://arxiv.org/abs/2310.03821
repo_url: https://github.com/jacky121298/WLST
paper_authors: Tsung-Lin Tsou, Tsung-Han Wu, Winston H. Hsu
for: 提高3D物体检测中的领域适应性，特别是在无目标标注的情况下。
methods: 提出了一种通用的弱标签导向自教学框架（WLST），利用自动标签器生成3D假标签，以提高目标频谱的训练过程。
results: 经验证明，我们的WLST框架可以提高3D物体检测中的领域适应性，并且在所有评价任务上表现出色，超过了之前的状态作法。

Abstract
In the field of domain adaptation (DA) on 3D object detection, most of the work is dedicated to unsupervised domain adaptation (UDA). Yet, without any target annotations, the performance gap between the UDA approaches and the fully-supervised approach is still noticeable, which is impractical for real-world applications. On the other hand, weakly-supervised domain adaptation (WDA) is an underexplored yet practical task that only requires few labeling effort on the target domain. To improve the DA performance in a cost-effective way, we propose a general weak labels guided self-training framework, WLST, designed for WDA on 3D object detection. By incorporating autolabeler, which can generate 3D pseudo labels from 2D bounding boxes, into the existing self-training pipeline, our method is able to generate more robust and consistent pseudo labels that would benefit the training process on the target domain. Extensive experiments demonstrate the effectiveness, robustness, and detector-agnosticism of our WLST framework. Notably, it outperforms previous state-of-the-art methods on all evaluation tasks.

摘要
在三维 объек目检测领域中的领域适应（DA）中，大多数工作集中在无监督领域适应（UDA）上。然而，无法获得目标域标注，DA方法与完全监督方法之间的性能差距仍然存在，这对实际应用来说是不实际的。相反，弱监督领域适应（WDA）是一个未得到充分发掘的 yet practical task，只需要少量的标注努力在目标域上。为了提高DA性能，我们提出了一个通用的弱标签指导自学习框架，WLST，设计为WDA在三维对象检测中进行。通过将自动标签器，可以生成3Dpseudo标签从2D bounding box，加入现有的自学习管道中，我们的方法可以生成更加稳定和一致的pseudo标签，这将对目标域训练过程中帮助提高DA性能。广泛的实验表明我们的WLST框架的有效性、稳定性和检测器免疫性。特别是，它超过了之前的状态 искусственный方法在所有评估任务上。

ContactGen: Generative Contact Modeling for Grasp Generation

paper_url: http://arxiv.org/abs/2310.03740
repo_url: https://github.com/stevenlsw/contactgen
paper_authors: Shaowei Liu, Yang Zhou, Jimei Yang, Saurabh Gupta, Shenlong Wang
for: 这个论文旨在提出一种基于物体中心的接触表示方法，以便更好地模型手部与物体之间的交互。
methods: 该方法包括三个组件：接触地图显示接触位置，部分地图表示接触手部，方向地图表示每个部分中的接触方向。给定输入物体，我们提出了一种 conditional generative 模型，以便预测接触地图并采用模型基于优化来预测多种具有多样性和几何可能性的抓取。
results: 实验结果表明，我们的方法可以生成高精度和多样性的人类抓取，并且适用于各种物体。项目页面：https://stevenlsw.github.io/contactgen/

Abstract
This paper presents a novel object-centric contact representation ContactGen for hand-object interaction. The ContactGen comprises three components: a contact map indicates the contact location, a part map represents the contact hand part, and a direction map tells the contact direction within each part. Given an input object, we propose a conditional generative model to predict ContactGen and adopt model-based optimization to predict diverse and geometrically feasible grasps. Experimental results demonstrate our method can generate high-fidelity and diverse human grasps for various objects. Project page: https://stevenlsw.github.io/contactgen/

摘要
这篇论文提出了一种新的物体呈现中心的接触表示方法ContactGen，用于手对象交互。ContactGen包括三个组成部分：接触地图显示接触位置，手部地图表示接触手部，以及每个部分的方向地图。给定输入物体，我们提议一种条件生成模型预测ContactGen，并采用模型基于优化预测多种具有多样性和几何可行性的抓取。实验结果表明我们的方法可以生成高精度和多样的人类抓取对象。项目页面：https://stevenlsw.github.io/contactgen/Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Stylist: Style-Driven Feature Ranking for Robust Novelty Detection

paper_url: http://arxiv.org/abs/2310.03738
repo_url: None
paper_authors: Stefan Smeu, Elena Burceanu, Emanuela Haller, Andrei Liviu Nicolicioiu
for: 检测样本是否具有新鲜度，但不是所有变化都是重要的。
methods: 使用形式化分为Semantic（有用）和Style（无用）变化，并且使用预训练大规模模型表示来提高抗性。
results: 提出了Stylist方法，可以去除环境偏见的特征，提高新鲜度检测性能。经验表明，在多个数据集上，Stylist方法可以提高新鲜度检测性能，并且可以处理 conten 和 style 类型的变化。

Abstract
Novelty detection aims at finding samples that differ in some form from the distribution of seen samples. But not all changes are created equal. Data can suffer a multitude of distribution shifts, and we might want to detect only some types of relevant changes. Similar to works in out-of-distribution generalization, we propose to use the formalization of separating into semantic or content changes, that are relevant to our task, and style changes, that are irrelevant. Within this formalization, we define the robust novelty detection as the task of finding semantic changes while being robust to style distributional shifts. Leveraging pretrained, large-scale model representations, we introduce Stylist, a novel method that focuses on dropping environment-biased features. First, we compute a per-feature score based on the feature distribution distances between environments. Next, we show that our selection manages to remove features responsible for spurious correlations and improve novelty detection performance. For evaluation, we adapt domain generalization datasets to our task and analyze the methods behaviors. We additionally built a large synthetic dataset where we have control over the spurious correlations degree. We prove that our selection mechanism improves novelty detection algorithms across multiple datasets, containing both stylistic and content shifts.

摘要
To achieve this, we leverage pre-trained, large-scale model representations and introduce Stylist, a novel method that focuses on dropping environment-biased features. We first compute a per-feature score based on the feature distribution distances between environments. We show that our selection mechanism removes features responsible for spurious correlations and improves novelty detection performance.For evaluation, we adapt domain generalization datasets to our task and analyze the methods' behaviors. We also built a large synthetic dataset where we have control over the degree of spurious correlations. Our results prove that our selection mechanism improves novelty detection algorithms across multiple datasets, containing both stylistic and content shifts.

Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency

paper_url: http://arxiv.org/abs/2310.03734
repo_url: None
paper_authors: Tianhong Li, Sangnie Bhardwaj, Yonglong Tian, Han Zhang, Jarred Barber, Dina Katabi, Guillaume Lajoie, Huiwen Chang, Dilip Krishnan
for: 提高vision-language生成模型的性能和泛化能力，使其能够在无需大量对应图像和文本数据的情况下进行训练和推断。
methods: 提出了一种新的训练方法，称为ITIT（图像-文本同步训练），它基于循环一致性的概念，可以在无需对应图像和文本数据的情况下进行图像-文本训练。
results: 实验表明，ITIT可以与高质量对应图像和文本数据进行训练，并且可以达到与现有的文本-图像模型相当的性能，只需要orders of magnitude fewer paired image-text data。

Abstract
Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities. However, automatically collecting such data (e.g. via large-scale web scraping) leads to low quality and poor image-text correlation, while human annotation is more accurate but requires significant manual effort and expense. We introduce $\textbf{ITIT}$ ($\textbf{I}$n$\textbf{T}$egrating $\textbf{I}$mage $\textbf{T}$ext): an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data. ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework. During training, ITIT leverages a small set of paired image-text data to ensure its output matches the input reasonably well in both directions. Simultaneously, the model is also trained on much larger datasets containing only images or texts. This is achieved by enforcing cycle consistency between the original unpaired samples and the cycle-generated counterparts. For instance, it generates a caption for a given input image and then uses the caption to create an output image, and enforces similarity between the input and output images. Our experiments show that ITIT with unpaired datasets exhibits similar scaling behavior as using high-quality paired data. We demonstrate image generation and captioning performance on par with state-of-the-art text-to-image and image-to-text models with orders of magnitude fewer (only 3M) paired image-text data.

摘要
Current vision-language生成模型依赖广泛的图像文本资料来 достичь最佳性能和泛化能力。然而，自动从网页抓取大规模图像文本资料会导致低品质和差强的图像文本相关性，而人工标注更加精准但需要较大的人工努力和成本。我们介绍了 $\textbf{ITIT}$（$\textbf{I}$ntegrating $\textbf{I}$mage $\textbf{T}$ext）：一种创新的训练方案基于循环一致的概念，允许vision-language训练在不对应图像文本资料上。ITIT包括一个共同图像文本编码器和分开的图像和文本解oder，允许两向的图像文本生成。在训练过程中，ITIT利用一小量的对应图像文本资料来确保其输出与输入相对对应。同时，模型还被训练在包含很多图像或文本资料的更大 datasets 中。这是通过强制循环一致的原始无对应样本和循环生成的对应样本之间的一致性来实现的。例如，它将一个输入图像的描述生成为一个图像，并将这个图像与输入图像进行比较，以确保它们之间的一致性。我们的实验显示，ITIT可以与高品质的对应资料一样具有推广性。我们在实验中使用了只有300万对应图像文本资料，并且可以达到与现有的文本至图像和图像至文本模型相同的表现。

OMG-ATTACK: Self-Supervised On-Manifold Generation of Transferable Evasion Attacks

paper_url: http://arxiv.org/abs/2310.03707
repo_url: None
paper_authors: Ofir Bar Tal, Adi Haviv, Amit H. Bermano
for: 测试神经网络的可靠性，使用欺骗攻击（Evasion Attacks）对神经网络进行测试。
methods: 使用自我指导的、计算机 эконоical的方法生成敌意攻击，采用表示学习技术，生成在数据分布上的攻击。
results: 对不同的模型、数据类别和防御模型进行了实验，显示了该方法的效果， suggessting on-manifold EAs 在对未看过模型的攻击中具有重要作用。

Abstract
Evasion Attacks (EA) are used to test the robustness of trained neural networks by distorting input data to misguide the model into incorrect classifications. Creating these attacks is a challenging task, especially with the ever-increasing complexity of models and datasets. In this work, we introduce a self-supervised, computationally economical method for generating adversarial examples, designed for the unseen black-box setting. Adapting techniques from representation learning, our method generates on-manifold EAs that are encouraged to resemble the data distribution. These attacks are comparable in effectiveness compared to the state-of-the-art when attacking the model trained on, but are significantly more effective when attacking unseen models, as the attacks are more related to the data rather than the model itself. Our experiments consistently demonstrate the method is effective across various models, unseen data categories, and even defended models, suggesting a significant role for on-manifold EAs when targeting unseen models.

摘要
逃脱攻击（EA）用于测试训练过的神经网络robustness，通过扭曲输入数据来诱导模型进行错误分类。创建这些攻击是一项复杂的任务，特别是随着模型和数据集的复杂度不断增加。在这项工作中，我们提出了一种自动supervised，computational economical的方法，用于生成黑盒 setting下的敌意例子。我们采用了表示学习技术，使得我们的方法可以在数据分布上生成在敌意例子。这些攻击效果相当于state-of-the-art，但是在训练过的模型上更有效果，而在未看过模型上更有效果，因为这些攻击更加 relate to the data 而不是模型自身。我们的实验表明，这种方法在不同的模型、未seen data category和even defended models中具有显著的作用， suggesting a significant role for on-manifold EAs when targeting unseen models。

Drag View: Generalizable Novel View Synthesis with Unposed Imagery

paper_url: http://arxiv.org/abs/2310.03704
repo_url: None
paper_authors: Zhiwen Fan, Panwang Pan, Peihao Wang, Yifan Jiang, Hanwen Jiang, Dejia Xu, Zehao Zhu, Dilin Wang, Zhangyang Wang
for: DragView is designed for generating novel views of unseen scenes from a single source image, with the ability to handle occlusion and flexible camera trajectories.
methods: DragView uses a sparse set of unposed multi-view images, a view-dependent modulation layer, and a transformer to decode ray features into final pixel intensities, all executed within a single feed-forward pass.
results: DragView showcases the capability to generalize to new scenes unseen during training, and consistently demonstrates superior performance in view synthesis quality compared to recent scene representation networks and generalizable NeRFs.

Abstract
We introduce DragView, a novel and interactive framework for generating novel views of unseen scenes. DragView initializes the new view from a single source image, and the rendering is supported by a sparse set of unposed multi-view images, all seamlessly executed within a single feed-forward pass. Our approach begins with users dragging a source view through a local relative coordinate system. Pixel-aligned features are obtained by projecting the sampled 3D points along the target ray onto the source view. We then incorporate a view-dependent modulation layer to effectively handle occlusion during the projection. Additionally, we broaden the epipolar attention mechanism to encompass all source pixels, facilitating the aggregation of initialized coordinate-aligned point features from other unposed views. Finally, we employ another transformer to decode ray features into final pixel intensities. Crucially, our framework does not rely on either 2D prior models or the explicit estimation of camera poses. During testing, DragView showcases the capability to generalize to new scenes unseen during training, also utilizing only unposed support images, enabling the generation of photo-realistic new views characterized by flexible camera trajectories. In our experiments, we conduct a comprehensive comparison of the performance of DragView with recent scene representation networks operating under pose-free conditions, as well as with generalizable NeRFs subject to noisy test camera poses. DragView consistently demonstrates its superior performance in view synthesis quality, while also being more user-friendly. Project page: https://zhiwenfan.github.io/DragView/.

摘要
我们介绍DragView，一种新的和交互式框架，用于生成未被见过的场景视图。DragView从单个源图像初始化新视图，并且渲染是基于一个稀缺的多视图图像支持，完全在单个往返传播中执行。我们的方法开始于用户将源视图拖动到本地相对坐标系中。通过对抽样的3D点进行 projetction，从源视图中获得齐平的特征点。然后，我们添加了视角依赖的修饰层，以有效地处理 occlusion durante la proyección。此外，我们扩展了 epipolar 注意力机制，以覆盖所有源像素，使得自 initialize 协调对齐点特征从其他无法 pose 视图中提取 initialized 协调点特征。最后，我们使用另一个 transformer 来解码轨道特征为最终像素强度。关键是，我们的框架不依赖于2D先验模型或自动确定相机位置。在测试时，DragView 显示了能够泛化到未在训练过程中看到的新场景，并且只使用无法 pose 支持图像，以生成 photo-realistic 的新视图，其中 camera 轨迹具有灵活性。在我们的实验中，我们对 DragView 的性能进行了对比，包括最近的场景表示网络在无法 pose 条件下的性能，以及一些受到噪音测试相机位置的 Generalizable NeRF 的性能。DragView 在视图合成质量方面 consistently 表现出色，而且更加 user-friendly。项目页面：https://zhiwenfan.github.io/DragView/.

LumiNet: The Bright Side of Perceptual Knowledge Distillation

paper_url: http://arxiv.org/abs/2310.03669
repo_url: None
paper_authors: Md. Ismail Hossain, M M Lutfe Elahi, Sameera Ramasinghe, Ali Cheraghian, Fuad Rahman, Nabeel Mohammed, Shafin Rahman
for: 该论文主要研究了基于logit的知识填充方法，以强化知识传递的能力。
methods: 该方法提出了一个名为LumiNet的新型知识传递算法，通过调整logits来提高学生模型对教师模型的学习。
results: 对于CIFAR-100、ImageNet和MSCOCO等基准数据集，LumiNet的实验结果表明其与当前的特征基于方法相比具有竞争力。此外，通过在不同任务下进行传输学习，该方法还能够强化学生模型在下游任务中的适应能力。

Abstract
In knowledge distillation research, feature-based methods have dominated due to their ability to effectively tap into extensive teacher models. In contrast, logit-based approaches are considered to be less adept at extracting hidden 'dark knowledge' from teachers. To bridge this gap, we present LumiNet, a novel knowledge-transfer algorithm designed to enhance logit-based distillation. We introduce a perception matrix that aims to recalibrate logits through adjustments based on the model's representation capability. By meticulously analyzing intra-class dynamics, LumiNet reconstructs more granular inter-class relationships, enabling the student model to learn a richer breadth of knowledge. Both teacher and student models are mapped onto this refined matrix, with the student's goal being to minimize representational discrepancies. Rigorous testing on benchmark datasets (CIFAR-100, ImageNet, and MSCOCO) attests to LumiNet's efficacy, revealing its competitive edge over leading feature-based methods. Moreover, in exploring the realm of transfer learning, we assess how effectively the student model, trained using our method, adapts to downstream tasks. Notably, when applied to Tiny ImageNet, the transferred features exhibit remarkable performance, further underscoring LumiNet's versatility and robustness in diverse settings. With LumiNet, we hope to steer the research discourse towards a renewed interest in the latent capabilities of logit-based knowledge distillation.

摘要
在知识储备研究中，基于特征的方法长期占据主导地位，这可能是因为它们能够有效地利用了大量的教师模型。然而，基于幂数的方法被视为无法充分激发教师模型中隐藏的“黑知识”。为了bridging这个差距，我们提出了LumiNet，一种新的知识传递算法，旨在通过调整幂数来增强基于幂数的储备。我们引入了一个感知矩阵，用于重新塑造幂数，以便通过对模型表示能力的调整来激发更多的隐藏知识。两个模型都被映射到这个矩阵上，学生模型的目标是减少表示差异。我们在CIFAR-100、ImageNet和MSCOCO等标准测试集上进行了严格的测试，并证明LumiNet的效果，其与主流基于特征的方法相比，显示出竞争力。此外，我们还进行了对下游任务的探索，并发现通过我们的方法进行训练后，学生模型在Tiny ImageNet上表现出了很好的性能，这再次证明了LumiNet在多种设置下的多样性和稳定性。我们希望通过LumiNet，激发研究者对基于幂数的知识储备的新兴兴趣。

Certification of Deep Learning Models for Medical Image Segmentation

paper_url: http://arxiv.org/abs/2310.03664
repo_url: None
paper_authors: Othmane Laousy, Alexandre Araujo, Guillaume Chassagnon, Nikos Paragios, Marie-Pierre Revel, Maria Vakalopoulou
for: 针对医疗影像分割模型，提供了首次的证明过程。
methods: 基于渐进推 diffusion 模型和随机缩放的方法。
results: 对五个公共静止肺X射线、皮肤癌和直肠内视镜数据集进行了广泛的实验，并观察到可以保持高度证明的 dice 分数，即使图像受到了很高水平的干扰。

Abstract
In medical imaging, segmentation models have known a significant improvement in the past decade and are now used daily in clinical practice. However, similar to classification models, segmentation models are affected by adversarial attacks. In a safety-critical field like healthcare, certifying model predictions is of the utmost importance. Randomized smoothing has been introduced lately and provides a framework to certify models and obtain theoretical guarantees. In this paper, we present for the first time a certified segmentation baseline for medical imaging based on randomized smoothing and diffusion models. Our results show that leveraging the power of denoising diffusion probabilistic models helps us overcome the limits of randomized smoothing. We conduct extensive experiments on five public datasets of chest X-rays, skin lesions, and colonoscopies, and empirically show that we are able to maintain high certified Dice scores even for highly perturbed images. Our work represents the first attempt to certify medical image segmentation models, and we aspire for it to set a foundation for future benchmarks in this crucial and largely uncharted area.

摘要
医疗影像中的分割模型在过去一代有了显著改进，现在在临床实践中每天都使用。然而，与分类模型一样，分割模型也受到敌意攻击的影响。在医疗领域中，确认模型预测的重要性是无可估量的。Randomized smoothing最近引入了一种框架，可以证明模型的预测，并提供了理论保证。在这篇论文中，我们为首次提出了医疗影像中证明的分割基线，基于随机熔浆概率模型和扩散模型。我们的结果表明，利用扩散概率模型的力量，我们可以超越随机熔浆的限制。我们在五个公共数据集上进行了广泛的实验，包括胸部X射线、皮肤损害和colonoscopy，并观察到我们可以保持高的证明 dice 分数，即使图像受到了严重的干扰。我们的工作代表了医疗影像分割模型的首次证明，我们希望这可以成为未来在这一关键和未知的领域的基础。

Robustness-Guided Image Synthesis for Data-Free Quantization

paper_url: http://arxiv.org/abs/2310.03661
repo_url: None
paper_authors: Jianhong Bai, Yuchen Yang, Huanpeng Chu, Hualiang Wang, Zuozhu Liu, Ruizhe Chen, Xiaoxuan He, Lianrui Mu, Chengfei Cai, Haoji Hu
for: 提高数据自由压缩性能，增强生成图像 semantics 和多样性。
methods: 提出 Robustness-Guided Image Synthesis (RIS) 方法，通过在输入和模型参数层次上引入干扰，并在特征和预测层次上定义不一致度指标，以提高生成图像 semantics 和多样性。
results: 在不同设定下实现了数据自由压缩性能的状态场，并且可以扩展到其他数据自由压缩任务。

Abstract
Quantization has emerged as a promising direction for model compression. Recently, data-free quantization has been widely studied as a promising method to avoid privacy concerns, which synthesizes images as an alternative to real training data. Existing methods use classification loss to ensure the reliability of the synthesized images. Unfortunately, even if these images are well-classified by the pre-trained model, they still suffer from low semantics and homogenization issues. Intuitively, these low-semantic images are sensitive to perturbations, and the pre-trained model tends to have inconsistent output when the generator synthesizes an image with poor semantics. To this end, we propose Robustness-Guided Image Synthesis (RIS), a simple but effective method to enrich the semantics of synthetic images and improve image diversity, further boosting the performance of downstream data-free compression tasks. Concretely, we first introduce perturbations on input and model weight, then define the inconsistency metrics at feature and prediction levels before and after perturbations. On the basis of inconsistency on two levels, we design a robustness optimization objective to enhance the semantics of synthetic images. Moreover, we also make our approach diversity-aware by forcing the generator to synthesize images with small correlations in the label space. With RIS, we achieve state-of-the-art performance for various settings on data-free quantization and can be extended to other data-free compression tasks.

摘要
量化已经出现为模型压缩的可能的方向。最近，数据无关量化已经广泛研究，以避免隐私问题，它使用图像作为代替实际训练数据来生成图像。现有方法使用类别损失来确保生成的图像的可靠性。然而，即使这些图像由预训练模型良好分类，它们仍然受到低 semantics 和同化问题的困扰。我们认为这些低 semantics 图像是敏感的，生成器Synthesize图像时容易受到扰动的影响，预训练模型对生成的图像的输出是不一致的。为此，我们提出了Robustness-Guided Image Synthesis（RIS），一种简单 yet effective的方法，以提高生成图像的 semantics 和多样性，从而提高下游数据free压缩任务的性能。具体来说，我们首先对输入和模型参数进行扰动，然后定义在特征层和预测层之前和之后的不一致度量。基于这两个层次的不一致度量，我们设计了一个Robustness optimization objective，以提高生成图像的 semantics。此外，我们还使我们的方法具有多样性， forcing the generator to synthesize images with small correlations in the label space。与RIS相比，我们实现了数据free压缩中的状态级性能，并且可以扩展到其他数据free压缩任务。

Visual inspection for illicit items in X-ray images using Deep Learning

paper_url: http://arxiv.org/abs/2310.03658
repo_url: None
paper_authors: Ioannis Mademlis, Georgios Batsis, Adamantia Anna Rebolledo Chrysochoou, Georgios Th. Papadopoulos
for: 实现自动检测货物摄像头中的黑心物品，以提高公共安全性，增加安全人员的生产力和减轻对于安全人员的心理负担，特别是在机场、地铁、海关/邮政等地区。
methods: 使用现代计算机视觉算法，基于深度神经网络（DNNs），以应对大量和高速的旅客和邮件等，并在受限和嵌入式环境中进行实现。
results: 根据实验结果显示，Transformer检测器最具优势，而过时的辅助神经网络在安全应用中的发展没有效果，CSP-DarkNet底层CNN也表现高效。

Abstract
Automated detection of contraband items in X-ray images can significantly increase public safety, by enhancing the productivity and alleviating the mental load of security officers in airports, subways, customs/post offices, etc. The large volume and high throughput of passengers, mailed parcels, etc., during rush hours practically make it a Big Data problem. Modern computer vision algorithms relying on Deep Neural Networks (DNNs) have proven capable of undertaking this task even under resource-constrained and embedded execution scenarios, e.g., as is the case with fast, single-stage object detectors. However, no comparative experimental assessment of the various relevant DNN components/methods has been performed under a common evaluation protocol, which means that reliable cross-method comparisons are missing. This paper presents exactly such a comparative assessment, utilizing a public relevant dataset and a well-defined methodology for selecting the specific DNN components/modules that are being evaluated. The results indicate the superiority of Transformer detectors, the obsolete nature of auxiliary neural modules that have been developed in the past few years for security applications and the efficiency of the CSP-DarkNet backbone CNN.

摘要
自动检测违法物品在X射图像中可以显著提高公共安全，因为它可以提高安全官员的产量和减轻他们的心理负担，特别是在机场、地铁、海关/邮政等场合。在湮旷时间段，大量的旅客和寄送包裹等会导致实际上是一个大数据问题。现代计算机视觉算法，基于深度神经网络（DNN），已经证明可以完成这项任务，即使在资源受限和嵌入式执行 scenarios 中。然而，没有一个 Comparative experimental assessment of the various relevant DNN components/methods has been performed under a common evaluation protocol，这意味着可靠的交叉比较不存在。这篇文章提供了一个 Comparative assessment，使用公共 relevante 的 dataset 和一个 Well-defined methodology for selecting the specific DNN components/modules being evaluated。结果表明 transformer 检测器的超越性，落后性 auxillary neural modules 在过去几年为安全应用程序开发的，以及 CSP-DarkNet 背景 CNN 的效率。

Wasserstein Distortion: Unifying Fidelity and Realism

paper_url: http://arxiv.org/abs/2310.03629
repo_url: None
paper_authors: Yang Qiu, Aaron B. Wagner, Johannes Ballé, Lucas Theis
for: 这篇论文是为了提出一种图像扭曲度量，即 Wasserstein distortion，该度量同时涵盖了像素级准确性和现实性。
methods: 论文使用了 Wasserstein distortion 来评估图像的扭曲度量，并对不同参数选择进行了分析。
results: 论文通过实验示出了 Wasserstein distortion 的实用性，可以同时保证图像的像素级准确性和现实性。例如，通过生成随机的 texture 来示出 Wasserstein distortion 的应用。

Abstract
We introduce a distortion measure for images, Wasserstein distortion, that simultaneously generalizes pixel-level fidelity on the one hand and realism on the other. We show how Wasserstein distortion reduces mathematically to a pure fidelity constraint or a pure realism constraint under different parameter choices. Pairs of images that are close under Wasserstein distortion illustrate its utility. In particular, we generate random textures that have high fidelity to a reference texture in one location of the image and smoothly transition to an independent realization of the texture as one moves away from this point. Connections between Wasserstein distortion and models of the human visual system are noted.

摘要
我们介绍了一种图像扭曲度量，即沃森拓扑扭曲度量，它同时同时具有像素级准确性和现实性两种特点。我们展示了沃森拓扑扭曲度量在不同参数选择下可以Mathematically reduce to纯准确性约束或纯现实性约束。图像中的离散点对象示出了沃森拓扑扭曲度量的用途。特别是，我们生成了一些随机的纹理图像，这些图像在一个图像中具有高准确性参照纹理，随着移动 away from this point，纹理逐渐变得独立和自由。我们还注意到了沃森拓扑扭曲度量与人类视觉系统模型之间的连接。

High-Degrees-of-Freedom Dynamic Neural Fields for Robot Self-Modeling and Motion Planning

paper_url: http://arxiv.org/abs/2310.03624
repo_url: None
paper_authors: Lennart Schulze, Hod Lipson
for:The paper is written for developing a robot self-model that can be used for motion planning tasks in the absence of classical geometric kinematic models.methods:The paper uses neural fields to allow a robot to self-model its kinematics as a neural-implicit query model learned only from 2D images annotated with camera poses and configurations.results:The learned self-model achieves a Chamfer-L2 distance of 2% of the robot’s workspace dimension, demonstrating its effectiveness in motion planning tasks.Here’s the Chinese translation of the three key points:for:论文是为了开发一个不需要经典几何遥感模型的机器人自模型，用于减少人工介入，实现真正的自主 Agent。methods:论文使用神经场来让机器人自己模型其动态物体的几何结构，通过基于神经网络的封闭概率场建模。results:学习的自模型在7个自由度（DOF）机器人测试setup中实现了2%的机器人工作空间维度的Chamfer-L2距离。

Abstract
A robot self-model is a task-agnostic representation of the robot's physical morphology that can be used for motion planning tasks in absence of classical geometric kinematic models. In particular, when the latter are hard to engineer or the robot's kinematics change unexpectedly, human-free self-modeling is a necessary feature of truly autonomous agents. In this work, we leverage neural fields to allow a robot to self-model its kinematics as a neural-implicit query model learned only from 2D images annotated with camera poses and configurations. This enables significantly greater applicability than existing approaches which have been dependent on depth images or geometry knowledge. To this end, alongside a curricular data sampling strategy, we propose a new encoder-based neural density field architecture for dynamic object-centric scenes conditioned on high numbers of degrees of freedom (DOFs). In a 7-DOF robot test setup, the learned self-model achieves a Chamfer-L2 distance of 2% of the robot's workspace dimension. We demonstrate the capabilities of this model on a motion planning task as an exemplary downstream application.

摘要
一个机器人自我模型是一种任务无关的机器人物理结构表示，可以用于减少或缺失经典几何动力学模型，尤其是在机器人的动力学发生意外变化或很难工程化时。在这种情况下，无人工作机器人自我模型是真正自主的必备特性。在这种工作中，我们利用神经场来让机器人自我模型其动态物理结构，通过学习神经隐式查询模型，只需基于2D图像和摄像头位置和配置进行学习。这使得我们的方法在已有方法所不足的情况下具有更大的可用性。为此，我们还提出了一种课程数据采样策略和新的编码器基于神经树量场架构，用于Conditional on high degrees of freedom (DOFs)的动态物体中心场景。在7DOF机器人测试设置中，学习的自我模型实现了机器人工作空间维度的Chamfer-L2距离为2%。我们示示了这种模型在运动规划任务中的应用能力。

Animatable Virtual Humans: Learning pose-dependent human representations in UV space for interactive performance synthesis

paper_url: http://arxiv.org/abs/2310.03615
repo_url: None
paper_authors: Wieland Morgenstern, Milena T. Bagdasarian, Anna Hilsmann, Peter Eisert
for: 这篇论文旨在提出一种新的虚拟人类表现方法，用于高度真实的实时动画和渲染在3D应用中。
methods: 该方法基于高精度多视图视频重建获取的动态 mesh序列学习 pose-dependent 外观和几何。
results: 该方法可以高效地学习人体 pose-dependent 外观和几何，并在实时场景中进行流畅处理和渲染虚拟人类。

Abstract
We propose a novel representation of virtual humans for highly realistic real-time animation and rendering in 3D applications. We learn pose dependent appearance and geometry from highly accurate dynamic mesh sequences obtained from state-of-the-art multiview-video reconstruction. Learning pose-dependent appearance and geometry from mesh sequences poses significant challenges, as it requires the network to learn the intricate shape and articulated motion of a human body. However, statistical body models like SMPL provide valuable a-priori knowledge which we leverage in order to constrain the dimension of the search space enabling more efficient and targeted learning and define pose-dependency. Instead of directly learning absolute pose-dependent geometry, we learn the difference between the observed geometry and the fitted SMPL model. This allows us to encode both pose-dependent appearance and geometry in the consistent UV space of the SMPL model. This approach not only ensures a high level of realism but also facilitates streamlined processing and rendering of virtual humans in real-time scenarios.

摘要
我们提出了一种新的虚拟人形表示方法，用于高度真实的实时动画和渲染在3D应用程序中。我们从 state-of-the-art 多视图视频重建获取了高度准确的动态网格序列，并学习 pose-dependent 形状和外观。学习pose-dependent的形状和运动呈poses significant challenges，因为它需要网络学习人体体形的细节和骨骼运动。然而，统计体模型如SMPL提供了valuable 先前知识，我们可以利用其来约束搜索空间的维度，以便更有效地学习和定向学习。而不是直接学习绝对pose-dependent的准确 geometry，我们学习了观察到的 geometry 与SMPL模型相比的差异。这种方法不仅保证了高度真实，还便利了实时enario中的虚拟人形处理和渲染。

How Good Are Synthetic Medical Images? An Empirical Study with Lung Ultrasound

paper_url: http://arxiv.org/abs/2310.03608
repo_url: https://github.com/global-health-labs/us-dcgan
paper_authors: Menghan Yu, Sourabh Kulhare, Courosh Mehanian, Charles B Delahunt, Daniel E Shea, Zohreh Laverriere, Ishan Shah, Matthew P Horning
for: 这个研究旨在提出一个整体框架，用于适应医学影像分析模型的开发 workflow。
methods: 该研究使用生成模型作为数据增强方法，并使用对抗方法保护患者隐私。
results: 研究表明，将真实数据和生成数据混合训练可以超越只使用真实数据训练的性能，并且模型只使用生成数据训练的性能接近真实数据训练的性能。

Abstract
Acquiring large quantities of data and annotations is known to be effective for developing high-performing deep learning models, but is difficult and expensive to do in the healthcare context. Adding synthetic training data using generative models offers a low-cost method to deal effectively with the data scarcity challenge, and can also address data imbalance and patient privacy issues. In this study, we propose a comprehensive framework that fits seamlessly into model development workflows for medical image analysis. We demonstrate, with datasets of varying size, (i) the benefits of generative models as a data augmentation method; (ii) how adversarial methods can protect patient privacy via data substitution; (iii) novel performance metrics for these use cases by testing models on real holdout data. We show that training with both synthetic and real data outperforms training with real data alone, and that models trained solely with synthetic data approach their real-only counterparts. Code is available at https://github.com/Global-Health-Labs/US-DCGAN.

摘要
获取大量数据和注释是开发高性能深度学习模型的有效方法，但在医疗上困难和昂贵。通过使用生成模型生成的假数据可以解决数据稀缺问题，并可以解决数据不均衡和患者隐私问题。在这项研究中，我们提出了适应医学图像分析模型开发工作流程的完整框架。我们在不同的数据量下测试了（i）生成模型作为数据扩充方法的效果；（ii）如何通过数据替换来保护患者隐私；（iii）为这些用例提供新的性能指标，通过测试模型在真实副本数据上进行测试。我们发现，训练使用真实和假数据的模型比训练使用真实数据alone更高效，并且模型准备了假数据alone与真实数据alone相似。可以在 GitHub 上获取代码：https://github.com/Global-Health-Labs/US-DCGAN。

Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints

paper_url: http://arxiv.org/abs/2310.03602
repo_url: None
paper_authors: Chuan Fang, Xiaotao Hu, Kunming Luo, Ping Tan
for: 这paper的目的是为了提供一种可以从文本提示生成高质量的3D室内场景，并且允许用户进行可交互的编辑操作。methods: 该方法使用了一种分解layout和appearance的思路，首先使用文本Conditional Diffusion Model来学习场景布局分布，然后使用精度调整的ControlNet来生成高质量的3D场景图像。results: 该方法可以生成高质量的3D场景图像，并且可以让用户通过Mask-guided editing模块进行可交互的编辑操作，而无需进行贵重的编辑培训。在Structured3D dataset上进行了广泛的实验，并证明该方法可以比 existed方法更高效地生成从文本提示生成的3D场景。

Abstract
Text-driven 3D indoor scene generation could be useful for gaming, film industry, and AR/VR applications. However, existing methods cannot faithfully capture the room layout, nor do they allow flexible editing of individual objects in the room. To address these problems, we present Ctrl-Room, which is able to generate convincing 3D rooms with designer-style layouts and high-fidelity textures from just a text prompt. Moreover, Ctrl-Room enables versatile interactive editing operations such as resizing or moving individual furniture items. Our key insight is to separate the modeling of layouts and appearance. %how to model the room that takes into account both scene texture and geometry at the same time. To this end, Our proposed method consists of two stages, a `Layout Generation Stage' and an `Appearance Generation Stage'. The `Layout Generation Stage' trains a text-conditional diffusion model to learn the layout distribution with our holistic scene code parameterization. Next, the `Appearance Generation Stage' employs a fine-tuned ControlNet to produce a vivid panoramic image of the room guided by the 3D scene layout and text prompt. In this way, we achieve a high-quality 3D room with convincing layouts and lively textures. Benefiting from the scene code parameterization, we can easily edit the generated room model through our mask-guided editing module, without expensive editing-specific training. Extensive experiments on the Structured3D dataset demonstrate that our method outperforms existing methods in producing more reasonable, view-consistent, and editable 3D rooms from natural language prompts.

摘要
文本驱动的3D室内场景生成可能对游戏、电影业和AR/VR应用程序有用。然而，现有方法无法准确地捕捉室内布局，也无法让个体对象进行灵活编辑。为解决这些问题，我们提出了Ctrl-Room，它可以从文本提示生成真实的3D室内场景，并具有设计风格的布局和高质量的纹理图像。此外，Ctrl-Room还允许用户进行便捷的交互式编辑操作，如调整室内家具的大小或位置。我们的关键发现是将布局和外观模型分离开来。我们的提posed方法包括两个阶段：一个`Layout Generation Stage'和一个`Appearance Generation Stage'。`Layout Generation Stage'使用文本条件的扩散模型来学习室内布局的分布，而`Appearance Generation Stage'使用精心调整的ControlNet来生成基于3D场景布局和文本提示的生动的全景图像。通过这种方式，我们可以生成高质量的3D室内场景，具有真实的布局和生动的纹理图像。由于使用场景代码参数化，我们可以轻松地通过我们的面具引导编辑模块进行编辑，而无需贵重的编辑Specific training。我们的实验表明，我们的方法可以在Structured3D dataset上生成更合理、视角一致和可编辑的3D室内场景，比现有方法更高效。

BID-NeRF: RGB-D image pose estimation with inverted Neural Radiance Fields

paper_url: http://arxiv.org/abs/2310.03563
repo_url: None
paper_authors: Ágoston István Csehi, Csaba Máté Józsa
for: 提高倒掌风格场（iNeRF）算法，将图像pose估计问题定义为基于NeRF的迭代线性优化问题。
methods: 我们对iNeRF算法进行了以下改进：添加深度基于损失函数的本地化优化目标，使用多张图像的关系pose来定义损失函数，并在树状渲染中减少层次抽象采样。
results: 我们的改进可以显著提高iNeRF算法的吞吐量和基本错误率，并将估计范围扩展到更高的初始pose估计错误率。

Abstract
We aim to improve the Inverted Neural Radiance Fields (iNeRF) algorithm which defines the image pose estimation problem as a NeRF based iterative linear optimization. NeRFs are novel neural space representation models that can synthesize photorealistic novel views of real-world scenes or objects. Our contributions are as follows: we extend the localization optimization objective with a depth-based loss function, we introduce a multi-image based loss function where a sequence of images with known relative poses are used without increasing the computational complexity, we omit hierarchical sampling during volumetric rendering, meaning only the coarse model is used for pose estimation, and we how that by extending the sampling interval convergence can be achieved even or higher initial pose estimate errors. With the proposed modifications the convergence speed is significantly improved, and the basin of convergence is substantially extended.

摘要
我们目标是改进倒计时神经辐射场（iNeRF）算法，该算法定义图像pose估计问题为基于NeRF的迭代线性优化问题。NeRF是一种新型神经空间表示模型，可以生成高品质的新视图图像或物体场景。我们的贡献包括以下几点：1. 将本地化优化目标添加depth-based损失函数，以提高pose估计精度。2. 引入多张图像基于损失函数，使用known相对pose的图像序列，无需增加计算复杂度。3. 在Volume Rendering中弃用层次抽象采样，只使用粗略模型进行pose估计，从而降低计算复杂度。4. 通过延长采样间隔，可以实现高初始pose估计错误的抽象，并且扩展了basin of convergence。通过我们的修改， convergence speed 得到了显著提高，并且basin of convergence得到了substantial扩展。

MedSyn: Text-guided Anatomy-aware Synthesis of High-Fidelity 3D CT Images

paper_url: http://arxiv.org/abs/2310.03559
repo_url: None
paper_authors: Yanwu Xu, Li Sun, Wei Peng, Shyam Visweswaran, Kayhan Batmanghelich
for: This paper presents an innovative method for generating high-quality 3D lung CT images based on textual information, which can enhance numerous downstream tasks.methods: The proposed method utilizes a hierarchical scheme with a modified UNet architecture to synthesize low-resolution images conditioned on text, and further generates vascular, airway, and lobular segmentation masks to ensure anatomical plausibility.results: The proposed approach demonstrates superior performance compared to state-of-the-art models based on GAN and diffusion techniques, especially in retaining crucial anatomical features such as fissure lines, airways, and vascular structures.

Abstract
This paper introduces an innovative methodology for producing high-quality 3D lung CT images guided by textual information. While diffusion-based generative models are increasingly used in medical imaging, current state-of-the-art approaches are limited to low-resolution outputs and underutilize radiology reports' abundant information. The radiology reports can enhance the generation process by providing additional guidance and offering fine-grained control over the synthesis of images. Nevertheless, expanding text-guided generation to high-resolution 3D images poses significant memory and anatomical detail-preserving challenges. Addressing the memory issue, we introduce a hierarchical scheme that uses a modified UNet architecture. We start by synthesizing low-resolution images conditioned on the text, serving as a foundation for subsequent generators for complete volumetric data. To ensure the anatomical plausibility of the generated samples, we provide further guidance by generating vascular, airway, and lobular segmentation masks in conjunction with the CT images. The model demonstrates the capability to use textual input and segmentation tasks to generate synthesized images. The results of comparative assessments indicate that our approach exhibits superior performance compared to the most advanced models based on GAN and diffusion techniques, especially in accurately retaining crucial anatomical features such as fissure lines, airways, and vascular structures. This innovation introduces novel possibilities. This study focuses on two main objectives: (1) the development of a method for creating images based on textual prompts and anatomical components, and (2) the capability to generate new images conditioning on anatomical elements. The advancements in image generation can be applied to enhance numerous downstream tasks.

摘要
To address the memory issue, we propose a hierarchical scheme that uses a modified UNet architecture. We first synthesize low-resolution images conditioned on the text, which serves as a foundation for subsequent generators to complete the volumetric data. Our approach demonstrates the capability to use textual input and segmentation tasks to generate synthesized images, with superior performance compared to existing GAN and diffusion techniques.The two main objectives of this study are:1. Developing a method for creating images based on textual prompts and anatomical components.2. Generating new images conditioning on anatomical elements.The advancements in image generation can be applied to enhance numerous downstream tasks.

Towards Unified Deep Image Deraining: A Survey and A New Benchmark

paper_url: http://arxiv.org/abs/2310.03535
repo_url: None
paper_authors: Xiang Chen, Jinshan Pan, Jiangxin Dong, Jinhui Tang
for: 本研究旨在提供一个统一的评估设定，用于评估现有的图像雨排除方法，并提供一个新的高质量标准benchmark。
methods: 本研究使用了现有的图像雨排除方法，并对其进行了全面的评估。
results: 本研究提出了一个新的高质量标准benchmark，并通过了extensive的性能评估。

Abstract
Recent years have witnessed significant advances in image deraining due to the kinds of effective image priors and deep learning models. As each deraining approach has individual settings (e.g., training and test datasets, evaluation criteria), how to fairly evaluate existing approaches comprehensively is not a trivial task. Although existing surveys aim to review of image deraining approaches comprehensively, few of them focus on providing unify evaluation settings to examine the deraining capability and practicality evaluation. In this paper, we provide a comprehensive review of existing image deraining method and provide a unify evaluation setting to evaluate the performance of image deraining methods. We construct a new high-quality benchmark named HQ-RAIN to further conduct extensive evaluation, consisting of 5,000 paired high-resolution synthetic images with higher harmony and realism. We also discuss the existing challenges and highlight several future research opportunities worth exploring. To facilitate the reproduction and tracking of the latest deraining technologies for general users, we build an online platform to provide the off-the-shelf toolkit, involving the large-scale performance evaluation. This online platform and the proposed new benchmark are publicly available and will be regularly updated at http://www.deraining.tech/.

摘要
近年来，因为有效的图像前提和深度学习模型，图像排除技术得到了显著进步。然而，每种排除方法都有自己的设置（例如训练和测试数据集、评价标准），因此全面评估现有方法的问题不是很容易解决。虽然现有的报告尝试了对图像排除方法进行全面审查，但只有很少的几篇文章关注提供统一的评估设置，以评估图像排除方法的性能和实用性。在这篇文章中，我们提供了一项全面的图像排除方法审查，并提供了统一的评估设置。我们构建了一个新的高品质标准 benchmark，名为HQ-RAIN，以进行广泛的评估。该标准包括5000对高分辨率的合成图像，具有更高的和实际性。我们还讨论了现有的挑战和提出了一些未来研究的可能性。为便于普通用户复制和跟踪最新的排除技术，我们建立了一个在线平台，提供了大规模的性能评估工具。这个在线平台和我们提出的新标准公共可用，将在http://www.deraining.tech/上进行定期更新。

3D-Aware Hypothesis & Verification for Generalizable Relative Object Pose Estimation

paper_url: http://arxiv.org/abs/2310.03534
repo_url: None
paper_authors: Chen Zhao, Tong Zhang, Mathieu Salzmann
for: 本研究旨在解决只有一个参考图像表示物体的情况下，估计物体在不同姿态下的相对 pose 问题。
methods: 我们提出了一种新的假设和验证框架，通过生成和评估多个姿态假设，最终选择最可靠的一个作为相对 pose。为了衡量可靠性，我们引入了3D变换aware的验证方法，将3D物体表示从两个输入图像中学习到的3D对象表示应用3D变换。
results: 我们的方法在Objaverse、LINEMOD和CO3D数据集上进行了广泛的实验，证明我们的相对pose估计精度较高，对大规模姿态变化和不visible object测试时的稳定性具有优势。

Abstract
Prior methods that tackle the problem of generalizable object pose estimation highly rely on having dense views of the unseen object. By contrast, we address the scenario where only a single reference view of the object is available. Our goal then is to estimate the relative object pose between this reference view and a query image that depicts the object in a different pose. In this scenario, robust generalization is imperative due to the presence of unseen objects during testing and the large-scale object pose variation between the reference and the query. To this end, we present a new hypothesis-and-verification framework, in which we generate and evaluate multiple pose hypotheses, ultimately selecting the most reliable one as the relative object pose. To measure reliability, we introduce a 3D-aware verification that explicitly applies 3D transformations to the 3D object representations learned from the two input images. Our comprehensive experiments on the Objaverse, LINEMOD, and CO3D datasets evidence the superior accuracy of our approach in relative pose estimation and its robustness in large-scale pose variations, when dealing with unseen objects.

摘要
先前的方法很多都是基于具有 dense views 的未知 объек的假设，而我们则是面临只有一个参考视图的情况。我们的目标是将参考视图中的对象pose与测试图像中的对象pose进行相对pose estimation。在这种情况下，Robust generalization 是非常重要的，因为测试时可能会出现未知的对象，并且参考和测试图像中的对象pose之间存在大规模的差异。为此，我们提出了一个新的假设-验证框架，在这个框架中，我们生成并评估多个pose假设，最终选择最可靠的 pose 作为对象的相对pose。为了衡量可靠性，我们引入了3D-aware验证，该验证Explicitly applies 3D变换到从两个输入图像中学习的3D对象表示。我们对 Objaverse、LINEMOD 和 CO3D 数据集进行了广泛的实验，证明了我们的方法在相对pose估计中的精度和在大规模的pose变化中的Robustness，当处理未知对象时。

V2X Cooperative Perception for Autonomous Driving: Recent Advances and Challenges

paper_url: http://arxiv.org/abs/2310.03525
repo_url: None
paper_authors: Tao Huang, Jianan Liu, Xi Zhou, Dinh C. Nguyen, Mostafa Rahimi Azghadi, Yuxuan Xia, Qing-Long Han, Sumei Sun
for: 本研究的目的是为了提高自动驾驶系统的安全性和可靠性，通过推动合作感知（Cooperative Perception，CP）技术的发展。
methods: 本研究使用了许多现有的计算机视觉技术，包括物体识别等，以解决现实世界交通环境中的难题。此外，还使用了许多最新的通信技术，如V2X技术，以增强驾驶自动化系统。
results: 本研究提出了一种基于V2X通信技术的CP工作流程，并对现有的V2X-based CP方法进行了分类和评价。此外，还对CP技术的发展进行了一种彻底的文献综述，并评估了现有的数据集和模拟器。最后，本研究还讨论了CP技术的未来发展和挑战。

Abstract
Accurate perception is essential for advancing autonomous driving and addressing safety challenges in modern transportation systems. Despite significant advancements in computer vision for object recognition, current perception methods still face difficulties in complex real-world traffic environments. Challenges such as physical occlusion and limited sensor field of view persist for individual vehicle systems. Cooperative Perception (CP) with Vehicle-to-Everything (V2X) technologies has emerged as a solution to overcome these obstacles and enhance driving automation systems. While some research has explored CP's fundamental architecture and critical components, there remains a lack of comprehensive summaries of the latest innovations, particularly in the context of V2X communication technologies. To address this gap, this paper provides a comprehensive overview of the evolution of CP technologies, spanning from early explorations to recent developments, including advancements in V2X communication technologies. Additionally, a contemporary generic framework is proposed to illustrate the V2X-based CP workflow, aiding in the structured understanding of CP system components. Furthermore, this paper categorizes prevailing V2X-based CP methodologies based on the critical issues they address. An extensive literature review is conducted within this taxonomy, evaluating existing datasets and simulators. Finally, open challenges and future directions in CP for autonomous driving are discussed by considering both perception and V2X communication advancements.

摘要
准确的感知是自动驾驶技术发展的关键，以解决现代交通系统中的安全挑战。尽管计算机视觉技术在物体识别方面做出了 significiant 进步，但现在的感知方法仍然在复杂的实际交通环境中遇到困难。这些困难包括物体遮挡和汽车感知器的有限范围。协同感知（CP）技术与 everything （V2X）技术已经出现为解决这些问题并增强驾驶自动化系统。虽然一些研究探讨了 CP 技术的基本架构和关键组件，但还有一些研究 gap 需要填充。为了填充这些 gap，这篇文章提供了 CP 技术的演化历史，从早期探索到最新的发展，包括 V2X 通信技术的进步。此外，文章还提出了一个现代化的 CP 工作流程框架，以便系统化地理解 CP 系统组件。此外，文章还对 CP 方法分为不同的关键问题，进行了广泛的文献综述。最后，文章讨论了 CP 技术在自动驾驶方面的未来发展和挑战。

PrototypeFormer: Learning to Explore Prototype Relationships for Few-shot Image Classification

paper_url: http://arxiv.org/abs/2310.03517
repo_url: None
paper_authors: Feihong He, Gang Li, Lingyu Si, Leilei Yan, Fanzhang Li, Fuchun Sun
for: 提高少量图像分类的性能，Addressing the challenge of poor classification performance with limited samples in novel classes.
methods: 使用 transformer 架构建 prototype 抽取模块，提取更有准确性的类表示，并在少shot learning scenario 中使用对比学习优化 prototype 特征。
results: 在多个流行的少shot image classification benchmark dataset上进行实验，显示了我们的方法在现有state-of-the-art方法之上具有remarkable的性能，并且将于未来释出代码。Here’s the translation in English for reference:
for: To improve the performance of few-shot image classification, addressing the challenge of poor classification performance with limited samples in novel classes.
methods: Using a transformer architecture to build a prototype extraction module, extracting more discriminative class representations for few-shot classification, and employing a contrastive learning-based optimization approach to optimize prototype features in few-shot learning scenarios.
results: Experimented on several popular few-shot image classification benchmark datasets, showing that our method outperforms all current state-of-the-art methods, with remarkable performance. The code will be released later.

Abstract
Few-shot image classification has received considerable attention for addressing the challenge of poor classification performance with limited samples in novel classes. However, numerous studies have employed sophisticated learning strategies and diversified feature extraction methods to address this issue. In this paper, we propose our method called PrototypeFormer, which aims to significantly advance traditional few-shot image classification approaches by exploring prototype relationships. Specifically, we utilize a transformer architecture to build a prototype extraction module, aiming to extract class representations that are more discriminative for few-shot classification. Additionally, during the model training process, we propose a contrastive learning-based optimization approach to optimize prototype features in few-shot learning scenarios. Despite its simplicity, the method performs remarkably well, with no bells and whistles. We have experimented with our approach on several popular few-shot image classification benchmark datasets, which shows that our method outperforms all current state-of-the-art methods. In particular, our method achieves 97.07% and 90.88% on 5-way 5-shot and 5-way 1-shot tasks of miniImageNet, which surpasses the state-of-the-art results with accuracy of 7.27% and 8.72%, respectively. The code will be released later.

摘要
Recently, few-shot image classification has received significant attention due to the challenge of achieving poor classification performance with limited samples in novel classes. Many studies have employed sophisticated learning strategies and diversified feature extraction methods to address this issue. In this paper, we propose a method called PrototypeFormer, which aims to significantly advance traditional few-shot image classification approaches by exploring prototype relationships. Specifically, we use a transformer architecture to build a prototype extraction module, which aims to extract class representations that are more discriminative for few-shot classification. Additionally, during the model training process, we propose a contrastive learning-based optimization approach to optimize prototype features in few-shot learning scenarios. Despite its simplicity, the method performs remarkably well, with no bells and whistles. We have experimented with our approach on several popular few-shot image classification benchmark datasets, which shows that our method outperforms all current state-of-the-art methods. In particular, our method achieves 97.07% and 90.88% on 5-way 5-shot and 5-way 1-shot tasks of miniImageNet, which surpasses the state-of-the-art results with accuracy of 7.27% and 8.72%, respectively. The code will be released later.Here's the breakdown of the translation:1. Recently, few-shot image classification has received significant attention (近些时候，几张图像分类 receiving 了 considerable attention)2. due to the challenge of achieving poor classification performance with limited samples in novel classes (因为它们需要处理有限样本的新类，而且这些样本通常是异常的)3. Many studies have employed sophisticated learning strategies and diversified feature extraction methods to address this issue. (许多研究使用了复杂的学习策略和多样化的特征提取方法来解决这个问题)4. In this paper, we propose a method called PrototypeFormer, which aims to significantly advance traditional few-shot image classification approaches by exploring prototype relationships. (在这篇论文中，我们提出了一种方法，叫做 PrototypeFormer，它目的是通过探索原型关系来显著地提高传统的几张图像分类方法)5. Specifically, we use a transformer architecture to build a prototype extraction module, which aims to extract class representations that are more discriminative for few-shot classification. (具体来说，我们使用 transformer 架构来建立原型提取模块，以提取更加特征的几张图像分类)6. Additionally, during the model training process, we propose a contrastive learning-based optimization approach to optimize prototype features in few-shot learning scenarios. (此外，在模型训练过程中，我们提出了一种基于对比学习的优化方法，用于优化几张图像分类中的原型特征)7. Despite its simplicity, the method performs remarkably well, with no bells and whistles. (尽管它的简单，但它的性能很好，没有任何辅助工具)8. We have experimented with our approach on several popular few-shot image classification benchmark datasets, which shows that our method outperforms all current state-of-the-art methods. (我们在几个流行的几张图像分类标准数据集上进行了实验，发现我们的方法超过了当前的状态艺术方法)9. In particular, our method achieves 97.07% and 90.88% on 5-way 5-shot and 5-way 1-shot tasks of miniImageNet, which surpasses the state-of-the-art results with accuracy of 7.27% and 8.72%, respectively. (特别是，我们的方法在 miniImageNet 中的 5-way 5-shot 和 5-way 1-shot 任务上达到了 97.07% 和 90.88%，超过了当前状态艺术方法的准确率 7.27% 和 8.72%， соответственно)10. The code will be released later. (代码将在后来发布)

Exploring DINO: Emergent Properties and Limitations for Synthetic Aperture Radar Imagery

paper_url: http://arxiv.org/abs/2310.03513
repo_url: None
paper_authors: Joseph A. Gallego-Mejia, Anna Jungbluth, Laura Martínez-Ferrer, Matt Allen, Francisco Dorr, Freddie Kalaitzis, Raúl Ramos-Pollán
for: 这个研究探讨了Self-Distillation with No Labels（DINO）算法在 Synthetic Aperture Radar（SAR）图像上的应用和emergent特性。
methods: 我们使用无标签SAR数据预训练一个基于Vision Transformer（ViT）的DINO模型，然后精确调整模型来预测高分辨率土地覆盖图。我们仔细评估了ViT底层抽象的MAP值，并与模型的Token Embedding空间进行比较。
results: 我们发现预训练比于从scratch训练有小量的提升，并讨论了SSL在Remote Sensing和土地覆盖分类中的局限性和机遇。此外，我们发现ViT的抽象MAP值对于Remote Sensing具有很大的内在价值，可以提供有用的输入 для其他算法。这个研究为大型和更好的SSL模型的开发奠定了基础。

Abstract
Self-supervised learning (SSL) models have recently demonstrated remarkable performance across various tasks, including image segmentation. This study delves into the emergent characteristics of the Self-Distillation with No Labels (DINO) algorithm and its application to Synthetic Aperture Radar (SAR) imagery. We pre-train a vision transformer (ViT)-based DINO model using unlabeled SAR data, and later fine-tune the model to predict high-resolution land cover maps. We rigorously evaluate the utility of attention maps generated by the ViT backbone, and compare them with the model's token embedding space. We observe a small improvement in model performance with pre-training compared to training from scratch, and discuss the limitations and opportunities of SSL for remote sensing and land cover segmentation. Beyond small performance increases, we show that ViT attention maps hold great intrinsic value for remote sensing, and could provide useful inputs to other algorithms. With this, our work lays the ground-work for bigger and better SSL models for Earth Observation.

摘要
自我指导学习（SSL）模型最近已经在不同任务上展示出惊人的表现，包括图像分割。本研究探讨了自我混合 WITH NO Labels（DINO）算法的 emergent 特性，并应用于Synthetic Aperture Radar（SAR）成像。我们使用无标签 SAR 数据预训练一个基于视Transformer（ViT）的 DINO 模型，然后练习模型预测高分辨率地形覆盖图。我们仔细评估了 ViT 底层的注意力地图，并与模型的 Token 空间进行比较。我们发现预训练比于从 scratch 训练有小幅提升性能，并讨论了SSL 在遥感和地形分类中的局限性和机遇。此外，我们发现 ViT 的注意力地图具有很大的内在价值，可以提供用于遥感的有用输入。因此，我们的工作为大型和更好的 SSL 模型奠定了基础。

RL-based Stateful Neural Adaptive Sampling and Denoising for Real-Time Path Tracing

paper_url: http://arxiv.org/abs/2310.03507
repo_url: https://github.com/ajsvb/rl_path_tracing
paper_authors: Antoine Scardigli, Lukas Cavigelli, Lorenz K. Müller
for: 提高真实图像生成的可靠性和速度
methods: 使用抽象学习网络对抽象进行END-TO-END训练，包括采样重要性网络、嵌入空间编码器网络和减噪网络
results: 在多个具有挑战性的数据集上提高视觉质量，并将比前一个状态艺术的渲染时间减少为1.6倍，为实时应用提供了有前途的解决方案。

Abstract
Monte-Carlo path tracing is a powerful technique for realistic image synthesis but suffers from high levels of noise at low sample counts, limiting its use in real-time applications. To address this, we propose a framework with end-to-end training of a sampling importance network, a latent space encoder network, and a denoiser network. Our approach uses reinforcement learning to optimize the sampling importance network, thus avoiding explicit numerically approximated gradients. Our method does not aggregate the sampled values per pixel by averaging but keeps all sampled values which are then fed into the latent space encoder. The encoder replaces handcrafted spatiotemporal heuristics by learned representations in a latent space. Finally, a neural denoiser is trained to refine the output image. Our approach increases visual quality on several challenging datasets and reduces rendering times for equal quality by a factor of 1.6x compared to the previous state-of-the-art, making it a promising solution for real-time applications.

摘要
蒙特卡洛路追踪是一种具有很高真实度的图像生成技术，但是在低样本数下会受到高水平的噪声影响，限制其在实时应用中的使用。为了解决这个问题，我们提出了一个框架，其中包括端到端培生样本重要性网络、嵌入空间编码器网络和净化网络。我们的方法使用了回归学习来优化样本重要性网络，从而避免直接用数值 aproximated 的数学导数。我们的方法不是将每个像素的样本值相加，而是保留所有的样本值，然后将它们传递给嵌入空间编码器。编码器将手工设计的空间时间规则替换为学习的表示在嵌入空间中。最后，我们训练了一个神经净化器来精细化输出图像。我们的方法可以在多个复杂的数据集上提高视觉质量，同时降低等质量的渲染时间，比前一个状态艺术高一点1.6倍，因此是一个有前途的解决方案。

Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion

paper_url: http://arxiv.org/abs/2310.03502
repo_url: https://github.com/ai-forever/movqgan
paper_authors: Anton Razzhigaev, Arseniy Shakhmatov, Anastasia Maltseva, Vladimir Arkhipkin, Igor Pavlov, Ilya Ryabov, Angelina Kuts, Alexander Panchenko, Andrey Kuznetsov, Denis Dimitrov
for: 这篇论文主要是为了提出一种新的扩展了演化架构，用于提高文本生成图像质量。
methods: 该模型使用了扩展了演化架构，包括像素级和幂等级的方法，并结合了图像先验模型和latent扩散技术。
results: 实验结果显示，该模型在COCO-30K数据集上的FID分数为8.03，与其他开源模型相比，表示该模型在可衡量的图像生成质量方面取得了突出的成绩。

Abstract
Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality.

摘要
现代计算机视觉中的文本至图生成是一个重要领域，在生成架构的演化中得到了显著改进。这些模型可以分为两类：像素级和幂等级方法。我们提出了一种新的探索，即将图像先验模型与幂等技术结合，称之为Kandinsky1。这个模型将文本嵌入模型与CLIP的图像嵌入模型进行共同训练。另外，我们修改了MoVQ实现方式，用于图像自编码器组件。总共有3.3亿参数。我们还实现了一个易于使用的示例系统，支持多种生成模式，如文本至图生成、图像融合、文本和图像融合、图像变化生成和文本引导填充/剔除。此外，我们还公布了Kandinsky模型的源代码和检查点。实验评估表明，Kandinsky1模型在COCO-30K dataset上的FID分数为8.03，这标志着我们的模型在可衡量的图像生成质量方面成为开源领先者。

IceCloudNet: Cirrus and mixed-phase cloud prediction from SEVIRI input learned from sparse supervision

paper_url: http://arxiv.org/abs/2310.03499
repo_url: None
paper_authors: Kai Jeggle, Mikolaj Czerkawski, Federico Serva, Bertrand Le Saux, David Neubauer, Ulrike Lohmann
for: 这项研究旨在提供地球站点覆盖率和活动卫星 Retrievals 的 ice 微物理性质 regime-dependent 观测约束，以提高气候模型中 ice 云物理过程的理解，从而减少气候变化中的不确定性。
methods: 这项研究使用了 convolutional neural network (CNN) 训练方法，使用了三年的 SEVIRI 和 DARDAR 数据集，以获得地球站点覆盖率和活动卫星 Retrievals 的 ice 微物理性质观测约束。
results: 这项研究实现了创造一种新的观测约束，可以用于改进气候模型中 ice 云物理过程的理解，从而减少气候变化中的不确定性。

Abstract
Clouds containing ice particles play a crucial role in the climate system. Yet they remain a source of great uncertainty in climate models and future climate projections. In this work, we create a new observational constraint of regime-dependent ice microphysical properties at the spatio-temporal coverage of geostationary satellite instruments and the quality of active satellite retrievals. We achieve this by training a convolutional neural network on three years of SEVIRI and DARDAR data sets. This work will enable novel research to improve ice cloud process understanding and hence, reduce uncertainties in a changing climate and help assess geoengineering methods for cirrus clouds.

摘要
云含冰粒物理性质在气候系统中扮演着关键性角色。然而，这些云仍然对未来气候预测中存在大量不确定性。在这项工作中，我们创造了一个新的观测约束，即在地球同步卫星 instrumente 上的空间时间覆盖和活动卫星推算的冰微物理性质的Registry-dependent。我们通过训练一个卷积神经网络，使用三年的SEVIRI和DARDAR数据集来实现这一目标。这项工作将帮助改善冰云过程理解，从而减少气候变化中的不确定性，并评估环境工程方法 для cirrus 云。

paper_url: http://arxiv.org/abs/2310.03485
repo_url: None
paper_authors: Dimitrios Kollias, Karanjot Vendal, Priyanka Gadhavi, Solomon Russom
for: 预测脑肿瘤MGMTpromoter的甲基化状态
methods: 利用多Modal的MRI扫描数据，包括FLAIR、T1w、T1wCE和T2 3D尺寸，采用BTDNet模型进行预测
results: 在RSNA-ASNR-MICCAI BraTS 2021 Challenge中，BTDNet方法舒大margin地超越了现有方法，提供了可能的脑肿瘤诊断和治疗的新途径

Abstract
Brain tumors pose significant health challenges worldwide, with glioblastoma being one of the most aggressive forms. Accurate determination of the O6-methylguanine-DNA methyltransferase (MGMT) promoter methylation status is crucial for personalized treatment strategies. However, traditional methods are labor-intensive and time-consuming. This paper proposes a novel multi-modal approach, BTDNet, leveraging multi-parametric MRI scans, including FLAIR, T1w, T1wCE, and T2 3D volumes, to predict MGMT promoter methylation status. BTDNet addresses two main challenges: the variable volume lengths (i.e., each volume consists of a different number of slices) and the volume-level annotations (i.e., the whole 3D volume is annotated and not the independent slices that it consists of). BTDNet consists of four components: i) the data augmentation one (that performs geometric transformations, convex combinations of data pairs and test-time data augmentation); ii) the 3D analysis one (that performs global analysis through a CNN-RNN); iii) the routing one (that contains a mask layer that handles variable input feature lengths), and iv) the modality fusion one (that effectively enhances data representation, reduces ambiguities and mitigates data scarcity). The proposed method outperforms by large margins the state-of-the-art methods in the RSNA-ASNR-MICCAI BraTS 2021 Challenge, offering a promising avenue for enhancing brain tumor diagnosis and treatment.

摘要
脑肿瘤对全球健康造成重要挑战，其中 glioblastoma 是最攻击性的一种。正确地确定 O6-methylguanine-DNA methyltransferase (MGMT) 基因Promoter的甲基化状态是个人化治疗策略的关键。然而，传统方法具有劳动 INTENSIVE 和时间耗费的缺点。这篇论文提出了一种新的多Modal方法，BTDNet，利用多参量 MRI 扫描结果，包括 FLAIR、T1w、T1wCE 和 T2 3D 尺度，来预测 MGMT Promoter 甲基化状态。BTDNet 解决了两个主要挑战：每个尺度的变量尺度（即每个尺度都有不同的 slice 数）和尺度级别的注释（即整个 3D 尺度被注释，而不是每个独立的 slice）。BTDNet 由四个组成部分组成：1. 数据增强一（通过 геометрические变换、数据对的凸合和测试时数据增强进行数据增强）。2. 3D 分析一（通过 CNN-RNN 进行全球分析）。3. 路由一（包含一个mask层，处理变量输入特征长度）。4. 模式融合一（有效地增强数据表示，减少歧义和减少数据缺乏）。提出的方法在 RSNA-ASNR-MICCAI BraTS 2021 挑战中大幅超越了当前状态的方法，提供了一个有前途的方向，用于提高脑肿瘤诊断和治疗。

Ammonia-Net: A Multi-task Joint Learning Model for Multi-class Segmentation and Classification in Tooth-marked Tongue Diagnosis

paper_url: http://arxiv.org/abs/2310.03472
repo_url: None
paper_authors: Shunkai Shi, Yuqi Wang, Qihui Ye, Yanran Wang, Yiming Zhu, Muhammad Hassan, Aikaterini Melliou, Dongmei Yu
For: This paper aims to address the challenges of manual diagnosis of tooth-marked tongue in traditional Chinese medicine, by proposing a multi-task joint learning model named Ammonia-Net.* Methods: The proposed model employs a convolutional neural network-based architecture, specifically designed for multi-class segmentation and classification of tongue images. It performs semantic segmentation of tongue images to identify tongue and tooth marks, and classifies the images into the desired number of classes.* Results: The experimental results show that the proposed model achieves 99.06% accuracy in the two-class classification task of tooth-marked tongue identification and 80.02% accuracy in the segmentation task, with mIoU for tongue and tooth marks amounting to 71.65%.

Abstract
In Traditional Chinese Medicine, the tooth marks on the tongue, stemming from prolonged dental pressure, serve as a crucial indicator for assessing qi (yang) deficiency, which is intrinsically linked to visceral health. Manual diagnosis of tooth-marked tongue solely relies on experience. Nonetheless, the diversity in shape, color, and type of tooth marks poses a challenge to diagnostic accuracy and consistency. To address these problems, herein we propose a multi-task joint learning model named Ammonia-Net. This model employs a convolutional neural network-based architecture, specifically designed for multi-class segmentation and classification of tongue images. Ammonia-Net performs semantic segmentation of tongue images to identify tongue and tooth marks. With the assistance of segmentation output, it classifies the images into the desired number of classes: healthy tongue, light tongue, moderate tongue, and severe tongue. As far as we know, this is the first attempt to apply the semantic segmentation results of tooth marks for tooth-marked tongue classification. To train Ammonia-Net, we collect 856 tongue images from 856 subjects. After a number of extensive experiments, the experimental results show that the proposed model achieves 99.06% accuracy in the two-class classification task of tooth-marked tongue identification and 80.02%. As for the segmentation task, mIoU for tongue and tooth marks amounts to 71.65%.

摘要
在中医中，吃牙印痕的舌头，长期的牙关节压力，作为脉气衰竭（阳衰）的重要指标，舌头的手动诊断完全依赖经验。然而，吃牙印痕的多样性对于诊断精度和一致性带来挑战。为了解决这些问题，我们提出了一个名为Ammonia-Net的多任务集成学习模型。这个模型使用了一个特定设计 для 多 клаス混合分类和 semantic segmentation的舌头图像。Ammonia-Net 进行 semantic segmentation of tongue images，以识别舌头和吃牙印痕。受欢迎的分类结果显示，这是首次将吃牙印痕的 semantic segmentation 结果应用于舌头分类。我们收集了856个舌头图像，并进行了多次广泛的实验。结果显示，我们提出的模型在两种类别分类任务中的舌头印痕识别中取得了99.06%的准确率，并在分类任务中取得了80.02%的准确率。在分 segmentation 任务中，miou 为舌头和吃牙印痕为71.65%。

Multi-Resolution Audio-Visual Feature Fusion for Temporal Action Localization

paper_url: http://arxiv.org/abs/2310.03456
repo_url: None
paper_authors: Edward Fish, Jon Weinbren, Andrew Gilbert
for: 本文旨在提高视频中的动作识别精度，特别是将音频特征集成到视觉特征检测框架中。
methods: 本文提出了一种新的多尺度音视频特征融合方法（MRAV-FF），通过层次阻止权重机制，灵活地调整音频信息的重要性。
results: 实验表明，MRAV-FF可以提高视频动作识别精度，特别是当音频数据可用时。此外，MRAV-FF可以与现有的FPN TAL架构兼容，提供了一个简单而强大的方法来提高视频动作识别性能。

Abstract
Temporal Action Localization (TAL) aims to identify actions' start, end, and class labels in untrimmed videos. While recent advancements using transformer networks and Feature Pyramid Networks (FPN) have enhanced visual feature recognition in TAL tasks, less progress has been made in the integration of audio features into such frameworks. This paper introduces the Multi-Resolution Audio-Visual Feature Fusion (MRAV-FF), an innovative method to merge audio-visual data across different temporal resolutions. Central to our approach is a hierarchical gated cross-attention mechanism, which discerningly weighs the importance of audio information at diverse temporal scales. Such a technique not only refines the precision of regression boundaries but also bolsters classification confidence. Importantly, MRAV-FF is versatile, making it compatible with existing FPN TAL architectures and offering a significant enhancement in performance when audio data is available.

摘要
Temporal Action Localization (TAL) 目标是在未处理视频中确定动作的开始、结束和类别标签。 Although recent advances in transformer networks and Feature Pyramid Networks (FPN) have improved visual feature recognition in TAL tasks, there has been less progress in integrating audio features into these frameworks. This paper introduces the Multi-Resolution Audio-Visual Feature Fusion (MRAV-FF), a novel method that combines audio-visual data across different temporal resolutions. The key to our approach is a hierarchical gated cross-attention mechanism, which selectively weights the importance of audio information at different temporal scales. This not only refines the precision of regression boundaries but also boosts classification confidence. Importantly, MRAV-FF is versatile and can be compatible with existing FPN TAL architectures, providing a significant improvement in performance when audio data is available.

Mitigating the Influence of Domain Shift in Skin Lesion Classification: A Benchmark Study of Unsupervised Domain Adaptation Methods on Dermoscopic Images

paper_url: http://arxiv.org/abs/2310.03432
repo_url: None
paper_authors: Sireesha Chamarthi, Katharina Fogelberg, Roman C. Maron, Titus J. Brinker, Julia Niebling
For: This paper aims to improve the performance of deep neural networks in skin lesion classification by addressing the issue of domain shift, which can negatively impact the models’ accuracy when tested on new data.* Methods: The authors evaluate eight different unsupervised domain adaptation methods to determine their effectiveness in improving generalization for dermoscopic datasets.* Results: The authors find that all eight domain adaptation methods result in improved AUPRC for the majority of analyzed datasets, indicating that unsupervised domain adaptation generally leads to performance improvements for the binary melanoma-nevus classification task. However, small or heavily imbalanced datasets may lead to reduced conformity of the results due to the influence of these factors on the methods’ performance.Here is the same information in Simplified Chinese text:* For: 本研究旨在提高深度神经网络在皮肤病诊断中的表现， Addressing the issue of domain shift，即模型在新数据上的表现下降。* Methods: 作者评估了8种不supervised domain adaptation方法，以确定它们在dermoscopic dataset中的效果。* Results: 作者发现，所有8种domain adaptation方法在大多数分析数据集上都有所改进，表明不supervised domain adaptation通常能够提高binary melanoma-nevus分类任务的表现。但是，小型或受束缚数据集可能会导致结果的不一致，因为这些因素对方法的表现产生影响。

Abstract
The potential of deep neural networks in skin lesion classification has already been demonstrated to be on-par if not superior to the dermatologists diagnosis. However, the performance of these models usually deteriorates when the test data differs significantly from the training data (i.e. domain shift). This concerning limitation for models intended to be used in real-world skin lesion classification tasks poses a risk to patients. For example, different image acquisition systems or previously unseen anatomical sites on the patient can suffice to cause such domain shifts. Mitigating the negative effect of such shifts is therefore crucial, but developing effective methods to address domain shift has proven to be challenging. In this study, we carry out an in-depth analysis of eight different unsupervised domain adaptation methods to analyze their effectiveness in improving generalization for dermoscopic datasets. To ensure robustness of our findings, we test each method on a total of ten distinct datasets, thereby covering a variety of possible domain shifts. In addition, we investigated which factors in the domain shifted datasets have an impact on the effectiveness of domain adaptation methods. Our findings show that all of the eight domain adaptation methods result in improved AUPRC for the majority of analyzed datasets. Altogether, these results indicate that unsupervised domain adaptations generally lead to performance improvements for the binary melanoma-nevus classification task regardless of the nature of the domain shift. However, small or heavily imbalanced datasets lead to a reduced conformity of the results due to the influence of these factors on the methods performance.

摘要
深度神经网络在皮肤病变分类中的潜力已经被证明与专业 dermatologist 诊断相当或更高。然而，这些模型在测试数据与训练数据之间的域转换（domain shift）时通常会表现出现下降。这种问题在实际应用中对患者造成风险，例如不同的图像获取系统或患者的前所未见的解剖位置可能会导致域转换。因此，解决域转换的负面影响是关键，但是开发有效的方法来解决域转换问题已经证明是困难的。在这项研究中，我们进行了八种不同的无监督领域适应方法的深入分析，以分析它们在DERMOSCOPIC dataset上的效果。为确保我们的发现的可靠性，我们将每种方法测试在总共十个不同的dataset上，以覆盖多种可能的域转换情况。此外，我们还研究了域转换 dataset 中各因素对领域适应方法的影响。我们的发现显示，所有八种领域适应方法都导致了大多数分析dataset中的AUPRC的提高。总之，这些结果表明无监督领域适应方法在皮肤病变分类任务中广泛地实现性提高，不管域转换的性质如何。然而，小型或受折制的dataset会导致结果的一致性受到这些因素的影响。

Robust Zero Level-Set Extraction from Unsigned Distance Fields Based on Double Covering

paper_url: http://arxiv.org/abs/2310.03431
repo_url: https://github.com/jjjkkyz/dcudf
paper_authors: Fei Hou, Xuhui Chen, Wencheng Wang, Hong Qin, Ying He
for: 本研究提出了一种新的方法，叫做DoubleCoverUDF，用于从无符号距离场（UDF）中提取零水平面。
methods: DoubleCoverUDF 方法使用一个学习的 UDF 和用户指定的参数 $r$（一个小正数）作为输入，使用 conventional marching cubes 算法计算一个iso-surface，其中iso-value 为 $r$。
results: 计算得到的iso-surface 是 $r$-偏移体积 $S$ 的边界，其中 $S$ 是一个 orientable manifold，无论 $S$ 的topology如何。然后，算法计算一个覆盖图来投影边界网格onto $S$，保持网格的topology和避免折叠。如果 $S$ 是 orientable manifold 表面，我们的算法将double-layered mesh 分解成一个单层 mesh，否则保持 double-layered mesh 作为输出。我们对 open models 进行了3D 面的重建，并在synthetic models 和benchmark datasets上进行了实验，结果表明我们的方法比现有的 UDF-based 方法更加稳定和生成高质量的 mesh。

Abstract
In this paper, we propose a new method, called DoubleCoverUDF, for extracting the zero level-set from unsigned distance fields (UDFs). DoubleCoverUDF takes a learned UDF and a user-specified parameter $r$ (a small positive real number) as input and extracts an iso-surface with an iso-value $r$ using the conventional marching cubes algorithm. We show that the computed iso-surface is the boundary of the $r$-offset volume of the target zero level-set $S$, which is an orientable manifold, regardless of the topology of $S$. Next, the algorithm computes a covering map to project the boundary mesh onto $S$, preserving the mesh's topology and avoiding folding. If $S$ is an orientable manifold surface, our algorithm separates the double-layered mesh into a single layer using a robust minimum-cut post-processing step. Otherwise, it keeps the double-layered mesh as the output. We validate our algorithm by reconstructing 3D surfaces of open models and demonstrate its efficacy and effectiveness on synthetic models and benchmark datasets. Our experimental results confirm that our method is robust and produces meshes with better quality in terms of both visual evaluation and quantitative measures than existing UDF-based methods. The source code is available at https://github.com/jjjkkyz/DCUDF.

摘要
在本文中，我们提出了一种新的方法，即DoubleCoverUDF，用于从无符号距离场（UDF）中提取零水平面。DoubleCoverUDF接受一个学习过的UDF和用户指定的参数$r$（一个小正数）作为输入，使用传统的推进立方体算法计算一个iso-面，其iso-值为$r$。我们证明计算得到的iso-面是$r$-偏移体积的目标零水平面的边界，这是一个orientable manifold，无论$S$的topology如何。接下来，算法计算一个覆盖函数，将边界网格映射到$S$上，保持网格的topology和避免折叠。如果$S$是orientable manifold表面，我们的算法将double-layered网格分解为单层网格，使用一种robust minimum-cut后处理步骤。否则，它将double-layered网格作为输出。我们验证了我们的算法，通过重建开放模型的3D表面，并在 sintetic模型和标准数据集上进行了实验。我们的实验结果表明，我们的方法可以快速、稳定、高质量地从UDF中提取零水平面，并且在视觉评价和量化度量上比existings UDF-based方法更好。源代码可以在https://github.com/jjjkkyz/DCUDF上下载。

FreeReg: Image-to-Point Cloud Registration Leveraging Pretrained Diffusion Models and Monocular Depth Estimators

paper_url: http://arxiv.org/abs/2310.03420
repo_url: https://github.com/WHU-USI3DV/FreeReg
paper_authors: Haiping Wang, Yuan Liu, Bing Wang, Yujing Sun, Zhen Dong, Wenping Wang, Bisheng Yang
for: 图像和点云之间的匹配问题的基础问题是图像-点云注册。但由于图像和点云的模式差，使得现有的度量学习方法难以学习稳定和特异的跨模态特征。
methods: 我们提议先使用大规模预训练模型将图像和点云的模式统一，然后在同一模式内建立稳定的对应关系。我们发现diffusion特征在深度图生成器中提取的特征在图像和点云之间具有 semantics的一致性，因此可以建立粗略而 Robust的跨模态对应关系。
results: 我们进一步提取了depth图生成器中的geometry特征，并将其与diffusion特征进行匹配。这有效地提高了粗略对应关系的准确性。我们在三个公共的indoor和outdoor标准测试集上进行了广泛的实验，并显示了没有任务特别训练的情况下，直接使用这两种特征可以实现高精度的图像-点云注册。

Abstract
Matching cross-modality features between images and point clouds is a fundamental problem for image-to-point cloud registration. However, due to the modality difference between images and points, it is difficult to learn robust and discriminative cross-modality features by existing metric learning methods for feature matching. Instead of applying metric learning on cross-modality data, we propose to unify the modality between images and point clouds by pretrained large-scale models first, and then establish robust correspondence within the same modality. We show that the intermediate features, called diffusion features, extracted by depth-to-image diffusion models are semantically consistent between images and point clouds, which enables the building of coarse but robust cross-modality correspondences. We further extract geometric features on depth maps produced by the monocular depth estimator. By matching such geometric features, we significantly improve the accuracy of the coarse correspondences produced by diffusion features. Extensive experiments demonstrate that without any task-specific training, direct utilization of both features produces accurate image-to-point cloud registration. On three public indoor and outdoor benchmarks, the proposed method averagely achieves a 20.6 percent improvement in Inlier Ratio, a three-fold higher Inlier Number, and a 48.6 percent improvement in Registration Recall than existing state-of-the-arts.

摘要
基于图像和点云的图像-点云匹配是图像处理领域的基本问题。然而，由于图像和点云的模式差异，使用现有的度量学习方法来学习强健和特异的跨模态特征是困难的。而不是将度量学习应用于跨模态数据上，我们提议先使用大规模预训练模型将图像和点云的模式统一，然后在同一模式内建立强健的对应关系。我们发现Diffusion特征，由深度图像扩散模型提取的中间特征，在图像和点云之间具有相似的含义，这使得可以建立粗略 yet 强健的跨模态对应关系。此外，我们还提取了depth图像上的几何特征。通过匹配这些几何特征，我们可以大幅提高粗略对应关系的准确性。我们的方法不需要任务特有的训练，直接使用这两种特征可以实现高精度的图像-点云匹配。在三个公共的室内和户外标准benchmark上，我们的方法平均提高了20.6%的准确率、三倍的准确数和48.6%的注册回溯率，与现有状态的方法相比。

paper_url: http://arxiv.org/abs/2310.03402
repo_url: None
paper_authors: Zhenyu Bu, Kai-Ni Wang, Fuxing Zhao, Shengxiao Li, Guang-Quan Zhou
for: 提高ultrasound imaging的图像质量，以便进行后续的分类和识别任务。
methods: 使用global和local知识网络，并 integration fine-grained refinement block，以提高图像的细节表示。
results: 在HC18和BUSI两个公共数据集上进行验证，实验结果表明，该模型可以在量化指标和视觉性能上达到竞争力水平。

Abstract
Ultrasound imaging serves as an effective and non-invasive diagnostic tool commonly employed in clinical examinations. However, the presence of speckle noise in ultrasound images invariably degrades image quality, impeding the performance of subsequent tasks, such as segmentation and classification. Existing methods for speckle noise reduction frequently induce excessive image smoothing or fail to preserve detailed information adequately. In this paper, we propose a complementary global and local knowledge network for ultrasound denoising with fine-grained refinement. Initially, the proposed architecture employs the L-CSwinTransformer as encoder to capture global information, incorporating CNN as decoder to fuse local features. We expand the resolution of the feature at different stages to extract more global information compared to the original CSwinTransformer. Subsequently, we integrate Fine-grained Refinement Block (FRB) within the skip-connection stage to further augment features. We validate our model on two public datasets, HC18 and BUSI. Experimental results demonstrate that our model can achieve competitive performance in both quantitative metrics and visual performance. Our code will be available at https://github.com/AAlkaid/USDenoising.

摘要
超声影像成为诊断工具的有效和非侵入性方法，广泛应用于临床检查。然而，超声影像中的斑点噪声常常降低影像质量，影响后续任务，如分割和分类。现有的噪声减少方法 frequently会导致过度的图像平滑或失去细节信息。在这篇论文中，我们提议一种 complementary 全球和本地知识网络 для超声杂噪减少，并在不同阶段扩大特征的分辨率，以获取更多的全球信息。然后，我们在跳过阶段内 интеGRATE Fine-grained Refinement Block (FRB)，以进一步增强特征。我们在 HC18 和 BUSI 两个公共数据集上验证我们的模型，实验结果表明我们的模型可以在量化指标和视觉性能方面达到竞争性表现。我们的代码将在 GitHub 上发布，链接为。

Learning to Simplify Spatial-Temporal Graphs in Gait Analysis

paper_url: http://arxiv.org/abs/2310.03396
repo_url: None
paper_authors: Adrian Cosma, Emilian Radoi
for: 这篇论文的目的是提高走势识别中的解释性和任务特定适应性，以提高走势识别的效率和可靠性。
methods: 这篇论文提出了一种新的方法，即使用两个模型（上游和下游模型）调整每个走势实例的边度矩阵，以删除固定的 graphs。这使得模型可以trainable end-to-end，并且可以根据特定的数据集和任务进行自动调整。
results: 研究人员使用了CASIA-B数据集进行实验，结果显示了our方法可以提高解释性和任务特定适应性，并且与固定 graphs相比，our方法的结果有着不同的解释性。

Abstract
Gait analysis leverages unique walking patterns for person identification and assessment across multiple domains. Among the methods used for gait analysis, skeleton-based approaches have shown promise due to their robust and interpretable features. However, these methods often rely on hand-crafted spatial-temporal graphs that are based on human anatomy disregarding the particularities of the dataset and task. This paper proposes a novel method to simplify the spatial-temporal graph representation for gait-based gender estimation, improving interpretability without losing performance. Our approach employs two models, an upstream and a downstream model, that can adjust the adjacency matrix for each walking instance, thereby removing the fixed nature of the graph. By employing the Straight-Through Gumbel-Softmax trick, our model is trainable end-to-end. We demonstrate the effectiveness of our approach on the CASIA-B dataset for gait-based gender estimation. The resulting graphs are interpretable and differ qualitatively from fixed graphs used in existing models. Our research contributes to enhancing the explainability and task-specific adaptability of gait recognition, promoting more efficient and reliable gait-based biometrics.

摘要
《走姿分析利用唯一的步态特征进行人体身份识别和评估，在多个领域中得到广泛应用。 Among the methods used for gait analysis, 骨架基 Approaches have shown promise due to their robust and interpretable features. However, these methods often rely on hand-crafted spatial-temporal graphs that are based on human anatomy, disregarding the particularities of the dataset and task. This paper proposes a novel method to simplify the spatial-temporal graph representation for gait-based gender estimation, improving interpretability without losing performance. Our approach employs two models, an upstream and a downstream model, that can adjust the adjacency matrix for each walking instance, thereby removing the fixed nature of the graph. By employing the Straight-Through Gumbel-Softmax trick, our model is trainable end-to-end. We demonstrate the effectiveness of our approach on the CASIA-B dataset for gait-based gender estimation. The resulting graphs are interpretable and differ qualitatively from fixed graphs used in existing models. Our research contributes to enhancing the explainability and task-specific adaptability of gait recognition, promoting more efficient and reliable gait-based biometrics.》Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

OpenPatch: a 3D patchwork for Out-Of-Distribution detection

paper_url: http://arxiv.org/abs/2310.03388
repo_url: None
paper_authors: Paolo Rabino, Antonio Alliegro, Francesco Cappio Borlino, Tatiana Tommasi
for: 本研究旨在准备深度学习模型在实际世界中进行部署，以处理不可预期的情况。在某些应用中，新型的出现会带来重要的威胁，因此需要有效地探测它们。这种技能应该可以在需要时使用，而不需要任何额外的计算训练努力。
methods: 本研究使用了OpenPatch方法，基于大量预训练模型，简单地从其中提取了一组patch表示，用于描述每个已知类型。对于新样本，我们可以通过评估该样本是否可以主要通过已知类型的patch组成来获得新型性分数。
results: 本研究在实际点云样本上进行了semantic novelty检测任务，并在全known样本和几个known样本情况下进行了广泛的实验评估。结果表明，OpenPatch在不同的预训练目标和网络背bone下都表现出了强大的稳定性，并且可以在不需要重新训练的情况下应用于各种实际任务。

Abstract
Moving deep learning models from the laboratory setting to the open world entails preparing them to handle unforeseen conditions. In several applications the occurrence of novel classes during deployment poses a significant threat, thus it is crucial to effectively detect them. Ideally, this skill should be used when needed without requiring any further computational training effort at every new task. Out-of-distribution detection has attracted significant attention in the last years, however the majority of the studies deal with 2D images ignoring the inherent 3D nature of the real-world and often confusing between domain and semantic novelty. In this work, we focus on the latter, considering the objects geometric structure captured by 3D point clouds regardless of the specific domain. We advance the field by introducing OpenPatch that builds on a large pre-trained model and simply extracts from its intermediate features a set of patch representations that describe each known class. For any new sample, we obtain a novelty score by evaluating whether it can be recomposed mainly by patches of a single known class or rather via the contribution of multiple classes. We present an extensive experimental evaluation of our approach for the task of semantic novelty detection on real-world point cloud samples when the reference known data are synthetic. We demonstrate that OpenPatch excels in both the full and few-shot known sample scenarios, showcasing its robustness across varying pre-training objectives and network backbones. The inherent training-free nature of our method allows for its immediate application to a wide array of real-world tasks, offering a compelling advantage over approaches that need expensive retraining efforts.

摘要
将深度学习模型从室内设置到开放世界需要准备其处理不可预测的条件。在多个应用程序中，新类的出现会带来重大的威胁，因此可以快速、无需进一步的计算训练努力，检测这些新类。我们在这里关注对象的三维结构，不管具体的领域。我们提出了开放patch（OpenPatch），基于大量预训练模型，简单地从其中提取一些patch表示，用于描述每个已知类。对于任何新样本，我们可以计算一个新类准确性分数，根据是否可以通过已知类中的patch组成。我们对实际世界点云样本进行了广泛的实验评估，证明了我们的方法在全示例和几示例已知样本enario中均表现出色，并且在不同的预训练目标和网络背景下保持了稳定性。由于我们的方法不需要重新训练，可以立即应用于各种实际世界任务，提供了优势。

ACT-Net: Anchor-context Action Detection in Surgery Videos

paper_url: http://arxiv.org/abs/2310.03377
repo_url: None
paper_authors: Luoying Hao, Yan Hu, Wenjun Lin, Qun Wang, Heng Li, Huazhu Fu, Jinming Duan, Jiang Liu
for: 这篇论文的目的是精确地检测运行整个手术过程中的细部动作，以提高Context-aware决策支持系统的精度。
methods: 这篇论文提出了一个名为ACTNet的检测网络，包括一个 anchor-context检测（ACD）模组和一个分类传播激活（CCD）模组，以回答以下问题：1）动作发生的地方是哪里？2）动作是什么样的？3）预测结果的可信度是什么样的？特别是，ACD模组在运行影片中找到和点选出执行动作的区域，并且根据这些区域的对话来预测动作的位置和分布。CCD模组则使用一个减震传播激活模型，以确定动作的预测结果。
results: 这篇论文的结果显示，这个方法可以实现高精度的动作检测，并且可以提供模型的可信度。与基eline相比，这个方法的MAP值提高了4.0%。

Abstract
Recognition and localization of surgical detailed actions is an essential component of developing a context-aware decision support system. However, most existing detection algorithms fail to provide high-accuracy action classes even having their locations, as they do not consider the surgery procedure's regularity in the whole video. This limitation hinders their application. Moreover, implementing the predictions in clinical applications seriously needs to convey model confidence to earn entrustment, which is unexplored in surgical action prediction. In this paper, to accurately detect fine-grained actions that happen at every moment, we propose an anchor-context action detection network (ACTNet), including an anchor-context detection (ACD) module and a class conditional diffusion (CCD) module, to answer the following questions: 1) where the actions happen; 2) what actions are; 3) how confidence predictions are. Specifically, the proposed ACD module spatially and temporally highlights the regions interacting with the extracted anchor in surgery video, which outputs action location and its class distribution based on anchor-context interactions. Considering the full distribution of action classes in videos, the CCD module adopts a denoising diffusion-based generative model conditioned on our ACD estimator to further reconstruct accurately the action predictions. Moreover, we utilize the stochastic nature of the diffusion model outputs to access model confidence for each prediction. Our method reports the state-of-the-art performance, with improvements of 4.0% mAP against baseline on the surgical video dataset.

摘要
Recognition and localization of surgical detailed actions is an essential component of developing a context-aware decision support system. However, most existing detection algorithms fail to provide high-accuracy action classes even having their locations, as they do not consider the surgery procedure's regularity in the whole video. This limitation hinders their application. Moreover, implementing the predictions in clinical applications seriously needs to convey model confidence to earn entrustment, which is unexplored in surgical action prediction. In this paper, to accurately detect fine-grained actions that happen at every moment, we propose an anchor-context action detection network (ACTNet), including an anchor-context detection (ACD) module and a class conditional diffusion (CCD) module, to answer the following questions: 1) where the actions happen; 2) what actions are; 3) how confidence predictions are. Specifically, the proposed ACD module spatially and temporally highlights the regions interacting with the extracted anchor in surgery video, which outputs action location and its class distribution based on anchor-context interactions. Considering the full distribution of action classes in videos, the CCD module adopts a denoising diffusion-based generative model conditioned on our ACD estimator to further reconstruct accurately the action predictions. Moreover, we utilize the stochastic nature of the diffusion model outputs to access model confidence for each prediction. Our method reports the state-of-the-art performance, with improvements of 4.0% mAP against baseline on the surgical video dataset.Here's a word-for-word translation of the text into Simplified Chinese:Recognition和localization of surgical detailed actions是developing context-aware decision support system的关键组成部分。然而，大多数现有的探测算法无法提供高精度的动作类别，即使有其位置信息，因为它们不考虑手术程序整体视频中的规则性。这种限制约束了其应用。此外，在临床应用中，实现预测结果的应用需要传递模型confidence来赢得信任，这在手术动作预测中未经explored。在这篇论文中，我们提议一种名为 anchor-context action detection network (ACTNet)，包括一个 anchor-context detection (ACD)模块和一个类别 conditioned diffusion (CCD)模块，以回答以下问题：1) where the actions happen; 2) what actions are; 3) how confidence predictions are。具体来说，我们的 ACD模块在手术视频中将抽取的 anchor 与其相互作用的区域进行空间和时间高亮显示，输出动作的位置和类别分布。考虑整个视频中动作类别的全部分布，我们的 CCD模块采用一种denoising diffusion-based generative model，根据我们的 ACD 估计器来进一步重建准确的动作预测。此外，我们利用 diffusion model 输出的随机性来访问模型信任度。我们的方法在手术视频数据集上报告了状态艺术性的表现，与基准相比提高了4.0% mAP。

Point-Based Radiance Fields for Controllable Human Motion Synthesis

paper_url: http://arxiv.org/abs/2310.03375
repo_url: https://github.com/dehezhang2/point_based_nerf_editing
paper_authors: Haitao Yu, Deheng Zhang, Peiyuan Xie, Tianyi Zhang
for: 本 paper 提出了一种新的可控人体动作合成方法，用于细粒度变形。以前的编辑神经透镜场方法可以生成出吸引人的结果，但它们几乎无法实现复杂的3D人体编辑，如前向骨骼动作。我们的方法利用了明确的点云来训练静态3D场景，并通过编码点云转移来应用变形。
methods: 我们的方法使用了静态点云来训练静态3D场景，并通过编码点云转移来应用变形。我们还使用了SVD来估计本地旋转，并通过插值来将每个点的旋转转化到查询视图方向上。
results: 我们的方法可以对细粒度变形进行高效的控制，并且可以泛化到其他3D角色 besides humans。我们的实验结果表明，我们的方法可以与当前状态静态点云场景的最佳实现竞争。

Abstract
This paper proposes a novel controllable human motion synthesis method for fine-level deformation based on static point-based radiance fields. Although previous editable neural radiance field methods can generate impressive results on novel-view synthesis and allow naive deformation, few algorithms can achieve complex 3D human editing such as forward kinematics. Our method exploits the explicit point cloud to train the static 3D scene and apply the deformation by encoding the point cloud translation using a deformation MLP. To make sure the rendering result is consistent with the canonical space training, we estimate the local rotation using SVD and interpolate the per-point rotation to the query view direction of the pre-trained radiance field. Extensive experiments show that our approach can significantly outperform the state-of-the-art on fine-level complex deformation which can be generalized to other 3D characters besides humans.

摘要
这篇论文提出了一种新的可控人体动作合成方法，基于静止点云基于辐射场。 although previous neural radiance field方法可以生成印象深刻的结果，但它们很难实现复杂的3D人体编辑，如前向运动。 our method使用点云进行Explicit 3D场景训练，并通过编码点云平移来应用塑形。 to ensure the rendering result is consistent with the canonical space training, we estimate the local rotation using SVD and interpolate the per-point rotation to the query view direction of the pre-trained radiance field. extensive experiments show that our approach can significantly outperform the state-of-the-art on fine-level complex deformation, which can be generalized to other 3D characters besides humans.

Realistic Speech-to-Face Generation with Speech-Conditioned Latent Diffusion Model with Face Prior

paper_url: http://arxiv.org/abs/2310.03363
repo_url: None
paper_authors: Jinting Wang, Li Liu, Jun Wang, Hei Victor Cheng
for: 这个论文的目的是提出一种新的语音到脸图生成框架，以解决现有的语音到脸图生成方法中存在的不稳定性和无法生成真实的脸图问题。
methods: 该框架基于一种新型的噪声扩散模型（SCLDM），并采用对比预训练来保持语音和脸图之间的共同特征信息。此外，我们还提出了一种新的增强方法，通过将统计面积至 estadístico incorporated into the diffusion process to eliminate the shared component across the faces and enhance the subtle variations captured by the speech condition.
results: 我们的方法可以生成更加真实的脸图，同时保持说话人的身份特征。经验表明，我们的方法在AVSpeech和Voxceleb两个 dataset上具有显著的改进，特别是在cosine distance metric上的提高。例如，在AVSpeech dataset上，我们的方法提高了32.17和32.72的cosine distance metric，对比之前的最佳方法，提高了23.53%和25.37%。

Abstract
Speech-to-face generation is an intriguing area of research that focuses on generating realistic facial images based on a speaker's audio speech. However, state-of-the-art methods employing GAN-based architectures lack stability and cannot generate realistic face images. To fill this gap, we propose a novel speech-to-face generation framework, which leverages a Speech-Conditioned Latent Diffusion Model, called SCLDM. To the best of our knowledge, this is the first work to harness the exceptional modeling capabilities of diffusion models for speech-to-face generation. Preserving the shared identity information between speech and face is crucial in generating realistic results. Therefore, we employ contrastive pre-training for both the speech encoder and the face encoder. This pre-training strategy facilitates effective alignment between the attributes of speech, such as age and gender, and the corresponding facial characteristics in the face images. Furthermore, we tackle the challenge posed by excessive diversity in the synthesis process caused by the diffusion model. To overcome this challenge, we introduce the concept of residuals by integrating a statistical face prior to the diffusion process. This addition helps to eliminate the shared component across the faces and enhances the subtle variations captured by the speech condition. Extensive quantitative, qualitative, and user study experiments demonstrate that our method can produce more realistic face images while preserving the identity of the speaker better than state-of-the-art methods. Highlighting the notable enhancements, our method demonstrates significant gains in all metrics on the AVSpeech dataset and Voxceleb dataset, particularly noteworthy are the improvements of 32.17 and 32.72 on the cosine distance metric for the two datasets, respectively.

摘要
《speech-to-face》是一个吸引人的研究领域，它旨在基于说话人的音频speech生成真实的脸部图像。然而，当前的方法使用GAN结构，缺乏稳定性，无法生成真实的脸部图像。为了填补这一漏洞，我们提出了一种新的speech-to-face生成框架，它利用了一种叫做Speech-Conditioned Latent Diffusion Model（SCLDM）。据我们所知，这是首次利用扩散模型来进行speech-to-face生成。在生成真实结果的同时，保持说话人的身份信息与脸部图像之间的共同性是关键。因此，我们使用了对比预训练，使得说话人的年龄和性别特征与对应的脸部特征进行有效的对应。此外，我们解决了由扩散模型引起的生成过程中的过度多样性挑战。我们在扩散过程中添加了一个统计学面壳，以消除共同的部分在脸部图像中，并使得扩散过程中的微妙变化更加明显。经验证明，我们的方法可以生成更真实的脸部图像，同时保持说话人的身份。特别是，在AVSpeech和Voxceleb两个 dataset上，我们的方法表现出了显著的提升，cosine distance指标上的提升分别为32.17和32.72。

CSI: Enhancing the Robustness of 3D Point Cloud Recognition against Corruption

paper_url: http://arxiv.org/abs/2310.03360
repo_url: https://github.com/masterwu2115/csi
paper_authors: Zhuoyuan Wu, Jiachen Sun, Chaowei Xiao
for: 提高点云识别 robustness 对于实际世界中的数据损害
methods: 利用点云数据的自然集 свой性，提出一种新的极限子集标识（CSI）方法，包括具有强度感知抽样（DAS）和自身熵减少（SEM）两部分
results: 与比较方法相比，CSI方法在ModelNet40-C和PointCloud-C上实现了18.4%和16.3%的错误率，比前者提高5.2%和4.2%，代表了 Notable improvement in point cloud recognition robustness against data corruption.

Abstract
Despite recent advancements in deep neural networks for point cloud recognition, real-world safety-critical applications present challenges due to unavoidable data corruption. Current models often fall short in generalizing to unforeseen distribution shifts. In this study, we harness the inherent set property of point cloud data to introduce a novel critical subset identification (CSI) method, aiming to bolster recognition robustness in the face of data corruption. Our CSI framework integrates two pivotal components: density-aware sampling (DAS) and self-entropy minimization (SEM), which cater to static and dynamic CSI, respectively. DAS ensures efficient robust anchor point sampling by factoring in local density, while SEM is employed during training to accentuate the most salient point-to-point attention. Evaluations reveal that our CSI approach yields error rates of 18.4\% and 16.3\% on ModelNet40-C and PointCloud-C, respectively, marking a notable improvement over state-of-the-art methods by margins of 5.2\% and 4.2\% on the respective benchmarks. Code is available at \href{https://github.com/masterwu2115/CSI/tree/main}{https://github.com/masterwu2115/CSI/tree/main}

摘要
尽管最近的深度神经网络在点云识别方面做出了重要进步，但在实际应用中仍然面临数据损害的挑战。现有模型经常无法适应不可预测的分布转移。本研究利用点云数据的自然集属性，提出一种新的极值子集标识（CSI）方法，以增强识别的可靠性。我们的CSI框架包括两个重要组成部分：具有地方权重的采样（DAS）和自 entropy 最小化（SEM），它们分别适应静态和动态CSI。DAS 确保高效地采样稳定的均勋点，而 SEM 在训练中被使用，以强调最重要的点对点关注。我们的CSI方法在 ModelNet40-C 和 PointCloud-C 上获得了18.4%和16.3%的错误率，与当前最佳方法相比，占了5.2%和4.2%的较大优势。代码可以在 \href{https://github.com/masterwu2115/CSI/tree/main}{https://github.com/masterwu2115/CSI/tree/main} 上获取。

Combining Datasets with Different Label Sets for Improved Nucleus Segmentation and Classification

paper_url: http://arxiv.org/abs/2310.03346
repo_url: None
paper_authors: Amruta Parulekar, Utkarsh Kanwat, Ravi Kant Gupta, Medha Chippa, Thomas Jacob, Tripti Bameta, Swapnil Rane, Amit Sethi
for: automatic cell counting and morphometric assessments in histopathology images
methods: deep neural networks (DNNs) with a coarse-to-fine class hierarchy
results: improved segmentation and classification metrics on test splits, as well as generalization to previously unseen datasets

Abstract
Segmentation and classification of cell nuclei in histopathology images using deep neural networks (DNNs) can save pathologists' time for diagnosing various diseases, including cancers, by automating cell counting and morphometric assessments. It is now well-known that the accuracy of DNNs increases with the sizes of annotated datasets available for training. Although multiple datasets of histopathology images with nuclear annotations and class labels have been made publicly available, the set of class labels differ across these datasets. We propose a method to train DNNs for instance segmentation and classification on multiple datasets where the set of classes across the datasets are related but not the same. Specifically, our method is designed to utilize a coarse-to-fine class hierarchy, where the set of classes labeled and annotated in a dataset can be at any level of the hierarchy, as long as the classes are mutually exclusive. Within a dataset, the set of classes need not even be at the same level of the class hierarchy tree. Our results demonstrate that segmentation and classification metrics for the class set used by the test split of a dataset can improve by pre-training on another dataset that may even have a different set of classes due to the expansion of the training set enabled by our method. Furthermore, generalization to previously unseen datasets also improves by combining multiple other datasets with different sets of classes for training. The improvement is both qualitative and quantitative. The proposed method can be adapted for various loss functions, DNN architectures, and application domains.

摘要
深度神经网络（DNNs）可以自动完成 Histopathology 图像中细胞核心的分割和分类，从而为诊断多种疾病，包括癌症，节省病理医生的时间。现在已经证明，DNNs 的准确性与用于训练的数据集的大小成正相关。虽然多个历史病理图像数据集已经公开发布，但这些数据集中的类别标签不同。我们提出了一种方法，可以在不同数据集上训练 DNNs 进行实例分割和分类，其中数据集中的类别标签可以是归一化的树结构中的任何一级，只要这些类别是互斥的。在一个数据集中，类别标签不必是同一级别的树结构中的。我们的结果表明，在使用另一个数据集进行预训练后，对测试分割的分割和分类指标可以得到改进，并且将多个不同数据集合并训练可以提高总体化和推广性。这种方法可以适应不同的损失函数、DNN 架构和应用领域。

Denoising Diffusion Step-aware Models

paper_url: http://arxiv.org/abs/2310.03337
repo_url: None
paper_authors: Shuai Yang, Yukang Chen, Luozhou Wang, Shu Liu, Yingcong Chen
for: 提高 diffusion model 的计算效率，使其适用于更广泛的数据生成任务。
methods: 使用 spectrum of neural networks ，其中每个网络的大小根据每个生成步骤的重要性进行调整，通过遗传搜索确定。
results: 对 CIFAR-10、CelebA-HQ、LSUN-bedroom、AFHQ 和 ImageNet 等数据集进行了实验，显示 DDSM 可以提高计算效率，对应的计算时间减少了 49%、61%、59%、71% 和 76%，而不会 sacrificing 生成质量。

Abstract
Denoising Diffusion Probabilistic Models (DDPMs) have garnered popularity for data generation across various domains. However, a significant bottleneck is the necessity for whole-network computation during every step of the generative process, leading to high computational overheads. This paper presents a novel framework, Denoising Diffusion Step-aware Models (DDSM), to address this challenge. Unlike conventional approaches, DDSM employs a spectrum of neural networks whose sizes are adapted according to the importance of each generative step, as determined through evolutionary search. This step-wise network variation effectively circumvents redundant computational efforts, particularly in less critical steps, thereby enhancing the efficiency of the diffusion model. Furthermore, the step-aware design can be seamlessly integrated with other efficiency-geared diffusion models such as DDIMs and latent diffusion, thus broadening the scope of computational savings. Empirical evaluations demonstrate that DDSM achieves computational savings of 49% for CIFAR-10, 61% for CelebA-HQ, 59% for LSUN-bedroom, 71% for AFHQ, and 76% for ImageNet, all without compromising the generation quality. Our code and models will be publicly available.

摘要
Diffusion Probabilistic Models (DDPMs) 有很多应用领域的数据生成，但是存在一个主要的瓶颈是每次生成过程中整个网络的计算 overhead，这导致了高效性的问题。这篇文章提出了一种新的框架，即Denosing Diffusion Step-aware Models (DDSM)，以解决这个挑战。与传统方法不同，DDSM 使用了一个适应性的spectrum of neural networks，其中每个网络的大小根据生成过程中每个步骤的重要性来确定，通过进化搜索来确定。这种步骤 wise network variation 可以减少 redundant computational efforts，特别是在 less critical steps，从而提高 diffusion model 的效率。此外，步骤 aware 的设计可以与其他高效 diffusion model such as DDIMs 和 latent diffusion 集成，从而拓宽了计算省力的范围。我们的实验证明，DDSM 可以在 CIFAR-10 上实现49%的计算减少，在 CelebA-HQ 上实现61%的计算减少，在 LSUN-bedroom 上实现59%的计算减少，在 AFHQ 上实现71%的计算减少，并在 ImageNet 上实现76%的计算减少，无需牺牲生成质量。我们的代码和模型将公开发布。

Continual Test-time Domain Adaptation via Dynamic Sample Selection

paper_url: http://arxiv.org/abs/2310.03335
repo_url: None
paper_authors: Yanshuo Wang, Jie Hong, Ali Cheraghian, Shafin Rahman, David Ahmedt-Aristizabal, Lars Petersson, Mehrtash Harandi
for: 这篇论文的目的是提出一种 continual test-time domain adaptation (CTDA) 方法，以逐步适应一串目标领域 без 访问原始数据。
methods: 本文提出了一种 Dynamic Sample Selection (DSS) 方法，包括动态阈值、正面学习和负面学习三个过程。传统上，模型从未知环境数据中学习，并将所有样本的 Pseudo-label 作为更新模型参数的来源。但是，这些 Pseudo-label 可能会受到杂音的影响，因此所有样本不是Equally trustworthy。因此，我们首先设计了一个动态阈值模组，选择可疑的低质量样本。选择了低质量样本的 samples 更可能是错误预测的。因此，我们将 JOINT 正面和负面学习应用到高质量和低质量样本上，以减少使用错误信息的风险。
results: 我们进行了广泛的实验，证明我们的提出的方法在图像领域中实现了 CTDA 的最佳效果，超越了当前的州际结果。此外，我们的方法还被评估在 3D 点云领域中，展示了它的多元性和应用的可能性。

Abstract
The objective of Continual Test-time Domain Adaptation (CTDA) is to gradually adapt a pre-trained model to a sequence of target domains without accessing the source data. This paper proposes a Dynamic Sample Selection (DSS) method for CTDA. DSS consists of dynamic thresholding, positive learning, and negative learning processes. Traditionally, models learn from unlabeled unknown environment data and equally rely on all samples' pseudo-labels to update their parameters through self-training. However, noisy predictions exist in these pseudo-labels, so all samples are not equally trustworthy. Therefore, in our method, a dynamic thresholding module is first designed to select suspected low-quality from high-quality samples. The selected low-quality samples are more likely to be wrongly predicted. Therefore, we apply joint positive and negative learning on both high- and low-quality samples to reduce the risk of using wrong information. We conduct extensive experiments that demonstrate the effectiveness of our proposed method for CTDA in the image domain, outperforming the state-of-the-art results. Furthermore, our approach is also evaluated in the 3D point cloud domain, showcasing its versatility and potential for broader applicability.

摘要
CTDA 的目标是慢慢地适应一系列目标领域的模型，无需访问源数据。这篇论文提出了动态样本选择（DSS）方法。DSS包括动态阈值调整、正例学习和负例学习过程。传统上，模型从未知环境数据中学习，并且均依赖所有样本的假标签来更新参数通过自我训练。然而，数据中的预测结果存在噪音，因此所有样本都不是 equally trustworthy。因此，我们首先设计了动态阈值模块，选择可疑的低质量样本。选择的低质量样本更有可能错误预测。因此，我们应用了联合正例和负例学习在高质量和低质量样本上进行减风险。我们进行了广泛的实验，证明我们提出的方法在图像频谱中实现了 CTDA，并超越了现状势力的结果。此外，我们的方法还在3D点云频谱中进行了评估，展示了其 universality 和普遍性。

paper_url: http://arxiv.org/abs/2310.03333
repo_url: None
paper_authors: Jia Syuen Lim, Ziwei Wang, Jiajun Liu, Abdelwahed Khamis, Reza Arablouei, Robert Barlow, Ryan McAllister
for: 实现在多元领域的监管遵循性实现高品质保证和可追溯性。
methods: 使用实时多感器探测系统，包括3D时间探测和RGB摄像头，联合无监督学习技术在边缘AI设备上。
results: 提高记录保持效率，减少手动干预，并在 agrifood 设施中认真验证刀具清洁效果，掌握 occlusion 和低光照等问题。

Abstract
Regulatory compliance auditing across diverse industrial domains requires heightened quality assurance and traceability. Present manual and intermittent approaches to such auditing yield significant challenges, potentially leading to oversights in the monitoring process. To address these issues, we introduce a real-time, multi-modal sensing system employing 3D time-of-flight and RGB cameras, coupled with unsupervised learning techniques on edge AI devices. This enables continuous object tracking thereby enhancing efficiency in record-keeping and minimizing manual interventions. While we validate the system in a knife sanitization context within agrifood facilities, emphasizing its prowess against occlusion and low-light issues with RGB cameras, its potential spans various industrial monitoring settings.

摘要
政策遵循审核 Across 多个工业领域需要提高质量控制和可追溯性。现有的手动和间歇性方法在审核过程中存在重要的挑战，可能导致监测过程中的漏洞。为解决这些问题，我们介绍一种实时、多模式感知系统，使用3D时间旋转和RGB相机，并与边缘AI设备结合不监督学习技术。这使得对象的连续跟踪而提高记录保存的效率，并减少手动干预。我们在农业食品设施中 validate 这种系统，强调其在遮盖和低光照问题下的强健性，但其潜在适用范围包括多个工业监测场景。

Investigating the Limitation of CLIP Models: The Worst-Performing Categories

paper_url: http://arxiv.org/abs/2310.03324
repo_url: None
paper_authors: Jie-Jing Shao, Jiang-Xin Shi, Xiao-Wen Yang, Lan-Zhe Guo, Yu-Feng Li
for: 提高 CLIP 模型在特定类别下的表现，尤其是在风险敏感应用中，其中一些类别具有重要性。
methods: 研究 CLIP 模型两Modalities 的吻合，并提出了类别匹配margin（\cmm）来衡量推理冲击。
results: 通过查询大型自然语言模型和建立权重和 ensemble，提高了 ImageNet 上最差10类的准确率，从0%提高到5.2%，无需手动工程提示、劳辑优化或访问标注验证数据。

Abstract
Contrastive Language-Image Pre-training (CLIP) provides a foundation model by integrating natural language into visual concepts, enabling zero-shot recognition on downstream tasks. It is usually expected that satisfactory overall accuracy can be achieved across numerous domains through well-designed textual prompts. However, we found that their performance in the worst categories is significantly inferior to the overall performance. For example, on ImageNet, there are a total of 10 categories with class-wise accuracy as low as 0\%, even though the overall performance has achieved 64.1\%. This phenomenon reveals the potential risks associated with using CLIP models, particularly in risk-sensitive applications where specific categories hold significant importance. To address this issue, we investigate the alignment between the two modalities in the CLIP model and propose the Class-wise Matching Margin (\cmm) to measure the inference confusion. \cmm\ can effectively identify the worst-performing categories and estimate the potential performance of the candidate prompts. We further query large language models to enrich descriptions of worst-performing categories and build a weighted ensemble to highlight the efficient prompts. Experimental results clearly verify the effectiveness of our proposal, where the accuracy on the worst-10 categories on ImageNet is boosted to 5.2\%, without manual prompt engineering, laborious optimization, or access to labeled validation data.

摘要
CLIP（对比语言图像预训练）提供了一个基本模型，将自然语言和视觉概念集成起来，以实现零shot认知任务。通常认为，通过Well-designed文本提示，可以在多个领域达到可接受的总体精度。然而，我们发现CLIP模型在最差类别表现不佳，比总体表现低至0%。例如，在ImageNet中有10个类别，其中每个类别的精度只有0%。这种现象表明CLIP模型在风险敏感应用中可能存在风险，特别是在特定类别具有重要性时。为解决这个问题，我们调查CLIP模型两个模式之间的对应关系，并提出了类别匹配margin（CMM）来度量推理冲击。CMM可以准确地确定最差表现的类别，并估算候选提示的可能性。我们进一步咨询大型自然语言模型，以描述最差表现的类别，并建立了权重 ensemble，以强调高效的提示。实验结果表明，我们的提议有效，ImageNet最差10类精度从0%提高到5.2%，无需手动提取工程、繁琐优化或访问标注验证数据。

Functional data learning using convolutional neural networks

paper_url: http://arxiv.org/abs/2310.03773
repo_url: https://github.com/jesusgl86/fdap01
paper_authors: Jose Galarza, Tamer Oraby
for: 这个论文目的是使用卷积神经网络（CNN）来解决功能数据中的回归和分类问题，特别是面临噪音和非噪音功能数据的情况。
methods: 这个方法是将功能数据转换为28x28图像，并使用特定的卷积神经网络来进行所有的回归运算和函数形式分类。
results: 这个方法可以实现高精度的回归和分类，并且可以应对噪音和非噪音功能数据。实验结果显示，这个方法可以成功地估计楕円增长和均值、幅度和峰值的大小，以及楕円函数的傅立叶 exponent和束缚问题。此外，这个方法还可以用于检测疾病传播率、药物溶解 profiling、和检测公元病例。

Abstract
In this paper, we show how convolutional neural networks (CNN) can be used in regression and classification learning problems of noisy and non-noisy functional data. The main idea is to transform the functional data into a 28 by 28 image. We use a specific but typical architecture of a convolutional neural network to perform all the regression exercises of parameter estimation and functional form classification. First, we use some functional case studies of functional data with and without random noise to showcase the strength of the new method. In particular, we use it to estimate exponential growth and decay rates, the bandwidths of sine and cosine functions, and the magnitudes and widths of curve peaks. We also use it to classify the monotonicity and curvatures of functional data, algebraic versus exponential growth, and the number of peaks of functional data. Second, we apply the same convolutional neural networks to Lyapunov exponent estimation in noisy and non-noisy chaotic data, in estimating rates of disease transmission from epidemic curves, and in detecting the similarity of drug dissolution profiles. Finally, we apply the method to real-life data to detect Parkinson's disease patients in a classification problem. The method, although simple, shows high accuracy and is promising for future use in engineering and medical applications.

摘要
在这篇论文中，我们展示了如何使用卷积神经网络（CNN）在有噪和无噪函数数据上进行回归和分类学习问题。主要思想是将函数数据转换成28x28图像。我们使用了一种特定 yet typical的卷积神经网络架构来实现所有的参数估计和函数形态分类问题。首先，我们使用了一些函数案例研究，包括带有噪声和无噪声的函数数据，以示新方法的强大性。我们使用它来估计指数增长和减速率、振荡函数的宽度和峰值强度、函数数据的 monotonicity 和曲线性、函数数据的 algebraic 增长和指数增长、函数数据的峰值数量等。其次，我们对噪声和无噪声杂化数据中的 Lyapunov 指数进行估计，从 epidemic 曲线中估计疾病传播率，并在药物溶解曲线上检测同义性。最后，我们应用这种方法到实际数据，以进行 Parkinson 病患分类问题。这种简单的方法具有高准确率，并在工程和医学应用中表示了承诺。

Can pre-trained models assist in dataset distillation?

paper_url: http://arxiv.org/abs/2310.03295
repo_url: https://github.com/yaolu-zjut/ddinterpreter
paper_authors: Yao Lu, Xuguang Chen, Yuchen Zhang, Jianyang Gu, Tianle Zhang, Yifan Zhang, Xiaoniu Yang, Qi Xuan, Kai Wang, Yang You
for: 本研究旨在探讨Pre-trained Models（PTMs）是否能够有效地传递知识到合成 dataset，以便DD可以准确地进行。
methods: 我们通过进行先验实验，并系统地研究不同的PTMs选项，包括初始化参数、模型架构、训练轮数和领域知识，以探讨PTMs对DD的贡献。
results: 我们发现：1）提高模型多样性可以提高合成dataset的性能; 2）不但优质模型可以帮助DD，而且在某些情况下，它们可以超越训练得非常好的模型; 3）领域特定的PTMs不是必需的，但是适应领域的PTMs可以提高DD的性能。通过选择最佳选项，我们可以大幅提高cross-architecture泛化性能。

Abstract
Dataset Distillation (DD) is a prominent technique that encapsulates knowledge from a large-scale original dataset into a small synthetic dataset for efficient training. Meanwhile, Pre-trained Models (PTMs) function as knowledge repositories, containing extensive information from the original dataset. This naturally raises a question: Can PTMs effectively transfer knowledge to synthetic datasets, guiding DD accurately? To this end, we conduct preliminary experiments, confirming the contribution of PTMs to DD. Afterwards, we systematically study different options in PTMs, including initialization parameters, model architecture, training epoch and domain knowledge, revealing that: 1) Increasing model diversity enhances the performance of synthetic datasets; 2) Sub-optimal models can also assist in DD and outperform well-trained ones in certain cases; 3) Domain-specific PTMs are not mandatory for DD, but a reasonable domain match is crucial. Finally, by selecting optimal options, we significantly improve the cross-architecture generalization over baseline DD methods. We hope our work will facilitate researchers to develop better DD techniques. Our code is available at https://github.com/yaolu-zjut/DDInterpreter.

摘要
dataset 简化 (DD) 是一种广泛应用的技术，它将原始数据集中的知识封装到一个小型的 sintetic 数据集上，以便高效地训练。同时，先修学模型 (PTM) 作为知识库，含有原始数据集中的广泛信息。这 naturally 引起了一个问题：PTM 是否可以正确地传递知识到 sintetic 数据集，以便 DD 准确地进行？为此，我们进行了初步的实验，并证明了 PTM 对 DD 的贡献。后续，我们系统地研究了不同的 PTM 选项，包括初始化参数、模型架构、训练轮数和领域知识，发现：1）提高模型多样性可以提高 sintetic 数据集的性能；2）不佳的模型也可以帮助 DD 进行，在某些情况下超越了训练得非常好的模型；3）领域特定的 PTM 并不是 DD 的必要条件，但是领域匹配是非常重要。最后，通过选择优化的选项，我们可以显著提高基eline DD 方法的跨建制泛化性能。我们希望我们的工作能够促进研究人员开发更好的 DD 技术。我们的代码可以在上找到。

SimVLG: Simple and Efficient Pretraining of Visual Language Generative Models

paper_url: http://arxiv.org/abs/2310.03291
repo_url: None
paper_authors: Yiren Jian, Tingkai Liu, Yunzhe Tao, Soroush Vosoughi, Hongxia Yang
for: 这篇论文目的是提出一种高效的视觉语言生成模型预训练方法，使用冻结的大型自然语言模型（LLM）。
methods: 该方法使用一个单阶段、单loss的框架，通过在训练过程中逐渐合并相似的视觉token来压缩视觉信息，保留 semantic content的 ricness，以实现快速的训练速度。
results: 对于视觉语言模型的训练，该方法可以提高训练速度 $\times 5$ 而无需减少性能，并且可以使用只有 $1/10$ 的数据 achieve 相当的性能。此外，该方法还可以将图像语言模型扩展到视频语言生成任务，通过一种新的软注意力 temporal token合并模块。

Abstract
In this paper, we propose ``SimVLG'', a streamlined framework for the pre-training of computationally intensive vision-language generative models, leveraging frozen pre-trained large language models (LLMs). The prevailing paradigm in vision-language pre-training (VLP) typically involves a two-stage optimization process: an initial resource-intensive phase dedicated to general-purpose vision-language representation learning, aimed at extracting and consolidating pertinent visual features, followed by a subsequent phase focusing on end-to-end alignment between visual and linguistic modalities. Our one-stage, single-loss framework circumvents the aforementioned computationally demanding first stage of training by gradually merging similar visual tokens during training. This gradual merging process effectively compacts the visual information while preserving the richness of semantic content, leading to fast convergence without sacrificing performance. Our experiments show that our approach can speed up the training of vision-language models by a factor $\times 5$ without noticeable impact on the overall performance. Additionally, we show that our models can achieve comparable performance to current vision-language models with only $1/10$ of the data. Finally, we demonstrate how our image-text models can be easily adapted to video-language generative tasks through a novel soft attentive temporal token merging modules.

摘要
在这篇论文中，我们提出了``SimVLG''框架，用于预训 computationally intensive的视觉语言生成模型，利用冰结的大型语言模型（LLM）。传统的视觉语言预训（VLP）方法通常包括两个阶段的优化过程：首先是一个资源占用 intensives的阶段，用于学习通用的视觉语言表示，然后是一个结合视觉语言Modalities的阶段。我们的一个阶段、单个损失框架可以 circumvent这个 computationally demanding的第一阶段训练，通过在训练过程中逐渐合并相似的视觉符号来压缩视觉信息，同时保持 semantics的 ricness，从而实现快速的训练 convergence 而无需牺牲性能。我们的实验表明，我们的方法可以将视觉语言模型的训练速度提高五倍，而无需注意到性能的影响。此外，我们还证明了我们的模型可以通过只使用一半的数据来实现与当前视觉语言模型相同的性能。最后，我们展示了我们的图像文本模型可以通过一种新的软注意时间符号合并模块来简单地适应视频语言生成任务。

PoseAction: Action Recognition for Patients in the Ward using Deep Learning Approaches

paper_url: http://arxiv.org/abs/2310.03288
repo_url: None
paper_authors: Zherui Li, Raye Chen-Hua Yeow
for: 这篇论文是为了提出一个基于计算机视觉和深度学习的方法，用于在医院内部的人员行为识别和预测。
methods: 这篇论文使用了OpenPose来准确地检测人员的位置，并使用AlphAction的异步互动聚合网络来预测人员的动作。这两个模型结合使用，称为PoseAction。
results: PoseAction模型在识别12种常见的医院区域动作时取得了98.72%的分类MAP（IoU@0.5）的最高成绩。此外，这篇论文还开发了一个在线实时模式，将支持医疗译rezension的实现。此外，使用OpenPose的面部点检测功能，这篇论文还实现了面部模糊，以保护病人和医疗工作者的隐私。

Abstract
Real-time intelligent detection and prediction of subjects' behavior particularly their movements or actions is critical in the ward. This approach offers the advantage of reducing in-hospital care costs and improving the efficiency of healthcare workers, which is especially true for scenarios at night or during peak admission periods. Therefore, in this work, we propose using computer vision (CV) and deep learning (DL) methods for detecting subjects and recognizing their actions. We utilize OpenPose as an accurate subject detector for recognizing the positions of human subjects in the video stream. Additionally, we employ AlphAction's Asynchronous Interaction Aggregation (AIA) network to predict the actions of detected subjects. This integrated model, referred to as PoseAction, is proposed. At the same time, the proposed model is further trained to predict 12 common actions in ward areas, such as staggering, chest pain, and falling down, using medical-related video clips from the NTU RGB+D and NTU RGB+D 120 datasets. The results demonstrate that PoseAction achieves the highest classification mAP of 98.72% (IoU@0.5). Additionally, this study develops an online real-time mode for action recognition, which strongly supports the clinical translation of PoseAction. Furthermore, using OpenPose's function for recognizing face key points, we also implement face blurring, which is a practical solution to address the privacy protection concerns of patients and healthcare workers. Nevertheless, the training data for PoseAction is currently limited, particularly in terms of label diversity. Consequently, the subsequent step involves utilizing a more diverse dataset (including general actions) to train the model's parameters for improved generalization.

摘要
“现场智能探测和预测病人的行为，特别是其运动或动作，在医院中是非常重要的。这种方法可以降低医院内部门成本和提高医疗工作者的效率，尤其在夜间或峰值 admit 期间。因此，在这种工作中，我们提议使用计算机视觉（CV）和深度学习（DL）方法来探测和识别病人的动作。我们使用 OpenPose 作为准确的人体探测器，并使用 AlphAction 的异步互动聚合（AIA）网络预测病人的动作。这个整体模型被称为 PoseAction。同时，我们进一步训练这个模型，以预测医院区域中的 12 种常见动作，如摇摆、胸痛和跌倒。结果表明，PoseAction 达到了最高的分类MAP 98.72%（IoU@0.5）。此外，本研究还开发了在线实时模式，以便支持临床翻译。此外，通过 OpenPose 的人脸关键点识别功能，我们还实现了面部模糊，这是一个实际的解决方案，以保护患者和医疗工作者的隐私。然而，PoseAction 的训练数据目前还受限，特别是Label多样性不够。因此，后续步骤是使用更多的多样化数据（包括通用动作）来训练模型的参数，以提高其泛化能力。”

Classifying Whole Slide Images: What Matters?

paper_url: http://arxiv.org/abs/2310.03279
repo_url: None
paper_authors: Long Nguyen, Aiden Nibali, Joshua Millward, Zhen He
for: 这篇论文旨在研究把握高分辨率整幕报告（WSIs）的分类算法。
methods: 这篇论文使用了不同的设计选择来探索WSIs分类算法中最重要的特征。
results: 研究发现，在WSIs分类中，最重要的特征是在小 patch 级别上捕捉的地方环境细节，而不是全幕级别的全球信息。此外，一种简单的多实例学习方法，不捕捉全球信息，也可以达到高精度。

Abstract
Recently there have been many algorithms proposed for the classification of very high resolution whole slide images (WSIs). These new algorithms are mostly focused on finding novel ways to combine the information from small local patches extracted from the slide, with an emphasis on effectively aggregating more global information for the final predictor. In this paper we thoroughly explore different key design choices for WSI classification algorithms to investigate what matters most for achieving high accuracy. Surprisingly, we found that capturing global context information does not necessarily mean better performance. A model that captures the most global information consistently performs worse than a model that captures less global information. In addition, a very simple multi-instance learning method that captures no global information performs almost as well as models that capture a lot of global information. These results suggest that the most important features for effective WSI classification are captured at the local small patch level, where cell and tissue micro-environment detail is most pronounced. Another surprising finding was that unsupervised pre-training on a larger set of 33 cancers gives significantly worse performance compared to pre-training on a smaller dataset of 7 cancers (including the target cancer). We posit that pre-training on a smaller, more focused dataset allows the feature extractor to make better use of the limited feature space to better discriminate between subtle differences in the input patch.

摘要
近些时间，有许多算法提出来用于分类高解像整幕照片（WSIs）。这些新算法主要集中在找到小地方区域Extracted from the slide的信息的新方法，强调有效地汇集全局信息为最终预测器。在这篇论文中，我们详细探讨了不同的关键设计选择WSI分类算法，以 investigate what matters most for achieving high accuracy。 Surprisingly, we found that capturing global context information does not necessarily mean better performance. A model that captures the most global information consistently performs worse than a model that captures less global information. In addition, a very simple multi-instance learning method that captures no global information performs almost as well as models that capture a lot of global information. These results suggest that the most important features for effective WSI classification are captured at the local small patch level, where cell and tissue micro-environment detail is most pronounced. Another surprising finding was that unsupervised pre-training on a larger set of 33 cancers gives significantly worse performance compared to pre-training on a smaller dataset of 7 cancers (including the target cancer). We posit that pre-training on a smaller, more focused dataset allows the feature extractor to make better use of the limited feature space to better discriminate between subtle differences in the input patch.

Ablation Study to Clarify the Mechanism of Object Segmentation in Multi-Object Representation Learning

paper_url: http://arxiv.org/abs/2310.03273
repo_url: None
paper_authors: Takayuki Komatsu, Yoshiyuki Ohmura, Yasuo Kuniyoshi
for: 多个物体表示学习旨在将复杂的真实世界视觉输入转化为多个物体的组合。
methods: prevailing 方法通常使用不监督学习来将输入图像分割成各个物体，并将这些物体编码到每个幂量 Vector 中。但是，不清楚前一代方法如何实现正确的物体分割。此外，大多数前一代方法使用 Variational Autoencoder (VAE) 进行幂量 Vector 的正则化，因此不清楚 VAE 正则化是否对物体分割有效。
results: 为了解释多个物体表示学习中对物体分割的机制，我们对 MONet 进行了减少性研究。MONet 使用对应的注意mask和幂量 Vector 来表示多个物体。每个幂量 Vector 来自输入图像和注意mask。然后，对于每个幂量 Vector，分解图像和注意mask。MONet 的损失函数包括1) 输入图像和分解图像之间的总准确率损失，2) VAE 正则化损失，3) 注意mask 的准确率损失以显式地编码形态信息。我们对这三个损失函数进行了减少性研究，我们发现 VAE 正则化损失没有影响分割性能，而其他两个损失函数确实影响分割性能。基于这个结果，我们提出了一个新的假设：在图像区域中，最好是使得每个幂量 Vector 对应的注意mask 最大化。我们验证了这个假设，并证明了它是正确的。

Abstract
Multi-object representation learning aims to represent complex real-world visual input using the composition of multiple objects. Representation learning methods have often used unsupervised learning to segment an input image into individual objects and encode these objects into each latent vector. However, it is not clear how previous methods have achieved the appropriate segmentation of individual objects. Additionally, most of the previous methods regularize the latent vectors using a Variational Autoencoder (VAE). Therefore, it is not clear whether VAE regularization contributes to appropriate object segmentation. To elucidate the mechanism of object segmentation in multi-object representation learning, we conducted an ablation study on MONet, which is a typical method. MONet represents multiple objects using pairs that consist of an attention mask and the latent vector corresponding to the attention mask. Each latent vector is encoded from the input image and attention mask. Then, the component image and attention mask are decoded from each latent vector. The loss function of MONet consists of 1) the sum of reconstruction losses between the input image and decoded component image, 2) the VAE regularization loss of the latent vector, and 3) the reconstruction loss of the attention mask to explicitly encode shape information. We conducted an ablation study on these three loss functions to investigate the effect on segmentation performance. Our results showed that the VAE regularization loss did not affect segmentation performance and the others losses did affect it. Based on this result, we hypothesize that it is important to maximize the attention mask of the image region best represented by a single latent vector corresponding to the attention mask. We confirmed this hypothesis by evaluating a new loss function with the same mechanism as the hypothesis.

摘要
多对象表示学习目标是将复杂的真实世界视觉输入转换为多个对象的组合。表示学习方法通常使用无监督学习来将输入图像分割成各个对象，并将这些对象编码到每个幂量中。然而，没有准确的方法来实现适当的对象分割。此外，大多数前一代方法使用Variational Autoencoder（VAE）来规范幂量。因此，不清楚VAE规范是否对适当的对象分割做出贡献。为了解释多对象表示学习中对象分割机制，我们进行了MONet方法的ablation研究。MONet使用对应于注意Mask和幂量的对象对来表示多个对象。每个幂量来自输入图像和注意Mask的编码。然后，从每个幂量中解码输入图像和注意Mask。MONet的损失函数包括1）输入图像和解码组件图像之间的总差异损失，2）VAE规范损失，3）注意Mask的重建损失，以便显式地编码形状信息。我们对这三个损失函数进行了ablation研究，并证明VAE规范损失没有影响分割性能，而其他两个损失函数有影响。基于这结果，我们提出了一种新的损失函数，它的机制与我们的假设相同。我们证明了这种损失函数能够提高对象分割性能。

EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models

paper_url: http://arxiv.org/abs/2310.03270
repo_url: None
paper_authors: Yefei He, Jing Liu, Weijia Wu, Hong Zhou, Bohan Zhuang
for: 这个研究旨在提高Diffusion模型的实用性，以便在实时应用中实现低延迟和低资料使用率。
methods: 这个研究使用了两种主要的压缩方法：post-training quantization (PTQ)和quantization-aware training (QAT)。而我们的方法是一种不需要训练数据的、简洁的精简架构，可以实现QAT-level的性能，并且具有PTQ-like的效率。
results: 实验结果显示，我们的方法可以与先前的PTQ-based diffusion模型比较，同时维持相似的时间和数据效率，并且实现更高的生成质量。具体来说，将LDM-4的 weights和活化函数压缩到4位数字时，与先前的PTQ-based方法相比，只有0.05 sFID增加。相比于QAT-based方法，我们的EfficientDM也具有16.2倍的压缩速度，并且实现了相似的生成质量。

Abstract
Diffusion models have demonstrated remarkable capabilities in image synthesis and related generative tasks. Nevertheless, their practicality for low-latency real-world applications is constrained by substantial computational costs and latency issues. Quantization is a dominant way to compress and accelerate diffusion models, where post-training quantization (PTQ) and quantization-aware training (QAT) are two main approaches, each bearing its own properties. While PTQ exhibits efficiency in terms of both time and data usage, it may lead to diminished performance in low bit-width. On the other hand, QAT can alleviate performance degradation but comes with substantial demands on computational and data resources. To capitalize on the advantages while avoiding their respective drawbacks, we introduce a data-free and parameter-efficient fine-tuning framework for low-bit diffusion models, dubbed EfficientDM, to achieve QAT-level performance with PTQ-like efficiency. Specifically, we propose a quantization-aware variant of the low-rank adapter (QALoRA) that can be merged with model weights and jointly quantized to low bit-width. The fine-tuning process distills the denoising capabilities of the full-precision model into its quantized counterpart, eliminating the requirement for training data. We also introduce scale-aware optimization and employ temporal learned step-size quantization to further enhance performance. Extensive experimental results demonstrate that our method significantly outperforms previous PTQ-based diffusion models while maintaining similar time and data efficiency. Specifically, there is only a marginal 0.05 sFID increase when quantizing both weights and activations of LDM-4 to 4-bit on ImageNet 256x256. Compared to QAT-based methods, our EfficientDM also boasts a 16.2x faster quantization speed with comparable generation quality.

摘要
Diffusion models 有 demonstrated 非常出色的创造力在图像生成和相关的生成任务中。然而，它们在实际应用中的延迟和计算成本问题受到了一定的限制。量化是Diffusion models的压缩和加速的主要方法，其中Post-training quantization (PTQ)和quantization-aware training (QAT)是两种主要的方法，每个方法都有自己的特点。PTQ可以快速地压缩和加速Diffusion models，但可能会导致低位数bit的性能下降。而QAT可以减轻性能下降，但需要大量的计算和数据资源。为了利用这些优点而避免其缺点，我们介绍了一种无需数据和参数的efficient fine-tuning框架，以实现QAT级别的性能，同时保持PTQ类似的效率。我们提出了一种量化感知的低级Adapter（QALoRA），可以与模型参数和量化结合使用，并且可以将模型 weights和活动量化到低位数bit。 fine-tuning过程将混合模型的权重和活动的权重和量化结果，从而消除了对训练数据的需求。我们还引入了缩放比例优化和时间学习步长量化，以进一步提高性能。我们的方法在实际实验中显著超越了之前基于PTQ的Diffusion models，同时保持相同的时间和数据效率。具体来说，在ImageNet 256x256上，将LDM-4的权重和活动量化到4位数bit时，只有0.05 sFID提升。相比QAT基于方法，我们的EfficientDM还具有16.2倍 faster量化速度，同时保持类似的生成质量。

2023-10-05

Diffusion Models as Masked Audio-Video Learners

Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation

Coloring Deep CNN Layers with Activation Hue Loss

TWICE Dataset: Digital Twin of Test Scenarios in a Controlled Environment

Characterizing the Features of Mitotic Figures Using a Conditional Diffusion Probabilistic Model

FNOSeg3D: Resolution-Robust 3D Image Segmentation with Fourier Neural Operator

Consistency Regularization Improves Placenta Segmentation in Fetal EPI MRI Time Series

OpenIncrement: A Unified Framework for Open Set Recognition and Deep Class-Incremental Learning

Less is More: On the Feature Redundancy of Pretrained Models When Transferring to Few-shot Tasks

HartleyMHA: Self-Attention in Frequency Domain for Resolution-Robust and Parameter-Efficient 3D Image Segmentation

Integrating Audio-Visual Features for Multimodal Deepfake Detection

WLST: Weak Labels Guided Self-training for Weakly-supervised Domain Adaptation on 3D Object Detection

ContactGen: Generative Contact Modeling for Grasp Generation

Stylist: Style-Driven Feature Ranking for Robust Novelty Detection

Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency

OMG-ATTACK: Self-Supervised On-Manifold Generation of Transferable Evasion Attacks

Drag View: Generalizable Novel View Synthesis with Unposed Imagery

LumiNet: The Bright Side of Perceptual Knowledge Distillation

Certification of Deep Learning Models for Medical Image Segmentation

Robustness-Guided Image Synthesis for Data-Free Quantization

Visual inspection for illicit items in X-ray images using Deep Learning

Wasserstein Distortion: Unifying Fidelity and Realism

High-Degrees-of-Freedom Dynamic Neural Fields for Robot Self-Modeling and Motion Planning

Animatable Virtual Humans: Learning pose-dependent human representations in UV space for interactive performance synthesis

How Good Are Synthetic Medical Images? An Empirical Study with Lung Ultrasound

Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints

BID-NeRF: RGB-D image pose estimation with inverted Neural Radiance Fields

MedSyn: Text-guided Anatomy-aware Synthesis of High-Fidelity 3D CT Images

Towards Unified Deep Image Deraining: A Survey and A New Benchmark

3D-Aware Hypothesis & Verification for Generalizable Relative Object Pose Estimation

V2X Cooperative Perception for Autonomous Driving: Recent Advances and Challenges

PrototypeFormer: Learning to Explore Prototype Relationships for Few-shot Image Classification

Exploring DINO: Emergent Properties and Limitations for Synthetic Aperture Radar Imagery

RL-based Stateful Neural Adaptive Sampling and Denoising for Real-Time Path Tracing

Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion

IceCloudNet: Cirrus and mixed-phase cloud prediction from SEVIRI input learned from sparse supervision

BTDNet: a Multi-Modal Approach for Brain Tumor Radiogenomic Classification

Ammonia-Net: A Multi-task Joint Learning Model for Multi-class Segmentation and Classification in Tooth-marked Tongue Diagnosis

Multi-Resolution Audio-Visual Feature Fusion for Temporal Action Localization

Mitigating the Influence of Domain Shift in Skin Lesion Classification: A Benchmark Study of Unsupervised Domain Adaptation Methods on Dermoscopic Images

Robust Zero Level-Set Extraction from Unsigned Distance Fields Based on Double Covering

FreeReg: Image-to-Point Cloud Registration Leveraging Pretrained Diffusion Models and Monocular Depth Estimators

A Complementary Global and Local Knowledge Network for Ultrasound denoising with Fine-grained Refinement

Learning to Simplify Spatial-Temporal Graphs in Gait Analysis

OpenPatch: a 3D patchwork for Out-Of-Distribution detection

ACT-Net: Anchor-context Action Detection in Surgery Videos

Point-Based Radiance Fields for Controllable Human Motion Synthesis

Realistic Speech-to-Face Generation with Speech-Conditioned Latent Diffusion Model with Face Prior

CSI: Enhancing the Robustness of 3D Point Cloud Recognition against Corruption

Combining Datasets with Different Label Sets for Improved Nucleus Segmentation and Classification

Denoising Diffusion Step-aware Models

Continual Test-time Domain Adaptation via Dynamic Sample Selection

Real-time Multi-modal Object Detection and Tracking on Edge for Regulatory Compliance Monitoring

Investigating the Limitation of CLIP Models: The Worst-Performing Categories

Functional data learning using convolutional neural networks

Can pre-trained models assist in dataset distillation?

SimVLG: Simple and Efficient Pretraining of Visual Language Generative Models

PoseAction: Action Recognition for Patients in the Ward using Deep Learning Approaches

Classifying Whole Slide Images: What Matters?

Ablation Study to Clarify the Mechanism of Object Segmentation in Multi-Object Representation Learning

EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models