2023-07-17

cs.CV

cs.CV - 2023-07-17

Identity-Preserving Aging of Face Images via Latent Diffusion Models

paper_url: http://arxiv.org/abs/2307.08585
repo_url: None
paper_authors: Sudipta Banerjee, Govind Mittal, Ameya Joshi, Chinmay Hegde, Nasir Memon
for: 这个论文是为了提高自动人脸识别系统的性能而写的。
methods: 这个论文使用了文本到图像扩散模型来 sintetically 年轻和年老人脸图像。
results: 这个方法可以通过几个尝试的训练来实现高度的视觉实际性和生物度的精度。在两个标准测试集（CelebA和AgeDB）上，这个方法比现有的状态调学基eline减少了约44%的False Non-Match Rate。

Abstract
The performance of automated face recognition systems is inevitably impacted by the facial aging process. However, high quality datasets of individuals collected over several years are typically small in scale. In this work, we propose, train, and validate the use of latent text-to-image diffusion models for synthetically aging and de-aging face images. Our models succeed with few-shot training, and have the added benefit of being controllable via intuitive textual prompting. We observe high degrees of visual realism in the generated images while maintaining biometric fidelity measured by commonly used metrics. We evaluate our method on two benchmark datasets (CelebA and AgeDB) and observe significant reduction (~44%) in the False Non-Match Rate compared to existing state-of the-art baselines.

摘要
“自动人脸识别系统的表现必然受到人脸年龄变化的影响。然而，高质量的个人数据库，收集了几年，通常规模不大。在这项工作中，我们提议、训练和验证了使用潜在的文本到图像扩散模型来人工增加和减少人脸图像的年龄。我们的模型在几次训练后达到了高度的视觉实际性和生物特征准确度，而且通过直观的文本提示来控制。我们对两个标准数据集（CelebA和AgeDB）进行评估，与现有的州态艺法基elines进行比较， Observation False Non-Match Rate 下降约44%。”Note: Please keep in mind that the translation is in Simplified Chinese, which is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and other countries.

Scale-Aware Modulation Meet Transformer

paper_url: http://arxiv.org/abs/2307.08579
repo_url: https://github.com/afeng-x/smt
paper_authors: Weifeng Lin, Ziheng Wu, Jiayu Chen, Jun Huang, Lianwen Jin
For: The paper proposes a new vision Transformer called Scale-Aware Modulation Transformer (SMT) that can handle various downstream tasks efficiently by combining convolutional networks and vision Transformers.* Methods: The proposed SMT includes two novel designs: Multi-Head Mixed Convolution (MHMC) and Scale-Aware Aggregation (SAA) modules. These modules enhance convolutional modulation and allow the network to capture multi-scale features and fuse information effectively.* Results: The proposed SMT significantly outperforms existing state-of-the-art models across a wide range of visual tasks, including image classification, object detection, and semantic segmentation. Specifically, SMT achieves 82.2% and 84.3% top-1 accuracy on ImageNet-1K, and outperforms the Swin Transformer counterpart by 4.2 and 1.3 mAP on COCO for object detection and 2.0 and 1.1 mIoU on ADE20K for semantic segmentation.

Abstract
This paper presents a new vision Transformer, Scale-Aware Modulation Transformer (SMT), that can handle various downstream tasks efficiently by combining the convolutional network and vision Transformer. The proposed Scale-Aware Modulation (SAM) in the SMT includes two primary novel designs. Firstly, we introduce the Multi-Head Mixed Convolution (MHMC) module, which can capture multi-scale features and expand the receptive field. Secondly, we propose the Scale-Aware Aggregation (SAA) module, which is lightweight but effective, enabling information fusion across different heads. By leveraging these two modules, convolutional modulation is further enhanced. Furthermore, in contrast to prior works that utilized modulations throughout all stages to build an attention-free network, we propose an Evolutionary Hybrid Network (EHN), which can effectively simulate the shift from capturing local to global dependencies as the network becomes deeper, resulting in superior performance. Extensive experiments demonstrate that SMT significantly outperforms existing state-of-the-art models across a wide range of visual tasks. Specifically, SMT with 11.5M / 2.4GFLOPs and 32M / 7.7GFLOPs can achieve 82.2% and 84.3% top-1 accuracy on ImageNet-1K, respectively. After pretrained on ImageNet-22K in 224^2 resolution, it attains 87.1% and 88.1% top-1 accuracy when finetuned with resolution 224^2 and 384^2, respectively. For object detection with Mask R-CNN, the SMT base trained with 1x and 3x schedule outperforms the Swin Transformer counterpart by 4.2 and 1.3 mAP on COCO, respectively. For semantic segmentation with UPerNet, the SMT base test at single- and multi-scale surpasses Swin by 2.0 and 1.1 mIoU respectively on the ADE20K.

摘要
In contrast to prior works that use modulations throughout all stages to build an attention-free network, the proposed Evolutionary Hybrid Network (EHN) can effectively simulate the shift from capturing local to global dependencies as the network becomes deeper, resulting in superior performance.Extensive experiments show that SMT significantly outperforms existing state-of-the-art models across a wide range of visual tasks. Specifically, SMT with 11.5M parameters and 2.4GFLOPs can achieve 82.2% top-1 accuracy on ImageNet-1K, while SMT with 32M parameters and 7.7GFLOPs can achieve 84.3% top-1 accuracy. After pretraining on ImageNet-22K in 224^2 resolution, the model can achieve 87.1% and 88.1% top-1 accuracy when finetuned with resolution 224^2 and 384^2, respectively.In object detection with Mask R-CNN, the SMT base trained with 1x and 3x schedule outperforms the Swin Transformer counterpart by 4.2 and 1.3 mAP on COCO, respectively. For semantic segmentation with UPerNet, the SMT base test at single- and multi-scale surpasses Swin by 2.0 and 1.1 mIoU, respectively, on the ADE20K.

On the Fly Neural Style Smoothing for Risk-Averse Domain Generalization

paper_url: http://arxiv.org/abs/2307.08551
repo_url: https://github.com/akshaymehra24/riskaversedg
paper_authors: Akshay Mehra, Yunbei Zhang, Bhavya Kailkhura, Jihun Hamm
For: 该论文目的是提出一种测试时 neural style smoothing（TT-NSS）方法，以提高预测不同域的风险敏感性。* Methods: 该方法使用一个“风格平滑”的 DG 分类器进行测试时预测，并使用 neural style transfer 模块来快速地在测试图像上实现风格平滑。* Results: 实验结果表明，TT-NSS 和 NSS 可以提高 DG 分类器在未经见过的域上的预测精度和风险敏感性。

Abstract
Achieving high accuracy on data from domains unseen during training is a fundamental challenge in domain generalization (DG). While state-of-the-art DG classifiers have demonstrated impressive performance across various tasks, they have shown a bias towards domain-dependent information, such as image styles, rather than domain-invariant information, such as image content. This bias renders them unreliable for deployment in risk-sensitive scenarios such as autonomous driving where a misclassification could lead to catastrophic consequences. To enable risk-averse predictions from a DG classifier, we propose a novel inference procedure, Test-Time Neural Style Smoothing (TT-NSS), that uses a "style-smoothed" version of the DG classifier for prediction at test time. Specifically, the style-smoothed classifier classifies a test image as the most probable class predicted by the DG classifier on random re-stylizations of the test image. TT-NSS uses a neural style transfer module to stylize a test image on the fly, requires only black-box access to the DG classifier, and crucially, abstains when predictions of the DG classifier on the stylized test images lack consensus. Additionally, we propose a neural style smoothing (NSS) based training procedure that can be seamlessly integrated with existing DG methods. This procedure enhances prediction consistency, improving the performance of TT-NSS on non-abstained samples. Our empirical results demonstrate the effectiveness of TT-NSS and NSS at producing and improving risk-averse predictions on unseen domains from DG classifiers trained with SOTA training methods on various benchmark datasets and their variations.

摘要
Specifically, the style-smoothed classifier predicts the most probable class based on the DG classifier's predictions on random re-stylizations of the test image. TT-NSS uses a neural style transfer module to stylize the test image on the fly, requires only black-box access to the DG classifier, and abstains when the DG classifier's predictions on the stylized test images lack consensus. Additionally, we propose a neural style smoothing (NSS) based training procedure that can be integrated with existing DG methods. This procedure enhances prediction consistency, improving the performance of TT-NSS on non-abstained samples.Our empirical results show that TT-NSS and NSS are effective in producing and improving risk-averse predictions on unseen domains for DG classifiers trained with state-of-the-art methods on various benchmark datasets and their variations.

Improving Data Efficiency for Plant Cover Prediction with Label Interpolation and Monte-Carlo Cropping

paper_url: http://arxiv.org/abs/2307.08559
repo_url: None
paper_authors: Matthias Körschens, Solveig Franziska Bucher, Christine Römermann, Joachim Denzler
for: 这个论文主要针对的是如何使用自动摄像头系统和深度学习算法对植被plot进行自动分类。methods: 这篇论文使用了自动摄像头系统收集高分辨率图像，然后使用深度学习算法对图像进行分类。另外，论文还引入了一种新的 Monte-Carlo Cropping 方法，用于处理高分辨率图像，并且可以增加训练数据集的大小。results: 论文的实验结果表明，使用自动摄像头系统和深度学习算法可以对植被plot进行高精度的自动分类，并且可以提高种类、社区和分割指标。此外， Monte-Carlo Cropping 方法也能够提高训练数据集的大小和模型的性能。

Abstract
The plant community composition is an essential indicator of environmental changes and is, for this reason, usually analyzed in ecological field studies in terms of the so-called plant cover. The manual acquisition of this kind of data is time-consuming, laborious, and prone to human error. Automated camera systems can collect high-resolution images of the surveyed vegetation plots at a high frequency. In combination with subsequent algorithmic analysis, it is possible to objectively extract information on plant community composition quickly and with little human effort. An automated camera system can easily collect the large amounts of image data necessary to train a Deep Learning system for automatic analysis. However, due to the amount of work required to annotate vegetation images with plant cover data, only few labeled samples are available. As automated camera systems can collect many pictures without labels, we introduce an approach to interpolate the sparse labels in the collected vegetation plot time series down to the intermediate dense and unlabeled images to artificially increase our training dataset to seven times its original size. Moreover, we introduce a new method we call Monte-Carlo Cropping. This approach trains on a collection of cropped parts of the training images to deal with high-resolution images efficiently, implicitly augment the training images, and speed up training. We evaluate both approaches on a plant cover dataset containing images of herbaceous plant communities and find that our methods lead to improvements in the species, community, and segmentation metrics investigated.

摘要
plant 社区组成是环境变化的重要指标，因此通常在生态场景研究中用 Plant cover 来进行分析。 however， manual acquisition of this kind of data is time-consuming, laborious, and prone to human error。 automatic camera systems can collect high-resolution images of the surveyed vegetation plots at a high frequency，并且可以使用后续的算法分析以获取植物社区组成的信息。 due to the amount of work required to annotate vegetation images with plant cover data， only a few labeled samples are available。 therefore， we introduce an approach to interpolate the sparse labels in the collected vegetation plot time series down to the intermediate dense and unlabeled images to artificially increase our training dataset to seven times its original size。 furthermore， we introduce a new method called Monte-Carlo Cropping。 this approach trains on a collection of cropped parts of the training images to deal with high-resolution images efficiently， implicitly augment the training images， and speed up training。we evaluate both approaches on a plant cover dataset containing images of herbaceous plant communities and find that our methods lead to improvements in the species， community， and segmentation metrics investigated。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know.

Reconstructed Convolution Module Based Look-Up Tables for Efficient Image Super-Resolution

paper_url: http://arxiv.org/abs/2307.08544
repo_url: https://github.com/liuguandu/rc-lut
paper_authors: Guandu Liu, Yukang Ding, Mading Li, Ming Sun, Xing Wen, Bin Wang
for: 提高单图超分辨率（SR）任务中的灵活性和效率。
methods: 提出了一种新的重构 convolution（RC）模块，通过分离通道和空间计算来减少 LUT 的存储量，同时可以扩大 RF 的大小。
results: 比对于 state-of-the-art LUT-based SR 方法，提出的 RCLUT 方法可以在五个 популяр的benchmark dataset上 achieve 9 倍的 RF 大小扩展和优秀的性能，并且可以作为 LUT-based SR 方法的插件来提高其效果。

Abstract
Look-up table(LUT)-based methods have shown the great efficacy in single image super-resolution (SR) task. However, previous methods ignore the essential reason of restricted receptive field (RF) size in LUT, which is caused by the interaction of space and channel features in vanilla convolution. They can only increase the RF at the cost of linearly increasing LUT size. To enlarge RF with contained LUT sizes, we propose a novel Reconstructed Convolution(RC) module, which decouples channel-wise and spatial calculation. It can be formulated as $n^2$ 1D LUTs to maintain $n\times n$ receptive field, which is obviously smaller than $n\times n$D LUT formulated before. The LUT generated by our RC module reaches less than 1/10000 storage compared with SR-LUT baseline. The proposed Reconstructed Convolution module based LUT method, termed as RCLUT, can enlarge the RF size by 9 times than the state-of-the-art LUT-based SR method and achieve superior performance on five popular benchmark dataset. Moreover, the efficient and robust RC module can be used as a plugin to improve other LUT-based SR methods. The code is available at https://github.com/liuguandu/RC-LUT.

摘要
Look-up table（LUT）基本方法在单图超解像（SR）任务中表现出色。然而，之前的方法忽视了LUT中受限的接收场（RF）大小的重要原因，这是因为混合空间和通道特征导致的vanilla convolution中的交互作用。它们只能通过linearly增加LUT大小来增加RF。为了使RF增加而不是LUT大小，我们提出了一种新的Reconstructed Convolution（RC）模块，它可以将通道和空间计算解耦。它可以表示为n^2个1D LUT，以维持n×n的接收场，这比之前的n×nD LUT更小。我们的RC模块生成的LUT可以达到与SR-LUT基准值的less than 1/10000的存储量。我们提出的Reconstructed Convolution模块基于LUT方法，称为RCLUT，可以将RF大小提高9倍于当前LUT基本SR方法，并在五个流行的benchmark dataset上达到更高的性能。此外，RC模块可以作为LUT基本SR方法的插件来改进其性能。代码可以在https://github.com/liuguandu/RC-LUT中找到。

Variational Probabilistic Fusion Network for RGB-T Semantic Segmentation

paper_url: http://arxiv.org/abs/2307.08536
repo_url: None
paper_authors: Baihong Lin, Zengrong Lin, Yulan Guo, Yulan Zhang, Jianxiao Zou, Shicai Fan
for: 本研究旨在提高RGB-T semantic segmentation的精度和稳定性，以应对具有差异光照条件的困难景象。
methods: 本研究提出了一种新的Variational Probabilistic Fusion Network（VPFNet），它视融合特征为Random Variables，通过多个抽象样本来实现稳定的分类。在VPFNet中，Variational Feature Fusion Module（VFFM）通过差异注意力来实现随机样本生成。此外，为了避免类归一致和模式偏好，我们使用了Weighted Cross-Entropy损失函数，并在VFFM中引入了灯光和类别的先验信息。
results: 实验结果表明，提出的VPFNet可以在MFNet和PST900 dataset上达到当今最佳的分类性能。

Abstract
RGB-T semantic segmentation has been widely adopted to handle hard scenes with poor lighting conditions by fusing different modality features of RGB and thermal images. Existing methods try to find an optimal fusion feature for segmentation, resulting in sensitivity to modality noise, class-imbalance, and modality bias. To overcome the problems, this paper proposes a novel Variational Probabilistic Fusion Network (VPFNet), which regards fusion features as random variables and obtains robust segmentation by averaging segmentation results under multiple samples of fusion features. The random samples generation of fusion features in VPFNet is realized by a novel Variational Feature Fusion Module (VFFM) designed based on variation attention. To further avoid class-imbalance and modality bias, we employ the weighted cross-entropy loss and introduce prior information of illumination and category to control the proposed VFFM. Experimental results on MFNet and PST900 datasets demonstrate that the proposed VPFNet can achieve state-of-the-art segmentation performance.

摘要

Multi-class point cloud completion networks for 3D cardiac anatomy reconstruction from cine magnetic resonance images

paper_url: http://arxiv.org/abs/2307.08535
repo_url: None
paper_authors: Marcel Beetz, Abhirup Banerjee, Julius Ossenberg-Engels, Vicente Grau
For: 这个论文的目的是提出一种全自动的三维心脏形态重建方法，以便从硬件磁共振成像（cine MRI）获得三维心脏形态模型。* Methods: 这个方法使用了一种多类点云完成网络（PCCN）来解决3D重建任务中的稀疏性和重合性问题。PCCN在大量的synthetic数据集上进行了评估，并与标准 Referenced anatomy 之间的Chamfer距离在不同的扭曲程度下都在或类似于图像分辨率下。此外，与3D U-Net Referenced 模型相比，PCCN减少了重建错误的比例，即 Hausdorff 距离和平均表面距离减少了32%和24%。* Results: 然后，作者使用PCCN作为自动重建管道的一部分，对UK Biobank研究中的1000名参与者进行了cross-domain传送。结果显示，PCCN可以重建 precisemedical 和可靠的三维心脏形态模型，并与之前的 литераature 中的临床指标相符。此外，作者还调查了该方法的稳定性，并发现它可以成功处理多种常见异常情况。

Abstract
Cine magnetic resonance imaging (MRI) is the current gold standard for the assessment of cardiac anatomy and function. However, it typically only acquires a set of two-dimensional (2D) slices of the underlying three-dimensional (3D) anatomy of the heart, thus limiting the understanding and analysis of both healthy and pathological cardiac morphology and physiology. In this paper, we propose a novel fully automatic surface reconstruction pipeline capable of reconstructing multi-class 3D cardiac anatomy meshes from raw cine MRI acquisitions. Its key component is a multi-class point cloud completion network (PCCN) capable of correcting both the sparsity and misalignment issues of the 3D reconstruction task in a unified model. We first evaluate the PCCN on a large synthetic dataset of biventricular anatomies and observe Chamfer distances between reconstructed and gold standard anatomies below or similar to the underlying image resolution for multiple levels of slice misalignment. Furthermore, we find a reduction in reconstruction error compared to a benchmark 3D U-Net by 32% and 24% in terms of Hausdorff distance and mean surface distance, respectively. We then apply the PCCN as part of our automated reconstruction pipeline to 1000 subjects from the UK Biobank study in a cross-domain transfer setting and demonstrate its ability to reconstruct accurate and topologically plausible biventricular heart meshes with clinical metrics comparable to the previous literature. Finally, we investigate the robustness of our proposed approach and observe its capacity to successfully handle multiple common outlier conditions.

摘要
临床磁共振成像（MRI）是当前的黄金标准 для心脏解剖和功能评估。然而，它通常只能获取心脏的二维（2D）slice图像，因此限制了我们对健康和疾病心脏解剖和physiology的理解和分析。在这篇论文中，我们提出了一种全自动的表面重建管线，可以从raw磁共振成像获取多类3D心脏解剖模型。其关键组件是一种多类点云完成网络（PCCN），可以同时 corrrect both the sparsity和misalignment Issues of the 3D reconstruction task in a unified model。我们首先在大量的synthetic数据集上评估PCCN，并观察到Chamfer距离与 golden standard anatomy之间的距离在多个slice misalignment情况下都在或类似于图像分辨率下。此外，我们发现PCCN比 benchmark 3D U-Net的重建误差下降32%和24%。我们然后将PCCN作为自动重建管线的一部分应用于UK Biobank研究中的1000名研究对象，并示出了它能够重建准确和topologically plausible的心脏解剖模型，与前一代 литераature中的临床 metric相当。最后，我们调查了我们提议的方法的稳定性，并发现它可以成功处理多种常见异常情况。

Multi-Domain Learning with Modulation Adapters

paper_url: http://arxiv.org/abs/2307.08528
repo_url: None
paper_authors: Ekaterina Iakovleva, Karteek Alahari, Jakob Verbeek
for: 这篇论文是为了解决多个领域的图像分类问题，以及对不同领域的图像进行共同训练。
methods: 这篇论文使用了卷积 neural network，并将卷积权重更新为每个任务的对应领域特有的权重。
results: 这篇论文在Visual Decathlon挑战和ImageNet-to-Sketch benchmark上取得了出色的结果，其精度与现有的州先进方法相似或更高。

Abstract
Deep convolutional networks are ubiquitous in computer vision, due to their excellent performance across different tasks for various domains. Models are, however, often trained in isolation for each task, failing to exploit relatedness between tasks and domains to learn more compact models that generalise better in low-data regimes. Multi-domain learning aims to handle related tasks, such as image classification across multiple domains, simultaneously. Previous work on this problem explored the use of a pre-trained and fixed domain-agnostic base network, in combination with smaller learnable domain-specific adaptation modules. In this paper, we introduce Modulation Adapters, which update the convolutional filter weights of the model in a multiplicative manner for each task. Parameterising these adaptation weights in a factored manner allows us to scale the number of per-task parameters in a flexible manner, and to strike different parameter-accuracy trade-offs. We evaluate our approach on the Visual Decathlon challenge, composed of ten image classification tasks across different domains, and on the ImageNet-to-Sketch benchmark, which consists of six image classification tasks. Our approach yields excellent results, with accuracies that are comparable to or better than those of existing state-of-the-art approaches.

摘要
深度卷积网络在计算机视觉领域具有广泛的应用，这是因为它们在不同任务和领域上表现出色。然而，模型们通常是单独进行每个任务的训练，不利用任务和领域之间的相互关系，以学习更加紧凑的模型，并在低数据 regime 中更好地泛化。多元领域学习旨在处理相关的任务，如图像分类领域中的多个任务。先前的工作中使用了预训练和固定的域名不同的基础网络，并与更小的可学习的域pecific适应模块进行组合。在这篇论文中，我们引入了调整器（Modulation Adapters），它们可以在每个任务中对卷积权重进行更新，并在多任务情况下共享参数。我们可以在 фактор化的方式下参数化这些适应权重，以便在灵活的方式下增加每个任务的参数数量，并在精度和准确性之间进行负担平衡。我们在Visual Decathlon挑战和ImageNet-to-Sketch benchmark上进行评估，并取得了出色的结果，与现有状态的方法相比，我们的方法的准确率更高。

BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization

paper_url: http://arxiv.org/abs/2307.08504
repo_url: None
paper_authors: Chaoya Jiang, Haiyang Xu, Wei Ye, Qinghao Ye, Chenliang Li, Ming Yan, Bin Bi, Shikun Zhang, Fei Huang, Songfang Huang
for: 这个论文主要旨在提高 ViT 模型在视觉语言理解和生成任务中的训练效率，而不 sacrifi 性能。
methods: 该论文提出了一种底向概括方法，称为 Bottom-Up Patch Summarization (BUS)，它在 ViT 底层EXTRACT 和顶层 ABSTRACT 之间协调底层EXTRACT 和顶层 ABSTRACT，以学习高效的视觉概括。
results: 该论文在多个视觉语言理解和生成任务中表现竞争力强，而且可以提高训练效率达 50%，同时保持或者提高效果。此外，该模型在增加输入图像分辨率时可以不增加计算成本，而达到领先的性能。

Abstract
Vision Transformer (ViT) based Vision-Language Pre-training (VLP) models have demonstrated impressive performance in various tasks. However, the lengthy visual token sequences fed into ViT can lead to training inefficiency and ineffectiveness. Existing efforts address the challenge by either bottom-level patch extraction in the ViT backbone or top-level patch abstraction outside, not balancing training efficiency and effectiveness well. Inspired by text summarization in natural language processing, we propose a Bottom-Up Patch Summarization approach named BUS, coordinating bottom-level extraction and top-level abstraction to learn a concise summary of lengthy visual token sequences efficiently. Specifically, We incorporate a Text-Semantics-Aware Patch Selector (TSPS) into the ViT backbone to perform a coarse-grained visual token extraction and then attach a flexible Transformer-based Patch Abstraction Decoder (PAD) upon the backbone for top-level visual abstraction. This bottom-up collaboration enables our BUS to yield high training efficiency while maintaining or even improving effectiveness. We evaluate our approach on various visual-language understanding and generation tasks and show competitive downstream task performance while boosting the training efficiency by 50\%. Additionally, our model achieves state-of-the-art performance on many downstream tasks by increasing input image resolution without increasing computational costs over baselines.

摘要
视Transformer（ViT）基于视力语言预训练（VLP）模型在不同任务中表现出色。然而，在ViT中长时间的视觉 токен序列可能会导致训练不fficient和不effective。现有的尝试解决这个挑战是通过ViT backbone中的底层 patch抽取或者外部的top-level patch抽象，但这些方法并不能很好地寻求训练效率和效果的平衡。以文本概要为引yles，我们提出了底层 patch概要approach（BUS），通过协同底层抽取和顶层抽象来学习长时间的视觉 токен序列高效简洁的概要。specifically，我们在ViT backbone中添加了文本 semantics-aware patch selector（TSPS），以进行粗粒度的视觉 токен抽取，然后在backbone上添加一个灵活的 transformer-based patch abstraction decoder（PAD），以进行顶层视觉抽象。这种底层协同使得我们的BUS可以高效地进行训练，同时保持或者提高效果。我们在不同的视力语言理解和生成任务上评估了我们的方法，并显示了与基eline相比的50%的训练效率提升，同时在许多下游任务上达到了state-of-the-art的性能。此外，我们的模型可以在不变 computational costs的情况下，提高输入图像的分辨率，从而实现更好的下游任务性能。

Study of Vision Transformers for Covid-19 Detection from Chest X-rays

paper_url: http://arxiv.org/abs/2307.09402
repo_url: None
paper_authors: Sandeep Angara, Sharath Thirunagaru
for: 这种研究旨在检测 COVID-19 使用视transformer 技术，以提高检测效率和准确率。
methods: 本研究使用了许多最新的 transformer 模型，包括 Vision Transformer (ViT)、Swin-transformer、Max vision transformer (MViT) 和 Pyramid Vision transformer (PVT)，通过转移学习IMAGENET 的 weights，实现了惊人的准确率范围为 98.75% 到 99.5%。
results: 实验结果表明，视transformer 在 COVID-19 检测中达到了 estado-of-the-art 性能，高于传统方法和 Even Convolutional Neural Networks (CNNs)，这些结果表明了视transformer 的潜在力量作为 COVID-19 检测工具，有助于提高检测和诊断的效率和准确率在临床设置中。

Abstract
The COVID-19 pandemic has led to a global health crisis, highlighting the need for rapid and accurate virus detection. This research paper examines transfer learning with vision transformers for COVID-19 detection, known for its excellent performance in image recognition tasks. We leverage the capability of Vision Transformers to capture global context and learn complex patterns from chest X-ray images. In this work, we explored the recent state-of-art transformer models to detect Covid-19 using CXR images such as vision transformer (ViT), Swin-transformer, Max vision transformer (MViT), and Pyramid Vision transformer (PVT). Through the utilization of transfer learning with IMAGENET weights, the models achieved an impressive accuracy range of 98.75% to 99.5%. Our experiments demonstrate that Vision Transformers achieve state-of-the-art performance in COVID-19 detection, outperforming traditional methods and even Convolutional Neural Networks (CNNs). The results highlight the potential of Vision Transformers as a powerful tool for COVID-19 detection, with implications for improving the efficiency and accuracy of screening and diagnosis in clinical settings.

摘要
COVID-19 流行病毒引起全球健康危机，强调了快速和准确的病毒检测的需求。这篇研究论文研究了将转移学习应用于视Transformers 中的 COVID-19 检测，其在图像识别任务中表现出色。我们利用了视Transformers 的全球上下文捕捉和学习复杂模式的能力，对胸部X射像进行检测。在这个工作中，我们探索了最新的转移学习模型，包括视Transformer（ViT）、Swin-transformer、Maxvision transformer（MViT）和Pyramid Vision transformer（PVT）。通过使用IMAGENET 权重进行转移学习，这些模型在 COVID-19 检测中实现了98.75% 到 99.5% 的准确率范围。我们的实验表明，视Transformers 在 COVID-19 检测中达到了状态 искусственный智能的性能，超越传统方法和卷积神经网络（CNNs）。这些结果表明，视Transformers 可以作为COVID-19 检测的有力工具，并有可能提高诊断和检测的效率和准确率在临床设置中。

Cumulative Spatial Knowledge Distillation for Vision Transformers

paper_url: http://arxiv.org/abs/2307.08500
repo_url: None
paper_authors: Borui Zhao, Renjie Song, Jiajun Liang
for: 本研究旨在提高vision transformer（ViT）的性能，通过吸取 convolutional neural networks（CNN）的知识。
methods: 本研究提出了Cumulative Spatial Knowledge Distillation（CSKD）方法，通过从CNN的相应的空间响应中提取空间知识，对ViT的所有patchtoken进行适应。此外，CSKD还使用了Cumulative Knowledge Fusion（CKF）模块，通过在训练过程中逐渐增加CNN的全局响应的重要性，使得ViT能够在训练早期充分利用CNN的地方适应，在训练后期更好地利用ViT的全局能力。
results: 对于ImageNet-1k和下游 dataset，CSKD获得了超越原始ViT的性能。 code将公开。

Abstract
Distilling knowledge from convolutional neural networks (CNNs) is a double-edged sword for vision transformers (ViTs). It boosts the performance since the image-friendly local-inductive bias of CNN helps ViT learn faster and better, but leading to two problems: (1) Network designs of CNN and ViT are completely different, which leads to different semantic levels of intermediate features, making spatial-wise knowledge transfer methods (e.g., feature mimicking) inefficient. (2) Distilling knowledge from CNN limits the network convergence in the later training period since ViT's capability of integrating global information is suppressed by CNN's local-inductive-bias supervision. To this end, we present Cumulative Spatial Knowledge Distillation (CSKD). CSKD distills spatial-wise knowledge to all patch tokens of ViT from the corresponding spatial responses of CNN, without introducing intermediate features. Furthermore, CSKD exploits a Cumulative Knowledge Fusion (CKF) module, which introduces the global response of CNN and increasingly emphasizes its importance during the training. Applying CKF leverages CNN's local inductive bias in the early training period and gives full play to ViT's global capability in the later one. Extensive experiments and analysis on ImageNet-1k and downstream datasets demonstrate the superiority of our CSKD. Code will be publicly available.

摘要
精炼知识从卷积神经网络（CNN）是视transformer（ViT）的双刃剑。它会提高性能，因为图像友好的本地推导性（local-inductive bias）在CNN中帮助ViT更快速地学习和提高性能，但也存在两个问题：（1）CNN和ViT的网络设计完全不同，导致它们的中间特征semantic level不同，使得空间知识传递方法（例如特征模仿）不efficient。（2）从CNN精炼知识限制了ViT的网络迁移在后期训练中，因为ViT的全球信息整合能力被CNN的本地推导性supervise。为此，我们提出了积累的空间知识填充（CSKD）。CSKD将从CNN的相应空间响应中精炼到ViT的所有patchtoken中的空间知识，而不需要中间特征。此外，CSKD还利用了积累知识融合（CKF）模块，该模块在训练中逐渐增加CNN的全球响应的重要性，从而利用CNN的本地推导性在初期训练中，并让ViT在后期训练中发挥全球能力。我们在ImageNet-1k和下游数据集上进行了广泛的实验和分析，并证明了我们的CSKD的优越性。代码将在公共可用。

SVDFormer: Complementing Point Cloud via Self-view Augmentation and Self-structure Dual-generator

paper_url: http://arxiv.org/abs/2307.08492
repo_url: https://github.com/czvvd/svdformer
paper_authors: Zhe Zhu, Honghua Chen, Xing He, Weiming Wang, Jing Qin, Mingqiang Wei
for: 本文提出了一种新型网络SVDFormer，用于解决 incomplete point cloud 的两个特定挑战：理解完整的全球形状和生成高精度的本地结构。现有方法通常只使用三维坐标来识别形状模式，或者带有良好准备的颜色图像来引导geometry estimation的缺失部分。但这些方法并不总是能充分利用cross-modal自身结构来完成高质量的点云完成。
methods: 我们首先设计了一个Self-view Fusion Network，利用多视图深度图像信息来观察不完整的自身形状并生成一个紧凑的全球形状。然后，我们引入了一个改进模块，叫Self-structure Dual-generator，其中我们将学习的形状假设和地理自相似性 incorporated into producing new points。通过识别每个点的不完整性，我们实现了DUAL-PATH设计，使得各种精度的精细结构可以被独立地识别和修复。
results: 我们的方法在 widely-used benchmarks 上 achieve state-of-the-art performance。代码将在 https://github.com/czvvd/SVDFormer 上发布。

Abstract
In this paper, we propose a novel network, SVDFormer, to tackle two specific challenges in point cloud completion: understanding faithful global shapes from incomplete point clouds and generating high-accuracy local structures. Current methods either perceive shape patterns using only 3D coordinates or import extra images with well-calibrated intrinsic parameters to guide the geometry estimation of the missing parts. However, these approaches do not always fully leverage the cross-modal self-structures available for accurate and high-quality point cloud completion. To this end, we first design a Self-view Fusion Network that leverages multiple-view depth image information to observe incomplete self-shape and generate a compact global shape. To reveal highly detailed structures, we then introduce a refinement module, called Self-structure Dual-generator, in which we incorporate learned shape priors and geometric self-similarities for producing new points. By perceiving the incompleteness of each point, the dual-path design disentangles refinement strategies conditioned on the structural type of each point. SVDFormer absorbs the wisdom of self-structures, avoiding any additional paired information such as color images with precisely calibrated camera intrinsic parameters. Comprehensive experiments indicate that our method achieves state-of-the-art performance on widely-used benchmarks. Code will be available at https://github.com/czvvd/SVDFormer.

摘要
在这篇论文中，我们提出了一种新的网络，即SVDFormer，以解决Point cloud completion中的两个特定挑战：理解完整的全球形态从不完整的点云中，并生成高精度的本地结构。现有方法可以通过只使用3D坐标来识别形态模式，或者从外部Import预先calibrated的颜色图像来导航点云中缺失部分的准确性。但这些方法并不总能充分利用点云之间的自同构信息，以实现高质量的完成。为此，我们首先设计了一个自我融合网络，利用多视图深度图像信息来观察不完整的自身形态，并生成一个紧凑的全球形态。为了揭示高精度的结构，我们然后引入了一个改进模块，即自身结构双生成器，在其中我们利用学习的形态规范和几何自相似性来生成新的点。通过识别每个点的不完整性，我们的双路设计分离了各种精度的修正策略。SVDFormer利用自身结构的智慧，不需要任何额外的对应信息，如预先calibrated的颜色图像。广泛的实验表明，我们的方法在常用的 benchmark 上达到了领先的性能。代码将在https://github.com/czvvd/SVDFormer 上提供。

Differentiable Transportation Pruning

paper_url: http://arxiv.org/abs/2307.08483
repo_url: None
paper_authors: Yunqiang Li, Jan C. van Gemert, Torsten Hoefler, Bert Moons, Evangelos Eleftheriou, Bram-Ernst Verhoef
for: 该论文旨在提出一种高效的深度学习模型压缩方法，以便在边缘设备上部署深度学习模型。
methods: 该方法使用了一种高效的优化交通算法，该算法通过自动调整搜索-利用行为来找到精准的稀疏子网络。
results: 该方法可以在3个不同的数据集上，使用5种不同的模型，在各种压缩比例下，与之前的压缩方法进行比较，并且可以在不同的稀疏预算和压缩粒度下实现状态的最佳性能。

Abstract
Deep learning algorithms are increasingly employed at the edge. However, edge devices are resource constrained and thus require efficient deployment of deep neural networks. Pruning methods are a key tool for edge deployment as they can improve storage, compute, memory bandwidth, and energy usage. In this paper we propose a novel accurate pruning technique that allows precise control over the output network size. Our method uses an efficient optimal transportation scheme which we make end-to-end differentiable and which automatically tunes the exploration-exploitation behavior of the algorithm to find accurate sparse sub-networks. We show that our method achieves state-of-the-art performance compared to previous pruning methods on 3 different datasets, using 5 different models, across a wide range of pruning ratios, and with two types of sparsity budgets and pruning granularities.

摘要
深度学习算法在边缘部署中越来越普遍。然而，边缘设备具有限制的资源，因此需要有效地部署深度神经网络。剪辑方法是边缘部署中关键的工具，可以提高存储、计算、内存带宽和能源使用。在这篇论文中，我们提出了一种新的精准剪辑技术，可以准确控制输出网络大小。我们的方法使用高效的优化交通方案，我们使其成为终端到终端可微分的，自动调整搜索-利用行为，以找到精准的稀疏子网络。我们展示了我们的方法与之前的剪辑方法相比，在三个 datasets 上，使用五种模型，在各种剪辑比率、两种稀疏预算和剪辑粒度上均达到了状态 искусственный智能的表现。

SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training

paper_url: http://arxiv.org/abs/2307.08476
repo_url: https://github.com/hongyan1123/skeletonmae
paper_authors: Hong Yan, Yang Liu, Yushen Wei, Zhen Li, Guanbin Li, Liang Lin
for: 这 paper 的目的是提出一种高效的人体序列学习框架，以便在不同的数据集上进行自动识别人体动作。
methods: 该 paper 使用了一种异symmetric graph-based encoder-decoder预训练架构，named SkeletonMAE，以及一种Spatiotemporal Representation Learning（STRL）模块，以完全捕捉人体姿势和获得有效的人体序列表示。
results: 该 paper 的实验结果表明，该方法可以在不同的数据集上提供优秀的自动识别人体动作性能，并且与一些完全监督的方法相当。

Abstract
Skeleton sequence representation learning has shown great advantages for action recognition due to its promising ability to model human joints and topology. However, the current methods usually require sufficient labeled data for training computationally expensive models, which is labor-intensive and time-consuming. Moreover, these methods ignore how to utilize the fine-grained dependencies among different skeleton joints to pre-train an efficient skeleton sequence learning model that can generalize well across different datasets. In this paper, we propose an efficient skeleton sequence learning framework, named Skeleton Sequence Learning (SSL). To comprehensively capture the human pose and obtain discriminative skeleton sequence representation, we build an asymmetric graph-based encoder-decoder pre-training architecture named SkeletonMAE, which embeds skeleton joint sequence into Graph Convolutional Network (GCN) and reconstructs the masked skeleton joints and edges based on the prior human topology knowledge. Then, the pre-trained SkeletonMAE encoder is integrated with the Spatial-Temporal Representation Learning (STRL) module to build the SSL framework. Extensive experimental results show that our SSL generalizes well across different datasets and outperforms the state-of-the-art self-supervised skeleton-based action recognition methods on FineGym, Diving48, NTU 60 and NTU 120 datasets. Additionally, we obtain comparable performance to some fully supervised methods. The code is avaliable at https://github.com/HongYan1123/SkeletonMAE.

摘要
skeleton sequence representation learning 对于动作认识表现出了非常出色的优势，这是因为它可以模型人体关节和结构。然而，当前的方法通常需要大量标注数据进行训练计算昂贵的模型，这是劳动密集和时间消耗的。此外，这些方法忽略了如何使用细腻的关节间依赖关系来预训练高效的skeleton sequence学习模型，以便在不同的数据集上进行泛化。在这篇论文中，我们提出了一种高效的skeleton sequence学习框架，名为Skeleton Sequence Learning（SSL）。为了全面捕捉人体姿势和获得特征的skeleton sequence表示，我们设计了一种偏 asymmetric graph-based encoder-decoder预训练架构，名为SkeletonMAE，它将skeleton关节序列嵌入图像 convolutional neural network（GCN）中，并在基于人体 topology知识的前提下重建屏蔽的关节和边。然后，我们将预训练的SkeletonMAE编码器与空间-时间表示学习（STRL）模块集成，构建了SSL框架。我们对多个数据集进行了广泛的实验，结果显示，我们的SSL可以在不同的数据集上进行泛化，并且比 estado-of-the-art的自我监督skeleton-based动作认识方法在FineGym、Diving48、NTU 60和NTU 120数据集上表现出色。此外，我们还获得了与一些完全监督方法相同的性能。代码可以在https://github.com/HongYan1123/SkeletonMAE中找到。

EGE-UNet: an Efficient Group Enhanced UNet for skin lesion segmentation

paper_url: http://arxiv.org/abs/2307.08473
repo_url: https://github.com/jcruan519/ege-unet
paper_authors: Jiacheng Ruan, Mingye Xie, Jingsheng Gao, Ting Liu, Yuzhuo Fu
for: 这篇研究旨在提出一个更有效的医疗影像分类方法，以应对现有的过滤器和其变体在医疗应用中的问题。
methods: 本研究提出了一个名为Efficient Group Enhanced UNet（EGE-UNet）的方法，具有轻量级的设计和实现。EGE-UNet具有Group multi-axis Hadamard Product Attention（GHPA）和Group Aggregation Bridge（GAB）两个模块，可以实现多轴 Hadamard Product Attention Mechanism 和多尺度信息聚合。
results: 实验结果显示，EGE-UNet在 ISIC2017 和 ISIC2018 datasets 上的分类性能比较现有的状态顶对方法高，同时对应用程序的参数和计算负载也有了显著的减少（494倍和160倍）。此外，EGE-UNet 的参数数量只有 50KB，这是现有方法中首次出现的最小化参数数量。

Abstract
Transformer and its variants have been widely used for medical image segmentation. However, the large number of parameter and computational load of these models make them unsuitable for mobile health applications. To address this issue, we propose a more efficient approach, the Efficient Group Enhanced UNet (EGE-UNet). We incorporate a Group multi-axis Hadamard Product Attention module (GHPA) and a Group Aggregation Bridge module (GAB) in a lightweight manner. The GHPA groups input features and performs Hadamard Product Attention mechanism (HPA) on different axes to extract pathological information from diverse perspectives. The GAB effectively fuses multi-scale information by grouping low-level features, high-level features, and a mask generated by the decoder at each stage. Comprehensive experiments on the ISIC2017 and ISIC2018 datasets demonstrate that EGE-UNet outperforms existing state-of-the-art methods. In short, compared to the TransFuse, our model achieves superior segmentation performance while reducing parameter and computation costs by 494x and 160x, respectively. Moreover, to our best knowledge, this is the first model with a parameter count limited to just 50KB. Our code is available at https://github.com/JCruan519/EGE-UNet.

摘要
transformer 和其变种在医学影像 segmentation 中广泛应用，但这些模型的参数数量和计算负担使其不适用于移动健康应用。为解决这个问题，我们提议一种更有效的方法，即高效组增强 U-Net (EGE-UNet)。我们在轻量级的情况下 интеGRoup 多轴 Hadamard Product Attention 模块 (GHPA) 和 Group Aggregation Bridge 模块 (GAB)。GHPA 将输入特征分组并在不同轴上执行 Hadamard Product Attention 机制 (HPA)，以提取多个视角的疾病信息。GAB 有效地融合多尺度信息，通过分组低级特征、高级特征和每个阶段生成的掩码。我们在 ISIC2017 和 ISIC2018 数据集上进行了全面的实验，结果表明 EGE-UNet 超过了现有的状态数据。简而言之，与 TransFuse 相比，我们的模型实现了更高的分 segmentation 性能，同时减少参数和计算成本494倍和160倍，分别。此外，我们知道这是首个参数数量限制在50KB之下的模型。我们的代码可以在中找到。

Riesz feature representation: scale equivariant scattering network for classification tasks

paper_url: http://arxiv.org/abs/2307.08467
repo_url: None
paper_authors: Tin Barisin, Jesus Angulo, Katja Schladitz, Claudia Redenbach
for: 文章主要用于提出一种基于里茨变换的特征表示方法，以避免采样缩scale维度的问题，并且具有等级平衡性。
methods: 本文使用里茨变换定义了一种新的特征表示方法，并详细分析了这种表示方法的数学基础。这种表示方法具有等级平衡性，并且与传统的散射网络相比，减少了特征数量四分之一。
results: 作者通过对 текстуre 分类和数字分类两个任务进行实验，证明了该方法可以具有比较好的性能，并且在不同的缩scale下保持稳定性。特别是在训练数据中不包含的缩scale下，该方法可以具有更好的性能。

Abstract
Scattering networks yield powerful and robust hierarchical image descriptors which do not require lengthy training and which work well with very few training data. However, they rely on sampling the scale dimension. Hence, they become sensitive to scale variations and are unable to generalize to unseen scales. In this work, we define an alternative feature representation based on the Riesz transform. We detail and analyze the mathematical foundations behind this representation. In particular, it inherits scale equivariance from the Riesz transform and completely avoids sampling of the scale dimension. Additionally, the number of features in the representation is reduced by a factor four compared to scattering networks. Nevertheless, our representation performs comparably well for texture classification with an interesting addition: scale equivariance. Our method yields superior performance when dealing with scales outside of those covered by the training dataset. The usefulness of the equivariance property is demonstrated on the digit classification task, where accuracy remains stable even for scales four times larger than the one chosen for training. As a second example, we consider classification of textures.

摘要
扫描网络生成强大和稳定的层次图像描述符，不需要长时间训练，并且可以使用非常少的训练数据。然而，它们依赖于采样Scale维度，因此对于不同的Scale变化会变得敏感，无法泛化到未经训练的Scale。在这项工作中，我们定义了一种基于Riesz变换的特征表示方法。我们详细介绍了这种表示方法的数学基础，特别是它从Riesz变换继承了Scale相对 invariants属性，完全避免了采样Scale维度。此外，与扫描网络相比，我们的表示方法减少了特征数量的四倍。然而，我们的方法与扫描网络的性能相似，并且具有一个有趣的附加特点：Scale相对 invariants。我们的方法在 digit 分类任务中表现出色，即使用于训练 dataset 中的Scale四倍大的Scale也能保持稳定的准确率。作为第二个例子，我们考虑了 Texture 分类任务。Note: Simplified Chinese is used here, as it is the most widely used variety of Chinese in mainland China. However, if you prefer Traditional Chinese, I can also provide the translation.

Generalizable Classification of UHF Partial Discharge Signals in Gas-Insulated HVDC Systems Using Neural Networks

paper_url: http://arxiv.org/abs/2307.08466
repo_url: None
paper_authors: Steffen Seitz, Thomas Götz, Christopher Lindenberg, Ronald Tetzlaff, Stephan Schlegel
for: 本研究旨在提出一种基于神经网络的方法，用于分类HVDC GIS中的部分磁发（PD）信号，而不需要基于振荡序列分析特征。
methods: 本研究使用神经网络模型进行PD信号分类，并对时域和频域输入信号进行比较，以及不同Normalization方法的影响。
results: 研究结果表明，使用神经网络模型可以有效地分类PD信号，并且可以普适到不同的输入振荡频率和电压倍数。

Abstract
Undetected partial discharges (PDs) are a safety critical issue in high voltage (HV) gas insulated systems (GIS). While the diagnosis of PDs under AC voltage is well-established, the analysis of PDs under DC voltage remains an active research field. A key focus of these investigations is the classification of different PD sources to enable subsequent sophisticated analysis. In this paper, we propose and analyze a neural network-based approach for classifying PD signals caused by metallic protrusions and conductive particles on the insulator of HVDC GIS, without relying on pulse sequence analysis features. In contrast to previous approaches, our proposed model can discriminate the studied PD signals obtained at negative and positive potentials, while also generalizing to unseen operating voltage multiples. Additionally, we compare the performance of time- and frequency-domain input signals and explore the impact of different normalization schemes to mitigate the influence of free-space path loss between the sensor and defect location.

摘要
Undetected partial discharges (PDs) are a safety critical issue in high voltage (HV) gas insulated systems (GIS). While the diagnosis of PDs under AC voltage is well-established, the analysis of PDs under DC voltage remains an active research field. A key focus of these investigations is the classification of different PD sources to enable subsequent sophisticated analysis. In this paper, we propose and analyze a neural network-based approach for classifying PD signals caused by metallic protrusions and conductive particles on the insulator of HVDC GIS, without relying on pulse sequence analysis features. In contrast to previous approaches, our proposed model can discriminate the studied PD signals obtained at negative and positive potentials, while also generalizing to unseen operating voltage multiples. Additionally, we compare the performance of time- and frequency-domain input signals and explore the impact of different normalization schemes to mitigate the influence of free-space path loss between the sensor and defect location.Here's the translation in Traditional Chinese:Undetected partial discharges (PDs) are a safety critical issue in high voltage (HV) gas insulated systems (GIS). While the diagnosis of PDs under AC voltage is well-established, the analysis of PDs under DC voltage remains an active research field. A key focus of these investigations is the classification of different PD sources to enable subsequent sophisticated analysis. In this paper, we propose and analyze a neural network-based approach for classifying PD signals caused by metallic protrusions and conductive particles on the insulator of HVDC GIS, without relying on pulse sequence analysis features. In contrast to previous approaches, our proposed model can discriminate the studied PD signals obtained at negative and positive potentials, while also generalizing to unseen operating voltage multiples. Additionally, we compare the performance of time- and frequency-domain input signals and explore the impact of different normalization schemes to mitigate the influence of free-space path loss between the sensor and defect location.

Domain Adaptation using Silver Standard Masks for Lateral Ventricle Segmentation in FLAIR MRI

paper_url: http://arxiv.org/abs/2307.08456
repo_url: None
paper_authors: Owen Crystal, Pejman J. Maralani, Sandra Black, Alan R. Moody, April Khademi
for:This paper presents a new method for segmenting lateral ventricular volume (LVV) in fluid-attenuated inversion recovery (FLAIR) MRI images.methods:The proposed method uses transfer learning and domain adaptation to improve the accuracy of LVV segmentation. It uses a novel image processing algorithm to generate silver standard (SS) masks from the target domain, which are then used to supplement the gold standard (GS) data from the source domain.results:The proposed method achieved the best and most consistent performance on four different datasets, with a mean Dice similarity coefficient (DSC) of 0.89 and a coefficient of variation (CoV) of 0.05. The method significantly outperformed the GS-only model on three target domains, and the results suggest that pre-training with noisy labels from the target domain and fine-tuning with GS masks allows the model to adapt to the dataset-specific characteristics and provide robust parameter initialization.

Abstract
Lateral ventricular volume (LVV) is an important biomarker for clinical investigation. We present the first transfer learning-based LVV segmentation method for fluid-attenuated inversion recovery (FLAIR) MRI. To mitigate covariate shifts between source and target domains, this work proposes an domain adaptation method that optimizes performance on three target datasets. Silver standard (SS) masks were generated from the target domain using a novel conventional image processing ventricular segmentation algorithm and used to supplement the gold standard (GS) data from the source domain, Canadian Atherosclerosis Imaging Network (CAIN). Four models were tested on held-out test sets from four datasets: 1) SS+GS: trained on target SS masks and fine-tuned on source GS masks, 2) GS+SS: trained on source GS masks and fine-tuned on target SS masks, 3) trained on source GS (GS CAIN Only) and 4) trained on target SS masks (SS Only). The SS+GS model had the best and most consistent performance (mean DSC = 0.89, CoV = 0.05) and showed significantly (p < 0.05) higher DSC compared to the GS-only model on three target domains. Results suggest pre-training with noisy labels from the target domain allows the model to adapt to the dataset-specific characteristics and provides robust parameter initialization while fine-tuning with GS masks allows the model to learn detailed features. This method has wide application to other medical imaging problems where labeled data is scarce, and can be used as a per-dataset calibration method to accelerate wide-scale adoption.

摘要
“ lateral ventricular volume (LVV) 是一个重要的临床探险标的。本研究提出了首个基于传播学习的 LVV 分类方法，用于 fluido-attenuated inversion recovery (FLAIR) MRI。为了缓解对应领域的变化，这个工作提出了一个领域适应方法，将表达性提高到三个目标 dataset 上。silver standard (SS) mask 由目标领域中的一个新的传统影像处理方法生成，并用来补充来自源领域的 gold standard (GS) 数据，Canadian Atherosclerosis Imaging Network (CAIN)。我们测试了四个模型，分别是：1) SS+GS：在目标 SS masks 上练习并在源 GS masks 上精确化，2) GS+SS：在源 GS masks 上练习并在目标 SS masks 上精确化，3) 在源 GS (GS CAIN Only) 上练习，4) 在目标 SS masks (SS Only) 上练习。SS+GS 模型表现最好，其 mean DSC 为 0.89，CoV 为 0.05，并在三个目标领域中表现出最高的 DSC，并与 GS-only 模型在所有三个目标领域中有 statistically significant (p < 0.05) 的差异。结果显示，在目标领域的噪音标签上进行预训可以让模型适应到dataset-specific的特征，并提供了稳定的初始化，而精确化在目标 SS masks 上可以让模型学习更多的细部特征。这种方法可以广泛应用于医疗影像问题中，where labeled data 是稀有的，并可以用来加速广泛的采纳。”

Not All Steps are Created Equal: Selective Diffusion Distillation for Image Manipulation

paper_url: http://arxiv.org/abs/2307.08448
repo_url: https://github.com/andysonys/selective-diffusion-distillation
paper_authors: Luozhou Wang, Shuai Yang, Shu Liu, Ying-cong Chen
for: 提高图像修改任务中的精度和可修改性
methods: 提出了一种新的框架 Selective Diffusion Distillation (SDD)，通过训练一个Feedforward图像修改网络，以及一个有效的时间步选择器，使得图像修改任务中精度和可修改性同时得到提高。
results: 经验证明，该框架可以成功解决图像修改任务中的质量与可修改性之间的衡量问题，并且在多个任务上达到了优秀的效果。

Abstract
Conditional diffusion models have demonstrated impressive performance in image manipulation tasks. The general pipeline involves adding noise to the image and then denoising it. However, this method faces a trade-off problem: adding too much noise affects the fidelity of the image while adding too little affects its editability. This largely limits their practical applicability. In this paper, we propose a novel framework, Selective Diffusion Distillation (SDD), that ensures both the fidelity and editability of images. Instead of directly editing images with a diffusion model, we train a feedforward image manipulation network under the guidance of the diffusion model. Besides, we propose an effective indicator to select the semantic-related timestep to obtain the correct semantic guidance from the diffusion model. This approach successfully avoids the dilemma caused by the diffusion process. Our extensive experiments demonstrate the advantages of our framework. Code is released at https://github.com/AndysonYs/Selective-Diffusion-Distillation.

摘要
<>将文本翻译成简化中文。<>Conditional diffusion models 在图像修饰任务中表现出色。通常的管道包括在图像上添加噪声并 затем去噪。然而，这种方法面临一个负担选择问题：添加过多噪声会影响图像的准确性，而添加过少噪声则会影响图像的修饰性。这主要限制了它们的实际应用。在这篇论文中，我们提出了一个新的框架，选择性填充分离（SDD），以确保图像的准确性和修饰性。而不是直接使用扩散模型修饰图像，我们在扩散模型的指导下训练了一个Feedforward图像修饰网络。此外，我们提出了一个有效的指标选择 semantic相关的时间步骤，以获取正确的semantic指导从扩散模型。这种方法成功避免了扩散过程所导致的困难。我们的广泛实验表明了我们的框架的优势。代码发布在https://github.com/AndysonYs/Selective-Diffusion-Distillation。

DOT: A Distillation-Oriented Trainer

paper_url: http://arxiv.org/abs/2307.08436
repo_url: None
paper_authors: Borui Zhao, Quan Cui, Renjie Song, Jiajun Liang
for: 本研究旨在提高知识传播过程中学生模型的优化属性，以提高模型的泛化能力。
methods: 本研究使用了知识传播策略，并增加了精益损失来加速学生模型的优化。
results: 实验表明，使用 Distillation-Oriented Trainer (DOT) 可以破坏知识传播中的负面交互，并提高学生模型的泛化能力。 DOT 在 ImageNet-1k 上 Achieves a +2.59% accuracy improvement for the ResNet50-MobileNetV1 pair.

Abstract
Knowledge distillation transfers knowledge from a large model to a small one via task and distillation losses. In this paper, we observe a trade-off between task and distillation losses, i.e., introducing distillation loss limits the convergence of task loss. We believe that the trade-off results from the insufficient optimization of distillation loss. The reason is: The teacher has a lower task loss than the student, and a lower distillation loss drives the student more similar to the teacher, then a better-converged task loss could be obtained. To break the trade-off, we propose the Distillation-Oriented Trainer (DOT). DOT separately considers gradients of task and distillation losses, then applies a larger momentum to distillation loss to accelerate its optimization. We empirically prove that DOT breaks the trade-off, i.e., both losses are sufficiently optimized. Extensive experiments validate the superiority of DOT. Notably, DOT achieves a +2.59% accuracy improvement on ImageNet-1k for the ResNet50-MobileNetV1 pair. Conclusively, DOT greatly benefits the student's optimization properties in terms of loss convergence and model generalization. Code will be made publicly available.

摘要
知识塑化是将知识从大模型传递到小模型的过程，通过任务和塑化损失来实现。在这篇论文中，我们观察到任务损失和塑化损失之间存在负面关系，即在引入塑化损失时，任务损失的收敛被限制。我们认为这种负面关系的原因是塑化损失的优化不充分。理由是：教师模型的任务损失低于学生模型，且塑化损失驱动学生模型更加相似于教师模型，然后可以获得更好的任务损失收敛。为解决这种负面关系，我们提出了塑化导向训练器（DOT）。DOT分别考虑了任务和塑化损失的梯度，然后将塑化损失的梯度应用更大的滚动矩阵，以加速其优化。我们经验证明，DOT可以破坏这种负面关系，即任务损失和塑化损失都得到了足够的优化。广泛的实验证明了DOT的优越性。特别是，DOT在ImageNet-1k上为ResNet50-MobileNetV1对得到了+2.59%的准确率提升。结论是，DOT对学生模型的优化质量有着很大的改善，包括损失收敛和模型泛化。代码将公开发布。

Dense Affinity Matching for Few-Shot Segmentation

paper_url: http://arxiv.org/abs/2307.08434
repo_url: None
paper_authors: Hao Chen, Yonghan Dong, Zheming Lu, Yunlong Yu, Yingming Li, Jungong Han, Zhongfei Zhang
for: 这篇论文目的是提出一个几shot segmentation（FSS）方法，用于分类新的图像类型，只需要几个标注类别的数据。
methods: 这篇论文提出了一个紧密的相似性匹配（DAM）框架，通过密集的Pixel-to-Pixel和Pixel-to-Patch关系捕捉，以及对应的3D潜在神经网络，实现了支持-询问的互动。
results: 实验结果显示，DAM在十个benchmark上的性能很竞争，特别是在跨类、跨数据和跨领域的FSS任务中，仅需0.68M个parameters，表明DAM的效iveness和效率。

Abstract
Few-Shot Segmentation (FSS) aims to segment the novel class images with a few annotated samples. In this paper, we propose a dense affinity matching (DAM) framework to exploit the support-query interaction by densely capturing both the pixel-to-pixel and pixel-to-patch relations in each support-query pair with the bidirectional 3D convolutions. Different from the existing methods that remove the support background, we design a hysteretic spatial filtering module (HSFM) to filter the background-related query features and retain the foreground-related query features with the assistance of the support background, which is beneficial for eliminating interference objects in the query background. We comprehensively evaluate our DAM on ten benchmarks under cross-category, cross-dataset, and cross-domain FSS tasks. Experimental results demonstrate that DAM performs very competitively under different settings with only 0.68M parameters, especially under cross-domain FSS tasks, showing its effectiveness and efficiency.

摘要
几个示例图像分割（FSS）目标是将新类图像分割成几个示例图像。在这篇论文中，我们提出了密集相似匹配（DAM）框架，利用支持Query的互动来密集捕捉每个支持Query对的像素到像素和像素到补做的关系，使用双向三维卷积来实现。与现有方法不同的是，我们设计了一种弹性空间筛选模块（HSFM），用于筛选查询背景相关的特征，保留查询背景相关的特征，以帮助消除查询背景中的干扰对象。我们在十个benchmark上进行了广泛的评估，包括跨类、跨数据集和跨领域的FSS任务。实验结果表明，DAM在不同的设置下表现非常竞争力，特别是在跨领域FSS任务中，表明其效果和效率。

Divide&Classify: Fine-Grained Classification for City-Wide Visual Place Recognition

paper_url: http://arxiv.org/abs/2307.08417
repo_url: https://github.com/ga1i13o/Divide-and-Classify
paper_authors: Gabriele Trivigno, Gabriele Berton, Carlo Masone, Juan Aragon, Barbara Caputo
for: 本研究旨在解决Visual Place recognition问题，即图像检索问题。
methods: 本研究使用分类方法，而不是传统的相似性搜索方法，以减少计算时间。
results: 研究提出了一种新的分类方法，称为Divide&Classify（D&C），可以快速和准确地进行计算，并且与现有的检索方法结合使用可以提高计算速度。

Abstract
Visual Place recognition is commonly addressed as an image retrieval problem. However, retrieval methods are impractical to scale to large datasets, densely sampled from city-wide maps, since their dimension impact negatively on the inference time. Using approximate nearest neighbour search for retrieval helps to mitigate this issue, at the cost of a performance drop. In this paper we investigate whether we can effectively approach this task as a classification problem, thus bypassing the need for a similarity search. We find that existing classification methods for coarse, planet-wide localization are not suitable for the fine-grained and city-wide setting. This is largely due to how the dataset is split into classes, because these methods are designed to handle a sparse distribution of photos and as such do not consider the visual aliasing problem across neighbouring classes that naturally arises in dense scenarios. Thus, we propose a partitioning scheme that enables a fast and accurate inference, preserving a simple learning procedure, and a novel inference pipeline based on an ensemble of novel classifiers that uses the prototypes learned via an angular margin loss. Our method, Divide&Classify (D&C), enjoys the fast inference of classification solutions and an accuracy competitive with retrieval methods on the fine-grained, city-wide setting. Moreover, we show that D&C can be paired with existing retrieval pipelines to speed up computations by over 20 times while increasing their recall, leading to new state-of-the-art results.

摘要
通常情况下，视觉地点识别被视为一个图像检索问题。然而，检索方法在大量数据集上是不可行的，因为它们的维度会导致推断时间增加。使用近似最似 neighboor search 进行检索可以减轻这个问题，但是会导致性能下降。在这篇论文中，我们研究了是否可以通过将这个任务转化为一个分类问题，从而减少需要的相似性检索。我们发现现有的分类方法不适用于细致的城市范围内的地点识别任务，主要是因为数据集被分成的类别不适合处理稠密的场景中的视觉假设问题。因此，我们提出了一种分类方案，即 Divide&Classify (D&C)，它可以快速和准确地进行推断，同时保持简单的学习过程。此外，我们还提出了一种新的推断管线，基于一个 ensemble 的新分类器，使用学习 angular margin loss 的抽象。我们的方法 D&C 可以快速地进行分类，并且与现有的检索管线结合使用可以提高计算速度，从而实现新的领先结果。

Monocular 3D Object Detection with LiDAR Guided Semi Supervised Active Learning

paper_url: http://arxiv.org/abs/2307.08415
repo_url: None
paper_authors: Aral Hekimoglu, Michael Schmidt, Alvaro Marcos-Ramiro
for: 这个论文旨在提出一种基于 semi-supervised active learning 的精灵活的 LiDAR 引导的单目3D对象检测框架 (MonoLiG)，以利用收集的所有数据模式进行模型开发。
methods: 论文使用 LiDAR 作为指导，在训练单目3D检测器时不添加任何执行阶段的开销。在训练中，我们利用 LiDAR 教师和单目学生跨模态批处理法从 semi-supervised learning 中提取不标注数据中的信息作为 Pseudo-labels。
results: 我们的选择策略可以在 KITTI 和 Waymo 数据集上广泛地实现，并且在 state-of-the-art active learning 基础上减少标注成本至少 17%。我们的训练策略在 KITTI 3D 和 birds-eye-view (BEV) 单目对象检测官方 benchmark 中获得了前一名，提高了 BEV 平均准确率 (AP) 2.02。

Abstract
We propose a novel semi-supervised active learning (SSAL) framework for monocular 3D object detection with LiDAR guidance (MonoLiG), which leverages all modalities of collected data during model development. We utilize LiDAR to guide the data selection and training of monocular 3D detectors without introducing any overhead in the inference phase. During training, we leverage the LiDAR teacher, monocular student cross-modal framework from semi-supervised learning to distill information from unlabeled data as pseudo-labels. To handle the differences in sensor characteristics, we propose a data noise-based weighting mechanism to reduce the effect of propagating noise from LiDAR modality to monocular. For selecting which samples to label to improve the model performance, we propose a sensor consistency-based selection score that is also coherent with the training objective. Extensive experimental results on KITTI and Waymo datasets verify the effectiveness of our proposed framework. In particular, our selection strategy consistently outperforms state-of-the-art active learning baselines, yielding up to 17% better saving rate in labeling costs. Our training strategy attains the top place in KITTI 3D and birds-eye-view (BEV) monocular object detection official benchmarks by improving the BEV Average Precision (AP) by 2.02.

摘要
我们提出了一种新的半监督学习框架（SSAL），用于单目3D物体检测，利用了所有数据收集时的模式。我们使用激光准备数据选择和训练单目3D检测器，无需在检测阶段添加任何负担。在训练过程中，我们利用激光师，单目学生交叉模式自动学习法从无标签数据中提取信息，作为pseudo标签。为了处理感知器特征的差异，我们提议一种数据噪音基于权重机制，以减少激光模式噪音对单目检测器的影响。为选择需要标注的样本以提高模型性能，我们提议一种感知器一致性基于选择分数，与训练目标含义一致。我们的选择策略在KITTI和Waymo数据集上进行了广泛的实验，并证明了我们的提议的有效性。特别是，我们的选择策略在活动学习基elines上一直保持状态的最佳，可以在KITTI 3D和bird's-eye-view（BEV）单目物体检测官方benchmark中提高BEV均值精度（AP）by 2.02。

Active Learning for Object Detection with Non-Redundant Informative Sampling

paper_url: http://arxiv.org/abs/2307.08414
repo_url: None
paper_authors: Aral Hekimoglu, Adrian Brucker, Alper Kagan Kayali, Michael Schmidt, Alvaro Marcos-Ramiro
for: 提高2D对象检测器的性能，建立一个有代表性和多样性的数据集
methods: 使用不同样本之间的差异和不确定性来选择样本，并计算样本集中样本之间的信息共同分数
results: 比Random选择更高效，可以减少标注成本20%和30%，并且可以建立多样化的对象类型、形状和角度的数据集

Abstract
Curating an informative and representative dataset is essential for enhancing the performance of 2D object detectors. We present a novel active learning sampling strategy that addresses both the informativeness and diversity of the selections. Our strategy integrates uncertainty and diversity-based selection principles into a joint selection objective by measuring the collective information score of the selected samples. Specifically, our proposed NORIS algorithm quantifies the impact of training with a sample on the informativeness of other similar samples. By exclusively selecting samples that are simultaneously informative and distant from other highly informative samples, we effectively avoid redundancy while maintaining a high level of informativeness. Moreover, instead of utilizing whole image features to calculate distances between samples, we leverage features extracted from detected object regions within images to define object features. This allows us to construct a dataset encompassing diverse object types, shapes, and angles. Extensive experiments on object detection and image classification tasks demonstrate the effectiveness of our strategy over the state-of-the-art baselines. Specifically, our selection strategy achieves a 20% and 30% reduction in labeling costs compared to random selection for PASCAL-VOC and KITTI, respectively.

摘要
curating an informative and representative dataset is crucial for enhancing the performance of 2D object detectors. we present a novel active learning sampling strategy that addresses both the informativeness and diversity of the selections. our strategy integrates uncertainty and diversity-based selection principles into a joint selection objective by measuring the collective information score of the selected samples. specifically, our proposed NORIS algorithm quantifies the impact of training with a sample on the informativeness of other similar samples. by exclusively selecting samples that are simultaneously informative and distant from other highly informative samples, we effectively avoid redundancy while maintaining a high level of informativeness. moreover, instead of utilizing whole image features to calculate distances between samples, we leverage features extracted from detected object regions within images to define object features. this allows us to construct a dataset encompassing diverse object types, shapes, and angles. extensive experiments on object detection and image classification tasks demonstrate the effectiveness of our strategy over the state-of-the-art baselines. specifically, our selection strategy achieves a 20% and 30% reduction in labeling costs compared to random selection for PASCAL-VOC and KITTI, respectively.

CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing

paper_url: http://arxiv.org/abs/2307.08397
repo_url: https://github.com/johnberg1/CLIPInverter
paper_authors: Ahmet Canberk Baykal, Abdul Basit Anees, Duygu Ceylan, Erkut Erdem, Aykut Erdem, Deniz Yuret
for: 用于实现基于自然语言描述的图像编辑
methods: 使用 StyleGAN 模型和 CLIP embedding 进行图像编辑，并使用 novel 的文本条件 adapter 层来实现多属性变化
results: 比其他方法更高效和精准地完成多属性变化，并且在不同领域（人脸、猫、鸟等）表现出更高的推理精度和图像真实性

Abstract
Researchers have recently begun exploring the use of StyleGAN-based models for real image editing. One particularly interesting application is using natural language descriptions to guide the editing process. Existing approaches for editing images using language either resort to instance-level latent code optimization or map predefined text prompts to some editing directions in the latent space. However, these approaches have inherent limitations. The former is not very efficient, while the latter often struggles to effectively handle multi-attribute changes. To address these weaknesses, we present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes. The core of our method is the use of novel, lightweight text-conditioned adapter layers integrated into pretrained GAN-inversion networks. We demonstrate that by conditioning the initial inversion step on the CLIP embedding of the target description, we are able to obtain more successful edit directions. Additionally, we use a CLIP-guided refinement step to make corrections in the resulting residual latent codes, which further improves the alignment with the text prompt. Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds, as shown by our qualitative and quantitative results.

摘要

Revisiting Scene Text Recognition: A Data Perspective

paper_url: http://arxiv.org/abs/2307.08723
repo_url: https://github.com/Mountchicken/Union14M
paper_authors: Qing Jiang, Jiapeng Wang, Dezhi Peng, Chongyu Liu, Lianwen Jin
for: 本研究旨在从数据驱动的角度重新评估场景文本识别（STR）。
methods: 我们首先回顾了场景文本识别领域的六个常用标准 benchmark，并发现了性能饱和现象，即仅有2.91%的标准图像无法由13种表征模型准确识别。
results: 我们的实验表明，13种模型在400万个标注图像上的平均准确率只有66.53%， indicating that STR still faces numerous challenges in real-world scenarios。

Abstract
This paper aims to re-assess scene text recognition (STR) from a data-oriented perspective. We begin by revisiting the six commonly used benchmarks in STR and observe a trend of performance saturation, whereby only 2.91% of the benchmark images cannot be accurately recognized by an ensemble of 13 representative models. While these results are impressive and suggest that STR could be considered solved, however, we argue that this is primarily due to the less challenging nature of the common benchmarks, thus concealing the underlying issues that STR faces. To this end, we consolidate a large-scale real STR dataset, namely Union14M, which comprises 4 million labeled images and 10 million unlabeled images, to assess the performance of STR models in more complex real-world scenarios. Our experiments demonstrate that the 13 models can only achieve an average accuracy of 66.53% on the 4 million labeled images, indicating that STR still faces numerous challenges in the real world. By analyzing the error patterns of the 13 models, we identify seven open challenges in STR and develop a challenge-driven benchmark consisting of eight distinct subsets to facilitate further progress in the field. Our exploration demonstrates that STR is far from being solved and leveraging data may be a promising solution. In this regard, we find that utilizing the 10 million unlabeled images through self-supervised pre-training can significantly improve the robustness of STR model in real-world scenarios and leads to state-of-the-art performance.

摘要
Translated into Simplified Chinese:这篇论文旨在从数据驱动的角度重新评估场景文本识别（STR）。我们开始是查看常用的六个STR benchmark，并观察到表现饱和的趋势，只有2.91%的benchmark图像无法被13种代表性模型准确地识别。虽然这些结果吸引人并建议STR可以被视为解决的，但我们认为这主要归结于常用的benchmark图像的更容易识别性，因此隐藏STR实际面临的问题。为此，我们整合了大规模的实际STR数据集，即Union14M，该数据集包括400万标注图像和1000万无标注图像，以评估STR模型在更复杂的实际场景中的性能。我们的实验表明，13种模型只能在400万标注图像上 achieve an average accuracy of 66.53%，表明STR在实际场景中仍面临许多挑战。通过分析13种模型的错误模式，我们确定了七个STR中的开放挑战，并开发了一个基于这些挑战的挑战驱动benchmark，包括八个不同的子集，以促进STR领域的进一步进步。我们的探索表明，STR远未被解决，并且利用数据可能是一个有希望的解决方案。在这种情况下，我们发现通过对1000万无标注图像进行自主学习预训练可以在实际场景中显著改善STR模型的Robustness，并达到状态的最佳性能。

Dynamic Snake Convolution based on Topological Geometric Constraints for Tubular Structure Segmentation

paper_url: http://arxiv.org/abs/2307.08388
repo_url: https://github.com/yaoleiqi/dscnet
paper_authors: Yaolei Qi, Yuting He, Xiaoming Qi, Yuan Zhang, Guanyu Yang
for: 这种研究旨在提高 tubular 结构 segmentation 任务中的准确性和效率，这些结构包括血管和道路等。methods: 该研究使用了动态蛇卷 convolution 技术来正确地捕捉 tubular 结构的特征，并提出了多视角特征融合策略以保持多种全球形态的重要信息。results: 实验表明，使用 DSCNet 可以在 2D 和 3D 数据集上提供更高的准确性和连续性，比较常见的方法更好。

Abstract
Accurate segmentation of topological tubular structures, such as blood vessels and roads, is crucial in various fields, ensuring accuracy and efficiency in downstream tasks. However, many factors complicate the task, including thin local structures and variable global morphologies. In this work, we note the specificity of tubular structures and use this knowledge to guide our DSCNet to simultaneously enhance perception in three stages: feature extraction, feature fusion, and loss constraint. First, we propose a dynamic snake convolution to accurately capture the features of tubular structures by adaptively focusing on slender and tortuous local structures. Subsequently, we propose a multi-view feature fusion strategy to complement the attention to features from multiple perspectives during feature fusion, ensuring the retention of important information from different global morphologies. Finally, a continuity constraint loss function, based on persistent homology, is proposed to constrain the topological continuity of the segmentation better. Experiments on 2D and 3D datasets show that our DSCNet provides better accuracy and continuity on the tubular structure segmentation task compared with several methods. Our codes will be publicly available.

摘要
Accurate segmentation of topological tubular structures, such as blood vessels and roads, is crucial in various fields, ensuring accuracy and efficiency in downstream tasks. However, many factors complicate the task, including thin local structures and variable global morphologies. In this work, we note the specificity of tubular structures and use this knowledge to guide our DSCNet to simultaneously enhance perception in three stages: feature extraction, feature fusion, and loss constraint. First, we propose a dynamic snake convolution to accurately capture the features of tubular structures by adaptively focusing on slender and tortuous local structures. Subsequently, we propose a multi-view feature fusion strategy to complement the attention to features from multiple perspectives during feature fusion, ensuring the retention of important information from different global morphologies. Finally, a continuity constraint loss function, based on persistent homology, is proposed to constrain the topological continuity of the segmentation better. Experiments on 2D and 3D datasets show that our DSCNet provides better accuracy and continuity on the tubular structure segmentation task compared with several methods. Our codes will be publicly available.Here's the translation in Traditional Chinese as well:同样，精确的分类 tubular structures, such as blood vessels and roads, 在多个领域是重要的，以确保下游任务的精度和效率。然而，多个因素会复杂这个任务，包括细部本地结构和变化的全球形态。在这个工作中，我们注意到 tubular structures 的特有性，并将这些知识用于导引我们的 DSCNet，以同时增强特性在三个阶段：特征提取、特征融合和损失约束。首先，我们提出了动态蛇条件，以精确地捕捉 tubular structures 的特征，并适应细长和迂回的本地结构。接着，我们提出了多观点特征融合策略，以补充多个观点的特征，以确保保留不同全球形态中的重要信息。最后，我们提出了基于 persistent homology 的连续约束损失函数，以更好地限制分类的topological continuity。实验结果显示，我们的 DSCNet 在 tubular structure 分类任务中提供了更好的精度和连续性，较于多种方法。我们的代码将会公开。

Distributed bundle adjustment with block-based sparse matrix compression for super large scale datasets

paper_url: http://arxiv.org/abs/2307.08383
repo_url: https://github.com/MozartZheng/DistributedBA
paper_authors: Maoteng Zheng, Nengcheng Chen, Junfeng Zhu, Xiaoru Zeng, Huanbin Qiu, Yuyao Jiang, Xingyue Lu, Hao Qu
for: 这篇论文主要是为了解决大规模数据集中的摄像头系统Bundle Adjustment（BA）问题。
methods: 该方法使用精确的Levenberg-Marquardt（LM）算法来实现分布式摄像头系统（DBA），而不是使用估计算法来适应平行框架。它还使用块基于稀疏矩阵压缩格式（BSMC）来压缩大规模的摄像头系统（RCS），以便分布式存储和更新。
results: 经过评估和比较，该方法在各种数据集上显示了高效的内存使用和广泛的可扩展性，比基eline上的方法更高效。首次在实际数据集上实现了平行BA使用LM算法，处理118万张图像和1000万张图像（相对于状态艺术LPM-based BA的500倍）。

Abstract
We propose a distributed bundle adjustment (DBA) method using the exact Levenberg-Marquardt (LM) algorithm for super large-scale datasets. Most of the existing methods partition the global map to small ones and conduct bundle adjustment in the submaps. In order to fit the parallel framework, they use approximate solutions instead of the LM algorithm. However, those methods often give sub-optimal results. Different from them, we utilize the exact LM algorithm to conduct global bundle adjustment where the formation of the reduced camera system (RCS) is actually parallelized and executed in a distributed way. To store the large RCS, we compress it with a block-based sparse matrix compression format (BSMC), which fully exploits its block feature. The BSMC format also enables the distributed storage and updating of the global RCS. The proposed method is extensively evaluated and compared with the state-of-the-art pipelines using both synthetic and real datasets. Preliminary results demonstrate the efficient memory usage and vast scalability of the proposed method compared with the baselines. For the first time, we conducted parallel bundle adjustment using LM algorithm on a real datasets with 1.18 million images and a synthetic dataset with 10 million images (about 500 times that of the state-of-the-art LM-based BA) on a distributed computing system.

摘要
我们提议一种分布式束适应（DBA）方法，使用精确的Levenberg-Marquardt（LM）算法进行超大规模数据集处理。现有的方法通常将全球地图分割成小地图，并在子地图中进行束适应。为适应并行框架，它们通常使用估计而不是LM算法。然而，这些方法通常会给出低于优化的结果。与之不同的是，我们利用精确的LM算法来进行全球束适应，并将Camera系统的减少（RCS）实际上并行并在分布式环境中执行。为存储大RCS，我们使用块基本稀疏矩阵压缩格式（BSMC），这种格式充分利用了块特点。BSMC格式还允许分布式存储和更新全球RCS。我们提出的方法与现有的管道进行了广泛的评估和比较，使用了真实和 sintetic 数据集。初步结果表明我们的方法具有高效的内存使用和广泛的扩展性，与基eline相比。此外，我们首次在真实数据集上进行了并行束适应，使用LM算法，并处理1.18万张图像和10万张图像（约500倍于现有LM基于BA的状态）在分布式计算系统上。

Self-supervised Monocular Depth Estimation: Let’s Talk About The Weather

paper_url: http://arxiv.org/abs/2307.08357
repo_url: https://github.com/kieran514/robustdepth
paper_authors: Kieran Saunders, George Vogiatzis, Luis Manso
for:这篇论文旨在提出一种 Pseudo-supervised 方法，使得自助学习深度估计模型能够在不同天气和光照条件下进行高性能的估计。methods:该方法使用计算机图形和生成模型来对现有的晴天数据进行数据增强，以模拟不利天气效果。此外，该方法还使用 Pseudo-supervised 损失函数，以提高depth和pose估计的性能。results:测试结果表明，该方法（Robust-Depth）在 KITTI 数据集上达到了 State-of-the-Art 性能，而在具有困难天气条件的数据集上，如 DrivingStereo、Foggy CityScape 和 NuScenes-Night，则有 significanly 高的性能。

Abstract
Current, self-supervised depth estimation architectures rely on clear and sunny weather scenes to train deep neural networks. However, in many locations, this assumption is too strong. For example in the UK (2021), 149 days consisted of rain. For these architectures to be effective in real-world applications, we must create models that can generalise to all weather conditions, times of the day and image qualities. Using a combination of computer graphics and generative models, one can augment existing sunny-weather data in a variety of ways that simulate adverse weather effects. While it is tempting to use such data augmentations for self-supervised depth, in the past this was shown to degrade performance instead of improving it. In this paper, we put forward a method that uses augmentations to remedy this problem. By exploiting the correspondence between unaugmented and augmented data we introduce a pseudo-supervised loss for both depth and pose estimation. This brings back some of the benefits of supervised learning while still not requiring any labels. We also make a series of practical recommendations which collectively offer a reliable, efficient framework for weather-related augmentation of self-supervised depth from monocular video. We present extensive testing to show that our method, Robust-Depth, achieves SotA performance on the KITTI dataset while significantly surpassing SotA on challenging, adverse condition data such as DrivingStereo, Foggy CityScape and NuScenes-Night. The project website can be found here https://kieran514.github.io/Robust-Depth-Project/.

摘要
当前的自助学深度估算架构假设需要清晰的天气和日光照明来训练深度学习模型。然而，在许多地方，这个假设是太强大。例如在英国（2021年），有149天雨天。为了使这些架构在实际应用中效果，我们需要创建可以总结到所有天气条件、时间和图像质量的模型。使用计算机图形和生成模型，我们可以对现有的晴天数据进行多种修改，以模拟不利的天气效果。尽管这可能看起来有趣，但在过去，这些数据修改方法实际上会降低性能而不是提高它。在这篇论文中，我们提出了一种使用修改来解决这个问题的方法。通过利用未修改和修改数据之间的对应关系，我们引入了一种假超级vised损失函数，用于估算深度和pose。这种方法可以带来一些supervised学习的好处，而不需要任何标签。我们还提供了一系列实用的建议，这些建议共同组成一个可靠、高效的气候相关数据修改框架，用于自助学深度从单光视频中的估算。我们对KITTI数据集进行了广泛的测试，并证明了我们的方法Robust-Depth可以在KITTI数据集上达到SotA性能，并在抗气候条件数据集上（如DrivingStereo、Foggy CityScape和NuScenes-Night）表现出显著超过SotA。 project网站的地址为https://kieran514.github.io/Robust-Depth-Project/.

Box-DETR: Understanding and Boxing Conditional Spatial Queries

paper_url: http://arxiv.org/abs/2307.08353
repo_url: https://github.com/tiny-smart/box-detr
paper_authors: Wenze Liu, Hao Lu, Yuliang Liu, Zhiguo Cao
for: 提高DETR的快速启动和检测性能
methods: 使用 conditional spatial queries 和 conditional linear projection，并将盒子信息转化为头specific agent points
results: 提高了启动速度和检测性能，例如使用 ResNet-50 的单个尺度模型达到了 $44.2$ APHere’s a brief summary of the paper in English:The paper proposes a method called Box Agent to improve the performance of DETR, a popular object detection framework. The method condenses the box information into head-specific agent points, allowing the conditional cross-attention to search for positions from a more reasonable starting point. This reduces the burden of the conditional linear projection and leads to faster convergence and improved detection performance. The method requires minor modifications to the code and has negligible computational workload.

Abstract
Conditional spatial queries are recently introduced into DEtection TRansformer (DETR) to accelerate convergence. In DAB-DETR, such queries are modulated by the so-called conditional linear projection at each decoder stage, aiming to search for positions of interest such as the four extremities of the box. Each decoder stage progressively updates the box by predicting the anchor box offsets, while in cross-attention only the box center is informed as the reference point. The use of only box center, however, leaves the width and height of the previous box unknown to the current stage, which hinders accurate prediction of offsets. We argue that the explicit use of the entire box information in cross-attention matters. In this work, we propose Box Agent to condense the box into head-specific agent points. By replacing the box center with the agent point as the reference point in each head, the conditional cross-attention can search for positions from a more reasonable starting point by considering the full scope of the previous box, rather than always from the previous box center. This significantly reduces the burden of the conditional linear projection. Experimental results show that the box agent leads to not only faster convergence but also improved detection performance, e.g., our single-scale model achieves $44.2$ AP with ResNet-50 based on DAB-DETR. Our Box Agent requires minor modifications to the code and has negligible computational workload. Code is available at https://github.com/tiny-smart/box-detr.

摘要
<> tranlate into Simplified Chinese Conditional spatial queries 是在 Detection Transformer (DETR) 中最近引入的，以加速减速。在 DAB-DETR 中，这些查询被称为 conditional linear projection 的模ulates，以每个解码器阶段进行搜索，以找到包括四个顶点的盒体的位置。每个解码器阶段都会逐渐更新盒体，通过预测盒体偏移量，而在跨attenion中只有盒体中心作为参考点。然而，使用只有盒体中心作为参考点，会使得上一个盒体的宽度和高度无法知道当前阶段，从而阻碍精确预测偏移量。我们认为，Explicitly 使用整个盒体信息在 cross-attention 中 matters。在这种情况下，我们提议使用 Box Agent 来压缩盒体到 head-specific agent point。通过在每个 head 中将盒体中心点 replaced 为代理点作为参考点， conditional cross-attention 可以从更加合理的起始点开始搜索，而不是总是从上一个盒体中心点开始。这会减轻 conditional linear projection 的负担。我们的 Box Agent 需要对代码进行 minor 的修改，并且计算工作负担几乎是零。代码可以在 https://github.com/tiny-smart/box-detr 上获取。Experimental results show that our single-scale model achieves $44.2$ AP with ResNet-50 based on DAB-DETR. Our Box Agent has two main advantages: faster convergence and improved detection performance. By using the entire box information in cross-attention, the agent can search for positions from a more reasonable starting point, rather than always from the previous box center. This significantly reduces the burden of the conditional linear projection. In addition, our Box Agent requires minor modifications to the code and has negligible computational workload, making it easy to implement and deploy.

Neural Modulation Fields for Conditional Cone Beam Neural Tomography

paper_url: http://arxiv.org/abs/2307.08351
repo_url: https://github.com/samuelepapa/cond-cbnt
paper_authors: Samuele Papa, David M. Knigge, Riccardo Valperga, Nikita Moriakov, Miltos Kofinas, Jan-Jakob Sonke, Efstratios Gavves
for: 提高CBCT重建精度
methods: 使用深度学习方法，包括conditional neural fields和Neural Modulation Field
results: 在不同数量的投影下，Conditional Cone Beam Neural Tomography表现更好，包括降低误差和提高精度

Abstract
Conventional Computed Tomography (CT) methods require large numbers of noise-free projections for accurate density reconstructions, limiting their applicability to the more complex class of Cone Beam Geometry CT (CBCT) reconstruction. Recently, deep learning methods have been proposed to overcome these limitations, with methods based on neural fields (NF) showing strong performance, by approximating the reconstructed density through a continuous-in-space coordinate based neural network. Our focus is on improving such methods, however, unlike previous work, which requires training an NF from scratch for each new set of projections, we instead propose to leverage anatomical consistencies over different scans by training a single conditional NF on a dataset of projections. We propose a novel conditioning method where local modulations are modeled per patient as a field over the input domain through a Neural Modulation Field (NMF). The resulting Conditional Cone Beam Neural Tomography (CondCBNT) shows improved performance for both high and low numbers of available projections on noise-free and noisy data.

摘要

Adaptive Local Basis Functions for Shape Completion

paper_url: http://arxiv.org/abs/2307.08348
repo_url: https://github.com/yinghdb/adaptive-local-basis-functions
paper_authors: Hui Ying, Tianjia Shao, He Wang, Yin Yang, Kun Zhou
for: 这个论文的目的是完成部分点云数据的3D形状完成任务，使用深度隐函数。
methods: 该方法使用适应本地基函数，不受限制于特定的函数形式，通过这些基函数实现本地到本地的形状完成框架。
results: 该方法比现有方法更高效，能够保留本地几何细节，涵盖更多的形状，并且可以在未看过的几何上进行扩展。

Abstract
In this paper, we focus on the task of 3D shape completion from partial point clouds using deep implicit functions. Existing methods seek to use voxelized basis functions or the ones from a certain family of functions (e.g., Gaussians), which leads to high computational costs or limited shape expressivity. On the contrary, our method employs adaptive local basis functions, which are learned end-to-end and not restricted in certain forms. Based on those basis functions, a local-to-local shape completion framework is presented. Our algorithm learns sparse parameterization with a small number of basis functions while preserving local geometric details during completion. Quantitative and qualitative experiments demonstrate that our method outperforms the state-of-the-art methods in shape completion, detail preservation, generalization to unseen geometries, and computational cost. Code and data are at https://github.com/yinghdb/Adaptive-Local-Basis-Functions.

摘要
在这篇论文中，我们关注3D形状完成从部分点云使用深度隐函数的任务。现有方法通常使用块化基函数或一定家族函数（例如高斯函数），这会导致高计算成本或局部形态表达力有限。相反，我们的方法使用适应地ocal基函数，这些基函数通过端到端学习而不受限制。基于这些基函数，我们提出了一种本地到本地的形状完成框架。我们的算法可以学习少量的基函数参数，同时保留完成过程中的地方准确性。量化和质量实验表明，我们的方法在形状完成、准确性、未经见过的几何体 generale和计算成本方面都高于当前的方法。代码和数据可以在https://github.com/yinghdb/Adaptive-Local-Basis-Functions上找到。

Soft Curriculum for Learning Conditional GANs with Noisy-Labeled and Uncurated Unlabeled Data

paper_url: http://arxiv.org/abs/2307.08319
repo_url: None
paper_authors: Kai Katsumata, Duc Minh Vo, Tatsuya Harada, Hideki Nakayama
for: 用于提高 conditional generative adversarial network 的训练，使其能够处理含有噪声和无标签数据的情况。
methods: 提出了一种新的 Conditional Image Generation 框架，该框架在训练时接受噪声和无标签数据，并使用 soft curriculum learning 来杜绝噪声和无标签数据的影响。
results: 对比 semi-supervised 和 label-noise 鲁棒方法，提出的方法在量化和质量上均达到了更高的表现。特别是，该方法能够与少于半个标注数据的情况下匹配 semi-supervised GANs 的表现。

Abstract
Label-noise or curated unlabeled data is used to compensate for the assumption of clean labeled data in training the conditional generative adversarial network; however, satisfying such an extended assumption is occasionally laborious or impractical. As a step towards generative modeling accessible to everyone, we introduce a novel conditional image generation framework that accepts noisy-labeled and uncurated unlabeled data during training: (i) closed-set and open-set label noise in labeled data and (ii) closed-set and open-set unlabeled data. To combat it, we propose soft curriculum learning, which assigns instance-wise weights for adversarial training while assigning new labels for unlabeled data and correcting wrong labels for labeled data. Unlike popular curriculum learning, which uses a threshold to pick the training samples, our soft curriculum controls the effect of each training instance by using the weights predicted by the auxiliary classifier, resulting in the preservation of useful samples while ignoring harmful ones. Our experiments show that our approach outperforms existing semi-supervised and label-noise robust methods in terms of both quantitative and qualitative performance. In particular, the proposed approach is able to match the performance of (semi-) supervised GANs even with less than half the labeled data.

摘要
文本中的描述：用于资料准备的标签噪声或精心挑选的无标签数据被用来补偿 conditional generative adversarial network 的假设，但满足这种扩展的假设 occasional 是劳动ious 或 impractical。为了实现 everyone 可以接触的生成模型，我们介绍了一种新的 conditional 图像生成框架，该框架在训练时接受噪声标签和无标签数据：（i） closed-set 和 open-set 标签噪声在标签数据中，（ii） closed-set 和 open-set 无标签数据。为了解决这个问题，我们提出了软预科学学习，它在对抗式训练中分配每个实例的权重，并为无标签数据分配新的标签，并对标签数据中的错误标签进行更正。与传统的预科学学习不同，我们的软预科学学习不使用阈值来选择训练样本，而是使用辅助分类器预测的权重来控制每个训练实例的影响，从而保留有用的样本，而忽略有害的样本。我们的实验显示，我们的方法在量化和质量上都超过了现有的半支持和标签噪声鲁棒方法。特别是，我们的方法能够与少于半个标签数据相当的性能。

Airway Label Prediction in Video Bronchoscopy: Capturing Temporal Dependencies Utilizing Anatomical Knowledge

paper_url: http://arxiv.org/abs/2307.08318
repo_url: None
paper_authors: Ron Keuth, Mattias Heinrich, Martin Eichenlaub, Marian Himstedt
for:本研究旨在提供无需电磁跟踪和特定病人CT扫描的视觉导航，以便在肺部手术中进行其他应用程序，如医学护理室。methods:本研究使用单帧图像分类和肺部模型来实现视觉导航，而不需要电磁跟踪和特定病人CT扫描。研究者们通过 incorporating sequences of CNN-based airway likelihoods into a Hidden Markov Model 来使用 topological bronchoscope localization 和 anatomical constraints 来提高导航精度。results:研究者们通过多个实验在肺部模型中评估了该方法，并发现该方法可以提高导航精度至0.98，比之前的0.81（加权平均值：0.98 vs 0.81）。这表明， combining CNN-based single image classification of airway segments with anatomical constraints and temporal HMM-based inference 可以提供高度的视觉导航。

Abstract
Purpose: Navigation guidance is a key requirement for a multitude of lung interventions using video bronchoscopy. State-of-the-art solutions focus on lung biopsies using electromagnetic tracking and intraoperative image registration w.r.t. preoperative CT scans for guidance. The requirement of patient-specific CT scans hampers the utilisation of navigation guidance for other applications such as intensive care units. Methods: This paper addresses navigation guidance solely incorporating bronchosopy video data. In contrast to state-of-the-art approaches we entirely omit the use of electromagnetic tracking and patient-specific CT scans. Guidance is enabled by means of topological bronchoscope localization w.r.t. an interpatient airway model. Particularly, we take maximally advantage of anatomical constraints of airway trees being sequentially traversed. This is realized by incorporating sequences of CNN-based airway likelihoods into a Hidden Markov Model. Results: Our approach is evaluated based on multiple experiments inside a lung phantom model. With the consideration of temporal context and use of anatomical knowledge for regularization, we are able to improve the accuracy up to to 0.98 compared to 0.81 (weighted F1: 0.98 compared to 0.81) for a classification based on individual frames. Conclusion: We combine CNN-based single image classification of airway segments with anatomical constraints and temporal HMM-based inference for the first time. Our approach renders vision-only guidance for bronchoscopy interventions in the absence of electromagnetic tracking and patient-specific CT scans possible.

摘要
目的：用视频镜头导航是肺部内部手术中的关键需求，现代解决方案主要采用电磁 tracking和实时 CT 图像对比为导航。但这些方法受到patient-specific CT 图像的限制，不能用于医学加护部门。方法：本文提出一种具有唯视导航的方法，与现有方法不同之处在于完全不使用电磁 tracking和patient-specific CT 图像。我们通过基于隐藏 Markov 模型的空间排序和 CNN 网络来实现导航。特别是，我们利用隐藏 Markov 模型中的排序和 CNN 网络来使用排序的 temporal 上下文和空间上下文来进行补做，从而提高导航的准确性。结果：我们在肺部模型中进行了多个实验，结果表明，我们的方法可以提高准确性至 0.98，比对 individual 帧的分类结果更高（weighted F1 分数为 0.98，对比 0.81）。结论：我们结合了 CNN 网络基于单个图像分类和空间排序的方法，并利用 temporal HMM 模型来进行推理。这种方法可以在没有电磁 tracking和patient-specific CT 图像的情况下实现肺部内部手术的视野导航。

AltFreezing for More General Video Face Forgery Detection

paper_url: http://arxiv.org/abs/2307.08317
repo_url: https://github.com/zhendongwang6/altfreezing
paper_authors: Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Houqiang Li
for: 这个论文主要应用于面伪造检测，对于已知的攻击方法进行防护。
methods: 本文提出了一个捷径的方法，通过结合空间和时间特征以检测面伪造。具体来说，是使用3D ConvNet来捕捉空间和时间特征，并通过AltFreezing训练策略来鼓励模型对于空间和时间类型的伪造进行检测。
results: 实验结果显示，该方法能够超越现有的方法，具有更好的扩展性和应用性。

Abstract
Existing face forgery detection models try to discriminate fake images by detecting only spatial artifacts (e.g., generative artifacts, blending) or mainly temporal artifacts (e.g., flickering, discontinuity). They may experience significant performance degradation when facing out-domain artifacts. In this paper, we propose to capture both spatial and temporal artifacts in one model for face forgery detection. A simple idea is to leverage a spatiotemporal model (3D ConvNet). However, we find that it may easily rely on one type of artifact and ignore the other. To address this issue, we present a novel training strategy called AltFreezing for more general face forgery detection. The AltFreezing aims to encourage the model to detect both spatial and temporal artifacts. It divides the weights of a spatiotemporal network into two groups: spatial-related and temporal-related. Then the two groups of weights are alternately frozen during the training process so that the model can learn spatial and temporal features to distinguish real or fake videos. Furthermore, we introduce various video-level data augmentation methods to improve the generalization capability of the forgery detection model. Extensive experiments show that our framework outperforms existing methods in terms of generalization to unseen manipulations and datasets. Code is available at https: //github.com/ZhendongWang6/AltFreezing.

摘要
现有的面孔伪造检测模型通常仅仅检测到空间artefacts（例如生成artefacts、融合）或主要是时间artefacts（例如闪烁、缺失连续性）。它们可能会在面对不同领域artefacts时表现出显著性能下降。在这篇论文中，我们提议一种捕捉空间和时间artefacts的一体化模型 для面孔伪造检测。一种简单的想法是利用三维ConvNet。然而，我们发现它可能会很容易依赖于一种类型的artefact并忽略另一种。为了解决这个问题，我们提出了一种新的训练策略called AltFreezing，旨在促进模型检测空间和时间artefacts。它将把一个三维网络的Weight分为两组：空间相关和时间相关。然后，这两组的Weight在训练过程中被 alternate 冻结，以便模型可以学习空间和时间特征来 distinguish real or fake videos。此外，我们还引入了多种视频级数据增强方法，以提高伪造检测模型的通用性。广泛的实验结果表明，我们的框架在面对未seen manipulations和数据集时表现出优于现有方法。代码可以在中下载。

Bridging the Gap: Multi-Level Cross-Modality Joint Alignment for Visible-Infrared Person Re-Identification

paper_url: http://arxiv.org/abs/2307.08316
repo_url: None
paper_authors: Tengfei Liang, Yi Jin, Wu Liu, Tao Wang, Songhe Feng, Yidong Li
for: 解决可见光和infrared摄像头之间的人识别问题，即人识别任务中的跨模态图像检索问题。
methods: 提出了一种简单而有效的方法，即多级跨模态共同准备（MCJA），它通过修正模态和目标水平的差距，解决了跨模态图像检索问题。
results: 在实验中，该方法通过增加模态匹配级联augmenation和跨模态检索损失，实现了提高跨模态图像检索的性能，并可以作为VI-ReID领域的强大基线方法。

Abstract
Visible-Infrared person Re-IDentification (VI-ReID) is a challenging cross-modality image retrieval task that aims to match pedestrians' images across visible and infrared cameras. To solve the modality gap, existing mainstream methods adopt a learning paradigm converting the image retrieval task into an image classification task with cross-entropy loss and auxiliary metric learning losses. These losses follow the strategy of adjusting the distribution of extracted embeddings to reduce the intra-class distance and increase the inter-class distance. However, such objectives do not precisely correspond to the final test setting of the retrieval task, resulting in a new gap at the optimization level. By rethinking these keys of VI-ReID, we propose a simple and effective method, the Multi-level Cross-modality Joint Alignment (MCJA), bridging both modality and objective-level gap. For the former, we design the Modality Alignment Augmentation, which consists of three novel strategies, the weighted grayscale, cross-channel cutmix, and spectrum jitter augmentation, effectively reducing modality discrepancy in the image space. For the latter, we introduce a new Cross-Modality Retrieval loss. It is the first work to constrain from the perspective of the ranking list, aligning with the goal of the testing stage. Moreover, based on the global feature only, our method exhibits good performance and can serve as a strong baseline method for the VI-ReID community.

摘要
visible-infrared人Re-IDentification（VI-ReID）是一个复杂的跨模态图像检索任务，旨在匹配人员的图像在可见和红外摄像头之间。为解决模态差距，现有主流方法采用学习做法，将图像检索任务转化为图像分类任务，使用十字积分损失和辅助度量学习损失。这些损失采用缩短内类距离和增加间类距离的策略，但这些目标不准确反映最终测试阶段的检索任务，导致新的优化差距。通过重新思考VI-ReID的关键，我们提出了一种简单有效的方法：多级跨模态联合准确（MCJA）。它通过以下三种新策略来减少模态差距：1. 模态准确增强：对于每个图像，使用权重规则来增强模态准确性。2. 交叉通道CMix：在不同模态之间进行交叉通道的混合，以提高模态之间的匹配度。3. 谱谱异常增强：在不同模态之间进行谱谱异常的增强，以提高模态之间的匹配度。此外，我们还引入了一种新的跨模态检索损失，它是根据排序列表来约束的，与测试阶段的目标相匹配。我们的方法只使用全球特征，可以在VI-ReID领域中作为一个强大基线方法。

Rethinking Intersection Over Union for Small Object Detection in Few-Shot Regime

paper_url: http://arxiv.org/abs/2307.09562
repo_url: None
paper_authors: Pierre Le Jeune, Anissa Mokraoui
for: 提高小 объек的检测精度
methods: 使用Scale-adaptive Intersection over Union（SIoU）作为评价指标和训练损失函数
results: 在小 объек检测 task 中，SIoU 可以大幅提高模型的性能，特别是在 aerial 图像中，其达到了新的顶峰性能Here’s a more detailed explanation of each point:
for: The paper is written to improve the accuracy of detecting small objects in few-shot object detection (FSOD) tasks.
methods: The paper proposes using a novel box similarity measure called Scale-adaptive Intersection over Union (SIoU) as an evaluation criterion and a loss function to prioritize small objects during training.
results: The paper shows that SIoU improves significantly the performance of small object detection in both natural (Pascal VOC and COCO datasets) and aerial images (DOTA and DIOR), especially in the aerial imagery where small objects are critical, and achieves new state-of-the-art FSOD performance on DOTA and DIOR.

Abstract
In Few-Shot Object Detection (FSOD), detecting small objects is extremely difficult. The limited supervision cripples the localization capabilities of the models and a few pixels shift can dramatically reduce the Intersection over Union (IoU) between the ground truth and predicted boxes for small objects. To this end, we propose Scale-adaptive Intersection over Union (SIoU), a novel box similarity measure. SIoU changes with the objects' size, it is more lenient with small object shifts. We conducted a user study and SIoU better aligns than IoU with human judgment. Employing SIoU as an evaluation criterion helps to build more user-oriented models. SIoU can also be used as a loss function to prioritize small objects during training, outperforming existing loss functions. SIoU improves small object detection in the non-few-shot regime, but this setting is unrealistic in the industry as annotated detection datasets are often too expensive to acquire. Hence, our experiments mainly focus on the few-shot regime to demonstrate the superiority and versatility of SIoU loss. SIoU improves significantly FSOD performance on small objects in both natural (Pascal VOC and COCO datasets) and aerial images (DOTA and DIOR). In aerial imagery, small objects are critical and SIoU loss achieves new state-of-the-art FSOD on DOTA and DIOR.

摘要
几个框架内部对象检测（FSOD）中，检测小对象非常困难。有限的监督使得模型的地方化能力受到限制，几个像素的偏移可以导致对真实值和预测框之间的交集覆盖率（IoU）减少很多。为了解决这个问题，我们提出了适应缩放交集覆盖率（SIoU），一种新的框 similarity度量。SIoU随对象的大小变化，对小对象的偏移更加宽容。我们进行了用户研究，发现SIoU与人类判断更加一致。使用SIoU作为评价标准可以建立更用户 oriented的模型。SIoU还可以作为训练 criterion，以优先级驱动模型在训练中学习小对象。SIoU在几何shot regime中显著提高了小对象检测性能，但这种设定是在实际应用中不切实际的，因为检测框 datasets通常是非常昂贵的。因此，我们的实验主要集中在几何shot regime中，以示SIoU损失的优越性和多样性。SIoU在自然图像（Pascal VOC和COCO datasets）和航空图像（DOTA和DIOR）上显著提高了小对象检测性能。在航空图像中，小对象非常重要，SIoU损失实现了新的状态的法Socket的FSOD。

RCM-Fusion: Radar-Camera Multi-Level Fusion for 3D Object Detection

paper_url: http://arxiv.org/abs/2307.10249
repo_url: None
paper_authors: Jisong Kim, Minjae Seong, Geonho Bang, Dongsuk Kum, Jun Won Choi
for: 本研究旨在提出一种基于雷达和摄像头的多级融合方法（RCM-Fusion），以完全利用雷达信息并提高3D对象检测性能。
methods: 本方法在feature级和实例级进行了雷达和摄像头的多级融合，包括Radar Guided BEV Encoder和Radar Grid Point Refinement module。Radar Guided BEV Encoder利用雷达 Bird’s-Eye-View特征将图像特征转换为精确的BEV表示，然后适应性地组合了雷达和摄像头的BEV特征。Radar Grid Point Refinement模块通过考虑雷达点云特征来减少本地化错误。
results: 在公共的nuScenes数据集上进行了实验，并证明了我们的提出的RCM-Fusion方法与摄像头只的基准模型相比，提高了11.8%的nuScenes检测得分（NDS），并在nuScenes 3D对象检测 benchmark中实现了雷达-摄像头融合方法的州际之最性能。

Abstract
While LiDAR sensors have been succesfully applied to 3D object detection, the affordability of radar and camera sensors has led to a growing interest in fusiong radars and cameras for 3D object detection. However, previous radar-camera fusion models have not been able to fully utilize radar information in that initial 3D proposals were generated based on the camera features only and the instance-level fusion is subsequently conducted. In this paper, we propose radar-camera multi-level fusion (RCM-Fusion), which fuses radar and camera modalities at both the feature-level and instance-level to fully utilize radar information. At the feature-level, we propose a Radar Guided BEV Encoder which utilizes radar Bird's-Eye-View (BEV) features to transform image features into precise BEV representations and then adaptively combines the radar and camera BEV features. At the instance-level, we propose a Radar Grid Point Refinement module that reduces localization error by considering the characteristics of the radar point clouds. The experiments conducted on the public nuScenes dataset demonstrate that our proposed RCM-Fusion offers 11.8% performance gain in nuScenes detection score (NDS) over the camera-only baseline model and achieves state-of-the-art performaces among radar-camera fusion methods in the nuScenes 3D object detection benchmark. Code will be made publicly available.

摘要
而LiDAR感知器已经成功应用于3D物体检测中，但由于雷达和摄像头感知器的可Affordability，有关 fusion 雷达和摄像头的研究在紧张起来。然而，之前的雷达-摄像头融合模型尚未能充分利用雷达信息，因为初始的3D提案都是基于摄像头特征来生成的，然后进行了实例级融合。在这篇论文中，我们提议了雷达-摄像头多级融合（RCM-Fusion）模型，该模型在特征级和实例级都进行雷达和摄像头模态的融合，以完全利用雷达信息。在特征级上，我们提出了雷达导航BEV编码器，该编码器利用雷达 bird's-eye-view（BEV）特征将图像特征转换为准确的BEV表示，然后适应性地合并雷达和摄像头BEV特征。在实例级上，我们提出了雷达网点精度修正模块，该模块通过考虑雷达点云特征来减少局部定位错误。我们在公共的 nuScenes 数据集上进行了实验，结果显示，我们提出的 RCM-Fusion 与摄像头基eline模型相比，提高 nuScenes 检测分数（NDS）11.8%，并在 nuScenes 3D物体检测比赛中实现了雷达-摄像头融合方法的状态器。代码将公开发布。

Combiner and HyperCombiner Networks: Rules to Combine Multimodality MR Images for Prostate Cancer Localisation

paper_url: http://arxiv.org/abs/2307.08279
repo_url: None
paper_authors: Wen Yan, Bernard Chiu, Ziyi Shen, Qianye Yang, Tom Syer, Zhe Min, Shonit Punwani, Mark Emberton, David Atkinson, Dean C. Barratt, Yipeng Hu
for: 这种研究的目的是使用报告系统PI-RADS v2.1，评估multiparametric MR扫描图像中的肾癌风险。
methods: 这种研究使用了低维度Parametric模型，模型PI-RADS决策规则，以及HyperCombiner网络来训练一个单一的图像分割网络。
results: 实验结果基于850名患者的数据，表明，使用Combiner网络可以提高图像分割的效率，同时可以获得和解释个体图像模式的线性权重或征兆，以及评估图像可用性、重要性和规则发现等临床应用。

Abstract
One of the distinct characteristics in radiologists' reading of multiparametric prostate MR scans, using reporting systems such as PI-RADS v2.1, is to score individual types of MR modalities, T2-weighted, diffusion-weighted, and dynamic contrast-enhanced, and then combine these image-modality-specific scores using standardised decision rules to predict the likelihood of clinically significant cancer. This work aims to demonstrate that it is feasible for low-dimensional parametric models to model such decision rules in the proposed Combiner networks, without compromising the accuracy of predicting radiologic labels: First, it is shown that either a linear mixture model or a nonlinear stacking model is sufficient to model PI-RADS decision rules for localising prostate cancer. Second, parameters of these (generalised) linear models are proposed as hyperparameters, to weigh multiple networks that independently represent individual image modalities in the Combiner network training, as opposed to end-to-end modality ensemble. A HyperCombiner network is developed to train a single image segmentation network that can be conditioned on these hyperparameters during inference, for much improved efficiency. Experimental results based on data from 850 patients, for the application of automating radiologist labelling multi-parametric MR, compare the proposed combiner networks with other commonly-adopted end-to-end networks. Using the added advantages of obtaining and interpreting the modality combining rules, in terms of the linear weights or odds-ratios on individual image modalities, three clinical applications are presented for prostate cancer segmentation, including modality availability assessment, importance quantification and rule discovery.

摘要
一个 radiologists 在多 Parametric prostate MR 扫描结果中的一个特征是，使用如 PI-RADS v2.1 的报告系统，对不同的 MR 模式（T2 重度、Diffusion 重度和动力刺激）进行分数，然后使用标准化的决策规则来预测肿瘤的可能性。这个工作的目的是证明可以使用低维度 Parametric 模型来模型这些决策规则，无需损失预测 радиологи labels 的准确性。首先，证明了线性混合模型或非线性堆叠模型都可以模型 PI-RADS 决策规则，用于Localizing 肿瘤。其次，通过将这些（总体）线性模型的参数作为权重来，以便在 Combiner 网络训练中对多个网络进行权重合并。在执行时，通过将这些参数作为 Condition 来，可以 conditioning 这些参数来提高效率。基于850名患者的数据，对多 Parametric MR 自动标注的 radiologist 标注进行比较，提出了Combiner 网络和其他常见的端到端网络之间的比较。通过获得和解释模式结合规则的优点，包括对单个图像模式的线性权重或抽象比率，对肿瘤 segmentation 进行三个临床应用：评估模式可用性、重要性评估和规则发现。

Adversarial Attacks on Traffic Sign Recognition: A Survey

paper_url: http://arxiv.org/abs/2307.08278
repo_url: None
paper_authors: Svetlana Pavlitska, Nico Lambing, J. Marius Zöllner
for: 这篇论文主要针对的是自动驾驶车辆的视觉系统中的交通标志识别问题，以及这个问题如何受到深度神经网络（DNNs）的攻击。
methods: 该论文主要采用了现有的深度神经网络（DNNs）进行交通标志识别和分类，并对这些模型进行了数字和实际攻击。
results: 该论文提供了现有的攻击研究的概述，并指出了需要进一步研究的领域。

Abstract
Traffic sign recognition is an essential component of perception in autonomous vehicles, which is currently performed almost exclusively with deep neural networks (DNNs). However, DNNs are known to be vulnerable to adversarial attacks. Several previous works have demonstrated the feasibility of adversarial attacks on traffic sign recognition models. Traffic signs are particularly promising for adversarial attack research due to the ease of performing real-world attacks using printed signs or stickers. In this work, we survey existing works performing either digital or real-world attacks on traffic sign detection and classification models. We provide an overview of the latest advancements and highlight the existing research areas that require further investigation.

摘要
自动驾驶车辆的见识功能中，交通标志识别是一个重要的组成部分，目前大多数使用深度神经网络（DNNs）进行实现。但是，DNNs已知容易受到对抗攻击。许多前期工作已经证明了对交通标志识别模型的攻击的可行性。由于交通标志的易于获得和修改，交通标志识别模型在实际攻击中具有极高的潜在危害性。在这种情况下，我们对交通标志检测和分类模型的攻击进行了评估和概述，并 highlighted 需要进一步研究的领域。

Liver Tumor Screening and Diagnosis in CT with Pixel-Lesion-Patient Network

paper_url: http://arxiv.org/abs/2307.08268
repo_url: None
paper_authors: Ke Yan, Xiaoli Yin, Yingda Xia, Fakai Wang, Shu Wang, Yuan Gao, Jiawen Yao, Chunli Li, Xiaoyu Bai, Jingren Zhou, Ling Zhang, Le Lu, Yu Shi
for: liver tumor segmentation and classification in non-contrast and dynamic contrast-enhanced CT images
methods: mask transformer with improved anchor queries and foreground-enhanced sampling loss, and an image-wise classifier to aggregate global information
results: high accuracy in tumor screening and lesion segmentation, and on par with a senior human radiologist in a reader study

Abstract
Liver tumor segmentation and classification are important tasks in computer aided diagnosis. We aim to address three problems: liver tumor screening and preliminary diagnosis in non-contrast computed tomography (CT), and differential diagnosis in dynamic contrast-enhanced CT. A novel framework named Pixel-Lesion-pAtient Network (PLAN) is proposed. It uses a mask transformer to jointly segment and classify each lesion with improved anchor queries and a foreground-enhanced sampling loss. It also has an image-wise classifier to effectively aggregate global information and predict patient-level diagnosis. A large-scale multi-phase dataset is collected containing 939 tumor patients and 810 normal subjects. 4010 tumor instances of eight types are extensively annotated. On the non-contrast tumor screening task, PLAN achieves 95% and 96% in patient-level sensitivity and specificity. On contrast-enhanced CT, our lesion-level detection precision, recall, and classification accuracy are 92%, 89%, and 86%, outperforming widely used CNN and transformers for lesion segmentation. We also conduct a reader study on a holdout set of 250 cases. PLAN is on par with a senior human radiologist, showing the clinical significance of our results.

摘要
liver tumor segmentation和分类是计算机辅助诊断中的重要任务。我们想要解决三个问题：肝肿征检测和初步诊断在不含对比 computed tomography（CT）图像，以及在动态对比增强CT图像中的差异诊断。我们提出了一个名为Pixel-Lesion-pAtient Network（PLAN）的框架。它使用一个面对transformer来同时段和类别每个肿瘤，并使用改进的锚点查询和前景增强抽象损失来提高精度。它还有一个图像级别分类器，可以有效地聚合全局信息并预测患者级别诊断。我们收集了一个大规模多阶段数据集，包括939名患者和810名正常人。4010个肿瘤实例中有八种类型得到了广泛的注释。在非对比肿征检测任务上，PLAN达到了95%和96%的患者级别敏感性和特异性。在对比CT图像上，我们的肿瘤水平检测精度、回归率和分类精度分别为92%, 89%和86%，超越了广泛使用的CNN和transformers для肿瘤 segmentation。我们还进行了一次读者研究，并证明PLAN与一名高级人类Radiologist在250个案例中的表现相当。

Extreme Image Compression using Fine-tuned VQGAN Models

paper_url: http://arxiv.org/abs/2307.08265
repo_url: None
paper_authors: Qi Mao, Tinghan Yang, Yinuo Zhang, Shuyin Pan, Meng Wang, Shiqi Wang, Siwei Ma
for: 提高压缩数据的感知质量，特别是在低比特率下。
methods: 引入vector quantization（VQ）基于生成模型，将图像表示为VQ指标。
results: 提出了一种简单 yet有效的编码框架，可以在低比特率下保持图像重建质量。并通过对大规模代码库进行划分，实现图像可以被表示为多个不同的VQ指标，从而实现可变比特率和不同水平的重建质量。

Abstract
Recent advances in generative compression methods have demonstrated remarkable progress in enhancing the perceptual quality of compressed data, especially in scenarios with low bitrates. Nevertheless, their efficacy and applicability in achieving extreme compression ratios ($<0.1$ bpp) still remain constrained. In this work, we propose a simple yet effective coding framework by introducing vector quantization (VQ)-based generative models into the image compression domain. The main insight is that the codebook learned by the VQGAN model yields strong expressive capacity, facilitating efficient compression of continuous information in the latent space while maintaining reconstruction quality. Specifically, an image can be represented as VQ-indices by finding the nearest codeword, which can be encoded using lossless compression methods into bitstreams. We then propose clustering a pre-trained large-scale codebook into smaller codebooks using the K-means algorithm. This enables images to be represented as diverse ranges of VQ-indices maps, resulting in variable bitrates and different levels of reconstruction quality. Extensive qualitative and quantitative experiments on various datasets demonstrate that the proposed framework outperforms the state-of-the-art codecs in terms of perceptual quality-oriented metrics and human perception under extremely low bitrates.

摘要
The main idea is to use the codebook learned by the VQGAN model to efficiently compress continuous information in the latent space while maintaining reconstruction quality. Specifically, an image can be represented as VQ-indices by finding the nearest codeword, which can be encoded using lossless compression methods into bitstreams.To further improve the efficiency of the framework, we propose clustering a pre-trained large-scale codebook into smaller codebooks using the K-means algorithm. This enables images to be represented as diverse ranges of VQ-indices maps, resulting in variable bitrates and different levels of reconstruction quality.Extensive experiments on various datasets show that the proposed framework outperforms state-of-the-art codecs in terms of perceptual quality-oriented metrics and human perception under extremely low bitrates.

Hierarchical Spatiotemporal Transformers for Video Object Segmentation

paper_url: http://arxiv.org/abs/2307.08263
repo_url: None
paper_authors: Jun-Sang Yoo, Hongjae Lee, Seung-Won Jung
for: 这篇论文探讨了一个新的框架，即HST，用于半监督类别影像对象分割 (VOS)。
methods: 这篇论文使用了最新的Swin Transformer和Video Swin Transformer来提取影像和影片特征，并将它们视为问题和内存，以获得高效的对象掩模数据。
results: HST在处理具有遮盾和快速移动的物体，以及压缩背景的情况下表现出色，并在多个知名的测试benchmark上表现出比以前的竞争对手更高的效果。具体来说，HST-B在YouTube-VOS（85.0%）、DAVIS 2017（85.9%）和DAVIS 2016（94.0%）等多个知名测试benchmark上表现出比以前的竞争对手更高的效果。

Abstract
This paper presents a novel framework called HST for semi-supervised video object segmentation (VOS). HST extracts image and video features using the latest Swin Transformer and Video Swin Transformer to inherit their inductive bias for the spatiotemporal locality, which is essential for temporally coherent VOS. To take full advantage of the image and video features, HST casts image and video features as a query and memory, respectively. By applying efficient memory read operations at multiple scales, HST produces hierarchical features for the precise reconstruction of object masks. HST shows effectiveness and robustness in handling challenging scenarios with occluded and fast-moving objects under cluttered backgrounds. In particular, HST-B outperforms the state-of-the-art competitors on multiple popular benchmarks, i.e., YouTube-VOS (85.0%), DAVIS 2017 (85.9%), and DAVIS 2016 (94.0%).

摘要

Large-Scale Person Detection and Localization using Overhead Fisheye Cameras

paper_url: http://arxiv.org/abs/2307.08252
repo_url: None
paper_authors: Lu Yang, Liulei Li, Xueshi Xin, Yifan Sun, Qing Song, Wenguan Wang
for: 本研究旨在提供一种基于折射镜相机的人员位置测定方法，以满足现代生活中的各种应用需求。
methods: 该方法使用了一种基于折射镜的人体探测网络，利用折射镜的扭转对称性进行培训策略，并通过数值解决方法计算实际人员位置。
results: 实验结果表明， compared to先前方法，该方法的折射镜人体探测器有superiority，并且整个折射镜位置测定方法可以在0.5米的准确精度下，在0.1秒钟之内确定所有人员在FOV的位置。

Abstract
Location determination finds wide applications in daily life. Instead of existing efforts devoted to localizing tourist photos captured by perspective cameras, in this article, we focus on devising person positioning solutions using overhead fisheye cameras. Such solutions are advantageous in large field of view (FOV), low cost, anti-occlusion, and unaggressive work mode (without the necessity of cameras carried by persons). However, related studies are quite scarce, due to the paucity of data. To stimulate research in this exciting area, we present LOAF, the first large-scale overhead fisheye dataset for person detection and localization. LOAF is built with many essential features, e.g., i) the data cover abundant diversities in scenes, human pose, density, and location; ii) it contains currently the largest number of annotated pedestrian, i.e., 457K bounding boxes with groundtruth location information; iii) the body-boxes are labeled as radius-aligned so as to fully address the positioning challenge. To approach localization, we build a fisheye person detection network, which exploits the fisheye distortions by a rotation-equivariant training strategy and predict radius-aligned human boxes end-to-end. Then, the actual locations of the detected persons are calculated by a numerical solution on the fisheye model and camera altitude data. Extensive experiments on LOAF validate the superiority of our fisheye detector w.r.t. previous methods, and show that our whole fisheye positioning solution is able to locate all persons in FOV with an accuracy of 0.5 m, within 0.1 s.

摘要
Location determination has numerous applications in daily life. Instead of previous efforts focused on localizing tourist photos captured by perspective cameras, this article focuses on developing person positioning solutions using overhead fisheye cameras. These solutions have several advantages, including a large field of view (FOV), low cost, resistance to occlusion, and a non-intrusive work mode (without the need for cameras carried by individuals). However, there is a lack of related studies due to the scarcity of data. To promote research in this exciting area, we present LOAF, the first large-scale overhead fisheye dataset for person detection and localization. LOAF features several essential aspects, including:1. Diverse scenes, human poses, densities, and locations are covered in the data.2. It contains the largest number of annotated pedestrians, with 457,000 bounding boxes and ground truth location information.3. The body boxes are labeled as radius-aligned to fully address the positioning challenge.To perform localization, we develop a fisheye person detection network that leverages fisheye distortions using a rotation-equivariant training strategy. The network predicts radius-aligned human boxes end-to-end. Then, the actual locations of the detected persons are calculated using a numerical solution on the fisheye model and camera altitude data. Extensive experiments on LOAF demonstrate the superiority of our fisheye detector compared to previous methods, and show that our entire fisheye positioning solution can accurately locate all persons in the FOV within 0.5 meters and within 0.1 seconds.

Random Boxes Are Open-world Object Detectors

paper_url: http://arxiv.org/abs/2307.08249
repo_url: https://github.com/scuwyh2000/randbox
paper_authors: Yanghao Wang, Zhongqi Yue, Xian-Sheng Hua, Hanwang Zhang
for: 本文目的是提出一种基于随机区域提议的Open-world Object Detection（OWOD）方法，以提高不知对象的检测精度。
methods: 本文使用的方法包括Random Box（RandBox）架构，基于Faster R-CNN和Transformer的基础，通过随机提议来增强模型的泛化能力。
results: 对 Pascal-VOC/MS-COCO 和 LVIS 两个底层 benchmark 进行了评估， RandBox 在所有指标中显著超过了之前的状态方法。 codes 可以在 https://github.com/scuwyh2000/RandBox 上获取。

Abstract
We show that classifiers trained with random region proposals achieve state-of-the-art Open-world Object Detection (OWOD): they can not only maintain the accuracy of the known objects (w/ training labels), but also considerably improve the recall of unknown ones (w/o training labels). Specifically, we propose RandBox, a Fast R-CNN based architecture trained on random proposals at each training iteration, surpassing existing Faster R-CNN and Transformer based OWOD. Its effectiveness stems from the following two benefits introduced by randomness. First, as the randomization is independent of the distribution of the limited known objects, the random proposals become the instrumental variable that prevents the training from being confounded by the known objects. Second, the unbiased training encourages more proposal explorations by using our proposed matching score that does not penalize the random proposals whose prediction scores do not match the known objects. On two benchmarks: Pascal-VOC/MS-COCO and LVIS, RandBox significantly outperforms the previous state-of-the-art in all metrics. We also detail the ablations on randomization and loss designs. Codes are available at https://github.com/scuwyh2000/RandBox.

摘要

Randomization independence from the distribution of limited known objects: Random proposals serve as an instrumental variable that prevents training from being confounded by known objects.2. Unbiased training encourages proposal exploration: Our proposed matching score does not penalize random proposals with incorrect prediction scores, encouraging more exploration.On Pascal-VOC/MS-COCO and LVIS benchmarks, RandBox significantly outperforms previous state-of-the-art in all metrics. We also conduct ablation studies on randomization and loss designs. The codes are available at https://github.com/scuwyh2000/RandBox.

Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting

paper_url: http://arxiv.org/abs/2307.08243
repo_url: None
paper_authors: Wentao Bao, Lele Chen, Libing Zeng, Zhong Li, Yi Xu, Junsong Yuan, Yu Kong
for: 预测人工手势 trajectory from egocentric views，以便快速理解人与AR/VR系统的互动意图。
methods: 提出了一种基于RGB视频在首人视角下的 egocentric 3D手势轨迹预测任务，使用了不确定性意识的状态空间变换器（USST），并可以通过速度约束和视觉提示调整（VPT）进一步改进。
results: 在H2O和EgoPAT3D数据集上实现了USST的超越性，并且可以对2D和3D轨迹预测进行比较。代码和数据集公开发布在GitHub上：https://github.com/Cogito2012/USST。

Abstract
Hand trajectory forecasting from egocentric views is crucial for enabling a prompt understanding of human intentions when interacting with AR/VR systems. However, existing methods handle this problem in a 2D image space which is inadequate for 3D real-world applications. In this paper, we set up an egocentric 3D hand trajectory forecasting task that aims to predict hand trajectories in a 3D space from early observed RGB videos in a first-person view. To fulfill this goal, we propose an uncertainty-aware state space Transformer (USST) that takes the merits of the attention mechanism and aleatoric uncertainty within the framework of the classical state-space model. The model can be further enhanced by the velocity constraint and visual prompt tuning (VPT) on large vision transformers. Moreover, we develop an annotation workflow to collect 3D hand trajectories with high quality. Experimental results on H2O and EgoPAT3D datasets demonstrate the superiority of USST for both 2D and 3D trajectory forecasting. The code and datasets are publicly released: https://github.com/Cogito2012/USST.

摘要
<>translate into Simplified Chinese人体轨迹预测从自центриック视角是虚拟现实/扩展现实系统中关键的一部分，以便快速理解人类意图。然而，现有方法在2D图像空间中处理这个问题，这不适用于3D实际应用。在这篇论文中，我们设置了一个 Egocentric 3D 手轨迹预测任务，旨在从早期观察到RGB视频的第一人称视角中预测手轨迹在3D空间。为实现这个目标，我们提议一种不确定性意识状态空间变换器（USST），它将注意力机制和不确定性因素内置在классиical状态空间模型中。此外，我们还提出了速度约束和视觉提示调整（VPT），以提高大规模视觉变换器的性能。此外，我们还开发了一种高质量3D手轨迹注释工作流程。实验结果表明，USST在H2O和EgoPAT3D数据集上对2D和3D轨迹预测均有superiority。代码和数据集公共发布：https://github.com/Cogito2012/USST。

Unified Open-Vocabulary Dense Visual Prediction

paper_url: http://arxiv.org/abs/2307.08238
repo_url: None
paper_authors: Hengcan Shi, Munawar Hayat, Jianfei Cai
for: 这篇论文旨在提出一种统一开 vocabulary 网络（UOVN），以jointly Address four 常见的dense prediction 任务。
methods: 该论文提出了一种多Modal, multi-scale和多任务（MMM）解码机制，以更好地利用多modal数据。此外，它还提出了一种UOVN 训练机制，以降低不同任务和领域之间的差距。
results: 实验结果表明，UOVN 可以有效地 Address four datasets 上的 dense prediction 任务。

Abstract
In recent years, open-vocabulary (OV) dense visual prediction (such as OV object detection, semantic, instance and panoptic segmentations) has attracted increasing research attention. However, most of existing approaches are task-specific and individually tackle each task. In this paper, we propose a Unified Open-Vocabulary Network (UOVN) to jointly address four common dense prediction tasks. Compared with separate models, a unified network is more desirable for diverse industrial applications. Moreover, OV dense prediction training data is relatively less. Separate networks can only leverage task-relevant training data, while a unified approach can integrate diverse training data to boost individual tasks. We address two major challenges in unified OV prediction. Firstly, unlike unified methods for fixed-set predictions, OV networks are usually trained with multi-modal data. Therefore, we propose a multi-modal, multi-scale and multi-task (MMM) decoding mechanism to better leverage multi-modal data. Secondly, because UOVN uses data from different tasks for training, there are significant domain and task gaps. We present a UOVN training mechanism to reduce such gaps. Experiments on four datasets demonstrate the effectiveness of our UOVN.

摘要
Recently, open-vocabulary (OV) dense visual prediction (such as OV object detection, semantic, instance and panoptic segmentations) has attracted increasing research attention. However, most existing approaches are task-specific and individually tackle each task. In this paper, we propose a Unified Open-Vocabulary Network (UOVN) to jointly address four common dense prediction tasks. Compared with separate models, a unified network is more desirable for diverse industrial applications. Moreover, OV dense prediction training data is relatively less. Separate networks can only leverage task-relevant training data, while a unified approach can integrate diverse training data to boost individual tasks. We address two major challenges in unified OV prediction. Firstly, unlike unified methods for fixed-set predictions, OV networks are usually trained with multi-modal data. Therefore, we propose a multi-modal, multi-scale and multi-task (MMM) decoding mechanism to better leverage multi-modal data. Secondly, because UOVN uses data from different tasks for training, there are significant domain and task gaps. We present a UOVN training mechanism to reduce such gaps. Experiments on four datasets demonstrate the effectiveness of our UOVN.Here's the word-for-word translation of the text into Simplified Chinese:近年来，开放词汇（OV）密集预测（如OV物体检测、 semantics、实例和杂alu segmentation）在研究中吸引了越来越多的注意力。然而，大多数现有的方法都是任务特定的，每个任务都是单独处理的。在这篇论文中，我们提出了一个统一开放词汇网络（UOVN），用于同时处理四种常见的密集预测任务。相比之下，分开的模型更加适合各种工业应用。此外，OV密集预测的训练数据相对较少。分开的网络只能利用任务相关的训练数据，而统一的方法可以更好地利用多种数据来提高个个任务。我们解决了两个主要挑战：一是与统一方法不同，OV网络通常在多模式数据上训练。因此，我们提出了一种多模式、多尺度和多任务（MMM）解码机制，以更好地利用多模式数据。二是由于UOVN在训练时使用不同任务的数据，存在域和任务漏报。我们提出了一种UOVN训练机制，以减少这些漏报。实验结果表明，我们的UOVN具有效果。

Video Frame Interpolation with Stereo Event and Intensity Camera

paper_url: http://arxiv.org/abs/2307.08228
repo_url: None
paper_authors: Chao Ding, Mingyuan Lin, Haijian Zhang, Jianzhuang Liu, Lei Yu
for: 解决实时视频 interpolate 中困难很多的 cross-modality parallax 问题，提高 Event-based Video Frame Interpolation (E-VFI) 的性能。
methods: 提出了一种 novel Stereo Event-based VFI (SE-VFI) 网络 (SEVFI-Net)，通过Feature Aggregation Module (FAM) 缓解 parallax，并通过综合了 optical flow 和 disparity estimation 来生成高质量的中间帧和相关的分辨率信息。
results: 对于实际世界的复杂动作和不同深度的场景，我们的提出的 SEVFI-Net 可以与现有的 E-VFI 方法相比，在多个公共实际三视图数据集（DSEC和MVSEC）和我们自己收集的 Stereo Event-Intensity Dataset (SEID) 上达到了显著的性能提升。

Abstract
The stereo event-intensity camera setup is widely applied to leverage the advantages of both event cameras with low latency and intensity cameras that capture accurate brightness and texture information. However, such a setup commonly encounters cross-modality parallax that is difficult to be eliminated solely with stereo rectification especially for real-world scenes with complex motions and varying depths, posing artifacts and distortion for existing Event-based Video Frame Interpolation (E-VFI) approaches. To tackle this problem, we propose a novel Stereo Event-based VFI (SE-VFI) network (SEVFI-Net) to generate high-quality intermediate frames and corresponding disparities from misaligned inputs consisting of two consecutive keyframes and event streams emitted between them. Specifically, we propose a Feature Aggregation Module (FAM) to alleviate the parallax and achieve spatial alignment in the feature domain. We then exploit the fused features accomplishing accurate optical flow and disparity estimation, and achieving better interpolated results through flow-based and synthesis-based ways. We also build a stereo visual acquisition system composed of an event camera and an RGB-D camera to collect a new Stereo Event-Intensity Dataset (SEID) containing diverse scenes with complex motions and varying depths. Experiments on public real-world stereo datasets, i.e., DSEC and MVSEC, and our SEID dataset demonstrate that our proposed SEVFI-Net outperforms state-of-the-art methods by a large margin.

摘要
这个双摄频码设置是广泛应用，以利用两种事件摄频的优点，即低延迟和精确的光度和 texture 信息。但是，这种设置通常会面临跨modalità 偏移，这是对单独使用摄频补偿所困难以解决，特别是 для 实际世界的场景中的复杂运动和不同的深度，导致现有的 Event-based Video Frame Interpolation (E-VFI) 方法中的缺陷和扭曲。为了解决这个问题，我们提出了一个新的双摄频基于 VFI (SE-VFI) 网络（SEVFI-Net），用于从不一致的两个关键帧和事件流中生成高品质的中频帧和相应的偏移。具体来说，我们提出了一个 Feature Aggregation Module (FAM)，以解决偏移和在特征领域进行空间Alignment。然后，我们利用融合的特征来完成精确的光流和偏移估测，并通过流动基于和synthesis基于的方法来取得更好的 interpolated 结果。我们还建立了一个双摄频视觉采集系统，该系统包括一个事件摄频和一个RGB-D 摄频，以收集一个新的双摄频 Intensity Dataset (SEID)，该dataset包括多样化的场景中的复杂运动和不同的深度。实验结果显示，我们的提案的 SEVFI-Net 在公共的实际世界双摄频dataset上（DSEC和MVSEC）和我们的 SEID dataset上都表现出色，与现有的方法相比，具有较大的改善空间。

Ada3D : Exploiting the Spatial Redundancy with Adaptive Inference for Efficient 3D Object Detection

paper_url: http://arxiv.org/abs/2307.08209
repo_url: None
paper_authors: Tianchen Zhao, Xuefei Ning, Ke Hong, Zhongyuan Qiu, Pu Lu, Yali Zhao, Linfeng Zhang, Lipu Zhou, Guohao Dai, Huazhong Yang, Yu Wang
for: 这个研究旨在提高自驾车中3D物体检测的效率，使其能够在资源有限的车辆上运行。
methods: 这个研究使用了适应推理框架，将输入中的空间重复点滤除，以提高模型的效率。此外，它还利用了2D BEV特征 map 的自然稀畴性，实现了缓存和computational cost的减少。
results: 这个研究获得了40%的缩减在3D voxels上，并将2D BEV特征 map 的密度从100%降低到20%，而无需对准确性作出牺牲。此外，这个方法可以降低模型的computational cost和缓存价值，并且实现了端到端 GPU 延迟和 GPU 峰值内存优化。

Abstract
Voxel-based methods have achieved state-of-the-art performance for 3D object detection in autonomous driving. However, their significant computational and memory costs pose a challenge for their application to resource-constrained vehicles. One reason for this high resource consumption is the presence of a large number of redundant background points in Lidar point clouds, resulting in spatial redundancy in both 3D voxel and dense BEV map representations. To address this issue, we propose an adaptive inference framework called Ada3D, which focuses on exploiting the input-level spatial redundancy. Ada3D adaptively filters the redundant input, guided by a lightweight importance predictor and the unique properties of the Lidar point cloud. Additionally, we utilize the BEV features' intrinsic sparsity by introducing the Sparsity Preserving Batch Normalization. With Ada3D, we achieve 40% reduction for 3D voxels and decrease the density of 2D BEV feature maps from 100% to 20% without sacrificing accuracy. Ada3D reduces the model computational and memory cost by 5x, and achieves 1.52x/1.45x end-to-end GPU latency and 1.5x/4.5x GPU peak memory optimization for the 3D and 2D backbone respectively.

摘要
voxel-based方法已经实现了自动驾驶场景中3D对象检测的状态机器。但是，它们的计算和内存成本却成为应用于有限资源的车辆中的挑战。一个原因是Lidar点云中的背景点的大量重复，导致3D voxel和稠密的BEV地图表示中的空间重复。为解决这个问题，我们提议了一种适应性推理框架，称之为Ada3D。Ada3D通过在输入水平进行适应性滤波，以避免不必要的计算。此外，我们利用BEV特征的自然稀畴性，通过引入稀畴保持批处理normalization。与Ada3D相比，我们实现了3D voxels的40%减少和2D BEV特征地图的density从100%降至20%，无需牺牲准确性。Ada3D降低了模型的计算和内存成本，并实现了3D和2D核心的5x缩放和1.5x/4.5x GPU峰值内存优化。

Unbiased Image Synthesis via Manifold-Driven Sampling in Diffusion Models

paper_url: http://arxiv.org/abs/2307.08199
repo_url: None
paper_authors: Xingzhe Su, Yi Ren, Wenwen Qiang, Zeen Song, Hang Gao, Fengge Wu, Changwen Zheng
for: 这个研究旨在Addressing data bias in diffusion models, especially when the training data does not accurately represent the true data distribution and exhibits skewed or imbalanced patterns.methods: 我们提出了一种新的方法，即利用构造指导来减少Diffusion models中的数据偏见。我们的关键思想是使用无supervised方法估计训练数据的构造，然后使其导引Diffusion models中的抽样过程。这样可以使生成的图像在数据构造上具备均匀分布，不需要更改模型架构或重新训练。results: 我们的理论分析和实验证明，该方法可以对Diffusion models进行改善图像生成质量和不偏性。 Specifically, our method can generate more diverse and balanced images compared to standard diffusion models, and can also improve the robustness of downstream applications.

Abstract
Diffusion models are a potent class of generative models capable of producing high-quality images. However, they can face challenges related to data bias, favoring specific modes of data, especially when the training data does not accurately represent the true data distribution and exhibits skewed or imbalanced patterns. For instance, the CelebA dataset contains more female images than male images, leading to biased generation results and impacting downstream applications. To address this issue, we propose a novel method that leverages manifold guidance to mitigate data bias in diffusion models. Our key idea is to estimate the manifold of the training data using an unsupervised approach, and then use it to guide the sampling process of diffusion models. This encourages the generated images to be uniformly distributed on the data manifold without altering the model architecture or necessitating labels or retraining. Theoretical analysis and empirical evidence demonstrate the effectiveness of our method in improving the quality and unbiasedness of image generation compared to standard diffusion models.

摘要
文本翻译成简化中文：Diffusion模型是一种强大的生成模型，能够生成高质量图像。然而，它们可能面临数据偏袋问题，尤其当训练数据不准确反映真实数据分布，并且呈现偏斜或不均匀的模式时。例如，CelebA数据集中有更多的女性图像 than male images，这会导致生成结果偏斜，并影响下游应用。为解决这个问题，我们提出了一种新的方法，利用拓扑导航来减轻 diffusion models 中的数据偏袋。我们的关键思想是通过不supervised的方式来估计训练数据的拓扑 manifold，然后使用它来导引 diffusion models 中的采样过程。这会使生成的图像在数据拓扑上均匀分布，而不需要修改模型结构或者需要标签或重新训练。我们的理论分析和实验证据表明，我们的方法可以提高图像生成的质量和不偏袋性，相比标准的 diffusion models。

On Point Affiliation in Feature Upsampling

paper_url: http://arxiv.org/abs/2307.08198
repo_url: https://github.com/tiny-smart/sapa
paper_authors: Wenze Liu, Hao Lu, Yuliang Liu, Zhiguo Cao
For: + The paper is written for improving feature upsampling in dense prediction tasks, specifically addressing the problem of point affiliation.* Methods: + The paper introduces the notion of point affiliation and presents a novel, lightweight, and universal upsampling solution called Similarity-Aware Point Affiliation (SAPA). + SAPA uses a generic formulation for generating similarity-aware upsampling kernels, which encourage not only semantic smoothness but also boundary sharpness.* Results: + The paper shows that SAPA outperforms prior upsamplers and consistently improves performance on a number of dense prediction tasks, including semantic segmentation, object detection, instance segmentation, panoptic segmentation, image matting, and depth estimation.

Abstract
We introduce the notion of point affiliation into feature upsampling. By abstracting a feature map into non-overlapped semantic clusters formed by points of identical semantic meaning, feature upsampling can be viewed as point affiliation -- designating a semantic cluster for each upsampled point. In the framework of kernel-based dynamic upsampling, we show that an upsampled point can resort to its low-res decoder neighbors and high-res encoder point to reason the affiliation, conditioned on the mutual similarity between them. We therefore present a generic formulation for generating similarity-aware upsampling kernels and prove that such kernels encourage not only semantic smoothness but also boundary sharpness. This formulation constitutes a novel, lightweight, and universal upsampling solution, Similarity-Aware Point Affiliation (SAPA). We show its working mechanism via our preliminary designs with window-shape kernel. After probing the limitations of the designs on object detection, we reveal additional insights for upsampling, leading to SAPA with the dynamic kernel shape. Extensive experiments demonstrate that SAPA outperforms prior upsamplers and invites consistent performance improvements on a number of dense prediction tasks, including semantic segmentation, object detection, instance segmentation, panoptic segmentation, image matting, and depth estimation. Code is made available at: https://github.com/tiny-smart/sapa

摘要
我们引入点聚合（point affiliation）into feature upsampling。我们抽象特征图into non-overlapped semantic clusters formed by points of identical semantic meaning，可以视为点聚合——为每个upsampled点分配一个semantic cluster。在基于kernel的动态upsampling框架中，我们表明upsampled点可以借助其low-res decoder neighbors和高-res encoder point来理解聚合，conditioned on它们之间的相似性。我们因此提出了一种通用的形式化方法，生成相似度感知upsampling kernel，并证明这些kernel不仅激发semantic smoothness，还激发boundary sharpness。这种形式化方法被称为Similarity-Aware Point Affiliation（SAPA）。我们通过我们的初步设计中的窗口形状kernel来示出它的工作机制。在对对象检测 task进行评估后，我们揭示了更多的增强方法，导致SAPA with dynamic kernel shape。广泛的实验表明SAPA比前一代的upsamplers有更好的性能，并在多个 dense prediction task 上具有一致的表现提升，包括semantic segmentation、object detection、instance segmentation、panoptic segmentation、image matting和depth estimation。代码可以在https://github.com/tiny-smart/sapa上获取。

Zero-Shot Image Harmonization with Generative Model Prior

paper_url: http://arxiv.org/abs/2307.08182
repo_url: https://github.com/windvchen/diff-harmonization
paper_authors: Jianqi Chen, Zhengxia Zou, Yilan Zhang, Keyan Chen, Zhenwei Shi
for: 本文旨在提出一种零shot图像协调方法，不需要大量的合成图像训练。
methods: 我们 Draw lessons from human behavior，使用预训练的生成模型来模拟人类对协调图像的偏好。我们还提出了一种Attention-Constraint Text来引导协调方向。
results: 我们的方法可以具有高效性和一致性，并且可以保持前景内容结构。广泛的实验证明了我们的方法的有效性，并且我们还探索了一些有趣的应用场景。

Abstract
Recent image harmonization methods have demonstrated promising results. However, due to their heavy reliance on a large number of composite images, these works are expensive in the training phase and often fail to generalize to unseen images. In this paper, we draw lessons from human behavior and come up with a zero-shot image harmonization method. Specifically, in the harmonization process, a human mainly utilizes his long-term prior on harmonious images and makes a composite image close to that prior. To imitate that, we resort to pretrained generative models for the prior of natural images. For the guidance of the harmonization direction, we propose an Attention-Constraint Text which is optimized to well illustrate the image environments. Some further designs are introduced for preserving the foreground content structure. The resulting framework, highly consistent with human behavior, can achieve harmonious results without burdensome training. Extensive experiments have demonstrated the effectiveness of our approach, and we have also explored some interesting applications.

摘要
Simplified Chinese:最近的图像协调方法已经展示出了有前途的结果，但它们往往因依赖大量的组合图像而成本高于训练阶段，而且常常无法泛化到未看过的图像。在这篇论文中，我们听从人类行为，提出了一种零shot图像协调方法。具体来说，在协调过程中，人类主要利用了长期的偏好 towards Harmonious images，使得组合图像更接近于这种偏好。为了模仿这种情况，我们采用了预训练的生成模型来定义自然图像的先验。为了指导协调方向，我们提出了一种注意力约束文本，该文本在图像环境中得到了优化。此外，我们还引入了一些保持前景内容结构的further designs。得到的框架与人类行为高度一致，可以无需压力的训练实现和谐的结果。我们进行了广泛的实验，并explored了一些有趣的应用。

Boundary-weighted logit consistency improves calibration of segmentation networks

paper_url: http://arxiv.org/abs/2307.08163
repo_url: None
paper_authors: Neerav Karani, Neel Dey, Polina Golland
for: 这个论文是为了解决神经网络预测概率和准确率之间的强相关性问题，以及图像分割数据中的固有标签抽象问题。
methods: 该论文使用了随机变换的稳定性来防止过于自信的预测，并提出了边界权重扩展来提供最佳的准确性调整。
results: 该论文对抑制癌细胞和心脏MRI分割问题取得了state-of-the-art的准确性。

Abstract
Neural network prediction probabilities and accuracy are often only weakly-correlated. Inherent label ambiguity in training data for image segmentation aggravates such miscalibration. We show that logit consistency across stochastic transformations acts as a spatially varying regularizer that prevents overconfident predictions at pixels with ambiguous labels. Our boundary-weighted extension of this regularizer provides state-of-the-art calibration for prostate and heart MRI segmentation.

摘要
神经网络预测概率和准确率通常只有weakly相关。在图像分割训练数据中的自然标签抽象使得这种误差更加严重。我们表明，在Stochastic变换中的Logit一致性作为空间分布的variational regularizer，可以防止在杂 Label pixels上的过于自信。我们的边界权重扩展可以提供state-of-the-art的均衡。Note: "Simplified Chinese" refers to the written form of Chinese that uses simplified characters and grammar, which is commonly used in mainland China.

Self-Attention Based Generative Adversarial Networks For Unsupervised Video Summarization

paper_url: http://arxiv.org/abs/2307.08145
repo_url: None
paper_authors: Maria Nektaria Minaidi, Charilaos Papaioannou, Alexandros Potamianos
for: 这 paper 的目的是提出一种基于无监督学习的视频摘要生成方法，使其能够生成具有代表性的摘要，与原始视频无法区分。
methods: 该方法基于一个流行的 Generative Adversarial Network (GAN) 的建立，通过在选择、编码和解码视频帧中引入注意力机制，以模型视频之间的时间关系。提出了 SUM-GAN-AED 模型，combines self-attention mechanism for frame selection with LSTMs for encoding and decoding.
results: 对 SumMe、TVSum 和 COGNIMUSE 等数据集进行了实验，结果表明，使用自注意机制作为帧选择机制，在 SumMe 上表现出色，与其他状态态的比较对 TVSum 和 COGNIMUSE 的表现相对较好。

Abstract
In this paper, we study the problem of producing a comprehensive video summary following an unsupervised approach that relies on adversarial learning. We build on a popular method where a Generative Adversarial Network (GAN) is trained to create representative summaries, indistinguishable from the originals. The introduction of the attention mechanism into the architecture for the selection, encoding and decoding of video frames, shows the efficacy of self-attention and transformer in modeling temporal relationships for video summarization. We propose the SUM-GAN-AED model that uses a self-attention mechanism for frame selection, combined with LSTMs for encoding and decoding. We evaluate the performance of the SUM-GAN-AED model on the SumMe, TVSum and COGNIMUSE datasets. Experimental results indicate that using a self-attention mechanism as the frame selection mechanism outperforms the state-of-the-art on SumMe and leads to comparable to state-of-the-art performance on TVSum and COGNIMUSE.

摘要
在这篇论文中，我们研究了一种不需要监督的视频概要生成方法，基于对抗学习。我们建立在一种受欢迎的方法之上，其中一个生成概要网络（GAN）在创造可信任的概要时进行训练。我们通过将注意力机制引入网络架构中，选择、编码和解码视频帧时，以表明自我注意力和变换器在视频概要模型中的有效性。我们提出了SUM-GAN-AED模型，它使用自我注意力机制来选择帧，并使用LSTM来编码和解码。我们在SumMe、TVSum和COGNIMUSE数据集上评估SUM-GAN-AED模型的性能。实验结果表明，使用自我注意力机制来选择帧比预先的状态对SUMMe数据集表现更好，并且在TVSum和COGNIMUSE数据集上表现相当于预先的状态。

Neural Stream Functions

paper_url: http://arxiv.org/abs/2307.08142
repo_url: https://github.com/skywolf829/neuralstreamfunction
paper_authors: Skylar Wolfgang Wurster, Hanqi Guo, Tom Peterka, Han-Wei Shen
for: 这个论文是为了计算流函数的，流函数是一个scalar函数，其梯度与给定的vector field垂直。
methods: 这个论文使用神经网络方法来学习流函数，输入是vector field，神经网络会学习将输入坐标映射到流函数值上。
results: 这个论文的结果表明，使用神经网络方法可以高效地计算流函数，并且可以根据输入vector field的不同来生成不同的流函数解。此外，论文还提出了一些可选的约束来生成流函数解，以便在流场的拟合中提高计算的精度。

Abstract
We present a neural network approach to compute stream functions, which are scalar functions with gradients orthogonal to a given vector field. As a result, isosurfaces of the stream function extract stream surfaces, which can be visualized to analyze flow features. Our approach takes a vector field as input and trains an implicit neural representation to learn a stream function for that vector field. The network learns to map input coordinates to a stream function value by minimizing the inner product of the gradient of the neural network's output and the vector field. Since stream function solutions may not be unique, we give optional constraints for the network to learn particular stream functions of interest. Specifically, we introduce regularizing loss functions that can optionally be used to generate stream function solutions whose stream surfaces follow the flow field's curvature, or that can learn a stream function that includes a stream surface passing through a seeding rake. We also discuss considerations for properly visualizing the trained implicit network and extracting artifact-free surfaces. We compare our results with other implicit solutions and present qualitative and quantitative results for several synthetic and simulated vector fields.

摘要
我们提出了一种神经网络方法来计算流函数，它是一个Scalar函数的梯度垂直于给定的vector field。因此，isoSurface流函数EXTRACT流面，可以用来分析流体特性。我们的方法通过将vector field作为输入，训练一个隐式神经表示来学习一个流函数 для该vector field。神经网络学习将输入坐标映射到流函数值上，通过内积 Inner product的梯度和vector field的梯度来最小化。由于流函数解可能不唯一，我们可以选择性地添加约束来学习特定的流函数解。例如，我们可以添加一个正则化损失函数，使流函数解的流面与流体场的弯曲度相符，或者学习一个流函数解包含一个流面通过种子排 sowing rake。我们还讨论了可视化训练后的隐式网络和EXTRACT artifact-free 流面的注意事项。我们与其他隐式解相比较，并对一些Synthetic和Simulated vector field进行了质量和量化的比较结果。

Adaptively Placed Multi-Grid Scene Representation Networks for Large-Scale Data Visualization

paper_url: http://arxiv.org/abs/2308.02494
repo_url: https://github.com/skywolf829/apmgsrn
paper_authors: Skylar Wolfgang Wurster, Tianyu Xiong, Han-Wei Shen, Hanqi Guo, Tom Peterka
for: 这篇论文是为了提出一种适应性的Scene Representation Networks（SRN），以便更好地压缩和可视化科学数据。methods: 该论文使用了多个空间自适应特征网格（APMGSRN），并提出了域分解训练和推理技术，以加速多GPU系统上的训练。results: 该论文提出的APMGSRN架构可以在不需要昂贵的八个树叶树、剪枝和搜索的情况下，提高SRN的重建精度。此外，论文还提供了一个开源的神经网络volume渲染应用程序，可以与任何PyTorch基于SRN进行插件式渲染。

Abstract
Scene representation networks (SRNs) have been recently proposed for compression and visualization of scientific data. However, state-of-the-art SRNs do not adapt the allocation of available network parameters to the complex features found in scientific data, leading to a loss in reconstruction quality. We address this shortcoming with an adaptively placed multi-grid SRN (APMGSRN) and propose a domain decomposition training and inference technique for accelerated parallel training on multi-GPU systems. We also release an open-source neural volume rendering application that allows plug-and-play rendering with any PyTorch-based SRN. Our proposed APMGSRN architecture uses multiple spatially adaptive feature grids that learn where to be placed within the domain to dynamically allocate more neural network resources where error is high in the volume, improving state-of-the-art reconstruction accuracy of SRNs for scientific data without requiring expensive octree refining, pruning, and traversal like previous adaptive models. In our domain decomposition approach for representing large-scale data, we train an set of APMGSRNs in parallel on separate bricks of the volume to reduce training time while avoiding overhead necessary for an out-of-core solution for volumes too large to fit in GPU memory. After training, the lightweight SRNs are used for realtime neural volume rendering in our open-source renderer, where arbitrary view angles and transfer functions can be explored. A copy of this paper, all code, all models used in our experiments, and all supplemental materials and videos are available at https://github.com/skywolf829/APMGSRN.

摘要
Scene representation networks (SRNs) 有最近提出用于压缩和可视化科学数据的方法。然而，当前的 SRNs 不会根据科学数据中复杂的特征进行分配可用的网络参数，导致重建质量下降。我们解决这个缺点，提出了适应地位的多格rid SRN (APMGSRN) 和多 GPU 系统上加速并行训练的领域分解训练和推理技术。我们还发布了基于 PyTorch 的开源神经体积渲染应用程序，允许任意的 PyTorch-based SRN 插件式渲染。我们的提议的 APMGSRN 架构使用多个空间自适应特征网格，学习在域内的位置，以动态分配更多神经网络资源，以提高 SRNs 的重建精度。在我们的域ode decomposition 方法中，我们在分解大规模数据时使用多个独立的块训练 APMGSRN，以降低训练时间，而不需要费时的 Octree 修正、剪辑和搜索。之后，我们使用轻量级 SRN 进行实时神经体积渲染，并且支持任意的视角和转换函数。详细的报告、所有代码、所有在实验中使用的模型、以及所有补充材料和视频都可以在 https://github.com/skywolf829/APMGSRN 上获取。

GastroVision: A Multi-class Endoscopy Image Dataset for Computer Aided Gastrointestinal Disease Detection

paper_url: http://arxiv.org/abs/2307.08140
repo_url: https://github.com/debeshjha/gastrovision
paper_authors: Debesh Jha, Vanshali Sharma, Neethi Dasu, Nikhil Kumar Tomar, Steven Hicks, M. K. Bhuyan, Pradip K. Das, Michael A. Riegler, Pål Halvorsen, Ulas Bagci, Thomas de Lange
For: 这个研究旨在提供一个大规模、精确标注的胃肠内镜数据集，以便用于胃肠疾病检测和分类的人工智能（AI）系统开发。* Methods: 这个研究使用了多中心开放存取的胃肠内镜数据集，包括不同的生物学特征、疾病症状、肿瘤除去 caso和正常发现（总共27个类别）。数据集包含8,000幅从挪威巴鲁姆医院和瑞典卡罗琳斯卡大学医院所取得的胃肠内镜图像，并由经验丰富的胃肠镜诊师进行了标注和验证。* Results: 研究人员验证了数据集的重要性，使用了具有普遍受欢迎的深度学习基eline模型进行了广泛的比较。他们认为这个数据集可以促进胃肠疾病检测和分类的AI系统开发。

Abstract
Integrating real-time artificial intelligence (AI) systems in clinical practices faces challenges such as scalability and acceptance. These challenges include data availability, biased outcomes, data quality, lack of transparency, and underperformance on unseen datasets from different distributions. The scarcity of large-scale, precisely labeled, and diverse datasets are the major challenge for clinical integration. This scarcity is also due to the legal restrictions and extensive manual efforts required for accurate annotations from clinicians. To address these challenges, we present \textit{GastroVision}, a multi-center open-access gastrointestinal (GI) endoscopy dataset that includes different anatomical landmarks, pathological abnormalities, polyp removal cases and normal findings (a total of 27 classes) from the GI tract. The dataset comprises 8,000 images acquired from B{\ae}rum Hospital in Norway and Karolinska University Hospital in Sweden and was annotated and verified by experienced GI endoscopists. Furthermore, we validate the significance of our dataset with extensive benchmarking based on the popular deep learning based baseline models. We believe our dataset can facilitate the development of AI-based algorithms for GI disease detection and classification. Our dataset is available at \url{https://osf.io/84e7f/}.

摘要
临床应用人工智能（AI）系统的整合面临着扩展性和接受性的挑战。这些挑战包括数据可用性、偏见结果、数据质量、不透明度和不同分布下的表现不佳。lack of large-scale, precisely labeled, and diverse datasets is the major challenge for clinical integration. This scarcity is also due to the legal restrictions and extensive manual efforts required for accurate annotations from clinicians. To address these challenges, we present \textit{GastroVision}, a multi-center open-access gastrointestinal (GI) endoscopy dataset that includes different anatomical landmarks, pathological abnormalities, polyp removal cases and normal findings (a total of 27 classes) from the GI tract. The dataset comprises 8,000 images acquired from B{\ae}rum Hospital in Norway and Karolinska University Hospital in Sweden and was annotated and verified by experienced GI endoscopists. Furthermore, we validate the significance of our dataset with extensive benchmarking based on the popular deep learning based baseline models. We believe our dataset can facilitate the development of AI-based algorithms for GI disease detection and classification. Our dataset is available at \url{https://osf.io/84e7f/}.Here's the word-for-word translation of the text into Simplified Chinese:临床应用人工智能（AI）系统的整合面临着扩展性和接受性的挑战。这些挑战包括数据可用性、偏见结果、数据质量、不透明度和不同分布下的表现不佳。lack of large-scale, precisely labeled, and diverse datasets is the major challenge for clinical integration. This scarcity is also due to the legal restrictions and extensive manual efforts required for accurate annotations from clinicians. To address these challenges, we present \textit{GastroVision}, a multi-center open-access gastrointestinal (GI) endoscopy dataset that includes different anatomical landmarks, pathological abnormalities, polyp removal cases and normal findings (a total of 27 classes) from the GI tract. The dataset comprises 8,000 images acquired from B{\ae}rum Hospital in Norway and Karolinska University Hospital in Sweden and was annotated and verified by experienced GI endoscopists. Furthermore, we validate the significance of our dataset with extensive benchmarking based on the popular deep learning based baseline models. We believe our dataset can facilitate the development of AI-based algorithms for GI disease detection and classification. Our dataset is available at \url{https://osf.io/84e7f/}.

Solving Inverse Problems with Latent Diffusion Models via Hard Data Consistency

paper_url: http://arxiv.org/abs/2307.08123
repo_url: None
paper_authors: Bowen Song, Soo Min Kwon, Zecheng Zhang, Xinyu Hu, Qing Qu, Liyue Shen
for: 解决 inverse problems 的泛化问题
methods: 使用 pre-trained latent diffusion models 和 hard data consistency 技术
results: 可以 reconstruction high-quality images, 比如 Linear and non-linear inverse problems 的解决方案

Abstract
Diffusion models have recently emerged as powerful generative priors for solving inverse problems. However, training diffusion models in the pixel space are both data intensive and computationally demanding, which restricts their applicability as priors in domains such as medical imaging. Latent diffusion models, which operate in a much lower-dimensional space, offer a solution to these challenges. Though, their direct application to solving inverse problems remains an unsolved technical challenge due to the nonlinearity of the encoder and decoder. To address this issue,we propose ReSample, an algorithm that solves general inverse problems with pre-trained latent diffusion models. Our algorithm incorporates data consistency by solving an optimization problem during the reverse sampling process, a concept that we term as hard data consistency. Upon solving this optimization problem, we propose a novel resampling scheme to map the measurement-consistent sample back onto the correct data manifold. Our approach offers both memory efficiency and considerable flexibility in the sense that (1) it can be readily adapted to various inverse problems using the same pre-trained model as it does not assume any fixed forward measurement operator during training, and (2) it can be generalized to different domains by simply fine-tuning the latent diffusion model with a minimal amount of data samples. Our empirical results on both linear and non-linear inverse problems demonstrate that our approach can reconstruct high-quality images even compared to state-of-the-art works that operate in the pixel space.

摘要
Diffusion models 已经在 solves inverse problems 中出现为强大的生成假设。然而，在像素空间中训练 diffusion models 是数据充足和计算挑战性的，这限制了它们在医学成像等领域的应用。 latent diffusion models 可以解决这些挑战，它们在低维度空间中运行。然而，它们的直接应用于解决 inverse problems 还是技术挑战，因为推导器和解码器都是非线性的。为解决这个问题，我们提出了 ReSample 算法，它可以解决一般的 inverse problems with pre-trained latent diffusion models。我们的算法包括数据一致性，通过解决反推问题时的优化问题，我们称之为 hard data consistency。在解决这个优化问题后，我们提出了一种新的抽样方案，用于将 measurement-consistent sample 映射回正确的数据拟合。我们的方法具有内存效率和可变性，它可以轻松适应不同的 inverse problems，并且可以通过简单地微调 latent diffusion model 来适应不同的领域。我们的实验结果表明，我们的方法可以重建高质量的图像，甚至比针对像素空间进行训练的现状技术更高效。

Domain Generalisation with Bidirectional Encoder Representations from Vision Transformers

paper_url: http://arxiv.org/abs/2307.08117
repo_url: https://github.com/sw-packages/d23c4b6afa05094a23071333bd230aceceec08117355003f5c0ea958e60c9c98
paper_authors: Hamza Riaz, Alan F. Smeaton
for: 这篇论文旨在应用领域普遍化（Domain Generalization）技术，将知识从来源领域（Source Domain）转移到未见领域（Target Domain），以扩展深度学习模型的应用范围。
methods: 这篇论文使用了视觉对映器（Vision Transformer）进行领域普遍化，并评估了四种不同的视觉对映器架构（ViT、LeViT、DeiT、BEIT）在对于不同的资料分布进行测试。
results: 根据结果显示，使用了bidirectional encoder representation from image transformers（BEIT）架构，在三个benchmark（PACS、Home-Office、DomainNet）上实现了显著的验证和测试准确率改善，并且在对于未见领域的测试中具有较好的表现。

Abstract
Domain generalisation involves pooling knowledge from source domain(s) into a single model that can generalise to unseen target domain(s). Recent research in domain generalisation has faced challenges when using deep learning models as they interact with data distributions which differ from those they are trained on. Here we perform domain generalisation on out-of-distribution (OOD) vision benchmarks using vision transformers. Initially we examine four vision transformer architectures namely ViT, LeViT, DeiT, and BEIT on out-of-distribution data. As the bidirectional encoder representation from image transformers (BEIT) architecture performs best, we use it in further experiments on three benchmarks PACS, Home-Office and DomainNet. Our results show significant improvements in validation and test accuracy and our implementation significantly overcomes gaps between within-distribution and OOD data.

摘要
域�总结是将来源域的知识汇集到一个可以总结目标域的模型中。Recent research in domain generalization has faced challenges when using deep learning models as they interact with data distributions that differ from those they are trained on. Here we perform domain generalization on out-of-distribution (OOD) vision benchmarks using vision transformers. Initially we examine four vision transformer architectures, namely ViT, LeViT, DeiT, and BEIT, on out-of-distribution data. As the bidirectional encoder representation from image transformers (BEIT) architecture performs best, we use it in further experiments on three benchmarks PACS, Home-Office, and DomainNet. Our results show significant improvements in validation and test accuracy, and our implementation significantly overcomes gaps between within-distribution and OOD data.

Polarization Multi-Image Synthesis with Birefringent Metasurfaces

paper_url: http://arxiv.org/abs/2307.08106
repo_url: https://github.com/deanhazineh/multi-image-synthesis
paper_authors: Dean Hazineh, Soon Wei Daniel Lim, Qi Guo, Federico Capasso, Todd Zickler
For: + The paper is written for the task of incoherent opto-electronic filtering, which is a new application of optical metasurfaces in computational imaging systems. + The paper aims to demonstrate a new system that uses a birefringent metasurface with a polarizer-mosaicked photosensor to capture four optically-coded measurements in a single exposure.* Methods: + The paper uses a birefringent metasurface with a polarizer-mosaicked photosensor to capture four optically-coded measurements in a single exposure. + The paper introduces a new form of gradient descent with a novel regularizer that encourages light efficiency and a high signal-to-noise ratio to find a metasurface that can realize a set of user-specified spatial filters.* Results: + The paper demonstrates several examples in simulation and with fabricated prototypes, including some with spatial filters that have prescribed variations with respect to depth and wavelength.Here is the answer in Simplified Chinese text:* For: + 这篇论文是为了实现不同的空间滤波器而设计的，这是计算成像系统中新的应用。 + 论文描述了一种使用具有 polarizer-mosaicked photosensor 的折射率元件，在单次曝光中捕获四个光学编码量。* Methods: + 论文使用具有折射率元件和 polarizer-mosaicked photosensor 的新方法来捕获四个光学编码量。 + 论文引入了一种新的梯度下降算法，该算法通过鼓励光效率和信号噪声比来找到一个实现用户指定的空间滤波器的元件。* Results: + 论文通过实验和制造的证明，展示了一些实现了用户指定的空间滤波器的例子，包括一些具有不同深度和波长的滤波器。

Abstract
Optical metasurfaces composed of precisely engineered nanostructures have gained significant attention for their ability to manipulate light and implement distinct functionalities based on the properties of the incident field. Computational imaging systems have started harnessing this capability to produce sets of coded measurements that benefit certain tasks when paired with digital post-processing. Inspired by these works, we introduce a new system that uses a birefringent metasurface with a polarizer-mosaicked photosensor to capture four optically-coded measurements in a single exposure. We apply this system to the task of incoherent opto-electronic filtering, where digital spatial-filtering operations are replaced by simpler, per-pixel sums across the four polarization channels, independent of the spatial filter size. In contrast to previous work on incoherent opto-electronic filtering that can realize only one spatial filter, our approach can realize a continuous family of filters from a single capture, with filters being selected from the family by adjusting the post-capture digital summation weights. To find a metasurface that can realize a set of user-specified spatial filters, we introduce a form of gradient descent with a novel regularizer that encourages light efficiency and a high signal-to-noise ratio. We demonstrate several examples in simulation and with fabricated prototypes, including some with spatial filters that have prescribed variations with respect to depth and wavelength. Visit the Project Page at https://deanhazineh.github.io/publications/Multi_Image_Synthesis/MIS_Home.html

摘要
依据精心设计的奈米结构，光学元面 composites 已经吸引了广泛关注，因为它们可以 manipulate 光子并实现基于入射场的特性而具有不同的功能性。计算成像系统已经开始利用这种能力生成具有特定任务需求的编码测量集，并通过数字后处理来实现。我们基于这些工作，引入了一种新的系统，使用偏振元面和分割成多个极化通道的探测器来在单个曝光中捕捉四个光学编码测量。我们应用这种系统于不同激光的吸收过滤器任务中，取代了传统的数字空间滤波操作，并且可以实现连续的家族filters，从单个捕捉中选择filters。为找到实现用户指定的空间滤波的元面，我们引入了一种新的迭代 descent 算法，其中包含了光效率和信号噪声比的新正则化项。我们在 simulate 和实验中示出了许多示例，包括一些具有深度和波长的特定变化的空间滤波。更多信息请参考

FourierHandFlow: Neural 4D Hand Representation Using Fourier Query Flow

paper_url: http://arxiv.org/abs/2307.08100
repo_url: None
paper_authors: Jihyun Lee, Junbong Jang, Donghwan Kim, Minhyuk Sung, Tae-Kyun Kim
for: 本研究旨在学习RGB视频中的人手四维形态，以实现高效精准的人手重建和动作估计。
methods: 本方法使用Fourier série来表示 Query Flow，并将3D占据场与动作相关的Query Flow组合起来，以实现四维形态的精准重建。
results: 在实验中，本方法可以实现 estado-of-the-art 的Result on video-based 4D reconstruction，同时 computationally more efficient than existing 3D/4D implicit shape representations。 Additionally, the learned correspondences of implicit shapes can be used for motion inter- and extrapolation and texture transfer.

Abstract
Recent 4D shape representations model continuous temporal evolution of implicit shapes by (1) learning query flows without leveraging shape and articulation priors or (2) decoding shape occupancies separately for each time value. Thus, they do not effectively capture implicit correspondences between articulated shapes or regularize jittery temporal deformations. In this work, we present FourierHandFlow, which is a spatio-temporally continuous representation for human hands that combines a 3D occupancy field with articulation-aware query flows represented as Fourier series. Given an input RGB sequence, we aim to learn a fixed number of Fourier coefficients for each query flow to guarantee smooth and continuous temporal shape dynamics. To effectively model spatio-temporal deformations of articulated hands, we compose our 4D representation based on two types of Fourier query flow: (1) pose flow that models query dynamics influenced by hand articulation changes via implicit linear blend skinning and (2) shape flow that models query-wise displacement flow. In the experiments, our method achieves state-of-the-art results on video-based 4D reconstruction while being computationally more efficient than the existing 3D/4D implicit shape representations. We additionally show our results on motion inter- and extrapolation and texture transfer using the learned correspondences of implicit shapes. To the best of our knowledge, FourierHandFlow is the first neural 4D continuous hand representation learned from RGB videos. The code will be publicly accessible.

摘要
最近的4D形态表示模型连续时间演化的隐式形态 by (1) 学习无关形态和肢体约束的查询流或 (2) 分解每个时间值的形态占用。因此，它们不能有效地捕捉隐式相关性 между 动体形态或正则化颤动幅。在这种工作中，我们提出了FourierHandFlow，它是一种包含3D占用场和形态相关的查询流 Fourier系列的四维表示。给输入的RGB序列，我们希望学习固定数量的Fourier系数来保证平滑和连续的时间形态动态。为了有效地模型动体形态的空间时间变换，我们将我们的4D表示分为两种类型的查询流：（1）pose流，它模型查询动态受到手肢变化的影响via隐式线性混合皮肤和（2）形态流，它模型查询点 wise的移动流。在实验中，我们的方法实现了视频基于4D重建的状态对齐的结果，而且与现有的3D/4D隐式形态表示更加计算效率。我们还展示了使用学习的隐式形态对应关系进行动作间隔和外部逼近，以及纹理传输。根据我们所知，FourierHandFlow是首次由RGB视频学习的神经网络4D连续手表示。代码将公开访问。

paper_url: http://arxiv.org/abs/2307.08098
repo_url: https://github.com/pjlallen/calibnet
paper_authors: Jialun Pei, Tao Jiang, He Tang, Nian Liu, Yueming Jin, Deng-Ping Fan, Pheng-Ann Heng
for: 这个论文主要针对RGB-D图像中的精度实例分割问题，提出了一种基于双树 Cross-Modal Feature Calibration Architecture（CalibNet）的新方法。
methods: 该方法使用了三个简单模块：动态交互卷积（DIK）、重量共享 fusión（WSF）和深度相似度评估（DSA），这三个模块共同工作以生成有效的实例相关卷积和融合cross-modal特征。
results: 对于三个挑战性评价标准，该方法实现了出色的结果，即COME15K-N测试集上的AP为58.0%，比替代方案更高。

Abstract
We propose a novel approach for RGB-D salient instance segmentation using a dual-branch cross-modal feature calibration architecture called CalibNet. Our method simultaneously calibrates depth and RGB features in the kernel and mask branches to generate instance-aware kernels and mask features. CalibNet consists of three simple modules, a dynamic interactive kernel (DIK) and a weight-sharing fusion (WSF), which work together to generate effective instance-aware kernels and integrate cross-modal features. To improve the quality of depth features, we incorporate a depth similarity assessment (DSA) module prior to DIK and WSF. In addition, we further contribute a new DSIS dataset, which contains 1,940 images with elaborate instance-level annotations. Extensive experiments on three challenging benchmarks show that CalibNet yields a promising result, i.e., 58.0% AP with 320*480 input size on the COME15K-N test set, which significantly surpasses the alternative frameworks. Our code and dataset are available at: https://github.com/PJLallen/CalibNet.

摘要
我们提出了一种新的RGB-D突出实例分割方法，基于双极分支交叉模式特征均衡架构，即CalibNet。我们的方法同时均衡了深度和RGB特征在核心和面罩分支中，以生成实例相关的核心和面罩特征。CalibNet由三个简单模块组成：动态互动核心（DIK）、重量共享融合（WSF）以及深度相似评估（DSA）模块。这些模块共同工作，以生成有效的实例相关核心和融合交叉特征。此外，我们还提供了一个新的DSIS数据集，包含1940张图像，每张图像均有详细的实例级别注解。我们的实验表明，CalibNet在三个挑战性的benchmark上实现了优秀的结果，即COME15K-N测试集上的58.0% AP值，与其他框架相比有显著提高。我们的代码和数据集可以在：https://github.com/PJLallen/CalibNet上获取。

Semi-DETR: Semi-Supervised Object Detection with Detection Transformers

paper_url: http://arxiv.org/abs/2307.08095
repo_url: https://github.com/PaddlePaddle/PaddleDetection/tree/develop/configs/semi_det/semi_detr
paper_authors: Jiacheng Zhang, Xiangru Lin, Wei Zhang, Kuo Wang, Xiao Tan, Junyu Han, Errui Ding, Jingdong Wang, Guanbin Li
for: semi-supervised object detection (SSOD)
methods: DETR-based framework with Stage-wise Hybrid Matching strategy and Crossview Query Consistency method
results: outperforms all state-of-the-art methods by clear margins on all SSOD settings of both COCO and Pascal VOC benchmark datasets.Here is the Chinese translation of the three key information points:
for: semi-supervised物体检测 (SSOD)
methods: DETR基于的框架，具有Stage-wise Hybrid Matching策略和 Crossview Query Consistency方法
results: 在所有 SSOD 设定下，包括 COCO 和 Pascal VOC 数据集的所有 benchmark 数据集上，与所有现有方法均以明显的差距超越。

Abstract
We analyze the DETR-based framework on semi-supervised object detection (SSOD) and observe that (1) the one-to-one assignment strategy generates incorrect matching when the pseudo ground-truth bounding box is inaccurate, leading to training inefficiency; (2) DETR-based detectors lack deterministic correspondence between the input query and its prediction output, which hinders the applicability of the consistency-based regularization widely used in current SSOD methods. We present Semi-DETR, the first transformer-based end-to-end semi-supervised object detector, to tackle these problems. Specifically, we propose a Stage-wise Hybrid Matching strategy that combines the one-to-many assignment and one-to-one assignment strategies to improve the training efficiency of the first stage and thus provide high-quality pseudo labels for the training of the second stage. Besides, we introduce a Crossview Query Consistency method to learn the semantic feature invariance of object queries from different views while avoiding the need to find deterministic query correspondence. Furthermore, we propose a Cost-based Pseudo Label Mining module to dynamically mine more pseudo boxes based on the matching cost of pseudo ground truth bounding boxes for consistency training. Extensive experiments on all SSOD settings of both COCO and Pascal VOC benchmark datasets show that our Semi-DETR method outperforms all state-of-the-art methods by clear margins. The PaddlePaddle version code1 is at https://github.com/PaddlePaddle/PaddleDetection/tree/develop/configs/semi_det/semi_detr.

摘要
我们分析基于DETR的框架在半指导下的物体检测（SSOD）中，发现了两个问题：（1）一对一对应策略可能导致训练不精确，因为伪真的 bounding box 精度不高；（2）基于DETR的检测器缺乏对输入查询与其预测输出之间的决定性对匹配，这限制了现有的一般SSOD方法中的一致性基础训练的应用。我们提出了半DETR，第一个基于transformer的端到端半指导物体检测器，以解决这些问题。具体来说，我们提出了阶段匹配策略，让一个查询与多个预测结果之间进行匹配，从而提高了训练的效率，并且为第二阶段的训练提供高质量的伪标签。此外，我们引入了跨观查询内容一致性方法，以学习查询从不同观点的物体特征内在性，而不需要寻找决定性的查询匹配。最后，我们提出了一个基于成本的伪标签采矿模组，以动态地采矿更多的伪标签，以便实现一致性训练。实验结果显示，我们的半DETR方法在所有SSOD设定下，都比所有现有方法优化了明显。PaddlePaddle版本代码可以在以下链接中找到：https://github.com/PaddlePaddle/PaddleDetection/tree/develop/configs/semi_det/semi_detr。

Cross-Ray Neural Radiance Fields for Novel-view Synthesis from Unconstrained Image Collections

paper_url: http://arxiv.org/abs/2307.08093
repo_url: https://github.com/yifyang993/cr-nerf-pytorch
paper_authors: Yifan Yang, Shuhai Zhang, Zixiong Huang, Yubing Zhang, Mingkui Tan
for: 用于Synthesizing occlusion-free novel views from unconstrained image collections, addressing challenges such as dynamic changes in appearance and transient objects.
methods: 使用Cross-Ray NeRF (CR-NeRF)方法，利用多个ray的交互信息来模型变化的外观，并通过图像特征covariance和图像外观的统计方式来recover外观。此外，还提出了适应物体排除和网格采样策略来避免 occlusion 问题。
results: 经过广泛的实验 validate CR-NeRF 的有效性，能够Synthesize high-quality novel views with the same appearances as the input images, even in the presence of dynamic changes and transient objects.

Abstract
Neural Radiance Fields (NeRF) is a revolutionary approach for rendering scenes by sampling a single ray per pixel and it has demonstrated impressive capabilities in novel-view synthesis from static scene images. However, in practice, we usually need to recover NeRF from unconstrained image collections, which poses two challenges: 1) the images often have dynamic changes in appearance because of different capturing time and camera settings; 2) the images may contain transient objects such as humans and cars, leading to occlusion and ghosting artifacts. Conventional approaches seek to address these challenges by locally utilizing a single ray to synthesize a color of a pixel. In contrast, humans typically perceive appearance and objects by globally utilizing information across multiple pixels. To mimic the perception process of humans, in this paper, we propose Cross-Ray NeRF (CR-NeRF) that leverages interactive information across multiple rays to synthesize occlusion-free novel views with the same appearances as the images. Specifically, to model varying appearances, we first propose to represent multiple rays with a novel cross-ray feature and then recover the appearance by fusing global statistics, i.e., feature covariance of the rays and the image appearance. Moreover, to avoid occlusion introduced by transient objects, we propose a transient objects handler and introduce a grid sampling strategy for masking out the transient objects. We theoretically find that leveraging correlation across multiple rays promotes capturing more global information. Moreover, extensive experimental results on large real-world datasets verify the effectiveness of CR-NeRF.

摘要
neural radiance fields (nerf) 是一种革命性的方法，通过每个像素只采样一个光线来渲染场景，并在不同视图 synthesis 中显示出优异的能力。然而，在实际应用中，我们通常需要从无结构图像集中恢复 nerf，这两个挑战：1）图像经常具有不同拍摄时间和摄像机设置导致的变化的外观; 2）图像可能包含过渡性的对象，如人和车辆，导致干扰和幻影 artifacts。传统的方法通过地方使用单个光线来synthesize 一个像素的颜色来解决这些挑战。与人类的感知过程不同，我们在这篇论文中提出了跨光线 nerf (cr-nerf)，它利用多个光线之间的互动信息来synthesize 干扰和幻影 artifacts 自由的新视图，同时保持与图像的外观一致。为了模型不同的外观，我们首先提出了一种新的跨光线特征表示，然后通过拼接全局统计，即光线特征相关矩阵和图像外观，来回归出现在图像中的外观。此外，我们还提出了一种过渡性对象处理器，并引入了网格采样策略，以masking 出过渡性对象。我们理论上发现，通过多个光线之间的互动信息，可以更好地捕捉全局信息。此外，我们在大量实际数据上进行了广泛的实验，并证明了cr-nerf的效果。

Gait Data Augmentation using Physics-Based Biomechanical Simulation

paper_url: http://arxiv.org/abs/2307.08092
repo_url: None
paper_authors: Mritula Chandrasekaran, Jarek Francik, Dimitrios Makris
for: Addressing the problem of data scarcity for gait analysis
methods: Using OpenSIM, a physics-based simulator, to synthesize biomechanically plausible walking sequences for gait data augmentation
results: Improved performance of model-based gait classifiers and state-of-the-art results for gait-based person identification with an accuracy of up to 96.11% on the CASIA-B dataset.Here’s the full Chinese text:
for: solves the problem of gait analysis data scarcity
methods: 使用OpenSIM，一个基于物理的 simulator，Synthesize biomechanically plausible walking sequences for gait data augmentation
results: 提高基于模型的步行分类器的性能，并在CASIA-B dataset上实现了人体步行识别的最佳结果，准确率高达96.11%。

Abstract
This paper focuses on addressing the problem of data scarcity for gait analysis. Standard augmentation methods may produce gait sequences that are not consistent with the biomechanical constraints of human walking. To address this issue, we propose a novel framework for gait data augmentation by using OpenSIM, a physics-based simulator, to synthesize biomechanically plausible walking sequences. The proposed approach is validated by augmenting the WBDS and CASIA-B datasets and then training gait-based classifiers for 3D gender gait classification and 2D gait person identification respectively. Experimental results indicate that our augmentation approach can improve the performance of model-based gait classifiers and deliver state-of-the-art results for gait-based person identification with an accuracy of up to 96.11% on the CASIA-B dataset.

摘要

Untrained neural network embedded Fourier phase retrieval from few measurements

paper_url: http://arxiv.org/abs/2307.08717
repo_url: https://github.com/liyuan-2000/trad
paper_authors: Liyuan Ma, Hongxia Wang, Ningyi Leng, Ziyang Yuan
for: 这篇论文目的是解决快速干扰分析（FPR）问题，即从傅里叶频谱测量获得未知信号的重建问题。
methods: 该论文提出了一种基于替换方向方法的多分量加速器（ADMM）框架的无学习神经网络（NN）嵌入算法，用于解决FPR问题。该算法使用生成网络来表示要重建的图像，从而限制图像在网络结构中的空间。此外，该算法还添加了总变量（TV）正则化，以便更好地恢复图像中的本地结构。
results: 实验结果表明，提出的算法在计算资源更少的情况下，能够超越现有的无学习NN基于算法，甚至与有学习NN基于算法相比表现竞争力强。

Abstract
Fourier phase retrieval (FPR) is a challenging task widely used in various applications. It involves recovering an unknown signal from its Fourier phaseless measurements. FPR with few measurements is important for reducing time and hardware costs, but it suffers from serious ill-posedness. Recently, untrained neural networks have offered new approaches by introducing learned priors to alleviate the ill-posedness without requiring any external data. However, they may not be ideal for reconstructing fine details in images and can be computationally expensive. This paper proposes an untrained neural network (NN) embedded algorithm based on the alternating direction method of multipliers (ADMM) framework to solve FPR with few measurements. Specifically, we use a generative network to represent the image to be recovered, which confines the image to the space defined by the network structure. To improve the ability to represent high-frequency information, total variation (TV) regularization is imposed to facilitate the recovery of local structures in the image. Furthermore, to reduce the computational cost mainly caused by the parameter updates of the untrained NN, we develop an accelerated algorithm that adaptively trades off between explicit and implicit regularization. Experimental results indicate that the proposed algorithm outperforms existing untrained NN-based algorithms with fewer computational resources and even performs competitively against trained NN-based algorithms.

摘要
傅里叶阶段恢复（FPR）是一个广泛应用的挑战任务，它的目标是从傅里叶无法测量中恢复未知的信号。FPR WITH few measurements 是一个重要的应用，可以降低时间和硬件成本，但它受到严重的非uniqueness问题的困扰。最近，未训练的神经网络（NN）已经提供了新的approaches，它们通过引入学习的约束来缓解非uniqueness问题，不需要任何外部数据。然而，它们可能不适合重建图像中的细节。这篇论文提出了一种基于 alternating direction method of multipliers（ADMM）框架的未训练NN算法来解决FPR WITH few measurements。 Specifically, we use a generative network to represent the image to be recovered, which confines the image to the space defined by the network structure。 To improve the ability to represent high-frequency information, total variation（TV）正则化是应用于促进图像中的地方结构的恢复。更over, to reduce the computational cost mainly caused by the parameter updates of the untrained NN, we develop an accelerated algorithm that adaptively trades off between explicit and implicit regularization。实验结果表明，提出的算法在计算资源更少的情况下表现出了更好的性能，甚至与训练NN-based algorithm相当竞争。

2023-07-17

Identity-Preserving Aging of Face Images via Latent Diffusion Models

Scale-Aware Modulation Meet Transformer

On the Fly Neural Style Smoothing for Risk-Averse Domain Generalization

Improving Data Efficiency for Plant Cover Prediction with Label Interpolation and Monte-Carlo Cropping

Reconstructed Convolution Module Based Look-Up Tables for Efficient Image Super-Resolution

Variational Probabilistic Fusion Network for RGB-T Semantic Segmentation

Multi-class point cloud completion networks for 3D cardiac anatomy reconstruction from cine magnetic resonance images

Multi-Domain Learning with Modulation Adapters

BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization

Study of Vision Transformers for Covid-19 Detection from Chest X-rays

Cumulative Spatial Knowledge Distillation for Vision Transformers

SVDFormer: Complementing Point Cloud via Self-view Augmentation and Self-structure Dual-generator

Differentiable Transportation Pruning

SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training

EGE-UNet: an Efficient Group Enhanced UNet for skin lesion segmentation

Riesz feature representation: scale equivariant scattering network for classification tasks

Generalizable Classification of UHF Partial Discharge Signals in Gas-Insulated HVDC Systems Using Neural Networks

Domain Adaptation using Silver Standard Masks for Lateral Ventricle Segmentation in FLAIR MRI

Not All Steps are Created Equal: Selective Diffusion Distillation for Image Manipulation

DOT: A Distillation-Oriented Trainer

Dense Affinity Matching for Few-Shot Segmentation

Divide&Classify: Fine-Grained Classification for City-Wide Visual Place Recognition

Monocular 3D Object Detection with LiDAR Guided Semi Supervised Active Learning

Active Learning for Object Detection with Non-Redundant Informative Sampling

CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing

Revisiting Scene Text Recognition: A Data Perspective

Dynamic Snake Convolution based on Topological Geometric Constraints for Tubular Structure Segmentation

Distributed bundle adjustment with block-based sparse matrix compression for super large scale datasets

Self-supervised Monocular Depth Estimation: Let’s Talk About The Weather

Box-DETR: Understanding and Boxing Conditional Spatial Queries

Neural Modulation Fields for Conditional Cone Beam Neural Tomography

Adaptive Local Basis Functions for Shape Completion

Soft Curriculum for Learning Conditional GANs with Noisy-Labeled and Uncurated Unlabeled Data

Airway Label Prediction in Video Bronchoscopy: Capturing Temporal Dependencies Utilizing Anatomical Knowledge

AltFreezing for More General Video Face Forgery Detection

Bridging the Gap: Multi-Level Cross-Modality Joint Alignment for Visible-Infrared Person Re-Identification

Rethinking Intersection Over Union for Small Object Detection in Few-Shot Regime

RCM-Fusion: Radar-Camera Multi-Level Fusion for 3D Object Detection

Combiner and HyperCombiner Networks: Rules to Combine Multimodality MR Images for Prostate Cancer Localisation

Adversarial Attacks on Traffic Sign Recognition: A Survey

Liver Tumor Screening and Diagnosis in CT with Pixel-Lesion-Patient Network

Extreme Image Compression using Fine-tuned VQGAN Models

Hierarchical Spatiotemporal Transformers for Video Object Segmentation

Large-Scale Person Detection and Localization using Overhead Fisheye Cameras

Random Boxes Are Open-world Object Detectors

Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting

Unified Open-Vocabulary Dense Visual Prediction

Video Frame Interpolation with Stereo Event and Intensity Camera

Ada3D : Exploiting the Spatial Redundancy with Adaptive Inference for Efficient 3D Object Detection

Unbiased Image Synthesis via Manifold-Driven Sampling in Diffusion Models

On Point Affiliation in Feature Upsampling

Zero-Shot Image Harmonization with Generative Model Prior

Boundary-weighted logit consistency improves calibration of segmentation networks

Self-Attention Based Generative Adversarial Networks For Unsupervised Video Summarization

Neural Stream Functions

Adaptively Placed Multi-Grid Scene Representation Networks for Large-Scale Data Visualization

GastroVision: A Multi-class Endoscopy Image Dataset for Computer Aided Gastrointestinal Disease Detection

Solving Inverse Problems with Latent Diffusion Models via Hard Data Consistency

Domain Generalisation with Bidirectional Encoder Representations from Vision Transformers

Polarization Multi-Image Synthesis with Birefringent Metasurfaces

FourierHandFlow: Neural 4D Hand Representation Using Fourier Query Flow

CalibNet: Dual-branch Cross-modal Calibration for RGB-D Salient Instance Segmentation

Semi-DETR: Semi-Supervised Object Detection with Detection Transformers

Cross-Ray Neural Radiance Fields for Novel-view Synthesis from Unconstrained Image Collections

Gait Data Augmentation using Physics-Based Biomechanical Simulation

Untrained neural network embedded Fourier phase retrieval from few measurements