2023-09-07

cs.CV

cs.CV - 2023-09-07

S-Adapter: Generalizing Vision Transformer for Face Anti-Spoofing with Statistical Tokens

paper_url: http://arxiv.org/abs/2309.04038
repo_url: None
paper_authors: Rizhao Cai, Zitong Yu, Chenqi Kong, Haoliang Li, Changsheng Chen, Yongjian Hu, Alex Kot
For: 检测面部识别系统中的恶意伪装 attempts (Face Anti-Spoofing, FAS)* Methods: 使用Efficient Parameter Transfer Learning (EPTL) paradigm，适应已经预训练的Vision Transformer模型，并在训练中插入 adapter modules，以便在不同频谱上进行恶意伪装检测。* Results: 提出了一种基于Statistical Adapter (S-Adapter)和Token Style Regularization (TSR)的方法，可以在零或几 shot cross-domain测试中提高检测性能，并且超过了现有方法在多个标准测试上的表现。

Abstract
Face Anti-Spoofing (FAS) aims to detect malicious attempts to invade a face recognition system by presenting spoofed faces. State-of-the-art FAS techniques predominantly rely on deep learning models but their cross-domain generalization capabilities are often hindered by the domain shift problem, which arises due to different distributions between training and testing data. In this study, we develop a generalized FAS method under the Efficient Parameter Transfer Learning (EPTL) paradigm, where we adapt the pre-trained Vision Transformer models for the FAS task. During training, the adapter modules are inserted into the pre-trained ViT model, and the adapters are updated while other pre-trained parameters remain fixed. We find the limitations of previous vanilla adapters in that they are based on linear layers, which lack a spoofing-aware inductive bias and thus restrict the cross-domain generalization. To address this limitation and achieve cross-domain generalized FAS, we propose a novel Statistical Adapter (S-Adapter) that gathers local discriminative and statistical information from localized token histograms. To further improve the generalization of the statistical tokens, we propose a novel Token Style Regularization (TSR), which aims to reduce domain style variance by regularizing Gram matrices extracted from tokens across different domains. Our experimental results demonstrate that our proposed S-Adapter and TSR provide significant benefits in both zero-shot and few-shot cross-domain testing, outperforming state-of-the-art methods on several benchmark tests. We will release the source code upon acceptance.

摘要
面部反射防范（FAS）目的是检测面部识别系统中的恶意入侵尝试，包括提供伪造面部。现代FAS技术主要基于深度学习模型，但它们在不同数据频谱的问题上存在跨频道泛化能力的问题。在本研究中，我们开发了基于高效参数传播学习（EPTL）模型的通用FAS方法。在训练过程中，我们插入了适应器模块到预训练的ViT模型中，并在其他预训练参数固定下更新适应器。我们发现过去的纯Adapter在基于线性层的限制下，无法具备跨频道泛化能力。为了解决这个限制并实现跨频道泛化FAS，我们提出了一种新的统计适应器（S-Adapter），它可以从本地化的token历史中收集当地特征和统计信息。为了进一步提高统计token的泛化能力，我们还提出了一种新的Token样式规范（TSR），它计划通过对token在不同频谱上的Gram矩阵进行规范来减少频谱样式差异。我们的实验结果表明，我们的提出的S-Adapter和TSR在零shot和几shot跨频道测试中具有显著的优势，超过了现有的方法在多个标准测试中的表现。我们将在接受后发布源代码。

Algebra and Geometry of Camera Resectioning

paper_url: http://arxiv.org/abs/2309.04028
repo_url: None
paper_authors: Erin Connelly, Timothy Duff, Jessie Loucks-Tavitas
for: 关于摄像机减掉问题的代数变量研究。
methods: 使用Gröbner基的技术来描述这些减掉变量的多度vanishing идеал。
results: derivation and re-interpretation of well-known results in geometric computer vision related to camera-point duality, as well as clarification of relationships between classical problems of optimal resectioning and triangulation, and a conjectured formula for the Euclidean distance degree of the resectioning variety.

Abstract
We study algebraic varieties associated with the camera resectioning problem. We characterize these resectioning varieties' multigraded vanishing ideals using Gr\"obner basis techniques. As an application, we derive and re-interpret celebrated results in geometric computer vision related to camera-point duality. We also clarify some relationships between the classical problems of optimal resectioning and triangulation, state a conjectural formula for the Euclidean distance degree of the resectioning variety, and discuss how this conjecture relates to the recently-resolved multiview conjecture.

摘要
我们研究关于摄像头重sectioning问题的代数变量。我们使用格罗本基技术来 caracterize这些重sectioning变量的多重度vanishing ideal。作为应用，我们得到并重新解释了在计算机视觉中知名的相机点对偶问题的结果。我们还清楚了经典的最优重sectioning和三角形问题之间的关系，提出了圆形距离度的重sectioning变量的投影式，并讲解了该投影式与最近解决的多视图问题之间的关系。

Improving the Accuracy of Beauty Product Recommendations by Assessing Face Illumination Quality

paper_url: http://arxiv.org/abs/2309.04022
repo_url: None
paper_authors: Parnian Afshar, Jenny Yeon, Andriy Levitskyy, Rahul Suresh, Amin Banitalebi-Dehkordi
for: 本研究旨在 Addressing the challenges in responsible beauty product recommendation, particularly when it involves comparing the product’s color with a person’s skin tone, such as for foundation and concealer products.
methods: We introduce a machine learning framework for illumination assessment which classifies images into having either good or bad illumination condition. We then build an automatic user guidance tool which informs a user holding their camera if their illumination condition is good or bad.
results: Our work improves the shade recommendation for various foundation products by using a diverse synthetic dataset and a Convolutional Neural Network (CNN) for illumination assessment.

Abstract
We focus on addressing the challenges in responsible beauty product recommendation, particularly when it involves comparing the product's color with a person's skin tone, such as for foundation and concealer products. To make accurate recommendations, it is crucial to infer both the product attributes and the product specific facial features such as skin conditions or tone. However, while many product photos are taken under good light conditions, face photos are taken from a wide range of conditions. The features extracted using the photos from ill-illuminated environment can be highly misleading or even be incompatible to be compared with the product attributes. Hence bad illumination condition can severely degrade quality of the recommendation. We introduce a machine learning framework for illumination assessment which classifies images into having either good or bad illumination condition. We then build an automatic user guidance tool which informs a user holding their camera if their illumination condition is good or bad. This way, the user is provided with rapid feedback and can interactively control how the photo is taken for their recommendation. Only a few studies are dedicated to this problem, mostly due to the lack of dataset that is large, labeled, and diverse both in terms of skin tones and light patterns. Lack of such dataset leads to neglecting skin tone diversity. Therefore, We begin by constructing a diverse synthetic dataset that simulates various skin tones and light patterns in addition to an existing facial image dataset. Next, we train a Convolutional Neural Network (CNN) for illumination assessment that outperforms the existing solutions using the synthetic dataset. Finally, we analyze how the our work improves the shade recommendation for various foundation products.

摘要
我们专注于处理美妆产品推荐中的挑战，特别是在比较产品的颜色与人们的皮肤颜色时。为确保精准的推荐，需要推算产品特性和产品具体的脸部特征，如皮肤状况或颜色。但是，许多产品照片在良好的照明条件下拍摄，而脸部照片则来自广泛的照明环境。由于撷取自不良照明环境的特征可能会导致极度错误或甚至无法与产品特性相比。因此，糟糕的照明环境可能会严重降低推荐质量。我们介绍了一个机器学习框架 для照明评估，可以区分具有好坏照明conditions的图像。然后，我们建立了一个自动用户指南工具，可以为用户提供快速的反馈，并让用户可以互动地控制他们拍摄的照片。只有几个研究对这个问题进行了研究，主要是因为缺乏大量、标注和多样化的皮肤颜色和照明模式的数据集。由于缺乏这种数据集，因此跳过了皮肤颜色多样性的问题。因此，我们开始了一个多样的人工数据集的建立，这个数据集模拟了不同的皮肤颜色和照明模式。接下来，我们使用这个数据集进行训练，并使用卷积神经网络（CNN）进行照明评估，以超越现有的解决方案。最后，我们分析了我们的工作如何改善不同的基调推荐。

Multimodal Transformer for Material Segmentation

paper_url: http://arxiv.org/abs/2309.04001
repo_url: None
paper_authors: Md Kaykobad Reza, Ashley Prater-Bennette, M. Salman Asif
for: 本研究的目的是提出一种新的多模态融合策略，以提高多模态分割 зада务的性能。
methods: 本研究提出了一种新的模型名为多模态分割变换器（MMSFormer），该模型包括一种新的融合策略，可以有效地融合不同组合的四种模式：RGB、Angular Linear Polarization（AoLP）、Degree of Linear Polarization（DoLP）和 Near-Infrared（NIR）模式。
results: 在MCubeS数据集上，MMSFormer模型达到了52.05%的mIoU，比现有状态的各种方法高出9.1%和10.4%。例如，我们的方法在检测gravel和人类类别上具有显著的提高（+10.4%和+9.1%）。

Abstract
Leveraging information across diverse modalities is known to enhance performance on multimodal segmentation tasks. However, effectively fusing information from different modalities remains challenging due to the unique characteristics of each modality. In this paper, we propose a novel fusion strategy that can effectively fuse information from different combinations of four different modalities: RGB, Angle of Linear Polarization (AoLP), Degree of Linear Polarization (DoLP) and Near-Infrared (NIR). We also propose a new model named Multi-Modal Segmentation Transformer (MMSFormer) that incorporates the proposed fusion strategy to perform multimodal material segmentation. MMSFormer achieves 52.05% mIoU outperforming the current state-of-the-art on Multimodal Material Segmentation (MCubeS) dataset. For instance, our method provides significant improvement in detecting gravel (+10.4%) and human (+9.1%) classes. Ablation studies show that different modules in the fusion block are crucial for overall model performance. Furthermore, our ablation studies also highlight the capacity of different input modalities to improve performance in the identification of different types of materials. The code and pretrained models will be made available at https://github.com/csiplab/MMSFormer.

摘要
利用多modalities的信息融合可以提高多modalities segmentation任务的性能。然而，有效地融合不同modalities的信息仍然是一个挑战，因为每种modalities都有独特的特征。在这篇论文中，我们提出了一种新的融合策略，可以有效地融合不同组合的四种modalities：RGB、Angular Linear Polarization（AoLP）、Degree of Linear Polarization（DoLP）和 Near-Infrared（NIR）。我们还提出了一个新的模型名为多Modal Segmentation Transformer（MMSFormer），该模型包含了提出的融合策略，用于进行多modal material segmentation。MMSFormer在Multimodal Material Segmentation（MCubeS）数据集上 achieved 52.05% mIoU，比前一个状态的艺术性表现出色。例如，我们的方法在检测gravel (+10.4%)和human (+9.1%)类中提供了显著改进。精算研究表明，不同模块在融合块中的不同部分对整体模型性能具有重要作用。此外，我们的精算研究还表明了不同输入modalities在不同材料类型的标识中的不同作用。代码和预训练模型将在https://github.com/csiplab/MMSFormer上提供。

Adapting Self-Supervised Representations to Multi-Domain Setups

paper_url: http://arxiv.org/abs/2309.03999
repo_url: None
paper_authors: Neha Kalibhat, Sam Sharpe, Jeremy Goodsitt, Bayan Bruss, Soheil Feizi
for: 提高多域自然语言处理模型的泛化能力（improve the generalization ability of multi-domain natural language processing models）
methods: 提出一种通用、轻量级的领域分离模块（propose a general-purpose, lightweight domain disentanglement module），可以适应任何自助学习编码器，以提高多域数据上的表示学习。在预训练期间，DDM对表示空间进行分解，从而实现领域分离。当领域标签不available时，DDM使用了一种可靠的聚类方法来发现 Pseudo-领域。
results: 与基eline比较，使用DDM预训练模型可以提高线性探测精度（linear probing accuracy）达3.5%，并且在多域测试集上显示了7.4%的泛化性能（generalization performance）提升。

Abstract
Current state-of-the-art self-supervised approaches, are effective when trained on individual domains but show limited generalization on unseen domains. We observe that these models poorly generalize even when trained on a mixture of domains, making them unsuitable to be deployed under diverse real-world setups. We therefore propose a general-purpose, lightweight Domain Disentanglement Module (DDM) that can be plugged into any self-supervised encoder to effectively perform representation learning on multiple, diverse domains with or without shared classes. During pre-training according to a self-supervised loss, DDM enforces a disentanglement in the representation space by splitting it into a domain-variant and a domain-invariant portion. When domain labels are not available, DDM uses a robust clustering approach to discover pseudo-domains. We show that pre-training with DDM can show up to 3.5% improvement in linear probing accuracy on state-of-the-art self-supervised models including SimCLR, MoCo, BYOL, DINO, SimSiam and Barlow Twins on multi-domain benchmarks including PACS, DomainNet and WILDS. Models trained with DDM show significantly improved generalization (7.4%) to unseen domains compared to baselines. Therefore, DDM can efficiently adapt self-supervised encoders to provide high-quality, generalizable representations for diverse multi-domain data.

摘要
当前最新的自动教程方法，在个别领域上有效地训练，但在未经见过的领域上显示有限的泛化能力。我们发现这些模型在混合领域上训练时表现不佳，使其在实际世界中不适用。因此，我们提出一种通用、轻量级的领域分离模块（DDM），可以与任何自动教程Encoder结合使用，以有效地进行多个、多种领域的表示学习，无论具有共享类别或不具有。在预训练时，DDM通过对自我超vised损失进行 enforcement，在表示空间中提升了分离。当领域标签不可用时，DDM使用Robust Clustering方法来发现pseudo-领域。我们显示，使用DDM进行预训练可以与现有最新的自动教程模型，包括SimCLR、MoCo、BYOL、DINO、SimSiam和Barlow Twins在多个领域的多个benchmark上提高线性探测精度达3.5%。模型通过DDM进行预训练后，对未经见过的领域的泛化能力提高了7.4%。因此，DDM可以有效地适应自动教程Encoder，以提供高质量、泛化的表示，用于多个多种领域的数据。

CDFSL-V: Cross-Domain Few-Shot Learning for Videos

paper_url: http://arxiv.org/abs/2309.03989
repo_url: None
paper_authors: Sarinda Samarasinghe, Mamshad Nayeem Rizve, Navid Kardan, Mubarak Shah
for: 这个论文是为了解决跨频道少量示例视频动作识别问题，而现有的方法都是基于大量标注的同频道数据集。
methods: 该论文提出了一种新的跨频道少量示例视频动作识别方法，该方法利用了自编码学习和课程学习来均衡源频道和目标频道的信息。具体来说，该方法使用了一个masked autoencoder-based自编码训练目标来从源和目标数据集中学习。然后，一个进步课程来均衡学习源数据集中的分类特征和目标频道特征。
results: 该论文在多个复杂的benchmark数据集上进行了评估，并证明了该方法可以超越现有的跨频道少量学习方法。代码可以在https://github.com/Sarinda251/CDFSL-V中找到。

Abstract
Few-shot video action recognition is an effective approach to recognizing new categories with only a few labeled examples, thereby reducing the challenges associated with collecting and annotating large-scale video datasets. Existing methods in video action recognition rely on large labeled datasets from the same domain. However, this setup is not realistic as novel categories may come from different data domains that may have different spatial and temporal characteristics. This dissimilarity between the source and target domains can pose a significant challenge, rendering traditional few-shot action recognition techniques ineffective. To address this issue, in this work, we propose a novel cross-domain few-shot video action recognition method that leverages self-supervised learning and curriculum learning to balance the information from the source and target domains. To be particular, our method employs a masked autoencoder-based self-supervised training objective to learn from both source and target data in a self-supervised manner. Then a progressive curriculum balances learning the discriminative information from the source dataset with the generic information learned from the target domain. Initially, our curriculum utilizes supervised learning to learn class discriminative features from the source data. As the training progresses, we transition to learning target-domain-specific features. We propose a progressive curriculum to encourage the emergence of rich features in the target domain based on class discriminative supervised features in the source domain. We evaluate our method on several challenging benchmark datasets and demonstrate that our approach outperforms existing cross-domain few-shot learning techniques. Our code is available at https://github.com/Sarinda251/CDFSL-V

摘要
新型几个shot视频动作识别方法可以快速地识别新类型，只需几个标注的示例，因此可以降低收集和标注大规模视频数据集的挑战。现有的视频动作识别方法依赖于同一个频谱中的大量标注数据。然而，这种设置不真实，因为新类型可能来自不同的数据频谱，这些频谱可能具有不同的空间和时间特征。这种差异会对传统的几个shot动作识别技术产生很大的挑战。为解决这个问题，在这项工作中，我们提出了一种新的跨频谱几个shot视频动作识别方法，该方法利用了自动生成学和课程学来均衡来源频谱和目标频谱的信息。具体来说，我们的方法使用了一个遮盖自动编码器基于的自动生成训练目标来学习来源和目标数据。然后，我们使用一个进步的课程来平衡学习来源数据中的分类特征与目标频谱中学习的通用特征。在训练过程中，我们首先使用了supervised学习来学习来源数据中的分类特征。随着训练的进行，我们转移到学习目标频谱特有的特征。我们提出了一种进步的课程，以便在目标频谱中促进特有的特征的出现，基于来源频谱中的分类特征。我们在多个挑战性 benchmark 数据集上评估了我们的方法，并证明了我们的方法在跨频谱几个shot学习中表现更好。我们的代码可以在 https://github.com/Sarinda251/CDFSL-V 上获取。

Separable Self and Mixed Attention Transformers for Efficient Object Tracking

paper_url: http://arxiv.org/abs/2309.03979
repo_url: https://github.com/goutamyg/smat
paper_authors: Goutam Yelluru Gopal, Maria A. Amer
for: 这篇论文旨在提出一种高效的自我和混合注意力转换器基 Architecture для轻量级目标跟踪。
methods: 该提案使用分解自我和混合注意力转换器来融合模板和搜索区域进行特征提取，并使用高效自注意力块进行全局Contextual模型化以提高目标状态估计的准确性。
results: 相比之下，该提案在GOT10k、TrackingNet、LaSOT、NfS30、UAV123和AVisT等 dataset上的表现都高于相关的轻量级跟踪器，而且在CPU上运行时速度达37帧，在GPU上运行时速度达158帧，参数数量为3.8亿。例如，在GOT10k-test上，它与E.T.Track和MixFormerV2-S的相似跟踪器相比，在AO metric上表现出了7.9%和5.8%的显著优势。

Abstract
The deployment of transformers for visual object tracking has shown state-of-the-art results on several benchmarks. However, the transformer-based models are under-utilized for Siamese lightweight tracking due to the computational complexity of their attention blocks. This paper proposes an efficient self and mixed attention transformer-based architecture for lightweight tracking. The proposed backbone utilizes the separable mixed attention transformers to fuse the template and search regions during feature extraction to generate superior feature encoding. Our prediction head performs global contextual modeling of the encoded features by leveraging efficient self-attention blocks for robust target state estimation. With these contributions, the proposed lightweight tracker deploys a transformer-based backbone and head module concurrently for the first time. Our ablation study testifies to the effectiveness of the proposed combination of backbone and head modules. Simulations show that our Separable Self and Mixed Attention-based Tracker, SMAT, surpasses the performance of related lightweight trackers on GOT10k, TrackingNet, LaSOT, NfS30, UAV123, and AVisT datasets, while running at 37 fps on CPU, 158 fps on GPU, and having 3.8M parameters. For example, it significantly surpasses the closely related trackers E.T.Track and MixFormerV2-S on GOT10k-test by a margin of 7.9% and 5.8%, respectively, in the AO metric. The tracker code and model is available at https://github.com/goutamyg/SMAT

摘要
<> translate into Simplified Chinese投入变换器对视图对象跟踪进行部署，显示了顶尖的结果。但是，基于变换器的模型在轻量级跟踪中受到计算复杂性的限制。这篇论文提出了一种高效的自我和混合注意力变换器-基于架构，用于轻量级跟踪。我们的提案中，使用分解式混合注意力变换器将模板和搜索区域 fusion 到特征提取过程中，以生成优化的特征编码。我们的预测头使用高效的自我注意力块进行全局Contextual 模型化，以提高目标状态估计的精度。通过这些贡献，我们的轻量级跟踪器首次同时使用变换器-基于架构和头模块。我们的ablation研究证明了我们的组合的后处和头模块的效果。实验显示，我们的分解自我和混合注意力基于跟踪器（SMAT）在GOT10k、TrackingNet、LaSOT、NfS30、UAV123和AVisT等数据集上表现出色，而且在CPU上运行时间为37帧，GPU上运行时间为158帧，参数数为3.8亿。例如，它在GOT10k-test上与相似的跟踪器E.T.Track和MixFormerV2-S的margin 7.9%和5.8%，分别。跟踪器代码和模型可以在https://github.com/goutamyg/SMAT 上下载。

Improving Resnet-9 Generalization Trained on Small Datasets

paper_url: http://arxiv.org/abs/2309.03965
repo_url: https://github.com/omarawad2/HAET2021_Huawei
paper_authors: Omar Mohamed Awad, Habib Hajimolahoseini, Michael Lim, Gurpreet Gosal, Walid Ahmed, Yang Liu, Gordon Deng
for: 本文提出了一种方法，该方法在ICLR竞赛中获得了最高精度奖。目标是在 less than 10 分钟内达到 CIFAR-10 数据集上的图像分类任务最高精度。
methods: 本文使用了一系列技术来提高 ResNet-9 的通用性，包括：锐度敏感优化、标签平滑、梯度中心化、输入补丁白净化以及基于基本学习的训练。
results: 我们的实验表明，通过在 CIFAR-10 数据集上进行sharpness aware optimization、标签平滑、梯度中心化、输入补丁白净化以及基于基本学习的训练，ResNet-9 可以在 less than 10 分钟内达到 88% 的精度，而且只需训练在 CIFAR-10 数据集上的 10% 子集上。

Abstract
This paper presents our proposed approach that won the first prize at the ICLR competition on Hardware Aware Efficient Training. The challenge is to achieve the highest possible accuracy in an image classification task in less than 10 minutes. The training is done on a small dataset of 5000 images picked randomly from CIFAR-10 dataset. The evaluation is performed by the competition organizers on a secret dataset with 1000 images of the same size. Our approach includes applying a series of technique for improving the generalization of ResNet-9 including: sharpness aware optimization, label smoothing, gradient centralization, input patch whitening as well as metalearning based training. Our experiments show that the ResNet-9 can achieve the accuracy of 88% while trained only on a 10% subset of CIFAR-10 dataset in less than 10 minuets

摘要
本文提出了我们的提议方法，在ICLR竞赛中获得首奖的硬件意识fficient Training中实现最高可能的准确率。挑战是在 less than 10 minutes 内完成一个图像分类任务。训练是基于小型的CIFAR-10数据集中随机选择的5000张图像。评估是由竞赛组织者在一个保密的数据集上进行，该数据集包含1000张同样大小的图像。我们的方法包括对ResNet-9进行一系列技巧以提高其泛化性，包括：锐度意识优化、标签平滑、梯度中心化、输入质patch白净以及基于学习的训练。我们的实验表明，ResNet-9可以在CIFAR-10数据集中训练只有10%的subset中的图像，在 less than 10 minutes 内达到88%的准确率。

REALM: Robust Entropy Adaptive Loss Minimization for Improved Single-Sample Test-Time Adaptation

paper_url: http://arxiv.org/abs/2309.03964
repo_url: None
paper_authors: Skyler Seto, Barry-John Theobald, Federico Danieli, Navdeep Jaitly, Dan Busbridge
For: The paper is written to mitigate performance loss due to distribution shifts between train and test data in online fully-test-time adaptation (F-TTA) without access to the training data and without knowledge of the model training procedure.* Methods: The paper proposes a general framework called Robust Entropy Adaptive Loss Minimization (REALM) inspired by self-paced learning and robust loss functions to improve the robustness of F-TTA to noisy samples.* Results: The proposed approach achieves better adaptation accuracy than previous approaches throughout the adaptation process on corruptions of CIFAR-10 and ImageNet-1K, demonstrating its effectiveness.Here’s the simplified Chinese version:* For: 约束是为了 mitigate 在 train 和 test 数据之间的分布变化导致的性能下降，而不需要访问训练数据和模型训练过程的知识。* Methods: 提议了一种通用框架 called Robust Entropy Adaptive Loss Minimization (REALM)， Drawing inspiration from self-paced learning and robust loss functions to improve F-TTA 的 robustness to noisy samples。* Results: 比较 previous approaches 的 adaptation accuracy superior throughout the adaptation process on corruptions of CIFAR-10 and ImageNet-1K, demonstrating its effectiveness.

Abstract
Fully-test-time adaptation (F-TTA) can mitigate performance loss due to distribution shifts between train and test data (1) without access to the training data, and (2) without knowledge of the model training procedure. In online F-TTA, a pre-trained model is adapted using a stream of test samples by minimizing a self-supervised objective, such as entropy minimization. However, models adapted with online using entropy minimization, are unstable especially in single sample settings, leading to degenerate solutions, and limiting the adoption of TTA inference strategies. Prior works identify noisy, or unreliable, samples as a cause of failure in online F-TTA. One solution is to ignore these samples, which can lead to bias in the update procedure, slow adaptation, and poor generalization. In this work, we present a general framework for improving robustness of F-TTA to these noisy samples, inspired by self-paced learning and robust loss functions. Our proposed approach, Robust Entropy Adaptive Loss Minimization (REALM), achieves better adaptation accuracy than previous approaches throughout the adaptation process on corruptions of CIFAR-10 and ImageNet-1K, demonstrating its effectiveness.

摘要
全程测试时适应（F-TTA）可以减轻因数据分布变化而导致的性能下降（1）无需访问训练数据，以及（2）无需知道模型训练过程。在线上F-TTA中，一个预训练模型通过使用一批测试样本进行适应，以降低一种自我超级对象，如熵降低。然而，通过在线使用熵降低进行适应，特别是在单个样本设置下，可能会导致不稳定的解决方案，限制TTA的推理策略的采用。先前的研究表明，噪声或不可靠的样本是在线F-TTA失败的原因。一种解决方案是忽略这些样本，可能会导致更新过程中的偏见，慢化适应，和泛化性下降。在这种情况下，我们提出了一种抗噪声的框架，即稳定熵降低适应loss函数（REALM）。我们的提议方法在CIFAR-10和ImageNet-1K上进行了某些损害的适应过程，并且在整个适应过程中达到了更高的适应精度，这表明了它的有效性。

SimpleNeRF: Regularizing Sparse Input Neural Radiance Fields with Simpler Solutions

paper_url: http://arxiv.org/abs/2309.03955
repo_url: None
paper_authors: Nagabhushan Somraj, Adithyan Karanayil, Rajiv Soundararajan
for: 这个论文主要研究了如何使用增强模型来训练几何投影场景中的NeRF，以实现 fewer-shot 渲染。
methods: 作者使用了增强模型来帮助训练NeRF，并在训练过程中添加了positional编码和视图依赖的采样来增强模型的简洁性。
results: 作者通过使用这些增强模型和采样方法，实现了在两个流行的数据集上的state-of-the-art 视角合成性能。Here’s the full text in Simplified Chinese:
for: 这个论文主要研究了如何使用增强模型来训练几何投影场景中的NeRF，以实现 fewer-shot 渲染。
methods: 作者使用了增强模型来帮助训练NeRF，并在训练过程中添加了positional编码和视图依赖的采样来增强模型的简洁性。
results: 作者通过使用这些增强模型和采样方法，实现了在两个流行的数据集上的state-of-the-art 视角合成性能。I hope that helps! Let me know if you have any other questions.

Abstract
Neural Radiance Fields (NeRF) show impressive performance for the photorealistic free-view rendering of scenes. However, NeRFs require dense sampling of images in the given scene, and their performance degrades significantly when only a sparse set of views are available. Researchers have found that supervising the depth estimated by the NeRF helps train it effectively with fewer views. The depth supervision is obtained either using classical approaches or neural networks pre-trained on a large dataset. While the former may provide only sparse supervision, the latter may suffer from generalization issues. As opposed to the earlier approaches, we seek to learn the depth supervision by designing augmented models and training them along with the NeRF. We design augmented models that encourage simpler solutions by exploring the role of positional encoding and view-dependent radiance in training the few-shot NeRF. The depth estimated by these simpler models is used to supervise the NeRF depth estimates. Since the augmented models can be inaccurate in certain regions, we design a mechanism to choose only reliable depth estimates for supervision. Finally, we add a consistency loss between the coarse and fine multi-layer perceptrons of the NeRF to ensure better utilization of hierarchical sampling. We achieve state-of-the-art view-synthesis performance on two popular datasets by employing the above regularizations. The source code for our model can be found on our project page: https://nagabhushansn95.github.io/publications/2023/SimpleNeRF.html

摘要
神经闪光场（NeRF）可以实现高品质的自由视图渲染场景。然而，NeRF需要场景中的图像 dense sampling，而且在只有 sparse 的视图时，其性能会下降 significatively。研究人员发现，对 NeRF depth 进行超vision 可以帮助它们在 fewer views 上训练效果。这些超vision 可以来自 classical 方法或者 neural network 预训练大量数据。然而，前者可能只提供 sparse 的超vision，而后者可能会导致泛化问题。与之前的方法不同，我们尝试通过设计增强模型并与 NeRF 同时训练来学习 depth 超vision。我们设计了增强模型，这些模型通过 exploring 场景中的位置编码和视角依赖的闪光来培养简单的解决方案。这些简单模型中的深度被用来超vision NeRF 的深度估计。由于增强模型在某些区域可能不准确，我们设计了一种机制来选择可靠的深度估计来作为超vision。最后，我们添加了一种 hierarchical 整合损失，以确保更好地利用多层感知。我们通过使用以上 regularizations 实现了两个流行的数据集上的视图合成状态机器。我们的模型代码可以在我们项目页面中找到：

A-Eval: A Benchmark for Cross-Dataset Evaluation of Abdominal Multi-Organ Segmentation

paper_url: http://arxiv.org/abs/2309.03906
repo_url: https://github.com/uni-medical/a-eval
paper_authors: Ziyan Huang, Zhongying Deng, Jin Ye, Haoyu Wang, Yanzhou Su, Tianbin Li, Hui Sun, Junlong Cheng, Jianpin Chen, Junjun He, Yun Gu, Shaoting Zhang, Lixu Gu, Yu Qiao
for: 本研究旨在检验多个数据集上的 Abdomen 多器官分割模型是否能够通用，以及如何进一步提高其通用性。
methods: 本研究使用了四个大规模公共数据集：FLARE22、AMOS、WORD 和 TotalSegmentator，每个数据集都提供了丰富的 Abdomen 多器官分割标签。为了评估，我们将这些数据集的验证集与 BTCV 数据集的训练集组合成一个可靠的 Benchmark，包括五个不同的数据集。
results: 我们通过使用不同的数据使用场景（即在单个数据集上独立训练、使用 pseudo-labeling 技术、混合不同Modalities 和在所有可用数据集上进行联合训练）来评估不同的模型是否能够通用。此外，我们还研究了模型的大小对 cross-dataset 通用性的影响。通过这些分析，我们强调了有效地使用数据的重要性，并提供了训练策略的有价值指导。

Abstract
Although deep learning have revolutionized abdominal multi-organ segmentation, models often struggle with generalization due to training on small, specific datasets. With the recent emergence of large-scale datasets, some important questions arise: \textbf{Can models trained on these datasets generalize well on different ones? If yes/no, how to further improve their generalizability?} To address these questions, we introduce A-Eval, a benchmark for the cross-dataset Evaluation ('Eval') of Abdominal ('A') multi-organ segmentation. We employ training sets from four large-scale public datasets: FLARE22, AMOS, WORD, and TotalSegmentator, each providing extensive labels for abdominal multi-organ segmentation. For evaluation, we incorporate the validation sets from these datasets along with the training set from the BTCV dataset, forming a robust benchmark comprising five distinct datasets. We evaluate the generalizability of various models using the A-Eval benchmark, with a focus on diverse data usage scenarios: training on individual datasets independently, utilizing unlabeled data via pseudo-labeling, mixing different modalities, and joint training across all available datasets. Additionally, we explore the impact of model sizes on cross-dataset generalizability. Through these analyses, we underline the importance of effective data usage in enhancing models' generalization capabilities, offering valuable insights for assembling large-scale datasets and improving training strategies. The code and pre-trained models are available at \href{https://github.com/uni-medical/A-Eval}{https://github.com/uni-medical/A-Eval}.

摘要
although deep learning has revolutionized abdominal multi-organ segmentation, models often struggle with generalization due to training on small, specific datasets. with the recent emergence of large-scale datasets, some important questions arise: �Can models trained on these datasets generalize well on different ones? if yes/no, how to further improve their generalizability? to address these questions, we introduce A-Eval, a benchmark for the cross-dataset evaluation of abdominal multi-organ segmentation. we employ training sets from four large-scale public datasets: flare22, amos, word, and totalsegmentator, each providing extensive labels for abdominal multi-organ segmentation. for evaluation, we incorporate the validation sets from these datasets along with the training set from the btcv dataset, forming a robust benchmark comprising five distinct datasets. we evaluate the generalizability of various models using the A-Eval benchmark, with a focus on diverse data usage scenarios: training on individual datasets independently, utilizing unlabeled data via pseudo-labeling, mixing different modalities, and joint training across all available datasets. additionally, we explore the impact of model sizes on cross-dataset generalizability. through these analyses, we underline the importance of effective data usage in enhancing models' generalization capabilities, offering valuable insights for assembling large-scale datasets and improving training strategies. the code and pre-trained models are available at https://github.com/uni-medical/A-Eval.

Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis

paper_url: http://arxiv.org/abs/2309.03904
repo_url: https://github.com/zhujiapeng/aurora
paper_authors: Jiapeng Zhu, Ceyuan Yang, Kecheng Zheng, Yinghao Xu, Zifan Shi, Yujun Shen
for: 文章旨在提出一种基于生成对抗网络（GAN）的文本决定图像生成模型，以便在大规模模型训练中减少计算资源的消耗。
methods: 该模型采用了一个集合专家来学习特征处理，并与一个稀有的路由器相结合，以选择最适合每个特征点的专家。路由器在考虑文本整体归一化代码的基础上进行动态决策，以确保准确地传递采样冲击和文本条件到最终生成图像中。
results: 在64x64图像分辨率下，使用LAION2B-en和COYO-700M训练集，模型达到了6.2 zero-shot FID在MS COCO上。

Abstract
Due to the difficulty in scaling up, generative adversarial networks (GANs) seem to be falling from grace on the task of text-conditioned image synthesis. Sparsely-activated mixture-of-experts (MoE) has recently been demonstrated as a valid solution to training large-scale models with limited computational resources. Inspired by such a philosophy, we present Aurora, a GAN-based text-to-image generator that employs a collection of experts to learn feature processing, together with a sparse router to help select the most suitable expert for each feature point. To faithfully decode the sampling stochasticity and the text condition to the final synthesis, our router adaptively makes its decision by taking into account the text-integrated global latent code. At 64x64 image resolution, our model trained on LAION2B-en and COYO-700M achieves 6.2 zero-shot FID on MS COCO. We release the code and checkpoints to facilitate the community for further development.

摘要
Translated into Simplified Chinese:由于扩展困难，生成对抗网络（GAN）在文本条件下的图像生成任务上似乎在落叶。另一方面，卷积束激活的混合专家（MoE）在具有有限的计算资源的情况下被证明为有效的解决方案。我们启发于这种哲学，提出了 Aurora，一个基于 GAN 的文本到图像生成器，该生成器使用一群专家来学习特征处理，并且使用稀有的路由器来帮助选择最适合的专家 для每个特征点。为了准确地将抽样偏移和文本条件传递到最终合成，我们的路由器动态做出决策，并考虑文本集成的全局幂等码。在 64x64 像素分辨率下，我们使用 LAION2B-en 和 COYO-700M 训练的模型达到 6.2 个零shot FID 在 MS COCO 上。我们将代码和检查点发布，以便社区进一步开发。

Tracking Anything with Decoupled Video Segmentation

paper_url: http://arxiv.org/abs/2309.03903
repo_url: https://github.com/hkchengrex/Tracking-Anything-with-DEVA
paper_authors: Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, Joon-Young Lee
for: 这篇论文旨在解决视频分割任务中的数据缺乏问题，使得扩展到新的视频分割任务更加困难。
methods: 该论文提出了一种分离视频分割方法（DEVA），它包括任务特定的图像级别分割和任务和类型无关的双向时间卷积。
results: 作者在多个数据缺乏任务中表示了这种方法的优势，包括大词汇视频精确分割、开放世界视频分割、引用视频分割和无监督视频物体分割。

Abstract
Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: https://hkchengrex.github.io/Tracking-Anything-with-DEVA

摘要
训练数据 для视频分割昂贵annotate。这阻碍了扩展到新的视频分割任务，��pecially在大词汇设定下。为了“跟踪任何”而不需要每个任务的视频数据训练，我们开发了分离视频分割方法（DEVA），它由任务特定的图像级别分割和任务和类型不可知的bi-directional时间推进动作组成。由于这种设计，我们只需要目标任务的图像级别模型（更 cheap to train）和一个通用的时间推进动作模型，这个模型在不同任务上通过一次训练而泛化。为了有效地结合这两个模块，我们使用bi-directional推进来（semi-)在线混合不同帧中的 segmentation 假设，以生成一个准确的分割。我们表明，这种分离的形式与端到端方法在多个数据缺乏任务中相比较好。代码可以在：https://hkchengrex.github.io/Tracking-Anything-with-DEVA 获取。

Learning Continuous Exposure Value Representations for Single-Image HDR Reconstruction

paper_url: http://arxiv.org/abs/2309.03900
repo_url: None
paper_authors: Su-Kai Chen, Hung-Lin Yen, Yu-Lun Liu, Min-Hung Chen, Hou-Ning Hu, Wen-Hsiao Peng, Yen-Yu Lin
for: 实现高动态变化的图像重建（HDR reconstruction），使用深度学习方法将LDR图像构成为HDR图像。
methods: 使用隐藏函数生成LDR图像的 kontinuous exposure value representation（CEVR），并使用循环训练策略来监督模型生成 kontinuous EV LDR图像。
results: 与现有方法相比，CEVR模型能够实现更高质量的HDR reconstruction。

Abstract
Deep learning is commonly used to reconstruct HDR images from LDR images. LDR stack-based methods are used for single-image HDR reconstruction, generating an HDR image from a deep learning-generated LDR stack. However, current methods generate the stack with predetermined exposure values (EVs), which may limit the quality of HDR reconstruction. To address this, we propose the continuous exposure value representation (CEVR), which uses an implicit function to generate LDR images with arbitrary EVs, including those unseen during training. Our approach generates a continuous stack with more images containing diverse EVs, significantly improving HDR reconstruction. We use a cycle training strategy to supervise the model in generating continuous EV LDR images without corresponding ground truths. Our CEVR model outperforms existing methods, as demonstrated by experimental results.

摘要
深度学习通常用于从LDR图像中重建HDR图像。现有的方法使用LDR堆栈来实现单个图像HDR重建，但是现有方法通常使用预先确定的曝光值（EV）来生成堆栈，这可能会限制HDR重建的质量。为解决这个问题，我们提出了连续曝光值表示（CEVR），它使用隐式函数生成LDR图像中的任意EV，包括训练过程中未看到的EV。我们的方法生成了更多包含多样EV的图像，Significantly Improving HDR重建。我们使用循环训练策略来监督模型在生成连续EV LDR图像时，无需对应的真实值。我们的CEVR模型在实验结果中胜过现有方法。

The Making and Breaking of Camouflage

paper_url: http://arxiv.org/abs/2309.03899
repo_url: None
paper_authors: Hala Lamdouar, Weidi Xie, Andrew Zisserman
for: 这项研究旨在解决camouflage效果的评估问题，提出三种评估指标，以评估和比较不同隐身数据集的效果。methods: 研究使用了背景和前景特征相似性和边界可见度来评估隐身的效果，并在生成模型中使用这些评估指标作为 auxillary loss，可以生成高质量的隐身图像和视频。results: 实验表明，使用这些评估指标可以实现State-of-the-art的隐身摘帽性能，并且可以在大规模的视频segmentation任务中提高性能。

Abstract
Not all camouflages are equally effective, as even a partially visible contour or a slight color difference can make the animal stand out and break its camouflage. In this paper, we address the question of what makes a camouflage successful, by proposing three scores for automatically assessing its effectiveness. In particular, we show that camouflage can be measured by the similarity between background and foreground features and boundary visibility. We use these camouflage scores to assess and compare all available camouflage datasets. We also incorporate the proposed camouflage score into a generative model as an auxiliary loss and show that effective camouflage images or videos can be synthesised in a scalable manner. The generated synthetic dataset is used to train a transformer-based model for segmenting camouflaged animals in videos. Experimentally, we demonstrate state-of-the-art camouflage breaking performance on the public MoCA-Mask benchmark.

摘要
不 todas las formas de camuflaje son igual de efectivas, ya que incluso una contour parcialmente visible o un ligero cambio de color puede hacer que el animal se destaque y rompa su camuflaje. En este artículo, abordamos la pregunta de qué hace que un camuflaje sea exitoso, proponiendo tres puntuaciones para evaluar su eficacia de forma automática. En particular, mostramos que el camuflaje se puede medir por la similitud entre las características de fondo y del foreground, así como la visibilidad de la boundaria. Incorporamos las puntuaciones de camuflaje propuestas en un modelo generativo como una pérdida auxiliar y demostramos que se pueden synthetizar imágenes o videos de camuflaje efectivos de manera escalable. El conjunto de datos sintético generado se utiliza para entrenar un modelo basado en transformers para segmentar animales camuflados en videos. Experimentalmente, demostramos un rendimiento de clasificación de camuflaje de estado del arte en el benchmark público MoCA-Mask.

ProPainter: Improving Propagation and Transformer for Video Inpainting

paper_url: http://arxiv.org/abs/2309.03897
repo_url: https://github.com/sczhou/propainter
paper_authors: Shangchen Zhou, Chongyi Li, Kelvin C. K. Chan, Chen Change Loy
for: 提高视频填充性能（VI）中的Flow-based媒体和空间时间Transformer机制的效果。
methods: 提出了改进的框架，称为ProPainter，它包括改进的ProPagation和高效的Transformer。 specifically, dual-domain propagation combines the advantages of image and feature warping, reliably exploiting global correspondences. 另外，我们还提出了一种面Mask-guided sparse video Transformer，可以高效地抛弃无用和重复的Token。
results: ProPainter比优先艺术品在PSNR指标上增加1.46 dB，同时保持了适度的效率。

Abstract
Flow-based propagation and spatiotemporal Transformer are two mainstream mechanisms in video inpainting (VI). Despite the effectiveness of these components, they still suffer from some limitations that affect their performance. Previous propagation-based approaches are performed separately either in the image or feature domain. Global image propagation isolated from learning may cause spatial misalignment due to inaccurate optical flow. Moreover, memory or computational constraints limit the temporal range of feature propagation and video Transformer, preventing exploration of correspondence information from distant frames. To address these issues, we propose an improved framework, called ProPainter, which involves enhanced ProPagation and an efficient Transformer. Specifically, we introduce dual-domain propagation that combines the advantages of image and feature warping, exploiting global correspondences reliably. We also propose a mask-guided sparse video Transformer, which achieves high efficiency by discarding unnecessary and redundant tokens. With these components, ProPainter outperforms prior arts by a large margin of 1.46 dB in PSNR while maintaining appealing efficiency.

摘要
<>使用流基本域和时间特征转换器是视频填充（VI）的两种主流机制。尽管这两种组件都有一定的局限性，但它们仍然会受到一些限制，影响其性能。前一代的卷积基本方法是在图像或特征空间分开进行，全球图像卷积可能会导致空间不一致，因为估算的光学流不准确。此外，记忆或计算限制会限制特征卷积和视频转换器的时间范围，防止在远程帧中检索相关信息。为解决这些问题，我们提出了改进的框架，称为ProPainter，它包括提升的ProPagation和高效的Transformer。我们引入了双域卷积，将图像和特征卷积的优点结合起来，可靠地利用全球匹配。我们还提出了面Mask指导的稀疏视频Transformer，它可以高效地抛弃无用和重复的token。与此前的方法相比，ProPainter在PSNR指标上提高了1.46 dB，同时保持了 attractive的效率。

InstructDiffusion: A Generalist Modeling Interface for Vision Tasks

paper_url: http://arxiv.org/abs/2309.03895
repo_url: None
paper_authors: Zigang Geng, Binxin Yang, Tiankai Hang, Chen Li, Shuyang Gu, Ting Zhang, Jianmin Bao, Zheng Zhang, Han Hu, Dong Chen, Baining Guo
for: 这篇论文旨在提出一种统一和通用的框架，用于将计算机视觉任务与人类指令相对应。
methods: 这种方法基于协抽程序，并在用户指令中预测像素。
results: 这种方法可以处理多种计算机视觉任务，包括理解任务（如分割和关键点检测）和生成任务（如编辑和提高）。它还可以处理未看过的任务和超越先前方法在新数据集上的性能。

Abstract
We present InstructDiffusion, a unifying and generic framework for aligning computer vision tasks with human instructions. Unlike existing approaches that integrate prior knowledge and pre-define the output space (e.g., categories and coordinates) for each vision task, we cast diverse vision tasks into a human-intuitive image-manipulating process whose output space is a flexible and interactive pixel space. Concretely, the model is built upon the diffusion process and is trained to predict pixels according to user instructions, such as encircling the man's left shoulder in red or applying a blue mask to the left car. InstructDiffusion could handle a variety of vision tasks, including understanding tasks (such as segmentation and keypoint detection) and generative tasks (such as editing and enhancement). It even exhibits the ability to handle unseen tasks and outperforms prior methods on novel datasets. This represents a significant step towards a generalist modeling interface for vision tasks, advancing artificial general intelligence in the field of computer vision.

摘要
我们介绍InstructDiffusion，一种普适和通用的框架，用于将计算机视觉任务与人类指令相对应。与现有方法不同，InstructDiffusion不需要将每个视觉任务的输出空间（例如，类别和坐标）预先定义，而是将多种视觉任务映射到一个人性化的图像修改过程中，其输出空间是一个可以互动的像素空间。具体来说，模型基于协振过程，并通过用户指令（如红色围绕男士左肩的框或蓝色涂抹到左车）预测像素。InstructDiffusion可以处理多种视觉任务，包括理解任务（如分割和关键点检测）和生成任务（如编辑和提高）。它甚至可以处理未看到的任务，并在新数据集上表现出excel。这表明InstructDiffusion可以作为计算机视觉领域的通用模型化接口，为人工通用智能领域带来重大进步。

BluNF: Blueprint Neural Field

paper_url: http://arxiv.org/abs/2309.03933
repo_url: None
paper_authors: Robin Courant, Xi Wang, Marc Christie, Vicky Kalogeiton
for: 这篇论文是关于Scene Novel View Synthesis的研究，旨在提供可观赏、精度、robust的隐式重建。
methods: 这篇论文使用Neural Radiance Fields（NeRFs）来实现Scene Novel View Synthesis，并提出了一种新的编辑方法，即Blueprint Neural Field（BluNF）。BluNF使用Implicit Neural Representation来构建场景的蓝图，以便INTUITIVE的场景编辑。
results: 这篇论文的实验结果表明，BluNF可以帮助人们INTUITIVE地编辑场景，包括对NeRF表示的3D形状和物理属性的修改。此外，BluNF还可以提供一种精度的3D manipulation方法，如场景的掩蔽、外观修改和物体的移除。

Abstract
Neural Radiance Fields (NeRFs) have revolutionized scene novel view synthesis, offering visually realistic, precise, and robust implicit reconstructions. While recent approaches enable NeRF editing, such as object removal, 3D shape modification, or material property manipulation, the manual annotation prior to such edits makes the process tedious. Additionally, traditional 2D interaction tools lack an accurate sense of 3D space, preventing precise manipulation and editing of scenes. In this paper, we introduce a novel approach, called Blueprint Neural Field (BluNF), to address these editing issues. BluNF provides a robust and user-friendly 2D blueprint, enabling intuitive scene editing. By leveraging implicit neural representation, BluNF constructs a blueprint of a scene using prior semantic and depth information. The generated blueprint allows effortless editing and manipulation of NeRF representations. We demonstrate BluNF's editability through an intuitive click-and-change mechanism, enabling 3D manipulations, such as masking, appearance modification, and object removal. Our approach significantly contributes to visual content creation, paving the way for further research in this area.

摘要
neural radiance fields (NeRFs) have revolutionized scene novel view synthesis, offering visually realistic, precise, and robust implicit reconstructions. while recent approaches enable NeRF editing, such as object removal, 3D shape modification, or material property manipulation, the manual annotation prior to such edits makes the process tedious. additionally, traditional 2D interaction tools lack an accurate sense of 3D space, preventing precise manipulation and editing of scenes. in this paper, we introduce a novel approach, called Blueprint Neural Field (BluNF), to address these editing issues. blunf provides a robust and user-friendly 2D blueprint, enabling intuitive scene editing. by leveraging implicit neural representation, blunf constructs a blueprint of a scene using prior semantic and depth information. the generated blueprint allows effortless editing and manipulation of NeRF representations. we demonstrate blunf's editability through an intuitive click-and-change mechanism, enabling 3D manipulations, such as masking, appearance modification, and object removal. our approach significantly contributes to visual content creation, paving the way for further research in this area.

ArtiGrasp: Physically Plausible Synthesis of Bi-Manual Dexterous Grasping and Articulation

paper_url: http://arxiv.org/abs/2309.03891
repo_url: None
paper_authors: Hui Zhang, Sammy Christen, Zicong Fan, Luocheng Zheng, Jemin Hwangbo, Jie Song, Otmar Hilliges
for: 这个论文的目的是提出一种新的手套控制方法，用于Synthesize bi-manual hand-object interactions，包括抓取和折叠动作。
methods: 该方法使用了强化学习和物理 simulations来训练一个控制全身姿势和精确的手指控制的政策。
results: 该方法可以在Dynamic Object Grasping and Articulation任务中提供高效的解决方案，并且可以适应不同的姿势和物体。

Abstract
We present ArtiGrasp, a novel method to synthesize bi-manual hand-object interactions that include grasping and articulation. This task is challenging due to the diversity of the global wrist motions and the precise finger control that are necessary to articulate objects. ArtiGrasp leverages reinforcement learning and physics simulations to train a policy that controls the global and local hand pose. Our framework unifies grasping and articulation within a single policy guided by a single hand pose reference. Moreover, to facilitate the training of the precise finger control required for articulation, we present a learning curriculum with increasing difficulty. It starts with single-hand manipulation of stationary objects and continues with multi-agent training including both hands and non-stationary objects. To evaluate our method, we introduce Dynamic Object Grasping and Articulation, a task that involves bringing an object into a target articulated pose. This task requires grasping, relocation, and articulation. We show our method's efficacy towards this task. We further demonstrate that our method can generate motions with noisy hand-object pose estimates from an off-the-shelf image-based regressor.

摘要
我们提出了ArtiGrasp方法，用于生成双手手动对象互动，包括抓取和肢体运动。由于全球肘部运动的多样性和必要的精准指控来实现对象的肢体运动，这是一项挑战性的任务。ArtiGrasp利用了奖励学习和物理模拟来训练一个控制全球和局部手姿的策略。我们的框架将抓取和肢体运动团结在一个单一的策略下，即一个手姿参考。此外，为了帮助学习精准的指控，我们提供了一个学习课程，其中从单手操作静止物体开始，然后是多机器人培训，包括双手和不稳定的物体。为评估我们的方法，我们引入了动态物体抓取和肢体运动任务，这个任务需要抓取、重新定位和肢体运动。我们证明了我们的方法的有效性。此外，我们还示出了使用市场上的图像回归器获取噪音手姿估计后，我们的方法仍可生成有效的手动对象互动。

Better Practices for Domain Adaptation

paper_url: http://arxiv.org/abs/2309.03879
repo_url: None
paper_authors: Linus Ericsson, Da Li, Timothy M. Hospedales
for: 本研究旨在 Addressing the challenge of domain shift in real-world machine learning applications, particularly the difficulty of performing hyperparameter optimization for domain adaptation algorithms without access to a labeled validation set.methods: The paper uses a suite of candidate validation criteria to benchmark popular adaptation algorithms and assess their performance.results: The results show that there are challenges across all three branches of domain adaptation methodology, including Unsupervised Domain Adaptation (UDA), Source-Free Domain Adaptation (SFDA), and Test Time Adaptation (TTA). However, the paper also demonstrates that using proper validation splits and exploring new validation metrics can improve performance.

Abstract
Distribution shifts are all too common in real-world applications of machine learning. Domain adaptation (DA) aims to address this by providing various frameworks for adapting models to the deployment data without using labels. However, the domain shift scenario raises a second more subtle challenge: the difficulty of performing hyperparameter optimisation (HPO) for these adaptation algorithms without access to a labelled validation set. The unclear validation protocol for DA has led to bad practices in the literature, such as performing HPO using the target test labels when, in real-world scenarios, they are not available. This has resulted in over-optimism about DA research progress compared to reality. In this paper, we analyse the state of DA when using good evaluation practice, by benchmarking a suite of candidate validation criteria and using them to assess popular adaptation algorithms. We show that there are challenges across all three branches of domain adaptation methodology including Unsupervised Domain Adaptation (UDA), Source-Free Domain Adaptation (SFDA), and Test Time Adaptation (TTA). While the results show that realistically achievable performance is often worse than expected, they also show that using proper validation splits is beneficial, as well as showing that some previously unexplored validation metrics provide the best options to date. Altogether, our improved practices covering data, training, validation and hyperparameter optimisation form a new rigorous pipeline to improve benchmarking, and hence research progress, within this important field going forward.

摘要
发布分布shift是现实世界应用机器学习中的普遍现象。领域适应（DA）目标是解决这个问题，提供不使用标签的方法来适应模型到部署数据。然而，领域转换场景带来一个更加细微的挑战：无法在无标签验证集上进行超参论调整。这在文献中存在坏习惯，如使用目标测试标签进行超参论调整，而在实际应用中，这些标签不可用。这导致了对DA研究进展的过度估计。在这篇论文中，我们分析了使用好的评估方式进行DA时的状态，并对一组候选验证标准进行比较。我们发现了领域适应方法学习的三个支序中的挑战，包括无监督领域适应（UDA）、源自由领域适应（SFDA）和测试时适应（TTA）。虽然结果表明实际可以达到的性能通常比预期更差，但是也表明使用正确的验证分割是有利的，同时也表明了一些未曾探索的验证指标可以提供最佳选择至今。总之，我们提出了一种新的严格的管道，包括数据、训练、验证和超参论调整，以改进DA研究的进程，并在这个重要领域内进行未来的进步。

paper_url: http://arxiv.org/abs/2309.03874
repo_url: https://github.com/eyalgomel/box-based-refinement
paper_authors: Eyal Gomel, Tal Shaharabany, Lior Wolf
for: 提高弱化监督和无监督方法的本地化性能
methods: 使用框架基于的检测网络，并在网络输出上训练检测器，采用适当的损失反propagation
results: 显著提高了“我们是哪里看到了什么”任务的表达精度，以及多种无监督物体发现方法

Abstract
It has been established that training a box-based detector network can enhance the localization performance of weakly supervised and unsupervised methods. Moreover, we extend this understanding by demonstrating that these detectors can be utilized to improve the original network, paving the way for further advancements. To accomplish this, we train the detectors on top of the network output instead of the image data and apply suitable loss backpropagation. Our findings reveal a significant improvement in phrase grounding for the ``what is where by looking'' task, as well as various methods of unsupervised object discovery. Our code is available at https://github.com/eyalgomel/box-based-refinement.

摘要
已经证明，使用库型检测网络进行训练可以增强弱监督和无监督方法的地方化性能。此外，我们将这些检测器应用到原始网络上，以便进一步提高。我们在网络输出上训练这些检测器，并将损失传播 backwards。我们的研究发现，这些检测器可以帮助解决“我在哪里看到了什么”的任务，以及其他无监督物品发现方法。我们的代码可以在 GitHub 上找到：https://github.com/eyalgomel/box-based-refinement。

Text-to-feature diffusion for audio-visual few-shot learning

paper_url: http://arxiv.org/abs/2309.03869
repo_url: https://github.com/explainableml/avdiff-gfsl
paper_authors: Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata
for: 这 paper 的目的是提出一个 audio-visual 少量数据集，用于训练深度学习模型进行视频分类。
methods: 这 paper 使用了十种方法，包括 AV-DIFF，一个基于文本到特征扩散的框架，用于将多Modal特征拼接在一起。
results: 根据这 paper，使用 AV-DIFF 方法可以在 audio-visual 少量数据集上达到状态之 искусственный智能的性能。

Abstract
Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process. A challenging and underexplored, yet much cheaper, setup is few-shot learning from video data. In particular, the inherently multi-modal nature of video data with sound and visual information has not been leveraged extensively for the few-shot video classification task. Therefore, we introduce a unified audio-visual few-shot video classification benchmark on three datasets, i.e. the VGGSound-FSL, UCF-FSL, ActivityNet-FSL datasets, where we adapt and compare ten methods. In addition, we propose AV-DIFF, a text-to-feature diffusion framework, which first fuses the temporal and audio-visual features via cross-modal attention and then generates multi-modal features for the novel classes. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual (generalised) few-shot learning. Our benchmark paves the way for effective audio-visual classification when only limited labeled data is available. Code and data are available at https://github.com/ExplainableML/AVDIFF-GFSL.

摘要
<> translate the following text into Simplified Chinese<>训练深度学习模型用于视频分类从audio-visual数据中需要巨大量的标注训练数据，这是一个昂贵的过程。一个挑战性和尚未得到充分开发的 setup 是几个shot学习从视频数据中。特别是，视频数据的内在多模式性，即声音和视觉信息，尚未得到广泛利用于几个shot视频分类任务。因此，我们引入一个统一的audio-visual几个shot视频分类标准benchmark，包括VGGSound-FSL、UCF-FSL和ActivityNet-FSL三个数据集，我们适应和比较了十种方法。此外，我们提出了AV-DIFF，一个文本到特征扩散框架，它首先将时间和音频-视觉特征 fusionvia Cross-modal注意力，然后生成多模式特征 для新的类。我们证明了AV-DIFF在我们提出的benchmark上实现了状态独一的性能，我们的benchmark为有限标注数据时的音频-视觉分类提供了一个有效的平台。代码和数据可以在https://github.com/ExplainableML/AVDIFF-GFSL上找到。

CenTime: Event-Conditional Modelling of Censoring in Survival Analysis

paper_url: http://arxiv.org/abs/2309.03851
repo_url: https://github.com/ahmedhshahin/CenTime
paper_authors: Ahmed H. Shahin, An Zhao, Alexander C. Whitehead, Daniel C. Alexander, Joseph Jacob, David Barber
for: 预测医疗机器学习模型中的时间到事件（time to event，TTE），以便更好地预测临床重要事件的发生时间。
methods: 提出了一种新的事件减少机制，可以在 censored 样本中实现稳定的 estimator ，并且可以与深度学习模型集成，不受批处大小或uncensored样本数量的限制。
results: 与标准生存分析方法，如科кс准则模型和 DeepHit，进行比较，得到了 state-of-the-art 的时间到死亡预测性能，同时维持了相似的排名性能。

Abstract
Survival analysis is a valuable tool for estimating the time until specific events, such as death or cancer recurrence, based on baseline observations. This is particularly useful in healthcare to prognostically predict clinically important events based on patient data. However, existing approaches often have limitations; some focus only on ranking patients by survivability, neglecting to estimate the actual event time, while others treat the problem as a classification task, ignoring the inherent time-ordered structure of the events. Furthermore, the effective utilization of censored samples - training data points where the exact event time is unknown - is essential for improving the predictive accuracy of the model. In this paper, we introduce CenTime, a novel approach to survival analysis that directly estimates the time to event. Our method features an innovative event-conditional censoring mechanism that performs robustly even when uncensored data is scarce. We demonstrate that our approach forms a consistent estimator for the event model parameters, even in the absence of uncensored data. Furthermore, CenTime is easily integrated with deep learning models with no restrictions on batch size or the number of uncensored samples. We compare our approach with standard survival analysis methods, including the Cox proportional-hazard model and DeepHit. Our results indicate that CenTime offers state-of-the-art performance in predicting time-to-death while maintaining comparable ranking performance. Our implementation is publicly available at https://github.com/ahmedhshahin/CenTime.

摘要
生存分析是一种有用的工具，可以根据基线观察数据来预测特定事件的时间，如死亡或癌症复发。这特别有用在医疗领域，可以预测基于病人数据的临床重要事件。然而，现有的方法往往有限制，一些只能将患者排序按照生存可能性，而忽略实际事件时间的估计；另一些对事件问题进行分类处理，忽略事件的内在时间顺序结构。此外，利用 censored 样本的有效使用是必要的，以提高预测模型的准确性。在这篇论文中，我们介绍了 CenTime，一种新的生存分析方法，可以直接估计事件时间。我们的方法具有创新的事件 conditional 封闭机制，可以在缺失完整事件时间的情况下表现稳定。我们示示了我们的方法可以在缺失完整数据的情况下成为事件模型参数的一致 estimator。此外，CenTime 可以轻松地与深度学习模型结合使用，无需Restrictions on batch size 或无 censored 样本数量。我们与标准生存分析方法，包括 Cox 幂度模型和 DeepHit 进行比较，结果表明 CenTime 在预测时间到死亡的性能处于国际前列，同时保持与排序性能相对一致。我们的实现可以在上获取。

Random Expert Sampling for Deep Learning Segmentation of Acute Ischemic Stroke on Non-contrast CT

paper_url: http://arxiv.org/abs/2309.03930
repo_url: None
paper_authors: Sophie Ostmeier, Brian Axelrod, Benjamin Pulli, Benjamin F. J. Verhaaren, Abdelkader Mahammedi, Yongkai Liu, Christian Federau, Greg Zaharchuk, Jeremy J. Heit
for: The paper aims to develop and validate a deep learning method for automatically quantifying ischemic brain tissue on non-contrast CT scans in patients with acute ischemic stroke.methods: The authors used a benchmark U-Net model trained on reference annotations from three experienced neuroradiologists using two training schemes: majority vote and random expert sampling. They compared the performance of these schemes using a one-sided Wilcoxon signed-rank test and consistency analysis.results: The random expert sampling scheme led to a model that showed better agreement with the experts and better consistency than the majority-vote model. The model also predicted the final infarct volume and correlated better with the clinical outcome than CT perfusion.Here is the information in Simplified Chinese text:for: 这篇论文目的是开发和验证一种使用非对照CT成像的多专家深度学习方法，用于自动评估急性血液血管受损脑部病变。methods: 作者使用了一个参考U-Net模型，并使用了三名经验丰富的 neuroradiologists 的参考标注来训练这个模型。他们使用了一个一侧 Wilcoxon 签名rank 测试和一致性分析来比较这两种训练方案的表现。results: 随机专家采样方案导致了一个与专家更加一致的模型，并且与专家之间的一致性更高，并且与多数投票模型性能相比有61%的提升（Surface Dice 在快速识别精度5mm上提升了0.70+-0.03）和25%的提升（Dice 在0.50+-0.04）。这个模型可以准确预测急性血液血管受损脑部病变的总量，并且与临床结果相关性更高。

Abstract
Purpose: Multi-expert deep learning training methods to automatically quantify ischemic brain tissue on Non-Contrast CT Materials and Methods: The data set consisted of 260 Non-Contrast CTs from 233 patients of acute ischemic stroke patients recruited in the DEFUSE 3 trial. A benchmark U-Net was trained on the reference annotations of three experienced neuroradiologists to segment ischemic brain tissue using majority vote and random expert sampling training schemes. We used a one-sided Wilcoxon signed-rank test on a set of segmentation metrics to compare bootstrapped point estimates of the training schemes with the inter-expert agreement and ratio of variance for consistency analysis. We further compare volumes with the 24h-follow-up DWI (final infarct core) in the patient subgroup with full reperfusion and we test volumes for correlation to the clinical outcome (mRS after 30 and 90 days) with the Spearman method. Results: Random expert sampling leads to a model that shows better agreement with experts than experts agree among themselves and better agreement than the agreement between experts and a majority-vote model performance (Surface Dice at Tolerance 5mm improvement of 61% to 0.70 +- 0.03 and Dice improvement of 25% to 0.50 +- 0.04). The model-based predicted volume similarly estimated the final infarct volume and correlated better to the clinical outcome than CT perfusion. Conclusion: A model trained on random expert sampling can identify the presence and location of acute ischemic brain tissue on Non-Contrast CT similar to CT perfusion and with better consistency than experts. This may further secure the selection of patients eligible for endovascular treatment in less specialized hospitals.

摘要
目的：使用多个专家深度学习训练方法自动评估非contrast CT中的血液脑部分量。方法：数据集包括260个非contrast CT图像，来自233名stroke患者，参与DEFUSE 3试验。我们使用一个benchmark U-Net模型，通过多个专家的参照注释来 segment非contrast CT中的血液脑部分。我们使用一个一侧Wilcoxon签名rank测试来比较各种训练方案的点估计与专家之间的一致性和差异分析。此外，我们还比较了患者 subgroup中的24小时后DWI（最终损伤核心）和临床结果（mRS after 30和90天）之间的相关性。结果：随机专家采样导致一个模型，与专家之间的一致性更高，并且与专家和多数投票模型的性能相比（Surface Dice at Tolerance 5mm改进率为61%，Dice改进率为25%）。该模型预测的量也准确地估计了最终损伤量，并且与临床结果更好地相关。结论：一个基于随机专家采样的模型可以在非contrast CT中准确地识别和定位急性血液脑部分，与CT perfusion相似，并且与专家之间的一致性更高。这可能会为eless specialized hospitals中选择患者渠道进行Endovascular treatment提供更安全的选择。

Cross-Task Attention Network: Improving Multi-Task Learning for Medical Imaging Applications

paper_url: http://arxiv.org/abs/2309.03837
repo_url: None
paper_authors: Sangwook Kim, Thomas G. Purdie, Chris McIntosh
for:This paper aims to improve the performance of medical imaging tasks using a novel attention-based multi-task learning (MTL) framework.methods:The proposed framework is called Cross-Task Attention Network (CTAN), which utilizes cross-task attention mechanisms to incorporate information from multiple tasks and improve performance.results:Compared to standard single-task learning (STL) and two widely used MTL baselines, CTAN demonstrated a 4.67% improvement in performance and outperformed both baselines. Specifically, CTAN outperformed HPS by 3.22% and MTAN by 5.38%. These findings highlight the effectiveness of CTAN in improving the accuracy of medical imaging tasks across different domains.

Abstract
Multi-task learning (MTL) is a powerful approach in deep learning that leverages the information from multiple tasks during training to improve model performance. In medical imaging, MTL has shown great potential to solve various tasks. However, existing MTL architectures in medical imaging are limited in sharing information across tasks, reducing the potential performance improvements of MTL. In this study, we introduce a novel attention-based MTL framework to better leverage inter-task interactions for various tasks from pixel-level to image-level predictions. Specifically, we propose a Cross-Task Attention Network (CTAN) which utilizes cross-task attention mechanisms to incorporate information by interacting across tasks. We validated CTAN on four medical imaging datasets that span different domains and tasks including: radiation treatment planning prediction using planning CT images of two different target cancers (Prostate, OpenKBP); pigmented skin lesion segmentation and diagnosis using dermatoscopic images (HAM10000); and COVID-19 diagnosis and severity prediction using chest CT scans (STOIC). Our study demonstrates the effectiveness of CTAN in improving the accuracy of medical imaging tasks. Compared to standard single-task learning (STL), CTAN demonstrated a 4.67% improvement in performance and outperformed both widely used MTL baselines: hard parameter sharing (HPS) with an average performance improvement of 3.22%; and multi-task attention network (MTAN) with a relative decrease of 5.38%. These findings highlight the significance of our proposed MTL framework in solving medical imaging tasks and its potential to improve their accuracy across domains.

摘要
多任务学习（MTL）是深度学习中的一种强大方法，利用多个任务的信息在训练中共享，以提高模型性能。在医疗影像领域，MTL已经实现了各种任务的解决。然而，现有的医疗影像MTL建 Architecture是有限的，它们在任务之间的信息共享上有所局限，从而减少了MTL的性能提升 potential.在本研究中，我们提出了一种新的注意力基于的MTL框架，以更好地利用多个任务之间的交互来提高各种任务的预测性能。具体来说，我们提出了一种交互式多任务注意力网络（CTAN），该网络通过交互式注意力机制来集成多个任务的信息。我们在四个医疗影像数据集上验证了CTAN，这些数据集包括了两种不同的目标肿瘤（肾癌和开口KBP）的规划计划预测、睫状皮肤损伤和诊断、以及COVID-19的诊断和严重程度预测。我们的研究表明，CTAN在医疗影像任务中的准确性得到了提高。相比于标准单任务学习（STL），CTAN在 average 上提高了4.67%的性能，并在多个MTL基线上超越了：硬件参数共享（HPS）的平均性能提升3.22%，以及多任务注意力网络（MTAN）的相对下降5.38%。这些发现表明了我们提出的MTL框架在解决医疗影像任务方面的重要性和其在不同领域中的可行性。

ArtHDR-Net: Perceptually Realistic and Accurate HDR Content Creation

paper_url: http://arxiv.org/abs/2309.03827
repo_url: None
paper_authors: Hrishav Bakul Barua, Ganesh Krishnasamy, KokSheik Wong, Kalin Stefanov, Abhinav Dhall
for: 这篇论文旨在探讨高动态范围（HDR）内容创建中，如何保持图像的艺术意义，即人类视觉对图像的感受。
methods: 该论文提出了一种基于卷积神经网络的Architecture，称为ArtHDR-Net，使用多张曝光LDR特征作为输入。
results: 实验结果表明，ArtHDR-Net可以达到状态的艺术意义水平（HDR-VDP-2分数），同时在PSNR和SSIM指标中实现竞争性的表现。

Abstract
High Dynamic Range (HDR) content creation has become an important topic for modern media and entertainment sectors, gaming and Augmented/Virtual Reality industries. Many methods have been proposed to recreate the HDR counterparts of input Low Dynamic Range (LDR) images/videos given a single exposure or multi-exposure LDRs. The state-of-the-art methods focus primarily on the preservation of the reconstruction's structural similarity and the pixel-wise accuracy. However, these conventional approaches do not emphasize preserving the artistic intent of the images in terms of human visual perception, which is an essential element in media, entertainment and gaming. In this paper, we attempt to study and fill this gap. We propose an architecture called ArtHDR-Net based on a Convolutional Neural Network that uses multi-exposed LDR features as input. Experimental results show that ArtHDR-Net can achieve state-of-the-art performance in terms of the HDR-VDP-2 score (i.e., mean opinion score index) while reaching competitive performance in terms of PSNR and SSIM.

摘要
高动态范围（HDR）内容创建已成为现代媒体和娱乐领域的重要话题，游戏和虚拟/增强现实领域。许多方法已经被提议，以便基于单张或多张抖动范围（LDR）图像/视频来重建HDR对应的Counterpart。当前的状态艺术方法主要关注重建结构的相似性和每个像素的准确率。然而，这些惯常的方法不强调保持图像的艺术意愿，即人类视觉的感知，这是媒体、娱乐和游戏领域的重要元素。在这篇论文中，我们尝试研究并填补这个空白。我们提出了一种 Architecture called ArtHDR-Net，基于卷积神经网络，使用多张抖动范围特征为输入。实验结果表明，ArtHDR-Net 可以达到当今最佳性能，而且与 PSNR 和 SSIM 的性能竞争。

T2IW: Joint Text to Image & Watermark Generation

paper_url: http://arxiv.org/abs/2309.03815
repo_url: None
paper_authors: An-An Liu, Guokai Zhang, Yuting Su, Ning Xu, Yongdong Zhang, Lanjun Wang
for: 这个研究旨在提出一个新的文本背景下的图像生成模型，以满足traceability、隐私保护和其他安全需求。
methods: 本研究使用文本与水印（T2IW）任务，强制semantic feature和水印信号在像素层次保持compatibility，并运用信息理论和非合作游戏理论分离图像和水印。
results: 实验结果显示本方法可以实现优秀的图像质量、水印隐藏和水印Robustness，并提出了一个新的评估指标集。

Abstract
Recent developments in text-conditioned image generative models have revolutionized the production of realistic results. Unfortunately, this has also led to an increase in privacy violations and the spread of false information, which requires the need for traceability, privacy protection, and other security measures. However, existing text-to-image paradigms lack the technical capabilities to link traceable messages with image generation. In this study, we introduce a novel task for the joint generation of text to image and watermark (T2IW). This T2IW scheme ensures minimal damage to image quality when generating a compound image by forcing the semantic feature and the watermark signal to be compatible in pixels. Additionally, by utilizing principles from Shannon information theory and non-cooperative game theory, we are able to separate the revealed image and the revealed watermark from the compound image. Furthermore, we strengthen the watermark robustness of our approach by subjecting the compound image to various post-processing attacks, with minimal pixel distortion observed in the revealed watermark. Extensive experiments have demonstrated remarkable achievements in image quality, watermark invisibility, and watermark robustness, supported by our proposed set of evaluation metrics.

摘要
最近的文本conditioned图像生成模型的发展，使得生成真实的结果变得更加容易。然而，这也导致了隐私泄露和假信息的扩散，需要Traceability、隐私保护和其他安全措施。然而，现有的文本到图像的思维方法缺乏技术能力，将可追溯的消息与图像生成相连。在这项研究中，我们介绍了一种新的文本到图像和水印（T2IW）任务。这种T2IW方案确保在生成复合图像时，Semantic feature和水印信号在像素级别保持Compatible。此外，通过利用信息理论和非合作游戏理论，我们可以将复合图像中的Revealed image和Revealed watermark分离开。此外，我们通过对复合图像进行不同类型的后处理攻击，保持了Minimal pixel distortion在Revealed watermark中。广泛的实验证明了我们提出的方法在图像质量、隐私性和隐私稳定性方面具有很好的表现，支持我们提出的评价指标集。

Panoramas from Photons

paper_url: http://arxiv.org/abs/2309.03811
repo_url: None
paper_authors: Sacha Jungerman, Atul Ingle, Mohit Gupta
for: 能够在高速运动和低照度下重建场景，如果应用于虚拟现实、无人机导航和自动化机器人等领域。
methods: 使用聚合和排序框架，以 iteratively 提高运动估计。
results: 可以在高速运动和极低照度下创建高质量的全景图和超分辨率结果，使用自定义单 photon 摄像头原型。

Abstract
Scene reconstruction in the presence of high-speed motion and low illumination is important in many applications such as augmented and virtual reality, drone navigation, and autonomous robotics. Traditional motion estimation techniques fail in such conditions, suffering from too much blur in the presence of high-speed motion and strong noise in low-light conditions. Single-photon cameras have recently emerged as a promising technology capable of capturing hundreds of thousands of photon frames per second thanks to their high speed and extreme sensitivity. Unfortunately, traditional computer vision techniques are not well suited for dealing with the binary-valued photon data captured by these cameras because these are corrupted by extreme Poisson noise. Here we present a method capable of estimating extreme scene motion under challenging conditions, such as low light or high dynamic range, from a sequence of high-speed image frames such as those captured by a single-photon camera. Our method relies on iteratively improving a motion estimate by grouping and aggregating frames after-the-fact, in a stratified manner. We demonstrate the creation of high-quality panoramas under fast motion and extremely low light, and super-resolution results using a custom single-photon camera prototype. For code and supplemental material see our $\href{https://wisionlab.com/project/panoramas-from-photons/}{\text{project webpage}$.

摘要
To address this challenge, we present a method that can estimate extreme scene motion under challenging conditions, such as low light or high dynamic range, from a sequence of high-speed image frames captured by a single-photon camera. Our method relies on iteratively improving a motion estimate by grouping and aggregating frames after-the-fact, in a stratified manner. We demonstrate the creation of high-quality panoramas under fast motion and extremely low light, and super-resolution results using a custom single-photon camera prototype. For more information and supplemental material, please visit our project webpage at $\href{https://wisionlab.com/project/panoramas-from-photons/}{\text{https://wisionlab.com/project/panoramas-from-photons/}$.

SimNP: Learning Self-Similarity Priors Between Neural Points

paper_url: http://arxiv.org/abs/2309.03809
repo_url: None
paper_authors: Christopher Wewer, Eddy Ilg, Bernt Schiele, Jan Eric Lenssen
for: 本研究旨在提高3D物体重建的 neural field 表示，特别是利用对象级别表示来提高物体的细节质量。
methods: 我们提出了 SimNP 方法，它将 neural point radiance fields 与对象级别自相似表示相结合，以获得更高质量的重建结果。我们首次在 neural point 中实现了类别级别自相似表示，从而保留了本地支持的物体区域的高级别细节。此外，我们还学习了 neural point 之间的信息共享方式，以便在重建过程中提取未见区域的信息。
results: SimNP 方法能够在重建 symmetric 的未见区域时，超越基于类别级别或像素对齐的 radiance fields 方法，同时提供 semantic 对应关系 между实例。我们的实验结果表明，SimNP 能够在不同的物体类别和观察角度下实现更高质量的重建结果。

Abstract
Existing neural field representations for 3D object reconstruction either (1) utilize object-level representations, but suffer from low-quality details due to conditioning on a global latent code, or (2) are able to perfectly reconstruct the observations, but fail to utilize object-level prior knowledge to infer unobserved regions. We present SimNP, a method to learn category-level self-similarities, which combines the advantages of both worlds by connecting neural point radiance fields with a category-level self-similarity representation. Our contribution is two-fold. (1) We design the first neural point representation on a category level by utilizing the concept of coherent point clouds. The resulting neural point radiance fields store a high level of detail for locally supported object regions. (2) We learn how information is shared between neural points in an unconstrained and unsupervised fashion, which allows to derive unobserved regions of an object during the reconstruction process from given observations. We show that SimNP is able to outperform previous methods in reconstructing symmetric unseen object regions, surpassing methods that build upon category-level or pixel-aligned radiance fields, while providing semantic correspondences between instances

摘要
现有的神经场表示方法 для三维物体重建都是（1）使用物体级别表示，但是因为conditioning于全局归一化代码而导致细节质量低下，或者（2）能够完美地重建观察数据，但是不能利用物体级别知识来推断未观察到的区域。我们提出了SimNP方法，它将神经点频谱场与类别级自相似表示相结合，以便结合两者的优点。我们的贡献有两个方面：1. 我们设计了首次基于类别水平的神经点表示，通过利用coherent点云概念。神经点频谱场中的高级别细节可以在支持本地物体区域时被存储。2. 我们学习了在无约束和无监督的情况下，神经点之间的信息共享方式，以便在重建过程中从观察数据中推断未观察到的区域。我们展示了SimNP方法能够在重建不见的对称区域方面超过前一代方法，而且提供semantic对应关系 между实例。

Deep Learning Safety Concerns in Automated Driving Perception

paper_url: http://arxiv.org/abs/2309.03774
repo_url: None
paper_authors: Stephanie Abrecht, Alexander Hirsch, Shervin Raafatnia, Matthias Woehrle
for: 本研究旨在提高自动驾驶系统中深度学习的应用，以确保系统的安全性。
methods: 本研究使用了安全问题的概念，以系统atic和全面地考虑深度学习模型在自动驾驶系统中的安全性。
results: 本研究提出了一种新的安全问题分类方法，以便跨功能团队共同解决问题。此外，本研究还运用了ISO 21448（SOTIF）和ISO PAS 8800等标准，以确保安全性。

Abstract
Recent advances in the field of deep learning and impressive performance of deep neural networks (DNNs) for perception have resulted in an increased demand for their use in automated driving (AD) systems. The safety of such systems is of utmost importance and thus requires to consider the unique properties of DNNs. In order to achieve safety of AD systems with DNN-based perception components in a systematic and comprehensive approach, so-called safety concerns have been introduced as a suitable structuring element. On the one hand, the concept of safety concerns is -- by design -- well aligned to existing standards relevant for safety of AD systems such as ISO 21448 (SOTIF). On the other hand, it has already inspired several academic publications and upcoming standards on AI safety such as ISO PAS 8800. While the concept of safety concerns has been previously introduced, this paper extends and refines it, leveraging feedback from various domain and safety experts in the field. In particular, this paper introduces an additional categorization for a better understanding as well as enabling cross-functional teams to jointly address the concerns.

摘要

$L_{2,1}$-Norm Regularized Quaternion Matrix Completion Using Sparse Representation and Quaternion QR Decomposition

paper_url: http://arxiv.org/abs/2309.03764
repo_url: None
paper_authors: Juan Han, Kit Ian Kou, Jifei Miao, Lizhi Liu, Haojiang Li
for: color image completion
methods: quaternion Qatar Riyal decomposition (QQR) and quaternion $L_{2,1}$-norm (QLNM-QQR), iteratively reweighted quaternion $L_{2,1}$-norm minimization (IRQLNM-QQR), and quaternion $L_{2,1}$-norm with sparse regularization (QLNM-QQR-SR)
results: outperforms QLNM-QQR and superior to several state-of-the-art methods on natural color images and color medical images

Abstract
Color image completion is a challenging problem in computer vision, but recent research has shown that quaternion representations of color images perform well in many areas. These representations consider the entire color image and effectively utilize coupling information between the three color channels. Consequently, low-rank quaternion matrix completion (LRQMC) algorithms have gained significant attention. We propose a method based on quaternion Qatar Riyal decomposition (QQR) and quaternion $L_{2,1}$-norm called QLNM-QQR. This new approach reduces computational complexity by avoiding the need to calculate the QSVD of large quaternion matrices. We also present two improvements to the QLNM-QQR method: an enhanced version called IRQLNM-QQR that uses iteratively reweighted quaternion $L_{2,1}$-norm minimization and a method called QLNM-QQR-SR that integrates sparse regularization. Our experiments on natural color images and color medical images show that IRQLNM-QQR outperforms QLNM-QQR and that the proposed QLNM-QQR-SR method is superior to several state-of-the-art methods.

摘要
图像颜色填充是计算机视觉领域的一个挑战，但最近的研究表明，使用四元数表示方法在许多领域表现良好。这些表示方法考虑整个颜色图像，并有效地利用三个颜色通道之间的相关信息。因此，低级四元数矩阵 completion（LRQMC）算法在获得了重要的注意力。我们提出了基于四元数卡塔瑞yal decompositon（QQR）和四元数L2,1-norm的方法QLNM-QQR。这种新的方法可以避免计算大四元数矩阵QSVD的需要，从而降低计算复杂性。我们还提出了两种改进QLNM-QQR方法：一种叫做IRQLNM-QQR，使用迭代重量四元数L2,1-norm最小化方法；另一种叫做QLNM-QQR-SR， integrate sparse regularization。我们对自然色图像和医疗颜色图像进行实验，发现IRQLNM-QQR方法比QLNM-QQR方法表现更好，而QLNM-QQR-SR方法在许多状态流行方法之上表现更出色。

dacl1k: Real-World Bridge Damage Dataset Putting Open-Source Data to the Test

paper_url: http://arxiv.org/abs/2309.03763
repo_url: None
paper_authors: Johannes Flotzinger, Philipp J. Rösch, Norbert Oswald, Thomas Braml
for: 本研究旨在提高桥梁材料损害识别精度，以确保结构完整性、交通安全和持续使用性。
methods: 本研究使用多种开源数据集合（meta datasets）进行模型训练，并对模型在真实世界中的应用进行评估。
results: 研究发现，使用meta datasets进行训练后，模型在新的bridge损害识别任务中表现出了实用性，最佳模型的准确率达32%。此外，研究还发现模型学习的是否分类数据集或损害类型，而不是具体的bridge损害类型。

Abstract
Recognising reinforced concrete defects (RCDs) is a crucial element for determining the structural integrity, traffic safety and durability of bridges. However, most of the existing datasets in the RCD domain are derived from a small number of bridges acquired in specific camera poses, lighting conditions and with fixed hardware. These limitations question the usability of models trained on such open-source data in real-world scenarios. We address this problem by testing such models on our "dacl1k" dataset, a highly diverse RCD dataset for multi-label classification based on building inspections including 1,474 images. Thereby, we trained the models on different combinations of open-source data (meta datasets) which were subsequently evaluated both extrinsically and intrinsically. During extrinsic evaluation, we report metrics on dacl1k and the meta datasets. The performance analysis on dacl1k shows practical usability of the meta data, where the best model shows an Exact Match Ratio of 32%. Additionally, we conduct an intrinsic evaluation by clustering the bottleneck features of the best model derived from the extrinsic evaluation in order to find out, if the model has learned distinguishing datasets or the classes (RCDs) which is the aspired goal. The dacl1k dataset and our trained models will be made publicly available, enabling researchers and practitioners to put their models to the real-world test.

摘要
识别强化混凝土缺陷（RCD）是bridge的结构完整性、交通安全和持续性的关键因素。然而，现有的RCD领域数据集大多来自少量桥梁，特定的摄像机位置和照明条件下获取的数据。这些限制问题在实际场景中使用模型的可用性。我们解决这个问题，通过在“dacl1k”数据集上测试这些模型，这是一个多标签分类的RCD数据集，包含1,474张图像。我们在不同的开源数据集（meta数据）上训练了模型，然后对这些meta数据进行了外部和内部评估。在外部评估中，我们对dacl1k和meta数据进行了度量。我们发现，使用meta数据可以实现实际场景中的实用性，最佳模型的准确匹配率达32%。此外，我们进行了内部评估，将最佳模型的瓶颈特征分组，以确定是否模型已经学习到了不同的数据集或RCD类别，这是我们的目标。dacl1k数据集和我们训练的模型将公开提供，allowing researchers和实践者可以在实际场景中测试他们的模型。

M(otion)-mode Based Prediction of Ejection Fraction using Echocardiograms

paper_url: http://arxiv.org/abs/2309.03759
repo_url: https://github.com/thomassutter/mmodeecho
paper_authors: Ece Ozkan, Thomas M. Sutter, Yurong Hu, Sebastian Balzer, Julia E. Vogt
for: 早期检测心脏功能异常，通过常规检查是诊断心血管疾病的关键。心脏功能指数下降，是心肺病的重要指标。
methods: 我们使用M模式电子心图来估算心脏功能指数和诊断心肺病。我们生成了多个人工M模式图像，并将其组合使用商业化模型架构。此外，我们将对比学习（CL）应用于卡达着影像识别，从不标注数据中提取有意义的特征，以达到高精度。
results: 我们的实验表明，使用M模式图像和对比学习可以在只有10个模式下达到高精度，与基线方法相当，而且计算上 much more efficient。此外，CL使用M模式图像在有限数据 scenarios（例如，只有200个标注患者）中非常有用。

Abstract
Early detection of cardiac dysfunction through routine screening is vital for diagnosing cardiovascular diseases. An important metric of cardiac function is the left ventricular ejection fraction (EF), where lower EF is associated with cardiomyopathy. Echocardiography is a popular diagnostic tool in cardiology, with ultrasound being a low-cost, real-time, and non-ionizing technology. However, human assessment of echocardiograms for calculating EF is time-consuming and expertise-demanding, raising the need for an automated approach. In this work, we propose using the M(otion)-mode of echocardiograms for estimating the EF and classifying cardiomyopathy. We generate multiple artificial M-mode images from a single echocardiogram and combine them using off-the-shelf model architectures. Additionally, we extend contrastive learning (CL) to cardiac imaging to learn meaningful representations from exploiting structures in unlabeled data allowing the model to achieve high accuracy, even with limited annotations. Our experiments show that the supervised setting converges with only ten modes and is comparable to the baseline method while bypassing its cumbersome training process and being computationally much more efficient. Furthermore, CL using M-mode images is helpful for limited data scenarios, such as having labels for only 200 patients, which is common in medical applications.

摘要
早期检测心脏功能不正常的 Routine 检查是诊断冠状病的关键。一个重要的心脏功能指标是左心室泵出率（EF），其中低EF 与心肺病有关。寿命成像是卡地里诊断工具中最受欢迎的一种，它是一种低成本、实时、不 ionizing 技术。然而，人类对成像进行 EF 计算是时间消耗和专业需求高的，从而需要自动化的方法。在这项工作中，我们提议使用 M（动作）模式成像来估算 EF 和诊断心肺病。我们生成多个人工 M-模式成像从单个成像，并将它们组合使用商业化的模型架构。此外，我们扩展了对比学习（CL）到卡地里成像，以学习有用的表示。我们的实验表明，在监督设定下，只需要使用十个模式，可以与基eline 方法相当，而且可以快速 converges。此外， CL 使用 M-模式成像在有限数据场景下是有帮助的，例如只有200个患者的标签。

PBP: Path-based Trajectory Prediction for Autonomous Driving

paper_url: http://arxiv.org/abs/2309.03750
repo_url: None
paper_authors: Sepideh Afshar, Nachiket Deo, Akshay Bhagat, Titas Chakraborty, Yunming Shao, Balarama Raju Buddharaju, Adwait Deshpande, Henggang Cui
for: 提高自动驾驶栈中的路径预测精度，使自动驾驶车辆更好地预测周围agent的运动轨迹。
methods: 提出了Path-based prediction（PBP）方法，通过使用HD地图中的参考路径特征和路径相对尼采抽象框架来预测路径。
results: 在Argoverse数据集上应用PBP trajectory decoder，与标准路径预测指标具有竞争性表现，同时在map compliance方面显著超过了现有基eline。

Abstract
Trajectory prediction plays a crucial role in the autonomous driving stack by enabling autonomous vehicles to anticipate the motion of surrounding agents. Goal-based prediction models have gained traction in recent years for addressing the multimodal nature of future trajectories. Goal-based prediction models simplify multimodal prediction by first predicting 2D goal locations of agents and then predicting trajectories conditioned on each goal. However, a single 2D goal location serves as a weak inductive bias for predicting the whole trajectory, often leading to poor map compliance, i.e., part of the trajectory going off-road or breaking traffic rules. In this paper, we improve upon goal-based prediction by proposing the Path-based prediction (PBP) approach. PBP predicts a discrete probability distribution over reference paths in the HD map using the path features and predicts trajectories in the path-relative Frenet frame. We applied the PBP trajectory decoder on top of the HiVT scene encoder and report results on the Argoverse dataset. Our experiments show that PBP achieves competitive performance on the standard trajectory prediction metrics, while significantly outperforming state-of-the-art baselines in terms of map compliance.

摘要
干线预测在自动驾驶栈中扮演着关键的角色，帮助自动车辆预测周围的agent的运动。目标基于预测模型在过去几年中得到了广泛应用，因为它可以简化未来轨迹的多样性。目标基于预测模型首先预测了 agent 的2D目标位置，然后预测了根据每个目标的轨迹。然而，单个2D目标位置通常是轨迹预测的弱 inductive bias，导致轨迹偏离路径，例如车辆离路或违反交通规则。在这篇论文中，我们提出了Path-based prediction（PBP）方法，该方法预测了HD地图中参考路径的抽象概率分布，然后预测了路径相对射线帧中的轨迹。我们在HiVT场景编码器之上应用了PBP轨迹解码器，并在Argoverse数据集上进行了实验。我们的实验结果显示，PBP在标准轨迹预测指标上达到了竞争性的表现，而与当前领先的基elines在地图兼容性方面表现出了显著优势。

Label-efficient Contrastive Learning-based model for nuclei detection and classification in 3D Cardiovascular Immunofluorescent Images

paper_url: http://arxiv.org/abs/2309.03744
repo_url: None
paper_authors: Nazanin Moradinasab, Rebecca A. Deaton, Laura S. Shankman, Gary K. Owens, Donald E. Brown
for: 这个研究旨在开发一个 Label-efficient Contrastive learning-based (LECL) 模型，用于检测和类别各种类型的核lei在3D免疫染色图像中。
methods: 这个模型使用 Extended Maximum Intensity Projection (EMIP) 方法来解决多层对称投影问题，并使用 Supervised Contrastive Learning (SCL) 方法在弱监督情况下进行训练。
results: 在心血管数据集上进行实验，发现这个提案的框架具有高效和高精度地检测和类别各种类型的核lei在3D免疫染色图像中。

Abstract
Recently, deep learning-based methods achieved promising performance in nuclei detection and classification applications. However, training deep learning-based methods requires a large amount of pixel-wise annotated data, which is time-consuming and labor-intensive, especially in 3D images. An alternative approach is to adapt weak-annotation methods, such as labeling each nucleus with a point, but this method does not extend from 2D histopathology images (for which it was originally developed) to 3D immunofluorescent images. The reason is that 3D images contain multiple channels (z-axis) for nuclei and different markers separately, which makes training using point annotations difficult. To address this challenge, we propose the Label-efficient Contrastive learning-based (LECL) model to detect and classify various types of nuclei in 3D immunofluorescent images. Previous methods use Maximum Intensity Projection (MIP) to convert immunofluorescent images with multiple slices to 2D images, which can cause signals from different z-stacks to falsely appear associated with each other. To overcome this, we devised an Extended Maximum Intensity Projection (EMIP) approach that addresses issues using MIP. Furthermore, we performed a Supervised Contrastive Learning (SCL) approach for weakly supervised settings. We conducted experiments on cardiovascular datasets and found that our proposed framework is effective and efficient in detecting and classifying various types of nuclei in 3D immunofluorescent images.

摘要
最近，深度学习基本方法在蛋白检测和分类应用中获得了可观的表现。然而，训练深度学习基本方法需要大量的像素级别标注数据，这是时间消耗和劳动密集的，特别是在3D图像上。一种代替方法是采用弱标注方法，如每个核体只需标注一点，但这种方法不能从2D histopathology图像（它原本是设计的）扩展到3D抗体图像。原因是3D图像包含多个通道（z轴），这些通道分别包含核体和不同的标签，因此使用点标注训练困难。为解决这个挑战，我们提出了 Label-efficient Contrastive learning-based (LECL) 模型，用于检测和分类3D抗体图像中的多种核体。以前的方法使用 Maximum Intensity Projection (MIP) 将多层抗体图像转换成2D图像，这可能会使得不同的z堆叠的信号错误地显示为相关的。为解决这个问题，我们开发了 Extended Maximum Intensity Projection (EMIP) 方法，解决了 MIP 中的问题。另外，我们采用了 Supervised Contrastive Learning (SCL) 方法在弱监督设定下进行训练。我们在循环系统数据集上进行了实验，发现我们提出的框架是有效和高效的，用于检测和分类3D抗体图像中的多种核体。

ClusterFusion: Leveraging Radar Spatial Features for Radar-Camera 3D Object Detection in Autonomous Vehicles

paper_url: http://arxiv.org/abs/2309.03734
repo_url: None
paper_authors: Irfan Tito Kurniawan, Bambang Riyanto Trilaksono
for: 本研究探讨了如何使用射频测距仪的本地空间和点位特征，通过直接从射频点云中提取点云划分后的特征进行 радио-照相机三维物体探测方法的改进。
methods: 本方法使用了 clustering 技术将射频点云分割成不同的分割，然后对每个分割进行特征提取，最后将特征 проек onto 图像平面进行跨模态特征融合。
results: 该方法在 nuScenes 测试片上 achieved 48.7% nuScenes 检测得分（NDS），在 радио-照相机三维物体探测方法中达到了状态之 arts 性能。

Abstract
Thanks to the complementary nature of millimeter wave radar and camera, deep learning-based radar-camera 3D object detection methods may reliably produce accurate detections even in low-visibility conditions. This makes them preferable to use in autonomous vehicles' perception systems, especially as the combined cost of both sensors is cheaper than the cost of a lidar. Recent radar-camera methods commonly perform feature-level fusion which often involves projecting the radar points onto the same plane as the image features and fusing the extracted features from both modalities. While performing fusion on the image plane is generally simpler and faster, projecting radar points onto the image plane flattens the depth dimension of the point cloud which might lead to information loss and makes extracting the spatial features of the point cloud harder. We proposed ClusterFusion, an architecture that leverages the local spatial features of the radar point cloud by clustering the point cloud and performing feature extraction directly on the point cloud clusters before projecting the features onto the image plane. ClusterFusion achieved the state-of-the-art performance among all radar-monocular camera methods on the test slice of the nuScenes dataset with 48.7% nuScenes detection score (NDS). We also investigated the performance of different radar feature extraction strategies on point cloud clusters: a handcrafted strategy, a learning-based strategy, and a combination of both, and found that the handcrafted strategy yielded the best performance. The main goal of this work is to explore the use of radar's local spatial and point-wise features by extracting them directly from radar point cloud clusters for a radar-monocular camera 3D object detection method that performs cross-modal feature fusion on the image plane.

摘要
Due to the complementary nature of millimeter wave radar and camera, deep learning-based radar-camera 3D object detection methods can produce accurate detections even in low-visibility conditions. This makes them more suitable for use in autonomous vehicles' perception systems, as the combined cost of both sensors is lower than the cost of a lidar. Recent radar-camera methods commonly perform feature-level fusion, which involves projecting the radar points onto the same plane as the image features and fusing the extracted features from both modalities. However, projecting radar points onto the image plane flattens the depth dimension of the point cloud, which may lead to information loss and makes extracting the spatial features of the point cloud more difficult. To address this issue, we proposed ClusterFusion, an architecture that leverages the local spatial features of the radar point cloud by clustering the point cloud and performing feature extraction directly on the point cloud clusters before projecting the features onto the image plane. ClusterFusion achieved the state-of-the-art performance among all radar-monocular camera methods on the test slice of the nuScenes dataset with 48.7% nuScenes detection score (NDS). We also investigated the performance of different radar feature extraction strategies on point cloud clusters, including a handcrafted strategy, a learning-based strategy, and a combination of both, and found that the handcrafted strategy yielded the best performance. The main goal of this work is to explore the use of radar's local spatial and point-wise features by extracting them directly from radar point cloud clusters for a radar-monocular camera 3D object detection method that performs cross-modal feature fusion on the image plane.

Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption

paper_url: http://arxiv.org/abs/2309.03729
repo_url: https://github.com/sjtuplayer/few-shot-diffusion
paper_authors: Teng Hu, Jiangning Zhang, Liang Liu, Ran Yi, Siqi Kou, Haokun Zhu, Xu Chen, Yabiao Wang, Chengjie Wang, Lizhuang Ma
for: 本研究的目的是提出一种基于几何扩展的几何扩展噪声模型，以解决具有有限样本的数据时，生成模型的训练问题。
methods: 本研究使用了phasic content fusion和directional distribution consistency loss两种新的学习目标，以帮助模型学习内容和样式信息，并且在不同的训练阶段学习不同的学习目标。
results: 实验表明，提出的方法可以在几何扩展中减少内容衰减，并且在几何扩展中增强结构一致性。此外，该方法在几何扩展中的训练效果也比PRIOR方法更好。

Abstract
Training a generative model with limited number of samples is a challenging task. Current methods primarily rely on few-shot model adaption to train the network. However, in scenarios where data is extremely limited (less than 10), the generative network tends to overfit and suffers from content degradation. To address these problems, we propose a novel phasic content fusing few-shot diffusion model with directional distribution consistency loss, which targets different learning objectives at distinct training stages of the diffusion model. Specifically, we design a phasic training strategy with phasic content fusion to help our model learn content and style information when t is large, and learn local details of target domain when t is small, leading to an improvement in the capture of content, style and local details. Furthermore, we introduce a novel directional distribution consistency loss that ensures the consistency between the generated and source distributions more efficiently and stably than the prior methods, preventing our model from overfitting. Finally, we propose a cross-domain structure guidance strategy that enhances structure consistency during domain adaptation. Theoretical analysis, qualitative and quantitative experiments demonstrate the superiority of our approach in few-shot generative model adaption tasks compared to state-of-the-art methods. The source code is available at: https://github.com/sjtuplayer/few-shot-diffusion.

摘要
训练一个生成模型具有有限样本的任务是一项具有挑战性的任务。当前方法主要依靠几 shot 模型适应来训练网络。然而，在数据非常有限（ menos de 10）的场景下，生成网络往往遇到过拟合和内容下降的问题。为解决这些问题，我们提出了一种新的phasic content fusion few-shot diffusion model，具有方向分布一致损失，可以在不同的训练阶段对 diffusion model 进行不同的学习目标。具体来说，我们设计了phasic 训练策略，通过phasic content fusion来帮助我们的模型在 t 大的时候学习内容和风格信息，并在 t 小的时候学习目标频道的本地细节，从而改善内容、风格和本地细节的捕捉。此外，我们引入了一种新的方向分布一致损失，可以更有效和稳定地保证生成的结果与源分布的一致性，避免模型过拟合。最后，我们提出了一种跨频道结构引导策略，可以在适应频道中提高结构一致性。理论分析、质量和量测试表明，我们的方法在几 shot 生成模型适应任务中比 state-of-the-art 方法更高效。模型代码可以在 GitHub 上获取：https://github.com/sjtuplayer/few-shot-diffusion。

Interpretable Visual Question Answering via Reasoning Supervision

paper_url: http://arxiv.org/abs/2309.03726
repo_url: None
paper_authors: Maria Parelli, Dimitrios Mallis, Markos Diomataris, Vassilis Pitsikalis
for: 提高模型在视觉问答任务中的视觉固定能力，使其更好地理解问题和图像之间的关系。
methods: 使用常识逻辑作为监督信号，通过文本证明来提供Visual Common Sense Reasoning（VCR）数据集上已有的批注来帮助模型更好地理解问题和图像之间的关系。
results: 经验表明，提出的方法可以帮助模型更好地理解问题和图像之间的关系，不需要训练显式固定注解。

Abstract
Transformer-based architectures have recently demonstrated remarkable performance in the Visual Question Answering (VQA) task. However, such models are likely to disregard crucial visual cues and often rely on multimodal shortcuts and inherent biases of the language modality to predict the correct answer, a phenomenon commonly referred to as lack of visual grounding. In this work, we alleviate this shortcoming through a novel architecture for visual question answering that leverages common sense reasoning as a supervisory signal. Reasoning supervision takes the form of a textual justification of the correct answer, with such annotations being already available on large-scale Visual Common Sense Reasoning (VCR) datasets. The model's visual attention is guided toward important elements of the scene through a similarity loss that aligns the learned attention distributions guided by the question and the correct reasoning. We demonstrate both quantitatively and qualitatively that the proposed approach can boost the model's visual perception capability and lead to performance increase, without requiring training on explicit grounding annotations.

摘要
带有转换器基础的架构在视觉问答任务中表现出了惊人的表现。然而，这些模型可能会忽略重要的视觉指示和依赖于多Modal短cut和语言modal的自然偏见来预测正确答案，这被称为视觉不归顺。在这项工作中，我们通过一种新的视觉问答架构来解决这个缺陷，该架构利用了通用理解作为监督信号。理解监督信号的形式是一个问题和正确答案的文本证明，这些注释已经在大规模的视觉通用理解（VCR）数据集上可以获得。我们的视觉注意力通过一种相似损失来引导，使得问题和正确答案的学习的视觉注意力 Distributions相似。我们示出了both量化和质量上的表述，表明我们的方法可以增强模型的视觉感知能力，不需要训练显式归顺注释。

A boundary-aware point clustering approach in Euclidean and embedding spaces for roof plane segmentation

paper_url: http://arxiv.org/abs/2309.03722
repo_url: None
paper_authors: Li Li, Qingqing Li, Guozheng Xu, Pengwei Zhou, Jingmin Tu, Jie Li, Jian Yao
for: 本研究旨在提高空拍LiDAR点云数据中的瓦片面分割精度，提供更高精度的3D建筑模型重建。
methods: 本研究提出了一种边缘意识点云划分方法，包括三个分支网络：一个用于预测 semantic labels、点偏移和深度嵌入特征，第二个用于预测点偏移，第三个用于确保点云实例的嵌入特征相似。
results: 实验结果显示，提出的方法significantly outperforms 现有的状态之最方法。

Abstract
Roof plane segmentation from airborne LiDAR point clouds is an important technology for 3D building model reconstruction. One of the key issues of plane segmentation is how to design powerful features that can exactly distinguish adjacent planar patches. The quality of point feature directly determines the accuracy of roof plane segmentation. Most of existing approaches use handcrafted features to extract roof planes. However, the abilities of these features are relatively low, especially in boundary area. To solve this problem, we propose a boundary-aware point clustering approach in Euclidean and embedding spaces constructed by a multi-task deep network for roof plane segmentation. We design a three-branch network to predict semantic labels, point offsets and extract deep embedding features. In the first branch, we classify the input data as non-roof, boundary and plane points. In the second branch, we predict point offsets for shifting each point toward its respective instance center. In the third branch, we constrain that points of the same plane instance should have the similar embeddings. We aim to ensure that points of the same plane instance are close as much as possible in both Euclidean and embedding spaces. However, although deep network has strong feature representative ability, it is still hard to accurately distinguish points near plane instance boundary. Therefore, we first group plane points into many clusters in the two spaces, and then we assign the rest boundary points to their closest clusters to generate final complete roof planes. In this way, we can effectively reduce the influence of unreliable boundary points. In addition, we construct a synthetic dataset and a real dataset to train and evaluate our approach. The experiments results show that the proposed approach significantly outperforms the existing state-of-the-art approaches.

摘要
《顶面面Segmentation从空中LiDAR点云是重要的三维建筑模型重建技术。一个关键问题是如何设计强大的特征来准确分辨邻近的平面 patches。点云特征质量直接影响顶面面Segmentation的准确性。大多数现有方法使用手动设计的特征来抽取顶面平面。然而，这些特征的能力相对较低，特别是在边缘区域。为解决这个问题，我们提出了一种边缘意识点云 clustering方法，通过多任务深度网络进行顶面面Segmentation。我们设计了三枝网络， Predict semantic labels, point offsets和EXTRACT deep embedding features。在第一枝网络中，我们将输入数据分类为非顶面、边界和平面点。在第二枝网络中，我们预测每个点的偏移量，以将每个点向其实例中心偏移。在第三枝网络中，我们强制实例中心的点需要具有类似的嵌入特征。我们希望通过这种方式，实例中心的点可以在Euclidean和嵌入空间中保持最近。然而，虽然深度网络具有强大的特征表示能力，但是仍然困难准确分辨边缘区域的点。因此，我们首先将平面点 grouping到多个cluster中，然后将边缘点分配到最近的cluster中，以生成最终的完整的顶面面。这种方法可以有效地减少边缘点的影响。此外，我们还构建了一个 sintetic dataset和一个实际 dataset，以用于训练和评估我们的方法。实验结果表明，我们的方法在与现有状态作准的方法进行比较时，表现出了显著的优势。

DiffDefense: Defending against Adversarial Attacks via Diffusion Models

paper_url: http://arxiv.org/abs/2309.03702
repo_url: https://github.com/hondamunigeprasannasilva/diffdefence
paper_authors: Hondamunige Prasanna Silva, Lorenzo Seidenari, Alberto Del Bimbo
for: 保护机器学习分类器免受攻击
methods: 使用Diffusion Models进行增强防御
results: 提供了一种robust的防御方法，保持了清晰精度和可插入性，同时能够抵抗攻击。In English:
for: Protect machine learning classifiers from attacks
methods: Leveraging Diffusion Models for enhanced defense
results: Provides a robust defense method that preserves clarity and plug-and-play compatibility, while resisting attacks.

Abstract
This paper presents a novel reconstruction method that leverages Diffusion Models to protect machine learning classifiers against adversarial attacks, all without requiring any modifications to the classifiers themselves. The susceptibility of machine learning models to minor input perturbations renders them vulnerable to adversarial attacks. While diffusion-based methods are typically disregarded for adversarial defense due to their slow reverse process, this paper demonstrates that our proposed method offers robustness against adversarial threats while preserving clean accuracy, speed, and plug-and-play compatibility. Code at: https://github.com/HondamunigePrasannaSilva/DiffDefence.

摘要
这篇论文提出了一种新的重建方法，利用傅里叶模型来保护机器学习分类器免受抗击攻击，而无需对分类器本身进行任何修改。由于机器学习模型对输入小变化很敏感，因此它们面临着抗击攻击的威胁。尽管傅里叶基本方法通常不被视为对抗攻击的有效方法，但这篇论文表明，我们的提议方法可以提供对抗攻击的坚固性，保持清晰率、速度和插件兼容性。代码可以在 GitHub 上找到：https://github.com/HondamunigePrasannaSilva/DiffDefence。

Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory

paper_url: http://arxiv.org/abs/2309.03696
repo_url: https://github.com/ltttpku/ada-cm
paper_authors: Ting Lei, Fabian Caba, Qingchao Chen, Hailin Jin, Yuxin Peng, Yang Liu
For: 本研究旨在提出一种可以快速准确地检测人对象互动（HOI）的方法，以解决在实际场景中HOI检测 task 的挑战，如质量下降和计算成本增加。* Methods: 该方法基于大量的视觉语言模型（VLM），并提出了两种操作模式：一是无需更新参数的培成模式，二是可以更新细化参数的实例感知适应器模式。* Results: 该方法在HICO-DET和V-COCO数据集上达到了与现有状态OF-the-art的竞争水平，而且具有训练时间较短的优势。

Abstract
Human Object Interaction (HOI) detection aims to localize and infer the relationships between a human and an object. Arguably, training supervised models for this task from scratch presents challenges due to the performance drop over rare classes and the high computational cost and time required to handle long-tailed distributions of HOIs in complex HOI scenes in realistic settings. This observation motivates us to design an HOI detector that can be trained even with long-tailed labeled data and can leverage existing knowledge from pre-trained models. Inspired by the powerful generalization ability of the large Vision-Language Models (VLM) on classification and retrieval tasks, we propose an efficient Adaptive HOI Detector with Concept-guided Memory (ADA-CM). ADA-CM has two operating modes. The first mode makes it tunable without learning new parameters in a training-free paradigm. Its second mode incorporates an instance-aware adapter mechanism that can further efficiently boost performance if updating a lightweight set of parameters can be afforded. Our proposed method achieves competitive results with state-of-the-art on the HICO-DET and V-COCO datasets with much less training time. Code can be found at https://github.com/ltttpku/ADA-CM.

摘要
人物物体交互（HOI）检测的目标是确定人类和物体之间的位置和关系。然而，从头scratch开始训练超级vised模型 для此任务可能会遇到困难，主要是因为罕见的类型下的性能下降和复杂的HOI场景中的长尾分布。这个问题驱动我们设计一种可以很好地处理长尾分布的HOI检测器。我们提出了一种基于大型视力语言模型（VLM）的强大泛化能力的Adaptive HOI Detector with Concept-guided Memory（ADA-CM）。ADA-CM有两种运作模式。首先，它可以在没有学习新参数的情况下进行调整。其第二种运作模式包括一个实例特征感知机制，可以进一步提高性能，只要更新一些轻量级的参数。我们的提议方法在HICO-DET和V-COCO数据集上实现了与当前最佳的竞争力。代码可以在https://github.com/ltttpku/ADA-CM中找到。

MS-UNet-v2: Adaptive Denoising Method and Training Strategy for Medical Image Segmentation with Small Training Data

paper_url: http://arxiv.org/abs/2309.03686
repo_url: None
paper_authors: Haoyuan Chen, Yufei Han, Pin Xu, Yanyi Li, Kuan Li, Jianping Yin
for: 这个研究旨在提高医疗影像分类 task 的性能，以及解决单层 U-Net 构造不足以掌握足够多信息的问题。
methods: 我们提出了一个名为 MS-UNet 的新型 U-Net 模型，使用了 Swin Transformer 嵌入式的多尺度嵌入式解oder，实现了Semantic feature mapping 的更好地学习。此外，我们也提出了一个 Edge loss 和一个可替换的 Denoising module，可以单独应用于其他模型中，并且可以优化 MS-UNet 的 segmentation 性能。
results: 实验结果显示，MS-UNet 能够具有更高效的特征学习能力，并且在小量训练数据情况下表现更出色，而且提出的 Edge loss 和 Denoising module 可以明显提高 MS-UNet 的 segmentation 性能。

Abstract
Models based on U-like structures have improved the performance of medical image segmentation. However, the single-layer decoder structure of U-Net is too "thin" to exploit enough information, resulting in large semantic differences between the encoder and decoder parts. Things get worse if the number of training sets of data is not sufficiently large, which is common in medical image processing tasks where annotated data are more difficult to obtain than other tasks. Based on this observation, we propose a novel U-Net model named MS-UNet for the medical image segmentation task in this study. Instead of the single-layer U-Net decoder structure used in Swin-UNet and TransUnet, we specifically design a multi-scale nested decoder based on the Swin Transformer for U-Net. The proposed multi-scale nested decoder structure allows the feature mapping between the decoder and encoder to be semantically closer, thus enabling the network to learn more detailed features. In addition, we propose a novel edge loss and a plug-and-play fine-tuning Denoising module, which not only effectively improves the segmentation performance of MS-UNet, but could also be applied to other models individually. Experimental results show that MS-UNet could effectively improve the network performance with more efficient feature learning capability and exhibit more advanced performance, especially in the extreme case with a small amount of training data, and the proposed Edge loss and Denoising module could significantly enhance the segmentation performance of MS-UNet.

摘要
模型基于U字结构的表现在医学图像分割方面有所改善。然而，单层decoder结构的U字网络（Swin-UNet和TransUnet）太"瘦"，无法利用足够的信息，导致encoder和decoder部分之间的semantic差异较大。尤其是在医学图像处理任务中，缺乏足够的训练数据是常见的问题，这会使得模型的表现更加差。基于这一观察，我们在本研究中提出了一种名为MS-UNet的新的U字网络模型。相比单层decoder结构，我们专门设计了基于Swin Transformer的多尺度嵌套decoder结构。这种多尺度嵌套decoder结构使得feature mapping междуdecoder和encoder更加近似，从而让网络学习更加细腻的特征。此外，我们还提出了一种新的边缘损失和可插拔的精度调整Denosing模块，这不仅能够有效提高MS-UNet的分割性能，还可以应用于其他模型。实验结果表明，MS-UNet可以更好地利用训练数据，具有更高效的特征学习能力和更高级别的表现，特别是在训练数据量很少的极端情况下。此外，提出的边缘损失和Denosing模块可以明显提高MS-UNet的分割性能。

paper_url: http://arxiv.org/abs/2309.03661
repo_url: None
paper_authors: Ting Liu, Wansen Wu, Yue Hu, Youkai Wang, Kai Xu, Quanjun Yin
for: 本研究旨在提高视觉语言Navigation（VLN）任务中的表达能力和模型适应能力，解决现有模型在VLN任务中的领域差距和Sequential alignment问题。
methods: 本文提出了一种新的Prompt-bAsed coNtext- and Domain-Aware（PANDA）预训练框架，通过两stage的提问方式，在领域意识阶段和上下文意识阶段，分别学习软视觉提示和硬上下文提示，以塑造模型在VLN任务中的跨模态对应性和上下文知识。
results: 实验结果表明，相比之前的状态态度方法，PANDA在R2R和REVERIE两个任务中具有明显的优势，能够更好地利用预训练模型，提高VLN任务的表达能力和模型适应能力。

Abstract
With strong representation capabilities, pretrained vision-language models are widely used in vision and language navigation (VLN). However, most of them are trained on web-crawled general-purpose datasets, which incurs a considerable domain gap when used for VLN tasks. Another challenge for VLN is how the agent understands the contextual relations between actions on a trajectory and performs cross-modal alignment sequentially. In this paper, we propose a novel Prompt-bAsed coNtext- and Domain-Aware (PANDA) pretraining framework to address these problems. It performs prompting in two stages. In the domain-aware stage, we apply a low-cost prompt tuning paradigm to learn soft visual prompts from an in-domain dataset for equipping the pretrained models with object-level and scene-level cross-modal alignment in VLN tasks. Furthermore, in the context-aware stage, we design a set of hard context prompts to capture the sequence-level semantics and instill both out-of-context and contextual knowledge in the instruction into cross-modal representations. They enable further tuning of the pretrained models via contrastive learning. Experimental results on both R2R and REVERIE show the superiority of PANDA compared to previous state-of-the-art methods.

摘要
With strong representation capabilities, pretrained vision-language models are widely used in vision and language navigation (VLN). However, most of them are trained on web-crawled general-purpose datasets, which incurs a considerable domain gap when used for VLN tasks. Another challenge for VLN is how the agent understands the contextual relations between actions on a trajectory and performs cross-modal alignment sequentially. In this paper, we propose a novel Prompt-bAsed coNtext- and Domain-Aware (PANDA) pretraining framework to address these problems. It performs prompting in two stages. In the domain-aware stage, we apply a low-cost prompt tuning paradigm to learn soft visual prompts from an in-domain dataset for equipping the pretrained models with object-level and scene-level cross-modal alignment in VLN tasks. Furthermore, in the context-aware stage, we design a set of hard context prompts to capture the sequence-level semantics and instill both out-of-context and contextual knowledge in the instruction into cross-modal representations. They enable further tuning of the pretrained models via contrastive learning. Experimental results on both R2R and REVERIE show the superiority of PANDA compared to previous state-of-the-art methods.

Spiking Structured State Space Model for Monaural Speech Enhancement

paper_url: http://arxiv.org/abs/2309.03641
repo_url: None
paper_authors: Yu Du, Xu Liu, Yansong Chua
for: 提高speech干扰率和计算成本，使用Spiking Structured State Space Model（Spiking-S4）。
methods: 使用Spiking Neural Networks（SNN）和Structured State Space Models（S4），结合能量效率和长距离序列模型能力。
results: 与现有Artificial Neural Network（ANN）方法相当，但计算资源减少，参数和浮点运算数（FLOPs）减少。

Abstract
Speech enhancement seeks to extract clean speech from noisy signals. Traditional deep learning methods face two challenges: efficiently using information in long speech sequences and high computational costs. To address these, we introduce the Spiking Structured State Space Model (Spiking-S4). This approach merges the energy efficiency of Spiking Neural Networks (SNN) with the long-range sequence modeling capabilities of Structured State Space Models (S4), offering a compelling solution. Evaluation on the DNS Challenge and VoiceBank+Demand Datasets confirms that Spiking-S4 rivals existing Artificial Neural Network (ANN) methods but with fewer computational resources, as evidenced by reduced parameters and Floating Point Operations (FLOPs).

摘要
干扰除抽取干扰后的清晰语音。传统的深度学习方法面临两个挑战：高效地使用长 speech 序列中的信息，以及高计算成本。为解决这些问题，我们介绍了 Spiking Structured State Space Model（Spiking-S4）。这种方法将神经网络中的能量效率与结构化状态空间模型（S4）结合起来，提供了一个吸引人的解决方案。评估在 DNS 挑战和 VoiceBank+Demand 数据集上表明，Spiking-S4 与现有的人工神经网络（ANN）方法相当，但具有更少的计算资源，如参数和浮点运算（FLOPs）。

Context-Aware 3D Object Localization from Single Calibrated Images: A Study of Basketballs

paper_url: http://arxiv.org/abs/2309.03640
repo_url: https://github.com/gabriel-vanzandycke/deepsport
paper_authors: Marcello Davide Caio, Gabriel Van Zandycke, Christophe De Vleeschouwer
for: This paper is written for the task of 3D localization of objects in computer vision applications, specifically for basketball localization from a single calibrated image.
methods: The method used in this paper is to predict the object’s height in pixels in image space by estimating its projection onto the ground plane within the image, leveraging the image itself and the object’s location as inputs.
results: The paper demonstrates substantial accuracy improvements compared to recent work, offering effective 3D ball tracking and understanding. The source code is made publicly available at \url{https://github.com/gabriel-vanzandycke/deepsport}.

Abstract
Accurately localizing objects in three dimensions (3D) is crucial for various computer vision applications, such as robotics, autonomous driving, and augmented reality. This task finds another important application in sports analytics and, in this work, we present a novel method for 3D basketball localization from a single calibrated image. Our approach predicts the object's height in pixels in image space by estimating its projection onto the ground plane within the image, leveraging the image itself and the object's location as inputs. The 3D coordinates of the ball are then reconstructed by exploiting the known projection matrix. Extensive experiments on the public DeepSport dataset, which provides ground truth annotations for 3D ball location alongside camera calibration information for each image, demonstrate the effectiveness of our method, offering substantial accuracy improvements compared to recent work. Our work opens up new possibilities for enhanced ball tracking and understanding, advancing computer vision in diverse domains. The source code of this work is made publicly available at \url{https://github.com/gabriel-vanzandycke/deepsport}.

摘要
三维空间中的物体准确地理解是计算机视觉应用中的关键，如 робо扮、自动驾驶和增强现实等。这种任务在体育分析中也具有重要的应用，在这篇论文中，我们介绍了一种基于单个投影图像的3D篮球定位方法。我们的方法在图像空间中预测物体的高度，利用图像本身和物体的位置作为输入，并且利用知道的投影矩阵来重建3D坐标。我们在公共的DeepSport数据集上进行了广泛的实验，该数据集提供了每个图像的摄像机准确的投影矩阵和3D球的位置的标注信息。我们的方法在相比之下提供了显著的精度提高，这些成果将开拓新的 возмож性，推动计算机视觉在多个领域的发展。我们的代码在 \url{https://github.com/gabriel-vanzandycke/deepsport} 上公开提供。

Chasing Consistency in Text-to-3D Generation from a Single Image

paper_url: http://arxiv.org/abs/2309.03599
repo_url: None
paper_authors: Yichen Ouyang, Wenhao Chai, Jiayi Ye, Dapeng Tao, Yibing Zhan, Gaoang Wang
for: 提出了一种解决多视图图像Text-to-3D生成 task中的不一致问题的方法，包括semantic inconsistency、geometric inconsistency和saturation inconsistency。
methods: 提出了一种三stage框架，包括semantic encoding stage、geometric encoding stage和optimization stage，用于学习参数化的一致性 tokens，以提高Text-to-3D生成的一致性和可靠性。
results: 实验结果表明，Compared with前一个状态的方法，Consist3D可以生成更加一致、忠实和 фото真实的3D资产，同时也允许背景和对象编辑通过文本提示。

Abstract
Text-to-3D generation from a single-view image is a popular but challenging task in 3D vision. Although numerous methods have been proposed, existing works still suffer from the inconsistency issues, including 1) semantic inconsistency, 2) geometric inconsistency, and 3) saturation inconsistency, resulting in distorted, overfitted, and over-saturated generations. In light of the above issues, we present Consist3D, a three-stage framework Chasing for semantic-, geometric-, and saturation-Consistent Text-to-3D generation from a single image, in which the first two stages aim to learn parameterized consistency tokens, and the last stage is for optimization. Specifically, the semantic encoding stage learns a token independent of views and estimations, promoting semantic consistency and robustness. Meanwhile, the geometric encoding stage learns another token with comprehensive geometry and reconstruction constraints under novel-view estimations, reducing overfitting and encouraging geometric consistency. Finally, the optimization stage benefits from the semantic and geometric tokens, allowing a low classifier-free guidance scale and therefore preventing oversaturation. Experimental results demonstrate that Consist3D produces more consistent, faithful, and photo-realistic 3D assets compared to previous state-of-the-art methods. Furthermore, Consist3D also allows background and object editing through text prompts.

摘要
文本到3D生成从单个图像是3D视图中受欢迎但具有挑战性的任务。虽然已有许多方法被提出，但现有的方法仍然受到不一致性问题的困扰，包括1）semantic不一致、2）geometry不一致和3）饱和不一致，导致生成的结果偏倾、适应度差和饱和。为了解决这些问题，我们提出了Consist3D，一个三个阶段框架，旨在从单个图像中实现semantic-, geometry-和饱和性Consistent文本到3D生成。在这三个阶段中，第一两个阶段的目标是学习参数化的一致性token，而第三个阶段是优化阶段。具体来说，semantic编码阶段学习一个独立于视图和估计的Token，Promoting semantic一致和Robustness。同时，geometry编码阶段学习另一个Token，旨在包括全面的几何和重建约束，降低过拟合和促进几何一致。最后，优化阶段利用semantic和geometry Token，允许低级别的类ifier-free导向缩放，因此避免饱和。实验结果表明，Consist3D生成的3D资产更加一致、忠实和真实的摄影图像。此外，Consist3D还允许背景和物体编辑通过文本提示。

Enhancing Sample Utilization through Sample Adaptive Augmentation in Semi-Supervised Learning

paper_url: http://arxiv.org/abs/2309.03598
repo_url: https://github.com/guangui-nju/saa
paper_authors: Guan Gui, Zhen Zhao, Lei Qi, Luping Zhou, Lei Wang, Yinghuan Shi
for: 提高 semi-supervised learning 模型的性能
methods: 使用 sample adaptive augmentation (SAA) 技术，包括 sample selection module 和 sample augmentation module，以适应不同样本的需求
results: SAA 可以显著提高 FixMatch 和 FlexMatch 模型的准确率，例如，在 CIFAR-10 数据集上，SAA 帮助 FixMatch 模型的准确率从 92.50% 提高到 94.76%，并且帮助 FlexMatch 模型的准确率从 95.01% 提高到 95.31%

Abstract
In semi-supervised learning, unlabeled samples can be utilized through augmentation and consistency regularization. However, we observed certain samples, even undergoing strong augmentation, are still correctly classified with high confidence, resulting in a loss close to zero. It indicates that these samples have been already learned well and do not provide any additional optimization benefits to the model. We refer to these samples as ``naive samples". Unfortunately, existing SSL models overlook the characteristics of naive samples, and they just apply the same learning strategy to all samples. To further optimize the SSL model, we emphasize the importance of giving attention to naive samples and augmenting them in a more diverse manner. Sample adaptive augmentation (SAA) is proposed for this stated purpose and consists of two modules: 1) sample selection module; 2) sample augmentation module. Specifically, the sample selection module picks out {naive samples} based on historical training information at each epoch, then the naive samples will be augmented in a more diverse manner in the sample augmentation module. Thanks to the extreme ease of implementation of the above modules, SAA is advantageous for being simple and lightweight. We add SAA on top of FixMatch and FlexMatch respectively, and experiments demonstrate SAA can significantly improve the models. For example, SAA helped improve the accuracy of FixMatch from 92.50% to 94.76% and that of FlexMatch from 95.01% to 95.31% on CIFAR-10 with 40 labels.

摘要
在半监督学习中，无标示样本可以通过扩充和一致性规范来利用。然而，我们观察到一些样本，即使经受了强大的扩充，仍然可以高度自信地分类，从而导致损失接近零。这表示这些样本已经很好地学习过，不会提供任何额外优化效果 для模型。我们称这些样本为“简单样本”。现有的SSL模型忽视了简单样本的特点，只是对所有样本应用同一种学习策略。为了进一步优化SSL模型，我们强调了简单样本的重要性，并提出了一种Sample Adaptive Augmentation（SAA）模型，包括以下两个模块：1）样本选择模块；2）样本扩充模块。具体来说，样本选择模块根据每 epoch 的历史训练信息选择简单样本，然后在样本扩充模块中对简单样本进行更加多样化的扩充。由于SAA的实现非常简单，因此SAA具有简单和轻量级的优势。我们在 FixMatch 和 FlexMatch 上加载 SAA，并进行了实验，结果表明，SAA可以显著提高模型的准确率。例如，SAA 在 CIFAR-10 上使 FixMatch 的准确率由 92.50% 提高到 94.76%，并在 FlexMatch 上使准确率由 95.01% 提高到 95.31%。

DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions

paper_url: http://arxiv.org/abs/2309.03576
repo_url: https://github.com/haochen-wang409/droppos
paper_authors: Haochen Wang, Junsong Fan, Yuxi Wang, Kaiyou Song, Tong Wang, Zhaoxiang Zhang
for: 提高 ViT 的位置意识（location awareness）
methods: 提出 DropPos 自我超vised挑战任务，通过随机DropPositional embedding来增强模型的位置理解能力
results: DropPos 在各种下游任务上表现出色，超过了supervised预训练和当前自我超vised方法的表现，这表明 DropPos 可以帮助 ViT 提高其位置意识。

Abstract
As it is empirically observed that Vision Transformers (ViTs) are quite insensitive to the order of input tokens, the need for an appropriate self-supervised pretext task that enhances the location awareness of ViTs is becoming evident. To address this, we present DropPos, a novel pretext task designed to reconstruct Dropped Positions. The formulation of DropPos is simple: we first drop a large random subset of positional embeddings and then the model classifies the actual position for each non-overlapping patch among all possible positions solely based on their visual appearance. To avoid trivial solutions, we increase the difficulty of this task by keeping only a subset of patches visible. Additionally, considering there may be different patches with similar visual appearances, we propose position smoothing and attentive reconstruction strategies to relax this classification problem, since it is not necessary to reconstruct their exact positions in these cases. Empirical evaluations of DropPos show strong capabilities. DropPos outperforms supervised pre-training and achieves competitive results compared with state-of-the-art self-supervised alternatives on a wide range of downstream benchmarks. This suggests that explicitly encouraging spatial reasoning abilities, as DropPos does, indeed contributes to the improved location awareness of ViTs. The code is publicly available at https://github.com/Haochen-Wang409/DropPos.

摘要
为了解决transformer眼见模型（ViT）对输入元素顺序的敏感性问题，我们提出了DropPos，一个新的自我超vised预备任务。DropPos的设计是简单的：我们首先随机选择大量的位置嵌入，然后让模型根据视觉特征来推断实际的位置。为了避免轻松解释，我们将只有一部分patch可视。此外，为了降低这个分类问题的难度，我们提出了position smoothing和专注重建构成技术。实验评估显示DropPos能够实现强大的表现，并且与现有的自我超vised替代方案相比，在广泛的下游评估中具有竞争力。这表明明确地强调空间推理能力，就像DropPos所做的一样，对ViT的位置意识 indeed有助益。代码可以在https://github.com/Haochen-Wang409/DropPos中找到。

Toward High Quality Facial Representation Learning

paper_url: http://arxiv.org/abs/2309.03575
repo_url: https://github.com/nomewang/mcf
paper_authors: Yue Wang, Jinlong Peng, Jiangning Zhang, Ran Yi, Liang Liu, Yabiao Wang, Chengjie Wang
for: 本研究旨在提高面部分析任务的性能，特别是面部表示性的提高。
methods: 本研究使用自适应预训练方法，包括使用面罩模型和对比策略，以提高面部表示性。
results: 本研究在多个下游任务中表现出色，包括AFLW-19面对齐和LaPa面分割等。模型在面部表示性方面提高了状态之arte的性能。

Abstract
Face analysis tasks have a wide range of applications, but the universal facial representation has only been explored in a few works. In this paper, we explore high-performance pre-training methods to boost the face analysis tasks such as face alignment and face parsing. We propose a self-supervised pre-training framework, called \textbf{\it Mask Contrastive Face (MCF)}, with mask image modeling and a contrastive strategy specially adjusted for face domain tasks. To improve the facial representation quality, we use feature map of a pre-trained visual backbone as a supervision item and use a partially pre-trained decoder for mask image modeling. To handle the face identity during the pre-training stage, we further use random masks to build contrastive learning pairs. We conduct the pre-training on the LAION-FACE-cropped dataset, a variants of LAION-FACE 20M, which contains more than 20 million face images from Internet websites. For efficiency pre-training, we explore our framework pre-training performance on a small part of LAION-FACE-cropped and verify the superiority with different pre-training settings. Our model pre-trained with the full pre-training dataset outperforms the state-of-the-art methods on multiple downstream tasks. Our model achieves 0.932 NME$_{diag}$ for AFLW-19 face alignment and 93.96 F1 score for LaPa face parsing. Code is available at https://github.com/nomewang/MCF.

摘要
“面部分析任务有很广泛的应用，但universal面部表示只在一些研究中被探讨。在这篇论文中，我们探索高性能预训练方法来提升面部分析任务，如面对齐和面分解。我们提出了一个自我超级vised预训练框架，称为Mask Contrastive Face（MCF），它使用面照模型和特定适应于面域任务的对比策略。为提高面部表示质量，我们使用预训练的视觉后处理器的特征图作为监督项，并使用部分预训练的解码器进行面照模型。为了在预训练阶段处理面部标识，我们还使用随机mask来建立对比学习对。我们在LAION-FACE-cropped数据集上进行预训练，这是LAION-FACE 20M数据集的一个变种，它包含了互联网上超过20万张面像。为了提高效率，我们在不同的预训练设置下进行了探索。我们的模型在多个下游任务中表现出色，其中AFLW-19面对齐NME$_{diag}$达0.932，LaPa面分解F1分数达93.96。代码可以在https://github.com/nomewang/MCF上找到。”

Sparse Federated Training of Object Detection in the Internet of Vehicles

paper_url: http://arxiv.org/abs/2309.03569
repo_url: None
paper_authors: Luping Rao, Chuan Ma, Ming Ding, Yuwen Qian, Lu Zhou, Zhe Liu
for: 提高iot中的车辆检测精度，降低通信开销
methods: 基于联合学习的方法，包括在中央服务器上分享已经训练好的本地模型，以及在边缘设备上进行稀疏训练
results: 实验结果表明，提议的方案可以实现需要的车辆检测率，同时减少了许多通信成本

Abstract
As an essential component part of the Intelligent Transportation System (ITS), the Internet of Vehicles (IoV) plays a vital role in alleviating traffic issues. Object detection is one of the key technologies in the IoV, which has been widely used to provide traffic management services by analyzing timely and sensitive vehicle-related information. However, the current object detection methods are mostly based on centralized deep training, that is, the sensitive data obtained by edge devices need to be uploaded to the server, which raises privacy concerns. To mitigate such privacy leakage, we first propose a federated learning-based framework, where well-trained local models are shared in the central server. However, since edge devices usually have limited computing power, plus a strict requirement of low latency in IoVs, we further propose a sparse training process on edge devices, which can effectively lighten the model, and ensure its training efficiency on edge devices, thereby reducing communication overheads. In addition, due to the diverse computing capabilities and dynamic environment, different sparsity rates are applied to edge devices. To further guarantee the performance, we propose, FedWeg, an improved aggregation scheme based on FedAvg, which is designed by the inverse ratio of sparsity rates. Experiments on the real-life dataset using YOLO show that the proposed scheme can achieve the required object detection rate while saving considerable communication costs.

摘要
作为智能交通系统（ITS）的重要组件，互联网imatics（IoV）在解决交通问题方面扮演着重要角色。对象检测是IoV中的关键技术之一，通过实时和敏感的车辆相关信息分析，提供交通管理服务。然而，当前的对象检测方法大多基于中央深度训练，即从边缘设备获取的敏感数据需要上传到服务器，这会导致隐私泄露。为了缓解这种隐私泄露，我们首先提议了一个基于联邦学习的框架，在中央服务器中分享已经训练好的本地模型。然而，边缘设备通常具有有限的计算能力， plus 因 IoV 的低延迟要求，我们进一步提议了一种稀疏训练过程在边缘设备上，可以有效减轻模型的计算负担，并在边缘设备上减少通信开销。此外，由于边缘设备的多样化计算能力和动态环境，我们采用不同的稀疏率来应对不同的边缘设备。为了进一步保证性能，我们提议了FedWeg，一种基于FedAvg的改进聚合方案，通过对稀疏率的反比进行调整，以保证聚合效果。实验结果表明，使用YOLO实际数据集时，我们的方案可以达到需要的对象检测率，同时减少了较大的通信成本。

Region Generation and Assessment Network for Occluded Person Re-Identification

paper_url: http://arxiv.org/abs/2309.03558
repo_url: None
paper_authors: Shuting He, Weihua Chen, Kai Wang, Hao Luo, Fan Wang, Wei Jiang, Henghui Ding
for: 本研究主要针对人体重认定（ReID）问题，尤其是 Addressing the challenges of misalignment and occlusions in ReID.
methods: 提出了一种 Region Generation and Assessment Network (RGANet)，包括 Region Generation Module (RGM) 和 Region Assessment Module (RAM)，用于有效地检测人体区域和强调重要区域。
results: 对六个广泛使用的 benchmark 进行了EXTENSIVE experimental results，证明 RGANet 在比较方法中表现出色。

Abstract
Person Re-identification (ReID) plays a more and more crucial role in recent years with a wide range of applications. Existing ReID methods are suffering from the challenges of misalignment and occlusions, which degrade the performance dramatically. Most methods tackle such challenges by utilizing external tools to locate body parts or exploiting matching strategies. Nevertheless, the inevitable domain gap between the datasets utilized for external tools and the ReID datasets and the complicated matching process make these methods unreliable and sensitive to noises. In this paper, we propose a Region Generation and Assessment Network (RGANet) to effectively and efficiently detect the human body regions and highlight the important regions. In the proposed RGANet, we first devise a Region Generation Module (RGM) which utilizes the pre-trained CLIP to locate the human body regions using semantic prototypes extracted from text descriptions. Learnable prompt is designed to eliminate domain gap between CLIP datasets and ReID datasets. Then, to measure the importance of each generated region, we introduce a Region Assessment Module (RAM) that assigns confidence scores to different regions and reduces the negative impact of the occlusion regions by lower scores. The RAM consists of a discrimination-aware indicator and an invariance-aware indicator, where the former indicates the capability to distinguish from different identities and the latter represents consistency among the images of the same class of human body regions. Extensive experimental results for six widely-used benchmarks including three tasks (occluded, partial, and holistic) demonstrate the superiority of RGANet against state-of-the-art methods.

摘要
人体重认（ReID）在最近几年变得越来越重要，它拥有广泛的应用领域。现有的ReID方法面临着异常匹配和遮挡的挑战，这些挑战会使得方法表现下降。大多数方法通过使用外部工具定位人体部分或者利用匹配策略来解决这些挑战。然而，实际的领域差值 между用于外部工具的数据集和ReID数据集，以及复杂的匹配过程，使得这些方法不可靠和敏感于噪声。在这篇论文中，我们提出了一种Region Generation and Assessment Network（RGANet），用于有效地和高效地检测人体部分并高亮重要区域。RGANet中首先设计了一种Region Generation Module（RGM），利用预训练的CLIP来定位人体部分使用语义词汇提取的文本描述。我们制定了可学习的提示，以消除领域差值 междуCLIP数据集和ReID数据集。然后，我们引入了一种Region Assessment Module（RAM），用于评估每个生成的区域的重要性。RAM包括一个排除噪声的指标和一个对称响应指标，其中前者表示可以分辨不同的人类标识，后者表示图像中同类人体部分的一致性。我们对六个广泛使用的标准 benchmark进行了广泛的实验，结果表明RGANet在比特率和泛化能力方面具有明显的优势。

Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model

paper_url: http://arxiv.org/abs/2309.03550
repo_url: https://github.com/deepshwang/text2control3d
paper_authors: Sungwon Hwang, Junha Hyung, Jaegul Choo
for: 这个论文旨在提供一种可控的文本到3D人物生成方法，可以通过提供一些控制视角的图像来控制人物的表情和外观。
methods: 该方法使用了ControlNet进行扩展，并使用了Neural Radiance Fields（NeRF）来构建3D人物。在生成视点控制图像时，使用了对 Referential的注意力来注入可控的表情和外观。此外，还进行了低通过滤波来缓解视点不同的文本问题。
results: 该方法可以生成高品质的3D人物，并可以控制人物的表情和外观。在实验中，我们发现了该方法可以在不同的视点下生成一致的3D人物，并且可以在不同的图像中控制人物的表情和外观。

Abstract
Recent advances in diffusion models such as ControlNet have enabled geometrically controllable, high-fidelity text-to-image generation. However, none of them addresses the question of adding such controllability to text-to-3D generation. In response, we propose Text2Control3D, a controllable text-to-3D avatar generation method whose facial expression is controllable given a monocular video casually captured with hand-held camera. Our main strategy is to construct the 3D avatar in Neural Radiance Fields (NeRF) optimized with a set of controlled viewpoint-aware images that we generate from ControlNet, whose condition input is the depth map extracted from the input video. When generating the viewpoint-aware images, we utilize cross-reference attention to inject well-controlled, referential facial expression and appearance via cross attention. We also conduct low-pass filtering of Gaussian latent of the diffusion model in order to ameliorate the viewpoint-agnostic texture problem we observed from our empirical analysis, where the viewpoint-aware images contain identical textures on identical pixel positions that are incomprehensible in 3D. Finally, to train NeRF with the images that are viewpoint-aware yet are not strictly consistent in geometry, our approach considers per-image geometric variation as a view of deformation from a shared 3D canonical space. Consequently, we construct the 3D avatar in a canonical space of deformable NeRF by learning a set of per-image deformation via deformation field table. We demonstrate the empirical results and discuss the effectiveness of our method.

摘要
近期Diffusion模型如ControlNet的进步使得文本到图像生成中具有可控的高精度。然而，这些方法都没有考虑文本到3D生成中的可控性问题。为此，我们提出Text2Control3D方法，它可以通过控制Neural Radiance Fields（NeRF）中的3D人物表达来实现文本到3D人物生成。我们的主要策略是根据ControlNet生成的视角相关图像来构建NeRF，并通过交叉引用注意力来注入控制 facial expression和外观的 referential特征。此外，我们还应用低通 Filtering来改善我们观察到的视点缺失问题，这些问题是由于Diffusion模型生成的图像中存在同一个文本和外观的重复现象。最后，我们通过学习每个图像的特殊变换来训练NeRF，以便在不同视点下生成可控的3D人物。我们的实验结果表明，我们的方法可以生成高质量的3D人物，并且可以控制其 facial expression和外观。

Trash to Treasure: Low-Light Object Detection via Decomposition-and-Aggregation

paper_url: http://arxiv.org/abs/2309.03548
repo_url: None
paper_authors: Xiaohan Cui, Long Ma, Tengyu Ma, Jinyuan Liu, Xin Fan, Risheng Liu
for: 提高对low-light环境的 объек检测精度
methods: 使用优化的扩充器+检测器组合，将废弃的照明减去作为检测器的助手，提取检测友好特征
results: 与其他状态艺法相比，实现了更高的检测精度

Abstract
Object detection in low-light scenarios has attracted much attention in the past few years. A mainstream and representative scheme introduces enhancers as the pre-processing for regular detectors. However, because of the disparity in task objectives between the enhancer and detector, this paradigm cannot shine at its best ability. In this work, we try to arouse the potential of enhancer + detector. Different from existing works, we extend the illumination-based enhancers (our newly designed or existing) as a scene decomposition module, whose removed illumination is exploited as the auxiliary in the detector for extracting detection-friendly features. A semantic aggregation module is further established for integrating multi-scale scene-related semantic information in the context space. Actually, our built scheme successfully transforms the "trash" (i.e., the ignored illumination in the detector) into the "treasure" for the detector. Plenty of experiments are conducted to reveal our superiority against other state-of-the-art methods. The code will be public if it is accepted.

摘要
寻找 объек特点在低光照情况下已经吸引了一些注意力。主流和表现出名的方案是通过增强器作为普通探测器的预处理。然而，由于增强器和探测器之间的任务目标差异，这种方法无法发挥最大的能力。在这种工作中，我们尝试使增强器+探测器达到最佳效果。与现有的方法不同，我们将照明基于增强器（我们 newly 设计或现有的）作为场景分解模块，并将其中的照明除去作为探测器中EXTRACTING detection-friendly features的auxiliary。此外，我们还设立了 semantic aggregation module，用于在上下文空间中集成多尺度场景相关的semantic信息。实际上，我们建立的方案成功地将"垃圾"（即探测器中被忽略的照明）转化为"财富"。我们进行了大量的实验，并证明了我们对其他现状的方法有着超越性。如果接受，我们将代码公开。

Zero-Shot Scene Graph Generation via Triplet Calibration and Reduction

paper_url: http://arxiv.org/abs/2309.03542
repo_url: https://github.com/jkli1998/T-CAR
paper_authors: Jiankai Li, Yunhong Wang, Weixin Li
for: 提高Scene Graph Generation（SGG）的下游任务表现，尤其是Zero-shot SGG。
methods: 提出Triplet Calibration and Reduction（T-CAR）框架，包括Triplet calibration loss和Unseen space reduction loss，以及Contextual encoder来提高无法见 triplets的泛化表现。
results: 实验表明，我们的方法可以在Zero-shot SGG中提供了一致的改进，超过了现有方法的表现。

Abstract
Scene Graph Generation (SGG) plays a pivotal role in downstream vision-language tasks. Existing SGG methods typically suffer from poor compositional generalizations on unseen triplets. They are generally trained on incompletely annotated scene graphs that contain dominant triplets and tend to bias toward these seen triplets during inference. To address this issue, we propose a Triplet Calibration and Reduction (T-CAR) framework in this paper. In our framework, a triplet calibration loss is first presented to regularize the representations of diverse triplets and to simultaneously excavate the unseen triplets in incompletely annotated training scene graphs. Moreover, the unseen space of scene graphs is usually several times larger than the seen space since it contains a huge number of unrealistic compositions. Thus, we propose an unseen space reduction loss to shift the attention of excavation to reasonable unseen compositions to facilitate the model training. Finally, we propose a contextual encoder to improve the compositional generalizations of unseen triplets by explicitly modeling the relative spatial relations between subjects and objects. Extensive experiments show that our approach achieves consistent improvements for zero-shot SGG over state-of-the-art methods. The code is available at https://github.com/jkli1998/T-CAR.

摘要
Scene Graph Generation (SGG) 扮演着下游视语任务的重要角色。现有的 SGG 方法通常受到不好的compositional generalization的影响，即在未看过的 triplets 上的表现不佳。它们通常是在部分注解的Scene Graph中训练的，这些Scene Graph 中充满了主导的 triplets，导致在推理时偏向这些已经看过的 triplets。为解决这个问题，我们在这篇论文中提出了Triplet Calibration and Reduction（T-CAR）框架。在我们的框架中，首先提出了 triplet calibration loss，用于规范多元 triplets 的表现，同时挖掘在不完全注解的训练 Scene Graph 中的未看过的 triplets。此外，Scene Graph 的未看过空间通常比见过空间更大，因为它包含了庞大的不可能的组合。因此，我们提出了Scene Graph 的未看过空间减少损失，以Shift attention 到合理的未看过组合，以便模型训练。最后，我们提出了一种Contextual Encoder，用于提高未看过 triplets 的compositional generalizations，通过显式地模型主题和 объек 之间的相对空间关系。我们的方法在零shot SGG 上实现了广泛的实验室，和现有的方法相比，具有了一致的改进。代码可以在 https://github.com/jkli1998/T-CAR 上获取。

YOLO series target detection algorithms for underwater environments

paper_url: http://arxiv.org/abs/2309.03539
repo_url: None
paper_authors: Chenjie Zhang, Pengcheng Jiao
for: marine engineering applications (such as underwater structural health monitoring and underwater biological detection)
methods: improved YOLO algorithm for underwater environments (addressing challenges such as dim light and turbid water)
results: potential for increased accuracy and efficiency in underwater applications, but still facing challenges and limitations.

Abstract
You Only Look Once (YOLO) algorithm is a representative target detection algorithm emerging in 2016, which is known for its balance of computing speed and accuracy, and now plays an important role in various fields of human production and life. However, there are still many limitations in the application of YOLO algorithm in underwater environments due to problems such as dim light and turbid water. With limited land area resources, the ocean must have great potential for future human development. In this paper, starting from the actual needs of marine engineering applications, taking underwater structural health monitoring (SHM) and underwater biological detection as examples, we propose improved methods for the application of underwater YOLO algorithms, and point out the problems that still exist.

摘要
你只需一看 (YOLO) 算法是2016年出现的一种代表性目标检测算法，知名于计算速度和准确率的平衡，现在在人类生产和生活中扮演着重要的角色。然而，在水下环境中应用YOLO算法还有许多限制，主要包括灰暗的照明和浑水等问题。由于海洋面积有限，海洋必须拥有未来人类发展的巨大潜在力量。在本文中，从marine工程应用实际需求出发，通过水下结构健康监测 (SHM) 和水下生物检测为例，提出改进了水下YOLO算法的应用方法，并指出仍有问题。

Feature Enhancer Segmentation Network (FES-Net) for Vessel Segmentation

paper_url: http://arxiv.org/abs/2309.03535
repo_url: None
paper_authors: Tariq M. Khan, Muhammad Arsalan, Shahzaib Iqbal, Imran Razzak, Erik Meijering
for: 预防和诊断视力减退疾病，如膳部病变和年轻人病变，需要精准地分类视网膜血管。
methods: 我们提出了一种新的特征增强分类网络（FES-Net），不需要额外的图像增强步骤，直接处理输入图像，并使用四个唤醒卷积块（PCB）进行下采样，以生成每类的 binary mask。
results: FES-Net 在四个公开可用的 state-of-the-art 数据集上（DRIVE、STARE、CHASE 和 HRF）表现出色，与现有文献中的其他竞争方法相比，显示出了明显的超越性。

Abstract
Diseases such as diabetic retinopathy and age-related macular degeneration pose a significant risk to vision, highlighting the importance of precise segmentation of retinal vessels for the tracking and diagnosis of progression. However, existing vessel segmentation methods that heavily rely on encoder-decoder structures struggle to capture contextual information about retinal vessel configurations, leading to challenges in reconciling semantic disparities between encoder and decoder features. To address this, we propose a novel feature enhancement segmentation network (FES-Net) that achieves accurate pixel-wise segmentation without requiring additional image enhancement steps. FES-Net directly processes the input image and utilizes four prompt convolutional blocks (PCBs) during downsampling, complemented by a shallow upsampling approach to generate a binary mask for each class. We evaluate the performance of FES-Net on four publicly available state-of-the-art datasets: DRIVE, STARE, CHASE, and HRF. The evaluation results clearly demonstrate the superior performance of FES-Net compared to other competitive approaches documented in the existing literature.

摘要
疾病如糖尿病和年龄相关的macular degeneration会对视力造成重大威胁，因此精准的血管分 segmentation在跟踪和诊断进程中具有 paramount importance。然而，现有的血管分 segmentation方法，即基于encoder-decoder结构的方法，在capturing retinal vessel配置上下文信息方面存在挑战，这会导致encoder和decoder特征之间的semantic disparities困难于相互协调。为了解决这个问题，我们提出了一种新的特征增强分 segmentation网络（FES-Net），它可以在不需要额外图像增强步骤的情况下，准确地进行每个像素的分 segmentation。FES-Net直接处理输入图像，并在下降阶段使用四个推荐卷积核（PCB），并且采用浅层的 upsampling 方法生成每个类型的二进制掩蔽。我们对四个公开的 state-of-the-art 数据集进行了评估：DRIVE、STARE、CHASE 和 HRF。评估结果表明，FES-Net 与文献中已有的其他竞争方法相比，具有显著的性能优势。

A Robust Negative Learning Approach to Partial Domain Adaptation Using Source Prototypes

paper_url: http://arxiv.org/abs/2309.03531
repo_url: None
paper_authors: Sandipan Choudhuri, Suli Adeniye, Arunabha Sen
for: This paper proposes a robust Partial Domain Adaptation (PDA) framework to mitigate the negative transfer problem by incorporating a robust target-supervision strategy.
methods: The proposed framework leverages ensemble learning and includes diverse, complementary label feedback, alleviating the effect of incorrect feedback and promoting pseudo-label refinement. It optimizes intra-class compactness and inter-class separation with the inferred source prototypes and highly-confident target samples in a domain-invariant fashion.
results: The proposed framework demonstrates enhanced robustness and generalization in a range of partial domain adaptation tasks, outperforming existing state-of-the-art PDA approaches.Here are the three points in Simplified Chinese:
for: 这篇论文提出了一种robust Partial Domain Adaptation（PDA）框架，以减少负转移问题，通过包含一种robust目标监督策略。
methods: 该框架利用ensemble学习和多元标签反馈，使得反馈错误的影响减少，并促进pseudo标签纠正。它在域无关的方式优化源类准确性和目标类分化度。
results: 该框架在多种partial domain adaptation任务中表现出了提高的Robustness和普遍性，超过了现有的state-of-the-art PDA方法。

Abstract
This work proposes a robust Partial Domain Adaptation (PDA) framework that mitigates the negative transfer problem by incorporating a robust target-supervision strategy. It leverages ensemble learning and includes diverse, complementary label feedback, alleviating the effect of incorrect feedback and promoting pseudo-label refinement. Rather than relying exclusively on first-order moments for distribution alignment, our approach offers explicit objectives to optimize intra-class compactness and inter-class separation with the inferred source prototypes and highly-confident target samples in a domain-invariant fashion. Notably, we ensure source data privacy by eliminating the need to access the source data during the adaptation phase through a priori inference of source prototypes. We conducted a series of comprehensive experiments, including an ablation analysis, covering a range of partial domain adaptation tasks. Comprehensive evaluations on benchmark datasets corroborate our framework's enhanced robustness and generalization, demonstrating its superiority over existing state-of-the-art PDA approaches.

摘要
Simplified Chinese:这个工作提出了一种robust Partial Domain Adaptation（PDA）框架，用以 Mitigate the negative transfer problem by incorporating a robust target-supervision strategy. 它利用了ensemble learning和多元标签反馈，以提高pseudo-label的精度和多样性，从而降低了因为错误反馈而导致的影响。而不是仅仅依靠first-order moments for distribution alignment, our approach offers explicit objectives to optimize intra-class compactness and inter-class separation with the inferred source prototypes and highly-confident target samples in a domain-invariant fashion. 另外, we ensure source data privacy by eliminating the need to access the source data during the adaptation phase through a priori inference of source prototypes. We conducted a series of comprehensive experiments, including an ablation analysis, covering a range of partial domain adaptation tasks. Comprehensive evaluations on benchmark datasets corroborate our framework's enhanced robustness and generalization, demonstrating its superiority over existing state-of-the-art PDA approaches.

Efficient Single Object Detection on Image Patches with Early Exit Enhanced High-Precision CNNs

paper_url: http://arxiv.org/abs/2309.03530
repo_url: None
paper_authors: Arne Moos
for: 本研究旨在提出一种用于移动机器人检测物体的新方法，主要是检测球体。
methods: 本文提出了一种专门为计算约束限制的机器人平台设计的卷积神经网络架构，以高精度分类单个物体图像块并确定其准确的空间位置。
results: 本文的方法可以在静态和动态环境中，在不同的照明条件下，达到100%的准确率和大于87%的检测率，并且可以在约170微秒内完成每个假设。通过结合提出的方法和早退法，可以实现更 than 28%的运行时优化。

Abstract
This paper proposes a novel approach for detecting objects using mobile robots in the context of the RoboCup Standard Platform League, with a primary focus on detecting the ball. The challenge lies in detecting a dynamic object in varying lighting conditions and blurred images caused by fast movements. To address this challenge, the paper presents a convolutional neural network architecture designed specifically for computationally constrained robotic platforms. The proposed CNN is trained to achieve high precision classification of single objects in image patches and to determine their precise spatial positions. The paper further integrates Early Exits into the existing high-precision CNN architecture to reduce the computational cost of easily rejectable cases in the background class. The training process involves a composite loss function based on confidence and positional losses with dynamic weighting and data augmentation. The proposed approach achieves a precision of 100% on the validation dataset and a recall of almost 87%, while maintaining an execution time of around 170 $\mu$s per hypotheses. By combining the proposed approach with an Early Exit, a runtime optimization of more than 28%, on average, can be achieved compared to the original CNN. Overall, this paper provides an efficient solution for an enhanced detection of objects, especially the ball, in computationally constrained robotic platforms.

摘要
The training process involves a composite loss function based on confidence and positional losses with dynamic weighting and data augmentation. The proposed approach achieves a precision of 100% on the validation dataset and a recall of almost 87%, while maintaining an execution time of around 170 microseconds per hypotheses. By combining the proposed approach with an Early Exit, a runtime optimization of more than 28%, on average, can be achieved compared to the original CNN.Overall, this paper provides an efficient solution for enhanced object detection, especially the ball, in computationally constrained robotic platforms.

BroadCAM: Outcome-agnostic Class Activation Mapping for Small-scale Weakly Supervised Applications

paper_url: http://arxiv.org/abs/2309.03509
repo_url: https://github.com/linjiatai/broadcam
paper_authors: Jiatai Lin, Guoqiang Han, Xuemiao Xu, Changhong Liang, Tien-Tsin Wong, C. L. Philip Chen, Zaiyi Liu, Chu Han
for: 这个 paper 是为了解释 deep learning 模型的问题，特别是在弱化运算下进行 semantic segmentation 和 object localization。
methods: 这个 paper 使用了 outcome-agnostic CAM 方法，即 BroadCAM，以避免因为小规模训练而产生不可靠的 weights。
results: BroadCAM 在不同的 CNN 架构下显示出了较高的性能，特别是在小规模训练数据下（less than 5%）。它还达到了 SOTA 性能在大规模训练数据下。

Abstract
Class activation mapping~(CAM), a visualization technique for interpreting deep learning models, is now commonly used for weakly supervised semantic segmentation~(WSSS) and object localization~(WSOL). It is the weighted aggregation of the feature maps by activating the high class-relevance ones. Current CAM methods achieve it relying on the training outcomes, such as predicted scores~(forward information), gradients~(backward information), etc. However, when with small-scale data, unstable training may lead to less effective model outcomes and generate unreliable weights, finally resulting in incorrect activation and noisy CAM seeds. In this paper, we propose an outcome-agnostic CAM approach, called BroadCAM, for small-scale weakly supervised applications. Since broad learning system (BLS) is independent to the model learning, BroadCAM can avoid the weights being affected by the unreliable model outcomes when with small-scale data. By evaluating BroadCAM on VOC2012 (natural images) and BCSS-WSSS (medical images) for WSSS and OpenImages30k for WSOL, BroadCAM demonstrates superior performance than existing CAM methods with small-scale data (less than 5\%) in different CNN architectures. It also achieves SOTA performance with large-scale training data. Extensive qualitative comparisons are conducted to demonstrate how BroadCAM activates the high class-relevance feature maps and generates reliable CAMs when with small-scale training data.

摘要
干净的类激活映射（CAM）现在广泛用于弱类标注 segmentation（WSSS）和物体 lokalisierung（WSOL）。它是通过活化高相关类的特征图的加权积sum来实现的。现有的CAM方法通过训练结果来实现，如预测得分（前向信息）、梯度（反向信息）等。然而，当 faced with small-scale data 时，不稳定的训练可能会导致模型的输出不稳定，最终导致错误的激活和噪音 CAM 种子。在这篇论文中，我们提出了不受训练结果影响的CAM方法，called BroadCAM， для small-scale weakly supervised 应用程序。由于 broad learning system（BLS）是独立于模型学习的，BroadCAM可以避免 weights 被模型的输出不稳定所影响。我们通过在 VOC2012（自然图像）和 BCSS-WSSS（医学图像）上进行 WSSS，以及在 OpenImages30k 上进行 WSOL，来评估 BroadCAM 的性能。我们发现 BroadCAM 在不同的 CNN 架构下对 small-scale 数据（less than 5%） exhibit 出色的性能，并且在大规模训练数据下也达到了 SOTA 性能。我们还进行了广泛的Qualitative comparison ，以示 BroadCAM 在 small-scale 训练数据下如何活化高相关类的特征图并生成可靠的 CAM。

Dynamic Frame Interpolation in Wavelet Domain

paper_url: http://arxiv.org/abs/2309.03508
repo_url: https://github.com/ltkong218/waveletvfi
paper_authors: Lingtong Kong, Boyuan Jiang, Donghao Luo, Wenqing Chu, Ying Tai, Chengjie Wang, Jie Yang
for: 提高视觉体验的frame rate，通过使用高级动态模型和合成网络。
methods: 提议一种基于wavelet synthesis网络的两stage框架，首先估计中间的湍流，然后使用流alignedContext特征预测多尺度wavelet含量。
results: 在常见高分辨率和动画框架 interpolate中，提议的WaveletVFI可以减少计算量达40%，保持相似的准确性，与其他状态静的方法相比更高效。

Abstract
Video frame interpolation is an important low-level vision task, which can increase frame rate for more fluent visual experience. Existing methods have achieved great success by employing advanced motion models and synthesis networks. However, the spatial redundancy when synthesizing the target frame has not been fully explored, that can result in lots of inefficient computation. On the other hand, the computation compression degree in frame interpolation is highly dependent on both texture distribution and scene motion, which demands to understand the spatial-temporal information of each input frame pair for a better compression degree selection. In this work, we propose a novel two-stage frame interpolation framework termed WaveletVFI to address above problems. It first estimates intermediate optical flow with a lightweight motion perception network, and then a wavelet synthesis network uses flow aligned context features to predict multi-scale wavelet coefficients with sparse convolution for efficient target frame reconstruction, where the sparse valid masks that control computation in each scale are determined by a crucial threshold ratio. Instead of setting a fixed value like previous methods, we find that embedding a classifier in the motion perception network to learn a dynamic threshold for each sample can achieve more computation reduction with almost no loss of accuracy. On the common high resolution and animation frame interpolation benchmarks, proposed WaveletVFI can reduce computation up to 40% while maintaining similar accuracy, making it perform more efficiently against other state-of-the-arts. Code is available at https://github.com/ltkong218/WaveletVFI.

摘要
视频帧 interpolate 是一个重要的低级视觉任务，可以增加帧率，提供更流畅的视觉经验。现有方法通过使用先进的运动模型和合成网络来实现了很大的成功。然而，在合成目标帧时间 redundancy 未经探索，可能导致大量的不必要计算。同时，计算压缩度在帧 interpolate 中高度取决于输入帧对的文本分布和场景运动，需要对每个输入帧对的空间时间信息进行更好的理解，以选择更佳的压缩度。在这项工作中，我们提出了一种新的两个阶段框架，称为 WaveletVFI，以解决以上问题。它首先估计目标帧中间的激光流，然后使用流行alignedContext Features来预测多尺度波лет系数，并使用稀聚核来减少计算。而不是在前一些方法中设置固定值，我们在运动识别网络中嵌入一个类ifier，以学习每个样本的动态阈值，可以更好地减少计算，而无损失准确性。在高分辨率和动画帧 interpolate 的标准测试 benchmarks 上，我们的 WaveletVFI 可以减少计算时间40%，同时保持相似的准确性，与其他现状之前的方法相比，更高效。代码可以在上找到。

Stroke-based Neural Painting and Stylization with Dynamically Predicted Painting Region

paper_url: http://arxiv.org/abs/2309.03504
repo_url: https://github.com/sjtuplayer/compositional_neural_painter
paper_authors: Teng Hu, Ran Yi, Haokun Zhu, Liang Liu, Jinlong Peng, Yabiao Wang, Chengjie Wang, Lizhuang Ma
for: 这篇论文的目的是提出一种基于笔画的图像渲染方法，以解决现有方法中的边界不一致问题。
methods: 该方法使用了一种动态预测下一个笔画区域的compositor网络，以及一种基于WGAN探测器的画家网络，来预测笔画参数。
results: 对比现有方法，该方法在笔画基于图像渲染和笔画基于风格转换中表现出优异，并且可以保持输入图像的结构。

Abstract
Stroke-based rendering aims to recreate an image with a set of strokes. Most existing methods render complex images using an uniform-block-dividing strategy, which leads to boundary inconsistency artifacts. To solve the problem, we propose Compositional Neural Painter, a novel stroke-based rendering framework which dynamically predicts the next painting region based on the current canvas, instead of dividing the image plane uniformly into painting regions. We start from an empty canvas and divide the painting process into several steps. At each step, a compositor network trained with a phasic RL strategy first predicts the next painting region, then a painter network trained with a WGAN discriminator predicts stroke parameters, and a stroke renderer paints the strokes onto the painting region of the current canvas. Moreover, we extend our method to stroke-based style transfer with a novel differentiable distance transform loss, which helps preserve the structure of the input image during stroke-based stylization. Extensive experiments show our model outperforms the existing models in both stroke-based neural painting and stroke-based stylization. Code is available at https://github.com/sjtuplayer/Compositional_Neural_Painter

摘要
stroke-based rendering targets to recreate an image with a set of strokes. Most existing methods use an uniform-block-dividing strategy, which leads to boundary inconsistency artifacts. To solve the problem, we propose Compositional Neural Painter, a novel stroke-based rendering framework which dynamically predicts the next painting region based on the current canvas, instead of dividing the image plane uniformly into painting regions. We start from an empty canvas and divide the painting process into several steps. At each step, a compositor network trained with a phasic RL strategy first predicts the next painting region, then a painter network trained with a WGAN discriminator predicts stroke parameters, and a stroke renderer paints the strokes onto the painting region of the current canvas. Moreover, we extend our method to stroke-based style transfer with a novel differentiable distance transform loss, which helps preserve the structure of the input image during stroke-based stylization. Extensive experiments show our model outperforms the existing models in both stroke-based neural painting and stroke-based stylization. Code is available at https://github.com/sjtuplayer/Compositional_Neural_Painter.Here's the word-for-word translation of the text into Simplified Chinese:roke-based rendering targets 图像重建 Set 的 strokes. 现有大多数方法使用 uniform-block-dividing 策略，这会导致 boundry inconsistency artifacts. 为解决问题，我们提出 Compositional Neural Painter, a novel stroke-based rendering framework, which dynamically predicts the next painting region based on the current canvas, instead of dividing the image plane uniformly into painting regions. 我们从 empty canvas 开始，并将 painting 过程分成 several steps. At each step, a compositor network trained with a phasic RL strategy first predicts the next painting region, then a painter network trained with a WGAN discriminator predicts stroke parameters, and a stroke renderer paints the strokes onto the painting region of the current canvas. 其他，我们扩展我们的方法到 stroke-based style transfer with a novel differentiable distance transform loss, which helps preserve the structure of the input image during stroke-based stylization. 广泛 experiments 表明我们的模型在 stroke-based neural painting 和 stroke-based stylization 中都高于现有模型。 Code 可以在 https://github.com/sjtuplayer/Compositional_Neural_Painter 上获取.

Instance Segmentation of Dislocations in TEM Images

paper_url: http://arxiv.org/abs/2309.03499
repo_url: https://github.com/kruzaeva/dislocation-segmentation
paper_authors: Karina Ruzaeva, Kishan Govind, Marc Legros, Stefan Sandfeld
for: 这个论文的目的是用量子电子显微镜（TEM）进行实验室内压缩试验，以揭示杂点的运动。在材料科学领域，了解杂点的位置和运动非常重要，以创造新材料。
methods: 这篇论文使用了现状的实例分割方法，包括Mask R-CNN和YOLOv8，以提取杂点面积。这些杂点面积被转换为数学线段，以便对杂点的长度和几何特征进行量化分析。
results: 这篇论文的结果表明，使用量子电子显微镜进行实验室内压缩试验可以得到高精度的杂点分割结果，并且可以用Physics-based metric来评估网络性能。这些结果可以帮助创造新材料，并且可以用于各种领域的后期处理。

Abstract
Quantitative Transmission Electron Microscopy (TEM) during in-situ straining experiment is able to reveal the motion of dislocations -- linear defects in the crystal lattice of metals. In the domain of materials science, the knowledge about the location and movement of dislocations is important for creating novel materials with superior properties. A long-standing problem, however, is to identify the position and extract the shape of dislocations, which would ultimately help to create a digital twin of such materials. In this work, we quantitatively compare state-of-the-art instance segmentation methods, including Mask R-CNN and YOLOv8. The dislocation masks as the results of the instance segmentation are converted to mathematical lines, enabling quantitative analysis of dislocation length and geometry -- important information for the domain scientist, which we then propose to include as a novel length-aware quality metric for estimating the network performance. Our segmentation pipeline shows a high accuracy suitable for all domain-specific, further post-processing. Additionally, our physics-based metric turns out to perform much more consistently than typically used pixel-wise metrics.

摘要
量子传输电子顺传显微镜（TEM）在实验室内受力试验中能够描述杂点的运动 -- 金属晶体结构中线性缺陷。在材料科学领域，了解杂点的位置和运动对创造新材料的性能具有重要意义。然而，长期存在的问题是如何确定杂点的位置并提取其形状，这将 ultimately 帮助创建材料的数字双。在这种工作中，我们对现有的实例 segmentation 方法进行了量化比较，包括 Mask R-CNN 和 YOLOv8。杂点面为实例 segmentation 的结果被转换为数学线，使得量化分析杂点的长度和几何 -- 对领域科学家而言是重要的信息。我们 then propose 将这些数学线作为新的长度意识质量指标，用于估计网络性能。我们的分 segmentation 管道表现高度准确，适用于所有领域专业人员进行进一步处理。此外，我们的物理基础指标在 Typically 使用像素精度指标之外表现了更一致。

Evaluating Deep Learning-based Melanoma Classification using Immunohistochemistry and Routine Histology: A Three Center Study

paper_url: http://arxiv.org/abs/2309.03494
repo_url: None
paper_authors: Christoph Wies, Lucas Schneider, Sarah Haggenmueller, Tabea-Clara Bucher, Sarah Hobelsberger, Markus V. Heppt, Gerardo Ferrara, Eva I. Krieghoff-Henning, Titus J. Brinker
for: 本研究用于检测皮肤癌病理 slide 的自动识别，使用 Deep Learning (DL) 技术。
methods: 研究使用 ResNet neural network 在 MelanA 和相应的 H&E 染色 slide 上进行训练，并对 OOD 数据集进行评估。
results: 结果显示，DL 基于 MelanA 的识别系统可以达到同 H&E 基准分类的水平（AUROC = 0.81-0.85），并且可以通过多种染色结合分类来提高识别精度。

Abstract
Pathologists routinely use immunohistochemical (IHC)-stained tissue slides against MelanA in addition to hematoxylin and eosin (H&E)-stained slides to improve their accuracy in diagnosing melanomas. The use of diagnostic Deep Learning (DL)-based support systems for automated examination of tissue morphology and cellular composition has been well studied in standard H&E-stained tissue slides. In contrast, there are few studies that analyze IHC slides using DL. Therefore, we investigated the separate and joint performance of ResNets trained on MelanA and corresponding H&E-stained slides. The MelanA classifier achieved an area under receiver operating characteristics curve (AUROC) of 0.82 and 0.74 on out of distribution (OOD)-datasets, similar to the H&E-based benchmark classification of 0.81 and 0.75, respectively. A combined classifier using MelanA and H&E achieved AUROCs of 0.85 and 0.81 on the OOD datasets. DL MelanA-based assistance systems show the same performance as the benchmark H&E classification and may be improved by multi stain classification to assist pathologists in their clinical routine.

摘要
PATHOLOGISTS 通常使用免疫染色技术（IHC）染色组织标本板，以提高诊断皮肤癌的准确性。使用基于 Deep Learning（DL）技术的诊断支持系统自动检查组织结构和细胞组成已经得到了广泛的研究，但对于 IHC 染色板的分析却有少量研究。因此，我们研究了使用 ResNet 在 MelanA 和相应的 H&E 染色板上训练的分类器。MelanA 分类器在 OOD 数据集上的面积下收操作Characteristic curve（AUROC）为 0.82 和 0.74，与 H&E 基准分类结果相似（AUROC 为 0.81 和 0.75）。将 MelanA 和 H&E 分类器合并使得 AUROC 在 OOD 数据集上为 0.85 和 0.81。DL 基于 MelanA 的诊断支持系统与 H&E 基准分类器的性能相同，并且可能通过多种染色分类来改进，以 помочь PATHOLOGISTS 在临床实践中。

SAM3D: Segment Anything Model in Volumetric Medical Images

paper_url: http://arxiv.org/abs/2309.03493
repo_url: https://github.com/DinhHieuHoang/SAM3D
paper_authors: Nhat-Tan Bui, Dinh-Hieu Hoang, Minh-Triet Tran, Ngan Le
for:这篇论文主要 targets at 3D volumetric medical images, aiming to provide accurate image segmentation for medical diagnosis.methods:基于Segment Anything Model（SAM）的SAM3D模型，使用预训练的SAM编码器提取输入图像的有意义表示。与其他现有的SAM基于volumetric segmentation方法不同，我们的模型直接将整个3D图像作为输入，简单地处理它，从而避免训练大量参数。results:我们在多个医疗图像数据集上进行了广泛的实验，并证明了我们的网络在3D医疗图像分割任务中具有竞争力，同时具有 significatively efficient 的参数。

Abstract
Image segmentation is a critical task in medical image analysis, providing valuable information that helps to make an accurate diagnosis. In recent years, deep learning-based automatic image segmentation methods have achieved outstanding results in medical images. In this paper, inspired by the Segment Anything Model (SAM), a foundation model that has received much attention for its impressive accuracy and powerful generalization ability in 2D still image segmentation, we propose a SAM3D that targets at 3D volumetric medical images and utilizes the pre-trained features from the SAM encoder to capture meaningful representations of input images. Different from other existing SAM-based volumetric segmentation methods that perform the segmentation by dividing the volume into a set of 2D slices, our model takes the whole 3D volume image as input and processes it simply and effectively that avoids training a significant number of parameters. Extensive experiments are conducted on multiple medical image datasets to demonstrate that our network attains competitive results compared with other state-of-the-art methods in 3D medical segmentation tasks while being significantly efficient in terms of parameters.

摘要
医疗图像分割是医疗图像分析中的关键任务，它提供了诊断的重要信息。在最近的几年中，基于深度学习的自动图像分割方法在医疗图像中取得了出色的结果。在这篇论文中，我们受到Segment Anything Model（SAM）的启发，这是一个在2D静止图像分割中表现出色的基本模型，我们提出了一个名为SAM3D的模型，该模型针对3D医疗图像进行分割，并使用SAMencoder中的预训练特征来捕捉输入图像的有意义表示。与其他现有的SAM基于volumetric分割方法不同，我们的模型不需要将Volume分成多个2Dslice，而是直接处理整个3D图像，从而避免训练大量参数。我们在多个医疗图像数据集上进行了广泛的实验，以示我们的网络与其他状态的方法在3D医疗分 segmentation任务中具有竞争力，同时在参数上具有显著的效率。

DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners

paper_url: http://arxiv.org/abs/2309.03483
repo_url: https://github.com/clarence-lee-sheng/determinet
paper_authors: Clarence Lee, M Ganesh Kumar, Cheston Tan
for: 这个论文旨在提高视觉引用模型的表现，使其能够更好地 distinguishing 特定对象 versus 总体对象。
methods: 这篇论文使用了250,000个synthetically生成的图像和标题，基于25个determiner来制作了Dataset，任务是预测矩形框，以便识别对象关注的对象。
results: 现有的视觉引用模型在这个Dataset上表现不佳，这反映了现有模型在参照和量化任务上的局限性。

Abstract
State-of-the-art visual grounding models can achieve high detection accuracy, but they are not designed to distinguish between all objects versus only certain objects of interest. In natural language, in order to specify a particular object or set of objects of interest, humans use determiners such as "my", "either" and "those". Determiners, as an important word class, are a type of schema in natural language about the reference or quantity of the noun. Existing grounded referencing datasets place much less emphasis on determiners, compared to other word classes such as nouns, verbs and adjectives. This makes it difficult to develop models that understand the full variety and complexity of object referencing. Thus, we have developed and released the DetermiNet dataset , which comprises 250,000 synthetically generated images and captions based on 25 determiners. The task is to predict bounding boxes to identify objects of interest, constrained by the semantics of the given determiner. We find that current state-of-the-art visual grounding models do not perform well on the dataset, highlighting the limitations of existing models on reference and quantification tasks.

摘要
现代视觉背景模型可以实现高的检测精度，但它们不是设计来分辨所有物体 versus 特定物体的 interests. 在自然语言中，为了指定特定的物体或 интересующий物体，人们使用 determiners such as "my", "either"和 "those"。 determiners 是自然语言中的一种 schema，用于指定名 animate 或 count nouns 的 reference 或 quantity。现有的场景 Referencing 数据集减少了 determiners 的重要性，相比其他单词类型如名动词、动词和形容词。这使得模型难以理解全面和复杂的物体引用。因此，我们开发了并发布了 DetermiNet 数据集，该数据集包括 250,000 个 sintethically 生成的图像和标签，基于 25 个 determiners。任务是预测矩形框，以标识 интересующий物体，受 determiners 的 semantics 约束。我们发现当前状态的最佳视觉背景模型在该数据集上不善于进行，这反映了现有模型在 Reference 和量化任务上的局限性。

TSI-Net: A Timing Sequence Image Segmentation Network for Intracranial Artery Segmentation in Digital Subtraction Angiography

paper_url: http://arxiv.org/abs/2309.03477
repo_url: None
paper_authors: Lemeng Wang, Wentao Liu, Weijin Xu, Haoyuan Li, Huihua Yang, Feng Gao
For: automatic segmentation of intracranial artery (IA) in digital subtraction angiography (DSA) sequences* Methods: incorporates a bi-directional ConvGRU module (BCM) in the encoder, which can input variable-length DSA sequences and retain past and future information, and introduces a sensitive detail branch (SDB) at the end for supervising fine vessels* Results: significantly better than state-of-the-art networks in recent years, with a Sen evaluation metric of 0.797, a 3% improvement compared to other methods.

Abstract
Cerebrovascular disease is one of the major diseases facing the world today. Automatic segmentation of intracranial artery (IA) in digital subtraction angiography (DSA) sequences is an important step in the diagnosis of vascular related diseases and in guiding neurointerventional procedures. While, a single image can only show part of the IA within the contrast medium according to the imaging principle of DSA technology. Therefore, 2D DSA segmentation methods are unable to capture the complete IA information and treatment of cerebrovascular diseases. We propose A timing sequence image segmentation network with U-shape, called TSI-Net, which incorporates a bi-directional ConvGRU module (BCM) in the encoder. The network incorporates a bi-directional ConvGRU module (BCM) in the encoder, which can input variable-length DSA sequences, retain past and future information, segment them into 2D images. In addition, we introduce a sensitive detail branch (SDB) at the end for supervising fine vessels. Experimented on the DSA sequence dataset DIAS, the method performs significantly better than state-of-the-art networks in recent years. In particular, it achieves a Sen evaluation metric of 0.797, which is a 3% improvement compared to other methods.

摘要
脑血管疾病是当今世界面临的一个重要疾病。自动分割整形动脉（IA）在数字抵消成像（DSA）序列中是诊断血管相关疾病和引导神经内部进行手术的重要步骤。然而，单个图像只能显示IA中的一部分，根据DSA技术的成像原理。因此，2D DSA分割方法无法捕捉IA完整的信息，对脑血管疾病的治疗造成限制。我们提出了一种名为TSI-Net的 timing sequence图像分割网络，该网络包含一个bi-directional ConvGRU模块（BCM）在编码器中。该网络可以输入变长的DSA序列，同时保留过去和未来信息，将其分割成2D图像。此外，我们还引入了一个敏感细节分支（SDB），用于监督细血管。在DIAS数据集上进行实验，该方法与当年最佳网络相比，表现出了显著的优势。尤其是，它实现了0.797的Sen评价指标，与其他方法相比，提高了3%。

Temporal Collection and Distribution for Referring Video Object Segmentation

paper_url: http://arxiv.org/abs/2309.03473
repo_url: None
paper_authors: Jiajin Tang, Ge Zheng, Sibei Yang
for: 本研究旨在提高视频对象 segmentation 的精度，通过将自然语言表达与视频帧中对象的动态关系相结合。
methods: 我们提议同时维护全视频水平的 Referent 令和一个序列化的对象提问，其中 Referent 令负责根据语言表达捕捉视频水平的 Referent，而对象提问则用于更好地定位和分割每帧中的对象。此外，我们还提出了一种新的时间集合分布机制，用于在 Referent 令和对象提问之间进行交互。
results: 我们的方法在所有标准测试集上具有显著优势，与现状的方法相比具有更高的精度和更好的一致性。

Abstract
Referring video object segmentation aims to segment a referent throughout a video sequence according to a natural language expression. It requires aligning the natural language expression with the objects' motions and their dynamic associations at the global video level but segmenting objects at the frame level. To achieve this goal, we propose to simultaneously maintain a global referent token and a sequence of object queries, where the former is responsible for capturing video-level referent according to the language expression, while the latter serves to better locate and segment objects with each frame. Furthermore, to explicitly capture object motions and spatial-temporal cross-modal reasoning over objects, we propose a novel temporal collection-distribution mechanism for interacting between the global referent token and object queries. Specifically, the temporal collection mechanism collects global information for the referent token from object queries to the temporal motions to the language expression. In turn, the temporal distribution first distributes the referent token to the referent sequence across all frames and then performs efficient cross-frame reasoning between the referent sequence and object queries in every frame. Experimental results show that our method outperforms state-of-the-art methods on all benchmarks consistently and significantly.

摘要
<>将文本翻译为简化中文。<>对于视频对象 segmentation，我们目标是根据自然语言表达在视频序列中Segment a referent。这需要将自然语言表达与对象的运动和它们在全视频水平的动态关系进行对应，并在每帧 уров划分对象。为达到这个目标，我们提议同时维护一个全视频referent token和一个序列化的对象查询，其中前者负责通过语言表达捕捉视频水平的referent，而后者则用于在每帧级划分对象。此外，为了Explicitly capture对象的运动和空间时间跨模态关系，我们提议一种新的时间集合分布机制，用于在全视频referent token和对象查询之间互动。具体来说，时间集合机制在语言表达和对象查询之间收集全视频信息，然后在每帧级分布referent token，并在每帧级进行高效的交互 между referent sequence和对象查询。实验结果表明，我们的方法在所有benchmark上都具有显著优势，并且与之前的状态有所不同。

Perceptual Quality Assessment of 360$^\circ$ Images Based on Generative Scanpath Representation

paper_url: http://arxiv.org/abs/2309.03472
repo_url: https://github.com/xiangjiesui/gsr
paper_authors: Xiangjie Sui, Hanwei Zhu, Xuelin Liu, Yuming Fang, Shiqi Wang, Zhou Wang
for: 提出了一种基于生成扫描路径表示（GSR）的高效的全息图质量评估方法，以满足人们在不同视图条件下对360度图像的质量评估。
methods: 使用了一种适合扫描路径生成器来生成基于多个假设用户的可见范围和探索时间的扫描路径集，并将这些扫描路径集转化为全息图的唯一GSR，以提高质量评估的准确性。
results: 经过实验 validate了该方法可以具有高度一致性，特别是在局部扭曲的360度图像下，以及在不同视图条件下。

Abstract
Despite substantial efforts dedicated to the design of heuristic models for omnidirectional (i.e., 360$^\circ$) image quality assessment (OIQA), a conspicuous gap remains due to the lack of consideration for the diversity of viewing behaviors that leads to the varying perceptual quality of 360$^\circ$ images. Two critical aspects underline this oversight: the neglect of viewing conditions that significantly sway user gaze patterns and the overreliance on a single viewport sequence from the 360$^\circ$ image for quality inference. To address these issues, we introduce a unique generative scanpath representation (GSR) for effective quality inference of 360$^\circ$ images, which aggregates varied perceptual experiences of multi-hypothesis users under a predefined viewing condition. More specifically, given a viewing condition characterized by the starting point of viewing and exploration time, a set of scanpaths consisting of dynamic visual fixations can be produced using an apt scanpath generator. Following this vein, we use the scanpaths to convert the 360$^\circ$ image into the unique GSR, which provides a global overview of gazed-focused contents derived from scanpaths. As such, the quality inference of the 360$^\circ$ image is swiftly transformed to that of GSR. We then propose an efficient OIQA computational framework by learning the quality maps of GSR. Comprehensive experimental results validate that the predictions of the proposed framework are highly consistent with human perception in the spatiotemporal domain, especially in the challenging context of locally distorted 360$^\circ$ images under varied viewing conditions. The code will be released at https://github.com/xiangjieSui/GSR

摘要
尽管在三Sixty度图像质量评估（OIQA）领域投入了大量努力，但是存在一个明显的缺陷，即视觉行为多样性的不足，导致三Sixty度图像的质量评估不准确。两个关键因素描述这一问题：一是忽略了用户查看行为下的不同条件，二是依靠单个视窗序列来评估图像质量。为解决这些问题，我们提出了一种独特的生成扫描路径表示（GSR），用于有效地评估三Sixty度图像的质量，该表示者将多个假设用户的多种视觉经验进行综合汇总。更具体来说，给定一个视觉条件，包括查看开始点和探索时间，我们可以使用适合的扫描路径生成器生成一系列的扫描路径，包括动态视觉固定点。然后，我们将这些扫描路径转换为唯一的 GSR，该表示者提供了在扫描路径上查看和关注的内容的全局概述。因此，三Sixty度图像的质量评估快速转换为 GSR 的质量评估。我们然后提出了一种高效的 OIQA 计算框架，通过学习 GSR 的质量地图来进行评估。实验结果表明，我们的提议的框架预测与人类视觉在空间时间域的吻合度很高，尤其是在三Sixty度图像下的局部扭曲视觉下的多种视觉条件下。代码将在 GitHub 上发布。

Multi-Modality Guidance Network For Missing Modality Inference

paper_url: http://arxiv.org/abs/2309.03452
repo_url: None
paper_authors: Zhuokai Zhao, Harish Palani, Tianyi Liu, Lena Evans, Ruth Toner
for: 提高大型系统中多模式处理的可行性
methods: 提案一个引导网络，通过训练时间对多模式表示进行知识共享，以提高单一模式模型的准确性
results: 实际验证显示，提案的框架可以训练单一模式模型，与传统训练方法相比，具有更高的准确性，并且保持相同的推断成本

Abstract
Multimodal models have gained significant success in recent years. Standard multimodal approaches often assume unchanged modalities from training stage to inference stage. In practice, however, many scenarios fail to satisfy such assumptions with missing modalities during inference, leading to limitations on where multimodal models can be applied. While existing methods mitigate the problem through reconstructing the missing modalities, it increases unnecessary computational cost, which could be just as critical, especially for large, deployed systems. To solve the problem from both sides, we propose a novel guidance network that promotes knowledge sharing during training, taking advantage of the multimodal representations to train better single-modality models for inference. Real-life experiment in violence detection shows that our proposed framework trains single-modality models that significantly outperform its traditionally trained counterparts while maintaining the same inference cost.

摘要
多模态模型在最近几年内取得了 significiant 成功。标准的多模态方法frequently assumes 不变的modalities从训练阶段到推理阶段。然而，在实践中，许多场景 Fail to Satisfy such assumptions with missing modalities during inference, leading to limitations on where multimodal models can be applied. While existing methods mitigate the problem through reconstructing the missing modalities, it increases unnecessary computational cost, which could be just as critical, especially for large, deployed systems. To solve the problem from both sides, we propose a novel guidance network that promotes knowledge sharing during training, taking advantage of the multimodal representations to train better single-modality models for inference. Real-life experiment in violence detection shows that our proposed framework trains single-modality models that significantly outperform its traditionally trained counterparts while maintaining the same inference cost.Translated into Simplified Chinese:<>多模态模型在最近几年内取得了 significiant 成功。标准的多模态方法frequently assumes 不变的modalities从训练阶段到推理阶段。然而，在实践中，许多场景 Fail to Satisfy such assumptions with missing modalities during inference, leading to limitations on where multimodal models can be applied. While existing methods mitigate the problem through reconstructing the missing modalities, it increases unnecessary computational cost, which could be just as critical, especially for large, deployed systems. To solve the problem from both sides, we propose a novel guidance network that promotes knowledge sharing during training, taking advantage of the multimodal representations to train better single-modality models for inference. Real-life experiment in violence detection shows that our proposed framework trains single-modality models that significantly outperform its traditionally trained counterparts while maintaining the same inference cost.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Taiwan and Hong Kong.

Underwater Image Enhancement by Transformer-based Diffusion Model with Non-uniform Sampling for Skip Strategy

paper_url: http://arxiv.org/abs/2309.03445
repo_url: https://github.com/piggy2009/dm_underwater
paper_authors: Yi Tang, Takafumi Iwaguchi, Hiroshi Kawasaki
for: 这个论文是为了提出一种基于分布模型的海水下图像提高方法。
methods: 该方法使用了条件杂化滤波模型，将海水下图像和高斯噪声作为输入，生成相应的提高图像。此外，为了提高推理过程中的效率，该方法采用了两种不同的方法：一是使用轻量级的变换器网络，可以提高网络前进一步的时间；二是引入跳样Strategy，可以减少迭代次数。
results: 该方法在 widely 使用的海水下图像提高数据集上进行了相对评估，与现有的方法进行比较。实验结果表明，该方法可以 достичь同等或更高的性能，同时具有高效性。代码可以在 \href{mailto:https://github.com/piggy2009/DM_underwater}{\color{blue}{https://github.com/piggy2009/DM\_underwater} 上获取。

Abstract
In this paper, we present an approach to image enhancement with diffusion model in underwater scenes. Our method adapts conditional denoising diffusion probabilistic models to generate the corresponding enhanced images by using the underwater images and the Gaussian noise as the inputs. Additionally, in order to improve the efficiency of the reverse process in the diffusion model, we adopt two different ways. We firstly propose a lightweight transformer-based denoising network, which can effectively promote the time of network forward per iteration. On the other hand, we introduce a skip sampling strategy to reduce the number of iterations. Besides, based on the skip sampling strategy, we propose two different non-uniform sampling methods for the sequence of the time step, namely piecewise sampling and searching with the evolutionary algorithm. Both of them are effective and can further improve performance by using the same steps against the previous uniform sampling. In the end, we conduct a relative evaluation of the widely used underwater enhancement datasets between the recent state-of-the-art methods and the proposed approach. The experimental results prove that our approach can achieve both competitive performance and high efficiency. Our code is available at \href{mailto:https://github.com/piggy2009/DM_underwater}{\color{blue}{https://github.com/piggy2009/DM\_underwater}.

摘要
在本文中，我们提出了一种图像提升方法基于扩散模型在水下场景中。我们的方法利用条件抑制扩散概率模型，将水下图像和高斯噪声作为输入，生成相应的提升图像。此外，为了提高扩散模型的反向过程效率，我们采用了两种不同的方法。一是使用轻量级的变换器基于抑制网络，可以有效提高网络前进一步的时间。另一方面，我们引入了跳过采样策略，以减少迭代数。此外，基于跳过采样策略，我们还提出了两种非均匀采样方法，即分割采样和搜索采样。它们都能够有效地提高性能，并且可以使用同样的步长对抗前一个均匀采样。最后，我们对水下图像提升数据集进行了相对评估，并与当前状态艺术方法进行了比较。实验结果表明，我们的方法可以实现高效和竞争力强的图像提升。我们的代码可以在 \href{mailto:https://github.com/piggy2009/DM_underwater}{\color{blue}{https://github.com/piggy2009/DM\_underwater} 上获取。

Punctate White Matter Lesion Segmentation in Preterm Infants Powered by Counterfactually Generative Learning

paper_url: http://arxiv.org/abs/2309.03440
repo_url: None
paper_authors: Zehua Ren, Yongheng Sun, Miaomiao Wang, Yuying Feng, Xianjun Li, Chao Jin, Jian Yang, Chunfeng Lian, Fan Wang
for: 这个研究旨在提高脑瘫癫病变的准确分类，以便在疗法时间上获得早期诊断和治疗。
methods: 这个研究使用了对抗事实的思维和脑组织分类的副 задачу，以学习细部位置和形态的描述，从而提高了精确的脑瘫癫病变分类。
results: 这个研究使用了一个简单和易于实现的深度学习框架（即DeepPWML），融合了病变对抗地图和组织可能性地图，对于实际临床数据集的脑瘫癫病变分类表现出了国际顶尖的性能。

Abstract
Accurate segmentation of punctate white matter lesions (PWMLs) are fundamental for the timely diagnosis and treatment of related developmental disorders. Automated PWMLs segmentation from infant brain MR images is challenging, considering that the lesions are typically small and low-contrast, and the number of lesions may dramatically change across subjects. Existing learning-based methods directly apply general network architectures to this challenging task, which may fail to capture detailed positional information of PWMLs, potentially leading to severe under-segmentations. In this paper, we propose to leverage the idea of counterfactual reasoning coupled with the auxiliary task of brain tissue segmentation to learn fine-grained positional and morphological representations of PWMLs for accurate localization and segmentation. A simple and easy-to-implement deep-learning framework (i.e., DeepPWML) is accordingly designed. It combines the lesion counterfactual map with the tissue probability map to train a lightweight PWML segmentation network, demonstrating state-of-the-art performance on a real-clinical dataset of infant T1w MR images. The code is available at \href{https://github.com/ladderlab-xjtu/DeepPWML}{https://github.com/ladderlab-xjtu/DeepPWML}.

摘要
精准 segmentation of punctate white matter lesions (PWMLs) 是诊断和治疗相关的发育障碍的基本步骤。自动从婴儿脑MR图像中提取PWMLs的自动化 segmentation 是一项挑战，因为lesions 通常很小并且对比度很低，同时Subject中lesions的数量可能会差异很大。现有的学习基本方法直接将通用网络架构应用到这个任务上，可能会miss detailed positional information of PWMLs，导致严重的下segmentation。在这篇论文中，我们提出使用counterfactual reasoning 和辅助任务脑组织 segmentation来学习PWMLs的细致位姿和形态表示，以实现精准的localization和segmentation。我们设计了一个简单易用的深度学习框架（i.e., DeepPWML），该框架结合lesion counterfactual map 和组织概率地图来训练一个轻量级PWML segmentation网络，并在实际临床数据上达到了现状前的性能。代码可以在 \href{https://github.com/ladderlab-xjtu/DeepPWML}{https://github.com/ladderlab-xjtu/DeepPWML} 中获取。

2023-09-07

S-Adapter: Generalizing Vision Transformer for Face Anti-Spoofing with Statistical Tokens

Algebra and Geometry of Camera Resectioning

Improving the Accuracy of Beauty Product Recommendations by Assessing Face Illumination Quality

Multimodal Transformer for Material Segmentation

Adapting Self-Supervised Representations to Multi-Domain Setups

CDFSL-V: Cross-Domain Few-Shot Learning for Videos

Separable Self and Mixed Attention Transformers for Efficient Object Tracking

Improving Resnet-9 Generalization Trained on Small Datasets

REALM: Robust Entropy Adaptive Loss Minimization for Improved Single-Sample Test-Time Adaptation

SimpleNeRF: Regularizing Sparse Input Neural Radiance Fields with Simpler Solutions

A-Eval: A Benchmark for Cross-Dataset Evaluation of Abdominal Multi-Organ Segmentation

Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis

Tracking Anything with Decoupled Video Segmentation

Learning Continuous Exposure Value Representations for Single-Image HDR Reconstruction

The Making and Breaking of Camouflage

ProPainter: Improving Propagation and Transformer for Video Inpainting

InstructDiffusion: A Generalist Modeling Interface for Vision Tasks

BluNF: Blueprint Neural Field

ArtiGrasp: Physically Plausible Synthesis of Bi-Manual Dexterous Grasping and Articulation

Better Practices for Domain Adaptation

Box-based Refinement for Weakly Supervised and Unsupervised Localization Tasks

Text-to-feature diffusion for audio-visual few-shot learning

CenTime: Event-Conditional Modelling of Censoring in Survival Analysis

Random Expert Sampling for Deep Learning Segmentation of Acute Ischemic Stroke on Non-contrast CT

Cross-Task Attention Network: Improving Multi-Task Learning for Medical Imaging Applications

ArtHDR-Net: Perceptually Realistic and Accurate HDR Content Creation

T2IW: Joint Text to Image & Watermark Generation

Panoramas from Photons

SimNP: Learning Self-Similarity Priors Between Neural Points

Deep Learning Safety Concerns in Automated Driving Perception

$L_{2,1}$-Norm Regularized Quaternion Matrix Completion Using Sparse Representation and Quaternion QR Decomposition

dacl1k: Real-World Bridge Damage Dataset Putting Open-Source Data to the Test

M(otion)-mode Based Prediction of Ejection Fraction using Echocardiograms

PBP: Path-based Trajectory Prediction for Autonomous Driving

Label-efficient Contrastive Learning-based model for nuclei detection and classification in 3D Cardiovascular Immunofluorescent Images

ClusterFusion: Leveraging Radar Spatial Features for Radar-Camera 3D Object Detection in Autonomous Vehicles

Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption

Interpretable Visual Question Answering via Reasoning Supervision

A boundary-aware point clustering approach in Euclidean and embedding spaces for roof plane segmentation

DiffDefense: Defending against Adversarial Attacks via Diffusion Models

Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory

MS-UNet-v2: Adaptive Denoising Method and Training Strategy for Medical Image Segmentation with Small Training Data

Prompt-based Context- and Domain-aware Pretraining for Vision and Language Navigation

Spiking Structured State Space Model for Monaural Speech Enhancement

Context-Aware 3D Object Localization from Single Calibrated Images: A Study of Basketballs

Chasing Consistency in Text-to-3D Generation from a Single Image

Enhancing Sample Utilization through Sample Adaptive Augmentation in Semi-Supervised Learning

DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions

Toward High Quality Facial Representation Learning

Sparse Federated Training of Object Detection in the Internet of Vehicles

Region Generation and Assessment Network for Occluded Person Re-Identification

Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model

Trash to Treasure: Low-Light Object Detection via Decomposition-and-Aggregation

Zero-Shot Scene Graph Generation via Triplet Calibration and Reduction

YOLO series target detection algorithms for underwater environments

Feature Enhancer Segmentation Network (FES-Net) for Vessel Segmentation

A Robust Negative Learning Approach to Partial Domain Adaptation Using Source Prototypes

Efficient Single Object Detection on Image Patches with Early Exit Enhanced High-Precision CNNs

BroadCAM: Outcome-agnostic Class Activation Mapping for Small-scale Weakly Supervised Applications

Dynamic Frame Interpolation in Wavelet Domain

Stroke-based Neural Painting and Stylization with Dynamically Predicted Painting Region

Instance Segmentation of Dislocations in TEM Images

Evaluating Deep Learning-based Melanoma Classification using Immunohistochemistry and Routine Histology: A Three Center Study

SAM3D: Segment Anything Model in Volumetric Medical Images

DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners

TSI-Net: A Timing Sequence Image Segmentation Network for Intracranial Artery Segmentation in Digital Subtraction Angiography

Temporal Collection and Distribution for Referring Video Object Segmentation

Perceptual Quality Assessment of 360$^\circ$ Images Based on Generative Scanpath Representation

Multi-Modality Guidance Network For Missing Modality Inference

Underwater Image Enhancement by Transformer-based Diffusion Model with Non-uniform Sampling for Skip Strategy

Punctate White Matter Lesion Segmentation in Preterm Infants Powered by Counterfactually Generative Learning