2023-08-20

cs.CV

cs.CV - 2023-08-20

Boosting Adversarial Transferability by Block Shuffle and Rotation

paper_url: http://arxiv.org/abs/2308.10299
repo_url: None
paper_authors: Kunyu Wang, Xuanran He, Wenxuan Wang, Xiaosen Wang
for: This paper focuses on improving the transferability of adversarial examples in the black-box setting.
methods: The proposed method, called block shuffle and rotation (BSR), uses input transformation to create a set of new images for gradient calculation.
results: The BSR method achieves significantly better transferability than existing input transformation based methods under single-model and ensemble-model settings, and combining BSR with current input transformation methods further improves transferability.Here’s the simplified Chinese version:
for: 这篇论文关注在黑盒Setting下的攻击 transferred examples的改进。
methods: 提议的方法是block shuffle and rotation (BSR)，通过输入变换创建一组新图像进行梯度计算。
results: BSR方法在单个模型和ensemble-model Settings下表现出了显著更好的传输性能，并且将BSR方法与当前输入变换方法组合使用可以减少攻击性能。

Abstract
Adversarial examples mislead deep neural networks with imperceptible perturbations and have brought significant threats to deep learning. An important aspect is their transferability, which refers to their ability to deceive other models, thus enabling attacks in the black-box setting. Though various methods have been proposed to boost transferability, the performance still falls short compared with white-box attacks. In this work, we observe that existing input transformation based attacks, one of the mainstream transfer-based attacks, result in different attention heatmaps on various models, which might limit the transferability. We also find that breaking the intrinsic relation of the image can disrupt the attention heatmap of the original image. Based on this finding, we propose a novel input transformation based attack called block shuffle and rotation (BSR). Specifically, BSR splits the input image into several blocks, then randomly shuffles and rotates these blocks to construct a set of new images for gradient calculation. Empirical evaluations on the ImageNet dataset demonstrate that BSR could achieve significantly better transferability than the existing input transformation based methods under single-model and ensemble-model settings. Combining BSR with the current input transformation method can further improve the transferability, which significantly outperforms the state-of-the-art methods.

摘要
深度学习系统中的敌对示例会使用微小的扰动来诱导深度神经网络，从而带来了深度学习的重要威胁。其中一个重要的方面是它们的传输性，即它们能够在黑盒Setting中欺骗其他模型，从而实现攻击。虽然已有许多方法提出来增强传输性，但其性能仍然落后于白盒攻击。在这个工作中，我们发现了现有的输入变换基于攻击方法中的一个主要问题，即输入变换后的图像的注意力热图与不同的模型存在差异，这可能限制了传输性。我们还发现，将图像的内在关系打破可以破坏原始图像的注意力热图。基于这一发现，我们提出了一种新的输入变换基于攻击方法，即块排序和旋转（BSR）。具体来说，BSR将输入图像分成多个块，然后随机排序和旋转这些块来构建一组新的图像，用于计算梯度。我们对ImageNet dataset进行了实验，表明BSR可以在单模型和集成模型的设置下实现显著更好的传输性，并且与现有的输入变换方法结合使用可以进一步提高传输性，与当前最佳方法强制升级。

DomainAdaptor: A Novel Approach to Test-time Adaptation

paper_url: http://arxiv.org/abs/2308.10297
repo_url: None
paper_authors: Jian Zhang, Lei Qi, Yinghuan Shi, Yang Gao
for: 这个论文的目的是如何在试用时进行预测 task 中对应预测数据的预测问题，并且对于缺乏资料的情况下进行更好的预测。
methods: 这个论文提出了一个叫做 DomainAdaptor 的方法，它包括一个 AdaMixBN 模组和一个 Generalized Entropy Minimization (GEM) 损失函数。AdaMixBN 模组通过动态混合系数和统计转换操作来适应训练和试用数据之间的领域差异。而 GEM 损失函数则是将 Entropy Minimization 损失函数扩展以更好地利用试用数据中的信息。
results: 实验结果显示，DomainAdaptor 可以与现有的方法相比，在四个 benchmark 上实现更高的预测性能。此外，在缺乏数据的情况下，DomainAdaptor 对于几个缺乏数据的benchmark 也实现了更大的改进。

Abstract
To deal with the domain shift between training and test samples, current methods have primarily focused on learning generalizable features during training and ignore the specificity of unseen samples that are also critical during the test. In this paper, we investigate a more challenging task that aims to adapt a trained CNN model to unseen domains during the test. To maximumly mine the information in the test data, we propose a unified method called DomainAdaptor for the test-time adaptation, which consists of an AdaMixBN module and a Generalized Entropy Minimization (GEM) loss. Specifically, AdaMixBN addresses the domain shift by adaptively fusing training and test statistics in the normalization layer via a dynamic mixture coefficient and a statistic transformation operation. To further enhance the adaptation ability of AdaMixBN, we design a GEM loss that extends the Entropy Minimization loss to better exploit the information in the test data. Extensive experiments show that DomainAdaptor consistently outperforms the state-of-the-art methods on four benchmarks. Furthermore, our method brings more remarkable improvement against existing methods on the few-data unseen domain. The code is available at https://github.com/koncle/DomainAdaptor.

摘要
Current methods have primarily focused on learning generalizable features during training and ignore the specificity of unseen samples, which are also critical during the test. In this paper, we investigate a more challenging task that aims to adapt a trained CNN model to unseen domains during the test. To maximize the information in the test data, we propose a unified method called DomainAdaptor for the test-time adaptation, which consists of an AdaMixBN module and a Generalized Entropy Minimization (GEM) loss. Specifically, AdaMixBN addresses the domain shift by adaptively fusing training and test statistics in the normalization layer via a dynamic mixture coefficient and a statistic transformation operation. To further enhance the adaptation ability of AdaMixBN, we design a GEM loss that extends the Entropy Minimization loss to better exploit the information in the test data. Extensive experiments show that DomainAdaptor consistently outperforms the state-of-the-art methods on four benchmarks. Furthermore, our method brings more remarkable improvement against existing methods on the few-data unseen domain. The code is available at https://github.com/koncle/DomainAdaptor.Here's the Chinese translation of each sentence:现在的方法主要关注在训练时学习通用特征，而忽略了测试时的特定样本，这些样本也是非常重要的。在这篇论文中，我们调查了一个更加具有挑战性的任务，即在测试时对未seen域进行适应。以maximize测试数据中的信息，我们提出了一种统一的方法 called DomainAdaptor，它包括 AdaMixBN 模块和一种通用 Entropy Minimization (GEM) 损失函数。具体来说，AdaMixBN 模块通过在normalization层中动态混合训练和测试统计信息，来 Address the domain shift。此外，我们还设计了一种 GEM 损失函数，它将 Entropy Minimization 损失函数扩展到更好地利用测试数据中的信息。经过广泛的实验，我们发现 DomainAdaptor 可以一直性地超越当前的方法在四个标准benchmark上。此外，我们的方法在几个数据少的未seen域上带来更加卓越的改进。代码可以在https://github.com/koncle/DomainAdaptor 中下载。

Privileged Anatomical and Protocol Discrimination in Trackerless 3D Ultrasound Reconstruction

paper_url: http://arxiv.org/abs/2308.10293
repo_url: None
paper_authors: Qi Li, Ziyi Shen, Qian Li, Dean C. Barratt, Thomas Dowrick, Matthew J. Clarkson, Tom Vercauteren, Yipeng Hu
for: 这 paper 是为了提高深度神经网络（DNN）无需外部跟踪设备的三维自由手 Ultrasound（US）重建的研究。
methods: 这 paper 使用了两个可能的改进 DNN-based 重建方法的因素：解剖学和协议。 authors 提议使用这两个因素作为静观信息，以提高现有的 DNN-based 方法。
results: 实验结果表明，解剖学和协议变化都是 DNN-based US 重建的能量因素；学习不同的主体（解剖学变化）和预先定义的扫描路径（协议变化）都能显著提高帧预测精度、重建覆盖率、累积跟踪误差和最终漂移。

Abstract
Three-dimensional (3D) freehand ultrasound (US) reconstruction without using any additional external tracking device has seen recent advances with deep neural networks (DNNs). In this paper, we first investigated two identified contributing factors of the learned inter-frame correlation that enable the DNN-based reconstruction: anatomy and protocol. We propose to incorporate the ability to represent these two factors - readily available during training - as the privileged information to improve existing DNN-based methods. This is implemented in a new multi-task method, where the anatomical and protocol discrimination are used as auxiliary tasks. We further develop a differentiable network architecture to optimise the branching location of these auxiliary tasks, which controls the ratio between shared and task-specific network parameters, for maximising the benefits from the two auxiliary tasks. Experimental results, on a dataset with 38 forearms of 19 volunteers acquired with 6 different scanning protocols, show that 1) both anatomical and protocol variances are enabling factors for DNN-based US reconstruction; 2) learning how to discriminate different subjects (anatomical variance) and predefined types of scanning paths (protocol variance) both significantly improve frame prediction accuracy, volume reconstruction overlap, accumulated tracking error and final drift, using the proposed algorithm.

摘要
三维自由手操作超声成像（US）已经在最近得到了深度神经网络（DNN）的进步。在这篇论文中，我们首先调查了两个促进学习的交叉帧相关性的因素：解剖学和协议。我们提议将这两个因素作为特权信息，以改进现有的DNN基于方法。我们实现了一种新的多任务方法，其中解剖学和协议歧义被用作辅助任务。我们进一步开发了可导网络架构，以优化分支位置，控制分支位置中共享和任务特定的网络参数的比例，以最大化来自两个辅助任务的利益。实验结果，在38只臂和19名志愿者通过6种扫描协议获得的数据集上，显示了以下结论：1）解剖学和协议的差异都是US成像中的促进因素；2）学习不同主体（解剖学差异）和预定的扫描路径（协议差异）都能够显著提高帧预测精度、重建 overlap、累积跟踪误差和最终漂移。使用我们提议的算法可以获得这些结果。

Efficient-VRNet: An Exquisite Fusion Network for Riverway Panoptic Perception based on Asymmetric Fair Fusion of Vision and 4D mmWave Radar

paper_url: http://arxiv.org/abs/2308.10287
repo_url: https://github.com/GuanRunwei/Efficient-VRNet
paper_authors: Runwei Guan, Shanliang Yao, Xiaohui Zhu, Ka Lok Man, Yong Yue, Jeremy Smith, Eng Gee Lim, Yutao Yue
for: 这个论文的目的是提出一个基于 USV 的河道全景感知方法，以提高自动航行的精度和安全性。
methods: 本论文使用了 Contextual Clustering (CoC) 和非对称的视觉和4D mmWave 激光数据融合，实现了同时进行物体检测和 semantic segmentation。
results: 在实验中，我们的 Efficient-VRNet 模型在我们自己收集的数据集上表现出色，特别是在不良天气和环境下，其性能比其他单模型更好。

Abstract
Panoptic perception is essential to unmanned surface vehicles (USVs) for autonomous navigation. The current panoptic perception scheme is mainly based on vision only, that is, object detection and semantic segmentation are performed simultaneously based on camera sensors. Nevertheless, the fusion of camera and radar sensors is regarded as a promising method which could substitute pure vision methods, but almost all works focus on object detection only. Therefore, how to maximize and subtly fuse the features of vision and radar to improve both detection and segmentation is a challenge. In this paper, we focus on riverway panoptic perception based on USVs, which is a considerably unexplored field compared with road panoptic perception. We propose Efficient-VRNet, a model based on Contextual Clustering (CoC) and the asymmetric fusion of vision and 4D mmWave radar, which treats both vision and radar modalities fairly. Efficient-VRNet can simultaneously perform detection and segmentation of riverway objects and drivable area segmentation. Furthermore, we adopt an uncertainty-based panoptic perception training strategy to train Efficient-VRNet. In the experiments, our Efficient-VRNet achieves better performances on our collected dataset than other uni-modal models, especially in adverse weather and environment with poor lighting conditions. Our code and models are available at \url{https://github.com/GuanRunwei/Efficient-VRNet}.

摘要
паноптическое восприятие是关键 для无人水面车（USV）的自主导航。目前的паноптическое восприятие方案主要基于视觉Only，即同时进行对象探测和semantic segmentation基于摄像头感知器。然而，混合摄像头和雷达感知器的方法被视为有前途的，但大多数工作都集中在对象探测上。因此，如何最大化和细致地融合视觉和雷达感知器的特征以提高探测和分割是一个挑战。在这篇论文中，我们关注USV在河道上的паноптиче восприятие，这是与公路上的panoramic perception相比较未探索的领域。我们提出了高效的VRNet，基于Contextual Clustering（CoC）和不对称的视觉和4D mmWave雷达混合，可以同时进行对象探测和分割，以及 drivable area segmentation。此外，我们采用了不确定性基本的panoramic perception训练策略来训练高效的VRNet。在实验中，我们的高效的VRNet在我们收集的数据集上表现更好than其他单modal模型，特别是在恶劣天气和环境下。我们的代码和模型可以在 \url{https://github.com/GuanRunwei/Efficient-VRNet} 中获取。

DomainDrop: Suppressing Domain-Sensitive Channels for Domain Generalization

paper_url: http://arxiv.org/abs/2308.10285
repo_url: None
paper_authors: Jintao Guo, Lei Qi, Yinghuan Shi
for: 提高模型对不同频率域数据的抗频率性能
methods: 提出了一种基于频率域特征图的频率域抗频率法，通过使用频率域特征图来增强通道的稳定性，提高模型的抗频率性能
results: 经验表明，提出的方法可以在多个 benchmark 上达到领先的性能水平，并且可以与其他竞争方法进行比较

Abstract
Deep Neural Networks have exhibited considerable success in various visual tasks. However, when applied to unseen test datasets, state-of-the-art models often suffer performance degradation due to domain shifts. In this paper, we introduce a novel approach for domain generalization from a novel perspective of enhancing the robustness of channels in feature maps to domain shifts. We observe that models trained on source domains contain a substantial number of channels that exhibit unstable activations across different domains, which are inclined to capture domain-specific features and behave abnormally when exposed to unseen target domains. To address the issue, we propose a DomainDrop framework to continuously enhance the channel robustness to domain shifts, where a domain discriminator is used to identify and drop unstable channels in feature maps of each network layer during forward propagation. We theoretically prove that our framework could effectively lower the generalization bound. Extensive experiments on several benchmarks indicate that our framework achieves state-of-the-art performance compared to other competing methods. Our code is available at https://github.com/lingeringlight/DomainDrop.

摘要

MacFormer: Map-Agent Coupled Transformer for Real-time and Robust Trajectory Prediction

paper_url: http://arxiv.org/abs/2308.10280
repo_url: None
paper_authors: Chen Feng, Hangning Zhou, Huadong Lin, Zhigang Zhang, Ziyao Xu, Chi Zhang, Boyu Zhou, Shaojie Shen
for: 预测自驾车行为的未来行为
methods: 使用 Map-Agent Coupled Transformer (MacFormer) 框架，并实现了约束映射和参考提取模块，以及一种多任务优化策略 (MTOS) 来增强网络学习。
results: 在 Argoverse 1、Argoverse 2 和 nuScenes 实验室中，实现了最佳性能，并且具有最低的推理延迟和最小的模型大小。 experiments 还表明，我们的框架可以抗性不准确的轨迹输入。

Abstract
Predicting the future behavior of agents is a fundamental task in autonomous vehicle domains. Accurate prediction relies on comprehending the surrounding map, which significantly regularizes agent behaviors. However, existing methods have limitations in exploiting the map and exhibit a strong dependence on historical trajectories, which yield unsatisfactory prediction performance and robustness. Additionally, their heavy network architectures impede real-time applications. To tackle these problems, we propose Map-Agent Coupled Transformer (MacFormer) for real-time and robust trajectory prediction. Our framework explicitly incorporates map constraints into the network via two carefully designed modules named coupled map and reference extractor. A novel multi-task optimization strategy (MTOS) is presented to enhance learning of topology and rule constraints. We also devise bilateral query scheme in context fusion for a more efficient and lightweight network. We evaluated our approach on Argoverse 1, Argoverse 2, and nuScenes real-world benchmarks, where it all achieved state-of-the-art performance with the lowest inference latency and smallest model size. Experiments also demonstrate that our framework is resilient to imperfect tracklet inputs. Furthermore, we show that by combining with our proposed strategies, classical models outperform their baselines, further validating the versatility of our framework.

摘要
预测自驾车代理行为是自驾车领域的基本任务。准确预测需要理解周围环境，这会有效地规范代理行为。然而，现有方法在利用地图和历史轨迹上有限制，导致预测性能和稳定性不 satisfactory。另外，他们的网络架构过于复杂，阻碍实时应用。为解决这些问题，我们提出了 Map-Agent Coupled Transformer（MacFormer），用于实时和可靠的轨迹预测。我们的框架直接将地图约束 incorporated into the network via two специаль地设计的模块： coupling map和 reference extractor。我们还提出了一种多任务优化策略（MTOS），以提高学习 topology 和规则约束。此外，我们还提出了一种 bilateral query scheme 在上下文融合中，以更有效地和更轻量级地网络。我们在 Argoverse 1、Argoverse 2 和 nuScenes 实验室中进行了评估，并在所有benchmark上实现了状态前的性能，同时拥有最低的推理延迟和最小的模型大小。实验还表明，我们的框架可以快速地适应不完整的轨迹输入。此外，我们还证明了我们的方法可以与经典模型结合使用，以提高其性能。

Turning Waste into Wealth: Leveraging Low-Quality Samples for Enhancing Continuous Conditional Generative Adversarial Networks

paper_url: http://arxiv.org/abs/2308.10273
repo_url: None
paper_authors: Xin Ding, Yongwei Wang, Zuheng Xu
for: 提高 conditional generative adversarial networks (CcGANs) 的生成质量，使其能够生成 conditional 的图像。
methods: 提出了一种新的 Negative Data Augmentation (NDA) 方法，称为 Dual-NDA，该方法使用了两种不同的负样本：一是通过预训练 CcGAN 生成的视觉不真实的图像，二是通过修改实际图像的标签来生成的标签不一致的图像。
results: 对 UTKFace 和 Steering Angle 进行了实验研究，发现 Dual-NDA 可以提高 CcGANs 生成的图像的视觉质量和标签一致性，并且可以使 CcGANs 超越当前状态的 conditional GANs 和涂抹模型，达到新的高水平性能。

Abstract
Continuous Conditional Generative Adversarial Networks (CcGANs) enable generative modeling conditional on continuous scalar variables (termed regression labels). However, they can produce subpar fake images due to limited training data. Although Negative Data Augmentation (NDA) effectively enhances unconditional and class-conditional GANs by introducing anomalies into real training images, guiding the GANs away from low-quality outputs, its impact on CcGANs is limited, as it fails to replicate negative samples that may occur during the CcGAN sampling. We present a novel NDA approach called Dual-NDA specifically tailored for CcGANs to address this problem. Dual-NDA employs two types of negative samples: visually unrealistic images generated from a pre-trained CcGAN and label-inconsistent images created by manipulating real images' labels. Leveraging these negative samples, we introduce a novel discriminator objective alongside a modified CcGAN training algorithm. Empirical analysis on UTKFace and Steering Angle reveals that Dual-NDA consistently enhances the visual fidelity and label consistency of fake images generated by CcGANs, exhibiting a substantial performance gain over the vanilla NDA. Moreover, by applying Dual-NDA, CcGANs demonstrate a remarkable advancement beyond the capabilities of state-of-the-art conditional GANs and diffusion models, establishing a new pinnacle of performance.

摘要
双重 NDA 使用两种负样本：基于预训练 CcGAN 生成的视觉不真实的图像，以及 manipulate 真实图像的标签来创造的标签不一致的图像。通过这些负样本，我们引入了一种新的Discriminator 目标 alongside 修改后 CcGAN 训练算法。我们的实验表明，使用 Dual-NDA，CcGANs 可以在 UTKFace 和扭转角度上提高视觉准确性和标签一致性的假图像生成性能，并且与状态项 conditional GANs 和液体模型相比，显示出remarkable进步。

Domain Reduction Strategy for Non Line of Sight Imaging

paper_url: http://arxiv.org/abs/2308.10269
repo_url: None
paper_authors: Hyunbo Shim, In Cho, Daekyu Kwon, Seon Joo Kim
for: 非直线视野（NLOS）图像重建，实现隐藏的场景重建。
methods: 利用光子返回的独立计算，建立一个缩小领域来排除空间，以提高优化的计算效率。
results: 在不同的NLOS场景中，包括非平面遮盾墙、罕发扫描模式、对焦和非对焦等，实现高效率和高精度的图像重建。

Abstract
This paper presents a novel optimization-based method for non-line-of-sight (NLOS) imaging that aims to reconstruct hidden scenes under various setups. Our method is built upon the observation that photons returning from each point in hidden volumes can be independently computed if the interactions between hidden surfaces are trivially ignored. We model the generalized light propagation function to accurately represent the transients as a linear combination of these functions. Moreover, our proposed method includes a domain reduction procedure to exclude empty areas of the hidden volumes from the set of propagation functions, thereby improving computational efficiency of the optimization. We demonstrate the effectiveness of the method in various NLOS scenarios, including non-planar relay wall, sparse scanning patterns, confocal and non-confocal, and surface geometry reconstruction. Experiments conducted on both synthetic and real-world data clearly support the superiority and the efficiency of the proposed method in general NLOS scenarios.

摘要

Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image

paper_url: http://arxiv.org/abs/2308.10257
repo_url: https://github.com/leoShen917/Make-It-4D
paper_authors: Liao Shen, Xingyi Li, Huiqiang Sun, Juewen Peng, Ke Xian, Zhiguo Cao, Guosheng Lin
for:This paper aims to synthesize a long-term dynamic video from a single image, addressing the challenges of consistent visual content movements and large camera motions.methods:The proposed method, Make-It-4D, utilizes layered depth images (LDIs) and motion estimation to estimate the underlying 4D representation of the scene, including 3D geometry and scene motion. The method also employs a pretrained diffusion model to inpaint and outpaint the input image, filling in occluded regions.results:The proposed method demonstrates effective rendering results, showcasing compelling dynamic video synthesis from a single input image. The method is also training-free, saving a significant amount of training time.

Abstract
We study the problem of synthesizing a long-term dynamic video from only a single image. This is challenging since it requires consistent visual content movements given large camera motions. Existing methods either hallucinate inconsistent perpetual views or struggle with long camera trajectories. To address these issues, it is essential to estimate the underlying 4D (including 3D geometry and scene motion) and fill in the occluded regions. To this end, we present Make-It-4D, a novel method that can generate a consistent long-term dynamic video from a single image. On the one hand, we utilize layered depth images (LDIs) to represent a scene, and they are then unprojected to form a feature point cloud. To animate the visual content, the feature point cloud is displaced based on the scene flow derived from motion estimation and the corresponding camera pose. Such 4D representation enables our method to maintain the global consistency of the generated dynamic video. On the other hand, we fill in the occluded regions by using a pretrained diffusion model to inpaint and outpaint the input image. This enables our method to work under large camera motions. Benefiting from our design, our method can be training-free which saves a significant amount of training time. Experimental results demonstrate the effectiveness of our approach, which showcases compelling rendering results.

摘要
我们研究将单一图像转换为长期动态影片的问题。这是一个挑战，因为需要在大型镜头运动下维持一致的视觉内容运动。现有的方法 Either hallucinate不一致的无限视野或对大型镜头轨迹进行适应。为解决这些问题，我们需要估计场景中的底层4D（包括3D几何和场景运动），并填充遮盖区域。为此，我们提出了Make-It-4D，一种新的方法，可以将单一图像转换为一致的长期动态影片。我们使用层级深度图像（LDIs）来表示场景，然后将它们投影到形成一个特征点云。为了将视觉内容动态化，我们将特征点云根据场景流和相应的镜头pose进行偏移。这样的4D表示方式使我们的方法能够维持生成的动态影片的全球一致性。另一方面，我们使用预训数据模型填充和外填充输入图像，以解决大型镜头运动下的遮盖区域问题。由于我们的设计，我们的方法不需要训练，这样可以大幅降低训练时间。实验结果显示我们的方法有着吸引人的渲染效果。

Generic Attention-model Explainability by Weighted Relevance Accumulation

paper_url: http://arxiv.org/abs/2308.10240
repo_url: None
paper_authors: Yiming Huang, Aozhe Jia, Xiaodan Zhang, Jiawei Zhang
For: This paper aims to improve the explainability of attention-based transformer models in multi-modal tasks, such as visual question answering.* Methods: The proposed method uses a weighted relevancy strategy to take into account the importance of token values when accumulating relevance, reducing distortion in the attention process.* Results: The proposed method outperforms existing methods in terms of explainability, as validated through extensive perturbation tests on visual question answering and image captioning.Here’s the Chinese version of the three key points:* For: 这篇论文的目的是提高基于转换器模型的多模态任务，如视觉问答的解释性。* Methods: 提议的方法使用了权重相关性策略，根据Token值的重要性来增加相关性，从而降低了注意力过程中的扭曲。* Results: 提议的方法在解释性方面表现出色，通过对视觉问答和图文描述进行广泛的干扰测试 Validated 表明，其解释性方法超过了现有的方法。

Abstract
Attention-based transformer models have achieved remarkable progress in multi-modal tasks, such as visual question answering. The explainability of attention-based methods has recently attracted wide interest as it can explain the inner changes of attention tokens by accumulating relevancy across attention layers. Current methods simply update relevancy by equally accumulating the token relevancy before and after the attention processes. However, the importance of token values is usually different during relevance accumulation. In this paper, we propose a weighted relevancy strategy, which takes the importance of token values into consideration, to reduce distortion when equally accumulating relevance. To evaluate our method, we propose a unified CLIP-based two-stage model, named CLIPmapper, to process Vision-and-Language tasks through CLIP encoder and a following mapper. CLIPmapper consists of self-attention, cross-attention, single-modality, and cross-modality attention, thus it is more suitable for evaluating our generic explainability method. Extensive perturbation tests on visual question answering and image captioning validate that our explainability method outperforms existing methods.

摘要
注意基于转换器模型已经取得了多模式任务中的很大进步，如视觉问答。现在的解释可能性方法吸引了广泛的关注，因为它可以解释关注令牌的内部变化。现有方法简单地将关注令牌的相关性更新为平等地积累关注层中的令牌相关性。然而，令牌值的重要性通常在相关性积累中不同。在这篇论文中，我们提议一种权重 relevancy 策略，该策略考虑令牌值的重要性，以减少积累时的扭曲。为评估我们的方法，我们提出了一种基于 CLIP 的两stage 模型，名为 CLIPmapper，该模型通过 CLIP Encoder 和后续的映射器来处理视觉语言任务。CLIPmapper 包括自注意、交叉注意、单模态和交叉模态注意，因此它更适合评估我们的通用解释可能性方法。对于视觉问答和图文描述等任务，我们进行了广泛的干扰测试，并证明了我们的解释可能性方法在现有方法中表现出色。

From Global to Local: Multi-scale Out-of-distribution Detection

paper_url: http://arxiv.org/abs/2308.10239
repo_url: https://github.com/jimzai/mode-ood
paper_authors: Ji Zhang, Lianli Gao, Bingguang Hao, Hao Huang, Jingkuan Song, Hengtao Shen
for: 这个论文的目的是提出一个新的 OUT-OF-DISTRIBUTION（OOD）检测方法，以检测“未知”的数据，即在训练过程中没有见过的标签。
methods: 这个方法使用了Multi-scale OOD DEtection（MODE）框架，利用了全域图像信息和本地区域细节来增强OOD检测。具体来说，这个方法首先发现现有的模型因为标签训练和OOD检测的规模不同，所以无法捕捉到有用的本地表示。为了解决这个问题，这个方法提出了Attention-based Local PropAgation（ALPA）的对话式目标，将本地区域的对话捕捉到ID训练过程中。
results: 这个方法在多个 bencmarks 上显示了优秀的效果，与之前的最佳性能相比，平均提高了19.24%的False Positive Rate（FPR）和2.77%的AUCROC。代码可以在https://github.com/JimZAI/MODE-OOD 获取。

Abstract
Out-of-distribution (OOD) detection aims to detect "unknown" data whose labels have not been seen during the in-distribution (ID) training process. Recent progress in representation learning gives rise to distance-based OOD detection that recognizes inputs as ID/OOD according to their relative distances to the training data of ID classes. Previous approaches calculate pairwise distances relying only on global image representations, which can be sub-optimal as the inevitable background clutter and intra-class variation may drive image-level representations from the same ID class far apart in a given representation space. In this work, we overcome this challenge by proposing Multi-scale OOD DEtection (MODE), a first framework leveraging both global visual information and local region details of images to maximally benefit OOD detection. Specifically, we first find that existing models pretrained by off-the-shelf cross-entropy or contrastive losses are incompetent to capture valuable local representations for MODE, due to the scale-discrepancy between the ID training and OOD detection processes. To mitigate this issue and encourage locally discriminative representations in ID training, we propose Attention-based Local PropAgation (ALPA), a trainable objective that exploits a cross-attention mechanism to align and highlight the local regions of the target objects for pairwise examples. During test-time OOD detection, a Cross-Scale Decision (CSD) function is further devised on the most discriminative multi-scale representations to distinguish ID/OOD data more faithfully. We demonstrate the effectiveness and flexibility of MODE on several benchmarks -- on average, MODE outperforms the previous state-of-the-art by up to 19.24% in FPR, 2.77% in AUROC. Code is available at https://github.com/JimZAI/MODE-OOD.

摘要
OUT-OF-DISTRIBUTION (OOD) 检测目标是检测未知数据，其标签在 ID 训练过程中没有出现过。 recent progress in representation learning 使得距离基本的 OOD 检测变得更加重要。在这种情况下，我们提出了 Multi-scale OOD DEtection (MODE) 框架，它利用了全球视觉信息和本地区域细节来最大化 OOD 检测的效果。具体来说，我们发现现有的模型通常通过预训练的 cross-entropy 或对比损失来学习全球视觉信息，但这些模型在 OOD 检测过程中可能无法捕捉到有价值的本地表示。为了解决这个问题，我们提出了 Attention-based Local PropAgation (ALPA) 目标函数，它利用了交叉注意机制来对 ID 训练过程中的本地区域进行匹配和强调。在测试时 OOD 检测过程中，我们还提出了 Cross-Scale Decision (CSD) 函数，用于在不同缩放级别上进行分类。我们在多个 benchmark 上进行了实验，得到了 MODE 的效果和灵活性。在 average 的情况下，MODE 可以与前一个状态的艺术品的 FPR 和 AUROC 进行比较，提高了 19.24% 和 2.77%。代码可以在上找到。

FedSIS: Federated Split Learning with Intermediate Representation Sampling for Privacy-preserving Generalized Face Presentation Attack Detection

paper_url: http://arxiv.org/abs/2308.10236
repo_url: https://github.com/naiftt/fedsis
paper_authors: Naif Alkhunaizi, Koushik Srivatsan, Faris Almalik, Ibrahim Almakky, Karthik Nandakumar
for:This paper is written for those who are interested in face presentation attack detection (FacePAD) and want to improve the generalization of their algorithms to unseen domains/attacks.methods:The paper proposes a novel framework called Federated Split learning with Intermediate representation Sampling (FedSIS) which combines federated learning (FL) and split learning to achieve privacy-preserving domain generalization. The FedSIS approach uses a hybrid Vision Transformer (ViT) architecture and a shared adapter network to distill discriminative information from intermediate blocks.results:The paper demonstrates that the FedSIS approach can achieve state-of-the-art generalization performance on two well-known benchmarks for cross-domain FacePAD without any data sharing, thereby preserving privacy.

Abstract
Lack of generalization to unseen domains/attacks is the Achilles heel of most face presentation attack detection (FacePAD) algorithms. Existing attempts to enhance the generalizability of FacePAD solutions assume that data from multiple source domains are available with a single entity to enable centralized training. In practice, data from different source domains may be collected by diverse entities, who are often unable to share their data due to legal and privacy constraints. While collaborative learning paradigms such as federated learning (FL) can overcome this problem, standard FL methods are ill-suited for domain generalization because they struggle to surmount the twin challenges of handling non-iid client data distributions during training and generalizing to unseen domains during inference. In this work, a novel framework called Federated Split learning with Intermediate representation Sampling (FedSIS) is introduced for privacy-preserving domain generalization. In FedSIS, a hybrid Vision Transformer (ViT) architecture is learned using a combination of FL and split learning to achieve robustness against statistical heterogeneity in the client data distributions without any sharing of raw data (thereby preserving privacy). To further improve generalization to unseen domains, a novel feature augmentation strategy called intermediate representation sampling is employed, and discriminative information from intermediate blocks of a ViT is distilled using a shared adapter network. The FedSIS approach has been evaluated on two well-known benchmarks for cross-domain FacePAD to demonstrate that it is possible to achieve state-of-the-art generalization performance without data sharing. Code: https://github.com/Naiftt/FedSIS

摘要
缺乏对未经看过的领域/攻击的泛化是大多数人脸展示攻击检测（FacePAD）算法的致命弱点。现有的增强FacePAD解决方案假设有多个源领域的数据可以进行中央化训练。在实际情况下，不同的源领域数据可能由不同的实体收集，这些实体通常因为法律和隐私限制无法分享其数据。而合作学习 paradigm such as federated learning（FL）可以解决这个问题，但标准FL方法在适应非同一样的客户端数据分布时具有困难。在这种情况下，一种名为 Federated Split learning with Intermediate representation Sampling（FedSIS）的新框架被引入，用于保护隐私的领域泛化。在 FedSIS 中，使用一种混合的 Computer Vision Transformer（ViT）架构，通过将 FL 和 split learning 结合使得模型对客户端数据分布的统计差异具有抗性，而无需分享Raw数据（因此保护隐私）。为了进一步提高对未经看过的领域的泛化，我们采用了一种新的特征增强策略，即中间表示采样策略，并将 ViT 中的中间块特征通过一个共享适配器网络进行精炼。FedSIS 方法在两个常用的 cross-domain FacePAD 测试集上进行评估，并证明了在不分享数据情况下可以实现状态 искусственный泛化性能。代码：https://github.com/Naiftt/FedSIS

Real-time Regular Expression Matching

paper_url: http://arxiv.org/abs/2308.10208
repo_url: https://github.com/charbelrami/regex-element
paper_authors: Alexandra Bernadotte
for: 这篇论文主要针对金字塔自动机、正则表达式匹配、模式识别以及对正则表达式长度增长而导致的加速问题。
methods: 该论文提出了一种理论和硬件解决方案，用于解决一些复杂的正则语言类型的加速问题，这些问题在网络入侵检测系统中带来严重的限制。
results: 文章提供了一些正则表达式匹配的正确性和复杂性 theorem，以支持其解决方案的可行性和效果。

Abstract
This paper is devoted to finite state automata, regular expression matching, pattern recognition, and the exponential blow-up problem, which is the growing complexity of automata exponentially depending on regular expression length. This paper presents a theoretical and hardware solution to the exponential blow-up problem for some complicated classes of regular languages, which caused severe limitations in Network Intrusion Detection Systems work. The article supports the solution with theorems on correctness and complexity.

摘要
这篇论文专注于finite state自动机、正则表达式匹配、模式识别和 exponential blow-up 问题（正则表达式长度增长导致自动机复杂度呈指数增长）。这篇论文提出了一种理论和硬件解决方案，以解决一些复杂的正则语言 exponential blow-up 问题，这些问题在网络入侵检测系统中带来严重的限制。文章证明了解决方案的正确性和复杂度。

GeT: Generative Target Structure Debiasing for Domain Adaptation

paper_url: http://arxiv.org/abs/2308.10205
repo_url: None
paper_authors: Can Zhang, Gim Hee Lee
for: 本研究的目的是提出一种能够减少源数据偏见的预测模型，以便在不同预测任务中进行更好的预测。
methods: 本研究使用了 semi-supervised learning 技术，并提出了一种基于 pseudo labeling 的方法来减少源数据偏见。
results: 实验结果表明，我们提出的方法可以在不同的预测任务中减少 source data bias，并且可以在不同的预测任务中获得更好的预测性能。

Abstract
Domain adaptation (DA) aims to transfer knowledge from a fully labeled source to a scarcely labeled or totally unlabeled target under domain shift. Recently, semi-supervised learning-based (SSL) techniques that leverage pseudo labeling have been increasingly used in DA. Despite the competitive performance, these pseudo labeling methods rely heavily on the source domain to generate pseudo labels for the target domain and therefore still suffer considerably from source data bias. Moreover, class distribution bias in the target domain is also often ignored in the pseudo label generation and thus leading to further deterioration of performance. In this paper, we propose GeT that learns a non-bias target embedding distribution with high quality pseudo labels. Specifically, we formulate an online target generative classifier to induce the target distribution into distinctive Gaussian components weighted by their class priors to mitigate source data bias and enhance target class discriminability. We further propose a structure similarity regularization framework to alleviate target class distribution bias and further improve target class discriminability. Experimental results show that our proposed GeT is effective and achieves consistent improvements under various DA settings with and without class distribution bias. Our code is available at: https://lulusindazc.github.io/getproject/.

摘要
域适应（DA）的目标是将来源域中完全标注的知识传递到目标域中具有域shift的场景中，但目标域通常具有少量或无标注数据。最近，半监督学习（SSL）技术在DA中越来越广泛应用，但这些假标注方法仍然受到源域的限制，即使在 pseudo label 生成过程中，它们仍然受到源数据偏见的影响。此外，目标域中类分布偏见也经常被忽略在假标注生成过程中，这会导致性能下降。在这篇论文中，我们提出了 GeT，它可以学习一个不偏见的目标域分布，并生成高质量的假标注。具体来说，我们提出了在线目标生成类分类器，通过将目标分布划分成不同的 Gaussian 组件，并通过类偏见来补偿源数据偏见，以提高目标类分riminability。此外，我们还提出了结构相似regularization框架，以减少目标类分布偏见，并进一步提高目标类分riminability。实验结果表明，我们的提出的 GeT 是有效的，在不同的 DA 设置下，它都可以获得鲁棒的提高。我们的代码可以在以下链接中找到：https://lulusindazc.github.io/getproject/.

paper_url: http://arxiv.org/abs/2308.10196
repo_url: None
paper_authors: Jingfan Tan, Xiaoxu Chen, Tao Wang, Kaihao Zhang, Wenhan Luo, Xiaocun Cao
for: 提供用户全屏体验，但是由于显示器的特点，UDC图像受到质量下降的影响。
methods: 提出了UDC图像修复方法，并实现了进步。但是还没有专门的UDC图像修复方法和数据集，这可能是UDC场景中最常见的问题。
results: 我们提出了一种两stage网络UDC降解模型网络（UDC-DMNet），并使用UDC-DMNet和高质量的face图像从FFHQ和CelebA-Test创建了UDC face修复数据集FFHQ-P/T和测试数据集CelebA-Test-P/T。我们还提出了一种新的字典引导 transformer网络（DGFormer），通过引入人脸组成字典和UDC图像的特点，使DGFormer能够实现盲人脸修复在UDC场景中。实验表明，我们的DGFormer和UDC-DMNet实现了状态的末点性能。

Abstract
By hiding the front-facing camera below the display panel, Under-Display Camera (UDC) provides users with a full-screen experience. However, due to the characteristics of the display, images taken by UDC suffer from significant quality degradation. Methods have been proposed to tackle UDC image restoration and advances have been achieved. There are still no specialized methods and datasets for restoring UDC face images, which may be the most common problem in the UDC scene. To this end, considering color filtering, brightness attenuation, and diffraction in the imaging process of UDC, we propose a two-stage network UDC Degradation Model Network named UDC-DMNet to synthesize UDC images by modeling the processes of UDC imaging. Then we use UDC-DMNet and high-quality face images from FFHQ and CelebA-Test to create UDC face training datasets FFHQ-P/T and testing datasets CelebA-Test-P/T for UDC face restoration. We propose a novel dictionary-guided transformer network named DGFormer. Introducing the facial component dictionary and the characteristics of the UDC image in the restoration makes DGFormer capable of addressing blind face restoration in UDC scenarios. Experiments show that our DGFormer and UDC-DMNet achieve state-of-the-art performance.

摘要
“隐藏前置摄像头在显示面板下，Under-Display Camera（UDC）为用户提供了全屏体验。然而，由于显示器的特点，UDC拍摄的图像会受到显著质量下降的影响。已经提出了一些方法来解决UDC图像修复问题，但是还没有专门的方法和数据集来修复UDC face图像，这可能是UDC场景中最常见的问题。为此，我们考虑了UDC拍摄过程中的颜色滤波、亮度减弱和折射，并提出了一个两Stage网络UDC降低模型网络（UDC-DMNet），用于模拟UDC拍摄过程。然后，我们使用UDC-DMNet和高质量的face图像从FFHQ和CelebA-Test创建了UDC face培训集FFHQ-P/T和测试集CelebA-Test-P/T。我们还提出了一种新的字典指导变换网络名为DGFormer。通过引入面部字典和UDC图像的特点，DGFormer能够 Addresses blind face修复问题在UDC场景中。实验结果表明，我们的DGFormer和UDC-DMNet都达到了状态精算的性能。”

EDDense-Net: Fully Dense Encoder Decoder Network for Joint Segmentation of Optic Cup and Disc

paper_url: http://arxiv.org/abs/2308.10192
repo_url: None
paper_authors: Mehwish Mehmood, Khuram Naveed, Haroon Ahmed Khan, Syed S. Naqvi
for: 这项研究的目的是为了提供一种用于诊断和分析глау科病例的自助系统，以帮助医学眼科医生进行诊断。methods: 该网络使用了EDDense-NetSegmentation网络，该网络包括Encoder和Decoder，它们都是由密集块组成的，每个块包括了 grouped convolutional layer，这使得网络能够同时从图像中获取和传递空间信息，而且降低网络的复杂性。而且，在涉及到semantic segmentation的部分，使用了 dice pixel classification来缓解类别偏袋问题。results: 在两个公共可用的数据集上进行评估，该网络的准确率和效率都高于现有的状态的艺术方法。因此，该方法可以用于作为医学眼科医生的第二意见系统，帮助他们进行诊断和分析 glaucoma 病例。

Abstract
Glaucoma is an eye disease that causes damage to the optic nerve, which can lead to visual loss and permanent blindness. Early glaucoma detection is therefore critical in order to avoid permanent blindness. The estimation of the cup-to-disc ratio (CDR) during an examination of the optical disc (OD) is used for the diagnosis of glaucoma. In this paper, we present the EDDense-Net segmentation network for the joint segmentation of OC and OD. The encoder and decoder in this network are made up of dense blocks with a grouped convolutional layer in each block, allowing the network to acquire and convey spatial information from the image while simultaneously reducing the network's complexity. To reduce spatial information loss, the optimal number of filters in all convolution layers were utilised. In semantic segmentation, dice pixel classification is employed in the decoder to alleviate the problem of class imbalance. The proposed network was evaluated on two publicly available datasets where it outperformed existing state-of-the-art methods in terms of accuracy and efficiency. For the diagnosis and analysis of glaucoma, this method can be used as a second opinion system to assist medical ophthalmologists.

摘要
glaucoma 是一种眼病，可以导致视网膜损害，从而引起视力减退和永久失明。 early detection of glaucoma 是非常重要的，以避免永久失明。 during an examination of the optical disc (OD), the estimation of the cup-to-disc ratio (CDR) is used for the diagnosis of glaucoma. 在这篇论文中，我们提出了 ED-DenseNet segmentation 网络，用于joint segmentation of OC 和 OD。这个网络的编码器和解码器都由密集层组成，每个密集层都包含了 grouped convolutional layer，使网络能够从图像中获取和传递空间信息，同时减少网络的复杂度。为了避免空间信息损失，我们在所有卷积层中使用了最佳的筛选器数量。在semantic segmentation中，我们使用了 dice pixel classification 来缓解类别不均衡问题。我们的方法在两个公共可用的数据集上进行了评估，并在精度和效率方面超过了现有的状态图方法。这种方法可以作为医学眼科专业人员的第二意见系统，帮助 диагности和分析 glaucoma。

Spiking-Diffusion: Vector Quantized Discrete Diffusion Model with Spiking Neural Networks

paper_url: http://arxiv.org/abs/2308.10187
repo_url: https://github.com/Arktis2022/Spiking-Diffusion
paper_authors: Mingxuan Liu, Rui Wen, Hong Chen
for: 这个论文主要用于实现基于神经网络的印象生成模型，并且使用脳神经网络（SNN）来实现能效的 neuromorphic 处理器。
methods: 这个论文使用了 vector quantized 秘密自适应网络（VQ-SVAE）来学习离散的伪想空间，并且使用 SNN 实现吸引状态传播和扩散图像复原。
results: 这个论文的实验结果显示，使用 Spiking-Diffusion 模型可以实现更好的印象生成效果，并且比较过去的 SNN-based 生成模型有更好的表现。实验结果显示，这个模型在 MNIST、FMNIST、KMNIST 和 Letters 等数据集上实现了更好的表现，并且比较过去的 SNN-based 生成模型有更好的表现。

Abstract
Spiking neural networks (SNNs) have tremendous potential for energy-efficient neuromorphic chips due to their binary and event-driven architecture. SNNs have been primarily used in classification tasks, but limited exploration on image generation tasks. To fill the gap, we propose a Spiking-Diffusion model, which is based on the vector quantized discrete diffusion model. First, we develop a vector quantized variational autoencoder with SNNs (VQ-SVAE) to learn a discrete latent space for images. With VQ-SVAE, image features are encoded using both the spike firing rate and postsynaptic potential, and an adaptive spike generator is designed to restore embedding features in the form of spike trains. Next, we perform absorbing state diffusion in the discrete latent space and construct a diffusion image decoder with SNNs to denoise the image. Our work is the first to build the diffusion model entirely from SNN layers. Experimental results on MNIST, FMNIST, KMNIST, and Letters demonstrate that Spiking-Diffusion outperforms the existing SNN-based generation model. We achieve FIDs of 37.50, 91.98, 59.23 and 67.41 on the above datasets respectively, with reductions of 58.60\%, 18.75\%, 64.51\%, and 29.75\% in FIDs compared with the state-of-art work.

摘要
神经网络（SNN）具有巨大的能效可能性，因其架构为二进制和事件驱动的。SNN在分类任务中广泛使用，但对图像生成任务的探索很少。为填补这个差距，我们提议了一种叫做神经网络扩散模型（Spiking-Diffusion model），它基于矢量量化离散扩散模型。首先，我们开发了一种基于SNN的矢量量化自适应编码器（VQ-SVAE），以学习图像的离散特征空间。VQ-SVAE中，图像特征被编码通过神经元发射率和后生电位，并设计了适应的神经元发射器来重建嵌入特征在形式为脉冲列表的形式。接着，我们在离散特征空间中进行吸引状态扩散，并构建了基于SNN层的扩散图像解码器来降噪图像。我们的工作是首次将扩散模型完全建立在SNN层之上。实验结果在MNIST、FMNIST、KMNIST和Letters等 dataset上表明，Spiking-Diffusion模型比 existed SNN-based generation模型更高效，我们在这些dataset上实现了FID值为37.50、91.98、59.23和67.41，相比 existed work，我们的FID值下降58.60%、18.75%、64.51%和29.75%。

paper_url: http://arxiv.org/abs/2308.10185
repo_url: https://github.com/TencentARC/ViT-Lens
paper_authors: Weixian Lei, Yixiao Ge, Jianfeng Zhang, Dylan Sun, Kun Yi, Ying Shan, Mike Zheng Shou
for: 这篇论文是为了提出一种能够有效地处理多个模式（如3D、声音等）的方法，以便在不同的任务和领域中使用已经预训练的 ViT 模型。
methods: 这篇论文使用了一种名为 ViT-Lens 的方法，它可以使用已经预训练的 ViT 模型来处理多个模式，并将这些模式Project到一个共同的 embedding 空间中。然后，使用强大的 ViT 模型来处理这些embedding，以获得高效的多模式表示学习。
results: 根据论文的测试结果，ViT-Lens 在 zero-shot 3D 分类任务中实现了显著的提升，其中 Objaverse-LVIS 上的准确率为 52.0%，ModelNet40 上的准确率为 87.4%，ScanObjectNN 上的准确率为 60.6%。此外，通过简单地将已经训练的 3D 透镜 integrate 到 InstructBLIP 模型中，实现了 zero-shot 3D 问答。

Abstract
Though the success of CLIP-based training recipes in vision-language models, their scalability to more modalities (e.g., 3D, audio, etc.) is limited to large-scale data, which is expensive or even inapplicable for rare modalities. In this paper, we present ViT-Lens that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning to a pre-defined space. Specifically, the modality-specific lens is tuned to project multimodal signals to the shared embedding space, which are then processed by a strong ViT that carries pre-trained image knowledge. The encoded multimodal representations are optimized toward aligning with the modal-independent space, pre-defined by off-the-shelf foundation models. A well-trained lens with a ViT backbone has the potential to serve as one of these foundation models, supervising the learning of subsequent modalities. ViT-Lens provides a unified solution for representation learning of increasing modalities with two appealing benefits: (i) Exploiting the pretrained ViT across tasks and domains effectively with efficient data regime; (ii) Emergent downstream capabilities of novel modalities are demonstrated due to the modality alignment space. We evaluate ViT-Lens in the context of 3D as an initial verification. In zero-shot 3D classification, ViT-Lens achieves substantial improvements over previous state-of-the-art, showing 52.0% accuracy on Objaverse-LVIS, 87.4% on ModelNet40, and 60.6% on ScanObjectNN. Furthermore, we enable zero-shot 3D question-answering by simply integrating the trained 3D lens into the InstructBLIP model without any adaptation. We will release the results of ViT-Lens on more modalities in the near future.

摘要
尽管CLIP基于训练辑recipes在视觉语模型中获得成功，但它们在更多Modalities（例如3D、音频等）的扩展是受限于大规模数据的，这些数据可能是costly或者even inapplicable for rare modalities。在这篇论文中，我们提出了ViT-Lens，它可以有效地进行多modalities的表示学习，通过将novel modalities projected to a predefined shared embedding space，然后由一个强大的ViT进行处理，该ViT已经预训练了图像知识。所得到的多modalities表示被优化以与modal-independent space相align，这个空间是由off-the-shelf foundation models预定的。一个很好地训练的lens可以作为这些基础模型，监督后续modalities的学习。ViT-Lens提供了一个统一的解决方案，可以在多modalities的表示学习中获得两个优点：（i）可以有效地利用预训练的ViT，并且在有限的数据上进行efficient training;（ii）由于modalities的对齐空间，可以实现 novel modalities的emergent downstream capabilities。我们在3D上进行了首次验证，并达到了substantial improvement，包括Objaverse-LVIS的52.0%准确率、ModelNet40的87.4%准确率和ScanObjectNN的60.6%准确率。此外，我们还可以通过简单地将训练好的3D镜头 integrate into the InstructBLIP模型，实现zero-shot 3D问答。我们将在未来发布更多modalities的结果。

BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation Knowledge

paper_url: http://arxiv.org/abs/2308.10175
repo_url: None
paper_authors: Chen Liu, Peike Li, Hu Zhang, Lincheng Li, Zi Huang, Dadong Wang, Xin Yu
for:这个研究旨在提高audiovisual segmentation（AVS）的精度，以便在实际场景中更好地定位声音来源。methods:我们提出了一个两阶段的增强AVS框架，包括多modal基础知识的整合。在第一阶段，我们使用一个分类模型来将可能的声音来源从视觉数据中分类，不受污染的音频讯号影响。在第二阶段，我们发展了一个音频视觉 semantic统合策略（AVIS），以确定真正有声音来源。我们建立了一个音频视觉树，根据声音和物类分类的层次相互对应。然后，我们评估了对应物类和类别化的音频标签之间的标签一致性。results:我们的方法在AVS datasets上进行了广泛的实验，特别是在背景噪音的情况下表现出色。我们的方法能够更好地定位真正有声音来源。

Abstract
Given an audio-visual pair, audio-visual segmentation (AVS) aims to locate sounding sources by predicting pixel-wise maps. Previous methods assume that each sound component in an audio signal always has a visual counterpart in the image. However, this assumption overlooks that off-screen sounds and background noise often contaminate the audio recordings in real-world scenarios. They impose significant challenges on building a consistent semantic mapping between audio and visual signals for AVS models and thus impede precise sound localization. In this work, we propose a two-stage bootstrapping audio-visual segmentation framework by incorporating multi-modal foundation knowledge. In a nutshell, our BAVS is designed to eliminate the interference of background noise or off-screen sounds in segmentation by establishing the audio-visual correspondences in an explicit manner. In the first stage, we employ a segmentation model to localize potential sounding objects from visual data without being affected by contaminated audio signals. Meanwhile, we also utilize a foundation audio classification model to discern audio semantics. Considering the audio tags provided by the audio foundation model are noisy, associating object masks with audio tags is not trivial. Thus, in the second stage, we develop an audio-visual semantic integration strategy (AVIS) to localize the authentic-sounding objects. Here, we construct an audio-visual tree based on the hierarchical correspondence between sounds and object categories. We then examine the label concurrency between the localized objects and classified audio tags by tracing the audio-visual tree. With AVIS, we can effectively segment real-sounding objects. Extensive experiments demonstrate the superiority of our method on AVS datasets, particularly in scenarios involving background noise. Our project website is https://yenanliu.github.io/AVSS.github.io/.

摘要
audio-visual分割（AVS）目的是通过预测像素级地图来确定声音来源。先前的方法假设每个声音 ком ponent in 音频信号总是有视觉对应在图像中。然而，这种假设过looks 忽略了实际情况中的后台噪声和屏幕外声音污染音频记录。这些噪声对AVS模型建立一致的semantic mapping between 音频和视觉信号的建立带来了很大挑战，从而降低了精确的声音定位。在这种情况下，我们提出了一个两stage bootstrapping 音频视觉分割框架。简而言之，我们的BAVS是通过明确的方式来消除噪声或屏幕外声音的污染，从而提高AVS的精度。在第一stage，我们使用一个分割模型来从视觉数据中 lokalisieren potential sounding objects，不受噪声音频信号的影响。同时，我们还利用一个基础Audio classification模型来理解音频semantics。由于audio标签提供by the audio foundation model是噪声的，因此将对象面积与audio标签相关联是不rivial的。因此，在第二stage，我们开发了一种音频视觉semantic интеграation策略（AVIS），以确定真实听起来的对象。我们在音频视觉树中建立了层次相应的声音和对象类别之间的 hierarchical correspondence。然后，我们跟踪音频视觉树，并对 lokalisierten对象和分类的audio标签进行标concurrentlabeling。通过AVIS，我们可以有效地分割真实听起来的对象。我们的项目网站是https://yenanliu.github.io/AVSS.github.io/.

Neural Interactive Keypoint Detection

paper_url: http://arxiv.org/abs/2308.10174
repo_url: https://github.com/idea-research/click-pose
paper_authors: Jie Yang, Ailing Zeng, Feng Li, Shilong Liu, Ruimao Zhang, Lei Zhang
for: 这个研究旨在开发一个端到端神经交互关键点检测框架，名为Click-Pose，可以将2D关键点标注的时间和努力减少至10倍以上。
methods: Click-Pose使用了一个对话式人工标注 Loop，让用户点击预测的关键点进行更正，并使用一个独特的姿势错误模型来提高模型的自我更正能力。
results: Click-Pose在COCO和Human-Art等两个测试集上显示了优秀的效果，只需1.97和6.45次点击（NoC）@95（精度95%）来标注，较前一代模型（ViTPose）的手动更正需要31.4%和36.3%的努力。此外，不需要用户点击，Click-Pose仍然可以超越先前的端到端模型。代码可以在https://github.com/IDEA-Research/Click-Pose上取得。

Abstract
This work proposes an end-to-end neural interactive keypoint detection framework named Click-Pose, which can significantly reduce more than 10 times labeling costs of 2D keypoint annotation compared with manual-only annotation. Click-Pose explores how user feedback can cooperate with a neural keypoint detector to correct the predicted keypoints in an interactive way for a faster and more effective annotation process. Specifically, we design the pose error modeling strategy that inputs the ground truth pose combined with four typical pose errors into the decoder and trains the model to reconstruct the correct poses, which enhances the self-correction ability of the model. Then, we attach an interactive human-feedback loop that allows receiving users' clicks to correct one or several predicted keypoints and iteratively utilizes the decoder to update all other keypoints with a minimum number of clicks (NoC) for efficient annotation. We validate Click-Pose in in-domain, out-of-domain scenes, and a new task of keypoint adaptation. For annotation, Click-Pose only needs 1.97 and 6.45 NoC@95 (at precision 95%) on COCO and Human-Art, reducing 31.4% and 36.3% efforts than the SOTA model (ViTPose) with manual correction, respectively. Besides, without user clicks, Click-Pose surpasses the previous end-to-end model by 1.4 AP on COCO and 3.0 AP on Human-Art. The code is available at https://github.com/IDEA-Research/Click-Pose.

摘要
这个工作提出了一种名为Click-Pose的结束到终端神经互动关键点检测框架，可以减少更多于10倍的2D关键点标注成本，比手动标注更快和有效。Click-Poseexplores如何用户反馈与神经关键点检测器共同 correction predicted关键点，以实现更快和更有效的标注过程。我们设计了 pose error 模型，将真实pose与四种常见pose error 输入到decoder中，并训练模型可以重建正确的pose，从而提高自修复能力。然后，我们附加了一个交互式人工反馈循环，让用户点击correct predicted关键点，并 iteratively使用decoder更新所有关键点，最少clicks (NoC) для高效的标注。我们验证Click-Pose在域内、域外场景和新任务关键点适应中。对于标注，Click-Pose只需1.97和6.45 NoC@95 (精度95%) 在COCO和人类艺术上，相比SOTA模型（ViTPose）的手动更正，减少了31.4%和36.3%的努力。此外，无需用户点击，Click-Pose超过了之前的端到终模型1.4 AP在COCO和3.0 AP在人类艺术上。代码可以在https://github.com/IDEA-Research/Click-Pose中找到。

paper_url: http://arxiv.org/abs/2308.10172
repo_url: https://github.com/yanyuanqiao/vln-petl
paper_authors: Yanyuan Qiao, Zheng Yu, Qi Wu
for: 本研究旨在提高大型预训练视言语模型在视言语 Navigation~(VLN) 任务上的性能，并且减少全量微调预训练模型的成本。
methods: 本研究提出了一种特点是 VLN 任务的Parameter-Efficient Transfer Learning~(PETL) 方法，包括两个特有的 PETL 模块：历史互动增强器（HIB）和 crossing modal 互动增强器（CIB）。
results: 在四种主流 VLN 任务（R2R、REVERIE、NDH、RxR）的广泛实验结果表明，提出的 VLN-PETL 方法可以与全微调和其他 PETL 方法相比，达到相同或更好的性能，并且具有更好的抗衰减性。

Abstract
The performance of the Vision-and-Language Navigation~(VLN) tasks has witnessed rapid progress recently thanks to the use of large pre-trained vision-and-language models. However, full fine-tuning the pre-trained model for every downstream VLN task is becoming costly due to the considerable model size. Recent research hotspot of Parameter-Efficient Transfer Learning (PETL) shows great potential in efficiently tuning large pre-trained models for the common CV and NLP tasks, which exploits the most of the representation knowledge implied in the pre-trained model while only tunes a minimal set of parameters. However, simply utilizing existing PETL methods for the more challenging VLN tasks may bring non-trivial degeneration to the performance. Therefore, we present the first study to explore PETL methods for VLN tasks and propose a VLN-specific PETL method named VLN-PETL. Specifically, we design two PETL modules: Historical Interaction Booster (HIB) and Cross-modal Interaction Booster (CIB). Then we combine these two modules with several existing PETL methods as the integrated VLN-PETL. Extensive experimental results on four mainstream VLN tasks (R2R, REVERIE, NDH, RxR) demonstrate the effectiveness of our proposed VLN-PETL, where VLN-PETL achieves comparable or even better performance to full fine-tuning and outperforms other PETL methods with promising margins.

摘要
“在最近，视觉语言导航（VLN）任务的性能得到了迅速的进步，归功于使用大型预训练的视觉语言模型。然而，对于每个下游VLN任务进行全面的预训练模型 fine-tuning 成本增加，因为模型的大小很大。目前研究热点 Parameter-Efficient Transfer Learning（PETL）表现出了巨大的潜力，可以有效地调参大型预训练模型，并且只需要调参最小的参数集。然而，直接使用现有的 PETL 方法来处理更加具有挑战性的 VLN 任务可能会导致性能下降。因此，我们提出了首次对 VLN 任务使用 PETL 方法的研究，并提出了一种特定于 VLN 的 PETL 方法 named VLN-PETL。特别是，我们设计了两个 PETL 模块：历史互动加速器（HIB）和交叉模式互动加速器（CIB）。然后，我们将这两个模块与一些现有的 PETL 方法相结合，形成了集成的 VLN-PETL。我们对四个主流 VLN 任务（R2R、REVERIE、NDH、RxR）进行了广泛的实验，结果表明，我们提出的 VLN-PETL 方法可以与全面 fine-tuning 和其他 PETL 方法相比，具有同等或更好的性能。”

Cell Spatial Analysis in Crohn’s Disease: Unveiling Local Cell Arrangement Pattern with Graph-based Signatures

paper_url: http://arxiv.org/abs/2308.10166
repo_url: None
paper_authors: Shunxing Bao, Sichen Zhu, Vasantha L Kolachala, Lucas W. Remedios, Yeonjoo Hwang, Yutong Sun, Ruining Deng, Can Cui, Yike Li, Jia Li, Joseph T. Roland, Qi Liu, Ken S. Lau, Subra Kugathasan, Peng Qiu, Keith T. Wilson, Lori A. Coburn, Bennett A. Landman, Yuankai Huo
for: 本研究旨在描述crohn病（CD）活动的抑菌环境，尤其是在肠道区域。
methods: 研究人员使用了 Hematoxylin and Eosin 染色技术（H&E），并分类了6种不同的细胞类型。然后，他们使用了 t-SNE 和 Kernel Density Estimation 来分析细胞环境的地方特征。
results: 研究发现了不同的细胞嵌入模式，尤其是在RECTUM区域。这些差异强调了数据不同性对细胞空间安排的影响。此外，研究还发现了两个研究机构之间的数据分布差异，这指出了协作医疗机构的重要性。

Abstract
Crohn's disease (CD) is a chronic and relapsing inflammatory condition that affects segments of the gastrointestinal tract. CD activity is determined by histological findings, particularly the density of neutrophils observed on Hematoxylin and Eosin stains (H&E) imaging. However, understanding the broader morphometry and local cell arrangement beyond cell counting and tissue morphology remains challenging. To address this, we characterize six distinct cell types from H&E images and develop a novel approach for the local spatial signature of each cell. Specifically, we create a 10-cell neighborhood matrix, representing neighboring cell arrangements for each individual cell. Utilizing t-SNE for non-linear spatial projection in scatter-plot and Kernel Density Estimation contour-plot formats, our study examines patterns of differences in the cellular environment associated with the odds ratio of spatial patterns between active CD and control groups. This analysis is based on data collected at the two research institutes. The findings reveal heterogeneous nearest-neighbor patterns, signifying distinct tendencies of cell clustering, with a particular focus on the rectum region. These variations underscore the impact of data heterogeneity on cell spatial arrangements in CD patients. Moreover, the spatial distribution disparities between the two research sites highlight the significance of collaborative efforts among healthcare organizations. All research analysis pipeline tools are available at https://github.com/MASILab/cellNN.

摘要
crohn病 (CD) 是一种 chronic 和 relapsing 的Inflammatory condition ，影响 digestive tract 的一部分。 CD 的活动由 histological 发现决定，特别是 neutrophils 的密度在 Hematoxylin 和 Eosin 染色 (H&E) 图像中。然而，理解更广泛的 morphometry 和 local cell arrangement 还是一个挑战。为了解决这个问题，我们将 six 种不同的细胞类型从 H&E 图像中分类，并开发了一种 novel approach 以获取每个细胞的本地空间签名。 Specifically, we create a 10-cell neighborhood matrix, representing neighboring cell arrangements for each individual cell. 使用 t-SNE для非线性空间投影，我们在 scatter-plot 和 Kernel Density Estimation 折线图中进行了 pattern 的分析，以探讨 CD 和 control 组之间的cellular environment 差异。这种分析基于在两个研究机构收集的数据。我们发现 heterogeneous 的 nearest-neighbor 模式，表明 CD 病人的细胞含量具有明显的差异，尤其是在 rectum 区域。这些差异 highlights 数据的多样性对细胞空间安排的影响。此外，研究站之间的 spatial distribution 差异也表明了卫生机构之间的合作是非常重要。所有的研究分析管道工具可以在 https://github.com/MASILab/cellNN 上找到。

paper_url: http://arxiv.org/abs/2308.10161
repo_url: None
paper_authors: Qiao Yan, Yihan Wang
for: 提高3D物体检测的稳定性和可靠性在极端天气和照明条件下。
methods: 提出了一种新的多模态融合方法，即RTDF-RCNN，利用4D雷达和热成像仪器的优势进行对象检测。
results: 对比其他方法，RTDF-RCNN在检测车辆、人员和自行车等对象方面有显著提高（超过7.98%、24.27%和27.15%），同时与LiDAR方法具有相似的性能。

Abstract
Robust 3D object detection in extreme weather and illumination conditions is a challenging task. While radars and thermal cameras are known for their resilience to these conditions, few studies have been conducted on radar-thermal fusion due to the lack of corresponding datasets. To address this gap, we first present a new multi-modal dataset called ThermRad, which includes a 3D LiDAR, a 4D radar, an RGB camera and a thermal camera. This dataset is unique because it includes data from all four sensors in extreme weather conditions, providing a valuable resource for future research in this area. To validate the robustness of 4D radars and thermal cameras for 3D object detection in challenging weather conditions, we propose a new multi-modal fusion method called RTDF-RCNN, which leverages the complementary strengths of 4D radars and thermal cameras to boost object detection performance. To further prove the effectiveness of our proposed framework, we re-implement state-of-the-art (SOTA) 3D detectors on our dataset as benchmarks for evaluation. Our method achieves significant enhancements in detecting cars, pedestrians, and cyclists, with improvements of over 7.98%, 24.27%, and 27.15%, respectively, while achieving comparable results to LiDAR-based approaches. Our contributions in both the ThermRad dataset and the new multi-modal fusion method provide a new approach to robust 3D object detection in adverse weather and illumination conditions. The ThermRad dataset will be released.

摘要
“Robust 3D物体探测在极端天气和照明条件下是一项具有挑战性的任务。尽管雷达和热成像仪器在这些条件下显示出了抗性，但由于数据集的缺乏，关于雷达-热成像融合的研究受到了限制。为了解决这个差距，我们首先提供了一个新的多模态数据集，称为ThermRad，该数据集包括3D LiDAR、4D雷达、RGB摄像头和热成像仪器。这个数据集独特之处在于它包含了所有四种感知器的数据，并在极端天气条件下进行数据采集，提供了未来研究的优质资源。为了证明4D雷达和热成像仪器在极端天气条件下的 robustness，我们提出了一种新的多模态融合方法，称为RTDF-RCNN，该方法利用了4D雷达和热成像仪器之间的共同优势，以提高物体探测性能。为了进一步证明我们的提出的方法的效果，我们重新实现了状态计算机视觉（SOTA）3D探测器在我们的数据集上进行评估。我们的方法在检测汽车、人员和自行车方面获得了显著提升，提升率分别为7.98%、24.27%和27.15%，而同时与LiDAR基本上的方法具有相似的结果。我们的贡献在ThermRad数据集和新的多模态融合方法方面，为robust 3D物体探测在极端天气和照明条件下提供了一个新的方法。ThermRad数据集将会公开发布。”

HODN: Disentangling Human-Object Feature for HOI Detection

paper_url: http://arxiv.org/abs/2308.10158
repo_url: None
paper_authors: Shuman Fang, Zhiwen Lin, Ke Yan, Jie Li, Xianming Lin, Rongrong Ji
for: 本文目标是提高人物物体互动检测的精度，并提出了一种基于变换器的人物物体分离网络（HODN），以显著提高现有方法的性能。
methods: 本文使用了两个独立的分离解码器来检测人物和物体，然后将其传递给互动解码器进行互动检测。为了确保互动解码器关注人类中心的区域，我们提出了一种人类导引链接方法。此外，我们还提出了一种停止梯度机制，以防止互动导引对物体检测的影响，但允许它们对人类检测的影响。
results: 我们的提议方法在V-COCO和HICO-Det数据集上达到了竞争性表现，可以与现有方法相结合以实现最佳效果。

Abstract
The task of Human-Object Interaction (HOI) detection is to detect humans and their interactions with surrounding objects, where transformer-based methods show dominant advances currently. However, these methods ignore the relationship among humans, objects, and interactions: 1) human features are more contributive than object ones to interaction prediction; 2) interactive information disturbs the detection of objects but helps human detection. In this paper, we propose a Human and Object Disentangling Network (HODN) to model the HOI relationships explicitly, where humans and objects are first detected by two disentangling decoders independently and then processed by an interaction decoder. Considering that human features are more contributive to interaction, we propose a Human-Guide Linking method to make sure the interaction decoder focuses on the human-centric regions with human features as the positional embeddings. To handle the opposite influences of interactions on humans and objects, we propose a Stop-Gradient Mechanism to stop interaction gradients from optimizing the object detection but to allow them to optimize the human detection. Our proposed method achieves competitive performance on both the V-COCO and the HICO-Det datasets. It can be combined with existing methods easily for state-of-the-art results.

摘要
人机物互动检测任务的目标是检测人类和他们与周围物体之间的互动，而transformer基本方法在当前得到了主导地位。然而，这些方法忽略了人类、物体和互动之间的关系：1）人类特征更加重要于互动预测;2）互动信息会干扰物体检测，但是帮助人类检测。在这篇论文中，我们提议了一个人类和物体分离网络（HODN），以显式地模型HOI关系，其中人类和物体被两个独立的分离解码器独立地检测，然后被交互解码器处理。由于人类特征更加重要于互动预测，我们提议了一个人类引导链接方法，以确保交互解码器专注于人类中心的区域，并将人类特征作为位域嵌入。为了处理人类和物体之间的对称影响，我们提议了一个停止梯度机制，以停止互动梯度在物体检测中优化，但是允许它们在人类检测中优化。我们的提议方法在V-COCO和HICO-Det datasets上达到了竞争性的性能。它可以轻松地与现有方法结合使用，以实现当前最佳结果。

Contrastive Diffusion Model with Auxiliary Guidance for Coarse-to-Fine PET Reconstruction

paper_url: http://arxiv.org/abs/2308.10157
repo_url: https://github.com/show-han/pet-reconstruction
paper_authors: Zeyu Han, Yuhan Wang, Luping Zhou, Peng Wang, Binyu Yan, Jiliu Zhou, Yan Wang, Dinggang Shen
for: 提高高质量的 пози特摄影图像，降低人体辐射暴露
methods: 使用抽象敌方网络（GANs）和扩散概率模型（DPMs），以及一种协同逐步优化模型
results: 对两个人脑 Positron Emission Tomography（PET）数据集进行了广泛的实验，并证明了该方法可以高效地提高 clinical 可靠性

Abstract
To obtain high-quality positron emission tomography (PET) scans while reducing radiation exposure to the human body, various approaches have been proposed to reconstruct standard-dose PET (SPET) images from low-dose PET (LPET) images. One widely adopted technique is the generative adversarial networks (GANs), yet recently, diffusion probabilistic models (DPMs) have emerged as a compelling alternative due to their improved sample quality and higher log-likelihood scores compared to GANs. Despite this, DPMs suffer from two major drawbacks in real clinical settings, i.e., the computationally expensive sampling process and the insufficient preservation of correspondence between the conditioning LPET image and the reconstructed PET (RPET) image. To address the above limitations, this paper presents a coarse-to-fine PET reconstruction framework that consists of a coarse prediction module (CPM) and an iterative refinement module (IRM). The CPM generates a coarse PET image via a deterministic process, and the IRM samples the residual iteratively. By delegating most of the computational overhead to the CPM, the overall sampling speed of our method can be significantly improved. Furthermore, two additional strategies, i.e., an auxiliary guidance strategy and a contrastive diffusion strategy, are proposed and integrated into the reconstruction process, which can enhance the correspondence between the LPET image and the RPET image, further improving clinical reliability. Extensive experiments on two human brain PET datasets demonstrate that our method outperforms the state-of-the-art PET reconstruction methods. The source code is available at \url{https://github.com/Show-han/PET-Reconstruction}.

摘要
要获得高质量的 позиトрон释发tomography（PET）图像，同时减少人体暴露到辐射的方法，Various approaches have been proposed to reconstruct standard-dose PET (SPET) images from low-dose PET (LPET) images. One widely adopted technique is the generative adversarial networks (GANs), yet recently, diffusion probabilistic models (DPMs) have emerged as a compelling alternative due to their improved sample quality and higher log-likelihood scores compared to GANs. Despite this, DPMs suffer from two major drawbacks in real clinical settings, i.e., the computationally expensive sampling process and the insufficient preservation of correspondence between the conditioning LPET image and the reconstructed PET (RPET) image. To address the above limitations, this paper presents a coarse-to-fine PET reconstruction framework that consists of a coarse prediction module (CPM) and an iterative refinement module (IRM). The CPM generates a coarse PET image via a deterministic process, and the IRM samples the residual iteratively. By delegating most of the computational overhead to the CPM, the overall sampling speed of our method can be significantly improved. Furthermore, two additional strategies, i.e., an auxiliary guidance strategy and a contrastive diffusion strategy, are proposed and integrated into the reconstruction process, which can enhance the correspondence between the LPET image and the RPET image, further improving clinical reliability. Extensive experiments on two human brain PET datasets demonstrate that our method outperforms the state-of-the-art PET reconstruction methods. The source code is available at \url{https://github.com/Show-han/PET-Reconstruction}.Here's the translation in Traditional Chinese:要获得高质量的 позиトрон释放tomography（PET）图像，同时减少人体暴露到辐射的方法，Various approaches have been proposed to reconstruct standard-dose PET (SPET) images from low-dose PET (LPET) images. One widely adopted technique is the generative adversarial networks (GANs), yet recently, diffusion probabilistic models (DPMs) have emerged as a compelling alternative due to their improved sample quality and higher log-likelihood scores compared to GANs. Despite this, DPMs suffer from two major drawbacks in real clinical settings, i.e., the computationally expensive sampling process and the insufficient preservation of correspondence between the conditioning LPET image and the reconstructed PET (RPET) image. To address the above limitations, this paper presents a coarse-to-fine PET reconstruction framework that consists of a coarse prediction module (CPM) and an iterative refinement module (IRM). The CPM generates a coarse PET image via a deterministic process, and the IRM samples the residual iteratively. By delegating most of the computational overhead to the CPM, the overall sampling speed of our method can be significantly improved. Furthermore, two additional strategies, i.e., an auxiliary guidance strategy and a contrastive diffusion strategy, are proposed and integrated into the reconstruction process, which can enhance the correspondence between the LPET image and the RPET image, further improving clinical reliability. Extensive experiments on two human brain PET datasets demonstrate that our method outperforms the state-of-the-art PET reconstruction methods. The source code is available at \url{https://github.com/Show-han/PET-Reconstruction}.

Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation for Anomaly Detection

paper_url: http://arxiv.org/abs/2308.10155
repo_url: None
paper_authors: Guodong Wang, Yunhong Wang, Jie Qin, Dongming Zhang, Xiuguo Bao, Di Huang
for: 这篇论文的目的是提出一种基于自适应学习的异常检测方法（Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation，UniCon-HA），以满足安全关键应用中异常检测的需求。
methods: 这篇论文使用了自适应学习的方法，包括对异常样本进行批量聚合，以及采用层次增强的批量聚合策略。同时，它还引入了一种软重要机制，以保证异常样本的纯净集中。
results: 这篇论文的实验结果显示，UniCon-HA方法在三种异常检测设定中（无标签一类、无标签多类和标签多类）均达到了其他竞争者的超越。

Abstract
Anomaly detection (AD), aiming to find samples that deviate from the training distribution, is essential in safety-critical applications. Though recent self-supervised learning based attempts achieve promising results by creating virtual outliers, their training objectives are less faithful to AD which requires a concentrated inlier distribution as well as a dispersive outlier distribution. In this paper, we propose Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation (UniCon-HA), taking into account both the requirements above. Specifically, we explicitly encourage the concentration of inliers and the dispersion of virtual outliers via supervised and unsupervised contrastive losses, respectively. Considering that standard contrastive data augmentation for generating positive views may induce outliers, we additionally introduce a soft mechanism to re-weight each augmented inlier according to its deviation from the inlier distribution, to ensure a purified concentration. Moreover, to prompt a higher concentration, inspired by curriculum learning, we adopt an easy-to-hard hierarchical augmentation strategy and perform contrastive aggregation at different depths of the network based on the strengths of data augmentation. Our method is evaluated under three AD settings including unlabeled one-class, unlabeled multi-class, and labeled multi-class, demonstrating its consistent superiority over other competitors.

摘要
预测异常（AD），找到训练分布不同的样本，在安全关键应用中是非常重要的。虽然最近的自我超vised学习基于尝试得到了可观的结果，但它们的训练目标不符合AD的需求，AD需要一个集中的异常样本分布以及一个散布的异常样本分布。在这篇论文中，我们提出了一种带有层次增强的对比学习方法（UniCon-HA），该方法考虑了上述两个需求。具体来说，我们明确地鼓励异常样本的集中和虚拟异常样本的散布，通过自我对比和不对比的损失函数来实现。由于标准的对比数据增强可能会生成异常样本，我们还提出了一种软机制来重新评估每个增强后的异常样本，以确保它们的集中程度。此外，为了提高集中程度，我们采用了一种易到difficult的层次增强策略，并在网络的不同层次上进行对比归一化，根据数据增强的强度。我们的方法在三种AD设定下进行评估，包括一类、多类和多类标注的情况，并示出了与其他竞争者相比的一致性优势。

ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer

paper_url: http://arxiv.org/abs/2308.10147
repo_url: https://github.com/mxin262/estextspotter
paper_authors: Mingxin Huang, Jiaxin Zhang, Dezhi Peng, Hao Lu, Can Huang, Yuliang Liu, Xiang Bai, Lianwen Jin
for: 提高文本检测和识别的同时性能
methods: 提出了一种新的Explicit Synergy-based Text Spotting Transformer框架(ESTextSpotter),通过在单个解码器中模型特定的文本检测和识别特征来实现显式的同时性。
results: 实验结果表明，我们的模型在比较前的状态提高了文本检测和识别的性能。代码可以在https://github.com/mxin262/ESTextSpotter上获取。

Abstract
In recent years, end-to-end scene text spotting approaches are evolving to the Transformer-based framework. While previous studies have shown the crucial importance of the intrinsic synergy between text detection and recognition, recent advances in Transformer-based methods usually adopt an implicit synergy strategy with shared query, which can not fully realize the potential of these two interactive tasks. In this paper, we argue that the explicit synergy considering distinct characteristics of text detection and recognition can significantly improve the performance text spotting. To this end, we introduce a new model named Explicit Synergy-based Text Spotting Transformer framework (ESTextSpotter), which achieves explicit synergy by modeling discriminative and interactive features for text detection and recognition within a single decoder. Specifically, we decompose the conventional shared query into task-aware queries for text polygon and content, respectively. Through the decoder with the proposed vision-language communication module, the queries interact with each other in an explicit manner while preserving discriminative patterns of text detection and recognition, thus improving performance significantly. Additionally, we propose a task-aware query initialization scheme to ensure stable training. Experimental results demonstrate that our model significantly outperforms previous state-of-the-art methods. Code is available at https://github.com/mxin262/ESTextSpotter.

摘要

OCHID-Fi: Occlusion-Robust Hand Pose Estimation in 3D via RF-Vision

paper_url: http://arxiv.org/abs/2308.10146
repo_url: None
paper_authors: Shujie Zhang, Tianyue Zheng, Zhe Chen, Jingzhi Hu, Abdelwahed Khamis, Jiajun Liu, Jun Luo
for: 提高手势识别率在 occluded scenarios
methods: 使用 radio-frequency-vision (RF-vision) 技术，并提出 OCHID-Fi 方法，使用宽带 RF 感知器在增强的 LoS 条件下检测手势 pose
results: 实验结果表明 OCHID-Fi 方法可以在 occluded scenarios 中保持 comparable 的精度，并且可以在新领域中进行推广Here’s a more detailed explanation of each point:
for: The paper aims to improve hand pose estimation accuracy in occluded scenarios, which is a challenging problem in many applications such as virtual reality, robotics, and human-computer interaction.
methods: The proposed method uses radio-frequency-vision (RF-vision) technology, which can bypass obstacles and capture hand pose information behind them. The method introduces OCHID-Fi, a complex-valued RF-HPE network that is trained using a cross-modality and cross-domain training process. The network is guided by a pre-trained CM-HPE network and a synchronized CM/RF dataset, and it uses adversarial learning to transfer knowledge from labeled LoS domain to unlabeled occluded domain.
results: The experimental results show that OCHID-Fi achieves comparable accuracy to CM-HPE under normal conditions while maintaining such accuracy even in occluded scenarios. The method also demonstrates empirical evidence for its generalizability to new domains.

Abstract
Hand Pose Estimation (HPE) is crucial to many applications, but conventional cameras-based CM-HPE methods are completely subject to Line-of-Sight (LoS), as cameras cannot capture occluded objects. In this paper, we propose to exploit Radio-Frequency-Vision (RF-vision) capable of bypassing obstacles for achieving occluded HPE, and we introduce OCHID-Fi as the first RF-HPE method with 3D pose estimation capability. OCHID-Fi employs wideband RF sensors widely available on smart devices (e.g., iPhones) to probe 3D human hand pose and extract their skeletons behind obstacles. To overcome the challenge in labeling RF imaging given its human incomprehensible nature, OCHID-Fi employs a cross-modality and cross-domain training process. It uses a pre-trained CM-HPE network and a synchronized CM/RF dataset, to guide the training of its complex-valued RF-HPE network under LoS conditions. It further transfers knowledge learned from labeled LoS domain to unlabeled occluded domain via adversarial learning, enabling OCHID-Fi to generalize to unseen occluded scenarios. Experimental results demonstrate the superiority of OCHID-Fi: it achieves comparable accuracy to CM-HPE under normal conditions while maintaining such accuracy even in occluded scenarios, with empirical evidence for its generalizability to new domains.

摘要
手势识别 (HPE) 是许多应用程序的关键，但传统的摄像头基于CM-HPE方法是完全受到视线（LoS）的限制，因为摄像头无法捕捉遮盖物体。在这篇论文中，我们提议利用无线电视觉（RF-vision），可以绕过障碍物来实现遮盖物体HPE，并引入OCHID-Fi作为首个RF-HPE方法，具有3D手势 pose estimation能力。OCHID-Fi使用通用在智能设备（例如iPhone）上广泛可用的宽频RF传感器来探测3D人类手势pose和提取其骨架。为了解决RF成像的标注挑战，OCHID-Fi采用了交叉模式和交叉领域的训练过程。它使用一个先行训练的CM-HPE网络和一个同步的CM/RF数据集，以引导RF-HPE网络的训练。它还通过对LoS频谱频率进行挑战学习，将学习于标注的LoS频谱频率转移到未标注的遮盖频谱频率，使OCHID-Fi能够通过频谱频率的挑战学习来泛化到新的频谱频率。实验结果表明，OCHID-Fi具有优于CM-HPE的性能，在正常情况下与CM-HPE具有相同的精度，而且在遮盖情况下保持精度，并且在新的频谱频率上进行泛化。

Polymerized Feature-based Domain Adaptation for Cervical Cancer Dose Map Prediction

paper_url: http://arxiv.org/abs/2308.10142
repo_url: None
paper_authors: Jie Zeng, Zeyu Han, Xingchen Peng, Jianghong Xiao, Peng Wang, Yan Wang
for: 提高乳腺癌辐射规划的准确性，使用深度学习自动化和加速辐射规划
methods: 通过领域适应而增强深度学习预测辐射规划表现，将乳腺癌和肛癌的知识转移到辐射规划中
results: 实验结果表明，提出的方法在两个内部临床数据集上显示出优于现有方法的性能

Abstract
Recently, deep learning (DL) has automated and accelerated the clinical radiation therapy (RT) planning significantly by predicting accurate dose maps. However, most DL-based dose map prediction methods are data-driven and not applicable for cervical cancer where only a small amount of data is available. To address this problem, this paper proposes to transfer the rich knowledge learned from another cancer, i.e., rectum cancer, which has the same scanning area and more clinically available data, to improve the dose map prediction performance for cervical cancer through domain adaptation. In order to close the congenital domain gap between the source (i.e., rectum cancer) and the target (i.e., cervical cancer) domains, we develop an effective Transformer-based polymerized feature module (PFM), which can generate an optimal polymerized feature distribution to smoothly align the two input distributions. Experimental results on two in-house clinical datasets demonstrate the superiority of the proposed method compared with state-of-the-art methods.

摘要
To close the congenital domain gap between the source (i.e., rectum cancer) and the target (i.e., cervical cancer) domains, we develop an effective Transformer-based polymerized feature module (PFM). This module can generate an optimal polymerized feature distribution to smoothly align the two input distributions. Experimental results on two in-house clinical datasets demonstrate the superiority of the proposed method compared with state-of-the-art methods.

March in Chat: Interactive Prompting for Remote Embodied Referring Expression

paper_url: http://arxiv.org/abs/2308.10141
repo_url: https://github.com/yanyuanqiao/mic
paper_authors: Yanyuan Qiao, Yuankai Qi, Zheng Yu, Jing Liu, Qi Wu
for: The paper is written for proposing a March-in-Chat (MiC) model that can talk to a Large Language Model (LLM) on the fly and plan dynamically based on a newly proposed Room-and-Object Aware Scene Perceiver (ROASP) to improve the performance of Vision-and-Language Navigation (VLN) tasks, specifically in the REVERIE setting.
methods: The paper uses a MiC model that incorporates a ROASP to enable the LLM to plan actions based on the current visual observation and adapt to the larger and more complex REVERIE environment.
results: The paper shows that the proposed MiC model outperforms the previous state-of-the-art by large margins in terms of SPL and RGSPL metrics on the REVERIE benchmark.

Abstract
Many Vision-and-Language Navigation (VLN) tasks have been proposed in recent years, from room-based to object-based and indoor to outdoor. The REVERIE (Remote Embodied Referring Expression) is interesting since it only provides high-level instructions to the agent, which are closer to human commands in practice. Nevertheless, this poses more challenges than other VLN tasks since it requires agents to infer a navigation plan only based on a short instruction. Large Language Models (LLMs) show great potential in robot action planning by providing proper prompts. Still, this strategy has not been explored under the REVERIE settings. There are several new challenges. For example, the LLM should be environment-aware so that the navigation plan can be adjusted based on the current visual observation. Moreover, the LLM planned actions should be adaptable to the much larger and more complex REVERIE environment. This paper proposes a March-in-Chat (MiC) model that can talk to the LLM on the fly and plan dynamically based on a newly proposed Room-and-Object Aware Scene Perceiver (ROASP). Our MiC model outperforms the previous state-of-the-art by large margins by SPL and RGSPL metrics on the REVERIE benchmark.

摘要
很多视觉语言导航（VLN）任务在最近几年内被提出，从房间基于的到物体基于的和室内到户外的。REVERIE（远程身体引用表达）是有趣的，因为它只提供高级指令给代理人，这些指令更加接近实际的人类命令。然而，这会提高代理人需要根据短 instrucion 生成导航计划的挑战。大语言模型（LLM）在机器人行动规划方面表现出了极大的潜力，但这种策略在 REVERIE 设置下尚未被探讨。新的挑战包括：LLM 需要环境意识，以便根据当前视觉观察更新导航计划。此外，LLM 计划的行动需要适应 REVERIE 环境中的更大和更复杂的环境。这篇论文提出了一种 March-in-Chat（MiC）模型，可以在 fly 中与 LLM 交流，并基于一种新的 Room-and-Object Aware Scene Perceiver（ROASP）进行动态规划。我们的 MiC 模型在 REVERIE benchmark 上比前一个状态的平均值大幅度超越了 SPL 和 RGSPL 指标。

HollowNeRF: Pruning Hashgrid-Based NeRFs with Trainable Collision Mitigation

paper_url: http://arxiv.org/abs/2308.10122
repo_url: None
paper_authors: Xiufeng Xie, Riccardo Gherardi, Zhihong Pan, Stephen Huang
for: 提高 NeRF 训练和评估的效率，使用 hashgrid-based 位置编码和神经网络，但是有效地利用三维场景的空间稀疏性仍然是一个挑战。
methods: 我们提出了一种新的压缩解决方案，即 HollowNeRF，它在训练阶段自动精炼 hashgrid-based NeRF 的特征网格。它通过训练一个粗略的三维感知掩模，以及使用 alternating direction method of multipliers (ADMM) 压缩器来精炼三维感知掩模，从而实现了高质量渲染的同时采用较少的参数。
results: 我们的方法可以与 Instant-NGP 相比，提供类似的渲染质量，但是使用的参数数量只有 Instant-NGP 的 31%。此外，我们的方法可以在 PSNR 精度上增加至 1dB，使用的参数数量仅占 Instant-NGP 的 56%。

Abstract
Neural radiance fields (NeRF) have garnered significant attention, with recent works such as Instant-NGP accelerating NeRF training and evaluation through a combination of hashgrid-based positional encoding and neural networks. However, effectively leveraging the spatial sparsity of 3D scenes remains a challenge. To cull away unnecessary regions of the feature grid, existing solutions rely on prior knowledge of object shape or periodically estimate object shape during training by repeated model evaluations, which are costly and wasteful. To address this issue, we propose HollowNeRF, a novel compression solution for hashgrid-based NeRF which automatically sparsifies the feature grid during the training phase. Instead of directly compressing dense features, HollowNeRF trains a coarse 3D saliency mask that guides efficient feature pruning, and employs an alternating direction method of multipliers (ADMM) pruner to sparsify the 3D saliency mask during training. By exploiting the sparsity in the 3D scene to redistribute hash collisions, HollowNeRF improves rendering quality while using a fraction of the parameters of comparable state-of-the-art solutions, leading to a better cost-accuracy trade-off. Our method delivers comparable rendering quality to Instant-NGP, while utilizing just 31% of the parameters. In addition, our solution can achieve a PSNR accuracy gain of up to 1dB using only 56% of the parameters.

摘要
为解决这个问题，我们提出了 HollowNeRF，一种新的压缩解决方案，用于在 hashgrid-based NeRF 训练阶段自动稀疏特征网格。而不是直接压缩密集特征，HollowNeRF 通过训练一个粗略的三维焦点映射来引导高效的特征剔除，并使用 alternating direction method of multipliers (ADMM) 剔除器在训练阶段稀疏着色映射。通过利用场景中的稀疏性来重新分配哈希冲突，HollowNeRF 可以提高渲染质量，使用相对较少的参数，从而实现更好的成本准确性质量比。我们的方法可以与 Instant-NGP 相比，在同等参数下提供相同的渲染质量，而使用的参数只占 Instant-NGP 的31%。此外，我们的方法还可以在56%的参数下实现PSNR准确性增加达1dB。

PDL: Regularizing Multiple Instance Learning with Progressive Dropout Layers

paper_url: http://arxiv.org/abs/2308.10112
repo_url: https://github.com/chongqingnosubway/pdl
paper_authors: Wenhui Zhu, Peijie Qiu, Oana M. Dumitrascu, Yalin Wang
for: 本研究旨在提高多个实例学习（MIL）模型的性能，特别是在弱监督学习情况下。
methods: 本研究提出了一种新的进步Dropout层（PDL），用于在MIL模型中进行规范。PDL不仅能够降低过拟合，还能够帮助MIL模型找到更加复杂和有力的特征表示。
results: 在多个MILbenchmark dataset上进行了广泛的评估，结果显示了将PDL integrate into多种MIL方法可以不仅提高分类性能，还能够增强其弱监督特征地图localization的能力。

Abstract
Multiple instance learning (MIL) was a weakly supervised learning approach that sought to assign binary class labels to collections of instances known as bags. However, due to their weak supervision nature, the MIL methods were susceptible to overfitting and required assistance in developing comprehensive representations of target instances. While regularization typically effectively combated overfitting, its integration with the MIL model has been frequently overlooked in prior studies. Meanwhile, current regularization methods for MIL have shown limitations in their capacity to uncover a diverse array of representations. In this study, we delve into the realm of regularization within the MIL model, presenting a novel approach in the form of a Progressive Dropout Layer (PDL). We aim to not only address overfitting but also empower the MIL model in uncovering intricate and impactful feature representations. The proposed method was orthogonal to existing MIL methods and could be easily integrated into them to boost performance. Our extensive evaluation across a range of MIL benchmark datasets demonstrated that the incorporation of the PDL into multiple MIL methods not only elevated their classification performance but also augmented their potential for weakly-supervised feature localizations.

摘要
在这个研究中，我们探究MIL模型中的常见化，提出了一种Progressive Dropout Layer（PDL）方法。我们不仅想要解决过拟合问题，还想要让MIL模型探索复杂且影响力大的特征表示。提出的方法与现有MIL方法 orthogonal，可以轻松地与之集成以提高性能。我们在多种MIL benchmark数据集上进行了广泛的评估，发现在 integrate PDL 到多种MIL方法后，不仅提高了分类性能，还扩大了它们的弱监督学习特征地址 localization 的潜力。

Controllable Multi-domain Semantic Artwork Synthesis

paper_url: http://arxiv.org/abs/2308.10111
repo_url: None
paper_authors: Yuantian Huang, Satoshi Iizuka, Edgar Simo-Serra, Kazuhiro Fukui
for: 该论文目的是提出一种多域合成艺术作品的框架，以解决艺术合成 tasks 缺乏公共可用的分割数据的问题。
methods: 该论文提出了一个名为 ArtSem 的数据集，包含 40,000 个艺术作品的 semantic 标签地图，以及一种基于 Conditional GAN 的方法，可以高质量地从 semantic 地图生成艺术作品，无需对训练数据进行对应。此外，论文还提出了一种域dependent variational encoder 来实现高质量的多域合成。
results: 论文的实验结果表明，该模型可以学习 joint 表示 style 和 semantic 信息，从而实现更好的艺术作品生成。此外，通过分析 latent space 中域分隔的hyperplane，我们还可以实现细化控制生成的艺术作品。相比之前的方法，我们的模型可以生成更高质量的艺术作品。

Abstract
We present a novel framework for multi-domain synthesis of artwork from semantic layouts. One of the main limitations of this challenging task is the lack of publicly available segmentation datasets for art synthesis. To address this problem, we propose a dataset, which we call ArtSem, that contains 40,000 images of artwork from 4 different domains with their corresponding semantic label maps. We generate the dataset by first extracting semantic maps from landscape photography and then propose a conditional Generative Adversarial Network (GAN)-based approach to generate high-quality artwork from the semantic maps without necessitating paired training data. Furthermore, we propose an artwork synthesis model that uses domain-dependent variational encoders for high-quality multi-domain synthesis. The model is improved and complemented with a simple but effective normalization method, based on normalizing both the semantic and style jointly, which we call Spatially STyle-Adaptive Normalization (SSTAN). In contrast to previous methods that only take semantic layout as input, our model is able to learn a joint representation of both style and semantic information, which leads to better generation quality for synthesizing artistic images. Results indicate that our model learns to separate the domains in the latent space, and thus, by identifying the hyperplanes that separate the different domains, we can also perform fine-grained control of the synthesized artwork. By combining our proposed dataset and approach, we are able to generate user-controllable artwork that is of higher quality than existing

摘要
我们提出了一种新的框架，用于多域合成艺术作品的 semantic 布局。这个挑战性任务的一个主要 limitation 是没有公共可用的分割数据集 для艺术合成。为了解决这个问题，我们提出了一个名为 ArtSem 的数据集，包含 40,000 个艺术作品从 4 个不同域的图像和其对应的 semantic 标签地图。我们使用 conditional Generative Adversarial Network (GAN) 方法生成高质量的艺术作品从 semantic 地图，无需 paired 训练数据。此外，我们提出了一种基于域 dependent 变量编码器的高质量多域合成模型。该模型通过将 semantic 和样式信息结合在一起，学习到了joint 表示，从而实现更好的生成质量。结果表明，我们的模型可以在 latent 空间中分离不同域，因此，通过识别不同域之间的分界面，也可以实现细致控制生成的艺术作品。通过我们提出的数据集和方法，我们能够生成高质量的用户控制的艺术作品。

Root Pose Decomposition Towards Generic Non-rigid 3D Reconstruction with Monocular Videos

paper_url: http://arxiv.org/abs/2308.10089
repo_url: None
paper_authors: Yikai Wang, Yinpeng Dong, Fuchun Sun, Xiao Yang
for: 这个论文旨在实现基于单色RGB视频序列的非固定对象三维重建。
methods: 该方法基于Root Pose Decomposition（RPD），保持每帧根pose变换，同时建立了局部变换的恰当空间。点注册到Canonical space进行优化。
results: RPD可以在复杂的情况下进行非固定对象的三维重建，包括对象受损、人体差异、 occlusion 等。该管道可能扩展到多种对象。实验表明，RPD超过了当前状态艺技。

Abstract
This work focuses on the 3D reconstruction of non-rigid objects based on monocular RGB video sequences. Concretely, we aim at building high-fidelity models for generic object categories and casually captured scenes. To this end, we do not assume known root poses of objects, and do not utilize category-specific templates or dense pose priors. The key idea of our method, Root Pose Decomposition (RPD), is to maintain a per-frame root pose transformation, meanwhile building a dense field with local transformations to rectify the root pose. The optimization of local transformations is performed by point registration to the canonical space. We also adapt RPD to multi-object scenarios with object occlusions and individual differences. As a result, RPD allows non-rigid 3D reconstruction for complicated scenarios containing objects with large deformations, complex motion patterns, occlusions, and scale diversities of different individuals. Such a pipeline potentially scales to diverse sets of objects in the wild. We experimentally show that RPD surpasses state-of-the-art methods on the challenging DAVIS, OVIS, and AMA datasets.

摘要

MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance

paper_url: http://arxiv.org/abs/2308.10079
repo_url: None
paper_authors: Ernie Chu, Tzuhsuan Huang, Shuo-Yen Lin, Jun-Cheng Chen
for: 本研究提出了一种高效和可靠的方法，MeDM，利用预训练的图像扩散模型进行视频到视频翻译，保持 temporal 流动的一致性。
methods: 该方法使用显式的光学流动来构建实用的编码，并对生成的帧进行独立分数。通过利用这种编码，可以将生成的视频中的每帧照合到物理约束，从而解决视频翻译中的困难。
results: 通过对多个标准 bencmark 进行广泛的质量和主观测试，研究表明，MeDM 的方法可以准确地将视频翻译成目标视频，并且比传统方法更有优势。

Abstract
This study introduces an efficient and effective method, MeDM, that utilizes pre-trained image Diffusion Models for video-to-video translation with consistent temporal flow. The proposed framework can render videos from scene position information, such as a normal G-buffer, or perform text-guided editing on videos captured in real-world scenarios. We employ explicit optical flows to construct a practical coding that enforces physical constraints on generated frames and mediates independent frame-wise scores. By leveraging this coding, maintaining temporal consistency in the generated videos can be framed as an optimization problem with a closed-form solution. To ensure compatibility with Stable Diffusion, we also suggest a workaround for modifying observed-space scores in latent-space Diffusion Models. Notably, MeDM does not require fine-tuning or test-time optimization of the Diffusion Models. Through extensive qualitative, quantitative, and subjective experiments on various benchmarks, the study demonstrates the effectiveness and superiority of the proposed approach.

摘要

Sensitivity analysis of AI-based algorithms for autonomous driving on optical wavefront aberrations induced by the windshield

paper_url: http://arxiv.org/abs/2308.11711
repo_url: None
paper_authors: Dominik Werner Wolf, Markus Ulrich, Nikhil Kapoor
For: 本研究旨在解决自动驾驶感知技术中的领域转移问题，通过评估两种感知模型对不同风镜配置的敏感性。* Methods: 本研究使用了 Fourier optics 基于的威胁模型来评估感知模型对风镜配置的影响，并对两种感知模型的 neural network benchmark 指标和光学质量函数之间的相关性进行分析。* Results: 研究发现，风镜配置会导致感知模型表现下降，而现有的光学指标可能不够用于评估感知模型的性能。

Abstract
Autonomous driving perception techniques are typically based on supervised machine learning models that are trained on real-world street data. A typical training process involves capturing images with a single car model and windshield configuration. However, deploying these trained models on different car types can lead to a domain shift, which can potentially hurt the neural networks performance and violate working ADAS requirements. To address this issue, this paper investigates the domain shift problem further by evaluating the sensitivity of two perception models to different windshield configurations. This is done by evaluating the dependencies between neural network benchmark metrics and optical merit functions by applying a Fourier optics based threat model. Our results show that there is a performance gap introduced by windshields and existing optical metrics used for posing requirements might not be sufficient.

摘要
自主驾驶感知技术通常基于supervised机器学习模型，通过实际街道数据进行训练。一般训练过程中会使用单车型和车窗配置拍摄图像。但是，将训练过的模型部署到不同车型上可能会导致域名shift，这可能会影响神经网络性能，并违反工作ADAS要求。为解决这个问题，本文进一步研究域名shift问题，评估两种感知模型对不同车窗配置的敏感性。通过应用 Fourier optics based threat model，我们发现存在由车窗引入的性能差距，现有的光学指标可能不够。

Fun Paper

2023-08-20

cs.CV - 2023-08-20

Boosting Adversarial Transferability by Block Shuffle and Rotation

DomainAdaptor: A Novel Approach to Test-time Adaptation

Privileged Anatomical and Protocol Discrimination in Trackerless 3D Ultrasound Reconstruction

Efficient-VRNet: An Exquisite Fusion Network for Riverway Panoptic Perception based on Asymmetric Fair Fusion of Vision and 4D mmWave Radar

DomainDrop: Suppressing Domain-Sensitive Channels for Domain Generalization

MacFormer: Map-Agent Coupled Transformer for Real-time and Robust Trajectory Prediction

Turning Waste into Wealth: Leveraging Low-Quality Samples for Enhancing Continuous Conditional Generative Adversarial Networks

Domain Reduction Strategy for Non Line of Sight Imaging

Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image

Generic Attention-model Explainability by Weighted Relevance Accumulation

From Global to Local: Multi-scale Out-of-distribution Detection

FedSIS: Federated Split Learning with Intermediate Representation Sampling for Privacy-preserving Generalized Face Presentation Attack Detection

Real-time Regular Expression Matching

GeT: Generative Target Structure Debiasing for Domain Adaptation

Blind Face Restoration for Under-Display Camera via Dictionary Guided Transformer

EDDense-Net: Fully Dense Encoder Decoder Network for Joint Segmentation of Optic Cup and Disc

Spiking-Diffusion: Vector Quantized Discrete Diffusion Model with Spiking Neural Networks

BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation Knowledge

Neural Interactive Keypoint Detection

VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation

Cell Spatial Analysis in Crohn’s Disease: Unveiling Local Cell Arrangement Pattern with Graph-based Signatures

HODN: Disentangling Human-Object Feature for HOI Detection

Contrastive Diffusion Model with Auxiliary Guidance for Coarse-to-Fine PET Reconstruction

Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation for Anomaly Detection

ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer

OCHID-Fi: Occlusion-Robust Hand Pose Estimation in 3D via RF-Vision

Polymerized Feature-based Domain Adaptation for Cervical Cancer Dose Map Prediction

March in Chat: Interactive Prompting for Remote Embodied Referring Expression

HollowNeRF: Pruning Hashgrid-Based NeRFs with Trainable Collision Mitigation

PDL: Regularizing Multiple Instance Learning with Progressive Dropout Layers

Controllable Multi-domain Semantic Artwork Synthesis

Root Pose Decomposition Towards Generic Non-rigid 3D Reconstruction with Monocular Videos

MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance

Sensitivity analysis of AI-based algorithms for autonomous driving on optical wavefront aberrations induced by the windshield

2023-08-20

Boosting Adversarial Transferability by Block Shuffle and Rotation

DomainAdaptor: A Novel Approach to Test-time Adaptation

Privileged Anatomical and Protocol Discrimination in Trackerless 3D Ultrasound Reconstruction

Efficient-VRNet: An Exquisite Fusion Network for Riverway Panoptic Perception based on Asymmetric Fair Fusion of Vision and 4D mmWave Radar

DomainDrop: Suppressing Domain-Sensitive Channels for Domain Generalization

MacFormer: Map-Agent Coupled Transformer for Real-time and Robust Trajectory Prediction

Turning Waste into Wealth: Leveraging Low-Quality Samples for Enhancing Continuous Conditional Generative Adversarial Networks

Domain Reduction Strategy for Non Line of Sight Imaging

Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image

Generic Attention-model Explainability by Weighted Relevance Accumulation

From Global to Local: Multi-scale Out-of-distribution Detection

FedSIS: Federated Split Learning with Intermediate Representation Sampling for Privacy-preserving Generalized Face Presentation Attack Detection

Real-time Regular Expression Matching

GeT: Generative Target Structure Debiasing for Domain Adaptation

Blind Face Restoration for Under-Display Camera via Dictionary Guided Transformer

EDDense-Net: Fully Dense Encoder Decoder Network for Joint Segmentation of Optic Cup and Disc

Spiking-Diffusion: Vector Quantized Discrete Diffusion Model with Spiking Neural Networks

ViT-Lens: Towards Omni-modal Representations

BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation Knowledge

Neural Interactive Keypoint Detection

VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation

Cell Spatial Analysis in Crohn’s Disease: Unveiling Local Cell Arrangement Pattern with Graph-based Signatures

ThermRad: A Multi-modal Dataset for Robust 3D Object Detection under Challenging Conditions

HODN: Disentangling Human-Object Feature for HOI Detection

Contrastive Diffusion Model with Auxiliary Guidance for Coarse-to-Fine PET Reconstruction

Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation for Anomaly Detection

ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer

OCHID-Fi: Occlusion-Robust Hand Pose Estimation in 3D via RF-Vision

Polymerized Feature-based Domain Adaptation for Cervical Cancer Dose Map Prediction

March in Chat: Interactive Prompting for Remote Embodied Referring Expression

HollowNeRF: Pruning Hashgrid-Based NeRFs with Trainable Collision Mitigation

PDL: Regularizing Multiple Instance Learning with Progressive Dropout Layers

Controllable Multi-domain Semantic Artwork Synthesis

Root Pose Decomposition Towards Generic Non-rigid 3D Reconstruction with Monocular Videos

MeDM: Mediating Image Diffusion Models for Video-to-Video Translation with Temporal Correspondence Guidance

Sensitivity analysis of AI-based algorithms for autonomous driving on optical wavefront aberrations induced by the windshield