2023-09-19

cs.CV

cs.CV - 2023-09-19

A Novel Deep Neural Network for Trajectory Prediction in Automated Vehicles Using Velocity Vector Field

paper_url: http://arxiv.org/abs/2309.10948
repo_url: https://github.com/Amir-Samadi/VVF-TP
paper_authors: MReza Alipour Sormoli, Amir Samadi, Sajjad Mozaffari, Konstantinos Koufos, Mehrdad Dianati, Roger Woodman
for: 预测其他道路用户的运动方向和速度，以帮助自动驾驶系统（ADS）进行安全和知悉的动作规划和决策。
methods: 融合了学习基于数据的方法和流体动力学发现的速度场（VVF），将其作为深度神经网的一个额外输入，以预测基于鸟瞰图片表示的动线。
results: 与现有方法进行比较，提出了一种新的动线预测技术，可以对5~50秒的预测时间 horizon和不同的观察窗口进行优化，并且降低了需要过去的观察历史以获得高精度的动线预测。

Abstract
Anticipating the motion of other road users is crucial for automated driving systems (ADS), as it enables safe and informed downstream decision-making and motion planning. Unfortunately, contemporary learning-based approaches for motion prediction exhibit significant performance degradation as the prediction horizon increases or the observation window decreases. This paper proposes a novel technique for trajectory prediction that combines a data-driven learning-based method with a velocity vector field (VVF) generated from a nature-inspired concept, i.e., fluid flow dynamics. In this work, the vector field is incorporated as an additional input to a convolutional-recurrent deep neural network to help predict the most likely future trajectories given a sequence of bird's eye view scene representations. The performance of the proposed model is compared with state-of-the-art methods on the HighD dataset demonstrating that the VVF inclusion improves the prediction accuracy for both short and long-term (5~sec) time horizons. It is also shown that the accuracy remains consistent with decreasing observation windows which alleviates the requirement of a long history of past observations for accurate trajectory prediction. Source codes are available at: https://github.com/Amir-Samadi/VVF-TP.

摘要
预测道路用户的运动是自动驾驶系统（ADS）中的关键，它允许系统进行安全和有知情的下游决策和运动规划。然而，现代学习基于方法的运动预测表现会随预测时间 horizon 的增加或 observation window 的减少而显著下降。这篇文章提出了一种新的轨迹预测技术，它将数据驱动学习基于方法和流体流动动力学（VVF）相结合，以便基于 bird's eye view 场景表示序列 Predict the most likely future trajectories.在这种方法中，VVF 被包含为循环神经网络的一个额外输入，以帮助预测未来的轨迹。文章比较了提出的模型与现有方法在 HighD 数据集上的性能，并证明了 VVF 包含可以提高预测精度，并且预测精度随 observation window 的减少而保持一致。代码可以在 GitHub 上找到：https://github.com/Amir-Samadi/VVF-TP。

A Geometric Flow Approach for Segmentation of Images with Inhomongeneous Intensity and Missing Boundaries

paper_url: http://arxiv.org/abs/2309.10935
repo_url: None
paper_authors: Paramjyoti Mohapatra, Richard Lartey, Weihong Guo, Michael Judkovich, Xiaojuan Li
for: 这篇论文主要针对的是如何使用新的INTENSITY correction和自动化拓扑方法来进行Muscle segmentation，尤其是在MR图像中，图像具有强烈的灰度不均和缺失边界问题。
methods: 该方法使用了一种Geometric flow，该流体现了一个RKHS edge detector和一个geodesic distance penalty term，这些 penalty term来自 marker和anti-marker的集合。此外， paper还提出了一种新的bias field估计方法，即Prior Bias-Corrected Fuzzy C-means (PBCFCM)，以帮助处理MR图像中的灰度不均问题。
results: 数字实验表明，提出的方法与比较方法相比，具有显著的改善，其 dice值为92.5%, 85.3%, 85.3% для quadriceps、hamstrings 和其他肌群，而其他方法至少下降10%。

Abstract
Image segmentation is a complex mathematical problem, especially for images that contain intensity inhomogeneity and tightly packed objects with missing boundaries in between. For instance, Magnetic Resonance (MR) muscle images often contain both of these issues, making muscle segmentation especially difficult. In this paper we propose a novel intensity correction and a semi-automatic active contour based segmentation approach. The approach uses a geometric flow that incorporates a reproducing kernel Hilbert space (RKHS) edge detector and a geodesic distance penalty term from a set of markers and anti-markers. We test the proposed scheme on MR muscle segmentation and compare with some state of the art methods. To help deal with the intensity inhomogeneity in this particular kind of image, a new approach to estimate the bias field using a fat fraction image, called Prior Bias-Corrected Fuzzy C-means (PBCFCM), is introduced. Numerical experiments show that the proposed scheme leads to significantly better results than compared ones. The average dice values of the proposed method are 92.5%, 85.3%, 85.3% for quadriceps, hamstrings and other muscle groups while other approaches are at least 10% worse.

摘要
Image segmentation是一个复杂的数学问题，特别是当图像具有强度不均和紧邻的物体间缺失边界时。例如，核磁共振（MR）肌肉图像经常具有这两种问题，从而使肌肉分割特别困难。在这篇论文中，我们提出了一种新的强度修正和半自动的活动梁基分割方法。该方法使用了一种几何流，其包括一个 reproduce kernel Hilbert space（RKHS）边检测器和一个地odesic距离罚款项。我们对MR肌肉分割进行测试，并与一些现有方法进行比较。为了帮助处理MR肌肉图像中的强度不均，我们还提出了一种新的方法来估计偏置场，即 Prior Bias-Corrected Fuzzy C-means（PBCFCM）。数字实验表明，我们提出的方法与其他方法相比，得到了显著更好的结果。提出的方法的平均 dice 值为 92.5%、85.3% 和 85.3% 分别，而其他方法至少下降了10%。

Incremental Multimodal Surface Mapping via Self-Organizing Gaussian Mixture Models

paper_url: http://arxiv.org/abs/2309.10900
repo_url: https://github.com/hieu9955/ggggg
paper_authors: Kshitij Goel, Wennie Tabib
for: 这个论文旨在提出一种逐步多模态表面映射方法，用于高精度重建环境，同时压缩空间和强度点云数据。
methods: 该方法使用 Gaussian mixture models (GMMs) 表示环境，并提出了一种快速EXTRACT GMM submap的方法，以及一种判断点云中重复和无关数据的方法，从而提高计算速度。
results: 论文的实验结果显示，该方法可以提供高精度的地图，同时压缩空间和强度点云数据，并且与现有的地图方法相比，具有更好的质量和大小协调。

Abstract
This letter describes an incremental multimodal surface mapping methodology, which represents the environment as a continuous probabilistic model. This model enables high-resolution reconstruction while simultaneously compressing spatial and intensity point cloud data. The strategy employed in this work utilizes Gaussian mixture models (GMMs) to represent the environment. While prior GMM-based mapping works have developed methodologies to determine the number of mixture components using information-theoretic techniques, these approaches either operate on individual sensor observations, making them unsuitable for incremental mapping, or are not real-time viable, especially for applications where high-fidelity modeling is required. To bridge this gap, this letter introduces a spatial hash map for rapid GMM submap extraction combined with an approach to determine relevant and redundant data in a point cloud. These contributions increase computational speed by an order of magnitude compared to state-of-the-art incremental GMM-based mapping. In addition, the proposed approach yields a superior tradeoff in map accuracy and size when compared to state-of-the-art mapping methodologies (both GMM- and not GMM-based). Evaluations are conducted using both simulated and real-world data. The software is released open-source to benefit the robotics community.

摘要
Translated into Simplified Chinese:这封信件描述了一种逐步多模态表面映射方法，该方法将环境表示为一个连续的概率模型。这个模型可以同时实现高分辨率重建和压缩空间和强度点云数据。这种策略使用 Gaussian mixture models (GMMs) 来表示环境。先前的 GMM 基于的映射工作已经开发了基于信息理论技术来确定混合组件的方法，但这些方法 Either operate on individual sensor observations, making them unsuitable for incremental mapping, or are not real-time viable, especially for applications where high-fidelity modeling is required. To address these limitations, this letter introduces a spatial hash map for rapid GMM submap extraction combined with an approach to determine relevant and redundant data in a point cloud. These contributions increase computational speed by an order of magnitude compared to state-of-the-art incremental GMM-based mapping. In addition, the proposed approach yields a superior tradeoff in map accuracy and size when compared to state-of-the-art mapping methodologies (both GMM- and not GMM-based). Evaluations are conducted using both simulated and real-world data, and the software is released open-source to benefit the robotics community.

PLVS: A SLAM System with Points, Lines, Volumetric Mapping, and 3D Incremental Segmentation

paper_url: http://arxiv.org/abs/2309.10896
repo_url: https://github.com/luigifreda/plvs
paper_authors: Luigi Freda
for: 这篇论文旨在提出一个实时系统，具有稀疏SLAM、Volume Mapping和3D无监督增量分割功能。
methods: 论文使用了 sparse points和Line segments为特征，通过基于锚帧的锚点扫描和Volume Mapping来生成3D环境重建。
results: 论文通过公共数据集的质量和量计算评估了PLVS框架的性能，并提供了一种incremental和几何基于的RGB-D摄像头分割方法。

Abstract
This document presents PLVS: a real-time system that leverages sparse SLAM, volumetric mapping, and 3D unsupervised incremental segmentation. PLVS stands for Points, Lines, Volumetric mapping, and Segmentation. It supports RGB-D and Stereo cameras, which may be optionally equipped with IMUs. The SLAM module is keyframe-based, and extracts and tracks sparse points and line segments as features. Volumetric mapping runs in parallel with respect to the SLAM front-end and generates a 3D reconstruction of the explored environment by fusing point clouds backprojected from keyframes. Different volumetric mapping methods are supported and integrated in PLVS. We use a novel reprojection error to bundle-adjust line segments. This error exploits available depth information to stabilize the position estimates of line segment endpoints. An incremental and geometric-based segmentation method is implemented and integrated for RGB-D cameras in the PLVS framework. We present qualitative and quantitative evaluations of the PLVS framework on some publicly available datasets. The appendix details the adopted stereo line triangulation method and provides a derivation of the Jacobians we used for line error terms. The software is available as open-source.

摘要
Here is the translation in Simplified Chinese:这份文档介绍了PLVS：一个实时系统，它利用稀疏SLAM、三维映射和RGB-D和雷达相机的3D无监督增量分割。PLVS表示点、线、三维映射和分割。它支持RGB-D和雷达相机，这些相机可选择配备IMUs。SLAM模块是基干帧基的，它提取和跟踪稀疏点和线段为特征。三维映射在SLAM前端并行执行，通过将点云反 projekt 到关键帧来生成探索环境的3D重建。不同的三维映射方法被支持和集成在PLVS框架中。我们使用了一种新的 reprojection 错误来稳定线段endpoint的位置估计，这个错误利用可用的深度信息来稳定位置估计。PLVS框架还实现了RGB-D相机上的增量和几何基于的分割方法。我们对PLVS框架在一些公共可用的数据集上进行质量和量化评估。附录详细介绍了我们采用的雷达线段三角法和line error Jacobians的derivation。PLVS框架的软件可以在开源的形式下获取。

GelSight Svelte: A Human Finger-shaped Single-camera Tactile Robot Finger with Large Sensing Coverage and Proprioceptive Sensing

paper_url: http://arxiv.org/abs/2309.10885
repo_url: None
paper_authors: Jialiang Zhao, Edward H. Adelson
for: 这个论文旨在开发一种可以同时进行感觉和 proprioceptive 感知的、人套指尺寸、单摄像头感知系统（GelSight Svelte）。
methods: 该系统使用折射镜实现感知覆盖区域，并通过摄像头捕捉到 flexible 背部的变形来估计扭矩和旋转力。
results: 经过gel塑变形实验和对三种不同抓取方式的评估，该系统能够准确地估计扭矩和旋转力，并且可以在不同的抓取位置上进行多种任务。更多信息请参考我们的官方网站：https://gelsight-svelte.alanz.info。

Abstract
Camera-based tactile sensing is a low-cost, popular approach to obtain highly detailed contact geometry information. However, most existing camera-based tactile sensors are fingertip sensors, and longer fingers often require extraneous elements to obtain an extended sensing area similar to the full length of a human finger. Moreover, existing methods to estimate proprioceptive information such as total forces and torques applied on the finger from camera-based tactile sensors are not effective when the contact geometry is complex. We introduce GelSight Svelte, a curved, human finger-sized, single-camera tactile sensor that is capable of both tactile and proprioceptive sensing over a large area. GelSight Svelte uses curved mirrors to achieve the desired shape and sensing coverage. Proprioceptive information, such as the total bending and twisting torques applied on the finger, is reflected as deformations on the flexible backbone of GelSight Svelte, which are also captured by the camera. We train a convolutional neural network to estimate the bending and twisting torques from the captured images. We conduct gel deformation experiments at various locations of the finger to evaluate the tactile sensing capability and proprioceptive sensing accuracy. To demonstrate the capability and potential uses of GelSight Svelte, we conduct an object holding task with three different grasping modes that utilize different areas of the finger. More information is available on our website: https://gelsight-svelte.alanz.info

摘要
Camera-based感觉检测是一种低成本、受欢迎的方法，以获取高级别的接触几何信息。然而，大多数现有的camera-based感觉传感器都是指尖传感器，长手指通常需要外加元素来获取类似于人类手指的扩展感知区域。此外，现有的方法来估算camera-based感觉传感器上的 proprioceptive信息（如手指上的总弯矩和扭矩）在复杂的接触几何下不准确。我们介绍了GelSight Svelte，一种呈杯形、人类手指大小的单摄像头感觉传感器，可以同时进行感觉和 proprioceptive感知。GelSight Svelte使用弯曲镜子实现感知覆盖区域。 proprioceptive信息（如手指上的总弯矩和扭矩）在GelSight Svelte的 flexible backbone上反射为凹形变化，并由摄像头捕捉。我们训练了一个卷积神经网络来估算弯矩和扭矩信息。我们通过在不同手指位置进行gel deformation实验来评估感觉检测能力和 proprioceptive感知精度。为了展示GelSight Svelte的能力和应用前景，我们完成了一个 объект托持任务，使用不同的抓取模式，利用不同的手指区域。更多信息请访问我们的网站：https://gelsight-svelte.alanz.info。

DeepliteRT: Computer Vision at the Edge

paper_url: http://arxiv.org/abs/2309.10878
repo_url: None
paper_authors: Saad Ashfaq, Alexander Hoffman, Saptarshi Mitra, Sudhakar Sah, MohammadHossein AskariHemmat, Ehsan Saboori
for: 本研究旨在提出一种高效的极低位量化神经网络模型，以便在计算机视觉应用中部署深度学习模型。
methods: 本研究使用了ARM目标平台上高优化的极低位量化卷积运算器，并实现了一个名为Deeplite Runtime（DeepliteRT）的端到端解决方案，用于编译、调参和极低位量化模型的运行。
results: 研究表明，使用DeepliteRT可以实现对分类和检测模型的高速化，相比于优化的32位浮点数、8位整数和2位基准，可以获得速度提升达2.20倍、2.33倍和2.17倍。

Abstract
The proliferation of edge devices has unlocked unprecedented opportunities for deep learning model deployment in computer vision applications. However, these complex models require considerable power, memory and compute resources that are typically not available on edge platforms. Ultra low-bit quantization presents an attractive solution to this problem by scaling down the model weights and activations from 32-bit to less than 8-bit. We implement highly optimized ultra low-bit convolution operators for ARM-based targets that outperform existing methods by up to 4.34x. Our operator is implemented within Deeplite Runtime (DeepliteRT), an end-to-end solution for the compilation, tuning, and inference of ultra low-bit models on ARM devices. Compiler passes in DeepliteRT automatically convert a fake-quantized model in full precision to a compact ultra low-bit representation, easing the process of quantized model deployment on commodity hardware. We analyze the performance of DeepliteRT on classification and detection models against optimized 32-bit floating-point, 8-bit integer, and 2-bit baselines, achieving significant speedups of up to 2.20x, 2.33x and 2.17x, respectively.

摘要
“Edge设备的普及导致深度学习模型在计算机视觉应用中的部署得到了无前例的机会。然而，这些复杂的模型占用了大量的计算、存储和电源资源，通常不可能在边缘平台上提供。超低位数量化presented an attractive solution to this problem by reducing the model weights and activations from 32-bit to less than 8-bit. We implement highly optimized ultra low-bit convolution operators for ARM-based targets that outperform existing methods by up to 4.34x. Our operator is implemented within Deeplite Runtime (DeepliteRT), an end-to-end solution for the compilation, tuning, and inference of ultra low-bit models on ARM devices. Compiler passes in DeepliteRT automatically convert a fake-quantized model in full precision to a compact ultra low-bit representation, easing the process of quantized model deployment on commodity hardware. We analyze the performance of DeepliteRT on classification and detection models against optimized 32-bit floating-point, 8-bit integer, and 2-bit baselines, achieving significant speedups of up to 2.20x, 2.33x, and 2.17x, respectively.”Note: The translation is in Simplified Chinese, which is the standard Chinese writing system used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

On-device Real-time Custom Hand Gesture Recognition

paper_url: http://arxiv.org/abs/2309.10858
repo_url: None
paper_authors: Esha Uboweja, David Tian, Qifei Wang, Yi-Chun Kuo, Joe Zou, Lu Wang, George Sung, Matthias Grundmann
for: 可以快速创建和部署用户定制的手势识别系统，无需专业Machine Learning（ML）知识。
methods: 使用预训练单手嵌入模型，通过收集一小部分的图像来训练和部署自定义手势识别模型。
results: 可以快速完成手势识别系统的开发和部署，只需几分钟时间。

Abstract
Most existing hand gesture recognition (HGR) systems are limited to a predefined set of gestures. However, users and developers often want to recognize new, unseen gestures. This is challenging due to the vast diversity of all plausible hand shapes, e.g. it is impossible for developers to include all hand gestures in a predefined list. In this paper, we present a user-friendly framework that lets users easily customize and deploy their own gesture recognition pipeline. Our framework provides a pre-trained single-hand embedding model that can be fine-tuned for custom gesture recognition. Users can perform gestures in front of a webcam to collect a small amount of images per gesture. We also offer a low-code solution to train and deploy the custom gesture recognition model. This makes it easy for users with limited ML expertise to use our framework. We further provide a no-code web front-end for users without any ML expertise. This makes it even easier to build and test the end-to-end pipeline. The resulting custom HGR is then ready to be run on-device for real-time scenarios. This can be done by calling a simple function in our open-sourced model inference API, MediaPipe Tasks. This entire process only takes a few minutes.

摘要
现有的手势识别（HGR）系统大多是基于固定的手势集。然而，用户和开发者通常希望可以识别新、未看到的手势。这是因为手势的多样性很大，例如开发者无法包含所有的手势在一个预定列表中。在这篇论文中，我们提供了一个用户友好的框架，允许用户轻松自定义和部署自己的手势识别管道。我们的框架提供了预训练的单手嵌入模型，可以根据用户自定义的手势进行微调。用户可以在前置摄像头上执行手势，并收集一小量的图像数据。我们还提供了一个低代码的解决方案，以便用户对自定义手势识别模型进行训练和部署。这使得用户对ML知识有限的用户可以轻松使用我们的框架。此外，我们还提供了一个无代码的web前端，以便用户没有任何ML知识的情况下也可以构建和测试完整的管道。结果的自定义HGR然后可以在实时场景下运行在设备上，只需要很短的时间。这可以通过调用我们开源的模型推理API，MediaPipe Tasks的简单函数来完成。

Assessing the capacity of a denoising diffusion probabilistic model to reproduce spatial context

paper_url: http://arxiv.org/abs/2309.10817
repo_url: None
paper_authors: Rucha Deshpande, Muzaffer Özbey, Hua Li, Mark A. Anastasio, Frank J. Brooks
for: 本研究的目的是探讨DDPMs在医疗影像领域中是否可靠地学习空间上下文信息。
methods: 本研究使用杂相上下文模型（SCMs）生成训练数据，并使用DDPMs生成图像 ensemble，以评估其在学习空间上下文信息方面的能力。
results: 研究发现，DDPMs可以准确地复制空间上下文信息，并且可以生成Contextually correct的图像，这些图像可以作为数据扩展任务中的参考数据。相比之下，GANs无法实现这种效果。

Abstract
Diffusion models have emerged as a popular family of deep generative models (DGMs). In the literature, it has been claimed that one class of diffusion models -- denoising diffusion probabilistic models (DDPMs) -- demonstrate superior image synthesis performance as compared to generative adversarial networks (GANs). To date, these claims have been evaluated using either ensemble-based methods designed for natural images, or conventional measures of image quality such as structural similarity. However, there remains an important need to understand the extent to which DDPMs can reliably learn medical imaging domain-relevant information, which is referred to as `spatial context' in this work. To address this, a systematic assessment of the ability of DDPMs to learn spatial context relevant to medical imaging applications is reported for the first time. A key aspect of the studies is the use of stochastic context models (SCMs) to produce training data. In this way, the ability of the DDPMs to reliably reproduce spatial context can be quantitatively assessed by use of post-hoc image analyses. Error-rates in DDPM-generated ensembles are reported, and compared to those corresponding to a modern GAN. The studies reveal new and important insights regarding the capacity of DDPMs to learn spatial context. Notably, the results demonstrate that DDPMs hold significant capacity for generating contextually correct images that are `interpolated' between training samples, which may benefit data-augmentation tasks in ways that GANs cannot.

摘要
Diffusion models 已经成为深度生成模型（DGM）的流行家族。在文献中，有人声称一种类型的扩散模型——去噪扩散概率模型（DDPM）——在比较于生成敌对网络（GAN）的图像生成性能方面表现出色。到目前为止，这些声称都是通过ensemble-based方法，设计 для自然图像，或者传统的图像质量度量指标，如结构相似性，进行评估。然而，还有一项重要的需求，即了解DDPM在医学影像领域中 relevante的信息是否可靠地学习。为了解决这个问题，本文报告了DDPM在医学影像应用中学习空间上下文的系统性评估。在这些研究中，使用stoochastic context models（SCM）生成训练数据，以便通过后期图像分析来评估DDPM是否可靠地重现空间上下文。DDPM生成的集合的错误率被报告，并与现代GAN的错误率进行比较。研究发现了新的重要信息，即DDPM在学习空间上下文方面具有显著的能力。特别是，结果表明DDPM可以生成符合训练样本上下文的图像，并且可以在数据扩展任务中提供新的优势。

PanopticNeRF-360: Panoramic 3D-to-2D Label Transfer in Urban Scenes

paper_url: http://arxiv.org/abs/2309.10815
repo_url: https://github.com/fuxiao0719/panopticnerf
paper_authors: Xiao Fu, Shangzhan Zhang, Tianrun Chen, Yichong Lu, Xiaowei Zhou, Andreas Geiger, Yiyi Liao
for: 本研究旨在提高自驾车视觉系统的训练效果，以便提高自驾车的安全性和可靠性。
methods: 本文提出了一种新的方法，即combine粗糙3D标注与噪音2Dsemantic约束来生成高质量的360度图像和精确的各种标签。
results: 实验表明，本方法可以在KITTI-360 dataset上达到现有标注传递方法的状态对抗性，并且可以生成高分辨率、多视图和时空一致的图像、semantic和instance标签。

Abstract
Training perception systems for self-driving cars requires substantial annotations. However, manual labeling in 2D images is highly labor-intensive. While existing datasets provide rich annotations for pre-recorded sequences, they fall short in labeling rarely encountered viewpoints, potentially hampering the generalization ability for perception models. In this paper, we present PanopticNeRF-360, a novel approach that combines coarse 3D annotations with noisy 2D semantic cues to generate consistent panoptic labels and high-quality images from any viewpoint. Our key insight lies in exploiting the complementarity of 3D and 2D priors to mutually enhance geometry and semantics. Specifically, we propose to leverage noisy semantic and instance labels in both 3D and 2D spaces to guide geometry optimization. Simultaneously, the improved geometry assists in filtering noise present in the 3D and 2D annotations by merging them in 3D space via a learned semantic field. To further enhance appearance, we combine MLP and hash grids to yield hybrid scene features, striking a balance between high-frequency appearance and predominantly contiguous semantics. Our experiments demonstrate PanopticNeRF-360's state-of-the-art performance over existing label transfer methods on the challenging urban scenes of the KITTI-360 dataset. Moreover, PanopticNeRF-360 enables omnidirectional rendering of high-fidelity, multi-view and spatiotemporally consistent appearance, semantic and instance labels. We make our code and data available at https://github.com/fuxiao0719/PanopticNeRF

摘要
培训自驾车视觉系统需要大量的标注。然而，手动标注在2D图像中是非常劳动密集的。现有的数据集提供了丰富的标注 для预先录制的序列，但它们缺乏标注不常见的视角，可能会妨碍视觉模型的泛化能力。在这篇论文中，我们提出了PanopticNeRF-360，一种新的方法，它将粗略的3D标注与噪声2D语义指标相结合，以生成一致的�anoptic标签和高质量的图像从任何视角。我们的关键发现在于利用3D和2D优先的补偿，以便互相增强几何和 semantics。具体来说，我们提议利用2D和3D空间中噪声的语义和实例标签来导航 geometry 优化。同时，改进的几何 помочь减少2D和3D标注中的噪声，通过学习的semantic场进行3D空间的合并。为了进一步提高外观，我们结合多层感知（MLP）和哈希网格，以生成混合的场景特征， strike a balance between high-frequency appearance and predominantly contiguous semantics。我们的实验表明PanopticNeRF-360在KITTI-360 dataset上的城市场景上表现出了状态机器人的性能，并且允许高精度、多视角和时空协调的出现、semantic和实例标签的渲染。我们在github上分享了我们的代码和数据，请参考https://github.com/fuxiao0719/PanopticNeRF。

PGDiff: Guiding Diffusion Models for Versatile Face Restoration via Partial Guidance

paper_url: http://arxiv.org/abs/2309.10810
repo_url: https://github.com/pq-yang/pgdiff
paper_authors: Peiqing Yang, Shangchen Zhou, Qingyi Tao, Chen Change Loy
For: 本研究旨在把传统任务特有的训练方法替换为使用预训diffusion模型，以提高复原性能。* Methods: 我们提出了PGDiff方法，它通过引入偏向指导来增强复原性能。不同于先前的方法，我们的方法不需要具体地定义干扰过程，而是根据高质量图像的特性和颜色统计，设定导航。* Results: 实验结果显示，我们的方法不仅超越了现有的diffusion-prior-based方法，而且与任务特有的模型竞争。此外，我们的方法还可以扩展到复合任务，通过结合不同任务的导航。

Abstract
Exploiting pre-trained diffusion models for restoration has recently become a favored alternative to the traditional task-specific training approach. Previous works have achieved noteworthy success by limiting the solution space using explicit degradation models. However, these methods often fall short when faced with complex degradations as they generally cannot be precisely modeled. In this paper, we propose PGDiff by introducing partial guidance, a fresh perspective that is more adaptable to real-world degradations compared to existing works. Rather than specifically defining the degradation process, our approach models the desired properties, such as image structure and color statistics of high-quality images, and applies this guidance during the reverse diffusion process. These properties are readily available and make no assumptions about the degradation process. When combined with a diffusion prior, this partial guidance can deliver appealing results across a range of restoration tasks. Additionally, PGDiff can be extended to handle composite tasks by consolidating multiple high-quality image properties, achieved by integrating the guidance from respective tasks. Experimental results demonstrate that our method not only outperforms existing diffusion-prior-based approaches but also competes favorably with task-specific models.

摘要
utilizes pre-trained diffusion models for restoration has recently become a popular alternative to the traditional task-specific training approach. Previous works have achieved notable success by limiting the solution space using explicit degradation models. However, these methods often fall short when faced with complex degradations as they generally cannot be precisely modeled. In this paper, we propose PGDiff by introducing partial guidance, a fresh perspective that is more adaptable to real-world degradations compared to existing works. Rather than specifically defining the degradation process, our approach models the desired properties, such as image structure and color statistics of high-quality images, and applies this guidance during the reverse diffusion process. These properties are readily available and make no assumptions about the degradation process. When combined with a diffusion prior, this partial guidance can deliver appealing results across a range of restoration tasks. Additionally, PGDiff can be extended to handle composite tasks by consolidating multiple high-quality image properties, achieved by integrating the guidance from respective tasks. Experimental results demonstrate that our method not only outperforms existing diffusion-prior-based approaches but also competes favorably with task-specific models.

Multi-Context Dual Hyper-Prior Neural Image Compression

paper_url: http://arxiv.org/abs/2309.10799
repo_url: None
paper_authors: Atefeh Khoshkhahtinat, Ali Zafari, Piyush M. Mehta, Mohammad Akyash, Hossein Kashiani, Nasser M. Nasrabadi
for: 这篇论文主要是为了提出一种基于Transformer的深度图像压缩神经网络，以提高图像压缩的环境依赖性和全局相互关系模型。
methods: 该论文提出了一种基于Transformer的非线性变换，以 efficiently capture both local and global information from the input image，并且引入了一种新的 entropy model，该模型包括了两种不同的 гиперприор来模型横轴和水平依赖关系。
results: 经过实验表明，该提议的框架在环境依赖性和全局相互关系上比前者方法更高，并且可以更好地压缩图像。

Abstract
Transform and entropy models are the two core components in deep image compression neural networks. Most existing learning-based image compression methods utilize convolutional-based transform, which lacks the ability to model long-range dependencies, primarily due to the limited receptive field of the convolution operation. To address this limitation, we propose a Transformer-based nonlinear transform. This transform has the remarkable ability to efficiently capture both local and global information from the input image, leading to a more decorrelated latent representation. In addition, we introduce a novel entropy model that incorporates two different hyperpriors to model cross-channel and spatial dependencies of the latent representation. To further improve the entropy model, we add a global context that leverages distant relationships to predict the current latent more accurately. This global context employs a causal attention mechanism to extract long-range information in a content-dependent manner. Our experiments show that our proposed framework performs better than the state-of-the-art methods in terms of rate-distortion performance.

摘要
<> transform和Entropy模型是深度图像压缩神经网络的两个核心组件。大多数现有的学习基于图像压缩方法使用 convolutional-based transform，它因为卷积操作的局部场景限制无法准确地模型长距离依赖关系。为了解决这一限制，我们提议一种基于 transformer 的非线性变换。这种变换能够高效地捕捉输入图像的局部和全局信息，从而导致更高度相关的幂等表示。此外，我们介绍了一种新的 Entropy 模型，它通过两个不同的 гиперPRIOR 来模型跨通道和空间依赖关系。为了进一步改进 Entropy 模型，我们添加了一个全局上下文，通过利用远程关系来预测当前幂等的更加准确。这个全局上下文使用 causal attention 机制来提取跨通道信息，以具有内容依赖的方式。我们的实验表明，我们的提议的框架在率度-质量性能方面与当前最佳方法相比表现更好。Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China. If you need Traditional Chinese, please let me know.

Multi-spectral Entropy Constrained Neural Compression of Solar Imagery

paper_url: http://arxiv.org/abs/2309.10791
repo_url: None
paper_authors: Ali Zafari, Atefeh Khoshkhahtinat, Piyush M. Mehta, Nasser M. Nasrabadi, Barbara J. Thompson, Michael S. F. Kirk, Daniel da Silva
for: 本研究旨在开发一种基于变换器的多spectral神经图像压缩器，以实现高效地捕捉多波段图像中的相互 redundancy。
methods: 我们提出了一种基于多头自注意 Mechanism的inter-window汇集token自注意机制，以解 liberate地ocalization的窗口自注意机制。此外，我们还使用随机偏移窗口注意机制，使transformer块对输入领域的翻译不敏感。
results: 我们的方法不仅超越了传统压缩算法，还能更好地decorrelates多波段图像，比单spectral压缩更好。

Abstract
Missions studying the dynamic behaviour of the Sun are defined to capture multi-spectral images of the sun and transmit them to the ground station in a daily basis. To make transmission efficient and feasible, image compression systems need to be exploited. Recently successful end-to-end optimized neural network-based image compression systems have shown great potential to be used in an ad-hoc manner. In this work we have proposed a transformer-based multi-spectral neural image compressor to efficiently capture redundancies both intra/inter-wavelength. To unleash the locality of window-based self attention mechanism, we propose an inter-window aggregated token multi head self attention. Additionally to make the neural compressor autoencoder shift invariant, a randomly shifted window attention mechanism is used which makes the transformer blocks insensitive to translations in their input domain. We demonstrate that the proposed approach not only outperforms the conventional compression algorithms but also it is able to better decorrelates images along the multiple wavelengths compared to single spectral compression.

摘要
充当太阳动态行为研究任务的图像压缩系统需要在每天基础上 capture 多spectral 图像并将其传输到地面站。为了使压缩效率高并实现可行，图像压缩系统需要被利用。最近，一种基于神经网络的结构优化的图像压缩系统在随机的方式上表现出了极高的潜力。在这种工作中，我们提出了一种基于变换器的多spectral 神经图像压缩器，可以快速捕捉 intra/inter-探讨的重复性。为了利用窗口基于的自注意机制的本地性，我们提出了一种多头自注意机制，并使用了随机推移窗口注意机制，使得转换块在输入领域中变得不敏感于平移。我们 demonstarte 了该方法不仅可以超越传统压缩算法，而且可以更好地对多普通频谱图像进行decorrelation。

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

paper_url: http://arxiv.org/abs/2309.10787
repo_url: https://github.com/roger-tseng/av-superb
paper_authors: Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee
for: 该研究旨在开发一种人类视觉系统，利用听视信息的相关性来实现。
methods: 该研究使用了AV-SUPERBBenchmark，一个涵盖5种音频视频任务的通用评估平台，以评估5种最近自监学模型的通用性。
results: 研究发现，当前的模型通常只能在限定的任务上进行有效的泛化，而且无法在所有任务上达到理想的性能。此外，研究还发现，通过中间任务练化和使用AudioSet作为中间任务可以提高表示的性能。

Abstract
Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations on 7 datasets covering 5 audio-visual tasks in speech and audio processing. We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks, emphasizing the need for future study on improving universal model performance. In addition, we show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task. We release our benchmark with evaluation code and a model submission platform to encourage further research in audio-visual learning.

摘要
音视频表示学习目标是使系统具有人类类似的感知，通过听音和视觉信息之间的相关性。然而，当前的模型通常只会处理有限的任务，学习得到的表示的通用能力是不清楚的。为此，我们提出了AV-SUPERB benchmark，它可以对7个数据集的5种音视频任务进行通用评估。我们评估了5种最新的自我超vised模型，发现其中没有一个能够通用到所有任务，这 highlights the need for future research on improving universal model performance。此外，我们还发现，通过中间任务精度调整和使用AudioSet进行听音事件分类可以提高表示的质量。我们发布了我们的benchmark，评估代码和模型提交平台，以便更多的研究人员继续探索音视频学习领域。

Context-Aware Neural Video Compression on Solar Dynamics Observatory

paper_url: http://arxiv.org/abs/2309.10784
repo_url: None
paper_authors: Atefeh Khoshkhahtinat, Ali Zafari, Piyush M. Mehta, Nasser M. Nasrabadi, Barbara J. Thompson, Michael S. F. Kirk, Daniel da Silva
for: This paper aims to improve the compression of solar images collected by NASA’s Solar Dynamics Observatory (SDO) mission.
methods: The paper proposes a novel neural Transformer-based video compression approach that leverages a Fused Local-aware Window (FLaWin) Transformer block to efficiently exploit temporal and spatial redundancies in the images, resulting in a high compression ratio.
results: The proposed approach outperforms conventional hand-engineered video codecs such as H.264 and H.265 in terms of rate-distortion trade-off, demonstrating the effectiveness of the FLaWin Transformer block in improving compression performance.

Abstract
NASA's Solar Dynamics Observatory (SDO) mission collects large data volumes of the Sun's daily activity. Data compression is crucial for space missions to reduce data storage and video bandwidth requirements by eliminating redundancies in the data. In this paper, we present a novel neural Transformer-based video compression approach specifically designed for the SDO images. Our primary objective is to efficiently exploit the temporal and spatial redundancies inherent in solar images to obtain a high compression ratio. Our proposed architecture benefits from a novel Transformer block called Fused Local-aware Window (FLaWin), which incorporates window-based self-attention modules and an efficient fused local-aware feed-forward (FLaFF) network. This architectural design allows us to simultaneously capture short-range and long-range information while facilitating the extraction of rich and diverse contextual representations. Moreover, this design choice results in reduced computational complexity. Experimental results demonstrate the significant contribution of the FLaWin Transformer block to the compression performance, outperforming conventional hand-engineered video codecs such as H.264 and H.265 in terms of rate-distortion trade-off.

摘要

MAGIC-TBR: Multiview Attention Fusion for Transformer-based Bodily Behavior Recognition in Group Settings

paper_url: http://arxiv.org/abs/2309.10765
repo_url: https://github.com/surbhimadan92/magic-tbr
paper_authors: Surbhi Madan, Rishabh Jain, Gulshan Sharma, Ramanathan Subramanian, Abhinav Dhall
for: 本研究旨在提高人工智能系统的理解，通过自动分析社交语言行为。
methods: 本文提出了一种多视图注意力融合方法（MAGIC-TBR），利用视频和其相应的Discrete Cosine Transform幂等特征，通过变换器基本方法进行融合。
results: 实验结果表明，提出的特征融合方法有效地捕捉了BBSI数据集中的较细的行为特征，如招手、梳妆、摸拥等。

Abstract
Bodily behavioral language is an important social cue, and its automated analysis helps in enhancing the understanding of artificial intelligence systems. Furthermore, behavioral language cues are essential for active engagement in social agent-based user interactions. Despite the progress made in computer vision for tasks like head and body pose estimation, there is still a need to explore the detection of finer behaviors such as gesturing, grooming, or fumbling. This paper proposes a multiview attention fusion method named MAGIC-TBR that combines features extracted from videos and their corresponding Discrete Cosine Transform coefficients via a transformer-based approach. The experiments are conducted on the BBSI dataset and the results demonstrate the effectiveness of the proposed feature fusion with multiview attention. The code is available at: https://github.com/surbhimadan92/MAGIC-TBR

摘要
文体行为语言是社交见识的重要依据，自动分析可以增强人工智能系统的理解。此外，行为语言cue也是社交代理人与用户之间活跃互动的重要因素。虽然计算机视觉技术在头和身体姿态估计等任务上已经取得了进展，但是还需要探索更细致的行为，如手势、梳妆或摸拌等。这篇论文提议一种名为MAGIC-TBR的多视图注意力融合方法，该方法通过 трансформа器方法结合视频和其相应的Discrete Cosine Transform幂值来拼接特征。实验在BBSI数据集上进行，结果表明提议的特征融合与多视图注意力具有效果。代码可以在以下链接获取：https://github.com/surbhimadan92/MAGIC-TBR。

Few-Shot Panoptic Segmentation With Foundation Models

paper_url: http://arxiv.org/abs/2309.10726
repo_url: https://github.com/robot-learning-freiburg/SPINO
paper_authors: Markus Käppeler, Kürsat Petek, Niclas Vödisch, Wolfram Burgard, Abhinav Valada
for: 提高�anoptic segmentation的可行性，使用非标注数据进行训练。
methods: 利用task-agnostic图像特征，即DINOv2底层和轻量级网络头，进行semantic segmentation和boundary estimation。
results: 使用仅10个标注图像，可以预测高质量pseudo-标签，并且与完全监督基eline相比，使用更少的标注数据（less than 0.3%），达到竞争力的结果。

Abstract
Current state-of-the-art methods for panoptic segmentation require an immense amount of annotated training data that is both arduous and expensive to obtain posing a significant challenge for their widespread adoption. Concurrently, recent breakthroughs in visual representation learning have sparked a paradigm shift leading to the advent of large foundation models that can be trained with completely unlabeled images. In this work, we propose to leverage such task-agnostic image features to enable few-shot panoptic segmentation by presenting Segmenting Panoptic Information with Nearly 0 labels (SPINO). In detail, our method combines a DINOv2 backbone with lightweight network heads for semantic segmentation and boundary estimation. We show that our approach, albeit being trained with only ten annotated images, predicts high-quality pseudo-labels that can be used with any existing panoptic segmentation method. Notably, we demonstrate that SPINO achieves competitive results compared to fully supervised baselines while using less than 0.3% of the ground truth labels, paving the way for learning complex visual recognition tasks leveraging foundation models. To illustrate its general applicability, we further deploy SPINO on real-world robotic vision systems for both outdoor and indoor environments. To foster future research, we make the code and trained models publicly available at http://spino.cs.uni-freiburg.de.

摘要
当前最新的方法 для�annoptic segmentation需要庞大量的标注训练数据，这是费时和贵金的，这 pose significan t challenge для它们的普及。同时，最近的视觉表示学习技术的突破口导致了基础模型的出现，可以通过完全无标注图像进行训练。在这种情况下，我们提议利用这些任务无关的图像特征来实现几何�train panoptic segmentation，我们称之为Segmenting Panoptic Information with Nearly 0 labels（SPINO）。在详细的实现中，我们将DINOv2 脊梁结合轻量级网络头进行semantic segmentation和边界估计。我们表明，我们的方法可以通过只使用十个标注图像来生成高质量 Pseudo-labels，这些 Pseudo-labels 可以与任何现有的�annoptic segmentation方法结合使用。需要注意的是，我们的方法可以与完全supervised baseline 进行比较，并且使用 less than 0.3% 的标注数据，这种方法可以学习复杂的视觉认知任务。为了见证其通用性，我们进一步在实际的 роботи视系统中部署了SPINO。为了促进未来的研究，我们在http://spino.cs.uni-freiburg.de 上公开发布了代码和训练模型。

Reconstruct-and-Generate Diffusion Model for Detail-Preserving Image Denoising

paper_url: http://arxiv.org/abs/2309.10714
repo_url: None
paper_authors: Yujin Wang, Lingen Li, Tianfan Xue, Jinwei Gu
for: 减少图像噪声，提高图像质量
methods: 提pose了一种新的Reconstruct-and-Generate Diffusion Model（RnG），包括一个恢复性的噪声去除网络和一个扩散算法，以保持图像的视觉质量和准确性
results: 通过对多个synthetic和实际噪声 dataset进行了广泛的实验，证明了提案的方法的优越性。

Abstract
Image denoising is a fundamental and challenging task in the field of computer vision. Most supervised denoising methods learn to reconstruct clean images from noisy inputs, which have intrinsic spectral bias and tend to produce over-smoothed and blurry images. Recently, researchers have explored diffusion models to generate high-frequency details in image restoration tasks, but these models do not guarantee that the generated texture aligns with real images, leading to undesirable artifacts. To address the trade-off between visual appeal and fidelity of high-frequency details in denoising tasks, we propose a novel approach called the Reconstruct-and-Generate Diffusion Model (RnG). Our method leverages a reconstructive denoising network to recover the majority of the underlying clean signal, which serves as the initial estimation for subsequent steps to maintain fidelity. Additionally, it employs a diffusion algorithm to generate residual high-frequency details, thereby enhancing visual quality. We further introduce a two-stage training scheme to ensure effective collaboration between the reconstructive and generative modules of RnG. To reduce undesirable texture introduced by the diffusion model, we also propose an adaptive step controller that regulates the number of inverse steps applied by the diffusion model, allowing control over the level of high-frequency details added to each patch as well as saving the inference computational cost. Through our proposed RnG, we achieve a better balance between perception and distortion. We conducted extensive experiments on both synthetic and real denoising datasets, validating the superiority of the proposed approach.

摘要
Image 减霉是计算机视觉领域中的基本和挑战性任务。大多数指导下的减霉方法学习从噪声输入中重建清晰图像，这些方法具有内在的频谱偏好，导致生成的图像过于平滑和模糊。在最近的研究中，研究人员使用扩散模型生成高频环境细节，但这些模型不保证生成的文本与实际图像相匹配，导致不жела的artefacts。为了解决减霉任务中的质量和准确性之间的权衡，我们提出了一种新的方法 called Reconstruct-and-Generate Diffusion Model (RnG)。我们的方法利用一种重建减霉网络来恢复大多数的下面清晰信号，该信号 serve as the initial estimation for subsequent steps to maintain fidelity。此外，它采用扩散算法生成剩余的高频环境细节，从而提高视觉质量。我们还提出了一种两阶段训练方案，以确保重建和生成模块之间的合作有效。为了避免扩散模型引入的不жела的文本，我们还提出了一种自适应步长控制器，可以控制每个 patch 中扩散模型应用的 inverse 步长数量，从而控制高频环境细节的添加水平以及计算误差的减少。通过我们的提出的 RnG，我们实现了减霉任务中更好的质量和准确性之间的权衡。我们在 synthetic 和实际减霉数据集上进行了广泛的实验，证明了我们的方法的超越性。

Interpret Vision Transformers as ConvNets with Dynamic Convolutions

paper_url: http://arxiv.org/abs/2309.10713
repo_url: None
paper_authors: Chong Zhou, Chen Change Loy, Bo Dai
for: 本研究旨在比较视觉变换器和卷积网络的超越性，并提出了一种将视觉变换器视为卷积网络的动态卷积的解释方法，从而在一个统一的框架下比较这两种 Architecture 的设计选择。
methods: 本研究使用了一种新的解释方法，即将视觉变换器视为卷积网络的动态卷积，从而将两种 Architecture 归纳到一个统一的框架下。研究者们还提出了两个具体的研究，一是关于视觉变换器中 softmax 的作用和其可以被取代的问题，二是基于depth-wise convolution的视觉变换器的设计。
results: 研究表明，通过将视觉变换器视为卷积网络的动态卷积，可以更好地理解和设计这两种 Architecture，并且可以提高网络的性能和速度。

Abstract
There has been a debate about the superiority between vision Transformers and ConvNets, serving as the backbone of computer vision models. Although they are usually considered as two completely different architectures, in this paper, we interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework and compare their design choices side by side. In addition, our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets and vice versa. We demonstrate such potential through two specific studies. First, we inspect the role of softmax in vision Transformers as the activation function and find it can be replaced by commonly used ConvNets modules, such as ReLU and Layer Normalization, which results in a faster convergence rate and better performance. Second, following the design of depth-wise convolution, we create a corresponding depth-wise vision Transformer that is more efficient with comparable performance. The potential of the proposed unified interpretation is not limited to the given examples and we hope it can inspire the community and give rise to more advanced network architectures.

摘要
有一些论战关于 Computer Vision 模型的优劣，主要集中在 vision Transformers 和 ConvNets 之间。尽管它们通常被视为完全不同的架构，但在这篇论文中，我们将 vision Transformers 解释为 ConvNets 中的动态混合，这允许我们在一个统一的框架下描述现有的 Transformers 和动态 ConvNets，并且比较它们的设计决策。此外，我们的解释还可以帮助网络设计，因为研究人员现在可以从 ConvNets 的设计空间中考虑 vision Transformers，并且将之相对。我们透过以下两个具体的研究来证明这一点。首先，我们评估了 vision Transformers 中 softmax 作为活化函数的角色，发现它可以被替换为通用的 ConvNets 模组，如 ReLU 和层normalization，这会导致更快的整合速率和更好的性能。其次，我们根据深度对应 convolution 的设计，创建了相应的深度对应 vision Transformer，该模型更加高效，性能相当。我们希望这一提议可以鼓励社区，并导致更进一步的网络架构。

Latent Space Energy-based Model for Fine-grained Open Set Recognition

paper_url: http://arxiv.org/abs/2309.10711
repo_url: None
paper_authors: Wentao Bao, Qi Yu, Yu Kong
for: 这个研究旨在应对细部分组别的图像识别，排除未知的类别图像。
methods: 这个方法使用了能量基模型（EBM）来混合生成和识别任务，并在高维度空间进行点击估计，以提高细部分组别的图像识别。
results: 这个方法可以在细部分组别的图像识别中提高表达性、精细度和点击密度，并且可以利用最新的视觉 трансформа器来实现强大的视觉分类和生成。

Abstract
Fine-grained open-set recognition (FineOSR) aims to recognize images belonging to classes with subtle appearance differences while rejecting images of unknown classes. A recent trend in OSR shows the benefit of generative models to discriminative unknown detection. As a type of generative model, energy-based models (EBM) are the potential for hybrid modeling of generative and discriminative tasks. However, most existing EBMs suffer from density estimation in high-dimensional space, which is critical to recognizing images from fine-grained classes. In this paper, we explore the low-dimensional latent space with energy-based prior distribution for OSR in a fine-grained visual world. Specifically, based on the latent space EBM, we propose an attribute-aware information bottleneck (AIB), a residual attribute feature aggregation (RAFA) module, and an uncertainty-based virtual outlier synthesis (UVOS) module to improve the expressivity, granularity, and density of the samples in fine-grained classes, respectively. Our method is flexible to take advantage of recent vision transformers for powerful visual classification and generation. The method is validated on both fine-grained and general visual classification datasets while preserving the capability of generating photo-realistic fake images with high resolution.

摘要
Translation notes:* Fine-grained open-set recognition (FineOSR) 精细开放类识别 (FineOSR)* aims to recognize images belonging to classes with subtle appearance differences 目标是识别具有微妙的外观差异的图像* while rejecting images of unknown classes 并拒绝未知类图像* A recent trend in OSR shows the benefit of generative models for unknown detection 最近的 OSR 趋势表明生成模型对未知检测具有优势* Energy-based models (EBMs) 能量基于模型 (EBMs)* are a type of generative model with potential for hybrid modeling of generative and discriminative tasks 是一种可以混合生成和抑制任务的生成模型* but most existing EBMs suffer from density estimation in high-dimensional space 但大多数现有的 EBM 在高维空间中进行密度估计存在问题* which is critical for recognizing images from fine-grained classes 这是识别精细类图像的关键问题* In this paper, we explore the low-dimensional latent space with energy-based prior distribution for OSR in a fine-grained visual world 在这篇论文中，我们在精细视觉世界中采用能量基于分布来探索低维 latent space，以实现 OSR* Specifically, based on the latent space EBM, we propose an attribute-aware information bottleneck (AIB) 具体来说，基于 latent space EBM，我们提出了 attribute-aware information bottleneck (AIB)* a residual attribute feature aggregation (RAFA) module 剩余特征特征聚合模块 (RAFA)* and an uncertainty-based virtual outlier synthesis (UVOS) module 基于不确定性的虚拟异常生成模块 (UVOS)* to improve the expressivity, granularity, and density of the samples in fine-grained classes 以提高精细类样本的表达能力、粒度和密度* Our method is flexible to take advantage of recent vision transformers for powerful visual classification and generation 我们的方法可以利用最近的视觉转换器来实现强大的视觉分类和生成* The method is validated on both fine-grained and general visual classification datasets while preserving the capability of generating photo-realistic fake images with high resolution 方法在精细和通用视觉分类数据集上进行验证，并保持生成高分辨率的图像的能力

ReShader: View-Dependent Highlights for Single Image View-Synthesis

paper_url: http://arxiv.org/abs/2309.10689
repo_url: https://github.com/avinashpaliwal/ReShader
paper_authors: Avinash Paliwal, Brandon Nguyen, Andrii Tsarov, Nima Khademi Kalantari
for: 提高单图像新视图synthesizer的可靠性和准确性
methods: 分解视图合成过程为两个独立任务：像素reshading和重定位
results: 生成具有真实运动高光的新视图图像，在多种真实场景上进行证明

Abstract
In recent years, novel view synthesis from a single image has seen significant progress thanks to the rapid advancements in 3D scene representation and image inpainting techniques. While the current approaches are able to synthesize geometrically consistent novel views, they often do not handle the view-dependent effects properly. Specifically, the highlights in their synthesized images usually appear to be glued to the surfaces, making the novel views unrealistic. To address this major problem, we make a key observation that the process of synthesizing novel views requires changing the shading of the pixels based on the novel camera, and moving them to appropriate locations. Therefore, we propose to split the view synthesis process into two independent tasks of pixel reshading and relocation. During the reshading process, we take the single image as the input and adjust its shading based on the novel camera. This reshaded image is then used as the input to an existing view synthesis method to relocate the pixels and produce the final novel view image. We propose to use a neural network to perform reshading and generate a large set of synthetic input-reshaded pairs to train our network. We demonstrate that our approach produces plausible novel view images with realistic moving highlights on a variety of real world scenes.

摘要
近年来，从单一图像 synthesize 新视图的技术得到了重要进步，归功于三维场景表示和图像填充技术的快速进步。当前的方法可以生成具有正确的几何匹配的新视图，但它们通常不能正确处理视角依赖的效果。具体来说，它们生成的新视图中的高光通常会黏附到表面上，使得新视图变得不真实。为解决这一主要问题，我们作出了关键的观察：在synthesize 新视图过程中，需要根据新的摄像机改变像素的颜色和位置。因此，我们提议将视图synthesize 过程分解为两个独立任务：像素重新灰度和重新位置。在重新灰度过程中，我们使用单一图像作为输入，并根据新的摄像机进行颜色调整。这个重新灰度后的图像然后作为输入给现有的视图synthesize 方法，以生成最终的新视图图像。我们提议使用神经网络来进行重新灰度，并生成大量的人工生成的输入-重新灰度对来训练我们的网络。我们示例了我们的方法可以在多种真实场景上生成真实的高光移动的新视图图像。

CMRxRecon: An open cardiac MRI dataset for the competition of accelerated image reconstruction

paper_url: http://arxiv.org/abs/2309.10836
repo_url: None
paper_authors: Chengyan Wang, Jun Lyu, Shuo Wang, Chen Qin, Kunyuan Guo, Xinyu Zhang, Xiaotong Yu, Yan Li, Fanwen Wang, Jianhua Jin, Zhang Shi, Ziqiang Xu, Yapeng Tian, Sha Hua, Zhensen Chen, Meng Liu, Mengting Sun, Xutong Kuang, Kang Wang, Haoran Wang, Hao Li, Yinghua Chu, Guang Yang, Wenjia Bai, Xiahai Zhuang, He Wang, Jing Qin, Xiaobo Qu
for: This paper aims to facilitate the advancement of state-of-the-art cardiac magnetic resonance imaging (CMR) image reconstruction.
methods: The paper provides a large dataset of multi-contrast, multi-view, multi-slice, and multi-coil CMR imaging data from 300 subjects, which can be used to train and evaluate deep learning-based image reconstruction algorithms.
results: The dataset includes manual segmentations of the myocardium and chambers of all the subjects, and scripts of state-of-the-art reconstruction algorithms are provided as a point of reference. The dataset is freely accessible to the research community and can be accessed at https://www.synapse.org/#!Synapse:syn51471091/wiki/.

Abstract
Cardiac magnetic resonance imaging (CMR) has emerged as a valuable diagnostic tool for cardiac diseases. However, a limitation of CMR is its slow imaging speed, which causes patient discomfort and introduces artifacts in the images. There has been growing interest in deep learning-based CMR imaging algorithms that can reconstruct high-quality images from highly under-sampled k-space data. However, the development of deep learning methods requires large training datasets, which have not been publicly available for CMR. To address this gap, we released a dataset that includes multi-contrast, multi-view, multi-slice and multi-coil CMR imaging data from 300 subjects. Imaging studies include cardiac cine and mapping sequences. Manual segmentations of the myocardium and chambers of all the subjects are also provided within the dataset. Scripts of state-of-the-art reconstruction algorithms were also provided as a point of reference. Our aim is to facilitate the advancement of state-of-the-art CMR image reconstruction by introducing standardized evaluation criteria and making the dataset freely accessible to the research community. Researchers can access the dataset at https://www.synapse.org/#!Synapse:syn51471091/wiki/.

摘要
卡ди亚克力磁共振成像（CMR）已成为心血管疾病诊断工具之一。然而，CMR的成像速度过慢，会让患者感到不适，并且会导致图像中的噪声和抖抖。随着深度学习技术的发展，深度学习基于CMR成像算法已经得到了广泛的关注。然而，深度学习的开发需要大量的训练数据集，而这些数据集在CMR领域并没有公开。为了解决这个问题，我们公开了一个包含多种对比、多视图、多层和多极的CMR成像数据集，其中包括300名参与者的数据。这些数据包括心血管磁共振和映射序列。参与者的心肺部分手动分割也包含在数据集中。我们提供了一些state-of-the-art reconstruction algorithm的脚本，作为参考。我们的目标是通过引入标准评估标准和公开数据集，推动CMR成像图像重建的技术发展。研究人员可以通过以下链接获取数据集：https://www.synapse.org/#!Synapse:syn51471091/wiki/。

Locally Stylized Neural Radiance Fields

paper_url: http://arxiv.org/abs/2309.10684
repo_url: None
paper_authors: Hong-Wing Pang, Binh-Son Hua, Sai-Kit Yeung
for: 将样式应用到3D场景中，特别是使用神经辐射场（NeRF）。
methods: 使用本地样式传递来实现样式化，使用HashGrid编码学习外观和geometry组件的嵌入，并通过优化外观分支来实现样式化。
results: 实现了可信度高的样式化结果，同时具有可控的个性化特性，通过修改和定制地域匹配来实现本地样式传递。

Abstract
In recent years, there has been increasing interest in applying stylization on 3D scenes from a reference style image, in particular onto neural radiance fields (NeRF). While performing stylization directly on NeRF guarantees appearance consistency over arbitrary novel views, it is a challenging problem to guide the transfer of patterns from the style image onto different parts of the NeRF scene. In this work, we propose a stylization framework for NeRF based on local style transfer. In particular, we use a hash-grid encoding to learn the embedding of the appearance and geometry components, and show that the mapping defined by the hash table allows us to control the stylization to a certain extent. Stylization is then achieved by optimizing the appearance branch while keeping the geometry branch fixed. To support local style transfer, we propose a new loss function that utilizes a segmentation network and bipartite matching to establish region correspondences between the style image and the content images obtained from volume rendering. Our experiments show that our method yields plausible stylization results with novel view synthesis while having flexible controllability via manipulating and customizing the region correspondences.

摘要
近年来，人们对于将参照风格图像应用到3D场景上的涂鸦问题越来越感兴趣，尤其是在神经辐射场（NeRF）上。虽然直接在NeRF上进行涂鸦可以保证视野中任意新视图的外观一致，但是将模式图像中的模式转移到不同的NeRF场景部分是一个具有挑战性的问题。在这项工作中，我们提出了基于本地风格传递的NeRF涂鸦框架。具体来说，我们使用Hash网格编码学习外观和几何component的嵌入，并证明Hash表定义的映射允许一定程度的控制涂鸦。然后，我们通过优化外观支线来实现涂鸦，保持几何支线不变。为支持本地风格传递，我们提出了一种新的损失函数，利用分割网络和两个对偶匹配来建立风格图像和内容图像从volume渲染得到的区域对应关系。我们的实验显示，我们的方法可以实现有效的涂鸦结果，同时具有自由控制的便利性，通过操作和自定义区域对应关系来控制涂鸦效果。

paper_url: http://arxiv.org/abs/2309.10667
repo_url: None
paper_authors: Subash Khanal, Srikumar Sastry, Aayush Dhakal, Nathan Jacobs
for: 预测特定地理位置可能听到的各种声音
methods: 使用最新的状态艺术模型对地标Audio、文本描述和捕捉位置的拓展图像进行对比预训练，从而建立三个模式之间的共享嵌入空间，以便从文本或音频查询中构建任何地理区域的声cape图
results: 使用SoundingEarth数据集，我们的方法与现有SOTA之间有显著的改进，Image-to-Audio Recall@100从0.256提高到0.450。

Abstract
We focus on the task of soundscape mapping, which involves predicting the most probable sounds that could be perceived at a particular geographic location. We utilise recent state-of-the-art models to encode geotagged audio, a textual description of the audio, and an overhead image of its capture location using contrastive pre-training. The end result is a shared embedding space for the three modalities, which enables the construction of soundscape maps for any geographic region from textual or audio queries. Using the SoundingEarth dataset, we find that our approach significantly outperforms the existing SOTA, with an improvement of image-to-audio Recall@100 from 0.256 to 0.450. Our code is available at https://github.com/mvrl/geoclap.

摘要
我团队的任务是声景地图，即预测特定地理位置可能被感受到的最有可能的声音。我们利用最新的状态艺术模型来编码地标注的音频、文本描述和捕捉位置的遮盲图像，并通过对比预训练来共享这三种模态的 embeddings 空间。这使得我们可以从文本或音频查询中构建任何地理区域的声景地图。使用 SoundingEarth 数据集，我们发现我们的方法与现有的 SOTA 有明显的改善，即图像到音频 Recall@100 从 0.256 提高到 0.450。我们的代码可以在 GitHub 上找到：https://github.com/mvrl/geoclap。

Analysing race and sex bias in brain age prediction

paper_url: http://arxiv.org/abs/2309.10835
repo_url: None
paper_authors: Carolina Piçarra, Ben Glocker
for: 这paper的目的是分析使用MRI进行脑年龄预测的模型是否受到人口结构的偏见。
methods: 这paper使用了ResNet-34模型，通过对不同人口结构下的子群进行分析，以及对特征进行检查，以确定模型是否受到偏见。
results: 研究发现，模型在不同人口结构下的预测性能存在 statistically significant differences，并且发现了7对12个对比的特征分布存在 statistically significant differences。这些结果表明脑年龄预测模型可能受到偏见。

Abstract
Brain age prediction from MRI has become a popular imaging biomarker associated with a wide range of neuropathologies. The datasets used for training, however, are often skewed and imbalanced regarding demographics, potentially making brain age prediction models susceptible to bias. We analyse the commonly used ResNet-34 model by conducting a comprehensive subgroup performance analysis and feature inspection. The model is trained on 1,215 T1-weighted MRI scans from Cam-CAN and IXI, and tested on UK Biobank (n=42,786), split into six racial and biological sex subgroups. With the objective of comparing the performance between subgroups, measured by the absolute prediction error, we use a Kruskal-Wallis test followed by two post-hoc Conover-Iman tests to inspect bias across race and biological sex. To examine biases in the generated features, we use PCA for dimensionality reduction and employ two-sample Kolmogorov-Smirnov tests to identify distribution shifts among subgroups. Our results reveal statistically significant differences in predictive performance between Black and White, Black and Asian, and male and female subjects. Seven out of twelve pairwise comparisons show statistically significant differences in the feature distributions. Our findings call for further analysis of brain age prediction models.

摘要
Magnetic Resonance Imaging (MRI) 年龄预测已成为脑科疾病相关的快速成像标记，但是用于训练的数据集经常受到人口结构的偏袋和不均衡影响，可能使脑年龄预测模型受到偏见。我们使用常用的ResNet-34模型进行全面的子组表现分析和特征检查。该模型在Cam-CAN和IXI数据集上训练，并在UK Biobank数据集（n=42,786）上进行测试，并将数据集分为六个种族和生物性别子组。通过对各子组的绝对预测误差进行比较，我们使用克鲁斯卡尔-沃利斯测试后跟进行了两次康维-伊曼测试来检测种族和生物性别之间的偏见。为了检测特征生成的偏见，我们使用PCA进行维度减少，并使用两种样本的科洛摩戈罗夫-斯米纳夫测试来发现分布偏移。我们的结果显示，黑人和白人、黑人和亚裔人、男性和女性之间存在 statistically significant 的差异。在十二个对比中，有七个对比显示 statistically significant 的分布偏移。我们的发现呼吁进一步分析脑年龄预测模型。

Multi-Stain Self-Attention Graph Multiple Instance Learning Pipeline for Histopathology Whole Slide Images

paper_url: http://arxiv.org/abs/2309.10650
repo_url: https://github.com/amayags/mustang
paper_authors: Amaya Gallagher-Syed, Luca Rossi, Felice Rivellese, Costantino Pitzalis, Myles Lewis, Michael Barnes, Gregory Slabaugh
for: 本研究旨在 Addressing the challenges of weakly supervised computer vision tasks in gigapixel Whole Slide Images (WSIs) for patient diagnosis and stratification.
methods: 我们提出了一种基于自注意力的多例学习框架（MUSTANG），用于解决不具有批处理级别标注的多个WSIs的分类任务。我们使用了一种快速计算的k-Nearest Neighbour Graph，以实现快速的自注意力计算。
results: 我们的方法在实验中取得了当今最佳的F1分数/AUC分数0.89/0.92，超过了广泛使用的CLAM模型。我们的方法可以轻松地适应不同的临床数据集，只需要patient级别的标注，并可以接受不同的WSIs集合，Graph可以是不同的大小和结构。

Abstract
Whole Slide Images (WSIs) present a challenging computer vision task due to their gigapixel size and presence of numerous artefacts. Yet they are a valuable resource for patient diagnosis and stratification, often representing the gold standard for diagnostic tasks. Real-world clinical datasets tend to come as sets of heterogeneous WSIs with labels present at the patient-level, with poor to no annotations. Weakly supervised attention-based multiple instance learning approaches have been developed in recent years to address these challenges, but can fail to resolve both long and short-range dependencies. Here we propose an end-to-end multi-stain self-attention graph (MUSTANG) multiple instance learning pipeline, which is designed to solve a weakly-supervised gigapixel multi-image classification task, where the label is assigned at the patient-level, but no slide-level labels or region annotations are available. The pipeline uses a self-attention based approach by restricting the operations to a highly sparse k-Nearest Neighbour Graph of embedded WSI patches based on the Euclidean distance. We show this approach achieves a state-of-the-art F1-score/AUC of 0.89/0.92, outperforming the widely used CLAM model. Our approach is highly modular and can easily be modified to suit different clinical datasets, as it only requires a patient-level label without annotations and accepts WSI sets of different sizes, as the graphs can be of varying sizes and structures. The source code can be found at https://github.com/AmayaGS/MUSTANG.

摘要
整幅扫描图像（WSIs）对计算机视觉 задача呈现挑战，因其 гига灵ixel大小和存在许多artefacts。然而，它们对患者诊断和分类非常有价值，通常被视为诊断任务的金标准。实际的临床数据集通常是一组不同类型的WSIs，每个扫描图像都有patient级别的标签，但是标签的质量很差，甚至没有任何注解。弱式监督的注意力基本学习方法在过去几年中开发出来，但是它们可能无法解决长距离和短距离依赖关系。我们提出了一个终端到终Multi-stain自动注意力图（MUSTANG）多个实例学习管道，用于解决弱式监督的гига灵ixel多图分类任务，任务标签是在患者级别上分配，但是没有扫描图像级别的标签或区域注解。管道使用基于Euclidean距离的自动注意力方法，限制操作于高度稀疏的k-最近邻 Graph Embedded WSI patches中。我们显示，这种方法可以达到状态机器人F1-score/AUC的0.89/0.92，超过了广泛使用的CLAM模型。我们的方法非常可 modify，可以轻松地适应不同的临床数据集，只需要patient级别的标签，不需要扫描图像级别的注解，并且可以接受不同大小和结构的WSIs集。源代码可以在https://github.com/AmayaGS/MUSTANG 中找到。

paper_url: http://arxiv.org/abs/2309.10649
repo_url: None
paper_authors: Jingyu Zhang, Huitong Yang, Daijie Wu, Xuesong Li, Xinge Zhu, Yuexin Ma
for: 提高3D点云Semantic Segmentation的性能，不需要大量的标注数据。
methods: 基于图像和点云之间的关系，设计有效的特征对齐策略，实现交叉模式和交叉领域的适应。
results: 无需3D标注数据，我们的方法在SemanticKITTI上达到了state-of-the-art表现，比对 existedUnsupervised和Weakly-supervised基eline的性能更高。

Abstract
Current state-of-the-art point cloud-based perception methods usually rely on large-scale labeled data, which requires expensive manual annotations. A natural option is to explore the unsupervised methodology for 3D perception tasks. However, such methods often face substantial performance-drop difficulties. Fortunately, we found that there exist amounts of image-based datasets and an alternative can be proposed, i.e., transferring the knowledge in the 2D images to 3D point clouds. Specifically, we propose a novel approach for the challenging cross-modal and cross-domain adaptation task by fully exploring the relationship between images and point clouds and designing effective feature alignment strategies. Without any 3D labels, our method achieves state-of-the-art performance for 3D point cloud semantic segmentation on SemanticKITTI by using the knowledge of KITTI360 and GTA5, compared to existing unsupervised and weakly-supervised baselines.

摘要
当前最先进的点云基于识别方法通常依赖于大规模的标注数据，这需要昂贵的人工标注。一种自然的选择是探索无监督的方法学习。然而，这些方法往往面临重大性能下降的困难。幸运地，我们发现了大量的图像数据集和一种代替方案，即将图像知识传播到点云中。我们提出了一种困难的对Modal和对Domain的适应任务，通过全面探索图像和点云之间的关系和设计有效的特征对齐策略。无需任何3D标注，我们的方法在SemanticKITTI上实现了无监督性的3D点云 semantic segmentation的国际级表现，与现有的无监督和弱监督基准相比。

Self-Supervised Super-Resolution Approach for Isotropic Reconstruction of 3D Electron Microscopy Images from Anisotropic Acquisition

paper_url: http://arxiv.org/abs/2309.10646
repo_url: None
paper_authors: Mohammad Khateri, Morteza Ghahremani, Alejandra Sierra, Jussi Tohka
for: 用于重建三维电子顾 microscopy（3DEM）图像的均匀性。
methods: 使用深度学习（DL）自主超分解方法，利用U型架构和视图变换（ViT）块，学习多级图像依赖关系。
results: 成功重建均匀的3DEM图像从不均匀的获取图像中。

Abstract
Three-dimensional electron microscopy (3DEM) is an essential technique to investigate volumetric tissue ultra-structure. Due to technical limitations and high imaging costs, samples are often imaged anisotropically, where resolution in the axial direction ($z$) is lower than in the lateral directions $(x,y)$. This anisotropy 3DEM can hamper subsequent analysis and visualization tasks. To overcome this limitation, we propose a novel deep-learning (DL)-based self-supervised super-resolution approach that computationally reconstructs isotropic 3DEM from the anisotropic acquisition. The proposed DL-based framework is built upon the U-shape architecture incorporating vision-transformer (ViT) blocks, enabling high-capability learning of local and global multi-scale image dependencies. To train the tailored network, we employ a self-supervised approach. Specifically, we generate pairs of anisotropic and isotropic training datasets from the given anisotropic 3DEM data. By feeding the given anisotropic 3DEM dataset in the trained network through our proposed framework, the isotropic 3DEM is obtained. Importantly, this isotropic reconstruction approach relies solely on the given anisotropic 3DEM dataset and does not require pairs of co-registered anisotropic and isotropic 3DEM training datasets. To evaluate the effectiveness of the proposed method, we conducted experiments using three 3DEM datasets acquired from brain. The experimental results demonstrated that our proposed framework could successfully reconstruct isotropic 3DEM from the anisotropic acquisition.

摘要
三维电子镜像技术（3DEM）是诊断组织结构的重要方法。由于技术限制和高成本镜像，样本通常在axial方向（z）中的分辨率较低，而在 lateral方向（x, y）中的分辨率高。这种不均匀镜像可能会降低后续分析和视觉任务的效果。为了解决这个问题，我们提出了一种基于深度学习（DL）的自动适应超分辨率方法。这种方法基于U形架构，并包括视觉转换（ViT）块，可以高效地学习本地和全局多尺度图像依赖关系。为了训练专门的网络，我们采用了自动适应的方法。具体来说，我们生成了具有不均匀和均匀特征的训练集。通过将给定的不均匀3DEM数据feed到我们的提posed框架中，可以获得均匀的3DEM。重要的是，这种均匀重建方法不需要对于不均匀和均匀3DEM训练集的对应的数据对。为了评估我们的方法的效果，我们使用了三个由脑组织的3DEM数据进行实验。实验结果表明，我们的提posed方法可以成功地从不均匀镜像中重建均匀的3DEM。

KFC: Kinship Verification with Fair Contrastive Loss and Multi-Task Learning

paper_url: http://arxiv.org/abs/2309.10641
repo_url: https://github.com/garynlfd/kfc
paper_authors: Jia Luo Peng, Keng Wei Chang, Shang-Hong Lai
For: The paper is written for the task of kinship verification in computer vision, with the goal of achieving better performance and mitigating racial bias.* Methods: The paper proposes a multi-task learning model structure with an attention module and a fairness-aware contrastive loss function with adversarial learning to enhance accuracy and reduce bias.* Results: The proposed method, called KFC, achieves state-of-the-art performance and mitigates racial bias, as demonstrated through extensive experimental evaluation.Here’s the information in Simplified Chinese text:
for: 本文是为计算机视觉中的身份验证任务所写的，目的是实现更好的性能和减少种族偏见。
methods: 本文提议一种多任务学习模型结构，具有注意力模块和公平意识的对比损失函数，以提高准确率和减少偏见。
results: 提议的方法（KFC）在标准差和准确率两个指标上具有优秀的性能，并成功减少种族偏见，经过了广泛的实验评估。

Abstract
Kinship verification is an emerging task in computer vision with multiple potential applications. However, there's no large enough kinship dataset to train a representative and robust model, which is a limitation for achieving better performance. Moreover, face verification is known to exhibit bias, which has not been dealt with by previous kinship verification works and sometimes even results in serious issues. So we first combine existing kinship datasets and label each identity with the correct race in order to take race information into consideration and provide a larger and complete dataset, called KinRace dataset. Secondly, we propose a multi-task learning model structure with attention module to enhance accuracy, which surpasses state-of-the-art performance. Lastly, our fairness-aware contrastive loss function with adversarial learning greatly mitigates racial bias. We introduce a debias term into traditional contrastive loss and implement gradient reverse in race classification task, which is an innovative idea to mix two fairness methods to alleviate bias. Exhaustive experimental evaluation demonstrates the effectiveness and superior performance of the proposed KFC in both standard deviation and accuracy at the same time.

摘要
《家庭关系验证是计算机视觉领域的一个emerging任务，它具有多种应用前景。然而，当前没有一个大 enough的家庭关系数据集来训练代表性强的模型，这是实现更好的性能的限制。更重要的是，脸部验证已经展现出偏见，这些偏见没有被前一些家庭关系验证工作所处理，有时甚至会导致严重的问题。因此，我们首先将现有的家庭关系数据集合并标注每个人的正确的种族信息，以便考虑种族信息并提供更完整的数据集，我们称之为KinRace数据集。其次，我们提议一种多任务学习模型结构，并在其中添加了注意模块，以提高准确率。最后，我们提出了一种公平意识的抽象损失函数，并通过对涉及到种族的任务进行逆向梯度的实现，这是一种创新的方法来缓解偏见。我们的方法可以同时提高标准差和准确率。我们进行了广泛的实验评估，并证明了我们的提案的效iveness和超越性。》Note: Please note that the translation is in Simplified Chinese, which is used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Sparser Random Networks Exist: Enforcing Communication-Efficient Federated Learning via Regularization

paper_url: http://arxiv.org/abs/2309.10834
repo_url: None
paper_authors: Mohamad Mestoukirdi, Omid Esrafilian, David Gesbert, Qianrui Li, Nicolas Gresset
for: 本文提出了一种新的方法，用于提高随机联合学习中的通信效率，特别是在训练过参数化的随机网络时。
methods: 在本文中，我们使用了一个 binary 面板来优化，而不是直接优化模型的权重。这个面板可以描述一个可以通过少量参数来泛化的减小网络。与传统联合学习方法不同，我们在交换时使用了简单的 binary 数据，而不是浮点数据。这有效地减少了通信成本，最多只需要1比特每个参数。
results: 我们的实验表明，在使用了本文提出的方法后，可以获得 significan 的通信和存储开销减少，达到最多5 magnitudes，同时Validation 精度也可以保持在一定程度上。

Abstract
This work presents a new method for enhancing communication efficiency in stochastic Federated Learning that trains over-parameterized random networks. In this setting, a binary mask is optimized instead of the model weights, which are kept fixed. The mask characterizes a sparse sub-network that is able to generalize as good as a smaller target network. Importantly, sparse binary masks are exchanged rather than the floating point weights in traditional federated learning, reducing communication cost to at most 1 bit per parameter. We show that previous state of the art stochastic methods fail to find the sparse networks that can reduce the communication and storage overhead using consistent loss objectives. To address this, we propose adding a regularization term to local objectives that encourages sparser solutions by eliminating redundant features across sub-networks. Extensive experiments demonstrate significant improvements in communication and memory efficiency of up to five magnitudes compared to the literature, with minimal performance degradation in validation accuracy in some instances.

摘要
Simplified Chinese:这个工作提出了一种新的方法，用于提高Stochastic Federated Learning中的通信效率，该方法在训练过 parametrization 的随机网络时使用。而不是优化模型的权重，这里优化一个二进制的maske，以Characterize一个可以与小型目标网络一样准确预测的稀疏子网络。这种方法可以将通信成本降至每个参数最多1比特，因为在传输的是稀疏二进制maske而不是浮点数据。previous state of the art stochastic方法无法找到可以减少通信和存储开销的稀疏网络，使用一致损失函数。为解决这个问题，我们提出了添加一个正则化项到本地目标函数中，以逼导更稀疏的解决方案，从而消除各个子网络之间的重复特征。广泛的实验表明，我们的方法可以在一些实例中实现到5 magnitudes的通信和存储效率提升，同时减少了验证精度下降。

Source-free Active Domain Adaptation for Diabetic Retinopathy Grading Based on Ultra-wide-field Fundus Image

paper_url: http://arxiv.org/abs/2309.10619
repo_url: None
paper_authors: Jinye Ran, Guanghua Zhang, Ximei Zhang, Juan Xie, Fan Xia, Hao Zhang
for: 这篇研究目的是为了提高无标注宽场照片中的脓瘤评分性能。methods: 这篇研究使用了源自由活动领域适应（SFADA）技术，通过生成颜色照片中的脓瘤关系演化，选择一些有价值的宽场照片进行标注，并将模型适应到宽场照片上。results: 实验结果显示，我们的提案的SFADA可以达到现场临床实践中的最佳脓瘤评分性能，比基准值提高20.9%和二次均值权重卡佛洛値18.63%，分别达到85.36%和92.38%。

Abstract
Domain adaptation (DA) has been widely applied in the diabetic retinopathy (DR) grading of unannotated ultra-wide-field (UWF) fundus images, which can transfer annotated knowledge from labeled color fundus images. However, suffering from huge domain gaps and complex real-world scenarios, the DR grading performance of most mainstream DA is far from that of clinical diagnosis. To tackle this, we propose a novel source-free active domain adaptation (SFADA) in this paper. Specifically, we focus on DR grading problem itself and propose to generate features of color fundus images with continuously evolving relationships of DRs, actively select a few valuable UWF fundus images for labeling with local representation matching, and adapt model on UWF fundus images with DR lesion prototypes. Notably, the SFADA also takes data privacy and computational efficiency into consideration. Extensive experimental results demonstrate that our proposed SFADA achieves state-of-the-art DR grading performance, increasing accuracy by 20.9% and quadratic weighted kappa by 18.63% compared with baseline and reaching 85.36% and 92.38% respectively. These investigations show that the potential of our approach for real clinical practice is promising.

摘要
域 adaptation (DA) 已经广泛应用于无注释超宽场视场照片（UWF）的糖尿病症诊断中，可以将标注过的知识传递到颜色视场照片中。然而，由于巨大的域漏洞和复杂的实际情况，大多数主流 DA 的诊断性能远远不如临床诊断。为了解决这个问题，我们在这篇论文中提出了一种新的源自由活动域 adaptation（SFADA）。我们专注于糖尿病诊断问题，并提出了生成颜色视场照片中的糖尿病关系的演化方法，并选择一些有价值的 UWF 照片进行本地匹配标注，并将模型适应到 UWF 照片中的糖尿病肉体抽象。需要注意的是，SFADA 还考虑了数据隐私和计算效率。我们的实验结果表明，我们的提议的 SFADA 可以达到状态机器的糖尿病诊断性能，相比基线提高了20.9%和quadratic weighted kappa 18.63%，分别达到85.36%和92.38%。这些调查表明了我们的方法在实际临床实践中的潜在潜力。

Intelligent Debris Mass Estimation Model for Autonomous Underwater Vehicle

paper_url: http://arxiv.org/abs/2309.10617
repo_url: None
paper_authors: Mohana Sri S, Swethaa S, Aouthithiye Barathwaj SR Y, Sai Ganesh CS
for: 这个论文的目的是提高自动下水车（AUV）在水下环境中导航和交互的能力，使用实例分割技术来分割图像中的对象。
methods: 这个论文使用的方法包括：YOLOV7在Roboflow中生成对象的 bounding box，将每个对象分割成不同的领域，并使用预处理技术来提高分割质量。
results: 这个论文的结果表明，使用实例分割技术可以准确地分割水下环境中的对象，并且可以提高AUV的导航和交互能力。

Abstract
Marine debris poses a significant threat to the survival of marine wildlife, often leading to entanglement and starvation, ultimately resulting in death. Therefore, removing debris from the ocean is crucial to restore the natural balance and allow marine life to thrive. Instance segmentation is an advanced form of object detection that identifies objects and precisely locates and separates them, making it an essential tool for autonomous underwater vehicles (AUVs) to navigate and interact with their underwater environment effectively. AUVs use image segmentation to analyze images captured by their cameras to navigate underwater environments. In this paper, we use instance segmentation to calculate the area of individual objects within an image, we use YOLOV7 in Roboflow to generate a set of bounding boxes for each object in the image with a class label and a confidence score for every detection. A segmentation mask is then created for each object by applying a binary mask to the object's bounding box. The masks are generated by applying a binary threshold to the output of a convolutional neural network trained to segment objects from the background. Finally, refining the segmentation mask for each object is done by applying post-processing techniques such as morphological operations and contour detection, to improve the accuracy and quality of the mask. The process of estimating the area of instance segmentation involves calculating the area of each segmented instance separately and then summing up the areas of all instances to obtain the total area. The calculation is carried out using standard formulas based on the shape of the object, such as rectangles and circles. In cases where the object is complex, the Monte Carlo method is used to estimate the area. This method provides a higher degree of accuracy than traditional methods, especially when using a large number of samples.

摘要
海洋垃圾对海洋野生动物的存活造成了重要威胁，通常会导致拥挤和饥饿，最终导致死亡。因此，从海洋中除垃圾是保持自然平衡的关键，让海洋生物发展和繁殖。图像分割是高级形态检测的一种，可以准确地标识和分离对象，因此在自动下水潜车（AUV）在水下环境中Navigation和交互时非常重要。AUV使用图像分割来分析捕捉到的图像，以便在水下环境中 Navigation。在这篇论文中，我们使用图像分割来计算图像中每个对象的面积。我们使用YOLOV7在Roboflow中生成每个对象的 bounding box，并为每个检测得到一个分类标签和信任分数。然后，我们生成每个对象的分割面，通过应用一个二进制阈值来对对象的 bounding box 进行分割。这些面由一个基于对象分割的几何学模型训练而成。最后，我们对每个对象的分割面进行修正，使用Post处理技术，如形态运算和某些检测，以提高分割面的准确性和质量。图像分割面积的估计过程包括计算每个分割对象的面积，然后将所有对象的面积之和为总面积。计算使用标准的形态方程，根据对象的形状，如方形和圆形。在对象复杂时，我们使用蒙特卡洛方法来估计面积，这种方法可以提供更高的准确性，特别是使用大量样本。

NDDepth: Normal-Distance Assisted Monocular Depth Estimation

paper_url: http://arxiv.org/abs/2309.10592
repo_url: None
paper_authors: Shuwei Shao, Zhongcai Pei, Weihai Chen, Xingming Wu, Zhengguo Li
for: 这个论文主要针对的是单目深度估计问题，它的广泛应用在计算机视觉领域。
methods: 该论文提出了一种基于物理学（几何学）的深度学习框架，假设3D场景由分割面组成。论文引入了一个新的normal-distance头，该头输出每个位置的像素级表面法向量和平面到起点的距离，以便从 depth 的估计。此外，normal和距离被通过开发的平面感知约束进行正则化。论文还增加了一个附加的深度头，以提高提案的稳定性。
results: 对于NYU-Depth-v2、KITTI和SUN RGB-D数据集，提案的方法超过了之前的状态态的竞争对手。特别是，在KITTI的深度预测在线竞赛上，提案的方法在提交时 ranked 1st 中所有提交。

Abstract
Monocular depth estimation has drawn widespread attention from the vision community due to its broad applications. In this paper, we propose a novel physics (geometry)-driven deep learning framework for monocular depth estimation by assuming that 3D scenes are constituted by piece-wise planes. Particularly, we introduce a new normal-distance head that outputs pixel-level surface normal and plane-to-origin distance for deriving depth at each position. Meanwhile, the normal and distance are regularized by a developed plane-aware consistency constraint. We further integrate an additional depth head to improve the robustness of the proposed framework. To fully exploit the strengths of these two heads, we develop an effective contrastive iterative refinement module that refines depth in a complementary manner according to the depth uncertainty. Extensive experiments indicate that the proposed method exceeds previous state-of-the-art competitors on the NYU-Depth-v2, KITTI and SUN RGB-D datasets. Notably, it ranks 1st among all submissions on the KITTI depth prediction online benchmark at the submission time.

摘要
单目深度估算在视觉社区中引起了广泛的关注，因为它具有广泛的应用。在这篇论文中，我们提出了一个新的物理（几何）驱动的深度学习框架，assuming that 3D scenes are constituted by piece-wise planes。我们引入了一个新的normal-distance head，这个head将在每个位置输出像素级表面法向和平面到起始距离，以 derivation depth。同时，normal和距离被调整了一个发展的平面应相适应约束。我们还整合了一个额外的深度head，以提高我们的提案的稳定性。为了充分利用这两个head的优点，我们开发了一个有效的补充循环调整模块，这个模块会根据depth的不确定性进行调整。实验结果显示，我们的提案超过了前一代的竞争者在 NYU-Depth-v2、KITTI 和 SUN RGB-D 数据集上。特别是，它在 KITTI 对应测试上的线上测试benchmark中排名第一。

Few-shot Object Detection in Remote Sensing: Lifting the Curse of Incompletely Annotated Novel Objects

paper_url: http://arxiv.org/abs/2309.10588
repo_url: None
paper_authors: Fahong Zhang, Yilei Shi, Zhitong Xiong, Xiao Xiang Zhu
for: 本研究的目的是提出一种基于自适应学习的几何卷积网络（ST-FSOD）方法，用于实现几何卷积网络在几何图像处理中的几何检测。
methods: 本研究使用了两个分支的Region Proposal Networks（RPN），其中一个分支用于提取基本对象的提案，另一个分支用于提取 noval 对象的提案。此外，本研究还使用了学生-教师机制，将高度自信的未标注目标作为pseudo标签，并将其包含在RPN和Region of Interest（RoI）头中。
results: 实验结果表明， compared with现有的state-of-the-art方法，本研究的ST-FSOD方法在各种几何检测设置下表现出了大幅提升。

Abstract
Object detection is an essential and fundamental task in computer vision and satellite image processing. Existing deep learning methods have achieved impressive performance thanks to the availability of large-scale annotated datasets. Yet, in real-world applications the availability of labels is limited. In this context, few-shot object detection (FSOD) has emerged as a promising direction, which aims at enabling the model to detect novel objects with only few of them annotated. However, many existing FSOD algorithms overlook a critical issue: when an input image contains multiple novel objects and only a subset of them are annotated, the unlabeled objects will be considered as background during training. This can cause confusions and severely impact the model's ability to recall novel objects. To address this issue, we propose a self-training-based FSOD (ST-FSOD) approach, which incorporates the self-training mechanism into the few-shot fine-tuning process. ST-FSOD aims to enable the discovery of novel objects that are not annotated, and take them into account during training. On the one hand, we devise a two-branch region proposal networks (RPN) to separate the proposal extraction of base and novel objects, On another hand, we incorporate the student-teacher mechanism into RPN and the region of interest (RoI) head to include those highly confident yet unlabeled targets as pseudo labels. Experimental results demonstrate that our proposed method outperforms the state-of-the-art in various FSOD settings by a large margin. The codes will be publicly available at https://github.com/zhu-xlab/ST-FSOD.

摘要
Computer vision 和卫星图像处理中的对象检测是一项基础和重要任务。现有的深度学习方法在大规模标注数据的支持下已经达到了印象人的性能。然而，在实际应用中，标注数据的可用性受限。在这种情况下，几shot对象检测（FSOD）已经出现为一个有前途的方向，它目的是让模型能够检测未标注的对象，只需要几个标注。然而，许多现有的FSOD算法忽视了一个关键问题：当输入图像包含多个未标注的对象，并且只有一部分被标注，那么未标注的对象将被视为背景进行训练。这会导致混乱和严重影响模型的对 novel 对象的回忆。为解决这个问题，我们提出了一种基于自我训练的FSOD方法（ST-FSOD），它在几个shot fine-tuning过程中包含了自我训练机制。ST-FSOD的目标是允许模型发现未标注的对象，并将它们包含在训练中。在一个方面，我们设计了两个分支的区域提档网络（RPN），以分离基本对象和新对象的提档。在另一个方面，我们在 RPN 和区域关注头（RoI）中 integrate了学生-教师机制，以包含高度自信的未标注目标作为 Pseudo 标注。实验结果表明，我们的提出方法在各种 FSOD 设置下的性能均高于当前状态的极大margin。代码将在上公开。

Adversarial Attacks Against Uncertainty Quantification

paper_url: http://arxiv.org/abs/2309.10586
repo_url: https://github.com/shymalagowri/Defense-against-Adversarial-Malware-using-RObust-Classifier-DAM-ROC
paper_authors: Emanuele Ledda, Daniele Angioni, Giorgio Piras, Giorgio Fumera, Battista Biggio, Fabio Roli
for: 本研究旨在攻击概率评估（Uncertainty Quantification，UQ）技术，以下用于让机器学习模型的输出不可靠。
methods: 我们设计了一个威胁模型，并提出了多种攻击策略，用于让UQ技术输出不准确的结果。
results: 我们的实验结果表明，我们的攻击策略可以更好地 manipulate UQ测量结果，比起induce misclassification。

Abstract
Machine-learning models can be fooled by adversarial examples, i.e., carefully-crafted input perturbations that force models to output wrong predictions. While uncertainty quantification has been recently proposed to detect adversarial inputs, under the assumption that such attacks exhibit a higher prediction uncertainty than pristine data, it has been shown that adaptive attacks specifically aimed at reducing also the uncertainty estimate can easily bypass this defense mechanism. In this work, we focus on a different adversarial scenario in which the attacker is still interested in manipulating the uncertainty estimate, but regardless of the correctness of the prediction; in particular, the goal is to undermine the use of machine-learning models when their outputs are consumed by a downstream module or by a human operator. Following such direction, we: \textit{(i)} design a threat model for attacks targeting uncertainty quantification; \textit{(ii)} devise different attack strategies on conceptually different UQ techniques spanning for both classification and semantic segmentation problems; \textit{(iii)} conduct a first complete and extensive analysis to compare the differences between some of the most employed UQ approaches under attack. Our extensive experimental analysis shows that our attacks are more effective in manipulating uncertainty quantification measures than attacks aimed to also induce misclassifications.

摘要
We design a threat model for attacks targeting uncertainty quantification and devise different attack strategies on various UQ techniques for both classification and semantic segmentation problems. We conduct a comprehensive analysis to compare the differences between some of the most commonly used UQ approaches under attack. Our extensive experimental analysis shows that our attacks are more effective in manipulating uncertainty quantification measures than attacks that also aim to induce misclassifications.

Forgedit: Text Guided Image Editing via Learning and Forgetting

paper_url: http://arxiv.org/abs/2309.10556
repo_url: https://github.com/witcherofresearch/forgedit
paper_authors: Shiwen Zhang, Shuai Xiao, Weilin Huang
for: 这个论文的目的是提出一种新的文本引导图像编辑方法，以解决现有的图像编辑模型受过拟合和时间占用的问题。
methods: 该方法基于一种新的精度学习框架，通过视语言结合学习来重建输入图像，并使用向量减减和向量投影来找到适合的文本嵌入。此外，它还利用了Diffusion Models的一般性特性，并采用了忘记策略来解决恶性拟合问题。
results: 该方法在TEdBench数据集上达到了新的顶峰性状态，在CLIP分数和LPIPS分数两个指标上都超越了之前的SOTA方法Imagic with Imagen。

Abstract
Text guided image editing on real images given only the image and the target text prompt as inputs, is a very general and challenging problem, which requires the editing model to reason by itself which part of the image should be edited, to preserve the characteristics of original image, and also to perform complicated non-rigid editing. Previous fine-tuning based solutions are time-consuming and vulnerable to overfitting, limiting their editing capabilities. To tackle these issues, we design a novel text guided image editing method, Forgedit. First, we propose a novel fine-tuning framework which learns to reconstruct the given image in less than one minute by vision language joint learning. Then we introduce vector subtraction and vector projection to explore the proper text embedding for editing. We also find a general property of UNet structures in Diffusion Models and inspired by such a finding, we design forgetting strategies to diminish the fatal overfitting issues and significantly boost the editing abilities of Diffusion Models. Our method, Forgedit, implemented with Stable Diffusion, achieves new state-of-the-art results on the challenging text guided image editing benchmark TEdBench, surpassing the previous SOTA method Imagic with Imagen, in terms of both CLIP score and LPIPS score. Codes are available at https://github.com/witcherofresearch/Forgedit.

摘要
Text 指导图像编辑问题是一个非常通用和挑战性的问题，需要编辑模型自行判断需要编辑哪些部分，以保持原图特征，同时也需要执行复杂的非RIGID 编辑。先前的精度训练基于解决方案具有较长的训练时间和过拟合问题，这限制了编辑能力。为解决这些问题，我们设计了一种新的文本指导图像编辑方法，即 Forgedit。首先，我们提出了一种新的精度训练框架，通过视语言结合学习来学习重建给定图像，并且在一分钟内完成。然后，我们引入向量减法和向量投影来探索适当的文本嵌入 для编辑。我们还发现了Diffusion Models中的一种普遍性，即UNet结构的特点，并由此得到了忘记策略，以解决致命的过拟合问题，并大幅提高Diffusion Models的编辑能力。我们的方法 Forgedit，通过稳定扩散，在TEdBench的文本指导图像编辑标准 benchmark 上实现了新的状态纪录，超过了之前的SOTA方法 Imagic with Imagen，以 Both CLIP 分数和 LPIPS 分数来衡量。代码可以在https://github.com/witcherofresearch/Forgedit 上找到。

An overview of some mathematical techniques and problems linking 3D vision to 3D printing

paper_url: http://arxiv.org/abs/2309.10549
repo_url: None
paper_authors: Emiliano Cristiani, Maurizio Falcone, Silvia Tozza
for: 本研究旨在探讨 Computer Vision 和 3D printing 之间的交互，尤其是Shape-from-Shading 问题的解决方法，以及基于非线性偏微分方程和优化的方法。
methods: 本文使用了一些非线性偏微分方程和优化技术来解决 Shape-from-Shading 问题，并考虑了将这些方法应用于 3D printing 过程中。
results: 本研究提出了一些实用的例子，以示出将图像转换为 final 3D 印刷的过程。

Abstract
Computer Vision and 3D printing have rapidly evolved in the last 10 years but interactions among them have been very limited so far, despite the fact that they share several mathematical techniques. We try to fill the gap presenting an overview of some techniques for Shape-from-Shading problems as well as for 3D printing with an emphasis on the approaches based on nonlinear partial differential equations and optimization. We also sketch possible couplings to complete the process of object manufacturing starting from one or more images of the object and ending with its final 3D print. We will give some practical examples of this procedure.

摘要
计算机视觉和3D打印技术在过去10年内快速发展，但它们之间的交互非常有限，尽管它们共享一些数学方法。我们尝试填补这个空白，介绍一些Shape-from-Shading问题的技巧以及基于非线性偏微分方程和优化的3D打印技术。我们还简要介绍可能的交互，完成对象制造的整个过程，从一个或多个图像开始，结束于其最终3D打印。我们将给出一些实践示例。

Decoupling the Curve Modeling and Pavement Regression for Lane Detection

paper_url: http://arxiv.org/abs/2309.10533
repo_url: None
paper_authors: Wencheng Han, Jianbing Shen
for: 本研究的目的是提出一种新的车道检测方法，以解决现有方法中 curve-based lane representation 对不规则的车道线的处理不佳问题。
methods: 本研究使用了分解车道检测任务为两部分：曲线建模和地面高程回归。Specifically, we use a parameterized curve to represent lanes in the BEV space to reflect the original distribution of lanes, and regress the ground heights of key points separately from the curve modeling.
results: 我们在2D车道检测 benchmarks (TuSimple和CULane) 和最近提出的3D车道检测 datasets (ONCE-3Dlane和OpenLane) 上进行了实验，并显示出了显著的改进。

Abstract
The curve-based lane representation is a popular approach in many lane detection methods, as it allows for the representation of lanes as a whole object and maximizes the use of holistic information about the lanes. However, the curves produced by these methods may not fit well with irregular lines, which can lead to gaps in performance compared to indirect representations such as segmentation-based or point-based methods. We have observed that these lanes are not intended to be irregular, but they appear zigzagged in the perspective view due to being drawn on uneven pavement. In this paper, we propose a new approach to the lane detection task by decomposing it into two parts: curve modeling and ground height regression. Specifically, we use a parameterized curve to represent lanes in the BEV space to reflect the original distribution of lanes. For the second part, since ground heights are determined by natural factors such as road conditions and are less holistic, we regress the ground heights of key points separately from the curve modeling. Additionally, we have unified the 2D and 3D lane detection tasks by designing a new framework and a series of losses to guide the optimization of models with or without 3D lane labels. Our experiments on 2D lane detection benchmarks (TuSimple and CULane), as well as the recently proposed 3D lane detection datasets (ONCE-3Dlane and OpenLane), have shown significant improvements. We will make our well-documented source code publicly available.

摘要
Lane 表示法是许多车道检测方法中的受欢迎方法，因为它允许车道被视为整个对象，并且最大化了车道的整体信息。然而，由这些方法生成的曲线可能不适应不规则的线条，这可能会导致性能上的差距与分 segmentation-based 或点 clouds 方法相比。我们发现这些车道并不是不规则的，但它们在 bird's eye view 视角下看起来是zigzag的，因为它们被Drawn 在不均匀的路面上。在这篇论文中，我们提出了一种新的车道检测任务的方法，即分解为曲线建模和地面高度回归。特别是，我们使用参数化的曲线来表示车道在 bird's eye view 空间中的原始分布。为第二部分，由于地面高度是由自然因素 such as 道路状况而决定的，我们分别从曲线建模中进行地面高度的回归。此外，我们将2D 和 3D 车道检测任务统一为一个框架和一系列损失函数，以引导模型的优化。我们的实验表明，在 TuSimple 和 CULane 2D 车道检测标准测试集上，以及最近提出的3D 车道检测数据集（ONCE-3Dlane 和 OpenLane）上，我们的方法具有显著的改进。我们将我们的详细的源代码公开发布。

Retinex-guided Channel-grouping based Patch Swap for Arbitrary Style Transfer

paper_url: http://arxiv.org/abs/2309.10528
repo_url: None
paper_authors: Chang Liu, Yi Niu, Mingming Ma, Fu Li, Guangming Shi
for: 提高patch-matching基于风格传输的质量和稳定性，以提供更加风格一致的文本URE。
methods: 根据Retinex理论和通道组合策略，对 conten image feature map 进行替换，并提供补充混合和多尺度生成策略来避免不希望的黑色区域和过度风格化问题。
results: 实验结果表明，提案方法可以比存在技术更好地提供风格一致的文本URE，同时保持内容准确性。

Abstract
The basic principle of the patch-matching based style transfer is to substitute the patches of the content image feature maps by the closest patches from the style image feature maps. Since the finite features harvested from one single aesthetic style image are inadequate to represent the rich textures of the content natural image, existing techniques treat the full-channel style feature patches as simple signal tensors and create new style feature patches via signal-level fusion, which ignore the implicit diversities existed in style features and thus fail for generating better stylised results. In this paper, we propose a Retinex theory guided, channel-grouping based patch swap technique to solve the above challenges. Channel-grouping strategy groups the style feature maps into surface and texture channels, which prevents the winner-takes-all problem. Retinex theory based decomposition controls a more stable channel code rate generation. In addition, we provide complementary fusion and multi-scale generation strategy to prevent unexpected black area and over-stylised results respectively. Experimental results demonstrate that the proposed method outperforms the existing techniques in providing more style-consistent textures while keeping the content fidelity.

摘要
基本原则是将内容图像特征地图替换为风格图像特征地图中最近的 patches。由于一个单一风格图像的特征不足以表示自然图像的复杂 текстуры，现有技术通过信号级混合来创建新的风格特征 patches，而忽略了风格特征之间的隐式多样性，从而导致更好的风格化结果不可能获得。在这篇论文中，我们提出了基于 Retinex 理论的、通道分组 based patch swap 技术来解决上述挑战。通道分组策略将风格特征地图分为表面和Texture通道，以避免赢家占据全部问题。Retinex 理论基于的分解控制了更稳定的通道码率生成。此外，我们还提供了补做和多尺度生成策略，以避免不期望的黑色区域和过度风格化结果。实验结果表明，我们的方法在保持内容准确性的同时，可以提供更风格一致的 текстуры。

SPOT: Scalable 3D Pre-training via Occupancy Prediction for Autonomous Driving

paper_url: http://arxiv.org/abs/2309.10527
repo_url: https://github.com/pjlab-adg/3dtrans
paper_authors: Xiangchao Yan, Runjian Chen, Bo Zhang, Jiakang Yuan, Xinyu Cai, Botian Shi, Wenqi Shao, Junchi Yan, Ping Luo, Yu Qiao
for: 本文为了提高3D LiDAR点云的感知任务，包括3D物体检测和LiDARSemantic分割，提出了一种可扩展的预训练方法。
methods: 本文提出了一种名为SPOT（可扩展预训练via占用预测）的方法，通过大规模预训练和不同下游数据集和任务的细致调整，以提高3D表示的学习效果。
results: 本文通过多个公共数据集和任务下的实验，证明了occupancy预测的潜在性，并且通过树枝抽样技术和类别准备策略来缓解不同LiDAR传感器和注释策略在不同数据集中的领域差异。此外，本文还观察到了扩展预训练的现象，即下游性能随预训练数据的增加而提高。

Abstract
Annotating 3D LiDAR point clouds for perception tasks including 3D object detection and LiDAR semantic segmentation is notoriously time-and-energy-consuming. To alleviate the burden from labeling, it is promising to perform large-scale pre-training and fine-tune the pre-trained backbone on different downstream datasets as well as tasks. In this paper, we propose SPOT, namely Scalable Pre-training via Occupancy prediction for learning Transferable 3D representations, and demonstrate its effectiveness on various public datasets with different downstream tasks under the label-efficiency setting. Our contributions are threefold: (1) Occupancy prediction is shown to be promising for learning general representations, which is demonstrated by extensive experiments on plenty of datasets and tasks. (2) SPOT uses beam re-sampling technique for point cloud augmentation and applies class-balancing strategies to overcome the domain gap brought by various LiDAR sensors and annotation strategies in different datasets. (3) Scalable pre-training is observed, that is, the downstream performance across all the experiments gets better with more pre-training data. We believe that our findings can facilitate understanding of LiDAR point clouds and pave the way for future exploration in LiDAR pre-training. Codes and models will be released.

摘要
<> translate_language: zh-CN<> annotating 3D LiDAR point clouds for perception tasks, including 3D object detection and LiDAR semantic segmentation, is notoriously time-consuming and energy-intensive. To alleviate the burden of labeling, it is promising to perform large-scale pre-training and fine-tune the pre-trained backbone on different downstream datasets and tasks. In this paper, we propose SPOT, namely Scalable Pre-training via Occupancy prediction for learning Transferable 3D representations, and demonstrate its effectiveness on various public datasets with different downstream tasks under the label-efficiency setting. Our contributions are threefold:1. Occupancy prediction is shown to be promising for learning general representations, which is demonstrated by extensive experiments on numerous datasets and tasks.2. SPOT uses beam re-sampling technique for point cloud augmentation and applies class-balancing strategies to overcome the domain gap brought by various LiDAR sensors and annotation strategies in different datasets.3. Scalable pre-training is observed, that is, the downstream performance across all the experiments gets better with more pre-training data.We believe that our findings can facilitate understanding of LiDAR point clouds and pave the way for future exploration in LiDAR pre-training. Codes and models will be released.

Edge-aware Feature Aggregation Network for Polyp Segmentation

paper_url: http://arxiv.org/abs/2309.10523
repo_url: None
paper_authors: Tao Zhou, Yizhe Zhang, Geng Chen, Yi Zhou, Ye Wu, Deng-Ping Fan
for: 本研究旨在提高肠癌检测和预防的早期诊断方面，通过改进肠癌形态学��部分的分割精度。
methods: 本研究提出了一种Edge-aware Feature Aggregation Network（EFA-Net），包括Edge-aware Guidance Module（EGM）、Scale-aware Convolution Module（SCM）和Cross-level Fusion Module（CFM）等模块。
results: 对五种常用的肠癌检测数据集进行实验，EFA-Net表现出比前方法更高的普适性和效果。

Abstract
Precise polyp segmentation is vital for the early diagnosis and prevention of colorectal cancer (CRC) in clinical practice. However, due to scale variation and blurry polyp boundaries, it is still a challenging task to achieve satisfactory segmentation performance with different scales and shapes. In this study, we present a novel Edge-aware Feature Aggregation Network (EFA-Net) for polyp segmentation, which can fully make use of cross-level and multi-scale features to enhance the performance of polyp segmentation. Specifically, we first present an Edge-aware Guidance Module (EGM) to combine the low-level features with the high-level features to learn an edge-enhanced feature, which is incorporated into each decoder unit using a layer-by-layer strategy. Besides, a Scale-aware Convolution Module (SCM) is proposed to learn scale-aware features by using dilated convolutions with different ratios, in order to effectively deal with scale variation. Further, a Cross-level Fusion Module (CFM) is proposed to effectively integrate the cross-level features, which can exploit the local and global contextual information. Finally, the outputs of CFMs are adaptively weighted by using the learned edge-aware feature, which are then used to produce multiple side-out segmentation maps. Experimental results on five widely adopted colonoscopy datasets show that our EFA-Net outperforms state-of-the-art polyp segmentation methods in terms of generalization and effectiveness.

摘要
精准的肠Rectal polyp分割对早期诊断和预防肠Rectal cancer (CRC)在临床实践中是非常重要的。然而，由于Scale variation和模糊的肠Rectal polyp边界，这仍然是一个挑战性的任务，以达到满意的分割性能。在本研究中，我们提出了一种新的Edge-aware Feature Aggregation Network (EFA-Net)，用于肠Rectal polyp分割，可以充分利用不同级别和多种比例的特征来提高肠Rectal polyp分割性能。具体来说，我们首先提出了Edge-aware Guidance Module (EGM)，用于将低级特征与高级特征结合，学习Edge-enhanced特征，并将其分割到每个解码器单元中。此外，我们还提出了Scale-aware Convolution Module (SCM)，用于学习Scale-aware特征，通过不同的扩大系数来有效地处理Scale variation。此外，我们还提出了Cross-level Fusion Module (CFM)，用于有效地整合不同级别的特征，可以利用本地和全局的Contextual information。最后，CFMs输出被使用learn的Edge-aware特征进行权重adaptive归一化，并用于生成多个Side-out分割地图。实验结果表明，我们的EFA-Net在五个常用的肠Rectoscopy数据集上比State-of-the-art的肠Rectal polyp分割方法有更好的普适性和效果。

Spatial-Assistant Encoder-Decoder Network for Real Time Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.10519
repo_url: https://github.com/cuzaoo/sanet-main
paper_authors: Yalun Wang, Shidong Chen, Huicong Bian, Weixiao Li, Qin Lu
for: 本文目的是提出一种基于encoder-decoder架构的实时semantic segmentation网络，以提高自动驾驶车辆的环境理解。
methods: 本文使用了encoder-decoder架构，并在encoder部分保留了中间部分的特征图，并使用了atrous convolution branches来实现同分辨率特征EXTRACTION。在decoder部分，我们提出了一种hybrid attention模块，SAD，来混合不同的分支。
results: 我们的SANet模型在实时CamVid和cityscape数据集上达到了竞争性的Result，包括78.4% mIOU at 65.1 FPS on Cityscape test dataset和78.8% mIOU at 147 FPS on CamVid test dataset。

Abstract
Semantic segmentation is an essential technology for self-driving cars to comprehend their surroundings. Currently, real-time semantic segmentation networks commonly employ either encoder-decoder architecture or two-pathway architecture. Generally speaking, encoder-decoder models tend to be quicker,whereas two-pathway models exhibit higher accuracy. To leverage both strengths, we present the Spatial-Assistant Encoder-Decoder Network (SANet) to fuse the two architectures. In the overall architecture, we uphold the encoder-decoder design while maintaining the feature maps in the middle section of the encoder and utilizing atrous convolution branches for same-resolution feature extraction. Toward the end of the encoder, we integrate the asymmetric pooling pyramid pooling module (APPPM) to optimize the semantic extraction of the feature maps. This module incorporates asymmetric pooling layers that extract features at multiple resolutions. In the decoder, we present a hybrid attention module, SAD, that integrates horizontal and vertical attention to facilitate the combination of various branches. To ascertain the effectiveness of our approach, our SANet model achieved competitive results on the real-time CamVid and cityscape datasets. By employing a single 2080Ti GPU, SANet achieved a 78.4 % mIOU at 65.1 FPS on the Cityscape test dataset and 78.8 % mIOU at 147 FPS on the CamVid test dataset. The training code and model for SANet are available at https://github.com/CuZaoo/SANet-main

摘要
Semantic segmentation 是自驾车技术的重要组成部分，帮助自驾车理解它所处的环境。当前实时 semantic segmentation 网络通常采用 either encoder-decoder 架构或 two-pathway 架构。一般来说，encoder-decoder 模型比较快速，而 two-pathway 模型具有更高的准确率。为了利用这两者的优点，我们提出了 Spatial-Assistant Encoder-Decoder Network (SANet)，将两种架构融合在一起。整体架构保持 encoder-decoder 设计，并在中间部分的 encoder 中维持特征图，使用 atrous convolution 分支来实现同分辨率特征提取。在 encoder 的末端，我们 integration asymmetric pooling pyramid pooling module (APPPM) 来优化特征提取。在 decoder 中，我们提出了 hybrid attention module (SAD)，将 horizontal 和 vertical attention 集成到不同分支中，以便不同分支之间的组合。为了证明我们的方法的有效性，我们的 SANet 模型在 real-time CamVid 和 cityscape 数据集上达到了竞争性的结果。使用单个 2080Ti GPU，SANet 在 Cityscape 测试数据集上达到了 78.4 % mIOU 的值，并在 65.1 FPS 的速度下运行。训练代码和模型可以在 https://github.com/CuZaoo/SANet-main 上下载。

Unsupervised Landmark Discovery Using Consistency Guided Bottleneck

paper_url: http://arxiv.org/abs/2309.10518
repo_url: https://github.com/mamonaawan/cgb_uld
paper_authors: Mamona Awan, Muhammad Haris Khan, Sanoojan Baliah, Muhammad Ahmad Waseem, Salman Khan, Fahad Shahbaz Khan, Arif Mahmood
for: 本研究目标是不监督的物体标记发现问题。
methods: 我们引入了一个可靠性指标导向的瓶颈，该瓶颈利用标记兼容度来生成适应性的热图。我们还提出了一种在图像重建 pipeline 中使用 pseudo-ground truth 来获得假Supervision。
results: 我们在五种多样化的 dataset 上进行了评估，结果表明我们的方法与现有状态的方法相比，表现出色。我们的代码可以在 GitHub 上获取（https://github.com/MamonaAwan/CGB_ULD）。

Abstract
We study a challenging problem of unsupervised discovery of object landmarks. Many recent methods rely on bottlenecks to generate 2D Gaussian heatmaps however, these are limited in generating informed heatmaps while training, presumably due to the lack of effective structural cues. Also, it is assumed that all predicted landmarks are semantically relevant despite having no ground truth supervision. In the current work, we introduce a consistency-guided bottleneck in an image reconstruction-based pipeline that leverages landmark consistency, a measure of compatibility score with the pseudo-ground truth to generate adaptive heatmaps. We propose obtaining pseudo-supervision via forming landmark correspondence across images. The consistency then modulates the uncertainty of the discovered landmarks in the generation of adaptive heatmaps which rank consistent landmarks above their noisy counterparts, providing effective structural information for improved robustness. Evaluations on five diverse datasets including MAFL, AFLW, LS3D, Cats, and Shoes demonstrate excellent performance of the proposed approach compared to the existing state-of-the-art methods. Our code is publicly available at https://github.com/MamonaAwan/CGB_ULD.

摘要
我们研究一个挑战性的无监督对象地标发现问题。许多最近的方法利用瓶颈来生成2D加aussian热图，但这些瓶颈受限于在训练时生成有知识的热图，可能因缺乏有效的结构准则而导致。此外，它假设所预测的地标都是semantically相关的，即使没有真实的地标指导。在当前的工作中，我们引入一个具有协调性的瓶颈，该瓶颈在图像重建基于管道中利用地标协调度来生成适应性的热图。我们提议通过图像之间的地标匹配来获取pseudo-超级视图。然后，协调度会修饰发现的地标的不确定性，使得有用的地标在热图生成中排名高于噪声Counterparts，提供有效的结构信息，从而提高了Robustness。我们对五种多样化的数据集，包括MAFL、AFLW、LS3D、猫和鞋子进行了评估，并证明了我们的方法与现有状态的方法相比表现出色。我们的代码公开可用于https://github.com/MamonaAwan/CGB_ULD。

Uncertainty Estimation in Instance Segmentation with Star-convex Shapes

paper_url: http://arxiv.org/abs/2309.10513
repo_url: None
paper_authors: Qasim M. K. Siddiqui, Sebastian Starke, Peter Steinbach
for: 这个研究旨在提高实例 segmentation 中模型的预测uncertainty 评估，以便更加准确地评估模型的可靠性和决策。
methods: 本研究使用 Monte-Carlo Dropout 和 Deep Ensemble 技术来计算实例的空间和分数 certeinty 分布，并 comparing 两种不同的聚类方法来评估它们的效果。
results: 研究发现，结合空间和分数 certeinty 分布可以获得更好的准确性评估结果，而使用 Deep Ensemble 技术并结合我们的新 radial clustering 方法则更加有效。

Abstract
Instance segmentation has witnessed promising advancements through deep neural network-based algorithms. However, these models often exhibit incorrect predictions with unwarranted confidence levels. Consequently, evaluating prediction uncertainty becomes critical for informed decision-making. Existing methods primarily focus on quantifying uncertainty in classification or regression tasks, lacking emphasis on instance segmentation. Our research addresses the challenge of estimating spatial certainty associated with the location of instances with star-convex shapes. Two distinct clustering approaches are evaluated which compute spatial and fractional certainty per instance employing samples by the Monte-Carlo Dropout or Deep Ensemble technique. Our study demonstrates that combining spatial and fractional certainty scores yields improved calibrated estimation over individual certainty scores. Notably, our experimental results show that the Deep Ensemble technique alongside our novel radial clustering approach proves to be an effective strategy. Our findings emphasize the significance of evaluating the calibration of estimated certainties for model reliability and decision-making.

摘要

Single-Image based unsupervised joint segmentation and denoising

paper_url: http://arxiv.org/abs/2309.10511
repo_url: https://github.com/Nadja1611/Single-Image-based-unsupervised-joint-segmentation-and-denoising
paper_authors: Nadja Gruber, Johannes Schwab, Noémie Debroux, Nicolas Papadakis, Markus Haltmeier
for: 这个论文的目的是 joint segmentation and denoising of a single image.
methods: 该方法combines variational segmentation method with self-supervised, single-image based deep learning approach.
results: 该方法可以在高噪音和通用Texture图像中分割多个有意义的区域，并且在微scopic镜中可以对噪音图像进行joint segmentation and denoising. I hope this helps! Let me know if you have any other questions.

Abstract
In this work, we develop an unsupervised method for the joint segmentation and denoising of a single image. To this end, we combine the advantages of a variational segmentation method with the power of a self-supervised, single-image based deep learning approach. One major strength of our method lies in the fact, that in contrast to data-driven methods, where huge amounts of labeled samples are necessary, our model can segment an image into multiple meaningful regions without any training database. Further, we introduce a novel energy functional in which denoising and segmentation are coupled in a way that both tasks benefit from each other. The limitations of existing single-image based variational segmentation methods, which are not capable of dealing with high noise or generic texture, are tackled by this specific combination with self-supervised image denoising. We propose a unified optimisation strategy and show that, especially for very noisy images available in microscopy, our proposed joint approach outperforms its sequential counterpart as well as alternative methods focused purely on denoising or segmentation. Another comparison is conducted with a supervised deep learning approach designed for the same application, highlighting the good performance of our approach.

摘要
在这项工作中，我们开发了一种无监督的方法，用于整合图像的分割和除噪。为此，我们结合了变形分割方法的优点和单个图像基于深度学习的自动学习方法。我们的方法的一个重要优点在于，它可以在没有培训数据库的情况下，将图像分割成多个有意义的区域。此外，我们引入了一个新的能量函数，其中除噪和分割两个任务之间存在互助关系，两个任务都可以受益于这种结合。现有的单个图像基于变形分割方法存在高噪音或通用的文字纹理等问题，我们通过与自动学习图像除噪方法的特定结合，解决这些局限性。我们提出了一种统一优化策略，并证明，特别是在微scopic频谱中的很噪音图像上，我们的提议的联合方法会超越其顺序对应方法以及专门为这种应用设计的其他方法。另外，我们还与一种监督式深度学习方法进行比较，显示了我们的方法的良好性能。

DCPT: Darkness Clue-Prompted Tracking in Nighttime UAVs

paper_url: http://arxiv.org/abs/2309.10491
repo_url: None
paper_authors: Jiawen Zhu, Huayi Tang, Zhi-Qi Cheng, Jun-Yan He, Bin Luo, Shihao Qiu, Shengming Li, Huchuan Lu
for: 提高夜间无人机跟踪性能
methods: 提出了一种新的 darkness clue-prompted tracking (DCPT) 架构，通过效率地学习生成黑暗提示来实现夜间无人机跟踪。
results: 对多个黑场景标准吊靡进行了广泛的实验，并达到了现状最佳性能。

Abstract
Existing nighttime unmanned aerial vehicle (UAV) trackers follow an "Enhance-then-Track" architecture - first using a light enhancer to brighten the nighttime video, then employing a daytime tracker to locate the object. This separate enhancement and tracking fails to build an end-to-end trainable vision system. To address this, we propose a novel architecture called Darkness Clue-Prompted Tracking (DCPT) that achieves robust UAV tracking at night by efficiently learning to generate darkness clue prompts. Without a separate enhancer, DCPT directly encodes anti-dark capabilities into prompts using a darkness clue prompter (DCP). Specifically, DCP iteratively learns emphasizing and undermining projections for darkness clues. It then injects these learned visual prompts into a daytime tracker with fixed parameters across transformer layers. Moreover, a gated feature aggregation mechanism enables adaptive fusion between prompts and between prompts and the base model. Extensive experiments show state-of-the-art performance for DCPT on multiple dark scenario benchmarks. The unified end-to-end learning of enhancement and tracking in DCPT enables a more trainable system. The darkness clue prompting efficiently injects anti-dark knowledge without extra modules. Code and models will be released.

摘要
exististing nighttime unmanned aerial vehicle (UAV) trackers follow an "Enhance-then-Track" architecture - first using a light enhancer to brighten the nighttime video, then employing a daytime tracker to locate the object. This separate enhancement and tracking fails to build an end-to-end trainable vision system. To address this, we propose a novel architecture called Darkness Clue-Prompted Tracking (DCPT) that achieves robust UAV tracking at night by efficiently learning to generate darkness clue prompts. Without a separate enhancer, DCPT directly encodes anti-dark capabilities into prompts using a darkness clue prompter (DCP). Specifically, DCP iteratively learns emphasizing and undermining projections for darkness clues. It then injects these learned visual prompts into a daytime tracker with fixed parameters across transformer layers. Moreover, a gated feature aggregation mechanism enables adaptive fusion between prompts and between prompts and the base model. Extensive experiments show state-of-the-art performance for DCPT on multiple dark scenario benchmarks. The unified end-to-end learning of enhancement and tracking in DCPT enables a more trainable system. The darkness clue prompting efficiently injects anti-dark knowledge without extra modules. Code and models will be released.Translation notes:* "Enhance-then-Track" architecture is translated as "提高然后跟踪" architecture.* "Darkness clue prompts" is translated as "黑暗提示" or "黑暗Prompt".* "Darkness clue prompter" is translated as "黑暗提示生成器" or "黑暗Prompter".* "Gated feature aggregation" is translated as "离散特征聚合" or "离散Feature Fusion".* "Base model" is translated as "基础模型" or "基础模型。

RECALL+: Adversarial Web-based Replay for Continual Learning in Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.10479
repo_url: None
paper_authors: Chang Liu, Giulia Rizzoli, Francesco Barbato, Umberto Michieli, Yi Niu, Pietro Zanuttigh
for: 本研究的目的是解决 continual learning 中的 catastrophic forgetting 问题，通过不同的 regularization 策略来保持之前学习的知识。
methods: 本研究扩展了之前的方法（RECALL），通过在线数据库中检索过时的类例来避免忘记。本研究引入了两种新的方法：基于对抗学习和自适应阈值调整来选择来自网络数据的示例，以及一种改进的pseudo-labeling scheme来更准确地标注网络数据。
results: 实验结果显示，这种加强的方法在多个增量学习步骤中表现出色，特别是在多个增量学习步骤中表现出色，特别是在多个增量学习步骤中表现出色，特别是在多个增量学习步骤中表现出色。

Abstract
Catastrophic forgetting of previous knowledge is a critical issue in continual learning typically handled through various regularization strategies. However, existing methods struggle especially when several incremental steps are performed. In this paper, we extend our previous approach (RECALL) and tackle forgetting by exploiting unsupervised web-crawled data to retrieve examples of old classes from online databases. Differently from the original approach that did not perform any evaluation of the web data, here we introduce two novel approaches based on adversarial learning and adaptive thresholding to select from web data only samples strongly resembling the statistics of the no longer available training ones. Furthermore, we improved the pseudo-labeling scheme to achieve a more accurate labeling of web data that also consider classes being learned in the current step. Experimental results show that this enhanced approach achieves remarkable results, especially when multiple incremental learning steps are performed.

摘要
continual learning中的重大问题之一是 previous knowledge的恶化，通常通过不同的正则化策略来解决。然而，现有方法在多个递增学习步骤时尤其不稳定。在这篇论文中，我们对我们之前的方法（RECALL）进行扩展，通过利用无监督的网络数据来恢复老的类。与原始方法不同的是，我们在这里引入了两种基于对抗学习和自适应阈值的新方法来从网络数据中选择只有强相似于过去训练数据的统计的样本。此外，我们改进了 Pseudo-labeling 方案，以更加准确地标注网络数据，并考虑当前步骤中学习的类。实验结果表明，这种加强的方法在多个递增学习步骤时显示出了很好的效果。

LineMarkNet: Line Landmark Detection for Valet Parking

paper_url: http://arxiv.org/abs/2309.10475
repo_url: None
paper_authors: Zizhang Wu, Fan Wang, Yuanzhu Gan, Tianhao Xu, Weiwei Sun, Rui Tang
for: This paper aims to solve the long-standing problem of accurate and efficient line landmark detection for valet parking in autonomous driving.
methods: The paper presents a deep line landmark detection system that utilizes a pre-calibrated homography to fuse context from four separate cameras into a unified bird-eye-view (BEV) space. The system employs a multi-task decoder to detect multiple line landmarks and incorporates a graph transformer to enhance the vision transformer with hierarchical level graph reasoning for semantic segmentation.
results: The paper achieves enhanced performance compared to several line detection methods and validates the multi-task network’s efficiency in real-time line landmark detection on the Qualcomm 820A platform while maintaining superior accuracy.

Abstract
We aim for accurate and efficient line landmark detection for valet parking, which is a long-standing yet unsolved problem in autonomous driving. To this end, we present a deep line landmark detection system where we carefully design the modules to be lightweight. Specifically, we first empirically design four general line landmarks including three physical lines and one novel mental line. The four line landmarks are effective for valet parking. We then develop a deep network (LineMarkNet) to detect line landmarks from surround-view cameras where we, via the pre-calibrated homography, fuse context from four separate cameras into the unified bird-eye-view (BEV) space, specifically we fuse the surroundview features and BEV features, then employ the multi-task decoder to detect multiple line landmarks where we apply the center-based strategy for object detection task, and design our graph transformer to enhance the vision transformer with hierarchical level graph reasoning for semantic segmentation task. At last, we further parameterize the detected line landmarks (e.g., intercept-slope form) whereby a novel filtering backend incorporates temporal and multi-view consistency to achieve smooth and stable detection. Moreover, we annotate a large-scale dataset to validate our method. Experimental results show that our framework achieves the enhanced performance compared with several line detection methods and validate the multi-task network's efficiency about the real-time line landmark detection on the Qualcomm 820A platform while meantime keeps superior accuracy, with our deep line landmark detection system.

摘要
我们目标是实现高精度、高效的直线特征检测系统，以解决自动驾驶中长期存在但未解决的停车卫士问题。为此，我们提出了一种深度学习基于模块的直线特征检测系统，其中我们 méticulously 设计了四个通用的直线特征，包括三个物理直线和一个新的心理直线。这四个直线特征都是适用于停车卫士。然后，我们开发了一种深度网络（LineMarkNet），用于从卫星视图摄像头中检测直线特征。我们通过先天准备好的同步论homography将卫星视图中的特征与 bird-eye-view 空间（BEV）中的特征进行融合，然后使用多任务解码器来检测多个直线特征。我们在object detection任务中采用了中心基本策略，并设计了一种图变换器来增强视Transformer的层次Graph理解，以提高semantic segmentation任务的性能。最后，我们进一步参数化检测到的直线特征（例如， intercept-slope 形式），并使用一种新的筛选后端来实现temporal和多视图一致性，以获得平滑和稳定的检测。此外，我们还为我们的方法进行了大规模数据集注解。实验结果表明，我们的框架在多个直线检测方法的比较中具有更高的性能，并证明了我们的深度直线特征检测系统在Qualcomm 820A 平台上的实时检测性能。

Diffusion-based speech enhancement with a weighted generative-supervised learning loss

paper_url: http://arxiv.org/abs/2309.10457
repo_url: None
paper_authors: Jean-Eudes Ayilo, Mostafa Sadeghi, Romain Serizel
for: 喇叭声音提升（SE）中的扩散基本生成模型，提供一种代替传统的指导方法。
methods: 将干净说话训练样本转化为中心在噪声说话的高斯噪声，并逐次学习一个参数化的模型，以逆转这个过程，受到噪声说话的条件。
results: 我们提议的方法可以有效地解决不supervised方法中的不足，并且实验结果表明我们的方法有效。

Abstract
Diffusion-based generative models have recently gained attention in speech enhancement (SE), providing an alternative to conventional supervised methods. These models transform clean speech training samples into Gaussian noise centered at noisy speech, and subsequently learn a parameterized model to reverse this process, conditionally on noisy speech. Unlike supervised methods, generative-based SE approaches usually rely solely on an unsupervised loss, which may result in less efficient incorporation of conditioned noisy speech. To address this issue, we propose augmenting the original diffusion training objective with a mean squared error (MSE) loss, measuring the discrepancy between estimated enhanced speech and ground-truth clean speech at each reverse process iteration. Experimental results demonstrate the effectiveness of our proposed methodology.

摘要
传播基本的生成模型在语音提高（SE）方面已经获得了关注，提供一个传统超级vised方法的替代方案。这些模型将清晰的语音训练样本转换为中心于噪音的 Gaussian 噪音，然后学习一个受控的模型，以逆读这个过程，仅在噪音下进行条件运算。不同于超级vised方法，生成基本的SE方法通常仅靠一个不监控的损失函数，这可能会导致更少的条件噪音语音的整合。为了解决这个问题，我们提议将原始传播训练目标加以mean squared error（MSE）损失函数，以量化在逆读过程中每次估计的提高语音与真正的清晰语音之间的差异。实验结果显示了我们提议的方法的有效性。

Unsupervised speech enhancement with diffusion-based generative models

paper_url: http://arxiv.org/abs/2309.10450
repo_url: https://github.com/sp-uhh/sgmse
paper_authors: Berné Nortier, Mostafa Sadeghi, Romain Serizel
for: 本研究旨在提出一种基于扩散模型的无监督音频增强方法，以解决现有方法在面对未见过的条件时的挑战。
methods: 我们使用了分布式扩散模型来学习干净语音的先验分布在短时域傅立叙特（STFT）域，然后通过将学习到的干净语音先验分布与陌生噪声模型结合，实现无监督的音频增强。
results: 我们的方法在比较之下比一种最新的变分自动编码器（VAE）基于无监督方法和一种现有的扩散基于监督方法的状态艺术方法更有优势，这些结果展示了我们的方法在无监督音频增强方面的潜在优势。

Abstract
Recently, conditional score-based diffusion models have gained significant attention in the field of supervised speech enhancement, yielding state-of-the-art performance. However, these methods may face challenges when generalising to unseen conditions. To address this issue, we introduce an alternative approach that operates in an unsupervised manner, leveraging the generative power of diffusion models. Specifically, in a training phase, a clean speech prior distribution is learnt in the short-time Fourier transform (STFT) domain using score-based diffusion models, allowing it to unconditionally generate clean speech from Gaussian noise. Then, we develop a posterior sampling methodology for speech enhancement by combining the learnt clean speech prior with a noise model for speech signal inference. The noise parameters are simultaneously learnt along with clean speech estimation through an iterative expectationmaximisation (EM) approach. To the best of our knowledge, this is the first work exploring diffusion-based generative models for unsupervised speech enhancement, demonstrating promising results compared to a recent variational auto-encoder (VAE)-based unsupervised approach and a state-of-the-art diffusion-based supervised method. It thus opens a new direction for future research in unsupervised speech enhancement.

摘要
Translated into Simplified Chinese:最近，基于条件的分布扩散模型在监督性音频增强领域受到了广泛关注，并实现了顶尖性能。然而，这些方法可能面临不可预期的条件下的挑战。为解决这个问题，我们介绍了一种alternative方法，利用扩散模型的生成力。在训练阶段，我们使用分数基于扩散模型在快时域 Fourier变换 (STFT) 中学习干净语音先验分布，以便不条件地生成干净语音。然后，我们开发了 posterior 采样方法，结合学习的干净语音先验分布和语音噪声模型，以进行语音增强。噪声参数同时与干净语音估计一起学习，通过交互式期望最大化 (EM) 方法。到目前为止，这是第一篇探讨扩散基于生成模型的无监督语音增强方法的研究论文，对于一个最近的variational autoencoder (VAE)-based无监督方法和一个状态艺术的扩散基于监督方法进行了比较，并达到了promising的结果。因此，这开启了未来无监督语音增强研究的新方向。

Posterior sampling algorithms for unsupervised speech enhancement with recurrent variational autoencoder

paper_url: http://arxiv.org/abs/2309.10439
repo_url: None
paper_authors: Mostafa Sadeghi, Romain Serizel
for: addresses the unsupervised speech enhancement problem based on recurrent variational autoencoder (RVAE)
methods: uses efficient sampling techniques based on Langevin dynamics and Metropolis-Hasting algorithms to circumvent the computational complexity of variational inference
results: significantly outperforms the variational expectation-maximization (VEM) method and shows robust generalization performance in mismatched test conditions

Abstract
In this paper, we address the unsupervised speech enhancement problem based on recurrent variational autoencoder (RVAE). This approach offers promising generalization performance over the supervised counterpart. Nevertheless, the involved iterative variational expectation-maximization (VEM) process at test time, which relies on a variational inference method, results in high computational complexity. To tackle this issue, we present efficient sampling techniques based on Langevin dynamics and Metropolis-Hasting algorithms, adapted to the EM-based speech enhancement with RVAE. By directly sampling from the intractable posterior distribution within the EM process, we circumvent the intricacies of variational inference. We conduct a series of experiments, comparing the proposed methods with VEM and a state-of-the-art supervised speech enhancement approach based on diffusion models. The results reveal that our sampling-based algorithms significantly outperform VEM, not only in terms of computational efficiency but also in overall performance. Furthermore, when compared to the supervised baseline, our methods showcase robust generalization performance in mismatched test conditions.

摘要
在本文中，我们Addresses the unsupervised speech enhancement problem based on recurrent variational autoencoder (RVAE). This approach offers promising generalization performance over the supervised counterpart. Nevertheless, the involved iterative variational expectation-maximization (VEM) process at test time, which relies on a variational inference method, results in high computational complexity. To tackle this issue, we present efficient sampling techniques based on Langevin dynamics and Metropolis-Hasting algorithms, adapted to the EM-based speech enhancement with RVAE. By directly sampling from the intractable posterior distribution within the EM process, we circumvent the intricacies of variational inference. We conduct a series of experiments, comparing the proposed methods with VEM and a state-of-the-art supervised speech enhancement approach based on diffusion models. The results reveal that our sampling-based algorithms significantly outperform VEM, not only in terms of computational efficiency but also in overall performance. Furthermore, when compared to the supervised baseline, our methods showcase robust generalization performance in mismatched test conditions.Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration

paper_url: http://arxiv.org/abs/2309.10438
repo_url: None
paper_authors: Lijiang Li, Huixia Li, Xiawu Zheng, Jie Wu, Xuefeng Xiao, Rui Wang, Min Zheng, Xin Pan, Fei Chao, Rongrong Ji
for: 提高 diffusion models 的生成速度，无需额外训练。
methods: 提出了一种统一搜索空间和演化算法，可以在不同的 diffusion models 中找到最佳的时间步骤和模型结构。
results: 通过使用只有几步（例如，4步），可以达到优秀的图像生成效果（例如， ImageNet 64 $\times$ 64 的 FID 分数为 17.86），比传统的 DDIM 更好。

Abstract
Diffusion models are emerging expressive generative models, in which a large number of time steps (inference steps) are required for a single image generation. To accelerate such tedious process, reducing steps uniformly is considered as an undisputed principle of diffusion models. We consider that such a uniform assumption is not the optimal solution in practice; i.e., we can find different optimal time steps for different models. Therefore, we propose to search the optimal time steps sequence and compressed model architecture in a unified framework to achieve effective image generation for diffusion models without any further training. Specifically, we first design a unified search space that consists of all possible time steps and various architectures. Then, a two stage evolutionary algorithm is introduced to find the optimal solution in the designed search space. To further accelerate the search process, we employ FID score between generated and real samples to estimate the performance of the sampled examples. As a result, the proposed method is (i).training-free, obtaining the optimal time steps and model architecture without any training process; (ii). orthogonal to most advanced diffusion samplers and can be integrated to gain better sample quality. (iii). generalized, where the searched time steps and architectures can be directly applied on different diffusion models with the same guidance scale. Experimental results show that our method achieves excellent performance by using only a few time steps, e.g. 17.86 FID score on ImageNet 64 $\times$ 64 with only four steps, compared to 138.66 with DDIM.

摘要
Diffusion模型是一种表达力强的生成模型，需要许多时间步骤（推理步骤）来生成单个图像。为了加速这个繁琐的过程，减少步骤的假设是 diffusion模型的一个常见原则。然而，我们认为这种假设并不是最佳的解决方案，即可以在具体的模型中找到不同的优化时间步骤。因此，我们提议通过搜索优化时间步骤和压缩模型架构的联合框架来实现效果性的图像生成，不需要进行任何训练。我们首先设计了一个包含所有可能的时间步骤和各种架构的统一搜索空间。然后，我们引入了两Stage演化算法，以找到这个搜索空间中的优化解决方案。为了加速搜索过程，我们使用FID分数来评估生成的样本质量。因此，我们的方法具有以下优点：1. 无需训练，可以直接从搜索空间中找到最佳的时间步骤和模型架构。2. 与现有的 diffusion samplers ortogonal，可以和其他模型结合使用以提高样本质量。3. 通用，可以在不同的 diffusion models 中应用已经搜索的时间步骤和模型架构。实验结果表明，我们的方法可以通过使用只有几个时间步骤，如17.86 FID分数在 ImageNet 64 $\times$ 64 中只需要四个步骤，而DDIM需要138.66 FID分数。

Sample-adaptive Augmentation for Point Cloud Recognition Against Real-world Corruptions

paper_url: http://arxiv.org/abs/2309.10431
repo_url: https://github.com/roywangj/adaptpoint
paper_authors: Jie Wang, Lihe Ding, Tingfa Xu, Shaocong Dong, Xinli Xu, Long Bai, Jianan Li
for: 提高3D视觉中的稳定性和可靠性，应对各种潜在的损害和噪声。
methods: 提出了一种自适应增强框架，名为AdaptPoint，通过基于样本结构的适应变换来应对潜在的损害。同时，还引入了一种听众指导反馈机制，以便生成合适的样本难度水平。
results: 实验结果表明，我们的方法在多种损害 benchmarcks 上达到了状态的最佳结果，包括 ModelNet-C、我们的 ScanObjectNN-C 和 ShapeNet-C。

Abstract
Robust 3D perception under corruption has become an essential task for the realm of 3D vision. While current data augmentation techniques usually perform random transformations on all point cloud objects in an offline way and ignore the structure of the samples, resulting in over-or-under enhancement. In this work, we propose an alternative to make sample-adaptive transformations based on the structure of the sample to cope with potential corruption via an auto-augmentation framework, named as AdaptPoint. Specially, we leverage a imitator, consisting of a Deformation Controller and a Mask Controller, respectively in charge of predicting deformation parameters and producing a per-point mask, based on the intrinsic structural information of the input point cloud, and then conduct corruption simulations on top. Then a discriminator is utilized to prevent the generation of excessive corruption that deviates from the original data distribution. In addition, a perception-guidance feedback mechanism is incorporated to guide the generation of samples with appropriate difficulty level. Furthermore, to address the paucity of real-world corrupted point cloud, we also introduce a new dataset ScanObjectNN-C, that exhibits greater similarity to actual data in real-world environments, especially when contrasted with preceding CAD datasets. Experiments show that our method achieves state-of-the-art results on multiple corruption benchmarks, including ModelNet-C, our ScanObjectNN-C, and ShapeNet-C.

摘要
Robust 3D感知下面受损Has become an essential task for the realm of 3D vision. While current data augmentation techniques usually perform random transformations on all point cloud objects in an offline way and ignore the structure of the samples, resulting in over-or-under enhancement. In this work, we propose an alternative to make sample-adaptive transformations based on the structure of the sample to cope with potential corruption via an auto-augmentation framework, named as AdaptPoint. Specially, we leverage a imitator, consisting of a Deformation Controller and a Mask Controller, respectively in charge of predicting deformation parameters and producing a per-point mask, based on the intrinsic structural information of the input point cloud, and then conduct corruption simulations on top. Then a discriminator is utilized to prevent the generation of excessive corruption that deviates from the original data distribution. In addition, a perception-guidance feedback mechanism is incorporated to guide the generation of samples with appropriate difficulty level. Furthermore, to address the paucity of real-world corrupted point cloud, we also introduce a new dataset ScanObjectNN-C, that exhibits greater similarity to actual data in real-world environments, especially when contrasted with preceding CAD datasets. Experiments show that our method achieves state-of-the-art results on multiple corruption benchmarks, including ModelNet-C, our ScanObjectNN-C, and ShapeNet-C.

Predicate Classification Using Optimal Transport Loss in Scene Graph Generation

paper_url: http://arxiv.org/abs/2309.10430
repo_url: None
paper_authors: Sorachi Kurita, Satoshi Oyama, Itsuki Noda
for: 提高Scene Graph生成（SGG）中的预测准确率，避免由数据集中关系标签的分布偏度导致的预测偏误。
methods: 使用最佳运输损失来比较两个概率分布的相似性，并在 predicate classification 中使用学习最佳运输损失来生成Scene Graph。
results: 相比 existed 方法，提出的方法在 mean Recall@50 和 100 上表现出色，并且提高了数据集中罕见的关系标签的回归率。

Abstract
In scene graph generation (SGG), learning with cross-entropy loss yields biased predictions owing to the severe imbalance in the distribution of the relationship labels in the dataset. Thus, this study proposes a method to generate scene graphs using optimal transport as a measure for comparing two probability distributions. We apply learning with the optimal transport loss, which reflects the similarity between the labels in terms of transportation cost, for predicate classification in SGG. In the proposed approach, the transportation cost of the optimal transport is defined using the similarity of words obtained from the pre-trained model. The experimental evaluation of the effectiveness demonstrates that the proposed method outperforms existing methods in terms of mean Recall@50 and 100. Furthermore, it improves the recall of the relationship labels scarcely available in the dataset.

摘要
在场景图生成（SGG）中，使用十字积分损失会导致预测结果受到分布不均衡的影响，因此这个研究提出了一种使用优化交通为比较两个概率分布的方法来生成场景图。我们在该方法中使用优化交通损失，该损失反映了标签之间的交通成本的相似性，来进行 predicate 分类。在我们的方法中，交通成本的优化交通是基于预训练模型中的字符相似度来定义的。我们的实验评估结果表明，提议的方法在mean Recall@50和100上超过了现有方法，并且提高了数据集中罕见的关系标签的回快。Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. If you need the translation in Traditional Chinese, please let me know.

Exploring Different Levels of Supervision for Detecting and Localizing Solar Panels on Remote Sensing Imagery

paper_url: http://arxiv.org/abs/2309.10421
repo_url: None
paper_authors: Maarten Burger, Rob Wijnhoven, Shaodi You
for: 本研究探讨了远程感知图像中对象存在和定位问题，特点是太阳能板识别。研究不同级别的超级视图，包括全监督物体探测器、弱监督图像分类器和最小监督异常探测器。
methods: 研究采用了不同级别的超级视图模型，包括全监督物体探测器、弱监督图像分类器和最小监督异常探测器。
results: 研究结果显示，分类器在二分类存在检测中取得0.79的F1分数，而物体探测器则在精确定位方面取得0.72的成绩。异常探测器需要更多数据来实现可靠性。模型结果的融合可能会提高准确性。CAM对定位有一定影响，高级CAM、高级CAM++和HiResCAM在定位方面具有较好的成绩。另外，分类器在数据量更少时仍然保持了良好的Robustness。

Abstract
This study investigates object presence detection and localization in remote sensing imagery, focusing on solar panel recognition. We explore different levels of supervision, evaluating three models: a fully supervised object detector, a weakly supervised image classifier with CAM-based localization, and a minimally supervised anomaly detector. The classifier excels in binary presence detection (0.79 F1-score), while the object detector (0.72) offers precise localization. The anomaly detector requires more data for viable performance. Fusion of model results shows potential accuracy gains. CAM impacts localization modestly, with GradCAM, GradCAM++, and HiResCAM yielding superior results. Notably, the classifier remains robust with less data, in contrast to the object detector.

摘要
本研究 investigate remote sensing imagery 中的对象存在和位置检测，强调太阳能板识别。我们考虑不同水平的超级visum，评估三种模型：完全supervised object detector、weakly supervised image classifier with CAM-based localization和minimally supervised anomaly detector。classifier在binary presence detection中表现出色（F1-score 0.79），而object detector（F1-score 0.72）具有精确的localization能力。anomaly detector需要更多数据来实现可靠的表现。模型结果的融合显示了可能的准确性提升。CAM对localization产生了一定的影响，高级CAM、GradCAM++和HiResCAM在localization中表现较好。各自注意的是，classifier在数据更少时仍然保持了 robust性，与object detector不同。

SideGAN: 3D-Aware Generative Model for Improved Side-View Image Synthesis

paper_url: http://arxiv.org/abs/2309.10388
repo_url: None
paper_authors: Kyungmin Jo, Wonjoon Jin, Jaegul Choo, Hyunjoon Lee, Sunghyun Cho
for: 该 paper 主要 targets 生成高质量的三维图像，尤其是在侧视角度下。
methods: 该 paper 提出了一种新的 GAN 训练方法，即 SideGAN，可以生成不同视角下的高质量图像。为了解决 pose 难以学习和 photo-realism 同时学习的问题，该方法将问题分解为两个更容易解决的子问题。
results: 该 paper 通过了广泛的验证，证明了 SideGAN 可以生成高质量的三维图像，不受 camera pose 的影响。

Abstract
While recent 3D-aware generative models have shown photo-realistic image synthesis with multi-view consistency, the synthesized image quality degrades depending on the camera pose (e.g., a face with a blurry and noisy boundary at a side viewpoint). Such degradation is mainly caused by the difficulty of learning both pose consistency and photo-realism simultaneously from a dataset with heavily imbalanced poses. In this paper, we propose SideGAN, a novel 3D GAN training method to generate photo-realistic images irrespective of the camera pose, especially for faces of side-view angles. To ease the challenging problem of learning photo-realistic and pose-consistent image synthesis, we split the problem into two subproblems, each of which can be solved more easily. Specifically, we formulate the problem as a combination of two simple discrimination problems, one of which learns to discriminate whether a synthesized image looks real or not, and the other learns to discriminate whether a synthesized image agrees with the camera pose. Based on this, we propose a dual-branched discriminator with two discrimination branches. We also propose a pose-matching loss to learn the pose consistency of 3D GANs. In addition, we present a pose sampling strategy to increase learning opportunities for steep angles in a pose-imbalanced dataset. With extensive validation, we demonstrate that our approach enables 3D GANs to generate high-quality geometries and photo-realistic images irrespective of the camera pose.

摘要
Recent 3D-aware生成模型已经显示了多视图一致的真实图像生成，但生成图像质量随着摄像头姿态的变化而下降（例如，一个在侧视角度的人脸会有模糊和噪声的边沿）。这种下降主要是由于学习多视图一致和真实图像同时存在 dataset 中 pose 异常的问题。在这篇论文中，我们提出了 SideGAN，一种新的3D GAN 训练方法，可以不受摄像头姿态的限制生成真实图像。为了缓解学习多视图一致和真实图像生成的复杂问题，我们将问题分解为两个更容易解决的子问题。具体来说，我们将问题定义为两个简单的识别问题的组合：一个识别生成图像是否真实，另一个识别生成图像是否与摄像头姿态一致。基于这，我们提出了一个 dual-branched 识别器，以及一个 pose 匹配损失来学习3D GAN 的姿态一致。此外，我们还提出了一种 pose 采样策略，以增加在 pose 偏负重 dataset 中的学习机会。通过广泛验证，我们证明了我们的方法可以使3D GANs生成高质量的几何图像和真实图像，不受摄像头姿态的限制。

Pointing out Human Answer Mistakes in a Goal-Oriented Visual Dialogue

paper_url: http://arxiv.org/abs/2309.10375
repo_url: None
paper_authors: Ryosuke Oshima, Seitaro Shinagawa, Hideki Tsunashima, Qi Feng, Shigeo Morishima
for: 该论文旨在研究人工智能与人类交互中的有效沟通方法，以便解决复杂问题。
methods: 该论文使用视觉对话来助人类完成问题，并分析了人类答案错误的因素，以提高模型的准确率。
results: 经过实验，研究发现人类答案错误的因素包括问题类型和QA转数，这些因素对模型的准确率有重要影响。

Abstract
Effective communication between humans and intelligent agents has promising applications for solving complex problems. One such approach is visual dialogue, which leverages multimodal context to assist humans. However, real-world scenarios occasionally involve human mistakes, which can cause intelligent agents to fail. While most prior research assumes perfect answers from human interlocutors, we focus on a setting where the agent points out unintentional mistakes for the interlocutor to review, better reflecting real-world situations. In this paper, we show that human answer mistakes depend on question type and QA turn in the visual dialogue by analyzing a previously unused data collection of human mistakes. We demonstrate the effectiveness of those factors for the model's accuracy in a pointing-human-mistake task through experiments using a simple MLP model and a Visual Language Model.

摘要
人机对话可以有效地解决复杂问题，其中一种方法是视觉对话，它利用多ModalContext来帮助人类。然而，实际情况中有时会出现人类的错误，这会导致智能代理人失败。而大多数先前的研究假设了人类回答是完美的，我们则关注实际情况中人类的意外错误，并对这些错误进行分析。在这篇论文中，我们发现人类回答错误的因素取决于问题类型和QA轮次在视觉对话中。我们通过使用简单的MLP模型和视觉语言模型进行实验，证明这些因素对模型准确性的影响。

GloPro: Globally-Consistent Uncertainty-Aware 3D Human Pose Estimation & Tracking in the Wild

paper_url: http://arxiv.org/abs/2309.10369
repo_url: None
paper_authors: Simon Schaefer, Dorian F. Henning, Stefan Leutenegger
for: 提高人机交互的安全性和效率，通过提供高精度的3D人体姿态估计。
methods: 利用视觉启示和学习的动作模型，效果地融合视觉启示，预测3D人体姿态的不确定性分布，包括形状、姿态和根姿态。
results: 与现有方法相比，在世界坐标系中的人体轨迹准确率得到了大幅提高（即使面临严重遮挡），能够提供一致的不确定性分布，并可以在实时下运行。

Abstract
An accurate and uncertainty-aware 3D human body pose estimation is key to enabling truly safe but efficient human-robot interactions. Current uncertainty-aware methods in 3D human pose estimation are limited to predicting the uncertainty of the body posture, while effectively neglecting the body shape and root pose. In this work, we present GloPro, which to the best of our knowledge the first framework to predict an uncertainty distribution of a 3D body mesh including its shape, pose, and root pose, by efficiently fusing visual clues with a learned motion model. We demonstrate that it vastly outperforms state-of-the-art methods in terms of human trajectory accuracy in a world coordinate system (even in the presence of severe occlusions), yields consistent uncertainty distributions, and can run in real-time.

摘要
<> translate "An accurate and uncertainty-aware 3D human body pose estimation is key to enabling truly safe but efficient human-robot interactions. Current uncertainty-aware methods in 3D human pose estimation are limited to predicting the uncertainty of the body posture, while effectively neglecting the body shape and root pose. In this work, we present GloPro, which to the best of our knowledge the first framework to predict an uncertainty distribution of a 3D body mesh including its shape, pose, and root pose, by efficiently fusing visual clues with a learned motion model. We demonstrate that it vastly outperforms state-of-the-art methods in terms of human trajectory accuracy in a world coordinate system (even in the presence of severe occlusions), yields consistent uncertainty distributions, and can run in real-time." into Simplified Chinese.翻译文本为简化字符串：“一个准确且识别不确定性的3D人体姿态估计是人机交互的关键，目前的不确定性意识方法仅仅是对体姿的预测不确定性，而忽略了身体形状和根姿。在这项工作中，我们提出了 GloPro，这是我们知道的第一个把3D身体网格的不确定性分布，包括身体形状、姿态和根姿，通过有效地融合视觉线索和学习运动模型来预测。我们示示了它在世界坐标系中的人跟踪精度高于当前状态艺术方法，且可以在真实时间内运行，并且具有一致的不确定性分布。”

Improving CLIP Robustness with Knowledge Distillation and Self-Training

paper_url: http://arxiv.org/abs/2309.10361
repo_url: None
paper_authors: Clement Laroudie, Andrei Bursuc, Mai Lan Ha, Gianni Franchi
for: 本研究旨在评估CLIP模型在无监督学习上的 robustness，同时探索可以增强其 robustness 的策略。
methods: 我们提出了一种名为LP-CLIP的新方法，即在CLIP模型的编码结构上添加一个线性探测层，并通过使用CLIP生成的pseudo-标签和自我训练策略来训练这层。
results: 我们的LP-CLIP方法可以增强CLIP模型的Robustness，并在多个数据集上达到了SOTA的result。此外，我们的方法不需要标注数据，因此在实际应用中可以更加有效。

Abstract
This paper examines the robustness of a multi-modal computer vision model, CLIP (Contrastive Language-Image Pretraining), in the context of unsupervised learning. The main objective is twofold: first, to evaluate the robustness of CLIP, and second, to explore strategies for augmenting its robustness. To achieve this, we introduce a novel approach named LP-CLIP. This technique involves the distillation of CLIP features through the incorporation of a linear probing layer positioned atop its encoding structure. This newly added layer is trained utilizing pseudo-labels produced by CLIP, coupled with a self-training strategy. The LP-CLIP technique offers a promising approach to enhance the robustness of CLIP without the need for annotations. By leveraging a simple linear probing layer, we aim to improve the model's ability to withstand various uncertainties and challenges commonly encountered in real-world scenarios. Importantly, our approach does not rely on annotated data, which makes it particularly valuable in situations where labeled data might be scarce or costly to obtain. Our proposed approach increases the robustness of CLIP with SOTA results compared to supervised technique on various datasets.

摘要

RoadFormer: Duplex Transformer for RGB-Normal Semantic Road Scene Parsing

paper_url: http://arxiv.org/abs/2309.10356
repo_url: None
paper_authors: Jiahang Li, Yikang Zhang, Peng Yun, Guangliang Zhou, Qijun Chen, Rui Fan
for: 本研究旨在提出一种基于Transformer的数据融合网络 RoadFormer，用于道路场景分解。
methods: RoadFormer 使用 duplex encoder 架构提取不同类型的特征，并使用异类特征融合块进行有效的特征融合和重新准确。
results: 对于 SYN-UDTIRI 数据集以及 KITTI 路、CityScapes 和 ORFD 三个公共数据集，RoadFormer 表现出优于所有现有的状态态之网络，特别是在 KITTI 路上排名第一。

Abstract
The recent advancements in deep convolutional neural networks have shown significant promise in the domain of road scene parsing. Nevertheless, the existing works focus primarily on freespace detection, with little attention given to hazardous road defects that could compromise both driving safety and comfort. In this paper, we introduce RoadFormer, a novel Transformer-based data-fusion network developed for road scene parsing. RoadFormer utilizes a duplex encoder architecture to extract heterogeneous features from both RGB images and surface normal information. The encoded features are subsequently fed into a novel heterogeneous feature synergy block for effective feature fusion and recalibration. The pixel decoder then learns multi-scale long-range dependencies from the fused and recalibrated heterogeneous features, which are subsequently processed by a Transformer decoder to produce the final semantic prediction. Additionally, we release SYN-UDTIRI, the first large-scale road scene parsing dataset that contains over 10,407 RGB images, dense depth images, and the corresponding pixel-level annotations for both freespace and road defects of different shapes and sizes. Extensive experimental evaluations conducted on our SYN-UDTIRI dataset, as well as on three public datasets, including KITTI road, CityScapes, and ORFD, demonstrate that RoadFormer outperforms all other state-of-the-art networks for road scene parsing. Specifically, RoadFormer ranks first on the KITTI road benchmark. Our source code, created dataset, and demo video are publicly available at mias.group/RoadFormer.

摘要
Recent advancements in deep convolutional neural networks have shown great potential in the field of road scene parsing. However, existing works mostly focus on free space detection, with little attention paid to hazardous road defects that can compromise both driving safety and comfort. In this paper, we introduce RoadFormer, a novel Transformer-based data-fusion network for road scene parsing. RoadFormer uses a duplex encoder architecture to extract heterogeneous features from both RGB images and surface normal information. The encoded features are then fed into a novel heterogeneous feature synergy block for effective feature fusion and recalibration. The pixel decoder learns multi-scale long-range dependencies from the fused and recalibrated heterogeneous features, which are subsequently processed by a Transformer decoder to produce the final semantic prediction. Furthermore, we release SYN-UDTIRI, the first large-scale road scene parsing dataset that contains over 10,407 RGB images, dense depth images, and the corresponding pixel-level annotations for both freespace and road defects of different shapes and sizes. Extensive experimental evaluations conducted on our SYN-UDTIRI dataset, as well as on three public datasets, including KITTI road, CityScapes, and ORFD, show that RoadFormer outperforms all other state-of-the-art networks for road scene parsing. Specifically, RoadFormer ranks first on the KITTI road benchmark. Our source code, created dataset, and demo video are publicly available at mias.group/RoadFormer.

Language Guided Adversarial Purification

paper_url: http://arxiv.org/abs/2309.10348
repo_url: None
paper_authors: Himanshu Singh, A V Subramanyam
for: 防御 adversarial 攻击
methods: 使用生成模型进行敏感级别提升
results: 对强大 adversarial 攻击进行评估，表现出优秀的防御性能，不需要特殊的网络训练

Abstract
Adversarial purification using generative models demonstrates strong adversarial defense performance. These methods are classifier and attack-agnostic, making them versatile but often computationally intensive. Recent strides in diffusion and score networks have improved image generation and, by extension, adversarial purification. Another highly efficient class of adversarial defense methods known as adversarial training requires specific knowledge of attack vectors, forcing them to be trained extensively on adversarial examples. To overcome these limitations, we introduce a new framework, namely Language Guided Adversarial Purification (LGAP), utilizing pre-trained diffusion models and caption generators to defend against adversarial attacks. Given an input image, our method first generates a caption, which is then used to guide the adversarial purification process through a diffusion network. Our approach has been evaluated against strong adversarial attacks, proving its effectiveness in enhancing adversarial robustness. Our results indicate that LGAP outperforms most existing adversarial defense techniques without requiring specialized network training. This underscores the generalizability of models trained on large datasets, highlighting a promising direction for further research.

摘要
<>输入文本为：Adversarial purification using generative models demonstrates strong adversarial defense performance. These methods are classifier and attack-agnostic, making them versatile but often computationally intensive. Recent strides in diffusion and score networks have improved image generation and, by extension, adversarial purification. Another highly efficient class of adversarial defense methods known as adversarial training requires specific knowledge of attack vectors, forcing them to be trained extensively on adversarial examples. To overcome these limitations, we introduce a new framework, namely Language Guided Adversarial Purification (LGAP), utilizing pre-trained diffusion models and caption generators to defend against adversarial attacks. Given an input image, our method first generates a caption, which is then used to guide the adversarial purification process through a diffusion network. Our approach has been evaluated against strong adversarial attacks, proving its effectiveness in enhancing adversarial robustness. Our results indicate that LGAP outperforms most existing adversarial defense techniques without requiring specialized network training. This underscores the generalizability of models trained on large datasets, highlighting a promising direction for further research.Translation into Simplified Chinese:<>使用生成模型进行对抗纯化可以达到强大的对抗防御性能。这些方法是无关于类别和攻击方式的，因此它们非常灵活，但经常需要大量计算资源。在扩散和分数网络方面，最近的进步有助于图像生成，并通过扩散网络来进行对抗纯化。另一种非常高效的对抗防御方法是对抗训练，但它需要特定的攻击方向的知识，因此需要对抗示例进行广泛训练。为了超越这些限制，我们介绍了一新的框架，即语言指导对抗纯化（LGAP），使用预训练的扩散模型和caption生成器来防御对抗攻击。给定一个输入图像，我们的方法首先生成一个caption，然后使用扩散网络来指导对抗纯化过程。我们的方法已经在强大的对抗攻击下进行评估，证明了它的效果性。我们的结果表明，LGAP在对抗防御方面超越了大多数现有的对抗防御技术，而不需要特定的网络训练。这反映了模型在大量数据上的普适性，标识了一个有前途的研究方向。

Anti-Aliased Neural Implicit Surfaces with Encoding Level of Detail

paper_url: http://arxiv.org/abs/2309.10336
repo_url: None
paper_authors: Yiyu Zhuang, Qi Zhang, Ying Feng, Hao Zhu, Yao Yao, Xiaoyu Li, Yan-Pei Cao, Ying Shan, Xun Cao
for: 高频几何细节恢复和抗折补新视图渲染
methods: 基于级别 detail（LoD）的 voxel 基本表示法， Multi-scale tri-plane Scene 表示法，可以捕捉 LoD 的 signed distance function（SDF）和空间辐射特征。
results: 比 state-of-the-art 方法更高效的surface 重建和 photorealistic 视图合成。

Abstract
We present LoD-NeuS, an efficient neural representation for high-frequency geometry detail recovery and anti-aliased novel view rendering. Drawing inspiration from voxel-based representations with the level of detail (LoD), we introduce a multi-scale tri-plane-based scene representation that is capable of capturing the LoD of the signed distance function (SDF) and the space radiance. Our representation aggregates space features from a multi-convolved featurization within a conical frustum along a ray and optimizes the LoD feature volume through differentiable rendering. Additionally, we propose an error-guided sampling strategy to guide the growth of the SDF during the optimization. Both qualitative and quantitative evaluations demonstrate that our method achieves superior surface reconstruction and photorealistic view synthesis compared to state-of-the-art approaches.

摘要
我们介绍LoD-NeuS，一种高效的神经网络表示方法，用于高频 geometry 细节恢复和抗折衔新视图渲染。 draws inspiration from voxel-based representations with the level of detail (LoD), we introduce a multi-scale tri-plane-based scene representation that is capable of capturing the LoD of the signed distance function (SDF) and the space radiance. Our representation aggregates space features from a multi-convolved featurization within a conical frustum along a ray and optimizes the LoD feature volume through differentiable rendering. Additionally, we propose an error-guided sampling strategy to guide the growth of the SDF during the optimization. Both qualitative and quantitative evaluations demonstrate that our method achieves superior surface reconstruction and photorealistic view synthesis compared to state-of-the-art approaches.Here is the translation in Traditional Chinese:我们介绍LoD-NeuS，一种高效的神经网络表示方法，用于高频 geometry 细节恢复和抗折衔新视域渲染。 draws inspiration from voxel-based representations with the level of detail (LoD), we introduce a multi-scale tri-plane-based scene representation that is capable of capturing the LoD of the signed distance function (SDF) and the space radiance. Our representation aggregates space features from a multi-convolved featurization within a conical frustum along a ray and optimizes the LoD feature volume through differentiable rendering. Additionally, we propose an error-guided sampling strategy to guide the growth of the SDF during the optimization. Both qualitative and quantitative evaluations demonstrate that our method achieves superior surface reconstruction and photorealistic view synthesis compared to state-of-the-art approaches.

Multi-dimension Queried and Interacting Network for Stereo Image Deraining

paper_url: http://arxiv.org/abs/2309.10319
repo_url: https://github.com/chdwyb/mqinet
paper_authors: Yuanbo Wen, Tao Gao, Ziqi Li, Jing Zhang, Ting Chen
for: 这个论文的目的是实现高效地除去双投影像中的雨尘变形。
methods: 这个方法使用多维度查询和互动来构建双投影像的雨尘除去模型。具体来说，它使用了一个具有上下文感知的维度别查询块（CDQB），这个模块利用维度别查询来获取双投影像中的关键特征，并且使用全局上下文感知注意力（GCA）来捕捉双投影像中的重要特征，而不是捕捉无用或不相关的信息。此外，它还引入了一个间iewPhysics-aware注意力（IPA），这个注意力基于雨水影像的倒数物理模型，可以提取双投影像中的潜在雨尘特征，并且帮助降低雨尘相关的artefacts在早期学习阶段。最后，它组合了多维度互动来增强两个看到之间的特征互动。
results: 实验结果显示，该模型比EPRRNet和StereoIRR更高效，具体来说，它在PSNR方面比EPRRNet和StereoIRR提高了4.18 dB和0.45 dB。代码和模型可以在\url{https://github.com/chdwyb/MQINet}上获取。

Abstract
Eliminating the rain degradation in stereo images poses a formidable challenge, which necessitates the efficient exploitation of mutual information present between the dual views. To this end, we devise MQINet, which employs multi-dimension queries and interactions for stereo image deraining. More specifically, our approach incorporates a context-aware dimension-wise queried block (CDQB). This module leverages dimension-wise queries that are independent of the input features and employs global context-aware attention (GCA) to capture essential features while avoiding the entanglement of redundant or irrelevant information. Meanwhile, we introduce an intra-view physics-aware attention (IPA) based on the inverse physical model of rainy images. IPA extracts shallow features that are sensitive to the physics of rain degradation, facilitating the reduction of rain-related artifacts during the early learning period. Furthermore, we integrate a cross-view multi-dimension interacting attention mechanism (CMIA) to foster comprehensive feature interaction between the two views across multiple dimensions. Extensive experimental evaluations demonstrate the superiority of our model over EPRRNet and StereoIRR, achieving respective improvements of 4.18 dB and 0.45 dB in PSNR. Code and models are available at \url{https://github.com/chdwyb/MQINet}.

摘要
避免雨损减降在双视图图像中存在强大挑战，需要高效利用双视图之间的相互信息。为此，我们设计了MQINet，它使用多维度查询和互动来进行双视图雨损减降。更具体地说，我们的方法包括一个Context-aware尺度 wise查询块（CDQB）。这个模块利用独立于输入特征的尺度 wise查询，并使用全局Context-aware注意力（GCA）来捕捉重要特征，同时避免杂乱或无关的信息。此外，我们还引入了内视图物理学感知（IPA），它基于雨水影响图像的反式物理模型，以提取雨水环境中深度特征，从而减少雨水相关的artefacts。此外，我们还将多视图多维度互动注意力机制（CMIA）与CDQB和IPA相结合，以促进多视图之间的全面特征互动。我们的模型在PSNR方面与EPRRNet和StereoIRR进行了广泛的实验评估，并实现了4.18 dB和0.45 dB的提升。代码和模型可以在 GitHub上找到：。

Dive Deeper into Rectifying Homography for Stereo Camera Online Self-Calibration

paper_url: http://arxiv.org/abs/2309.10314
repo_url: None
paper_authors: Hongbo Zhao, Yikang Zhang, Qijun Chen, Rui Fan
for: 本文提出了一种基于投影 homography 的单机器推准算法，用于在仅提供一对图像时对 stero 摄像头进行在线自动化准备。
methods: 本文 introduce 了一种简单 yet effective 的全球最优极值外Parameters 估计方法，以及四种新的评价指标用于评估极值外Parameters 估计精度和可靠性。
results: 广泛的实验结果表明，提出的算法在室内和室外环境中的多种实验设置下具有更高的性能，相比基准算法。

Abstract
Accurate estimation of stereo camera extrinsic parameters is the key to guarantee the performance of stereo matching algorithms. In prior arts, the online self-calibration of stereo cameras has commonly been formulated as a specialized visual odometry problem, without taking into account the principles of stereo rectification. In this paper, we first delve deeply into the concept of rectifying homography, which serves as the cornerstone for the development of our novel stereo camera online self-calibration algorithm, for cases where only a single pair of images is available. Furthermore, we introduce a simple yet effective solution for global optimum extrinsic parameter estimation in the presence of stereo video sequences. Additionally, we emphasize the impracticality of using three Euler angles and three components in the translation vectors for performance quantification. Instead, we introduce four new evaluation metrics to quantify the robustness and accuracy of extrinsic parameter estimation, applicable to both single-pair and multi-pair cases. Extensive experiments conducted across indoor and outdoor environments using various experimental setups validate the effectiveness of our proposed algorithm. The comprehensive evaluation results demonstrate its superior performance in comparison to the baseline algorithm. Our source code, demo video, and supplement are publicly available at mias.group/StereoCalibrator.

摘要
针对单一对像的情况，本文首先深入探讨了正交投影的概念，该概念是我们提出的新型双眼相机在线自动调整算法的基础。然后，我们提出了一种简单 yet effective的全球最优化外参参数估计解决方案，适用于双眼视频序列 caso。此外，我们强调了使用三个欧拉角和三个翻译向量来评估性能的不现实性，而 Instead，我们引入了四个新的评估指标来评估双眼相机外参参数估计的精度和稳定性，适用于单个对像和多对像情况。我们在室内和室外环境中进行了广泛的实验，并通过了extensive experiments demonstrate the effectiveness of our proposed algorithm, outperforming the baseline algorithm.我们的源代码、示例视频和补充材料都公开available at mias.group/StereoCalibrator。

Decoupled Training: Return of Frustratingly Easy Multi-Domain Learning

paper_url: http://arxiv.org/abs/2309.10302
repo_url: None
paper_authors: Ximei Wang, Junwei Pan, Xingzhuo Guo, Dapeng Liu, Jie Jiang
for: 本研究的目的是提出一种简单、无参数的多Domain学习方法，以解决多个领域之间的数据偏袋和领域占据问题。
methods: 本研究使用的方法是一种三阶段的普通到特定的训练策略，首先在所有领域上进行温存训练，然后将每个领域分成多个头，并在固定后部上细化每个头。
results: 研究表明，使用该方法可以在各种数据集上达到惊人的性能，包括标准评价指标和卫星图像和推荐系统等应用领域。

Abstract
Multi-domain learning (MDL) aims to train a model with minimal average risk across multiple overlapping but non-identical domains. To tackle the challenges of dataset bias and domain domination, numerous MDL approaches have been proposed from the perspectives of seeking commonalities by aligning distributions to reduce domain gap or reserving differences by implementing domain-specific towers, gates, and even experts. MDL models are becoming more and more complex with sophisticated network architectures or loss functions, introducing extra parameters and enlarging computation costs. In this paper, we propose a frustratingly easy and hyperparameter-free multi-domain learning method named Decoupled Training(D-Train). D-Train is a tri-phase general-to-specific training strategy that first pre-trains on all domains to warm up a root model, then post-trains on each domain by splitting into multi heads, and finally fine-tunes the heads by fixing the backbone, enabling decouple training to achieve domain independence. Despite its extraordinary simplicity and efficiency, D-Train performs remarkably well in extensive evaluations of various datasets from standard benchmarks to applications of satellite imagery and recommender systems.

摘要
In this paper, we propose a simple and hyperparameter-free multi-domain learning method called Decoupled Training (D-Train). D-Train is a three-phase training strategy that first pre-trains on all domains to warm up a root model, then post-trains on each domain by splitting into multiple heads, and finally fine-tunes the heads by fixing the backbone. This approach enables decoupled training to achieve domain independence.Despite its simplicity and efficiency, D-Train performs remarkably well in extensive evaluations of various datasets from standard benchmarks to applications of satellite imagery and recommender systems.

360$^\circ$ Reconstruction From a Single Image Using Space Carved Outpainting

paper_url: http://arxiv.org/abs/2309.10279
repo_url: None
paper_authors: Nuri Ryu, Minsu Gong, Geonung Kim, Joo-Haeng Lee, Sunghyun Cho
for: 创建一个全景360°视图的3D模型从单一图像
methods: combinatorial使用深度和 нормаль预测、空间划分、生成模型和神经ImplicitSurface重建方法
results: 提供了一个高度普适的方法，能够在不同的自然场景中进行高质量的3D重建，并且比同期作品提高了重建的精度和自然性

Abstract
We introduce POP3D, a novel framework that creates a full $360^\circ$-view 3D model from a single image. POP3D resolves two prominent issues that limit the single-view reconstruction. Firstly, POP3D offers substantial generalizability to arbitrary categories, a trait that previous methods struggle to achieve. Secondly, POP3D further improves reconstruction fidelity and naturalness, a crucial aspect that concurrent works fall short of. Our approach marries the strengths of four primary components: (1) a monocular depth and normal predictor that serves to predict crucial geometric cues, (2) a space carving method capable of demarcating the potentially unseen portions of the target object, (3) a generative model pre-trained on a large-scale image dataset that can complete unseen regions of the target, and (4) a neural implicit surface reconstruction method tailored in reconstructing objects using RGB images along with monocular geometric cues. The combination of these components enables POP3D to readily generalize across various in-the-wild images and generate state-of-the-art reconstructions, outperforming similar works by a significant margin. Project page: \url{http://cg.postech.ac.kr/research/POP3D}

摘要
我们引入POP3D，一个新的框架，可以从单一的图像中生成全天球360°的3D模型。POP3D解决了单一构成的两个主要问题，即具有普遍性和高重建实实相似度的能力。POP3D的方法结合了四个主要 ком成分：（1）一个单眼深度和法向预测器，用于预测重要的几何启示；（2）一种能够划分目标物的空间剖面方法；（3）一个基于大规模图像集的生成模型，用于完成未见区域的目标物；以及（4）一个适应RGB图像和单眼几何启示的神经隐面表面重建方法。这些元件的结合使得POP3D可以轻松扩展至各种在野的图像，并生成国际首席的重建效果，超过相似的作品。项目页面：

RGB-based Category-level Object Pose Estimation via Decoupled Metric Scale Recovery

paper_url: http://arxiv.org/abs/2309.10255
repo_url: None
paper_authors: Jiaxin Wei, Xibin Song, Weizhe Liu, Laurent Kneip, Hongdong Li, Pan Ji
For: 提出一种新的摄像头ipeline，用于解决RGB-D摄像头 pose estimation方法中的约束问题。* Methods: 利用预训练的单目估计器提取本地几何信息，并在这些信息上进行2D-3D匹配搜索。另外，还设计了一个独立的分支来直接回归对象的度量尺度基于类别级统计。* Results: 在 both synthetic and real datasets 上进行了广泛的实验，并证明了该方法在前一代RGB-based方法中的精度更高，特别是在旋转精度方面。

Abstract
While showing promising results, recent RGB-D camera-based category-level object pose estimation methods have restricted applications due to the heavy reliance on depth sensors. RGB-only methods provide an alternative to this problem yet suffer from inherent scale ambiguity stemming from monocular observations. In this paper, we propose a novel pipeline that decouples the 6D pose and size estimation to mitigate the influence of imperfect scales on rigid transformations. Specifically, we leverage a pre-trained monocular estimator to extract local geometric information, mainly facilitating the search for inlier 2D-3D correspondence. Meanwhile, a separate branch is designed to directly recover the metric scale of the object based on category-level statistics. Finally, we advocate using the RANSAC-P$n$P algorithm to robustly solve for 6D object pose. Extensive experiments have been conducted on both synthetic and real datasets, demonstrating the superior performance of our method over previous state-of-the-art RGB-based approaches, especially in terms of rotation accuracy.

摘要
“当前RGB-D相机基于分类级别物体姿态估计方法显示了扎实的结果，但由于依赖深度传感器，因此受到了应用限制。RGB只方法可以解决这个问题，但它们受到了固有的比例抽象问题。在这篇论文中，我们提出了一个新的管道，即分离6D姿态和大小估计，以减少不准确的比例对矢量变换的影响。我们利用预训练的单目估计器提取本地几何信息，主要是为了搜索2D-3D匹配。同时，我们设计了一个分支来直接回归对象的度量尺度基于类别统计。最后，我们提议使用RANSAC-P$n$P算法稳定解决6D物体姿态。我们在实验中进行了Synthetic和实际数据集的广泛测试，并证明了我们的方法在前一个状态的RGB基于方法中的稍高精度。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

UPL-SFDA: Uncertainty-aware Pseudo Label Guided Source-Free Domain Adaptation for Medical Image Segmentation

paper_url: http://arxiv.org/abs/2309.10244
repo_url: https://github.com/hilab-git/upl-sfda
paper_authors: Jianghao Wu, Guotai Wang, Ran Gu, Tao Lu, Yinan Chen, Wentao Zhu, Tom Vercauteren, Sébastien Ourselin, Shaoting Zhang
for: 这篇研究的目的是为了提高深度学习基础的医疗影像分类模型在新目标领域中的测试影像处理，特别是在源领域数据不可用且目标领域数据没有标签的情况下进行调整。
methods: 本研究提出了一个新的不确定性感知驱动的源自由领域调整（SFDA）方法，包括目标领域增长（TDG）和两个前进对答案驱动（TFS）策略，以及一个平均预测值基于的熵最小化项。
results: 本研究在三个多地点心脏MRI分类任务、跨模型胚胎脑分类任务和3D胚胎组织分类任务上验证了UPL-SFDA方法，与基准方法相比，平均标签精度提高了5.54、5.01和6.89 percentage points。此外，UPL-SFDA方法也超过了一些现有的SFDA方法。

Abstract
Domain Adaptation (DA) is important for deep learning-based medical image segmentation models to deal with testing images from a new target domain. As the source-domain data are usually unavailable when a trained model is deployed at a new center, Source-Free Domain Adaptation (SFDA) is appealing for data and annotation-efficient adaptation to the target domain. However, existing SFDA methods have a limited performance due to lack of sufficient supervision with source-domain images unavailable and target-domain images unlabeled. We propose a novel Uncertainty-aware Pseudo Label guided (UPL) SFDA method for medical image segmentation. Specifically, we propose Target Domain Growing (TDG) to enhance the diversity of predictions in the target domain by duplicating the pre-trained model's prediction head multiple times with perturbations. The different predictions in these duplicated heads are used to obtain pseudo labels for unlabeled target-domain images and their uncertainty to identify reliable pseudo labels. We also propose a Twice Forward pass Supervision (TFS) strategy that uses reliable pseudo labels obtained in one forward pass to supervise predictions in the next forward pass. The adaptation is further regularized by a mean prediction-based entropy minimization term that encourages confident and consistent results in different prediction heads. UPL-SFDA was validated with a multi-site heart MRI segmentation dataset, a cross-modality fetal brain segmentation dataset, and a 3D fetal tissue segmentation dataset. It improved the average Dice by 5.54, 5.01 and 6.89 percentage points for the three tasks compared with the baseline, respectively, and outperformed several state-of-the-art SFDA methods.

摘要
域 adaptation (DA) 是深度学习基于医疗影像分割模型的重要问题，以适应测试图像的新目标域。然而，现有的SFDA方法受到缺乏源域数据和目标域图像标注的限制，导致其性能有限。我们提出了一种新的不确定性感知导向的SFDA方法（UPL），用于医疗影像分割。具体来说，我们提出了目标域增长（TDG）技术，用于在目标域中提高预测的多样性。我们还提出了两次前进 passe supervision（TFS）策略，使用可靠的pseudo标签来监督预测。此外，我们还添加了一个平均预测值基于的Entropy最小化项，以鼓励confident和一致的结果。UPL-SFDA在多个心脏MRI分割 dataset、交叉模态胎生脑分割 dataset和3D胎生组织分割dataset上进行验证，提高了平均 dice 值5.54、5.01和6.89个百分点，分别比基eline进行了5.54、5.01和6.89个百分点的提升。此外，它还超过了一些state-of-the-art SFDA方法。

Transferable Adversarial Attack on Image Tampering Localization

paper_url: http://arxiv.org/abs/2309.10243
repo_url: None
paper_authors: Yuqi Wang, Gang Cao, Zijie Lou, Haochen Zhu
for: 评估现有数字图像修改地标算法的安全性在实际应用中。
methods: 提出了一种对修改地标算法的攻击方案，用于暴露其可靠性，并使其预测修改后的区域不准确。 Specifically, 使用优化和梯度来实现白/黑盒攻击。
results: 广泛评估表明，提posed攻击方案可以很准确地降低地标准正确率，同时保持修改后图像的视觉质量高。

Abstract
It is significant to evaluate the security of existing digital image tampering localization algorithms in real-world applications. In this paper, we propose an adversarial attack scheme to reveal the reliability of such tampering localizers, which would be fooled and fail to predict altered regions correctly. Specifically, the adversarial examples based on optimization and gradient are implemented for white/black-box attacks. Correspondingly, the adversarial example is optimized via reverse gradient propagation, and the perturbation is added adaptively in the direction of gradient rising. The black-box attack is achieved by relying on the transferability of such adversarial examples to different localizers. Extensive evaluations verify that the proposed attack sharply reduces the localization accuracy while preserving high visual quality of the attacked images.

摘要
需要评估现有数字图像修改地点算法的安全性在实际应用中。在这篇论文中，我们提出了一种攻击方案，用于暴露修改地点检测器的可靠性，这些检测器会被欺骗并 incorrect地预测修改后的区域。 Specifically，我们基于优化和梯度的例外处理实现了白/黑盒攻击。各自相应地，例外处理通过逆梯度升降来优化，并在梯度升降的方向添加随机的杂音。黑盒攻击通过将这些例外处理转移到不同的地点检测器来实现。广泛的评估表明，我们提出的攻击会减少地点检测精度，同时保持修改后的图像的视觉质量高。

Learning Point-wise Abstaining Penalty for Point Cloud Anomaly Detection

paper_url: http://arxiv.org/abs/2309.10230
repo_url: https://github.com/daniellli/pad
paper_authors: Shaocong Xu, Pengfei Li, Xinyu Liu, Qianpu Sun, Yang Li, Shihui Guo, Zhen Wang, Bo Jiang, Rui Wang, Kehua Sheng, Bo Zhang, Hao Zhao
for: 该论文旨在提高自动驾驶系统的LiDAR场景理解模块中的Out-Of-Distribution（OOD）点云识别能力。methods: 该方法基于选择性分类，即在标准关闭集分类setup中引入选择函数，以学习点云中的不同类别之间的差异。该方法还包括一个强大的合成管道，用于生成各种不同的外liers。results: 该方法在SemanticKITTI和nuScenes上达到了状态的最佳result，并且通过风险覆盖分析，揭示了不同方法的内在性质。代码和模型将公开可用。

Abstract
LiDAR-based semantic scene understanding is an important module in the modern autonomous driving perception stack. However, identifying Out-Of-Distribution (OOD) points in a LiDAR point cloud is challenging as point clouds lack semantically rich features when compared with RGB images. We revisit this problem from the perspective of selective classification, which introduces a selective function into the standard closed-set classification setup. Our solution is built upon the basic idea of abstaining from choosing any known categories but learns a point-wise abstaining penalty with a marginbased loss. Synthesizing outliers to approximate unlimited OOD samples is also critical to this idea, so we propose a strong synthesis pipeline that generates outliers originated from various factors: unrealistic object categories, sampling patterns and sizes. We demonstrate that learning different abstaining penalties, apart from point-wise penalty, for different types of (synthesized) outliers can further improve the performance. We benchmark our method on SemanticKITTI and nuScenes and achieve state-of-the-art results. Risk-coverage analysis further reveals intrinsic properties of different methods. Codes and models will be publicly available.

摘要
利用LiDAR技术实现 semantic scene understanding 是现代自动驾驶感知栈中的重要模块。然而，在LiDAR点云中标识 Out-Of-Distribution (OOD) 点 cloud 是具有挑战性，因为点云在比RGB图像更缺乏semantic rich features。我们从选择性分类的角度重新评估了这个问题，并提出了一种基于选择函数的标准closed-set分类设置。我们的解决方案基于基本的想法，即在已知类别中选择点云，并学习一个点wise abstaining penalty with margin-based loss。synthesizing outliers to approximate unlimited OOD samples是关键，我们提议了一个强大的合成管道，该管道可以生成来自不同因素的outliers，包括不可能的物体类别、采样模式和大小。我们示示了不同类型的outliers的学习不同的abstaining penalties可以进一步提高性能。我们在SemanticKITTI和nuScenes上进行了比较，并达到了当前最佳的成绩。风险覆盖分析还揭示了不同方法的内在特性。代码和模型将公开 disponibles。

Learning Dynamic MRI Reconstruction with Convolutional Network Assisted Reconstruction Swin Transformer

paper_url: http://arxiv.org/abs/2309.10227
repo_url: None
paper_authors: Di Xu, Hengjie Liu, Dan Ruan, Ke Sheng
for:这个论文主要是为了提高动态磁共振成像（DMRI）的诊断任务中的动态跟踪能力。methods:这篇论文使用了压缩学习（Compress sensing）和深度学习（Deep learning）技术来加速DMRI获取。具体来说，这篇论文提出了一种基于Transformer的新型推理架构，称为Reconstruction Swin Transformer（RST），用于4D MRI重建。results:实验结果表明，RST在cardiac 4D MR数据集上表现出色，与9倍加速的验证序列相比，RMSE值为0.0286±0.0199，1-SSIM值为0.0872±0.0783。这表明RST可以减少模型复杂度、GPU硬件需求和训练时间，同时保持重建质量。

Abstract
Dynamic magnetic resonance imaging (DMRI) is an effective imaging tool for diagnosis tasks that require motion tracking of a certain anatomy. To speed up DMRI acquisition, k-space measurements are commonly undersampled along spatial or spatial-temporal domains. The difficulty of recovering useful information increases with increasing undersampling ratios. Compress sensing was invented for this purpose and has become the most popular method until deep learning (DL) based DMRI reconstruction methods emerged in the past decade. Nevertheless, existing DL networks are still limited in long-range sequential dependency understanding and computational efficiency and are not fully automated. Considering the success of Transformers positional embedding and "swin window" self-attention mechanism in the vision community, especially natural video understanding, we hereby propose a novel architecture named Reconstruction Swin Transformer (RST) for 4D MRI. RST inherits the backbone design of the Video Swin Transformer with a novel reconstruction head introduced to restore pixel-wise intensity. A convolution network called SADXNet is used for rapid initialization of 2D MR frames before RST learning to effectively reduce the model complexity, GPU hardware demand, and training time. Experimental results in the cardiac 4D MR dataset further substantiate the superiority of RST, achieving the lowest RMSE of 0.0286 +/- 0.0199 and 1 - SSIM of 0.0872 +/- 0.0783 on 9 times accelerated validation sequences.

摘要
dynamic magnetic resonance imaging (DMRI) 是一种有效的成像工具，用于诊断任务中需要跟踪特定的解剖结构运动。为了加速 DMRI 获取，通常将 k-空间测量下折水平或时间域。随着下折率的增加，获取有用信息的困难度也会逐渐增加。 compression sensing 是为此目的而创造的技术，并在过去的一代成为 DMRI 重建方法中最受欢迎的方法。然而，现有的深度学习（DL）基于 DMRI 重建方法仍有一定的局限性，主要是不具备长距离次序关系理解和计算效率问题，以及不具备完全自动化的特点。为了解决这些问题，我们在这里提出了一种新的架构，即 Reconstruction Swin Transformer (RST)。RST 继承了 Video Swin Transformer 的底层设计，并添加了一个新的重建头来恢复像素级强度。一个快速初始化 2D MR 帧的 convolution 网络 called SADXNet 也是用于减少模型复杂度、GPU 硬件需求和训练时间。实验结果表明，RST 在 Cardiac 4D MR 数据集上取得了最低 RMSE 值为 0.0286 ± 0.0199 和 1 - SSIM 值为 0.0872 ± 0.0783，在9倍加速验证序列上。

2023-09-19

A Novel Deep Neural Network for Trajectory Prediction in Automated Vehicles Using Velocity Vector Field

A Geometric Flow Approach for Segmentation of Images with Inhomongeneous Intensity and Missing Boundaries

Incremental Multimodal Surface Mapping via Self-Organizing Gaussian Mixture Models

PLVS: A SLAM System with Points, Lines, Volumetric Mapping, and 3D Incremental Segmentation

GelSight Svelte: A Human Finger-shaped Single-camera Tactile Robot Finger with Large Sensing Coverage and Proprioceptive Sensing

DeepliteRT: Computer Vision at the Edge

On-device Real-time Custom Hand Gesture Recognition

Assessing the capacity of a denoising diffusion probabilistic model to reproduce spatial context

PanopticNeRF-360: Panoramic 3D-to-2D Label Transfer in Urban Scenes

PGDiff: Guiding Diffusion Models for Versatile Face Restoration via Partial Guidance

Multi-Context Dual Hyper-Prior Neural Image Compression

Multi-spectral Entropy Constrained Neural Compression of Solar Imagery

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

Context-Aware Neural Video Compression on Solar Dynamics Observatory

MAGIC-TBR: Multiview Attention Fusion for Transformer-based Bodily Behavior Recognition in Group Settings

Few-Shot Panoptic Segmentation With Foundation Models

Reconstruct-and-Generate Diffusion Model for Detail-Preserving Image Denoising

Interpret Vision Transformers as ConvNets with Dynamic Convolutions

Latent Space Energy-based Model for Fine-grained Open Set Recognition

ReShader: View-Dependent Highlights for Single Image View-Synthesis

CMRxRecon: An open cardiac MRI dataset for the competition of accelerated image reconstruction

Locally Stylized Neural Radiance Fields

Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping

Analysing race and sex bias in brain age prediction

Multi-Stain Self-Attention Graph Multiple Instance Learning Pipeline for Histopathology Whole Slide Images

Cross-modal and Cross-domain Knowledge Transfer for Label-free 3D Segmentation

Self-Supervised Super-Resolution Approach for Isotropic Reconstruction of 3D Electron Microscopy Images from Anisotropic Acquisition

KFC: Kinship Verification with Fair Contrastive Loss and Multi-Task Learning

Sparser Random Networks Exist: Enforcing Communication-Efficient Federated Learning via Regularization

Source-free Active Domain Adaptation for Diabetic Retinopathy Grading Based on Ultra-wide-field Fundus Image

Intelligent Debris Mass Estimation Model for Autonomous Underwater Vehicle

NDDepth: Normal-Distance Assisted Monocular Depth Estimation

Few-shot Object Detection in Remote Sensing: Lifting the Curse of Incompletely Annotated Novel Objects

Adversarial Attacks Against Uncertainty Quantification

Forgedit: Text Guided Image Editing via Learning and Forgetting

An overview of some mathematical techniques and problems linking 3D vision to 3D printing

Decoupling the Curve Modeling and Pavement Regression for Lane Detection

Retinex-guided Channel-grouping based Patch Swap for Arbitrary Style Transfer

SPOT: Scalable 3D Pre-training via Occupancy Prediction for Autonomous Driving

Edge-aware Feature Aggregation Network for Polyp Segmentation

Spatial-Assistant Encoder-Decoder Network for Real Time Semantic Segmentation

Unsupervised Landmark Discovery Using Consistency Guided Bottleneck

Uncertainty Estimation in Instance Segmentation with Star-convex Shapes

Single-Image based unsupervised joint segmentation and denoising

DCPT: Darkness Clue-Prompted Tracking in Nighttime UAVs

RECALL+: Adversarial Web-based Replay for Continual Learning in Semantic Segmentation

LineMarkNet: Line Landmark Detection for Valet Parking

Diffusion-based speech enhancement with a weighted generative-supervised learning loss

Unsupervised speech enhancement with diffusion-based generative models

Posterior sampling algorithms for unsupervised speech enhancement with recurrent variational autoencoder

AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration

Sample-adaptive Augmentation for Point Cloud Recognition Against Real-world Corruptions

Predicate Classification Using Optimal Transport Loss in Scene Graph Generation

Exploring Different Levels of Supervision for Detecting and Localizing Solar Panels on Remote Sensing Imagery

SideGAN: 3D-Aware Generative Model for Improved Side-View Image Synthesis

Pointing out Human Answer Mistakes in a Goal-Oriented Visual Dialogue

GloPro: Globally-Consistent Uncertainty-Aware 3D Human Pose Estimation & Tracking in the Wild

Improving CLIP Robustness with Knowledge Distillation and Self-Training

RoadFormer: Duplex Transformer for RGB-Normal Semantic Road Scene Parsing

Language Guided Adversarial Purification

Anti-Aliased Neural Implicit Surfaces with Encoding Level of Detail

Multi-dimension Queried and Interacting Network for Stereo Image Deraining

Dive Deeper into Rectifying Homography for Stereo Camera Online Self-Calibration

Decoupled Training: Return of Frustratingly Easy Multi-Domain Learning

360$^\circ$ Reconstruction From a Single Image Using Space Carved Outpainting

RGB-based Category-level Object Pose Estimation via Decoupled Metric Scale Recovery

UPL-SFDA: Uncertainty-aware Pseudo Label Guided Source-Free Domain Adaptation for Medical Image Segmentation

Transferable Adversarial Attack on Image Tampering Localization

Learning Point-wise Abstaining Penalty for Point Cloud Anomaly Detection

Learning Dynamic MRI Reconstruction with Convolutional Network Assisted Reconstruction Swin Transformer