2023-09-28

cs.CV

cs.CV - 2023-09-28

On the Contractivity of Plug-and-Play Operators

paper_url: http://arxiv.org/abs/2309.16899
repo_url: https://github.com/bhartendu-kumar/pnp-conv
paper_authors: Chirayu D. Athalye, Kunal N. Chaudhury, Bhartendu Kumar
for: 这个论文是为了探讨 plug-and-play（PnP）膨胀正则化方法的理论基础和实际性能而写的。
methods: 论文使用了一种强大的推理器来取代 proximal 算法中的 proximal 运算符，这种方法被称为 PnP 方法。
results: 论文表明了 PnP 方法在各种图像应用场景中可以达到状态之最的结果，并且提供了一种 Linear 的整形分析方法来解释这种成功。

Abstract
In plug-and-play (PnP) regularization, the proximal operator in algorithms such as ISTA and ADMM is replaced by a powerful denoiser. This formal substitution works surprisingly well in practice. In fact, PnP has been shown to give state-of-the-art results for various imaging applications. The empirical success of PnP has motivated researchers to understand its theoretical underpinnings and, in particular, its convergence. It was shown in prior work that for kernel denoisers such as the nonlocal means, PnP-ISTA provably converges under some strong assumptions on the forward model. The present work is motivated by the following questions: Can we relax the assumptions on the forward model? Can the convergence analysis be extended to PnP-ADMM? Can we estimate the convergence rate? In this letter, we resolve these questions using the contraction mapping theorem: (i) for symmetric denoisers, we show that (under mild conditions) PnP-ISTA and PnP-ADMM exhibit linear convergence; and (ii) for kernel denoisers, we show that PnP-ISTA and PnP-ADMM converge linearly for image inpainting. We validate our theoretical findings using reconstruction experiments.

摘要
在插入式规范（PnP）常规化中，卷积算法中的距离算子被替换成一个强大的噪声约束。这种形式的替换在实践中很好的工作。实际上，PnP已经在不同的图像应用中达到了状态机器人的结果。PnP的 empirical 成功 Motivates 研究人员理解其理论基础，特别是它的收敛。在先前的工作中，对于核函数噪声约束的情况，PnP-ISTA 的收敛性已经被证明了，但是这些假设对于前向模型很强。本文的目标是解答以下问题：我们可以减轻前向模型的假设吗？我们可以扩展 PnP-ADMM 的收敛分析吗？我们可以估算收敛率吗？在这封信中，我们使用Contract Mapping Theorem 来解答这些问题：（i）对于 симметри� denoiser，我们显示（在轻度的假设下）PnP-ISTA 和 PnP-ADMM 在 linear 收敛；和（ii）对于核函数 denoiser，我们显示 PnP-ISTA 和 PnP-ADMM 对于图像缺失部分进行 linear 收敛。我们使用重建实验来验证我们的理论发现。

Superpixel Transformers for Efficient Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.16889
repo_url: None
paper_authors: Alex Zihao Zhu, Jieru Mei, Siyuan Qiao, Hang Yan, Yukun Zhu, Liang-Chieh Chen, Henrik Kretzschmar
for: 本文主要针对 semantic segmentation 任务进行研究，即对每个图像像素进行分类。
methods: 我们提出了一种解决高维度问题的方法，利用 superpixel 的想法和现代转换器框架。我们的模型通过多个地方的跨域关注来将像素空间下降到low维度 superpixel 空间。然后，我们应用多头自注意力来增强 superpixel 特征，并直接生成每个 superpixel 的类别预测。最后，我们将 superpixel 类别预测投影回到像素空间使用图像像素特征的关联。
results: 我们的方法可以在 Cityscapes 和 ADE20K 上实现state-of-the-art性能，同时具有较少的模型参数和延迟。

Abstract
Semantic segmentation, which aims to classify every pixel in an image, is a key task in machine perception, with many applications across robotics and autonomous driving. Due to the high dimensionality of this task, most existing approaches use local operations, such as convolutions, to generate per-pixel features. However, these methods are typically unable to effectively leverage global context information due to the high computational costs of operating on a dense image. In this work, we propose a solution to this issue by leveraging the idea of superpixels, an over-segmentation of the image, and applying them with a modern transformer framework. In particular, our model learns to decompose the pixel space into a spatially low dimensional superpixel space via a series of local cross-attentions. We then apply multi-head self-attention to the superpixels to enrich the superpixel features with global context and then directly produce a class prediction for each superpixel. Finally, we directly project the superpixel class predictions back into the pixel space using the associations between the superpixels and the image pixel features. Reasoning in the superpixel space allows our method to be substantially more computationally efficient compared to convolution-based decoder methods. Yet, our method achieves state-of-the-art performance in semantic segmentation due to the rich superpixel features generated by the global self-attention mechanism. Our experiments on Cityscapes and ADE20K demonstrate that our method matches the state of the art in terms of accuracy, while outperforming in terms of model parameters and latency.

摘要
《 semantic segmentation 》是机器视觉中关键任务之一，具有许多应用于 робо类和自动驾驶等领域。由于这个任务的维度较高，大多数现有方法使用本地操作，如 convolution，生成每个像素特征。然而，这些方法通常无法有效利用全局上下文信息，因为对密集图像进行计算的成本较高。在这种情况下，我们提出了一种解决方案，利用超像素（superpixel）的想法，并将其与现代转换器框架结合使用。具体来说，我们的模型通过一系列的本地交叉关注来将像素空间分解成低维度的超像素空间。然后，我们对超像素进行多头自注意，以激活全局上下文信息，并将其作为每个超像素的特征进行分类。最后，我们直接将超像素的类别预测映射回到像素空间使用图像像素特征与超像素之间的关联。使用超像素空间进行逻辑计算，使我们的方法相比于基于 convolution 的解码器方法更加计算效率高。然而，我们的方法可以 дости到状态之狮的性能水平，因为全球自注意机制生成了丰富的超像素特征。我们在 Cityscapes 和 ADE20K 上进行了实验，结果显示，我们的方法与状态之狮相当，而且在参数量和延迟上超过了状态之狮。

LEF: Late-to-Early Temporal Fusion for LiDAR 3D Object Detection

paper_url: http://arxiv.org/abs/2309.16870
repo_url: None
paper_authors: Tong He, Pei Sun, Zhaoqi Leng, Chenxi Liu, Dragomir Anguelov, Mingxing Tan
for: 本研究旨在提高3D物体检测中的难对象检测精度，通过将物体含义映射到早期启发器中来提高模型的捕捉形态和姿态能力。
methods: 我们提出了一种延迟到早期的Recurrent Feature Fusion（RFF）方法，通过在时间LiDAR点云中执行延迟到早期的窗口基于注意力块来实现。此外，我们还提出了一种 bird’s eye view封顶排序技术，用于减少需要被模型融合的稀疏历史特征数量。
results: 我们在Waymo开放数据集上评估了我们的方法，并证明了它在3D物体检测中的提高，特别是对于困难类别的大物体检测。

Abstract
We propose a late-to-early recurrent feature fusion scheme for 3D object detection using temporal LiDAR point clouds. Our main motivation is fusing object-aware latent embeddings into the early stages of a 3D object detector. This feature fusion strategy enables the model to better capture the shapes and poses for challenging objects, compared with learning from raw points directly. Our method conducts late-to-early feature fusion in a recurrent manner. This is achieved by enforcing window-based attention blocks upon temporally calibrated and aligned sparse pillar tokens. Leveraging bird's eye view foreground pillar segmentation, we reduce the number of sparse history features that our model needs to fuse into its current frame by 10$\times$. We also propose a stochastic-length FrameDrop training technique, which generalizes the model to variable frame lengths at inference for improved performance without retraining. We evaluate our method on the widely adopted Waymo Open Dataset and demonstrate improvement on 3D object detection against the baseline model, especially for the challenging category of large objects.

摘要
我们提议一种从晚期到早期的循环特征融合方案，用于3D物体检测 based on 时间LiDAR点云。我们的主要动机是将物体意识的潜在嵌入 fusion 到3D物体检测器的早期阶段。这种特征融合策略使得模型能够更好地捕捉到难以捕捉的物体形状和姿态。我们的方法在循环方式下进行了晚期到早期的特征融合。这是通过在时间均衡和对齐的稀疏柱 Token 上进行窗口基于注意力块来实现的。通过采用鸟瞰视图前景柱 Segmentation，我们可以将 sparse history features 缩减到当前帧的 10 倍。我们还提出了一种 Stochastic-length FrameDrop 训练技术，该技术可以在推断时对变长帧进行适应，从而不需要重新训练，以提高性能。我们在广泛采用的 Waymo Open Dataset 上评估了我们的方法，并证明了对基eline 模型的3D物体检测性能的提高，特别是难以捕捉的大型物体类别。

Stochastic Digital Twin for Copy Detection Patterns

paper_url: http://arxiv.org/abs/2309.16866
repo_url: None
paper_authors: Yury Belousov, Olga Taran, Vitaliy Kinakh, Slava Voloshynovskiy
for: 这篇论文主要旨在提高侵游防范技术中的产品安全性，通过利用计算机模拟和数字双方法来优化防范系统。
methods: 本论文使用了一种基于机器学习的数字双方法，称为Turbo框架，以模拟印刷-图像通道的行为。此外，还使用了一种新的泛化模型，即DDPM模型，来评估其在数字双方应用中的可能性。
results: 研究结果表明，DDPM模型在数字双方应用中表现出色，可以更好地捕捉印刷-图像过程中的随机性。此外，DDPM模型也能够在移动设备数据收集中进行效果地评估。 despite the increased complexity of DDPM methods compared to traditional approaches, this study highlights their advantages and explores their potential for future applications.

Abstract
Copy detection patterns (CDP) present an efficient technique for product protection against counterfeiting. However, the complexity of studying CDP production variability often results in time-consuming and costly procedures, limiting CDP scalability. Recent advancements in computer modelling, notably the concept of a "digital twin" for printing-imaging channels, allow for enhanced scalability and the optimization of authentication systems. Yet, the development of an accurate digital twin is far from trivial. This paper extends previous research which modelled a printing-imaging channel using a machine learning-based digital twin for CDP. This model, built upon an information-theoretic framework known as "Turbo", demonstrated superior performance over traditional generative models such as CycleGAN and pix2pix. However, the emerging field of Denoising Diffusion Probabilistic Models (DDPM) presents a potential advancement in generative models due to its ability to stochastically model the inherent randomness of the printing-imaging process, and its impressive performance in image-to-image translation tasks. This study aims at comparing the capabilities of the Turbo framework and DDPM on the same CDP datasets, with the goal of establishing the real-world benefits of DDPM models for digital twin applications in CDP security. Furthermore, the paper seeks to evaluate the generative potential of the studied models in the context of mobile phone data acquisition. Despite the increased complexity of DDPM methods when compared to traditional approaches, our study highlights their advantages and explores their potential for future applications.

摘要
kopi detection patterns (CDP) 提供了一种有效的产品安全技术，以防止假冒。然而，研究 CDP 生产变化的复杂性通常会导致时间consuming 和costly的过程，限制 CDP 可扩展性。现代计算机模拟技术，如“数字双” для印刷-图像通道，可以提高 Authentication Systems 的可扩展性和优化。然而，构建准确的数字双是很困难的。本文是基于前一个研究，使用机器学习基于的数字双模型来模拟印刷-图像通道。这个模型，基于信息论框架知为“Turbo”，在传统生成模型such as CycleGAN 和 pix2pix 中表现出了superior performance。然而，emerging field of Denoising Diffusion Probabilistic Models (DDPM) 提出了一种新的生成模型，它可以随机模型印刷-图像过程中的自然噪音，并在图像-图像翻译任务中表现出了卓越的表现。本研究的目的是比较 Turbo 框架和 DDPM 在同一个 CDP 数据集上的能力，以确定在 CDP 安全领域中 DDPM 模型的现实世界优势。此外，本研究还检验了这些模型在移动 phone 数据获取中的生成潜力。尽管 DDPM 方法与传统方法相比更加复杂，但我们的研究表明它们具有优势，并探讨了它们在未来应用中的潜力。

Sketch2CADScript: 3D Scene Reconstruction from 2D Sketch using Visual Transformer and Rhino Grasshopper

paper_url: http://arxiv.org/abs/2309.16850
repo_url: None
paper_authors: Hong-Bin Yang
for: 本研究旨在提出一种新的3D重建方法，以解决现有方法生成的3D模型具有粗糙表面和扭曲结构的问题，从而增加人工编辑和后处理的困难。
methods: 本研究使用视觉转换器来预测基于单个线框图像的“场景描述符”，该描述符包含对象类型和相对位置、旋转、大小等参数。然后使用如Blender或Rhino Grasshopper等3D模型软件来重建3D场景，从而获得轻松可编辑的3D模型。
results: 根据两个数据集的测试结果，模型在简单场景中具有高精度重建能力，但在更复杂的场景中存在挑战。

Abstract
Existing 3D model reconstruction methods typically produce outputs in the form of voxels, point clouds, or meshes. However, each of these approaches has its limitations and may not be suitable for every scenario. For instance, the resulting model may exhibit a rough surface and distorted structure, making manual editing and post-processing challenging for humans. In this paper, we introduce a novel 3D reconstruction method designed to address these issues. We trained a visual transformer to predict a "scene descriptor" from a single wire-frame image. This descriptor encompasses crucial information, including object types and parameters such as position, rotation, and size. With the predicted parameters, a 3D scene can be reconstructed using 3D modeling software like Blender or Rhino Grasshopper which provides a programmable interface, resulting in finely and easily editable 3D models. To evaluate the proposed model, we created two datasets: one featuring simple scenes and another with complex scenes. The test results demonstrate the model's ability to accurately reconstruct simple scenes but reveal its challenges with more complex ones.

摘要
（原文）现有的3D模型重建方法通常会生成voxels、点云或mesh等输出。然而，每种方法都有其局限性，可能无法适用于所有情况。例如，生成的模型可能具有抖音表面和扭曲结构，使人工编辑和后处理变得困难。在这篇论文中，我们介绍了一种新的3D重建方法，用于解决这些问题。我们使用视觉变换器来预测基于单个粗细图像的“场景描述符”。这个描述符包括对象类型和参数，如位置、旋转和大小。与预测的参数进行组合，可以使用3D模型创建软件如Blender或Rhino Grasshopper，实现高级编程接口，从而得到高级编辑和轻松修改的3D模型。为评估提案模型，我们创建了两个数据集：一个是简单场景集，另一个是复杂场景集。测试结果表明，模型可以准确重建简单场景，但面临更复杂场景的挑战。

Space-Time Attention with Shifted Non-Local Search

paper_url: http://arxiv.org/abs/2309.16849
repo_url: https://github.com/rprokap/pset-9
paper_authors: Kent Gauen, Stanley Chan
for: 这篇研究的目的是提高类比搜寻的精度和视觉品质，以便更好地进行视频去噪。
methods: 这篇研究使用了一种名为“Shifted Non-Local Search”的搜寻策略，它结合了非本地搜寻的品质和预测的偏移量的运算，以提高搜寻的精度和视觉品质。
results: 实验结果显示，这种搜寻策略可以提高视频去噪的品质，PSNR值提高了0.30 dB，并且需要7.5%的总时间。此外，这种搜寻策略可以与现有的空间时间注意模组结合，以达到类比搜寻的最佳效果。

Abstract
Efficiently computing attention maps for videos is challenging due to the motion of objects between frames. While a standard non-local search is high-quality for a window surrounding each query point, the window's small size cannot accommodate motion. Methods for long-range motion use an auxiliary network to predict the most similar key coordinates as offsets from each query location. However, accurately predicting this flow field of offsets remains challenging, even for large-scale networks. Small spatial inaccuracies significantly impact the attention module's quality. This paper proposes a search strategy that combines the quality of a non-local search with the range of predicted offsets. The method, named Shifted Non-Local Search, executes a small grid search surrounding the predicted offsets to correct small spatial errors. Our method's in-place computation consumes 10 times less memory and is over 3 times faster than previous work. Experimentally, correcting the small spatial errors improves the video frame alignment quality by over 3 dB PSNR. Our search upgrades existing space-time attention modules, which improves video denoising results by 0.30 dB PSNR for a 7.5% increase in overall runtime. We integrate our space-time attention module into a UNet-like architecture to achieve state-of-the-art results on video denoising.

摘要
computation of attention maps for videos is challenging due to the motion of objects between frames. While a standard non-local search is high-quality for a window surrounding each query point, the window's small size cannot accommodate motion. Methods for long-range motion use an auxiliary network to predict the most similar key coordinates as offsets from each query location. However, accurately predicting this flow field of offsets remains challenging, even for large-scale networks. Small spatial inaccuracies significantly impact the attention module's quality. This paper proposes a search strategy that combines the quality of a non-local search with the range of predicted offsets. The method, named Shifted Non-Local Search, executes a small grid search surrounding the predicted offsets to correct small spatial errors. Our method's in-place computation consumes 10 times less memory and is over 3 times faster than previous work. Experimentally, correcting the small spatial errors improves the video frame alignment quality by over 3 dB PSNR. Our search upgrades existing space-time attention modules, which improves video denoising results by 0.30 dB PSNR for a 7.5% increase in overall runtime. We integrate our space-time attention module into a UNet-like architecture to achieve state-of-the-art results on video denoising.Here's the translation in Traditional Chinese: computation of attention maps for videos is challenging due to the motion of objects between frames. While a standard non-local search is high-quality for a window surrounding each query point, the window's small size cannot accommodate motion. Methods for long-range motion use an auxiliary network to predict the most similar key coordinates as offsets from each query location. However, accurately predicting this flow field of offsets remains challenging, even for large-scale networks. Small spatial inaccuracies significantly impact the attention module's quality. This paper proposes a search strategy that combines the quality of a non-local search with the range of predicted offsets. The method, named Shifted Non-Local Search, executes a small grid search surrounding the predicted offsets to correct small spatial errors. Our method's in-place computation consumes 10 times less memory and is over 3 times faster than previous work. Experimentally, correcting the small spatial errors improves the video frame alignment quality by over 3 dB PSNR. Our search upgrades existing space-time attention modules, which improves video denoising results by 0.30 dB PSNR for a 7.5% increase in overall runtime. We integrate our space-time attention module into a UNet-like architecture to achieve state-of-the-art results on video denoising.

Propagation and Attribution of Uncertainty in Medical Imaging Pipelines

paper_url: http://arxiv.org/abs/2309.16831
repo_url: https://github.com/leonhardfeiner/uncertainty_propagation
paper_authors: Leonhard F. Feiner, Martin J. Menten, Kerstin Hammernik, Paul Hager, Wenqi Huang, Daniel Rueckert, Rickmer F. Braren, Georgios Kaissis
for: 这篇论文的目的是提出一种方法，用于在医疗影像应用中建立可解释的神经网络。
methods: 这篇论文使用了续生神经网络的协变过程，将不确定性传递到医疗影像管线中的不同阶层。这allow我们将管线中的不确定性总体化，并从后续模型的预测中获得共同不确定性量。此外，我们还可以分开报告每个管线阶层的数据基于不确定性的贡献。
results: 在一个真实的医疗影像管线中，我们使用了这种方法来重建测试数据中的脑和膝盖磁共振影像，并预测影像中的量值信息，例如脑体积、膝盖侧或patient的性别。我们量化地显示了传递不确定性和输入不确定性之间的相关性，并比较管线阶层对共同不确定性量的贡献比例。

Abstract
Uncertainty estimation, which provides a means of building explainable neural networks for medical imaging applications, have mostly been studied for single deep learning models that focus on a specific task. In this paper, we propose a method to propagate uncertainty through cascades of deep learning models in medical imaging pipelines. This allows us to aggregate the uncertainty in later stages of the pipeline and to obtain a joint uncertainty measure for the predictions of later models. Additionally, we can separately report contributions of the aleatoric, data-based, uncertainty of every component in the pipeline. We demonstrate the utility of our method on a realistic imaging pipeline that reconstructs undersampled brain and knee magnetic resonance (MR) images and subsequently predicts quantitative information from the images, such as the brain volume, or knee side or patient's sex. We quantitatively show that the propagated uncertainty is correlated with input uncertainty and compare the proportions of contributions of pipeline stages to the joint uncertainty measure.

摘要
“uncertainty estimation”，它提供了建立可解释性神经网络的方法，用于医疗影像应用。在这篇论文中，我们提议了在医疗影像管道中传播不确定性的方法。这允许我们在后续管道阶段聚合不确定性，并获得多个模型预测结果的共同不确定性度量。此外，我们还可以分别报告每个管道阶段的 aleatoric 不确定性（数据基于的不确定性）的贡献。我们在一个实际的医疗影像管道中重建了扫描不足的脑和膝盖磁共振（MR）图像，并在图像上预测了一些量化信息，如脑体积或膝部位或患者的性别。我们Quantitatively 表明了传播不确定性与输入不确定性之间的相关性，并比较管道阶段对共同不确定性度量的贡献比例。

paper_url: http://arxiv.org/abs/2309.16818
repo_url: https://github.com/leggedrobotics/elevation_mapping_cupy
paper_authors: Gian Erni, Jonas Frey, Takahiro Miki, Matias Mattamala, Marco Hutter
for: This paper is written for robotic and learning tasks that require the fusion of multi-modal information for environment perception.
methods: The paper presents a 2.5D robot-centric elevation mapping framework that fuses multi-modal information from multiple sources into a popular map representation, using a set of fusion algorithms that can be selected based on the information type and user requirements.
results: The paper demonstrates the capabilities of the framework by deploying it on multiple robots with varying sensor configurations and showcasing a range of applications that utilize multi-modal layers, including line detection, human detection, and colorization.

Abstract
Elevation maps are commonly used to represent the environment of mobile robots and are instrumental for locomotion and navigation tasks. However, pure geometric information is insufficient for many field applications that require appearance or semantic information, which limits their applicability to other platforms or domains. In this work, we extend a 2.5D robot-centric elevation mapping framework by fusing multi-modal information from multiple sources into a popular map representation. The framework allows inputting data contained in point clouds or images in a unified manner. To manage the different nature of the data, we also present a set of fusion algorithms that can be selected based on the information type and user requirements. Our system is designed to run on the GPU, making it real-time capable for various robotic and learning tasks. We demonstrate the capabilities of our framework by deploying it on multiple robots with varying sensor configurations and showcasing a range of applications that utilize multi-modal layers, including line detection, human detection, and colorization.

摘要
Mobile robots的环境表示图是常用的工具，但纯粹的几何信息不足以满足许多场景中的应用需求，这限制了其应用于其他平台或领域。在这项工作中，我们将2.5D机器人中心的抬高地图框架扩展为将多种数据源的多Modal信息融合到一起。该框架可以接受点云或图像中的数据，并且我们提供了根据信息类型和用户需求选择的融合算法集。我们的系统采用GPU进行实时处理，以便在多种机器人和学习任务中实现实时性。我们通过在多个机器人上部署我们的框架，并使用不同的感知器配置，展示了多种应用场景，包括线检测、人体检测和彩色化。

SatDM: Synthesizing Realistic Satellite Image with Semantic Layout Conditioning using Diffusion Models

paper_url: http://arxiv.org/abs/2309.16812
repo_url: None
paper_authors: Orkhan Baghirli, Hamid Askarov, Imran Ibrahimli, Ismat Bakhishov, Nabi Nabiyev
for: 提供一种基于 semantic layout 的 conditional DDPM 模型，用于生成高质量、多样化、准确相应的卫星图像。
methods: 使用 variance learning、classifier-free guidance 和改进的噪声调度等技术来优化 denoising network architecture，并integrate adaptive normalization和自注意机制以提高模型的能力。
results: 验证结果表明，提posed model 能够生成高质量、多样化、准确相应的卫星图像，并且与实际图像的差异非常小。

Abstract
Deep learning models in the Earth Observation domain heavily rely on the availability of large-scale accurately labeled satellite imagery. However, obtaining and labeling satellite imagery is a resource-intensive endeavor. While generative models offer a promising solution to address data scarcity, their potential remains underexplored. Recently, Denoising Diffusion Probabilistic Models (DDPMs) have demonstrated significant promise in synthesizing realistic images from semantic layouts. In this paper, a conditional DDPM model capable of taking a semantic map and generating high-quality, diverse, and correspondingly accurate satellite images is implemented. Additionally, a comprehensive illustration of the optimization dynamics is provided. The proposed methodology integrates cutting-edge techniques such as variance learning, classifier-free guidance, and improved noise scheduling. The denoising network architecture is further complemented by the incorporation of adaptive normalization and self-attention mechanisms, enhancing the model's capabilities. The effectiveness of our proposed model is validated using a meticulously labeled dataset introduced within the context of this study. Validation encompasses both algorithmic methods such as Frechet Inception Distance (FID) and Intersection over Union (IoU), as well as a human opinion study. Our findings indicate that the generated samples exhibit minimal deviation from real ones, opening doors for practical applications such as data augmentation. We look forward to further explorations of DDPMs in a wider variety of settings and data modalities. An open-source reference implementation of the algorithm and a link to the benchmarked dataset are provided at https://github.com/obaghirli/syn10-diffusion.

摘要
深度学习模型在地球观测领域广泛依赖大量高精度卫星成像数据。然而，获取和标注卫星成像数据是资源消耗大的。而生成模型具有潜在的解决数据不足问题的潜力，但它们的潜力仍未得到充分利用。最近，涉及扰动概率模型（DDPMs）在生成真实图像方面表现出了显著的搭配性。在这篇论文中，我们实现了一种基于 semantic map 的受控 DDPM 模型，能够生成高质量、多样化和准确相应的卫星成像图像。此外，我们还提供了完整的优化动态图文献。我们的提案方法 integrate cutting-edge techniques such as variance learning, classifier-free guidance, and improved noise scheduling。减震网络架构还通过自适应 нормализа和自注意机制的添加，进一步提高模型的能力。我们的实验结果表明，生成的样本与实际样本几乎没有差异，开启了实际应用，如数据增强。我们期待以 DDPMs 在更多的设置和数据模式中进行进一步的探索。我们提供了一个开源参考实现和一个具有详细标注的数据集的链接，请参考 https://github.com/obaghirli/syn10-diffusion。

Granularity at Scale: Estimating Neighborhood Well-Being from High-Resolution Orthographic Imagery and Hybrid Learning

paper_url: http://arxiv.org/abs/2309.16808
repo_url: https://github.com/vida-nyu/gdpfinder
paper_authors: Ethan Brewer, Giovani Valdrighi, Parikshit Solunke, Joao Rulff, Yurii Piadyk, Zhonghui Lv, Jorge Poco, Claudio Silva
for: fills in the gaps of basic information on the well-being of the population in areas with limited data collection methods.
methods: uses high-resolution imagery from satellite or aircraft, and machine learning and computer vision techniques to extract features and detect patterns in the image data.
results: accurately estimates population density, median household income, and educational attainment of individual neighborhoods with R$^2$ up to 0.81, and provides a basis for future work to estimate fine-scale information from overhead imagery without label data.

Abstract
Many areas of the world are without basic information on the well-being of the residing population due to limitations in existing data collection methods. Overhead images obtained remotely, such as from satellite or aircraft, can help serve as windows into the state of life on the ground and help "fill in the gaps" where community information is sparse, with estimates at smaller geographic scales requiring higher resolution sensors. Concurrent with improved sensor resolutions, recent advancements in machine learning and computer vision have made it possible to quickly extract features from and detect patterns in image data, in the process correlating these features with other information. In this work, we explore how well two approaches, a supervised convolutional neural network and semi-supervised clustering based on bag-of-visual-words, estimate population density, median household income, and educational attainment of individual neighborhoods from publicly available high-resolution imagery of cities throughout the United States. Results and analyses indicate that features extracted from the imagery can accurately estimate the density (R$^2$ up to 0.81) of neighborhoods, with the supervised approach able to explain about half the variation in a population's income and education. In addition to the presented approaches serving as a basis for further geographic generalization, the novel semi-supervised approach provides a foundation for future work seeking to estimate fine-scale information from overhead imagery without the need for label data.

摘要
许多地方的人口生活状况的基本信息缺失，这是因为现有的数据收集方法有限。 however， remotely obtained overhead images， such as satellite or aircraft images， can serve as "windows" into the state of life on the ground and help "fill in the gaps" where community information is sparse. With the improvement of sensor resolutions, recent advancements in machine learning and computer vision have made it possible to quickly extract features from and detect patterns in image data, and correlate these features with other information.在这个研究中，我们研究了使用公共可用高分辨率图像来估算美国城市 neighborhoods 中人口密度、 median household income 和教育水平。 results 表明，从图像中提取的特征可以准确地估算社区的密度（R$^2$ 可以达到 0.81），并且超vised 方法可以解释人口收入和教育水平的约半部分变化。此外，我们还提出了一种基于 bag-of-visual-words 的半supervised 聚类方法，该方法可以在没有标签数据的情况下，通过图像特征来估算社区的人口密度、收入和教育水平。这些方法不仅可以为未来的地理总结提供基础，而且还可以用于未来无需标签数据来估算社区的人口密度、收入和教育水平。

Ultra-low-power Image Classification on Neuromorphic Hardware

paper_url: http://arxiv.org/abs/2309.16795
repo_url: https://github.com/biphasic/quartz-on-loihi
paper_authors: Gregor Lenz, Garrick Orchard, Sadique Sheik
for: 这个研究旨在实现低功耗的股节神经网络（SNN），并且使用时间和空间簇节来减少资料传输量。
methods: 这个研究使用了一种基于时间到第一射（TTFS）的时间ANN-to-SNN转换方法，实现高精度识别，并且可以轻松地实现在神经遗留器硬件上。
results: 这个研究在MNIST、CIFAR10和ImageNet上进行了 simulated benchmarking，结果显示了这个方法的优点，包括低功耗、高 Throughput 和低延迟。此外，这个方法可以在Loihi neuromorphic chip上实现，并且提供了证据，认为时间编码具有相似的识别精度，但是具有更低的功耗和更高的 Throughput。

Abstract
Spiking neural networks (SNNs) promise ultra-low-power applications by exploiting temporal and spatial sparsity. The number of binary activations, called spikes, is proportional to the power consumed when executed on neuromorphic hardware. Training such SNNs using backpropagation through time for vision tasks that rely mainly on spatial features is computationally costly. Training a stateless artificial neural network (ANN) to then convert the weights to an SNN is a straightforward alternative when it comes to image recognition datasets. Most conversion methods rely on rate coding in the SNN to represent ANN activation, which uses enormous amounts of spikes and, therefore, energy to encode information. Recently, temporal conversion methods have shown promising results requiring significantly fewer spikes per neuron, but sometimes complex neuron models. We propose a temporal ANN-to-SNN conversion method, which we call Quartz, that is based on the time to first spike (TTFS). Quartz achieves high classification accuracy and can be easily implemented on neuromorphic hardware while using the least amount of synaptic operations and memory accesses. It incurs a cost of two additional synapses per neuron compared to previous temporal conversion methods, which are readily available on neuromorphic hardware. We benchmark Quartz on MNIST, CIFAR10, and ImageNet in simulation to show the benefits of our method and follow up with an implementation on Loihi, a neuromorphic chip by Intel. We provide evidence that temporal coding has advantages in terms of power consumption, throughput, and latency for similar classification accuracy. Our code and models are publicly available.

摘要
斯宁 neural network (SNN) 承诺在低功耗应用方面具有优势，通过利用时间和空间稀疏性来减少功耗。在神经模仿硬件上运行 SNN 时，活动数（即冲击）与功耗成直接关系。使用反射传播时间来训练 SNN 可能是一个简单的替代方案，但是在图像识别任务中，通常需要大量的冲击来编码信息。现在，使用时间转换方法来将 ANN 转换成 SNN 已经取得了显著的进步，但是这些方法有时会使用复杂的神经元模型。我们提出了一种基于时间到第一冲击（TTFS）的时间 ANN-to-SNN 转换方法，我们称之为Quartz。Quartz 实现了高精度的分类 accuracy 并且可以轻松地在神经模仿硬件上实现，同时使用最少的 synaptic 操作和内存访问。它比前一些时间转换方法增加了两个附加的 synapse 每个神经元，这些附加 synapse 已经可以在神经模仿硬件上进行可用。我们在 MNIST、CIFAR10 和 ImageNet 上使用 simulate 来证明我们的方法的优点，然后对 Loihi neuromorphic chip 进行实现。我们提供了证据，表明使用时间编码有优势在功耗、吞吐量和延迟方面，对于类似的分类精度。我们的代码和模型都是公共可用。

STIR: Surgical Tattoos in Infrared

paper_url: http://arxiv.org/abs/2309.16782
repo_url: None
paper_authors: Adam Schmidt, Omid Mohareri, Simon DiMaio, Septimiu E. Salcudean
for: The paper is written for researchers and developers working on image guidance and automation of medical interventions and surgery, specifically in endoscopic environments.
methods: The paper introduces a novel labeling methodology called Surgical Tattoos in Infrared (STIR), which uses invisible IR-fluorescent dye (indocyanine green, ICG) to label tissue points in video clips, allowing for persistent but invisible labels for tracking and mapping.
results: The paper analyzes multiple frame-based tracking methods on STIR using both 3D and 2D endpoint error and accuracy metrics, providing a benchmark dataset for evaluating and improving tracking and mapping methods in endoscopic environments.Here’s the Chinese translation of the three points:
for: 论文是为了医疗图像导航和自动化手术等领域的研究人员和开发者而写的，特别是在内镜环境中。
methods: 论文介绍了一种新的标注方法，即医学纹理标注法（STIR），它使用不可见的IR抗静电粉末（印度氧绿素，ICG）标注组织点，从而实现了 persistente但是不可见的标注，便于跟踪和映射。
results: 论文使用STIR benchmark数据集进行多个帧基于跟踪方法的分析，包括3D和2D终点误差和精度度量，以提供跟踪和映射方法在内镜环境中的评估和改进的基准数据集。

Abstract
Quantifying performance of methods for tracking and mapping tissue in endoscopic environments is essential for enabling image guidance and automation of medical interventions and surgery. Datasets developed so far either use rigid environments, visible markers, or require annotators to label salient points in videos after collection. These are respectively: not general, visible to algorithms, or costly and error-prone. We introduce a novel labeling methodology along with a dataset that uses said methodology, Surgical Tattoos in Infrared (STIR). STIR has labels that are persistent but invisible to visible spectrum algorithms. This is done by labelling tissue points with IR-flourescent dye, indocyanine green (ICG), and then collecting visible light video clips. STIR comprises hundreds of stereo video clips in both in-vivo and ex-vivo scenes with start and end points labelled in the IR spectrum. With over 3,000 labelled points, STIR will help to quantify and enable better analysis of tracking and mapping methods. After introducing STIR, we analyze multiple different frame-based tracking methods on STIR using both 3D and 2D endpoint error and accuracy metrics. STIR is available at https://dx.doi.org/10.21227/w8g4-g548

摘要
量化跟踪和地图方法的性能在镜头环境中是医疗图像导航和自动化手术的关键。现有的数据集都有一些缺点，包括：不通用、可见到算法或需要标注视频后集成。我们介绍了一种新的标注方法和数据集，即镜头纹身（STIR）。STIR使用不可见光谱的IR染料，即尼龙绿（ICG）染料，标注组织点。采集到的视频帧都是可见光谱的。STIR包含了数百个斯tereo视频帧，包括生物体内和外场景，标注了开始和结束点。总共有超过3000个标注点，STIR将帮助量化和促进跟踪和地图方法的分析。之后，我们使用STIR进行多种帧基于跟踪方法的分析，包括3D和2D终点误差和准确率指标。STIR可以在https://dx.doi.org/10.21227/w8g4-g548上下载。

Deep Learning based Systems for Crater Detection: A Review

paper_url: http://arxiv.org/abs/2310.07727
repo_url: None
paper_authors: Atal Tewari, K Prateek, Amrita Singh, Nitin Khanna
for: 本文旨在帮助研究者在撞击坑检测领域，检视深度学习基于的撞击坑检测算法（CDA）的发展。
methods: 本文对深度学习基于的撞击坑检测算法进行了检视和分类，包括planetary数据、撞击坑数据库和评估指标。具体来说，我们对撞击坑检测中存在的挑战进行了详细介绍，并将深度学习基于的CDA分为三类： semantic segmentation-based、object detection-based 和 classification-based。
results: 我们对所有semantic segmentation-based CDA在一个共同数据集上进行了训练和测试，以评估每个架构在撞击坑检测中的效果和应用前景。此外，我们还提供了未来研究的建议。

Abstract
Craters are one of the most prominent features on planetary surfaces, used in applications such as age estimation, hazard detection, and spacecraft navigation. Crater detection is a challenging problem due to various aspects, including complex crater characteristics such as varying sizes and shapes, data resolution, and planetary data types. Similar to other computer vision tasks, deep learning-based approaches have significantly impacted research on crater detection in recent years. This survey aims to assist researchers in this field by examining the development of deep learning-based crater detection algorithms (CDAs). The review includes over 140 research works covering diverse crater detection approaches, including planetary data, craters database, and evaluation metrics. To be specific, we discuss the challenges in crater detection due to the complex properties of the craters and survey the DL-based CDAs by categorizing them into three parts: (a) semantic segmentation-based, (b) object detection-based, and (c) classification-based. Additionally, we have conducted training and testing of all the semantic segmentation-based CDAs on a common dataset to evaluate the effectiveness of each architecture for crater detection and its potential applications. Finally, we have provided recommendations for potential future works.

摘要
Planetary surfaces often have craters, which are important features used in age estimation, hazard detection, and spacecraft navigation. However, crater detection is a challenging problem due to various factors, including the complexity of crater characteristics such as size and shape, data resolution, and planetary data types. Like other computer vision tasks, deep learning-based approaches have significantly impacted crater detection research in recent years. This survey aims to help researchers in this field by examining the development of deep learning-based crater detection algorithms (CDAs). The review includes over 140 research works covering diverse crater detection approaches, including planetary data, crater databases, and evaluation metrics. Specifically, we discuss the challenges in crater detection due to the complex properties of craters and survey DL-based CDAs by categorizing them into three parts: (a) semantic segmentation-based, (b) object detection-based, and (c) classification-based. Additionally, we have trained and tested all the semantic segmentation-based CDAs on a common dataset to evaluate their effectiveness for crater detection and their potential applications. Finally, we provide recommendations for potential future works.Here's the text in Traditional Chinese:planetary surfaces often have craters, which are important features used in age estimation, hazard detection, and spacecraft navigation. However, crater detection is a challenging problem due to various factors, including the complexity of crater characteristics such as size and shape, data resolution, and planetary data types. Like other computer vision tasks, deep learning-based approaches have significantly impacted crater detection research in recent years. This survey aims to help researchers in this field by examining the development of deep learning-based crater detection algorithms (CDAs). The review includes over 140 research works covering diverse crater detection approaches, including planetary data, crater databases, and evaluation metrics. Specifically, we discuss the challenges in crater detection due to the complex properties of craters and survey DL-based CDAs by categorizing them into three parts: (a) semantic segmentation-based, (b) object detection-based, and (c) classification-based. Additionally, we have trained and tested all the semantic segmentation-based CDAs on a common dataset to evaluate their effectiveness for crater detection and their potential applications. Finally, we provide recommendations for potential future works.

Prompt-Enhanced Self-supervised Representation Learning for Remote Sensing Image Understanding

paper_url: http://arxiv.org/abs/2310.00022
repo_url: None
paper_authors: Mingming Zhang, Qingjie Liu, Yunhong Wang
for: 本研究旨在提高自然语言处理技术的性能，尤其是在远程感知图像分析领域。
methods: 我们提出了一种使用原始图像片段作为拟合模板的唤醒自我超级vised学习方法，并通过 semantic consistency 约束来提供上下文信息。
results: 我们的方法在多种下游任务上表现出色，包括土地覆盖分类、semantic segmentation、object detection 和实体分 segmentation。这些结果表明我们的方法可以学习优秀的远程感知表示，并具有高度的泛化和传输性。Here’s the translation in Simplified Chinese:
for: 本研究旨在提高自然语言处理技术的性能，尤其是在远程感知图像分析领域。
methods: 我们提出了一种使用原始图像片段作为拟合模板的唤醒自我超级vised学习方法，并通过 semantic consistency 约束来提供上下文信息。
results: 我们的方法在多种下游任务上表现出色，包括土地覆盖分类、semantic segmentation、object detection 和实体分 segmentation。这些结果表明我们的方法可以学习优秀的远程感知表示，并具有高度的泛化和传输性。

Abstract
Learning representations through self-supervision on a large-scale, unlabeled dataset has proven to be highly effective for understanding diverse images, such as those used in remote sensing image analysis. However, remote sensing images often have complex and densely populated scenes, with multiple land objects and no clear foreground objects. This intrinsic property can lead to false positive pairs in contrastive learning, or missing contextual information in reconstructive learning, which can limit the effectiveness of existing self-supervised learning methods. To address these problems, we propose a prompt-enhanced self-supervised representation learning method that uses a simple yet efficient pre-training pipeline. Our approach involves utilizing original image patches as a reconstructive prompt template, and designing a prompt-enhanced generative branch that provides contextual information through semantic consistency constraints. We collected a dataset of over 1.28 million remote sensing images that is comparable to the popular ImageNet dataset, but without specific temporal or geographical constraints. Our experiments show that our method outperforms fully supervised learning models and state-of-the-art self-supervised learning methods on various downstream tasks, including land cover classification, semantic segmentation, object detection, and instance segmentation. These results demonstrate that our approach learns impressive remote sensing representations with high generalization and transferability.

摘要
学习通过自我监督在大规模、无标注数据集上实现了对多样化图像的理解，如远程感知图像分析中的图像。然而，远程感知图像经常有复杂且受树立的场景，具有多个地面 объек 和没有明确的前景对象，这种内在性可能导致对比学习中的假阳对，或是重建学习中的上下文信息缺失，这些问题限制了现有的自我监督学习方法的效iveness。为解决这些问题，我们提出了一种使用原始图像块作为重构权重模板的提问增强自我监督表示学习方法。我们的方法包括使用原始图像块作为重构权重模板，并设计一个提问增强生成分支，通过语义一致约束提供上下文信息。我们收集了1280万多个远程感知图像的数据集，与popular ImageNet dataset相似，但不受特定的时间或地理约束。我们的实验表明，我们的方法在多个下游任务中比完全监督学习模型和现状最佳自我监督学习方法表现出色，包括地面覆盖分类、semantic segmentation、物体检测和实例 segmentation。这些结果表明，我们的方法学习出色的远程感知表示，具有高度的普适性和传输性。

Learning to Transform for Generalizable Instance-wise Invariance

paper_url: http://arxiv.org/abs/2309.16672
repo_url: None
paper_authors: Utkarsh Singhal, Carlos Esteves, Ameesh Makadia, Stella X. Yu
for: 本研究旨在帮助计算机视觉系统具备对自然数据中的空间变换的Robustness。
methods: 我们使用了一种称为”normalizing flow”的方法，用于预测对图像的变换分布，并将其平均值为每个图像。这种分布只依赖于图像实例，因此可以在批处理时对图像进行对齐，并在不同类别之间实现对变换的泛化。
results: 我们在CIFAR 10、CIFAR10-LT和TinyImageNet等 datasets上进行了实验，并证明了我们的方法可以提高准确率和Robustness。特别是，我们的方法可以学习更大的变换范围，比如Augerino和InstaAug。

Abstract
Computer vision research has long aimed to build systems that are robust to spatial transformations found in natural data. Traditionally, this is done using data augmentation or hard-coding invariances into the architecture. However, too much or too little invariance can hurt, and the correct amount is unknown a priori and dependent on the instance. Ideally, the appropriate invariance would be learned from data and inferred at test-time. We treat invariance as a prediction problem. Given any image, we use a normalizing flow to predict a distribution over transformations and average the predictions over them. Since this distribution only depends on the instance, we can align instances before classifying them and generalize invariance across classes. The same distribution can also be used to adapt to out-of-distribution poses. This normalizing flow is trained end-to-end and can learn a much larger range of transformations than Augerino and InstaAug. When used as data augmentation, our method shows accuracy and robustness gains on CIFAR 10, CIFAR10-LT, and TinyImageNet.

摘要

Decaf: Monocular Deformation Capture for Face and Hand Interactions

paper_url: http://arxiv.org/abs/2309.16670
repo_url: None
paper_authors: Soshi Shimada, Vladislav Golyanik, Patrick Pérez, Christian Theobalt
for: 本研究旨在Addressing the challenges of 3D tracking from monocular RGB videos, particularly the non-rigid deformations of human hands interacting with human faces.
methods: 该研究提出了一种新的方法，基于一种新的手脸动作和互动捕获数据集，使用变量自动编码器提供手脸深度先验，以及模块导航3D跟踪。
results: 研究结果显示，该方法可以生成真实和更有可信度的3D手脸重建，与多个基线比较显著。

Abstract
Existing methods for 3D tracking from monocular RGB videos predominantly consider articulated and rigid objects. Modelling dense non-rigid object deformations in this setting remained largely unaddressed so far, although such effects can improve the realism of the downstream applications such as AR/VR and avatar communications. This is due to the severe ill-posedness of the monocular view setting and the associated challenges. While it is possible to naively track multiple non-rigid objects independently using 3D templates or parametric 3D models, such an approach would suffer from multiple artefacts in the resulting 3D estimates such as depth ambiguity, unnatural intra-object collisions and missing or implausible deformations. Hence, this paper introduces the first method that addresses the fundamental challenges depicted above and that allows tracking human hands interacting with human faces in 3D from single monocular RGB videos. We model hands as articulated objects inducing non-rigid face deformations during an active interaction. Our method relies on a new hand-face motion and interaction capture dataset with realistic face deformations acquired with a markerless multi-view camera system. As a pivotal step in its creation, we process the reconstructed raw 3D shapes with position-based dynamics and an approach for non-uniform stiffness estimation of the head tissues, which results in plausible annotations of the surface deformations, hand-face contact regions and head-hand positions. At the core of our neural approach are a variational auto-encoder supplying the hand-face depth prior and modules that guide the 3D tracking by estimating the contacts and the deformations. Our final 3D hand and face reconstructions are realistic and more plausible compared to several baselines applicable in our setting, both quantitatively and qualitatively. https://vcai.mpi-inf.mpg.de/projects/Decaf

摘要
现有方法主要考虑了由单一RGB视频中捕捉的可动和坚实对象，而模elling非坚实对象变形在这种设置中尚未得到了充分的关注，尽管这些效果可以提高下游应用程序，如AR/VR和人物通信的真实性。这是因为单个视频设置的缺乏定义性和相关挑战所致。虽然可以通过使用3D模板或参数化3D模型来独立地跟踪多个非坚实对象，但这种方法会导致多种artefacts在结果3D估计中出现，包括深度模糊、非自然的内部对象碰撞和缺失或不合理的变形。因此，这篇论文提出了首个解决这些基本挑战的方法，并允许通过单个RGB视频来跟踪人手与人脸的3D交互。我们将手指定为可动对象，并且在活动交互中引起非坚实面部变形。我们的方法基于一个新的手脸动作和互动捕获数据集，该数据集包含真实的面部变形，通过 markerless 多视图摄像头系统获取。在创建该数据集的过程中，我们对重构的raw 3D形状进行位置基于动力学处理，并使用一种非坚实头组织的估计方法，以获得面部变形、手脸接触区域和头部位置的可靠注释。我们的神经网络方法的核心是一种variational auto-encoder，该模型提供了手脸深度优先验证，以及导向3D跟踪的模块。我们的最终3D手脸重建结果比较真实和更有可靠性，相比多个可应用于我们的设置的基准。

Training a Large Video Model on a Single Machine in a Day

paper_url: http://arxiv.org/abs/2309.16669
repo_url: https://github.com/zhaoyue-zephyrus/avion
paper_authors: Yue Zhao, Philipp Krähenbühl
for: 这篇论文是为了提出一种可以在单机器上训练大型视频模型的高效training pipeline，并且在一天内完成训练。
methods: 作者通过调查IO、CPU和GPU计算的瓶颈，并对每个瓶颈进行优化，实现了一个高效的视频训练管道。
results: 相比同类架构的先前工作，该管道可以在一天内 achieved higher accuracy with $\frac{1}{8}$ of the computation.

Abstract
Videos are big, complex to pre-process, and slow to train on. State-of-the-art large-scale video models are trained on clusters of 32 or more GPUs for several days. As a consequence, academia largely ceded the training of large video models to industry. In this paper, we show how to still train a state-of-the-art video model on a single machine with eight consumer-grade GPUs in a day. We identify three bottlenecks, IO, CPU, and GPU computation, and optimize each. The result is a highly efficient video training pipeline. For comparable architectures, our pipeline achieves higher accuracies with $\frac{1}{8}$ of the computation compared to prior work. Code is available at https://github.com/zhaoyue-zephyrus/AVION.

摘要
视频很大，复杂处理，训练时间长。现状的大规模视频模型通常需要cluster of 32或更多的GPU进行数天的训练。在这篇论文中，我们显示了如何在单个机器上使用八款Consumer-grade GPU来训练状态宇的视频模型，只需一天的时间。我们认为IO、CPU和GPU计算是训练 pipeline 中的三个瓶颈，并且优化了每个瓶颈。结果是一个高效的视频训练管道。对于相同的架构，我们的管道可以与先前的工作相比，得到高度的准确率，只需一半的计算时间。代码可以在https://github.com/zhaoyue-zephyrus/AVION 上找到。

Geodesic Regression Characterizes 3D Shape Changes in the Female Brain During Menstruation

paper_url: http://arxiv.org/abs/2309.16662
repo_url: https://github.com/bioshape-lab/my28brains
paper_authors: Adele Myers, Caitlin Taylor, Emily Jacobs, Nina Miolane
for:* The paper aims to investigate the connection between female brain health and sex hormone fluctuations during menopause.methods:* The researchers use geodesic regression on the space of 3D discrete surfaces to characterize the evolution of brain shape during hormone fluctuations.* They propose approximation schemes to accelerate the process and provide rules of thumb for when to use each approximation.results:* The researchers test their approach on synthetic data and show a significant speed-accuracy trade-off.* They apply the method to real brain shape data and produce the first characterization of how the female hippocampus changes shape during the menstrual cycle as a function of progesterone.Here’s the Chinese version of the three key points:for:* 这个论文目的是investigate female brain health和性激素波动之间的连接，尤其是在menopause期间。methods:* 研究人员使用地odesic regression on the space of 3D discrete surfaces来Characterize brain shape evolution during hormone fluctuations.* 他们提出了一些简化方法，并提供了使用情况的规则Of thumb.results:* 研究人员在synthetic data上测试了他们的方法，并显示了明显的速度-准确度贸易。* 他们应用了方法到实际brain shape数据上，并生成了月经期内雌激素水平对女性hippocampus shape的首次Characterization。

Abstract
Women are at higher risk of Alzheimer's and other neurological diseases after menopause, and yet research connecting female brain health to sex hormone fluctuations is limited. We seek to investigate this connection by developing tools that quantify 3D shape changes that occur in the brain during sex hormone fluctuations. Geodesic regression on the space of 3D discrete surfaces offers a principled way to characterize the evolution of a brain's shape. However, in its current form, this approach is too computationally expensive for practical use. In this paper, we propose approximation schemes that accelerate geodesic regression on shape spaces of 3D discrete surfaces. We also provide rules of thumb for when each approximation can be used. We test our approach on synthetic data to quantify the speed-accuracy trade-off of these approximations and show that practitioners can expect very significant speed-up while only sacrificing little accuracy. Finally, we apply the method to real brain shape data and produce the first characterization of how the female hippocampus changes shape during the menstrual cycle as a function of progesterone: a characterization made (practically) possible by our approximation schemes. Our work paves the way for comprehensive, practical shape analyses in the fields of bio-medicine and computer vision. Our implementation is publicly available on GitHub: https://github.com/bioshape-lab/my28brains.

摘要
女性在男性メソッド后は、アルツハイマー病やその他の神経疾患のリスクが高いと考えられています。しかし、女性の脳健康と性ホルモンの変化の间の连関性に関する研究は限定的です。我々は、この连関性を调查するために、性ホルモンの変化による脳の3D形状の変化を数学的に定量するためのツールを开発しました。しかし、现在のこのアプローチは、実用的な目的で使用するためには计算的にもっています。この论文では、3DDiscrete Surface上の地odesic regressionのアプローチを加速するための近似法を提案します。また、各アプローチが适用できる状况についてのルール OF THUMBを提供します。我々は、このアプローチをsynthetic dataに対して试験し、速さ-正确さのトレードオフを评価しました。结果は、実用的な目的で使用するためには、かなりの速さアップを与えることができましたが、正确さについては、ほとんど影响を受けませんでした。最后に、この方法を実际の脳形状データに适用し、女性の月経期によるプロゲステロンの影响による脳のHiPPocampusの形状の変化を描写しました。我々の作业は、生物医学およびコンピュータビジョンの分野で、実用的な形状分析を可能にすることを目的としています。我々の実装は、GitHub上で公开されています。https://github.com/bioshape-lab/my28brains。

Visual In-Context Learning for Few-Shot Eczema Segmentation

paper_url: http://arxiv.org/abs/2309.16656
repo_url: None
paper_authors: Neelesh Kumar, Oya Aran, Venugopal Vasudevan
for:这种研究旨在开发一种基于数字相机图像的自适应诊断系统，帮助患有皮炎的患者自我监测疾病的进程。为了实现这一目标，图像分割是一项重要的任务。现有的皮炎分割方法基于深度神经网络，如CNN-based U-Net和transformer-based Swin U-Net，但它们需要大量的标注数据，这可能很难以获得。methods:我们研究了视觉上下文学习的可能性，以实现几何少例学习皮炎分割。我们提出了一种将SegGPT应用于皮炎分割的策略。在一个标注皮炎图像集上测试，我们发现，只需要2个示例图像，SegGPT的性能比CNN U-Net retrained on 428图像更高（mIoU: 36.69 vs 32.60）。此外，我们发现，使用更多的示例图像可能会对SegGPT的性能产生负面影响。results:我们的结果表明，视觉上下文学习可以在皮炎图像分割中实现几何少例学习，并且可以开发更快速、更好的皮炎诊断解决方案。此外，我们的结果还预示，可以开发包容性的解决方案，以满足那些在训练数据中受欠表现的少数民族。

Abstract
Automated diagnosis of eczema from digital camera images is crucial for developing applications that allow patients to self-monitor their recovery. An important component of this is the segmentation of eczema region from such images. Current methods for eczema segmentation rely on deep neural networks such as convolutional (CNN)-based U-Net or transformer-based Swin U-Net. While effective, these methods require high volume of annotated data, which can be difficult to obtain. Here, we investigate the capabilities of visual in-context learning that can perform few-shot eczema segmentation with just a handful of examples and without any need for retraining models. Specifically, we propose a strategy for applying in-context learning for eczema segmentation with a generalist vision model called SegGPT. When benchmarked on a dataset of annotated eczema images, we show that SegGPT with just 2 representative example images from the training dataset performs better (mIoU: 36.69) than a CNN U-Net trained on 428 images (mIoU: 32.60). We also discover that using more number of examples for SegGPT may in fact be harmful to its performance. Our result highlights the importance of visual in-context learning in developing faster and better solutions to skin imaging tasks. Our result also paves the way for developing inclusive solutions that can cater to minorities in the demographics who are typically heavily under-represented in the training data.

摘要
自动诊断皮炎从数字相机图像是致命的，这将帮助开发出让患者自我监测的应用程序。一个重要的组成部分是皮炎区域的分割。目前的皮炎分割方法基于深度神经网络，如卷积神经网络（CNN）基于U-Net或转换器基于Swin U-Net。虽然有效，但这些方法需要大量的标注数据，这可能很难以获得。在这里，我们研究了可视内容学习的能力，可以在几个示例图像的基础上完成皮炎分割。我们提出了在 SegGPT 泛型视觉模型上应用 visual in-context learning 的策略，并在一个标注皮炎图像集上进行了测试。结果显示，只需使用两个示例图像，SegGPT 的表现比 CNN U-Net 在 428 个图像上训练的表现更好（mIoU：36.69）。我们还发现，给 SegGPT 更多的示例可能会对其性能产生负面影响。我们的结果表明，可视内容学习在皮炎图像分割任务中具有重要的意义，并且可能为皮炎诊断和治疗带来更快和更好的解决方案。此外，我们的结果还预示了可能为少数民族群体在训练数据中的重要地位，并且可能为这些群体开发包容性的解决方案。

Novel Deep Learning Pipeline for Automatic Weapon Detection

paper_url: http://arxiv.org/abs/2309.16654
repo_url: None
paper_authors: Haribharathi Sivakumar, Vijay Arvind. R, Pawan Ragavendhar V, G. Balamurugan
for: 这篇论文是为了提出一种实时武器检测系统，以应对当今社会中武器和枪击案的快速增长。
methods: 该论文提出了一个新的pipeline，包括一个 ensemble of convolutional neural networks（CNN）with distinct architectures，每个CNN都是通过不同的mini-batch进行训练，以提高系统的准确率和特异性。
results: 该论文通过多个数据集进行比较，发现提出的pipeline在与现有的SoA系统进行比较时，平均提高了5%的准确率、特异性和回归率。

Abstract
Weapon and gun violence have recently become a pressing issue today. The degree of these crimes and activities has risen to the point of being termed as an epidemic. This prevalent misuse of weapons calls for an automatic system that detects weapons in real-time. Real-time surveillance video is captured and recorded in almost all public forums and places. These videos contain abundant raw data which can be extracted and processed into meaningful information. This paper proposes a novel pipeline consisting of an ensemble of convolutional neural networks with distinct architectures. Each neural network is trained with a unique mini-batch with little to no overlap in the training samples. This paper will present several promising results using multiple datasets associated with comparing the proposed architecture and state-of-the-art (SoA) models. The proposed pipeline produced an average increase of 5% in accuracy, specificity, and recall compared to the SoA systems.

摘要
武器和枪击案现在是当今的紧迫问题。这种犯罪和活动的程度已经到了epidemic的水平。这种普遍的武器违法使用需要实时检测武器。现在大多数公共场所和地点都有实时监控视频记录。这些视频含有大量的原始数据，可以提取和处理成有用信息。本文提出了一个新的管道，包括一个ensemble of convolutional neural networks with distinct architectures。每个神经网络都是通过不同的mini-batch进行训练，几乎没有重叠在训练样本上。本文将对多个数据集进行比较，并与状态之前（SoA）模型进行比较。提出的管道在准确性、特异性和恢复率方面平均提高了5% compared to SoA系统。

DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

paper_url: http://arxiv.org/abs/2309.16653
repo_url: https://github.com/dreamgaussian/dreamgaussian
paper_authors: Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, Gang Zeng
for: 这篇论文的目的是提出一种高效高质量的三维内容创建方法。
methods: 这篇论文使用了一种基于分配蒸气 sampling的三维生成方法，并且提出了一种用于生成高质量纹理的高效算法。
results: 该方法可以在只需2分钟时生成高质量纹理化三维模型，相比之下现有方法提高约10倍。

Abstract
Recent advances in 3D content creation mostly leverage optimization-based 3D generation via score distillation sampling (SDS). Though promising results have been exhibited, these methods often suffer from slow per-sample optimization, limiting their practical usage. In this paper, we propose DreamGaussian, a novel 3D content generation framework that achieves both efficiency and quality simultaneously. Our key insight is to design a generative 3D Gaussian Splatting model with companioned mesh extraction and texture refinement in UV space. In contrast to the occupancy pruning used in Neural Radiance Fields, we demonstrate that the progressive densification of 3D Gaussians converges significantly faster for 3D generative tasks. To further enhance the texture quality and facilitate downstream applications, we introduce an efficient algorithm to convert 3D Gaussians into textured meshes and apply a fine-tuning stage to refine the details. Extensive experiments demonstrate the superior efficiency and competitive generation quality of our proposed approach. Notably, DreamGaussian produces high-quality textured meshes in just 2 minutes from a single-view image, achieving approximately 10 times acceleration compared to existing methods.

摘要
Recent advances in 3D content creation mostly rely on optimization-based 3D generation via score distillation sampling (SDS). Although these methods have shown promising results, they often suffer from slow per-sample optimization, limiting their practical usage. In this paper, we propose DreamGaussian, a novel 3D content generation framework that achieves both efficiency and quality simultaneously. Our key insight is to design a generative 3D Gaussian Splatting model with companioned mesh extraction and texture refinement in UV space. Unlike the occupancy pruning used in Neural Radiance Fields, we demonstrate that the progressive densification of 3D Gaussians converges significantly faster for 3D generative tasks. To further enhance the texture quality and facilitate downstream applications, we introduce an efficient algorithm to convert 3D Gaussians into textured meshes and apply a fine-tuning stage to refine the details. Extensive experiments show that our proposed approach achieves superior efficiency and competitive generation quality. Notably, DreamGaussian produces high-quality textured meshes in just 2 minutes from a single-view image, achieving approximately 10 times acceleration compared to existing methods.

ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning

paper_url: http://arxiv.org/abs/2309.16650
repo_url: None
paper_authors: Qiao Gu, Alihusein Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, Chuang Gan, Celso Miguel de Melo, Joshua B. Tenenbaum, Antonio Torralba, Florian Shkurti, Liam Paull
for: 这篇论文旨在为机器人建立一种semantic rich yet compact和高效的3D场景表示，以便实现多种任务。
methods: 该论文提出了一种基于大视觉语言模型的Open-vocabulary graph-structured表示方法，通过多视图协调绘制3D场景，并将2D基础模型的输出与3D场景相结合。
results: 该研究表明，该表示方法可以在新的semantic类中泛化，无需收集大量3D数据或者微调模型。此外，该表示方法还能够通过语言提示来实现复杂的空间和semantic概念的推理。

Abstract
For robots to perform a wide variety of tasks, they require a 3D representation of the world that is semantically rich, yet compact and efficient for task-driven perception and planning. Recent approaches have attempted to leverage features from large vision-language models to encode semantics in 3D representations. However, these approaches tend to produce maps with per-point feature vectors, which do not scale well in larger environments, nor do they contain semantic spatial relationships between entities in the environment, which are useful for downstream planning. In this work, we propose ConceptGraphs, an open-vocabulary graph-structured representation for 3D scenes. ConceptGraphs is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association. The resulting representations generalize to novel semantic classes, without the need to collect large 3D datasets or finetune models. We demonstrate the utility of this representation through a number of downstream planning tasks that are specified through abstract (language) prompts and require complex reasoning over spatial and semantic concepts. (Project page: https://concept-graphs.github.io/ Explainer video: https://youtu.be/mRhNkQwRYnc )

摘要
In this work, we propose ConceptGraphs, an open-vocabulary graph-structured representation for 3D scenes. ConceptGraphs is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association. The resulting representations generalize to novel semantic classes, without the need to collect large 3D datasets or finetune models.We demonstrate the utility of this representation through a number of downstream planning tasks that are specified through abstract (language) prompts and require complex reasoning over spatial and semantic concepts. (Project page: https://concept-graphs.github.io/ Explainer video: https://youtu.be/mRhNkQwRYnc )

FLIP: Cross-domain Face Anti-spoofing with Language Guidance

paper_url: http://arxiv.org/abs/2309.16649
repo_url: https://github.com/koushiksrivats/flip
paper_authors: Koushik Srivatsan, Muzammal Naseer, Karthik Nandakumar
for: 防止面部遮盾攻击（Face Anti-Spoofing，FAS）是安全应用中的重要 ком ponent，但现有的方法对不见过的骗特类型、摄像头感知器和环境因素的普遍性强度不足。
methods: 我们使用了视觉 транс福特（ViT）模型，并将其扩展为多modal（例如CLIP）初始化，以提高FAS任务的普遍性。我们还提出了一种新的方法，通过与自然语言的semantics相互关联，将视觉表现匹配到多个分类描述中，以提高FAS的普遍性。
results: 我们的方法在三个标准协议中进行了广泛的实验，结果显示我们的方法可以对FAS任务进行 Zero-shot 转移，并且在低数据情况下表现更好，比过五击转移的适应 ViT 更好。

Abstract
Face anti-spoofing (FAS) or presentation attack detection is an essential component of face recognition systems deployed in security-critical applications. Existing FAS methods have poor generalizability to unseen spoof types, camera sensors, and environmental conditions. Recently, vision transformer (ViT) models have been shown to be effective for the FAS task due to their ability to capture long-range dependencies among image patches. However, adaptive modules or auxiliary loss functions are often required to adapt pre-trained ViT weights learned on large-scale datasets such as ImageNet. In this work, we first show that initializing ViTs with multimodal (e.g., CLIP) pre-trained weights improves generalizability for the FAS task, which is in line with the zero-shot transfer capabilities of vision-language pre-trained (VLP) models. We then propose a novel approach for robust cross-domain FAS by grounding visual representations with the help of natural language. Specifically, we show that aligning the image representation with an ensemble of class descriptions (based on natural language semantics) improves FAS generalizability in low-data regimes. Finally, we propose a multimodal contrastive learning strategy to boost feature generalization further and bridge the gap between source and target domains. Extensive experiments on three standard protocols demonstrate that our method significantly outperforms the state-of-the-art methods, achieving better zero-shot transfer performance than five-shot transfer of adaptive ViTs. Code: https://github.com/koushiksrivats/FLIP

摘要
“人脸防 spoofing”（FAS）或“发表攻击”检测是安全应用中的重要组成部分。现有的FAS方法具有较差的泛化性，不能适应未见过的骗YPE、摄像头和环境条件。随着Recently, vision transformer（ViT）模型在FAS任务中的表现，它们的长距离依赖性能力使得它们成为FAS任务的有效解决方案。然而，需要适应预训练 ViT Weight 的auxiliary loss function或适应模块来适应预训练 ViT Weight 学习的大规模数据集，如 ImageNet。在这个工作中，我们首先表明，使用多模态（例如 CLIP）预训练 weight 初始化 ViT 可以提高 FAS 任务的泛化性，这与视language预训练（VLP）模型的零扩展转移能力相一致。然后，我们提出了一种新的方法，通过自然语言的语义来固定视觉表示。具体来说，我们发现，将图像表示与一个ensemble of class descriptions（基于自然语言 semantics）进行对应，可以提高 FAS 任务在低数据 régime中的泛化性。最后，我们提出了一种多模态对比学习策略，以提高特征泛化并跨源领域之间的减少。我们的方法在三个标准协议上进行了广泛的实验，并证明了我们的方法可以明显超越当前的state-of-the-art方法，在零扩展转移情况下，我们的方法可以在5shot转移的情况下表现更好。代码：https://github.com/koushiksrivats/FLIP。

Improving Equivariance in State-of-the-Art Supervised Depth and Normal Predictors

paper_url: http://arxiv.org/abs/2309.16646
repo_url: https://github.com/mikuhatsune/equivariance
paper_authors: Yuanyi Zhong, Anand Bhattad, Yu-Xiong Wang, David Forsyth
for: 提高 dense depth 和 surface normal 预测器的等变性，使其具有对剪辑和缩放的等变性。
methods: 提出了一种等变化规则化技术，包括均值处理和自我一致损失，以便明确激活剪辑和缩放等变性。
results: 对于 Taskonomy 任务，我们的等变性规则化技术可以适用于 CNN 和 Transformer 架构，不会在测试时增加额外成本，并且显著提高了超参与和半参与学习的性能。此外，对于现有的 state-of-the-art depth 和 normal 预测器，finetuning 我们的损失可以不仅提高等变性，还提高其在 NYU-v2 上的准确率。

Abstract
Dense depth and surface normal predictors should possess the equivariant property to cropping-and-resizing -- cropping the input image should result in cropping the same output image. However, we find that state-of-the-art depth and normal predictors, despite having strong performances, surprisingly do not respect equivariance. The problem exists even when crop-and-resize data augmentation is employed during training. To remedy this, we propose an equivariant regularization technique, consisting of an averaging procedure and a self-consistency loss, to explicitly promote cropping-and-resizing equivariance in depth and normal networks. Our approach can be applied to both CNN and Transformer architectures, does not incur extra cost during testing, and notably improves the supervised and semi-supervised learning performance of dense predictors on Taskonomy tasks. Finally, finetuning with our loss on unlabeled images improves not only equivariance but also accuracy of state-of-the-art depth and normal predictors when evaluated on NYU-v2. GitHub link: https://github.com/mikuhatsune/equivariance

摘要
“density和表面法则预测器应具有均衡性，即裁剪输入图像时，输出图像也应该裁剪。然而，我们发现当前的深度和法则预测器，尽管表现出色，却不尊重均衡性。这个问题甚至存在当使用裁剪和缩放数据增强 durante el entrenamiento。为了解决这个问题，我们提出了一种均衡化训练技术，包括均衡averaging过程和自我一致损失，以直接促进裁剪和缩放均衡性在深度和法则网络中。我们的方法可以应用于CNN和Transformer架构，不会在测试过程中添加额外成本，并且能够提高Taskonomy任务上的超级vised和半supervised学习性能。最后，我们在不包含标注图像的情况下，对现有的深度和法则预测器进行finetuning，不仅提高了均衡性，还提高了其在NYU-v2上的准确率。”Here's the breakdown of the text into Simplified Chinese characters:“density”: 密度 (mìdòu)“surface normal”: 表面法则 (biǎofàng fǎlǜ)“predictors”: 预测器 (yùjièqì)“should possess the equivariant property”: 应具有均衡性 (bìng yǒu yǒu zhèng yì)“cropping the input image should result in cropping the same output image”: 裁剪输入图像，输出图像也应该裁剪 (dīng niè zhǐ yǐng xiàng yǐng, yǐng xiàng yǐng)“despite having strong performances”: 尽管表现出色 (zhōngguān biǎofàng zhèng)“the problem exists even when crop-and-resize data augmentation is employed during training”: 这个问题甚至存在当使用裁剪和缩放数据增强 durante el entrenamiento (zhè ge wèn tī zhīyī cái yǐjīn zài yùdào)“to remedy this, we propose an equivariant regularization technique”: 为解决这个问题，我们提出了一种均衡化训练技术 (bìng yǒu yì zhèng zhì)“consisting of an averaging procedure and a self-consistency loss”: 包括均衡averaging过程和自我一致损失 (bǎng zhì yǐng zhìyì zhèng)“to explicitly promote cropping-and-resizing equivariance in depth and normal networks”: 以直接促进裁剪和缩放均衡性在深度和法则网络中 (yǐng zhì yǐng zhèng zhì yǐng zhèng)“our approach can be applied to both CNN and Transformer architectures”: 我们的方法可以应用于CNN和Transformer架构 (wǒmen de fāngshì kěyǐ bìng yì zhèng zhì)“does not incur extra cost during testing”: 不会在测试过程中添加额外成本 (bù huì zài zhèng yì zhèng)“and notably improves the supervised and semi-supervised learning performance of dense predictors on Taskonomy tasks”: 并能够提高Taskonomy任务上的超级vised和半supervised学习性能 (dànnéng yǐng qián zhèng zhì yǐng zhèng)“Finally, finetuning with our loss on unlabeled images improves not only equivariance but also accuracy of state-of-the-art depth and normal predictors when evaluated on NYU-v2”: 最后，我们在不包含标注图像的情况下，对现有的深度和法则预测器进行finetuning，不仅提高了均衡性，还提高了其在NYU-v2上的准确率 (zuihou, wǒmen zài bù bǎng zhǐ yǐng xiàng yǐng, yǐng xiàng yǐng)

Deep Geometrized Cartoon Line Inbetweening

paper_url: http://arxiv.org/abs/2309.16643
repo_url: https://github.com/lisiyao21/animeinbet
paper_authors: Li Siyao, Tianpei Gu, Weiye Xiao, Henghui Ding, Ziwei Liu, Chen Change Loy
for: addressing the inbetweening problem in the anime industry, specifically the generation of intermediate frames between black-and-white line drawings
methods: using a new approach called AnimeInbet, which geometrizes raster line drawings into graphs of endpoints and reframes the inbetweening task as a graph fusion problem with vertex repositioning
results: synthesizing high-quality, clean, and complete intermediate line drawings that outperform existing methods quantitatively and qualitatively, especially in cases with large motions

Abstract
We aim to address a significant but understudied problem in the anime industry, namely the inbetweening of cartoon line drawings. Inbetweening involves generating intermediate frames between two black-and-white line drawings and is a time-consuming and expensive process that can benefit from automation. However, existing frame interpolation methods that rely on matching and warping whole raster images are unsuitable for line inbetweening and often produce blurring artifacts that damage the intricate line structures. To preserve the precision and detail of the line drawings, we propose a new approach, AnimeInbet, which geometrizes raster line drawings into graphs of endpoints and reframes the inbetweening task as a graph fusion problem with vertex repositioning. Our method can effectively capture the sparsity and unique structure of line drawings while preserving the details during inbetweening. This is made possible via our novel modules, i.e., vertex geometric embedding, a vertex correspondence Transformer, an effective mechanism for vertex repositioning and a visibility predictor. To train our method, we introduce MixamoLine240, a new dataset of line drawings with ground truth vectorization and matching labels. Our experiments demonstrate that AnimeInbet synthesizes high-quality, clean, and complete intermediate line drawings, outperforming existing methods quantitatively and qualitatively, especially in cases with large motions. Data and code are available at https://github.com/lisiyao21/AnimeInbet.

摘要
我们目标是解决动漫业界中尚未得到充分研究的问题，即动漫线 Drawing 的夹在中。夹在中需要生成动漫线 Drawing 中两个黑白线 Drawing 之间的中间帧，这是一项时间consuming 和昂贵的过程，可以从自动化中获得利益。然而，现有的帧 interpolate 方法，基于整个矩阵图像匹配和扭曲，对于线 Drawing 来说是不适用的，经常会产生模糊 artifacts ，损害线 Drawing 的精细结构。为保持线 Drawing 的精度和详细情况，我们提出了一种新方法，即 AnimeInbet，它将线 Drawing 转化为图形� Graphics 的结点 Graph ，并将夹在问题转化为图形� Graphics 的结点合并问题。我们的方法可以有效地捕捉线 Drawing 的稀疏性和特殊结构，同时保持精度和详细情况。这是通过我们的新模块，即结点几何嵌入、结点对准 Transformer 和有效的结点重新排列机制，以及可见预测器。为训练我们的方法，我们提出了 MixamoLine240 数据集，这是一个包含线 Drawing 的vectorization和匹配标签的新数据集。我们的实验表明，AnimeInbet 可以生成高质量、干净、完整的中间线 Drawing，超越现有方法，特别是在大动作情况下。数据和代码可以在 https://github.com/lisiyao21/AnimeInbet 上获取。

paper_url: http://arxiv.org/abs/2309.16634
repo_url: None
paper_authors: Guillaume Bono, Leonid Antsfeld, Boris Chidlovskii, Philippe Weinzaepfel, Christian Wolf
for: 这 paper 是关于目标导向视觉导航的最新研究，使用大规模机器学习在模拟环境中进行学习。
methods: 这 paper 使用了两个预言任务来解决主要的挑战，即学习精简的表示和学习高容量的感知模块，以便在不知道环境时进行高级别的感知和决策。
results: 实验结果表明，通过使用这两个预言任务，可以帮助模型学习高级别的感知和决策能力，并达到最新的状态册点和最高级别的性能。

Abstract
Most recent work in goal oriented visual navigation resorts to large-scale machine learning in simulated environments. The main challenge lies in learning compact representations generalizable to unseen environments and in learning high-capacity perception modules capable of reasoning on high-dimensional input. The latter is particularly difficult when the goal is not given as a category ("ObjectNav") but as an exemplar image ("ImageNav"), as the perception module needs to learn a comparison strategy requiring to solve an underlying visual correspondence problem. This has been shown to be difficult from reward alone or with standard auxiliary tasks. We address this problem through a sequence of two pretext tasks, which serve as a prior for what we argue is one of the main bottleneck in perception, extremely wide-baseline relative pose estimation and visibility prediction in complex scenes. The first pretext task, cross-view completion is a proxy for the underlying visual correspondence problem, while the second task addresses goal detection and finding directly. We propose a new dual encoder with a large-capacity binocular ViT model and show that correspondence solutions naturally emerge from the training signals. Experiments show significant improvements and SOTA performance on the two benchmarks, ImageNav and the Instance-ImageNav variant, where camera intrinsics and height differ between observation and goal.

摘要
最新的目标导航研究借助大规模机器学习在模拟环境中进行。主要挑战在学习简洁的总结性模型，可以在未经见过的环境中泛化应用，以及学习高容量的感知模块，可以在高维输入上进行理解。特别是当目标不是category("ObjectNav")而是图像("ImageNav")时，感知模块需要学习一种比较策略，解决了下面的视觉匹配问题。这个问题已经被证明是从奖励alone或标准辅助任务中很Difficult。我们通过一系列两个预测任务来解决这个问题，其中第一个任务是cross-view completion，它是视觉匹配问题的代理任务，而第二个任务是直接面向目标检测和定位。我们提出了一个新的双 encode器，其中包括一个大容量的双目视力模型（ViT），并证明了对应关系解决方案会自然地从训练信号中出现。实验显示我们的方法在两个标准准则ImageNav和Instance-ImageNav中具有显著改进和SOTA性能。

Class Activation Map-based Weakly supervised Hemorrhage Segmentation using Resnet-LSTM in Non-Contrast Computed Tomography images

paper_url: http://arxiv.org/abs/2309.16627
repo_url: None
paper_authors: Shreyas H Ramananda, Vaanathi Sundaresan
for: 这个论文的目的是提出一种新的弱监督深度学习方法，用于在NCCT图像中自动 segmentation出脑出血患者。
methods: 该方法使用图像级别的二分类标签，而不需要大量的手动标注每个脑出血患者的Lesion-level标签。首先，使用分类网络来确定脑出血患者的大致位置，然后使用 pseudo-ICH 面积来进一步精细地 segmentation 脑出血患者。
results: 在MICCAI 2022 INSTANCE challenge 的验证数据上，该方法的 dice 值为0.55，与现有的弱监督方法（dice值为0.47）相当，而且在训练数据量更小的情况下达到这一效果。

Abstract
In clinical settings, intracranial hemorrhages (ICH) are routinely diagnosed using non-contrast CT (NCCT) for severity assessment. Accurate automated segmentation of ICH lesions is the initial and essential step, immensely useful for such assessment. However, compared to other structural imaging modalities such as MRI, in NCCT images ICH appears with very low contrast and poor SNR. Over recent years, deep learning (DL)-based methods have shown great potential, however, training them requires a huge amount of manually annotated lesion-level labels, with sufficient diversity to capture the characteristics of ICH. In this work, we propose a novel weakly supervised DL method for ICH segmentation on NCCT scans, using image-level binary classification labels, which are less time-consuming and labor-efficient when compared to the manual labeling of individual ICH lesions. Our method initially determines the approximate location of ICH using class activation maps from a classification network, which is trained to learn dependencies across contiguous slices. We further refine the ICH segmentation using pseudo-ICH masks obtained in an unsupervised manner. The method is flexible and uses a computationally light architecture during testing. On evaluating our method on the validation data of the MICCAI 2022 INSTANCE challenge, our method achieves a Dice value of 0.55, comparable with those of existing weakly supervised method (Dice value of 0.47), despite training on a much smaller training data.

摘要
在临床设置下，脑膜内出血（ICH）通常使用非contrast CT（NCCT）进行严重评估。正确地自动分割ICH损害是初始和基本步骤，对评估具有极大的用处。然而，与其他结构成像Modalities（MRI）相比，在NCCT图像中，ICH具有非常低的冲击和噪声。过去几年，深度学习（DL）基本方法在ICH分割中表现出了极大的潜力，但是它们的训练需要大量的手动标注损害级别Label，以及足够的多样性，以捕捉ICH的特征。在这项工作中，我们提出了一种新的弱监睹DL方法，用于NCCT扫描图像中ICH分割，使用图像级别的二分类标签，相比手动标注每个ICH损害，更加快速和劳动效率。我们的方法首先在分割网络中获得ICH的约束位置，使用分割网络训练得到的类 activation maps，然后进行加工ICH分割。我们的方法是灵活的，在测试时使用轻量级的计算机itecture。在评估我们的方法在MICCAI 2022 INSTANCE挑战的验证数据上，我们的方法达到了0.55的Dice值，与现有的弱监睹方法（Dice值为0.47）相当，即使在训练数据量相对较小的情况下。

KV Inversion: KV Embeddings Learning for Text-Conditioned Real Image Action Editing

paper_url: http://arxiv.org/abs/2309.16608
repo_url: None
paper_authors: Jiancheng Huang, Yifan Liu, Jin Qin, Shifeng Chen
for: 该论文旨在解决图像修改任务中的动作编辑问题，使得修改后的图像能够符合动作 semantics 和保持原图像的内容。
methods: 我们提出了 KV Inversion 方法，可以实现满意的重建性和动作编辑，解决两个主要问题：1）修改后的结果能够匹配相应的动作，2）修改对象能够保持原图像的текстура和身份。
results: 我们的方法不需要专门培训 Stable Diffusion 模型，也不需要扫描大规模的数据集进行时间consuming的培训。

Abstract
Text-conditioned image editing is a recently emerged and highly practical task, and its potential is immeasurable. However, most of the concurrent methods are unable to perform action editing, i.e. they can not produce results that conform to the action semantics of the editing prompt and preserve the content of the original image. To solve the problem of action editing, we propose KV Inversion, a method that can achieve satisfactory reconstruction performance and action editing, which can solve two major problems: 1) the edited result can match the corresponding action, and 2) the edited object can retain the texture and identity of the original real image. In addition, our method does not require training the Stable Diffusion model itself, nor does it require scanning a large-scale dataset to perform time-consuming training.

摘要
文本受控图像编辑是一个最近出现的高实用性任务，其潜力无法估量。然而，大多数同时期方法无法实现动作编辑，即无法根据编辑提示生成符合动作 semantics 的结果，同时保留原始图像内容。为解决动作编辑问题，我们提议 KV Inversion，一种可以实现满意重构性和动作编辑，解决两个主要问题：1）编辑结果能匹配相应的动作，2）编辑对象能保留原始真实图像的Texture和identify。此外，我们的方法不需要专门培训 Stable Diffusion 模型，也不需要扫描大规模数据集进行时间消耗的训练。

paper_url: http://arxiv.org/abs/2309.16592
repo_url: None
paper_authors: Manish Sharma, Moitreya Chatterjee, Kuan-Chuan Peng, Suhas Lohit, Michael Jones
for: 本研究的目的是使IR图像中的物体检测模型获得更好的表现，因为IR图像的训练数据缺乏。
methods: 本研究使用了一种名为TensorFact的新的矩阵分解方法，它可以将卷积层的矩阵分解成低级因子矩阵，从而减少模型中的参数数量。
results: 实验表明，TensorFact可以在RGB图像中提高物体检测性能，并且在IR图像中进行微调可以超过标准的物体检测器。

Abstract
The primary bottleneck towards obtaining good recognition performance in IR images is the lack of sufficient labeled training data, owing to the cost of acquiring such data. Realizing that object detection methods for the RGB modality are quite robust (at least for some commonplace classes, like person, car, etc.), thanks to the giant training sets that exist, in this work we seek to leverage cues from the RGB modality to scale object detectors to the IR modality, while preserving model performance in the RGB modality. At the core of our method, is a novel tensor decomposition method called TensorFact which splits the convolution kernels of a layer of a Convolutional Neural Network (CNN) into low-rank factor matrices, with fewer parameters than the original CNN. We first pretrain these factor matrices on the RGB modality, for which plenty of training data are assumed to exist and then augment only a few trainable parameters for training on the IR modality to avoid over-fitting, while encouraging them to capture complementary cues from those trained only on the RGB modality. We validate our approach empirically by first assessing how well our TensorFact decomposed network performs at the task of detecting objects in RGB images vis-a-vis the original network and then look at how well it adapts to IR images of the FLIR ADAS v1 dataset. For the latter, we train models under scenarios that pose challenges stemming from data paucity. From the experiments, we observe that: (i) TensorFact shows performance gains on RGB images; (ii) further, this pre-trained model, when fine-tuned, outperforms a standard state-of-the-art object detector on the FLIR ADAS v1 dataset by about 4% in terms of mAP 50 score.

摘要
主要瓶颈在获得良好的认知性能方面是因为缺乏充足的标注训练数据，即使是因为获取这些数据的成本。在这项工作中，我们寻求利用RGB模式中对象检测方法的稳定性（至少是一些常见的类型，如人车等），通过RGB模式的大规模训练集，将其扩展到IR模式中，而不会影响模型在RGB模式中的性能。我们的方法的核心是一种新的矩阵因子分解方法，称为TensorFact，它将卷积层的矩阵分解成低级因子矩阵，具有 fewer 参数 than the original CNN。我们首先在RGB模式上预训练这些因子矩阵，然后只需要对IR模式进行一些可训练参数的增强，以避免过拟合，同时鼓励它们捕捉RGB模式中未经训练的辅助信号。我们通过实验证明了我们的方法的效果，先评估我们的TensorFact decomposed network在RGB图像上的性能，然后看看它在FLIR ADAS v1 dataset上如何适应IR图像。为了 simulate 数据缺乏的问题，我们在训练中采用了不同的场景。从实验结果来看，我们有以下观察：(i) TensorFact在RGB图像上显示出性能提升;(ii)此外，我们在FLIR ADAS v1 dataset上使用这个预训练模型进行细化，与标准状态的对象检测器相比，其MAP50得分提高了约4%。

Vision Transformers Need Registers

paper_url: http://arxiv.org/abs/2309.16588
repo_url: https://github.com/facebookresearch/dinov2
paper_authors: Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski
for: 本研究旨在解决 transformer 网络中的图像表示学习中的缺陷，提高 visual 表示学习的效果。
methods: 本文使用 both supervised 和 self-supervised ViT 网络，并提出一种简单 yet effective 的解决方案，即提供更多的输入序列，以填充高强度token在推理过程中的缺陷。
results: 本文显示，该解决方案可以 entirely fix 这个问题，并在 dense visual prediction 任务中设置新的州OF THE ART ，允许使用更大的模型进行对象发现，并且导致 downstream 视觉处理中的 feature maps 和 attention maps 更加平滑。

Abstract
Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.

摘要
受欢迎的变换器最近在视觉学习中展示出了强大的工具。在这篇论文中，我们识别和描述了Feature Map中的artefacts。这些artefacts与推理过程中出现的高 нор Tokens相对应，主要出现在图像中的低信息背景区域，并被用于内部计算。我们提出了一个简单 yet有效的解决方案，即在视觉 трансформа器的输入序列中提供更多的Token来替代这个角色。我们表明，这种解决方案可以完全解决supervised和self-supervised模型中的这个问题，并在dense visual prediction任务中设置新的状态anner，启用更大的物体发现方法，并且导致下游视处理中的特征图和注意力图更加平滑。

Text-to-3D using Gaussian Splatting

paper_url: http://arxiv.org/abs/2309.16585
repo_url: https://github.com/gsgen3d/gsgen
paper_authors: Zilong Chen, Feng Wang, Huaping Liu
for: 高品质3D物体生成
methods: 3D加aussian拼接、进程优化
results: 精确的3D形状和详细构造

Abstract
In this paper, we present Gaussian Splatting based text-to-3D generation (GSGEN), a novel approach for generating high-quality 3D objects. Previous methods suffer from inaccurate geometry and limited fidelity due to the absence of 3D prior and proper representation. We leverage 3D Gaussian Splatting, a recent state-of-the-art representation, to address existing shortcomings by exploiting the explicit nature that enables the incorporation of 3D prior. Specifically, our method adopts a progressive optimization strategy, which includes a geometry optimization stage and an appearance refinement stage. In geometry optimization, a coarse representation is established under a 3D geometry prior along with the ordinary 2D SDS loss, ensuring a sensible and 3D-consistent rough shape. Subsequently, the obtained Gaussians undergo an iterative refinement to enrich details. In this stage, we increase the number of Gaussians by compactness-based densification to enhance continuity and improve fidelity. With these designs, our approach can generate 3D content with delicate details and more accurate geometry. Extensive evaluations demonstrate the effectiveness of our method, especially for capturing high-frequency components. Video results are provided at https://gsgen3d.github.io. Our code is available at https://github.com/gsgen3d/gsgen

摘要
在这篇论文中，我们提出了基于 Gaussian Splatting 的文本到 3D 生成方法（GSGEN），这是一种新的方法，用于生成高质量的 3D 对象。先前的方法受到缺乏 3D 先天知识和不准确的几何结构的限制，导致生成的 3D 对象具有偏差和有限的质量。我们利用了三维 Gaussian Splatting，这是当前最佳的表示方式，以解决现有的缺陷，通过质量的考虑，确保了 3D 对象的可见性和质量。我们的方法包括两个阶段：几何优化阶段和外观优化阶段。在几何优化阶段，我们使用了一个粗略的表示，同时遵循 3D 几何先天知识，确保生成的 3D 对象具有理性和可见性。然后，我们通过多个 Gaussians 的迭代优化，以增强细节和提高质量。在外观优化阶段，我们通过增加粒子数量来提高连续性和质量。我们的方法可以生成高质量的 3D 内容，包括细节和几何结构。我们提供了详细的评估结果，显示我们的方法可以更好地捕捉高频成分。视频结果可以在中查看。我们的代码可以在中下载。

Audio-Visual Speaker Verification via Joint Cross-Attention

paper_url: http://arxiv.org/abs/2309.16569
repo_url: None
paper_authors: R. Gnana Praveen, Jahangir Alam
for: 这种研究的目的是提高Speaker Verification的性能，使用融合 faces和voices 的 Audio-Visual Fusion 技术。
methods: 这种方法使用 cross-modal joint attention 技术，通过对共同特征表示和个体特征表示之间的相关性进行估计，以capture intra-modal和inter-modal关系。
results: 实验结果表明，该方法可以significantly outperform 现有的Audio-Visual Fusion 方法，提高Speaker Verification 性能。

Abstract
Speaker verification has been widely explored using speech signals, which has shown significant improvement using deep models. Recently, there has been a surge in exploring faces and voices as they can offer more complementary and comprehensive information than relying only on a single modality of speech signals. Though current methods in the literature on the fusion of faces and voices have shown improvement over that of individual face or voice modalities, the potential of audio-visual fusion is not fully explored for speaker verification. Most of the existing methods based on audio-visual fusion either rely on score-level fusion or simple feature concatenation. In this work, we have explored cross-modal joint attention to fully leverage the inter-modal complementary information and the intra-modal information for speaker verification. Specifically, we estimate the cross-attention weights based on the correlation between the joint feature presentation and that of the individual feature representations in order to effectively capture both intra-modal as well inter-modal relationships among the faces and voices. We have shown that efficiently leveraging the intra- and inter-modal relationships significantly improves the performance of audio-visual fusion for speaker verification. The performance of the proposed approach has been evaluated on the Voxceleb1 dataset. Results show that the proposed approach can significantly outperform the state-of-the-art methods of audio-visual fusion for speaker verification.

摘要
《 speaker verification 》在使用语音信号方面进行了广泛的探索，并表现出了显著的改进。近期，人们开始探索 faces 和 voices 的可用性，因为它们可以为 speaker verification 提供更多的补充和完整的信息，而不仅仅是依靠单一的语音信号。 existing literature 中的 audio-visual fusión 方法已经表现出了与单一的 face 或 voice 模态之间的改进，但是 audio-visual fusión 的潜力还没有得到完全的探索。大多数现有的方法基于 audio-visual fusión ether rely on score-level fusion 或者简单的 feature concatenation。在这个工作中，我们explored cross-modal joint attention 来全面利用 faces 和 voices 之间的相互补充信息和单一模态信息以进行 speaker verification。 Specifically, we estimate the cross-attention weights based on the correlation between the joint feature presentation and that of the individual feature representations in order to effectively capture both intra-modal as well inter-modal relationships among the faces and voices。我们的方法可以很好地利用 intra-modal 和 inter-modal 关系，从而提高 audio-visual fusión 的性能。我们在 Voxceleb1 数据集上评估了我们的方法，结果表明我们的方法可以在 audio-visual fusión 中对 speaker verification 进行显著改进。

MatrixCity: A Large-scale City Dataset for City-scale Neural Rendering and Beyond

paper_url: http://arxiv.org/abs/2309.16553
repo_url: None
paper_authors: Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, Bo Dai
for: 研发大规模、高质量的 synthetic dataset，推动城市级别的神经渲染研究。
methods: 使用 Unreal Engine 5 City Sample project pipeline，收集了飞行和街景视图，并附带了摄像头姿态和多种数据模式。
results: 建立了 MatrixCity 数据集，包含 67k 飞行图像和 452k 街景图像，涵盖两个城市地图，总面积 $28km^2$。

Abstract
Neural radiance fields (NeRF) and its subsequent variants have led to remarkable progress in neural rendering. While most of recent neural rendering works focus on objects and small-scale scenes, developing neural rendering methods for city-scale scenes is of great potential in many real-world applications. However, this line of research is impeded by the absence of a comprehensive and high-quality dataset, yet collecting such a dataset over real city-scale scenes is costly, sensitive, and technically difficult. To this end, we build a large-scale, comprehensive, and high-quality synthetic dataset for city-scale neural rendering researches. Leveraging the Unreal Engine 5 City Sample project, we develop a pipeline to easily collect aerial and street city views, accompanied by ground-truth camera poses and a range of additional data modalities. Flexible controls over environmental factors like light, weather, human and car crowd are also available in our pipeline, supporting the need of various tasks covering city-scale neural rendering and beyond. The resulting pilot dataset, MatrixCity, contains 67k aerial images and 452k street images from two city maps of total size $28km^2$. On top of MatrixCity, a thorough benchmark is also conducted, which not only reveals unique challenges of the task of city-scale neural rendering, but also highlights potential improvements for future works. The dataset and code will be publicly available at our project page: https://city-super.github.io/matrixcity/.

摘要
neural radiance fields (NeRF) 和其 variants 已经带来了对 neural rendering 的巨大进步。然而，大多数最近的 neural rendering 工作都集中在小规模的对象和场景上，发展 neural rendering 方法 для city-scale 场景的潜在应用巨大。然而，这一线索的研究受到了数据缺乏的困难，因为收集这样的数据在真实的 city-scale 场景上是昂贵的、敏感的和技术上有困难。为此，我们建立了一个大规模、全面和高质量的synthetic dataset，用于city-scale neural rendering 研究。我们利用 Unreal Engine 5 City Sample 项目，开发了一个管道，可以轻松地收集空中和街道的城市视图，并附加了相应的摄像头位和多种数据模式。管道中还提供了灵活的环境因素控制，如光、天气、人员和车辆拥堵，以支持多种任务，涵盖 city-scale neural rendering 和更多的应用。结果的预测数据集，MatrixCity，包含 67k 空中图像和 452k 街道图像，总面积为 $28km^2$。在 MatrixCity 之上，我们还进行了一项全面的比较，不仅揭示了城市级 neural rendering 任务的独特挑战，还强调了未来工作的潜在改进方向。数据和代码将在我们项目页面上公开：https://city-super.github.io/matrixcity/.

Uncertainty Quantification for Eosinophil Segmentation

paper_url: http://arxiv.org/abs/2309.16536
repo_url: None
paper_authors: Kevin Lin, Donald Brown, Sana Syed, Adam Greene
for: 该研究旨在提高Adorno等人的方法，用深度图像分割来评估嗜内针蛋白。
methods: 该方法使用Monte Carlo Dropout来提供深度学习模型的不确定性评估，并将其视觉化在输出图像中，以评估模型性能、理解深度学习算法的工作方式，并帮助病理学家识别嗜内针蛋白。
results: 该方法可以帮助病理学家更准确地识别嗜内针蛋白，提高诊断效率。

Abstract
Eosinophilic Esophagitis (EoE) is an allergic condition increasing in prevalence. To diagnose EoE, pathologists must find 15 or more eosinophils within a single high-power field (400X magnification). Determining whether or not a patient has EoE can be an arduous process and any medical imaging approaches used to assist diagnosis must consider both efficiency and precision. We propose an improvement of Adorno et al's approach for quantifying eosinphils using deep image segmentation. Our new approach leverages Monte Carlo Dropout, a common approach in deep learning to reduce overfitting, to provide uncertainty quantification on current deep learning models. The uncertainty can be visualized in an output image to evaluate model performance, provide insight to how deep learning algorithms function, and assist pathologists in identifying eosinophils.

摘要
“恶生气肠炎（EoE）是一种增加的 allergy 病种，诊断 EoE 可以是一个困难的过程。为了帮助诊断，任何医学影像方法都必须考虑效率和精度。我们提出了改进 Adorno 等人的方法，使用深度图像分割来量化嗜好细胞。我们的新方法利用 Monte Carlo Dropout，一种常见的深度学习方法来减少预测过拟合，从而提供不确定性量化。这种不确定性可以在输出图像中显示，评估模型性能，提供深度学习算法的运作方式，并帮助病理学家确定嗜好细胞。”

HOI4ABOT: Human-Object Interaction Anticipation for Human Intention Reading Collaborative roBOTs

paper_url: http://arxiv.org/abs/2309.16524
repo_url: None
paper_authors: Esteve Valls Mascaro, Daniel Sliwowski, Dongheui Lee
for: 本研究旨在提高人机合作的效率和直观性，通过寻找和预测人机互动（HOI）。
methods: 我们提出了一种基于转换器的HOI探测和预测框架，使用视频数据集进行训练，并通过对比现有方法进行评估。
results: 我们的模型在VidHOI数据集上的探测和预测效果比现有方法高，具体来说是1.76%和1.04%的提升在mAP上，同时速度比现有方法快15.4倍。我们通过实验表明，我们的方法可以帮助人机合作更加效率和直观。

Abstract
Robots are becoming increasingly integrated into our lives, assisting us in various tasks. To ensure effective collaboration between humans and robots, it is essential that they understand our intentions and anticipate our actions. In this paper, we propose a Human-Object Interaction (HOI) anticipation framework for collaborative robots. We propose an efficient and robust transformer-based model to detect and anticipate HOIs from videos. This enhanced anticipation empowers robots to proactively assist humans, resulting in more efficient and intuitive collaborations. Our model outperforms state-of-the-art results in HOI detection and anticipation in VidHOI dataset with an increase of 1.76% and 1.04% in mAP respectively while being 15.4 times faster. We showcase the effectiveness of our approach through experimental results in a real robot, demonstrating that the robot's ability to anticipate HOIs is key for better Human-Robot Interaction. More information can be found on our project webpage: https://evm7.github.io/HOI4ABOT_page/

摘要
роботы все более интегрируются в нашу жизнь, помогая нам в различных задачах. Чтобы обеспечить эффективное сотрудничество между людьми и роботами, важно, чтобы они понимали наши намерения и предсказывали наши действия. В этой статье мы предлагаем фреймворк для предсказания взаимодействия между людьми и объектами (HOI) для коллаборативных роботов. Мы предлагаем эффективный и прочный модель на основе трансформара для детектирования и предсказания HOIs из видео. Улучшенная предсказуемость позволяет роботам проактивно помогать людям, что приводит к более эффективным и интуитивным сотрудничествам. Наша модель превышает результаты государства искусства в HOI-детектировании и предсказании в базе данных VidHOI на 1,76% и 1,04% в mAP соответственно, в то время как она на 15,4 раза быстрее. Мы подтвердили эффективность нашего подхода с помощью экспериментов в реальном роботе, демонстрируя, что способность робота предсказывать HOIs является ключевым фактором для более эффективного взаимодействия между людьми и роботами. За более подробную информацию обратитесь к нашей странице проекта: .

Latent Noise Segmentation: How Neural Noise Leads to the Emergence of Segmentation and Grouping

paper_url: http://arxiv.org/abs/2309.16515
repo_url: None
paper_authors: Ben Lonnqvist, Zhengqing Wu, Michael H. Herzog
for: 这个论文的目的是提出一种无监督的分割方法，它利用神经网络的噪声来分割图像。
methods: 这个方法使用的是神经网络，并且添加了噪声来使神经网络能够分割图像，而不需要任何监督标签。
results: 研究发现，这种方法可以成功地分割图像，并且分割结果与人类视觉系统中的分割现象相似。此外，研究还发现，这种方法需要很少的样本数据，并且可以在各种不同的噪声水平下进行。

Abstract
Deep Neural Networks (DNNs) that achieve human-level performance in general tasks like object segmentation typically require supervised labels. In contrast, humans are able to perform these tasks effortlessly without supervision. To accomplish this, the human visual system makes use of perceptual grouping. Understanding how perceptual grouping arises in an unsupervised manner is critical for improving both models of the visual system, and computer vision models. In this work, we propose a counterintuitive approach to unsupervised perceptual grouping and segmentation: that they arise because of neural noise, rather than in spite of it. We (1) mathematically demonstrate that under realistic assumptions, neural noise can be used to separate objects from each other, and (2) show that adding noise in a DNN enables the network to segment images even though it was never trained on any segmentation labels. Interestingly, we find that (3) segmenting objects using noise results in segmentation performance that aligns with the perceptual grouping phenomena observed in humans. We introduce the Good Gestalt (GG) datasets -- six datasets designed to specifically test perceptual grouping, and show that our DNN models reproduce many important phenomena in human perception, such as illusory contours, closure, continuity, proximity, and occlusion. Finally, we (4) demonstrate the ecological plausibility of the method by analyzing the sensitivity of the DNN to different magnitudes of noise. We find that some model variants consistently succeed with remarkably low levels of neural noise ($\sigma<0.001$), and surprisingly, that segmenting this way requires as few as a handful of samples. Together, our results suggest a novel unsupervised segmentation method requiring few assumptions, a new explanation for the formation of perceptual grouping, and a potential benefit of neural noise in the visual system.

摘要
深度神经网络（DNN）可以达到人类水平的执行通用任务，如物体分割，通常需要有监督标签。然而，人类可以通过自然的视觉系统完成这些任务，而无需监督。为了实现这一点，人类视觉系统会利用感知分组。理解感知分组在无监督下如何发生是critical，以改进计算机视觉模型和人类视觉系统。在这个工作中，我们提出了一种Counterintuitive的方法，即感知分组和分割是由神经噪声引起的，而不是它们的障碍。我们（1）数学示出，在实际假设下，神经噪声可以将物体分开，并（2）显示在DNN中添加噪声可以让网络分割图像，即使这些图像从未接受过任何分割标签。有趣的是，我们发现（3）使用噪声进行分割， segmentation的性能与人类感知分组现象相吻合。我们开发了Good Gestalt（GG）数据集，包括六个数据集，用于测试感知分组。我们的DNN模型在这些数据集上展现了许多重要的人类感知现象，如潜在的梯度、闭合、连续性、靠近性和遮挡。最后，我们（4）通过分析神经网络对噪声的敏感性，证明这种方法的生物学可靠性。我们发现一些模型变体可以在remarkably low levels of neural noise（$\sigma<0.001）下成功，而且segmenting这样需要的样本数很少，只需几个样本。总之，我们的结果提出了一种新的无监督分割方法，一种新的感知分组的解释，以及神经噪声在视觉系统中的可能的优点。

CCEdit: Creative and Controllable Video Editing via Diffusion Models

paper_url: http://arxiv.org/abs/2309.16496
repo_url: https://github.com/RuoyuFeng/CCEdit
paper_authors: Ruoyu Feng, Wenming Weng, Yanhui Wang, Yuhui Yuan, Jianmin Bao, Chong Luo, Zhibo Chen, Baining Guo
for: 这篇论文旨在解决视频编辑中的创造性和控制性问题。
methods: 该论文提出了一种名为 CCEdit 的框架，它采用了 ControlNet 架构，并具有可变的时间模块，可以与现有的文本生成技术相结合，如 DreamBooth 和 LoRA。
results: 实验结果表明，CCEdit 框架具有出色的功能和编辑能力。

Abstract
In this work, we present CCEdit, a versatile framework designed to address the challenges of creative and controllable video editing. CCEdit accommodates a wide spectrum of user editing requirements and enables enhanced creative control through an innovative approach that decouples video structure and appearance. We leverage the foundational ControlNet architecture to preserve structural integrity, while seamlessly integrating adaptable temporal modules compatible with state-of-the-art personalization techniques for text-to-image generation, such as DreamBooth and LoRA.Furthermore, we introduce reference-conditioned video editing, empowering users to exercise precise creative control over video editing through the more manageable process of editing key frames. Our extensive experimental evaluations confirm the exceptional functionality and editing capabilities of the proposed CCEdit framework. Demo video is available at https://www.youtube.com/watch?v=UQw4jq-igN4.

摘要
在这项工作中，我们介绍CCEdit框架，这是一种适应创作和控制视频编辑的多功能框架。CCEdit可以满足广泛的用户编辑需求，并提供了更高级别的创意控制通过解耦视频结构和外观的创新方法。我们利用ControlNet体系结构，以保持视频结构完整性，同时快速集成了适应性强的时间模块，与现代个性化文本生成技术，如梦镜和LoRA，协同工作。此外，我们还引入了参考条件视频编辑，让用户通过更加可控的逻辑框架进行视频编辑，从而提高了编辑效率和精度。我们的广泛实验证明了CCEdit框架的非凡功能和编辑能力。详细的示例视频可以在https://www.youtube.com/watch?v=UQw4jq-igN4中找到。

Deep Single Models vs. Ensembles: Insights for a Fast Deployment of Parking Monitoring Systems

paper_url: http://arxiv.org/abs/2309.16495
repo_url: None
paper_authors: Andre Gustavo Hochuli, Jean Paul Barddal, Gillian Cezar Palhano, Leonardo Matheus Mendes, Paulo Ricardo Lisboa de Almeida
for: 这个研究的目的是为了开发一个可以在高密度城市中搜寻可用的停车位置的系统，并且降低 drivers 搜寻停车位置的压力。
methods: 这个研究使用了图像基的系统，并且使用了深度学习技术来实现停车位置的识别。
results: 研究发现，使用不同的数据集和深度学习架构，包括融合策略和集成方法，可以实现95%的准确率，而不需要对目标停车场进行训练和标注。

Abstract
Searching for available parking spots in high-density urban centers is a stressful task for drivers that can be mitigated by systems that know in advance the nearest parking space available. To this end, image-based systems offer cost advantages over other sensor-based alternatives (e.g., ultrasonic sensors), requiring less physical infrastructure for installation and maintenance. Despite recent deep learning advances, deploying intelligent parking monitoring is still a challenge since most approaches involve collecting and labeling large amounts of data, which is laborious and time-consuming. Our study aims to uncover the challenges in creating a global framework, trained using publicly available labeled parking lot images, that performs accurately across diverse scenarios, enabling the parking space monitoring as a ready-to-use system to deploy in a new environment. Through exhaustive experiments involving different datasets and deep learning architectures, including fusion strategies and ensemble methods, we found that models trained on diverse datasets can achieve 95\% accuracy without the burden of data annotation and model training on the target parking lot

摘要
搜寻高密度城市中的停车位是驾驶员忙碌的任务，可以通过系统知道当前最近的停车位。为此，图像基的系统提供成本优势，需要更少的物理基础设施安装和维护。尽管最近的深度学习突破，但是实施智能停车监测仍然是一个挑战，因为大多数方法需要收集和标注大量数据，这是时间consuming和劳动密集的。我们的研究旨在探讨在公共可用的标注停车场图像基础上创建全球框架，可以在多样化场景下准确地检测停车位，并提供一个Ready-to-use的系统，可以在新环境中部署。通过不同的数据集和深度学习架构、合并策略和 ensemble方法的 исследование，我们发现了：模型在多样化数据集上训练可以达到95%的准确率，无需目标停车场的数据注解和模型训练。

Accurate and lightweight dehazing via multi-receptive-field non-local network and novel contrastive regularization

paper_url: http://arxiv.org/abs/2309.16494
repo_url: None
paper_authors: Zewei He, Zixuan Chen, Ziqian Lu, Xuecheng Sun, Zhe-Ming Lu
for: 提高雾光照权重抑制和细节表征的图像雾化模型
methods: 使用多感知场非本地网络（MRFNLN），包括多流特征注意块（MSFAB）和跨非本地块（CNLB），以提取更加丰富的特征，并通过注意力机制和跨非本地块来捕捉长距离关系
results: 提出了一种新的细节重点对比调整（DFCR），并通过对比调整和细节重点对比调整来提高图像雾化性能，并且模型具有少于1500万参数，超过当前状态的图像雾化方法

Abstract
Recently, deep learning-based methods have dominated image dehazing domain. Although very competitive dehazing performance has been achieved with sophisticated models, effective solutions for extracting useful features are still under-explored. In addition, non-local network, which has made a breakthrough in many vision tasks, has not been appropriately applied to image dehazing. Thus, a multi-receptive-field non-local network (MRFNLN) consisting of the multi-stream feature attention block (MSFAB) and cross non-local block (CNLB) is presented in this paper. We start with extracting richer features for dehazing. Specifically, we design a multi-stream feature extraction (MSFE) sub-block, which contains three parallel convolutions with different receptive fields (i.e., $1\times 1$, $3\times 3$, $5\times 5$) for extracting multi-scale features. Following MSFE, we employ an attention sub-block to make the model adaptively focus on important channels/regions. The MSFE and attention sub-blocks constitute our MSFAB. Then, we design a cross non-local block (CNLB), which can capture long-range dependencies beyond the query. Instead of the same input source of query branch, the key and value branches are enhanced by fusing more preceding features. CNLB is computation-friendly by leveraging a spatial pyramid down-sampling (SPDS) strategy to reduce the computation and memory consumption without sacrificing the performance. Last but not least, a novel detail-focused contrastive regularization (DFCR) is presented by emphasizing the low-level details and ignoring the high-level semantic information in the representation space. Comprehensive experimental results demonstrate that the proposed MRFNLN model outperforms recent state-of-the-art dehazing methods with less than 1.5 Million parameters.

摘要
近期，深度学习基于方法在图像抑雾领域占据了主导地位。虽然使用了复杂的模型，但是对于提取有用特征的有效解决方案还是尚未得到足够的探索。此外，非本地网络，在视觉任务中创造出了突破，尚未被适当应用于图像抑雾。因此，本文提出了一种多感受场非本地网络（MRFNLN），其包括多流处理特征吸引块（MSFAB）和跨非本地块（CNLB）。我们从提取更丰富的特征开始，specifically，我们设计了一种多流特征提取子块（MSFE），它包括三个并行的三维卷积（$1\times 1$, $3\times 3$, $5\times 5$），用于提取多级特征。接着，我们采用了一个注意力吸引子块，使模型能够适应重要的通道/区域。MSFE和注意力吸引子块组成我们的 MSFAB。然后，我们设计了一种跨非本地块（CNLB），它可以捕捉更远的相关性，而不是仅仅依靠输入的查询源。相比之下，关键和值分支可以通过折衔更多的先前特征进行增强。CNLB通过利用空间PYRAMID下降抽象（SPDS）策略来减少计算和存储占用，不会失去性能。最后，我们提出了一种新的细节重点对比常规化正则化（DFCR），强调低级详细信息，忽略高级 semantic信息在表示空间中。我们进行了广泛的实验研究，结果表明，我们提出的 MRFNLN 模型在参数数量不足 1.5 万的情况下，已经超越了最新的图像抑雾方法。

HTC-DC Net: Monocular Height Estimation from Single Remote Sensing Images

paper_url: http://arxiv.org/abs/2309.16486
repo_url: https://github.com/zhu-xlab/htc-dc-net
paper_authors: Sining Chen, Yilei Shi, Zhitong Xiong, Xiao Xiang Zhu
for: 这篇论文的目的是提出一种用于单目高程估计的方法，以解决基于远程感知数据的3D地理信息的问题。
methods: 该方法基于分类-回归模式，包括特点EXTRACTOR、HTC-AdaBins模块和混合回归过程。HTC-AdaBins模块用视transformer编码器并HTC来 Address long-tailed问题，并且涉及DCs来训练。
results: 实验结果显示，提议的网络在三个不同分辨率的数据集上表现出色，与现有方法相比，具有大量的优势。广泛的折衔研究也证明了每个设计元素的效果。

Abstract
3D geo-information is of great significance for understanding the living environment; however, 3D perception from remote sensing data, especially on a large scale, is restricted. To tackle this problem, we propose a method for monocular height estimation from optical imagery, which is currently one of the richest sources of remote sensing data. As an ill-posed problem, monocular height estimation requires well-designed networks for enhanced representations to improve performance. Moreover, the distribution of height values is long-tailed with the low-height pixels, e.g., the background, as the head, and thus trained networks are usually biased and tend to underestimate building heights. To solve the problems, instead of formalizing the problem as a regression task, we propose HTC-DC Net following the classification-regression paradigm, with the head-tail cut (HTC) and the distribution-based constraints (DCs) as the main contributions. HTC-DC Net is composed of the backbone network as the feature extractor, the HTC-AdaBins module, and the hybrid regression process. The HTC-AdaBins module serves as the classification phase to determine bins adaptive to each input image. It is equipped with a vision transformer encoder to incorporate local context with holistic information and involves an HTC to address the long-tailed problem in monocular height estimation for balancing the performances of foreground and background pixels. The hybrid regression process does the regression via the smoothing of bins from the classification phase, which is trained via DCs. The proposed network is tested on three datasets of different resolutions, namely ISPRS Vaihingen (0.09 m), DFC19 (1.3 m) and GBH (3 m). Experimental results show the superiority of the proposed network over existing methods by large margins. Extensive ablation studies demonstrate the effectiveness of each design component.

摘要
三维地理信息对生活环境理解具有重要 significancem however, 从远程感知数据中获得三维高度的见解，尤其是在大规模上，受到限制。为解决这个问题，我们提出了一种从光学影像获得高度的独眼准备方法，现在是远程感知数据中最丰富的资源之一。由于这是一个不定Problem，高度估计需要良好的网络设计来提高性能。此外，高度值的分布呈长尾，低高度像素（如背景）为主，因此训练的网络通常偏 towards underestimating building heights。为了解决这些问题，我们不是直接将问题定义为回归任务，而是提出了 HTC-DC Net，它是基于分类-回归模式的。 HTC-DC Net 由 feature extractor 作为 backbone network，HTC-AdaBins 模块，和混合回归过程组成。HTC-AdaBins 模块 serves as the classification phase to determine adaptive bins for each input image，它使用了视transformer encoder integrate local context with holistic information，并在 HTC 中处理长尾问题。混合回归过程通过 adapted bins from the classification phase 进行回归，并由 DCs 训练。我们在三个不同分辨率的 dataset 上进行测试，namely ISPRS Vaihingen (0.09 m), DFC19 (1.3 m) and GBH (3 m)。实验结果表明我们的方法在现有方法的大幅提高。我们还进行了广泛的拟合研究，以证明每个设计元件的效果。

Rethinking Domain Generalization: Discriminability and Generalizability

paper_url: http://arxiv.org/abs/2309.16483
repo_url: None
paper_authors: Shaocong Long, Qianyu Zhou, Chenhao Ying, Lizhuang Ma, Yuan Luo
for: 本研究旨在开发一种能同时具备强大的特征泛化和精准的分类能力的领域总结（Domain Generalization，DG）方法。
methods: 本方法基于两个核心 ком成分：选择性频道剔除（Selective Channel Pruning，SCP）和微级分布对齐（Micro-level Distribution Alignment，MDA）。SCP 通过减少神经网络中的冗余特征，增强特征的稳定性和分类精度。而 MDA 强调每个类别内的微级分布对齐，以便保留足够的总体特征和细化分类。
results: 在四个 benchmark 数据集上进行了广泛的实验，证明了我们的方法的有效性。

Abstract
Domain generalization (DG) endeavors to develop robust models that possess strong generalizability while preserving excellent discriminability. Nonetheless, pivotal DG techniques tend to improve the feature generalizability by learning domain-invariant representations, inadvertently overlooking the feature discriminability. On the one hand, the simultaneous attainment of generalizability and discriminability of features presents a complex challenge, often entailing inherent contradictions. This challenge becomes particularly pronounced when domain-invariant features manifest reduced discriminability owing to the inclusion of unstable factors, \emph{i.e.,} spurious correlations. On the other hand, prevailing domain-invariant methods can be categorized as category-level alignment, susceptible to discarding indispensable features possessing substantial generalizability and narrowing intra-class variations. To surmount these obstacles, we rethink DG from a new perspective that concurrently imbues features with formidable discriminability and robust generalizability, and present a novel framework, namely, Discriminative Microscopic Distribution Alignment (DMDA). DMDA incorporates two core components: Selective Channel Pruning~(SCP) and Micro-level Distribution Alignment (MDA). Concretely, SCP attempts to curtail redundancy within neural networks, prioritizing stable attributes conducive to accurate classification. This approach alleviates the adverse effect of spurious domain invariance and amplifies the feature discriminability. Besides, MDA accentuates micro-level alignment within each class, going beyond mere category-level alignment. This strategy accommodates sufficient generalizable features and facilitates within-class variations. Extensive experiments on four benchmark datasets corroborate the efficacy of our method.

摘要
领域通用化（DG）努力开发强健的模型，以保持优秀的泛化能力和精准可识别能力。然而，许多领域通用化技术通常通过学习领域不变的表示来提高特征泛化能力，不幸的是，这会忽略特征精准可识别能力。一方面，同时实现特征泛化和精准可识别的特征表示存在复杂的挑战，经常带有内在的矛盾。尤其是当领域不变的特征表示具有不稳定因素，即偶极相关性，时这种挑战变得更加突出。另一方面，现有的领域不变方法可以分为两类：分类水平协调和特征水平协调。前者容易抛弃重要的泛化特征，导致内部变化减少，而后者忽略了特征精准可识别能力。为了缓解这些障碍，我们往返领域通用化的新视角，并提出了一种新的框架，即精准微型分布适应（DMDA）。DMDA包括两个核心 ком成分：选择性通道剔除（SCP）和微级分布适应（MDA）。具体来说，SCP尝试减少神经网络中的重复性，优先保留稳定特征，以便精准分类。这种方法可以减少领域不变的副作用，提高特征精准可识别能力。此外，MDA强调每个类型内的微级协调，超越仅category水平协调。这种策略可以保留足够的泛化特征，促进内部变化。我们在四个标准 benchmark 数据集上进行了广泛的实验，证明了我们的方法的有效性。

Diverse Target and Contribution Scheduling for Domain Generalization

paper_url: http://arxiv.org/abs/2309.16460
repo_url: None
paper_authors: Shaocong Long, Qianyu Zhou, Chenhao Ying, Lizhuang Ma, Yuan Luo
for: 这篇论文主要针对的是在分布偏移下进行计算机视觉领域的一致化问题，即在不同预测器上进行训练时，能够快速地适应新的预测任务。
methods: 本文提出了一种新的概念，即多源预测器的多目标抽象，以及一种基于这种概念的新方法，即多源预测器的多目标均衡。这种方法可以在不同预测器上进行训练，并且可以快速地适应新的预测任务。
results: experiments 表明，这种方法可以在四个 benchmark 数据集上实现竞争性的性能，并且可以快速地适应新的预测任务。这说明了本文提出的方法的有效性和优势。

Abstract
Generalization under the distribution shift has been a great challenge in computer vision. The prevailing practice of directly employing the one-hot labels as the training targets in domain generalization~(DG) can lead to gradient conflicts, making it insufficient for capturing the intrinsic class characteristics and hard to increase the intra-class variation. Besides, existing methods in DG mostly overlook the distinct contributions of source (seen) domains, resulting in uneven learning from these domains. To address these issues, we firstly present a theoretical and empirical analysis of the existence of gradient conflicts in DG, unveiling the previously unexplored relationship between distribution shifts and gradient conflicts during the optimization process. In this paper, we present a novel perspective of DG from the empirical source domain's risk and propose a new paradigm for DG called Diverse Target and Contribution Scheduling (DTCS). DTCS comprises two innovative modules: Diverse Target Supervision (DTS) and Diverse Contribution Balance (DCB), with the aim of addressing the limitations associated with the common utilization of one-hot labels and equal contributions for source domains in DG. In specific, DTS employs distinct soft labels as training targets to account for various feature distributions across domains and thereby mitigates the gradient conflicts, and DCB dynamically balances the contributions of source domains by ensuring a fair decline in losses of different source domains. Extensive experiments with analysis on four benchmark datasets show that the proposed method achieves a competitive performance in comparison with the state-of-the-art approaches, demonstrating the effectiveness and advantages of the proposed DTCS.

摘要
通用化在分布转移下是计算机视觉领域的一大挑战。直接使用领域总体化（DG）中的一颗热度标签作为训练目标，可能会导致梯度冲突，从而使得 capture 内部类特征和增加同类内部差异困难。此外，现有的DG方法大多忽视了来自源（已知）领域的特点，从而导致不均衡学习这些领域。为解决这些问题，我们首先提供了分布转移和梯度冲突在DG中的理论和实验分析，揭示了在优化过程中的 previously 未探讨的关系。在这篇论文中，我们提出了一新的DG视角，即来自 empirical 源领域的风险，并提出了一种新的DG方法 called 多样化目标和贡献安排（DTCS）。DTCS包括两个创新模块：多样化目标监督（DTS）和多样化贡献平衡（DCB），旨在解决DG中一般采用一颗热度标签和平等贡献的局限性。具体来说，DTS 使用不同的软标签作为训练目标，以 compte 各个领域的特性分布，从而缓解梯度冲突，而 DCB 在不同的源领域之间动态均衡贡献，以确保不同的源领域的损失下降均衡。我们对四个基准数据集进行了广泛的实验和分析，结果表明，我们提出的方法可以与当前领先方法竞争， demonstrating 我们的DTCS方法的有效性和优势。

Towards Novel Class Discovery: A Study in Novel Skin Lesions Clustering

paper_url: http://arxiv.org/abs/2309.16451
repo_url: None
paper_authors: Wei Feng, Lie Ju, Lin Wang, Kaimin Song, Zongyuan Ge
for: automatically discover and identify new semantic categories from new data
methods: contrastive learning + uncertainty-aware multi-view cross pseudo-supervision strategy + local sample similarity aggregation
results: effectively leverage knowledge from known categories to discover new semantic categories, validated through extensive ablation experiments.

Abstract
Existing deep learning models have achieved promising performance in recognizing skin diseases from dermoscopic images. However, these models can only recognize samples from predefined categories, when they are deployed in the clinic, data from new unknown categories are constantly emerging. Therefore, it is crucial to automatically discover and identify new semantic categories from new data. In this paper, we propose a new novel class discovery framework for automatically discovering new semantic classes from dermoscopy image datasets based on the knowledge of known classes. Specifically, we first use contrastive learning to learn a robust and unbiased feature representation based on all data from known and unknown categories. We then propose an uncertainty-aware multi-view cross pseudo-supervision strategy, which is trained jointly on all categories of data using pseudo labels generated by a self-labeling strategy. Finally, we further refine the pseudo label by aggregating neighborhood information through local sample similarity to improve the clustering performance of the model for unknown categories. We conducted extensive experiments on the dermatology dataset ISIC 2019, and the experimental results show that our approach can effectively leverage knowledge from known categories to discover new semantic categories. We also further validated the effectiveness of the different modules through extensive ablation experiments. Our code will be released soon.

摘要
现有的深度学习模型已经在诊断皮肤病的dermoscopic图像上达到了成功的表现。然而，这些模型只能识别预先定义的类别，当它们在临床中使用时，新的未知类别的数据会不断出现。因此，自动地发现和识别新的semantic类别是急需的。在这篇论文中，我们提出了一种新的novel class discovery框架，用于自动地发现dermoscopy图像集中的新类别。具体来说，我们首先使用对所有数据进行对比学习，以学习不偏袋性的特征表示。然后，我们提出了一种不确定性感知多视图 Pseudo-supervision策略，通过对所有类别的数据进行联合训练，使用自己生成的Pseudo标签进行训练。最后，我们进一步改进了Pseudo标签，通过地方sample相似性来提高模型对未知类别的减混表现。我们对ISIC 2019皮肤病 dataset进行了广泛的实验，并证明了我们的方法可以有效地利用已知类别的知识来发现新的semantic类别。我们还进行了extensive的ablation experiment，以验证不同模块的效果。我们的代码即将发布。

Radar Instance Transformer: Reliable Moving Instance Segmentation in Sparse Radar Point Clouds

paper_url: http://arxiv.org/abs/2309.16435
repo_url: None
paper_authors: Matthias Zeller, Vardeep S. Sandhu, Benedikt Mersch, Jens Behley, Michael Heidingsfeld, Cyrill Stachniss
for: 提高自动化机器人在动态环境中避免碰撞的能力，增强Scene解释。
methods: 利用LiDAR和摄像头对场景进行解释，但这些设备受到不良天气的限制，而雷达传感器可以超越这些限制，提供Doppler速度信息，直接提供动态对象的信息。
results: 提出一种基于雷达点云的移动实例分割方法，可以增强Scene解释，并且可以在安全关键任务中提高自动化机器人的性能。

Abstract
The perception of moving objects is crucial for autonomous robots performing collision avoidance in dynamic environments. LiDARs and cameras tremendously enhance scene interpretation but do not provide direct motion information and face limitations under adverse weather. Radar sensors overcome these limitations and provide Doppler velocities, delivering direct information on dynamic objects. In this paper, we address the problem of moving instance segmentation in radar point clouds to enhance scene interpretation for safety-critical tasks. Our Radar Instance Transformer enriches the current radar scan with temporal information without passing aggregated scans through a neural network. We propose a full-resolution backbone to prevent information loss in sparse point cloud processing. Our instance transformer head incorporates essential information to enhance segmentation but also enables reliable, class-agnostic instance assignments. In sum, our approach shows superior performance on the new moving instance segmentation benchmarks, including diverse environments, and provides model-agnostic modules to enhance scene interpretation. The benchmark is based on the RadarScenes dataset and will be made available upon acceptance.

摘要
<> autonomous robots 需要准确地感知移动 объекts，以确保在动态环境中避免碰撞。 LiDAR 和摄像头可以很好地帮助解释场景，但是它们不直接提供动态信息和在不良天气情况下存在限制。 Radar 感知器可以突破这些限制，提供Doppler 速度，直接提供动态对象的信息。在这篇论文中，我们解决了使用 Radar 点云中的移动实例分割问题，以提高场景的解释，为安全关键任务做好准备。我们的 Radar Instance Transformer 可以在不经过神经网络的情况下，将现场时间信息纳入 Radar 扫描中，以提高分割的精度。我们的实例转换头可以具有类型不易分的实例分割，并且可以提供可靠的实例分割结果。综上所述，我们的方法在新的移动实例分割标准测试中表现出色，包括多种环境，并提供了模型无关的模块，以提高场景的解释。这个标准基于 RadarScenes 数据集，并将在接受后公布。

Distilling ODE Solvers of Diffusion Models into Smaller Steps

paper_url: http://arxiv.org/abs/2309.16421
repo_url: None
paper_authors: Sanghwan Kim, Hao Tang, Fisher Yu
for: 提高 diffusion 模型的采样速度
methods: 使用简单的 distillation 方法优化 ODE 解决方案
results: 比较 existing ODE 解决方案的性能，特别是在生成样本 fewer steps 时的表现更佳，并且具有较低的计算开销。

Abstract
Distillation techniques have substantially improved the sampling speed of diffusion models, allowing of the generation within only one step or a few steps. However, these distillation methods require extensive training for each dataset, sampler, and network, which limits their practical applicability. To address this limitation, we propose a straightforward distillation approach, Distilled-ODE solvers (D-ODE solvers), that optimizes the ODE solver rather than training the denoising network. D-ODE solvers are formulated by simply applying a single parameter adjustment to existing ODE solvers. Subsequently, D-ODE solvers with smaller steps are optimized by ODE solvers with larger steps through distillation over a batch of samples. Our comprehensive experiments indicate that D-ODE solvers outperform existing ODE solvers, including DDIM, PNDM, DPM-Solver, DEIS, and EDM, especially when generating samples with fewer steps. Our method incur negligible computational overhead compared to previous distillation techniques, enabling simple and rapid integration with previous samplers. Qualitative analysis further shows that D-ODE solvers enhance image quality while preserving the sampling trajectory of ODE solvers.

摘要
<>将文本翻译为简化中文。<>diffusion模型的抽象技术已经大幅提高了采样速度，允许在单步或几步内生成样本。然而，这些抽象方法需要对每个数据集、抽象器和网络进行训练，这限制了它们的实际应用性。为解决这些限制，我们提出了一种简单的抽象方法，即Distilled-ODE solvers（D-ODE solvers），它通过对ODE解决器进行优化而不需要训练杜尔凡抽象网络。D-ODE solvers通过对现有ODE解决器进行单个参数调整，并通过对具有更大步长的ODE解决器进行析取采样来优化小步长ODE解决器。我们的全面实验表明，D-ODE solvers在生成样本时比现有的ODE解决器、包括DDIM、PNDM、DPM-Solver、DEIS和EDM更高效，特别是在生成 fewer steps 的样本时。我们的方法相比前一个分布采样技术具有较低的计算开销，使得可以简单地和快速地与现有的采样器结合。Qualitative分析还表明，D-ODE solvers可以提高图像质量，同时保持ODE解决器的采样轨迹。

HIC-YOLOv5: Improved YOLOv5 For Small Object Detection

paper_url: http://arxiv.org/abs/2309.16393
repo_url: https://github.com/Jacoo-ai/HIC-Yolov5
paper_authors: Shiyi Tang, Yini Fang, Shu Zhang
for: 提高小对象检测精度和速度， Addressing the challenges of small object detection in object detection tasks.
methods: 增加特定于小对象的预测头，采用卷积层和涨层，并应用Channel Attention Mechanism (CBAM) to increase channel information and emphasize important information in both channel and spatial domain.
results: 在VisDrone-2019-DET数据集上，HIC-YOLOv5的mAP@[.5:.95]提高6.42%，mAP@0.5提高9.38%.

Abstract
Small object detection has been a challenging problem in the field of object detection. There has been some works that proposes improvements for this task, such as adding several attention blocks or changing the whole structure of feature fusion networks. However, the computation cost of these models is large, which makes deploying a real-time object detection system unfeasible, while leaving room for improvement. To this end, an improved YOLOv5 model: HIC-YOLOv5 is proposed to address the aforementioned problems. Firstly, an additional prediction head specific to small objects is added to provide a higher-resolution feature map for better prediction. Secondly, an involution block is adopted between the backbone and neck to increase channel information of the feature map. Moreover, an attention mechanism named CBAM is applied at the end of the backbone, thus not only decreasing the computation cost compared with previous works but also emphasizing the important information in both channel and spatial domain. Our result shows that HIC-YOLOv5 has improved mAP@[.5:.95] by 6.42% and mAP@0.5 by 9.38% on VisDrone-2019-DET dataset.

摘要
小物体检测问题在物体检测领域中是一个挑战。有些工作提出了改进方案，如添加多个注意块或者修改特征融合网络的结构。然而，这些模型的计算成本较大，使得实时物体检测系统无法实现，剩下有很多可改进的空间。为此，一个改进的YOLOv5模型：HIC-YOLOv5被提出来解决这些问题。首先，增加了专门用于小物体预测的预测头，以提供更高分辨率的特征图用于更好的预测。其次，在 neck 和 backbone 之间采用了卷积块，以增加特征图的通道信息。此外，在 backbone 的末端应用了一个注意机制 named CBAM，以降低计算成本与之前的工作相比，同时强调通道和空间领域中的重要信息。我们的结果显示，HIC-YOLOv5 在 VisDrone-2019-DET 数据集上提高了 mAP@[.5:.95] 和 mAP@0.5 的值，分别提高了6.42%和9.38%。

An Enhanced Low-Resolution Image Recognition Method for Traffic Environments

paper_url: http://arxiv.org/abs/2309.16390
repo_url: None
paper_authors: Zongcai Tan, Zhenhai Gao
for: 提高低分辨率图像识别精度
methods: 基于差分模块和共享特征空间算法，提出双支路差分网络结构，并使用中间层特征进行增强低分辨率图像识别精度。
results: 通过实验 Validate the effectiveness of this algorithm for low-resolution image recognition in traffic environments.

Abstract
Currently, low-resolution image recognition is confronted with a significant challenge in the field of intelligent traffic perception. Compared to high-resolution images, low-resolution images suffer from small size, low quality, and lack of detail, leading to a notable decrease in the accuracy of traditional neural network recognition algorithms. The key to low-resolution image recognition lies in effective feature extraction. Therefore, this paper delves into the fundamental dimensions of residual modules and their impact on feature extraction and computational efficiency. Based on experiments, we introduce a dual-branch residual network structure that leverages the basic architecture of residual networks and a common feature subspace algorithm. Additionally, it incorporates the utilization of intermediate-layer features to enhance the accuracy of low-resolution image recognition. Furthermore, we employ knowledge distillation to reduce network parameters and computational overhead. Experimental results validate the effectiveness of this algorithm for low-resolution image recognition in traffic environments.

摘要
当前，低分辨率图像识别遇到了智能交通感知领域中的一个 significiant 挑战。相比高分辨率图像，低分辨率图像受到小尺寸、低质量和缺乏细节的限制，导致传统神经网络识别算法的准确率显著下降。因此，关键在于有效地提取特征。这篇论文探讨了剩余模块的基本维度和其对特征提取和计算效率的影响。基于实验，我们提出了一种双极分支剩余网络结构，利用基本的剩余网络架构和公共特征空间算法。此外，我们还利用中间层特征来提高低分辨率图像识别的准确率。此外，我们采用知识传承来降低网络参数和计算负担。实验结果证明了这种算法的有效性于低分辨率图像识别在交通环境中。

paper_url: http://arxiv.org/abs/2309.16388
repo_url: None
paper_authors: Xun Lin, Wenzhong Tang, Shuai Wang, Zitong Yu, Yizhong Liu, Haoran Wang, Ying Fu, Alex Kot
for: 本文旨在提出一种能够 Mitigating the effects of disruptive factors in biomedical images, such as artifacts, abnormal patterns, and noises, and detecting splicing traces in biomedical images.
methods: 本文提出了一种基于 Uncertainty-guided Refinement Network (URN) 的方法，可以显著地减少不可靠信息的传递，从而获得robust特征。此外，URN 还可以在解码阶段对不确定预测的区域进行精度的调整。
results: 对三个基准数据集进行了广泛的实验，并证明了提出的方法的superiority。此外，我们还验证了URN 的普适性和对Post-processing Approaches的Robustness。

Abstract
Recently, a surge in biomedical academic publications suspected of image manipulation has led to numerous retractions, turning biomedical image forensics into a research hotspot. While manipulation detectors are concerning, the specific detection of splicing traces in biomedical images remains underexplored. The disruptive factors within biomedical images, such as artifacts, abnormal patterns, and noises, show misleading features like the splicing traces, greatly increasing the challenge for this task. Moreover, the scarcity of high-quality spliced biomedical images also limits potential advancements in this field. In this work, we propose an Uncertainty-guided Refinement Network (URN) to mitigate the effects of these disruptive factors. Our URN can explicitly suppress the propagation of unreliable information flow caused by disruptive factors among regions, thereby obtaining robust features. Moreover, URN enables a concentration on the refinement of uncertainly predicted regions during the decoding phase. Besides, we construct a dataset for Biomedical image Splicing (BioSp) detection, which consists of 1,290 spliced images. Compared with existing datasets, BioSp comprises the largest number of spliced images and the most diverse sources. Comprehensive experiments on three benchmark datasets demonstrate the superiority of the proposed method. Meanwhile, we verify the generalizability of URN when against cross-dataset domain shifts and its robustness to resist post-processing approaches. Our BioSp dataset will be released upon acceptance.

摘要
近些时间，生物医学图像涂抹的学术论文涌现，引起了许多撤回，使生物医学图像鉴别成为研究热点。然而，图像涂抹检测器在生物医学图像中仍然存在困难。生物医学图像中的干扰因素，如artifacts、异常模式和噪声，会显示涂抹迹象，大大增加了这个任务的挑战。此外，生物医学图像的缺乏高质量拼接图像也限制了这个领域的进展。在这项工作中，我们提出了一种基于不确定性的修正网络（URN），以减少干扰因素的影响。URN可以显式地抑制干扰因素在区域之间的不确定信息流传播，从而获得robust特征。此外，URN在解码阶段可以集中做不确定预测区域的修正。此外，我们构建了一个生物医学图像拼接检测（BioSp）数据集，该数据集包含1290个拼接图像。与现有数据集相比，BioSp数据集包括最多的拼接图像和最多的来源。我们对三个标准数据集进行了全面的实验，并证明了我们提出的方法的优越性。同时，我们验证了URN在跨数据集频率域转移和抗后处理approaches的一致性和可靠性。我们将发布BioSp数据集一旦得到批准。

A Comprehensive Review on Tree Detection Methods Using Point Cloud and Aerial Imagery from Unmanned Aerial Vehicles

paper_url: http://arxiv.org/abs/2309.16375
repo_url: None
paper_authors: Weijie Kuang, Hann Woei Ho, Ye Zhou, Shahrel Azmin Suandi, Farzad Ismail
for: 这篇论文主要针对用于树木探测的无人机数据中的点云数据和图像数据进行了分析和评估。methods: 本论文主要分析了使用点云数据进行树木探测的方法，包括使用LiDAR和Digital Aerial Photography（DAP）两种数据来实现树木探测。而使用图像直接进行树木探测的方法则是根据使用深度学习（DL）方法来进行评估。results: 本论文对各种方法的比较和结合进行了分析，并介绍了每种方法的优缺点和应用领域。此外，本论文还统计了在过去几年内使用不同方法进行树木探测的研究数量，并发现到2022年为止，使用DL方法进行图像直接树木探测的研究占总数的45%。因此，本论文可以帮助研究人员在特定的森林中进行树木探测，以及帮助农业生产者使用无人机进行农业资源管理。

Abstract
Unmanned Aerial Vehicles (UAVs) are considered cutting-edge technology with highly cost-effective and flexible usage scenarios. Although many papers have reviewed the application of UAVs in agriculture, the review of the application for tree detection is still insufficient. This paper focuses on tree detection methods applied to UAV data collected by UAVs. There are two kinds of data, the point cloud and the images, which are acquired by the Light Detection and Ranging (LiDAR) sensor and camera, respectively. Among the detection methods using point-cloud data, this paper mainly classifies these methods according to LiDAR and Digital Aerial Photography (DAP). For the detection methods using images directly, this paper reviews these methods by whether or not to use the Deep Learning (DL) method. Our review concludes and analyses the comparison and combination between the application of LiDAR-based and DAP-based point cloud data. The performance, relative merits, and application fields of the methods are also introduced. Meanwhile, this review counts the number of tree detection studies using different methods in recent years. From our statics, the detection task using DL methods on the image has become a mainstream trend as the number of DL-based detection researches increases to 45% of the total number of tree detection studies up to 2022. As a result, this review could help and guide researchers who want to carry out tree detection on specific forests and for farmers to use UAVs in managing agriculture production.

摘要
无人飞行器（UAV）技术被视为当今最先进和最有效的应用领域之一，其应用场景非常多样化。虽然许多论文已经评估了UAV在农业中的应用，但对树木探测的评估仍然不够。这篇论文将关注UAV数据中的树木探测方法。该数据包括点云数据和图像数据，它们分别由激光探测器和摄像头获取。对点云数据进行探测方法，本论文主要分为利用LiDAR和数字空间图像（DAP）两种方法进行分类。对直接使用图像进行探测方法，本论文则根据使用深度学习（DL）方法进行评估。本评估结论和分析了利用LiDAR和DAP两种点云数据的比较和结合，并 introduce了方法的性能、优势和应用领域。同时，本评估还统计了过去几年内tree detection研究中使用不同方法的数量，从而可以帮助研究人员在特定的森林中进行树木探测，以及帮助农民使用UAV进行农业生产管理。

FG-NeRF: Flow-GAN based Probabilistic Neural Radiance Field for Independence-Assumption-Free Uncertainty Estimation

paper_url: http://arxiv.org/abs/2309.16364
repo_url: None
paper_authors: Songlin Wei, Jiazhao Zhang, Yang Wang, Fanbo Xiang, Hao Su, He Wang
for: 这个研究旨在创建一个独立假设无关的神经辐射场，以便获取更好的可信度和精确性。
methods: 我们提出了一种基于Flow-GAN的独立假设无关的 probabilistic NeRF，通过结合对抗学习和对抗演算的具有创新能力和强大表达能力的正规流。
results: 我们通过实验证明了我们的方法可以预测更低的渲染错误和更可靠的不确定性，并且在实验中表现出了独立假设无关的神经辐射场的优秀性能。

Abstract
Neural radiance fields with stochasticity have garnered significant interest by enabling the sampling of plausible radiance fields and quantifying uncertainty for downstream tasks. Existing works rely on the independence assumption of points in the radiance field or the pixels in input views to obtain tractable forms of the probability density function. However, this assumption inadvertently impacts performance when dealing with intricate geometry and texture. In this work, we propose an independence-assumption-free probabilistic neural radiance field based on Flow-GAN. By combining the generative capability of adversarial learning and the powerful expressivity of normalizing flow, our method explicitly models the density-radiance distribution of the whole scene. We represent our probabilistic NeRF as a mean-shifted probabilistic residual neural model. Our model is trained without an explicit likelihood function, thereby avoiding the independence assumption. Specifically, We downsample the training images with different strides and centers to form fixed-size patches which are used to train the generator with patch-based adversarial learning. Through extensive experiments, our method demonstrates state-of-the-art performance by predicting lower rendering errors and more reliable uncertainty on both synthetic and real-world datasets.

摘要
In this work, we propose an independence-assumption-free probabilistic neural radiance field based on Flow-GAN. By combining the generative capability of adversarial learning and the powerful expressivity of normalizing flow, our method explicitly models the density-radiance distribution of the whole scene. We represent our probabilistic NeRF as a mean-shifted probabilistic residual neural model. Our model is trained without an explicit likelihood function, thereby avoiding the independence assumption.Specifically, we downsample the training images with different strides and centers to form fixed-size patches, which are used to train the generator with patch-based adversarial learning. Through extensive experiments, our method demonstrates state-of-the-art performance by predicting lower rendering errors and more reliable uncertainty on both synthetic and real-world datasets.

Dark Side Augmentation: Generating Diverse Night Examples for Metric Learning

paper_url: http://arxiv.org/abs/2309.16351
repo_url: https://github.com/mohwald/gandtr
paper_authors: Albert Mohwald, Tomas Jenicek, Ondřej Chum
for: 提高夜间图像检索性能， addresses the challenge of poor retrieval performance in night-time images with limited training data.
methods: 使用 Generative Adversarial Networks (GANs) 生成 synthetic night images, 并将其用于 metric learning 中的数据增强。提出了一种新的轻量级 GAN 架构，同时具有对照图像的边 consistency 和同时具有夜夜和白天图像的边检测能力。
results: 在标准的 Tokyo 24/7 日夜检索 benchmark 上达到了state-of-the-art Results, 而不需要特定的日夜图像对应的训练集。

Abstract
Image retrieval methods based on CNN descriptors rely on metric learning from a large number of diverse examples of positive and negative image pairs. Domains, such as night-time images, with limited availability and variability of training data suffer from poor retrieval performance even with methods performing well on standard benchmarks. We propose to train a GAN-based synthetic-image generator, translating available day-time image examples into night images. Such a generator is used in metric learning as a form of augmentation, supplying training data to the scarce domain. Various types of generators are evaluated and analyzed. We contribute with a novel light-weight GAN architecture that enforces the consistency between the original and translated image through edge consistency. The proposed architecture also allows a simultaneous training of an edge detector that operates on both night and day images. To further increase the variability in the training examples and to maximize the generalization of the trained model, we propose a novel method of diverse anchor mining. The proposed method improves over the state-of-the-art results on a standard Tokyo 24/7 day-night retrieval benchmark while preserving the performance on Oxford and Paris datasets. This is achieved without the need of training image pairs of matching day and night images. The source code is available at https://github.com/mohwald/gandtr .

摘要
图像检索方法基于CNN描述符依赖于大量多样化的正例和反例图像对的 metric 学习。如夜间图像频率和多样性受限的领域，使用标准 benchmark 中表现良好的方法仍然存在检索性能低下的问题。我们提议使用 GAN 基于的生成器，将可用的日间图像例子翻译成夜间图像。这种生成器在 metric 学习中用作数据增强，为缺乏训练数据的频道提供了训练数据。我们提出了一种新的轻量级 GAN 架构，通过 Edge 的一致性来保证原始图像和翻译图像之间的一致性。此外，我们还提出了一种同时训练 Edge 检测器，该检测器可以在夜间和日间图像上运行。为了进一步增加训练例子的多样性和最大化模型的泛化性，我们提出了一种新的多样 anchor 挖掘方法。通过这种方法，我们超越了标准 Tokyo 24/7 日夜检索标准准则的现有最佳结果，而不需要训练日夜对应的图像对。此外，我们还保持了在 Oxford 和 Paris 数据集上的性能。这些成果都是基于不需要特定的日夜图像对的训练。代码可以在上获取。

Logarithm-transform aided Gaussian Sampling for Few-Shot Learning

paper_url: http://arxiv.org/abs/2309.16337
repo_url: https://github.com/ganatra-v/gaussian-sampling-fsl
paper_authors: Vaibhav Ganatra
for: 这篇论文主要关注的是几shot类别化任务中的表现学习，尤其是使用Gaussian分布来训练分类器。
methods: 本论文提出了一新的Gaussian对映方法，可以将实验数据转换为Gaussian-like分布，并且该方法比现有的方法更高效。
results: 本论文透过实验显示，使用该新的Gaussian对映方法可以实现更好的几shot类别化性能，并且需要较少的数据训练。

Abstract
Few-shot image classification has recently witnessed the rise of representation learning being utilised for models to adapt to new classes using only a few training examples. Therefore, the properties of the representations, such as their underlying probability distributions, assume vital importance. Representations sampled from Gaussian distributions have been used in recent works, [19] to train classifiers for few-shot classification. These methods rely on transforming the distributions of experimental data to approximate Gaussian distributions for their functioning. In this paper, I propose a novel Gaussian transform, that outperforms existing methods on transforming experimental data into Gaussian-like distributions. I then utilise this novel transformation for few-shot image classification and show significant gains in performance, while sampling lesser data.

摘要
近些年，几何学习在几个例子中的图像分类得到了广泛应用，因此表示学习的特性，如其下面的概率分布，变得非常重要。在过去的工作中，使用 Gaussian 分布来训练几个例子中的分类器。这些方法基于将实际数据的分布转换为接近 Gaussian 分布的方法。在这篇论文中，我提出了一种新的 Gaussian 变换，超过了现有方法在将实际数据转换为 Gaussian-like 分布的能力。然后，我利用这种新的变换进行几个例子中的图像分类，并显示了显著的性能提升，而且采样更少的数据。

Weakly-Supervised Video Anomaly Detection with Snippet Anomalous Attention

paper_url: http://arxiv.org/abs/2309.16309
repo_url: None
paper_authors: Yidan Fan, Yongxin Yu, Wenhuan Lu, Yahong Han
for: 针对含有异常事件的未经处理视频进行异常检测。
methods: 提出了一种异常注意力机制，通过考虑视频帧级别编码特征而不需要pseudo标签。Specifically, our approach first generates snippet-level anomalous attention and then feeds it together with original anomaly scores into a Multi-branch Supervision Module.
results: 经验表明，我们的方法可以更好地检测异常事件，并且可以更 preciselly localize异常。Experiments on benchmark datasets XDViolence和UCF-Crime verify the effectiveness of our method.

Abstract
With a focus on abnormal events contained within untrimmed videos, there is increasing interest among researchers in video anomaly detection. Among different video anomaly detection scenarios, weakly-supervised video anomaly detection poses a significant challenge as it lacks frame-wise labels during the training stage, only relying on video-level labels as coarse supervision. Previous methods have made attempts to either learn discriminative features in an end-to-end manner or employ a twostage self-training strategy to generate snippet-level pseudo labels. However, both approaches have certain limitations. The former tends to overlook informative features at the snippet level, while the latter can be susceptible to noises. In this paper, we propose an Anomalous Attention mechanism for weakly-supervised anomaly detection to tackle the aforementioned problems. Our approach takes into account snippet-level encoded features without the supervision of pseudo labels. Specifically, our approach first generates snippet-level anomalous attention and then feeds it together with original anomaly scores into a Multi-branch Supervision Module. The module learns different areas of the video, including areas that are challenging to detect, and also assists the attention optimization. Experiments on benchmark datasets XDViolence and UCF-Crime verify the effectiveness of our method. Besides, thanks to the proposed snippet-level attention, we obtain a more precise anomaly localization.

摘要
“对于含有异常事件的未裁剪影片，研究人员对影片异常检测存在增加的兴趣。在不同的影片异常检测场景中，弱监督的影片异常检测具有重要挑战，因为它缺乏training阶段Frame-wise标签，只有视频级别标签作为杂质指导。前一些方法尝试了以下两种方法：一是通过端到端学习学习特征，二是使用两个阶段自动训练策略生成剪辑级别的pseudo标签。然而，这两种方法都有一定的局限性。前者容易忽略剪辑级别的有用特征，而后者可能会受到噪音的影响。在这篇论文中，我们提出了一种异常注意力机制，用于弱监督的异常检测。我们的方法首先生成剪辑级别的异常注意力，然后将其与原始异常分数一起 feed into一个多支序监督模块。该模块学习不同的视频区域，包括具有检测挑战的区域，也可以帮助注意力优化。实验表明，我们的方法在XDViolence和UCF-Crime数据集上具有显著效果。此外，由于我们提出的剪辑级别注意力，我们可以更准确地定位异常。”

Can the Query-based Object Detector Be Designed with Fewer Stages?

paper_url: http://arxiv.org/abs/2309.16306
repo_url: None
paper_authors: Jialin Li, Weifu Fu, Yuhuan Lin, Qiang Nie, Yong Liu
for: 提高查询基于对象检测器的性能，尝试缩短查询过程中的多stage编码器和解码器。
methods: 提出多种改进查询基于对象检测器的技术，并基于这些发现提出一个新模型叫GOLO（全局一次和本地一次），该模型采用了两stage解码器。
results: 对COCO数据集进行实验，比较与主流查询基于多stage编码器和解码器的模型，GOLO模型具有较少的decoder stages，仍然能够达到高度的性能。

Abstract
Query-based object detectors have made significant advancements since the publication of DETR. However, most existing methods still rely on multi-stage encoders and decoders, or a combination of both. Despite achieving high accuracy, the multi-stage paradigm (typically consisting of 6 stages) suffers from issues such as heavy computational burden, prompting us to reconsider its necessity. In this paper, we explore multiple techniques to enhance query-based detectors and, based on these findings, propose a novel model called GOLO (Global Once and Local Once), which follows a two-stage decoding paradigm. Compared to other mainstream query-based models with multi-stage decoders, our model employs fewer decoder stages while still achieving considerable performance. Experimental results on the COCO dataset demonstrate the effectiveness of our approach.

摘要
征文基于对象检测器在DETR发表后已经作出了 significiant 进步。然而，大多数现有方法仍然依赖于多个阶段编码器和解码器，或者是这两者的组合。尽管实现了高精度，但多阶段 paradigma（通常包括6个阶段）受到了Computational burden 的困扰，让我们重新思考其必要性。在这篇论文中，我们探讨了多种提高查询基于检测器的技术，并根据这些发现，我们提出了一种新的模型called GOLO（全局一次和本地一次），该模型采用了两个阶段解码方式。与其他主流查询基于模型的多个阶段解码器相比，我们的模型使用了 fewer decoder stages，但仍然可以实现较高的性能。在COCO数据集上进行的实验结果表明了我们的方法的有效性。

Multi-scale Recurrent LSTM and Transformer Network for Depth Completion

paper_url: http://arxiv.org/abs/2309.16301
repo_url: None
paper_authors: Xiaogang Jia, Yusong Tan, Songlei Jian, Yonggang Che
for: 这 paper 的目的是提出一种基于 LSTM 和 Transformer 模块的深度完成方法，以实现fficiently fuse 色彩空间和深度空间的特征。
methods: 该 paper 使用了 Forget gate、Update gate、Output gate 和 Skip gate 来实现有效的色彩和深度特征融合，并在多尺度进行循环优化。最后，通过多头注意力机制来进一步融合深度特征。
results: 实验结果显示，无需复杂的网络结构和后处理步骤，我们的方法可以在一个简单的编码器-解码器网络结构上达到现场的主流自动驾驶 KITTI 数据集的状态码性表现，并且可以作为其他方法的后备网络，也可以达到状态码性表现。

Abstract
Lidar depth completion is a new and hot topic of depth estimation. In this task, it is the key and difficult point to fuse the features of color space and depth space. In this paper, we migrate the classic LSTM and Transformer modules from NLP to depth completion and redesign them appropriately. Specifically, we use Forget gate, Update gate, Output gate, and Skip gate to achieve the efficient fusion of color and depth features and perform loop optimization at multiple scales. Finally, we further fuse the deep features through the Transformer multi-head attention mechanism. Experimental results show that without repetitive network structure and post-processing steps, our method can achieve state-of-the-art performance by adding our modules to a simple encoder-decoder network structure. Our method ranks first on the current mainstream autonomous driving KITTI benchmark dataset. It can also be regarded as a backbone network for other methods, which likewise achieves state-of-the-art performance.

摘要
《 lidar 深度完成》是一个新的和热门的深度估计领域。在这个任务中，它是关键和困难的点是将颜色空间和深度空间的特征进行有效的融合。在这篇论文中，我们将传统的 LSTM 和 Transformer 模块从 NLP 迁移到深度完成领域，并对其进行适当的重新设计。具体来说，我们使用 Forget gate、Update gate、Output gate 和 Skip gate 来实现有效的颜色和深度特征融合，并在多个尺度上进行循环优化。最后，我们进一步将深度特征进行 Transformer 多头注意机制来进行融合。实验结果显示，无需复杂的网络结构和后处理步骤，我们的方法可以通过添加我们的模块到简单的编码器-解码器网络结构来实现状态作均性的表现。我们的方法在当前主流自动驾驶 KITTI benchmark 数据集上 ranking 第一名，同时也可以作为其他方法的后ION 网络，也实现了状态作均性的表现。

GAMMA: Generalizable Articulation Modeling and Manipulation for Articulated Objects

paper_url: http://arxiv.org/abs/2309.16264
repo_url: None
paper_authors: Qiaojun Yu, Junbo Wang, Wenhai Liu, Ce Hao, Liu Liu, Lin Shao, Weiming Wang, Cewu Lu
for: 本研究旨在提出一种通用的人工智能模型，可以识别和控制多种不同类型的受拟合物体。
methods: 该模型基于PartNet-Mobility数据集进行训练，并采用自适应操作来逐渐减少模型错误并提高操作性能。
results: 实验结果表明，该模型在未看过的和跨类型的受拟合物体中表现出色，超过了现有的艺术骨骼模型和抓取算法的性能。

Abstract
Articulated objects like cabinets and doors are widespread in daily life. However, directly manipulating 3D articulated objects is challenging because they have diverse geometrical shapes, semantic categories, and kinetic constraints. Prior works mostly focused on recognizing and manipulating articulated objects with specific joint types. They can either estimate the joint parameters or distinguish suitable grasp poses to facilitate trajectory planning. Although these approaches have succeeded in certain types of articulated objects, they lack generalizability to unseen objects, which significantly impedes their application in broader scenarios. In this paper, we propose a novel framework of Generalizable Articulation Modeling and Manipulating for Articulated Objects (GAMMA), which learns both articulation modeling and grasp pose affordance from diverse articulated objects with different categories. In addition, GAMMA adopts adaptive manipulation to iteratively reduce the modeling errors and enhance manipulation performance. We train GAMMA with the PartNet-Mobility dataset and evaluate with comprehensive experiments in SAPIEN simulation and real-world Franka robot. Results show that GAMMA significantly outperforms SOTA articulation modeling and manipulation algorithms in unseen and cross-category articulated objects. We will open-source all codes and datasets in both simulation and real robots for reproduction in the final version. Images and videos are published on the project website at: http://sites.google.com/view/gamma-articulation

摘要
每天生活中都可以找到具有多个 JOINT 的物体，如橱柜和门。然而，直接操作这些三维具有多 JOINT 的物体是困难的，因为它们具有多样的几何形态、 semantic 类别和动力约束。先前的研究主要集中在特定 JOINT 类型上recognize和操作具有多 JOINT 的物体。它们可以是估算 JOINT 参数或 distinguishing 适合的抓取姿势，以便进行 trajectory 规划。虽然这些方法在某些类型的具有多 JOINT 的物体上达到了一定的成功，但它们缺乏对未看到的物体的一般化，这会很大程度地限制它们在更广泛的场景中的应用。在这篇论文中，我们提出了一种名为 Generalizable Articulation Modeling and Manipulating for Articulated Objects (GAMMA) 的新框架。GAMMA 会学习具有多 JOINT 的物体的 Connection 模型和抓取姿势的可行性。此外，GAMMA 采用了适应的操作来逐步减少模型错误和提高操作性能。我们在 PartNet-Mobility 数据集上训练 GAMMA，并通过在 SAPIEN 模拟和实际 Franka 机器人中进行了广泛的实验。结果显示，GAMMA 在未看到和跨类别的具有多 JOINT 的物体上明显超越了当前的 Connection 模型和操作算法。我们将在最终版本中公布所有代码和数据集，并在实际机器人和模拟中进行了重复。图片和视频已经在项目网站上发布：http://sites.google.com/view/gamma-articulation。

FORB: A Flat Object Retrieval Benchmark for Universal Image Embedding

paper_url: http://arxiv.org/abs/2309.16249
repo_url: https://github.com/pxiangwu/forb
paper_authors: Pengxiang Wu, Siman Wang, Kevin Dela Rosa, Derek Hao Hu
for: 本研究旨在提供一个新的图像搜寻比较benchmark，以评估图像嵌入优化的质量。
methods: 本研究使用了多种 represnetation learning 方法，包括 LSTM、CNN 和 Siamese 网络。
results: 本研究发现，不同的搜寻策略在不同的图像类别上的表现有所不同，并且提出了一个新的图像搜寻benchmark（FORB），以测试图像嵌入优化的质量。

Abstract
Image retrieval is a fundamental task in computer vision. Despite recent advances in this field, many techniques have been evaluated on a limited number of domains, with a small number of instance categories. Notably, most existing works only consider domains like 3D landmarks, making it difficult to generalize the conclusions made by these works to other domains, e.g., logo and other 2D flat objects. To bridge this gap, we introduce a new dataset for benchmarking visual search methods on flat images with diverse patterns. Our flat object retrieval benchmark (FORB) supplements the commonly adopted 3D object domain, and more importantly, it serves as a testbed for assessing the image embedding quality on out-of-distribution domains. In this benchmark we investigate the retrieval accuracy of representative methods in terms of candidate ranks, as well as matching score margin, a viewpoint which is largely ignored by many works. Our experiments not only highlight the challenges and rich heterogeneity of FORB, but also reveal the hidden properties of different retrieval strategies. The proposed benchmark is a growing project and we expect to expand in both quantity and variety of objects. The dataset and supporting codes are available at https://github.com/pxiangwu/FORB/.

摘要
translate-into-simplified-chinese图像检索是计算机视觉中的基本任务。尽管最近几年有很多技术在这个领域进行了探索，但是大多数技术只是在一些有限的领域上进行了评估，而且只考虑了3D地标的领域。这使得已有的研究结论很难推广到其他领域，例如标识logo和其他2D平面对象。为了bridging这个差距，我们提出了一个新的图像检索数据集（FORB），该数据集补充了常见采用的3D对象领域，而且更重要的是，它作为图像嵌入质量评估的测试床。在这个数据集中，我们 investigate了不同方法的检索精度，包括候选人数、匹配分数差等方面。我们的实验不仅揭示了FORB的挑战和多样性，还揭示了不同检索策略的隐藏性。我们预计将在量和多样性方面继续扩展该数据集。数据集和相关代码可以在https://github.com/pxiangwu/FORB/上获取。Here's the translation in Traditional Chinese as well:translate-into-traditional-chinese图像检索是计算机视觉中的基本任务。尽管最近几年有很多技术在这个领域进行了探索，但是大多数技术只是在一些有限的领域上进行了评估，而且只考虑了3D地标的领域。这使得已有的研究结论很难推广到其他领域，例如标识logo和其他2D平面对象。为了bridging这个差距，我们提出了一个新的图像检索数据集（FORB），该数据集补充了常见采用的3D对象领域，而且更重要的是，它作为图像嵌入质量评估的测试床。在这个数据集中，我们 investigate了不同方法的检索精度，包括候选人数、匹配分数差等方面。我们的实验不仅揭示了FORB的挑战和多样性，还揭示了不同检索策略的隐藏性。我们预计将在量和多样性方面继续扩展该数据集。数据集和相关代码可以在https://github.com/pxiangwu/FORB/上获取。

Object Motion Guided Human Motion Synthesis

paper_url: http://arxiv.org/abs/2309.16237
repo_url: None
paper_authors: Jiaman Li, Jiajun Wu, C. Karen Liu
for: This paper is written for the purpose of full-body human motion synthesis for the manipulation of large-sized objects, with applications in character animation, embodied AI, VR/AR, and robotics.
methods: The proposed method, called Object MOtion guided human MOtion synthesis (OMOMO), is a conditional diffusion framework that uses two separate denoising processes to generate full-body manipulation behaviors from only the object motion. The method employs hand positions as an intermediate representation to explicitly enforce contact constraints, resulting in more physically plausible manipulation motions.
results: The proposed pipeline is demonstrated to be effective through extensive experiments, and is shown to generalize well to unseen objects. Additionally, a large-scale dataset consisting of 3D object geometry, object motion, and human motion is collected, which contains human-object interaction motion for 15 objects with a total duration of approximately 10 hours.

Abstract
Modeling human behaviors in contextual environments has a wide range of applications in character animation, embodied AI, VR/AR, and robotics. In real-world scenarios, humans frequently interact with the environment and manipulate various objects to complete daily tasks. In this work, we study the problem of full-body human motion synthesis for the manipulation of large-sized objects. We propose Object MOtion guided human MOtion synthesis (OMOMO), a conditional diffusion framework that can generate full-body manipulation behaviors from only the object motion. Since naively applying diffusion models fails to precisely enforce contact constraints between the hands and the object, OMOMO learns two separate denoising processes to first predict hand positions from object motion and subsequently synthesize full-body poses based on the predicted hand positions. By employing the hand positions as an intermediate representation between the two denoising processes, we can explicitly enforce contact constraints, resulting in more physically plausible manipulation motions. With the learned model, we develop a novel system that captures full-body human manipulation motions by simply attaching a smartphone to the object being manipulated. Through extensive experiments, we demonstrate the effectiveness of our proposed pipeline and its ability to generalize to unseen objects. Additionally, as high-quality human-object interaction datasets are scarce, we collect a large-scale dataset consisting of 3D object geometry, object motion, and human motion. Our dataset contains human-object interaction motion for 15 objects, with a total duration of approximately 10 hours.

摘要
人类行为模拟在受环境中有广泛的应用，包括人物动画、身体AI、VR/AR和 робо工程。在实际情况下，人类经常与环境交互，并使用不同的物体来完成日常任务。在这种工作中，我们研究了大型物体摆动的全身人类动作合成问题。我们提出了物体动作引导人体动作合成（OMOMO），一种条件扩散框架，可以通过只有物体动作来生成全身摆动行为。由于直接应用扩散模型无法准确实施手和物体之间的接触约束，OMOMO学习了两个分离的排除过程，先预测手部位于物体动作，然后基于预测的手部位进行全身姿态合成。通过使用手部位作为两个排除过程之间的中间表示，我们可以直接实施接触约束，从而生成更加物理可能的摆动姿态。我们提出的管道可以通过简单地将手机附加到被摆动的物体来捕捉全身人类摆动姿态。通过广泛的实验，我们证明了我们的提出的管道的效iveness和其能够泛化到未看到的物体。此外，由于高质量的人类-物体交互动作数据罕见，我们收集了一个大规模的数据集，包括3D物体几何、物体动作和人体动作。我们的数据集包含15种物体的人类-物体交互动作，总持续时间约10小时。

Off-the-shelf bin picking workcell with visual pose estimation: A case study on the world robot summit 2018 kitting task

paper_url: http://arxiv.org/abs/2309.16221
repo_url: None
paper_authors: Frederik Hagelskjær, Kasper Høj Lorenzen, Dirk Kraft
for: 本研究旨在提高机器人的搬运灵活性，通过视觉位置估算技术和新感知器来实现箱内物品拾取。
methods: 本研究使用了新的视觉感知器和位置估算算法，实现了箱内物品的位置估算。同时，我们还实现了一个工作站来完成箱内物品拾取，使用力基 grasping方法。
results: 我们在2018年世界机器人大会组装挑战中测试了我们的设备，并取得了所有参赛队伍中最高分。这说明当今技术已经可以在比之前的水平上进行箱内物品拾取。

Abstract
The World Robot Summit 2018 Assembly Challenge included four different tasks. The kitting task, which required bin-picking, was the task in which the fewest points were obtained. However, bin-picking is a vital skill that can significantly increase the flexibility of robotic set-ups, and is, therefore, an important research field. In recent years advancements have been made in sensor technology and pose estimation algorithms. These advancements allow for better performance when performing visual pose estimation. This paper shows that by utilizing new vision sensors and pose estimation algorithms pose estimation in bins can be performed successfully. We also implement a workcell for bin picking along with a force based grasping approach to perform the complete bin picking. Our set-up is tested on the World Robot Summit 2018 Assembly Challenge and successfully obtains a higher score compared with all teams at the competition. This demonstrate that current technology can perform bin-picking at a much higher level compared with previous results.

摘要
世界机器人峰会2018年组装挑战中包括四个不同的任务。其中kiting任务，需要拾取物品，是所有任务中得分最低的一个。然而，拾取是机器人设置的重要技能，可以增加机器人的灵活性，因此是一个重要的研究领域。在过去几年，感知技术和pose估计算术得到了进步。这些进步使得在视觉pose估计中表现更好。这篇论文展示了通过新的视觉感知器和pose估计算术，在容器中进行pose估计是可行的。我们还实现了一个工作站以进行容器拾取，并使用力学 grasping方法来完成完整的容器拾取。我们的设置在世界机器人峰会2018年组装挑战中被测试，并成功获得了所有队伍在比赛中的高得分。这表明现有技术可以在前一个水平上进行容器拾取，而不是之前的结果。

GAFlow: Incorporating Gaussian Attention into Optical Flow

paper_url: http://arxiv.org/abs/2309.16217
repo_url: https://github.com/la30/gaflow
paper_authors: Ao Luo, Fan Yang, Xin Li, Lang Nie, Chunyu Lin, Haoqiang Fan, Shuaicheng Liu
for: 提高计算机视觉中的运动场景分析精度，特别是在图像序列中捕捉和跟踪运动的问题。
methods: 使用 Gaussian Attention 技术，包括 Gaussian-Constrained Layer (GCL) 和 Gaussian-Guided Attention Module (GGAM)，来强调本地特征和运动相关性。
results: 在标准的计算机视觉数据集上进行了广泛的实验，并显示了提高的表现，包括在评估能力和在线测试中的优异表现。

Abstract
Optical flow, or the estimation of motion fields from image sequences, is one of the fundamental problems in computer vision. Unlike most pixel-wise tasks that aim at achieving consistent representations of the same category, optical flow raises extra demands for obtaining local discrimination and smoothness, which yet is not fully explored by existing approaches. In this paper, we push Gaussian Attention (GA) into the optical flow models to accentuate local properties during representation learning and enforce the motion affinity during matching. Specifically, we introduce a novel Gaussian-Constrained Layer (GCL) which can be easily plugged into existing Transformer blocks to highlight the local neighborhood that contains fine-grained structural information. Moreover, for reliable motion analysis, we provide a new Gaussian-Guided Attention Module (GGAM) which not only inherits properties from Gaussian distribution to instinctively revolve around the neighbor fields of each point but also is empowered to put the emphasis on contextually related regions during matching. Our fully-equipped model, namely Gaussian Attention Flow network (GAFlow), naturally incorporates a series of novel Gaussian-based modules into the conventional optical flow framework for reliable motion analysis. Extensive experiments on standard optical flow datasets consistently demonstrate the exceptional performance of the proposed approach in terms of both generalization ability evaluation and online benchmark testing. Code is available at https://github.com/LA30/GAFlow.

摘要
Computer vision 中的一个基本问题是图像序列中的运动场景估计（optical flow）。与大多数像素级任务不同，运动场景估计需要同时获得本地特征和流畅性，这些要求尚未由现有方法充分探索。在这篇论文中，我们将把Gaussian Attention（GA）技术应用于运动场景估计模型，以便在学习 represencing 过程中强调本地特征，并在匹配过程中保持运动相互关系。 Specifically, we introduce a novel Gaussian-Constrained Layer（GCL），可以轻松地插入到现有的Transformer块中，以高亮本地区域的细腻结构信息。此外，为了确保可靠的运动分析，我们提出了一新的 Gaussian-Guided Attention Module（GGAM），不仅继承了Gaussian分布的特性，以便自然地绕着每个点的邻近场景循环，而且可以在匹配过程中强调上下文相关的区域。我们的全套模型，即Gaussian Attention Flow网络（GAFlow），自然地将一系列基于Gaussian分布的模块集成到了传统的运动场景估计框架中，以确保可靠的运动分析。经验表明，我们的方法在标准的运动场景估计数据集上表现出了出色的一致性和在线测试中的稳定性。代码可以在https://github.com/LA30/GAFlow中找到。

Abdominal multi-organ segmentation in CT using Swinunter

paper_url: http://arxiv.org/abs/2309.16210
repo_url: None
paper_authors: Mingjin Chen, Yongkang He, Yongyi Lu
for: 验证深度学习模型在 computed tomography（CT）图像中进行腹部多器官分割的可能性，以便检测疾病和规划治疗。
methods: 使用 transformer-based 模型进行训练，以便更好地处理复杂背景和不同器官大小的问题。
results: 在公共验证集上获得了可接受的结果和推理时间，表明 transformer-based 模型在这些任务中可以达到更高的性能。

Abstract
Abdominal multi-organ segmentation in computed tomography (CT) is crucial for many clinical applications including disease detection and treatment planning. Deep learning methods have shown unprecedented performance in this perspective. However, it is still quite challenging to accurately segment different organs utilizing a single network due to the vague boundaries of organs, the complex background, and the substantially different organ size scales. In this work we used make transformer-based model for training. It was found through previous years' competitions that basically all of the top 5 methods used CNN-based methods, which is likely due to the lack of data volume that prevents transformer-based methods from taking full advantage. The thousands of samples in this competition may enable the transformer-based model to have more excellent results. The results on the public validation set also show that the transformer-based model can achieve an acceptable result and inference time.

摘要
Computed tomography (CT) 腹部多器官分割是许多临床应用的关键，包括疾病检测和治疗规划。深度学习方法在这个方面表现出了无 précédente的表现。然而，使用单一网络 segment 不同器官仍然是一项具有挑战性的任务，主要因为器官的界限晦涩，背景复杂，器官尺度 scales 不同。在这项工作中，我们使用 transformer-based 模型进行训练。根据前年的竞赛结果，大约 90% 的前五名方法都使用 CNN-based 方法，这可能是因为数据量的限制，阻碍 transformer-based 方法发挥全面的效用。这 thousands 个样本在这次竞赛中可能会帮助 transformer-based 模型取得更佳的结果。在公共验证集上也可以看到， transformer-based 模型可以达到可接受的结果和执行时间。

Nonconvex third-order Tensor Recovery Based on Logarithmic Minimax Function

paper_url: http://arxiv.org/abs/2309.16208
repo_url: None
paper_authors: Hongbing Zhang
for: 本研究旨在提出一种新的对数最小最大函数（LM），用于保护大对数值 while 强制小对数值处加以惩罚。
methods: 该函数基于非 convex relaxation 方法，并定义了一种加重tensor LM norm的重量tensor LM norm。
results: 实验表明，提出的方法可以在不同的实际数据集上具有更高的完teness和精度，并且比预先状态艺术方法（EMLCP）更高。

Abstract
Recent researches have shown that low-rank tensor recovery based non-convex relaxation has gained extensive attention. In this context, we propose a new Logarithmic Minimax (LM) function. The comparative analysis between the LM function and the Logarithmic, Minimax concave penalty (MCP), and Minimax Logarithmic concave penalty (MLCP) functions reveals that the proposed function can protect large singular values while imposing stronger penalization on small singular values. Based on this, we define a weighted tensor LM norm as a non-convex relaxation for tensor tubal rank. Subsequently, we propose the TLM-based low-rank tensor completion (LRTC) model and the TLM-based tensor robust principal component analysis (TRPCA) model respectively. Furthermore, we provide theoretical convergence guarantees for the proposed methods. Comprehensive experiments were conducted on various real datasets, and a comparison analysis was made with the similar EMLCP method. The results demonstrate that the proposed method outperforms the state-of-the-art methods.

摘要
Note:* "Recent researches" should be "Recent research" in Simplified Chinese.* "based on non-convex relaxation" should be "based on non-convex relaxation" in Simplified Chinese.* "Comprehensive experiments" should be "Extensive experiments" in Simplified Chinese.* "EMLCP" should be "EMLCP method" in Simplified Chinese.

Parameter-Saving Adversarial Training: Reinforcing Multi-Perturbation Robustness via Hypernetworks

paper_url: http://arxiv.org/abs/2309.16207
repo_url: None
paper_authors: Huihui Gong, Minjing Dong, Siqi Ma, Seyit Camtepe, Surya Nepal, Chang Xu
for: 防御强制攻击（Adversarial Training）是一种最受欢迎并且最有效的方法来防御强制攻击。
methods: 我们提出了一种新的多种攻击perturbation adversarial training框架，即参数节省鲁棒训练（PSAT），以提高多种攻击perturbation的鲁棒性，同时具有节省参数的优点。
results: 我们对各种最新的攻击方法进行了广泛的评估和比较，并证明了我们的提出方法在不同的数据集上具有最佳的鲁棒性质量和参数节省的优势，例如在CIFAR-10数据集上，使用ResNet-50作为背景网络，PSAT可以将参数量减少约80%，同时保持最佳的鲁棒性质量。

Abstract
Adversarial training serves as one of the most popular and effective methods to defend against adversarial perturbations. However, most defense mechanisms only consider a single type of perturbation while various attack methods might be adopted to perform stronger adversarial attacks against the deployed model in real-world scenarios, e.g., $\ell_2$ or $\ell_\infty$. Defending against various attacks can be a challenging problem since multi-perturbation adversarial training and its variants only achieve suboptimal robustness trade-offs, due to the theoretical limit to multi-perturbation robustness for a single model. Besides, it is impractical to deploy large models in some storage-efficient scenarios. To settle down these drawbacks, in this paper we propose a novel multi-perturbation adversarial training framework, parameter-saving adversarial training (PSAT), to reinforce multi-perturbation robustness with an advantageous side effect of saving parameters, which leverages hypernetworks to train specialized models against a single perturbation and aggregate these specialized models to defend against multiple perturbations. Eventually, we extensively evaluate and compare our proposed method with state-of-the-art single/multi-perturbation robust methods against various latest attack methods on different datasets, showing the robustness superiority and parameter efficiency of our proposed method, e.g., for the CIFAR-10 dataset with ResNet-50 as the backbone, PSAT saves approximately 80\% of parameters with achieving the state-of-the-art robustness trade-off accuracy.

摘要
“对抗攻击训练是现在最受欢迎并且最有效的防御方法。然而，大多数防御机制只考虑单一类型的攻击，而实际场景中可能会采用多种攻击方法来进行更加强大的对抗攻击。防御多种攻击是一个具有挑战性的问题，因为多重攻击 adversarial training 和其变体只能达到各种不佳的鲁棒性质交换。此外，在某些存储效率不高的场景中，部署大型模型是不切实际的。为了解决这些缺点，在这篇论文中我们提出了一种新的多重攻击 adversarial training 框架 Parametersaving adversarial training（PSAT），通过使用 hypernetworks 来训练特殊化模型对单个攻击，并将这些特殊化模型综合以防御多种攻击。最终，我们对state-of-the-art 单/多攻击鲁棒方法进行了广泛的评估和比较，在不同的数据集上，我们的提出方法展现出了鲁棒性superiority和参数效率的优势，例如在 CIFAR-10 数据集上，使用 ResNet-50 作为 backing 模型，PSAT 可以将参数数量减少约 80% ，同时实现最佳的鲁棒性质交换精度。”

Alzheimer’s Disease Prediction via Brain Structural-Functional Deep Fusing Network

paper_url: http://arxiv.org/abs/2309.16206
repo_url: None
paper_authors: Qiankun Zuo, Junren Pan, Shuqiang Wang
For: 这篇论文旨在开发一个能够有效地融合多Modal neuroimages的模型，以分析阿尔ツ海默症（AD）的恶化。* Methods: 本论文提出了一个名为cross-modal transformer generative adversarial network（CT-GAN）的新模型，可以从 функциональ磁共振成像（fMRI）和Diffusion tensor imaging（DTI）中获取功能和结构资讯，并将其融合成一个统一的多Modal connectivity。* Results: 在ADNI资料集上进行评估，提出的CT-GAN模型可以明显提高预测性能和实际地检测AD相关的脑区域。此外，模型还提供了新的问题检测AD相关的异常神经回路的新关注。

Abstract
Fusing structural-functional images of the brain has shown great potential to analyze the deterioration of Alzheimer's disease (AD). However, it is a big challenge to effectively fuse the correlated and complementary information from multimodal neuroimages. In this paper, a novel model termed cross-modal transformer generative adversarial network (CT-GAN) is proposed to effectively fuse the functional and structural information contained in functional magnetic resonance imaging (fMRI) and diffusion tensor imaging (DTI). The CT-GAN can learn topological features and generate multimodal connectivity from multimodal imaging data in an efficient end-to-end manner. Moreover, the swapping bi-attention mechanism is designed to gradually align common features and effectively enhance the complementary features between modalities. By analyzing the generated connectivity features, the proposed model can identify AD-related brain connections. Evaluations on the public ADNI dataset show that the proposed CT-GAN can dramatically improve prediction performance and detect AD-related brain regions effectively. The proposed model also provides new insights for detecting AD-related abnormal neural circuits.

摘要
综合结构功能图像融合技术已经在分析阿尔茨海默病（AD）的衰退方面显示了巨大的潜力。然而，是一个大的挑战来有效地融合多modal的脑图像信息。在这篇论文中，一种新的模型称为交叉模态变换生成敌对网络（CT-GAN）被提出，以有效地融合功能和结构信息，包括功能磁共振成像（fMRI）和扩散tensor成像（DTI）。CT-GAN可以学习 topological特征并生成多modal的连接性，从多modal成像数据中获得有效的终端到端方式。此外，交换双注意机制被设计来逐渐对共同特征进行对齐，以实现多modal之间的增强。通过分析生成的连接特征，提出的模型可以识别AD相关的脑连接。评估ADNI数据集显示，提出的CT-GAN可以明显提高预测性能和有效地检测AD相关的脑区域。此外，该模型还提供了新的检测AD相关异常神经Circuit的新视角。

DiffGAN-F2S: Symmetric and Efficient Denoising Diffusion GANs for Structural Connectivity Prediction from Brain fMRI

paper_url: http://arxiv.org/abs/2309.16205
repo_url: None
paper_authors: Qiankun Zuo, Ruiheng Li, Yi Di, Hao Tian, Changhong Jing, Xuhang Chen, Shuqiang Wang
for: 本研究旨在提出一种基于Diffusion Generative Adversarial Network（DiffGAN）的FMRI-to-SC（fMRI-to-structural connectivity）模型，以便在综合性脑网络融合中提取多Modal brain network中的可能性biomarkers。
methods: 该模型基于denoising diffusion probabilistic models（DDPMs）和对抗学习，可以很快地从fMRI中生成高精度的SC，并通过设计双通道多头空间注意力（DMSA）和图像 convolutional module来捕捉全球和本地脑区域之间的关系。
results: 在ADNI dataset上测试，该模型可以高效地生成empirical SC-preserved connectivity，并与其他相关模型相比显示出superior的SC预测性能。此外，该模型还可以确定大多数重要的脑区域和连接，提供一种替代的方式来融合多modal brain networks和分析临床疾病。

Abstract
Mapping from functional connectivity (FC) to structural connectivity (SC) can facilitate multimodal brain network fusion and discover potential biomarkers for clinical implications. However, it is challenging to directly bridge the reliable non-linear mapping relations between SC and functional magnetic resonance imaging (fMRI). In this paper, a novel diffusision generative adversarial network-based fMRI-to-SC (DiffGAN-F2S) model is proposed to predict SC from brain fMRI in an end-to-end manner. To be specific, the proposed DiffGAN-F2S leverages denoising diffusion probabilistic models (DDPMs) and adversarial learning to efficiently generate high-fidelity SC through a few steps from fMRI. By designing the dual-channel multi-head spatial attention (DMSA) and graph convolutional modules, the symmetric graph generator first captures global relations among direct and indirect connected brain regions, then models the local brain region interactions. It can uncover the complex mapping relations between fMRI and structural connectivity. Furthermore, the spatially connected consistency loss is devised to constrain the generator to preserve global-local topological information for accurate intrinsic SC prediction. Testing on the public Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, the proposed model can effectively generate empirical SC-preserved connectivity from four-dimensional imaging data and shows superior performance in SC prediction compared with other related models. Furthermore, the proposed model can identify the vast majority of important brain regions and connections derived from the empirical method, providing an alternative way to fuse multimodal brain networks and analyze clinical disease.

摘要
fc可以映射到结构连接（sc），可以促进多Modal脑网络融合和发现临床应用中的生物标志物。然而，直接从 Functional Magnetic Resonance Imaging（fMRI）到SC的可靠非线性映射关系是挑战。在本文中，一种基于Diffusion Generative Adversarial Network（DiffGAN）的fMRI-to-SC（DiffGAN-F2S）模型被提出，可以在端到端方式 Predict SC from brain fMRI。具体来说，提案的DiffGAN-F2S模型利用了Denosing Diffusion Probabilistic Models（DDPMs）和对抗学习来快速生成高品质SC，只需几步。通过设计双通道多头空间注意力（DMSA）和图像演算模块，对称图生成器首先捕捉了全脑区域之间的全球关系，然后模型了本地脑区域之间的交互。这可以揭示出fMRI和SC之间的复杂映射关系。此外，用空间连接一致损失来约束生成器保持全球-本地 topological信息的准确SC预测。在公共Alzheimer's Disease Neuroimaging Initiative（ADNI）数据集上测试，提案的模型可以高效地从四维图像数据中生成empirical SC-保持连接，并显示与其他相关模型相比表现出色。此外，提案的模型还可以识别大多数重要的脑区域和连接，提供一种代替方式来融合多Modal脑网络并分析临床疾病。

Cloth2Body: Generating 3D Human Body Mesh from 2D Clothing

paper_url: http://arxiv.org/abs/2309.16189
repo_url: None
paper_authors: Lu Dai, Liqian Ma, Shenhan Qian, Hao Liu, Ziwei Liu, Hui Xiong
for: 这个论文的目标是生成基于2D服装图像的3D人体模型。methods: 该论文提出了一种结构包括人体姿态估计、身体形态估计和姿态生成方法的端到端框架。首先，利用人体姿态估计来估计人体姿态参数。然后，通过人体三维骨架的使用和反射 kinematics 模块来提高估计精度。最后，提出了一种适应深度技巧来规避对象大小和摄像头外部效果的分离。results: 该论文通过实验结果表明，提出的方法可以从2D服装图像中高精度地回归自然和多样化的3D人体模型，并且可以在探索和生成多个姿态时提供多样化的结果。

Abstract
In this paper, we define and study a new Cloth2Body problem which has a goal of generating 3D human body meshes from a 2D clothing image. Unlike the existing human mesh recovery problem, Cloth2Body needs to address new and emerging challenges raised by the partial observation of the input and the high diversity of the output. Indeed, there are three specific challenges. First, how to locate and pose human bodies into the clothes. Second, how to effectively estimate body shapes out of various clothing types. Finally, how to generate diverse and plausible results from a 2D clothing image. To this end, we propose an end-to-end framework that can accurately estimate 3D body mesh parameterized by pose and shape from a 2D clothing image. Along this line, we first utilize Kinematics-aware Pose Estimation to estimate body pose parameters. 3D skeleton is employed as a proxy followed by an inverse kinematics module to boost the estimation accuracy. We additionally design an adaptive depth trick to align the re-projected 3D mesh better with 2D clothing image by disentangling the effects of object size and camera extrinsic. Next, we propose Physics-informed Shape Estimation to estimate body shape parameters. 3D shape parameters are predicted based on partial body measurements estimated from RGB image, which not only improves pixel-wise human-cloth alignment, but also enables flexible user editing. Finally, we design Evolution-based pose generation method, a skeleton transplanting method inspired by genetic algorithms to generate diverse reasonable poses during inference. As shown by experimental results on both synthetic and real-world data, the proposed framework achieves state-of-the-art performance and can effectively recover natural and diverse 3D body meshes from 2D images that align well with clothing.

摘要
在这篇论文中，我们定义并研究了一个新的Cloth2Body问题，该问题的目标是从2D服装图像中生成3D人体几何体。与现有的人体几何恢复问题不同，Cloth2Body需要解决一些新的和兴起的挑战，包括：首先，如何将人体部着到服装上。第二，如何有效地从不同类型的服装中估计身体形态。最后，如何从2D服装图像中生成多样化和合理的结果。为此，我们提议一个端到端框架，可以准确地估计3D人体几何参数化 pose和形态从2D服装图像。在这个框架中，我们首先利用了 Kinematics-aware Pose Estimation，来估计人体姿态参数。然后，我们使用3D骨架作为代理，并使用反 Kinematics 模块来提高估计精度。此外，我们还设计了一种适应深度技巧，用于将重projected 3D网格更好地与2D服装图像相对应。接下来，我们提出了物理学习形态估计方法，用于估计身体形态参数。在RGB图像中估计部分身体测量值，可以不 только提高像素级人体-服装对齐，还可以启用 flexible user editing。最后，我们设计了一种进化策略，用于在推理中生成多样化的合理姿态。根据实验结果，我们的框架在 both synthetic and real-world data 上达到了状态元的性能，并可以效果地从2D图像中恢复自然和多样化的3D人体几何体。

BEVHeight++: Toward Robust Visual Centric 3D Object Detection

paper_url: http://arxiv.org/abs/2309.16179
repo_url: None
paper_authors: Lei Yang, Tao Tang, Jun Li, Peng Chen, Kun Yuan, Li Wang, Yi Huang, Xinyu Zhang, Kaicheng Yu
for: This paper focuses on addressing the limitations of vision-centric bird’s eye view detection methods for roadside cameras, and proposes a simple yet effective approach called BEVHeight++ to improve the accuracy and robustness of camera-only perception methods.
methods: The proposed BEVHeight++ method regresses the height to the ground to achieve a distance-agnostic formulation, and incorporates both height and depth encoding techniques to improve the projection from 2D to BEV spaces.
results: The proposed method surpasses all previous vision-centric methods on popular 3D detection benchmarks of roadside cameras, and achieves significant improvements over depth-only methods in ego-vehicle scenarios. Specifically, the method yields a notable improvement of +1.9% NDS and +1.1% mAP over BEVDepth on the nuScenes validation set, and achieves substantial advancements of +2.8% NDS and +1.7% mAP on the nuScenes test set.Here’s the simplified Chinese text version of the three information points:
for: 本研究旨在解决路边摄像头上常见的视觉中心探测方法的局限性，并提出了一种简单 yet effective的方法 called BEVHeight++，以提高路边摄像头只使用的探测方法的准确性和可靠性。
methods: BEVHeight++方法通过对地面高度进行回归，实现了距离不依赖的表示，并通过结合高度和深度编码技术来提高2D到BEV空间的投影。
results: BEVHeight++方法在流行的路边摄像头3D探测benchmark上表现出色，超过了所有之前的视觉中心方法，并在ego-汽车场景中也表现出明显的提升，比如BEVDepth方法在nuScenes验证集上的NDS和mAP提升了+1.9%和+1.1%，分别在nuScenes测试集上提升了+2.8%和+1.7%。

Abstract
While most recent autonomous driving system focuses on developing perception methods on ego-vehicle sensors, people tend to overlook an alternative approach to leverage intelligent roadside cameras to extend the perception ability beyond the visual range. We discover that the state-of-the-art vision-centric bird's eye view detection methods have inferior performances on roadside cameras. This is because these methods mainly focus on recovering the depth regarding the camera center, where the depth difference between the car and the ground quickly shrinks while the distance increases. In this paper, we propose a simple yet effective approach, dubbed BEVHeight++, to address this issue. In essence, we regress the height to the ground to achieve a distance-agnostic formulation to ease the optimization process of camera-only perception methods. By incorporating both height and depth encoding techniques, we achieve a more accurate and robust projection from 2D to BEV spaces. On popular 3D detection benchmarks of roadside cameras, our method surpasses all previous vision-centric methods by a significant margin. In terms of the ego-vehicle scenario, our BEVHeight++ possesses superior over depth-only methods. Specifically, it yields a notable improvement of +1.9% NDS and +1.1% mAP over BEVDepth when evaluated on the nuScenes validation set. Moreover, on the nuScenes test set, our method achieves substantial advancements, with an increase of +2.8% NDS and +1.7% mAP, respectively.

摘要
当前大多数自动驾驶系统都是关注ego-vehicle传感器上的感知方法，而人们往往忽视了一种alternative approach，即利用智能路边摄像头来扩展感知范围。我们发现现有的视觉中心 bird's eye view探测方法在路边摄像头上表现不佳。这是因为这些方法主要关注在camera中心的深度恢复，而在距离增加时，汽车和地面之间的深度差异很快消失。在这篇论文中，我们提出了一种简单又有效的方法，称为BEVHeight++。其核心思想是通过对地面高度进行回归，以实现距离agnostic的表示，从而简化了Camera-only感知方法的优化过程。通过结合高度和深度编码技术，我们实现了更加准确和Robust的2D到BEV空间投影。在popular 3D探测 benchmarks of roadside cameras 上，我们的方法比之前的视觉中心方法高出了显著的margin。在ego-vehicleenario中，我们的BEVHeight++也表现出了明显的优势， Specifically，它在nuScenes验证集上提高了+1.9% NDS和+1.1% mAP，并在nuScenes测试集上实现了重大的进步，升幅+2.8% NDS和+1.7% mAP。

ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens

paper_url: http://arxiv.org/abs/2309.16738
repo_url: https://github.com/guoyang9/elip
paper_authors: Yangyang Guo, Haoyu Zhang, Liqiang Nie, Yongkang Wong, Mohan Kankanhalli
for: 降低计算成本和占用空间，提高语言-图像模型的灵活性。
methods: 提出了一种视觉Token采样和合并方法（ELIP），根据语言输出进行不同级别的Token采样和合并，以提高模型的计算效率和存储效率。
results: 在三种常用的语言-图像预训练模型中实现了相对较少的性能损失（均为0.32减少率），并且可以通过增加更大的批处理大小来加速模型预训练和下游任务表现。

Abstract
Learning a versatile language-image model is computationally prohibitive under a limited computing budget. This paper delves into the efficient language-image pre-training, an area that has received relatively little attention despite its importance in reducing computational cost and footprint. To that end, we propose a vision token pruning and merging method, ie ELIP, to remove less influential tokens based on the supervision of language outputs. Our method is designed with several strengths, such as being computation-efficient, memory-efficient, and trainable-parameter-free, and is distinguished from previous vision-only token pruning approaches by its alignment with task objectives. We implement this method in a progressively pruning manner using several sequential blocks. To evaluate its generalization performance, we apply ELIP to three commonly used language-image pre-training models and utilize public image-caption pairs with 4M images for pre-training. Our experiments demonstrate that with the removal of ~30$\%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance with baselines ($\sim$0.32 accuracy drop on average) over various downstream tasks including cross-modal retrieval, VQA, image captioning, etc. In addition, the spared GPU resources by our ELIP allow us to scale up with larger batch sizes, thereby accelerating model pre-training and even sometimes enhancing downstream model performance. Our code will be released at https://github.com/guoyang9/ELIP.

摘要
学习一种多面语言模型 computationally prohibitive under a limited computing budget. This paper explores the efficient language-image pre-training, an area that has received relatively little attention despite its importance in reducing computational cost and footprint. To that end, we propose a vision token pruning and merging method, namely ELIP, to remove less influential tokens based on the supervision of language outputs. Our method is designed with several strengths, such as being computation-efficient, memory-efficient, and trainable-parameter-free, and is distinguished from previous vision-only token pruning approaches by its alignment with task objectives. We implement this method in a progressively pruning manner using several sequential blocks. To evaluate its generalization performance, we apply ELIP to three commonly used language-image pre-training models and utilize public image-caption pairs with 4M images for pre-training. Our experiments demonstrate that with the removal of ~30% vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance with baselines (approximately 0.32 accuracy drop on average) over various downstream tasks including cross-modal retrieval, VQA, image captioning, etc. In addition, the spared GPU resources by our ELIP allow us to scale up with larger batch sizes, thereby accelerating model pre-training and even sometimes enhancing downstream model performance. Our code will be released at https://github.com/guoyang9/ELIP.

paper_url: http://arxiv.org/abs/2309.16164
repo_url: https://github.com/huskykingdom/dita_acml2023
paper_authors: Yuhang Song, Anh Nguyen, Chun-Yi Lee
for: 这个论文主要针对 autonomous navigation 系统中的目标接近和 episoden 终止问题，尤其是在深度学习（Deep Reinforcement Learning，DRL）基于方法中，通常lack depth信息，导致优化路径规划和终止识别困难。
methods: 我们提出了一种新的方法，即 Depth-Inference Termination Agent（DITA），它利用一个名为 Judge Model 的监管模型来隐式地推断目标的深度信息，并与 reinforcement learning 结合决策终止。我们在 Judge Model 和 reinforcement learning 的训练过程中同时进行了监控和激励，以高效地训练 Judge Model。
results: 我们的评估结果表明，DITA 方法在所有房间类型上都表现出优秀的成绩，相比基eline方法，DITA 方法在长期epsisode环境中提高了51.2%的成绩，同时保持了 slighty better Success Weighted by Path Length（SPL）。 Code 和资源可以在 GitHub 上找到：https://github.com/HuskyKingdom/DITA_acml2023。

Abstract
This paper tackles the critical challenge of object navigation in autonomous navigation systems, particularly focusing on the problem of target approach and episode termination in environments with long optimal episode length in Deep Reinforcement Learning (DRL) based methods. While effective in environment exploration and object localization, conventional DRL methods often struggle with optimal path planning and termination recognition due to a lack of depth information. To overcome these limitations, we propose a novel approach, namely the Depth-Inference Termination Agent (DITA), which incorporates a supervised model called the Judge Model to implicitly infer object-wise depth and decide termination jointly with reinforcement learning. We train our judge model along with reinforcement learning in parallel and supervise the former efficiently by reward signal. Our evaluation shows the method is demonstrating superior performance, we achieve a 9.3% gain on success rate than our baseline method across all room types and gain 51.2% improvements on long episodes environment while maintaining slightly better Success Weighted by Path Length (SPL). Code and resources, visualization are available at: https://github.com/HuskyKingdom/DITA_acml2023

摘要

OSM-Net: One-to-Many One-shot Talking Head Generation with Spontaneous Head Motions

paper_url: http://arxiv.org/abs/2309.16148
repo_url: None
paper_authors: Jin Liu, Xi Wang, Xiaomeng Fu, Yesheng Chai, Cai Yu, Jiao Dai, Jizhong Han
for: 一种一批讲话头生成任务，即一个人说话时的自然头部运动生成。
methods: 提出了一种基于多个clip级别的头部运动特征空间的一批讲话头生成网络（OSM-Net），可以自然地模拟人类的头部运动。
results: 对比其他方法，OSM-Net能够更自然地生成真实的讲话头部运动，并且可以在一个合理的一对多映射方式下进行映射。

Abstract
One-shot talking head generation has no explicit head movement reference, thus it is difficult to generate talking heads with head motions. Some existing works only edit the mouth area and generate still talking heads, leading to unreal talking head performance. Other works construct one-to-one mapping between audio signal and head motion sequences, introducing ambiguity correspondences into the mapping since people can behave differently in head motions when speaking the same content. This unreasonable mapping form fails to model the diversity and produces either nearly static or even exaggerated head motions, which are unnatural and strange. Therefore, the one-shot talking head generation task is actually a one-to-many ill-posed problem and people present diverse head motions when speaking. Based on the above observation, we propose OSM-Net, a \textit{one-to-many} one-shot talking head generation network with natural head motions. OSM-Net constructs a motion space that contains rich and various clip-level head motion features. Each basis of the space represents a feature of meaningful head motion in a clip rather than just a frame, thus providing more coherent and natural motion changes in talking heads. The driving audio is mapped into the motion space, around which various motion features can be sampled within a reasonable range to achieve the one-to-many mapping. Besides, the landmark constraint and time window feature input improve the accurate expression feature extraction and video generation. Extensive experiments show that OSM-Net generates more natural realistic head motions under reasonable one-to-many mapping paradigm compared with other methods.

摘要
一般来说，一帧说话头生成没有显式的头部运动参考，因此很难生成带有头部运动的说话头。现有的方法只是编辑口部区域，生成不动的说话头，这会导致不自然的表演。其他方法建立了一对一的声音信号和头部运动序列之间的映射，但这种映射存在不合理的匹配问题，人们在说同一个内容时可能会有不同的头部运动，这会导致生成的头部运动不自然。因此，一帧说话头生成任务实际上是一个一对多不定 problema，人们在说话时会表现出多种头部运动。根据上述观察，我们提出了OSM-Net，一种基于声音信号的一对多一帧说话头生成网络，具有自然的头部运动。OSM-Net建立了一个运动空间，该空间包含了clip级别的多种有意义的头部运动特征。每个基于空间的基准代表了一个clip中的有意义的头部运动特征，而不是只是一帧。这样，在 talking heads 中可以更自然地实现多种头部运动。另外，附加的标记约束和时间窗口特征输入，可以提高准确地EXTRACT特征和视频生成。实验表明，OSM-Net可以在合理的一对多映射模型下，比其他方法更自然地生成真实的头部运动。

paper_url: http://arxiv.org/abs/2309.16141
repo_url: https://github.com/pter61/aligncmss
paper_authors: Yuanmin Tang, Jing Yu, Keke Gai, Yujing Wang, Yue Hu, Gang Xiong, Qi Wu
for: 这 paper 是关于 cross-modal sponsored search 的研究，用于提高搜索引擎中的广告搜索效果。
methods: 该 paper 使用了一种简单的对应网络，将细部视觉部分在广告图像中与相应的文本进行Explicit地映射，不需要贵重的标注训练数据。此外，paper 还提出了一种新的模型，可以有效地进行cross-modal对应和广告搜索，只需要半个训练数据。
results: compared with state-of-the-art 模型，该 paper 的模型在大规模商业数据集上表现出优于2.57%的提升。此外，paper 还研究了一种常见的cross-modal检索任务，在 MSCOCO 数据集上达到了一致性的性能提升，证明了该方法的通用性。

Abstract
Cross-Modal sponsored search displays multi-modal advertisements (ads) when consumers look for desired products by natural language queries in search engines. Since multi-modal ads bring complementary details for query-ads matching, the ability to align ads-specific information in both images and texts is crucial for accurate and flexible sponsored search. Conventional research mainly studies from the view of modeling the implicit correlations between images and texts for query-ads matching, ignoring the alignment of detailed product information and resulting in suboptimal search performance.In this work, we propose a simple alignment network for explicitly mapping fine-grained visual parts in ads images to the corresponding text, which leverages the co-occurrence structure consistency between vision and language spaces without requiring expensive labeled training data. Moreover, we propose a novel model for cross-modal sponsored search that effectively conducts the cross-modal alignment and query-ads matching in two separate processes. In this way, the model matches the multi-modal input in the same language space, resulting in a superior performance with merely half of the training data. Our model outperforms the state-of-the-art models by 2.57% on a large commercial dataset. Besides sponsored search, our alignment method is applicable for general cross-modal search. We study a typical cross-modal retrieval task on the MSCOCO dataset, which achieves consistent performance improvement and proves the generalization ability of our method. Our code is available at https://github.com/Pter61/AlignCMSS/

摘要
协助搜索赞助显示多 modal 广告 (广告) 当消费者通过自然语言查询在搜索引擎中寻找需要的产品。自然modal 广告的匹配需要精细的产品信息的对应，因此在搜索中准确地将广告信息与搜索结果相对应是非常重要。现有研究主要从视图模型多 modal 空间中的隐式相关性来研究查询广告匹配，忽略了细节产品信息的对应，从而导致搜索性能下降。在这种情况下，我们提出了一个简单的对应网络，用于显式地将多 modal 广告图像中细节部分与相应的文本对应。我们还提出了一种新的模型，可以有效地进行交互模式的搜索和广告匹配。在这种方式下，模型将多 modal 输入在同一个语言空间中匹配，从而提高搜索性能，只需要半数据进行训练。我们的模型在大规模商业数据集上超过了状态艺术模型的性能，提高了2.57%。此外，我们的对应方法可以应用于普通的交互模式搜索。我们在 MSCOCO 数据集上进行了一般交互模式搜索任务，实现了一致性提高和普适性。我们的代码可以在中找到。

CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware Prompting

paper_url: http://arxiv.org/abs/2309.16140
repo_url: None
paper_authors: Shaoxiang Guo, Qing Cai, Lin Qi, Junyu Dong
for: This paper proposes a novel 3D hand pose estimator from monocular images, which can successfully bridge the gap between text prompts and irregular detailed pose distribution.
methods: The proposed model uses a CLIP-based contrastive learning paradigm, which encodes pose-aware features and maximizes semantic consistency for a pair of pose-text features. A coarse-to-fine mesh regressor is also designed to effectively query joint-aware cues from the feature pyramid.
results: The proposed model achieves a significantly faster inference speed while achieving state-of-the-art performance compared to methods utilizing the similar scale backbone, on several public hand benchmarks.

Abstract
Contrastive Language-Image Pre-training (CLIP) starts to emerge in many computer vision tasks and has achieved promising performance. However, it remains underexplored whether CLIP can be generalized to 3D hand pose estimation, as bridging text prompts with pose-aware features presents significant challenges due to the discrete nature of joint positions in 3D space. In this paper, we make one of the first attempts to propose a novel 3D hand pose estimator from monocular images, dubbed as CLIP-Hand3D, which successfully bridges the gap between text prompts and irregular detailed pose distribution. In particular, the distribution order of hand joints in various 3D space directions is derived from pose labels, forming corresponding text prompts that are subsequently encoded into text representations. Simultaneously, 21 hand joints in the 3D space are retrieved, and their spatial distribution (in x, y, and z axes) is encoded to form pose-aware features. Subsequently, we maximize semantic consistency for a pair of pose-text features following a CLIP-based contrastive learning paradigm. Furthermore, a coarse-to-fine mesh regressor is designed, which is capable of effectively querying joint-aware cues from the feature pyramid. Extensive experiments on several public hand benchmarks show that the proposed model attains a significantly faster inference speed while achieving state-of-the-art performance compared to methods utilizing the similar scale backbone.

摘要
《对比 язы言-图像预训（CLIP）在许多计算机视觉任务中开始出现，并已经实现了有前途的性能。然而，尚未被完善地探索CLIP是否可以泛化到3D手姿估计任务中，因为将文本提示与具有手姿特征的特征相连接具有显著挑战，因为手姿 JOINT 的位置是离散的。在这篇论文中，我们提出了一种基于 CLIP 的新的3D手姿估计器从单视图图像，称为 CLIP-Hand3D，成功地跨越了文本提示和不规则细节姿态的差异。具体来说，根据手姿标签， derive 手姿 JOINT 的分布顺序在不同的3D空间方向，并将其转化为对应的文本提示。同时，从单视图图像中提取出21个手姿 JOINT，并将其在 x、y 和 z 轴上的空间分布编码成具有手姿特征的特征。接着，我们在 CLIP 基于对比学习模式下maximize semantic consistency，以便将文本提示与手姿特征相关联。此外，我们还设计了一种粗细至细网络回归器，可以有效地从特征 пирамид中提取手姿特征。我们在多个公共手姿标准 benchmark 上进行了广泛的实验，结果显示，我们的模型在相同的规模下实现了比较快的推理速度，同时保持了与相似规模的方法相比的状态体现。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and widely used in other countries as well.

Two-Step Active Learning for Instance Segmentation with Uncertainty and Diversity Sampling

paper_url: http://arxiv.org/abs/2309.16139
repo_url: None
paper_authors: Ke Yu, Stephen Albro, Giulia DeSalvo, Suraj Kothawade, Abdullah Rashwan, Sasan Tavakkol, Kayhan Batmanghelich, Xiaoqi Yin
for: 提高INSTANCE segmentation模型的训练质量，降低标注成本。
methods: integrate uncertainty-based sampling with diversity-based sampling，实现简单、易于实现，且能够提供优秀表现。
results: 在多个dataset上实现五倍的标注效率提升。

Abstract
Training high-quality instance segmentation models requires an abundance of labeled images with instance masks and classifications, which is often expensive to procure. Active learning addresses this challenge by striving for optimum performance with minimal labeling cost by selecting the most informative and representative images for labeling. Despite its potential, active learning has been less explored in instance segmentation compared to other tasks like image classification, which require less labeling. In this study, we propose a post-hoc active learning algorithm that integrates uncertainty-based sampling with diversity-based sampling. Our proposed algorithm is not only simple and easy to implement, but it also delivers superior performance on various datasets. Its practical application is demonstrated on a real-world overhead imagery dataset, where it increases the labeling efficiency fivefold.

摘要
培训高质量的实例分割模型需要很多标注图像和类别标注，这经常是昂贵的。活动学习可以解决这个挑战，它寻求最优性和最小的标注成本，通过选择最有用和最 represetative 的图像进行标注。虽然活动学习在实例分割方面得到了更多的探索，但它在其他任务如图像分类中得到了更多的应用。在这种研究中，我们提出了一种post-hoc的活动学习算法，它结合了不确定性基于的采样和多样性基于的采样。我们的提出的算法不仅简单易于实现，而且可以提供更高的性能，在多个数据集上进行了证明。在一个真实的飞行图像数据集上，我们示出了它可以提高标注效率五倍。

Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image Retrieval

paper_url: http://arxiv.org/abs/2309.16137
repo_url: https://github.com/pter61/context_i2w
paper_authors: Yuanmin Tang, Jing Yu, Keke Gai, Zhuang Jiamin, Gang Xiong, Yue Hu, Qi Wu
for: zeroshot composed image retrieval tasks (ZS-CIR)
methods: context-dependent mapping network (Context-I2W) with intent view selector and visual target extractor
results: strong generalization ability and significant performance boosts on four ZS-CIR tasks, achieving new state-of-the-art results

Abstract
Different from Composed Image Retrieval task that requires expensive labels for training task-specific models, Zero-Shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with a broad range of visual content manipulation intent that could be related to domain, scene, object, and attribute. The key challenge for ZS-CIR tasks is to learn a more accurate image representation that has adaptive attention to the reference image for various manipulation descriptions. In this paper, we propose a novel context-dependent mapping network, named Context-I2W, for adaptively converting description-relevant Image information into a pseudo-word token composed of the description for accurate ZS-CIR. Specifically, an Intent View Selector first dynamically learns a rotation rule to map the identical image to a task-specific manipulation view. Then a Visual Target Extractor further captures local information covering the main targets in ZS-CIR tasks under the guidance of multiple learnable queries. The two complementary modules work together to map an image to a context-dependent pseudo-word token without extra supervision. Our model shows strong generalization ability on four ZS-CIR tasks, including domain conversion, object composition, object manipulation, and attribute manipulation. It obtains consistent and significant performance boosts ranging from 1.88% to 3.60% over the best methods and achieves new state-of-the-art results on ZS-CIR. Our code is available at https://github.com/Pter61/context_i2w.

摘要
不同于需要特殊标注的图像检索任务（Composed Image Retrieval），零值组合图像检索（ZS-CIR）涉及到多样化的视觉内容操作意图，包括域、场景、对象和特征等。 ZS-CIR 任务的关键挑战是学习更加准确的图像表示，以适应不同的描述映射。在这篇论文中，我们提出了一种新的上下文依赖的映射网络，即 Context-I2W，用于适应描述相关的图像信息转换为一个上下文依赖的伪词表示。具体来说，首先是动态学习的意图规则，用于将同一张图像映射到任务特定的操作视图。然后是一个可学习的查询器，用于在 ZS-CIR 任务中捕捉主要目标的本地信息。这两个补做模块共同工作，无需额外监督，可以将图像映射到上下文依赖的伪词表示。我们的模型在四个 ZS-CIR 任务上表现出了强大的泛化能力，对比最佳方法提高了1.88%到3.60%的表现，并实现了 ZS-CIR 领域的新state-of-the-art 成绩。我们的代码可以在 GitHub 上找到：https://github.com/Pter61/context_i2w。

A dual-branch model with inter- and intra-branch contrastive loss for long-tailed recognition

paper_url: http://arxiv.org/abs/2309.16135
repo_url: None
paper_authors: Qiong Chen, Tianlin Huang, Geren Zhu, Enlu Lin
for: 实际世界数据频繁地呈长尾分布，主要是因为主要类别占据大量数据，而末端类别仅有几个样本。因此，这篇论文提出了一个简单又有效的模型，名为双支分支长尾识别（DB-LTR），它包括不均衡学习分支和对比学习分支（CoLB）。
methods: 双支分支长尾识别模型包括一个共享后ION和一个线性分类器，利用常见的不均衡学习方法来解决数据不均衡问题。CoLB learns一个每个末端类别的原型，并计算了一个间支比例损失、一个内支比例损失和一个度量损失。
results: 实验结果显示，我们的DB-LTR在三个长尾标准资料集（CIFAR100-LT、ImageNet-LT和Places-LT）上具有竞争力和优势，与比较方法相比。

Abstract
Real-world data often exhibits a long-tailed distribution, in which head classes occupy most of the data, while tail classes only have very few samples. Models trained on long-tailed datasets have poor adaptability to tail classes and the decision boundaries are ambiguous. Therefore, in this paper, we propose a simple yet effective model, named Dual-Branch Long-Tailed Recognition (DB-LTR), which includes an imbalanced learning branch and a Contrastive Learning Branch (CoLB). The imbalanced learning branch, which consists of a shared backbone and a linear classifier, leverages common imbalanced learning approaches to tackle the data imbalance issue. In CoLB, we learn a prototype for each tail class, and calculate an inter-branch contrastive loss, an intra-branch contrastive loss and a metric loss. CoLB can improve the capability of the model in adapting to tail classes and assist the imbalanced learning branch to learn a well-represented feature space and discriminative decision boundary. Extensive experiments on three long-tailed benchmark datasets, i.e., CIFAR100-LT, ImageNet-LT and Places-LT, show that our DB-LTR is competitive and superior to the comparative methods.

摘要
实际世界数据经常展现长尾分布，其中头类占据大量数据，而尾类只有很少的样本。模型在长尾数据集上训练时，对尾类的适应性差，决策界限抖音。因此，在这篇论文中，我们提出了一种简单又有效的模型，名为双支分支长尾识别（DB-LTR）。该模型包括一个共享背bone和一个线性分类器，这两个分支都可以使用常见的不均衡学习方法来处理数据不均衡问题。在CoLB分支中，我们学习了每个尾类的原型，并计算了 между支contrastive损失、内支contrastive损失和一个度量损失。CoLB可以帮助模型更好地适应尾类，并帮助不均衡学习分支学习一个准确的特征空间和决策界限。我们在CIFAR100-LT、ImageNet-LT和Places-LT三个长尾 benchmark数据集上进行了广泛的实验，结果显示，我们的DB-LTR在比较方法中具有竞争力和超越性。

MASK4D: Mask Transformer for 4D Panoptic Segmentation

paper_url: http://arxiv.org/abs/2309.16133
repo_url: None
paper_authors: Kadir Yilmaz, Jonas Schult, Alexey Nekrasov, Bastian Leibe
for: 本研究旨在提高自适应Agent在动态环境中做出正确决策的能力，通过提出Mask4D模型来实现4D精准分割 LiDAR 点云 sequences。methods: Mask4D 是首个将 semantic instance segmentation 和 trackig 紧密融合到一起的 transformer-based 模型，直接预测 semantic instances 和其时间关系，不需要手动设计非学习的关联策略。results: Mask4D 在 SemanticKITTI 测试集上达到了新的状态对，得分 68.4 LSTQ，至少超过 +4.5% 于已发表的顶峰性能方法。

Abstract
Accurately perceiving and tracking instances over time is essential for the decision-making processes of autonomous agents interacting safely in dynamic environments. With this intention, we propose Mask4D for the challenging task of 4D panoptic segmentation of LiDAR point clouds. Mask4D is the first transformer-based approach unifying semantic instance segmentation and tracking of sparse and irregular sequences of 3D point clouds into a single joint model. Our model directly predicts semantic instances and their temporal associations without relying on any hand-crafted non-learned association strategies such as probabilistic clustering or voting-based center prediction. Instead, Mask4D introduces spatio-temporal instance queries which encode the semantic and geometric properties of each semantic tracklet in the sequence. In an in-depth study, we find that it is critical to promote spatially compact instance predictions as spatio-temporal instance queries tend to merge multiple semantically similar instances, even if they are spatially distant. To this end, we regress 6-DOF bounding box parameters from spatio-temporal instance queries, which is used as an auxiliary task to foster spatially compact predictions. Mask4D achieves a new state-of-the-art on the SemanticKITTI test set with a score of 68.4 LSTQ, improving upon published top-performing methods by at least +4.5%.

摘要
为了安全地在动态环境中交互，自动化代理需要精准地掌握和跟踪实例过程。为此目的，我们提出Mask4D，一种用于4D�anoptic分割 LiDAR 点云的挑战性任务。Mask4D 是首个基于 transformer 的方法，将 semantic instance segmentation 和 sparse 和不规则的3D点云序列中的实例跟踪合并到一个共同模型中。我们的模型直接预测 semantic instance 和其时间关联，不需要手动设置非学习的关联策略，如概率 clustering 或投票式中心预测。相反，Mask4D 引入 spatio-temporal instance queries，这些查询编码了每个 semantic tracklet 的 semantic 和geometry 属性。在深入研究中，我们发现需要促进空间紧凑的实例预测，因为 spatio-temporal instance queries 往往将多个semantically similar instance合并，即使它们在空间上远离。为此，我们回归 6-DOF 矩阵参数，用于提高空间紧凑的预测。Mask4D 在 SemanticKITTI 测试集上达到了新的州OF-the-art 分数68.4 LSTQ，超过了已发表的最高表现方法至少+4.5%。

paper_url: http://arxiv.org/abs/2309.16128
repo_url: https://github.com/woshiyll/jcrnet
paper_authors: Nana Yu, Hong Shi, Yahong Han
for: 提高低光照图像的增强，以达到更好的平衡 amongst 亮度、颜色和照明。
methods: 提议一种新的合成结构，即 JOINT CORRECTING AND REFINMENT NETWORK (JCRNet)，通过三个阶段来更好地均衡增强中的亮度、颜色和照明。
results: 与21种 state-of-the-art 方法进行比较，JCRNet 在 9 个标准测试集上展现出了广泛的性能优势，并在下游视觉任务中（例如精度检测）也得到了更好的结果。

Abstract
Low-light image enhancement tasks demand an appropriate balance among brightness, color, and illumination. While existing methods often focus on one aspect of the image without considering how to pay attention to this balance, which will cause problems of color distortion and overexposure etc. This seriously affects both human visual perception and the performance of high-level visual models. In this work, a novel synergistic structure is proposed which can balance brightness, color, and illumination more effectively. Specifically, the proposed method, so-called Joint Correcting and Refinement Network (JCRNet), which mainly consists of three stages to balance brightness, color, and illumination of enhancement. Stage 1: we utilize a basic encoder-decoder and local supervision mechanism to extract local information and more comprehensive details for enhancement. Stage 2: cross-stage feature transmission and spatial feature transformation further facilitate color correction and feature refinement. Stage 3: we employ a dynamic illumination adjustment approach to embed residuals between predicted and ground truth images into the model, adaptively adjusting illumination balance. Extensive experiments demonstrate that the proposed method exhibits comprehensive performance advantages over 21 state-of-the-art methods on 9 benchmark datasets. Furthermore, a more persuasive experiment has been conducted to validate our approach the effectiveness in downstream visual tasks (e.g., saliency detection). Compared to several enhancement models, the proposed method effectively improves the segmentation results and quantitative metrics of saliency detection. The source code will be available at https://github.com/woshiyll/JCRNet.

摘要
低光照图像增强任务需要达到对比度、色彩和照明的平衡。现有方法通常强调一个图像方面，而忽视了如何更好地平衡这些方面，这会导致颜色扭曲和过度照明等问题。这会 серьез影响人类视觉和高级视觉模型的性能。在这种工作中，我们提出了一种新的同化结构，即 JOINT CORRECTING AND REFINEMENT NETWORK（JCRNet）。它主要由三个阶段组成，用于平衡图像的亮度、色彩和照明增强。Stage 1：我们使用基本的编解oder和本地监督机制，提取本地信息和更全面的细节，用于增强。Stage 2：跨阶段特征传输和空间特征变换，进一步促进颜色修正和特征细化。Stage 3：我们使用动态照明调整方法，将预测和真实图像之间的差异 embedding 到模型中，自适应调整照明平衡。广泛的实验表明，提出的方法在9个标准 datasets 上表现出了21种当前方法的综合性优势。此外，我们还进行了一项更加吸引人的实验，以验证我们的方法在下游视觉任务（例如分割检测）中的有效性。相比一些增强模型，我们的方法可以更好地提高分割结果和量化度量的分割检测结果。代码将在 GitHub 上公开。

Open Compound Domain Adaptation with Object Style Compensation for Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.16127
repo_url: None
paper_authors: Tingliang Feng, Hao Shi, Xueyang Liu, Wei Feng, Liang Wan, Yanlin Zhou, Di Lin
for: 这paper是为了提高 semantic image segmentation 的精度，使用 open compound domain adaptation 方法。
methods: 这paper使用 Object Style Compensation 方法，包括建立 Object-Level Discrepancy Memory，并学习多个类别或实例的样式差异特征。
results: 这paper在不同数据集上达到了 state-of-the-art 结果，提高了 semantic image segmentation 的精度。

Abstract
Many methods of semantic image segmentation have borrowed the success of open compound domain adaptation. They minimize the style gap between the images of source and target domains, more easily predicting the accurate pseudo annotations for target domain's images that train segmentation network. The existing methods globally adapt the scene style of the images, whereas the object styles of different categories or instances are adapted improperly. This paper proposes the Object Style Compensation, where we construct the Object-Level Discrepancy Memory with multiple sets of discrepancy features. The discrepancy features in a set capture the style changes of the same category's object instances adapted from target to source domains. We learn the discrepancy features from the images of source and target domains, storing the discrepancy features in memory. With this memory, we select appropriate discrepancy features for compensating the style information of the object instances of various categories, adapting the object styles to a unified style of source domain. Our method enables a more accurate computation of the pseudo annotations for target domain's images, thus yielding state-of-the-art results on different datasets.

摘要
很多semantic image segmentation方法借鉴了开放复合领域适应的成功。它们减少了目标领域图像和源领域图像之间的风格差异，更容易预测目标领域图像的准确pseudo annotations，用于训练segmentation网络。现有方法通过全局适应场景风格的图像，而对不同类别或实例的物体风格适应不够。本文提出了对象风格补偿（Object Style Compensation），我们构建了对象级别差异记忆（Object-Level Discrepancy Memory），其中包含多个差异特征集。差异特征集中的差异特征捕捉了目标领域图像中同类对象实例的风格变化，从目标领域图像和源领域图像中学习差异特征，并将其存储在记忆中。通过这个记忆，我们可以选择合适的差异特征来补偿目标领域图像中各类对象的风格信息，使得对象风格统一到源领域的风格，从而实现更高精度的pseudo annotations计算，并在不同的 dataset 上达到状态�ayer的结果。

UVL: A Unified Framework for Video Tampering Localization

paper_url: http://arxiv.org/abs/2309.16126
repo_url: None
paper_authors: Pengfei Pei, Xianfeng Zhao, Jinchuan Li, Yun Cao
for: 检测深度学习技术中的伪造影片
methods: 提出了一个统一的影片修剪探测框架（UVL），通过提取不同类型伪造影片的共同特征（边界artefacts、生成像素的不自然分布和伪造区域与原始影片的联合），以提高对未知影片的探测性能。
results: 在三种不同类型的伪造影片（影片填充、影片融合和DeepFake）的多种benchmark上，UVL achieve state-of-the-art的性能，并与现有方法比较，在跨 dataset 上大幅提高了性能。

Abstract
With the development of deep learning technology, various forgery methods emerge endlessly. Meanwhile, methods to detect these fake videos have also achieved excellent performance on some datasets. However, these methods suffer from poor generalization to unknown videos and are inefficient for new forgery methods. To address this challenging problem, we propose UVL, a novel unified video tampering localization framework for synthesizing forgeries. Specifically, UVL extracts common features of synthetic forgeries: boundary artifacts of synthetic edges, unnatural distribution of generated pixels, and noncorrelation between the forgery region and the original. These features are widely present in different types of synthetic forgeries and help improve generalization for detecting unknown videos. Extensive experiments on three types of synthetic forgery: video inpainting, video splicing and DeepFake show that the proposed UVL achieves state-of-the-art performance on various benchmarks and outperforms existing methods by a large margin on cross-dataset.

摘要
随着深度学习技术的发展，各种假视频技术不断出现。同时，检测这些假视频的方法也达到了一定的表现水平在某些数据集上。然而，这些方法受到未知视频的泛化和新的假视频方法的挑战。为解决这个问题，我们提出了UVL，一种基于Synthetic Tampering Localization的新型视频假冒检测框架。具体来说，UVL提取了各种合成假视频的共同特征：合成边缘的边缘特征、生成像素的不自然分布和假冒区域与原始视频的无关性。这些特征广泛存在不同类型的合成假视频中，可以提高泛化性 для检测未知视频。我们在三种合成假视频：视频填充、视频融合和DeepFake上进行了广泛的实验，结果表明，提出的UVL在各种标准准点上达到了当前最佳性能，与现有方法相比，在跨数据集上表现出了明显的超越。

D$^3$Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation

paper_url: http://arxiv.org/abs/2309.16118
repo_url: None
paper_authors: Yixuan Wang, Zhuoran Li, Mingtong Zhang, Katherine Driggs-Campbell, Jiajun Wu, Li Fei-Fei, Yunzhu Li
for: 这篇论文的目的是提出一种新的Scene representation，以便在 robotic manipulation systems 中提高操作精度和普遍性。
methods: 这篇论文使用了 dynamic 3D descriptor fields，它们可以捕捉工作空间中的动态3D环境，并将 semantic features 和 instance masks 组合在一起。具体来说，它们将 arbitrary 3D points 投射到多视角的Visual observations 上，然后从基础模型中提取特征，并将其混合在一起。
results: 这篇论文的结果显示，使用 D$^3$Fields 可以实现零数据学习的 robotic manipulation tasks，并且与现有的 dense descriptors 相比，它们在普遍性和操作精度方面表现更好。

Abstract
Scene representation has been a crucial design choice in robotic manipulation systems. An ideal representation should be 3D, dynamic, and semantic to meet the demands of diverse manipulation tasks. However, previous works often lack all three properties simultaneously. In this work, we introduce D$^3$Fields - dynamic 3D descriptor fields. These fields capture the dynamics of the underlying 3D environment and encode both semantic features and instance masks. Specifically, we project arbitrary 3D points in the workspace onto multi-view 2D visual observations and interpolate features derived from foundational models. The resulting fused descriptor fields allow for flexible goal specifications using 2D images with varied contexts, styles, and instances. To evaluate the effectiveness of these descriptor fields, we apply our representation to a wide range of robotic manipulation tasks in a zero-shot manner. Through extensive evaluation in both real-world scenarios and simulations, we demonstrate that D$^3$Fields are both generalizable and effective for zero-shot robotic manipulation tasks. In quantitative comparisons with state-of-the-art dense descriptors, such as Dense Object Nets and DINO, D$^3$Fields exhibit significantly better generalization abilities and manipulation accuracy.

摘要
scene表示有被 robotic manipulation system 的关键设计选择。理想的表示应该是3D、动态和semantic，以满足多样化的抓取任务的需求。然而，之前的工作通常缺乏这三个属性。在这项工作中，我们介绍了D$^3$Fields - 动态3D描述场。这些场景捕捉了下面环境的动态特征，并将 semantic feature和实例掩码编码在一起。具体来说，我们将工作空间中的arbitrary 3D点 Project onto multi-view 2D visual observation，并 interpolate基础模型中 derivated 特征。得到的混合描述场可以通过2D图像 with varied contexts、styles和instances进行灵活的目标规定。为了评估D$^3$Fields的有效性，我们将其应用到了各种 robotic manipulation任务中。通过在实际场景和 simulations 中进行了广泛的评估，我们证明了D$^3$Fields是both generalizable和effective zero-shot robotic manipulation任务中。在对比state-of-the-art dense descriptors，如Dense Object Nets和DINO，D$^3$Fields表现出了明显更好的总体化能力和抓取精度。

Learning Effective NeRFs and SDFs Representations with 3D Generative Adversarial Networks for 3D Object Generation: Technical Report for ICCV 2023 OmniObject3D Challenge

paper_url: http://arxiv.org/abs/2309.16110
repo_url: None
paper_authors: Zheyuan Yang, Yibo Liu, Guile Wu, Tongtong Cao, Yuan Ren, Yang Liu, Bingbing Liu
for: 本文提出了一种解决ICCV 2023 OmniObject3D Challenge中3D对象生成 зада务的方法。
methods: 本文使用3D生成对抗网络（GANs）学习有效的NeRFs和SDFs表示，并通过label嵌入和颜色映射来同时训练不同的分类。
results: 本文通过使用only a few images of each object from a variety of classes来训练模型，而不是使用大量的图像或训练每个类的单独模型。这种管道可以优化3D对象生成模型。这个解决方案在ICCV 2023 OmniObject3D Challenge中得到了前三名的成绩。

Abstract
In this technical report, we present a solution for 3D object generation of ICCV 2023 OmniObject3D Challenge. In recent years, 3D object generation has made great process and achieved promising results, but it remains a challenging task due to the difficulty of generating complex, textured and high-fidelity results. To resolve this problem, we study learning effective NeRFs and SDFs representations with 3D Generative Adversarial Networks (GANs) for 3D object generation. Specifically, inspired by recent works, we use the efficient geometry-aware 3D GANs as the backbone incorporating with label embedding and color mapping, which enables to train the model on different taxonomies simultaneously. Then, through a decoder, we aggregate the resulting features to generate Neural Radiance Fields (NeRFs) based representations for rendering high-fidelity synthetic images. Meanwhile, we optimize Signed Distance Functions (SDFs) to effectively represent objects with 3D meshes. Besides, we observe that this model can be effectively trained with only a few images of each object from a variety of classes, instead of using a great number of images per object or training one model per class. With this pipeline, we can optimize an effective model for 3D object generation. This solution is one of the final top-3-place solutions in the ICCV 2023 OmniObject3D Challenge.

摘要
在这份技术报告中，我们介绍了一种用于ICCV 2023 OmniObject3D Challenge 3D物体生成的解决方案。在过去几年中，3D物体生成技术做出了很大的进步，但是仍然是一个挑战性的任务，因为生成复杂、Texture和高精度结果很难。为了解决这个问题，我们研究了使用3D生成抗恐网络（GANs）来学习有效的NeRFs和SDFs表示。具体来说，我们采用了效果性的geometry-aware 3D GANs作为后续，并结合标签嵌入和颜色映射，以便同时训练不同的分类。然后，通过一个解码器，我们将结果综合到generate Neural Radiance Fields（NeRFs）基于表示，以便生成高品质的 sintetic 图像。同时，我们优化Signed Distance Functions（SDFs），以有效地表示物体使用3D矩阵。此外，我们发现这种模型可以通过只使用每个物体的几张图像进行训练，而不需要使用大量的图像或训练每个分类的模型。这种管道可以优化一个有效的3D物体生成模型。这个解决方案在ICCV 2023 OmniObject3D Challenge中排名前三名。

2023-09-28

On the Contractivity of Plug-and-Play Operators

Superpixel Transformers for Efficient Semantic Segmentation

LEF: Late-to-Early Temporal Fusion for LiDAR 3D Object Detection

Stochastic Digital Twin for Copy Detection Patterns

Sketch2CADScript: 3D Scene Reconstruction from 2D Sketch using Visual Transformer and Rhino Grasshopper

Space-Time Attention with Shifted Non-Local Search

Propagation and Attribution of Uncertainty in Medical Imaging Pipelines

MEM: Multi-Modal Elevation Mapping for Robotics and Learning

SatDM: Synthesizing Realistic Satellite Image with Semantic Layout Conditioning using Diffusion Models

Granularity at Scale: Estimating Neighborhood Well-Being from High-Resolution Orthographic Imagery and Hybrid Learning

Ultra-low-power Image Classification on Neuromorphic Hardware

STIR: Surgical Tattoos in Infrared

Deep Learning based Systems for Crater Detection: A Review

Prompt-Enhanced Self-supervised Representation Learning for Remote Sensing Image Understanding

Learning to Transform for Generalizable Instance-wise Invariance

Decaf: Monocular Deformation Capture for Face and Hand Interactions

Training a Large Video Model on a Single Machine in a Day

Geodesic Regression Characterizes 3D Shape Changes in the Female Brain During Menstruation

Visual In-Context Learning for Few-Shot Eczema Segmentation

Novel Deep Learning Pipeline for Automatic Weapon Detection

DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning

FLIP: Cross-domain Face Anti-spoofing with Language Guidance

Improving Equivariance in State-of-the-Art Supervised Depth and Normal Predictors

Deep Geometrized Cartoon Line Inbetweening

End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon

Class Activation Map-based Weakly supervised Hemorrhage Segmentation using Resnet-LSTM in Non-Contrast Computed Tomography images

KV Inversion: KV Embeddings Learning for Text-Conditioned Real Image Action Editing

Tensor Factorization for Leveraging Cross-Modal Knowledge in Data-Constrained Infrared Object Detection

Vision Transformers Need Registers

Text-to-3D using Gaussian Splatting

Audio-Visual Speaker Verification via Joint Cross-Attention

MatrixCity: A Large-scale City Dataset for City-scale Neural Rendering and Beyond

Uncertainty Quantification for Eosinophil Segmentation

HOI4ABOT: Human-Object Interaction Anticipation for Human Intention Reading Collaborative roBOTs

Latent Noise Segmentation: How Neural Noise Leads to the Emergence of Segmentation and Grouping

CCEdit: Creative and Controllable Video Editing via Diffusion Models

Deep Single Models vs. Ensembles: Insights for a Fast Deployment of Parking Monitoring Systems

Accurate and lightweight dehazing via multi-receptive-field non-local network and novel contrastive regularization

HTC-DC Net: Monocular Height Estimation from Single Remote Sensing Images

Rethinking Domain Generalization: Discriminability and Generalizability

Diverse Target and Contribution Scheduling for Domain Generalization

Towards Novel Class Discovery: A Study in Novel Skin Lesions Clustering

Radar Instance Transformer: Reliable Moving Instance Segmentation in Sparse Radar Point Clouds

Distilling ODE Solvers of Diffusion Models into Smaller Steps

HIC-YOLOv5: Improved YOLOv5 For Small Object Detection

An Enhanced Low-Resolution Image Recognition Method for Traffic Environments

Biomedical Image Splicing Detection using Uncertainty-Guided Refinement

A Comprehensive Review on Tree Detection Methods Using Point Cloud and Aerial Imagery from Unmanned Aerial Vehicles

FG-NeRF: Flow-GAN based Probabilistic Neural Radiance Field for Independence-Assumption-Free Uncertainty Estimation

Dark Side Augmentation: Generating Diverse Night Examples for Metric Learning

Logarithm-transform aided Gaussian Sampling for Few-Shot Learning

Weakly-Supervised Video Anomaly Detection with Snippet Anomalous Attention

Can the Query-based Object Detector Be Designed with Fewer Stages?

Multi-scale Recurrent LSTM and Transformer Network for Depth Completion

GAMMA: Generalizable Articulation Modeling and Manipulation for Articulated Objects

FORB: A Flat Object Retrieval Benchmark for Universal Image Embedding

Object Motion Guided Human Motion Synthesis

Off-the-shelf bin picking workcell with visual pose estimation: A case study on the world robot summit 2018 kitting task

GAFlow: Incorporating Gaussian Attention into Optical Flow

Abdominal multi-organ segmentation in CT using Swinunter

Nonconvex third-order Tensor Recovery Based on Logarithmic Minimax Function

Parameter-Saving Adversarial Training: Reinforcing Multi-Perturbation Robustness via Hypernetworks

Alzheimer’s Disease Prediction via Brain Structural-Functional Deep Fusing Network

DiffGAN-F2S: Symmetric and Efficient Denoising Diffusion GANs for Structural Connectivity Prediction from Brain fMRI

Cloth2Body: Generating 3D Human Body Mesh from 2D Clothing

BEVHeight++: Toward Robust Visual Centric 3D Object Detection

ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens

Learning to Terminate in Object Navigation

OSM-Net: One-to-Many One-shot Talking Head Generation with Spontaneous Head Motions

Align before Search: Aligning Ads Image to Text for Accurate Cross-Modal Sponsored Search

CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware Prompting

Two-Step Active Learning for Instance Segmentation with Uncertainty and Diversity Sampling

Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image Retrieval

A dual-branch model with inter- and intra-branch contrastive loss for long-tailed recognition

MASK4D: Mask Transformer for 4D Panoptic Segmentation

Joint Correcting and Refinement for Balanced Low-Light Image Enhancement

Open Compound Domain Adaptation with Object Style Compensation for Semantic Segmentation

UVL: A Unified Framework for Video Tampering Localization