2023-12-07

cs.CV

cs.CV - 2023-12-07

Scaling Laws of Synthetic Images for Model Training … for Now

paper_url: http://arxiv.org/abs/2312.04567
repo_url: https://github.com/google-research/syn-rep-learn
paper_authors: Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, Yonglong Tian
for: 这个论文的目的是研究使用文本生成的图像模型在大规模训练中的行为，以及这种方法在训练图像分类器和CLIP模型时的缩放规律。
methods: 该论文使用了当今最佳的文本生成模型生成synthetic图像，并对这些图像进行了标注监督和语言监督的训练。
results: 研究发现，使用synthetic图像进行训练时， scaling behavior会受到文本描述、类别自由导航缩放因素和文本生成模型类型等多个因素的影响。并且，对于图像分类器的训练，使用synthetic图像可以达到与实际图像相似的缩放规律，但是在使用CLIP模型时，synthetic图像的性能明显下降。这些发现可能有助于解决在大规模训练中使用synthetic图像的问题。

Abstract
Recent significant advances in text-to-image models unlock the possibility of training vision systems using synthetic images, potentially overcoming the difficulty of collecting curated data at scale. It is unclear, however, how these models behave at scale, as more synthetic data is added to the training set. In this paper we study the scaling laws of synthetic images generated by state of the art text-to-image models, for the training of supervised models: image classifiers with label supervision, and CLIP with language supervision. We identify several factors, including text prompts, classifier-free guidance scale, and types of text-to-image models, that significantly affect scaling behavior. After tuning these factors, we observe that synthetic images demonstrate a scaling trend similar to, but slightly less effective than, real images in CLIP training, while they significantly underperform in scaling when training supervised image classifiers. Our analysis indicates that the main reason for this underperformance is the inability of off-the-shelf text-to-image models to generate certain concepts, a limitation that significantly impairs the training of image classifiers. Our findings also suggest that scaling synthetic data can be particularly effective in scenarios such as: (1) when there is a limited supply of real images for a supervised problem (e.g., fewer than 0.5 million images in ImageNet), (2) when the evaluation dataset diverges significantly from the training data, indicating the out-of-distribution scenario, or (3) when synthetic data is used in conjunction with real images, as demonstrated in the training of CLIP models.

摘要

Gen2Det: Generate to Detect

paper_url: http://arxiv.org/abs/2312.04566
repo_url: None
paper_authors: Saksham Suri, Fanyi Xiao, Animesh Sinha, Sean Chang Culatana, Raghuraman Krishnamoorthi, Chenchen Zhu, Abhinav Shrivastava
for: 提高对象检测和分割任务的性能，特别是在各种不同的数据量和类别分布下。
methods: 利用现有的场景图生成方法，直接生成场景图中的对象实例，并提出了一些技术来最佳利用生成的数据，包括图像级滤波、实例级滤波和训练策略调整。
results: 在不同的设定下，Gen2Det可以获得显著的性能提升，包括在LVIS上的尖顶检测设定下提高罕见类别的性能，在COCO上的低数据量设定下提高检测和分割任务的性能，以及在最通用的检测设定下仍然表现良好。

Abstract
Recently diffusion models have shown improvement in synthetic image quality as well as better control in generation. We motivate and present Gen2Det, a simple modular pipeline to create synthetic training data for object detection for free by leveraging state-of-the-art grounded image generation methods. Unlike existing works which generate individual object instances, require identifying foreground followed by pasting on other images, we simplify to directly generating scene-centric images. In addition to the synthetic data, Gen2Det also proposes a suite of techniques to best utilize the generated data, including image-level filtering, instance-level filtering, and better training recipe to account for imperfections in the generation. Using Gen2Det, we show healthy improvements on object detection and segmentation tasks under various settings and agnostic to detection methods. In the long-tailed detection setting on LVIS, Gen2Det improves the performance on rare categories by a large margin while also significantly improving the performance on other categories, e.g. we see an improvement of 2.13 Box AP and 1.84 Mask AP over just training on real data on LVIS with Mask R-CNN. In the low-data regime setting on COCO, Gen2Det consistently improves both Box and Mask AP by 2.27 and 1.85 points. In the most general detection setting, Gen2Det still demonstrates robust performance gains, e.g. it improves the Box and Mask AP on COCO by 0.45 and 0.32 points.

摘要
近期扩散模型在合成图像质量和控制方面有所改进，我们提出和实现了Gen2Det，一个简单的模块化管道，用于免费生成对象检测的Synthetic培训数据。与现有方法不同，Gen2Det不是单独生成个体物体实例，而是直接生成场景中心的图像。此外，Gen2Det还提出了一系列使用生成数据的技术，包括图像级滤波、实例级滤波和更好的训练策略，以抵消生成过程中的瑕疵。使用Gen2Det，我们在不同的设定下显著提高了对象检测和分割任务的性能，包括在LVIS中的长尾检测设定下，对罕见类别的性能提高了大幅度，同时也提高了其他类别的性能，例如在LVIS上与Mask R-CNN训练的Box AP和Mask AP分别提高2.13和1.84点。在COCO中的低数据 régime下，Gen2Det通常提高了Box和Mask AP的性能，分别提高2.27和1.85点。在最通用的检测设定下，Gen2Det仍然显示了强大的性能改进，例如在COCO上，Box和Mask AP分别提高0.45和0.32点。

MuRF: Multi-Baseline Radiance Fields

paper_url: http://arxiv.org/abs/2312.04565
repo_url: https://github.com/autonomousvision/murf
paper_authors: Haofei Xu, Anpei Chen, Yuedong Chen, Christos Sakaridis, Yulun Zhang, Marc Pollefeys, Andreas Geiger, Fisher Yu
for: 解决多个基线设定下的稀疏视场渲染问题
methods: 使用批处理网络来激活多个基线设定下的视场渲染
results: 在多个基线设定下和多种场景下（包括DTU、RealEstate10K和LLFF）达到了最佳性能，并在零基线情况下也有出色的推广能力。

Abstract
We present Multi-Baseline Radiance Fields (MuRF), a general feed-forward approach to solving sparse view synthesis under multiple different baseline settings (small and large baselines, and different number of input views). To render a target novel view, we discretize the 3D space into planes parallel to the target image plane, and accordingly construct a target view frustum volume. Such a target volume representation is spatially aligned with the target view, which effectively aggregates relevant information from the input views for high-quality rendering. It also facilitates subsequent radiance field regression with a convolutional network thanks to its axis-aligned nature. The 3D context modeled by the convolutional network enables our method to synthesis sharper scene structures than prior works. Our MuRF achieves state-of-the-art performance across multiple different baseline settings and diverse scenarios ranging from simple objects (DTU) to complex indoor and outdoor scenes (RealEstate10K and LLFF). We also show promising zero-shot generalization abilities on the Mip-NeRF 360 dataset, demonstrating the general applicability of MuRF.

摘要
我们提出了多基线辐射场（MuRF），一种通用的斜杆方法，用于解决多个不同基线设置（小基线和大基线，以及不同输入视图数）下的稀疏视场synthesis。为渲染目标新视图，我们将三维空间分割成平行于目标图像平面的面，并从input视图中搜集相关信息，以实现高质量渲染。此外，我们还使用卷积网络进行辐射场回归，因为目标视图空间具有轴对齐性，从而使得我们可以更好地捕捉Scene中的细节。我们的MuRF在不同基线设置下和多种场景（DTU、RealEstate10K和LLFF）中实现了状态机器的性能。此外，我们还证明了MuRF在Mip-NeRF 360 dataset上具有良好的零基eline泛化能力，这表明MuRF可以通用化应用于不同的场景。

EAGLES: Efficient Accelerated 3D Gaussians with Lightweight EncodingS

paper_url: http://arxiv.org/abs/2312.04564
repo_url: https://github.com/Sharath-girish/efficientgaussian
paper_authors: Sharath Girish, Kamal Gupta, Abhinav Shrivastava
For: 这paper focused on addressing the memory storage and training speed challenges of 3D Gaussian splatting (3D-GS) in novel-view scene synthesis.* Methods: The authors proposed a technique using quantized embeddings to significantly reduce memory storage requirements and a coarse-to-fine training strategy for faster and more stable optimization of the Gaussian point clouds.* Results: The approach resulted in scene representations with fewer Gaussians and quantized representations, leading to faster training times and rendering speeds for real-time rendering of high resolution scenes, with a reduction in memory usage of more than an order of magnitude while maintaining the reconstruction quality.

Abstract
Recently, 3D Gaussian splatting (3D-GS) has gained popularity in novel-view scene synthesis. It addresses the challenges of lengthy training times and slow rendering speeds associated with Neural Radiance Fields (NeRFs). Through rapid, differentiable rasterization of 3D Gaussians, 3D-GS achieves real-time rendering and accelerated training. They, however, demand substantial memory resources for both training and storage, as they require millions of Gaussians in their point cloud representation for each scene. We present a technique utilizing quantized embeddings to significantly reduce memory storage requirements and a coarse-to-fine training strategy for a faster and more stable optimization of the Gaussian point clouds. Our approach results in scene representations with fewer Gaussians and quantized representations, leading to faster training times and rendering speeds for real-time rendering of high resolution scenes. We reduce memory by more than an order of magnitude all while maintaining the reconstruction quality. We validate the effectiveness of our approach on a variety of datasets and scenes preserving the visual quality while consuming 10-20x less memory and faster training/inference speed. Project page and code is available https://efficientgaussian.github.io

摘要
近些时间，3D Gaussian splatting（3D-GS）在新视图场景合成中受到欢迎。它解决了Neural Radiance Fields（NeRFs）的训练时间和渲染速度问题，通过快速、可导的三角形渲染来实现实时渲染和加速训练。然而，它们需要巨大的内存资源来进行训练和存储，因为它们需要每个场景的点云表示中有数百万个 Gaussian。我们提出了使用量化编码来减少内存存储需求，并采用粗略到细节的训练策略来加速和稳定 Gaussian 点云的优化。我们的方法可以在高分辨率场景中实现更快的训练时间和渲染速度，同时保持视觉质量。我们通过对多个数据集和场景进行验证，发现我们的方法可以减少内存使用量高于一个数量级，并且加速训练和推理速度。项目页面和代码可以在上查看。

Visual Geometry Grounded Deep Structure From Motion

paper_url: http://arxiv.org/abs/2312.04563
repo_url: None
paper_authors: Jianyuan Wang, Nikita Karaev, Christian Rupprecht, David Novotny
for: The paper is written to address the problem of structure-from-motion (SfM) in computer vision, specifically to develop a new deep pipeline called VGGSfM that can reconstruct the camera poses and 3D structure of a scene from a set of unconstrained 2D images.
methods: The paper uses deep learning techniques to enhance specific elements of the SfM pipeline, such as keypoint matching, and introduces new mechanisms and simplifications to make the pipeline fully differentiable and trainable in an end-to-end manner.
results: The paper achieves state-of-the-art performance on three popular datasets, CO3D, IMC Phototourism, and ETH3D, demonstrating the effectiveness of the proposed VGGSfM pipeline.Here’s the same information in Simplified Chinese:
for: 论文目的是解决计算机视觉领域中的结构从动画（SfM）问题，具体来说是从一组不受限制的2D图像中重建摄像头姿和场景的3D结构。
methods: 论文使用深度学习技术来提高SfM管道中的特定元素，例如关键点匹配，并引入新的机制和简化来使管道成为完全可导的和可以在端到端方式训练。
results: 论文在CO3D、IMC Phototourism和ETH3D三个Popular Dataset上达到了最佳性能，证明提案的VGGSfM管道的有效性。

Abstract
Structure-from-motion (SfM) is a long-standing problem in the computer vision community, which aims to reconstruct the camera poses and 3D structure of a scene from a set of unconstrained 2D images. Classical frameworks solve this problem in an incremental manner by detecting and matching keypoints, registering images, triangulating 3D points, and conducting bundle adjustment. Recent research efforts have predominantly revolved around harnessing the power of deep learning techniques to enhance specific elements (e.g., keypoint matching), but are still based on the original, non-differentiable pipeline. Instead, we propose a new deep pipeline VGGSfM, where each component is fully differentiable and thus can be trained in an end-to-end manner. To this end, we introduce new mechanisms and simplifications. First, we build on recent advances in deep 2D point tracking to extract reliable pixel-accurate tracks, which eliminates the need for chaining pairwise matches. Furthermore, we recover all cameras simultaneously based on the image and track features instead of gradually registering cameras. Finally, we optimise the cameras and triangulate 3D points via a differentiable bundle adjustment layer. We attain state-of-the-art performance on three popular datasets, CO3D, IMC Phototourism, and ETH3D.

摘要
Structure-from-motion (SfM) 是计算机视觉领域的一个长期问题，目标是从一组不受限制的二维图像中重construct camera pose和场景的3D结构。经典框架通过检测和匹配关键点、注册图像、三角测量3D点和进行缓冲调整来解决这个问题。最近的研究努力主要是通过利用深度学习技术来提高特定元素（例如关键点匹配），但是仍然基于原始、不可导的管道。相反，我们提出了一个新的深度管道 VGGSfM，其中每个组件都是可导的，因此可以在终端向导下进行训练。为此，我们引入了新的机制和简化。首先，我们基于最近的深度2D点跟踪技术来提取可靠的像素精度轨迹，从而消除了将链式匹配链接的需求。其次，我们同时回归所有相机，基于图像和轨迹特征而不是逐渐注册相机。最后，我们通过可导的缓冲调整层来优化相机和三角测量3D点。我们在三个流行的数据集上达到了状态的最佳性能：CO3D、IMC Phototourism 和 ETH3D。

GenDeF: Learning Generative Deformation Field for Video Generation

paper_url: http://arxiv.org/abs/2312.04561
repo_url: None
paper_authors: Wen Wang, Kecheng Zheng, Qiuyu Wang, Hao Chen, Zifan Shi, Ceyuan Yang, Yujun Shen, Chunhua Shen
for: 提出一种新的视频生成方法，不直接生成一个序列帧，而是使用生成扭变场(GenDeF)来折射一个静止图像。
methods: 使用一个已经受过训练的图像生成器生成静止图像，并将扭变场转换为光流，以便应用明确的结构REGularization，得到 temporally consistent的结果。
results: 对三个常见的视频生成标准套件进行质量评估，结果表明GenDeF方法在视觉质量和生成灵活性方面具有显著的优势。

Abstract
We offer a new perspective on approaching the task of video generation. Instead of directly synthesizing a sequence of frames, we propose to render a video by warping one static image with a generative deformation field (GenDeF). Such a pipeline enjoys three appealing advantages. First, we can sufficiently reuse a well-trained image generator to synthesize the static image (also called canonical image), alleviating the difficulty in producing a video and thereby resulting in better visual quality. Second, we can easily convert a deformation field to optical flows, making it possible to apply explicit structural regularizations for motion modeling, leading to temporally consistent results. Third, the disentanglement between content and motion allows users to process a synthesized video through processing its corresponding static image without any tuning, facilitating many applications like video editing, keypoint tracking, and video segmentation. Both qualitative and quantitative results on three common video generation benchmarks demonstrate the superiority of our GenDeF method.

摘要
我们提出了一种新的视频生成方法的思路。而不是直接生成一系列帧，我们建议使用生成几何变换场（GenDeF）来折射一个静止图像，以生成视频。这种管道具有三个吸引人的优点。首先，我们可以重用一个已经受过训练的图像生成器来生成静止图像（也称为标准图像），从而降低生成视频的难度，并因此提高视觉质量。其次，我们可以轻松地将几何变换场转换为光流，使得可以直接应用显式结构化 regularization 来模式化动作，导致时间相关的结果。最后，内容和运动之间的分离使得用户可以对生成的视频进行处理，不需要任何调整，从而便利了许多应用，如视频编辑、关键点跟踪和视频分割。我们在三个常见的视频生成 benchmark 上进行了both质量和量化的结果，显示了我们的 GenDeF 方法的优越性。

PrimDiffusion: Volumetric Primitives Diffusion for 3D Human Generation

paper_url: http://arxiv.org/abs/2312.04559
repo_url: https://github.com/frozenburning/primdiffusion
paper_authors: Zhaoxi Chen, Fangzhou Hong, Haiyi Mei, Guangcong Wang, Lei Yang, Ziwei Liu
for: 3D human generation
methods: diffusion-based framework, denoising diffusion process, volumetric primitives representation
results: outperforms state-of-the-art methods, real-time rendering of high-quality 3D humans, flexible and efficient framework for training-free conditional generationHere’s the full answer in Simplified Chinese:
for: 这篇论文专门用于3D人体生成。
methods: 这篇论文使用了扩散基本框架，通过直接在一组体量元素上进行扩散过程来实现3D人体生成。这种方法利用了体量元素的灵活性和效率，同时也能够捕捉人体的艺术骨骼结构。
results: 比较之前的方法，这种方法在3D人体生成方面表现更出色，可以实现高质量的3D人体图像在$512\times512$的分辨率上的实时渲染。此外，这种方法还能够支持无需训练的 conditional generation，如Texture Transfer和3D填充等。

Abstract
We present PrimDiffusion, the first diffusion-based framework for 3D human generation. Devising diffusion models for 3D human generation is difficult due to the intensive computational cost of 3D representations and the articulated topology of 3D humans. To tackle these challenges, our key insight is operating the denoising diffusion process directly on a set of volumetric primitives, which models the human body as a number of small volumes with radiance and kinematic information. This volumetric primitives representation marries the capacity of volumetric representations with the efficiency of primitive-based rendering. Our PrimDiffusion framework has three appealing properties: 1) compact and expressive parameter space for the diffusion model, 2) flexible 3D representation that incorporates human prior, and 3) decoder-free rendering for efficient novel-view and novel-pose synthesis. Extensive experiments validate that PrimDiffusion outperforms state-of-the-art methods in 3D human generation. Notably, compared to GAN-based methods, our PrimDiffusion supports real-time rendering of high-quality 3D humans at a resolution of $512\times512$ once the denoising process is done. We also demonstrate the flexibility of our framework on training-free conditional generation such as texture transfer and 3D inpainting.

摘要
我们现在提出PrimDiffusion，第一个基于扩散的3D人生成框架。在3D人体生成中设计扩散模型具有较高的计算成本和人体结构的复杂性，这些挑战困难了我们的设计。我们的关键发现是直接在一组积分体 primitives 上进行扩散过程，将人体视为一些小体积量素和动态信息的集合。这种积分体 primitives 表示方式结合了积分体表示的容量和基于 primitives 的渲染效率。PrimDiffusion 框架具有三个魅力性属性：1）紧凑且表达力强的扩散模型参数空间，2）包含人类先验的flexible 3D表示，3）无需解码器的高效新视角和新pose生成。我们的 PrimDiffusion 在3D人生成方面与状态 искусственный neural networks 比较，实验证明 PrimDiffusion 能够在 $512\times512$ 分辨率下实现高质量的实时3D人生成。此外，我们还证明了我们的框架在训练不需要情况下进行条件生成，如xture传输和3D遮盲填充。

MonoGaussianAvatar: Monocular Gaussian Point-based Head Avatar

paper_url: http://arxiv.org/abs/2312.04558
repo_url: None
paper_authors: Yufan Chen, Lizhen Wang, Qijing Li, Hongjiang Xiao, Shengping Zhang, Hongxun Yao, Yebin Liu
for: 本研究旨在从单视图肖像视频序列中生成 photo-realistic 头部人物模型，帮助跨越虚拟和实际世界的桥梁。
methods: 我们提出了一种新的方法，即 MonoGaussianAvatar（单视 Gaussian 点 cloud 头部人物模型），它利用 3D Gaussian 点表示法和 Gaussian 扭曲场来学习单视图肖像视频序列中的头部人物模型。
results: 我们的方法可以达到前所未有的Result，超越过去的方法。

Abstract
The ability to animate photo-realistic head avatars reconstructed from monocular portrait video sequences represents a crucial step in bridging the gap between the virtual and real worlds. Recent advancements in head avatar techniques, including explicit 3D morphable meshes (3DMM), point clouds, and neural implicit representation have been exploited for this ongoing research. However, 3DMM-based methods are constrained by their fixed topologies, point-based approaches suffer from a heavy training burden due to the extensive quantity of points involved, and the last ones suffer from limitations in deformation flexibility and rendering efficiency. In response to these challenges, we propose MonoGaussianAvatar (Monocular Gaussian Point-based Head Avatar), a novel approach that harnesses 3D Gaussian point representation coupled with a Gaussian deformation field to learn explicit head avatars from monocular portrait videos. We define our head avatars with Gaussian points characterized by adaptable shapes, enabling flexible topology. These points exhibit movement with a Gaussian deformation field in alignment with the target pose and expression of a person, facilitating efficient deformation. Additionally, the Gaussian points have controllable shape, size, color, and opacity combined with Gaussian splatting, allowing for efficient training and rendering. Experiments demonstrate the superior performance of our method, which achieves state-of-the-art results among previous methods.

摘要
<>通过动态渲染 photo-realistic 头部人物模型，bridging 虚拟和真实世界的差距是一项关键的推进。近期的头部人物技术，包括explicit 3D 变形多面体 (3DMM)、点云和神经凝结表示，在这些研究中得到报道。然而，3DMM 基于的方法受限于固定的体系，点云方法因需要处理大量的点而带来巨大的训练负担，而神经凝结方法受到限制，包括形态灵活性和渲染效率等方面。为了解决这些挑战，我们提出了 MonoGaussianAvatar（单个 Gaussian 点 clouds 基于 Head Avatar），一种新的方法，利用 Gaussian 点 clouds 和 Gaussian 变形场来学习从单个照片视频序列中获取 head avatar。我们定义了我们的 head avatar 通过可变形状的 Gaussian 点来表示，这些点在目标姿势和表情的人物中运动，并且可以实现高效的变形。此外，Gaussian 点还具有可控的形状、大小、颜色和透明度，可以通过 Gaussian 混合来实现高效的训练和渲染。实验表明我们的方法可以达到前期方法的最佳性能。Note: Simplified Chinese is used in this translation, which is a standardized form of Chinese used in mainland China and widely used in informal writing and communication.

GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation

paper_url: http://arxiv.org/abs/2312.04557
repo_url: None
paper_authors: Shoufa Chen, Mengmeng Xu, Jiawei Ren, Yuren Cong, Sen He, Yanping Xie, Animesh Sinha, Ping Luo, Tao Xiang, Juan-Manuel Perez-Rua
for: 本研究探讨使用 transformer 泛化模型进行图像和视频生成。
methods: 作者使用 transformer 泛化模型进行图像和视频生成，并通过对 diffusion transformers (DiTs) 的适应来进行文本条件化。
results: 作者在对比 SDXL 时，通过人工评分发现 GenTron 在视觉质量方面获得了 51.1% 的胜利率（并且有 19.8% 的引き分），在文本对应性方面获得了 42.3% 的胜利率（并且有 42.9% 的引き分）。 GenTron 还在 T2I-CompBench 中表现出色，这表明它在compositional generation 方面具有强大的能力。

Abstract
In this study, we explore Transformer-based diffusion models for image and video generation. Despite the dominance of Transformer architectures in various fields due to their flexibility and scalability, the visual generative domain primarily utilizes CNN-based U-Net architectures, particularly in diffusion-based models. We introduce GenTron, a family of Generative models employing Transformer-based diffusion, to address this gap. Our initial step was to adapt Diffusion Transformers (DiTs) from class to text conditioning, a process involving thorough empirical exploration of the conditioning mechanism. We then scale GenTron from approximately 900M to over 3B parameters, observing significant improvements in visual quality. Furthermore, we extend GenTron to text-to-video generation, incorporating novel motion-free guidance to enhance video quality. In human evaluations against SDXL, GenTron achieves a 51.1% win rate in visual quality (with a 19.8% draw rate), and a 42.3% win rate in text alignment (with a 42.9% draw rate). GenTron also excels in the T2I-CompBench, underscoring its strengths in compositional generation. We believe this work will provide meaningful insights and serve as a valuable reference for future research.

摘要
在这项研究中，我们探索了基于Transformer的扩散模型，用于图像和视频生成。尽管Transformer架构在不同领域因其灵活性和可扩展性而占据主导地位，但视觉生成领域主要使用CNN基于U-Net架构，特别是在扩散基于模型中。我们介绍了GenTron家族，一种基于Transformer扩散的生成模型，以Address这一漏洞。我们首先将Diffusion Transformers（DiTs）从类到文本条件下适应，经过了详细的实验研究条件机制。然后，我们缩放了GenTron的参数量从大约900M到3B以上，观察到了显著改善的视觉质量。此外，我们将GenTron扩展到文本到视频生成，并添加了新的无动作指导，以提高视频质量。在人类评估中，GenTron在SDXL的比赛中获得了51.1%的胜利率（与19.8%的平局率），以及42.3%的文本排序率（与42.9%的平局率）。GenTron还在T2I-CompBench上表现出色，这反映了它在复杂生成中的优势。我们认为这项研究将为未来研究提供有价值的参考。

SPIDeRS: Structured Polarization for Invisible Depth and Reflectance Sensing

paper_url: http://arxiv.org/abs/2312.04553
repo_url: None
paper_authors: Tomoki Ichikawa, Shohei Nobuhara, Ko Nishino
For: 这个论文旨在开拓用偏振光感知技术捕捉物体形状和反射率的新途径。* Methods: 该论文提出了使用偏振光模式（SPIDeRS），通过对投射光的角度 Linear Polarization（AoLP）进行调制，以捕捉物体的形状和反射率。* Results: 论文通过实验表明，SPIDeRS方法可以成功地重建各种材质的物体形状，并且能够快速和稳定地处理各种材质和环境光线。此外，该方法还可以实现对物体的重新照明。

Abstract
Can we capture shape and reflectance in stealth? Such capability would be valuable for many application domains in vision, xR, robotics, and HCI. We introduce Structured Polarization, the first depth and reflectance sensing method using patterns of polarized light (SPIDeRS). The key idea is to modulate the angle of linear polarization (AoLP) of projected light at each pixel. The use of polarization makes it invisible and lets us recover not only depth but also directly surface normals and even reflectance. We implement SPIDeRS with a liquid crystal spatial light modulator (SLM) and a polarimetric camera. We derive a novel method for robustly extracting the projected structured polarization pattern from the polarimetric object appearance. We evaluate the effectiveness of SPIDeRS by applying it to a number of real-world objects. The results show that our method successfully reconstructs object shapes of various materials and is robust to diffuse reflection and ambient light. We also demonstrate relighting using recovered surface normals and reflectance. We believe SPIDeRS opens a new avenue of polarization use in visual sensing.

摘要
可以捕捉形状和反射吗？这种能力会对视觉、虚拟现实、机器人和人机交互等应用领域有很大的价值。我们介绍了结构化偏振（SPIDeRS），它是利用排序偏振光的方法来捕捉深度和反射特征。我们使用偏振光使其隐形，从而恢复不仅深度，还直接恢复表面法向和反射率。我们使用液晶Display的空间光模ulator（SLM）和偏振光相机来实现SPIDeRS。我们提出了一种新的方法，可以坚定地提取到偏振光模式的投影结构。我们对一些实际物体进行应用，结果显示我们的方法可以成功地重建物体的形状，并且对各种材料的反射和 ambient light 有良好的鲁棒性。我们还示出了使用恢复的表面法向和反射率来进行光照。我们认为SPIDeRS开启了一新的偏振光在视觉感知中的应用 Avenues。

Free3D: Consistent Novel View Synthesis without 3D Representation

paper_url: http://arxiv.org/abs/2312.04551
repo_url: https://github.com/lyndonzheng/Free3D
paper_authors: Chuanxia Zheng, Andrea Vedaldi
for: 这篇论文目的是为了实现开放集成视 synthesis（NVS）从单个图像。
methods: 这篇论文使用了一种简单的方法， Starting from a pre-trained 2D image generator for generalization, 并将其精度调整为NVS。它不需要Explicit 3D表示，这会增加计算成本和内存占用，或者训练一个额外的3D网络。而是通过新的每像素投影方向 normalization（RCN）层来更好地编码目标摄像头pose。此外，它还使用了一个轻量级的多视图注意力层和多视图噪声共享来提高多视图一致性。
results: 这篇论文在Objaverse dataset上进行了训练，并在多个新类别上进行了优秀的泛化。包括OminiObject3D和GSO等多个新数据集。

Abstract
We introduce Free3D, a simple approach designed for open-set novel view synthesis (NVS) from a single image. Similar to Zero-1-to-3, we start from a pre-trained 2D image generator for generalization, and fine-tune it for NVS. Compared to recent and concurrent works, we obtain significant improvements without resorting to an explicit 3D representation, which is slow and memory-consuming or training an additional 3D network. We do so by encoding better the target camera pose via a new per-pixel ray conditioning normalization (RCN) layer. The latter injects pose information in the underlying 2D image generator by telling each pixel its specific viewing direction. We also improve multi-view consistency via a light-weight multi-view attention layer and multi-view noise sharing. We train Free3D on the Objaverse dataset and demonstrate excellent generalization to various new categories in several new datasets, including OminiObject3D and GSO. We hope our simple and effective approach will serve as a solid baseline and help future research in NVS with more accuracy pose. The project page is available at https://chuanxiaz.com/free3d/.

摘要
我们介绍Free3D，一种简单的方法，用于开放集成视 synthesis（NVS）从单个图像。与Zero-1-to-3类似，我们从预训练的2D图像生成器开始，并对其进行特化以实现NVS。相比之前的并发作品，我们获得了显著改善，而无需使用显式的3D表示或再训练一个3D网络。我们通过新的每像素射线条件正则化（RCN）层来更好地编码目标摄像头姿态信息。这个层使每个像素了解其特定的视觉方向。我们还通过轻量级多视图注意力层和多视图噪声共享来提高多视图一致性。我们在Objaverse数据集上训练了Free3D，并在多个新类别上示出了杰出的总体化能力。我们希望我们的简单有效的方法能够作为一个坚实的基准，帮助未来的NVS研究更加精准地识别pose。项目页面可以在https://chuanxiaz.com/free3d/上找到。

paper_url: http://arxiv.org/abs/2312.04547
repo_url: None
paper_authors: Zhongang Cai, Jianping Jiang, Zhongfei Qing, Xinying Guo, Mingyuan Zhang, Zhengyu Lin, Haiyi Mei, Chen Wei, Ruisi Wang, Wanqi Yin, Xiangyu Fan, Han Du, Liang Pan, Peng Gao, Zhitao Yang, Yang Gao, Jiaqi Li, Tianxiang Ren, Yukun Wei, Xiaogang Wang, Chen Change Loy, Lei Yang, Ziwei Liu
For: The paper aims to create autonomous 3D characters that can engage in social interactions and express themselves with articulated body motions, simulating life in a digital environment.* Methods: The framework consists of two primary components: SocioMind, a digital brain that models personalities and initiates dialogue topics, and MoMat-MoGen, a text-driven motion synthesis paradigm that ensures motion quality and generates diverse movements.* Results: Extensive experiments demonstrate that each module achieves state-of-the-art performance in its respective domain, enabling virtual characters to initiate and sustain dialogues autonomously, while evolving their socio-psychological states and performing contextually relevant bodily movements. Additionally, the motion captioning module allows the virtual character to recognize and respond to human players’ actions.

Abstract
In this work, we present Digital Life Project, a framework utilizing language as the universal medium to build autonomous 3D characters, who are capable of engaging in social interactions and expressing with articulated body motions, thereby simulating life in a digital environment. Our framework comprises two primary components: 1) SocioMind: a meticulously crafted digital brain that models personalities with systematic few-shot exemplars, incorporates a reflection process based on psychology principles, and emulates autonomy by initiating dialogue topics; 2) MoMat-MoGen: a text-driven motion synthesis paradigm for controlling the character's digital body. It integrates motion matching, a proven industry technique to ensure motion quality, with cutting-edge advancements in motion generation for diversity. Extensive experiments demonstrate that each module achieves state-of-the-art performance in its respective domain. Collectively, they enable virtual characters to initiate and sustain dialogues autonomously, while evolving their socio-psychological states. Concurrently, these characters can perform contextually relevant bodily movements. Additionally, a motion captioning module further allows the virtual character to recognize and appropriately respond to human players' actions. Homepage: https://digital-life-project.com/

摘要
在这项工作中，我们介绍了数字生命计划，这是一个基于语言作为通用媒介的框架，用于建立自主的3D人物，能够与人类进行社交交流，并通过具有艺术化的身体动作表达自己的情感和想法。我们的框架包括两个主要组成部分：1）SocioMind：一个仔细设计的数字大脑，模拟人格特征，包括几个例子，基于心理学原理进行反射过程，并且模拟自主性，可以主动发起对话话题；2）MoMat-MoGen：一种基于文本的动作合成方法，用于控制人物的数字身体。它结合了行业中证明有效的动作匹配技术，以及当前的动作生成技术，以实现多样性。我们的实验表明，每个模块都达到了其各自领域的状态之 искус。总的来说，这两个模块使虚拟人物能够自主地发起和维持对话，同时演化其社会心理状态。此外，虚拟人物还可以根据人类玩家的行为，Recognize和应ropriately respond to human players' actions。更多信息请访问我们的官方网站：。

HyperDreamer: Hyper-Realistic 3D Content Generation and Editing from a Single Image

paper_url: http://arxiv.org/abs/2312.04543
repo_url: https://github.com/wutong16/HyperDreamer
paper_authors: Tong Wu, Zhibing Li, Shuai Yang, Pan Zhang, Xinggang Pan, Jiaqi Wang, Dahua Lin, Ziwei Liu
for: 实现高品质3D内容创建从单一图像的挑战。
methods: 利用2D扩散假设，获得可接受的结果。但现有方法并不具够协真实性，无法在各个观察角度下显示、渲染和修改所产生的3D内容。
results: HyperDreamer提供了多个关键设计和吸引人的属性，包括：1）可观看的360度网格模型、高分辨率纹理、2）可渲染的细节 semantic 数据驱动的纹理和反射性质的学习、3）可编辑的功能，让用户透过几下点击来选择和修改纹理。实验结果显示HyperDreamer可以实现高分辨率纹理和区域特征数据驱动的3D内容创建，并且具够用户友好的编辑功能。

Abstract
3D content creation from a single image is a long-standing yet highly desirable task. Recent advances introduce 2D diffusion priors, yielding reasonable results. However, existing methods are not hyper-realistic enough for post-generation usage, as users cannot view, render and edit the resulting 3D content from a full range. To address these challenges, we introduce HyperDreamer with several key designs and appealing properties: 1) Viewable: 360 degree mesh modeling with high-resolution textures enables the creation of visually compelling 3D models from a full range of observation points. 2) Renderable: Fine-grained semantic segmentation and data-driven priors are incorporated as guidance to learn reasonable albedo, roughness, and specular properties of the materials, enabling semantic-aware arbitrary material estimation. 3) Editable: For a generated model or their own data, users can interactively select any region via a few clicks and efficiently edit the texture with text-based guidance. Extensive experiments demonstrate the effectiveness of HyperDreamer in modeling region-aware materials with high-resolution textures and enabling user-friendly editing. We believe that HyperDreamer holds promise for advancing 3D content creation and finding applications in various domains.

摘要
三维内容创建从单个图像是一项长期受欢迎但具有高度挑战性的任务。最近的进步引入2D扩散约束，可以得到可理解的结果。然而，现有的方法并不够真实主义，用户无法在全面范围内查看、渲染和修改生成的三维内容。为解决这些挑战，我们介绍了HyperDreamer，具有以下键要素和吸引人的性能：1. 可见：使用360度的网格模型和高分辨率文本ures可以从全面范围内创建视觉吸引人的三维模型。2. 可渲染：基于Semantic-aware材质预测和数据驱动约束，学习物质的可见性、粗糙度和镜面性，以获得合理的材质性。3. 可编辑：用户可以通过几个键盘快捷键选择任意区域，并通过文本导航进行高效地修改图像。广泛的实验表明HyperDreamer在高分辨率图像和可见性材质预测方面具有优异的效果，并且允许用户Friendly editing。我们认为HyperDreamer将为三维内容创建带来进步，并在多个领域发现应用。

Self-Guided Open-Vocabulary Semantic Segmentation

paper_url: http://arxiv.org/abs/2312.04539
repo_url: None
paper_authors: Osman Ülger, Maksymilian Kulicki, Yuki Asano, Martin R. Oswald
for: 这个论文的目的是提出一种无需文本输入的开放词汇分割方法，可以在不知道分割类别名称的情况下进行高精度的Semantic Segmentation。
methods: 这个方法基于Self-Guided Semantic Segmentation（Self-Seg）框架，它可以自动检测clustered BLIP embeddings中相关的类别名称，并使用这些名称进行准确的Semantic Segmentation。此外，我们还提出了一种基于LLM的Open-Vocabulary Evaluator（LOVE）来有效评估预测的开放词汇类别名称。
results: 在Pascal VOC、ADE20K和CityScapes等 datasets上，这个方法可以实现开放词汇分割的state-of-the-art Results，而且与给定类别名称的方法的性能相当。

Abstract
Vision-Language Models (VLMs) have emerged as promising tools for open-ended image understanding tasks, including open vocabulary segmentation. Yet, direct application of such VLMs to segmentation is non-trivial, since VLMs are trained with image-text pairs and naturally lack pixel-level granularity. Recent works have made advancements in bridging this gap, often by leveraging the shared image-text space in which the image and a provided text prompt are represented. In this paper, we challenge the capabilities of VLMs further and tackle open-vocabulary segmentation without the need for any textual input. To this end, we propose a novel Self-Guided Semantic Segmentation (Self-Seg) framework. Self-Seg is capable of automatically detecting relevant class names from clustered BLIP embeddings and using these for accurate semantic segmentation. In addition, we propose an LLM-based Open-Vocabulary Evaluator (LOVE) to effectively assess predicted open-vocabulary class names. We achieve state-of-the-art results on Pascal VOC, ADE20K and CityScapes for open-vocabulary segmentation without given class names, as well as competitive performance with methods where class names are given. All code and data will be released.

摘要
视力语言模型（VLM）已经出现为开放式图像理解任务中的有力工具，包括开放词汇分割。然而，直接将VLM应用于分割是非得，因为VLM通常是通过图像和文本对的共同表示来进行训练的。在这篇论文中，我们将VLM的能力进一步挑战，并在不需要任何文本输入的情况下进行开放词汇分割。为此，我们提出了一个新的自适应 semantic segmentation（Self-Seg）框架。Self-Seg可以自动检测相关的类名称，并使用这些类名称进行准确的semantic segmentation。此外，我们还提出了一个基于LLM的开放词汇评估器（LOVE），以有效评估预测的开放词汇类名称。我们在Pascal VOC、ADE20K和CityScapes等 dataset上实现了开放词汇分割无给类名称的状态顶峰性成绩，以及在给定类名称的情况下的竞争性表现。所有代码和数据将被释出。

PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns

paper_url: http://arxiv.org/abs/2312.04534
repo_url: https://github.com/ningshuliang/PICTURE
paper_authors: Shuliang Ning, Duomin Wang, Yipeng Qin, Zirong Jin, Baoyuan Wang, Xiaoguang Han
For: 本研究提出了一种新的虚拟试穿（ucVTON）任务，以生成个性化复合衣物的 фото真实 synthesis。与先前的艺术约束的方法不同，我们的方法允许灵活地指定样式（文本或图像）和文化（全衣物、剪辑部分或纹理覆盖）条件。* Methods: 我们提出了一个两个阶段管道，以解决使用全衣物图像作为条件时的杂谱挑战。在第一阶段，我们生成了基于输入的愿望的样式条件，并在第二阶段将文化纹理应用到解决图像上。为表示复杂和不定的纹理，我们首先提出了EXTRACTING hierarchical和平衡CLIP特征，并在VTON中应用位编码。* Results: 实验表明，我们的方法可以生成高质量的个性化synthesis，并允许用户有灵活的控制样式和文化杂谱。这些功能将虚拟试穿带到了在线购物和时尚设计中的新的水平。

Abstract
In this paper, we propose a novel virtual try-on from unconstrained designs (ucVTON) task to enable photorealistic synthesis of personalized composite clothing on input human images. Unlike prior arts constrained by specific input types, our method allows flexible specification of style (text or image) and texture (full garment, cropped sections, or texture patches) conditions. To address the entanglement challenge when using full garment images as conditions, we develop a two-stage pipeline with explicit disentanglement of style and texture. In the first stage, we generate a human parsing map reflecting the desired style conditioned on the input. In the second stage, we composite textures onto the parsing map areas based on the texture input. To represent complex and non-stationary textures that have never been achieved in previous fashion editing works, we first propose extracting hierarchical and balanced CLIP features and applying position encoding in VTON. Experiments demonstrate superior synthesis quality and personalization enabled by our method. The flexible control over style and texture mixing brings virtual try-on to a new level of user experience for online shopping and fashion design.

摘要
在这篇论文中，我们提出了一种新的无束定设计（ucVTON）任务，以实现人像化的个性化复合服装的光realistic合成。与先前的方法不同，我们的方法允许用户自由地指定样式（文本或图像）和 текстур（全服装、剪辑部分或纹理块）条件。为了解决使用全服装图像作为条件时的杂糅挑战，我们开发了一个两阶段管道，其中首先生成基于输入的人像分割图，然后在第二阶段将Texture Composite到分割图区域。为了表示复杂的非站ARY texture，我们首先提出了EXTRACTING hierarchical和平衡CLIP特征，然后在VTON中应用位编码。实验表明，我们的方法可以实现高质量的合成和个性化，并且允许用户自由地控制样式和 текстур混合，从而将虚拟试穿升级到在线购物和时尚设计中的新的用户体验水平。

Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models

paper_url: http://arxiv.org/abs/2312.04533
repo_url: None
paper_authors: Ivan Kapelyukh, Yifei Ren, Ignacio Alzugaray, Edward Johns
for: 本研究旨在开发一个基于视语言模型（VLM）的Robotics框架，以实现基于语言条件的物体重新排序。
methods: 该框架使用自动构建的3D场景表示，并使用VLM评估Render的最佳重新排序方案。
results: 实验结果表明，该框架在受到干扰的情况下具有较高的稳定性和可控性，能够理解复杂多对象关系，并适用于多种表格和6度自由重新排序任务。

Abstract
We introduce Dream2Real, a robotics framework which integrates vision-language models (VLMs) trained on 2D data into a 3D object rearrangement pipeline. This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered. These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world with pick-and-place. This enables language-conditioned rearrangement to be performed zero-shot, without needing to collect a training dataset of example arrangements. Results on a series of real-world tasks show that this framework is robust to distractors, controllable by language, capable of understanding complex multi-object relations, and readily applicable to both tabletop and 6-DoF rearrangement tasks.

摘要
我们介绍 Dream2Real，一个 robotics 框架，它将 Computer Vision Language Models (VLMs) 训练在 2D 数据上集成到一个 3D 物品重新排序管线中。这是通过 robot 自动建立一个 3D 场景的构成，并将物品重新排序到虚拟中，然后由 VLM 评估这些重新排序的图像，选择最好满足使用者指令的重新排序，并将其转移到实际世界中进行pick-and-place操作。这使得语言条件的重新排序可以 Zero-shot 进行，不需要收集一个训练集的示例排序。实验结果显示，这个框架具有较好的抗斥扰性、可以通过语言控制、理解复杂多个物品关系、并可以应用于表格顶部和 6-DoF 重新排序任务。

Camera Height Doesn’t Change: Unsupervised Monocular Scale-Aware Road-Scene Depth Estimation

paper_url: http://arxiv.org/abs/2312.04530
repo_url: None
paper_authors: Genki Kinoshita, Ko Nishino
for: 这 paper 是为了提出一种不需要外部感知器或监督的单目深度估计方法。
methods: 这 paper 使用了一种新的Scale-aware monocular depth estimation方法，即StableCamH，该方法不需要任何外部感知器或监督。它利用了场景中对象的高度知识，并将高度指示器聚合到一个共同的抽象度量中，以便实现Robust和准确的无监督末端训练。
results: 实验表明，StableCamH 可以实现高精度的单目深度估计，并且比其他相关方法有更高的状态艺术性。它还能够在不同的场景下保持一定的普适性。

Abstract
Monocular depth estimators either require explicit scale supervision through auxiliary sensors or suffer from scale ambiguity, which renders them difficult to deploy in downstream applications. A possible source of scale is the sizes of objects found in the scene, but inaccurate localization makes them difficult to exploit. In this paper, we introduce a novel scale-aware monocular depth estimation method called StableCamH that does not require any auxiliary sensor or supervision. The key idea is to exploit prior knowledge of object heights in the scene but aggregate the height cues into a single invariant measure common to all frames in a road video sequence, namely the camera height. By formulating monocular depth estimation as camera height optimization, we achieve robust and accurate unsupervised end-to-end training. To realize StableCamH, we devise a novel learning-based size prior that can directly convert car appearance into its dimensions. Extensive experiments on KITTI and Cityscapes show the effectiveness of StableCamH, its state-of-the-art accuracy compared with related methods, and its generalizability. The training framework of StableCamH can be used for any monocular depth estimation method and will hopefully become a fundamental building block for further work.

摘要
眼镜几何估计器通常需要显式的尺度监督通过辅助传感器或受到尺度模糊的影响，这使其在下游应用中具有困难性。可能的尺度来源是场景中的物体大小，但是不准确的地址化使其困难于利用。在这篇论文中，我们介绍了一种新的尺度意识的眼镜几何估计方法 called StableCamH，不需要任何辅助传感器或监督。关键思想是利用场景中物体高度的先验知识，并将高度指示器聚合成所有帧在路由视频序列中的相同不变量，即摄像机高度。通过将眼镜几何估计转化为摄像机高度优化问题，我们实现了 Robust 和准确的无监督末端训练。为实现 StableCamH，我们设计了一种新的学习基于大小估计的大小先验，可以直接将车辆外观转化为其维度。广泛的实验表明 StableCamH 的有效性，与相关方法相比的状态态准确性，以及其普适性。训练框架可以用于任何眼镜几何估计方法，希望成为未来工作的基础组件。

Diffusion Reflectance Map: Single-Image Stochastic Inverse Rendering of Illumination and Reflectance

paper_url: http://arxiv.org/abs/2312.04529
repo_url: None
paper_authors: Yuto Enyo, Ko Nishino
for: 这 paper 是为了重构物体的反射率谱，并且同时获取到了照明谱。
methods: 这 paper 使用了一种新的扩散模型，即Diffusion Reflectance Map Network (DRMNet)，来解决这个盲目逆问题。
results: 这 paper 通过使用 DRMNet 获取了一个完整的反射率谱，并且能够高效地逆向推算出来的照明谱。

Abstract
Reflectance bounds the frequency spectrum of illumination in the object appearance. In this paper, we introduce the first stochastic inverse rendering method, which recovers the full frequency spectrum of an illumination jointly with the object reflectance from a single image. Our key idea is to solve this blind inverse problem in the reflectance map, an appearance representation invariant to the underlying geometry, by learning to reverse the image formation with a novel diffusion model which we refer to as the Diffusion Reflectance Map Network (DRMNet). Given an observed reflectance map converted and completed from the single input image, DRMNet generates a reflectance map corresponding to a perfect mirror sphere while jointly estimating the reflectance. The forward process can be understood as gradually filtering a natural illumination with lower and lower frequency reflectance and additive Gaussian noise. DRMNet learns to invert this process with two subnetworks, IllNet and RefNet, which work in concert towards this joint estimation. The network is trained on an extensive synthetic dataset and is demonstrated to generalize to real images, showing state-of-the-art accuracy on established datasets.

摘要
illumination 绘制对象的外观呈现各个频谱的约束。在这篇论文中，我们提出了第一个随机 inverse rendering 方法，可以从单个图像中同时回归完整的频谱照明和物体反射率。我们的关键想法是通过学习反射映射的异常扩散模型，称为噪声映射网络（DRMNet），来解决这种盲目逆向问题。给定一个已经 converts 和完善的反射映射图，DRMNet 可以生成一个对应于完美镜球的反射映射，同时进行 JOINT 估计。前向过程可以理解为逐渐过滤自然照明的低频反射和加性 Gaussian 噪声，DRMNet 学习了这一过程的逆向。网络由两个子网络，IllNet 和 RefNet，共同努力实现这种 JOINT 估计。网络在大量的 sintetic 数据上训练，并在实际图像上显示了状态级别的准确性。

Correspondences of the Third Kind: Camera Pose Estimation from Object Reflection

paper_url: http://arxiv.org/abs/2312.04527
repo_url: None
paper_authors: Kohei Yamashita, Vincent Lepetit, Ko Nishino
for: 这篇论文旨在探讨第三种匹配方式，即反射匹配，以优化摄像头pose估计和物体形状估计。
methods: 该论文提出了一种基于反射匹配的神经网络匹配 estimator 和 RANSAC 算法，可以充分利用三种匹配方式来实现稳定和准确的摄像头pose和物体形状估计。
results: 论文表明，通过使用反射匹配，可以解决基于背景的匹配假设的偏误，并且可以提高Camera pose 和物体形状估计的准确性。

Abstract
Computer vision has long relied on two kinds of correspondences: pixel correspondences in images and 3D correspondences on object surfaces. Is there another kind, and if there is, what can they do for us? In this paper, we introduce correspondences of the third kind we call reflection correspondences and show that they can help estimate camera pose by just looking at objects without relying on the background. Reflection correspondences are point correspondences in the reflected world, i.e., the scene reflected by the object surface. The object geometry and reflectance alters the scene geometrically and radiometrically, respectively, causing incorrect pixel correspondences. Geometry recovered from each image is also hampered by distortions, namely generalized bas-relief ambiguity, leading to erroneous 3D correspondences. We show that reflection correspondences can resolve the ambiguities arising from these distortions. We introduce a neural correspondence estimator and a RANSAC algorithm that fully leverages all three kinds of correspondences for robust and accurate joint camera pose and object shape estimation just from the object appearance. The method expands the horizon of numerous downstream tasks, including camera pose estimation for appearance modeling (e.g., NeRF) and motion estimation of reflective objects (e.g., cars on the road), to name a few, as it relieves the requirement of overlapping background.

摘要

RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

paper_url: http://arxiv.org/abs/2312.04524
repo_url: https://github.com/rehg-lab/rave
paper_authors: Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe, James M. Rehg, Pinar Yanardag
for: 这篇论文旨在提出一种适用于视频编辑的零批量学习方法，使得用户可以通过文本描述来生成高质量的视频。
methods: 这种方法基于已经训练过的文本到图像扩散模型，不需要进一步训练。它使用了一种新的噪声混淆策略，利用帧之间的空间时间交互来生成具有一致性的视频。
results: 论文通过对多种视频编辑场景进行质量和量化的实验，显示了与现有方法相比，RAVE 可以更快、更高质量地生成视频。它可以处理 longer videos 并且可以实现多种不同类型的编辑，从本地特征修改到形态修饰。

Abstract
Recent advancements in diffusion-based models have demonstrated significant success in generating images from text. However, video editing models have not yet reached the same level of visual quality and user control. To address this, we introduce RAVE, a zero-shot video editing method that leverages pre-trained text-to-image diffusion models without additional training. RAVE takes an input video and a text prompt to produce high-quality videos while preserving the original motion and semantic structure. It employs a novel noise shuffling strategy, leveraging spatio-temporal interactions between frames, to produce temporally consistent videos faster than existing methods. It is also efficient in terms of memory requirements, allowing it to handle longer videos. RAVE is capable of a wide range of edits, from local attribute modifications to shape transformations. In order to demonstrate the versatility of RAVE, we create a comprehensive video evaluation dataset ranging from object-focused scenes to complex human activities like dancing and typing, and dynamic scenes featuring swimming fish and boats. Our qualitative and quantitative experiments highlight the effectiveness of RAVE in diverse video editing scenarios compared to existing methods. Our code, dataset and videos can be found in https://rave-video.github.io.

摘要
最近的扩散基本模型的进步已经在生成图像从文本中达到了显著的成功。然而，视频编辑模型还没有达到同样的视觉质量和用户控制水平。为解决这个问题，我们介绍了RAVE，一种零批量视频编辑方法，无需进行额外训练。RAVE使用输入视频和文本提示来生成高质量的视频，保留原始的运动和semantic结构。它采用了一种新的噪声混淆策略，利用干扰帧之间的空间时间交互，以更快速生成一致的视频。此外，RAVE还具有较低的内存需求，可以处理更长的视频。RAVE可以完成各种编辑任务，从本地特征修改到形态变换。为了展示RAVE的多样性，我们创建了一个包括物体关注场景、人类活动如舞蹈和打印等，以及动态场景如游泳鱼和船等的广泛的视频评估dataset。我们的质量和量测试表明，RAVE在多种视频编辑场景中比现有方法更有效。我们的代码、dataset和视频可以在https://rave-video.github.io找到。

Multimodal Industrial Anomaly Detection by Crossmodal Feature Mapping

paper_url: http://arxiv.org/abs/2312.04521
repo_url: None
paper_authors: Alex Costanzino, Pierluigi Zama Ramirez, Giuseppe Lisanti, Luigi Di Stefano
for: 本研究探讨了工业多modal anomaly detection（AD）任务，利用点云和RGB图像来localize anomalies。
methods: 我们提出了一种新的轻量级快速框架，可以将一个模式下的特征映射到另一个模式上。在测试时，我们通过比较观察到的特征与映射后的特征来探测异常。
results: 我们的方法在MVTec 3D-AD数据集上达到了标准和少量学习的检测和分 segmentation性能顶峰，并且在测试时间和内存占用方面比前一代多模态AD方法更快和更有效。此外，我们还提出了层剔除技术来提高内存和时间效率，但是这会导致一定的性能下降。

Abstract
The paper explores the industrial multimodal Anomaly Detection (AD) task, which exploits point clouds and RGB images to localize anomalies. We introduce a novel light and fast framework that learns to map features from one modality to the other on nominal samples. At test time, anomalies are detected by pinpointing inconsistencies between observed and mapped features. Extensive experiments show that our approach achieves state-of-the-art detection and segmentation performance in both the standard and few-shot settings on the MVTec 3D-AD dataset while achieving faster inference and occupying less memory than previous multimodal AD methods. Moreover, we propose a layer-pruning technique to improve memory and time efficiency with a marginal sacrifice in performance.

摘要
文章研究了工业多模态异常检测（AD）任务，利用点云和RGB图像localize异常。我们提出了一种新的轻量级快速框架，可以在正常样本上学习从一个模式到另一个模式的特征映射。在测试时，异常都是通过比较观察到的特征与映射特征进行异常检测。实验显示，我们的方法在MVTec 3D-AD数据集的标准和少量设置中实现了state-of-the-art检测和分类性能，并且在实际应用中具有更快的推理速度和更少的内存占用量，与之前的多模态AD方法相比。此外，我们还提出了层束技术来提高内存和时间效率，但是这会导致一定的性能下降。

Bootstrapping Autonomous Radars with Self-Supervised Learning

paper_url: http://arxiv.org/abs/2312.04519
repo_url: None
paper_authors: Yiduo Hao, Sohrab Madani, Junfeng Guan, Mohammed Alloulah, Saurabh Gupta, Haitham Hassanieh
for: 本研究旨在使用雷达数据进行自主驾驶感知 task 的训练，因为雷达数据可以在雾和差别天气下运行。
methods: 该研究提出了一种自助学习框架，使用大量未标注的雷达数据来预训练雷达专用嵌入。该方法结合雷达到雷达和视频的对比损失来学习一般表示。
results: 对于下游对象检测任务，该提议的自助学习框架可以提高现有超过90%的精度。

Abstract
The perception of autonomous vehicles using radars has attracted increased research interest due its ability to operate in fog and bad weather. However, training radar models is hindered by the cost and difficulty of annotating large-scale radar data. To overcome this bottleneck, we propose a self-supervised learning framework to leverage the large amount of unlabeled radar data to pre-train radar-only embeddings for self-driving perception tasks. The proposed method combines radar-to-radar and radar-to-vision contrastive losses to learn a general representation from unlabeled radar heatmaps paired with their corresponding camera images. When used for downstream object detection, we demonstrate that the proposed self-supervision framework can improve the accuracy of state-of-the-art supervised baselines by 5.8% in mAP.

摘要
自适应汽车使用雷达的感知已经吸引了更多的研究兴趣，因为雷达可以在fog和坏天气下运行。然而，培育雷达模型受到了大量雷达数据的成本和困难的标注的限制。为了突破这个瓶颈，我们提议一种无监督学习框架，利用大量无标注雷达数据来预训练雷达专用的嵌入函数。我们的方法将雷达到雷达和雷达到视觉的对比损失相结合，以学习来自无标注雷达热图与其对应的摄像头图像的通用表示。在下游对象检测任务中使用我们的自监督学习框架，我们实验表明，可以提高现有最佳监督学习基线的准确率by 5.8% in mAP。

FRNet: Frustum-Range Networks for Scalable LiDAR Segmentation

paper_url: http://arxiv.org/abs/2312.04484
repo_url: https://github.com/Xiangxu-0103/FRNet
paper_authors: Xiang Xu, Lingdong Kong, Hui Shuai, Qingshan Liu
for: 这个研究是为了提高LiDAR分类的精度和效率，特别是在实时处理中。
methods: 这篇论文提出了一个简单 yet powerful的FRNet架构，用于Restore LiDAR范围内的 contectual information。该架构包括frustum feature encoder模组、frustum-point fusion模组和head fusion模组。
results: 实验结果显示，FRNet可以实现高效率和高精度的LiDAR分类，并在四个流行的LiDAR分类 benchmark上表现出色。

Abstract
LiDAR segmentation is crucial for autonomous driving systems. The recent range-view approaches are promising for real-time processing. However, they suffer inevitably from corrupted contextual information and rely heavily on post-processing techniques for prediction refinement. In this work, we propose a simple yet powerful FRNet that restores the contextual information of the range image pixels with corresponding frustum LiDAR points. Firstly, a frustum feature encoder module is used to extract per-point features within the frustum region, which preserves scene consistency and is crucial for point-level predictions. Next, a frustum-point fusion module is introduced to update per-point features hierarchically, which enables each point to extract more surrounding information via the frustum features. Finally, a head fusion module is used to fuse features at different levels for final semantic prediction. Extensive experiments on four popular LiDAR segmentation benchmarks under various task setups demonstrate our superiority. FRNet achieves competitive performance while maintaining high efficiency. The code is publicly available.

摘要
利用LiDAR分割是自动驾驶系统的关键技术。最近的距离视图方法显示了实时处理的承诺，但它们无法免受干扰信息的损害和依赖后处理技术来进行预测精度的改进。在这项工作中，我们提出了一种简单又强大的FRNet，它可以恢复距离图像像素中的Contextual信息和相应的镜头LiDAR点。首先，我们使用镜头特征编码模块提取距离区域内每个点的特征，这种方法保持了场景一致性，对点级预测具有重要的作用。然后，我们引入了镜头点融合模块，通过 hierarchical更新每个点的特征，让每个点可以通过镜头特征提取更多的周围信息。最后，我们使用头融合模块将不同层次的特征进行融合，以实现最终的semantic预测。我们在四个流行的LiDAR分割benchmark上进行了广泛的实验，结果表明FRNet可以与其他方法相比，具有竞争力的性能，同时保持高效。代码公开可用。

Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

paper_url: http://arxiv.org/abs/2312.04483
repo_url: None
paper_authors: Zhiwu Qing, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yujie Wei, Yingya Zhang, Changxin Gao, Nong Sang
for: 提高文本到视频生成（T2V）的性能，使其能够生成真实、多样化的视频。
methods: 提出了一种基于扩散模型的方法，通过在两个角度（结构层和内容层）隔离空间和时间因素，提高T2V的性能。在结构层，通过将T2V任务分解成两个步骤，包括空间理解和时间理解，使用一个统一的杂净器来处理。在内容层，提取了输入视频内容中的两种细致的信息，用于表达运动和外观变化。这两种信息 THEN 用于引导模型的训练，以生成具有 semantics 精度和运动稳定性的视频。
results: 实验表明， HiGen 方法在比较现有的T2V方法时表现出了明显的优势，能够生成真实、多样化的视频。

Abstract
Despite diffusion models having shown powerful abilities to generate photorealistic images, generating videos that are realistic and diverse still remains in its infancy. One of the key reasons is that current methods intertwine spatial content and temporal dynamics together, leading to a notably increased complexity of text-to-video generation (T2V). In this work, we propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives, i.e., structure level and content level. At the structure level, we decompose the T2V task into two steps, including spatial reasoning and temporal reasoning, using a unified denoiser. Specifically, we generate spatially coherent priors using text during spatial reasoning and then generate temporally coherent motions from these priors during temporal reasoning. At the content level, we extract two subtle cues from the content of the input video that can express motion and appearance changes, respectively. These two cues then guide the model's training for generating videos, enabling flexible content variations and enhancing temporal stability. Through the decoupled paradigm, HiGen can effectively reduce the complexity of this task and generate realistic videos with semantics accuracy and motion stability. Extensive experiments demonstrate the superior performance of HiGen over the state-of-the-art T2V methods.

摘要
尽管扩散模型已经显示了生成高品质图像的强大能力，但生成真实多样化的视频仍处于初生阶段。一个关键的原因是现有的方法将空间内容和时间动态相互杂糅，从而增加了文本到视频生成（T2V）的复杂性。在这种情况下，我们提出了HiGen方法，基于扩散模型，通过在两个角度（结构层和内容层）分解视频生成任务，提高性能。在结构层上，我们将T2V任务分解为两步，包括空间理解和时间理解，使用一个统一的减噪器。特别是，我们在空间理解阶段使用文本生成空间相对谱，然后在时间理解阶段使用这些谱生成时间相对的动作。在内容层上，我们提取了输入视频中的两种细微特征，可以表达动作和外观变化。这两种特征则导向模型在生成视频时的训练，使得模型能够生成多样化的内容和稳定的动作。通过分解模型，HiGen可以有效减少T2V任务的复杂性，并生成具有semantics精度和动作稳定的真实视频。我们的广泛实验表明，HiGen在现状顶峰方法的T2V任务上表现出色。

Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion

paper_url: http://arxiv.org/abs/2312.04466
repo_url: https://github.com/kiranchhatre/amuse
paper_authors: Kiran Chhatre, Radek Daněček, Nikos Athanasiou, Giorgio Becherini, Christopher Peters, Michael J. Black, Timo Bolkart
for: 这 paper 是为了提高语音驱动的人体姿势生成技术，以便控制表达的情感和个人风格。
methods: 这 paper 使用了 latent diffusion 模型，将驱动 audio 映射到三个分离的幂量 вектор中：一个是内容（gesture 相关的语音节奏和词汇），一个是情感，一个是个人风格。然后，通过将这些幂量 вектор作为条件，使用 latent diffusion 模型生成人体姿势动画。
results: 这 paper 的实验和评估表明，AMUSE 可以生成高质量的人体姿势动画，同时控制表达的情感和个人风格。与现有技术相比，AMUSE 的生成的姿势动画更好地同步与语音内容，同时更好地表达输入语音中的情感。

Abstract
Existing methods for synthesizing 3D human gestures from speech have shown promising results, but they do not explicitly model the impact of emotions on the generated gestures. Instead, these methods directly output animations from speech without control over the expressed emotion. To address this limitation, we present AMUSE, an emotional speech-driven body animation model based on latent diffusion. Our observation is that content (i.e., gestures related to speech rhythm and word utterances), emotion, and personal style are separable. To account for this, AMUSE maps the driving audio to three disentangled latent vectors: one for content, one for emotion, and one for personal style. A latent diffusion model, trained to generate gesture motion sequences, is then conditioned on these latent vectors. Once trained, AMUSE synthesizes 3D human gestures directly from speech with control over the expressed emotions and style by combining the content from the driving speech with the emotion and style of another speech sequence. Randomly sampling the noise of the diffusion model further generates variations of the gesture with the same emotional expressivity. Qualitative, quantitative, and perceptual evaluations demonstrate that AMUSE outputs realistic gesture sequences. Compared to the state of the art, the generated gestures are better synchronized with the speech content and better represent the emotion expressed by the input speech. Our project website is amuse.is.tue.mpg.de.

摘要
现有的方法可以从语音中生成3D人姿动作，但这些方法没有直接模型表达的情感的影响。相反，这些方法直接从语音输出动画而不具有表达情感的控制。为了解决这一限制，我们提出了AMUSE，一种基于积分扩散的情感语音驱动人姿动画模型。我们观察到了内容（即语音节奏和单词词汇）、情感和个人风格是可分离的。为了考虑这一点，AMUSE将驱动音频映射到三个分离的幽默量Vector：一个为内容，一个为情感，一个为个人风格。然后，通过将这些幽默量Vector作为条件，使用扩散模型生成人姿动作序列。一旦训练完成，AMUSE可以直接从语音中生成3D人姿动作，并控制表达的情感和风格。随机采样扩散模型的噪声可以生成同情感表达的多种姿动作。我们的项目网站是amuse.is.tue.mpg.de。 Qualitative, quantitative, and perceptual evaluations demonstrate that AMUSE outputs realistic gesture sequences. Compared to the state of the art, the generated gestures are better synchronized with the speech content and better represent the emotion expressed by the input speech.

FitDiff: Robust monocular 3D facial shape and reflectance estimation using Diffusion Models

paper_url: http://arxiv.org/abs/2312.04465
repo_url: None
paper_authors: Stathis Galanakis, Alexandros Lattas, Stylianos Moschoglou, Stefanos Zafeiriou
for: 生成高级化的3D人脸模型，以提高渲染效果。
methods: 使用扩散模型，同时输出面部反射率图和形状，并通过人脸认证嵌入进行约束。
results: 实现了高性能的3D人脸模型，可以 direct 使用于常见渲染引擎，并且可以在不约束的情况下生成高质量的人脸模型。

Abstract
The remarkable progress in 3D face reconstruction has resulted in high-detail and photorealistic facial representations. Recently, Diffusion Models have revolutionized the capabilities of generative methods by achieving far better performance than GANs. In this work, we present FitDiff, a diffusion-based 3D facial avatar generative model. This model accurately generates relightable facial avatars, utilizing an identity embedding extracted from an "in-the-wild" 2D facial image. Our multi-modal diffusion model concurrently outputs facial reflectance maps (diffuse and specular albedo and normals) and shapes, showcasing great generalization capabilities. It is solely trained on an annotated subset of a public facial dataset, paired with 3D reconstructions. We revisit the typical 3D facial fitting approach by guiding a reverse diffusion process using perceptual and face recognition losses. Being the first LDM conditioned on face recognition embeddings, FitDiff reconstructs relightable human avatars, that can be used as-is in common rendering engines, starting only from an unconstrained facial image, and achieving state-of-the-art performance.

摘要
“三维面部重建技术的发展有所进步，实现高细节和实际的面部表现。最近，扩散模型对生成方法带来了革命性的改善，超过了GANs的性能。在这个工作中，我们介绍了FitDiff，一个基于扩散的三维人脸假象生成模型。这个模型可以实现对照点推射的人脸假象，使用从“在野”的二维面部图像中提取的身份嵌入。我们的多modal扩散模型同时输出面部反射maps（湿润和视亮反射率）和形状，展示了很好的通用能力。它仅从公共的面部数据集上得到了标注的一部分，并与3D重建相结合。我们重新评估了传统的三维面部适摄approach，通过导引反扩散过程，使用感知和识别损失来引导。作为首个基于识别嵌入的LDM，FitDiff可以从无条件的面部图像中重建可适摄的人脸假象，并实现现场最佳性能。”

DreamVideo: Composing Your Dream Videos with Customized Subject and Motion

paper_url: http://arxiv.org/abs/2312.04433
repo_url: None
paper_authors: Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhiheng Liu, Yu Liu, Yingya Zhang, Jingren Zhou, Hongming Shan
for: 生成个性化视频 from a few static images and target motion videos
methods: 分为两个阶段：主题学习和动作学习，使用预训练的视频扩散模型，并使用文本反转和精心设计的身份适配器来准确捕捉主题的细腻外貌
results: 比对状态方法有较高表现，可自由定制任何主题和动作

Abstract
Customized generation using diffusion models has made impressive progress in image generation, but remains unsatisfactory in the challenging video generation task, as it requires the controllability of both subjects and motions. To that end, we present DreamVideo, a novel approach to generating personalized videos from a few static images of the desired subject and a few videos of target motion. DreamVideo decouples this task into two stages, subject learning and motion learning, by leveraging a pre-trained video diffusion model. The subject learning aims to accurately capture the fine appearance of the subject from provided images, which is achieved by combining textual inversion and fine-tuning of our carefully designed identity adapter. In motion learning, we architect a motion adapter and fine-tune it on the given videos to effectively model the target motion pattern. Combining these two lightweight and efficient adapters allows for flexible customization of any subject with any motion. Extensive experimental results demonstrate the superior performance of our DreamVideo over the state-of-the-art methods for customized video generation. Our project page is at https://dreamvideo-t2v.github.io.

摘要
<>将文本翻译成简化中文。<>使用扩散模型进行个性化生成已经在图像生成中做出了很大进步，但在视频生成任务中仍然不够满意，因为它需要主题和运动的控制。为此，我们提出了 DreamVideo，一种生成个性化视频的新方法，只需提供一些Static图像和target运动的视频。DreamVideo将这个任务分解为两个阶段：主题学习和运动学习，通过利用预训练的视频扩散模型。主题学习的目标是准确捕捉提供的主题细节，我们通过文本反转和我们特制的identidad适应器进行精细调整。在运动学习阶段，我们设计了运动适应器，并在给定的视频上进行了微调。将这两个轻量级和高效的适应器结合使得任何主题可以与任何运动自由定制。我们的项目页面是https://dreamvideo-t2v.github.io。Extensive experimental results demonstrate the superior performance of our DreamVideo over the state-of-the-art methods for customized video generation.

Approximate Caching for Efficiently Serving Diffusion Models

paper_url: http://arxiv.org/abs/2312.04429
repo_url: None
paper_authors: Shubham Agarwal, Subrata Mitra, Sarthak Chakraborty, Srikrishna Karanam, Koyel Mukherjee, Shiv Saini
for: 提高 diffusion model 的生产环境效率和可扩展性
methods: 使用 approximate-caching 技术， reuse intermediate noise states，提高图像生成效率和可扩展性
results: 实现 % GPU 计算减少，19.8% 综合延迟减少，19% 成本减少，在两个实际生产工作上Here’s a breakdown of each point:
for: The paper is aimed at improving the efficiency and scalability of text-to-image generation systems using diffusion models.
methods: The authors propose a technique called approximate-caching, which reuses intermediate noise states created during prior image generations for similar prompts to reduce the number of iterative denoising steps.
results: The proposed approach achieves an average of % GPU compute savings, 19.8% end-to-end latency reduction, and 19% dollar savings on two real production workloads, compared to a baseline system.

Abstract
Text-to-image generation using diffusion models has seen explosive popularity owing to their ability in producing high quality images adhering to text prompts. However, production-grade diffusion model serving is a resource intensive task that not only require high-end GPUs which are expensive but also incurs considerable latency. In this paper, we introduce a technique called approximate-caching that can reduce such iterative denoising steps for an image generation based on a prompt by reusing intermediate noise states created during a prior image generation for similar prompts. Based on this idea, we present an end to end text-to-image system, Nirvana, that uses the approximate-caching with a novel cache management-policy Least Computationally Beneficial and Frequently Used (LCBFU) to provide % GPU compute savings, 19.8% end-to-end latency reduction and 19% dollar savings, on average, on two real production workloads. We further present an extensive characterization of real production text-to-image prompts from the perspective of caching, popularity and reuse of intermediate states in a large production environment.

摘要
We have developed an end-to-end text-to-image system, Nirvana, that incorporates approximate-caching with a novel cache management policy called Least Computationally Beneficial and Frequently Used (LCBFU). This approach achieves an average of 30% GPU compute savings, 19.8% end-to-end latency reduction, and 19% dollar savings on two real production workloads.In addition, we have conducted an extensive characterization of real production text-to-image prompts from the perspective of caching, popularity, and reuse of intermediate states in a large production environment. Our results show that the proposed technique can significantly reduce the computational resources and latency required for text-to-image generation, making it more practical for real-world applications.

Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views

paper_url: http://arxiv.org/abs/2312.04424
repo_url: None
paper_authors: Yabo Chen, Jiemin Fang, Yuyang Huang, Taoran Yi, Xiaopeng Zhang, Lingxi Xie, Xinggang Wang, Wenrui Dai, Hongkai Xiong, Qi Tian
for: 提出了一种用于生成多视图3D图像的方法，以解决单视图图像生成多视图3D图像的挑战。
methods: 使用了 Zero-1-to-3 方法，通过将单视图图像和摄像头pose作为条件信息，生成目标视图图像。
results: 提出了一种名为 Cascade-Zero123 的树建 Framework，可以在不同视图之间建立更高度的视觉一致性，并且在许多复杂和挑战的场景下显示出优于 Zero-1-to-3 的表现。

Abstract
Synthesizing multi-view 3D from one single image is a significant and challenging task. For this goal, Zero-1-to-3 methods aim to extend a 2D latent diffusion model to the 3D scope. These approaches generate the target-view image with a single-view source image and the camera pose as condition information. However, the one-to-one manner adopted in Zero-1-to-3 incurs challenges for building geometric and visual consistency across views, especially for complex objects. We propose a cascade generation framework constructed with two Zero-1-to-3 models, named Cascade-Zero123, to tackle this issue, which progressively extracts 3D information from the source image. Specifically, a self-prompting mechanism is designed to generate several nearby views at first. These views are then fed into the second-stage model along with the source image as generation conditions. With self-prompted multiple views as the supplementary information, our Cascade-Zero123 generates more highly consistent novel-view images than Zero-1-to-3. The promotion is significant for various complex and challenging scenes, involving insects, humans, transparent objects, and stacked multiple objects etc. The project page is at https://cascadezero123.github.io/.

摘要
Synthesizing multi-view 3D from one single image is a significant and challenging task. For this goal, Zero-1-to-3 methods aim to extend a 2D latent diffusion model to the 3D scope. These approaches generate the target-view image with a single-view source image and the camera pose as condition information. However, the one-to-one manner adopted in Zero-1-to-3 incurs challenges for building geometric and visual consistency across views, especially for complex objects. We propose a cascade generation framework constructed with two Zero-1-to-3 models, named Cascade-Zero123, to tackle this issue, which progressively extracts 3D information from the source image. Specifically, a self-prompting mechanism is designed to generate several nearby views at first. These views are then fed into the second-stage model along with the source image as generation conditions. With self-prompted multiple views as the supplementary information, our Cascade-Zero123 generates more highly consistent novel-view images than Zero-1-to-3. The promotion is significant for various complex and challenging scenes, involving insects, humans, transparent objects, and stacked multiple objects etc. The project page is at .

Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models

paper_url: http://arxiv.org/abs/2312.04410
repo_url: https://github.com/shi-labs/smooth-diffusion
paper_authors: Jiayi Guo, Xingqian Xu, Yifan Pu, Zanlin Ni, Chaofei Wang, Manushree Vasu, Shiji Song, Gao Huang, Humphrey Shi
for: 这 paper 的目的是提出一种新的类型的扩散模型，即 Smooth Diffusion，可以同时具有高性能和平滑的特性。
methods: 该 paper 使用 Step-wise Variation Regularization 和 interpolation standard deviation (ISTD) metric 来保证扩散模型的特征空间平滑。
results: 对比其他扩散模型，Smooth Diffusion 在文本到图像（T2I）生成、图像 interpolating、图像反转和图像编辑等下游任务中表现出色，并且可以减少 visual fluctuations。

Abstract
Recently, diffusion models have made remarkable progress in text-to-image (T2I) generation, synthesizing images with high fidelity and diverse contents. Despite this advancement, latent space smoothness within diffusion models remains largely unexplored. Smooth latent spaces ensure that a perturbation on an input latent corresponds to a steady change in the output image. This property proves beneficial in downstream tasks, including image interpolation, inversion, and editing. In this work, we expose the non-smoothness of diffusion latent spaces by observing noticeable visual fluctuations resulting from minor latent variations. To tackle this issue, we propose Smooth Diffusion, a new category of diffusion models that can be simultaneously high-performing and smooth. Specifically, we introduce Step-wise Variation Regularization to enforce the proportion between the variations of an arbitrary input latent and that of the output image is a constant at any diffusion training step. In addition, we devise an interpolation standard deviation (ISTD) metric to effectively assess the latent space smoothness of a diffusion model. Extensive quantitative and qualitative experiments demonstrate that Smooth Diffusion stands out as a more desirable solution not only in T2I generation but also across various downstream tasks. Smooth Diffusion is implemented as a plug-and-play Smooth-LoRA to work with various community models. Code is available at https://github.com/SHI-Labs/Smooth-Diffusion.

摘要
最近，扩散模型在文本到图像（T2I）生成中已经做出了显著进步，生成图像的准确性和多样性都有所提高。然而，扩散模型中的积分空间内的平滑性仍未得到了足够的探究。平滑的积分空间确保输入积分变动对输出图像的变化是稳定的，这种属性对下游任务，如图像 interpolate、反转和编辑等具有优点。在这种工作中，我们发现了扩散积分空间的非平滑性，通过观察输入积分变动对输出图像的视觉变化，我们发现这种变化很不稳定。为了解决这个问题，我们提出了新的扩散模型，即平滑扩散（Smooth Diffusion）。我们在这个模型中引入步骤变量规则，使得输入积分变动和输出图像变动之间的比例保持不变。此外，我们还设计了一个 interpolate standard deviation（ISTD）度量，用于评估扩散模型的积分空间平滑性。我们的实验表明，平滑扩散模型在T2I生成和多种下游任务中表现出色，并且可以与社区模型结合使用。代码可以在 GitHub 上找到：https://github.com/SHI-Labs/Smooth-Diffusion。

OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization

paper_url: http://arxiv.org/abs/2312.04403
repo_url: None
paper_authors: Dongchen Han, Xiaojun Jia, Yang Bai, Jindong Gu, Yang Liu, Xiaochun Cao
for: 本研究旨在探讨vision-language预训模型对多 modal adversarial examples (AE) 的抵御性。
methods: 我们提出了一种基于优化运输理论的 adversarial 攻击方法，称为 OT-Attack。该方法通过对图像和文本集的特征进行优化运输来确定最佳的映射，以生成高转移性 adversarial examples。
results: 我们在不同的网络架构和数据集上进行了广泛的实验，发现 OT-Attack 可以在图像-文本匹配任务中提高 adversarial transferability，并且比现有的状态对方法更高效。

Abstract
Vision-language pre-training (VLP) models demonstrate impressive abilities in processing both images and text. However, they are vulnerable to multi-modal adversarial examples (AEs). Investigating the generation of high-transferability adversarial examples is crucial for uncovering VLP models' vulnerabilities in practical scenarios. Recent works have indicated that leveraging data augmentation and image-text modal interactions can enhance the transferability of adversarial examples for VLP models significantly. However, they do not consider the optimal alignment problem between dataaugmented image-text pairs. This oversight leads to adversarial examples that are overly tailored to the source model, thus limiting improvements in transferability. In our research, we first explore the interplay between image sets produced through data augmentation and their corresponding text sets. We find that augmented image samples can align optimally with certain texts while exhibiting less relevance to others. Motivated by this, we propose an Optimal Transport-based Adversarial Attack, dubbed OT-Attack. The proposed method formulates the features of image and text sets as two distinct distributions and employs optimal transport theory to determine the most efficient mapping between them. This optimal mapping informs our generation of adversarial examples to effectively counteract the overfitting issues. Extensive experiments across various network architectures and datasets in image-text matching tasks reveal that our OT-Attack outperforms existing state-of-the-art methods in terms of adversarial transferability.

摘要
vision-language pre-training（VLP）模型显示出处理图像和文本的印象性能力。然而，它们受到多 modal adversarial examples（AE）的攻击。研究生成高传播性 adversarial examples的生成是关键的，以探索 VLP 模型在实际场景中的漏洞。current works 表明，通过数据增强和图像文本模式交互，可以提高 VLP 模型中 adversarial examples的传播性。然而，它们没有考虑数据增强图像文本对的最佳对齐问题。这个缺点导致 adversarial examples 过于适应源模型，因此限制了改进的传播性。在我们的研究中，我们首先探索了通过数据增强生成的图像集和其对应的文本集之间的交互。我们发现，通过数据增强生成的图像样本可以与某些文本集进行最佳对齐，而与其他文本集之间的对齐更为低。这一发现使我们提出了一种基于 Optimal Transport 的 adversarial attack，称为 OT-Attack。我们的方法将图像和文本集的特征视为两个不同的分布，并使用 Optimal Transport 理论来确定这两个分布之间的最有效的映射。这种最有效的映射为我们生成 adversarial examples，以有效地对抗过拟合问题。我们在不同的网络架构和数据集上进行了广泛的实验，发现 OT-Attack 在图像文本匹配任务中超过了现有状态的方法，在 adversarial transferability 方面表现出色。

PhysHOI: Physics-Based Imitation of Dynamic Human-Object Interaction

paper_url: http://arxiv.org/abs/2312.04393
repo_url: https://github.com/wyhuai/PhysHOI
paper_authors: Yinhuai Wang, Jing Lin, Ailing Zeng, Zhengyi Luo, Jian Zhang, Lei Zhang
for: 本研究旨在帮助人工智能系统学习人类与物体之间的互动，以提高未来的智能动画和 робо学系统的能力。
methods: 本研究使用物理学基的HOI模拟方法，无需特定任务奖励。研究者引入了Contact Graph来显式地表示人体和物体之间的接触关系，并设计了Contact Graph奖励，以便准确地模拟HOI。
results: 研究者通过使用PhysHOI模型，成功地模拟了多种HOI任务，包括全身抓取和篮球技巧。此外，研究者还介绍了 BallPlay 数据集，该数据集包含了八种全身篮球技巧。

Abstract
Humans interact with objects all the time. Enabling a humanoid to learn human-object interaction (HOI) is a key step for future smart animation and intelligent robotics systems. However, recent progress in physics-based HOI requires carefully designed task-specific rewards, making the system unscalable and labor-intensive. This work focuses on dynamic HOI imitation: teaching humanoid dynamic interaction skills through imitating kinematic HOI demonstrations. It is quite challenging because of the complexity of the interaction between body parts and objects and the lack of dynamic HOI data. To handle the above issues, we present PhysHOI, the first physics-based whole-body HOI imitation approach without task-specific reward designs. Except for the kinematic HOI representations of humans and objects, we introduce the contact graph to model the contact relations between body parts and objects explicitly. A contact graph reward is also designed, which proved to be critical for precise HOI imitation. Based on the key designs, PhysHOI can imitate diverse HOI tasks simply yet effectively without prior knowledge. To make up for the lack of dynamic HOI scenarios in this area, we introduce the BallPlay dataset that contains eight whole-body basketball skills. We validate PhysHOI on diverse HOI tasks, including whole-body grasping and basketball skills.

摘要
人类与物体之间的互动是一天的常见情况。允许人类型机器人学习人机互动（HOI）是未来智能动画和智能机器人系统的关键一步。然而，最近的物理基于HOI进步需要仔细设计任务特定的奖励，这使得系统不可扩展和劳动密集。本工作关注于动态HOI伪装：通过imitating动机HOI示例教育人类型机器人动态交互技能。这是因为交互体系和物体之间的互动复杂性和lack of dynamic HOI数据而具有挑战性。为解决以上问题，我们提出PhysHOI，首个基于物理的整体HOI伪装方法，不需要任务特定奖励设计。 except for the kinematic HOI representations of humans and objects, we introduce the contact graph to model the contact relations between body parts and objects explicitly. A contact graph reward is also designed, which proved to be critical for precise HOI imitation. Based on the key designs, PhysHOI can imitate diverse HOI tasks simply yet effectively without prior knowledge. To make up for the lack of dynamic HOI scenarios in this area, we introduce the BallPlay dataset that contains eight whole-body basketball skills. We validate PhysHOI on diverse HOI tasks, including whole-body grasping and basketball skills.

AniRes2D: Anisotropic Residual-enhanced Diffusion for 2D MR Super-Resolution

paper_url: http://arxiv.org/abs/2312.04385
repo_url: None
paper_authors: Zejun Wu, Samuel W. Remedios, Blake E. Dewey, Aaron Carass, Jerry L. Prince
for: This paper aims to super-resolve anisotropic low-resolution (LR) magnetic resonance (MR) images using denoising diffusion probabilistic models (DDPMs).methods: The proposed approach, called AniRes2D, combines DDPM with a residual prediction for 2D super-resolution (SR).results: AniRes2D outperforms several other DDPM-based models in quantitative metrics, visual quality, and out-of-domain evaluation. Additionally, the use of noise conditioning augmentation (NCA) as an alternative augmentation technique for DDPM-based SR models was found to reduce performance.

Abstract
Anisotropic low-resolution (LR) magnetic resonance (MR) images are fast to obtain but hinder automated processing. We propose to use denoising diffusion probabilistic models (DDPMs) to super-resolve these 2D-acquired LR MR slices. This paper introduces AniRes2D, a novel approach combining DDPM with a residual prediction for 2D super-resolution (SR). Results demonstrate that AniRes2D outperforms several other DDPM-based models in quantitative metrics, visual quality, and out-of-domain evaluation. We use a trained AniRes2D to super-resolve 3D volumes slice by slice, where comparative quantitative results and reduced skull aliasing are achieved compared to a recent state-of-the-art self-supervised 3D super-resolution method. Furthermore, we explored the use of noise conditioning augmentation (NCA) as an alternative augmentation technique for DDPM-based SR models, but it was found to reduce performance. Our findings contribute valuable insights to the application of DDPMs for SR of anisotropic MR images.

摘要
低分辨率磁共振成像（MR）图像具有快速获取的优点，但是它们降低了自动处理的可能性。我们提议使用杂相噪声泛化模型（DDPM）来超解这些2D获取的低分辨率MR剖面。本文介绍了AniRes2D，一种新的方法，它结合了DDPM和差异预测来实现2D超解像（SR）。结果表明，AniRes2D在量化指标、视觉质量和 OUT-OF-DOMAIN 评价中都超过了其他多个DDPM基于模型。此外，我们使用训练过的AniRes2D来超解3D卷积slice by slice，并实现了相对于最新的自动学习3D超解像方法的比较量化结果和减少骨架噪声。此外，我们还探讨了噪声条件增强（NCA）作为DDPM基于SR模型的增强技术，但发现它会降低性能。我们的发现对DDPM在SR方面的应用提供了有价值的信息。

SingingHead: A Large-scale 4D Dataset for Singing Head Animation

paper_url: http://arxiv.org/abs/2312.04369
repo_url: https://github.com/wsj-sjtu/SingingHead
paper_authors: Sijing Wu, Yunhao Li, Weitian Zhang, Jun Jia, Yucheng Zhu, Yichao Yan, Guangtao Zhai
for: 这篇论文旨在提出一个高质量的大规模歌唱头数据集（SingingHead），以及一个统一的歌唱面部动画框架（UniSinger），以解决歌唱音频驱动的3D头部动画和2D肖像视频合成问题。
methods: 本文使用了高质量的歌唱头数据集，并提出了一种统一的歌唱面部动画框架，该框架可以同时解决3D头部动画和2D肖像视频合成问题。
results: 对比与当前最佳3D面部动画和2D肖像视频合成方法，实验结果表明，歌唱音频驱动的3D头部动画和2D肖像视频合成问题可以通过使用高质量的歌唱头数据集和统一的面部动画框架得到有效解决。

Abstract
Singing, as a common facial movement second only to talking, can be regarded as a universal language across ethnicities and cultures, plays an important role in emotional communication, art, and entertainment. However, it is often overlooked in the field of audio-driven facial animation due to the lack of singing head datasets and the domain gap between singing and talking in rhythm and amplitude. To this end, we collect a high-quality large-scale singing head dataset, SingingHead, which consists of more than 27 hours of synchronized singing video, 3D facial motion, singing audio, and background music from 76 individuals and 8 types of music. Along with the SingingHead dataset, we argue that 3D and 2D facial animation tasks can be solved together, and propose a unified singing facial animation framework named UniSinger to achieve both singing audio-driven 3D singing head animation and 2D singing portrait video synthesis. Extensive comparative experiments with both SOTA 3D facial animation and 2D portrait animation methods demonstrate the necessity of singing-specific datasets in singing head animation tasks and the promising performance of our unified facial animation framework.

摘要
唱歌，作为人类面部运动的常见第二种（只次于说话），在不同的民族和文化中具有通用语言的作用，对情感通信、艺术和娱乐具有重要作用。然而，在受音频驱动的面部动画领域中，唱歌却经常被忽略，这主要归结于唱歌数据集的缺乏和说话和唱歌的频谱和幅度域的域外差。为此，我们收集了高质量大规模的唱歌头部数据集，即SingingHead，该数据集包括76名个体和8种音乐的27小时以上同步唱歌视频、3D面部动作、唱歌音频和背景音乐。此外，我们认为3D和2D面部动画任务可以并行解决，并提出了一个统一的唱歌面部动画框架名为UniSinger，以实现受音频驱动的3D唱歌头部动画和2D唱歌肖像视频生成。我们对两者进行了广泛的比较实验，结果表明唱歌特定数据集的重要性和我们的统一动画框架的承诺性。

DemoCaricature: Democratising Caricature Generation with a Rough Sketch

paper_url: http://arxiv.org/abs/2312.04364
repo_url: None
paper_authors: Dar-Yen Chen, Subhadeep Koley, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Ayan Kumar Bhunia, Yi-Zhe Song
for: 这篇论文是为了推动个人化卡通生成的民主化，让人们只需提供一张照片和一个概念素描就能轻松地创作个性化卡通。
methods: 该论文提出了Explicit Rank-1 Model Editing和单图个性化，通过 selectively 应用细微修改来融合个体和风格的特征。
results: 该论文的目标不是取代艺术家，而是减少访问难度，让爱好者参与到艺术创作中来。

Abstract
In this paper, we democratise caricature generation, empowering individuals to effortlessly craft personalised caricatures with just a photo and a conceptual sketch. Our objective is to strike a delicate balance between abstraction and identity, while preserving the creativity and subjectivity inherent in a sketch. To achieve this, we present Explicit Rank-1 Model Editing alongside single-image personalisation, selectively applying nuanced edits to cross-attention layers for a seamless merge of identity and style. Additionally, we propose Random Mask Reconstruction to enhance robustness, directing the model to focus on distinctive identity and style features. Crucially, our aim is not to replace artists but to eliminate accessibility barriers, allowing enthusiasts to engage in the artistry.

摘要
在这篇论文中，我们减少了卡通生成的门槛，让个人轻松地创建个性化卡通，只需一张照片和一个概念图片。我们的目标是在把抽象和个体之间做出 equilibrio，保留素描和创作的主观性。为了实现这一点，我们提出了显式 Rank-1 模型编辑，并同时实现单图个性化， selectively 应用细腻的修改来融合个体和风格。此外，我们还提出了随机面罩重建，以增强 robustness，使模型更加注重特征和风格。我们的目标不是取代艺术家，而是消除访问障碍，让爱好者参与到艺术创作中。

Multi-View Unsupervised Image Generation with Cross Attention Guidance

paper_url: http://arxiv.org/abs/2312.04337
repo_url: None
paper_authors: Llukman Cerkezi, Aram Davtyan, Sepehr Sameni, Paolo Favaro
for: 本研究旨在提高novel view synthesis的可扩展性和鲁棒性，并解决由NeRF模型引起的数据精度问题。
methods: 本文提出了一种基于单类数据集的无监督训练方法，使用预训练的自注意力 transformer（DINOv2）来识别对象 pose，并使用扩展的pose-conditioned diffusion模型来生成 novel view。
results: 比较于先前的方法，MIRAGE模型在真实图像上的novel view synthesis task中表现出色，并且具有较高的鲁棒性和可扩展性，可以在多种Texture和几何形态下进行 Synthetic image生成。

Abstract
The growing interest in novel view synthesis, driven by Neural Radiance Field (NeRF) models, is hindered by scalability issues due to their reliance on precisely annotated multi-view images. Recent models address this by fine-tuning large text2image diffusion models on synthetic multi-view data. Despite robust zero-shot generalization, they may need post-processing and can face quality issues due to the synthetic-real domain gap. This paper introduces a novel pipeline for unsupervised training of a pose-conditioned diffusion model on single-category datasets. With the help of pretrained self-supervised Vision Transformers (DINOv2), we identify object poses by clustering the dataset through comparing visibility and locations of specific object parts. The pose-conditioned diffusion model, trained on pose labels, and equipped with cross-frame attention at inference time ensures cross-view consistency, that is further aided by our novel hard-attention guidance. Our model, MIRAGE, surpasses prior work in novel view synthesis on real images. Furthermore, MIRAGE is robust to diverse textures and geometries, as demonstrated with our experiments on synthetic images generated with pretrained Stable Diffusion.

摘要
“ novel view 合成的兴趣增长，受到神经辐射场（NeRF）模型的驱动，受到多视图图像的精度标注的限制。 recent models 通过 fine-tuning large text2image 扩散模型在 sintetic multi-view 数据上进行了调整。 despite robust zero-shot 通用化，它们可能需要后处理和面临quality issues due to the synthetic-real domain gap。 this paper introduces a novel pipeline for unsupervised training of a pose-conditioned diffusion model on single-category datasets. with the help of pretrained self-supervised Vision Transformers (DINOv2), we identify object poses by clustering the dataset through comparing visibility and locations of specific object parts. the pose-conditioned diffusion model, trained on pose labels, and equipped with cross-frame attention at inference time ensures cross-view consistency, that is further aided by our novel hard-attention guidance. our model, MIRAGE, surpasses prior work in novel view synthesis on real images. Furthermore, MIRAGE is robust to diverse textures and geometries, as demonstrated with our experiments on synthetic images generated with pretrained Stable Diffusion.”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

Towards a Perceptual Evaluation Framework for Lighting Estimation

paper_url: http://arxiv.org/abs/2312.04334
repo_url: None
paper_authors: Justine Giroux, Mohammad Reza Karimi Dastjerdi, Yannick Hold-Geoffroy, Javier Vazquez-Corral, Jean-François Lalonde
for: 研究人员想要了解人类对光照估算结果的偏好，以便评估未来的光照估算算法。
methods: 研究人员采用了一种控制的心理physical实验，让人类观察者选择自己喜欢的rendered场景，并用这些场景来分析现有的光照估算算法如何符合人类感觉。
results: 研究发现，现有的IQA指标从 literature中的大多数都不能正确地表达人类的偏好，但是通过学习现有的IQA指标的组合，可以更加准确地表达人类的偏好。这提供了一个新的感知框架，可以帮助评估未来的光照估算算法。

Abstract
Progress in lighting estimation is tracked by computing existing image quality assessment (IQA) metrics on images from standard datasets. While this may appear to be a reasonable approach, we demonstrate that doing so does not correlate to human preference when the estimated lighting is used to relight a virtual scene into a real photograph. To study this, we design a controlled psychophysical experiment where human observers must choose their preference amongst rendered scenes lit using a set of lighting estimation algorithms selected from the recent literature, and use it to analyse how these algorithms perform according to human perception. Then, we demonstrate that none of the most popular IQA metrics from the literature, taken individually, correctly represent human perception. Finally, we show that by learning a combination of existing IQA metrics, we can more accurately represent human preference. This provides a new perceptual framework to help evaluate future lighting estimation algorithms.

摘要
进步在照明估算方面是通过计算现有的图像质量评价（IQA）指标来跟踪的。这可能看起来很合理，但我们示示了在使用估算出来的照明来重新照明虚拟场景时，这种方法并不与人类的偏好相匹配。为了研究这一点，我们设计了一个控制的心理physical实验，让人类观察者选择他们的偏好 amongst 由文献中选择的照明估算算法rendered scene，并用其来分析这些算法如何根据人类感知表现。然后，我们表明了现有IQA指标中的任何一个都不能准确地表达人类感知。最后，我们示示了通过学习现有IQA指标的组合，可以更准确地表达人类偏好。这提供了一个新的感知框架，可以帮助评估未来的照明估算算法。

A Multi-scale Information Integration Framework for Infrared and Visible Image Fusion

paper_url: http://arxiv.org/abs/2312.04328
repo_url: https://github.com/ssyangguang/mda
paper_authors: Guang Yang, Jie Li, Hanxiao Lei, Xinbo Gao
for: 这个研究的目的是提出一种基于多尺度双注意力（MDA）框架的红外和可见图像融合方法，以实现基于同一场景的多模态图像之间的信息融合。
methods: 该方法首先使用差分下采样块将源图像分解成三个级别，然后使用双注意力融合块将不同级别的特征进行融合，并生成空间和通道注意力地图。最后，使用差分重建块重建输出图像。损失函数包括图像级、特征级和小区级三部分，其中图像级和小区级两部分的计算基于特征融合后生成的优先级权重。此外，为了限制输出图像和红外图像像素的 INTENSITY 分布，还添加了一个风格损失。
results: 对于两个 dataset 的融合结果，研究发现我们的方法可以保持红外射频信息和可见信息的同时保持高质量，并与其他当前领域的状态对照方法相当。我们还进行了精度和量化的实验，以及缺省实验，以证明我们的信息整合建筑和损失函数中的优先级权重适应性测量的效果。

Abstract
Infrared and visible image fusion aims at generating a fused image containing the intensity and detail information of source images, and the key issue is effectively measuring and integrating the complementary information of multi-modality images from the same scene. Existing methods mostly adopt a simple weight in the loss function to decide the information retention of each modality rather than adaptively measuring complementary information for different image pairs. In this study, we propose a multi-scale dual attention (MDA) framework for infrared and visible image fusion, which is designed to measure and integrate complementary information in both structure and loss function at the image and patch level. In our method, the residual downsample block decomposes source images into three scales first. Then, dual attention fusion block integrates complementary information and generates a spatial and channel attention map at each scale for feature fusion. Finally, the output image is reconstructed by the residual reconstruction block. Loss function consists of image-level, feature-level and patch-level three parts, of which the calculation of the image-level and patch-level two parts are based on the weights generated by the complementary information measurement. Indeed, to constrain the pixel intensity distribution between the output and infrared image, a style loss is added. Our fusion results perform robust and informative across different scenarios. Qualitative and quantitative results on two datasets illustrate that our method is able to preserve both thermal radiation and detailed information from two modalities and achieve comparable results compared with the other state-of-the-art methods. Ablation experiments show the effectiveness of our information integration architecture and adaptively measure complementary information retention in the loss function.

摘要
infrared和可见图像融合的目标是生成包含源图像的Intensity和细节信息的融合图像，关键问题是有效地度量和集成多模态图像从同一场景中的补偿信息。现有方法大多采用简单的权重在损失函数中决定每个模态的信息保留，而不是动态度量不同图像对的补偿信息。在本研究中，我们提出了多尺度双注意（MDA）框架 для infrared和可见图像融合，该框架是用来度量和集成多模态图像的补偿信息的。我们的方法首先使用residual下采样块将源图像 decomposes into three scales。然后，双注意 fusions 块将补偿信息集成并生成每层的空间和通道注意力图，用于特征融合。最后，输出图像由 residual reconstruction block重建。损失函数包括图像水平、特征水平和 patch 水平三部分，其中图像水平和 patch 水平两部分的计算基于生成的补偿信息量。此外，为了限制输出图像和infrared图像的像素 INTENSITY 分布，我们添加了一个风格损失。我们的融合结果在不同场景中具有良好的Robustness和信息量。Qualitative和量化结果表明我们的方法能够保留两种模态的热辐射和细节信息，并与其他状态对比的方法相当。ablation эксперименты表明我们的信息集成体系和 adaptively 度量补偿信息保留在损失函数中的效果。

iDesigner: A High-Resolution and Complex-Prompt Following Text-to-Image Diffusion Model for Interior Design

paper_url: http://arxiv.org/abs/2312.04326
repo_url: None
paper_authors: Ruyi Gan, Xiaojun Wu, Junyu Lu, Yuanhe Tian, Dixiang Zhang, Ziwei Wu, Renliang Sun, Chang Liu, Jiaxing Zhang, Pingjian Zhang, Yan Song
for: 文章目标是提出一种特化于室内设计的文本至图像模型，以满足设计师的需求。
methods: 该模型使用基于CLIP开源模型的自适应扩展和迭代强化学习来提高提示跟随性能。
results: 实验结果表明，该方法可以准确地生成高质量图像，并超越强基eline。

Abstract
With the open-sourcing of text-to-image models (T2I) such as stable diffusion (SD) and stable diffusion XL (SD-XL), there is an influx of models fine-tuned in specific domains based on the open-source SD model, such as in anime, character portraits, etc. However, there are few specialized models in certain domains, such as interior design, which is attributed to the complex textual descriptions and detailed visual elements inherent in design, alongside the necessity for adaptable resolution. Therefore, text-to-image models for interior design are required to have outstanding prompt-following capabilities, as well as iterative collaboration with design professionals to achieve the desired outcome. In this paper, we collect and optimize text-image data in the design field and continue training in both English and Chinese on the basis of the open-source CLIP model. We also proposed a fine-tuning strategy with curriculum learning and reinforcement learning from CLIP feedback to enhance the prompt-following capabilities of our approach so as to improve the quality of image generation. The experimental results on the collected dataset demonstrate the effectiveness of the proposed approach, which achieves impressive results and outperforms strong baselines.

摘要
开源文本到图像模型（T2I）如稳定扩散（SD）和稳定扩散XL（SD-XL）的开源，导致各个领域的模型进行精细定制，如动漫、人物肖像等。然而，有些领域，如内部设计，尚未有专门的模型，这可能与文本描述的复杂性和设计中的细节视觉元素有关，以及需要可变分辨率。因此，内部设计中的文本到图像模型需要出色的提示遵循能力，以及与设计专业人员之间的迭代协作，以实现愿景。在这篇论文中，我们收集和优化设计领域的文本图像数据，并在英文和中文基础上继续训练开源CLIP模型。我们还提出了一种 fine-tuning 策略，利用课程学习和反馈学习从CLIP反馈来提高我们方法的提示遵循能力，以提高图像生成质量。实验结果表明，我们提posed的方法在收集的数据集上具有惊人的效果，并超越了强基线。

GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives

paper_url: http://arxiv.org/abs/2312.04314
repo_url: https://github.com/gpt4vision/gpt4sgg
paper_authors: Zuyao Chen, Jinlin Wu, Zhen Lei, Zhaoxiang Zhang, Changwen Chen
for: 提高Scene Graph Generation（SGG）的精度和完整性，使用自然语言描述来学习Scene Graph。
methods: 提出了一个简单 yet effective的框架，即GPT4SGG，通过使用全局和区域特定的narative来生成Scene Graph。
results: GPT4SGG可以显著提高基于图像caption数据的SGG模型的性能。

Abstract
Learning scene graphs from natural language descriptions has proven to be a cheap and promising scheme for Scene Graph Generation (SGG). However, such unstructured caption data and its processing are troubling the learning an acurrate and complete scene graph. This dilema can be summarized as three points. First, traditional language parsers often fail to extract meaningful relationship triplets from caption data. Second, grounding unlocalized objects in parsed triplets will meet ambiguity in visual-language alignment. Last, caption data typically are sparse and exhibit bias to partial observations of image content. These three issues make it hard for the model to generate comprehensive and accurate scene graphs. To fill this gap, we propose a simple yet effective framework, GPT4SGG, to synthesize scene graphs from holistic and region-specific narratives. The framework discards traditional language parser, and localize objects before obtaining relationship triplets. To obtain relationship triplets, holistic and dense region-specific narratives are generated from the image. With such textual representation of image data and a task-specific prompt, an LLM, particularly GPT-4, directly synthesizes a scene graph as "pseudo labels". Experimental results showcase GPT4SGG significantly improves the performance of SGG models trained on image-caption data. We believe this pioneering work can motivate further research into mining the visual reasoning capabilities of LLMs.

摘要

Traditional language parsers often fail to extract meaningful relationship triplets from caption data.2. Grounding unlocalized objects in parsed triplets can lead to ambiguity in visual-language alignment.3. Caption data is typically sparse and biased towards partial observations of image content.To address these issues, we propose a simple yet effective framework, GPT4SGG, to synthesize scene graphs from holistic and region-specific narratives. Our framework discards traditional language parsers and localizes objects before obtaining relationship triplets. We generate holistic and dense region-specific narratives from the image to obtain relationship triplets, and use a task-specific prompt and an LLM, particularly GPT-4, to directly synthesize a scene graph as “pseudo labels”.Experimental results show that GPT4SGG significantly improves the performance of SGG models trained on image-caption data. We believe this pioneering work can motivate further research into mining the visual reasoning capabilities of LLMs.

Cross-codex Learning for Reliable Scribe Identification in Medieval Manuscripts

paper_url: http://arxiv.org/abs/2312.04296
repo_url: None
paper_authors: Julius Weißmann, Markus Seidl, Anya Dietrich, Martin Haltrich
for: 这 paper 的目的是提出一种基于 CNN 的文本独立scribe identification方法，以解决不同编码的问题。
methods: 该方法使用了预处理、不同的神经网络架构和拒绝选项来提高分类结果的准确率和稳定性。
results: 研究发现，使用遮盖灰度图像进行预处理可以明显提高分类结果的 F1 分数，而不同的神经网络架构在不同的时间和精度上有所不同，AlexNet 是最佳的权衡点。此外，通过实施拒绝选项，可以进一步提高 CNN 输出的稳定性。研究使用了大规模的开源数据集 —— Codex Claustroneoburgensis 数据库（CCl-DB）——并在不同的编码中进行了详细的分析和比较。

Abstract
Historic scribe identification is a substantial task for obtaining information about the past. Uniform script styles, such as the Carolingian minuscule, make it a difficult task for classification to focus on meaningful features. Therefore, we demonstrate in this paper the importance of cross-codex training data for CNN based text-independent off-line scribe identification, to overcome codex dependent overfitting. We report three main findings: First, we found that preprocessing with masked grayscale images instead of RGB images clearly increased the F1-score of the classification results. Second, we trained different neural networks on our complex data, validating time and accuracy differences in order to define the most reliable network architecture. With AlexNet, the network with the best trade-off between F1-score and time, we achieved for individual classes F1-scores of up to 0,96 on line level and up to 1.0 on page level in classification. Third, we could replicate the finding that the CNN output can be further improved by implementing a reject option, giving more stable results. We present the results on our large scale open source dataset -- the Codex Claustroneoburgensis database (CCl-DB) -- containing a significant number of writings from different scribes in several codices. We demonstrate for the first time on a dataset with such a variety of codices that paleographic decisions can be reproduced automatically and precisely with CNNs. This gives manifold new and fast possibilities for paleographers to gain insights into unlabeled material, but also to develop further hypotheses.

摘要
历史scribe标识是一项重要的任务，以获取过去的信息。使用统一的字体风格，如卡洛林小写，可以使类型化学习受到限制。因此，在这篇论文中，我们展示了跨代码кс训练数据的重要性，以超越代码кс依赖的预测。我们报告了三个主要发现：首先，我们发现使用遮盖 grayscale 图像进行预处理，而不是 RGB 图像，明显提高了分类结果的 F1 分数。第二，我们在我们的复杂数据上训练了不同的神经网络，以确定最可靠的网络架构。使用 AlexNet 网络，我们在个体类别上达到了最高的 F1 分数，达到 0.96 的线程级别和 1.0 的页级别。第三，我们发现可以通过实施拒绝选项，提高 CNN 输出的稳定性。我们在 Codex Claustroneoburgensis 数据库（CCl-DB）中进行了大规模的开源数据集测试，并证明了在多种代码кс中，使用 CNN 可以自动和精度地复制 paleographic 决策。这为 paleographers 提供了新的快速可能性，以获取无标签材料中的信息，并发展更多的假设。

GPT-4V with Emotion: A Zero-shot Benchmark for Multimodal Emotion Understanding

paper_url: http://arxiv.org/abs/2312.04293
repo_url: https://github.com/zeroqiaoba/gpt4v-emotion
paper_authors: Zheng Lian, Licai Sun, Haiyang Sun, Kang Chen, Zhuofan Wen, Hao Gu, Shun Chen, Bin Liu, Jianhua Tao
for: 这 paper 的主要目的是评估 GPT-4V 在多模态情感理解方面的能力，包括面部情绪识别、视觉情感分析、微表情识别、动态面部情绪识别和多模态情感识别等任务。
methods: 这 paper 使用 GPT-4V 进行多模态情感理解任务的评估，包括使用预训练模型和零例学习策略等方法。
results: 实验结果表明 GPT-4V 在多模态情感理解任务中表现出色，甚至超过了一些指导学习系统。然而，GPT-4V 在微表情识别任务中表现较差，这可能是因为这个任务需要专门的专业知识。

Abstract
Recently, GPT-4 with Vision (GPT-4V) has shown remarkable performance across various multimodal tasks. However, its efficacy in emotion recognition remains a question. This paper quantitatively evaluates GPT-4V's capabilities in multimodal emotion understanding, encompassing tasks such as facial emotion recognition, visual sentiment analysis, micro-expression recognition, dynamic facial emotion recognition, and multimodal emotion recognition. Our experiments show that GPT-4V exhibits impressive multimodal and temporal understanding capabilities, even surpassing supervised systems in some tasks. Despite these achievements, GPT-4V is currently tailored for general domains. It performs poorly in micro-expression recognition that requires specialized expertise. The main purpose of this paper is to present quantitative results of GPT-4V on emotion understanding and establish a zero-shot benchmark for future research. Code and evaluation results are available at: https://github.com/zeroQiaoba/gpt4v-emotion.

摘要
最近，GPT-4 with Vision（GPT-4V）在多modal任务上表现出色，但其情感认知能力仍然存在问题。这篇论文量化评估GPT-4V在多modal情感理解方面的能力，包括facial emotion recognition、视觉情感分析、微表情识别、动态facial emotion recognition以及多modal情感理解。我们的实验表明GPT-4V在多modal和时间上具有出色的理解能力，甚至超过了指导系统在某些任务中的表现。虽然GPT-4V在某些任务中表现出色，但目前它仍然需要特殊知识来完成微表情识别任务。本文的主要目的是公布GPT-4V在情感理解方面的量化结果，并建立未来研究的零开发基准。代码和评估结果可以在：https://github.com/zeroQiaoba/gpt4v-emotion上获取。

Activity Grammars for Temporal Action Segmentation

paper_url: http://arxiv.org/abs/2312.04266
repo_url: https://github.com/gongda0e/kari
paper_authors: Dayoung Gong, Joonseok Lee, Deunsol Jung, Suha Kwak, Minsu Cho
for: 这篇论文的目的是提出一种有效的动作语法来引导神经网络预测时间动作分割。
methods: 这篇论文提出了一种新的语法抽象算法，可以从动作序列数据中提取出强大的上下文自由语法。同时，我们还开发了一种高效的普通分析器，可以将帧级概率分布转化为可靠的动作序列。
results: 实验结果表明，我们的方法可以显著改善时间动作分割的性能和可读性，在两个标准的评测 dataset（Breakfast和50 Salads）上。

Abstract
Sequence prediction on temporal data requires the ability to understand compositional structures of multi-level semantics beyond individual and contextual properties. The task of temporal action segmentation, which aims at translating an untrimmed activity video into a sequence of action segments, remains challenging for this reason. This paper addresses the problem by introducing an effective activity grammar to guide neural predictions for temporal action segmentation. We propose a novel grammar induction algorithm that extracts a powerful context-free grammar from action sequence data. We also develop an efficient generalized parser that transforms frame-level probability distributions into a reliable sequence of actions according to the induced grammar with recursive rules. Our approach can be combined with any neural network for temporal action segmentation to enhance the sequence prediction and discover its compositional structure. Experimental results demonstrate that our method significantly improves temporal action segmentation in terms of both performance and interpretability on two standard benchmarks, Breakfast and 50 Salads.

摘要
Sequence prediction on temporal data需要理解多层semantics的compositional结构 beyond individual和contextual properties。Temporal action segmentation Task，which aims to translate an untrimmed activity video into a sequence of action segments, remains challenging for this reason. This paper addresses the problem by introducing an effective activity grammar to guide neural predictions for temporal action segmentation. We propose a novel grammar induction algorithm that extracts a powerful context-free grammar from action sequence data. We also develop an efficient generalized parser that transforms frame-level probability distributions into a reliable sequence of actions according to the induced grammar with recursive rules. Our approach can be combined with any neural network for temporal action segmentation to enhance the sequence prediction and discover its compositional structure. Experimental results demonstrate that our method significantly improves temporal action segmentation in terms of both performance and interpretability on two standard benchmarks, Breakfast and 50 Salads.

Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

paper_url: http://arxiv.org/abs/2312.04265
repo_url: None
paper_authors: Zhixiang Wei, Lin Chen, Yi Jin, Xiaoxiao Ma, Tianle Liu, Pengyang Lin, Ben Wang, Huaian Chen, Jinjin Zheng
for: 这篇论文主要用于域通用语义分割（DGSS）领域，利用不同的视觉基础模型（VFM），以提高分割器的总体化能力。
methods: 该论文提出了一种robust fine-tuning方法，即Rein，用于快速和效率地利用VFM进行DGSS任务。Rein使用一组可训练的token，每个token与不同的实例相关联，并在每层的后向传播中精度地调整和传递特征图。这种方法能够减少trainable参数的数量，并且能够有效地调整VFM进行DGSS任务。
results: 对于多种设置，Rein显示出了与状态艺术方法相当或更高的性能。具体来说，在Cityscapes dataset上，只有额外增加1%的trainable参数，Rein就可以达到68.1%的mIoU水平，不需要访问任何真正的城市景象数据集。

Abstract
In this paper, we first assess and harness various Vision Foundation Models (VFMs) in the context of Domain Generalized Semantic Segmentation (DGSS). Driven by the motivation that Leveraging Stronger pre-trained models and Fewer trainable parameters for Superior generalizability, we introduce a robust fine-tuning approach, namely Rein, to parameter-efficiently harness VFMs for DGSS. Built upon a set of trainable tokens, each linked to distinct instances, Rein precisely refines and forwards the feature maps from each layer to the next layer within the backbone. This process produces diverse refinements for different categories within a single image. With fewer trainable parameters, Rein efficiently fine-tunes VFMs for DGSS tasks, surprisingly surpassing full parameter fine-tuning. Extensive experiments across various settings demonstrate that Rein significantly outperforms state-of-the-art methods. Remarkably, with just an extra 1% of trainable parameters within the frozen backbone, Rein achieves a mIoU of 68.1% on the Cityscapes, without accessing any real urban-scene datasets.

摘要
在这篇论文中，我们首先评估和利用不同的视觉基础模型（VFM）在领域泛化semantic segmentation（DGSS）上。驱动于强大的预训练模型和 fewer trainable parameters的目的，我们提出了一种robust fine-tuning方法，即Rein，以parameter-efficiently harness VFMs for DGSS。基于一组可训练的token，每个token都链接到不同的实例，Rein精确地细化和传递各层的特征图在后向网络中。这个过程生成了不同类别内单个图像中的多种细化。与完全 parameter fine-tuning相比，Rein更加parameter-efficiently fine-tunes VFMs for DGSS任务， surprisingly surpassing state-of-the-art methods。广泛的实验表明，Rein在不同设定下显著超越了现有的方法。几乎只有额外1%的trainable parameters在冻结后bone中，Rein在Cityscapes上达到了68.1%的mIoU，不用访问任何真实的都市场景数据。

TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes

paper_url: http://arxiv.org/abs/2312.04248
repo_url: https://github.com/zhangxuying1004/TeMO
paper_authors: Xuying Zhang, Bo-Wen Yin, Yuming Chen, Zheng Lin, Yunheng Li, Qibin Hou, Ming-Ming Cheng
For: The paper is written for the task of text-driven 3D stylization of multi-object scenes.* Methods: The paper proposes a novel framework called TeMO, which uses a Decoupled Graph Attention (DGA) module and a Cross-Grained Contrast (CGC) supervision system to parse multi-object 3D scenes and edit their styles under contrast supervision at multiple levels.* Results: The paper reports that the proposed method can synthesize high-quality stylized content and outperform existing methods over a wide range of multi-object 3D meshes.Here is the information in Simplified Chinese text:
for: 该文章是为多bject 3D 场景的文本驱动 3D 样式设计。
methods: 文章提出了一种名为 TeMO 的新框架，该框架使用 Decoupled Graph Attention (DGA) 模块和 Cross-Grained Contrast (CGC) 监督系统，以多模式精度地分解多bject 3D 场景，并在多个级别进行对比监督。
results: 文章表明，提出的方法可以生成高质量的风格化内容，并在多种多bject 3D 场景中超越现有方法。

Abstract
Recent progress in the text-driven 3D stylization of a single object has been considerably promoted by CLIP-based methods. However, the stylization of multi-object 3D scenes is still impeded in that the image-text pairs used for pre-training CLIP mostly consist of an object. Meanwhile, the local details of multiple objects may be susceptible to omission due to the existing supervision manner primarily relying on coarse-grained contrast of image-text pairs. To overcome these challenges, we present a novel framework, dubbed TeMO, to parse multi-object 3D scenes and edit their styles under the contrast supervision at multiple levels. We first propose a Decoupled Graph Attention (DGA) module to distinguishably reinforce the features of 3D surface points. Particularly, a cross-modal graph is constructed to align the object points accurately and noun phrases decoupled from the 3D mesh and textual description. Then, we develop a Cross-Grained Contrast (CGC) supervision system, where a fine-grained loss between the words in the textual description and the randomly rendered images are constructed to complement the coarse-grained loss. Extensive experiments show that our method can synthesize high-quality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes. Our code and results will be made publicly available

摘要
近期，文本驱动的3D样式化单个物体的进步很大，主要归功于基于CLIP的方法。然而，多个物体3D场景的样式化仍然受阻，因为用于预训练CLIP的图文对的大多数是单个物体。此外，现有的监督方式很可能会因多个物体的地方细节而产生欠拟合。为了解决这些挑战，我们提出了一种新的框架，名为TeMO，用于分析多个物体3D场景并在多级对比拟合下编辑样式。我们首先提出了一种分离Graph Attention（DGA）模块，用于强化3D表面点的特征。具体来说，我们构建了一个交叉模式图，以便准确对齐对象点并分离出3D网格和文本描述。然后，我们开发了一种跨模式对比（CGC）监督系统，其中文本描述中的单词和随机渲染的图像之间的细致对比loss被建立，以补做了粗略对比loss。我们的方法可以生成高质量的样式化内容，并在多种多样的多个物体3D网格上超越现有方法。我们的代码和结果将公开发布。

Fine-tune vision foundation model for crack segmentation in civil infrastructures

paper_url: http://arxiv.org/abs/2312.04233
repo_url: None
paper_authors: Kang Ge, Chen Wang, Yutao Guo
for: 这个研究旨在应用大规模基础模型来进行裂缝标注。
methods: 这个研究使用了两种 Parameter-efficient fine-tuning 方法，namely adapter 和 low-rank adaptation，以进行基础模型 Segment Anything Model (SAM) 的精确化。
results: 比较实验显示，CrackSAM 模型在不同的测试数据集上表现出色，特别是在陌生测试数据集上。这些十字测试结果表明了基础模型的出色零执行能力，并提供了新的思想来发展视觉模型。

Abstract
Large-scale foundation models have become the mainstream method in the field of deep learning, while in civil engineering, the scale of AI models is strictly limited. In this work, vision foundation model is introduced for crack segmentation. Two Parameter-efficient fine-tuning methods, adapter and low-rank adaptation, are adopted to fine-tune the foundation model in the field of semantic segmentation: Segment Anything Model (SAM). The fine-tuned model CrackSAM is much larger than all the existing crack segmentation models, but shows excellent performance. To test the zero-shot performance of the proposed method, two unique datasets related to road and exterior wall cracks are collected, annotated and open-sourced, in total 810 images. Comparative experiments are conducted with twelve mature semantic segmentation models. On datasets with artificial noise and previously unseen datasets, the performance of CrackSAM far exceeds that of all state-of-the-art models. CrackSAM exhibits remarkable superiority, particularly in challenging conditions such as dim lighting, shadows, road markings, construction joints, and other interference factors. Such cross-scenario results demonstrate the outstanding zero-shot capability of foundation models, and provide new ideas for the development of vision models in civil engineering.

摘要
大规模基础模型在深度学习领域已成为主流方法，而在公共工程领域，AI模型的规模受限。在这项工作中，我们介绍了视觉基础模型，用于裂隙分 segmentation。我们采用了两种 Parameter-efficient 精细调整方法，Adapter 和 Low-rank adaptation，来精细调整基础模型，得到了 Segment Anything Model (SAM)。经过精细调整后的模型 CrackSAM 的大小比所有现有的裂隙分 segmentation 模型都要大得多，但它在性能上表现出色。为证明提案方法的零aser性能，我们收集了两个Unique的公路和外墙裂隙数据集，共计810个图像，并对其进行了注释和开源。与12种成熟的semantic segmentation模型进行了比较实验。在人工噪音和未经见过的数据集上，CrackSAM 的表现远胜于所有state-of-the-art模型。CrackSAM 在各种挑战性情况下，如低光照、阴影、路标、建筑 JOINTS 和其他干扰因素下表现出了remarkable 的优势。这些跨场景结果表明基础模型的零aser性能，并提供了新的思路 для视觉模型的发展。

TLCE: Transfer-Learning Based Classifier Ensembles for Few-Shot Class-Incremental Learning

paper_url: http://arxiv.org/abs/2312.04225
repo_url: None
paper_authors: Shuangmei Wang, Yang Cao, Tieru Wu
for: 提高几何学cremental学习的表现，解决新类几何学cremental学习中的卷积神经网络模型忘记旧类或者过拟合新类问题。
methods: 提出TLCE方法，通过多个预训练模型的ensemble来提高新类和旧类之间的分离度，并通过 episodic 训练将旧类图像映射到 quasi-orthogonal прототипы中，以避免旧类和新类之间的干扰。
results: 经验表明，TLCE方法可以在多个数据集上超越当前领域的状态OF-the-art 方法，并在数据不均衡的情况下也能够达到良好的表现。

Abstract
Few-shot class-incremental learning (FSCIL) struggles to incrementally recognize novel classes from few examples without catastrophic forgetting of old classes or overfitting to new classes. We propose TLCE, which ensembles multiple pre-trained models to improve separation of novel and old classes. TLCE minimizes interference between old and new classes by mapping old class images to quasi-orthogonal prototypes using episodic training. It then ensembles diverse pre-trained models to better adapt to novel classes despite data imbalance. Extensive experiments on various datasets demonstrate that our transfer learning ensemble approach outperforms state-of-the-art FSCIL methods.

摘要
几个示例学习（Few-shot class-incremental learning，FSCIL）遇到了新类的扩展认知问题，无法逐步忘记老类或过度适应新类。我们提出了TLCE，它通过多个预训练模型的拟合来提高新类和老类之间的分离。TLCE使用 episodic 训练将老类图像映射到 quasi-orthogonal 的 прототипы，从而减少了老类和新类之间的干扰。然后，它将多个不同预训练模型 ensemble 以更好地适应新类，尽管数据不均衡。广泛的实验表明，我们的传输学ensemble方法在多个数据集上超过了当前FSCIL方法的表现。

Guided Reconstruction with Conditioned Diffusion Models for Unsupervised Anomaly Detection in Brain MRIs

paper_url: http://arxiv.org/abs/2312.04215
repo_url: https://github.com/finnbehrendt/conditioned-diffusion-models-uad
paper_authors: Finn Behrendt, Debayan Bhattacharya, Robin Mieling, Lennart Maack, Julia Krüger, Roland Opfer, Alexander Schlaefer
for: 本研究旨在提高脑MRI中的无监督异常检测性能，通过conditioning diffusion模型的杂散机制来保持健康训练分布下的准确重建。
methods: 本研究使用了diffusion模型，并通过conditioning机制来保持输入图像的准确重建和地方光谱特征。
results: 研究表明，通过conditioning机制可以提高异常检测性能，降低假阳性预测，并在不同MRI获得和模拟对比下显示了适用性。

Abstract
Unsupervised anomaly detection in Brain MRIs aims to identify abnormalities as outliers from a healthy training distribution. Reconstruction-based approaches that use generative models to learn to reconstruct healthy brain anatomy are commonly used for this task. Diffusion models are an emerging class of deep generative models that show great potential regarding reconstruction fidelity. However, they face challenges in preserving intensity characteristics in the reconstructed images, limiting their performance in anomaly detection. To address this challenge, we propose to condition the denoising mechanism of diffusion models with additional information about the image to reconstruct coming from a latent representation of the noise-free input image. This conditioning enables high-fidelity reconstruction of healthy brain structures while aligning local intensity characteristics of input-reconstruction pairs. We evaluate our method's reconstruction quality, domain adaptation features and finally segmentation performance on publicly available data sets with various pathologies. Using our proposed conditioning mechanism we can reduce the false-positive predictions and enable a more precise delineation of anomalies which significantly enhances the anomaly detection performance compared to established state-of-the-art approaches to unsupervised anomaly detection in brain MRI. Furthermore, our approach shows promise in domain adaptation across different MRI acquisitions and simulated contrasts, a crucial property of general anomaly detection methods.

摘要
<>转换给定文本到简化中文。>无监督异常检测在大脑MRI中 aimsto标识异常为外liers从健康训练分布中。重建基于approaches用生成模型学习重建健康大脑 анато STRING。噪声模型是深度生成模型的一个新兴类型，它们在重建准确性方面表现出了极高的潜力。然而，它们面临着保持强度特征的挑战，这限制了它们在异常检测中的表现。为解决这个挑战，我们提议通过将杂噪机制的conditioning加入 diffusion models，使得它们可以高精度重建健康大脑结构，同时保持输入图像和重建图像的本地强度特征相似。我们评估我们的方法的重建质量、领域适应特征和最终的分割性能在公共可用数据集上，包括不同的疾病。使用我们的提议的conditioning机制，我们可以减少假阳性预测，并提供更精确的异常检测性能，相比于现有的无监督异常检测方法。此外，我们的方法显示出在不同的MRI获得和模拟增强中的领域适应特性，这是一种重要的异常检测方法的特征。

SAMBA: A Trainable Segmentation Web-App with Smart Labelling

paper_url: http://arxiv.org/abs/2312.04197
repo_url: None
paper_authors: Ronan Docherty, Isaac Squires, Antonis Vamvakeros, Samuel J. Cooper
for: 这个论文主要是为了提出一种可视化的、可定制的批处理方法，以便在材料科学中进行各种统计分析任务。
methods: 该论文使用了一种名为SAMBA的可estrained segmentation工具，该工具使用Meta的Segment Anything Model（SAM）生成高质量的标签建议，并使用Random Forest分类器实现抗难度、可重复的分 segmentation。
results: 该论文通过使用SAMBA工具，实现了高质量、可重复的批处理结果，并且可以在浏览器中访问（https://www.sambasegment.com/)，不需要下载任何外部依赖项。

Abstract
Segmentation is the assigning of a semantic class to every pixel in an image and is a prerequisite for various statistical analysis tasks in materials science, like phase quantification, physics simulations or morphological characterization. The wide range of length scales, imaging techniques and materials studied in materials science means any segmentation algorithm must generalise to unseen data and support abstract, user-defined semantic classes. Trainable segmentation is a popular interactive segmentation paradigm where a classifier is trained to map from image features to user drawn labels. SAMBA is a trainable segmentation tool that uses Meta's Segment Anything Model (SAM) for fast, high-quality label suggestions and a random forest classifier for robust, generalizable segmentations. It is accessible in the browser (https://www.sambasegment.com/) without the need to download any external dependencies. The segmentation backend is run in the cloud, so does not require the user to have powerful hardware.

摘要
分割是将每个图像像素分配到 semantic 类别的过程，是物理科学中各种统计分析任务的必要前提，如相态量化、物理模拟或形态特征化。由于物理科学中的图像、捕捉技术和材料种类的广泛性，任何分割算法都必须能够总结到未经见过的数据和支持抽象、用户定义的 semantic 类别。可交互式分割是一种受欢迎的分割方法，其中一个分类器将从图像特征图中映射到用户手动绘制的标签。SAMBA 是一种可交互式分割工具，它使用 Meta 的分割任何模型（SAM）来获得快速、高质量的标签建议，并使用随机森林分类器来实现可靠、普适的分割。它可以在浏览器中访问（https://www.sambasegment.com），不需要下载任何外部依赖项。分割服务器位于云端，因此不需要用户拥有高性能的硬件。

Text as Image: Learning Transferable Adapter for Multi-Label Classification

paper_url: http://arxiv.org/abs/2312.04160
repo_url: None
paper_authors: Xuelin Zhu, Jiuxin Cao, Jian liu, Dongqi Tang, Furong Xu, Weijia Liu, Jiawei Ge, Bo Liu, Qingpei Guo, Tianyi Zhang
for: 这个研究旨在提高多 Label 图像分类 task 中的自动化程度，无需靠数据库。
methods: 研究使用了预训推导法，将预训推导到多 Label 图像分类 task 中，并使用随机干扰法增强 Cross-modal 转换能力。
results: 实验结果显示， compared to existing methods, 本方法在多 Label 图像分类 task 中得到了更高的性能，并且可以自动生成多 Label 文本描述。

Abstract
Pre-trained vision-language models have notably accelerated progress of open-world concept recognition. Their impressive zero-shot ability has recently been transferred to multi-label image classification via prompt tuning, enabling to discover novel labels in an open-vocabulary manner. However, this paradigm suffers from non-trivial training costs, and becomes computationally prohibitive for a large number of candidate labels. To address this issue, we note that vision-language pre-training aligns images and texts in a unified embedding space, making it potential for an adapter network to identify labels in visual modality while be trained in text modality. To enhance such cross-modal transfer ability, a simple yet effective method termed random perturbation is proposed, which enables the adapter to search for potential visual embeddings by perturbing text embeddings with noise during training, resulting in better performance in visual modality. Furthermore, we introduce an effective approach to employ large language models for multi-label instruction-following text generation. In this way, a fully automated pipeline for visual label recognition is developed without relying on any manual data. Extensive experiments on public benchmarks show the superiority of our method in various multi-label classification tasks.

摘要
<>将文本翻译成简化中文。<>开放世界概念认知的进步受到了预训练视觉语言模型的加速。这些模型在零批量模式下表现出了很强的能力，并在最近通过提示调整来应用于多标签图像分类。然而，这种方法受到非常严重的训练成本和计算拥堵问题的限制，对于大量候选标签来说是计算不可持续的。为解决这个问题，我们注意到了视觉语言预训练的图像和文本在一个统一的嵌入空间中对齐，这使得一个适配器网络能够在文本模式下训练时通过搜索图像嵌入的方式来识别图像中的标签。为进一步增强这种跨Modal的传输能力，我们提出了一种简单 yet effective的方法，即随机扰动法，该法在训练时对文本嵌入进行随机扰动，以便适配器通过搜索图像嵌入来提高图像模式的性能。此外，我们提出了一种有效的方法，以大型自然语言模型进行多标签指令遵循文本生成。这种方法可以自动生成图像标签识别的完整管道，不需要任何手动数据。我们在公共benchmark上进行了广泛的实验，并证明了我们的方法在多种多标签分类任务中的优越性。

EulerMormer: Robust Eulerian Motion Magnification via Dynamic Filtering within Transformer

paper_url: http://arxiv.org/abs/2312.04152
repo_url: https://github.com/vut-hfut/eulermormer
paper_authors: Fei Wang, Dan Guo, Kun Li, Meng Wang
for: 该 paper 的目的是提高视觉感知的分辨率，揭示视频中的微小运动信息。
methods: 该 paper 使用了一种新的动态筛选策略，通过启用动态筛选来除掉噪声，保留重要的特征。
results: 该 paper 的实验结果显示，EulerMormer 可以更好地提高视频动作彩色化，至少比现有方法要好。

Abstract
Video Motion Magnification (VMM) aims to break the resolution limit of human visual perception capability and reveal the imperceptible minor motion that contains valuable information in the macroscopic domain. However, challenges arise in this task due to photon noise inevitably introduced by photographic devices and spatial inconsistency in amplification, leading to flickering artifacts in static fields and motion blur and distortion in dynamic fields in the video. Existing methods focus on explicit motion modeling without emphasizing prioritized denoising during the motion magnification process. This paper proposes a novel dynamic filtering strategy to achieve static-dynamic field adaptive denoising. Specifically, based on Eulerian theory, we separate texture and shape to extract motion representation through inter-frame shape differences, expecting to leverage these subdivided features to solve this task finely. Then, we introduce a novel dynamic filter that eliminates noise cues and preserves critical features in the motion magnification and amplification generation phases. Overall, our unified framework, EulerMormer, is a pioneering effort to first equip with Transformer in learning-based VMM. The core of the dynamic filter lies in a global dynamic sparse cross-covariance attention mechanism that explicitly removes noise while preserving vital information, coupled with a multi-scale dual-path gating mechanism that selectively regulates the dependence on different frequency features to reduce spatial attenuation and complement motion boundaries. We demonstrate extensive experiments that EulerMormer achieves more robust video motion magnification from the Eulerian perspective, significantly outperforming state-of-the-art methods. The source code is available at https://github.com/VUT-HFUT/EulerMormer.

摘要
视频动态增强（VMM）目标是突破人类视觉能力的解像力限制，揭示macroscopic中的微小运动信息。然而，实际应用中遇到了照相机引入的光子噪声和宽度不匹配的增强问题，导致静止场景中的闪烁 artifacts和动态场景中的运动模糊和扭曲。现有方法主要关注Explicit Motion Modeling，而忽略了对优先采样的优先级调整。本文提出了一种新的动态筛选策略，以实现静态-动态场景适应性的减噪。具体来说，基于Eulerian理论，我们分解了图像和形状，通过帧间形状差来提取运动表示。我们引入了一种新的动态筛选器，可以消除噪声信号，保留关键特征在运动增强和增强生成阶段。总的来说，我们的整体框架，EulerMormer，是一项前所未有的将Transformer学习模型应用于VMM领域的准新提案。动态筛选器的核心在于全球动态稀疏相关性注意力机制，可以显著地除去噪声信号，同时保留重要信息。此外，我们还引入了一种多尺度双路闭合机制，可以选择不同频谱特征的依赖程度，以减少空间减弱和补做运动边界。我们的实验结果表明，EulerMormer在Eulerian视角下实现了更加稳定和 Robust的视频动态增强，significantly Outperforming当前的方法。代码可以在https://github.com/VUT-HFUT/EulerMormer上获取。

Diffusing Colors: Image Colorization with Text Guided Diffusion

paper_url: http://arxiv.org/abs/2312.04145
repo_url: None
paper_authors: Nir Zabari, Aharon Azulay, Alexey Gorkor, Tavi Halperin, Ohad Fried
for: 这篇论文旨在解决图像色调化的复杂和主观问题，提供一种新的图像色调化框架，使用图像扩散技术和粒子提示。
methods: 该方法使用预训练的生成扩散模型，通过粒子提示来控制图像色调化过程，并提供一种基于CLIP的颜色评价模型来评价颜色的生动程度。
results: 该方法可以提供高度自动和控制的图像色调化输出，并且在视觉质量和 semantics 方面表现出色，比如现有方法更具控制性和semantic coherence。

Abstract
The colorization of grayscale images is a complex and subjective task with significant challenges. Despite recent progress in employing large-scale datasets with deep neural networks, difficulties with controllability and visual quality persist. To tackle these issues, we present a novel image colorization framework that utilizes image diffusion techniques with granular text prompts. This integration not only produces colorization outputs that are semantically appropriate but also greatly improves the level of control users have over the colorization process. Our method provides a balance between automation and control, outperforming existing techniques in terms of visual quality and semantic coherence. We leverage a pretrained generative Diffusion Model, and show that we can finetune it for the colorization task without losing its generative power or attention to text prompts. Moreover, we present a novel CLIP-based ranking model that evaluates color vividness, enabling automatic selection of the most suitable level of vividness based on the specific scene semantics. Our approach holds potential particularly for color enhancement and historical image colorization.

摘要
“灰度图像颜色化是一个复杂且主观的任务，具有 significante 挑战。尽管最近的进步使用大规模数据集和深度神经网络，仍然存在控制性和视觉质量的问题。为了解决这些问题，我们提出了一种新的图像颜色化框架，利用图像扩散技术和粒体文本提示。这种整合不仅能生成符合 semantics 的颜色化输出，还可以大幅提高用户对颜色化过程的控制能力。我们的方法具有自动化和控制之间的平衡，在视觉质量和semantic coherence 方面超过现有技术。我们利用预训练的生成Diffusion Model，并证明可以在颜色化任务中训练这个模型而不会失去生成力或文本提示的注意力。此外，我们还提出了基于 CLIP 的颜色豁达评分模型，可以自动选择基于特定场景 semantics 的最适合的颜色豁达水平。我们的方法在彩色提高和历史图像颜色化方面具有潜力。”

Towards 4D Human Video Stylization

paper_url: http://arxiv.org/abs/2312.04143
repo_url: https://github.com/tiantianwang/4d_video_stylization
paper_authors: Tiantian Wang, Xinxin Zuo, Fangzhou Mu, Jian Wang, Ming-Hsuan Yang
for: 这 paper 旨在提供一种基于 Neural Radiance Fields (NeRFs) 的四维 (3D 和时间) 人体视频样式化方法，包括样式转移、新视图生成和人体动画。
methods: 该方法使用 NeRFs 来表示视频，并在rendered feature space中进行样式化。它采用了一种新的geometry-guided tri-plane representation，以增强特征表示的稳定性。
results: 对比 existing approaches，该方法能够更好地平衡样式化的текстура和时间准确性，并且可以适应新的 pose 和视图。

Abstract
We present a first step towards 4D (3D and time) human video stylization, which addresses style transfer, novel view synthesis and human animation within a unified framework. While numerous video stylization methods have been developed, they are often restricted to rendering images in specific viewpoints of the input video, lacking the capability to generalize to novel views and novel poses in dynamic scenes. To overcome these limitations, we leverage Neural Radiance Fields (NeRFs) to represent videos, conducting stylization in the rendered feature space. Our innovative approach involves the simultaneous representation of both the human subject and the surrounding scene using two NeRFs. This dual representation facilitates the animation of human subjects across various poses and novel viewpoints. Specifically, we introduce a novel geometry-guided tri-plane representation, significantly enhancing feature representation robustness compared to direct tri-plane optimization. Following the video reconstruction, stylization is performed within the NeRFs' rendered feature space. Extensive experiments demonstrate that the proposed method strikes a superior balance between stylized textures and temporal coherence, surpassing existing approaches. Furthermore, our framework uniquely extends its capabilities to accommodate novel poses and viewpoints, making it a versatile tool for creative human video stylization.

摘要
我们提出了第一步 towards 4D（3D和时间）人类视频化，解决了式 Transfer、新视角合成和人类动画在一个统一框架中。虽然许多视频化方法已经发展出来，但它们通常仅能在特定的输入视频中显示固定的见识点，缺乏能够扩展到新的视角和新的姿势的能力。为了突破这些限制，我们利用神经辉照场（NeRFs）来表示视频，进行运算在输出的特征空间中。我们的创新方法包括同时对人类主体和周围的场景使用两个NeRFs。这个双重表现方式使得人类主体可以在不同的姿势和新的视角中进行动画。我们将 introduce a novel geometry-guided tri-plane representation，对特征表现的稳定性有很大的提高，与直接 tri-plane 优化相比。接下来，我们将在NeRFs的输出特征空间中进行颜色化。实验结果表明，我们的方法能够将统一的平衡保持在颜色化和时间对称之间，超过现有的方法。此外，我们的框架可以独特地扩展到新的姿势和视角，使其成为艺术人类视频化的多用途工具。

Polarimetric Light Transport Analysis for Specular Inter-reflection

paper_url: http://arxiv.org/abs/2312.04140
repo_url: None
paper_authors: Ryota Maeda, Shinsaku Hiura
for: 本研究旨在提出一种新的分解方法，用于处理金属表面上的干扰反射。
methods: 该方法利用了偏振特征的旋转方向，作为直接反射和交叉反射的分化因素。
results: 经过实验和数据分析，该方法能够有效地分解金属表面上的干扰反射分成直接反射和交叉反射两部分。此外，该方法还可以与其他分解方法结合使用，以进一步分析光学运输。在实际应用中，该方法能够提高3D测量的准确性，抵消强干扰反射的影响。

Abstract
Polarization is well known for its ability to decompose diffuse and specular reflections. However, the existing decomposition methods only focus on direct reflection and overlook multiple reflections, especially specular inter-reflection. In this paper, we propose a novel decomposition method for handling specular inter-reflection of metal objects by using a unique polarimetric feature: the rotation direction of linear polarization. This rotation direction serves as a discriminative factor between direct and inter-reflection on specular surfaces. To decompose the reflectance components, we actively rotate the linear polarization of incident light and analyze the rotation direction of the reflected light. We evaluate our method using both synthetic and real data, demonstrating its effectiveness in decomposing specular inter-reflections of metal objects. Furthermore, we demonstrate that our method can be combined with other decomposition methods for a detailed analysis of light transport. As a practical application, we show its effectiveness in improving the accuracy of 3D measurement against strong specular inter-reflection.

摘要
polarization 已经被广泛应用于分解杂化和射频反射。然而，现有的分解方法只关注直接反射，忽略了多重反射，特别是金属表面上的射频反射。在这篇论文中，我们提出了一种新的分解方法，用于处理金属表面上的射频反射。我们利用了唯一的极化特征：反射光的旋转方向。这个旋转方向作为直接反射和交叠反射的判别因素。为分解反射组分，我们活动地调整了入射光的极化方向，然后分析反射光的旋转方向。我们使用了 sintetic 和实际数据进行评估，并证明了我们的方法可以有效地分解金属表面上的射频反射。此外，我们还示出了将我们的方法与其他分解方法结合使用可以对光传输进行详细分析。在实际应用中，我们展示了我们的方法可以提高3D测量的准确性，特别是在强射频交叠反射的情况下。

Forensic Iris Image Synthesis

paper_url: http://arxiv.org/abs/2312.04125
repo_url: None
paper_authors: Rasel Ahmed Bhuiyan, Adam Czajka
For: The paper is written for the purpose of advancing the field of post-mortem iris recognition, which is an emerging application of iris-based human identification in a forensic setup.* Methods: The paper proposes a novel approach using a conditional StyleGAN-based iris synthesis model to generate multiple within-class and between-class post-mortem iris images, compliant with ISO/IEC 29794-6, and with decomposition deformations controlled by the requested PMI (post mortem interval).* Results: The paper offers a contribution to facilitate progress in post-mortem iris recognition by generating multiple post-mortem iris images with controlled decomposition deformations, which can be used to enhance the existing, very sparse, post-mortem iris datasets and improve the training effectiveness of professional forensic human examiners.

Abstract
Post-mortem iris recognition is an emerging application of iris-based human identification in a forensic setup, able to correctly identify deceased subjects even three weeks post-mortem. This technique thus is considered as an important component of future forensic toolkits. The current advancements in this field are seriously slowed down by exceptionally difficult data collection, which can happen in mortuary conditions, at crime scenes, or in ``body farm'' facilities. This paper makes a novel contribution to facilitate progress in post-mortem iris recognition by offering a conditional StyleGAN-based iris synthesis model, trained on the largest-available dataset of post-mortem iris samples acquired from more than 350 subjects, generating -- through appropriate exploration of StyleGAN latent space -- multiple within-class (same identity) and between-class (different new identities) post-mortem iris images, compliant with ISO/IEC 29794-6, and with decomposition deformations controlled by the requested PMI (post mortem interval). Besides an obvious application to enhance the existing, very sparse, post-mortem iris datasets to advance -- among others -- iris presentation attack endeavors, we anticipate it may be useful to generate samples that would expose professional forensic human examiners to never-seen-before deformations for various PMIs, increasing their training effectiveness. The source codes and model weights are made available with the paper.

摘要
后备识别是一种emerging应用的芳心眼基于人体识别技术，可以在法医设置中正确识别已经去世的人员，甚至在三周后。这种技术因此被视为未来法医工具箱中的重要组成部分。但现有技术的发展受到了极其困难的数据收集所带来的困难，这些数据可能在室内、犯罪现场或“尸体农场”设施中收集。这篇论文提供了一种新的 StyleGAN 基于的芳心眼合成模型，训练在已知的芳心眼样本中，生成了符合 ISO/IEC 29794-6 标准的多个同类（同一个身份）和异类（不同的新身份）的尸体芳心照片，并控制了请求的 PMIs（后备时间）。此外，我们预计这些样本可能有助于增加现有的、非常罕见的尸体芳心样本，以提高识别攻击的效果。我们还将提供模型的源代码和权重。

A Multilevel Guidance-Exploration Network and Behavior-Scene Matching Method for Human Behavior Anomaly Detection

paper_url: http://arxiv.org/abs/2312.04119
repo_url: https://github.com/Ufere/Assingment_1
paper_authors: Guoqing Yang, Zhiming Luo, Jianzhe Gao, Yingxin Lai, Kun Yang, Yifan He, Shaozi Li
for: 本研究旨在检测人类行为异常，具有智能监测等领域的重要应用。
methods: 我们提出了一种基于Student-Teacher Network的新框架，即多级引导探索网络（MGENet），通过高级表示差异来检测异常。
results: 我们的方法在上海理工和UBnormal数据集上实现了状态当前的最佳性能，具体来说是AUC为86.9%和73.5%。

Abstract
Human behavior anomaly detection aims to identify unusual human actions, playing a crucial role in intelligent surveillance and other areas. The current mainstream methods still adopt reconstruction or future frame prediction techniques. However, reconstructing or predicting low-level pixel features easily enables the network to achieve overly strong generalization ability, allowing anomalies to be reconstructed or predicted as effectively as normal data. Different from their methods, inspired by the Student-Teacher Network, we propose a novel framework called the Multilevel Guidance-Exploration Network(MGENet), which detects anomalies through the difference in high-level representation between the Guidance and Exploration network. Specifically, we first utilize the pre-trained Normalizing Flow that takes skeletal keypoints as input to guide an RGB encoder, which takes unmasked RGB frames as input, to explore motion latent features. Then, the RGB encoder guides the mask encoder, which takes masked RGB frames as input, to explore the latent appearance feature. Additionally, we design a Behavior-Scene Matching Module(BSMM) to detect scene-related behavioral anomalies. Extensive experiments demonstrate that our proposed method achieves state-of-the-art performance on ShanghaiTech and UBnormal datasets, with AUC of 86.9 % and 73.5 %, respectively. The code will be available on https://github.com/molu-ggg/GENet.

摘要
人类行为异常检测目标是 identificar不寻常的人类行为，在智能监测等领域发挥关键作用。当前主流方法仍然采用重建或未来帧预测技术。然而，重建或预测低级像素特征容易让网络实现过于强大的泛化能力，使异常现象重建或预测与正常数据一样有效。与传统方法不同，我们提出了一种新的框架，即多级引导探索网络（MGENet），通过高级表示差异来检测异常。具体来说，我们首先利用预训练的Normalizing Flow，将骨架关键点作为输入，导引RGB编码器，该编码器接受不masked RGB帧作为输入，探索动作缓存特征。然后，RGB编码器导引面编码器，该编码器接受面罩RGB帧作为输入，探索面相应特征。此外，我们还设计了行为场景匹配模块（BSMM），检测场景相关的行为异常。广泛的实验证明了我们提出的方法在上海理工和UB正常数据集上达到了当前最佳性能，AUC为86.9%和73.5%。代码将在https://github.com/molu-ggg/GENet上发布。

Instance Tracking in 3D Scenes from Egocentric Videos

paper_url: http://arxiv.org/abs/2312.04117
repo_url: https://github.com/it3dego/it3dego
paper_authors: Yunhan Zhao, Haoyu Ma, Shu Kong, Charless Fowlkes
for: 这个论文主要针对的问题是 Egocentric 感知器（如 AR/VR 设备）在捕捉人类对象交互的场景中提供任务协助，特别是在实际世界3D场景中实现3D对象的实例跟踪。
methods: 作者们首先介绍了一个新的数据集，包括 RGB 和深度视频、每帧相机pose、并且在相机和世界坐标系中提供了实例级别的注释。他们还提出了一种评估协议，用于评估在3D坐标系中的跟踪性能，并分为两种设置：单视图在线报名（在人类佩戴者的交互基础上实时指定实例）和多视图预报（存储实例图像在内存中，等待识别）。
results: 作者们首先重用了相关领域的方法，如单个对象跟踪（SOT）——在2D帧中运行SOT方法来跟踪实例，并使用相机pose和深度将其提升到3D坐标系。他们还提出了一种简单的方法，利用预训练的分割和检测模型生成RGB帧中的提案，并将提案与已经存储的实例图像进行匹配。实验结果显示，作者们的方法（无需调整）在SOT基础上显著超越了。作者们 conclude 认为，通过利用相机pose和使用3D全球坐标表示， Egocentric 实例跟踪问题变得更加容易解决。

Abstract
Egocentric sensors such as AR/VR devices capture human-object interactions and offer the potential to provide task-assistance by recalling 3D locations of objects of interest in the surrounding environment. This capability requires instance tracking in real-world 3D scenes from egocentric videos (IT3DEgo). We explore this problem by first introducing a new benchmark dataset, consisting of RGB and depth videos, per-frame camera pose, and instance-level annotations in both 2D camera and 3D world coordinates. We present an evaluation protocol which evaluates tracking performance in 3D coordinates with two settings for enrolling instances to track: (1) single-view online enrollment where an instance is specified on-the-fly based on the human wearer's interactions. and (2) multi-view pre-enrollment where images of an instance to be tracked are stored in memory ahead of time. To address IT3DEgo, we first re-purpose methods from relevant areas, e.g., single object tracking (SOT) -- running SOT methods to track instances in 2D frames and lifting them to 3D using camera pose and depth. We also present a simple method that leverages pretrained segmentation and detection models to generate proposals from RGB frames and match proposals with enrolled instance images. Perhaps surprisingly, our extensive experiments show that our method (with no finetuning) significantly outperforms SOT-based approaches. We conclude by arguing that the problem of egocentric instance tracking is made easier by leveraging camera pose and using a 3D allocentric (world) coordinate representation.

摘要
egocentric感知器如AR/VR设备记录人类对物体的互动，并可提供任务协助 by recalling 3D对象位置在周围环境中。这一能力需要实例跟踪在实际世界3D场景中（IT3DEgo）。我们对这个问题进行探索，首先介绍了一个新的 benchmark数据集，包括RGB和深度视频，每帧摄像头pose，以及实例级别的注释在两个坐标系中（2D摄像头坐标和3D世界坐标）。我们提出了一种评估协议，用于评估在3D坐标系中的跟踪性能，并分为两种设置来启动实例跟踪：（1）实时在线报名，基于人员佩戴者的互动来指定实例。（2）先在内存中存储实例图像，然后在多视图情况下进行报名。为解决IT3DEgo，我们首先重用了相关领域的方法，例如单个对象跟踪（SOT）——在2D帧中运行SOT方法来跟踪实例，并使用摄像头pose和深度将其映射到3D中。我们还提出了一种简单的方法，利用预训练的分割和检测模型生成RGB帧中的提案，然后与存储在内存中的实例图像进行匹配。我们的广泛实验表明，我们的方法（无需调整）在SOT基础上显著超越了。我们结论，通过利用摄像头pose和使用3D全称（世界）坐标表示， egocentric实例跟踪问题变得更加容易解决。

Multi-strategy Collaborative Optimized YOLOv5s and its Application in Distance Estimation

paper_url: http://arxiv.org/abs/2312.04113
repo_url: https://github.com/Ufere/Assingment_1
paper_authors: Zijian Shen, Zhenping Mu, Xiangxiang Li
for: 提高自动车active安全系统的准确性，提高车辆检测和距离估计的精度，以提供安全预警。
methods: 基于新的神经网络模型（YOLOv5s-SE），取代IoU，添加SE注意力模块，并使用类似三角形的原理进行距离估计。
results: 通过 simulate experiment，提高mAP值5.5%，并实现基于估计距离信息提供安全建议。

Abstract
The increasing accident rate brought about by the explosive growth of automobiles has made the research on active safety systems of automobiles increasingly important. The importance of improving the accuracy of vehicle target detection is self-evident. To achieve the goals of vehicle detection and distance estimation and provide safety warnings, a Distance Estimation Safety Warning System (DESWS) based on a new neural network model (YOLOv5s-SE) by replacing the IoU with DIoU, embedding SE attention module, and a distance estimation method through using the principle of similar triangles was proposed. In addition, a method that can give safety suggestions based on the estimated distance using nonparametric testing was presented in this work. Through the simulation experiment, it was verified that the mAP was improved by 5.5% and the purpose of giving safety suggestions based on the estimated distance information can be achieved.

摘要
随着汽车数量的快速增长，汽车活动安全系统的研究变得越来越重要。提高车辆检测精度的重要性自然而然。为了实现车辆检测和距离估计，并提供安全警示，一种基于新的神经网络模型（YOLOv5s-SE）、取代IOU的DIoU模块、并使用 Similar Triangles 原理进行距离估计的距离估计安全警示系统（DESWS）被提议。此外，基于估计距离信息给出安全建议的方法也被介绍在这项工作中。通过实验验证，我们发现了5.5%的mAP提高和基于估计距离信息给出安全建议的目的得到实现。

Identity-Obscured Neural Radiance Fields: Privacy-Preserving 3D Facial Reconstruction

paper_url: http://arxiv.org/abs/2312.04106
repo_url: None
paper_authors: Jiayi Kong, Baixin Xu, Xurui Song, Chen Qian, Jun Luo, Ying He
for: 保护个人隐私，避免泄露敏感面部数据
methods: 使用隐私保护的图像，在NeRF框架中重建3D头部几何结构
results: 能够生成可信度高的头部几何结构，不需要敏感面部数据

Abstract
Neural radiance fields (NeRF) typically require a complete set of images taken from multiple camera perspectives to accurately reconstruct geometric details. However, this approach raise significant privacy concerns in the context of facial reconstruction. The critical need for privacy protection often leads invidividuals to be reluctant in sharing their facial images, due to fears of potential misuse or security risks. Addressing these concerns, we propose a method that leverages privacy-preserving images for reconstructing 3D head geometry within the NeRF framework. Our method stands apart from traditional facial reconstruction techniques as it does not depend on RGB information from images containing sensitive facial data. Instead, it effectively generates plausible facial geometry using a series of identity-obscured inputs, thereby protecting facial privacy.

摘要
neuronal radiance fields (NeRF) 通常需要多个相机视角的完整图像来准确重建几何细节。然而，这种方法会引起 Privacy Concerns 的问题，特别是在人脸重建方面。由于担心数据可能被滥用或者安全风险，个人可能会拒绝分享自己的面部图像。为解决这些问题，我们提议一种基于隐藏个人信息的 NeRF 框架，以保护面部隐私。我们的方法与传统的面部重建技术不同，因为它不需要包含敏感面部数据的 RGB 信息。相反，它可以生成可信的面部几何结构，使用一系列隐藏个人身份的输入，以保护面部隐私。

Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning Interference with Gradient Projection

paper_url: http://arxiv.org/abs/2312.04095
repo_url: https://github.com/hnanhtuan/projected_gradient_unlearning
paper_authors: Tuan Hoang, Santu Rana, Sunil Gupta, Svetha Venkatesh
for: 这篇论文是关于机器学习的卷积束抹除（Machine Unlearning）技术，它的目的是从已经训练过的模型中移除特定的训练样本的影响，如果为了遵守新的数据隐私法律。
methods: 这篇论文提出了一种基于投影方程的学习方法，名为投影Gradient Unlearning（PGU），它使用Stochastic Gradient Descent（SGD）来更新模型参数，并可以有效地扩展到任何模型和数据集大小。
results: 论文提供了实验证据，表明这种卷积抹除方法可以生成与从scratch重新训练的模型性能相似的模型，即使训练数据集不再可用。

Abstract
Recent data-privacy laws have sparked interest in machine unlearning, which involves removing the effect of specific training samples from a learnt model as if they were never present in the original training dataset. The challenge of machine unlearning is to discard information about the ``forget'' data in the learnt model without altering the knowledge about the remaining dataset and to do so more efficiently than the naive retraining approach. To achieve this, we adopt a projected-gradient based learning method, named as Projected-Gradient Unlearning (PGU), in which the model takes steps in the orthogonal direction to the gradient subspaces deemed unimportant for the retaining dataset, so as to its knowledge is preserved. By utilizing Stochastic Gradient Descent (SGD) to update the model weights, our method can efficiently scale to any model and dataset size. We provide empirically evidence to demonstrate that our unlearning method can produce models that behave similar to models retrained from scratch across various metrics even when the training dataset is no longer accessible. Our code is available at https://github.com/hnanhtuan/projected_gradient_unlearning.

摘要
现代数据隐私法规已引发机器学习忘记技术的兴趣，该技术涉及从已经训练过的模型中除去特定训练样本的效果，如果现在不再可用。忘记技术的挑战是不要改变原始数据集中剩下的知识，同时快速地忘记掉忘记数据。我们采用了投影 gradient 学习方法，名为 Projected-Gradient Unlearning (PGU)，该方法中模型通过在不重要的梯度空间上做出步进行，以保留模型的知识。通过使用 Stochastic Gradient Descent (SGD) 更新模型参数，我们的方法可以高效地扩展到任何模型和数据集大小。我们提供了实验证明，证明我们的忘记方法可以生成与从scratch retrained模型相似的模型，即使训练数据集不再可用。我们的代码可以在 GitHub 上找到：https://github.com/hnanhtuan/projected_gradient_unlearning。

Open-Vocabulary Segmentation with Semantic-Assisted Calibration

paper_url: http://arxiv.org/abs/2312.04089
repo_url: https://github.com/workforai/scan
paper_authors: Yong Liu, Sule Bai, Guanbin Li, Yitong Wang, Yansong Tang
for: This paper focuses on the problem of open-vocabulary segmentation (OVS), specifically addressing the challenge of aligning visual content with the semantics of unbounded text.
methods: The proposed method, Semantic-assisted CAlibration Network (SCAN), incorporates a generalized semantic prior of CLIP into proposal embedding and applies a contextual shift strategy to mitigate the lack of global context and unnatural background noise.
results: SCAN achieves state-of-the-art performance on all popular open-vocabulary segmentation benchmarks, and the authors also propose a new evaluation metric called Semantic-Guided IoU (SG-IoU) to address the problem of semantic duplication across categories.Here’s the simplified Chinese text:
for: 这篇论文关注开放词汇分割（OVS）问题，具体地是将视觉内容与无限文本语义相对应的挑战。
methods: 提议的方法是启用通用语义优先预测（SCAN），其中包括将CLIP总结预测作为提议 embedding 的一部分，并应用上下文转移策略来减少不自然的背景噪音和全局上下文的缺失。
results: SCAN在所有启用词汇分割标准准则上达到了最新的性能记录，同时作者还提出了一种新的评价指标叫做semantic-guided IoU（SG-IoU），用于解决 semantic duplication 的问题。

Abstract
This paper studies open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with generalized contextual prior of CLIP. As the core of open-vocabulary understanding, alignment of visual content with the semantics of unbounded text has become the bottleneck of this field. To address this challenge, recent works propose to utilize CLIP as an additional classifier and aggregate model predictions with CLIP classification results. Despite their remarkable progress, performance of OVS methods in relevant scenarios is still unsatisfactory compared with supervised counterparts. We attribute this to the in-vocabulary embedding and domain-biased CLIP prediction. To this end, we present a Semantic-assisted CAlibration Network (SCAN). In SCAN, we incorporate generalized semantic prior of CLIP into proposal embedding to avoid collapsing on known categories. Besides, a contextual shift strategy is applied to mitigate the lack of global context and unnatural background noise. With above designs, SCAN achieves state-of-the-art performance on all popular open-vocabulary segmentation benchmarks. Furthermore, we also focus on the problem of existing evaluation system that ignores semantic duplication across categories, and propose a new metric called Semantic-Guided IoU (SG-IoU).

摘要
In SCAN, we incorporate a generalized semantic prior of CLIP into the proposal embedding to avoid collapsing on known categories. Additionally, we apply a contextual shift strategy to mitigate the lack of global context and unnatural background noise. With these designs, SCAN achieves state-of-the-art performance on all popular open-vocabulary segmentation benchmarks. Furthermore, we also focus on the problem of the existing evaluation system, which ignores semantic duplication across categories, and propose a new metric called Semantic-Guided IoU (SG-IoU).

MTVG : Multi-text Video Generation with Text-to-Video Models

paper_url: http://arxiv.org/abs/2312.04086
repo_url: None
paper_authors: Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, Wonmin Byeon, Jinkyu Kim, Sungwoong Kim, Hyeokmin Kwon, Sangpil Kim
for: 这个论文旨在提出一种新的多文本视频生成（MTVG）方法，以直接使用预训练的扩散基于文本到视频（T2V）生成模型，无需进行额外的精度调整。
methods: 该方法包括动态噪声和最后一帧意识扩散，以重新 initialize 噪声征量以保持视觉一致性 между不同提示下的视频，并避免重复的动作或内容。此外，我们还提出了结构导向采样，以保持单个视频帧的全球外观。
results: 我们的广泛实验表明，我们的提posed方法可以生成高度相关和temporally smooth的视频，包括多种转换的描述。视频示例可以在我们项目页面中找到：https://kuai-lab.github.io/mtvg-page。

Abstract
Recently, video generation has attracted massive attention and yielded noticeable outcomes. Concerning the characteristics of video, multi-text conditioning incorporating sequential events is necessary for next-step video generation. In this work, we propose a novel multi-text video generation~(MTVG) by directly utilizing a pre-trained diffusion-based text-to-video~(T2V) generation model without additional fine-tuning. To generate consecutive video segments, visual consistency generated by distinct prompts is necessary with diverse variations, such as motion and content-related transitions. Our proposed MTVG includes Dynamic Noise and Last Frame Aware Inversion which reinitialize the noise latent to preserve visual coherence between videos of different prompts and prevent repetitive motion or contents. Furthermore, we present Structure Guiding Sampling to maintain the global appearance across the frames in a single video clip, where we leverage iterative latent updates across the preceding frame. Additionally, our Prompt Generator allows for arbitrary format of text conditions consisting of diverse events. As a result, our extensive experiments, including diverse transitions of descriptions, demonstrate that our proposed methods show superior generated outputs in terms of semantically coherent and temporally seamless video.Video examples are available in our project page: https://kuai-lab.github.io/mtvg-page.

摘要
To generate consecutive video segments, visual consistency generated by distinct prompts is necessary, including diverse variations such as motion and content-related transitions. Our proposed MTVG includes Dynamic Noise and Last Frame Aware Inversion, which reinitialize the noise latent to preserve visual coherence between videos of different prompts and prevent repetitive motion or contents.Moreover, we present Structure Guiding Sampling to maintain the global appearance across the frames in a single video clip, where we leverage iterative latent updates across the preceding frame. Our Prompt Generator allows for arbitrary formats of text conditions consisting of diverse events.Our extensive experiments, including diverse transitions of descriptions, demonstrate that our proposed methods produce superior generated outputs in terms of semantically coherent and temporally seamless videos. Video examples are available on our project page: .

Large Language Models are Good Prompt Learners for Low-Shot Image Classification

paper_url: http://arxiv.org/abs/2312.04076
repo_url: None
paper_authors: Zhaoheng Zheng, Jingmin Wei, Xuefeng Hu, Haidong Zhu, Ram Nevatia
for: 优化预训练视语（VL）模型，使其在具有限制或不可 accessible 的训练图像时表现出色。
methods: 利用大语言模型（LLM）增强预训练VL模型，特别是在低分类任务中。
results: 与其他状态之最比较，LLaMP在零shot泛化和几shot图像分类任务中表现出色，在11个数据集上实现了更好的性能。

Abstract
Low-shot image classification, where training images are limited or inaccessible, has benefited from recent progress on pre-trained vision-language (VL) models with strong generalizability, e.g. CLIP. Prompt learning methods built with VL models generate text features from the class names that only have confined class-specific information. Large Language Models (LLMs), with their vast encyclopedic knowledge, emerge as the complement. Thus, in this paper, we discuss the integration of LLMs to enhance pre-trained VL models, specifically on low-shot classification. However, the domain gap between language and vision blocks the direct application of LLMs. Thus, we propose LLaMP, Large Language Models as Prompt learners, that produces adaptive prompts for the CLIP text encoder, establishing it as the connecting bridge. Experiments show that, compared with other state-of-the-art prompt learning methods, LLaMP yields better performance on both zero-shot generalization and few-shot image classification, over a spectrum of 11 datasets.

摘要
低样本图像分类，即训练图像有限或不可达，受到最近进步的前期训练视觉语言（VL）模型的强大泛化效应所帮助。例如CLIP。在这篇论文中，我们讨论了将大型语言模型（LLMs）与预训练VL模型结合使用，具体来说是用于低样本分类。然而，视觉和语言之间的领域差异使得直接应用LLMs不具有优势。因此，我们提出了LLaMP，即大型语言模型作为提示学习器，以生成适应性的提示，使CLIP文本编码器成为连接的桥梁。实验结果表明，相比其他状态的最佳提示学习方法，LLaMP在零shot泛化和几个shot图像分类方面表现更好，在11个数据集上。

Combining inherent knowledge of vision-language models with unsupervised domain adaptation through self-knowledge distillation

paper_url: http://arxiv.org/abs/2312.04066
repo_url: https://github.com/ThomasWestfechtel/SKD
paper_authors: Thomas Westfechtel, Dexuan Zhang, Tatsuya Harada
for: 这篇论文目的是为了解决不需要标注数据的问题，使用来源数据集和目标数据集之间的知识传递，以提高目标数据集的预测性能。methods: 这篇论文使用了无监督预测和视力语言模型的自然知识，并结合了域 adaptation 技术和自我学习精炼。results: 实验和降解研究表明，这种方法可以在三个基准数据集（OfficeHome、VisDA和DomainNet）上达到最佳性能，并且可以通过逐渐扩展源领域的策略来进一步提高性能。

Abstract
Unsupervised domain adaptation (UDA) tries to overcome the tedious work of labeling data by leveraging a labeled source dataset and transferring its knowledge to a similar but different target dataset. On the other hand, current vision-language models exhibit astonishing zero-shot prediction capabilities. In this work, we combine knowledge gained through UDA with the inherent knowledge of vision-language models. In a first step, we generate the zero-shot predictions of the source and target dataset using the vision-language model. Since zero-shot predictions usually exhibit a large entropy, meaning that the class probabilities are rather evenly distributed, we first adjust the distribution to accentuate the winning probabilities. This is done using both source and target data to keep the relative confidence between source and target data. We then employ a conventional DA method, to gain the knowledge from the source dataset, in combination with self-knowledge distillation, to maintain the inherent knowledge of the vision-language model. We further combine our method with a gradual source domain expansion strategy (GSDE) and show that this strategy can also benefit by including zero-shot predictions. We conduct experiments and ablation studies on three benchmarks (OfficeHome, VisDA, and DomainNet) and outperform state-of-the-art methods. We further show in ablation studies the contributions of different parts of our algorithm.

摘要
Unsupervised domain adaptation (UDA) 尝试使用来自源数据集的标注数据，将其知识传递到目标数据集中，而不需要重新标注大量数据。然而，当前的视觉语言模型具有惊人的零批预测能力。在这种情况下，我们将UDA知识与视觉语言模型的内在知识结合起来。我们首先使用视觉语言模型进行零批预测，然后对源和目标数据集进行调整，以强调赢利的概率分布。接着，我们使用传统的DA方法，以获得源数据集中的知识，并与自我知识照抄来保持视觉语言模型的内在知识。我们还将我们的方法与渐进源领域扩展策略（GSDE）结合使用，并证明这种策略可以通过包括零批预测来增强表现。我们在OfficeHome、VisDA和DomainNet三个benchmark上进行了实验和ablation研究，并超越了状态之前的方法。我们还通过ablation研究表明了不同部分的贡献。

An unsupervised approach towards promptable defect segmentation in laser-based additive manufacturing by Segment Anything

paper_url: http://arxiv.org/abs/2312.04063
repo_url: None
paper_authors: Israt Zarin Era, Imtiaz Ahmed, Zhichao Liu, Srinjoy Das
for: 这个研究是为了提高计算机视觉任务中的产品质量和实时运行控制，特别是在生产领域中进行类型的问题。
methods: 本研究使用了一个现代化的感知 transformer（ViT）基础模型（Segment Anything Model），以及一种新的多点提示生成方案，使用不supervised clustering。
results: 本研究在一个案例研究中，使用了这个框架进行实时的气孔分类，并获得了高的 dice similarity coefficient（DSC），不需要模型进行supervised fine-tuning。

Abstract
Foundation models are currently driving a paradigm shift in computer vision tasks for various fields including biology, astronomy, and robotics among others, leveraging user-generated prompts to enhance their performance. In the manufacturing domain, accurate image-based defect segmentation is imperative to ensure product quality and facilitate real-time process control. However, such tasks are often characterized by multiple challenges including the absence of labels and the requirement for low latency inference among others. To address these issues, we construct a framework for image segmentation using a state-of-the-art Vision Transformer (ViT) based Foundation model (Segment Anything Model) with a novel multi-point prompt generation scheme using unsupervised clustering. We apply our framework to perform real-time porosity segmentation in a case study of laser base powder bed fusion (L-PBF) and obtain high Dice Similarity Coefficients (DSC) without the necessity for any supervised fine-tuning in the model. Using such lightweight foundation model inference in conjunction with unsupervised prompt generation, we envision the construction of a real-time anomaly detection pipeline that has the potential to revolutionize the current laser-based additive manufacturing processes, thereby facilitating the shift towards Industry 4.0 and promoting defect-free production along with operational efficiency.

摘要
现代基金模型在计算机视觉任务中推动一种 парадигShift，包括生物学、天文学和机器人等领域，通过用户生成的提示来提高其性能。在制造领域，准确的图像基于缺陷分 segmentation是保证产品质量和实时过程控制的关键。然而，这些任务经常面临多个挑战，包括标签缺失和低延迟推理等。为解决这些问题，我们构建了一个基于现代视觉转移（ViT）的基础模型（Segment Anything Model）和一种新的多点提示生成方案，使用不supervised clustering。我们在一个案例研究中使用我们的框架进行实时порosity segmentation，并 obtaint high Dice Similarity Coefficients（DSC）without the necessity for any supervised fine-tuning in the model。使用这种轻量级基础模型推理和无监督提示生成，我们可以构建一个实时异常检测管线，这将可能革新现有的激光基于加法制造过程，并促进缺陷free production和操作效率。

Differentiable Registration of Images and LiDAR Point Clouds with VoxelPoint-to-Pixel Matching

paper_url: http://arxiv.org/abs/2312.04060
repo_url: https://github.com/junshengzhou/vp2p-match
paper_authors: Junsheng Zhou, Baorui Ma, Wenyuan Zhang, Yi Fang, Yu-Shen Liu, Zhizhong Han
for:* 这种方法用于跨模态注册，即将2D图像与3D点云注册到共同的空间中。methods:* 使用三元网络学习Pixel-to-Voxel匹配，以学习跨模态空间的特征表示。* 使用可微的PnP解决器进行推理，以便直接在推理过程中提供监督。results:* 在KITTI和nuScenes数据集上实现了比之前方法更好的注册结果。

Abstract
Cross-modality registration between 2D images from cameras and 3D point clouds from LiDARs is a crucial task in computer vision and robotic. Previous methods estimate 2D-3D correspondences by matching point and pixel patterns learned by neural networks, and use Perspective-n-Points (PnP) to estimate rigid transformation during post-processing. However, these methods struggle to map points and pixels to a shared latent space robustly since points and pixels have very different characteristics with patterns learned in different manners (MLP and CNN), and they also fail to construct supervision directly on the transformation since the PnP is non-differentiable, which leads to unstable registration results. To address these problems, we propose to learn a structured cross-modality latent space to represent pixel features and 3D features via a differentiable probabilistic PnP solver. Specifically, we design a triplet network to learn VoxelPoint-to-Pixel matching, where we represent 3D elements using both voxels and points to learn the cross-modality latent space with pixels. We design both the voxel and pixel branch based on CNNs to operate convolutions on voxels/pixels represented in grids, and integrate an additional point branch to regain the information lost during voxelization. We train our framework end-to-end by imposing supervisions directly on the predicted pose distribution with a probabilistic PnP solver. To explore distinctive patterns of cross-modality features, we design a novel loss with adaptive-weighted optimization for cross-modality feature description. The experimental results on KITTI and nuScenes datasets show significant improvements over the state-of-the-art methods. The code and models are available at https://github.com/junshengzhou/VP2P-Match.

摘要
cross-modality registration between 2D图像从摄像头和3D点云从LiDAR中的是计算机视觉和机器人领域中的关键任务。先前的方法通过匹配点和像素模式，由神经网络学习而来，并使用Perspective-n-Points（PnP）来估算rigid变换，并在后期处理中进行估算。然而，这些方法在robust地映射点和像素到共享的潜在空间是不稳定的，因为点和像素具有不同的特征，由不同的神经网络学习而来，并且它们也无法直接在变换上提供监督。为解决这些问题，我们提议学习一个结构化的cross-modality潜在空间，以表示像素特征和3D特征。我们设计了一个 triplet网络，用于学习VoxelPoint-to-Pixel匹配，其中我们使用voxels和点来学习cross-modality潜在空间。我们设计了voxel和像素分支基于Convolutional Neural Networks（CNNs）来进行 convolutions on voxels/pixels表示的格子，并将其集成到一个额外的点分支来恢复在voxelization中丢失的信息。我们在整个框架中进行了端到端训练，并通过直接在预测的姿态分布上进行supervision来训练。为了探索不同的cross-modality特征特征，我们设计了一个新的损失函数，并在其中进行了adaptive-weighted优化。实验结果表明，我们的方法在KITTI和nuScenes datasets上达到了状态之前的最佳性能。代码和模型可以在https://github.com/junshengzhou/VP2P-Match中获取。

Residual Graph Convolutional Network for Bird’s-Eye-View Semantic Segmentation

paper_url: http://arxiv.org/abs/2312.04044
repo_url: None
paper_authors: Qiuxiao Chen, Xiaojun Qi
for: 提高 Bird’s-Eye-View (BEV) semantic segmentation 的精度，即 autonomous driving 中自动车辆的环境认知。
methods: 提出一种基于 Residual Graph Convolutional (RGC) 模块的深度 Convolutional Neural Networks (CNNs)，使得网络可以同时获取全景信息和地区级别的 semantic 关系。
results: 对 nuScenes 数据集进行实验，与四种现状顶尖网络和其四种变体进行比较，得到更高的 IoU 和 mIoU 值，其中 mIoU 值高达 3.1%，超过了最佳现状网络 BEVFusion。

Abstract
Retrieving spatial information and understanding the semantic information of the surroundings are important for Bird's-Eye-View (BEV) semantic segmentation. In the application of autonomous driving, autonomous vehicles need to be aware of their surroundings to drive safely. However, current BEV semantic segmentation techniques, deep Convolutional Neural Networks (CNNs) and transformers, have difficulties in obtaining the global semantic relationships of the surroundings at the early layers of the network. In this paper, we propose to incorporate a novel Residual Graph Convolutional (RGC) module in deep CNNs to acquire both the global information and the region-level semantic relationship in the multi-view image domain. Specifically, the RGC module employs a non-overlapping graph space projection to efficiently project the complete BEV information into graph space. It then builds interconnected spatial and channel graphs to extract spatial information between each node and channel information within each node (i.e., extract contextual relationships of the global features). Furthermore, it uses a downsample residual process to enhance the coordinate feature reuse to maintain the global information. The segmentation data augmentation and alignment module helps to simultaneously augment and align BEV features and ground truth to geometrically preserve their alignment to achieve better segmentation results. Our experimental results on the nuScenes benchmark dataset demonstrate that the RGC network outperforms four state-of-the-art networks and its four variants in terms of IoU and mIoU. The proposed RGC network achieves a higher mIoU of 3.1% than the best state-of-the-art network, BEVFusion. Code and models will be released.

摘要
Retrieving spatial information和理解周围环境的 semantics是bird's-eye-view（BEV）semantic segmentation中重要的。在自动驾驶应用中，自动车辆需要了解其周围环境，以确保安全驾驶。然而，当前BEV semantic segmentation技术中，深度卷积神经网络（CNN）和transformers都具有在网络的早期层获得全局semantic关系的困难。在这篇论文中，我们提议在深度CNN中引入一种新的Residual Graph Convolutional（RGC）模块，以获得全局信息和多视图图像域中的区域级别semantic关系。具体来说，RGC模块使用非重叠图像空间投影，高效地将完整的BEV信息投影到图像空间中。然后，它建立了不重叠的空间和通道图（graph），以EXTRACT SPATIAL INFORMATION BETWEEN EACH NODE AND CHANNEL INFORMATION WITHIN EACH NODE（提取每个节点和每个通道之间的空间信息）。此外，它使用下采样径优化过程，以提高坐标特征重复使用，以保持全局信息。我们的实验结果表明，RGC网络在nuScenes benchmark dataset上比四个state-of-the-art网络和其四个变体更高的IoU和mIoU。我们计划发布代码和模型。

DiffusionPhase: Motion Diffusion in Frequency Domain

paper_url: http://arxiv.org/abs/2312.04036
repo_url: None
paper_authors: Weilin Wan, Yiming Huang, Shutong Wu, Taku Komura, Wenping Wang, Dinesh Jayaraman, Lingjie Liu
for: 本研究旨在提出一种基于学习的人体动作序列生成方法，从文本描述（例如，“一个人走向前”）中生成高质量的人体动作序列。现有技术在生成 произво长度动作序列时受到了动作多样性和平滑过渡的限制，主要是因为文本到动作数据集的有限和使用的姿势表示方法缺乏表达力和紧凑性。
methods: 我们提出了首个在频率域中进行文本条件人体动作生成的方法。我们开发了一个网络编码器，将动作空间转换为一个紧凑 yet 表达力强的参数化频率空间，高精度地编码了动作的本地周期性在时空中。我们还引入了一种基于文本描述和起始姿势的 conditional diffusion 模型，可以高效地预测周期动作参数，并且可以实现不同文本描述相关的动作序列之间的平滑过渡。
results: 实验表明，我们的方法可以比现有方法更好地生成各种高质量动作序列，并能够合理地synthesize 长度为数十帧的动作序列。

Abstract
In this study, we introduce a learning-based method for generating high-quality human motion sequences from text descriptions (e.g., ``A person walks forward"). Existing techniques struggle with motion diversity and smooth transitions in generating arbitrary-length motion sequences, due to limited text-to-motion datasets and the pose representations used that often lack expressiveness or compactness. To address these issues, we propose the first method for text-conditioned human motion generation in the frequency domain of motions. We develop a network encoder that converts the motion space into a compact yet expressive parameterized phase space with high-frequency details encoded, capturing the local periodicity of motions in time and space with high accuracy. We also introduce a conditional diffusion model for predicting periodic motion parameters based on text descriptions and a start pose, efficiently achieving smooth transitions between motion sequences associated with different text descriptions. Experiments demonstrate that our approach outperforms current methods in generating a broader variety of high-quality motions, and synthesizing long sequences with natural transitions.

摘要
在本研究中，我们提出了一种基于学习的方法，用于从文本描述生成高质量人体动作序列（例如，“一个人走向前”）。现有技术在生成无限长动作序列时受到动作多样性和平滑过渡的限制，主要是因为文本到动作数据集的有限性和用于表示姿势的姿势表示方法缺乏表达力和紧凑性。为解决这些问题，我们提出了文本条件人体动作生成的首次方法在动作频谱中。我们开发了一个网络编码器，将动作空间转换为一个紧凑 yet 表达力强的参数化频谱空间，高精度地记录了动作的本地周期性在时间和空间中。我们还提出了一种基于文本描述和起始姿势的 conditional diffusion 模型，可以高效地预测基于文本描述的 periodic 动作参数，并实现了不同文本描述相关的动作序列之间的自然过渡。实验表明，我们的方法在生成更多种高质量动作和生成长序列的自然过渡方面表现出色。

ImFace++: A Sophisticated Nonlinear 3D Morphable Face Model with Implicit Neural Representations

paper_url: http://arxiv.org/abs/2312.04028
repo_url: None
paper_authors: Mingwu Zheng, Haiyu Zhang, Hongyu Yang, Liming Chen, Di Huang
for: 本研究旨在提出一种新的3D人脸模型，以学习精细和连续的人脸空间。
methods: 该模型首先构建了两个分离的扭变场，以模型人脸的各种形态和表情特征。此外，还在模板空间中添加了一个精细的偏移场，以学习个体特定的人脸细节。最后，一种神经融合场被设计，以增强表达的可 reprehension能力。
results: 对比评估表明，ImFace++在面部重建精度和匹配精度两个方面具有显著的进步，并且可以更好地捕捉人脸的各种表情和形态特征。

Abstract
Accurate representations of 3D faces are of paramount importance in various computer vision and graphics applications. However, the challenges persist due to the limitations imposed by data discretization and model linearity, which hinder the precise capture of identity and expression clues in current studies. This paper presents a novel 3D morphable face model, named ImFace++, to learn a sophisticated and continuous space with implicit neural representations. ImFace++ first constructs two explicitly disentangled deformation fields to model complex shapes associated with identities and expressions, respectively, which simultaneously facilitate the automatic learning of correspondences across diverse facial shapes. To capture more sophisticated facial details, a refinement displacement field within the template space is further incorporated, enabling a fine-grained learning of individual-specific facial details. Furthermore, a Neural Blend-Field is designed to reinforce the representation capabilities through adaptive blending of an array of local fields. In addition to ImFace++, we have devised an improved learning strategy to extend expression embeddings, allowing for a broader range of expression variations. Comprehensive qualitative and quantitative evaluations demonstrate that ImFace++ significantly advances the state-of-the-art in terms of both face reconstruction fidelity and correspondence accuracy.

摘要
三维面部表示的精度非常重要在计算机视觉和图形应用中。然而，挑战仍然存在，主要由数据精度和模型线性所带来的限制，这些限制阻碍了现有研究中的人脸和表情特征的准确捕捉。本文介绍一种新的三维人脸模型，名为ImFace++，用于学习一个复杂的连续空间，使用隐藏神经表示。ImFace++首先构建了两个分离的扭曲场，用于模型人脸的各种形态特征和表情特征，同时使得自动学习对应关系。为了捕捉更复杂的面部细节，模板空间中的修正偏移场也被引入，以进一步提高个体特定的面部细节学习。此外，我们还设计了一个神经混合场，以强化表示能力通过adaptive混合本地场。除了ImFace++，我们还提出了改进的学习策略，以扩展表情嵌入，使得表情变化范围更广泛。全面的质量和量测试表明，ImFace++在人脸重建准确度和对应精度方面都有显著提高。

PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation

paper_url: http://arxiv.org/abs/2312.04016
repo_url: None
paper_authors: Ardian Umam, Cheng-Kun Yang, Min-Hung Chen, Jen-Hui Chuang, Yen-Yu Lin
for: 提高3D形状部分 segmentation的精度
methods: 跨模态填充框架PartDistill，从视觉语言模型（VLM）中传递2D知识来促进3D形状部分 segmentation
results: 在广泛使用的ShapeNetPart和PartE数据集上，与现有方法相比，PartDistill得到了大于15%和12%的高于mIoU分数提升

Abstract
This paper proposes a cross-modal distillation framework, PartDistill, which transfers 2D knowledge from vision-language models (VLMs) to facilitate 3D shape part segmentation. PartDistill addresses three major challenges in this task: the lack of 3D segmentation in invisible or undetected regions in the 2D projections, inaccurate and inconsistent 2D predictions by VLMs, and the lack of knowledge accumulation across different 3D shapes. PartDistill consists of a teacher network that uses a VLM to make 2D predictions and a student network that learns from the 2D predictions while extracting geometrical features from multiple 3D shapes to carry out 3D part segmentation. A bi-directional distillation, including forward and backward distillations, is carried out within the framework, where the former forward distills the 2D predictions to the student network, and the latter improves the quality of the 2D predictions, which subsequently enhances the final 3D part segmentation. Moreover, PartDistill can exploit generative models that facilitate effortless 3D shape creation for generating knowledge sources to be distilled. Through extensive experiments, PartDistill boosts the existing methods with substantial margins on widely used ShapeNetPart and PartE datasets, by more than 15% and 12% higher mIoU scores, respectively.

摘要

Lack of 3D segmentation in invisible or undetected regions in the 2D projections.2. Inaccurate and inconsistent 2D predictions by VLMs.3. Lack of knowledge accumulation across different 3D shapes.PartDistill consists of a teacher network that uses a VLM to make 2D predictions and a student network that learns from the 2D predictions while extracting geometrical features from multiple 3D shapes for 3D part segmentation. A bi-directional distillation, including forward and backward distillations, is carried out within the framework. The forward distillation transfers the 2D predictions to the student network, while the backward distillation improves the quality of the 2D predictions, which ultimately enhances the final 3D part segmentation.Moreover, PartDistill can exploit generative models that facilitate effortless 3D shape creation for generating knowledge sources to be distilled. Through extensive experiments, PartDistill outperforms existing methods by more than 15% and 12% higher mIoU scores on the ShapeNetPart and PartE datasets, respectively.

Natural-language-driven Simulation Benchmark and Copilot for Efficient Production of Object Interactions in Virtual Road Scenes

paper_url: http://arxiv.org/abs/2312.04008
repo_url: None
paper_authors: Kairui Yang, Zihao Guo, Gengjie Lin, Haotian Dong, Die Zuo, Jibin Peng, Zhao Huang, Zhecheng Xu, Fupeng Li, Ziyun Bai, Di Lin
For: The paper is written for researchers and developers of autonomous driving systems, as well as those interested in natural-language-driven simulation for teaching and testing such systems.* Methods: The paper proposes the use of natural-language descriptions to control object interactions in virtual road scenes, and presents a dataset of 120,000 such descriptions, along with a methodology for translating them into renderable code.* Results: The paper evaluates the effectiveness of the proposed methodology, SimCopilot, in controlling object motions, generating complex interactions, and generalizing interactions across different road topologies using the L2I dataset. The results demonstrate the potential of the NLD simulation for efficient and effective testing and teaching of autonomous driving systems.Here is the information in Simplified Chinese text:
for: 这篇论文是为研究者和自动驾驶系统开发者写的，以及关注自然语言驱动的 simulate 的人。
methods: 论文提出使用自然语言描述控制虚拟道路上对象的互动，并提供了120,000个自然语言描述和翻译成可渲染代码的方法。
results: 论文使用L2I数据集评估了 simulate 的能力控制对象的运动、生成复杂互动和泛化互动 across 不同的道路拓扑。结果表明NLD simulate 具有高效高效的测试和教学自动驾驶系统的潜力。

Abstract
We advocate the idea of the natural-language-driven(NLD) simulation to efficiently produce the object interactions between multiple objects in the virtual road scenes, for teaching and testing the autonomous driving systems that should take quick action to avoid collision with obstacles with unpredictable motions. The NLD simulation allows the brief natural-language description to control the object interactions, significantly reducing the human efforts for creating a large amount of interaction data. To facilitate the research of NLD simulation, we collect the Language-to-Interaction(L2I) benchmark dataset with 120,000 natural-language descriptions of object interactions in 6 common types of road topologies. Each description is associated with the programming code, which the graphic render can use to visually reconstruct the object interactions in the virtual scenes. As a methodology contribution, we design SimCopilot to translate the interaction descriptions to the renderable code. We use the L2I dataset to evaluate SimCopilot's abilities to control the object motions, generate complex interactions, and generalize interactions across road topologies. The L2I dataset and the evaluation results motivate the relevant research of the NLD simulation.

摘要
我们支持自然语言驱动（NLD）模拟，以快速生成虚拟道路场景中多个物体之间的交互行为，为自动驾驶系统的教学和测试。NLD模拟允许使用简短的自然语言描述控制物体交互，大幅减少人类创建大量交互数据的努力。为推进NLD模拟的研究，我们收集了语言到交互（L2I）标准数据集，包含6种常见道路格局的120,000个自然语言交互描述。每个描述都与程序代码相关，可以用图形渲染来视觉重建虚拟场景中的物体交互。作为方法贡献，我们设计了SimCopilot，可以将交互描述翻译成可渲染代码。我们使用L2I数据集来评估SimCopilot的交互控制能力、生成复杂交互和泛化交互 across道路格局。L2I数据集和评估结果激发了相关的NLD模拟研究。

LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures

paper_url: http://arxiv.org/abs/2312.04000
repo_url: None
paper_authors: Vimal Thilak, Chen Huang, Omid Saremi, Laurent Dinh, Hanlin Goh, Preetum Nakkiran, Joshua M. Susskind, Etai Littwin
for: 这篇论文旨在提出一种用于评估 JOINT EMBEDDING（JE）建模器中学习的数据表示的度量方法，以便更好地评估JE方法的性能。
methods: 这篇论文提出了一种名为LiDAR（Linear Discriminant Analysis Rank）的度量方法，用于评估JE建模器中学习的数据表示质量。LiDAR基于Linear Discriminant Analysis（LDA）矩阵，可以更好地评估JE建模器中的表示信息含量。
results: 论文的实验表明，LiDAR度量方法可以更好地预测JE建模器的优化参数，比 tradicional rank基于的方法更加准确。这种度量方法可以帮助更好地评估JE建模器的性能，并促进这些技术在不同领域的广泛应用。

Abstract
Joint embedding (JE) architectures have emerged as a promising avenue for acquiring transferable data representations. A key obstacle to using JE methods, however, is the inherent challenge of evaluating learned representations without access to a downstream task, and an annotated dataset. Without efficient and reliable evaluation, it is difficult to iterate on architectural and training choices for JE methods. In this paper, we introduce LiDAR (Linear Discriminant Analysis Rank), a metric designed to measure the quality of representations within JE architectures. Our metric addresses several shortcomings of recent approaches based on feature covariance rank by discriminating between informative and uninformative features. In essence, LiDAR quantifies the rank of the Linear Discriminant Analysis (LDA) matrix associated with the surrogate SSL task -- a measure that intuitively captures the information content as it pertains to solving the SSL task. We empirically demonstrate that LiDAR significantly surpasses naive rank based approaches in its predictive power of optimal hyperparameters. Our proposed criterion presents a more robust and intuitive means of assessing the quality of representations within JE architectures, which we hope facilitates broader adoption of these powerful techniques in various domains.

摘要

Stable diffusion for Data Augmentation in COCO and Weed Datasets

paper_url: http://arxiv.org/abs/2312.03996
repo_url: None
paper_authors: Boyang Deng, Yuzhen Lu
for: This study aimed to explore the potential of stable diffusion models in improving the performance of object detection models, specifically for small datasets with image-sparse categories.methods: The study utilized a recent version of stable diffusion to generate synthetic images belonging to seven categories in the COCO dataset and three weed species in Michigan. The YOLOv8 model was trained based on these synthetic images and compared to models trained on original images. Additionally, several techniques of stable diffusion were leveraged for image generation with different focuses.results: Despite the overall results being disappointing, promising results were achieved in some classes, illustrating the potential of stable diffusion models to improve object detection performance. The study suggests that stable diffusion models may be adapted to classification and detection tasks in different fields.

Abstract
Generative models have increasingly impacted relative tasks ranging from image revision and object detection in computer vision to interior design and idea illustration in more general fields. Stable diffusion is an outstanding model series that paves the way for producing high-resolution images with thorough details from text prompts or reference images. It will be an interesting topic about how to leverage the capability of stable diffusion to elevate the image variations of certain categories (e.g., vehicles, humans, and daily objects); particularly, it has the potential to gain improvements for small datasets with image-sparse categories. This study utilized seven categories in the popular COCO dataset and three widespread weed species in Michigan to evaluate the efficiency of a recent version of stable diffusion. In detail, Stable diffusion was used to generate synthetic images belonging to these classes; then, YOLOv8 models were trained based on these synthetic images, whose performance was compared to the models trained on original images. In addition, several techniques (e.g., Image-to-image translation, Dreambooth, ControlNet) of Stable diffusion were leveraged for image generation with different focuses. In spite of the overall results being disappointing, promising results have been achieved in some classes, illustrating the potential of stable diffusion models to improve the performance of detection models, which represent more helpful information being conveyed into the models by the generated images. This seminal study may expedite the adaption of stable diffusion models to classification and detection tasks in different fields.

摘要
生成模型在各种领域中的应用，从计算机视觉中的图像修订和对象检测到更通用的室内设计和想象插图。稳定扩散是一种出色的模型系列，能够生成高分辨率图像，从文本提示或参考图像中获得详细的信息。研究如何利用稳定扩散的能力，提高certain类型图像的变化（如车辆、人体、日常物品）的表现。这项研究使用了COCO dataset中的7类和Michigan的3种常见药类进行评估稳定扩散的效率。具体来说，使用稳定扩散生成相关类别的 sintetic 图像，然后基于这些 sintetic 图像训练 YOLOv8 模型，并与原始图像训练的模型进行比较。此外，研究还利用了稳定扩散的不同技术（如图像转换、梦幕、控制网）进行图像生成。尽管总体结果有些 disappointing，但在某些类别中获得了promising的结果，这demonstrates the potential of stable diffusion models to improve detection models' performance，即通过生成图像传递更多有助的信息到模型中。这项研究可能会促进稳定扩散模型在不同领域的适应。Simplified Chinese:生成模型在不同领域中的应用，从计算机视觉中的图像修订和对象检测到更通用的室内设计和想象插图。稳定扩散是一种出色的模型系列，能够生成高分辨率图像，从文本提示或参考图像中获得详细的信息。研究如何利用稳定扩散的能力，提高certain类型图像的变化（如车辆、人体、日常物品）的表现。这项研究使用了COCO dataset中的7类和Michigan的3种常见药类进行评估稳定扩散的效率。具体来说，使用稳定扩散生成相关类别的 sintetic 图像，然后基于这些 sintetic 图像训练 YOLOv8 模型，并与原始图像训练的模型进行比较。

2023-12-07

Scaling Laws of Synthetic Images for Model Training … for Now

Gen2Det: Generate to Detect

MuRF: Multi-Baseline Radiance Fields

EAGLES: Efficient Accelerated 3D Gaussians with Lightweight EncodingS

Visual Geometry Grounded Deep Structure From Motion

GenDeF: Learning Generative Deformation Field for Video Generation

PrimDiffusion: Volumetric Primitives Diffusion for 3D Human Generation

MonoGaussianAvatar: Monocular Gaussian Point-based Head Avatar

GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation

SPIDeRS: Structured Polarization for Invisible Depth and Reflectance Sensing

Free3D: Consistent Novel View Synthesis without 3D Representation

Digital Life Project: Autonomous 3D Characters with Social Intelligence

HyperDreamer: Hyper-Realistic 3D Content Generation and Editing from a Single Image

Self-Guided Open-Vocabulary Semantic Segmentation

PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns

Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models

Camera Height Doesn’t Change: Unsupervised Monocular Scale-Aware Road-Scene Depth Estimation

Diffusion Reflectance Map: Single-Image Stochastic Inverse Rendering of Illumination and Reflectance

Correspondences of the Third Kind: Camera Pose Estimation from Object Reflection

RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

Multimodal Industrial Anomaly Detection by Crossmodal Feature Mapping

Bootstrapping Autonomous Radars with Self-Supervised Learning

FRNet: Frustum-Range Networks for Scalable LiDAR Segmentation

Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion

FitDiff: Robust monocular 3D facial shape and reflectance estimation using Diffusion Models

DreamVideo: Composing Your Dream Videos with Customized Subject and Motion

Approximate Caching for Efficiently Serving Diffusion Models

Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views

Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models

OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization

PhysHOI: Physics-Based Imitation of Dynamic Human-Object Interaction

AniRes2D: Anisotropic Residual-enhanced Diffusion for 2D MR Super-Resolution

SingingHead: A Large-scale 4D Dataset for Singing Head Animation

DemoCaricature: Democratising Caricature Generation with a Rough Sketch

Multi-View Unsupervised Image Generation with Cross Attention Guidance

Towards a Perceptual Evaluation Framework for Lighting Estimation

A Multi-scale Information Integration Framework for Infrared and Visible Image Fusion

iDesigner: A High-Resolution and Complex-Prompt Following Text-to-Image Diffusion Model for Interior Design

GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives

Cross-codex Learning for Reliable Scribe Identification in Medieval Manuscripts

GPT-4V with Emotion: A Zero-shot Benchmark for Multimodal Emotion Understanding

Activity Grammars for Temporal Action Segmentation

Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes

Fine-tune vision foundation model for crack segmentation in civil infrastructures

TLCE: Transfer-Learning Based Classifier Ensembles for Few-Shot Class-Incremental Learning

Guided Reconstruction with Conditioned Diffusion Models for Unsupervised Anomaly Detection in Brain MRIs

SAMBA: A Trainable Segmentation Web-App with Smart Labelling

Text as Image: Learning Transferable Adapter for Multi-Label Classification

EulerMormer: Robust Eulerian Motion Magnification via Dynamic Filtering within Transformer

Diffusing Colors: Image Colorization with Text Guided Diffusion

Towards 4D Human Video Stylization

Polarimetric Light Transport Analysis for Specular Inter-reflection

Forensic Iris Image Synthesis

A Multilevel Guidance-Exploration Network and Behavior-Scene Matching Method for Human Behavior Anomaly Detection

Instance Tracking in 3D Scenes from Egocentric Videos

Multi-strategy Collaborative Optimized YOLOv5s and its Application in Distance Estimation

Identity-Obscured Neural Radiance Fields: Privacy-Preserving 3D Facial Reconstruction

Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning Interference with Gradient Projection

Open-Vocabulary Segmentation with Semantic-Assisted Calibration

MTVG : Multi-text Video Generation with Text-to-Video Models

Large Language Models are Good Prompt Learners for Low-Shot Image Classification

Combining inherent knowledge of vision-language models with unsupervised domain adaptation through self-knowledge distillation

An unsupervised approach towards promptable defect segmentation in laser-based additive manufacturing by Segment Anything

Differentiable Registration of Images and LiDAR Point Clouds with VoxelPoint-to-Pixel Matching

Residual Graph Convolutional Network for Bird’s-Eye-View Semantic Segmentation

DiffusionPhase: Motion Diffusion in Frequency Domain

ImFace++: A Sophisticated Nonlinear 3D Morphable Face Model with Implicit Neural Representations

PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation

Natural-language-driven Simulation Benchmark and Copilot for Efficient Production of Object Interactions in Virtual Road Scenes

LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures

Stable diffusion for Data Augmentation in COCO and Weed Datasets