cs.CV - 2023-12-05

DiffusionAtlas: High-Fidelity Consistent Diffusion Video Editing

  • paper_url: http://arxiv.org/abs/2312.03772
  • repo_url: None
  • paper_authors: Shao-Yu Chang, Hwann-Tzong Chen, Tyng-Luh Liu
  • for: 这篇论文旨在解决视频编辑中维护物体外观的一致性和高精度问题。
  • methods: 该方法基于DiffusionAtlas框架,利用视觉文本扩散模型进行物体直接编辑,以确保帧内物体外观一致性。
  • results: 对比之下,该方法能够在视频编辑中实现高精度和一致性,超过当前状态艺法的性能。
    Abstract We present a diffusion-based video editing framework, namely DiffusionAtlas, which can achieve both frame consistency and high fidelity in editing video object appearance. Despite the success in image editing, diffusion models still encounter significant hindrances when it comes to video editing due to the challenge of maintaining spatiotemporal consistency in the object's appearance across frames. On the other hand, atlas-based techniques allow propagating edits on the layered representations consistently back to frames. However, they often struggle to create editing effects that adhere correctly to the user-provided textual or visual conditions due to the limitation of editing the texture atlas on a fixed UV mapping field. Our method leverages a visual-textual diffusion model to edit objects directly on the diffusion atlases, ensuring coherent object identity across frames. We design a loss term with atlas-based constraints and build a pretrained text-driven diffusion model as pixel-wise guidance for refining shape distortions and correcting texture deviations. Qualitative and quantitative experiments show that our method outperforms state-of-the-art methods in achieving consistent high-fidelity video-object editing.
    摘要 我们提出了一个基于扩散的视频编辑框架,即DiffusionAtlas,可以实现帧内一致性和高精度对视频对象的外观进行编辑。虽然扩散模型在图像编辑方面取得了成功,但在视频编辑方面仍然遇到了重大的障碍,主要是保持视频帧内对象的外观一致性。然而,在地图基础上的技术可以在层次表示中一致地传递编辑效果,但它们经常因为固定的UV映射场而难以创造符合用户提供的文本或视觉条件的编辑效果。我们的方法利用了视觉文本扩散模型来直接编辑对象在扩散地图上,确保对象在帧内保持一致性。我们设计了基于地图的损失函数和预训练的文本驱动扩散模型,以便修复形状偏差和Texture偏差。我们的方法在质量和量上的实验中胜过了现有的方法,可以实现高精度、一致的视频对象编辑。

DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models

  • paper_url: http://arxiv.org/abs/2312.03771
  • repo_url: None
  • paper_authors: Shaoan Xie, Yang Zhao, Zhisheng Xiao, Kelvin C. K. Chan, Yandong Li, Yanwu Xu, Kun Zhang, Tingbo Hou
  • for: 文章旨在描述一种新的图像填充任务,即通过文本和示例图像进行图像填充。在先前的尝试中,文本和示例图像都被独立地使用,但是两者同时使用的挑战尚未被解决。
  • methods: 我们提出了一种两步方法 DreamInpainter,首先计算粘密的主题特征,以确保精确的主题复制。然后,我们使用一个抽象 токен选择模块,以消除冗余的主题细节,保留主题的身份而允许根据文本提示和掩码形状进行变化。此外,我们还提出了解 Coupling 规范技术,以增强文本控制在示例图像存在时。
  • results: 我们的广泛实验表明,我们的方法在视觉质量、身份保持和文本控制方面具有显著的优势,展示了其在文本引导主题驱动图像填充中的效果。
    Abstract This study introduces Text-Guided Subject-Driven Image Inpainting, a novel task that combines text and exemplar images for image inpainting. While both text and exemplar images have been used independently in previous efforts, their combined utilization remains unexplored. Simultaneously accommodating both conditions poses a significant challenge due to the inherent balance required between editability and subject fidelity. To tackle this challenge, we propose a two-step approach DreamInpainter. First, we compute dense subject features to ensure accurate subject replication. Then, we employ a discriminative token selection module to eliminate redundant subject details, preserving the subject's identity while allowing changes according to other conditions such as mask shape and text prompts. Additionally, we introduce a decoupling regularization technique to enhance text control in the presence of exemplar images. Our extensive experiments demonstrate the superior performance of our method in terms of visual quality, identity preservation, and text control, showcasing its effectiveness in the context of text-guided subject-driven image inpainting.
    摘要

HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces

  • paper_url: http://arxiv.org/abs/2312.03160
  • repo_url: None
  • paper_authors: Haithem Turki, Vasu Agrawal, Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder, Deva Ramanan, Michael Zollhöfer, Christian Richardt
  • for: 提高视图合成质量和速度
  • methods: 结合表面和体积表示方法
  • results: 提高错误率15-30%,实现真实时频率(至少36帧/秒) для虚拟现实分辨率(2Kx2K)
    Abstract Neural radiance fields provide state-of-the-art view synthesis quality but tend to be slow to render. One reason is that they make use of volume rendering, thus requiring many samples (and model queries) per ray at render time. Although this representation is flexible and easy to optimize, most real-world objects can be modeled more efficiently with surfaces instead of volumes, requiring far fewer samples per ray. This observation has spurred considerable progress in surface representations such as signed distance functions, but these may struggle to model semi-opaque and thin structures. We propose a method, HybridNeRF, that leverages the strengths of both representations by rendering most objects as surfaces while modeling the (typically) small fraction of challenging regions volumetrically. We evaluate HybridNeRF against the challenging Eyeful Tower dataset along with other commonly used view synthesis datasets. When comparing to state-of-the-art baselines, including recent rasterization-based approaches, we improve error rates by 15-30% while achieving real-time framerates (at least 36 FPS) for virtual-reality resolutions (2Kx2K).
    摘要 neural radiance fields 提供了当今最高质量的视图合成效果,但它们往往具有较慢的渲染速度。一个原因是它们使用体积渲染,因此每个光束的渲染时需要许多样本(以及模型查询)。虽然这种表示方式弹性和易于优化,但大多数实际世界中的物体可以更加效率地使用表面而不是体积来表示,需要更少的样本每个光束。这一观察点推动了表面表示方法的进步,如签名距离函数,但这些方法可能会难以模型半透明和薄 estructures。我们提出了一种方法,HybridNeRF,它利用两种表示方式的优势,在渲染大部分物体时使用表面,而在处理复杂的部分时使用体积。我们在eyeful tower dataset以及其他常用的视图合成dataset上评估了HybridNeRF,并与其他常用的基准值进行比较。与state-of-the-art基准值相比,我们提高了错误率15-30%,同时实现了虚拟现实分辨率(2Kx2K)的真实帧率(至少36帧/秒)。

ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet

  • paper_url: http://arxiv.org/abs/2312.03154
  • repo_url: https://github.com/soon-yau/visconet
  • paper_authors: Soon Yau Cheong, Armin Mustafa, Andrew Gilbert
  • for: 提高文本到图像人类生成模型的可控性,使用视觉提示来控制图像结构。
  • methods: 使用ControlNet分支将物体的外观分离出图像背景,并在预训练的扩散模型(LDM)中注入控制信息。
  • results: 可以通过文本和图像提示来控制图像的视觉特征和艺术风格,并且可以从小型特定对象领域学习视觉条件,保留LDM骨干的生成力。
    Abstract This paper introduces ViscoNet, a novel method that enhances text-to-image human generation models with visual prompting. Unlike existing methods that rely on lengthy text descriptions to control the image structure, ViscoNet allows users to specify the visual appearance of the target object with a reference image. ViscoNet disentangles the object's appearance from the image background and injects it into a pre-trained latent diffusion model (LDM) model via a ControlNet branch. This way, ViscoNet mitigates the style mode collapse problem and enables precise and flexible visual control. We demonstrate the effectiveness of ViscoNet on human image generation, where it can manipulate visual attributes and artistic styles with text and image prompts. We also show that ViscoNet can learn visual conditioning from small and specific object domains while preserving the generative power of the LDM backbone.
    摘要 这篇论文介绍了ViscoNet,一种新的方法,用于增强文本到图像人工生成模型,使其具有更精确和灵活的视觉控制。与现有方法不同,ViscoNet不需要长文本描述来控制图像结构,而是通过引入参考图像来指定目标对象的视觉外观。ViscoNet分离了图像背景和目标对象的外观,并将其注入到预训练的扩散模型(LDM)中,通过ControlNet分支。这样,ViscoNet可以解决Style Mode Collapse问题,并允许用户通过文本和图像提示来控制图像的Visual attribute和艺术风格。我们在人像生成中示示了ViscoNet的效果,它可以通过文本和图像提示来控制视觉特征和艺术风格。此外,我们还表明ViscoNet可以从小的特定对象领域中学习视觉条件,而不会削弱LDM的生成能力。

Predicting Bone Degradation Using Vision Transformer and Synthetic Cellular Microstructures Dataset

  • paper_url: http://arxiv.org/abs/2312.03133
  • repo_url: None
  • paper_authors: Mohammad Saber Hashemi, Azadeh Sheidaei
  • for: 这篇论文的目的是为了预测和可视化太空探索任务中宇航员骨骼衰竭的现象。
  • methods: 这篇论文使用了一种具有弹性和速度的计算方法,即TransVNet,可以将不同的3D voxelized图像进行预测和可视化,并且可以考虑每个个体的骨骼特性。
  • results: 这篇论文使用了一个混合3D-CNN-VisionTransformer autoencoder架构,可以将3D voxelized图像的演化追踪到月份级的时间尺度,并且可以将这些变化与真实的骨骼衰竭现象进行比较。
    Abstract Bone degradation, especially for astronauts in microgravity conditions, is crucial for space exploration missions since the lower applied external forces accelerate the diminution in bone stiffness and strength substantially. Although existing computational models help us understand this phenomenon and possibly restrict its effect in the future, they are time-consuming to simulate the changes in the bones, not just the bone microstructures, of each individual in detail. In this study, a robust yet fast computational method to predict and visualize bone degradation has been developed. Our deep-learning method, TransVNet, can take in different 3D voxelized images and predict their evolution throughout months utilizing a hybrid 3D-CNN-VisionTransformer autoencoder architecture. Because of limited available experimental data and challenges of obtaining new samples, a digital twin dataset of diverse and initial bone-like microstructures was generated to train our TransVNet on the evolution of the 3D images through a previously developed degradation model for microgravity.
    摘要 骨质下降,尤其是在微重力条件下,对航天器探索任务非常重要,因为低于应用的外部力量会加速骨刚度和强度的减退。虽然现有的计算模型可以帮助我们理解这种现象并可能将其影响限制在未来,但它们需要较长时间来模拟每个个体的骨变化。在本研究中,我们开发了一种强健快速的计算方法来预测和可视化骨质下降。我们的深度学习方法TransVNet可以将不同的3D矩阵图像作为输入,预测其在月份内的演化,使用混合3D-CNN-视Transformer自适应网络架构。由于实验数据的有限性和获取新样本的挑战,我们使用了一个数字双胞虫数据集来训练我们的TransVNet,该数据集包含多种初始骨状微结构。

AI-SAM: Automatic and Interactive Segment Anything Model

  • paper_url: http://arxiv.org/abs/2312.03119
  • repo_url: None
  • paper_authors: Yimu Pan, Sitao Zhang, Alison D. Gernand, Jeffery A. Goldstein, James Z. Wang
  • for: 该论文主要针对 Semantic Segmentation 任务中的自动和交互两种方法的问题。
  • methods: 该论文提出了一种新的自动和交互模型(AI-SAM),通过对提示质量进行全面分析,并提出了首个自动生成初始点提示的自动交互提示器(AI-Prompter)。
  • results: 实验结果表明,AI-SAM在自动设置下达到了状态当前的最高性能,并且可以采用更多的用户提示来进一步提高性能。
    Abstract Semantic segmentation is a core task in computer vision. Existing methods are generally divided into two categories: automatic and interactive. Interactive approaches, exemplified by the Segment Anything Model (SAM), have shown promise as pre-trained models. However, current adaptation strategies for these models tend to lean towards either automatic or interactive approaches. Interactive methods depend on prompts user input to operate, while automatic ones bypass the interactive promptability entirely. Addressing these limitations, we introduce a novel paradigm and its first model: the Automatic and Interactive Segment Anything Model (AI-SAM). In this paradigm, we conduct a comprehensive analysis of prompt quality and introduce the pioneering Automatic and Interactive Prompter (AI-Prompter) that automatically generates initial point prompts while accepting additional user inputs. Our experimental results demonstrate AI-SAM's effectiveness in the automatic setting, achieving state-of-the-art performance. Significantly, it offers the flexibility to incorporate additional user prompts, thereby further enhancing its performance. The project page is available at https://github.com/ymp5078/AI-SAM.
    摘要 Semantic segmentation 是计算机视觉中的核心任务。现有方法大致可分为两类:自动和交互式。交互式方法,如Segment Anything Model (SAM),有示 promise 作为预训练模型。然而,当前的适应策略倾向于自动或交互式方法。交互式方法需要用户输入来运行,而自动方法则完全不需要用户交互。为了解决这些限制,我们介绍了一种新的思想和其首个模型:自动和交互式Segment Anything Model (AI-SAM)。在这种思想中,我们进行了详细的提示质量分析,并引入了先锋的自动和交互式提示器(AI-Prompter),可以自动生成初始点提示而 accepting additional 用户输入。我们的实验结果表明 AI-SAM 在自动设置下 exhibit 州 前的表现,并且具有更多用户提示的灵活性,进一步提高其性能。项目页面可以在 中找到。

The Automated Bias Triangle Feature Extraction Framework

  • paper_url: http://arxiv.org/abs/2312.03110
  • repo_url: None
  • paper_authors: Madeleine Kotzagiannidis, Jonas Schuff, Nathan Korda
  • for: 本研究旨在提供一个自动化分析框架,以便利用量子点(QD)设备的稳定图中的偏置三角形来检测粒子物理学中的Pauli束阻塞(PSB)现象。
  • methods: 本研究使用了无监督的计算机视觉方法,包括分割分析,以提取偏置三角形的物理特性。这种方法可以自动地识别和量化偏置三角形的形状和特征,并且不需要人工标注或大量的训练数据。
  • results: 研究表明,使用这种方法可以高效地、自动地检测PSB现象,而不需要任何训练数据或人工标注。这种方法可以帮助提高量子点设备的稳定性和性能。
    Abstract Bias triangles represent features in stability diagrams of Quantum Dot (QD) devices, whose occurrence and property analysis are crucial indicators for spin physics. Nevertheless, challenges associated with quality and availability of data as well as the subtlety of physical phenomena of interest have hindered an automatic and bespoke analysis framework, often still relying (in part) on human labelling and verification. We introduce a feature extraction framework for bias triangles, built from unsupervised, segmentation-based computer vision methods, which facilitates the direct identification and quantification of physical properties of the former. Thereby, the need for human input or large training datasets to inform supervised learning approaches is circumvented, while additionally enabling the automation of pixelwise shape and feature labeling. In particular, we demonstrate that Pauli Spin Blockade (PSB) detection can be conducted effectively, efficiently and without any training data as a direct result of this approach.
    摘要 bias triangle 表示量子点(QD)设备的稳定性图中的特征,其出现和性质分析是核磁物理的重要指标。然而,数据质量和可用性的问题以及感兴趣的物理现象的细微性带来了自动化和特定分析框架的困难,经常仍然 rely (part) on human labeling and verification。我们介绍了一种基于不supervised,分割based computer vision方法的特征提取框架,可以直接识别和量化former的物理属性。因此,不需要人工输入或大训练数据来支持supervised learning方法,同时还能够自动化像素级shape和特征标签。例如,我们示示了Pauli Spin Blockade(PSB)检测可以高效、高效地进行,不需要任何training data。

Fully Convolutional Slice-to-Volume Reconstruction for Single-Stack MRI

  • paper_url: http://arxiv.org/abs/2312.03102
  • repo_url: None
  • paper_authors: Sean I. Young, Yaël Balbastre, Bruce Fischl, Polina Golland, Juan Eugenio Iglesias
  • for: 用于重建未知的3D磁共振图像卷积
  • methods: 使用深度学习网络,通过对给定的 slice stack 进行单个视图运动估计,生成一个运动堆,并且作为运动堆的副产品生成3D重建
  • results: 对成人和胎儿大脑的SVR重建实现了两倍的精度,与之前的SVR方法相比Here’s the translation of the abstract in English:
  • for: To reconstruct an unknown 3D magnetic resonance volume from stacks of 2D slices corrupted by motion.
  • methods: Using a fully convolutional network to estimate the motion stack for a given slice stack, and produce a 3D reconstruction as a byproduct of the predicted motion.
  • results: Achieved twice the accuracy of previous SVR methods in reconstructing adult and fetal brains. The code is available at github.com/seannz/svr.
    Abstract In magnetic resonance imaging (MRI), slice-to-volume reconstruction (SVR) refers to computational reconstruction of an unknown 3D magnetic resonance volume from stacks of 2D slices corrupted by motion. While promising, current SVR methods require multiple slice stacks for accurate 3D reconstruction, leading to long scans and limiting their use in time-sensitive applications such as fetal fMRI. Here, we propose a SVR method that overcomes the shortcomings of previous work and produces state-of-the-art reconstructions in the presence of extreme inter-slice motion. Inspired by the recent success of single-view depth estimation methods, we formulate SVR as a single-stack motion estimation task and train a fully convolutional network to predict a motion stack for a given slice stack, producing a 3D reconstruction as a byproduct of the predicted motion. Extensive experiments on the SVR of adult and fetal brains demonstrate that our fully convolutional method is twice as accurate as previous SVR methods. Our code is available at github.com/seannz/svr.
    摘要 magnetic resonance imaging (MRI)中的slice-to-volume reconstruction (SVR)是指计算unknown的3D磁共振图像从堆叠的2Dslice中重建的计算方法。而现在的SVR方法需要多个slice堆叠来实现 precisemotion 的3D重建,导致扫描时间长,限制了在时间敏感应用中的使用,如胎儿fMRI。在这里,我们提出了一种SVR方法,超越了现有的工作,并在极端间 slice 运动下生成了state-of-the-art的重建。我们 Drawing inspiration from the recent success of single-view depth estimation methods, we formulate SVR as a single-stack motion estimation task and train a fully convolutional network to predict a motion stack for a given slice stack, producing a 3D reconstruction as a byproduct of the predicted motion. Our extensive experiments on the SVR of adult and fetal brains demonstrate that our fully convolutional method is twice as accurate as previous SVR methods. Our code is available at github.com/seannz/svr.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know and I can provide that as well.

Gaussian3Diff: 3D Gaussian Diffusion for 3D Full Head Synthesis and Editing

  • paper_url: http://arxiv.org/abs/2312.03763
  • repo_url: None
  • paper_authors: Yushi Lan, Feitong Tan, Di Qiu, Qiangeng Xu, Kyle Genova, Zeng Huang, Sean Fanello, Rohit Pandey, Thomas Funkhouser, Chen Change Loy, Yinda Zhang
  • for: 生成高品质3D人头模型,并可以灵活地修改和重新pose。
  • methods: 使用卷积神经网络生成3D人头模型,并使用3DGauss Distribution来表示人头的形状和表现。
  • results: 可以生成高品质的3D人头模型,并且可以灵活地修改和重新pose。
    Abstract We present a novel framework for generating photorealistic 3D human head and subsequently manipulating and reposing them with remarkable flexibility. The proposed approach leverages an implicit function representation of 3D human heads, employing 3D Gaussians anchored on a parametric face model. To enhance representational capabilities and encode spatial information, we embed a lightweight tri-plane payload within each Gaussian rather than directly storing color and opacity. Additionally, we parameterize the Gaussians in a 2D UV space via a 3DMM, enabling effective utilization of the diffusion model for 3D head avatar generation. Our method facilitates the creation of diverse and realistic 3D human heads with fine-grained editing over facial features and expressions. Extensive experiments demonstrate the effectiveness of our method.
    摘要 我们提出了一种新的框架,用于生成真实的3D人头和后续进行灵活的修改和重新布局。我们的方法利用了3D гаус模型来表示人头,并在每个GAUSSIAN中嵌入一个轻量级的三面板卷积,以编码空间信息。此外,我们使用3DMM参数化GAUSSIAN在2D UV空间,使得diffusion模型可以有效地用于3D头像生成。这种方法可以生成多样化、真实的3D人头,并且具有细化的编辑功能,可以修改人头的 facial 特征和表情。我们的实验证明了该方法的有效性。

ScAR: Scaling Adversarial Robustness for LiDAR Object Detection

  • paper_url: http://arxiv.org/abs/2312.03085
  • repo_url: https://github.com/xiaohulugo/ScAR-Scaling-Adversarial-Robustness-for-LiDAR-Object-Detection
  • paper_authors: Xiaohu Lu, Hayder Radha
    For:* The paper is written to improve the adversarial robustness of LiDAR object detection models.Methods:* The paper proposes a black-box Scaling Adversarial Robustness (ScAR) method for LiDAR object detection, which uses three black-box scaling adversarial attack methods based on available information: model-aware attack, distribution-aware attack, and blind attack.Results:* The proposed method is effective in improving the model’s robustness against scaling adversarial attacks, as demonstrated by comparison with other methods on public datasets under different 3D object detection architectures.
    Abstract The adversarial robustness of a model is its ability to resist adversarial attacks in the form of small perturbations to input data. Universal adversarial attack methods such as Fast Sign Gradient Method (FSGM) and Projected Gradient Descend (PGD) are popular for LiDAR object detection, but they are often deficient compared to task-specific adversarial attacks. Additionally, these universal methods typically require unrestricted access to the model's information, which is difficult to obtain in real-world applications. To address these limitations, we present a black-box Scaling Adversarial Robustness (ScAR) method for LiDAR object detection. By analyzing the statistical characteristics of 3D object detection datasets such as KITTI, Waymo, and nuScenes, we have found that the model's prediction is sensitive to scaling of 3D instances. We propose three black-box scaling adversarial attack methods based on the available information: model-aware attack, distribution-aware attack, and blind attack. We also introduce a strategy for generating scaling adversarial examples to improve the model's robustness against these three scaling adversarial attacks. Comparison with other methods on public datasets under different 3D object detection architectures demonstrates the effectiveness of our proposed method.
    摘要 “模型的对抗攻击 robustness 是指模型能够抵挡小变换的输入数据上的攻击。通用对抗攻击方法如 Fast Sign Gradient Method(FSGM)和 Projected Gradient Descend(PGD)在 LiDAR 对象检测中很受欢迎,但它们通常比任务特定的对抗攻击更弱。此外,这些通用方法通常需要获取模型的完整信息,这在实际应用中很难实现。为解决这些限制,我们提出了黑盒 Scaling Adversarial Robustness(ScAR)方法。通过分析3D对象检测数据集 such as KITTI、Waymo 和 nuScenes 的统计特征,我们发现模型对缩放3D实例的预测是敏感的。我们提出了三种黑盒缩放对抗攻击方法,基于可用的信息:模型意识攻击、分布意识攻击和盲目攻击。我们还介绍了一种生成缩放对抗例的策略,以提高模型对这三种缩放对抗攻击的Robustness。与其他方法在不同的3D对象检测架构下进行比较,我们发现我们的提出的方法更有效。”

LooseControl: Lifting ControlNet for Generalized Depth Conditioning

  • paper_url: http://arxiv.org/abs/2312.03079
  • repo_url: None
  • paper_authors: Shariq Farooq Bhat, Niloy J. Mitra, Peter Wonka
  • For: LooseControl is designed to enable generalized depth conditioning for diffusion-based image generation, allowing users to create complex environments with only boundary conditions or 3D box controls.* Methods: The paper introduces a generalized version of depth conditioning that enables scene boundary control and 3D box control for specifying layout locations of target objects, along with two editing mechanisms (3D box editing and attribute editing) to refine the results.* Results: Extensive tests and comparisons with baselines demonstrate the generality of LooseControl, and the authors believe it can become an important design tool for creating complex environments and be extended to other forms of guidance channels.Here’s the Chinese translation of the three points:* For: LooseControl 是用于启用普通化深度条件的扩展性图像生成,让用户通过边界条件或 3D 盒子控制创建复杂的环境。* Methods: paper 引入了一种普通化的深度条件,允许用户通过边界控制和 3D 盒子控制指定目标对象的布局位置,同时提供了三种编辑机制(3D 盒子编辑、特征编辑)来细化结果。* Results: 广泛的测试和比较基线表明 LooseControl 的一般性,作者们认为它可以成为创建复杂环境的重要设计工具,并可以扩展到其他指导通道。
    Abstract We present LooseControl to allow generalized depth conditioning for diffusion-based image generation. ControlNet, the SOTA for depth-conditioned image generation, produces remarkable results but relies on having access to detailed depth maps for guidance. Creating such exact depth maps, in many scenarios, is challenging. This paper introduces a generalized version of depth conditioning that enables many new content-creation workflows. Specifically, we allow (C1) scene boundary control for loosely specifying scenes with only boundary conditions, and (C2) 3D box control for specifying layout locations of the target objects rather than the exact shape and appearance of the objects. Using LooseControl, along with text guidance, users can create complex environments (e.g., rooms, street views, etc.) by specifying only scene boundaries and locations of primary objects. Further, we provide two editing mechanisms to refine the results: (E1) 3D box editing enables the user to refine images by changing, adding, or removing boxes while freezing the style of the image. This yields minimal changes apart from changes induced by the edited boxes. (E2) Attribute editing proposes possible editing directions to change one particular aspect of the scene, such as the overall object density or a particular object. Extensive tests and comparisons with baselines demonstrate the generality of our method. We believe that LooseControl can become an important design tool for easily creating complex environments and be extended to other forms of guidance channels. Code and more information are available at https://shariqfarooq123.github.io/loose-control/ .
    摘要 我们提出LooseControl,以允许通用深度conditioning的扩展。ControlNet,当前最佳的深度conditioned图像生成方法,生成了惊人的结果,但是它需要精确的深度地图作为导航。在许多场景下,创建这些精确的深度地图是困难的。这篇论文介绍了一种通用的深度conditioning方法,允许用户通过边界控制和3D盒控制来创建更多的内容创作 workflow。具体来说,我们允许用户根据场景边界进行不具体的场景控制,并且可以通过3D盒控制来定义目标对象的布局位置,而不是具体的形状和外观。使用LooseControl,同时提供文本指导,用户可以通过指定场景边界和目标对象的位置来创建复杂的环境,例如房间、街景等。此外,我们还提供了两种编辑机制来细化结果:(E1) 3D盒编辑允许用户在保持图像风格不变的情况下,改变、添加或移除盒子,从而导致最小的变化。(E2) 特征编辑提供了可能的编辑方向,以改变场景中某个特定的特征,例如总体对象密度或特定对象。我们进行了广泛的测试和比较,证明了我们的方法的通用性。我们认为LooseControl可以成为轻松创建复杂环境的设计工具,并且可以扩展到其他导航通道。代码和更多信息可以在https://shariqfarooq123.github.io/loose-control/ 找到。

ReconFusion: 3D Reconstruction with Diffusion Priors

  • paper_url: http://arxiv.org/abs/2312.02981
  • repo_url: None
  • paper_authors: Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, Aleksander Holynski
  • for: 用于重建真实世界场景,只需要几张照片。
  • methods: 使用协助扩散假设来Synthesize realistic geometry和 texture,并且保持观察到的外观。
  • results: 在多种真实世界场景中,表现出显著的性能改进,比如前向场景和360度场景。
    Abstract 3D reconstruction methods such as Neural Radiance Fields (NeRFs) excel at rendering photorealistic novel views of complex scenes. However, recovering a high-quality NeRF typically requires tens to hundreds of input images, resulting in a time-consuming capture process. We present ReconFusion to reconstruct real-world scenes using only a few photos. Our approach leverages a diffusion prior for novel view synthesis, trained on synthetic and multiview datasets, which regularizes a NeRF-based 3D reconstruction pipeline at novel camera poses beyond those captured by the set of input images. Our method synthesizes realistic geometry and texture in underconstrained regions while preserving the appearance of observed regions. We perform an extensive evaluation across various real-world datasets, including forward-facing and 360-degree scenes, demonstrating significant performance improvements over previous few-view NeRF reconstruction approaches.
    摘要 3D重建方法如Neural Radiance Fields(NeRFs)能够生成高品质的新视图图像,但是获得高质量NeRF通常需要数十到数百个输入图像,这 Result in a time-consuming capture process. We present ReconFusion to reconstruct real-world scenes using only a few photos. Our approach leverages a diffusion prior for novel view synthesis, trained on synthetic and multiview datasets, which regularizes a NeRF-based 3D reconstruction pipeline at novel camera poses beyond those captured by the set of input images. Our method synthesizes realistic geometry and texture in underconstrained regions while preserving the appearance of observed regions. We perform an extensive evaluation across various real-world datasets, including forward-facing and 360-degree scenes, demonstrating significant performance improvements over previous few-view NeRF reconstruction approaches.Here's the breakdown of the translation:* 3D重建方法 (3D reconstruction methods) -> 3D重建方法 (3D reconstruction methods)* Neural Radiance Fields (NeRFs) -> 神经采样场 (NeRFs)* 能够生成高品质的新视图图像 (able to generate high-quality novel views) -> 能够生成高品质的新视图图像 (able to generate high-quality novel views)* 数十到数百个输入图像 (tens to hundreds of input images) -> 数十到数百个输入图像 (tens to hundreds of input images)* Result in a time-consuming capture process -> 导致时间消耗的捕捉过程 (leading to a time-consuming capture process)* We present ReconFusion -> 我们提出ReconFusion (we propose ReconFusion)* leverages a diffusion prior for novel view synthesis -> 利用分散预测器进行新视图合成 (leveraging a diffusion prior for novel view synthesis)* trained on synthetic and multiview datasets -> 在 sintetic和多视图数据上训练 (trained on synthetic and multiview datasets)* which regularizes a NeRF-based 3D reconstruction pipeline at novel camera poses beyond those captured by the set of input images -> 可以在新的摄像机位置上正则化基于NeRF的3D重建管道 (which regularizes a NeRF-based 3D reconstruction pipeline at novel camera poses beyond those captured by the set of input images)* Our method synthesizes realistic geometry and texture in underconstrained regions while preserving the appearance of observed regions. -> 我们的方法可以在不够约束的区域中合成实现 geometry和texture,同时保留观察到的区域的外观 (Our method synthesizes realistic geometry and texture in underconstrained regions while preserving the appearance of observed regions.)* We perform an extensive evaluation across various real-world datasets, including forward-facing and 360-degree scenes, demonstrating significant performance improvements over previous few-view NeRF reconstruction approaches. -> 我们在不同的实际场景中进行了广泛的评估,包括前视场景和360度场景,并证明了以前的几视NeRF重建方法的性能提高 (We perform an extensive evaluation across various real-world datasets, including forward-facing and 360-degree scenes, demonstrating significant performance improvements over previous few-view NeRF reconstruction approaches.)

GPT4Point: A Unified Framework for Point-Language Understanding and Generation

  • paper_url: http://arxiv.org/abs/2312.02980
  • repo_url: None
  • paper_authors: Zhangyang Qi, Ye Fang, Zeyi Sun, Xiaoyang Wu, Tong Wu, Jiaqi Wang, Dahua Lin, Hengshuang Zhao
  • For: The paper aims to improve the understanding of 3D objects by developing a groundbreaking point-language multimodal model called GPT4Point, which can execute various point-text reference tasks and generate high-quality 3D objects from low-quality point-text features.* Methods: The paper introduces GPT4Point, a powerful 3D multimodal language model that utilizes the MLLM framework and is equipped with advanced capabilities for controllable 3D generation. The model is trained on a large-scale database of 1M objects from the Objaverse-XL dataset, which is constructed using the Pyramid-XL dataset annotation engine.* Results: The paper demonstrates the superior performance of GPT4Point in understanding and generation tasks, achieving high-quality results in point-cloud captioning and Q&A, and maintaining the geometric shapes and colors of the 3D objects.
    Abstract Multimodal Large Language Models (MLLMs) have excelled in 2D image-text comprehension and image generation, but their understanding of the 3D world is notably deficient, limiting progress in 3D language understanding and generation. To solve this problem, we introduce GPT4Point, an innovative groundbreaking point-language multimodal model designed specifically for unified 3D object understanding and generation within the MLLM framework. GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-text reference tasks such as point-cloud captioning and Q&A. Additionally, GPT4Point is equipped with advanced capabilities for controllable 3D generation, it can get high-quality results through a low-quality point-text feature maintaining the geometric shapes and colors. To support the expansive needs of 3D object-text pairs, we develop Pyramid-XL, a point-language dataset annotation engine. It constructs a large-scale database over 1M objects of varied text granularity levels from the Objaverse-XL dataset, essential for training GPT4Point. A comprehensive benchmark has been proposed to evaluate 3D point-language understanding capabilities. In extensive evaluations, GPT4Point has demonstrated superior performance in understanding and generation.
    摘要 多模式大语言模型(MLLM)在2D图像文本理解和生成方面具有出色的表现,但它们对3D世界的理解仍然有限,这限制了3D语言理解和生成的进步。为解决这个问题,我们提出了GPT4Point,一种创新的多点语言多模式模型,专门针对MLLM框架中的统一3D物体理解和生成。GPT4Point作为强大的3D MLLM,可以轻松执行多点文本参考任务,如点云captioning和Q&A。此外,GPT4Point还具有高级可控3D生成能力,可以通过低质量点文本特征获得高质量结果,保持几何形状和颜色。为满足3D物体-文本对的广泛需求,我们开发了Pyramid-XL,一种点语言数据集标注工具。它构建了一个大规模的数据库,包含100万个不同文本细分水平的物体,这些数据库是用于训练GPT4Point的必备。我们建议了一个完整的评价标准,用于评估3D点语言理解能力。在广泛的评估中,GPT4Point表现出色,在理解和生成方面均达到了优秀的成绩。

DiffusionPCR: Diffusion Models for Robust Multi-Step Point Cloud Registration

  • paper_url: http://arxiv.org/abs/2312.03053
  • repo_url: None
  • paper_authors: Zhi Chen, Yufan Ren, Tong Zhang, Zheng Dang, Wenbing Tao, Sabine Süsstrunk, Mathieu Salzmann
  • for: 本研究的目的是提出一种基于扩散过程的点云注册方法,能够高效地将两个点云注册到同一个参考系统中。
  • methods: 本研究使用了一种基于扩散过程的点云注册方法,包括将点云注册问题转化为一种扩散过程,并使用一种带有规范的扩散模型来抑制噪声。
  • results: 实验结果表明,使用扩散PCR方法可以提高点云注册的准确率,并且在3DMatch和3DLoMatch dataset上达到了状态艺术的注册记忆率(95.3%/81.6%)。
    Abstract Point Cloud Registration (PCR) estimates the relative rigid transformation between two point clouds. We propose formulating PCR as a denoising diffusion probabilistic process, mapping noisy transformations to the ground truth. However, using diffusion models for PCR has nontrivial challenges, such as adapting a generative model to a discriminative task and leveraging the estimated nonlinear transformation from the previous step. Instead of training a diffusion model to directly map pure noise to ground truth, we map the predictions of an off-the-shelf PCR model to ground truth. The predictions of off-the-shelf models are often imperfect, especially in challenging cases where the two points clouds have low overlap, and thus could be seen as noisy versions of the real rigid transformation. In addition, we transform the rotation matrix into a spherical linear space for interpolation between samples in the forward process, and convert rigid transformations into auxiliary information to implicitly exploit last-step estimations in the reverse process. As a result, conditioned on time step, the denoising model adapts to the increasing accuracy across steps and refines registrations. Our extensive experiments showcase the effectiveness of our DiffusionPCR, yielding state-of-the-art registration recall rates (95.3%/81.6%) on 3DMatch and 3DLoMatch. The code will be made public upon publication.
    摘要 点云注册(PCR)估计两个点云之间的相对稳定变换。我们提议将PCR视为一种滤波推演过程,将噪声变换映射到真实的变换。然而,使用推演模型进行PCR有一些不rivial的挑战,例如将生成模型转化为推论任务,并利用上一步估计的非线性变换。而不是直接将纯噪声映射到真实的变换,我们将外部PCR模型的预测映射到真实的变换。外部PCR模型的预测通常不准确,特别是在两个点云之间的覆盖率低的情况下,因此可以看作噪声版本的真正稳定变换。此外,我们将旋转矩阵转换为球面线性空间中的 interpolate 操作,并将稳定变换转换为auxiliary信息,以便在反向过程中隐式地利用上一步的估计。因此,根据时间步长,滤波模型适应到增加的准确性,并在注册过程中进行细化。我们的广泛的实验显示了DiffusionPCR的效果,得到了3DMatch和3DLoMatch上的注册记忆率(95.3%/81.6%)。代码将在出版时公开。

GauHuman: Articulated Gaussian Splatting from Monocular Human Videos

  • paper_url: http://arxiv.org/abs/2312.02973
  • repo_url: https://github.com/skhu101/gauhuman
  • paper_authors: Shoukang Hu, Ziwei Liu
  • for: 这篇论文的目的是提出一种快速训练(1~2分钟)和实时渲染(最多189帧/秒)的3D人体模型,比既有的基于NeRF的半 implicit表示模型框架需要几个小时的训练和每帧渲染。
  • methods: 该论文使用 Gaussian Splatting 方法,在 canonical space 中编码 Gaussian Splatting,将3D Gaussians 从 canonical space 转换到 posed space 中使用线性混合皮肤(LBS),并设计了精细的人体详细信息学习模块。
  • results: 广大实验表明,GauHuman 可以在 ZJU_Mocap 和 MonoCap 数据集上达到当今最佳性能,同时具有快速训练和实时渲染速度。特别是,无需牺牲渲染质量,GauHuman 可以快速模型3D人体演员的 ~13k 3D Gaussians。
    Abstract We present, GauHuman, a 3D human model with Gaussian Splatting for both fast training (1 ~ 2 minutes) and real-time rendering (up to 189 FPS), compared with existing NeRF-based implicit representation modelling frameworks demanding hours of training and seconds of rendering per frame. Specifically, GauHuman encodes Gaussian Splatting in the canonical space and transforms 3D Gaussians from canonical space to posed space with linear blend skinning (LBS), in which effective pose and LBS refinement modules are designed to learn fine details of 3D humans under negligible computational cost. Moreover, to enable fast optimization of GauHuman, we initialize and prune 3D Gaussians with 3D human prior, while splitting/cloning via KL divergence guidance, along with a novel merge operation for further speeding up. Extensive experiments on ZJU_Mocap and MonoCap datasets demonstrate that GauHuman achieves state-of-the-art performance quantitatively and qualitatively with fast training and real-time rendering speed. Notably, without sacrificing rendering quality, GauHuman can fast model the 3D human performer with ~13k 3D Gaussians.
    摘要 我们介绍GauHuman,一个3D人体模型,使用高斯扩散来实现快速训练(1-2分钟)和实时渲染(最高189帧/秒)。与现有基于NeRF的几何表示模型框架相比,GauHuman需要训练时间只需要几分钟,并且每帧渲染时间只需要几秒钟。特别是,GauHuman在 canonical space中编码高斯扩散,将3D高斯从 canonical space 转换为posed space 通过线性混合皮肤(LBS),并设计了有效的pose和LBS修正模块,以学习3D人体的细节。此外,为了快速优化GauHuman,我们在初始化和剪辑3D高斯时使用3D人体先验,并通过拟合操作进一步加速。我们对ZJU_Mocap和MonoCap数据集进行了广泛的实验,并证明GauHuman在量化和质量上均达到了领先水平,同时具有快速训练和实时渲染的优点。值得一提的是,不 sacrificing渲染质量,GauHuman可以快速模拟3D人体演示者的约13k个3D高斯。

AmbiGen: Generating Ambigrams from Pre-trained Diffusion Model

  • paper_url: http://arxiv.org/abs/2312.02967
  • repo_url: None
  • paper_authors: Boheng Zhao, Rana Hanocka, Raymond A. Yeh
  • for: 生成拼写旁观文本的拼写旁观文本摘要
  • methods: 使用深度弗洛伊德IF模型进行拼写旁观文本生成,并优化字符的轮廓以提高在两个视图角度下的可读性
  • results: 对于英语500个最常见的单词,我们的方法比现有的拼写旁观文本生成方法高于11.6%的词法精度和少于41.9%的编辑距离
    Abstract Ambigrams are calligraphic designs that have different meanings depending on the viewing orientation. Creating ambigrams is a challenging task even for skilled artists, as it requires maintaining the meaning under two different viewpoints at the same time. In this work, we propose to generate ambigrams by distilling a large-scale vision and language diffusion model, namely DeepFloyd IF, to optimize the letters' outline for legibility in the two viewing orientations. Empirically, we demonstrate that our approach outperforms existing ambigram generation methods. On the 500 most common words in English, our method achieves more than an 11.6% increase in word accuracy and at least a 41.9% reduction in edit distance.
    摘要 《抽象字形设计》是一种字形设计,它们在不同的视角下具有不同的含义。创建抽象字形是一项复杂的任务,因为需要在两个不同的视角下保持字形的意义。在这项工作中,我们提议使用大规模视力和语言扩散模型(DeepFloyd IF)来优化字形的轮廓,以确保在两个视角下的可读性。我们的方法在500个最常见的英文词汇上表现出了超过11.6%的词法精度提升和最少41.9%的编辑距离减少。

Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection

  • paper_url: http://arxiv.org/abs/2312.02966
  • repo_url: https://github.com/luluho1208/diffusion-ss3d
  • paper_authors: Cheng-Ju Ho, Chen-Hsuan Tai, Yen-Yu Lin, Ming-Hsuan Yang, Yi-Hsuan Tsai
  • for: 提高 semi-supervised 3D 物体检测中的鲁棒性和精度, Addressing the limitation of acquiring large-scale 3D bounding box annotations.
  • methods: 使用 teacher-student 框架和 pseudo-labeling 技术,并提出一种基于扩散模型的新方法,通过增加噪声来提高 pseudo-label 质量,并将扩散模型纳入 teacher-student 框架中,以提高 pseudo-label 生成和整个 semi-supervised 学习过程。
  • results: 在 ScanNet 和 SUN RGB-D 数据集上进行了实验,并 demonstrably 达到了现有方法的状态空间性能。 Additionally, we present extensive analysis to understand how our diffusion model design affects performance in semi-supervised learning.
    Abstract Semi-supervised object detection is crucial for 3D scene understanding, efficiently addressing the limitation of acquiring large-scale 3D bounding box annotations. Existing methods typically employ a teacher-student framework with pseudo-labeling to leverage unlabeled point clouds. However, producing reliable pseudo-labels in a diverse 3D space still remains challenging. In this work, we propose Diffusion-SS3D, a new perspective of enhancing the quality of pseudo-labels via the diffusion model for semi-supervised 3D object detection. Specifically, we include noises to produce corrupted 3D object size and class label distributions, and then utilize the diffusion model as a denoising process to obtain bounding box outputs. Moreover, we integrate the diffusion model into the teacher-student framework, so that the denoised bounding boxes can be used to improve pseudo-label generation, as well as the entire semi-supervised learning process. We conduct experiments on the ScanNet and SUN RGB-D benchmark datasets to demonstrate that our approach achieves state-of-the-art performance against existing methods. We also present extensive analysis to understand how our diffusion model design affects performance in semi-supervised learning.
    摘要 三种指导下的对象检测是三维场景理解的关键,因为获取大规模的三维矩形框注释 remains challenging. 现有方法通常采用教师生 frameworks with pseudo-labeling 来利用无标注点云数据. 然而,在多样化的三维空间中生成可靠的pseudo-labels仍然是一个挑战。 在这项工作中,我们提议Diffusion-SS3D,一种新的扩展pseudo-label的质量via diffusion model for semi-supervised 3D object detection. Specifically, we add noise to produce corrupted 3D object size and class label distributions, and then use the diffusion model as a denoising process to obtain bounding box outputs. 此外,我们将diffusion model integrate into the teacher-student framework, so that the denoised bounding boxes can be used to improve pseudo-label generation, as well as the entire semi-supervised learning process. We conduct experiments on the ScanNet and SUN RGB-D benchmark datasets to demonstrate that our approach achieves state-of-the-art performance against existing methods. We also present extensive analysis to understand how our diffusion model design affects performance in semi-supervised learning.

MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures

  • paper_url: http://arxiv.org/abs/2312.02963
  • repo_url: None
  • paper_authors: Zhangyang Xiong, Chenghong Li, Kenkun Liu, Hongjie Liao, Jianqiao Hu, Junyi Zhu, Shuliang Ning, Lingteng Qiu, Chongjie Wang, Shijie Wang, Shuguang Cui, Xiaoguang Han
  • for: 本研究的目的是提供一个大规模的3D人体数据集,以便进行大规模的人体数据采集和分析,并且探索这些数据集在不同的视觉任务中的应用潜力。
  • methods: 本研究使用了一种多视图人体捕捉系统,通过这种系统,可以轻松地收集大规模的高质量3D人体数据。此外,研究者还对数据进行了广泛的标注,包括人体 máscara、摄像机参数、2D和3D关键点、SMPL/SMPLX参数以及相应的文本描述。
  • results: 研究者通过对MVHumanNet数据集进行了多种视觉任务的实验,包括视力一致动作识别、人体NeRF重建、文本驱动视图不受限制的人像生成、2D视图不受限制的人像生成以及3D人像生成等。结果表明,MVHumanNet数据集可以提供大量的高质量3D人体数据,并且可以有效地应用于多种视觉任务。
    Abstract In this era, the success of large language models and text-to-image models can be attributed to the driving force of large-scale datasets. However, in the realm of 3D vision, while remarkable progress has been made with models trained on large-scale synthetic and real-captured object data like Objaverse and MVImgNet, a similar level of progress has not been observed in the domain of human-centric tasks partially due to the lack of a large-scale human dataset. Existing datasets of high-fidelity 3D human capture continue to be mid-sized due to the significant challenges in acquiring large-scale high-quality 3D human data. To bridge this gap, we present MVHumanNet, a dataset that comprises multi-view human action sequences of 4,500 human identities. The primary focus of our work is on collecting human data that features a large number of diverse identities and everyday clothing using a multi-view human capture system, which facilitates easily scalable data collection. Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million frames with extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions. To explore the potential of MVHumanNet in various 2D and 3D visual tasks, we conducted pilot studies on view-consistent action recognition, human NeRF reconstruction, text-driven view-unconstrained human image generation, as well as 2D view-unconstrained human image and 3D avatar generation. Extensive experiments demonstrate the performance improvements and effective applications enabled by the scale provided by MVHumanNet. As the current largest-scale 3D human dataset, we hope that the release of MVHumanNet data with annotations will foster further innovations in the domain of 3D human-centric tasks at scale.
    摘要 在这个时代,大语言模型和文本到图像模型的成功可以归功于大规模数据的驱动力。然而,在3D视觉领域,虽然已经取得了很大的进步,但在人类中心任务领域并没有达到同样的水平,一部分原因是因为缺乏大规模的人类数据集。现有的高品质3D人类捕捉数据集仍然较小,因为获得大规模高质量3D人类数据的挑战很大。为了bridging这个差距,我们提出了MVHumanNet数据集,它包括4,500个人类标签的多视图人体动作序列。我们的工作的主要重点是收集具有大量多样化人类标签和日常服装的人体数据,使用多视图人体捕捉系统,以便轻松扩展数据收集。我们的数据集包含4,500个不同的日常服装,60,000个动作序列和645万帧,并有广泛的注释,包括人mask、摄像头参数、2D和3D关键点、SMPL/SMPLX参数和相应的文本描述。为了探索MVHumanNet在不同的2D和3D视觉任务中的潜力,我们进行了一些试点研究,包括视度一致动作识别、人NeRF重建、文本驱动视角不受限制的人图生成、2D视角不受限制的人图生成和3D人像生成。广泛的实验表明,MVHumanNet数据集的规模提供了大量的性能提升和应用场景,我们希望通过发布MVHumanNet数据集和注释的发布,激发更多在3D人类中心任务领域的创新。

HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding

  • paper_url: http://arxiv.org/abs/2312.03050
  • repo_url: None
  • paper_authors: Trong-Thuan Nguyen, Pha Nguyen, Khoa Luu
  • for: 本研究旨在解决计算机视觉中的视觉场景内的互动理解问题, existing methods 只是基于简单的关系模型,在面临多样化的外观、情况、位置、互动和关系等方面受到限制,降低了对复杂的视觉动力学中的互动理解能力。
  • methods: 本研究使用的方法是基于层次结构的 Hierarchical Interlacement Graph (HIG),它利用一个统一层和图在层次结构中提供深入的Scene变化理解,并在五个不同任务中进行了广泛的实验。
  • results: 对于五个任务,本研究的方法表现出色,超过了其他方法的性能。
    Abstract Visual interactivity understanding within visual scenes presents a significant challenge in computer vision. Existing methods focus on complex interactivities while leveraging a simple relationship model. These methods, however, struggle with a diversity of appearance, situation, position, interaction, and relation in videos. This limitation hinders the ability to fully comprehend the interplay within the complex visual dynamics of subjects. In this paper, we delve into interactivities understanding within visual content by deriving scene graph representations from dense interactivities among humans and objects. To achieve this goal, we first present a new dataset containing Appearance-Situation-Position-Interaction-Relation predicates, named ASPIRe, offering an extensive collection of videos marked by a wide range of interactivities. Then, we propose a new approach named Hierarchical Interlacement Graph (HIG), which leverages a unified layer and graph within a hierarchical structure to provide deep insights into scene changes across five distinct tasks. Our approach demonstrates superior performance to other methods through extensive experiments conducted in various scenarios.
    摘要 Visual interactivity understanding within visual scenes presents a significant challenge in computer vision. Existing methods focus on complex interactivities while leveraging a simple relationship model. These methods, however, struggle with a diversity of appearance, situation, position, interaction, and relation in videos. This limitation hinders the ability to fully comprehend the interplay within the complex visual dynamics of subjects. In this paper, we delve into interactivities understanding within visual content by deriving scene graph representations from dense interactivities among humans and objects. To achieve this goal, we first present a new dataset containing Appearance-Situation-Position-Interaction-Relation predicates, named ASPIRe, offering an extensive collection of videos marked by a wide range of interactivities. Then, we propose a new approach named Hierarchical Interlacement Graph (HIG), which leverages a unified layer and graph within a hierarchical structure to provide deep insights into scene changes across five distinct tasks. Our approach demonstrates superior performance to other methods through extensive experiments conducted in various scenarios.Here is the word-for-word translation of the text into Simplified Chinese:视觉互动理解在视觉场景中存在重大挑战,现有方法主要关注复杂的互动关系,而使用简单的关系模型。这些方法却难以处理视频中的多样性,包括外观、情况、位置、互动和关系。这种局限性阻碍了完全理解视觉场景中人物之间的互动剖面。在这篇论文中,我们探索视觉内容中的互动理解,通过 dense 互动关系网络生成场景图表示。为实现这一目标,我们首先提供了一个新的数据集,名为 ASPIRe,该数据集包含了各种互动的视频集,并且提供了各种互动关系 predicates。然后,我们提出了一种新的方法,即层次结构 Graph (HIG),该方法在层次结构中具有一个统一的层次结构和图,以深入探索视频场景的变化。我们的方法在多种场景下进行了广泛的实验,并证明了与其他方法相比,具有更高的性能。

Choroidalyzer: An open-source, end-to-end pipeline for choroidal analysis in optical coherence tomography

  • paper_url: http://arxiv.org/abs/2312.02956
  • repo_url: None
  • paper_authors: Justin Engelmann, Jamie Burke, Charlene Hamid, Megan Reid-Schachter, Dan Pugh, Neeraj Dhaun, Diana Moukaddem, Lyle Gray, Niall Strang, Paul McGraw, Amos Storkey, Paul J. Steptoe, Stuart King, Tom MacGillivray, Miguel O. Bernabeu, Ian J. C. MacCormick
    for:The paper aims to develop an open-source, end-to-end pipeline for segmenting the choroid region, vessels, and fovea, and deriving choroidal thickness, area, and vascular index.methods:The authors used 5,600 OCT B-scans from 233 subjects, 6 systemic disease cohorts, and 3 device types, and trained a U-Net deep-learning model to detect the region, vessels, and fovea. They also used state-of-the-art automatic methods for generating region and vessel ground-truths, and manually annotated foveal positions.results:Choroidalyzer achieved excellent segmentation performance for the choroid region (Dice score: internal 0.9789, external 0.9749), vessels (Dice score: internal 0.8817, external 0.8703), and fovea location (mean absolute error: internal 3.9 pixels, external 3.4 pixels). The agreement between Choroidalyzer and two manual graders was comparable to the inter-grader agreement across all metrics. The pipeline was able to accurately extract choroidal thickness, area, and vascular index, with strong correlations (Pearson correlation: internal 0.9754, external 0.9831) and low mean absolute errors (MAE: internal 3.9 pixels, external 3.4 pixels).
    Abstract Purpose: To develop Choroidalyzer, an open-source, end-to-end pipeline for segmenting the choroid region, vessels, and fovea, and deriving choroidal thickness, area, and vascular index. Methods: We used 5,600 OCT B-scans (233 subjects, 6 systemic disease cohorts, 3 device types, 2 manufacturers). To generate region and vessel ground-truths, we used state-of-the-art automatic methods following manual correction of inaccurate segmentations, with foveal positions manually annotated. We trained a U-Net deep-learning model to detect the region, vessels, and fovea to calculate choroid thickness, area, and vascular index in a fovea-centred region of interest. We analysed segmentation agreement (AUC, Dice) and choroid metrics agreement (Pearson, Spearman, mean absolute error (MAE)) in internal and external test sets. We compared Choroidalyzer to two manual graders on a small subset of external test images and examined cases of high error. Results: Choroidalyzer took 0.299 seconds per image on a standard laptop and achieved excellent region (Dice: internal 0.9789, external 0.9749), very good vessel segmentation performance (Dice: internal 0.8817, external 0.8703) and excellent fovea location prediction (MAE: internal 3.9 pixels, external 3.4 pixels). For thickness, area, and vascular index, Pearson correlations were 0.9754, 0.9815, and 0.8285 (internal) / 0.9831, 0.9779, 0.7948 (external), respectively (all p<0.0001). Choroidalyzer's agreement with graders was comparable to the inter-grader agreement across all metrics. Conclusions: Choroidalyzer is an open-source, end-to-end pipeline that accurately segments the choroid and reliably extracts thickness, area, and vascular index. Especially choroidal vessel segmentation is a difficult and subjective task, and fully-automatic methods like Choroidalyzer could provide objectivity and standardisation.
    摘要 Methods: 使用 5,600 个 OCT B-scan 图像(233 个Subject,6 个系统疾病群组,3 种设备,2 个制造商)。为生成区域和血管的地面 truth,我们使用当前最佳的自动方法,并对不准确的分 segmentation 进行手动修正,并手动标注眼窝位置。我们使用 U-Net 深度学习模型,以检测区域、血管和眼窝,计算choroid 厚度、面积和血管指数在眼窝中心的区域兴趣点。我们分析了 segmentation 一致性(AUC、Dice)和choroid 指标一致性(Pearson、Spearman、平均绝对误差 (MAE))在内部和外部测试集中。我们将 Choroidalyzer 与两名手动评分员进行比较,并对高误差情况进行检查。Results: Choroidalyzer 在标准 laptop 上花费 0.299 秒/图像,实现了优秀的区域(Dice:内部 0.9789,外部 0.9749)、非常好的血管分 segmentation 性能(Dice:内部 0.8817,外部 0.8703)和非常好的眼窝位置预测(MAE:内部 3.9 像素,外部 3.4 像素)。对厚度、面积和血管指数,Pearson 相关性为 0.9754、0.9815 和 0.8285(内部)/ 0.9831、0.9779 和 0.7948(外部),均为 p<0.0001。 Choroidalyzer 与评分员的协调性与两名评分员之间的协调性相当。Conclusions: Choroidalyzer 是一个开源、端到端管道,可以准确地 segmenting choroid 和眼窝,并可靠地提取厚度、面积和血管指数。特别是血管分 segmentation 是一项具有Subjective和Objective two aspects的任务,全自动的方法 Like Choroidalyzer 可以提供Objectivity和标准化。

DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

  • paper_url: http://arxiv.org/abs/2312.03048
  • repo_url: None
  • paper_authors: Yuru Jia, Lukas Hoyer, Shengyu Huang, Tianfu Wang, Luc Van Gool, Konrad Schindler, Anton Obukhov
  • for: 这个论文旨在检查大规模潜在扩展模型(LDM)是否可以用于生成大规模数据,以提高自动驾驶任务中的semantic segmentation?
  • methods: 作者提出了一种高效的数据生成管线,称为DGInStyle。该管线包括特点控制生成、多分辨率潜在拟合和风格交换等方法,以提高LDM的生成质量和多样性。
  • results: 通过使用DGInStyle管线,作者生成了一个多样的街景数据集,并在其上训练了一个域不依的semantic segmentation模型。结果表明,使用该生成增强方案可以提高域泛化预测的性能,在一些情况下比前一个状态的方法提高了+2.5 mIoU。
    Abstract Large, pretrained latent diffusion models (LDMs) have demonstrated an extraordinary ability to generate creative content, specialize to user data through few-shot fine-tuning, and condition their output on other modalities, such as semantic maps. However, are they usable as large-scale data generators, e.g., to improve tasks in the perception stack, like semantic segmentation? We investigate this question in the context of autonomous driving, and answer it with a resounding "yes". We propose an efficient data generation pipeline termed DGInStyle. First, we examine the problem of specializing a pretrained LDM to semantically-controlled generation within a narrow domain. Second, we design a Multi-resolution Latent Fusion technique to overcome the bias of LDMs towards dominant objects. Third, we propose a Style Swap technique to endow the rich generative prior with the learned semantic control. Using DGInStyle, we generate a diverse dataset of street scenes, train a domain-agnostic semantic segmentation model on it, and evaluate the model on multiple popular autonomous driving datasets. Our approach consistently increases the performance of several domain generalization methods, in some cases by +2.5 mIoU compared to the previous state-of-the-art method without our generative augmentation scheme. Source code and dataset are available at https://dginstyle.github.io .
    摘要 First, we examine the problem of specializing a pretrained LDM to semantically-controlled generation within a narrow domain. Second, we design a Multi-resolution Latent Fusion technique to overcome the bias of LDMs towards dominant objects. Third, we propose a Style Swap technique to endow the rich generative prior with the learned semantic control.Using DGInStyle, we generate a diverse dataset of street scenes, train a domain-agnostic semantic segmentation model on it, and evaluate the model on multiple popular autonomous driving datasets. Our approach consistently improves the performance of several domain generalization methods, in some cases by +2.5 mIoU compared to the previous state-of-the-art method without our generative augmentation scheme. The source code and dataset are available at .

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

  • paper_url: http://arxiv.org/abs/2312.02949
  • repo_url: https://github.com/ux-decoder/llava-grounding
  • paper_authors: Hao Zhang, Hongyang Li, Feng Li, Tianhe Ren, Xueyan Zou, Shilong Liu, Shijia Huang, Jianfeng Gao, Lei Zhang, Chunyuan Li, Jianwei Yang
  • for: 这篇论文旨在提高大型多Modal模型(LMM)在视觉对话中的固定能力。
  • methods: 作者提出了一种新的数据集(GVC),以及一种模型设计,可以将固定和对话功能结合在一起。
  • results: 实验结果表明,作者的模型在Grounding-Bench上表现出色,并且在经典的固定 benchmark RefCOCO/+/g 和 Flickr30K Entities 上也达到了竞争性表现。
    Abstract With the recent significant advancements in large multi-modal models (LMMs), the importance of their grounding capability in visual chat is increasingly recognized. Despite recent efforts to enable LMMs to support grounding, their capabilities for grounding and chat are usually separate, and their chat performance drops dramatically when asked to ground. The problem is the lack of a dataset for grounded visual chat (GVC). Existing grounding datasets only contain short captions. To address this issue, we have created GVC data that allows for the combination of grounding and chat capabilities. To better evaluate the GVC capabilities, we have introduced a benchmark called Grounding-Bench. Additionally, we have proposed a model design that can support GVC and various types of visual prompts by connecting segmentation models with language models. Experimental results demonstrate that our model outperforms other LMMs on Grounding-Bench. Furthermore, our model achieves competitive performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities. Our code will be released at https://github.com/UX-Decoder/LLaVA-Grounding .
    摘要

Fast CT anatomic localization algorithm

  • paper_url: http://arxiv.org/abs/2312.02941
  • repo_url: None
  • paper_authors: Amit Oved
  • for: 本研究的目的是提高计算机断层成像(CT)扫描中的每个slice的定位精度,以便快速检索区域 интерес点并自动分析。
  • methods: 本研究使用了一种新的方法,直接在一部分的slice上进行定位,然后使用线性模型将slice的标注位置映射到实际的AXIAL anatomical位置。
  • results: 本研究的结果显示,这种方法可以很快(less than 1 second per scan)、精度高(typical median localization error of 1 cm)、Robust(resistant to various noise sources, imaging protocols, metal induced artifacts, anatomical deformations etc.)地定位每个slice。此外,该方法还提供了一个映射信任分数,以避免在罕见的异常扫描时出现的不可靠定位结果。
    Abstract Automatically determining the position of every slice in a CT scan is a basic yet powerful capability allowing fast retrieval of region of interest for visual inspection and automated analysis. Unlike conventional localization approaches which work at the slice level, we directly localize only a fraction of the slices and and then fit a linear model which maps slice index to its estimated axial anatomical position based on those slices. The model is then used to assign axial position to every slices of the scan. This approach proves to be both computationally efficient, with a typical processing time of less than a second per scan (regardless of its size), accurate, with a typical median localization error of 1 cm, and robust to different noise sources, imaging protocols, metal induced artifacts, anatomical deformations etc. Another key element of our approach is the introduction of a mapping confidence score. This score acts as a fail safe mechanism which allows a rejection of unreliable localization results in rare cases of anomalous scans. Our algorithm sets new State Of The Art results in terms of localization accuracy. It also offers a decrease of two orders of magnitude in processing time with respect to all published processing times. It was designed to be invariant to various scan resolutions, scan protocols, patient orientations, strong artifacts and various deformations and abnormalities. Additionally, our algorithm is the first one to the best of our knowledge which supports the entire body from head to feet and is not confined to specific anatomical region. This algorithm was tested on thousands of scans and proves to be very reliable and useful as a preprocessing stage for many applications.
    摘要 自动确定每个切面在CT扫描中的位置是一个基本 yet 强大的能力,允许快速检索区域关注和自动分析。与传统的地方化方法不同,我们直接确定只有一部分的切面,然后使用一个线性模型,将切面索引映射到估计的AXIAL anatomical位置基于这些切面。模型然后用于将AXIAL位置分配给扫描中的所有切面。这种方法证明是计算效率高,Typical处理时间低于1秒钟(无论扫描的大小),准确性高,Typical median localization error 1 cm,并且对不同的噪声来源、扫描协议、镍导致的artefacts、生物学变形等等有 robust性。另外,我们的方法还引入了一个映射信任分数。这个分数 acts as a fail safe mechanism, allowing reject unreliable localization results in rare cases of anomalous scans。我们的算法创造了新的 State Of The Art 结果,在本地化精度方面。它还提供了两个阶段的减少,在所有已发表的处理时间上。它是具有VARIOUS SCAN RESOLUTIONS、SCAN PROTOCOLS、patient orientations、强大的artefacts和多种变形和畸形等等的抗衡性。此外,我们的算法是首个,至少到我们所知道的,支持全身从头到脚,而不是仅仅局部解剖区域。这个算法在千个扫描中进行了严格的测试,证明它是非常可靠和有用,作为许多应用的预处理阶段。

Drag-A-Video: Non-rigid Video Editing with Point-based Interaction

  • paper_url: http://arxiv.org/abs/2312.02936
  • repo_url: None
  • paper_authors: Yao Teng, Enze Xie, Yue Wu, Haoyu Han, Zhenguo Li, Xihui Liu
  • for: 该论文旨在提供一种可交互地点对点视频修改方法,允许用户在第一帧视频中点击两个托管点和目标点,以及面对面的mask,然后将这些点集扩展到其他帧中。
  • methods: 该方法使用了一种扩散基于的方法,通过点对点的匹配来更新视频的内容。为了保证视频的流畅性和一致性,我们采用了一种新的视频水平运动监视器,并引入了隐藏偏移来实现这种更新。
  • results: 我们的方法可以准确地修改视频的内容,并且可以在不同的视频中保持一致性。我们在多个视频中进行了实验,并证明了我们的方法的效果和灵活性。具体的实验结果可以通过我们的网站查看:https://drag-a-video.github.io/.
    Abstract Video editing is a challenging task that requires manipulating videos on both the spatial and temporal dimensions. Existing methods for video editing mainly focus on changing the appearance or style of the objects in the video, while keeping their structures unchanged. However, there is no existing method that allows users to interactively ``drag'' any points of instances on the first frame to precisely reach the target points with other frames consistently deformed. In this paper, we propose a new diffusion-based method for interactive point-based video manipulation, called Drag-A-Video. Our method allows users to click pairs of handle points and target points as well as masks on the first frame of an input video. Then, our method transforms the inputs into point sets and propagates these sets across frames. To precisely modify the contents of the video, we employ a new video-level motion supervision to update the features of the video and introduce the latent offsets to achieve this update at multiple denoising timesteps. We propose a temporal-consistent point tracking module to coordinate the movement of the points in the handle point sets. We demonstrate the effectiveness and flexibility of our method on various videos. The website of our work is available here: https://drag-a-video.github.io/.
    摘要 视频编辑是一项复杂的任务,需要在空间和时间维度上修改视频内容。现有的视频编辑方法主要是改变视频中对象的外观或风格,而保持对象的结构不变。然而,现在没有任何方法可以允许用户在第一帧视频中“拖”任意点,并在其他帧上准确地达到目标点。在本文中,我们提出了一种新的扩散基于的实时点基视频修改方法,称为Drag-A-Video。我们的方法允许用户在输入视频的第一帧上采用点对点和面Mask进行点击,然后将这些点集传播到其他帧。为了准确修改视频内容,我们采用了一种新的视频级运动监视,并引入了潜在偏移来实现这种更新。我们还提出了一个协调点跟踪模块,以确保点集在各个帧中的运动协调一致。我们在各种视频上展示了我们的方法的效果和灵活性。我们的工作网站的链接是:https://drag-a-video.github.io/.

WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation

  • paper_url: http://arxiv.org/abs/2312.02934
  • repo_url: https://github.com/fudan-zvg/wovogen
  • paper_authors: Jiachen Lu, Ze Huang, Jiahui Zhang, Zeyu Yang, Li Zhang
  • For: This paper is written for researchers and developers working on autonomous driving technology, particularly those interested in multi-camera street-view video generation and scene understanding.* Methods: The paper proposes a novel method called WoVoGen, which combines an additional explicit world volume to leverage 4D world volume as a foundational element for video generation. The method operates in two phases: envisioning the future 4D temporal world volume based on vehicle control sequences, and generating multi-camera videos informed by this envisioned 4D temporal world volume and sensor interconnectivity.* Results: The paper presents results of WoVoGen on several benchmark datasets, demonstrating its ability to generate high-quality street-view videos in response to vehicle control inputs and facilitating scene editing tasks. The results show that WoVoGen outperforms traditional rendering-based methods in terms of diversity and coherence, and achieves state-of-the-art performance in scene understanding tasks.
    Abstract Generating multi-camera street-view videos is critical for augmenting autonomous driving datasets, addressing the urgent demand for extensive and varied data. Due to the limitations in diversity and challenges in handling lighting conditions, traditional rendering-based methods are increasingly being supplanted by diffusion-based methods. However, a significant challenge in diffusion-based methods is ensuring that the generated sensor data preserve both intra-world consistency and inter-sensor coherence. To address these challenges, we combine an additional explicit world volume and propose the World Volume-aware Multi-camera Driving Scene Generator (WoVoGen). This system is specifically designed to leverage 4D world volume as a foundational element for video generation. Our model operates in two distinct phases: (i) envisioning the future 4D temporal world volume based on vehicle control sequences, and (ii) generating multi-camera videos, informed by this envisioned 4D temporal world volume and sensor interconnectivity. The incorporation of the 4D world volume empowers WoVoGen not only to generate high-quality street-view videos in response to vehicle control inputs but also to facilitate scene editing tasks.
    摘要 生成多摄像头街景视频是为自动驾驶数据集扩大和多样化的关键,以满足自动驾驶技术的不断发展和应用。然而,传统的渲染方法由于缺乏多样性和光照条件的控制,逐渐被替换为扩散方法。然而,扩散方法中保持摄像头数据的内部一致和外部协调性的问题仍然具有挑战性。为解决这些问题,我们提出了基于世界体积(4D世界体积)的多摄像头驾驶场景生成器(WoVoGen)。这种系统利用了4D世界体积作为生成多摄像头视频的基础元素,并在两个阶段进行操作:1. 根据车辆控制序列预测未来4D时间世界体积。2. 使用预测的4D时间世界体积和摄像头之间的连接信息,生成多摄像头视频。通过利用4D世界体积,WoVoGen不仅可以根据车辆控制输入生成高质量的街景视频,还可以帮助进行场景编辑任务。

LivePhoto: Real Image Animation with Text-guided Motion Control

  • paper_url: http://arxiv.org/abs/2312.02928
  • repo_url: None
  • paper_authors: Xi Chen, Zhiheng Liu, Mengting Chen, Yutong Feng, Yu Liu, Yujun Shen, Hengshuang Zhao
  • for: 这个论文主要用于解决文本描述中的动作控制问题,使得用户可以通过文本描述来控制视频中的动作和摄像头移动。
  • methods: 这个系统使用了一种叫做LivePhoto的实用系统,该系统包括一个强化的文本-到-图像生成器(Stable Diffusion),以及一个动作模块用于模拟时间变化。此外,该系统还包括一个文本重新权重模块和一个动作强度估计模块,以降低文本-到-动作映射的歧义。
  • results: 这个系统能够很好地将文本描述中的动作指令转化为视频中的动作和摄像头移动,并且可以根据用户的需求进行视频自定义,包括控制动作的强度。
    Abstract Despite the recent progress in text-to-video generation, existing studies usually overlook the issue that only spatial contents but not temporal motions in synthesized videos are under the control of text. Towards such a challenge, this work presents a practical system, named LivePhoto, which allows users to animate an image of their interest with text descriptions. We first establish a strong baseline that helps a well-learned text-to-image generator (i.e., Stable Diffusion) take an image as a further input. We then equip the improved generator with a motion module for temporal modeling and propose a carefully designed training pipeline to better link texts and motions. In particular, considering the facts that (1) text can only describe motions roughly (e.g., regardless of the moving speed) and (2) text may include both content and motion descriptions, we introduce a motion intensity estimation module as well as a text re-weighting module to reduce the ambiguity of text-to-motion mapping. Empirical evidence suggests that our approach is capable of well decoding motion-related textual instructions into videos, such as actions, camera movements, or even conjuring new contents from thin air (e.g., pouring water into an empty glass). Interestingly, thanks to the proposed intensity learning mechanism, our system offers users an additional control signal (i.e., the motion intensity) besides text for video customization.
    摘要 尽管现有的文本到视频生成技术已经做出了一些进步,但是现有的研究通常忽略了视频中的时间动作控制问题。面对这个挑战,这项工作提出了一个实用的系统——LivePhoto,允许用户通过文本描述控制自己 интере点的图像动作。我们首先确立了一个强大的基线,即使用Stable Diffusion文本到图像生成器(i.e., Stable Diffusion),并将其与一个时间模型相结合。我们然后提出了一种特制的训练管道,以更好地联系文本和动作。具体来说,我们注意到以下两点:(1)文本只能描述动作 roughly(例如无论移动速度),(2)文本可能包含内容和动作描述。为了减少文本到动作映射的抽象性,我们引入了动作强度估计模块以及文本重新权重模块。实验证明,我们的方法可以很好地将文本指令转化为视频中的动作,例如行为、相机运动或者even创造新的内容(如倒流水到空瓶中)。同时,由于我们提出的强度学习机制,我们的系统还为用户提供了一个额外的控制信号(即动作强度),以便为视频定制。

MagicStick: Controllable Video Editing via Control Handle Transformations

  • paper_url: http://arxiv.org/abs/2312.03047
  • repo_url: https://github.com/mayuelala/magicstick
  • paper_authors: Yue Ma, Xiaodong Cun, Yingqing He, Chenyang Qi, Xintao Wang, Ying Shan, Xiu Li, Qifeng Chen
  • for: 这个论文的目的是提出一种可控的视频编辑方法,可以通过对特定内部特征(如物体的边极图或人体姿态)的转换来编辑视频的属性。
  • methods: 该方法利用提取的内部控制信号(如边极图或人体姿态)进行转换,并通过使用彩色图像扩散模型和ControlNet进行编辑。在编辑过程中,还使用了提案的注意力混合来提供注意力引导。
  • results: 该方法可以从预训练的文本到图像模型中提取出视频属性的编辑能力,并在各种场景中进行了详细的实验,证明了其在 temporal consistency 和编辑能力方面的superiority。
    Abstract Text-based video editing has recently attracted considerable interest in changing the style or replacing the objects with a similar structure. Beyond this, we demonstrate that properties such as shape, size, location, motion, etc., can also be edited in videos. Our key insight is that the keyframe transformations of the specific internal feature (e.g., edge maps of objects or human pose), can easily propagate to other frames to provide generation guidance. We thus propose MagicStick, a controllable video editing method that edits the video properties by utilizing the transformation on the extracted internal control signals. In detail, to keep the appearance, we inflate both the pretrained image diffusion model and ControlNet to the temporal dimension and train low-rank adaptions (LORA) layers to fit the specific scenes. Then, in editing, we perform an inversion and editing framework. Differently, finetuned ControlNet is introduced in both inversion and generation for attention guidance with the proposed attention remix between the spatial attention maps of inversion and editing. Yet succinct, our method is the first method to show the ability of video property editing from the pre-trained text-to-image model. We present experiments on numerous examples within our unified framework. We also compare with shape-aware text-based editing and handcrafted motion video generation, demonstrating our superior temporal consistency and editing capability than previous works. The code and models will be made publicly available.
    摘要 文本基于视频编辑在最近吸引了许多关注,主要是修改样式或者替换对象的类似结构。此外,我们还证明了视频中的属性,如形状、大小、位置、运动等,也可以通过编辑。我们的关键发现是,特定的内部特征(例如对象的边极图或人体姿势)的关键帧变换可以轻松地传播到其他帧,以提供生成指南。因此,我们提出了MagicStick,一种可控的视频编辑方法,通过利用特定内部控制信号的变换来编辑视频属性。在详细的实现方式下,为保持外观,我们将预训练的图像扩散模型和ControlNet扩展到时间维度,并使用低级杂化层(LORA)来适应特定的场景。在编辑时,我们将执行反向和编辑框架。与之不同的是,我们在编辑和反向中都进行了фиinetuning ControlNet,以使用提议的注意力混合来提供注意力导航。简而言之,我们的方法是首次通过文本到图像模型来实现视频属性编辑的方法。我们在多个例子中进行了实验,并与Shape-aware文本基于编辑和手动制作的运动视频生成进行比较,demonstrating our superior temporal consistency和编辑能力。代码和模型将公开发布。

Split & Merge: Unlocking the Potential of Visual Adapters via Sparse Training

  • paper_url: http://arxiv.org/abs/2312.02923
  • repo_url: https://github.com/theia-4869/mosa
  • paper_authors: Qizhe Zhang, Bocheng Zou, Ruichuan An, Jiaming Liu, Shanghang Zhang
  • for: 提高Adapter Tuning的效率和性能,以便更好地应用于各种任务和环境。
  • methods: 将标准适配器拆分成多个非重叠模块,然后随机启用模块进行稀有训练,并最终将模块集成成一个完整的适配器进行调整。
  • results: 在27种视觉任务上进行了广泛的实验,显示MoSA可以在其他Adapter Tuning方法和基elines之上具有显著的性能优势,并在低资源和多任务设置下达到满意的结果。
    Abstract With the rapid growth in the scale of pre-trained foundation models, parameter-efficient fine-tuning techniques have gained significant attention, among which Adapter Tuning is the most widely used. Despite achieving efficiency, Adapter Tuning still underperforms full fine-tuning, and the performance improves at the cost of an increase in parameters. Recent efforts address this issue by pruning the original adapters, but it also introduces training instability and suboptimal performance on certain datasets. Motivated by this, we propose Mixture of Sparse Adapters, or MoSA, as a novel Adapter Tuning method to fully unleash the potential of each parameter in the adapter. We first split the standard adapter into multiple non-overlapping modules, then stochastically activate modules for sparse training, and finally merge them to form a complete adapter after tuning. In this way, MoSA can achieve significantly better performance than standard adapters without any additional computational or storage overhead. Furthermore, we propose a hierarchical sparse strategy to better leverage limited training data. Extensive experiments on a series of 27 visual tasks demonstrate that MoSA consistently outperforms other Adapter Tuning methods as well as other baselines by a significant margin. Furthermore, in two challenging scenarios with low-resource and multi-task settings, MoSA achieves satisfactory results, further demonstrating the effectiveness of our design. Our code will be released.
    摘要 通过预训练基础模型的规模快速增长,参数稀缺化精细调教技术得到了广泛关注,其中Adapter Tuning是最为广泛使用的。尽管它可以提高效率,但它仍然比全面调教下降性能,而且性能改善的代价是增加参数的数量。现有努力通过修剪原始adapter来解决这个问题,但它也会引入训练不稳定和数据集特定的表现下降。为此,我们提出了一种多 sparse adapter(MoSA),作为一种新的Adapter Tuning方法,以全面发挥每个参数在adapter中的潜力。我们首先将标准adapter分解成多个不重叠的模块,然后随机启用模块进行极 sparse训练,并最后将它们合并为一个完整的adapter после调教。这样,MoSA可以在不增加计算或存储开销的情况下,实现了标准adapter的显著性能提升。此外,我们还提出了一种层次 sparse策略,以更好地利用有限的训练数据。我们对27个视觉任务进行了广泛的实验,结果表明,MoSA可以在Adapter Tuning方法和其他基elines之上呈现出显著的性能优势,并且在低资源和多任务设置下也能达到满意的结果。我们将代码发布。

Fine-grained Controllable Video Generation via Object Appearance and Context

  • paper_url: http://arxiv.org/abs/2312.02919
  • repo_url: None
  • paper_authors: Hsin-Ping Huang, Yu-Chuan Su, Deqing Sun, Lu Jiang, Xuhui Jia, Yukun Zhu, Ming-Hsuan Yang
  • for: 可以实现精细控制的影片生成,并且不需要调整模型。
  • methods: 提出了一个统一框架,将控制信号插入到现有的文本至影片模型中,以实现精细控制。
  • results: 与比较方法相比,提高了70%的控制性指标。
    Abstract Text-to-video generation has shown promising results. However, by taking only natural languages as input, users often face difficulties in providing detailed information to precisely control the model's output. In this work, we propose fine-grained controllable video generation (FACTOR) to achieve detailed control. Specifically, FACTOR aims to control objects' appearances and context, including their location and category, in conjunction with the text prompt. To achieve detailed control, we propose a unified framework to jointly inject control signals into the existing text-to-video model. Our model consists of a joint encoder and adaptive cross-attention layers. By optimizing the encoder and the inserted layer, we adapt the model to generate videos that are aligned with both text prompts and fine-grained control. Compared to existing methods relying on dense control signals such as edge maps, we provide a more intuitive and user-friendly interface to allow object-level fine-grained control. Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users. Extensive experiments on standard benchmark datasets and user-provided inputs validate that our model obtains a 70% improvement in controllability metrics over competitive baselines.
    摘要 文本到视频生成技术已经展示了有望的结果。然而,通过只接受自然语言作为输入,用户经常遇到减少细节信息的困难,以便精确控制模型的输出。在这种情况下,我们提议细化可控视频生成(FACTOR)以实现细节控制。特别是,FACTOR目标控制对象的外观和上下文,包括其位置和类别,与文本提示相协调。为了实现细节控制,我们提议一种统一框架,将控制信号直接注入到现有的文本到视频模型中。我们的模型包括共同编码器和适应性跨度注意力层。通过优化编码器和插入层,我们适应了模型以生成与文本提示和细节控制相对应的视频。与以前基于粗糙控制信号such as edge maps的方法相比,我们提供了更直观和用户友好的界面,允许对象级别的细节控制。我们的方法可以在不进行finetuning的情况下实现对象外观的控制,从而降低用户每个人优化的努力。我们的实验表明,我们的模型在标准 benchmark数据集和用户提供的输入上实现了70%的控制度指标提高,比基eline方法更高。

Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration

  • paper_url: http://arxiv.org/abs/2312.02918
  • repo_url: None
  • paper_authors: Yuang Ai, Huaibo Huang, Xiaoqiang Zhou, Jiexiang Wang, Ran He
  • for: 这篇论文旨在提出一种多modal提示学习方法(MPerceiver),用于解决现实世界中复杂的图像修复问题。
  • methods: 该方法使用了稳定扩散(SD)先验来提高适应性、普适性和准确性。具体来说,我们开发了一个双树模块,用于掌握两种类型的SD提前:文本提前为总体表示,视觉提前为多尺度细节表示。这两种提前都会随着预测的降低预测来进行自适应调整。此外,我们还增加了一个精细修正模块,以提高修复精度。
  • results: 对于9个图像修复任务,MPerceiver训练后比 estado-of-the-art 任务特定方法在大多数任务中表现出优异。在预训练后,MPerceiver在未看过任务的情况下也能够达到优秀的零例学习和几例学习效果。广泛的实验表明,MPerceiver在适应性、普适性和精度方面具有优异性。
    Abstract Despite substantial progress, all-in-one image restoration (IR) grapples with persistent challenges in handling intricate real-world degradations. This paper introduces MPerceiver: a novel multimodal prompt learning approach that harnesses Stable Diffusion (SD) priors to enhance adaptiveness, generalizability and fidelity for all-in-one image restoration. Specifically, we develop a dual-branch module to master two types of SD prompts: textual for holistic representation and visual for multiscale detail representation. Both prompts are dynamically adjusted by degradation predictions from the CLIP image encoder, enabling adaptive responses to diverse unknown degradations. Moreover, a plug-in detail refinement module improves restoration fidelity via direct encoder-to-decoder information transformation. To assess our method, MPerceiver is trained on 9 tasks for all-in-one IR and outperforms state-of-the-art task-specific methods across most tasks. Post multitask pre-training, MPerceiver attains a generalized representation in low-level vision, exhibiting remarkable zero-shot and few-shot capabilities in unseen tasks. Extensive experiments on 16 IR tasks and 26 benchmarks underscore the superiority of MPerceiver in terms of adaptiveness, generalizability and fidelity.
    摘要

MIND: Multi-Task Incremental Network Distillation

  • paper_url: http://arxiv.org/abs/2312.02916
  • repo_url: https://github.com/lsabetta/mind
  • paper_authors: Jacopo Bonato, Francesco Pelosin, Luigi Sabetta, Alessandro Nicolosi
  • for: Addressing the challenges of Class-Incremental and Domain-Incremental learning in resource-constrained environments.
  • methods: Two alternative distillation procedures and the optimization of BachNorm layers across tasks inside sub-networks.
  • results: Outperforms all state-of-the-art methods for rehearsal-free Class-Incremental learning, with an increment in classification accuracy of +6% on CIFAR-100/10 and +10% on TinyImageNet/10, and up to +40% accuracy in Domain-Incremental scenarios.
    Abstract The recent surge in pervasive devices generating dynamic data streams has underscored the necessity for learning systems to adapt to data distributional shifts continually. To tackle this challenge, the research community has put forth a spectrum of methodologies, including the demanding pursuit of class-incremental learning without replay data. In this study, we present MIND, a parameter isolation method that aims to significantly enhance the performance of replay-free solutions and achieve state-of-the-art results on several widely studied datasets. Our approach introduces two main contributions: two alternative distillation procedures that significantly improve the efficiency of MIND increasing the accumulated knowledge of each sub-network, and the optimization of the BachNorm layers across tasks inside the sub-networks. Overall, MIND outperforms all the state-of-the-art methods for rehearsal-free Class-Incremental learning (with an increment in classification accuracy of approx. +6% on CIFAR-100/10 and +10% on TinyImageNet/10) reaching up to approx. +40% accuracy in Domain-Incremental scenarios. Moreover, we ablated each contribution to demonstrate its impact on performance improvement. Our results showcase the superior performance of MIND indicating its potential for addressing the challenges posed by Class-incremental and Domain-Incremental learning in resource-constrained environments.
    摘要 Our approach has two main contributions:1. Two alternative distillation procedures that improve the efficiency of MIND and increase the accumulated knowledge of each sub-network.2. Optimization of the BachNorm layers across tasks inside the sub-networks.Overall, MIND outperforms all state-of-the-art methods for rehearsal-free Class-Incremental learning, with an average increase in classification accuracy of approximately 6% on CIFAR-100/10 and 10% on TinyImageNet/10. In Domain-Incremental scenarios, MIND achieves up to approximately 40% accuracy.We also conducted ablation studies to demonstrate the impact of each contribution on performance improvement. Our results show that MIND significantly outperforms other methods, indicating its potential for addressing the challenges of Class-incremental and Domain-Incremental learning in resource-constrained environments.

Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training

  • paper_url: http://arxiv.org/abs/2312.02914
  • repo_url: None
  • paper_authors: Arun Reddy, William Paul, Corban Rivera, Ketul Shah, Celso M. de Melo, Rama Chellappa
  • for: 这项研究目标是解决无监督领域适应(Unsupervised Domain Adaptation,UDA)的视频动作识别问题。
  • methods: 该方法(UNITE)使用一个图像教师模型来适应目标频道上的视频学生模型。首先,UNITE使用自我超视觉预训练来促进目标频道视频中的特征学习,使用教师导航的掩码目标散列对象。然后,我们使用视频学生模型和图像教师模型共同生成改进的 Pseudolabels для无标记目标视频。我们的自我训练过程成功地利用了两个模型的优势,以实现强大的适应性。
  • results: 我们在多个视频频道适应benchmark上评估了我们的方法,并观察到了 significanthigher 的改进 compared to 之前报告的结果。
    Abstract In this work, we tackle the problem of unsupervised domain adaptation (UDA) for video action recognition. Our approach, which we call UNITE, uses an image teacher model to adapt a video student model to the target domain. UNITE first employs self-supervised pre-training to promote discriminative feature learning on target domain videos using a teacher-guided masked distillation objective. We then perform self-training on masked target data, using the video student model and image teacher model together to generate improved pseudolabels for unlabeled target videos. Our self-training process successfully leverages the strengths of both models to achieve strong transfer performance across domains. We evaluate our approach on multiple video domain adaptation benchmarks and observe significant improvements upon previously reported results.
    摘要 在这项工作中,我们解决了无监督领域适应(USD)视频动作识别问题。我们的方法,我们称之为UNITE,使用一个图像老师模型来适应目标领域的视频学生模型。UNITE首先使用自我超vision的预训练来促进目标领域视频中的特征学习,使用老师导航的封面挑战目标。然后,我们在封面目标数据上进行自我训练,使用视频学生模型和图像老师模型共同生成改进的假标签 для未标注的目标视频。我们的自我训练过程成功地利用了两个模型的优势,实现了强大的适应性 across 频率。我们在多个视频领域适应 benchmark 上评估了我们的方法,并观察到了显著的改进。

Realistic Scatterer Based Adversarial Attacks on SAR Image Classifiers

  • paper_url: http://arxiv.org/abs/2312.02912
  • repo_url: None
  • paper_authors: Tian Ye, Rajgopal Kannan, Viktor Prasanna, Carl Busart, Lance Kaplan
    for:This paper proposes a new physical adversarial attack called the On-Target Scatterer Attack (OTSA) to mislead SAR image classifiers.methods:The OTSA attack uses physical actions to place additional false objects as scatterers around the on-ground target to perturb the SAR image. To ensure the feasibility of its physical execution, the attack is constrained to only place scatterers on the target, and not in the shadow regions or background.results:The experimental results show that the OTSA attack obtains significantly higher success rates under the positioning constraint compared with existing methods.
    Abstract Adversarial attacks have highlighted the vulnerability of classifiers based on machine learning for Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) tasks. An adversarial attack perturbs SAR images of on-ground targets such that the classifiers are misled into making incorrect predictions. However, many existing attacking techniques rely on arbitrary manipulation of SAR images while overlooking the feasibility of executing the attacks on real-world SAR imagery. Instead, adversarial attacks should be able to be implemented by physical actions, for example, placing additional false objects as scatterers around the on-ground target to perturb the SAR image and fool the SAR ATR. In this paper, we propose the On-Target Scatterer Attack (OTSA), a scatterer-based physical adversarial attack. To ensure the feasibility of its physical execution, we enforce a constraint on the positioning of the scatterers. Specifically, we restrict the scatterers to be placed only on the target instead of in the shadow regions or the background. To achieve this, we introduce a positioning score based on Gaussian kernels and formulate an optimization problem for our OTSA attack. Using a gradient ascent method to solve the optimization problem, the OTSA can generate a vector of parameters describing the positions, shapes, sizes and amplitudes of the scatterers to guide the physical execution of the attack that will mislead SAR image classifiers. The experimental results show that our attack obtains significantly higher success rates under the positioning constraint compared with the existing method.
    摘要 侵器攻击揭示了基于机器学习的Synthetic Aperture Radar(SAR)自动目标识别(ATR)任务中的漏洞。一个侵器攻击可以让SAR图像上的地面目标变得模糊,使得分类器进行错误预测。然而,许多现有的攻击技术是通过随意地修改SAR图像来实现的,而忽视了在实际SAR图像上执行攻击的可行性。相反,攻击应该能够通过物理操作实现,例如,在地面目标周围放置附带干扰的干扰物来让SAR图像变得模糊,使得SAR ATR分类器进行错误预测。在这篇论文中,我们提出了Target Scatterer Attack(OTSA),一种基于物理干扰的 физи学攻击方法。为确保物理执行的可行性,我们强制限制干扰物的位置。具体来说,我们约束干扰物只能在目标上放置,而不能在阴影区域或背景上放置。为实现这一点,我们引入了一个位置评分函数,该函数基于高斯函数,用于形式化我们的 OTSA 攻击问题。通过使用梯度下降法解决该问题,OTSA 可以生成一个描述干扰物的位置、形状、大小和强度的向量,以引导物理执行的攻击,以让SAR图像分类器进行错误预测。实验结果表明,我们的攻击在位置约束下比现有方法更高的成功率。

Rare Galaxy Classes Identified In Foundation Model Representations

  • paper_url: http://arxiv.org/abs/2312.02910
  • repo_url: None
  • paper_authors: Mike Walmsley, Anna M. M. Scaife
  • for: identify rare and visually distinctive galaxy populations
  • methods: use pretrained models to search for structure in learned representations, cluster approach to isolate specific local patterns
  • results: reveal groups of galaxies with rare and scientifically-interesting morphologies
    Abstract We identify rare and visually distinctive galaxy populations by searching for structure within the learned representations of pretrained models. We show that these representations arrange galaxies by appearance in patterns beyond those needed to predict the pretraining labels. We design a clustering approach to isolate specific local patterns, revealing groups of galaxies with rare and scientifically-interesting morphologies.
    摘要 我们通过在已经预训练的模型中学习的表示找到罕见和视觉特征 distintive 的星系人口。我们发现这些表示在预训练标签的预测之外还有更多的结构,这些结构可以用来划分特定的地方pattern,揭示出具有罕见和科学上有价值的星系形态。Here's a breakdown of the translation:* 我们 (wǒmen) - we* 通过 (gòngzuò) - by* 在 (在) - in* 已经 (yǐjīng) - already* 预训练 ( pré-training) - pretrained* 模型 (módel) - model* 中 (zhōng) - in* 学习 (xuéxí) - learning* 的 (de) - possessive particle* 表示 (biǎoxiǎng) - representation* 找到 (zhǎndào) - find* 罕见 (guājian) - rare* 和 (hé) - and* 视觉 (shìjian) - visual* 特征 (tèshēng) - distinctive* 的 (de) - possessive particle* 星系 (xīngxì) - galaxy* 人口 (rénkǒu) - populationI hope this helps! Let me know if you have any questions or need further clarification.

Deep Learning Segmentation of Spiral Arms and Bars

  • paper_url: http://arxiv.org/abs/2312.02908
  • repo_url: https://github.com/mwalmsley/zoobot-3d
  • paper_authors: Mike Walmsley, Ashley Spindler
  • for: 这篇论文是为了开发一个用于Segmenting galactic spiral arms和bars的深度学习模型。
  • methods: 这篇论文使用了深度学习模型来 segment spiral arms和bars,并在不知情的专家评估中胜过当前的自动化方法(99%的评估)和原始志愿标签(79%的评估)。
  • results: 专家对我们预测的扭轴膜的评估为“大部分是好”到“完美”,覆盖率为89%。bar lengths从我们预测的扭轴膜中得到的轴长与一个专门募集的项目中的轴长display excellent agreement。我们的Masks的像素精度,在大规模上是不可能的,将为 spiral arms和bars的演化研究提供基础。
    Abstract We present the first deep learning model for segmenting galactic spiral arms and bars. In a blinded assessment by expert astronomers, our predicted spiral arm masks are preferred over both current automated methods (99% of evaluations) and our original volunteer labels (79% of evaluations). Experts rated our spiral arm masks as `mostly good' to `perfect' in 89% of evaluations. Bar lengths trivially derived from our predicted bar masks are in excellent agreement with a dedicated crowdsourcing project. The pixelwise precision of our masks, previously impossible at scale, will underpin new research into how spiral arms and bars evolve.
    摘要 我们介绍了首个深度学习模型,用于Segmenting galactic spiral arms和bars。在由专家天文学家Blind assessment中,我们预测的旋回臂Masks被评估为当前自动方法(99%评估)和原始志愿标签(79%评估)都被首选。专家对我们的旋回臂Masks评估为“大部分好”到“完美”的89%。来自我们预测的棒长,可以在一个专门的人工智能项目中得到了高度一致。我们的掩码精度,以前无法在大规模实现,将对旋回臂和棒的演化做出新的研究贡献。

HeadGaS: Real-Time Animatable Head Avatars via 3D Gaussian Splatting

  • paper_url: http://arxiv.org/abs/2312.02902
  • repo_url: None
  • paper_authors: Helisa Dhamo, Yinyu Nie, Arthur Moreau, Jifei Song, Richard Shaw, Yiren Zhou, Eduardo Pérez-Pellitero
  • for: 3D head animation quality and runtime improvement
  • methods: 3D Gaussian Splats (3DGS) and hybrid model with learnable latent features
  • results: state-of-the-art results in real-time inference frame rates, with up to ~2dB improvement and x10 acceleration in rendering speed
    Abstract 3D head animation has seen major quality and runtime improvements over the last few years, particularly empowered by the advances in differentiable rendering and neural radiance fields. Real-time rendering is a highly desirable goal for real-world applications. We propose HeadGaS, the first model to use 3D Gaussian Splats (3DGS) for 3D head reconstruction and animation. In this paper we introduce a hybrid model that extends the explicit representation from 3DGS with a base of learnable latent features, which can be linearly blended with low-dimensional parameters from parametric head models to obtain expression-dependent final color and opacity values. We demonstrate that HeadGaS delivers state-of-the-art results in real-time inference frame rates, which surpasses baselines by up to ~2dB, while accelerating rendering speed by over x10.
    摘要 “3D头部动画在最近几年内已经经历了重要的质量和运行时间提升,尤其是由于分别渲染和神经光谱场的进步。实时渲染是实际应用中的一个非常愿望的目标。我们提议使用3D Gaussian Splats(3DGS)来进行3D头部重建和动画。在这篇论文中,我们介绍了一种混合模型,其扩展了3DGS的Explicit Representation,通过可学习的潜在特征基准来 Linearly Blend low-dimensional参数从parametric头部模型中获得表达висиendent的最终颜色和透明度值。我们示出了HeadGaS可以在实时推理框架中达到状态机器的结果,超过基准值by up to ~2dB,同时加速渲染速度by over x10。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Diversified in-domain synthesis with efficient fine-tuning for few-shot classification

  • paper_url: http://arxiv.org/abs/2312.03046
  • repo_url: https://github.com/vturrisi/disef
  • paper_authors: Victor G. Turrisi da Costa, Nicola Dall’Asen, Yiming Wang, Nicu Sebe, Elisa Ricci
  • for: 提高 few-shot 图像分类器的泛化能力,使其能够更好地适应不同的图像分类任务。
  • methods: 使用文本生成模型生成高质量的假图像,并通过维度增强和有效的 fine-tuning 来提高模型的泛化能力。
  • results: 在十个不同的 benchmark 上进行了实验, consistently 超越基elines,并为 few-shot 图像分类器成功地设立了新的 state-of-the-art。
    Abstract Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class. A recent research direction for improving few-shot classifiers involves augmenting the labelled samples with synthetic images created by state-of-the-art text-to-image generation models. Following this trend, we propose Diversified in-domain synthesis with efficient fine-tuning (DISEF), a novel approach which addresses the generalization challenge in few-shot learning using synthetic data. DISEF consists of two main components. First, we propose a novel text-to-image augmentation pipeline that, by leveraging the real samples and their rich semantics coming from an advanced captioning model, promotes in-domain sample diversity for better generalization. Second, we emphasize the importance of effective model fine-tuning in few-shot recognition, proposing to use Low-Rank Adaptation (LoRA) for joint adaptation of the text and image encoders in a Vision Language Model. We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification. Code is available at \url{https://github.com/vturrisi/disef}
    摘要 《几个示例图像分类》目标是通过少量标注示例来学习图像分类器。现有研究方向是通过使用现代文本生成图像模型来生成synthetic图像来提高几个示例分类器的性能。我们提出了一种新的方法,即含括性域同步生成(DISEF),以解决几个示例学习中的通用化挑战。DISEF包括两个主要组成部分。第一部分是一种新的文本生成图像扩充管道,通过利用真实的样本和它们的高度 semantics来提高域内样本多样性,从而提高泛化性。第二部分是强调有效的模型练习,我们提出了使用矩阵适应(LoRA)来适应文本和图像编码器在视觉语言模型中的集成。我们在十个不同的benchmark上验证了我们的方法, consistently outperforming baselines,并在几个示例分类中创造了新的状态之术。代码可以在 \url{https://github.com/vturrisi/disef} 中找到。

BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models

  • paper_url: http://arxiv.org/abs/2312.02896
  • repo_url: https://github.com/aifeg/benchlmm
  • paper_authors: Rizhao Cai, Zirui Song, Dayan Guan, Zhenhao Chen, Xing Luo, Chenyu Yi, Alex Kot
  • for: 本研究旨在评估大型多Modal模型(LMMs)在不同风格下的Robustness。
  • methods: 我们提出了一个新的benchmark,即BenchLMM,用于评估LMMs在三种不同风格下的性能:艺术风格、感知器风格和应用风格,每种风格有五个子风格。我们使用BenchLMM进行了state-of-the-art LMMs的全面评估,并发现:1)LMMs在不同风格下的性能通常会下降;2)一个LMM在常见风格下表现出色不一定意味着它在其他风格下也会表现出色;3)可以通过让LMM预测风格来提高LMMs的reasoning能力,并且这种方法不需要训练。
  • results: 我们的研究表明,开发更智能和多talented LMMs需要更好地理解它们在不同风格下的性能。我们的benchmark和分析希望能够为LMMs的发展提供新的思路。
    Abstract Large Multimodal Models (LMMs) such as GPT-4V and LLaVA have shown remarkable capabilities in visual reasoning with common image styles. However, their robustness against diverse style shifts, crucial for practical applications, remains largely unexplored. In this paper, we propose a new benchmark, BenchLMM, to assess the robustness of LMMs against three different styles: artistic image style, imaging sensor style, and application style, where each style has five sub-styles. Utilizing BenchLMM, we comprehensively evaluate state-of-the-art LMMs and reveal: 1) LMMs generally suffer performance degradation when working with other styles; 2) An LMM performs better than another model in common style does not guarantee its superior performance in other styles; 3) LMMs' reasoning capability can be enhanced by prompting LMMs to predict the style first, based on which we propose a versatile and training-free method for improving LMMs; 4) An intelligent LMM is expected to interpret the causes of its errors when facing stylistic variations. We hope that our benchmark and analysis can shed new light on developing more intelligent and versatile LMMs.
    摘要 大型多modal模型(LMM),如GPT-4V和LLaVA,在常见图像风格下显示了惊人的视觉理解能力。然而,它们对各种风格变化的可靠性,在实际应用中非常重要,尚未得到了充分探讨。在这篇论文中,我们提出了一个新的标准 benchMark,即BenchLMM,用于评估LMMs对三种不同风格的可靠性:艺术风格、摄像头风格和应用风格,每种风格又有五种子风格。通过使用BenchLMM,我们对当今最佳的LMMs进行了全面的评估,并发现了以下结论:1)LMMs在不同风格下工作时通常会导致性能下降; 2)一个LMM在常见风格下表现良好不一定意味着它在其他风格下也会表现良好; 3)LMMs的理解能力可以通过向LMMs提供风格predicting的Prompt来提高; 4)一个智能LMM应该能够解释它在风格变化时的错误原因。我们希望通过我们的benchmark和分析,可以为开发更智能和多样化的LMMs提供新的思路。

Customization Assistant for Text-to-image Generation

  • paper_url: http://arxiv.org/abs/2312.03045
  • repo_url: https://github.com/Sfedfcv/redesigned-pancake
  • paper_authors: Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, Tong Sun
  • for: 这个论文的目的是提出一种基于预训练大语言模型和扩散模型的自定义助手,可以帮助用户在没有调整的情况下进行自定义生成。
  • methods: 这个论文使用了一种新的模型设计和训练策略,使得助手可以在2-5秒钟内完成自定义生成,而无需任何测试时间的调整。
  • results: 实验表明,这个论文的提出的方法可以在不同领域中获得竞争力的result,表明该方法的有效性。
    Abstract Customizing pre-trained text-to-image generation model has attracted massive research interest recently, due to its huge potential in real-world applications. Although existing methods are able to generate creative content for a novel concept contained in single user-input image, their capability are still far from perfection. Specifically, most existing methods require fine-tuning the generative model on testing images. Some existing methods do not require fine-tuning, while their performance are unsatisfactory. Furthermore, the interaction between users and models are still limited to directive and descriptive prompts such as instructions and captions. In this work, we build a customization assistant based on pre-trained large language model and diffusion model, which can not only perform customized generation in a tuning-free manner, but also enable more user-friendly interactions: users can chat with the assistant and input either ambiguous text or clear instruction. Specifically, we propose a new framework consists of a new model design and a novel training strategy. The resulting assistant can perform customized generation in 2-5 seconds without any test time fine-tuning. Extensive experiments are conducted, competitive results have been obtained across different domains, illustrating the effectiveness of the proposed method.
    摘要 In this work, we propose a customization assistant based on pre-trained large language models and diffusion models, which can perform customized generation without fine-tuning and enable more user-friendly interactions. Users can chat with the assistant and input either ambiguous text or clear instructions, and the resulting assistant can generate customized images in 2-5 seconds. Our proposed framework consists of a new model design and a novel training strategy, and we have obtained competitive results across different domains through extensive experiments, demonstrating the effectiveness of our method.

Towards More Practical Group Activity Detection: A New Benchmark and Model

  • paper_url: http://arxiv.org/abs/2312.02878
  • repo_url: https://github.com/dk-kim/CAFE_codebase
  • paper_authors: Dongkeun Kim, Youngkil Song, Minsu Cho, Suha Kwak
  • for: 这篇论文主要是为了提高现有的集体活动检测(GAD)方法和数据集,以更好地应对实际场景。
  • methods: 这篇论文提出了一个新的GAD模型,可以有效地处理未知的群体数量和隐藏的群员。它还使用了一个新的数据集(Caf'e),该数据集更加实际,具有更多的评估场景和精美的注释。
  • results: 研究人员在三个数据集中进行了测试,其中包括Caf'e数据集,并在准确率和推理速度两个方面超过了之前的工作。
    Abstract Group activity detection (GAD) is the task of identifying members of each group and classifying the activity of the group at the same time in a video. While GAD has been studied recently, there is still much room for improvement in both dataset and methodology due to their limited capability to address practical GAD scenarios. To resolve these issues, we first present a new dataset, dubbed Caf\'e. Unlike existing datasets, Caf\'e is constructed primarily for GAD and presents more practical evaluation scenarios and metrics, as well as being large-scale and providing rich annotations. Along with the dataset, we propose a new GAD model that deals with an unknown number of groups and latent group members efficiently and effectively. We evaluated our model on three datasets including Caf\'e, where it outperformed previous work in terms of both accuracy and inference speed. Both our dataset and code base will be open to the public to promote future research on GAD.
    摘要 group activity detection (GAD) 是指在视频中Identifying members of each group and classifying the activity of the group simultaneously. Although GAD has been studied recently, there is still much room for improvement in both dataset and methodology due to their limited capability to address practical GAD scenarios. To resolve these issues, we first present a new dataset, dubbed Caf\'e. Unlike existing datasets, Caf\'e is constructed primarily for GAD and presents more practical evaluation scenarios and metrics, as well as being large-scale and providing rich annotations. Along with the dataset, we propose a new GAD model that deals with an unknown number of groups and latent group members efficiently and effectively. We evaluated our model on three datasets including Caf\'e, where it outperformed previous work in terms of both accuracy and inference speed. Both our dataset and code base will be open to the public to promote future research on GAD.Here's the breakdown of the translation:* Group activity detection (GAD) 是指在视频中... (GAD is a task that involves identifying members of each group and classifying the activity of the group in a video)* although GAD has been studied recently, there is still much room for improvement... (although GAD has been studied recently, there is still much room for improvement in both dataset and methodology)* 因为现有的数据集和方法ologies 有限,无法有效地解决实际的 GAD 场景。 (because the existing datasets and methodologies have limitations, they cannot effectively address practical GAD scenarios)* 为了解决这些问题,我们首先提出了一个新的数据集,名为 Caf\'e. (to resolve these issues, we first proposed a new dataset, called Caf\'e)* Caf\'e 数据集不同于现有的数据集,主要是为 GAD 设计 constructing 和提供了更实用的评估场景和度量,同时也是大规模的和有丰富的注释。 (Caf\'e dataset is different from existing datasets, it is primarily designed for GAD and provides more practical evaluation scenarios and metrics, while also being large-scale and providing rich annotations)* 同时,我们也提出了一种新的 GAD 模型,可以有效地处理不确定的组数和隐藏的组员。 (at the same time, we proposed a new GAD model that can effectively handle an unknown number of groups and latent group members)* 我们在三个数据集中进行了测试,其中包括 Caf\'e,我们的模型在这些数据集上的性能都高于过去的工作。 (we tested our model on three datasets, including Caf\'e, and our model outperformed previous work on all three datasets)* 我们将我们的数据集和代码库公开发布,以便将来的研究人员可以更好地进行 GAD 研究。 (we will openly release our dataset and codebase to promote future research on GAD)

A Dynamic Network for Efficient Point Cloud Registration

  • paper_url: http://arxiv.org/abs/2312.02877
  • repo_url: None
  • paper_authors: Yang Ai, Xi Yang
  • For: 提高点云注册精度和效率,解决非重叠点云 consume 大量计算资源的问题。* Methods: 引入动态方法,广泛应用于计算机视觉任务中,以提高点云注册精度和效率。使用迭代注册过程, identific 匹配点云集中的区域,并最终移除噪点云。* Results: 对比其他方法,本方法具有较高的速度提升(3DMatch上提高41.2%,KITTI上提高33.4%),同时保持竞争力的注册回快要求。
    Abstract For the point cloud registration task, a significant challenge arises from non-overlapping points that consume extensive computational resources while negatively affecting registration accuracy. In this paper, we introduce a dynamic approach, widely utilized to improve network efficiency in computer vision tasks, to the point cloud registration task. We employ an iterative registration process on point cloud data multiple times to identify regions where matching points cluster, ultimately enabling us to remove noisy points. Specifically, we begin with deep global sampling to perform coarse global registration. Subsequently, we employ the proposed refined node proposal module to further narrow down the registration region and perform local registration. Furthermore, we utilize a spatial consistency-based classifier to evaluate the results of each registration stage. The model terminates once it reaches sufficient confidence, avoiding unnecessary computations. Extended experiments demonstrate that our model significantly reduces time consumption compared to other methods with similar results, achieving a speed improvement of over 41% on indoor dataset (3DMatch) and 33% on outdoor datasets (KITTI) while maintaining competitive registration recall requirements.
    摘要

RotaTR: Detection Transformer for Dense and Rotated Object

  • paper_url: http://arxiv.org/abs/2312.02821
  • repo_url: None
  • paper_authors: Zhu Yuke, Ruan Yumeng, Yang Lei, Guo Sheng
  • for: 本文针对 dense 和旋转的对象检测 Task 进行了研究,以提高 DETR 的性能。
  • methods: 本文提出了 Rotated object detection TRansformer (RotaTR),它是 DETR 的扩展,通过设计 Rotation Sensitive deformable (RSDeform) 注意力来增强 DETR 对垂直目标的检测能力。
  • results: 对四个复杂的旋转 Benchmark 进行测试,RotaTR 在 dense 和旋转的对象检测中表现出色,与原始 DETR 相比有大量的优势。同时,它也与当前最佳的 CNN-based 检测器相当。
    Abstract Detecting the objects in dense and rotated scenes is a challenging task. Recent works on this topic are mostly based on Faster RCNN or Retinanet. As they are highly dependent on the pre-set dense anchors and the NMS operation, the approach is indirect and suboptimal.The end-to-end DETR-based detectors have achieved great success in horizontal object detection and many other areas like segmentation, tracking, action recognition and etc.However, the DETR-based detectors perform poorly on dense rotated target tasks and perform worse than most modern CNN-based detectors. In this paper, we find the most significant reason for the poor performance is that the original attention can not accurately focus on the oriented targets. Accordingly, we propose Rotated object detection TRansformer (RotaTR) as an extension of DETR to oriented detection. Specifically, we design Rotation Sensitive deformable (RSDeform) attention to enhance the DETR's ability to detect oriented targets. It is used to build the feature alignment module and rotation-sensitive decoder for our model. We test RotaTR on four challenging-oriented benchmarks. It shows a great advantage in detecting dense and oriented objects compared to the original DETR. It also achieves competitive results when compared to the state-of-the-art.
    摘要 检测密集和旋转场景中的对象是一项具有挑战性的任务。现有的方法多基于Faster RCNN或Retinanet,它们具有各种缺点,如依赖于预设密集锚点和NMS操作,导致方法间接和不优化。而基于DETR的端到端检测器在横向对象检测和其他领域如分割、跟踪、动作识别等方面具有很大的成功。然而,DETR基于的检测器在密集旋转目标任务上表现不佳,比现代CNN基于的检测器更差。在这篇论文中,我们发现最主要的问题在于DETR的原始注意力无法准确地对待方向目标。因此,我们提出了对DETR进行了扩展,即旋转对象检测传播器(RotaTR),以提高DETR对密集旋转目标的检测能力。具体来说,我们设计了旋转敏感的变形(RSDeform)注意力,用于增强DETR对方向目标的检测能力。它被用于构建特征对应模块和旋转敏感解码器。我们在四个复杂的旋转对象测试benchmark上测试了RotaTR。它在密集旋转目标上显示了优于DETR的检测能力,同时与现状的状态 искусственный智能达到了竞争性的 результа。

Deterministic Guidance Diffusion Model for Probabilistic Weather Forecasting

  • paper_url: http://arxiv.org/abs/2312.02819
  • repo_url: https://github.com/donggeun-yoon/dgdm
  • paper_authors: Donggeun Yoon, Minseok Seo, Doyi Kim, Yeji Choi, Donghyeon Cho
  • for: 用于掌握气象预测中的可能性预测和精度预测之间的平衡。
  • methods: 结合权重矩阵和概率模型,提出了Deterministic Guidance Diffusion Model (DGDM),通过在前向和反向过程中结合权重矩阵和概率模型来实现可能性预测和精度预测的平衡。
  • results: 在全球气象预测数据集(WeatherBench)和通用视频帧预测标准 benchmark(Moving MNIST)上评估DGDM,并在高分辨率地方预测中使用PNW-Typhoon气象卫星数据集进行验证。结果显示DGDM在全球预测和地方预测中均达到了当前最佳效果。
    Abstract Weather forecasting requires not only accuracy but also the ability to perform probabilistic prediction. However, deterministic weather forecasting methods do not support probabilistic predictions, and conversely, probabilistic models tend to be less accurate. To address these challenges, in this paper, we introduce the \textbf{\textit{D}eterministic \textbf{\textit{G}uidance \textbf{\textit{D}iffusion \textbf{\textit{M}odel (DGDM) for probabilistic weather forecasting, integrating benefits of both deterministic and probabilistic approaches. During the forward process, both the deterministic and probabilistic models are trained end-to-end. In the reverse process, weather forecasting leverages the predicted result from the deterministic model, using as an intermediate starting point for the probabilistic model. By fusing deterministic models with probabilistic models in this manner, DGDM is capable of providing accurate forecasts while also offering probabilistic predictions. To evaluate DGDM, we assess it on the global weather forecasting dataset (WeatherBench) and the common video frame prediction benchmark (Moving MNIST). We also introduce and evaluate the Pacific Northwest Windstorm (PNW)-Typhoon weather satellite dataset to verify the effectiveness of DGDM in high-resolution regional forecasting. As a result of our experiments, DGDM achieves state-of-the-art results not only in global forecasting but also in regional forecasting. The code is available at: \url{https://github.com/DongGeun-Yoon/DGDM}.
    摘要 天气预测需要不仅准确度高,还需要能够进行 probabilistic 预测。然而,混合型天气预测方法不支持 probabilistic 预测,反之, probabilistic 模型往往准确性较差。为了解决这些挑战,在这篇论文中,我们引入了 \textbf{\textit{D}eterministic \textbf{\textit{G}uidance \textbf{\textit{D}iffusion \textbf{\textit{M}odel (DGDM),将混合型天气预测和 probabilistic 预测方法相结合。在前向过程中,混合型天气预测和 probabilistic 模型都是终端训练的。在反向过程中,天气预测使用混合型天气预测的预测结果作为probabilistic 模型的初始点。通过这种方式,DGDM可以提供准确的预测,同时也可以提供 probabilistic 预测。为了评估 DGDM,我们在 WeatherBench 全球天气预测数据集和 Moving MNIST 通用视频预测数据集上进行了评估。我们还引入了 Pacific Northwest Windstorm (PNW)-Typhoon 高分辨率地方天气卫星数据集,以验证 DGDM 在高分辨率地方预测中的效果。根据我们的实验结果,DGDM 在全球预测和地方预测中均取得了状态机器人的结果。代码可以在以下链接获取:https://github.com/DongGeun-Yoon/DGDM。

Generating Fine-Grained Human Motions Using ChatGPT-Refined Descriptions

  • paper_url: http://arxiv.org/abs/2312.02772
  • repo_url: None
  • paper_authors: Xu Shi, Chuanchen Luo, Junran Peng, Hongwen Zhang, Yunlian Sun
  • for: 本研究的目的是提出一种基于分解策略的人体动作生成模型(FG-MDM),以生成细化的人体动作。
  • methods: 本研究使用大语言模型(GPT-3.5)将潦汤的文本描述精细化为不同身体部位的描述,然后使用这些精细描述导引一种基于转换器的扩散模型进行人体动作生成。
  • results: 实验结果表明,FG-MDM比之前的方法更具有优势,尤其是在不同于训练数据的情况下。 FG-MDM能够生成细化的人体动作,并且可以在不同的人体姿势和环境下进行生成。
    Abstract Recently, significant progress has been made in text-based motion generation, enabling the generation of diverse and high-quality human motions that conform to textual descriptions. However, it remains challenging to generate fine-grained or stylized motions due to the lack of datasets annotated with detailed textual descriptions. By adopting a divide-and-conquer strategy, we propose a new framework named Fine-Grained Human Motion Diffusion Model (FG-MDM) for human motion generation. Specifically, we first parse previous vague textual annotation into fine-grained description of different body parts by leveraging a large language model (GPT-3.5). We then use these fine-grained descriptions to guide a transformer-based diffusion model. FG-MDM can generate fine-grained and stylized motions even outside of the distribution of the training data. Our experimental results demonstrate the superiority of FG-MDM over previous methods, especially the strong generalization capability. We will release our fine-grained textual annotations for HumanML3D and KIT.
    摘要 最近,在文本基于动作生成方面,有了 significanth的进步,使得可以生成具有多样性和高质量的人体动作,这些动作都遵循文本描述。然而,由于缺乏细化的文本描述,仍然困难生成细化或特殊的动作。为解决这问题,我们提出了一种新的框架,即细化人体动作扩散模型(FG-MDM)。具体来说,我们首先使用大型自然语言模型(GPT-3.5)来分解前一 vague 的文本描述,然后使用这些细化描述来引导一个基于transformer的扩散模型。FG-MDM可以生成细化和特殊的动作,即使在训练数据的外部。我们的实验结果表明FG-MDM在前一些方法之上具有显著的优势,尤其是在泛化性方面。我们计划将我们的细化文本描述发布给HumanML3D和KIT。

SEVA: Leveraging sketches to evaluate alignment between human and machine visual abstraction

  • paper_url: http://arxiv.org/abs/2312.03035
  • repo_url: https://github.com/cogtoolslab/visual_abstractions_benchmarking_public2023
  • paper_authors: Kushin Mukherjee, Holly Huey, Xuanchen Lu, Yael Vinker, Rio Aguina-Kang, Ariel Shamir, Judith E. Fan
  • for: 本研究旨在评估当前视觉算法是否能够理解人类创作的粗略图像,以及人类对粗略图像的含义是如何的。
  • methods: 研究者使用了一个新的 benchmark dataset,named SEVA,包含约90,000个人类创作的粗略图像,并对当前视觉算法进行了评估,以确定它们是否能够理解人类创作的粗略图像。
  • results: 研究者发现,当前视觉算法可以很好地识别人类创作的粗略图像中的目标概念,但是模型和人类响应模式之间仍存在一定的差距。此外,研究者还发现,一种基于人类视觉抽象的生成算法可以生成具有不同粗略度的粗略图像。
    Abstract Sketching is a powerful tool for creating abstract images that are sparse but meaningful. Sketch understanding poses fundamental challenges for general-purpose vision algorithms because it requires robustness to the sparsity of sketches relative to natural visual inputs and because it demands tolerance for semantic ambiguity, as sketches can reliably evoke multiple meanings. While current vision algorithms have achieved high performance on a variety of visual tasks, it remains unclear to what extent they understand sketches in a human-like way. Here we introduce SEVA, a new benchmark dataset containing approximately 90K human-generated sketches of 128 object concepts produced under different time constraints, and thus systematically varying in sparsity. We evaluated a suite of state-of-the-art vision algorithms on their ability to correctly identify the target concept depicted in these sketches and to generate responses that are strongly aligned with human response patterns on the same sketch recognition task. We found that vision algorithms that better predicted human sketch recognition performance also better approximated human uncertainty about sketch meaning, but there remains a sizable gap between model and human response patterns. To explore the potential of models that emulate human visual abstraction in generative tasks, we conducted further evaluations of a recently developed sketch generation algorithm (Vinker et al., 2022) capable of generating sketches that vary in sparsity. We hope that public release of this dataset and evaluation protocol will catalyze progress towards algorithms with enhanced capacities for human-like visual abstraction.
    摘要 绘制是一种强大的工具,用于创建简洁而意义reich的抽象图像。理解绘制pose了一些核心性的挑战 для通用视觉算法,因为它们需要对绘制的简洁性与自然视觉输入具有坚固的Robustness,并且需要忍受Semantic Ambiguity,因为绘制可以可靠地诱发多种含义。虽然当前的视觉算法在多种视觉任务上已经 дости得了高性能,但是没有很清楚地知道它们是否能够理解人类的绘制方式。在这篇文章中,我们引入了SEVA,一个新的测试数据集,包含约90,000个人类生成的绘制,这些绘制是在不同的时间限制下生成的,因此系统地 varying in sparsity。我们对一些当前的State-of-the-art视觉算法进行了评估,以确定它们是否能够正确地识别绘制中的目标概念,并且是否能够生成与人类响应模式相似的响应。我们发现,能够更好地预测人类绘制认知性的视觉算法也更好地预测人类对绘制的不确定性,但是还有一定的差距 между模型和人类响应模式。为了探索模型可以模仿人类视觉抽象的能力,我们进行了进一步的评估,使用Vinker et al. (2022)提出的一种可变干扰绘制生成算法,可以生成绘制的不同级别。我们希望通过公开这个数据集和评估协议,促进模型具有人类化视觉抽象能力的进步。

Learning Cortical Anomaly through Masked Encoding for Unsupervised Heterogeneity Mapping

  • paper_url: http://arxiv.org/abs/2312.02762
  • repo_url: https://github.com/chadHGY/CAM
  • paper_authors: Hao-Chun Yang, Ole Andreassen, Lars Tjelta Westlye, Andre F. Marquand, Christian F. Beckmann, Thomas Wolfers
  • For: 检测复杂的大脑疾病,特别是精神疾病,因为症状的复杂性和可靠的生物标志物的缺失。* Methods: 使用CAM( cortical anomaly detection through masked image modeling),一种新的自动超级vised框架,通过 cortical surface 特征来探测复杂大脑疾病。* Results: 对于精神疾病的患者,使用CAM框架可以达到 AUC 0.696 和 AUC 0.769,不需要任何标签。此外,分析异常 cortical 区域,包括 Pars Triangularis 和多个前额叶区域,这些区域经常与Schizophrenia 有关,增加了我们的方法的信心。
    Abstract The detection of heterogeneous mental disorders based on brain readouts remains challenging due to the complexity of symptoms and the absence of reliable biomarkers. This paper introduces CAM (Cortical Anomaly Detection through Masked Image Modeling), a novel self-supervised framework designed for the unsupervised detection of complex brain disorders using cortical surface features. We employ this framework for the detection of individuals on the psychotic spectrum and demonstrate its capabilities compared to state-ofthe-art methods, achieving an AUC of 0.696 for Schizoaffective and 0.769 for Schizophreniform, without the need for any labels. Furthermore, the analysis of atypical cortical regions includes Pars Triangularis and several frontal areas, often implicated in schizophrenia, provide further confidence in our approach. Altogether, we demonstrate a scalable approach for anomaly detection of complex brain disorders based on cortical abnormalities.
    摘要 <>转换文本到简化中文《检测不同型神经疾病基于脑输出尚是一项挑战,主要因为症状复杂和可靠生物标志物的缺失。本文介绍CAM( Cortical Anomaly Detection through Masked Image Modeling),一种新的自动学习框架,用于无监督的识别复杂脑疾病使用脑表面特征。我们使用这种框架进行听觉障碍和分裂障碍的检测,无需任何标签,达到了0.696和0.769的ROC曲线,对比当前方法更高。此外,分析不同的脑区,包括前庭区和前额叶皮层,常被 связан到Schizophrenia,增加了我们的方法的信任度。总之,我们展示了一种可扩展的方法,用于识别复杂脑疾病的异常检测,基于脑病变。

C3: High-performance and low-complexity neural compression from a single image or video

  • paper_url: http://arxiv.org/abs/2312.02753
  • repo_url: None
  • paper_authors: Hyunjik Kim, Matthias Bauer, Lucas Theis, Jonathan Richard Schwarz, Emilien Dupont
  • for: 这个论文是为了提出一种基于小型模型的神经压缩方法,以提高压缩率和质量的性能。
  • methods: 该方法使用小型模型进行特征提取和编码,并通过一些简单而有效的改进来提高图像和视频压缩性能。
  • results: 在CLIC2020图像标准和UVG视频标准上,该方法可以与参考实现H.266编码器和神经视频编码器相匹配或超越其性能,但具有许多 magnitudes 更低的计算复杂度。
    Abstract Most neural compression models are trained on large datasets of images or videos in order to generalize to unseen data. Such generalization typically requires large and expressive architectures with a high decoding complexity. Here we introduce C3, a neural compression method with strong rate-distortion (RD) performance that instead overfits a small model to each image or video separately. The resulting decoding complexity of C3 can be an order of magnitude lower than neural baselines with similar RD performance. C3 builds on COOL-CHIC (Ladune et al.) and makes several simple and effective improvements for images. We further develop new methodology to apply C3 to videos. On the CLIC2020 image benchmark, we match the RD performance of VTM, the reference implementation of the H.266 codec, with less than 3k MACs/pixel for decoding. On the UVG video benchmark, we match the RD performance of the Video Compression Transformer (Mentzer et al.), a well-established neural video codec, with less than 5k MACs/pixel for decoding.
    摘要 大多数神经压缩模型通过大量的图像或视频数据进行训练,以便泛化到未经见过的数据。这种泛化通常需要具有高度表达能力和高解码复杂度的大型和复杂的建筑。而我们所介绍的C3神经压缩方法却通过对每个图像或视频进行特点适应,而不是使用大型和复杂的建筑,以达到类似的泛化性能。因此,C3的解码复杂度可以降低至对神经基eline的同等水平,而不是神经基eline的数十倍。C3建立在COOL-CHIC(Ladune等人)的基础之上,并进行了一些简单而有效的改进,以便应用于图像上。我们还开发了新的方法,以应用C3于视频上。在CLIC2020图像benchmark上,我们与VTM(H.266编码器的参考实现)的参考实现匹配了RD性能,但需要少于3k MACs/像素进行解码。在UVG视频benchmark上,我们与Video Compression Transformer(Mentzer等人)的神经视频编码器匹配了RD性能,但需要少于5k MACs/像素进行解码。

C-NERF: Representing Scene Changes as Directional Consistency Difference-based NeRF

  • paper_url: http://arxiv.org/abs/2312.02751
  • repo_url: https://github.com/c-nerf/c-nerf
  • paper_authors: Rui Huang, Binbin Jiang, Qingyi Zhao, William Wang, Yuxiang Zhang, Qing Guo
  • for: 检测 neural radiance fields (NeRFs) 中的对象变化
  • methods: 提出了一种基于方向一致性的 NeRF 表示方法,包括三个模块:首先对两个 NeRF 图像进行空间对齐,然后根据方向一致性约束确定变化点,最后根据构建的 NeRF 生成变化地图
  • results: 与状态艺术方法和 NeRF 基于方法相比,该方法具有显著的优势,可以准确检测Scene中的对象变化
    Abstract In this work, we aim to detect the changes caused by object variations in a scene represented by the neural radiance fields (NeRFs). Given an arbitrary view and two sets of scene images captured at different timestamps, we can predict the scene changes in that view, which has significant potential applications in scene monitoring and measuring. We conducted preliminary studies and found that such an exciting task cannot be easily achieved by utilizing existing NeRFs and 2D change detection methods with many false or missing detections. The main reason is that the 2D change detection is based on the pixel appearance difference between spatial-aligned image pairs and neglects the stereo information in the NeRF. To address the limitations, we propose the C-NERF to represent scene changes as directional consistency difference-based NeRF, which mainly contains three modules. We first perform the spatial alignment of two NeRFs captured before and after changes. Then, we identify the change points based on the direction-consistent constraint; that is, real change points have similar change representations across view directions, but fake change points do not. Finally, we design the change map rendering process based on the built NeRFs and can generate the change map of an arbitrarily specified view direction. To validate the effectiveness, we build a new dataset containing ten scenes covering diverse scenarios with different changing objects. Our approach surpasses state-of-the-art 2D change detection and NeRF-based methods by a significant margin.
    摘要 在这项工作中,我们目标是检测对象变化在由神经辐射场(NeRF)表示的场景中。给出任意视角和两个不同时间点拍摄的场景图像对,我们可以预测场景中的变化,这有着重要的应用前景,包括场景监测和量测。我们进行了初步研究发现,这样的挑战不可能通过现有的NeRF和2D变化检测方法来实现,因为这些方法存在许多假或缺失检测。主要原因是2D变化检测基于图像对的空间对齐,忽略了NeRF中的立体信息。为解决这些限制,我们提出了C-NERF,它表示场景变化为方向差异基于NeRF的方向一致性差异。C-NERF主要包括三个模块:首先,我们将两个在before和after变化之前拍摄的NeRF进行空间对齐。然后,我们基于方向一致性约束(实际变化点在不同视向中应该具有相似的变化表达)来确定变化点。最后,我们设计了基于建立的NeRF和变化地图的渲染过程,可以生成任意指定视向的变化地图。为证明效果,我们建立了一个新的数据集,包括十个场景,其中每个场景都包含不同的变化对象。我们的方法在对比现有的2D变化检测和NeRF基于方法时表现出了明显的优势。

LiDAR-based Person Re-identification

  • paper_url: http://arxiv.org/abs/2312.03033
  • repo_url: None
  • paper_authors: Wenxuan Guo, Zhiyu Pan, Yingping Liang, Ziheng Xi, Zhi Chen Zhong, Jianjiang Feng, Jie Zhou
  • for: 这篇论文主要针对人体重识别(ReID)领域的Camera-based系统,旨在提高人体重识别的精度和可靠性。
  • methods: 该论文提出了一种基于LiDAR的人体重识别框架,称为ReID3D,该框架利用预训练策略来提取3D人体形态特征,并 introduces Graph-based Complementary Enhancement Encoder来提取全面特征。
  • results: 经过广泛的实验,ReID3D在LReID dataset上实现了非常出色的性能,rank-1准确率达94.0%,这表明LiDAR可以很好地解决人体重识别任务。
    Abstract Camera-based person re-identification (ReID) systems have been widely applied in the field of public security. However, cameras often lack the perception of 3D morphological information of human and are susceptible to various limitations, such as inadequate illumination, complex background, and personal privacy. In this paper, we propose a LiDAR-based ReID framework, ReID3D, that utilizes pre-training strategy to retrieve features of 3D body shape and introduces Graph-based Complementary Enhancement Encoder for extracting comprehensive features. Due to the lack of LiDAR datasets, we build LReID, the first LiDAR-based person ReID dataset, which is collected in several outdoor scenes with variations in natural conditions. Additionally, we introduce LReID-sync, a simulated pedestrian dataset designed for pre-training encoders with tasks of point cloud completion and shape parameter learning. Extensive experiments on LReID show that ReID3D achieves exceptional performance with a rank-1 accuracy of 94.0, highlighting the significant potential of LiDAR in addressing person ReID tasks. To the best of our knowledge, we are the first to propose a solution for LiDAR-based ReID. The code and datasets will be released soon.
    摘要 摄像头基于人识别(ReID)系统在公共安全领域广泛应用,但摄像头通常缺乏人体三维形态信息的感知,并且容易受到不良环境、复杂背景和个人隐私等限制。在这篇论文中,我们提出了基于LiDAR的人识别框架,即ReID3D,该框架利用预训练策略来检索人体三维形态特征,并 introduce了图像基本补充编码器以提取全面特征。由于LiDAR数据集缺乏,我们建立了LReID,这是首个基于LiDAR的人识别数据集,该数据集在自然条件下的多个户外场景中采集。此外,我们还提出了LReID-sync,一个模拟人员数据集,用于预训练编码器,包括点云完成任务和形态参数学习任务。广泛的实验表明,ReID3D在LReID上达到了94.0的排名一精度, highlighting LiDAR在人识别任务中的潜在潜力。到目前为止,我们是首次提出了LiDAR基于的人识别解决方案。代码和数据将很快发布。

R3D-SWIN:Use Shifted Window Attention for Single-View 3D Reconstruction

  • paper_url: http://arxiv.org/abs/2312.02725
  • repo_url: None
  • paper_authors: Chenhuan Li, Meihua Xiao, zehuan li, Mengxi Gao
  • for: 提高单视重建精度
  • methods: shifted windows attention voxel 3D reconstruction network
  • results: SOTA单视重建精度在ShapeNet上
    Abstract Recently, vision transformers have performed well in various computer vision tasks, including voxel 3D reconstruction. However, the windows of the vision transformer are not multi-scale, and there is no connection between the windows, which limits the accuracy of voxel 3D reconstruction . Therefore, we propose a shifted windows attention voxel 3D reconstruction network. To the best of our knowledge, this is the first work to apply shifted window attention to voxel 3D reconstruction. Experimental results on ShapeNet verify our method achieves SOTA accuracy in single-view reconstruction.
    摘要 近期,视觉转换器在不同计算机视觉任务中表现出色,包括 voxel 3D 重建。然而,视觉转换器的窗口不是多尺度的,没有连接 между 窗口,这限制了 voxel 3D 重建的准确性。因此,我们提议一种shifted窗口注意力 voxel 3D 重建网络。根据我们所知,这是首次应用 shifted 窗口注意力到 voxel 3D 重建。实验结果表明,我们的方法在ShapeNet上 achieve SOTA 精度在单视重建中。

MyPortrait: Morphable Prior-Guided Personalized Portrait Generation

  • paper_url: http://arxiv.org/abs/2312.02703
  • repo_url: None
  • paper_authors: Bo Ding, Zhenfeng Fan, Shuang Yang, Shihong Xia
  • for: 这个论文是关于计算机视觉领域中生成真实的人脸演示的研究。
  • methods: 我们提出了一种简单、通用、灵活的神经网络框架——Myportrait,可以在单个人的单视镜视频中生成个性化的脸部动画。
  • results: 我们的方法在多种 метри中表现出色,超越了现有的方法。我们还提供了在线和离线两种版本,可以根据测试数据是否送到训练来选择。
    Abstract Generating realistic talking faces is an interesting and long-standing topic in the field of computer vision. Although significant progress has been made, it is still challenging to generate high-quality dynamic faces with personalized details. This is mainly due to the inability of the general model to represent personalized details and the generalization problem to unseen controllable parameters. In this work, we propose Myportrait, a simple, general, and flexible framework for neural portrait generation. We incorporate personalized prior in a monocular video and morphable prior in 3D face morphable space for generating personalized details under novel controllable parameters. Our proposed framework supports both video-driven and audio-driven face animation given a monocular video of a single person. Distinguished by whether the test data is sent to training or not, our method provides a real-time online version and a high-quality offline version. Comprehensive experiments in various metrics demonstrate the superior performance of our method over the state-of-the-art methods. The code will be publicly available.
    摘要 <>将文本翻译成简化中文。<>计算机视觉领域中生成真实对话面是一个有趣和长期的话题。虽然已经做出了重要的进步,但仍然困难以生成高质量的动态面孔 WITH 个性化细节。这主要是因为普通的模型无法表示个性化细节,以及总结问题到未经见过的可控参数。在这种情况下,我们提议Myportrait,一个简单、通用和 flexible的神经端游戏框架。我们在单个人的单视图中 incorporate 个性化先验和 3D 面部可变空间中的模板先验,以生成个性化细节 unter novel可控参数。我们的提议的框架支持单视图驱动和音频驱动的面部动画,并根据测试数据是否送往训练而分为实时在线版本和高质量离线版本。对于不同的维度进行了广泛的实验,我们的方法在Various metric中表现出了与当前最佳方法的超越性。代码将公开available。

Neural Sign Actors: A diffusion model for 3D sign language production from text

  • paper_url: http://arxiv.org/abs/2312.02702
  • repo_url: None
  • paper_authors: Vasileios Baltatzis, Rolandos Alexandros Potamias, Evangelos Ververas, Guanxiong Sun, Jiankang Deng, Stefanos Zafeiriou
  • for: 提高手语生成的真实性和含义精度
  • methods: 使用扩散过程和基于体格神经网络的新方法
  • results: 与前方法相比,显著提高手语生成的性能和真实性
    Abstract Sign Languages (SL) serve as the predominant mode of communication for the Deaf and Hard of Hearing communities. The advent of deep learning has aided numerous methods in SL recognition and translation, achieving remarkable results. However, Sign Language Production (SLP) poses a challenge for the computer vision community as the motions generated must be realistic and have precise semantic meanings. Most SLP methods rely on 2D data, thus impeding their ability to attain a necessary level of realism. In this work, we propose a diffusion-based SLP model trained on a curated large-scale dataset of 4D signing avatars and their corresponding text transcripts. The proposed method can generate dynamic sequences of 3D avatars from an unconstrained domain of discourse using a diffusion process formed on a novel and anatomically informed graph neural network defined on the SMPL-X body skeleton. Through a series of quantitative and qualitative experiments, we show that the proposed method considerably outperforms previous methods of SLP. We believe that this work presents an important and necessary step towards realistic neural sign avatars, bridging the communication gap between Deaf and hearing communities. The code, method and generated data will be made publicly available.
    摘要 《手语生成(SLP)》是一个挑战计算机视觉领域的问题,因为它需要生成的手势必须具有真实的意思和精准的 semantics。大多数SLP方法仅使用2D数据,因此它们无法达到所需的真实性。在这种情况下,我们提出了一种基于扩散的SLP模型,通过一个大规模的4D签名人体数据集和其对应的文本译本来训练。这种方法可以在没有约束的话语域中生成动态的3D人体序列,使用一种基于新的体系学 informed graph neural network(SMPL-X)的扩散过程。通过一系列的量化和质量测试,我们表明了我们的方法舒适性和真实性在SLP方法中明显提高。我们认为这项工作对于实现真实的神经签名人体做出了重要和必要的贡献, bridging the communication gap between Deaf and hearing communities。我们将代码、方法和生成的数据公开发布。

Revisit Human-Scene Interaction via Space Occupancy

  • paper_url: http://arxiv.org/abs/2312.02700
  • repo_url: None
  • paper_authors: Xinpeng Liu, Haowen Hou, Yanchao Yang, Yong-Lu Li, Cewu Lu
  • for: 本研究旨在解决人Scene交互(HSI)生成 task 中的数据缺乏问题,即高质量的人和3D环境同时捕捉数据匮乏,导致数据多样性和复杂性受限。
  • methods: 本研究提出了一种新的人Occupancy交互视图,即将人的运动序列看作是与场景的空间占用交互的记录,从而将运动序列数据聚合成大规模的对应人Occupancy交互数据库(MOB)。
  • results: 通过在MOB上训练单个运动控制器,可以在不同的静态和动态场景中生成实际和稳定的HSI运动,而无需GT 3D场景训练。 codes和数据将在https://foruck.github.io/occu-page/上公开。
    Abstract Human-scene Interaction (HSI) generation is a challenging task and crucial for various downstream tasks. However, one of the major obstacles is the limited data scale. High-quality data with simultaneously captured human and 3D environments is rare, resulting in limited data diversity and complexity. In this work, we argue that interaction with a scene is essentially interacting with the space occupancy of the scene from an abstract physical perspective, leading us to a unified novel view of Human-Occupancy Interaction. By treating pure motion sequences as records of humans interacting with invisible scene occupancy, we can aggregate motion-only data into a large-scale paired human-occupancy interaction database: Motion Occupancy Base (MOB). Thus, the need for costly paired motion-scene datasets with high-quality scene scans can be substantially alleviated. With this new unified view of Human-Occupancy interaction, a single motion controller is proposed to reach the target state given the surrounding occupancy. Once trained on MOB with complex occupancy layout, the controller could handle cramped scenes and generalize well to general scenes with limited complexity. With no GT 3D scenes for training, our method can generate realistic and stable HSI motions in diverse scenarios, including both static and dynamic scenes. Our code and data would be made publicly available at https://foruck.github.io/occu-page/.
    摘要 人Scene交互(HSI)生成是一项复杂的任务,对下游任务非常重要。然而,一个主要的障碍是数据规模的限制。高质量的数据,同时捕捉人和3D环境,罕见,导致数据多样性和复杂性受限。在这种情况下,我们认为人与场景交互的核心是与场景空间占用的抽象物理视角相互作用,从而导致我们提出了一种新的人占用交互视图。我们将纯洁的运动序列看作人与透明场景占用的交互记录,从而将运动序列聚合成大规模的人占用交互数据库:运动占用基础(MOB)。因此,需要高质量的相关运动场景数据和场景扫描数据的成本可以得到很大的减弱。通过这种新的人占用交互视图,我们提出了一种单个运动控制器,可以在固定或动态场景中达到目标状态,只需要知道周围的占用。我们的方法不需要GT 3D场景进行训练,可以生成真实和稳定的HSI运动,在多样化的场景中,包括静止和动态场景。我们的代码和数据将在https://foruck.github.io/occu-page/上公开。

UPOCR: Towards Unified Pixel-Level OCR Interface

  • paper_url: http://arxiv.org/abs/2312.02694
  • repo_url: None
  • paper_authors: Dezhi Peng, Zhenhua Yang, Jiaxin Zhang, Chongyu Liu, Yongxin Shi, Kai Ding, Fengjun Guo, Lianwen Jin
  • for: 提高 pixel-level OCR 领域的研究和应用效率,建立一个通用的 OCR 模型,能同时处理多种任务。
  • methods: 提出了一种基于 Vision Transformer 的普通模型,通过学习任务提示来激活任务特征表示,实现多任务共享表示的目标。
  • results: 实验结果表明,该方法可同时在三种 pixel-level OCR 任务上达到状态对应的表现水平,并且可以在不同任务之间共享表示,提供了 valuable 的研究策略和意见 для未来的通用 OCR 模型。
    Abstract In recent years, the optical character recognition (OCR) field has been proliferating with plentiful cutting-edge approaches for a wide spectrum of tasks. However, these approaches are task-specifically designed with divergent paradigms, architectures, and training strategies, which significantly increases the complexity of research and maintenance and hinders the fast deployment in applications. To this end, we propose UPOCR, a simple-yet-effective generalist model for Unified Pixel-level OCR interface. Specifically, the UPOCR unifies the paradigm of diverse OCR tasks as image-to-image transformation and the architecture as a vision Transformer (ViT)-based encoder-decoder. Learnable task prompts are introduced to push the general feature representations extracted by the encoder toward task-specific spaces, endowing the decoder with task awareness. Moreover, the model training is uniformly aimed at minimizing the discrepancy between the generated and ground-truth images regardless of the inhomogeneity among tasks. Experiments are conducted on three pixel-level OCR tasks including text removal, text segmentation, and tampered text detection. Without bells and whistles, the experimental results showcase that the proposed method can simultaneously achieve state-of-the-art performance on three tasks with a unified single model, which provides valuable strategies and insights for future research on generalist OCR models. Code will be publicly available.
    摘要 近年来,光学字符识别(OCR)领域出现了大量先进技术,涵盖各种任务。然而,这些技术具有不同的思想、结构和训练策略,这使研究和维护变得更加复杂,妨碍快速应用。为此,我们提出了UPOCR,一种简单 yet有效的通用模型,用于统一像素级OCR接口。具体来说,UPOCR将多种OCR任务视为图像到图像转换,并采用基于视Transformer(ViT)的编码器-解码器架构。我们引入可学习的任务提示,使编码器提取的通用特征表示push到任务特有的空间,让解码器具备任务意识。此外,模型训练的目标是将生成的图像与真实图像的差异降到最小化,不管任务之间的不同。我们对三个像素级OCR任务进行了实验,包括文本除去、文本分割和受损文本检测。无论有多少辉煌的技术,我们的方法可以同时在三个任务上达到状态 искусственный智能性的性能,提供了有价值的策略和思路 для未来的通用OCR模型研究。代码将公开。

DeepPointMap: Advancing LiDAR SLAM with Unified Neural Descriptors

  • paper_url: http://arxiv.org/abs/2312.02684
  • repo_url: None
  • paper_authors: Xiaze Zhang, Ziheng Ding, Qi Jing, Yuejie Zhang, Wenchao Ding, Rui Feng
  • for: 提高 simultaneous localization and mapping (SLAM) 的精度和效率
  • methods: 使用神经网络提取高度表示性的点云描述符,实现内存有效的地图表示和精准的多尺度本地化任务
  • results: 在多种复杂的场景中,包括多机合作SLAM, 实现了优秀的结果,证明了方法的有效性和潜力
    Abstract Point clouds have shown significant potential in various domains, including Simultaneous Localization and Mapping (SLAM). However, existing approaches either rely on dense point clouds to achieve high localization accuracy or use generalized descriptors to reduce map size. Unfortunately, these two aspects seem to conflict with each other. To address this limitation, we propose a unified architecture, DeepPointMap, achieving excellent preference on both aspects. We utilize neural network to extract highly representative and sparse neural descriptors from point clouds, enabling memory-efficient map representation and accurate multi-scale localization tasks (e.g., odometry and loop-closure). Moreover, we showcase the versatility of our framework by extending it to more challenging multi-agent collaborative SLAM. The promising results obtained in these scenarios further emphasize the effectiveness and potential of our approach.
    摘要 几何 clouds 已经显示出了多个领域的潜在能力,包括同时地位和地图对接(SLAM)。然而,现有的方法可能会依赖于紧密的几何 clouds 以达到高地位准确性,或者使用通用的描述子来减少地图的大小。可惜的是,这两个方面似乎相互抵触。为了解决这个限制,我们提出了一个统一架构,深度点图(DeepPointMap),实现了高度的选择性和高精度的多尺度地位任务(例如运动和关闭)。此外,我们还将框架扩展到更加挑战性的多机合作SLAM。在这些情况下,我们所取得的结果给出了深度点图的有效性和潜力。

Zero-Shot Point Cloud Registration

  • paper_url: http://arxiv.org/abs/2312.03032
  • repo_url: None
  • paper_authors: Weijie Wang, Guofeng Mei, Bin Ren, Xiaoshui Huang, Fabio Poiesi, Luc Van Gool, Nicu Sebe, Bruno Lepri
  • for: 本研究的目的是提出第一个不需要特定数据集训练的点云注册方法,以提高注册精度和效率。
  • methods: ZeroReg方法基于图像特征传输,通过在3D空间搜索邻近点Cloud进行减少参数,并通过新的参数-自由地理编码器进行点Cloud特征综合。
  • results: ZeroReg方法在3DMatch、3DLoMatch和ScanNet等数据集上实现了超过84%、46%和75%的召回率,与传统和学习基于方法竞争。
    Abstract Learning-based point cloud registration approaches have significantly outperformed their traditional counterparts. However, they typically require extensive training on specific datasets. In this paper, we propose , the first zero-shot point cloud registration approach that eliminates the need for training on point cloud datasets. The cornerstone of ZeroReg is the novel transfer of image features from keypoints to the point cloud, enriched by aggregating information from 3D geometric neighborhoods. Specifically, we extract keypoints and features from 2D image pairs using a frozen pretrained 2D backbone. These features are then projected in 3D, and patches are constructed by searching for neighboring points. We integrate the geometric and visual features of each point using our novel parameter-free geometric decoder. Subsequently, the task of determining correspondences between point clouds is formulated as an optimal transport problem. Extensive evaluations of ZeroReg demonstrate its competitive performance against both traditional and learning-based methods. On benchmarks such as 3DMatch, 3DLoMatch, and ScanNet, ZeroReg achieves impressive Recall Ratios (RR) of over 84%, 46%, and 75%, respectively.
    摘要 学习基于的点云注册方法已经比传统方法表现出了明显的优势。然而,它们通常需要对特定的点云数据进行广泛的训练。在这篇论文中,我们提出了第一个无需训练点云数据的零学习点云注册方法。 ZeroReg 的核心思想是将图像特征从关键点传递到点云,并通过在3D的几何邻域中聚合信息来增强。我们首先从冻结的预训练2D背bone中提取关键点和特征。然后,我们将这些特征在3D中投影,并通过搜索邻近点构建patch。我们然后将每个点的几何和视觉特征集成我们的新的参数自由几何解码器。最后,我们将点云之间的对应关系定义为优化运输问题。我们对 ZeroReg 进行了广泛的评估,并证明它在3DMatch、3DLoMatch和ScanNet等benchmark上达到了超过84%、46%和75%的征 recall ratio(RR)。

Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

  • paper_url: http://arxiv.org/abs/2312.03031
  • repo_url: https://github.com/nvlabs/bev-planner
  • paper_authors: Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, Jose M. Alvarez
  • for: 本研究旨在深入研究推动自动驾驶技术的开发,尤其是在全栈层面上实现自主驾驶。
  • methods: 本研究使用了 nuScenes 数据集,并进行了详细的分析和检验,以探讨更多的细节问题。
  • results: 研究发现,使用 ego 状态信息可以提高路径规划质量,但是 nuScenes 数据集的限制导致模型倾向于仅仅基于 ego 车速度进行未来路径规划。此外,现有的指标不能全面评估规划质量,可能会导致不准确的结论。为此,本研究引入了一种新的指标来评估路径规划是否遵循道路规则。此外,我们还提出了一种简单的基线模型,能够在不依赖于感知注解的情况下达到竞争性的结果。
    Abstract End-to-end autonomous driving recently emerged as a promising research direction to target autonomy from a full-stack perspective. Along this line, many of the latest works follow an open-loop evaluation setting on nuScenes to study the planning behavior. In this paper, we delve deeper into the problem by conducting thorough analyses and demystifying more devils in the details. We initially observed that the nuScenes dataset, characterized by relatively simple driving scenarios, leads to an under-utilization of perception information in end-to-end models incorporating ego status, such as the ego vehicle's velocity. These models tend to rely predominantly on the ego vehicle's status for future path planning. Beyond the limitations of the dataset, we also note that current metrics do not comprehensively assess the planning quality, leading to potentially biased conclusions drawn from existing benchmarks. To address this issue, we introduce a new metric to evaluate whether the predicted trajectories adhere to the road. We further propose a simple baseline able to achieve competitive results without relying on perception annotations. Given the current limitations on the benchmark and metrics, we suggest the community reassess relevant prevailing research and be cautious whether the continued pursuit of state-of-the-art would yield convincing and universal conclusions. Code and models are available at \url{https://github.com/NVlabs/BEV-Planner}
    摘要

Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection? An Investigation and the HOI-Synth Domain Adaptation Benchmark

  • paper_url: http://arxiv.org/abs/2312.02672
  • repo_url: None
  • paper_authors: Rosario Leonardi, Antonino Furnari, Francesco Ragusa, Giovanni Maria Farinella
  • for: 本研究探讨了使用合成数据提高 egocentric 视野中手-物体互动检测的效果。
  • methods: 我们引入了一种自动生成合成图像的 simulate 器,可以生成自动标注手-物体接触状态、 bounding box 和像素精度分割的图像。我们还使用了领域适应技术来改进模型性能。
  • results: 我们的实验结果显示,使用合成数据和领域适应技术可以达到与传统干支持方法相同的性能水平,但需要标注的数据量减少到一半。当使用来自3D模型的真实目标环境和物体的合成数据进行测试时,我们的最佳模型表现出了相对于标准干支持方法的性能提升。我们还设置了一个新的领域适应标准(HOI-Synth),并提供了基线结果,以鼓励社区进行这个挑战性任务。
    Abstract In this study, we investigate the effectiveness of synthetic data in enhancing hand-object interaction detection within the egocentric vision domain. We introduce a simulator able to generate synthetic images of hand-object interactions automatically labeled with hand-object contact states, bounding boxes, and pixel-wise segmentation masks. Through comprehensive experiments and comparative analyses on three egocentric datasets, VISOR, EgoHOS, and ENIGMA-51, we demonstrate that the use of synthetic data and domain adaptation techniques allows for comparable performance to conventional supervised methods while requiring annotations on only a fraction of the real data. When tested with in-domain synthetic data generated from 3D models of real target environments and objects, our best models show consistent performance improvements with respect to standard fully supervised approaches based on labeled real data only. Our study also sets a new benchmark of domain adaptation for egocentric hand-object interaction detection (HOI-Synth) and provides baseline results to encourage the community to engage in this challenging task. We release the generated data, code, and the simulator at the following link: https://iplab.dmi.unict.it/HOI-Synth/.
    摘要 在这项研究中,我们研究了人工数据的有效性以提高 egocentric 视觉领域内手对物体交互检测。我们开发了一个可以自动生成带有手对物体接触状态、 bounding box 和像素级分 segmentation 的 synthetic 图像的 simulator。通过对三个 egocentric 数据集进行全面的实验和比较分析,我们证明了使用 synthetic 数据和领域适应技术可以实现与传统超级视图方法相同的性能,只需要对实际数据进行标注的一小部分。当测试使用了基于 3D 模型生成的真实目标环境和物体的 synthetic 数据时,我们的最佳模型表现出了与标准充分监督方法相比的一致性提升。我们的研究还设置了 egocentric 手对物体交互检测(HOI-Synth)领域的新标准基准和提供了基线结果,以便社区能够更好地参与这个挑战性任务。我们在以下链接发布了生成的数据、代码和 simulator:https://iplab.dmi.unict.it/HOI-Synth/.

Generating Visually Realistic Adversarial Patch

  • paper_url: http://arxiv.org/abs/2312.03030
  • repo_url: None
  • paper_authors: Xiaosen Wang, Kunyu Wang
  • for: 防御深度神经网络(DNNs)受到多种攻击,如攻击示例,以提高安全应用的可信度。
  • methods: 我们提出了一种有效的攻击方法,即可视真实攻击(VRAP),通过在实际图像附近进行约束,并在最差位置进行优化,以确保攻击patch的可见性和印刷性。
  • results: 我们的实验表明,VRAP在数字世界中表现出了出色的攻击性能,而生成的攻击patch可以在物理世界中掩饰成涂鸦或商标,让深度模型无法识别,对DNNs-enabled应用造成了重大威胁。
    Abstract Deep neural networks (DNNs) are vulnerable to various types of adversarial examples, bringing huge threats to security-critical applications. Among these, adversarial patches have drawn increasing attention due to their good applicability to fool DNNs in the physical world. However, existing works often generate patches with meaningless noise or patterns, making it conspicuous to humans. To address this issue, we explore how to generate visually realistic adversarial patches to fool DNNs. Firstly, we analyze that a high-quality adversarial patch should be realistic, position irrelevant, and printable to be deployed in the physical world. Based on this analysis, we propose an effective attack called VRAP, to generate visually realistic adversarial patches. Specifically, VRAP constrains the patch in the neighborhood of a real image to ensure the visual reality, optimizes the patch at the poorest position for position irrelevance, and adopts Total Variance loss as well as gamma transformation to make the generated patch printable without losing information. Empirical evaluations on the ImageNet dataset demonstrate that the proposed VRAP exhibits outstanding attack performance in the digital world. Moreover, the generated adversarial patches can be disguised as the scrawl or logo in the physical world to fool the deep models without being detected, bringing significant threats to DNNs-enabled applications.
    摘要 We analyze that a high-quality adversarial patch should be realistic, position-irrelevant, and printable. Based on this analysis, we propose an effective attack called VRAP (Visually Realistic Adversarial Patch) to generate visually realistic patches. VRAP constrains the patch in the neighborhood of a real image to ensure visual reality, optimizes the patch at the poorest position for position irrelevance, and adopts Total Variance loss as well as gamma transformation to make the generated patch printable without losing information.Empirical evaluations on the ImageNet dataset demonstrate that the proposed VRAP exhibits outstanding attack performance in the digital world. Moreover, the generated adversarial patches can be disguised as scrawls or logos in the physical world to fool deep models without being detected, bringing significant threats to DNNs-enabled applications.

Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians

  • paper_url: http://arxiv.org/abs/2312.03029
  • repo_url: https://github.com/yuelangx/gaussian-head-avatar
  • paper_authors: Yuelang Xu, Benwang Chen, Zhe Li, Hongwen Zhang, Lizhen Wang, Zerong Zheng, Yebin Liu
  • for: 高精度3D头像创建是研究热点,但在缺乏视图的情况下存在大型挑战。本文提出了基于可控3D Gaussian的 Gaussian Head Avatar,以实现高精度头像模型。
  • methods: 我们优化了中性3D Gaussian和基于MLP的弯曲场,以捕捉复杂表情。这两个部分相互帮助,因此我们的方法可以模拟细节动作的同时保持表情准确。此外,我们提出了一种合理的几何导向初始化策略,基于几何SDF和深度迭代四面体,以确保训练过程的稳定性和归一化。
  • results: 我们的方法在缺乏视图情况下比其他状态体系方法表现更高,实现2K分辨率的极高精度渲染质量,即使面对极端表情。
    Abstract Creating high-fidelity 3D head avatars has always been a research hotspot, but there remains a great challenge under lightweight sparse view setups. In this paper, we propose Gaussian Head Avatar represented by controllable 3D Gaussians for high-fidelity head avatar modeling. We optimize the neutral 3D Gaussians and a fully learned MLP-based deformation field to capture complex expressions. The two parts benefit each other, thereby our method can model fine-grained dynamic details while ensuring expression accuracy. Furthermore, we devise a well-designed geometry-guided initialization strategy based on implicit SDF and Deep Marching Tetrahedra for the stability and convergence of the training procedure. Experiments show our approach outperforms other state-of-the-art sparse-view methods, achieving ultra high-fidelity rendering quality at 2K resolution even under exaggerated expressions.
    摘要 创造高精度3D头像一直是研究热点,但在光量稀缺的视点设置下存在大的挑战。在这篇论文中,我们提出了基于可控3D高斯函数的 Gaussian Head Avatar,用于高精度头像模型化。我们优化了中性3D高斯函数和完全学习的MLP基于塑形场,以捕捉复杂的表情。这两个部分互助 each other,因此我们的方法可以模拟细腻的动态细节,同时保证表情准确性。此外,我们设计了一种基于印象函数和深度迈克定量的初始化策略,以保证训练过程的稳定性和收敛性。实验表明,我们的方法在 sparse-view 方面超越了其他现有的方法,实现了2K解度的极高精度渲染质量,甚至在夸大的表情下。

Double Integral Enhanced Zeroing Neural Network Optimized with ALSOA fostered Lung Cancer Classification using CT Images

  • paper_url: http://arxiv.org/abs/2312.03028
  • repo_url: None
  • paper_authors: V S Priya Sumitha, V. Keerthika, A. Geetha
  • for: 本研究旨在提出一种智能自动抑制肺癌分类方法,以提高肺癌检测的准确率和效率。
  • methods: 本研究使用了双 интеграル增强零化神经网络优化的ALSOA肺癌分类方法,并在CT图像上进行了预处理和特征提取。
  • results: 对比exist方法,本研究的方法实现了18.32%、27.20%和34.32%的高度准确率提升。
    Abstract Lung cancer is one of the deadliest diseases and the leading cause of illness and death. Since lung cancer cannot predicted at premature stage, it able to only be discovered more broadly once it has spread to other lung parts. The risk grows when radiologists and other specialists determine whether lung cancer is current. Owing to significance of determining type of treatment and its depth based on severity of the illness, critical to develop smart and automatic cancer prediction scheme is precise, at which stage of cancer. In this paper, Double Integral Enhanced Zeroing Neural Network Optimized with ALSOA fostered Lung Cancer Classification using CT Images (LCC-DIEZNN-ALSO-CTI) is proposed. Initially, input CT image is amassed from lung cancer dataset. The input CT image is pre-processing via Unscented Trainable Kalman Filtering (UTKF) technique. In pre-processing stage unwanted noise are removed from CT images. Afterwards, grayscale statistic features and Haralick texture features extracted by Adaptive and Concise Empirical Wavelet Transform (ACEWT). The proposed model is implemented on MATLAB. The performance of the proposed method is analyzed through existing techniques. The proposed method attains 18.32%, 27.20%, and 34.32% higher accuracy analyzed with existing method likes Deep Learning Assisted Predict of Lung Cancer on Computed Tomography Images Utilizing AHHMM (LCC-AHHMM-CT), Convolutional neural networks based pulmonary nodule malignancy assessment in pipeline for classifying lung cancer (LCC-ICNN-CT), Automated Decision Support Scheme for Lung Cancer Identification with Categorization (LCC-RFCN-MLRPN-CT) methods respectively.
    摘要 肺癌是一种非常致命的疾病,也是致死率最高的疾病之一。由于肺癌在早期难以预测,因此通常只能在其他肺部已经扩散之后才能发现。随着疾病的严重程度增加,检测肺癌的重要性也在提高。为了开发一种智能和自动的肺癌预测方案,在不同的阶段对肺癌进行精准的预测是非常重要。在本文中,我们提出了一种基于CT图像的肺癌分类方法,即Double Integral Enhanced Zeroing Neural Network Optimized with ALSOA fostered Lung Cancer Classification using CT Images (LCC-DIEZNN-ALSO-CTI)。首先,我们从肺癌数据集中收集了输入CT图像。然后,我们使用Unscented Trainable Kalman Filtering (UTKF)技术进行预处理,以移除CT图像中的噪声。接着,我们使用Adaptive and Concise Empirical Wavelet Transform (ACEWT)提取了灰度统计特征和Haralick текстур特征。我们在MATLAB中实现了该方法。我们对该方法进行了分析,并与现有的方法进行比较,包括Deep Learning Assisted Predict of Lung Cancer on Computed Tomography Images Utilizing AHHMM (LCC-AHHMM-CT)、Convolutional neural networks based pulmonary nodule malignancy assessment in pipeline for classifying lung cancer (LCC-ICNN-CT)和Automated Decision Support Scheme for Lung Cancer Identification with Categorization (LCC-RFCN-MLRPN-CT)等方法。结果表明,该方法在肺癌预测中达到了18.32%、27.20%和34.32%的高精度。

TPA3D: Triplane Attention for Fast Text-to-3D Generation

  • paper_url: http://arxiv.org/abs/2312.02647
  • repo_url: None
  • paper_authors: Hong-En Chen, Bin-Shih Wu, Sheng-Yu Huang, Yu-Chiang Frank Wang
  • For: 文本至3D生成* Methods: 使用GAN基本模型和注意力机制对文本进行生成* Results: 生成高质量的3D纹理形状,与文本描述高度吻合,计算效率高
    Abstract Due to the lack of large-scale text-3D correspondence data, recent text-to-3D generation works mainly rely on utilizing 2D diffusion models for synthesizing 3D data. Since diffusion-based methods typically require significant optimization time for both training and inference, the use of GAN-based models would still be desirable for fast 3D generation. In this work, we propose Triplane Attention for text-guided 3D generation (TPA3D), an end-to-end trainable GAN-based deep learning model for fast text-to-3D generation. With only 3D shape data and their rendered 2D images observed during training, our TPA3D is designed to retrieve detailed visual descriptions for synthesizing the corresponding 3D mesh data. This is achieved by the proposed attention mechanisms on the extracted sentence and word-level text features. In our experiments, we show that TPA3D generates high-quality 3D textured shapes aligned with fine-grained descriptions, while impressive computation efficiency can be observed.
    摘要 (Simplified Chinese translation)由于缺乏大规模文本-3D对应数据,现代文本-3D生成工作主要依靠利用2D扩散模型为生成3D数据。由于扩散模型通常需要较大的优化时间 для训练和推理,因此使用GAN基于模型仍然是可靠的选择 для快速3D生成。在这种工作中,我们提出了Triplane Attention for text-guided 3D generation(TPA3D),一种可训练的GAN基于深度学习模型,用于快速文本导向3D生成。在训练过程中,我们只需要提供3D形状数据和其渲染后的2D图像,TPA3D会使用提出的注意力机制来检索文本中的详细视觉描述,并生成匹配的3D网格数据。在我们的实验中,我们发现TPA3D可以生成高质量的3D纹理形状,与细化的文本描述高度吻合,同时计算效率非常高。

Synchronization is All You Need: Exocentric-to-Egocentric Transfer for Temporal Action Segmentation with Unlabeled Synchronized Video Pairs

  • paper_url: http://arxiv.org/abs/2312.02638
  • repo_url: https://github.com/fpv-iplab/synchronization-is-all-you-need
  • paper_authors: Camillo Quattrocchi, Antonino Furnari, Daniele Di Mauro, Mario Valerio Giuffrida, Giovanni Maria Farinella
  • for: 这篇论文的目的是将惯性的行为分类系统从外部相机扩展到自我相机的 egocentric enario,并且不需要新的 egocentric 视频标签。
  • methods: 这篇论文提出了一个新的方法,利用现有的标签的外部相机视频和一些新的、同步的外部-自我相机 видео对,来进行适应。这个方法基于知识传播,并在特征和模型层面进行了实现。
  • results: results 显示,这个方法可以实现对于 egocentric 视频的适应,并且可以和传统的不监督领域适应方法相比,获得更好的效果。特别是,这个方法不需要任何 egocentric 标签,仅靠使用现有的标签和自动生成的对,可以实现和传统监督学习方法相比的性能。
    Abstract We consider the problem of transferring a temporal action segmentation system initially designed for exocentric (fixed) cameras to an egocentric scenario, where wearable cameras capture video data. The conventional supervised approach requires the collection and labeling of a new set of egocentric videos to adapt the model, which is costly and time-consuming. Instead, we propose a novel methodology which performs the adaptation leveraging existing labeled exocentric videos and a new set of unlabeled, synchronized exocentric-egocentric video pairs, for which temporal action segmentation annotations do not need to be collected. We implement the proposed methodology with an approach based on knowledge distillation, which we investigate both at the feature and model level. To evaluate our approach, we introduce a new benchmark based on the Assembly101 dataset. Results demonstrate the feasibility and effectiveness of the proposed method against classic unsupervised domain adaptation and temporal sequence alignment approaches. Remarkably, without bells and whistles, our best model performs on par with supervised approaches trained on labeled egocentric data, without ever seeing a single egocentric label, achieving a +15.99% (28.59% vs 12.60%) improvement in the edit score on the Assembly101 dataset compared to a baseline model trained solely on exocentric data.
    摘要 我们考虑将用于外部(固定)摄像头的时间动作分割系统传播到自我中心(穿戴式摄像头)场景中,以便使用便携式摄像头捕捉视频数据。传统的监督方法需要收集和标注一组新的 egocentric 视频,这是贵重和耗时的。我们提出了一种新的方法,该方法可以在现有的 exocentric 视频和一组新的同步 exocentric-egocentric 视频对中进行适应,无需收集 temporal action segmentation 标注。我们实现了该方法使用知识储存技术,并在特征和模型层进行了调整。为评估我们的方法,我们创建了一个基于 Assembly101 数据集的新的标准准比较。结果表明,我们的方法可以减少监督学习的成本和时间,并且能够达到与监督学习 trained on labeled egocentric data 相当的性能,而无需看到一个单一的 egocentric 标注,在 Assembly101 数据集上比基eline模型提高了 +15.99%(28.59% vs 12.60%)。

Stable Diffusion Exposed: Gender Bias from Prompt to Image

  • paper_url: http://arxiv.org/abs/2312.03027
  • repo_url: None
  • paper_authors: Yankun Wu, Yuta Nakashima, Noa Garcia
  • for: 本研究旨在检查稳定扩散图像生成模型中的性别偏见,以及这些偏见如何影响图像的生成。
  • methods: 本研究使用自动化评估协议来分析稳定扩散图像中的性别指标的影响。基于之前的研究,我们探讨了性别指标如何影响图像中的对象和布局的表现。
  • results: 我们发现图像中的对象 display 存在性别偏见,例如♂♂和♀♀的 instrumente 不同,而且图像的总布局也存在差异。此外,中性提示比♂提示更加倾向于生成♂类型的图像,这反映了稳定扩散图像生成模型中的性别偏见。
    Abstract Recent studies have highlighted biases in generative models, shedding light on their predisposition towards gender-based stereotypes and imbalances. This paper contributes to this growing body of research by introducing an evaluation protocol designed to automatically analyze the impact of gender indicators on Stable Diffusion images. Leveraging insights from prior work, we explore how gender indicators not only affect gender presentation but also the representation of objects and layouts within the generated images. Our findings include the existence of differences in the depiction of objects, such as instruments tailored for specific genders, and shifts in overall layouts. We also reveal that neutral prompts tend to produce images more aligned with masculine prompts than their feminine counterparts, providing valuable insights into the nuanced gender biases inherent in Stable Diffusion.
    摘要 Translated into Simplified Chinese:近期研究强调生成模型中的偏见,抛照到它们具有性别偏见和不均衡的倾向。本文对这些研究进行贡献,通过自动分析Stable Diffusion图像中的性别指标影响的评价协议。基于先前的研究,我们探索了性别指标不仅影响 gender presentation,还影响图像中的 объек和布局。我们的发现包括对象的不同表现,如适用于特定性别的工具,以及图像中的总布局的变化。我们还发现,中性提示通常会生成更加 masculine 的图像,这提供了关于Stable Diffusion中的性别偏见的有价值信息。

Diffusion Noise Feature: Accurate and Fast Generated Image Detection

  • paper_url: http://arxiv.org/abs/2312.02625
  • repo_url: None
  • paper_authors: Yichi Zhang, Xiaogang Xu
  • for: 本研究旨在提高生成图像检测精度和普适性。
  • methods: 本文使用反射扩散模型进行倒散处理,并利用生成图像和真实图像在扩散过程中的差异来增强生成图像的检测。
  • results: 本研究实现了高精度、稳定性和普适性的生成图像检测方法,并在标准 dataset 上达到了国际顶峰效果。
    Abstract Generative models have reached an advanced stage where they can produce remarkably realistic images. However, this remarkable generative capability also introduces the risk of disseminating false or misleading information. Notably, existing image detectors for generated images encounter challenges such as low accuracy and limited generalization. This paper seeks to address this issue by seeking a representation with strong generalization capabilities to enhance the detection of generated images. Our investigation has revealed that real and generated images display distinct latent Gaussian representations when subjected to an inverse diffusion process within a pre-trained diffusion model. Exploiting this disparity, we can amplify subtle artifacts in generated images. Building upon this insight, we introduce a novel image representation known as Diffusion Noise Feature (DNF). DNF is an ensemble representation that estimates the noise generated during the inverse diffusion process. A simple classifier, e.g., ResNet, trained on DNF achieves high accuracy, robustness, and generalization capabilities for detecting generated images, even from previously unseen classes or models. We conducted experiments using a widely recognized and standard dataset, achieving state-of-the-art effects of Detection.
    摘要 <>传送给定文本到简化中文。<>生成模型已经达到了高度的进步,能够生成极其真实的图像。然而,这种极高的生成能力也会导致误差或误导信息的散布。特别是现有的生成图像检测器遇到了低准确率和有限的泛化问题。这篇论文目的是解决这个问题,通过提高生成图像检测的准确率来增强检测生成图像的能力。我们的调查发现,实际图像和生成图像在逆Diffusion过程中的约束表示中存在显著的差异。利用这个差异,我们可以强制加大生成图像中的微小残留。基于这一点,我们引入了一种新的图像表示方式,称为Diffusion Noise Feature(DNF)。DNF是一个ensemble表示方式,用于估计逆Diffusion过程中生成的噪声。一个简单的分类器,例如ResNet,在DNF上进行训练,可以在检测生成图像方面达到高度的准确率、Robustness和泛化能力。我们在一个广泛 признан和标准的 dataset 上进行了实验,实现了状态的检测效果。

DreaMo: Articulated 3D Reconstruction From A Single Casual Video

  • paper_url: http://arxiv.org/abs/2312.02617
  • repo_url: None
  • paper_authors: Tao Tu, Ming-Feng Li, Chieh Hubert Lin, Yen-Chi Cheng, Min Sun, Ming-Hsuan Yang
  • for: 用于从单一和偶极 captured 的互联网视频中进行擅长3D形态重建,并解决低覆盖区域的挑战。
  • methods: 提议了一种名为 DreaMo 的方法,该方法同时进行形态重建并解决低覆盖区域,使用视图条件的扩散优化和一些适应性规则。
  • results: 在一个自收集的互联网视频收集中进行了研究,并取得了可观的质量在新视图渲染、细化三维形态重建和人 интер替skeleton生成方面。对于现有方法,研究发现它们无法解决正确的几何学问题由于视图覆盖率不够。
    Abstract Articulated 3D reconstruction has valuable applications in various domains, yet it remains costly and demands intensive work from domain experts. Recent advancements in template-free learning methods show promising results with monocular videos. Nevertheless, these approaches necessitate a comprehensive coverage of all viewpoints of the subject in the input video, thus limiting their applicability to casually captured videos from online sources. In this work, we study articulated 3D shape reconstruction from a single and casually captured internet video, where the subject's view coverage is incomplete. We propose DreaMo that jointly performs shape reconstruction while solving the challenging low-coverage regions with view-conditioned diffusion prior and several tailored regularizations. In addition, we introduce a skeleton generation strategy to create human-interpretable skeletons from the learned neural bones and skinning weights. We conduct our study on a self-collected internet video collection characterized by incomplete view coverage. DreaMo shows promising quality in novel-view rendering, detailed articulated shape reconstruction, and skeleton generation. Extensive qualitative and quantitative studies validate the efficacy of each proposed component, and show existing methods are unable to solve correct geometry due to the incomplete view coverage.
    摘要 “三维重建有着宝贵的应用在不同领域,但它仍然具有高成本和需要专业人员投入大量时间。现代的模板无法学习方法在单一影像中显示了有望的结果,但这些方法需要所有主题的看法都必须被覆盖。在这个工作中,我们研究了从单一且偶发的网络影像中进行三维形状重建,其中主题的看法覆盖率较低。我们提出了DreaMo,它同时进行形状重建和解决低覆盖区域的挑战,使用了看板条件的扩散优化和特殊调整。此外,我们导入了人类可理解的骨架生成策略,将学习的神经骨和皮肤遮盾转换为人类可读的骨架。我们对自己收集的网络影像库进行了研究,其中主题的看法覆盖率较低。DreaMo表现出了优秀的质量,包括新视野显示、细部三维形状重建和骨架生成。广泛的质量和质数研究证明了每个提案的有效性,并证明了现有方法无法正确地重建 geometry 因为看法覆盖率不够高。”

Facilitating the Production of Well-tailored Video Summaries for Sharing on Social Media

  • paper_url: http://arxiv.org/abs/2312.02616
  • repo_url: None
  • paper_authors: Evlampios Apostolidis, Konstantinos Apostolidis, Vasileios Mezaris
  • for: 这篇论文提供了一种基于Web的视频摘要生成工具,用于在社交媒体上分享摘要。
  • methods: 该工具使用了 integrate AI模型进行视频摘要和比例转换,支持一键摘要过程,可以根据目标平台的视频长度和比例生成多个摘要。
  • results: 该工具可以生成高质量的摘要,并且可以根据用户的需求进行自定义。
    Abstract This paper presents a web-based tool that facilitates the production of tailored summaries for online sharing on social media. Through an interactive user interface, it supports a ``one-click'' video summarization process. Based on the integrated AI models for video summarization and aspect ratio transformation, it facilitates the generation of multiple summaries of a full-length video according to the needs of target platforms with regard to the video's length and aspect ratio.
    摘要 Here's the text in Simplified Chinese:这篇论文介绍了一个基于网页的工具,可以帮助创建适合社交媒体上分享的个性化摘要。该工具具有一个交互式用户界面,可以通过一键快速摘要视频。通过 интеGRATED AI模型,该工具可以根据目标平台的视频长度和比例转换生成多个摘要视频。

Projection Regret: Reducing Background Bias for Novelty Detection via Diffusion Models

  • paper_url: http://arxiv.org/abs/2312.02615
  • repo_url: None
  • paper_authors: Sungik Choi, Hankook Lee, Honglak Lee, Moontae Lee
  • For: The paper is written for detecting abnormal (out-of-distribution) samples in diffusion models, which have recently gained attention in machine learning.* Methods: The paper proposes a novelty detection method called Projection Regret (PR), which mitigates the bias of non-semantic information by computing the perceptual distance between the test image and its diffusion-based projection, and then cancelling out the background bias by comparing it against recursive projections.* Results: The paper shows that PR outperforms prior art of generative-model-based novelty detection methods by a significant margin, demonstrating its effectiveness in detecting abnormal samples.Here’s the simplified Chinese text for the three key points:* For: 本文是为探测Diffusion模型中的异常(非常量)样本而写的。* Methods: 本文提出了一种新的异常探测方法called Projection Regret (PR),它通过计算测试图像与其Diffusion-based projection的相似度来探测异常性,然后通过对比反复的投影来减少背景偏见。* Results: 本文的实验结果表明,PR在与先前的生成模型基于的异常探测方法相比,具有显著的提升。
    Abstract Novelty detection is a fundamental task of machine learning which aims to detect abnormal ($\textit{i.e.}$ out-of-distribution (OOD)) samples. Since diffusion models have recently emerged as the de facto standard generative framework with surprising generation results, novelty detection via diffusion models has also gained much attention. Recent methods have mainly utilized the reconstruction property of in-distribution samples. However, they often suffer from detecting OOD samples that share similar background information to the in-distribution data. Based on our observation that diffusion models can \emph{project} any sample to an in-distribution sample with similar background information, we propose \emph{Projection Regret (PR)}, an efficient novelty detection method that mitigates the bias of non-semantic information. To be specific, PR computes the perceptual distance between the test image and its diffusion-based projection to detect abnormality. Since the perceptual distance often fails to capture semantic changes when the background information is dominant, we cancel out the background bias by comparing it against recursive projections. Extensive experiments demonstrate that PR outperforms the prior art of generative-model-based novelty detection methods by a significant margin.
    摘要 新鲜度检测是机器学习的基本任务之一,旨在检测异常(即外部分布(OOD))样本。由于扩散模型最近在生成框架中得到了很多关注,因此通过扩散模型进行新鲜度检测也得到了很多关注。 current methods mainly rely on the reconstruction property of in-distribution samples, but they often suffer from detecting OOD samples that share similar background information to the in-distribution data. Based on our observation that diffusion models can project any sample to an in-distribution sample with similar background information, we propose Projection Regret (PR), an efficient novelty detection method that mitigates the bias of non-semantic information. To be specific, PR computes the perceptual distance between the test image and its diffusion-based projection to detect abnormality. Since the perceptual distance often fails to capture semantic changes when the background information is dominant, we cancel out the background bias by comparing it against recursive projections. EXTENSIVE experiments demonstrate that PR outperforms the prior art of generative-model-based novelty detection methods by a significant margin.

A Unified Simulation Framework for Visual and Behavioral Fidelity in Crowd Analysis

  • paper_url: http://arxiv.org/abs/2312.02613
  • repo_url: None
  • paper_authors: Niccolò Bisagno, Nicola Garau, Antonio Luigi Stefani, Nicola Conci
  • for: 这个论文主要是为了研究人群聚集的 simulations,以生成适合计算机视觉任务的标注数据。
  • methods: 这个论文使用了一种人群 simulate 的工具,叫做 UniCrowd,以及一个验证管道。
  • results: 这个论文通过使用 UniCrowd 生成了一些适合计算机视觉任务的标注数据,包括人群排版、人姿估计、轨迹分析和预测、以及异常检测等。
    Abstract Simulation is a powerful tool to easily generate annotated data, and a highly desirable feature, especially in those domains where learning models need large training datasets. Machine learning and deep learning solutions, have proven to be extremely data-hungry and sometimes, the available real-world data are not sufficient to effectively model the given task. Despite the initial skepticism of a portion of the scientific community, the potential of simulation has been largely confirmed in many application areas, and the recent developments in terms of rendering and virtualization engines, have shown a good ability also in representing complex scenes. This includes environmental factors, such as weather conditions and surface reflectance, as well as human-related events, like human actions and behaviors. We present a human crowd simulator, called UniCrowd, and its associated validation pipeline. We show how the simulator can generate annotated data, suitable for computer vision tasks, in particular for detection and segmentation, as well as the related applications, as crowd counting, human pose estimation, trajectory analysis and prediction, and anomaly detection.
    摘要 <>传输文本到简化中文。<>模拟是一种强大的工具,可以轻松生成注解数据,特别在那些领域 where 学习模型需要大量的训练数据。机器学习和深度学习解决方案,已经证明了它们对于解决给定任务非常渴望大量数据。尽管当初一部分科学界对 simulation 的可能性表示怀疑,但是,在许多应用领域, simulation 的潜力已经得到了证明。最近的渲染和虚拟化引擎的进步,也表明它们在表示复杂场景方面具有良好的能力。这包括环境因素,如天气条件和表面反射,以及人类相关的事件,如人类行为和习惯。我们介绍了一种人群模拟器,叫做 UniCrowd,以及其相关的验证管道。我们显示了模拟器可以生成注解数据,适用于计算机视觉任务,特别是检测和分割,以及相关应用,如人群计数、人姿估计、轨迹分析和预测,以及异常检测。

Accelerating Learnt Video Codecs with Gradient Decay and Layer-wise Distillation

  • paper_url: http://arxiv.org/abs/2312.02605
  • repo_url: None
  • paper_authors: Tianhao Peng, Ge Gao, Heming Sun, Fan Zhang, David Bull
  • for: 这篇论文旨在提出一种基于模型独立抽象的视频编码器压缩模型,以提高视频编码器的计算效率和实时性。
  • methods: 该论文使用了一种基于梯度衰减和层次练习的模型独立抽象策略,可以在压缩过程中减少计算复杂度,同时保持视频质量。
  • results: 试验结果显示,该策略可以在三种流行的终端学习视频编码器中实现65%的计算减少和2倍的速度提升,同时保持视频质量下降在0.3dB以下。
    Abstract In recent years, end-to-end learnt video codecs have demonstrated their potential to compete with conventional coding algorithms in term of compression efficiency. However, most learning-based video compression models are associated with high computational complexity and latency, in particular at the decoder side, which limits their deployment in practical applications. In this paper, we present a novel model-agnostic pruning scheme based on gradient decay and adaptive layer-wise distillation. Gradient decay enhances parameter exploration during sparsification whilst preventing runaway sparsity and is superior to the standard Straight-Through Estimation. The adaptive layer-wise distillation regulates the sparse training in various stages based on the distortion of intermediate features. This stage-wise design efficiently updates parameters with minimal computational overhead. The proposed approach has been applied to three popular end-to-end learnt video codecs, FVC, DCVC, and DCVC-HEM. Results confirm that our method yields up to 65% reduction in MACs and 2x speed-up with less than 0.3dB drop in BD-PSNR. Supporting code and supplementary material can be downloaded from: https://jasminepp.github.io/lightweightdvc/
    摘要

An Integrated System for Spatio-Temporal Summarization of 360-degrees Videos

  • paper_url: http://arxiv.org/abs/2312.02576
  • repo_url: None
  • paper_authors: Ioannis Kontostathis, Evlampios Apostolidis, Vasileios Mezaris
  • for: 本研究旨在开发一个 integrate 的系统,用于对360度视频进行空间时间概要。
  • methods: 该系统利用 cutting-edge 的精度检测方法(ATSal 和 SST-Sal)和视频概要生成方法(CA-SUM),以及一种 mechanism 来 классифицировать360度视频是静止或移动摄像机记录的,并选择适用的精度检测方法。
  • results: 对于两个360度视频数据集(VR-EyeTracking 和 Sports-360)进行了量化评估,并表明了该决策机制的准确性和正面影响。此外,对于这些数据集, Qualitative 分析也表明了决策机制的工作方式,并证明了每种精度检测方法的优缺点,以及训练后的概要生成方法的高效性。
    Abstract In this work, we present an integrated system for spatiotemporal summarization of 360-degrees videos. The video summary production mainly involves the detection of salient events and their synopsis into a concise summary. The analysis relies on state-of-the-art methods for saliency detection in 360-degrees video (ATSal and SST-Sal) and video summarization (CA-SUM). It also contains a mechanism that classifies a 360-degrees video based on the use of static or moving camera during recording and decides which saliency detection method will be used, as well as a 2D video production component that is responsible to create a conventional 2D video containing the salient events in the 360-degrees video. Quantitative evaluations using two datasets for 360-degrees video saliency detection (VR-EyeTracking, Sports-360) show the accuracy and positive impact of the developed decision mechanism, and justify our choice to use two different methods for detecting the salient events. A qualitative analysis using content from these datasets, gives further insights about the functionality of the decision mechanism, shows the pros and cons of each used saliency detection method and demonstrates the advanced performance of the trained summarization method against a more conventional approach.
    摘要 在这项工作中,我们提出了一个集成的360度视频审核系统。该系统的视频审核主要包括识别精彩事件和将其简要概括为一个短视频。分析利用当前领域最佳的360度视频精彩检测方法(ATSal和SST-Sal)和视频概要生成方法(CA-SUM),还包括一种机制来判断360度视频是否使用静止或移动摄像机记录,并决定使用哪种精彩检测方法。此外,系统还包括一个负责将360度视频转换成标准2D视频的组件,以便包含精彩事件。经过量测使用两个360度视频精彩检测数据集(VR-EyeTracking、Sports-360),显示出我们的决策机制准确性和积极影响,并证明我们的选择使用两种不同的精彩检测方法是正确的。另外,一个详细的分析使用这些数据集的内容,提供了更多关于决策机制的信息,描述了每种使用的精彩检测方法的优缺点,并证明我们的训练 SUM 方法在对一种传统方法进行训练后表现出色。

Prompt2NeRF-PIL: Fast NeRF Generation via Pretrained Implicit Latent

  • paper_url: http://arxiv.org/abs/2312.02568
  • repo_url: None
  • paper_authors: Jianmeng Liu, Yuyao Zhang, Zeyuan Meng, Yu-Wing Tai, Chi-Keung Tang
  • for: 本研究探讨了可提示NeRF生成(如文本提示或单一图像提示)的直接条件和快速生成NeRF参数的Underlying 3D场景,从而消除复杂的中间步骤,提供全3D生成以 conditional控制。
  • methods: 本方法使用Prompt2NeRF-PIL,一种可以通过单一的前进 pass生成多种3D对象,并且可以利用预训练的含隐示准NeRF参数的空间来进行3D生成。
  • results: 我们的实验表明,我们的方法可以在零基础任务中生成高质量的NeRF,并且可以快速加速现有的提示-to-NeRF方法的推理过程。 specifically,我们的方法可以加速DreamFusion文本-to-NeRF模型和Zero-1-to-3图像-to-NeRF方法的3D重建速度,提高3-5倍。
    Abstract This paper explores promptable NeRF generation (e.g., text prompt or single image prompt) for direct conditioning and fast generation of NeRF parameters for the underlying 3D scenes, thus undoing complex intermediate steps while providing full 3D generation with conditional control. Unlike previous diffusion-CLIP-based pipelines that involve tedious per-prompt optimizations, Prompt2NeRF-PIL is capable of generating a variety of 3D objects with a single forward pass, leveraging a pre-trained implicit latent space of NeRF parameters. Furthermore, in zero-shot tasks, our experiments demonstrate that the NeRFs produced by our method serve as semantically informative initializations, significantly accelerating the inference process of existing prompt-to-NeRF methods. Specifically, we will show that our approach speeds up the text-to-NeRF model DreamFusion and the 3D reconstruction speed of the image-to-NeRF method Zero-1-to-3 by 3 to 5 times.
    摘要 Translation notes:* "promptable NeRF generation" is translated as "可提示NeRF生成" (kě bìng NeRF shēng chéng)* "direct conditioning" is translated as "直接控制" (zhí dì kòng zhì)* "full 3D generation" is translated as "全3D生成" (quán sān jí shēng chéng)* "conditional control" is translated as "条件控制" (tiáo xiàng kòng zhì)* "diffusion-CLIP-based pipelines" is translated as "干扰-CLIP基于的管道" (shuā zhí-CLIP jī yǔ de guǎn dào)* "per-prompt optimizations" is translated as "每个提示优化" (mēi ge bìng chēng yǎo jī)* "pre-trained implicit latent space" is translated as "预训练的含义隐藏空间" (xiāng xiǎng zhī xiǎng de hán yì yǐn huī kōng jī)* "semantically informative initializations" is translated as "含义信息的初始化" (hán yì xìn xīn de chū shí)* "accelerating the inference process" is translated as "加速推理过程" (jiā sù tuī lǐ guò jì)* "text-to-NeRF model" is translated as "文本到NeRF模型" (wén těng dao NeRF módel)* "3D reconstruction speed" is translated as "3D重建速度" (3D zhòng jiàn sù dù)

Think Twice Before Selection: Federated Evidential Active Learning for Medical Image Analysis with Domain Shifts

  • paper_url: http://arxiv.org/abs/2312.02567
  • repo_url: None
  • paper_authors: Jiayi Chen, Benteng Ma, Hengfei Cui, Yong Xia, Kwang-Ting Cheng
  • for: 这个研究旨在提高 Federated Learning 中的数据评估过程,以便更好地利用分散在多个医疗机构中的数据,而不需要中央化数据。
  • methods: 这个研究使用了 Federated Evidential Active Learning (FEAL) 方法,它将在不同领域中的数据 derivation 进行评估,并使用 Dirichlet 分布来捕捉本地和全球模型的预测不确定性。
  • results: 实验结果显示,FEAL 方法比 state-of-the-art 活跃学方法更有效率,并且在 Federated Active Learning 框架下实现了更好的资料多样性和资料范围。
    Abstract Federated learning facilitates the collaborative learning of a global model across multiple distributed medical institutions without centralizing data. Nevertheless, the expensive cost of annotation on local clients remains an obstacle to effectively utilizing local data. To mitigate this issue, federated active learning methods suggest leveraging local and global model predictions to select a relatively small amount of informative local data for annotation. However, existing methods mainly focus on all local data sampled from the same domain, making them unreliable in realistic medical scenarios with domain shifts among different clients. In this paper, we make the first attempt to assess the informativeness of local data derived from diverse domains and propose a novel methodology termed Federated Evidential Active Learning (FEAL) to calibrate the data evaluation under domain shift. Specifically, we introduce a Dirichlet prior distribution in both local and global models to treat the prediction as a distribution over the probability simplex and capture both aleatoric and epistemic uncertainties by using the Dirichlet-based evidential model. Then we employ the epistemic uncertainty to calibrate the aleatoric uncertainty. Afterward, we design a diversity relaxation strategy to reduce data redundancy and maintain data diversity. Extensive experiments and analyses are conducted to show the superiority of FEAL over the state-of-the-art active learning methods and the efficiency of FEAL under the federated active learning framework.
    摘要 联邦学习可以帮助多个分散的医疗机构共同学习一个全球模型,而不需要中央化数据。然而,当地方机构的标签成本高昂时,这可能会导致实际上不能充分利用地方数据。为了解决这个问题,联邦活动学习方法建议使用地方和全球模型预测来选择一小量具有资讯的地方数据进行标签。然而,现有的方法主要将注意力集中在同一个领域中的所有地方数据上,因此在实际的医疗场景中,当存在不同客户端之间的领域转移时,这些方法可能无法实际使用。在这篇论文中,我们做出了首次尝试,以评估地方数据来自不同领域的有用性,并提出了一种新的方法,称为联邦证据活动学习(FEAL),以调整数据评估下领域转移的情况。具体来说,我们将地方和全球模型中的预测视为一个分布在概率Simplex上,并使用Dirichlet基于的证据模型来捕捉这两种不确定性。然后,我们使用这个epistemic不确定性来调整这个aleatoric不确定性。接着,我们设计了一种多样性放松策略,以减少数据的重复和保持数据的多样性。我们进行了广泛的实验和分析,以证明 FEAL 在联邦活动学习框架下的超越性和效率。

Uni3DL: Unified Model for 3D and Language Understanding

  • paper_url: http://arxiv.org/abs/2312.03026
  • repo_url: None
  • paper_authors: Xiang Li, Jian Ding, Zhaoyang Chen, Mohamed Elhoseiny
    for:This paper presents a unified model for 3D and Language understanding, called Uni3DL, which can perform various tasks in 3D vision and language understanding.methods:The Uni3DL model uses a query transformer to learn task-agnostic semantic and mask outputs by attending to 3D visual features, and a task router to selectively generate task-specific outputs required for diverse tasks.results:The Uni3DL model has been evaluated across diverse 3D vision-language understanding tasks and demonstrates performance on par with or surpassing state-of-the-art task-specific models.
    Abstract In this work, we present Uni3DL, a unified model for 3D and Language understanding. Distinct from existing unified vision-language models in 3D which are limited in task variety and predominantly dependent on projected multi-view images, Uni3DL operates directly on point clouds. This approach significantly expands the range of supported tasks in 3D, encompassing both vision and vision-language tasks in 3D. At the core of Uni3DL, a query transformer is designed to learn task-agnostic semantic and mask outputs by attending to 3D visual features, and a task router is employed to selectively generate task-specific outputs required for diverse tasks. With a unified architecture, our Uni3DL model enjoys seamless task decomposition and substantial parameter sharing across tasks. Uni3DL has been rigorously evaluated across diverse 3D vision-language understanding tasks, including semantic segmentation, object detection, instance segmentation, visual grounding, 3D captioning, and text-3D cross-modal retrieval. It demonstrates performance on par with or surpassing state-of-the-art (SOTA) task-specific models. We hope our benchmark and Uni3DL model will serve as a solid step to ease future research in unified models in the realm of 3D and language understanding. Project page: https://uni3dl.github.io.
    摘要 在这项工作中,我们提出Uni3DL模型,这是一个统一的3D和语言理解模型。与现有的3D统一视力语言模型不同,Uni3DL直接处理点云,这将扩大3D支持的任务范围,涵盖了视力和视力语言任务。Uni3DL的核心是一个查询转换器,用于学习任务无关的 semantic和mask输出,并使用一个任务导航器来选择ively生成任务特定的输出。由于具有统一架构,我们的Uni3DL模型可以实现无缝任务分解和大量参数共享。我们在多种3D视力语言理解任务上进行了严格的评估,包括semantic segmentation、object detection、instance segmentation、视物识别、3D标注和文本-3Dcross-modal检索。Uni3DL模型在这些任务上表现了与或超过了状态之arte(SOTA)任务特定模型。我们希望我们的测试和Uni3DL模型可以为未来3D和语言理解领域的研究提供一个坚实的基础。项目页面:https://uni3dl.github.io。

GeNIe: Generative Hard Negative Images Through Diffusion

  • paper_url: http://arxiv.org/abs/2312.02548
  • repo_url: https://github.com/ucdvision/genie
  • paper_authors: Soroush Abbasi Koohpayegani, Anuj Singh, K L Navaneet, Hadi Jamali-Rad, Hamed Pirsiavash
  • for: 用于训练深度模型,避免过拟合具有有限数据的问题。
  • methods: 使用扩展AI技术,如扩散模型为图像生成,实现更加复杂的数据增强技术,生成更像自然图像的数据。
  • results: 通过对分类器的决策边界进行精细调整,生成的扩充样本能够更有效率地引导学习过程。通过文本描述和图像材料的混合,生成出更加挑战性的样本,尤其是在少shot和长尾分布设置下。
    Abstract Data augmentation is crucial in training deep models, preventing them from overfitting to limited data. Common data augmentation methods are effective, but recent advancements in generative AI, such as diffusion models for image generation, enable more sophisticated augmentation techniques that produce data resembling natural images. We recognize that augmented samples closer to the ideal decision boundary of a classifier are particularly effective and efficient in guiding the learning process. We introduce GeNIe which leverages a diffusion model conditioned on a text prompt to merge contrasting data points (an image from the source category and a text prompt from the target category) to generate challenging samples for the target category. Inspired by recent image editing methods, we limit the number of diffusion iterations and the amount of noise. This ensures that the generated image retains low-level and contextual features from the source image, potentially conflicting with the target category. Our extensive experiments, in few-shot and also long-tail distribution settings, demonstrate the effectiveness of our novel augmentation method, especially benefiting categories with a limited number of examples.
    摘要 <>通过数据扩充,深度模型可以避免过拟合 Limited Data。常见的数据扩充方法是有效的,但是最近的生成AI技术,如扩散模型 для图像生成,允许更加复杂的扩充技术,生成更像自然图像的数据。我们认为已 augmented 样本更接近分类器的理想决策边界,特别有效和有效地引导学习过程。我们介绍了 GeNIe,利用一个扩散模型根据文本提示来合并对立数据点(一个源类图像和一个目标类文本提示)来生成对目标类来的挑战样本。受最近的图像编辑方法的启发,我们限制了扩散迭代次数和噪音的数量。这确保了生成的图像保留了 source 图像的低级和上下文特征,可能与目标类冲突。我们进行了广泛的实验,包括几个例示和长尾分布设置, demonstrates 我们的新的扩充方法的效果,尤其是对于具有有限的例子数的类别。Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning

  • paper_url: http://arxiv.org/abs/2312.02546
  • repo_url: https://github.com/tmllab/machine_vision_therapy
  • paper_authors: Zhuo Huang, Chang Liu, Yinpeng Dong, Hang Su, Shibao Zheng, Tongliang Liu
  • for: 提高vision模型的零基础 robustness
  • methods: 利用多 modal大语言模型(MLLMs)和denoising in-context learning(DICL)策略
  • results: 通过不监督的方式提高vision模型的性能,并在多个OOD dataset上进行了广泛的实验 validate the effectiveness of our method.
    Abstract Although vision models such as Contrastive Language-Image Pre-Training (CLIP) show impressive generalization performance, their zero-shot robustness is still limited under Out-of-Distribution (OOD) scenarios without fine-tuning. Instead of undesirably providing human supervision as commonly done, it is possible to take advantage of Multi-modal Large Language Models (MLLMs) that hold powerful visual understanding abilities. However, MLLMs are shown to struggle with vision problems due to the incompatibility of tasks, thus hindering their utilization. In this paper, we propose to effectively leverage MLLMs to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models. By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner. To solve the incompatibility issue, we propose a novel Denoising In-Context Learning (DICL) strategy to align vision tasks with MLLMs. Concretely, by estimating a transition matrix that captures the probability of one class being confused with another, an instruction containing a correct exemplar and an erroneous one from the most probable noisy class can be constructed. Such an instruction can help any MLLMs with ICL ability to detect and rectify incorrect predictions of vision models. Through extensive experiments on ImageNet, WILDS, DomainBed, and other OOD datasets, we carefully validate the quantitative and qualitative effectiveness of our method. Our code is available at https://github.com/tmllab/Machine_Vision_Therapy.
    摘要 尽管视觉模型如对照语言图像预训练(CLIP)表现出色,但其零shot Robustness仍然受到Out-of-Distribution(OOD)场景下的限制。而不需要不必要地提供人工监督,可以利用多modal大语言模型(MLLMs),这些模型拥有强大的视觉理解能力。然而,MLLMs在视觉任务上存在兼容性问题,这使得它们的使用受限。在这篇论文中,我们提出了一种有效地利用MLLMs进行机器视觉疗法,以改善视觉模型的预测结果。通过精心调整杂乱标签,我们可以在无监督下提高学习模型的性能。为解决兼容性问题,我们提出了一种新的减噪在Context学习策略(DICL),用于将视觉任务与MLLMs相协调。具体来说,我们可以估算一个转移矩阵,该矩阵捕捉了一个类型与另一个类型之间的混淆概率。通过构建一个包含正确示例和错误示例的指导,我们可以帮助任何拥有ICL能力的MLLMs检测和修正视觉模型的错误预测。经过广泛的ImageNet、WILDS、DomainBed和其他OOD dataset的实验,我们谨慎验证了我们的方法的量化和质量效果。我们的代码可以在https://github.com/tmllab/Machine_Vision_Therapy上获取。

Explainable Severity ranking via pairwise n-hidden comparison: a case study of glaucoma

  • paper_url: http://arxiv.org/abs/2312.02541
  • repo_url: None
  • paper_authors: Hong Nguyen, Cuong V. Nguyen, Shrikanth Narayanan, Benjamin Y. Xu, Michael Pazzani
  • for: 该论文旨在用基于图像的方法评估和比较开Angle Glaucoma(POAG)的严重程度,以帮助诊断和评估这种疾病。
  • methods: 该论文使用了一种基于siamese网络的严重性排名方法,通过对图像进行对比来评估POAG的严重程度。此外,论文还提出了一种新的解释方法,用于解释图像的严重程度高或低的原因。
  • results: 论文的实验结果表明,基于图像的严重性排名模型比传统方法更准确地诊断POAG,同时也能够提供更好的解释。
    Abstract Primary open-angle glaucoma (POAG) is a chronic and progressive optic nerve condition that results in an acquired loss of optic nerve fibers and potential blindness. The gradual onset of glaucoma results in patients progressively losing their vision without being consciously aware of the changes. To diagnose POAG and determine its severity, patients must undergo a comprehensive dilated eye examination. In this work, we build a framework to rank, compare, and interpret the severity of glaucoma using fundus images. We introduce a siamese-based severity ranking using pairwise n-hidden comparisons. We additionally have a novel approach to explaining why a specific image is deemed more severe than others. Our findings indicate that the proposed severity ranking model surpasses traditional ones in terms of diagnostic accuracy and delivers improved saliency explanations.
    摘要 primary open-angle glaucoma (POAG) 是一种 Chronic 和进行性的Optic nerve condition,会导致Acquired loss of optic nerve fibers 和 potential blindness。逐渐发展的 glaucoma 会使patients 慢慢地失去视力,而不是Consciously aware of the changes。为诊断 POAG 和其严重程度,患者必须进行 comprehensive dilated eye examination。在这项工作中,我们建立了一个排名、比较和解释 glaucoma 严重程度的框架,使用基于siamese的严重排名方法。我们还提出了一种新的方法来解释特定图像是否更严重于别的。我们的发现表明,我们提出的严重排名模型在诊断精度和提供更好的Saliency explanations 方面都超过了传统的方法。

Enhanced Breast Cancer Tumor Classification using MobileNetV2: A Detailed Exploration on Image Intensity, Error Mitigation, and Streamlit-driven Real-time Deployment

  • paper_url: http://arxiv.org/abs/2312.03020
  • repo_url: None
  • paper_authors: Aaditya Surya, Aditya Shah, Jarnell Kabore, Subash Sasikumar
  • for: 这份研究旨在将Google的MobileNetV2模型应用到乳腺癌肉瘤分类中,以分为正常、良性和恶性三种类别,使用了1576几个超音波图像数据(265个正常、891个良性、420个恶性)。
  • methods: 这份研究使用了MobileNetV2模型,并对数据进行了调整和分类。
  • results: 研究获得了0.82的准确率、0.83的精度、0.81的回传率、ROC-AUC的0.94、PR-AUC的0.88和MCC的0.74。它还进行了图像数据分布和错误分析,提供了未来应用的改进。
    Abstract This research introduces a sophisticated transfer learning model based on Google's MobileNetV2 for breast cancer tumor classification into normal, benign, and malignant categories, utilizing a dataset of 1576 ultrasound images (265 normal, 891 benign, 420 malignant). The model achieves an accuracy of 0.82, precision of 0.83, recall of 0.81, ROC-AUC of 0.94, PR-AUC of 0.88, and MCC of 0.74. It examines image intensity distributions and misclassification errors, offering improvements for future applications. Addressing dataset imbalances, the study ensures a generalizable model. This work, using a dataset from Baheya Hospital, Cairo, Egypt, compiled by Walid Al-Dhabyani et al., emphasizes MobileNetV2's potential in medical imaging, aiming to improve diagnostic precision in oncology. Additionally, the paper explores Streamlit-based deployment for real-time tumor classification, demonstrating MobileNetV2's applicability in medical imaging and setting a benchmark for future research in oncology diagnostics.
    摘要 Translated into Simplified Chinese:这项研究推出了基于Google的MobileNetV2 Transfer Learning模型,用于分类乳腺癌为正常、恶性和肿瘤三类,使用了1576张ultrasound图像(265张正常、891张恶性、420张肿瘤)。模型达到了0.82的准确率、0.83的精度、0.81的准确率、0.94的ROC-AUC、0.88的PR-AUC和0.74的MCC。它分析了图像强度分布和误分类错误,提供了未来应用中的改进。通过处理数据集偏好,该研究确保了一个通用的模型。这项研究使用了来自加拉heid Hospital的 dataset,由Walid Al-Dhabyani等人编译,强调了MobileNetV2在医疗影像中的潜力, aspires to improve the diagnostic precision in oncology。此外,论文还探讨了基于Streamlit的实时诊断部署,展示了MobileNetV2在医疗影像中的可用性,并为未来医学诊断研究设置了benchmark。

Towards Open-set Gesture Recognition via Feature Activation Enhancement and Orthogonal Prototype Learning

  • paper_url: http://arxiv.org/abs/2312.02535
  • repo_url: None
  • paper_authors: Chen Liu, Can Han, Chengfeng Zhou, Crystal Cai, Suncheng Xiang, Hualiang Ni, Dahong Qian
  • for: 这篇论文旨在解决人机互动中的手势识别 task 中的 open set recognition (OSR) 问题。
  • methods: 这篇论文提出了一种基于 prototype learning (PL) 的更有效的方法,利用两种新的自然分布特征,特征活化水平和投影不一致性,对于已知和未知的分类进行更好的分别。
  • results: 实验结果显示,这篇论文的提案方法可以同时实现精确的关闭集合识别和有效地拒绝未知的手势识别。
    Abstract Gesture recognition is a foundational task in human-machine interaction (HMI). While there has been significant progress in gesture recognition based on surface electromyography (sEMG), accurate recognition of predefined gestures only within a closed set is still inadequate in practice. It is essential to effectively discern and reject unknown gestures of disinterest in a robust system. Numerous methods based on prototype learning (PL) have been proposed to tackle this open set recognition (OSR) problem. However, they do not fully explore the inherent distinctions between known and unknown classes. In this paper, we propose a more effective PL method leveraging two novel and inherent distinctions, feature activation level and projection inconsistency. Specifically, the Feature Activation Enhancement Mechanism (FAEM) widens the gap in feature activation values between known and unknown classes. Furthermore, we introduce Orthogonal Prototype Learning (OPL) to construct multiple perspectives. OPL acts to project a sample from orthogonal directions to maximize the distinction between its two projections, where unknown samples will be projected near the clusters of different known classes while known samples still maintain intra-class similarity. Our proposed method simultaneously achieves accurate closed-set classification for predefined gestures and effective rejection for unknown gestures. Extensive experiments demonstrate its efficacy and superiority in open-set gesture recognition based on sEMG.
    摘要 <> translate "Gesture recognition is a foundational task in human-machine interaction (HMI). While there has been significant progress in gesture recognition based on surface electromyography (sEMG), accurate recognition of predefined gestures only within a closed set is still inadequate in practice. It is essential to effectively discern and reject unknown gestures of disinterest in a robust system. Numerous methods based on prototype learning (PL) have been proposed to tackle this open set recognition (OSR) problem. However, they do not fully explore the inherent distinctions between known and unknown classes. In this paper, we propose a more effective PL method leveraging two novel and inherent distinctions, feature activation level and projection inconsistency. Specifically, the Feature Activation Enhancement Mechanism (FAEM) widens the gap in feature activation values between known and unknown classes. Furthermore, we introduce Orthogonal Prototype Learning (OPL) to construct multiple perspectives. OPL acts to project a sample from orthogonal directions to maximize the distinction between its two projections, where unknown samples will be projected near the clusters of different known classes while known samples still maintain intra-class similarity. Our proposed method simultaneously achieves accurate closed-set classification for predefined gestures and effective rejection for unknown gestures. Extensive experiments demonstrate its efficacy and superiority in open-set gesture recognition based on sEMG.">>Here's the translation:人机交互(HMI)中的手势识别是一项基础任务。虽然基于表面电 MYography(sEMG)的手势识别已经取得了 significi cant 进步,但是只能够准确地识别闭 SET的手势,在实际应用中还是不够。因此,有效地分辨并排除无关的手势是一项重要的需求。多种基于原型学习(PL)的方法已经被提议来解决这个开集识别(OSR)问题,但是它们并没有充分利用手势之间的内在差异。在这篇论文中,我们提出了一种更有效的 PL 方法,利用两种新的内在差异来进行分辨:特征活动水平和投影不一致。specifically,我们提出了特征活动增强机制(FAEM),使知道类和未知类之间的特征活动值差距更大。此外,我们引入了多个视角学习(OPL),以从多个方向投影样本,以 maximize 不同类型知道样本之间的分辨。我们的提出的方法同时实现了高精度闭 SET 识别和有效排除无关的手势。广泛的实验证明了我们的方法的有效性和优势在基于 sEMG 的开集手势识别中。

Towards Automatic Power Battery Detection: New Challenge, Benchmark Dataset and Baseline

  • paper_url: http://arxiv.org/abs/2312.02528
  • repo_url: None
  • paper_authors: Xiaoqi Zhao, Youwei Pang, Zhenyu Chen, Qian Yu, Lihe Zhang, Hanqi Liu, Jiaming Zuo, Huchuan Lu
  • for: 本研究旨在提出一种新任务 named power battery detection (PBD), 用于 lokalisieren dense cathode 和 anode plates 端点从 X-ray 图像中评估电池质量。现有生产者通常采用人工观察来完成 PBD,这会增加识别率和效率的困难,为了解决这个问题并吸引更多关注这个有意义的任务,我们首先 elaborately 收集了一个数据集,called X-ray PBD,包含 $1,500$ 多种 X-ray 图像,选自 $5$ 家生产商的 $5,000$ 个电池,并且包含 $7$ 种视觉干扰。
  • methods: 我们提出了一种新的 segmentation-based 解决方案,称为 multi-dimensional collaborative network (MDCNet)。通过线性和数字预测器的协同作用,分割分支中的点 segmentation 可以得到改进的Semantic 和 Detail 两个方面的表达。此外,我们还设计了一种有效的距离适应mask生成策略,以适应不同的板间质量和分布,为 MDCNet 提供稳定的指导。
  • results: 无论是通过 corner detection、人群计数或普通/小物体检测等方法,我们的 segmentation-based MDCNet 都能够在 PBD 任务中表现出色,不仅比较其他方法有更高的准确率,还能够在不同的 X-ray 图像中保持稳定的性能。最终,我们还讨论了一些可能的难点和未来研究的方向。代码和数据集将在 \href{http://www.gy3000.company/x3000%e5%bc%80%e6%94%be%e5%b9%b3%e5%8f%b0}{X-ray PBD} 上公开发布。
    Abstract We conduct a comprehensive study on a new task named power battery detection (PBD), which aims to localize the dense cathode and anode plates endpoints from X-ray images to evaluate the quality of power batteries. Existing manufacturers usually rely on human eye observation to complete PBD, which makes it difficult to balance the accuracy and efficiency of detection. To address this issue and drive more attention into this meaningful task, we first elaborately collect a dataset, called X-ray PBD, which has $1,500$ diverse X-ray images selected from thousands of power batteries of $5$ manufacturers, with $7$ different visual interference. Then, we propose a novel segmentation-based solution for PBD, termed multi-dimensional collaborative network (MDCNet). With the help of line and counting predictors, the representation of the point segmentation branch can be improved at both semantic and detail aspects. Besides, we design an effective distance-adaptive mask generation strategy, which can alleviate the visual challenge caused by the inconsistent distribution density of plates to provide MDCNet with stable supervision. Without any bells and whistles, our segmentation-based MDCNet consistently outperforms various other corner detection, crowd counting and general/tiny object detection-based solutions, making it a strong baseline that can help facilitate future research in PBD. Finally, we share some potential difficulties and works for future researches. The source code and datasets will be publicly available at \href{http://www.gy3000.company/x3000%e5%bc%80%e6%94%be%e5%b9%b3%e5%8f%b0}{X-ray PBD}.
    摘要 我们进行了一项全面的研究,探讨一种新任务 named 电池电能检测(PBD),该任务的目的是从X射线图像中LOCAL化稠密的锂电板和锂电板端点。现有制造商通常通过人工观察来完成PBD,这会减少了准确性和效率的平衡。为了解决这个问题并吸引更多关注这一重要任务,我们先在X射线PBD dataset中收集了1500个多样化的X射线图像,这些图像来自于5家制造商的5000个电池,并且包含7种视觉干扰。然后,我们提出了一种基于分割的解决方案,称为多维协同网络(MDCNet)。通过线和计数预测器,我们可以在 semantic和细节两个方面提高点 segmentation branch的表达。此外,我们还设计了一种有效的距离适应mask生成策略,以适应电池板的不均匀分布,为MDCNet提供稳定的超vis。不需要任何炫技,我们的分割基于MDCNet在多种尖锐检测、人群计数和通用/小物体检测基础上准确地检测电池电板,从而成为一个强大的基线,可以帮助未来的PBD研究。最后,我们介绍了一些潜在的挑战和未来研究的可能性。我们将在\href{http://www.gy3000.company/x3000%e5%bc%80%e6%94%be%e5%b9%b3%e5%8f%b0}{X-ray PBD}上公开源代码和数据集。

Towards More Unified In-context Visual Understanding

  • paper_url: http://arxiv.org/abs/2312.02520
  • repo_url: None
  • paper_authors: Dianmo Sheng, Dongdong Chen, Zhentao Tan, Qiankun Liu, Qi Chu, Jianmin Bao, Tao Gong, Bin Liu, Shengwei Xu, Nenghai Yu
  • for: 这篇论文旨在提出一种能够处理多modal输出的视觉理解ICL框架,以扩展ICL的应用场景。
  • methods: 该模型使用量化并嵌入文本和视觉提示,并采用一个受限的 sparse transformer 架构进行生成模型化。
  • results: 实验结果表明,该模型在多modal输出视觉理解任务上达到了与专门模型和先前ICL基线相当的竞争性性能。
    Abstract The rapid advancement of large language models (LLMs) has accelerated the emergence of in-context learning (ICL) as a cutting-edge approach in the natural language processing domain. Recently, ICL has been employed in visual understanding tasks, such as semantic segmentation and image captioning, yielding promising results. However, existing visual ICL framework can not enable producing content across multiple modalities, which limits their potential usage scenarios. To address this issue, we present a new ICL framework for visual understanding with multi-modal output enabled. First, we quantize and embed both text and visual prompt into a unified representational space, structured as interleaved in-context sequences. Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them, facilitating in-context learning. Thanks to this design, the model is capable of handling in-context vision understanding tasks with multimodal output in a unified pipeline. Experimental results demonstrate that our model achieves competitive performance compared with specialized models and previous ICL baselines. Overall, our research takes a further step toward unified multimodal in-context learning.
    摘要 快速发展的大语言模型(LLM)已经推动了在自然语言处理领域内的卷积学习(ICL)技术的出现。目前,ICL技术已经在视觉理解任务中,如 semantic segmentation 和图像描述,获得了有利的结果。然而,现有的视觉 ICLL 框架不能生成跨Modalities的内容,这限制了其应用场景。为解决这个问题,我们提出了一个新的 ICLL 框架 для视觉理解,具有多Modalities输出能力。首先,我们将文本和视觉提示编码并嵌入到一个共同表示空间中,即嵌入式的卷积序列。然后,我们使用一个 sparse transformer 架构来进行生成模型,以便在卷积序列中进行卷积学习。由于这种设计,我们的模型可以处理卷积视觉理解任务,并且可以在一个统一的管道中生成多Modalities的输出。实验结果表明,我们的模型可以与专门的模型和前一代 ICLL 基线集成比肩。总之,我们的研究又一步向多Modalities卷积学习的统一发展。

SAVE: Protagonist Diversification with Structure Agnostic Video Editing

  • paper_url: http://arxiv.org/abs/2312.02503
  • repo_url: None
  • paper_authors: Yeji Song, Wonsik Shin, Junsoo Lee, Jeesoo Kim, Nojun Kwak
  • for: 提高视频编辑的多样性和可Scalability
  • methods: 采用动作个性化、文本嵌入、pseudo optical flow等技术
  • results: 实现了对视频中人物的修改和风格变换,提高了视频编辑的多样性和可Scalability
    Abstract Driven by the upsurge progress in text-to-image (T2I) generation models, text-to-video (T2V) generation has experienced a significant advance as well. Accordingly, tasks such as modifying the object or changing the style in a video have been possible. However, previous works usually work well on trivial and consistent shapes, and easily collapse on a difficult target that has a largely different body shape from the original one. In this paper, we spot the bias problem in the existing video editing method that restricts the range of choices for the new protagonist and attempt to address this issue using the conventional image-level personalization method. We adopt motion personalization that isolates the motion from a single source video and then modifies the protagonist accordingly. To deal with the natural discrepancy between image and video, we propose a motion word with an inflated textual embedding to properly represent the motion in a source video. We also regulate the motion word to attend to proper motion-related areas by introducing a novel pseudo optical flow, efficiently computed from the pre-calculated attention maps. Finally, we decouple the motion from the appearance of the source video with an additional pseudo word. Extensive experiments demonstrate the editing capability of our method, taking a step toward more diverse and extensive video editing.
    摘要 驱动了文本到图像(T2I)生成模型的快速进步,文本到视频(T2V)生成也经历了显著改进。因此,修改视频中的对象或改变风格也变得可能。然而,先前的工作通常只能在简单和一致的形状下工作,对于具有大量不同的身体形态的目标来说,容易崩溃。在这篇论文中,我们发现了现有视频编辑方法中的偏见问题,并尝试使用传统的图像级个性化方法来解决这个问题。我们采用了运动个性化,即从单个源视频中隔离出运动,然后根据新的主角进行修改。为了处理自然的图像和视频之间的差异,我们提出了一种带有膨胀的文本嵌入来正确表示源视频中的运动。此外,我们还引入了一种新的 Pseudo 运动流,通过计算先前计算的注意力图来有效地Compute Pseudo 运动流。最后,我们将运动与源视频的外观分离开来,使用一个额外的 Pseudo 词来表示运动。广泛的实验表明我们的方法可以进行多样化和广泛的视频编辑。

ReconU-Net: a direct PET image reconstruction using U-Net architecture with back projection-induced skip connection

  • paper_url: http://arxiv.org/abs/2312.02494
  • repo_url: None
  • paper_authors: Fumio Hashimoto, Kibo Ote
    for:* 这个研究旨在提出一种基于深度学习的直接 positron发射tomography(PET)图像重建算法,即ReconU-Net,以提高直接PET图像重建的精度。methods:* 该算法独特地将物理模型的后投影操作integrated into skip connection,从而使得含有物理模型的 skip connection可以有效地传递原始空间信息从输入的sinogram到重建的图像中。results:* 比较其他无 skip connections的encoder-decoder架构,提案的ReconU-Net方法可以生成具有更高精度的重建图像。* 进一步分析表明,ReconU-Net可以在 skip connections中传递多个分辨率的特征,特别是非抽象高分辨率信息。* 虽然受限于小训练数据,但提案的ReconU-Net方法可以成功重建实际 Hoffman大脑phantom数据,而其他深度学习基于直接重建方法则无法生成重建图像。
    Abstract [Objective] This study aims to introduce a novel back projection-induced U-Net-shaped architecture, called ReconU-Net, for deep learning-based direct positron emission tomography (PET) image reconstruction. Additionally, our objective is to analyze the behavior of direct PET image reconstruction and gain deeper insights by comparing the proposed ReconU-Net architecture with other encoder-decoder architectures without skip connections. [Approach] The proposed ReconU-Net architecture uniquely integrates the physical model of the back projection operation into the skip connection. This distinctive feature facilitates the effective transfer of intrinsic spatial information from the input sinogram to the reconstructed image via an embedded physical model. The proposed ReconU-Net was trained using Monte Carlo simulation data from the Brainweb phantom and tested on both simulated and real Hoffman brain phantom data. [Main results] The proposed ReconU-Net method generated a reconstructed image with a more accurate structure compared to other deep learning-based direct reconstruction methods. Further analysis showed that the proposed ReconU-Net architecture has the ability to transfer features of multiple resolutions, especially non-abstract high-resolution information, through skip connections. Despite limited training on simulated data, the proposed ReconU-Net successfully reconstructed the real Hoffman brain phantom, unlike other deep learning-based direct reconstruction methods, which failed to produce a reconstructed image. [Significance] The proposed ReconU-Net can improve the fidelity of direct PET image reconstruction, even when dealing with small training datasets, by leveraging the synergistic relationship between data-driven modeling and the physics model of the imaging process.
    摘要 [目标] 本研究旨在介绍一种新的反投影引入U-Net型架构,称为ReconU-Net,用于深度学习基于直接融合Tomography(PET)图像重建。此外,我们还想要通过比较不含 skip connection 的架构和ReconU-Net架构进行分析,从而更深入地理解 direct PET 图像重建的行为。[方法] ReconU-Net 架构独特地将物理模型的反投影操作integrated into skip connection,这种特殊的特点使得把输入的sinogram中的内在空间信息有效地传递到重建的图像中。ReconU-Net 使用 Monte Carlo 仿真数据进行训练,并在 simulate 和实际 Hoffman 脑部phantom数据上进行测试。[主要结果] ReconU-Net 方法生成的重建图像具有更高的准确性结构,相比其他深度学习基于直接重建方法。进一步的分析表明,ReconU-Net 架构有能力传递多resolution的特征,特别是非抽象高分辨率信息,通过 skip connection。尽管它只受限于小规模的训练数据,ReconU-Net 仍然成功地重建了实际 Hoffman 脑部phantom,与其他深度学习基于直接重建方法不同,其他方法无法生成重建图像。[意义] ReconU-Net 可以通过利用数据驱动模型和成像过程的物理模型之间的相互关系,提高直接 PET 图像重建的准确性,即使面临小规模的训练数据。

EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model

  • paper_url: http://arxiv.org/abs/2312.02483
  • repo_url: None
  • paper_authors: Guozhang Li, Xinpeng Ding, De Cheng, Jie Li, Nannan Wang, Xinbo Gao
  • for: 提高 Early weakly supervised video grounding (WSVG) 方法对于 incomplete boundary detection 的能力。
  • methods: 使用 explicit-supervision 方法,生成 pseudo-temporal boundaries for training,并通过 data augmentations 增加更多的 valuable information。
  • results: 提出了一种新的 perspective,可以 maintain the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries,并且通过 multimodal large language models (MLLMs) 对每帧内 initial pseudo boundaries 进行更多的描述,以获得更精确的 boundaries。
    Abstract Early weakly supervised video grounding (WSVG) methods often struggle with incomplete boundary detection due to the absence of temporal boundary annotations. To bridge the gap between video-level and boundary-level annotation, explicit-supervision methods, i.e., generating pseudo-temporal boundaries for training, have achieved great success. However, data augmentations in these methods might disrupt critical temporal information, yielding poor pseudo boundaries. In this paper, we propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries. To this end, we propose EtC (Expand then Clarify), first use the additional information to expand the initial incomplete pseudo boundaries, and subsequently refine these expanded ones to achieve precise boundaries. Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multimodal large language models (MLLMs) to annotate each frame within initial pseudo boundaries, yielding more comprehensive descriptions for expanded boundaries. To further clarify the noise of expanded boundaries, we combine mutual learning with a tailored proposal-level contrastive objective to use a learnable approach to harmonize a balance between incomplete yet clean (initial) and comprehensive yet noisy (expanded) boundaries for more precise ones. Experiments demonstrate the superiority of our method on two challenging WSVG datasets.
    摘要 Early weakly supervised video grounding (WSVG) methods often struggle with incomplete boundary detection due to the absence of temporal boundary annotations. To bridge the gap between video-level and boundary-level annotation, explicit-supervision methods, i.e., generating pseudo-temporal boundaries for training, have achieved great success. However, data augmentations in these methods might disrupt critical temporal information, yielding poor pseudo boundaries. In this paper, we propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries. To this end, we propose EtC (Expand then Clarify), first use the additional information to expand the initial incomplete pseudo boundaries, and subsequently refine these expanded ones to achieve precise boundaries. Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multimodal large language models (MLLMs) to annotate each frame within initial pseudo boundaries, yielding more comprehensive descriptions for expanded boundaries. To further clarify the noise of expanded boundaries, we combine mutual learning with a tailored proposal-level contrastive objective to use a learnable approach to harmonize a balance between incomplete yet clean (initial) and comprehensive yet noisy (expanded) boundaries for more precise ones. Experiments demonstrate the superiority of our method on two challenging WSVG datasets.

Differentiable Point-based Inverse Rendering

  • paper_url: http://arxiv.org/abs/2312.02480
  • repo_url: None
  • paper_authors: Hoon-Gyu Chung, Seokjun Choi, Seung-Hwan Baek
  • for: 本研究旨在实现分析���rendering,以估计图像下多种照明条件下的形状和空间分布 BRDF。
  • methods: 本研究使用点基 Rendering,消除了多个样本探测每个光束的需要,从而大幅提高反向渲染的速度。另外,我们还提出了一种混合点-积分表示法,以保留 SDF-based 表示法中的几何细节和稳定性。
  • results: 我们的实验表明,DPIR 可以比优于先前的方法,在重建精度、计算效率和内存占用方面。此外,我们的直观点基表示和渲染可以带来直观的几何和反射Editing。
    Abstract We present differentiable point-based inverse rendering, DPIR, an analysis-by-synthesis method that processes images captured under diverse illuminations to estimate shape and spatially-varying BRDF. To this end, we adopt point-based rendering, eliminating the need for multiple samplings per ray, typical of volumetric rendering, thus significantly enhancing the speed of inverse rendering. To realize this idea, we devise a hybrid point-volumetric representation for geometry and a regularized basis-BRDF representation for reflectance. The hybrid geometric representation enables fast rendering through point-based splatting while retaining the geometric details and stability inherent to SDF-based representations. The regularized basis-BRDF mitigates the ill-posedness of inverse rendering stemming from limited light-view angular samples. We also propose an efficient shadow detection method using point-based shadow map rendering. Our extensive evaluations demonstrate that DPIR outperforms prior works in terms of reconstruction accuracy, computational efficiency, and memory footprint. Furthermore, our explicit point-based representation and rendering enables intuitive geometry and reflectance editing. The code will be publicly available.
    摘要 我们介绍了差分可导点 cloud inverse rendering(DPIR),这是一种分析� BY Synthesis 方法,用于从多种照明条件下捕获的图像中估算形状和空间分布式 BRDF。为此,我们采用点 cloud 渲染,从而消除了通常需要多个抽象�ayer per ray的液体渲染的需要,从而有效提高反向渲染的速度。为实现这一想法,我们开发了一种混合点 cloud 和Volume 表示法 дляgeometry,以及一种减少基面-BRDF 表示法来抑制反向渲染中的缺失约束。这种混合的表示法使得快速渲染可以通过点 cloud 拼接来实现,同时保留SDF-based表示法中的准确性和稳定性。我们还提出了一种高效的阴影检测方法,使用点 cloud 阴影图 rendering。我们的广泛评估表明,DPIR 在重建精度、计算效率和存储占用上都超过了先前的工作。此外,我们的点 cloud 表示和渲染允许直观地编辑geometry和 Reflectance。我们的代码将公开。

Generator Born from Classifier

  • paper_url: http://arxiv.org/abs/2312.02470
  • repo_url: None
  • paper_authors: Runpeng Yu, Xinchao Wang
  • for: Given a pre-trained classifier, the paper aims to reconstruct an image generator without using any data samples.
  • methods: The proposed method leverages the knowledge encapsulated within the parameters of the neural network and uses a novel learning paradigm that trains the generator to ensure convergence conditions of the network parameters are satisfied over the generated distribution of samples.
  • results: Empirical validation from various image generation tasks demonstrates the efficacy of the proposed strategy.Here’s the simplified Chinese text:
  • for: 给一个预训练的分类器,本文尝试做一件奔波的任务:无需使用任何数据样本,重建一个图像生成器。
  • methods: 该方法借鉴神经网络参数中封装的知识,提出了一种新的学习方法,通过让生成器确保网络参数的叠加条件在生成分布中满足,来让网络参数得到最大margin的偏好。
  • results: 从多种图像生成任务的实验 validate了该策略的有效性。
    Abstract In this paper, we make a bold attempt toward an ambitious task: given a pre-trained classifier, we aim to reconstruct an image generator, without relying on any data samples. From a black-box perspective, this challenge seems intractable, since it inevitably involves identifying the inverse function for a classifier, which is, by nature, an information extraction process. As such, we resort to leveraging the knowledge encapsulated within the parameters of the neural network. Grounded on the theory of Maximum-Margin Bias of gradient descent, we propose a novel learning paradigm, in which the generator is trained to ensure that the convergence conditions of the network parameters are satisfied over the generated distribution of the samples. Empirical validation from various image generation tasks substantiates the efficacy of our strategy.
    摘要 在这篇论文中,我们尝试了一项大梦想的任务:使用预训练的分类器,无需任何数据样本,重建一个图像生成器。从黑盒角度来看,这确实是一项不可能实现的挑战,因为它涉及到找到分类器的逆函数,这是一个自然的信息抽取过程。然而,我们借助神经网络参数中嵌入的知识,提出了一种新的学习方法。基于极大距离抑制法的梯度下降理论,我们训练生成器,使其在生成样本分布中满足网络参数的整合条件。实际 validate from various image generation tasks 证明了我们的策略的有效性。

Learning Energy-based Model via Dual-MCMC Teaching

  • paper_url: http://arxiv.org/abs/2312.02469
  • repo_url: None
  • paper_authors: Jiali Cui, Tian Han
  • for: 本研究探讨了基于能量模型(EBM)的基本学习问题。
  • methods: 本研究使用了最大 LIKElihood估计(MLE)和Markov Chain Monte Carlo(MCMC)探索EBM的学习方法,并考虑了将生成器模型作为补充模型,以使MCMC探索更加稳定。
  • results: 本研究提出了一种结合MLE和MCMC探索EBM的框架,通过将生成器模型作为EBM的补充模型,使MCMC探索更加稳定。此外,研究还提出了一种使用MCMC posterior sampling和补充探索模型来实现有效和高效的EBM学习方法。
    Abstract This paper studies the fundamental learning problem of the energy-based model (EBM). Learning the EBM can be achieved using the maximum likelihood estimation (MLE), which typically involves the Markov Chain Monte Carlo (MCMC) sampling, such as the Langevin dynamics. However, the noise-initialized Langevin dynamics can be challenging in practice and hard to mix. This motivates the exploration of joint training with the generator model where the generator model serves as a complementary model to bypass MCMC sampling. However, such a method can be less accurate than the MCMC and result in biased EBM learning. While the generator can also serve as an initializer model for better MCMC sampling, its learning can be biased since it only matches the EBM and has no access to empirical training examples. Such biased generator learning may limit the potential of learning the EBM. To address this issue, we present a joint learning framework that interweaves the maximum likelihood learning algorithm for both the EBM and the complementary generator model. In particular, the generator model is learned by MLE to match both the EBM and the empirical data distribution, making it a more informative initializer for MCMC sampling of EBM. Learning generator with observed examples typically requires inference of the generator posterior. To ensure accurate and efficient inference, we adopt the MCMC posterior sampling and introduce a complementary inference model to initialize such latent MCMC sampling. We show that three separate models can be seamlessly integrated into our joint framework through two (dual-) MCMC teaching, enabling effective and efficient EBM learning.
    摘要

SAM-Assisted Remote Sensing Imagery Semantic Segmentation with Object and Boundary Constraints

  • paper_url: http://arxiv.org/abs/2312.02464
  • repo_url: https://github.com/sstary/ssrs
  • paper_authors: Xianping Ma, Qianqian Wu, Xingyu Zhao, Xiaokang Zhang, Man-On Pun, Bo Huang
    for: 这篇论文的目的是提高遥测图像Semantic Segmentation的精度和效率,并使用Segment Anything Model(SAM)来实现这一目的。methods: 本文使用了两个新的概念,即SAM-Generated Object(SGO)和SAM-Generated Boundary(SGB),以及两个新的损失函数:object loss和boundary loss,以进一步优化Semantic Segmentation模型。results: 本文在两个知名的数据集,ISPRS Vaihingen和LoveDA Urban上进行了实验,结果显示了提高了Semantic Segmentation的精度和效率。
    Abstract Semantic segmentation of remote sensing imagery plays a pivotal role in extracting precise information for diverse down-stream applications. Recent development of the Segment Anything Model (SAM), an advanced general-purpose segmentation model, has revolutionized this field, presenting new avenues for accurate and efficient segmentation. However, SAM is limited to generating segmentation results without class information. Consequently, the utilization of such a powerful general vision model for semantic segmentation in remote sensing images has become a focal point of research. In this paper, we present a streamlined framework aimed at leveraging the raw output of SAM by exploiting two novel concepts called SAM-Generated Object (SGO) and SAM-Generated Boundary (SGB). More specifically, we propose a novel object loss and further introduce a boundary loss as augmentative components to aid in model optimization in a general semantic segmentation framework. Taking into account the content characteristics of SGO, we introduce the concept of object consistency to leverage segmented regions lacking semantic information. By imposing constraints on the consistency of predicted values within objects, the object loss aims to enhance semantic segmentation performance. Furthermore, the boundary loss capitalizes on the distinctive features of SGB by directing the model's attention to the boundary information of the object. Experimental results on two well-known datasets, namely ISPRS Vaihingen and LoveDA Urban, demonstrate the effectiveness of our proposed method. The source code for this work will be accessible at https://github.com/sstary/SSRS.
    摘要 remote sensing 图像 semantic segmentation 对于提取精准信息而言是关键。 recent development of the Segment Anything Model (SAM) 已经 revolutionized this field, offering new opportunities for accurate and efficient segmentation. However, SAM only generates segmentation results without class information. As a result, utilizing such a powerful general vision model for semantic segmentation in remote sensing images has become a research focus. In this paper, we present a streamlined framework that leverages the raw output of SAM by introducing two novel concepts: SAM-Generated Object (SGO) and SAM-Generated Boundary (SGB). Specifically, we propose a novel object loss and introduce a boundary loss as augmentative components to aid in model optimization in a general semantic segmentation framework. Considering the content characteristics of SGO, we introduce the concept of object consistency to enhance semantic segmentation performance. By imposing constraints on the consistency of predicted values within objects, the object loss aims to enhance semantic segmentation performance. Furthermore, the boundary loss capitalizes on the distinctive features of SGB by directing the model's attention to the boundary information of the object. Experimental results on two well-known datasets, namely ISPRS Vaihingen and LoveDA Urban, demonstrate the effectiveness of our proposed method. The source code for this work will be accessible at https://github.com/sstary/SSRS.

DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance

  • paper_url: http://arxiv.org/abs/2312.03018
  • repo_url: None
  • paper_authors: Cong Wang, Jiaxi Gu, Panwen Hu, Songcen Xu, Hang Xu, Xiaodan Liang
  • for: 这篇论文旨在实现高质量的图像到视频生成,以提高现有的图像扩展方法的精度和可控性。
  • methods: 我们提出了一种基于 DreamVideo 模型的高精度图像到视频生成方法,其中包括一个框护分支来保持图像细节,以及一种double-condition类 Conditioned GAN(DCGAN)自适应导航方法来实现不同动作的视频生成。
  • results: 我们在公共数据集上进行了广泛的实验,结果表明我们的方法在质量和可控性方面都超过了现有的状态图像到视频模型。尤其是在保持图像细节方面,我们的模型在 UCF101 上的 FVD 较高,并且可以通过不同的文本提示来实现精准的控制。
    Abstract Image-to-video generation, which aims to generate a video starting from a given reference image, has drawn great attention. Existing methods try to extend pre-trained text-guided image diffusion models to image-guided video generation models. Nevertheless, these methods often result in either low fidelity or flickering over time due to their limitation to shallow image guidance and poor temporal consistency. To tackle these problems, we propose a high-fidelity image-to-video generation method by devising a frame retention branch on the basis of a pre-trained video diffusion model, named DreamVideo. Instead of integrating the reference image into the diffusion process in a semantic level, our DreamVideo perceives the reference image via convolution layers and concatenate the features with the noisy latents as model input. By this means, the details of the reference image can be preserved to the greatest extent. In addition, by incorporating double-condition classifier-free guidance, a single image can be directed to videos of different actions by providing varying prompt texts. This has significant implications for controllable video generation and holds broad application prospects. We conduct comprehensive experiments on the public dataset, both quantitative and qualitative results indicate that our method outperforms the state-of-the-art method. Especially for fidelity, our model has powerful image retention ability and result in high FVD in UCF101 compared to other image-to-video models. Also, precise control can be achieved by giving different text prompts. Further details and comprehensive results of our model will be presented in https://anonymous0769.github.io/DreamVideo/.
    摘要 图像到视频生成,它目标是从给定的参考图像生成一个视频,已经吸引了广泛的关注。现有方法通常是将预训练的文本指导图像扩散模型扩展为图像指导视频生成模型。然而,这些方法经常会导致时间上的低精度或闪烁,这是因为它们的图像指导性较浅和时间上的一致性不够。为了解决这些问题,我们提出了高精度的图像到视频生成方法,基于预训练的视频扩散模型,我们称之为DreamVideo。在DreamVideo中,我们不是将参考图像直接 интегрирова到扩散过程中的semantic层次,而是通过 convolution层将参考图像 perceive,并将其的特征与随机噪音的特征进行 concatenate。这样做的原因是保留参考图像的细节,以达到最大程度的精度。此外,我们还通过double-condition classifier-free guidance,使得一个图像可以被指导到不同的动作视频中,只需提供不同的文本提示。这有着广泛的应用前景和可控的视频生成的意义。我们对公共数据集进行了广泛的实验,both quantitative和qualitative结果表明,our方法在state-of-the-art方法之上。特别是,our模型在UCF101上的FVD(Frame Velocity Difference)值较高,表明它有强大的图像保留能力。此外,通过不同的文本提示,我们可以实现精确的控制。更多细节和our模型的全面结果可以在https://anonymous0769.github.io/DreamVideo/查看。

GDN: A Stacking Network Used for Skin Cancer Diagnosis

  • paper_url: http://arxiv.org/abs/2312.02437
  • repo_url: None
  • paper_authors: Jingmin Wei, Haoyang Shen, Ziyi Wang, Ziqian Zhang
  • for: 这个论文的目的是提出一种自动识别不同类型皮肤癌的图像分类模型,以提高皮肤癌的检测精度。
  • methods: 这个模型使用了堆叠不同网络的方法来提高模型性能,具体来说是使用GoogLeNet和DenseNet两个网络进行并行训练,并在第二层使用logistic regression模型来进行预测。
  • results: 比较这个模型与四个基eline网络(ResNet、VGGNet、DenseNet和GoogLeNet),GDN模型在测试数据集上显示出较高的准确率,特别是在使用Logistic Regression预测方法时达到了最好的预测结果。
    Abstract Skin cancer, the primary type of cancer that can be identified by visual recognition, requires an automatic identification system that can accurately classify different types of lesions. This paper presents GoogLe-Dense Network (GDN), which is an image-classification model to identify two types of skin cancer, Basal Cell Carcinoma, and Melanoma. GDN uses stacking of different networks to enhance the model performance. Specifically, GDN consists of two sequential levels in its structure. The first level performs basic classification tasks accomplished by GoogLeNet and DenseNet, which are trained in parallel to enhance efficiency. To avoid low accuracy and long training time, the second level takes the output of the GoogLeNet and DenseNet as the input for a logistic regression model. We compare our method with four baseline networks including ResNet, VGGNet, DenseNet, and GoogLeNet on the dataset, in which GoogLeNet and DenseNet significantly outperform ResNet and VGGNet. In the second level, different stacking methods such as perceptron, logistic regression, SVM, decision trees and K-neighbor are studied in which Logistic Regression shows the best prediction result among all. The results prove that GDN, compared to a single network structure, has higher accuracy in optimizing skin cancer detection.
    摘要 皮肤癌,主要的癌症可以通过视觉识别,需要一个自动识别系统,可以准确地分类不同类型的肿瘤。这篇论文提出了GoogLe-Dense Network(GDN),是一种图像分类模型,用于识别两种皮肤癌,基础细胞癌和 меланома。GDN使用不同网络堆叠来提高模型性能。具体来说,GDN包括两个级别结构。第一级完成基本的分类任务,由GoogLeNet和DenseNet进行并行训练,以提高效率。而第二级使用GoogLeNet和DenseNet的输出作为对数学归一化模型的输入,以避免低准确率和长训练时间。我们与四种基eline网络,包括ResNet、VGGNet、DenseNet和GoogLeNet进行比较,结果显示GoogLeNet和DenseNet在该数据集上显著超过ResNet和VGGNet。在第二级,我们研究了不同的堆叠方法,包括权重加权、归一化、支持向量机、决策树和K-近邻,其中Logistic Regression表现最好。结果证明,相比单个网络结构,GDN在优化皮肤癌检测上有更高的准确率。

FINER: Flexible spectral-bias tuning in Implicit NEural Representation by Variable-periodic Activation Functions

  • paper_url: http://arxiv.org/abs/2312.02434
  • repo_url: None
  • paper_authors: Zhen Liu, Hao Zhu, Qi Zhang, Jingde Fu, Weibing Deng, Zhan Ma, Yanwen Guo, Xun Cao
    for: 解决现有INR技术中频率相关问题,提高复杂信号的表示性能。methods: 提出变量周期函数(FINER),通过初始化神经网络偏置在不同范围内,选择不同频率的子函数进行活化。results: 在2D图像适应、3D签名距离场表示和5D神经辐射场优化上,FINER表现出比现有INR更高的表示性能。
    Abstract Implicit Neural Representation (INR), which utilizes a neural network to map coordinate inputs to corresponding attributes, is causing a revolution in the field of signal processing. However, current INR techniques suffer from a restricted capability to tune their supported frequency set, resulting in imperfect performance when representing complex signals with multiple frequencies. We have identified that this frequency-related problem can be greatly alleviated by introducing variable-periodic activation functions, for which we propose FINER. By initializing the bias of the neural network within different ranges, sub-functions with various frequencies in the variable-periodic function are selected for activation. Consequently, the supported frequency set of FINER can be flexibly tuned, leading to improved performance in signal representation. We demonstrate the capabilities of FINER in the contexts of 2D image fitting, 3D signed distance field representation, and 5D neural radiance fields optimization, and we show that it outperforms existing INRs.
    摘要 匿�生神经表示法(INR),通过神经网络将坐标输入映射到相应的特征上,在信号处理领域引起了革命。然而,当前INR技术存在一定的频率集支持的限制,导致表示复杂信号的多频性表现不佳。我们发现,这种频率相关问题可以通过引入变量周期 activation function 进行大幅减轻。我们提议使用不同范围内的偏置初始化神经网络,从而选择不同频率的子函数进行活化。因此,FINER可以自由地调整支持的频率集,从而提高信号表示的性能。我们在2D图像适应、3D签名距离场表示和5D神经辐射场优化上展示了FINER的能力,并证明它超过了现有的INR。

Lenna: Language Enhanced Reasoning Detection Assistant

  • paper_url: http://arxiv.org/abs/2312.02433
  • repo_url: https://github.com/meituan-automl/lenna
  • paper_authors: Fei Wei, Xinyu Zhang, Ailing Zhang, Bo Zhang, Xiangxiang Chu
  • for: This paper proposes a language-enhanced reasoning detection assistant called Lenna, which utilizes robust multimodal feature representation for image perception tasks.
  • methods: The paper incorporates an additional token in the MLLM vocabulary to preserve location information for detection, and constructs a ReasonDet dataset to measure the reasoning capability of Lenna.
  • results: Lenna demonstrates outstanding performance on ReasonDet with significantly low training costs and minimal transferring overhead when extended to other tasks.Here’s the simplified Chinese text:
  • for: 这篇论文提出了一种基于多模态语言模型(MLLM)的语言增强推理检测助手,即Lenna。
  • methods: 该论文在MLLM中添加了一个 tokens,以保持检测位置信息,并构建了一个ReasonDet数据集来评估Lenna的推理能力。
  • results: Lenna在ReasonDet上表现出色,训练成本低下,并在其他任务上转移成本很低。代码和模型将在https://git.io/Lenna上公开。
    Abstract With the fast-paced development of multimodal large language models (MLLMs), we can now converse with AI systems in natural languages to understand images. However, the reasoning power and world knowledge embedded in the large language models have been much less investigated and exploited for image perception tasks. In this paper, we propose Lenna, a language-enhanced reasoning detection assistant, which utilizes the robust multimodal feature representation of MLLMs, while preserving location information for detection. This is achieved by incorporating an additional token in the MLLM vocabulary that is free of explicit semantic context but serves as a prompt for the detector to identify the corresponding position. To evaluate the reasoning capability of Lenna, we construct a ReasonDet dataset to measure its performance on reasoning-based detection. Remarkably, Lenna demonstrates outstanding performance on ReasonDet and comes with significantly low training costs. It also incurs minimal transferring overhead when extended to other tasks. Our code and model will be available at https://git.io/Lenna.
    摘要 With the rapid development of multimodal large language models (MLLMs), we can now communicate with AI systems in natural language to understand images. However, the reasoning power and world knowledge embedded in the large language models have been less explored and utilized for image perception tasks. In this paper, we propose Lenna, a language-enhanced reasoning detection assistant, which utilizes the robust multimodal feature representation of MLLMs, while preserving location information for detection. This is achieved by incorporating an additional token in the MLLM vocabulary that is free of explicit semantic context but serves as a prompt for the detector to identify the corresponding position. To evaluate the reasoning capability of Lenna, we construct a ReasonDet dataset to measure its performance on reasoning-based detection. Remarkably, Lenna demonstrates outstanding performance on ReasonDet and comes with significantly low training costs. It also incurs minimal transferring overhead when extended to other tasks. Our code and model will be available at https://git.io/Lenna.Here's the translation in Traditional Chinese:随着多Modal大语言模型(MLLM)的快速发展,我们现在可以通过自然语言与AI系统进行交流,以了解图像。然而,大语言模型中嵌入的理解力和世界知识在图像认识任务中得到了较少的探索和利用。在本文中,我们提出Lenna,一个语言增强的推理检测助手,利用MLLM的强大多modal特征表示,同时保留检测位置信息。这是通过在MLLM词汇中添加一个token,该token是无关 semantic context的,但可作为检测器识别相应位置的帮助。为了评估Lenna的推理能力,我们建立了一个ReasonDet数据集,用于量化它在推理基于检测的性能。惊奇的是,Lenna在ReasonDet上表现出色,训练成本较低,且对其他任务的转移成本较低。我们的代码和模型将在https://git.io/Lenna 上公开。

Orthogonal Adaptation for Modular Customization of Diffusion Models

  • paper_url: http://arxiv.org/abs/2312.02432
  • repo_url: None
  • paper_authors: Ryan Po, Guandao Yang, Kfir Aberman, Gordon Wetzstein
  • for: 这 paper 的目的是解决 Modular Customization 问题,实现高效地合并精心调整的扩展模型,以便在一幅图像中同时渲染多个概念。
  • methods: 该 paper 使用 Orthogonal Adaptation 方法,鼓励精心调整的模型在INFERENCE时进行互补,以确保合并后的模型仍然可以保持高度的准确性和唯一性。
  • results: 该 paper 的实验结果表明,与相关基eline比较,我们的方法在效率和准确性两个方面均有显著提高,代表了在扩展 Customization 领域中的重要突破。
    Abstract Customization techniques for text-to-image models have paved the way for a wide range of previously unattainable applications, enabling the generation of specific concepts across diverse contexts and styles. While existing methods facilitate high-fidelity customization for individual concepts or a limited, pre-defined set of them, they fall short of achieving scalability, where a single model can seamlessly render countless concepts. In this paper, we address a new problem called Modular Customization, with the goal of efficiently merging customized models that were fine-tuned independently for individual concepts. This allows the merged model to jointly synthesize concepts in one image without compromising fidelity or incurring any additional computational costs. To address this problem, we introduce Orthogonal Adaptation, a method designed to encourage the customized models, which do not have access to each other during fine-tuning, to have orthogonal residual weights. This ensures that during inference time, the customized models can be summed with minimal interference. Our proposed method is both simple and versatile, applicable to nearly all optimizable weights in the model architecture. Through an extensive set of quantitative and qualitative evaluations, our method consistently outperforms relevant baselines in terms of efficiency and identity preservation, demonstrating a significant leap toward scalable customization of diffusion models.
    摘要 Customization techniques for text-to-image models have opened up a wide range of previously unattainable applications, enabling the generation of specific concepts across diverse contexts and styles. While existing methods allow for high-fidelity customization of individual concepts or a limited, pre-defined set of them, they fall short of achieving scalability, where a single model can seamlessly render countless concepts. In this paper, we address a new problem called Modular Customization, with the goal of efficiently merging customized models that were fine-tuned independently for individual concepts. This allows the merged model to jointly synthesize concepts in one image without compromising fidelity or incurring any additional computational costs.To address this problem, we introduce Orthogonal Adaptation, a method designed to encourage the customized models, which do not have access to each other during fine-tuning, to have orthogonal residual weights. This ensures that during inference time, the customized models can be summed with minimal interference.Our proposed method is both simple and versatile, applicable to nearly all optimizable weights in the model architecture. Through an extensive set of quantitative and qualitative evaluations, our method consistently outperforms relevant baselines in terms of efficiency and identity preservation, demonstrating a significant leap toward scalable customization of diffusion models.

FreestyleRet: Retrieving Images from Style-Diversified Queries

  • paper_url: http://arxiv.org/abs/2312.02428
  • repo_url: None
  • paper_authors: Hao Li, Curise Jia, Peng Jin, Zesen Cheng, Kehan Li, Jialu Sui, Chang Liu, Li Yuan
  • for: 这 paper 的目的是提出 Style-Diversified Query-Based Image Retrieval 任务,允许根据不同的查询风格进行图像检索。
  • methods: 作者提出了一种轻量级多样化查询检索框架,使得不同的查询风格(如文本、笔迹、低分辨率、艺术等)都可以同时进行检索。
  • results: 实验表明,使用作者提出的 style-init 提示调整策略,对 Style-Diversified Query-Based Image Retrieval 任务表现出色,并且可以同时检索不同查询风格的图像。此外,auxiliary information from other queries 也可以增强每个查询的检索性能。
    Abstract Image Retrieval aims to retrieve corresponding images based on a given query. In application scenarios, users intend to express their retrieval intent through various query styles. However, current retrieval tasks predominantly focus on text-query retrieval exploration, leading to limited retrieval query options and potential ambiguity or bias in user intention. In this paper, we propose the Style-Diversified Query-Based Image Retrieval task, which enables retrieval based on various query styles. To facilitate the novel setting, we propose the first Diverse-Style Retrieval dataset, encompassing diverse query styles including text, sketch, low-resolution, and art. We also propose a light-weighted style-diversified retrieval framework. For various query style inputs, we apply the Gram Matrix to extract the query's textural features and cluster them into a style space with style-specific bases. Then we employ the style-init prompt tuning module to enable the visual encoder to comprehend the texture and style information of the query. Experiments demonstrate that our model, employing the style-init prompt tuning strategy, outperforms existing retrieval models on the style-diversified retrieval task. Moreover, style-diversified queries~(sketch+text, art+text, etc) can be simultaneously retrieved in our model. The auxiliary information from other queries enhances the retrieval performance within the respective query.
    摘要 图像检索目标是根据给定的查询 retrieve 相应的图像。在应用场景中,用户可能通过不同的查询风格表达他们的检索意图。然而,现有的检索任务主要集中在文本查询检索领域,导致检索查询选项有限,可能存在用户意图的抽象或偏见。本文提出了多样化查询基于图像检索任务(Style-Diversified Query-Based Image Retrieval task),允许基于多种查询风格进行检索。为了推动这种新的设定,我们提出了首个多样化查询 dataset,包括多种查询风格,如文本、绘制、低分辨率和艺术。我们还提出了一种轻量级多样化检索框架。对于不同的查询风格输入,我们使用 Gram Matrix 提取查询的文本特征,并将其分为一个风格空间中的风格特征基。然后,我们使用风格初始化 prompt tuning 模块,使视觉编码器理解查询中的Texture和风格信息。实验结果表明,我们的模型,使用风格初始化 prompt tuning 策略,在多样化检索任务上表现出色,并且可以同时检索不同的查询风格(如绘制+文本、艺术+文本等)。此外,auxiliary information 从其他查询中对当前查询的检索性能产生补偿作用。

Towards Granularity-adjusted Pixel-level Semantic Annotation

  • paper_url: http://arxiv.org/abs/2312.02420
  • repo_url: None
  • paper_authors: Rohit Kundu, Sudipta Paul, Rohit Lal, Amit K. Roy-Chowdhury
  • for: 提供无需人工监督的semantic segmentation预测方法,用于生成图像中像素级别的标注数据。
  • methods: 使用Stable Diffusion模型生成synthetic图像,并使用这些图像来学习一个映射函数,将SAM卷积推荐的匀速度embeddings与物体类别标签相对应。
  • results: 在PASCAL VOC 2012和COCO-80 datasets上进行实验,比对 existed状态的方法时,我们得到了+17.95%和+5.17%的mIoU提升。
    Abstract Recent advancements in computer vision predominantly rely on learning-based systems, leveraging annotations as the driving force to develop specialized models. However, annotating pixel-level information, particularly in semantic segmentation, presents a challenging and labor-intensive task, prompting the need for autonomous processes. In this work, we propose GranSAM which distinguishes itself by providing semantic segmentation at the user-defined granularity level on unlabeled data without the need for any manual supervision, offering a unique contribution in the realm of semantic mask annotation method. Specifically, we propose an approach to enable the Segment Anything Model (SAM) with semantic recognition capability to generate pixel-level annotations for images without any manual supervision. For this, we accumulate semantic information from synthetic images generated by the Stable Diffusion model or web crawled images and employ this data to learn a mapping function between SAM mask embeddings and object class labels. As a result, SAM, enabled with granularity-adjusted mask recognition, can be used for pixel-level semantic annotation purposes. We conducted experiments on the PASCAL VOC 2012 and COCO-80 datasets and observed a +17.95% and +5.17% increase in mIoU, respectively, compared to existing state-of-the-art methods when evaluated under our problem setting.
    摘要

MGTR: Multi-Granular Transformer for Motion Prediction with LiDAR

  • paper_url: http://arxiv.org/abs/2312.02409
  • repo_url: https://github.com/AlexXiao95/MGTR
  • paper_authors: Yiqian Gan, Hao Xiao, Yizhe Zhao, Ethan Zhang, Zhe Huang, Xin Ye, Lingting Ge
  • for: 这篇论文是用于自动驾驶系统中的动作预测,以应对高度不确定和复杂的交通景象。
  • methods: 本文提出了一个多粒子 трансформа器(MGTR)框架,利用不同粒子大小的上下文特征来描述不同类型的交通工具。此外,文件还使用了内置的 LiDAR 特征提取器来搭配 MGTR,以进一步增强其能力。
  • results: 根据 Waymo Open Dataset 动作预测 benchmark 的评估结果,提出的 MGTR 方法获得了最佳性能,在领头牌(https://waymo.com/open/challenges/2023/motion-prediction/)上排名第一。
    Abstract Motion prediction has been an essential component of autonomous driving systems since it handles highly uncertain and complex scenarios involving moving agents of different types. In this paper, we propose a Multi-Granular TRansformer (MGTR) framework, an encoder-decoder network that exploits context features in different granularities for different kinds of traffic agents. To further enhance MGTR's capabilities, we leverage LiDAR point cloud data by incorporating LiDAR semantic features from an off-the-shelf LiDAR feature extractor. We evaluate MGTR on Waymo Open Dataset motion prediction benchmark and show that the proposed method achieved state-of-the-art performance, ranking 1st on its leaderboard (https://waymo.com/open/challenges/2023/motion-prediction/).
    摘要 自动驾驶系统中的动态预测已经是一项非常重要的组件,因为它可以处理高度不确定和复杂的情况,涉及到不同类型的移动代理。在这篇论文中,我们提出了一种多级别 TRansformer(MGTR)框架,这是一个编码器-解码器网络,它利用不同级别的上下文特征来处理不同类型的交通代理。为了进一步提高MGTR的能力,我们利用了LiDAR点云数据,并在LiDAR特征提取器中Integrate LiDAR语义特征。我们对 Waymo Open Dataset 动态预测测试benchmark进行评估,并显示了我们的提案方法在其领导人杰出表现,在其领导人杰出表现(https://waymo.com/open/challenges/2023/motion-prediction/)。