cs.CV - 2023-11-27

Small and Dim Target Detection in IR Imagery: A Review

  • paper_url: http://arxiv.org/abs/2311.16346
  • repo_url: None
  • paper_authors: Nikhil Kumar, Pravendra Singh
  • for: 本研究的主要目标是探讨小和暗目标在红外图像中的探测,尤其是在背景充满杂乱细节和红外特征随着热动力学变化的情况下。
  • methods: 本文涵盖了多种方法,从传统的图像处理技术到最新的深度学习方法,包括多帧方法和单帧方法。单帧方法包括传统的图像处理技术以及更高级的深度学习方法。我们发现深度学习方法的性能比传统的图像处理技术更好。
  • results: 我们的研究发现,深度学习方法在小和暗目标探测中表现更好,而传统的图像处理技术则表现较差。此外,我们还提供了各种可用的数据集的全面compile,并对现有技术的缺陷和局限性进行了评估,为未来的研究和开发提供了irection。
    Abstract While there has been significant progress in object detection using conventional image processing and machine learning algorithms, exploring small and dim target detection in the IR domain is a relatively new area of study. The majority of small and dim target detection methods are derived from conventional object detection algorithms, albeit with some alterations. The task of detecting small and dim targets in IR imagery is complex. This is because these targets often need distinct features, the background is cluttered with unclear details, and the IR signatures of the scene can change over time due to fluctuations in thermodynamics. The primary objective of this review is to highlight the progress made in this field. This is the first review in the field of small and dim target detection in infrared imagery, encompassing various methodologies ranging from conventional image processing to cutting-edge deep learning-based approaches. The authors have also introduced a taxonomy of such approaches. There are two main types of approaches: methodologies using several frames for detection, and single-frame-based detection techniques. Single frame-based detection techniques encompass a diverse range of methods, spanning from traditional image processing-based approaches to more advanced deep learning methodologies. Our findings indicate that deep learning approaches perform better than traditional image processing-based approaches. In addition, a comprehensive compilation of various available datasets has also been provided. Furthermore, this review identifies the gaps and limitations in existing techniques, paving the way for future research and development in this area.
    摘要 traditional image processing and machine learning algorithms have made significant progress in object detection, but exploring small and dim target detection in the IR domain is a relatively new area of study. most small and dim target detection methods are based on conventional object detection algorithms, but with some modifications. detecting small and dim targets in IR imagery is challenging because these targets often lack distinct features, the background is cluttered with unclear details, and the IR signatures of the scene can change over time due to thermodynamic fluctuations. the main objective of this review is to highlight the progress made in this field. this is the first review of small and dim target detection in infrared imagery, covering various methodologies from conventional image processing to cutting-edge deep learning-based approaches. the authors have also developed a taxonomy of such approaches. there are two main types of approaches: methodologies using multiple frames for detection and single-frame-based detection techniques. single-frame-based detection techniques include a wide range of methods, from traditional image processing-based approaches to more advanced deep learning methodologies. our findings show that deep learning approaches perform better than traditional image processing-based approaches. in addition, a comprehensive list of available datasets has been provided. this review identifies the gaps and limitations in existing techniques, paving the way for future research and development in this area.

Spatially Adaptive Cloth Regression with Implicit Neural Representations

  • paper_url: http://arxiv.org/abs/2311.16344
  • repo_url: None
  • paper_authors: Lei Shu, Vinicius Azevedo, Barbara Solenthaler, Markus Gross
  • for: 这篇论文旨在解决计算机图形中细节的布料折叠问题,具有非常高的计算复杂性和复杂的方法。
  • methods: 这篇论文提出了一种新的多向折叠技术,使用隐式神经网络表示 superficies,并采用了一种新的隐藏样本方法和一种新的对抗训练方法,以提高计算效率和精度。
  • results: 该论文通过多种布料对象互动场景的实验表明,与传统粗粒度表示相比,该方法在模拟高级别的本地折叠时表现更加稳定和准确。
    Abstract The accurate representation of fine-detailed cloth wrinkles poses significant challenges in computer graphics. The inherently non-uniform structure of cloth wrinkles mandates the employment of intricate discretization strategies, which are frequently characterized by high computational demands and complex methodologies. Addressing this, the research introduced in this paper elucidates a novel anisotropic cloth regression technique that capitalizes on the potential of implicit neural representations of surfaces. Our first core contribution is an innovative mesh-free sampling approach, crafted to reduce the reliance on traditional mesh structures, thereby offering greater flexibility and accuracy in capturing fine cloth details. Our second contribution is a novel adversarial training scheme, which is designed meticulously to strike a harmonious balance between the sampling and simulation objectives. The adversarial approach ensures that the wrinkles are represented with high fidelity, while also maintaining computational efficiency. Our results showcase through various cloth-object interaction scenarios that our method, given the same memory constraints, consistently surpasses traditional discrete representations, particularly when modelling highly-detailed localized wrinkles.
    摘要 Computer graphics 中的细腐皮表现具有 significiant challenges. 细腐皮的非均匀结构使得使用复杂的离散策略成为必要,这些策略通常具有高计算成本和复杂的方法。为了解决这一问题,这篇研究报告提出了一种新的折衣表示技术,利用 implicit neural surface 的潜力。我们的首要贡献是一种创新的无网格采样方法,可以减少传统网格结构的依赖,从而更好地捕捉细腐皮的细节。我们的第二次贡献是一种 novel adversarial 训练方案,它是为了寻找一个平衡点,使得细腐皮的表示和计算效率之间寻找一个平衡点。我们的结果表明,在不同的细腐皮对象互动场景中,我们的方法,具有同样的内存限制下,可以持续性超越传统离散表示方法,特别是当模拟高度地方化的细腐皮时。

Multi-3D-Models Registration-Based Augmented Reality (AR) Instructions for Assembly

  • paper_url: http://arxiv.org/abs/2311.16337
  • repo_url: None
  • paper_authors: Seda Tuzun Canadinc, Wei Yan
  • for: 提供了一种新的、无标记、步骤、场景内3D增强现实(AR)指令方法,用于小部件组装。
  • methods: 使用深度学习训练的3D模型基于协调,实现了3D组装部件的真实可视化和用户控制。
  • results: 测试和人类可见性评估表明,BRICKxAR(M3D)可以提供可靠的3D AR指令,并允许用户 manipulate assembly model。
    Abstract This paper introduces a novel, markerless, step-by-step, in-situ 3D Augmented Reality (AR) instruction method and its application - BRICKxAR (Multi 3D Models/M3D) - for small parts assembly. BRICKxAR (M3D) realistically visualizes rendered 3D assembly parts at the assembly location of the physical assembly model (Figure 1). The user controls the assembly process through a user interface. BRICKxAR (M3D) utilizes deep learning-trained 3D model-based registration. Object recognition and tracking become challenging as the assembly model updates at each step. Additionally, not every part in a 3D assembly may be visible to the camera during the assembly. BRICKxAR (M3D) combines multiple assembly phases with a step count to address these challenges. Thus, using fewer phases simplifies the complex assembly process while step count facilitates accurate object recognition and precise visualization of each step. A testing and heuristic evaluation of the BRICKxAR (M3D) prototype and qualitative analysis were conducted with users and experts in visualization and human-computer interaction. Providing robust 3D AR instructions and allowing the handling of the assembly model, BRICKxAR (M3D) has the potential to be used at different scales ranging from manufacturing assembly to construction.
    摘要 BRICKxAR (M3D) uses deep learning-trained 3D model-based registration to address the challenges of object recognition and tracking as the assembly model updates at each step. Additionally, the method combines multiple assembly phases with a step count to simplify the complex assembly process and facilitate accurate object recognition and precise visualization of each step.A testing and heuristic evaluation of the BRICKxAR (M3D) prototype and qualitative analysis were conducted with users and experts in visualization and human-computer interaction. The results show that BRICKxAR (M3D) has the potential to be used at different scales, ranging from manufacturing assembly to construction, and provides robust 3D AR instructions for handling the assembly model.Here's the text in Simplified Chinese:这篇论文介绍了一种新的、无标记、步骤式、场景内3D增强现实(AR)指南方法,称为BRICKxAR(多3D模型/M3D),用于小件组装。BRICKxAR(M3D)可以实时视觉化assembly模型中的3D组装部件(图1)。用户通过用户界面控制assembly过程。BRICKxAR(M3D)使用深度学习训练的3D模型基于 регистрацию来解决对象识别和跟踪的挑战,因为assembly模型在每个步骤更新时会更改。此外,不是每个3D组装部件都可以在摄像头中可见。BRICKxAR(M3D)将多个assembly阶段与步数相结合,以简化复杂的assembly过程,并且帮助精准地识别和可见化每个步骤。对BRICKxAR(M3D)原型的测试和人机交互分析,以及讨论和评估方法,与用户和人机交互视觉专家进行了合作。结果表明,BRICKxAR(M3D)在不同的尺度,从制造Assembly到建筑,都有可能被应用,并提供了强大的3D AR指南,用于处理assembly模型。

Characterizing Video Question Answering with Sparsified Inputs

  • paper_url: http://arxiv.org/abs/2311.16311
  • repo_url: None
  • paper_authors: Shiyuan Huang, Robinson Piramuthu, Vicente Ordonez, Shih-Fu Chang, Gunnar A. Sigurdsson
  • for: 本研究旨在提高视频问答任务的数据效率,通过选择最佳的视频帧来提高任务性能。
  • methods: 本研究使用Gumbel-based学习选择模块来适应ively选择最佳的视频帧,以提高任务性能。
  • results: 从实验结果来看,只使用2-4帧的视频可以保持94.8%-95.2%的任务性能,这表明可以减少视频长度来提高任务效率。同时,研究还发现了视频和文本输入之间的补做行为,这 suggets the potential of improving data efficiency for video-and-language tasks。
    Abstract In Video Question Answering, videos are often processed as a full-length sequence of frames to ensure minimal loss of information. Recent works have demonstrated evidence that sparse video inputs are sufficient to maintain high performance. However, they usually discuss the case of single frame selection. In our work, we extend the setting to multiple number of inputs and other modalities. We characterize the task with different input sparsity and provide a tool for doing that. Specifically, we use a Gumbel-based learnable selection module to adaptively select the best inputs for the final task. In this way, we experiment over public VideoQA benchmarks and provide analysis on how sparsified inputs affect the performance. From our experiments, we have observed only 5.2%-5.8% loss of performance with only 10% of video lengths, which corresponds to 2-4 frames selected from each video. Meanwhile, we also observed the complimentary behaviour between visual and textual inputs, even under highly sparsified settings, suggesting the potential of improving data efficiency for video-and-language tasks.
    摘要 在视频问答中,视频通常被处理为全长序列帧,以确保最小化信息损失。然而, latest works have shown that sparse video inputs are sufficient to maintain high performance. However, these works usually only discuss the case of single frame selection. In our work, we extend the setting to multiple numbers of inputs and other modalities. We characterize the task with different input sparsity and provide a tool for doing so. Specifically, we use a Gumbel-based learnable selection module to adaptively select the best inputs for the final task. Through experiments on public VideoQA benchmarks, we provide analysis on how sparsified inputs affect performance. Our results show that only 5.2%-5.8% of performance is lost with only 10% of video lengths, corresponding to 2-4 frames selected from each video. Furthermore, we observe complementary behavior between visual and textual inputs, even under highly sparsified settings, suggesting the potential of improving data efficiency for video-and-language tasks.

Robust Self-calibration of Focal Lengths from the Fundamental Matrix

  • paper_url: http://arxiv.org/abs/2311.16304
  • repo_url: https://github.com/kocurvik/robust_self_calibration
  • paper_authors: Viktor Kocur, Daniel Kyselica, Zuzana Kúkelová
  • for: 根据给定的基本矩阵,计算两个相机的自适应 фокаль点和主点的问题是计算机视学的基本问题之一。
  • methods: 我们提出了一种高效和可靠的迭代方法,使用基本矩阵和先验知识来估算相机参数。
  • results: 我们的迭代方法可以减少基本矩阵计算的计算时间,并在实际数据上达到更高的估算精度,即使在使用不准确的先验知识时。
    Abstract The problem of self-calibration of two cameras from a given fundamental matrix is one of the basic problems in geometric computer vision. Under the assumption of known principal points and square pixels, the well-known Bougnoux formula offers a means to compute the two unknown focal lengths. However, in many practical situations, the formula yields inaccurate results due to commonly occurring singularities. Moreover, the estimates are sensitive to noise in the computed fundamental matrix and to the assumed positions of the principal points. In this paper, we therefore propose an efficient and robust iterative method to estimate the focal lengths along with the principal points of the cameras given a fundamental matrix and priors for the estimated camera parameters. In addition, we study a computationally efficient check of models generated within RANSAC that improves the accuracy of the estimated models while reducing the total computational time. Extensive experiments on real and synthetic data show that our iterative method brings significant improvements in terms of the accuracy of the estimated focal lengths over the Bougnoux formula and other state-of-the-art methods, even when relying on inaccurate priors.
    摘要 “自测Camera对基本矩阵的自测问题是computer vision领域的基本问题之一。假设知道主要点和方差矩阵,则著名的Bougnoux公式可以计算两个不知道的镜头焦距。然而,在实际应用中,这个公式往往会产生不准确的结果,因为 comunmente occursing singularities。此外,估计适用到computed基本矩阵中的误差和假设主要点的位置。因此,在这篇论文中,我们提出了一个高效和可靠的迭代法,以便根据基本矩阵和假设镜头参数的统计分布,估计镜头焦距和主要点。此外,我们还研究了一种可以快速检查模型的计算方法,以提高估计模型的精度,同时降低总计算时间。实际实验表明,我们的迭代方法可以在实际应用中带来重要的改善,即精度更高的镜头焦距估计,即使假设镜头参数的统计分布不准确。”

Aligning Non-Causal Factors for Transformer-Based Source-Free Domain Adaptation

  • paper_url: http://arxiv.org/abs/2311.16294
  • repo_url: None
  • paper_authors: Sunandini Sanyal, Ashish Ramayee Asokan, Suvaansh Bhambri, Pradyumna YM, Akshay Kulkarni, Jogendra Nath Kundu, R Venkatesh Babu
  • for: 这篇研究的目的是提高预测性能,尤其是在零知道源预测 задачі中。
  • methods: 本研究提出了一个基于对非 causal 因素的排序和静态映射的框架,以实现对 causal 因素的对齐。此外,研究者还使用了 Style 分类任务来对 non-causal 因素进行全局对齐。
  • results: 研究者的方法在多个 DA 测试 benchmark 上取得了state-of-the-art 的结果。
    Abstract Conventional domain adaptation algorithms aim to achieve better generalization by aligning only the task-discriminative causal factors between a source and target domain. However, we find that retaining the spurious correlation between causal and non-causal factors plays a vital role in bridging the domain gap and improving target adaptation. Therefore, we propose to build a framework that disentangles and supports causal factor alignment by aligning the non-causal factors first. We also investigate and find that the strong shape bias of vision transformers, coupled with its multi-head attention, make it a suitable architecture for realizing our proposed disentanglement. Hence, we propose to build a Causality-enforcing Source-Free Transformer framework (C-SFTrans) to achieve disentanglement via a novel two-stage alignment approach: a) non-causal factor alignment: non-causal factors are aligned using a style classification task which leads to an overall global alignment, b) task-discriminative causal factor alignment: causal factors are aligned via target adaptation. We are the first to investigate the role of vision transformers (ViTs) in a privacy-preserving source-free setting. Our approach achieves state-of-the-art results in several DA benchmarks.
    摘要
  1. Non-causal factor alignment: Non-causal factors are aligned using a style classification task, leading to overall global alignment.2. Task-discriminative causal factor alignment: Causal factors are aligned via target adaptation.Our approach is the first to investigate the role of ViTs in a privacy-preserving source-free setting. Our approach achieves state-of-the-art results in several domain adaptation benchmarks.

VehicleGAN: Pair-flexible Pose Guided Image Synthesis for Vehicle Re-identification

  • paper_url: http://arxiv.org/abs/2311.16278
  • repo_url: None
  • paper_authors: Baolu Li, Ping Liu, Lan Fu, Jinlong Li, Jianwu Fang, Zhigang Xu, Hongkai Yu
  • for: 提高 Vehicle Re-ID 模型在实际世界中的性能,解决不同摄像头视角导致的特征空间混淆问题。
  • methods: 提出一种可以在不同摄像头中拍摄的车辆图像的大量合成方法,以增强特征识别。在实际世界中可能无法获得对应的对比数据,因此提出了首个不需要三元数据的可灵活对应 pose 导向图像生成方法(VehicleGAN)。
  • results: 实验结果表明,对于 VeRi-776 和 VehicleID 数据集,提出的 VehicleGAN 和 JML 可以提高 Vehicle Re-ID 模型的准确率和效果。
    Abstract Vehicle Re-identification (Re-ID) has been broadly studied in the last decade; however, the different camera view angle leading to confused discrimination in the feature subspace for the vehicles of various poses, is still challenging for the Vehicle Re-ID models in the real world. To promote the Vehicle Re-ID models, this paper proposes to synthesize a large number of vehicle images in the target pose, whose idea is to project the vehicles of diverse poses into the unified target pose so as to enhance feature discrimination. Considering that the paired data of the same vehicles in different traffic surveillance cameras might be not available in the real world, we propose the first Pair-flexible Pose Guided Image Synthesis method for Vehicle Re-ID, named as VehicleGAN in this paper, which works for both supervised and unsupervised settings without the knowledge of geometric 3D models. Because of the feature distribution difference between real and synthetic data, simply training a traditional metric learning based Re-ID model with data-level fusion (i.e., data augmentation) is not satisfactory, therefore we propose a new Joint Metric Learning (JML) via effective feature-level fusion from both real and synthetic data. Intensive experimental results on the public VeRi-776 and VehicleID datasets prove the accuracy and effectiveness of our proposed VehicleGAN and JML.
    摘要 车辆重新标识(Re-ID)在过去的一代研究广泛,但是不同摄像头视角导致车辆各种姿态下的特征空间混淆,仍然是现实世界中车辆Re-ID模型的挑战。为提高车辆Re-ID模型,本文提出了合成大量目标姿态下的车辆图像的想法,即将车辆各种姿态下的图像投影到统一的目标姿态上,以提高特征识别。考虑到现实世界中可能无法获得相同车辆在不同交通监控摄像头中的对应数据,我们提出了首个可以在不同视角下 Synthesize 车辆图像的Pair-flexible Pose Guided Image Synthesis方法,称为VehicleGAN。此外,由于实际和 sintetic 数据之间的特征分布差异,直接使用传统的度量学学习基于数据水平融合(i.e., 数据增强)是不满足的,因此我们提出了一种新的联合度量学习(JML),通过有效的特征级别融合来提高Re-ID模型的准确性。实验结果表明,我们提出的VehicleGAN和JML在公共的 VeRi-776 和 VehicleID 数据集上具有高精度和效果。

Self-Supervised Learning of Whole and Component-Based Semantic Representations for Person Re-Identification

  • paper_url: http://arxiv.org/abs/2311.17074
  • repo_url: None
  • paper_authors: Siyuan Huang, Yifan Zhou, Ram Prabhakar Kathirvel, Rama Chellappa, Chun Pong Lau
  • for: 提高人脸识别性能和通用性,并解决不同领域的人脸识别问题。
  • methods: 利用精准人acentric semantic representation,通过自然语言处理和KoLeo regularization等技术来提高semantic表示性。
  • results: 在三类人脸识别 datasets(标准人脸识别、CC-ReID和无结构人脸识别)上显示出优于现状方法的性能。 simultaneously introduce a new large-scale person dataset with fine-grained semantics to assist ReID methods in acquiring robust performance.
    Abstract Interactive Segmentation Models (ISMs) like the Segment Anything Model have significantly improved various computer vision tasks, yet their application to Person Re-identification (ReID) remains limited. On the other hand, existing semantic pre-training models for ReID often have limitations like predefined parsing ranges or coarse semantics. Additionally, ReID and Clothes-Changing ReID (CC-ReID) are usually treated separately due to their different domains. This paper investigates whether utilizing precise human-centric semantic representation can boost the ReID performance and improve the generalization among various ReID tasks. We propose SemReID, a self-supervised ReID model that leverages ISMs for adaptive part-based semantic extraction, contributing to the improvement of ReID performance. SemReID additionally refines its semantic representation through techniques such as image masking and KoLeo regularization. Evaluation across three types of ReID datasets -- standard ReID, CC-ReID, and unconstrained ReID -- demonstrates superior performance compared to state-of-the-art methods. In addition, recognizing the scarcity of large person datasets with fine-grained semantics, we introduce the novel LUPerson-Part dataset to assist ReID methods in acquiring the fine-grained part semantics for robust performance.
    摘要 <> transtable begin人体重要部分 segmentation模型(ISMs),如 segment anything 模型,已经在计算机视觉任务中提高了表现,但它们在人识别(ReID)领域的应用仍然有限。而现有的semantic pre-training模型 для ReID 通常具有预定的解析范围或粗糙的 semantics。此外,ReID 和 Clothes-Changing ReID(CC-ReID)通常被视为不同的领域,这导致了它们的研究和应用分别进行。本文 investigate 是否可以通过使用精确的人类中心semantic representation来提高 ReID 性能和提高不同 ReID 任务之间的一致性。我们提出了 SemReID,一种自动化人体重要部分 semantic extraction 的 ReID 模型,利用 ISMs 进行 adaptive part-based semantic extraction,以提高 ReID 性能。 SemReID 还使用了图像遮盾和 KoLeo 规则来细化semantic representation。对于标准 ReID、CC-ReID 和无限制 ReID 三种 ReID 数据集进行评估,SemReID 的性能均比当前最佳方法高。此外, acknowledge 人体数据集的缺乏,我们引入了novel LUPerson-Part数据集,以帮助 ReID 方法在获得细化部分semantics方面提高表现。<> transtable end

SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance

  • paper_url: http://arxiv.org/abs/2311.16241
  • repo_url: https://github.com/google-research/semivl
  • paper_authors: Lukas Hoyer, David Joseph Tan, Muhammad Ferjad Naeem, Luc Van Gool, Federico Tombari
  • For: 这个研究目的是将双极性模型(VLM)给半指定式semantic segmentation,以减少注解量。* Methods: 这个方法使用了VLM的丰富先天知识,并将其应用到半指定式semantic segmentation中,以学习更好的semantic decision boundary。它还引入了一个空间精确化策略,以便在标签效率下进行label-efficient learning。* Results: 这个方法在4个semantic segmentation dataset上进行评估,和之前的半指定式方法相比,它表现出了 significiant improvement(+13.5 mIoU在COCO上,+6.1 mIoU在Pascal VOC上)。
    Abstract In semi-supervised semantic segmentation, a model is trained with a limited number of labeled images along with a large corpus of unlabeled images to reduce the high annotation effort. While previous methods are able to learn good segmentation boundaries, they are prone to confuse classes with similar visual appearance due to the limited supervision. On the other hand, vision-language models (VLMs) are able to learn diverse semantic knowledge from image-caption datasets but produce noisy segmentation due to the image-level training. In SemiVL, we propose to integrate rich priors from VLM pre-training into semi-supervised semantic segmentation to learn better semantic decision boundaries. To adapt the VLM from global to local reasoning, we introduce a spatial fine-tuning strategy for label-efficient learning. Further, we design a language-guided decoder to jointly reason over vision and language. Finally, we propose to handle inherent ambiguities in class labels by providing the model with language guidance in the form of class definitions. We evaluate SemiVL on 4 semantic segmentation datasets, where it significantly outperforms previous semi-supervised methods. For instance, SemiVL improves the state-of-the-art by +13.5 mIoU on COCO with 232 annotated images and by +6.1 mIoU on Pascal VOC with 92 labels. Project page: https://github.com/google-research/semivl
    摘要 在半supervised semantic segmentation中,一模型被训练于有限数量的标注图像以及一大量的无标注图像,以降低高的标注努力。而前一代方法可以学习良好的分割边界,但容易将类别混淆,因为有限的监督。在 contrary,视觉语言模型(VLM)可以从图像caption datasets中学习多样的semantic知识,但生成的分割结果具有噪音。在 SemiVL 中,我们提出将rich prior from VLM pre-trainingintegrated into semi-supervised semantic segmentation,以学习更好的semantic decision boundaries。为了将 VLM 从全球到局部的reasoning,我们引入了空间精细调整策略,以便 Label-efficient learning。此外,我们设计了语言引导 decoder,以同时进行视觉和语言的共同理解。最后,我们提出了解决类别标签的内在歧义的方法,通过给模型提供语言引导,以形式化类 definitions。我们评估 SemiVL 在 4 个semantic segmentation dataset上,其显著超过了之前的半supervised方法。例如,SemiVL 在 COCO 上提高了state-of-the-art 的 +13.5 mIoU,并在 Pascal VOC 上提高了 +6.1 mIoU,具有92个标签。项目页面:https://github.com/google-research/semivl

GART: Gaussian Articulated Template Models

  • paper_url: http://arxiv.org/abs/2311.16099
  • repo_url: https://github.com/JiahuiLei/GART
  • paper_authors: Jiahui Lei, Yufu Wang, Georgios Pavlakos, Lingjie Liu, Kostas Daniilidis
  • For: 表示用于捕捉和渲染非定形软体的单投影视频中的非定形软体的表示和外观。* Methods: 使用混合移动3D高斯函数来显式地approximate非定形软体的几何和外观,利用分类模板模型先验(SMPL、SMAL等)的learnable forward skinning,并扩展到更复杂的非定形变形。* Results: 可以通过分子积分渲染法从单投影视频中重建GART,并在新的姿势下更快于150fps进行渲染。
    Abstract We introduce Gaussian Articulated Template Model GART, an explicit, efficient, and expressive representation for non-rigid articulated subject capturing and rendering from monocular videos. GART utilizes a mixture of moving 3D Gaussians to explicitly approximate a deformable subject's geometry and appearance. It takes advantage of a categorical template model prior (SMPL, SMAL, etc.) with learnable forward skinning while further generalizing to more complex non-rigid deformations with novel latent bones. GART can be reconstructed via differentiable rendering from monocular videos in seconds or minutes and rendered in novel poses faster than 150fps.
    摘要 我们介绍Gaussian Articulated Template Model(GART),一种明确、高效、表达力强的非静止人物捕捉和呈现方法,从单一影像中取得3D对象的位置和形状。GART使用一种组合移动3D高斯函数来明确地描述可变的主题物的几何和外观。它利用可分类模板模型(SMPL、SMAL等)的学习前向皮肤渠道,同时对更复杂的非静止塑形进行扩展,通过新的秘密骨的概念。GART可以通过可微分渲染从单一影像中重建,并在150帧/秒之下呈现新的姿势。

CG-HOI: Contact-Guided 3D Human-Object Interaction Generation

  • paper_url: http://arxiv.org/abs/2311.16097
  • repo_url: None
  • paper_authors: Christian Diller, Angela Dai
  • For: 生成动态3D人物互动场景(HOI)的任务。* Methods: joint diffusion process和cross-attention模型人体和物体的运动,以及在推理过程中使用Contact为导航。* Results: 可以生成真实和物理可能的人物互动序列,并且可以根据给定的物体轨迹来生成相应的人体运动,表明了人物之间的互dependent学习。
    Abstract We propose CG-HOI, the first method to address the task of generating dynamic 3D human-object interactions (HOIs) from text. We model the motion of both human and object in an interdependent fashion, as semantically rich human motion rarely happens in isolation without any interactions. Our key insight is that explicitly modeling contact between the human body surface and object geometry can be used as strong proxy guidance, both during training and inference. Using this guidance to bridge human and object motion enables generating more realistic and physically plausible interaction sequences, where the human body and corresponding object move in a coherent manner. Our method first learns to model human motion, object motion, and contact in a joint diffusion process, inter-correlated through cross-attention. We then leverage this learned contact for guidance during inference synthesis of realistic, coherent HOIs. Extensive evaluation shows that our joint contact-based human-object interaction approach generates realistic and physically plausible sequences, and we show two applications highlighting the capabilities of our method. Conditioned on a given object trajectory, we can generate the corresponding human motion without re-training, demonstrating strong human-object interdependency learning. Our approach is also flexible, and can be applied to static real-world 3D scene scans.
    摘要 我们提出了CG-HOI,首个用文本生成动态3D人物互动(HOIs)的方法。我们模型了人体和物体之间的运动,以semantically rich的人体运动为准则,它们很少发生在孤立状态下。我们的关键发现是:明确模型人体表面和物体几何结构之间的接触可以用作强大的引导,在训练和推理过程中。通过这种引导,我们可以生成更真实和物理可能的互动序列,人体和对应的物体在一起运动。我们的方法首先学习了人体运动、物体运动和接触的联合扩散过程,通过对准关注相互关联。然后,我们可以在推理过程中利用这些学习的接触作为引导,生成更加真实和物理可能的HOIs。我们的方法可以应用于静态的真实3D场景扫描,并且可以根据给定的物体轨迹来生成匹配的人体动作,无需重新训练,表明了人体-物体之间的强大互dependent学习。

Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling

  • paper_url: http://arxiv.org/abs/2311.16096
  • repo_url: https://github.com/lizhe00/animatablegaussians
  • paper_authors: Zhe Li, Zerong Zheng, Lizhen Wang, Yebin Liu
  • for: 模型人物动画 Avatar 从 RGB 视频中提取。
  • methods: 使用 MLP-based 神经震荡场 (NeRF) 表示 3D 人物,但是它还是困难的为纯 MLP 进行姿势依赖的裤子细节预测。
  • results: 我们引入了 Animatable Gaussians,一种新的人物表示方法,使用强大的 2D CNN 和 3D Gaussian splatting 创建高质量 Avatar。我们学习了输入视频中的参数模板,然后将参数模板分配到两个前置 & 后置 canonical Gaussian 地图上,每个像素表示一个 3D Gaussian。这些学习的模板是适应穿着裤子的,以便模型穿着裤子的裤子。我们还引入了一种 pose projection 策略,以便更好地适应新的姿势。总的来说,我们的方法可以创造出真实、动态和泛化的 Avatar。实验表明,我们的方法超过了其他状态对照方法。代码:https://github.com/lizhe00/AnimatableGaussians
    Abstract Modeling animatable human avatars from RGB videos is a long-standing and challenging problem. Recent works usually adopt MLP-based neural radiance fields (NeRF) to represent 3D humans, but it remains difficult for pure MLPs to regress pose-dependent garment details. To this end, we introduce Animatable Gaussians, a new avatar representation that leverages powerful 2D CNNs and 3D Gaussian splatting to create high-fidelity avatars. To associate 3D Gaussians with the animatable avatar, we learn a parametric template from the input videos, and then parameterize the template on two front \& back canonical Gaussian maps where each pixel represents a 3D Gaussian. The learned template is adaptive to the wearing garments for modeling looser clothes like dresses. Such template-guided 2D parameterization enables us to employ a powerful StyleGAN-based CNN to learn the pose-dependent Gaussian maps for modeling detailed dynamic appearances. Furthermore, we introduce a pose projection strategy for better generalization given novel poses. Overall, our method can create lifelike avatars with dynamic, realistic and generalized appearances. Experiments show that our method outperforms other state-of-the-art approaches. Code: https://github.com/lizhe00/AnimatableGaussians
    摘要 模型人工头像从RGB视频中是一个长期存在的和挑战性的问题。现有的方法通常采用MLP基于神经辐射场(NeRF)来表示3D人体,但是纯MLP还是很难直接预测pose相关的衣物细节。为了解决这个问题,我们提出了一种新的人工头像表示方法,即可变GAUSsians。我们利用了强大的2D卷积神经网络和3DGAUSS分配来生成高精度头像。为了将3DGAUSS分配到可动头像,我们学习了输入视频中的参数模板,然后将这个模板参数化在两个前后笔直的可能性映射中,每个像素都表示一个3DGAUSS。我们的模板是可适应穿着裤子和其他裤子的服装,以便更好地模拟裤子的摆放。这种模板引导的2D参数化使我们能够使用一个强大的StyleGAN基于神经网络来学习pose相关的GAUSS分配。此外,我们还提出了一种pose投影策略,以便更好地泛化给新的pose。总之,我们的方法可以创造出真实、动态和泛化的人工头像。实验表明,我们的方法超越了其他状态的艺术方法。代码:https://github.com/lizhe00/AnimatableGaussians

Street TryOn: Learning In-the-Wild Virtual Try-On from Unpaired Person Images

  • paper_url: http://arxiv.org/abs/2311.16094
  • repo_url: None
  • paper_authors: Aiyu Cui, Jay Mahajan, Viraj Shah, Preeti Gomathinayagam, Svetlana Lazebnik
    for:This paper focuses on virtual try-on technology for in-the-wild scenes, specifically street scenes, and aims to fill the gap in current research by introducing a new benchmark and a novel method that can learn without paired data.methods:The proposed method combines a DensePose warping correction method with diffusion-based inpainting controlled by pose and semantic segmentation to achieve robust performance across shop and street domains.results:The authors’ experiments demonstrate competitive performance for standard studio try-on tasks and state-of-the-art (SOTA) performance for street try-on and cross-domain try-on tasks.
    Abstract Virtual try-on has become a popular research topic, but most existing methods focus on studio images with a clean background. They can achieve plausible results for this studio try-on setting by learning to warp a garment image to fit a person's body from paired training data, i.e., garment images paired with images of people wearing the same garment. Such data is often collected from commercial websites, where each garment is demonstrated both by itself and on several models. By contrast, it is hard to collect paired data for in-the-wild scenes, and therefore, virtual try-on for casual images of people against cluttered backgrounds is rarely studied. In this work, we fill the gap in the current virtual try-on research by (1) introducing a Street TryOn benchmark to evaluate performance on street scenes and (2) proposing a novel method that can learn without paired data, from a set of in-the-wild person images directly. Our method can achieve robust performance across shop and street domains using a novel DensePose warping correction method combined with diffusion-based inpainting controlled by pose and semantic segmentation. Our experiments demonstrate competitive performance for standard studio try-on tasks and SOTA performance for street try-on and cross-domain try-on tasks.
    摘要 现代技术研究中,虚拟试穿成为了流行的研究主题,但大多数现有方法都是在工作室图像上进行研究,即使这些图像具有整洁的背景。这些方法可以通过对人体图像和穿着同一件衣服的模特儿图像进行对拼训练,以学习将衣服图像适应人体形态。然而,收集在实际场景中的对应数据很Difficult,因此虚拟试穿在快餐场景中的研究很少。在这个工作中,我们填充了现有虚拟试穿研究的缺口,我们首先提出了街头试穿 benchmark,以评估在街景中的性能。其次,我们提出了一种不需要对数据对拼的新方法,可以从无对数据直接学习。我们的方法通过对人体图像进行高密度描述 pose 和 semanticsegmentation controlled diffusion-based inpainting来纠正抗键截图像,并实现了在店面和街景上的稳定性。我们的实验表明,我们的方法可以在标准工作室试穿任务上达到竞争性的性能,并在街头试穿和跨领域试穿任务上达到了领先的性能。

Self-correcting LLM-controlled Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.16090
  • repo_url: None
  • paper_authors: Tsung-Han Wu, Long Lian, Joseph E. Gonzalez, Boyi Li, Trevor Darrell
  • for: 文本到图像生成领域内,提高Diffusion模型的精度和可靠性。
  • methods: 提出了Self-correcting LLM-controlled Diffusion(SLD)框架,通过将文本输入转化为图像,然后评估图像与文本输入的匹配程度,进行自动修正,以确保图像的准确性。
  • results: 实验结果表明,SLD可以修正大多数错误生成,特别是在生成数学、特征绑定和空间关系方面表现出色。此外,通过轻量级地调整LLM的指令,SLD可以完成图像编辑任务, bridge the gap between text-to-image generation和图像编辑管道。
    Abstract Text-to-image generation has witnessed significant progress with the advent of diffusion models. Despite the ability to generate photorealistic images, current text-to-image diffusion models still often struggle to accurately interpret and follow complex input text prompts. In contrast to existing models that aim to generate images only with their best effort, we introduce Self-correcting LLM-controlled Diffusion (SLD). SLD is a framework that generates an image from the input prompt, assesses its alignment with the prompt, and performs self-corrections on the inaccuracies in the generated image. Steered by an LLM controller, SLD turns text-to-image generation into an iterative closed-loop process, ensuring correctness in the resulting image. SLD is not only training-free but can also be seamlessly integrated with diffusion models behind API access, such as DALL-E 3, to further boost the performance of state-of-the-art diffusion models. Experimental results show that our approach can rectify a majority of incorrect generations, particularly in generative numeracy, attribute binding, and spatial relationships. Furthermore, by simply adjusting the instructions to the LLM, SLD can perform image editing tasks, bridging the gap between text-to-image generation and image editing pipelines. We will make our code available for future research and applications.
    摘要 文本到图像生成技术在扩散模型出现后,经历了重要的进步。然而,当前的文本到图像扩散模型仍然很难准确地理解和执行复杂的输入文本提示。与现有的模型不同,我们介绍了一种自我修复的扩散控制模型(SLD)。SLD是一个框架,通过将输入提示转化为图像,评估图像与提示的匹配程度,并对图像中的错误进行自我修复。受到LLM控制器的引导,SLD将文本到图像生成转化为循环闭合过程,确保图像的准确性。SLD不仅无需训练,而且可以轻松地与现有的扩散模型结合使用,如DALL-E 3,以提高现状最佳的扩散模型性能。实验结果表明,我们的方法可以更正大多数的错误生成,特别是在生成数字、Attribute Binding和空间关系方面。此外,通过简单地调整LLM的指令,SLD可以完成图像编辑任务,将文本到图像生成和图像编辑管道相连接。我们将在未来的研究和应用中发布我们的代码。

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model

  • paper_url: http://arxiv.org/abs/2311.16498
  • repo_url: https://github.com/magic-research/magic-animate
  • paper_authors: Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, Mike Zheng Shou
  • for: 这个研究旨在解决人像动画任务,即生成一个参考人物按照特定的动作序列进行动画。现有的动画方法通常使用框架折叠技术来动画参考图像。尽管达到了合理的结果,但这些方法面临着缺乏时间模型和保持参考图像的精度等挑战。
  • methods: 我们提出了MagicAnimate,一个基于扩散的框架,旨在提高时间一致性、精度保持和动画质量。为此,我们首先开发了一个视频扩散模型来编码时间信息。其次,我们引入了一个新的外观编码器来保持帧内的外观一致性。利用这两个创新,我们进一步使用了一种简单的视频融合技术来促进长视频动画的缓冲过程。
  • results: 我们的方法在两个标准测试集上比基eline方法高出38%以上的视频质量。特别是在挑战性的TikTok舞蹈数据集上,我们的方法比最强的基eline方法高出41%的视频质量。
    Abstract This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence. Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion. Despite achieving reasonable results, these approaches face challenges in maintaining temporal consistency throughout the animation due to the lack of temporal modeling and poor preservation of reference identity. In this work, we introduce MagicAnimate, a diffusion-based framework that aims at enhancing temporal consistency, preserving reference image faithfully, and improving animation fidelity. To achieve this, we first develop a video diffusion model to encode temporal information. Second, to maintain the appearance coherence across frames, we introduce a novel appearance encoder to retain the intricate details of the reference image. Leveraging these two innovations, we further employ a simple video fusion technique to encourage smooth transitions for long video animation. Empirical results demonstrate the superiority of our method over baseline approaches on two benchmarks. Notably, our approach outperforms the strongest baseline by over 38% in terms of video fidelity on the challenging TikTok dancing dataset. Code and model will be made available.
    摘要 In this study, we propose MagicAnimate, a diffusion-based framework that aims to enhance temporal consistency, preserve the reference image faithfully, and improve animation fidelity. To achieve this, we first develop a video diffusion model to encode temporal information. Then, to ensure the appearance coherence across frames, we introduce a novel appearance encoder that retains the intricate details of the reference image. Finally, we use a simple video fusion technique to create smooth transitions for long video animations.Our approach outperforms the strongest baseline by over 38% in terms of video fidelity on the challenging TikTok dancing dataset. The code and model will be made available.

DiffSLVA: Harnessing Diffusion Models for Sign Language Video Anonymization

  • paper_url: http://arxiv.org/abs/2311.16060
  • repo_url: https://github.com/jeffery9707/diffslva
  • paper_authors: Zhaoyang Xia, Carol Neidle, Dimitris N. Metaxas
  • for: 这篇论文的目的是为了提供一种可以实现手语录影匿名化的新方法,以便为聋公共社区提供实际的应用。
  • methods: 这篇论文使用了预训大规模扩散模型,并且将 ControlNet 整合到系统中,以便跳过精确的姿势估测。此外,它还开发了一个特别的模组,用于捕捉手语中的脸部表情,这些表情是手语中critical的语言信息的一部分。
  • results: 这篇论文的实验结果显示,这种新的匿名化方法可以实现对手语录影中的姿势和脸部表情的匿名化,并且可以保留手语中的重要语言信息。这个方法可以在实际应用中使用,并且可以为聋公共社区提供 significative benefits。
    Abstract Since American Sign Language (ASL) has no standard written form, Deaf signers frequently share videos in order to communicate in their native language. However, since both hands and face convey critical linguistic information in signed languages, sign language videos cannot preserve signer privacy. While signers have expressed interest, for a variety of applications, in sign language video anonymization that would effectively preserve linguistic content, attempts to develop such technology have had limited success, given the complexity of hand movements and facial expressions. Existing approaches rely predominantly on precise pose estimations of the signer in video footage and often require sign language video datasets for training. These requirements prevent them from processing videos 'in the wild,' in part because of the limited diversity present in current sign language video datasets. To address these limitations, our research introduces DiffSLVA, a novel methodology that utilizes pre-trained large-scale diffusion models for zero-shot text-guided sign language video anonymization. We incorporate ControlNet, which leverages low-level image features such as HED (Holistically-Nested Edge Detection) edges, to circumvent the need for pose estimation. Additionally, we develop a specialized module dedicated to capturing facial expressions, which are critical for conveying essential linguistic information in signed languages. We then combine the above methods to achieve anonymization that better preserves the essential linguistic content of the original signer. This innovative methodology makes possible, for the first time, sign language video anonymization that could be used for real-world applications, which would offer significant benefits to the Deaf and Hard-of-Hearing communities. We demonstrate the effectiveness of our approach with a series of signer anonymization experiments.
    摘要 美国手语(ASL)没有标准的书面形式,因此听力受损人们经常分享视频以沟通他们的native语言。然而,由于手部和面部都会传递重要的语言信息,因此手语视频无法保持听力人的隐私。听力人有表达兴趣,为了许多应用程序,手语视频匿名化技术。然而,由于手部运动和面部表达的复杂性,这些技术的开发具有有限的成功。现有的方法主要基于视频中精确的姿势估计,并且通常需要手语视频数据集进行训练。这些要求使得它们无法处理“野外”的视频,因为当前的手语视频数据集具有有限的多样性。为解决这些限制,我们的研究推出了DiffSLVA,一种新的方法,利用预训练的大规模扩散模型,实现零 shot文本引导手语视频匿名化。我们采用ControlNet,利用低级图像特征,如HED(整体嵌入式边缘检测)边缘,以避免需要姿势估计。此外,我们开发了专门用于捕捉面部表达的模块,这些表达对于手语语言中的语义信息是关键的。我们将以上两种方法相结合,以实现更好地保持原始手语者的语义内容。这种创新的方法使得可以,对于第一次,在实际应用中使用手语视频匿名化,这将为听力和听力困难的社区带来 significativ beneficial。我们通过一系列的匿名化实验,证明了我们的方法的效果。

Seeing Beyond Cancer: Multi-Institutional Validation of Object Localization and 3D Semantic Segmentation using Deep Learning for Breast MRI

  • paper_url: http://arxiv.org/abs/2311.16213
  • repo_url: None
  • paper_authors: Arda Pekis, Vignesh Kannan, Evandros Kaklamanos, Anu Antony, Snehal Patel, Tyler Earnest
  • for: breast cancer staging, prognosis, and surgical planning
  • methods: semantic segmentation using 2D object detectors and 3D U-nets, pre-trained on ImageNet and COCO, and operated on MIP images
  • results: superior Dice score on tumor segmentation while maintaining competitive performance on other studied tissues across multiple institutions
    Abstract The clinical management of breast cancer depends on an accurate understanding of the tumor and its anatomical context to adjacent tissues and landmark structures. This context may be provided by semantic segmentation methods; however, previous works have been largely limited to a singular focus on the tumor alone and rarely other tissue types. In contrast, we present a method that exploits tissue-tissue interactions to accurately segment every major tissue type in the breast including: chest wall, skin, adipose tissue, fibroglandular tissue, vasculature and tumor via standard-of-care Dynamic Contrast Enhanced MRI. Comparing our method to prior state-of-the-art, we achieved a superior Dice score on tumor segmentation while maintaining competitive performance on other studied tissues across multiple institutions. Briefly, our method proceeds by localizing the tumor using 2D object detectors, then segmenting the tumor and surrounding tissues independently using two 3D U-nets, and finally integrating these results while mitigating false positives by checking for anatomically plausible tissue-tissue contacts. The object detection models were pre-trained on ImageNet and COCO, and operated on MIP (maximum intensity projection) images in the axial and sagittal planes, establishing a 3D tumor bounding box. By integrating multiple relevant peri-tumoral tissues, our work enables clinical applications in breast cancer staging, prognosis and surgical planning.
    摘要 临床管理乳腺癌取决于正确地理解肿体和其相邻组织的Context。这个Context可以通过semantic segmentation方法提供,但之前的工作通常只关注于肿体本身,罕见其他组织类型。相比之下,我们提出了一种方法,利用组织之间的交互来准确地分类每个主要乳腺组织,包括胸壁、皮肤、脂肪组织、纤维肉组织、血管和肿体。我们的方法包括:首先,使用2D对象检测器来定位肿体;然后,使用三个3D U-net来独立地分类肿体和周围的组织;最后,将这些结果集成,并避免假阳性结果通过检查可能的组织间接触。对象检测模型在ImageNet和COCO上进行预训练,并在MIP图像(最大强度 проек)上运行,以确定3D肿体 bounding box。通过结合多个相关的周围组织,我们的工作启动了乳腺癌阶段诊断、预后和手术规划等临床应用。

Segment Every Out-of-Distribution Object

  • paper_url: http://arxiv.org/abs/2311.16516
  • repo_url: None
  • paper_authors: Wenjie Zhao, Jia Li, Xin Dong, Yu Xiang, Yunhui Guo
  • for: 本研究旨在提高semantic segmentation模型在real-world中的应用,解决出现在distribution外的object detection问题。
  • methods: 该paper引入了一种将 anomaly Score 转换为segmentation Mask的方法,称为S2M。与traditional方法不同的是,S2M直接将异常分数转换为整个异常object的分 segments。
  • results: 实验表明,S2M在不同的benchmark上,平均与state-of-the-art的Performance gap约为10%的IoU和30%的mean F1 score。
    Abstract Semantic segmentation models, while effective for in-distribution categories, face challenges in real-world deployment due to encountering out-of-distribution (OoD) objects. Detecting these OoD objects is crucial for safety-critical applications. Existing methods rely on anomaly scores, but choosing a suitable threshold for generating masks presents difficulties and can lead to fragmentation and inaccuracy. This paper introduces a method to convert anomaly Score To segmentation Mask, called S2M, a simple and effective framework for OoD detection in semantic segmentation. Unlike assigning anomaly scores to pixels, S2M directly segments the entire OoD object. By transforming anomaly scores into prompts for a promptable segmentation model, S2M eliminates the need for threshold selection. Extensive experiments demonstrate that S2M outperforms the state-of-the-art by approximately 10\% in IoU and 30\% in mean F1 score, on average, across various benchmarks including Fishyscapes, Segment-Me-If-You-Can, and RoadAnomaly datasets.
    摘要 Semantic segmentation模型,虽然有效对内分布类,但在实际应用中遇到了外分布(OoD)对象,导致检测这些OoD对象是关键的。现有方法依靠异常分数,但选择适当的阈值生成mask存在困难,可能导致分化和不准确。这篇论文提出了将异常分数转换为分 segmentation mask的方法,称为S2M,这是一种简单有效的OoD检测方法。与分配异常分数到像素不同,S2M直接对整个OoD对象进行分割。通过将异常分数转换为提示可变分 segmentation模型,S2M消除了阈值选择的需要。广泛的实验表明,S2M在多个benchmark上,包括Fishyscapes、Segment-Me-If-You-Can和RoadAnomaly dataset,平均提高了10%的IoU和30%的 mean F1 score,胜过当前状态的艺术。

Exploring Attribute Variations in Style-based GANs using Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.16052
  • repo_url: None
  • paper_authors: Rishubh Parihar, Prasanna Balaji, Raghav Magazine, Sarthak Vora, Tejan Karmali, Varun Jampani, R. Venkatesh Babu
  • for: 本研究旨在提供一种多样化特征编辑方法,以便用户可以生成多个可能的编辑结果。
  • methods: 该方法基于 pré-训练的 GAN 的独立幂 space,并使用 Denoising Diffusion Probabilistic Model (DDPM) 学习 latent 分布。
  • results: 经过大量的qualitative和量化实验,本方法在多种数据集上显示出高效性,并且应用于3D编辑也可以获得良好的结果。
    Abstract Existing attribute editing methods treat semantic attributes as binary, resulting in a single edit per attribute. However, attributes such as eyeglasses, smiles, or hairstyles exhibit a vast range of diversity. In this work, we formulate the task of \textit{diverse attribute editing} by modeling the multidimensional nature of attribute edits. This enables users to generate multiple plausible edits per attribute. We capitalize on disentangled latent spaces of pretrained GANs and train a Denoising Diffusion Probabilistic Model (DDPM) to learn the latent distribution for diverse edits. Specifically, we train DDPM over a dataset of edit latent directions obtained by embedding image pairs with a single attribute change. This leads to latent subspaces that enable diverse attribute editing. Applying diffusion in the highly compressed latent space allows us to model rich distributions of edits within limited computational resources. Through extensive qualitative and quantitative experiments conducted across a range of datasets, we demonstrate the effectiveness of our approach for diverse attribute editing. We also showcase the results of our method applied for 3D editing of various face attributes.
    摘要 原始特征编辑方法往往将 semantic attribute 视为二进制,从而导致每个特征只能进行单个编辑。然而,特征如眼镜、笑容或发型实际上具有很大的多样性。在这项工作中,我们将 attribute editing 任务 reformulate 为多元特征编辑任务,以利用特征编辑的多 Dimensional 性。这使得用户可以生成多个可能的编辑。我们利用预训练 GAN 的独立 latent space 和 DDPM 模型来学习 latent distribution для多元编辑。具体来说,我们在一个包含单个特征变化的图像对的 latent space 中训练 DDPM。这导致 latent subspace 允许多元特征编辑。通过在高度压缩 latent space 中进行扩散,我们可以模型 Rich 的编辑分布,而不需要过多的计算资源。经过对多种数据集的广泛 Qualitative 和 Quantitative 实验,我们证明了我们的方法的效果性。我们还展示了我们的方法在 3D 编辑中的应用结果。

SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution

  • paper_url: http://arxiv.org/abs/2311.16518
  • repo_url: https://github.com/cswry/seesr
  • paper_authors: Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, Lei Zhang
  • for: 提高实际图像超分辨率问题中的semantic fidelity
  • methods: 使用degradation-aware提问Extractor生成准确的软和硬semantic prompts,并在推理过程中将LR图像 integrate到初始抽样噪声中以mitigate T2I模型生成过多的随机细节
  • results: 方法可以更好地复制实际图像细节并保持semantics
    Abstract Owe to the powerful generative priors, the pre-trained text-to-image (T2I) diffusion models have become increasingly popular in solving the real-world image super-resolution problem. However, as a consequence of the heavy quality degradation of input low-resolution (LR) images, the destruction of local structures can lead to ambiguous image semantics. As a result, the content of reproduced high-resolution image may have semantic errors, deteriorating the super-resolution performance. To address this issue, we present a semantics-aware approach to better preserve the semantic fidelity of generative real-world image super-resolution. First, we train a degradation-aware prompt extractor, which can generate accurate soft and hard semantic prompts even under strong degradation. The hard semantic prompts refer to the image tags, aiming to enhance the local perception ability of the T2I model, while the soft semantic prompts compensate for the hard ones to provide additional representation information. These semantic prompts can encourage the T2I model to generate detailed and semantically accurate results. Furthermore, during the inference process, we integrate the LR images into the initial sampling noise to mitigate the diffusion model's tendency to generate excessive random details. The experiments show that our method can reproduce more realistic image details and hold better the semantics.
    摘要 因为强大的生成性先验,预训练的文本到图像(T2I)扩散模型在解决现实世界图像超分辨问题上变得越来越受欢迎。然而,由于输入低分辨率(LR)图像的质量严重受损,可能导致地方结构的破坏,从而导致生成的高分辨率图像的内容具有错误的 semantics。为了解决这个问题,我们提出了一种 semantics-aware approach,以更好地保持生成图像的semantic fidelity。首先,我们训练了适应受损描述器,可以生成准确的软和硬 semantics 描述符,即图像标签,以增强 T2I 模型的地方感知能力。而软 semantics 描述符则提供了额外的表示信息,以抵消硬 semantics 描述符的不足。这些 semantics 描述符可以鼓励 T2I 模型生成详细而semantically 准确的结果。另外,在推理过程中,我们将LR图像集成到初始抽象噪声中,以mitigate T2I 模型生成过多的随机细节。实验结果表明,我们的方法可以更好地重现图像细节和保持 semantics。

Relightable 3D Gaussian: Real-time Point Cloud Relighting with BRDF Decomposition and Ray Tracing

  • paper_url: http://arxiv.org/abs/2311.16043
  • repo_url: None
  • paper_authors: Jian Gao, Chun Gu, Youtian Lin, Hao Zhu, Xun Cao, Li Zhang, Yao Yao
  • for: 这个论文旨在提出一种基于点云的可微分点云渲染框架,用于材质和照明分解多视图图像,以实现编辑、推理 tracing 和实时重新照明三维点云。
  • methods: 该论文使用了3D Gaussian点云来表示场景,每个点还包含了法向量、BRDF参数和从不同方向来的入射光。为了实现可靠的照明估计,每个点的入射光被分解为全球和本地组件,以及视点依赖性可见性。场景被优化通过3D Gaussian Splatting技术,而BRDF和照明则通过物理基于微分渲染进行分解。
  • results: 对比现有材质估计方法,该论文的方法能够更好地估计BRDF,并且可以在实时 rendering 和重新照明中实现高品质的视图渲染结果。同时,该论文还提出了一种基于维度体 Hierarchy的高效可见性碰撞抑制方法,以实现高效的可见性碰撞抑制。
    Abstract We present a novel differentiable point-based rendering framework for material and lighting decomposition from multi-view images, enabling editing, ray-tracing, and real-time relighting of the 3D point cloud. Specifically, a 3D scene is represented as a set of relightable 3D Gaussian points, where each point is additionally associated with a normal direction, BRDF parameters, and incident lights from different directions. To achieve robust lighting estimation, we further divide incident lights of each point into global and local components, as well as view-dependent visibilities. The 3D scene is optimized through the 3D Gaussian Splatting technique while BRDF and lighting are decomposed by physically-based differentiable rendering. Moreover, we introduce an innovative point-based ray-tracing approach based on the bounding volume hierarchy for efficient visibility baking, enabling real-time rendering and relighting of 3D Gaussian points with accurate shadow effects. Extensive experiments demonstrate improved BRDF estimation and novel view rendering results compared to state-of-the-art material estimation approaches. Our framework showcases the potential to revolutionize the mesh-based graphics pipeline with a relightable, traceable, and editable rendering pipeline solely based on point cloud. Project page:https://nju-3dv.github.io/projects/Relightable3DGaussian/.
    摘要 我们提出了一种新的可 diferenciable 点 cloud 基础结构,用于从多视图图像中提取材质和照明分解,以实现编辑、推理跟踪和实时重新照明三维点云。特别是,我们将三维场景表示为一组可重新照明的三维 Gaussian 点,每个点还关联有法向、 BRDF 参数和从不同方向来的入射光。为了实现可靠的照明估计,我们进一步将每个点的入射光分为全球和本地组件,以及视点依赖的可见性。三维场景通过三维 Gaussian Splatting 技术进行优化,而 BRDF 和照明则通过物理基于的可 diferenciable 渲染进行分解。此外,我们还提出了一种创新的点 cloud 基础的折射跟踪方法,基于矩形体堆 hierarchical 来实现高效的可见性筛选,以实现实时渲染和重新照明三维 Gaussian 点 cloud 的准确阴影效果。我们的框架在 BRDF 估计和新视图渲染效果方面进行了广泛的实验,与当前材质估计方法相比,具有显著改善。我们的框架展示了可以革命化 mesh-based 图形管道,通过基于点云的可 diferenciable 渲染管道,实现可编辑、可跟踪和可重新照明的三维场景渲染。项目页面:https://nju-3dv.github.io/projects/Relightable3DGaussian/.

Weakly-Supervised 3D Reconstruction of Clothed Humans via Normal Maps

  • paper_url: http://arxiv.org/abs/2311.16042
  • repo_url: None
  • paper_authors: Jane Wu, Diego Thomas, Ronald Fedkiw
  • for: 该论文是用深度学习方法来进行人体3D重建,使用弱监督,只需要提供2D正常图。
  • methods: 该方法使用2D正常图进行监督,通过推导出 signed distance function(SDF),并使用 Marching Tetrahedra 算法计算三角形网格上的表面。
  • results: 该方法可以很好地生成3D人体模型,并且可以选择使用多视图损失来提高结果。
    Abstract We present a novel deep learning-based approach to the 3D reconstruction of clothed humans using weak supervision via 2D normal maps. Given a single RGB image or multiview images, our network infers a signed distance function (SDF) discretized on a tetrahedral mesh surrounding the body in a rest pose. Subsequently, inferred pose and camera parameters are used to generate a normal map from the SDF. A key aspect of our approach is the use of Marching Tetrahedra to (uniquely) compute a triangulated surface from the SDF on the tetrahedral mesh, facilitating straightforward differentiation (and thus backpropagation). Thus, given only ground truth normal maps (with no volumetric information ground truth information), we can train the network to produce SDF values from corresponding RGB images. Optionally, an additional multiview loss leads to improved results. We demonstrate the efficacy of our approach for both network inference and 3D reconstruction.
    摘要 我们提出了一种基于深度学习的新方法,用于三维重建人体形态,使用弱监督的2D正常图。从单个RGB图像或多视图图像中,我们的网络可以推断出身体在休息姿势下的签名距离函数(SDF),并将其精确地计算到四面体网格中。然后,推断出的姿势和摄像头参数可以用来生成正常图从SDF中。我们的方法的一个关键特点是使用三角形网格来计算SDF的三角形表面,这使得很容易进行微分(以及反推)。因此,只需要提供正常图(没有体积信息)作为指导,我们可以在RGB图像上训练网络生成SDF值。在选择时,可以添加多视图损失来提高结果。我们在网络推断和三维重建方面证明了方法的有效性。

GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions

  • paper_url: http://arxiv.org/abs/2311.16037
  • repo_url: None
  • paper_authors: Jiemin Fang, Junjie Wang, Xiaopeng Zhang, Lingxi Xie, Qi Tian
  • for: 本文主要针对3D场景编辑 tasks,尤其是通过文本指令进行精细、特定的编辑。
  • methods: 本文提出了一个系统性的框架,即GaussianEditor,通过3D高斯拟合来实现3D场景的精细编辑。
  • results: 相比之前的方法,GaussianEditor可以更加精细和准确地编辑3D场景,而且训练时间也比较快,只需20分钟左右。
    Abstract Recently, impressive results have been achieved in 3D scene editing with text instructions based on a 2D diffusion model. However, current diffusion models primarily generate images by predicting noise in the latent space, and the editing is usually applied to the whole image, which makes it challenging to perform delicate, especially localized, editing for 3D scenes. Inspired by recent 3D Gaussian splatting, we propose a systematic framework, named GaussianEditor, to edit 3D scenes delicately via 3D Gaussians with text instructions. Benefiting from the explicit property of 3D Gaussians, we design a series of techniques to achieve delicate editing. Specifically, we first extract the region of interest (RoI) corresponding to the text instruction, aligning it to 3D Gaussians. The Gaussian RoI is further used to control the editing process. Our framework can achieve more delicate and precise editing of 3D scenes than previous methods while enjoying much faster training speed, i.e. within 20 minutes on a single V100 GPU, more than twice as fast as Instruct-NeRF2NeRF (45 minutes -- 2 hours).
    摘要 最近,基于2D扩散模型的3D场景编辑技术已经取得了很好的结果。然而,当前的扩散模型主要通过预测积分空间中的噪声来生成图像,而编辑通常会对整个图像进行应用,这会使得对3D场景进行细腻、特别是地方化的编辑变得困难。 drawing inspiration from recent 3D Gaussian splatting, we propose a systematic framework, named GaussianEditor, to delicately edit 3D scenes via 3D Gaussians with text instructions. Thanks to the explicit property of 3D Gaussians, we design a series of techniques to achieve delicate editing. Specifically, we first extract the region of interest (RoI) corresponding to the text instruction, aligning it to 3D Gaussians. The Gaussian RoI is then used to control the editing process. Our framework can achieve more delicate and precise editing of 3D scenes than previous methods while enjoying much faster training speed, i.e. within 20 minutes on a single V100 GPU, more than twice as fast as Instruct-NeRF2NeRF (45 minutes -- 2 hours).

GaitContour: Efficient Gait Recognition based on a Contour-Pose Representation

  • paper_url: http://arxiv.org/abs/2311.16497
  • repo_url: None
  • paper_authors: Yuxiang Guo, Anshul Shah, Jiang Liu, Rama Chellappa, Cheng Peng
  • for: 本研究旨在提出一种新的、基于维度的 Contour-Pose 表示方法,用于人体行走识别。
  • methods: 该方法使用了一种本地-全球架构,称为 GaitContour,以利用新的表示方法并高效地计算人体行走表示。
  • results: 实验结果表明,GaitContour 比前置方法更高效,同时也比 silhouette-based 方法更高效。在挑战性 dataset 上,GaitContour 甚至可以超过 silhouette-based 方法。
    Abstract Gait recognition holds the promise to robustly identify subjects based on walking patterns instead of appearance information. In recent years, this field has been dominated by learning methods based on two principal input representations: dense silhouette masks or sparse pose keypoints. In this work, we propose a novel, point-based Contour-Pose representation, which compactly expresses both body shape and body parts information. We further propose a local-to-global architecture, called GaitContour, to leverage this novel representation and efficiently compute subject embedding in two stages. The first stage consists of a local transformer that extracts features from five different body regions. The second stage then aggregates the regional features to estimate a global human gait representation. Such a design significantly reduces the complexity of the attention operation and improves efficiency and performance simultaneously. Through large scale experiments, GaitContour is shown to perform significantly better than previous point-based methods, while also being significantly more efficient than silhouette-based methods. On challenging datasets with significant distractors, GaitContour can even outperform silhouette-based methods.
    摘要 “走姿识别技术可以强制地识别人们基于行走模式而不是外表信息。在过去几年,这个领域主要由学习方法基于两种主要的输入表示:紧密的抽象面罩或簇分的动作关键。在这个工作中,我们提出了一个新的、点基于的Contour-Pose表示方法,可以紧扣地表示人体形状和人体部位信息。我们还提出了一个内部-到-全球架构,called GaitContour,来利用这个新的表示方法,高效地计算人们的对应物。这个设计可以实现对于注意力操作的简化,同时提高效率和性能。在大规模实验中,GaitContour被证明可以与之前的点基于方法相比,表现更好,同时也比紧密面罩基于的方法更高效。甚至在具有干扰素的测试 dataset 上,GaitContour可以超越紧密面罩基于的方法。”

VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generation

  • paper_url: http://arxiv.org/abs/2311.16492
  • repo_url: None
  • paper_authors: Zijian Zhou, Miaojing Shi, Holger Caesar
  • for: 提高图像理解的全面性,同时 segmentation 对象和预测对象之间的关系
  • methods: 利用语言信息和视觉信息,通过注意力机制进行关系预测
  • results: 与前一代方法相比,显著提高了PSG数据集上的relation预测精度,解决了实际应用中的长尾问题
    Abstract Panoptic Scene Graph Generation (PSG) aims at achieving a comprehensive image understanding by simultaneously segmenting objects and predicting relations among objects. However, the long-tail problem among relations leads to unsatisfactory results in real-world applications. Prior methods predominantly rely on vision information or utilize limited language information, such as object or relation names, thereby overlooking the utility of language information. Leveraging the recent progress in Large Language Models (LLMs), we propose to use language information to assist relation prediction, particularly for rare relations. To this end, we propose the Vision-Language Prompting (VLPrompt) model, which acquires vision information from images and language information from LLMs. Then, through a prompter network based on attention mechanism, it achieves precise relation prediction. Our extensive experiments show that VLPrompt significantly outperforms previous state-of-the-art methods on the PSG dataset, proving the effectiveness of incorporating language information and alleviating the long-tail problem of relations.
    摘要 PSG(Panoptic Scene Graph Generation)目标是在图像上实现全面的图像理解,同时将对象分割和对象之间关系预测。然而,长尾问题中的关系导致实际应用中的结果不满足。先前的方法主要依靠视觉信息或使用有限的语言信息,如物体或关系名称,而忽略了语言信息的utilty。利用最近的大语言模型(LLMs)的进步,我们提议使用语言信息来助力关系预测,特别是 для罕见关系。为此,我们提出了视力语言提示(VLPrompt)模型,它从图像中获取视觉信息,并从 LLMS 获取语言信息。然后,通过基于注意机制的提示网络,它实现了精确的关系预测。我们的广泛实验表明,VLPrompt 明显超过了先前的状态码方法在 PSG 数据集上,证明了语言信息的包容和长尾问题的缓解帮助。

Automated Measurement of Vascular Calcification in Femoral Endarterectomy Patients Using Deep Learning

  • paper_url: http://arxiv.org/abs/2311.16001
  • repo_url: https://github.com/pip-alireza/deepcalcscoring
  • paper_authors: Alireza Bagheri Rajeoni, Breanna Pederson, Daniel G. Clair, Susan M. Lessner, Homayoun Valafar
  • For: The paper aims to develop a deep learning model for automated analysis of vascular calcification in patients with peripheral arterial disease (PAD) undergoing femoral endarterectomy surgery.* Methods: The authors employ a deep neural network (DNN) model to segment the vascular system in computed tomographic angiogram (CTA) images and measure vascular calcification from the left renal artery to the patella.* Results: The DNN model achieves 83.4% average Dice accuracy in segmenting arteries from aorta to patella, outperforming previous state-of-the-art methods. Additionally, the authors present a robust statistical analysis of automated calcification measurement in the lower extremities using deep learning, with a Mean Absolute Percentage Error (MAPE) of 9.5% and a correlation coefficient of 0.978 between automated and manual calcification scores.Here’s the Chinese translation of the three points:* For: 这篇论文的目的是开发一种深度学习模型,用于自动分析 péripheral arterial disease (PAD) 患者在 femoral endarterectomy 手术中的血管 calcification。* Methods: 作者使用深度神经网络 (DNN) 模型,对 computed tomographic angiogram (CTA) 图像进行血管分类和血管 calcification 测量,从左肾动脉到股骨。* Results: DNN 模型在 aorta 到股骨 血管分类中取得 83.4% 的平均 Dice 准确率,超过了之前的状态的艺术方法。此外,作者还提供了一种robust的自动 calcification 测量统计分析,MAPE 为 9.5%,并且在自动和手动 calcification 分数之间存在0.978 的相关系数。
    Abstract Atherosclerosis, a chronic inflammatory disease affecting the large arteries, presents a global health risk. Accurate analysis of diagnostic images, like computed tomographic angiograms (CTAs), is essential for staging and monitoring the progression of atherosclerosis-related conditions, including peripheral arterial disease (PAD). However, manual analysis of CTA images is time-consuming and tedious. To address this limitation, we employed a deep learning model to segment the vascular system in CTA images of PAD patients undergoing femoral endarterectomy surgery and to measure vascular calcification from the left renal artery to the patella. Utilizing proprietary CTA images of 27 patients undergoing femoral endarterectomy surgery provided by Prisma Health Midlands, we developed a Deep Neural Network (DNN) model to first segment the arterial system, starting from the descending aorta to the patella, and second, to provide a metric of arterial calcification. Our designed DNN achieved 83.4% average Dice accuracy in segmenting arteries from aorta to patella, advancing the state-of-the-art by 0.8%. Furthermore, our work is the first to present a robust statistical analysis of automated calcification measurement in the lower extremities using deep learning, attaining a Mean Absolute Percentage Error (MAPE) of 9.5% and a correlation coefficient of 0.978 between automated and manual calcification scores. These findings underscore the potential of deep learning techniques as a rapid and accurate tool for medical professionals to assess calcification in the abdominal aorta and its branches above the patella. The developed DNN model and related documentation in this project are available at GitHub page at https://github.com/pip-alireza/DeepCalcScoring.
    摘要 athersclerosis, a chronic inflammatory disease affecting the large arteries, presents a global health risk. Accurate analysis of diagnostic images, like computed tomographic angiograms (CTAs), is essential for staging and monitoring the progression of atherosclerosis-related conditions, including peripheral arterial disease (PAD). However, manual analysis of CTA images is time-consuming and tedious. To address this limitation, we employed a deep learning model to segment the vascular system in CTA images of PAD patients undergoing femoral endarterectomy surgery and to measure vascular calcification from the left renal artery to the patella. Utilizing proprietary CTA images of 27 patients undergoing femoral endarterectomy surgery provided by Prisma Health Midlands, we developed a Deep Neural Network (DNN) model to first segment the arterial system, starting from the descending aorta to the patella, and second, to provide a metric of arterial calcification. Our designed DNN achieved 83.4% average Dice accuracy in segmenting arteries from aorta to patella, advancing the state-of-the-art by 0.8%. Furthermore, our work is the first to present a robust statistical analysis of automated calcification measurement in the lower extremities using deep learning, attaining a Mean Absolute Percentage Error (MAPE) of 9.5% and a correlation coefficient of 0.978 between automated and manual calcification scores. These findings underscore the potential of deep learning techniques as a rapid and accurate tool for medical professionals to assess calcification in the abdominal aorta and its branches above the patella. The developed DNN model and related documentation in this project are available at GitHub page at .

Adversarial Doodles: Interpretable and Human-drawable Attacks Provide Describable Insights

  • paper_url: http://arxiv.org/abs/2311.15994
  • repo_url: None
  • paper_authors: Ryoya Nara, Yusuke Matsui
  • for: The paper is written for researchers and practitioners in the field of computer vision and machine learning, specifically those interested in adversarial attacks and defenses.
  • methods: The paper proposes a new method called Adversarial Doodles, which uses black Bézier curves to generate interpretable adversarial examples that can provide insights into the mechanism of the target classifier. The method optimizes the doodled area and introduces random perspective transformation to obtain compact attacks that can be replicated by hand.
  • results: The paper demonstrates the effectiveness of Adversarial Doodles in fooling state-of-the-art deep neural network (DNN)-based image classification models. The authors show that the generated adversarial examples have interpretable shapes and provide describable insights into the relationship between the attacks and the classifier’s output. For example, the authors demonstrate that adding two strokes on the head of a bird image can cause the classifier to misclassify it as a butterfly.
    Abstract DNN-based image classification models are susceptible to adversarial attacks. Most previous adversarial attacks do not focus on the interpretability of the generated adversarial examples, and we cannot gain insights into the mechanism of the target classifier from the attacks. Therefore, we propose Adversarial Doodles, which have interpretable shapes. We optimize black b\'ezier curves to fool the target classifier by overlaying them onto the input image. By introducing random perspective transformation and regularizing the doodled area, we obtain compact attacks that cause misclassification even when humans replicate them by hand. Adversarial doodles provide describable and intriguing insights into the relationship between our attacks and the classifier's output. We utilize adversarial doodles and discover the bias inherent in the target classifier, such as "We add two strokes on its head, a triangle onto its body, and two lines inside the triangle on a bird image. Then, the classifier misclassifies the image as a butterfly."
    摘要

DiffAnt: Diffusion Models for Action Anticipation

  • paper_url: http://arxiv.org/abs/2311.15991
  • repo_url: None
  • paper_authors: Zeyun Zhong, Chengzhi Wu, Manuel Martin, Michael Voit, Juergen Gall, Jürgen Beyerer
  • for: 预测未来动作的未定性。
  • methods: 使用扩散模型来捕捉不同的未来动作。
  • results: 在四个 benchmark 数据集上(Breakfast、50Salads、EpicKitchens 和 EGTEA Gaze+)实现了或与现状方法相当的Result,表明了生成方法的效iveness。
    Abstract Anticipating future actions is inherently uncertain. Given an observed video segment containing ongoing actions, multiple subsequent actions can plausibly follow. This uncertainty becomes even larger when predicting far into the future. However, the majority of existing action anticipation models adhere to a deterministic approach, neglecting to account for future uncertainties. In this work, we rethink action anticipation from a generative view, employing diffusion models to capture different possible future actions. In this framework, future actions are iteratively generated from standard Gaussian noise in the latent space, conditioned on the observed video, and subsequently transitioned into the action space. Extensive experiments on four benchmark datasets, i.e., Breakfast, 50Salads, EpicKitchens, and EGTEA Gaze+, are performed and the proposed method achieves superior or comparable results to state-of-the-art methods, showing the effectiveness of a generative approach for action anticipation. Our code and trained models will be published on GitHub.
    摘要 anticipating future actions inherently uncertain, given observed video segment containing ongoing actions, multiple subsequent actions can plausibly follow. this uncertainty becomes even larger when predicting far into the future. however, majority of existing action anticipation models adhere to deterministic approach, neglecting to account for future uncertainties. in this work, we rethink action anticipation from generative view, employing diffusion models capture different possible future actions. in this framework, future actions iteratively generated from standard gaussian noise in latent space, conditioned on observed video, and subsequently transitioned into action space. extensive experiments four benchmark datasets, i.e., breakfast, 50salads, epickitchens, and egtea gaze+, are performed and proposed method achieves superior or comparable results state-of-the-art methods, showing effectiveness of generative approach action anticipation. our code and trained models will published on github.

Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion

  • paper_url: http://arxiv.org/abs/2311.15980
  • repo_url: None
  • paper_authors: Yuanxun Lu, Jingyang Zhang, Shiwei Li, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan, Xun Cao, Yao Yao
  • for: 这篇论文的目的是提出一种基于2.5D扩散的高效、多视图、高品质3D内容生成方法。
  • methods: 该方法使用一个预训练的2D扩散模型进行微调,并通过一种新的多视图正见映射方法来拼接生成的多视图正见图像。
  • results: 经过广泛的实验表明,该方法可以在10秒内生成多视图正见图像,并且不需要任何后期优化。该方法可以生成多样化、模式寻找自由、高质量的3D内容。
    Abstract Recent advances in generative AI have unveiled significant potential for the creation of 3D content. However, current methods either apply a pre-trained 2D diffusion model with the time-consuming score distillation sampling (SDS), or a direct 3D diffusion model trained on limited 3D data losing generation diversity. In this work, we approach the problem by employing a multi-view 2.5D diffusion fine-tuned from a pre-trained 2D diffusion model. The multi-view 2.5D diffusion directly models the structural distribution of 3D data, while still maintaining the strong generalization ability of the original 2D diffusion model, filling the gap between 2D diffusion-based and direct 3D diffusion-based methods for 3D content generation. During inference, multi-view normal maps are generated using the 2.5D diffusion, and a novel differentiable rasterization scheme is introduced to fuse the almost consistent multi-view normal maps into a consistent 3D model. We further design a normal-conditioned multi-view image generation module for fast appearance generation given the 3D geometry. Our method is a one-pass diffusion process and does not require any SDS optimization as post-processing. We demonstrate through extensive experiments that, our direct 2.5D generation with the specially-designed fusion scheme can achieve diverse, mode-seeking-free, and high-fidelity 3D content generation in only 10 seconds. Project page: https://nju-3dv.github.io/projects/direct25.
    摘要 近期的生成AI技术突破有显著的可能性 для创造3D内容。然而,当前的方法都是 Either apply a pre-trained 2D diffusion model with time-consuming score distillation sampling (SDS), or a direct 3D diffusion model trained on limited 3D data, resulting in a loss of generation diversity. In this work, we approach the problem by employing a multi-view 2.5D diffusion fine-tuned from a pre-trained 2D diffusion model. The multi-view 2.5D diffusion directly models the structural distribution of 3D data, while still maintaining the strong generalization ability of the original 2D diffusion model, thereby bridging the gap between 2D diffusion-based and direct 3D diffusion-based methods for 3D content generation. During inference, multi-view normal maps are generated using the 2.5D diffusion, and a novel differentiable rasterization scheme is introduced to fuse the almost consistent multi-view normal maps into a consistent 3D model. We further design a normal-conditioned multi-view image generation module for fast appearance generation given the 3D geometry. Our method is a one-pass diffusion process and does not require any SDS optimization as post-processing. We demonstrate through extensive experiments that our direct 2.5D generation with the specially-designed fusion scheme can achieve diverse, mode-seeking-free, and high-fidelity 3D content generation in only 10 seconds. Project page: .

Text2Loc: 3D Point Cloud Localization from Natural Language

  • paper_url: http://arxiv.org/abs/2311.15977
  • repo_url: None
  • paper_authors: Yan Xia, Letian Shi, Zifeng Ding, João F. Henriques, Daniel Cremers
  • for: 本研究 targets the problem of 3D point cloud localization based on natural language descriptions, and introduces a novel neural network called Text2Loc to fully interpret the semantic relationship between points and text.
  • methods: Text2Loc follows a coarse-to-fine localization pipeline, including text-submap global place recognition and fine localization. The global place recognition uses a hierarchical transformer with max-pooling (HTM) to capture relational dynamics among textual hints, while the fine localization uses a novel matching-free method that completely removes the need for text-instance matching and is lighter, faster, and more accurate than previous methods.
  • results: Extensive experiments show that Text2Loc improves the localization accuracy by up to 2 times over the state-of-the-art on the KITTI360Pose dataset.Here is the summary in Traditional Chinese:
  • for: 本研究目标是根据自然语言描述来解决3D点云网络的地点位置Localization问题,并引入了一个名为Text2Loc的神经网络,以全面地理解点和文本之间的Semantic relationship。
  • methods: Text2Loc遵循一个course-to-fine的Localizationipeline,包括文本子对Global place recognition和精确Localization。文本子对Global place recognition使用一个弹性的对称变换器(HTM)来捕捉文本中间的关系动力,而精确Localization使用一个新的匹配�Free方法,完全消除了文本实例匹配的需求,并且轻量、快速、更加精确于前一代方法。
  • results: 广泛的实验显示,Text2Loc可以提高地点位置的准确性,与前一代比较,在KITTI360Pose数据集上提高了2倍以上。
    Abstract We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions and introduce a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text. Text2Loc follows a coarse-to-fine localization pipeline: text-submap global place recognition, followed by fine localization. In global place recognition, relational dynamics among each textual hint are captured in a hierarchical transformer with max-pooling (HTM), whereas a balance between positive and negative pairs is maintained using text-submap contrastive learning. Moreover, we propose a novel matching-free fine localization method to further refine the location predictions, which completely removes the need for complicated text-instance matching and is lighter, faster, and more accurate than previous methods. Extensive experiments show that Text2Loc improves the localization accuracy by up to $2\times$ over the state-of-the-art on the KITTI360Pose dataset. We will make the code publicly available.
    摘要 我们解决了基于一些自然语言描述的3D点云地标问题,并引入了一种新的神经网络Text2Loc,它可以全面理解点云和文本之间的Semantic关系。Text2Loc采用一种层次转换器加权max pooling(HTM)来捕捉文本提示之间的关系动力,并在文本地图匹配中保持正负对比的平衡。此外,我们还提出了一种新的匹配自由精度调整方法,以进一步精细化位置预测结果,完全消除了复杂的文本实例匹配的需求,轻量级、快速、高精度。广泛的实验表明,Text2Loc可以提高地标精度达2倍于状态对的KITTI360Pose数据集。我们将代码公开。

FALCON: Fairness Learning via Contrastive Attention Approach to Continual Semantic Scene Understanding in Open World

  • paper_url: http://arxiv.org/abs/2311.15965
  • repo_url: None
  • paper_authors: Thanh-Dat Truong, Utsav Prabhu, Bhiksha Raj, Jackson Cothren, Khoa Luu
  • for: 本研究旨在解决 continual learning 中的 fairness 问题,以提高 continual semantic segmentation 模型的性能和公平性。
  • methods: 本研究提出了一种 Fairness Learning via Contrastive Attention Approach,包括新的 Fairness Contrastive Clustering loss 和 attention-based visual grammar 方法。
  • results: 通过实验,我们的提posed approach 在不同的 continual learning 设定下 achieve State-of-the-Art (SOTA) 性能,并且提高了 continual semantic segmentation 模型的公平性。
    Abstract Continual Learning in semantic scene segmentation aims to continually learn new unseen classes in dynamic environments while maintaining previously learned knowledge. Prior studies focused on modeling the catastrophic forgetting and background shift challenges in continual learning. However, fairness, another major challenge that causes unfair predictions leading to low performance among major and minor classes, still needs to be well addressed. In addition, prior methods have yet to model the unknown classes well, thus resulting in producing non-discriminative features among unknown classes. This paper presents a novel Fairness Learning via Contrastive Attention Approach to continual learning in semantic scene understanding. In particular, we first introduce a new Fairness Contrastive Clustering loss to address the problems of catastrophic forgetting and fairness. Then, we propose an attention-based visual grammar approach to effectively model the background shift problem and unknown classes, producing better feature representations for different unknown classes. Through our experiments, our proposed approach achieves State-of-the-Art (SOTA) performance on different continual learning settings of three standard benchmarks, i.e., ADE20K, Cityscapes, and Pascal VOC. It promotes the fairness of the continual semantic segmentation model.
    摘要 Translated into Simplified Chinese: kontinuous learning in semantic scene understanding aims to continually learn new unseen classes in dynamic environments while maintaining previously learned knowledge. Prior studies focused on modeling the catastrophic forgetting and background shift challenges in continual learning. However, fairness, another major challenge that causes unfair predictions leading to low performance among major and minor classes, still needs to be well addressed. In addition, prior methods have yet to model the unknown classes well, thus resulting in producing non-discriminative features among unknown classes. This paper presents a novel Fairness Learning via Contrastive Attention Approach to continual learning in semantic scene understanding. In particular, we first introduce a new Fairness Contrastive Clustering loss to address the problems of catastrophic forgetting and fairness. Then, we propose an attention-based visual grammar approach to effectively model the background shift problem and unknown classes, producing better feature representations for different unknown classes. Through our experiments, our proposed approach achieves State-of-the-Art (SOTA) performance on different continual learning settings of three standard benchmarks, i.e., ADE20K, Cityscapes, and Pascal VOC. It promotes the fairness of the continual semantic segmentation model.Translated into Traditional Chinese: kontinuous learning in semantic scene understanding aims to continually learn new unseen classes in dynamic environments while maintaining previously learned knowledge. Prior studies focused on modeling the catastrophic forgetting and background shift challenges in continual learning. However, fairness, another major challenge that causes unfair predictions leading to low performance among major and minor classes, still needs to be well addressed. In addition, prior methods have yet to model the unknown classes well, thus resulting in producing non-discriminative features among unknown classes. This paper presents a novel Fairness Learning via Contrastive Attention Approach to continual learning in semantic scene understanding. In particular, we first introduce a new Fairness Contrastive Clustering loss to address the problems of catastrophic forgetting and fairness. Then, we propose an attention-based visual grammar approach to effectively model the background shift problem and unknown classes, producing better feature representations for different unknown classes. Through our experiments, our proposed approach achieves State-of-the-Art (SOTA) performance on different continual learning settings of three standard benchmarks, i.e., ADE20K, Cityscapes, and Pascal VOC. It promotes the fairness of the continual semantic segmentation model.

From Pixels to Titles: Video Game Identification by Screenshots using Convolutional Neural Networks

  • paper_url: http://arxiv.org/abs/2311.15963
  • repo_url: https://github.com/fbreve/videogame
  • paper_authors: Fabricio Breve
  • for: investigate video game identification through single screenshots
  • methods: utilize five convolutional neural network (CNN) architectures and ImageNet pre-trained weights
  • results: achieve high accuracy in identifying game titles from screenshots, with EfficientNetB3 reaching a peak accuracy of 76.36% and demonstrating reduced convergence epochs.
    Abstract This paper investigates video game identification through single screenshots, utilizing five convolutional neural network (CNN) architectures (MobileNet, DenseNet, EfficientNetB0, EfficientNetB2, and EfficientNetB3) across 22 home console systems, spanning from Atari 2600 to PlayStation 5. Confirming the hypothesis, CNNs autonomously extract image features, enabling the identification of game titles from screenshots without additional features. Using ImageNet pre-trained weights, EfficientNetB3 achieves the highest average accuracy (74.51%), while DenseNet169 excels in 14 of the 22 systems. Employing alternative initial weights from another screenshots dataset boosts accuracy for EfficientNetB2 and EfficientNetB3, with the latter reaching a peak accuracy of 76.36% and demonstrating reduced convergence epochs from 23.7 to 20.5 on average. Overall, the combination of optimal architecture and weights attains 77.67% accuracy, primarily led by EfficientNetB3 in 19 systems. These findings underscore the efficacy of CNNs in video game identification through screenshots.
    摘要 Using pre-trained weights from ImageNet, EfficientNetB3 achieves the highest average accuracy (74.51%), while DenseNet169 performs well across 14 of the 22 systems. The study also shows that employing alternative initial weights from another screenshots dataset can improve accuracy for EfficientNetB2 and EfficientNetB3, with the latter reaching a peak accuracy of 76.36% and demonstrating reduced convergence epochs.Overall, the combination of optimal architecture and weights achieves an accuracy of 77.67%, primarily led by EfficientNetB3 in 19 systems. These findings highlight the effectiveness of CNNs in video game identification through screenshots.

Deceptive-Human: Prompt-to-NeRF 3D Human Generation with 3D-Consistent Synthetic Images

  • paper_url: http://arxiv.org/abs/2311.16499
  • repo_url: https://github.com/danielshkao/deceptivehuman
  • paper_authors: Shiu-hong Kao, Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang
  • for: 这篇论文旨在生成高质量可控的3D人NeRF模型,使用现有的控制扩散模型(如ControlNet)进行生成。
  • methods: 该方法使用进步加工技术来提高重建质量,通过使用高质量的人像生成器(如ControlNet)来生成视觉一致的损失。
  • results: 该方法可以生成高光照准确的多视图一致的人NeRF模型,并且可以轻松扩展到多modal输入,如文本提示和其他数据。
    Abstract This paper presents Deceptive-Human, a novel Prompt-to-NeRF framework capitalizing state-of-the-art control diffusion models (e.g., ControlNet) to generate a high-quality controllable 3D human NeRF. Different from direct 3D generative approaches, e.g., DreamFusion and DreamHuman, Deceptive-Human employs a progressive refinement technique to elevate the reconstruction quality. This is achieved by utilizing high-quality synthetic human images generated through the ControlNet with view-consistent loss. Our method is versatile and readily extensible, accommodating multimodal inputs, including a text prompt and additional data such as 3D mesh, poses, and seed images. The resulting 3D human NeRF model empowers the synthesis of highly photorealistic novel views from 360-degree perspectives. The key to our Deceptive-Human for hallucinating multi-view consistent synthetic human images lies in our progressive finetuning strategy. This strategy involves iteratively enhancing views using the provided multimodal inputs at each intermediate step to improve the human NeRF model. Within this iterative refinement process, view-dependent appearances are systematically eliminated to prevent interference with the underlying density estimation. Extensive qualitative and quantitative experimental comparison shows that our deceptive human models achieve state-of-the-art application quality.
    摘要 The key to Deceptive-Human's ability to hallucinate multi-view consistent synthetic human images lies in its progressive finetuning strategy. This involves iteratively enhancing views using the provided multimodal inputs at each intermediate step to improve the human NeRF model. Within this iterative refinement process, view-dependent appearances are systematically eliminated to prevent interference with the underlying density estimation.Extensive qualitative and quantitative experimental comparison shows that our deceptive human models achieve state-of-the-art application quality.

Unleashing the Power of Prompt-driven Nucleus Instance Segmentation

  • paper_url: http://arxiv.org/abs/2311.15939
  • repo_url: https://github.com/windygoo/promptnucseg
  • paper_authors: Zhongyi Shui, Yunlong Zhang, Kai Yao, Chenglu Zhu, Yuxuan Sun, Lin Yang
  • for: automatic nuclei instance segmentation in histology images
  • methods: point prompter and a SAM (Segment Anything Model) fine-tuned to output the corresponding mask of the cued nucleus, with negative prompts for overlapping nuclei
  • results: sets a new state-of-the-art performance on three challenging benchmarks
    Abstract Nuclear instance segmentation in histology images is crucial for a broad spectrum of clinical applications. Current prevailing nuclear instance segmentation algorithms rely on regression of nuclei contours, distance maps, watershed markers or a proxy nuclear representation of star-convex polygons. Consequently, these methods necessitate sophisticated post-processing operations to distinguish nuclei instances, which are commonly acknowledged to be error-prone and parameter-sensitive. Recently, the segment anything model (SAM) has earned attracted huge attention within the domain of medical image segmentation due to its impressive generalization ability and promptable property. Nevertheless, its potential on nuclear instance segmentation remains largely underexplored. In this paper, we present a novel prompt-driven framework that consists of a point prompter and a SAM for automatic nuclei instance segmentation. Specifically, the prompter learns to generate a unique point prompt for each nucleus while the SAM is fine tuned to output the corresponding mask of the cued nucleus. Furthermore, we propose to add adjacent nuclei as negative prompts to promote the model's ability to recognize overlapping nuclei. Without bells and whistles, our proposed method sets a new state-of-the-art performance on three challenging benchmarks. Our code is available at \textcolor{magenta}{\url{https://github.com/windygoo/PromptNucSeg} .
    摘要 核心实例分割在 histology 图像中是关键的诊断应用领域。目前主流的核心实例分割算法基于核心 outline 回归、距离地图、水域 marker 或代理核心表示星形多边形。这些方法通常需要复杂的后处理操作来分割核心实例,这些操作通常被承认为是 error-prone 和参数敏感。最近,医疗图像分割领域内的 segment anything model (SAM) 吸引了很大的关注,因为它在多种应用场景中表现出了惊人的通用能力和快速性。然而,它的核心实例分割能力尚未得到充分探索。在这篇论文中,我们提出了一种新的推动式框架,它包括一个点推导器和一个 SAM。特别是,推导器学习生成每个核心的唯一点提示,而 SAM 则是通过 fine-tuning 来输出相应的核心 máscara。此外,我们还提议将邻近的核心作为负例提示,以促进模型认知 overlap 的核心。没有奖励和感叹的情况下,我们的提出的方法在三个挑战性的标准底图上设置了新的状态级表现。我们的代码可以在 \textcolor{magenta}{\url{https://github.com/windygoo/PromptNucSeg} 上获取。

Optimal Transport Aggregation for Visual Place Recognition

  • paper_url: http://arxiv.org/abs/2311.15937
  • repo_url: https://github.com/serizba/salad
  • paper_authors: Sergio Izquierdo, Javier Civera
  • for: 这篇论文旨在提出一种基于视觉特征的地点识别方法(Visual Place Recognition,VPR),用于匹配查询图像与数据库中的图像,仅仅通过视觉特征进行匹配。
  • methods: 该方法使用深度 neural network 提取特征,并将其组合成全局描述符,以便进行匹配。此外,该方法还引入了一种名为 ‘垃圾桶’(dustbin)的特殊 cluster,用于抛弃不具有信息价值的特征,从而提高总体描述符质量。
  • results: 该方法在公共 VPR 数据集上比单stage基线方法表现出色,并且还比两stage方法(包括重新排序)表现更好,即使其具有较低的训练时间。代码和模型可以在 GitHub 上获取。
    Abstract The task of Visual Place Recognition (VPR) aims to match a query image against references from an extensive database of images from different places, relying solely on visual cues. State-of-the-art pipelines focus on the aggregation of features extracted from a deep backbone, in order to form a global descriptor for each image. In this context, we introduce SALAD (Sinkhorn Algorithm for Locally Aggregated Descriptors), which reformulates NetVLAD's soft-assignment of local features to clusters as an optimal transport problem. In SALAD, we consider both feature-to-cluster and cluster-to-feature relations and we also introduce a 'dustbin' cluster, designed to selectively discard features deemed non-informative, enhancing the overall descriptor quality. Additionally, we leverage and fine-tune DINOv2 as a backbone, which provides enhanced description power for the local features, and dramatically reduces the required training time. As a result, our single-stage method not only surpasses single-stage baselines in public VPR datasets, but also surpasses two-stage methods that add a re-ranking with significantly higher cost. Code and models are available at https://github.com/serizba/salad.
    摘要 “视觉地点识别(VPR)任务的目标是将查询图像与数据库中的不同地点图像进行匹配,完全仅基于视觉特征。现代渠道架构强调在深度背景中提取特征,并将其组合成全局描述符以实现图像的唯一标识。在这种情况下,我们介绍了SALAD(降水算法为本地汇集特征分配),它将NetVLAD的软分配本地特征到群集转换为最优运输问题。在SALAD中,我们考虑了特征与群集之间的关系,以及群集与特征之间的关系,并引入了一个 '垃圾桶' 特征,用于选择不具有信息价值的特征,从而提高总描述符质量。此外,我们利用了并微调了 DINOv2 作为后处,它提供了增强的本地描述力,并减少了训练时间。因此,我们的单阶段方法不仅超过单阶段基线在公共 VPR 数据集上,还超过了两阶段方法,其中第二阶段额外添加了重新排序,并且这些重新排序需要更高的成本。代码和模型可以在 GitHub 上找到。”

ADM-Loc: Actionness Distribution Modeling for Point-supervised Temporal Action Localization

  • paper_url: http://arxiv.org/abs/2311.15916
  • repo_url: None
  • paper_authors: Elahe Vahdani, Yingli Tian
  • for: 这个论文目标是提高点监督的 temporal action detection 性能,只有一帧action实例被标注在训练集中。
  • methods: 这个论文提出了一种名为 ADM-Loc 的新框架,它是基于 actionness distribution modeling 的点监督action localization。ADM-Loc 使用 Gaussian 和 uniform 分布对 action classification 信号进行适应,以提高生成的 action proposal 和实际的 action instance 的对应性。
  • results: ADM-Loc 在 THUMOS14 和 ActivityNet-v1.2 数据集上达到了点监督方法的新高性能。
    Abstract This paper addresses the challenge of point-supervised temporal action detection, in which only one frame per action instance is annotated in the training set. Self-training aims to provide supplementary supervision for the training process by generating pseudo-labels (action proposals) from a base model. However, most current methods generate action proposals by applying manually designed thresholds to action classification probabilities and treating adjacent snippets as independent entities. As a result, these methods struggle to generate complete action proposals, exhibit sensitivity to fluctuations in action classification scores, and generate redundant and overlapping action proposals. This paper proposes a novel framework termed ADM-Loc, which stands for Actionness Distribution Modeling for point-supervised action Localization. ADM-Loc generates action proposals by fitting a composite distribution, comprising both Gaussian and uniform distributions, to the action classification signals. This fitting process is tailored to each action class present in the video and is applied separately for each action instance, ensuring the distinctiveness of their distributions. ADM-Loc significantly enhances the alignment between the generated action proposals and ground-truth action instances and offers high-quality pseudo-labels for self-training. Moreover, to model action boundary snippets, it enforces consistency in action classification scores during training by employing Gaussian kernels, supervised with the proposed loss functions. ADM-Loc outperforms the state-of-the-art point-supervised methods on THUMOS14 and ActivityNet-v1.2 datasets.
    摘要 Current methods for generating action proposals rely on manually designed thresholds for action classification probabilities and treat adjacent snippets as independent entities. However, these methods often produce incomplete or overlapping action proposals and are sensitive to fluctuations in action classification scores.ADM-Loc addresses these issues by fitting a composite distribution, combining Gaussian and uniform distributions, to the action classification signals. This fitting process is tailored to each action class present in the video and is applied separately for each action instance, ensuring the distinctiveness of their distributions.To model action boundary snippets, ADM-Loc enforces consistency in action classification scores during training using Gaussian kernels and supervised with the proposed loss functions. This leads to high-quality pseudo-labels for self-training and significantly enhances the alignment between the generated action proposals and ground-truth action instances.The proposed method outperforms state-of-the-art point-supervised methods on THUMOS14 and ActivityNet-v1.2 datasets.

Computer Vision for Carriers: PATRIOT

  • paper_url: http://arxiv.org/abs/2311.15914
  • repo_url: None
  • paper_authors: Ari Goodman, Gurpreet Singh, James Hing, Ryan O’Shea
    for: 该研究旨在提高航空母舰上的甲板跟踪过程,使用自动化技术来增加 sortie generation rates。methods: 该研究使用了pasive sensing技术和计算机视觉算法来实现甲板跟踪,而不需要安装hardware-based的Global Positioning System(GPS)仪器。results: 该研究已经开发出了一个名为PATRIOT(Panoramic Asset Tracking of Real-Time Information for the Ouija Tabletop)的研究和解决方案,可以快速、准确地跟踪飞机、人员和支持设备的位置。PATRIOT可以减少人员劳动量、提高效率和安全性,并且可以收集数据来改善物流和后勤支持。
    Abstract Deck tracking performed on carriers currently involves a team of sailors manually identifying aircraft and updating a digital user interface called the Ouija Board. Improvements to the deck tracking process would result in increased Sortie Generation Rates, and therefore applying automation is seen as a critical method to improve deck tracking. However, the requirements on a carrier ship do not allow for the installation of hardware-based location sensing technologies like Global Positioning System (GPS) sensors. PATRIOT (Panoramic Asset Tracking of Real-Time Information for the Ouija Tabletop) is a research effort and proposed solution to performing deck tracking with passive sensing and without the need for GPS sensors. PATRIOT is a prototype system which takes existing camera feeds, calculates aircraft poses, and updates a virtual Ouija board interface with the current status of the assets. PATRIOT would allow for faster, more accurate, and less laborious asset tracking for aircraft, people, and support equipment. PATRIOT is anticipated to benefit the warfighter by reducing cognitive workload, reducing manning requirements, collecting data to improve logistics, and enabling an automation gateway for future efforts to improve efficiency and safety. The authors have developed and tested algorithms to perform pose estimations of assets in real-time including OpenPifPaf, High-Resolution Network (HRNet), HigherHRNet (HHRNet), Faster R-CNN, and in-house developed encoder-decoder network. The software was tested with synthetic and real-world data and was able to accurately extract the pose of assets. Fusion, tracking, and real-world generality are planned to be improved to ensure a successful transition to the fleet.
    摘要 舰船上的垫追踪现在由一组船员手动识别飞机并更新一个名为Ouija Board的数字用户界面。提高垫追踪过程的改进会导致更高的批发生速率,因此应用自动化是追求提高垫追踪的关键方法。然而,舰船上的要求不允许安装硬件基于位置感知技术如GPS传感器。PATRIOT(Panoramic Asset Tracking of Real-Time Information for the Ouija Tabletop)是一项研究努力和提议的解决方案,使用无线遥感技术进行垫追踪,不需要GPS传感器。PATRIOT是一个原型系统,使用现有的摄像头Feed,计算飞机的姿态,并将当前资产的状态更新到虚拟Ouija Board界面。PATRIOT可以提供更快、更准确、 less laborious的资产追踪,包括飞机、人员和支持设备。PATRIOT预计会为战斗员带来减少认知劳动负担,减少人员编制,收集数据以改进后勤,并实现自动化的门槛,以便未来的效率和安全性改进。作者已经开发和测试了资产姿态估算算法,包括OpenPifPaf、HRNet、HHRNet、Faster R-CNN和自家开发的编码器-解码器网络。软件在 sintetic和实际数据上测试,能够准确地提取资产的姿态。将来,将进行融合、跟踪和实际通用性的改进,以确保成功的转移到舰队。

LIFT OFF: LoRaWAN Installation and Fiducial Tracking Operations for the Flightline of the Future

  • paper_url: http://arxiv.org/abs/2311.15912
  • repo_url: None
  • paper_authors: Ari Goodman, Ryan O’Shea
    for: 这个研究是为了提供现场位置情况的实时认知,以便完成任务 efficiently 和满足需求。methods: 这个研究使用了machine-vision компонент和 geolocation sensor компонент,以及创建了一个LoRaWAN广泛局域网络来传输数据。results: 这个研究成功地提供了实时更新的地图,该地图显示了所追踪的资产的位置,包括人员和支持设备的GPS感知器,以及航空器的视觉标志。
    Abstract Real-time situational awareness for the location of assets is critical to ensure missions are completed efficiently and requirements are satisfied. In many commercial settings, the application of global positioning system (GPS) sensors is appropriate to achieve timely knowledge of the position of people and equipment. However, GPS sensors are not appropriate for all situations due to flight clearance and operations security concerns. LIFT OFF: LoRaWAN Installation and Fiducial Tracking Operations for the Flightline of the Future proposes a hybrid framework solution to achieve real-time situational awareness for people, support equipment, and aircraft positions regardless of the environment. This framework included a machine-vision component, which involved setting up cameras to detect AprilTag decals that were installed on the sides of aircraft. The framework included a geolocation sensor component, which involved installing GPS sensors on support equipment and helmets. The framework also included creating a long-range wide area network (LoRaWAN) to transfer data and developing a user interface to display the data. The framework was tested at Naval Air Station Oceana Flightline, the United States Naval Test Pilot School, and at Naval Air Warfare Center Aircraft Division Lakehurst. LIFT OFF successfully provided a real-time updating map of all tracked assets using GPS sensors for people and support equipment and with visual fiducials for aircraft. The trajectories of the assets were recorded for logistical analysis and playback. Future follow-on work is anticipated to apply the technology to other environments including carriers and amphibious assault ships in addition to the flightline.
    摘要 实时情况意识对资产位置是关键,以确保任务效率高并达到需求。在商业场景中,使用全球定位系统(GPS)传感器是合适的方式来实现实时知ledge of人员和设备位置。然而,GPS传感器在某些情况下不是合适的,例如飞行批准和操作安全问题。LIFT OFF:LoRaWAN安装和 fiducial Tracking 操作 для未来的飞行线 proposes 一种混合框架解决方案,以实现实时情况意识 для人员、支持设备和飞机位置,无论环境。这个框架包括机器视觉组件,即在飞机侧安装 AprilTag 徽标以便检测。这个框架还包括地理位置传感器组件,即在支持设备和头盔上安装 GPS 传感器。此外,还创建了一个覆盖广泛区域网络(LoRaWAN)来传输数据,并开发了一个用户界面来显示数据。这个框架在美国 naval Air Station Oceana 飞行线、美国 naval Test Pilot School 和 naval Air Warfare Center Aircraft Division Lakehurst 进行了测试,LIFT OFF 成功地提供了实时更新的地图,显示所跟踪的所有资产位置,包括人员和支持设备的 GPS 位置,以及飞机的视觉 fiducials。资产的轨迹被记录以供日后分析和播放。未来的跟进工作预计将应用该技术到其他环境,包括航空母舰和坦克登陆舰。

Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.15908
  • repo_url: https://github.com/claudiom4sir/stablevsr
  • paper_authors: Claudio Rota, Marco Buzzelli, Joost van de Weijer
  • for: 提高视频超分辨率(VSR)的质量,使用扩散模型(DM)。
  • methods: 使用Temporal Conditioning Module(TCM),Temporal Texture Guidance,Frame-wise Bidirectional Sampling策略。
  • results: 提高视频超分辨率的 perceived 质量,比现有的VSR方法更高。
    Abstract In this paper, we address the problem of video super-resolution (VSR) using Diffusion Models (DM), and present StableVSR. Our method significantly enhances the perceptual quality of upscaled videos by synthesizing realistic and temporally-consistent details. We turn a pre-trained DM for single image super-resolution into a VSR method by introducing the Temporal Conditioning Module (TCM). TCM uses Temporal Texture Guidance, which provides spatially-aligned and detail-rich texture information synthesized in adjacent frames. This guides the generative process of the current frame toward high-quality and temporally-consistent results. We introduce a Frame-wise Bidirectional Sampling strategy to encourage the use of information from past to future and vice-versa. This strategy improves the perceptual quality of the results and the temporal consistency across frames. We demonstrate the effectiveness of StableVSR in enhancing the perceptual quality of upscaled videos compared to existing state-of-the-art methods for VSR. The code is available at https://github.com/claudiom4sir/StableVSR.
    摘要 在这篇论文中,我们讨论了视频超分辨(VSR)使用扩散模型(DM),并提出了StableVSR。我们的方法可以显著提高 upscaled 视频的感知质量,通过生成真实和时间相关的细节。我们将先前训练的 DM 转换为 VSR 方法,通过引入 Temporal Conditioning Module (TCM)。TCM 使用 Temporal Texture Guidance,提供邻帧匹配的空间同步和细节富的Texture信息,导引当前帧生成过程向高质量和时间相关的结果。我们提出了 Frame-wise Bidirectional Sampling 策略,以促进以前和后的信息使用。这种策略提高了结果的感知质量和时间相关性。我们证明 StableVSR 可以在提高 upscaled 视频的感知质量方面超越现有的 VSR 方法。代码可以在 https://github.com/claudiom4sir/StableVSR 上下载。

MetaDefa: Meta-learning based on Domain Enhancement and Feature Alignment for Single Domain Generalization

  • paper_url: http://arxiv.org/abs/2311.15906
  • repo_url: None
  • paper_authors: Can Sun, Hao Zheng, Zhigang Hu, Liu Yang, Meiguang Zheng, Bo Xu
    for:这个论文的目的是提出一种基于元学习的单域总结(SDG)技术,以解决频率分布不匹配和域特征分离问题,提高模型的总结性能。methods:这个论文使用了背景替换和视觉损害技术来生成多样化和有效的扩展域。然后,基于类活化图和类agnostic活化图的多通道特征协调模块被设计,以有效地提取充分的传输知识。在这个模块中,域特征可以得到全面探索,通过关注源域和扩展域特征空间中相似的目标区域,并抑制不相似的目标区域的特征表示。results:经过广泛的实验,这个论文在两个公共可用的 dataset 上表现出了显著的总结性能优势,能够在未知多个目标域中进行高效的总结。
    Abstract The single domain generalization(SDG) based on meta-learning has emerged as an effective technique for solving the domain-shift problem. However, the inadequate match of data distribution between source and augmented domains and difficult separation of domain-invariant features from domain-related features make SDG model hard to achieve great generalization. Therefore, a novel meta-learning method based on domain enhancement and feature alignment (MetaDefa) is proposed to improve the model generalization performance. First, the background substitution and visual corruptions techniques are used to generate diverse and effective augmented domains. Then, the multi-channel feature alignment module based on class activation maps and class agnostic activation maps is designed to effectively extract adequate transferability knowledge. In this module, domain-invariant features can be fully explored by focusing on similar target regions between source and augmented domains feature space and suppressing the feature representation of non-similar target regions. Extensive experiments on two publicly available datasets show that MetaDefa has significant generalization performance advantages in unknown multiple target domains.
    摘要 Single domain generalization(SDG)基于meta-学习技术已经成为解决域分布问题的有效方法。然而,因为数据分布不足和域相关特征分离困难,SDG模型难以实现出色的泛化性。为此,一种基于域强化和特征对齐(MetaDefa)的新的meta-学习方法被提出,以提高模型泛化性表现。首先,使用背景替换和视觉损害技术生成了多元和有效的扩充域。然后,基于类活动图和类不知情活动图的多通道特征对齐模块被设计,以有效地提取适用知识。在这个模块中,通过专注于源域和扩充域特征空间的相似目标区域,全面探索域不关参数。此外,通过抑制非相似目标区域的特征表示,有效地避免了域相关特征的混淆。广泛的实验表明,MetaDefa在未知多个目标域中具有显著的泛化性表现优势。

Stability-Informed Initialization of Neural Ordinary Differential Equations

  • paper_url: http://arxiv.org/abs/2311.15890
  • repo_url: https://github.com/westny/neural-stability
  • paper_authors: Theodor Westny, Arman Mohammadi, Daniel Jung, Erik Frisk
  • for: 本文研究Neural Ordinary Differential Equations(neural ODEs)的训练方法,具体来说是研究numerical integration techniques、stability regions、step size和 initialization techniques之间的关系。
  • methods: 本文使用numerical integration techniques来训练neural ODEs,并研究solver的稳定区域对训练和预测性能的影响。
  • results: 本文提出了一种基于稳定性的初始化 Parameters技术,并在多个学习benchmark和实际应用中证明了其效果。
    Abstract This paper addresses the training of Neural Ordinary Differential Equations (neural ODEs), and in particular explores the interplay between numerical integration techniques, stability regions, step size, and initialization techniques. It is shown how the choice of integration technique implicitly regularizes the learned model, and how the solver's corresponding stability region affects training and prediction performance. From this analysis, a stability-informed parameter initialization technique is introduced. The effectiveness of the initialization method is displayed across several learning benchmarks and industrial applications.
    摘要 Here is the text in Simplified Chinese:这篇论文研究神经常微方程(neural ODEs)的训练,特别是数学 интеграル技术、稳定区域、步长和初始化技术之间的交互关系。文章表明,选择的数学 интеграル技术会隐式地规范学习的模型,并且选择的数学 интеграル技术对训练和预测性能产生影响。基于这种分析,文章提出了稳定区域 Informed 参数初始化技术,并在多个学习Benchmark和实际应用中展示了其效果。

EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension

  • paper_url: http://arxiv.org/abs/2311.15879
  • repo_url: None
  • paper_authors: Jiaxuan Li, Duc Minh Vo, Akihiro Sugimoto, Hideki Nakayama
  • for: 这个论文旨在提高大型语言模型(LLM)基于图像描述的表达能力,使其能够描述没有在训练数据中出现的 объек。
  • methods: 该方法使用外部视觉名称记忆(EVCap)来提高 LLM 的对象知识,并通过对象的视觉和名称来建立可变的对象知识库。
  • results: 该方法在多个benchmark上表现出色,特别是在针对不同的数据集和 Commonsense-violating 数据进行测试时,与其他相同大小的模型相比,它表现出了更高的性能。
    Abstract Large language models (LLMs)-based image captioning has the capability of describing objects not explicitly observed in training data; yet novel objects occur frequently, necessitating the requirement of sustaining up-to-date object knowledge for open-world comprehension. Instead of relying on large amounts of data and scaling up network parameters, we introduce a highly effective retrieval-augmented image captioning method that prompts LLMs with object names retrieved from External Visual--name memory (EVCap). We build ever-changing object knowledge memory using objects' visuals and names, enabling us to (i) update the memory at a minimal cost and (ii) effortlessly augment LLMs with retrieved object names utilizing a lightweight and fast-to-train model. Our model, which was trained only on the COCO dataset, can be adapted to out-domain data without additional fine-tuning or retraining. Our comprehensive experiments conducted on various benchmarks and synthetic commonsense-violating data demonstrate that EVCap, comprising solely 3.97M trainable parameters, exhibits superior performance compared to other methods of equivalent model size scale. Notably, it achieves competitive performance against specialist SOTAs with an enormous number of parameters. Our code is available at https://jiaxuan-li.github.io/EVCap.
    摘要 大型语言模型(LLM)基于图像描述可以描述没有直接出现在训练数据中的对象;然而,新型对象会频繁出现,需要保持更新对象知识的能力 для开放世界理解。而不是依靠大量数据和扩大网络参数,我们介绍了一种非常有效的检索扩展图像描述方法,通过在External Visual-name memory(EVCap)中检索对象名称来提醒LLM。我们建立了可以在最小成本下更新的对象知识记忆,使得我们可以(i)更新记忆,而不需要较大的成本,和(ii)使用易于训练的轻量级模型来快速地扩展LLM。我们的模型,它只在COCO dataset上训练,可以无需额外 fine-tuning或重新训练,在不同的benchmark上进行适应。我们的实验表明,EVCap,它只有3.97M个可训练参数,在其他方法相同的模型大小的情况下表现出色,并且与专家水平的模型相当。我们的代码可以在https://jiaxuan-li.github.io/EVCap中下载。

InterControl: Generate Human Motion Interactions by Controlling Every Joint

  • paper_url: http://arxiv.org/abs/2311.15864
  • repo_url: https://github.com/zhenzhiwang/intercontrol
  • paper_authors: Zhenzhi Wang, Jingbo Wang, Dahua Lin, Bo Dai
  • for: 模拟人类交互行为,包括任意数量的人类之间的交互。
  • methods: 利用 diffusion models 和对应的控制信号,以及 Large Language Model (LLM) Planner 将交互描述翻译成 contacts plans,然后使用 spatially controllable motion generation methods 生成人类交互。
  • results: 提出了一种名为 InterControl 的新方法,可以在不同人类之间实现 flexible spatial control,并且可以生成准确、合理的人类交互。经过了大量的实验,包括 HumanML3D 和 KIT-ML 数据集,得到了效果的证明。
    Abstract Text-conditioned human motion generation model has achieved great progress by introducing diffusion models and corresponding control signals. However, the interaction between humans are still under explored. To model interactions of arbitrary number of humans, we define interactions as human joint pairs that are either in contact or separated, and leverage {\em Large Language Model (LLM) Planner} to translate interaction descriptions into contact plans. Based on the contact plans, interaction generation could be achieved by spatially controllable motion generation methods by taking joint contacts as spatial conditions. We present a novel approach named InterControl for flexible spatial control of every joint in every person at any time by leveraging motion diffusion model only trained on single-person data. We incorporate a motion controlnet to generate coherent and realistic motions given sparse spatial control signals and a loss guidance module to precisely align any joint to the desired position in a classifier guidance manner via Inverse Kinematics (IK). Extensive experiments on HumanML3D and KIT-ML dataset demonstrate its effectiveness in versatile joint control. We also collect data of joint contact pairs by LLMs to show InterControl's ability in human interaction generation.
    摘要 文本受控人体动作生成模型已经取得了很大的进步,通过引入扩散模型和相应的控制信号。然而,人类之间的交互仍然尚未得到了充分探索。为模型人类之间的交互,我们定义交互为人体 JOINT PAIRS 的连接或分离,并利用 Large Language Model (LLM) Planner 将交互描述翻译成接触计划。基于接触计划,交互生成可以通过基于 JOINT CONTACTS 的 spatially可控动作生成方法实现。我们提出一种新的方法 named InterControl,可以在任意时间和任意 JOINT 上实现 flexible spatial control。我们利用动作扩散模型,并在唯一人体数据上进行训练。我们采用动作控制网络来生成具有较好的准确性和可控性的动作,并通过类ifier guidance manner 的损失引导模块来精准地将任意 JOINT 的位置与所需的位置进行对齐。我们进行了广泛的实验,证明 InterControl 在多个 JOINT 上的可控性和人类交互生成能力的效果。我们还收集了由 LLMs 生成的 JOINT CONTACT PAIRS 数据,以展示 InterControl 在人类交互生成方面的能力。

JSSL: Joint Supervised and Self-supervised Learning for MRI Reconstruction

  • paper_url: http://arxiv.org/abs/2311.15856
  • repo_url: None
  • paper_authors: George Yiasemis, Nikita Moriakov, Clara I. Sánchez, Jan-Jakob Sonke, Jonas Teuwen
  • For: 这篇论文的目的是为了提高基因发散磁共振成像(MRI)的重建质量,并且在临床enario中应对实验动作引起的资料不充分情况。* Methods: 这篇论文使用了自动化学习网络(deep learning)来重建MRI影像,并且提出了一种新的训练方法,即同时在自主学习和监督学习之间训练模型。* Results: 论文的结果显示,该新的训练方法可以对于具有实验动作引起的资料不充分情况下的MRI重建提供了重要的改善,并且提供了一个“实验规则”来选择合适的训练方法。
    Abstract Magnetic Resonance Imaging represents an important diagnostic modality; however, its inherently slow acquisition process poses challenges in obtaining fully sampled k-space data under motion in clinical scenarios such as abdominal, cardiac, and prostate imaging. In the absence of fully sampled acquisitions, which can serve as ground truth data, training deep learning algorithms in a supervised manner to predict the underlying ground truth image becomes an impossible task. To address this limitation, self-supervised methods have emerged as a viable alternative, leveraging available subsampled k-space data to train deep learning networks for MRI reconstruction. Nevertheless, these self-supervised approaches often fall short when compared to supervised methodologies. In this paper, we introduce JSSL (Joint Supervised and Self-supervised Learning), a novel training approach for deep learning-based MRI reconstruction algorithms aimed at enhancing reconstruction quality in scenarios where target dataset(s) containing fully sampled k-space measurements are unavailable. Our proposed method operates by simultaneously training a model in a self-supervised learning setting, using subsampled data from the target dataset(s), and in a supervised learning manner, utilizing data from other datasets, referred to as proxy datasets, where fully sampled k-space data is accessible. To demonstrate the efficacy of JSSL, we utilized subsampled prostate parallel MRI measurements as the target dataset, while employing fully sampled brain and knee k-space acquisitions as proxy datasets. Our results showcase a substantial improvement over conventional self-supervised training methods, thereby underscoring the effectiveness of our joint approach. We provide a theoretical motivation for JSSL and establish a practical "rule-of-thumb" for selecting the most appropriate training approach for deep MRI reconstruction.
    摘要 磁共振成像(Magnetic Resonance Imaging,MRI)是一种重要的诊断方法,但它的自然slow acquisition process(磁共振信号读取过程)在临床场景中,如腹腔、心脏和肾脏成像,会带来移动影响,导致完全抽样的k-空间数据不可得。在缺乏完全抽样的情况下,无法使用完全抽样的数据作为基准数据来训练深度学习网络。为解决这些限制,无监督方法(self-supervised methods)得到了广泛应用,利用可用的半样本k-空间数据来训练深度学习网络。然而,这些无监督方法通常与监督方法相比较差。在这篇文章中,我们介绍了JSSL(联合监督和自监督学习),一种新的训练方法,旨在在没有完全抽样数据的情况下提高MRI重建质量。我们的提议的方法是同时在自监督学习环境中使用目标数据集(target dataset)中的半样本数据,并在监督学习环境中使用其他数据集(proxy datasets),其中具有完全抽样的k-空间数据。为证明JSSL的效果,我们使用了半样本肾脏平行MRI测量数据作为目标数据集,并使用了完全抽样的大脑和股骨k-空间测量数据作为proxy datasets。我们的结果显示,JSSL在相比传统自监督训练方法的情况下具有显著改善,从而证明了我们的联合方法的有效性。我们还提供了对JSSL的理论基础和实际“准则”(rule-of-thumb),以帮助选择适当的深度MRI重建训练方法。

SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion

  • paper_url: http://arxiv.org/abs/2311.15855
  • repo_url: None
  • paper_authors: Hsuan-I Ho, Jie Song, Otmar Hilliges
  • for: 创建从单个图像中生成真实、全面的3D人体模型
  • methods: 提出了一种新的渠道,将图像条件的扩散模型集成到3D mesh重建工作流中
  • results: 经过广泛的实验和用户测试,证明该方法可以很好地从不同的图像中生成真实、全面的3D人体模型
    Abstract A long-standing goal of 3D human reconstruction is to create lifelike and fully detailed 3D humans from single images. The main challenge lies in inferring unknown human shapes, clothing, and texture information in areas not visible in the images. To address this, we propose SiTH, a novel pipeline that uniquely integrates an image-conditioned diffusion model into a 3D mesh reconstruction workflow. At the core of our method lies the decomposition of the ill-posed single-view reconstruction problem into hallucination and reconstruction subproblems. For the former, we employ a powerful generative diffusion model to hallucinate back appearances from the input images. For the latter, we leverage skinned body meshes as guidance to recover full-body texture meshes from the input and back-view images. Our designs enable training of the pipeline with only about 500 3D human scans while maintaining its generality and robustness. Extensive experiments and user studies on two 3D reconstruction benchmarks demonstrated the efficacy of our method in generating realistic, fully textured 3D humans from a diverse range of unseen images.
    摘要 长期目标是从单个图像中生成真实、完整的3D人体。主要挑战在于推断不可见区域的人体形状、衣物和纹理信息。我们提议SiTHpipeline,它uniquely integrate了图像条件的扩散模型到3D短网 reconstruction工作流程中。我们的方法通过将单视重构问题分解为描绘和重构两个互补部分来解决这个挑战。为描绘部分,我们使用强大的生成扩散模型来描绘图像中的返回 appearances。为重构部分,我们利用皮封体mesh作为导向来从输入和反向图像中恢复全身纹理网格。我们的设计使得可以通过约500个3D人体扫描训练我们的管道,而且保持其通用性和稳定性。我们的实验和用户研究表明,我们的方法可以从多样化的未看过图像中生成真实、完整的3D人体。

Single-Model and Any-Modality for Video Object Tracking

  • paper_url: http://arxiv.org/abs/2311.15851
  • repo_url: https://github.com/zongwei97/untrack
  • paper_authors: Zongwei Wu, Jilai Zheng, Xiangxuan Ren, Florin-Alexandru Vasluianu, Chao Ma, Danda Pani Paudel, Luc Van Gool, Radu Timofte
  • for: 这个论文主要针对视频对象跟踪领域中的多模态跟踪问题。
  • methods: 这篇论文提出了一种基于 transformer 架构的单一参数化跟踪方法(Un-Track),通过低级 фактор化和重建技术来学习不同模态之间的共同特征空间,从而实现多模态跟踪的一体化。
  • results: 对于 DepthTrack 数据集,Un-Track 可以提供 +8.1 绝对 F-score 提升,相比 SOTA 统一跟踪器和模式特定精化对手,Un-Track 在五个不同模式的 benchmark 数据集上均表现出色,证明了它的有效性和实用性。
    Abstract In the realm of video object tracking, auxiliary modalities such as depth, thermal, or event data have emerged as valuable assets to complement the RGB trackers. In practice, most existing RGB trackers learn a single set of parameters to use them across datasets and applications. However, a similar single-model unification for multi-modality tracking presents several challenges. These challenges stem from the inherent heterogeneity of inputs -- each with modality-specific representations, the scarcity of multi-modal datasets, and the absence of all the modalities at all times. In this work, we introduce Un-Track, a \underline{Un}ified Tracker of a single set of parameters for any modality. To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques. More importantly, we use only the RGB-X pairs to learn the common latent space. This unique shared representation seamlessly binds all modalities together, enabling effective unification and accommodating any missing modality, all within a single transformer-based architecture and without the need for modality-specific fine-tuning. Our Un-Track achieves +8.1 absolute F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters, through a simple yet efficient prompting strategy. Extensive comparisons on five benchmark datasets with different modalities show that Un-Track surpasses both SOTA unified trackers and modality-specific finetuned counterparts, validating our effectiveness and practicality.
    摘要 在视频对象跟踪领域,辅助Modalities如深度、热成像或事件数据已成为资产,以增强RGB跟踪器的性能。然而,将多modalities的跟踪器统一为单一模型存在多种挑战。这些挑战来自于输入数据的自然差异,每种Modalities具有特定的表示方式,缺乏多modalities的数据集,以及缺少某些modalities的情况。在这项工作中,我们介绍了Un-Track,一个基于单一参数集的多Modalities跟踪器。为了处理任意Modalities,我们的方法通过低级因子化和重建技术学习它们共同的底层空间。更重要的是,我们只使用RGB-X对来学习共同的底层空间。这种共同表示能够自然地结合所有Modalities,使得单一的转换器架构可以处理任意Modalities,无需特定的模式精度调整。Un-Track在DepthTrack数据集上实现了+8.1绝对F1分数提升,相比21.50GFLOPs和+6.6M参数,通过简单 yet efficient的激励策略。我们对五个benchmark数据集进行了广泛的比较,并证明Un-Track不仅超过了state-of-the-art的统一跟踪器,还超过了特定模式精度调整后的模式特定跟踪器,这 validate了我们的效果和实用性。

Cell Maps Representation For Lung Adenocarcinoma Growth Patterns Classification In Whole Slide Images

  • paper_url: http://arxiv.org/abs/2311.15847
  • repo_url: None
  • paper_authors: Arwa Al-Rubaian, Gozde N. Gunesli, Wajd A. Althakfi, Ayesha Azam, Nasir Rajpoot, Shan E Ahmed Raza
  • for: 这种研究旨在开发一种基于机器学习的肺癌分类方法,以提高肺癌诊断和 проgnosis。
  • methods: 这种方法首先将染色涂抹扫描图像转化为细胞地图,然后使用 convolutional neural network 进行分类。
  • results: 研究表明,这种方法可以具有高度的普适性和精度,在未见数据集上实现了约30%的高精度。
    Abstract Lung adenocarcinoma is a morphologically heterogeneous disease, characterized by five primary histologic growth patterns. The quantity of these patterns can be related to tumor behavior and has a significant impact on patient prognosis. In this work, we propose a novel machine learning pipeline capable of classifying tissue tiles into one of the five patterns or as non-tumor, with an Area Under the Receiver Operating Characteristic Curve (AUCROC) score of 0.97. Our model's strength lies in its comprehensive consideration of cellular spatial patterns, where it first generates cell maps from Hematoxylin and Eosin (H&E) whole slide images (WSIs), which are then fed into a convolutional neural network classification model. Exploiting these cell maps provides the model with robust generalizability to new data, achieving approximately 30% higher accuracy on unseen test-sets compared to current state of the art approaches. The insights derived from our model can be used to predict prognosis, enhancing patient outcomes.
    摘要 肺癌是一种 morphologically 多样化的疾病,表现为五种主要 Histologic 生长模式。这些模式的数量与肿瘤行为有直接关系,对患者预后具有重要影响。在这项工作中,我们提出了一种新的机器学习管道,能够将组织块分类为一个 из五种模式或非肿瘤,AUCROC 分数为 0.97。我们的模型的优势在于它对细胞空间模式进行了全面考虑,首先从 Hematoxylin 和 Eosin(H&E)整个染色板图像(WSIs)中生成细胞地图,然后将其传输给一个卷积神经网络分类模型。通过利用这些细胞地图,我们的模型在新数据上实现了约 30% 高的准确率,比现有的状态 искусственный智能方法更加稳定。这些发现可以用于预测肿瘤的发展,提高患者的预后。

Learning with Noisy Low-Cost MOS for Image Quality Assessment via Dual-Bias Calibration

  • paper_url: http://arxiv.org/abs/2311.15846
  • repo_url: None
  • paper_authors: Lei Wang, Qingbo Wu, Desen Yuan, King Ngi Ngan, Hongliang Li, Fanman Meng, Linfeng Xu
  • for: 学习基于图像质量评估(IQA)模型,以减少人工标注的劳动成本。
  • methods: 使用低成本的主观评分(LC-MOS)作为学习目标,并通过对LC-MOS的不确定性进行假设,以提高IQA模型的鲁棒性。
  • results: 在四个常用的IQA数据集上,提出了一种基于LC-MOS的IQA模型学习方法,并经过了广泛的实验 validate 该方法的可靠性和效果。
    Abstract Learning based image quality assessment (IQA) models have obtained impressive performance with the help of reliable subjective quality labels, where mean opinion score (MOS) is the most popular choice. However, in view of the subjective bias of individual annotators, the labor-abundant MOS (LA-MOS) typically requires a large collection of opinion scores from multiple annotators for each image, which significantly increases the learning cost. In this paper, we aim to learn robust IQA models from low-cost MOS (LC-MOS), which only requires very few opinion scores or even a single opinion score for each image. More specifically, we consider the LC-MOS as the noisy observation of LA-MOS and enforce the IQA model learned from LC-MOS to approach the unbiased estimation of LA-MOS. In this way, we represent the subjective bias between LC-MOS and LA-MOS, and the model bias between IQA predictions learned from LC-MOS and LA-MOS (i.e., dual-bias) as two latent variables with unknown parameters. By means of the expectation-maximization based alternating optimization, we can jointly estimate the parameters of the dual-bias, which suppresses the misleading of LC-MOS via a gated dual-bias calibration (GDBC) module. To the best of our knowledge, this is the first exploration of robust IQA model learning from noisy low-cost labels. Theoretical analysis and extensive experiments on four popular IQA datasets show that the proposed method is robust toward different bias rates and annotation numbers and significantly outperforms the other learning based IQA models when only LC-MOS is available. Furthermore, we also achieve comparable performance with respect to the other models learned with LA-MOS.
    摘要 学习基于图像质量评估(IQA)模型已经获得了吸引人的表现,尤其是通过可靠的主观质量标签,其中主观意见分数(MOS)是最受欢迎的选择。然而,由于各个评分员的主观偏见,需要大量的评分员提供多个意见分数,这会增加学习成本。在这篇论文中,我们希望通过低成本的MOS(LC-MOS)学习Robust IQA模型,只需每个图像的几个或者even只有一个意见分数。更具体地,我们认为LC-MOS是LA-MOS的噪声观察,并强制IQA模型学习自LC-MOS中的不准确评分,以便 approached不受主观偏见的LA-MOS评分。这样,我们可以表示LC-MOS和LA-MOS之间的主观偏见差异,以及IQA预测结果学习自LC-MOS和LA-MOS(i.e., dual-bias)之间的模型偏见。通过对预测结果进行预期最大化的分布式优化,我们可以同时估计 dual-bias 的参数。我们通过LC-MOS来减少误导的方法,并通过一种闭合 dual-bias 准备(GDBC)模块来加以减少。根据我们知道的,这是首次对主观低成本标签学习Robust IQA模型的探索。我们在四个流行的IQA数据集上进行了广泛的实验,并证明了我们的方法在不同的偏见率和评分员数量下具有robust性。此外,我们还实现了与其他基于LA-MOS学习的模型相比肯同的性能。

Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation

  • paper_url: http://arxiv.org/abs/2311.15841
  • repo_url: None
  • paper_authors: Siteng Huang, Biao Gong, Yutong Feng, Xi Chen, Yuqian Fu, Yu Liu, Donglin Wang
  • for: 本研究旨在解决文本到图像(T2I)生成中的新任务——动作定制。该任务的目标是从有限数据中学习并将动作特征推广到未经见过的人或动物。
  • methods: 我们提出了一种借鉴-基本的方法——动作解体标识器(ADI),从示例图像中学习动作特征标识器。ADI首先扩展了semantic conditioning空间,通过引入层次标识符token,从而增加表达丰富性而分配推准 across different features。然后,为防止行为无关特征的推准,ADI从构建的样本三重中提取梯度不变性,并将无关通道的更新掩码。
  • results: 我们的ADI比exist的基elines表现出色,在动作定制T2I生成中达到了更高的质量和多样性。我们还提供了一个ActionBench,其包括了多种动作,每个动作都有仔细选择的样本。both quantitative和qualitative结果表明,我们的ADI在动作定制T2I生成中具有显著的优势。
    Abstract This study focuses on a novel task in text-to-image (T2I) generation, namely action customization. The objective of this task is to learn the co-existing action from limited data and generalize it to unseen humans or even animals. Experimental results show that existing subject-driven customization methods fail to learn the representative characteristics of actions and struggle in decoupling actions from context features, including appearance. To overcome the preference for low-level features and the entanglement of high-level features, we propose an inversion-based method Action-Disentangled Identifier (ADI) to learn action-specific identifiers from the exemplar images. ADI first expands the semantic conditioning space by introducing layer-wise identifier tokens, thereby increasing the representational richness while distributing the inversion across different features. Then, to block the inversion of action-agnostic features, ADI extracts the gradient invariance from the constructed sample triples and masks the updates of irrelevant channels. To comprehensively evaluate the task, we present an ActionBench that includes a variety of actions, each accompanied by meticulously selected samples. Both quantitative and qualitative results show that our ADI outperforms existing baselines in action-customized T2I generation.
    摘要 Translation in Simplified Chinese:这个研究关注了一个新的文本到图像(T2I)生成任务,即动作定制。该任务的目标是从有限数据中学习共存的动作,并将其推广到未见过的人或甚至动物。实验结果表明,现有的主体驱动定制方法无法学习动作的特征特征,同时也困难划分动作和上下文特征。为了超越低级特征的偏好和高级特征的杂谱,我们提议一种倒推基于方法,即Action-Disentangled Identifier (ADI),来学习动作特有的标识符从示例图像中。ADI首先扩展了 semantic conditioning 空间,通过引入层级标识符 токен,从而增加表达力richness,同时分布倒推在不同特征上。然后,为了阻止倒推无关动作特征,ADI提取构造的样本 triple 中的梯度不变性,并遮盖无关通道的更新。为了全面评估任务,我们提供了一个 ActionBench,包括了多种动作,每个动作都有仔细选择的示例。 Both quantitative and qualitative results show that our ADI outperforms existing baselines in action-customized T2I generation.

Syn3DWound: A Synthetic Dataset for 3D Wound Bed Analysis

  • paper_url: http://arxiv.org/abs/2311.15836
  • repo_url: None
  • paper_authors: Léo Lebrat, Rodrigo Santa Cruz, Remi Chierchia, Yulia Arzhaeva, Mohammad Ali Armin, Joshua Goldsmith, Jeremy Oorloff, Prithvi Reddy, Chuong Nguyen, Lars Petersson, Michelle Barakat-Johnson, Georgina Luscombe, Clinton Fookes, Olivier Salvado, David Ahmedt-Aristizabal
  • for: 这篇论文是为了解决床ridden patients和老年人的伤口管理问题而写的。
  • methods: 这篇论文使用现代图像分析技术,提供高精度和准确的伤口测量。
  • results: 这篇论文提出了一个开源的Syn3DWound数据集,并提出了基线方法和一个对自动3D形态分析和2D/3D伤口分割的比较框架。
    Abstract Wound management poses a significant challenge, particularly for bedridden patients and the elderly. Accurate diagnostic and healing monitoring can significantly benefit from modern image analysis, providing accurate and precise measurements of wounds. Despite several existing techniques, the shortage of expansive and diverse training datasets remains a significant obstacle to constructing machine learning-based frameworks. This paper introduces Syn3DWound, an open-source dataset of high-fidelity simulated wounds with 2D and 3D annotations. We propose baseline methods and a benchmarking framework for automated 3D morphometry analysis and 2D/3D wound segmentation.
    摘要 伤口管理具有重要挑战,特别是对床bound patients和老年人来说。精准的诊断和 cicatrix 监测可以从现代图像分析中受益很大,提供精确和精密的伤口测量。 despite several existing techniques, the shortage of expansive and diverse training datasets remains a significant obstacle to constructing machine learning-based frameworks。这篇文章介绍Syn3DWound,一个开源的高效 simulations of wounds dataset with 2D and 3D annotations。我们提出基准方法和一个比较框架 для自动化3D形态分析和2D/3D伤口分割。

A-JEPA: Joint-Embedding Predictive Architecture Can Listen

  • paper_url: http://arxiv.org/abs/2311.15830
  • repo_url: None
  • paper_authors: Zhengcong Fei, Mingyuan Fan, Junshi Huang
  • for: 这paper的目的是使用遮盖模型原理来驱动大型基础视觉模型的成功,并应用到音频上。
  • methods: 这paper引入了Audio-based Joint-Embedding Predictive Architecture(A-JEPA),一种简单的扩展方法,通过自我超vised学习来学习从音频谱spectrum。
  • results: 这paper发现,通过将随机块遮盖转换为时间频率意识的遮盖,可以提高音频spectrum中高度相关的地方的表达能力和 Robustness。此外,通过在目标集上进行正则化遮盖,可以提高音频和语音分类任务的Contextual semantic understanding和表达能力。
    Abstract This paper presents that the masked-modeling principle driving the success of large foundational vision models can be effectively applied to audio by making predictions in a latent space. We introduce Audio-based Joint-Embedding Predictive Architecture (A-JEPA), a simple extension method for self-supervised learning from the audio spectrum. Following the design of I-JEPA, our A-JEPA encodes visible audio spectrogram patches with a curriculum masking strategy via context encoder, and predicts the representations of regions sampled at well-designed locations. The target representations of those regions are extracted by the exponential moving average of context encoder, \emph{i.e.}, target encoder, on the whole spectrogram. We find it beneficial to transfer random block masking into time-frequency aware masking in a curriculum manner, considering the complexity of highly correlated in local time and frequency in audio spectrograms. To enhance contextual semantic understanding and robustness, we fine-tune the encoder with a regularized masking on target datasets, instead of input dropping or zero. Empirically, when built with Vision Transformers structure, we find A-JEPA to be highly scalable and sets new state-of-the-art performance on multiple audio and speech classification tasks, outperforming other recent models that use externally supervised pre-training.
    摘要 The A-JEPA method first encodes audio spectrogram patches using a context encoder, and then predicts the representations of regions sampled at well-designed locations. The target representations of these regions are extracted by taking an exponential moving average of the context encoder on the whole spectrogram. To improve the model's performance, the authors transfer random block masking into time-frequency aware masking in a curriculum manner, considering the complexity of highly correlated local time and frequency in audio spectrograms.To enhance the model's contextual semantic understanding and robustness, the authors fine-tune the encoder with a regularized masking on target datasets, rather than using input dropping or zero. The A-JEPA model is built using the Vision Transformers structure, and is found to be highly scalable and to set new state-of-the-art performance on multiple audio and speech classification tasks, outperforming other recent models that use externally supervised pre-training.

LLMGA: Multimodal Large Language Model based Generation Assistant

  • paper_url: http://arxiv.org/abs/2311.16500
  • repo_url: https://github.com/Zj-BinXia/LLMGA
  • paper_authors: Bin Xia, Shiyin Wang, Yingfan Tao, Yitong Wang, Jiaya Jia
  • for: 这个论文是为了推出一种基于大语言模型的生成助手(LLMGA),利用大语言模型(LLM)内置的知识和推理能力,帮助用户在图像生成和编辑中进行精准的控制。
  • methods: 这个论文使用了一种两阶段训练方法,在第一阶段中,通过训练大语言模型(MLLM)来学习图像生成和编辑的特性,并在第二阶段中,通过优化固定扩散(SD)来将MLLM的生成提示与SD进行Alignment。此外,论文还提出了一种参考基于的修复网络,以解决图像修改过程中的纹理、亮度和对比度的差异。
  • results: 论文的实验结果表明,LLMGA具有良好的生成能力,可以在交互方式下推出更多的应用场景。
    Abstract In this paper, we introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA), leveraging the vast reservoir of knowledge and proficiency in reasoning, comprehension, and response inherent in Large Language Models (LLMs) to assist users in image generation and editing. Diverging from existing approaches where Multimodal Large Language Models (MLLMs) generate fixed-size embeddings to control Stable Diffusion (SD), our LLMGA provides a detailed language generation prompt for precise control over SD. This not only augments LLM context understanding but also reduces noise in generation prompts, yields images with more intricate and precise content, and elevates the interpretability of the network. To this end, we curate a comprehensive dataset comprising prompt refinement, similar image generation, inpainting $\&$ outpainting, and visual question answering. Moreover, we propose a two-stage training scheme. In the first stage, we train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. In the second stage, we optimize SD to align with the MLLM's generation prompts. Additionally, we propose a reference-based restoration network to alleviate texture, brightness, and contrast disparities between generated and preserved regions during image editing. Extensive results show that LLMGA has promising generative capabilities and can enable wider applications in an interactive manner.
    摘要 在本文中,我们介绍了一种基于大语言模型的多Modal生成助手(LLMGA),利用大语言模型(LLM)内置的庞大知识和理解能力来帮助用户进行图像生成和修改。与现有方法不同,我们的LLMGA不使用固定大小的嵌入来控制稳定扩散(SD),而是提供详细的语言生成提示,以提高LLM上下文理解和降低生成提示的噪音。这不仅扩展了LLM的上下文理解范围,而且还提高了生成图像的精度和可读性。为此,我们积集了包括提示修改、相似图像生成、填充和剔除、视觉问答等多种任务的COMPREHENSIVE数据集。此外,我们提出了两个阶段的训练方案。在第一阶段,我们训练MLLM掌握图像生成和修改的属性,使其能够生成详细的提示。在第二阶段,我们优化SD以与MLLM生成提示相Alignment。此外,我们还提出了一种参考基于的修复网络,以解决图像修改过程中的纹理、亮度和对比度差异。我们的实验结果表明,LLMGA具有扎实的生成能力,可以在互动性下扩展到更多应用场景。

C-SAW: Self-Supervised Prompt Learning for Image Generalization in Remote Sensing

  • paper_url: http://arxiv.org/abs/2311.15812
  • repo_url: None
  • paper_authors: Avigyan Bhattacharya, Mainak Singha, Ankit Jha, Biplab Banerjee
  • for: 本研究旨在解决领域和类总结问题在分析光学远程感知图像时,使用大规模预训练的视力语言模型(VLM)CLIP。
  • methods: 我们提出了一种解决方案,使得CLIP在多个领域中进行域 инвариант的提问学习,并提高视觉特征表达力。我们发现CLIP的视觉编码器很难以捕捉图像片断的 Contextual information,尤其是在遥感图像中,各个领域类型具有明确的Contextual appearance。为此,我们提出了C-SAW方法,通过在视觉空间中添加自动学习损失和一种新的提问学习技术,以强调视觉域和内容特征。
  • results: 我们的实验结果表明,C-SAW在多个遥感benchmark上和不同的总结任务中具有显著的优势。
    Abstract We focus on domain and class generalization problems in analyzing optical remote sensing images, using the large-scale pre-trained vision-language model (VLM), CLIP. While contrastively trained VLMs show impressive zero-shot generalization performance, their effectiveness is limited when dealing with diverse domains during training and testing. Existing prompt learning techniques overlook the importance of incorporating domain and content information into the prompts, which results in a drop in performance while dealing with such multi-domain data. To address these challenges, we propose a solution that ensures domain-invariant prompt learning while enhancing the expressiveness of visual features. We observe that CLIP's vision encoder struggles to identify contextual image information, particularly when image patches are jumbled up. This issue is especially severe in optical remote sensing images, where land-cover classes exhibit well-defined contextual appearances. To this end, we introduce C-SAW, a method that complements CLIP with a self-supervised loss in the visual space and a novel prompt learning technique that emphasizes both visual domain and content-specific features. We keep the CLIP backbone frozen and introduce a small set of projectors for both the CLIP encoders to train C-SAW contrastively. Experimental results demonstrate the superiority of C-SAW across multiple remote sensing benchmarks and different generalization tasks.
    摘要 我们在分析光学远程感知图像时集中于领域和类总体化问题,使用大规模预训练的视觉语言模型(VLM)CLIP。而对于不同领域的训练和测试数据,尝试性地训练的VLM显示出了很好的零shot总结性能,但其效iveness受到多个领域的训练和测试数据的多样性的限制。现有的提问学习技术忽视了在提问中包含领域和内容信息的重要性,这会导致与多个领域数据进行提问学习时的性能下降。为解决这些挑战,我们提出了一种解决方案,以确保领域不变的提问学习,同时提高视觉特征表达的能力。我们发现,CLIP的视觉encoder在混乱图像 patches时很难以识别图像上的上下文信息,特别是在光学远程感知图像中,陆地覆盖类型具有明确的上下文表现。为此,我们提出了 C-SAW,一种方法,它在视觉空间中添加了自动学习损失,并提出了一种新的提问学习技术,强调视觉领域和内容特定的特征。我们保持CLIP的背部冰结并引入了一小组项目器,以便对CLIP的encoder进行C-SAW对比。实验结果表明,C-SAW在多个光学远程感知 benchmark上和不同的总结任务中具有优越性。

PIPE : Parallelized Inference Through Post-Training Quantization Ensembling of Residual Expansions

  • paper_url: http://arxiv.org/abs/2311.15806
  • repo_url: None
  • paper_authors: Edouard Yvinec, Arnaud Dapogny, Kevin Bailly
  • for: 这篇论文的目的是解决深度神经网络(DNNs)在计算机视觉和自然语言处理中的高推断成本问题,通过几何化来实现这一目的。
  • methods: 这篇论文使用了一种名为PIPE的量化方法,PIPE 方法利用了差分错误扩展、群集簇范围和集成估计来实现更好的并行化。
  • results: 根据论文的测试结果,PIPE 方法在每个测试应用程序(包括 Computer Vision 和自然语言处理)、架构(包括 ConvNets 和 transformers)和位元数(包括 int8 和 ternary 量化)上都能够获得超过其他方法的性能。
    Abstract Deep neural networks (DNNs) are ubiquitous in computer vision and natural language processing, but suffer from high inference cost. This problem can be addressed by quantization, which consists in converting floating point perations into a lower bit-width format. With the growing concerns on privacy rights, we focus our efforts on data-free methods. However, such techniques suffer from their lack of adaptability to the target devices, as a hardware typically only support specific bit widths. Thus, to adapt to a variety of devices, a quantization method shall be flexible enough to find good accuracy v.s. speed trade-offs for every bit width and target device. To achieve this, we propose PIPE, a quantization method that leverages residual error expansion, along with group sparsity and an ensemble approximation for better parallelization. PIPE is backed off by strong theoretical guarantees and achieves superior performance on every benchmarked application (from vision to NLP tasks), architecture (ConvNets, transformers) and bit-width (from int8 to ternary quantization).
    摘要

SOAC: Spatio-Temporal Overlap-Aware Multi-Sensor Calibration using Neural Radiance Fields

  • paper_url: http://arxiv.org/abs/2311.15803
  • repo_url: None
  • paper_authors: Quentin Herau, Nathan Piasco, Moussab Bennehar, Luis Roldão, Dzmitry Tsishkou, Cyrille Migniot, Pascal Vasseur, Cédric Demonceaux
  • for: 本研究旨在实现多标的感应器资料的精确汇总,以提高自动驾驶车辆的运作精确性和稳定性。
  • methods: 本研究使用神经震荡场(NeRF)来代表不同感应器模式的共同体积表示,以实现Robust和精确的时空感应器汇总。我们运用分割方法,基于每个感应器可见部分的scene,将汇总问题转化为只有重叠区域的问题,这样的策略使得汇总更加稳定和精确。
  • results: 我们运用本方法在多个公共驾驶数据集上进行验证,结果显示,我们的方法能够比较精确和稳定,较前方法更好。
    Abstract In rapidly-evolving domains such as autonomous driving, the use of multiple sensors with different modalities is crucial to ensure high operational precision and stability. To correctly exploit the provided information by each sensor in a single common frame, it is essential for these sensors to be accurately calibrated. In this paper, we leverage the ability of Neural Radiance Fields (NeRF) to represent different sensors modalities in a common volumetric representation to achieve robust and accurate spatio-temporal sensor calibration. By designing a partitioning approach based on the visible part of the scene for each sensor, we formulate the calibration problem using only the overlapping areas. This strategy results in a more robust and accurate calibration that is less prone to failure. We demonstrate that our approach works on outdoor urban scenes by validating it on multiple established driving datasets. Results show that our method is able to get better accuracy and robustness compared to existing methods.
    摘要 在高速发展领域如自动驾驶中,使用多种感知器的多模态感知是关键以确保高度精准和稳定操作。为了正确地利用每个感知器提供的信息在单一的帧中,它们必须准确地准确。在这篇论文中,我们利用神经辐射场(NeRF)来表示不同感知器模态的共同三维表示,以实现Robust和准确的空间时间感知器准确。通过根据每个感知器可见部分的Scene进行分区,我们将准确性 проблеme formulate 只有相互重叠的部分。这种策略会导致更Robust和准确的准确,相比之下更不易失败。我们在多个Established驾驶数据集上验证了我们的方法,结果显示我们的方法能够在外部城市场景中获得更高的准确性和稳定性,相比于现有方法。

Mip-Splatting: Alias-free 3D Gaussian Splatting

  • paper_url: http://arxiv.org/abs/2311.16493
  • repo_url: https://github.com/autonomousvision/mip-splatting
  • paper_authors: Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, Andreas Geiger
  • for: 提高3D Gaussian Splatting的精度和效率
  • methods: 引入3D平滑Filter,根据输入视图中的最大采样频率来限制3D Gaussian primitives的大小,并使用2D Mip filter来消除抖音和扩散问题
  • results: 在多种场景下,包括单个批处理图像和多个批处理图像的训练和测试, validate our approach的有效性
    Abstract Recently, 3D Gaussian Splatting has demonstrated impressive novel view synthesis results, reaching high fidelity and efficiency. However, strong artifacts can be observed when changing the sampling rate, \eg, by changing focal length or camera distance. We find that the source for this phenomenon can be attributed to the lack of 3D frequency constraints and the usage of a 2D dilation filter. To address this problem, we introduce a 3D smoothing filter which constrains the size of the 3D Gaussian primitives based on the maximal sampling frequency induced by the input views, eliminating high-frequency artifacts when zooming in. Moreover, replacing 2D dilation with a 2D Mip filter, which simulates a 2D box filter, effectively mitigates aliasing and dilation issues. Our evaluation, including scenarios such a training on single-scale images and testing on multiple scales, validates the effectiveness of our approach.
    摘要 近期,3D Gaussian Splatting 展示了吸目的新视图合成结果,达到高准确率和效率。然而,当变换 sampling rate 时,可见强烈的artefacts,例如改变 focal length 或 camera distance。我们发现这种现象的原因是因为缺乏 3D 频率约束和使用 2D 扩展 filter。为解决这个问题,我们引入了一种基于输入视图的最大抽象频率的 3D 平滑 filter,根据 maximal sampling frequency 约束 3D Gaussian primitives 的大小,避免在增大时出现高频纹理残留。此外,将 2D 扩展替换为 2D Mip filter,模拟 2D 箱 filter,有效地解决了抖擦和扩展问题。我们的评估,包括在单个批处理图像和多个缩放级别上进行训练和测试,证明了我们的方法的有效性。

Source-Free Domain Adaptation with Frozen Multimodal Foundation Model

  • paper_url: http://arxiv.org/abs/2311.16510
  • repo_url: None
  • paper_authors: Song Tang, Wenxin Su, Mao Ye, Xiatian Zhu
  • for: 这篇论文的目的是为了实现源预测模型在目标领域中的适应,并仅使用无标的目标训练数据和源模型预训在源领域。
  • methods: 这篇论文使用了 pseudo labeling 和/或辅助supervision,并 explore了 off-the-shelf vision-language(ViL)多modal模型(例如 CLIP)的潜在。它们提出了一个名为 Distilling multimodal Foundation model(DIFO)的新方法,它在适应过程中 alternate between two step:(i) 为 ViL 模型进行自适应,以增强它对目标模型的互联性;(ii) 将自适应后的 ViL 模型知识转换到目标模型中。此外,它还引入了两种有效的调整常式,namely most-likely category encouragement 和 predictive consistency。
  • results: 实验结果显示,DIFO 对于目前的state-of-the-art方法进行了严重的超越。
    Abstract Source-Free Domain Adaptation (SFDA) aims to adapt a source model for a target domain, with only access to unlabeled target training data and the source model pre-trained on a supervised source domain. Relying on pseudo labeling and/or auxiliary supervision, conventional methods are inevitably error-prone. To mitigate this limitation, in this work we for the first time explore the potentials of off-the-shelf vision-language (ViL) multimodal models (e.g.,CLIP) with rich whilst heterogeneous knowledge. We find that directly applying the ViL model to the target domain in a zero-shot fashion is unsatisfactory, as it is not specialized for this particular task but largely generic. To make it task specific, we propose a novel Distilling multimodal Foundation model(DIFO)approach. Specifically, DIFO alternates between two steps during adaptation: (i) Customizing the ViL model by maximizing the mutual information with the target model in a prompt learning manner, (ii) Distilling the knowledge of this customized ViL model to the target model. For more fine-grained and reliable distillation, we further introduce two effective regularization terms, namely most-likely category encouragement and predictive consistency. Extensive experiments show that DIFO significantly outperforms the state-of-the-art alternatives. Our source code will be released.
    摘要 源自自由领域适应(SFDA)目的是将源模型适应目标领域,只有访问无标的目标训练数据和源模型在监督源领域预训练。对于这些问题,传统方法是不可避免的错误。为了解决这个限制,在这个研究中,我们首次探索了市场上可用的视觉语言(ViL)多模式模型(例如CLIP)的潜力。我们发现,直接将ViL模型应用到目标领域中是不 satisfactory,因为它不是这个特定任务的特殊化。为了让它成为特定任务的,我们提出了一个novel的Distilling multimodal Foundation model(DIFO)方法。具体来说,DIFO在适应过程中 alternate between two steps:(i)使ViL模型进行启发式学习,以增强它对目标模型的互联信息;(ii)将这个自订的ViL模型知识转投到目标模型中。为了更加精确和可靠的传播,我们还引入了两个有效的调整项, namely most-likely category encouragement和predictive consistency。实验结果显示,DIFO明显超过了目前的替代方案。我们将源代码发布。

Relationship between Model Compression and Adversarial Robustness: A Review of Current Evidence

  • paper_url: http://arxiv.org/abs/2311.15782
  • repo_url: None
  • paper_authors: Svetlana Pavlitska, Hannes Grolig, J. Marius Zöllner
  • for: 提高模型容量可以增强深度学习网络的针对性攻击性能力。
  • methods: 许多模型压缩技术,如剪枝和量化,可以降低网络大小而保持准确性。
  • results: 一些latest studies have investigated the relationship between模型压缩和针对性攻击性能,但有些实验结果存在矛盾。
    Abstract Increasing the model capacity is a known approach to enhance the adversarial robustness of deep learning networks. On the other hand, various model compression techniques, including pruning and quantization, can reduce the size of the network while preserving its accuracy. Several recent studies have addressed the relationship between model compression and adversarial robustness, while some experiments have reported contradictory results. This work summarizes available evidence and discusses possible explanations for the observed effects.
    摘要 增加模型容量是一种常见的方法来提高深度学习网络的逆攻击Robustness。然而,多种模型压缩技术,包括剪辑和量化,可以降低网络的大小,保持其准确性。一些最近的研究已经研究了模型压缩和逆攻击Robustness之间的关系,而一些实验得到了矛盾的结果。本文总结了可用证据并讨论了可能的解释。

Stable Segment Anything Model

  • paper_url: http://arxiv.org/abs/2311.15776
  • repo_url: https://github.com/fanq15/stable-sam
  • paper_authors: Qi Fan, Xin Tao, Lei Ke, Mingqiao Ye, Yuan Zhang, Pengfei Wan, Zhongyuan Wang, Yu-Wing Tai, Chi-Keung Tang
  • for: 提高Segment Anything Model(SAM)的精度和稳定性,使其能够更好地处理低质量的提示。
  • methods: 通过分析SAM的分割稳定性,发现提示中的缺失盒和点会导致SAM的面部分划分偏向背景或特定的对象部分。以及提出了一种 learnable deformable offsets 的方法,使SAM可以在数据驱动的方式下动态调整图像特征的抽样位置,以提高分割精度和稳定性。
  • results: 在多个数据集上进行了广泛的实验,证明了我们的方法可以提高SAM的分割精度和稳定性,同时保留SAM的强大可提示分割效率和通用性,并且具有最少的学习参数(0.08 M)和快速适应(在1个训练epoch中)。
    Abstract The Segment Anything Model (SAM) achieves remarkable promptable segmentation given high-quality prompts which, however, often require good skills to specify. To make SAM robust to casual prompts, this paper presents the first comprehensive analysis on SAM's segmentation stability across a diverse spectrum of prompt qualities, notably imprecise bounding boxes and insufficient points. Our key finding reveals that given such low-quality prompts, SAM's mask decoder tends to activate image features that are biased towards the background or confined to specific object parts. To mitigate this issue, our key idea consists of adjusting the sampling locations of image feature using learnable deformable offsets, while the original SAM model architecture and weights remain unchanged. Consequently, our deformable sampling plugin (DSP) enables SAM to adaptively shift attention to the prompted target regions in a data-driven manner, facilitated by our effective robust training strategy (RTS). During inference, dynamic routing plugin (DRP) is proposed that toggles SAM between the deformable and regular grid sampling modes, conditioned on the input prompt quality. Thus, our solution, termed Stable-SAM, is one of its kind focusing on solely adjusting feature sampling locations, which offers several advantages: 1) improved SAM's segmentation stability across a wide range of prompt qualities, while 2) retaining SAM's powerful promptable segmentation efficiency and generality, with 3) minimal learnable parameters (0.08 M) and fast adaptation (by 1 training epoch). Extensive experiments across multiple datasets validate the effectiveness and advantages of our approach, underscoring Stable-SAM as a more robust solution for segmenting anything. Codes will be released upon acceptance.
    摘要 Segment Anything Model (SAM) 可以实现非常出色的提示分割,但是需要高质量的提示,这些提示通常需要技术人员的努力来 Specify。为了让 SAM 更加鲁棒对待不优质提示,这篇论文提出了首次对 SAM 的分割稳定性进行全面分析,具体来说是对提示质量较低的情况进行分析,包括不准确的 bounding box 和不充分的点。我们的关键发现是,当 faced with such low-quality prompts, SAM 的 mask decoder 会活化图像特征,这些特征偏向背景或是对象部分中的特征。为了解决这个问题,我们提出了一种调整图像特征的抽象偏移学习方法,而不改变 SAM 的模型结构和参数。因此,我们的抽象抽取器(DSP)可以让 SAM 通过数据驱动的方式shift attention to the prompted target regions。在推理中,我们还提出了动态路由插件(DRP),它可以根据输入提示质量来 toggle SAM между抽象和常规的网格采样模式。因此,我们的解决方案,称为 Stable-SAM,可以在很宽的提示质量范围内提供更好的分割稳定性,同时保持 SAM 的强大可提示分割效率和通用性,并且具有最小化的可学习参数(0.08 M)和快速适应(在 1 个训练 epoch 内)。广泛的实验 validate 了我们的方法的效果和优势,这证明了 Stable-SAM 是一种更加鲁棒的分割解决方案。代码将在接受后释出。

Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation

  • paper_url: http://arxiv.org/abs/2311.15773
  • repo_url: https://github.com/SimM-T2I/SimM
  • paper_authors: Biao Gong, Siteng Huang, Yutong Feng, Shiwei Zhang, Yuyuan Li, Yu Liu
  • for: 这 paper 的目的是提出一种无需训练的布局准确化系统,以便在生成图像时更好地理解和 sintesize 文本提示中的布局要求。
  • methods: 该系统使用 “检查-定位-修正” 管道,首先分析提示以生成目标布局,然后与 intermediate 输出进行比较以自动检测错误。接着,通过移动定位的启动和进行 intra- 和 inter-map 调整,可以减少计算负担。
  • results: 对于一系列布局要求,提出了一个名为 SimMBench 的测试准确度,并通过量化和质量测试表明,提出的 SimM 系统可以准确地调整布局不一致的问题。
    Abstract Diffusion models have recently achieved remarkable progress in generating realistic images. However, challenges remain in accurately understanding and synthesizing the layout requirements in the textual prompts. To align the generated image with layout instructions, we present a training-free layout calibration system SimM that intervenes in the generative process on the fly during inference time. Specifically, following a "check-locate-rectify" pipeline, the system first analyses the prompt to generate the target layout and compares it with the intermediate outputs to automatically detect errors. Then, by moving the located activations and making intra- and inter-map adjustments, the rectification process can be performed with negligible computational overhead. To evaluate SimM over a range of layout requirements, we present a benchmark SimMBench that compensates for the lack of superlative spatial relations in existing datasets. And both quantitative and qualitative results demonstrate the effectiveness of the proposed SimM in calibrating the layout inconsistencies.
    摘要 Diffusion models 有最近很大进步在生成实际图像上。然而,在精确理解和整合文本提示中的布局要求仍存在挑战。为了将生成的图像与文本提示中的布局一致,我们提出了一个无需训练的布局调整系统SimM。在推理时间内,该系统会在生成过程中进行干预,并按照"检查-定位-修正"管道进行操作。 Specifically, the system first analyzes the prompt to generate the target layout and compares it with the intermediate outputs to automatically detect errors. Then, by moving the located activations and making intra- and inter-map adjustments, the rectification process can be performed with negligible computational overhead.为了评估SimM的效果,我们提出了一个名为SimMBench的数据集,该数据集补做了现有数据集中的缺乏超越空间关系。并both quantitative和qualitative结果表明,提出的SimM可以准确地调整布局不一致。

Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

  • paper_url: http://arxiv.org/abs/2311.15769
  • repo_url: https://github.com/HJYao00/Side4Video
  • paper_authors: Huanjin Yao, Wenhao Wu, Zhiheng Li
  • for: 本研究旨在提高大型计算机视觉模型的有效使用,尤其是在视频理解任务上。
  • methods: 我们提出了一种名为Side4Video的新方法,它使用轻量级的空间-时间侧网络来减少大型图像模型的内存使用量,并通过多级空间特征来避免反向传播。
  • results: 我们的方法在多个视频数据集上取得了出色的表现,特别是在Something-Something V1&V2(67.3% & 74.6%)、Kinetics-400(88.6%)、MSR-VTT(52.3%)、MSVD(56.1%)和VATEX(68.8%)等任务上。
    Abstract Large pre-trained vision models achieve impressive success in computer vision. However, fully fine-tuning large models for downstream tasks, particularly in video understanding, can be prohibitively computationally expensive. Recent studies turn their focus towards efficient image-to-video transfer learning. Nevertheless, existing efficient fine-tuning methods lack attention to training memory usage and exploration of transferring a larger model to the video domain. In this paper, we present a novel Spatial-Temporal Side Network for memory-efficient fine-tuning large image models to video understanding, named Side4Video. Specifically, we introduce a lightweight spatial-temporal side network attached to the frozen vision model, which avoids the backpropagation through the heavy pre-trained model and utilizes multi-level spatial features from the original image model. Extremely memory-efficient architecture enables our method to reduce 75% memory usage than previous adapter-based methods. In this way, we can transfer a huge ViT-E (4.4B) for video understanding tasks which is 14x larger than ViT-L (304M). Our approach achieves remarkable performance on various video datasets across unimodal and cross-modal tasks (i.e., action recognition and text-video retrieval), especially in Something-Something V1&V2 (67.3% & 74.6%), Kinetics-400 (88.6%), MSR-VTT (52.3%), MSVD (56.1%) and VATEX (68.8%). We release our code at https://github.com/HJYao00/Side4Video.
    摘要 大型预训练视觉模型在计算机视觉中实现了很大的成功。然而,将整个模型完全精度地调整为下游任务,特别是视频理解,可能是计算上的极限。现有研究转移注意力于快速的图像到视频转移学习。然而,现有的高效调整方法忽略了训练内存使用和将大型模型传递到视频领域的探索。本文提出了一种新的空间-时间侧网络,用于高效地调整大图像模型到视频理解,称为Side4Video。我们引入了一个轻量级的空间-时间侧网络,与冰封的视觉模型相连,以避免将重要的预训练模型的后向传播,并利用多级空间特征从原始图像模型。这种极其内存高效的架构,使我们的方法可以减少75%的内存使用量,比前一代适配器基于方法更加高效。因此,我们可以将巨大的ViT-E(4.4B)模型传递到视频理解任务,这是14倍于ViT-L(304M)。我们的方法在不同的视频数据集上显示出了优异的表现(如Something-Something V1&V2(67.3% & 74.6%)、Kinetics-400(88.6%)、MSR-VTT(52.3%)、MSVD(56.1%)和VATEX(68.8%)),特别是在跨模态任务(如动作认知和文本-视频检索)中表现出色。我们将代码发布在https://github.com/HJYao00/Side4Video。

PyNanospacing: TEM image processing tool for strain analysis and visualization

  • paper_url: http://arxiv.org/abs/2311.15751
  • repo_url: None
  • paper_authors: Mehmet Ali Sarsil, Mubashir Mansoor, Mert Saracoglu, Servet Timur, Mustafa Urgen, Onur Ergen
  • for: 本研究旨在开发一种用于TEM图像处理的Python代码,以便可以处理各种材料,包括粒子、2D材料、纯晶和固相材料。
  • methods: 本研究使用了Python编程语言,并开发了一种能够处理各种材料的TEM图像处理算法,可以将本地差异转换为对应的折线图,以便可以视觉地表示材料的压缩和扩展。
  • results: 本研究通过开发了一种可以处理各种材料的TEM图像处理算法,可以生成精确的材料特性信息,包括带隙、机械剪切模量、颜色、phonon和电子浓度等,以及材料表面和催化性能。这些结果可以帮助更深入地探索材料的压缩工程学,并通过对材料的压缩和扩展进行可视化表示。
    Abstract The diverse spectrum of material characteristics including band gap, mechanical moduli, color, phonon and electronic density of states, along with catalytic and surface properties are intricately intertwined with the atomic structure and the corresponding interatomic bond-lengths. This interconnection extends to the manifestation of interplanar spacings within a crystalline lattice. Analysis of these interplanar spacings and the comprehension of any deviations, whether it be lattice compression or expansion, commonly referred to as strain, hold paramount significance in unraveling various unknowns within the field. Transmission Electron Microscopy (TEM) is widely used to capture atomic-scale ordering, facilitating direct investigation of interplanar spacings. However, creating critical contour maps for visualizing and interpreting lattice stresses in TEM images remains a challenging task. Here we developed a Python code for TEM image processing that can handle a wide range of materials including nanoparticles, 2D materials, pure crystals and solid solutions. This algorithm converts local differences in interplanar spacings into contour maps allowing for a visual representation of lattice expansion and compression. The tool is very generic and can significantly aid in analyzing material properties using TEM images, allowing for a more in-depth exploration of the underlying science behind strain engineering via strain contour maps at the atomic level.
    摘要 “材料的多元谱spectrum,包括带隙、机械剪力、颜色、 fonon和电子密度状态,以及催化和表面性质,与原子结构和相应的间隔尺度紧密相关。这种关系还扩展到晶体中的间隔间观察。分析这些间隔间的变化,以及对这些变化的理解,是探索不同领域的关键。传输电子顾问(TEM)广泛用于捕捉原子级次字母,可以直接调查间隔间的变化。然而,从TEM图像中创建关键的护帐图可以视觉化和解释晶体的压缩和扩展是一项具有挑战性的任务。我们在这里开发了一个基于Python的TEM图像处理算法,可以处理各种材料,包括粒子、二维材料、纯净晶体和固溶液。这种算法将本地差异转换为带图,以便视觉化晶体的压缩和扩展。这个工具非常通用,可以帮助分析材料的特性,并且允许更深入探索在压缩工程中的科学基础。”

SIRAN: Sinkhorn Distance Regularized Adversarial Network for DEM Super-resolution using Discriminative Spatial Self-attention

  • paper_url: http://arxiv.org/abs/2311.16490
  • repo_url: None
  • paper_authors: Subhajit Paul, Ashutosh Gupta
  • for: This paper aims to generate high-resolution Digital Elevation Models (DEMs) using high-resolution multi-spectral (MX) satellite imagery with the assistance of adversarial learning.
  • methods: The proposed method utilizes polarized self-attention of discriminator spatial maps and a Densely connected Multi-Residual Block (DMRB) module to improve the efficiency of gradient flow. Additionally, the objective function is optimized using Sinkhorn distance with traditional GAN to address vanishing gradient issues and improve numerical convergence.
  • results: The proposed method demonstrates better performance compared to other learning-based state-of-the-art methods in terms of vanishing gradient issues and numerical convergence. The authors provide both qualitative and quantitative outcomes with available state-of-the-art methods and generate several high-resolution DEMs covering terrains with diverse signatures to show the performance of their model.Here’s the same information in Simplified Chinese text:
  • for: 这篇论文目标是使用高分辨率多spectral (MX) 卫星图像生成高分辨率地形高程模型 (DEM),并利用对抗学习来优化模型。
  • methods: 该方法利用排斥自注意力的推导器空间地图,以及一种紧密连接的多重径径块 (DMRB) 模块,以提高梯度流的效率。此外,该方法使用传统GAN的锐度距离来优化目标函数,以Addressing vanishing gradient issues和数值稳定性问题。
  • results: 该方法比其他学习基于的当前状态码方法更好地性能,包括消失梯度问题和数值稳定性问题。作者们通过提供可用的当前状态码方法的质量和量化结果,并生成了覆盖不同特征的地形高程模型,以示其模型的性能。
    Abstract Digital Elevation Model (DEM) is an essential aspect in the remote sensing domain to analyze and explore different applications related to surface elevation information. In this study, we intend to address the generation of high-resolution DEMs using high-resolution multi-spectral (MX) satellite imagery by incorporating adversarial learning. To promptly regulate this process, we utilize the notion of polarized self-attention of discriminator spatial maps as well as introduce a Densely connected Multi-Residual Block (DMRB) module to assist in efficient gradient flow. Further, we present an objective function related to optimizing Sinkhorn distance with traditional GAN to improve the stability of adversarial learning. In this regard, we provide both theoretical and empirical substantiation of better performance in terms of vanishing gradient issues and numerical convergence. We demonstrate both qualitative and quantitative outcomes with available state-of-the-art methods. Based on our experiments on DEM datasets of Shuttle Radar Topographic Mission (SRTM) and Cartosat-1, we show that the proposed model performs preferably against other learning-based state-of-the-art methods. We also generate and visualize several high-resolution DEMs covering terrains with diverse signatures to show the performance of our model.
    摘要 digital elevation model (DEM) 是远程感知领域中非常重要的一环,用于分析和探索不同的表面高程信息。在本研究中,我们计划使用高分辨率多spectral (MX) 卫星图像生成高分辨率 DEM,并通过对抗学习来优化这个过程。为了快速调控这个过程,我们利用推荐权重自注意力的推荐自动化机制,并引入密集连接多重残差块 (DMRB) 模块,以帮助更高效的梯度流。此外,我们还提出了一个基于传统 GAN 的对抗学习目标函数,以改进对抗学习的稳定性。在这个方面,我们提供了 both 理论和实验的证明,证明我们的模型在衰落梯度问题和数值收敛问题上具有更好的性能。我们通过对 SRTM 和 Cartosat-1 的 DEM 数据进行实验,证明我们的模型在learning-based 状态之前表现更好。此外,我们还生成了一些高分辨率 DEM ,用于展示我们的模型在不同的地形特征上的性能。

One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls

  • paper_url: http://arxiv.org/abs/2311.15744
  • repo_url: None
  • paper_authors: Minghui Hu, Jianbin Zheng, Chuanxia Zheng, Chaoyue Wang, Dacheng Tao, Tat-Jen Cham
  • for: 提高 diffusion 模型中图像质量和准确性
  • methods: 提出一种增强图像质量的方法,即 One More Step (OMS),通过添加一个简单 yet effective 步骤来提高图像质量,同时保持原始模型参数
  • results: OMS 方法可以提高图像质量和准确性,并且可以让不同的预训练 diffusion 模型在同一个 latent Domain 中共享同一个 OMS 模块,无需修改原始模型参数
    Abstract It is well known that many open-released foundational diffusion models have difficulty in generating images that substantially depart from average brightness, despite such images being present in the training data. This is due to an inconsistency: while denoising starts from pure Gaussian noise during inference, the training noise schedule retains residual data even in the final timestep distribution, due to difficulties in numerical conditioning in mainstream formulation, leading to unintended bias during inference. To mitigate this issue, certain $\epsilon$-prediction models are combined with an ad-hoc offset-noise methodology. In parallel, some contemporary models have adopted zero-terminal SNR noise schedules together with $\mathbf{v}$-prediction, which necessitate major alterations to pre-trained models. However, such changes risk destabilizing a large multitude of community-driven applications anchored on these pre-trained models. In light of this, our investigation revisits the fundamental causes, leading to our proposal of an innovative and principled remedy, called One More Step (OMS). By integrating a compact network and incorporating an additional simple yet effective step during inference, OMS elevates image fidelity and harmonizes the dichotomy between training and inference, while preserving original model parameters. Once trained, various pre-trained diffusion models with the same latent domain can share the same OMS module.
    摘要 “ Many open-released foundational diffusion models have difficulty generating images with significant deviations from the average brightness, despite the presence of such images in the training data. This is due to an inconsistency between the denoising process during inference and the training noise schedule, which can lead to biases during inference. To address this issue, some models use an ad-hoc offset-noise methodology or zero-terminal SNR noise schedules with $\mathbf{v}$-prediction. However, these changes can destabilize community-driven applications that rely on pre-trained models. In response, our investigation identifies the fundamental causes of this issue and proposes an innovative solution called One More Step (OMS). By integrating a compact network and adding an additional simple step during inference, OMS enhances image fidelity and resolves the disparity between training and inference, while preserving the original model parameters. Once trained, various pre-trained diffusion models with the same latent domain can share the same OMS module.”Note that Simplified Chinese is used in mainland China, while Traditional Chinese is used in Taiwan and other regions. The translation may vary slightly depending on the specific dialect or region.

Machine Learning-Based Jamun Leaf Disease Detection: A Comprehensive Review

  • paper_url: http://arxiv.org/abs/2311.15741
  • repo_url: None
  • paper_authors: Auvick Chandra Bhowmik, Dr. Md. Taimur Ahad, Yousuf Rayhan Emon
  • for: 本研究旨在探讨机器学习技术在茭果叶病诊断中的应用,以提高茭果叶病诊断的效率和准确率。
  • methods: 本研究使用了多种机器学习模型,包括转移学习模型(TLMViT)、SLViT、SE-ViT、IterationViT、Tiny-LeViT、IEM-ViT、GreenViT、PMViT等,以及传统的 dense convolutional network(DenseNet)、Residual Neural Network(ResNet)-50V2、EfficientNet、Ensemble model、Convolutional Neural Network(CNN)等模型。
  • results: 本研究对多种数据集进行了评估,并证明了这些机器学习模型在实际应用中的可行性。
    Abstract Jamun leaf diseases pose a significant threat to agricultural productivity, negatively impacting both yield and quality in the jamun industry. The advent of machine learning has opened up new avenues for tackling these diseases effectively. Early detection and diagnosis are essential for successful crop management. While no automated systems have yet been developed specifically for jamun leaf disease detection, various automated systems have been implemented for similar types of disease detection using image processing techniques. This paper presents a comprehensive review of machine learning methodologies employed for diagnosing plant leaf diseases through image classification, which can be adapted for jamun leaf disease detection. It meticulously assesses the strengths and limitations of various Vision Transformer models, including Transfer learning model and vision transformer (TLMViT), SLViT, SE-ViT, IterationViT, Tiny-LeViT, IEM-ViT, GreenViT, and PMViT. Additionally, the paper reviews models such as Dense Convolutional Network (DenseNet), Residual Neural Network (ResNet)-50V2, EfficientNet, Ensemble model, Convolutional Neural Network (CNN), and Locally Reversible Transformer. These machine-learning models have been evaluated on various datasets, demonstrating their real-world applicability. This review not only sheds light on current advancements in the field but also provides valuable insights for future research directions in machine learning-based jamun leaf disease detection and classification.
    摘要 jamun 叶病种 pose 了一个重要的农业产量威胁,负面影响了 jamun 产业的产量和质量。随着机器学习的出现,有效地控制这些病种成为可能。早期检测和诊断是成功农业管理的关键。尽管没有特定于 jamun 叶病检测的自动化系统已经开发,但是对类似病种检测使用图像处理技术已经实施了许多自动化系统。本文提供了机器学习方法在植物叶病诊断中的广泛评论,可以适应 jamun 叶病检测。它仔细评估了不同的 Vision Transformer 模型,包括传输学习模型和视觉 transformer(TLMViT)、SLViT、SE-ViT、IterationViT、Tiny-LeViT、IEM-ViT、GreenViT 和 PMViT。此外,文章还评估了 dense convolutional network(DenseNet)、Residual Neural Network(ResNet)-50V2、EfficientNet、Ensemble model、Convolutional Neural Network(CNN)和 Locally Reversible Transformer 等机器学习模型。这些模型在不同的数据集上进行了评估, demonstrating 了它们在实际应用中的可行性。本文不仅披露了当前领域的进展,还提供了有价值的未来研究方向,包括机器学习基于 jamun 叶病检测和分类的未来研究。

Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents

  • paper_url: http://arxiv.org/abs/2311.15740
  • repo_url: https://github.com/feup-infolab/archmine
  • paper_authors: Mariana Dias, Carla Teixeira Lopes
  • For: 本研究旨在评估图像处理方法和参数调整在文本识别器中对文化遗产文档的影响。* Methods: 本研究使用了多目标问题形式和非主导种加速遗产算法(NSGA-II)来调整图像处理方法的参数。* Results: 评估结果显示,通过基于数字表示类型的参数化可以提高图像处理算法在文本识别器中的性能。此外,我们的发现还表明,在文本识别任务无需预处理时,使用图像预处理算法可能更适合。 Specifically, Adaptive Thresholding, Bilateral Filter, and Opening are the best-performing algorithms for the theatre plays’ covers, letters, and overall dataset, respectively, and should be applied before OCR to improve its performance.
    Abstract Linked Data is used in various fields as a new way of structuring and connecting data. Cultural heritage institutions have been using linked data to improve archival descriptions and facilitate the discovery of information. Most archival records have digital representations of physical artifacts in the form of scanned images that are non-machine-readable. Optical Character Recognition (OCR) recognizes text in images and translates it into machine-encoded text. This paper evaluates the impact of image processing methods and parameter tuning in OCR applied to typewritten cultural heritage documents. The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II) to tune the methods' parameters. Evaluation results show that parameterization by digital representation typology benefits the performance of image pre-processing algorithms in OCR. Furthermore, our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results. In particular, Adaptive Thresholding, Bilateral Filter, and Opening are the best-performing algorithms for the theatre plays' covers, letters, and overall dataset, respectively, and should be applied before OCR to improve its performance.
    摘要 链接数据在不同领域被应用为新的数据结构和连接方式。文化遗产机构使用链接数据来改进存档描述和提高信息发现。大多数存档记录都有数字表示物理artefact的扫描图像,这些图像不可读取机器。光学字符识别(OCR)可以识别图像中的文本并将其转换为机器编码文本。这篇论文评估了图像处理方法和参数调整在文化遗产文件中OCR应用的影响。该方法使用多目标问题的形式来最小化Levenshtein编辑距离并最大化正确地识别的单词数量,使用非主导种生态学算法(NSGA-II)来调整方法的参数。评估结果表明,根据数字表示类型进行参数化可以提高图像前处理算法在OCR中的性能。此外,我们的发现表明,在文本识别任务没有前处理时,使用图像前处理算法可能更适合。特别是,适应阈值、双向滤波和开口是对戏剧封面、信件和总体数据中的最佳性能的算法,应该在OCR之前应用以提高其性能。

GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?

  • paper_url: http://arxiv.org/abs/2311.15732
  • repo_url: https://github.com/whwu95/GPT4Vis
  • paper_authors: Wenhao Wu, Huanjin Yao, Mengxi Zhang, Yuxin Song, Wanli Ouyang, Jingdong Wang
  • for: 这个研究不是推出新方法,而是关注最新的生成人工智能(GenAI)领域中的一个基线:使用 GPT-4 进行视觉理解。我们的研究是评估 GPT-4 的语言和视觉能力在零shot视觉识别任务中。特别是,我们探索了 GPT-4 生成的丰富文本描述如何提高识别性能无需任何训练。此外,我们评估了 GPT-4 直接识别多样化的视觉内容的视觉能力。
  • methods: 我们采用了一系列实验,系统地量化 GPT-4 在三种模式下的表现:图像、视频和点云。我们在 16 个广泛认可的数据集上进行了总共 16 个数据集的广泛测试,并提供了 top-1 和 top-5 精度指标。
  • results: 我们发现,通过利用 GPT-4 高级语言知识来生成丰富的描述,可以明显提高零shot识别性能。在视觉方面,GPT-4V 的平均表现在 16 个数据集中约等于 OpenAI-CLIP 的 ViT-L 和 EVA-CLIP 的 ViT-E。我们希望这项研究可以为未来的研究提供有价值的数据点和经验。我们的代码可以在 GitHub 上找到:https://github.com/whwu95/GPT4Vis。
    Abstract This paper does not present a novel method. Instead, it delves into an essential, yet must-know baseline in light of the latest advancements in Generative Artificial Intelligence (GenAI): the utilization of GPT-4 for visual understanding. Our study centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks. Specifically, we explore the potential of its generated rich textual descriptions across various categories to enhance recognition performance without any training. Additionally, we evaluate its visual proficiency in directly recognizing diverse visual content. To achieve this, we conduct an extensive series of experiments, systematically quantifying the performance of GPT-4 across three modalities: images, videos, and point clouds. This comprehensive evaluation encompasses a total of 16 widely recognized benchmark datasets, providing top-1 and top-5 accuracy metrics. Our study reveals that leveraging GPT-4's advanced linguistic knowledge to generate rich descriptions markedly improves zero-shot recognition. In terms of visual proficiency, GPT-4V's average performance across 16 datasets sits roughly between the capabilities of OpenAI-CLIP's ViT-L and EVA-CLIP's ViT-E. We hope that this research will contribute valuable data points and experience for future studies. We release our code at https://github.com/whwu95/GPT4Vis.
    摘要 To conduct this study, we conducted an extensive series of experiments using 16 widely recognized benchmark datasets, providing top-1 and top-5 accuracy metrics. Our results show that leveraging GPT-4's advanced linguistic knowledge to generate rich descriptions significantly improves zero-shot recognition. In terms of visual proficiency, GPT-4's average performance across the 16 datasets is roughly between the capabilities of OpenAI-CLIP's ViT-L and EVA-CLIP's ViT-E.We hope that this research will provide valuable data points and experience for future studies. Our code is available at https://github.com/whwu95/GPT4Vis.

MARIS: Referring Image Segmentation via Mutual-Aware Attention Features

  • paper_url: http://arxiv.org/abs/2311.15727
  • repo_url: None
  • paper_authors: Mengxi Zhang, Yiming Liu, Xiangjun Yin, Huanjing Yue, Jingyu Yang
  • for: 这篇论文主要targets at Referring Image Segmentation (RIS) task, aiming to segment a specific region based on a language expression prompt.
  • methods: 该方法基于Segment Anything Model (SAM)和mutual-aware attention mechanism, which leverages two parallel branches to enhance cross-modal fusion. The mechanism includes Vision-Guided Attention和Language-Guided Attention, which bidirectionally model the relationship between visual and linguistic features.
  • results: 对三个 benchmark datasets进行了广泛的实验,并证明了该方法在RIS任务上的超越前方法。
    Abstract Referring image segmentation (RIS) aims to segment a particular region based on a language expression prompt. Existing methods incorporate linguistic features into visual features and obtain multi-modal features for mask decoding. However, these methods may segment the visually salient entity instead of the correct referring region, as the multi-modal features are dominated by the abundant visual context. In this paper, we propose MARIS, a referring image segmentation method that leverages the Segment Anything Model (SAM) and introduces a mutual-aware attention mechanism to enhance the cross-modal fusion via two parallel branches. Specifically, our mutual-aware attention mechanism consists of Vision-Guided Attention and Language-Guided Attention, which bidirectionally model the relationship between visual and linguistic features. Correspondingly, we design a Mask Decoder to enable explicit linguistic guidance for more consistent segmentation with the language expression. To this end, a multi-modal query token is proposed to integrate linguistic information and interact with visual information simultaneously. Extensive experiments on three benchmark datasets show that our method outperforms the state-of-the-art RIS methods. Our code will be publicly available.
    摘要 引用图像分割(RIS)目标是基于语言表达提示segement出特定区域。现有方法将语言特征与视觉特征结合并获得多modal特征进行面码编码。然而,这些方法可能会将视觉吸引人的实体 segment而不是正确的引用区域,因为多modal特征可能会受到丰富的视觉上下文的束缚。在这篇论文中,我们提出了MARIS方法,它基于Segment Anything Model(SAM)和两个并行分支的mutual-aware attention机制来增强交叉模式融合。具体来说,我们的mutual-aware attention机制包括视觉引导注意力和语言引导注意力,它们在视觉和语言特征之间bidirectionally模型关系。相应地,我们设计了一个面码解码器,以便在更加explicit的语言指导下进行更一致的分割。为此,我们提出了一个多modal查询符,以同时integrate语言信息和视觉信息。我们的方法在三个标准测试集上进行了广泛的实验,并示出了与状态前的RIS方法相比的优越性。我们的代码将公开available。

SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation

  • paper_url: http://arxiv.org/abs/2311.15707
  • repo_url: https://github.com/jiehonglin/sam-6d
  • paper_authors: Jiehong Lin, Lihua Liu, Dekun Lu, Kui Jia
  • for: 这篇论文旨在解决杂凑的6D物体姿 pose estimation问题,协助模型具备跨环境对应能力。
  • methods: 这篇论文提出了一个名为SAM-6D的新框架,通过两步骤,包括分割和姿 pose estimation,实现这个任务。
  • results: 这篇论文的结果显示,SAM-6D可以在杂凑的RGB-D图像上实现高性能的6D物体姿 pose estimation,并且比较出色的与现有方法比较。
    Abstract Zero-shot 6D object pose estimation involves the detection of novel objects with their 6D poses in cluttered scenes, presenting significant challenges for model generalizability. Fortunately, the recent Segment Anything Model (SAM) has showcased remarkable zero-shot transfer performance, which provides a promising solution to tackle this task. Motivated by this, we introduce SAM-6D, a novel framework designed to realize the task through two steps, including instance segmentation and pose estimation. Given the target objects, SAM-6D employs two dedicated sub-networks, namely Instance Segmentation Model (ISM) and Pose Estimation Model (PEM), to perform these steps on cluttered RGB-D images. ISM takes SAM as an advanced starting point to generate all possible object proposals and selectively preserves valid ones through meticulously crafted object matching scores in terms of semantics, appearance and geometry. By treating pose estimation as a partial-to-partial point matching problem, PEM performs a two-stage point matching process featuring a novel design of background tokens to construct dense 3D-3D correspondence, ultimately yielding the pose estimates. Without bells and whistles, SAM-6D outperforms the existing methods on the seven core datasets of the BOP Benchmark for both instance segmentation and pose estimation of novel objects.
    摘要 zero-shot 6D对象姿态估计 involves the detection of novel objects with their 6D poses in cluttered scenes, which presents significant challenges for model generalizability. Fortunately, the recent Segment Anything Model (SAM) has showcased remarkable zero-shot transfer performance, providing a promising solution to tackle this task. Motivated by this, we introduce SAM-6D, a novel framework designed to realize the task through two steps, including instance segmentation and pose estimation. Given the target objects, SAM-6D employs two dedicated sub-networks, namely Instance Segmentation Model (ISM) and Pose Estimation Model (PEM), to perform these steps on cluttered RGB-D images. ISM takes SAM as an advanced starting point to generate all possible object proposals and selectively preserves valid ones through meticulously crafted object matching scores in terms of semantics, appearance, and geometry. By treating pose estimation as a partial-to-partial point matching problem, PEM performs a two-stage point matching process featuring a novel design of background tokens to construct dense 3D-3D correspondence, ultimately yielding the pose estimates. Without any frills, SAM-6D outperforms existing methods on the seven core datasets of the BOP Benchmark for both instance segmentation and pose estimation of novel objects.

ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models

  • paper_url: http://arxiv.org/abs/2311.16494
  • repo_url: None
  • paper_authors: Xinyu Tian, Shu Zou, Zhaoyuan Yang, Jing Zhang
  • for: 这篇论文的目的是解决vision-language(V&L)模型在分布变化下的问题,提高其在下游任务上的效果。
  • methods: 这篇论文使用Attribute-Guided Prompt Tuning(ArGue)方法,做出三个主要贡献:1)在传统方法中直接附加软预告之前,将模型与大型自然语言模型生成的基本visual特征进行对齐。我们认为,当模型能够对这些特征表示高度信任,则可以识别正确的类别理由。2)引入特征抽样,删除不利的特征,从而只保留具有 semantics 意义的特征。3)提出负预告法,明确列出class-agnostic的特征,促使模型产生高度垂直的概率分布,与这些负面特征相关。
  • results: 在实验中,这篇论文的方法与现有的状态oce-of-the-art prompt tuning方法相比,在新类别预测和外部散度预测任务上表现出色,具有更高的效果。
    Abstract Although soft prompt tuning is effective in efficiently adapting Vision-Language (V&L) models for downstream tasks, it shows limitations in dealing with distribution shifts. We address this issue with Attribute-Guided Prompt Tuning (ArGue), making three key contributions. 1) In contrast to the conventional approach of directly appending soft prompts preceding class names, we align the model with primitive visual attributes generated by Large Language Models (LLMs). We posit that a model's ability to express high confidence in these attributes signifies its capacity to discern the correct class rationales. 2) We introduce attribute sampling to eliminate disadvantageous attributes, thus only semantically meaningful attributes are preserved. 3) We propose negative prompting, explicitly enumerating class-agnostic attributes to activate spurious correlations and encourage the model to generate highly orthogonal probability distributions in relation to these negative features. In experiments, our method significantly outperforms current state-of-the-art prompt tuning methods on both novel class prediction and out-of-distribution generalization tasks.
    摘要 although 软提示调整是效果很好地使vision-language(V&L)模型适应下游任务,它在分布转移问题上显示有限制。我们通过启发引导的提示调整(ArGue),作出三个关键贡献:1. 而不是直接附加软提示 перед类名,我们将模型与由大自然语言模型(LLM)生成的基本视觉特征进行对应。我们认为,当模型表达高度自信且与这些特征相关时,它才能够正确地理解类间的理由。2. 我们引入特征采样,以消除不利的特征,从而只保留 semantics 有意义的特征。3. 我们提出负提示,显式列出 class-agnostic 特征,以唤醒偶极相关性并鼓励模型生成高度归一化的概率分布与这些负特征相关。在实验中,我们的方法在新类预测和 OUT-OF-distribution 通用任务上显著超越当前状态的提示调整方法。

Model-agnostic Body Part Relevance Assessment for Pedestrian Detection

  • paper_url: http://arxiv.org/abs/2311.15679
  • repo_url: None
  • paper_authors: Maurice Günder, Sneha Banerjee, Rafet Sifa, Christian Bauckhage
  • for: This paper is written for explaining the workings of deep learning models, specifically in the context of computer vision and object detection.
  • methods: The paper uses sampling-based explanation methods, specifically KernelSHAP, to analyze the output of deep learning models.
  • results: The paper presents a novel sampling-based method that is more efficient and robust for explainability analyses on large-scale datasets, and demonstrates its effectiveness through experiments on a pedestrian detection task.
    Abstract Model-agnostic explanation methods for deep learning models are flexible regarding usability and availability. However, due to the fact that they can only manipulate input to see changes in output, they suffer from weak performance when used with complex model architectures. For models with large inputs as, for instance, in object detection, sampling-based methods like KernelSHAP are inefficient due to many computation-heavy forward passes through the model. In this work, we present a framework for using sampling-based explanation models in a computer vision context by body part relevance assessment for pedestrian detection. Furthermore, we introduce a novel sampling-based method similar to KernelSHAP that shows more robustness for lower sampling sizes and, thus, is more efficient for explainability analyses on large-scale datasets.
    摘要 model无关的解释方法对深度学习模型具有灵活性和可用性。然而,由于它们只能对输入进行修改,因此对复杂的模型结构表现弱。例如,在对物体检测进行Explainability分析时,使用抽象基于采样的方法,如KernelSHAP,可能因为大量的计算引入的前进 passes而变得不efficient。在这种情况下,我们提出了一个框架,用于在计算机视ión上使用抽象基于采样的解释模型。此外,我们还介绍了一种新的抽象基于采样方法,与KernelSHAP类似,但更加robust,可以在较小的采样大小下进行Explainability分析,因此更加高效。

HAVE-FUN: Human Avatar Reconstruction from Few-Shot Unconstrained Images

  • paper_url: http://arxiv.org/abs/2311.15672
  • repo_url: None
  • paper_authors: Xihe Yang, Xingyu Chen, Shaohui Wang, Daiheng Gao, Xiaoguang Han, Baoyuan Wang
  • for: 这 paper 的目的是 reconstruction of human avatars from few-shot unconstrained photos.
  • methods: 这 paper 使用了 skinning mechanism with deep marching tetrahedra (DMTet) 和 two-phase optimization method 来处理动态数据和缺乏数据的问题。
  • results: 这 paper 的 HaveFun 框架可以完成人物重建、渲染和动画。对于 developed benchmarks 的实验结果表明,HaveFun 的性能明显超过了其他方法。
    Abstract As for human avatar reconstruction, contemporary techniques commonly necessitate the acquisition of costly data and struggle to achieve satisfactory results from a small number of casual images. In this paper, we investigate this task from a few-shot unconstrained photo album. The reconstruction of human avatars from such data sources is challenging because of limited data amount and dynamic articulated poses. For handling dynamic data, we integrate a skinning mechanism with deep marching tetrahedra (DMTet) to form a drivable tetrahedral representation, which drives arbitrary mesh topologies generated by the DMTet for the adaptation of unconstrained images. To effectively mine instructive information from few-shot data, we devise a two-phase optimization method with few-shot reference and few-shot guidance. The former focuses on aligning avatar identity with reference images, while the latter aims to generate plausible appearances for unseen regions. Overall, our framework, called HaveFun, can undertake avatar reconstruction, rendering, and animation. Extensive experiments on our developed benchmarks demonstrate that HaveFun exhibits substantially superior performance in reconstructing the human body and hand. Project website: https://seanchenxy.github.io/HaveFunWeb/.
    摘要 对人体化身重建,当今技术通常需要高价的数据收集和寥寥一些临时图像来获得满意的结果。在这篇论文中,我们从临时无结构图像集中进行研究。人体化身从这种数据源重建具有限制数据量和动态艺术骨骼pose的挑战。为处理动态数据,我们将皮肤机制与深入迈征etrahedra(DMTet)集成,以形成可驱动的四面体表示,该表示驱动由DMTet生成的自由多面体结构,以适应无结构图像。为了有效地从少量数据中提取有益信息,我们设计了两相互独立的优化方法:一是对比图像和参考图像进行尺寸对齐,而二是为未经见过的区域生成可能的外观。总的来说,我们的框架,叫做HaveFun,可以进行人体重建、渲染和动画。我们在自己开发的benchmark上进行了广泛的实验,显示HaveFun在重建人体和手部方面具有显著的性能优势。项目官网:https://seanchenxy.github.io/HaveFunWeb/.

Deformation-Guided Unsupervised Non-Rigid Shape Matching

  • paper_url: http://arxiv.org/abs/2311.15668
  • repo_url: None
  • paper_authors: Aymen Merrouche, Joao Regateiro, Stefanie Wuhrer, Edmond Boyer
  • for: 非导向的形状匹配(shape matching),用于计算机视觉和图形应用中的基本步骤。
  • methods: 使用层次分辨率的补做方法(hierarchical patch based shape representation)和固定匹配到3D的方法(fitting a patch-wise near-rigid deformation model)来实现Robust不导向形状匹配。
  • results: 在使用 raw 3D 扫描数据时,本方法可以获得 significanly 更好的结果,与标准测试场景中的方法相当。
    Abstract We present an unsupervised data-driven approach for non-rigid shape matching. Shape matching identifies correspondences between two shapes and is a fundamental step in many computer vision and graphics applications. Our approach is designed to be particularly robust when matching shapes digitized using 3D scanners that contain fine geometric detail and suffer from different types of noise including topological noise caused by the coalescence of spatially close surface regions. We build on two strategies. First, using a hierarchical patch based shape representation we match shapes consistently in a coarse to fine manner, allowing for robustness to noise. This multi-scale representation drastically reduces the dimensionality of the problem when matching at the coarsest scale, rendering unsupervised learning feasible. Second, we constrain this hierarchical matching to be reflected in 3D by fitting a patch-wise near-rigid deformation model. Using this constraint, we leverage spatial continuity at different scales to capture global shape properties, resulting in matchings that generalize well to data with different deformations and noise characteristics. Experiments demonstrate that our approach obtains significantly better results on raw 3D scans than state-of-the-art methods, while performing on-par on standard test scenarios.
    摘要 我们提出了一种无监督数据驱动的非固定形匹配方法。形匹配是计算机视觉和图形应用中的基本步骤,我们的方法特别适用于使用3D扫描仪获取的精度高的形状数据,并且可以承受不同类型的噪声,包括surface region的凝结噪声。我们基于两个策略。首先,我们使用层次分解的 patch 基于形状表示,在粗细到细的层次上匹配形状,从而具有较高的噪声抗性。其次,我们使用 patch-wise 近似固定扭变模型来约束这个层次匹配,从而利用不同尺度的空间连续性来捕捉全局形状特征,使得匹配结果更好地泛化到不同的扭变和噪声特性。我们的方法在 raw 3D 扫描数据上实现了较好的结果,与标准测试场景的结果相当。

Technical Report for Argoverse Challenges on 4D Occupancy Forecasting

  • paper_url: http://arxiv.org/abs/2311.15660
  • repo_url: None
  • paper_authors: Pengfei Zheng, Kanokphan Lertniphonphan, Feng Chen, Siwei Chen, Bingchuan Sun, Jun Xie, Zhepeng Wang
  • for: 这个论文是为了解决CVPR 2023 工作坊 autonomous driving 中的4D 占用预测问题。
  • methods: 该方案使用了一个强大的 LiDAR 基础的 Bird’s Eye View(BEV)编码器,并将时间拼接和二stage 解码器结合在一起,其中包括一个 DETR 头和一个 UNet 解码器。
  • results: 该方案在 Argoverse 2 感知数据集上进行测试,并评估了未来3秒的占用状态。与基线相比,该方案实现了18%的 L1 误差降低(3.57),并在 Argoverse Challenges 中获得了4D 占用预测任务的第一名。
    Abstract This report presents our Le3DE2E_Occ solution for 4D Occupancy Forecasting in Argoverse Challenges at CVPR 2023 Workshop on Autonomous Driving (WAD). Our solution consists of a strong LiDAR-based Bird's Eye View (BEV) encoder with temporal fusion and a two-stage decoder, which combines a DETR head and a UNet decoder. The solution was tested on the Argoverse 2 sensor dataset to evaluate the occupancy state 3 seconds in the future. Our solution achieved 18% lower L1 Error (3.57) than the baseline and got the 1 place on the 4D Occupancy Forecasting task in Argoverse Challenges at CVPR 2023.
    摘要 本报告介绍我们的Le3DE2E_Occ解决方案,用于CVPR 2023 工作坊自动驾驶(WAD)中的4D占用预测挑战。我们的解决方案包括一个强大的 LiDAR 基础视图(BEV)编码器和时间融合,以及两个阶段解码器,其中一个是 DETR 头和一个 UNet 解码器。我们的解决方案在 Argoverse 2 感知数据集上进行测试,以评估未来 3 秒内的占用状态。我们的解决方案与基eline的 L1 误差比(3.57)下降了18%,并在 Argoverse Challenges 中获得了4D占用预测任务的第一名。

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2311.15657
  • repo_url: https://github.com/chaofengc/texforce
  • paper_authors: Chaofeng Chen, Annan Wang, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, Weisi Lin
  • for: 提高 diffusion 模型中文描述和图像的对应性,以提高图像质量。
  • methods: 使用人工奖励学习或直接反propagation来修改 diffusion U-Net,但多数研究忽视了文本encoder的重要性,这常常是预训练并固定不变的。本文提出了通过 reinforcement learning 来训练文本encoder,从而提高文本-图像对应性,提高图像质量。
  • results: 我们的研究表明,可以通过训练文本encoder来提高 diffusion 模型的性能,而且可以使用 TexForce 来简单地将 U-Net 模型与文本encoder结合,无需进行额外训练。此外,我们还示出了我们的方法在多种应用中的适用性,包括生成高质量的人脸和手像。
    Abstract Text-to-image diffusion models are typically trained to optimize the log-likelihood objective, which presents challenges in meeting specific requirements for downstream tasks, such as image aesthetics and image-text alignment. Recent research addresses this issue by refining the diffusion U-Net using human rewards through reinforcement learning or direct backpropagation. However, many of them overlook the importance of the text encoder, which is typically pretrained and fixed during training. In this paper, we demonstrate that by finetuning the text encoder through reinforcement learning, we can enhance the text-image alignment of the results, thereby improving the visual quality. Our primary motivation comes from the observation that the current text encoder is suboptimal, often requiring careful prompt adjustment. While fine-tuning the U-Net can partially improve performance, it remains suffering from the suboptimal text encoder. Therefore, we propose to use reinforcement learning with low-rank adaptation to finetune the text encoder based on task-specific rewards, referred as \textbf{TexForce}. We first show that finetuning the text encoder can improve the performance of diffusion models. Then, we illustrate that TexForce can be simply combined with existing U-Net finetuned models to get much better results without additional training. Finally, we showcase the adaptability of our method in diverse applications, including the generation of high-quality face and hand images.
    摘要 文本到图像填充模型通常是通过优化log-likelihood目标函数来训练,这会带来下游任务中的图像美观和文本图像对齐的问题。现在的研究通过修改填充U-Net使用人工奖励或直接反射学习来解决这个问题。然而,许多研究忽视了文本encoder的重要性,它通常是预训练的并固定的。在这篇论文中,我们示出了通过适应学习训练文本encoder来提高文本图像对齐的结果,从而提高视觉质量。我们的主要动机来自于观察,现有的文本encoder是不佳的,经常需要精心的提示调整。虽然可以通过精细调整U-Net来部分提高性能,但它仍然受到文本encoder的限制。因此,我们提议使用适应学习和低级适应来训练文本encoder,称为TexForce。我们首先示出了通过适应学习训练文本encoder可以提高填充模型的性能。然后,我们示出了TexForce可以与现有的U-Net训练模型进行简单的组合,而不需要额外的训练。最后,我们展示了我们的方法在多种应用中的适用性,包括高质量的面和手图像生成。

Mitigating Hallucination in Visual Language Models with Visual Supervision

  • paper_url: http://arxiv.org/abs/2311.16479
  • repo_url: None
  • paper_authors: Zhiyang Chen, Yousong Zhu, Yufei Zhan, Zhaowen Li, Chaoyang Zhao, Jinqiao Wang, Ming Tang
    for:The paper aims to improve the performance of large vision-language models (LVLMs) by addressing the issue of hallucination in their responses.methods:The authors use a combination of methods to improve the performance of LVLMs, including:* Generating image-text pairs with detailed relationship annotations in the panoptic scene graph dataset (PSG)* Integrating SAM and mask prediction loss as auxiliary supervision* Proposing a new benchmark, RAH-Bench, to evaluate hallucination in LVLMsresults:The authors achieve the following results:* An +8.4% enhancement compared to the original LLaVA model* Widespread performance improvements across other modelsHere is the Chinese translation of the three key information points:for:这篇论文的目标是提高大型视力语言模型(LVLMs)的表现,解决它们的回答中的幻觉问题。methods:作者使用了以下方法来提高LVLMs的表现:* 生成具有细节关系注释的图文对象集(PSG)* 将SAM和掩码预测损失作为辅助监督* 提出一个新的评估标准套件,RAH-Bench,来评估LVLMs中的幻觉results:作者实现了以下结果:* LLaVA原型模型的+8.4%提升* 其他模型的广泛表现改进
    Abstract Large vision-language models (LVLMs) suffer from hallucination a lot, generating responses that apparently contradict to the image content occasionally. The key problem lies in its weak ability to comprehend detailed content in a multi-modal context, which can be mainly attributed to two factors in training data and loss function. The vision instruction dataset primarily focuses on global description, and the auto-regressive loss function favors text modeling rather than image understanding. In this paper, we bring more detailed vision annotations and more discriminative vision models to facilitate the training of LVLMs, so that they can generate more precise responses without encounter hallucination. On one hand, we generate image-text pairs with detailed relationship annotations in panoptic scene graph dataset (PSG). These conversations pay more attention on detailed facts in the image, encouraging the model to answer questions based on multi-modal contexts. On the other hand, we integrate SAM and mask prediction loss as auxiliary supervision, forcing the LVLMs to have the capacity to identify context-related objects, so that they can generate more accurate responses, mitigating hallucination. Moreover, to provide a deeper evaluation on the hallucination in LVLMs, we propose a new benchmark, RAH-Bench. It divides vision hallucination into three different types that contradicts the image with wrong categories, attributes or relations, and introduces False Positive Rate as detailed sub-metric for each type. In this benchmark, our approach demonstrates an +8.4% enhancement compared to original LLaVA and achieves widespread performance improvements across other models.
    摘要 大型视言语模型(LVLM)经常会出现幻觉现象,生成响应时间不符合图像内容。主要问题在于其在多模式上下文中理解细节的能力不强,可以归结于训练数据和损失函数两个方面。视 instrucion 集合主要关注全局描述,自动递归损失函数偏好文本模型化而非图像理解。在这篇论文中,我们采用更详细的视觉注释和更能够区分的视觉模型,以便在训练 LVLM 时更好地提高其对多模式上下文的理解,从而避免幻觉现象。一方面,我们生成了具有细节关系注释的图像文本对(PSG),这些对话更加注重图像中的细节信息,让模型更容易根据多模式上下文回答问题。另一方面,我们将 SAM 和面 predicate 损失函数作为辅助监督,让 LVLM 具备Context-related对象的识别能力,以便更准确地回答问题,避免幻觉现象。此外,为了对 LVLM 中的幻觉进行更深入的评估,我们提出了一个新的标准套件 RAH-Bench。它将视觉幻觉分为三类:图像与错误类别、属性或关系相关,并在每类上 introduce False Positive Rate 的详细子指标。在这个套件中,我们的方法与原始 LLVA 相比增加了+8.4%,并在其他模型上实现了广泛的性能提升。

PaintNeSF: Artistic Creation of Stylized Scenes with Vectorized 3D Strokes

  • paper_url: http://arxiv.org/abs/2311.15637
  • repo_url: None
  • paper_authors: Hao-Bin Duan, Miao Wang, Yan-Xun Li, Yong-Liang Yang
  • for: 生成3D场景的独特风格图像
  • methods: 使用 vector stroke 模拟人工艺术创作过程,从基本 primitives 和 splines 中生成彩色3D stroke 笔触集,将3D scene 材料化为多视图图像
  • results: 能够有效地Synthesize 3D scene 的Geometric 和艺术风格化图像,同时保持不同视图的一致性
    Abstract We present Paint Neural Stroke Field (PaintNeSF), a novel technique to generate stylized images of a 3D scene at arbitrary novel views from multi-view 2D images. Different from existing methods which apply stylization to trained neural radiance fields at the voxel level, our approach draws inspiration from image-to-painting methods, simulating the progressive painting process of human artwork with vector strokes. We develop a palette of stylized 3D strokes from basic primitives and splines, and consider the 3D scene stylization task as a multi-view reconstruction process based on these 3D stroke primitives. Instead of directly searching for the parameters of these 3D strokes, which would be too costly, we introduce a differentiable renderer that allows optimizing stroke parameters using gradient descent, and propose a training scheme to alleviate the vanishing gradient issue. The extensive evaluation demonstrates that our approach effectively synthesizes 3D scenes with significant geometric and aesthetic stylization while maintaining a consistent appearance across different views. Our method can be further integrated with style loss and image-text contrastive models to extend its applications, including color transfer and text-driven 3D scene drawing.
    摘要 我们介绍Paint Neural Stroke Field(PaintNeSF),一种新的技术来生成3D场景中的不同视角的写实图像。与现有方法不同,我们的方法从图像至画作的方法中获取灵感,通过vector stroke来模拟人工艺术创作的进程。我们发展了一个3D stroke的画传alette和spline,并视3D场景塑型化任务为基于这些3D stroke元素的多视角重建过程。而不是直接搜寻这些3D stroke的参数,我们创建了一个可微的渲染器,通过梯度下降来优化参数,并提出了一个训练方案来解决减少梯度问题。我们的方法可以与style loss和图像对比方面的模型一起运用,包括颜色转移和文本驱动3D场景绘制。

Only Positive Cases: 5-fold High-order Attention Interaction Model for Skin Segmentation Derived Classification

  • paper_url: http://arxiv.org/abs/2311.15625
  • repo_url: https://github.com/wurenkai/MHA-UNet
  • paper_authors: Renkai Wu, Yinghao Liu, Pengchen Liang, Qing Chang
  • for: 这种皮肤疾病计算机辅助诊断工具,用于帮助皮肤科医生和患者更好地理解计算机辅助诊断的学习和预测过程。
  • methods: 该paper提出了一种多高级注意力互动模型(MHA-UNet),该模型可以通过可解释的理由来判断皮肤疾病是否存在,而不需要训练使用负样本。具体来说,该模型引入了高级注意力互动机制,将压缩注意力引入到更高级别的特征注意力中。此外,该paper还提出了一种多高级注意力互动模块(MHAblock),该模块通过组合不同级别的特征来实现更高效的分类和预测。
  • results: 在explainable reasoning的基础上,该paper在absence of negative samples的情况下进行了分类和 segmentation experiments,并取得了81.0%的最高正确检测率和83.5%的最高负检测率。此外,与13种医学 segmentation模型进行了比较,以及与8种 externally validate models在三个公共数据集和我们的临床数据集中进行了比较,得到了state-of-the-art的性能。
    Abstract Computer-aided diagnosis of skin diseases is an important tool. However, the interpretability of computer-aided diagnosis is currently poor. Dermatologists and patients cannot intuitively understand the learning and prediction process of neural networks, which will lead to a decrease in the credibility of computer-aided diagnosis. In addition, traditional methods need to be trained using negative samples in order to predict the presence or absence of a lesion, but medical data is often in short supply. In this paper, we propose a multiple high-order attention interaction model (MHA-UNet) for use in a highly explainable skin lesion segmentation task. MHA-UNet is able to obtain the presence or absence of a lesion by explainable reasoning without the need for training on negative samples. Specifically, we propose a high-order attention interaction mechanism that introduces squeeze attention to a higher level for feature attention. In addition, a multiple high-order attention interaction (MHAblock) module is proposed by combining the different features of different orders. For classifying the presence or absence of lesions, we conducted classification experiments on several publicly available datasets in the absence of negative samples, based on explainable reasoning about the interaction of 5 attention orders of MHAblock. The highest positive detection rate obtained from the experiments was 81.0% and the highest negative detection rate was 83.5%. For segmentation experiments, comparison experiments of the proposed method with 13 medical segmentation models and external validation experiments with 8 state-of-the-art models in three public datasets and our clinical dataset demonstrate the state-of-the-art performance of our model. The code is available from https://github.com/wurenkai/MHA-UNet.
    摘要 计算机助动诊断皮肤疾病是一种重要的工具。然而,计算机助动诊断的可解释性目前很差。皮肤科医生和患者无法直观地理解计算机助动诊断中的学习和预测过程,这将导致计算机助动诊断的信用度下降。另外,传统方法需要通过负样本进行训练,以预测皮肤疾病的存在或缺失,但医疗数据很少。在这篇论文中,我们提出了一种多重高阶注意力互动模型(MHA-UNet),用于实现高度可解释的皮肤疾病分割任务。MHA-UNet可以通过可解释的理由来确定皮肤疾病的存在或缺失,而无需训练负样本。具体来说,我们提出了一种高阶注意力互动机制,通过将压缩注意力引入到更高级别的特征注意力中来实现。此外,我们还提出了一种多重高阶注意力互动模块(MHAblock),通过将不同级别的特征相互结合来实现。为了 классификация皮肤疾病的存在或缺失,我们在absence of negative samples基于可解释的理由进行了分类实验,并取得了最高的正确检测率81.0%和最高的负检测率83.5%。 для分割任务,我们对13种医疗分割模型进行了比较实验,并对8种 state-of-the-art模型进行了外部验证实验在三个公共数据集和我们的临床数据集中,并证明了我们的模型的顶峰性能。代码可以从https://github.com/wurenkai/MHA-UNet获取。

Technical Report for Argoverse Challenges on Unified Sensor-based Detection, Tracking, and Forecasting

  • paper_url: http://arxiv.org/abs/2311.15615
  • repo_url: None
  • paper_authors: Zhepeng Wang, Feng Chen, Kanokphan Lertniphonphan, Siwei Chen, Jinyao Bao, Pengfei Zheng, Jinbao Zhang, Kaer Huang, Tao Zhang
  • for: 本文提出了一种用于检测、跟踪和预测的集成式感知器解决方案,用于 Argoverse Challenges at CVPR 2023 Workshop on Autonomous Driving (WAD) 。
  • methods: 本文提出了一种把检测、跟踪和预测三个任务集成到一起的网络模型,采用强大的 Bird’s Eye View (BEV) 编码器进行空间和时间融合,生成了多任务共享表示。
  • results: 本文在 Argoverse 2 感知数据集上测试了该解决方案,并在 E2E Forecasting 轨道上达到了检测、跟踪和预测26个物体类型的1stPlace。
    Abstract This report presents our Le3DE2E solution for unified sensor-based detection, tracking, and forecasting in Argoverse Challenges at CVPR 2023 Workshop on Autonomous Driving (WAD). We propose a unified network that incorporates three tasks, including detection, tracking, and forecasting. This solution adopts a strong Bird's Eye View (BEV) encoder with spatial and temporal fusion and generates unified representations for multi-tasks. The solution was tested in the Argoverse 2 sensor dataset to evaluate the detection, tracking, and forecasting of 26 object categories. We achieved 1st place in Detection, Tracking, and Forecasting on the E2E Forecasting track in Argoverse Challenges at CVPR 2023 WAD.
    摘要 这份报告介绍我们的Le3DE2E解决方案,用于统一感知器基于探测、跟踪和预测在CVPR 2023 工作坊自动驾驶(WAD)中。我们提议一个统一网络,其中包含三个任务,包括探测、跟踪和预测。该解决方案采用强大的鸟瞰视图(BEV)编码器,并进行空间和时间融合,生成统一表示多个任务。我们在Argoverse 2感知数据集上测试了这种解决方案,以评估26种物体类型的探测、跟踪和预测性能。我们在Argoverse Challenges at CVPR 2023 WAD中获得了探测、跟踪和预测3个track的第一名。

RetouchUAA: Unconstrained Adversarial Attack via Image Retouching

  • paper_url: http://arxiv.org/abs/2311.16478
  • repo_url: None
  • paper_authors: Mengda Xie, Yiling He, Meie Fang
    for:RetouchUAA is designed to attack deep neural networks (DNNs) by exploiting image retouching styles, which are more realistic and interpretable than traditional attacks.methods:RetouchUAA uses a custom-designed image retouching attack framework and a retouching style guidance module to generate realistic and interpretable perturbations. The framework linearizes images and models human retouching behavior, while the guidance module ensures the perturbations are standard retouching styles.results:RetouchUAA achieves nearly 100% white-box attack success against three DNNs on ImageNet and Place365, with a better trade-off between image naturalness, transferability, and defense robustness than baseline attacks.
    Abstract Deep Neural Networks (DNNs) are susceptible to adversarial examples. Conventional attacks generate controlled noise-like perturbations that fail to reflect real-world scenarios and hard to interpretable. In contrast, recent unconstrained attacks mimic natural image transformations occurring in the real world for perceptible but inconspicuous attacks, yet compromise realism due to neglect of image post-processing and uncontrolled attack direction. In this paper, we propose RetouchUAA, an unconstrained attack that exploits a real-life perturbation: image retouching styles, highlighting its potential threat to DNNs. Compared to existing attacks, RetouchUAA offers several notable advantages. Firstly, RetouchUAA excels in generating interpretable and realistic perturbations through two key designs: the image retouching attack framework and the retouching style guidance module. The former custom-designed human-interpretability retouching framework for adversarial attack by linearizing images while modelling the local processing and retouching decision-making in human retouching behaviour, provides an explicit and reasonable pipeline for understanding the robustness of DNNs against retouching. The latter guides the adversarial image towards standard retouching styles, thereby ensuring its realism. Secondly, attributed to the design of the retouching decision regularization and the persistent attack strategy, RetouchUAA also exhibits outstanding attack capability and defense robustness, posing a heavy threat to DNNs. Experiments on ImageNet and Place365 reveal that RetouchUAA achieves nearly 100\% white-box attack success against three DNNs, while achieving a better trade-off between image naturalness, transferability and defense robustness than baseline attacks.
    摘要 深度神经网络(DNNs)易受到攻击性示例的影响。传统攻击通常生成控制的噪声样本,这些样本不reflect real-world scenario并difficult to interpret. 在相反,最近的无约束攻击模仿了实际 mundo naturale transformation,通过不可见的方式进行攻击,但是这些攻击可能会compromise realism due to neglect of image post-processing and uncontrolled attack direction。在这篇论文中,我们提出了RetouchUAA,一种无约束攻击,利用了实际的杂化:图像修改样式。相比之下,RetouchUAA具有以下几个优势:首先,RetouchUAA通过两个关键的设计:图像修改攻击框架和修改样式指导模块,生成了可解释和真实的杂化。图像修改攻击框架是人类可解释的攻击框架,用于模拟人类修改行为中的本地加工和修改决策。这种框架提供了一个可读的和合理的管道,以便理解DNNs对修改的Robustness。修改样式指导模块使得攻击图像朝向标准的修改样式,从而确保其真实性。其次,由于RetouchUAA的修改决策 regularization和持续攻击策略,它也表现出了出色的攻击能力和防御Robustness,对DNNs构成了严重的威胁。实验结果表明,RetouchUAA在ImageNet和Place365上 Achieves nearly 100% white-box attack success rate against three DNNs,而且在图像自然性、传输性和防御Robustness之间取得了更好的平衡。

Animatable 3D Gaussian: Fast and High-Quality Reconstruction of Multiple Human Avatars

  • paper_url: http://arxiv.org/abs/2311.16482
  • repo_url: https://github.com/jimmyYliu/Animatable-3D-Gaussian
  • paper_authors: Yang Liu, Xiang Huang, Minghan Qin, Qinwei Lin, Haoqian Wang
  • for: 实现高品质的人工人体 avatar,但训练和渲染成本高。
  • methods: 使用3D Gaussian 学习人体姿势和形状,并将3D Gaussian 扩展到动态人体场景中。
  • results: 在不同的观察角度和姿势下,能够实现高品质的重建和新视角生成,并且比较昂贵的训练和渲染成本较低。
    Abstract Neural radiance fields are capable of reconstructing high-quality drivable human avatars but are expensive to train and render. To reduce consumption, we propose Animatable 3D Gaussian, which learns human avatars from input images and poses. We extend 3D Gaussians to dynamic human scenes by modeling a set of skinned 3D Gaussians and a corresponding skeleton in canonical space and deforming 3D Gaussians to posed space according to the input poses. We introduce hash-encoded shape and appearance to speed up training and propose time-dependent ambient occlusion to achieve high-quality reconstructions in scenes containing complex motions and dynamic shadows. On both novel view synthesis and novel pose synthesis tasks, our method outperforms existing methods in terms of training time, rendering speed, and reconstruction quality. Our method can be easily extended to multi-human scenes and achieve comparable novel view synthesis results on a scene with ten people in only 25 seconds of training.
    摘要

A manometric feature descriptor with linear-SVM to distinguish esophageal contraction vigor

  • paper_url: http://arxiv.org/abs/2311.15609
  • repo_url: None
  • paper_authors: Jialin Liu, Lu Yan, Xiaowei Liu, Yuzhuo Dai, Fanggen Lu, Yuanting Ma, Muzhou Hou, Zheng Wang
  • for: 诊断食管功能障碍的临床诊断和评估方法。
  • methods: 使用高分辨率抽象功能测试(HRM)测量食管功能动态特征,并通过图像处理技术预测食管收缩力。
  • results: 使用特征提取和历史ogram of Gradients(FE-HOG)分析食管提肠动作特征,并使用线性支持向量机(linear-SVM)进行分类,实现了较高的识别精度(86.83%),比其他常用机器学习方法高。
    Abstract n clinical, if a patient presents with nonmechanical obstructive dysphagia, esophageal chest pain, and gastro esophageal reflux symptoms, the physician will usually assess the esophageal dynamic function. High-resolution manometry (HRM) is a clinically commonly used technique for detection of esophageal dynamic function comprehensively and objectively. However, after the results of HRM are obtained, doctors still need to evaluate by a variety of parameters. This work is burdensome, and the process is complex. We conducted image processing of HRM to predict the esophageal contraction vigor for assisting the evaluation of esophageal dynamic function. Firstly, we used Feature-Extraction and Histogram of Gradients (FE-HOG) to analyses feature of proposal of swallow (PoS) to further extract higher-order features. Then we determine the classification of esophageal contraction vigor normal, weak and failed by using linear-SVM according to these features. Our data set includes 3000 training sets, 500 validation sets and 411 test sets. After verification our accuracy reaches 86.83%, which is higher than other common machine learning methods.
    摘要 在临床中,如果患者出现非机械性喉窄综合征、胸部食管疼痛和食管胃阻症状,医生通常会评估食管动功能。高分辨率振荡测试(HRM)是临床广泛使用的技术,可全面和Objectively检测食管动功能。然而,HRM测试结果获得后,医生仍需根据多种参数评估。这个工作困难重重,评估过程复杂。我们通过图像处理技术来预测食管收缩力,以帮助评估食管动功能。首先,我们使用特征提取和 histogram of gradients(FE-HOG)分析提案的喉窄(PoS)特征,以提取更高级别的特征。然后,我们根据这些特征使用线性支持向量机器学习(linear-SVM)进行分类,并将食管收缩力分为正常、弱和失败三类。我们的数据集包括3000个训练集、500个验证集和411个测试集。经验验证后,我们的准确率达86.83%,高于其他常见机器学习方法。

2D Feature Distillation for Weakly- and Semi-Supervised 3D Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2311.15605
  • repo_url: None
  • paper_authors: Ozan Unal, Dengxin Dai, Lukas Hoyer, Yigit Baran Can, Luc Van Gool
  • for: 提高 LiDAR semantic segmentation 的大规模标注数据集的需求,新方法在不监督训练中减少标注的必要性。
  • methods: 使用 RGB 图像提供更密集的场景表示,并使用域 adaptation 的二维semantic segmentation 网络提取高级特征信息。一种一向冲突学习方案和 FOVMix 混合策略来弥补 LiDAR 和 RGB 摄像头之间的水平视场匹配问题。
  • results: IGNet 在 ScribbleKITTI 上实现了不监督 LiDAR semantic segmentation 的状态之末表现,与完全监督训练相比,只需8% 标注点,提高了98% 的相对性能。此外,我们还证明我们的贡献在 semi-supervised 训练中也是最佳的。
    Abstract As 3D perception problems grow in popularity and the need for large-scale labeled datasets for LiDAR semantic segmentation increase, new methods arise that aim to reduce the necessity for dense annotations by employing weakly-supervised training. However these methods continue to show weak boundary estimation and high false negative rates for small objects and distant sparse regions. We argue that such weaknesses can be compensated by using RGB images which provide a denser representation of the scene. We propose an image-guidance network (IGNet) which builds upon the idea of distilling high level feature information from a domain adapted synthetically trained 2D semantic segmentation network. We further utilize a one-way contrastive learning scheme alongside a novel mixing strategy called FOVMix, to combat the horizontal field-of-view mismatch between the two sensors and enhance the effects of image guidance. IGNet achieves state-of-the-art results for weakly-supervised LiDAR semantic segmentation on ScribbleKITTI, boasting up to 98% relative performance to fully supervised training with only 8% labeled points, while introducing no additional annotation burden or computational/memory cost during inference. Furthermore, we show that our contributions also prove effective for semi-supervised training, where IGNet claims state-of-the-art results on both ScribbleKITTI and SemanticKITTI.
    摘要 As 3D perception problems become more popular and the need for large-scale labeled datasets for LiDAR semantic segmentation increases, new methods have emerged that aim to reduce the need for dense annotations by using weakly-supervised training. However, these methods still struggle with weak boundary estimation and high false negative rates for small objects and distant sparse regions. We believe that these weaknesses can be compensated by using RGB images, which provide a denser representation of the scene. We propose an image-guidance network (IGNet) that builds upon the idea of distilling high-level feature information from a domain-adapted synthetically trained 2D semantic segmentation network. We also use a one-way contrastive learning scheme and a novel mixing strategy called FOVMix to combat the horizontal field-of-view mismatch between the two sensors and enhance the effects of image guidance. IGNet achieves state-of-the-art results for weakly-supervised LiDAR semantic segmentation on ScribbleKITTI, with up to 98% relative performance to fully supervised training using only 8% labeled points, without adding any additional annotation burden or computational/memory cost during inference. Furthermore, we show that our contributions are also effective for semi-supervised training, achieving state-of-the-art results on both ScribbleKITTI and SemanticKITTI.

Progressive Target-Styled Feature Augmentation for Unsupervised Domain Adaptation on Point Clouds

  • paper_url: http://arxiv.org/abs/2311.16474
  • repo_url: https://github.com/xiaoyao3302/ptsfa
  • paper_authors: Zicheng Wang, Zhen Zhao, Yiming Wu, Luping Zhou, Dong Xu
  • for: 这个研究是为了解决对点云资料进行适应分析中的无监督领域转换问题,因为模型在新的enario中常常会受到领域shift的影响,导致其表现不佳。
  • methods: 我们提出了一种新的方法,即进步的目标式化特征增强(PTSFA),它不同于先前的方法,它不是针对特征提取器进行适应,而是针对分类器进行适应,以便让分类器能够识别目标式的原始特征。
  • results: 我们在benchmark datasets上验证了我们的方法,其中我们的方法在新的state-of-the-art表现中得到了新的最佳性能。
    Abstract Unsupervised domain adaptation is a critical challenge in the field of point cloud analysis, as models trained on one set of data often struggle to perform well in new scenarios due to domain shifts. Previous works tackle the problem by using adversarial training or self-supervised learning for feature extractor adaptation, but ensuring that features extracted from the target domain can be distinguished by the source-supervised classifier remains challenging. In this work, we propose a novel approach called progressive target-styled feature augmentation (PTSFA). Unlike previous works that focus on feature extractor adaptation, our PTSFA approach focuses on classifier adaptation. It aims to empower the classifier to recognize target-styled source features and progressively adapt to the target domain. To enhance the reliability of predictions within the PTSFA framework and encourage discriminative feature extraction, we further introduce a new intermediate domain approaching (IDA) strategy. We validate our method on the benchmark datasets, where our method achieves new state-of-the-art performance. Our code is available at https://github.com/xiaoyao3302/PTSFA.
    摘要 <> translate the following text into Simplified Chinese:Unsupervised domain adaptation is a critical challenge in the field of point cloud analysis, as models trained on one set of data often struggle to perform well in new scenarios due to domain shifts. Previous works tackle the problem by using adversarial training or self-supervised learning for feature extractor adaptation, but ensuring that features extracted from the target domain can be distinguished by the source-supervised classifier remains challenging. In this work, we propose a novel approach called progressive target-styled feature augmentation (PTSFA). Unlike previous works that focus on feature extractor adaptation, our PTSFA approach focuses on classifier adaptation. It aims to empower the classifier to recognize target-styled source features and progressively adapt to the target domain. To enhance the reliability of predictions within the PTSFA framework and encourage discriminative feature extraction, we further introduce a new intermediate domain approaching (IDA) strategy. We validate our method on the benchmark datasets, where our method achieves new state-of-the-art performance. Our code is available at https://github.com/xiaoyao3302/PTSFA.Translate the text into Simplified Chinese:<>域外预处理是点云分析领域中的一项关键挑战,因为训练于一个数据集的模型在新enario中表现不佳,这是因为域 shift。先前的工作通过对 feature extractor 进行 adversarial 训练或自我超vision 学习来解决问题,但是保证目标域中提取的特征可以被来自源监督学习器认可仍然是挑战。在这项工作中,我们提出了一种新的方法called progressive target-styled feature augmentation (PTSFA)。与先前的方法不同的是,PTSFA 方法将注意力集中在 classifier 的适应上,它目标是使 classifier 能够认可目标风格的源特征,并逐步适应目标域。为了在 PTSFA 框架中提高预测可靠性和促进特征提取,我们还引入了一种新的中间域接近策略(IDA)。我们在标准测试集上验证了我们的方法,其 achieves 新的 state-of-the-art 性能。我们的代码可以在 https://github.com/xiaoyao3302/PTSFA 中找到。

LFSRDiff: Light Field Image Super-Resolution via Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.16517
  • repo_url: https://github.com/chaowentao/lfsrdiff
  • paper_authors: Wentao Chao, Fuqing Duan, Xuechun Wang, Yingqian Wang, Guanghui Wang
  • For: 本研究 targets the problem of light field (LF) image super-resolution (SR), which is challenging due to the inherent ill-posed nature of LF images.* Methods: The proposed approach, LFSRDiff, incorporates a disentangled U-Net for diffusion models to effectively extract and fuse spatial and angular information within LF images.* Results: The proposed approach consistently produces diverse and realistic SR results, achieving the highest perceptual metric in terms of LPIPS and demonstrating the ability to effectively control the trade-off between perception and distortion.Here is the summary in Traditional Chinese:* For: 本研究探讨了光场(LF)图像超解析(SR)的问题,LF图像的自然缺乏定义性使得SR问题变得更加困难。* Methods: 提案的方法LFSRDiff,通过将U-Net采用分离的方法来实现Diffusion模型中的LF分离,以更好地提取和融合LF图像中的空间和角度信息。* Results: 提案的方法能够预量得到多样化和现实的SR结果,在LPIPS上取得最高的感知指标,并能够有效地控制干扰和歪斜之间的调节。
    Abstract Light field (LF) image super-resolution (SR) is a challenging problem due to its inherent ill-posed nature, where a single low-resolution (LR) input LF image can correspond to multiple potential super-resolved outcomes. Despite this complexity, mainstream LF image SR methods typically adopt a deterministic approach, generating only a single output supervised by pixel-wise loss functions. This tendency often results in blurry and unrealistic results. Although diffusion models can capture the distribution of potential SR results by iteratively predicting Gaussian noise during the denoising process, they are primarily designed for general images and struggle to effectively handle the unique characteristics and information present in LF images. To address these limitations, we introduce LFSRDiff, the first diffusion-based LF image SR model, by incorporating the LF disentanglement mechanism. Our novel contribution includes the introduction of a disentangled U-Net for diffusion models, enabling more effective extraction and fusion of both spatial and angular information within LF images. Through comprehensive experimental evaluations and comparisons with the state-of-the-art LF image SR methods, the proposed approach consistently produces diverse and realistic SR results. It achieves the highest perceptual metric in terms of LPIPS. It also demonstrates the ability to effectively control the trade-off between perception and distortion. The code is available at \url{https://github.com/chaowentao/LFSRDiff}.
    摘要 Light field (LF) 图像超解析 (SR) 是一个复杂的问题,因为一个低分辨率 (LR) 输入 LF 图像可能对应多个可能的超解析结果。尽管如此,主流 LF 图像 SR 方法通常采用deterministic方法,生成受 pixel-wise 损失函数约束的唯一输出。这种做法经常导致模糊和不真实的结果。虽然扩散模型可以捕捉 LF 图像的可能性分布,但它们通常是为普通图像而设计,对 LF 图像的特殊特征和信息不够有效。为解决这些局限性,我们介绍了 LFSRDiff,首个基于扩散的 LF 图像 SR 模型,通过 incorporating LF 分离机制。我们的新贡献包括在扩散模型中引入分离U-Net,以更好地提取和融合 LF 图像中的空间和方向信息。经过全面的实验评估和与当前领先的 LF 图像 SR 方法进行比较,我们的提案通常生成多样和真实的 SR 结果,并在 LPIPS 指标中达到最高的 perceval 值。此外,我们的方法还能够有效控制对比度和损害之间的负荷。代码可以在 GitHub 上找到:\url{https://github.com/chaowentao/LFSRDiff}.

An Ensemble of 2.5D ResUnet Based Models for Segmentation for Kidney and Masses

  • paper_url: http://arxiv.org/abs/2311.15586
  • repo_url: None
  • paper_authors: Cancan Chen, RongguoZhang
  • for: 本研究的目的是提出一种高效的计算机断层成像(CT)扫描图像的肾脏、肾肿和肾腔分割方法。
  • methods: 本研究使用了2.5D ResUnet建立了一个高效的含义semantic segmentation框架,从粗到细进行分割。使用了489个CT扫描图像进行训练和验证,并使用了一个从未使用过的独立CT扫描图像进行测试。
  • results: 研究结果表明,提出的方法可以得到较高的 dice 值(0.954、0.792、0.691),surface dice 值(0.897、0.591、0.541)以及较低的平均扫描时间(20.65s)和最大GPU内存占用量(3525MB)。结果表明,提出的方法可以做到更好的平衡模型性能和效率。
    Abstract The automatic segmentation of kidney, kidney tumor and kidney cyst on Computed Tomography (CT) scans is a challenging task due to the indistinct lesion boundaries and fuzzy texture. Considering the large range and unbalanced distribution of CT scans' thickness, 2.5D ResUnet are adopted to build an efficient coarse-to-fine semantic segmentation framework in this work. A set of 489 CT scans are used for training and validation, and an independent never-before-used CT scans for testing. Finally, we demonstrate the effectiveness of our proposed method. The dice values on test set are 0.954, 0.792, 0.691, the surface dice values are 0.897, 0.591, 0.541 for kidney, tumor and cyst, respectively. The average inference time of each CT scan is 20.65s and the max GPU memory is 3525MB. The results suggest that a better trade-off between model performance and efficiency.
    摘要 自动 segmentation of kidney, kidney tumor, and kidney cyst on Computed Tomography (CT) scans 是一个复杂的任务,因为 lesion boundaries 和 texture 具有模糊的特征。 考虑到 CT scans 的厚度范围很大且分布不均,这里采用了 2.5D ResUnet 建立了一个高效的 course-to-fine semantic segmentation 框架。 使用了 489 个 CT scans 进行训练和验证,并在独立的 never-before-used CT scans 上进行测试。 最终,我们证明了我们的提议的方法的效果。 测试集的 dice 值为 0.954, 0.792, 0.691,表面 dice 值为 0.897, 0.591, 0.541 для kidney, tumor 和 cyst,分别。 每个 CT scans 的平均推断时间为 20.65s,最大的 GPU 内存为 3525MB。 结果表明我们的方法可以更好地平衡模型性能和效率。

A deep learning approach for marine snow synthesis and removal

  • paper_url: http://arxiv.org/abs/2311.15584
  • repo_url: https://github.com/fergaletto/mssr
  • paper_authors: Fernando Galetto, Guang Deng
  • for: 提高水下图像的可见性和人工智能系统的性能,解决海洋灰尘对水下图像的污染问题。
  • methods: 使用深度学习技术,首先生成真实的海洋灰尘样本,并将其与自然水下图像结合成一个对应的集合。然后,使用 U-Net 模型进行海洋灰尘除除作为图像到图像翻译任务,以高精度除除海洋灰尘。
  • results: 实验表明,U-Net 模型可以高效地除除自然和人工生成的海洋灰尘,并高于现有方法(如 median 滤波器和其自适应变体)。我们还证明了我们的方法在 MSRB 数据集上的稳定性,该数据集包含我们的模型在训练过程中没有看到的 sintetic artifacts。
    Abstract Marine snow, the floating particles in underwater images, severely degrades the visibility and performance of human and machine vision systems. This paper proposes a novel method to reduce the marine snow interference using deep learning techniques. We first synthesize realistic marine snow samples by training a Generative Adversarial Network (GAN) model and combine them with natural underwater images to create a paired dataset. We then train a U-Net model to perform marine snow removal as an image to image translation task. Our experiments show that the U-Net model can effectively remove both synthetic and natural marine snow with high accuracy, outperforming state-of-the-art methods such as the Median filter and its adaptive variant. We also demonstrate the robustness of our method by testing it on the MSRB dataset, which contains synthetic artifacts that our model has not seen during training. Our method is a practical and efficient solution for enhancing underwater images affected by marine snow.
    摘要 海洋瑞透,浮游的粉尘在水下图像中严重降低了人工和机器视觉系统的可见性和性能。这篇论文提出了一种使用深度学习技术来减少海洋瑞透干扰的新方法。我们首先使用生成对抗网络(GAN)模型生成了真实的海洋瑞透样本,然后与自然水下图像相结合,形成了一个对应的数据集。然后,我们使用U-Net模型进行海洋瑞透除去作为图像到图像翻译任务进行训练。我们的实验表明,U-Net模型可以高精度地除去自然和人工生成的海洋瑞透,超过了状态机制的方法,如 médian滤波和其自适应变体。我们还证明了我们的方法的稳定性,通过在MSRB数据集上测试我们的模型,该数据集包含Synthetic artifacts,我们的模型在训练过程中没有看到过。我们的方法是一种实用和高效的解决水下图像受到海洋瑞透干扰的问题的方法。

Real Time GAZED: Online Shot Selection and Editing of Virtual Cameras from Wide-Angle Monocular Video Recordings

  • paper_url: http://arxiv.org/abs/2311.15581
  • repo_url: None
  • paper_authors: Sudheer Achary, Rohit Girmaji, Adhiraj Anil Deshmukh, Vineet Gandhi
  • for: 实时视频编辑和摄像头轨迹稳定
  • methods: 基于GAZED框架和CineFilter技术实现实时视频编辑
  • results: 比基准方法高质量的视频输出和用户认为视频编辑效果美观
    Abstract Eliminating time-consuming post-production processes and delivering high-quality videos in today's fast-paced digital landscape are the key advantages of real-time approaches. To address these needs, we present Real Time GAZED: a real-time adaptation of the GAZED framework integrated with CineFilter, a novel real-time camera trajectory stabilization approach. It enables users to create professionally edited videos in real-time. Comparative evaluations against baseline methods, including the non-real-time GAZED, demonstrate that Real Time GAZED achieves similar editing results, ensuring high-quality video output. Furthermore, a user study confirms the aesthetic quality of the video edits produced by the Real Time GAZED approach. With these advancements in real-time camera trajectory optimization and video editing presented, the demand for immediate and dynamic content creation in industries such as live broadcasting, sports coverage, news reporting, and social media content creation can be met more efficiently.
    摘要 减少时间消耗的后期处理和提供高质量视频的优势是实时方法的关键。为满足这些需求,我们介绍了实时GAZED:一种基于GAZED框架的实时适应方法,与新型的实时摄像机轨迹稳定方法CineFilter相结合。它允许用户在实时创建专业编辑的视频。与基准方法相比,包括非实时GAZED,Real Time GAZED的比较评估表明,它可以实现相似的编辑效果,保证视频输出质量高。此外,用户测试确认Real Time GAZED方法生成的视频剪辑具有美观性。这些实时摄像机轨迹优化和视频编辑技术的进步,可以更好地满足现场直播、体育转播、新闻报道和社交媒体内容创作中的快速内容创作需求。

EucliDreamer: Fast and High-Quality Texturing for 3D Models with Stable Diffusion Depth

  • paper_url: http://arxiv.org/abs/2311.15573
  • repo_url: None
  • paper_authors: Cindy Le, Congrui Hetang, Ang Cao, Yihui He
  • for: 这 paper 是为了生成基于文本提示和3D模型的 texture 的一种新方法。
  • methods: 该方法使用 Score Distillation Sampling (SDS) 过程,并考虑了额外深度信息。
  • results: 我们的模型可以生成更加满意的结果,并且可以生成不同的艺术风格 для同一个物体。此外,我们的模型在生成相同质量的 texture 时间比较快。我们还进行了详细的抑制研究,探讨不同因素对生成质量的影响,包括采样步骤、指导缩放、负提示、数据扩展、高程范围以及 SDS 的代替方法。
    Abstract This paper presents a novel method to generate textures for 3D models given text prompts and 3D meshes. Additional depth information is taken into account to perform the Score Distillation Sampling (SDS) process [28] with depth conditional Stable Diffusion [34]. We ran our model over the open-source dataset Objaverse [7] and conducted a user study to compare the results with those of various 3D texturing methods. We have shown that our model can generate more satisfactory results and produce various art styles for the same object. In addition, we achieved faster time when generating textures of comparable quality. We also conduct thorough ablation studies of how different factors may affect generation quality, including sampling steps, guidance scale, negative prompts, data augmentation, elevation range, and alternatives to SDS.
    摘要

Video-based Visible-Infrared Person Re-Identification with Auxiliary Samples

  • paper_url: http://arxiv.org/abs/2311.15571
  • repo_url: https://github.com/dyhbupt/buptcampus
  • paper_authors: Yunhao Du, Cheng Lei, Zhicheng Zhao, Yuan Dong, Fei Su
  • for: 这篇论文的目的是解决在24小时监控系统中进行人识别和跟踪的问题,使用可见光和红外光来匹配人像。
  • methods: 该论文使用了两�ream流程,包括一个基本的两栅流程和一个curriculum学习策略,以及一种新的时间k-对称重新排序方法来提高排名结果的精度。
  • results: 实验结果表明,该论文提出的方法具有显著的有效性,并且在对9种现有的图像和视频基于VI-ReID方法进行重现时也表现出优异性。
    Abstract Visible-infrared person re-identification (VI-ReID) aims to match persons captured by visible and infrared cameras, allowing person retrieval and tracking in 24-hour surveillance systems. Previous methods focus on learning from cross-modality person images in different cameras. However, temporal information and single-camera samples tend to be neglected. To crack this nut, in this paper, we first contribute a large-scale VI-ReID dataset named BUPTCampus. Different from most existing VI-ReID datasets, it 1) collects tracklets instead of images to introduce rich temporal information, 2) contains pixel-aligned cross-modality sample pairs for better modality-invariant learning, 3) provides one auxiliary set to help enhance the optimization, in which each identity only appears in a single camera. Based on our constructed dataset, we present a two-stream framework as baseline and apply Generative Adversarial Network (GAN) to narrow the gap between the two modalities. To exploit the advantages introduced by the auxiliary set, we propose a curriculum learning based strategy to jointly learn from both primary and auxiliary sets. Moreover, we design a novel temporal k-reciprocal re-ranking method to refine the ranking list with fine-grained temporal correlation cues. Experimental results demonstrate the effectiveness of the proposed methods. We also reproduce 9 state-of-the-art image-based and video-based VI-ReID methods on BUPTCampus and our methods show substantial superiority to them. The codes and dataset are available at: https://github.com/dyhBUPT/BUPTCampus.
    摘要 visible-infrared人重识别(VI-ReID)目的是匹配在可见和红外摄像头中捕捉到的人,以实现24小时监控系统中的人检索和跟踪。过去的方法强调从不同摄像头中的交叉模式人像学习。然而,时间信息和单个摄像头示例通常被忽略。为了解决这个问题,在这篇论文中,我们首先提供了一个大规模的VI-ReID数据集 named BUPTCampus。与大多数现有的VI-ReID数据集不同,它:1. 收集了tracklets而不是图像,以增加时间信息的质量。2. 包含了某些摄像头中的像素对应的交叉模式样本对,以更好地学习模式不变。3. 提供了一个辅助集,以帮助提高优化。每个人只出现在一个摄像头中。基于我们制作的数据集,我们提出了一个两渠道框架作为基线,并通过生成对抗网络(GAN)来缓解两种模式之间的差距。此外,我们还提出了一种curriculum学习策略,以同时学习主要和辅助集。此外,我们还设计了一种新的时间kreciprocal重新排序方法,以使用细化的时间相关关系来精细化排序列表。实验结果表明我们的方法的有效性。我们还在BUPTCampus上重现了9种现状顶尖的图像基于和视频基于VI-ReID方法,并显示了我们的方法在他们之上显著的优势。代码和数据集可以在https://github.com/dyhBUPT/BUPTCampus上获取。

UFDA: Universal Federated Domain Adaptation with Practical Assumptions

  • paper_url: http://arxiv.org/abs/2311.15570
  • repo_url: None
  • paper_authors: Xinhui Liu, Zhenghao Chen, Luping Zhou, Dong Xu, Wei Xi, Gairui Bai, Yihan Zhao, Jizhong Zhao
  • for: 这个研究是为了解决现实世界中联邦领域对应(Federated Domain Adaptation,FDA)的实际化问题。
  • methods: 本研究提出了一个更实际的方法,名为全面联邦领域对应(Universal Federated Domain Adaptation,UFDA),它不需要 Label set consistency 和目标领域的标签集信息,并且可以处理不同来源领域的标签集不一致和目标领域的标签集完全无知。
  • results: experiments 表明,我们的方法可以在三个 benchmark 上实现比较好的性能,并且比之前的方法更少假设,即使在现实世界中的实际应用中。
    Abstract Conventional Federated Domain Adaptation (FDA) approaches usually demand an abundance of assumptions, such as label set consistency, which makes them significantly less feasible for real-world situations and introduces security hazards. In this work, we propose a more practical scenario named Universal Federated Domain Adaptation (UFDA). It only requires the black-box model and the label set information of each source domain, while the label sets of different source domains could be inconsistent and the target-domain label set is totally blind. This relaxes the assumptions made by FDA, which are often challenging to meet in real-world cases and diminish model security. To address the UFDA scenario, we propose a corresponding framework called Hot-Learning with Contrastive Label Disambiguation (HCLD), which tackles UFDA's domain shifts and category gaps problem by using one-hot outputs from the black-box models of various source domains. Moreover, to better distinguish the shared and unknown classes, we further present a cluster-level strategy named Mutual-Voting Decision (MVD) to extract robust consensus knowledge across peer classes from both source and target domains. The extensive experiments on three benchmarks demonstrate that our HCLD achieves comparable performance for our UFDA scenario with much fewer assumptions, compared to the previous methodologies with many additional assumptions.
    摘要 To address the UFDA scenario, we propose a corresponding framework called Hot-Learning with Contrastive Label Disambiguation (HCLD). This framework tackles UFDA's domain shifts and category gaps problem by using one-hot outputs from the black-box models of various source domains. Additionally, to better distinguish shared and unknown classes, we propose a cluster-level strategy called Mutual-Voting Decision (MVD) to extract robust consensus knowledge across peer classes from both source and target domains.Our extensive experiments on three benchmarks demonstrate that our HCLD achieves comparable performance for the UFDA scenario with fewer assumptions, compared to previous methodologies with many additional assumptions.

Fully Authentic Visual Question Answering Dataset from Online Communities

  • paper_url: http://arxiv.org/abs/2311.15562
  • repo_url: None
  • paper_authors: Chongyan Chen, Mengchen Liu, Noel Codella, Yunsheng Li, Lu Yuan, Danna Gurari
  • for: 这篇论文是关于图像问答(VQA)的,它的目的是回答基于图像的问题。
  • methods: 这篇论文使用了来自网络问答社区论坛的真实用 caso,并将其称为VQAonline。
  • results: 研究人员发现VQAonline中的答案具有较长的均值(例如173个单词),与标准的VQA评估指标不兼容,因此分析了六种流行的长文评估指标中的哪些最好align with human judgments。
    Abstract Visual Question Answering (VQA) entails answering questions about images. We introduce the first VQA dataset in which all contents originate from an authentic use case. Sourced from online question answering community forums, we call it VQAonline. We then characterize our dataset and how it relates to eight other VQA datasets. Observing that answers in our dataset tend to be much longer (e.g., with a mean of 173 words) and thus incompatible with standard VQA evaluation metrics, we next analyze which of the six popular metrics for longer text evaluation align best with human judgments. We then use the best-suited metrics to evaluate six state-of-the-art vision and language foundation models on VQAonline and reveal where they struggle most. We will release the dataset soon to facilitate future extensions.
    摘要 Visual Question Answering (VQA) 涉及对图像的问题回答。我们介绍了第一个来自真实使用场景的 VQA 数据集,称为 VQAonline。我们 THEN 描述了我们的数据集和与其他 eight 个 VQA 数据集的关系。我们发现我们数据集中的答案具有较长的均值(例如 173 个单词),因此与标准 VQA 评估指标不兼容。我们然后分析了 six 种流行的长文本评估指标,以确定哪些最好适应人类判断。我们 THEN 使用最适合的指标来评估 six 种现代视觉和语言基础模型在 VQAonline 上的表现,并揭示它们在哪些方面遇到最大的困难。我们即将发布数据集,以便未来扩展。

ET3D: Efficient Text-to-3D Generation via Multi-View Distillation

  • paper_url: http://arxiv.org/abs/2311.15561
  • repo_url: None
  • paper_authors: Yiming Chen, Zhiqi Li, Peidong Liu
  • for: 本研究旨在提出一种高效的文本到3D图像生成方法,可以在consumer graphics card上生成3D图像,只需要约8毫秒。
  • methods: 我们利用一个大型预训练的文本到图像扩散模型生成的图像,来监督训练一个文本受限的3D生成随机抽象网络。一旦网络训练完毕,我们可以通过单一的前进 pass来高效地生成3D图像。
  • results: 我们的方法可以减少计算负担,提高生成速度,从而提供一种高效的文本到3D图像生成方法。
    Abstract Recent breakthroughs in text-to-image generation has shown encouraging results via large generative models. Due to the scarcity of 3D assets, it is hardly to transfer the success of text-to-image generation to that of text-to-3D generation. Existing text-to-3D generation methods usually adopt the paradigm of DreamFusion, which conducts per-asset optimization by distilling a pretrained text-to-image diffusion model. The generation speed usually ranges from several minutes to tens of minutes per 3D asset, which degrades the user experience and also imposes a burden to the service providers due to the high computational budget. In this work, we present an efficient text-to-3D generation method, which requires only around 8 $ms$ to generate a 3D asset given the text prompt on a consumer graphic card. The main insight is that we exploit the images generated by a large pre-trained text-to-image diffusion model, to supervise the training of a text conditioned 3D generative adversarial network. Once the network is trained, we are able to efficiently generate a 3D asset via a single forward pass. Our method requires no 3D training data and provides an alternative approach for efficient text-to-3D generation by distilling pre-trained image diffusion models.
    摘要 最近的文本到图像生成突破口已经显示出了有希望的结果,通过大型生成模型。由于3D资产的缺乏,很难将文本到图像生成的成功传递到文本到3D生成。现有的文本到3D生成方法通常采用DreamFusion的思想,通过预训练的文本到图像扩散模型进行每个资产优化。生成速度通常在几分钟到几十分钟之间,这会影响用户体验,同时也对服务提供者造成高计算预算的压力。在这种工作中,我们提出了一种高效的文本到3D生成方法,需要只有约8毫秒来生成基于文本提示的3D资产。我们利用一个大型预训练的文本到图像扩散模型生成的图像来监督训练一个文本决定的3D生成随机抽取网络。一旦网络训练完成,我们可以通过单个前向传播来高效地生成3D资产。我们的方法不需要3D训练数据,并提供了一种alternative的高效文本到3D生成方法,不需要扩散模型的预训练。

PKU-I2IQA: An Image-to-Image Quality Assessment Database for AI Generated Images

  • paper_url: http://arxiv.org/abs/2311.15556
  • repo_url: https://github.com/jiquan123/i2iqa
  • paper_authors: Jiquan Yuan, Xinyan Cao, Changjin Li, Fanyi Yang, Jinlong Lin, Xixin Cao
  • for: 本研究旨在评估人工智能生成的图像质量,提供更全面的评估方法。
  • methods: 本研究使用了人类观察者进行主观测试,收集了高质量的图像质量标签。并提出了两种参考模型:NR-AIGCIQA 和 FR-AIGCIQA。
  • results: 研究发现,NR-AIGCIQA 模型在不同的图像生成场景下表现出色,FR-AIGCIQA 模型则在具有高纬度匹配的场景下表现更好。
    Abstract As image generation technology advances, AI-based image generation has been applied in various fields and Artificial Intelligence Generated Content (AIGC) has garnered widespread attention. However, the development of AI-based image generative models also brings new problems and challenges. A significant challenge is that AI-generated images (AIGI) may exhibit unique distortions compared to natural images, and not all generated images meet the requirements of the real world. Therefore, it is of great significance to evaluate AIGIs more comprehensively. Although previous work has established several human perception-based AIGC image quality assessment (AIGCIQA) databases for text-generated images, the AI image generation technology includes scenarios like text-to-image and image-to-image, and assessing only the images generated by text-to-image models is insufficient. To address this issue, we establish a human perception-based image-to-image AIGCIQA database, named PKU-I2IQA. We conduct a well-organized subjective experiment to collect quality labels for AIGIs and then conduct a comprehensive analysis of the PKU-I2IQA database. Furthermore, we have proposed two benchmark models: NR-AIGCIQA based on the no-reference image quality assessment method and FR-AIGCIQA based on the full-reference image quality assessment method. Finally, leveraging this database, we conduct benchmark experiments and compare the performance of the proposed benchmark models. The PKU-I2IQA database and benchmarks will be released to facilitate future research on \url{https://github.com/jiquan123/I2IQA}.
    摘要 为了评估人工智能生成的图像质量,需要更好地评估人工智能生成的图像(AIG)。然而,人工智能图像生成技术的发展也带来了新的问题和挑战。一个重要的问题是AIG可能会出现与自然图像不同的扭曲,并且不 всех生成的图像符合实际世界的要求。因此,评估AIG的方法是非常重要的。在过去的工作中,已经建立了基于人类视觉的图像生成内容评价(AIGCIQA)数据库,但这些数据库只评估了基于文本到图像的图像生成方式。然而,人工智能图像生成技术还包括文本到图像和图像到图像的场景,只评估文本到图像的图像生成方式是不够的。为解决这个问题,我们建立了一个基于人类视觉的图像到图像AIGCIQA数据库,名为PKU-I2IQA。我们进行了一项有序的主观实验,收集了AIG的质量标签,然后对PKU-I2IQA数据库进行了全面的分析。此外,我们还提出了两种标准模型:NR-AIGCIQA基于无参考图像质量评估方法和FR-AIGCIQA基于全参考图像质量评估方法。最后,我们利用这个数据库,对两种标准模型进行了比较性能测试。PKU-I2IQA数据库和标准模型将在\url{https://github.com/jiquan123/I2IQA}上发布,以便未来的研究。

Dataset Distillation in Latent Space

  • paper_url: http://arxiv.org/abs/2311.15547
  • repo_url: None
  • paper_authors: Yuxuan Duan, Jianfu Zhang, Liqing Zhang
  • for: 降低训练模型的计算负担,提高模型在下游任务中的性能。
  • methods: 将DD过程从像素空间转移到含义空间,使用预训练的通用自编码器将原始图像编码成压缩后的latent codes。
  • results: 在压缩后的latent codes上进行DD算法,可以大幅降低时间和空间消耗,同时保持与原始数据的相似性,并可以 targets at greater data ratio和高分辨率dataset。
    Abstract Dataset distillation (DD) is a newly emerging research area aiming at alleviating the heavy computational load in training models on large datasets. It tries to distill a large dataset into a small and condensed one so that models trained on the distilled dataset can perform comparably with those trained on the full dataset when performing downstream tasks. Among the previous works in this area, there are three key problems that hinder the performance and availability of the existing DD methods: high time complexity, high space complexity, and low info-compactness. In this work, we simultaneously attempt to settle these three problems by moving the DD processes from conventionally used pixel space to latent space. Encoded by a pretrained generic autoencoder, latent codes in the latent space are naturally info-compact representations of the original images in much smaller sizes. After transferring three mainstream DD algorithms to latent space, we significantly reduce time and space consumption while achieving similar performance, allowing us to distill high-resolution datasets or target at greater data ratio that previous methods have failed. Besides, within the same storage budget, we can also quantitatively deliver more latent codes than pixel-level images, which further boosts the performance of our methods.
    摘要

Beyond Pixels: Exploring Human-Readable SVG Generation for Simple Images with Vision Language Models

  • paper_url: http://arxiv.org/abs/2311.15543
  • repo_url: None
  • paper_authors: Tong Zhang, Haoyang Liu, Peiyan Zhang, Yuxuan Cheng, Haohan Wang
  • for: 本研究旨在提出一种可以快速生成简洁可读的Scalable Vector Graphics(SVG)图像,以满足计算机图形领域中的需求。
  • methods: 我们提出了一种名为Simple-SVG-Generation(S\textsuperscript{2}VG\textsuperscript{2)的方法,该方法可以生成简洁可读的SVG图像,同时保持原图像的关系性和上下文。
  • results: 我们通过对简单图像进行理解任务和高级语言模型的评估,发现我们的方法可以明显超越先前的SVG生成方法。此外,我们还进行了人工评估,结果也表明我们的方法可以提供更加简洁可读的SVG图像。
    Abstract In the field of computer graphics, the use of vector graphics, particularly Scalable Vector Graphics (SVG), represents a notable development from traditional pixel-based imagery. SVGs, with their XML-based format, are distinct in their ability to directly and explicitly represent visual elements such as shape, color, and path. This direct representation facilitates a more accurate and logical depiction of graphical elements, enhancing reasoning and interpretability. Recognizing the potential of SVGs, the machine learning community has introduced multiple methods for image vectorization. However, transforming images into SVG format while retaining the relational properties and context of the original scene remains a key challenge. Most vectorization methods often yield SVGs that are overly complex and not easily interpretable. In response to this challenge, we introduce our method, Simple-SVG-Generation (S\textsuperscript{2}VG\textsuperscript{2}). Our method focuses on producing SVGs that are both accurate and simple, aligning with human readability and understanding. With simple images, we evaluate our method with reasoning tasks together with advanced language models, the results show a clear improvement over previous SVG generation methods. We also conducted surveys for human evaluation on the readability of our generated SVGs, the results also favor our methods.
    摘要 在计算机图形领域,使用 вектор图形,特别是可扩展 вектор图形(SVG),表示了传统 пиксель基本图形的一项重要发展。 SVGs 的 XML 格式使其能直接和明确表示视觉元素,如形状、颜色和路径,这种直接表示方式使得图形元素的理解和逻辑推理得到了提高。认可 SVGs 的潜在优势,机器学习社区已经提出了多种图像vectorization方法。然而,将图像转换为 SVG 格式而保持原Scene中的关系性和上下文仍然是一个关键挑战。大多数vectorization方法通常会生成不太可读的 SVGs。在回应这个挑战的情况下,我们介绍了我们的方法,Simple-SVG-Generation(S\textsuperscript{2}VG\textsuperscript{2)。我们的方法关注生成简洁而准确的 SVGs,与人类可读性和理解有关。使用简单的图像,我们与高级语言模型进行了合理性任务的评估,结果显示了我们的方法与过去的 SVG 生成方法之间的明显改善。我们还进行了人类评估我们生成的 SVGs 的可读性,结果也对我们的方法产生了正面的影响。

EAFP-Med: An Efficient Adaptive Feature Processing Module Based on Prompts for Medical Image Detection

  • paper_url: http://arxiv.org/abs/2311.15540
  • repo_url: None
  • paper_authors: Xiang Li, Long Lan, Husam Lahza, Shaowu Yang, Shuihua Wang, Wenjing Yang, Hengzhu Liu, Yudong Zhang
  • for: 这个论文旨在解决医学成像技术之间的特征表现差异问题,提高医学成像检测的效率和准确性。
  • methods: 本论文提出了一个基于大语言模型的医学成像检测方法,即EAFP-Med,可以快速和高效地提取不同尺度的疾病特征,并且可以与不同的成像技术进行整合。
  • results: 本论文的实验结果显示,EAFP-Med ST 模型在三个数据集(胸部X射线像、头部磁共振成像像和皮肤像)上的测试结果均达到了最佳水平,并且比前九种方法更高效。
    Abstract In the face of rapid advances in medical imaging, cross-domain adaptive medical image detection is challenging due to the differences in lesion representations across various medical imaging technologies. To address this issue, we draw inspiration from large language models to propose EAFP-Med, an efficient adaptive feature processing module based on prompts for medical image detection. EAFP-Med can efficiently extract lesion features of different scales from a diverse range of medical images based on prompts while being flexible and not limited by specific imaging techniques. Furthermore, it serves as a feature preprocessing module that can be connected to any model front-end to enhance the lesion features in input images. Moreover, we propose a novel adaptive disease detection model named EAFP-Med ST, which utilizes the Swin Transformer V2 - Tiny (SwinV2-T) as its backbone and connects it to EAFP-Med. We have compared our method to nine state-of-the-art methods. Experimental results demonstrate that EAFP-Med ST achieves the best performance on all three datasets (chest X-ray images, cranial magnetic resonance imaging images, and skin images). EAFP-Med can efficiently extract lesion features from various medical images based on prompts, enhancing the model's performance. This holds significant potential for improving medical image analysis and diagnosis.
    摘要 随着医学成像技术的快速发展,跨领域adaptive医学成像检测具有挑战性,主要是因为不同的医学成像技术中lesion的表现不同。为解决这个问题,我们从大语言模型中灵感来提出EAFP-Med,一种高效的adaptive功能处理模块基于提示。EAFP-Med可以高效地从不同类型的医学成像中提取lesion特征,而且可以根据提示而不受特定成像技术的限制。此外,它可以作为输入图像的特征预处理模块,与任何模型前端连接以提高输入图像中lesion特征。此外,我们还提出了一种基于Swin Transformer V2 - Tiny(SwinV2-T)的novel adaptive疾病检测模型,名为EAFP-Med ST。我们与9种现有方法进行比较。实验结果表明,EAFP-Med ST在三个数据集(肺X射影图像、脑磁共振成像图像和皮肤图像)上的表现最佳。EAFP-Med可以高效地基于提示提取不同类型医学成像中lesion特征,提高模型的性能。这对医学成像分析和诊断具有重要意义。

SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2311.15537
  • repo_url: https://github.com/xb534/sed
  • paper_authors: Bin Xie, Jiale Cao, Jin Xie, Fahad Shahbaz Khan, Yanwei Pang
  • for: 这篇论文目的是提出一种简单的编码器-解码器模型,用于开放词汇semantic segmentation问题。
  • methods: 该模型使用层次编码器来生成图像级image-text成本地图,并使用分层结构的解码器将成本地图与不同层次的背景图进行混合。
  • results: 在多个开放词汇semantic segmentation数据集上进行实验,该方法达到了31.6%的mIoU分数,在82ms/图像上实现了单个A6000上的最高性能。
    Abstract Open-vocabulary semantic segmentation strives to distinguish pixels into different semantic groups from an open set of categories. Most existing methods explore utilizing pre-trained vision-language models, in which the key is to adopt the image-level model for pixel-level segmentation task. In this paper, we propose a simple encoder-decoder, named SED, for open-vocabulary semantic segmentation, which comprises a hierarchical encoder-based cost map generation and a gradual fusion decoder with category early rejection. The hierarchical encoder-based cost map generation employs hierarchical backbone, instead of plain transformer, to predict pixel-level image-text cost map. Compared to plain transformer, hierarchical backbone better captures local spatial information and has linear computational complexity with respect to input size. Our gradual fusion decoder employs a top-down structure to combine cost map and the feature maps of different backbone levels for segmentation. To accelerate inference speed, we introduce a category early rejection scheme in the decoder that rejects many no-existing categories at the early layer of decoder, resulting in at most 4.7 times acceleration without accuracy degradation. Experiments are performed on multiple open-vocabulary semantic segmentation datasets, which demonstrates the efficacy of our SED method. When using ConvNeXt-B, our SED method achieves mIoU score of 31.6\% on ADE20K with 150 categories at 82 millisecond ($ms$) per image on a single A6000. We will release it at \url{https://github.com/xb534/SED.git}.
    摘要 Open-vocabulary semantic segmentation寻求通过分类不同类别的像素。现有的方法大多数都是利用预训练的视觉语言模型,其中的关键是使用图像级别的模型进行像素级别的分类任务。在这篇论文中,我们提出了一种简单的编码器-解码器,称之为SED,用于开放词汇semantic segmentation,它包括层次编码器基于的成本图生成和渐进融合解码器。层次编码器基于的成本图生成使用层次背景,而不是平面变换器,以预测像素级别的图像文本成本图。与平面变换器相比,层次背景更好地捕捉本地空间信息,并且与输入大小相关的计算复杂度是线性的。我们的渐进融合解码器使用顶部结构将成本图和不同层次背景的特征图进行融合,以实现分类。为了加速推理速度,我们引入了类别早期抛弃的方案,从推理的早期阶段抛弃许多不存在的类别,从而最多地提高了推理速度,而无需减少准确性。我们在多个开放词汇semantic segmentation数据集上进行了实验,并证明了我们的SED方法的有效性。当使用ConvNeXt-B时,我们的SED方法在ADE20K上的mIoU分数为31.6%,在82毫秒($ms)/张的单个A6000上完成每帧推理。我们将在\url{https://github.com/xb534/SED.git}上发布它。

SVRDA: A Web-based Dataset Annotation Tool for Slice-to-Volume Registration

  • paper_url: http://arxiv.org/abs/2311.15536
  • repo_url: https://github.com/roldbach/svrda
  • paper_authors: Weixun Luo, Alexandre Triay Bagur, Paul Aljabar, George Ralli, Sir Michael Brady
    for: SVRDA is designed to facilitate the annotation of benchmark datasets for slice-to-volume registration.methods: SVRDA is a web-based application that supports platform-agnostic collaboration and efficient transformation manipulation via keyboard shortcuts. It also features automatic saving, configuration-based data loading, and separation of concerns for flexibility and extensibility.results: The effectiveness of SVRDA was validated through indirect evaluation of post-registration segmentation quality on UK Biobank data, which showed a significant improvement in Dice Similarity Coefficient and 95th percentile Hausdorff distance. Additionally, SVRDA was successfully integrated into test-retest T1 quantification on in-house magnetic resonance images, leading to more consistent results after registration.
    Abstract Background and Objective: The lack of benchmark datasets has impeded the development of slice-to-volume registration algorithms. Such datasets are difficult to annotate, primarily due to the dimensional difference within data and the dearth of task-specific software. We aim to develop a user-friendly tool to streamline dataset annotation for slice-to-volume registration. Methods: The proposed tool, named SVRDA, is an installation-free web application for platform-agnostic collaborative dataset annotation. It enables efficient transformation manipulation via keyboard shortcuts and smooth case transitions with auto-saving. SVRDA supports configuration-based data loading and adheres to the separation of concerns, offering great flexibility and extensibility for future research. Various supplementary features have been implemented to facilitate slice-to-volume registration. Results: We validated the effectiveness of SVRDA by indirectly evaluating the post-registration segmentation quality on UK Biobank data, observing a dramatic overall improvement (24.02% in the Dice Similarity Coefficient and 48.93% in the 95th percentile Hausdorff distance, respectively) supported by highly statistically significant evidence ($p<0.001$).We further showcased the clinical usage of SVRDA by integrating it into test-retest T1 quantification on in-house magnetic resonance images, leading to more consistent results after registration. Conclusions: SVRDA can facilitate collaborative annotation of benchmark datasets while being potentially applicable to other pipelines incorporating slice-to-volume registration. Full source code and documentation are available at https://github.com/Roldbach/SVRDA
    摘要 背景和目标:缺乏标准数据集的缺乏对 slice-to-volume регистрация算法的发展带来了阻碍。这些数据集 annotate 是因为数据维度的差异以及缺乏任务特定的软件而困难。我们的目标是开发一个用户友好的工具,以便在 slice-to-volume REGISTRAION 的数据集上进行 annotate。方法:我们提出的工具是一个可以在线安装的 web 应用程序,可以在平台不同的情况下进行协作性数据集注释。该工具具有键盘快捷键和自动保存功能,可以快速和灵活地进行转换操作。SVRDA 支持配置数据加载和遵循分离的 Concerns 的方式,这使得它具有很好的扩展性和灵活性,以便未来的研究。此外,我们还实现了一些辅助功能,以便进一步优化 slice-to-volume REGISTRAION。结果:我们证明了 SVRDA 的有效性,通过对 UK Biobank 数据进行 indirect 评估, Observation 表明在 Dice 相似度系数和 95% 百分比 Hausdorff 距离上出现了很大的全体改善(24.02% 和 48.93% 分别),这种改善得到了高度统计学上的证明 ($p<0.001$).此外,我们还展示了 SVRDA 在医学应用中的优势,通过将其集成到了室内磁共振成像中的测试-重复 T1 量化中,以获得更一致的结果。结论:SVRDA 可以促进标准数据集的协作注释,同时可以应用于其他 incorporating slice-to-volume REGISTRAION 的管道。完整的源代码和文档可以在 上获取。

Efficient Dataset Distillation via Minimax Diffusion

  • paper_url: http://arxiv.org/abs/2311.15529
  • repo_url: https://github.com/vimar-gu/minimaxdiffusion
  • paper_authors: Jianyang Gu, Saeed Vahidian, Vyacheslav Kungurtsev, Haonan Wang, Wei Jiang, Yang You, Yiran Chen
  • for: 降低训练 neural network 的存储和计算资源,通过生成一个小型的代理数据集来捕捉原始大规模数据集中的 ric information。
  • methods: incorporating 生成扩散技术 для计算代理数据集,并通过设计降低扩散过程中的质量损失来提高生成的图像的多样性和表现力。
  • results: 在 ImageWoof 上 Achieves state-of-the-art 验证性能,并且需要 much less 的计算资源,相比之下 previous 方法。
    Abstract Dataset distillation reduces the storage and computational consumption of training a network by generating a small surrogate dataset that encapsulates rich information of the original large-scale one. However, previous distillation methods heavily rely on the sample-wise iterative optimization scheme. As the images-per-class (IPC) setting or image resolution grows larger, the necessary computation will demand overwhelming time and resources. In this work, we intend to incorporate generative diffusion techniques for computing the surrogate dataset. Observing that key factors for constructing an effective surrogate dataset are representativeness and diversity, we design additional minimax criteria in the generative training to enhance these facets for the generated images of diffusion models. We present a theoretical model of the process as hierarchical diffusion control demonstrating the flexibility of the diffusion process to target these criteria without jeopardizing the faithfulness of the sample to the desired distribution. The proposed method achieves state-of-the-art validation performance while demanding much less computational resources. Under the 100-IPC setting on ImageWoof, our method requires less than one-twentieth the distillation time of previous methods, yet yields even better performance. Source code available in https://github.com/vimar-gu/MinimaxDiffusion.
    摘要 (Simplified Chinese translation) dataset 简化通过生成一个小型的附加数据集来减少训练网络的存储和计算资源。然而,过去的简化方法都是通过样本WISE的迭代优化算法来实现。随着图像每个类(IPC)设置或图像分辨率的增加,需要的计算将占用过量的时间和资源。在这种情况下,我们想要利用生成扩散技术来计算附加数据集。我们发现构建有效的附加数据集的关键因素是代表性和多样性,因此我们在生成图像的扩散模型中添加了附加的最大化最小化 criterion来提高这些方面。我们提出了一个层次扩散控制的理论模型,证明了扩散过程可以无须牺牲样本的准确性来实现这些标准。我们的方法实现了当前验证性能的最佳状态,而且需要的计算资源减少了90%以上。在ImageWoof上的100-IPC设置下,我们的方法需要前一个方法的20倍的时间,却能够获得更好的性能。源代码可以在https://github.com/vimar-gu/MinimaxDiffusion上找到。

Fine-grained Appearance Transfer with Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.16513
  • repo_url: https://github.com/babahui/fine-grained-appearance-transfer
  • paper_authors: Yuteng Ye, Guanwen Li, Hang Zhou, Cai Jiale, Junqing Yu, Yawei Luo, Zikai Song, Qilong Xing, Youjia Zhang, Wei Yang
  • for: 这篇论文是关于图像到图像翻译(I2I)和图像外观传输的研究,旨在通过维护图像的结构准确性来改变图像的视觉外观。
  • methods: 这篇论文提出了一种新的框架,它将semantic matching、图像外观传输和latent deviation综合应用于图像翻译。具有预测$x_0$空间的 diffusion models 在 latent space 中的使用被视为一个关键因素,可以帮助保留细节的 струкural element。
  • results: 该方法在各种类别和领域中进行了广泛的实验,并表明其能够有效地处理细节的图像翻译。 code 可以在 https://github.com/babahui/Fine-grained-Appearance-Transfer 中找到。
    Abstract Image-to-image translation (I2I), and particularly its subfield of appearance transfer, which seeks to alter the visual appearance between images while maintaining structural coherence, presents formidable challenges. Despite significant advancements brought by diffusion models, achieving fine-grained transfer remains complex, particularly in terms of retaining detailed structural elements and ensuring information fidelity. This paper proposes an innovative framework designed to surmount these challenges by integrating various aspects of semantic matching, appearance transfer, and latent deviation. A pivotal aspect of our approach is the strategic use of the predicted $x_0$ space by diffusion models within the latent space of diffusion processes. This is identified as a crucial element for the precise and natural transfer of fine-grained details. Our framework exploits this space to accomplish semantic alignment between source and target images, facilitating mask-wise appearance transfer for improved feature acquisition. A significant advancement of our method is the seamless integration of these features into the latent space, enabling more nuanced latent deviations without necessitating extensive model retraining or fine-tuning. The effectiveness of our approach is demonstrated through extensive experiments, which showcase its ability to adeptly handle fine-grained appearance transfers across a wide range of categories and domains. We provide our code at https://github.com/babahui/Fine-grained-Appearance-Transfer
    摘要 图像到图像翻译(I2I)和其子领域的外观传递问题具有挑战性。尤其是在细节级别上进行传递时,保持结构一致性和信息准确性是复杂的。这篇论文提出了一种创新的框架,用于缓解这些挑战。我们的方法包括各种语义匹配、外观传递和潜在偏差的综合使用。我们的核心思想在于通过Diffusion模型预测的$x_0$空间进行精细控制。这是实现细节级别的自然传递的关键。我们的框架利用这个空间实现源和目标图像的semantic alignment,以便更好地进行掩码基于的外观传递,以提高特征获取。我们的方法可以充分利用这些特征,无需进行广泛的模型重新训练或细化。我们的实验结果表明,我们的方法可以在多种类别和领域中灵活地处理细节级别的外观传递。我们的代码可以在https://github.com/babahui/Fine-grained-Appearance-Transfer上获取。

Sparse Pedestrian Character Learning for Trajectory Prediction

  • paper_url: http://arxiv.org/abs/2311.15512
  • repo_url: None
  • paper_authors: Yonghao Dong, Le Wang, Sanpin Zhou, Gang Hua, Changyin Sun
  • for: 预测人行道径,即自动驾驶中的一个重要任务。
  • methods: 使用行人特征信息,包括行人动作和外观,以提高学习的路径嵌入,并实现状态前所未达成的性能。
  • results: 我们提出了一种两栅稀有基于网络~(TSNet),该网络可以从稀有的人物特征中减去负面特征,以提高路径嵌入。广泛的实验表明,我们的方法在两个首人视角数据集上表现出色,超过了现有的状态前所未达成的方法。
    Abstract Pedestrian trajectory prediction in a first-person view has recently attracted much attention due to its importance in autonomous driving. Recent work utilizes pedestrian character information, \textit{i.e.}, action and appearance, to improve the learned trajectory embedding and achieves state-of-the-art performance. However, it neglects the invalid and negative pedestrian character information, which is harmful to trajectory representation and thus leads to performance degradation. To address this issue, we present a two-stream sparse-character-based network~(TSNet) for pedestrian trajectory prediction. Specifically, TSNet learns the negative-removed characters in the sparse character representation stream to improve the trajectory embedding obtained in the trajectory representation stream. Moreover, to model the negative-removed characters, we propose a novel sparse character graph, including the sparse category and sparse temporal character graphs, to learn the different effects of various characters in category and temporal dimensions, respectively. Extensive experiments on two first-person view datasets, PIE and JAAD, show that our method outperforms existing state-of-the-art methods. In addition, ablation studies demonstrate different effects of various characters and prove that TSNet outperforms approaches without eliminating negative characters.
    摘要 pedestrian trajectory prediction in a first-person view latest 收到了很多关注,因为它对自动驾驶非常重要。 recent work 使用 pedestrian 的 character information,即行为和外表,以提高学习的轨迹嵌入,并实现了状态最佳性。 然而,这些方法忽略了无效和负面 pedestrian 的 character information,这会对轨迹表示有害,从而导致性能下降。 To address this issue, we present a two-stream sparse-character-based network~(TSNet) for pedestrian trajectory prediction. Specifically, TSNet learns the negative-removed characters in the sparse character representation stream to improve the trajectory embedding obtained in the trajectory representation stream. Moreover, to model the negative-removed characters, we propose a novel sparse character graph, including the sparse category and sparse temporal character graphs, to learn the different effects of various characters in category and temporal dimensions, respectively. 广泛的实验在两个 first-person view 数据集上,PIE 和 JAAD,表明我们的方法超过了现有的最佳方法。 In addition, ablation studies demonstrate different effects of various characters and prove that TSNet outperforms approaches without eliminating negative characters.

CaesarNeRF: Calibrated Semantic Representation for Few-shot Generalizable Neural Rendering

  • paper_url: http://arxiv.org/abs/2311.15510
  • repo_url: https://github.com/haidongz-usc/CaesarNeRF
  • paper_authors: Haidong Zhu, Tianyu Ding, Tianyi Chen, Ilya Zharkov, Ram Nevatia, Luming Liang
  • for: 提高 NeRF 模型的普适性和少量学习能力,帮助实现高质量细节的渲染。
  • methods: 引入 CAlibratEd SemAntic Representation,同时使用像素级表示来提高 NeRF 模型的普适性和少量学习能力。
  • results: 对公共数据集进行了广泛的实验,并证明 CaesarNeRF 可以在不同的参考视图数量下达到状态 искусственный智能水平,并且能够capture varying details。
    Abstract Generalizability and few-shot learning are key challenges in Neural Radiance Fields (NeRF), often due to the lack of a holistic understanding in pixel-level rendering. We introduce CaesarNeRF, an end-to-end approach that leverages scene-level CAlibratEd SemAntic Representation along with pixel-level representations to advance few-shot, generalizable neural rendering, facilitating a holistic understanding without compromising high-quality details. CaesarNeRF explicitly models pose differences of reference views to combine scene-level semantic representations, providing a calibrated holistic understanding. This calibration process aligns various viewpoints with precise location and is further enhanced by sequential refinement to capture varying details. Extensive experiments on public datasets, including LLFF, Shiny, mip-NeRF 360, and MVImgNet, show that CaesarNeRF delivers state-of-the-art performance across varying numbers of reference views, proving effective even with a single reference image. The project page of this work can be found at https://haidongz-usc.github.io/project/caesarnerf.
    摘要 通用性和少量学习是Neural Radiance Fields(NeRF)中的关键挑战,经常是因为缺乏像素级渲染的整体理解。我们提出了CaesarNeRF,一种综合方法,利用场景级CAlibratEd SemAntic Representation以及像素级表示来进攻少量、通用的神经渲染,实现整体理解而无需牺牲高质量细节。CaesarNeRF显式模型参考视图差异,将场景级Semantic Representation合并,提供准确的整体理解。这个准确性进程与多个视图匹配精确位置,并通过顺序反馈进一步增强,捕捉不同细节。我们在公共数据集上进行了广泛的实验,包括LLFF、Shiny、mip-NeRF 360和MVImgNet,并证明CaesarNeRF在不同参考视图数量下具有状态 искусственный智能表现,并且效果随着参考视图数量的增加。CaesarNeRF的项目页面可以在https://haidongz-usc.github.io/project/caesarnerf找到。

Class-Adaptive Sampling Policy for Efficient Continual Learning

  • paper_url: http://arxiv.org/abs/2311.16485
  • repo_url: https://github.com/hossein-rezaei624/casp
  • paper_authors: Hossein Rezaei, Mohammad Sabokrou
  • for: 提高continuous learning(CL)的效率和可 reuse 性, solve the problem of buffer-based methods 不能够 dynamically allocate storage space.
  • methods: 提出了一种名为“Class-Adaptive Sampling Policy”(CASP)的新方法和策略,通过考虑类别贡献和难度,动态 allocate buffer 空间,以便更好地利用知识和避免忘记。
  • results: CASP 可以大幅提高 CL 的效率和可 reuse 性,适用于不同类型的学习任务和复杂的学习场景。
    Abstract Continual learning (CL) aims to acquire new knowledge while preserving information from previous experiences without forgetting. Though buffer-based methods (i.e., retaining samples from previous tasks) have achieved acceptable performance, determining how to allocate the buffer remains a critical challenge. Most recent research focuses on refining these methods but often fails to sufficiently consider the varying influence of samples on the learning process, and frequently overlooks the complexity of the classes/concepts being learned. Generally, these methods do not directly take into account the contribution of individual classes. However, our investigation indicates that more challenging classes necessitate preserving a larger number of samples compared to less challenging ones. To address this issue, we propose a novel method and policy named 'Class-Adaptive Sampling Policy' (CASP), which dynamically allocates storage space within the buffer. By utilizing concepts of class contribution and difficulty, CASP adaptively manages buffer space, allowing certain classes to occupy a larger portion of the buffer while reducing storage for others. This approach significantly improves the efficiency of knowledge retention and utilization. CASP provides a versatile solution to boost the performance and efficiency of CL. It meets the demand for dynamic buffer allocation, accommodating the varying contributions of different classes and their learning complexities over time.
    摘要 (简化中文) kontinual learning (CL) 的目标是在新的知识取得到而不会忘记之前的经验。虽然缓存方法(保留先前任务的样本)已经达到了可接受的性能,但是决定如何分配缓存仍然是一个关键的挑战。最近的研究主要关注这些方法的改进,但是往往不充分考虑样本在学习过程中的不同影响,并经常忽视学习的类别和概念的复杂性。通常,这些方法直接不考虑个类的贡献。然而,我们的调查表明,更复杂的类需要保留更多的样本,而 simpler 类则可以减少存储。为解决这个问题,我们提出了一种新的方法和政策,即 'Class-Adaptive Sampling Policy' (CASP),它在缓存中动态分配存储空间。通过利用类贡献和难度概念,CASP可以适应不同类的学习复杂性,让某些类占据缓存的更大比例,而另一些类则减少存储。这种方法能够有效提高知识保持和利用效率。CASP 提供了一种灵活的解决方案,可以提高 CL 的性能和效率。它适应了缓存分配的变化需求,满足不同类的学习复杂性和时间变化。

AerialBooth: Mutual Information Guidance for Text Controlled Aerial View Synthesis from a Single Image

  • paper_url: http://arxiv.org/abs/2311.15478
  • repo_url: None
  • paper_authors: Divya Kothandaraman, Tianyi Zhou, Ming Lin, Dinesh Manocha
  • for: 本文提出了一种新的方法,即 AerialBooth,用于从单个输入图像中生成空中视图,基于其文本描述。
  • methods: 本方法利用了预训练的文本-2D图像稳定扩散模型作为3D世界的先验知识,并在两步finetuning中优化文本嵌入和UNet重构输入图像的过程。
  • results: 通过对各种自然场景、室内场景、人体动作等多种数据进行广泛的实验和减少研究,我们证明了 AerialBooth 的效果和其对其他文本控制视图的普适性。同时,我们还显示了 AerialBooth 在7个评价指标上实现了最佳的视角准确性-准确性衡量。代码和数据可以在 GitHub 上找到。
    Abstract We present a novel method, AerialBooth, for synthesizing the aerial view from a single input image using its text description. We leverage the pretrained text-to-2D image stable diffusion model as prior knowledge of the 3D world. The model is finetuned in two steps to optimize for the text embedding and the UNet that reconstruct the input image and its inverse perspective mapping respectively. The inverse perspective mapping creates variance within the text-image space of the diffusion model, while providing weak guidance for aerial view synthesis. At inference, we steer the contents of the generated image towards the input image using novel mutual information guidance that maximizes the information content between the probability distributions of the two images. We evaluate our approach on a wide spectrum of real and synthetic data, including natural scenes, indoor scenes, human action, etc. Through extensive experiments and ablation studies, we demonstrate the effectiveness of AerialBooth and also its generalizability to other text-controlled views. We also show that AerialBooth achieves the best viewpoint-fidelity trade-off though quantitative evaluation on 7 metrics analyzing viewpoint and fidelity w.r.t. input image. Code and data is available at https://github.com/divyakraman/AerialBooth2023.
    摘要 我们提出了一种新方法,即 AerialBooth,用于从单个输入图像中生成飞行视图。我们利用了预训练的文本-2D图像稳定扩散模型作为3D世界的先验知识。我们在两步中进行了训练,以便优化文本嵌入和UNet,以重构输入图像和其反射映射。反射映射创造了在文本-图像空间中的变量,同时提供了软导向的飞行视图生成。在推理阶段,我们使用了一种新的共声导向,以使得生成的图像内容倾向于输入图像。我们对各种实际和 sintetic 数据进行了广泛的实验和割除研究,包括自然场景、室内场景、人体动作等。我们通过了EXTENSIVE 的实验和割除研究,证明了 AerialBooth 的有效性和其普适性。此外,我们还表明了 AerialBooth 在7个视点和准确性评价指标中的最佳视点均衡。代码和数据可以在 https://github.com/divyakraman/AerialBooth2023 上获取。

DreamCreature: Crafting Photorealistic Virtual Creatures from Imagination

  • paper_url: http://arxiv.org/abs/2311.15477
  • repo_url: https://github.com/kamwoh/dreamcreature
  • paper_authors: Kam Woh Ng, Xiatian Zhu, Yi-Zhe Song, Tao Xiang
  • for: 这个论文的目的是为了开发一种基于文本的图像生成模型,能够生成新的、细腻的生物类型(如虚拟狗或鸟类),用于数字资产创造和生物多样性分析。
  • methods: 这个论文使用了一种新的方法 called DreamCreature,它可以在无监督的情况下,从图像中提取出目标概念(如鸟类的不同部分),然后通过组合这些部分来生成新的、混合的概念。
  • results: 实验表明,DreamCreature 比之前的方法更高效地生成新的、细腻的生物类型,并且可以在多种背景和场景下具有权威的结构和逼真的外观。
    Abstract Recent text-to-image (T2I) generative models allow for high-quality synthesis following either text instructions or visual examples. Despite their capabilities, these models face limitations in creating new, detailed creatures within specific categories (e.g., virtual dog or bird species), which are valuable in digital asset creation and biodiversity analysis. To bridge this gap, we introduce a novel task, Virtual Creatures Generation: Given a set of unlabeled images of the target concepts (e.g., 200 bird species), we aim to train a T2I model capable of creating new, hybrid concepts within diverse backgrounds and contexts. We propose a new method called DreamCreature, which identifies and extracts the underlying sub-concepts (e.g., body parts of a specific species) in an unsupervised manner. The T2I thus adapts to generate novel concepts (e.g., new bird species) with faithful structures and photorealistic appearance by seamlessly and flexibly composing learned sub-concepts. To enhance sub-concept fidelity and disentanglement, we extend the textual inversion technique by incorporating an additional projector and tailored attention loss regularization. Extensive experiments on two fine-grained image benchmarks demonstrate the superiority of DreamCreature over prior methods in both qualitative and quantitative evaluation. Ultimately, the learned sub-concepts facilitate diverse creative applications, including innovative consumer product designs and nuanced property modifications.
    摘要 最近的文本到图像(T2I)生成模型可以生成高质量的图像,以文本指令或视觉示例为基础。尽管它们具有可观的能力,但它们在创造新的、详细的生物类型(例如虚拟狗或鸟种)方面存在限制。为了覆盖这一漏洞,我们介绍了一项新任务:虚拟生物生成。给定一组无标签图像目标概念(例如200种鸟类),我们希望通过训练一个T2I模型,使其能够创造新的混合概念(例如新的鸟种),并在多样化背景和Context中呈现 faithful 的结构和高质量的外观。我们提出了一种新的方法called DreamCreature,它可以自动从无标签图像中提取目标概念的下一级概念(例如鸟类体部),并在无监督的情况下进行学习。T2I因此可以适应创造新的概念,通过将学习到的下一级概念Components seamlessly和 flexibly组合而生成。为了提高下一级概念的准确性和分离度,我们将文本反转技术扩展为添加额外的投影和适应损失正则化。广泛的实验表明,DreamCreature在两个细致的图像benchmark上舒适性和量化评价中具有Superiority。最终,学习的下一级概念可以促进多样化的创意应用,包括创新的消费品设计和细化的财产修改。

MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers

  • paper_url: http://arxiv.org/abs/2311.15475
  • repo_url: https://github.com/nihalsid/mesh-gpt
  • paper_authors: Yawar Siddiqui, Antonio Alliegro, Alexey Artemov, Tatiana Tommasi, Daniele Sirigatti, Vladislav Rosov, Angela Dai, Matthias Nießner
  • for: 本研究旨在开发一种基于语言模型的 triangle mesh生成方法,以提高 mesh 的紧凑性和精炼程度。
  • methods: 该方法采用了一种序列化方法,通过 graph convolutions 学习 vocabulary 的 latent quantized embeddings,然后使用 transformer 模型预测下一个 embedding 的索引。
  • results: 该方法可以直接生成新的 triangle mesh,并达到了 state of the art 的 mesh generation 方法,具有9%的形态覆盖率和30个 FID 分数的提升。
    Abstract We introduce MeshGPT, a new approach for generating triangle meshes that reflects the compactness typical of artist-created meshes, in contrast to dense triangle meshes extracted by iso-surfacing methods from neural fields. Inspired by recent advances in powerful large language models, we adopt a sequence-based approach to autoregressively generate triangle meshes as sequences of triangles. We first learn a vocabulary of latent quantized embeddings, using graph convolutions, which inform these embeddings of the local mesh geometry and topology. These embeddings are sequenced and decoded into triangles by a decoder, ensuring that they can effectively reconstruct the mesh. A transformer is then trained on this learned vocabulary to predict the index of the next embedding given previous embeddings. Once trained, our model can be autoregressively sampled to generate new triangle meshes, directly generating compact meshes with sharp edges, more closely imitating the efficient triangulation patterns of human-crafted meshes. MeshGPT demonstrates a notable improvement over state of the art mesh generation methods, with a 9% increase in shape coverage and a 30-point enhancement in FID scores across various categories.
    摘要 我们介绍MeshGPT,一种新的方法生成三角形网格,它具有艺术家创建的紧凑性特点,与基于神经场的iso-surfacing方法提取的密集三角形网格不同。我们 Drawing inspiration from recent advances in powerful large language models, we adopt a sequence-based approach to autoregressively generate triangle meshes as sequences of triangles. First, we learn a vocabulary of latent quantized embeddings using graph convolutions, which inform these embeddings of the local mesh geometry and topology. These embeddings are then sequenced and decoded into triangles by a decoder, ensuring that they can effectively reconstruct the mesh. Finally, we train a transformer on this learned vocabulary to predict the index of the next embedding given previous embeddings. Once trained, our model can be autoregressively sampled to generate new triangle meshes, directly generating compact meshes with sharp edges, more closely imitating the efficient triangulation patterns of human-crafted meshes. MeshGPT比州际前方法有9%的形态覆盖率和30个FID分数的提高。

Where to Begin? From Random to Foundation Model Instructed Initialization in Federated Learning for Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2311.15463
  • repo_url: None
  • paper_authors: Ming Li, Guang Yang
  • for: 这篇论文的目的是探讨在医疗影像分析中使用 Federated Learning (FL) 技术,并评估基于底层模型的 initialization 的影响。
  • methods: 本文使用了 Federated Learning (FL) 模型,并探讨基于 Segment Anything Model (SAM) 的底层模型作为 initialize FL 模型的指导教师。
  • results: 本文的实验结果显示,使用基于 SAM 的底层模型作为 initialize FL 模型可以提高 FL 模型在非 Identically Independent Distributed (non-IID) 数据 scenario 中的性能,并且可以更快地收敛。
    Abstract In medical image analysis, Federated Learning (FL) stands out as a key technology that enables privacy-preserved, decentralized data processing, crucial for handling sensitive medical data. Currently, most FL models employ random initialization, which has been proven effective in various instances. However, given the unique challenges posed by non-IID (independently and identically distributed) data in FL, we propose a novel perspective: exploring the impact of using the foundation model with enormous pre-trained knowledge, such as the Segment Anything Model (SAM), as an instructive teacher for FL model initialization in medical image segmentation task. This work for the first time attempts to utilize the foundation model as an instructive teacher for initialization in FL, assessing its impact on the performance of FL models, especially in non-IID data scenarios. Our empirical evaluation on chest x-ray lung segmentation showcases that FL with foundation model instructed initialization not only achieves faster convergence but also improves performance in complex data contexts. These findings offer a new perspective for model initialization in FL.
    摘要 Here's the translation in Simplified Chinese:医疗图像分析中,联邦学习(FL)是一种关键技术,它可以保持隐私,进行分布式数据处理,这对医疗数据进行处理是非常重要。目前,大多数FL模型使用随机初始化,这已经在各种场景中证明有效。然而,由于非独立和同分布(non-IID)数据的特殊挑战,我们提出了一新的视角:探讨使用底层模型,如Segment Anything Model(SAM),作为FL模型初始化的指导教师。这是第一次尝试使用基础模型作为FL模型初始化的指导教师,评估其影响FL模型的性能,特别是在非独立数据场景下。我们对肺部X射影像分割进行了实验,显示FL与基础模型指导初始化不仅可以更快 converges,还可以在复杂的数据上下文中提高性能。这些发现提供了一新的视角 дляFL模型的初始化。