cs.CV - 2023-10-12

Investigating the Robustness and Properties of Detection Transformers (DETR) Toward Difficult Images

  • paper_url: http://arxiv.org/abs/2310.08772
  • repo_url: None
  • paper_authors: Zhao Ning Zou, Yuhang Zhang, Robert Wijaya
  • for: 本研究探讨了基于Transformer的对象检测器(DETR)如何处理不同的图像干扰因素,如 occlusion 和 adversarial 杂乱。
  • methods: 我们使用了不同的实验和测试基准来评估 DETR 的性能,并对其与基于 convolutional neural network(CNN)的检测器如 YOLO 和 Faster-RCNN 进行了比较。
  • results: 我们发现 DETR 在遇到 occlusion 图像时表现良好,但在遇到 adversarial 杂乱时表现较差,并且依赖于主要查询来做预测,导致查询的贡献不均匀。
    Abstract Transformer-based object detectors (DETR) have shown significant performance across machine vision tasks, ultimately in object detection. This detector is based on a self-attention mechanism along with the transformer encoder-decoder architecture to capture the global context in the image. The critical issue to be addressed is how this model architecture can handle different image nuisances, such as occlusion and adversarial perturbations. We studied this issue by measuring the performance of DETR with different experiments and benchmarking the network with convolutional neural network (CNN) based detectors like YOLO and Faster-RCNN. We found that DETR performs well when it comes to resistance to interference from information loss in occlusion images. Despite that, we found that the adversarial stickers put on the image require the network to produce a new unnecessary set of keys, queries, and values, which in most cases, results in a misdirection of the network. DETR also performed poorer than YOLOv5 in the image corruption benchmark. Furthermore, we found that DETR depends heavily on the main query when making a prediction, which leads to imbalanced contributions between queries since the main query receives most of the gradient flow.
    摘要 带有变换器的对象检测器(DETR)在机器视觉任务中表现出色,特别是对象检测任务。这种检测器基于自我注意机制以及变换器编码解码架构,以捕捉图像中的全局上下文。然而,需要解决的问题是如何让这种模型架构在不同的图像噪音(如 occlusion 和 adversarial 扰动)下表现良好。我们通过不同的实验和比较 DE TR 与基于 convolutional neural network(CNN)的检测器如 YOLO 和 Faster-RCNN 进行了比较。我们发现 DE TR 在干扰图像中具有较好的抗干扰性能。然而,我们发现对图像添加 adversarial 贴图会导致网络生成新的无用的键、问题和值,这通常会导致网络的误导。此外,我们发现 DE TR 在图像损害benchmark中表现较差,而且 DE TR 依赖于主要的查询来进行预测,导致查询的贡献不均。

Intelligent Scoliosis Screening and Diagnosis: A Survey

  • paper_url: http://arxiv.org/abs/2310.08756
  • repo_url: None
  • paper_authors: Zhang Zhenlin, Pu Lixin, Li Ang, Zhang Jun, Li Xianjie, Fan Jipeng
  • for: 这个论文是为了探讨计算机助手诊断和评估脊梁 curvature 的现状和发展趋势。
  • methods: 论文使用了不同算法模型来描述脊梁 curvature的计算机助手诊断和评估,并对这些模型的优缺点进行分析。
  • results: 论文分析了现有算法模型的优缺点,并预测未来发展趋势。I hope that helps!
    Abstract Scoliosis is a three-dimensional spinal deformity, which may lead to abnormal morphologies, such as thoracic deformity, and pelvic tilt. Severe patients may suffer from nerve damage and urinary abnormalities. At present, the number of scoliosis patients in primary and secondary schools has exceeded five million in China, the incidence rate is about 3% to 5% which is growing every year. The research on scoliosis, therefore, has important clinical value. This paper systematically introduces computer-assisted scoliosis screening and diagnosis as well as analyzes the advantages and limitations of different algorithm models in the current issue field. Moreover, the paper also discusses the current development bottlenecks in this field and looks forward to future development trends.
    摘要 斯科利病是三维脊梁弯曲的疾病,可能会导致异常的体征,如胸部弯曲和臀部倾斜。严重的患者可能会uffer from nerve damage和尿液异常。目前,中国primary和secondary学校的斯科利病患者人数已经超过500万,发生率大约为3%-5%,每年都在增长。因此,斯科利病的研究具有重要的临床价值。本文系统介绍了计算机助け的斯科利病检测和诊断,分析了不同算法模型在当前领域的优劣点。此外,本文还讨论了当前领域的发展瓶颈和未来发展趋势。

PU-Ray: Point Cloud Upsampling via Ray Marching on Implicit Surface

  • paper_url: http://arxiv.org/abs/2310.08755
  • repo_url: https://github.com/sum1lim/PU-Ray
  • paper_authors: Sangwon Lim, Karim El-Basyouny, Yee Hong Yang
  • For: + The paper addresses the problems of domain dependency and computational redundancy in deep-learning-based point cloud upsampling methods, and proposes a ray-based upsampling approach with an arbitrary rate for more precise and stable results.* Methods: + The method uses a ray-based approach to simulate the ray marching algorithm for implicit surface learning, and employs a rule-based mid-point query sampling method to achieve a uniform output point distribution without requiring model training.* Results: + The results demonstrate the method’s versatility across different domains and training scenarios with limited computational resources and training data, allowing the upsampling task to transition from academic research to real-world applications.Here’s the simplified Chinese text for the three key information points:* 用: + 本文解决了深度学习基于点云upsampling方法中的域dependency和计算过程的重复性问题,并提出了一种基于探针的upsampling方法,以实现更精确和稳定的结果。* 方法: + 方法使用探针基本来模拟探针迭代算法,实现了隐藏表面学习,并采用规则基于中点查询法来实现输出点云的均匀分布,不需要模型训练。* 结果: + 结果表明该方法在不同的域和训练场景下具有限制的计算资源和训练数据,可以将upsampling任务从学术研究转移到实际应用中。
    Abstract While the recent advancements in deep-learning-based point cloud upsampling methods improve the input to autonomous driving systems, they still suffer from the uncertainty of denser point generation resulting from end-to-end learning. For example, due to the vague training objectives of the models, their performance depends on the point distributions of the input and the ground truth. This causes problems of domain dependency between synthetic and real-scanned point clouds and issues with substantial model sizes and dataset requirements. Additionally, many existing methods upsample point clouds with a fixed scaling rate, making them inflexible and computationally redundant. This paper addresses the above problems by proposing a ray-based upsampling approach with an arbitrary rate, where a depth prediction is made for each query ray. The method simulates the ray marching algorithm to achieve more precise and stable ray-depth predictions through implicit surface learning. The rule-based mid-point query sampling method enables a uniform output point distribution without requiring model training using the Chamfer distance loss function, which can exhibit bias towards the training dataset. Self-supervised learning becomes possible with accurate ground truths within the input point cloud. The results demonstrate the method's versatility across different domains and training scenarios with limited computational resources and training data. This allows the upsampling task to transition from academic research to real-world applications.
    摘要 “Recent advancements in deep learning-based point cloud upsampling methods have improved the input for autonomous driving systems, but they still suffer from the uncertainty of denser point generation due to end-to-end learning. For example, the models' performance depends on the point distributions of the input and ground truth, causing domain dependency between synthetic and real-scanned point clouds, as well as issues with large model sizes and dataset requirements. Existing methods upsample point clouds with a fixed scaling rate, making them inflexible and computationally redundant. This paper addresses these problems by proposing a ray-based upsampling approach with an arbitrary rate, where a depth prediction is made for each query ray. The method simulates the ray marching algorithm to achieve more precise and stable ray-depth predictions through implicit surface learning. The rule-based mid-point query sampling method enables a uniform output point distribution without requiring model training using the Chamfer distance loss function, which can exhibit bias towards the training dataset. Self-supervised learning becomes possible with accurate ground truths within the input point cloud. The results demonstrate the method's versatility across different domains and training scenarios with limited computational resources and training data, allowing the upsampling task to transition from academic research to real-world applications.”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

AcTExplore: Active Tactile Exploration on Unknown Objects

  • paper_url: http://arxiv.org/abs/2310.08745
  • repo_url: None
  • paper_authors: Amir-Hossein Shahidzadeh, Seong Jong Yoo, Pavan Mantripragada, Chahat Deep Singh, Cornelia Fermüller, Yiannis Aloimonos
  • for: 本研究旨在提出一种基于奖励学习的活动戚触探测方法,以便高效地探索物体结构,从而提高机器人抓取和操作等基础任务的能力。
  • methods: 本方法使用奖励学习驱动的活动戚触探测策略,通过尝试不同的探测动作,逐渐探索物体表面,并通过缓存的策略来减少探测次数。
  • results: 本研究在未看过的 YCB 对象上达到了95.97% IoU 覆盖率,而且只需要在基本形状上进行训练。项目网站:https://prg.cs.umd$.$edu/AcTExplore
    Abstract Tactile exploration plays a crucial role in understanding object structures for fundamental robotics tasks such as grasping and manipulation. However, efficiently exploring such objects using tactile sensors is challenging, primarily due to the large-scale unknown environments and limited sensing coverage of these sensors. To this end, we present AcTExplore, an active tactile exploration method driven by reinforcement learning for object reconstruction at scales that automatically explores the object surfaces in a limited number of steps. Through sufficient exploration, our algorithm incrementally collects tactile data and reconstructs 3D shapes of the objects as well, which can serve as a representation for higher-level downstream tasks. Our method achieves an average of 95.97% IoU coverage on unseen YCB objects while just being trained on primitive shapes. Project Webpage: https://prg.cs.umd$.$edu/AcTExplore
    摘要 “感觉探索”在基本机器人任务中,如抓取和操作,具有重要的作用。然而,使用感觉传感器效率地探索物体是具有挑战性,主要是因为物体环境规模很大,感觉传感器的感知范围也有限。为此,我们提出了AcTExplore,一种基于奖励学习的活动感觉探索方法,可以自动在有限步数内探索物体表面。通过适当的探索,我们的算法逐步收集感觉数据,并将其用于重建物体的3D形状,这可以作为下游任务的表示。我们的方法在未经训练的YCB物体上实现了95.97%的IOU覆盖率,请参阅项目网页:https://prg.cs.umd$.$edu/AcTExplore。

A Benchmarking Protocol for SAR Colorization: From Regression to Deep Learning Approaches

  • paper_url: http://arxiv.org/abs/2310.08705
  • repo_url: None
  • paper_authors: Kangqing Shen, Gemine Vivone, Xiaoyuan Yang, Simone Lolli, Michael Schmitt
  • for: 这篇论文旨在提出一种基于supervised learning的SAR颜色化方法,以帮助解决Remote sensing中SAR图像的雾涂问题。
  • methods: 该方法包括一种协议 для生成合成的SAR颜色图像、多种基线和一种基于conditional generative adversarial network(cGAN)的有效SAR颜色化方法。
  • results: EXTENSIVE TESTS表明我们提出的cGAN基本网络对SAR颜色化问题具有高效性。代码将公开发布。
    Abstract Synthetic aperture radar (SAR) images are widely used in remote sensing. Interpreting SAR images can be challenging due to their intrinsic speckle noise and grayscale nature. To address this issue, SAR colorization has emerged as a research direction to colorize gray scale SAR images while preserving the original spatial information and radiometric information. However, this research field is still in its early stages, and many limitations can be highlighted. In this paper, we propose a full research line for supervised learning-based approaches to SAR colorization. Our approach includes a protocol for generating synthetic color SAR images, several baselines, and an effective method based on the conditional generative adversarial network (cGAN) for SAR colorization. We also propose numerical assessment metrics for the problem at hand. To our knowledge, this is the first attempt to propose a research line for SAR colorization that includes a protocol, a benchmark, and a complete performance evaluation. Our extensive tests demonstrate the effectiveness of our proposed cGAN-based network for SAR colorization. The code will be made publicly available.
    摘要 雷达图像(SAR)广泛用于远程感知。解释SAR图像可以是困难的,因为它们具有内生的点粒度噪声和灰度特征。为解决这一问题,SAR彩色化在研究中得到了广泛关注,以彩色化灰度SAR图像,保留原始空间信息和雷达信息。但这一研究领域仍处于早期阶段,有很多限制。在这篇论文中,我们提出了一个完整的超级vised学习基础的研究线路,用于SAR彩色化。我们的方法包括一种协议,一些基elines,以及基于条件生成 adversarial network(cGAN)的有效方法。我们还提出了评估问题的数字量表。根据我们所知,这是第一次为SAR彩色化提出了一个完整的研究线路,包括协议、标准和完整的性能评估。我们的广泛测试表明,我们提出的cGAN基于网络可以有效地进行SAR彩色化。代码将公开发布。

Fed-Safe: Securing Federated Learning in Healthcare Against Adversarial Attacks

  • paper_url: http://arxiv.org/abs/2310.08681
  • repo_url: None
  • paper_authors: Erfan Darzi, Nanna M. Sijtsema, P. M. A van Ooijen
  • for: 本研究探讨了受防御性训练和加密技术保护的 Federated Learning 医学图像分析应用的安全性。
  • methods: 本研究使用了分布式噪声来防御模型免受攻击,并且通过对不同攻击场景、参数和使用场景进行广泛的评估来证明其效果。
  • results: 研究结果表明,通过分布式噪声可以实现与传统防御训练相同的安全水平,而且需要更少的重训样本来建立一个可靠的模型。
    Abstract This paper explores the security aspects of federated learning applications in medical image analysis. Current robustness-oriented methods like adversarial training, secure aggregation, and homomorphic encryption often risk privacy compromises. The central aim is to defend the network against potential privacy breaches while maintaining model robustness against adversarial manipulations. We show that incorporating distributed noise, grounded in the privacy guarantees in federated settings, enables the development of a adversarially robust model that also meets federated privacy standards. We conducted comprehensive evaluations across diverse attack scenarios, parameters, and use cases in cancer imaging, concentrating on pathology, meningioma, and glioma. The results reveal that the incorporation of distributed noise allows for the attainment of security levels comparable to those of conventional adversarial training while requiring fewer retraining samples to establish a robust model.
    摘要

SSG2: A new modelling paradigm for semantic segmentation

  • paper_url: http://arxiv.org/abs/2310.08671
  • repo_url: https://github.com/feevos/ssg2
  • paper_authors: Foivos I. Diakogiannis, Suzanne Furby, Peter Caccetta, Xiaoliang Wu, Rodrigo Ibata, Ondrej Hlinka, John Taylor
  • for: 这个论文主要是为了解决semantic segmentation中的一个问题,即模型只能处理单个静止图像,导致无法进行误差修正。
  • methods: 这篇论文提出了一种方法,即使用序列可观测的方式来提高semantic segmentation的准确率。具体来说,该方法使用了一个双encoder、单decoder的基网络,以及一个序列模型。
  • results: 在三个不同的数据集上进行测试,SSG2模型表现出色,与UNet类基线模型相比,它在同样的数量的梯度更新中显著地提高了准确率。然而,添加时间维度会增加内存占用量。
    Abstract State-of-the-art models in semantic segmentation primarily operate on single, static images, generating corresponding segmentation masks. This one-shot approach leaves little room for error correction, as the models lack the capability to integrate multiple observations for enhanced accuracy. Inspired by work on semantic change detection, we address this limitation by introducing a methodology that leverages a sequence of observables generated for each static input image. By adding this "temporal" dimension, we exploit strong signal correlations between successive observations in the sequence to reduce error rates. Our framework, dubbed SSG2 (Semantic Segmentation Generation 2), employs a dual-encoder, single-decoder base network augmented with a sequence model. The base model learns to predict the set intersection, union, and difference of labels from dual-input images. Given a fixed target input image and a set of support images, the sequence model builds the predicted mask of the target by synthesizing the partial views from each sequence step and filtering out noise. We evaluate SSG2 across three diverse datasets: UrbanMonitor, featuring orthoimage tiles from Darwin, Australia with five spectral bands and 0.2m spatial resolution; ISPRS Potsdam, which includes true orthophoto images with multiple spectral bands and a 5cm ground sampling distance; and ISIC2018, a medical dataset focused on skin lesion segmentation, particularly melanoma. The SSG2 model demonstrates rapid convergence within the first few tens of epochs and significantly outperforms UNet-like baseline models with the same number of gradient updates. However, the addition of the temporal dimension results in an increased memory footprint. While this could be a limitation, it is offset by the advent of higher-memory GPUs and coding optimizations.
    摘要 现代 semantic segmentation 模型主要在单个静止图像上运行,生成相应的分割标签。这种一枚投入方法留下了Errata的Room for error correction,因为模型缺乏集成多个观察到的能力。我们由 semantic change detection 的工作 inspirited,我们提出了一种方法,利用每个静止输入图像的序列可观测。通过添加这个“时间”维度,我们利用序列观察中强相关的信号强度来减少错误率。我们的框架,称为 SSG2 (Semantic Segmentation Generation 2),使用了 dual-encoder,single-decoder 基础网络,并添加了一个序列模型。基础模型可以预测 dual-input 图像上的标签集 intersection, union 和 difference。给定一个固定的目标输入图像和一组支持图像,序列模型可以使用每个序列步骤中的Synthesizing partial views,并将噪声滤除,生成目标图像的预测掩码。我们在 UrbanMonitor,ISPRS Potsdam 和 ISIC2018 三个多样化的数据集上评估了 SSG2 模型,它在First few tens of epochs 内快速聚合,并与同样多个梯度更新的 UNet-like 基eline模型相比,显著提高性能。然而,添加时间维度会增加内存占用量。尽管这可能是一个限制,但是随着更高内存 GPU 和编程优化,这个问题可以被解决。

Multimodal Large Language Model for Visual Navigation

  • paper_url: http://arxiv.org/abs/2310.08669
  • repo_url: None
  • paper_authors: Yao-Hung Hubert Tsai, Vansh Dhar, Jialu Li, Bowen Zhang, Jian Zhang
  • for: 本研究旨在开发一种可以通过语言模型来实现视觉导航的方法,不需要复杂的提示系统。
  • methods: 我们的方法使用了简单的文本提示、当前观察和历史收集器模型,将输入为视觉导航。输出为可能的行为选择的概率分布。
  • results: 我们的方法在使用人类示例和碰撞信号从Habitat-Matterport 3D Dataset(HM3D)进行训练后,与现有的行为快照方法相比,表现出来的结果更好,并有效降低碰撞率。
    Abstract Recent efforts to enable visual navigation using large language models have mainly focused on developing complex prompt systems. These systems incorporate instructions, observations, and history into massive text prompts, which are then combined with pre-trained large language models to facilitate visual navigation. In contrast, our approach aims to fine-tune large language models for visual navigation without extensive prompt engineering. Our design involves a simple text prompt, current observations, and a history collector model that gathers information from previous observations as input. For output, our design provides a probability distribution of possible actions that the agent can take during navigation. We train our model using human demonstrations and collision signals from the Habitat-Matterport 3D Dataset (HM3D). Experimental results demonstrate that our method outperforms state-of-the-art behavior cloning methods and effectively reduces collision rates.
    摘要 Recent efforts to enable visual navigation using large language models have mainly focused on developing complex prompt systems. These systems incorporate instructions, observations, and history into massive text prompts, which are then combined with pre-trained large language models to facilitate visual navigation. In contrast, our approach aims to fine-tune large language models for visual navigation without extensive prompt engineering. Our design involves a simple text prompt, current observations, and a history collector model that gathers information from previous observations as input. For output, our design provides a probability distribution of possible actions that the agent can take during navigation. We train our model using human demonstrations and collision signals from the Habitat-Matterport 3D Dataset (HM3D). Experimental results demonstrate that our method outperforms state-of-the-art behavior cloning methods and effectively reduces collision rates.Here's the text in Traditional Chinese:近期对使用大型自然语言模型进行视觉NAVIIGATION的努力主要集中在开发复杂的提示系统上。这些系统将 instrucions、观察和历史合并到巨量文本提示中,然后与预训的大型自然语言模型结合以便视觉NAVIIGATION。相比之下,我们的方法则是调整大型自然语言模型以便视觉NAVIIGATION,不需要广泛的提示工程。我们的设计包括简单的文本提示、当前观察和历史收集器模型,这些模型将前一次观察的信息作为输入,并将输出为可能的行动选择的概率分布。我们使用人类示范和Habitat-Matterport 3D Dataset(HM3D)中的碰撞信号进行训练。实验结果显示,我们的方法比预设的行为复制方法更高效,并有效地降低碰撞率。

Histogram- and Diffusion-Based Medical Out-of-Distribution Detection

  • paper_url: http://arxiv.org/abs/2310.08654
  • repo_url: None
  • paper_authors: Evi M. C. Huijben, Sina Amirrajab, Josien P. W. Pluim
  • for: 本研究旨在提高医学领域人工智能算法的安全性和可靠性,通过检测异常输入数据(Out-of-distribution,OOD)。
  • methods: 本研究提出了一个组合使用 histogram-based 方法和 diffusion-based 方法的检测管道,以检测医学领域中的异常数据。 histogram-based 方法用于检测医学领域中的同型异常(homogeneous anomalies),而 diffusion-based 方法基于最新的无监督异常检测方法(DDPM-OOD)。
  • results: 研究发现,提出的 DDPM 方法敏感于卷积和偏置场示例,但面临着解剖变形、黑色slice和交换 patches 等挑战。这些发现表明,进一步研究可以提高 DDPM 的性能,以便更好地检测医学领域中的异常数据。
    Abstract Out-of-distribution (OOD) detection is crucial for the safety and reliability of artificial intelligence algorithms, especially in the medical domain. In the context of the Medical OOD (MOOD) detection challenge 2023, we propose a pipeline that combines a histogram-based method and a diffusion-based method. The histogram-based method is designed to accurately detect homogeneous anomalies in the toy examples of the challenge, such as blobs with constant intensity values. The diffusion-based method is based on one of the latest methods for unsupervised anomaly detection, called DDPM-OOD. We explore this method and propose extensive post-processing steps for pixel-level and sample-level anomaly detection on brain MRI and abdominal CT data provided by the challenge. Our results show that the proposed DDPM method is sensitive to blur and bias field samples, but faces challenges with anatomical deformation, black slice, and swapped patches. These findings suggest that further research is needed to improve the performance of DDPM for OOD detection in medical images.
    摘要 外部分布 (OOD) 检测是人工智能算法的安全性和可靠性关键,特别在医疗领域。在2023年医疗外部分布检测挑战中,我们提议一个管道, combinates histogram-based 方法和扩散-based 方法。 histogram-based 方法用于准确检测医疗示例中的同质异常,如具有常数Intensity值的blob。扩散-based 方法基于最新的无监督异常检测方法DDPM-OOD。我们探索这个方法,并提出了广泛的后处理步骤,用于像素级和样本级异常检测在脑MRI和腹部CT数据中。我们的结果表明,我们提议的 DDPM 方法对于锐化和偏置场景敏感,但面临着解剖变形、黑色slice和交换 patches 等挑战。这些发现表明,进一步的研究可以提高 DDPM 的外部分布检测性能在医疗图像中。

Defect Analysis of 3D Printed Cylinder Object Using Transfer Learning Approaches

  • paper_url: http://arxiv.org/abs/2310.08645
  • repo_url: None
  • paper_authors: Md Manjurul Ahsan, Shivakumar Raman, Zahed Siddique
  • for: 这个研究旨在测试机器学习(ML)方法,特别是转移学习(TL)模型,以检测3D印造中的缺陷。
  • methods: 研究使用了多种ML模型,包括VGG16、VGG19、ResNet50、ResNet101、InceptionResNetV2和MobileNetV2,对3D印造中的图像进行分析。
  • results: 研究发现,MobileNetV2、InceptionResNetV2和VGG16等TL模型在第一个研究中均取得了完美的分数,而ResNet50则表现不佳,其平均F1分数为0.32。在第二个研究中,MobileNetV2正确地显示了所有的实例,而ResNet50则因为更多的假阳性和 fewer true positives,其F1分数为0.75。总的来说,研究发现了一些TL模型,如MobileNetV2,可以为3D印造中的缺陷分类提供高精度。
    Abstract Additive manufacturing (AM) is gaining attention across various industries like healthcare, aerospace, and automotive. However, identifying defects early in the AM process can reduce production costs and improve productivity - a key challenge. This study explored the effectiveness of machine learning (ML) approaches, specifically transfer learning (TL) models, for defect detection in 3D-printed cylinders. Images of cylinders were analyzed using models including VGG16, VGG19, ResNet50, ResNet101, InceptionResNetV2, and MobileNetV2. Performance was compared across two datasets using accuracy, precision, recall, and F1-score metrics. In the first study, VGG16, InceptionResNetV2, and MobileNetV2 achieved perfect scores. In contrast, ResNet50 had the lowest performance, with an average F1-score of 0.32. Similarly, in the second study, MobileNetV2 correctly classified all instances, while ResNet50 struggled with more false positives and fewer true positives, resulting in an F1-score of 0.75. Overall, the findings suggest certain TL models like MobileNetV2 can deliver high accuracy for AM defect classification, although performance varies across algorithms. The results provide insights into model optimization and integration needs for reliable automated defect analysis during 3D printing. By identifying the top-performing TL techniques, this study aims to enhance AM product quality through robust image-based monitoring and inspection.
    摘要 三维打印(AM)在医疗、航空和汽车等领域得到了广泛关注,但是早期发现AM制造过程中的缺陷可以降低生产成本和提高生产效率,这是一个关键挑战。本研究通过机器学习(ML)方法,具体来说是传输学习(TL)模型,研究了3D打印的缺陷检测。研究使用了多种模型,包括VGG16、VGG19、ResNet50、ResNet101、InceptionResNetV2和MobileNetV2。通过精度、准确率、回归率和F1得分来评估模型的性能。在第一个研究中,VGG16、InceptionResNetV2和MobileNetV2均取得了完美的分数。相比之下,ResNet50表现最差,其平均F1分数为0.32。在第二个研究中,MobileNetV2正确地分类了所有实例,而ResNet50则有更多的假阳性和 fewer true positive,其F1分数为0.75。总的来说,研究发现一些TL模型,如MobileNetV2,可以在AM缺陷分类中达到高精度。然而,不同的算法之间存在性能差异。这些结果为自动化3D打印图像基于监测和检测中的模型优化和集成提供了信息。通过确定最佳TL技术,本研究旨在通过图像基于的可靠自动检测,提高AM产品质量。

Is Generalized Dynamic Novel View Synthesis from Monocular Videos Possible Today?

  • paper_url: http://arxiv.org/abs/2310.08587
  • repo_url: None
  • paper_authors: Xiaoming Zhao, Alex Colburn, Fangchang Ma, Miguel Angel Bautista, Joshua M. Susskind, Alexander G. Schwing
  • for: 动态新视角Synthesizing from monocular videos
  • methods: 基于现有技术的分析框架和 pseudo-generalized 方法
  • results: despite lacking scene-specific appearance optimization, the pseudo-generalized approach improves upon some scene-specific methods and achieves geometrically and temporally consistent depth estimates.
    Abstract Rendering scenes observed in a monocular video from novel viewpoints is a challenging problem. For static scenes the community has studied both scene-specific optimization techniques, which optimize on every test scene, and generalized techniques, which only run a deep net forward pass on a test scene. In contrast, for dynamic scenes, scene-specific optimization techniques exist, but, to our best knowledge, there is currently no generalized method for dynamic novel view synthesis from a given monocular video. To answer whether generalized dynamic novel view synthesis from monocular videos is possible today, we establish an analysis framework based on existing techniques and work toward the generalized approach. We find a pseudo-generalized process without scene-specific appearance optimization is possible, but geometrically and temporally consistent depth estimates are needed. Despite no scene-specific appearance optimization, the pseudo-generalized approach improves upon some scene-specific methods.
    摘要 translate("Rendering scenes observed in a monocular video from novel viewpoints is a challenging problem.")对于单目视频中观察到的场景,社区已经研究了两种类型的技术:一是场景特定优化技术,这些技术在每个测试场景上进行优化;另一种是通用技术,只需在测试场景上运行深度网络的前进 pass。然而,对于动态场景,只有场景特定优化技术存在,而没有通用的方法 для动态新视角synthesis from monocular videos。为了回答这个问题,我们建立了一个分析框架,基于现有的技术和工作 toward a generalized approach。我们发现可以使用 Pseudo-generalized process without scene-specific appearance optimization,但需要ogeometrically和temporally consistent depth estimates。尽管没有场景特定的外观优化, Pseudo-generalized approach仍然可以超越一些场景特定的方法。Note: "Pseudo-generalized" is a term used in the original text, and it refers to a process that is not entirely generalized, but rather a simplified version of a generalized process.

Im4D: High-Fidelity and Real-Time Novel View Synthesis for Dynamic Scenes

  • paper_url: http://arxiv.org/abs/2310.08585
  • repo_url: https://github.com/zju3dv/im4d
  • paper_authors: Haotong Lin, Sida Peng, Zhen Xu, Tao Xie, Xingyi He, Hujun Bao, Xiaowei Zhou
  • For: 该 paper targets 动态视角合成问题,即从多视角视频中生成高质量的动态视图图像。* Methods: 该 paper 提出了 Im4D Hybrid Scene Representation,即将格子基 geometry 与多视角图像基于的 appearance 结合在一起,以捕捉复杂动态场景中的earance detail。* Results: 该 paper 的方法在 five 个动态视角合成数据集上进行了评估,并表现出了state-of-the-art的渲染质量和可教学性,同时实现了实时渲染,单个 RTX 3090 GPU 上的速度为 79.8 FPS。
    Abstract This paper aims to tackle the challenge of dynamic view synthesis from multi-view videos. The key observation is that while previous grid-based methods offer consistent rendering, they fall short in capturing appearance details of a complex dynamic scene, a domain where multi-view image-based rendering methods demonstrate the opposite properties. To combine the best of two worlds, we introduce Im4D, a hybrid scene representation that consists of a grid-based geometry representation and a multi-view image-based appearance representation. Specifically, the dynamic geometry is encoded as a 4D density function composed of spatiotemporal feature planes and a small MLP network, which globally models the scene structure and facilitates the rendering consistency. We represent the scene appearance by the original multi-view videos and a network that learns to predict the color of a 3D point from image features, instead of memorizing detailed appearance totally with networks, thereby naturally making the learning of networks easier. Our method is evaluated on five dynamic view synthesis datasets including DyNeRF, ZJU-MoCap, NHR, DNA-Rendering and ENeRF-Outdoor datasets. The results show that Im4D exhibits state-of-the-art performance in rendering quality and can be trained efficiently, while realizing real-time rendering with a speed of 79.8 FPS for 512x512 images, on a single RTX 3090 GPU.
    摘要 The dynamic geometry of the scene is represented as a 4D density function consisting of spatiotemporal feature planes and a small MLP network. This allows for global modeling of the scene structure and consistent rendering. The scene appearance is represented by the original multi-view videos and a network that predicts the color of a 3D point based on image features, rather than memorizing detailed appearance with networks. This approach makes it easier to learn the networks and naturally leads to more efficient training.We evaluate our method on five dynamic view synthesis datasets, including DyNeRF, ZJU-MoCap, NHR, DNA-Rendering, and ENeRF-Outdoor. The results show that Im4D achieves state-of-the-art performance in rendering quality and can be trained efficiently. Additionally, our method realizes real-time rendering with a speed of 79.8 FPS for 512x512 images on a single RTX 3090 GPU.

PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm

  • paper_url: http://arxiv.org/abs/2310.08586
  • repo_url: https://github.com/OpenGVLab/PonderV2
  • paper_authors: Haoyi Zhu, Honghui Yang, Xiaoyang Wu, Di Huang, Sha Zhang, Xianglong He, Tong He, Hengshuang Zhao, Chunhua Shen, Yu Qiao, Wanli Ouyang
  • for: 本研究旨在开发一种robust和高度泛化的3D基础模型,以解决现有的2D计算机视觉和自然语言处理基础模型不足的问题。
  • methods: 该研究提出了一种包括点云编码器和volumetric神经渲染器的完整3D预训练框架,通过对实际图像和预测图像进行对比,以学习有用的3D表示。
  • results: 该研究首次实现了在11个室内和室外标准测试集上的state-of-the-art性能,并在不同的场景下表现了一致性。代码和模型将在https://github.com/OpenGVLab/PonderV2中公开。
    Abstract In contrast to numerous NLP and 2D computer vision foundational models, the learning of a robust and highly generalized 3D foundational model poses considerably greater challenges. This is primarily due to the inherent data variability and the diversity of downstream tasks. In this paper, we introduce a comprehensive 3D pre-training framework designed to facilitate the acquisition of efficient 3D representations, thereby establishing a pathway to 3D foundational models. Motivated by the fact that informative 3D features should be able to encode rich geometry and appearance cues that can be utilized to render realistic images, we propose a novel universal paradigm to learn point cloud representations by differentiable neural rendering, serving as a bridge between 3D and 2D worlds. We train a point cloud encoder within a devised volumetric neural renderer by comparing the rendered images with the real images. Notably, our approach demonstrates the seamless integration of the learned 3D encoder into diverse downstream tasks. These tasks encompass not only high-level challenges such as 3D detection and segmentation but also low-level objectives like 3D reconstruction and image synthesis, spanning both indoor and outdoor scenarios. Besides, we also illustrate the capability of pre-training a 2D backbone using the proposed universal methodology, surpassing conventional pre-training methods by a large margin. For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks. The consistent improvements in various settings imply the effectiveness of the proposed method. Code and models will be made available at https://github.com/OpenGVLab/PonderV2.
    摘要 相比多种自然语言处理和2D计算机视觉的基础模型,学习一个强大和高度总结的3D基础模型带来了许多更大的挑战。这主要是因为数据的自然变化和下游任务的多样性。在这篇论文中,我们介绍了一个全面的3D预训练框架,用于实现高效的3D表示的获得,从而建立3D基础模型的路径。我们被激励了由于3D特征应该能够编码丰富的几何和外观提示,以便生成真实的图像。我们提出了一种新的通用 paradigma,用于学习点云表示,作为2D和3D世界之间的桥梁。我们在定制的Volumetric Neural Renderer中训练了一个点云编码器,通过比较生成的图像与真实图像来训练。值得注意的是,我们的方法可以很好地整合学习的3D编码器到多种下游任务中。这些任务包括高级挑战 like 3D检测和分割,以及低级目标 like 3D重建和图像生成,涵盖了室内和室外场景。此外,我们还示出了使用我们所提出的通用方法来预训练2D脊梁的优势。在11个室内和室外标准测试 benchmark上,PonderV2首次实现了状态机器人的性能。这一共见的改进表明了我们的方法的有效性。代码和模型将在https://github.com/OpenGVLab/PonderV2上提供。

Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video

  • paper_url: http://arxiv.org/abs/2310.08584
  • repo_url: None
  • paper_authors: Shashanka Venkataramanan, Mamshad Nayeem Rizve, João Carreira, Yuki M. Asano, Yannis Avrithis
  • for: 这个论文是为了研究自主学习中的数据使用方法,以及如何更加经济地使用数据。
  • methods: 本论文提出了两个贡献。首先,它介绍了一个新的自助学习图像预训练方法,该方法基于时间Tracking来学习认知。其次,它提出了一个新的自助学习预训练方法,该方法使用 transformer 交叉关注来生成焦点地图,并使用这些焦点地图来学习图像和视频下游任务。
  • results: 根据论文描述,使用这两种方法可以使一个来自 Walking Tours 的视频成为 ImageNet 的强大竞争对手。
    Abstract Self-supervised learning has unlocked the potential of scaling up pretraining to billions of images, since annotation is unnecessary. But are we making the best use of data? How more economical can we be? In this work, we attempt to answer this question by making two contributions. First, we investigate first-person videos and introduce a "Walking Tours" dataset. These videos are high-resolution, hours-long, captured in a single uninterrupted take, depicting a large number of objects and actions with natural scene transitions. They are unlabeled and uncurated, thus realistic for self-supervision and comparable with human learning. Second, we introduce a novel self-supervised image pretraining method tailored for learning from continuous videos. Existing methods typically adapt image-based pretraining approaches to incorporate more frames. Instead, we advocate a "tracking to learn to recognize" approach. Our method called DoRA, leads to attention maps that Discover and tRAck objects over time in an end-to-end manner, using transformer cross-attention. We derive multiple views from the tracks and use them in a classical self-supervised distillation loss. Using our novel approach, a single Walking Tours video remarkably becomes a strong competitor to ImageNet for several image and video downstream tasks.
    摘要 自我指导学习已经开放了扩大预训练到数十亿张图像的潜力,因为没有注释。但我们是否可以更经济地使用数据?在这项工作中,我们尝试回答这个问题,并提供了两项贡献。首先,我们研究了一个“步行旅游”数据集,这是高解度、数小时长、不间断拍摄的首人视频。这些视频没有标签和排序,因此与人类学习的方式相似,适合自我指导学习。其次,我们介绍了一种适合从连续视频中学习的自我指导图像预训练方法。现有方法通常是将图像预训练方法与更多帧相结合。相反,我们提倡“跟踪以学习认知”的方法。我们的方法称为DoRA,通过转换器跨层关注来发现和跟踪 объек over time,并使用классиical自我指导液化损失来生成多视图。使用我们的新方法,一个单个步行旅游视频很奇迹地变成了 ImageNet 的强大竞争对手。

Universal Visual Decomposer: Long-Horizon Manipulation Made Easy

  • paper_url: http://arxiv.org/abs/2310.08581
  • repo_url: https://github.com/zcczhang/UVD
  • paper_authors: Zichen Zhang, Yunshuang Li, Osbert Bastani, Abhishek Gupta, Dinesh Jayaraman, Yecheng Jason Ma, Luca Weihs
    for:本研究旨在开发一种可靠、可 reuse 的视觉任务剖分方法,以便在 robotic 控制中学习长期 manipulate 任务。methods:本研究使用 pre-trained 视觉表示,通过检测视觉 embedding 空间中的阶段变化,自动找到视觉子任务。无需额外训练,UVD 可以减少 compositional generalization 问题,并在实际任务中显著提高性能。results:与基eline 相比,UVD 在 simulation 和实际任务中均表现出色,可以快速地学习和适应新任务。UVD 可以提供更好的 compositional generalization,并且可以用于 constructing goal-based reward shaping。
    Abstract Real-world robotic tasks stretch over extended horizons and encompass multiple stages. Learning long-horizon manipulation tasks, however, is a long-standing challenge, and demands decomposing the overarching task into several manageable subtasks to facilitate policy learning and generalization to unseen tasks. Prior task decomposition methods require task-specific knowledge, are computationally intensive, and cannot readily be applied to new tasks. To address these shortcomings, we propose Universal Visual Decomposer (UVD), an off-the-shelf task decomposition method for visual long horizon manipulation using pre-trained visual representations designed for robotic control. At a high level, UVD discovers subgoals by detecting phase shifts in the embedding space of the pre-trained representation. Operating purely on visual demonstrations without auxiliary information, UVD can effectively extract visual subgoals embedded in the videos, while incurring zero additional training cost on top of standard visuomotor policy training. Goal-conditioned policies learned with UVD-discovered subgoals exhibit significantly improved compositional generalization at test time to unseen tasks. Furthermore, UVD-discovered subgoals can be used to construct goal-based reward shaping that jump-starts temporally extended exploration for reinforcement learning. We extensively evaluate UVD on both simulation and real-world tasks, and in all cases, UVD substantially outperforms baselines across imitation and reinforcement learning settings on in-domain and out-of-domain task sequences alike, validating the clear advantage of automated visual task decomposition within the simple, compact UVD framework.
    摘要 实际世界中的 роботи工作通常是长时间的、多个阶段的。学习长时间的抓取任务是一个长期的挑战,需要将总体任务分解成可控制的子任务,以便策略学习和对未看过的任务进行泛化。现有的任务分解方法需要任务特定的知识, computationally 成本高,并不能方便地应用于新任务。为解决这些缺点,我们提出了一种通用视觉分解器(UVD),用于视觉长时间抓取任务的偏振 decomposition。UVD 使用预训练的视觉表示进行检测预Shift 在 embedding 空间中的阶段变化,从而发现子任务。不需要辅助信息,UVD 可以有效地从视频中提取视觉子任务,并不需要额外的训练成本。与标准视听动作策略训练相同,使用 UVD 发现的子任务可以显著提高含 композиitional 泛化的表现,并且可以用于构建目标基于的奖励形式,刺激执行扩展的探索。我们广泛测试了 UVD 在 simulator 和实际世界中,并在所有情况下都显著超越基eline,证明了自动视觉任务分解在简单、 компакт的 UVD 框架中的明显优势。

OmniControl: Control Any Joint at Any Time for Human Motion Generation

  • paper_url: http://arxiv.org/abs/2310.08580
  • repo_url: https://github.com/neu-vi/OmniControl
  • paper_authors: Yiming Xie, Varun Jampani, Lei Zhong, Deqing Sun, Huaizu Jiang
  • for: 用于 incorporating flexible spatial control signals into a text-conditioned human motion generation model
  • methods: 使用 analytic spatial guidance 和 realism guidance 两种不同的指导方法
  • results: 实验结果表明,OmniControl 可以实现更加真实、协调和一致的人体动作生成,并且在不同 JOINTS 上的控制也有显著改善。
    Abstract We present a novel approach named OmniControl for incorporating flexible spatial control signals into a text-conditioned human motion generation model based on the diffusion process. Unlike previous methods that can only control the pelvis trajectory, OmniControl can incorporate flexible spatial control signals over different joints at different times with only one model. Specifically, we propose analytic spatial guidance that ensures the generated motion can tightly conform to the input control signals. At the same time, realism guidance is introduced to refine all the joints to generate more coherent motion. Both the spatial and realism guidance are essential and they are highly complementary for balancing control accuracy and motion realism. By combining them, OmniControl generates motions that are realistic, coherent, and consistent with the spatial constraints. Experiments on HumanML3D and KIT-ML datasets show that OmniControl not only achieves significant improvement over state-of-the-art methods on pelvis control but also shows promising results when incorporating the constraints over other joints.
    摘要 我们提出了一种新的方法 named OmniControl,用于在基于扩散过程的文本条件人体运动生成模型中 incorporating flexible spatial control signals。与之前的方法不同,OmniControl 可以在不同的 JOINTS 和不同的时间点上使用 flexible spatial control signals,只需一个模型。我们提出了分析空间指导,以确保生成的运动能够紧跟输入控制信号。同时,我们还引入了真实性指导,以进一步让所有 JOINTS 都更加协调,生成更加合理的运动。这两种指导都是重要的,它们是彼此补做的,可以均衡控制准确性和运动真实性。通过将它们结合起来,OmniControl 可以生成更加真实、协调、遵循空间约束的运动。在 HumanML3D 和 KIT-ML 数据集上进行了实验,OmniControl 不仅在 pelvis 控制方面取得了显著改进,还在其他 JOINTS 上 incorporating 约束时表现了良好的结果。

HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion

  • paper_url: http://arxiv.org/abs/2310.08579
  • repo_url: https://github.com/snap-research/HyperHuman
  • paper_authors: Xian Liu, Jian Ren, Aliaksandr Siarohin, Ivan Skorokhodov, Yanyu Li, Dahua Lin, Xihui Liu, Ziwei Liu, Sergey Tulyakov
  • for: 该论文目标是生成高度真实的人像图像,以满足在各种场景下的人像生成需求。
  • methods: 该论文提出了一种叫做HyperHuman的框架,该框架包括三个主要部分:1) 构建了一个大规模的人像数据集(名为HumanVerse),包括340万个图像和人 pose、深度和表面法向量等精心标注。2) 提出了一种叫做Latent Structural Diffusion Model的模型,该模型同时减去了深度和表面法向量以及生成的RGB图像中的噪声。3) 最后,提出了一种叫做Structure-Guided Refiner的组合方法,用于更加细腻地生成更高分辨率的人像图像。
  • results: 经过广泛的实验,该论文的框架实现了state-of-the-art的性能,可以在多种场景下生成高度真实的人像图像。
    Abstract Despite significant advances in large-scale text-to-image models, achieving hyper-realistic human image generation remains a desirable yet unsolved task. Existing models like Stable Diffusion and DALL-E 2 tend to generate human images with incoherent parts or unnatural poses. To tackle these challenges, our key insight is that human image is inherently structural over multiple granularities, from the coarse-level body skeleton to fine-grained spatial geometry. Therefore, capturing such correlations between the explicit appearance and latent structure in one model is essential to generate coherent and natural human images. To this end, we propose a unified framework, HyperHuman, that generates in-the-wild human images of high realism and diverse layouts. Specifically, 1) we first build a large-scale human-centric dataset, named HumanVerse, which consists of 340M images with comprehensive annotations like human pose, depth, and surface normal. 2) Next, we propose a Latent Structural Diffusion Model that simultaneously denoises the depth and surface normal along with the synthesized RGB image. Our model enforces the joint learning of image appearance, spatial relationship, and geometry in a unified network, where each branch in the model complements to each other with both structural awareness and textural richness. 3) Finally, to further boost the visual quality, we propose a Structure-Guided Refiner to compose the predicted conditions for more detailed generation of higher resolution. Extensive experiments demonstrate that our framework yields the state-of-the-art performance, generating hyper-realistic human images under diverse scenarios. Project Page: https://snap-research.github.io/HyperHuman/
    摘要 尽管大规模文本到图像模型已经取得了 significativo 进步,但 Achieving hyper-realistic human image generation 仍然是一个需要解决的任务。现有的模型如 Stable Diffusion 和 DALL-E 2 通常会生成人像图像中的部分不协调或不自然的姿势。为了解决这些挑战,我们的关键洞察是人像图像具有多个粒度的结构,从粗粒度的体姿skeleton到细粒度的空间几何。因此,捕捉这些相关性在一个模型中是关键,以生成协调的和自然的人像图像。为此,我们提出了一个统一框架,即 HyperHuman,可以生成宽泛的人像图像,高度真实和多样化的布局。specifically,我们采取以下三个步骤:1. 我们首先建立了一个大规模的人类中心的数据集,名为 HumanVerse,该集包含340万张图像,并包括人pose、深度和表面法向的全面注解。2. 接下来,我们提出了一种干扰难度和表面法向同时减震的模型,即 Latent Structural Diffusion Model。该模型同时学习图像外观、空间关系和几何结构,并在一个统一网络中进行结合学习。每个分支在模型中补做了对于彼此的结构意识和 текстуаль丰富的补做。3. 为了进一步提高视觉质量,我们还提出了一种结构指导的修正器,用于更详细地生成更高分辨率的图像。广泛的实验表明,我们的框架可以 дости到当前最佳性能,在多种enario下生成高度真实的人像图像。项目页面:https://snap-research.github.io/HyperHuman/

Learning to Act from Actionless Videos through Dense Correspondences

  • paper_url: http://arxiv.org/abs/2310.08576
  • repo_url: https://github.com/flow-diffusion/AVDC
  • paper_authors: Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, Joshua B. Tenenbaum
  • for: 本研究旨在构建一种基于视频示例的机器人策略,可以在不同机器人和环境中可靠执行多种任务,仅从视频示例中学习而无需使用任何动作标注。
  • methods: 本方法利用图像作为任务免疑表示,同时使用文本来表示机器人目标。我们使用视频拼接技术生成机器人执行动作的视频,并利用密集对准关系来INFER机器人需要执行的具体动作。
  • results: 我们在表面 manipulate 和导航任务上证明了本方法的效果,并提供了一个开源框架,可以有效地模型视频,使得在四个GPU上进行高精度策略模型训练,可以在一天内完成。
    Abstract In this work, we present an approach to construct a video-based robot policy capable of reliably executing diverse tasks across different robots and environments from few video demonstrations without using any action annotations. Our method leverages images as a task-agnostic representation, encoding both the state and action information, and text as a general representation for specifying robot goals. By synthesizing videos that ``hallucinate'' robot executing actions and in combination with dense correspondences between frames, our approach can infer the closed-formed action to execute to an environment without the need of any explicit action labels. This unique capability allows us to train the policy solely based on RGB videos and deploy learned policies to various robotic tasks. We demonstrate the efficacy of our approach in learning policies on table-top manipulation and navigation tasks. Additionally, we contribute an open-source framework for efficient video modeling, enabling the training of high-fidelity policy models with four GPUs within a single day.
    摘要 在这项工作中,我们提出了一种方法,能够基于视频构建一个多功能机器人策略,可靠地执行多种任务在不同的机器人和环境中,只需从视频示例中学习而无需使用任何动作标注。我们的方法利用图像作为任务无关的表示,卷积 both 状态和动作信息,并使用文本作为机器人目标的通用表示。通过将视频“幻化”机器人执行动作,并在每帧之间进行紧密的对应关系,我们的方法可以从RGB视频中推理出要执行的closed-form动作,无需任何显式动作标注。这种特有的能力允许我们通过RGB视频进行训练策略,并将学习的策略部署到多种机器人任务中。我们在表ptop抓取和导航任务上证明了这种方法的效果。此外,我们还提供了一个开源的视频模型框架,可以使用四个GPU在单天内高效地训练高精度策略模型。

Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation

  • paper_url: http://arxiv.org/abs/2310.08541
  • repo_url: None
  • paper_authors: Zhengyuan Yang, Jianfeng Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Lijuan Wang
  • for: automatic image design and generation
  • methods: multimodal iterative self-refinement with GPT-4V(ision)
  • results: images of better semantic and visual qualities, with the ability to process input ideas with interleaved image-text sequences and follow ideas with design instructions.
    Abstract We introduce ``Idea to Image,'' a system that enables multimodal iterative self-refinement with GPT-4V(ision) for automatic image design and generation. Humans can quickly identify the characteristics of different text-to-image (T2I) models via iterative explorations. This enables them to efficiently convert their high-level generation ideas into effective T2I prompts that can produce good images. We investigate if systems based on large multimodal models (LMMs) can develop analogous multimodal self-refinement abilities that enable exploring unknown models or environments via self-refining tries. Idea2Img cyclically generates revised T2I prompts to synthesize draft images, and provides directional feedback for prompt revision, both conditioned on its memory of the probed T2I model's characteristics. The iterative self-refinement brings Idea2Img various advantages over vanilla T2I models. Notably, Idea2Img can process input ideas with interleaved image-text sequences, follow ideas with design instructions, and generate images of better semantic and visual qualities. The user preference study validates the efficacy of multimodal iterative self-refinement on automatic image design and generation.
    摘要 我们介绍“想法到图像”系统,该系统允许多Modal迭代自修正(GPT-4V)为自动图像设计和生成。人类可以快速认识不同的文本到图像(T2I)模型的特征,通过迭代探索来快速转化高级生成想法为有效的T2I提示,以生成好的图像。我们研究了基于大型多Modal模型(LMM)是否可以发展类似的多Modal自修复能力,以实现探索未知模型或环境 via 自修复尝试。Idea2Img 在循环生成修订T2I提示,并提供向提示修改的指导反馈,两者均基于它的记忆 probed T2I 模型的特征。多Modal迭代自修正带来了 Idea2Img 多种优势,包括可以处理交错的图像文本序列、跟随设计指令、生成更好的 semantic 和视觉质量的图像。用户偏好调查证明了自动图像设计和生成中多Modal迭代自修复的有效性。

Image2PCI – A Multitask Learning Framework for Estimating Pavement Condition Indices Directly from Images

  • paper_url: http://arxiv.org/abs/2310.08538
  • repo_url: None
  • paper_authors: Neema Jakisa Owor, Hang Du, Abdulateef Daud, Armstrong Aboah, Yaw Adu-Gyamfi
  • For: The paper aims to develop a unified multi-tasking model for estimating Pavement Condition Index (PCI) directly from top-down pavement images.* Methods: The proposed model is a multi-task architecture that combines feature extraction and four decoders for PCI estimation, crack detection, and segmentation. The model uses deep learning techniques and is trained on a benchmarked and open pavement distress dataset.* Results: The proposed model achieves excellent accuracy on all related tasks for crack detection and segmentation, and can estimate PCI directly from images at real-time speeds. This is the first work that can accomplish this task, to the best of the authors’ knowledge.
    Abstract The Pavement Condition Index (PCI) is a widely used metric for evaluating pavement performance based on the type, extent and severity of distresses detected on a pavement surface. In recent times, significant progress has been made in utilizing deep-learning approaches to automate PCI estimation process. However, the current approaches rely on at least two separate models to estimate PCI values -- one model dedicated to determining the type and extent and another for estimating their severity. This approach presents several challenges, including complexities, high computational resource demands, and maintenance burdens that necessitate careful consideration and resolution. To overcome these challenges, the current study develops a unified multi-tasking model that predicts the PCI directly from a top-down pavement image. The proposed architecture is a multi-task model composed of one encoder for feature extraction and four decoders to handle specific tasks: two detection heads, one segmentation head and one PCI estimation head. By multitasking, we are able to extract features from the detection and segmentation heads for automatically estimating the PCI directly from the images. The model performs very well on our benchmarked and open pavement distress dataset that is annotated for multitask learning (the first of its kind). To our best knowledge, this is the first work that can estimate PCI directly from an image at real time speeds while maintaining excellent accuracy on all related tasks for crack detection and segmentation.
    摘要 《路面条件指数(PCI)评估 metric 是评估路面性能的 widely 使用方法,基于路面表面上的类型、规模和严重程度的病诊。在最近的时间里,深入学习方法在自动化 PCI 评估过程中进行了显著的进步。然而,当前的方法都需要至少两个分开的模型来计算 PCI 值 -- 一个模型用于确定类型和规模,另一个用于估计严重程度。这种方法存在许多挑战,包括复杂性、高计算资源需求和维护压力,需要仔细考虑和解决。为了突破这些挑战,当前的研究开发了一种简化多任务模型,可以直接从路面图像中预测 PCI。我们的建议的架构包括一个嵌入器 для特征提取和四个解码器来处理特定任务:两个检测头、一个分割头和一个 PCI 估计头。通过多任务学习,我们能够自动从检测和分割任务中提取特征,以便直接从图像中预测 PCI。我们的模型在我们自己练制的和公开的路面裂隙数据集上表现出色,并且在所有相关任务上保持了高精度。到我们所知,这是第一个可以在实时速度下从图像中直接预测 PCI,并且保持所有相关任务的高精度的工作。》

XAI Benchmark for Visual Explanation

  • paper_url: http://arxiv.org/abs/2310.08537
  • repo_url: None
  • paper_authors: Yifei Zhang, Siyi Gu, James Song, Bo Pan, Liang Zhao
    for:The paper aims to provide a benchmark for evaluating the performance of visual explanation models in the context of image data.methods:The paper introduces a comprehensive visual explanation pipeline that integrates data loading, preprocessing, experimental setup, and model evaluation processes. The pipeline is designed to enable fair comparisons of various visual explanation techniques.results:The paper provides a comprehensive review of over 10 evaluation methods for visual explanation and conducts experiments on selected datasets using various model-centered and ground truth-centered evaluation metrics. The results demonstrate the effectiveness of the proposed benchmark for evaluating the performance of visual explanation models.Here is the simplified Chinese text for the three key points:for:这篇论文的目的是为了提供一个用于评估图像数据上视觉解释模型表现的标准 benchmarck。methods:这篇论文介绍了一个完整的视觉解释管道,该管道 integrates 数据加载、预处理、实验设置和模型评估过程。该管道的设计目的是允许不同的视觉解释技术进行公正的比较。results:这篇论文提供了一个涵盖 более十种评估方法的全面的视觉解释评估文献,并在选择的数据集上使用不同的模型中心和真实数据中心评估 метри来进行实验。结果表明该 benchmark 对视觉解释模型表现的评估具有效果。
    Abstract The rise of deep learning algorithms has led to significant advancements in computer vision tasks, but their "black box" nature has raised concerns regarding interpretability. Explainable AI (XAI) has emerged as a critical area of research aiming to open this "black box", and shed light on the decision-making process of AI models. Visual explanations, as a subset of Explainable Artificial Intelligence (XAI), provide intuitive insights into the decision-making processes of AI models handling visual data by highlighting influential areas in an input image. Despite extensive research conducted on visual explanations, most evaluations are model-centered since the availability of corresponding real-world datasets with ground truth explanations is scarce in the context of image data. To bridge this gap, we introduce an XAI Benchmark comprising a dataset collection from diverse topics that provide both class labels and corresponding explanation annotations for images. We have processed data from diverse domains to align with our unified visual explanation framework. We introduce a comprehensive Visual Explanation pipeline, which integrates data loading, preprocessing, experimental setup, and model evaluation processes. This structure enables researchers to conduct fair comparisons of various visual explanation techniques. In addition, we provide a comprehensive review of over 10 evaluation methods for visual explanation to assist researchers in effectively utilizing our dataset collection. To further assess the performance of existing visual explanation methods, we conduct experiments on selected datasets using various model-centered and ground truth-centered evaluation metrics. We envision this benchmark could facilitate the advancement of visual explanation models. The XAI dataset collection and easy-to-use code for evaluation are publicly accessible at https://xaidataset.github.io.
    摘要 “深度学习算法的出现导致计算机视觉任务得到了重大进步,但它们的“黑盒”性带来了解释性的担忧。解释人工智能(XAI)成为了一个重要的研究领域,旨在打开这“黑盒”,了解人工智能模型做出决策的过程。视觉解释,作为解释人工智能的一个子集,为处理视觉数据的人工智能模型提供了直观的决策过程解释。然而,大多数研究都是模型中心的,因为对图像数据的相关真实数据集的可用性非常scarce。为了bridging这个差距,我们介绍了一个XAI Benchmark,包括从多种主题收集的数据集,每个数据集都包括图像的类别标签和相应的解释注释。我们对这些数据进行了多种领域的处理,以适应我们的统一的视觉解释框架。我们还提供了一个完整的视觉解释管线,包括数据加载、预处理、实验设置和模型评估过程。这种结构使研究人员能够进行公正的比较多种视觉解释技术。此外,我们还提供了更 than 10 评估方法的完整审查,以帮助研究人员有效地利用我们的数据集。为了进一步评估现有的视觉解释方法,我们在选择的数据集上进行了多种模型中心和真实数据中心的评估指标。我们希望这个Benchmark能够促进视觉解释模型的进步。XAI数据集和使用方式的代码公开访问,可以在 中找到。”

Animating Street View

  • paper_url: http://arxiv.org/abs/2310.08534
  • repo_url: https://github.com/jblsmith/street-view-movie-maker
  • paper_authors: Mengyi Shan, Brian Curless, Ira Kemelmacher-Shlizerman, Steve Seitz
  • for: 这个系统可以自动将街景图像带到生命中,通过插入自然行为的行人和车辆,并且规划路径和交通行为,同时还会模拟遮盖和阴影效果。
  • methods: 这个系统使用了去除原有的人和车辆、插入运动对象、规划路径和交通行为、模拟人群行为、并且使用一致的照明、可见度、遮盖和阴影效果来实现。
  • results: 这个系统在各种街景图像中得到了丰富的生命化效果,包括正常的拍摄图像和扫描图像。
    Abstract We present a system that automatically brings street view imagery to life by populating it with naturally behaving, animated pedestrians and vehicles. Our approach is to remove existing people and vehicles from the input image, insert moving objects with proper scale, angle, motion, and appearance, plan paths and traffic behavior, as well as render the scene with plausible occlusion and shadowing effects. The system achieves these by reconstructing the still image street scene, simulating crowd behavior, and rendering with consistent lighting, visibility, occlusions, and shadows. We demonstrate results on a diverse range of street scenes including regular still images and panoramas.
    摘要 我们提出了一种系统,可以自动将街景图像带到生命中,通过插入自然行为的步行者和交通工具,让图像具有更加生动的效果。我们的方法是从输入图像中移除现有的人员和交通工具,插入正确的规模、角度、运动和外观的运动对象,规划路径和交通行为,同时进行透明度和阴影效果的渲染。该系统通过重建静止图像街景、模拟人群行为、渲染透明度和阴影效果来实现这一目标。我们在多种不同的街景图像中进行了证明,包括普通的静止图像和拍摄的Panorama。

UniPose: Detecting Any Keypoints

  • paper_url: http://arxiv.org/abs/2310.08530
  • repo_url: https://github.com/IDEA-Research/UniPose
  • paper_authors: Jie Yang, Ailing Zeng, Ruimao Zhang, Lei Zhang
  • for: 这个研究旨在探索一个统一的框架,叫做UniPose,以探测任何骨骼结构的人类或动物体姿的关键点,包括眼睛、脚、爪子等细部信息,以便进一步掌握和操作细部物品的视觉理解。
  • methods: 这个研究使用了一个统一的框架,叫做UniPose,让探测关键点的任何类型的物品,包括人类和动物体姿,并且使用了文本或图像提示来进行探测。
  • results: 研究结果显示UniPose能够具有优秀的细部定位和普遍化能力,可以在不同的图像样式、类别和姿势下进行精确的关键点探测。
    Abstract This work proposes a unified framework called UniPose to detect keypoints of any articulated (e.g., human and animal), rigid, and soft objects via visual or textual prompts for fine-grained vision understanding and manipulation. Keypoint is a structure-aware, pixel-level, and compact representation of any object, especially articulated objects. Existing fine-grained promptable tasks mainly focus on object instance detection and segmentation but often fail to identify fine-grained granularity and structured information of image and instance, such as eyes, leg, paw, etc. Meanwhile, prompt-based keypoint detection is still under-explored. To bridge the gap, we make the first attempt to develop an end-to-end prompt-based keypoint detection framework called UniPose to detect keypoints of any objects. As keypoint detection tasks are unified in this framework, we can leverage 13 keypoint detection datasets with 338 keypoints across 1,237 categories over 400K instances to train a generic keypoint detection model. UniPose can effectively align text-to-keypoint and image-to-keypoint due to the mutual enhancement of textual and visual prompts based on the cross-modality contrastive learning optimization objectives. Our experimental results show that UniPose has strong fine-grained localization and generalization abilities across image styles, categories, and poses. Based on UniPose as a generalist keypoint detector, we hope it could serve fine-grained visual perception, understanding, and generation.
    摘要 这个工作提出了一个统一框架called UniPose,用于检测任何骨Structured object(如人类和动物)的关键点,通过视觉或文本提示进行细腻视觉理解和操作。关键点是一种结构意识、像素级别、紧凑的对象表示,特别是复杂的对象。现有的细腻提示任务主要集中在对象实例检测和分割,但经常无法识别图像和实例的细腻特征,如眼睛、脚、爪子等。同时,基于提示的关键点检测还是下不足探索的领域。为了填补这个空白,我们首次尝试开发了一个端到端基于提示的关键点检测框架,可以检测任何对象的关键点。由于这个框架中的关键点检测任务被统一,我们可以使用13个关键点检测数据集,包含338个关键点,涵盖1,237个类型,共400,000个实例来训练一个通用的关键点检测模型。UniPose可以有效地将文本到关键点和图像到关键点进行对应,基于跨Modalities的对比学习优化目标,从而实现文本和图像之间的协调。我们的实验结果表明,UniPose具有强大的细腻地方化和泛化能力,可以在不同的图像风格、类型和姿势下进行高精度的关键点检测。基于UniPose作为一个通用的关键点检测器,我们希望它可以为细腻视觉理解、理解和生成提供服务。

GaussianDreamer: Fast Generation from Text to 3D Gaussian Splatting with Point Cloud Priors

  • paper_url: http://arxiv.org/abs/2310.08529
  • repo_url: https://github.com/hustvl/GaussianDreamer
  • paper_authors: Taoran Yi, Jiemin Fang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, Xinggang Wang
  • for: 本研究旨在bridging 2D和3D扩散模型之间,通过使用 latest 3D Gaussian splatting representation,实现高质量和高效的3D生成。
  • methods: 本研究提出了一种fast 3D生成框架,named as \name,其中2D扩散模型提供初始化点云约束,而3D扩散模型为 initialization提供点云质量约束。操作包括噪点增长和颜色干扰,以提高 initialized Gaussians。
  • results: 根据实验结果,我们的 \name 可以在一个GPU上生成高质量的3D实例,耗时只有25分钟,比之前的方法更快。生成的实例可以 direct rendering in real time。示例和代码可以在https://taoranyi.com/gaussiandreamer/ 中找到。
    Abstract In recent times, the generation of 3D assets from text prompts has shown impressive results. Both 2D and 3D diffusion models can generate decent 3D objects based on prompts. 3D diffusion models have good 3D consistency, but their quality and generalization are limited as trainable 3D data is expensive and hard to obtain. 2D diffusion models enjoy strong abilities of generalization and fine generation, but the 3D consistency is hard to guarantee. This paper attempts to bridge the power from the two types of diffusion models via the recent explicit and efficient 3D Gaussian splatting representation. A fast 3D generation framework, named as \name, is proposed, where the 3D diffusion model provides point cloud priors for initialization and the 2D diffusion model enriches the geometry and appearance. Operations of noisy point growing and color perturbation are introduced to enhance the initialized Gaussians. Our \name can generate a high-quality 3D instance within 25 minutes on one GPU, much faster than previous methods, while the generated instances can be directly rendered in real time. Demos and code are available at https://taoranyi.com/gaussiandreamer/.
    摘要 现今,从文本提示生成3D资产已经获得了优秀的结果。两种类型的扩散模型都能够生成可以接受的3D物件,包括2D扩散模型和3D扩散模型。3D扩散模型具有良好的3D一致性,但是其质量和应用范围受到可读性和实际应用的限制。2D扩散模型具有强大的一般化和细节生成能力,但是3D一致性很难保证。这篇论文尝试将2D和3D扩散模型的力量融合起来,通过最近的明确和高效的3D Gaussian抛物表示。我们提出了一个快速的3D生成框架,名为\name,其中3D扩散模型提供初始化的点云偏好,而2D扩散模型则丰富了几何和外观。我们引入随机点增长和颜色干扰的操作来改善初始化的Gaussian。我们的\name可以在一个GPU上生成高品质的3D实例,比前方法更快,且生成的实例可以实时显示。 demo 和代码可以在 上获取。

4D Gaussian Splatting for Real-Time Dynamic Scene Rendering

  • paper_url: http://arxiv.org/abs/2310.08528
  • repo_url: https://github.com/hustvl/4DGaussians
  • paper_authors: Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, Xinggang Wang
  • for: 实现高效的动态场景渲染,包括模拟复杂运动和高分辨率渲染。
  • methods: 提出了4D Gaussian Splatting(4D-GS)方法,通过构建高效的凝固场和 hexPlane 连接来实现高效的形态和 Gaussian 运动模拟。
  • results: 实现了70帧/秒的实时渲染,在800x800分辨率的RTX 3090 GPU上,并且与之前的状态OF艺术方法相当或更高的质量。更多 demo 和代码可以在 https://guanjunwu.github.io/4dgs/ 上找到。
    Abstract Representing and rendering dynamic scenes has been an important but challenging task. Especially, to accurately model complex motions, high efficiency is usually hard to maintain. We introduce the 4D Gaussian Splatting (4D-GS) to achieve real-time dynamic scene rendering while also enjoying high training and storage efficiency. An efficient deformation field is constructed to model both Gaussian motions and shape deformations. Different adjacent Gaussians are connected via a HexPlane to produce more accurate position and shape deformations. Our 4D-GS method achieves real-time rendering under high resolutions, 70 FPS at a 800$\times$800 resolution on an RTX 3090 GPU, while maintaining comparable or higher quality than previous state-of-the-art methods. More demos and code are available at https://guanjunwu.github.io/4dgs/.
    摘要 Dynamic scene representation and rendering has been an important but challenging task. Especially, accurately modeling complex motions is often difficult to achieve while maintaining high efficiency. We propose the 4D Gaussian Splatting (4D-GS) method to achieve real-time dynamic scene rendering while also enjoying high training and storage efficiency. An efficient deformation field is constructed to model both Gaussian motions and shape deformations. Different adjacent Gaussians are connected via a HexPlane to produce more accurate position and shape deformations. Our 4D-GS method achieves real-time rendering under high resolutions, with 70 FPS at an 800x800 resolution on an RTX 3090 GPU, while maintaining comparable or higher quality than previous state-of-the-art methods. More demos and code are available at https://guanjunwu.github.io/4dgs/.Here's the breakdown of the translation:* "dynamic scene" becomes "动态场景" (dòngtài chǎngjìng)* "representation" becomes "表示" (biǎozhì)* "rendering" becomes "渲染" (chūjiān)* "challenging task" becomes "difficult task" ( Zhèngshì zhèngdào)* "Gaussian motions" becomes "高斯运动" (gāosī yùndòng)* "shape deformations" becomes "形态变形" (xíngtài biànxiàng)* "HexPlane" becomes "六面体" (liùmiàn tǐ)* "real-time" becomes "实时" (shíshí)* "high resolutions" becomes "高分辨率" (gāo fēnbiàn zhù)* "70 FPS" becomes "70帧每秒" (qīshí fēn fēi shí)* "RTX 3090 GPU" becomes "RTX 3090 GPU" (RTX 3090 GPU)* "while maintaining" becomes "保持" (bǎojìn)* "comparable or higher quality" becomes "相当或更高质量" (xiāngdàng huí gèng qiàngyù)* "previous state-of-the-art methods" becomes "前一代方法" (qián yīdài fāngchéng)* "More demos and code" becomes "更多示例和代码" (gèng duō shìjì yǔ gōngcháng)Note that the translation is done in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know and I can provide that instead.

Unsupervised Learning of Object-Centric Embeddings for Cell Instance Segmentation in Microscopy Images

  • paper_url: http://arxiv.org/abs/2310.08501
  • repo_url: https://github.com/funkelab/cellulus
  • paper_authors: Steffen Wolf, Manan Lalit, Henry Westmacott, Katie McDole, Jan Funke
  • for: This paper is written for the task of segmenting objects in microscopy images, which is an important task in biomedical applications.
  • methods: The paper introduces a new method called object-centric embeddings (OCEs) that learns to embed image patches in a way that preserves spatial offsets between patches from the same object.
  • results: The paper shows that the OCE method can be used to delineate individual objects and obtain instance segmentations, and evaluates the method on nine diverse large-scale microscopy datasets. The results show that the method leads to substantially improved results compared to state-of-the-art baselines on six out of nine datasets, and performs on par on the remaining three datasets. If ground-truth annotations are available, the method can serve as an excellent starting point for supervised training, reducing the required amount of ground-truth needed by one order of magnitude.
    Abstract Segmentation of objects in microscopy images is required for many biomedical applications. We introduce object-centric embeddings (OCEs), which embed image patches such that the spatial offsets between patches cropped from the same object are preserved. Those learnt embeddings can be used to delineate individual objects and thus obtain instance segmentations. Here, we show theoretically that, under assumptions commonly found in microscopy images, OCEs can be learnt through a self-supervised task that predicts the spatial offset between image patches. Together, this forms an unsupervised cell instance segmentation method which we evaluate on nine diverse large-scale microscopy datasets. Segmentations obtained with our method lead to substantially improved results, compared to state-of-the-art baselines on six out of nine datasets, and perform on par on the remaining three datasets. If ground-truth annotations are available, our method serves as an excellent starting point for supervised training, reducing the required amount of ground-truth needed by one order of magnitude, thus substantially increasing the practical applicability of our method. Source code is available at https://github.com/funkelab/cellulus.
    摘要 分割微scopic图像中的对象是生物医学应用中的重要任务。我们介绍了对象中心嵌入(OCE),它将图像块嵌入以保留归一化的空间偏移。这些学习的嵌入可以用来划分个体对象并获取实例分割。我们证明了,在微scopic图像中常见的假设下,OCE可以通过自动学习任务预测图像块之间的空间偏移来学习。这两个组件共同形成了无监督细胞实例分割方法,我们在九个大规模微scopic图像集合上进行了评估。 segmentation结果表现出色,相比于状态函数基eline,在六个 dataset 上表现出了明显的改善,并在剩下三个 dataset 上表现在eline。如果有ground truth标注,我们的方法可以作为supervised学习的初始点, thereby reducing the amount of ground truth needed by one order of magnitude, thereby significantly increasing the practical applicability of our method。可以在https://github.com/funkelab/cellulus上获取源代码。

MotionDirector: Motion Customization of Text-to-Video Diffusion Models

  • paper_url: http://arxiv.org/abs/2310.08465
  • repo_url: https://github.com/showlab/MotionDirector
  • paper_authors: Rui Zhao, Yuchao Gu, Jay Zhangjie Wu, David Junhao Zhang, Jiawei Liu, Weijia Wu, Jussi Keppo, Mike Zheng Shou
  • for: 这种研究的目的是为了使用扩展的涂抹模型生成具有自定义动作的视频。
  • methods: 这种方法使用了全模型调整、附加层参数精度调整以及低级适应(LoRAs)等方法进行自定义动作的调整。
  • results: 实验结果表明,提议的方法可以生成具有多样化外观的自定义动作视频。此外,该方法还可以支持多种下游应用,如混合不同视频的外观和动作,以及将单个图像涂抹到自定义动作中。
    Abstract Large-scale pre-trained diffusion models have exhibited remarkable capabilities in diverse video generations. Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate videos with this motion. For example, generating a video with a car moving in a prescribed manner under specific camera movements to make a movie, or a video illustrating how a bear would lift weights to inspire creators. Adaptation methods have been developed for customizing appearance like subject or style, yet unexplored for motion. It is straightforward to extend mainstream adaption methods for motion customization, including full model tuning, parameter-efficient tuning of additional layers, and Low-Rank Adaptions (LoRAs). However, the motion concept learned by these methods is often coupled with the limited appearances in the training videos, making it difficult to generalize the customized motion to other appearances. To overcome this challenge, we propose MotionDirector, with a dual-path LoRAs architecture to decouple the learning of appearance and motion. Further, we design a novel appearance-debiased temporal loss to mitigate the influence of appearance on the temporal training objective. Experimental results show the proposed method can generate videos of diverse appearances for the customized motions. Our method also supports various downstream applications, such as the mixing of different videos with their appearance and motion respectively, and animating a single image with customized motions. Our code and model weights will be released.
    摘要 大规模预训 diffusion 模型在多种视频生成任务中表现出色,例如:根据给定的视频clip生成具有指定动作的视频。为了解决这个问题,我们提出了 MotionDirector,它使用了双路LoRAs架构来解耦动作和外观的学习。此外,我们还设计了一种新的外观偏好的时间损失来减轻外观对时间目标的影响。实验结果表明,我们的方法可以生成具有多样化外观的动作视频。此外,我们的方法还支持多种下游应用程序,例如:将不同视频的外观和动作分别混合,并将一张图片动画为自定义动作。我们将发布代码和模型参数。

Proving the Potential of Skeleton Based Action Recognition to Automate the Analysis of Manual Processes

  • paper_url: http://arxiv.org/abs/2310.08451
  • repo_url: None
  • paper_authors: Marlin Berger, Frederik Cloppenburg, Jens Eufinger, Thomas Gries
  • For: The paper aims to improve the analysis and monitoring of manual processes in manufacturing sectors such as textiles and electronics by using machine learning (ML) methods.* Methods: The paper uses a skeleton-based action recognition approach, which is a recent successful method in machine vision tasks, to detect the current motion performed by an operator in manual assembly. The authors also develop a ML pipeline to enable extensive research on different pre-processing methods and neural nets.* Results: The authors find that ML methods can provide higher flexibility, self-sufficiency, and lower costs compared to traditional methods such as Methods-Time-Measurement (MTM). They also demonstrate that their approach can be applied to all kinds of manual processes, not just manual assembly.
    Abstract In manufacturing sectors such as textiles and electronics, manual processes are a fundamental part of production. The analysis and monitoring of the processes is necessary for efficient production design. Traditional methods for analyzing manual processes are complex, expensive, and inflexible. Compared to established approaches such as Methods-Time-Measurement (MTM), machine learning (ML) methods promise: Higher flexibility, self-sufficient & permanent use, lower costs. In this work, based on a video stream, the current motion class in a manual assembly process is detected. With information on the current motion, Key-Performance-Indicators (KPIs) can be derived easily. A skeleton-based action recognition approach is taken, as this field recently shows major success in machine vision tasks. For skeleton-based action recognition in manual assembly, no sufficient pre-work could be found. Therefore, a ML pipeline is developed, to enable extensive research on different (pre-) processing methods and neural nets. Suitable well generalizing approaches are found, proving the potential of ML to enhance analyzation of manual processes. Models detect the current motion, performed by an operator in manual assembly, but the results can be transferred to all kinds of manual processes.
    摘要 在制造业务中,如纺织和电子等,手动过程是生产的基本组成部分。分析和监测这些过程是生产设计的必要条件。传统方法分析手动过程相对复杂、昂贵和不灵活。相比已有的方法,如方法时间测量(MTM),机器学习(ML)方法承诺:更高的灵活性、自主和常规使用、更低的成本。在这项工作中,通过视频流,检测当前手动 Assembly 过程中的动作类别。通过动作类别信息,可以轻松地 derivation Key-Performance-Indicators(KPIs)。我们采用skeleton基于动作识别方法,因为这个领域最近几年在机器视觉任务中占据了主导地位。对于手动 Assembly 中skeleton基于动作识别,没有充分的前置工作。因此,我们开发了一个ML管道,以便进行广泛的不同(前)处理方法和神经网络的研究。适合通用的方法被发现,证明了机器学习可以增强手动过程的分析。模型可以检测手动 Assembly 过程中操作员现在执行的动作,但结果可以应用于所有的手动过程。

Assessing of Soil Erosion Risk Through Geoinformation Sciences and Remote Sensing – A Review

  • paper_url: http://arxiv.org/abs/2310.08430
  • repo_url: None
  • paper_authors: Lachezar Filchev, Vasil Kolev
  • for: 这篇论文主要是为了评估不同类型和结构的风化模型,以及它们在全球各地的应用。
  • methods: 这篇论文使用了空间分析技术,如地理信息系统(GIS),以进行风化风险评估,包括美国和世界各地的常用USLE和RUSLE方法,以及更进一步的实验室方法和人工智能技术。
  • results: 这篇论文提出了一种可能的未来发展方向,即采用人工智能技术进行风化风险评估。
    Abstract During past decades a marked manifestation of widespread erosion phenomena was studied worldwide. Global conservation community has launched campaigns at local, regional and continental level in developing countries for preservation of soil resources in order not only to stop or mitigate human impact on nature but also to improve life in rural areas introducing new approaches for soil cultivation. After the adoption of Sustainable Development Goals of UNs and launching several world initiatives such as the Land Degradation Neutrality (LDN) the world came to realize the very importance of the soil resources on which the biosphere relies for its existence. The main goal of the chapter is to review different types and structures erosion models as well as their applications. Several methods using spatial analysis capabilities of geographic information systems (GIS) are in operation for soil erosion risk assessment, such as Universal Soil Loss Equation (USLE), Revised Universal Soil Loss Equation (RUSLE) in operation worldwide and in the USA and MESALES model. These and more models are being discussed in the present work alongside more experimental models and methods for assessing soil erosion risk such as Artificial Intelligence (AI), Machine and Deep Learning, etc. At the end of this work, a prospectus for the future development of soil erosion risk assessment is drawn.
    摘要 The main goal of this chapter is to review different types and structures of erosion models, as well as their applications. Various methods using spatial analysis capabilities of geographic information systems (GIS) are currently in operation for soil erosion risk assessment, such as the Universal Soil Loss Equation (USLE), the Revised Universal Soil Loss Equation (RUSLE), and the MESALES model. These and other models, as well as more experimental models and methods for assessing soil erosion risk, such as Artificial Intelligence (AI), Machine Learning, and Deep Learning, are discussed in this work. Finally, a prospectus for the future development of soil erosion risk assessment is drawn.

Revisiting Data Augmentation for Rotational Invariance in Convolutional Neural Networks

  • paper_url: http://arxiv.org/abs/2310.08429
  • repo_url: https://github.com/facundoq/rotational_invariance_data_augmentation
  • paper_authors: Facundo Manuel Quiroga, Franco Ronchetti, Laura Lanzarini, Aurelio Fernandez-Bariviera
  • for: 这个论文是为了研究如何在图像分类任务中实现旋转不变性。
  • methods: 该论文使用了数据增强法和两种特殊的Convolutional Neural Networks(Spatial Transformer Networks和Group Equivariant CNNs)来实现旋转不变性。
  • results: 研究发现,通过数据增强法可以让网络在旋转图像上准确分类,但是这需要更多的训练时间。此外,研究还发现了哪些层在网络中帮助网络编码旋转不变性。
    Abstract Convolutional Neural Networks (CNN) offer state of the art performance in various computer vision tasks. Many of those tasks require different subtypes of affine invariances (scale, rotational, translational) to image transformations. Convolutional layers are translation equivariant by design, but in their basic form lack invariances. In this work we investigate how best to include rotational invariance in a CNN for image classification. Our experiments show that networks trained with data augmentation alone can classify rotated images nearly as well as in the normal unrotated case; this increase in representational power comes only at the cost of training time. We also compare data augmentation versus two modified CNN models for achieving rotational invariance or equivariance, Spatial Transformer Networks and Group Equivariant CNNs, finding no significant accuracy increase with these specialized methods. In the case of data augmented networks, we also analyze which layers help the network to encode the rotational invariance, which is important for understanding its limitations and how to best retrain a network with data augmentation to achieve invariance to rotation.
    摘要 convolutional neural networks (CNN) 提供了计算机视觉任务中的状态机器人表现。这些任务中有许多不同的Affine invariance(比例、旋转、平移)图像变换需求。 convolutional layers 是设计为 equivariant 的,但在其基本形式中缺乏抗变征性。在这项工作中,我们调查了如何在 CNN 中包含旋转不变性,以提高图像分类的表现。我们的实验结果表明,使用数据增强alone 可以将旋转图像分类到 nearly 与 Normal 无旋转情况下分类的程度相似; 这种增强的表现力只需要增加训练时间的代价。我们还比较了数据增强与两种修改后 CNN 模型来实现旋转不变性或 equivariance,Spatial Transformer Networks 和 Group Equivariant CNNs,发现这些特殊化方法并没有提高准确率。在数据增强网络中,我们还分析了哪些层 помо助网络编码旋转不变性,这是重要的,因为它可以理解其限制和如何最好地重新训练数据增强网络以实现旋转不变性。

Visual Attention-Prompted Prediction and Learning

  • paper_url: http://arxiv.org/abs/2310.08420
  • repo_url: None
  • paper_authors: Yifei Zhang, Siyi Gu, Bo Pan, Guangji Bai, Xiaofeng Yang, Liang Zhao
  • for: 提高模型预测力,解决注意力引导学习中的时间和计算成本问题
  • methods: 提出了注意力提前预测技术,不需要模型重新训练,并解决了视觉注意力提示的不完整信息问题
  • results: 通过实验表明,提出的框架可以在两个 dataset 上增强预测结果,并且可以在注意力提示和无注意力提示的情况下进行预测,提高了模型的预测能力
    Abstract Explanation(attention)-guided learning is a method that enhances a model's predictive power by incorporating human understanding during the training phase. While attention-guided learning has shown promising results, it often involves time-consuming and computationally expensive model retraining. To address this issue, we introduce the attention-prompted prediction technique, which enables direct prediction guided by the attention prompt without the need for model retraining. However, this approach presents several challenges, including: 1) How to incorporate the visual attention prompt into the model's decision-making process and leverage it for future predictions even in the absence of a prompt? and 2) How to handle the incomplete information from the visual attention prompt? To tackle these challenges, we propose a novel framework called Visual Attention-Prompted Prediction and Learning, which seamlessly integrates visual attention prompts into the model's decision-making process and adapts to images both with and without attention prompts for prediction. To address the incomplete information of the visual attention prompt, we introduce a perturbation-based attention map modification method. Additionally, we propose an optimization-based mask aggregation method with a new weight learning function for adaptive perturbed annotation aggregation in the attention map modification process. Our overall framework is designed to learn in an attention-prompt guided multi-task manner to enhance future predictions even for samples without attention prompts and trained in an alternating manner for better convergence. Extensive experiments conducted on two datasets demonstrate the effectiveness of our proposed framework in enhancing predictions for samples, both with and without provided prompts.
    摘要 针对解释(注意)导学习方法,我们提出了一种新的框架,即视觉注意点引导预测和学习框架(Visual Attention-Prompted Prediction and Learning,简称VAPPL)。这种框架可以让模型在训练过程中通过注意点引导来增强预测力,而无需进行复杂和计算昂贵的模型重新训练。然而,这种方法存在一些挑战,包括如何在模型决策过程中 incorporate 视觉注意点,以及如何处理视觉注意点中的不完整信息。为解决这些挑战,我们提出了以下两点方法:1. 将视觉注意点integrated 到模型决策过程中,并在未提供注意点时进行预测。2. 使用杂化基于注意点的修正方法,以处理视觉注意点中的不完整信息。我们的框架包括以下几个组成部分:1. 注意点引导预测:使用提供的注意点来直接预测图像中的特征。2. 注意点修正:使用杂化基于注意点的修正方法,以处理视觉注意点中的不完整信息。3. 杂化基于注意点的mask aggregation:使用一种新的杂化基于注意点的mask aggregation方法,以便更好地处理不完整的注意点信息。4. 优化基于注意点的weight learning:使用一种新的优化基于注意点的weight learning方法,以便更好地适应不同的注意点信息。我们的框架采用了一种带有注意点的多任务学习方法,通过在不同任务之间进行交互学习,以提高未提供注意点时的预测性能。此外,我们还采用了一种分段的训练策略,以便更好地适应不同的注意点信息。我们对两个数据集进行了广泛的实验,并证明了我们的提出的框架可以提高未提供注意点时的预测性能。

Towards Design and Development of an ArUco Markers-Based Quantitative Surface Tactile Sensor

  • paper_url: http://arxiv.org/abs/2310.08398
  • repo_url: None
  • paper_authors: Ozdemir Can Kara, Charles Everson, Farshid Alambeigi
  • for: 这项研究的目标是量化视觉基于感知器的图像输出。
  • methods: 该研究提出了一种新的量化表面感知器(QS-TS),使得机器人抓取机械臂安全和自主地操作细腻物品。QS-TS通过在实时中测量感知器的gel层变形来实现这一目标。
  • results: 实验结果表明,QS-TS可以准确地测量感知器的gel层变形,相对误差低于5%。
    Abstract In this paper, with the goal of quantifying the qualitative image outputs of a Vision-based Tactile Sensor (VTS), we present the design, fabrication, and characterization of a novel Quantitative Surface Tactile Sensor (called QS-TS). QS-TS directly estimates the sensor's gel layer deformation in real-time enabling safe and autonomous tactile manipulation and servoing of delicate objects using robotic manipulators. The core of the proposed sensor is the utilization of miniature 1.5 mm x 1.5 mm synthetic square markers with inner binary patterns and a broad black border, called ArUco Markers. Each ArUco marker can provide real-time camera pose estimation that, in our design, is used as a quantitative measure for obtaining deformation of the QS-TS gel layer. Moreover, thanks to the use of ArUco markers, we propose a unique fabrication procedure that mitigates various challenges associated with the fabrication of the existing marker-based VTSs and offers an intuitive and less-arduous method for the construction of the VTS. Remarkably, the proposed fabrication facilitates the integration and adherence of markers with the gel layer to robustly and reliably obtain a quantitative measure of deformation in real-time regardless of the orientation of ArUco Markers. The performance and efficacy of the proposed QS-TS in estimating the deformation of the sensor's gel layer were experimentally evaluated and verified. Results demonstrate the phenomenal performance of the QS-TS in estimating the deformation of the gel layer with a relative error of <5%.
    摘要 在这篇论文中,我们目标是量化视觉基于感测器(VTS)的 качеitative图像输出。我们介绍了一种新的量化表面感测器(QS-TS)的设计、制造和性能Characterization。QS-TS直接测量感测器的gel层塑料的变形,并在实时中提供了安全和自主的柔软物体把握和控制。核心思想是利用微型1.5毫米 x 1.5毫米的合成方块 marker,内部具有内 binar pattern和宽黑边框,称为 ArUco 标记。每个 ArUco 标记可以提供实时相机pose estimation,在我们的设计中用作量化测量gel层的变形。此外,我们提出了一种独特的制造过程,解决了现有 marker-based VTS 的制造问题,并提供了一种直观和不困难的方法 для VTS 的建造。另外,我们的制造过程可以强制性地和可靠地在不同的 ArUco 标记orientation下获取量化测量gel层的变形。我们对提出的 QS-TS 的性能和可靠性进行了实验性评估和验证。结果表明,QS-TS 可以准确地测量gel层的变形,Relative error <5%。

Hyp-UML: Hyperbolic Image Retrieval with Uncertainty-aware Metric Learning

  • paper_url: http://arxiv.org/abs/2310.08390
  • repo_url: None
  • paper_authors: Shiyang Yan, Zongxuan Liu, Lin Xu
  • for: 这篇论文主要应用于图像搜寻和分类,并且是一种代表学习的关键算法,例如特征学习和它们在度量空间的对齐。
  • methods: 本论文提出了一种基于希腊圆形空间的图像嵌入,并且还包括两种不同的不确定度测量方法,一种是基于对比学习,另一种是基于margin-based度量学习。
  • results: 实验验证确认了提出的方法可以实现相关方法中的最佳结果,并且进行了广泛的ablation研究,验证每个方法的有效性。
    Abstract Metric learning plays a critical role in training image retrieval and classification. It is also a key algorithm in representation learning, e.g., for feature learning and its alignment in metric space. Hyperbolic embedding has been recently developed. Compared to the conventional Euclidean embedding in most of the previously developed models, Hyperbolic embedding can be more effective in representing the hierarchical data structure. Second, uncertainty estimation/measurement is a long-lasting challenge in artificial intelligence. Successful uncertainty estimation can improve a machine learning model's performance, robustness, and security. In Hyperbolic space, uncertainty measurement is at least with equivalent, if not more, critical importance. In this paper, we develop a Hyperbolic image embedding with uncertainty-aware metric learning for image retrieval. We call our method Hyp-UML: Hyperbolic Uncertainty-aware Metric Learning. Our contribution are threefold: we propose an image embedding algorithm based on Hyperbolic space, with their corresponding uncertainty value; we propose two types of uncertainty-aware metric learning, for the popular Contrastive learning and conventional margin-based metric learning, respectively. We perform extensive experimental validations to prove that the proposed algorithm can achieve state-of-the-art results among related methods. The comprehensive ablation study validates the effectiveness of each component of the proposed algorithm.
    摘要 metric 学习在图像检索和分类训练中扮演了关键角色,它还是表示学习中的关键算法,例如特征学习和其在度量空间的对齐。 reciently, Hyperbolic embedding 已经被开发出来。相比传统的欧几何 embedding ,Hyperbolic embedding 可以更好地表示层次结构的数据。第二,人工智能中的不确定性估计是一个长期的挑战。成功的不确定性估计可以提高机器学习模型的性能、Robustness 和安全性。在 Hyperbolic 空间中,不确定性测量是至少与欧几何空间相当重要,可能更重要。在这篇论文中,我们开发了一种基于 Hyperbolic 空间的图像嵌入,并与不确定性值相对。我们称之为 Hyp-UML:Hyperbolic Uncertainty-aware Metric Learning。我们的贡献有三个方面:1. 我们提出了基于 Hyperbolic 空间的图像嵌入算法,并附带不确定性值。2. 我们提出了两种不确定性意识度量学习方法,一种是基于对比学习,另一种是基于折衔学习。3. 我们进行了广泛的实验验证,证明我们的方法可以在相关的方法中 achieve 状态的较好Result。另外,我们进行了全面的减少学习来验证每个方法的有效性。

MeanAP-Guided Reinforced Active Learning for Object Detection

  • paper_url: http://arxiv.org/abs/2310.08387
  • repo_url: None
  • paper_authors: Zhixuan Liang, Xingyu Zeng, Rui Zhao, Ping Luo
  • for: 本研究旨在提高对象检测模型的训练效果,使用最少的标注数据,通过选择最有用的示例进行标注并将其包含到任务学习器中。
  • methods: 本研究使用了 MeanAP metric来作为查找数据的信息吸引度,并采用了一种基于 reinforcement learning 的抽象代理来选择后续训练示例。
  • results: 实验结果表明,MAGRAL 在 PASCAL VOC 和 MS COCO 上比最新的状态艺术方法表现出色,显示了substantial的性能提升。MAGRAL 为激活学习对象检测提供了一个坚实的基线,这表明它在这个领域可能会取得进一步的进步。
    Abstract Active learning presents a promising avenue for training high-performance models with minimal labeled data, achieved by judiciously selecting the most informative instances to label and incorporating them into the task learner. Despite notable advancements in active learning for image recognition, metrics devised or learned to gauge the information gain of data, crucial for query strategy design, do not consistently align with task model performance metrics, such as Mean Average Precision (MeanAP) in object detection tasks. This paper introduces MeanAP-Guided Reinforced Active Learning for Object Detection (MAGRAL), a novel approach that directly utilizes the MeanAP metric of the task model to devise a sampling strategy employing a reinforcement learning-based sampling agent. Built upon LSTM architecture, the agent efficiently explores and selects subsequent training instances, and optimizes the process through policy gradient with MeanAP serving as reward. Recognizing the time-intensive nature of MeanAP computation at each step, we propose fast look-up tables to expedite agent training. We assess MAGRAL's efficacy across popular benchmarks, PASCAL VOC and MS COCO, utilizing different backbone architectures. Empirical findings substantiate MAGRAL's superiority over recent state-of-the-art methods, showcasing substantial performance gains. MAGRAL establishes a robust baseline for reinforced active object detection, signifying its potential in advancing the field.
    摘要 active learning可能是训练高性能模型的有望途径,通过选择最有信息的实例进行标注并将其添加到任务学习器中,以实现最小的标注数据量。然而,关键指标选择和任务模型性能指标之间存在一定的差异,这些指标通常是图像识别任务中的 Mean Average Precision(MeanAP)。这篇论文介绍了 MeanAP-Guided Reinforced Active Learning for Object Detection(MAGRAL),一种新的方法,它直接使用任务模型的 MeanAP 指标来设计查询策略,并使用长短期记忆(LSTM)架构建立一个强化学习 Agent。通过策略梯度下降,Agent 可以快速探索和选择后续训练实例,并且可以通过 MeanAP 作为奖励来优化过程。由于 MeanAP 的计算在每步都是时间开销的,我们提出了快速查找表来加速 Agent 的训练。我们在 PASCAL VOC 和 MS COCO 等 популяр的 benchmark 上进行了实验,并使用不同的底层架构。实验结果证明 MAGRAL 在最新的方法中表现出色,显示了大幅性能提升。MAGRAL 建立了一个强大的底线 для强化活动对象检测,这表明它在该领域的发展潜力很大。

AutoVP: An Automated Visual Prompting Framework and Benchmark

  • paper_url: http://arxiv.org/abs/2310.08381
  • repo_url: https://github.com/IBM/AutoVP
  • paper_authors: Hsi-Ai Tsao, Lei Hsiung, Pin-Yu Chen, Sijia Liu, Tsung-Yi Ho
    for:这篇论文的目的是提出一个叫做AutoVP的扩展性框架,用于自动化Visual Prompting(VP)设计选择,以及提供12个下游图像分类任务,用于全面评估VP性能。methods:论文使用了一个名为AutoVP的框架,包括了三个设计空间:1)对于Prompt的共同优化; 2)适用于预训练模型的选择,包括图像分类器和文本图像Encoder; 3)模型输出映射策略,包括非 Parametric 和可训练的标签映射。results:实验结果显示,AutoVP比现有最佳VP方法有着重大的提升,具体而言,可以提高精度的最大提升为27.5%,并且在12个下游图像分类任务中实现了6.7%的提升。
    Abstract Visual prompting (VP) is an emerging parameter-efficient fine-tuning approach to adapting pre-trained vision models to solve various downstream image-classification tasks. However, there has hitherto been little systematic study of the design space of VP and no clear benchmark for evaluating its performance. To bridge this gap, we propose AutoVP, an end-to-end expandable framework for automating VP design choices, along with 12 downstream image-classification tasks that can serve as a holistic VP-performance benchmark. Our design space covers 1) the joint optimization of the prompts; 2) the selection of pre-trained models, including image classifiers and text-image encoders; and 3) model output mapping strategies, including nonparametric and trainable label mapping. Our extensive experimental results show that AutoVP outperforms the best-known current VP methods by a substantial margin, having up to 6.7% improvement in accuracy; and attains a maximum performance increase of 27.5% compared to linear-probing (LP) baseline. AutoVP thus makes a two-fold contribution: serving both as an efficient tool for hyperparameter tuning on VP design choices, and as a comprehensive benchmark that can reasonably be expected to accelerate VP's development. The source code is available at https://github.com/IBM/AutoVP.
    摘要 “幻像提示(VP)是一种emerging的参数高效调整方法,用于适应预训练的视觉模型解决各种下游图像分类任务。然而,有很少的系统性研究VP的设计空间,也没有明确的性能标准。为bridge这个差距,我们提议AutoVP,一个可扩展的框架,用于自动化VP设计选择,以及12个下游图像分类任务,可以作为VP性能标准。我们的设计空间包括:1)提示的共同优化; 2)采用预训练模型,包括图像分类器和文本图像编码器; 3)模型输出映射策略,包括非 Parametric 和可训练标签映射。我们的广泛实验结果表明,AutoVP比现有最佳VP方法有substantial的提升,具有最高27.5%的性能提升比基准线性探测(LP)方法。AutoVP因此作出了两重贡献:作为一个高效的 hyperparameter 调整工具,以及一个全面的标准,可以加速VP的发展。代码可以在https://github.com/IBM/AutoVP 中获取。”

Worst-Case Morphs using Wasserstein ALI and Improved MIPGAN

  • paper_url: http://arxiv.org/abs/2310.08371
  • repo_url: None
  • paper_authors: Una M. Kelly, Meike Nauta, Lu Liu, Luuk J. Spreeuwers, Raymond N. J. Veldhuis
  • for: 这 paper 的目的是提出一种能够生成 worst-case 模糊图像的方法,以挑战 face recognition 系统(FR)的安全性。
  • methods: 这 paper 使用了 Adversarially Learned Inference(ALI)和 Wasserstein GANs trains with Gradient Penalty(WGAN-GP)等方法来生成模糊图像,并通过特定的损失函数来提高模糊图像中的人脸信息 manipulate 的能力。
  • results: 这 paper 的结果表明,使用 WALI 方法可以生成更加挑战 FR 系统的模糊图像,并且可以提高 MIPGAN 等现有的 StyleGAN-based morph generator 的性能。
    Abstract A morph is a combination of two separate facial images and contains identity information of two different people. When used in an identity document, both people can be authenticated by a biometric Face Recognition (FR) system. Morphs can be generated using either a landmark-based approach or approaches based on deep learning such as Generative Adversarial Networks (GAN). In a recent paper, we introduced a \emph{worst-case} upper bound on how challenging morphing attacks can be for an FR system. The closer morphs are to this upper bound, the bigger the challenge they pose to FR. We introduced an approach with which it was possible to generate morphs that approximate this upper bound for a known FR system (white box), but not for unknown (black box) FR systems. In this paper, we introduce a morph generation method that can approximate worst-case morphs even when the FR system is not known. A key contribution is that we include the goal of generating difficult morphs \emph{during} training. Our method is based on Adversarially Learned Inference (ALI) and uses concepts from Wasserstein GANs trained with Gradient Penalty, which were introduced to stabilise the training of GANs. We include these concepts to achieve similar improvement in training stability and call the resulting method Wasserstein ALI (WALI). We finetune WALI using loss functions designed specifically to improve the ability to manipulate identity information in facial images and show how it can generate morphs that are more challenging for FR systems than landmark- or GAN-based morphs. We also show how our findings can be used to improve MIPGAN, an existing StyleGAN-based morph generator.
    摘要 文本:A morph is a combination of two separate facial images and contains identity information of two different people. When used in an identity document, both people can be authenticated by a biometric Face Recognition (FR) system. Morphs can be generated using either a landmark-based approach or approaches based on deep learning such as Generative Adversarial Networks (GAN). In a recent paper, we introduced a worst-case upper bound on how challenging morphing attacks can be for an FR system. The closer morphs are to this upper bound, the bigger the challenge they pose to FR. We introduced an approach with which it was possible to generate morphs that approximate this upper bound for a known FR system (white box), but not for unknown (black box) FR systems. In this paper, we introduce a morph generation method that can approximate worst-case morphs even when the FR system is not known. A key contribution is that we include the goal of generating difficult morphs during training. Our method is based on Adversarially Learned Inference (ALI) and uses concepts from Wasserstein GANs trained with Gradient Penalty, which were introduced to stabilize the training of GANs. We include these concepts to achieve similar improvement in training stability and call the resulting method Wasserstein ALI (WALI). We finetune WALI using loss functions designed specifically to improve the ability to manipulate identity information in facial images and show how it can generate morphs that are more challenging for FR systems than landmark- or GAN-based morphs. We also show how our findings can be used to improve MIPGAN, an existing StyleGAN-based morph generator.翻译:一个 morph 是两个不同人的面部图像的组合,它包含这两个人的身份信息。在身份文件中使用时,这两个人可以通过面部识别系统进行验证。 morphs 可以使用 landmark-based 方法或深度学习方法如生成敌对学习网络 (GAN) 来生成。在一篇最近的论文中,我们引入了一个 worst-case 上限,用于描述 morphing 攻击的复杂程度。这个上限更近的 morphs 对于 Face Recognition (FR) 系统来说更加具有挑战性。我们引入了一种可以在知道 FR 系统 (白盒) 上生成 Approximate worst-case morphs 的方法,但不能在不知道 FR 系统 (黑盒) 上生成。在这篇论文中,我们介绍了一种可以在不知道 FR 系统上生成 worst-case morphs 的方法。这个方法基于 Adversarially Learned Inference (ALI),并使用 Wasserstein GANs trained with Gradient Penalty 的概念,这些概念可以帮助稳定 GANs 的训练。我们在这些概念基础上进行了类似的改进,并将其称为 Wasserstein ALI (WALI)。我们使用特定设计来提高 facial image 中的身份信息 manipulate 能力的损失函数,并通过这些损失函数来训练 WALI。我们还显示了我们的发现可以用来改进 MIPGAN,一个基于 StyleGAN 的 morph generator。

UniPAD: A Universal Pre-training Paradigm for Autonomous Driving

  • paper_url: http://arxiv.org/abs/2310.08370
  • repo_url: https://github.com/Nightmare-n/UniPAD
  • paper_authors: Honghui Yang, Sha Zhang, Di Huang, Xiaoyang Wu, Haoyi Zhu, Tong He, Shixiang Tang, Hengshuang Zhao, Qibo Qiu, Binbin Lin, Xiaofei He, Wanli Ouyang
  • for: This paper is written for the purpose of proposing a novel self-supervised learning paradigm called UniPAD, which is designed to improve the effectiveness of feature learning for autonomous driving.
  • methods: The paper uses a 3D volumetric differentiable rendering technique to implicitly encode 3D space and facilitate the reconstruction of continuous 3D shape structures and intricate appearance characteristics of their 2D projections.
  • results: The paper demonstrates the feasibility and effectiveness of UniPAD through extensive experiments on various downstream 3D tasks, achieving significant improvements over lidar-, camera-, and lidar-camera-based baselines, and achieving state-of-the-art results in 3D object detection and 3D semantic segmentation on the nuScenes validation set.Here is the text in Simplified Chinese:
  • for: 这篇论文是为了介绍一种新的自我超vised学习方法UniPAD,该方法是用于提高自驾护护的特征学习效果。
  • methods: 该论文使用了3D可微分渲染技术,以隐式地编码3D空间,并且促进了连续3D形状结构和2D投影中的细腻特征的重建。
  • results: 论文通过对多个下游3D任务进行广泛的实验,证明了UniPAD的可行性和效果,并在nuScenes验证集上 achieved state-of-the-art Results in 3D物体检测和3D semantics排序。
    Abstract In the context of autonomous driving, the significance of effective feature learning is widely acknowledged. While conventional 3D self-supervised pre-training methods have shown widespread success, most methods follow the ideas originally designed for 2D images. In this paper, we present UniPAD, a novel self-supervised learning paradigm applying 3D volumetric differentiable rendering. UniPAD implicitly encodes 3D space, facilitating the reconstruction of continuous 3D shape structures and the intricate appearance characteristics of their 2D projections. The flexibility of our method enables seamless integration into both 2D and 3D frameworks, enabling a more holistic comprehension of the scenes. We manifest the feasibility and effectiveness of UniPAD by conducting extensive experiments on various downstream 3D tasks. Our method significantly improves lidar-, camera-, and lidar-camera-based baseline by 9.1, 7.7, and 6.9 NDS, respectively. Notably, our pre-training pipeline achieves 73.2 NDS for 3D object detection and 79.4 mIoU for 3D semantic segmentation on the nuScenes validation set, achieving state-of-the-art results in comparison with previous methods. The code will be available at https://github.com/Nightmare-n/UniPAD.
    摘要 在自动驾驶中,有效特征学习的重要性广泛得到了认可。传统的3D自我超vised预训练方法已经在各种应用中得到了广泛的成功,但大多数方法都是基于2D图像的想法。在这篇论文中,我们提出了UniPAD,一种新的自我超vised学习方法,通过3D分割可 differentiable rendering来隐式地编码3D空间,使得可以重建连续的3D形状结构和其2D投影图像的细节特征。我们的方法具有灵活性,可以轻松地与2D和3D框架集成,从而更好地理解场景。我们通过对多种下游3D任务进行广泛的实验证明了UniPAD的可行性和效果。与基eline相比,我们的预训练管道可以提高lidar-, camera-和lidar-camera-based基eline的NDS分数,分别提高9.1、7.7和6.9个NDS。特别是,我们的预训练管道在3D物体检测和3Dsemantic segmentation任务上实现了73.2个NDS和79.4个mIoU的最佳成绩,与前一代方法相比,达到了状态的艺术水平。代码将在https://github.com/Nightmare-n/UniPAD中提供。

Mapping Memes to Words for Multimodal Hateful Meme Classification

  • paper_url: http://arxiv.org/abs/2310.08368
  • repo_url: https://github.com/miccunifi/issues
  • paper_authors: Giovanni Burbi, Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Alberto Del Bimbo
  • for: 本研究旨在探讨Multimodal图文投稿中的仇恨内容检测,以提高网络上的仇恨内容识别和防控。
  • methods: 本研究提出了一种名为ISSUES的新方法,利用预训练的CLIP视觉语言模型和文本倒转技术,有效地捕捉 Multimodal图文投稿的semantic内容。
  • results: 实验表明,ISSUES方法在Hateful Memes Challenge和HarMeme数据集上达到了状态之前的最佳结果。代码和预训练模型公开在https://github.com/miccunifi/ISSUES。
    Abstract Multimodal image-text memes are prevalent on the internet, serving as a unique form of communication that combines visual and textual elements to convey humor, ideas, or emotions. However, some memes take a malicious turn, promoting hateful content and perpetuating discrimination. Detecting hateful memes within this multimodal context is a challenging task that requires understanding the intertwined meaning of text and images. In this work, we address this issue by proposing a novel approach named ISSUES for multimodal hateful meme classification. ISSUES leverages a pre-trained CLIP vision-language model and the textual inversion technique to effectively capture the multimodal semantic content of the memes. The experiments show that our method achieves state-of-the-art results on the Hateful Memes Challenge and HarMeme datasets. The code and the pre-trained models are publicly available at https://github.com/miccunifi/ISSUES.
    摘要 多模态图文投稿在互联网上广泛存在,作为一种混合视觉和文本元素的特殊形式的沟通,用于传达幽默、想法或情感。然而,一些投稿会发展为恶意的,推广仇恨内容并推动歧视。在这种多模态上下文中探测恶意投稿是一项复杂的任务,需要理解图文中的含义相互作用。在这种情况下,我们提出了一种名为ISSUES的新方法,用于多模态恶意投稿分类。ISSUES利用预训练的CLIP视觉语言模型和文本倒转技术,有效地捕捉投稿的多模态含义。实验结果表明,我们的方法在Hateful Memes Challenge和HarMeme数据集上达到了状态码的最佳结果。代码和预训练模型可以在https://github.com/miccunifi/ISSUES上下载。

A Generic Software Framework for Distributed Topological Analysis Pipelines

  • paper_url: http://arxiv.org/abs/2310.08339
  • repo_url: None
  • paper_authors: Eve Le Guillou, Michael Will, Pierre Guillou, Jonas Lukasczyk, Pierre Fortin, Christoph Garth, Julien Tierny
  • for: 本文提出了一个软件框架,用于支持分布式内存中的拓扑分析管道。相比之下,一些最近的论文已经在分布式内存环境中实现了基于拓扑的方法,但是这些方法都是专门为单一算法而实现的。本文则描述了一个通用的、Generic框架,可以支持多种拓扑算法的交互,可能在不同的进程上运行。
  • methods: 我们在本文中使用了MPI模型,并在Topology ToolKit(TTK)中实现了这个框架。在开发这个框架时,我们遇到了许多算法和软件工程困难,并在文中 документирова了这些困难。我们还提供了分布式内存中的拓扑算法的分类,根据它们的通信需求,以及一些Hybrid MPI+线程并行的示例。
  • results: 我们对这个框架的性能进行了详细的分析,发现并行效率可以在20%到80%之间,具体取决于算法。此外,我们在我们的框架中引入的MPI特定的预处理对计算时间 overhead是可以忽略的。 finally,我们使用了TTK在一个大规模的数据集上进行了一个高级的分析管道示例,演示了这个框架的新的分布式内存能力。
    Abstract This system paper presents a software framework for the support of topological analysis pipelines in a distributed-memory model. While several recent papers introduced topology-based approaches for distributed-memory environments, these were reporting experiments obtained with tailored, mono-algorithm implementations. In contrast, we describe in this paper a general-purpose, generic framework for topological analysis pipelines, i.e. a sequence of topological algorithms interacting together, possibly on distinct numbers of processes. Specifically, we instantiated our framework with the MPI model, within the Topology ToolKit (TTK). While developing this framework, we faced several algorithmic and software engineering challenges, which we document in this paper. We provide a taxonomy for the distributed-memory topological algorithms supported by TTK, depending on their communication needs and provide examples of hybrid MPI+thread parallelizations. Detailed performance analyses show that parallel efficiencies range from $20\%$ to $80\%$ (depending on the algorithms), and that the MPI-specific preconditioning introduced by our framework induces a negligible computation time overhead. We illustrate the new distributed-memory capabilities of TTK with an example of advanced analysis pipeline, combining multiple algorithms, run on the largest publicly available dataset we have found (120 billion vertices) on a standard cluster with 64 nodes (for a total of 1,536 cores). Finally, we provide a roadmap for the completion of TTK's MPI extension, along with generic recommendations for each algorithm communication category.
    摘要 We provide a taxonomy for the distributed-memory topological algorithms supported by TTK, depending on their communication needs, and examples of hybrid MPI+thread parallelizations. Our performance analyses show that parallel efficiencies range from 20% to 80% (depending on the algorithms), and that the MPI-specific preconditioning introduced by our framework has negligible computation time overhead.We illustrate the new distributed-memory capabilities of TTK with an example of an advanced analysis pipeline combining multiple algorithms on the largest publicly available dataset (120 billion vertices) on a standard cluster with 64 nodes (for a total of 1,536 cores). Finally, we provide a roadmap for the completion of TTK's MPI extension, along with generic recommendations for each algorithm communication category.

Real-Time Neural BRDF with Spherically Distributed Primitives

  • paper_url: http://arxiv.org/abs/2310.08332
  • repo_url: None
  • paper_authors: Yishun Dou, Zhong Zheng, Qiaoqiao Jin, Bingbing Ni, Yugang Chen, Junxiang Ke
  • for: 提供一种高效简洁的神经网络 BRDF,用于实现实时渲染。
  • methods: 提议使用两个低维度的方向特征网格(一个是入射方向网格,另一个是出射方向网格),以及一个小型的神经网络来学习反射特征。
  • results: 实验结果表明,提议的方法可以在高解度下实现实时渲染,并且可以模型各种材料的各种表现。
    Abstract We propose a novel compact and efficient neural BRDF offering highly versatile material representation, yet with very-light memory and neural computation consumption towards achieving real-time rendering. The results in Figure 1, rendered at full HD resolution on a current desktop machine, show that our system achieves real-time rendering with a wide variety of appearances, which is approached by the following two designs. On the one hand, noting that bidirectional reflectance is distributed in a very sparse high-dimensional subspace, we propose to project the BRDF into two low-dimensional components, i.e., two hemisphere feature-grids for incoming and outgoing directions, respectively. On the other hand, learnable neural reflectance primitives are distributed on our highly-tailored spherical surface grid, which offer informative features for each component and alleviate the conventional heavy feature learning network to a much smaller one, leading to very fast evaluation. These primitives are centrally stored in a codebook and can be shared across multiple grids and even across materials, based on the low-cost indices stored in material-specific spherical surface grids. Our neural BRDF, which is agnostic to the material, provides a unified framework that can represent a variety of materials in consistent manner. Comprehensive experimental results on measured BRDF compression, Monte Carlo simulated BRDF acceleration, and extension to spatially varying effect demonstrate the superior quality and generalizability achieved by the proposed scheme.
    摘要 我们提出了一种新的紧凑型高效神经BRDF,可以高效地表示各种材料的各种表现,且具有很低的内存和神经计算占用率,以实现实时渲染。图1所示的结果,在全高清解算器上的当前桌面机器上进行渲染,显示了我们的系统可以实现实时渲染,并且可以表示各种不同的外观。在一种方法上,我们注意到了反射率在高维度下的极其稀畴分布,我们将BRDF投影到了两个低维度组件中,即进行和出行方向的两个半球特征网格。另一方面,我们使用学习神经反射元素,分布在我们特制的球面网格上,这些元素提供了每个组件中的有用特征,从而使得传统的重量级特征学习网络可以减少到非常小,从而实现非常快的评估。这些元素被中心存储在一个编码表中,可以在多个网格和材料之间共享,基于材料特有的球面网格中的低成本索引。我们的神经BRDF是材料无关的,它提供了一种统一的框架,可以一致地表示各种材料。我们的实验结果表明,我们的方法可以高效地压缩BRDF,使用MCV simulated BRDF加速,并在空间变化的效果上进行扩展。

NSM4D: Neural Scene Model Based Online 4D Point Cloud Sequence Understanding

  • paper_url: http://arxiv.org/abs/2310.08326
  • repo_url: None
  • paper_authors: Yuhao Dong, Zhuoyang Zhang, Yunze Liu, Li Yi
  • for: 本研究旨在提高现有4D背bone的在线感知能力,包括VR/AR、机器人和自动驾驶等场景。
  • methods: 我们提出了一种名为NSM4D的通用在线4D感知方法,可以与现有的4D背bone结合使用,以提高其在线感知能力。NSM4D使用神经场景模型来分解空间和运动信息,并通过token表示来提高鲁棒性和可缩放性。
  • results: 我们在各种在线感知测试 benchmark 上达到了显著的改善,包括HOI4D在线动作 segmentation 的9.6%精度提高和SemanticKITTI在线 semantics segmentation 的3.4% mIoU 提高。此外,NSM4D表现出了优秀的扩展性,可以适应更长的序列。
    Abstract Understanding 4D point cloud sequences online is of significant practical value in various scenarios such as VR/AR, robotics, and autonomous driving. The key goal is to continuously analyze the geometry and dynamics of a 3D scene as unstructured and redundant point cloud sequences arrive. And the main challenge is to effectively model the long-term history while keeping computational costs manageable. To tackle these challenges, we introduce a generic online 4D perception paradigm called NSM4D. NSM4D serves as a plug-and-play strategy that can be adapted to existing 4D backbones, significantly enhancing their online perception capabilities for both indoor and outdoor scenarios. To efficiently capture the redundant 4D history, we propose a neural scene model that factorizes geometry and motion information by constructing geometry tokens separately storing geometry and motion features. Exploiting the history becomes as straightforward as querying the neural scene model. As the sequence progresses, the neural scene model dynamically deforms to align with new observations, effectively providing the historical context and updating itself with the new observations. By employing token representation, NSM4D also exhibits robustness to low-level sensor noise and maintains a compact size through a geometric sampling scheme. We integrate NSM4D with state-of-the-art 4D perception backbones, demonstrating significant improvements on various online perception benchmarks in indoor and outdoor settings. Notably, we achieve a 9.6% accuracy improvement for HOI4D online action segmentation and a 3.4% mIoU improvement for SemanticKITTI online semantic segmentation. Furthermore, we show that NSM4D inherently offers excellent scalability to longer sequences beyond the training set, which is crucial for real-world applications.
    摘要 理解4D点云序列在线是实际场景中的重要任务,如VR/AR、 робо太器和自动驾驶。主要挑战是在新观察到的数据流入时,有效地模型长期历史,同时保持计算成本可控。为解决这些挑战,我们介绍了一种通用的在线4D感知方法 called NSM4D。NSM4D是一种插件化策略,可以适应现有4D脊梁,明显提高在线感知能力,包括室内和室外场景。为了有效地捕捉重复的4D历史,我们提议一种神经场景模型,该模型将geometry和动作信息分解为两个分量,并将geometry特征存储在geometry tokens中。利用历史变得如查询神经场景模型。随着序列的扩展,神经场景模型会逐渐对新观察到的数据进行匹配,以提供历史上的 контекст和更新。通过使用Token表示,NSM4D也能够对低级别的感知器骤动具有抗性,并保持紧凑的大小通过地理学取样方式。我们将NSM4D与现有的4D感知脊梁集成,在室内和室外场景中展示了显著改进。特别是,我们实现了HOI4D在线动作分割 tasks中的9.6%精度提高和SemanticKITTI在线semantic segmentation tasks中的3.4%mIoU提高。此外,我们还证明NSM4D自然地具有优秀的扩展性,可以处理更长的序列,这在实际应用中是非常重要的。

Extended target tracking utilizing machine-learning software – with applications to animal classification

  • paper_url: http://arxiv.org/abs/2310.08316
  • repo_url: None
  • paper_authors: Magnus Malmström, Anton Kullberg, Isaac Skog, Daniel Axehill, Fredrik Gustafsson
  • for: 检测和跟踪图像序列中的对象
  • methods: 使用对象检测算法输出为检测结果,并利用前一帧的类信息强化分类,以鲁棒化分类结果
  • results: 在使用camera trap图像进行测试后,实现了更加鲁检的分类结果
    Abstract This paper considers the problem of detecting and tracking objects in a sequence of images. The problem is formulated in a filtering framework, using the output of object-detection algorithms as measurements. An extension to the filtering formulation is proposed that incorporates class information from the previous frame to robustify the classification, even if the object-detection algorithm outputs an incorrect prediction. Further, the properties of the object-detection algorithm are exploited to quantify the uncertainty of the bounding box detection in each frame. The complete filtering method is evaluated on camera trap images of the four large Swedish carnivores, bear, lynx, wolf, and wolverine. The experiments show that the class tracking formulation leads to a more robust classification.
    摘要 这篇论文考虑了图像序列中对象检测和跟踪的问题。问题是使用滤波框架来解决,使用对象检测算法的输出作为测量。另外,一种增强的滤波形式是提出,该形式包括上一帧的类信息来强化分类,即使对象检测算法输出错误预测也能够强化分类。此外,利用对象检测算法的性质来评估每帧 bounding box 检测结果的uncertainty。完整的滤波方法在摄像头捕捉的瑞典四大哺乳动物摄像头上进行了评估。实验结果表明,类跟踪形式导致更加稳定的分类。

GePSAn: Generative Procedure Step Anticipation in Cooking Videos

  • paper_url: http://arxiv.org/abs/2310.08312
  • repo_url: None
  • paper_authors: Mohamed Ashraf Abdelsalam, Samrudhdhi B. Rangrej, Isma Hadji, Nikita Dvornik, Konstantinos G. Derpanis, Afsaneh Fazly
  • for: 预测未来步骤在进程视频中
  • methods: 使用生成模型,通过模型学习多个可能的下一步选择
  • results: 在 YouCookII 上实现新的状态态-of-the-art 结果,并在没有调整或适应的情况下在视频频道上进行预测。Here’s a breakdown of each point:
  • for: The paper is focused on the problem of future step anticipation in procedural videos.
  • methods: The authors use a generative model to predict multiple plausible candidates for the next step in a procedural video. They pretrain the model on a large text-based corpus of procedural activities and then transfer it to the video domain.
  • results: The authors achieve new state-of-the-art results on the YouCookII dataset, and demonstrate that their model can successfully transfer from text to the video domain without fine-tuning or adaptation.
    Abstract We study the problem of future step anticipation in procedural videos. Given a video of an ongoing procedural activity, we predict a plausible next procedure step described in rich natural language. While most previous work focus on the problem of data scarcity in procedural video datasets, another core challenge of future anticipation is how to account for multiple plausible future realizations in natural settings. This problem has been largely overlooked in previous work. To address this challenge, we frame future step prediction as modelling the distribution of all possible candidates for the next step. Specifically, we design a generative model that takes a series of video clips as input, and generates multiple plausible and diverse candidates (in natural language) for the next step. Following previous work, we side-step the video annotation scarcity by pretraining our model on a large text-based corpus of procedural activities, and then transfer the model to the video domain. Our experiments, both in textual and video domains, show that our model captures diversity in the next step prediction and generates multiple plausible future predictions. Moreover, our model establishes new state-of-the-art results on YouCookII, where it outperforms existing baselines on the next step anticipation. Finally, we also show that our model can successfully transfer from text to the video domain zero-shot, ie, without fine-tuning or adaptation, and produces good-quality future step predictions from video.
    摘要 我们研究未来步骤预测在进程视频中的问题。给定一个正在进行的进程活动视频,我们预测下一步的可能性描述在丰富的自然语言中。而前一个工作主要关注的问题是数据缺乏在进程视频数据集上,另一个核心挑战是如何考虑多个可能的未来实现在自然 Setting中。这个问题在前一个工作中得到了广泛忽略。为了解决这个挑战,我们将未来步骤预测定义为模型所有可能候选人的分布。具体来说,我们设计了一种生成模型,接受一系列视频剪辑作为输入,并生成多个可能和多样的候选人(在自然语言中)的下一步。根据之前的工作,我们训练我们的模型在大量的文本基础数据集上,然后将模型转移到视频领域。我们的实验表明,我们的模型能够捕捉多个下一步预测的多样性,并生成多个可能的未来预测。此外,我们的模型在YouCookII上新做出了状态的报表结果,比现有的基elines superior。最后,我们还证明了我们的模型可以成功地在视频领域中转移到零例情况下,即无需调整或适应,并生成良质的未来步骤预测。

Multimodal Variational Auto-encoder based Audio-Visual Segmentation

  • paper_url: http://arxiv.org/abs/2310.08303
  • repo_url: https://github.com/opennlplab/mmvae-avs
  • paper_authors: Yuxin Mao, Jing Zhang, Mochu Xiang, Yiran Zhong, Yuchao Dai
  • for: 为 audio-visual segmentation (AVS) 任务,提出了Explicit Conditional Multimodal Variational Auto-Encoder (ECMVAE) 模型,用于音频视频序列中的音源分割。
  • methods: 我们使用了模式特征学习的视角,强调明确地捕捉每个模式的特征。具体来说,我们发现音频中含有音源生产者的关键分类信息,而视频数据则提供了可能的声音生产者。这两种数据的共同信息与视频中显示的声音生产者相对应。因此,跨modal共享表示学习是AVS中非常重要的。为了实现这一目标,我们的ECMVAE模型使用了共享表示和特定表示的因子化。在这种情况下,我们应用了modalities之间的正交性约束,以保持因子化的独特性。此外,我们还引入了广泛探索的强制正则化,以便对每个模式进行详细的探索。
  • results: 我们在AVSBench上进行了量化和质量评估,并证明了我们的方法的效iveness。相比之前的AVS方法,我们的ECMVAE模型在多个声音源分割任务中达到了新的州OF-THE-ART Waterloo,升级了3.84 mIOU的性能。
    Abstract We propose an Explicit Conditional Multimodal Variational Auto-Encoder (ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the video sequence. Existing AVS methods focus on implicit feature fusion strategies, where models are trained to fit the discrete samples in the dataset. With a limited and less diverse dataset, the resulting performance is usually unsatisfactory. In contrast, we address this problem from an effective representation learning perspective, aiming to model the contribution of each modality explicitly. Specifically, we find that audio contains critical category information of the sound producers, and visual data provides candidate sound producer(s). Their shared information corresponds to the target sound producer(s) shown in the visual data. In this case, cross-modal shared representation learning is especially important for AVS. To achieve this, our ECMVAE factorizes the representations of each modality with a modality-shared representation and a modality-specific representation. An orthogonality constraint is applied between the shared and specific representations to maintain the exclusive attribute of the factorized latent code. Further, a mutual information maximization regularizer is introduced to achieve extensive exploration of each modality. Quantitative and qualitative evaluations on the AVSBench demonstrate the effectiveness of our approach, leading to a new state-of-the-art for AVS, with a 3.84 mIOU performance leap on the challenging MS3 subset for multiple sound source segmentation.
    摘要 我们提出了一种显式条件多模态变分自动编码器(ECMVAE),用于音频视频分割(AVS),目的是在视频序列中分割声音源。现有的AVS方法主要采用隐式特征融合策略,其中模型通常是根据数据集中的精确样本进行训练。由于数据集规模有限,模型的性能通常不满足要求。我们则从表示学习的视角来解决这个问题,即模型需要明确地表示每个modalities的贡献。具体来说,我们发现音频中含有重要的声音生产者类别信息,而视觉数据则提供了声音生产者候选人。他们共享的信息与视频中显示的声音生产者相对应。在这种情况下,跨Modalities的共享表示学习特别重要。为此,我们的ECMVAE使用一个共享表示和一个特定表示来分解每个modalities的表示。我们还应用一个共享和特定表示之间的正交约束,以保持各个modalities的独特性。此外,我们还引入了一个最大化对抗信息 regularizer,以实现每个modalities的广泛探索。量化和质量评估表明,我们的方法有效地解决AVS问题,在AVSBench上达到了新的状态机器,其中MS3子集上多个声音源分割的IOU性能提高3.84米。

GraphAlign: Enhancing Accurate Feature Alignment by Graph matching for Multi-Modal 3D Object Detection

  • paper_url: http://arxiv.org/abs/2310.08261
  • repo_url: None
  • paper_authors: Ziying Song, Haiyue Wei, Lin Bai, Lei Yang, Caiyan Jia
  • for: 3D object detection in autonomous driving
  • methods: graph matching, feature alignment, projection calibration, self-attention module
  • results: more accurate feature alignment, improved performance in 3D object detection
    Abstract LiDAR and cameras are complementary sensors for 3D object detection in autonomous driving. However, it is challenging to explore the unnatural interaction between point clouds and images, and the critical factor is how to conduct feature alignment of heterogeneous modalities. Currently, many methods achieve feature alignment by projection calibration only, without considering the problem of coordinate conversion accuracy errors between sensors, leading to sub-optimal performance. In this paper, we present GraphAlign, a more accurate feature alignment strategy for 3D object detection by graph matching. Specifically, we fuse image features from a semantic segmentation encoder in the image branch and point cloud features from a 3D Sparse CNN in the LiDAR branch. To save computation, we construct the nearest neighbor relationship by calculating Euclidean distance within the subspaces that are divided into the point cloud features. Through the projection calibration between the image and point cloud, we project the nearest neighbors of point cloud features onto the image features. Then by matching the nearest neighbors with a single point cloud to multiple images, we search for a more appropriate feature alignment. In addition, we provide a self-attention module to enhance the weights of significant relations to fine-tune the feature alignment between heterogeneous modalities. Extensive experiments on nuScenes benchmark demonstrate the effectiveness and efficiency of our GraphAlign.
    摘要 李达朗和摄像头是自动驾驶中3D对象检测的补充传感器。然而,在点云和图像之间的不自然交互问题具有挑战性,而且关键因素是如何进行多模态特征对齐。目前,许多方法通过投影准备 alone,不考虑投影准备精度错误之间传感器的坐标转换问题,导致优化性不佳。在这篇论文中,我们提出了图像对齐策略,通过图像特征和点云特征的图像对齐来提高3D对象检测的精度。具体来说,我们将图像分支中的semantic segmentation编码器输出的图像特征与LiDAR分支中的3D稀畴CNN输出的点云特征进行融合。为了降低计算量,我们将点云特征分解成子空间,并在这些子空间内计算最近邻关系。然后,通过点云特征与图像特征的投影准备,将点云特征的最近邻映射到图像特征上。最后,我们通过将多个点云特征对应到同一张图像上的多个特征进行匹配,以找到更加适合的特征对齐。此外,我们还提供了一个自注意模块,以增强不同模态之间的特征对齐关系的权重,以进一步细调特征对齐。我们在nuScenes标准测试集上进行了广泛的实验,并证明了我们的图像对齐策略的有效性和高效性。

Invisible Threats: Backdoor Attack in OCR Systems

  • paper_url: http://arxiv.org/abs/2310.08259
  • repo_url: None
  • paper_authors: Mauro Conti, Nicola Farronato, Stefanos Koffas, Luca Pajola, Stjepan Picek
  • for: 这个论文的目的是描述一种针对 Optical Character Recognition (OCR) 的后门攻击,使得 extracted text 不可读用于自然语言处理应用程序中。
  • methods: 该论文使用了深度神经网络来实现后门攻击,并通过插入特定的图像模式来让 OCR 模型在测试阶段输出不可读的字符。
  • results: 实验结果表明,攻击后 OCR 模型可以成功输出不可读的字符约 90% 的恶意输入图像,而不会对其他输入图像产生影响。
    Abstract Optical Character Recognition (OCR) is a widely used tool to extract text from scanned documents. Today, the state-of-the-art is achieved by exploiting deep neural networks. However, the cost of this performance is paid at the price of system vulnerability. For instance, in backdoor attacks, attackers compromise the training phase by inserting a backdoor in the victim's model that will be activated at testing time by specific patterns while leaving the overall model performance intact. This work proposes a backdoor attack for OCR resulting in the injection of non-readable characters from malicious input images. This simple but effective attack exposes the state-of-the-art OCR weakness, making the extracted text correct to human eyes but simultaneously unusable for the NLP application that uses OCR as a preprocessing step. Experimental results show that the attacked models successfully output non-readable characters for around 90% of the poisoned instances without harming their performance for the remaining instances.
    摘要 “光学字符识别(OCR)是一个广泛使用的工具来提取扫描文档中的文字。今天,技术的前进是通过启用深度神经网络来实现的。然而,这些性能的代价是系统的易受攻击性。例如,在后门攻击中,攻击者将在受害者的模型中植入后门,使特定的模式在试验阶段 Activate 时会导致模型产生非法的字符。这个简单而有效的攻击可以让OCR模型对逻辑不正确的输入图像进行处理,从而导致提取的文字 Correct 到人眼看来,但同时也使得这些文字无法用于基于OCR的自然语言处理应用程序中。实验结果显示,攻击模型可以在90%的毒品实验中产生非法的字符,而不会对其他实验中的模型造成影响。”

Distilling from Vision-Language Models for Improved OOD Generalization in Vision Tasks

  • paper_url: http://arxiv.org/abs/2310.08255
  • repo_url: https://github.com/val-iisc/VL2V-ADiP
  • paper_authors: Sravanti Addepalli, Ashish Ramayee Asokan, Lakshay Sharma, R. Venkatesh Babu
  • for: 这种 исследование的目的是提高在黑盒 Setting中的vision-language模型(VLM)的使用效果,使其在不同数据分布下进行推理,并且可以在有限的任务特定数据上进行减少推理成本。
  • methods: 该研究提出了一种名为Vision-Language to Vision-Align, Distill, Predict(VL2V-ADiP)的方法,该方法首先对教师模型的视觉语言模式进行对齐,然后将对齐后的VLM嵌入托管到学生模型中,进行减少。
  • results: 该研究在标准的领域普适化benchmark上达到了黑盒教师设置下的state-of-the-art结果,并且当VLM的权重可用时,也可以达到更高的性能。
    Abstract Vision-Language Models (VLMs) such as CLIP are trained on large amounts of image-text pairs, resulting in remarkable generalization across several data distributions. The prohibitively expensive training and data collection/curation costs of these models make them valuable Intellectual Property (IP) for organizations. This motivates a vendor-client paradigm, where a vendor trains a large-scale VLM and grants only input-output access to clients on a pay-per-query basis in a black-box setting. The client aims to minimize inference cost by distilling the VLM to a student model using the limited available task-specific data, and further deploying this student model in the downstream application. While naive distillation largely improves the In-Domain (ID) accuracy of the student, it fails to transfer the superior out-of-distribution (OOD) generalization of the VLM teacher using the limited available labeled images. To mitigate this, we propose Vision-Language to Vision-Align, Distill, Predict (VL2V-ADiP), which first aligns the vision and language modalities of the teacher model with the vision modality of a pre-trained student model, and further distills the aligned VLM embeddings to the student. This maximally retains the pre-trained features of the student, while also incorporating the rich representations of the VLM image encoder and the superior generalization of the text embeddings. The proposed approach achieves state-of-the-art results on the standard Domain Generalization benchmarks in a black-box teacher setting, and also when weights of the VLM are accessible.
    摘要 vision-language模型(VLM)如CLIP在大量图像文本对的训练中显示出惊人的总结能力。这些模型的训练和数据收集/筛选成本高昂,使其成为组织的价值财产。这种厂商-客户模式中,厂商将大规模的VLM训练成功,并只提供输入-输出访问权限给客户,并在黑盒模式下收取访问成本。客户希望通过简化VLM来减少推理成本,并将其部署到下游应用程序中。虽然简化大幅提高了学生模型的区域性(ID)准确率,但是它无法传递VLM教师模型在有限可用标注图像上的出色的跨类泛化性。为解决这个问题,我们提议vision-language到vision-align、distill、predict(VL2V-ADiP),它首先将视语模式的教师模型与先验学生模型的视模式对齐,然后将对齐后的VLM嵌入简化到学生模型中,以保留先验学生模型的特征,同时也包含VLM图像编码器的丰富表示和文本嵌入的出色泛化性。该方法在标准领域普遍化benchmark上达到了黑盒教师模式下的状态艺术成绩,以及可以访问VLM权重时的成绩。

Fast Discrete Optimisation for Geometrically Consistent 3D Shape Matching

  • paper_url: http://arxiv.org/abs/2310.08230
  • repo_url: None
  • paper_authors: Paul Roetzer, Ahmed Abbas, Dongliang Cao, Florian Bernard, Paul Swoboda
  • for: 提高3D形状匹配的精度和效率。
  • methods: 结合学习基于和AXIOmatic方法,实现地面一个有效的匹配方案。
  • results: 提供了一种初始化自由、大量并行化、提供优化差值、运行时间减少和全球最优的匹配方案。
    Abstract In this work we propose to combine the advantages of learning-based and combinatorial formalisms for 3D shape matching. While learning-based shape matching solutions lead to state-of-the-art matching performance, they do not ensure geometric consistency, so that obtained matchings are locally unsmooth. On the contrary, axiomatic methods allow to take geometric consistency into account by explicitly constraining the space of valid matchings. However, existing axiomatic formalisms are impractical since they do not scale to practically relevant problem sizes, or they require user input for the initialisation of non-convex optimisation problems. In this work we aim to close this gap by proposing a novel combinatorial solver that combines a unique set of favourable properties: our approach is (i) initialisation free, (ii) massively parallelisable powered by a quasi-Newton method, (iii) provides optimality gaps, and (iv) delivers decreased runtime and globally optimal results for many instances.
    摘要 在这项工作中,我们提议结合学习基于和组合形式的3D形状匹配方法。学习基于的匹配解决方案可以达到状态最佳的匹配性,但是它们不能保证几何一致性,因此所获得的匹配是地方不稳定。相反,AXIOmatic方法可以直接考虑几何一致性,通过明确限制有效匹配空间。然而,现有的AXIOmatic形式不scalable,或者需要用户输入来初始化非拟合优化问题。在这项工作中,我们希望通过提出一种新的 combinatorial solver,并且这种 solver具有以下优点:我们的方法是(i)无需初始化,(ii)可以大规模并行化,通过 quasi-Newton 方法,(iii)提供优化差,并(iv)在许多实例中具有减少的时间和全球最佳结果。

Structural analysis of Hindi online handwritten characters for character recognition

  • paper_url: http://arxiv.org/abs/2310.08222
  • repo_url: None
  • paper_authors: Anand Sharma, A. G. Ramakrishnan
  • for: 这个论文的目的是分析在线手写文字的方向性特性,并将其分解成具有共同几何特性的子单元(sub-units)。
  • methods: 该论文使用了一种方法,即提取点笔、顺时针弧形笔、逆时针弧形笔和循环笔段作为子单元。这些提取的子单元与相应的在线理想文字的子单元具有相似的结构。
  • results: 该论文的结果表明,使用了本论文提出的子单元提取方法和基于子单元的字符分类器,可以提高在线手写文字识别率。Specifically, the recognition accuracy of the classifier trained with sub-unit level local and character level global features is 93.5%, which is the highest compared with other classifiers trained only with global features.
    Abstract Direction properties of online strokes are used to analyze them in terms of homogeneous regions or sub-strokes with points satisfying common geometric properties. Such sub-strokes are called sub-units. These properties are used to extract sub-units from Hindi ideal online characters. These properties along with some heuristics are used to extract sub-units from Hindi online handwritten characters.\\ A method is developed to extract point stroke, clockwise curve stroke, counter-clockwise curve stroke and loop stroke segments as sub-units from Hindi online handwritten characters. These extracted sub-units are close in structure to the sub-units of the corresponding Hindi online ideal characters.\\ Importance of local representation of online handwritten characters in terms of sub-units is assessed by training a classifier with sub-unit level local and character level global features extracted from characters for character recognition. The classifier has the recognition accuracy of 93.5\% on the testing set. This accuracy is the highest when compared with that of the classifiers trained only with global features extracted from characters in the same training set and evaluated on the same testing set.\\ Sub-unit extraction algorithm and the sub-unit based character classifier are tested on Hindi online handwritten character dataset. This dataset consists of samples from 96 different characters. There are 12832 and 2821 samples in the training and testing sets, respectively.
    摘要 irection Properties of Online Strokes are Used to Analyze Them in Terms of Homogeneous Regions or Sub-strokes with Points Satisfying Common Geometric Properties. Such Sub-strokes are Called Sub-units. These Properties are Used to Extract Sub-units from Hindi Ideal Online Characters.\\ A Method is Developed to Extract Point Stroke, Clockwise Curve Stroke, Counter-clockwise Curve Stroke, and Loop Stroke Segments as Sub-units from Hindi Online Handwritten Characters. These Extracted Sub-units are Close in Structure to the Sub-units of the Corresponding Hindi Online Ideal Characters.\\ Importance of Local Representation of Online Handwritten Characters in Terms of Sub-units is Assessed by Training a Classifier with Sub-unit Level Local and Character Level Global Features Extracted from Characters for Character Recognition. The Classifier has the Recognition Accuracy of 93.5% on the Testing Set. This Accuracy is the Highest When Compared with That of the Classifiers Trained Only with Global Features Extracted from Characters in the Same Training Set and Evaluated on the Same Testing Set.\\ Sub-unit Extraction Algorithm and the Sub-unit Based Character Classifier are Tested on Hindi Online Handwritten Character Dataset. This Dataset Consists of Samples from 96 Different Characters. There are 12832 and 2821 Samples in the Training and Testing Sets, Respectively.

Lifelong Audio-video Masked Autoencoder with Forget-robust Localized Alignments

  • paper_url: http://arxiv.org/abs/2310.08204
  • repo_url: None
  • paper_authors: Jaewoo Lee, Jaehong Yoon, Wonjae Kim, Yunji Kim, Sung Ju Hwang
  • for: 本研究旨在应对 continuous audio-video 流的学习,即时学习多媒体表示。
  • methods: 我们提出了两个新想法来解决这个问题:(1) 本地对过程:我们引入了一个小型可训练的多媒体编码器,它预测 audio 和 video 词汇的对顺掌握。这使得模型仅学习高度相关的 audiovisual 矩阵。(2) 忘却Robust多媒体矩阵选择:我们比较了每对 audio-video 矩阵的相对重要性,以mitigate 过去学习的 audiovisual 表示的忘却。
  • results: 我们的实验显示,FLAVA 比顶对应的 continual learning 方法在多个 bencmark 数据集上表现出色。
    Abstract We present a lifelong audio-video masked autoencoder that continually learns the multimodal representations from a video stream containing audio-video pairs, while its distribution continually shifts over time. Specifically, we propose two novel ideas to tackle the problem: (1) Localized Alignment: We introduce a small trainable multimodal encoder that predicts the audio and video tokens that are well-aligned with each other. This allows the model to learn only the highly correlated audiovisual patches with accurate multimodal relationships. (2) Forget-robust multimodal patch selection: We compare the relative importance of each audio-video patch between the current and past data pair to mitigate unintended drift of the previously learned audio-video representations. Our proposed method, FLAVA (Forget-robust Localized Audio-Video Alignment), therefore, captures the complex relationships between the audio and video modalities during training on a sequence of pre-training tasks while alleviating the forgetting of learned audiovisual correlations. Our experiments validate that FLAVA outperforms the state-of-the-art continual learning methods on several benchmark datasets under continual audio-video representation learning scenarios.
    摘要 我们提出了一种持续学习的音频视频匿名自动编码器,该模型从包含音频视频对的视频流中不断学习多modal表示,而其分布也在时间上不断变化。我们提出了两个新的想法来解决这个问题:(1)本地对齐:我们引入了一个可学习的小型多modal编码器,该编码器预测了audio和视频标记的匹配。这使得模型只学习了高度相关的音频视频 patches,并且保持了准确的多modal关系。(2)忘记抗性多modal patch选择:我们比较了当前和过去数据对的相对重要性,以mitigate不必要的演变。我们的提议方法FLAVA(忘记抗性本地音频视频对齐)因此在训练一系列预训练任务时,捕捉了音频视频modal之间的复杂关系,并减轻了已经学习的audiovisual相关性的忘记。我们的实验证明,FLAVA在多个 benchmark 数据集上比state-of-the-art continual learning方法表现出色。

XIMAGENET-12: An Explainable AI Benchmark Dataset for Model Robustness Evaluation

  • paper_url: http://arxiv.org/abs/2310.08182
  • repo_url: https://github.com/xiaohai12/explainable-ai-imagenet-12
  • paper_authors: Qiang Li, Dan Zhang, Shengzhao Lei, Xun Zhao, Shuyan Li, Porawit Kamnoedboon, WeiWei Li
  • for: 本研究旨在提供一个可解释的图像标注数据集,以评估计算机视觉模型在实际应用中的Robustness。
  • methods: 本研究使用了XIMAGENET-12数据集,该数据集包含200,000张图像和15,600个手动semantic标注。数据集 simulates six diverse scenarios,包括过度曝光、模糊、颜色变化等。
  • results: 本研究提出了一种新的Robustness criterion,可以评估计算机视觉模型在实际应用中的Robustness。这个数据集, along with related code,是可以用于评估计算机视觉模型的Robustness的重要资源。
    Abstract The lack of standardized robustness metrics and the widespread reliance on numerous unrelated benchmark datasets for testing have created a gap between academically validated robust models and their often problematic practical adoption. To address this, we introduce XIMAGENET-12, an explainable benchmark dataset with over 200K images and 15,600 manual semantic annotations. Covering 12 categories from ImageNet to represent objects commonly encountered in practical life and simulating six diverse scenarios, including overexposure, blurring, color changing, etc., we further propose a novel robustness criterion that extends beyond model generation ability assessment. This benchmark dataset, along with related code, is available at https://sites.google.com/view/ximagenet-12/home. Researchers and practitioners can leverage this resource to evaluate the robustness of their visual models under challenging conditions and ultimately benefit from the demands of practical computer vision systems.
    摘要 因为缺乏标准化的稳定性指标和各种不相关的benchmark dataset的广泛依赖,这导致了学术验证的模型和其在实际应用中的问题aticadoptation之间的一个差距。为解决这个问题,我们介绍了ximagenet-12,一个可解释的benchmark dataset,包含超过20万个图像和15600个手动semantic annotations。这些dataset covers 12个类从imageNet中选择了通常在实际生活中遇到的 объекcs,并模拟了6种多样化的enario,包括过度曝光、模糊、颜色变化等。此外,我们还提出了一个新的稳定性标准,超过了模型生成能力评价。这个benchmark dataset, along with related code,可以在https://sites.google.com/view/ximagenet-12/home上获取。研究人员和实践者可以利用这个资源来评估他们的视觉模型在具有挑战性的条件下的稳定性,从而 ultimately benefit from the demands of practical computer vision systems。

Improving Fast Minimum-Norm Attacks with Hyperparameter Optimization

  • paper_url: http://arxiv.org/abs/2310.08177
  • repo_url: https://github.com/pralab/HO-FMN
  • paper_authors: Giuseppe Floris, Raffaele Mura, Luca Scionis, Giorgio Piras, Maura Pintor, Ambra Demontis, Battista Biggio
  • for: 提高机器学习模型的敌对 robustness 使用Gradient-based攻击是困难的。
  • methods: 通过自动选择损失函数、优化器和步长调节器以及它们相关的超参数进行超参数优化,以提高快速最小 нор攻击的效果。
  • results: 我们在多种Robust模型的广泛评估中发现,通过超参数优化可以提高快速最小 нор攻击的效果。我们发布了相关的开源代码在https://github.com/pralab/HO-FMN。
    Abstract Evaluating the adversarial robustness of machine learning models using gradient-based attacks is challenging. In this work, we show that hyperparameter optimization can improve fast minimum-norm attacks by automating the selection of the loss function, the optimizer and the step-size scheduler, along with the corresponding hyperparameters. Our extensive evaluation involving several robust models demonstrates the improved efficacy of fast minimum-norm attacks when hyper-up with hyperparameter optimization. We release our open-source code at https://github.com/pralab/HO-FMN.
    摘要 evaluating the adversarial robustness of machine learning models using gradient-based attacks is challenging. In this work, we show that hyperparameter optimization can improve fast minimum-norm attacks by automating the selection of the loss function, the optimizer, and the step-size scheduler, along with the corresponding hyperparameters. Our extensive evaluation involving several robust models demonstrates the improved efficacy of fast minimum-norm attacks when hyper-up with hyperparameter optimization. We release our open-source code at https://github.com/pralab/HO-FMN.Here's the translation in Traditional Chinese:评估机器学习模型的敌方性防护效果使用Gradient-based攻击是具有挑战性的。在这个工作中,我们显示出hyperparameter优化可以提高快速最小范数攻击的效率,通过自动选择损失函数、优化器和步长调节器,以及它们所对应的超参数。我们的广泛评估,包括多个预防型模型,显示了快速最小范数攻击的改善效果,当hyper-up with hyperparameter优化时。我们在https://github.com/pralab/HO-FMN上发布了我们的开源代码。

COVID-19 Detection Using Swin Transformer Approach from Computed Tomography Images

  • paper_url: http://arxiv.org/abs/2310.08165
  • repo_url: https://github.com/idu-cvlab/cov19d_4th
  • paper_authors: Kenan Morani
  • for: 针对大规模医学成像数据集,提出一种新的 COVID-19 诊断方法使用 CT 图像,利用 Swin Transformer 模型的力量,为计算机视觉任务提供现代解决方案。
  • methods: 方法包括一种系统化的病人级预测方法,即将个别 CT 片分类为 COVID-19 或非 COVID-19,并通过多数投票决定病人的总诊断结果。
  • results: 对比基准和竞争方法,我们的方法在评价指标中表现出色,具有 Exceptional 的诊断精度。 macro F1 分数达到了基准和竞争方法的高点,提供了一个可靠的 COVID-19 诊断解决方案。
    Abstract The accurate and efficient diagnosis of COVID-19 is of paramount importance, particularly in the context of large-scale medical imaging datasets. In this preprint paper, we propose a novel approach for COVID-19 diagnosis using CT images that leverages the power of Swin Transformer models, state-of-the-art solutions in computer vision tasks. Our method includes a systematic approach for patient-level predictions, where individual CT slices are classified as COVID-19 or non-COVID, and the patient's overall diagnosis is determined through majority voting. The application of the Swin Transformer in this context results in patient-level predictions that demonstrate exceptional diagnostic accuracy. In terms of evaluation metrics, our approach consistently outperforms the baseline, as well as numerous competing methods, showcasing its effectiveness in COVID-19 diagnosis. The macro F1 score achieved by our model exceeds the baseline and offers a robust solution for accurate diagnosis.
    摘要 “ covid-19 诊断的精确性和效率非常重要,特别在大规模医疗影像数据中。在这个预印稿中,我们提出了一种新的 covid-19 诊断方法使用 CT 影像,利用了 Swin Transformer 模型,现今的计算机见解应用。我们的方法包括对每个 CT 层进行分类,将每个 CT 层分为 covid-19 或非 covid-19,并通过多数决进行病人级别预测。Swin Transformer 在这个上下文中的应用导致了病人级别预测的非常高精度。在评估指标方面,我们的方法比基准和多个竞争方法表现出色,展示了它在 covid-19 诊断中的有效性。 macro F1 分数由我们的模型实现,超过基准,提供了一个可靠的准确诊断解决方案。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need Traditional Chinese, please let me know.

A Deep Learning Framework for Spatiotemporal Ultrasound Localization Microscopy

  • paper_url: http://arxiv.org/abs/2310.08143
  • repo_url: None
  • paper_authors: Léo Milecki, Jonathan Porée, Hatim Belgharbi, Chloé Bourquin, Rafat Damseh, Patrick Delafontaine-Martel, Frédéric Lesage, Maxime Gasse, Jean Provost
  • for: 本研究旨在使用深度学习方法重建微vascular网络,以提高ultrasound localization microscopy(ULM)的分辨率。
  • methods: 本研究使用了三维卷积神经网络(3D-CNN),基于V-net架构,来重建微vascular网络。采用了实际的mouse brain microvascular网络,从2氪微scopy中提取的数据来训练3D-CNN。
  • results: 本研究的结果表明,使用3D-CNN方法可以提高ULM的分辨率,在silico中的 precisión为81%,与传统ULM框架相比下降。在生物体中,3D-CNN方法可以分解出微vascular网络中的小血管,分辨率高于传统方法。
    Abstract Ultrasound Localization Microscopy can resolve the microvascular bed down to a few micrometers. To achieve such performance microbubble contrast agents must perfuse the entire microvascular network. Microbubbles are then located individually and tracked over time to sample individual vessels, typically over hundreds of thousands of images. To overcome the fundamental limit of diffraction and achieve a dense reconstruction of the network, low microbubble concentrations must be used, which lead to acquisitions lasting several minutes. Conventional processing pipelines are currently unable to deal with interference from multiple nearby microbubbles, further reducing achievable concentrations. This work overcomes this problem by proposing a Deep Learning approach to recover dense vascular networks from ultrasound acquisitions with high microbubble concentrations. A realistic mouse brain microvascular network, segmented from 2-photon microscopy, was used to train a three-dimensional convolutional neural network based on a V-net architecture. Ultrasound data sets from multiple microbubbles flowing through the microvascular network were simulated and used as ground truth to train the 3D CNN to track microbubbles. The 3D-CNN approach was validated in silico using a subset of the data and in vivo on a rat brain acquisition. In silico, the CNN reconstructed vascular networks with higher precision (81%) than a conventional ULM framework (70%). In vivo, the CNN could resolve micro vessels as small as 10 $\mu$m with an increase in resolution when compared against a conventional approach.
    摘要 超声本地化微scopic imaging可以达到几微米级别的分辨率。为了实现这一表现,微ubble contrast agents必须在整个微血管网络中流动。然后,微ubble会被 individuated 和跟踪时间,以采样个体血管,通常是数十万张图像。为了超越干扰的基本限制,使用低微ubble浓度,需要持续数分钟的获取。现有的处理管道无法处理多个附近微ubble的干扰,从而降低实现的浓度。这项工作解决了这个问题,提出了基于深度学习的方法,从ultrasound获取 dense vascular network 的重建。使用真实的mouse brain microvascular network,从2气相icroscopy中 segments,并用3维 convolutional neural network (CNN) 基于V-net架构进行训练。ultrasound数据集从多个微ubble流经 microvascular network 进行模拟,并用作真实数据来训练3D CNN。在silico中,CNN可以比 convential ULM framework (70%) 提高分辨率(81%)。在 vivo中,CNN可以分解到10微米级别的微血管,并与 convential approach 比较,显示了增加的分辨率。

Fine-Grained Annotation for Face Anti-Spoofing

  • paper_url: http://arxiv.org/abs/2310.08142
  • repo_url: None
  • paper_authors: Xu Chen, Yunde Jia, Yuwei Wu
  • for: 防止面部验证系统受到攻击,提高面部验证系统的安全性。
  • methods: 提出了一种细化注释方法,通过使用面部特征点作为提示,获取面部区域的分割 маSK。然后,将这些区域分割成三个分割地图:骗ubble、生物和背景地图。最后,将这三个地图组合成一个三通道地图,用于模型训练。此外,我们还引入了多通道区域交换增强,以增加训练数据的多样性和减少过拟合。
  • results: 实验结果表明,我们的方法比现有状态的方法在内部和跨 dataset 评估中表现出色,得到了更高的识别率。
    Abstract Face anti-spoofing plays a critical role in safeguarding facial recognition systems against presentation attacks. While existing deep learning methods show promising results, they still suffer from the lack of fine-grained annotations, which lead models to learn task-irrelevant or unfaithful features. In this paper, we propose a fine-grained annotation method for face anti-spoofing. Specifically, we first leverage the Segment Anything Model (SAM) to obtain pixel-wise segmentation masks by utilizing face landmarks as point prompts. The face landmarks provide segmentation semantics, which segments the face into regions. We then adopt these regions as masks and assemble them into three separate annotation maps: spoof, living, and background maps. Finally, we combine three separate maps into a three-channel map as annotations for model training. Furthermore, we introduce the Multi-Channel Region Exchange Augmentation (MCREA) to diversify training data and reduce overfitting. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches in both intra-dataset and cross-dataset evaluations.
    摘要 <>translate_text=" Face anti-spoofing plays a critical role in safeguarding facial recognition systems against presentation attacks. While existing deep learning methods show promising results, they still suffer from the lack of fine-grained annotations, which lead models to learn task-irrelevant or unfaithful features. In this paper, we propose a fine-grained annotation method for face anti-spoofing. Specifically, we first leverage the Segment Anything Model (SAM) to obtain pixel-wise segmentation masks by utilizing face landmarks as point prompts. The face landmarks provide segmentation semantics, which segments the face into regions. We then adopt these regions as masks and assemble them into three separate annotation maps: spoof, living, and background maps. Finally, we combine three separate maps into a three-channel map as annotations for model training. Furthermore, we introduce the Multi-Channel Region Exchange Augmentation (MCREA) to diversify training data and reduce overfitting. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches in both intra-dataset and cross-dataset evaluations. "translate_text_simplified = "面部防质备措施对于面部识别系统的保护起到关键作用。现有的深度学习方法虽显示出了扎实的结果,但仍然受到精细注解的缺乏,这导致模型学习到无关任务或不准确的特征。在这篇论文中,我们提议一种精细注解方法 для面部防质备。具体来说,我们首先利用Segment Anything Model(SAM)获取面部像素级分割masks,通过面部特征点作为点提示来使用。这些面部特征点提供分割 semantics,将面部分成不同区域。我们然后采用这些区域作为masks,并将其组装成三个分割图:骗球、生物和背景图。最后,我们将三个分割图合并成三通道的映射,用于模型训练。此外,我们还引入多通道区域交换增强技术(MCREA),以增加训练数据的多样性,降低过拟合。实验结果表明,我们的方法在内部和交叉 dataset 评估中都超过了现有状态码的方法。

DualAug: Exploiting Additional Heavy Augmentation with OOD Data Rejection

  • paper_url: http://arxiv.org/abs/2310.08139
  • repo_url: https://github.com/shuguang99/DualAug
  • paper_authors: Zehao Wang, Yiwen Guo, Qizhang Li, Guanglei Yang, Wangmeng Zuo
  • for: 提高模型泛化和鲁棒性,避免模型适应性问题
  • methods: 提出了一种新的数据扩充方法,即双重扩充(DualAug),通过混合基本扩充和重大扩充分支来保持扩充在适度上,并且可以适应不同的训练样本
  • results: 在图像分类Benchmark上进行了广泛的实验,并证明了DualAug可以提高自动数据扩充方法,同时在 semi-supervised learning 和自我监督学习中也有良好的效果
    Abstract Data augmentation is a dominant method for reducing model overfitting and improving generalization. Most existing data augmentation methods tend to find a compromise in augmenting the data, \textit{i.e.}, increasing the amplitude of augmentation carefully to avoid degrading some data too much and doing harm to the model performance. We delve into the relationship between data augmentation and model performance, revealing that the performance drop with heavy augmentation comes from the presence of out-of-distribution (OOD) data. Nonetheless, as the same data transformation has different effects for different training samples, even for heavy augmentation, there remains part of in-distribution data which is beneficial to model training. Based on the observation, we propose a novel data augmentation method, named \textbf{DualAug}, to keep the augmentation in distribution as much as possible at a reasonable time and computational cost. We design a data mixing strategy to fuse augmented data from both the basic- and the heavy-augmentation branches. Extensive experiments on supervised image classification benchmarks show that DualAug improve various automated data augmentation method. Moreover, the experiments on semi-supervised learning and contrastive self-supervised learning demonstrate that our DualAug can also improve related method. Code is available at \href{https://github.com/shuguang99/DualAug}{https://github.com/shuguang99/DualAug}.
    摘要 <>输入文本转换成简化中文。<>数据增强是现有方法中最主要的方法,用于降低模型适应度和提高泛化能力。大多数现有的数据增强方法都是找到一个妥协,即缓和增强数据的方式,以避免一些数据被增强得太多,对模型表现产生负面影响。我们深入研究数据增强和模型性能之间的关系,发现增强后模型表现下降的原因是存在外部数据(OOD)。然而,即使使用同一种数据变换,不同的训练样本会受到不同的影响,甚至在增强得 Very Heavy 时,还有一部分内部数据会对模型训练有利。基于这个观察,我们提出了一种新的数据增强方法,名为 DualAug,可以保持增强在distribution中的可能性最大,同时在合理的时间和计算成本下进行增强。我们设计了一种混合策略,将基本增强和重增强分支中的增强数据混合在一起。广泛的实验表明,我们的 DualAug 可以提高不同的自动数据增强方法,并且在 semi-supervised 学习和对比自动数据增强方法中也有优化效果。代码可以在 中找到。

Tailored Visions: Enhancing Text-to-Image Generation with Personalized Prompt Rewriting

  • paper_url: http://arxiv.org/abs/2310.08129
  • repo_url: None
  • paper_authors: Zijie Chen, Lichao Zhang, Fangsheng Weng, Lili Pan, Zhenzhong Lan
  • for: 提高文本到图像生成的个性化性和用户体验
  • methods: 利用历史用户与系统交互增强用户提示,并使用大规模文本到图像数据集进行提示重写
  • results: 比基eline方法有显著提高,在新的离线评估方法和在线测试中得到较高的效果
    Abstract We propose a novel perspective of viewing large pretrained models as search engines, thereby enabling the repurposing of techniques previously used to enhance search engine performance. As an illustration, we employ a personalized query rewriting technique in the realm of text-to-image generation. Despite significant progress in the field, it is still challenging to create personalized visual representations that align closely with the desires and preferences of individual users. This process requires users to articulate their ideas in words that are both comprehensible to the models and accurately capture their vision, posing difficulties for many users. In this paper, we tackle this challenge by leveraging historical user interactions with the system to enhance user prompts. We propose a novel approach that involves rewriting user prompts based a new large-scale text-to-image dataset with over 300k prompts from 3115 users. Our rewriting model enhances the expressiveness and alignment of user prompts with their intended visual outputs. Experimental results demonstrate the superiority of our methods over baseline approaches, as evidenced in our new offline evaluation method and online tests. Our approach opens up exciting possibilities of applying more search engine techniques to build truly personalized large pretrained models.
    摘要 我们提出了一种新的视角,即将大型预训练模型视为搜索引擎,从而使得可以复用以前用于提高搜索引擎性能的技术。作为一个示例,我们在文本到图生成领域使用了个性化查询 rewrite 技术。虽然在这个领域已经做出了很大的进步,但是仍然很难创造个性化的视觉表示,使得用户需要用语言来表达他们的想法,这会对用户提出很大的挑战。在这篇论文中,我们解决了这个问题,通过利用系统历史用户交互记录来增强用户提示。我们提出了一种新的方法,即基于大规模文本到图数据集(包含超过 300k 提示,来自 3115 名用户)进行用户提示 rewrite。我们的 rewrite 模型可以提高用户提示的表达力和与愿景的匹配度。实验结果表明我们的方法在基准方法上有superiority,可见于我们新的离线评估方法和在线测试中。我们的方法开 up了应用更多搜索引擎技术来建立真正个性化的大型预训练模型的可能性。

Multimodal Active Measurement for Human Mesh Recovery in Close Proximity

  • paper_url: http://arxiv.org/abs/2310.08116
  • repo_url: None
  • paper_authors: Takahiro Maeda, Keisuke Takeshita, Kazuhito Tanaka
    for: 这个研究旨在提高人机交互中机器人的人体位姿估计精度,以实现安全和复杂的人机交互。methods: 本研究提出了一个活动测量和感应融合框架,使用equipped镜头和其他感应器,如触摸感应器和2D LiDAR,在人机交互中获取稀疏但可靠的感应讯号,并融合这些感应讯号和镜头测量估计的人体位姿,以提高人体位姿估计精度。results: 实验结果显示, compared to existing methods, 本研究的方法能够更好地估计人体位姿,尤其是在实际情况下,如人被覆盖物品 occluded 和人机交互中。
    Abstract For safe and sophisticated physical human-robot interactions (pHRI), a robot needs to estimate the accurate body pose or mesh of the target person. However, in these pHRI scenarios, the robot cannot fully observe the target person's body with equipped cameras because the target person is usually close to the robot. This leads to severe truncation and occlusions, and results in poor accuracy of human pose estimation. For better accuracy of human pose estimation or mesh recovery on this limited information from cameras, we propose an active measurement and sensor fusion framework of the equipped cameras and other sensors such as touch sensors and 2D LiDAR. These touch and LiDAR sensing are obtained attendantly through pHRI without additional costs. These sensor measurements are sparse but reliable and informative cues for human mesh recovery. In our active measurement process, camera viewpoints and sensor placements are optimized based on the uncertainty of the estimated pose, which is closely related to the truncated or occluded areas. In our sensor fusion process, we fuse the sensor measurements to the camera-based estimated pose by minimizing the distance between the estimated mesh and measured positions. Our method is agnostic to robot configurations. Experiments were conducted using the Toyota Human Support Robot, which has a camera, 2D LiDAR, and a touch sensor on the robot arm. Our proposed method demonstrated the superiority in the human pose estimation accuracy on the quantitative comparison. Furthermore, our proposed method reliably estimated the pose of the target person in practical settings such as target people occluded by a blanket and standing aid with the robot arm.
    摘要 为实现安全和复杂的人机 робо交互(pHRI), робо需要估算target人体姿或网格的准确位置。然而,在这些pHRI场景中,робо不能完全观察target人体的全部部分,因此会出现严重的截断和遮挡,导致人体姿势估算的精度低。为了提高人体姿势估算或网格恢复的精度,我们提议使用配备了摄像头和其他感知器的活动测量和感知融合框架。这些触感和LiDAR感知通过pHRI获得,无需额外成本。这些感知测量 sparse yet reliable and informative cues for human mesh recovery。在我们的活动测量过程中,摄像头视点和感知器位置被优化基于估算 pose 的uncertainty,这与 truncated或 occluded areas 有关。在我们的感知融合过程中,我们将感知测量与摄像头基于 estimated pose 进行融合,以iminize the distance between the estimated mesh and measured positions。我们的方法不受机器人配置的限制。我们在使用 Toyota Human Support Robot,它配备了摄像头、2D LiDAR和触感器,进行实验。我们的提议方法在量化比较中表现出了superiority。此外,我们的方法可靠地估算target人体姿势在实际场景中,如target人被布料 occluded 和robot臂上的standing aid。

Generalized Logit Adjustment: Calibrating Fine-tuned Models by Removing Label Bias in Foundation Models

  • paper_url: http://arxiv.org/abs/2310.08106
  • repo_url: https://github.com/BeierZhu/GLA
  • paper_authors: Beier Zhu, Kaihua Tang, Qianru Sun, Hanwang Zhang
  • for: 提高预训练模型的表现,尤其是在零shot任务上。
  • methods: 研究基础模型中的内在偏见问题,并提出一种通过优化来减少这种偏见的方法(Generalized Logit Adjustment,GLA)。
  • results: 在多个任务上达到了显著的改善,包括在ImageNet上的1.5 pp精度提升,以及在11个少量数据集上的大均值改善(1.4-4.6 pp)和长尾分类任务上的2.4 pp提升。
    Abstract Foundation models like CLIP allow zero-shot transfer on various tasks without additional training data. Yet, the zero-shot performance is less competitive than a fully supervised one. Thus, to enhance the performance, fine-tuning and ensembling are also commonly adopted to better fit the downstream tasks. However, we argue that such prior work has overlooked the inherent biases in foundation models. Due to the highly imbalanced Web-scale training set, these foundation models are inevitably skewed toward frequent semantics, and thus the subsequent fine-tuning or ensembling is still biased. In this study, we systematically examine the biases in foundation models and demonstrate the efficacy of our proposed Generalized Logit Adjustment (GLA) method. Note that bias estimation in foundation models is challenging, as most pre-train data cannot be explicitly accessed like in traditional long-tailed classification tasks. To this end, GLA has an optimization-based bias estimation approach for debiasing foundation models. As our work resolves a fundamental flaw in the pre-training, the proposed GLA demonstrates significant improvements across a diverse range of tasks: it achieves 1.5 pp accuracy gains on ImageNet, an large average improvement (1.4-4.6 pp) on 11 few-shot datasets, 2.4 pp gains on long-tailed classification. Codes are in \url{https://github.com/BeierZhu/GLA}.
    摘要 基于CLIP等基础模型的零shot传输能力在多种任务上表现不佳,但是通过精度调整和组合来进一步适应下游任务的性能。然而,我们认为这些前工作忽略了基础模型内置的偏见。由于Web规模训练集的高度偏袋性,这些基础模型无法快速识别少见的 semantics,因此后续的精度调整或组合仍然偏袋。在本研究中,我们系统地检查基础模型中的偏见,并示出我们的提议的通用Logit调整(GLA)方法的效果。尽管对基础模型的偏见估计是一项挑战,因为大多数预训练数据无法直接访问如传统长尾分类任务一样。为此,GLA使用优化基本偏见估计方法来减少基础模型的偏见。我们的工作解决了预训练的基本漏洞,因此我们的GLA方法在多种任务上表现出了显著改进:在ImageNet上达到1.5 pp的精度提升,在11个少量样本任务上平均提高1.4-4.6 pp,在长尾分类任务上提高2.4 pp。代码在\url{https://github.com/BeierZhu/GLA}。

SingleInsert: Inserting New Concepts from a Single Image into Text-to-Image Models for Flexible Editing

  • paper_url: http://arxiv.org/abs/2310.08094
  • repo_url: None
  • paper_authors: Zijie Wu, Chaohui Yu, Zhen Zhu, Fan Wang, Xiang Bai
  • for: 这个研究的目的是提出一个简单且有效的单一图像转文本(I2T)倒排基eline,实现高品质的图像生成和自由的文本控制。
  • methods: 这个基eline使用了两个阶段的方案,第一阶段是调整学习的对象 embedding,使其专注在对话领域而不与无关的背景相关。第二阶段是精微调整T2I模型,以提高图像的可观性和避免语言漂移问题。
  • results: 这个基eline可以实现高品质的单一概念生成,同时允许自由的编辑。此外,这个基eline也可以实现单一图像新视角生成和多概念合成,不需要共同训练。我们设计了一个编辑提示列表和一个名为Editing Success Rate(ESR)的评估指标,以便评估编辑的灵活性。
    Abstract Recent progress in text-to-image (T2I) models enables high-quality image generation with flexible textual control. To utilize the abundant visual priors in the off-the-shelf T2I models, a series of methods try to invert an image to proper embedding that aligns with the semantic space of the T2I model. However, these image-to-text (I2T) inversion methods typically need multiple source images containing the same concept or struggle with the imbalance between editing flexibility and visual fidelity. In this work, we point out that the critical problem lies in the foreground-background entanglement when learning an intended concept, and propose a simple and effective baseline for single-image I2T inversion, named SingleInsert. SingleInsert adopts a two-stage scheme. In the first stage, we regulate the learned embedding to concentrate on the foreground area without being associated with the irrelevant background. In the second stage, we finetune the T2I model for better visual resemblance and devise a semantic loss to prevent the language drift problem. With the proposed techniques, SingleInsert excels in single concept generation with high visual fidelity while allowing flexible editing. Additionally, SingleInsert can perform single-image novel view synthesis and multiple concepts composition without requiring joint training. To facilitate evaluation, we design an editing prompt list and introduce a metric named Editing Success Rate (ESR) for quantitative assessment of editing flexibility. Our project page is: https://jarrentwu1031.github.io/SingleInsert-web/
    摘要 最近的文本到图像(T2I)模型进步,使得高质量图像生成变得可控。为了利用存在的图像Visual prior,一些方法尝试将图像转换为与T2I模型的semantic空间匹配的嵌入。然而,这些图像到文本(I2T)反向方法通常需要多个包含同一概念的源图像,或者面临着编辑灵活性和视觉准确性之间的矛盾。在这种情况下,我们指出了带前景背景杂化的问题是学习某一概念的关键问题。为了解决这问题,我们提出了一种简单而有效的基线方法,名为SingleInsert。SingleInsert采用两个阶段方案。在第一阶段,我们规定学习的嵌入向量集中注意力集中在前景区域,而不与无关的背景相关。在第二阶段,我们进一步训练T2I模型,以更好地保持视觉准确性,并设置了semantic损失,以避免语言迁移问题。与传统方法相比,SingleInsert在单个概念生成中实现高视觉准确性,同时允许高灵活度编辑。此外,SingleInsert还可以完成单图像新视图生成和多个概念组合,无需共同训练。为方便评估,我们设计了编辑提示列表,并引入了一个名为Editing Success Rate(ESR)的评价指标,用于评估编辑flexibility的量化评价。我们的项目页面是:https://jarrentwu1031.github.io/SingleInsert-web/

Consistent123: Improve Consistency for One Image to 3D Object Synthesis

  • paper_url: http://arxiv.org/abs/2310.08092
  • repo_url: None
  • paper_authors: Haohan Weng, Tianyu Yang, Jianan Wang, Yu Li, Tong Zhang, C. L. Philip Chen, Lei Zhang
  • for: 提高视图一致性和三维重建性
  • methods: incorporating additional cross-view attention layers and shared self-attention mechanism
  • results: outperforms baselines in view consistency and shows great potential in 3D generation field
    Abstract Large image diffusion models enable novel view synthesis with high quality and excellent zero-shot capability. However, such models based on image-to-image translation have no guarantee of view consistency, limiting the performance for downstream tasks like 3D reconstruction and image-to-3D generation. To empower consistency, we propose Consistent123 to synthesize novel views simultaneously by incorporating additional cross-view attention layers and the shared self-attention mechanism. The proposed attention mechanism improves the interaction across all synthesized views, as well as the alignment between the condition view and novel views. In the sampling stage, such architecture supports simultaneously generating an arbitrary number of views while training at a fixed length. We also introduce a progressive classifier-free guidance strategy to achieve the trade-off between texture and geometry for synthesized object views. Qualitative and quantitative experiments show that Consistent123 outperforms baselines in view consistency by a large margin. Furthermore, we demonstrate a significant improvement of Consistent123 on varying downstream tasks, showing its great potential in the 3D generation field. The project page is available at consistent-123.github.io.
    摘要 大型图像扩散模型可以实现高质量的新视图合成,但这些模型基于图像到图像翻译没有保证视图一致性,这限制了下游任务如3D重建和图像到3D转换的性能。为了强化一致性,我们提议Consistent123同时生成新视图,通过添加跨视图注意力层和共享自注意机制来实现。该注意力机制提高了所生成视图之间的交互,以及condition视图和新视图之间的对齐。在抽取阶段,这种建筑支持同时生成任意数量的视图,并在固定长度进行训练。我们还提出了不需要分类器的进度导航策略,以实现Texture和Geometry之间的融合。Qualitative和量化实验显示,Consistent123在视图一致性方面大幅超过基eline。此外,我们还证明Consistent123在不同的下游任务上表现出了很大的提升,这表明它在3D生成领域的潜力非常大。项目页面可以在consistent-123.github.io上找到。

Implicit Shape and Appearance Priors for Few-Shot Full Head Reconstruction

  • paper_url: http://arxiv.org/abs/2310.08784
  • repo_url: None
  • paper_authors: Pol Caselles, Eduard Ramon, Jaime Garcia, Gil Triginer, Francesc Moreno-Noguer
  • for: 这篇论文主要targets few-shot full 3D head reconstruction, aiming to improve the efficiency and accuracy of coordinate-based neural representations.
  • methods: 该方法具有以下三个特点:1) incorporating a probabilistic shape and appearance prior into coordinate-based representations, 2) leveraging a differentiable renderer for fitting a signed distance function, and 3) employing parallelizable ray tracing and dynamic caching strategies.
  • results: 该方法可以在只使用几张输入图像(甚至只有一张)的情况下实现高精度的3D头部重建,并且比前一代方法快速多了一个数量级。此外,该方法还可以在测试阶段使用H3DS数据集进行评估,并达到了当前最佳的结果。
    Abstract Recent advancements in learning techniques that employ coordinate-based neural representations have yielded remarkable results in multi-view 3D reconstruction tasks. However, these approaches often require a substantial number of input views (typically several tens) and computationally intensive optimization procedures to achieve their effectiveness. In this paper, we address these limitations specifically for the problem of few-shot full 3D head reconstruction. We accomplish this by incorporating a probabilistic shape and appearance prior into coordinate-based representations, enabling faster convergence and improved generalization when working with only a few input images (even as low as a single image). During testing, we leverage this prior to guide the fitting process of a signed distance function using a differentiable renderer. By incorporating the statistical prior alongside parallelizable ray tracing and dynamic caching strategies, we achieve an efficient and accurate approach to few-shot full 3D head reconstruction. Moreover, we extend the H3DS dataset, which now comprises 60 high-resolution 3D full head scans and their corresponding posed images and masks, which we use for evaluation purposes. By leveraging this dataset, we demonstrate the remarkable capabilities of our approach in achieving state-of-the-art results in geometry reconstruction while being an order of magnitude faster than previous approaches.
    摘要

Volumetric Medical Image Segmentation via Scribble Annotations and Shape Priors

  • paper_url: http://arxiv.org/abs/2310.08084
  • repo_url: None
  • paper_authors: Qiuhui Chen, Haiying Lyu, Xinyue Hu, Yong Lu, Yi Hong
  • for: 这个论文目的是提出一种基于scribble的三维图像分割方法,以提高边界预测和ROI的形态regularization。
  • methods: 该方法使用了一种2.5D注意力UNet,加上一个提议的标签传播模块,以扩展scribble中的semantic信息,并使用了static和active边界预测来学习ROI的边界和形态regulation。
  • results: 对于三个公共数据集和一个私有数据集, experiments demonstrate that our Scribble2D5方法可以在基于scribble的volumetric图像分割 task中 achieve state-of-the-art performance,并且可以利用shape prior信息来进一步提高模型准确性。
    Abstract Recently, weakly-supervised image segmentation using weak annotations like scribbles has gained great attention in computer vision and medical image analysis, since such annotations are much easier to obtain compared to time-consuming and labor-intensive labeling at the pixel/voxel level. However, due to a lack of structure supervision on regions of interest (ROIs), existing scribble-based methods suffer from poor boundary localization. Furthermore, most current methods are designed for 2D image segmentation, which do not fully leverage the volumetric information if directly applied to each image slice. In this paper, we propose a scribble-based volumetric image segmentation, Scribble2D5, which tackles 3D anisotropic image segmentation and aims to its improve boundary prediction. To achieve this, we augment a 2.5D attention UNet with a proposed label propagation module to extend semantic information from scribbles and use a combination of static and active boundary prediction to learn ROI's boundary and regularize its shape. Also, we propose an optional add-on component, which incorporates the shape prior information from unpaired segmentation masks to further improve model accuracy. Extensive experiments on three public datasets and one private dataset demonstrate our Scribble2D5 achieves state-of-the-art performance on volumetric image segmentation using scribbles and shape prior if available.
    摘要 Translation:近期,受到scribble(简要标注)的关注强化了计算机视觉和医学影像分析领域,因为这些标注比 pixel/voxel 级别的时间consuming和劳动 INTENSIVE 标注更加容易获得。然而,由于ROI(区域关注点)的结构监督缺乏,现有的scribble-based方法受到边界预测的差。此外,大多数当前方法是为2D图像分割而设计,这些方法直接应用于每个图像片不会充分利用图像堆叠中的三维信息。在这篇论文中,我们提出了一种基于scribble的三维图像分割方法,即Scribble2D5,该方法旨在改进边界预测。为了实现这一点,我们将2.5D注意力UNet(2.5D注意力网络)与一个提议的标签传播模块相结合,以延伸scribble中的semantic信息,并使用组合动态和活动边界预测来学习ROI的边界和正则化其形状。此外,我们还提出了一个可选的组件,即将不对应分割mask中的形状优先信息 integrate到模型中,以进一步提高模型精度。广泛的实验表明,我们的Scribble2D5在使用scribble和形状优先信息时取得了state-of-the-art的性能。

Jointly Optimized Global-Local Visual Localization of UAVs

  • paper_url: http://arxiv.org/abs/2310.08082
  • repo_url: None
  • paper_authors: Haoling Li, Jiuniu Wang, Zhiwei Wei, Wenjia Xu
  • For: 本研究旨在解决无人机在GNSS干扰和不可靠情况下的导航和定位问题,特别是解决传统方法(如同时地图和视差估计)的缺陷,如错误积累和实时性不足。* Methods: 我们提出了一种新的全球-地方视觉定位网络(GLVL),该网络是一种两个阶段的视觉定位方法,其首先使用大规模检索模块找到与无人机飞行场景中相似的区域,然后使用细腻匹配模块确定精确的无人机坐标,实现实时和精确的定位。* Results: 我们在六个无人机飞行场景中进行了实验,包括了Texture-rich和Texture-sparse两类场景。结果表明,我们的方法可以实现实时精确的定位要求,特别是在村庄场景中,我们的方法可以在0.48秒内达到2.39米的定位错误。
    Abstract Navigation and localization of UAVs present a challenge when global navigation satellite systems (GNSS) are disrupted and unreliable. Traditional techniques, such as simultaneous localization and mapping (SLAM) and visual odometry (VO), exhibit certain limitations in furnishing absolute coordinates and mitigating error accumulation. Existing visual localization methods achieve autonomous visual localization without error accumulation by matching with ortho satellite images. However, doing so cannot guarantee real-time performance due to the complex matching process. To address these challenges, we propose a novel Global-Local Visual Localization (GLVL) network. Our GLVL network is a two-stage visual localization approach, combining a large-scale retrieval module that finds similar regions with the UAV flight scene, and a fine-grained matching module that localizes the precise UAV coordinate, enabling real-time and precise localization. The training process is jointly optimized in an end-to-end manner to further enhance the model capability. Experiments on six UAV flight scenes encompassing both texture-rich and texture-sparse regions demonstrate the ability of our model to achieve the real-time precise localization requirements of UAVs. Particularly, our method achieves a localization error of only 2.39 meters in 0.48 seconds in a village scene with sparse texture features.
    摘要 Navigation and localization of UAVs present a challenge when global navigation satellite systems (GNSS) are disrupted and unreliable. Traditional techniques, such as simultaneous localization and mapping (SLAM) and visual odometry (VO), have certain limitations in providing absolute coordinates and mitigating error accumulation. Existing visual localization methods can achieve autonomous visual localization without error accumulation by matching with ortho satellite images, but this cannot guarantee real-time performance due to the complex matching process. To address these challenges, we propose a novel Global-Local Visual Localization (GLVL) network. Our GLVL network is a two-stage visual localization approach, combining a large-scale retrieval module that finds similar regions with the UAV flight scene, and a fine-grained matching module that localizes the precise UAV coordinate, enabling real-time and precise localization. The training process is jointly optimized in an end-to-end manner to further enhance the model capability. Experiments on six UAV flight scenes encompassing both texture-rich and texture-sparse regions demonstrate the ability of our model to achieve the real-time precise localization requirements of UAVs. Particularly, our method achieves a localization error of only 2.39 meters in 0.48 seconds in a village scene with sparse texture features.Here's the word-for-word translation of the text into Simplified Chinese:导航和地理位置系统(GNSS)在受到干扰和不可靠时,UAV的导航和地理位置问题具有挑战性。传统技术,如同时地理位置和地图(SLAM)和视觉速度(VO),在提供绝对坐标和减少错误偏差方面存在一定的局限性。现有的视觉定位方法可以通过与正交卫星图像匹配来实现无错误的自主视觉定位,但这无法保证实时性。为解决这些挑战,我们提出了一种新的全球视觉定位网络(GLVL)。我们的 GLVL 网络是一种两stage的视觉定位方法,包括一个大规模检索模块,找到与 UAV 飞行场景相似的区域,以及一个细化匹配模块,在 UAV 坐标上进行精度定位,实现实时和准确的定位。训练过程是在端到端方式进行并行优化,以进一步提高模型能力。实验结果表明,我们的方法可以在包括Texture-rich和Texture-sparse区域的六个 UAV 飞行场景中实现实时精度定位要求。特别是,我们的方法在村庄场景中,具有稀疏特征的Texture-sparse区域,可以实现只有2.39米的地理位置错误,在0.48秒内完成。

RT-SRTS: Angle-Agnostic Real-Time Simultaneous 3D Reconstruction and Tumor Segmentation from Single X-Ray Projection

  • paper_url: http://arxiv.org/abs/2310.08080
  • repo_url: None
  • paper_authors: Miao Zhu, Qiming Fu, Bo Liu, Mengxi Zhang, Bojian Li, Xiaoyan Luo, Fugen Zhou
  • for: 这篇论文的目的是提出一种新的医疗影像重建方法,以帮助肿瘤治疗中的放射线治疗过程。
  • methods: 这篇论文使用的方法是基于多任务学习(MTL)的一种综合三维图像重建和肿瘤分类的网络,可以实现单据X射线像面的实时三维重建和肿瘤分类。此外,还提出了注意力增强calibrator(AEC)和不确定区域详细(URE)模组,以帮助特征提取和提高分类精度。
  • results: 这篇论文的结果显示,提出的方法可以实现实时三维重建和肿瘤分类,并且与两种现有方法比较,表现更加出色。实际上,这篇论文可以实现单据X射线像面的实时三维重建和肿瘤分类,并且可以在约70ms内完成这个过程,远远超过了实时肿瘤追踪所需的时间点。此外,还进一步验证了AEC和URE模组的有效性。
    Abstract Radiotherapy is one of the primary treatment methods for tumors, but the organ movement caused by respiratory motion limits its accuracy. Recently, 3D imaging from single X-ray projection receives extensive attentions as a promising way to address this issue. However, current methods can only reconstruct 3D image without direct location of the tumor and are only validated for fixed-angle imaging, which fails to fully meet the requirement of motion control in radiotherapy. In this study, we propose a novel imaging method RT-SRTS which integrates 3D imaging and tumor segmentation into one network based on the multi-task learning (MTL) and achieves real-time simultaneous 3D reconstruction and tumor segmentation from single X-ray projection at any angle. Futhermore, we propose the attention enhanced calibrator (AEC) and uncertain-region elaboration (URE) modules to aid feature extraction and improve segmentation accuracy. We evaluated the proposed method on ten patient cases and compared it with two state-of-the-art methods. Our approach not only delivered superior 3D reconstruction but also demonstrated commendable tumor segmentation results. The simultaneous reconstruction and segmentation could be completed in approximately 70 ms, significantly faster than the required time threshold for real-time tumor tracking. The efficacy of both AEC and URE was also validated through ablation studies.
    摘要 医学中,辐射疗法是肿瘤的主要治疗方法,但是呼吸运动引起的器官运动限制了它的精度。最近,3D成像从单个X射线投影所receives extensive attention为一种有前途的方法来解决这个问题。然而,当前的方法只能重建3D图像而不是直接定位肿瘤,并且只适用于固定角度的成像,这些方法无法充分满足肿瘤跟踪的需求。在本研究中,我们提出了一种新的成像方法,即RT-SRTS,它将3D成像和肿瘤分割 integrate into one network based on multi-task learning (MTL),并在单个X射线投影任意角度下实现实时同步3D重建和肿瘤分割。此外,我们还提出了注意力增强calibrator (AEC)和uncertain-region elaboration (URE)模块,以帮助特征提取和提高分割精度。我们对十个患者案例进行了评估,并与两种当前最佳方法进行比较。我们的方法不仅提供了superior 3D重建,还demonstrated commendable tumor segmentation results。同时,我们的方法可以在约70ms内完成同步重建和分割,这比较于实时肿瘤跟踪的时间要求更快。此外,我们还 validate了AEC和URE模块的效果通过ablation study。

Samples on Thin Ice: Re-Evaluating Adversarial Pruning of Neural Networks

  • paper_url: http://arxiv.org/abs/2310.08073
  • repo_url: None
  • paper_authors: Giorgio Piras, Maura Pintor, Ambra Demontis, Battista Biggio
  • for: 这篇论文的目的是重新评估三种最新的对抗式范例遗传方法,并评估这些方法的稳定性和抗衰变性。
  • methods: 这篇论文使用了三种最新的对抗式范例遗传方法,分别是 adversarial training、input preprocessing 和 output preprocessing。
  • results: 研究发现,这三种方法的 robustness 被过度估计,而且对于较具有挑战性的测试数据集,这些方法的表现相对较差。此外,研究发现这些方法遗传后的模型通常会对于较接近原始模型的决策界面的样本进行错误分类。
    Abstract Neural network pruning has shown to be an effective technique for reducing the network size, trading desirable properties like generalization and robustness to adversarial attacks for higher sparsity. Recent work has claimed that adversarial pruning methods can produce sparse networks while also preserving robustness to adversarial examples. In this work, we first re-evaluate three state-of-the-art adversarial pruning methods, showing that their robustness was indeed overestimated. We then compare pruned and dense versions of the same models, discovering that samples on thin ice, i.e., closer to the unpruned model's decision boundary, are typically misclassified after pruning. We conclude by discussing how this intuition may lead to designing more effective adversarial pruning methods in future work.
    摘要

Learning Transferable Conceptual Prototypes for Interpretable Unsupervised Domain Adaptation

  • paper_url: http://arxiv.org/abs/2310.08071
  • repo_url: None
  • paper_authors: Junyu Gao, Xinhong Ma, Changsheng Xu
  • for: 本研究旨在提出一种可解释的频繁领域适应(UDA)方法,以提高模型的安全性和可控性。
  • methods: 本方法基于层次分类模型,设计了一个层次概念模型(TCPL),通过将来源频繁领域的基本概念传递到目标频繁领域,学习了频繁领域共享的原型。同时,设计了一种自适应的自我预测稳定潜在标签策略,以选择适合 Pseudo 注解的目标样本,逐渐缩小频繁领域的差距。
  • results: 实验表明,提出的方法可以不仅提供有效和直观的解释,还能够超越之前的状态。
    Abstract Despite the great progress of unsupervised domain adaptation (UDA) with the deep neural networks, current UDA models are opaque and cannot provide promising explanations, limiting their applications in the scenarios that require safe and controllable model decisions. At present, a surge of work focuses on designing deep interpretable methods with adequate data annotations and only a few methods consider the distributional shift problem. Most existing interpretable UDA methods are post-hoc ones, which cannot facilitate the model learning process for performance enhancement. In this paper, we propose an inherently interpretable method, named Transferable Conceptual Prototype Learning (TCPL), which could simultaneously interpret and improve the processes of knowledge transfer and decision-making in UDA. To achieve this goal, we design a hierarchically prototypical module that transfers categorical basic concepts from the source domain to the target domain and learns domain-shared prototypes for explaining the underlying reasoning process. With the learned transferable prototypes, a self-predictive consistent pseudo-label strategy that fuses confidence, predictions, and prototype information, is designed for selecting suitable target samples for pseudo annotations and gradually narrowing down the domain gap. Comprehensive experiments show that the proposed method can not only provide effective and intuitive explanations but also outperform previous state-of-the-arts.
    摘要 尽管深度神经网络在无监督领域适应(UDA)中做出了很大的进步,但目前的UDA模型仍然不透明,无法提供有前途的解释,限制其在需要安全和可控的模型决策的场景中的应用。目前,大量的研究集中在设计深度可解释方法上,但大多数这些方法仅考虑了数据注解的问题,而很少考虑分布shift问题。现有的可解释UDA方法都是后续的方法,无法促进模型性能的提高。在这篇论文中,我们提出了内置可解释的方法,即传递可读 prototype 学习(TCPL),可同时解释和改进知识传递和决策过程。为 достичь这个目标,我们设计了层次prototype模块,将来源领域中的基本概念传递到目标领域,并在不同领域之间学习共享的概念示例。通过学习传递的示例,我们设计了一种自预测一致的 pseudo-label 策略,将信任度、预测值和示例信息 fusion 以选择适合 pseudo 标注的目标样本,逐渐缩小领域差距。经过完整的实验表明,我们的方法不仅可以提供有效和直观的解释,还可以超越先前的状态 искус。

Frequency-Aware Re-Parameterization for Over-Fitting Based Image Compression

  • paper_url: http://arxiv.org/abs/2310.08068
  • repo_url: None
  • paper_authors: Yun Ye, Yanjie Pan, Qually Jiang, Ming Lu, Xiaoran Fang, Beryl Xu
  • for: 压缩图像过滤需要图像压缩和实时调整,对于深度卷积神经网 (CNN) 的方法而言,这会带来储存类型和快速调整的挑战。
  • methods: 这篇 paper 提出了一个简单的重构化方法,用于训练 CNNs 的储存类型和快速调整。卷积核心被重构化为一个权重总和的离散弹道变换 (DCT) 核心,允许直接优化频域中。combined with L1 正规化,提出的方法可以超过普通的卷积,在短时间内 achieve 较好的比特率-调整。
  • results: 实验结果显示,这篇 paper 的方法可以在不同的数据集上进行压缩图像的过滤,并且可以实现 -46.12% BD-rate 的提升,仅需要 200 迭代。
    Abstract Over-fitting-based image compression requires weights compactness for compression and fast convergence for practical use, posing challenges for deep convolutional neural networks (CNNs) based methods. This paper presents a simple re-parameterization method to train CNNs with reduced weights storage and accelerated convergence. The convolution kernels are re-parameterized as a weighted sum of discrete cosine transform (DCT) kernels enabling direct optimization in the frequency domain. Combined with L1 regularization, the proposed method surpasses vanilla convolutions by achieving a significantly improved rate-distortion with low computational cost. The proposed method is verified with extensive experiments of over-fitting-based image restoration on various datasets, achieving up to -46.12% BD-rate on top of HEIF with only 200 iterations.
    摘要 适应过拟合的图像压缩需要权重紧密度 для压缩和快速收敛,对深度卷积神经网络(CNN)基本方法带来挑战。这篇论文提出了一种简单的重parameter化方法,以减少权重存储和加速收敛。核心卷积被重parameterized为一个权重加权的积分幂函数,允许直接优化频率频谱中。与L1正则化结合使用,提议方法在环境成本低下实现了明显提高的比特率-误差率。试验表明,在多种适应过拟合图像修复 task 上,提议方法可以达到最高 -46.12% BD-rate,只需200个迭代。

Age Estimation Based on Graph Convolutional Networks and Multi-head Attention Mechanisms

  • paper_url: http://arxiv.org/abs/2310.08064
  • repo_url: None
  • paper_authors: Miaomiao Yang, Changwei Yao, Shijin Yan
  • For: 本研究开发了一个端正游戏用户识别系统,使用腔边卷网络和多头注意力机制来提高年龄估测的精度。* Methods: 本研究使用了卷网络和多头注意力机制,实现了不 Regular 面部图像特征的抽象和模型,以减少背景信息的影响和提高年龄估测的精度。* Results: 本研究获得了较高的年龄估测精度,MAE错误值降至约3.64,比今天的年龄估测模型更好,从而提高了面部识别和身份验证的精度。
    Abstract Age estimation technology is a part of facial recognition and has been applied to identity authentication. This technology achieves the development and application of a juvenile anti-addiction system by authenticating users in the game. Convolutional Neural Network (CNN) and Transformer algorithms are widely used in this application scenario. However, these two models cannot flexibly extract and model features of faces with irregular shapes, and they are ineffective in capturing key information. Furthermore, the above methods will contain a lot of background information while extracting features, which will interfere with the model. In consequence, it is easy to extract redundant information from images. In this paper, a new modeling idea is proposed to solve this problem, which can flexibly model irregular objects. The Graph Convolutional Network (GCN) is used to extract features from irregular face images effectively, and multi-head attention mechanisms are added to avoid redundant features and capture key region information in the image. This model can effectively improve the accuracy of age estimation and reduce the MAE error value to about 3.64, which is better than the effect of today's age estimation model, to improve the accuracy of face recognition and identity authentication.
    摘要 现代年龄估计技术是人脸识别的一部分,已经应用于身份验证。这种技术通过验证用户在游戏中的身份来实现青少年反加ict系统的发展和应用。卷积神经网络(CNN)和变换器算法广泛应用于这个应用场景中。然而,这两种模型无法flexibly提取和模型面呈扁桃形的特征,并且不能 Capture关键信息。此外,上述方法会从图像中提取背景信息,这会干扰模型。因此,容易提取图像中的废弃信息。在本文中,一种新的模型化想法被提出来解决这个问题,即使用图像特征提取GCN网络,并添加多头注意机制以避免废弃特征和Capture图像关键区域信息。这种模型可以有效提高年龄估计的准确性,并将MAE错误值降到约3.64,比现有的年龄估计模型更好。这将有助于提高人脸识别和身份验证的准确性。

EC-Depth: Exploring the consistency of self-supervised monocular depth estimation under challenging scenes

  • paper_url: http://arxiv.org/abs/2310.08044
  • repo_url: https://github.com/RuijieZhu94/EC-Depth
  • paper_authors: Ruijie Zhu, Ziyang Song, Chuxin Wang, Jianfeng He, Tianzhu Zhang
    for:EC-Depth is designed to improve the robustness of self-supervised monocular depth estimation models in real-world applications, where adverse conditions are prevalent.methods:The proposed method utilizes a two-stage training framework with a perturbation-invariant depth consistency constraint module and a consistency-based pseudo-label selection module to achieve accurate and consistent depth predictions.results:EC-Depth surpasses existing state-of-the-art methods on KITTI, KITTI-C, and DrivingStereo benchmarks, demonstrating its effectiveness in challenging scenarios.
    Abstract Self-supervised monocular depth estimation holds significant importance in the fields of autonomous driving and robotics. However, existing methods are typically designed to train and test on clear and pristine datasets, overlooking the impact of various adverse conditions prevalent in real-world scenarios. As a result, it is commonly observed that most self-supervised monocular depth estimation methods struggle to perform adequately under challenging conditions. To address this issue, we present EC-Depth, a novel self-supervised two-stage training framework to achieve a robust depth estimation, starting from the foundation of depth prediction consistency under different perturbations. Leveraging the proposed perturbation-invariant depth consistency constraint module and the consistency-based pseudo-label selection module, our model attains accurate and consistent depth predictions in both standard and challenging scenarios. Extensive experiments substantiate the effectiveness of the proposed method. Moreover, our method surpasses existing state-of-the-art methods on KITTI, KITTI-C and DrivingStereo benchmarks, demonstrating its potential for enhancing the reliability of self-supervised monocular depth estimation models in real-world applications.
    摘要 自我监督单目深度估计在自动驾驶和 робо械学中具有重要意义,但现有方法通常是在清晰和完整的数据集上训练和测试,忽视了实际场景中的多种不利条件。因此,大多数自我监督单目深度估计方法在实际场景中表现不佳。为解决这个问题,我们提出了 EC-Depth,一种新的自我监督两 stage 训练框架,以实现robust的深度估计。我们利用了提议的扰动不敏感深度一致性约束模块和一致性基于pseudo标签选择模块,从而使我们的模型在标准和复杂场景中都能够获得准确和一致的深度预测。广泛的实验证明了我们的方法的有效性。此外,我们的方法在 KITTI、KITTI-C 和 DrivingStereo 标准吗chmark上超过了现有状态的艺术方法,这表明了我们的方法在实际应用中提高了自我监督单目深度估计模型的可靠性。

X-HRNet: Towards Lightweight Human Pose Estimation with Spatially Unidimensional Self-Attention

  • paper_url: http://arxiv.org/abs/2310.08042
  • repo_url: https://github.com/cool-xuan/x-hrnet
  • paper_authors: Yixuan Zhou, Xuanhan Wang, Xing Xu, Lei Zhao, Jingkuan Song
  • for: 提高人 pose 估计精度,降低计算复杂性
  • methods: 引入空间单 dimensional 自注意力 (SUSA),取代点 wise (1x1) 卷积
  • results: 实现高精度人 pose 估计,降低计算复杂性96%,并提供了可重复使用的代码Here’s a breakdown of each sentence:* “for”: 该文章是为了提高人 pose 估计精度和降低计算复杂性。* “methods”: 文章提出了一种新的方法,即引入空间单 dimensional 自注意力 (SUSA),以取代点 wise (1x1) 卷积。* “results”: 实验结果表明,使用 SUSA 可以实现高精度人 pose 估计,并降低计算复杂性96%。此外,文章还提供了可重复使用的代码。
    Abstract High-resolution representation is necessary for human pose estimation to achieve high performance, and the ensuing problem is high computational complexity. In particular, predominant pose estimation methods estimate human joints by 2D single-peak heatmaps. Each 2D heatmap can be horizontally and vertically projected to and reconstructed by a pair of 1D heat vectors. Inspired by this observation, we introduce a lightweight and powerful alternative, Spatially Unidimensional Self-Attention (SUSA), to the pointwise (1x1) convolution that is the main computational bottleneck in the depthwise separable 3c3 convolution. Our SUSA reduces the computational complexity of the pointwise (1x1) convolution by 96% without sacrificing accuracy. Furthermore, we use the SUSA as the main module to build our lightweight pose estimation backbone X-HRNet, where `X' represents the estimated cross-shape attention vectors. Extensive experiments on the COCO benchmark demonstrate the superiority of our X-HRNet, and comprehensive ablation studies show the effectiveness of the SUSA modules. The code is publicly available at https://github.com/cool-xuan/x-hrnet.
    摘要 高分辨率表示是人体姿势估计高性能所需的,但是随之而来的问题是高计算复杂性。特别是,主流的姿势估计方法都是通过2D单峰热图来估计人体关节。每个2D热图可以被水平和垂直投影,并通过一对1D热向量重建。从这个观察中,我们提出了一种轻量级、强大的替代方案——空间单维自注意(SUSA),以减少点 wise(1x1)卷积的计算复杂性。我们的SUSA可以将点 wise(1x1)卷积的计算复杂性减少96%,而不会失去精度。另外,我们使用SUSA作为主模块,建立了我们的轻量级姿势估计后缘X-HRNet,其中`X'表示估计的交叉形注意 vector。EXTENSIVE EXPERIMENTS ON THE COCO BENCHMARK DEMONSTRATE THE SUPERIORITY OF OUR X-HRNet, AND COMPREHENSIVE ABLAATION STUDIES SHOW THE EFFECTIVENESS OF THE SUSA MODULES。代码可以在https://github.com/cool-xuan/x-hrnet中获得。

Continual Learning via Manifold Expansion Replay

  • paper_url: http://arxiv.org/abs/2310.08038
  • repo_url: None
  • paper_authors: Zihao Xu, Xuan Tang, Yufei Shi, Jianfeng Zhang, Jian Yang, Mingsong Chen, Xian Wei
  • for: 本研究旨在提高连续学习中的稳定性和表达力,透过扩大知识表示的含义槽的几何尺度。
  • methods: 本研究提出了一种新的播放策略called Manifold Expansion Replay (MaER),通过在知识缓存中扩大隐式几何的缺失来提高模型的稳定性和表达力。
  • results: 通过对MNIST、CIFAR10、CIFAR100和TinyImageNet等数据集进行了广泛的实验验证,提出的方法在连续学习设置下显著提高了精度,比对状态前的表现更高。
    Abstract In continual learning, the learner learns multiple tasks in sequence, with data being acquired only once for each task. Catastrophic forgetting is a major challenge to continual learning. To reduce forgetting, some existing rehearsal-based methods use episodic memory to replay samples of previous tasks. However, in the process of knowledge integration when learning a new task, this strategy also suffers from catastrophic forgetting due to an imbalance between old and new knowledge. To address this problem, we propose a novel replay strategy called Manifold Expansion Replay (MaER). We argue that expanding the implicit manifold of the knowledge representation in the episodic memory helps to improve the robustness and expressiveness of the model. To this end, we propose a greedy strategy to keep increasing the diameter of the implicit manifold represented by the knowledge in the buffer during memory management. In addition, we introduce Wasserstein distance instead of cross entropy as distillation loss to preserve previous knowledge. With extensive experimental validation on MNIST, CIFAR10, CIFAR100, and TinyImageNet, we show that the proposed method significantly improves the accuracy in continual learning setup, outperforming the state of the arts.
    摘要 在连续学习中,学习者需要在序列中学习多个任务,并且每个任务只有一次数据采集。然而,这会导致忘记问题,特别是在知识集成过程中学习新任务时。为解决这问题,我们提出了一种新的回忆策略,即扩展隐式抽象的 manifold 扩展回忆(MaER)策略。我们认为,通过扩展知识表示的隐式抽象 manifold 可以提高模型的稳定性和表达力。为此,我们提出了一种满足策略,在内存管理中不断增加知识在缓存中的径距。此外,我们引入 Wasserstein 距离 instead of cross entropy 作为练习损失,以保持之前的知识。经验 validate 在 MNIST、CIFAR10、CIFAR100 和 TinyImageNet 上,我们发现提出的方法可以在连续学习设置中显著提高准确率,超过当前最佳性能。

BaSAL: Size Balanced Warm Start Active Learning for LiDAR Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2310.08035
  • repo_url: None
  • paper_authors: Jiarong Wei, Yancong Lin, Holger Caesar
  • for: 降低成本的数据标注,通过重复询问 annotator 标注pool中的无标签数据中最有用的样本,并将其用于重新训练模型。
  • methods: 使用size-balanced warm start active learning模型,根据对象类别的特征大小进行对象群集 sampling,以创建更加均衡的数据集。
  • results: 能够大幅提高初始模型的性能,并且与使用整个SemanticKITTI dataset进行训练相当,使用只有5%的标注数据,而且与现有的活动学习方法相当。
    Abstract Active learning strives to reduce the need for costly data annotation, by repeatedly querying an annotator to label the most informative samples from a pool of unlabeled data and retraining a model from these samples. We identify two problems with existing active learning methods for LiDAR semantic segmentation. First, they ignore the severe class imbalance inherent in LiDAR semantic segmentation datasets. Second, to bootstrap the active learning loop, they train their initial model from randomly selected data samples, which leads to low performance and is referred to as the cold start problem. To address these problems we propose BaSAL, a size-balanced warm start active learning model, based on the observation that each object class has a characteristic size. By sampling object clusters according to their size, we can thus create a size-balanced dataset that is also more class-balanced. Furthermore, in contrast to existing information measures like entropy or CoreSet, size-based sampling does not require an already trained model and thus can be used to address the cold start problem. Results show that we are able to improve the performance of the initial model by a large margin. Combining size-balanced sampling and warm start with established information measures, our approach achieves a comparable performance to training on the entire SemanticKITTI dataset, despite using only 5% of the annotations, which outperforms existing active learning methods. We also match the existing state-of-the-art in active learning on nuScenes. Our code will be made available upon paper acceptance.
    摘要 aktive learning实践旨在减少成本的标注资料,通过重复询问标注者 labelpool中的不标注资料中的最有用样本,并从这些样本中重训模型。我们发现了现有的 aktive learning方法对于LiDAR semantic segmentation有两个问题。首先,它们忽略了LiDAR semantic segmentationdataset中的严重类别不均衡。其次,为了启动活动学习循环,它们从Random选择的资料样本中训练初始模型,这个问题被称为冷启动问题。为了解决这些问题,我们提出了Basal,一个size-balanced warm start aktive learning模型,基于每个物类的特征大小。通过根据物类的大小排序物类对,我们可以创建一个size-balanceddataset,并且更好地对类别进行均衡。此外,不同于现有的信息度量like entropy或CoreSet,size-based sampling不需要已经训练的模型,因此可以用来解决冷启动问题。我们的结果显示,我们能够从初始模型中大幅提高性能。通过结合size-balanced sampling和暖启动,我们的方法可以与使用整个SemanticKITTI dataset的性能相匹配,即使只使用5%的标注资料,而且超越现有的aktive learning方法。我们还与nuScenes中的active learning方法匹配。我们将代码公开发布一并发表论文。

Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video Retrieval

  • paper_url: http://arxiv.org/abs/2310.08009
  • repo_url: https://github.com/IMCCretrieval/DKPH
  • paper_authors: Pandeng Li, Hongtao Xie, Jiannan Ge, Lei Zhang, Shaobo Min, Yongdong Zhang
  • for: 本研究旨在提高无监督视频哈希的性能,通过分解视频信息为重建依赖的信息和Semantic依赖的信息,从而隔离 semantic extraction 从重建约束。
  • methods: 我们采用了一种简单的 dual-stream 结构,包括一个时间层和一个哈希层。在这种结构中,哈希层通过自我监督获得的含义相似知识,学习捕捉 binary codes 中的 semantics,而时间层则学习重建视频信息。
  • results: 我们的方法在三个视频benchmark上进行了广泛的实验 validate,与之前的状态场景比较,我们的方法一直表现出优于其他方法。
    Abstract Unsupervised video hashing usually optimizes binary codes by learning to reconstruct input videos. Such reconstruction constraint spends much effort on frame-level temporal context changes without focusing on video-level global semantics that are more useful for retrieval. Hence, we address this problem by decomposing video information into reconstruction-dependent and semantic-dependent information, which disentangles the semantic extraction from reconstruction constraint. Specifically, we first design a simple dual-stream structure, including a temporal layer and a hash layer. Then, with the help of semantic similarity knowledge obtained from self-supervision, the hash layer learns to capture information for semantic retrieval, while the temporal layer learns to capture the information for reconstruction. In this way, the model naturally preserves the disentangled semantics into binary codes. Validated by comprehensive experiments, our method consistently outperforms the state-of-the-arts on three video benchmarks.
    摘要 <> translates the given text into Simplified Chinese.Unsupervised video hashing usually optimizes binary codes by learning to reconstruct input videos. Such reconstruction constraint spends much effort on frame-level temporal context changes without focusing on video-level global semantics that are more useful for retrieval. Hence, we address this problem by decomposing video information into reconstruction-dependent and semantic-dependent information, which disentangles the semantic extraction from reconstruction constraint. Specifically, we first design a simple dual-stream structure, including a temporal layer and a hash layer. Then, with the help of semantic similarity knowledge obtained from self-supervision, the hash layer learns to capture information for semantic retrieval, while the temporal layer learns to capture the information for reconstruction. In this way, the model naturally preserves the disentangled semantics into binary codes. Validated by comprehensive experiments, our method consistently outperforms the state-of-the-arts on three video benchmarks.中文翻译:通常情况下,无监督视频哈希优化二进制代码通过学习重建输入视频的方式进行优化。这种重建约束会占用帧级时间上下文变化的大量精力,而不是关注视频级全局 semantics 更有用于检索。因此,我们解决这个问题,通过分解视频信息为重建依赖的信息和semantic依赖的信息来分离semantic抽取。具体来说,我们首先设计了一个简单的双流结构,包括一个时间层和一个哈希层。然后,通过自我监督获得的semantic相似性知识,哈希层学习捕捉Semantic检索中的信息,而时间层学习捕捉重建中的信息。这样,模型会自然地储存分离的semantics到二进制代码中。经过了广泛的实验 validate,我们的方法在三个视频 benchmark 上 consistently 超越了状态的艺术。

MLP-AMDC: An MLP Architecture for Adaptive-Mask-based Dual-Camera snapshot hyperspectral imaging

  • paper_url: http://arxiv.org/abs/2310.08002
  • repo_url: https://github.com/caizeyu1992/MLP-AMDC
  • paper_authors: Zeyu Cai, Can Zhang, Xunhao Chen, Shanghuan Liu, Chengqian Jin, Feipeng Da
  • for: This paper aims to improve the performance and speed of Coded Aperture Snapshot Spectral Imaging (CASSI) systems, which are used to acquire Hyper-Spectral Images (HSI).
  • methods: The paper proposes an AMDC-CASSI system that uses an RGB camera with CASSI and Adaptive-Mask to improve the reconstruction quality of HSI. The proposed method replaces the transformer structure of the network with an MLP architecture to improve the inference speed of the reconstruction network.
  • results: The paper shows that the proposed MLP-AMDC method achieves an 8 dB improvement over the state-of-the-art (SOTA) and at least a 5-fold improvement in reconstruction speed, while maintaining competitive reconstruction quality.
    Abstract Coded Aperture Snapshot Spectral Imaging (CASSI) system has great advantages over traditional methods in dynamically acquiring Hyper-Spectral Image (HSI), but there are the following problems. 1) Traditional mask relies on random patterns or analytical design, both of which limit the performance improvement of CASSI. 2) Existing high-quality reconstruction algorithms are slow in reconstruction and can only reconstruct scene information offline. To address the above two problems, this paper designs the AMDC-CASSI system, introducing RGB camera with CASSI based on Adaptive-Mask as multimodal input to improve the reconstruction quality. The existing SOTA reconstruction schemes are based on transformer, but the operation of self-attention pulls down the operation efficiency of the network. In order to improve the inference speed of the reconstruction network, this paper proposes An MLP Architecture for Adaptive-Mask-based Dual-Camera (MLP-AMDC) to replace the transformer structure of the network. Numerous experiments have shown that MLP performs no less well than transformer-based structures for HSI reconstruction, while MLP greatly improves the network inference speed and has less number of parameters and operations, our method has a 8 db improvement over SOTA and at least a 5-fold improvement in reconstruction speed. (https://github.com/caizeyu1992/MLP-AMDC.)
    摘要 CASSI(coded aperture snapshot spectral imaging)系统在获取高spectral resolution的图像方面有优势,但存在以下问题:1)传统的面Mask rely on random patterns或分析设计,两者都限制了CASSI的性能提升。2)现有的高质量重建算法慢于重建和只能在离线重建场景信息。为了解决上述两个问题,本文提出了RGB camera与CASSI基于Adaptive-Mask的多模态输入,以提高重建质量。现有的SOTA重建方案基于transformer,但自我注意operation pulls down网络的运算效率。为了提高重建网络的吞吐量,本文提议使用An MLP Architecture for Adaptive-Mask-based Dual-Camera(MLP-AMDC)来替换网络的transformer结构。多个实验表明,MLP与transformer-based结构相当,而MLP可以大幅提高网络的吞吐量和参数数量,我们的方法与SOTA差距8db,并至少提高5倍的重建速度。(https://github.com/caizeyu1992/MLP-AMDC。)

Reset It and Forget It: Relearning Last-Layer Weights Improves Continual and Transfer Learning

  • paper_url: http://arxiv.org/abs/2310.07996
  • repo_url: None
  • paper_authors: Lapo Frati, Neil Traft, Jeff Clune, Nick Cheney
  • for: 这个论文旨在提出一种简单的预训练机制,以便 representations 能够更好地进行 continual 学习和转移学习。
  • methods: 这个机制是在最后一层的权重重新设置,我们昵称其为 “zapping”。这种机制原本是为 meta-continual-learning 过程设计的,但我们表明它可以在许多其他场景中应用。
  • results: 在我们的实验中,我们想要将预训练的图像分类器转移到新的类别中,并在几个极少的试验中达到了更高的转移精度和/或更快的适应速度,而无需使用昂贵的高阶导数。这种 zapping 机制可以考虑为 computationally 更便宜的、或者是 meta-learning 快速适应特征的代替方案。
    Abstract This work identifies a simple pre-training mechanism that leads to representations exhibiting better continual and transfer learning. This mechanism -- the repeated resetting of weights in the last layer, which we nickname "zapping" -- was originally designed for a meta-continual-learning procedure, yet we show it is surprisingly applicable in many settings beyond both meta-learning and continual learning. In our experiments, we wish to transfer a pre-trained image classifier to a new set of classes, in a few shots. We show that our zapping procedure results in improved transfer accuracy and/or more rapid adaptation in both standard fine-tuning and continual learning settings, while being simple to implement and computationally efficient. In many cases, we achieve performance on par with state of the art meta-learning without needing the expensive higher-order gradients, by using a combination of zapping and sequential learning. An intuitive explanation for the effectiveness of this zapping procedure is that representations trained with repeated zapping learn features that are capable of rapidly adapting to newly initialized classifiers. Such an approach may be considered a computationally cheaper type of, or alternative to, meta-learning rapidly adaptable features with higher-order gradients. This adds to recent work on the usefulness of resetting neural network parameters during training, and invites further investigation of this mechanism.
    摘要

CleftGAN: Adapting A Style-Based Generative Adversarial Network To Create Images Depicting Cleft Lip Deformity

  • paper_url: http://arxiv.org/abs/2310.07969
  • repo_url: None
  • paper_authors: Abdullah Hayajneh, Erchin Serpedin, Mohammad Shaqfeh, Graeme Glass, Mitchell A. Stotland
  • for: This paper aims to address the challenge of training a machine learning system to evaluate facial clefts by generating a large dataset of high-quality, ethics board-approved patient images using a deep learning-based cleft lip generator.
  • methods: The authors use a transfer learning protocol with a deep learning-based generative adversarial network image generator incorporating adaptive data augmentation (ADA) to generate a large dataset of artificial images exhibiting high-fidelity facsimiles of cleft lip with wide variation.
  • results: The authors found that StyleGAN3 with translation invariance (StyleGAN3-t) performed optimally as a base model, and the generated images achieved a low Frechet Inception Distance (FID) reflecting a close similarity to the training input dataset of genuine cleft images. The PPL and DISH measures also showed a smooth and semantically valid interpolation of images through the transfer learning process, and a similar distribution of severity in the training and generated images.
    Abstract A major obstacle when attempting to train a machine learning system to evaluate facial clefts is the scarcity of large datasets of high-quality, ethics board-approved patient images. In response, we have built a deep learning-based cleft lip generator designed to produce an almost unlimited number of artificial images exhibiting high-fidelity facsimiles of cleft lip with wide variation. We undertook a transfer learning protocol testing different versions of StyleGAN-ADA (a generative adversarial network image generator incorporating adaptive data augmentation (ADA)) as the base model. Training images depicting a variety of cleft deformities were pre-processed to adjust for rotation, scaling, color adjustment and background blurring. The ADA modification of the primary algorithm permitted construction of our new generative model while requiring input of a relatively small number of training images. Adversarial training was carried out using 514 unique frontal photographs of cleft-affected faces to adapt a pre-trained model based on 70,000 normal faces. The Frechet Inception Distance (FID) was used to measure the similarity of the newly generated facial images to the cleft training dataset, while Perceptual Path Length (PPL) and the novel Divergence Index of Severity Histograms (DISH) measures were also used to assess the performance of the image generator that we dub CleftGAN. We found that StyleGAN3 with translation invariance (StyleGAN3-t) performed optimally as a base model. Generated images achieved a low FID reflecting a close similarity to our training input dataset of genuine cleft images. Low PPL and DISH measures reflected a smooth and semantically valid interpolation of images through the transfer learning process and a similar distribution of severity in the training and generated images, respectively.
    摘要 很多时候,在尝试使机器学习系统评估面部缺陷时,面临着大量高质量、伦理委员会批准的患者图像的缺乏问题。为此,我们构建了一个基于深度学习的面部缺陷生成器,可以生成具有广泛变化的人工图像,以便模拟面部缺陷的多种形式。我们采用了一种转移学习协议,测试不同版本的StyleGAN-ADA(一种基于生成 adversarial network的图像生成器,其中ADA表示适应性数据增强)作为基本模型。我们使用了不同的扭转、缩放、颜色调整和背景模糊等方法来预处理训练图像,以适应不同的缺陷形式。ADA修改后的主要算法允许我们建立我们新的生成模型,只需输入相对较少的训练图像。我们使用了514个特定的rontal相机拍摄了缺陷面部图像来适应一个预训练模型,基于70000个正常面部图像。我们使用了Frechet Inception Distance(FID)、Perceptual Path Length(PPL)和 novel Divergence Index of Severity Histograms(DISH)等方法来评估我们所建立的图像生成器,我们称之为CleftGAN。我们发现,StyleGAN3 with translation invariance(StyleGAN3-t)在基本模型中表现最佳。生成的图像得到了低的FID,表示它们与我们的训练输入图像的真实缺陷图像很相似。PPL和DISH值均较低,表示通过转移学习过程中的满意 interpolate 和生成图像的分布相似。