cs.CV - 2023-09-11

Radiomics Boosts Deep Learning Model for IPMN Classification

  • paper_url: http://arxiv.org/abs/2309.05857
  • repo_url: None
  • paper_authors: Lanhong Yao, Zheyuan Zhang, Ugur Demir, Elif Keles, Camila Vendrami, Emil Agarunov, Candice Bolan, Ivo Schoots, Marc Bruno, Rajesh Keswani, Frank Miller, Tamas Gonda, Cemal Yazici, Temel Tirkes, Michael Wallace, Concetto Spampinato, Ulas Bagci
    for:这篇论文的目的是为了提出一个新的电脑支持诊断架构,以帮助诊断潜在的胰脏癌症。methods:这篇论文使用了一种独特的自适应分 segmentation 策略来定义胰脏的边界,然后使用了一个新的深度学习架构来进行分类。results:这篇论文在使用多个检测方法时得到了超过 80% 的准确率,较之前的国际标准和出版研究更高。
    Abstract Intraductal Papillary Mucinous Neoplasm (IPMN) cysts are pre-malignant pancreas lesions, and they can progress into pancreatic cancer. Therefore, detecting and stratifying their risk level is of ultimate importance for effective treatment planning and disease control. However, this is a highly challenging task because of the diverse and irregular shape, texture, and size of the IPMN cysts as well as the pancreas. In this study, we propose a novel computer-aided diagnosis pipeline for IPMN risk classification from multi-contrast MRI scans. Our proposed analysis framework includes an efficient volumetric self-adapting segmentation strategy for pancreas delineation, followed by a newly designed deep learning-based classification scheme with a radiomics-based predictive approach. We test our proposed decision-fusion model in multi-center data sets of 246 multi-contrast MRI scans and obtain superior performance to the state of the art (SOTA) in this field. Our ablation studies demonstrate the significance of both radiomics and deep learning modules for achieving the new SOTA performance compared to international guidelines and published studies (81.9\% vs 61.3\% in accuracy). Our findings have important implications for clinical decision-making. In a series of rigorous experiments on multi-center data sets (246 MRI scans from five centers), we achieved unprecedented performance (81.9\% accuracy).
    摘要 卵巢瘤细胞肿(IPMN)是肝脏前期癌变,可能会进展到肝脏癌。因此,检测和分级IPMN的风险水平是肝脏疾病控制的关键。然而,这是一项非常具有挑战性的任务,因为IPMN瘤肿的形态、文本和大小均非常多样化和不规则。在这项研究中,我们提出了一种新的计算机助手诊断管线,用于IPMN风险分类从多方位MRI扫描中。我们的提议分析框架包括高效的自适应分割策略,以便识别肝脏,然后是一种新设计的深度学习基于的分类方案,以及一种基于 радиологи学的预测方法。我们在多个中心的数据集上测试了我们的提议决策融合模型,并获得了在这个领域的新的最高性能(81.9%)。我们的剖析研究表明, Both radiomics和深度学习模块对于实现新的最高性能具有重要的意义,相比于国际指南和已发表的研究(81.9% vs 61.3%)。我们的发现对临床决策有重要的意义。在多个中心的数据集上(246个MRI扫描)进行了严格的实验,我们实现了准确率81.9%。

Self-Correlation and Cross-Correlation Learning for Few-Shot Remote Sensing Image Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2309.05840
  • repo_url: https://github.com/linhanwang/sccnet
  • paper_authors: Linhan Wang, Shuo Lei, Jianfeng He, Shengkun Wang, Min Zhang, Chang-Tien Lu
  • for: 本研究旨在提出一种基于几何学相关学习的几何学相关学习网络,用于几何学相关学习图像semantic segmentation问题。
  • methods: 我们提出了一种自相关和十分相关学习网络(SCCNet),该模型通过考虑自相关和十分相关图像之间的相关性来增强分类预测的普适性。
  • results: 我们在两个遥感图像集上进行了广泛的实验,并证明了我们的模型在几何学相关学习图像semantic segmentation中的有效性和优势。
    Abstract Remote sensing image semantic segmentation is an important problem for remote sensing image interpretation. Although remarkable progress has been achieved, existing deep neural network methods suffer from the reliance on massive training data. Few-shot remote sensing semantic segmentation aims at learning to segment target objects from a query image using only a few annotated support images of the target class. Most existing few-shot learning methods stem primarily from their sole focus on extracting information from support images, thereby failing to effectively address the large variance in appearance and scales of geographic objects. To tackle these challenges, we propose a Self-Correlation and Cross-Correlation Learning Network for the few-shot remote sensing image semantic segmentation. Our model enhances the generalization by considering both self-correlation and cross-correlation between support and query images to make segmentation predictions. To further explore the self-correlation with the query image, we propose to adopt a classical spectral method to produce a class-agnostic segmentation mask based on the basic visual information of the image. Extensive experiments on two remote sensing image datasets demonstrate the effectiveness and superiority of our model in few-shot remote sensing image semantic segmentation. Code and models will be accessed at https://github.com/linhanwang/SCCNet.
    摘要 <>将文本翻译成简化中文。<>遥感图像semantic segmentation是遥感图像解释中的重要问题。尽管已经取得了很大的进步,现有的深度神经网络方法却受到大量训练数据的依赖。几个shot遥感semantic segmentation目标是通过只使用少量标注图像来学习针对目标类图像进行分割。现有的几个shot学习方法主要围绕支持图像中的信息提取而设计,导致效果不够地处理大量的地理物体外观和比例差异。为解决这些挑战,我们提议一种基于自身相关和交叉相关学习网络(SCCNet)。我们的模型通过考虑支持和查询图像之间的自身相关和交叉相关来增强总体化。为进一步探索支持图像与查询图像之间的自身相关,我们提议采用一种经典的spectral方法生成基于图像的基本视觉信息的类型独立分割mask。我们的实验结果表明,我们的模型在几个shot遥感图像semantic segmentation中表现出色,并且与其他方法相比,具有更高的一致性和稳定性。代码和模型将在https://github.com/linhanwang/SCCNet上公开。

SCD-Net: Spatiotemporal Clues Disentanglement Network for Self-supervised Skeleton-based Action Recognition

  • paper_url: http://arxiv.org/abs/2309.05834
  • repo_url: None
  • paper_authors: Cong Wu, Xiao-Jun Wu, Josef Kittler, Tianyang Xu, Sara Atito, Muhammad Awais, Zhenhua Feng
  • for: 本研究旨在提高skeleton-based动作识别的性能,通过分离空间和时间域的特征来提高contrastive学习的效果。
  • methods: 本研究提出了一种新的contrastive学习框架,即空间时间见识分解网络(SCD-Net),它通过综合decoupling模块和特征提取器来分离空间和时间域的特征。
  • results: 对于NTU-RGB+D(60&120)和PKU-MMD(I&II) datasets,我们的方法显著超越了现有的SOTA方法,并在多个下游任务上达到了优秀的性能,包括动作识别、动作检索、过渡学习和半监督学习。
    Abstract Contrastive learning has achieved great success in skeleton-based action recognition. However, most existing approaches encode the skeleton sequences as entangled spatiotemporal representations and confine the contrasts to the same level of representation. Instead, this paper introduces a novel contrastive learning framework, namely Spatiotemporal Clues Disentanglement Network (SCD-Net). Specifically, we integrate the decoupling module with a feature extractor to derive explicit clues from spatial and temporal domains respectively. As for the training of SCD-Net, with a constructed global anchor, we encourage the interaction between the anchor and extracted clues. Further, we propose a new masking strategy with structural constraints to strengthen the contextual associations, leveraging the latest development from masked image modelling into the proposed SCD-Net. We conduct extensive evaluations on the NTU-RGB+D (60&120) and PKU-MMD (I&II) datasets, covering various downstream tasks such as action recognition, action retrieval, transfer learning, and semi-supervised learning. The experimental results demonstrate the effectiveness of our method, which outperforms the existing state-of-the-art (SOTA) approaches significantly.
    摘要 <>对比学习在skeleton基因action认识中取得了很大成功。然而,现有的方法通常将骨架序列编码为杂合的空间时间表示,并将对比限制在同一个表示层次。而本文提出了一种新的对比学习框架,即空间时间准确网络(SCD-Net)。specifically,我们将分解模块与特征提取器结合,以 deriv implicit clue from空间和时间 DOMAIN separately。在 SCD-Net 的训练中,我们使用构建的全球锚点,并且鼓励锚点和提取的 clue 之间的互动。此外,我们提出了一种新的masking strategy,以强制Contextual associations,利用图像模型的最新发展。我们在 NTU-RGB+D (60&120) 和 PKU-MMD (I&II) datasets 进行了广泛的评估,覆盖了多种下游任务,如 action recognition、action retrieval、传输学习和半监督学习。实验结果表明,我们的方法可以很有效地与现有的 SOTA 方法进行比较。

Instance-Agnostic Geometry and Contact Dynamics Learning

  • paper_url: http://arxiv.org/abs/2309.05832
  • repo_url: None
  • paper_authors: Mengti Sun, Bowen Jiang, Bibit Bianchini, Camillo Jose Taylor, Michael Posa
  • for: 本文提出了一个无需知道物体形状和运动模型的静止学习框架,通过视觉和动力学的共同表示来同时学习物体的形状、运动轨迹和物理性质。
  • methods: 本文使用了BundleSDF视觉系统和ContactNets动力系统,并提出了一个循环训练管道,将动力模块输出用于修正视觉模块的姿态和形状,使用平面投影。
  • results: 实验表明,本文的框架可以学习静止和凹形物体的形状和动力学特性,并超越现有的跟踪框架。
    Abstract This work presents an instance-agnostic learning framework that fuses vision with dynamics to simultaneously learn shape, pose trajectories and physical properties via the use of geometry as a shared representation. Unlike many contact learning approaches that assume motion capture input and a known shape prior for the collision model, our proposed framework learns an object's geometric and dynamic properties from RGBD video, without requiring either category-level or instance-level shape priors. We integrate a vision system, BundleSDF, with a dynamics system, ContactNets and propose a cyclic training pipeline to use the output from the dynamics module to refine the poses and the geometry from the vision module, using perspective reprojection. Experiments demonstrate our framework's ability to learn the geometry and dynamics of rigid and convex objects and improve upon the current tracking framework.
    摘要

Mobile Vision Transformer-based Visual Object Tracking

  • paper_url: http://arxiv.org/abs/2309.05829
  • repo_url: https://github.com/goutamyg/mvt
  • paper_authors: Goutam Yelluru Gopal, Maria A. Amer
  • for: 提高目标跟踪算法的性能和速度,并在大规模数据集上实现高精度和高速度的跟踪。
  • methods: 使用Mobile Vision Transformers(MobileViT)作为后备,并提出了一种将模板和搜索区域表示 fusion 的新方法,以生成优化的目标位置编码。
  • results: 在大规模数据集 GOT10k 和 TrackingNet 上,我们的 MobileViT-based Tracker(MVT)的性能超过了当前的轻量级跟踪器,并且在 GPU 上运行的速度比 DiMP-50 快得多。 Code 和模型可以从 https://github.com/goutamyg/MVT 获取。
    Abstract The introduction of robust backbones, such as Vision Transformers, has improved the performance of object tracking algorithms in recent years. However, these state-of-the-art trackers are computationally expensive since they have a large number of model parameters and rely on specialized hardware (e.g., GPU) for faster inference. On the other hand, recent lightweight trackers are fast but are less accurate, especially on large-scale datasets. We propose a lightweight, accurate, and fast tracking algorithm using Mobile Vision Transformers (MobileViT) as the backbone for the first time. We also present a novel approach of fusing the template and search region representations in the MobileViT backbone, thereby generating superior feature encoding for target localization. The experimental results show that our MobileViT-based Tracker, MVT, surpasses the performance of recent lightweight trackers on the large-scale datasets GOT10k and TrackingNet, and with a high inference speed. In addition, our method outperforms the popular DiMP-50 tracker despite having 4.7 times fewer model parameters and running at 2.8 times its speed on a GPU. The tracker code and models are available at https://github.com/goutamyg/MVT
    摘要 Introduction of robust backbones, such as Vision Transformers, has improved the performance of object tracking algorithms in recent years. However, these state-of-the-art trackers are computationally expensive due to their large number of model parameters and reliance on specialized hardware (e.g., GPU) for faster inference. On the other hand, recent lightweight trackers are fast but less accurate, especially on large-scale datasets. We propose a lightweight, accurate, and fast tracking algorithm using Mobile Vision Transformers (MobileViT) as the backbone for the first time. We also present a novel approach of fusing the template and search region representations in the MobileViT backbone, thereby generating superior feature encoding for target localization. Experimental results show that our MobileViT-based Tracker, MVT, outperforms the performance of recent lightweight trackers on large-scale datasets GOT10k and TrackingNet, with high inference speed. Additionally, our method outperforms the popular DiMP-50 tracker despite having 4.7 times fewer model parameters and running at 2.8 times its speed on a GPU. Tracker code and models are available at https://github.com/goutamyg/MVT.

KD-FixMatch: Knowledge Distillation Siamese Neural Networks

  • paper_url: http://arxiv.org/abs/2309.05826
  • repo_url: None
  • paper_authors: Chien-Chih Wang, Shaoyuan Xu, Jinmiao Fu, Yang Liu, Bryan Wang
  • for: 这篇论文的目的是提出一种具有知识传播的semi-supervised learning(SSL)算法,以解决深度学习中的标签数据短缺问题。
  • methods: 这篇论文使用了一种叫做FixMatch的SSL算法,并在其基础上增加了知识传播。 FixMatch使用了一个siamese神经网(SNN)来同时训练师和学生网络,并使用对应的标签来训练。 KD-FixMatch则是一种将知识传播添加到FixMatch中的新算法,并使用了组合的sequential和同时训练方法来提高性能。
  • results: 实验结果显示,KD-FixMatch在四个公开的数据集上都比 FixMatch 高效。 KD-FixMatch 可以从具有标签的数据集和无标签的数据集中获得更好的训练开头点,从而提高模型的性能。
    Abstract Semi-supervised learning (SSL) has become a crucial approach in deep learning as a way to address the challenge of limited labeled data. The success of deep neural networks heavily relies on the availability of large-scale high-quality labeled data. However, the process of data labeling is time-consuming and unscalable, leading to shortages in labeled data. SSL aims to tackle this problem by leveraging additional unlabeled data in the training process. One of the popular SSL algorithms, FixMatch, trains identical weight-sharing teacher and student networks simultaneously using a siamese neural network (SNN). However, it is prone to performance degradation when the pseudo labels are heavily noisy in the early training stage. We present KD-FixMatch, a novel SSL algorithm that addresses the limitations of FixMatch by incorporating knowledge distillation. The algorithm utilizes a combination of sequential and simultaneous training of SNNs to enhance performance and reduce performance degradation. Firstly, an outer SNN is trained using labeled and unlabeled data. After that, the network of the well-trained outer SNN generates pseudo labels for the unlabeled data, from which a subset of unlabeled data with trusted pseudo labels is then carefully created through high-confidence sampling and deep embedding clustering. Finally, an inner SNN is trained with the labeled data, the unlabeled data, and the subset of unlabeled data with trusted pseudo labels. Experiments on four public data sets demonstrate that KD-FixMatch outperforms FixMatch in all cases. Our results indicate that KD-FixMatch has a better training starting point that leads to improved model performance compared to FixMatch.
    摘要 深度学习中的半监督学习(SSL)已成为一种重要的方法,以解决深度神经网络的受限于标注数据的问题。然而,数据标注是一个时间consuming和不可扩展的过程,导致标注数据的短缺。SSL利用额外的无标注数据来提高深度神经网络的性能。 FixMatch 是一种流行的 SSL 算法,它使用同一个 weight-sharing 教师和学生网络同时训练,使用 Siamese 神经网络(SNN)。然而, FixMatch 在早期训练阶段 pseudo labels 具有很强的噪音,可能导致性能下降。我们提出了 KD-FixMatch,一种新的 SSL 算法,通过搅合顺序和同时训练 SNNs 来提高性能并降低性能下降。首先,外部 SNN 通过标注和无标注数据进行训练。然后,外部 SNN 的网络生成 pseudo labels для无标注数据,并从中选择一 subset of 无标注数据,通过高confidence 采样和深度嵌入划分来生成可信 pseudo labels。最后,内部 SNN 通过标注数据、无标注数据和可信 pseudo labels 进行训练。我们在四个公共数据集上进行了实验,结果显示 KD-FixMatch 在所有情况下都高于 FixMatch。我们的结果表明,KD-FixMatch 具有更好的训练开始点,导致深度神经网络的性能得到提高。

Rice Plant Disease Detection and Diagnosis using Deep Convolutional Neural Networks and Multispectral Imaging

  • paper_url: http://arxiv.org/abs/2309.05818
  • repo_url: None
  • paper_authors: Yara Ali Alnaggar, Ahmad Sebaq, Karim Amer, ElSayed Naeem, Mohamed Elhelw
  • for: 本研究旨在提高rice plantaemia检测的精度,以帮助提高rice生产的效率和质量。
  • methods: 本研究使用多modal数据,包括公共多spectral和RGB图像集和深度学习pipeline,以检测rice crops疾病的早期阶段。
  • results: 研究发现,使用多spectral和RGB通道作为输入,可以获得更高的F1准确率,比使用RGB输入only更高。
    Abstract Rice is considered a strategic crop in Egypt as it is regularly consumed in the Egyptian people's diet. Even though Egypt is the highest rice producer in Africa with a share of 6 million tons per year, it still imports rice to satisfy its local needs due to production loss, especially due to rice disease. Rice blast disease is responsible for 30% loss in rice production worldwide. Therefore, it is crucial to target limiting yield damage by detecting rice crops diseases in its early stages. This paper introduces a public multispectral and RGB images dataset and a deep learning pipeline for rice plant disease detection using multi-modal data. The collected multispectral images consist of Red, Green and Near-Infrared channels and we show that using multispectral along with RGB channels as input archives a higher F1 accuracy compared to using RGB input only.
    摘要 rice 被视为埃及的战略作物,因为它是埃及人的日常饮食中的重要组成部分。尽管埃及是非洲最大的rice生产国,每年生产600万吨rice,但它仍然需要从外部进口rice来满足本地需求,尤其是因为生产损失,如rice疾病。rice疾病是全球rice生产损失的30%原因。因此,它是非常重要的target limiting yield damage by detecting rice crops diseases in its early stages。这篇论文介绍了一个公共多spectral和RGB图像集和一个深度学习管道,用于rice plant疾病检测以及多Modal数据。收集的多spectral图像包括红、绿和近红外通道,我们表明使用多spectral和RGB通道作为输入,可以达到高于RGB输入Only的F1准确率。

SHIFT3D: Synthesizing Hard Inputs For Tricking 3D Detectors

  • paper_url: http://arxiv.org/abs/2309.05810
  • repo_url: None
  • paper_authors: Hongge Chen, Zhao Chen, Gregory P. Meyer, Dennis Park, Carl Vondrick, Ashish Shrivastava, Yuning Chai
  • for: 用于生成 Structurally plausible yet challenging 3D shapes,以检测3D object detectors的漏洞。
  • methods: 使用 signed distanced function (SDF) 表示 объек,并通过权重错误信号来缓慢塑形或对象的pose进行变化,以混淆下游3D检测器。
  • results: 通过SHIFT3D方法生成的对象 physically differ from baseline object, yet retain semantically recognizable shape,可以提供3D检测器的可读性失败模式,帮助预先发现3D感知系统中的安全隐患。
    Abstract We present SHIFT3D, a differentiable pipeline for generating 3D shapes that are structurally plausible yet challenging to 3D object detectors. In safety-critical applications like autonomous driving, discovering such novel challenging objects can offer insight into unknown vulnerabilities of 3D detectors. By representing objects with a signed distanced function (SDF), we show that gradient error signals allow us to smoothly deform the shape or pose of a 3D object in order to confuse a downstream 3D detector. Importantly, the objects generated by SHIFT3D physically differ from the baseline object yet retain a semantically recognizable shape. Our approach provides interpretable failure modes for modern 3D object detectors, and can aid in preemptive discovery of potential safety risks within 3D perception systems before these risks become critical failures.
    摘要 我团队现在发布了Shift3D,一个可微分的管道,用于生成3D形状,这些形状具有可能挑战3D对象检测器的结构性可能性,但又保持semanticognizable的形状。在自动驾驶等安全关键应用中,发现这些新挑战的对象可以提供对3D检测器的不明显漏洞的知识。通过使用签名距离函数(SDF)表示对象,我们表明了下游3D检测器的导数误差信号,允许我们缓和形状或pose的变化,以搅乱下游3D检测器。这些由Shift3D生成的对象与基准对象有所不同,但它们仍保持semanticognizable的形状。我们的方法提供了可读取的失败模式,可以帮助在3D感知系统中发现潜在的安全风险之前,以避免这些风险变成 kritical failures。

Divergences in Color Perception between Deep Neural Networks and Humans

  • paper_url: http://arxiv.org/abs/2309.05809
  • repo_url: None
  • paper_authors: Ethan O. Nadler, Elise Darragh-Ford, Bhargav Srinivasa Desikan, Christian Conaway, Mark Chu, Tasker Hull, Douglas Guilbeault
  • for: 这个论文旨在研究深度神经网络(DNNs)是否能够模型人类视觉,以及它们是否能够捕捉人类视觉中的基本特征。
  • methods: 作者们采用了新的实验方法来评估DNNs中的色彩嵌入是否具有人类视觉中的含义性。他们还使用了在线调查来收集人类对图像的色彩相似性评估。
  • results: 研究发现,当前的DNN架构(包括卷积神经网络和视Transformer)在处理图像中的色彩相似性评估中表现不佳,其Result与人类对图像的色彩相似性评估存在很大差异。而基于wavelet分解的一种可解释的和有理性的色彩模型则能够更好地预测人类对图像的色彩相似性评估。
    Abstract Deep neural networks (DNNs) are increasingly proposed as models of human vision, bolstered by their impressive performance on image classification and object recognition tasks. Yet, the extent to which DNNs capture fundamental aspects of human vision such as color perception remains unclear. Here, we develop novel experiments for evaluating the perceptual coherence of color embeddings in DNNs, and we assess how well these algorithms predict human color similarity judgments collected via an online survey. We find that state-of-the-art DNN architectures $-$ including convolutional neural networks and vision transformers $-$ provide color similarity judgments that strikingly diverge from human color judgments of (i) images with controlled color properties, (ii) images generated from online searches, and (iii) real-world images from the canonical CIFAR-10 dataset. We compare DNN performance against an interpretable and cognitively plausible model of color perception based on wavelet decomposition, inspired by foundational theories in computational neuroscience. While one deep learning model $-$ a convolutional DNN trained on a style transfer task $-$ captures some aspects of human color perception, our wavelet algorithm provides more coherent color embeddings that better predict human color judgments compared to all DNNs we examine. These results hold when altering the high-level visual task used to train similar DNN architectures (e.g., image classification versus image segmentation), as well as when examining the color embeddings of different layers in a given DNN architecture. These findings break new ground in the effort to analyze the perceptual representations of machine learning algorithms and to improve their ability to serve as cognitively plausible models of human vision. Implications for machine learning, human perception, and embodied cognition are discussed.
    摘要 深度神经网络(DNN)在人类视觉模型中得到了广泛的应用,它们的表现在图像分类和物体识别任务上是极其出色的。然而,DNN是否能够捕捉人类视觉的基本特征,如色彩感知,还未得到了清楚的回答。在这篇文章中,我们开发了一系列新的实验来评估DNN中色彩嵌入的听觉性,并对DNN的预测与人类的色彩相似性判断进行比较。我们发现,包括卷积神经网络和视Transformers在内的当今最佳DNN架构,其对人类色彩判断的预测与人类实际上的色彩判断存在很大差异。我们将DNN的表现与基于计算神经科学的理论 inspirited by wavelet decomposition的一种可读性和认知可能的色彩模型进行比较。结果显示,一个 convolutional DNN 在样式传递任务上训练后 capture some aspects of human color perception,但我们的波let算法提供了更听觉性的色彩嵌入,可以更好地预测人类色彩判断。这些结果在不同的高级视觉任务(如图像分类和图像分割)和不同层次的DNN架构中都保持相同。这些发现开拓了对机器学习算法的听觉表征分析和改进其为人类视觉可能的模型的领域。文章结尾,我们讨论了机器学习、人类感知和embodied cognition等领域的影响。

Blendshapes GHUM: Real-time Monocular Facial Blendshape Prediction

  • paper_url: http://arxiv.org/abs/2309.05782
  • repo_url: None
  • paper_authors: Ivan Grishchenko, Geng Yan, Eduard Gabriel Bazavan, Andrei Zanfir, Nikolai Chinaev, Karthik Raveendran, Matthias Grundmann, Cristian Sminchisescu
  • for: 该论文旨在实现在现代手机上实时预测人脸52个混合形态系数,以便实现人脸动作捕捉应用程序,如虚拟人物。
  • methods: 该论文提出了两大贡献:一是一种没有注释的离线方法,可以从真实的人类扫描中获取混合形态系数;二是一种轻量级的实时模型,可以基于人脸特征点预测混合形态系数。
  • results: 该论文在现代手机上实现了30+ FPS的实时预测,并且可以从单个彩色RGB图像中获取52个混合形态系数。
    Abstract We present Blendshapes GHUM, an on-device ML pipeline that predicts 52 facial blendshape coefficients at 30+ FPS on modern mobile phones, from a single monocular RGB image and enables facial motion capture applications like virtual avatars. Our main contributions are: i) an annotation-free offline method for obtaining blendshape coefficients from real-world human scans, ii) a lightweight real-time model that predicts blendshape coefficients based on facial landmarks.
    摘要 我们现在提供Blendshapes GHUM,一个在设备上的机器学习管道,可以在现代手机上预测52个 facial blendshape 系数,每秒30多帧,从单个灰度RGB图像中获取,并实现了人脸动作捕捉应用,如虚拟人物。我们的主要贡献包括:1. 无需注释的离线方法,可以从真实世界人体扫描中获取 blendshape 系数。2. 轻量级的实时模型,可以基于人脸特征点预测 blendshape 系数。

LUNet: Deep Learning for the Segmentation of Arterioles and Venules in High Resolution Fundus Images

  • paper_url: http://arxiv.org/abs/2309.05780
  • repo_url: None
  • paper_authors: Jonathan Fhima, Jan Van Eijgen, Hana Kulenovic, Valérie Debeuf, Marie Vangilbergen, Marie-Isaline Billen, Heloïse Brackenier, Moti Freiman, Ingeborg Stalmans, Joachim A. Behar
  • For: The paper aims to develop a deep learning architecture for automated segmentation of retinal arterioles and venules in digital fundus images.* Methods: The proposed method, called LUNet, uses a double dilated convolutional block and a long tail to enhance the receptive field and resolution of the segmentation, respectively. The custom loss function emphasizes the continuity of the blood vessels.* Results: LUNet significantly outperforms two state-of-the-art segmentation algorithms on both the local test set and four external test sets simulating distribution shifts across ethnicity, comorbidities, and annotators.Here are the three key points in Simplified Chinese:* For: 这篇论文的目的是开发一种深度学习架构,用于自动分类retinal arterioles和venules在数字背景图像中。* Methods: 提出的方法(LUNet)使用了双倍扩展 convolutional block和长尾,以提高模型的接收场和分辨率。 Custom loss function 强调血管之间的连续性。* Results: LUNet 在本地测试集和四个外部测试集上,与两种现有的分割算法相比,显著地表现出优异。
    Abstract The retina is the only part of the human body in which blood vessels can be accessed non-invasively using imaging techniques such as digital fundus images (DFI). The spatial distribution of the retinal microvasculature may change with cardiovascular diseases and thus the eyes may be regarded as a window to our hearts. Computerized segmentation of the retinal arterioles and venules (A/V) is essential for automated microvasculature analysis. Using active learning, we created a new DFI dataset containing 240 crowd-sourced manual A/V segmentations performed by fifteen medical students and reviewed by an ophthalmologist, and developed LUNet, a novel deep learning architecture for high resolution A/V segmentation. LUNet architecture includes a double dilated convolutional block that aims to enhance the receptive field of the model and reduce its parameter count. Furthermore, LUNet has a long tail that operates at high resolution to refine the segmentation. The custom loss function emphasizes the continuity of the blood vessels. LUNet is shown to significantly outperform two state-of-the-art segmentation algorithms on the local test set as well as on four external test sets simulating distribution shifts across ethnicity, comorbidities, and annotators. We make the newly created dataset open access (upon publication).
    摘要 retina 是人体中唯一可以非侵入式通过图像技术访问血管的部分。通过计算机化分割,我们可以自动分析微血管网络。使用活动学习,我们创建了一个新的数字肉眼图像(DFI)集合,包括240名医学生 manually 手动 segmentation 和一位眼科医生审阅,并开发了 LUNet,一种新的深度学习架构,用于高分辨率血管分割。LUNet 架构包括一个双扩展 convolutional block,用于提高模型的感知范围和减少参数数量。此外,LUNet 还有一个长尾,用于在高分辨率下细化分割。我们定义了一个自定义损失函数,用于强调血管之间的连续性。LUNet 在本地测试集上以及四个外部测试集上,对比两种现有的分割算法,显著超越它们。我们将新创建的数据集开放给社区(在发表之前)。

TransferDoc: A Self-Supervised Transferable Document Representation Learning Model Unifying Vision and Language

  • paper_url: http://arxiv.org/abs/2309.05756
  • repo_url: None
  • paper_authors: Souhail Bakkali, Sanket Biswas, Zuheng Ming, Mickael Coustaty, Marçal Rusiñol, Oriol Ramos Terrades, Josep Lladós
  • for: 本研究旨在提高视觉文档理解的效果, especialy in real-world online industrial settings。
  • methods: 本文提出了一种基于cross-modal transformer的 TransferDoc模型,通过自动学习的方式在不同模式下进行预处理,以提高模型的通用性、灵活性和 robustness。
  • results: 对于 downstream tasks, TransferDoc 模型表现出色,在 industrial evaluation scenario 中出perform other state-of-the-art approaches。
    Abstract The field of visual document understanding has witnessed a rapid growth in emerging challenges and powerful multi-modal strategies. However, they rely on an extensive amount of document data to learn their pretext objectives in a ``pre-train-then-fine-tune'' paradigm and thus, suffer a significant performance drop in real-world online industrial settings. One major reason is the over-reliance on OCR engines to extract local positional information within a document page. Therefore, this hinders the model's generalizability, flexibility and robustness due to the lack of capturing global information within a document image. We introduce TransferDoc, a cross-modal transformer-based architecture pre-trained in a self-supervised fashion using three novel pretext objectives. TransferDoc learns richer semantic concepts by unifying language and visual representations, which enables the production of more transferable models. Besides, two novel downstream tasks have been introduced for a ``closer-to-real'' industrial evaluation scenario where TransferDoc outperforms other state-of-the-art approaches.
    摘要 领域中的视觉文档理解受到了快速增长的挑战和强大的多modal策略的推动。然而,它们依赖于大量的文档数据来学习其预先定义的目标任务,因此在实际上的线上工业环境中表现不佳。主要的原因是依赖于 OCR 引擎来提取文档页面上的局部位置信息,从而忽略了文档图像中的全局信息。因此,这会导致模型的普适性、灵活性和可靠性受到限制。我们介绍了 TransferDoc,一种基于 transferred 的 cross-modal transformer 架构,通过三种新的预先定义目标来预处理。TransferDoc 学习了更加具有 semantic 的概念,使得生成更加可转移的模型。此外,我们还引入了两个新的下游任务,以提供更加真实的工业评估场景,在这些场景下,TransferDoc 超过了其他状态计算的方法。

Evaluating the Reliability of CNN Models on Classifying Traffic and Road Signs using LIME

  • paper_url: http://arxiv.org/abs/2309.05747
  • repo_url: None
  • paper_authors: Md. Atiqur Rahman, Ahmed Saad Tanim, Sanjid Islam, Fahim Pranto, G. M. Shahariar, Md. Tanvir Rouf Shawon
  • for: 本研究旨在评估和比较四种当前最佳预训练模型(ResNet-34、VGG-19、DenseNet-121和Inception V3)在使用GTSRB公共数据集中分类交通和道路标志的效果。
  • methods: 本研究使用了GTSRB公共数据集进行测试和评估这些模型的预测精度和图像分类特征选择能力。同时,使用了LIME框架来增加模型预测的可解释性和可靠性。
  • results: 研究发现,使用LIME框架可以增加模型预测的可解释性和可靠性,并且可以提高模型在图像分类任务中的效果。结论:LIME是一种重要的工具,可以帮助改进机器学习模型的可解释性和可靠性,无论这些模型在图像分类任务中的性能如何。
    Abstract The objective of this investigation is to evaluate and contrast the effectiveness of four state-of-the-art pre-trained models, ResNet-34, VGG-19, DenseNet-121, and Inception V3, in classifying traffic and road signs with the utilization of the GTSRB public dataset. The study focuses on evaluating the accuracy of these models' predictions as well as their ability to employ appropriate features for image categorization. To gain insights into the strengths and limitations of the model's predictions, the study employs the local interpretable model-agnostic explanations (LIME) framework. The findings of this experiment indicate that LIME is a crucial tool for improving the interpretability and dependability of machine learning models for image identification, regardless of the models achieving an f1 score of 0.99 on classifying traffic and road signs. The conclusion of this study has important ramifications for how these models are used in practice, as it is crucial to ensure that model predictions are founded on the pertinent image features.
    摘要 本研究的目的是评估和比较四种现代预训练模型(ResNet-34、VGG-19、DenseNet-121和Inception V3)在使用GTSRB公共数据集进行交通和道路标志分类中的效果。研究旨在评估这些模型预测结果的准确性以及它们在图像分类中采用合适的特征。为了获得模型预测结果的含义和依赖性,研究使用了地方可解释性模型-无关的框架(LIME)。研究发现,LIME是一种重要的工具,可以提高图像识别模型的可解释性和可靠性,无论这些模型在交通和道路标志分类中的f1分数是0.99。这项研究的结论对于这些模型在实践中的使用具有重要的意义,因为需要确保模型的预测基于相关的图像特征。

Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips

  • paper_url: http://arxiv.org/abs/2309.05663
  • repo_url: https://github.com/JudyYe/diffhoi
  • paper_authors: Yufei Ye, Poorvi Hebbar, Abhinav Gupta, Shubham Tulsiani
  • for: 从短视频clip中重建手动物互动
  • methods: 使用3D推理为每个视频进行优化,并从视频中恢复神经网络表示的物体形状和时间变化性和手部运动
  • results: 对6种物体类别的 egocentric video进行了实验,与单视和多视方法相比,显示了显著的改善,并且可以重建来自YouTube的任意clip,包括1人和3人互动。
    Abstract We tackle the task of reconstructing hand-object interactions from short video clips. Given an input video, our approach casts 3D inference as a per-video optimization and recovers a neural 3D representation of the object shape, as well as the time-varying motion and hand articulation. While the input video naturally provides some multi-view cues to guide 3D inference, these are insufficient on their own due to occlusions and limited viewpoint variations. To obtain accurate 3D, we augment the multi-view signals with generic data-driven priors to guide reconstruction. Specifically, we learn a diffusion network to model the conditional distribution of (geometric) renderings of objects conditioned on hand configuration and category label, and leverage it as a prior to guide the novel-view renderings of the reconstructed scene. We empirically evaluate our approach on egocentric videos across 6 object categories, and observe significant improvements over prior single-view and multi-view methods. Finally, we demonstrate our system's ability to reconstruct arbitrary clips from YouTube, showing both 1st and 3rd person interactions.
    摘要 我们面临着从短视频clip中重建手对象交互的任务。给定输入视频,我们的方法将3D推断视为每个视频优化,并recover一个神经网络3D表示物体形状,以及时变动和手部骨骼运动。虽然输入视频自然提供一些多视图提示来导引3D推断,但这些提示不够准确,因为 occlusion 和有限的视角变化。为了获得高精度3D,我们将多视图信号与通用数据驱动的假设相结合,以便重建。特别是,我们学习了一种扩散网络,用于模型(geometry)视图的条件分布,根据手姿和类别标签来Conditional Rendering。我们观察了在6种物品类别上的 Egocentric 视频中,我们的方法与之前的单视和多视方法进行比较,并观察到了显著的改进。最后,我们示出了我们系统可以从 YouTube 上 reconstruction任意clip,包括1人和3人交互。

ViHOPE: Visuotactile In-Hand Object 6D Pose Estimation with Shape Completion

  • paper_url: http://arxiv.org/abs/2309.05662
  • repo_url: None
  • paper_authors: Hongyu Li, Snehal Dikhale, Soshi Iba, Nawid Jamali
  • for: 本文提出了一种新的框架,用于使用视听触感来估计手中物体的6D姿态。
  • methods: 该框架使用一种条件生成 adversarial network来完成手中物体的形状,并与6D姿态估计任务共同优化。
  • results: 相比于直接将视听触感转换为6D姿态,该方法可以提高6D姿态估计的准确性。在视听形状完成任务中,我们超过了状态时的表现,并在Chamfer距离和Intersection of Union metric上减少了88%和265%的值。在视听姿态估计任务中,我们获得了35%和64%的位置和角度错误减少。
    Abstract In this letter, we introduce ViHOPE, a novel framework for estimating the 6D pose of an in-hand object using visuotactile perception. Our key insight is that the accuracy of the 6D object pose estimate can be improved by explicitly completing the shape of the object. To this end, we introduce a novel visuotactile shape completion module that uses a conditional Generative Adversarial Network to complete the shape of an in-hand object based on volumetric representation. This approach improves over prior works that directly regress visuotactile observations to a 6D pose. By explicitly completing the shape of the in-hand object and jointly optimizing the shape completion and pose estimation tasks, we improve the accuracy of the 6D object pose estimate. We train and test our model on a synthetic dataset and compare it with the state-of-the-art. In the visuotactile shape completion task, we outperform the state-of-the-art by 265% using the Intersection of Union metric and achieve 88% lower Chamfer Distance. In the visuotactile pose estimation task, we present results that suggest our framework reduces position and angular errors by 35% and 64%, respectively. Furthermore, we ablate our framework to confirm the gain on the 6D object pose estimate from explicitly completing the shape. Ultimately, we show that our framework produces models that are robust to sim-to-real transfer on a real-world robot platform.
    摘要 在这封信中,我们介绍了一个新的框架,即ViHOPE,用于估计手中物体的6D姿态。我们的关键发现是,通过显式完成手中物体的形状,可以提高6D物体姿态估计的准确性。为此,我们引入了一种新的视听形状完成模块,该模块使用conditional生成对抗网络来完成手中物体的形状基于体积表示。这种方法在直接从视听观测中预测6D姿态的先前作品上进行了改进。通过显式完成手中物体的形状并同时优化形状完成和姿态估计任务,我们提高了6D物体姿态估计的准确性。我们在一个 sintetic 数据集上训练和测试了我们的模型,并与当前最佳的状态 comparison。在视听形状完成任务中,我们使用Intersection of Union metric的比较,我们的模型在比较中高于当前最佳的状态,提高了88%的Chamfer Distance。在视听姿态估计任务中,我们提供了结果,表明我们的框架可以降低位置和angular error by 35%和64%,分别。此外,我们对我们的框架进行了剖析,并证明了在实际世界Robot平台上的robustness。最终,我们表明了我们的框架生成的模型在实际世界中是可靠的。

An Effective Two-stage Training Paradigm Detector for Small Dataset

  • paper_url: http://arxiv.org/abs/2309.05652
  • repo_url: None
  • paper_authors: Zheng Wang, Dong Xie, Hanzhi Wang, Jiang Tian
  • for: 这份报告是为了解决 object detection 领域中的标目探测问题。
  • methods: 这种方法使用了两阶段训练架构,首先预读 YOLOv8 的背景为Encoder,然后精确地训练探测器。在试验阶段,使用了复杂的扩展和重量组合。
  • results: 这种方法在 DelftBikes 测试集上得到了 30.4% 的平均精度,在领域中排名第四。
    Abstract Learning from the limited amount of labeled data to the pre-train model has always been viewed as a challenging task. In this report, an effective and robust solution, the two-stage training paradigm YOLOv8 detector (TP-YOLOv8), is designed for the object detection track in VIPriors Challenge 2023. First, the backbone of YOLOv8 is pre-trained as the encoder using the masked image modeling technique. Then the detector is fine-tuned with elaborate augmentations. During the test stage, test-time augmentation (TTA) is used to enhance each model, and weighted box fusion (WBF) is implemented to further boost the performance. With the well-designed structure, our approach has achieved 30.4% average precision from 0.50 to 0.95 on the DelftBikes test set, ranking 4th on the leaderboard.
    摘要 学习从有限的标注数据到预训练模型总是被视为一个挑战。在这份报告中,我们提出了一种有效和可靠的解决方案,即两stage训练 парадигмы(TP-YOLOv8),用于Object Detection track在VIPriors Challenge 2023中。首先,YOLOv8的背bone被用作Encoder,通过masked image modeling技术进行预训练。然后,检测器被细化地归一化。在测试阶段,使用测试时数学增强(TTA)以提高每个模型的性能,并实施Weighted Box Fusion(WBF)以进一步提高表现。基于我们的结构设计,我们的方法在DelftBikes测试集上达到了30.4%的平均精度,在领导者板块上排名第四。

CitDet: A Benchmark Dataset for Citrus Fruit Detection

  • paper_url: http://arxiv.org/abs/2309.05645
  • repo_url: None
  • paper_authors: Jordan A. James, Heather K. Manching, Matthew R. Mattia, Kim D. Bowman, Amanda M. Hulse-Kemp, William J. Beksi
  • for: 该论文目的是提高 citrus 果实检测技术,以便更正确地估算 citrus 树上HLB病毒的影响。
  • methods: 该论文使用了现代对象检测算法,并在 Typical orchard 环境中进行了进一步的改进。
  • results: 该论文通过提供高分辨率图像和高质量 bounding box 约束,实现了 citrus 果实检测的高精度。此外,论文还显示了 citrus 果实的位置可以准确地反映 citrus 树上HLB病毒的影响,并与产量估算有直接的相关性。
    Abstract In this letter, we present a new dataset to advance the state of the art in detecting citrus fruit and accurately estimate yield on trees affected by the Huanglongbing (HLB) disease in orchard environments via imaging. Despite the fact that significant progress has been made in solving the fruit detection problem, the lack of publicly available datasets has complicated direct comparison of results. For instance, citrus detection has long been of interest in the agricultural research community, yet there is an absence of work, particularly involving public datasets of citrus affected by HLB. To address this issue, we enhance state-of-the-art object detection methods for use in typical orchard settings. Concretely, we provide high-resolution images of citrus trees located in an area known to be highly affected by HLB, along with high-quality bounding box annotations of citrus fruit. Fruit on both the trees and the ground are labeled to allow for identification of fruit location, which contributes to advancements in yield estimation and potential measure of HLB impact via fruit drop. The dataset consists of over 32,000 bounding box annotations for fruit instances contained in 579 high-resolution images. In summary, our contributions are the following: (i) we introduce a novel dataset along with baseline performance benchmarks on multiple contemporary object detection algorithms, (ii) we show the ability to accurately capture fruit location on tree or on ground, and finally (ii) we present a correlation of our results with yield estimations.
    摘要 在这封信中,我们介绍了一个新的数据集,以提高检测柑橘果的状态艺术在受 huanglongbing(HLB)病虫影响的orchard环境中。尽管在检测果实问题上已经取得了重要进步,但由于缺乏公共可用的数据集,对结果的直接比较受到了限制。例如,柑橘果检测已经在农业研究领域产生了长期的兴趣,但是没有具有公共数据集的相关研究,特别是涉及HLB病虫的柑橘果检测。为解决这个问题,我们改进了现有的object detection方法,以适应典型的orchard环境。具体来说,我们提供了高分辨率的柑橘树图像,以及高质量的 bounding box 约束标注。 fruit的位置可以在树上或地上被确定,这有助于提高产量估算和HLB病虫的影响度量。数据集包含了32,000个 bounding box 约束标注,分别包含579个高分辨率的图像。简而言之,我们的贡献包括以下几点:1. 我们引入了一个新的数据集,并提供了多种现代 object detection 算法的基准性能评价。2. 我们能够准确地捕捉柑橘果的位置,包括树上和地上的位置。3. 我们对我们的结果与产量估算之间的相关性进行了报告。

Learning the Geodesic Embedding with Graph Neural Networks

  • paper_url: http://arxiv.org/abs/2309.05613
  • repo_url: None
  • paper_authors: Bo Pang, Zhongtian Zheng, Guoping Wang, Peng-Shuai Wang
  • for: 计算精确地odesic距离 между两个精确的点 clouds,即在三维材料上的精确地odesic距离计算。
  • methods: 使用学习基于方法,通过嵌入矩阵到高维嵌入空间,计算精确地odesic距离。提出了新的图像卷积和图像聚合模块,以吸收地odesic信息,并证明其比前一代设计更有效。
  • results: 对ShapeNet进行了测试,并证明了对比 existed方法,具有orders of magnitude快的速度和相似或更好的准确性。同时,方法还能够在噪音和缺失点云上进行稳定的计算,并且具有强大的泛化能力。
    Abstract We present GeGnn, a learning-based method for computing the approximate geodesic distance between two arbitrary points on discrete polyhedra surfaces with constant time complexity after fast precomputation. Previous relevant methods either focus on computing the geodesic distance between a single source and all destinations, which has linear complexity at least or require a long precomputation time. Our key idea is to train a graph neural network to embed an input mesh into a high-dimensional embedding space and compute the geodesic distance between a pair of points using the corresponding embedding vectors and a lightweight decoding function. To facilitate the learning of the embedding, we propose novel graph convolution and graph pooling modules that incorporate local geodesic information and are verified to be much more effective than previous designs. After training, our method requires only one forward pass of the network per mesh as precomputation. Then, we can compute the geodesic distance between a pair of points using our decoding function, which requires only several matrix multiplications and can be massively parallelized on GPUs. We verify the efficiency and effectiveness of our method on ShapeNet and demonstrate that our method is faster than existing methods by orders of magnitude while achieving comparable or better accuracy. Additionally, our method exhibits robustness on noisy and incomplete meshes and strong generalization ability on out-of-distribution meshes. The code and pretrained model can be found on https://github.com/IntelligentGeometry/GeGnn.
    摘要 我们介绍GeGnn,一种基于学习的方法,用于计算粗略地理odesic距离 между两个随机点 на discrete polyhedra 表面上,具有常量时间复杂度以后快速预处理。先前的相关方法都是计算一个源点和所有目标点之间的 geodesic 距离,其复杂度至少是线性的,或者需要长时间的预处理。我们的关键想法是通过训练一个图神经网络,将输入网格 embedding 到高维 embedding 空间中,然后使用对应的 embedding вектор和一个轻量级的解码函数计算 geodesic 距离。为了促进 embedding 的学习,我们提出了新的图神经和图聚合模块,它们在地理odesic 信息的本地特征上吸收了更多的信息,并被证明是远胜先前的设计。之后,我们只需要在网格上进行一次网络前向传播,然后可以使用我们的解码函数计算 geodesic 距离,这需要只需要几个矩阵乘法和可以大规模并行化在 GPU 上。我们证明了我们的方法在ShapeNet上的效率和有效性,并表明我们的方法比现有方法速度多orders of magnitude,同时具有相似或更好的准确性。此外,我们的方法在噪音和缺失网格上展现了鲁棒性,并且在不同网格上的泛化能力强。我们的代码和预训练模型可以在https://github.com/IntelligentGeometry/GeGnn 找到。

UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase

  • paper_url: http://arxiv.org/abs/2309.05573
  • repo_url: https://github.com/pjlab-adg/pcseg
  • paper_authors: Youquan Liu, Runnan Chen, Xin Li, Lingdong Kong, Yuchen Yang, Zhaoyang Xia, Yeqi Bai, Xinge Zhu, Yuexin Ma, Yikang Li, Yu Qiao, Yuenan Hou
  • for: 这个论文的目的是提出一种多modal LiDAR分割网络(UniSeg),利用RGB图像和三种点云视图的信息,同时完成semantic分割和панOPTIC分割。
  • methods: 这个论文使用了自动将点云视图和图像特征相关联的Learnable cross-Modal Association(LMA)模块,并将增强的点云视图特征转换到点空间,以进行适应性的多视图相关联(LVA)。
  • results: 这个论文在三个公共评估benchmark上达到了优秀的结果,包括SemanticKITTI、nuScenes和Waymo Open Dataset(WOD)的LiDARsemantic分割和панOPTIC分割挑战赛。此外,这个论文还构建了OpenPCSeg代码库,它是最大和最全面的户外LiDAR分割代码库,包含大多数户外LiDAR分割算法和可重现实现。
    Abstract Point-, voxel-, and range-views are three representative forms of point clouds. All of them have accurate 3D measurements but lack color and texture information. RGB images are a natural complement to these point cloud views and fully utilizing the comprehensive information of them benefits more robust perceptions. In this paper, we present a unified multi-modal LiDAR segmentation network, termed UniSeg, which leverages the information of RGB images and three views of the point cloud, and accomplishes semantic segmentation and panoptic segmentation simultaneously. Specifically, we first design the Learnable cross-Modal Association (LMA) module to automatically fuse voxel-view and range-view features with image features, which fully utilize the rich semantic information of images and are robust to calibration errors. Then, the enhanced voxel-view and range-view features are transformed to the point space,where three views of point cloud features are further fused adaptively by the Learnable cross-View Association module (LVA). Notably, UniSeg achieves promising results in three public benchmarks, i.e., SemanticKITTI, nuScenes, and Waymo Open Dataset (WOD); it ranks 1st on two challenges of two benchmarks, including the LiDAR semantic segmentation challenge of nuScenes and panoptic segmentation challenges of SemanticKITTI. Besides, we construct the OpenPCSeg codebase, which is the largest and most comprehensive outdoor LiDAR segmentation codebase. It contains most of the popular outdoor LiDAR segmentation algorithms and provides reproducible implementations. The OpenPCSeg codebase will be made publicly available at https://github.com/PJLab-ADG/PCSeg.
    摘要 <> translate the following text into Simplified Chinese:Point-, voxel-, and range-views are three representative forms of point clouds. All of them have accurate 3D measurements but lack color and texture information. RGB images are a natural complement to these point cloud views and fully utilizing the comprehensive information of them benefits more robust perceptions. In this paper, we present a unified multi-modal LiDAR segmentation network, termed UniSeg, which leverages the information of RGB images and three views of the point cloud, and accomplishes semantic segmentation and panoptic segmentation simultaneously. Specifically, we first design the Learnable cross-Modal Association (LMA) module to automatically fuse voxel-view and range-view features with image features, which fully utilize the rich semantic information of images and are robust to calibration errors. Then, the enhanced voxel-view and range-view features are transformed to the point space,where three views of point cloud features are further fused adaptively by the Learnable cross-View Association module (LVA). Notably, UniSeg achieves promising results in three public benchmarks, i.e., SemanticKITTI, nuScenes, and Waymo Open Dataset (WOD); it ranks 1st on two challenges of two benchmarks, including the LiDAR semantic segmentation challenge of nuScenes and panoptic segmentation challenges of SemanticKITTI. Besides, we construct the OpenPCSeg codebase, which is the largest and most comprehensive outdoor LiDAR segmentation codebase. It contains most of the popular outdoor LiDAR segmentation algorithms and provides reproducible implementations. The OpenPCSeg codebase will be made publicly available at https://github.com/PJLab-ADG/PCSeg.Translated text in Simplified Chinese:Point-, voxel-, 和 range-views 是 LiDAR 点云的三种表示形式,它们都具有高精度的 3D 测量,但缺乏颜色和текстура信息。 RGB 图像是 LiDAR 点云视图的自然补充,完全利用图像的广泛 semantic 信息和 calibration 错误的Robustness。在这篇论文中,我们提出了一种多模态 LiDAR 分割网络,称为 UniSeg,它利用 RGB 图像和三种 LiDAR 点云视图,并在同时完成 semantic 分割和 panoptic 分割。具体来说,我们首先设计了 Learnable cross-Modal Association(LMA)模块,自动将 voxel-view 和 range-view 特征与图像特征进行关联,以全面利用图像的 semantic 信息和 calibration 错误的Robustness。然后,通过将增强的 voxel-view 和 range-view 特征转换到点空间,并在点云特征上进行 adaptive 关联,使得三种点云视图的特征得到了有效的融合。值得一提的是,UniSeg 在 SemanticKITTI、nuScenes 和 Waymo Open Dataset(WOD)三个公共 bencmarks 上表现出色,其中在 nuScenes 中的 LiDAR semantic 分割挑战和 SemanticKITTI 中的 panoptic 分割挑战上排名第一。此外,我们还建立了 OpenPCSeg 代码库,它是最大和最全面的 outdoor LiDAR 分割代码库,它包含了大多数流行的 outdoor LiDAR 分割算法,并提供了可重现的实现。OpenPCSeg 代码库将在 https://github.com/PJLab-ADG/PCSeg 上公开。

OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data

  • paper_url: http://arxiv.org/abs/2309.05551
  • repo_url: https://github.com/aimagelab/open-fashion-clip
  • paper_authors: Giuseppe Cartella, Alberto Baldrati, Davide Morelli, Marcella Cornia, Marco Bertini, Rita Cucchiara
  • for: 这个论文的目的是提出一种基于视觉语言对比学习的开源时尚CLIP方法,以便在自动标签分类和多Modal检索等任务中实现可扩展和可靠的机器学习解决方案。
  • methods: 这个论文使用了开源时尚数据,并采用了视觉语言对比学习方法进行训练。
  • results: 实验结果表明,该方法具有显著的对外域泛化能力和稳定性,并在多个任务和benchmark上实现了STATE-OF-THE-ART的性能和准确率。
    Abstract The inexorable growth of online shopping and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In the context of automatic tagging classification and multimodal retrieval, prior works either defined a low generalizable supervised learning approach or more reusable CLIP-based techniques while, however, training on closed source data. In this work, we propose OpenFashionCLIP, a vision-and-language contrastive learning method that only adopts open-source fashion data stemming from diverse domains, and characterized by varying degrees of specificity. Our approach is extensively validated across several tasks and benchmarks, and experimental results highlight a significant out-of-domain generalization capability and consistent improvements over state-of-the-art methods both in terms of accuracy and recall. Source code and trained models are publicly available at: https://github.com/aimagelab/open-fashion-clip.
    摘要 “在线购物和电商的不断增长下,需要可扩展和可靠的机器学习解决方案来满足客户需求。在自动标签分类和多 modal 搜寻的上下文中,先前的工作 Either 定义了一个低通用的超vised 学习方法或更可重用的 CLIP 基本技术,而且将训练数据来源汇入关闭。在这个工作中,我们提出了 OpenFashionCLIP,一种视觉和语言对照学习方法,仅使用开源时装数据,来自多个领域,具有不同程度的特定性。我们的方法在多个任务和标准库中进行了广泛验证,实验结果显示了它在不同领域的外部测试能力和稳定性有很大提升,并且在精度和回传上也有显著的改善。源代码和训练模型可以在:https://github.com/aimagelab/open-fashion-clip 中下载。”

Distance-Aware eXplanation Based Learning

  • paper_url: http://arxiv.org/abs/2309.05548
  • repo_url: https://github.com/msgun/xbl-d
  • paper_authors: Misgina Tsighe Hagos, Niamh Belton, Kathleen M. Curran, Brian Mac Namee
  • for: This paper is written for training deep learning models with interactive learning approaches that provide transparent explanations of the model’s decisions.
  • methods: The paper proposes a method called distance-aware explanation loss, which adds a distance-based penalty to the categorical losses to train the model to focus on important regions of the training dataset.
  • results: The paper demonstrates the performance of the proposed method on three image classification tasks, and proposes an interpretability metric for evaluating visual feature-attribution based model explanations.
    Abstract eXplanation Based Learning (XBL) is an interactive learning approach that provides a transparent method of training deep learning models by interacting with their explanations. XBL augments loss functions to penalize a model based on deviation of its explanations from user annotation of image features. The literature on XBL mostly depends on the intersection of visual model explanations and image feature annotations. We present a method to add a distance-aware explanation loss to categorical losses that trains a learner to focus on important regions of a training dataset. Distance is an appropriate approach for calculating explanation loss since visual model explanations such as Gradient-weighted Class Activation Mapping (Grad-CAMs) are not strictly bounded as annotations and their intersections may not provide complete information on the deviation of a model's focus from relevant image regions. In addition to assessing our model using existing metrics, we propose an interpretability metric for evaluating visual feature-attribution based model explanations that is more informative of the model's performance than existing metrics. We demonstrate performance of our proposed method on three image classification tasks.
    摘要 <>批处学习(XBL)是一种互动式学习方法,它通过与模型的解释进行交互式训练深度学习模型。 XBL 将损失函数增强为对模型解释的偏差进行惩罚。 文献中的 XBL 主要基于图像特征标注和视觉模型解释的交集。 我们提出一种将距离意识到的解释损失添加到 categorical 损失函数中,以训练学习者专注于训练集中重要的区域。 距离是一种适当的方法 для计算解释损失,因为视觉模型的解释,如梯度权重分布图(Grad-CAMs),不是很准确的标注,而且它们的交集可能不会提供完整的信息,关于模型的ocus deviate from relevant image regions。 除了使用现有的metric来评估我们的模型,我们还提出了一种可以更加准确地评估视觉特征归属基于模型解释的解释度量。 我们在三个图像分类任务上展示了我们的提议的性能。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

On the detection of Out-Of-Distribution samples in Multiple Instance Learning

  • paper_url: http://arxiv.org/abs/2309.05528
  • repo_url: https://github.com/loic-lb/ood_mil
  • paper_authors: Loïc Le Bescond, Maria Vakalopoulou, Stergios Christodoulidis, Fabrice André, Hugues Talbot
  • for: 本研究旨在 Addressing the challenge of out-of-distribution (OOD) detection in weakly supervised learning Multiple Instance Learning (MIL) 框架中。
  • methods: 本研究采用了适应后OOD检测方法,并在多个公共数据集上进行了广泛的实验,以评估OOD检测性能在弱有监督的情况下。
  • results: 实验结果显示,DICE emerges as the best-performing method overall,但它在某些数据集上表现不佳,这说明OOD检测在MIL框架下是一个复杂和挑战性的话题。
    Abstract The deployment of machine learning solutions in real-world scenarios often involves addressing the challenge of out-of-distribution (OOD) detection. While significant efforts have been devoted to OOD detection in classical supervised settings, the context of weakly supervised learning, particularly the Multiple Instance Learning (MIL) framework, remains under-explored. In this study, we tackle this challenge by adapting post-hoc OOD detection methods to the MIL setting while introducing a novel benchmark specifically designed to assess OOD detection performance in weakly supervised scenarios. Extensive experiments based on diverse public datasets do not reveal a single method with a clear advantage over the others. Although DICE emerges as the best-performing method overall, it exhibits significant shortcomings on some datasets, emphasizing the complexity of this under-explored and challenging topic. Our findings shed light on the complex nature of OOD detection under the MIL framework, emphasizing the importance of developing novel, robust, and reliable methods that can generalize effectively in a weakly supervised context. The code for the paper is available here: https://github.com/loic-lb/OOD_MIL.
    摘要 deployment of machine learning solutions in real-world scenarios 常会面临对不同数据分布(Out-of-distribution,OOD)的挑战。虽然对约束学习(classical supervised learning)的OOD检测得到了很多努力,但是多例学习(Multiple Instance Learning,MIL)框架仍然尚未得到了充分的研究。在这篇研究中,我们对MIL框架中的OOD检测进行了适应,并创建了特别设计来评估OOD检测性能的实验室环境。经过了各种公开数据集的广泛实验,我们发现没有一个方法能够在所有数据集中表现出色,DICE获得了整体最好的成绩,但是在某些数据集上它具有明显的缺陷,这说明了MIL框架下OOD检测的复杂性和挑战性。我们的发现强调了在弱监督下进行OOD检测的重要性,需要开发出新的、可靠、有效的方法,以应对这种挑战。研究代码可以在以下链接中找到:https://github.com/loic-lb/OOD_MIL。

ReSimAD: Zero-Shot 3D Domain Transfer for Autonomous Driving with Source Reconstruction and Target Simulation

  • paper_url: http://arxiv.org/abs/2309.05527
  • repo_url: https://github.com/pjlab-adg/3dtrans
  • paper_authors: Bo Zhang, Xinyu Cai, Jiakang Yuan, Donglin Yang, Jianfei Guo, Renqiu Xia, Botian Shi, Min Dou, Tao Chen, Si Liu, Junchi Yan, Yu Qiao
  • for: 提高自动驾驶模型在不同领域中的适用性,alleviating the domain shifts problem.
  • methods: 提出了一种Reconstruction-Simulation-Perception(ReSimAD)方案,通过将知识从前一个领域转化为领域独特的表示,以提高领域总结能力。
  • results: 通过考虑不同的跨领域情况,如 Waymo-to-KITTI、Waymo-to-nuScenes 等,实验表明 ReSimAD 方法能够提高领域总结能力,甚至在 3D 预训练中表现出色。
    Abstract Domain shifts such as sensor type changes and geographical situation variations are prevalent in Autonomous Driving (AD), which poses a challenge since AD model relying on the previous-domain knowledge can be hardly directly deployed to a new domain without additional costs. In this paper, we provide a new perspective and approach of alleviating the domain shifts, by proposing a Reconstruction-Simulation-Perception (ReSimAD) scheme. Specifically, the implicit reconstruction process is based on the knowledge from the previous old domain, aiming to convert the domain-related knowledge into domain-invariant representations, \textit{e.g.}, 3D scene-level meshes. Besides, the point clouds simulation process of multiple new domains is conditioned on the above reconstructed 3D meshes, where the target-domain-like simulation samples can be obtained, thus reducing the cost of collecting and annotating new-domain data for the subsequent perception process. For experiments, we consider different cross-domain situations such as Waymo-to-KITTI, Waymo-to-nuScenes, Waymo-to-ONCE, \textit{etc}, to verify the \textbf{zero-shot} target-domain perception using ReSimAD. Results demonstrate that our method is beneficial to boost the domain generalization ability, even promising for 3D pre-training.
    摘要 域别变化(Domain shifts)是自动驾驶(Autonomous Driving)中的普遍问题,这对于自动驾驶模型而言是一大挑战,因为这些模型从前一个域别获得的知识很难直接应用到新的域别中。在这篇文章中,我们提出了一个新的见解和方法来解决域别变化问题,即提案了一个复原-模拟-观察(ReSimAD)方案。具体来说,这个方案的隐藏重建过程基于以前的旧域别知识,目的是将域别相关的知识转换为域别不对称的表示,例如3D场景级别的几何体。此外,模拟过程中的多个新域别的点 clouds是基于上述复原的3D几何体进行 conditioning,从而降低了获取和标注新域别数据的成本,以便在后续的观察过程中使用。实验中,我们考虑了不同的跨域别情况,如 Waymo-to-KITTI、Waymo-to-nuScenes、Waymo-to-ONCE等,以验证我们的方法在零数据目标域观察中的优化效果。结果显示,我们的方法可以增强域别普遍化能力,甚至对3D预训有推动作用。

Stream-based Active Learning by Exploiting Temporal Properties in Perception with Temporal Predicted Loss

  • paper_url: http://arxiv.org/abs/2309.05517
  • repo_url: None
  • paper_authors: Sebastian Schmidt, Stephan Günnemann
  • for: 这 paper 是关于活动学习(AL),它可以减少机器学习模型训练所需的标注数据量。
  • methods: 这 paper 使用了一种新的 temporal predicted loss(TPL)方法,它利用了图像流的时间性质量进行过滤。
  • results: 实验表明,TPL 方法可以大幅提高选择的多样性,同时比 pool-based 方法更快。TPL 还与州界前面的 pool-based 和流程基于的方法进行比较,显示它在不同的模型上表现更出色。
    Abstract Active learning (AL) reduces the amount of labeled data needed to train a machine learning model by intelligently choosing which instances to label. Classic pool-based AL requires all data to be present in a datacenter, which can be challenging with the increasing amounts of data needed in deep learning. However, AL on mobile devices and robots, like autonomous cars, can filter the data from perception sensor streams before reaching the datacenter. We exploited the temporal properties for such image streams in our work and proposed the novel temporal predicted loss (TPL) method. To evaluate the stream-based setting properly, we introduced the GTA V streets and the A2D2 streets dataset and made both publicly available. Our experiments showed that our approach significantly improves the diversity of the selection while being an uncertainty-based method. As pool-based approaches are more common in perception applications, we derived a concept for comparing pool-based and stream-based AL, where TPL out-performed state-of-the-art pool- or stream-based approaches for different models. TPL demonstrated a gain of 2.5 precept points (pp) less required data while being significantly faster than pool-based methods.
    摘要 aktive lerning (AL) 可以减少用于训练机器学习模型的标注数据量,通过智能地选择需要标注的实例。 классиic pool-based AL 需要所有数据都存在数据中心,这可能是深度学习中所需的数据量增加的挑战。然而, AL 在移动设备和机器人(如自动驾驶车)上可以从感知传感器流中筛选数据,以避免数据中心的压力。我们利用了图像流的时间性质量,并提出了新的时间预测损失(TPL)方法。为了正确评估流式设置,我们介绍了 GTA V 街道和 A2D2 街道数据集,并将其公开发布。我们的实验表明,我们的方法可以明显提高选择的多样性,而且是一种uncertainty-based方法。在感知应用中,pool-based方法更常见,因此我们提出了对 pool-based 和流式 AL 进行比较的概念,并证明 TPL 在不同的模型上都能够超过状态对照方法。TPL 在训练过程中得到了2.5个预言点(pp) menos的数据量,同时比 pool-based 方法更快。

Zero-Shot Co-salient Object Detection Framework

  • paper_url: http://arxiv.org/abs/2309.05499
  • repo_url: None
  • paper_authors: Haoke Xiao, Lv Tang, Bo Li, Zhiming Luo, Shaozi Li
  • for: 本研究旨在模仿人类视觉系统中识别图像集中的共同和突出对象的能力。
  • methods: 我们提出了一种采用基础计算机视觉模型的零标注CoSOD框架,不需要任何训练过程。我们还 introduce two novel component:集体提示生成模块(GPG)和共聚焦度图生成模块(CMP)。
  • results: 我们对广泛使用的 dataset 进行评估,并观察了非常出色的结果。我们的方法超过了现有的无监督方法,甚至超过了2020年之前开发的完全监督方法,而且与2022年之前开发的一些完全监督方法保持竞争力。
    Abstract Co-salient Object Detection (CoSOD) endeavors to replicate the human visual system's capacity to recognize common and salient objects within a collection of images. Despite recent advancements in deep learning models, these models still rely on training with well-annotated CoSOD datasets. The exploration of training-free zero-shot CoSOD frameworks has been limited. In this paper, taking inspiration from the zero-shot transfer capabilities of foundational computer vision models, we introduce the first zero-shot CoSOD framework that harnesses these models without any training process. To achieve this, we introduce two novel components in our proposed framework: the group prompt generation (GPG) module and the co-saliency map generation (CMP) module. We evaluate the framework's performance on widely-used datasets and observe impressive results. Our approach surpasses existing unsupervised methods and even outperforms fully supervised methods developed before 2020, while remaining competitive with some fully supervised methods developed before 2022.
    摘要 Note: "CoSOD" is a abbreviation of "Co-salient Object Detection".Here's the translation in Simplified Chinese:CoSOD (共同焦点物体检测) 目标是模仿人类视觉系统在一组图像中识别共同焦点和突出的物体。尽管最近的深度学习模型已经取得了一定的进步,但这些模型仍然需要基于良好注解的 CoSOD 数据集进行训练。训练自由零shot CoSOD 框架的探索受限。在这篇论文中,我们 inspirited 由基础计算机视觉模型的零shot 传输能力,我们提出了第一个无需训练的 CoSOD 框架。为实现这一点,我们提出了两个新的组件:集体提示生成(GPG)模块和共同焦点图生成(CMP)模块。我们在广泛使用的数据集上评估了该框架的性能,并观察到了非常出色的结果。我们的方法超过了现有的无监督方法,甚至超过了2020年之前开发的完全监督方法,同时与2022年之前开发的一些完全监督方法维持竞争力。

COMPASS: High-Efficiency Deep Image Compression with Arbitrary-scale Spatial Scalability

  • paper_url: http://arxiv.org/abs/2309.07926
  • repo_url: https://github.com/ImJongminPark/COMPASS
  • paper_authors: Jongmin Park, Jooyoung Lee, Munchurl Kim
  • for: This paper proposes a novel neural network (NN)-based spatially scalable image compression method called COMPASS, which supports arbitrary-scale spatial scalability and has a flexible structure that allows for arbitrary determination of the number of layers and their respective scale factors during inference.
  • methods: The proposed COMPASS method uses an inter-layer arbitrary scale prediction method called LIFF based on implicit neural representation to reduce spatial redundancy between adjacent layers for arbitrary scale factors. The method also uses a combined RD loss function to effectively train multiple layers.
  • results: The experimental results show that COMPASS achieves BD-rate gain of -58.33% and -47.17% at maximum compared to SHVC and the state-of-the-art NN-based spatially scalable image compression method, respectively, for various combinations of scale factors. Additionally, COMPASS shows comparable or even better coding efficiency than the single-layer coding for various scale factors.
    Abstract Recently, neural network (NN)-based image compression studies have actively been made and has shown impressive performance in comparison to traditional methods. However, most of the works have focused on non-scalable image compression (single-layer coding) while spatially scalable image compression has drawn less attention although it has many applications. In this paper, we propose a novel NN-based spatially scalable image compression method, called COMPASS, which supports arbitrary-scale spatial scalability. Our proposed COMPASS has a very flexible structure where the number of layers and their respective scale factors can be arbitrarily determined during inference. To reduce the spatial redundancy between adjacent layers for arbitrary scale factors, our COMPASS adopts an inter-layer arbitrary scale prediction method, called LIFF, based on implicit neural representation. We propose a combined RD loss function to effectively train multiple layers. Experimental results show that our COMPASS achieves BD-rate gain of -58.33% and -47.17% at maximum compared to SHVC and the state-of-the-art NN-based spatially scalable image compression method, respectively, for various combinations of scale factors. Our COMPASS also shows comparable or even better coding efficiency than the single-layer coding for various scale factors.
    摘要

Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal Retrieval

  • paper_url: http://arxiv.org/abs/2309.05451
  • repo_url: None
  • paper_authors: Yabing Wang, Shuhui Wang, Hao Luo, Jianfeng Dong, Fan Wang, Meng Han, Xun Wang, Meng Wang
  • for: 本研究旨在打破非英语标注数据的限制,提高cross-modal retrieval的多语言可用性。
  • methods: 我们提出了双视角优化交通(DCOT)方法,利用优化交通理论从两个视角量化样本对的相对性,并通过双视curriculum学习动态设置交通成本。
  • results: 我们在两个多语言图像文本数据集和一个视频文本数据集上进行了广泛的实验,结果表明我们的提议方法具有效果和稳定性,同时也能够扩展到跨语言图像文本基线和 OUT-OF-DOMAIN 数据。
    Abstract Current research on cross-modal retrieval is mostly English-oriented, as the availability of a large number of English-oriented human-labeled vision-language corpora. In order to break the limit of non-English labeled data, cross-lingual cross-modal retrieval (CCR) has attracted increasing attention. Most CCR methods construct pseudo-parallel vision-language corpora via Machine Translation (MT) to achieve cross-lingual transfer. However, the translated sentences from MT are generally imperfect in describing the corresponding visual contents. Improperly assuming the pseudo-parallel data are correctly correlated will make the networks overfit to the noisy correspondence. Therefore, we propose Dual-view Curricular Optimal Transport (DCOT) to learn with noisy correspondence in CCR. In particular, we quantify the confidence of the sample pair correlation with optimal transport theory from both the cross-lingual and cross-modal views, and design dual-view curriculum learning to dynamically model the transportation costs according to the learning stage of the two views. Extensive experiments are conducted on two multilingual image-text datasets and one video-text dataset, and the results demonstrate the effectiveness and robustness of the proposed method. Besides, our proposed method also shows a good expansibility to cross-lingual image-text baselines and a decent generalization on out-of-domain data.
    摘要 当前的跨模态检索研究大多是英语 oriented,因为有大量的英语人工标注的视觉语言数据。为了突破非英语标注数据的限制,跨语言跨模态检索(CCR)已经吸引了越来越多的关注。大多数 CCR 方法使用机器翻译(MT)construct pseudo-parallel vision-language corpora以实现跨语言传递。然而,由 MT 翻译的句子通常不能准确描述相应的视觉内容。如果不正确地假设 pseudo-parallel 数据是正确相关的,那么网络会适应到噪音相关性。因此,我们提出了双视角最优运输(DCOT),以学习噪音相关性在 CCR 中。具体来说,我们使用optimal transport理论从双视角量度来衡量样本对的相关性,并设计了双视角课程学习来动态模型运输成本 According to the learning stage of the two views。我们在两个多语言图像文本数据集和一个视频文本数据集上进行了广泛的实验,并得到了我们提posed方法的效果和稳定性。此外,我们的提出方法还显示了跨语言图像文本基elines的好适用性和out-of-domain数据的 descent generalization。

A Localization-to-Segmentation Framework for Automatic Tumor Segmentation in Whole-Body PET/CT Images

  • paper_url: http://arxiv.org/abs/2309.05446
  • repo_url: https://github.com/medcai/l2snet
  • paper_authors: Linghan Cai, Jianhao Huang, Zihang Zhu, Jinpeng Lu, Yongbing Zhang
  • for: 这篇论文旨在提出一个专门用于检测一些癌症,如肺癌和淋巴癌的fluorodeoxyglucose(FDG) positron emission tomography(PET)复合成像中的自动肢解析方法,以改善医生的工作负担,进而提高诊断质量。
  • methods: 这篇论文提出了一个名为L2SNet的本地化-分类框架,用于精确地分类肿瘤。L2SNet首先在肿瘤Localization阶段寻找可能的肿瘤区域,然后使用这些位置的启发信号来塑造分类结果。为了进一步提高L2SNet的分类性能,我们设计了一个适应的阈值方案,将两个阶段的分类结果考虑在内。
  • results: 在MICCAI 2023 Automated Lesion Segmentation in Whole-Body FDG-PET/CT challenge dataset上进行实验,我们的方法在预选测试集上取得了竞争性的结果,排名在前7名之间。
    Abstract Fluorodeoxyglucose (FDG) positron emission tomography (PET) combined with computed tomography (CT) is considered the primary solution for detecting some cancers, such as lung cancer and melanoma. Automatic segmentation of tumors in PET/CT images can help reduce doctors' workload, thereby improving diagnostic quality. However, precise tumor segmentation is challenging due to the small size of many tumors and the similarity of high-uptake normal areas to the tumor regions. To address these issues, this paper proposes a localization-to-segmentation framework (L2SNet) for precise tumor segmentation. L2SNet first localizes the possible lesions in the lesion localization phase and then uses the location cues to shape the segmentation results in the lesion segmentation phase. To further improve the segmentation performance of L2SNet, we design an adaptive threshold scheme that takes the segmentation results of the two phases into consideration. The experiments with the MICCAI 2023 Automated Lesion Segmentation in Whole-Body FDG-PET/CT challenge dataset show that our method achieved a competitive result and was ranked in the top 7 methods on the preliminary test set. Our work is available at: https://github.com/MedCAI/L2SNet.
    摘要 富含氟代谐糖蛋白 (FDG) позиトрон辐射Tomography (PET) 与计算机断层成像 (CT) 被视为检测一些肿瘤的首选方法,如肺癌和黑色素瘤。自动将肿瘤 segmented 出 PET/CT 图像中可以减轻医生的工作负担,从而提高诊断质量。然而,准确地 segmenting 肿瘤是困难的,因为许多肿瘤的体积很小,而且高吸收的正常区域与肿瘤区域相似。为解决这些问题,本文提出了一个 localization-to-segmentation 框架 (L2SNet),用于准确地 segmenting 肿瘤。L2SNet 首先在肿瘤localization阶段确定可能的肿瘤,然后使用位置提示来形成 segmentation 结果。为了进一步提高 L2SNet 的 segmentation 性能,我们设计了一种适应reshold scheme,该 scheme 根据 segmentation 结果来调整阈值。我们的实验表明,使用 MICCAI 2023 自动肿瘤 segmentation in Whole-Body FDG-PET/CT 挑战数据集,我们的方法在预liminary test set 上获得了竞争性的结果,并列在前 7 名。我们的工作可以在 GitHub 上找到:https://github.com/MedCAI/L2SNet。

Towards Content-based Pixel Retrieval in Revisited Oxford and Paris

  • paper_url: http://arxiv.org/abs/2309.05438
  • repo_url: https://github.com/anguoyuan/pixel_retrieval-segmented_instance_retrieval
  • paper_authors: Guoyuan An, Woo Jae Kim, Saelyne Yang, Rong Li, Yuchi Huo, Sung-Eui Yoon
  • for: 本研究旨在提供首个像素检索Benchmark,用于实现分割实例检索。
  • methods: 本研究使用PROxford和PRParis两个基于ROxford和RParis图像检索数据集的像素检索Benchmark,并进行了三名专业标注员的二次双重检查和精度提升。
  • results: 研究结果表明,像素检索任务对现有方法来说是一个挑战,与现有问题不同,这表明进一步研究可以提高内容基于像素检索的用户搜索体验。
    Abstract This paper introduces the first two pixel retrieval benchmarks. Pixel retrieval is segmented instance retrieval. Like semantic segmentation extends classification to the pixel level, pixel retrieval is an extension of image retrieval and offers information about which pixels are related to the query object. In addition to retrieving images for the given query, it helps users quickly identify the query object in true positive images and exclude false positive images by denoting the correlated pixels. Our user study results show pixel-level annotation can significantly improve the user experience. Compared with semantic and instance segmentation, pixel retrieval requires a fine-grained recognition capability for variable-granularity targets. To this end, we propose pixel retrieval benchmarks named PROxford and PRParis, which are based on the widely used image retrieval datasets, ROxford and RParis. Three professional annotators label 5,942 images with two rounds of double-checking and refinement. Furthermore, we conduct extensive experiments and analysis on the SOTA methods in image search, image matching, detection, segmentation, and dense matching using our pixel retrieval benchmarks. Results show that the pixel retrieval task is challenging to these approaches and distinctive from existing problems, suggesting that further research can advance the content-based pixel-retrieval and thus user search experience. The datasets can be downloaded from \href{https://github.com/anguoyuan/Pixel_retrieval-Segmented_instance_retrieval}{this link}.
    摘要

FlowIBR: Leveraging Pre-Training for Efficient Neural Image-Based Rendering of Dynamic Scenes

  • paper_url: http://arxiv.org/abs/2309.05418
  • repo_url: None
  • paper_authors: Marcel Büsching, Josef Bengtson, David Nilsson, Mårten Björkman
  • for: 这个论文的目的是为了实现单目视图合成动态场景。
  • methods: 这个方法使用神经网络基于图像渲染方法,在大量可用的静止场景数据集上进行预训练,然后使用每个场景的优化的场景流场谱来抵消场景动力,使摄像机辐射线与场景动力相抵消,以present the dynamic scene as if it were static to the rendering network。
  • results: 该方法可以在单个消费级GPU上实现near-optimal results,并且可以减少每个场景优化时间量by an order of magnitude。
    Abstract We introduce a novel approach for monocular novel view synthesis of dynamic scenes. Existing techniques already show impressive rendering quality but tend to focus on optimization within a single scene without leveraging prior knowledge. This limitation has been primarily attributed to the lack of datasets of dynamic scenes available for training and the diversity of scene dynamics. Our method FlowIBR circumvents these issues by integrating a neural image-based rendering method, pre-trained on a large corpus of widely available static scenes, with a per-scene optimized scene flow field. Utilizing this flow field, we bend the camera rays to counteract the scene dynamics, thereby presenting the dynamic scene as if it were static to the rendering network. The proposed method reduces per-scene optimization time by an order of magnitude, achieving comparable results to existing methods - all on a single consumer-grade GPU.
    摘要 我们介绍了一种新的笔脚渲染方法,用于单视图动态场景的synthesis。现有技术已经达到了非常出色的渲染质量,但它们通常会在单个场景中优化,而不是利用先前的知识。这种限制主要归结于动态场景的数据集不足以进行训练,以及场景动态的多样性。我们的方法FlowIBR通过将神经网络图像基于渲染方法与每个场景优化的场景流场融合,使用这个流场场融合了摄像机杆的弯曲,使动态场景被渲染为如果是静止的,并且通过这种方法可以大幅降低每个场景优化时间,达到与现有方法相同的结果,全部在单个Consumer-grade GPU上进行。

Treatment-aware Diffusion Probabilistic Model for Longitudinal MRI Generation and Diffuse Glioma Growth Prediction

  • paper_url: http://arxiv.org/abs/2309.05406
  • repo_url: None
  • paper_authors: Qinghui Liu, Elies Fuster-Garcia, Ivar Thokle Hovden, Donatas Sederevicius, Karoline Skogen, Bradley J MacIntosh, Edvard Grødem, Till Schellhorn, Petter Brandal, Atle Bjørnerud, Kyrre Eeg Emblem
  • For: This paper presents a novel end-to-end network for generating future tumor masks and realistic MRIs of how the tumor will look at any future time points for different treatment plans.* Methods: The approach is based on cutting-edge diffusion probabilistic models and deep-segmentation neural networks, using sequential multi-parametric magnetic resonance images (MRI) and treatment information as conditioning inputs to guide the generative diffusion process.* Results: The model has demonstrated promising performance across a range of tasks, including the generation of high-quality synthetic MRIs with tumor masks, time-series tumor segmentations, and uncertainty estimates.
    Abstract Diffuse gliomas are malignant brain tumors that grow widespread through the brain. The complex interactions between neoplastic cells and normal tissue, as well as the treatment-induced changes often encountered, make glioma tumor growth modeling challenging. In this paper, we present a novel end-to-end network capable of generating future tumor masks and realistic MRIs of how the tumor will look at any future time points for different treatment plans. Our approach is based on cutting-edge diffusion probabilistic models and deep-segmentation neural networks. We included sequential multi-parametric magnetic resonance images (MRI) and treatment information as conditioning inputs to guide the generative diffusion process. This allows for tumor growth estimates at any given time point. We trained the model using real-world postoperative longitudinal MRI data with glioma tumor growth trajectories represented as tumor segmentation maps over time. The model has demonstrated promising performance across a range of tasks, including the generation of high-quality synthetic MRIs with tumor masks, time-series tumor segmentations, and uncertainty estimates. Combined with the treatment-aware generated MRIs, the tumor growth predictions with uncertainty estimates can provide useful information for clinical decision-making.
    摘要 Diffuse gliomas 是肿瘤性脑肿,通过脑中扩散生长。这种肿瘤的复杂的Cellular interactions和正常组织之间的互动,以及治疗所引起的变化,使肿瘤增长模型变得具有挑战性。在这篇论文中,我们提出了一种新的端到端网络,能够生成未来肿瘤的面具和真实的MRI图像。我们的方法基于最新的扩散概率模型和深度分割神经网络。我们将继序多parametric磁共振成像(MRI)和治疗信息作为conditioning输入,以引导生成扩散过程。这allow for肿瘤增长估计在任何给定时间点。我们使用了真实世界后期手术 longitudinal MRI数据,其中肿瘤增长轨迹表示为肿瘤分割地图的时间序列。模型在多种任务上表现出色,包括生成高质量的Synthetic MRI面具、时间序列肿瘤分割、和不确定性估计。与治疗意识的生成MRI图像结合,肿瘤增长预测与不确定性估计可以为临床决策提供有用信息。

Two-Stage Hybrid Supervision Framework for Fast, Low-resource, and Accurate Organ and Pan-cancer Segmentation in Abdomen CT

  • paper_url: http://arxiv.org/abs/2309.05405
  • repo_url: None
  • paper_authors: Wentao Liu, Tong Tian, Weijin Xu, Lemeng Wang, Haoyuan Li, Huihua Yang
  • for: 这个研究旨在提高腹部器官和肿瘤分类的精度,以应用于评估器官量、手术规划和病变诊断。
  • methods: 这个方法基于自带训练和平均教师的混合模型,使用半标示和无标示数据进行分类。它还 introduce a two-stage segmentation pipeline和整个质量量-based input strategy来最大化分类精度。
  • results: 在FLARE2023的验证集上,这个方法实现了佳的分类性能,同时具有快速和低资源模型测试的能力。它的平均DSC分数为89.79%和45.55%,而执行时间和GPU内存使用率分别为11.25秒和9627.82MB。
    Abstract Abdominal organ and tumour segmentation has many important clinical applications, such as organ quantification, surgical planning, and disease diagnosis. However, manual assessment is inherently subjective with considerable inter- and intra-expert variability. In the paper, we propose a hybrid supervised framework, StMt, that integrates self-training and mean teacher for the segmentation of abdominal organs and tumors using partially labeled and unlabeled data. We introduce a two-stage segmentation pipeline and whole-volume-based input strategy to maximize segmentation accuracy while meeting the requirements of inference time and GPU memory usage. Experiments on the validation set of FLARE2023 demonstrate that our method achieves excellent segmentation performance as well as fast and low-resource model inference. Our method achieved an average DSC score of 89.79\% and 45.55 \% for the organs and lesions on the validation set and the average running time and area under GPU memory-time cure are 11.25s and 9627.82MB, respectively.
    摘要 腹部器官和肿瘤分割有很多重要的临床应用,如器官量化、手术规划和疾病诊断。然而,手动评估是内在含糊不清,存在较大的内外专家变化。在论文中,我们提议一种混合监督方案,StMt,该方案将自我教学和平均老师约束用于腹部器官和肿瘤分割,使用部分标注和无标注数据。我们提出了两stage分割管道和整体量化输入策略,以便最大化分割准确性,同时满足推理时间和GPU内存使用的要求。在FLARE2023验证集上进行了实验,我们的方法实现了优秀的分割性能,同时具有快速和低资源模型推理。我们的方法在验证集上获得了89.79%的DSC分数和45.55%的lesion DSC分数,平均执行时间为11.25秒,GPU内存使用率为9627.82MB。

Robust Single Rotation Averaging Revisited

  • paper_url: http://arxiv.org/abs/2309.05388
  • repo_url: None
  • paper_authors: Seong Hun Lee, Javier Civera
  • for: robust single rotation averaging to handle extremely large fractions of outliers
  • methods: minimize total truncated least unsquared deviations (TLUD) cost of geodesic distances, consisting of three steps: consider each input rotation as a potential initial solution, obtain the inlier set using the initial solution, and iteratively compute the geodesic $L_1$-mean of the inliers using the Weiszfeld algorithm on $SO(3)$
  • results: outperform the current state of the art in robustness against up to 99% outliers given a sufficient number of accurate inliers
    Abstract In this work, we propose a novel method for robust single rotation averaging that can efficiently handle an extremely large fraction of outliers. Our approach is to minimize the total truncated least unsquared deviations (TLUD) cost of geodesic distances. The proposed algorithm consists of three steps: First, we consider each input rotation as a potential initial solution and choose the one that yields the least sum of truncated chordal deviations. Next, we obtain the inlier set using the initial solution and compute its chordal $L_2$-mean. Finally, starting from this estimate, we iteratively compute the geodesic $L_1$-mean of the inliers using the Weiszfeld algorithm on $SO(3)$. An extensive evaluation shows that our method is robust against up to 99% outliers given a sufficient number of accurate inliers, outperforming the current state of the art.
    摘要 在这个研究中,我们提出了一种新的稳定单旋转平均方法,可以高效处理极高比例的异常值。我们的方法是将总 truncated least unsquared deviations(TLUD)成本最小化。我们的算法包括三步:首先,我们考虑每个输入旋转作为可能的初始解,选择它们的总和最小的 truncated chordal deviations 成本。接下来,我们使用初始解来获取准确的集合,并计算其圆柱 $L_2$-平均。最后,我们从这个估计开始,iteratively 使用 Weiszfeld 算法在 $SO(3)$ 上计算 geodesic $L_1$-平均。我们的方法可以快速处理大量数据,并且对于具有足够多准确准确的数据,可以快速地减少异常值的影响。

Collective PV-RCNN: A Novel Fusion Technique using Collective Detections for Enhanced Local LiDAR-Based Perception

  • paper_url: http://arxiv.org/abs/2309.05380
  • repo_url: None
  • paper_authors: Sven Teufel, Jörg Gamerdinger, Georg Volk, Oliver Bringmann
  • for: 提高自动驾驶车辆的环境感知能力,以便安全运行。
  • methods: 利用集成感知(CP)技术,让车辆之间交换信息以减少遮挡、探测范围有限和环境影响等问题。
  • results: 提出了一种新的拟合方法,可以将共同探测结果融合到本地LiDAR探测管道中,以提高自动驾驶车辆的环境感知能力。
    Abstract Comprehensive perception of the environment is crucial for the safe operation of autonomous vehicles. However, the perception capabilities of autonomous vehicles are limited due to occlusions, limited sensor ranges, or environmental influences. Collective Perception (CP) aims to mitigate these problems by enabling the exchange of information between vehicles. A major challenge in CP is the fusion of the exchanged information. Due to the enormous bandwidth requirement of early fusion approaches and the interchangeability issues of intermediate fusion approaches, only the late fusion of shared detections is practical. Current late fusion approaches neglect valuable information for local detection, this is why we propose a novel fusion method to fuse the detections of cooperative vehicles within the local LiDAR-based detection pipeline. Therefore, we present Collective PV-RCNN (CPV-RCNN), which extends the PV-RCNN++ framework to fuse collective detections. Code is available at https://github.com/ekut-es
    摘要 全面的环境感知是自动驾驶车辆安全运行的关键。然而,自动驾驶车辆的感知能力受到遮挡、感器范围有限以及环境因素的限制。集成感知(CP)试图解决这些问题,通过车辆之间的信息交换来提高感知范围和精度。但是,CP中的信息融合带来挑战,特别是在早期的融合方法需要巨大的带宽,而中间融合方法则存在交换信息的问题。为此,我们提出了一种新的融合方法,将在本地LiDAR基本检测管道中融合协作车辆的检测结果。因此,我们提出了集成PV-RCNN(CPV-RCNN),它是基于PV-RCNN++框架,用于融合集成检测结果。代码可以在GitHub上找到:https://github.com/ekut-es。

CNN or ViT? Revisiting Vision Transformers Through the Lens of Convolution

  • paper_url: http://arxiv.org/abs/2309.05375
  • repo_url: None
  • paper_authors: Chenghao Li, Chaoning Zhang
  • for: 提高ViT在小数据集上的表现(没有预训练)
  • methods: 应用权重掩码(Gaussian Mixture Mask,GMM)改进ViT的本地模型化
  • results: 在多个小数据集上实现了无需预训练的ViT表现提高(几乎无额外参数或计算成本增加)
    Abstract The success of Vision Transformer (ViT) has been widely reported on a wide range of image recognition tasks. The merit of ViT over CNN has been largely attributed to large training datasets or auxiliary pre-training. Without pre-training, the performance of ViT on small datasets is limited because the global self-attention has limited capacity in local modeling. Towards boosting ViT on small datasets without pre-training, this work improves its local modeling by applying a weight mask on the original self-attention matrix. A straightforward way to locally adapt the self-attention matrix can be realized by an element-wise learnable weight mask (ELM), for which our preliminary results show promising results. However, the element-wise simple learnable weight mask not only induces a non-trivial additional parameter overhead but also increases the optimization complexity. To this end, this work proposes a novel Gaussian mixture mask (GMM) in which one mask only has two learnable parameters and it can be conveniently used in any ViT variants whose attention mechanism allows the use of masks. Experimental results on multiple small datasets demonstrate that the effectiveness of our proposed Gaussian mask for boosting ViTs for free (almost zero additional parameter or computation cost). Our code will be publicly available at \href{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention}{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention}.
    摘要 异形卷积(ViT)的成功在各种图像识别任务上广泛报道。异形卷积比 traditional CNN 有更多的优势,主要归功于大规模的训练数据或auxiliary预训练。然而,在小数据集上不进行预训练时,异形卷积的性能有限,因为全球自注意的能力在本地模型中有限。为了提高异形卷积在小数据集上的性能,这种工作改进了异形卷积的本地模型,通过应用Weight Mask在原始的自注意矩阵上。一种直观的方式是使用元素可学习的权重Mask(ELM),我们的初步结果表明这种方法有承诺的结果。然而,元素可学习的简单权重Mask不仅增加了非常小的额外参数负担,还增加了优化复杂度。为了解决这个问题,这种工作提议一种新的 Gaussian Mixture Mask(GMM),它只有两个可学习参数,可以方便地在任何异形卷积变种中使用。我们的实验结果表明,我们的提议的 Gaussian 面积可以免除额外参数和计算成本,提高异形卷积的性能。我们的代码将在 GitHub 上公开,可以在 \href{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention}{https://github.com/CatworldLee/Gaussian-Mixture-Mask-Attention} 上获取。

Learning Geometric Representations of Objects via Interaction

  • paper_url: http://arxiv.org/abs/2309.05346
  • repo_url: https://github.com/reichlin/geomrepobj
  • paper_authors: Alfredo Reichlin, Giovanni Luca Marchetti, Hang Yin, Anastasiia Varava, Danica Kragic
  • for: 本 paper 的目的是学习从Scene中学习代理人和外部对象的表示。
  • methods: 该 frameworks 使用代理人的行为作为唯一的超visão,假设对象由未知动力移动。
  • results: 该 paper 提供了一种理论基础和实验证明,证明理想学习者可以准确地提取代理人和对象的位置,并且在下游任务中使用 reinforcement learning 进行有效的解决。
    Abstract We address the problem of learning representations from observations of a scene involving an agent and an external object the agent interacts with. To this end, we propose a representation learning framework extracting the location in physical space of both the agent and the object from unstructured observations of arbitrary nature. Our framework relies on the actions performed by the agent as the only source of supervision, while assuming that the object is displaced by the agent via unknown dynamics. We provide a theoretical foundation and formally prove that an ideal learner is guaranteed to infer an isometric representation, disentangling the agent from the object and correctly extracting their locations. We evaluate empirically our framework on a variety of scenarios, showing that it outperforms vision-based approaches such as a state-of-the-art keypoint extractor. We moreover demonstrate how the extracted representations enable the agent to solve downstream tasks via reinforcement learning in an efficient manner.
    摘要 我们处理了对一个场景中的代理人和外部物体之间的学习表现的问题。为此,我们提出了一个表现学习框架,从无结构的观察中提取代理人和物体的物理空间位置。我们的框架仅受代理人的动作作为指导,并假设物体是由代理人驱动的隐藏 Dynamics。我们提供了理论基础,正式证明了理想学习者可以从无结构观察中推导出不对称的表现,将代理人和物体分离开来,并正确地提取他们的位置。我们进行了实验评估,证明了我们的框架在多种情况下表现较好,并且可以通过循环学习来解决下游任务。

PAg-NeRF: Towards fast and efficient end-to-end panoptic 3D representations for agricultural robotics

  • paper_url: http://arxiv.org/abs/2309.05339
  • repo_url: None
  • paper_authors: Claus Smitt, Michael Halstead, Patrick Zimmer, Thomas Läbe, Esra Guclu, Cyrill Stachniss, Chris McCool
  • for: 实现园林 robots 监控和干预任务中的高精度Scene理解
  • methods: 使用NeRF技术建立3D测量和照片实际化描述
  • results: 提高 peak signal to noise ratio 和 panoptic quality,并且可以适应不精度的机器人位置资料
    Abstract Precise scene understanding is key for most robot monitoring and intervention tasks in agriculture. In this work we present PAg-NeRF which is a novel NeRF-based system that enables 3D panoptic scene understanding. Our representation is trained using an image sequence with noisy robot odometry poses and automatic panoptic predictions with inconsistent IDs between frames. Despite this noisy input, our system is able to output scene geometry, photo-realistic renders and 3D consistent panoptic representations with consistent instance IDs. We evaluate this novel system in a very challenging horticultural scenario and in doing so demonstrate an end-to-end trainable system that can make use of noisy robot poses rather than precise poses that have to be pre-calculated. Compared to a baseline approach the peak signal to noise ratio is improved from 21.34dB to 23.37dB while the panoptic quality improves from 56.65% to 70.08%. Furthermore, our approach is faster and can be tuned to improve inference time by more than a factor of 2 while being memory efficient with approximately 12 times fewer parameters.
    摘要 precisions scene understanding 是 agriculture robot monitoring 和 intervening 任务中的关键。在这项工作中,我们介绍了一种基于NeRF的新系统PAg-NeRF,它可以实现3D权威场景理解。我们的表示使用了含有噪声的机器人姿态pose的图像序列,以及自动生成的�anoptic预测结果,其中每帧ID不匹配。尽管输入含噪声,我们的系统仍能输出场景几何学、真实图像和3D一致的�anoptic表示,并且实现了一致的实例ID。我们在一个非常具有挑战性的植物种植场景中评估了这种新系统,并在这之中展示了一个可以使用噪声机器人姿态而不需要先计算精确姿态的端到端可训练系统。相比基线方法,我们的方法可以提高峰峰信号听频比为21.34dB到23.37dB,并提高�anoptic质量从56.65%到70.08%。此外,我们的方法更快,可以通过更 чем2倍的速度调整执行时间,同时具有较少的参数。

MultIOD: Rehearsal-free Multihead Incremental Object Detector

  • paper_url: http://arxiv.org/abs/2309.05334
  • repo_url: None
  • paper_authors: Eden Belouadah, Arnaud Dapogny, Kevin Bailly
  • for: 这篇论文的目的是提出一个基于 CenterNet 的分类增量学习检测器,以解决现有的分类增量学习检测器具有过时忘记问题。
  • methods: 我们提出了一个多头特征pyramid和多头检测架构,使得有效地分类不同的分类表现,并通过分类初始学习和增量学习之间的传播学习来应对过时忘记问题。
  • results: 我们在两个 Pascal VOC 数据集上进行评估,结果显示我们的方法无需使用复杂的调整和练习,仍能超越许多现有的州Of-The-Art 方法。
    Abstract Class-Incremental learning (CIL) is the ability of artificial agents to accommodate new classes as they appear in a stream. It is particularly interesting in evolving environments where agents have limited access to memory and computational resources. The main challenge of class-incremental learning is catastrophic forgetting, the inability of neural networks to retain past knowledge when learning a new one. Unfortunately, most existing class-incremental object detectors are applied to two-stage algorithms such as Faster-RCNN and rely on rehearsal memory to retain past knowledge. We believe that the current benchmarks are not realistic, and more effort should be dedicated to anchor-free and rehearsal-free object detection. In this context, we propose MultIOD, a class-incremental object detector based on CenterNet. Our main contributions are: (1) we propose a multihead feature pyramid and multihead detection architecture to efficiently separate class representations, (2) we employ transfer learning between classes learned initially and those learned incrementally to tackle catastrophic forgetting, and (3) we use a class-wise non-max-suppression as a post-processing technique to remove redundant boxes. Without bells and whistles, our method outperforms a range of state-of-the-art methods on two Pascal VOC datasets.
    摘要 《级间学习(CIL)是人工智能代理人的承受新类出现在流中的能力。在发展环境中,代理人具有有限的内存和计算资源。主要挑战是迁移学习,即神经网络忘记过去知识 WHEN learning new one。现有的大多数级间学习对象检测器采用了两stage算法如Faster-RCNN,并且 rely on rehearsal memory来保持过去知识。我们认为现有的标准准则不实际,更应该投入更多的努力于 anchor-free和 rehearsal-free 对象检测。在这个上下文中,我们提出了 MultIOD,基于 CenterNet 的级间学习对象检测器。我们的主要贡献是:1. 我们提出了多头特征层和多头检测架构,以有效地分离类表示。2. 我们使用类初学习和级间学习之间的转移学习来抗衡迁移学习问题。3. 我们使用类别非最大抑制作为后处理技术,以去除重复的框。 ohne 钻石和精雕,我们的方法在 Pascal VOC 两个数据集上超越了一些状态的先进方法。》

Diff-Privacy: Diffusion-based Face Privacy Protection

  • paper_url: http://arxiv.org/abs/2309.05330
  • repo_url: None
  • paper_authors: Xiao He, Mingrui Zhu, Dongxin Chen, Nannan Wang, Xinbo Gao
  • for: 保护人脸隐私,防止人工智能技术进行个人数据收集和违用。
  • methods: 基于扩散模型,提出了一种新的面部隐私保护方法,即Diff-Privacy。该方法通过训练多尺度图像反转模块(MSI)来获得原始图像的SDM格式条件嵌入。然后,根据条件嵌入,设计对应的嵌入调度策略,在杂化 proces中构建不同的能量函数,以实现人脸隐私和视觉特征隐私。
  • results: 对比于传统方法,提出的Diff-Privacy方法能够更好地保护人脸隐私,并且可以同时实现人脸隐私和视觉特征隐私。经验表明,Diff-Privacy方法能够减少人脸隐私攻击的风险。
    Abstract Privacy protection has become a top priority as the proliferation of AI techniques has led to widespread collection and misuse of personal data. Anonymization and visual identity information hiding are two important facial privacy protection tasks that aim to remove identification characteristics from facial images at the human perception level. However, they have a significant difference in that the former aims to prevent the machine from recognizing correctly, while the latter needs to ensure the accuracy of machine recognition. Therefore, it is difficult to train a model to complete these two tasks simultaneously. In this paper, we unify the task of anonymization and visual identity information hiding and propose a novel face privacy protection method based on diffusion models, dubbed Diff-Privacy. Specifically, we train our proposed multi-scale image inversion module (MSI) to obtain a set of SDM format conditional embeddings of the original image. Based on the conditional embeddings, we design corresponding embedding scheduling strategies and construct different energy functions during the denoising process to achieve anonymization and visual identity information hiding. Extensive experiments have been conducted to validate the effectiveness of our proposed framework in protecting facial privacy.
    摘要 隐私保护已成为人工智能技术普及后的首要任务,由于广泛收集和不当使用个人数据,隐私保护变得更加重要。隐身化和视觉特征隐藏是两项重要的面部隐私保护任务,它们的目标是在人类视觉水平上从面部图像中除去识别特征。然而,这两项任务之间存在重要的区别,隐身化旨在防止机器正确地识别,而视觉特征隐藏则需要保证机器识别的准确性。因此,同时完成这两项任务是很困难的。在这篇论文中,我们将隐身化和视觉特征隐藏的任务综合起来,并提出了一种基于扩散模型的面部隐私保护方法,称为Diff-Privacy。具体来说,我们将训练我们提出的多尺度图像反转模块(MSI),以获得原始图像的SDM格式条件嵌入。根据条件嵌入,我们设计了对应的嵌入调度策略,并在减噪过程中构建不同的能量函数,以实现隐身化和视觉特征隐藏。我们对方法进行了广泛的实验,以验证其在保护面部隐私方面的效果。

DeCUR: decoupling common & unique representations for multimodal self-supervision

  • paper_url: http://arxiv.org/abs/2309.05300
  • repo_url: https://github.com/zhu-xlab/decur
  • paper_authors: Yi Wang, Conrad M Albrecht, Nassim Ait Ali Braham, Chenying Liu, Zhitong Xiong, Xiao Xiang Zhu
  • for: 本文旨在提出一种简单 yet effective的多modal自编学习方法,以捕捉不同模式之间的补做关系。
  • methods: 该方法通过分解共同和唯一表示来结合不同模式之间的补做信息。
  • results: 在radar-optical、RGB-elevation和RGB-depth等常见的多modal场景中,该方法能够准确地进行景物分类和 semantic segmentation下游任务,并且可以 straightforward地提高state-of-the-art超vised多modal方法的性能。
    Abstract The increasing availability of multi-sensor data sparks interest in multimodal self-supervised learning. However, most existing approaches learn only common representations across modalities while ignoring intra-modal training and modality-unique representations. We propose Decoupling Common and Unique Representations (DeCUR), a simple yet effective method for multimodal self-supervised learning. By distinguishing inter- and intra-modal embeddings, DeCUR is trained to integrate complementary information across different modalities. We evaluate DeCUR in three common multimodal scenarios (radar-optical, RGB-elevation, and RGB-depth), and demonstrate its consistent benefits on scene classification and semantic segmentation downstream tasks. Notably, we get straightforward improvements by transferring our pretrained backbones to state-of-the-art supervised multimodal methods without any hyperparameter tuning. Furthermore, we conduct a comprehensive explainability analysis to shed light on the interpretation of common and unique features in our multimodal approach. Codes are available at \url{https://github.com/zhu-xlab/DeCUR}.
    摘要 随着多感器数据的更加普遍,人们对多模态自我supervised学习表示越来越多的兴趣。然而,大多数现有的方法只学习多modalities中的共同表示,而忽略了INTRA-modal训练和特有表示。我们提议一种简单 yet effective的方法:异步共同和独特表示分离(DeCUR),用于多modal self-supervised learning。通过分辨Inter-和INTRA-modal嵌入,DeCUR可以整合不同modalities中的补充信息。我们在radar-optical、RGB-elevation和RGB-depth三个常见的多modal场景中进行评估,并证明DeCUR在Scene classification和semantic segmentation下渠道任务中具有一致的好处。特别是,我们可以无需任何超参数调整直接将我们预训练的backbone传承到状态 искусственный极点的supervised multimodal方法中,获得直接的改进。此外,我们进行了广泛的解释性分析,以便更好地理解我们的多modal方法中的共同和特有特征的解释。codes可以在\url{https://github.com/zhu-xlab/DeCUR}中找到。

Task-driven Compression for Collision Encoding based on Depth Images

  • paper_url: http://arxiv.org/abs/2309.05289
  • repo_url: None
  • paper_authors: Mihir Kulkarni, Kostas Alexis
  • for: 这个研究旨在提出一种基于学习的方法,用于对深度图像进行攻击性压缩和编码,以便在机器人系统中进行冲击预测。
  • methods: 这个研究提出了一种新的三维图像处理方法,考虑了机器人的大小,以便适当地膨胀对于机器人的障碍物,并从而获取机器人可以在摄像头视场中无冲击地移动的距离。这些深度图像和冲击图像的对组是用于训练一个基于Variational Autoencoders架构的神经网络,以将原始深度图像的信息压缩和转换为具有冲击信息的潜在表示。
  • results: 比较这个提案的任务驱动编码方法与传统任务无关的方法, demonstrate superior performance for the task of collision image prediction from extremely low-dimensional latent spaces。一系列的比较研究显示,提案的方法可以对复杂的场景中的细小障碍物进行更好的编码,具有4050:1的压缩比。
    Abstract This paper contributes a novel learning-based method for aggressive task-driven compression of depth images and their encoding as images tailored to collision prediction for robotic systems. A novel 3D image processing methodology is proposed that accounts for the robot's size in order to appropriately "inflate" the obstacles represented in the depth image and thus obtain the distance that can be traversed by the robot in a collision-free manner along any given ray within the camera frustum. Such depth-and-collision image pairs are used to train a neural network that follows the architecture of Variational Autoencoders to compress-and-transform the information in the original depth image to derive a latent representation that encodes the collision information for the given depth image. We compare our proposed task-driven encoding method with classical task-agnostic methods and demonstrate superior performance for the task of collision image prediction from extremely low-dimensional latent spaces. A set of comparative studies show that the proposed approach is capable of encoding depth image-and-collision image tuples from complex scenes with thin obstacles at long distances better than the classical methods at compression ratios as high as 4050:1.
    摘要

Class-Incremental Grouping Network for Continual Audio-Visual Learning

  • paper_url: http://arxiv.org/abs/2309.05281
  • repo_url: https://github.com/stonemo/cign
  • paper_authors: Shentong Mo, Weiguo Pian, Yapeng Tian
  • for: 这篇论文的目的是提出一个 novel 的类别增量学习网络(CIGN),以便在类别增量学习中进行 continual audio-visual 学习。
  • methods: 这篇论文使用了一个具有 learnable 的音频视频类别标识和音频视频分组的类别增量学习网络(CIGN),并且使用了类别标识激发和类别分组来防止忘记。
  • results: 根据实验结果显示,CIGN 可以在 VGGSound-Instruments、VGGSound-100 和 VGG-Sound Sources 评量上实现类别增量学习的最佳性能。
    Abstract Continual learning is a challenging problem in which models need to be trained on non-stationary data across sequential tasks for class-incremental learning. While previous methods have focused on using either regularization or rehearsal-based frameworks to alleviate catastrophic forgetting in image classification, they are limited to a single modality and cannot learn compact class-aware cross-modal representations for continual audio-visual learning. To address this gap, we propose a novel class-incremental grouping network (CIGN) that can learn category-wise semantic features to achieve continual audio-visual learning. Our CIGN leverages learnable audio-visual class tokens and audio-visual grouping to continually aggregate class-aware features. Additionally, it utilizes class tokens distillation and continual grouping to prevent forgetting parameters learned from previous tasks, thereby improving the model's ability to capture discriminative audio-visual categories. We conduct extensive experiments on VGGSound-Instruments, VGGSound-100, and VGG-Sound Sources benchmarks. Our experimental results demonstrate that the CIGN achieves state-of-the-art audio-visual class-incremental learning performance. Code is available at https://github.com/stoneMo/CIGN.
    摘要 <> translate into Simplified ChineseContinual learning 是一个具有挑战性的问题, models 需要在非站点数据上进行 sequential 任务中的 class-incremental learning。 在过去的方法中,使用了 either regularization 或 rehearsal-based 框架,以减轻 catastrophic forgetting 在 image classification 中,但这些方法仅适用于单一模式,无法学习 cross-modal 的 class-aware 表示。为了解决这个差异,我们提出了一个 novel class-incremental grouping network (CIGN),可以学习 category-wise semantic features,以实现 continual audio-visual learning。我们的 CIGN 利用可学习的 audio-visual 类别标志和 audio-visual 分组,以 continually 积累 class-aware 特征。此外,它还利用类别标志激发和 continual 分组,以防止遗传 learned 的前一个任务中的知识。我们对 VGGSound-Instruments、VGGSound-100 和 VGG-Sound Sources 标准库进行了广泛的实验。我们的实验结果显示,CIGN 在 audio-visual class-incremental learning 中 achieved state-of-the-art 性能。代码可以在 https://github.com/stoneMo/CIGN 获取。

Interactive Class-Agnostic Object Counting

  • paper_url: http://arxiv.org/abs/2309.05277
  • repo_url: https://github.com/Yifehuang97/ICACount
  • paper_authors: Yifeng Huang, Viresh Ranjan, Minh Hoai
  • for: 这个论文的目的是提出一种便捷、可靠的人工智能计数方法,允许用户在互动的方式下提供反馈,以提高计数的准确性。
  • methods: 该方法包括两个主要组件:一个易于使用的视觉化器来收集反馈,以及一种高效的机制来利用反馈进行改进。在每次迭代中,我们生成一张扩散图来显示当前预测结果,并将其分成不重叠的区域,每个区域可以轻松地计算出对象的数量。用户可以通过选择显示错误的区域,并指定该区域内对象的数量范围,为计数结果提供反馈。
  • results: 我们的方法可以在两个普遍性类型的物体计数 benchmark 上减少多种现有 visual counter 的平均绝对误差,比如FSCD-LVIS 和 FSC-147,减少约30%到40%。
    Abstract We propose a novel framework for interactive class-agnostic object counting, where a human user can interactively provide feedback to improve the accuracy of a counter. Our framework consists of two main components: a user-friendly visualizer to gather feedback and an efficient mechanism to incorporate it. In each iteration, we produce a density map to show the current prediction result, and we segment it into non-overlapping regions with an easily verifiable number of objects. The user can provide feedback by selecting a region with obvious counting errors and specifying the range for the estimated number of objects within it. To improve the counting result, we develop a novel adaptation loss to force the visual counter to output the predicted count within the user-specified range. For effective and efficient adaptation, we propose a refinement module that can be used with any density-based visual counter, and only the parameters in the refinement module will be updated during adaptation. Our experiments on two challenging class-agnostic object counting benchmarks, FSCD-LVIS and FSC-147, show that our method can reduce the mean absolute error of multiple state-of-the-art visual counters by roughly 30% to 40% with minimal user input. Our project can be found at https://yifehuang97.github.io/ICACountProjectPage/.
    摘要 我们提出了一种新的框架,用于互动性的类共享对象计数,其中人类用户可以互动地提供反馈来提高计数的准确性。我们的框架包括两个主要组成部分:一个易于使用的视觉化器来收集反馈,以及一种高效的机制来整合它。在每次迭代中,我们生成一张扩散图来显示当前预测结果,并将其分成不重叠的区域,每个区域可以轻松地验证其中的对象数量。用户可以通过选择具有明显计数错误的区域,并指定该区域内对象数量的范围,来提供反馈。为了改进计数结果,我们开发了一种新的适应损失,使得视觉计数器输出预测的计数在用户指定的范围内。为了有效和高效地适应,我们提议一种修充模块,可以与任何扩散基本的视觉计数器结合使用,只有修充模块中的参数会在适应过程中更新。我们的实验表明,我们的方法可以在两个普遍性类共享对象计数标准 benchmark 上减少多个现状顶尖的视觉计数器的平均绝对误差约30%到40%,具有最小的用户输入。您可以在 查看我们的项目。

Diving into Darkness: A Dual-Modulated Framework for High-Fidelity Super-Resolution in Ultra-Dark Environments

  • paper_url: http://arxiv.org/abs/2309.05267
  • repo_url: None
  • paper_authors: Jiaxin Gao, Ziyu Yue, Yaohua Liu, Sihan Xie, Xin Fan, Risheng Liu
  • for: This paper is written for the problem of super-resolution in ultra-dark environments, which is a challenging and practical problem that has received little attention.
  • methods: The paper proposes a specialized dual-modulated learning framework that includes a self-regularized luminance constraint and Illuminance-Semantic Dual Modulation (ISDM) components to enhance feature-level preservation of illumination and color details. Additionally, the paper designs a Resolution-Sensitive Merging Up-sampler (RSMU) module to mitigate the presence of artifacts and halos.
  • results: The paper shows that its approach outperforms state-of-the-art methods in terms of PSNR, LPIPS, and RMSE score, with a notable improvement of 5% in PSNR and 43% in LPIPS. The paper also demonstrates the generalization of its approach across different darkness levels, with a 19-fold increase in the RMSE score.
    Abstract Super-resolution tasks oriented to images captured in ultra-dark environments is a practical yet challenging problem that has received little attention. Due to uneven illumination and low signal-to-noise ratio in dark environments, a multitude of problems such as lack of detail and color distortion may be magnified in the super-resolution process compared to normal-lighting environments. Consequently, conventional low-light enhancement or super-resolution methods, whether applied individually or in a cascaded manner for such problem, often encounter limitations in recovering luminance, color fidelity, and intricate details. To conquer these issues, this paper proposes a specialized dual-modulated learning framework that, for the first time, attempts to deeply dissect the nature of the low-light super-resolution task. Leveraging natural image color characteristics, we introduce a self-regularized luminance constraint as a prior for addressing uneven lighting. Expanding on this, we develop Illuminance-Semantic Dual Modulation (ISDM) components to enhance feature-level preservation of illumination and color details. Besides, instead of deploying naive up-sampling strategies, we design the Resolution-Sensitive Merging Up-sampler (RSMU) module that brings together different sampling modalities as substrates, effectively mitigating the presence of artifacts and halos. Comprehensive experiments showcases the applicability and generalizability of our approach to diverse and challenging ultra-low-light conditions, outperforming state-of-the-art methods with a notable improvement (i.e., $\uparrow$5\% in PSNR, and $\uparrow$43\% in LPIPS). Especially noteworthy is the 19-fold increase in the RMSE score, underscoring our method's exceptional generalization across different darkness levels. The code will be available online upon publication of the paper.
    摘要 “低光环境下的超解像任务是一个实际且挑战性的问题,它们获得了少量的注意。由于低光环境中的不均匀照明和讯号与干扰比,它们可能会导致缺乏细节和颜色扭曲。因此,传统的低光照增强或超解像方法, whether applied individually or in a cascaded manner for this problem, often encounter limitations in recovering luminance, color fidelity, and intricate details. To conquer these issues, this paper proposes a specialized dual-modulated learning framework that, for the first time, attempts to deeply dissect the nature of the low-light super-resolution task. Leveraging natural image color characteristics, we introduce a self-regularized luminance constraint as a prior for addressing uneven lighting. Expanding on this, we develop Illuminance-Semantic Dual Modulation (ISDM) components to enhance feature-level preservation of illumination and color details. Besides, instead of deploying naive up-sampling strategies, we design the Resolution-Sensitive Merging Up-sampler (RSMU) module that brings together different sampling modalities as substrates, effectively mitigating the presence of artifacts and halos. Comprehensive experiments showcases the applicability and generalizability of our approach to diverse and challenging ultra-low-light conditions, outperforming state-of-the-art methods with a notable improvement (i.e., $\uparrow$5\% in PSNR, and $\uparrow$43\% in LPIPS). Especially noteworthy is the 19-fold increase in the RMSE score, underscoring our method's exceptional generalization across different darkness levels. The code will be available online upon publication of the paper.”

A horizon line annotation tool for streamlining autonomous sea navigation experiments

  • paper_url: http://arxiv.org/abs/2309.05262
  • repo_url: None
  • paper_authors: Yassir Zardoua, Abdelhamid El Wahabi, Mohammed Boulaala, Abdelali Astito
  • For: 本研究的目的是为marine autonomous navigation任务提供更加稳定和可靠的海Horizon line检测方法。* Methods: 本研究使用了一种新的公共注释软件,用于快速和容易地注释收集的海洋图像中的海Horizon line。* Results: 本研究通过对多种海condexts进行实验 validate了一种更加Robust的海Horizon line检测方法。Here’s the breakdown of each point in English:* For: The purpose of this research is to provide more stable and reliable horizon line detection methods for marine autonomous navigation tasks.* Methods: This research uses a new public annotation software to quickly and easily annotate collected marine images with the correct position and orientation of the horizon line.* Results: This research experimentally validated a more robust horizon line detection method by collecting and annotating images of various sea conditions.
    Abstract Horizon line (or sea line) detection (HLD) is a critical component in multiple marine autonomous navigation tasks, such as identifying the navigation area (i.e., the sea), obstacle detection and geo-localization, and digital video stabilization. A recent survey highlighted several weaknesses of such detectors, particularly on sea conditions lacking from the most extensive dataset currently used by HLD researchers. Experimental validation of more robust HLDs involves collecting an extensive set of these lacking sea conditions and annotating each collected image with the correct position and orientation of the horizon line. The annotation task is daunting without a proper tool. Therefore, we present the first public annotation software with tailored features to make the sea line annotation process fast and easy. The software is available at: https://drive.google.com/drive/folders/1c0ZmvYDckuQCPIWfh_70P7E1A_DWlIvF?usp=sharing
    摘要 <>将文本翻译为简化字符串。<>水平线(或海洋线)检测(HLD)是多种海上自动导航任务的关键组件,包括定位水平线(即海洋)、障碍物检测和地理位置确定、数字视频稳定化等。一项最新的调查指出了HLD检测器的一些弱点,特别是在缺乏当前HLD研究者使用最广泛的数据集的情况下。为验证更加稳定的HLD,需要收集一个广泛的这些缺失的海洋条件,并对每个收集的图像注解正确的水平线位置和方向。 annotation task是无法进行的,因此我们提供了首个公共的注释软件,带有适应的特性,使水平线注释过程快速和容易。软件可以在以下链接中下载:https://drive.google.com/drive/folders/1c0ZmvYDckuQCPIWfh_70P7E1A_DWlIvF?usp=sharing

Gall Bladder Cancer Detection from US Images with Only Image Level Labels

  • paper_url: http://arxiv.org/abs/2309.05261
  • repo_url: None
  • paper_authors: Soumen Basu, Ashish Papanai, Mayank Gupta, Pankaj Gupta, Chetan Arora
  • for: 这个研究旨在提高 gallbladder cancer (GBC) 的检测精度,使用仅有的 image-level label。
  • methods: 我们使用 transformer 模型,并使用 multi-instance-learning (MIL) 和自动选择 instance 来处理弱相关目标检测 (WSOD) 问题。
  • results: 我们的方法在 AP 和检测敏感性上比 SOTA transformer-based 和 CNN-based WSOD 方法更好。Note:
  • for: 是指研究的目的或目标,即这个研究是为了提高 GBC 的检测精度。
  • methods: 是指使用的方法或技术,即使用 transformer 模型和 MIL 等方法。
  • results: 是指研究所得到的结果,即比 SOTA 方法更好的检测精度。
    Abstract Automated detection of Gallbladder Cancer (GBC) from Ultrasound (US) images is an important problem, which has drawn increased interest from researchers. However, most of these works use difficult-to-acquire information such as bounding box annotations or additional US videos. In this paper, we focus on GBC detection using only image-level labels. Such annotation is usually available based on the diagnostic report of a patient, and do not require additional annotation effort from the physicians. However, our analysis reveals that it is difficult to train a standard image classification model for GBC detection. This is due to the low inter-class variance (a malignant region usually occupies only a small portion of a US image), high intra-class variance (due to the US sensor capturing a 2D slice of a 3D object leading to large viewpoint variations), and low training data availability. We posit that even when we have only the image level label, still formulating the problem as object detection (with bounding box output) helps a deep neural network (DNN) model focus on the relevant region of interest. Since no bounding box annotations is available for training, we pose the problem as weakly supervised object detection (WSOD). Motivated by the recent success of transformer models in object detection, we train one such model, DETR, using multi-instance-learning (MIL) with self-supervised instance selection to suit the WSOD task. Our proposed method demonstrates an improvement of AP and detection sensitivity over the SOTA transformer-based and CNN-based WSOD methods. Project page is at https://gbc-iitd.github.io/wsod-gbc
    摘要 自动检测结肠癌(GBC)从超声画像(US)图像是一个重要的问题,吸引了研究者们的关注。然而,大多数这些工作使用难以获得的信息,如矩形框注释或更多的US视频。在这篇论文中,我们专注于基于只有图像级别的标签进行GBC检测。这些标签通常可以基于病人的诊断报告获得,不需要更多的注释努力。然而,我们的分析表明,使用标准的图像分类模型进行GBC检测是困难的。这是因为肿瘤区域通常占用US图像中的only a small portion,以及US感知器捕捉的2Dslice of a 3D object leading to large viewpoint variations。我们认为,即使只有图像级别的标签,使用深度神经网络(DNN)模型仍然可以帮助模型关注相关的区域。由于没有 bounding box 注释可用于训练,我们将问题定义为弱型Supervised Object Detection(WSOD)问题。驱动于Recent success of transformer models in object detection,我们使用多例学习(MIL)和自我驱动的实例选择来训练我们的模型DETR。我们的提议方法在AP和检测敏感性方面与State-of-the-Art(SOTA)基于 transformer 和 CNN 的 WSOD 方法之上具有进步。相关页面可以在 中找到。

FusionFormer: A Multi-sensory Fusion in Bird’s-Eye-View and Temporal Consistent Transformer for 3D Objection

  • paper_url: http://arxiv.org/abs/2309.05257
  • repo_url: None
  • paper_authors: Chunyong Hu, Hang Zheng, Kun Li, Jianyun Xu, Weibo Mao, Maochun Luo, Lingxuan Wang, Mingxia Chen, Kaixuan Liu, Yiru Zhao, Peihan Hao, Minzhe Liu, Kaicheng Yu
  • for: 提高3D物体检测 task 的性能
  • methods: 利用 transformers 结合多modal 特征,并添加 depth prediction 分支以提高检测性能
  • results: 在 nuScenes dataset 上 Achieve 72.6% mAP 和 75.1% NDS,超越现有方法
    Abstract Multi-sensor modal fusion has demonstrated strong advantages in 3D object detection tasks. However, existing methods that fuse multi-modal features through a simple channel concatenation require transformation features into bird's eye view space and may lose the information on Z-axis thus leads to inferior performance. To this end, we propose FusionFormer, an end-to-end multi-modal fusion framework that leverages transformers to fuse multi-modal features and obtain fused BEV features. And based on the flexible adaptability of FusionFormer to the input modality representation, we propose a depth prediction branch that can be added to the framework to improve detection performance in camera-based detection tasks. In addition, we propose a plug-and-play temporal fusion module based on transformers that can fuse historical frame BEV features for more stable and reliable detection results. We evaluate our method on the nuScenes dataset and achieve 72.6% mAP and 75.1% NDS for 3D object detection tasks, outperforming state-of-the-art methods.
    摘要 多感器模式融合已经在3D物体探测任务中显示出了强大的优势。然而,现有的方法通过简单的通道拼接来融合多模态特征可能会产生Z轴信息损失,导致性能下降。为此,我们提出了FusionFormer,一个端到端多模态融合框架,利用转换器来融合多模态特征并获得融合BEV特征。此外,基于输入模式表示的灵活适应性,我们提出了一个深度预测分支,可以添加到框架中,以提高摄像头基于探测任务的检测性能。此外,我们还提出了基于转换器的历史帧BEV特征融合模块,可以将历史帧特征融合以获得更稳定和可靠的检测结果。我们在nuScenes dataset上评估了我们的方法,并实现了3D物体探测任务中的72.6% mAP和75.1% NDS,超过了当前state-of-the-art方法。

Towards Better Data Exploitation In Self-Supervised Monocular Depth Estimation

  • paper_url: http://arxiv.org/abs/2309.05254
  • repo_url: https://github.com/sauf4896/bdedepth
  • paper_authors: Jinfeng Liu, Lingtong Kong, Jie Yang, Wei Liu
  • for: 本研究旨在提高自助学习独眼视觉系统中的深度估计能力,以增强机器人视觉系统的泛化能力。
  • methods: 本研究使用了两种数据增强技术:Resizing-Cropping和Splitting-Permuting,以充分利用训练数据的潜在能力。具体来说,我们将原始图像和生成的两个增强图像同时 feed into 训练管道,并通过自我采样来进行自采样。此外,我们还提出了细节优化的 DepthNet,其包括一个全规模分支在编码器中和一个网格解码器,以提高depth图中的细节Restoration。
  • results: 实验结果表明,我们的方法可以在KITTI标准测试集上达到状态机器人视觉系统中的最佳性能,并且我们的模型还能够在Make3D和NYUv2测试集上 Transfer learning 表现出色。
    Abstract Depth estimation plays an important role in the robotic perception system. Self-supervised monocular paradigm has gained significant attention since it can free training from the reliance on depth annotations. Despite recent advancements, existing self-supervised methods still underutilize the available training data, limiting their generalization ability. In this paper, we take two data augmentation techniques, namely Resizing-Cropping and Splitting-Permuting, to fully exploit the potential of training datasets. Specifically, the original image and the generated two augmented images are fed into the training pipeline simultaneously and we leverage them to conduct self-distillation. Additionally, we introduce the detail-enhanced DepthNet with an extra full-scale branch in the encoder and a grid decoder to enhance the restoration of fine details in depth maps. Experimental results demonstrate our method can achieve state-of-the-art performance on the KITTI benchmark, with both raw ground truth and improved ground truth. Moreover, our models also show superior generalization performance when transferring to Make3D and NYUv2 datasets. Our codes are available at https://github.com/Sauf4896/BDEdepth.
    摘要 depth estimation 在 robotic perception system 中扮演着重要的角色。无监督单目法固有了广泛的关注,因为它可以免除depth注释的依赖。Despite recent advancements, existing self-supervised methods still underutilize the available training data, limiting their generalization ability. In this paper, we take two data augmentation techniques, namely Resizing-Cropping and Splitting-Permuting, to fully exploit the potential of training datasets. Specifically, the original image and the generated two augmented images are fed into the training pipeline simultaneously and we leverage them to conduct self-distillation. Additionally, we introduce the detail-enhanced DepthNet with an extra full-scale branch in the encoder and a grid decoder to enhance the restoration of fine details in depth maps. Experimental results demonstrate our method can achieve state-of-the-art performance on the KITTI benchmark, with both raw ground truth and improved ground truth. Moreover, our models also show superior generalization performance when transferring to Make3D and NYUv2 datasets. Our codes are available at https://github.com/Sauf4896/BDEdepth.Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

Multi3DRefer: Grounding Text Description to Multiple 3D Objects

  • paper_url: http://arxiv.org/abs/2309.05251
  • repo_url: https://github.com/3dlg-hcvc/M3DRef-CLIP
  • paper_authors: Yiming Zhang, ZeMing Gong, Angel X. Chang
    for: 本研究旨在用自然语言描述Localize多个3D场景中的灵活数量的物体。现有的3D视觉固定任务都是基于唯一的目标物体描述,但这种约束是不自然的,因为在真实世界情况下,我们经常需要Localize多个物体。methods: 我们提出了Multi3DRefer,一种扩展ScanRefer数据集和任务,以解决这种情况。我们的数据集包含11609个 объек的11609个描述,其中每个描述可能有一个、多个或zero个目标物体。我们还引入了一种新的评价指标和优化方法,以便进一步研究多模式3D场景理解。results: 我们开发了一种新的基线方法,利用CLIP的2D特征进行对比学习,可以在线渲染出对象提案,并超越当前状态的最佳性能在ScanReferbenchmark上。
    Abstract We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natural language descriptions. Existing 3D visual grounding tasks focus on localizing a unique object given a text description. However, such a strict setting is unnatural as localizing potentially multiple objects is a common need in real-world scenarios and robotic tasks (e.g., visual navigation and object rearrangement). To address this setting we propose Multi3DRefer, generalizing the ScanRefer dataset and task. Our dataset contains 61926 descriptions of 11609 objects, where zero, single or multiple target objects are referenced by each description. We also introduce a new evaluation metric and benchmark methods from prior work to enable further investigation of multi-modal 3D scene understanding. Furthermore, we develop a better baseline leveraging 2D features from CLIP by rendering object proposals online with contrastive learning, which outperforms the state of the art on the ScanRefer benchmark.
    摘要 我们介绍了在实际世界3D场景中自然语言描述中Localizing多个灵活对象的任务。现有的3D视觉定位任务都是基于固定对象的描述,但这种 Setting是不自然的,因为在实际场景中Localizing多个对象是常见的需求,例如视觉导航和对象重新排序。为解决这种设定,我们提出了Multi3DRefer,它扩展了ScanRefer数据集和任务。我们的数据集包括11609个对象的11609个描述,其中每个描述可能引用零、单个或多个目标对象。我们还引入了一个新的评价指标和优秀方法,从优秀的多模式3D场景理解研究中继承来。此外,我们开发了一个更好的基线,通过在线渲染对象提案,使用对比学习,以获得更高的性能在ScanRefer bencmark上。

HAT: Hybrid Attention Transformer for Image Restoration

  • paper_url: http://arxiv.org/abs/2309.05239
  • repo_url: https://github.com/xpixelgroup/hat
  • paper_authors: Xiangyu Chen, Xintao Wang, Wenlong Zhang, Xiangtao Kong, Yu Qiao, Jiantao Zhou, Chao Dong
  • for: 提高图像修复任务中Transformer网络的表现,包括图像超解和噪声除除。
  • methods: 提出了一种新的混合注意力变换器(HAT),它结合了通道注意力和窗口自注意力两种方法,以便更好地使用输入信息。此外,我们还引入了覆盖窗口相互注意模块,以增强窗口之间信息的协同。
  • results: 对比于基eline方法,HAT可以更好地恢复图像,并且可以扩展到更多的图像修复应用,如真实世界图像超解、Gaussian图像噪声除除和图像压缩 artefacts 除除。实验表明,HAT可以达到状态之前的最佳性能。
    Abstract Transformer-based methods have shown impressive performance in image restoration tasks, such as image super-resolution and denoising. However, we find that these networks can only utilize a limited spatial range of input information through attribution analysis. This implies that the potential of Transformer is still not fully exploited in existing networks. In order to activate more input pixels for better restoration, we propose a new Hybrid Attention Transformer (HAT). It combines both channel attention and window-based self-attention schemes, thus making use of their complementary advantages. Moreover, to better aggregate the cross-window information, we introduce an overlapping cross-attention module to enhance the interaction between neighboring window features. In the training stage, we additionally adopt a same-task pre-training strategy to further exploit the potential of the model for further improvement. Extensive experiments have demonstrated the effectiveness of the proposed modules. We further scale up the model to show that the performance of the SR task can be greatly improved. Besides, we extend HAT to more image restoration applications, including real-world image super-resolution, Gaussian image denoising and image compression artifacts reduction. Experiments on benchmark and real-world datasets demonstrate that our HAT achieves state-of-the-art performance both quantitatively and qualitatively. Codes and models are publicly available at https://github.com/XPixelGroup/HAT.
    摘要 <>Transformer-based methods have shown impressive performance in image restoration tasks, such as image super-resolution and denoising. However, we find that these networks can only utilize a limited spatial range of input information through attribution analysis. This implies that the potential of Transformer is still not fully exploited in existing networks. In order to activate more input pixels for better restoration, we propose a new Hybrid Attention Transformer (HAT). It combines both channel attention and window-based self-attention schemes, thus making use of their complementary advantages. Moreover, to better aggregate the cross-window information, we introduce an overlapping cross-attention module to enhance the interaction between neighboring window features. In the training stage, we additionally adopt a same-task pre-training strategy to further exploit the potential of the model for further improvement. Extensive experiments have demonstrated the effectiveness of the proposed modules. We further scale up the model to show that the performance of the SR task can be greatly improved. Besides, we extend HAT to more image restoration applications, including real-world image super-resolution, Gaussian image denoising and image compression artifacts reduction. Experiments on benchmark and real-world datasets demonstrate that our HAT achieves state-of-the-art performance both quantitatively and qualitatively. Codes and models are publicly available at .中文简体版:Transformer基于方法在图像修复任务中表现出色,如图像超解和噪声除除。然而,我们发现这些网络只能利用输入图像的有限范围信息,通过属性分析来进行评估。这意味着Transformer的潜力还未被完全利用。为了激活更多的输入像素,提高修复效果,我们提议一种新的混合注意力Transformer(HAT)。它结合了通道注意力和窗口基于自注意力的方法,并且利用了它们的优势。此外,为了更好地聚合跨窗信息,我们引入了重叠cross-attention模块,以增强邻近窗口特征之间的互动。在训练阶段,我们还采用了同任务预训练策略,以进一步利用模型的潜力。广泛的实验证明了我们提议的模块的效iveness。此外,我们还将HAT扩展到更多的图像修复应用,包括实际世界图像超解、Gaussian图像噪声除除和图像压缩 artifacts reduction。实验表明,我们的HAT在量化和质量上都达到了领先的性能。代码和模型在公共可用。

Angle Range and Identity Similarity Enhanced Gaze and Head Redirection based on Synthetic data

  • paper_url: http://arxiv.org/abs/2309.05214
  • repo_url: None
  • paper_authors: Jiawei Qin, Xueting Wang
  • for: 提高全面图像中头和视线的方向准确性和图像质量
  • methods: 使用数据扩充和提高图像质量和人脸认知稳定性
  • results: 实现了大角度重定向的改进表现,同时保持高图像质量和人脸认知稳定性
    Abstract In this paper, we propose a method for improving the angular accuracy and photo-reality of gaze and head redirection in full-face images. The problem with current models is that they cannot handle redirection at large angles, and this limitation mainly comes from the lack of training data. To resolve this problem, we create data augmentation by monocular 3D face reconstruction to extend the head pose and gaze range of the real data, which allows the model to handle a wider redirection range. In addition to the main focus on data augmentation, we also propose a framework with better image quality and identity preservation of unseen subjects even training with synthetic data. Experiments show that our method significantly improves redirection performance in terms of redirection angular accuracy while maintaining high image quality, especially when redirecting to large angles.
    摘要 在这篇论文中,我们提出了一种方法来提高全面图像中的视线和头部重定向精度。现有模型无法处理大角度的重定向,主要是因为数据训练的限制。为解决这个问题,我们创造了一种数据扩展方法,通过单目3D脸部重建来扩展实际数据中的头部姿态和视线范围,这allow the model可以处理更广泛的重定向范围。除了主要关注数据扩展之外,我们还提出了一个框架,可以保持不seen subjects的图像质量和身份,即使使用 synthetic data 进行训练。实验表明,我们的方法可以明显提高重定向性能,特别是在重定向大角度时。

Phase-Specific Augmented Reality Guidance for Microscopic Cataract Surgery Using Long-Short Spatiotemporal Aggregation Transformer

  • paper_url: http://arxiv.org/abs/2309.05209
  • repo_url: None
  • paper_authors: Puxun Tu, Hongfei Ye, Haochen Shi, Jeff Young, Meng Xie, Peiquan Zhao, Ce Zheng, Xiaoyi Jiang, Xiaojun Chen
  • For: The paper is focused on developing a novel phase-specific augmented reality (AR) guidance system for phacoemulsification cataract surgery (PCS) to enhance intraoperative proficiency.* Methods: The proposed system utilizes a two-stage surgical microscopic video recognition network, including a multi-task learning structure to segment the surgical limbus region and extract spatial features, and a long-short spatiotemporal aggregation transformer (LS-SAT) network to recognize the current surgical phase. The system also incorporates AR visual cues designed in collaboration with ophthalmologists.* Results: The proposed system was evaluated on publicly available and in-house datasets, with comparison results demonstrating its superior performance compared to related works. The system was also evaluated in a clinical setup, with results indicating remarkable accuracy and real-time performance, highlighting its potential for clinical applications.
    Abstract Phacoemulsification cataract surgery (PCS) is a routine procedure conducted using a surgical microscope, heavily reliant on the skill of the ophthalmologist. While existing PCS guidance systems extract valuable information from surgical microscopic videos to enhance intraoperative proficiency, they suffer from non-phasespecific guidance, leading to redundant visual information. In this study, our major contribution is the development of a novel phase-specific augmented reality (AR) guidance system, which offers tailored AR information corresponding to the recognized surgical phase. Leveraging the inherent quasi-standardized nature of PCS procedures, we propose a two-stage surgical microscopic video recognition network. In the first stage, we implement a multi-task learning structure to segment the surgical limbus region and extract limbus region-focused spatial feature for each frame. In the second stage, we propose the long-short spatiotemporal aggregation transformer (LS-SAT) network to model local fine-grained and global temporal relationships, and combine the extracted spatial features to recognize the current surgical phase. Additionally, we collaborate closely with ophthalmologists to design AR visual cues by utilizing techniques such as limbus ellipse fitting and regional restricted normal cross-correlation rotation computation. We evaluated the network on publicly available and in-house datasets, with comparison results demonstrating its superior performance compared to related works. Ablation results further validated the effectiveness of the limbus region-focused spatial feature extractor and the combination of temporal features. Furthermore, the developed system was evaluated in a clinical setup, with results indicating remarkable accuracy and real-time performance. underscoring its potential for clinical applications.
    摘要 喷洗cataract手术(PCS)是一种常见的手术程序,需要医生高度精准的技巧。现有的PCS导航系统可以从手术微镜视频中提取有价值信息,以提高手术过程中的灵活性,但是这些导航系统却受到非相对阶段的导航,从而导致多余的视觉信息。在本研究中,我们的主要贡献是开发了一种新的相对阶段增强现实(AR)导航系统,该系统可以根据识别到的当前手术阶段提供适应的AR信息。利用手术微镜程序的自然固有标准化特性,我们提议一种两阶段的手术微镜视频识别网络。在第一阶段,我们实施了多任务学习结构,将手术边缘区域分割出来,并对每帧图像提取边缘区域专门的空间特征。在第二阶段,我们提议使用长短距离空间时间汇聚变换(LS-SAT)网络,模型局部细腻和全局时间关系,并将提取的空间特征组合以识别当前手术阶段。此外,我们与眼科医生合作,通过利用技术such as镜面轮廓适应和区域限制正常垂直扩散计算来设计AR视觉提示。我们对公共可用和自有数据集进行评估,与相关工作进行比较,结果显示其性能更高。剥离结果进一步验证了镜边区域专门的空间特征提取器和时间特征的组合的效果。此外,我们开发的系统在临床设置中进行了评估,结果表明其精度和实时性具有很高的潜力。

Learning Sequential Acquisition Policies for Robot-Assisted Feeding

  • paper_url: http://arxiv.org/abs/2309.05197
  • repo_url: None
  • paper_authors: Priya Sundaresan, Jiajun Wu, Dorsa Sadigh
  • for: 这个论文的目的是开发一种基于视觉行为规划的长期 manipulate 杯子上的食物获取系统,以便让助手机机器人在吃饭时提供更好的帮助。
  • methods: 这个论文使用了一种名为 VAPORS 的框架,该框架通过在模拟中学习杯子的动态特征来学习高级动作选择策略。在实际世界中执行计划时,VAPORS 委托给可视参数化的基本动作来执行动作。
  • results: 在实际杯子上进行了38个食物获取任务中,VAPORS 比基eline高效得多,可以普遍应对实际杯子上的变化(如顶部和酱料),并在49名用户中的调查中得到了较高的用户满意度。
    Abstract A robot providing mealtime assistance must perform specialized maneuvers with various utensils in order to pick up and feed a range of food items. Beyond these dexterous low-level skills, an assistive robot must also plan these strategies in sequence over a long horizon to clear a plate and complete a meal. Previous methods in robot-assisted feeding introduce highly specialized primitives for food handling without a means to compose them together. Meanwhile, existing approaches to long-horizon manipulation lack the flexibility to embed highly specialized primitives into their frameworks. We propose Visual Action Planning OveR Sequences (VAPORS), a framework for long-horizon food acquisition. VAPORS learns a policy for high-level action selection by leveraging learned latent plate dynamics in simulation. To carry out sequential plans in the real world, VAPORS delegates action execution to visually parameterized primitives. We validate our approach on complex real-world acquisition trials involving noodle acquisition and bimanual scooping of jelly beans. Across 38 plates, VAPORS acquires much more efficiently than baselines, generalizes across realistic plate variations such as toppings and sauces, and qualitatively appeals to user feeding preferences in a survey conducted across 49 individuals. Code, datasets, videos, and supplementary materials can be found on our website: https://sites.google.com/view/vaporsbot.
    摘要 robot提供协助时需要执行特殊的机械操作,使用各种工具来拾取和给食各种食品。除了灵活的低级技能外,协助 robot还需要规划这些策略,在较长的时间距离内完成整个餐点。现有的机器人协助食物投入方法 introduce highly specialized primitives for food handling without a means to compose them together,而现有的长期机械 manipulate方法缺乏灵活性来嵌入高级特殊 primitives。我们提出了Visual Action Planning OveR Sequences(VAPORS),一种长期食物获取框架。VAPORS学习一个高级行为选择策略,通过在模拟中学习latent plate dynamics来执行。为了在实际世界中执行sequential plans,VAPORS委托action execution给视觉参数化的基本 primitives。我们验证了我们的方法在复杂的实际食物获取任务中,比baseline效果更高,可以 generale across realistic plate variations such as toppings and sauces,并且在49名用户中的调查中获得了负面feeding preferences的评价。代码、数据集、视频和补充材料可以在我们的网站上找到:https://sites.google.com/view/vaporsbot。

Towards Viewpoint Robustness in Bird’s Eye View Segmentation

  • paper_url: http://arxiv.org/abs/2309.05192
  • repo_url: None
  • paper_authors: Tzofi Klinghoffer, Jonah Philion, Wenzheng Chen, Or Litany, Zan Gojcic, Jungseock Joo, Ramesh Raskar, Sanja Fidler, Jose M. Alvarez
  • For: 这个论文旨在解决自动驾驶车辆(AV)中神经网络的视角不一致问题,以便在多种车辆上部署神经网络模型 без重复收集和标注数据。* Methods: 作者们提出了一种基于鸟瞰视(BEV)分割任务的方法,使用novel view synthesis技术将收集的数据转换到目标摄像头配置的视角下,以便在不同的摄像头配置上训练BEV分割模型。* Results: 作者们通过广泛的实验发现,现有的感知模型具有较大的视角不一致敏感度,当训练数据来自特定的摄像头配置时,小量的视角变化会导致大幅下降在性能。作者们的方法能够恢复约14.7%的 intersectioin over union(IoU),即使在新的摄像头配置上部署模型。
    Abstract Autonomous vehicles (AV) require that neural networks used for perception be robust to different viewpoints if they are to be deployed across many types of vehicles without the repeated cost of data collection and labeling for each. AV companies typically focus on collecting data from diverse scenarios and locations, but not camera rig configurations, due to cost. As a result, only a small number of rig variations exist across most fleets. In this paper, we study how AV perception models are affected by changes in camera viewpoint and propose a way to scale them across vehicle types without repeated data collection and labeling. Using bird's eye view (BEV) segmentation as a motivating task, we find through extensive experiments that existing perception models are surprisingly sensitive to changes in camera viewpoint. When trained with data from one camera rig, small changes to pitch, yaw, depth, or height of the camera at inference time lead to large drops in performance. We introduce a technique for novel view synthesis and use it to transform collected data to the viewpoint of target rigs, allowing us to train BEV segmentation models for diverse target rigs without any additional data collection or labeling cost. To analyze the impact of viewpoint changes, we leverage synthetic data to mitigate other gaps (content, ISP, etc). Our approach is then trained on real data and evaluated on synthetic data, enabling evaluation on diverse target rigs. We release all data for use in future work. Our method is able to recover an average of 14.7% of the IoU that is otherwise lost when deploying to new rigs.
    摘要

HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving

  • paper_url: http://arxiv.org/abs/2309.05186
  • repo_url: None
  • paper_authors: Xinpeng Ding, Jianhua Han, Hang Xu, Wei Zhang, Xiaomeng Li
  • for: 这 paper 是为了推动自动驾驶系统的多任务合一化,使用单一的语言模型来把多个自动驾驶任务从视频中提取出来。
  • methods: 这 paper 使用了 singular multimodal large language models (MLLMs) 来把多个自动驾驶任务从视频中提取出来,并且提出了一种 efficient method 来 incorporate high-resolution (HR) information into MLLMs。
  • results: 实验结果显示,与现有的 MLLMs 相比,HiLM-D 在 ROLISP 任务上表现出色,提高了 4.8% 的 BLEU-4 和 17.2% 的 mIoU。
    Abstract Autonomous driving systems generally employ separate models for different tasks resulting in intricate designs. For the first time, we leverage singular multimodal large language models (MLLMs) to consolidate multiple autonomous driving tasks from videos, i.e., the Risk Object Localization and Intention and Suggestion Prediction (ROLISP) task. ROLISP uses natural language to simultaneously identify and interpret risk objects, understand ego-vehicle intentions, and provide motion suggestions, eliminating the necessity for task-specific architectures. However, lacking high-resolution (HR) information, existing MLLMs often miss small objects (e.g., traffic cones) and overly focus on salient ones (e.g., large trucks) when applied to ROLISP. We propose HiLM-D (Towards High-Resolution Understanding in MLLMs for Autonomous Driving), an efficient method to incorporate HR information into MLLMs for the ROLISP task. Especially, HiLM-D integrates two branches: (i) the low-resolution reasoning branch, can be any MLLMs, processes low-resolution videos to caption risk objects and discern ego-vehicle intentions/suggestions; (ii) the high-resolution perception branch (HR-PB), prominent to HiLM-D,, ingests HR images to enhance detection by capturing vision-specific HR feature maps and prioritizing all potential risks over merely salient objects. Our HR-PB serves as a plug-and-play module, seamlessly fitting into current MLLMs. Experiments on the ROLISP benchmark reveal HiLM-D's notable advantage over leading MLLMs, with improvements of 4.8% in BLEU-4 for captioning and 17.2% in mIoU for detection.
    摘要 自适应驾驶系统通常采用分离的模型来处理不同任务,导致设计变得复杂。我们是第一个使用单一多Modal大语言模型(MLLMs)将多个自适应驾驶任务从视频中整合,即风险对象本地化和建议预测(ROLISP)任务。ROLISP使用自然语言同时识别和解释风险对象,理解ego汽车的意图,并提供动作建议,从而消除任务特定的建筑。然而,由于缺乏高分辨率(HR)信息,现有的MLLMs经常会遗弃小对象(例如交通标志),而偏重于突出对象(例如大卡车)。我们提出了HiLM-D(推向高分辨率理解在MLLMs中的自适应驾驶),一种高效的方法,将HR信息 integrate到MLLMs中。尤其是HiLM-D具有两个分支:(i)低分辨率理解分支,可以是任何MLLMs,处理低分辨率视频,描述风险对象和ego汽车的意图/建议;(ii)高分辨率感知分支(HR-PB),特点在HiLM-D中,通过捕捉视觉特定的HR特征图和优先处理所有风险对象,提高检测精度。我们的HR-PB作为插件模块,顺利地适配现有MLLMs。实验表明HiLM-D在ROLISP数据集上表现出色,与领先MLLMs相比,提高了4.8%的BLEU-4措词率和17.2%的mIoU检测精度。