eess.IV - 2023-07-25

Towards Unifying Anatomy Segmentation: Automated Generation of a Full-body CT Dataset via Knowledge Aggregation and Anatomical Guidelines

  • paper_url: http://arxiv.org/abs/2307.13375
  • repo_url: https://github.com/alexanderjaus/atlasdataset
  • paper_authors: Alexander Jaus, Constantin Seibold, Kelsey Hermann, Alexandra Walter, Kristina Giske, Johannes Haubold, Jens Kleesiek, Rainer Stiefelhagen
    for: 这种方法用于生成自动生成的解剖学 segmentation 数据集,使用紧跟式的 nnU-Net 基于 pseudo-labeling 和 anatomy-guided pseudo-label 精度调整。methods: 这种方法通过结合多种分立的知识库,生成了一个整体 CT 扫描图像的 $142$ 块级标签,提供了全面的解剖学覆盖。results: 我们的方法不需要手动标注 durante 标签聚合阶段,并在 BTCV 数据集上实现了 85% dice 分数。此外,我们还进行了医学有效性检查和可扩展自动检查。
    Abstract In this study, we present a method for generating automated anatomy segmentation datasets using a sequential process that involves nnU-Net-based pseudo-labeling and anatomy-guided pseudo-label refinement. By combining various fragmented knowledge bases, we generate a dataset of whole-body CT scans with $142$ voxel-level labels for 533 volumes providing comprehensive anatomical coverage which experts have approved. Our proposed procedure does not rely on manual annotation during the label aggregation stage. We examine its plausibility and usefulness using three complementary checks: Human expert evaluation which approved the dataset, a Deep Learning usefulness benchmark on the BTCV dataset in which we achieve 85% dice score without using its training dataset, and medical validity checks. This evaluation procedure combines scalable automated checks with labor-intensive high-quality expert checks. Besides the dataset, we release our trained unified anatomical segmentation model capable of predicting $142$ anatomical structures on CT data.
    摘要 在这项研究中,我们提出了一种方法,用于自动生成骨科影像分割数据集,通过nnU-Net基于pseudo标签和骨科指导pseudo标签纠正的顺序 proces。通过结合多个分割知识库,我们生成了整体 CT 扫描图像的 $142$ 块级标签,对 533 幅提供了全面的解剖学覆盖,经过专家审核。我们的提议过程不依赖于手动标注 during the label aggregation stage。我们使用三种 complementary 检查来评估我们的方法:人工专家评估,btcv 数据集上的深度学习有用性测试,以及医学有效性检查。这种评估过程结合了扩展自动检查和劳动密集高质量专家检查。除了数据集之外,我们发布了我们的训练过的一体解剖学分割模型,可以在 CT 数据上预测 $142$ 种解剖学结构。

Overcoming Distribution Mismatch in Quantizing Image Super-Resolution Networks

  • paper_url: http://arxiv.org/abs/2307.13337
  • repo_url: None
  • paper_authors: Cheeun Hong, Kyoung Mu Lee
  • for: 这篇论文的目的是提出一个新的几何量化框架,以解决图像超解析网络中的分布差异问题,以提高量化后的精准度。
  • methods: 这篇论文使用了一个新的几何量化框架,named ODM,它通过在训练过程中直接调整特征分布的方式来降低分布差异问题。此外,ODM还引入了分布偏移来更好地调整各个通道的特征分布。
  • results: 实验结果显示,ODM可以对图像超解析网络进行有效的量化,并且与现有的量化方法相比,ODM可以更好地维持精准度。此外,ODM还可以降低分布差异问题的影响,使量化后的精准度得到更大的提升。
    Abstract Quantization is a promising approach to reduce the high computational complexity of image super-resolution (SR) networks. However, compared to high-level tasks like image classification, low-bit quantization leads to severe accuracy loss in SR networks. This is because feature distributions of SR networks are significantly divergent for each channel or input image, and is thus difficult to determine a quantization range. Existing SR quantization works approach this distribution mismatch problem by dynamically adapting quantization ranges to the variant distributions during test time. However, such dynamic adaptation incurs additional computational costs that limit the benefits of quantization. Instead, we propose a new quantization-aware training framework that effectively Overcomes the Distribution Mismatch problem in SR networks without the need for dynamic adaptation. Intuitively, the mismatch can be reduced by directly regularizing the variance in features during training. However, we observe that variance regularization can collide with the reconstruction loss during training and adversely impact SR accuracy. Thus, we avoid the conflict between two losses by regularizing the variance only when the gradients of variance regularization are cooperative with that of reconstruction. Additionally, to further reduce the distribution mismatch, we introduce distribution offsets to layers with a significant mismatch, which either scales or shifts channel-wise features. Our proposed algorithm, called ODM, effectively reduces the mismatch in distributions with minimal computational overhead. Experimental results show that ODM effectively outperforms existing SR quantization approaches with similar or fewer computations, demonstrating the importance of reducing the distribution mismatch problem. Our code is available at https://github.com/Cheeun/ODM.
    摘要 “量化是一种可能的方法来降低图像超解像网络的高度计算复杂性。然而,相比高水平任务如图像分类,低位数量化对SR网络导致严重的准确损失。这是因为SR网络的特征分布在每个通道或输入图像之间存在严重的分布不对称性。现有的SR量化工作通过在试用时适应性的方式来解决这个分布不对称问题。然而,这种动态适应带来更多的计算成本,限制了量化的利弊。相反,我们提出了一个新的量化意识训练框架,可以有效地解决SR网络中的分布不对称问题,无需动态适应。”“我们观察到,SR网络的特征分布存在严重的分布不对称性,这可以通过对特征的方差调控来缓和。然而,我们发现,在训练时对方差进行调控可能会与重建loss发生冲突,导致SR准确下降。因此,我们避免了这两个损失之间的冲突,通过对方差调控时只有在重建loss的Gradient与方差调控的Gradient之间有着合作的情况下进行调控。”“此外,为了进一步缓和分布不对称问题,我们引入了分布偏移,将通道对频率偏移或扭转。我们称之为ODM。实验结果显示,ODM可以对SR量化进行有效的缓和,并且与相同或 fewer 的计算成本下,实现SR准确的提高。”“我们的代码可以在https://github.com/Cheeun/ODM上找到。”

A Visual Quality Assessment Method for Raster Images in Scanned Document

  • paper_url: http://arxiv.org/abs/2307.13241
  • repo_url: None
  • paper_authors: Justin Yang, Peter Bauer, Todd Harris, Changhyung Lee, Hyeon Seok Seo, Jan P Allebach, Fengqing Zhu
  • for: 本研究探讨了扫描文档中的图像质量,特别是针对灰度图像区域。
  • methods: 我们提出了一种基于机器学习的分类方法,以确定扫描灰度图像的视觉质量是否符合标准。
  • results: 我们通过进行心理学实验,确定了不同分辨率设定下图像质量的可接受程度,并使用这些人工标准来训练机器学习模型。 However, this dataset is unbalanced as most images were rated as visually acceptable. To address the data imbalance problem, we introduce several noise models to simulate the degradation of image quality during the scanning process. Our results show that by including augmented data in training, we can significantly improve the performance of the classifier to determine whether the visual quality of raster images in a scanned document is acceptable or not for a given resolution setting.
    Abstract Image quality assessment (IQA) is an active research area in the field of image processing. Most prior works focus on visual quality of natural images captured by cameras. In this paper, we explore visual quality of scanned documents, focusing on raster image areas. Different from many existing works which aim to estimate a visual quality score, we propose a machine learning based classification method to determine whether the visual quality of a scanned raster image at a given resolution setting is acceptable. We conduct a psychophysical study to determine the acceptability at different image resolutions based on human subject ratings and use them as the ground truth to train our machine learning model. However, this dataset is unbalanced as most images were rated as visually acceptable. To address the data imbalance problem, we introduce several noise models to simulate the degradation of image quality during the scanning process. Our results show that by including augmented data in training, we can significantly improve the performance of the classifier to determine whether the visual quality of raster images in a scanned document is acceptable or not for a given resolution setting.
    摘要 We conduct a psychophysical study to determine the acceptability of images at different resolutions based on human subject ratings and use them as the ground truth to train our machine learning model. However, the dataset is unbalanced as most images were rated as visually acceptable. To address this problem, we introduce several noise models to simulate the degradation of image quality during the scanning process. Our results show that by including augmented data in training, we can significantly improve the performance of the classifier to determine whether the visual quality of raster images in a scanned document is acceptable or not for a given resolution setting.

One for Multiple: Physics-informed Synthetic Data Boosts Generalizable Deep Learning for Fast MRI Reconstruction

  • paper_url: http://arxiv.org/abs/2307.13220
  • repo_url: https://github.com/wangziblake/pisf
  • paper_authors: Zi Wang, Xiaotong Yu, Chengyan Wang, Weibo Chen, Jiazheng Wang, Ying-Hua Chu, Hongwei Sun, Rushuai Li, Peiyong Li, Fan Yang, Haiwei Han, Taishan Kang, Jianzhong Lin, Chen Yang, Shufu Chang, Zhang Shi, Sha Hua, Yan Li, Juan Hu, Liuhong Zhu, Jianjun Zhou, Meijing Lin, Jiefeng Guo, Congbo Cai, Zhong Chen, Di Guo, Xiaobo Qu
  • for: 这个论文旨在提高快速磁共振成像(MRI)的扫描时间,使用深度学习(DL)技术进行图像重建,但是现有的DL方法在不同的成像场景下的应用尚未得到广泛开发。
  • methods: 这个研究使用了物理学习Synthetic数据框架(PISF),该框架可以通过具有一个训练后可通用的模型来实现多种成像场景下的MRI重建。在2D图像重建中,扫描是分解成多个1D基本问题,并从1D数据生成开始,以便普适化。
  • results: 研究发现,使用PISF学习Synthetic数据,并与提高学习技术相结合,可以在实验室中比较或者更好地重建实验室中的MRI图像,比对使用匹配的真实数据训练的模型。此外,PISF还能够在多种供应商多个中心的成像中表现出优异的适应性。10名经验丰富的医生也证明了PISF在实际应用中的优秀适应性。
    Abstract Magnetic resonance imaging (MRI) is a principal radiological modality that provides radiation-free, abundant, and diverse information about the whole human body for medical diagnosis, but suffers from prolonged scan time. The scan time can be significantly reduced through k-space undersampling but the introduced artifacts need to be removed in image reconstruction. Although deep learning (DL) has emerged as a powerful tool for image reconstruction in fast MRI, its potential in multiple imaging scenarios remains largely untapped. This is because not only collecting large-scale and diverse realistic training data is generally costly and privacy-restricted, but also existing DL methods are hard to handle the practically inevitable mismatch between training and target data. Here, we present a Physics-Informed Synthetic data learning framework for Fast MRI, called PISF, which is the first to enable generalizable DL for multi-scenario MRI reconstruction using solely one trained model. For a 2D image, the reconstruction is separated into many 1D basic problems and starts with the 1D data synthesis, to facilitate generalization. We demonstrate that training DL models on synthetic data, integrated with enhanced learning techniques, can achieve comparable or even better in vivo MRI reconstruction compared to models trained on a matched realistic dataset, reducing the demand for real-world MRI data by up to 96%. Moreover, our PISF shows impressive generalizability in multi-vendor multi-center imaging. Its excellent adaptability to patients has been verified through 10 experienced doctors' evaluations. PISF provides a feasible and cost-effective way to markedly boost the widespread usage of DL in various fast MRI applications, while freeing from the intractable ethical and practical considerations of in vivo human data acquisitions.
    摘要 To address these challenges, we present a Physics-Informed Synthetic data learning framework for Fast MRI, called PISF. PISF enables generalizable DL for multi-scenario MRI reconstruction using solely one trained model. For a 2D image, the reconstruction is separated into many 1D basic problems, starting with the synthesis of 1D data. This approach facilitates generalization and reduces the demand for real-world MRI data by up to 96%. Additionally, PISF demonstrates impressive generalizability in multi-vendor multi-center imaging and has been evaluated by 10 experienced doctors, who have verified its excellent adaptability to patients.PISF provides a feasible and cost-effective way to markedly boost the widespread usage of DL in various fast MRI applications, while freeing from the intractable ethical and practical considerations of in vivo human data acquisitions.

Magnetic Resonance Parameter Mapping using Self-supervised Deep Learning with Model Reinforcement

  • paper_url: http://arxiv.org/abs/2307.13211
  • repo_url: None
  • paper_authors: Wanyu Bian, Albert Jang, Fang Liu
  • for: 本研究提出了一种新的自动学习方法,RELAX-MORE,用于生物医学影像重建(qMRI)。
  • methods: 该方法使用优化算法将模型基于qMRI重建拓展到深度学习框架中,以生成高精度和可靠的MR参数图像。
  • results: 在不同的脑、膝和phantom实验中,提出的方法能够高效地重建MR参数图像,正确地纠正影像损害、除除噪音和恢复图像特征。与其他状态前方法相比,RELAX-MORE显著提高了效率、准确性、可靠性和通用性。这种方法有很大的应用前途,可能为qMRI的临床翻译提供很大的助力。
    Abstract This paper proposes a novel self-supervised learning method, RELAX-MORE, for quantitative MRI (qMRI) reconstruction. The proposed method uses an optimization algorithm to unroll a model-based qMRI reconstruction into a deep learning framework, enabling the generation of highly accurate and robust MR parameter maps at imaging acceleration. Unlike conventional deep learning methods requiring a large amount of training data, RELAX-MORE is a subject-specific method that can be trained on single-subject data through self-supervised learning, making it accessible and practically applicable to many qMRI studies. Using the quantitative $T_1$ mapping as an example at different brain, knee and phantom experiments, the proposed method demonstrates excellent performance in reconstructing MR parameters, correcting imaging artifacts, removing noises, and recovering image features at imperfect imaging conditions. Compared with other state-of-the-art conventional and deep learning methods, RELAX-MORE significantly improves efficiency, accuracy, robustness, and generalizability for rapid MR parameter mapping. This work demonstrates the feasibility of a new self-supervised learning method for rapid MR parameter mapping, with great potential to enhance the clinical translation of qMRI.
    摘要 The proposed method was tested using the quantitative $T_1$ mapping as an example at different brain, knee, and phantom experiments. The results showed that RELAX-MORE demonstrated excellent performance in reconstructing MR parameters, correcting imaging artifacts, removing noise, and recovering image features at imperfect imaging conditions. Compared with other state-of-the-art conventional and deep learning methods, RELAX-MORE significantly improved efficiency, accuracy, robustness, and generalizability for rapid MR parameter mapping.This work demonstrates the feasibility of self-supervised learning for rapid MR parameter mapping, with great potential to enhance the clinical translation of qMRI.

Deep Learning Approaches for Data Augmentation in Medical Imaging: A Review

  • paper_url: http://arxiv.org/abs/2307.13125
  • repo_url: https://github.com/Arminsbss/tumor-classification
  • paper_authors: Aghiles Kebaili, Jérôme Lapuyade-Lahorgue, Su Ruan
  • for: 这篇论文主要针对医疗影像分析领域中深度学习模型的训练数据有限制的问题,即使用深度生成模型来生成更真实和多样化的数据,以提高深度学习模型在医疗影像分析中的表现。
  • methods: 这篇论文主要介绍了三种深度生成模型,即变量自动编码器、对抗网络和扩散模型,以及它们在医疗影像分析中的应用。
  • results: 论文提供了现有深度生成模型在医疗影像分析中的最新状况,以及它们在不同下游任务中的潜在应用,包括分类、分割和cross-modal翻译。同时,论文也评估了每种模型的优缺点,并提出了未来研究的方向。
    Abstract Deep learning has become a popular tool for medical image analysis, but the limited availability of training data remains a major challenge, particularly in the medical field where data acquisition can be costly and subject to privacy regulations. Data augmentation techniques offer a solution by artificially increasing the number of training samples, but these techniques often produce limited and unconvincing results. To address this issue, a growing number of studies have proposed the use of deep generative models to generate more realistic and diverse data that conform to the true distribution of the data. In this review, we focus on three types of deep generative models for medical image augmentation: variational autoencoders, generative adversarial networks, and diffusion models. We provide an overview of the current state of the art in each of these models and discuss their potential for use in different downstream tasks in medical imaging, including classification, segmentation, and cross-modal translation. We also evaluate the strengths and limitations of each model and suggest directions for future research in this field. Our goal is to provide a comprehensive review about the use of deep generative models for medical image augmentation and to highlight the potential of these models for improving the performance of deep learning algorithms in medical image analysis.
    摘要 深度学习已成为医疗影像分析的流行工具,但培训数据的有限性仍然是主要的挑战,尤其在医疗领域,数据获取可能昂贵且受隐私法规限制。数据扩充技术可以人工增加培训样本数量,但这些技术通常生成有限和不置人心的结果。为解决这个问题,一些研究提出使用深度生成模型生成更真实和多样的数据,以符合实际数据的分布。在这篇评论中,我们关注了医疗影像增强中三种深度生成模型:变量自适应网络、对抗网络和扩散模型。我们提供了每种模型的当前状态之讲,并讨论它们在不同下游任务中的潜在应用,包括分类、分割和cross-modal翻译。我们还评估了每种模型的优缺点,并建议未来在这一领域的发展方向。我们的目标是提供深度生成模型在医疗影像增强中的全面评论,并高亮这些模型在医疗影像分析中的潜在优势,以及未来研究的发展方向。

In-Situ Thickness Measurement of Die Silicon Using Voltage Imaging for Hardware Assurance

  • paper_url: http://arxiv.org/abs/2307.13118
  • repo_url: None
  • paper_authors: Olivia P. Dizon-Paradis, Nitin Varshney, M Tanjidur Rahman, Michael Strizich, Haoting Shen, Navid Asadizanjani
  • for: 这篇论文的目的是提出一种基于电子束电压成像、图像处理和蒙特卡洛模拟的快速减厚方法,以便在减厚过程中保证层的厚度均匀。
  • methods: 该方法使用电子束电压成像技术、图像处理和蒙特卡洛模拟来测量剩下的硅层的厚度,以便在减厚过程中实时监测和调整层的厚度。
  • results: 该方法可以快速、准确地测量硅层的厚度,并且可以在减厚过程中实时监测和调整层的厚度,以保证减厚过程的准确性和效率。
    Abstract Hardware assurance of electronics is a challenging task and is of great interest to the government and the electronics industry. Physical inspection-based methods such as reverse engineering (RE) and Trojan scanning (TS) play an important role in hardware assurance. Therefore, there is a growing demand for automation in RE and TS. Many state-of-the-art physical inspection methods incorporate an iterative imaging and delayering workflow. In practice, uniform delayering can be challenging if the thickness of the initial layer of material is non-uniform. Moreover, this non-uniformity can reoccur at any stage during delayering and must be corrected. Therefore, it is critical to evaluate the thickness of the layers to be removed in a real-time fashion. Our proposed method uses electron beam voltage imaging, image processing, and Monte Carlo simulation to measure the thickness of remaining silicon to guide a uniform delayering process
    摘要 硬件保证是电子设备领域的一项挑战,政府和电子行业对其具有很大的兴趣。物理检查方法如反工程(RE)和 Trojan 扫描(TS)在硬件保证中扮演着重要的角色。因此,自动化在 RE 和 TS 中的需求在增长。许多当今的物理检查方法具有迭代性的图像和层去除工艺。在实践中,均匀的层去除可以是一个挑战,特别是当初始材料厚度不均匀时。此外,这种不均匀性可能会在任何阶段重新出现,需要实时纠正。因此,我们的提议的方法使用电子束电幕摄影、图像处理和 Монте卡洛 simulate 来测量剩下的硬件厚度,以便实现均匀的层去除过程。

Automatic Infant Respiration Estimation from Video: A Deep Flow-based Algorithm and a Novel Public Benchmark

  • paper_url: http://arxiv.org/abs/2307.13110
  • repo_url: https://github.com/ostadabbas/infant-respiration-estimation
  • paper_authors: Sai Kumar Reddy Manne, Shaotong Zhu, Sarah Ostadabbas, Michael Wan
  • for: 这篇论文目标是为新生儿提供自动、无接触的呼吸监测。
  • methods: 该论文使用深度学习方法,使用普通的视频捕捉来估计新生儿的呼吸速率和呼吸波形。
  • results: 该论文在使用AIRFlowNet模型和AIR-125 infant数据集上进行训练后,与其他state-of-the-art方法相比,在呼吸速率估计中显著提高了精度, сред平均误差为$\sim$2.9 breaths per minute。
    Abstract Respiration is a critical vital sign for infants, and continuous respiratory monitoring is particularly important for newborns. However, neonates are sensitive and contact-based sensors present challenges in comfort, hygiene, and skin health, especially for preterm babies. As a step toward fully automatic, continuous, and contactless respiratory monitoring, we develop a deep-learning method for estimating respiratory rate and waveform from plain video footage in natural settings. Our automated infant respiration flow-based network (AIRFlowNet) combines video-extracted optical flow input and spatiotemporal convolutional processing tuned to the infant domain. We support our model with the first public annotated infant respiration dataset with 125 videos (AIR-125), drawn from eight infant subjects, set varied pose, lighting, and camera conditions. We include manual respiration annotations and optimize AIRFlowNet training on them using a novel spectral bandpass loss function. When trained and tested on the AIR-125 infant data, our method significantly outperforms other state-of-the-art methods in respiratory rate estimation, achieving a mean absolute error of $\sim$2.9 breaths per minute, compared to $\sim$4.7--6.2 for other public models designed for adult subjects and more uniform environments.
    摘要 呼吸是新生儿的重要生命 Parameter,连续呼吸监测特别重要。然而,新生儿脆弱, contact-based 感测器会带来舒适、卫生和皮肤健康问题,特别是 Premature 新生儿。为了实现完全自动、不间断、无接触的呼吸监测,我们开发了一种深度学习方法,可以从平面视频 Footage 中获取呼吸速率和波形。我们的自动 infant 呼吸流基本网络(AIRFlowNet)将视频提取的光学流输入和空间时间卷积处理相结合,并在婴儿领域进行了调整。我们为模型提供了首个公共标注 infant 呼吸数据集(AIR-125),包含 125 个视频,来自八个婴儿素材,有不同的姿势、照明和摄像头条件。我们还包括手动呼吸标注和使用一种新的spectral bandpass损失函数来优化 AIRFlowNet 的训练。当我们在 AIR-125 婴儿数据集上训练和测试 AIRFlowNet 时,它在呼吸速率估计方面表现出色,与其他公共模型在 adult 主题和更uniform 环境中的表现相比,表现出较低的mean absolute error(约为 2.9 呼吸/分钟)。

Framework for Automatic PCB Marking Detection and Recognition for Hardware Assurance

  • paper_url: http://arxiv.org/abs/2307.13105
  • repo_url: None
  • paper_authors: Olivia P. Dizon-Paradis, Daniel E. Capecci, Nathan T. Jessurun, Damon L. Woodard, Mark M. Tehranipoor, Navid Asadizanjani
  • for: 这个研究的目的是提出一种自动电路板标注EXTRACTION方法,以便为政府和电子行业提供高精度的自动硬件保证。
  • methods: 该研究提出了一种收集PCB标注数据的计划,以及一种将该数据integrated到自动硬件保证过程中的框架。
  • results: 该研究提出了一种收集PCB标注数据的计划和一种将该数据integrated到自动硬件保证过程中的框架,可以提高自动硬件保证的精度。
    Abstract A Bill of Materials (BoM) is a list of all components on a printed circuit board (PCB). Since BoMs are useful for hardware assurance, automatic BoM extraction (AutoBoM) is of great interest to the government and electronics industry. To achieve a high-accuracy AutoBoM process, domain knowledge of PCB text and logos must be utilized. In this study, we discuss the challenges associated with automatic PCB marking extraction and propose 1) a plan for collecting salient PCB marking data, and 2) a framework for incorporating this data for automatic PCB assurance. Given the proposed dataset plan and framework, subsequent future work, implications, and open research possibilities are detailed.
    摘要 一份成本物品列表(Bill of Materials,BoM)是印刷电路板(Printed Circuit Board,PCB)上所有组件的列表。由于BoM对硬件保证有益,因此自动BoM提取(AutoBoM)对政府和电子业界来说非常有利。为实现高精度的AutoBoM过程,需要利用PCB文本和标识符的领域知识。在这篇研究中,我们介绍了自动PCB标识提取的挑战,并提出了1)PCB标识数据收集计划,2)基于这些数据的自动PCB保证框架。根据提出的数据计划和框架,我们释放了未来工作、后续研究和开放的研究可能性。

Enhancing image captioning with depth information using a Transformer-based framework

  • paper_url: http://arxiv.org/abs/2308.03767
  • repo_url: None
  • paper_authors: Aya Mahmoud Ahmed, Mohamed Yousef, Khaled F. Hussain, Yousef Bassyouni Mahdy
  • for: 该论文旨在提高图像captioning任务中的场景理解,通过将RGB图像和其相应的深度图 integrate into一个Transformer-based encoder-decoder框架中,生成多句文本描述3D场景。
  • methods: 该论文提出了一种将RGB图像和深度图进行拼接的方法,并使用Transformer架构来生成多句文本描述。不同的拼接方法也被研究以实现最佳的结果。
  • results: 实验结果表明,使用RGB图像和深度图进行拼接可以提高图像captioning任务的效果,无论depth图是否为真实的或估算值。此外,该论文还提出了一个更正版的NYU-v2数据集,以解决存在问题的标注问题。
    Abstract Captioning images is a challenging scene-understanding task that connects computer vision and natural language processing. While image captioning models have been successful in producing excellent descriptions, the field has primarily focused on generating a single sentence for 2D images. This paper investigates whether integrating depth information with RGB images can enhance the captioning task and generate better descriptions. For this purpose, we propose a Transformer-based encoder-decoder framework for generating a multi-sentence description of a 3D scene. The RGB image and its corresponding depth map are provided as inputs to our framework, which combines them to produce a better understanding of the input scene. Depth maps could be ground truth or estimated, which makes our framework widely applicable to any RGB captioning dataset. We explored different fusion approaches to fuse RGB and depth images. The experiments are performed on the NYU-v2 dataset and the Stanford image paragraph captioning dataset. During our work with the NYU-v2 dataset, we found inconsistent labeling that prevents the benefit of using depth information to enhance the captioning task. The results were even worse than using RGB images only. As a result, we propose a cleaned version of the NYU-v2 dataset that is more consistent and informative. Our results on both datasets demonstrate that the proposed framework effectively benefits from depth information, whether it is ground truth or estimated, and generates better captions. Code, pre-trained models, and the cleaned version of the NYU-v2 dataset will be made publically available.
    摘要 标题: integrate depth information to enhance image captioning task摘要:在图像描述任务中,将RGB图像和深度图像融合在一起,可以提高描述任务的质量。我们提出了一种基于Transformer的RGB图像和深度图像融合框架,用于生成多句话描述3D场景。我们的框架可以将RGB图像和其对应的深度图像作为输入,并将它们融合在一起,以更好地理解输入场景。我们实现了不同的融合方法,并对NYU-v2数据集和Stanford图像描述数据集进行了实验。在我们的工作中,我们发现了NYU-v2数据集中的不一致标注问题,这使得使用深度信息来提高描述任务的 beneficial effects become less effective。最终,我们提出了一个更加一致的NYU-v2数据集,并对这两个数据集进行了实验。我们的结果表明,我们的提议的框架可以借助深度信息,无论是真实的深度图像还是估计的深度图像,生成更好的描述。我们将代码、预训练模型和清洁版NYU-v2数据集公开发布。