cs.CV - 2023-11-19

Improved Defect Detection and Classification Method for Advanced IC Nodes by Using Slicing Aided Hyper Inference with Refinement Strategy

  • paper_url: http://arxiv.org/abs/2311.11439
  • repo_url: None
  • paper_authors: Vic De Ridder, Bappaditya Dey, Victor Blanco, Sandip Halder, Bartel Van Waeyenberge
  • for: 这个论文旨在提高现有的抗噪检测技术,以满足高NA(数字幕面积)EUVL(极紫外光刻)制造过程中的精细缺陷检测需求。
  • methods: 本文使用了Slicing Aided Hyper Inference(SAHI)框架,通过对大小增加的SEM图像进行推理,提高检测小缺陷的能力。
  • results: 比较现有的半导体数据集,SAHI方法可以提高小缺陷检测率约2倍;在新的测试数据集上,SAHI方法可以实现缺陷检测率100%,而未经训练的模型则失败。此外,本文还提出了一种不会导致真正正确预测减少的SAHI扩展方法。
    Abstract In semiconductor manufacturing, lithography has often been the manufacturing step defining the smallest possible pattern dimensions. In recent years, progress has been made towards high-NA (Numerical Aperture) EUVL (Extreme-Ultraviolet-Lithography) paradigm, which promises to advance pattern shrinking (2 nm node and beyond). However, a significant increase in stochastic defects and the complexity of defect detection becomes more pronounced with high-NA. Present defect inspection techniques (both non-machine learning and machine learning based), fail to achieve satisfactory performance at high-NA dimensions. In this work, we investigate the use of the Slicing Aided Hyper Inference (SAHI) framework for improving upon current techniques. Using SAHI, inference is performed on size-increased slices of the SEM images. This leads to the object detector's receptive field being more effective in capturing small defect instances. First, the performance on previously investigated semiconductor datasets is benchmarked across various configurations, and the SAHI approach is demonstrated to substantially enhance the detection of small defects, by approx. 2x. Afterwards, we also demonstrated application of SAHI leads to flawless detection rates on a new test dataset, with scenarios not encountered during training, whereas previous trained models failed. Finally, we formulate an extension of SAHI that does not significantly reduce true-positive predictions while eliminating false-positive predictions.
    摘要 在半导体制造中,镭射曾经是制造过程中定义最小 Pattern 维度的关键步骤。在最近的年头,高 NA (数字化镜像) EUVL (极紫外光刻) 平台在进行 Pattern 缩小 (2 nm 节点和更多) 做出了 significiant 进步。然而,高 NA 中存在较多的随机缺陷和缺陷检测的复杂性增加。现有的缺陷检测技术(包括机器学习和非机器学习基于的)在高 NA 环境下表现不满意。在这项工作中,我们调查使用 Slicing Aided Hyper Inference (SAHI) 框架来改进当前技术。使用 SAHI,我们在 SEM 图像上进行大小增加的 slice 检测,从而使检测器的感知范围更加有效地捕捉小缺陷实例。首先,我们对已知的半导体数据集进行了不同配置的比较,并证明 SAHI 方法可以大幅提高小缺陷的检测率,约2倍。之后,我们还证明 SAHI 可以在新的测试数据集上实现缺陷检测率100%,而未在训练过程中遇到的场景。最后,我们提出了 SAHI 的扩展,可以不会对真正的正确预测造成重要减少,而是完全消除假阳性预测。

DiffSCI: Zero-Shot Snapshot Compressive Imaging via Iterative Spectral Diffusion Model

  • paper_url: http://arxiv.org/abs/2311.11417
  • repo_url: None
  • paper_authors: Zhenghao Pan, Haijin Zeng, Jiezhang Cao, Kai Zhang, Yongyong Chen
    for:diffsci leverages the structural insights from the deep prior and optimization-based methodologies, complemented by the generative capabilities offered by the contemporary denoising diffusion model.methods:firstly, we employ a pre-trained diffusion model, which has been trained on a substantial corpus of rbg images, as the generative denoiser within the plug-and-play framework for the first time. this integration allows for the successful completion of sci reconstruction, especially in the case that current methods struggle to address effectively. secondly, we systematically account for spectral band correlations and introduce a robust methodology to mitigate wavelength mismatch, thus enabling seamless adaptation of the rbg diffusion model to msis.results:we present extensive testing to show that diffsci exhibits discernible performance enhancements over prevailing self-supervised and zero-shot approaches, surpassing even supervised transformer counterparts across both simulated and real datasets.
    Abstract This paper endeavors to advance the precision of snapshot compressive imaging (SCI) reconstruction for multispectral image (MSI). To achieve this, we integrate the advantageous attributes of established SCI techniques and an image generative model, propose a novel structured zero-shot diffusion model, dubbed DiffSCI. DiffSCI leverages the structural insights from the deep prior and optimization-based methodologies, complemented by the generative capabilities offered by the contemporary denoising diffusion model. Specifically, firstly, we employ a pre-trained diffusion model, which has been trained on a substantial corpus of RGB images, as the generative denoiser within the Plug-and-Play framework for the first time. This integration allows for the successful completion of SCI reconstruction, especially in the case that current methods struggle to address effectively. Secondly, we systematically account for spectral band correlations and introduce a robust methodology to mitigate wavelength mismatch, thus enabling seamless adaptation of the RGB diffusion model to MSIs. Thirdly, an accelerated algorithm is implemented to expedite the resolution of the data subproblem. This augmentation not only accelerates the convergence rate but also elevates the quality of the reconstruction process. We present extensive testing to show that DiffSCI exhibits discernible performance enhancements over prevailing self-supervised and zero-shot approaches, surpassing even supervised transformer counterparts across both simulated and real datasets. Our code will be available.
    摘要 Firstly, we use a pre-trained diffusion model, which has been trained on a large dataset of RGB images, as the generative denoiser within the Plug-and-Play framework for the first time. This integration allows for successful SCI reconstruction, especially in cases where current methods struggle.Secondly, we systematically account for spectral band correlations and introduce a robust methodology to mitigate wavelength mismatch, enabling seamless adaptation of the RGB diffusion model to MSIs.Thirdly, we develop an accelerated algorithm to expedite the resolution of the data subproblem. This not only accelerates the convergence rate but also improves the quality of the reconstruction process.We present extensive testing to show that DiffSCI exhibits significant performance enhancements over existing self-supervised and zero-shot approaches, outperforming even supervised transformer counterparts across both simulated and real datasets. Our code will be publicly available.

Enhancing Low-dose CT Image Reconstruction by Integrating Supervised and Unsupervised Learning

  • paper_url: http://arxiv.org/abs/2311.12071
  • repo_url: None
  • paper_authors: Ling Chen, Zhishen Huang, Yong Long, Saiprasad Ravishankar
  • for: 这个研究是为了提高低剂量 Computed Tomography (CT) 图像重建的精确性和效率。
  • methods: 本研究提出了一个混合监督式学习框架,结合了统计学模型和深度学习方法来进行图像重建。
  • results: 实验结果显示,提案的框架可以与现有的低剂量 CT 重建方法相比,具有更高的精确性和效率。
    Abstract Traditional model-based image reconstruction (MBIR) methods combine forward and noise models with simple object priors. Recent application of deep learning methods for image reconstruction provides a successful data-driven approach to addressing the challenges when reconstructing images with undersampled measurements or various types of noise. In this work, we propose a hybrid supervised-unsupervised learning framework for X-ray computed tomography (CT) image reconstruction. The proposed learning formulation leverages both sparsity or unsupervised learning-based priors and neural network reconstructors to simulate a fixed-point iteration process. Each proposed trained block consists of a deterministic MBIR solver and a neural network. The information flows in parallel through these two reconstructors and is then optimally combined. Multiple such blocks are cascaded to form a reconstruction pipeline. We demonstrate the efficacy of this learned hybrid model for low-dose CT image reconstruction with limited training data, where we use the NIH AAPM Mayo Clinic Low Dose CT Grand Challenge dataset for training and testing. In our experiments, we study combinations of supervised deep network reconstructors and MBIR solver with learned sparse representation-based priors or analytical priors. Our results demonstrate the promising performance of the proposed framework compared to recent low-dose CT reconstruction methods.
    摘要 传统模型基于的图像重建(MBIR)方法会结合前向和噪声模型,并使用简单的物件假设。在这个工作中,我们提出了一个混合监督无监督学习框架,用于X射 Computed Tomography(CT)图像重建。我们的学习形式会利用循环约束和神经网络重建器来模拟一个固定点回旋过程。每个提出的训练块都包含一个决定性MBIR解析器和一个神经网络。信息在这两个重建器之间流动,然后互相结合。多个这些块被组合以形成一个重建管线。在我们的实验中,我们研究了对于低剂量CT图像重建的不同组合,包括对于对于MBIR解析器和神经网络的学习统计学假设。我们的结果显示了我们提出的框架与最近的低剂量CT重建方法相比,表现出色。

FDDM: Unsupervised Medical Image Translation with a Frequency-Decoupled Diffusion Model

  • paper_url: http://arxiv.org/abs/2311.12070
  • repo_url: None
  • paper_authors: Yunxiang Li, Hua-Chieh Shao, Xiaoxue Qian, You Zhang
    for:The paper aims to improve the quality and accuracy of medical image translations using a novel framework called frequency-decoupled diffusion model (FDDM).methods:FDDM decouples the frequency components of medical images in the Fourier domain during the translation process, allowing for structure-preserved high-quality image conversion. It applies an unsupervised frequency conversion module and uses frequency-specific information to guide a following diffusion model for final source-to-target image translation.results:The proposed FDDM outperformed other GAN-, VAE-, and diffusion-based models in terms of image quality and faithfulness to the original anatomical structures. It achieved an FID of 29.88, significantly lower than the second best. These results demonstrate the effectiveness of FDDM in generating highly-realistic target-domain images while maintaining the accuracy of translated anatomical structures.Here is the simplified Chinese version of the three key points:for:这篇论文目标是提高医疗图像翻译质量和准确性,使用一种新的扩展模型(FDDM)。methods:FDDM 在快 Fourier 频域中分离医疗图像的频率组成,以实现保持结构的高质量图像翻译。它使用无监督频谱转换模块和频率特定信息导引后续的扩展模型进行最终的源图像翻译。results:提出的 FDDM 在图像质量和结构准确性方面表现出色,超越了其他 GAN-、VAE- 和扩展模型。它在 FID 指标上达到 29.88,与第二最佳的一半以下。这些结果表明 FDDM 能够生成高度真实的目标频域图像,同时保持翻译后的结构准确性。
    Abstract Diffusion models have demonstrated significant potential in producing high-quality images for medical image translation to aid disease diagnosis, localization, and treatment. Nevertheless, current diffusion models have limited success in achieving faithful image translations that can accurately preserve the anatomical structures of medical images, especially for unpaired datasets. The preservation of structural and anatomical details is essential to reliable medical diagnosis and treatment planning, as structural mismatches can lead to disease misidentification and treatment errors. In this study, we introduced a frequency-decoupled diffusion model (FDDM), a novel framework that decouples the frequency components of medical images in the Fourier domain during the translation process, to allow structure-preserved high-quality image conversion. FDDM applies an unsupervised frequency conversion module to translate the source medical images into frequency-specific outputs and then uses the frequency-specific information to guide a following diffusion model for final source-to-target image translation. We conducted extensive evaluations of FDDM using a public brain MR-to-CT translation dataset, showing its superior performance against other GAN-, VAE-, and diffusion-based models. Metrics including the Frechet inception distance (FID), the peak signal-to-noise ratio (PSNR), and the structural similarity index measure (SSIM) were assessed. FDDM achieves an FID of 29.88, less than half of the second best. These results demonstrated FDDM's prowess in generating highly-realistic target-domain images while maintaining the faithfulness of translated anatomical structures.
    摘要 Diffusion models have shown great potential in generating high-quality images for medical image translation, aiding disease diagnosis, localization, and treatment. However, current diffusion models struggle to produce faithful image translations that accurately preserve the anatomical structures of medical images, especially for unpaired datasets. Preserving structural and anatomical details is crucial for reliable medical diagnosis and treatment planning, as mismatches can lead to disease misidentification and treatment errors. In this study, we proposed a frequency-decoupled diffusion model (FDDM) to address this challenge. FDDM separates the frequency components of medical images in the Fourier domain during translation, allowing for structure-preserved high-quality image conversion. FDDM first applies an unsupervised frequency conversion module to translate the source medical images into frequency-specific outputs, then uses the frequency-specific information to guide a following diffusion model for final source-to-target image translation. We evaluated FDDM on a public brain MR-to-CT translation dataset and found it outperformed other GAN-, VAE-, and diffusion-based models. Metrics such as Frechet inception distance (FID), peak signal-to-noise ratio (PSNR), and structural similarity index measure (SSIM) were used to assess performance. FDDM achieved an FID of 29.88, significantly better than the second best. These results demonstrate FDDM's ability to generate highly-realistic target-domain images while maintaining the faithfulness of translated anatomical structures.

A Survey of Emerging Applications of Diffusion Probabilistic Models in MRI

  • paper_url: http://arxiv.org/abs/2311.11383
  • repo_url: None
  • paper_authors: Yuheng Fan, Hanxi Liao, Shiqi Huang, Yimin Luo, Huazhu Fu, Haikun Qi
  • for: 本文旨在介绍Diffusion probabilistic models (DPMs)在医学成像中的应用,帮助医学成像领域的研究人员了解DPMs在不同应用中的进步。
  • methods: 本文首先介绍了两种主导的DPMs,即分别是粒子批处理和连续时间批处理的两种方法,然后提供了关于MRI应用的DPMs的全面回顾,包括重建、图像生成、图像翻译、分割、异常检测以及进一步研究话题。
  • results: 本文对DPMs在MRI应用中的总体局限性以及特定任务的局限性进行了讨论,并指出了未来研究的潜在领域。
    Abstract Diffusion probabilistic models (DPMs) which employ explicit likelihood characterization and a gradual sampling process to synthesize data, have gained increasing research interest. Despite their huge computational burdens due to the large number of steps involved during sampling, DPMs are widely appreciated in various medical imaging tasks for their high-quality and diversity of generation. Magnetic resonance imaging (MRI) is an important medical imaging modality with excellent soft tissue contrast and superb spatial resolution, which possesses unique opportunities for diffusion models. Although there is a recent surge of studies exploring DPMs in MRI, a survey paper of DPMs specifically designed for MRI applications is still lacking. This review article aims to help researchers in the MRI community to grasp the advances of DPMs in different applications. We first introduce the theory of two dominant kinds of DPMs, categorized according to whether the diffusion time step is discrete or continuous, and then provide a comprehensive review of emerging DPMs in MRI, including reconstruction, image generation, image translation, segmentation, anomaly detection, and further research topics. Finally, we discuss the general limitations as well as limitations specific to the MRI tasks of DPMs and point out potential areas that are worth further exploration.
    摘要 吸引probabilistic models of diffusion (DPMs),which employ explicit likelihood characterization and a gradual sampling process to synthesize data,have gained increasing research interest. Despite their huge computational burdens due to the large number of steps involved during sampling, DPMs are widely appreciated in various medical imaging tasks for their high-quality and diversity of generation.Magnetic resonance imaging (MRI) is an important medical imaging modality with excellent soft tissue contrast and superb spatial resolution, which possesses unique opportunities for diffusion models. Although there is a recent surge of studies exploring DPMs in MRI, a survey paper of DPMs specifically designed for MRI applications is still lacking. This review article aims to help researchers in the MRI community to grasp the advances of DPMs in different applications. We first introduce the theory of two dominant kinds of DPMs, categorized according to whether the diffusion time step is discrete or continuous, and then provide a comprehensive review of emerging DPMs in MRI, including reconstruction, image generation, image translation, segmentation, anomaly detection, and further research topics. Finally, we discuss the general limitations as well as limitations specific to the MRI tasks of DPMs and point out potential areas that are worth further exploration.

Evidential Uncertainty Quantification: A Variance-Based Perspective

  • paper_url: http://arxiv.org/abs/2311.11367
  • repo_url: https://github.com/kerrydrx/evidentialada
  • paper_authors: Ruxiao Duan, Brian Caffo, Harrison X. Bai, Haris I. Sair, Craig Jones
  • for: 这个研究的目的是为了提供一种量化深度神经网络中的不确定性的方法,并且适用于多元领域的应用。
  • methods: 这个研究使用了证据深度学习的方法,将不确定性量化为分类领域内的级别不确定性。
  • results: 实验结果显示,这个方法可以在跨 домен数据集上实现相当准确的分类,并且可以提供分类不确定性和级别间的相互相关信息。
    Abstract Uncertainty quantification of deep neural networks has become an active field of research and plays a crucial role in various downstream tasks such as active learning. Recent advances in evidential deep learning shed light on the direct quantification of aleatoric and epistemic uncertainties with a single forward pass of the model. Most traditional approaches adopt an entropy-based method to derive evidential uncertainty in classification, quantifying uncertainty at the sample level. However, the variance-based method that has been widely applied in regression problems is seldom used in the classification setting. In this work, we adapt the variance-based approach from regression to classification, quantifying classification uncertainty at the class level. The variance decomposition technique in regression is extended to class covariance decomposition in classification based on the law of total covariance, and the class correlation is also derived from the covariance. Experiments on cross-domain datasets are conducted to illustrate that the variance-based approach not only results in similar accuracy as the entropy-based one in active domain adaptation but also brings information about class-wise uncertainties as well as between-class correlations. The code is available at https://github.com/KerryDRX/EvidentialADA. This alternative means of evidential uncertainty quantification will give researchers more options when class uncertainties and correlations are important in their applications.
    摘要 深度神经网络的不确定性评估已成为活跃的研究领域,对downstream任务如活动学习具有关键作用。近年来,显著的潜在深度学习技术在单个模型前进通过直接量化 aleatoric 和 epistemic 不确定性。大多数传统方法采用 entropy 基于的方法来Derive evidential uncertainty in classification,量化不确定性在样本级别。然而,在回归问题中广泛应用的 variance 基于的方法在分类 setting 中 rarely 使用。在这种工作中,我们将 regression 中的 variance 基于方法适应到分类 setting,量化分类不确定性在类别级别。在分类问题中,regression 中的 Covariance 分解技术被推广到类别 Covariance 分解,并 derive 类之间的相关性。在跨Domain 数据集上进行了实验,ILLUSTRATE dass variance-based approach 不仅与 entropy-based approach 在活动适应中得到相似的准确率,还提供了类别不确定性以及类之间的相关性信息。代码可以在 上获取。这种 altenative 的 evidential uncertainty 评估方法将给研究人员更多的选择,当类别不确定性和相关性在应用中具有重要意义时。

Scale-aware competition network for palmprint recognition

  • paper_url: http://arxiv.org/abs/2311.11354
  • repo_url: None
  • paper_authors: Chengrui Gao, Ziyuan Yang, Min Zhu, Andrew Beng Jin Teoh
  • for: This paper aims to improve the recognition performance of palmprint biometrics by addressing the limitation of prior methodologies that only focus on texture orientation and neglect the significant texture scale dimension.
  • methods: The proposed method, called SAC-Net, consists of two modules: Inner-Scale Competition Module (ISCM) and Across-Scale Competition Module (ASCM). ISCM integrates learnable Gabor filters and a self-attention mechanism to extract rich orientation data, while ASCM leverages a competitive strategy across various scales to effectively encapsulate competitive texture scale elements.
  • results: The proposed method was tested on three benchmark datasets and showed exceptional recognition performance and resilience relative to state-of-the-art alternatives.
    Abstract Palmprint biometrics garner heightened attention in palm-scanning payment and social security due to their distinctive attributes. However, prevailing methodologies singularly prioritize texture orientation, neglecting the significant texture scale dimension. We design an innovative network for concurrently extracting intra-scale and inter-scale features to redress this limitation. This paper proposes a scale-aware competitive network (SAC-Net), which includes the Inner-Scale Competition Module (ISCM) and the Across-Scale Competition Module (ASCM) to capture texture characteristics related to orientation and scale. ISCM efficiently integrates learnable Gabor filters and a self-attention mechanism to extract rich orientation data and discern textures with long-range discriminative properties. Subsequently, ASCM leverages a competitive strategy across various scales to effectively encapsulate the competitive texture scale elements. By synergizing ISCM and ASCM, our method adeptly characterizes palmprint features. Rigorous experimentation across three benchmark datasets unequivocally demonstrates our proposed approach's exceptional recognition performance and resilience relative to state-of-the-art alternatives.
    摘要 palmprint生物特征在掌握支付和社会保障方面受到高度关注,因其具有独特的特征。然而,现有的方法ologies仅单独强调Texture orientation,忽略了重要的Texture scale维度。我们设计了一种创新的网络,可同时提取内纬度和间纬度特征。这篇论文提出了一种Scale-aware竞争网络(SAC-Net),包括Inner-Scale Competition Module(ISCM)和Across-Scale Competition Module(ASCM),用于捕捉Texture特征相关的orientation和Scale。ISCM使用可学习的Gabor滤波器和自我注意机制,以提取丰富的orientation数据,并能够识别距离特征长距离性的文本。然后,ASCMC使用竞争策略跨多个纬度,以有效地包含竞争Texture纬度元素。通过融合ISCM和ASCMC,我们的方法能够有效地描述掌握特征。经过严格的实验证明,我们的提议方法在三个标准数据集上具有突出的认识性和抗衰落性,与当前的状态对照法相比。

MoVideo: Motion-Aware Video Generation with Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.11325
  • repo_url: None
  • paper_authors: Jingyun Liang, Yuchen Fan, Kai Zhang, Radu Timofte, Luc Van Gool, Rakesh Ranjan
  • for: 文章旨在提出一种基于运动意识的视频生成(MoVideo)框架,以解决现有的视频生成方法缺乏运动考虑的问题。
  • methods: 该框架使用了两种方法来考虑运动:视频深度和光流。视频深度用于规范运动,而光流用于保持细节和改善时间一致性。
  • results: 实验结果显示,MoVideo在文本生成和图像生成中均达到了状态机器的结果,表现出了优秀的提示一致性、帧一致性和视觉质量。
    Abstract While recent years have witnessed great progress on using diffusion models for video generation, most of them are simple extensions of image generation frameworks, which fail to explicitly consider one of the key differences between videos and images, i.e., motion. In this paper, we propose a novel motion-aware video generation (MoVideo) framework that takes motion into consideration from two aspects: video depth and optical flow. The former regulates motion by per-frame object distances and spatial layouts, while the later describes motion by cross-frame correspondences that help in preserving fine details and improving temporal consistency. More specifically, given a key frame that exists or generated from text prompts, we first design a diffusion model with spatio-temporal modules to generate the video depth and the corresponding optical flows. Then, the video is generated in the latent space by another spatio-temporal diffusion model under the guidance of depth, optical flow-based warped latent video and the calculated occlusion mask. Lastly, we use optical flows again to align and refine different frames for better video decoding from the latent space to the pixel space. In experiments, MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation, showing promising prompt consistency, frame consistency and visual quality.
    摘要 “Recent years have witnessed significant progress in using diffusion models for video generation, but most of these models are simple extensions of image generation frameworks, which ignore the key difference between videos and images: motion. In this paper, we propose a novel motion-aware video generation (MoVideo) framework that takes motion into consideration from two aspects: video depth and optical flow. The former regulates motion by per-frame object distances and spatial layouts, while the latter describes motion by cross-frame correspondences that help preserve fine details and improve temporal consistency. Specifically, given a key frame that exists or generated from text prompts, we first design a diffusion model with spatio-temporal modules to generate the video depth and the corresponding optical flows. Then, the video is generated in the latent space by another spatio-temporal diffusion model under the guidance of depth, optical flow-based warped latent video, and the calculated occlusion mask. Finally, we use optical flows again to align and refine different frames for better video decoding from the latent space to the pixel space. In experiments, MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation, demonstrating excellent prompt consistency, frame consistency, and visual quality.”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.

Discrete approximations of Gaussian smoothing and Gaussian derivatives

  • paper_url: http://arxiv.org/abs/2311.11317
  • repo_url: None
  • paper_authors: Tony Lindeberg
  • for: 本研究探讨了在批处理数据时,使用抽象scale-space理论来近似 Gaussian 平滑和Gaussian 导数计算的问题。
  • methods: 本研究考虑了三种主要的精度方法,包括(i)采样Gaussian 函数和Gaussian 导数函数,(ii)在每个像素支持区域内进行局部积分Gaussian 函数和Gaussian 导数函数,以及(iii)基于离散Gaussian 函数的scale-space分析,并通过小支持中央差 опера作计算导数近似。
  • results: 研究发现,采样Gaussian 函数和Gaussian 导数函数在非常细见度上表现很差,而离散Gaussian 函数和其对应的离散导数近似方法在非常细见度上表现较好。同时,采样Gaussian 函数和Gaussian 导数函数在大比例参数时(大于约1个网格间距)表现 numerically 非常好,但是在非常细见度上表现不佳。
    Abstract This paper develops an in-depth treatment concerning the problem of approximating the Gaussian smoothing and Gaussian derivative computations in scale-space theory for application on discrete data. With close connections to previous axiomatic treatments of continuous and discrete scale-space theory, we consider three main ways discretizing these scale-space operations in terms of explicit discrete convolutions, based on either (i) sampling the Gaussian kernels and the Gaussian derivative kernels, (ii) locally integrating the Gaussian kernels and the Gaussian derivative kernels over each pixel support region and (iii) basing the scale-space analysis on the discrete analogue of the Gaussian kernel, and then computing derivative approximations by applying small-support central difference operators to the spatially smoothed image data. We study the properties of these three main discretization methods both theoretically and experimentally, and characterize their performance by quantitative measures, including the results they give rise to with respect to the task of scale selection, investigated for four different use cases, and with emphasis on the behaviour at fine scales. The results show that the sampled Gaussian kernels and derivatives as well as the integrated Gaussian kernels and derivatives perform very poorly at very fine scales. At very fine scales, the discrete analogue of the Gaussian kernel with its corresponding discrete derivative approximations performs substantially better. The sampled Gaussian kernel and the sampled Gaussian derivatives do, on the other hand, lead to numerically very good approximations of the corresponding continuous results, when the scale parameter is sufficiently large, in the experiments presented in the paper, when the scale parameter is greater than a value of about 1, in units of the grid spacing.
    摘要 本文对抽象数据中的抽象 Gaussian 平滑和抽象 Gaussian 导数计算进行深入研究,以应用在抽象数据上。与之前的几何学对照推理相关,我们考虑了三种主要的精度化方法,包括:(i)采样 Gaussian 函数和导数函数,(ii)在每个像素支持区域内进行局部积分 Gaussian 函数和导数函数,以及(iii)基于抽象数据的 discrete Gaussian 函数,然后通过小支持中心差分算法计算导数近似。我们对这三种精度化方法进行了理论和实验研究,并通过量化指标来评估其性能,包括根据四个不同的应用场景进行的缩放选择任务的结果,并强调细致级别上的行为。结果表明,采样 Gaussian 函数和导数函数以及局部积分 Gaussian 函数和导数函数在极细致级别上表现很差,而抽象数据的 discrete Gaussian 函数和其对应的导数近似则表现出色。同时,采样 Gaussian 函数和导数函数在大于一定缩放参数(约等于网格间距)时对应的近似结果 numerically 非常好。

Optimizing rgb-d semantic segmentation through multi-modal interaction and pooling attention

  • paper_url: http://arxiv.org/abs/2311.11312
  • repo_url: None
  • paper_authors: Shuai Zhang, Minghong Xie
  • for: 提高RGB-D图像semantic segmentation的精度
  • methods: 利用多Modal Interaction Fusion Module(MIM)和Pooling Attention Module(PAM)将RGB和深度信息融合,并在decoder阶段进行targeted Integration
  • results: 在NYUDv2和SUN-RGBD两个indoor scene数据集上,MIPANet表现出色,与现有方法相比,提高了RGB-Dsemantic segmentation的精度
    Abstract Semantic segmentation of RGB-D images involves understanding the appearance and spatial relationships of objects within a scene, which requires careful consideration of various factors. However, in indoor environments, the simple input of RGB and depth images often results in a relatively limited acquisition of semantic and spatial information, leading to suboptimal segmentation outcomes. To address this, we propose the Multi-modal Interaction and Pooling Attention Network (MIPANet), a novel approach designed to harness the interactive synergy between RGB and depth modalities, optimizing the utilization of complementary information. Specifically, we incorporate a Multi-modal Interaction Fusion Module (MIM) into the deepest layers of the network. This module is engineered to facilitate the fusion of RGB and depth information, allowing for mutual enhancement and correction. Additionally, we introduce a Pooling Attention Module (PAM) at various stages of the encoder. This module serves to amplify the features extracted by the network and integrates the module's output into the decoder in a targeted manner, significantly improving semantic segmentation performance. Our experimental results demonstrate that MIPANet outperforms existing methods on two indoor scene datasets, NYUDv2 and SUN-RGBD, underscoring its effectiveness in enhancing RGB-D semantic segmentation.
    摘要 semantic segmentation of RGB-D 图像含义涉及到场景中对象的外观和空间关系的理解,需要谨慎考虑多种因素。然而,在室内环境中,简单地输入 RGB 和深度图像经常导致semantic和空间信息的有限获取,从而导致 segmentation 结果不佳。为此,我们提出了 Multi-modal Interaction and Pooling Attention Network (MIPANet),一种新的方法,旨在利用 RGB 和深度模式之间的交互 synergy,最大化 complementary 信息的利用。具体来说,我们在网络的最深层中添加了 Multi-modal Interaction Fusion Module (MIM)。这个模块是用于把 RGB 和深度信息融合在一起,以便互相增强和更正。此外,我们在网络的不同层次中引入了 Pooling Attention Module (PAM)。这个模块用于增强网络提取的特征,并将其集成到解码器中,以提高 semantic segmentation 性能。我们的实验结果表明,MIPANet 在 NYUDv2 和 SUN-RGBD 两个室内场景数据集上的性能都高于现有方法,证明它在 RGB-D semantic segmentation 方面具有显著的改进效果。

UMAAF: Unveiling Aesthetics via Multifarious Attributes of Images

  • paper_url: http://arxiv.org/abs/2311.11306
  • repo_url: None
  • paper_authors: Weijie Li, Yitian Wan, Xingjiao Wu, Junjie Xu, Cheng Jin, Liang He
  • For: The paper focuses on Image Aesthetic Assessment (IAA) and proposes a Unified Multi-attribute Aesthetic Assessment Framework (UMAAF) to better utilize image attributes in aesthetic assessment.* Methods: The paper uses a combination of absolute-attribute perception modules and an absolute-attribute interacting network to extract and integrate absolute attribute features, as well as a Relative-Relation Loss function to model the relative attribute of images.* Results: The proposed UMAAF achieves state-of-the-art performance on TAD66K and AVA datasets, and multiple experiments demonstrate the effectiveness of each module and the model’s alignment with human preference.
    Abstract With the increasing prevalence of smartphones and websites, Image Aesthetic Assessment (IAA) has become increasingly crucial. While the significance of attributes in IAA is widely recognized, many attribute-based methods lack consideration for the selection and utilization of aesthetic attributes. Our initial step involves the acquisition of aesthetic attributes from both intra- and inter-perspectives. Within the intra-perspective, we extract the direct visual attributes of images, constituting the absolute attribute. In the inter-perspective, our focus lies in modeling the relative score relationships between images within the same sequence, forming the relative attribute. Then, to better utilize image attributes in aesthetic assessment, we propose the Unified Multi-attribute Aesthetic Assessment Framework (UMAAF) to model both absolute and relative attributes of images. For absolute attributes, we leverage multiple absolute-attribute perception modules and an absolute-attribute interacting network. The absolute-attribute perception modules are first pre-trained on several absolute-attribute learning tasks and then used to extract corresponding absolute attribute features. The absolute-attribute interacting network adaptively learns the weight of diverse absolute-attribute features, effectively integrating them with generic aesthetic features from various absolute-attribute perspectives and generating the aesthetic prediction. To model the relative attribute of images, we consider the relative ranking and relative distance relationships between images in a Relative-Relation Loss function, which boosts the robustness of the UMAAF. Furthermore, UMAAF achieves state-of-the-art performance on TAD66K and AVA datasets, and multiple experiments demonstrate the effectiveness of each module and the model's alignment with human preference.
    摘要 随着智能手机和网站的普及,图像美学评估(IAA)变得越来越重要。虽然美学特征在IAA中的重要性得到了广泛的认可,但许多基于特征的方法忽略了图像美学特征的选择和使用。我们的初步步骤是从内部和外部两个角度获取图像美学特征。在内部角度,我们提取图像直接视觉特征,组成绝对特征。在外部角度,我们关注图像序列中图像之间的相对分数关系,组成相对特征。然后,为更好地利用图像特征进行美学评估,我们提议一种统一多属性美学评估框架(UMAAF),以模型图像绝对和相对特征。对于绝对特征,我们利用多个绝对特征感知模块和绝对特征互动网络。绝对特征感知模块首先在多个绝对特征学习任务上进行预训练,然后用来提取对应的绝对特征特征。绝对特征互动网络可以适应性地学习绝对特征特征之间的权重,有效地将多种绝对特征视角综合到一起,生成美学预测。为了模型图像相对特征,我们考虑图像之间的相对排名和相对距离关系,通过相对关系损失函数进行优化。此外,UMAAF在TAD66K和AVA数据集上实现了状态机器人的表现,并在多个实验中证明了模型与人类偏好的一致。

Exchanging Dual Encoder-Decoder: A New Strategy for Change Detection with Semantic Guidance and Spatial Localization

  • paper_url: http://arxiv.org/abs/2311.11302
  • repo_url: None
  • paper_authors: Sijie Zhao, Xueliang Zhang, Pengfeng Xiao, Guangjun He
  • for: 这个论文主要targets binary change detection in earth observation applications, with a focus on addressing the problems of bitemporal feature interference and inapplicability of existing methods.
  • methods: 该论文提出了一种新的exchange dual encoder-decoder structure,which fuses bitemporal features in the decision level and uses bitemporal semantic features to determine changed areas.
  • results: 根据实验结果,该模型在六个数据集上的性能均达到了或超过了state-of-the-art方法,其中F1-score为97.77%, 83.07%, 94.86%, 92.33%, 91.39%, 74.35%。
    Abstract Change detection is a critical task in earth observation applications. Recently, deep learning-based methods have shown promising performance and are quickly adopted in change detection. However, the widely used multiple encoder and single decoder (MESD) as well as dual encoder-decoder (DED) architectures still struggle to effectively handle change detection well. The former has problems of bitemporal feature interference in the feature-level fusion, while the latter is inapplicable to intraclass change detection and multiview building change detection. To solve these problems, we propose a new strategy with an exchanging dual encoder-decoder structure for binary change detection with semantic guidance and spatial localization. The proposed strategy solves the problems of bitemporal feature inference in MESD by fusing bitemporal features in the decision level and the inapplicability in DED by determining changed areas using bitemporal semantic features. We build a binary change detection model based on this strategy, and then validate and compare it with 18 state-of-the-art change detection methods on six datasets in three scenarios, including intraclass change detection datasets (CDD, SYSU), single-view building change detection datasets (WHU, LEVIR-CD, LEVIR-CD+) and a multiview building change detection dataset (NJDS). The experimental results demonstrate that our model achieves superior performance with high efficiency and outperforms all benchmark methods with F1-scores of 97.77%, 83.07%, 94.86%, 92.33%, 91.39%, 74.35% on CDD, SYSU, WHU, LEVIR-CD, LEVIR- CD+, and NJDS datasets, respectively. The code of this work will be available at https://github.com/NJU-LHRS/official-SGSLN.
    摘要 Change detection is a critical task in earth observation applications. Recently, deep learning-based methods have shown promising performance and are quickly adopted in change detection. However, the widely used multiple encoder and single decoder (MESD) as well as dual encoder-decoder (DED) architectures still struggle to effectively handle change detection well. The former has problems of bitemporal feature interference in the feature-level fusion, while the latter is inapplicable to intraclass change detection and multiview building change detection. To solve these problems, we propose a new strategy with an exchanging dual encoder-decoder structure for binary change detection with semantic guidance and spatial localization. The proposed strategy solves the problems of bitemporal feature inference in MESD by fusing bitemporal features in the decision level and the inapplicability in DED by determining changed areas using bitemporal semantic features. We build a binary change detection model based on this strategy, and then validate and compare it with 18 state-of-the-art change detection methods on six datasets in three scenarios, including intraclass change detection datasets (CDD, SYSU), single-view building change detection datasets (WHU, LEVIR-CD, LEVIR-CD+), and a multiview building change detection dataset (NJDS). The experimental results demonstrate that our model achieves superior performance with high efficiency and outperforms all benchmark methods with F1-scores of 97.77%, 83.07%, 94.86%, 92.33%, 91.39%, and 74.35% on CDD, SYSU, WHU, LEVIR-CD, LEVIR-CD+, and NJDS datasets, respectively. The code of this work will be available at https://github.com/NJU-LHRS/official-SGSLN.

Pair-wise Layer Attention with Spatial Masking for Video Prediction

  • paper_url: http://arxiv.org/abs/2311.11289
  • repo_url: https://github.com/mlvccn/pla_sm_videopred
  • paper_authors: Ping Li, Chenhan Zhang, Zheng Yang, Xianghua Xu, Mingli Song
  • for: 预测视频帧,提高预测质量。
  • methods: Pair-wise Layer Attention (PLA) 模块和 Spatial Masking (SM) 模块。
  • results: 提高预测质量, capture 空间时间动态。Here’s the full translation in Simplified Chinese:
  • for: 预测视频帧,提高预测质量。
  • methods: 使用 Pair-wise Layer Attention (PLA) 模块和 Spatial Masking (SM) 模块。
  • results: 提高预测质量, capture 空间时间动态。I hope that helps!
    Abstract Video prediction yields future frames by employing the historical frames and has exhibited its great potential in many applications, e.g., meteorological prediction, and autonomous driving. Previous works often decode the ultimate high-level semantic features to future frames without texture details, which deteriorates the prediction quality. Motivated by this, we develop a Pair-wise Layer Attention (PLA) module to enhance the layer-wise semantic dependency of the feature maps derived from the U-shape structure in Translator, by coupling low-level visual cues and high-level features. Hence, the texture details of predicted frames are enriched. Moreover, most existing methods capture the spatiotemporal dynamics by Translator, but fail to sufficiently utilize the spatial features of Encoder. This inspires us to design a Spatial Masking (SM) module to mask partial encoding features during pretraining, which adds the visibility of remaining feature pixels by Decoder. To this end, we present a Pair-wise Layer Attention with Spatial Masking (PLA-SM) framework for video prediction to capture the spatiotemporal dynamics, which reflect the motion trend. Extensive experiments and rigorous ablation studies on five benchmarks demonstrate the advantages of the proposed approach. The code is available at GitHub.
    摘要 <>translate "Video prediction yields future frames by employing the historical frames and has exhibited its great potential in many applications, e.g., meteorological prediction, and autonomous driving. Previous works often decode the ultimate high-level semantic features to future frames without texture details, which deteriorates the prediction quality. Motivated by this, we develop a Pair-wise Layer Attention (PLA) module to enhance the layer-wise semantic dependency of the feature maps derived from the U-shape structure in Translator, by coupling low-level visual cues and high-level features. Hence, the texture details of predicted frames are enriched. Moreover, most existing methods capture the spatiotemporal dynamics by Translator, but fail to sufficiently utilize the spatial features of Encoder. This inspires us to design a Spatial Masking (SM) module to mask partial encoding features during pretraining, which adds the visibility of remaining feature pixels by Decoder. To this end, we present a Pair-wise Layer Attention with Spatial Masking (PLA-SM) framework for video prediction to capture the spatiotemporal dynamics, which reflect the motion trend. Extensive experiments and rigorous ablation studies on five benchmarks demonstrate the advantages of the proposed approach. The code is available at GitHub." into Simplified Chinese.Video 预测可以通过历史帧来生成未来帧,并在许多应用中表现出了很大的潜力,例如气象预测和自动驾驶。先前的工作通常将最终的高级 semantic 特征解码到未来帧中,无论是 texture 细节,这会导致预测质量下降。我们受这种情况的激励,开发了一种 Pair-wise Layer Attention (PLA) 模块,以强化 Translator 中层次 semantic 相关性,通过对低级视觉指示和高级特征进行对接。因此,预测的帧 texture 细节得到了增强。此外,大多数现有方法可以通过 Translator 捕捉 spatiotemporal 动力学,但是它们通常不充分利用 Encoder 中的空间特征。这引发我们设计一种 Spatial Masking (SM) 模块,在预训练期间对 Encoder 中的部分特征进行遮盲,使得 Decoder 中的剩下特征像素可以得到更多的可见性。因此,我们提出了一种 Pair-wise Layer Attention with Spatial Masking (PLA-SM) 框架,用于视频预测,以捕捉 spatiotemporal 动力学,它反映了运动趋势。我们对五个标准 benchmark 进行了广泛的实验和严格的ablation 研究, demonstarted 我们提出的方法的优势。代码可以在 GitHub 上获取。

LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching

  • paper_url: http://arxiv.org/abs/2311.11284
  • repo_url: https://github.com/envision-research/luciddreamer
  • paper_authors: Yixun Liang, Xin Yang, Jiantao Lin, Haodong Li, Xiaogang Xu, Yingcong Chen
  • for: 提高文本到3D生成模型的质量和效率
  • methods: 使用权重匹配法和3D扩散轨迹来缓解过滤效应,并 integration of 3D Gaussian Splatting
  • results: 与现状比较,模型表现出较高的质量和更好的训练效率
    Abstract The recent advancements in text-to-3D generation mark a significant milestone in generative models, unlocking new possibilities for creating imaginative 3D assets across various real-world scenarios. While recent advancements in text-to-3D generation have shown promise, they often fall short in rendering detailed and high-quality 3D models. This problem is especially prevalent as many methods base themselves on Score Distillation Sampling (SDS). This paper identifies a notable deficiency in SDS, that it brings inconsistent and low-quality updating direction for the 3D model, causing the over-smoothing effect. To address this, we propose a novel approach called Interval Score Matching (ISM). ISM employs deterministic diffusing trajectories and utilizes interval-based score matching to counteract over-smoothing. Furthermore, we incorporate 3D Gaussian Splatting into our text-to-3D generation pipeline. Extensive experiments show that our model largely outperforms the state-of-the-art in quality and training efficiency.
    摘要

Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection

  • paper_url: http://arxiv.org/abs/2311.11278
  • repo_url: None
  • paper_authors: Zhiyuan Yan, Yuhao Luo, Siwei Lyu, Qingshan Liu, Baoyuan Wu
  • for: 这篇论文的目的是提出一个简单 yet effective的伪照片检测方法,以解决现有的检测方法对于训练和测试数据的不同分布而导致的检测性下降问题。
  • methods: 这篇论文使用了一个叫做LSDA(Latent Space Data Augmentation)的方法,它基于一个假设:表现更多的伪照片特征可以帮助检测方法学习更加通用的分类边界,因此减少伪照片特有的方法特征的适应。这个方法包括在伪照片空间中制造和模拟不同的伪照片特征,以扩展伪照片空间,并将这些特征融入到检测方法中。
  • results: 实验结果显示,这篇论文的提出的方法 surprisingly effective,并在许多常用的 benchmark 上超过了现有的检测方法。
    Abstract Deepfake detection faces a critical generalization hurdle, with performance deteriorating when there is a mismatch between the distributions of training and testing data. A broadly received explanation is the tendency of these detectors to be overfitted to forgery-specific artifacts, rather than learning features that are widely applicable across various forgeries. To address this issue, we propose a simple yet effective detector called LSDA (\underline{L}atent \underline{S}pace \underline{D}ata \underline{A}ugmentation), which is based on a heuristic idea: representations with a wider variety of forgeries should be able to learn a more generalizable decision boundary, thereby mitigating the overfitting of method-specific features (see Figure. 1). Following this idea, we propose to enlarge the forgery space by constructing and simulating variations within and across forgery features in the latent space. This approach encompasses the acquisition of enriched, domain-specific features and the facilitation of smoother transitions between different forgery types, effectively bridging domain gaps. Our approach culminates in refining a binary classifier that leverages the distilled knowledge from the enhanced features, striving for a generalizable deepfake detector. Comprehensive experiments show that our proposed method is surprisingly effective and transcends state-of-the-art detectors across several widely used benchmarks.
    摘要 深层负作假检测面临着普遍的泛化挑战,其性能在训练和测试数据分布不同时会下降。一个广泛接受的解释是这些检测器过拟合伪造特有的特征,而不是学习通用于不同伪造的特征。为解决这个问题,我们提出了一种简单 yet effective 的检测器 called LSDA(短 latent Space Data Augmentation),基于一个启发性的想法:具有更多种伪造的表示应该能够学习更通用的决策边界,从而避免检测器特有的特征过拟合(参见图1)。按照这个想法,我们建议扩大伪造空间,通过在和 across 伪造特征中构建和模拟变化。这种方法涵盖了获得了更加丰富的领域特定特征,以及在不同伪造类型之间更平滑的过渡。我们的方法最终导致了一种基于更加精炼的特征的二分类器,希望能够创造一个通用的深层负作假检测器。我们的实验表明,我们提出的方法 surprisingly 有效,超越了多个通用的benchmark。

Generalization and Hallucination of Large Vision-Language Models through a Camouflaged Lens

  • paper_url: http://arxiv.org/abs/2311.11273
  • repo_url: None
  • paper_authors: Lv Tang, Peng-Tao Jiang, Zhihao Shen, Hao Zhang, Jinwei Chen, Bo Li
  • for: 本研究旨在探讨大视语言模型(LVLM)是否可以在无需训练的情况下泛化到隐身物体检测(COD)场景中。
  • methods: 我们提出了一种新的框架——隐身视语框架(CPVLF),以探讨LVLM是否可以在COD场景中泛化。在泛化过程中,我们发现LVLM因为幻觉问题而可能错误地感知隐身场景中的 объек,生成了对应的假设概念。此外,由于LVLM没有特定地培育针对隐身物体的精确定位,它在找到隐身物体时表现出一定的uncertainty。因此,我们提出了链条视觉,以增强LVLM在隐身场景中的视觉感知,从语言和视觉两个角度减少幻觉问题,提高LVLM的隐身物体精确定位能力。
  • results: 我们在三个常用COD数据集上验证了CPVLF的效果,实验结果表明LVLM在COD任务中具有潜在的能力。
    Abstract Large Vision-Language Model (LVLM) has seen burgeoning development and increasing attention recently. In this paper, we propose a novel framework, camo-perceptive vision-language framework (CPVLF), to explore whether LVLM can generalize to the challenging camouflaged object detection (COD) scenario in a training-free manner. During the process of generalization, we find that due to hallucination issues within LVLM, it can erroneously perceive objects in camouflaged scenes, producing counterfactual concepts. Moreover, as LVLM is not specifically trained for the precise localization of camouflaged objects, it exhibits a degree of uncertainty in accurately pinpointing these objects. Therefore, we propose chain of visual perception, which enhances LVLM's perception of camouflaged scenes from both linguistic and visual perspectives, reducing the hallucination issue and improving its capability in accurately locating camouflaged objects. We validate the effectiveness of CPVLF on three widely used COD datasets, and the experiments show the potential of LVLM in the COD task.
    摘要 Large Vision-Language Model (LVLM) 在最近几年来已经得到了极大的发展和关注。在这篇论文中,我们提出了一种新的框架,即隐形物检测(COD)场景下的视语框架(CPVLF),以探索whether LVLM可以在无需训练的情况下泛化到隐形物检测场景。在泛化过程中,我们发现LVLM因为内在的幻觉问题而可能错误地感知隐形场景中的物体,并且生成了一些对象的counterfactual概念。此外,由于LVLM没有特别地培训用于精确地定位隐形物体,它在检测隐形物体时表现出一定的不确定性。因此,我们提出了一种链条视觉感知机制,该机制可以帮助LVLM在语言和视觉两个方面更好地理解隐形场景,降低幻觉问题,并提高其检测隐形物体的精度。我们 validate了CPVLF的效果在三个常用的COD数据集上,实验结果表明LVLM在COD任务中的潜力。

Radarize: Large-Scale Radar SLAM for Indoor Environments

  • paper_url: http://arxiv.org/abs/2311.11260
  • repo_url: None
  • paper_authors: Emerson Sie, Xinyu Wu, Heyu Guo, Deepak Vasisht
  • for: 这个论文是为了解决室内环境中的SLAM问题,使用低成本的半导体mmWave雷达。
  • methods: 该方法使用雷达本身的特有现象,如Doppler偏移,来提高性能。
  • results: 作者对146个轨迹数据进行评估,结果显示该方法与当前最佳雷达基本方法相比,约5倍提高了径迹速度和8倍提高了终端SLAM性能,而无需其他传感器如IMU或轮胎速度测量。
    Abstract We present Radarize, a self-contained SLAM pipeline for indoor environments that uses only a low-cost commodity single-chip mmWave radar. Our radar-native approach leverages phenomena unique to radio frequencies, such as doppler shift-based odometry, to improve performance. We evaluate our method on a large-scale dataset of 146 trajectories spanning 4 campus buildings, totaling approximately 4680m of travel distance. Our results show that our method outperforms state-of-the-art radar-based approaches by approximately 5x in terms of odometry and 8x in terms of end-to-end SLAM, as measured by absolute trajectory error (ATE), without the need additional sensors such as IMUs or wheel odometry.
    摘要 我们介绍Radarize,一个低成本半导体 millimeter 波雷达 Self-Contained SLAM 管线,适用于室内环境。我们的雷达原生方法利用电磁波特有的现象,如偏振Shift-based odometry,提高性能。我们在4座大学建筑物中测试了146个轨迹,总覆盖距离约4680米。我们的结果表明,我们的方法在相对 Error (ATE) 指标下,与当前雷达基本方法相比,提高约5倍的征迹和8倍的总SLAM。而无需额外传感器,如IMU或轮胎速度测量。

Quality and Quantity: Unveiling a Million High-Quality Images for Text-to-Image Synthesis in Fashion Design

  • paper_url: http://arxiv.org/abs/2311.12067
  • repo_url: None
  • paper_authors: Jia Yu, Lichao Zhang, Zijie Chen, Fayu Pan, MiaoMiao Wen, Yuming Yan, Fangsheng Weng, Shuai Zhang, Lili Pan, Zhenzhong Lan
  • for: 本研究旨在推动人工智能在时尚设计领域的应用,具体来说是解决当前lack of extensive, interrelated data on clothing and try-on stages问题。
  • methods: 本研究使用了多年的努力,创建了全球首个一百万高品质时尚图像和详细文本描述的 dataset,该 dataset 包括了多种地域和文化背景的时尚趋势。图像都有精心标注了时尚和人体的 attribute,从而将时尚设计过程转化为文本到图像(T2I)任务。
  • results: 本研究不仅提供了高品质的文本-图像对和多种人类-服装对,还提供了一个新的标准 benchmark 来评估时尚设计模型的性能。这项研究代表了人工智能驱动时尚设计领域的一个重要突破,为未来的这个领域做出了新的标准。
    Abstract The fusion of AI and fashion design has emerged as a promising research area. However, the lack of extensive, interrelated data on clothing and try-on stages has hindered the full potential of AI in this domain. Addressing this, we present the Fashion-Diffusion dataset, a product of multiple years' rigorous effort. This dataset, the first of its kind, comprises over a million high-quality fashion images, paired with detailed text descriptions. Sourced from a diverse range of geographical locations and cultural backgrounds, the dataset encapsulates global fashion trends. The images have been meticulously annotated with fine-grained attributes related to clothing and humans, simplifying the fashion design process into a Text-to-Image (T2I) task. The Fashion-Diffusion dataset not only provides high-quality text-image pairs and diverse human-garment pairs but also serves as a large-scale resource about humans, thereby facilitating research in T2I generation. Moreover, to foster standardization in the T2I-based fashion design field, we propose a new benchmark comprising multiple datasets for evaluating the performance of fashion design models. This work represents a significant leap forward in the realm of AI-driven fashion design, setting a new standard for future research in this field.
    摘要 人工智能(AI)和时尚设计的融合已成为一个有前途的研究领域。然而,由于缺乏广泛、相互关联的服装和试穿stage数据,AI在这个领域的潜力尚未得到完全发挥。为解决这个问题,我们现在提出了时尚扩散数据集(Fashion-Diffusion dataset),这是多年的辛苦努力的产物。这个数据集包含了大量高质量的时尚图像和详细的文本描述,从多个地理位置和文化背景中收集到。图像被精心标注了细化的服装和人体特征,使得时尚设计过程被简化为文本到图像(T2I)任务。Fashion-Diffusion dataset不仅提供了高质量的文本-图像对和多样化的人类-服装对,还可以作为大规模人类资料,促进人工智能驱动的时尚设计研究。此外,为促进时尚设计领域中T2I模型的标准化,我们提议了一个新的标准套件,该套件包括多个数据集用于评估时尚设计模型的性能。这项工作代表了人工智能驱动时尚设计领域的一个重要突破,为未来这个领域的研究提供了新的标准。

Submeter-level Land Cover Mapping of Japan

  • paper_url: http://arxiv.org/abs/2311.11252
  • repo_url: None
  • paper_authors: Naoto Yokoya, Junshi Xia, Clifford Broni-Bediako
  • for: 本研究的目的是提出一种低成本的人工监督深度学习框架,用于大规模 submeter 级地面覆盖图的自动生成。
  • methods: 我们使用 OpenEarthMap benchmark 数据集,并提出了一种基于 U-Net 模型的人工监督深度学习框架,以实现全国范围内的 submeter 级地面覆盖图生成。我们采用了小量的标注数据,将 U-Net 模型 retrained,并达到了全国范围内的 80% 的准确率。
  • results: 我们使用日本地毫图像数据,生成了日本全国的 submeter 级地面覆盖图,并达到了 80% 的准确率。这种框架可以降低标注成本,并提供高精度的地面覆盖图生成结果,可以贡献于自动更新国家级 submeter 级地面覆盖图。
    Abstract Deep learning has shown promising performance in submeter-level mapping tasks; however, the annotation cost of submeter-level imagery remains a challenge, especially when applied on a large scale. In this paper, we present the first submeter-level land cover mapping of Japan with eight classes, at a relatively low annotation cost. We introduce a human-in-the-loop deep learning framework leveraging OpenEarthMap, a recently introduced benchmark dataset for global submeter-level land cover mapping, with a U-Net model that achieves national-scale mapping with a small amount of additional labeled data. By adding a small amount of labeled data of areas or regions where a U-Net model trained on OpenEarthMap clearly failed and retraining the model, an overall accuracy of 80\% was achieved, which is a nearly 16 percentage point improvement after retraining. Using aerial imagery provided by the Geospatial Information Authority of Japan, we create land cover classification maps of eight classes for the entire country of Japan. Our framework, with its low annotation cost and high-accuracy mapping results, demonstrates the potential to contribute to the automatic updating of national-scale land cover mapping using submeter-level optical remote sensing data. The mapping results will be made publicly available.
    摘要 深度学习在微米级地形映射任务中表现出了扎根的优异性,但是微米级地形映射的标注成本仍然是一大挑战,特别是在大规模应用场景下。在这篇论文中,我们提出了首个在日本全国范围内进行微米级土地覆盖分类的案例,使用了OpenEarthMap数据集,一个最近引入的全球微米级土地覆盖分类数据集。我们采用了人工循环深度学习框架,使用U-Net模型,实现了全国范围内的地形映射,并且只需要小量的额外标注数据来提高总体精度。通过在OpenEarthMap上训练U-Net模型,并在模型失败的地方添加小量标注数据,我们实现了总体精度达80%,比前一次训练后提高了16%。使用日本地理信息掌握局提供的飞行图像,我们创建了日本全国范围内的八类土地分类图。我们的框架,具有低标注成本和高精度地形映射结果,适用于自动更新国家级别的土地覆盖分类使用微米级光学Remote感数据。映射结果将公开发布。

AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort

  • paper_url: http://arxiv.org/abs/2311.11243
  • repo_url: None
  • paper_authors: Wen Wang, Canyu Zhao, Hao Chen, Zhekai Chen, Kecheng Zheng, Chunhua Shen
  • for: Story visualization system that can generate diverse, high-quality, and consistent sets of story images with minimal human interactions.
  • methods: Utilizes comprehension and planning capabilities of large language models for layout planning, and leverages large-scale text-to-image models to generate sophisticated story images based on the layout.
  • results: Improves image quality and allows easy and intuitive user interactions, as well as generates multi-view consistent character images without reliance on human labor.Here’s the Chinese version:
  • for: Story 视觉系统,可以生成多样化、高质量、一致的故事图片,减少人工干预。
  • methods: Utilizes 大语言模型的理解和规划能力进行布局规划,然后利用大规模文本到图像模型生成复杂的故事图片。
  • results: 提高图片质量,使用户交互更加容易和直观,同时自动生成多视图一致的人物图片。
    Abstract Story visualization aims to generate a series of images that match the story described in texts, and it requires the generated images to satisfy high quality, alignment with the text description, and consistency in character identities. Given the complexity of story visualization, existing methods drastically simplify the problem by considering only a few specific characters and scenarios, or requiring the users to provide per-image control conditions such as sketches. However, these simplifications render these methods incompetent for real applications. To this end, we propose an automated story visualization system that can effectively generate diverse, high-quality, and consistent sets of story images, with minimal human interactions. Specifically, we utilize the comprehension and planning capabilities of large language models for layout planning, and then leverage large-scale text-to-image models to generate sophisticated story images based on the layout. We empirically find that sparse control conditions, such as bounding boxes, are suitable for layout planning, while dense control conditions, e.g., sketches and keypoints, are suitable for generating high-quality image content. To obtain the best of both worlds, we devise a dense condition generation module to transform simple bounding box layouts into sketch or keypoint control conditions for final image generation, which not only improves the image quality but also allows easy and intuitive user interactions. In addition, we propose a simple yet effective method to generate multi-view consistent character images, eliminating the reliance on human labor to collect or draw character images.
    摘要 (Simplified Chinese translation)story visualization 目标是生成与文本描述匹配的系列图像,并且要求生成的图像具有高质量、文本描述Alignment和人物标识性的一致性。由于story visualization的复杂性,现有的方法通常是忽视一些特定的人物和场景,或者需要用户提供每个图像的控制条件,如素描。然而,这些简化方法在实际应用中无法满足需求。为此,我们提出了一个自动化的story visualization系统,可以生成多样化、高质量和一致的图像集,并且减少人类交互。我们利用大型语言模型的理解和规划能力来进行布局规划,然后利用大规模的文本到图像模型来生成基于布局的复杂的故事图像。我们发现,使用简单的 bounding box 控制条件可以很好地进行布局规划,而使用 dense 控制条件,如素描和关键点,可以生成高质量的图像内容。为了得到最佳的结果,我们提出了一种将简单的 bounding box 控制条件转换成素描或关键点控制条件的方法,从而提高图像质量并允许用户 intuitive 的交互。此外,我们还提出了一种简单 yet effective 的方法来生成多视图一致的人物图像,从而消除了人工劳动的需求,收集或绘制人物图像。

Open-Vocabulary Camouflaged Object Segmentation

  • paper_url: http://arxiv.org/abs/2311.11241
  • repo_url: None
  • paper_authors: Youwei Pang, Xiaoqi Zhao, Jiaming Zuo, Lihe Zhang, Huchuan Lu
  • for: 该 paper 主要研究开放词汇涂抹物体识别任务,即开放 vocabulary camouflaged object segmentation (OVCOS) 任务。
  • methods: 该 paper 使用 pre-trained 大规模视语言模型 (CLIP) 和 iterate semantic guidance 和 structure enhancement 等方法来实现开放词汇涂抹物体识别。
  • results: 该 paper 在新constructed的 OVCamo 数据集上 achieved state-of-the-art Results in open-vocabulary semantic image segmentation, outperforming previous methods by a large margin.
    Abstract Recently, the emergence of the large-scale vision-language model (VLM), such as CLIP, has opened the way towards open-world object perception. Many works has explored the utilization of pre-trained VLM for the challenging open-vocabulary dense prediction task that requires perceive diverse objects with novel classes at inference time. Existing methods construct experiments based on the public datasets of related tasks, which are not tailored for open vocabulary and rarely involves imperceptible objects camouflaged in complex scenes due to data collection bias and annotation costs. To fill in the gaps, we introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS) and construct a large-scale complex scene dataset (\textbf{OVCamo}) which containing 11,483 hand-selected images with fine annotations and corresponding object classes. Further, we build a strong single-stage open-vocabulary \underline{c}amouflaged \underline{o}bject \underline{s}egmentation transform\underline{er} baseline \textbf{OVCoser} attached to the parameter-fixed CLIP with iterative semantic guidance and structure enhancement. By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects. Moreover, this effective framework also surpasses previous state-of-the-arts of open-vocabulary semantic image segmentation by a large margin on our OVCamo dataset. With the proposed dataset and baseline, we hope that this new task with more practical value can further expand the research on open-vocabulary dense prediction tasks.
    摘要 最近,大规模视力语言模型(VLM)的出现,如CLIP,已经开创了开放世界物体识别的新途径。许多研究已经利用预训练VLM来进行开放词汇稠密预测任务,需要在推理时识别多种新类型的物体。现有方法基于相关任务的公共数据集构建实验,这些数据集通常不适用于开放词汇和罕见的隐形物体,受到数据采集偏见和注释成本的限制。为了填补这些缺陷,我们引入了一个新任务:开放词汇隐形物体分割(OVCOS),并构建了一个大规模的复杂场景数据集(OVCamo),包含11,483张手动选择的图像,以及相应的物体类别。此外,我们建立了一个强大的单stage开放词汇隐形物体分割变换器(OVCoser),将CLIP的参数固定,并采用迭代性含义指导和结构增强。通过结合类别含义知识和视觉结构征信息,提出的方法可以效果地捕捉隐形物体。此外,我们的方法还超过了开放词汇semantic图像分割的前一代性能,在我们的OVCamo数据集上。通过我们的数据集和基eline,我们希望能够通过这个更加实用的任务,进一步推动开放词汇稠密预测任务的研究。

Enhancing Radiology Diagnosis through Convolutional Neural Networks for Computer Vision in Healthcare

  • paper_url: http://arxiv.org/abs/2311.11234
  • repo_url: None
  • paper_authors: Keshav Kumar K., Dr N V S L Narasimham
  • for: 这个研究探讨了使用卷积神经网络(CNNs)在医学诊断中的转变力量,特别是其可解性、效果和伦理问题。
  • methods: 该研究使用了修改后的DenseNet体系结构,并通过比较分析表明其在特别性、敏感度和准确率方面表现出色。
  • results: 研究结果表明,CNNs 在医学诊断中表现出优于传统方法,但需要解决可解性问题和不断改进模型。同时,Integration 问题和培训 радиialogsts 也需要考虑。
    Abstract The transformative power of Convolutional Neural Networks (CNNs) in radiology diagnostics is examined in this study, with a focus on interpretability, effectiveness, and ethical issues. With an altered DenseNet architecture, the CNN performs admirably in terms of particularity, sensitivity, as well as accuracy. Its superiority over conventional methods is validated by comparative analyses, which highlight efficiency gains. Nonetheless, interpretability issues highlight the necessity of sophisticated methods in addition to continuous model improvement. Integration issues like interoperability and radiologists' training lead to suggestions for teamwork. Systematic consideration of the ethical implications is carried out, necessitating extensive frameworks. Refinement of architectures, interpretability, alongside ethical considerations need to be prioritized in future work for responsible CNN deployment in radiology diagnostics.
    摘要 这种研究探讨了计算机神经网络(CNN)在医学诊断中的转化力量,特别是解释性、有效性和伦理问题。通过修改 denseNet 架构,CNN 表现出色地具有特征性、敏感度和准确性。与传统方法比较分析显示,CNN 在效率方面具有明显的优势。然而,解释性问题表明需要复杂的方法和不断改进模型。医生培训和兼容性问题引起了建议,需要团队合作。对伦理问题进行系统考虑,需要广泛的框架。未来的工作应该PRIORITIZE 架构优化、解释性和伦理考虑,以实现负责任的 CNN 部署在医学诊断中。

GaussianDiffusion: 3D Gaussian Splatting for Denoising Diffusion Probabilistic Models with Structured Noise

  • paper_url: http://arxiv.org/abs/2311.11221
  • repo_url: None
  • paper_authors: Xinhai Li, Huaibin Wang, Kuo-Kun Tseng
  • for: 这篇论文旨在提出一种基于 Gaussian splatting 的文本到三维内容生成框架,以实现更加真实的三维图像生成。
  • methods: 该框架使用 Gaussian splatting 技术,通过控制个别 Gaussian 球体透明度来调整图像饱和度,从而生成更加实际的图像。此外,paper 还提出了一种基于多视图噪声分布的方法,以解决多视图几何匹配问题。
  • results: 对比传统点 wise sampling 技术,Gaussian splatting 能够生成更加细节rich和饱和度更高的图像。此外,paper 还证明了该方法可以减少浮动、蜡烛和其他艺术ifacts,提高了三维图像生成的质量和稳定性。
    Abstract Text-to-3D, known for its efficient generation methods and expansive creative potential, has garnered significant attention in the AIGC domain. However, the amalgamation of Nerf and 2D diffusion models frequently yields oversaturated images, posing severe limitations on downstream industrial applications due to the constraints of pixelwise rendering method. Gaussian splatting has recently superseded the traditional pointwise sampling technique prevalent in NeRF-based methodologies, revolutionizing various aspects of 3D reconstruction. This paper introduces a novel text to 3D content generation framework based on Gaussian splatting, enabling fine control over image saturation through individual Gaussian sphere transparencies, thereby producing more realistic images. The challenge of achieving multi-view consistency in 3D generation significantly impedes modeling complexity and accuracy. Taking inspiration from SJC, we explore employing multi-view noise distributions to perturb images generated by 3D Gaussian splatting, aiming to rectify inconsistencies in multi-view geometry. We ingeniously devise an efficient method to generate noise that produces Gaussian noise from diverse viewpoints, all originating from a shared noise source. Furthermore, vanilla 3D Gaussian-based generation tends to trap models in local minima, causing artifacts like floaters, burrs, or proliferative elements. To mitigate these issues, we propose the variational Gaussian splatting technique to enhance the quality and stability of 3D appearance. To our knowledge, our approach represents the first comprehensive utilization of Gaussian splatting across the entire spectrum of 3D content generation processes.
    摘要 文本到3D技术,因其高效生成方法和广泛的创造力,在AIGC领域引起了广泛关注。然而,将NERF和2D扩散模型结合起来经常导致过度饱和的图像,从而限制了下游工业应用的可行性,由于像素级渲染方法的约束。在这篇论文中,我们介绍了一种基于Gaussian splatting的新的文本到3D内容生成框架,可以通过个别Gaussian球体透明度来控制图像饱和程度,从而生成更真实的图像。然而,在多视图协调中,3D生成过程中的模型复杂度和精度受到了严重的限制。我们在SJC的灵感下, explore使用多视图噪声分布来扰乱3D Gaussian splatting生成的图像,以修复多视图几何学的不一致。我们创新地设计了一种高效的生成噪声方法,可以从多个视图点生成Gaussian噪声,所有这些噪声都来自同一个噪声源。此外,普通的3D Gaussian基于生成往往会让模型陷入本地最小值,导致残余元素如浮动物、刺激物或生长物的出现。为了解决这些问题,我们提出了变量Gaussian splatting技术,以提高3D外观质量和稳定性。根据我们所知,我们的方法是第一次在3D内容生成过程中广泛应用Gaussian splatting。

Infrared image identification method of substation equipment fault under weak supervision

  • paper_url: http://arxiv.org/abs/2311.11214
  • repo_url: None
  • paper_authors: Anjali Sharma, Priya Banerjee, Nikhil Singh
  • for: 这种研究旨在提出一种弱监督的方法,用于检测发电厂设备的缺陷。
  • methods: 该方法使用更改模型网络结构和参数,以提高设备识别精度。
  • results: 研究表明,该方法可以准确地识别多种设备类型的缺陷,并与人工标注结果进行比较,证明了该算法的高精度。
    Abstract This study presents a weakly supervised method for identifying faults in infrared images of substation equipment. It utilizes the Faster RCNN model for equipment identification, enhancing detection accuracy through modifications to the model's network structure and parameters. The method is exemplified through the analysis of infrared images captured by inspection robots at substations. Performance is validated against manually marked results, demonstrating that the proposed algorithm significantly enhances the accuracy of fault identification across various equipment types.
    摘要 Here is the text in Simplified Chinese:这个研究提出了一种弱监督的方法,用于在发电厂设备上的热像中检测缺陷。它利用了Faster RCNN模型来识别设备,通过模型网络结构和参数的修改,提高检测精度。该方法通过在发电厂检测机器人采集的热像进行分析,并与手动标注结果进行验证,证明提posed算法可以明显提高不同设备类型的缺陷检测精度。

HiH: A Multi-modal Hierarchy in Hierarchy Network for Unconstrained Gait Recognition

  • paper_url: http://arxiv.org/abs/2311.11210
  • repo_url: None
  • paper_authors: Lei Wang, Yinchi Ma, Peng Luan, Wei Yao, Congcong Li, Bo Liu
  • For: Robust gait recognition in unconstrained environments, addressing challenges such as view changes, occlusions, and varying walking speeds, as well as cross-modality incompatibility.* Methods: A multi-modal Hierarchy in Hierarchy network (HiH) that integrates silhouette and pose sequences, featuring Hierarchical Gait Decomposer (HGD) modules and auxiliary branches for 2D joint sequences, including Deformable Spatial Enhancement (DSE) and Deformable Temporal Alignment (DTA) modules.* Results: State-of-the-art performance in gait recognition, with a well-balanced trade-off between accuracy and efficiency, demonstrated through extensive evaluations across diverse indoor and outdoor datasets.
    Abstract Gait recognition has achieved promising advances in controlled settings, yet it significantly struggles in unconstrained environments due to challenges such as view changes, occlusions, and varying walking speeds. Additionally, efforts to fuse multiple modalities often face limited improvements because of cross-modality incompatibility, particularly in outdoor scenarios. To address these issues, we present a multi-modal Hierarchy in Hierarchy network (HiH) that integrates silhouette and pose sequences for robust gait recognition. HiH features a main branch that utilizes Hierarchical Gait Decomposer (HGD) modules for depth-wise and intra-module hierarchical examination of general gait patterns from silhouette data. This approach captures motion hierarchies from overall body dynamics to detailed limb movements, facilitating the representation of gait attributes across multiple spatial resolutions. Complementing this, an auxiliary branch, based on 2D joint sequences, enriches the spatial and temporal aspects of gait analysis. It employs a Deformable Spatial Enhancement (DSE) module for pose-guided spatial attention and a Deformable Temporal Alignment (DTA) module for aligning motion dynamics through learned temporal offsets. Extensive evaluations across diverse indoor and outdoor datasets demonstrate HiH's state-of-the-art performance, affirming a well-balanced trade-off between accuracy and efficiency.
    摘要 <>transliteration: Gait recognition yǐjīng zhòngyì xiǎngyìng, yètī zhèngyì zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng zhìyǐ zhèngxìng

3D Guidewire Shape Reconstruction from Monoplane Fluoroscopic Images

  • paper_url: http://arxiv.org/abs/2311.11209
  • repo_url: None
  • paper_authors: Tudor Jianu, Baoru Huang, Pierre Berthet-Rayne, Sebastiano Fichera, Anh Nguyen
  • for: used to reconstruct 3D guidewires in endovascular interventions, reducing radiation exposure and improving accuracy.
  • methods: utilizes CathSim, a state-of-the-art endovascular simulator, and a 3D Fluoroscopy Guidewire Reconstruction Network (3D-FGRN) to reconstruct the 3D guidewire from simulated monoplane fluoroscopic images.
  • results: delivers results on par with conventional triangulation methods, demonstrating the efficiency and potential of the proposed network.
    Abstract Endovascular navigation, essential for diagnosing and treating endovascular diseases, predominantly hinges on fluoroscopic images due to the constraints in sensory feedback. Current shape reconstruction techniques for endovascular intervention often rely on either a priori information or specialized equipment, potentially subjecting patients to heightened radiation exposure. While deep learning holds potential, it typically demands extensive data. In this paper, we propose a new method to reconstruct the 3D guidewire by utilizing CathSim, a state-of-the-art endovascular simulator, and a 3D Fluoroscopy Guidewire Reconstruction Network (3D-FGRN). Our 3D-FGRN delivers results on par with conventional triangulation from simulated monoplane fluoroscopic images. Our experiments accentuate the efficiency of the proposed network, demonstrating it as a promising alternative to traditional methods.
    摘要 endpointvascular navigation, essential for diagnosing and treating endpointvascular diseases, predominantly hinges on fluoroscopic images due to the constraints in sensory feedback. Current shape reconstruction techniques for endpointvascular intervention often rely on either a priori information or specialized equipment, potentially subjecting patients to heightened radiation exposure. While deep learning holds potential, it typically demands extensive data. In this paper, we propose a new method to reconstruct the 3D guidewire by utilizing CathSim, a state-of-the-art endpointvascular simulator, and a 3D Fluoroscopy Guidewire Reconstruction Network (3D-FGRN). Our 3D-FGRN delivers results on par with conventional triangulation from simulated monoplane fluoroscopic images. Our experiments accentuate the efficiency of the proposed network, demonstrating it as a promising alternative to traditional methods.Here's the breakdown of the translation:* endpointvascular (endpointvascular navigation) -> 终端血管导航 (endpointvascular navigation)* predominantly (predominantly hinges) -> 主要地 (predominantly hinges)* fluoroscopic (fluoroscopic images) -> 萤光成像 (fluoroscopic images)* triangulation (conventional triangulation) -> 三角形计算 (conventional triangulation)* CathSim (CathSim) -> CATSIM (CathSim)* 3D Fluoroscopy Guidewire Reconstruction Network (3D-FGRN) -> 三维萤光成像导wire重建网络 (3D-FGRN)Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need Traditional Chinese, please let me know.

LogicNet: A Logical Consistency Embedded Face Attribute Learning Network

  • paper_url: http://arxiv.org/abs/2311.11208
  • repo_url: None
  • paper_authors: Haiyu Wu, Sicong Tian, Huayu Li, Kevin W. Bowyer
  • for: 提高多属性分类中逻辑一致性的可靠性
  • methods: 引入两个挑战:1) 如何使用数据逻辑一致性检查训练模型,以便获得逻辑一致的预测结果?2) 如何实现这一点不需要数据逻辑一致性检查?提出两个数据集(FH41K和CelebA-logic)和LogicNet adversarial training框架,用于学习属性之间的逻辑关系。
  • results: LogicNet在FH37K、FH41K和CelebA-logic上的准确率比下一个最佳方法高出23.05%、9.96%和1.71%,在实际案例分析中,我们的方法可以将失败案例数减少 más than 50% compared to other methods。
    Abstract Ensuring logical consistency in predictions is a crucial yet overlooked aspect in multi-attribute classification. We explore the potential reasons for this oversight and introduce two pressing challenges to the field: 1) How can we ensure that a model, when trained with data checked for logical consistency, yields predictions that are logically consistent? 2) How can we achieve the same with data that hasn't undergone logical consistency checks? Minimizing manual effort is also essential for enhancing automation. To address these challenges, we introduce two datasets, FH41K and CelebA-logic, and propose LogicNet, an adversarial training framework that learns the logical relationships between attributes. Accuracy of LogicNet surpasses that of the next-best approach by 23.05%, 9.96%, and 1.71% on FH37K, FH41K, and CelebA-logic, respectively. In real-world case analysis, our approach can achieve a reduction of more than 50% in the average number of failed cases compared to other methods.
    摘要 保证预测的逻辑一致性是多属性分类中的一个重要 yet 被忽略的方面。我们探讨了可能导致这种忽略的原因,并提出了两个严重的挑战:1)如何使得在数据进行逻辑一致性检查后训练的模型可以生成逻辑一致的预测?2)如何实现这一点而不需要手动干预?为解决这些挑战,我们介绍了两个数据集FH41K和CelebA-logic,并提出了LogicNet,一种基于对属性之间的逻辑关系进行反对抗训练的框架。LogicNet的准确率与接下来最佳方法相比高出23.05%、9.96%和1.71%的差。在实际案例分析中,我们的方法可以实现预测失败案例的减少率超过50%。

Shape-Sensitive Loss for Catheter and Guidewire Segmentation

  • paper_url: http://arxiv.org/abs/2311.11205
  • repo_url: None
  • paper_authors: Chayun Kongtongvattana, Baoru Huang, Jingxuan Kang, Hoan Nguyen, Olajide Olufemi, Anh Nguyen
  • for: 这个论文旨在提高镜像内的导 wire和干扰器分割精度。
  • methods: 该论文使用了一种形态敏感损失函数,并在视Transformer网络中应用该函数来实现新的状态值。
  • results: 该论文在大规模镜像数据集上实现了新的状态值,并且提供了高维特征向量,以及一种基于cosinus相似性的图像相似度测量方法。
    Abstract We introduce a shape-sensitive loss function for catheter and guidewire segmentation and utilize it in a vision transformer network to establish a new state-of-the-art result on a large-scale X-ray images dataset. We transform network-derived predictions and their corresponding ground truths into signed distance maps, thereby enabling any networks to concentrate on the essential boundaries rather than merely the overall contours. These SDMs are subjected to the vision transformer, efficiently producing high-dimensional feature vectors encapsulating critical image attributes. By computing the cosine similarity between these feature vectors, we gain a nuanced understanding of image similarity that goes beyond the limitations of traditional overlap-based measures. The advantages of our approach are manifold, ranging from scale and translation invariance to superior detection of subtle differences, thus ensuring precise localization and delineation of the medical instruments within the images. Comprehensive quantitative and qualitative analyses substantiate the significant enhancement in performance over existing baselines, demonstrating the promise held by our new shape-sensitive loss function for improving catheter and guidewire segmentation.
    摘要 我们介绍了一种形态敏感损失函数,用于捷克和导 wire segmentation,并在视Transformer网络中应用,以实现大规模X射线图像数据集上新的状态anner-of-the-art result。我们将网络 derivated predictions和其对应的ground truth转换成签名距离地图,以便任何网络可以专注于重要的边界而不仅仅是总的轮廓。这些SDMs被视Transformer转换成高维特征向量,其中包含了重要的图像特征。通过计算这些特征向量之间的cosine相似性,我们获得了一种超越传统 overlap-based measures的多方面的图像相似性理解。我们的方法具有许多优点,包括尺度和平移不变性,以及更好地检测到微不足的差异,从而确保精准地位置和定义医疗器械在图像中。完整的量化和质量分析证明了我们新的形态敏感损失函数对捷克和导 wire segmentation的改进。

Self-Supervised Versus Supervised Training for Segmentation of Organoid Images

  • paper_url: http://arxiv.org/abs/2311.11198
  • repo_url: None
  • paper_authors: Asmaa Haja, Eric Brouwer, Lambert Schomaker
  • for: 本研究旨在提高数字显微镜技术中数据标注的效率和可靠性,以便更好地利用深度学习算法进行图像分类和分割。
  • methods: 本研究使用了自动学习(SSL)技术,通过学习内在特征来寻找与主任务相似的预测任务,从而不需要标注数据。研究使用了ResNet50 U-Net模型,首先在增强图像上进行了图像恢复训练,然后将模型拟合到图像分割任务上。
  • results: 研究结果表明,使用25%像素掉用或图像模糊增强策略可以更高效地进行自动学习,并且在几何损失、精度损失和交集精度损失等评价指标上表现更好。在训练114张图像后,自动学习方法的F1分数为0.85,高于监督学习方法的F1分数(0.78),并且在训练1000张图像后,自动学习方法的F1分数仍然高于监督学习方法(0.92 vs 0.85)。
    Abstract The process of annotating relevant data in the field of digital microscopy can be both time-consuming and especially expensive due to the required technical skills and human-expert knowledge. Consequently, large amounts of microscopic image data sets remain unlabeled, preventing their effective exploitation using deep-learning algorithms. In recent years it has been shown that a lot of relevant information can be drawn from unlabeled data. Self-supervised learning (SSL) is a promising solution based on learning intrinsic features under a pretext task that is similar to the main task without requiring labels. The trained result is transferred to the main task - image segmentation in our case. A ResNet50 U-Net was first trained to restore images of liver progenitor organoids from augmented images using the Structural Similarity Index Metric (SSIM), alone, and using SSIM combined with L1 loss. Both the encoder and decoder were trained in tandem. The weights were transferred to another U-Net model designed for segmentation with frozen encoder weights, using Binary Cross Entropy, Dice, and Intersection over Union (IoU) losses. For comparison, we used the same U-Net architecture to train two supervised models, one utilizing the ResNet50 encoder as well as a simple CNN. Results showed that self-supervised learning models using a 25\% pixel drop or image blurring augmentation performed better than the other augmentation techniques using the IoU loss. When trained on only 114 images for the main task, the self-supervised learning approach outperforms the supervised method achieving an F1-score of 0.85, with higher stability, in contrast to an F1=0.78 scored by the supervised method. Furthermore, when trained with larger data sets (1,000 images), self-supervised learning is still able to perform better, achieving an F1-score of 0.92, contrasting to a score of 0.85 for the supervised method.
    摘要 digit化显微镜数据的标注过程可能会很时间consuming,特别是需要技术专业知识和人工智能。因此,大量的显微镜数据仍然无法被标注,从而阻碍它们的有效利用使用深度学习算法。在过去几年中,已经证明了很多相关信息可以从无标注数据中获取。基于学习内在特征的自我监督学习(SSL)是一个有前途的解决方案,不需要标注。在我们的情况下,我们使用ResNet50 U-Net来预测肝发生器组织胞的图像,并使用SSIM指数和L1损失来训练。 Encoder和decoder都在一起训练。 weights被转授到另一个用于类别的 U-Net 模型中,使用 Binary Cross Entropy、Dice 和 Intersection over Union(IoU)损失。为了比较,我们使用了同一个 U-Net 架构,训练了两个超vised模型,一个使用 ResNet50 嵌入器,另一个使用简单的 CNN。结果显示,使用25%像素截割或图像模糊增强的自我监督学习模型在IoU损失下表现较好,并且在仅使用114幅图像进行主任务训练时,自我监督学习方法的F1分数高于超vised方法(F1=0.85 vs F1=0.78)。此外,当训练数据集大小增加到1,000幅时,自我监督学习方法仍然能够表现较好,其F1分数为0.92,而超vised方法的F1分数为0.85。