cs.CV - 2023-07-21

FEDD – Fair, Efficient, and Diverse Diffusion-based Lesion Segmentation and Malignancy Classification

  • paper_url: http://arxiv.org/abs/2307.11654
  • repo_url: https://github.com/hectorcarrion/fedd
  • paper_authors: Héctor Carrión, Narges Norouzi
    for:这个研究目的是提高肤科图像诊断的存取ibilit,以提供更加公平和准确的肤科图像分类和描述。methods:这个研究使用了一个名为FEDD的数据测试框架,它使用了干扰推导的数据库,并使用了直线探针来处理具有Semantic feature embeddings的数据。results:这个研究获得了显著的改进,包括0.18、0.13、0.06和0.07的intersection over union,并且仅使用了5%、10%、15%和20%的标注数据。此外,FEDD在10%的DDI上获得了81%的癌症分类精度,高于现有的州均值。
    Abstract Skin diseases affect millions of people worldwide, across all ethnicities. Increasing diagnosis accessibility requires fair and accurate segmentation and classification of dermatology images. However, the scarcity of annotated medical images, especially for rare diseases and underrepresented skin tones, poses a challenge to the development of fair and accurate models. In this study, we introduce a Fair, Efficient, and Diverse Diffusion-based framework for skin lesion segmentation and malignancy classification. FEDD leverages semantically meaningful feature embeddings learned through a denoising diffusion probabilistic backbone and processes them via linear probes to achieve state-of-the-art performance on Diverse Dermatology Images (DDI). We achieve an improvement in intersection over union of 0.18, 0.13, 0.06, and 0.07 while using only 5%, 10%, 15%, and 20% labeled samples, respectively. Additionally, FEDD trained on 10% of DDI demonstrates malignancy classification accuracy of 81%, 14% higher compared to the state-of-the-art. We showcase high efficiency in data-constrained scenarios while providing fair performance for diverse skin tones and rare malignancy conditions. Our newly annotated DDI segmentation masks and training code can be found on https://github.com/hectorcarrion/fedd.
    摘要 世界各地的人们中有数百万人被皮肤疾病影响。为了提高诊断的可accessibility,需要公平、准确地分类和分割皮肤图像。然而,罕见皮肤疾病和不 Represented 的皮肤颜色的医疗图像的缺乏标注,对开发公平和准确的模型的发展带来了挑战。本研究提出了一个公平、高效、多样的扩散基础框架(FEDD),用于皮肤病变分类和诊断。FEDD 利用了semantically meaningful的特征嵌入,通过推敲扩散probabilistic backbone来学习,然后通过线性探针进行处理,以达到最新的性能水平在多样的皮肤图像(DDI)上。我们在5%, 10%, 15%, 和20%标注样本上分别提高了交集隔��的0.18, 0.13, 0.06, 和0.07。此外,FEDD 在10%的DDI上进行训练,可以达到81%的肉瘤分类精度,与现有最新的14%高。我们在数据缺乏的情况下实现了高效性,同时为多样的皮肤颜色和罕见的肉瘤疾病情况提供公平性。我们在https://github.com/hectorcarrion/fedd上提供了新的DDI分割 mask和训练代码。

Deep Reinforcement Learning Based System for Intraoperative Hyperspectral Video Autofocusing

  • paper_url: http://arxiv.org/abs/2307.11638
  • repo_url: None
  • paper_authors: Charlie Budd, Jianrong Qiu, Oscar MacCormac, Martin Huber, Christopher Mower, Mirek Janatka, Théo Trotouin, Jonathan Shapey, Mads S. Bergholt, Tom Vercauteren
  • for: 这篇论文旨在探讨使用高spectral成像技术进行手持式实时视频HSIC的可用性问题,并提出了一种基于深度学习的自动对焦方法来解决这个问题。
  • methods: 这篇论文使用了一种可变焦点的液体镜来嵌入视频HSIC镜头,并提出了一种基于深度学习的自动对焦算法来解决视频HSIC中的对焦问题。
  • results: 实验结果显示,该新的自动对焦算法比传统的对焦策略更好 ($p<0.05$),其中平均焦点误差为 $0.070\pm.098$,比传统策略的 $0.146\pm.148$ 更低。此外,两名 neurosurgeon 在对不同自动对焦策略的比较中,认为该新的自动对焦算法最为有利,因此该系统在实际应用中是一个可靠的选择。
    Abstract Hyperspectral imaging (HSI) captures a greater level of spectral detail than traditional optical imaging, making it a potentially valuable intraoperative tool when precise tissue differentiation is essential. Hardware limitations of current optical systems used for handheld real-time video HSI result in a limited focal depth, thereby posing usability issues for integration of the technology into the operating room. This work integrates a focus-tunable liquid lens into a video HSI exoscope, and proposes novel video autofocusing methods based on deep reinforcement learning. A first-of-its-kind robotic focal-time scan was performed to create a realistic and reproducible testing dataset. We benchmarked our proposed autofocus algorithm against traditional policies, and found our novel approach to perform significantly ($p<0.05$) better than traditional techniques ($0.070\pm.098$ mean absolute focal error compared to $0.146\pm.148$). In addition, we performed a blinded usability trial by having two neurosurgeons compare the system with different autofocus policies, and found our novel approach to be the most favourable, making our system a desirable addition for intraoperative HSI.
    摘要 高spectral成像(HSI)可以捕捉更多的spectral特征,使其成为可能有价值的实时操作中的工具,当精确的组织区分是必要时。现有的光学系统的硬件限制使得实时视频HSI的焦点深度受限,从而导致了技术的可用性问题。这项工作将射频调整液体镜组合到视频HSI外Scope中,并提出了基于深度学习的视频自动对焦方法。我们首次实现了Robotics focal-time扫描,并创建了可信度评估dataset。我们对我们提出的自动对焦算法与传统策略进行比较,发现我们的新方法在($p<0.05$)的显著程度上 ($0.070\pm.098$的平均焦点错误与$0.146\pm.148$之间)。此外,我们进行了隐身用户试验,让两名 neurosurgeon比较不同的自动对焦策略,发现我们的新方法是最有优势,使我们的系统成为实时HSI中的欢迎加入。

Divide and Adapt: Active Domain Adaptation via Customized Learning

  • paper_url: http://arxiv.org/abs/2307.11618
  • repo_url: https://github.com/duojun-huang/diana-cvpr2023
  • paper_authors: Duojun Huang, Jichang Li, Weikai Chen, Junshi Huang, Zhenhua Chai, Guanbin Li
  • for: 这个研究旨在提高适应性模型的适应性表现,通过整合活动学习(AL)技术来标识目标数据中最有价的子集。
  • methods: 这个研究使用了一个名为Divide-and-Adapt(DiaNA)的新的适应领域活动整合框架,将目标实例分为四个类别,每个类别都具有透明的转移性质。
  • results: DiaNA可以精确地识别最有价的数据,并且可以适应不同的领域转移Setting,例如无监督领域转移(UDA)、半监督领域转移(SSDA)、源自由领域转移(SFDA)等。
    Abstract Active domain adaptation (ADA) aims to improve the model adaptation performance by incorporating active learning (AL) techniques to label a maximally-informative subset of target samples. Conventional AL methods do not consider the existence of domain shift, and hence, fail to identify the truly valuable samples in the context of domain adaptation. To accommodate active learning and domain adaption, the two naturally different tasks, in a collaborative framework, we advocate that a customized learning strategy for the target data is the key to the success of ADA solutions. We present Divide-and-Adapt (DiaNA), a new ADA framework that partitions the target instances into four categories with stratified transferable properties. With a novel data subdivision protocol based on uncertainty and domainness, DiaNA can accurately recognize the most gainful samples. While sending the informative instances for annotation, DiaNA employs tailored learning strategies for the remaining categories. Furthermore, we propose an informativeness score that unifies the data partitioning criteria. This enables the use of a Gaussian mixture model (GMM) to automatically sample unlabeled data into the proposed four categories. Thanks to the "divideand-adapt" spirit, DiaNA can handle data with large variations of domain gap. In addition, we show that DiaNA can generalize to different domain adaptation settings, such as unsupervised domain adaptation (UDA), semi-supervised domain adaptation (SSDA), source-free domain adaptation (SFDA), etc.
    摘要 aktive domaine anpassung (ADA) zielt darauf ab, die anpassungsleistung von modellen durch einbezug aktiver lernen (AL) technologien zu verbessern, indem eine maximale-informative auswahl vonzielchen labelliert wird. conventional AL methods ignorieren die existenz von domain shift, und daher können sie keine echten wertvollen beispiele in context von domain adaptation identifizieren. um active learning und domain anpassung zu accommodate, two naturally different tasks, propose a customized learning strategy for the target data is the key to the success of ADA solutions. we present divide-and-adapt (DiaNA), a new ADA framework that partitions the target instances into four categories with stratified transferable properties. with a novel data subdivision protocol based on uncertainty and domainness, DiaNA can accurately recognize the most gainful samples. while sending the informative instances for annotation, DiaNA employs tailored learning strategies for the remaining categories. furthermore, we propose an informativeness score that unifies the data partitioning criteria. this enables the use of a gaussian mixture model (GMM) to automatically sample unlabeled data into the proposed four categories. thanks to the "divideand-adapt" spirit, DiaNA can handle data with large variations of domain gap. in addition, we show that DiaNA can generalize to different domain adaptation settings, such as unsupervised domain adaptation (UDA), semi-supervised domain adaptation (SSDA), source-free domain adaptation (SFDA), etc.

Consistency-guided Meta-Learning for Bootstrapping Semi-Supervised Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2307.11604
  • repo_url: https://github.com/aijinrjinr/mlb-seg
  • paper_authors: Qingyue Wei, Lequan Yu, Xianhang Li, Wei Shao, Cihang Xie, Lei Xing, Yuyin Zhou
  • for: 这个研究旨在解决医疗影像识别的 semi-supervised 学习问题,提高医疗影像识别的效率和精度。
  • methods: 我们提出了一个名为 Meta-Learning for Bootstrapping Medical Image Segmentation (MLB-Seg) 的新方法,具有以下三个特点:首先,我们使用一小批清洁标注的影像进行训练,以生成初始的标签 для无标注的数据。其次,我们引入了一个像素类别映射系统,将像素标签和模型预测的结果相互映射。这些映射是基于一个小型的精心标注影像集,并且使用一个基于这些标注的meta-过程来决定这些映射。最后,我们引入了一个内部的Consistency-based Pseudo Label Enhancement (PLE) 方法,用于提高模型预测的品质。
  • results: 我们的实验结果显示,我们的提出的方法在 semi-supervised 下可以 achieve state-of-the-art 的效果,并且在两个公开的颈部和肾脏分类数据集上实现了显著的改善。
    Abstract Medical imaging has witnessed remarkable progress but usually requires a large amount of high-quality annotated data which is time-consuming and costly to obtain. To alleviate this burden, semi-supervised learning has garnered attention as a potential solution. In this paper, we present Meta-Learning for Bootstrapping Medical Image Segmentation (MLB-Seg), a novel method for tackling the challenge of semi-supervised medical image segmentation. Specifically, our approach first involves training a segmentation model on a small set of clean labeled images to generate initial labels for unlabeled data. To further optimize this bootstrapping process, we introduce a per-pixel weight mapping system that dynamically assigns weights to both the initialized labels and the model's own predictions. These weights are determined using a meta-process that prioritizes pixels with loss gradient directions closer to those of clean data, which is based on a small set of precisely annotated images. To facilitate the meta-learning process, we additionally introduce a consistency-based Pseudo Label Enhancement (PLE) scheme that improves the quality of the model's own predictions by ensembling predictions from various augmented versions of the same input. In order to improve the quality of the weight maps obtained through multiple augmentations of a single input, we introduce a mean teacher into the PLE scheme. This method helps to reduce noise in the weight maps and stabilize its generation process. Our extensive experimental results on public atrial and prostate segmentation datasets demonstrate that our proposed method achieves state-of-the-art results under semi-supervision. Our code is available at https://github.com/aijinrjinr/MLB-Seg.
    摘要 医疗影像技术在过去几年内发展非常remarkable,但通常需要大量高质量标注数据,这是时间consuming和costly的。为了解决这个问题,半supervised learning在这个领域得到了关注。在这篇论文中,我们提出了一种新的方法:Meta-Learning for Bootstrapping Medical Image Segmentation(MLB-Seg),用于解决半supervised的医疗影像分割挑战。具体来说,我们的方法首先在一小个清洁标注图像上训练一个分割模型,以生成初始标签 для无标注数据。为了进一步优化这个启动过程,我们引入了一个像素Weight Mapping系统,该系统在运行时动态分配像素的标签和模型自己预测的标签之间的权重。这些权重是基于一个小型的精确标注图像来确定的,这个过程是基于一个Meta-process来进行。为了促进这个Meta-学习过程,我们还引入了一种Consistency-based Pseudo Label Enhancement(PLE)方案,该方案可以提高模型自己的预测质量。为了进一步提高Weight map的质量,我们引入了一个Mean Teacher。这种方法可以减少Weight map中的噪音,并使其生成过程更加稳定。我们对公共的atrial和prostate分割数据集进行了广泛的实验,结果表明,我们提出的方法在半supervision下可以达到状态级result。我们的代码可以在https://github.com/aijinrjinr/MLB-Seg上获取。

Cascaded multitask U-Net using topological loss for vessel segmentation and centerline extraction

  • paper_url: http://arxiv.org/abs/2307.11603
  • repo_url: None
  • paper_authors: Pierre Rougé, Nicolas Passat, Odyssée Merveille
  • for: 这篇论文主要针对的是血管疾病诊断工具中的血管分割和中心线提取任务。
  • methods: 这篇论文提出了一种使用深度学习方法进行血管分割和中心线提取任务,并使用了clDice损失函数来保证结果的准确性。
  • results: 该方法可以提供更加准确的血管分割和中心线提取结果,并且可以在3D图像上进行更加有效的血管分割和中心线提取。
    Abstract Vessel segmentation and centerline extraction are two crucial preliminary tasks for many computer-aided diagnosis tools dealing with vascular diseases. Recently, deep-learning based methods have been widely applied to these tasks. However, classic deep-learning approaches struggle to capture the complex geometry and specific topology of vascular networks, which is of the utmost importance in most applications. To overcome these limitations, the clDice loss, a topological loss that focuses on the vessel centerlines, has been recently proposed. This loss requires computing, with a proposed soft-skeleton algorithm, the skeletons of both the ground truth and the predicted segmentation. However, the soft-skeleton algorithm provides suboptimal results on 3D images, which makes the clDice hardly suitable on 3D images. In this paper, we propose to replace the soft-skeleton algorithm by a U-Net which computes the vascular skeleton directly from the segmentation. We show that our method provides more accurate skeletons than the soft-skeleton algorithm. We then build upon this network a cascaded U-Net trained with the clDice loss to embed topological constraints during the segmentation. The resulting model is able to predict both the vessel segmentation and centerlines with a more accurate topology.
    摘要 船 Segmentation 和中心线抽取是许多计算机辅助诊断工具处理血管疾病的两个重要前期任务。近年来,深度学习基于方法广泛应用于这两个任务。然而,经典深度学习方法很难捕捉血管网络的复杂geometry和特定topology,这在大多数应用中非常重要。为了解决这些限制,最近提出了clDice损失,这是一种关注血管中心线的topological损失。这个损失需要计算,使用我们提议的软skeleton算法,真实的血管skeleton。然而,软skeleton算法在3D图像上提供的结果不够优化,使得clDice几乎不适用于3D图像。在这篇论文中,我们提议将软skeleton算法替换为一个U-Net,该网络直接从分 segmentation 中计算血管skeleton。我们显示我们的方法可以提供更加准确的skeletons。然后,我们在这个网络上建立了一个缓冲 U-Net,用于在分 segmentation 中嵌入topological约束。结果是一个能够预测血管分 segmentation 和中心线的模型,其 topology 更加准确。

CortexMorph: fast cortical thickness estimation via diffeomorphic registration using VoxelMorph

  • paper_url: http://arxiv.org/abs/2307.11567
  • repo_url: None
  • paper_authors: Richard McKinley, Christian Rummel
  • for: This paper aims to improve the efficiency of estimating cortical thickness in magnetic resonance imaging (MRI) studies.
  • methods: The proposed method, CortexMorph, uses unsupervised deep learning to directly regress the deformation field needed for DiReCT, which can significantly reduce the registration time.
  • results: The proposed method can estimate region-wise thickness in seconds from a T1-weighted image, while maintaining the ability to detect cortical atrophy, as validated on the OASIS-3 dataset and the synthetic cortical thickness phantom.
    Abstract The thickness of the cortical band is linked to various neurological and psychiatric conditions, and is often estimated through surface-based methods such as Freesurfer in MRI studies. The DiReCT method, which calculates cortical thickness using a diffeomorphic deformation of the gray-white matter interface towards the pial surface, offers an alternative to surface-based methods. Recent studies using a synthetic cortical thickness phantom have demonstrated that the combination of DiReCT and deep-learning-based segmentation is more sensitive to subvoxel cortical thinning than Freesurfer. While anatomical segmentation of a T1-weighted image now takes seconds, existing implementations of DiReCT rely on iterative image registration methods which can take up to an hour per volume. On the other hand, learning-based deformable image registration methods like VoxelMorph have been shown to be faster than classical methods while improving registration accuracy. This paper proposes CortexMorph, a new method that employs unsupervised deep learning to directly regress the deformation field needed for DiReCT. By combining CortexMorph with a deep-learning-based segmentation model, it is possible to estimate region-wise thickness in seconds from a T1-weighted image, while maintaining the ability to detect cortical atrophy. We validate this claim on the OASIS-3 dataset and the synthetic cortical thickness phantom of Rusak et al.
    摘要 cortical band 厚度与多种神经科学和心理科学疾病相关,通常通过表面基本方法such as Freesurfer在MRI研究中估算。DiReCT方法,它通过Diffuse Deformation of the gray-white matter interface towards the pial surface来计算 cortical thickness,为surface-based方法提供了一种alternative。Recent studies using a synthetic cortical thickness phantom have shown that the combination of DiReCT and deep learning-based segmentation is more sensitive to subvoxel cortical thinning than Freesurfer. However, existing implementations of DiReCT rely on iterative image registration methods, which can take up to an hour per volume. In contrast, learning-based deformable image registration methods like VoxelMorph have been shown to be faster than classical methods while improving registration accuracy. This paper proposes CortexMorph, a new method that employs unsupervised deep learning to directly regress the deformation field needed for DiReCT. By combining CortexMorph with a deep learning-based segmentation model, it is possible to estimate region-wise thickness in seconds from a T1-weighted image, while maintaining the ability to detect cortical atrophy. We validate this claim on the OASIS-3 dataset and the synthetic cortical thickness phantom of Rusak et al.Here's the text in Traditional Chinese: cortical band 厚度与多种神经科学和心理科学疾病相关,通常通过表面基本方法such as Freesurfer在MRI研究中估算。DiReCT方法,它通过Diffuse Deformation of the gray-white matter interface towards the pial surface来计算 cortical thickness,为surface-based方法提供了一个alternative。Recent studies using a synthetic cortical thickness phantom have shown that the combination of DiReCT and deep learning-based segmentation is more sensitive to subvoxel cortical thinning than Freesurfer. 然而,现有的DiReCT实现方法仍然 rely on迭代的图像 регистра方法,这可以花费到每个volume上一个小时。相比之下,学习型的可扩展图像REGISTRATION方法like VoxelMorph已经被证明是更快的,同时改善了registration的准确性。本文提出了CortexMorph,一个新的方法,它使用不supervised的深度学习来直接预测DiReCT所需的歪み场。通过结合CortexMorph和深度学习基本的分类模型,可以从T1类型的影像中 estimates region-wise thickness in seconds, while maintaining the ability to detect cortical atrophy。我们验证了这个主张在OASIS-3 dataset和Rusak et al.的synthetic cortical thickness phantom上。

YOLOPose V2: Understanding and Improving Transformer-based 6D Pose Estimation

  • paper_url: http://arxiv.org/abs/2307.11550
  • repo_url: None
  • paper_authors: Arul Selvam Periyasamy, Arash Amini, Vladimir Tsaturyan, Sven Behnke
    for:* 6D object pose estimation for autonomous robot manipulation applicationsmethods:* Transformer-based multi-object 6D pose estimation method using keypoint regression and learnable orientation estimationresults:* Achieves results comparable to state-of-the-art methods and suitable for real-time applications
    Abstract 6D object pose estimation is a crucial prerequisite for autonomous robot manipulation applications. The state-of-the-art models for pose estimation are convolutional neural network (CNN)-based. Lately, Transformers, an architecture originally proposed for natural language processing, is achieving state-of-the-art results in many computer vision tasks as well. Equipped with the multi-head self-attention mechanism, Transformers enable simple single-stage end-to-end architectures for learning object detection and 6D object pose estimation jointly. In this work, we propose YOLOPose (short form for You Only Look Once Pose estimation), a Transformer-based multi-object 6D pose estimation method based on keypoint regression and an improved variant of the YOLOPose model. In contrast to the standard heatmaps for predicting keypoints in an image, we directly regress the keypoints. Additionally, we employ a learnable orientation estimation module to predict the orientation from the keypoints. Along with a separate translation estimation module, our model is end-to-end differentiable. Our method is suitable for real-time applications and achieves results comparable to state-of-the-art methods. We analyze the role of object queries in our architecture and reveal that the object queries specialize in detecting objects in specific image regions. Furthermore, we quantify the accuracy trade-off of using datasets of smaller sizes to train our model.
    摘要 6D对象pose估计是自动化机器人控制应用的重要前提。现状的模型都是基于卷积神经网络(CNN)的。最近,Transformers这种原本是自然语言处理领域的 arquitecture 在计算机视觉任务中也取得了状态的前景result。通过多头自注意机制,Transformers 使得单 Stage 末端到末的结构可以用于学习对象检测和6D对象pose估计。在这种工作中,我们提出了 YOLOPose(简称为You Only Look Once Pose estimation),一种基于键点回归和改进的 YOLOPose 模型。与标准的热图来predict键点的方法不同,我们直接进行键点回归。此外,我们使用学习的orientation estimation模块来预测键点的方向。与此同时,我们还使用一个分离的翻译估计模块,使得我们的模型能够在终端到终的可 differentiable。我们的方法适用于实时应用,并达到了与状态前景方法相当的结果。我们分析了我们的建筑中的对象查询的角色,并发现对象查询在特定的图像区域中探测对象。此外,我们量化使用小型数据集来训练我们的模型的准确性交换。

KVN: Keypoints Voting Network with Differentiable RANSAC for Stereo Pose Estimation

  • paper_url: http://arxiv.org/abs/2307.11543
  • repo_url: https://github.com/ivano-donadi/kvn
  • paper_authors: Ivano Donadi, Alberto Pretto
  • for: 本研究旨在提高对象pose estimation的精度,特别是在多视图情况下。
  • methods: 该研究使用了一种新的可导式RANSAC层和多视图PnP算法来解决多视图对象pose estimation问题。
  • results: 实验结果表明,该方法在一个公共的多视图对象pose estimation dataset上达到了当前最佳的结果,并且在对比其他最新方法时表现出了优异性。Translation:
  • for: The purpose of this research is to improve the accuracy of object pose estimation, particularly in multi-view scenarios.
  • methods: The research uses a new differentiable RANSAC layer and a multi-view PnP algorithm to solve the multi-view object pose estimation problem.
  • results: Experimental results show that the proposed method achieves the best results on a public multi-view object pose estimation dataset, and outperforms other recent methods.
    Abstract Object pose estimation is a fundamental computer vision task exploited in several robotics and augmented reality applications. Many established approaches rely on predicting 2D-3D keypoint correspondences using RANSAC (Random sample consensus) and estimating the object pose using the PnP (Perspective-n-Point) algorithm. Being RANSAC non-differentiable, correspondences cannot be directly learned in an end-to-end fashion. In this paper, we address the stereo image-based object pose estimation problem by (i) introducing a differentiable RANSAC layer into a well-known monocular pose estimation network; (ii) exploiting an uncertainty-driven multi-view PnP solver which can fuse information from multiple views. We evaluate our approach on a challenging public stereo object pose estimation dataset, yielding state-of-the-art results against other recent approaches. Furthermore, in our ablation study, we show that the differentiable RANSAC layer plays a significant role in the accuracy of the proposed method. We release with this paper the open-source implementation of our method.
    摘要 <>将文本翻译为简化中文。>对象pose估算是计算机视觉的基本任务,广泛应用于机器人和增强现实领域。许多成熟的方法利用随机抽样consensus(RANSAC)和perspective-n-point(PnP)算法来估算对象pose。由于RANSAC不具有导数,因此对应关系不能直接在终端到终点学习。在这篇论文中,我们解决了使用多视图PnP算法和不同视图之间的信息融合来解决顺序图像基本对象pose估算问题。我们对一个公共顺序图像基本对象pose估算数据集进行了评估,并与其他最近的方法进行比较,获得了状态的艺术成绩。此外,在我们的抽象研究中,我们发现了使用可导RANSAC层对提出的方法具有重要作用。我们在这篇论文中释放了我们的方法的开源实现。

UWAT-GAN: Fundus Fluorescein Angiography Synthesis via Ultra-wide-angle Transformation Multi-scale GAN

  • paper_url: http://arxiv.org/abs/2307.11530
  • repo_url: https://github.com/Tinysqua/UWAT-GAN
  • paper_authors: Zhaojie Fang, Zhanghao Chen, Pengxue Wei, Wangting Li, Shaochong Zhang, Ahmed Elazab, Gangyong Jia, Ruiquan Ge, Changmiao Wang
  • For: The paper proposes a novel conditional generative adversarial network (UWAT-GAN) to synthesize UWF-FA from UWF-SLO, aiming to avoid the negative impacts of injecting sodium fluorescein and improve the resolution of fundus imaging.* Methods: The UWAT-GAN uses multi-scale generators and a fusion module patch to better extract global and local information, and an attention transmit module to help the decoder learn effectively. The network is trained using a supervised approach with multiple new weighted losses on different scales of data.* Results: The experiments on an in-house UWF image dataset demonstrate the superiority of the UWAT-GAN over the state-of-the-art methods, with high-resolution images generated and the ability to capture tiny vascular lesion areas.Here’s the information in Simplified Chinese text:* For: 这篇论文提出了一种新的条件生成 adversarial network(UWAT-GAN),用于从ultra-wide-angle fundus photography(UWF)中生成fluorescein angiography(FA)图像,以避免使用氯胺绿色素的不良影响和提高肤肤影像的分辨率。* Methods: UWAT-GAN使用多尺度生成器和一个拼接模块补充,以更好地提取全局和局部信息,同时使用一个注意力传输模块来帮助解码器更好地学习。网络使用了一种监管的方法,通过多个不同尺度的数据进行训练,并使用多种新的质量损失来更好地调节网络。* Results: 实验结果表明,UWAT-GAN在一个自有的UWF图像集上表现出优于当前的方法,可以生成高分辨率的图像,同时能够捕捉到微小的血管病变区域。
    Abstract Fundus photography is an essential examination for clinical and differential diagnosis of fundus diseases. Recently, Ultra-Wide-angle Fundus (UWF) techniques, UWF Fluorescein Angiography (UWF-FA) and UWF Scanning Laser Ophthalmoscopy (UWF-SLO) have been gradually put into use. However, Fluorescein Angiography (FA) and UWF-FA require injecting sodium fluorescein which may have detrimental influences. To avoid negative impacts, cross-modality medical image generation algorithms have been proposed. Nevertheless, current methods in fundus imaging could not produce high-resolution images and are unable to capture tiny vascular lesion areas. This paper proposes a novel conditional generative adversarial network (UWAT-GAN) to synthesize UWF-FA from UWF-SLO. Using multi-scale generators and a fusion module patch to better extract global and local information, our model can generate high-resolution images. Moreover, an attention transmit module is proposed to help the decoder learn effectively. Besides, a supervised approach is used to train the network using multiple new weighted losses on different scales of data. Experiments on an in-house UWF image dataset demonstrate the superiority of the UWAT-GAN over the state-of-the-art methods. The source code is available at: https://github.com/Tinysqua/UWAT-GAN.
    摘要 背景:背部照片是诊断背部疾病的重要诊断工具。近些年来,极广角背部照片(UWF)技术,包括UWF氟胺染色(UWF-FA)和UWF扫描镜内眼镜(UWF-SLO),逐渐得到应用。然而,氟胺染色和UWF-FA都需要注射氟胺,可能会产生不良影响。为了避免这些影响,混合模式医学图像生成算法已经被提出。然而,当前的基于背部照片的医学图像生成方法无法生成高分辨率图像,并且无法捕捉微型血管病区。方法:本文提出了一种新的冲拦生成 adversarial network(UWAT-GAN),用于从UWF-SLO中生成UWF-FA。该模型使用多尺度生成器和一个融合模块质子来更好地提取全部和局部信息。此外,我们还提出了一种注意力传输模块,以帮助解码器更好地学习。此外,我们采用了多种新的权重损失方法来训练网络。结果:我们在自有的UWF图像集上进行了实验,并证明了UWAT-GAN的优越性。相比之前的方法,UWAT-GAN能够生成高分辨率图像,并且能够更好地捕捉微型血管病区。代码可以在以下GitHub地址下下载:https://github.com/Tinysqua/UWAT-GAN。

Improving Viewpoint Robustness for Visual Recognition via Adversarial Training

  • paper_url: http://arxiv.org/abs/2307.11528
  • repo_url: None
  • paper_authors: Shouwei Ruan, Yinpeng Dong, Hang Su, Jianteng Peng, Ning Chen, Xingxing Wei
  • for: 提高图像分类器的视点Robustness,使其能够快速和稳定地识别不同视点下的图像。
  • methods: 提出了Viewpoint-Invariant Adversarial Training(VIAT)方法,通过对视点变换视为攻击,形式化为最小化最坏情况下的损失函数,从而获得视点不变的图像分类器。
  • results: VIAT可以显著提高多种图像分类器的视点Robustness,并且可以通过提供证明 radius 和准确率来证明其效果。
    Abstract Viewpoint invariance remains challenging for visual recognition in the 3D world, as altering the viewing directions can significantly impact predictions for the same object. While substantial efforts have been dedicated to making neural networks invariant to 2D image translations and rotations, viewpoint invariance is rarely investigated. Motivated by the success of adversarial training in enhancing model robustness, we propose Viewpoint-Invariant Adversarial Training (VIAT) to improve the viewpoint robustness of image classifiers. Regarding viewpoint transformation as an attack, we formulate VIAT as a minimax optimization problem, where the inner maximization characterizes diverse adversarial viewpoints by learning a Gaussian mixture distribution based on the proposed attack method GMVFool. The outer minimization obtains a viewpoint-invariant classifier by minimizing the expected loss over the worst-case viewpoint distributions that can share the same one for different objects within the same category. Based on GMVFool, we contribute a large-scale dataset called ImageNet-V+ to benchmark viewpoint robustness. Experimental results show that VIAT significantly improves the viewpoint robustness of various image classifiers based on the diversity of adversarial viewpoints generated by GMVFool. Furthermore, we propose ViewRS, a certified viewpoint robustness method that provides a certified radius and accuracy to demonstrate the effectiveness of VIAT from the theoretical perspective.
    摘要 视点不变性仍然是视觉识别在三维世界中的挑战,因为改变观察方向可能会对同一个物体的预测产生很大的影响。虽然大量的努力已经投入到了使神经网络不受二维图像的翻译和旋转的影响,但视点不变性几乎没有被研究。驱动于对抗训练的成功,我们提出了视点不变性对抗训练(VIAT),以提高图像分类器的视点稳定性。我们将视点变换视为攻击,并将VIAT表示为一个对抗最优化问题。内部最大化Characterizes diverse adversarial viewpoints by learning a Gaussian mixture distribution based on the proposed attack method GMVFool.外部最小化获得视点不变的分类器,将预期损失最小化在多种可能的观察方向中的最坏情况下。基于GMVFool,我们贡献了一个大规模的数据集ImageNet-V+,用于评估视点稳定性。实验结果显示,VIAT可以大幅提高多种图像分类器的视点稳定性,并且基于GMVFool生成的多种攻击视点的多样性。此外,我们还提出了ViewRS,一种可证明的视点稳定性方法,可以在理论上证明VIAT的效果。

  • paper_url: http://arxiv.org/abs/2307.11526
  • repo_url: None
  • paper_authors: Ziyuan Luo, Qing Guo, Ka Chun Cheung, Simon See, Renjie Wan
  • for: 保护NeRF模型的版权
  • methods: 使用水印颜色表示法保护NeRF模型的版权,并设计了一种鲁棒的渲染方案来保证NeRF模型的抽象稳定性
  • results: 比较 Optional solutions 中的不同方法,提出了一种能够直接保护NeRF模型的版权,同时保持高质量渲染和位准率的方法
    Abstract Neural Radiance Fields (NeRF) have the potential to be a major representation of media. Since training a NeRF has never been an easy task, the protection of its model copyright should be a priority. In this paper, by analyzing the pros and cons of possible copyright protection solutions, we propose to protect the copyright of NeRF models by replacing the original color representation in NeRF with a watermarked color representation. Then, a distortion-resistant rendering scheme is designed to guarantee robust message extraction in 2D renderings of NeRF. Our proposed method can directly protect the copyright of NeRF models while maintaining high rendering quality and bit accuracy when compared among optional solutions.
    摘要 neural radiance fields (NeRF) 有可能成为媒体表示的主要形式。由于训练 NeRF 从来不是一件容易的任务,因此保护其模型版权应该是优先事项。在这篇论文中,我们通过分析可能的版权保护解决方案的利弊,提议将 NeRF 模型中原始颜色表示替换为水印颜色表示,然后设计一种抗扭变渲染方案,以保证 NeRF 2D 渲染中的信息提取Robust。我们提议的方法可以直接保护 NeRF 模型的版权,同时保持高质量渲染和位数精度。

BatMobility: Towards Flying Without Seeing for Autonomous Drones

  • paper_url: http://arxiv.org/abs/2307.11518
  • repo_url: None
  • paper_authors: Emerson Sie, Zikun Liu, Deepak Vasisht
  • for: 这个论文的目的是否允许无人飞行器(UAV)不依赖于光学感知器,即UAV可以不见而飞。
  • methods: 作者提出了一种轻量级的mm波雷达仅基于感知系统,称为BatMobility,以取代光学感知器。BatMobility实现了无人飞行器的电台流估计(一种基于表面平行偏振的FMCW雷达基于 superficie-parallel doppler shift的新方法)和雷达基于碰撞避免。
  • results: 作者使用商业感知器建立了BatMobility,并在一个小型的Off-the-shelf quadcopter上运行了一个未修改的飞行控制器。评估表明,BatMobility在各种场景中比或超过了商业级光学感知器的性能。
    Abstract Unmanned aerial vehicles (UAVs) rely on optical sensors such as cameras and lidar for autonomous operation. However, such optical sensors are error-prone in bad lighting, inclement weather conditions including fog and smoke, and around textureless or transparent surfaces. In this paper, we ask: is it possible to fly UAVs without relying on optical sensors, i.e., can UAVs fly without seeing? We present BatMobility, a lightweight mmWave radar-only perception system for UAVs that eliminates the need for optical sensors. BatMobility enables two core functionalities for UAVs -- radio flow estimation (a novel FMCW radar-based alternative for optical flow based on surface-parallel doppler shift) and radar-based collision avoidance. We build BatMobility using commodity sensors and deploy it as a real-time system on a small off-the-shelf quadcopter running an unmodified flight controller. Our evaluation shows that BatMobility achieves comparable or better performance than commercial-grade optical sensors across a wide range of scenarios.
    摘要 无人飞行器(UAV)通常靠光学感知器件如摄像头和激光雷达进行自主操作。但光学感知器件在糟糕的照明条件、不适的天气条件(如雾和烟)以及无Texture或透明表面上存在误差。在这篇论文中,我们问:是否可以让UAV飞行不用光学感知器件,即UAV是否可以“不见”飞行?我们介绍了BatMobility,一种轻量级mmWave雷达只的感知系统,它消除了对光学感知器件的需求。BatMobility实现了UAV两个核心功能—— радио流速估计(一种基于表面平行Doppler偏移的新型FMCW雷达基于光流的替代方案)和雷达基于避免碰撞。我们使用常见的感知器件建立BatMobility,并将其部署为实时系统,运行在一个小型、Off-the-shelfquadcopter上,不需要修改飞行控制器。我们的评估表明,BatMobility在各种场景中与商业级光学感知器件相比具有相当或更好的性能。

CORE: Cooperative Reconstruction for Multi-Agent Perception

  • paper_url: http://arxiv.org/abs/2307.11514
  • repo_url: https://github.com/zllxot/core
  • paper_authors: Binglu Wang, Lei Zhang, Zhaozhong Wang, Yongqiang Zhao, Tianfei Zhou
  • for: 这篇论文提出了一种基于多智能机器人协同感知的模型CORE,用于提高多机器人合作感知的效果和通信效率。
  • methods: 该模型包括三个主要组成部分:压缩器、协同协商组件和重建模块。压缩器用于每个机器人创建更加压缩的特征表示,以便高效广播;协同协商组件用于跨机器人消息集成,以提高协同感知的效果;重建模块用于根据合并的特征表示重建观察。
  • results: 该模型在OPV2V大规模多机器人感知数据集上进行了3D物体检测和 semantic segmentation两个任务的验证,结果表明该模型在两个任务中均达到了领先的性能水平,并且更高效的进行通信。
    Abstract This paper presents CORE, a conceptually simple, effective and communication-efficient model for multi-agent cooperative perception. It addresses the task from a novel perspective of cooperative reconstruction, based on two key insights: 1) cooperating agents together provide a more holistic observation of the environment, and 2) the holistic observation can serve as valuable supervision to explicitly guide the model learning how to reconstruct the ideal observation based on collaboration. CORE instantiates the idea with three major components: a compressor for each agent to create more compact feature representation for efficient broadcasting, a lightweight attentive collaboration component for cross-agent message aggregation, and a reconstruction module to reconstruct the observation based on aggregated feature representations. This learning-to-reconstruct idea is task-agnostic, and offers clear and reasonable supervision to inspire more effective collaboration, eventually promoting perception tasks. We validate CORE on OPV2V, a large-scale multi-agent percetion dataset, in two tasks, i.e., 3D object detection and semantic segmentation. Results demonstrate that the model achieves state-of-the-art performance on both tasks, and is more communication-efficient.
    摘要 CORE consists of three major components:1. Compressor: Each agent creates a more compact feature representation for efficient broadcasting.2. Lightweight attentive collaboration component: Cross-agent message aggregation is performed using a lightweight attentive mechanism.3. Reconstruction module: The observation is reconstructed based on aggregated feature representations.The learning-to-reconstruct idea is task-agnostic and provides clear and reasonable supervision to inspire more effective collaboration, leading to improved perception tasks. The model is validated on the OPV2V dataset, achieving state-of-the-art performance in both 3D object detection and semantic segmentation tasks, while also being more communication-efficient.

Bone mineral density estimation from a plain X-ray image by learning decomposition into projections of bone-segmented computed tomography

  • paper_url: http://arxiv.org/abs/2307.11513
  • repo_url: None
  • paper_authors: Yi Gu, Yoshito Otake, Keisuke Uemura, Mazen Soufi, Masaki Takao, Hugues Talbot, Seiji Okada, Nobuhiko Sugano, Yoshinobu Sato
  • for: 这个研究旨在透过简单的X射线影像来估计骨骼粒子密度(BMD),以便在日常医疗实践中进行早期诊断。
  • methods: 本研究提出了一种高效的方法,即将骨骼分割成QCT射测的投影,以估计BMD。这种方法只需要有限的数据,并且可以在实际医疗实践中进行应用。
  • results: 研究发现,这种方法可以高度准确地估计BMD,其相互联系系数(Pearson correlation coefficient)为0.880和0.920,并且准确性的标准差(root mean square of coefficient of variation)为3.27%至3.79%。此外,研究还进行了多个验证实验,包括多姿、未测量CT和压缩实验,以确保其可以在实际应用中进行运行。
    Abstract Osteoporosis is a prevalent bone disease that causes fractures in fragile bones, leading to a decline in daily living activities. Dual-energy X-ray absorptiometry (DXA) and quantitative computed tomography (QCT) are highly accurate for diagnosing osteoporosis; however, these modalities require special equipment and scan protocols. To frequently monitor bone health, low-cost, low-dose, and ubiquitously available diagnostic methods are highly anticipated. In this study, we aim to perform bone mineral density (BMD) estimation from a plain X-ray image for opportunistic screening, which is potentially useful for early diagnosis. Existing methods have used multi-stage approaches consisting of extraction of the region of interest and simple regression to estimate BMD, which require a large amount of training data. Therefore, we propose an efficient method that learns decomposition into projections of bone-segmented QCT for BMD estimation under limited datasets. The proposed method achieved high accuracy in BMD estimation, where Pearson correlation coefficients of 0.880 and 0.920 were observed for DXA-measured BMD and QCT-measured BMD estimation tasks, respectively, and the root mean square of the coefficient of variation values were 3.27 to 3.79% for four measurements with different poses. Furthermore, we conducted extensive validation experiments, including multi-pose, uncalibrated-CT, and compression experiments toward actual application in routine clinical practice.
    摘要 骨质疾病(osteoporosis)是一种非常普遍的骨疾病,会导致脆弱骨骼中的裂解,从而导致日常生活活动下降。 dual-energy X-ray absorptiometry(DXA)和量子计算Tomography(QCT)是骨质疾病诊断的非常准确的方法,但是这些方法需要特殊的设备和扫描协议。为了经常监测骨健康,低成本、低剂量、通用可用的诊断方法非常需求。在这项研究中,我们想要从普通X射线图像中估算骨矿化密度(BMD),以便在机会性检测中进行早期诊断。现有的方法通常使用多stage的方法,包括提取区域 интереса和简单的回归来估算BMD,这些方法需要很大的训练数据。因此,我们提出了一种高效的方法,可以在有限的数据集上学习分解为骨segmented QCT的投影来估算BMD。我们的方法实现了高精度的BMD估算,其中DXA测量BMD和QCT测量BMD估算任务的Pearson相关系数分别为0.880和0.920,并且根据不同的姿势测量结果,核心均方差的值为3.27%至3.79%。此外,我们进行了广泛的验证实验,包括多姿、无抽象CT和压缩实验,以便在实际临床医疗实践中应用。

R2Det: Redemption from Range-view for Accurate 3D Object Detection

  • paper_url: http://arxiv.org/abs/2307.11482
  • repo_url: None
  • paper_authors: Yihan Wang, Qiao Yan, Yi Wang
  • for: 这篇论文的目的是提高LiDAR数据驱动的自动驾驶系统中的3D物体检测精度。
  • methods: 该论文提出了一种基于范围视图的方法,使用Range-view Representation来增强3D点Cloud的精度。这种方法包括BasicBlock、Hierarchical-dilated Meta Kernel和Feature Points Redemption等部分。
  • results: 该论文的实验结果表明,将该方法与现有的LiDAR数据驱动的3D物体检测器结合使用,可以提高3D物体检测精度,比如在KITTI val set上的easy、moderate和hard等difficulty level上的mAP提高1.39%、1.67%和1.97%。此外,该论文还提出了一种基于R2M的新的3D物体检测器R2Detector,与现有的范围视图基本的方法相比,R2Detector在KITTI benchmark和Waymo Open Dataset上表现出了明显的优异。
    Abstract LiDAR-based 3D object detection is of paramount importance for autonomous driving. Recent trends show a remarkable improvement for bird's-eye-view (BEV) based and point-based methods as they demonstrate superior performance compared to range-view counterparts. This paper presents an insight that leverages range-view representation to enhance 3D points for accurate 3D object detection. Specifically, we introduce a Redemption from Range-view Module (R2M), a plug-and-play approach for 3D surface texture enhancement from the 2D range view to the 3D point view. R2M comprises BasicBlock for 2D feature extraction, Hierarchical-dilated (HD) Meta Kernel for expanding the 3D receptive field, and Feature Points Redemption (FPR) for recovering 3D surface texture information. R2M can be seamlessly integrated into state-of-the-art LiDAR-based 3D object detectors as preprocessing and achieve appealing improvement, e.g., 1.39%, 1.67%, and 1.97% mAP improvement on easy, moderate, and hard difficulty level of KITTI val set, respectively. Based on R2M, we further propose R2Detector (R2Det) with the Synchronous-Grid RoI Pooling for accurate box refinement. R2Det outperforms existing range-view-based methods by a significant margin on both the KITTI benchmark and the Waymo Open Dataset. Codes will be made publicly available.
    摘要 利用LiDAR的3D对象检测是自动驾驶中的重要环节。最新趋势表明BEV(鸟瞰视)和点云方法在性能方面表现更出色,超过范围视图方法。本文提出一种启示,利用范围视图表示增强3D点云的精度。特别是,我们介绍了一个叫做Redemption from Range-view Module(R2M),这是一种可插入现有LiDAR基于3D对象检测器中的预处理方法。R2M包括基本块(BasicBlock) дляEXTRACTING 2D特征,叠加的高级叠加(HD)元件 kernel for 扩展3D感知范围,以及特征点重新识别(FPR)模块,用于从2D范围视图中恢复3D表面文本信息。R2M可以轻松地与现有的LiDAR基于3D对象检测器集成,并实现了让人满意的提升,例如,在KITTI测试集的易、中、Difficulty三级水平上的mAP提升率分别为1.39%、1.67%和1.97%。基于R2M,我们进一步提出了R2Detector(R2Det),它使用同步格RoI pooling来进行精度的盒子精度。R2Det在KITTI测试集和Waymo开放数据集上与现有范围视图基于方法相比具有明显的提升。代码将公开发布。

SA-BEV: Generating Semantic-Aware Bird’s-Eye-View Feature for Multi-view 3D Object Detection

  • paper_url: http://arxiv.org/abs/2307.11477
  • repo_url: https://github.com/mengtan00/sa-bev
  • paper_authors: Jinqing Zhang, Yanan Zhang, Qingjie Liu, Yunhong Wang
  • for: 提高自适应驾驶的经济性,使用纯摄像头基础的鸟瞰视图(BEV)感知。
  • methods: 提出 semantic-aware BEV Pooling(SA-BEVPool),可以根据图像特征的semantic segmentation过滤背景信息,将图像特征转化为semantic-aware BEV特征。还提出 BEV-Paste 数据增强策略,可以尝试与 semantic-aware BEV特征进行匹配。此外,我们还设计了 Multi-Scale Cross-Task(MSCT)头,可以结合任务特定和跨任务信息,更准确地预测深度分布和semantic segmentation。
  • results: 经过实验,SA-BEV在 nuScenes 上达到了状态革命性的性能。
    Abstract Recently, the pure camera-based Bird's-Eye-View (BEV) perception provides a feasible solution for economical autonomous driving. However, the existing BEV-based multi-view 3D detectors generally transform all image features into BEV features, without considering the problem that the large proportion of background information may submerge the object information. In this paper, we propose Semantic-Aware BEV Pooling (SA-BEVPool), which can filter out background information according to the semantic segmentation of image features and transform image features into semantic-aware BEV features. Accordingly, we propose BEV-Paste, an effective data augmentation strategy that closely matches with semantic-aware BEV feature. In addition, we design a Multi-Scale Cross-Task (MSCT) head, which combines task-specific and cross-task information to predict depth distribution and semantic segmentation more accurately, further improving the quality of semantic-aware BEV feature. Finally, we integrate the above modules into a novel multi-view 3D object detection framework, namely SA-BEV. Experiments on nuScenes show that SA-BEV achieves state-of-the-art performance. Code has been available at https://github.com/mengtan00/SA-BEV.git.
    摘要 最近,纯摄像头基本视角(BEV)的感知提供了经济自动驾驶的可行解决方案。然而,现有的 BEV 基本多视图三维探测器通常将所有图像特征转换成 BEV 特征,无论背景信息占据对象信息的大量。在这篇论文中,我们提出了semantic-aware BEV pooling(SA-BEVPool),可以根据图像特征的semantic segmentation过滤出背景信息,并将图像特征转换成semantic-aware BEV特征。此外,我们提出了BEV-Paste,一种高效的数据增强策略,可以快速匹配semantic-aware BEV特征。此外,我们设计了多尺度交叉任务(MSCT)头,可以将任务特定和交叉任务信息组合以更准确地预测深度分布和semantic segmentation,进一步提高 semantic-aware BEV特征的质量。最后,我们将上述模块集成到了一个新的多视图三维对象检测框架中,称之为SA-BEV。nuScenes实验显示,SA-BEV可以达到状态前的性能。代码可以在https://github.com/mengtan00/SA-BEV.git中下载。

Physics-Aware Semi-Supervised Underwater Image Enhancement

  • paper_url: http://arxiv.org/abs/2307.11470
  • repo_url: None
  • paper_authors: Hao Qi, Xinghui Dong
  • for: 提高水下图像质量,解决水下图像受到媒体传输的降低效应
  • methods: combining physics-based underwater Image Formation Model (IFM)和深度学习技术,提出了一种新的物理意识的双流水下图像提升网络(PA-UIENet),包括传输估计流和环境光估计流
  • results: 与基eline的八个方法进行比较,在五个测试集上的降低估计和水下图像提升任务中表现更好,这可能是因为它不仅可以模拟降低,还可以学习不同的水下场景特征。
    Abstract Underwater images normally suffer from degradation due to the transmission medium of water bodies. Both traditional prior-based approaches and deep learning-based methods have been used to address this problem. However, the inflexible assumption of the former often impairs their effectiveness in handling diverse underwater scenes, while the generalization of the latter to unseen images is usually weakened by insufficient data. In this study, we leverage both the physics-based underwater Image Formation Model (IFM) and deep learning techniques for Underwater Image Enhancement (UIE). To this end, we propose a novel Physics-Aware Dual-Stream Underwater Image Enhancement Network, i.e., PA-UIENet, which comprises a Transmission Estimation Steam (T-Stream) and an Ambient Light Estimation Stream (A-Stream). This network fulfills the UIE task by explicitly estimating the degradation parameters of the IFM. We also adopt an IFM-inspired semi-supervised learning framework, which exploits both the labeled and unlabeled images, to address the issue of insufficient data. Our method performs better than, or at least comparably to, eight baselines across five testing sets in the degradation estimation and UIE tasks. This should be due to the fact that it not only can model the degradation but also can learn the characteristics of diverse underwater scenes.
    摘要 水下图像通常受到水体媒体传输的质量下降的影响。传统的基于先前的方法和深度学习基于方法都已经用来解决这个问题。然而,前者的固定假设经常使其效果不足以处理多样化的水下场景,而后者的普适性通常由数据不足而削弱。在这项研究中,我们利用物理基础的水下图像形成模型(IFM)和深度学习技术来进行水下图像提升(UIE)。为此,我们提出了一种具有物理意识的双流水下图像提升网络,即PA-UIENet,该网络包括传输估计流(T-Stream)和投光估计流(A-Stream)。这个网络通过直接估计IFM中的降低参数来完成UIE任务。我们还采用基于IFM的半监督学习框架,这种框架可以利用标注和无标注图像来解决数据不足的问题。我们的方法在五个测试集上比基eline的八个参考方法表现更好,或者至少与其相当。这应该是因为它不仅可以模型降低,还可以学习多样化的水下场景的特点。

MatSpectNet: Material Segmentation Network with Domain-Aware and Physically-Constrained Hyperspectral Reconstruction

  • paper_url: http://arxiv.org/abs/2307.11466
  • repo_url: https://github.com/heng-yuwen/matspectnet
  • paper_authors: Yuwen Heng, Yihong Wu, Jiawen Chen, Srinandan Dasmahapatra, Hansung Kim
  • for: 提高RGB图像中材料分割的准确率,使用Recovered hyperspectral images。
  • methods: 提出了一种新的模型——MatSpectNet,利用现代相机的色彩感知原理来约束重构的彩色图像,并通过领域适应方法将彩色图像映射到材料分割dataset中。
  • results: 对LMD dataset和OpenSurfaces dataset进行了实验,MatSpectNet在比较最新的发表文章的基础上提高了1.60%的平均像素精度和3.42%的类别精度。
    Abstract Achieving accurate material segmentation for 3-channel RGB images is challenging due to the considerable variation in a material's appearance. Hyperspectral images, which are sets of spectral measurements sampled at multiple wavelengths, theoretically offer distinct information for material identification, as variations in intensity of electromagnetic radiation reflected by a surface depend on the material composition of a scene. However, existing hyperspectral datasets are impoverished regarding the number of images and material categories for the dense material segmentation task, and collecting and annotating hyperspectral images with a spectral camera is prohibitively expensive. To address this, we propose a new model, the MatSpectNet to segment materials with recovered hyperspectral images from RGB images. The network leverages the principles of colour perception in modern cameras to constrain the reconstructed hyperspectral images and employs the domain adaptation method to generalise the hyperspectral reconstruction capability from a spectral recovery dataset to material segmentation datasets. The reconstructed hyperspectral images are further filtered using learned response curves and enhanced with human perception. The performance of MatSpectNet is evaluated on the LMD dataset as well as the OpenSurfaces dataset. Our experiments demonstrate that MatSpectNet attains a 1.60% increase in average pixel accuracy and a 3.42% improvement in mean class accuracy compared with the most recent publication. The project code is attached to the supplementary material and will be published on GitHub.
    摘要 实现高精度物料分 segmentation для 3-canal RGB 影像是一个挑战,因为物料的外观可能会有很大的变化。投色影像(sets of spectral measurements sampled at multiple wavelengths)在理论上可以提供专门的材料识别信息,因为表面反射的电磁辐射强度变化随物料scene的Composition。但是现有的投色影像数据集是缺乏 dense material segmentation 任务中的图像数据和材料类别,收集和标注投色影像使用 spectral camera 是昂费的。为了解决这个问题,我们提出了一个新的模型,MatSpectNet,它可以将 RGB 影像中的材料分 segmentation 构成为 recovered 投色影像。MatSpectNet 利用了现代相机中的色彩感知原理来对投色影像进行对映,并使用领域适应方法将投色影像的对映能力从投色回传数据集中转移到物料分 segmentation 数据集中。另外,MatSpectNet 还使用学习回应曲线来筛选重建的投色影像,并通过人类感知进行改进。我们将 MatSpectNet 的性能评估在 LMD 数据集以及 OpenSurfaces 数据集上。实验结果显示,MatSpectNet 可以与最近一篇文献相比,提高了1.60%的平均像素精度和3.42%的mean class accuracy。项目代码会附加到补充材料中,并将在 GitHub 上发布。

Strip-MLP: Efficient Token Interaction for Vision MLP

  • paper_url: http://arxiv.org/abs/2307.11458
  • repo_url: https://github.com/med-process/strip_mlp
  • paper_authors: Guiping Cao, Shengda Luo, Wenjian Huang, Xiangyuan Lan, Dongmei Jiang, Yaowei Wang, Jianguo Zhang
  • for: 提高 MLP 模型在小数据集上表现,以及与现有 MLP 模型在 ImageNet 上的比较
  • methods: 提出了一种新的 MLP 层called Strip MLP layer,并提出了一种 Cascade Group Strip Mixing Module (CGSMM) 和一种 Local Strip Mixing Module (LSMM) 以增强 токен交互力
  • results: 实验表明,Strip-MLP 模型在小数据集上显著提高了性能,并在 ImageNet 上与现有 MLP 模型相当或更好的结果。具体来说,Strip-MLP 模型在 Caltech-101 和 CIFAR-100 上的平均 Top-1 准确率高于现有 MLP 模型 +2.44% 和 +2.16%。
    Abstract Token interaction operation is one of the core modules in MLP-based models to exchange and aggregate information between different spatial locations. However, the power of token interaction on the spatial dimension is highly dependent on the spatial resolution of the feature maps, which limits the model's expressive ability, especially in deep layers where the feature are down-sampled to a small spatial size. To address this issue, we present a novel method called \textbf{Strip-MLP} to enrich the token interaction power in three ways. Firstly, we introduce a new MLP paradigm called Strip MLP layer that allows the token to interact with other tokens in a cross-strip manner, enabling the tokens in a row (or column) to contribute to the information aggregations in adjacent but different strips of rows (or columns). Secondly, a \textbf{C}ascade \textbf{G}roup \textbf{S}trip \textbf{M}ixing \textbf{M}odule (CGSMM) is proposed to overcome the performance degradation caused by small spatial feature size. The module allows tokens to interact more effectively in the manners of within-patch and cross-patch, which is independent to the feature spatial size. Finally, based on the Strip MLP layer, we propose a novel \textbf{L}ocal \textbf{S}trip \textbf{M}ixing \textbf{M}odule (LSMM) to boost the token interaction power in the local region. Extensive experiments demonstrate that Strip-MLP significantly improves the performance of MLP-based models on small datasets and obtains comparable or even better results on ImageNet. In particular, Strip-MLP models achieve higher average Top-1 accuracy than existing MLP-based models by +2.44\% on Caltech-101 and +2.16\% on CIFAR-100. The source codes will be available at~\href{https://github.com/Med-Process/Strip_MLP{https://github.com/Med-Process/Strip\_MLP}.
    摘要 Token 交互操作是 MLB 模型中的核心模块,用于在不同的空间位置交换和聚合信息。然而,Token 交互力在空间维度上受特定的空间分辨率限制,尤其是在深层次 Where Feature Maps 下采用了压缩,这限制了模型的表达能力。为解决这问题,我们提出了一种新的方法 called Strip-MLP,它可以在三个方面增强 Token 交互力。首先,我们引入了一种新的 MLP парадиг called Strip MLP 层,允许 Token 在横向(或纵向)方向上交互,使得在同一行(或同一列)的 Token 能够参与到邻近但不同的横向(或纵向)的信息聚合中。其次,我们提出了一种 Cascade Group Strip Mixing Module (CGSMM),用于解决因特定空间分辨率而导致的性能下降。该模块允许 Token 在不同的横向和纵向上进行有效的交互,不受特定空间分辨率的限制。最后,基于 Strip MLP 层,我们提出了一种 Local Strip Mixing Module (LSMM),用于在本地区域中增强 Token 交互力。广泛的实验表明,Strip-MLP 可以大幅提高 MLP 模型在小 datasets 上的表现,并在 ImageNet 上达到相当或更好的结果。具体来说,Strip-MLP 模型在 Caltech-101 和 CIFAR-100 上的平均 Top-1 准确率高于现有 MLP 模型 +2.44% 和 +2.16%。代码将在 GitHub 上公开,可以通过 访问。

Attention Consistency Refined Masked Frequency Forgery Representation for Generalizing Face Forgery Detection

  • paper_url: http://arxiv.org/abs/2307.11438
  • repo_url: https://github.com/chenboluo/acmf
  • paper_authors: Decheng Liu, Tao Chen, Chunlei Peng, Nannan Wang, Ruimin Hu, Xinbo Gao
  • for: 这个论文旨在提高Visual data forgery detection的能力,以应对社会和经济安全中的复杂问题。
  • methods: 本论文提出了一个新的Attention Consistency Refined masked frequency forgery representation模型(ACMF),用于实现面伪造检测器的普遍化能力。
  • results: 实验结果显示,提案的方法在多个公共面伪造数据集(FaceForensic++, DFD, Celeb-DF, WDF数据集)上表现出色,较前一代方法更好。
    Abstract Due to the successful development of deep image generation technology, visual data forgery detection would play a more important role in social and economic security. Existing forgery detection methods suffer from unsatisfactory generalization ability to determine the authenticity in the unseen domain. In this paper, we propose a novel Attention Consistency Refined masked frequency forgery representation model toward generalizing face forgery detection algorithm (ACMF). Most forgery technologies always bring in high-frequency aware cues, which make it easy to distinguish source authenticity but difficult to generalize to unseen artifact types. The masked frequency forgery representation module is designed to explore robust forgery cues by randomly discarding high-frequency information. In addition, we find that the forgery attention map inconsistency through the detection network could affect the generalizability. Thus, the forgery attention consistency is introduced to force detectors to focus on similar attention regions for better generalization ability. Experiment results on several public face forgery datasets (FaceForensic++, DFD, Celeb-DF, and WDF datasets) demonstrate the superior performance of the proposed method compared with the state-of-the-art methods.
    摘要 Existing forgery detection methods often rely on high-frequency aware cues, which make it easy to distinguish between authentic and forged sources but difficult to generalize to unseen artifact types. To address this limitation, our proposed masked frequency forgery representation module is designed to explore robust forgery cues by randomly discarding high-frequency information.In addition, we find that the forgery attention map inconsistency through the detection network can affect the generalizability of the model. To address this issue, we introduce forgery attention consistency to force detectors to focus on similar attention regions for better generalization ability.Experimental results on several public face forgery datasets (FaceForensic++, DFD, Celeb-DF, and WDF datasets) demonstrate the superior performance of our proposed method compared to state-of-the-art methods.

FaceCLIPNeRF: Text-driven 3D Face Manipulation using Deformable Neural Radiance Fields

  • paper_url: http://arxiv.org/abs/2307.11418
  • repo_url: None
  • paper_authors: Sungwon Hwang, Junha Hyung, Daejin Kim, Min-Jung Kim, Jaegul Choo
  • for: 本研究旨在提供一种基于文本的3D人脸扭曲控制方法,使得非专家用户可以通过单个文本来控制基于NeRF的3D人脸重建。
  • methods: 我们的方法首先训练了一个场景扭曲器,一个基于秘密码的可变NeRF,以控制人脸扭曲使用秘密码。然而,表示场景扭曲的单个秘密码不利于分布式扭曲观察到的不同实例中。因此,我们提出了一种Positional-conditional Anchor Compositor(PAC),以学习表示扭曲场景的空间分布式秘密码。他们的渲染结果与场景扭曲器进行优化,以实现高cosine相似性与目标文本在CLIP空间的预测 embedding。
  • results: 根据我们所知,我们的方法是首次Addressing the text-driven manipulation of a face reconstructed with NeRF。我们的实验、比较和剖析研究表明,我们的方法可以实现高质量的文本驱动扭曲控制。
    Abstract As recent advances in Neural Radiance Fields (NeRF) have enabled high-fidelity 3D face reconstruction and novel view synthesis, its manipulation also became an essential task in 3D vision. However, existing manipulation methods require extensive human labor, such as a user-provided semantic mask and manual attribute search unsuitable for non-expert users. Instead, our approach is designed to require a single text to manipulate a face reconstructed with NeRF. To do so, we first train a scene manipulator, a latent code-conditional deformable NeRF, over a dynamic scene to control a face deformation using the latent code. However, representing a scene deformation with a single latent code is unfavorable for compositing local deformations observed in different instances. As so, our proposed Position-conditional Anchor Compositor (PAC) learns to represent a manipulated scene with spatially varying latent codes. Their renderings with the scene manipulator are then optimized to yield high cosine similarity to a target text in CLIP embedding space for text-driven manipulation. To the best of our knowledge, our approach is the first to address the text-driven manipulation of a face reconstructed with NeRF. Extensive results, comparisons, and ablation studies demonstrate the effectiveness of our approach.
    摘要 To achieve this, we first train a scene manipulator, a latent code-conditional deformable NeRF, over a dynamic scene to control face deformations using the latent code. However, representing a scene deformation with a single latent code is not ideal for compositing local deformations observed in different instances. Therefore, we propose the Position-conditional Anchor Compositor (PAC) to learn how to represent a manipulated scene with spatially varying latent codes. The renderings with the scene manipulator are then optimized to have high cosine similarity to a target text in CLIP embedding space for text-driven manipulation.To the best of our knowledge, our approach is the first to address text-driven manipulation of faces reconstructed with NeRF. Our extensive results, comparisons, and ablation studies demonstrate the effectiveness of our approach.

Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning

  • paper_url: http://arxiv.org/abs/2307.11410
  • repo_url: https://github.com/OPPO-Mente-Lab/Subject-Diffusion
  • paper_authors: Jian Ma, Junhao Liang, Chen Chen, Haonan Lu
  • for: 这篇论文主要targets open-domain和non-fine-tuning个性化图像生成领域的进步。
  • methods: 该论文提出了一种新的开放领域个性化图像生成模型,只需要一张参考图像来支持单或多个主题的图像生成。
  • results: 对比其他SOTA框架,该方法在单个、多个和人类自定义图像生成方面具有优异的表现。
    Abstract Recent progress in personalized image generation using diffusion models has been significant. However, development in the area of open-domain and non-fine-tuning personalized image generation is proceeding rather slowly. In this paper, we propose Subject-Diffusion, a novel open-domain personalized image generation model that, in addition to not requiring test-time fine-tuning, also only requires a single reference image to support personalized generation of single- or multi-subject in any domain. Firstly, we construct an automatic data labeling tool and use the LAION-Aesthetics dataset to construct a large-scale dataset consisting of 76M images and their corresponding subject detection bounding boxes, segmentation masks and text descriptions. Secondly, we design a new unified framework that combines text and image semantics by incorporating coarse location and fine-grained reference image control to maximize subject fidelity and generalization. Furthermore, we also adopt an attention control mechanism to support multi-subject generation. Extensive qualitative and quantitative results demonstrate that our method outperforms other SOTA frameworks in single, multiple, and human customized image generation. Please refer to our \href{https://oppo-mente-lab.github.io/subject_diffusion/}{project page}
    摘要 最近几年个性化图像生成采用扩散模型的进步很 significative。然而,开放领域和不需要微调的个性化图像生成领域的发展相对较慢。在这篇论文中,我们提议了一种新的开放领域个性化图像生成模型,即主题扩散(Subject-Diffusion)。这种模型不仅不需要测试时微调,而且只需一个参考图像来支持个性化生成单或多主题图像。首先,我们构建了一个自动数据标签工具,并使用LAION-Aesthetics dataset constructed a large-scale dataset consisting of 76M images and their corresponding subject detection bounding boxes, segmentation masks and text descriptions。然后,我们设计了一个新的统一框架,通过 combining text and image semantics,并通过粗略位置和细腻参考图像控制来 maximize subject fidelity and generalization。此外,我们还采用了一种注意控制机制来支持多主题生成。我们的方法在单、多和人自定义图像生成方面均有优秀表现。详细的质量和量测试结果可以参考我们的 \href{https://oppo-mente-lab.github.io/subject_diffusion/}{项目页面}。

Latent-OFER: Detect, Mask, and Reconstruct with Latent Vectors for Occluded Facial Expression Recognition

  • paper_url: http://arxiv.org/abs/2307.11404
  • repo_url: https://github.com/leeisack/latent-ofer
  • paper_authors: Isack Lee, Eungi Lee, Seok Bong Yoo
  • for: 提高实际场景中人脸表达识别(FER)的性能,解决 occluded FER(OFER)问题。
  • methods: 使用视transformer(ViT)基于 occlusion patch detector,检测 occluded 位置,并使用 hybrid reconstruction network 生成掩蔽位置的完整图像。然后,使用 CNN 基于 class activation map 提取表达相关的latent vector。
  • results: 对多个数据库进行了实验,demonstrated 提档方法的优势,比state-of-the-art方法更高效。
    Abstract Most research on facial expression recognition (FER) is conducted in highly controlled environments, but its performance is often unacceptable when applied to real-world situations. This is because when unexpected objects occlude the face, the FER network faces difficulties extracting facial features and accurately predicting facial expressions. Therefore, occluded FER (OFER) is a challenging problem. Previous studies on occlusion-aware FER have typically required fully annotated facial images for training. However, collecting facial images with various occlusions and expression annotations is time-consuming and expensive. Latent-OFER, the proposed method, can detect occlusions, restore occluded parts of the face as if they were unoccluded, and recognize them, improving FER accuracy. This approach involves three steps: First, the vision transformer (ViT)-based occlusion patch detector masks the occluded position by training only latent vectors from the unoccluded patches using the support vector data description algorithm. Second, the hybrid reconstruction network generates the masking position as a complete image using the ViT and convolutional neural network (CNN). Last, the expression-relevant latent vector extractor retrieves and uses expression-related information from all latent vectors by applying a CNN-based class activation map. This mechanism has a significant advantage in preventing performance degradation from occlusion by unseen objects. The experimental results on several databases demonstrate the superiority of the proposed method over state-of-the-art methods.
    摘要 大多数人脸表达识别(FER)研究都进行在高度控制的环境中,但其在实际场景中的性能往往不受接受。这是因为当不期望的物体遮挡面部时,FER网络很难提取面部特征并正确预测面部表达。因此,遮挡FER(OFER)成为一个挑战性问题。先前的 occlusion-aware FER 研究通常需要全部标注的面部图像进行训练。然而,收集面部图像与不同遮挡物和表达注解是时间consuming 和昂贵的。我们提出的方法是 Latent-OFER,它可以检测遮挡,还原遮挡部分的面部图像,并正确识别它们,从而提高 FER 的准确率。这个方法包括三个步骤:1. 使用支持向量数据描述算法来训练 ViT 基于的 occlusion patch detector,将遮挡位置掩码为不遮挡位置的 latent vector。2. 使用 ViT 和卷积神经网络(CNN)来生成遮挡位置的完整图像。3. 使用 CNN 来提取表达相关的 latent vector,并应用类Activation map 来找到表达相关的信息。这种机制有着显著的优势,可以避免由不可见的遮挡物引起的性能下降。我们在多个数据库上进行了多个实验,结果表明,我们的方法在 state-of-the-art 方法之上表现出了明显的优势。

CLR: Channel-wise Lightweight Reprogramming for Continual Learning

  • paper_url: http://arxiv.org/abs/2307.11386
  • repo_url: https://github.com/gyhandy/channel-wise-lightweight-reprogramming
  • paper_authors: Yunhao Ge, Yuecheng Li, Shuo Ni, Jiaping Zhao, Ming-Hsuan Yang, Laurent Itti
  • for: 这个研究旨在解决 kontinual learning 中的问题,即维持先前学习的任务表现度以后学习新任务。
  • methods: 我们提出了一个 Channel-wise Lightweight Reprogramming (CLR) 方法,帮助条件神经网络 (CNN) 在 kontinual learning 中避免 catastrophic forgetting。我们使用一个轻量级 (very cheap) 的重新程式码参数,将 CNN 模型训练在旧任务 (或自我指导 proxy task) 后,转换为解决新任务。
  • results: 我们的方法可以实现 better stability-plasticity trade-off,即维持先前任务表现度并吸收新知识。我们的方法比 13 个 state-of-the-art kontinual learning 基线测试项目表现出色,在一个新的挑战性的序列中的 53 个图像分类任务中。
    Abstract Continual learning aims to emulate the human ability to continually accumulate knowledge over sequential tasks. The main challenge is to maintain performance on previously learned tasks after learning new tasks, i.e., to avoid catastrophic forgetting. We propose a Channel-wise Lightweight Reprogramming (CLR) approach that helps convolutional neural networks (CNNs) overcome catastrophic forgetting during continual learning. We show that a CNN model trained on an old task (or self-supervised proxy task) could be ``reprogrammed" to solve a new task by using our proposed lightweight (very cheap) reprogramming parameter. With the help of CLR, we have a better stability-plasticity trade-off to solve continual learning problems: To maintain stability and retain previous task ability, we use a common task-agnostic immutable part as the shared ``anchor" parameter set. We then add task-specific lightweight reprogramming parameters to reinterpret the outputs of the immutable parts, to enable plasticity and integrate new knowledge. To learn sequential tasks, we only train the lightweight reprogramming parameters to learn each new task. Reprogramming parameters are task-specific and exclusive to each task, which makes our method immune to catastrophic forgetting. To minimize the parameter requirement of reprogramming to learn new tasks, we make reprogramming lightweight by only adjusting essential kernels and learning channel-wise linear mappings from anchor parameters to task-specific domain knowledge. We show that, for general CNNs, the CLR parameter increase is less than 0.6\% for any new task. Our method outperforms 13 state-of-the-art continual learning baselines on a new challenging sequence of 53 image classification datasets. Code and data are available at https://github.com/gyhandy/Channel-wise-Lightweight-Reprogramming
    摘要 “CONTINUAL LEARNING”目的是实现人类在继续完成多个任务后继续累累积累知识的能力。主要挑战是维持先前学习的任务表现,即避免“悖论”现象。我们提出了一个“通道对适应”(Channel-wise Lightweight Reprogramming,CLR)方法,帮助对称神经网络(CNNs)在继续学习中获得更好的稳定性和柔韧性。我们显示了一个 CNN 模型在旧任务(或自我超vised proxy task)上训练后可以通过我们提议的轻量级(非常便宜)重新配置参数,以解决继续学习中的悖论问题。我们在 CLR 方法中使用了一个通用任务无法适应的固定“标准”(immutable)部分作为共同“anchor”参数集,然后将任务特定的轻量级重新配置参数添加到标准部分以重新解读出PUTS,以获得更好的稳定性和柔韧性。为了学习继续任务,我们仅需要对每个新任务进行轻量级重新配置参数的训练。重新配置参数是任务特定的且对应到每个任务,这使我们的方法免于悖论现象。为了实现轻量级重新配置参数的学习,我们仅将重要的核心变化和通道对适应的linear mapping学习。我们发现,对于通用 CNN,CLR 参数增加的比例小于 0.6% для任何新任务。我们的方法在一个新的53个图像分类任务中进行了比较,并与13个现有的基eline方法进行了比较。代码和数据可以在 https://github.com/gyhandy/Channel-wise-Lightweight-Reprogramming 上获取。

LatentAugment: Data Augmentation via Guided Manipulation of GAN’s Latent Space

  • paper_url: http://arxiv.org/abs/2307.11375
  • repo_url: https://github.com/ltronchin/latentaugment
  • paper_authors: Lorenzo Tronchin, Minh H. Vu, Paolo Soda, Tommy Löfstedt
  • for: 这个研究旨在提高训练数据的量和多样性,以降低适架化和提高泛化。
  • methods: 这个研究使用生成数据网络(GAN)生成高质量的 sintetic 数据,并通过修改秘密 вектор来增加模式覆盖率和多样性。
  • results: 研究显示,使用 LatentAugment 技术可以提高深度模型从 MRI-to-CT 的转换性能,并在模式覆盖率和多样性方面超过 GAN-based sampling。
    Abstract Data Augmentation (DA) is a technique to increase the quantity and diversity of the training data, and by that alleviate overfitting and improve generalisation. However, standard DA produces synthetic data for augmentation with limited diversity. Generative Adversarial Networks (GANs) may unlock additional information in a dataset by generating synthetic samples having the appearance of real images. However, these models struggle to simultaneously address three key requirements: fidelity and high-quality samples; diversity and mode coverage; and fast sampling. Indeed, GANs generate high-quality samples rapidly, but have poor mode coverage, limiting their adoption in DA applications. We propose LatentAugment, a DA strategy that overcomes the low diversity of GANs, opening up for use in DA applications. Without external supervision, LatentAugment modifies latent vectors and moves them into latent space regions to maximise the synthetic images' diversity and fidelity. It is also agnostic to the dataset and the downstream task. A wide set of experiments shows that LatentAugment improves the generalisation of a deep model translating from MRI-to-CT beating both standard DA as well GAN-based sampling. Moreover, still in comparison with GAN-based sampling, LatentAugment synthetic samples show superior mode coverage and diversity. Code is available at: https://github.com/ltronchin/LatentAugment.
    摘要 <>转换文本到简化中文。<>数据扩展(DA)是一种技术,以增加训练数据的量和多样性,从而缓解过拟合和提高泛化。然而,标准的DA产生的合成数据具有有限的多样性。生成对抗网络(GANs)可以通过生成具有真实图像的样式的合成样本,从而激活数据中的额外信息。然而,这些模型很难同时满足三个关键要求:准确性和高质量样本;多样性和模式覆盖率;和快速采样。indeed,GANs可以快速生成高质量样本,但它们的模式覆盖率很低,限制了它们在DA应用中的采用。我们提出了LatentAugment,一种DA策略,可以在GANs中增加多样性,使其在DA应用中使用。无需外部监督,LatentAugment会修改缺少的缺省向量,将其移动到缺省空间中,以最大化合成图像的多样性和准确性。它还是数据aset和下游任务无关的。一系列实验表明,LatentAugment可以提高一个深度模型从MRI-to-CT的翻译,比标准DA和GAN-based sampling更好。此外,与GAN-based sampling相比,LatentAugment的合成样本还显示出更高的模式覆盖率和多样性。代码可以在:https://github.com/ltronchin/LatentAugment.

Photo2Relief: Let Human in the Photograph Stand Out

  • paper_url: http://arxiv.org/abs/2307.11364
  • repo_url: None
  • paper_authors: Zhongping Ji, Feifei Che, Hanshuo Liu, Ziyi Zhao, Yu-Wei Zhang, Wenping Wang
  • for: 这 paper 的目的是创造从照片中生成的数字 2.5D 艺术作品,其中描绘人体的全身活动。
  • methods: 该方法使用了一种新的 sigmoid 变体函数来灵活控制梯度,并通过定义在梯度空间的损失函数来训练神经网络。此外,它还使用了图像基于的渲染技术来解决不同光照条件下的挑战。
  • results: 实验结果表明,该方法可以高效地从照片中生成高质量的数字 2.5D 艺术作品,并且可以在多种场景下实现高度的自然性和细节。
    Abstract In this paper, we propose a technique for making humans in photographs protrude like reliefs. Unlike previous methods which mostly focus on the face and head, our method aims to generate art works that describe the whole body activity of the character. One challenge is that there is no ground-truth for supervised deep learning. We introduce a sigmoid variant function to manipulate gradients tactfully and train our neural networks by equipping with a loss function defined in gradient domain. The second challenge is that actual photographs often across different light conditions. We used image-based rendering technique to address this challenge and acquire rendering images and depth data under different lighting conditions. To make a clear division of labor in network modules, a two-scale architecture is proposed to create high-quality relief from a single photograph. Extensive experimental results on a variety of scenes show that our method is a highly effective solution for generating digital 2.5D artwork from photographs.
    摘要 在这篇论文中,我们提出了一种技术,使人像中的人物凸出如 relief 一般。与先前的方法一样,我们的方法不仅关注人脸和头部,而是通过生成描述人物全身活动的艺术作品。一个挑战是没有超出真实数据的观察数据,我们引入了截然函数来策略性地操作梯度,并通过定义梯度领域的损失函数来训练我们的神经网络。另一个挑战是实际照片通常在不同的照明条件下拍摄,我们使用图像基于的渲染技术来解决这个问题,并获得不同照明条件下的渲染图像和深度数据。为了在网络模块之间进行清晰的划分工作,我们提议了一种两级架构,从单个照片中生成高质量的 relief。我们的实验结果表明,我们的方法是生成数字2.5D艺术作品从照片中的高效解决方案。

ParGANDA: Making Synthetic Pedestrians A Reality For Object Detection

  • paper_url: http://arxiv.org/abs/2307.11360
  • repo_url: None
  • paper_authors: Daria Reshetova, Guanhang Wu, Marcel Puyat, Chunhui Gu, Huizhong Chen
  • for: 本研究旨在提高人体检测模型的性能,使其在实际应用中更加稳定和可靠。
  • methods: 本研究使用了生成 adversarial Network (GAN) 来将生成的人体图像变换成更加真实的图像,以减少 Synthetic-to-real 频率差。
  • results: 实验结果表明,使用 GAN 可以生成高质量的人体图像,并且不需要标注真实频率图像,因此可以应用于多个下游任务。
    Abstract Object detection is the key technique to a number of Computer Vision applications, but it often requires large amounts of annotated data to achieve decent results. Moreover, for pedestrian detection specifically, the collected data might contain some personally identifiable information (PII), which is highly restricted in many countries. This label intensive and privacy concerning task has recently led to an increasing interest in training the detection models using synthetically generated pedestrian datasets collected with a photo-realistic video game engine. The engine is able to generate unlimited amounts of data with precise and consistent annotations, which gives potential for significant gains in the real-world applications. However, the use of synthetic data for training introduces a synthetic-to-real domain shift aggravating the final performance. To close the gap between the real and synthetic data, we propose to use a Generative Adversarial Network (GAN), which performsparameterized unpaired image-to-image translation to generate more realistic images. The key benefit of using the GAN is its intrinsic preference of low-level changes to geometric ones, which means annotations of a given synthetic image remain accurate even after domain translation is performed thus eliminating the need for labeling real data. We extensively experimented with the proposed method using MOTSynth dataset to train and MOT17 and MOT20 detection datasets to test, with experimental results demonstrating the effectiveness of this method. Our approach not only produces visually plausible samples but also does not require any labels of the real domain thus making it applicable to the variety of downstream tasks.
    摘要 Computer视觉应用中的对象检测是关键技术,但它通常需要大量注解数据来 достичь良好的结果。此外,人员检测特别是可能包含个人标识信息(PII),这在许多国家是非常限制的。这种标注密集和隐私担忧的任务最近引起了使用synthetically生成的人员数据集来训练检测模型的兴趣。这个引擎可以生成无限量的数据,并且可以提供精确和一致的注释,这给了实际应用中的可能性。然而,使用synthetic数据进行训练会导致synthetic-to-real域shift,从而使得最终性能下降。为了封闭这个域shift,我们提议使用生成对抗网络(GAN),它可以进行参数化的无对比image-to-image翻译,以生成更真实的图像。GAN的关键优点在于它对低级别的变化偏好,这意味着注释给定的synthetic图像保持正确,even after domain translation is performed,因此不需要标注实际数据。我们对提议方法进行了广泛的实验,使用MOTSynth数据集进行训练,并使用MOT17和MOT20检测数据集进行测试,实验结果表明了该方法的有效性。我们的方法不仅可以生成可见的样本,而且不需要实际域的标注,因此可以应用于多个下游任务。

Tuning Pre-trained Model via Moment Probing

  • paper_url: http://arxiv.org/abs/2307.11342
  • repo_url: https://github.com/mingzeg/moment-probing
  • paper_authors: Mingze Gao, Qilong Wang, Zhenyi Lin, Pengfei Zhu, Qinghua Hu, Jingbo Zhou
  • for: 提高大规模预训练模型的高效调参,以探索LP模块的潜力。
  • methods: 提出了一种新的幻数探测(MP)方法,通过对特征分布的线性探测来提供更强的表示能力。
  • results: MP比LP更高效,并且与同类模型在训练成本下达到了竞争水平,而MP$_{+}$达到了最佳性能。
    Abstract Recently, efficient fine-tuning of large-scale pre-trained models has attracted increasing research interests, where linear probing (LP) as a fundamental module is involved in exploiting the final representations for task-dependent classification. However, most of the existing methods focus on how to effectively introduce a few of learnable parameters, and little work pays attention to the commonly used LP module. In this paper, we propose a novel Moment Probing (MP) method to further explore the potential of LP. Distinguished from LP which builds a linear classification head based on the mean of final features (e.g., word tokens for ViT) or classification tokens, our MP performs a linear classifier on feature distribution, which provides the stronger representation ability by exploiting richer statistical information inherent in features. Specifically, we represent feature distribution by its characteristic function, which is efficiently approximated by using first- and second-order moments of features. Furthermore, we propose a multi-head convolutional cross-covariance (MHC$^3$) to compute second-order moments in an efficient and effective manner. By considering that MP could affect feature learning, we introduce a partially shared module to learn two recalibrating parameters (PSRP) for backbones based on MP, namely MP$_{+}$. Extensive experiments on ten benchmarks using various models show that our MP significantly outperforms LP and is competitive with counterparts at less training cost, while our MP$_{+}$ achieves state-of-the-art performance.
    摘要 近期,高效练级预训模型的调整吸引了增加的研究兴趣,其中线性探测(LP)作为基本模块被利用来挖掘任务特定的分类表现。然而,大多数现有方法都是如何效果地引入一些学习参数的问题,而忽略了通常使用的LP模块。在这篇论文中,我们提出了一种新的幂值探测(MP)方法,可以进一步探索LP的潜力。与LP不同,MP不是根据最终特征(如ViT中的单词符)或分类符建立线性分类头,而是在特征分布上进行线性分类,从而获得更强的表达能力,因为它可以利用特征内置的更加丰富的统计信息。具体来说,我们使用特征分布的特征函数来表示特征分布,该函数可以高效地被approximated通过使用特征的第一和第二 moments。此外,我们还提出了一种多头 convolutional cross-covariance(MHC$^3)来计算特征分布中的第二 moments,以实现高效和有效地计算。由于MP可能会影响特征学习,我们引入了一个共享模块来学习两个修正参数(PSRP),即MP$_{+}$.经验表明,我们的MP在十个benchmark上与LP和其他方法进行比较,显示MP显著超过LP,并且与其他方法在训练成本下更低的情况下具有竞争力。此外,我们的MP$_{+}$实现了状态计算的表现。

Character Time-series Matching For Robust License Plate Recognition

  • paper_url: http://arxiv.org/abs/2307.11336
  • repo_url: https://github.com/chequanghuy/Character-Time-series-Matching
  • paper_authors: Quang Huy Che, Tung Do Thanh, Cuong Truong Van
  • For: This paper aims to improve license plate recognition accuracy in real-world situations by tracking the license plate in multiple frames.* Methods: The proposed method uses the Adaptive License Plate Rotation algorithm to correctly align the detected license plate, and a new method called Character Time-series Matching to recognize license plate characters from multiple consecutive frames.* Results: The proposed method achieved 96.7% accuracy on the UFPR-ALPR dataset in real-time on an RTX A5000 GPU card, and achieved high accuracy in the Vietnamese ALPR system with license plate detection and character recognition accuracy of 0.881 and 0.979 $mAP^{test}$@.5 respectively.Here’s the Chinese translation of the three key points:* FOR: 这篇论文目标是在实际情况下提高车牌识别精度,通过跟踪多帧中的车牌。* METHODS: 提议的方法使用了适应式车牌旋转算法来正确地对齐检测到的车牌,以及一种新的Character Time-series Matching方法来在多个后续帧中识别车牌字符。* RESULTS: 提议的方法在UFPR-ALPR数据集上实时达到96.7%的准确率,并在越南的ALPR系统中实现了高精度。车牌检测精度和字符识别精度分别为0.881和0.979 $mAP^{test}$@.5。
    Abstract Automatic License Plate Recognition (ALPR) is becoming a popular study area and is applied in many fields such as transportation or smart city. However, there are still several limitations when applying many current methods to practical problems due to the variation in real-world situations such as light changes, unclear License Plate (LP) characters, and image quality. Almost recent ALPR algorithms process on a single frame, which reduces accuracy in case of worse image quality. This paper presents methods to improve license plate recognition accuracy by tracking the license plate in multiple frames. First, the Adaptive License Plate Rotation algorithm is applied to correctly align the detected license plate. Second, we propose a method called Character Time-series Matching to recognize license plate characters from many consequence frames. The proposed method archives high performance in the UFPR-ALPR dataset which is \boldmath$96.7\%$ accuracy in real-time on RTX A5000 GPU card. We also deploy the algorithm for the Vietnamese ALPR system. The accuracy for license plate detection and character recognition are 0.881 and 0.979 $mAP^{test}$@.5 respectively. The source code is available at https://github.com/chequanghuy/Character-Time-series-Matching.git
    摘要 自动识别车牌(ALPR)已成为当前研究领域之一,并应用于交通和智能城市等领域。然而,现有许多方法在实际问题中仍然存在一些限制,主要是因为实际情况下的光线变化、车牌字符模糊和图像质量等问题。大多数当前ALPR算法都是基于单帧处理,这会导致图像质量更差时的准确率下降。本文提出了一种改进车牌识别精度的方法,通过跟踪车牌在多帧中的变化。首先,我们提出了一种适应车牌旋转算法,以正确地对检测到的车牌进行对齐。然后,我们提出了一种名为 Character Time-series Matching的方法,用于在多个后续帧中识别车牌字符。我们在UFPR-ALPR数据集上测试了该方法,实时性达到96.7%,并在RTX A5000 GPU卡上进行了测试。此外,我们还部署了该算法于越南ALPR系统,车牌检测精度和字符识别精度分别为0.881和0.979 $mAP^{test}$@.5。源代码可以在https://github.com/chequanghuy/Character-Time-series-Matching.git中下载。

Improving Transferability of Adversarial Examples via Bayesian Attacks

  • paper_url: http://arxiv.org/abs/2307.11334
  • repo_url: None
  • paper_authors: Qizhang Li, Yiwen Guo, Xiaochen Yang, Wangmeng Zuo, Hao Chen
  • for: 提高模型转移性,防止黑客利用模型识别图像的攻击
  • methods: incorporating Bayesian formulation into model parameters and input, 采用权重积分法对模型参数和输入进行权重补偿
  • results: 1) 将极大 Bayesian 形式应用于模型输入和参数,对转移性具有显著提高效果; 2) 通过高级推论 posterior distribution over the model input, 提高黑客攻击的可扩展性; 3) 提出一种理性的方法来细化模型参数,以便在扩展 bayesian 形式下进行优化。
    Abstract This paper presents a substantial extension of our work published at ICLR. Our ICLR work advocated for enhancing transferability in adversarial examples by incorporating a Bayesian formulation into model parameters, which effectively emulates the ensemble of infinitely many deep neural networks, while, in this paper, we introduce a novel extension by incorporating the Bayesian formulation into the model input as well, enabling the joint diversification of both the model input and model parameters. Our empirical findings demonstrate that: 1) the combination of Bayesian formulations for both the model input and model parameters yields significant improvements in transferability; 2) by introducing advanced approximations of the posterior distribution over the model input, adversarial transferability achieves further enhancement, surpassing all state-of-the-arts when attacking without model fine-tuning. Moreover, we propose a principled approach to fine-tune model parameters in such an extended Bayesian formulation. The derived optimization objective inherently encourages flat minima in the parameter space and input space. Extensive experiments demonstrate that our method achieves a new state-of-the-art on transfer-based attacks, improving the average success rate on ImageNet and CIFAR-10 by 19.14% and 2.08%, respectively, when comparing with our ICLR basic Bayesian method. We will make our code publicly available.
    摘要
  1. Combining Bayesian formulations for both the model input and model parameters leads to significant improvements in transferability.2. Introducing advanced approximations of the posterior distribution over the model input further enhances adversarial transferability, surpassing all state-of-the-art methods when attacking without model fine-tuning.Moreover, we propose a principled approach to fine-tune model parameters in this extended Bayesian formulation. The derived optimization objective inherently encourages flat minima in both the parameter space and input space. Extensive experiments demonstrate that our method achieves a new state-of-the-art on transfer-based attacks, improving the average success rate on ImageNet and CIFAR-10 by 19.14% and 2.08%, respectively, compared to our ICLR basic Bayesian method. We will make our code publicly available.

EndoSurf: Neural Surface Reconstruction of Deformable Tissues with Stereo Endoscope Videos

  • paper_url: http://arxiv.org/abs/2307.11307
  • repo_url: https://github.com/ruyi-zha/endosurf
  • paper_authors: Ruyi Zha, Xuelian Cheng, Hongdong Li, Mehrtash Harandi, Zongyuan Ge
  • for: reconstruction of soft tissues from stereo endoscope videos
  • methods: neural-field-based method (EndoSurf) using deformation field, SDF field, and radiance field for shape and texture representation
  • results: significantly outperforms existing solutions in reconstructing high-fidelity shapes, as demonstrated by experiments on public endoscope datasets.
    Abstract Reconstructing soft tissues from stereo endoscope videos is an essential prerequisite for many medical applications. Previous methods struggle to produce high-quality geometry and appearance due to their inadequate representations of 3D scenes. To address this issue, we propose a novel neural-field-based method, called EndoSurf, which effectively learns to represent a deforming surface from an RGBD sequence. In EndoSurf, we model surface dynamics, shape, and texture with three neural fields. First, 3D points are transformed from the observed space to the canonical space using the deformation field. The signed distance function (SDF) field and radiance field then predict their SDFs and colors, respectively, with which RGBD images can be synthesized via differentiable volume rendering. We constrain the learned shape by tailoring multiple regularization strategies and disentangling geometry and appearance. Experiments on public endoscope datasets demonstrate that EndoSurf significantly outperforms existing solutions, particularly in reconstructing high-fidelity shapes. Code is available at https://github.com/Ruyi-Zha/endosurf.git.
    摘要 <>很多医疗应用都需要从斯tereo激光镜视频中重建软组织。现有方法很难生成高质量的几何和外观,因为它们不能准确地表示3D场景。为解决这个问题,我们提出了一种基于神经场的新方法,叫做EndoSurf。EndoSurf可以从RGBD序列中有效地学习表示变形表面。在EndoSurf中,我们使用三个神经场来模型表面动态、形状和Texture。首先,3D点从观察空间转换到Canonical空间使用扭变场。然后,SDF场和颜色场使用梯度渲染算法来预测它们的SDF和颜色。通过这种方式,我们可以通过 differentiable volume rendering来synthesizeRGBD图像。我们使用多种正则化策略和分解几何和外观来约束学习的形状。实验表明,EndoSurf在公共激光镜数据集上表现出色,特别是在重建高精度形状方面。代码可以在https://github.com/Ruyi-Zha/endosurf.git中找到。

MAS: Towards Resource-Efficient Federated Multiple-Task Learning

  • paper_url: http://arxiv.org/abs/2307.11285
  • repo_url: None
  • paper_authors: Weiming Zhuang, Yonggang Wen, Lingjuan Lyu, Shuai Zhang
  • For: This paper proposes a new approach to training multiple simultaneous federated learning (FL) tasks on resource-constrained devices, which can improve the performance and efficiency of FL training.* Methods: The proposed approach, called MAS (Merge and Split), merges multiple FL tasks into an all-in-one task and then splits it into two or more tasks based on the affinities among tasks. It continues training each split of tasks using model parameters from the all-in-one training.* Results: Extensive experiments demonstrate that MAS outperforms other methods while reducing training time by 2x and reducing energy consumption by 40%.
    Abstract Federated learning (FL) is an emerging distributed machine learning method that empowers in-situ model training on decentralized edge devices. However, multiple simultaneous FL tasks could overload resource-constrained devices. In this work, we propose the first FL system to effectively coordinate and train multiple simultaneous FL tasks. We first formalize the problem of training simultaneous FL tasks. Then, we present our new approach, MAS (Merge and Split), to optimize the performance of training multiple simultaneous FL tasks. MAS starts by merging FL tasks into an all-in-one FL task with a multi-task architecture. After training for a few rounds, MAS splits the all-in-one FL task into two or more FL tasks by using the affinities among tasks measured during the all-in-one training. It then continues training each split of FL tasks based on model parameters from the all-in-one training. Extensive experiments demonstrate that MAS outperforms other methods while reducing training time by 2x and reducing energy consumption by 40%. We hope this work will inspire the community to further study and optimize training simultaneous FL tasks.
    摘要 Federated 学习(FL)是一种emerging的分布式机器学习方法,它允许在分布式边缘设备上进行 situ 模型训练。然而,多个同时进行 FL 任务可能会过载resource-constrained设备。在这种工作中,我们提出了首个有效地协调和训练多个同时进行 FL 任务的 FL 系统。我们首先将同时进行 FL 任务的训练问题正式化。然后,我们提出了我们的新方法,MAS(合并和分割),以便优化训练多个同时进行 FL 任务的性能。MAS 开始是将 FL 任务合并成一个所有任务的 FL 任务,并使用多任务架构进行训练。在训练一些往返后,MAS 使用在所有训练中测量的任务之间的相互关系,将所有任务合并成两个或更多个 FL 任务。然后,它继续基于所有训练中的模型参数进行每个分割的 FL 任务的训练。我们的实验证明,MAS 在减少训练时间和能量消耗的情况下,能够超越其他方法。我们希望这种工作能够鼓励社区进一步研究和优化训练同时进行 FL 任务。

Learning to Segment from Noisy Annotations: A Spatial Correction Approach

  • paper_url: http://arxiv.org/abs/2308.02498
  • repo_url: https://github.com/michaelofsbu/spatialcorrection
  • paper_authors: Jiachen Yao, Yikai Zhang, Songzhu Zheng, Mayank Goswami, Prateek Prasanna, Chao Chen
  • for: 这个研究旨在提出一个新的Markov模型,用于减少医疗影像分类 задачі中的标签错误。
  • methods: 我们提出了一种基于Markov模型的标签修正方法,可以逐步修正错误的标签,并提供了理论保证。
  • results: 实验结果显示,我们的方法在实际标签错误情况下表现比现有的方法更好。
    Abstract Noisy labels can significantly affect the performance of deep neural networks (DNNs). In medical image segmentation tasks, annotations are error-prone due to the high demand in annotation time and in the annotators' expertise. Existing methods mostly assume noisy labels in different pixels are \textit{i.i.d}. However, segmentation label noise usually has strong spatial correlation and has prominent bias in distribution. In this paper, we propose a novel Markov model for segmentation noisy annotations that encodes both spatial correlation and bias. Further, to mitigate such label noise, we propose a label correction method to recover true label progressively. We provide theoretical guarantees of the correctness of the proposed method. Experiments show that our approach outperforms current state-of-the-art methods on both synthetic and real-world noisy annotations.
    摘要 “噪音标签可以严重影响深度神经网络(DNNs)的性能。医疗影像分类任务中的标签通常受到高度的标签时间和标签专家的要求,因此标签损害很普遍。现有的方法通常假设不同像素的标签噪音是独立同分布(i.i.d)。但是,分类标签噪音通常具有强烈的空间相关性和明显的偏好性。在这篇论文中,我们提出了一个新的Markov模型,用于描述分类标签噪音的特性。此外,我们也提出了一个标签更正方法,可以逐步地更正true标签。我们提供了理论保证方法的正确性。实验结果显示,我们的方法在实验和实际标签噪音上都能够超过目前的州Of-the-art方法。”Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

Screening Mammography Breast Cancer Detection

  • paper_url: http://arxiv.org/abs/2307.11274
  • repo_url: https://github.com/chakrabortyde/rsna-breast-cancer
  • paper_authors: Debajyoti Chakraborty
  • for: 提高乳癌检测效率和准确率,以减少成本和假阳性结果导致的患者担忧。
  • methods: 使用自动化乳癌检测方法,测试了不同的方法 against RSNA数据集中的约20,000名女性患者的 radiographic 乳癌图像,得到了平均验证案例 pF1 分数为0.56。
  • results: 实现了提高乳癌检测效率和准确率的目标,可能减少成本和假阳性结果导致的患者担忧。
    Abstract Breast cancer is a leading cause of cancer-related deaths, but current programs are expensive and prone to false positives, leading to unnecessary follow-up and patient anxiety. This paper proposes a solution to automated breast cancer detection, to improve the efficiency and accuracy of screening programs. Different methodologies were tested against the RSNA dataset of radiographic breast images of roughly 20,000 female patients and yielded an average validation case pF1 score of 0.56 across methods.
    摘要 乳癌是癌症related deaths的主要原因,但现有的Programs 昂贵并且容易出现假阳性结果,导致不必要的跟进和患者担忧。这篇论文提出一种自动乳癌检测方案,以提高检测计划的效率和准确率。不同的方法ologies 在 RSNA 数据集上测试,对约20,000名女性患者的 radiographic 乳影像进行了测试,Validation case pF1 score 的平均值为 0.56。

SimCol3D – 3D Reconstruction during Colonoscopy Challenge

  • paper_url: http://arxiv.org/abs/2307.11261
  • repo_url: None
  • paper_authors: Anita Rau, Sophia Bano, Yueming Jin, Pablo Azagra, Javier Morlana, Edward Sanderson, Bogdan J. Matuszewski, Jae Young Lee, Dong-Jae Lee, Erez Posner, Netanel Frank, Varshini Elangovan, Sista Raviteja, Zhengwen Li, Jiquan Liu, Seenivasan Lalithkumar, Mobarakol Islam, Hongliang Ren, José M. M. Montiel, Danail Stoyanov
  • for: 针对受到抗坏性肿瘤的护理和检测
  • methods: 使用学习型方法对视频材料进行深度和pose预测
  • results: 虚拟colonoscopy中的深度预测robust可行,而pose预测仍然是一个开放的研究问题
    Abstract Colorectal cancer is one of the most common cancers in the world. While colonoscopy is an effective screening technique, navigating an endoscope through the colon to detect polyps is challenging. A 3D map of the observed surfaces could enhance the identification of unscreened colon tissue and serve as a training platform. However, reconstructing the colon from video footage remains unsolved due to numerous factors such as self-occlusion, reflective surfaces, lack of texture, and tissue deformation that limit feature-based methods. Learning-based approaches hold promise as robust alternatives, but necessitate extensive datasets. By establishing a benchmark, the 2022 EndoVis sub-challenge SimCol3D aimed to facilitate data-driven depth and pose prediction during colonoscopy. The challenge was hosted as part of MICCAI 2022 in Singapore. Six teams from around the world and representatives from academia and industry participated in the three sub-challenges: synthetic depth prediction, synthetic pose prediction, and real pose prediction. This paper describes the challenge, the submitted methods, and their results. We show that depth prediction in virtual colonoscopy is robustly solvable, while pose estimation remains an open research question.
    摘要 抗rectal cancer是全球最常见的癌症之一。虽然colonoscopy是一种有效的检测技术,但是通过endooscope检测colon中的质量是具有挑战性的。一个3D地图可以增强未检查的colon组织识别和作为培训平台。然而,从视频足本中重建colon仍然是一个未解决的问题,因为自我遮挡、反射表面、缺乏Texture和组织变形等多种因素限制了基于特征的方法。学习基于方法具有潜在的优势,但它们需要大量的数据。为了实现这一目标,2022年的EndoVis子挑战SimCol3D在新加坡的MICCAI 2022会议上举行。六支来自全球的团队和学术界和产业界的代表参加了三个子挑战: sintetic depth prediction、syntetic pose prediction和real pose prediction。本文描述了这一挑战,提交的方法以及其结果。我们显示了虚拟colonoscopy中的深度预测是可靠地解决的,而pose预测仍然是一个开放的研究问题。

Towards Non-Parametric Models for Confidence Aware Image Prediction from Low Data using Gaussian Processes

  • paper_url: http://arxiv.org/abs/2307.11259
  • repo_url: None
  • paper_authors: Nikhil U. Shinde, Florian Richter, Michael C. Yip
  • for: 预测未来图像序列中的图像
  • methods: 使用非 Parametric 模型,采取 probabilistic 方法来预测图像
  • results: 成功预测 fluid simulations 环境中的未来帧数据
    Abstract The ability to envision future states is crucial to informed decision making while interacting with dynamic environments. With cameras providing a prevalent and information rich sensing modality, the problem of predicting future states from image sequences has garnered a lot of attention. Current state of the art methods typically train large parametric models for their predictions. Though often able to predict with accuracy, these models rely on the availability of large training datasets to converge to useful solutions. In this paper we focus on the problem of predicting future images of an image sequence from very little training data. To approach this problem, we use non-parametric models to take a probabilistic approach to image prediction. We generate probability distributions over sequentially predicted images and propagate uncertainty through time to generate a confidence metric for our predictions. Gaussian Processes are used for their data efficiency and ability to readily incorporate new training data online. We showcase our method by successfully predicting future frames of a smooth fluid simulation environment.
    摘要 <>预测未来状态的能力对于在动态环境中决策是非常重要的。由于摄像头是一种非常普遍和信息充沛的感知方式,预测未来状态从图像序列中得到的问题已经吸引了很多注意。当前的状态艺术方法通常是通过大型参数模型进行预测。虽然它们经常可以准确预测,但它们需要大量的训练数据来得到有用的解决方案。在这篇论文中,我们关注的是从非常少的训练数据中预测图像序列的未来帧。为了解决这个问题,我们使用非Parametric模型采取一种 probabilistic 的方法来预测图像。我们生成图像序列中预测的probability分布,并将uncertainty通过时间进行传播,以生成一个 confidence 度量器 для我们的预测。使用 Gaussian Processes 的数据效率和能够轻松地在线上添加新的训练数据,我们成功地预测了一个平滑的液体流体动画环境中的未来帧。

UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models

  • paper_url: http://arxiv.org/abs/2307.11227
  • repo_url: None
  • paper_authors: Xin Li, Sima Behpour, Thang Doan, Wenbin He, Liang Gou, Liu Ren
  • for: 本研究旨在 optimizing performance for undefined downstream tasks with a limited annotation budget, through selecting instances for labeling from an unlabeled dataset through a single pass.
  • methods: 本研究使用的方法是基于 joint feature space of both vision and text, 通过不同的设计和训练策略,提高数据预选的性能。 Specifically, we introduce UP-DP, a simple yet effective unsupervised prompt learning approach that adapts vision-language models, like BLIP-2, for data pre-selection.
  • results: 与状态的艺术 compare our method with the state-of-the-art using seven benchmark datasets in different settings, achieving up to a performance gain of 20%. Additionally, the prompts learned from one dataset demonstrate significant generalizability and can be applied directly to enhance the feature extraction of BLIP-2 from other datasets.
    Abstract In this study, we investigate the task of data pre-selection, which aims to select instances for labeling from an unlabeled dataset through a single pass, thereby optimizing performance for undefined downstream tasks with a limited annotation budget. Previous approaches to data pre-selection relied solely on visual features extracted from foundation models, such as CLIP and BLIP-2, but largely ignored the powerfulness of text features. In this work, we argue that, with proper design, the joint feature space of both vision and text can yield a better representation for data pre-selection. To this end, we introduce UP-DP, a simple yet effective unsupervised prompt learning approach that adapts vision-language models, like BLIP-2, for data pre-selection. Specifically, with the BLIP-2 parameters frozen, we train text prompts to extract the joint features with improved representation, ensuring a diverse cluster structure that covers the entire dataset. We extensively compare our method with the state-of-the-art using seven benchmark datasets in different settings, achieving up to a performance gain of 20%. Interestingly, the prompts learned from one dataset demonstrate significant generalizability and can be applied directly to enhance the feature extraction of BLIP-2 from other datasets. To the best of our knowledge, UP-DP is the first work to incorporate unsupervised prompt learning in a vision-language model for data pre-selection.
    摘要 在这项研究中,我们调查了数据预选任务,该任务通过单次通过,选择未标注数据集中的实例,以优化未定下渠道任务的性能,并尽可能减少标注预算。先前的数据预选方法仅仅基于基础模型中的视觉特征,如CLIP和BLIP-2,而忽略文本特征的力量。在这项工作中,我们认为,如果设计得当,则联合视觉和文本特征的特征空间可以提供更好的数据预选表示。为此,我们提出了UP-DP方法,这是一种简单 yet有效的无监督提问学习方法,可以使用视觉语言模型,如BLIP-2,进行数据预选。具体来说,我们将BLIP-2参数冻结,然后使用文本提示来提取联合特征,以确保多样化的群集结构,覆盖整个数据集。我们在七个标准测试集上进行了广泛的比较,与状态艺术的方法进行比较,达到了最高的20%的性能提升。有趣的是,从一个数据集上学习的提示可以直接应用于提高BLIP-2在其他数据集上的特征提取。据我们所知,UP-DP是首次在视觉语言模型中 incorporate无监督提问学习来进行数据预选。

Heuristic Hyperparameter Choice for Image Anomaly Detection

  • paper_url: http://arxiv.org/abs/2307.11197
  • repo_url: None
  • paper_authors: Zeyu Jiang, João P. C. Bertoldo, Etienne Decencière
  • for: 这个研究旨在提出一种基于深度学习神经网络的图像异常检测方法,使用 pré-训练的模型提取的深度特征进行异常检测,并且采用约化特征分解方法来降低计算成本和提高性能。
  • methods: 该方法使用的是逆原理 Component Analysis(NPCA)约化特征分解方法,并且提出了一些优化策略来选择NPCA算法中的参数,以获得最少的特征组件数而仍保持良好的性能。
  • results: 研究结果表明,使用NPCA约化特征分解方法可以减少图像异常检测中的计算成本,同时保持良好的检测性能。此外,该方法还可以适应不同的图像数据集和异常检测任务。
    Abstract Anomaly detection (AD) in images is a fundamental computer vision problem by deep learning neural network to identify images deviating significantly from normality. The deep features extracted from pretrained models have been proved to be essential for AD based on multivariate Gaussian distribution analysis. However, since models are usually pretrained on a large dataset for classification tasks such as ImageNet, they might produce lots of redundant features for AD, which increases computational cost and degrades the performance. We aim to do the dimension reduction of Negated Principal Component Analysis (NPCA) for these features. So we proposed some heuristic to choose hyperparameter of NPCA algorithm for getting as fewer components of features as possible while ensuring a good performance.
    摘要 <>转换给定文本到简化中文。>图像异常检测(AD)是计算机视觉中的基本问题,使用深度学习神经网络来识别图像异常。深度特征从预训练模型中提取出来的特征被证明是AD基于多变量 Gaussian 分布分析中的 essencial。然而,由于模型通常在大量的分类任务上预训练,如 ImageNet,它们可能生成大量的冗余特征,这会增加计算成本并降低性能。我们想使用NPCA算法进行维度减少。因此,我们提出了一些规则来选择NPCA算法的超参数,以获得最少的特征组件而 guaranteeing good performance。

Representation Learning in Anomaly Detection: Successes, Limits and a Grand Challenge

  • paper_url: http://arxiv.org/abs/2307.11085
  • repo_url: None
  • paper_authors: Yedid Hoshen
  • for: 这篇论文主要写于异常检测领域, argue that 异常检测领域中的主流思想无法无限扩展,会遇到基础限制。
  • methods: 该论文使用了异常检测中的“无免费午餐”原理,并提出了两个grand challenges,一是科学发现通过异常检测,二是检测imageNet数据集中最异常的图像。
  • results: 该论文提出了新的异常检测工具和想法,以应对这两个挑战。I hope this helps! Let me know if you have any other questions.
    Abstract In this perspective paper, we argue that the dominant paradigm in anomaly detection cannot scale indefinitely and will eventually hit fundamental limits. This is due to the a no free lunch principle for anomaly detection. These limitations can be overcome when there are strong tasks priors, as is the case for many industrial tasks. When such priors do not exists, the task is much harder for anomaly detection. We pose two such tasks as grand challenges for anomaly detection: i) scientific discovery by anomaly detection ii) a "mini-grand" challenge of detecting the most anomalous image in the ImageNet dataset. We believe new anomaly detection tools and ideas would need to be developed to overcome these challenges.
    摘要 在这篇观点论文中,我们 argueThat the dominant paradigm in anomaly detection cannot scale indefinitely and will eventually hit fundamental limits. This is due to the a no free lunch principle for anomaly detection. These limitations can be overcome when there are strong tasks priors, as is the case for many industrial tasks. When such priors do not exists, the task is much harder for anomaly detection. We pose two such tasks as grand challenges for anomaly detection: i) scientific discovery by anomaly detection ii) a "mini-grand" challenge of detecting the most anomalous image in the ImageNet dataset. We believe new anomaly detection tools and ideas would need to be developed to overcome these challenges.Note: Please keep in mind that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is used in Hong Kong, Taiwan, and other countries.

GLSFormer: Gated - Long, Short Sequence Transformer for Step Recognition in Surgical Videos

  • paper_url: http://arxiv.org/abs/2307.11081
  • repo_url: https://github.com/nisargshah1999/glsformer
  • paper_authors: Nisarg A. Shah, Shameema Sikder, S. Swaroop Vedula, Vishal M. Patel
  • for: automated surgical step recognition, 自动化手术步骤识别
  • methods: 使用视Transformer学习直接从批量帧级别的空间时间特征,并 integrate短期和长期空间时间特征表示
  • results: 在两个眼睛手术视频数据集(Cataract-101和D99)上进行了广泛评估,与多种现有方法进行比较,并达到了更高的性能。
    Abstract Automated surgical step recognition is an important task that can significantly improve patient safety and decision-making during surgeries. Existing state-of-the-art methods for surgical step recognition either rely on separate, multi-stage modeling of spatial and temporal information or operate on short-range temporal resolution when learned jointly. However, the benefits of joint modeling of spatio-temporal features and long-range information are not taken in account. In this paper, we propose a vision transformer-based approach to jointly learn spatio-temporal features directly from sequence of frame-level patches. Our method incorporates a gated-temporal attention mechanism that intelligently combines short-term and long-term spatio-temporal feature representations. We extensively evaluate our approach on two cataract surgery video datasets, namely Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods. These results validate the suitability of our proposed approach for automated surgical step recognition. Our code is released at: https://github.com/nisargshah1999/GLSFormer
    摘要 自动化手术步骤识别是一项重要的任务,可以有效提高手术过程中的患者安全和决策。现有的状态级方法 для手术步骤识别可以是分离的多stage模型化的空间和时间信息,或者在学习时间resolution上进行简单的操作。然而,将空间和时间特征结合jointly learning的好处并没有得到足够的考虑。在本文中,我们提出了基于视觉 трансформер的方法,直接从序列帧级补剪图像中学习spatio-temporal特征。我们的方法包括一种阻止temporal注意力机制,可以智能地将短期和长期的spatio-temporal特征表示结合起来。我们对两个眼部手术视频数据集,即Cataract-101和D99进行了广泛的评估,并证明了与多种状态级方法的比较。这些结果证明了我们的提议的合适性。我们的代码可以在:https://github.com/nisargshah1999/GLSFormer 获取。

Learning Dense UV Completion for Human Mesh Recovery

  • paper_url: http://arxiv.org/abs/2307.11074
  • repo_url: None
  • paper_authors: Yanjun Wang, Qingping Sun, Wenjia Wang, Jun Ling, Zhongang Cai, Rong Xie, Li Song
  • for: 解决单张图像中的人体重建问题,尤其是在受到自身、物体或其他人体 occlusion 的情况下。
  • methods: 提出了 Dense Inpainting Human Mesh Recovery (DIMR) 方法,利用稠密匹配图进行人体特征分离和完善。方法还包括一个基于注意力的特征填充模块,以便在 heavily occluded 图像下进行人体特征的填充和完善。
  • results: 对多个数据集进行了评测,并证明了 DIMR 方法在受到 heavily occluded 情况下的表现明显更好,而且在标准标准 bencmarks (3DPW) 上也达到了相似的结果。
    Abstract Human mesh reconstruction from a single image is challenging in the presence of occlusion, which can be caused by self, objects, or other humans. Existing methods either fail to separate human features accurately or lack proper supervision for feature completion. In this paper, we propose Dense Inpainting Human Mesh Recovery (DIMR), a two-stage method that leverages dense correspondence maps to handle occlusion. Our method utilizes a dense correspondence map to separate visible human features and completes human features on a structured UV map dense human with an attention-based feature completion module. We also design a feature inpainting training procedure that guides the network to learn from unoccluded features. We evaluate our method on several datasets and demonstrate its superior performance under heavily occluded scenarios compared to other methods. Extensive experiments show that our method obviously outperforms prior SOTA methods on heavily occluded images and achieves comparable results on the standard benchmarks (3DPW).
    摘要 人体三角形重建从单个图像中是具有挑战性的,尤其在 occlusion 存在时。现有方法可能不准确地分割人体特征或缺乏适当的监督来完成特征。在这篇论文中,我们提议 dense inpainting human mesh recovery(DIMR)方法,这是一种两个阶段的方法,利用 dense correspondence map 来处理 occlusion。我们的方法使用 dense correspondence map 来分割可见的人体特征,并在结构化 UV 图 dense human 上完成人体特征使用注意力基于的特征完成模块。我们还设计了一种特征填充训练过程,使网络学习从不受 occlusion 的特征。我们对多个数据集进行了评估,并证明了我们的方法在受到重度 occlusion 的场景下表现出色,比 Priors 的 SOTA 方法更好。广泛的实验表明,我们的方法在 heavily occluded 图像上表现出色,并在标准 benchmarks 上 achieve 相当的结果。

Towards General Game Representations: Decomposing Games Pixels into Content and Style

  • paper_url: http://arxiv.org/abs/2307.11141
  • repo_url: None
  • paper_authors: Chintan Trivedi, Konstantinos Makantasis, Antonios Liapis, Georgios N. Yannakakis
  • for: 本研究旨在利用游戏视频中的丰富上下文信息,探索人工智能在多个下渠任务中的应用,包括游戏玩家模型、程序生成和游戏玩家Agent。
  • methods: 本研究使用了预训练的计算机视觉Encoder,并使用了基于游戏类型的分解技术,以获取独立的内容嵌入和风格嵌入。
  • results: 研究发现,通过分解内容和风格嵌入,可以在多个游戏环境中实现风格不变性,同时仍能保持强的内容提取能力。这些结果表明,提出的内容和风格分解方法可以更好地普适化在不同游戏环境中。
    Abstract On-screen game footage contains rich contextual information that players process when playing and experiencing a game. Learning pixel representations of games can benefit artificial intelligence across several downstream tasks including game-playing agents, procedural content generation, and player modelling. The generalizability of these methods, however, remains a challenge, as learned representations should ideally be shared across games with similar game mechanics. This could allow, for instance, game-playing agents trained on one game to perform well in similar games with no re-training. This paper explores how generalizable pre-trained computer vision encoders can be for such tasks, by decomposing the latent space into content embeddings and style embeddings. The goal is to minimize the domain gap between games of the same genre when it comes to game content critical for downstream tasks, and ignore differences in graphical style. We employ a pre-trained Vision Transformer encoder and a decomposition technique based on game genres to obtain separate content and style embeddings. Our findings show that the decomposed embeddings achieve style invariance across multiple games while still maintaining strong content extraction capabilities. We argue that the proposed decomposition of content and style offers better generalization capacities across game environments independently of the downstream task.
    摘要 电脑游戏截屏视频内容含有丰富的上下文信息,玩家在游戏时会处理这些信息。学习游戏像素表示可以提高人工智能在多个下游任务中的表现,如游戏玩家代理、过程内容生成和玩家模型。然而,这些方法的通用性仍然是一个挑战,因为学习的表示应该能够在同类游戏中共享。这可以让游戏玩家代理从一款游戏中转移到类似游戏中,无需重新训练。本文研究如何使用普通的计算机视觉encoder和游戏类别 decomposition技术来提取游戏内容和风格的嵌入。我们的发现表明,这些分解的嵌入可以在多个游戏中保持风格不变,同时仍然保留强大的内容提取能力。我们认为,我们的内容和风格分解方法可以在不同的游戏环境下独立地提高总化能力。

CNOS: A Strong Baseline for CAD-based Novel Object Segmentation

  • paper_url: http://arxiv.org/abs/2307.11067
  • repo_url: https://github.com/nv-nguyen/cnos
  • paper_authors: Van Nguyen Nguyen, Tomas Hodan, Georgy Ponimatkin, Thibault Groueix, Vincent Lepetit
  • for: 针对RGB图像中未见对象的分割,使用CAD模型创建描述符和提议,并使用对比描述符与参考描述符进行匹配,以实现精度的对象ID分配和模式面。
  • methods: 采用三个阶段方法,首先使用最新的基础模型DINOv2和Segment Anything创建描述符和提议,然后使用对比描述符与参考描述符进行匹配,最后通过模式面进行分割。
  • results: 对七个核心数据集进行实验,比较与现有方法,得到了19.8% AP的最佳result,超越现有方法。
    Abstract We propose a simple three-stage approach to segment unseen objects in RGB images using their CAD models. Leveraging recent powerful foundation models, DINOv2 and Segment Anything, we create descriptors and generate proposals, including binary masks for a given input RGB image. By matching proposals with reference descriptors created from CAD models, we achieve precise object ID assignment along with modal masks. We experimentally demonstrate that our method achieves state-of-the-art results in CAD-based novel object segmentation, surpassing existing approaches on the seven core datasets of the BOP challenge by 19.8% AP using the same BOP evaluation protocol. Our source code is available at https://github.com/nv-nguyen/cnos.
    摘要 我们提出了一种简单的三个阶段方法,用于使用CAD模型来分割RGB图像中的未见对象。利用最新的强大基础模型,DINOv2和Segment Anything,我们创建了描述符和生成提案,包括输入RGB图像的二进制掩码。通过与参考描述符,创建从CAD模型中获得的对象ID分配和modal掩码。我们经验表明,我们的方法可以在BOP挑战的七个核心数据集上达到状态理论的Result,比既有方法在同一BOP评估协议下提高19.8%的AP。我们的源代码可以在https://github.com/nv-nguyen/cnos上获取。

HRFNet: High-Resolution Forgery Network for Localizing Satellite Image Manipulation

  • paper_url: http://arxiv.org/abs/2307.11052
  • repo_url: None
  • paper_authors: Fahim Faisal Niloy, Kishor Kumar Bhaumik, Simon S. Woo
  • for: 本研究旨在提出一种高分辨率卫星图像伪造地址化方法,以解决现有高分辨率卫星图像伪造地址化方法存在较多缺陷的问题。
  • methods: 本研究提议一种名为HRFNet的新型模型,具有 shallow 和 deep 两个分支,可以很好地结合 RGB 和扩展特征在全球和本地层次上进行卫星图像伪造地址化。
  • results: 通过多种实验表明,OUR方法可以准确地 lokalize 卫星图像伪造地址,而且不会与现有方法相比增加内存需求和处理速度的需求。
    Abstract Existing high-resolution satellite image forgery localization methods rely on patch-based or downsampling-based training. Both of these training methods have major drawbacks, such as inaccurate boundaries between pristine and forged regions, the generation of unwanted artifacts, etc. To tackle the aforementioned challenges, inspired by the high-resolution image segmentation literature, we propose a novel model called HRFNet to enable satellite image forgery localization effectively. Specifically, equipped with shallow and deep branches, our model can successfully integrate RGB and resampling features in both global and local manners to localize forgery more accurately. We perform various experiments to demonstrate that our method achieves the best performance, while the memory requirement and processing speed are not compromised compared to existing methods.
    摘要 当前高分辨率卫星图像假造地点 localization 方法通常基于 patch-based 或 downsampling-based 训练。两者均有严重缺陷,如假造区域与原始区域的界限不准确、生成不必要的 artifacts 等。为了解决上述挑战,我们引用高分辨率图像分割文献,提出了一种新的模型called HRFNet,用于有效地进行卫星图像假造地点 localization。具体来说,我们的模型具有 shallow 和 deep 分支,可以成功地在全球和本地方面 интегра RGB 和抽样特征,以更加准确地Localize 假造。我们进行了多种实验,证明了我们的方法可以具有最高性能,而占用内存和处理速度与现有方法相比,不会受到影响。

Multi-objective point cloud autoencoders for explainable myocardial infarction prediction

  • paper_url: http://arxiv.org/abs/2307.11017
  • repo_url: None
  • paper_authors: Marcel Beetz, Abhirup Banerjee, Vicente Grau
  • for: 预测心肌梗死(Myocardial Infarction,MI)的病理学基础。
  • methods: 使用多目标点云自动编码器(Multi-objective Point Cloud Autoencoder),一种基于多类3D点云表示的心肌生物学特征和功能的几何深度学习方法,可以有效地学习心肌3D形态特征,并提供可解释的MI预测结果。
  • results: 在一个大型UK Biobank数据集上,使用多目标点云自动编码器可以准确地重建多个时间点的3D形态,并且与输入形态的 Chamfer 距离低于图像像素分辨率。此外,该方法在incident MI预测任务上比多种机器学习和深度学习标准准降19%,并且可以清晰地分解控制和MI群集,并与相应的3D形态之间存在临床可能的关系,这种可解释性能够展示预测的可信度。
    Abstract Myocardial infarction (MI) is one of the most common causes of death in the world. Image-based biomarkers commonly used in the clinic, such as ejection fraction, fail to capture more complex patterns in the heart's 3D anatomy and thus limit diagnostic accuracy. In this work, we present the multi-objective point cloud autoencoder as a novel geometric deep learning approach for explainable infarction prediction, based on multi-class 3D point cloud representations of cardiac anatomy and function. Its architecture consists of multiple task-specific branches connected by a low-dimensional latent space to allow for effective multi-objective learning of both reconstruction and MI prediction, while capturing pathology-specific 3D shape information in an interpretable latent space. Furthermore, its hierarchical branch design with point cloud-based deep learning operations enables efficient multi-scale feature learning directly on high-resolution anatomy point clouds. In our experiments on a large UK Biobank dataset, the multi-objective point cloud autoencoder is able to accurately reconstruct multi-temporal 3D shapes with Chamfer distances between predicted and input anatomies below the underlying images' pixel resolution. Our method outperforms multiple machine learning and deep learning benchmarks for the task of incident MI prediction by 19% in terms of Area Under the Receiver Operating Characteristic curve. In addition, its task-specific compact latent space exhibits easily separable control and MI clusters with clinically plausible associations between subject encodings and corresponding 3D shapes, thus demonstrating the explainability of the prediction.
    摘要 myocardial infarction (MI) 是世界上最常见的死亡原因之一。传统的图像基于标记器,如舒张率,无法捕捉心脏三维解剖结构中更复杂的模式,因此限制了诊断精度。在这项工作中,我们提出了基于多对象点云自适应神经网络的新的几何深度学方法,用于可见的损害预测,基于多类三维点云表示的律动器解剖结构和功能。其架构包括多个任务特定分支,通过低维度的干扰空间相连,以实现有效的多目标学习 both 重建和MI预测,同时捕捉疾病特定的三维形态信息。此外,其层次分支设计和点云深度运算使得高分辨率的解剖点云上可以有效地进行多级别特征学习。在我们对大型UK Biobank数据集进行实验时,多对象点云自适应神经网络能够准确地重建多个时间点的三维形态,Chamfer距离输入和预测的形态之间的距离小于图像的像素分辨率。我们的方法在多种机器学习和深度学习标准准点上出现19%的提升,在接收操作特征曲线图下的面积来评估预测效果。此外,任务特定的紧凑的干托空间显示可以分别控制和MI层次分解,并且与相应的三维形态之间存在严格的相关关系,这表明预测是可解释的。

General Image-to-Image Translation with One-Shot Image Guidance

  • paper_url: http://arxiv.org/abs/2307.14352
  • repo_url: https://github.com/crystalneuro/visual-concept-translator
  • paper_authors: Bin Cheng, Zuhao Liu, Yunbo Peng, Yue Lin
  • for: 可以将愿景图像中的视觉概念翻译成另一种图像,保持源图像的内容。
  • methods: 提出了一种名为视觉概念翻译器(VCT)的新框架,通过内容概念分离和内容概念融合两个过程来实现图像翻译。
  • results: 经验证明,提出的方法可以在各种普遍的图像翻译任务中达到优秀的结果,并且可以保持源图像的内容。
    Abstract Large-scale text-to-image models pre-trained on massive text-image pairs show excellent performance in image synthesis recently. However, image can provide more intuitive visual concepts than plain text. People may ask: how can we integrate the desired visual concept into an existing image, such as our portrait? Current methods are inadequate in meeting this demand as they lack the ability to preserve content or translate visual concepts effectively. Inspired by this, we propose a novel framework named visual concept translator (VCT) with the ability to preserve content in the source image and translate the visual concepts guided by a single reference image. The proposed VCT contains a content-concept inversion (CCI) process to extract contents and concepts, and a content-concept fusion (CCF) process to gather the extracted information to obtain the target image. Given only one reference image, the proposed VCT can complete a wide range of general image-to-image translation tasks with excellent results. Extensive experiments are conducted to prove the superiority and effectiveness of the proposed methods. Codes are available at https://github.com/CrystalNeuro/visual-concept-translator.
    摘要 大规模的文本到图像模型在最近的图像生成中表现出色,但图像可以提供更直观的视觉概念 than plain text。人们可能会问:如何将我们的肖像中的愿望 visual concept 集成到现有的图像中?现有的方法无法满足这个需求,因为它们缺乏保持内容或翻译视觉概念的能力。 draw inspiration from this,我们提出了一种名为视觉概念翻译器(VCT)的新框架,具有保持内容的源图像和根据单个参考图像 guid 视觉概念的翻译能力。VCT 包括一个内容概念反转(CCI)过程,用于提取内容和概念,以及一个内容概念聚合(CCF)过程,用于将提取的信息聚合以获得目标图像。只需要一个参考图像,提出的 VCT 可以完成广泛的普通图像到图像翻译任务,效果极佳。我们进行了广泛的实验,以证明我们的方法的超越性和有效性。代码可以在 https://github.com/CrystalNeuro/visual-concept-translator 上获取。

Frequency-aware optical coherence tomography image super-resolution via conditional generative adversarial neural network

  • paper_url: http://arxiv.org/abs/2307.11130
  • repo_url: None
  • paper_authors: Xueshen Li, Zhenxing Dong, Hongshan Liu, Jennifer J. Kang-Mieler, Yuye Ling, Yu Gan
  • for: 提高医学影像诊断和治疗的能力,特别是Cardiology和Ophthalmology领域。
  • methods: 使用深度学习基于super-resolution技术,以提高图像的分辨率和结构的保存。
  • results: 提出一种 integrate three critical frequency-based modules and frequency-based loss function into a conditional generative adversarial network (cGAN)的频率意识super-resolution框架,并在大规模的 coronary OCT 数据集中进行了评估,并在鱼眼和小鼠视网膜图像中进行了应用,以证明其在眼科影像中的抽象能力。
    Abstract Optical coherence tomography (OCT) has stimulated a wide range of medical image-based diagnosis and treatment in fields such as cardiology and ophthalmology. Such applications can be further facilitated by deep learning-based super-resolution technology, which improves the capability of resolving morphological structures. However, existing deep learning-based method only focuses on spatial distribution and disregard frequency fidelity in image reconstruction, leading to a frequency bias. To overcome this limitation, we propose a frequency-aware super-resolution framework that integrates three critical frequency-based modules (i.e., frequency transformation, frequency skip connection, and frequency alignment) and frequency-based loss function into a conditional generative adversarial network (cGAN). We conducted a large-scale quantitative study from an existing coronary OCT dataset to demonstrate the superiority of our proposed framework over existing deep learning frameworks. In addition, we confirmed the generalizability of our framework by applying it to fish corneal images and rat retinal images, demonstrating its capability to super-resolve morphological details in eye imaging.
    摘要

Deep Spiking-UNet for Image Processing

  • paper_url: http://arxiv.org/abs/2307.10974
  • repo_url: https://github.com/snnresearch/spiking-unet
  • paper_authors: Hebei Li, Yueyi Zhang, Zhiwei Xiong, Zheng-jun Zha, Xiaoyan Sun
  • for: 这篇论文的目的是探讨使用快速神经网络(SNN)在图像处理任务中的应用,并将U-Net架构与SNN结合。
  • methods: 作者使用了多reshold飞神 ней元来提高信息传递的效率,并采用了一种转换和精度调整的管道来优化转换后的模型。
  • results: 实验结果表明,作者的快速神经网络在图像分割和减噪任务中达到了与非快速神经网络相当的性能,并超越了现有的SNN方法。 Comparing with the converted Spiking-UNet without fine-tuning, the inference time is reduced by approximately 90%.
    Abstract U-Net, known for its simple yet efficient architecture, is widely utilized for image processing tasks and is particularly suitable for deployment on neuromorphic chips. This paper introduces the novel concept of Spiking-UNet for image processing, which combines the power of Spiking Neural Networks (SNNs) with the U-Net architecture. To achieve an efficient Spiking-UNet, we face two primary challenges: ensuring high-fidelity information propagation through the network via spikes and formulating an effective training strategy. To address the issue of information loss, we introduce multi-threshold spiking neurons, which improve the efficiency of information transmission within the Spiking-UNet. For the training strategy, we adopt a conversion and fine-tuning pipeline that leverage pre-trained U-Net models. During the conversion process, significant variability in data distribution across different parts is observed when utilizing skip connections. Therefore, we propose a connection-wise normalization method to prevent inaccurate firing rates. Furthermore, we adopt a flow-based training method to fine-tune the converted models, reducing time steps while preserving performance. Experimental results show that, on image segmentation and denoising, our Spiking-UNet achieves comparable performance to its non-spiking counterpart, surpassing existing SNN methods. Compared with the converted Spiking-UNet without fine-tuning, our Spiking-UNet reduces inference time by approximately 90\%. This research broadens the application scope of SNNs in image processing and is expected to inspire further exploration in the field of neuromorphic engineering. The code for our Spiking-UNet implementation is available at https://github.com/SNNresearch/Spiking-UNet.
    摘要 U-Net,知名的简单 yet efficient 架构,广泛应用于图像处理任务,特别适合运行在神经模拟器芯片上。这篇论文介绍了一种新的激发式 U-Net 图像处理方法,该方法结合激发式神经网络(SNNs)与 U-Net 架构。为了实现高效的激发式 U-Net,我们面临两个主要挑战:保证通过网络传递高精度信息via spikes,并制定有效的训练策略。为了解决信息损失问题,我们引入多reshold spiking neurons,以提高激发式 U-Net 中信息传输的效率。在训练策略方面,我们采用一个转化和细化训练管道,利用预训练 U-Net 模型。在转化过程中,我们发现了不同部分数据分布变化带来的显著差异。因此,我们提出了一种连接 wise normalization 方法,以避免假象率。此外,我们采用一种流式训练方法,以细化转化后的模型,降低时间步骤而保持性能。实验结果表明,在图像分割和净化任务上,我们的激发式 U-Net 与非激发式 counterpart 的性能相似,超过现有的 SNN 方法。相比转化后的激发式 U-Net без细化,我们的激发式 U-Net 可以降低推理时间约90%。这项研究扩展了 SNN 在图像处理领域的应用范围,预计会激发更多的神经模拟器工程研究。Spiking-UNet 实现代码可在 GitHub 上找到:https://github.com/SNNresearch/Spiking-UNet.