cs.CV - 2023-07-15

HQG-Net: Unpaired Medical Image Enhancement with High-Quality Guidance

  • paper_url: http://arxiv.org/abs/2307.07829
  • repo_url: None
  • paper_authors: Chunming He, Kai Li, Guoxia Xu, Jiangpeng Yan, Longxiang Tang, Yulun Zhang, Xiu Li, Yaowei Wang
  • for: 提高医疗影像质量(UMIE),使低质量医疗影像变成高质量医疗影像,不需要使用对应的高质量影像进行训练。
  • methods: 提出了一种新的UMIE方法,利用变量ormalization模块直接在低质量影像增强过程中编码高质量灵感,并通过对抗学习和内容响应损失来导航增强过程。
  • results: 对三个医疗影像 dataset进行了实验,并证明了该方法可以在增强质量和下游任务性能之间取得平衡,并且在许多情况下超越了现有的方法。
    Abstract Unpaired Medical Image Enhancement (UMIE) aims to transform a low-quality (LQ) medical image into a high-quality (HQ) one without relying on paired images for training. While most existing approaches are based on Pix2Pix/CycleGAN and are effective to some extent, they fail to explicitly use HQ information to guide the enhancement process, which can lead to undesired artifacts and structural distortions. In this paper, we propose a novel UMIE approach that avoids the above limitation of existing methods by directly encoding HQ cues into the LQ enhancement process in a variational fashion and thus model the UMIE task under the joint distribution between the LQ and HQ domains. Specifically, we extract features from an HQ image and explicitly insert the features, which are expected to encode HQ cues, into the enhancement network to guide the LQ enhancement with the variational normalization module. We train the enhancement network adversarially with a discriminator to ensure the generated HQ image falls into the HQ domain. We further propose a content-aware loss to guide the enhancement process with wavelet-based pixel-level and multi-encoder-based feature-level constraints. Additionally, as a key motivation for performing image enhancement is to make the enhanced images serve better for downstream tasks, we propose a bi-level learning scheme to optimize the UMIE task and downstream tasks cooperatively, helping generate HQ images both visually appealing and favorable for downstream tasks. Experiments on three medical datasets, including two newly collected datasets, verify that the proposed method outperforms existing techniques in terms of both enhancement quality and downstream task performance. We will make the code and the newly collected datasets publicly available for community study.
    摘要 <>医学图像提高(UMIE)的目标是将低质量(LQ)医学图像转换成高质量(HQ)图像,不需要使用对应的图像进行训练。现有的方法多数基于Pix2Pix/CycleGAN,尽管有一定的效果,但是它们不会直接使用HQ信息来导引提高过程,这可能会导致不必要的artefacts和结构性错误。在这篇论文中,我们提出了一种新的UMIE方法,避免了现有方法的这一限制,通过直接在LQ提高过程中编码HQ信息,并在变量化正则模块中进行变量化正则化。具体来说,我们从HQ图像中提取特征,并直接插入提高网络中,以便使LQ图像提高以HQ图像的特征为导引。我们通过对提高网络进行对抗训练,使得生成的HQ图像在HQ领域中。我们还提出了一种内容相关损失,以便通过波лет特-基于的像素级和多编码器-基于的特征级约束,指导提高过程。此外,作为提高图像的主要动机是为了使图像更好地服务于下游任务,我们提出了一种叠加学习方案,以便在UMIE任务和下游任务之间协同优化,使得生成的HQ图像同时具备视觉吸引力和下游任务的有利性。实验结果表明,提出的方法在三个医学 dataset 上表现出色,超过了现有技术。我们将代码和新收集的 dataset 公开发布,为社区的研究提供便利。

TinyTracker: Ultra-Fast and Ultra-Low-Power Edge Vision In-Sensor for Gaze Estimation

  • paper_url: http://arxiv.org/abs/2307.07813
  • repo_url: None
  • paper_authors: Pietro Bonazzi, Thomas Ruegg, Sizhen Bian, Yawei Li, Michele Magno
  • for: 这个论文的目的是提出一种高效低功耗的边缘视觉应用,以解决边缘设备的计算负担问题。
  • methods: 该论文使用了一种新的“AI在传感器”视觉平台,SONY的IMX500,实现了ultra-fast和ultra-low-power的边缘视觉应用。
  • results: 研究表明,IMX500比Google Coral Dev Micro和SONY Spresense更快(19ms vs 34.4ms)和更高效(4.9mJ VS 34.2mJ),并且可以实现2D gaze estimation的高精度预测(最大差错0.16cm)。
    Abstract Intelligent edge vision tasks encounter the critical challenge of ensuring power and latency efficiency due to the typically heavy computational load they impose on edge platforms.This work leverages one of the first "AI in sensor" vision platforms, IMX500 by Sony, to achieve ultra-fast and ultra-low-power end-to-end edge vision applications. We evaluate the IMX500 and compare it to other edge platforms, such as the Google Coral Dev Micro and Sony Spresense, by exploring gaze estimation as a case study. We propose TinyTracker, a highly efficient, fully quantized model for 2D gaze estimation designed to maximize the performance of the edge vision systems considered in this study. TinyTracker achieves a 41x size reduction (600Kb) compared to iTracker [1] without significant loss in gaze estimation accuracy (maximum of 0.16 cm when fully quantized). TinyTracker's deployment on the Sony IMX500 vision sensor results in end-to-end latency of around 19ms. The camera takes around 17.9ms to read, process and transmit the pixels to the accelerator. The inference time of the network is 0.86ms with an additional 0.24 ms for retrieving the results from the sensor. The overall energy consumption of the end-to-end system is 4.9 mJ, including 0.06 mJ for inference. The end-to-end study shows that IMX500 is 1.7x faster than CoralMicro (19ms vs 34.4ms) and 7x more power efficient (4.9mJ VS 34.2mJ)
    摘要 智能边缘视觉任务面临 crítical 挑战,因为它们通常占用边缘平台的重要计算负担。这项工作利用了首个“AI在传感器”视觉平台IMX500(由索尼制造),实现ultra-fast和ultra-low-power的边缘视觉应用。我们对IMX500进行了比较,并使用了其他边缘平台,如Google Coral Dev Micro和索尼Spresense,通过察看计算作为研究案例。我们提出了TinyTracker,一种高效、全数字化模型,用于实现2D gaze estimation。TinyTracker在IMX500视觉传感器上部署后,实现了约19ms的终端延迟时间。摄像头需要17.9ms来读取、处理和传输像素到加速器。网络执行时间为0.86ms,加上从传感器中获取结果的时间(0.24ms)。总的来说,边缘系统的能 consumption 为4.9 mJ,其中计算部分占用0.06 mJ。对照研究显示,IMX500比CoralMicro(19ms VS 34.4ms)更快,并且在能 efficiency 方面更高(4.9mJ VS 34.2mJ)。

Multiscale Memory Comparator Transformer for Few-Shot Video Segmentation

  • paper_url: http://arxiv.org/abs/2307.07812
  • repo_url: https://github.com/msiam/mmc-multiscalememory
  • paper_authors: Mennatullah Siam, Rezaul Karim, He Zhao, Richard Wildes
  • for: 这篇论文主要针对几个支持图像中的特定新类,使用少量标注图像进行分类。
  • methods: 该方法使用一种名为多尺度记忆比较器(MMC),它将信息在不同尺度级别内共享,以增强分类精度。
  • results: 该方法在几个实验中均超过了基eline,并达到了状态之arte。代码可以在https://github.com/MSiam/MMC-MultiscaleMemory中下载。
    Abstract Few-shot video segmentation is the task of delineating a specific novel class in a query video using few labelled support images. Typical approaches compare support and query features while limiting comparisons to a single feature layer and thereby ignore potentially valuable information. We present a meta-learned Multiscale Memory Comparator (MMC) for few-shot video segmentation that combines information across scales within a transformer decoder. Typical multiscale transformer decoders for segmentation tasks learn a compressed representation, their queries, through information exchange across scales. Unlike previous work, we instead preserve the detailed feature maps during across scale information exchange via a multiscale memory transformer decoding to reduce confusion between the background and novel class. Integral to the approach, we investigate multiple forms of information exchange across scales in different tasks and provide insights with empirical evidence on which to use in each task. The overall comparisons among query and support features benefit from both rich semantics and precise localization. We demonstrate our approach primarily on few-shot video object segmentation and an adapted version on the fully supervised counterpart. In all cases, our approach outperforms the baseline and yields state-of-the-art performance. Our code is publicly available at https://github.com/MSiam/MMC-MultiscaleMemory.
    摘要 几个批量视频分割是指将特定的新类型在查询视频中分割使用几个标注图像。通常的方法是比较支持和查询特征,并限制比较到单个特征层,因此可能会忽略有价值信息。我们提出了基于模型学习的多尺度内存比较器(MMC),用于几个批量视频分割,它将在变换器解码器中结合多尺度信息。通常的多尺度变换器解码器用于分割任务会学习压缩表示,其查询通过不同尺度之间的信息交换来学习压缩表示。与之前的工作不同,我们在交换过程中保留细节特征图,以避免背景和新类型之间的混淆。我们的方法包括多种不同任务中的信息交换方式,并提供了实验证据,以帮助选择合适的信息交换方式。通过比较查询和支持特征,我们的方法可以获得丰富的 semantics 和精确的地方化。我们在几个几个批量视频对象分割和修改后的完全监督版本中展示了我们的方法,在所有情况下,我们的方法都超越基准,并实现了状态的最佳性。我们的代码可以在 GitHub 上获取:https://github.com/MSiam/MMC-MultiscaleMemory。

MUVF-YOLOX: A Multi-modal Ultrasound Video Fusion Network for Renal Tumor Diagnosis

  • paper_url: http://arxiv.org/abs/2307.07807
  • repo_url: https://github.com/jeunyuli/muaf
  • paper_authors: Junyu Li, Han Huang, Dong Ni, Wufeng Xue, Dongmei Zhu, Jun Cheng
  • for: 这个研究的目的是为了早期识别肾脏癌,以提高病人存活率。
  • methods: 这个研究使用了融合B模式和CEUS模式射频影像的多modal聚合网络,以实现多modal特征融合和影像分类。
  • results: 实验结果显示,提案的框架比单模式模型和竞争方法更高的准确性。此外,我们的OTA模组在多帧射频影像中实现了更高的分类精度。
    Abstract Early diagnosis of renal cancer can greatly improve the survival rate of patients. Contrast-enhanced ultrasound (CEUS) is a cost-effective and non-invasive imaging technique and has become more and more frequently used for renal tumor diagnosis. However, the classification of benign and malignant renal tumors can still be very challenging due to the highly heterogeneous appearance of cancer and imaging artifacts. Our aim is to detect and classify renal tumors by integrating B-mode and CEUS-mode ultrasound videos. To this end, we propose a novel multi-modal ultrasound video fusion network that can effectively perform multi-modal feature fusion and video classification for renal tumor diagnosis. The attention-based multi-modal fusion module uses cross-attention and self-attention to extract modality-invariant features and modality-specific features in parallel. In addition, we design an object-level temporal aggregation (OTA) module that can automatically filter low-quality features and efficiently integrate temporal information from multiple frames to improve the accuracy of tumor diagnosis. Experimental results on a multicenter dataset show that the proposed framework outperforms the single-modal models and the competing methods. Furthermore, our OTA module achieves higher classification accuracy than the frame-level predictions. Our code is available at \url{https://github.com/JeunyuLi/MUAF}.
    摘要 早期诊断肾癌可以大幅提高患者存活率。扩增噪声超声(CEUS)是一种Cost-effective和非侵入性的成像技术,在肾肿瘤诊断中越来越常用。然而,肾肿瘤的分类仍然是非常困难的,主要是因为肿瘤的外观非常异常,以及成像 artifacts。我们的目标是通过融合B模式和CEUS模式超声视频来检测和分类肾肿瘤。为此,我们提出了一种新的多模态超声视频融合网络,可以有效地执行多模态特征融合和视频分类。我们的注意力基于多模态融合模块使用交叉注意力和自注意力来提取模式不变特征和模式特定特征。此外,我们设计了一个物体级别时间聚合(OTA)模块,可以自动筛选低质量特征并高效地集成多帧中的时间信息,以提高肾肿瘤诊断的准确性。实验结果表明,我们提出的框架在多中心数据集上表现出优于单模态模型和竞争方法。此外,我们的 OTA 模块在帧级预测中实现了更高的分类精度。我们的代码可以在 \url{https://github.com/JeunyuLi/MUAF} 上获取。

Joint Adversarial and Collaborative Learning for Self-Supervised Action Recognition

  • paper_url: http://arxiv.org/abs/2307.07791
  • repo_url: https://github.com/levigty/acl
  • paper_authors: Tianyu Guo, Mengyuan Liu, Hong Liu, Wenhao Li, Jingwen Guo, Tao Wang, Yidi Li
  • for: 本研究的目的是使用对比学习方法,包括MoCo和SimCLR,来解决自然语言处理中的自我supervised动作识别任务。
  • methods: 本研究使用多个数据流(joint、运动和骨)进行ensemble学习,同时如何在单个流中构建一个特异性的特征空间并有效地将多个流的信息汇聚到一起仍然是一个开放的问题。
  • results: 我们首先应用一种新的对比学习方法called BYOL来学习从骨架数据,并将其формализова为一个简单 yet effective的基线方法 для自我supervised骨架动作识别。此外,我们还提出了一种联合对抗学习(ACL)框架,其结合了cross-model对抗学习(CMAL)和cross-stream合作学习(CSCL)。
    Abstract Considering the instance-level discriminative ability, contrastive learning methods, including MoCo and SimCLR, have been adapted from the original image representation learning task to solve the self-supervised skeleton-based action recognition task. These methods usually use multiple data streams (i.e., joint, motion, and bone) for ensemble learning, meanwhile, how to construct a discriminative feature space within a single stream and effectively aggregate the information from multiple streams remains an open problem. To this end, we first apply a new contrastive learning method called BYOL to learn from skeleton data and formulate SkeletonBYOL as a simple yet effective baseline for self-supervised skeleton-based action recognition. Inspired by SkeletonBYOL, we further present a joint Adversarial and Collaborative Learning (ACL) framework, which combines Cross-Model Adversarial Learning (CMAL) and Cross-Stream Collaborative Learning (CSCL). Specifically, CMAL learns single-stream representation by cross-model adversarial loss to obtain more discriminative features. To aggregate and interact with multi-stream information, CSCL is designed by generating similarity pseudo label of ensemble learning as supervision and guiding feature generation for individual streams. Exhaustive experiments on three datasets verify the complementary properties between CMAL and CSCL and also verify that our method can perform favorably against state-of-the-art methods using various evaluation protocols. Our code and models are publicly available at \url{https://github.com/Levigty/ACL}.
    摘要 基于实例水平的异同能力,包括MoCo和SimCLR等对比学习方法,已经从原始图像表示学习任务中适应到解决自动识别动作任务。这些方法通常使用多个数据流(即联合、运动和骨)进行ensemble学习,而在单个流中构建准确的特征空间并有效地聚合多个流的信息仍然是一个开放的问题。为此,我们首先应用一种新的对比学习方法called BYOL来学习从骨架数据,并将SkeletonBYOL作为一个简单又有效的基线 для自主学习骨架动作识别。受SkeletonBYOL的启发,我们进一步提出了一种联合对抗学习(ACL)框架,该框架将对抗学习(CMAL)和流处理学习(CSCL)相结合。具体来说,CMAL通过跨模型对抗损失来获得更准确的特征,而CSCL通过生成流处理学习中的similarity pseudo标签来引导特征生成和流处理学习。我们对三个数据集进行了详细的实验,并证明了ACL的共轭性和对比学习方法的优势。我们的代码和模型在\url{https://github.com/Levigty/ACL}上公开。

Adaptive Nonlinear Latent Transformation for Conditional Face Editing

  • paper_url: http://arxiv.org/abs/2307.07790
  • repo_url: https://github.com/hzzone/adatrans
  • paper_authors: Zhizhong Huang, Siteng Ma, Junping Zhang, Hongming Shan
  • for: 本文提出了一种新的适应非线性潜在转换方法(AdaTrans),用于不混合的条件脸部编辑。
  • methods: 本方法将编辑过程分解为多个细节步骤,每个步骤的方向和大小受到面部特征和潜在码的控制。这种非线性转换路径可以操作面部到目标特征Attributes,保持其他特征不变。此外,本文还提出了一种独立学习策略,基于mutual information框架,以消除特征之间的混合。
  • results: 对于多种面部特征,AdaTrans能够实现可控的脸部编辑,具有分离、灵活性和高质量的优点。与现有方法相比,AdaTrans在面部特征之间的混合、年龄差距和少量标注数据下表现出较好的性能。代码可以从https://github.com/Hzzone/AdaTrans获取。
    Abstract Recent works for face editing usually manipulate the latent space of StyleGAN via the linear semantic directions. However, they usually suffer from the entanglement of facial attributes, need to tune the optimal editing strength, and are limited to binary attributes with strong supervision signals. This paper proposes a novel adaptive nonlinear latent transformation for disentangled and conditional face editing, termed AdaTrans. Specifically, our AdaTrans divides the manipulation process into several finer steps; i.e., the direction and size at each step are conditioned on both the facial attributes and the latent codes. In this way, AdaTrans describes an adaptive nonlinear transformation trajectory to manipulate the faces into target attributes while keeping other attributes unchanged. Then, AdaTrans leverages a predefined density model to constrain the learned trajectory in the distribution of latent codes by maximizing the likelihood of transformed latent code. Moreover, we also propose a disentangled learning strategy under a mutual information framework to eliminate the entanglement among attributes, which can further relax the need for labeled data. Consequently, AdaTrans enables a controllable face editing with the advantages of disentanglement, flexibility with non-binary attributes, and high fidelity. Extensive experimental results on various facial attributes demonstrate the qualitative and quantitative effectiveness of the proposed AdaTrans over existing state-of-the-art methods, especially in the most challenging scenarios with a large age gap and few labeled examples. The source code is available at https://github.com/Hzzone/AdaTrans.
    摘要 最近的面部编辑方法通常通过 StyleGAN 的归一化空间进行 manipulate。然而,这些方法通常受到面部特征的杂化的限制,需要调整最佳编辑强度,并且只能处理二分类特征。这篇论文提出了一种新的适应非线性层次变换方法,称为 AdaTrans。具体来说,我们的 AdaTrans 将掌控过程分解成多个更细的步骤,即在每个步骤中的方向和大小受到面部特征和归一化码的Conditioning。这样,AdaTrans 可以描述一个适应非线性变换轨迹,以控制面部变换为目标特征,保持其他特征不变。然后,AdaTrans 利用一个预定义的概率模型来约束学习的轨迹在归一化码的分布中,通过最大化变换后的归一化码的概率来提升。此外,我们还提出了一种不依赖标注数据的分解学习策略,基于mutual information框架,以消除特征之间的杂化,从而更好地适应具有不同特征的面部编辑。因此,AdaTrans 可以提供可控的面部编辑,具有分解、非二分类特征和高精度的优势。我们的实验结果表明,AdaTrans 在不同的面部特征下表现出较好的效果,特别是在面部年龄差较大和标注数据少的情况下。代码可以在 上获取。

SoccerKDNet: A Knowledge Distillation Framework for Action Recognition in Soccer Videos

  • paper_url: http://arxiv.org/abs/2307.07768
  • repo_url: None
  • paper_authors: Sarosij Bose, Saikat Sarkar, Amlan Chakrabarti
  • For: The paper is written for classifying player actions in soccer videos, which is a challenging problem in sports analytics.* Methods: The paper proposes a novel end-to-end knowledge distillation based transfer learning network pre-trained on the Kinetics400 dataset, and introduces a unique loss parameterization.* Results: The paper obtains 67.20% validation accuracy on a new dataset named SoccerDB1, which consists of 448 videos and 4 diverse classes of players playing soccer. The model also generalizes well to new datasets.Here is the same information in Simplified Chinese text:* For: 这篇论文是为了解决足球视频中玩家行为的分类问题,这是体育分析中的一个挑战。* Methods: 论文提出了一种新的端到端知识储备基于转移学习网络,并引入了一个唯一的损失参数化。* Results: 论文在新的足球DB1数据集上获得了67.20%的验证精度,该数据集包括448个视频和4种多样化的玩家类别。模型也能够轻松地适应新的数据集。
    Abstract Classifying player actions from soccer videos is a challenging problem, which has become increasingly important in sports analytics over the years. Most state-of-the-art methods employ highly complex offline networks, which makes it difficult to deploy such models in resource constrained scenarios. Here, in this paper we propose a novel end-to-end knowledge distillation based transfer learning network pre-trained on the Kinetics400 dataset and then perform extensive analysis on the learned framework by introducing a unique loss parameterization. We also introduce a new dataset named SoccerDB1 containing 448 videos and consisting of 4 diverse classes each of players playing soccer. Furthermore, we introduce an unique loss parameter that help us linearly weigh the extent to which the predictions of each network are utilized. Finally, we also perform a thorough performance study using various changed hyperparameters. We also benchmark the first classification results on the new SoccerDB1 dataset obtaining 67.20% validation accuracy. Apart from outperforming prior arts significantly, our model also generalizes to new datasets easily. The dataset has been made publicly available at: https://bit.ly/soccerdb1
    摘要 “足球视频中玩家行为分类是一个复杂的问题,在体育分析领域已经越来越重要。大多数当前最佳方法使用高度复杂的离线网络,这使得在资源受限的情况下难以部署这些模型。在这篇论文中,我们提出了一种新的终端到终端知识储备基于转移学习网络,并进行了广泛的分析。我们还介绍了一个新的损失参数化方法,以及一个名为足球DB1的新数据集,该数据集包含448个视频和4种多样化的玩家在足球比赛中的行为。此外,我们还引入了一个唯一的损失参数,以便将每个网络的预测值 linearly 权重。最后,我们还进行了详细的性能研究,并对不同的变量参数进行了测试。我们的模型在新的足球DB1数据集上获得了67.20%的验证精度,并且能够轻松地在新的数据集上进行扩展。数据集可以在以下链接获取:https://bit.ly/soccerdb1。”

Tightly-Coupled LiDAR-Visual SLAM Based on Geometric Features for Mobile Agents

  • paper_url: http://arxiv.org/abs/2307.07763
  • repo_url: None
  • paper_authors: Ke Cao, Ruiping Liu, Ze Wang, Kunyu Peng, Jiaming Zhang, Junwei Zheng, Zhifeng Teng, Kailun Yang, Rainer Stiefelhagen
  • for: 提供自动化导航和任务执行在复杂和未知环境中,以便移动机器人可以更好地操作。
  • methods: 利用LiDAR-视觉Simultaneous Localization and Mapping(SLAM)技术,包括两个子系统(LiDAR和单目视觉SLAM)和一个融合框架。融合框架将深度和 semantic的多Modal 几何特征相关,以便补充视觉线尺标和添加方向优化。
  • results: 在M2DGR公共数据集上进行评估,与当前状态的多模式方法相比,我们的系统实现了更高精度和更加稳定的姿态估计。
    Abstract The mobile robot relies on SLAM (Simultaneous Localization and Mapping) to provide autonomous navigation and task execution in complex and unknown environments. However, it is hard to develop a dedicated algorithm for mobile robots due to dynamic and challenging situations, such as poor lighting conditions and motion blur. To tackle this issue, we propose a tightly-coupled LiDAR-visual SLAM based on geometric features, which includes two sub-systems (LiDAR and monocular visual SLAM) and a fusion framework. The fusion framework associates the depth and semantics of the multi-modal geometric features to complement the visual line landmarks and to add direction optimization in Bundle Adjustment (BA). This further constrains visual odometry. On the other hand, the entire line segment detected by the visual subsystem overcomes the limitation of the LiDAR subsystem, which can only perform the local calculation for geometric features. It adjusts the direction of linear feature points and filters out outliers, leading to a higher accurate odometry system. Finally, we employ a module to detect the subsystem's operation, providing the LiDAR subsystem's output as a complementary trajectory to our system while visual subsystem tracking fails. The evaluation results on the public dataset M2DGR, gathered from ground robots across various indoor and outdoor scenarios, show that our system achieves more accurate and robust pose estimation compared to current state-of-the-art multi-modal methods.
    摘要 Mobile robot 依靠 SLAM (同时地理定位和地图生成) 实现自主导航和任务执行在复杂和未知环境中。但是开发专门的移动 robot 算法具有困难的动态和挑战性情况,如低光照条件和运动模糊。为解决这个问题,我们提议一种紧密相关的 LiDAR-视觉 SLAM,其包括两个子系统(LiDAR 和单目视觉 SLAM)以及一个融合框架。融合框架将depth 和视觉特征相关,以补偿视线标记的缺失,并在Bundle Adjustment 中添加方向优化。这使得视觉遥感更加精准。另一方面,视觉子系统检测到的整条线段超越了 LiDAR 子系统的局部计算能力,可以调整方向和过滤异常值,从而实现更高精度的遥感系统。最后,我们采用一个模块来检测子系统的运作,提供 LiDAR 子系统的补偿轨迹,而视觉子系统跟踪失败时。根据公共数据集 M2DGR,从各种室内和室外场景中收集的数据,我们的系统实现了与当前状态艺术多模式方法相比更高精度和更加稳定的姿态估计。

Open Scene Understanding: Grounded Situation Recognition Meets Segment Anything for Helping People with Visual Impairments

  • paper_url: http://arxiv.org/abs/2307.07757
  • repo_url: https://github.com/ruipingl/opensu
  • paper_authors: Ruiping Liu, Jiaming Zhang, Kunyu Peng, Junwei Zheng, Ke Cao, Yufan Chen, Kailun Yang, Rainer Stiefelhagen
  • for: 帮助人们 WITH 视觉障碍(PVI)更加独立地移动,通过增强场景理解和提供有用信息。
  • methods: 基于 Grounded Situation Recognition(GSR)技术,采用高效的 Segment Anything Model(SAM)和固定纯transformer背景结构,并将所有激活函数换为GELU,以提高GSR性能。
  • results: 在 SWiG 数据集上达到了领先的表现,并通过在辅助技术数据集和应用示例中的场景测试和应用示例,证明了 OpenSU 系统的可行性和有用性。
    Abstract Grounded Situation Recognition (GSR) is capable of recognizing and interpreting visual scenes in a contextually intuitive way, yielding salient activities (verbs) and the involved entities (roles) depicted in images. In this work, we focus on the application of GSR in assisting people with visual impairments (PVI). However, precise localization information of detected objects is often required to navigate their surroundings confidently and make informed decisions. For the first time, we propose an Open Scene Understanding (OpenSU) system that aims to generate pixel-wise dense segmentation masks of involved entities instead of bounding boxes. Specifically, we build our OpenSU system on top of GSR by additionally adopting an efficient Segment Anything Model (SAM). Furthermore, to enhance the feature extraction and interaction between the encoder-decoder structure, we construct our OpenSU system using a solid pure transformer backbone to improve the performance of GSR. In order to accelerate the convergence, we replace all the activation functions within the GSR decoders with GELU, thereby reducing the training duration. In quantitative analysis, our model achieves state-of-the-art performance on the SWiG dataset. Moreover, through field testing on dedicated assistive technology datasets and application demonstrations, the proposed OpenSU system can be used to enhance scene understanding and facilitate the independent mobility of people with visual impairments. Our code will be available at https://github.com/RuipingL/OpenSU.
    摘要 “固定Scene理解(GSR)能够识别和解释视觉场景,将其转换为有意义的活动(动词)和参与的实体(角色)。在这个工作中,我们专注于将GSR应用于视障人士(PVI)。然而,精确的地方化信息是需要帮助PVI人士自信地穿梭环境和做出了知情的决策。为了解决这个问题,我们提出了一个开放场景理解(OpenSU)系统,旨在生成像素粒度密集的参与实体分割图案。具体来说,我们将OpenSU系统建立在GSR之上,还 adopting一个高效的填充模型(SAM)。此外,为了增强协变和协变之间的交互,我们使用一个坚固的纯transformer背景架构来提高GSR的表现。为了加速训练,我们将所有的启动函数在GSR解oder中替换为GELU,从而减少训练时间。在量值分析中,我们的模型在SWiG dataset上 achieve state-of-the-art表现。此外,通过特定助理技术dataset和应用示例的野试,我们的OpenSU系统可以增强场景理解和推动视障人士的独立 mobilility。我们的代码将在https://github.com/RuipingL/OpenSU 上available。”

Fast Adaptation with Bradley-Terry Preference Models in Text-To-Image Classification and Generation

  • paper_url: http://arxiv.org/abs/2308.07929
  • repo_url: None
  • paper_authors: Victor Gallego
  • for: 这个研究是为了适应大型多modal模型(如CLIP和Stable Diffusion) towards sets of particular human preferences,使得用户能够 personnalize 这些模型 для特定任务或偏好。
  • methods: 我们使用 Bradley-Terry 偏好模型开发了一种快速适应方法,可以快速调整原始模型,只需要几个示例和Minimal computing resources。
  • results: 我们透过不同的多modal text and image understanding领域的实验,证明了这个框架的能力,包括偏好预测为赏金模型、生成任务等。
    Abstract Recently, large multimodal models, such as CLIP and Stable Diffusion have experimented tremendous successes in both foundations and applications. However, as these models increase in parameter size and computational requirements, it becomes more challenging for users to personalize them for specific tasks or preferences. In this work, we address the problem of adapting the previous models towards sets of particular human preferences, aligning the retrieved or generated images with the preferences of the user. We leverage the Bradley-Terry preference model to develop a fast adaptation method that efficiently fine-tunes the original model, with few examples and with minimal computing resources. Extensive evidence of the capabilities of this framework is provided through experiments in different domains related to multimodal text and image understanding, including preference prediction as a reward model, and generation tasks.
    摘要 Note:* "multimodal models" is translated as "多modal模型" (duō modal módel)* "CLIP" is translated as "CLIP" (C-L-I-P)* "Stable Diffusion" is translated as "稳定扩散" (jìng dìng kuò chuān)* "Bradley-Terry preference model" is translated as "布拉德利-特里尔喜好模型" (Bù lā dé lǐ-tè lì xǐ huān módel)* "fine-tunes" is translated as "细调" (xì diào)* "extensive evidence" is translated as "广泛的证据" (guǎng fāng de jiàn jí)

Prawn Morphometrics and Weight Estimation from Images using Deep Learning for Landmark Localization

  • paper_url: http://arxiv.org/abs/2307.07732
  • repo_url: None
  • paper_authors: Alzayat Saleh, Md Mehedi Hasan, Herman W Raadsma, Mehar S Khatkar, Dean R Jerry, Mostafa Rahimi Azghadi
  • for: 这个研究是为了开发一个可靠且高精度的数位图像识别技术,以便在水产业中快速和准确地取得鱼类的形态数据,包括体重和 morphometric 分析。
  • methods: 这个研究使用了一种新的深度学习(DL)方法,包括两个主要 ком成分:一个特征提取模组,使用克罗内克积分操作 effficiently 结合低级和高级特征;以及一个点位标定模组,使用这些特征来预测虾子身体中的关键 morphological 点(点标)的坐标。
  • results: 我们的实验结果显示,这个新的DL方法在精度、可靠性和效率方面都比以前的DL方法表现更好,并且在8164幅虾子图像 dataset 上进行评估。我们也使用了原始特征来Derive five important prawn traits,并使用PCA方法来找出这些点标之间的距离,发现这些距离和鱼类的体长和宽度有高度的相关性。
    Abstract Accurate weight estimation and morphometric analyses are useful in aquaculture for optimizing feeding, predicting harvest yields, identifying desirable traits for selective breeding, grading processes, and monitoring the health status of production animals. However, the collection of phenotypic data through traditional manual approaches at industrial scales and in real-time is time-consuming, labour-intensive, and prone to errors. Digital imaging of individuals and subsequent training of prediction models using Deep Learning (DL) has the potential to rapidly and accurately acquire phenotypic data from aquaculture species. In this study, we applied a novel DL approach to automate weight estimation and morphometric analysis using the black tiger prawn (Penaeus monodon) as a model crustacean. The DL approach comprises two main components: a feature extraction module that efficiently combines low-level and high-level features using the Kronecker product operation; followed by a landmark localization module that then uses these features to predict the coordinates of key morphological points (landmarks) on the prawn body. Once these landmarks were extracted, weight was estimated using a weight regression module based on the extracted landmarks using a fully connected network. For morphometric analyses, we utilized the detected landmarks to derive five important prawn traits. Principal Component Analysis (PCA) was also used to identify landmark-derived distances, which were found to be highly correlated with shape features such as body length, and width. We evaluated our approach on a large dataset of 8164 images of the Black tiger prawn (Penaeus monodon) collected from Australian farms. Our experimental results demonstrate that the novel DL approach outperforms existing DL methods in terms of accuracy, robustness, and efficiency.
    摘要 Accurate weight estimation和形态分析在水产业中是非常有用的,可以优化饲料、预测捕捞量、确定适应性 trait、分级处理和监测生产动物的健康状况。但是,通过传统的手动方法收集生物数据在工业规模和实时是时间consuming、劳动密集和容易出错。数字图像技术可以快速和准确地获取生物数据。在这种研究中,我们采用了一种新的深度学习(DL)方法来自动化生物量和形态分析。这种DL方法包括两个主要组成部分:一个特征提取模块,使用Kronecker乘法高效地结合低级和高级特征;然后是一个地标定位模块,使用这些特征预测虾蟹身体上关键的形态特征(地标)的坐标。一旦获得了这些地标,我们使用基于这些地标的普通连接网络来估算虾蟹的重量。为了形态分析,我们利用检测到的地标来计算五个重要的虾蟹特征。我们还使用主成分分析(PCA)来确定由地标得到的距离,这些距离与身体长度和宽度之间存在高度相关性。我们对澳大利亚营养虾蟹(Penaeus monodon)的大量数据集进行了实验,结果表明,我们的DL方法在准确性、可靠性和效率方面都高于现有的DL方法。

Improving NeRF with Height Data for Utilization of GIS Data

  • paper_url: http://arxiv.org/abs/2307.07729
  • repo_url: None
  • paper_authors: Hinata Aoki, Takao Yamanaka
  • for: 使用 Geographic Information System (GIS) 获取的高程数据,应用 Neural Radiance Fields (NeRF) 技术来重建大规模Scene。
  • methods: 将场景空间分成多个 объек和背景,使用高程数据来表示它们,并采用自适应采样方法。
  • results: 通过这种方法,可以提高图像渲染的准确性,同时减少训练速度。
    Abstract Neural Radiance Fields (NeRF) has been applied to various tasks related to representations of 3D scenes. Most studies based on NeRF have focused on a small object, while a few studies have tried to reconstruct large-scale scenes although these methods tend to require large computational cost. For the application of NeRF to large-scale scenes, a method based on NeRF is proposed in this paper to effectively use height data which can be obtained from GIS (Geographic Information System). For this purpose, the scene space was divided into multiple objects and a background using the height data to represent them with separate neural networks. In addition, an adaptive sampling method is also proposed by using the height data. As a result, the accuracy of image rendering was improved with faster training speed.
    摘要 <>translate "Neural Radiance Fields (NeRF) has been applied to various tasks related to representations of 3D scenes. Most studies based on NeRF have focused on a small object, while a few studies have tried to reconstruct large-scale scenes although these methods tend to require large computational cost. For the application of NeRF to large-scale scenes, a method based on NeRF is proposed in this paper to effectively use height data which can be obtained from GIS (Geographic Information System). For this purpose, the scene space was divided into multiple objects and a background using the height data to represent them with separate neural networks. In addition, an adaptive sampling method is also proposed by using the height data. As a result, the accuracy of image rendering was improved with faster training speed." into Simplified Chinese.哦!下面是文本翻译成简化字的中文:NeRF(神经辐射场)已经应用于多种3D场景表示任务中。大多数NeRF研究都集中在小对象上,只有一些研究尝试了大规模场景重建,但这些方法通常需要大量计算成本。为了应用NeRF到大规模场景,这篇论文提出了一种基于NeRF的方法,使用GIS(地理信息系统)获取的高程数据来有效地使用它们。为此,场景空间被分解成多个对象和背景,使用高程数据来表示它们的不同神经网络。此外,还提出了一种适应采样方法。因此,图像渲染精度得到了改善,同时训练速度也得到了加速。

Improving Translation Invariance in Convolutional Neural Networks with Peripheral Prediction Padding

  • paper_url: http://arxiv.org/abs/2307.07725
  • repo_url: None
  • paper_authors: Kensuke Mukai, Takao Yamanaka
  • for: 这个论文是为了提出一种新的填充方法,以便在卷积神经网络中进行端到端训练。
  • methods: 该方法使用一种新的填充方法,即 Peripheral Prediction Padding (PP-Pad) 方法,可以根据每个任务而不同地训练填充值。
  • results: 经过测试,该方法在 semantic segmentation 任务中实现了更高的准确率和翻译不变性,比既前一些方法更好。I hope this helps! Let me know if you have any other questions.
    Abstract Zero padding is often used in convolutional neural networks to prevent the feature map size from decreasing with each layer. However, recent studies have shown that zero padding promotes encoding of absolute positional information, which may adversely affect the performance of some tasks. In this work, a novel padding method called Peripheral Prediction Padding (PP-Pad) method is proposed, which enables end-to-end training of padding values suitable for each task instead of zero padding. Moreover, novel metrics to quantitatively evaluate the translation invariance of the model are presented. By evaluating with these metrics, it was confirmed that the proposed method achieved higher accuracy and translation invariance than the previous methods in a semantic segmentation task.
    摘要 很多时候,卷积神经网络中使用零填充来防止特征图的大小随层数量递减。然而,最近的研究表明,零填充可能会导致编码绝对位置信息,这可能会影响一些任务的性能。在这种情况下,一种新的填充方法called Peripheral Prediction Padding(PP-Pad)方法被提出,该方法可以在每个任务中自动训练适合的填充值。此外,一些用于评估模型的翻译不变性的新指标也被提出。通过使用这些指标评估,确认了提议方法在 semantic segmentation 任务中实现了更高的准确率和翻译不变性,与之前的方法相比。

Spatial-Spectral Hyperspectral Classification based on Learnable 3D Group Convolution

  • paper_url: http://arxiv.org/abs/2307.07720
  • repo_url: None
  • paper_authors: Guandong Li, Mengxia Ye
    for:This paper proposes a learnable group convolution network (LGCNet) to improve the performance of deep neural networks in hyperspectral image classification.methods:The LGCNet module uses a dynamic learning method for the input channels and convolution kernel grouping, which allows for flexible grouping structures and improved representation ability.results:The LGCNet achieves better inference speed and accuracy than mainstream hyperspectral image classification methods on three datasets (Indian Pines, Pavia University, and KSC).
    Abstract Deep neural networks have faced many problems in hyperspectral image classification, including the ineffective utilization of spectral-spatial joint information and the problems of gradient vanishing and overfitting that arise with increasing depth. In order to accelerate the deployment of models on edge devices with strict latency requirements and limited computing power, this paper proposes a learnable group convolution network (LGCNet) based on an improved 3D-DenseNet model and a lightweight model design. The LGCNet module improves the shortcomings of group convolution by introducing a dynamic learning method for the input channels and convolution kernel grouping, enabling flexible grouping structures and generating better representation ability. Through the overall loss and gradient of the backpropagation network, the 3D group convolution is dynamically determined and updated in an end-to-end manner. The learnable number of channels and corresponding grouping can capture different complementary visual features of input images, allowing the CNN to learn richer feature representations. When extracting high-dimensional and redundant hyperspectral data, the 3D convolution kernels also contain a large amount of redundant information. The LGC module allows the 3D-DenseNet to choose channel information with more semantic features, and is very efficient, making it suitable for embedding in any deep neural network for acceleration and efficiency improvements. LGC enables the 3D-CNN to achieve sufficient feature extraction while also meeting speed and computing requirements. Furthermore, LGCNet has achieved progress in inference speed and accuracy, and outperforms mainstream hyperspectral image classification methods on the Indian Pines, Pavia University, and KSC datasets.
    摘要 深度神经网络在多spectral图像分类中遇到了许多问题,包括不好地利用spectral-spatial共同信息和深度增加导致梯度消失和过拟合问题。为了加速部署模型在边缘设备上,这篇文章提出了一种可学习的群集卷积网络(LGCNet),基于改进的3D-DenseNet模型和轻量级模型设计。LGCNet模块将group卷积缺点改进,通过动态学习输入通道和卷积核组合,实现更flexible的grouping结构,并且能够更好地表示能力。通过整体损失和反向传播网络的梯度,3D group卷积在端到端方式进行动态确定和更新。学习可变通道和相应的grouping可以捕捉不同的辐射特征图像的可读性信息,使CNN学习更加丰富的特征表示。在提取高维和重复的多spectral数据时,3D卷积核也包含了大量的重复信息。LGC模块使得3D-DenseNet可以选择更多semantic的通道信息,非常高效,适用于任何深度神经网络的加速和效率提高。LGCNet在推理速度和准确率方面进步,并在印度棕榈、帕维亚大学和科学中心 datasets 上超越主流多spectral图像分类方法。

ExposureDiffusion: Learning to Expose for Low-light Image Enhancement

  • paper_url: http://arxiv.org/abs/2307.07710
  • repo_url: https://github.com/wyf0912/ExposureDiffusion
  • paper_authors: Yufei Wang, Yi Yu, Wenhan Yang, Lanqing Guo, Lap-Pui Chau, Alex C. Kot, Bihan Wen
  • for: 提高图像亮度和准确率
  • methods: 结合扩散模型和物理基础模型,实现图像亮度提高和准确率提高
  • results: 实现了图像亮度提高和准确率提高,并且比传统扩散模型具有更好的一致性和更快的推理速度
    Abstract Previous raw image-based low-light image enhancement methods predominantly relied on feed-forward neural networks to learn deterministic mappings from low-light to normally-exposed images. However, they failed to capture critical distribution information, leading to visually undesirable results. This work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure model. Different from a vanilla diffusion model that has to perform Gaussian denoising, with the injected physics-based exposure model, our restoration process can directly start from a noisy image instead of pure noise. As such, our method obtains significantly improved performance and reduced inference time compared with vanilla diffusion models. To make full use of the advantages of different intermediate steps, we further propose an adaptive residual layer that effectively screens out the side-effect in the iterative refinement when the intermediate results have been already well-exposed. The proposed framework can work with both real-paired datasets, SOTA noise models, and different backbone networks. Note that, the proposed framework is compatible with real-paired datasets, real/synthetic noise models, and different backbone networks. We evaluate the proposed method on various public benchmarks, achieving promising results with consistent improvements using different exposure models and backbones. Besides, the proposed method achieves better generalization capacity for unseen amplifying ratios and better performance than a larger feedforward neural model when few parameters are adopted.
    摘要 previous raw image-based low-light image enhancement methods mostly relied on feed-forward neural networks to learn deterministic mappings from low-light to normally-exposed images, but they failed to capture critical distribution information, leading to visually undesirable results. this work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure model. different from a vanilla diffusion model that has to perform Gaussian denoising, with the injected physics-based exposure model, our restoration process can directly start from a noisy image instead of pure noise. as such, our method obtains significantly improved performance and reduced inference time compared with vanilla diffusion models. to make full use of the advantages of different intermediate steps, we further propose an adaptive residual layer that effectively screens out the side-effect in the iterative refinement when the intermediate results have been already well-exposed. the proposed framework can work with both real-paired datasets, SOTA noise models, and different backbone networks. note that, the proposed framework is compatible with real-paired datasets, real/synthetic noise models, and different backbone networks. we evaluate the proposed method on various public benchmarks, achieving promising results with consistent improvements using different exposure models and backbones. besides, the proposed method achieves better generalization capacity for unseen amplifying ratios and better performance than a larger feedforward neural model when few parameters are adopted.Here's the translation in Traditional Chinese:previous raw image-based low-light image enhancement methods mostly relied on feed-forward neural networks to learn deterministic mappings from low-light to normally-exposed images, but they failed to capture critical distribution information, leading to visually undesirable results. this work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure model. different from a vanilla diffusion model that has to perform Gaussian denoising, with the injected physics-based exposure model, our restoration process can directly start from a noisy image instead of pure noise. as such, our method obtains significantly improved performance and reduced inference time compared with vanilla diffusion models. to make full use of the advantages of different intermediate steps, we further propose an adaptive residual layer that effectively screens out the side-effect in the iterative refinement when the intermediate results have been already well-exposed. the proposed framework can work with both real-paired datasets, SOTA noise models, and different backbone networks. note that, the proposed framework is compatible with real-paired datasets, real/synthetic noise models, and different backbone networks. we evaluate the proposed method on various public benchmarks, achieving promising results with consistent improvements using different exposure models and backbones. besides, the proposed method achieves better generalization capacity for unseen amplifying ratios and better performance than a larger feedforward neural model when few parameters are adopted.

PSGformer: Enhancing 3D Point Cloud Instance Segmentation via Precise Semantic Guidance

  • paper_url: http://arxiv.org/abs/2307.07708
  • repo_url: None
  • paper_authors: Lei Pan, Wuyang Luan, Yuan Zheng, Qiang Fu, Junhui Li
  • for: 提高3D实例分割的性能,解决现有的3D semantic segmentation模型导致的限制。
  • methods: 提出了一种新的3D实例分割网络PSGformer,包括两个关键进步:首先,我们提出了一种多级semantic aggregation module,通过对前景点 filtering和多半径聚合来有效地捕捉场景特征。其次,PSGformer引入了并行特征融合transformer模块,通过独立处理超点特征和聚合特征来实现更全面的特征表示。
  • results: 在ScanNetv2 dataset上进行了广泛的实验,并证明PSGformer可以在Scannetv2隐藏测试集上提高3D实例分割的性能,比对比的状态体系方法高2.2%。
    Abstract Most existing 3D instance segmentation methods are derived from 3D semantic segmentation models. However, these indirect approaches suffer from certain limitations. They fail to fully leverage global and local semantic information for accurate prediction, which hampers the overall performance of the 3D instance segmentation framework. To address these issues, this paper presents PSGformer, a novel 3D instance segmentation network. PSGformer incorporates two key advancements to enhance the performance of 3D instance segmentation. Firstly, we propose a Multi-Level Semantic Aggregation Module, which effectively captures scene features by employing foreground point filtering and multi-radius aggregation. This module enables the acquisition of more detailed semantic information from global and local perspectives. Secondly, PSGformer introduces a Parallel Feature Fusion Transformer Module that independently processes super-point features and aggregated features using transformers. The model achieves a more comprehensive feature representation by the features which connect global and local features. We conducted extensive experiments on the ScanNetv2 dataset. Notably, PSGformer exceeds compared state-of-the-art methods by 2.2% on ScanNetv2 hidden test set in terms of mAP. Our code and models will be publicly released.
    摘要 现有的大多数3D实例分割方法都是基于3Dsemantic分割模型的派生方法。然而,这些间接方法受到一定的限制。它们不能充分利用全局和局部semantic信息,导致3D实例分割框架的总性表现下降。为了解决这些问题,本文提出了PSGformer,一种新的3D实例分割网络。PSGformer包括两项关键进步,以提高3D实例分割的表现。首先,我们提出了一个多级semantic汇集模块,该模块通过对前景点滤波和多半径汇集来有效地捕捉场景特征。这使得更详细的semantic信息从全局和局部角度得到了捕捉。其次,PSGformer引入了并行特征融合 transformer模块,该模块使用transformer独立处理超点特征和汇集特征,以实现更全面的特征表示。我们在ScanNetv2 dataset上进行了广泛的实验。值得注意的是,PSGformer在ScanNetv2隐藏测试集上比相对state-of-the-art方法高出2.2%的mAP。我们将代码和模型公开发布。

Neural Deformable Models for 3D Bi-Ventricular Heart Shape Reconstruction and Modeling from 2D Sparse Cardiac Magnetic Resonance Imaging

  • paper_url: http://arxiv.org/abs/2307.07693
  • repo_url: None
  • paper_authors: Meng Ye, Dong Yang, Mikael Kanski, Leon Axel, Dimitris Metaxas
  • for: 重建和模型三维心脏bi-射影形状从二维精 sparsecardiac magnetic resonance(CMR)成像数据
  • methods: 使用混合型抽象超quadrics模型,通过 globally和 locally抽象来模型bi-射影形状
  • results: 比 conventiomal方法高效、可以自动生成高质量三角形网格、学习 dense对应关系以进行准确心脏形态 региSTRATION
    Abstract We propose a novel neural deformable model (NDM) targeting at the reconstruction and modeling of 3D bi-ventricular shape of the heart from 2D sparse cardiac magnetic resonance (CMR) imaging data. We model the bi-ventricular shape using blended deformable superquadrics, which are parameterized by a set of geometric parameter functions and are capable of deforming globally and locally. While global geometric parameter functions and deformations capture gross shape features from visual data, local deformations, parameterized as neural diffeomorphic point flows, can be learned to recover the detailed heart shape.Different from iterative optimization methods used in conventional deformable model formulations, NDMs can be trained to learn such geometric parameter functions, global and local deformations from a shape distribution manifold. Our NDM can learn to densify a sparse cardiac point cloud with arbitrary scales and generate high-quality triangular meshes automatically. It also enables the implicit learning of dense correspondences among different heart shape instances for accurate cardiac shape registration. Furthermore, the parameters of NDM are intuitive, and can be used by a physician without sophisticated post-processing. Experimental results on a large CMR dataset demonstrate the improved performance of NDM over conventional methods.
    摘要 我们提出了一种新的神经流形模型(NDM),用于从2D稀疏卡路里影像数据中重建和模型3D双腔形心的形状。我们使用混合的流形超quadrics来模型双腔形,这些超quadrics是由一组 геометрических参数函数parameter化的,可以在全球和地方水平上进行扭曲。而全球的 геометрические参数函数和扭曲可以从视觉数据中提取大规模的形状特征,而地方的扭曲,则是通过神经Diffusion点流来学习,以回归心形的细节。与传统的可变模型形式ulation中的迭代优化方法不同,NDM可以通过形态分布拟合来学习 geometric parameter functions和全球和地方的扭曲。我们的NDM可以从稀疏卡路里点云中填充任意尺度的点云,并自动生成高质量的三角形网格。此外,NDM还可以自动学习 dense correspondences among different heart shape instances,以实现准确的心形注册。此外,NDM的参数是直观的,可以由医生 без高级后处理使用。实验结果表明,NDM在大量CMR数据集上表现得更好于传统方法。

DRM-IR: Task-Adaptive Deep Unfolding Network for All-In-One Image Restoration

  • paper_url: http://arxiv.org/abs/2307.07688
  • repo_url: https://github.com/YuanshuoCheng/DRM-IR
  • paper_authors: Yuanshuo Cheng, Mingwen Shao, Yecong Wan, Chao Wang, Wangmeng Zuo
  • for: 这篇论文是为了提出一种高效的全能性图像修复方法(DRM-IR),以提高图像修复性能。
  • methods: 该方法包括任务适应质量模型和模型基于图像修复两个子任务,它们是通过参考图像对进行最大 posteriori推断的。
  • results: 对多个标准数据集进行了广泛的实验,显示DRM-IR可以达到当今最佳的全能性图像修复性能。
    Abstract Existing All-In-One image restoration (IR) methods usually lack flexible modeling on various types of degradation, thus impeding the restoration performance. To achieve All-In-One IR with higher task dexterity, this work proposes an efficient Dynamic Reference Modeling paradigm (DRM-IR), which consists of task-adaptive degradation modeling and model-based image restoring. Specifically, these two subtasks are formalized as a pair of entangled reference-based maximum a posteriori (MAP) inferences, which are optimized synchronously in an unfolding-based manner. With the two cascaded subtasks, DRM-IR first dynamically models the task-specific degradation based on a reference image pair and further restores the image with the collected degradation statistics. Besides, to bridge the semantic gap between the reference and target degraded images, we further devise a Degradation Prior Transmitter (DPT) that restrains the instance-specific feature differences. DRM-IR explicitly provides superior flexibility for All-in-One IR while being interpretable. Extensive experiments on multiple benchmark datasets show that our DRM-IR achieves state-of-the-art in All-In-One IR.
    摘要 现有的全包式图像恢复(IR)方法通常缺乏不同类型的退化模型的灵活定制,因此影响了恢复性能。为了实现更高的任务技巧性,本工作提出了高效的动态参照模型 paradigma(DRM-IR),它包括任务适应性的退化模型和基于模型的图像恢复。具体来说,这两个子任务被形式化为一对推理最大 posteriori(MAP)推理,它们在一种层次结构上进行同步优化。通过两个层次的子任务,DRM-IR首先在参照图像对中动态地模型任务特定的退化,然后使用采集的退化统计来恢复图像。此外,为了跨越参照图像和目标退化图像之间的Semantic gap,我们还提出了退化先天传输器(DPT),它限制了特定的特征差异。DRM-IR显式地提供了更高灵活性和可解释性,并在多个 benchmark 数据集上进行了广泛的实验,证明了我们的 DRM-IR 在全包式IR 中实现了state-of-the-art。

Semantic Contrastive Bootstrapping for Single-positive Multi-label Recognition

  • paper_url: http://arxiv.org/abs/2307.07680
  • repo_url: https://github.com/iCVTEAM/Scob
  • paper_authors: Cheng Chen, Yifan Zhao, Jia Li
  • for: 这篇论文主要用于提出一种学习多个标签图像识别方法,以便利用部分标注数据进行训练,并且可以提高图像识别性能和减少标注工作量。
  • methods: 该方法使用semantic contrastive bootstrapping(Scob)方法,通过引入类活动作为semantic guidance来慢慢恢复交对关系,然后提出一种循环semantic masked transformer来提取图像级别的iconic对象表示,并在多个标签分类任务上进行contrastive学习。
  • results: 经过extensive的实验表明,提出的共同学习框架可以在四个公共多个标签图像识别benchmark上大幅超越现有的模型,并且可以减少可能由错误semantic guidance引起的干扰。
    Abstract Learning multi-label image recognition with incomplete annotation is gaining popularity due to its superior performance and significant labor savings when compared to training with fully labeled datasets. Existing literature mainly focuses on label completion and co-occurrence learning while facing difficulties with the most common single-positive label manner. To tackle this problem, we present a semantic contrastive bootstrapping (Scob) approach to gradually recover the cross-object relationships by introducing class activation as semantic guidance. With this learning guidance, we then propose a recurrent semantic masked transformer to extract iconic object-level representations and delve into the contrastive learning problems on multi-label classification tasks. We further propose a bootstrapping framework in an Expectation-Maximization fashion that iteratively optimizes the network parameters and refines semantic guidance to alleviate possible disturbance caused by wrong semantic guidance. Extensive experimental results demonstrate that the proposed joint learning framework surpasses the state-of-the-art models by a large margin on four public multi-label image recognition benchmarks. Codes can be found at https://github.com/iCVTEAM/Scob.
    摘要 学习多标签图像识别 WITH incomplete annotation 在当前研究中得到了广泛的关注,因为它在训练完全标注数据集时表现出的高性能和重要的劳动力成本减少。现有文献主要关注于标签完成和共occurrence学习,而面临着最常见的单个正样式问题。为解决这个问题,我们提出了semantic contrastive bootstrapping(Scob)方法,通过引入类活动作为semantic guidance来慢慢地恢复交对关系。然后,我们提议一种循环semantic masked transformer来提取图像级别的iconic对象表示,并探索多标签分类任务中的contrastive learning问题。我们还提出了一个Expectation-Maximization的框架,iteratively optimize网络参数和refine semantic guidance,以避免可能由错误的semantic guidance引起的干扰。实验结果表明,我们提出的 JOINT learning框架在四个公共多标签图像识别benchmark上超过了当前状态的模型,并且可以在https://github.com/iCVTEAM/Scob找到代码。

Both Spatial and Frequency Cues Contribute to High-Fidelity Image Inpainting

  • paper_url: http://arxiv.org/abs/2307.07678
  • repo_url: None
  • paper_authors: Ze Lu, Yalei Lv, Wenqi Wang, Pengfei Xiong
  • for: Image inpainting with deep generative approaches.
  • methods: Proposed Frequency-Spatial Complementary Network (FSCN) with extra Frequency Branch and Frequency Loss, and Frequency-Spatial Cross-Attention Block (FSCAB) to fuse multi-domain features.
  • results: Superior inpainting results with fewer parameters and less computation cost, outperforming previous state-of-the-art approaches.
    Abstract Deep generative approaches have obtained great success in image inpainting recently. However, most generative inpainting networks suffer from either over-smooth results or aliasing artifacts. The former lacks high-frequency details, while the latter lacks semantic structure. To address this issue, we propose an effective Frequency-Spatial Complementary Network (FSCN) by exploiting rich semantic information in both spatial and frequency domains. Specifically, we introduce an extra Frequency Branch and Frequency Loss on the spatial-based network to impose direct supervision on the frequency information, and propose a Frequency-Spatial Cross-Attention Block (FSCAB) to fuse multi-domain features and combine the corresponding characteristics. With our FSCAB, the inpainting network is capable of capturing frequency information and preserving visual consistency simultaneously. Extensive quantitative and qualitative experiments demonstrate that our inpainting network can effectively achieve superior results, outperforming previous state-of-the-art approaches with significantly fewer parameters and less computation cost. The code will be released soon.
    摘要 深度生成方法在图像填充方面最近几年取得了很大的成功。然而,大多数生成填充网络都会面临高频环境的过滤问题,导致结果过滤而失去高频环境的细节,或者保留了semantic结构,但是图像中的细节失真。为了解决这个问题,我们提出了一种有效的频率空间补充网络(FSCN),通过利用图像空间和频率频谱中的丰富semantic信息来解决。特别是,我们在网络中添加了额外的频率分支和频率损失,以直接监督频率信息,并提出了频率空间协同块(FSCAB)来融合多个频谱特征并组合相应的特征。通过我们的FSCAB,填充网络可以同时捕捉频率信息和保持视觉一致性。广泛的量化和质量测试表明,我们的填充网络可以达到更高的效果,比前一代方法有更少的参数和计算成本。代码将很快地发布。

Learning from Pseudo-labeled Segmentation for Multi-Class Object Counting

  • paper_url: http://arxiv.org/abs/2307.07677
  • repo_url: None
  • paper_authors: Jingyi Xu, Hieu Le, Dimitris Samaras
  • for: 本研究的目的是为了解决现有的物件计数模型在多类图像中对象计数 task 中的挑战,即尝试使用只有一些示例来计数图像中的多个物件类型。
  • methods: 我们提议使用示例基本的分割模型来地方化对象区域,然后使用这些分割模型来计数图像中的物件。我们使用只有盒子示例和点注释来获取pseudo segmentation masks,并训练这些分割模型。
  • results: 我们在两个新的多类数据集和一个真实图像集上评估了不同方法的性能,并显示了我们的提议方法在这些数据集上显著超过了之前的物件计数方法。
    Abstract Class-agnostic counting (CAC) has numerous potential applications across various domains. The goal is to count objects of an arbitrary category during testing, based on only a few annotated exemplars. In this paper, we point out that the task of counting objects of interest when there are multiple object classes in the image (namely, multi-class object counting) is particularly challenging for current object counting models. They often greedily count every object regardless of the exemplars. To address this issue, we propose localizing the area containing the objects of interest via an exemplar-based segmentation model before counting them. The key challenge here is the lack of segmentation supervision to train this model. To this end, we propose a method to obtain pseudo segmentation masks using only box exemplars and dot annotations. We show that the segmentation model trained on these pseudo-labeled masks can effectively localize objects of interest for an arbitrary multi-class image based on the exemplars. To evaluate the performance of different methods on multi-class counting, we introduce two new benchmarks, a synthetic multi-class dataset and a new test set of real images in which objects from multiple classes are present. Our proposed method shows a significant advantage over the previous CAC methods on these two benchmarks.
    摘要 “类agnostic counting(CAC)具有广泛的应用前景,可以在不同领域中应用。目标是在测试过程中,基于只有一些标注的示例来计数对象。在本文中,我们指出,当图像中存在多个对象类时(即多类对象计数),现有的对象计数模型很难准确地计数对象。他们通常会吃掉所有对象,不论是否符合示例。为解决这个问题,我们提议先使用示例基于分割模型将对象区域围括出来,然后计数围括区域中的对象。关键问题在于,如何训练这个分割模型。为此,我们提议使用只有盒子示例和点标注来生成pseudo分割面积。我们发现,这种分割模型可以基于示例来有效地围括对象区域,并且可以在任意多类图像中计数对象。为评估不同方法的性能,我们创建了两个新的标准 bencmarks:一个是Synthetic多类数据集,另一个是一个新的实际图像测试集,其中对象来自多个类。我们的提议方法在这两个标准 bencmarks上表现出了明显的优势。”

INVE: Interactive Neural Video Editing

  • paper_url: http://arxiv.org/abs/2307.07663
  • repo_url: None
  • paper_authors: Jiahui Huang, Leonid Sigal, Kwang Moo Yi, Oliver Wang, Joon-Young Lee
  • for: 这个论文是为了提供一种实时视频编辑解决方案,帮助视频编辑过程中的 sparse frame 编辑propagation到整个视频clip。
  • methods: 该方法基于 recent work on Layered Neural Atlas (LNA),但是LNA受到两大缺点:(1)方法太慢 для实时编辑,(2)不支持一些编辑用case,如直接帧编辑和固定Texture tracking。我们采用高效的网络架构,以及 hash-grids 编码,以提高处理速度。此外,我们学习 bi-directional 函数 между image-atlas 和引入 vectorized editing,这些都使得我们的 INVE 可以支持更多的编辑。
  • results: 相比 LNA,我们的 INVE 可以减少学习和推理时间,并且支持更多的视频编辑操作。我们通过对 INVE 和 LNA 进行了全面的量化和质量分析,并展示了 INVE 的优越性和改进的性能。 для视频结果,请参考 https://gabriel-huang.github.io/inve/
    Abstract We present Interactive Neural Video Editing (INVE), a real-time video editing solution, which can assist the video editing process by consistently propagating sparse frame edits to the entire video clip. Our method is inspired by the recent work on Layered Neural Atlas (LNA). LNA, however, suffers from two major drawbacks: (1) the method is too slow for interactive editing, and (2) it offers insufficient support for some editing use cases, including direct frame editing and rigid texture tracking. To address these challenges we leverage and adopt highly efficient network architectures, powered by hash-grids encoding, to substantially improve processing speed. In addition, we learn bi-directional functions between image-atlas and introduce vectorized editing, which collectively enables a much greater variety of edits in both the atlas and the frames directly. Compared to LNA, our INVE reduces the learning and inference time by a factor of 5, and supports various video editing operations that LNA cannot. We showcase the superiority of INVE over LNA in interactive video editing through a comprehensive quantitative and qualitative analysis, highlighting its numerous advantages and improved performance. For video results, please see https://gabriel-huang.github.io/inve/
    摘要 我们介绍Interactive Neural Video Editing(INVE),一个实时影像修剪解决方案,可以帮助影像修剪过程中的叠加短缺几帧至整个影像片。我们的方法受到latest Layered Neural Atlas(LNA)的启发,但LNA有两个主要缺点:(1)方法太慢 для互动编辑,(2)它无法支持一些编辑用 caso,包括直接几帧编辑和固定的 Texture Tracking。为了解决这些挑战,我们采用高效的网络架构,并利用 Hash-Grids 编码,以提高处理速度。此外,我们学习了两向函数 между Image-Atlas和引入 вектор化编辑,这样可以实现更多的编辑在 both the atlas and the frames directly。与LNA相比,INVE可以reduces the learning and inference time by a factor of 5, and supports various video editing operations that LNA cannot。我们通过对INVE和LNA的Quantitative and qualitative analysis,强调其许多优点和改进的表现。如果您想看到影像效果,请参考

RFLA: A Stealthy Reflected Light Adversarial Attack in the Physical World

  • paper_url: http://arxiv.org/abs/2307.07653
  • repo_url: https://github.com/winterwindwang/rfla
  • paper_authors: Donghua Wang, Wen Yao, Tingsong Jiang, Chao Li, Xiaoqian Chen
  • for: 本研究旨在攻击深度神经网络(DNNs)的物理攻击方法。
  • methods: 本文提出了一种新的反射光攻击(RFLA)方法,通过在镜前放置透明彩虹的彩色透明膜和纸剪形成不同颜色的几何图形来实现。
  • results: 实验结果表明,提出的方法在不同的数据集和模型上达到了99%以上的成功率,并在不同的物理环境中进行了验证。
    Abstract Physical adversarial attacks against deep neural networks (DNNs) have recently gained increasing attention. The current mainstream physical attacks use printed adversarial patches or camouflage to alter the appearance of the target object. However, these approaches generate conspicuous adversarial patterns that show poor stealthiness. Another physical deployable attack is the optical attack, featuring stealthiness while exhibiting weakly in the daytime with sunlight. In this paper, we propose a novel Reflected Light Attack (RFLA), featuring effective and stealthy in both the digital and physical world, which is implemented by placing the color transparent plastic sheet and a paper cut of a specific shape in front of the mirror to create different colored geometries on the target object. To achieve these goals, we devise a general framework based on the circle to model the reflected light on the target object. Specifically, we optimize a circle (composed of a coordinate and radius) to carry various geometrical shapes determined by the optimized angle. The fill color of the geometry shape and its corresponding transparency are also optimized. We extensively evaluate the effectiveness of RFLA on different datasets and models. Experiment results suggest that the proposed method achieves over 99% success rate on different datasets and models in the digital world. Additionally, we verify the effectiveness of the proposed method in different physical environments by using sunlight or a flashlight.
    摘要 Recently, physical adversarial attacks against deep neural networks (DNNs) have gained increasing attention. Current mainstream physical attacks use printed adversarial patches or camouflage to alter the appearance of the target object, but these approaches generate conspicuous adversarial patterns that show poor stealthiness. Another physical deployable attack is the optical attack, which is stealthy during the daytime with sunlight. In this paper, we propose a novel Reflected Light Attack (RFLA), which is effective and stealthy in both the digital and physical worlds. This attack is implemented by placing a color transparent plastic sheet and a paper cut of a specific shape in front of a mirror to create different colored geometries on the target object. To achieve these goals, we develop a general framework based on the circle to model the reflected light on the target object. Specifically, we optimize a circle (composed of a coordinate and radius) to carry various geometrical shapes determined by the optimized angle. The fill color of the geometry shape and its corresponding transparency are also optimized. We extensively evaluate the effectiveness of RFLA on different datasets and models. Experiment results suggest that the proposed method achieves over 99% success rate on different datasets and models in the digital world. Additionally, we verify the effectiveness of the proposed method in different physical environments by using sunlight or a flashlight.

ACF-Net: An Attention-enhanced Co-interactive Fusion Network for Automated Structural Condition Assessment in Visual Inspection

  • paper_url: http://arxiv.org/abs/2307.07643
  • repo_url: None
  • paper_authors: Chenyu Zhang, Zhaozheng Yin, Ruwen Qin
    for: 这篇论文旨在提出一种自动化桥梁Visual化检查的方法,以提高公共工程设施的监测效率。methods: 该方法基于Attention-enhanced Co-interactive Fusion Network (ACF-Net),可同时分析结构元素和表面缺陷,并使用两个任务特定的学习子网计算任务特定特征。results: 实验结果表明,提出的ACF-Net方法在新的Steel Bridge Condition Inspection Visual (SBCIV)测试集上达到了92.11% mIoU для元素分析和87.16% mIoU для腐蚀分割,超越当前状态的方法。
    Abstract Efficiently monitoring the condition of civil infrastructures necessitates automating the structural condition assessment in visual inspection. This paper proposes an Attention-enhanced Co-interactive Fusion Network (ACF-Net) for automatic structural condition assessment in visual bridge inspection. The ACF-Net can simultaneously parse structural elements and segment surface defects on the elements in inspection images. It integrates two task-specific relearning subnets to extract task-specific features from an overall feature embedding and a co-interactive feature fusion module to capture the spatial correlation and facilitate information sharing between tasks. Experimental results demonstrate that the proposed ACF-Net outperforms the current state-of-the-art approaches, achieving promising performance with 92.11% mIoU for element parsing and 87.16% mIoU for corrosion segmentation on the new benchmark dataset Steel Bridge Condition Inspection Visual (SBCIV) testing set. An ablation study reveals the strengths of ACF-Net, and a case study showcases its capability to automate structural condition assessment. The code will be open-source after acceptance.
    摘要 efficiently monitoring civil infrastructure condition requires automating structural condition assessment in visual inspection. this paper proposes an attention-enhanced co-interactive fusion network (ACF-Net) for automatic structural condition assessment in visual bridge inspection. the ACF-Net can simultaneously parse structural elements and segment surface defects on the elements in inspection images. it integrates two task-specific relearning subnets to extract task-specific features from an overall feature embedding and a co-interactive feature fusion module to capture the spatial correlation and facilitate information sharing between tasks. experimental results demonstrate that the proposed ACF-Net outperforms the current state-of-the-art approaches, achieving promising performance with 92.11% mIoU for element parsing and 87.16% mIoU for corrosion segmentation on the new benchmark dataset steel bridge condition inspection visual (SBCIV) testing set. an ablation study reveals the strengths of ACF-Net, and a case study showcases its capability to automate structural condition assessment. the code will be open-source after acceptance.Here's the text with traditional Chinese characters:efficiently monitoring civil infrastructure condition requires automating structural condition assessment in visual inspection. this paper proposes an attention-enhanced co-interactive fusion network (ACF-Net) for automatic structural condition assessment in visual bridge inspection. the ACF-Net can simultaneously parse structural elements and segment surface defects on the elements in inspection images. it integrates two task-specific relearning subnets to extract task-specific features from an overall feature embedding and a co-interactive feature fusion module to capture the spatial correlation and facilitate information sharing between tasks. experimental results demonstrate that the proposed ACF-Net outperforms the current state-of-the-art approaches, achieving promising performance with 92.11% mIoU for element parsing and 87.16% mIoU for corrosion segmentation on the new benchmark dataset steel bridge condition inspection visual (SBCIV) testing set. an ablation study reveals the strengths of ACF-Net, and a case study showcases its capability to automate structural condition assessment. the code will be open-source after acceptance.

CoTracker: It is Better to Track Together

  • paper_url: http://arxiv.org/abs/2307.07635
  • repo_url: https://github.com/facebookresearch/co-tracker
  • paper_authors: Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, Christian Rupprecht
  • for: 这个论文的目的是提出一种能够同时跟踪多个视频帧中多个点的抽象方法,以提高视频动作预测的性能。
  • methods: 这个方法使用了transformer网络,特制的注意层,iteratively更新多个轨迹的估计。
  • results: 这个方法在大多数测试 benchmark 中超越了当前state-of-the-art方法,并且可以同时跟踪一些点,还可以在视频帧中添加新的点进行跟踪。
    Abstract Methods for video motion prediction either estimate jointly the instantaneous motion of all points in a given video frame using optical flow or independently track the motion of individual points throughout the video. The latter is true even for powerful deep-learning methods that can track points through occlusions. Tracking points individually ignores the strong correlation that can exist between the points, for instance, because they belong to the same physical object, potentially harming performance. In this paper, we thus propose CoTracker, an architecture that jointly tracks multiple points throughout an entire video. This architecture combines several ideas from the optical flow and tracking literature in a new, flexible and powerful design. It is based on a transformer network that models the correlation of different points in time via specialised attention layers. The transformer iteratively updates an estimate of several trajectories. It can be applied in a sliding-window manner to very long videos, for which we engineer an unrolled training loop. It can track from one to several points jointly and supports adding new points to track at any time. The result is a flexible and powerful tracking algorithm that outperforms state-of-the-art methods in almost all benchmarks.
    摘要 视频动态预测方法可以同时估计视频帧中所有点的快照动态,或者独立地跟踪视频中每个点的动态。前者是true,即使使用深度学习方法,可以在遮挡中跟踪点。但是,单独跟踪点可能忽略视频中点之间的强相关性,例如因为它们属于同一物理对象,从而影响性能。为此,我们在这篇论文中提出了CoTracker,一种架构,可以在整个视频中同时跟踪多个点。这种架构结合了光流和跟踪领域的一些想法,并实现了一种flexible和强大的设计。它基于特殊注意层,以模型不同时间点之间的点相关性。特殊注意层在循环更新多个轨迹的估计,可以在很长的视频中使用滑块训练方法。它可以同时跟踪1到多个点,并且支持在任何时间添加新的点。结果是一种灵活和强大的跟踪算法,比state-of-the-art方法在大多数测试 benchMarks 上出perform。

Generalizable Embeddings with Cross-batch Metric Learning

  • paper_url: http://arxiv.org/abs/2307.07620
  • repo_url: https://github.com/yetigurbuz/xml-dml
  • paper_authors: Yeti Z. Gurbuz, A. Aydin Alatan
  • for: 本文探讨了深度度量学中的全球平均pooling(GAP)组件,以及其在学习泛化表示的效果。
  • methods: 作者将GAP视为一种将各个特征向量作为独立的 semantic entity 进行组合的方式,并通过一种可 convex combination of learnable prototypes来表示GAP。然后,作者通过一种 recursive 过程来适应一个批处理的样本,并在每一轮中使用不同的批处理来正则化学习。
  • results: 作者在4个流行的DML benchmark上验证了他们的方法,并发现其可以提高DML模型的泛化能力。
    Abstract Global average pooling (GAP) is a popular component in deep metric learning (DML) for aggregating features. Its effectiveness is often attributed to treating each feature vector as a distinct semantic entity and GAP as a combination of them. Albeit substantiated, such an explanation's algorithmic implications to learn generalizable entities to represent unseen classes, a crucial DML goal, remain unclear. To address this, we formulate GAP as a convex combination of learnable prototypes. We then show that the prototype learning can be expressed as a recursive process fitting a linear predictor to a batch of samples. Building on that perspective, we consider two batches of disjoint classes at each iteration and regularize the learning by expressing the samples of a batch with the prototypes that are fitted to the other batch. We validate our approach on 4 popular DML benchmarks.
    摘要

Gastrointestinal Disease Classification through Explainable and Cost-Sensitive Deep Neural Networks with Supervised Contrastive Learning

  • paper_url: http://arxiv.org/abs/2307.07603
  • repo_url: https://github.com/dibya404/gastrointestinal-disease-classification-through-explainable-and-cost-sensitive-dnn-with-scl
  • paper_authors: Dibya Nath, G. M. Shahariar
  • for: 这篇论文是为了开发一种基于深度卷积神经网络和可读性学习的肠胃疾病分类方法。
  • methods: 该方法利用了Cost-sensitive预训练的深度卷积神经网络架构,并采用了监督式对照学习来学习疾病相关的特征。此外,该方法还包括了Gradient-based的可读性技术,以提高模型的解释性。
  • results: 经过广泛的实验和比较,该方法能够实现高精度的肠胃疾病分类,同时具有Robustness和解释性。
    Abstract Gastrointestinal diseases pose significant healthcare chall-enges as they manifest in diverse ways and can lead to potential complications. Ensuring precise and timely classification of these diseases is pivotal in guiding treatment choices and enhancing patient outcomes. This paper introduces a novel approach on classifying gastrointestinal diseases by leveraging cost-sensitive pre-trained deep convolutional neural network (CNN) architectures with supervised contrastive learning. Our approach enables the network to learn representations that capture vital disease-related features, while also considering the relationships of similarity between samples. To tackle the challenges posed by imbalanced datasets and the cost-sensitive nature of misclassification errors in healthcare, we incorporate cost-sensitive learning. By assigning distinct costs to misclassifications based on the disease class, we prioritize accurate classification of critical conditions. Furthermore, we enhance the interpretability of our model by integrating gradient-based techniques from explainable artificial intelligence (AI). This inclusion provides valuable insights into the decision-making process of the network, aiding in understanding the features that contribute to disease classification. To assess the effectiveness of our proposed approach, we perform extensive experiments on a comprehensive gastrointestinal disease dataset, such as the Hyper-Kvasir dataset. Through thorough comparisons with existing works, we demonstrate the strong classification accuracy, robustness and interpretability of our model. We have made the implementation of our proposed approach publicly available at https://github.com/dibya404/Gastrointestinal-Disease-Classification-through-Explainable-and-Cost-Sensitive-DNN-with-SCL
    摘要 Gastrointestinal diseases pose significant healthcare challenges as they manifest in diverse ways and can lead to potential complications. Ensuring precise and timely classification of these diseases is crucial in guiding treatment choices and enhancing patient outcomes. This paper introduces a novel approach to classifying gastrointestinal diseases by leveraging cost-sensitive pre-trained deep convolutional neural network (CNN) architectures with supervised contrastive learning. Our approach enables the network to learn representations that capture vital disease-related features, while also considering the relationships of similarity between samples. To tackle the challenges posed by imbalanced datasets and the cost-sensitive nature of misclassification errors in healthcare, we incorporate cost-sensitive learning. By assigning distinct costs to misclassifications based on the disease class, we prioritize accurate classification of critical conditions. Furthermore, we enhance the interpretability of our model by integrating gradient-based techniques from explainable artificial intelligence (AI). This inclusion provides valuable insights into the decision-making process of the network, aiding in understanding the features that contribute to disease classification. To assess the effectiveness of our proposed approach, we perform extensive experiments on a comprehensive gastrointestinal disease dataset, such as the Hyper-Kvasir dataset. Through thorough comparisons with existing works, we demonstrate the strong classification accuracy, robustness, and interpretability of our model. We have made the implementation of our proposed approach publicly available at https://github.com/dibya404/Gastrointestinal-Disease-Classification-through-Explainable-and-Cost-Sensitive-DNN-with-SCL.

NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis

  • paper_url: http://arxiv.org/abs/2307.07511
  • repo_url: None
  • paper_authors: Nilesh Kulkarni, Davis Rempe, Kyle Genova, Abhijit Kundu, Justin Johnson, David Fouhey, Leonidas Guibas
  • for: 这个论文 targets the problem of generating realistic 3D human motions interacting with objects in a scene.
  • methods: 该论文提出了一个名为Neural Interaction Fields for Trajectory sYnthesis(NIFTY)的框架,它使用了神经网络生成交互场,以及基于物体的物理约束来驱动人物动作的扩散过程,以生成更加真实和可信的人物动作。
  • results: 该论文通过使用自动生成的 sintetic数据和NIFTY框架,实现了更加真实和可信的人物动作Synthesize,包括坐下和抬起几种对象的动作。这些动作质量和成功完成率都高于其他方法。
    Abstract We address the problem of generating realistic 3D motions of humans interacting with objects in a scene. Our key idea is to create a neural interaction field attached to a specific object, which outputs the distance to the valid interaction manifold given a human pose as input. This interaction field guides the sampling of an object-conditioned human motion diffusion model, so as to encourage plausible contacts and affordance semantics. To support interactions with scarcely available data, we propose an automated synthetic data pipeline. For this, we seed a pre-trained motion model, which has priors for the basics of human movement, with interaction-specific anchor poses extracted from limited motion capture data. Using our guided diffusion model trained on generated synthetic data, we synthesize realistic motions for sitting and lifting with several objects, outperforming alternative approaches in terms of motion quality and successful action completion. We call our framework NIFTY: Neural Interaction Fields for Trajectory sYnthesis.
    摘要 我们Addressing the problem of generating realistic 3D human motions interacting with objects in a scene. Our key idea is to create a neural interaction field attached to a specific object, which outputs the distance to the valid interaction manifold given a human pose as input. This interaction field guides the sampling of an object-conditioned human motion diffusion model, so as to encourage plausible contacts and affordance semantics. To support interactions with scarce data, we propose an automated synthetic data pipeline. For this, we seed a pre-trained motion model with interaction-specific anchor poses extracted from limited motion capture data. Using our guided diffusion model trained on generated synthetic data, we synthesize realistic motions for sitting and lifting with several objects, outperforming alternative approaches in terms of motion quality and successful action completion. We call our framework NIFTY: Neural Interaction Fields for Trajectory sYnthesis.Here's the translation breakdown:* 我们 (wǒmen) - we* Addressing ( Addressing) - addressing* the problem (the problem) - the problem* of generating (of generating) - of generating* realistic 3D human motions (realistic 3D human motions) - realistic 3D human motions* interacting (interacting) - interacting* with objects (with objects) - with objects* in a scene (in a scene) - in a scene* Our key idea (Our key idea) - Our key idea* is to create (is to create) - is to create* a neural interaction field (a neural interaction field) - a neural interaction field* attached to (attached to) - attached to* a specific object (a specific object) - a specific object* which outputs (which outputs) - which outputs* the distance (the distance) - the distance* to the valid interaction manifold (to the valid interaction manifold) - to the valid interaction manifold* given (given) - given* a human pose (a human pose) - a human pose* as input (as input) - as input* This interaction field (This interaction field) - This interaction field* guides (guides) - guides* the sampling (the sampling) - the sampling* of an object-conditioned (object-conditioned) - object-conditioned* human motion diffusion model (human motion diffusion model) - human motion diffusion model* so as to encourage (so as to encourage) - so as to encourage* plausible contacts (plausible contacts) - plausible contacts* and affordance semantics (and affordance semantics) - and affordance semantics* To support interactions (To support interactions) - To support interactions* with scarce data (with scarce data) - with scarce data* we propose (we propose) - we propose* an automated synthetic data pipeline (an automated synthetic data pipeline) - an automated synthetic data pipeline* For this (For this) - For this* we seed (we seed) - we seed* a pre-trained motion model (a pre-trained motion model) - a pre-trained motion model* with interaction-specific (with interaction-specific) - with interaction-specific* anchor poses (anchor poses) - anchor poses* extracted from (extracted from) - extracted from* limited motion capture data (limited motion capture data) - limited motion capture data* Using our guided diffusion model (Using our guided diffusion model) - Using our guided diffusion model* trained on generated (trained on generated) - trained on generated* synthetic data (synthetic data) - synthetic data* we synthesize (we synthesize) - we synthesize* realistic motions (realistic motions) - realistic motions* for sitting (for sitting) - for sitting* and lifting (and lifting) - and lifting* with several objects (with several objects) - with several objects* outperforming (outperforming) - outperforming* alternative approaches (alternative approaches) - alternative approaches* in terms of motion quality (in terms of motion quality) - in terms of motion quality* and successful action completion (and successful action completion) - and successful action completion* We call our framework (We call our framework) - We call our framework* NIFTY (NIFTY) - NIFTY* Neural Interaction Fields for Trajectory sYnthesis (Neural Interaction Fields for Trajectory sYnthesis) - Neural Interaction Fields for Trajectory sYnthesis

Brain Tumor Detection using Convolutional Neural Networks with Skip Connections

  • paper_url: http://arxiv.org/abs/2307.07503
  • repo_url: None
  • paper_authors: Aupam Hamran, Marzieh Vaeztourshizi, Amirhossein Esmaili, Massoud Pedram
  • for: 用CNN分类脑肿为benign和malignant两类
  • methods: 使用MRI技术,采用不同的CNN体系结构,并应用扩展和深化网络,以及添加跳过连接等优化技术以提高网络的准确率
  • results: 结果表明,一些优化技术可以judiciously用于超越基eline CNN模型Note: “judiciously” is a bit of a tricky word to translate, but I’ve translated it as “可以judiciously用于” (can be used judiciously) to convey the idea that the optimization techniques can be selectively applied to achieve better results.
    Abstract In this paper, we present different architectures of Convolutional Neural Networks (CNN) to analyze and classify the brain tumors into benign and malignant types using the Magnetic Resonance Imaging (MRI) technique. Different CNN architecture optimization techniques such as widening and deepening of the network and adding skip connections are applied to improve the accuracy of the network. Results show that a subset of these techniques can judiciously be used to outperform a baseline CNN model used for the same purpose.
    摘要 在这篇论文中,我们介绍了不同类型的卷积神经网络(CNN),用于通过磁共振成像(MRI)技术分类脑肿为非恶性和恶性两类。不同的CNN建立优化技术,如宽化和深化网络以及添加跳过连接,用于提高网络的准确性。结果表明,一些这些技术可以有效地用于超越基eline CNN模型,用于同一目的。Here's a word-for-word translation of the text using Traditional Chinese characters:在这篇论文中,我们介绍了不同类型的卷积神经网络(CNN),用于通过磁共振成像(MRI)技术分类脑肿为非恶性和恶性两类。不同的CNN建立优化技术,如宽化和深化网络以及添加跳过连接,用于提高网络的准确性。结果表明,一些这些技术可以有效地用于超越基eline CNN模型,用于同一目的。

TALL: Thumbnail Layout for Deepfake Video Detection

  • paper_url: http://arxiv.org/abs/2307.07494
  • repo_url: None
  • paper_authors: Yuting Xu, Jian Liang, Gengyun Jia, Ziming Yang, Yanhao Zhang, Ran He
  • for: 这篇论文主要应用于侦测深增伪影片(deepfake video detection)。
  • methods: 本篇论文提出了一个简单 yet effective的策略,即幕照片(Thumbnail Layout,TALL),它可以将影片幕拍成一个预先定义的布局,以保留空间和时间相依性。TALL 是无需更改现有代码的简单方法,只需要在每幕中对继承幕拍进行干扰,然后将其转换为子图排序为预先定义的布局。
  • results: 实验结果显示,TALL 和 SOTA TALL-Swin 在标准测试集和跨测试集上均有出色的表现,TALL 的 AUC 为 90.79%,而 TALL-Swin 的 AUC 为 93.43%。
    Abstract The growing threats of deepfakes to society and cybersecurity have raised enormous public concerns, and increasing efforts have been devoted to this critical topic of deepfake video detection. Existing video methods achieve good performance but are computationally intensive. This paper introduces a simple yet effective strategy named Thumbnail Layout (TALL), which transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies. Specifically, consecutive frames are masked in a fixed position in each frame to improve generalization, then resized to sub-images and rearranged into a pre-defined layout as the thumbnail. TALL is model-agnostic and extremely simple by only modifying a few lines of code. Inspired by the success of vision transformers, we incorporate TALL into Swin Transformer, forming an efficient and effective method TALL-Swin. Extensive experiments on intra-dataset and cross-dataset validate the validity and superiority of TALL and SOTA TALL-Swin. TALL-Swin achieves 90.79$\%$ AUC on the challenging cross-dataset task, FaceForensics++ $\to$ Celeb-DF. The code is available at https://github.com/rainy-xu/TALL4Deepfake.
    摘要 “深圳技术”的威胁对社会和网络安全引起了巨大的公众关注,而寻找深圳视频的检测方法也在不断地努力。现有的视频方法可以达到好的性能,但是 computationally intensive。这篇文章介绍了一种简单 yet effective的策略,即 Thumbnail Layout(TALL),它将视频剪辑转换为预定的布局,以保持空间和时间相互依赖关系。 Specifically, consecutive frames 是在固定位置上填充,然后resize 为子图并重新排序成预定的布局作为缩略图。TALL 是无关模型的,只需要修改一些代码。通过 incorporate TALL 到 Swin Transformer 中,我们得到了高效和有效的方法 TALL-Swin。广泛的实验表明 TALL 和 SOTA TALL-Swin 的有效性和优势。 TALL-Swin 在艰辛的 cross-dataset 任务 FaceForensics++ $\to$ Celeb-DF 上达到了 90.79% AUC。代码可以在 GitHub 上找到:https://github.com/rainy-xu/TALL4Deepfake。

PseudoCal: A Source-Free Approach to Unsupervised Uncertainty Calibration in Domain Adaptation

  • paper_url: http://arxiv.org/abs/2307.07489
  • repo_url: None
  • paper_authors: Dapeng Hu, Jian Liang, Xinchao Wang, Chuan-Sheng Foo
  • for: 提高目标频率下模型的准确性
  • methods: 使用 PseudoCal 方法,依据无标目标数据进行准确性调整
  • results: 相比现有方法, PseudoCal 方法显示出了更低的准确性错误
    Abstract Unsupervised domain adaptation (UDA) has witnessed remarkable advancements in improving the accuracy of models for unlabeled target domains. However, the calibration of predictive uncertainty in the target domain, a crucial aspect of the safe deployment of UDA models, has received limited attention. The conventional in-domain calibration method, \textit{temperature scaling} (TempScal), encounters challenges due to domain distribution shifts and the absence of labeled target domain data. Recent approaches have employed importance-weighting techniques to estimate the target-optimal temperature based on re-weighted labeled source data. Nonetheless, these methods require source data and suffer from unreliable density estimates under severe domain shifts, rendering them unsuitable for source-free UDA settings. To overcome these limitations, we propose PseudoCal, a source-free calibration method that exclusively relies on unlabeled target data. Unlike previous approaches that treat UDA calibration as a \textit{covariate shift} problem, we consider it as an unsupervised calibration problem specific to the target domain. Motivated by the factorization of the negative log-likelihood (NLL) objective in TempScal, we generate a labeled pseudo-target set that captures the structure of the real target. By doing so, we transform the unsupervised calibration problem into a supervised one, enabling us to effectively address it using widely-used in-domain methods like TempScal. Finally, we thoroughly evaluate the calibration performance of PseudoCal by conducting extensive experiments on 10 UDA methods, considering both traditional UDA settings and recent source-free UDA scenarios. The experimental results consistently demonstrate the superior performance of PseudoCal, exhibiting significantly reduced calibration error compared to existing calibration methods.
    摘要 Unsupervised domain adaptation (UDA) 在提高目标领域模型的准确性方面做出了非常出色的进步。然而,目标领域模型的预测uncertainty的准确性调整,作为模型安全部署的关键方面,受到了有限的关注。传统的域内准确度调整方法(TemperatureScaling, TempScal)在域分布转移和目标领域数据缺失问题上遇到了挑战。 latest approaches use importance-weighting techniques to estimate the target-optimal temperature based on re-weighted labeled source data. 然而,这些方法需要源数据,并且在严重的域转移情况下,density estimates 不可靠,从而无法适用于源自由 UDA 设置。为了解决这些限制,我们提出了 PseudoCal,一种源自由的准确性调整方法,不依赖于源数据。与前方法不同,我们将 UDA 准确性调整视为目标域specific的无监督准确性调整问题。 Motivated by the factorization of the negative log-likelihood (NLL) objective in TempScal, we generate a labeled pseudo-target set that captures the structure of the real target. By doing so, we transform the unsupervised calibration problem into a supervised one, enabling us to effectively address it using widely-used in-domain methods like TempScal。我们进行了广泛的实验,评估 PseudoCal 的准确性调整性能,包括传统 UDA 设置以及最近的源自由 UDA 场景。实验结果 consistently demonstrate the superior performance of PseudoCal, exhibiting significantly reduced calibration error compared to existing calibration methods。

DreamTeacher: Pretraining Image Backbones with Deep Generative Models

  • paper_url: http://arxiv.org/abs/2307.07487
  • repo_url: None
  • paper_authors: Daiqing Li, Huan Ling, Amlan Kar, David Acuna, Seung Wook Kim, Karsten Kreis, Antonio Torralba, Sanja Fidler
  • for: 本研究提出了一种自我超vised特征表示学习框架 DreamTeacher,利用生成网络进行预训练下游图像卷积体。
  • methods: 我们提出了两种知识精炼方法:1)将生成模型学习的生成特征精炼到目标图像卷积体上,作为代替大量标注数据集such as ImageNet的预训练方法;2)将生成网络中的标签精炼到目标卷积体的幂点上。
  • results: 我们进行了多种生成模型、密集预测 benchmarks 和多种预训练方法的实验研究,发现我们的 DreamTeacher 对现有的自我超vised表示学习方法进行了显著超越。 不supervised ImageNet 预训练通过 DreamTeacher leads to significant improvements over ImageNet classification pre-training on downstream datasets, showcasing generative models and diffusion generative models specifically, as a promising approach to representation learning on large, diverse datasets without requiring manual annotation.
    Abstract In this work, we introduce a self-supervised feature representation learning framework DreamTeacher that utilizes generative networks for pre-training downstream image backbones. We propose to distill knowledge from a trained generative model into standard image backbones that have been well engineered for specific perception tasks. We investigate two types of knowledge distillation: 1) distilling learned generative features onto target image backbones as an alternative to pretraining these backbones on large labeled datasets such as ImageNet, and 2) distilling labels obtained from generative networks with task heads onto logits of target backbones. We perform extensive analyses on multiple generative models, dense prediction benchmarks, and several pre-training regimes. We empirically find that our DreamTeacher significantly outperforms existing self-supervised representation learning approaches across the board. Unsupervised ImageNet pre-training with DreamTeacher leads to significant improvements over ImageNet classification pre-training on downstream datasets, showcasing generative models, and diffusion generative models specifically, as a promising approach to representation learning on large, diverse datasets without requiring manual annotation.
    摘要 在这项工作中,我们介绍了一个自我超vised特征表示学习框架 DreamTeacher,该框架利用生成网络进行预训练下游图像脊梁。我们提议通过将已经训练过的生成模型中的知识注入到标准图像脊梁上来,以代替使用大量标注数据集如ImageNet进行预训练。我们 investigate了两种知识注入方法:1)将生成模型中学习的特征注入到目标图像脊梁上作为替代预训练方法,2)将生成网络中的标签注入到目标脊梁的幂点上。我们进行了多种生成模型、精度预测 bencmarks 和多种预训练方式的实验研究。我们发现,我们的 DreamTeacher 在所有自我超vised表示学习方法中显著超越了其他方法。不需要人工标注,使用 DreamTeacher 进行无监督ImageNet预训练可以在下游数据集上获得显著改进,并且在多个生成模型和扩散生成模型中都有显著的改进。

Multimodal Distillation for Egocentric Action Recognition

  • paper_url: http://arxiv.org/abs/2307.07483
  • repo_url: https://github.com/gorjanradevski/multimodal-distillation
  • paper_authors: Gorjan Radevski, Dusan Grujicic, Marie-Francine Moens, Matthew Blaschko, Tinne Tuytelaars
  • for: 本研究目的是为了模型手部物体互动,以提高 egocentric 视频理解性能。
  • methods: 该paper使用了 CNNs 和 Vision Transformers 等标准模型,并采用了多modal 输入模块,以提高模型性能。然而,这些多modal 模块增加了模型的复杂度,使其不适合实际应用。本研究目标是保留多modal 模型的性能,但只使用 RGB 帧作为输入。
  • results: 研究发现,使用 multimodal 教师进行教学,可以使学生模型更加准确和更加靠谱。此外,本研究还提出了一种原则正的多modal 知识塑造框架,以解决多modal 知识塑造中出现的问题。最后,研究发现了计算复杂度的减少,并证明了我们的方法可以保持高性能,同时减少输入视图的数量。
    Abstract The focal point of egocentric video understanding is modelling hand-object interactions. Standard models, e.g. CNNs or Vision Transformers, which receive RGB frames as input perform well. However, their performance improves further by employing additional input modalities that provide complementary cues, such as object detections, optical flow, audio, etc. The added complexity of the modality-specific modules, on the other hand, makes these models impractical for deployment. The goal of this work is to retain the performance of such a multimodal approach, while using only the RGB frames as input at inference time. We demonstrate that for egocentric action recognition on the Epic-Kitchens and the Something-Something datasets, students which are taught by multimodal teachers tend to be more accurate and better calibrated than architecturally equivalent models trained on ground truth labels in a unimodal or multimodal fashion. We further adopt a principled multimodal knowledge distillation framework, allowing us to deal with issues which occur when applying multimodal knowledge distillation in a naive manner. Lastly, we demonstrate the achieved reduction in computational complexity, and show that our approach maintains higher performance with the reduction of the number of input views. We release our code at https://github.com/gorjanradevski/multimodal-distillation.
    摘要 主要焦点是模型手object交互的 egocentric视频理解。标准模型,如CNNs或视Transformers,使用RGB帧作为输入表现良好。然而,通过添加补充的模式特征信息,如物体检测、流动、音频等,其性能可以进一步提高。然而,这些模式特征模块的附加复杂性使得这些模型在部署时不实用。我们的目标是保留多模式approach的性能,使用只RGB帧作为输入进行推理。我们示出在Epic-Kitchens和Something-Something数据集上进行 egocentric动作识别 tasks,使用多模式教师进行教育的学生比architectureEquivalent模型在单模式或多模式教学情况下更准确和更好地规范。我们进一步采用了一种原则正的多模式知识储存框架,以解决在naive manner中应用多模式知识储存时出现的问题。最后,我们表明实现的计算复杂度减少,并示出我们的方法可以在输入视图数量减少的情况下保持更高的性能。我们在https://github.com/gorjanradevski/multimodal-distillation上发布了代码。

Dual-Query Multiple Instance Learning for Dynamic Meta-Embedding based Tumor Classification

  • paper_url: http://arxiv.org/abs/2307.07482
  • repo_url: None
  • paper_authors: Simon Holdenried-Krafft, Peter Somers, Ivonne A. Montes-Majarro, Diana Silimon, Cristina Tarín, Falko Fend, Hendrik P. A. Lensch
  • for: 这个论文主要针对的是肿瘤诊断和治疗规划中的整幕影像评估,以高Definition Magnification进行细胞分析。
  • methods: 该论文提出了一种基于嵌入模型的多例学习(MIL)管道,包括嵌入模型和聚合步骤。在嵌入模型方面,我们explore了使用最新的自我超vised预训练模型来提高MIL的普适性。在聚合步骤方面,我们提出了一种新的MIL架构,可以将MIL-注意力与相关自注意力结合使用。
  • results: 我们在三个 histopathological 数据集上进行了实验,并证明了我们的方法可以与当前状态艺技相比提高至多10%的性能。
    Abstract Whole slide image (WSI) assessment is a challenging and crucial step in cancer diagnosis and treatment planning. WSIs require high magnifications to facilitate sub-cellular analysis. Precise annotations for patch- or even pixel-level classifications in the context of gigapixel WSIs are tedious to acquire and require domain experts. Coarse-grained labels, on the other hand, are easily accessible, which makes WSI classification an ideal use case for multiple instance learning (MIL). In our work, we propose a novel embedding-based Dual-Query MIL pipeline (DQ-MIL). We contribute to both the embedding and aggregation steps. Since all-purpose visual feature representations are not yet available, embedding models are currently limited in terms of generalizability. With our work, we explore the potential of dynamic meta-embedding based on cutting-edge self-supervised pre-trained models in the context of MIL. Moreover, we propose a new MIL architecture capable of combining MIL-attention with correlated self-attention. The Dual-Query Perceiver design of our approach allows us to leverage the concept of self-distillation and to combine the advantages of a small model in the context of a low data regime with the rich feature representation of a larger model. We demonstrate the superior performance of our approach on three histopathological datasets, where we show improvement of up to 10% over state-of-the-art approaches.
    摘要 整幕图像(WSI)评估是癌症诊断和治疗规划中的关键步骤。WSI需要高放大以便进行细胞分析。精确的标注对patch-或甚至像素级分类在 context of gigapixel WSIs 是繁琐的和需要域专家。相比之下,粗粒标注更加容易获得,这使得WSI分类成为多例学习(MIL)的理想应用场景。在我们的工作中,我们提出了一种新的嵌入基于的双Query MIL管道(DQ-MIL)。我们对嵌入和聚合步骤做出了贡献。由于目前的视觉特征表示Model 没有通用的ALL-PURPOSE,嵌入模型因此受限于通用性。我们通过在MILCONTEXT中使用 cutting-edge self-supervised pre-trained model 的动态元 embedding来探索这一点。此外,我们还提出了一种新的MIL架构,可以结合MIL注意力和相关自注意力。我们的approach 使用 Dual-Query Perceiver 的设计,可以利用自馈采集和小模型在低数据情况下的优势,同时保留大模型的丰富特征表示。我们在三个 histopathological 数据集上进行了实验,并证明了我们的方法在 state-of-the-art 方法之上提高了10%。

Atlas-Based Interpretable Age Prediction

  • paper_url: http://arxiv.org/abs/2307.07439
  • repo_url: None
  • paper_authors: Sophie Starck, Yadunandan Vivekanand Kini, Jessica Johanna Maria Ritter, Rickmer Braren, Daniel Rueckert, Tamara Mueller
  • for: 这个研究旨在提高医疗评估和研究中的年龄预测精度,以检测疾病和异常年龄变化。
  • methods: 这种研究使用整体图像进行整个身体的研究,并使用Grad-CAM解释方法确定身体部位对年龄预测的影响。
  • results: 研究发现了三个关键的身体部位,即脊梁、自生肌肉和心脏区域,这三个部位对年龄预测具有最高的重要性。
    Abstract Age prediction is an important part of medical assessments and research. It can aid in detecting diseases as well as abnormal ageing by highlighting the discrepancy between chronological and biological age. To gain a comprehensive understanding of age-related changes observed in various body parts, we investigate them on a larger scale by using whole-body images. We utilise the Grad-CAM interpretability method to determine the body areas most predictive of a person's age. We expand our analysis beyond individual subjects by employing registration techniques to generate population-wide interpretability maps. Furthermore, we set state-of-the-art whole-body age prediction with a model that achieves a mean absolute error of 2.76 years. Our findings reveal three primary areas of interest: the spine, the autochthonous back muscles, and the cardiac region, which exhibits the highest importance.
    摘要 生长预测是医学评估和研究中非常重要的一部分。它可以帮助检测疾病以及异常年龄,并且可以预测人体各部位的年龄变化。为了更全面地了解各部位年龄变化,我们使用整体图像进行研究。我们使用Grad-CAM可读性方法来确定人体各部位年龄预测最有价值的部位。此外,我们还使用注册技术来生成人类总体可读性图。此外,我们设置了全面的整体年龄预测模型,实现了年龄差值的平均绝对误差为2.76年。我们的发现显示了三个主要的关注方向:脊梁、自生肌肉和心脏区域,这三个方向具有最高的重要性。