cs.CV - 2023-11-23

Robust and Interpretable COVID-19 Diagnosis on Chest X-ray Images using Adversarial Training

  • paper_url: http://arxiv.org/abs/2311.14227
  • repo_url: None
  • paper_authors: Karina Yang, Alexis Bennett, Dominique Duncan
    for: 这项研究的目的是提高人工智能(AI)算法在胸部X射线(CXR)图像上的快速和准确诊断COVID-19疾病。methods: 该研究使用了21种 convolutional neural network(CNN)模型,在一个多样化的33,000多张CXR图像上进行了训练和评估,以分类健康、COVID-19和非COVID-19肺炎CXR图像。results: 研究结果显示,使用对抗训练可以提高模型的 robustness和可解释性,并且模型的三个分类精度、回归率和准确率分别达到了97.03%、97.97%和99.95%。此外,对比专业Radiologist的诊断, adversarially trained models的saliency map更好地指出了临床重要的特征,并且更加Robust against extraneous artifacts。
    Abstract The novel 2019 Coronavirus disease (COVID-19) global pandemic is a defining health crisis. Recent efforts have been increasingly directed towards achieving quick and accurate detection of COVID-19 across symptomatic patients to mitigate the intensity and spread of the disease. Artificial intelligence (AI) algorithms applied to chest X-ray (CXR) images have emerged as promising diagnostic tools, and previous work has demonstrated impressive classification performances. However, such methods have faced criticisms from physicians due to their black-box reasoning process and unpredictable nature. In contrast to professional radiologist diagnosis, AI systems often lack generalizability, explainability, and robustness in the clinical decision making process. In our work, we address these issues by first proposing an extensive baseline study, training and evaluating 21 convolutional neural network (CNN) models on a diverse set of 33,000+ CXR images to classify between healthy, COVID-19, and non-COVID-19 pneumonia CXRs. Our resulting models achieved a 3-way classification accuracy, recall, and precision of up to 97.03\%, 97.97\%, and 99.95\%, respectively. Next, we investigate the effectiveness of adversarial training on model robustness and explainability via Gradient-weighted Class Activation Mapping (Grad-CAM) heatmaps. We find that adversarially trained models not only significantly outperform their standard counterparts on classifying perturbed images, but also yield saliency maps that 1) better specify clinically relevant features, 2) are robust against extraneous artifacts, and 3) agree considerably more with expert radiologist findings.
    摘要 新型冠状病毒疾病(COVID-19)的全球大流行是一个决定性的健康危机。最近的努力主要是通过快速和准确地检测COVID-19来减轻疾病的严重性和扩散。人工智能(AI)算法应用于胸部X射线图像(CXR)中有出色的诊断表现,但这些方法受到医生的批判,因为它们的黑盒式理由过程和不可预测的性。相比职业医生诊断,AI系统经常缺乏普适性、解释性和临床决策过程中的稳定性。在我们的工作中,我们解决这些问题,先是提出了一项广泛的基线研究,在33,000多个CXR图像上训练和评估21种 convolutional neural network(CNN)模型,以分类健康、COVID-19和非COVID-19肺炎CXR图像。我们的结果表明,我们的模型可达3种分类精度、回归率和准确率的97.03%、97.97%和99.95%。接下来,我们研究了对模型稳定性和解释性的影响,通过权重类Activation Mapping(Grad-CAM)热图。我们发现,对模型进行反击训练后,不仅可以明显超过标准模型在处理受损图像的能力,还可以提供更有用的特征积分图,这些图像特征更加注重临床 relevante,更加抗锋抗辐,同时与专业医生诊断结果相符。

A New Benchmark and Model for Challenging Image Manipulation Detection

  • paper_url: http://arxiv.org/abs/2311.14218
  • repo_url: None
  • paper_authors: Zhenfei Zhang, Mingyang Li, Ming-Ching Chang
  • for: 本研究旨在提高数字审查中图像修改检测的精度和效能,特别是在检测大图像中小修改区域的情况下。
  • methods: 本研究使用了一个新的两棵网络模型,基于HRNet,以检测图像修改和压缩 artifacts。
  • results: 对比 existed 图像修改检测方法,本研究的模型在挑战性的情况下显示出了显著的提高,并且在新的 CIMD 数据集上达到了最高的检测精度。
    Abstract The ability to detect manipulation in multimedia data is vital in digital forensics. Existing Image Manipulation Detection (IMD) methods are mainly based on detecting anomalous features arisen from image editing or double compression artifacts. All existing IMD techniques encounter challenges when it comes to detecting small tampered regions from a large image. Moreover, compression-based IMD approaches face difficulties in cases of double compression of identical quality factors. To investigate the State-of-The-Art (SoTA) IMD methods in those challenging conditions, we introduce a new Challenging Image Manipulation Detection (CIMD) benchmark dataset, which consists of two subsets, for evaluating editing-based and compression-based IMD methods, respectively. The dataset images were manually taken and tampered with high-quality annotations. In addition, we propose a new two-branch network model based on HRNet that can better detect both the image-editing and compression artifacts in those challenging conditions. Extensive experiments on the CIMD benchmark show that our model significantly outperforms SoTA IMD methods on CIMD.
    摘要 <>转换文本到简化中文。<>鉴别 multimedia 数据中的修改是数字化检察中的关键。现有的图像修改检测(IMD)方法主要基于检测图像编辑或双重压缩 artifacts 中的异常特征。所有现有的 IMD 技术在检测小修改区域时遇到挑战,尤其是在处理大图像时。此外,压缩基于 IMD 方法在同等质量因子下的双压缩情况下采取的方法也遇到困难。为了调查State-of-The-Art(SoTA)IMD 方法在这些挑战情况下的性能,我们提出了一个新的挑战性图像修改检测(CIMD)benchmark 数据集,该数据集包括两个子集,用于评估编辑基于和压缩基于 IMD 方法。数据集中的图像都是手动修改并有高质量的注释。此外,我们提出了一个基于 HRNet 的两支网络模型,可以更好地检测图像修改和压缩 artifacts 在挑战情况下。广泛的实验表明,我们的模型在 CIMD 上significantly 超过 SoTA IMD 方法的性能。

ECRF: Entropy-Constrained Neural Radiance Fields Compression with Frequency Domain Optimization

  • paper_url: http://arxiv.org/abs/2311.14208
  • repo_url: None
  • paper_authors: Soonbin Lee, Fangwen Shu, Yago Sanchez, Thomas Schierl, Cornelius Hellge
  • for: 本研究旨在提高 feature-grid 基于 NeRF 模型的数据压缩,以提高渲染质量和速度。
  • methods: 我们提出了一种压缩模型,使用逻辑 Cosine Transform (DCT) 对 tensorial radiance fields 进行压缩,并使用量化和 entropy 编码。此外,我们还提出了一种基于频域 entropy 参数化技术,以提高稀疏性。
  • results: 我们的模型在多个数据集上实现了superior压缩性能。 sources 代码将公开发布。
    Abstract Explicit feature-grid based NeRF models have shown promising results in terms of rendering quality and significant speed-up in training. However, these methods often require a significant amount of data to represent a single scene or object. In this work, we present a compression model that aims to minimize the entropy in the frequency domain in order to effectively reduce the data size. First, we propose using the discrete cosine transform (DCT) on the tensorial radiance fields to compress the feature-grid. This feature-grid is transformed into coefficients, which are then quantized and entropy encoded, following a similar approach to the traditional video coding pipeline. Furthermore, to achieve a higher level of sparsity, we propose using an entropy parameterization technique for the frequency domain, specifically for DCT coefficients of the feature-grid. Since the transformed coefficients are optimized during the training phase, the proposed model does not require any fine-tuning or additional information. Our model only requires a lightweight compression pipeline for encoding and decoding, making it easier to apply volumetric radiance field methods for real-world applications. Experimental results demonstrate that our proposed frequency domain entropy model can achieve superior compression performance across various datasets. The source code will be made publicly available.
    摘要 <>使用特征网格基于NeRF模型显示出了较好的渲染质量和较快的训练速度。然而,这些方法通常需要大量的数据来表示单个场景或物体。在这项工作中,我们提出了一种压缩模型,旨在在频域中减少数据大小。首先,我们提议使用离散归一化变换(DCT)对特征网格进行压缩。这个特征网格被转换为系数,然后被量化和 entropy编码,采用类似于传统视频编码管道的方法。此外,为了实现更高的稀疏性,我们提议使用频域的 entropy 参数化技术,专门是为DCT coefficients of the feature-grid。由于转换后的系数在训练阶段被优化,所以我们的模型不需要任何细化或额外信息。我们的模型只需要一个轻量级压缩管道进行编码和解码,使其更易应用于实际应用场景。实验结果表明,我们的提议频域 entropy 模型可以在多个数据集上实现更高的压缩性能。源代码将公开发布。

Enhancing mTBI Diagnosis with Residual Triplet Convolutional Neural Network Using 3D CT

  • paper_url: http://arxiv.org/abs/2311.14197
  • repo_url: None
  • paper_authors: Hanem Ellethy, Shekhar S. Chandra, Viktor Vegh
  • For: The paper aims to improve the accuracy and efficiency of mild traumatic brain injury (mTBI) diagnosis using 3D Computed Tomography (CT) images and a residual triplet convolutional neural network (RTCNN) model.* Methods: The RTCNN model utilizes a triplet loss function to optimize feature representations and distinguish between mTBI cases and healthy ones. The model shows promising performance in mTBI diagnosis, with an average accuracy of 94.3%, sensitivity of 94.1%, and specificity of 95.2%.* Results: Compared to the conventional Residual Convolutional Neural Network (RCNN) model, the RTCNN model exhibits a significant improvement in specificity (22.5%), accuracy (16.2%), and sensitivity (11.3%). Additionally, the RTCNN model requires lower memory resources, making it both highly effective and resource-efficient.
    Abstract Mild Traumatic Brain Injury (mTBI) is a common and challenging condition to diagnose accurately. Timely and precise diagnosis is essential for effective treatment and improved patient outcomes. Traditional diagnostic methods for mTBI often have limitations in terms of accuracy and sensitivity. In this study, we introduce an innovative approach to enhance mTBI diagnosis using 3D Computed Tomography (CT) images and a metric learning technique trained with triplet loss. To address these challenges, we propose a Residual Triplet Convolutional Neural Network (RTCNN) model to distinguish between mTBI cases and healthy ones by embedding 3D CT scans into a feature space. The triplet loss function maximizes the margin between similar and dissimilar image pairs, optimizing feature representations. This facilitates better context placement of individual cases, aids informed decision-making, and has the potential to improve patient outcomes. Our RTCNN model shows promising performance in mTBI diagnosis, achieving an average accuracy of 94.3%, a sensitivity of 94.1%, and a specificity of 95.2%, as confirmed through a five-fold cross-validation. Importantly, when compared to the conventional Residual Convolutional Neural Network (RCNN) model, the RTCNN exhibits a significant improvement, showcasing a remarkable 22.5% increase in specificity, a notable 16.2% boost in accuracy, and an 11.3% enhancement in sensitivity. Moreover, RTCNN requires lower memory resources, making it not only highly effective but also resource-efficient in minimizing false positives while maximizing its diagnostic accuracy in distinguishing normal CT scans from mTBI cases. The quantitative performance metrics provided and utilization of occlusion sensitivity maps to visually explain the model's decision-making process further enhance the interpretability and transparency of our approach.
    摘要 轻度脑部损伤(mTBI)是一种常见且困难的诊断问题。时效精准诊断对于有效的治疗和患者结果具有重要意义。传统的mTBI诊断方法经常具有准确率和敏感度的限制。在这种研究中,我们提出了一种创新的mTBI诊断方法,使用三维计算机Tomography(CT)图像和一种度学学习技术,通过三元损失函数进行训练。为解决这些挑战,我们提议了一种差异Triplet Convolutional Neural Network(RTCNN)模型,通过嵌入三维CT扫描图像,将 случа例嵌入特征空间中。三元损失函数最大化类似和不类似图像对的距离,优化特征表示。这有助于更好地地位置个案,帮助决策,并有potential to improve patient outcomes。我们的RTCNN模型在mTBI诊断中表现出色,实现了平均准确率为94.3%,敏感性为94.1%,特异性为95.2%,通过五fold横断验证。与传统的Residual Convolutional Neural Network(RCNN)模型相比,RTCNN表现出了显著的提升,包括22.5%的特异性提高、16.2%的准确率提高和11.3%的敏感性提高。此外,RTCNN需要较低的内存资源,不仅高效,还能够减少假阳性的发生。我们还提供了量化性能指标和使用 occlusion sensitivity maps来可见地解释模型决策过程,进一步增强了我们的方法的可读性和透明度。

HACD: Hand-Aware Conditional Diffusion for Monocular Hand-Held Object Reconstruction

  • paper_url: http://arxiv.org/abs/2311.14189
  • repo_url: None
  • paper_authors: Bowen Fu, Yan Di, Chenyangguang Zhang, Gu Wang, Ziqin Huang, Zhiying Leng, Fabian Manhardt, Xiangyang Ji, Federico Tombari
  • for: 本研究旨在解决无知3D模板、类别先验和深度信息的单色RGB图像中手持物体重建问题,这是计算机视觉中的一个关键 yet 挑战性问题。
  • methods: 我们采用概率点云干净模型来解决上述问题,并在两个方面模型手持物体互动:首先,我们引入了手持物体相关的条件,从semantic和geometry两个方面来模型手持物体互动。其次,我们提出了一种手强制中心固定方案,通过使用手关节预测来约束部分干净点云的中心偏差在扩散和反向过程中。
  • results: 我们的方法在Synthetic ObMan数据集和两个实际数据集HO3D和MOW上进行了实验,结果表明我们的方法在所有现有方法之上具有显著优势。
    Abstract Reconstructing hand-held objects from a single RGB image without known 3D object templates, category prior, or depth information is a vital yet challenging problem in computer vision. In contrast to prior works that utilize deterministic modeling paradigms, which make it hard to account for the uncertainties introduced by hand- and self-occlusion, we employ a probabilistic point cloud denoising diffusion model to tackle the above challenge. In this work, we present Hand-Aware Conditional Diffusion for monocular hand-held object reconstruction (HACD), modeling the hand-object interaction in two aspects. First, we introduce hand-aware conditioning to model hand-object interaction from both semantic and geometric perspectives. Specifically, a unified hand-object semantic embedding compensates for the 2D local feature deficiency induced by hand occlusion, and a hand articulation embedding further encodes the relationship between object vertices and hand joints. Second, we propose a hand-constrained centroid fixing scheme, which utilizes hand vertices priors to restrict the centroid deviation of partially denoised point cloud during diffusion and reverse process. Removing the centroid bias interference allows the diffusion models to focus on the reconstruction of shape, thus enhancing the stability and precision of local feature projection. Experiments on the synthetic ObMan dataset and two real-world datasets, HO3D and MOW, demonstrate our approach surpasses all existing methods by a large margin.
    摘要 <>TRANSLATE_TEXT重建手持物品从单一的RGB图像中,无知3D对象模板、类别先验或深度信息,是计算机视觉中的一项重要且挑战性问题。在先前的工作中,我们使用deterministic模型,即快导致手 occlusion 引入的不确定性困难很难考虑。在这种情况下,我们采用一种概率点云净化扩散模型,即手持aware Conditional Diffusion(HACD),来解决上述问题。在这个工作中,我们提出了两种方法来模型手持物品的交互:1. 引入手持aware conditioning,模型手持物品的交互从semantic和geometry两个方面。specifically,我们使用一个统一的手持物品semantic embedding,来补做2D本地特征不足induced by手 occlusion,并使用一个手关节embedding,来编码物体顶点和手关节之间的关系。2. 我们提出了一种手做限制中心偏移方案,使用手关节顶点的先验来限制部分净化后的点云的中心偏移的范围。这样,从 diffusion和反过程中移除中心偏移的干扰,使得扩散模型更专注于形状的重建,从而提高了本地特征的投射稳定性和精度。在ObMan synthetic dataset和HO3D和MOW两个实际 dataset上,我们的方法比所有先前方法都高得多。<>

TCuPGAN: A novel framework developed for optimizing human-machine interactions in citizen science

  • paper_url: http://arxiv.org/abs/2311.14177
  • repo_url: None
  • paper_authors: Ramanakumar Sankar, Kameswara Mantha, Lucy Fortson, Helen Spiers, Thomas Pengo, Douglas Mashek, Myat Mo, Mark Sanders, Trace Christensen, Jeffrey Salisbury, Laura Trouille
  • for: 降低人工标注和分类大数据集的努力,提出了一种基于机器学习技术的新型3D segmentation模型。
  • methods: 该模型使用补做式对抗性和Long Short-Term Memory(LSTM)编码sequential information。
  • results: 通过与Zooniverse平台上的公民科学项目进行迭代人机优化,只有一部分2Dslice从这些立方体图像中被参考者看到,并通过补做式对抗性提供 Machine proposal的估计,以便减少参考者的努力。
    Abstract In the era of big data in scientific research, there is a necessity to leverage techniques which reduce human effort in labeling and categorizing large datasets by involving sophisticated machine tools. To combat this problem, we present a novel, general purpose model for 3D segmentation that leverages patch-wise adversariality and Long Short-Term Memory to encode sequential information. Using this model alongside citizen science projects which use 3D datasets (image cubes) on the Zooniverse platforms, we propose an iterative human-machine optimization framework where only a fraction of the 2D slices from these cubes are seen by the volunteers. We leverage the patch-wise discriminator in our model to provide an estimate of which slices within these image cubes have poorly generalized feature representations, and correspondingly poor machine performance. These images with corresponding machine proposals would be presented to volunteers on Zooniverse for correction, leading to a drastic reduction in the volunteer effort on citizen science projects. We trained our model on ~2300 liver tissue 3D electron micrographs. Lipid droplets were segmented within these images through human annotation via the `Etch A Cell - Fat Checker' citizen science project, hosted on the Zooniverse platform. In this work, we demonstrate this framework and the selection methodology which resulted in a measured reduction in volunteer effort by more than 60%. We envision this type of joint human-machine partnership will be of great use on future Zooniverse projects.
    摘要 在科学研究中的大数据时代,有一种必要性,即利用技术来减少人工标注和分类大数据集的劳动。为解决这个问题,我们提出了一种全新的、通用模型,利用patch-wise adversariality和Long Short-Term Memory来编码sequential信息。使用这种模型,我们提议一种迭代人机优化框架,只有3D数据集(图像立方体)中的一部分2D slice被志愿者看到。我们利用模型中的patch-wise diskriminator提供估计,哪些图像slice在这些图像立方体中有不良泛化特征表示,并且对应的机器提议也有较差性能。这些图像与机器提议将被志愿者在Zooniverse平台上修正,从而减少志愿者的劳动。我们对~2300个肝脏电镜 graf剖板中的脂肪板进行了人工标注,并通过Zooniverse平台上的`Etch A Cell - Fat Checker`公民科学项目,对这些图像进行了人工分类。在这个工作中,我们展示了这种框架和选择方法,并证明了它们可以减少志愿者劳动时间大于60%。我们认为这种人机合作将在未来Zooniverse项目中发挥重要作用。

GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence

  • paper_url: http://arxiv.org/abs/2311.14155
  • repo_url: https://github.com/nv-nguyen/gigapose
  • paper_authors: Van Nguyen Nguyen, Thibault Groueix, Mathieu Salzmann, Vincent Lepetit
  • for: 这个论文是为了提出一种快速、稳定、准确的 CAD 模型基于的新物体姿态估计方法。
  • methods: 该方法首先利用了特征模板,用于回归出入图像中物体的外平面旋转,然后使用 patch correspondences 来估计剩下的四个参数。
  • results: 该方法可以在RGB图像中快速和准确地估计物体的姿态,并且比现有技术快速38倍。此外,该方法也更加抗性能异常出入图像的分割错误。
    Abstract We present GigaPose, a fast, robust, and accurate method for CAD-based novel object pose estimation in RGB images. GigaPose first leverages discriminative templates, rendered images of the CAD models, to recover the out-of-plane rotation and then uses patch correspondences to estimate the four remaining parameters. Our approach samples templates in only a two-degrees-of-freedom space instead of the usual three and matches the input image to the templates using fast nearest neighbor search in feature space, results in a speedup factor of 38x compared to the state of the art. Moreover, GigaPose is significantly more robust to segmentation errors. Our extensive evaluation on the seven core datasets of the BOP challenge demonstrates that it achieves state-of-the-art accuracy and can be seamlessly integrated with a refinement method. Additionally, we show the potential of GigaPose with 3D models predicted by recent work on 3D reconstruction from a single image, relaxing the need for CAD models and making 6D pose object estimation much more convenient. Our source code and trained models are publicly available at https://github.com/nv-nguyen/gigaPose
    摘要 我们介绍GigaPose,一种快速、稳定和准确的CAD模型基于RGB图像中新物体姿态估计方法。GigaPose首先利用特征模板,rendered CAD模型的图像,来回归外平面旋转,然后使用贴图匹配来估计四个余下参数。我们的方法在只有两个度量空间中采样模板,而不是常见的三个度量空间,并使用快速的 nearest neighbor search 在特征空间匹配输入图像和模板,从而实现了38倍的速度提升相比于状态机器。此外,GigaPose更加抗性能强,对分割误差具有更好的鲁棒性。我们的广泛的评估表明,GigaPose在七个核心数据集上达到了状态机器的准确率,并且可以与精度补做方法集成。此外,我们还展示了GigaPose在3D模型预测的最新工作中,可以释放CAD模型的需求,使6D姿态对象估计变得更加便捷。我们的源代码和训练模型在 上公开可用。

Automated 3D Tumor Segmentation using Temporal Cubic PatchGAN (TCuP-GAN)

  • paper_url: http://arxiv.org/abs/2311.14148
  • repo_url: None
  • paper_authors: Kameswara Bharadwaj Mantha, Ramanakumar Sankar, Lucy Fortson
  • for: 这个研究旨在开发一种可靠的全通用3D分割框架,使用最新的深度学习技术来解决各种生物医学领域中的3D分割问题。
  • methods: 这个研究使用了一种名为Temporal Cubic PatchGAN(TCuP-GAN)的量体-to-量体翻译模型,这种模型结合了生成特征学习框架和卷积长短时间记忆网络(LSTM),用于3D分割任务。
  • results: 研究者通过使用TCuP-GAN在2023年脑肿分 segmentation(BraTS)挑战中的四个挑战(成人肿瘤、脑膜肿瘤、儿童肿瘤和Sub-Saharan Africa subset)的数据上进行了评估,并使用LesionWise Dice相似度和$95%$ Hausdorff Distance度量来评估模型的性能。研究者成功地学习了TCuP-GAN来预测多类分割mask,并在所有挑战中预测了 robust的多类分割结果。
    Abstract Development of robust general purpose 3D segmentation frameworks using the latest deep learning techniques is one of the active topics in various bio-medical domains. In this work, we introduce Temporal Cubic PatchGAN (TCuP-GAN), a volume-to-volume translational model that marries the concepts of a generative feature learning framework with Convolutional Long Short-Term Memory Networks (LSTMs), for the task of 3D segmentation. We demonstrate the capabilities of our TCuP-GAN on the data from four segmentation challenges (Adult Glioma, Meningioma, Pediatric Tumors, and Sub-Saharan Africa subset) featured within the 2023 Brain Tumor Segmentation (BraTS) Challenge and quantify its performance using LesionWise Dice similarity and $95\%$ Hausdorff Distance metrics. We demonstrate the successful learning of our framework to predict robust multi-class segmentation masks across all the challenges. This benchmarking work serves as a stepping stone for future efforts towards applying TCuP-GAN on other multi-class tasks such as multi-organelle segmentation in electron microscopy imaging.
    摘要 development of robust general-purpose 3D segmentation frameworks using the latest deep learning techniques is one of the active topics in various bio-medical domains. In this work, we introduce Temporal Cubic PatchGAN (TCuP-GAN), a volume-to-volume translational model that marries the concepts of a generative feature learning framework with Convolutional Long Short-Term Memory Networks (LSTMs), for the task of 3D segmentation. We demonstrate the capabilities of our TCuP-GAN on the data from four segmentation challenges (Adult Glioma, Meningioma, Pediatric Tumors, and Sub-Saharan Africa subset) featured within the 2023 Brain Tumor Segmentation (BraTS) Challenge and quantify its performance using LesionWise Dice similarity and $95\%$ Hausdorff Distance metrics. We demonstrate the successful learning of our framework to predict robust multi-class segmentation masks across all the challenges. This benchmarking work serves as a stepping stone for future efforts towards applying TCuP-GAN on other multi-class tasks such as multi-organelle segmentation in electron microscopy imaging.Here's the translation in Traditional Chinese:开发robust通用3D分类框架,使用最新的深度学习技术是生物医学领域中活跃的主题。在这个工作中,我们介绍Temporal Cubic PatchGAN(TCuP-GAN),一个将feature learning框架与Convolutional Long Short-Term Memory Networks(LSTMs)联合的量值转换模型,用于3D分类任务。我们在2023年Brain Tumor Segmentation(BraTS)挑战中使用TCuP-GAN进行四个分类挑战(成人 Glioma、Meningioma、儿童肿瘤和SUB-SAHARAN Africa subset)的数据上进行评估,并使用LesionWise Dice similarity和$95\%$ Hausdorff Distance度量表示TCuP-GAN的性能。我们显示了TCuP-GAN在所有挑战中预测了多维多类分类掩模的成功。这个参考工作将serve as a stepping stone для未来对TCuP-GAN应用于其他多类任务,例如电子显微镜图像中的多个膜蛋白分类。

Class Balanced Dynamic Acquisition for Domain Adaptive Semantic Segmentation using Active Learning

  • paper_url: http://arxiv.org/abs/2311.14146
  • repo_url: None
  • paper_authors: Marc Schachtsiek, Simone Rossi, Thomas Hannagan
  • for: 这个论文的目的是提出一种解决预处理数据分类不均衡问题的活动学习方法,以提高Semantic segmentation模型的标签效率。
  • methods: 该方法使用两个优化目标:不确定度和多样性,并结合像素策略来选择训练标签。
  • results: 该方法可以在大型活动学习预算下提高模型的性能,特别是在高预算情况下,对少数类的性能提高了0.6、1.7和2.4个mIoU,并且对最小类的性能提高了0.5、2.9和4.6个IoU。
    Abstract Domain adaptive active learning is leading the charge in label-efficient training of neural networks. For semantic segmentation, state-of-the-art models jointly use two criteria of uncertainty and diversity to select training labels, combined with a pixel-wise acquisition strategy. However, we show that such methods currently suffer from a class imbalance issue which degrades their performance for larger active learning budgets. We then introduce Class Balanced Dynamic Acquisition (CBDA), a novel active learning method that mitigates this issue, especially in high-budget regimes. The more balanced labels increase minority class performance, which in turn allows the model to outperform the previous baseline by 0.6, 1.7, and 2.4 mIoU for budgets of 5%, 10%, and 20%, respectively. Additionally, the focus on minority classes leads to improvements of the minimum class performance of 0.5, 2.9, and 4.6 IoU respectively. The top-performing model even exceeds the fully supervised baseline, showing that a more balanced label than the entire ground truth can be beneficial.
    摘要 域 adaptive active learning 在标签有效性训练神经网络方面领先。为 semantic segmentation,现状顶尖模型通过两个不确定和多样性的标准来选择训练标签,并结合像素策略。然而,我们显示现有方法在大型活动学习预算中存在分类偏置问题,这会降低其性能。我们then introduce Class Balanced Dynamic Acquisition (CBDA), a novel active learning method that mitigates this issue, especially in high-budget regimes. 更平衡的标签提高了少数类性能,这使得模型可以超越之前的基eline by 0.6, 1.7, and 2.4 mIoU for budgets of 5%, 10%, and 20%, respectively. Furthermore, the focus on minority classes leads to improvements of the minimum class performance of 0.5, 2.9, and 4.6 IoU, respectively. The top-performing model even exceeds the fully supervised baseline, showing that a more balanced label than the entire ground truth can be beneficial.

ACT: Adversarial Consistency Models

  • paper_url: http://arxiv.org/abs/2311.14097
  • repo_url: None
  • paper_authors: Fei Kong, Jinhao Duan, Lichao Sun, Hao Cheng, Renjing Xu, Hengtao Shen, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu
  • For: The paper is written for improving the speed and quality of diffusion-based image generation models, specifically addressing the issue of slow generation speeds in existing methods.* Methods: The paper proposes a new method called Adversarial Consistency Training (ACT) that incorporates a discriminator into the consistency training framework to directly minimize the Jensen-Shannon divergence between the target and generated distributions at each timestep.* Results: The paper achieves improved FID scores on CIFAR10 and ImageNet 64x64, retains zero-shot image inpainting capabilities, and uses less than 1/6 of the original batch size and fewer than 1/2 of the model parameters and training steps compared to the baseline method, resulting in a substantial reduction in resource consumption.Here is the information in Simplified Chinese text:
  • for: 本文是为了提高Diffusion-based图像生成模型的速度和质量,具体地解决现有方法的慢速生成问题。
  • methods: 本文提出了一种新方法called Adversarial Consistency Training (ACT),它在每个步骤中直接使用推理器来减少Jensen-Shannon divergence的值,以便更好地保证生成的质量。
  • results: 本文在CIFAR10和ImageNet 64x64上实现了改进的FID分数,保持零授练图像填充能力,并使用了原始批处理大小的1/6和模型参数和训练步骤的1/2以下,实现了资源占用的重大减少。
    Abstract Though diffusion models excel in image generation, their step-by-step denoising leads to slow generation speeds. Consistency training addresses this issue with single-step sampling but often produces lower-quality generations and requires high training costs. In this paper, we show that optimizing consistency training loss minimizes the Wasserstein distance between target and generated distributions. As timestep increases, the upper bound accumulates previous consistency training losses. Therefore, larger batch sizes are needed to reduce both current and accumulated losses. We propose Adversarial Consistency Training (ACT), which directly minimizes the Jensen-Shannon (JS) divergence between distributions at each timestep using a discriminator. Theoretically, ACT enhances generation quality, and convergence. By incorporating a discriminator into the consistency training framework, our method achieves improved FID scores on CIFAR10 and ImageNet 64$\times$64, retains zero-shot image inpainting capabilities, and uses less than $1/6$ of the original batch size and fewer than $1/2$ of the model parameters and training steps compared to the baseline method, this leads to a substantial reduction in resource consumption.
    摘要 尽管扩散模型在图像生成方面表现出色,但是步骤性的干净会导致生成速度变慢。一致训练可以解决这个问题,但是它们通常会生成较低质量的图像,并且需要较高的训练成本。在这篇论文中,我们展示了,最小化一致训练损失可以最小化泊松距离 между目标和生成分布。随着时间步骤的增加,Upper bound会积累之前的一致训练损失。因此,更大的批处大小可以减少当前和积累的损失。我们提议的对抗一致训练(ACT)直接在每个时间步骤中最小化Jensen-Shannon(JS)分布之间的差异使用一个探测器。从理论来看,ACT可以提高生成质量和吸引力,同时减少资源占用。通过在一致训练框架中 incorporate 探测器,我们的方法可以在CIFAR10和ImageNet 64×64上提高FID分数,保留零基本图像填充能力,并且使用较少的原始批处大小和模型参数,以及较少的训练步骤。这将导致资源消耗的减少。

Video Anomaly Detection using GAN

  • paper_url: http://arxiv.org/abs/2311.14095
  • repo_url: https://github.com/SeaOtter/vad_gan
  • paper_authors: Anikeit Sethi, Krishanu Saini, Sai Mounika Mididoddi
  • for: 本研究旨在提供一种自动异常事件检测和识别方法,以提高公共安全性的评估。
  • methods: 本研究使用机器学习技术,特别是生成对抗网络(GAN)模型,来自动识别异常事件。
  • results: 对四个标准测试集进行比较,研究发现,与现有技术相比,本研究的网络在所有数据集上表现出优异的检测性能。
    Abstract Accounting for the increased concern for public safety, automatic abnormal event detection and recognition in a surveillance scene is crucial. It is a current open study subject because of its intricacy and utility. The identification of aberrant events automatically, it's a difficult undertaking because everyone's idea of abnormality is different. A typical occurrence in one circumstance could be seen as aberrant in another. Automatic anomaly identification becomes particularly challenging in the surveillance footage with a large crowd due to congestion and high occlusion. With the use of machine learning techniques, this thesis study aims to offer the solution for this use case so that human resources won't be required to keep an eye out for any unusual activity in the surveillance system records. We have developed a novel generative adversarial network (GAN) based anomaly detection model. This model is trained such that it learns together about constructing a high dimensional picture space and determining the latent space from the video's context. The generator uses a residual Autoencoder architecture made up of a multi-stage channel attention-based decoder and a two-stream, deep convolutional encoder that can realise both spatial and temporal data. We have also offered a technique for refining the GAN model that reduces training time while also generalising the model by utilising transfer learning between datasets. Using a variety of assessment measures, we compare our model to the current state-of-the-art techniques on four benchmark datasets. The empirical findings indicate that, in comparison to existing techniques, our network performs favourably on all datasets.
    摘要 现在,因为公共安全的关注度更高,自动异常事件检测和识别在监控场景中是非常重要的。这是一个当前的开放研究主题,因为它的复杂度和实用性。自动异常事件识别是一项困难的任务,因为每个人对异常的定义不同。一个普通的事件在一个情况下可能被看作异常的,而在另一个情况下可能不然。自动异常事件识别在监控视频中出现大量人群的情况下特别Difficult,因为拥堵和高度遮挡。通过机器学习技术,这个研究试图提供一种解决方案,以避免人工资源被用于监控系统记录中的异常活动监测。我们已经开发了一种基于生成对抗网络(GAN)的异常检测模型。这个模型在学习过程中会学习建立高维度图像空间和确定视频上下文中的秘密空间。生成器使用了一种带有多个频道注意力的抽象网络架构,包括多个阶段通道注意力基于解码器和深度卷积Encoder,可以实现空间和时间数据的同时满足。我们还提供了一种使用传输学习来改进GAN模型的技术,以降低训练时间并通过数据集之间的传输学习来泛化模型。使用多种评价指标,我们与当前状态艺术技术进行比较,并在四个benchmark数据集上进行了测试。实验结果表明,相比现有技术,我们的网络在所有数据集上表现良好。

Class Uncertainty: A Measure to Mitigate Class Imbalance

  • paper_url: http://arxiv.org/abs/2311.14090
  • repo_url: None
  • paper_authors: Z. S. Baltaci, K. Oksuz, S. Kuzucu, K. Tezoren, B. K. Konar, A. Ozkan, E. Akbas, S. Kalkan
  • for: 本研究旨在解决深度分类器在训练例子的分类数据不均匀时的性能问题。
  • methods: 本研究提出了一种新的类uncertainty指标,用于衡量类均匀性问题,并将其纳入了一组多种类均匀性缓解方法的实验评估。
  • results: 研究发现,通过考虑类uncertainty指标,可以更好地捕捉类均匀性问题,并且在长尾分布中的训练例子上具有更高的精度。
    Abstract Class-wise characteristics of training examples affect the performance of deep classifiers. A well-studied example is when the number of training examples of classes follows a long-tailed distribution, a situation that is likely to yield sub-optimal performance for under-represented classes. This class imbalance problem is conventionally addressed by approaches relying on the class-wise cardinality of training examples, such as data resampling. In this paper, we demonstrate that considering solely the cardinality of classes does not cover all issues causing class imbalance. To measure class imbalance, we propose "Class Uncertainty" as the average predictive uncertainty of the training examples, and we show that this novel measure captures the differences across classes better than cardinality. We also curate SVCI-20 as a novel dataset in which the classes have equal number of training examples but they differ in terms of their hardness; thereby causing a type of class imbalance which cannot be addressed by the approaches relying on cardinality. We incorporate our "Class Uncertainty" measure into a diverse set of ten class imbalance mitigation methods to demonstrate its effectiveness on long-tailed datasets as well as on our SVCI-20. Code and datasets will be made available.
    摘要 训练例子的类别特征会影响深度分类器的性能。一个已经得到了广泛研究的例子是,训练例子的类别数据遵循长尾分布,这会导致不足代表的类别得到差化性能。这种类别不均衡问题通常通过基于类别卡дина尔ность的方法解决,如数据重采样。在这篇论文中,我们表明,只考虑类别卡дина尔ность不能涵盖所有引起类别不均衡的问题。为了测量类别不均衡,我们提出了“类型uncertainty”作为训练例子的预测不确定性的平均值,并证明这个新的度量能够更好地捕捉类别之间的差异。此外,我们还挑选了SVCI-20作为一个新的数据集,其中类别均匀但具有不同的困难程度,从而导致了基于卡дина尔ность的方法无法解决的一种类别不均衡问题。我们将“类型uncertainty”度量 integrate into了一组多种十种类别不均衡缓解方法,以证明其效果在长尾数据集以及SVCI-20上。代码和数据将被公开。

Brain MRI Screening Tool with Federated Learning

  • paper_url: http://arxiv.org/abs/2311.14086
  • repo_url: None
  • paper_authors: Roman Stoklasa, Ioannis Stathopoulos, Efstratios Karavasilis, Efstathios Efstathopoulos, Marek Dostál, Miloš Keřkovský, Michal Kozubek, Luigi Serio
  • for: 这个论文是为了提高临床实践中的MRI扫描结果诊断速度而写的。
  • methods: 该论文使用了自动化脑MRI检测工具,以便识别肿瘤样病理。此外,它还支持联邦学习,以便多家机构可以共同贡献模型而无需披露私有数据。
  • results: 该论文显示了自动化脑MRI检测工具的可靠性和准确性,并且可以帮助医生更快地诊断肿瘤样病理。
    Abstract In clinical practice, we often see significant delays between MRI scans and the diagnosis made by radiologists, even for severe cases. In some cases, this may be caused by the lack of additional information and clues, so even the severe cases need to wait in the queue for diagnosis. This can be avoided if there is an automatic software tool, which would supplement additional information, alerting radiologists that the particular patient may be a severe case. We are presenting an automatic brain MRI Screening Tool and we are demonstrating its capabilities for detecting tumor-like pathologies. It is the first version on the path toward a robust multi-pathology screening solution. The tool supports Federated Learning, so multiple institutions may contribute to the model without disclosing their private data.
    摘要 在临床实践中,我们经常看到MRI扫描后诊断时间过长,即使是严重的情况。在一些情况下,这可能是因为缺乏额外信息和线索,因此严重的患者还需要等待排队诊断。这可以通过自动化软件工具解决,该工具可以补充额外信息,警示医生该患者可能是严重的。我们现在介绍一种自动脑MRI检测工具,我们展示了它在检测肿瘤类疾病方面的能力。这是我们的第一个版本,我们计划在未来开发出一种多种疾病检测解决方案。该工具支持联邦学习,因此多个机构可以为模型提供贡献,而无需披露私有数据。

SinSR: Diffusion-Based Image Super-Resolution in a Single Step

  • paper_url: http://arxiv.org/abs/2311.14760
  • repo_url: https://github.com/wyf0912/sinsr
  • paper_authors: Yufei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C. Kot, Bihan Wen
  • for: 提高Diffusion模型基于超分辨率图像生成的推理速度
  • methods: 基于 deterministic sampling 进行单步SR生成,并提出了一种新的一致性保持损失函数
  • results: 在 synthetic 和实际 datasets 上进行了广泛的实验,并显示了与先前的SOTA方法和教师模型相比的一致或更高的性能,并且具有了remarkable x10的推理速度提升
    Abstract While super-resolution (SR) methods based on diffusion models exhibit promising results, their practical application is hindered by the substantial number of required inference steps. Recent methods utilize degraded images in the initial state, thereby shortening the Markov chain. Nevertheless, these solutions either rely on a precise formulation of the degradation process or still necessitate a relatively lengthy generation path (e.g., 15 iterations). To enhance inference speed, we propose a simple yet effective method for achieving single-step SR generation, named SinSR. Specifically, we first derive a deterministic sampling process from the most recent state-of-the-art (SOTA) method for accelerating diffusion-based SR. This allows the mapping between the input random noise and the generated high-resolution image to be obtained in a reduced and acceptable number of inference steps during training. We show that this deterministic mapping can be distilled into a student model that performs SR within only one inference step. Additionally, we propose a novel consistency-preserving loss to simultaneously leverage the ground-truth image during the distillation process, ensuring that the performance of the student model is not solely bound by the feature manifold of the teacher model, resulting in further performance improvement. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the proposed method can achieve comparable or even superior performance compared to both previous SOTA methods and the teacher model, in just one sampling step, resulting in a remarkable up to x10 speedup for inference. Our code will be released at https://github.com/wyf0912/SinSR
    摘要 SR 方法基于扩散模型的应用在实际场景中受到很多限制,主要是因为需要进行大量的推理步骤。现有方法通过使用降低图像作为初始状态,从而缩短摘要链。然而,这些解决方案可能需要非常长的生成路径(例如,15次迭代),或者需要精准地表示降低过程。为了提高推理速度,我们提出了一种简单 yet有效的方法,称为 SinSR。具体来说,我们首先从最新的state-of-the-art(SOTA)方法中得到一种快速加速扩散基于 SR 的抽象过程。这使得在训练过程中,从输入随机噪声到生成高分辨率图像的映射可以在减少的和可接受的推理步骤中完成。我们表明了这种权重映射可以在训练过程中被硬编译成一个学生模型,以便在一步推理中实现 SR。此外,我们提出了一种新的一致性保持损失函数,可以同时利用真实的图像,以确保学生模型的性能不受教师模型的特征抽象限制,从而实现更好的性能提升。我们在 synthetic 和实际数据集上进行了广泛的实验,结果表明,我们的方法可以在一步推理中达到相同或更高的性能,并且与教师模型的性能相似,从而实现了remarkable的 x10 倍速化。我们的代码将在https://github.com/wyf0912/SinSR 上发布。

You Only Explain Once

  • paper_url: http://arxiv.org/abs/2311.14081
  • repo_url: https://github.com/sayantann11/all-classification-templetes-for-ML
  • paper_authors: David A. Kelly, Hana Chockler, Daniel Kroening, Nathan Blake, Aditi Ramaswamy, Melane Navaratnarajah, Aaditya Shivakumar
  • for: 这个论文是为了提出一种新的黑盒式解释算法和工具,以提高对物Detector的输出的解释效率。
  • methods: 该算法使用贪婪的剪辑和 causal analysis 来避免回溯,从而提高解释效率。
  • results: 实验结果表明,YO-ReX 可以准确地解释 YOLO、SSD 和 Faster R-CNN 的输出,并且具有较少的负载。
    Abstract In this paper, we propose a new black-box explainability algorithm and tool, YO-ReX, for efficient explanation of the outputs of object detectors. The new algorithm computes explanations for all objects detected in the image simultaneously. Hence, compared to the baseline, the new algorithm reduces the number of queries by a factor of 10X for the case of ten detected objects. The speedup increases further with with the number of objects. Our experimental results demonstrate that YO-ReX can explain the outputs of YOLO with a negligible overhead over the running time of YOLO. We also demonstrate similar results for explaining SSD and Faster R-CNN. The speedup is achieved by avoiding backtracking by combining aggressive pruning with a causal analysis.
    摘要 在这篇论文中,我们提出了一种新的黑盒式解释算法和工具YO-ReX,用于快速地解释对象检测器的输出。新算法可同时对图像中检测到的所有对象进行解释。相比基线,新算法可以将查询数量减少为10倍,对于检测到的对象数量为十个情况。运行时间上的开销几乎可以忽略不计。我们还进行了对 YOLO、SSD 和 Faster R-CNN 的解释,结果表明 YO-ReX 可以快速地解释对象检测器的输出,并且可以减少对图像的访问次数。这种减少可以通过不再回溯来实现,并且通过 causal analysis 来减少查询数量。

HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding

  • paper_url: http://arxiv.org/abs/2311.14064
  • repo_url: https://github.com/richard-peng-xia/HGCLIP
  • paper_authors: Peng Xia, Xingtong Yu, Ming Hu, Lie Ju, Zhiyong Wang, Peibo Duan, Zongyuan Ge
  • For: 本文提出了一种新的框架(HGCLIP),用于解决多维度分类问题。该框架通过结合 CLIP 和图表学习来更好地利用分类层次结构,从而提高图像识别性能。* Methods: 本文使用了一种新的方法,即将类层次结构转化为图表,然后通过图表Encoder 进行学习。文本特征 incorporates 了层次结构信息,而图像特征通过注意机制强调类型特征。* Results: 本文在多种多样化的视觉识别任务上达到了显著提高,包括一般化和细化的图像识别任务。 codes 可以在 GitHub 上找到。
    Abstract Object categories are typically organized into a multi-granularity taxonomic hierarchy. When classifying categories at different hierarchy levels, traditional uni-modal approaches focus primarily on image features, revealing limitations in complex scenarios. Recent studies integrating Vision-Language Models (VLMs) with class hierarchies have shown promise, yet they fall short of fully exploiting the hierarchical relationships. These efforts are constrained by their inability to perform effectively across varied granularity of categories. To tackle this issue, we propose a novel framework (HGCLIP) that effectively combines CLIP with a deeper exploitation of the Hierarchical class structure via Graph representation learning. We explore constructing the class hierarchy into a graph, with its nodes representing the textual or image features of each category. After passing through a graph encoder, the textual features incorporate hierarchical structure information, while the image features emphasize class-aware features derived from prototypes through the attention mechanism. Our approach demonstrates significant improvements on both generic and fine-grained visual recognition benchmarks. Our codes are fully available at https://github.com/richard-peng-xia/HGCLIP.
    摘要 Object categories 通常被组织成多级划分的分类体系。传统的uni-modal方法主要基于图像特征,在复杂的场景下表现有限。近年来的研究 combining Vision-Language Models (VLMs) with class hierarchies 已经显示了承诺,但它们在不同级别的类划分方面受限。这些努力被限制在不同级别的类划分效果不佳。为解决这个问题,我们提出了一种新的框架(HGCLIP),可以有效地结合 CLIP 与类划分结构的层次结构。我们将类划分结构转换为图,其中节点表示每个类划分的文本或图像特征。经过图编码器后,文本特征包含层次结构信息,而图像特征通过注意机制强调类型特征,从概念到实际的类划分。我们的方法在多个普通和精细的视觉识别 benchmar 上表现出了显著的改进。我们的代码可以在 上下载。

Hardware Resilience Properties of Text-Guided Image Classifiers

  • paper_url: http://arxiv.org/abs/2311.14062
  • repo_url: https://github.com/TalalWasim/TextGuidedResilience
  • paper_authors: Syed Talal Wasim, Kabila Haile Saboka, Abdulrahman Mahmoud, Salman Khan, David Brooks, Gu-Yeon Wei
  • for: 这 paper 的目的是提高图像分类模型在部署时的可靠性,面对过渡性硬件错误。
  • methods: 这 paper 使用 GPT-3 生成的充实文本 embedding,与 CLIP 预训练的文本编码器,作为分类层的初始化。
  • results: 这 paper 的方法可以在不同的架构上实现平均的硬件可靠性提高 ($5.5\times$,最高达 $14\times$),同时减少了模型的参数和 FLOPs 开销,并且可以轻松地与任何图像分类背景集成合作。
    Abstract This paper presents a novel method to enhance the reliability of image classification models during deployment in the face of transient hardware errors. By utilizing enriched text embeddings derived from GPT-3 with question prompts per class and CLIP pretrained text encoder, we investigate their impact as an initialization for the classification layer. Our approach achieves a remarkable $5.5\times$ average increase in hardware reliability (and up to 14x) across various architectures in the most critical layer, with minimal accuracy drop (0.3% on average) compared to baseline PyTorch models. Furthermore, our method seamlessly integrates with any image classification backbone, showcases results across various network architectures, decreases parameter and FLOPs overhead, and follows a consistent training recipe. This research offers a practical and efficient solution to bolster the robustness of image classification models against hardware failures, with potential implications for future studies in this domain. Our code and models are released at https://github.com/TalalWasim/TextGuidedResilience.
    摘要 Here's the Simplified Chinese translation:这篇论文提出了一种使用GPT-3和CLIP预训练文本编码器 derive的增强图像分类模型的可靠性方法。该方法可以在不同的网络架构上实现remarkable的硬件可靠性提升(5.5倍平均提升,最高达14倍),同时保持准确率的下降(0.3%的平均下降)相比基eline PyTorch模型。该方法可以轻松地与任何图像分类背景集成合并,展示了在不同的网络架构上的结果,具有低的参数和FLOPs负担,并且遵循一致的训练recipe。这些研究可能对未来相关领域的研究产生影响。代码和模型可以在https://github.com/TalalWasim/TextGuidedResilience中下载。

Assessment of Deep Learning Segmentation for Real-Time Free-Breathing Cardiac Magnetic Resonance Imaging

  • paper_url: http://arxiv.org/abs/2311.14049
  • repo_url: None
  • paper_authors: Martin Schilling, Christina Unterberg-Buchwald, Joachim Lotz, Martin Uecker
  • for: This paper aims to assess the accuracy of deep learning methods for volumetric analysis of the left ventricle in real-time free-breathing CMR at rest and under exercise stress.
  • methods: The paper uses a freely available neural network (nnU-Net) and compares it to a commercial software (comDL) for segmentation of the left ventricle, myocardium, and right ventricle in both cine and real-time free-breathing CMR.
  • results: The paper finds that nnU-Net achieves higher accuracy than comDL overall for real-time CMR, and that the performance of deep learning methods is comparable to inter-observer variability in cine CMR, making them a viable option for automatic segmentation. Additionally, the paper shows that nnU-Net can accurately segment the left ventricle, myocardium, and right ventricle in both rest and exercise conditions.
    Abstract In recent years, a variety of deep learning networks for cardiac MRI (CMR) segmentation have been developed and analyzed. However, nearly all of them are focused on cine CMR under breathold. In this work, accuracy of deep learning methods is assessed for volumetric analysis (via segmentation) of the left ventricle in real-time free-breathing CMR at rest and under exercise stress. Data from healthy volunteers (n=15) for cine and real-time free-breathing CMR were analyzed retrospectively. Segmentations of a commercial software (comDL) and a freely available neural network (nnU-Net), were compared to a reference created via the manual correction of comDL segmentation. Segmentation of left ventricular endocardium (LV), left ventricular myocardium (MYO), and right ventricle (RV) is evaluated for both end-systolic and end-diastolic phases and analyzed with Dice's coefficient (DC). The volumetric analysis includes LV end-diastolic volume (EDV), LV end-systolic volume (ESV), and LV ejection fraction (EF). For cine CMR, nnU-Net and comDL achieve a DC above 0.95 for LV and 0.9 for MYO, and RV. For real-time CMR, the accuracy of nnU-Net exceeds that of comDL overall. For real-time CMR at rest, nnU-Net achieves a DC of 0.94 for LV, 0.89 for MYO, and 0.90 for RV; mean absolute differences between nnU-Net and reference are 2.9mL for EDV, 3.5mL for ESV and 2.6% for EF. For real-time CMR under exercise stress, nnU-Net achieves a DC of 0.92 for LV, 0.85 for MYO, and 0.83 for RV; mean absolute differences between nnU-Net and reference are 11.4mL for EDV, 2.9mL for ESV and 3.6% for EF. Deep learning methods designed or trained for cine CMR segmentation can perform well on real-time CMR. For real-time free-breathing CMR at rest, the performance of deep learning methods is comparable to inter-observer variability in cine CMR and is usable or fully automatic segmentation.
    摘要 近年来,许多深度学习网络 для心脏MRI(CMR)分割被开发和分析。然而,大多数其中都是针对呼吸自由的照片CMR进行分割。在这项工作中,我们评估了深度学习方法在实时自由呼吸CMR中心 Ventricle left volume的分割准确性。数据来自健康志愿者(n=15)的照片CMR和实时自由呼吸CMR分割分析。 commercial software(comDL)和一个可用的神经网络(nnU-Net)的分割结果与手动修改comDL分割结果作为参考点进行比较。分割左心脏内壁(LV)、左心脏肌肉(MYO)和右心脏(RV)的精度被评估于结束 systolic 和 diastolic 阶段,并使用 dice 系数(DC)进行分析。 volumetric analysis包括左心脏结束 diastolic 体积(EDV)、左心脏结束 systolic 体积(ESV)和左心脏抽出率(EF)。对于照片CMR, nnU-Net 和 comDL 的 DC 超过 0.95 для LV 和 0.9 для MYO 和 RV。对于实时 CMR,nnU-Net 的准确性超过 comDL 总体。在实时 CMR 的休息状态下,nnU-Net achiev 0.94 DC para LV, 0.89 para MYO 和 0.90 para RV; 参考点和 nnU-Net 之间的平均绝对差为 2.9 mL para EDV, 3.5 mL para ESV 和 2.6% para EF。在实时 CMR 的运动状态下,nnU-Net achiev 0.92 DC para LV, 0.85 para MYO 和 0.83 para RV; 参考点和 nnU-Net 之间的平均绝对差为 11.4 mL para EDV, 2.9 mL para ESV 和 3.6% para EF。深度学习方法,设计或训练于照片CMR分割,可以在实时 CMR 中表现良好。在实时自由呼吸 CMR 的休息状态下,深度学习方法的性能相当于人工 observer 间的变化,可以用于自动或半自动的分割。

Understanding the Vulnerability of CLIP to Image Compression

  • paper_url: http://arxiv.org/abs/2311.14029
  • repo_url: None
  • paper_authors: Cangxiong Chen, Vinay P. Namboodiri, Julian Padget
  • for: 这个论文旨在检测CLIP模型对压缩影像的敏感性,并使用Integrated Gradients方法进行解释。
  • methods: 该论文使用CLIP模型进行零截shot图像识别和图像文本对齐任务,并通过压缩影像来评估模型的敏感性。
  • results: 研究发现,CLIP模型对压缩影像的改变会导致零截shot识别精度下降,并使用Integrated Gradients方法可以帮助更好地理解压缩对模型的影响。
    Abstract CLIP is a widely used foundational vision-language model that is used for zero-shot image recognition and other image-text alignment tasks. We demonstrate that CLIP is vulnerable to change in image quality under compression. This surprising result is further analysed using an attribution method-Integrated Gradients. Using this attribution method, we are able to better understand both quantitatively and qualitatively exactly the nature in which the compression affects the zero-shot recognition accuracy of this model. We evaluate this extensively on CIFAR-10 and STL-10. Our work provides the basis to understand this vulnerability of CLIP and can help us develop more effective methods to improve the robustness of CLIP and other vision-language models.
    摘要 CLIP 是一种广泛使用的基础视觉语言模型,用于零 shot 图像识别和其他图像文本对齐任务。我们发现 CLIP 对压缩图像质量变化很敏感,这是一个意外的结果。我们使用 Integrated Gradients 方法进行分析,以更好地理解压缩对 CLIP 零 shot 识别精度的影响。我们对 CIFAR-10 和 STL-10 进行了广泛的评估。我们的工作为我们理解 CLIP 的敏感性提供了基础,可以帮助我们开发更有效的鲁棒性提高方法,以适应不同的应用场景。

Creating and Benchmarking a Synthetic Dataset for Cloud Optical Thickness Estimation

  • paper_url: http://arxiv.org/abs/2311.14024
  • repo_url: https://github.com/aleksispi/ml-cloud-opt-thick
  • paper_authors: Aleksis Pirinen, Nosheen Abid, Nuria Agues Paszkowsky, Thomas Ohlson Timoudas, Ronald Scheirer, Chiara Ceccobello, György Kovács, Anders Persson
  • for: 这个论文的目的是提出一个新的云体积估计方法,以解决云物理学气象卫星观测中的云体积数据稀缺问题。
  • methods: 该论文使用机器学习方法,通过使用虚拟数据集和真实数据集进行训练,来估计云体积。
  • results: 研究结果表明,使用虚拟数据集可以提高云体积估计的准确性和精度,并且可以在不同的云类型和环境下进行普适的应用。
    Abstract Cloud formations often obscure optical satellite-based monitoring of the Earth's surface, thus limiting Earth observation (EO) activities such as land cover mapping, ocean color analysis, and cropland monitoring. The integration of machine learning (ML) methods within the remote sensing domain has significantly improved performance on a wide range of EO tasks, including cloud detection and filtering, but there is still much room for improvement. A key bottleneck is that ML methods typically depend on large amounts of annotated data for training, which is often difficult to come by in EO contexts. This is especially true for the task of cloud optical thickness (COT) estimation. A reliable estimation of COT enables more fine-grained and application-dependent control compared to using pre-specified cloud categories, as is commonly done in practice. To alleviate the COT data scarcity problem, in this work we propose a novel synthetic dataset for COT estimation, where top-of-atmosphere radiances have been simulated for 12 of the spectral bands of the Multi-Spectral Instrument (MSI) sensor onboard Sentinel-2 platforms. These data points have been simulated under consideration of different cloud types, COTs, and ground surface and atmospheric profiles. Extensive experimentation of training several ML models to predict COT from the measured reflectivity of the spectral bands demonstrates the usefulness of our proposed dataset. Generalization to real data is also demonstrated on two satellite image datasets -- one that is publicly available, and one which we have collected and annotated. The synthetic data, the newly collected real dataset, code and models have been made publicly available at https://github.com/aleksispi/ml-cloud-opt-thick.
    摘要 云形成 oft obscures OPTICAL satellite-based monitoring of the Earth's surface, limiting EARTH OBSERVATION (EO) activities such as land cover mapping, ocean color analysis, and cropland monitoring. The integration of MACHINE LEARNING (ML) methods within the remote sensing domain has significantly improved performance on a wide range of EO tasks, including cloud detection and filtering, but there is still much room for improvement. A key bottleneck is that ML methods typically depend on large amounts of annotated data for training, which is often difficult to come by in EO contexts. This is especially true for the task of CLOUD OPTICAL THICKNESS (COT) estimation. A reliable estimation of COT enables more fine-grained and application-dependent control compared to using pre-specified cloud categories, as is commonly done in practice. To alleviate the COT data scarcity problem, in this work we propose a novel SYNTHETIC DATASET for COT estimation, where top-of-atmosphere radiances have been simulated for 12 of the spectral bands of the Multi-Spectral Instrument (MSI) sensor onboard Sentinel-2 platforms. These data points have been simulated under consideration of different cloud types, COTs, and ground surface and atmospheric profiles. Extensive experimentation of training several ML models to predict COT from the measured reflectivity of the spectral bands demonstrates the usefulness of our proposed dataset. Generalization to real data is also demonstrated on two satellite image datasets -- one that is publicly available, and one which we have collected and annotated. The SYNTHETIC data, the newly collected real dataset, code, and models have been made publicly available at https://github.com/aleksispi/ml-cloud-opt-thick.

Shadow: A Novel Loss Function for Efficient Training in Siamese Networks

  • paper_url: http://arxiv.org/abs/2311.14012
  • repo_url: None
  • paper_authors: Alif Elham Khan, Mohammad Junayed Hasan, Humayra Anjum, Nabeel Mohammed
  • for: 提高 Similarity Detection 任务的效能 under memory constraints
  • methods: 提出一种新的损失函数 called Shadow Loss,通过压缩维度空间来降低计算成本
  • results: 比对 Triplet Margin Loss 的性能,Shadow Loss 在多个 dataset 上提高了5%-10%的准确率,并且可以在不同的模型和 dataset 上保持性能。
    Abstract Despite significant recent advances in similarity detection tasks, existing approaches pose substantial challenges under memory constraints. One of the primary reasons for this is the use of computationally expensive metric learning loss functions such as Triplet Loss in Siamese networks. In this paper, we present a novel loss function called Shadow Loss that compresses the dimensions of an embedding space during loss calculation without loss of performance. The distance between the projections of the embeddings is learned from inputs on a compact projection space where distances directly correspond to a measure of class similarity. Projecting on a lower-dimension projection space, our loss function converges faster, and the resulting classified image clusters have higher inter-class and smaller intra-class distances. Shadow Loss not only reduces embedding dimensions favoring memory constraint devices but also consistently performs better than the state-of-the-art Triplet Margin Loss by an accuracy of 5\%-10\% across diverse datasets. The proposed loss function is also model agnostic, upholding its performance across several tested models. Its effectiveness and robustness across balanced, imbalanced, medical, and non-medical image datasets suggests that it is not specific to a particular model or dataset but demonstrates superior performance consistently while using less memory and computation.
    摘要

High-resolution Population Maps Derived from Sentinel-1 and Sentinel-2

  • paper_url: http://arxiv.org/abs/2311.14006
  • repo_url: None
  • paper_authors: Nando Metzger, Rodrigo Caye Daudt, Devis Tuia, Konrad Schindler
  • for: This paper aims to provide a timely and scalable method for generating population maps, particularly in data-scarce regions.
  • methods: The proposed method, called POPCORN, uses free, globally available satellite images from Sentinel-1 and Sentinel-2, and a small number of aggregate population counts over coarse census districts for calibration.
  • results: The method surpasses the mapping accuracy of existing schemes, including those that rely on high-resolution imagery, and produces population maps with an $R^2$ score of 66% and an average error of only $\pm$10 inhabitants/ha in Kigali. Additionally, the method provides interpretable results, such as maps of built-up areas and local building occupancy rates, and can be applied repeatedly to track population changes and transferred to geographically similar regions.
    Abstract Detailed population maps play an important role in diverse fields ranging from humanitarian action to urban planning. Generating such maps in a timely and scalable manner presents a challenge, especially in data-scarce regions. To address it we have developed POPCORN, a population mapping method whose only inputs are free, globally available satellite images from Sentinel-1 and Sentinel-2; and a small number of aggregate population counts over coarse census districts for calibration. Despite the minimal data requirements our approach surpasses the mapping accuracy of existing schemes, including several that rely on building footprints derived from high-resolution imagery. E.g., we were able to produce population maps for Rwanda with 100m GSD based on less than 400 regional census counts. In Kigali, those maps reach an $R^2$ score of 66% w.r.t. a ground truth reference map, with an average error of only $\pm$10 inhabitants/ha. Conveniently, POPCORN retrieves explicit maps of built-up areas and of local building occupancy rates, making the mapping process interpretable and offering additional insights, for instance about the distribution of built-up, but unpopulated areas, e.g., industrial warehouses. Moreover, we find that, once trained, the model can be applied repeatedly to track population changes; and that it can be transferred to geographically similar regions, e.g., from Uganda to Rwanda). With our work we aim to democratize access to up-to-date and high-resolution population maps, recognizing that some regions faced with particularly strong population dynamics may lack the resources for costly micro-census campaigns.
    摘要 “精细人口地图在多个领域发挥重要作用,包括人道主义行动和城市规划。但生成这些地图尤其在数据缺乏地区是一个挑战。为解决这个问题,我们开发了一种人口地图生成方法,只需要自由可用的卫星图像和一些粗略的人口总数计算。尽管我们的方法只需要这些最小数据,但它仍然可以超过现有方法的准确率,包括基于高分辨率图像的建筑物踪迹方法。例如,我们可以为rwanda生成100米GSD的人口地图,只需要400个区域普查数据。在基于这些数据的某些地区,我们可以达到66%的$R^2$分数,相对误差只有±10人/ha。此外,POPCORN还可以提取明确的建筑区域和地方人口占比图,使地图生成过程更加可读性高,并提供额外的意义,例如不发展建筑区域的分布,如工业仓库。此外,我们发现,一旦训练完成,这种模型可以重复应用,跟踪人口变化,并可以在地理相似的地区转移。我们的工作是要民用化最新和高分辨率人口地图的访问权,认为一些面临特别强人口动态的地区可能缺乏高成本的微型人口普查活动的资源。”

GRJointNET: Synergistic Completion and Part Segmentation on 3D Incomplete Point Clouds

  • paper_url: http://arxiv.org/abs/2311.13997
  • repo_url: None
  • paper_authors: Yigit Gurses, Melisa Taspinar, Mahmut Yurt, Sedat Ozer
  • for: 提高3D点云的实用性和用途
  • methods: 使用GRNet和GRJointNet两种深度学习模型,实现点云的完成和分割
  • results: GRJointNet能够超过GRNet的完成率,同时GRNet无法进行分割,GRJointNet可以同时进行完成和分割,提高点云的实用性和用途。
    Abstract Segmentation of three-dimensional (3D) point clouds is an important task for autonomous systems. However, success of segmentation algorithms depends greatly on the quality of the underlying point clouds (resolution, completeness etc.). In particular, incomplete point clouds might reduce a downstream model's performance. GRNet is proposed as a novel and recent deep learning solution to complete point clouds, but it is not capable of part segmentation. On the other hand, our proposed solution, GRJointNet, is an architecture that can perform joint completion and segmentation on point clouds as a successor of GRNet. Features extracted for the two tasks are also utilized by each other to increase the overall performance. We evaluated our proposed network on the ShapeNet-Part dataset and compared its performance to GRNet. Our results demonstrate GRJointNet can outperform GRNet on point completion. It should also be noted that GRNet is not capable of segmentation while GRJointNet is. This study1, therefore, holds a promise to enhance practicality and utility of point clouds in 3D vision for autonomous systems.
    摘要 三维点云分割是自动化系统中重要的任务。然而,下游模型的性能受点云质量(分辨率、完整性等)的影响很大。特别是含漏点云可能会降低下游模型的性能。GRNet是一种新的深度学习解决方案,用于完善点云,但它不能进行部分分割。相比之下,我们提出的解决方案GRJointNet是一种建立在GRNet之上的架构,可以同时进行点云完善和分割。两个任务之间的特征提取也得到了共享,以提高总体性能。我们在ShapeNet-Part数据集上评估了我们的提案网络,并与GRNet进行比较。我们的结果表明,GRJointNet可以在点云完善任务上超越GRNet。此外,GRNet无法进行分割,而GRJointNet可以。这一研究因此增强了点云在三维视觉中的实用性和实用性。

EIGEN: Expert-Informed Joint Learning Aggregation for High-Fidelity Information Extraction from Document Images

  • paper_url: http://arxiv.org/abs/2311.13993
  • repo_url: https://github.com/ayushayush591/eigen-high-fidelity-extraction-document-images
  • paper_authors: Abhishek Singh, Venkatapathy Subramanian, Ayush Maheshwari, Pradeep Narayan, Devi Prasad Shetty, Ganesh Ramakrishnan
  • for: 该论文主要针对文档图像中的信息EXTRACTION(IE)问题,即使用深度学习模型和规则基本方法来提高IE的准确率。
  • methods: 该论文提出了一种新的EIGEN(专家支持的共同学习准则)方法,它结合了深度学习模型和规则基本方法,通过数据编程技术来减少训练数据的注解量。文中还提出了一种标签函数,可以 incorporating contextual information,以提高word的准确分类。
  • results: 论文的实验表明,EIGEN方法可以在具有少量注解数据的情况下,significantly improve state-of-the-art深度模型的性能。 code可以在https://github.com/ayushayush591/EIGEN-High-Fidelity-Extraction-Document-Images中下载。
    Abstract Information Extraction (IE) from document images is challenging due to the high variability of layout formats. Deep models such as LayoutLM and BROS have been proposed to address this problem and have shown promising results. However, they still require a large amount of field-level annotations for training these models. Other approaches using rule-based methods have also been proposed based on the understanding of the layout and semantics of a form such as geometric position, or type of the fields, etc. In this work, we propose a novel approach, EIGEN (Expert-Informed Joint Learning aGgrEatioN), which combines rule-based methods with deep learning models using data programming approaches to circumvent the requirement of annotation of large amounts of training data. Specifically, EIGEN consolidates weak labels induced from multiple heuristics through generative models and use them along with a small number of annotated labels to jointly train a deep model. In our framework, we propose the use of labeling functions that include incorporating contextual information thus capturing the visual and language context of a word for accurate categorization. We empirically show that our EIGEN framework can significantly improve the performance of state-of-the-art deep models with the availability of very few labeled data instances. The source code is available at https://github.com/ayushayush591/EIGEN-High-Fidelity-Extraction-Document-Images.
    摘要 信息提取(IE)从文档图像中具有高度变化的格式,使得模型训练时需要大量的字段级别标注。深度模型如LayoutLM和BROS已经提出来解决这个问题,但它们仍需要大量的训练数据。其他使用规则编程方法的方法也被提出,基于文档格式和semantic的理解,如几何位置或字段类型等。在这种工作中,我们提出了一种新的方法,EIGEN(专家支持的共同学习),它将规则编程方法与深度学习模型结合使用数据编程方法,以避免大量训练数据的标注。我们的框架中提出使用标签函数,包括 incorporating contextual information,以确定word的准确分类。我们的实验表明,EIGEN框架可以在很少的标注数据 instances 的情况下显著提高现有深度模型的性能。代码可以在 中下载。

FViT-Grasp: Grasping Objects With Using Fast Vision Transformers

  • paper_url: http://arxiv.org/abs/2311.13986
  • repo_url: None
  • paper_authors: Arda Sarp Yenicesu, Berk Cicek, Ozgur S. Oguz
  • for: 本研究挑战了机器人 manipulate 问题,提出了一种新的方法来快速、准确地确定机器人抓取物体的最佳 grasp point。
  • methods: 该方法使用 Fast Vision Transformer (FViT) 神经网络进行视觉数据处理,预测最佳抓取位置。
  • results: 研究显示该方法在速度和准确率两个方面具有state-of-the-art表现,可能用于实时机器人抓取应用。I hope this helps! Let me know if you have any other questions.
    Abstract This study addresses the challenge of manipulation, a prominent issue in robotics. We have devised a novel methodology for swiftly and precisely identifying the optimal grasp point for a robot to manipulate an object. Our approach leverages a Fast Vision Transformer (FViT), a type of neural network designed for processing visual data and predicting the most suitable grasp location. Demonstrating state-of-the-art performance in terms of speed while maintaining a high level of accuracy, our method holds promise for potential deployment in real-time robotic grasping applications. We believe that this study provides a baseline for future research in vision-based robotic grasp applications. Its high speed and accuracy bring researchers closer to real-life applications.
    摘要

Low Latency Instance Segmentation by Continuous Clustering for Rotating LiDAR Sensors

  • paper_url: http://arxiv.org/abs/2311.13976
  • repo_url: https://github.com/unibwtas/continuous_clustering
  • paper_authors: Andreas Reich, Hans-Joachim Wuensche
  • for: 这 paper 的目的是提出一种实时的 LiDAR 点云实例分割方法,以便在真实世界应用中实现快速响应。
  • methods: 该方法使用连续划分方法对障碍点云进行实例分割,不需要扫描整个 LiDAR 数据流,而是在实时进行划分。
  • results: 该方法可以减少实时点云分割的延迟,并且不会出现扫描整个 LiDAR 数据流时的缺失点云问题。 codes 将于 https://github.com/UniBwTAS/continuous_clustering 上发布。
    Abstract Low-latency instance segmentation of LiDAR point clouds is crucial in real-world applications because it serves as an initial and frequently-used building block in a robot's perception pipeline, where every task adds further delay. Particularly in dynamic environments, this total delay can result in significant positional offsets of dynamic objects, as seen in highway scenarios. To address this issue, we employ continuous clustering of obstacle points in order to obtain an instance-segmented point cloud. Unlike most existing approaches, which use a full revolution of the LiDAR sensor, we process the data stream in a continuous and seamless fashion. More specifically, each column of a range image is processed as soon it is available. Obstacle points are clustered to existing instances in real-time and it is checked at a high-frequency which instances are completed and are ready to be published. An additional advantage is that no problematic discontinuities between the points of the start and the end of a scan are observed. In this work we describe the two-layered data structure and the corresponding algorithm for continuous clustering, which is able to cluster the incoming data in real time. We explain the importance of a large perceptive field of view. Furthermore, we describe and evaluate important architectural design choices, which could be relevant to design an architecture for deep learning based low-latency instance segmentation. We are publishing the source code at https://github.com/UniBwTAS/continuous_clustering.
    摘要 低延迟实例分割 LiDAR 点云是实世界应用中非常重要的,因为它作为机器人感知管道中的第一个和常用的建筑物,每个任务都会添加额外的延迟。特别在动态环境下,总延迟可能导致动态对象的位置偏移,如公路场景中所见。为解决这个问题,我们使用连续划分障碍点,以获得实例分割的点云。与大多数现有方法不同,我们不需要整个 LiDAR 传感器的一次革命,而是在连续和无缝的方式处理数据流。更具体地说,每列范围图像中的每一列都会在可用时被处理。障碍点被分配到现有的实例中,并在高频率上检查是否已经完成并可以发布。此外,我们不会看到扫描器开始和结束点云的问题。在这个工作中,我们介绍了两层数据结构和相应的算法,可以在实时中连续划分入来数据。我们还讨论了关键的建筑设计选择,可能对设计深度学习基于低延迟实例分割的架构有重要影响。我们将源代码发布在 GitHub 上,可以通过 https://github.com/UniBwTAS/continuous_clustering 访问。

Investigating the use of publicly available natural videos to learn Dynamic MR image reconstruction

  • paper_url: http://arxiv.org/abs/2311.13963
  • repo_url: https://github.com/olivier-jaubert/image_reconstruction_inter4k
  • paper_authors: Olivier Jaubert, Michele Pascale, Javier Montalt-Tordera, Julius Akesson, Ruta Virsinskaite, Daniel Knight, Simon Arridge, Jennifer Steeden, Vivek Muthurangu
  • for: 本研究旨在开发和评估一种深度学习(深度学习)束缚dynamic MRI图像重建方法,以便从公共可用的自然视频中学习。
  • methods: 本研究使用了多种深度学习架构(VarNet、3D UNet、FastDVDNet)和相应的采样模式(Cartesian、 радиаль、螺旋),从True多晶粒心磁共振数据(N=692)或 pseudo-MRI数据生成自自然视频(N=692)进行学习。
  • results: 在 simulations(N=104个数据集)中,使用深度学习网络训练 cardiac 数据和自然视频数据,并使用压缩感知(CS)重建动态MRI图像,得到了较佳的MSE、PSNR和SSIM指标(p<0.05)。在实际应用中,使用了这些DL方法重建的图像质量得到了主观评估和数据分析的结果,并与CS方法进行比较。
    Abstract Purpose: To develop and assess a deep learning (DL) pipeline to learn dynamic MR image reconstruction from publicly available natural videos (Inter4K). Materials and Methods: Learning was performed for a range of DL architectures (VarNet, 3D UNet, FastDVDNet) and corresponding sampling patterns (Cartesian, radial, spiral) either from true multi-coil cardiac MR data (N=692) or from pseudo-MR data simulated from Inter4K natural videos (N=692). Real-time undersampled dynamic MR images were reconstructed using DL networks trained with cardiac data and natural videos, and compressed sensing (CS). Differences were assessed in simulations (N=104 datasets) in terms of MSE, PSNR, and SSIM and prospectively for cardiac (short axis, four chambers, N=20) and speech (N=10) data in terms of subjective image quality ranking, SNR and Edge sharpness. Friedman Chi Square tests with post-hoc Nemenyi analysis were performed to assess statistical significance. Results: For all simulation metrics, DL networks trained with cardiac data outperformed DL networks trained with natural videos, which outperformed CS (p<0.05). However, in prospective experiments DL reconstructions using both training datasets were ranked similarly (and higher than CS) and presented no statistical differences in SNR and Edge Sharpness for most conditions. Additionally, high SSIM was measured between the DL methods with cardiac data and natural videos (SSIM>0.85). Conclusion: The developed pipeline enabled learning dynamic MR reconstruction from natural videos preserving DL reconstruction advantages such as high quality fast and ultra-fast reconstructions while overcoming some limitations (data scarcity or sharing). The natural video dataset, code and pre-trained networks are made readily available on github. Key Words: real-time; dynamic MRI; deep learning; image reconstruction; machine learning;
    摘要 目的:开发和评估一个深度学习(DL)管道,以learn动态MR图像重建从公共可用的自然视频(Inter4K)。 材料和方法:学习使用了不同的DL架构(VarNet、3D UNet、FastDVDNet)和相应的抽象样式(Cartesian、 радиаль、螺旋), Either from true multi-coil cardiac MR数据(N=692)或从 pseudo-MR数据通过Inter4K自然视频 simulated(N=692)。实时减样动态MR图像被使用DL网络训练 cardiac data和自然视频,并使用压缩感知(CS)重建。在 simulated(N=104 datasets)中, differenced were assessed in terms of MSE, PSNR, and SSIM, and prospectively for cardiac(短轴、四室、N=20)and speech(N=10)data in terms of subjective image quality ranking, SNR, and Edge sharpness。 Friedman Chi Square tests with post-hoc Nemenyi analysis were performed to assess statistical significance. 结果:对于所有的 simulated метри克,DL网络通过cardiac数据训练表现出色,outperformed DL网络通过自然视频训练(p<0.05),而自然视频训练DL网络则outperformed CS(p<0.05)。然而,在实际实验中,DL重建使用两个训练数据集得分类似(和CS),并没有显示统计学上的差异(p>0.05)。此外,高的SSIM(>0.85)被测量在DL方法中使用cardiac数据和自然视频之间。 结论:已经开发的管道可以通过自然视频来学习动态MR重建,保持DL重建的优点,如快速高质量重建,同时解决一些限制(数据缺乏或分享)。自然视频数据、代码和预训练网络在github上可以获得。 关键词:实时;动态MRI;深度学习;图像重建;机器学习;

RankFeat&RankWeight: Rank-1 Feature/Weight Removal for Out-of-distribution Detection

  • paper_url: http://arxiv.org/abs/2311.13959
  • repo_url: https://github.com/kingjamessong/rankfeat
  • paper_authors: Yue Song, Nicu Sebe, Wei Wang
  • for: 本研究的目的是提出一种基于特征矩阵的OOD检测方法,以提高机器学习模型在实际应用中的可靠性。
  • methods: 本研究使用了特征矩阵的协方差分析,发现OOD样本的特征矩阵具有较大的主要特征值,这导致了OOD样本的分类结果受到这些特征值的影响。基于这个发现,本研究提出了一种简单 yet有效的post hoc方法,即 removing the rank-1 matrix composed of the largest singular value and the associated singular vectors from the high-level feature.
  • results: 本研究的实验结果表明,使用了提案的方法可以大幅降低OOD检测中的false positive rate(FPR95),并且与其他方法相比,提案的方法具有更好的compatibility和更高的抗抑制性。此外,本研究还提出了一种基于weight matrix的OOD检测方法,即 removing the rank-1 weight from the parameter matrices of a single deep layer.这种方法可以单独使用,也可以与提案的方法结合使用,以达到更高的检测性能。
    Abstract The task of out-of-distribution (OOD) detection is crucial for deploying machine learning models in real-world settings. In this paper, we observe that the singular value distributions of the in-distribution (ID) and OOD features are quite different: the OOD feature matrix tends to have a larger dominant singular value than the ID feature, and the class predictions of OOD samples are largely determined by it. This observation motivates us to propose \texttt{RankFeat}, a simple yet effective \emph{post hoc} approach for OOD detection by removing the rank-1 matrix composed of the largest singular value and the associated singular vectors from the high-level feature. \texttt{RankFeat} achieves \emph{state-of-the-art} performance and reduces the average false positive rate (FPR95) by 17.90\% compared with the previous best method. The success of \texttt{RankFeat} motivates us to investigate whether a similar phenomenon would exist in the parameter matrices of neural networks. We thus propose \texttt{RankWeight} which removes the rank-1 weight from the parameter matrices of a single deep layer. Our \texttt{RankWeight}is also \emph{post hoc} and only requires computing the rank-1 matrix once. As a standalone approach, \texttt{RankWeight} has very competitive performance against other methods across various backbones. Moreover, \texttt{RankWeight} enjoys flexible compatibility with a wide range of OOD detection methods. The combination of \texttt{RankWeight} and \texttt{RankFeat} refreshes the new \emph{state-of-the-art} performance, achieving the FPR95 as low as 16.13\% on the ImageNet-1k benchmark. Extensive ablation studies and comprehensive theoretical analyses are presented to support the empirical results.
    摘要 “模型在真实世界中部署需要进行外部分布(OOD)检测。本文发现内部分布(ID)和OOD特征值分布不同:OOD特征矩阵通常有较大的主要对角线值,并且影响了标签预测的OOD标签。这个观察驱使我们提出\texttt{RankFeat},一个简单 yet effective的\emph{post hoc}方法,通过删除高级特征矩阵中的主要对角线值和相关的对角线 вектор来进行OOD检测。\texttt{RankFeat}实现了\emph{状况之境}的性能,降低了平均错 Positive Rate (FPR95)的值为17.90%,较前一代最佳方法为19.67%。\texttt{RankFeat}的成功驱使我们 investigates whether a similar phenomenon would exist in the parameter matrices of neural networks。我们因此提出\texttt{RankWeight},删除单一深度层中的主要重量矩阵。\texttt{RankWeight}是\emph{post hoc}的,仅需一次计算主要对角线 Matrix。作为一个独立的方法,\texttt{RankWeight}的性能仅次于其他方法,并且具有与多种OOD检测方法Compatible的自适应性。 combining \texttt{RankWeight} and \texttt{RankFeat} refreshes the new \emph{状况之境} performance,FPR95的值降低到16.13%,在ImageNet-1k Benchmark上。广泛的实验研究和深入的理论分析支持了实验结果。”

High-Order Tensor Recovery with A Tensor $U_1$ Norm

  • paper_url: http://arxiv.org/abs/2311.13958
  • repo_url: https://github.com/jzheng20/jzheng20.github.io
  • paper_authors: Jingjing Zheng, Wenzhe Wang, Xiaoqin Zhang, Yankai Cao, Xianta Jiang
  • for: 本研究的目标是提供一种有效的高阶张量重建技术,能够处理非平滑变化的高阶张量数据,并快速探索高阶张量数据的各个维度之间的相关性,而不需要设置多个变量和权重。
  • methods: 本研究引入了一种新的张量减少方法和一种新的张量范数called Tensor $U_1$ norm。这些新技术用于解决高阶张量完成问题,并提供了对完成后张量模型的正负样本 guarantees。
  • results: 数学分析表明,提案的算法可以将karush-kuhn-tucker (KKT) 点的优化问题解决。实验表明,提案的方法在高阶张量完成问题中表现出色,特别是在tensor数据中存在非平滑变化的情况下。
    Abstract Recently, numerous tensor SVD (t-SVD)-based tensor recovery methods have emerged, showing promise in processing visual data. However, these methods often suffer from performance degradation when confronted with high-order tensor data exhibiting non-smooth changes, commonly observed in real-world scenarios but ignored by the traditional t-SVD-based methods. Our objective in this study is to provide an effective tensor recovery technique for handling non-smooth changes in tensor data and efficiently explore the correlations of high-order tensor data across its various dimensions without introducing numerous variables and weights. To this end, we introduce a new tensor decomposition and a new tensor norm called the Tensor $U_1$ norm. We utilize these novel techniques in solving the problem of high-order tensor completion problem and provide theoretical guarantees for the exact recovery of the resulting tensor completion models. An optimization algorithm is proposed to solve the resulting tensor completion model iteratively by combining the proximal algorithm with the Alternating Direction Method of Multipliers. Theoretical analysis showed the convergence of the algorithm to the Karush-Kuhn-Tucker (KKT) point of the optimization problem. Numerical experiments demonstrated the effectiveness of the proposed method in high-order tensor completion, especially for tensor data with non-smooth changes.
    摘要 近些年,许多基于t-SVD(tensor singular value decomposition)的高维数据恢复方法出现,显示了处理视觉数据的抢手。然而,这些方法常受到高维数据中非平滑变化的影响,这些变化在实际场景中很常见,但是传统的t-SVD基本方法忽略了这些变化。我们的目标在这个研究中是提供一种有效的高维数据恢复技术,能够处理非平滑变化并快速探索高维数据中各维度之间的相关性,而不需要创建大量的变量和负重。为此,我们引入了一种新的高维分解和一种新的高维范数,称为Tensor $U_1$ 范数。我们利用这些新技术解决高维tensor完成问题,并提供了关于恢复模型的理论保证。一种基于 proximal 算法和Alternating Direction Method of Multipliers(ADMM)的优化算法是提出来解决高维tensor completion问题。理论分析表明该算法会 converges to KKT 点。numerical experiments表明我们的方法在高维tensor completion问题中的效果,特别是对于具有非平滑变化的tensor数据。

Electric Network Frequency Optical Sensing Devices

  • paper_url: http://arxiv.org/abs/2311.13954
  • repo_url: https://github.com/chrismoi/Electric-Network-Frequency-Optical-Sensing-Devices
  • paper_authors: Christos Moysiadis, Georgios Karantaidis, Constantine Kotropoulos
    for: 本研究的目的是在室内照明环境下使用光电子设备来估算电网频率(ENF)。methods: 本研究使用了一种基于 photodiode 的首先光电子设备,以及直接从电力主要中收集 ENF 的设备作为真实参照值。此外,还使用了一个摄像头作为第二个光电子感知器,以估算 ENF。results: 研究发现,使用光电子设备来估算 ENF 的精度取决于各种因素,包括照明环境、摄像头的位置和光电子设备的配置。研究还提供了广泛的实验证据,证明了光电子设备可以在不同的场景下(包括静止场景和人员活动场景)准确地估算 ENF。
    Abstract Electric Network Frequency (ENF) acts as a fingerprint in multimedia forensics applications. In indoor environments, ENF variations affect the intensity of light sources connected to power mains. Accordingly, the light intensity variations captured by sensing devices can be exploited to estimate the ENF. A first optical sensing device based on a photodiode is developed for capturing ENF variations in indoor lighting environments. In addition, a device that captures the ENF directly from power mains is implemented. This device serves as a ground truth ENF collector. Video recordings captured by a camera are also employed to estimate the ENF. The camera serves as a second optical sensor. The factors affecting the ENF estimation are thoroughly studied. The maximum correlation coefficient between the ENF estimated by the two optical sensors and that estimated directly from power mains is used to measure the estimation accuracy. The paper's major contribution is in the disclosure of extensive experimental evidence on ENF estimation in scenes ranging from static ones capturing a white wall to non-static ones, including human activity.
    摘要 电网频率(ENF)作为多媒体嫌疑人证件。在室内环境中,ENF变化影响了连接到电力主线的灯光源的强度。因此,通过感测设备记录的光度变化可以估算ENF。首先开发了基于 фото逻器的一种光学感测设备,用于在室内照明环境中捕捉ENF变化。此外,还实现了直接从电力主线收集ENF的设备,作为真实ENF收集器。 Camera也被用作第二种光学感测设备,用于估算ENF。影响ENF估算的因素得到了仔细研究。使用最大相关系数来衡量估算精度。文章的主要贡献在于提供了广泛的实验证据,描述了ENF估算在不同场景中的性能,从静止场景(捕捉白墙)到非静止场景(包括人类活动)。

Robustness-Reinforced Knowledge Distillation with Correlation Distance and Network Pruning

  • paper_url: http://arxiv.org/abs/2311.13934
  • repo_url: None
  • paper_authors: Seonghak Kim, Gyeongdo Ham, Yucheol Cho, Daeshik Kim
  • for: 提高高效轻量级模型(学生模型)的性能
  • methods: 使用知识储存(KD)技术,通过将知识从更复杂的模型(教师模型)传递给学生模型
  • results: 在不同的图像数据集上,包括 CIFAR-100、FGVR、TinyImagenet 和 ImageNet,与当前状态艺技术相比,提出了更高效的方法 R2KD,并通过数据扩展和网络剔除来提高性能
    Abstract The improvement in the performance of efficient and lightweight models (i.e., the student model) is achieved through knowledge distillation (KD), which involves transferring knowledge from more complex models (i.e., the teacher model). However, most existing KD techniques rely on Kullback-Leibler (KL) divergence, which has certain limitations. First, if the teacher distribution has high entropy, the KL divergence's mode-averaging nature hinders the transfer of sufficient target information. Second, when the teacher distribution has low entropy, the KL divergence tends to excessively focus on specific modes, which fails to convey an abundant amount of valuable knowledge to the student. Consequently, when dealing with datasets that contain numerous confounding or challenging samples, student models may struggle to acquire sufficient knowledge, resulting in subpar performance. Furthermore, in previous KD approaches, we observed that data augmentation, a technique aimed at enhancing a model's generalization, can have an adverse impact. Therefore, we propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning. This approach enables KD to effectively incorporate data augmentation for performance improvement. Extensive experiments on various datasets, including CIFAR-100, FGVR, TinyImagenet, and ImageNet, demonstrate our method's superiority over current state-of-the-art methods.
    摘要 通过知识储存(KD)提高高效轻量级模型(学生模型)的性能,我们可以通过从更复杂的模型(教师模型)传递知识来实现这一目标。然而,现有的大多数KD技术都基于基本矩阵(KL)差异,这有一些限制。首先,如果教师分布有高 entropy,KL差异的模均性限制了学生模型继承的目标信息的传递。其次,当教师分布有低 entropy时,KL差异倾向于过度强调特定模式,从而对学生模型的知识传递造成损害。因此,在处理含有许多干扰或挑战样本的数据集时,学生模型可能会遇到差异知识的问题,导致性能下降。此外,在之前的KD方法中,我们发现了数据扩展,一种目的是增强模型的通用性的技术,可能会对KD造成负面影响。因此,我们提出了一种Robustness-Reinforced Knowledge Distillation(R2KD),它利用相关距离和网络剪辑来实现有效的KD incorporation。我们的方法在多个数据集上,包括CIFAR-100、FGVR、TinyImagenet和ImageNet等,进行了广泛的实验,并证明了与当前状态的先进方法相比,我们的方法具有更高的超越性。

Periodically Exchange Teacher-Student for Source-Free Object Detection

  • paper_url: http://arxiv.org/abs/2311.13930
  • repo_url: None
  • paper_authors: Qipeng Liu, Luojun Lin, Zhifeng Shen, Zhifeng Yang
  • for: 这 paper 的目的是解决源自由物体检测(SFOD)任务中缺乏源领域数据的问题,使学习者模型能够适应目标领域数据。
  • methods: 该 paper 提出了一种 Periodically Exchange Teacher-Student(PETS)方法,它是一种简单又新的方法,它在 MT 框架下引入了多个教师模型,包括静态教师、动态教师和学生模型。在训练阶段,我们在静态教师和学生模型之间进行定期的Weight Exchange,然后更新动态教师使其能够吸收过去的训练经验,从而降低训练过程中的埋设误差。此外,我们还提出了一种协调机制,用于将两个教师模型的预测结果 merge 成更高质量的 pseudo 标签,以便学生模型进行更加稳定的训练。
  • results: 在多个 SFOD 标准测试集上,我们的方法 achieved state-of-the-art 性能,比如其他相关方法更高,这表明我们的方法在 SFOD 任务中表现出了优势。
    Abstract Source-free object detection (SFOD) aims to adapt the source detector to unlabeled target domain data in the absence of source domain data. Most SFOD methods follow the same self-training paradigm using mean-teacher (MT) framework where the student model is guided by only one single teacher model. However, such paradigm can easily fall into a training instability problem that when the teacher model collapses uncontrollably due to the domain shift, the student model also suffers drastic performance degradation. To address this issue, we propose the Periodically Exchange Teacher-Student (PETS) method, a simple yet novel approach that introduces a multiple-teacher framework consisting of a static teacher, a dynamic teacher, and a student model. During the training phase, we periodically exchange the weights between the static teacher and the student model. Then, we update the dynamic teacher using the moving average of the student model that has already been exchanged by the static teacher. In this way, the dynamic teacher can integrate knowledge from past periods, effectively reducing error accumulation and enabling a more stable training process within the MT-based framework. Further, we develop a consensus mechanism to merge the predictions of two teacher models to provide higher-quality pseudo labels for student model. Extensive experiments on multiple SFOD benchmarks show that the proposed method achieves state-of-the-art performance compared with other related methods, demonstrating the effectiveness and superiority of our method on SFOD task.
    摘要

Attribute-Aware Representation Rectification for Generalized Zero-Shot Learning

  • paper_url: http://arxiv.org/abs/2311.14750
  • repo_url: None
  • paper_authors: Zhijie Rao, Jingcai Guo, Xiaocheng Lu, Qihua Zhou, Jie Zhang, Kang Wei, Chenxin Li, Song Guo
  • for: 本研究旨在提高基于zero-shot学习(GZSL)的视觉Semantic映射的精度,尤其是在不同领域任务/数据集下,提高泛化推理性。
  • methods: 本文提出了一种简单 yet effective的 Attribute-Aware Representation Rectification框架(AR^(2)),用于修正GZSL中的特征提取器,以学习新的特征 whilst keeping valuable original features。该方法包括两个关键组成部分:即 Unseen-Aware Distillation(UAD)和Attribute-Guided Learning(AGL)。
  • results: 在多个 benchmark datasets上,our方法得到了广泛的实验证明,显示了其效果。
    Abstract Generalized Zero-shot Learning (GZSL) has yielded remarkable performance by designing a series of unbiased visual-semantics mappings, wherein, the precision relies heavily on the completeness of extracted visual features from both seen and unseen classes. However, as a common practice in GZSL, the pre-trained feature extractor may easily exhibit difficulty in capturing domain-specific traits of the downstream tasks/datasets to provide fine-grained discriminative features, i.e., domain bias, which hinders the overall recognition performance, especially for unseen classes. Recent studies partially address this issue by fine-tuning feature extractors, while may inevitably incur catastrophic forgetting and overfitting issues. In this paper, we propose a simple yet effective Attribute-Aware Representation Rectification framework for GZSL, dubbed $\mathbf{(AR)^{2}$, to adaptively rectify the feature extractor to learn novel features while keeping original valuable features. Specifically, our method consists of two key components, i.e., Unseen-Aware Distillation (UAD) and Attribute-Guided Learning (AGL). During training, UAD exploits the prior knowledge of attribute texts that are shared by both seen/unseen classes with attention mechanisms to detect and maintain unseen class-sensitive visual features in a targeted manner, and meanwhile, AGL aims to steer the model to focus on valuable features and suppress them to fit noisy elements in the seen classes by attribute-guided representation learning. Extensive experiments on various benchmark datasets demonstrate the effectiveness of our method.
    摘要 通用零分学习(GZSL)已经实现了很好的表现,通过设计一系列不偏见的视 Semantics 映射,其中精度归功于提取到seen和unseen类中的完整的视觉特征。然而,在GZSL中通常存在一个问题,即预训练的特征提取器可能会难以捕捉下游任务/数据集的域特征,导致精度下降,特别是对未经见类型的案例。 latest studies 部分解决了这个问题,通过 fine-tuning 特征提取器,但可能会导致恰等忘记和过拟合问题。在这篇论文中,我们提出了一种简单 yet effective的特征整Rectification 框架,名为 $\mathbf{(AR)^{2}$,用于适应GZSL。我们的方法包括两个关键组成部分:即 Unseen-Aware Distillation (UAD) 和 Attribute-Guided Learning (AGL)。在训练中,UAD 利用seen和unseen类之间共享的属性文本的先前知识,通过注意机制检测和保留未经见类型的视觉特征,同时,AGL 目标是使模型关注有价值的特征,并且压缩其以适应噪音元素在seen类中的表示学习。我们在多个标准数据集上进行了广泛的实验,证明了我们的方法的有效性。

MetaFBP: Learning to Learn High-Order Predictor for Personalized Facial Beauty Prediction

  • paper_url: http://arxiv.org/abs/2311.13929
  • repo_url: https://github.com/metavisionlab/metafbp
  • paper_authors: Luojun Lin, Zhifeng Shen, Jia-Li Yin, Qipeng Liu, Yuanlong Yu, Weijie Chen
  • for: 预测个人美学偏好,可以应用于人类社会的各种场景,如美食、时尚、艺术等领域。
  • methods: 基于meta学的方法,即每个用户对应一个meta任务,以捕捉用户偏好的共同部分和特有部分。
  • results: 提出了一种 MetaFBP 框架,通过一种 universal feature extractor 捕捉美学共同部分,然后通过一种 meta-学机制来适应用户特有的美学偏好。经验表明,该方法可以快速适应用户偏好。
    Abstract Predicting individual aesthetic preferences holds significant practical applications and academic implications for human society. However, existing studies mainly focus on learning and predicting the commonality of facial attractiveness, with little attention given to Personalized Facial Beauty Prediction (PFBP). PFBP aims to develop a machine that can adapt to individual aesthetic preferences with only a few images rated by each user. In this paper, we formulate this task from a meta-learning perspective that each user corresponds to a meta-task. To address such PFBP task, we draw inspiration from the human aesthetic mechanism that visual aesthetics in society follows a Gaussian distribution, which motivates us to disentangle user preferences into a commonality and an individuality part. To this end, we propose a novel MetaFBP framework, in which we devise a universal feature extractor to capture the aesthetic commonality and then optimize to adapt the aesthetic individuality by shifting the decision boundary of the predictor via a meta-learning mechanism. Unlike conventional meta-learning methods that may struggle with slow adaptation or overfitting to tiny support sets, we propose a novel approach that optimizes a high-order predictor for fast adaptation. In order to validate the performance of the proposed method, we build several PFBP benchmarks by using existing facial beauty prediction datasets rated by numerous users. Extensive experiments on these benchmarks demonstrate the effectiveness of the proposed MetaFBP method.
    摘要 预测个人美学偏好具有重要的实际应用和学术意义,但现有研究主要集中在学习和预测共同的脸 Beauty 的吸引力,忽略了个人化的脸 Beauty 预测(PFBP)。PFBP 目标是开发一种可适应个人美学偏好的机器,只需要每名用户提供几张排名过的图像。在这篇论文中,我们从meta-学习的角度来解决这个任务,认为每名用户对应一个meta-任务。为Addressing PFBP任务,我们 drew inspiration from human的美学机制,认为视觉美学在社会中遵循 Gaussian 分布,这种分布驱动我们将用户偏好分解为共同部分和个体部分。为此,我们提出了一个Novel MetaFBP框架,其中我们设计了通用的特征提取器来捕捉美学共同部分,然后通过meta-学习机制来适应个体部分。不同于传统的meta-学习方法可能会降低到慢速适应或过拟合到小支持集,我们提出了一种新的方法,即通过高阶预测器来快速适应。为验证提出的方法效果,我们建立了多个PFBP benchmark,使用现有的脸 Beauty 预测数据集,由多名用户评分。广泛的实验表明,我们的MetaFBP方法具有显著的效果。

Predicting Recovery or Decease of COVID-19 Patients with Clinical and RT-PCR Using Machine Learning Classification Algorithms

  • paper_url: http://arxiv.org/abs/2311.13925
  • repo_url: None
  • paper_authors: Mohammad Dehghani, Zahra Yazdanparast
    for: This study aims to examine whether machine learning algorithms can predict the outcome of COVID-19 cases (recovery or death) based on the features present in the dataset, and to determine which feature set (clinical or RT-PCR) is more reliable for predicting recovery and decease.methods: The study uses six machine learning methods to build prediction models, including random forest, which showed promising results with an accuracy of 78.7%.results: The study finds that recovery and decease of patients are predictable using machine learning, with clinical alone (without using RT-PCR), trained with AdaBoost algorithm, being the most accurate with an accuracy of 82.1%.
    Abstract The COVID-19 pandemic has disrupted the global economy and people's daily lives in unprecedented ways. To make appropriate decisions, it is necessary to diagnose COVID-19 rapidly and accurately. Clinical decision making is influenced by data collected from patients. With the aid of artificial intelligence, COVID-19 has been diagnosed quickly by analyzing symptoms, polymerase chain reaction (PCR), computed tomography scans, chest X-rays, routine laboratory blood tests and even cough sounds. Furthermore, these data can be used to predict a patient's morality, although there is a question about which data makes the most accurate predictions. Therefore, this study consists of two parts. Our first objective is to examine whether machine learning algorithms can predict the outcome of COVID-19 cases (recovery or death), based on the features present in the dataset. In the second part of the research, we investigated the impact of clinical and RT-PCR on prediction of recovery and decease to determine which one is more reliable. We defined four stages with different feature sets and use six machine learning methods to build prediction model. With an accuracy of 78.7%, random forest showed promising results for predicting death and recovery of patients. Based on this, it appears that recovery and decease of patients are predictable using machine learning. For second objective, results indicate that clinical alone (without using RT-PCR), trained with AdaBoost algorithm, is the most accurate with an accuracy of 82.1%. This study can provide guidance for medical professionals in the event of a crisis or outbreak similar to COVID-19.
    摘要 COVID-19 大流行对全球经济和人们日常生活造成了无前例的干扰。为了作出合适的决策,需要快速和准确地诊断 COVID-19。临床决策受到了患者的数据的影响。通过人工智能, COVID-19 已经快速地诊断了通过症状、核酸链式反应(PCR)、 computed tomography 成像、胸部X射线、常规实验室血液检测和喊喊 зву频等数据。此外,这些数据还可以预测患者的生存可能性,但是有一个问题是哪些数据可以提供最准确的预测。因此,这个研究包括两个部分。我们的第一个目标是检查机器学习算法是否可以根据数据集中的特征预测 COVID-19 患者的结局(recovery 或 death)。在第二部分的研究中,我们研究了临床和 RT-PCR 对预测恢复和死亡的影响,以确定哪一种更可靠。我们定义了四个阶段,每个阶段具有不同的特征集,并使用六种机器学习方法建立预测模型。random forest 显示了78.7% 的准确率,用于预测患者的死亡和恢复。根据这个结果,表明患者的恢复和死亡是可预测的。在第二个目标中,结果表明,临床alone(不使用 RT-PCR),使用 AdaBoost 算法,是最准确的,准确率为82.1%。这个研究可以为医疗专业人员在类似 COVID-19 的危机或爆发时提供指导。

Expanding the deep-learning model to diagnosis LVNC: Limitations and trade-offs

  • paper_url: http://arxiv.org/abs/2311.13912
  • repo_url: None
  • paper_authors: Gregorio Bernabé, Pilar González-Férez, José M. García, Guillem Casas, Josefa González-Carrillo
  • for: 针对各种不同的心脏疾病,包括高 Blood pressure cardiomyopathy 等。
  • methods: 使用深度学习方法(DL-LVTQ),基于 U-Net convolutional neural network 架构,自动诊断左心室 trabeculae 的量化。
  • results: 对于不同的 cardiomyopathy 患者群,DL-LVTQ 的精度、特异性和kappa值均有所提高,而且保持了感知性的高度。 cardiologists 评估表明,98.9%的评估输出都得到了严格的临床诊断。
    Abstract Hyper-trabeculation or non-compaction in the left ventricle of the myocardium (LVNC) is a recently classified form of cardiomyopathy. Several methods have been proposed to quantify the trabeculae accurately in the left ventricle, but there is no general agreement in the medical community to use a particular approach. In previous work, we proposed DL-LVTQ, a deep learning approach for left ventricular trabecular quantification based on a U-Net CNN architecture. DL-LVTQ was an automatic diagnosis tool developed from a dataset of patients with the same cardiomyopathy (hypertrophic cardiomyopathy). In this work, we have extended and adapted DL-LVTQ to cope with patients with different cardiomyopathies. The dataset consists of up 379 patients in three groups with different particularities and cardiomyopathies. Patient images were taken from different scanners and hospitals. We have modified and adapted the U-Net convolutional neural network to account for the different particularities of a heterogeneous group of patients with various unclassifiable or mixed and inherited cardiomyopathies. The inclusion of new groups of patients has increased the accuracy, specificity and kappa values while maintaining the sensitivity of the automatic deep learning method proposed. Therefore, a better-prepared diagnosis tool is ready for various cardiomyopathies with different characteristics. Cardiologists have considered that 98.9% of the evaluated outputs are verified clinically for diagnosis. Therefore, the high precision to segment the different cardiac structures allows us to make a robust diagnostic system objective and faster, decreasing human error and time spent.
    摘要 Left ventricular non-compaction (LVNC) is a newly recognized form of cardiomyopathy. Several methods have been proposed to accurately quantify the trabeculae in the left ventricle, but there is no consensus in the medical community on which approach to use. In previous work, we proposed a deep learning approach called DL-LVTQ, which uses a U-Net CNN architecture to quantify left ventricular trabecular. DL-LVTQ was developed from a dataset of patients with hypertrophic cardiomyopathy. In this study, we extended and adapted DL-LVTQ to accommodate patients with different cardiomyopathies. The dataset consisted of 379 patients in three groups with different characteristics and cardiomyopathies. The patient images were obtained from different scanners and hospitals. We modified and adapted the U-Net convolutional neural network to account for the different characteristics of the heterogeneous group of patients with various unclassifiable or mixed and inherited cardiomyopathies.The inclusion of new groups of patients increased the accuracy, specificity, and kappa values of the automatic deep learning method, while maintaining its sensitivity. Therefore, a more comprehensive diagnosis tool is ready for various cardiomyopathies with different characteristics. Cardiologists have verified that 98.9% of the evaluated outputs are clinically useful for diagnosis. The high precision in segmenting different cardiac structures allows for a more objective and faster diagnosis, reducing human error and the time spent.

Query by Activity Video in the Wild

  • paper_url: http://arxiv.org/abs/2311.13895
  • repo_url: https://github.com/dongzhuoyao/video-query-in-the-wild
  • paper_authors: Tao Hu, William Thong, Pascal Mettes, Cees G. M. Snoek
  • for: 本研究旨在解决视频查询中的活动检索问题,特别是在不均衡的情况下。
  • methods: 本文提出了一种视 Semantic embedding网络,该网络包括两个新模块:视觉对Alignment模块和Semantic模块。视觉对Alignment模块在输入视频和固定大小的视觉银行表示之间进行全局对齐,而Semantic模块在输入视频和固定大小的Semantic活动表示之间进行对齐。这两个模块使得我们不再忽略罕见的活动 durante检索。
  • results: 实验结果表明,我们的方法对所有类型的活动都有效。
    Abstract This paper focuses on activity retrieval from a video query in an imbalanced scenario. In current query-by-activity-video literature, a common assumption is that all activities have sufficient labelled examples when learning an embedding. This assumption does however practically not hold, as only a portion of activities have many examples, while other activities are only described by few examples. In this paper, we propose a visual-semantic embedding network that explicitly deals with the imbalanced scenario for activity retrieval. Our network contains two novel modules. The visual alignment module performs a global alignment between the input video and fixed-sized visual bank representations for all activities. The semantic module performs an alignment between the input video and fixed-sized semantic activity representations. By matching videos with both visual and semantic activity representations that are of equal size over all activities, we no longer ignore infrequent activities during retrieval. Experiments on a new imbalanced activity retrieval benchmark show the effectiveness of our approach for all types of activities.
    摘要

Compositional Zero-shot Learning via Progressive Language-based Observations

  • paper_url: http://arxiv.org/abs/2311.14749
  • repo_url: None
  • paper_authors: Lin Li, Guikun Chen, Jun Xiao, Long Chen
  • for: 这个研究旨在解决Zero-shot learning中的State-object compositions识别问题,通过在训练时 leveraging known primitives (state和object) 来recognize unseen state-object compositions。
  • methods: 这个方法使用Progressive Language-based Observations (PLO),可以 dynamically determine a better observation order of primitives,包括使用预训练的vision-language models (VLMs) 和 large language models (LLMs) 来掌握图像内容。
  • results: 实验结果显示PLO比顶对照方法 superior,能够实现Compositional recognition的能力。
    Abstract Compositional zero-shot learning aims to recognize unseen state-object compositions by leveraging known primitives (state and object) during training. However, effectively modeling interactions between primitives and generalizing knowledge to novel compositions remains a perennial challenge. There are two key factors: object-conditioned and state-conditioned variance, i.e., the appearance of states (or objects) can vary significantly when combined with different objects (or states). For instance, the state "old" can signify a vintage design for a "car" or an advanced age for a "cat". In this paper, we argue that these variances can be mitigated by predicting composition categories based on pre-observed primitive. To this end, we propose Progressive Language-based Observations (PLO), which can dynamically determine a better observation order of primitives. These observations comprise a series of concepts or languages that allow the model to understand image content in a step-by-step manner. Specifically, PLO adopts pre-trained vision-language models (VLMs) to empower the model with observation capabilities. We further devise two variants: 1) PLO-VLM: a two-step method, where a pre-observing classifier dynamically determines the observation order of two primitives. 2) PLO-LLM: a multi-step scheme, which utilizes large language models (LLMs) to craft composition-specific prompts for step-by-step observing. Extensive ablations on three challenging datasets demonstrate the superiority of PLO compared with state-of-the-art methods, affirming its abilities in compositional recognition.
    摘要 compositional zero-shot learning的目标是通过使用已知基本元素(状态和对象)进行训练,recognize未看过的状态-对象组合。然而,有两个关键因素:对象conditioned和状态conditioned的差异,即不同的对象(或状态)组合可能导致状态(或对象)的外观发生很大的变化。例如,状态“老”可能表示车型的“旧式设计”或“老鼠”的“高龄”。在这篇论文中,我们认为这些差异可以通过基于先前观察的基本元素来缓解。为此,我们提出了进步语言基于观察(PLO),可以动态确定更好的基本元素观察顺序。这些观察包括一系列的概念或语言,allowing the model to understand图像内容以步骤方式。具体来说,PLO使用预训练的视力语言模型(VLM)来赋能模型。我们还开发了两个变体:1)PLO-VLM:一种两步方法,在两个基本元素之间动态确定观察顺序。2)PLO-LLM:一种多步方案,使用大型语言模型(LLM)来为步骤观察编写特定的作业。我们在三个复杂的数据集进行了广泛的ablations,并证明了PLO在compositional recognition中的优越性,证明了它在基本元素观察和组合recognition方面的能力。

PointPCA+: Extending PointPCA objective quality assessment metric

  • paper_url: http://arxiv.org/abs/2311.13880
  • repo_url: None
  • paper_authors: Xuemei Zhou, Evangelos Alexiou, Irene Viola, Pablo Cesar
  • for: 本文提出了一种改进版的PointPCA metric,即PointPCA+,用于评估点云质量。
  • methods: 该方法使用PCAOnly在geometry数据上进行分解,并增强了原有的geometry和texture描述符。
  • results: 实验结果显示,PointPCA+可以达到高度的预测性能,与对公共可用数据集的主观真实分数进行比较。Here’s the breakdown of each point in more detail:
  • for: The paper is proposing a new metric for assessing the quality of point clouds, called PointPCA+.
  • methods: The method uses PCA only on the geometry data, and enriches the existing geometry and texture descriptors.
  • results: The experimental results show that PointPCA+ achieves high predictive performance, compared to subjective ground truth scores from publicly available datasets.
    Abstract A computationally-simplified and descriptor-richer Point Cloud Quality Assessment (PCQA) metric, namely PointPCA+, is proposed in this paper, which is an extension of PointPCA. PointPCA proposed a set of perceptually-relevant descriptors based on PCA decomposition that were applied to both the geometry and texture data of point clouds for full reference PCQA. PointPCA+ employs PCA only on the geometry data while enriching existing geometry and texture descriptors, that are computed more efficiently. Similarly to PointPCA, a total quality score is obtained through a learning-based fusion of individual predictions from geometry and texture descriptors that capture local shape and appearance properties, respectively. Before feature fusion, a feature selection module is introduced to choose the most effective features from a proposed super-set. Experimental results show that PointPCA+ achieves high predictive performance against subjective ground truth scores obtained from publicly available datasets. The code is available at \url{https://github.com/cwi-dis/pointpca_suite/}.
    摘要 “这篇论文提出了一个 computationally-simplified 且描述量更高的 Point Cloud Quality Assessment (PCQA) 度量,即 PointPCA+,它是 PointPCA 的扩展。PointPCA 使用 PCA 分解提出了一系列具有感知意义的描述子,并将它们应用到了点云的几何和текxture数据上,以进行全参照 PCQA。PointPCA+ 则是将 PCA 只对几何数据进行分解,并对几何和текxture数据进行更加充分的描述,并且计算更加高效。PointPCA+ 使用一个学习基于的混合方法,将几何和tekxture数据的描述器组合成一个总质量分数。在混合前,我们引入了一个特选功能,以选择最有效的特征。实验结果显示,PointPCA+ 可以对对公开 available 的数据集进行高精度预测。代码可以在 \url{https://github.com/cwi-dis/pointpca_suite/} 上获取。”Note: Please keep in mind that the translation is in Simplified Chinese, and the grammar and word choice may be different from Traditional Chinese.

Language-guided Few-shot Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2311.13865
  • repo_url: None
  • paper_authors: Jing Wang, Yuang Liu, Qiang Zhou, Fan Wang
  • for: 降低新类适应的标签成本,使用小量、准确标注的支持集来导航模型学习。
  • methods: 提出了一种使用语言信息进行几拍学习 semantic segmentation 的创新解决方案,包括视力语言预训练(VLP)模型和mask整化器,从文本提示生成高质量 Pseudo-semantic 面积。还引入分布式原型监督方法和协同相关匹配模块,引导模型寻找精准的semantic关系。
  • results: 在两个 benchmark 数据集上实验表明,我们的方法可以在语言导航下实现新基eline 的几拍 semantic segmentation,并与最近的视觉导航方法竞争。
    Abstract Few-shot learning is a promising way for reducing the label cost in new categories adaptation with the guidance of a small, well labeled support set. But for few-shot semantic segmentation, the pixel-level annotations of support images are still expensive. In this paper, we propose an innovative solution to tackle the challenge of few-shot semantic segmentation using only language information, i.e.image-level text labels. Our approach involves a vision-language-driven mask distillation scheme, which contains a vision-language pretraining (VLP) model and a mask refiner, to generate high quality pseudo-semantic masks from text prompts. We additionally introduce a distributed prototype supervision method and complementary correlation matching module to guide the model in digging precise semantic relations among support and query images. The experiments on two benchmark datasets demonstrate that our method establishes a new baseline for language-guided few-shot semantic segmentation and achieves competitive results to recent vision-guided methods.
    摘要 新领域适应中减少标签成本的有前途的方法之一是几拍学习。然而,几拍Semantic segmentation的像素级注释仍然是昂贵的。在这篇论文中,我们提出了一种创新的解决方案,即通过语言信息来实现几拍Semantic segmentation,即使没有像素级注释。我们的方法包括一个视力语言预training(VLP)模型和一个mask整理器,以生成高质量的假Semantic masks从文本提示。我们还引入了分布式原型监督方法和补做匹配模块,以导引模型在支持和查询图像之间挖掘精准的semantic关系。实验结果表明,我们的方法在两个 benchmark dataset 上建立了新的基准点,并与最近的视力导向方法匹敌。

Perceptual Image Compression with Cooperative Cross-Modal Side Information

  • paper_url: http://arxiv.org/abs/2311.13847
  • repo_url: None
  • paper_authors: Shiyu Qin, Bin Chen, Yujun Huang, Baoyi An, Tao Dai, Shu-Tao Xia
    for: This paper aims to enhance image compression by utilizing text-level semantic dependencies to improve the rate-perception tradeoff and semantic distortion.methods: The proposed method employs the CLIP text encoder and an effective Semantic-Spatial Aware block to fuse text and image features, and uses a text-conditional generative adversarial network to improve perceptual quality.results: The proposed approach achieves superior results in terms of rate-perception tradeoff and semantic distortion, as demonstrated by extensive experiments on four datasets and ten image quality assessment metrics.Here’s the Chinese translation:for: 本文目的是使用文本水平semantic dependencies来提高图像压缩的rate-perceptiontradeoff和semantic distortion。methods: 提议的方法使用CLIP文本编码器和有效的Semantic-Spatial Aware块来融合文本和图像特征,并使用文本 conditional generative adversarial networks来提高图像的感知质量。results: 实验结果显示,提议的方法在四个dataset和十种图像质量评价指标下 achieve superior results in terms of rate-perception tradeoff和semantic distortion。
    Abstract The explosion of data has resulted in more and more associated text being transmitted along with images. Inspired by from distributed source coding, many works utilize image side information to enhance image compression. However, existing methods generally do not consider using text as side information to enhance perceptual compression of images, even though the benefits of multimodal synergy have been widely demonstrated in research. This begs the following question: How can we effectively transfer text-level semantic dependencies to help image compression, which is only available to the decoder? In this work, we propose a novel deep image compression method with text-guided side information to achieve a better rate-perception-distortion tradeoff. Specifically, we employ the CLIP text encoder and an effective Semantic-Spatial Aware block to fuse the text and image features. This is done by predicting a semantic mask to guide the learned text-adaptive affine transformation at the pixel level. Furthermore, we design a text-conditional generative adversarial networks to improve the perceptual quality of reconstructed images. Extensive experiments involving four datasets and ten image quality assessment metrics demonstrate that the proposed approach achieves superior results in terms of rate-perception trade-off and semantic distortion.
    摘要 “数据爆炸”的出现导致更多的相关文本被传输 вместе与图像。启发自分布式源编码,许多研究利用图像侧信息进行图像压缩。然而,现有方法通常不考虑使用文本作为侧信息来提高图像压缩的效果,尽管研究中已经广泛证明了多Modal synergy的好处。这引起了以下问题:如何有效地传输文本水平的semantic依赖度以帮助图像压缩,即使只有解码器可以获得这些信息?在这项工作中,我们提出了一种新的深度图像压缩方法,使用文本指导的侧信息来实现更好的rate-perception-distortion质量衡平衡。具体来说,我们使用CLIP文本编码器和一个有效的Semantic-Spatial Aware块来融合文本和图像特征。这是通过预测一个semantic掩蔽来引导学习文本适应变换来实现的。此外,我们设计了一个文本 conditional Generative Adversarial Networks来提高重建图像的 perceived质量。经验证明了在四个数据集和十个图像质量评价指标下,我们的方法可以获得更好的rate-perception-distortion质量衡平衡和semantic Distortion。

Progressive Learning with Visual Prompt Tuning for Variable-Rate Image Compression

  • paper_url: http://arxiv.org/abs/2311.13846
  • repo_url: None
  • paper_authors: Shiyu Qin, Yimin Zhou, Jinpeng Wang, Bin Chen, Baoyi An, Tao Dai, Shu-Tao Xia
  • for: 这个论文旨在提出一种进步学习 paradigma,用于基于 transformer 的变量速度图像压缩。该方法可以覆盖各种压缩率范围,通过层adaptive Prompt Module (LPM) 的帮助。
  • methods: 该方法使用 LPM 提取输入图像和隐藏特征的批处理,并将其作为附加信息 fed 到 pre-trained transformer 基于图像压缩模型的 Swin Transformer 层中,以影响 allocate 注意区域和比特,从而改变目标压缩比。
  • results: 对比多种方法,包括分立优化的多模型方法,提出的方法可以达到同样的性能,但却占用参数存储的80%和数据集的90%。同时,我们的模型超越了现有的变量压缩图像方法,并接近 fixes 压缩图像压缩方法。
    Abstract In this paper, we propose a progressive learning paradigm for transformer-based variable-rate image compression. Our approach covers a wide range of compression rates with the assistance of the Layer-adaptive Prompt Module (LPM). Inspired by visual prompt tuning, we use LPM to extract prompts for input images and hidden features at the encoder side and decoder side, respectively, which are fed as additional information into the Swin Transformer layer of a pre-trained transformer-based image compression model to affect the allocation of attention region and the bits, which in turn changes the target compression ratio of the model. To ensure the network is more lightweight, we involves the integration of prompt networks with less convolutional layers. Exhaustive experiments show that compared to methods based on multiple models, which are optimized separately for different target rates, the proposed method arrives at the same performance with 80% savings in parameter storage and 90% savings in datasets. Meanwhile, our model outperforms all current variable bitrate image methods in terms of rate-distortion performance and approaches the state-of-the-art fixed bitrate image compression methods trained from scratch.
    摘要 在本文中,我们提出了一种进程式学习模式,用于基于变换器的变量比例图像压缩。我们的方法覆盖了各种压缩率的范围,通过层 adaptive 提示模块 (LPM) 的帮助。以视觉提示调整为 inspirations,我们使用 LPM 提取输入图像和隐藏特征在encoder和decoder两侧的提示,并将其作为附加信息传递给 pré-train 的 transformer 基于图像压缩模型的 Swin Transformer 层,以影响 allocating 注意力区域和比特,从而改变模型的目标压缩率。为了使模型更轻量级,我们涉及了提示网络的集成,并减少了 convolutional 层数。经过 exhaustive 的实验表明,与基于多个模型优化的方法相比,我们的方法可以达到同等性能,并且占用参数存储空间的80%和数据集的90%。同时,我们的模型超越了当前的变量比例图像压缩方法,并接近预先训练的 fixes 比例图像压缩方法。

HOMOE: A Memory-Based and Composition-Aware Framework for Zero-Shot Learning with Hopfield Network and Soft Mixture of Experts

  • paper_url: http://arxiv.org/abs/2311.14747
  • repo_url: None
  • paper_authors: Do Huu Dat, Po Yuan Mao, Tien Hoang Nguyen, Wray Buntine, Mohammed Bennamoun
    for:This paper focuses on the problem of zero-shot learning, specifically the challenge of managing unfamiliar combinations of seen and unseen classes. It proposes a novel framework that combines the Modern Hopfield Network with a Mixture of Experts (HOMOE) to classify the compositions of previously unseen objects.methods:The proposed framework uses the Modern Hopfield Network to create a memory that stores label prototypes and identifies relevant labels for a given input image. The Mixture of Experts then integrates the image with the fitting prototype to produce the final composition classification.results:The approach achieves state-of-the-art (SOTA) performance on several benchmarks, including MIT-States and UT-Zappos. The paper also examines how each component contributes to improved generalization.Here is the simplified Chinese translation of the three key points:for:这篇论文关注零shot学习问题,具体来说是处理未经见过的组合类型的挑战。它提议一种新的框架,将现代套件网络与混合专家模型(HOMOE)结合起来,对未经见过的物品进行组合分类。methods:该框架使用现代套件网络创建一个存储标签谱系的记忆,并将输入图像与适应的标签拟合到一起进行最终的组合分类。results:该方法在多个 benchmark 上达到了领先的状态(SOTA)性能,包括 MIT-States 和 UT-Zappos。文章还分析了每个组件对改进泛化的贡献。
    Abstract Compositional Zero-Shot Learning (CZSL) has emerged as an essential paradigm in machine learning, aiming to overcome the constraints of traditional zero-shot learning by incorporating compositional thinking into its methodology. Conventional zero-shot learning has difficulty managing unfamiliar combinations of seen and unseen classes because it depends on pre-defined class embeddings. In contrast, Compositional Zero-Shot Learning uses the inherent hierarchies and structural connections among classes, creating new class representations by combining attributes, components, or other semantic elements. In our paper, we propose a novel framework that for the first time combines the Modern Hopfield Network with a Mixture of Experts (HOMOE) to classify the compositions of previously unseen objects. Specifically, the Modern Hopfield Network creates a memory that stores label prototypes and identifies relevant labels for a given input image. Following this, the Mixture of Expert models integrates the image with the fitting prototype to produce the final composition classification. Our approach achieves SOTA performance on several benchmarks, including MIT-States and UT-Zappos. We also examine how each component contributes to improved generalization.
    摘要 compositional zero-shot learning (CZSL) 已经成为机器学习中的一种重要思想,旨在超越传统的零例学习方法,通过在方法ологии中吸收compositional思维。传统的零例学习在处理未经见过的类型和组合时存在困难,因为它依赖于预先定义的类别嵌入。而CZSL则利用类型之间的自然层次结构和semantic元素的连接,创造新的类别表示。在我们的论文中,我们提出了一种新的框架,将现代洛氏网络(Modern Hopfield Network)与权重混合模型(Mixture of Experts,HOMOE)结合,以分类未经见过的对象的组合。具体来说,现代洛氏网络创建了一个存储标签范例的记忆,然后使用相关的标签来识别输入图像的标签。接着,权重混合模型将图像与适合的标签混合以生成最终的组合分类结果。我们的方法在多个benchmark上达到了顶尖性能,包括MIT-States和UT-Zappos。我们还研究了每个组件在改进泛化性能方面的贡献。

Posterior Distillation Sampling

  • paper_url: http://arxiv.org/abs/2311.13831
  • repo_url: https://github.com/Posterior-Distillation-Sampling/posterior-distillation-sampling.github.io
  • paper_authors: Juil Koo, Chanho Park, Minhyuk Sung
  • for: 这篇论文主要是为了提出一种新的优化方法,帮助实现parametric图像编辑。
  • methods: 该方法基于diffusion模型,并利用二维的律动质量来处理parametric图像。
  • results: 该方法可以帮助实现一种平衡 между符合目标特征和保持源内容的identiy,并且可以在不同的参数空间中采样目标。
    Abstract We introduce Posterior Distillation Sampling (PDS), a novel optimization method for parametric image editing based on diffusion models. Existing optimization-based methods, which leverage the powerful 2D prior of diffusion models to handle various parametric images, have mainly focused on generation. Unlike generation, editing requires a balance between conforming to the target attribute and preserving the identity of the source content. Recent 2D image editing methods have achieved this balance by leveraging the stochastic latent encoded in the generative process of diffusion models. To extend the editing capabilities of diffusion models shown in pixel space to parameter space, we reformulate the 2D image editing method into an optimization form named PDS. PDS matches the stochastic latents of the source and the target, enabling the sampling of targets in diverse parameter spaces that align with a desired attribute while maintaining the source's identity. We demonstrate that this optimization resembles running a generative process with the target attribute, but aligning this process with the trajectory of the source's generative process. Extensive editing results in Neural Radiance Fields and Scalable Vector Graphics representations demonstrate that PDS is capable of sampling targets to fulfill the aforementioned balance across various parameter spaces.
    摘要 我们介绍Posterior Distillation Sampling(PDS),一种新的优化方法 дляParametric Image Editing基于扩散模型。现有的优化基于方法,通过利用强大的2D假设扩散模型来处理各种参数图像,主要侧重于生成。不过,编译需要寻求参数空间中的平衡,既要遵循目标属性,又要保持来源内容的身份。现有的2D图像编译方法通过扩散模型的生成过程中的随机内码来实现这个平衡。为了将这些编译能力从像素空间延伸到参数空间,我们将这种2D图像编译方法改编为名为PDS的优化形式。PDS将目标和来源的随机内码匹配,使得目标在参数空间中的样本可以遵循一个想定的属性,同时保持来源内容的身份。我们显示了将这个优化视为在目标属性下的生成过程,并与来源生成过程的路径进行对齐,可以在各种参数空间中获得具有均衡的样本。我们的实验结果显示,PDS可以在Neural Radiance Fields和Scalable Vector Graphics表示中进行具有均衡的样本。

Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception

  • paper_url: http://arxiv.org/abs/2311.13793
  • repo_url: None
  • paper_authors: Lei Fan, Mingfu Liang, Yunxuan Li, Gang Hua, Ying Wu
  • for: 本研究旨在提高机器人在未知观察环境下的智能探索能力,以获取更多信息而不是避免不良观察条件。
  • methods: 本研究使用了学习策略从 simulated或收集的数据中提取相应的动作,当recognition准确时选择更多的合适的动作。
  • results: 本研究表明,引入uncertainty可以提高活动认识的表现,并且提出了一种基于证据组合理论的sequential evidence-gathering процесsto提供步骤的uncertainty量化和可靠预测。
    Abstract Active recognition enables robots to intelligently explore novel observations, thereby acquiring more information while circumventing undesired viewing conditions. Recent approaches favor learning policies from simulated or collected data, wherein appropriate actions are more frequently selected when the recognition is accurate. However, most recognition modules are developed under the closed-world assumption, which makes them ill-equipped to handle unexpected inputs, such as the absence of the target object in the current observation. To address this issue, we propose treating active recognition as a sequential evidence-gathering process, providing by-step uncertainty quantification and reliable prediction under the evidence combination theory. Additionally, the reward function developed in this paper effectively characterizes the merit of actions when operating in open-world environments. To evaluate the performance, we collect a dataset from an indoor simulator, encompassing various recognition challenges such as distance, occlusion levels, and visibility. Through a series of experiments on recognition and robustness analysis, we demonstrate the necessity of introducing uncertainties to active recognition and the superior performance of the proposed method.
    摘要 活动识别可以让机器人智能地探索新的观察结果,从而获得更多的信息而不是避免不想要的观察条件。 latest approaches 倾向于从 simulated 或 collected 数据学习策略,其中适当的动作更加频繁地被选择当 recognition 是准确的。然而,大多数 recognition 模块是在closed-world assumption下开发的,这使得它们无法处理意外输入,如目标对象在当前观察中的缺失。为解决这个问题,我们提议将活动识别看作是一个顺序的证据收集过程,提供步骤不确定性评估和可靠预测基于证据组合理论。此外,我们在这篇论文中提出的奖励函数有效地描述在开放世界环境中操作时的动作的价值。为评估表现,我们从室内 simulator 中收集了一个数据集,包括识别挑战的多种因素,如距离、遮挡水平和可见度。通过一系列的识别和Robustness 分析,我们证明了引入不确定性是活动识别的必要和我们的方法的超越性。

All in One: RGB, RGB-D, and RGB-T Salient Object Detection

  • paper_url: http://arxiv.org/abs/2311.14746
  • repo_url: None
  • paper_authors: Xingzhao Jia, Zhongqiu Zhao, Changlei Dongye, Zhao Zhang
  • for: 本研究旨在提出一种能够同时处理RGB、RGB-D和RGB-T数据的精炼对象检测模型(AiOSOD),以提高对不同数据类型的检测性能。
  • methods: 该模型框架具有一个独特的 concatenate 方法,将RGB、RGB-D和RGB-T数据 concatenate 在一起,并使用 transformer 网络进行特征提取。
  • results: AiOSOD 模型在 RGB、RGB-D 和 RGB-T 数据集上实现了高速(780FPS)和高效率(6.25M参数)的精炼对象检测,并且在不同数据类型上具有优秀的检测性能。
    Abstract Salient object detection (SOD) aims to identify the most attractive objects within an image. Depending on the type of data being detected, SOD can be categorized into various forms, including RGB, RGB-D (Depth), RGB-T (Thermal) and light field SOD. Previous researches have focused on saliency detection with individual data type. If the RGB-D SOD model is forced to detect RGB-T data it will perform poorly. We propose an innovative model framework that provides a unified solution for the salient object detection task of three types of data (RGB, RGB-D, and RGB-T). The three types of data can be handled in one model (all in one) with the same weight parameters. In this framework, the three types of data are concatenated in an ordered manner within a single input batch, and features are extracted using a transformer network. Based on this framework, we propose an efficient lightweight SOD model, namely AiOSOD, which can detect any RGB, RGB-D, and RGB-T data with high speed (780FPS for RGB data, 485FPS for RGB-D or RGB-T data). Notably, with only 6.25M parameters, AiOSOD achieves excellent performance on RGB, RGB-D, and RGB-T datasets.
    摘要 抽象对象检测(SOD)目标是在图像中标识最吸引人的对象。根据检测数据的类型,SOD可以分为不同的形式,包括RGB、RGB-D(深度)和RGB-T(热场)等。之前的研究主要集中在单独的数据类型上进行了抽象检测。如果强制RGB-D SOD模型检测RGB-T数据,其性能将下降。我们提出了一种创新的模型框架,可以对三种数据类型(RGB、RGB-D和RGB-T)进行统一的检测任务。这三种数据类型在一个模型中(all in one)使用同一个参数进行处理,并通过变换网络提取特征。基于这种框架,我们提出了一种高效的轻量级 SOD模型,即AiOSOD,可以高速检测RGB、RGB-D和RGB-T数据(RGB数据780帧/秒,RGB-D或RGB-T数据485帧/秒)。值得注意的是,AiOSOD只有6.25万参数,在RGB、RGB-D和RGB-T数据集上表现出色。

Dynamic Compositional Graph Convolutional Network for Efficient Composite Human Motion Prediction

  • paper_url: http://arxiv.org/abs/2311.13781
  • repo_url: https://github.com/oliviazwy/dcgcn
  • paper_authors: Wanying Zhang, Shen Zhao, Fanyang Meng, Songtao Wu, Mengyuan Liu
  • For: The paper is written for the task of human motion prediction, specifically addressing the challenge of predicting composite actions.* Methods: The paper proposes a Composite Action Generation (CAG) module to generate synthetic composite actions for training, and a Dynamic Compositional Graph Convolutional Network (DC-GCN) to handle the composite actions.* Results: The paper achieves state-of-the-art motion prediction accuracies on the Human3.6M dataset and the newly collected CHAMP dataset, with few extra computational costs compared to traditional GCN-based human motion methods.Here is the information in Simplified Chinese text:
  • for: 本文是为人体动作预测任务提出的,具体是处理复杂动作的问题。
  • methods: 本文提出了一个 Composite Action Generation (CAG) 模块,用于生成训练 synthetic composite actions,以及一个 Dynamic Compositional Graph Convolutional Network (DC-GCN) 来处理复杂动作。
  • results: 本文在 Human3.6M 数据集和新收集的 CHAMP 数据集上实现了状态机器人动作预测精度的最佳性和最小化计算成本,与传统 GCNN 基于人体动作方法相比。
    Abstract With potential applications in fields including intelligent surveillance and human-robot interaction, the human motion prediction task has become a hot research topic and also has achieved high success, especially using the recent Graph Convolutional Network (GCN). Current human motion prediction task usually focuses on predicting human motions for atomic actions. Observing that atomic actions can happen at the same time and thus formulating the composite actions, we propose the composite human motion prediction task. To handle this task, we first present a Composite Action Generation (CAG) module to generate synthetic composite actions for training, thus avoiding the laborious work of collecting composite action samples. Moreover, we alleviate the effect of composite actions on demand for a more complicated model by presenting a Dynamic Compositional Graph Convolutional Network (DC-GCN). Extensive experiments on the Human3.6M dataset and our newly collected CHAMP dataset consistently verify the efficiency of our DC-GCN method, which achieves state-of-the-art motion prediction accuracies and meanwhile needs few extra computational costs than traditional GCN-based human motion methods.
    摘要 With potential applications in fields including intelligent surveillance and human-robot interaction, the human motion prediction task has become a hot research topic and has achieved high success, especially using the recent Graph Convolutional Network (GCN). Current human motion prediction tasks usually focus on predicting human motions for atomic actions. Observing that atomic actions can happen at the same time and thus formulating the composite actions, we propose the composite human motion prediction task. To handle this task, we first present a Composite Action Generation (CAG) module to generate synthetic composite actions for training, thus avoiding the laborious work of collecting composite action samples. Moreover, we alleviate the effect of composite actions on demand for a more complicated model by presenting a Dynamic Compositional Graph Convolutional Network (DC-GCN). Extensive experiments on the Human3.6M dataset and our newly collected CHAMP dataset consistently verify the efficiency of our DC-GCN method, which achieves state-of-the-art motion prediction accuracies and meanwhile needs few extra computational costs than traditional GCN-based human motion methods.Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and widely used in other countries as well. The Traditional Chinese version of the text would be slightly different, with some characters and word order changes.

Detection and Identification Accuracy of PCA-Accelerated Real-Time Processing of Hyperspectral Imagery

  • paper_url: http://arxiv.org/abs/2311.13779
  • repo_url: None
  • paper_authors: Abigail Basener, Meagan Herald
  • for: 这篇论文旨在提高实时或近实时多спектル探测和识别的效率,以应对各个领域中的需求。
  • methods: 这篇论文使用主成分分析(PCA)进行维度缩减,以减少处理数据所需的时间。然后,透过检测和可能性分析,确定测试结果。
  • results: 根据这篇论文的结果,可以将主成分数量降低到一个可观的水平,而不会影响检测率。这表示可以实现更快速的处理速度。
    Abstract Real-time or near real-time hyperspectral detection and identification are extremely useful and needed in many fields. These data sets can be quite large, and the algorithms can require numerous computations that slow the process down. A common way of speeding up the process is to use principal component analysis (PCA) for dimension reduction. In the reduced dimensional space, provided by a subset of the principal components, fewer computations are needed to process the data resulting in a faster run time. In this paper, we propose a way to further decrease the time required to use PCA by investigating how many principal components may be omitted with minimal impact on the detection rate. Using ACE to perform the detection, and then probability, and spectral fit for identification, we find that the number of principal components can be reduced by a substantial amount before seeing a noticeable change in detection rates.
    摘要 < transtable实时或准实时多spectral探测和识别是许多领域中非常有用和需要的。这些数据集可能很大,算法可能需要很多计算,导致处理时间增长。一种常见的加速方法是使用主成分分析(PCA)进行维度减少。在减少的维度空间中,通过一个子集的主成分,需要 fewer computations 来处理数据,从而减少处理时间。在这篇论文中,我们提出了一种方法,来进一步减少使用 PCA 时的时间。我们使用 ACE 进行探测,然后使用概率和spectral fit进行识别,我们发现可以减少主成分的数量,而不会影响探测率。

GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence

  • paper_url: http://arxiv.org/abs/2311.13777
  • repo_url: None
  • paper_authors: Pengyuan Wang, Takuya Ikeda, Robert Lee, Koichi Nishiwaki
  • for: 提高Category-level pose estimation的精度和效率,避免使用大量的pose-labelled实际图像或优化的摄像头真实的模拟器。
  • methods: 使用基于深度学习的方法,但这些方法通常受到域漏洞的限制,即需要大量的pose-labelled实际图像或优化的摄像头真实的模拟器。我们的方法使用基于foundation模型的几何和semantic特征,并将其投影到3D中,然后使用一个已经训练的匹配网络对新的单视图观察进行匹配。
  • results: 我们的方法可以在几何和semantic特征的情况下提高pose estimation的精度和效率,并且需要训练数据的量相对较少。我们通过了一系列评估,证明了我们的方法可以在几何和semantic特征的情况下提高pose estimation的精度和效率,并且需要训练数据的量相对较少。
    Abstract Category-level pose estimation is a challenging task with many potential applications in computer vision and robotics. Recently, deep-learning-based approaches have made great progress, but are typically hindered by the need for large datasets of either pose-labelled real images or carefully tuned photorealistic simulators. This can be avoided by using only geometry inputs such as depth images to reduce the domain-gap but these approaches suffer from a lack of semantic information, which can be vital in the pose estimation problem. To resolve this conflict, we propose to utilize both geometric and semantic features obtained from a pre-trained foundation model.Our approach projects 2D features from this foundation model into 3D for a single object model per category, and then performs matching against this for new single view observations of unseen object instances with a trained matching network. This requires significantly less data to train than prior methods since the semantic features are robust to object texture and appearance. We demonstrate this with a rich evaluation, showing improved performance over prior methods with a fraction of the data required.
    摘要 category-level pose estimation是一个复杂的计算机视觉和机器人领域的任务。现在,深度学习基础的方法已经取得了很大的进步,但通常受到大量的真实图像或高度调整的 фото真实模拟器的需求,这可以避免使用只geometry输入来减少领域差距,但这些方法缺乏semantic信息,这可以是pose estimation问题中的关键因素。为解决这个冲突,我们提议使用基础模型已经预训练的both geometric和semantic特征。我们的方法将基础模型的2D特征映射到3D,并使用一个已经训练的匹配网络对新的单视观察结果进行匹配。这需要训练数据量相对较少,因为semantic特征对物体文件和外观不敏感。我们通过了 ric evaluation,展示了我们的方法在对比先前方法的情况下具有更高的性能,并且需要的数据量只占了相对较少的一部分。

Sample-Efficient Training for Diffusion

  • paper_url: http://arxiv.org/abs/2311.13745
  • repo_url: https://github.com/YangYuSCU/DE-PINN
  • paper_authors: Shivam Gupta, Aditya Parulekar, Eric Price, Zhiyang Xun
  • for: This paper focuses on improving the efficiency of score-based diffusion models for deep generative modeling of images.
  • methods: The paper uses a score-matching objective to estimate the score function, which is then used for sampling. The authors show that estimating the score in $L^2$ requires a polynomial dependence on the data radius and desired Wasserstein accuracy, but that a polylogarithmic number of samples suffice for sampling.
  • results: The paper shows that with a polylogarithmic number of samples, the ERM of the score-matching objective is $L^2$ accurate on all but a probability $\delta$ fraction of the true distribution, and that this weaker guarantee is sufficient for efficient sampling.Here’s the Chinese translation of the three points:
  • for: 这篇论文主要关注改进深度生成图像的抽取模型。
  • methods: 论文使用分数匹配目标来估算分数函数,然后用这个分数函数进行抽取。作者表明,估算分数在$L^2$中需要数据半径和感知 Wasserstein 精度的多项式依赖关系,但是polylogarithmic数量的样本实际上可以进行抽取。
  • results: 论文显示,polylogarithmic数量的样本可以使ERM分数匹配目标在true分布中的$L^2$精度高于某些probability $\delta$ 的分布上,这种较弱的保证足以保证有效的抽取。
    Abstract Score-based diffusion models have become the most popular approach to deep generative modeling of images, largely due to their empirical performance and reliability. Recently, a number of theoretical works \citep{chen2022, Chen2022ImprovedAO, Chenetal23flowode, benton2023linear} have shown that diffusion models can efficiently sample, assuming $L^2$-accurate score estimates. The score-matching objective naturally approximates the true score in $L^2$, but the sample complexity of existing bounds depends \emph{polynomially} on the data radius and desired Wasserstein accuracy. By contrast, the time complexity of sampling is only logarithmic in these parameters. We show that estimating the score in $L^2$ \emph{requires} this polynomial dependence, but that a number of samples that scales polylogarithmically in the Wasserstein accuracy actually do suffice for sampling. We show that with a polylogarithmic number of samples, the ERM of the score-matching objective is $L^2$ accurate on all but a probability $\delta$ fraction of the true distribution, and that this weaker guarantee is sufficient for efficient sampling.
    摘要 “分数基于的扩散模型已成为深度生成图像的最受欢迎方法,主要是因为它们的实际性和可靠性。最近,一些理论工作(《chen2022》、《Chen2022ImprovedAO》、《Chenetal23flowode》、《benton2023linear》)表明,扩散模型可以高效地采样,假设有$L^2$准确的分数估计。分数匹配目标自然地 aproximates 真的分数在$L^2$中,但现有的样本复杂度的下界取决于数据半径和所需的温顺环境精度。然而,采样的时间复杂度只是对这些参数的对数增长。我们表明,估计分数在$L^2$中需要这样的多项式依赖,但是只需要polylogarithmic数量的样本,实际上可以在 Wasserstein 精度下采样。我们还证明,使用polylogarithmic数量的样本,ERM 的分数匹配目标在所有 except 一个可能性下是$L^2$准确的,并且这个弱保证足够 для有效的采样。”