cs.CV - 2023-09-30

Assessing the Generalizability of Deep Neural Networks-Based Models for Black Skin Lesions

  • paper_url: http://arxiv.org/abs/2310.00517
  • repo_url: https://github.com/httplups/black-acral-skin-lesion-detection
  • paper_authors: Luana Barros, Levy Chaves, Sandra Avila
  • for: 这个论文主要用于检测皮肤癌症,特别是针对黑人群体中的肤色区域(手掌、足底和指甲)。
  • methods: 该论文使用深度神经网络进行检测,并分别评估了指导式和自我指导式模型在黑人肤色区域中的表现。
  • results: 研究发现,现有的深度神经网络模型在黑人肤色区域中的性能不佳,只能在白皮肤区域中表现出色。这显示了这些模型在不同肤色区域中的一致性不足。
    Abstract Melanoma is the most severe type of skin cancer due to its ability to cause metastasis. It is more common in black people, often affecting acral regions: palms, soles, and nails. Deep neural networks have shown tremendous potential for improving clinical care and skin cancer diagnosis. Nevertheless, prevailing studies predominantly rely on datasets of white skin tones, neglecting to report diagnostic outcomes for diverse patient skin tones. In this work, we evaluate supervised and self-supervised models in skin lesion images extracted from acral regions commonly observed in black individuals. Also, we carefully curate a dataset containing skin lesions in acral regions and assess the datasets concerning the Fitzpatrick scale to verify performance on black skin. Our results expose the poor generalizability of these models, revealing their favorable performance for lesions on white skin. Neglecting to create diverse datasets, which necessitates the development of specialized models, is unacceptable. Deep neural networks have great potential to improve diagnosis, particularly for populations with limited access to dermatology. However, including black skin lesions is necessary to ensure these populations can access the benefits of inclusive technology.
    摘要 癌症是皮肤癌症中最严重的一种,因为它可以导致肿瘤迁移。它更常见于黑人,通常会影响到手掌、脚底和指甲。深度神经网络在临床护理和皮肤癌诊断方面表现出了巨大的潜力。然而,现有的研究大多涉及白皮肤Dataset,忽略了不同皮肤颜色的患者诊断结果的报告。在这项工作中,我们评估了指导和自动化模型在黑人常见的手掌、脚底和指甲部位上的皮肤癌图像中的表现。此外,我们也仔细筛选了包含黑人皮肤癌图像的Dataset,并评估了该Dataset在Fitzpatrick级别中的表现,以确认模型在黑皮肤上的性能。我们的结果表明,现有的模型在白皮肤上表现良好,但对黑皮肤的患者来说,这些模型的总体性能很差。忽略创建多样化的Dataset是不可接受的。深度神经网络在护理方面具有极大的潜力,特别是对于有限的资源的人群,但是包含黑皮肤癌图像是必要的,以确保这些人群可以通过包容技术获得诊断的优势。

Exploring SAM Ablations for Enhancing Medical Segmentation in Radiology and Pathology

  • paper_url: http://arxiv.org/abs/2310.00504
  • repo_url: None
  • paper_authors: Amin Ranem, Niklas Babendererde, Moritz Fuchs, Anirban Mukhopadhyay
  • for: 本研究旨在探讨Segment Anything Model(SAM)在不同领域中的应用,以提高准确性和可靠性。
  • methods: 本研究使用SAM框架,分析其基本组件与它们之间的复杂交互,并对其进行精细调整以提高 segmentation 结果的准确性。
  • results: 经过系列仔细设计的实验表明,SAM在放射学(特别是脑肿瘤 segmentation)和病理学(特别是乳腺癌 segmentation)中的应用具有很高的潜力,可以帮助解决医学影像分 segmentation 的挑战。
    Abstract Medical imaging plays a critical role in the diagnosis and treatment planning of various medical conditions, with radiology and pathology heavily reliant on precise image segmentation. The Segment Anything Model (SAM) has emerged as a promising framework for addressing segmentation challenges across different domains. In this white paper, we delve into SAM, breaking down its fundamental components and uncovering the intricate interactions between them. We also explore the fine-tuning of SAM and assess its profound impact on the accuracy and reliability of segmentation results, focusing on applications in radiology (specifically, brain tumor segmentation) and pathology (specifically, breast cancer segmentation). Through a series of carefully designed experiments, we analyze SAM's potential application in the field of medical imaging. We aim to bridge the gap between advanced segmentation techniques and the demanding requirements of healthcare, shedding light on SAM's transformative capabilities.
    摘要 医疗影像在各种医疗疾病诊断和治疗规划中扮演着关键的角色,医 radiology 和 pathology 都受到精确的图像分割的依赖。 segmen anything model(SAM)在不同领域中呈现出了一种有前途的框架,以下我们将对 SAM 进行分析,探讨其基本组件之间的细腻交互,以及对准确性和可靠性的影响。我们将在医 radiology(特别是脑肿瘤分割)和 pathology(特别是乳腺癌分割)中进行精心设计的实验,分析 SAM 在医疗影像领域的潜在应用。我们想通过 bridging 高级分割技术和医疗需求的 gap,把 SAM 的 transformative 能力推广到医疗领域。

Black-box Attacks on Image Activity Prediction and its Natural Language Explanations

  • paper_url: http://arxiv.org/abs/2310.00503
  • repo_url: None
  • paper_authors: Alina Elena Baia, Valentina Poggioni, Andrea Cavallaro
  • for: 这篇论文目的是评估深度神经网络的决策过程可以不可靠地描述的隐藏攻击。
  • methods: 这篇论文使用了一种自然语言解释的自然语言基于图像活动识别模型,并使用了黑盒测试来评估模型的Robustness。
  • results: 研究发现,使用了这种自然语言解释的模型很容易受到黑盒攻击,可以通过让模型生成不准确的解释来 manipulate 模型的决策。
    Abstract Explainable AI (XAI) methods aim to describe the decision process of deep neural networks. Early XAI methods produced visual explanations, whereas more recent techniques generate multimodal explanations that include textual information and visual representations. Visual XAI methods have been shown to be vulnerable to white-box and gray-box adversarial attacks, with an attacker having full or partial knowledge of and access to the target system. As the vulnerabilities of multimodal XAI models have not been examined, in this paper we assess for the first time the robustness to black-box attacks of the natural language explanations generated by a self-rationalizing image-based activity recognition model. We generate unrestricted, spatially variant perturbations that disrupt the association between the predictions and the corresponding explanations to mislead the model into generating unfaithful explanations. We show that we can create adversarial images that manipulate the explanations of an activity recognition model by having access only to its final output.
    摘要 explainable AI (XAI) 技术目的是描述深度神经网络决策过程。早期 XAI 技术生成了视觉解释,而更近期的技术生成了多 modal 解释,包括文本信息和视觉表示。视觉 XAI 技术在面对白盒和灰盒攻击时容易受损,攻击者具有完整或部分知道和访问目标系统的权限。然而,多 modal XAI 模型的抵御性尚未被调查,这篇论文是第一次评估黑盒攻击下自然语言解释生成的图像活动识别模型的可靠性。我们生成了无限制、空间变化的扰动,使模型的预测和相应的解释失去关联,以诱导模型生成不寻常的解释。我们显示了访问模型的最终输出后,可以创造欺骗性图像,使模型生成不准确的解释。

Small Visual Language Models can also be Open-Ended Few-Shot Learners

  • paper_url: http://arxiv.org/abs/2310.00500
  • repo_url: None
  • paper_authors: Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Cees G. M. Snoek, Marcel Worring, Yuki M. Asano
  • for: 开发了一种自然语言模型的开放式少量学习能力,即使使用小型模型(约1B参数)也能够超越大型模型(如冰冻和FROMage)的少量学习能力。
  • methods: 提出了一种自我上下文适应(SeCAt)方法,通过自动学习从 симвоlic yet self-supervised 训练任务中获得知识,包括基于 clustering 大量图像并赋予不相关的名称。
  • results: 在多Modal 少量数据集上表现出优秀的灵活性和性能,并且可以用小型模型(约1B参数)来实现,而不需要大型模型或专有模型。
    Abstract We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks open-ended few-shot abilities of small visual language models. Our proposed adaptation algorithm explicitly learns from symbolic, yet self-supervised training tasks. Specifically, our approach imitates image captions in a self-supervised way based on clustering a large pool of images followed by assigning semantically-unrelated names to clusters. By doing so, we construct the `self-context', a training signal consisting of interleaved sequences of image and pseudo-caption pairs and a query image for which the model is trained to produce the right pseudo-caption. We demonstrate the performance and flexibility of SeCAt on several multimodal few-shot datasets, spanning various granularities. By using models with approximately 1B parameters we outperform the few-shot abilities of much larger models, such as Frozen and FROMAGe. SeCAt opens new possibilities for research in open-ended few-shot learning that otherwise requires access to large or proprietary models.
    摘要 我们介绍Self-Context Adaptation(SeCAt),一种自我指导的方法,可以激活小视觉语言模型的开放式少量学习能力。我们的提议的适应算法直接从 символиック, yet自我指导的训练任务中学习。具体来说,我们的方法模仿图像描述文本在自我指导的方式基于图像集 clustering,并将抽象无关的名称分配给集群。通过这样做,我们构建了`自我上下文',一个训练信号包括交错的图像和假描述对象的序列,以及一个查询图像,对于该模型要生成正确的假描述。我们在多个多modal few-shot数据集上表现出了性能和灵活性,覆盖了不同的细化程度。使用大约1B参数的模型,我们超越了许多更大的模型,如冰冻和FROMAGe的少量学习能力。SeCAt开启了新的可能性 для开放式少量学习研究,否则需要访问大型或专有模型。

The Sparsity Roofline: Understanding the Hardware Limits of Sparse Neural Networks

  • paper_url: http://arxiv.org/abs/2310.00496
  • repo_url: None
  • paper_authors: Cameron Shinn, Collin McCarthy, Saurav Muralidharan, Muhammad Osama, John D. Owens
  • for: 评估神经网络中稀疙瘩的性能
  • methods: 提出了一种名为”简洁顶层”的视觉性能模型,用于评估神经网络中稀疙瘩的性能
  • results: 通过一种新的分析方法,可以预测稀疙瘩神经网络的性能,并 Validate the predicted speedup using several real-world computer vision architectures pruned across a range of sparsity patterns and degrees.
    Abstract We introduce the Sparsity Roofline, a visual performance model for evaluating sparsity in neural networks. The Sparsity Roofline jointly models network accuracy, sparsity, and predicted inference speedup. Our approach does not require implementing and benchmarking optimized kernels, and the predicted speedup is equal to what would be measured when the corresponding dense and sparse kernels are equally well-optimized. We achieve this through a novel analytical model for predicting sparse network performance, and validate the predicted speedup using several real-world computer vision architectures pruned across a range of sparsity patterns and degrees. We demonstrate the utility and ease-of-use of our model through two case studies: (1) we show how machine learning researchers can predict the performance of unimplemented or unoptimized block-structured sparsity patterns, and (2) we show how hardware designers can predict the performance implications of new sparsity patterns and sparse data formats in hardware. In both scenarios, the Sparsity Roofline helps performance experts identify sparsity regimes with the highest performance potential.
    摘要 我们介绍了简洁顶部(Sparsity Roofline),一个用于评估神经网络中的简洁性的可视性表现模型。简洁顶部同时考虑神经网络的准确性、简洁性和预测的执行速度增加。我们的方法不需要实现和测试优化的核心,且预测的速度与 dense 和简洁核心相同程度的优化相同。我们通过一个新的分析模型来预测简洁网络的性能,并使用多个真实世界计算机视觉架构中的简洁Pattern和度量进行验证。我们透过两个案例研究:首先,我们显示了如何在简洁顶部的帮助下,机器学习研究人员可以预测尚未实现或优化的块结构简洁模式的性能。其次,我们显示了如何在简洁顶部的帮助下,硬件设计师可以预测新的简洁模式和简洁数据格式在硬件上的性能影响。在这两个案例中,简洁顶部帮助性能专家识别最高性能潜在的简洁度域。

Diff-DOPE: Differentiable Deep Object Pose Estimation

  • paper_url: http://arxiv.org/abs/2310.00463
  • repo_url: None
  • paper_authors: Jonathan Tremblay, Bowen Wen, Valts Blukis, Balakumar Sundaralingam, Stephen Tyree, Stan Birchfield
  • for: 提高对象pose的优化,使用拟合照片和3D文本模型来更新 object pose,以最小化视觉错误。
  • methods: 使用可导渠 rendering 更新 object pose,避免需要训练大量synthetic dataset的深度神经网络。
  • results: 实现了状态机器pose estimation datasets的最佳效果,并且可以处理多种modalities,如RGB、深度、纹理边缘和物体 segmentation masks。
    Abstract We introduce Diff-DOPE, a 6-DoF pose refiner that takes as input an image, a 3D textured model of an object, and an initial pose of the object. The method uses differentiable rendering to update the object pose to minimize the visual error between the image and the projection of the model. We show that this simple, yet effective, idea is able to achieve state-of-the-art results on pose estimation datasets. Our approach is a departure from recent methods in which the pose refiner is a deep neural network trained on a large synthetic dataset to map inputs to refinement steps. Rather, our use of differentiable rendering allows us to avoid training altogether. Our approach performs multiple gradient descent optimizations in parallel with different random learning rates to avoid local minima from symmetric objects, similar appearances, or wrong step size. Various modalities can be used, e.g., RGB, depth, intensity edges, and object segmentation masks. We present experiments examining the effect of various choices, showing that the best results are found when the RGB image is accompanied by an object mask and depth image to guide the optimization process.
    摘要 我们介绍Diff-DOPE,一种6DoF姿态级化器,它接受图像、一个3D纹理模型和初始对象姿态作为输入。该方法使用可微渲染更新对象姿态,以最小化图像和模型投影之间的视觉错误。我们表明,这个简单 yet有效的想法可以实现状态革命的结果在姿态估计数据集上。我们的方法与最近的方法不同,后者是通过训练大量的 sintetic数据来训练一个深度神经网络,以将输入映射到更新步骤。而我们使用可微渲染,可以避免训练。我们的方法可以并行进行多个梯度下降优化,以避免相似的对象、同样的外观或错误的步长导致的本地极小值。不同的感知modalities可以使用,例如RGB、深度、强度边缘和对象分割mask。我们进行了不同的选择的实验,并显示了RGB图像和对象mask、深度图像的搭配能够获得最佳结果。

UniLVSeg: Unified Left Ventricular Segmentation with Sparsely Annotated Echocardiogram Videos through Self-Supervised Temporal Masking and Weakly Supervised Training

  • paper_url: http://arxiv.org/abs/2310.00454
  • repo_url: None
  • paper_authors: Fadillah Maani, Asim Ukaye, Nada Saadi, Numan Saeed, Mohammad Yaqub
    for: 这份研究的目的是提出一种可靠且高效的左心室(LV)分 segmentation方法,以帮助医生更加精确地诊断心血管疾病。methods: 本研究使用了自动学习(SSL)和弱监督训练(WST)两种方法,并考虑了三种不同的分 segmentation方法:3D分 segmentation和一种新的2D超像(SI)。results: 本研究比较了各种方法的效果,结果显示了我们的提案方法在大规模数据集(EchoNet-Dynamic)上获得了93.32%(95%CI 93.21-93.43%)的 dice分数,而且比之前的方法更高效。我们还提供了广泛的拓展研究,包括预训练设定和不同的深度学习架构。
    Abstract Echocardiography has become an indispensable clinical imaging modality for general heart health assessment. From calculating biomarkers such as ejection fraction to the probability of a patient's heart failure, accurate segmentation of the heart and its structures allows doctors to plan and execute treatments with greater precision and accuracy. However, achieving accurate and robust left ventricle segmentation is time-consuming and challenging due to different reasons. This work introduces a novel approach for consistent left ventricular (LV) segmentation from sparsely annotated echocardiogram videos. We achieve this through (1) self-supervised learning (SSL) using temporal masking followed by (2) weakly supervised training. We investigate two different segmentation approaches: 3D segmentation and a novel 2D superimage (SI). We demonstrate how our proposed method outperforms the state-of-the-art solutions by achieving a 93.32% (95%CI 93.21-93.43%) dice score on a large-scale dataset (EchoNet-Dynamic) while being more efficient. To show the effectiveness of our approach, we provide extensive ablation studies, including pre-training settings and various deep learning backbones. Additionally, we discuss how our proposed methodology achieves high data utility by incorporating unlabeled frames in the training process. To help support the AI in medicine community, the complete solution with the source code will be made publicly available upon acceptance.
    摘要 echo cardiography 已成为现代医学实验室中不可或缺的诊断工具,从计算生物标志物such as 血液泵功率到患者的心血液疾病可能性,准确地分割心脏和其结构,帮助医生更加准确地规划和执行治疗。然而,实现准确和可靠的左心室(LV)分割是一项时间consuming和困难的任务,主要因为多种原因。这种工作介绍了一种新的方法,可以从缺乏标注的echo cardiogram视频中获得一致的LV分割结果。我们通过(1)自动学习(SSL)使用时间掩蔽,然后(2)弱监督训练来实现这一目标。我们 investigate了两种不同的分割方法:3D分割和一种新的2D超像(SI)。我们展示了我们的提议方法在大规模数据集(EchoNet-Dynamic)上的表现,而且比现有的解决方案高效。为了证明我们的方法的有效性,我们提供了广泛的拟合研究,包括预训练设置和不同的深度学习背bone。此外,我们讨论了我们的方法如何实现高数据利用率,通过在训练过程中包含无标注帧。为了支持AI医学社区,我们将在接受后公开完整的解决方案和源代码。

On the Role of Neural Collapse in Meta Learning Models for Few-shot Learning

  • paper_url: http://arxiv.org/abs/2310.00451
  • repo_url: https://github.com/saakethmm/nc-prototypical-networks
  • paper_authors: Saaketh Medepalli, Naren Doraiswamy
  • For: 这个论文探讨了基于少量示例学习的元学习框架,以及这些框架在新类上的泛化性。* Methods: 这个论文使用了元学习框架,并在Omniglot数据集上进行了几个示例学习任务的研究。* Results: 研究发现,随着模型大小增加,学习出来的特征往往呈现出神经塌磔现象,但不一定符合完整的神经塌磔性质。
    Abstract Meta-learning frameworks for few-shot learning aims to learn models that can learn new skills or adapt to new environments rapidly with a few training examples. This has led to the generalizability of the developed model towards new classes with just a few labelled samples. However these networks are seen as black-box models and understanding the representations learnt under different learning scenarios is crucial. Neural collapse ($\mathcal{NC}$) is a recently discovered phenomenon which showcases unique properties at the network proceeds towards zero loss. The input features collapse to their respective class means, the class means form a Simplex equiangular tight frame (ETF) where the class means are maximally distant and linearly separable, and the classifier acts as a simple nearest neighbor classifier. While these phenomena have been observed in simple classification networks, this study is the first to explore and understand the properties of neural collapse in meta learning frameworks for few-shot learning. We perform studies on the Omniglot dataset in the few-shot setting and study the neural collapse phenomenon. We observe that the learnt features indeed have the trend of neural collapse, especially as model size grows, but to do not necessarily showcase the complete collapse as measured by the $\mathcal{NC}$ properties.
    摘要 translate-internal: "Meta-learning frameworks for few-shot learning aim to learn models that can learn new skills or adapt to new environments rapidly with just a few training examples. This has led to the generalizability of the developed model towards new classes with just a few labelled samples. However, these networks are seen as black-box models, and understanding the representations learnt under different learning scenarios is crucial. Neural collapse (NC) is a recently discovered phenomenon that showcases unique properties when the network proceeds towards zero loss. The input features collapse to their respective class means, the class means form a Simplex equiangular tight frame (ETF) where the class means are maximally distant and linearly separable, and the classifier acts as a simple nearest neighbor classifier. While these phenomena have been observed in simple classification networks, this study is the first to explore and understand the properties of neural collapse in meta learning frameworks for few-shot learning. We perform studies on the Omniglot dataset in the few-shot setting and study the neural collapse phenomenon. We observe that the learnt features indeed have the trend of neural collapse, especially as model size grows, but they do not necessarily showcase the complete collapse as measured by the NC properties."Here's the translation in Traditional Chinese:translate-internal: "Meta-learning frameworks for few-shot learning aim to learn models that can learn new skills or adapt to new environments rapidly with just a few training examples. This has led to the generalizability of the developed model towards new classes with just a few labelled samples. However, these networks are seen as black-box models, and understanding the representations learnt under different learning scenarios is crucial. Neural collapse (NC) is a recently discovered phenomenon that showcases unique properties when the network proceeds towards zero loss. The input features collapse to their respective class means, the class means form a Simplex equiangular tight frame (ETF) where the class means are maximally distant and linearly separable, and the classifier acts as a simple nearest neighbor classifier. While these phenomena have been observed in simple classification networks, this study is the first to explore and understand the properties of neural collapse in meta learning frameworks for few-shot learning. We perform studies on the Omniglot dataset in the few-shot setting and study the neural collapse phenomenon. We observe that the learnt features indeed have the trend of neural collapse, especially as model size grows, but they do not necessarily showcase the complete collapse as measured by the NC properties."Note that the translation is in Simplified Chinese, as requested. If you would like the translation in Traditional Chinese instead, please let me know.

Human-Producible Adversarial Examples

  • paper_url: http://arxiv.org/abs/2310.00438
  • repo_url: https://github.com/lionfish0/adversarial-human
  • paper_authors: David Khachaturov, Yue Gao, Ilia Shumailov, Robert Mullins, Ross Anderson, Kassem Fawaz
  • for: 该论文旨在开发一种可以在真实世界中生成人工生成的 adversarial example 方法,而无需使用复杂的设备或技术。
  • methods: 该方法基于差异渲染,通过简单地绘制四个或九个直线来构建强大的 adversarial example。它还包括一种基于人工绘制错误的抗噪准则,以保证攻击的可重复性。
  • results: 研究人员通过用涂抹笔将lines绘制到图像上,实现了在YOLO模型中81.8%的攻击成功率。此外,研究人员还进行了数字和物理世界的广泛测试,并证明了该方法可以由无经验人员应用。
    Abstract Visual adversarial examples have so far been restricted to pixel-level image manipulations in the digital world, or have required sophisticated equipment such as 2D or 3D printers to be produced in the physical real world. We present the first ever method of generating human-producible adversarial examples for the real world that requires nothing more complicated than a marker pen. We call them $\textbf{adversarial tags}$. First, building on top of differential rendering, we demonstrate that it is possible to build potent adversarial examples with just lines. We find that by drawing just $4$ lines we can disrupt a YOLO-based model in $54.8\%$ of cases; increasing this to $9$ lines disrupts $81.8\%$ of the cases tested. Next, we devise an improved method for line placement to be invariant to human drawing error. We evaluate our system thoroughly in both digital and analogue worlds and demonstrate that our tags can be applied by untrained humans. We demonstrate the effectiveness of our method for producing real-world adversarial examples by conducting a user study where participants were asked to draw over printed images using digital equivalents as guides. We further evaluate the effectiveness of both targeted and untargeted attacks, and discuss various trade-offs and method limitations, as well as the practical and ethical implications of our work. The source code will be released publicly.
    摘要 “Visual adversarial examples”Previously, have only been restricted to digital image manipulation or require sophisticated equipment such as 2D or 3D printers to produce in the physical world. We present the first method of generating human-producible adversarial examples for the real world that only requires a marker pen. We call them “adversarial tags”.First, we build on differential rendering and show that it is possible to create powerful adversarial examples with just lines. We found that by drawing just 4 lines, we can disrupt a YOLO-based model in 54.8% of cases, and increasing it to 9 lines disrupts 81.8% of the cases tested. Next, we improve the method for line placement to be invariant to human drawing errors.We thoroughly evaluate our system in both the digital and analog worlds and demonstrate that our tags can be applied by untrained humans. We also conduct a user study where participants were asked to draw over printed images using digital equivalents as guides, and evaluate the effectiveness of both targeted and untargeted attacks. We discuss various trade-offs and method limitations, as well as the practical and ethical implications of our work. The source code will be released publicly.

DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models

  • paper_url: http://arxiv.org/abs/2310.00434
  • repo_url: https://github.com/THU-LYJ-Lab/DiffPoseTalk
  • paper_authors: Zhiyao Sun, Tian Lv, Sheng Ye, Matthieu Gaetan Lin, Jenny Sheng, Yu-Hui Wen, Minjing Yu, Yong-jin Liu
  • for: 本研究旨在提出一种生成风格化3D脸部动画,使用speech和风格编码来驱动生成过程。
  • methods: 我们提出了一种基于扩散模型和风格编码器的生成框架,从短视频中提取风格特征,并使用无类标注引导生成过程。
  • results: 我们的方法在评测中超过了现有方法,并且通过用户测试得到了更高的评价。 code和数据将公开发布。
    Abstract The generation of stylistic 3D facial animations driven by speech poses a significant challenge as it requires learning a many-to-many mapping between speech, style, and the corresponding natural facial motion. However, existing methods either employ a deterministic model for speech-to-motion mapping or encode the style using a one-hot encoding scheme. Notably, the one-hot encoding approach fails to capture the complexity of the style and thus limits generalization ability. In this paper, we propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder that extracts style embeddings from short reference videos. During inference, we employ classifier-free guidance to guide the generation process based on the speech and style. We extend this to include the generation of head poses, thereby enhancing user perception. Additionally, we address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset. Our extensive experiments and user study demonstrate that our approach outperforms state-of-the-art methods. The code and dataset will be made publicly available.
    摘要 当前的3D facial动画生成技术面临着一个重要挑战,即学习很多到很多的映射关系 между语音、风格和自然的脸部动作。然而,现有的方法都是使用决定性的语音到动作映射模型,或者使用一个简单的一个热度编码方法来编码风格。可是,这种一个热度编码方法无法捕捉风格的复杂性,因此限制了其泛化能力。在这篇论文中,我们提出了DiffPoseTalk,一种基于扩散模型并与风格编码器结合的生成框架。在推理过程中,我们使用无类别导航来指导生成过程,根据语音和风格。此外,我们还扩展了头部pose的生成,从而提高用户的感知。此外,我们解决了3D talking face数据的缺乏问题,通过在高质量的自然语言视频 Dataset 中重建3DMM参数来训练我们的模型。我们的广泛的实验和用户研究表明,我们的方法在比较状态的方法之上出色表现。代码和数据将会公开释出。

Technical Report of 2023 ABO Fine-grained Semantic Segmentation Competition

  • paper_url: http://arxiv.org/abs/2310.00427
  • repo_url: None
  • paper_authors: Zeyu Dong
  • for: 本研究是为了参加2023年ABO细化semantic segmentation比赛,目标是预测五类 convex shape的semantic标签。
  • methods: 本研究使用DGCNN作为backbone,通过分类不同类型的结构来实现semantic segmentation。我们进行了多个实验,并发现使用学习率随机梯度下降和不同类型的分解因子来提高模型性能。
  • results: 我们的模型在2023年ICCV 3DVeComm Workshop Challenge的Dev阶段取得了第3名。
    Abstract In this report, we describe the technical details of our submission to the 2023 ABO Fine-grained Semantic Segmentation Competition, by Team "Zeyu\_Dong" (username:ZeyuDong). The task is to predicate the semantic labels for the convex shape of five categories, which consist of high-quality, standardized 3D models of real products available for purchase online. By using DGCNN as the backbone to classify different structures of five classes, We carried out numerous experiments and found learning rate stochastic gradient descent with warm restarts and setting different rate of factors for various categories contribute most to the performance of the model. The appropriate method helps us rank 3rd place in the Dev phase of the 2023 ICCV 3DVeComm Workshop Challenge.
    摘要 在本报告中,我们介绍了我们对2023年ABO细致semantic segmentation比赛的提交技术细节,由Team "Zeyu\_Dong"(用户名:ZeyuDong)完成。任务是预测五类 convex shape的semantic标签,其中五类包括高质量、标准化的3D模型在线销售。通过使用DGCNN作为后端分类不同结构的五类,我们进行了许多实验,发现了学习率随机梯度下降与温存 restart的方法对模型性能产生了最大化的影响。这种方法帮助我们在2023年ICCV 3DVeComm Workshop Challenge的Dev阶段取得第三名。

PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

  • paper_url: http://arxiv.org/abs/2310.00426
  • repo_url: https://github.com/PixArt-alpha/PixArt-alpha
  • paper_authors: Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, Zhenguo Li
  • for: 这篇论文旨在提供一个高品质却低成本的文本转换到图像(T2I)模型,以推动AI创新和低排放。
  • methods: 这篇论文提出了三个核心设计:首先,通过分解训练策略,分别优化像素依赖、文本图像对齐和图像艺术质量; 其次,透过插入批访模组,将文本条件注入到算法中,以提高计算效率; 最后,强调概念密度的重要性,并运用大型视觉语言模型自动生成密集pseudo-caption,以帮助图像对齐学习。
  • results: 这篇论文显示了PIXART-$\alpha$的训练速度明显高于现有的大规模T2I模型,例如PIXART-$\alpha$只需10.8%的Stable Diffusion v1.5的训练时间(675vs. 6,250 A100 GPU天),优化了约 $300,000( $26,000 vs. $320,000)和减少90%的二氧化碳排放。此外,与现有较大的SOTA模型相比,我们的训练成本仅1%。PIXART-$\alpha$在图像质量、艺术性和Semantic控制方面表现出色。
    Abstract The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-$\alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-$\alpha$'s training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-$\alpha$ only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \$300,000 (\$26,000 vs. \$320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-$\alpha$ excels in image quality, artistry, and semantic control. We hope PIXART-$\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.
    摘要 最先进的文本到图像(T2I)模型需要巨大的训练成本(例如,百万个GPU小时),这会很大程度地阻碍AIGC社区的基础创新,同时也会增加CO2排放。这篇论文介绍了PIXART-α,一种基于Transformer的T2I扩散模型,其生成图像质量与现状最先进的图像生成器(如Imagen、SDXL以及甚至Midjourney)相当,达到了近商用应用标准。此外,它还支持高分辨率图像生成,达到1024px的分辨率,训练成本低廉,如图1和图2所示。为了实现这一目标,我们提出了三个核心设计:1. 训练策略分解:我们将训练过程分为三个独立的步骤,每个步骤都会分别优化像素依赖关系、文本-图像对齐和图像美观质量。2. 高效的T2I transformer:我们在扩散变换器(DiT)中添加了跨注意力模块,以注入文本条件并使计算量昂贵的分类分支更加高效。3. 高信息 densities:我们强调了文本-图像对的概率密度的重要性,并利用大量的视觉语言模型自动生成密集 Pseudo-captions,以帮助图像-文本对齐学习。因此,PIXART-α的训练速度明显高于现有的大规模T2I模型,例如PIXART-α只需10.8%的Stable Diffusion v1.5的训练时间(675 vs. 6,250 A100 GPU天),相对 экономии了约300,000美元(26,000 vs. 320,000美元),同时减少了90%的CO2排放。此外,相比一个更大的SOTA模型,RAPHAEL,我们的训练成本只有1%。广泛的实验表明,PIXART-α在图像质量、艺术性和 semantics 控制方面具有优异的表现。我们希望PIXART-α能为AIGC社区和创新公司提供新的思路,以便他们可以从头开始构建高质量 yet low cost的生成模型。

MVC: A Multi-Task Vision Transformer Network for COVID-19 Diagnosis from Chest X-ray Images

  • paper_url: http://arxiv.org/abs/2310.00418
  • repo_url: None
  • paper_authors: Huyen Tran, Duc Thanh Nguyen, John Yearwood
  • for: 这篇论文的目的是提出一个新的多任务检测方法,以便同时分类骨肉X射线图像和识别受影响区域。
  • methods: 本论文使用的方法是基于Vision Transformer的多任务学习架构,具有本地和全球表示学习的能力。
  • results: 实验结果显示,提出的方法在比较基于基eline的方法时表现更好,在骨肉X射线图像分类和受影响区域识别两个任务上都达到了更高的准确率。
    Abstract Medical image analysis using computer-based algorithms has attracted considerable attention from the research community and achieved tremendous progress in the last decade. With recent advances in computing resources and availability of large-scale medical image datasets, many deep learning models have been developed for disease diagnosis from medical images. However, existing techniques focus on sub-tasks, e.g., disease classification and identification, individually, while there is a lack of a unified framework enabling multi-task diagnosis. Inspired by the capability of Vision Transformers in both local and global representation learning, we propose in this paper a new method, namely Multi-task Vision Transformer (MVC) for simultaneously classifying chest X-ray images and identifying affected regions from the input data. Our method is built upon the Vision Transformer but extends its learning capability in a multi-task setting. We evaluated our proposed method and compared it with existing baselines on a benchmark dataset of COVID-19 chest X-ray images. Experimental results verified the superiority of the proposed method over the baselines on both the image classification and affected region identification tasks.
    摘要 医学图像分析使用计算机基于算法已经在过去十年内吸引了广泛的研究者关注,取得了很大的进步。随着计算资源的提高和医学图像大规模数据集的可用性,许多深度学习模型在医疾诊断方面得到了应用。然而,现有的技术主要集中在子任务上,例如疾病分类和识别,而忽略了多任务诊断框架的开发。我们在这篇论文中提出了一种新的方法,即多任务视transformer(MVC),用于同时分类胸部X射影图像和识别输入数据中的受影响区域。我们的方法基于视transformer,但在多任务设定下扩展了其学习能力。我们对一个COVID-19胸部X射影图像数据集进行了实验,并与现有的基线相比较。实验结果表明,我们提出的方法在两个任务上都有superiority。

SSIF: Learning Continuous Image Representation for Spatial-Spectral Super-Resolution

  • paper_url: http://arxiv.org/abs/2310.00413
  • repo_url: None
  • paper_authors: Gengchen Mai, Ni Lao, Weiwei Sun, Yuchi Ma, Jiaming Song, Chenlin Meng, Hongxu Ma, Jinmeng Rao, Ziyuan Li, Stefano Ermon
  • for: 提高图像的空间和频谱分解能力
  • methods: 使用神经隐函数模型来表示图像,并在图像中进行空间和频谱分解
  • results: 对两个难题的空间-频谱超分辨 benchmark 进行了实验,并证明了 SSIF 可以在不同的空间和频谱分辨下表现出色,并且可以提高下游任务(例如土地使用分类)的性能 by 1.7%-7%。
    Abstract Existing digital sensors capture images at fixed spatial and spectral resolutions (e.g., RGB, multispectral, and hyperspectral images), and each combination requires bespoke machine learning models. Neural Implicit Functions partially overcome the spatial resolution challenge by representing an image in a resolution-independent way. However, they still operate at fixed, pre-defined spectral resolutions. To address this challenge, we propose Spatial-Spectral Implicit Function (SSIF), a neural implicit model that represents an image as a function of both continuous pixel coordinates in the spatial domain and continuous wavelengths in the spectral domain. We empirically demonstrate the effectiveness of SSIF on two challenging spatio-spectral super-resolution benchmarks. We observe that SSIF consistently outperforms state-of-the-art baselines even when the baselines are allowed to train separate models at each spectral resolution. We show that SSIF generalizes well to both unseen spatial resolutions and spectral resolutions. Moreover, SSIF can generate high-resolution images that improve the performance of downstream tasks (e.g., land use classification) by 1.7%-7%.
    摘要 现有的数字感知器只能捕捉固定的空间和спектраль分辨率图像(例如RGB、多spectral和快速pectral图像),每种组合都需要特制的机器学习模型。神经凝函数partially overcomes the spatial resolution challenge by representing an image in a resolution-independent way. However, they still operate at fixed, pre-defined spectral resolutions. To address this challenge, we propose Spatial-Spectral Implicit Function (SSIF), a neural implicit model that represents an image as a function of both continuous pixel coordinates in the spatial domain and continuous wavelengths in the spectral domain. We empirically demonstrate the effectiveness of SSIF on two challenging spatio-spectral super-resolution benchmarks. We observe that SSIF consistently outperforms state-of-the-art baselines even when the baselines are allowed to train separate models at each spectral resolution. We show that SSIF generalizes well to both unseen spatial resolutions and spectral resolutions. Moreover, SSIF can generate high-resolution images that improve the performance of downstream tasks (e.g., 土地使用分类) by 1.7%-7%.

Controlling Neural Style Transfer with Deep Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2310.00405
  • repo_url: https://github.com/abusufyanvu/6S191_MIT_DeepLearning
  • paper_authors: Chengming Feng, Jing Hu, Xin Wang, Shu Hu, Bin Zhu, Xi Wu, Hongtu Zhu, Siwei Lyu
  • for: 这个论文是为了提出一种深度学习(Deep Learning)基本的 Reinforcement Learning(RL)架构,用于控制涂抹式转换(Neural Style Transfer,NST)的度量。
  • methods: 这个方法使用RL来将一步式转换分解为多个步骤,以保留内容图像的详细信息和结构,并在后续步骤中增加风格模式。这种方法是用户轻松控制的样式传递方法。
  • results: 实验结果表明,我们的RL-based方法可以具有更高的效果和稳定性,并且比现有的一步DL基本模型具有更低的计算复杂度和更轻量级的计算负担。
    Abstract Controlling the degree of stylization in the Neural Style Transfer (NST) is a little tricky since it usually needs hand-engineering on hyper-parameters. In this paper, we propose the first deep Reinforcement Learning (RL) based architecture that splits one-step style transfer into a step-wise process for the NST task. Our RL-based method tends to preserve more details and structures of the content image in early steps, and synthesize more style patterns in later steps. It is a user-easily-controlled style-transfer method. Additionally, as our RL-based model performs the stylization progressively, it is lightweight and has lower computational complexity than existing one-step Deep Learning (DL) based models. Experimental results demonstrate the effectiveness and robustness of our method.
    摘要 控制 neural style transfer(NST)的度风格化有些困难,通常需要手工调整超参数。在这篇论文中,我们提出了首个深度强化学习(RL)基 Architecture,将一步式样式传递分解成多个步骤的NST任务。我们的RL基方法会在早期步骤中保留更多的内容图像细节和结构,并在后期步骤中更多地 sinthez style pattern。这是一种用户轻松控制的样式传递方法。此外,我们的RL基模型在进行样式传递的过程中,轻量级,计算复杂度较低于现有的一步DL基模型。实验结果表明我们的方法的有效性和稳定性。

MonoGAE: Roadside Monocular 3D Object Detection with Ground-Aware Embeddings

  • paper_url: http://arxiv.org/abs/2310.00400
  • repo_url: None
  • paper_authors: Lei Yang, Jiaxin Yu, Xinyu Zhang, Jun Li, Li Wang, Yi Huang, Chuang Zhang, Hong Wang, Yiming Li
  • for: 提高路边摄像头的自动驾驶系统能力
  • methods: 利用智能路边摄像头,通过学习高维 embedding 来提高物体检测精度
  • results: 比前一代方法更高的3D物体检测性能,可以帮助路边摄像头提高自动驾驶系统的能力
    Abstract Although the majority of recent autonomous driving systems concentrate on developing perception methods based on ego-vehicle sensors, there is an overlooked alternative approach that involves leveraging intelligent roadside cameras to help extend the ego-vehicle perception ability beyond the visual range. We discover that most existing monocular 3D object detectors rely on the ego-vehicle prior assumption that the optical axis of the camera is parallel to the ground. However, the roadside camera is installed on a pole with a pitched angle, which makes the existing methods not optimal for roadside scenes. In this paper, we introduce a novel framework for Roadside Monocular 3D object detection with ground-aware embeddings, named MonoGAE. Specifically, the ground plane is a stable and strong prior knowledge due to the fixed installation of cameras in roadside scenarios. In order to reduce the domain gap between the ground geometry information and high-dimensional image features, we employ a supervised training paradigm with a ground plane to predict high-dimensional ground-aware embeddings. These embeddings are subsequently integrated with image features through cross-attention mechanisms. Furthermore, to improve the detector's robustness to the divergences in cameras' installation poses, we replace the ground plane depth map with a novel pixel-level refined ground plane equation map. Our approach demonstrates a substantial performance advantage over all previous monocular 3D object detectors on widely recognized 3D detection benchmarks for roadside cameras. The code and pre-trained models will be released soon.
    摘要 尽管现在大多数自动驾驶系统都在开发基于自驾车感知器的感知方法,但是有一种被忽略的代理方法是利用智能路边摄像头来帮助扩展自驾车感知范围之 beyond。我们发现大多数现有的单目3D物体探测器都基于自驾车的先前假设,即摄像头的光学轴与地面平行。然而,路边摄像头通常会被安装在倾斜的杆上,这使得现有的方法不适合路边场景。在这篇论文中,我们介绍了一种名为MonogaE的新框架,用于路边单目3D物体探测。 Specifically,我们认为地面是一种稳定和强大的先知知识,因为摄像头在路边enario中是固定安装的。为了减少图像特征和地面几何信息之间的领域差距,我们采用一种监督训练方法,使用地面来预测高维度的地面感知 embedding。这些 embedding subsequentially 与图像特征进行交叉注意机制。此外,为了提高探测器对摄像头安装位置的不同的灵活性,我们将地面深度图像 replaced noval pixel-level refined ground plane equation map。我们的方法在广泛recognized 3D探测标准准例中表现出了明显的性能优势。我们即将发布代码和预训练模型。

InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

  • paper_url: http://arxiv.org/abs/2310.00390
  • repo_url: https://github.com/AlaaLab/InstructCV
  • paper_authors: Yulu Gan, Sungwoo Park, Alexander Schubert, Anthony Philippakis, Ahmed M. Alaa
  • for: 这个论文旨在提供一种基于自然语言指令的多任务计算视觉模型,它可以通过文本描述来执行多种计算视觉任务。
  • methods: 该论文使用了文本控制的生成扩散模型,并通过自然语言模型来帮助模型学习多个计算视觉任务。
  • results: 实验表明,该模型能够与其他通用和任务特定视觉模型相比竞争,并且具有出色的扩展性,可以在未经见过的数据、类别和用户指令下进行高效的执行。
    Abstract Recent advances in generative diffusion models have enabled text-controlled synthesis of realistic and diverse images with impressive quality. Despite these remarkable advances, the application of text-to-image generative models in computer vision for standard visual recognition tasks remains limited. The current de facto approach for these tasks is to design model architectures and loss functions that are tailored to the task at hand. In this paper, we develop a unified language interface for computer vision tasks that abstracts away task-specific design choices and enables task execution by following natural language instructions. Our approach involves casting multiple computer vision tasks as text-to-image generation problems. Here, the text represents an instruction describing the task, and the resulting image is a visually-encoded task output. To train our model, we pool commonly-used computer vision datasets covering a range of tasks, including segmentation, object detection, depth estimation, and classification. We then use a large language model to paraphrase prompt templates that convey the specific tasks to be conducted on each image, and through this process, we create a multi-modal and multi-task training dataset comprising input and output images along with annotated instructions. Following the InstructPix2Pix architecture, we apply instruction-tuning to a text-to-image diffusion model using our constructed dataset, steering its functionality from a generative model to an instruction-guided multi-task vision learner. Experiments demonstrate that our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models. Moreover, it exhibits compelling generalization capabilities to unseen data, categories, and user instructions.
    摘要 Traditional Chinese:最近的生成扩散模型突破了文本控制的图像生成的可靠性和多样性,这些突破使得图像生成的应用在计算机视觉中尚未得到广泛应用。当前的 де факто方法是通过设计特定任务的模型结构和损失函数来实现这些任务。在这篇论文中,我们开发了一个统一的自然语言界面 для计算机视觉任务,这个界面将任务特定的设计选择抽象化,使得任务执行可以通过自然语言指令进行。我们的方法是将多种计算机视觉任务转化为文本到图像生成问题,其中文本表示任务的指令,并且生成的图像是视觉编码的任务输出。为了训练我们的模型,我们将常用的计算机视觉数据集合起来,包括分割、物体检测、深度估计和分类等任务。然后,我们使用一个大型自然语言模型来重新表达用于每个图像的指令模板,并通过这个过程创建了一个多modal和多任务的训练数据集。采用InstructPix2Pix架构,我们对文本到图像扩散模型进行指令调整,从而将其变成一个根据指令进行多任务视觉学习的模型。实验结果表明,我们的模型,称为InstructCV,与其他普遍和任务特定的视觉模型相比,表现竞争力强。此外,它还能够对未看到的数据、类别和用户指令进行吸引人的泛化。

Deep Active Learning with Noisy Oracle in Object Detection

  • paper_url: http://arxiv.org/abs/2310.00372
  • repo_url: None
  • paper_authors: Marius Schubert, Tobias Riedlinger, Karsten Kahl, Matthias Rottmann
  • for: 提高对象检测器的性能,减少人工标注的数量和质量不良的影响。
  • methods: 使用活动学习算法和标签审核模块,对活动数据中的噪声标注进行纠正,提高模型性能。
  • results: 在实验中,通过包括标签审核模块,使用部分标注预算来纠正噪声标注,提高对象检测器的性能,最高提升4.5个mAP点。
    Abstract Obtaining annotations for complex computer vision tasks such as object detection is an expensive and time-intense endeavor involving a large number of human workers or expert opinions. Reducing the amount of annotations required while maintaining algorithm performance is, therefore, desirable for machine learning practitioners and has been successfully achieved by active learning algorithms. However, it is not merely the amount of annotations which influences model performance but also the annotation quality. In practice, the oracles that are queried for new annotations frequently contain significant amounts of noise. Therefore, cleansing procedures are oftentimes necessary to review and correct given labels. This process is subject to the same budget as the initial annotation itself since it requires human workers or even domain experts. Here, we propose a composite active learning framework including a label review module for deep object detection. We show that utilizing part of the annotation budget to correct the noisy annotations partially in the active dataset leads to early improvements in model performance, especially when coupled with uncertainty-based query strategies. The precision of the label error proposals has a significant influence on the measured effect of the label review. In our experiments we achieve improvements of up to 4.5 mAP points of object detection performance by incorporating label reviews at equal annotation budget.
    摘要 获取复杂计算机视觉任务中的对象检测签名是一个昂贵的时间进行的劳动密集型任务,需要大量的人工工作或专家意见。因此,减少需要的签名数量而保持算法性能是机器学习实践者所需的,并已经由活动学习算法得到成功。然而,不仅是签名数量,签名质量也对模型性能产生影响。在实践中,查询新签名的或acles frequently contain significant amounts of noise。因此,清洁过程是必要的,以审查并更正给出的标签。这个过程与初始签名预算相同,需要人工工作或域专家。我们提议一种复合的活动学习框架,包括深度对象检测中的标签审查模块。我们表明,在活动数据集中部分使用签名预算来更正噪音签名,会导致早期提高模型性能,特别是与不确定性基于的查询策略相结合。标签错误提案的精度有重要的影响于测量效果。在我们的实验中,通过包含标签审查,提高了对象检测性能的4.5个MAP点。

Distilling Inductive Bias: Knowledge Distillation Beyond Model Compression

  • paper_url: http://arxiv.org/abs/2310.00369
  • repo_url: None
  • paper_authors: Gousia Habib, Tausifa Jan Saleem, Brejesh Lall
  • for: 提高计算效率和实用性,为计算机视觉应用程序提供实用的解决方案。
  • methods: 采用 ensemble-based distillation 方法,从多种不同架构的轻量级教师模型中继承知识,以提高学生模型的性能。
  • results: 通过将多种架构的轻量级教师模型 ensemble 教学,学生模型可以从 readily 可识别的存储 dataset 中积累广泛的知识,提高学生模型的性能。
    Abstract With the rapid development of computer vision, Vision Transformers (ViTs) offer the tantalizing prospect of unified information processing across visual and textual domains. But due to the lack of inherent inductive biases in ViTs, they require enormous amount of data for training. To make their applications practical, we introduce an innovative ensemble-based distillation approach distilling inductive bias from complementary lightweight teacher models. Prior systems relied solely on convolution-based teaching. However, this method incorporates an ensemble of light teachers with different architectural tendencies, such as convolution and involution, to instruct the student transformer jointly. Because of these unique inductive biases, instructors can accumulate a wide range of knowledge, even from readily identifiable stored datasets, which leads to enhanced student performance. Our proposed framework also involves precomputing and storing logits in advance, essentially the unnormalized predictions of the model. This optimization can accelerate the distillation process by eliminating the need for repeated forward passes during knowledge distillation, significantly reducing the computational burden and enhancing efficiency.
    摘要 随着计算机视觉的快速发展,视觉变换器(ViTs)提供了融合视觉和文本领域的信息处理的吸引人可能性。然而,由于变换器缺乏自然的逻辑假设,因此需要很大量的数据进行训练。为了使其应用实用,我们提出了一种创新的ensemble-based distillation方法,将各种轻量级教学模型中的 inductive bias 逻辑假设传播给学生变换器。先前的系统仅仅依靠卷积而教学,而我们的方法则是结合不同的建筑倾向,如卷积和反卷积,来教学学生变换器。由于这些特有的逻辑假设,教师可以吸收广泛的知识,甚至从Ready readily可识别的存储数据集中,这导致了学生的性能提高。我们的提议的框架还包括预计算和存储logits的步骤,实际上是模型的未正规化预测值。这种优化可以减少了知识传播过程中的计算负担,使得学生的训练变得更加高效,提高了效率。

Diffusion Posterior Illumination for Ambiguity-aware Inverse Rendering

  • paper_url: http://arxiv.org/abs/2310.00362
  • repo_url: None
  • paper_authors: Linjie Lyu, Ayush Tewari, Marc Habermann, Shunsuke Saito, Michael Zollhöfer, Thomas Leimkühler, Christian Theobalt
  • for: inverse rendering, 即从图像中推断场景属性的问题
  • methods: integrate了一种含有自然照明地图的扩散概率模型,并与可微分的跟踪器结合使用优化框架
  • results: 可以从多种照明和空间分布式表面材料中采样,并且能够生成高质量、多样化的环境地图样本,并准确地反映输入图像的照明情况。
    Abstract Inverse rendering, the process of inferring scene properties from images, is a challenging inverse problem. The task is ill-posed, as many different scene configurations can give rise to the same image. Most existing solutions incorporate priors into the inverse-rendering pipeline to encourage plausible solutions, but they do not consider the inherent ambiguities and the multi-modal distribution of possible decompositions. In this work, we propose a novel scheme that integrates a denoising diffusion probabilistic model pre-trained on natural illumination maps into an optimization framework involving a differentiable path tracer. The proposed method allows sampling from combinations of illumination and spatially-varying surface materials that are, both, natural and explain the image observations. We further conduct an extensive comparative study of different priors on illumination used in previous work on inverse rendering. Our method excels in recovering materials and producing highly realistic and diverse environment map samples that faithfully explain the illumination of the input images.
    摘要 <> invertible rendering,将场景属性从图像中推算出来的过程,是一个具有很多挑战的反向问题。任务是不定的,因为多种场景配置都可以导致同一张图像。大多数现有的解决方案将约束加入反向渲染管道中,以便推导可能的解决方案,但它们没有考虑内在的抽象和多模分布。在这项工作中,我们提议一种新的方案,即将自然照明地图预训练的杂化扩散概率模型integrated到一个可导的跟踪器框架中。该方法允许采样从组合照明和空间分布的表面材料中,这些材料都是自然的,并且能够解释输入图像的照明。我们进一步进行了对先前工作中不同照明约束的比较研究。我们的方法在恢复材料和生成高真实度、多样化的环境地图样本方面表现出色,能够准确地解释输入图像的照明。

Improving Cross-dataset Deepfake Detection with Deep Information Decomposition

  • paper_url: http://arxiv.org/abs/2310.00359
  • repo_url: None
  • paper_authors: Shanmin Yang, Shu Hu, Bin Zhu, Ying Fu, Siwei Lyu, Xi Wu, Xin Wang
  • for: 防止深伪技术威胁安全和社会信任,这篇论文旨在提出一个深信息分解(DID)框架。
  • methods: 该框架将注重高水平 semantics 特征,而不是视觉特征,以提高深伪检测的稳定性和普遍能力。
  • results: 实验结果显示,该框架在不同测试数据集中具有更高的检测精度和普遍能力,并且能够对不同的伪造方法进行更好的检测。
    Abstract Deepfake technology poses a significant threat to security and social trust. Although existing detection methods have demonstrated high performance in identifying forgeries within datasets using the same techniques for training and testing, they suffer from sharp performance degradation when faced with cross-dataset scenarios where unseen deepfake techniques are tested. To address this challenge, we propose a deep information decomposition (DID) framework in this paper. Unlike most existing deepfake detection methods, our framework prioritizes high-level semantic features over visual artifacts. Specifically, it decomposes facial features into deepfake-related and irrelevant information and optimizes the deepfake information for real/fake discrimination to be independent of other factors. Our approach improves the robustness of deepfake detection against various irrelevant information changes and enhances the generalization ability of the framework to detect unseen forgery methods. Extensive experimental comparisons with existing state-of-the-art detection methods validate the effectiveness and superiority of the DID framework on cross-dataset deepfake detection.
    摘要 深刻的假动作技术对安全和社会信任具有重大威胁。 although existing detection methods have shown high performance in identifying forgeries within datasets using the same techniques for training and testing, they suffer from sharp performance degradation when faced with cross-dataset scenarios where unseen deepfake techniques are tested. To address this challenge, we propose a deep information decomposition (DID) framework in this paper. Unlike most existing deepfake detection methods, our framework prioritizes high-level semantic features over visual artifacts. Specifically, it decomposes facial features into deepfake-related and irrelevant information and optimizes the deepfake information for real/fake discrimination to be independent of other factors. Our approach improves the robustness of deepfake detection against various irrelevant information changes and enhances the generalization ability of the framework to detect unseen forgery methods. Extensive experimental comparisons with existing state-of-the-art detection methods validate the effectiveness and superiority of the DID framework on cross-dataset deepfake detection.Here's the translation breakdown:* 深刻的假动作技术 (shēn kòng de zhèng zhī yì jī jī) - deepfake technology* 对安全和社会信任 (duì ān qì yè shè qì) - pose a significant threat to security and social trust* existing detection methods (zhèng zhī yì jī) - existing detection methods* have demonstrated high performance (zhèng zhī yì jī) - have demonstrated high performance* in identifying forgeries within datasets (shuì zhèng zhī yì jī) - within datasets* using the same techniques for training and testing (yì jī yuè xíng) - using the same techniques for training and testing* but they suffer from sharp performance degradation (but they suffer from sharp performance degradation)* when faced with cross-dataset scenarios (zhèng zhī yì jī zhèng zhī yì jī) - when faced with cross-dataset scenarios* where unseen deepfake techniques are tested (where unseen deepfake techniques are tested)* To address this challenge, we propose (To address this challenge, we propose)* a deep information decomposition (DID) framework (a deep information decomposition (DID) framework)* Unlike most existing deepfake detection methods (zhèng zhī yì jī yuè xíng) - Unlike most existing deepfake detection methods* our framework prioritizes high-level semantic features (our framework prioritizes high-level semantic features)* over visual artifacts (over visual artifacts)* Specifically, it decomposes facial features (Specifically, it decomposes facial features)* into deepfake-related and irrelevant information (into deepfake-related and irrelevant information)* and optimizes the deepfake information (and optimizes the deepfake information)* for real/fake discrimination to be independent of other factors (for real/fake discrimination to be independent of other factors)* Our approach improves the robustness (Our approach improves the robustness)* of deepfake detection (of deepfake detection)* against various irrelevant information changes (against various irrelevant information changes)* and enhances the generalization ability (and enhances the generalization ability)* of the framework to detect unseen forgery methods (of the framework to detect unseen forgery methods)* Extensive experimental comparisons (Extensive experimental comparisons)* with existing state-of-the-art detection methods (with existing state-of-the-art detection methods)* validate the effectiveness (validate the effectiveness)* and superiority (and superiority)* of the DID framework (of the DID framework)* on cross-dataset deepfake detection (on cross-dataset deepfake detection)

Structural Adversarial Objectives for Self-Supervised Representation Learning

  • paper_url: http://arxiv.org/abs/2310.00357
  • repo_url: https://github.com/xiao7199/structural-adversarial-objectives
  • paper_authors: Xiao Zhang, Michael Maire
  • for: 本研究使用GANs进行自动化的表示学习,以提高图像识别的性能。
  • methods: 本文提出了一种基于GANs的自然语言生成方法,通过追加一些结构化模型责任来使混合网络学习表示。
  • results: 实验表明,通过使用本文提出的自然语言生成方法,GANs可以学习出高质量的表示,与对比学习方法相当。
    Abstract Within the framework of generative adversarial networks (GANs), we propose objectives that task the discriminator for self-supervised representation learning via additional structural modeling responsibilities. In combination with an efficient smoothness regularizer imposed on the network, these objectives guide the discriminator to learn to extract informative representations, while maintaining a generator capable of sampling from the domain. Specifically, our objectives encourage the discriminator to structure features at two levels of granularity: aligning distribution characteristics, such as mean and variance, at coarse scales, and grouping features into local clusters at finer scales. Operating as a feature learner within the GAN framework frees our self-supervised system from the reliance on hand-crafted data augmentation schemes that are prevalent across contrastive representation learning methods. Across CIFAR-10/100 and an ImageNet subset, experiments demonstrate that equipping GANs with our self-supervised objectives suffices to produce discriminators which, evaluated in terms of representation learning, compete with networks trained by contrastive learning approaches.
    摘要 在生成对抗网络(GAN)框架内,我们提议一些目标,要让分类器进行自我超vised学习的表示学习。这些目标与网络中的简洁正则化相结合,导引分类器学习提取有用的表示,同时保持生成器可以从领域中随机抽取样本。具体来说,我们的目标让分类器在两级划分粒度上结构化特征:在大规模划分水平上对分布特征进行匹配,并在细规划分水平上将特征分组到本地团集中。作为GAN框架内的特征学习器,我们的自我超vised系统不需要靠手工设计的数据增强方案,这种方法在对比学习方法中广泛存在。在CIFAR-10/100和ImageNet子集上进行了实验,发现当我们将GAN equip with我们的自我超vised目标时,评价在表示学习方面的分类器与对比学习方法训练的网络相比,具有竞争力。

RBF Weighted Hyper-Involution for RGB-D Object Detection

  • paper_url: http://arxiv.org/abs/2310.00342
  • repo_url: None
  • paper_authors: Mehfuz A Rahman, Jiju Peethambaran, Neil London
  • for: 实时RGBD物体检测模型
  • methods: 提议使用深度导航强化和升降样本联合层
  • results: 在NYU Depth v2和SUN RGB-D datasets上显示出比其他RGB-D基于物体检测模型更高的性能,并在新的室外RGB-D物体检测数据集上取得了最佳性能。
    Abstract A vast majority of conventional augmented reality devices are equipped with depth sensors. Depth images produced by such sensors contain complementary information for object detection when used with color images. Despite the benefits, it remains a complex task to simultaneously extract photometric and depth features in real time due to the immanent difference between depth and color images. Moreover, standard convolution operations are not sufficient to properly extract information directly from raw depth images leading to intermediate representations of depth which is inefficient. To address these issues, we propose a real-time and two stream RGBD object detection model. The proposed model consists of two new components: a depth guided hyper-involution that adapts dynamically based on the spatial interaction pattern in the raw depth map and an up-sampling based trainable fusion layer that combines the extracted depth and color image features without blocking the information transfer between them. We show that the proposed model outperforms other RGB-D based object detection models on NYU Depth v2 dataset and achieves comparable (second best) results on SUN RGB-D. Additionally, we introduce a new outdoor RGB-D object detection dataset where our proposed model outperforms other models. The performance evaluation on diverse synthetic data generated from CAD models and images shows the potential of the proposed model to be adapted to augmented reality based applications.
    摘要 大多数传统增强现实设备都配备有深度传感器。深度图像生成于such传感器中包含补偿信息,可以帮助对象检测。despite the benefits, it remains a complex task to simultaneously extract photometric and depth features in real time due to the inherent difference between depth and color images. Moreover, standard convolution operations are not sufficient to properly extract information directly from raw depth images leading to intermediate representations of depth, which is inefficient. To address these issues, we propose a real-time and two-stream RGBD object detection model. The proposed model consists of two new components: a depth-guided hyper-evolution that adapts dynamically based on the spatial interaction pattern in the raw depth map and an up-sampling based trainable fusion layer that combines the extracted depth and color image features without blocking the information transfer between them. We show that the proposed model outperforms other RGB-D based object detection models on NYU Depth v2 dataset and achieves comparable (second best) results on SUN RGB-D. Additionally, we introduce a new outdoor RGB-D object detection dataset where our proposed model outperforms other models. The performance evaluation on diverse synthetic data generated from CAD models and images shows the potential of the proposed model to be adapted to augmented reality based applications.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard Chinese languages. If you prefer Traditional Chinese, please let me know.

MFL Data Preprocessing and CNN-based Oil Pipeline Defects Detection

  • paper_url: http://arxiv.org/abs/2310.00332
  • repo_url: None
  • paper_authors: Iurii Katser, Vyacheslav Kozitsin, Igor Mozolin
  • for: 这个论文主要用于检测石油管道异常现象,以提高运输系统的可靠性和安全性。
  • methods: 该论文使用了最新的卷积神经网络结构,并提出了一些有效的预处理技术和检测方法,以解决现有数据的限制。
  • results: 该论文通过使用实际数据进行验证,并达到了高度的性能水平,以增强石油管道异常检测的精度和效果。
    Abstract Recently, the application of computer vision for anomaly detection has been under attention in several industrial fields. An important example is oil pipeline defect detection. Failure of one oil pipeline can interrupt the operation of the entire transportation system or cause a far-reaching failure. The automated defect detection could significantly decrease the inspection time and the related costs. However, there is a gap in the related literature when it comes to dealing with this task. The existing studies do not sufficiently cover the research of the Magnetic Flux Leakage data and the preprocessing techniques that allow overcoming the limitations set by the available data. This work focuses on alleviating these issues. Moreover, in doing so, we exploited the recent convolutional neural network structures and proposed robust approaches, aiming to acquire high performance considering the related metrics. The proposed approaches and their applicability were verified using real-world data.
    摘要 Here is the text in Simplified Chinese:近期,计算机视觉在各个 industrielle 领域中的应用异常检测引起了关注。一个重要的例子是油管缺陷检测。油管缺陷的失效可以中断交通系统的全部运行或者引起广泛的故障。自动检测可以显著减少检测时间和相关成本。然而,现有的相关文献不充分考虑了阻碍数据的限制和磁漏泄检测数据的研究。这个工作强调解决这些问题。此外,我们还利用了最新的卷积神经网络结构,并提出了可靠的方法,以达到考虑相关维度的高性能。我们的方法和其可应用性得到了实际数据的验证。

Decoding Realistic Images from Brain Activity with Contrastive Self-supervision and Latent Diffusion

  • paper_url: http://arxiv.org/abs/2310.00318
  • repo_url: None
  • paper_authors: Jingyuan Sun, Mingxiao Li, Marie-Francine Moens
  • for: 提高我们对大脑视系统的理解和计算机视觉模型之间的连接,使用人脑活动重建视觉刺激。
  • methods: 使用深度生成模型进行重建,并采用自我超级vised contrastive learning获取fMRI数据表示。
  • results: 实验结果显示,CnD可以高效重建复杂图像,并提供了对LDM组件和人脑视系统之间的量化解释。
    Abstract Reconstructing visual stimuli from human brain activities provides a promising opportunity to advance our understanding of the brain's visual system and its connection with computer vision models. Although deep generative models have been employed for this task, the challenge of generating high-quality images with accurate semantics persists due to the intricate underlying representations of brain signals and the limited availability of parallel data. In this paper, we propose a two-phase framework named Contrast and Diffuse (CnD) to decode realistic images from functional magnetic resonance imaging (fMRI) recordings. In the first phase, we acquire representations of fMRI data through self-supervised contrastive learning. In the second phase, the encoded fMRI representations condition the diffusion model to reconstruct visual stimulus through our proposed concept-aware conditioning method. Experimental results show that CnD reconstructs highly plausible images on challenging benchmarks. We also provide a quantitative interpretation of the connection between the latent diffusion model (LDM) components and the human brain's visual system. In summary, we present an effective approach for reconstructing visual stimuli based on human brain activity and offer a novel framework to understand the relationship between the diffusion model and the human brain visual system.
    摘要 <>将人脑活动转化为可读图像,提供了推进我们对大脑视系统的理解和计算机视觉模型之间的连接的可能性。虽然深入的生成模型已经被应用于这项任务,但是生成高质量图像的挑战仍然存在,因为大脑信号的下面表示和数据并不充分。在这篇论文中,我们提出了一种两阶段框架,名为对比和散射(CnD),用于从功能核磁共振图像记录中重建真实的图像。在第一阶段,我们通过自我超级vised对比学习获得了fMRI数据的表示。在第二阶段,这些编码的fMRI表示条件了我们提出的概念意识对应方法,使得扩散模型重建视觉刺激。实验结果表明,CnD可以在复杂的标准 benchmark 上重建高可能性的图像。我们还提供了人脑视系统中LDM组件和扩散模型之间的量化解释。总之,我们提出了基于人脑活动的可读图像重建方法,并提供了扩散模型和人脑视系统之间的新框架。

An easy zero-shot learning combination: Texture Sensitive Semantic Segmentation IceHrNet and Advanced Style Transfer Learning Strategy

  • paper_url: http://arxiv.org/abs/2310.00310
  • repo_url: https://github.com/pl23k/icehrnet
  • paper_authors: Zhiyong Yang, Yuelong Zhu, Xiaoqin Zeng, Jun Zong, Xiuheng Liu, Ran Tao, Xiaofei Cong, Yufeng Yu
  • for: 本研究旨在提出一种简单的零shot语义 segmentation方法,使用style transfer来实现。
  • methods: 我们使用了医学影像数据集(血液图像)来训练一个river ice语义 segmentation模型。首先,我们构建了一个river ice语义 segmentation数据集IPC_RI_SEG,使用固定摄像头和涵盖整个河流冰融化过程。其次,我们提出了一种高分辨率Texture Fusion semantic segmentation网络,名为IceHrNet。该网络使用HRNet作为背景,并添加了ASPP和Decoder segmentation头,以保留低级别的Texture特征,进行细致的语义分割。最后,我们提出了一种简单有效的高级 Style transfer学习策略,可以在交叉领域语义分割数据集上进行零shot转移学习,实现了87% mIoU的语义分割效果。
  • results: 实验显示,IceHrNet在Texture专注数据集IPC_RI_SEG上超过了现状的方法,并在Shape专注river ice数据集上达到了优秀的效果。在零shot转移学习中,IceHrNet比其他方法提高了2个百分点。我们的代码和模型已经发布在https://github.com/PL23K/IceHrNet。
    Abstract We proposed an easy method of Zero-Shot semantic segmentation by using style transfer. In this case, we successfully used a medical imaging dataset (Blood Cell Imagery) to train a model for river ice semantic segmentation. First, we built a river ice semantic segmentation dataset IPC_RI_SEG using a fixed camera and covering the entire ice melting process of the river. Second, a high-resolution texture fusion semantic segmentation network named IceHrNet is proposed. The network used HRNet as the backbone and added ASPP and Decoder segmentation heads to retain low-level texture features for fine semantic segmentation. Finally, a simple and effective advanced style transfer learning strategy was proposed, which can perform zero-shot transfer learning based on cross-domain semantic segmentation datasets, achieving a practical effect of 87% mIoU for semantic segmentation of river ice without target training dataset (25% mIoU for None Stylized, 65% mIoU for Conventional Stylized, our strategy improved by 22%). Experiments showed that the IceHrNet outperformed the state-of-the-art methods on the texture-focused dataset IPC_RI_SEG, and achieved an excellent result on the shape-focused river ice datasets. In zero-shot transfer learning, IceHrNet achieved an increase of 2 percentage points compared to other methods. Our code and model are published on https://github.com/PL23K/IceHrNet.
    摘要 我们提出了一种简单的零shot semantic segmentation方法,利用样式传输。在这种情况下,我们成功地使用医疗影像 dataset(血球影像)来训练一个河川冰 semantic segmentation 模型。首先,我们建立了一个河川冰 semantic segmentation dataset IPC_RI_SEG,使用固定摄像头和覆盖整个河川冰融化过程。其次,我们提出了一种高分辨率Texture Fusion semantic segmentation网络,名为 IceHrNet。该网络使用 HRNet 作为背景,并添加 ASPP 和 Decoder segmentation 头,以保留低级别 Texture 特征,以提高精确的semantic segmentation。最后,我们提出了一种简单有效的高级 Style Transfer 学习策略,可以在 cross-domain semantic segmentation 数据集上进行零shot Transfer learning,实现了87% mIoU 的semantic segmentation精度,比无目标训练数据集 (25% mIoU for None Stylized, 65% mIoU for Conventional Stylized) 提高22%。实验显示,IceHrNet 在 texture-focused 数据集 IPC_RI_SEG 上超过了现状级方法,并在 shape-focused 河川冰数据集上达到了出色的结果。在零shot Transfer learning 中,IceHrNet 相比其他方法提高了2个百分点。我们的代码和模型已经在 上发布。

Dual-Augmented Transformer Network for Weakly Supervised Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2310.00307
  • repo_url: None
  • paper_authors: Jingliang Deng, Zonghan Li
  • for: 这个论文目的是提出一种基于 dual-augmented transformer 网络和自我规则约束的弱监督Semantic Segmentation(WSSS)方法,以提高WSSS的完teness和准确性。
  • methods: 该方法使用了一个双网络,包括 CNN 和 transformer 网络,并在两个网络之间进行互补学习,以提高 WSSS 的性能。此外,该方法还使用了自我规则约束来避免过拟合。
  • results: 对于 PASCAL VOC 2012 难度评测 benchmark,该方法达到了最高的效果,超过了之前的州立艺术方法。
    Abstract Weakly supervised semantic segmentation (WSSS), a fundamental computer vision task, which aims to segment out the object within only class-level labels. The traditional methods adopt the CNN-based network and utilize the class activation map (CAM) strategy to discover the object regions. However, such methods only focus on the most discriminative region of the object, resulting in incomplete segmentation. An alternative is to explore vision transformers (ViT) to encode the image to acquire the global semantic information. Yet, the lack of transductive bias to objects is a flaw of ViT. In this paper, we explore the dual-augmented transformer network with self-regularization constraints for WSSS. Specifically, we propose a dual network with both CNN-based and transformer networks for mutually complementary learning, where both networks augment the final output for enhancement. Massive systemic evaluations on the challenging PASCAL VOC 2012 benchmark demonstrate the effectiveness of our method, outperforming previous state-of-the-art methods.
    摘要 弱类指导 semantic segmentation (WSSS) 是计算机视觉中的基本任务,旨在只使用类别标签来分割对象。传统方法通常采用 CNN 网络和使用 CAM 策略来发现对象区域。然而,这些方法只关注对象中最有特征的区域,导致 segmentation 不够完整。为了解决这问题,我们可以探索使用 transformer 网络来编码图像,以获得全局 semantic 信息。然而,transformer 网络缺乏对物体的推导性偏好。在这篇论文中,我们探索了 dual-augmented transformer 网络,并采用自我regularization 约束来解决 WSSS 问题。具体来说,我们提出了一个 dual 网络,其中包括 CNN 基于网络和 transformer 网络,用于互补学习。在 PASCAL VOC 2012 数据集上进行了大规模系统性评估,我们的方法被证明高效,并超过了之前的状态码法。

QUIZ: An Arbitrary Volumetric Point Matching Method for Medical Image Registration

  • paper_url: http://arxiv.org/abs/2310.00296
  • repo_url: None
  • paper_authors: Lin Liu, Xinxin Fan, Haoyang Liu, Chulong Zhang, Weibin Kong, Jingjing Dai, Yuming Jiang, Yaoqin Xie, Xiaokun Liang
  • for: 这个研究是为了提出一种基于自适应点对匹配的医疗影像注册方法,以解决现有方法在不同姿势或影像质量差强时的不稳定和错误问题。
  • methods: 本研究使用了一种新的方法,即查询点问题(QUIZ),它是基于本地-全局匹配点的对应,使用了对应点进行特征提取,并使用Transformer架构进行全球匹配查询,最后是通过实现本地影像对称变换。
  • results: 实验结果显示,这种方法在一个大尺度变形的阴道癌病人 dataset 上的注册结果比现有方法更为稳定和准确,甚至在跨Modal subjects 上也能够取得更好的结果,超过现有state-of-the-art。
    Abstract Rigid pre-registration involving local-global matching or other large deformation scenarios is crucial. Current popular methods rely on unsupervised learning based on grayscale similarity, but under circumstances where different poses lead to varying tissue structures, or where image quality is poor, these methods tend to exhibit instability and inaccuracies. In this study, we propose a novel method for medical image registration based on arbitrary voxel point of interest matching, called query point quizzer (QUIZ). QUIZ focuses on the correspondence between local-global matching points, specifically employing CNN for feature extraction and utilizing the Transformer architecture for global point matching queries, followed by applying average displacement for local image rigid transformation. We have validated this approach on a large deformation dataset of cervical cancer patients, with results indicating substantially smaller deviations compared to state-of-the-art methods. Remarkably, even for cross-modality subjects, it achieves results surpassing the current state-of-the-art.
    摘要 rigid pre-registration involving local-global matching or other large deformation scenarios is crucial. current popular methods rely on unsupervised learning based on grayscale similarity, but under circumstances where different poses lead to varying tissue structures, or where image quality is poor, these methods tend to exhibit instability and inaccuracies. in this study, we propose a novel method for medical image registration based on arbitrary voxel point of interest matching, called query point quizzer (quiz). quiz focuses on the correspondence between local-global matching points, specifically employing cnn for feature extraction and utilizing the transformer architecture for global point matching queries, followed by applying average displacement for local image rigid transformation. we have validated this approach on a large deformation dataset of cervical cancer patients, with results indicating substantially smaller deviations compared to state-of-the-art methods. remarkably, even for cross-modality subjects, it achieves results surpassing the current state-of-the-art.Here's the breakdown of the translation:* rigid pre-registration: 固定预注册* involving local-global matching: 包括本地-全局匹配* or other large deformation scenarios: 或其他大型扭曲场景* is crucial: 是关键的* Current popular methods rely on unsupervised learning: 当前流行的方法依靠无监督学习* based on grayscale similarity: 基于灰度相似性* but under circumstances where different poses lead to varying tissue structures: 但在不同的姿势下导致组织结构变化* or where image quality is poor: 或图像质量不佳* these methods tend to exhibit instability and inaccuracies: 这些方法往往表现出不稳定和不准确* In this study, we propose a novel method: 在这项研究中,我们提出了一种新的方法* for medical image registration: 医疗图像注册* based on arbitrary voxel point of interest matching: 基于任意体素点的关注匹配* called query point quizzer (QUIZ): 称为查询点赛询(QUIZ)* QUIZ focuses on the correspondence between local-global matching points: 赛询集中注重本地-全局匹配点之间的匹配* specifically employing CNN for feature extraction: 特别利用CNN提取特征* and utilizing the Transformer architecture for global point matching queries: 并利用Transformer架构进行全局点匹配查询* followed by applying average displacement for local image rigid transformation: 然后应用平均偏移来实现本地图像固定变换* We have validated this approach on a large deformation dataset of cervical cancer patients: 我们在一个大型扭曲 dataset 上验证了这种方法* with results indicating substantially smaller deviations compared to state-of-the-art methods: 结果表明与当前状态艺的方法相比,这种方法具有较小的偏差* Remarkably, even for cross-modality subjects: 备受惊叹的是,这种方法可以在不同的modalities中进行批处理* it achieves results surpassing the current state-of-the-art: 它在当前状态艺中超越了当前的最佳性能I hope this helps! Let me know if you have any further questions or if there's anything else I can help you with.

Pubic Symphysis-Fetal Head Segmentation Using Pure Transformer with Bi-level Routing Attention

  • paper_url: http://arxiv.org/abs/2310.00289
  • repo_url: None
  • paper_authors: Pengzhou Cai, Jiang Lu, Yanxin Li, Libin Lan
  • for: 这个论文是为了解决公针缘-胎头分割任务而提出的方法。
  • methods: 该方法采用了一种类似于U-Net的纯转换器架构,并使用了二级路由注意力和跳过连接,能够有效地学习本地-全球含义。
  • results: 该方法在 транс体内超声影像数据集上进行评估,并达到了相当于的最终分数。代码将在 GitHub 上公开。
    Abstract In this paper, we propose a method, named BRAU-Net, to solve the pubic symphysis-fetal head segmentation task. The method adopts a U-Net-like pure Transformer architecture with bi-level routing attention and skip connections, which effectively learns local-global semantic information. The proposed BRAU-Net was evaluated on transperineal Ultrasound images dataset from the pubic symphysis-fetal head segmentation and angle of progression (FH-PS-AOP) challenge. The results demonstrate that the proposed BRAU-Net achieves comparable a final score. The codes will be available at https://github.com/Caipengzhou/BRAU-Net.
    摘要 在这篇论文中,我们提出了一种方法,名为BRAU-Net,用于解决公钵缝-胎头分割任务。该方法采用了一种类似于U-Net的纯Transformer架构,具有 би层路由注意力和跳过连接,能够有效学习本地-全局semantic信息。我们提posed的BRAU-Net在transperineal Ultrasound图像数据集上进行了评估,并实现了相对比较高的最终分数。codes将在https://github.com/Caipengzhou/BRAU-Net中提供。

InFER: A Multi-Ethnic Indian Facial Expression Recognition Dataset

  • paper_url: http://arxiv.org/abs/2310.00287
  • repo_url: None
  • paper_authors: Syed Sameen Ahmad Rizvi, Preyansh Agrawal, Jagat Sesh Challa, Pratik Narang
  • for: 这个论文是为了开发一个面部表达识别系统,特别是针对印度次大陆的多元人种背景下的人脸表达识别。
  • methods: 这个论文使用了深度学习技术,并使用了10200张图片和4200段视频,包括7种基本的面部表达和6000张来自互联网的自然表达。
  • results: 这个论文通过实验表明,使用深度学习技术可以在印度次大陆的多元人种背景下实现高度的人脸表达识别精度。
    Abstract The rapid advancement in deep learning over the past decade has transformed Facial Expression Recognition (FER) systems, as newer methods have been proposed that outperform the existing traditional handcrafted techniques. However, such a supervised learning approach requires a sufficiently large training dataset covering all the possible scenarios. And since most people exhibit facial expressions based upon their age group, gender, and ethnicity, a diverse facial expression dataset is needed. This becomes even more crucial while developing a FER system for the Indian subcontinent, which comprises of a diverse multi-ethnic population. In this work, we present InFER, a real-world multi-ethnic Indian Facial Expression Recognition dataset consisting of 10,200 images and 4,200 short videos of seven basic facial expressions. The dataset has posed expressions of 600 human subjects, and spontaneous/acted expressions of 6000 images crowd-sourced from the internet. To the best of our knowledge InFER is the first of its kind consisting of images from 600 subjects from very diverse ethnicity of the Indian Subcontinent. We also present the experimental results of baseline & deep FER methods on our dataset to substantiate its usability in real-world practical applications.
    摘要 随着深度学习技术的快速发展,过去十年,人脸表达识别(FER)系统得到了深刻的改进,新的方法被提出,超越了传统的手工设计方法。然而,这种监督学习方法需要一个具有所有可能情况的充分大的训练数据集。而人们的表达往往与年龄组、性别和民族相关,因此需要一个多样化的人脸表达数据集。这变得更加重要,在开发印度次大陆的FER系统时。在这种情况下,我们提出了InFER,一个包含10,200张图像和4,200个短视频的多元族裔印度人脸表达识别数据集。该数据集包含1000名人类的pose表达和互联网上抓取的6000张自然表达图像。根据我们所知,InFER是世界上第一个包含600名不同民族背景的人脸表达数据集。我们还将展示基线和深度FER方法在我们的数据集上的实验结果,以证明其在实际应用中的可用性。

Unleash Data Generation for Efficient and Effective Data-free Knowledge Distillation

  • paper_url: http://arxiv.org/abs/2310.00258
  • repo_url: https://github.com/fw742211/nayer
  • paper_authors: Minh-Tuan Tran, Trung Le, Xuan-May Le, Mehrtash Harandi, Quan Hung Tran, Dinh Phung
  • for: 这篇论文的目的是提出一种新的无数据知识传播(DFKD)方法,并且解决了现有方法无法从随机变量中生成高质量数据的问题。
  • methods: 本文提出的方法为杂凑层生成(NAYER),它将随机性源从输入转移到杂凑层,并使用具有含义的标签文字嵌入(LTE)作为输入。LTE能够含有丰富的意义ful inter-class信息,允许生成高质量数据,只需要几个训练步骤。同时,杂凑层可以解决标签信息的价值过滤问题,使模型不会过度强调标签信息。
  • results: 实验结果显示, NAYER 不仅超越了现有的方法,而且比前一些方法快5-15倍。
    Abstract Data-Free Knowledge Distillation (DFKD) has recently made remarkable advancements with its core principle of transferring knowledge from a teacher neural network to a student neural network without requiring access to the original data. Nonetheless, existing approaches encounter a significant challenge when attempting to generate samples from random noise inputs, which inherently lack meaningful information. Consequently, these models struggle to effectively map this noise to the ground-truth sample distribution, resulting in the production of low-quality data and imposing substantial time requirements for training the generator. In this paper, we propose a novel Noisy Layer Generation method (NAYER) which relocates the randomness source from the input to a noisy layer and utilizes the meaningful label-text embedding (LTE) as the input. The significance of LTE lies in its ability to contain substantial meaningful inter-class information, enabling the generation of high-quality samples with only a few training steps. Simultaneously, the noisy layer plays a key role in addressing the issue of diversity in sample generation by preventing the model from overemphasizing the constrained label information. By reinitializing the noisy layer in each iteration, we aim to facilitate the generation of diverse samples while still retaining the method's efficiency, thanks to the ease of learning provided by LTE. Experiments carried out on multiple datasets demonstrate that our NAYER not only outperforms the state-of-the-art methods but also achieves speeds 5 to 15 times faster than previous approaches.
    摘要 <> translate "Data-Free Knowledge Distillation (DFKD) has recently made remarkable advancements with its core principle of transferring knowledge from a teacher neural network to a student neural network without requiring access to the original data. Nonetheless, existing approaches encounter a significant challenge when attempting to generate samples from random noise inputs, which inherently lack meaningful information. Consequently, these models struggle to effectively map this noise to the ground-truth sample distribution, resulting in the production of low-quality data and imposing substantial time requirements for training the generator. In this paper, we propose a novel Noisy Layer Generation method (NAYER) which relocates the randomness source from the input to a noisy layer and utilizes the meaningful label-text embedding (LTE) as the input. The significance of LTE lies in its ability to contain substantial meaningful inter-class information, enabling the generation of high-quality samples with only a few training steps. Simultaneously, the noisy layer plays a key role in addressing the issue of diversity in sample generation by preventing the model from overemphasizing the constrained label information. By reinitializing the noisy layer in each iteration, we aim to facilitate the generation of diverse samples while still retaining the method's efficiency, thanks to the ease of learning provided by LTE. Experiments carried out on multiple datasets demonstrate that our NAYER not only outperforms the state-of-the-art methods but also achieves speeds 5 to 15 times faster than previous approaches."Translation:“数据无法知识传播(DFKD)最近又取得了显著进步,其核心思想是将知识从教师神经网络传播到学生神经网络,无需访问原始数据。然而,现有方法在生成随机噪声输入时遇到了重大挑战,因为这些噪声缺乏有意义信息。这使得这些模型很难准确地将噪声映射到真实样本分布,从而生成低质量数据,并且需要训练生成器的很长时间。在这篇论文中,我们提出了一种新的噪声层生成方法(NAYER),它将噪声源从输入重新定义到噪声层,并使用有意义的标签文本嵌入(LTE)作为输入。LTE的重要性在于它能够包含大量有意义的 между类信息,使得通过只需几个训练步骤就可以生成高质量样本。同时,噪声层对样本生成的多样性做出了重要贡献,避免模型过分强调约束的标签信息。我们在每个迭代中重新初始化噪声层,以便生成多样化的样本,同时仍保持方法的效率,即使是通过LTE的易学习性。我们在多个数据集上进行了实验,结果表明,我们的NAYER不仅超过了现有方法的性能,而且在5-15倍 faster than previous approaches。”

MMPI: a Flexible Radiance Field Representation by Multiple Multi-plane Images Blending

  • paper_url: http://arxiv.org/abs/2310.00249
  • repo_url: None
  • paper_authors: Yuze He, Peng Wang, Yubin Hu, Wang Zhao, Ran Yi, Yong-Jin Liu, Wenping Wang
  • for: 这篇论文旨在探讨基于多平面图像(MPI)的神经辐射场(NeRF)的高质量视图合成方法,以扩展现有的MPI-based NeRF方法到更复杂的场景中。
  • methods: 作者采用MPI parameterization的NeRF学习方法,并提出了一种基于多个MPI的适应混合操作,以模拟不同视角和摄像头分布的场景。
  • results: 实验结果表明,该方法可以高质量地生成不同摄像头分布和视角的新视图图像,并且比前一代快速训练NeRF方法更快速地训练完成。此外,作者还示出了该方法可以处理长轨迹和新视图图像的问题,表明其在自动驾驶等应用中的潜在可能性。
    Abstract This paper presents a flexible representation of neural radiance fields based on multi-plane images (MPI), for high-quality view synthesis of complex scenes. MPI with Normalized Device Coordinate (NDC) parameterization is widely used in NeRF learning for its simple definition, easy calculation, and powerful ability to represent unbounded scenes. However, existing NeRF works that adopt MPI representation for novel view synthesis can only handle simple forward-facing unbounded scenes, where the input cameras are all observing in similar directions with small relative translations. Hence, extending these MPI-based methods to more complex scenes like large-range or even 360-degree scenes is very challenging. In this paper, we explore the potential of MPI and show that MPI can synthesize high-quality novel views of complex scenes with diverse camera distributions and view directions, which are not only limited to simple forward-facing scenes. Our key idea is to encode the neural radiance field with multiple MPIs facing different directions and blend them with an adaptive blending operation. For each region of the scene, the blending operation gives larger blending weights to those advantaged MPIs with stronger local representation abilities while giving lower weights to those with weaker representation abilities. Such blending operation automatically modulates the multiple MPIs to appropriately represent the diverse local density and color information. Experiments on the KITTI dataset and ScanNet dataset demonstrate that our proposed MMPI synthesizes high-quality images from diverse camera pose distributions and is fast to train, outperforming the previous fast-training NeRF methods for novel view synthesis. Moreover, we show that MMPI can encode extremely long trajectories and produce novel view renderings, demonstrating its potential in applications like autonomous driving.
    摘要 Our key idea is to encode the neural radiance field with multiple MPIs facing different directions and blend them with an adaptive blending operation. For each region of the scene, the blending operation gives larger blending weights to those MPIs with stronger local representation abilities and lower weights to those with weaker representation abilities. This automatically modulates the multiple MPIs to appropriately represent the diverse local density and color information.Experiments on the KITTI dataset and ScanNet dataset show that our proposed method, called Multi-plane Multi-Image (MMPI), synthesizes high-quality images from diverse camera pose distributions and is fast to train, outperforming previous fast-training NeRF methods for novel view synthesis. Moreover, we demonstrate that MMPI can encode extremely long trajectories and produce novel view renderings, indicating its potential in applications like autonomous driving.

Walking = Traversable? : Traversability Prediction via Multiple Human Object Tracking under Occlusion

  • paper_url: http://arxiv.org/abs/2310.00242
  • repo_url: None
  • paper_authors: Jonathan Tay Yu Liang, Kanji Tanaka
  • for: 这种技术可以提高室内机器人导航,预测受阻的地板。
  • methods: 该方法使用第三人称视角单目摄像头,使用SLAM和MOT两种跟踪器监测站立物和移动人员的互动。
  • results: 该方法可以在视觉复杂enario中稳定地预测通行性,包括 occlusion、非线性视角、深度不确定和多个人员的交叠。
    Abstract The emerging ``Floor plan from human trails (PfH)" technique has great potential for improving indoor robot navigation by predicting the traversability of occluded floors. This study presents an innovative approach that replaces first-person-view sensors with a third-person-view monocular camera mounted on the observer robot. This approach can gather measurements from multiple humans, expanding its range of applications. The key idea is to use two types of trackers, SLAM and MOT, to monitor stationary objects and moving humans and assess their interactions. This method achieves stable predictions of traversability even in challenging visual scenarios, such as occlusions, nonlinear perspectives, depth uncertainty, and intersections involving multiple humans. Additionally, we extend map quality metrics to apply to traversability maps, facilitating future research. We validate our proposed method through fusion and comparison with established techniques.
    摘要 “人类脚踪映射(PfH)技术在室内机器人导航方面具有潜在的潜力,可以预测受阻的loor的可行性。本研究提出了一种创新的方法,替换了首人视角感知器,使用跟踪器Mounted on the observer robot的第三人视角监测器来收集多个人的数据。这种方法可以监测站ARYObjects和移动人员之间的互动,并对其进行评估。这种方法可以在视觉复杂场景中稳定地预测可行性,包括 occlusions、非线性视角、深度不确定性和多个人的交叉。此外,我们扩展了图像质量指标,以便应用于可行性图。我们验证了我们的提议方法通过融合和与现有技术进行比较。”Note that Simplified Chinese is the official standard for Chinese writing in mainland China, and it is used in this translation. Traditional Chinese is also commonly used in Taiwan and Hong Kong, but it may have slightly different grammar and character forms.

Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

  • paper_url: http://arxiv.org/abs/2310.00240
  • repo_url: https://github.com/jiaosiyu1999/maft
  • paper_authors: Siyu Jiao, Yunchao Wei, Yaowei Wang, Yao Zhao, Humphrey Shi
    for:* The paper aims to improve the performance of zero-shot segmentation methods by addressing the insensitivity of CLIP to different mask proposals.methods:* The proposed method, Mask-aware Fine-tuning (MAFT), uses an Image-Proposals CLIP Encoder (IP-CLIP Encoder) to handle arbitrary numbers of image and mask proposals simultaneously.* MAFT introduces mask-aware loss and self-distillation loss to fine-tune IP-CLIP Encoder, ensuring CLIP is responsive to different mask proposals while maintaining transferability.results:* With MAFT, the performance of state-of-the-art methods is promoted by a large margin on popular zero-shot benchmarks, including COCO, Pascal-VOC, and ADE20K. Specifically, the mIoU for unseen classes is improved by 8.2%, 3.2%, and 4.3% respectively.
    Abstract Recently, pre-trained vision-language models have been increasingly used to tackle the challenging zero-shot segmentation task. Typical solutions follow the paradigm of first generating mask proposals and then adopting CLIP to classify them. To maintain the CLIP's zero-shot transferability, previous practices favour to freeze CLIP during training. However, in the paper, we reveal that CLIP is insensitive to different mask proposals and tends to produce similar predictions for various mask proposals of the same image. This insensitivity results in numerous false positives when classifying mask proposals. This issue mainly relates to the fact that CLIP is trained with image-level supervision. To alleviate this issue, we propose a simple yet effective method, named Mask-aware Fine-tuning (MAFT). Specifically, Image-Proposals CLIP Encoder (IP-CLIP Encoder) is proposed to handle arbitrary numbers of image and mask proposals simultaneously. Then, mask-aware loss and self-distillation loss are designed to fine-tune IP-CLIP Encoder, ensuring CLIP is responsive to different mask proposals while not sacrificing transferability. In this way, mask-aware representations can be easily learned to make the true positives stand out. Notably, our solution can seamlessly plug into most existing methods without introducing any new parameters during the fine-tuning process. We conduct extensive experiments on the popular zero-shot benchmarks. With MAFT, the performance of the state-of-the-art methods is promoted by a large margin: 50.4% (+ 8.2%) on COCO, 81.8% (+ 3.2%) on Pascal-VOC, and 8.7% (+4.3%) on ADE20K in terms of mIoU for unseen classes. The code is available at https://github.com/jiaosiyu1999/MAFT.git.
    摘要 近期,预训练的视觉语言模型在零样式分割任务中表现越来越出色。一般来说,这些解决方案采用的是首先生成mask提案,然后采用CLIP进行分类的方法。为保持CLIP的零样式传输性,以前的做法是冻结CLIP。然而,我们发现CLIP对不同的mask提案敏感度很低,它会为同一张图片的不同mask提案生成相同的预测结果。这种敏感度问题导致了许多假阳性的分类结果。这主要与CLIP在图像水平上进行训练有关。为解决这个问题,我们提出了一种简单 yet 有效的方法,即Mask-aware Fine-tuning(MAFT)。具体来说,我们提出了Image-Proposals CLIP Encoder(IP-CLIP Encoder),可以同时处理任意数量的图像和mask提案。然后,我们设计了面孔检测和自我顾问损失来细化IP-CLIP Encoder,确保CLIP对不同的mask提案具有响应性。这样,面孔检测可以轻松地学习出mask-aware表示,使真阳性能够出亮。另外,我们的解决方案可以不需要添加任何新参数,直接在已有的训练过程中进行细化。我们在popular zero-shot benchmark上进行了广泛的实验,与MAFT结合,state-of-the-art方法的性能得到了大幅提升:COCO中的50.4% (+ 8.2%),Pascal-VOC中的81.8% (+ 3.2%),ADE20K中的8.7% (+4.3%)。相关代码可以在https://github.com/jiaosiyu1999/MAFT.git中找到。

Domain-Controlled Prompt Learning

  • paper_url: http://arxiv.org/abs/2310.07730
  • repo_url: https://github.com/Sfedfcv/redesigned-pancake
  • paper_authors: Qinglong Cao, Zhengqin Xu, Yuantian Chen, Chao Ma, Xiaokang Yang
  • for: 这个研究的目的是为特殊领域的Remote Sensing Image (RSIs) 和医学影像等领域进行适应和扩展。
  • methods: 我们提出了一种叫做领域控制的提示学习(Domain-Controlled Prompt Learning,DCPL),使用大规模的专业领域基础模型(LSDM)提供特殊领域知识,并使用轻量级神经网络将这些知识转换为领域偏好,以直接控制颜ppo上的提示。
  • results: 我们的方法在特殊领域影像识别 зада域中得到了最佳性能。
    Abstract Large pre-trained vision-language models, such as CLIP, have shown remarkable generalization capabilities across various tasks when appropriate text prompts are provided. However, adapting these models to specialized domains, like remote sensing images (RSIs), medical images, etc, remains unexplored and challenging. Existing prompt learning methods often lack domain-awareness or domain-transfer mechanisms, leading to suboptimal performance due to the misinterpretation of specialized images in natural image patterns. To tackle this dilemma, we proposed a Domain-Controlled Prompt Learning for the specialized domains. Specifically, the large-scale specialized domain foundation model (LSDM) is first introduced to provide essential specialized domain knowledge. Using lightweight neural networks, we transfer this knowledge into domain biases, which control both the visual and language branches to obtain domain-adaptive prompts in a directly incorporating manner. Simultaneously, to overcome the existing overfitting challenge, we propose a novel noisy-adding strategy, without extra trainable parameters, to help the model escape the suboptimal solution in a global domain oscillation manner. Experimental results show our method achieves state-of-the-art performance in specialized domain image recognition datasets. Our code is available at https://anonymous.4open.science/r/DCPL-8588.
    摘要 大型预训 vision-language模型,如CLIP,在不同任务中表现出了惊人的通用能力,但将这些模型适应特殊领域,如遥感图像(RSIs)、医疗图像等,仍然是一个未探索的和挑战性的领域。现有的提问学习方法通常缺乏领域意识或领域传输机制,导致特殊图像在自然图像模式中的误 interpret,从而影响表现。为解决这个困难,我们提出了领域控制的提问学习(DCPL)方法。具体来说,我们首先引入大规模特殊领域基础模型(LSDM),以提供特殊领域知识的基础。使用轻量级神经网络,我们将这些知识转移到领域偏好,以控制视觉和语言 Zweige,从而获得适应特殊领域的提问。同时,为了解决现有的过拟合挑战,我们提出了一种新的噪音添加策略,不需要额外的可训练参数,以帮助模型脱离低效解决方案。实验结果表明,我们的方法在特殊领域图像识别数据集中实现了状态略 луч的表现。我们的代码可以在https://anonymous.4open.science/r/DCPL-8588中找到。

Pixel-Inconsistency Modeling for Image Manipulation Localization

  • paper_url: http://arxiv.org/abs/2310.00234
  • repo_url: None
  • paper_authors: Chenqi Kong, Anwei Luo, Shiqi Wang, Haoliang Li, Anderson Rocha, Alex C. Kot
  • for: 本研究旨在提高图像修饰检测的通用性和Robustness,以便更好地识别和 lokalisir forgery。
  • methods: 本文提出了一种基于自注意力的涂抹检测模型,通过分析图像中的像素不一致痕迹来检测修饰。此外,本文还提出了一种新的学习准备模块(LWM),用于将全像和局部像素相关性流合并到一起,从而提高检测性能。
  • results: 实验结果表明,本文提出的方法可以成功检测图像修饰,并且在不同的数据集和扰动图像上表现出优秀的通用性和Robustness。
    Abstract Digital image forensics plays a crucial role in image authentication and manipulation localization. Despite the progress powered by deep neural networks, existing forgery localization methodologies exhibit limitations when deployed to unseen datasets and perturbed images (i.e., lack of generalization and robustness to real-world applications). To circumvent these problems and aid image integrity, this paper presents a generalized and robust manipulation localization model through the analysis of pixel inconsistency artifacts. The rationale is grounded on the observation that most image signal processors (ISP) involve the demosaicing process, which introduces pixel correlations in pristine images. Moreover, manipulating operations, including splicing, copy-move, and inpainting, directly affect such pixel regularity. We, therefore, first split the input image into several blocks and design masked self-attention mechanisms to model the global pixel dependency in input images. Simultaneously, we optimize another local pixel dependency stream to mine local manipulation clues within input forgery images. In addition, we design novel Learning-to-Weight Modules (LWM) to combine features from the two streams, thereby enhancing the final forgery localization performance. To improve the training process, we propose a novel Pixel-Inconsistency Data Augmentation (PIDA) strategy, driving the model to focus on capturing inherent pixel-level artifacts instead of mining semantic forgery traces. This work establishes a comprehensive benchmark integrating 15 representative detection models across 12 datasets. Extensive experiments show that our method successfully extracts inherent pixel-inconsistency forgery fingerprints and achieve state-of-the-art generalization and robustness performances in image manipulation localization.
    摘要 “数字图像科学在图像认证和修改地址中扮演着关键角色。尽管深度神经网络的进步,现有的伪造地址方法在未见数据集和压缩图像上展示了局限性和不可靠性。为了缓解这些问题并帮助图像完整性,本文提出了一种通用和Robust的伪造地址模型,基于像素不一致痕迹的分析。我们认为大多数图像信号处理器(ISP)都包含排除过程,这会在原始图像中引入像素相关性。此外,操作包括拼接、复制、填充等,都会直接影响这种像素规律。因此,我们将输入图像分成多个块,并设计了带有mask的自注意力机制,以模型输入图像的全球像素依赖关系。同时,我们优化了另一个本地像素依赖流,以挖掘输入伪造图像中的本地伪造线索。此外,我们设计了一种新的学习加权模块(LWM),以将两条流合并,从而提高最终伪造地址性能。为了改进训练过程,我们提出了一种新的像素不一致数据增强策略(PIDA),使模型更加专注于捕捉内置像素级别的痕迹,而不是挖掘 semantic伪造迹象。这种工作建立了12个数据集上15种表示性检测模型的通用 benchmark。广泛的实验表明,我们的方法可以成功捕捉内置像素不一致伪造指纹,并在图像伪造地址方面实现了状态 искусственный intelligence的通用和Robust性表现。”

Scaling for Training Time and Post-hoc Out-of-distribution Detection Enhancement

  • paper_url: http://arxiv.org/abs/2310.00227
  • repo_url: https://github.com/kai422/scale
  • paper_authors: Kai Xu, Rongyu Chen, Gianni Franchi, Angela Yao
  • for: This paper focuses on the task of out-of-distribution (OOD) detection in deep learning systems, specifically on the recent state-of-the-art method of activation shaping (ASH).
  • methods: The paper proposes two novel methods for OOD detection: 1) SCALE, a post-hoc network enhancement method that achieves state-of-the-art OOD detection performance without compromising in-distribution (ID) accuracy, and 2) Intermediate Tensor SHaping (ISH), a lightweight method for training time OOD detection enhancement.
  • results: The paper reports AUROC scores of +1.85% for near-OOD and +0.74% for far-OOD datasets on the OpenOOD v1.5 ImageNet-1K benchmark, demonstrating the effectiveness of the proposed methods for OOD detection.Here is the information in Simplified Chinese text:
  • for: 这篇论文关注深度学习系统中的 OUT-OF-DISTRIBUTION(OOD)检测任务,具体来说是最新的 activation shaping(ASH)方法。
  • methods: 论文提出了两种新的 OOD 检测方法:1) SCALE,一种后置网络增强方法,可以在保持 ID 准确率的情况下提高 OOD 检测性能,2) Intermediate Tensor SHaping(ISH),一种轻量级的训练时间 OOD 检测增强方法。
  • results: 论文报告了 OpenOOD v1.5 ImageNet-1K 测试集上的 AUROC 分数,分别为 +1.85% 和 +0.74%,这表明提出的方法对 OOD 检测具有效果。
    Abstract The capacity of a modern deep learning system to determine if a sample falls within its realm of knowledge is fundamental and important. In this paper, we offer insights and analyses of recent state-of-the-art out-of-distribution (OOD) detection methods - extremely simple activation shaping (ASH). We demonstrate that activation pruning has a detrimental effect on OOD detection, while activation scaling enhances it. Moreover, we propose SCALE, a simple yet effective post-hoc network enhancement method for OOD detection, which attains state-of-the-art OOD detection performance without compromising in-distribution (ID) accuracy. By integrating scaling concepts into the training process to capture a sample's ID characteristics, we propose Intermediate Tensor SHaping (ISH), a lightweight method for training time OOD detection enhancement. We achieve AUROC scores of +1.85\% for near-OOD and +0.74\% for far-OOD datasets on the OpenOOD v1.5 ImageNet-1K benchmark. Our code and models are available at https://github.com/kai422/SCALE.
    摘要 现代深度学习系统确定样本是否属于其知识范围是基本重要的。在这篇论文中,我们提供了近期状态艺术的out-of-distribution(OOD)检测方法的深入分析和见解,包括极简的活动形状(ASH)。我们表明了活动剪除对OOD检测有负面影响,而活动缩放则有利于其。此外,我们提议SCALE,一种简单又有效的后期网络增强方法,以实现OOD检测性能的状元。通过将扩展概念 integrate到训练过程中,以捕捉样本的ID特征,我们提议Intermediate Tensor SHaping(ISH),一种轻量级的训练时OOD检测增强方法。我们在OpenOOD v1.5 ImageNet-1K测试集上达到了AUROC分数+1.85%的近OOD数据集和+0.74%的远OOD数据集。我们的代码和模型可以在https://github.com/kai422/SCALE上获取。

LSOR: Longitudinally-Consistent Self-Organized Representation Learning

  • paper_url: http://arxiv.org/abs/2310.00213
  • repo_url: https://github.com/ouyangjiahong/longitudinal-som-single-modality
  • paper_authors: Jiahong Ouyang, Qingyu Zhao, Ehsan Adeli, Wei Peng, Greg Zaharchuk, Kilian M. Pohl
  • for: 这个论文的目的是提出一种基于单modal MR 的自适应SOM方法,以提高深度学习模型在长itudinal MR 上的可读性。
  • methods: 该方法使用了自适应SOM,通过将高维的干扰空间分成多个集群,并将每个集群映射到一个离散的(通常是2D)网格上,以保持高维关系 between集群。
  • results: 该方法可以在长itudinal MR 上生成一个可读的干扰空间,并且可以在不同的诊断任务上达到或超过当前的状态艺 Representatives的性能。code available at https://github.com/ouyangjiahong/longitudinal-som-single-modality。
    Abstract Interpretability is a key issue when applying deep learning models to longitudinal brain MRIs. One way to address this issue is by visualizing the high-dimensional latent spaces generated by deep learning via self-organizing maps (SOM). SOM separates the latent space into clusters and then maps the cluster centers to a discrete (typically 2D) grid preserving the high-dimensional relationship between clusters. However, learning SOM in a high-dimensional latent space tends to be unstable, especially in a self-supervision setting. Furthermore, the learned SOM grid does not necessarily capture clinically interesting information, such as brain age. To resolve these issues, we propose the first self-supervised SOM approach that derives a high-dimensional, interpretable representation stratified by brain age solely based on longitudinal brain MRIs (i.e., without demographic or cognitive information). Called Longitudinally-consistent Self-Organized Representation learning (LSOR), the method is stable during training as it relies on soft clustering (vs. the hard cluster assignments used by existing SOM). Furthermore, our approach generates a latent space stratified according to brain age by aligning trajectories inferred from longitudinal MRIs to the reference vector associated with the corresponding SOM cluster. When applied to longitudinal MRIs of the Alzheimer's Disease Neuroimaging Initiative (ADNI, N=632), LSOR generates an interpretable latent space and achieves comparable or higher accuracy than the state-of-the-art representations with respect to the downstream tasks of classification (static vs. progressive mild cognitive impairment) and regression (determining ADAS-Cog score of all subjects). The code is available at https://github.com/ouyangjiahong/longitudinal-som-single-modality.
    摘要 <>translate into Simplified Chinese��sterreichische Nationalbibliothek, Vienna, Austria� entitled Interpretability in Deep Learning for Longitudinal Brain MRIs, one challenge is the high dimensionality of the latent spaces generated by deep learning models. One approach to address this challenge is to visualize the latent spaces using self-organizing maps (SOM). However, learning SOM in high-dimensional latent spaces can be unstable, especially in self-supervised settings. Moreover, the learned SOM grid may not capture clinically meaningful information such as brain age.� To address these issues, we propose the first self-supervised SOM approach that derives a high-dimensional, interpretable representation stratified by brain age based solely on longitudinal brain MRIs. Called Longitudinally-consistent Self-Organized Representation learning (LSOR), the method is stable during training and relies on soft clustering instead of hard cluster assignments. Furthermore, our approach aligns trajectories inferred from longitudinal MRIs to the reference vector associated with the corresponding SOM cluster, generating a latent space stratified according to brain age.� When applied to longitudinal MRIs of the Alzheimer's Disease Neuroimaging Initiative (ADNI, N=632), LSOR generates an interpretable latent space and achieves comparable or higher accuracy than state-of-the-art representations with respect to downstream tasks such as classification (static vs. progressive mild cognitive impairment) and regression (determining ADAS-Cog score of all subjects). The code is available at https://github.com/ouyangjiahong/longitudinal-som-single-modality.� In summary, LSOR is a stable and interpretable deep learning method for longitudinal brain MRIs that captures brain age information and achieves high accuracy in downstream tasks.

DeformUX-Net: Exploring a 3D Foundation Backbone for Medical Image Segmentation with Depthwise Deformable Convolution

  • paper_url: http://arxiv.org/abs/2310.00199
  • repo_url: https://github.com/masilab/deform-uxnet
  • paper_authors: Ho Hin Lee, Quan Liu, Qi Yang, Xin Yu, Shunxing Bao, Yuankai Huo, Bennett A. Landman
    for:* The paper is focused on improving medical image segmentation using 3D ViTs and deformable convolution.methods:* The proposed model, 3D DeformUX-Net, combines long-range dependency, adaptive spatial aggregation, and computational efficiency by revisiting volumetric deformable convolution in a depth-wise setting.* The model also includes a parallel branch for generating deformable tri-planar offsets, which provides adaptive spatial aggregation across all channels.results:* The proposed model consistently outperforms existing state-of-the-art ViTs and large kernel convolution models across four challenging public datasets, achieving better segmentation results in terms of mean Dice.
    Abstract The application of 3D ViTs to medical image segmentation has seen remarkable strides, somewhat overshadowing the budding advancements in Convolutional Neural Network (CNN)-based models. Large kernel depthwise convolution has emerged as a promising technique, showcasing capabilities akin to hierarchical transformers and facilitating an expansive effective receptive field (ERF) vital for dense predictions. Despite this, existing core operators, ranging from global-local attention to large kernel convolution, exhibit inherent trade-offs and limitations (e.g., global-local range trade-off, aggregating attentional features). We hypothesize that deformable convolution can be an exploratory alternative to combine all advantages from the previous operators, providing long-range dependency, adaptive spatial aggregation and computational efficiency as a foundation backbone. In this work, we introduce 3D DeformUX-Net, a pioneering volumetric CNN model that adeptly navigates the shortcomings traditionally associated with ViTs and large kernel convolution. Specifically, we revisit volumetric deformable convolution in depth-wise setting to adapt long-range dependency with computational efficiency. Inspired by the concepts of structural re-parameterization for convolution kernel weights, we further generate the deformable tri-planar offsets by adapting a parallel branch (starting from $1\times1\times1$ convolution), providing adaptive spatial aggregation across all channels. Our empirical evaluations reveal that the 3D DeformUX-Net consistently outperforms existing state-of-the-art ViTs and large kernel convolution models across four challenging public datasets, spanning various scales from organs (KiTS: 0.680 to 0.720, MSD Pancreas: 0.676 to 0.717, AMOS: 0.871 to 0.902) to vessels (e.g., MSD hepatic vessels: 0.635 to 0.671) in mean Dice.
    摘要 三维ViT的应用在医学图像分割领域已经取得了非常出色的进步,一些超越了增强型神经网络(CNN)模型的发展。大核心深度卷积技术已经出现了可能,与层次转换器类似,并且提供了宽泛的有效接受场(ERF),这些都是为紧密预测而必要的。然而,现有的核心运算符,从全球local注意力到大核心卷积,都存在着内在的负担和限制(例如全球local范围负担)。我们 hypothesize dass deformable convolution可以是一种探索性的代替方案,结合所有的优点,提供长茨征dependency、适应空间聚合和计算效率作为基础核心。在这种工作中,我们引入了3D DeformUX-Net,一种在深度缩放设置下的三维弹性 convolution 模型,能够灵活地导航传统上与ViTs和大核心卷积相关的缺陷。具体来说,我们在深度缩放设置下对卷积核心进行了修改,以适应长茨征dependency,并且通过缩放后的权重映射来实现适应空间聚合。我们的实验证明了3D DeformUX-Net在四个公共数据集上(包括 KiTS:0.680-0.720、MSD Pancreas:0.676-0.717、AMOS:0.871-0.902)表现出了与现有的状态开头的 ViTs 和大核心卷积模型相对的稳定性和精度。