cs.CV - 2023-07-28

TriadNet: Sampling-free predictive intervals for lesional volume in 3D brain MR images

  • paper_url: http://arxiv.org/abs/2307.15638
  • repo_url: https://github.com/benolmbrt/TriadNet
  • paper_authors: Benjamin Lambert, Florence Forbes, Senan Doyle, Michel Dojat
  • for: 评估脑肿瘤(如血栓或 tumor)的体积是诊断病人状况的重要指标,可以用来导引治疗策略。
  • methods: 利用深度卷积神经网络(CNN)进行分割,目前是状态体系的方法。
  • results: 提出了TriadNet方法,可同时提供肿瘤体积和相关预测 интерVAL simultanously,仅需一秒钟。在BraTS 2021大规模 MRI glioblastoma 图像数据库上,TriadNet方法显示出优势。
    Abstract The volume of a brain lesion (e.g. infarct or tumor) is a powerful indicator of patient prognosis and can be used to guide the therapeutic strategy. Lesional volume estimation is usually performed by segmentation with deep convolutional neural networks (CNN), currently the state-of-the-art approach. However, to date, few work has been done to equip volume segmentation tools with adequate quantitative predictive intervals, which can hinder their usefulness and acceptation in clinical practice. In this work, we propose TriadNet, a segmentation approach relying on a multi-head CNN architecture, which provides both the lesion volumes and the associated predictive intervals simultaneously, in less than a second. We demonstrate its superiority over other solutions on BraTS 2021, a large-scale MRI glioblastoma image database.
    摘要 病变Volume (例如血栓或肿瘤) 是一个重要的病人 прогностиic indicator,可以用于导引治疗策略。 lesional volume 估计通常使用深度卷积神经网络(CNN)进行,现在是状态的方法。然而,到目前为止,有很少的工作是在 volume segmentation 工具中添加适当的量化预测范围,这会限制它们在临床实践中的使用和接受度。在这种情况下,我们提出了 TriadNet,一种基于多头CNN架构的分割方法,可以同时提供lesion volume和相关的预测范围,并且在一秒钟内完成。我们在 BraTS 2021 大规模 MRI glioblastoma 图像数据库上展示了它的优势。

A Survey on Deep Learning in Medical Image Registration: New Technologies, Uncertainty, Evaluation Metrics, and Beyond

  • paper_url: http://arxiv.org/abs/2307.15615
  • repo_url: None
  • paper_authors: Junyu Chen, Yihao Liu, Shuwen Wei, Zhangxing Bian, Shalini Subramanian, Aaron Carass, Jerry L. Prince, Yong Du
  • for: 本文提供了深度学习技术在医学图像registratin中的最新进展。
  • methods: 本文提出了多种新的网络架构、特异度函数和误差估计方法,以及适用于评估深度学习模型在registratin任务中的评价指标。
  • results: 本文对deep learning-based registratin的应用进行了实质性的探讨,包括多Atlas建构、多Atlas分割、运动估计和2D-3D registratin等。
    Abstract Over the past decade, deep learning technologies have greatly advanced the field of medical image registration. The initial developments, such as ResNet-based and U-Net-based networks, laid the groundwork for deep learning-driven image registration. Subsequent progress has been made in various aspects of deep learning-based registration, including similarity measures, deformation regularizations, and uncertainty estimation. These advancements have not only enriched the field of deformable image registration but have also facilitated its application in a wide range of tasks, including atlas construction, multi-atlas segmentation, motion estimation, and 2D-3D registration. In this paper, we present a comprehensive overview of the most recent advancements in deep learning-based image registration. We begin with a concise introduction to the core concepts of deep learning-based image registration. Then, we delve into innovative network architectures, loss functions specific to registration, and methods for estimating registration uncertainty. Additionally, this paper explores appropriate evaluation metrics for assessing the performance of deep learning models in registration tasks. Finally, we highlight the practical applications of these novel techniques in medical imaging and discuss the future prospects of deep learning-based image registration.
    摘要 In this paper, we present a comprehensive overview of the most recent advancements in deep learning-based image registration. We begin with a concise introduction to the core concepts of deep learning-based image registration. Then, we delve into innovative network architectures, loss functions specific to registration, and methods for estimating registration uncertainty. Additionally, this paper explores appropriate evaluation metrics for assessing the performance of deep learning models in registration tasks. Finally, we highlight the practical applications of these novel techniques in medical imaging and discuss the future prospects of deep learning-based image registration.Translated into Simplified Chinese:过去十年,深度学习技术在医疗图像注册领域已经做出了很大的进步。初期的发展,如ResNet基于和U-Net基于的网络,为深度学习驱动的图像注册奠定了基础。后续的进步包括相似度量、形变规范和注册不确定性的估计等方面。这些进步不仅涌现了图像注册领域的多样性,还促进了其应用于多种任务,如建立图像 атла斯、多个 атла斯分割、运动估计和2D-3D注册等。在这篇论文中,我们提供了深度学习基于图像注册的最新进展的全面概述。我们从核心概念的入门开始,然后探讨了新的网络架构、特定于注册的损失函数和注册不确定性的估计方法。此外,这篇论文还探讨了注册任务中深度学习模型的评价指标,以及这些新技术在医疗图像中的实际应用和未来前景。

Integrated Digital Reconstruction of Welded Components: Supporting Improved Fatigue Life Prediction

  • paper_url: http://arxiv.org/abs/2307.15604
  • repo_url: None
  • paper_authors: Anders Faarbæk Mikkelstrup, Morten Kristiansen
  • for: 提高钢结构疲劳性能
  • methods: 使用自动化高频机械冲击处理,并利用数字重建技术来改进质量和生产效率
  • results: 实现了成本效果、可重用、灵活和快速的数字重建方法,以援助Component设计、全面质量监测和HFMI处理ocumentation
    Abstract In the design of offshore jacket foundations, fatigue life is crucial. Post-weld treatment has been proposed to enhance the fatigue performance of welded joints, where particularly high-frequency mechanical impact (HFMI) treatment has been shown to improve fatigue performance significantly. Automated HFMI treatment has improved quality assurance and can lead to cost-effective design when combined with accurate fatigue life prediction. However, the finite element method (FEM), commonly used for predicting fatigue life in complex or multi-axial joints, relies on a basic CAD depiction of the weld, failing to consider the actual weld geometry and defects. Including the actual weld geometry in the FE model improves fatigue life prediction and possible crack location prediction but requires a digital reconstruction of the weld. Current digital reconstruction methods are time-consuming or require specialised scanning equipment and potential component relocation. The proposed framework instead uses an industrial manipulator combined with a line scanner to integrate digital reconstruction as part of the automated HFMI treatment setup. This approach applies standard image processing, simple filtering techniques, and non-linear optimisation for aligning and merging overlapping scans. A screened Poisson surface reconstruction finalises the 3D model to create a meshed surface. The outcome is a generic, cost-effective, flexible, and rapid method that enables generic digital reconstruction of welded parts, aiding in component design, overall quality assurance, and documentation of the HFMI treatment.
    摘要 在海上钻井基础设计中,疲劳寿命是关键。POST-WELD处理已经被提议来提高焊接缝合的疲劳性能,其中高频机械冲击(HFMI)处理得到了显著提高疲劳性能的结果。自动化HFMI处理可以提高质量控制和可靠性,并可以通过精准的疲劳寿命预测来实现成本效益。但是,通用的Finite Element方法(FEM),常用于预测复杂或多轴缝合的疲劳寿命,基于焊接部的基本CAD描述,而不考虑实际焊接geometry和缺陷。包含实际焊接geometry在FEM模型中可以提高疲劳寿命预测和可能的裂解位置预测,但需要对焊接部进行数字重建。现有的数字重建方法有时间consuming或需要特殊的扫描设备,并且可能需要部件重新位置。提案的框架使用工业机械人与线扫描器结合来集成数字重建作为自动HFMI处理设置的一部分。这种方法使用标准的图像处理、简单的滤波技术和非线性优化来对 overlap的扫描进行协调和拼接。一个屏幕化的波峰表面重建最后生成3D模型,以创建一个可重用、成本效益、灵活的方法,以帮助 generic数字重建焊接部件,以便于组件设计、总质量控制和HFMI处理的文档。

OAFuser: Towards Omni-Aperture Fusion for Light Field Semantic Segmentation of Road Scenes

  • paper_url: http://arxiv.org/abs/2307.15588
  • repo_url: https://github.com/feibryantkit/oafuser
  • paper_authors: Fei Teng, Jiaming Zhang, Kunyu Peng, Kailun Yang, Yaonan Wang, Rainer Stiefelhagen
  • for: 提高自动驾驶场景理解的图像semantic segmentation,利用光场相机提供的丰富的角度和空间信息。
  • methods: 提出Omni-Aperture Fusion模型(OAFuser),利用中心视图的dense Context和子镜头图像中的角度信息生成semantically-consistent的结果,并提出Sub-Aperture Fusion Module(SAFM)来嵌入子镜头图像到角度特征中,无需额外存储成本。
  • results: 在UrbanLF-Real和-Syn数据集上达到了当前最佳性能,mIoU达84.93%,与原始数据集上的最佳性能增加+4.53%。
    Abstract Light field cameras can provide rich angular and spatial information to enhance image semantic segmentation for scene understanding in the field of autonomous driving. However, the extensive angular information of light field cameras contains a large amount of redundant data, which is overwhelming for the limited hardware resource of intelligent vehicles. Besides, inappropriate compression leads to information corruption and data loss. To excavate representative information, we propose an Omni-Aperture Fusion model (OAFuser), which leverages dense context from the central view and discovers the angular information from sub-aperture images to generate a semantically-consistent result. To avoid feature loss during network propagation and simultaneously streamline the redundant information from the light field camera, we present a simple yet very effective Sub-Aperture Fusion Module (SAFM) to embed sub-aperture images into angular features without any additional memory cost. Furthermore, to address the mismatched spatial information across viewpoints, we present Center Angular Rectification Module (CARM) realized feature resorting and prevent feature occlusion caused by asymmetric information. Our proposed OAFuser achieves state-of-the-art performance on the UrbanLF-Real and -Syn datasets and sets a new record of 84.93% in mIoU on the UrbanLF-Real Extended dataset, with a gain of +4.53%. The source code of OAFuser will be made publicly available at https://github.com/FeiBryantkit/OAFuser.
    摘要 光场相机可以提供丰富的angular和空间信息,以提高自动驾驶场景理解。然而,广泛的angular信息光场相机包含大量的重复数据,对智能汽车硬件资源的限制是过载。此外,不当压缩会导致信息损害和数据损失。为了挖掘代表性信息,我们提出了 Omni-Aperture Fusion 模型(OAFuser),它利用中心视图的 dense context 和子镜像中的angular信息来生成具有semantic consistency的结果。为了避免网络传播过程中的特征损失并同时压缩红外相机中的重复信息,我们提出了一种简单 yet 高效的 Sub-Aperture Fusion Module(SAFM),可以在不添加额外存储成本下将子镜像 embed 到angular特征中。此外,为了 Address the mismatched spatial information across viewpoints,我们提出了 Center Angular Rectification Module(CARM),实现了Feature resorting 和避免了由不同视角信息干扰所致的特征遮挡。我们的提出的 OAFuser 在 UrbanLF-Real 和 UrbanLF-Syn 数据集上达到了状态机器人�왕的性能,并在 UrbanLF-Real Extended 数据集上 achieved 84.93% 的 mIoU 记录,与前一个记录 (+4.53%) 相比。我们将在 GitHub 上发布 OAFuser 的源代码,访问 https://github.com/FeiBryantkit/OAFuser。

Point Clouds Are Specialized Images: A Knowledge Transfer Approach for 3D Understanding

  • paper_url: http://arxiv.org/abs/2307.15569
  • repo_url: None
  • paper_authors: Jiachen Kang, Wenjing Jia, Xiangjian He, Kin Man Lam
  • for: 这篇论文是针对点云理解进行自我指导的 representation learning (SSRL) 问题,以 addressed 3D 数据缺乏和高notation costs 的挑战。
  • methods: 这篇论文提出了 PCExpert,一种新的 SSRL 方法,将点云视为 “特殊化的图像”,这个概念Shift 允许 PCExpert 可以更直接地和更深入地利用大规模的图像特征,通过在多种Transformer架构中共享参数。
  • results: PCExpert 可以在多种任务中表现出色,包括 ScanObjectNN 等,并且只需要训练少量的参数,而且其性能可以与全模型 fine-tuning 的结果相似(92.66%),这表明 PCExpert 具有强大和可靠的表示能力。
    Abstract Self-supervised representation learning (SSRL) has gained increasing attention in point cloud understanding, in addressing the challenges posed by 3D data scarcity and high annotation costs. This paper presents PCExpert, a novel SSRL approach that reinterprets point clouds as "specialized images". This conceptual shift allows PCExpert to leverage knowledge derived from large-scale image modality in a more direct and deeper manner, via extensively sharing the parameters with a pre-trained image encoder in a multi-way Transformer architecture. The parameter sharing strategy, combined with a novel pretext task for pre-training, i.e., transformation estimation, empowers PCExpert to outperform the state of the arts in a variety of tasks, with a remarkable reduction in the number of trainable parameters. Notably, PCExpert's performance under LINEAR fine-tuning (e.g., yielding a 90.02% overall accuracy on ScanObjectNN) has already approached the results obtained with FULL model fine-tuning (92.66%), demonstrating its effective and robust representation capability.
    摘要 自适应表示学习(SSRL)在点云理解方面受到越来越多的关注,总是面临3D数据罕见和高标注成本的挑战。本文提出PCExpert,一种新的SSRL方法,它将点云视为“专业化图像”,这种概念转换允许PCExpert在更直接和深入的方式利用大规模图像领域的知识,通过在多种Transformer架构中广泛共享参数和一种新的预测任务来进行预训练。这种参数共享策略,加之预训练中的变换估计任务,使得PCExpert能够超越当前状态的表现,并在多种任务中具有很好的灵活性和稳定性。值得一提的是,PCExpert在线性微调(例如在ScanObjectNN上达到90.02%的总准确率)的性能已经接近了全模型微调(92.66%)的结果,这显示PCExpert的表示能力是有效和可靠的。

Panoptic Scene Graph Generation with Semantics-prototype Learning

  • paper_url: http://arxiv.org/abs/2307.15567
  • repo_url: None
  • paper_authors: Li Li, Wei Ji, Yiming Wu, Mengze Li, You Qin, Lina Wei, Roger Zimmermann
  • for: 提高 PSG 模型在实际应用中的表现,解决固有偏见导致 PSG 模型在构建准确的决策平面上遇到困难。
  • methods: 提出了一种名为 ADTrans 的新框架,用于适应性地传输偏见 predicate 笔记到有用和统一的笔记。通过保证每个 predicate 类划 representation 的一致性和准确性,学习不偏 predicate 的聚类表示。同时,通过不断测量每个表现与其聚类表示之间的分布变化,不断屏选掉潜在的偏见数据。
  • results: 实验显示,ADTrans 可以显著提高 benchmark 模型的表现,实现新的州OF-the-art 性能,并在多个数据集上显示出极高的一致性和效果。
    Abstract Panoptic Scene Graph Generation (PSG) parses objects and predicts their relationships (predicate) to connect human language and visual scenes. However, different language preferences of annotators and semantic overlaps between predicates lead to biased predicate annotations in the dataset, i.e. different predicates for same object pairs. Biased predicate annotations make PSG models struggle in constructing a clear decision plane among predicates, which greatly hinders the real application of PSG models. To address the intrinsic bias above, we propose a novel framework named ADTrans to adaptively transfer biased predicate annotations to informative and unified ones. To promise consistency and accuracy during the transfer process, we propose to measure the invariance of representations in each predicate class, and learn unbiased prototypes of predicates with different intensities. Meanwhile, we continuously measure the distribution changes between each presentation and its prototype, and constantly screen potential biased data. Finally, with the unbiased predicate-prototype representation embedding space, biased annotations are easily identified. Experiments show that ADTrans significantly improves the performance of benchmark models, achieving a new state-of-the-art performance, and shows great generalization and effectiveness on multiple datasets.
    摘要 全面Scene图生成(PSG)解析物体并预测其关系( predicate),将语言和视觉场景连接起来。然而,annotator的语言偏好和 predicate的semantic overlap导致数据集中的 predicate 注解受到偏见,例如对象对的不同 predicate。这种偏见使PSG模型在构建准确的决策平面上陷入困难,从而限制PSG模型的实际应用。为了解决上述内在偏见,我们提出了一种名为 ADTrans 的框架,用于自适应地传输偏见 predicate 注解到有用和统一的一个。为保证转移过程中的一致性和准确性,我们提议测量每个 predicate 类别中的表示不变性,并学习不偏 predicate 的 прототипы。同时,我们连续测量每个表现和其prototype之间的分布变化,并持续屏蔽潜在偏见数据。最终,通过不偏 predicate-prototype表示空间,偏见注解得到了轻松地识别。实验表明,ADTrans 可以显著提高 benchmark 模型的性能,达到新的状态对应性,并在多个数据集上显示出极大的一致性和效果。

Beating Backdoor Attack at Its Own Game

  • paper_url: http://arxiv.org/abs/2307.15539
  • repo_url: https://github.com/damianliumin/non-adversarial_backdoor
  • paper_authors: Min Liu, Alberto Sangiovanni-Vincentelli, Xiangyu Yue
  • for: 防御深度神经网络(DNNs)受到后门攻击,不会影响网络在干净数据上的性能,但会 manipulate 网络行为一旦添加触发模式。
  • methods: 我们提出了一种简单 yet highly effective 防御框架,通过在恶意样本上注入非对抗性后门来抑制攻击者的后门。
  • results: 我们在多个 benchmark 上进行了广泛的实验,结果显示我们的方法可以 достичь现状最佳的防御效果,同时具有最低的干净数据性能下降。
    Abstract Deep neural networks (DNNs) are vulnerable to backdoor attack, which does not affect the network's performance on clean data but would manipulate the network behavior once a trigger pattern is added. Existing defense methods have greatly reduced attack success rate, but their prediction accuracy on clean data still lags behind a clean model by a large margin. Inspired by the stealthiness and effectiveness of backdoor attack, we propose a simple but highly effective defense framework which injects non-adversarial backdoors targeting poisoned samples. Following the general steps in backdoor attack, we detect a small set of suspected samples and then apply a poisoning strategy to them. The non-adversarial backdoor, once triggered, suppresses the attacker's backdoor on poisoned data, but has limited influence on clean data. The defense can be carried out during data preprocessing, without any modification to the standard end-to-end training pipeline. We conduct extensive experiments on multiple benchmarks with different architectures and representative attacks. Results demonstrate that our method achieves state-of-the-art defense effectiveness with by far the lowest performance drop on clean data. Considering the surprising defense ability displayed by our framework, we call for more attention to utilizing backdoor for backdoor defense. Code is available at https://github.com/damianliumin/non-adversarial_backdoor.
    摘要 Inspired by the stealthiness and effectiveness of backdoor attacks, we propose a simple but highly effective defense framework that injects non-adversarial backdoors targeting poisoned samples. Following the general steps in backdoor attacks, we detect a small set of suspected samples and then apply a poisoning strategy to them. The non-adversarial backdoor, once triggered, suppresses the attacker's backdoor on poisoned data, but has limited influence on clean data.The defense can be carried out during data preprocessing, without any modification to the standard end-to-end training pipeline. We conduct extensive experiments on multiple benchmarks with different architectures and representative attacks. Results demonstrate that our method achieves state-of-the-art defense effectiveness with by far the lowest performance drop on clean data.Considering the surprising defense ability displayed by our framework, we call for more attention to utilizing backdoors for backdoor defense. Code is available at https://github.com/damianliumin/non-adversarial_backdoor.

YOLOv8 for Defect Inspection of Hexagonal Directed Self-Assembly Patterns: A Data-Centric Approach

  • paper_url: http://arxiv.org/abs/2307.15516
  • repo_url: None
  • paper_authors: Enrique Dehaerne, Bappaditya Dey, Hossein Esfandiar, Lander Verstraete, Hyo Seon Suh, Sandip Halder, Stefan De Gendt
  • for: 本文旨在提出一种方法,以便在 Directed self-assembly (DSA) Patterning 中获得高质量的报告标签,以便用于supervised Machine Learning 模型的准确性检测。
  • methods: 本文使用了一种基于 Machine Learning 的 SEM 图像分析方法,以便自动检测 DSA Patterning 中的缺陷。
  • results: 本文的实验结果表明,使用 YOLOv8 neural network 可以在 DSA Patterning 中达到精度更高于 0.9 mAP 的缺陷检测精度。
    Abstract Shrinking pattern dimensions leads to an increased variety of defect types in semiconductor devices. This has spurred innovation in patterning approaches such as Directed self-assembly (DSA) for which no traditional, automatic defect inspection software exists. Machine Learning-based SEM image analysis has become an increasingly popular research topic for defect inspection with supervised ML models often showing the best performance. However, little research has been done on obtaining a dataset with high-quality labels for these supervised models. In this work, we propose a method for obtaining coherent and complete labels for a dataset of hexagonal contact hole DSA patterns while requiring minimal quality control effort from a DSA expert. We show that YOLOv8, a state-of-the-art neural network, achieves defect detection precisions of more than 0.9 mAP on our final dataset which best reflects DSA expert defect labeling expectations. We discuss the strengths and limitations of our proposed labeling approach and suggest directions for future work in data-centric ML-based defect inspection.
    摘要 缩小模式维度会导致半导体设备中的缺陷类型多样化增加。这种情况推动了半导体 Patterning 的创新,如指导自assembly(DSA),但是传统的自动缺陷检测软件没有适用。机器学习基于 SEM 图像分析已成为半导体缺陷检测的流行研究话题,但是有少量研究关于获得高质量标签的方法。在这种工作中,我们提出一种方法,可以获得 coherent 和完整的标签集,用于半导体 DSA 模式的缺陷检测,而不需要 DSA 专家投入大量时间进行质量控制。我们显示,使用 YOLOv8 neural network,可以在我们的最终数据集上达到缺陷检测精度超过 0.9 mAP,这与 DSA 专家的标签预期相 closest 。我们讨论了我们的标签获取方法的优势和局限性,以及未来在数据驱动的 ML 基于缺陷检测方面的发展方向。

Improving Image Quality of Sparse-view Lung Cancer CT Images with a Convolutional Neural Network

  • paper_url: http://arxiv.org/abs/2307.15506
  • repo_url: None
  • paper_authors: Annika Ries, Tina Dorosti, Johannes Thalhammer, Daniel Sasse, Andreas Sauter, Felix Meurer, Ashley Benne, Franz Pfeiffer, Daniela Pfeiffer
  • For: The paper aims to improve the image quality of sparse-view computed tomography (CT) images for lung cancer detection and to determine the best trade-off between number of views, image quality, and diagnostic confidence.* Methods: The paper uses a U-Net to improve the image quality of sparse-view CT images and evaluates the effectiveness of different levels of undersampling (16, 32, 64, 128, 256, and 512 views) on image quality and diagnostic confidence.* Results: The paper shows that 64-projection sparse-view images result in high image quality and diagnostic confidence, while fewer views lead to insufficient quality. Post-processing the sparse-view images with the U-Net further improves image quality and diagnostic confidence.Here’s the simplified Chinese text for the three key points:
  • for: 该研究旨在提高 sparse-view CT 图像质量,以便更好地检测肺癌病变,并确定最佳投射视图数量、图像质量和诊断自信的平衡点。
  • methods: 该研究使用 U-Net 来提高 sparse-view CT 图像质量,并评估不同的投射视图数量(16, 32, 64, 128, 256, 512 个视图)对图像质量和诊断自信的影响。
  • results: 研究显示,64 投射 sparse-view CT 图像可以保持高质量和诊断自信,而更少的投射视图会导致图像质量下降。通过将 sparse-view CT 图像后处理 U-Net 模型,可以进一步提高图像质量和诊断自信。
    Abstract Purpose: To improve the image quality of sparse-view computed tomography (CT) images with a U-Net for lung cancer detection and to determine the best trade-off between number of views, image quality, and diagnostic confidence. Methods: CT images from 41 subjects (34 with lung cancer, seven healthy) were retrospectively selected (01.2016-12.2018) and forward projected onto 2048-view sinograms. Six corresponding sparse-view CT data subsets at varying levels of undersampling were reconstructed from sinograms using filtered backprojection with 16, 32, 64, 128, 256, and 512 views, respectively. A dual-frame U-Net was trained and evaluated for each subsampling level on 8,658 images from 22 diseased subjects. A representative image per scan was selected from 19 subjects (12 diseased, seven healthy) for a single-blinded reader study. The selected slices, for all levels of subsampling, with and without post-processing by the U-Net model, were presented to three readers. Image quality and diagnostic confidence were ranked using pre-defined scales. Subjective nodule segmentation was evaluated utilizing sensitivity (Se) and Dice Similarity Coefficient (DSC) with 95% confidence intervals (CI). Results: The 64-projection sparse-view images resulted in Se = 0.89 and DSC = 0.81 [0.75,0.86] while their counterparts, post-processed with the U-Net, had improved metrics (Se = 0.94, DSC = 0.85 [0.82,0.87]). Fewer views lead to insufficient quality for diagnostic purposes. For increased views, no substantial discrepancies were noted between the sparse-view and post-processed images. Conclusion: Projection views can be reduced from 2048 to 64 while maintaining image quality and the confidence of the radiologists on a satisfactory level.
    摘要 Methods: 从 2016年1月至2018年12月收集到的41名病人(34名有肺癌,7名健康)的 CT 图像,并将它们前向投影到 2048 个视角的信号gram。将这些视角投影到稀畴视角 CT 数据集中,并使用缓冲后投影来重建图像。为每个抽象级别,使用 filtered backprojection 重建图像,并使用 dual-frame U-Net 训练和评估。在 8,658 张图像上进行了 22 名病人的评估。选择了每个扫描的一个代表图像,并在 19 名病人(12 名病人、7 名健康)中选择了一个代表图像。这些选择的slice被presented给三名读者。评估图像质量和诊断自信使用预定的级别。使用敏感度(Se)和 dice 相似度系数(DSC)来评估分割结果,并计算95% 的信任区间(CI)。Results: 64 个视角稀畴视角图像的 Se = 0.89 和 DSC = 0.81 [0.75,0.86],而其对应的 U-Net 后处理图像的 metric 得到了改善(Se = 0.94, DSC = 0.85 [0.82,0.87])。 fewer views 不够用于诊断purpose。随着视角数量的增加,没有显著的差异被注意到 между sparse-view 和 U-Net 后处理图像。Conclusion: 可以将 projection views 从 2048 缩减到 64,而不会影响图像质量和诊断人员对图像质量的自信。

Local and Global Information in Obstacle Detection on Railway Tracks

  • paper_url: http://arxiv.org/abs/2307.15478
  • repo_url: None
  • paper_authors: Matthias Brucker, Andrei Cramariuc, Cornelius von Einem, Roland Siegwart, Cesar Cadena
  • for: 避免铁路交通事故,提高列车安全性。
  • methods: 利用浅网络学习铁路分割,采用局部感知和控制global信息。
  • results: 比基eline方法高效,在自定义铁路图像集上评估。
    Abstract Reliable obstacle detection on railways could help prevent collisions that result in injuries and potentially damage or derail the train. Unfortunately, generic object detectors do not have enough classes to account for all possible scenarios, and datasets featuring objects on railways are challenging to obtain. We propose utilizing a shallow network to learn railway segmentation from normal railway images. The limited receptive field of the network prevents overconfident predictions and allows the network to focus on the locally very distinct and repetitive patterns of the railway environment. Additionally, we explore the controlled inclusion of global information by learning to hallucinate obstacle-free images. We evaluate our method on a custom dataset featuring railway images with artificially augmented obstacles. Our proposed method outperforms other learning-based baseline methods.
    摘要 可靠的铁路障碍检测可以帮助避免因collision而导致的伤害和可能地损坏或脱轨列车。然而,通用对象探测器并不具备足够的类型来覆盖所有可能的场景,而 dataset featuring objects on railways 也是具有挑战性的。我们提议利用一个浅网络学习铁路分割 FROM normal railway images。网络的有限接受区域防止过于自信的预测,allowing the network to focus on the locally very distinct and repetitive patterns of the railway environment。此外,我们还探索控制了包含全局信息的学习方法,通过学习生成障碍物free images。我们对自定义的 dataset featuring railway images with artificially augmented obstacles 进行评估,并证明了我们的提议方法在其他学习基础方法的比较中表现出色。

Defocus Blur Synthesis and Deblurring via Interpolation and Extrapolation in Latent Space

  • paper_url: http://arxiv.org/abs/2307.15461
  • repo_url: https://github.com/nis-research/linear-latent-blur
  • paper_authors: Ioana Mazilu, Shunxin Wang, Sven Dummer, Raymond Veldhuis, Christoph Brune, Nicola Strisciuglio
  • for: 这篇论文的目的是提高微scopic图像的质量,以便进一步的处理和分析疾病。
  • methods: 该论文提出了一种方法,可以对图像进行恢复和合成不同程度的模糊效果。该方法使用自适应卷积神经网络,并采用了隐式和显式正则化技术来强制抽象关系在准确空间中的线性关系。
  • results: 该方法可以有效地模拟不同程度的模糊效果,从而提高数据的多样性,并提高微scopic图像的质量。这种方法可以作为数据增强技术,并且可以提高疾病的诊断和分析。
    Abstract Though modern microscopes have an autofocusing system to ensure optimal focus, out-of-focus images can still occur when cells within the medium are not all in the same focal plane, affecting the image quality for medical diagnosis and analysis of diseases. We propose a method that can deblur images as well as synthesize defocus blur. We train autoencoders with implicit and explicit regularization techniques to enforce linearity relations among the representations of different blur levels in the latent space. This allows for the exploration of different blur levels of an object by linearly interpolating/extrapolating the latent representations of images taken at different focal planes. Compared to existing works, we use a simple architecture to synthesize images with flexible blur levels, leveraging the linear latent space. Our regularized autoencoders can effectively mimic blur and deblur, increasing data variety as a data augmentation technique and improving the quality of microscopic images, which would be beneficial for further processing and analysis.
    摘要 现代微镜已经搭载了自动对焦系统,但是在细胞medium中不同的细胞不在同一个 фокус平面上,可能导致图像质量下降,影响医学诊断和疾病分析。我们提出了一种方法,可以恢复图像和生成杂推模灵。我们使用自动编码器,并通过显式和隐式正则化技术来强制抽象关系在 latent space 中的线性关系。这允许我们通过线性 interpolate/extrapolate latent representation来探索不同的杂推模灵水平。与现有方法相比,我们使用简单的架构来生成具有灵活杂推模灵的图像,利用 latent space 的线性性。我们的正则化自动编码器可以有效地模拟杂推模灵和恢复,增加数据的多样性,并提高微镜图像的质量,这将对进一步处理和分析产生有利影响。

ERCPMP: An Endoscopic Image and Video Dataset for Colorectal Polyps Morphology and Pathology

  • paper_url: http://arxiv.org/abs/2307.15444
  • repo_url: None
  • paper_authors: Mojgan Forootan, Mohsen Rajabnia, Ahmad R Mafi, Hamed Azhdari Tehrani, Erfan Ghadirzadeh, Mahziar Setayeshfar, Zahra Ghaffari, Mohammad Tashakoripour, Mohammad Reza Zali, Hamidreza Bolhasani
  • For: The paper is written for developing accurate algorithms for medical prediction, detection, diagnosis, treatment, and prognosis, specifically for colorectal polyps.* Methods: The paper uses a dataset called ERCPMP, which contains demographic, morphological, and pathological data, endoscopic images, and videos of 191 patients with colorectal polyps. The dataset includes data on the diagnosis of the polyps, such as Tubular, Villous, Tubulovillous, Hyperplastic, Serrated, Inflammatory, and Adenocarcinoma with Dysplasia Grade & Differentiation.* Results: The paper provides a dataset that can be used for developing accurate algorithms for medical prediction, detection, diagnosis, treatment, and prognosis of colorectal polyps. The dataset includes a wide range of data, including demographic, morphological, and pathological data, endoscopic images, and videos, which can be used to train and test machine learning and deep learning models.Here is the information in Simplified Chinese text:* 用途:这篇论文用于开发准确的医学预测、检测、诊断、治疗和预后预测算法,特别是对于肠Rectal 肿瘤。* 方法:论文使用一个名为ERCPMP的Endoscopic Image和Video Dataset,该Dataset包括191名患者的肠Rectal 肿瘤的人口统计、形态学数据、病理学数据、Endoscopic 图像和视频。该Dataset包括肿瘤的诊断,如管状、 villous、 tubulovillous、 hyperplastic、 serrated、 inflammatory 和adenocarcinoma with dysplasia grade & differentiation。* 结果:论文提供了一个可以用于开发准确的医学预测、检测、诊断、治疗和预后预测算法的Dataset。该Dataset包括词语、形态学数据、病理学数据、Endoscopic 图像和视频,可以用于训练和测试机器学习和深度学习模型。
    Abstract In the recent years, artificial intelligence (AI) and its leading subtypes, machine learning (ML) and deep learning (DL) and their applications are spreading very fast in various aspects such as medicine. Today the most important challenge of developing accurate algorithms for medical prediction, detection, diagnosis, treatment and prognosis is data. ERCPMP is an Endoscopic Image and Video Dataset for Recognition of Colorectal Polyps Morphology and Pathology. This dataset contains demographic, morphological and pathological data, endoscopic images and videos of 191 patients with colorectal polyps. Morphological data is included based on the latest international gastroenterology classification references such as Paris, Pit and JNET classification. Pathological data includes the diagnosis of the polyps including Tubular, Villous, Tubulovillous, Hyperplastic, Serrated, Inflammatory and Adenocarcinoma with Dysplasia Grade & Differentiation. The current version of this dataset is published and available on Elsevier Mendeley Dataverse and since it is under development, the latest version is accessible via: https://databiox.com.
    摘要 recent 年们,人工智能(AI)和其主要子类型,机器学习(ML)和深度学习(DL)以及其应用在各个领域都在快速扩散,其中医学领域也是如此。 currently, the most important challenge of developing accurate algorithms for medical prediction, detection, diagnosis, treatment, and prognosis is data. ERCPMP 是一个 Endoscopic Image and Video Dataset for Recognition of Colorectal Polyps Morphology and Pathology。 This dataset contains demographic, morphological, and pathological data, endoscopic images and videos of 191 patients with colorectal polyps. Morphological data is based on the latest international gastroenterology classification references such as Paris, Pit, and JNET classification. Pathological data includes the diagnosis of the polyps, including Tubular, Villous, Tubulovillous, Hyperplastic, Serrated, Inflammatory, and Adenocarcinoma with Dysplasia Grade & Differentiation. The current version of this dataset is published and available on Elsevier Mendeley Dataverse, and the latest version can be accessed via: .

Automated Visual Monitoring of Nocturnal Insects with Light-based Camera Traps

  • paper_url: http://arxiv.org/abs/2307.15433
  • repo_url: None
  • paper_authors: Dimitri Korsch, Paul Bodesheim, Gunnar Brehm, Joachim Denzler
  • for: 这个论文的目的是为了提供一个自动化的摄像头辅助的蜥蜉数量估计方法,以便更好地理解和对抗现在的蜥蜉减少趋势。
  • methods: 这个论文使用了一个两stage的检测和分类管道,使用了公民科学家手动捕捉的EU-Moths数据集,并对其进行了训练和评估。此外,它还介绍了一个自动化视觉监测系统的原型,并对其进行了评估。
  • results: 这个论文提供了这两个数据集的第一个检测和分类基线,并鼓励其他科学家使用这些公共可用的数据进行进一步的研究。
    Abstract Automatic camera-assisted monitoring of insects for abundance estimations is crucial to understand and counteract ongoing insect decline. In this paper, we present two datasets of nocturnal insects, especially moths as a subset of Lepidoptera, photographed in Central Europe. One of the datasets, the EU-Moths dataset, was captured manually by citizen scientists and contains species annotations for 200 different species and bounding box annotations for those. We used this dataset to develop and evaluate a two-stage pipeline for insect detection and moth species classification in previous work. We further introduce a prototype for an automated visual monitoring system. This prototype produced the second dataset consisting of more than 27,000 images captured on 95 nights. For evaluation and bootstrapping purposes, we annotated a subset of the images with bounding boxes enframing nocturnal insects. Finally, we present first detection and classification baselines for these datasets and encourage other scientists to use this publicly available data.
    摘要 自动摄像头助记 insect 数量的估计是理解和逆转正在进行的昆虫衰退的关键。在这篇论文中,我们介绍了中欧地区的两个昆虫数据集。其中一个数据集是由公民科学家手动捕捉的 EU-Moths 数据集,包含了 200 种不同物种的标注和 bounding box 标注。我们在过去的工作中使用了这个数据集来开发和评估一种昆虫检测和蛾类种类分类的两个阶段管道。我们还介绍了一种自动视觉监测系统的原型,这个原型在 95 个夜晚中拍摄了 более 27,000 张图像。为评估和启动目的,我们将一 subset 的图像标注为涵盖夜晚昆虫的 bounding box。最后,我们展示了这些数据集的第一个检测和分类基线,并邀请其他科学家使用这些公共可用的数据进行研究。

Implicit neural representation for change detection

  • paper_url: http://arxiv.org/abs/2307.15428
  • repo_url: None
  • paper_authors: Peter Naylor, Diego Di Carlo, Arianna Traviglia, Makoto Yamada, Marco Fiorucci
  • for: 检测三维空间飞行LiDAR点云中发生的变化,特别是因为不匹配的空间支持和采集系统噪声。
  • methods: 我们提出了一种无监督方法,包括两个组成部分:神经场(NF) для连续形态重建和高斯混合模型 для分类变化。NF提供了不固定格式的表示方式,可以增加高频环境和减少噪声。
  • results: 我们在一个 benchmark 数据集上进行了测试,并证明了我们的方法可以与当前状态的艺术之前提高检测能力。此外,我们还应用了我们的方法于一个实际场景,并证明了它们与场景专家的发现相符。
    Abstract Detecting changes that occurred in a pair of 3D airborne LiDAR point clouds, acquired at two different times over the same geographical area, is a challenging task because of unmatching spatial supports and acquisition system noise. Most recent attempts to detect changes on point clouds are based on supervised methods, which require large labelled data unavailable in real-world applications. To address these issues, we propose an unsupervised approach that comprises two components: Neural Field (NF) for continuous shape reconstruction and a Gaussian Mixture Model for categorising changes. NF offer a grid-agnostic representation to encode bi-temporal point clouds with unmatched spatial support that can be regularised to increase high-frequency details and reduce noise. The reconstructions at each timestamp are compared at arbitrary spatial scales, leading to a significant increase in detection capabilities. We apply our method to a benchmark dataset of simulated LiDAR point clouds for urban sprawling. The dataset offers different challenging scenarios with different resolutions, input modalities and noise levels, allowing a multi-scenario comparison of our method with the current state-of-the-art. We boast the previous methods on this dataset by a 10% margin in intersection over union metric. In addition, we apply our methods to a real-world scenario to identify illegal excavation (looting) of archaeological sites and confirm that they match findings from field experts.
    摘要 检测两个3D空中探测点云之间的变化是一项具有挑战性的任务,因为这两个点云在不同的时间被获取,并且具有不匹配的空间支持和探测系统噪声。大多数最新的变化检测方法基于指导方法,需要大量的标注数据,这些数据在实际应用中很难获得。为了解决这些问题,我们提出了一种不supervised方法,它包括两个组成部分:神经场(NF)和高斯混合模型。NF提供了不受格子限制的表示方式,用于编码不匹配的时间点云,并可以通过增强高频率细节和减少噪声来增强高精度。在每个时间戳点上对重建的点云进行比较,可以在不同的空间缩放比例上进行比较,从而大幅提高检测能力。我们在一个 simulate LiDAR点云 benchmark dataset上应用了我们的方法,该dataset包括不同的具有不同分辨率、输入模式和噪声水平的场景。我们在这些场景中跟比现状态的方法,增加了10%的交叉部分精度。此外,我们还应用了我们的方法于一个实际场景,即考古遗产挖掘(looting),并证明与场地专家的发现相匹配。

Deep Learning Pipeline for Automated Visual Moth Monitoring: Insect Localization and Species Classification

  • paper_url: http://arxiv.org/abs/2307.15427
  • repo_url: None
  • paper_authors: Dimitri Korsch, Paul Bodesheim, Joachim Denzler
  • for: 本研究旨在开发一个基于深度学习的苹果蛾自动识别系统,以帮助生物多样性监测。
  • methods: 本研究使用了一个基于蛾虫检测器和分类器的深度学习管线来分析蛾虫扫描仪上的图像。
  • results: 研究表明,将检测器和分类器结合使用可以提高蛾虫图像标识率从79.62%提高到88.05%。
    Abstract Biodiversity monitoring is crucial for tracking and counteracting adverse trends in population fluctuations. However, automatic recognition systems are rarely applied so far, and experts evaluate the generated data masses manually. Especially the support of deep learning methods for visual monitoring is not yet established in biodiversity research, compared to other areas like advertising or entertainment. In this paper, we present a deep learning pipeline for analyzing images captured by a moth scanner, an automated visual monitoring system of moth species developed within the AMMOD project. We first localize individuals with a moth detector and afterward determine the species of detected insects with a classifier. Our detector achieves up to 99.01% mean average precision and our classifier distinguishes 200 moth species with an accuracy of 93.13% on image cutouts depicting single insects. Combining both in our pipeline improves the accuracy for species identification in images of the moth scanner from 79.62% to 88.05%.
    摘要 生物多样性监测是追踪和抵消人口波动的关键。然而,自动识别系统 rarely 被应用,专家们仍然手动评估生成的数据量。特别是在生物多样性研究中,深度学习方法的视觉监测支持还没有得到广泛应用,相比其他领域如广告或娱乐业。本文提出了一个深度学习管道,用于分析由 moth scanner 捕捉的图像。我们首先使用 moth 检测器来 Localize 检测到的 insects,然后使用分类器来确定检测到的昆虫种类。我们的检测器可以达到 99.01% 的平均精度,分类器可以在单个昆虫图像中分类出 200 种昆虫,准确率为 93.13%。将两个模块结合在一起可以提高图像中的种类鉴定精度,从 79.62% 提高到 88.05%。

MLIC++: Linear Complexity Multi-Reference Entropy Modeling for Learned Image Compression

  • paper_url: http://arxiv.org/abs/2307.15421
  • repo_url: https://github.com/jiangweibeta/mlic
  • paper_authors: Wei Jiang, Ronggang Wang
  • for: 这个论文是为了提出一种基于多参考 entropy 模型的学习型图像压缩方法,以提高图像压缩的效率和质量。
  • methods: 该方法使用 linear complexity 来捕捉全局相关性,而不是之前的 attention 方法,以降低复杂性。具体来说,它使用了 softmax decomposion 来实现 linear complexity 的捕捉。
  • results: compared to VTM-17.0,这种 MLIC$^{++}$ 方法可以提供12.44%的BD-rate 下降,并且在 PSNR 上具有较高的效率。Here’s the translation in English:
  • for: This paper proposes a learned image compression method based on multi-reference entropy modeling, to improve the efficiency and quality of image compression.
  • methods: The method uses linear complexity to capture global correlations, instead of the previous attention method, to reduce complexity. Specifically, it uses softmax decomposition to achieve linear complexity.
  • results: Compared to VTM-17.0, the proposed MLIC$^{++}$ method can provide a 12.44% reduction in BD-rate and higher efficiency in PSNR.
    Abstract Recently, multi-reference entropy model has been proposed, which captures channel-wise, local spatial, and global spatial correlations. Previous works adopt attention for global correlation capturing, however, the quadratic cpmplexity limits the potential of high-resolution image coding. In this paper, we propose the linear complexity global correlations capturing, via the decomposition of softmax operation. Based on it, we propose the MLIC$^{++}$, a learned image compression with linear complexity for multi-reference entropy modeling. Our MLIC$^{++}$ is more efficient and it reduces BD-rate by 12.44% on the Kodak dataset compared to VTM-17.0 when measured in PSNR. Code will be available at https://github.com/JiangWeibeta/MLIC.
    摘要 最近,多参照 entropy 模型已经被提出,它捕捉了通道 wise、本地空间和全局空间相关性。先前的工作采用了注意力来捕捉全局相关性,但 quadratic complexity 限制了高分辨率图像编码的潜力。在这篇文章中,我们提出了线性复杂度全局相关性捕捉,通过软MAX操作的分解。基于其,我们提出了 MLIC$^{++} $,一种学习图像压缩的线性复杂度多参照 entropy 模型。我们的 MLIC$^{++} $ 比 VTM-17.0 在 PSNR 下降12.44% 的 Kodak 数据集上更高效,代码将在 GitHub 上发布。

Uncertainty-aware Unsupervised Multi-Object Tracking

  • paper_url: http://arxiv.org/abs/2307.15409
  • repo_url: None
  • paper_authors: Kai Liu, Sheng Jin, Zhihang Fu, Ze Chen, Rongxin Jiang, Jieping Ye
  • for: 提高无监督多 объек tracking的性能
  • methods: 采用自我监督技术,并开发了一个uncertainty-based metric来验证和修正危险关系
  • results: 实现了高性能的无监督多对象跟踪,并在MOT-Challenges和VisDrone-MOT benchmark上达到了最高级别的表现
    Abstract Without manually annotated identities, unsupervised multi-object trackers are inferior to learning reliable feature embeddings. It causes the similarity-based inter-frame association stage also be error-prone, where an uncertainty problem arises. The frame-by-frame accumulated uncertainty prevents trackers from learning the consistent feature embedding against time variation. To avoid this uncertainty problem, recent self-supervised techniques are adopted, whereas they failed to capture temporal relations. The interframe uncertainty still exists. In fact, this paper argues that though the uncertainty problem is inevitable, it is possible to leverage the uncertainty itself to improve the learned consistency in turn. Specifically, an uncertainty-based metric is developed to verify and rectify the risky associations. The resulting accurate pseudo-tracklets boost learning the feature consistency. And accurate tracklets can incorporate temporal information into spatial transformation. This paper proposes a tracklet-guided augmentation strategy to simulate tracklets' motion, which adopts a hierarchical uncertainty-based sampling mechanism for hard sample mining. The ultimate unsupervised MOT framework, namely U2MOT, is proven effective on MOT-Challenges and VisDrone-MOT benchmark. U2MOT achieves a SOTA performance among the published supervised and unsupervised trackers.
    摘要 Without manually annotated identities, unsupervised multi-object trackers are inferior to learning reliable feature embeddings. This causes the similarity-based inter-frame association stage to also be error-prone, resulting in an uncertainty problem. The frame-by-frame accumulated uncertainty prevents trackers from learning the consistent feature embedding against time variation. To avoid this uncertainty problem, recent self-supervised techniques are adopted, but they failed to capture temporal relations. The interframe uncertainty still exists. In fact, this paper argues that though the uncertainty problem is inevitable, it is possible to leverage the uncertainty itself to improve the learned consistency in turn. Specifically, an uncertainty-based metric is developed to verify and rectify the risky associations. The resulting accurate pseudo-tracklets boost learning the feature consistency. And accurate tracklets can incorporate temporal information into spatial transformation. This paper proposes a tracklet-guided augmentation strategy to simulate tracklets' motion, which adopts a hierarchical uncertainty-based sampling mechanism for hard sample mining. The ultimate unsupervised MOT framework, namely U2MOT, is proven effective on MOT-Challenges and VisDrone-MOT benchmark. U2MOT achieves a SOTA performance among the published supervised and unsupervised trackers.

Task-Oriented Channel Attention for Fine-Grained Few-Shot Classification

  • paper_url: http://arxiv.org/abs/2308.00093
  • repo_url: None
  • paper_authors: SuBeen Lee, WonJun Moon, Hyun Seok Seong, Jae-Pil Heo
  • for: fine-grained image classification with limited training data
  • methods: Task Discrepancy Maximization (TDM) with Support Attention Module (SAM) and Query Attention Module (QAM)
  • results: accurate class-sensitive similarity measure and instance-wise highlighting of object-relevant channels
    Abstract The difficulty of the fine-grained image classification mainly comes from a shared overall appearance across classes. Thus, recognizing discriminative details, such as eyes and beaks for birds, is a key in the task. However, this is particularly challenging when training data is limited. To address this, we propose Task Discrepancy Maximization (TDM), a task-oriented channel attention method tailored for fine-grained few-shot classification with two novel modules Support Attention Module (SAM) and Query Attention Module (QAM). SAM highlights channels encoding class-wise discriminative features, while QAM assigns higher weights to object-relevant channels of the query. Based on these submodules, TDM produces task-adaptive features by focusing on channels encoding class-discriminative details and possessed by the query at the same time, for accurate class-sensitive similarity measure between support and query instances. While TDM influences high-level feature maps by task-adaptive calibration of channel-wise importance, we further introduce Instance Attention Module (IAM) operating in intermediate layers of feature extractors to instance-wisely highlight object-relevant channels, by extending QAM. The merits of TDM and IAM and their complementary benefits are experimentally validated in fine-grained few-shot classification tasks. Moreover, IAM is also shown to be effective in coarse-grained and cross-domain few-shot classifications.
    摘要 Fine-grained图像分类的困难主要来自于类别之间共同的整体外观。因此,认izable找到分类特征,如鸟类的眼睛和嘴,是关键。然而,当训练数据有限时,这变得非常困难。为 Addressing this challenge, we propose Task Discrepancy Maximization (TDM), a task-oriented channel attention method tailored for fine-grained few-shot classification with two novel modules Support Attention Module (SAM) and Query Attention Module (QAM). SAM highlights channels encoding class-wise discriminative features, while QAM assigns higher weights to object-relevant channels of the query. Based on these submodules, TDM produces task-adaptive features by focusing on channels encoding class-discriminative details and possessed by the query at the same time, for accurate class-sensitive similarity measure between support and query instances. While TDM influences high-level feature maps by task-adaptive calibration of channel-wise importance, we further introduce Instance Attention Module (IAM) operating in intermediate layers of feature extractors to instance-wisely highlight object-relevant channels, by extending QAM. The merits of TDM and IAM and their complementary benefits are experimentally validated in fine-grained few-shot classification tasks. Moreover, IAM is also shown to be effective in coarse-grained and cross-domain few-shot classifications.

AffineGlue: Joint Matching and Robust Estimation

  • paper_url: http://arxiv.org/abs/2307.15381
  • repo_url: None
  • paper_authors: Daniel Barath, Dmytro Mishkin, Luca Cavalli, Paul-Edouard Sarlin, Petr Hruby, Marc Pollefeys
  • for: 本文提出了AffineGlue方法,用于 JOINT two-view 特征匹配和稳定估计,从而减少了问题的可能性级别。
  • methods: AffineGlue 使用单点最小解方法选择可能的匹配,并使用导航匹配来找到与模型相符的匹配。此外,我们还提出了一种新的 minimal solver for homography estimation,只需要一个 affine correspondence (AC) 和一个重力优先级。
  • results: AffineGlue 在实际 dataset 上表现优于 SOTA,即使假设重力方向下降。在 PhotoTourism 上,AUC@10° 分数提高了6.6个点 compared to SOTA。在 ScanNet 上,AffineGlue 使得 SuperPoint 和 SuperGlue 与无探测 LoFTR achieve 类似的准确率。
    Abstract We propose AffineGlue, a method for joint two-view feature matching and robust estimation that reduces the combinatorial complexity of the problem by employing single-point minimal solvers. AffineGlue selects potential matches from one-to-many correspondences to estimate minimal models. Guided matching is then used to find matches consistent with the model, suffering less from the ambiguities of one-to-one matches. Moreover, we derive a new minimal solver for homography estimation, requiring only a single affine correspondence (AC) and a gravity prior. Furthermore, we train a neural network to reject ACs that are unlikely to lead to a good model. AffineGlue is superior to the SOTA on real-world datasets, even when assuming that the gravity direction points downwards. On PhotoTourism, the AUC@10{\deg} score is improved by 6.6 points compared to the SOTA. On ScanNet, AffineGlue makes SuperPoint and SuperGlue achieve similar accuracy as the detector-free LoFTR.
    摘要 我们提出了AffineGlue方法,它是一种能够同时实现二视图特征匹配和稳定估计的方法,通过单点最小解决方案来减少问题的 combinatorial 复杂性。AffineGlue选择一个可能的匹配点,并使用导向匹配来找到与模型一致的匹配。此外,我们还 derivated一种新的单 Affine 匹配(AC)的 homography 估计方法,只需要一个Affine对应性(AC)和重力 prior。此外,我们还训练了一个神经网络来拒绝不可能导致良好模型的AC。Compared to the state-of-the-art(SOTA),AffineGlue在实际数据集上表现更优异,即使gravity方向下降。在PhotoTourism上,AffineGlue在10度的AUC得分上提高了6.6分,比SOTA更高。在ScanNet上,AffineGlue使得SuperPoint和SuperGlue达到了无检测器LoFTR的同等准确率。

Prompt Guided Transformer for Multi-Task Dense Prediction

  • paper_url: http://arxiv.org/abs/2307.15362
  • repo_url: None
  • paper_authors: Yuxiang Lu, Shalayiding Sirejiding, Yue Ding, Chunlin Wang, Hongtao Lu
  • for: 本文targets the problem of trading off performance and model parameters in task-conditional architecture, and proposes a simple and lightweight task-conditional model called Prompt Guided Transformer (PGT) to optimize this challenge.
  • methods: 本文提出了一种名为Prompt-conditioned Transformer block的新块,该块在自我注意机制中加入了任务特定的提示,以实现全球相互关系模型和参数效率的特征适应。此外,本文还提出了一种轻量级的解码器,以降低参数数量。
  • results: EXTENSIVE experiments on two multi-task dense prediction benchmarks, PASCAL-Context and NYUD-v2, show that our approach achieves state-of-the-art results among task-conditional methods while using fewer parameters, and maintains a significant balance between performance and parameter size.
    Abstract Task-conditional architecture offers advantage in parameter efficiency but falls short in performance compared to state-of-the-art multi-decoder methods. How to trade off performance and model parameters is an important and difficult problem. In this paper, we introduce a simple and lightweight task-conditional model called Prompt Guided Transformer (PGT) to optimize this challenge. Our approach designs a Prompt-conditioned Transformer block, which incorporates task-specific prompts in the self-attention mechanism to achieve global dependency modeling and parameter-efficient feature adaptation across multiple tasks. This block is integrated into both the shared encoder and decoder, enhancing the capture of intra- and inter-task features. Moreover, we design a lightweight decoder to further reduce parameter usage, which accounts for only 2.7% of the total model parameters. Extensive experiments on two multi-task dense prediction benchmarks, PASCAL-Context and NYUD-v2, demonstrate that our approach achieves state-of-the-art results among task-conditional methods while using fewer parameters, and maintains a significant balance between performance and parameter size.
    摘要 任务条件架构具有参数效率优势,但在性能方面与现有多decoder方法相比,表现略为下降。在这篇论文中,我们提出了一种简单且轻量级的任务条件模型,即提示导向 transformer(PGT),以优化这个挑战。我们的方法设计了一个任务特定提示块,将任务特定的提示 incorporated 在自我注意机制中,以实现全局依赖关系和参数效率的特征适应。这个块在共享encoder和decoder中都有所整合,从而提高了内部和外部任务特征的捕捉。此外,我们还设计了一个轻量级decoder,以进一步减少参数使用量,这个decoder占总模型参数的2.7%。我们在两个多任务稠密预测 benchmark,PASCAL-Context和NYUD-v2,进行了广泛的实验,结果表明,我们的方法在任务条件方法中实现了最佳的结果,同时具有更好的参数大小协调。

Supervised Homography Learning with Realistic Dataset Generation

  • paper_url: http://arxiv.org/abs/2307.15353
  • repo_url: https://github.com/jianghaiscu/realsh
  • paper_authors: Hai Jiang, Haipeng Li, Songchen Han, Haoqiang Fan, Bing Zeng, Shuaicheng Liu
  • For: 提出一种迭代框架,包括两个阶段:生成阶段和训练阶段,用于生成真实的训练数据并提取一个监督的投影网络。* Methods: 使用预估的主导平面屏障和投影对一个无标签图像对来生成一个新的标注过的训练对,并使用这些生成的数据进行训练监督投影网络。在训练阶段,使用内容一致模块和质量评估模块来进行数据的细化和评估。* Results: 实验结果表明,我们的方法可以达到现有最佳性能,并可以在生成的数据集上提高现有的监督方法的性能。代码和数据集可以在https://github.com/JianghaiSCU/RealSH上下载。
    Abstract In this paper, we propose an iterative framework, which consists of two phases: a generation phase and a training phase, to generate realistic training data and yield a supervised homography network. In the generation phase, given an unlabeled image pair, we utilize the pre-estimated dominant plane masks and homography of the pair, along with another sampled homography that serves as ground truth to generate a new labeled training pair with realistic motion. In the training phase, the generated data is used to train the supervised homography network, in which the training data is refined via a content consistency module and a quality assessment module. Once an iteration is finished, the trained network is used in the next data generation phase to update the pre-estimated homography. Through such an iterative strategy, the quality of the dataset and the performance of the network can be gradually and simultaneously improved. Experimental results show that our method achieves state-of-the-art performance and existing supervised methods can be also improved based on the generated dataset. Code and dataset are available at https://github.com/JianghaiSCU/RealSH.
    摘要 在这篇论文中,我们提出了一种迭代框架,它包括两个阶段:生成阶段和训练阶段,用于生成真实的训练数据并生成一个监督式投影网络。在生成阶段,给定一个无标注的图像对,我们利用该对的先预计算的主要平面面积和投影,以及另一个随机选择的投影作为真实的参照,生成一个新的标注过的训练对。在训练阶段,生成的数据被用来训练监督式投影网络,其中生成的数据被修正 via 内容一致模块和质量评估模块。一旦一个迭代结束,训练完成后,用于下一次数据生成阶段的网络被更新。通过如此的迭代策略,数据集的质量和网络的性能可以逐渐提高。实验结果表明,我们的方法可以 дости得现状势最佳性能,并且可以根据生成的数据来改进现有的监督式方法。代码和数据可以在https://github.com/JianghaiSCU/RealSH中下载。

The Radon Signed Cumulative Distribution Transform and its applications in classification of Signed Images

  • paper_url: http://arxiv.org/abs/2307.15339
  • repo_url: https://github.com/rohdelab/PyTransKit
  • paper_authors: Le Gong, Shiying Li, Naqib Sad Pathan, Mohammad Shifat-E-Rabbi, Gustavo K. Rohde, Abu Hasnat Mohammad Rubaiyat, Sumati Thareja
  • for: 该研究提出了一种基于运输 mathematics 和最优运输的新图像表示技术。
  • methods: 该方法结合了Radon transform 和 Signed Cumulative Distribution Transform 两种已知的图像表示方法,并将其推广到任意函数(图像)上,因此可以用于更多应用场景。
  • results: 研究人员对实验和模拟数据进行了比较,发现新的 transform 能够更准确地表示签名图像中的信息内容,因此可以获得更高的分类精度。
    Abstract Here we describe a new image representation technique based on the mathematics of transport and optimal transport. The method relies on the combination of the well-known Radon transform for images and a recent signal representation method called the Signed Cumulative Distribution Transform. The newly proposed method generalizes previous transport-related image representation methods to arbitrary functions (images), and thus can be used in more applications. We describe the new transform, and some of its mathematical properties and demonstrate its ability to partition image classes with real and simulated data. In comparison to existing transport transform methods, as well as deep learning-based classification methods, the new transform more accurately represents the information content of signed images, and thus can be used to obtain higher classification accuracies. The implementation of the proposed method in Python language is integrated as a part of the software package PyTransKit, available on Github.
    摘要 我们提出了一种新的图像表示技术,基于运输学和最优运输学 mathematics。该方法通过结合已知的射频变换和最近的signal representation方法called Signed Cumulative Distribution Transform而组合。新提出的方法可以对任意函数(图像)进行扩展,因此可以在更多的应用中使用。我们描述了新的变换,以及一些其数学性质和示例数据。与现有的运输变换方法和深度学习基于分类方法相比,新的变换更好地表示签名图像中的信息内容,因此可以实现更高的分类精度。我们在Python语言中实现了该方法,并将其 integrate到PyTransKit软件包中,可以在Github上下载。

Dynamic PlenOctree for Adaptive Sampling Refinement in Explicit NeRF

  • paper_url: http://arxiv.org/abs/2307.15333
  • repo_url: None
  • paper_authors: Haotian Bai, Yiqi Lin, Yize Chen, Lin Wang
  • for: 这个研究是为了提高Explicit NeRF的训练和测试效率,以应对虚拟现实和游戏等领域。
  • methods: 这个研究使用的方法是Dynamic PlenOctree DOT,它是一个可靠的、高效的Octree表现,可以适应场景的变化。
  • results: 相比POT,DOT可以提高视觉质量,减少超过55.15%/$68.84%$的参数,并提供1.7/1.9倍的FPS дляNeRF-synthetic和Tanks $&$ Temples。
    Abstract The explicit neural radiance field (NeRF) has gained considerable interest for its efficient training and fast inference capabilities, making it a promising direction such as virtual reality and gaming. In particular, PlenOctree (POT)[1], an explicit hierarchical multi-scale octree representation, has emerged as a structural and influential framework. However, POT's fixed structure for direct optimization is sub-optimal as the scene complexity evolves continuously with updates to cached color and density, necessitating refining the sampling distribution to capture signal complexity accordingly. To address this issue, we propose the dynamic PlenOctree DOT, which adaptively refines the sample distribution to adjust to changing scene complexity. Specifically, DOT proposes a concise yet novel hierarchical feature fusion strategy during the iterative rendering process. Firstly, it identifies the regions of interest through training signals to ensure adaptive and efficient refinement. Next, rather than directly filtering out valueless nodes, DOT introduces the sampling and pruning operations for octrees to aggregate features, enabling rapid parameter learning. Compared with POT, our DOT outperforms it by enhancing visual quality, reducing over $55.15$/$68.84\%$ parameters, and providing 1.7/1.9 times FPS for NeRF-synthetic and Tanks $\&$ Temples, respectively. Project homepage:https://vlislab22.github.io/DOT. [1] Yu, Alex, et al. "Plenoctrees for real-time rendering of neural radiance fields." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
    摘要 Explicit NeRF 技术在虚拟现实和游戏领域得到了广泛关注,因为它具有高效的训练和快速推断能力。特别是PlenOctree(POT)这种显式层次多规格树表示方法,在许多情况下成为了一种重要的框架。然而,POT 的固定结构导致了直接优化的问题,因为场景复杂度在缓存颜色和浓度更新的过程中不断变化,需要根据信号复杂度进行适应性的改进。为解决这个问题,我们提出了动态PlenOctree DOT,它可以动态调整样本分布,以适应场景复杂度的变化。具体来说,DOT提出了一种新的层次特征融合策略,在迭代渲染过程中对Region of Interest进行训练,以确保高效和适应的改进。而不是直接过滤无用的节点,DOT引入了采样和剪除操作,以协助快速学习参数。相比POT,我们的DOT在提高视觉质量、减少参数数量和提供更高的帧率方面表现出色,具体来说,DOT的视觉质量提高了55.15%/68.84%,参数减少了55.15%/68.84%,并且在NeRF-synthetic和Tanks $\&$ Temples等场景下提供了1.7/1.9倍的帧率。项目主页:https://vlislab22.github.io/DOT。[1] Yu, Alex, et al. "Plenoctrees for real-time rendering of neural radiance fields." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Staging E-Commerce Products for Online Advertising using Retrieval Assisted Image Generation

  • paper_url: http://arxiv.org/abs/2307.15326
  • repo_url: None
  • paper_authors: Yueh-Ning Ku, Mikhail Kuznetsov, Shaunak Mishra, Paloma de Juan
  • for: 这个论文是关于如何使用生成对抗网络(GAN)和检索助手GAN(Retrieval Assisted GAN,RAGAN)来增强电子商务平台上的动态产品广告(DPA)图像的。
  • methods: 这个论文提出了一种基于GAN和检索助手GAN的复制粘贴stagging方法,该方法首先从目录中检索与输入产品相似的已stagged产品,然后将其背景复制到输入图像中,并使用GAN基于填充模型来填充复制后的孔隙。
  • results: 论文通过在线度量和人工评估来证明了该复制粘贴stagging方法的效果,同时还展示了如何使用该方法生成产品动画。
    Abstract Online ads showing e-commerce products typically rely on the product images in a catalog sent to the advertising platform by an e-commerce platform. In the broader ads industry such ads are called dynamic product ads (DPA). It is common for DPA catalogs to be in the scale of millions (corresponding to the scale of products which can be bought from the e-commerce platform). However, not all product images in the catalog may be appealing when directly re-purposed as an ad image, and this may lead to lower click-through rates (CTRs). In particular, products just placed against a solid background may not be as enticing and realistic as a product staged in a natural environment. To address such shortcomings of DPA images at scale, we propose a generative adversarial network (GAN) based approach to generate staged backgrounds for un-staged product images. Generating the entire staged background is a challenging task susceptible to hallucinations. To get around this, we introduce a simpler approach called copy-paste staging using retrieval assisted GANs. In copy paste staging, we first retrieve (from the catalog) staged products similar to the un-staged input product, and then copy-paste the background of the retrieved product in the input image. A GAN based in-painting model is used to fill the holes left after this copy-paste operation. We show the efficacy of our copy-paste staging method via offline metrics, and human evaluation. In addition, we show how our staging approach can enable animations of moving products leading to a video ad from a product image.
    摘要 在线广告通常会使用电商平台提供的产品图片,这些图片通常会被称为动态产品广告(DPA)。DPA目录通常有数百万个图片,但不 все图片都能够直接复用为广告图片,这可能导致更低的键盘 clicks(CTR)。特别是,产品只有在固定背景下显示可能不那么吸引人和真实。为解决DPA图片的缺点,我们提出了基于生成对抗网络(GAN)的方法,生成产品在自然环境中的摄影。然而,整个生成整个场景是一项复杂的任务,易于生成幻觉。为此,我们提出了一种更简单的方法:复制粘贴配置。在复制粘贴配置中,我们首先从目录中检索与输入产品相似的已有的stage产品,然后将其中的背景复制到输入图片中。使用GAN基于的填充模型来填充复制后的孔隙。我们通过线上指标和人工评估表明了我们的配置方法的有效性。此外,我们还展示了如何使用我们的配置方法生成动画。

TaskExpert: Dynamically Assembling Multi-Task Representations with Memorial Mixture-of-Experts

  • paper_url: http://arxiv.org/abs/2307.15324
  • repo_url: https://github.com/prismformore/multi-task-transformer
  • paper_authors: Hanrong Ye, Dan Xu
  • for: 这个研究是为了解决多项任务学习中,同一个背景特征(例如从backbone层中的特征)中同时学习多个明确任务特有的特征的问题。
  • methods: 这个研究使用了一种名为TaskExpert的多项任务混合专家模型,它可以学习多个代表任务特有的特征空间,并在动态的方式下将任务特有的特征解析为多个专家网络。
  • results: 实验结果显示,TaskExpert在两个竞争性多项任务学习 benchmark(PASCAL-Context和NYUD-v2)上的9个指标中,均超越了前一代最佳方法。
    Abstract Learning discriminative task-specific features simultaneously for multiple distinct tasks is a fundamental problem in multi-task learning. Recent state-of-the-art models consider directly decoding task-specific features from one shared task-generic feature (e.g., feature from a backbone layer), and utilize carefully designed decoders to produce multi-task features. However, as the input feature is fully shared and each task decoder also shares decoding parameters for different input samples, it leads to a static feature decoding process, producing less discriminative task-specific representations. To tackle this limitation, we propose TaskExpert, a novel multi-task mixture-of-experts model that enables learning multiple representative task-generic feature spaces and decoding task-specific features in a dynamic manner. Specifically, TaskExpert introduces a set of expert networks to decompose the backbone feature into several representative task-generic features. Then, the task-specific features are decoded by using dynamic task-specific gating networks operating on the decomposed task-generic features. Furthermore, to establish long-range modeling of the task-specific representations from different layers of TaskExpert, we design a multi-task feature memory that updates at each layer and acts as an additional feature expert for dynamic task-specific feature decoding. Extensive experiments demonstrate that our TaskExpert clearly outperforms previous best-performing methods on all 9 metrics of two competitive multi-task learning benchmarks for visual scene understanding (i.e., PASCAL-Context and NYUD-v2). Codes and models will be made publicly available at https://github.com/prismformore/Multi-Task-Transformer
    摘要 学习多个不同任务的特征同时是多任务学习的基本问题。现今的状态提取模型直接从一个共享任务普适特征(例如,底层层次特征)中提取任务特征,并使用特别设计的解码器生成多任务特征。然而,由于输入特征完全共享,每个任务解码器也共享解码参数 для不同的输入样本,这会导致静态特征解码过程,生成更少的特征分解。为了解决这些限制,我们提出了TaskExpert,一种新的多任务混合专家模型,允许学习多个代表任务普适特征空间和动态解码任务特征。特别是,TaskExpert引入了一组专家网络将底层特征分解成多个代表任务普适特征。然后,每个任务特征被解码器使用动态任务特征闭合网络在不同的任务普适特征空间中解码。此外,为了在不同层次的TaskExpert中建立长距离模型化任务特征表示,我们设计了一个多任务特征记忆,在每层更新并作为多任务特征解码器的额外特征专家。广泛的实验证明了我们的TaskExpert明显超过了之前最佳表现的方法在图像Scene理解两个竞争性多任务学习 benchmark 上(即 PASCAL-Context 和 NYUD-v2)。代码和模型将在https://github.com/prismformore/Multi-Task-Transformer 上公开。

DocDeshadower: Frequency-aware Transformer for Document Shadow Removal

  • paper_url: http://arxiv.org/abs/2307.15318
  • repo_url: None
  • paper_authors: Shenghong Luo, Ruifeng Xu, Xuhang Chen, Zinuo Li, Chi-Man Pun, Shuqiang Wang
  • for: 提高扫描文档中的阴影 removing 效果
  • methods: 使用多频Transformer模型、Attention-Aggregation Network和Gated Multi-scale Fusion Transformer来除阴影
  • results: 在质量和量化两个方面都超越现有的状态之最方法
    Abstract The presence of shadows significantly impacts the visual quality of scanned documents. However, the existing traditional techniques and deep learning methods used for shadow removal have several limitations. These methods either rely heavily on heuristics, resulting in suboptimal performance, or require large datasets to learn shadow-related features. In this study, we propose the DocDeshadower, a multi-frequency Transformer-based model built on Laplacian Pyramid. DocDeshadower is designed to remove shadows at different frequencies in a coarse-to-fine manner. To achieve this, we decompose the shadow image into different frequency bands using Laplacian Pyramid. In addition, we introduce two novel components to this model: the Attention-Aggregation Network and the Gated Multi-scale Fusion Transformer. The Attention-Aggregation Network is designed to remove shadows in the low-frequency part of the image, whereas the Gated Multi-scale Fusion Transformer refines the entire image at a global scale with its large perceptive field. Our extensive experiments demonstrate that DocDeshadower outperforms the current state-of-the-art methods in both qualitative and quantitative terms.
    摘要 文本中的阴影对可读性有着重要的影响,但现有的传统方法和深度学习方法used for shadow removal有一些局限性。这些方法可能会依赖于规则,导致性能下降,或者需要大量的数据来学习阴影相关的特征。在这项研究中,我们提出了DocDeshadower,一种多频Transformer基于Laplacian Pyramid的模型。DocDeshadower通过在不同频率带进行均衡处理来去除阴影。为此,我们使用Laplacian Pyramid将阴影图像分解成不同频率带。此外,我们还提出了两个新的组件:协调汇集网络和灵活多scale混合transformer。协调汇集网络用于在低频部分中去除阴影,而灵活多scale混合transformer则在全图上进行全局级别的细化处理,其大见范field允许它在不同频率带上进行细化处理。我们的广泛实验表明,DocDeshadower在可读性和量化上都超过了当前状态的方法。

Attentive Multimodal Fusion for Optical and Scene Flow

  • paper_url: http://arxiv.org/abs/2307.15301
  • repo_url: https://github.com/jiesico/fusionraft
  • paper_authors: Youjie Zhou, Guofeng Mei, Yiming Wang, Fabio Poiesi, Yi Wan
  • for: 这paper是为了解决RGB模式下的视觉和Scene flow估计问题,尤其是在噪声或低照度环境下。
  • methods: 这paper提出了一种基于深度学习的FusionRAFT方法,使得早期模式融合可以更好地利用两个感知模式(RGB和深度)的优势。该方法包括自我和交叉关注层,以建立有用的特征,以便更好地利用两个模式的优势。
  • results: 通过比较性试验,这paper表明FusionRAFT方法在Flyingthings3DSynthetic数据集和KITTI实际数据集上表现更好,并且在噪声和低照度条件下表现更加稳定和可靠。
    Abstract This paper presents an investigation into the estimation of optical and scene flow using RGBD information in scenarios where the RGB modality is affected by noise or captured in dark environments. Existing methods typically rely solely on RGB images or fuse the modalities at later stages, which can result in lower accuracy when the RGB information is unreliable. To address this issue, we propose a novel deep neural network approach named FusionRAFT, which enables early-stage information fusion between sensor modalities (RGB and depth). Our approach incorporates self- and cross-attention layers at different network levels to construct informative features that leverage the strengths of both modalities. Through comparative experiments, we demonstrate that our approach outperforms recent methods in terms of performance on the synthetic dataset Flyingthings3D, as well as the generalization on the real-world dataset KITTI. We illustrate that our approach exhibits improved robustness in the presence of noise and low-lighting conditions that affect the RGB images. We release the code, models and dataset at https://github.com/jiesico/FusionRAFT.
    摘要 中文翻译:本文研究了基于RGBD信息的光学和场景流计算,在RGB信息受到噪音或低光照影响的场景下。现有方法通常只采用RGB图像或在后续阶段进行模式融合,这可能会导致RGB信息不可靠时的性能下降。为解决这问题,我们提出了一种新的深度神经网络方法,即FusionRAFT,它在感知Modalities(RGB和深度)之间进行早期融合。我们的方法包括自身和交叉关注层,以不同的网络层次构建有用的特征,以利用两种模式之间的优势。通过比较实验,我们证明了我们的方法在Flyingthings3D sintetic dataset和KITTI实验室 dataset上的性能较高,并且在噪音和低光照条件下表现更加稳定。我们将代码、模型和数据集发布在https://github.com/jiesico/FusionRAFT上。

AC-Norm: Effective Tuning for Medical Image Analysis via Affine Collaborative Normalization

  • paper_url: http://arxiv.org/abs/2307.15282
  • repo_url: https://github.com/endoluminalsurgicalvision-imr/acnorm
  • paper_authors: Chuyan Zhang, Yuncheng Yang, Hao Zheng, Yun Gu
    for: This paper focuses on enhancing the performance of clinical applications with limited annotations using self-supervised learning (SSL) and the “pretraining-then-finetuning” paradigm.methods: The proposed method, Affine Collaborative Normalization (AC-Norm), utilizes the trainable affine parameters of batch normalization (BN) layers to dynamically recalibrate the channels in the target model according to the cross-domain channel-wise correlations, without adding extra parameters.results: The proposed AC-Norm method outperformed the vanilla finetuning method by up to 4% improvement in various transfer learning tasks, including diabetic retinopathy grade classification, retinal vessel segmentation, CT lung nodule segmentation/classification, CT liver-tumor segmentation, and MRI cardiac segmentation. Additionally, AC-Norm was found to be capable of fast transferability estimation.
    Abstract Driven by the latest trend towards self-supervised learning (SSL), the paradigm of "pretraining-then-finetuning" has been extensively explored to enhance the performance of clinical applications with limited annotations. Previous literature on model finetuning has mainly focused on regularization terms and specific policy models, while the misalignment of channels between source and target models has not received sufficient attention. In this work, we revisited the dynamics of batch normalization (BN) layers and observed that the trainable affine parameters of BN serve as sensitive indicators of domain information. Therefore, Affine Collaborative Normalization (AC-Norm) is proposed for finetuning, which dynamically recalibrates the channels in the target model according to the cross-domain channel-wise correlations without adding extra parameters. Based on a single-step backpropagation, AC-Norm can also be utilized to measure the transferability of pretrained models. We evaluated AC-Norm against the vanilla finetuning and state-of-the-art fine-tuning methods on transferring diverse pretrained models to the diabetic retinopathy grade classification, retinal vessel segmentation, CT lung nodule segmentation/classification, CT liver-tumor segmentation and MRI cardiac segmentation tasks. Extensive experiments demonstrate that AC-Norm unanimously outperforms the vanilla finetuning by up to 4% improvement, even under significant domain shifts where the state-of-the-art methods bring no gains. We also prove the capability of AC-Norm in fast transferability estimation. Our code is available at https://github.com/EndoluminalSurgicalVision-IMR/ACNorm.
    摘要 受最新的自动学习(SSL)趋势驱动,“预训练后finetuning”的方法在医疗应用中得到了广泛的探索,以提高limited annotations的性能。之前的模型finetuning研究主要集中在常规化项和特定策略模型上,而频道之间的偏移问题尚未得到了充分的注意。在这项工作中,我们重新探讨了批量 нормализа(BN)层的动力学,并发现了BN层的可调参数作为域信息的敏感指标。因此,我们提出了Affine Collaborative Normalization(AC-Norm),用于finetuning,可以在目标模型中动态重新准确channel,无需添加额外参数。基于单步反射,AC-Norm还可以用于评估预训练模型的传输性。我们对AC-Norm与常规finetuning和现有的精细调整方法进行了对比,在不同的频道偏移 task 上进行了广泛的实验。结果表明,AC-Norm在频道偏移情况下可以达到4%的提升,而在其他方法无法提供任何提升的情况下。我们还证明了AC-Norm的快速传输性能测试能力。代码可以在https://github.com/EndoluminalSurgicalVision-IMR/ACNorm上下载。

Recovering high-quality FODs from a reduced number of diffusion-weighted images using a model-driven deep learning architecture

  • paper_url: http://arxiv.org/abs/2307.15273
  • repo_url: https://github.com/jbartlett6/sdnet
  • paper_authors: J Bartlett, C E Davey, L A Johnston, J Duan
  • for: 该研究旨在提出一种基于深度学习的材料方向分布(FOD)重建方法,可以从少量的扩散束图像(DWI)中生成高精度的FOD。
  • methods: 该方法使用深度学习网络,使用扩散获取的Diffusion-weighted image(DWI)信号作为输入,并通过一种圆拟合网络来重建FOD。
  • results: 研究表明,该模型基于深度学习的FOD重建方法可以与现有的FOD超分辨率网络相比,并且可以通过调整约束来提高下游的fixel分类精度。代码可以在https://github.com/Jbartlett6/SDNet中获取。
    Abstract Fibre orientation distribution (FOD) reconstruction using deep learning has the potential to produce accurate FODs from a reduced number of diffusion-weighted images (DWIs), decreasing total imaging time. Diffusion acquisition invariant representations of the DWI signals are typically used as input to these methods to ensure that they can be applied flexibly to data with different b-vectors and b-values; however, this means the network cannot condition its output directly on the DWI signal. In this work, we propose a spherical deconvolution network, a model-driven deep learning FOD reconstruction architecture, that ensures intermediate and output FODs produced by the network are consistent with the input DWI signals. Furthermore, we implement a fixel classification penalty within our loss function, encouraging the network to produce FODs that can subsequently be segmented into the correct number of fixels and improve downstream fixel-based analysis. Our results show that the model-based deep learning architecture achieves competitive performance compared to a state-of-the-art FOD super-resolution network, FOD-Net. Moreover, we show that the fixel classification penalty can be tuned to offer improved performance with respect to metrics that rely on accurately segmented of FODs. Our code is publicly available at https://github.com/Jbartlett6/SDNet .
    摘要 《纤维方向分布(FOD)重建使用深度学习有可能生成准确的FOD,从一小量的扩散束图像(DWI)中减少总成像时间。通常使用扩散获取的不变表示来作为输入,以确保这些方法可以适应不同的b-向量和b值;然而,这意味着网络无法直接Conditional Output在DWI信号。在这种工作中,我们提议一种圆柱体抽象网络,一种驱动深度学习FOD重建架构,以确保输入DWI信号和输出FOD之间的一致。此外,我们实施了一种纤维分类罚金在我们的损失函数中,让网络生成FOD,可以随后被正确分割为纤维。我们的结果表明,模型驱动的深度学习架构与现状的FOD超分辨网络FOD-Net具有竞争性。此外,我们还证明了纤维分类罚金可以调整以提高基于纤维分割的下游分析的表现。我们的代码在https://github.com/Jbartlett6/SDNet上公开。》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Anatomy-Aware Lymph Node Detection in Chest CT using Implicit Station Stratification

  • paper_url: http://arxiv.org/abs/2307.15271
  • repo_url: None
  • paper_authors: Ke Yan, Dakai Jin, Dazhou Guo, Minfeng Xu, Na Shen, Xian-Sheng Hua, Xianghua Ye, Le Lu
    for: 这个研究的目的是提高于医疗影像中发现异常淋巴节的自动检测性能。methods: 本研究提出了一个 novel 的终端框架,通过利用淋巴节的站信息来提高淋巴节检测性能。我们设计了多头检测器,并将每个头专注于区分淋巴节和非淋巴节结构的不同站信息。在训练过程中,我们使用了多任务学习,将淋巴节站信息作为多个类别的标签生成,因此在测试过程中不需要另外的Explicit LN站预测模型。results: 我们在82名肺癌患者和91名食道癌患者的CT影像检查中评估了我们的算法。结果显示,我们的方法可以从65.1%提高到71.4%和80.3%提高到85.5%,对于每个患者2个错误的淋巴节检测性能,均有明显的改善。相比于 existed 的多种基eline技术,如nnUNet、nnDetection和LENS,我们的方法具有更高的检测性能。
    Abstract Finding abnormal lymph nodes in radiological images is highly important for various medical tasks such as cancer metastasis staging and radiotherapy planning. Lymph nodes (LNs) are small glands scattered throughout the body. They are grouped or defined to various LN stations according to their anatomical locations. The CT imaging appearance and context of LNs in different stations vary significantly, posing challenges for automated detection, especially for pathological LNs. Motivated by this observation, we propose a novel end-to-end framework to improve LN detection performance by leveraging their station information. We design a multi-head detector and make each head focus on differentiating the LN and non-LN structures of certain stations. Pseudo station labels are generated by an LN station classifier as a form of multi-task learning during training, so we do not need another explicit LN station prediction model during inference. Our algorithm is evaluated on 82 patients with lung cancer and 91 patients with esophageal cancer. The proposed implicit station stratification method improves the detection sensitivity of thoracic lymph nodes from 65.1% to 71.4% and from 80.3% to 85.5% at 2 false positives per patient on the two datasets, respectively, which significantly outperforms various existing state-of-the-art baseline techniques such as nnUNet, nnDetection and LENS.
    摘要 找到不同常规图像中的异常淋巴节点是医疗领域中非常重要的各种任务中的一个,如癌细胞肿瘤stage和放疗规划。淋巴节点(LN)是身体中散布的小腺体,根据其生理位置分为不同的LN站。不同的LN站在CT图像的出现和背景下有很大的差异,这会提高自动检测的挑战,特别是对于病理LN。为了解决这个问题,我们提出了一种新的综合框架,利用LN站信息来提高淋巴节点检测性能。我们设计了多头检测器,每个头都专门用于区分LN和非LN结构。在训练时,我们使用LN站分类器生成pseudo站标签,以实现多任务学习,因此在推断时不需要另外的explicit LN站预测模型。我们的算法在82名肺癌患者和91名食道癌患者的数据集上进行了评估,并显示了提高了脊梗淋巴节点检测感度,从65.1%提高到71.4%和80.3%提高到85.5%,在2个false positive每个患者时,分别有显著的提高。与多种现有的基线技术相比,我们的方法显示出了显著的优势。

RSGPT: A Remote Sensing Vision Language Model and Benchmark

  • paper_url: http://arxiv.org/abs/2307.15266
  • repo_url: None
  • paper_authors: Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Xiang Li
    for: 本研究旨在开发特有的大视语言模型(VLM),用于数据分析领域中的远程感知(RS)应用。methods: 本研究使用了人工标注的卫星图像描述集(RSICap),以及卫星图像评估集(RSIEval),用于评估和训练大视语言模型。results: 研究发现,通过使用高质量的卫星图像描述集(RSICap)和卫星图像评估集(RSIEval),可以帮助开发大视语言模型(VLM),并且可以在RS应用中达到比较出色的性能。
    Abstract The emergence of large-scale large language models, with GPT-4 as a prominent example, has significantly propelled the rapid advancement of artificial general intelligence and sparked the revolution of Artificial Intelligence 2.0. In the realm of remote sensing (RS), there is a growing interest in developing large vision language models (VLMs) specifically tailored for data analysis in this domain. However, current research predominantly revolves around visual recognition tasks, lacking comprehensive, large-scale image-text datasets that are aligned and suitable for training large VLMs, which poses significant challenges to effectively training such models for RS applications. In computer vision, recent research has demonstrated that fine-tuning large vision language models on small-scale, high-quality datasets can yield impressive performance in visual and language understanding. These results are comparable to state-of-the-art VLMs trained from scratch on massive amounts of data, such as GPT-4. Inspired by this captivating idea, in this work, we build a high-quality Remote Sensing Image Captioning dataset (RSICap) that facilitates the development of large VLMs in the RS field. Unlike previous RS datasets that either employ model-generated captions or short descriptions, RSICap comprises 2,585 human-annotated captions with rich and high-quality information. This dataset offers detailed descriptions for each image, encompassing scene descriptions (e.g., residential area, airport, or farmland) as well as object information (e.g., color, shape, quantity, absolute position, etc). To facilitate the evaluation of VLMs in the field of RS, we also provide a benchmark evaluation dataset called RSIEval. This dataset consists of human-annotated captions and visual question-answer pairs, allowing for a comprehensive assessment of VLMs in the context of RS.
    摘要 大规模的大语言模型,如GPT-4,对人工通用智能的发展产生了巨大的推动,并促使了人工智能2.0的革命。在远程感知(RS)领域,有增加兴趣在开发特定于数据分析的大视语言模型(VLMs)。然而,当前的研究主要集中在视觉认知任务上,缺乏大规模、一致的图像文本数据集,这会对培育大VLMs的训练带来很大的挑战。在计算机视觉领域,最近的研究表明, fine-tuning大视语言模型在小规模、高质量数据集上可以获得出色的视觉和语言理解性能。这些结果与 state-of-the-art VLMs 训练自零开始大量数据,如GPT-4,相当。 inspirited by this captivating idea,在这项工作中,我们构建了高质量的远程感知图像描述集(RSICap),以便在RS领域开发大VLMs。与前期RS datasets不同,RSICap包含2,585个人注解的描述,其中包括Scene描述(例如:居民区、机场、农业等)以及物体信息(例如:颜色、形状、数量、绝对位置等)。为便于RS领域中VLMs的评估,我们还提供了一个名为RSIEval的基准评估集,该集包含人注解的描述和视觉问答对,以便对VLMs在RS领域进行全面的评估。

Learning with Constraint Learning: New Perspective, Solution Strategy and Various Applications

  • paper_url: http://arxiv.org/abs/2307.15257
  • repo_url: None
  • paper_authors: Risheng Liu, Jiaxin Gao, Xuan Liu, Xin Fan
  • for: 解决复杂的机器学习和计算机视觉问题,包括生成对抗网络(GAN)和其变种、多任务和元学习、超参数学习以及各种实际应用。
  • methods: 提出了一种新的框架——学习干预学习(LwCL),可以一元化地探讨这些多样化的学习和视觉问题。LwCL 采用了一种高级别的层次优化模型,能够捕捉这些多样化的学习和视觉问题的本质。
  • results: 对于多种学习和视觉应用,LwCL 提供了一个广泛的解决方案,包括三类和九种问题类型。实验表明,LwCL 可以有效地解决各种复杂的机器学习和计算机视觉问题,并 bridge 理论和实践之间的差距。
    Abstract The complexity of learning problems, such as Generative Adversarial Network (GAN) and its variants, multi-task and meta-learning, hyper-parameter learning, and a variety of real-world vision applications, demands a deeper understanding of their underlying coupling mechanisms. Existing approaches often address these problems in isolation, lacking a unified perspective that can reveal commonalities and enable effective solutions. Therefore, in this work, we proposed a new framework, named Learning with Constraint Learning (LwCL), that can holistically examine challenges and provide a unified methodology to tackle all the above-mentioned complex learning and vision problems. Specifically, LwCL is designed as a general hierarchical optimization model that captures the essence of these diverse learning and vision problems. Furthermore, we develop a gradient-response based fast solution strategy to overcome optimization challenges of the LwCL framework. Our proposed framework efficiently addresses a wide range of applications in learning and vision, encompassing three categories and nine different problem types. Extensive experiments on synthetic tasks and real-world applications verify the effectiveness of our approach. The LwCL framework offers a comprehensive solution for tackling complex machine learning and computer vision problems, bridging the gap between theory and practice.
    摘要 “复杂的学习问题,如生成对抗网络(GAN)和其变种,多任务和元学习,参数学习,以及各种现实世界视觉应用,需要更深刻的理解它们的基础机制。现有方法通常对这些问题进行隔离处理,缺乏一个综合视角,这使得它们的解决方法受限。因此,在这项工作中,我们提出了一个新的框架,名为学习约束学(LwCL),可以总结这些多样化的学习和视觉问题。具体来说,LwCL是一种通用的层次优化模型,捕捉这些多样化的学习和视觉问题的核心。此外,我们开发了基于梯度响应的快速解决策略,以解决LwCL框架中的优化挑战。我们的提出的框架可以有效地解决各种学习和视觉问题,涵盖三个类别和九种不同的问题类型。广泛的实验证明了我们的方法的有效性,LwCL框架可以凝聚理论和实践之间的差距,为复杂的机器学习和计算机视觉问题提供一个普适的解决方案。”

A Solution to Co-occurrence Bias: Attributes Disentanglement via Mutual Information Minimization for Pedestrian Attribute Recognition

  • paper_url: http://arxiv.org/abs/2307.15252
  • repo_url: https://github.com/sdret/a-solution-to-co-occurence-bias-in-pedestrian-attribute-recognition
  • paper_authors: Yibo Zhou, Hai-Miao Hu, Jinzuo Yu, Zhenbo Xu, Weiqing Lu, Yuran Cao
  • for: 提高pedestrian attribute recognition的Robustness和Generalization能力
  • methods: 提出了一种Attributes-disentangled feature learning方法,通过mutual information minimization来解耦特征之间的相关性
  • results: 在实际场景中提高了baseline的性能,并在PETAzs和RAPzs等 dataset上实现了State-of-the-artresult
    Abstract Recent studies on pedestrian attribute recognition progress with either explicit or implicit modeling of the co-occurrence among attributes. Considering that this known a prior is highly variable and unforeseeable regarding the specific scenarios, we show that current methods can actually suffer in generalizing such fitted attributes interdependencies onto scenes or identities off the dataset distribution, resulting in the underlined bias of attributes co-occurrence. To render models robust in realistic scenes, we propose the attributes-disentangled feature learning to ensure the recognition of an attribute not inferring on the existence of others, and which is sequentially formulated as a problem of mutual information minimization. Rooting from it, practical strategies are devised to efficiently decouple attributes, which substantially improve the baseline and establish state-of-the-art performance on realistic datasets like PETAzs and RAPzs. Code is released on https://github.com/SDret/A-Solution-to-Co-occurence-Bias-in-Pedestrian-Attribute-Recognition.
    摘要 近期研究人员发现,人行AttributeRecognition的进步通常采用显式或隐式表示人行Attribute的相互关系。然而,这种已知的假设对特定场景的变化和不可预测,可能导致现有方法在不同场景下表现不佳,具有人行Attribute相互关系的偏见。为确保模型在真实场景中 robust,我们提议使用Attribute分离特征学习,以确保一个特征不受另一个特征的存在影响。这个问题可以看作是 mutual information minimization 问题。从而,我们提出了实用的策略来快速分离特征,并在实际数据集上达到了基eline和状态arp的表现。代码可以在 https://github.com/SDret/A-Solution-to-Co-occurence-Bias-in-Pedestrian-Attribute-Recognition 上找到。

D2S: Representing local descriptors and global scene coordinates for camera relocalization

  • paper_url: http://arxiv.org/abs/2307.15250
  • repo_url: None
  • paper_authors: Bach-Thuan Bui, Dinh-Tuan Tran, Joo-Ho Lee
  • For: 本研究提出了一种基于直接学习的视地标记方法,用于解决现有的视地标记方法具有较高的计算成本和存储量的问题。* Methods: 本方法使用一个简单的神经网络 named D2S,用于表示本地描述符和场景坐标。 D2S 使用一种简单的损失函数和图注意机制,选择性地关注 robust 的描述符,而忽略某些不可靠的区域,如云、树和动态对象。* Results: 本方法在室内和室外环境中的场景坐标回归任务中表现出色,超过了现有的 CNN 基于方法。 它能够在不具备标注数据源的情况下,通过自然语言描述和自动分类来泛化到不同的场景中,包括从白天到黑夜的过渡和频率域转换。I hope that helps! Let me know if you have any further questions or if there’s anything else I can help with.
    Abstract State-of-the-art visual localization methods mostly rely on complex procedures to match local descriptors and 3D point clouds. However, these procedures can incur significant cost in terms of inference, storage, and updates over time. In this study, we propose a direct learning-based approach that utilizes a simple network named D2S to represent local descriptors and their scene coordinates. Our method is characterized by its simplicity and cost-effectiveness. It solely leverages a single RGB image for localization during the testing phase and only requires a lightweight model to encode a complex sparse scene. The proposed D2S employs a combination of a simple loss function and graph attention to selectively focus on robust descriptors while disregarding areas such as clouds, trees, and several dynamic objects. This selective attention enables D2S to effectively perform a binary-semantic classification for sparse descriptors. Additionally, we propose a new outdoor dataset to evaluate the capabilities of visual localization methods in terms of scene generalization and self-updating from unlabeled observations. Our approach outperforms the state-of-the-art CNN-based methods in scene coordinate regression in indoor and outdoor environments. It demonstrates the ability to generalize beyond training data, including scenarios involving transitions from day to night and adapting to domain shifts, even in the absence of the labeled data sources. The source code, trained models, dataset, and demo videos are available at the following link: https://thpjp.github.io/d2s
    摘要 现代视觉地标方法通常需要复杂的过程来匹配本地描述符和3D点云。然而,这些过程可能会带来较大的计算成本、存储成本和时间更新成本。在本研究中,我们提出了一种直接学习基于的方法,利用名为D2S的简单网络来表示本地描述符和其场景坐标。我们的方法具有简单性和成本效果。它仅在测试阶段使用单个RGB图像进行地标,并且仅需要一个轻量级模型来编码复杂的稀疏场景。提出的D2S使用一种简单的损失函数和图像注意力来选择ively关注可靠的描述符,而忽略云、树和一些动态对象。这种选择性注意力使得D2S可以有效地进行二分类Semantic地标。此外,我们还提出了一个新的户外数据集来评估视觉地标方法的场景总结和自动更新能力。我们的方法在室内和户外环境中超越了当前最佳CNN基于方法的场景坐标回归。它能够总结超出训练数据,包括从日到夜的过渡和适应频率变化,甚至在没有标注数据源的情况下。source code、训练模型、数据集和示例视频可以在以下链接获取:https://thpjp.github.io/d2s。

TROPHY: A Topologically Robust Physics-Informed Tracking Framework for Tropical Cyclones

  • paper_url: http://arxiv.org/abs/2307.15243
  • repo_url: None
  • paper_authors: Lin Yan, Hanqi Guo, Thomas Peterka, Bei Wang, Jiali Wang
  • for: 本研究旨在提出一种基于物理知识的高效的TC跟踪方法,以提高大规模气象数据集中TC跟踪的计算效率。
  • methods: 本方法首先提出了一种基于物理知识的特征选择策略,以筛选出高稳定性和长期存在的TC kritical points。然后,在多层Robustness计算中,我们对TC kritical points进行了物理约束,以确保计算的TC跟踪结果具有物理意义。
  • results: 我们对30年的2D风场数据进行了实验,并通过对比观察轨迹和已有的TC跟踪算法,示出了TROPHY可以准确地跟踪TC的特征,并且有时even better than已有的TC跟踪算法。
    Abstract Tropical cyclones (TCs) are among the most destructive weather systems. Realistically and efficiently detecting and tracking TCs are critical for assessing their impacts and risks. Recently, a multilevel robustness framework has been introduced to study the critical points of time-varying vector fields. The framework quantifies the robustness of critical points across varying neighborhoods. By relating the multilevel robustness with critical point tracking, the framework has demonstrated its potential in cyclone tracking. An advantage is that it identifies cyclonic features using only 2D wind vector fields, which is encouraging as most tracking algorithms require multiple dynamic and thermodynamic variables at different altitudes. A disadvantage is that the framework does not scale well computationally for datasets containing a large number of cyclones. This paper introduces a topologically robust physics-informed tracking framework (TROPHY) for TC tracking. The main idea is to integrate physical knowledge of TC to drastically improve the computational efficiency of multilevel robustness framework for large-scale climate datasets. First, during preprocessing, we propose a physics-informed feature selection strategy to filter 90% of critical points that are short-lived and have low stability, thus preserving good candidates for TC tracking. Second, during in-processing, we impose constraints during the multilevel robustness computation to focus only on physics-informed neighborhoods of TCs. We apply TROPHY to 30 years of 2D wind fields from reanalysis data in ERA5 and generate a number of TC tracks. In comparison with the observed tracks, we demonstrate that TROPHY can capture TC characteristics that are comparable to and sometimes even better than a well-validated TC tracking algorithm that requires multiple dynamic and thermodynamic scalar fields.
    摘要 热带风暴(TC)是气候系统中最破坏性的天气系统之一。有效地探测和跟踪TC是评估其影响和风险的关键。最近,一种多级坚定性框架已经被提出来研究时变向量场中的关键点。这个框架可以量化不同级别的坚定性,并与多级坚定性相关的 kritical point tracking 进行比较。这个框架在风暴跟踪中表现出了潜力。它可以通过只使用2D风向场来识别风暴特征,这是有利的,因为大多数跟踪算法需要不同高度和层次的动力和 термодинамиче变量。然而,这个框架的计算效率不太好,特别是对大规模气候数据进行处理。这篇文章介绍了一种基于物理知识的逻辑坚定性物理协调跟踪框架(TROPHY),用于风暴跟踪。TROPHY的主要想法是通过将物理知识 integrate 到多级坚定性框架中,以提高计算效率。我们的方法包括:一、在预处理阶段,我们提出了物理学习Feature选择策略,以过滤90%的不稳定和短暂的关键点,保留适合风暴跟踪的好andidates。二、在进程阶段,我们在多级坚定性计算中强制实施物理学习的约束,只考虑物理学习所支持的TC约束。我们在ERA5的2D风向场数据上应用TROPHY,并生成了30年的风暴跟踪。与观测跟踪相比,我们示出TROPHY可以捕捉风暴特征,并且在一些情况下,甚至比一种已经证明有效的TC跟踪算法(需要多个动力和 термодинамичеscalar场)更好。

Fast Dust Sand Image Enhancement Based on Color Correction and New Membership Function

  • paper_url: http://arxiv.org/abs/2307.15230
  • repo_url: None
  • paper_authors: Ali Hakem Alsaeedi, Suha Mohammed Hadi, Yarub Alazzawi
  • for: 提高灰尘照片质量
  • methods: 使用色彩修正和新成员函数进行颜色偏移 correction,采用Adaptive Dark Channel Prior(A-DCP)进行雾化 removal,基于Contrast Limited Adaptive Histogram Equalization(CLAHE)进行对比度限制和图像亮度提高
  • results: 比现有研究更高效地除去红色和黄色投影,提供高质量和量灰尘照片
    Abstract Images captured in dusty environments suffering from poor visibility and quality. Enhancement of these images such as sand dust images plays a critical role in various atmospheric optics applications. In this work, proposed a new model based on Color Correction and new membership function to enhance san dust images. The proposed model consists of three phases: correction of color shift, removal of haze, and enhancement of contrast and brightness. The color shift is corrected using a new membership function to adjust the values of U and V in the YUV color space. The Adaptive Dark Channel Prior (A-DCP) is used for haze removal. The stretching contrast and improving image brightness are based on Contrast Limited Adaptive Histogram Equalization (CLAHE). The proposed model tests and evaluates through many real sand dust images. The experimental results show that the proposed solution is outperformed the current studies in terms of effectively removing the red and yellow cast and provides high quality and quantity dust images.
    摘要 图像捕捉在尘埃环境中,由于visibility和质量受到限制。尘埃图像加强在大气光学应用中扮演关键角色。本工作提出了一种基于颜色修正和新成员函数的图像加强模型。该模型包括三个阶段:色差修正、雾气除净和对比和亮度提高。色差修正使用新的成员函数调整YUV颜色空间中U和V值。使用适应黑道通道优先(A-DCP)进行雾气除净。对比和亮度提高基于对比限定适应 histogram平衡(CLAHE)。提出的模型在多个真实的沙尘图像上进行测试和评估。实验结果表明,提出的解决方案在效果上超越当前研究,可以有效地除掉红色和黄色投影,提供高质量和量的尘埃图像。

Sustainable Transparency in Recommender Systems: Bayesian Ranking of Images for Explainability

  • paper_url: http://arxiv.org/abs/2308.01196
  • repo_url: None
  • paper_authors: Jorge Paz-Ruza, Amparo Alonso-Betanzos, Berta Guijarro-Berdiñas, Brais Cancela, Carlos Eiras-Franco
  • for: 提高推荐系统的透明度和用户信任度
  • methods: 使用用户创建的视觉内容生成个性化解释
  • results: 比前方法更高效,减少了75%的CO${_2}$排放和模型尺寸,在六个实际数据集上达到了一致性superior表现,而且具有remarkable efficiency和小型模型优势。
    Abstract Recommender Systems have become crucial in the modern world, commonly guiding users towards relevant content or products, and having a large influence over the decisions of users and citizens. However, ensuring transparency and user trust in these systems remains a challenge; personalized explanations have emerged as a solution, offering justifications for recommendations. Among the existing approaches for generating personalized explanations, using visual content created by the users is one particularly promising option, showing a potential to maximize transparency and user trust. Existing models for explaining recommendations in this context face limitations: sustainability has been a critical concern, as they often require substantial computational resources, leading to significant carbon emissions comparable to the Recommender Systems where they would be integrated. Moreover, most models employ surrogate learning goals that do not align with the objective of ranking the most effective personalized explanations for a given recommendation, leading to a suboptimal learning process and larger model sizes. To address these limitations, we present BRIE, a novel model designed to tackle the existing challenges by adopting a more adequate learning goal based on Bayesian Pairwise Ranking, enabling it to achieve consistently superior performance than state-of-the-art models in six real-world datasets, while exhibiting remarkable efficiency, emitting up to 75% less CO${_2}$ during training and inference with a model up to 64 times smaller than previous approaches.
    摘要 现有的解释推荐模型在这个上下文中存在限制:它们通常需要大量的计算资源,导致显著的碳排放和大型模型,与推荐系统集成时的碳排放相比。此外,大多数模型使用代理学习目标,这些目标与个性化解释排名的目标不一致,导致学习过程不优化和模型较大。为解决这些限制,我们提出了 BRIE,一种新的模型,通过采用更适合的学习目标基于 bayesian pairwise ranking,实现了与现状最佳的性能,在六个实际数据集上表现出了明显的优势,同时具有很好的效率和较小的模型大小。在训练和推理过程中,BRIE可以减少75%的碳排放,并且模型可以达到64倍小于现有方法。

Generative AI for Medical Imaging: extending the MONAI Framework

  • paper_url: http://arxiv.org/abs/2307.15208
  • repo_url: https://github.com/project-monai/generativemodels
  • paper_authors: Walter H. L. Pinaya, Mark S. Graham, Eric Kerfoot, Petru-Daniel Tudosiu, Jessica Dafflon, Virginia Fernandez, Pedro Sanchez, Julia Wolleb, Pedro F. da Costa, Ashay Patel, Hyungjin Chung, Can Zhao, Wei Peng, Zelong Liu, Xueyan Mei, Oeslle Lucena, Jong Chul Ye, Sotirios A. Tsaftaris, Prerna Dogra, Andrew Feng, Marc Modat, Parashkev Nachev, Sebastien Ourselin, M. Jorge Cardoso
  • for: 本研究旨在提供一个开源平台,专门用于训练、评估和部署生成模型和相关应用。
  • methods: 本研究使用了多种生成模型,包括扩散模型、自动推导 трансформа器和GANs,并在一个通用的方式下实现了这些模型。
  • results: 本研究可以将生成模型应用到不同的领域,包括医疗影像的问题检测、图像对图像翻译、干扰除和MRI重建。 results表明,这些模型可以在不同的领域中实现高效和精准的结果。
    Abstract Recent advances in generative AI have brought incredible breakthroughs in several areas, including medical imaging. These generative models have tremendous potential not only to help safely share medical data via synthetic datasets but also to perform an array of diverse applications, such as anomaly detection, image-to-image translation, denoising, and MRI reconstruction. However, due to the complexity of these models, their implementation and reproducibility can be difficult. This complexity can hinder progress, act as a use barrier, and dissuade the comparison of new methods with existing works. In this study, we present MONAI Generative Models, a freely available open-source platform that allows researchers and developers to easily train, evaluate, and deploy generative models and related applications. Our platform reproduces state-of-art studies in a standardised way involving different architectures (such as diffusion models, autoregressive transformers, and GANs), and provides pre-trained models for the community. We have implemented these models in a generalisable fashion, illustrating that their results can be extended to 2D or 3D scenarios, including medical images with different modalities (like CT, MRI, and X-Ray data) and from different anatomical areas. Finally, we adopt a modular and extensible approach, ensuring long-term maintainability and the extension of current applications for future features.
    摘要 In this study, we present MONAI Generative Models, a freely available open-source platform that allows researchers and developers to easily train, evaluate, and deploy generative models and related applications. Our platform reproduces state-of-the-art studies in a standardized way involving different architectures (such as diffusion models, autoregressive transformers, and GANs), and provides pre-trained models for the community. We have implemented these models in a generalizable fashion, illustrating that their results can be extended to 2D or 3D scenarios, including medical images with different modalities (such as CT, MRI, and X-ray data) and from different anatomical areas.Finally, we adopt a modular and extensible approach, ensuring long-term maintainability and the extension of current applications for future features.

Small, but important: Traffic light proposals for detecting small traffic lights and beyond

  • paper_url: http://arxiv.org/abs/2307.15191
  • repo_url: None
  • paper_authors: Tom Sanitz, Christian Wilms, Simone Frintrop
  • for: 提高小型交通灯的检测精度
  • methods: 提出了一种新的交通灯检测系统,包括基于通用物体提案生成的新的交通灯提案生成器,以及细致多尺度特征和注意力机制,以提高检测效果。
  • results: 对三个公共可用的数据集进行评估,与六种方法进行比较,结果显示小型交通灯的检测精度提高至少12.6%,并在所有交通灯大小上表现优异。
    Abstract Traffic light detection is a challenging problem in the context of self-driving cars and driver assistance systems. While most existing systems produce good results on large traffic lights, detecting small and tiny ones is often overlooked. A key problem here is the inherent downsampling in CNNs, leading to low-resolution features for detection. To mitigate this problem, we propose a new traffic light detection system, comprising a novel traffic light proposal generator that utilizes findings from general object proposal generation, fine-grained multi-scale features, and attention for efficient processing. Moreover, we design a new detection head for classifying and refining our proposals. We evaluate our system on three challenging, publicly available datasets and compare it against six methods. The results show substantial improvements of at least $12.6\%$ on small and tiny traffic lights, as well as strong results across all sizes of traffic lights.
    摘要 干货灯检测是自驾车和驾驶助手系统中的一个挑战。大多数现有系统可以在大型干货灯上提供良好的结果,但检测小型和微型干货灯通常被忽略。这里的关键问题在于卷积神经网络中的自然下采样问题,导致检测特征的解析精度低下。为解决这个问题,我们提出了一个新的干货灯检测系统,包括一个新的干货灯提案生成器,该生成器利用通用物体提案生成的发现,以及细腻多尺度特征和注意力来实现高效处理。此外,我们还设计了一个新的检测头来分类和精细地修正我们的提案。我们在三个公共可用的数据集上评估了我们的系统,并与六种方法进行比较。结果显示,我们的系统在小型和微型干货灯上提供了至少12.6%的提升,并在所有干货灯大小上达到了强劲的结果。

EnSolver: Uncertainty-Aware CAPTCHA Solver Using Deep Ensembles

  • paper_url: http://arxiv.org/abs/2307.15180
  • repo_url: https://github.com/hoangcongduc/ensolver
  • paper_authors: Duc C. Hoang, Cuong V. Nguyen, Amin Kharraz
  • for: 保护网站从自动化机器人的攻击中,通过使用文本基于的 CAPTCHA 安全机制。
  • methods: 使用深度学习技术建立 CAPTCHA 解决器,并使用深度ensemble不确定性估计来检测和跳过不符合预期的样本。
  • results: 使用对象检测模型和实验结果表明,EnSolver 可以在不同样本中具有高度的准确率和成功率,达到98.1% 和93% 分别。
    Abstract The popularity of text-based CAPTCHA as a security mechanism to protect websites from automated bots has prompted researches in CAPTCHA solvers, with the aim of understanding its failure cases and subsequently making CAPTCHAs more secure. Recently proposed solvers, built on advances in deep learning, are able to crack even the very challenging CAPTCHAs with high accuracy. However, these solvers often perform poorly on out-of-distribution samples that contain visual features different from those in the training set. Furthermore, they lack the ability to detect and avoid such samples, making them susceptible to being locked out by defense systems after a certain number of failed attempts. In this paper, we propose EnSolver, a novel CAPTCHA solver that utilizes deep ensemble uncertainty estimation to detect and skip out-of-distribution CAPTCHAs, making it harder to be detected. We demonstrate the use of our solver with object detection models and show empirically that it performs well on both in-distribution and out-of-distribution data, achieving up to 98.1% accuracy when detecting out-of-distribution data and up to 93% success rate when solving in-distribution CAPTCHAs.
    摘要 受欢迎的文本基于CAPTCHA作为网站自动化软件保护机制的流行性,使得研究人员努力开发CAPTCHA解决方案,以了解其失败情况,并使CAPTCHA更加安全。最近提出的解决方案基于深度学习技术,能够解决even the very challenging CAPTCHAs with high accuracy。然而,这些解决方案经常在不同于训练集的视觉特征的样本上表现不佳,并且缺乏检测和避免这些样本的能力,使其容易被防御系统锁定。在本文中,我们提出EnSolver,一种新的CAPTCHA解决方案,利用深度ensemble uncertainty estimation来检测和跳过不同于训练集的CAPTCHAs,使其更难被检测。我们使用对象检测模型来实现我们的解决方案,并证明了其在各种数据上的良好性,包括在分布型和不同分布型数据上的性能,达到了98.1%的检测精度和93%的成功率。

R-LPIPS: An Adversarially Robust Perceptual Similarity Metric

  • paper_url: http://arxiv.org/abs/2307.15157
  • repo_url: https://github.com/saraghazanfari/r-lpips
  • paper_authors: Sara Ghazanfari, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami, Alexandre Araujo
  • for: 本研究旨在提出一种robust learned perceptual image patch similarity(R-LPIPS)度量,以提高图像相似度评估中的安全性。
  • methods: 该度量使用了 adversarially trained deep features,并通过了一系列实验证明其比 классиical LPIPS 度量更加稳定和可靠。
  • results: 研究表明,R-LPIPS 度量能够更好地抗击 adversarial examples,并且在大规模应用中具有更高的安全性。
    Abstract Similarity metrics have played a significant role in computer vision to capture the underlying semantics of images. In recent years, advanced similarity metrics, such as the Learned Perceptual Image Patch Similarity (LPIPS), have emerged. These metrics leverage deep features extracted from trained neural networks and have demonstrated a remarkable ability to closely align with human perception when evaluating relative image similarity. However, it is now well-known that neural networks are susceptible to adversarial examples, i.e., small perturbations invisible to humans crafted to deliberately mislead the model. Consequently, the LPIPS metric is also sensitive to such adversarial examples. This susceptibility introduces significant security concerns, especially considering the widespread adoption of LPIPS in large-scale applications. In this paper, we propose the Robust Learned Perceptual Image Patch Similarity (R-LPIPS) metric, a new metric that leverages adversarially trained deep features. Through a comprehensive set of experiments, we demonstrate the superiority of R-LPIPS compared to the classical LPIPS metric. The code is available at https://github.com/SaraGhazanfari/R-LPIPS.
    摘要 Computer vision 中的相似度度量有着重要的作用,用于捕捉图像的含义。近年来,高级相似度度量,如学习的 Perceptual Image Patch Similarity(LPIPS),得到了广泛应用。这些度量利用训练过的神经网络提取的深度特征,并在评估图像相似性时表现出了人类视觉的惊人能力。然而,现在已经公认的是,神经网络受到攻击性例子的威胁,即通过小型的隐蔽的扰动量让模型做出错误的判断。这种敏感性引入了重要的安全问题,特别是在大规模应用中。在这篇论文中,我们提出了Robust Learned Perceptual Image Patch Similarity(R-LPIPS)度量,一种新的度量,利用攻击性训练的深度特征。通过全面的实验,我们证明了R-LPIPS在相似性评估中的优越性,相比于经典的LPIPS度量。代码可以在https://github.com/SaraGhazanfari/R-LPIPS中找到。

R-Block: Regularized Block of Dropout for convolutional networks

  • paper_url: http://arxiv.org/abs/2307.15150
  • repo_url: None
  • paper_authors: Liqi Wang, Qiya Hu
  • for: 本研究旨在提出一种基于对比学习的卷积层正则化技术,以提高卷积神经网络的性能。
  • methods: 本研究使用了一种名为R-Block的对比学习训练策略,其目的是在卷积层中对两个不同采样的输出进行匹配。具体来说,R-Block将两个不同采样的卷积层输出的分布差用来最小化训练集的损失。我们还提出了两种构建子模型的方法。
  • results: 我们的实验结果显示,R-Block在比较其他结构化抛出变体时表现更好,而且我们的子模型构建方法也超过了其他方法。
    Abstract Dropout as a regularization technique is widely used in fully connected layers while is less effective in convolutional layers. Therefore more structured forms of dropout have been proposed to regularize convolutional networks. The disadvantage of these methods is that the randomness introduced causes inconsistency between training and inference. In this paper, we apply a mutual learning training strategy for convolutional layer regularization, namely R-Block, which forces two outputs of the generated difference maximizing sub models to be consistent with each other. Concretely, R-Block minimizes the losses between the output distributions of two sub models with different drop regions for each sample in the training dataset. We design two approaches to construct such sub models. Our experiments demonstrate that R-Block achieves better performance than other existing structured dropout variants. We also demonstrate that our approaches to construct sub models outperforms others.
    摘要 Dropout 作为一种常用的正则化技术,通常在全连接层中使用,而在卷积层中效果较差。因此,为了正则化卷积网络,更结构化的Dropout变体被提议。然而,这些方法的Randomness引入会导致训练和测试过程中的不一致。在这篇论文中,我们采用了相互学习训练策略,即R-Block,使得两个由生成的差分最大化子模型输出的结果彼此一致。具体来说,R-Block将每个训练集中的样本的输出分布between两个不同掉除区域的子模型Minimize the loss。我们设计了两种方法来构建子模型。我们的实验表明,R-Block在其他已有结构化Dropout变体的比较中表现更好。此外,我们的子模型构建方法也超过了其他方法。

Online Clustered Codebook

  • paper_url: http://arxiv.org/abs/2307.15139
  • repo_url: https://github.com/lyndonzheng/cvq-vae
  • paper_authors: Chuanxia Zheng, Andrea Vedaldi
  • for: 这 paper 的目的是提出一种简单的在线代码库学习方法,以解决现有 VQ-VAE 中的代码Vector 归一化问题。
  • methods: 这 paper 使用 Clustering VQ-VAE (CVQ-VAE) 方法,选择编码特征作为更新“死亡”的代码Vector 的参考点,同时使用原始损失来优化代码库。
  • results: 该 paper 的 CVQ-VAE 方法可以广泛验证在不同的 dataset、任务(如重建和生成)和架构(如 VQ-VAE、VQGAN、LDM)上,并且可以轻松地与现有模型集成。
    Abstract Vector Quantisation (VQ) is experiencing a comeback in machine learning, where it is increasingly used in representation learning. However, optimizing the codevectors in existing VQ-VAE is not entirely trivial. A problem is codebook collapse, where only a small subset of codevectors receive gradients useful for their optimisation, whereas a majority of them simply ``dies off'' and is never updated or used. This limits the effectiveness of VQ for learning larger codebooks in complex computer vision tasks that require high-capacity representations. In this paper, we present a simple alternative method for online codebook learning, Clustering VQ-VAE (CVQ-VAE). Our approach selects encoded features as anchors to update the ``dead'' codevectors, while optimising the codebooks which are alive via the original loss. This strategy brings unused codevectors closer in distribution to the encoded features, increasing the likelihood of being chosen and optimized. We extensively validate the generalization capability of our quantiser on various datasets, tasks (e.g. reconstruction and generation), and architectures (e.g. VQ-VAE, VQGAN, LDM). Our CVQ-VAE can be easily integrated into the existing models with just a few lines of code.
    摘要 vector量化(VQ)在机器学习中经受着重新发现,现在越来越在表示学习中使用。然而,在现有的VQ-VAE中优化codevector并不是完全懒散的。一个问题是codebook塌缩,只有一小部分的codevector会收到有用的梯度更新,而大多数codevector会“死亡”并从未更新或使用。这限制了VQ在学习更大的codebook时的效iveness,特别是在复杂的计算机视觉任务中需要高容量表示。在这篇论文中,我们提出了一种简单的在线代码库学习方法,即Clustering VQ-VAE(CVQ-VAE)。我们的方法选择编码特征作为更新“死亡” codevector的锚点,同时通过原始损失来优化活跃的代码库。这种策略使得无用的codevector更近于编码特征的分布,提高了它们的选择和优化的可能性。我们广泛验证了我们的量化器在不同的数据集、任务(例如重建和生成)和结构(例如VQ-VAE、VQGAN、LDM)上的通用能力。我们的CVQ-VAE可以轻松地与现有模型集成,只需要几行代码。

Seal-3D: Interactive Pixel-Level Editing for Neural Radiance Fields

  • paper_url: http://arxiv.org/abs/2307.15131
  • repo_url: https://github.com/windingwind/seal-3d
  • paper_authors: Xiangyu Wang, Jingsen Zhu, Qi Ye, Yuchi Huo, Yunlong Ran, Zhihua Zhong, Jiming Chen
  • for: 这个论文旨在提供一种可交互地编辑神经表示的方法,以便在受限的编辑灵活性、质量和速度等方面提高NeRF编辑的效果。
  • methods: 该方法使用一种新的教师-学生训练策略和本地预训练和全球精度调整,将编辑指令映射到原始NeRF模型空间,以实现直接响应编辑指令并快速预览编辑效果。
  • results: 该方法可以实现各种编辑效果,并且可以在约1秒钟的交互速度下达到出色的编辑效果。
    Abstract With the popularity of implicit neural representations, or neural radiance fields (NeRF), there is a pressing need for editing methods to interact with the implicit 3D models for tasks like post-processing reconstructed scenes and 3D content creation. While previous works have explored NeRF editing from various perspectives, they are restricted in editing flexibility, quality, and speed, failing to offer direct editing response and instant preview. The key challenge is to conceive a locally editable neural representation that can directly reflect the editing instructions and update instantly. To bridge the gap, we propose a new interactive editing method and system for implicit representations, called Seal-3D, which allows users to edit NeRF models in a pixel-level and free manner with a wide range of NeRF-like backbone and preview the editing effects instantly. To achieve the effects, the challenges are addressed by our proposed proxy function mapping the editing instructions to the original space of NeRF models and a teacher-student training strategy with local pretraining and global finetuning. A NeRF editing system is built to showcase various editing types. Our system can achieve compelling editing effects with an interactive speed of about 1 second.
    摘要 《 neural radiance fields (NeRF) 的 популярность导致了对 implicit 3D 模型的编辑方法的强需求,以便在重建场景后处理和创建3D内容中进行交互式编辑。而过去的作品都已经在不同的角度探索了 NeRF 编辑,但它们的编辑灵活性、质量和速度受到了限制,无法提供直接的编辑回应和即时预览。关键挑战是总结一种可以直接反映编辑指令并快速更新的 neural representation。为了bridging这个差距,我们提出了一种新的交互式编辑方法和系统,叫做 Seal-3D,允许用户在像素级别和自由地编辑 NeRF 模型,并在即时预览编辑效果。以解决这些挑战,我们提出了一种代理函数,将编辑指令映射到原始 NeRF 模型的空间,以及一种教师学生训练策略,包括本地预训练和全球调整。我们建立了一个 NeRF 编辑系统,以示出多种编辑类型。我们的系统可以实现吸引人的编辑效果,编辑速度约为1秒。》

End-to-end Remote Sensing Change Detection of Unregistered Bi-temporal Images for Natural Disasters

  • paper_url: http://arxiv.org/abs/2307.15128
  • repo_url: None
  • paper_authors: Guiqin Zhao, Lianlei Shan, Weiqiang Wang
  • for: 本研究旨在针对自然灾害区域内的建筑物损害检测,通过远程感知图像进行检测。
  • methods: 本研究使用了深度网络,并提出了一种无需注册的端到端变化检测网络(E2ECDNet),可以处理不匹配的双时间图像对。
  • results: 实验结果表明,E2ECDNet 能够在不匹配的双时间图像对上提供高精度的变化检测结果,并且与现有的注册变化检测方法相比,具有更高的精度和更快的运算速度。
    Abstract Change detection based on remote sensing images has been a prominent area of interest in the field of remote sensing. Deep networks have demonstrated significant success in detecting changes in bi-temporal remote sensing images and have found applications in various fields. Given the degradation of natural environments and the frequent occurrence of natural disasters, accurately and swiftly identifying damaged buildings in disaster-stricken areas through remote sensing images holds immense significance. This paper aims to investigate change detection specifically for natural disasters. Considering that existing public datasets used in change detection research are registered, which does not align with the practical scenario where bi-temporal images are not matched, this paper introduces an unregistered end-to-end change detection synthetic dataset called xBD-E2ECD. Furthermore, we propose an end-to-end change detection network named E2ECDNet, which takes an unregistered bi-temporal image pair as input and simultaneously generates the flow field prediction result and the change detection prediction result. It is worth noting that our E2ECDNet also supports change detection for registered image pairs, as registration can be seen as a special case of non-registration. Additionally, this paper redefines the criteria for correctly predicting a positive case and introduces neighborhood-based change detection evaluation metrics. The experimental results have demonstrated significant improvements.
    摘要 改变探测基于远程感知图像已经成为远程感知领域的一个主要领域。深度网络在比时图像之间进行改变探测中表现出了显著的成功,并在不同领域找到了应用。随着自然环境的衰退和自然灾害的频繁发生,通过远程感知图像快速和准确地确定灾难hit buildings是非常重要的。本文旨在研究自然灾害中的改变探测。由于现有的公共数据集在改变探测研究中使用的是注册的,这并不符合实际情况,在这种情况下,本文引入了一个无注册的综合改变探测数据集called xBD-E2ECD。此外,我们提议一种综合改变探测网络,称为E2ECDNet,该网络可以将无注册的双时图像对作为输入,并同时生成流场预测结果和改变探测预测结果。需要注意的是,我们的E2ECDNet还支持注册图像对的改变探测,因为注册可以看作特殊的非注册情况。此外,本文还重新定义了正确预测正例的标准,并引入了邻居基于改变探测评价指标。实验结果表明了显著的改进。

To Adapt or Not to Adapt? Real-Time Adaptation for Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2307.15063
  • repo_url: https://github.com/MarcBotet/hamlet
  • paper_authors: Marc Botet Colomer, Pier Luigi Dovesi, Theodoros Panagiotakopoulos, Joao Frederico Carvalho, Linus Härenstam-Nielsen, Hossein Azizpour, Hedvig Kjellström, Daniel Cremers, Matteo Poggi
  • for: 该论文旨在解决在部署时出现不可预期的领域变化,如突发天气事件,以实现 semantic segmentation 的在线领域适应。
  • methods: 该论文提出了一种基于硬件意识的模块最低成本训练框架(HAMLET),包括硬件意识反推协调器(HAMT)和特有的领域偏移探测器(LT),以实现实时领域适应。
  • results: 该论文的方法可以在单个consumer-grade GPU上达到更高于29帧/秒的同时进行 semantic segmentation 和领域适应,并且在 OnDA 和 SHIFT 标准准则上实现了鼓舞人的准确率和速度质量平衡。
    Abstract The goal of Online Domain Adaptation for semantic segmentation is to handle unforeseeable domain changes that occur during deployment, like sudden weather events. However, the high computational costs associated with brute-force adaptation make this paradigm unfeasible for real-world applications. In this paper we propose HAMLET, a Hardware-Aware Modular Least Expensive Training framework for real-time domain adaptation. Our approach includes a hardware-aware back-propagation orchestration agent (HAMT) and a dedicated domain-shift detector that enables active control over when and how the model is adapted (LT). Thanks to these advancements, our approach is capable of performing semantic segmentation while simultaneously adapting at more than 29FPS on a single consumer-grade GPU. Our framework's encouraging accuracy and speed trade-off is demonstrated on OnDA and SHIFT benchmarks through experimental results.
    摘要 goal of Online Domain Adaptation for semantic segmentation is to handle unforeseeable domain changes that occur during deployment, like sudden weather events. However, the high computational costs associated with brute-force adaptation make this paradigm unfeasible for real-world applications. In this paper, we propose HAMLET, a Hardware-Aware Modular Least Expensive Training framework for real-time domain adaptation. Our approach includes a hardware-aware back-propagation orchestration agent (HAMT) and a dedicated domain-shift detector that enables active control over when and how the model is adapted (LT). Thanks to these advancements, our approach is capable of performing semantic segmentation while simultaneously adapting at more than 29FPS on a single consumer-grade GPU. Our framework's encouraging accuracy and speed trade-off is demonstrated on OnDA and SHIFT benchmarks through experimental results.Here's the translation in Traditional Chinese:goal of Online Domain Adaptation for semantic segmentation is to handle unforeseeable domain changes that occur during deployment, like sudden weather events. However, the high computational costs associated with brute-force adaptation make this paradigm unfeasible for real-world applications. In this paper, we propose HAMLET, a Hardware-Aware Modular Least Expensive Training framework for real-time domain adaptation. Our approach includes a hardware-aware back-propagation orchestration agent (HAMT) and a dedicated domain-shift detector that enables active control over when and how the model is adapted (LT). Thanks to these advancements, our approach is capable of performing semantic segmentation while simultaneously adapting at more than 29FPS on a single consumer-grade GPU. Our framework's encouraging accuracy and speed trade-off is demonstrated on OnDA and SHIFT benchmarks through experimental results.

Self-Supervised Visual Acoustic Matching

  • paper_url: http://arxiv.org/abs/2307.15064
  • repo_url: None
  • paper_authors: Arjun Somayazulu, Changan Chen, Kristen Grauman
  • for: 用于自然语言处理和媒体生成等应用场景,恢复和重新生成受到环境影响的音频clip。
  • methods: 提出一种自监督的方法,只使用目标场景图像和音频,不需要匹配的源音频作为参考。通过 Conditional GAN 框架和一种新的评价指标,学习抽象房间声学特征和重新生成音频。
  • results: 在多个挑战性的数据集上,与当前状态势最高,并在各种真实世界的音频和环境下表现出色。
    Abstract Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. Existing methods assume access to paired training data, where the audio is observed in both source and target environments, but this limits the diversity of training data or requires the use of simulated data or heuristics to create paired samples. We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio -- without acoustically mismatched source audio for reference. Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric that quantifies the level of residual acoustic information in the de-biased audio. Training with either in-the-wild web data or simulated data, we demonstrate it outperforms the state-of-the-art on multiple challenging datasets and a wide variety of real-world audio and environments.
    摘要 听音匹配目标是重新synthesize一个音频片段,使其在目标听音环境中 зву律。现有方法假设有对称的训练数据,其中包括源和目标环境中的音频,但这限制了训练数据的多样性或需要使用模拟数据或规则来生成对应的样本。我们提出了一种无监督的方法,通过jointly学习抽离房间听音和重新synthesize音频到目标环境中,使用条件GAAN框架和一个新的度量量化听音信息的减噪度。我们在使用实际网络数据或模拟数据进行训练后,示出其在多个挑战性 dataset 和真实世界的听音和环境中表现出色。

The RoboDepth Challenge: Methods and Advancements Towards Robust Depth Estimation

  • paper_url: http://arxiv.org/abs/2307.15061
  • repo_url: https://github.com/ldkong1205/robodepth
  • paper_authors: Lingdong Kong, Yaru Niu, Shaoyuan Xie, Hanjiang Hu, Lai Xing Ng, Benoit R. Cottereau, Ding Zhao, Liangjun Zhang, Hesheng Wang, Wei Tsang Ooi, Ruijie Zhu, Ziyang Song, Li Liu, Tianzhu Zhang, Jun Yu, Mohan Jing, Pengwei Li, Xiaohua Qi, Cheng Jin, Yingfeng Chen, Jie Hou, Jie Zhang, Zhen Kan, Qiang Ling, Liang Peng, Minglei Li, Di Xu, Changpeng Yang, Yuanqi Yao, Gang Wu, Jian Kuai, Xianming Liu, Junjun Jiang, Jiamian Huang, Baojun Li, Jiale Chen, Shuang Zhang, Sun Ao, Zhenyu Li, Runze Chen, Haiyong Luo, Fang Zhao, Jingze Yu
  • for: 提高安全应用中深度估计的可靠性,如在不良天气、传感器故障和噪声污染等情况下提供可靠的深度预测。
  • methods: 使用自然语言处理、图像修复、超解析、对抗训练、扩散噪声消除、视觉语言预训练、学习模型ensemble和层次特征强化等方法来提高深度估计的 Robustness 和可靠性。
  • results: 通过RoboDepth Challenge的学术竞赛,发现了9种top-performing解决方案,包括空间-和频率域扩充、面罩模型、图像修复和超解析、对抗训练、扩散噪声消除、视觉语言预训练、学习模型ensemble和层次特征强化等方法,这些方法能够提高深度估计的可靠性和Robustness。
    Abstract Accurate depth estimation under out-of-distribution (OoD) scenarios, such as adverse weather conditions, sensor failure, and noise contamination, is desirable for safety-critical applications. Existing depth estimation systems, however, suffer inevitably from real-world corruptions and perturbations and are struggled to provide reliable depth predictions under such cases. In this paper, we summarize the winning solutions from the RoboDepth Challenge -- an academic competition designed to facilitate and advance robust OoD depth estimation. This challenge was developed based on the newly established KITTI-C and NYUDepth2-C benchmarks. We hosted two stand-alone tracks, with an emphasis on robust self-supervised and robust fully-supervised depth estimation, respectively. Out of more than two hundred participants, nine unique and top-performing solutions have appeared, with novel designs ranging from the following aspects: spatial- and frequency-domain augmentations, masked image modeling, image restoration and super-resolution, adversarial training, diffusion-based noise suppression, vision-language pre-training, learned model ensembling, and hierarchical feature enhancement. Extensive experimental analyses along with insightful observations are drawn to better understand the rationale behind each design. We hope this challenge could lay a solid foundation for future research on robust and reliable depth estimation and beyond. The datasets, competition toolkit, workshop recordings, and source code from the winning teams are publicly available on the challenge website.
    摘要 <>转换文本到简化中文。<>实时深度估计在不同的应用场景中具有重要的意义,如恶劣天气、传感器故障和噪声污染等情况下,却存在现实世界中的干扰和损害,使得现有的深度估计系统很难提供可靠的深度预测。在这篇论文中,我们总结了RoboDepth Challenge中的胜利解决方案,这是一个学术竞赛,旨在促进和进步强健的应用场景外的深度估计。这个竞赛基于新建的KITTI-C和NYUDepth2-C标准准则。我们设置了两个独立的轨道,强调强健自我超vised和强健全supervised深度估计。共有超过200名参与者,9个独特和表现出色的解决方案出现在了,其中包括空间和频率域扩充、掩码图像模型、图像修复和超分辨率、对抗训练、扩散型噪声消除、视语预训练、学习集成和层次特征增强等方法。我们进行了广泛的实验分析和深入的观察,以更好地理解每种设计的原理。我们希望这次竞赛可以为未来的强健和可靠的深度估计奠定坚实的基础,并超出这个领域。竞赛网站上公开了数据集、竞赛工具箱、学术会录音和胜利团队的源代码。

MARS: An Instance-aware, Modular and Realistic Simulator for Autonomous Driving

  • paper_url: http://arxiv.org/abs/2307.15058
  • repo_url: https://github.com/open-air-sun/mars
  • paper_authors: Zirui Wu, Tianyu Liu, Liyi Luo, Zhide Zhong, Jianteng Chen, Hongmin Xiao, Chao Hou, Haozhe Lou, Yuantao Chen, Runyi Yang, Yuxin Huang, Xiaoyu Ye, Zike Yan, Yongliang Shi, Yiyi Liao, Hao Zhao
  • For: The paper proposes an autonomous driving simulator based on neural radiance fields (NeRFs) to solve remaining corner cases and improve realism in simulation.* Methods: The simulator models foreground instances and background environments separately with independent networks, allowing for flexible switching between different NeRF-related backbones, sampling strategies, input modalities, etc.* Results: The simulator achieves state-of-the-art photo-realism results and will be open-sourced, while most counterparts are not.Here’s the simplified Chinese text for the three key points:* For: 该论文提出了基于神经采样场(NeRF)的自动驾驶模拟器,以解决剩下的角落情况并提高模拟的真实性。* Methods: 该模拟器将前景实体和背景环境分别模型为独立的网络,允许自由地换换不同的NeRF相关脊梁、采样策略、输入模式等。* Results: 该模拟器实现了状态机器人化的图像真实性结果,并将被开源发布,而大多数对手不是。
    Abstract Nowadays, autonomous cars can drive smoothly in ordinary cases, and it is widely recognized that realistic sensor simulation will play a critical role in solving remaining corner cases by simulating them. To this end, we propose an autonomous driving simulator based upon neural radiance fields (NeRFs). Compared with existing works, ours has three notable features: (1) Instance-aware. Our simulator models the foreground instances and background environments separately with independent networks so that the static (e.g., size and appearance) and dynamic (e.g., trajectory) properties of instances can be controlled separately. (2) Modular. Our simulator allows flexible switching between different modern NeRF-related backbones, sampling strategies, input modalities, etc. We expect this modular design to boost academic progress and industrial deployment of NeRF-based autonomous driving simulation. (3) Realistic. Our simulator set new state-of-the-art photo-realism results given the best module selection. Our simulator will be open-sourced while most of our counterparts are not. Project page: https://open-air-sun.github.io/mars/.
    摘要 现在,自动驾驶车可以平稳驾驶在常见情况下,而实际感知器模拟将在解决剩下的角度情况中扮演关键角色。为此,我们提出了基于神经辐射场(NeRFs)的自动驾驶模拟器。与现有作品相比,我们的模拟器具有以下三个特点:1. 实例感知:我们的模拟器将背景环境和前景实例分别模型为独立的网络,以便分别控制实例的静态特性(如大小和外观)和动态特性(如轨迹)。2. 模块化:我们的模拟器允许自由地换换不同的现代NeRF相关脊梁、采样策略、输入Modalities等。我们期望这种模块化设计能促进学术进步和实业应用NeRF相关自动驾驶模拟。3. 实际:我们的模拟器在选择最佳模块时创造了新的state-of-the-art的 фото实实alomResults。我们的模拟器将被开源,而大多数对手不是。项目页面:

PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking

  • paper_url: http://arxiv.org/abs/2307.15055
  • repo_url: https://github.com/y-zheng18/point_odyssey
  • paper_authors: Yang Zheng, Adam W. Harley, Bokui Shen, Gordon Wetzstein, Leonidas J. Guibas
  • for: 本研究旨在提高长期细化跟踪算法的状态前方,强调自然化的运动。
  • methods: 该研究使用真实世界动作捕捉数据来动画可变形人物,建立3D场景,并使用结构从动视觉来渲染摄像头视点。其中,人物的外观、动作特征、物理、照明和大气效果都进行了随机化。
  • results: 研究人员通过对现有算法进行修改,使其在PointOdyssey数据集上表现更好,并在两个真实世界benchmark上表现出色。此外,研究人员还提出了一种改进PIPs点跟踪方法,使其在时间上具有更广泛的感知范围,并在PointOdyssey数据集和两个真实世界benchmark上提高了表现。
    Abstract We introduce PointOdyssey, a large-scale synthetic dataset, and data generation framework, for the training and evaluation of long-term fine-grained tracking algorithms. Our goal is to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion. Toward the goal of naturalism, we animate deformable characters using real-world motion capture data, we build 3D scenes to match the motion capture environments, and we render camera viewpoints using trajectories mined via structure-from-motion on real videos. We create combinatorial diversity by randomizing character appearance, motion profiles, materials, lighting, 3D assets, and atmospheric effects. Our dataset currently includes 104 videos, averaging 2,000 frames long, with orders of magnitude more correspondence annotations than prior work. We show that existing methods can be trained from scratch in our dataset and outperform the published variants. Finally, we introduce modifications to the PIPs point tracking method, greatly widening its temporal receptive field, which improves its performance on PointOdyssey as well as on two real-world benchmarks. Our data and code are publicly available at: https://pointodyssey.com
    摘要 我们介绍PointOdyssey,一个大规模的人工数据集和数据生成框架,用于长期精细跟踪算法的训练和评估。我们的目标是提高状态的艺术性,因此我们使用了真实的动作捕捉数据来动画可变的人物,建立了基于动作捕捉环境的3D场景,并使用了从结构 FROM 运动中挖掘的轨迹来渲染摄像头视点。我们创造了多样性的组合,通过随机化人物的外观、动作特征、物理、照明、3D资产和大气效果来创造多样性。我们的数据集目前包含104个视频,每个视频平均2,000帧长,与先前的工作相比有几个数量级的更多的对应笔记注解。我们示出了在我们数据集中可以从头开始训练现有方法,并在PointOdyssey上以及两个真实的benchmark上表现出色。最后,我们对PIPs点跟踪方法进行修改,使其 temporal 感知场景得到了大幅提高,这也提高了它的表现在PointOdyssey上以及两个真实的benchmark上。我们的数据和代码在https://pointodyssey.com 上公开 available。

Learning Depth Estimation for Transparent and Mirror Surfaces

  • paper_url: http://arxiv.org/abs/2307.15052
  • repo_url: None
  • paper_authors: Alex Costanzino, Pierluigi Zama Ramirez, Matteo Poggi, Fabio Tosi, Stefano Mattoccia, Luigi Di Stefano
  • for: 估算透明或镜面(ToM)表面深度是一个困难的任务,需要感知器、算法或深度网络。
  • methods: 我们提出了一个简单的管道,使用神经网络来学习正确地估算ToM表面深度,无需任何真实标注。我们解释了如何获取可靠的 pseudo标签,通过在图像中填充ToM对象并使用单视深度估算模型来处理它们。这些标签可以用于练化现有的单视或双视网络,让它们学习如何处理ToM表面。
  • results: 在Booster数据集上进行实验,我们发现我们的简单提案具有很大的改进作用。
    Abstract Inferring the depth of transparent or mirror (ToM) surfaces represents a hard challenge for either sensors, algorithms, or deep networks. We propose a simple pipeline for learning to estimate depth properly for such surfaces with neural networks, without requiring any ground-truth annotation. We unveil how to obtain reliable pseudo labels by in-painting ToM objects in images and processing them with a monocular depth estimation model. These labels can be used to fine-tune existing monocular or stereo networks, to let them learn how to deal with ToM surfaces. Experimental results on the Booster dataset show the dramatic improvements enabled by our remarkably simple proposal.
    摘要 描述透明或镜面(ToM)表面的深度很难以由感知器、算法或深度网络正确地推断。我们提出了一个简单的管道,通过神经网络来学习对ToM表面的深度估计,无需任何真实标注。我们解释了如何获得可靠的pseudo标签,通过在图像中填充ToM对象并使用单目深度估计模型处理它们。这些标签可以用来练化现有的单目或双目网络,让它们学习如何处理ToM表面。实验结果表明,我们的非常简单的建议带来了 dramatic improvement。

Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models

  • paper_url: http://arxiv.org/abs/2307.15049
  • repo_url: None
  • paper_authors: Kecheng Zheng, Wei Wu, Ruili Feng, Kai Zhu, Jiawei Liu, Deli Zhao, Zheng-Jun Zha, Wei Chen, Yujun Shen
  • for: 这个研究想要将预训练的潜在视觉语言模型(VLM)转移到不同的下游任务上。
  • methods: 我们提出了一种新型的调整方法,即弹性掩蔽调整法,它通过学习选择器来掩蔽网络参数。我们参考了神经通路,认为预训练过程中隐藏在网络参数中的知识,可以透过调整这些参数来把其恢复到光。我们首先选择一个下游任务中需要的参数集,然后将这些参数给掩蔽,最后在下游数据上优化这些掩蔽。当更新掩蔽时,我们引入了一种新的梯度减少策略,以调整参数选择,以避免模型忘记过去的知识并过滤下游数据。
  • results: 我们在11个数据集上进行实验,结果显示我们的方法在前一代的方法上具有优越的性能。特别是,我们可以透过仅将2.56%的参数掩蔽,实现18.73%的性能提升,比零基elineCLIP更高。此外,我们的方法可以与大多数现有的参数效率调整方法相互作用,可以将其表现提升。详细信息可以查看我们的项目页面(https://wuw2019.github.io/R-AMT/)。
    Abstract Prompt tuning and adapter tuning have shown great potential in transferring pre-trained vision-language models (VLMs) to various downstream tasks. In this work, we design a new type of tuning method, termed as regularized mask tuning, which masks the network parameters through a learnable selection. Inspired by neural pathways, we argue that the knowledge required by a downstream task already exists in the pre-trained weights but just gets concealed in the upstream pre-training stage. To bring the useful knowledge back into light, we first identify a set of parameters that are important to a given downstream task, then attach a binary mask to each parameter, and finally optimize these masks on the downstream data with the parameters frozen. When updating the mask, we introduce a novel gradient dropout strategy to regularize the parameter selection, in order to prevent the model from forgetting old knowledge and overfitting the downstream data. Experimental results on 11 datasets demonstrate the consistent superiority of our method over previous alternatives. It is noteworthy that we manage to deliver 18.73% performance improvement compared to the zero-shot CLIP via masking an average of only 2.56% parameters. Furthermore, our method is synergistic with most existing parameter-efficient tuning methods and can boost the performance on top of them. Project page can be found here (https://wuw2019.github.io/R-AMT/).
    摘要 Prompt tuning和adapter tuning已经在将预训练的视觉语言模型(VLM)转移到多种下游任务中显示了很大的潜力。在这项工作中,我们设计了一种新的调参方法,称为正则化面 Selection Tuning(R-AMT)。受神经网络的启发,我们认为预训练阶段中隐藏在预训练模型中的知识已经存在于下游任务中。为了让这些知识重新浮现,我们首先确定了一个下游任务中重要的参数集,然后将每个参数添加一个二进制面,最后在下游数据上优化这些面。在更新面时,我们引入了一种新的梯度抑制策略,以避免模型忘记原来的知识并遇到下游数据上的过拟合。实验结果在11个数据集上表明,我们的方法与之前的方法相比具有显著的优势。具体来说,我们通过面 Selection Tuning仅占用2.56%的参数平均提高了CLIP的性能18.73%。此外,我们的方法可以与大多数现有的参数效率调参方法相结合,并可以提高它们的性能。相关页面可以在这里找到(https://wuw2019.github.io/R-AMT/)。

A Transformer-based Approach for Arabic Offline Handwritten Text Recognition

  • paper_url: http://arxiv.org/abs/2307.15045
  • repo_url: None
  • paper_authors: Saleh Momeni, Bagher BabaAli
  • for: 本研究旨在提高 Offline 阿拉伯手写文本识别精度。
  • methods: 我们提出了两种新的架构:Transformer Transducer 和标准sequence-to-sequence Transformer,并对其表现进行比较。这两种架构均利用了注意力机制,可以更好地模型语言依赖关系,并且更容易并行化。
  • results: 我们的方法在 Arabic KHATT 数据集上的评估中表现出色,超越了现有的状态之作。
    Abstract Handwriting recognition is a challenging and critical problem in the fields of pattern recognition and machine learning, with applications spanning a wide range of domains. In this paper, we focus on the specific issue of recognizing offline Arabic handwritten text. Existing approaches typically utilize a combination of convolutional neural networks for image feature extraction and recurrent neural networks for temporal modeling, with connectionist temporal classification used for text generation. However, these methods suffer from a lack of parallelization due to the sequential nature of recurrent neural networks. Furthermore, these models cannot account for linguistic rules, necessitating the use of an external language model in the post-processing stage to boost accuracy. To overcome these issues, we introduce two alternative architectures, namely the Transformer Transducer and the standard sequence-to-sequence Transformer, and compare their performance in terms of accuracy and speed. Our approach can model language dependencies and relies only on the attention mechanism, thereby making it more parallelizable and less complex. We employ pre-trained Transformers for both image understanding and language modeling. Our evaluation on the Arabic KHATT dataset demonstrates that our proposed method outperforms the current state-of-the-art approaches for recognizing offline Arabic handwritten text.
    摘要 手写识别是Pattern recognition和机器学习领域中的一个挑战和重要问题,其应用范围广泛,包括文本识别、语音识别、图像识别等领域。在本文中,我们将关注特定的问题是 Offline 阿拉伯手写文本识别。现有的方法通常使用混合 convolutional neural networks для图像特征提取和 recurrent neural networks для时间模型化,并使用 connectionist temporal classification для文本生成。然而,这些方法受到缺乏并行化的约束,以及不能考虑语言规则的限制,因此需要在后期处理阶段使用外部语言模型以提高准确性。为了解决这些问题,我们提出了两种 altenative 架构:Transformer Transducer 和标准 sequence-to-sequence Transformer。我们 comparing 这两种架构的性能,包括准确率和速度。我们的方法可以模型语言依赖关系,只靠注意机制进行并行化,因此更加简单和高效。我们使用预训练的 Transformers 来进行图像理解和语言模型化。我们的评估结果表明,我们的提议方法在 Offline 阿拉伯手写文本识别领域的现状顶峰性能。

TEDi: Temporally-Entangled Diffusion for Long-Term Motion Synthesis

  • paper_url: http://arxiv.org/abs/2307.15042
  • repo_url: None
  • paper_authors: Zihan Zhang, Richard Liu, Kfir Aberman, Rana Hanocka
  • for: 提出一种基于渐进噪声泛化的动作序列 sintesis 模型,以满足长期动作synthesis的应用需求。
  • methods: 利用渐进噪声泛化概率模型(DDPM)的核心思想,在时间轴上实现渐进噪声泛化,并在动作序列中随时间变化。
  • results: 通过实验表明,提出的方法可以生成高质量的动作序列,并且可以应用于人物动画和其他领域。
    Abstract The gradual nature of a diffusion process that synthesizes samples in small increments constitutes a key ingredient of Denoising Diffusion Probabilistic Models (DDPM), which have presented unprecedented quality in image synthesis and been recently explored in the motion domain. In this work, we propose to adapt the gradual diffusion concept (operating along a diffusion time-axis) into the temporal-axis of the motion sequence. Our key idea is to extend the DDPM framework to support temporally varying denoising, thereby entangling the two axes. Using our special formulation, we iteratively denoise a motion buffer that contains a set of increasingly-noised poses, which auto-regressively produces an arbitrarily long stream of frames. With a stationary diffusion time-axis, in each diffusion step we increment only the temporal-axis of the motion such that the framework produces a new, clean frame which is removed from the beginning of the buffer, followed by a newly drawn noise vector that is appended to it. This new mechanism paves the way towards a new framework for long-term motion synthesis with applications to character animation and other domains.
    摘要 《慢慢进程的扩散过程 Synthesize 样本在小幅度 increments 中是关键组成部分,这种方法被称为 Denoising Diffusion Probabilistic Models (DDPM),它在图像生成中提供了无 precedent 的质量,并在近期被探索在运动频谱中。在这项工作中,我们提议将慢慢扩散概念(在扩散时间轴上进行)扩展到运动序列的 temporal 轴。我们的关键想法是在 DDPM 框架中支持时间变化的净化,从而将两个轴相互束缚。使用我们的特殊形式ulation,我们在每个扩散步骤中只 increments 运动序列中的时间轴,以生成一个新的、干净的帧,并将其从运动缓冲中移除。然后,我们随机生成一个新的噪声向量,并将其追加到缓冲中。这种新机制开 up a new 框架,可以用于长期运动生成,并具有应用于人物动画和其他领域的潜在应用。]Note: Please note that the translation is in Simplified Chinese, and the word order and grammar may be different from the original text.

Detecting Morphing Attacks via Continual Incremental Training

  • paper_url: http://arxiv.org/abs/2307.15105
  • repo_url: None
  • paper_authors: Lorenzo Pellegrini, Guido Borghi, Annalisa Franco, Davide Maltoni
  • for: 这个论文旨在解决数据传输和存储限制下,难以组合单一数据集,以执行批处理训练方法。
  • methods: 这篇论文使用了不同数据源的Continual Learning(CL)方法, simulate一个随着新数据块的到达,模型在每一轮训练中更新的场景。
  • results: 实验结果显示,Learning without Forgetting(LwF)方法在这种场景下表现最佳,并且在 Morphing Attack Detection 和 Object Classification 任务中进行了详细的调研和优化。
    Abstract Scenarios in which restrictions in data transfer and storage limit the possibility to compose a single dataset -- also exploiting different data sources -- to perform a batch-based training procedure, make the development of robust models particularly challenging. We hypothesize that the recent Continual Learning (CL) paradigm may represent an effective solution to enable incremental training, even through multiple sites. Indeed, a basic assumption of CL is that once a model has been trained, old data can no longer be used in successive training iterations and in principle can be deleted. Therefore, in this paper, we investigate the performance of different Continual Learning methods in this scenario, simulating a learning model that is updated every time a new chunk of data, even of variable size, is available. Experimental results reveal that a particular CL method, namely Learning without Forgetting (LwF), is one of the best-performing algorithms. Then, we investigate its usage and parametrization in Morphing Attack Detection and Object Classification tasks, specifically with respect to the amount of new training data that became available.
    摘要 具有限制数据传输和存储的场景下,组成单一数据集并在多个数据源上进行批处理训练过程中存在很大挑战。我们假设,最近的不间断学习(Continual Learning,CL) paradigm可能是一个有效的解决方案,以允许逐步训练,即使在多个站点之间。实际上,CL的基本假设是,一旦模型已经训练过,那么过去的数据不能在后续训练过程中再次使用,并且可以被删除。因此,在这篇论文中,我们 investigate CL方法在这种enario下的表现,通过模拟一个可以在新数据批量可用时更新的学习模型。实验结果表明,一种特定的CL方法,即不忘学习(Learning without Forgetting,LwF)是最佳性能的算法之一。然后,我们进一步调查其在形态攻击检测和对象分类任务中的使用和参数化情况,具体是关于可用新训练数据的量。

Diverse Inpainting and Editing with GAN Inversion

  • paper_url: http://arxiv.org/abs/2307.15033
  • repo_url: None
  • paper_authors: Ahmet Burak Yildirim, Hamza Pehlivan, Bahri Batuhan Bilecen, Aysegul Dundar
  • for: 实现从 StyleGAN 的潜在空间还原被删除的图像,并在这些图像上进行许多编辑。
  • methods: 我们提出了一个将删除图像的潜在代码与 StyleGAN 的映射特征结合的混合网络,以及一个使用生成的数据进行训练的新设计。
  • results: 我们的方法与现有的逆向和填充方法进行比较,实现了较好的质量和多样性。
    Abstract Recent inversion methods have shown that real images can be inverted into StyleGAN's latent space and numerous edits can be achieved on those images thanks to the semantically rich feature representations of well-trained GAN models. However, extensive research has also shown that image inversion is challenging due to the trade-off between high-fidelity reconstruction and editability. In this paper, we tackle an even more difficult task, inverting erased images into GAN's latent space for realistic inpaintings and editings. Furthermore, by augmenting inverted latent codes with different latent samples, we achieve diverse inpaintings. Specifically, we propose to learn an encoder and mixing network to combine encoded features from erased images with StyleGAN's mapped features from random samples. To encourage the mixing network to utilize both inputs, we train the networks with generated data via a novel set-up. We also utilize higher-rate features to prevent color inconsistencies between the inpainted and unerased parts. We run extensive experiments and compare our method with state-of-the-art inversion and inpainting methods. Qualitative metrics and visual comparisons show significant improvements.
    摘要 现在的倒转方法已经证明了真实图像可以被StyleGAN的含义空间内转换,并且可以通过训练过的GAN模型中的含义强的特征表示来实现多个编辑。然而,广泛的研究也表明,图像倒转是一项复杂的任务,因为存在高精度重建和编辑之间的负担。在这篇论文中,我们面临更加困难的任务:将抹消图像转换到GAN模型的含义空间中,以获得真实的填充和编辑。此外,我们还利用不同的含义样本来扩展 reverted 的含义代码,以实现多样化的填充。具体来说,我们提议使用编码器和混合网络将抹消图像的编码特征与StyleGAN的映射特征混合在一起。为了让混合网络使用两个输入,我们在网络训练时使用生成的数据进行新的设置。我们还利用更高的比率特征,以避免在填充和未抹消部分之间的颜色不一致。我们进行了广泛的实验,并与当前的倒转和填充方法进行比较。质量指标和视觉比较表明存在显著的改进。

Adaptive Segmentation Network for Scene Text Detection

  • paper_url: http://arxiv.org/abs/2307.15029
  • repo_url: None
  • paper_authors: Guiqin Zhao
  • for: 提高场景文本检测器的性能,解决手动调整参数的繁琐问题,并且能够处理文本实例的极大比例和方向。
  • methods: 提出自适应分 segmentation 阈值学习方法,自动分辨文本像素和背景像素,并且使用全球信息增强特征层网络(GE-FPN)捕捉文本实例的macro大小和极大比例。其后,引入层次优化结构进一步精细化文本实例。
  • results: 通过提出的阈值学习策略和文本检测结构,实现了场景文本检测器的state-of-the-art性能,并且通过缺失实验证明了我们的贡献的有效性。
    Abstract Inspired by deep convolution segmentation algorithms, scene text detectors break the performance ceiling of datasets steadily. However, these methods often encounter threshold selection bottlenecks and have poor performance on text instances with extreme aspect ratios. In this paper, we propose to automatically learn the discriminate segmentation threshold, which distinguishes text pixels from background pixels for segmentation-based scene text detectors and then further reduces the time-consuming manual parameter adjustment. Besides, we design a Global-information Enhanced Feature Pyramid Network (GE-FPN) for capturing text instances with macro size and extreme aspect ratios. Following the GE-FPN, we introduce a cascade optimization structure to further refine the text instances. Finally, together with the proposed threshold learning strategy and text detection structure, we design an Adaptive Segmentation Network (ASNet) for scene text detection. Extensive experiments are carried out to demonstrate that the proposed ASNet can achieve the state-of-the-art performance on four text detection benchmarks, i.e., ICDAR 2015, MSRA-TD500, ICDAR 2017 MLT and CTW1500. The ablation experiments also verify the effectiveness of our contributions.
    摘要 深受深度卷积分割算法的启发,场景文本检测器不断突破数据集的性能峰值。然而,这些方法经常遇到选择阈值的瓶颈和文本实例的极大比例问题。在这篇论文中,我们提议自动学习分割 segmentation 阈值,以分别文本像素和背景像素,从而降低时间消耗的手动参数调整。此外,我们设计了全球信息增强特征层网络(GE-FPN),用于捕捉文本实例的大小和极大比例。接着,我们引入了层次优化结构,以进一步细化文本实例。最后,我们结合提出的阈值学习策略和文本检测结构,设计了适应性网络(ASNet) для场景文本检测。广泛的实验表明,我们的ASNet可以在四个文本检测标准准则上达到领先性状态,即ICDAR 2015、MSRA-TD500、ICDAR 2017 MLT 和 CTW1500。另外,我们的拓展实验也证明了我们的贡献的有效性。

Self-Supervised Graph Transformer for Deepfake Detection

  • paper_url: http://arxiv.org/abs/2307.15019
  • repo_url: None
  • paper_authors: Aminollah Khormali, Jiann-Shiun Yuan
  • For: 本研究提出了一种深伪检测框架,用于检测视频中的深伪。* Methods: 该框架包括一个基于视觉转换器架构的特征提取器,一个基于Transformer推理器的图 convolutional网络,以及一个图Transformer相关图。* Results: 研究人员通过进行多种难题的实验,包括在不同数据集上进行测试、跨数据集检测、跨操作检测和对常见后期处理抖散的检测,得到了该框架的优秀效果,超过了当前状态艺术方法。
    Abstract Deepfake detection methods have shown promising results in recognizing forgeries within a given dataset, where training and testing take place on the in-distribution dataset. However, their performance deteriorates significantly when presented with unseen samples. As a result, a reliable deepfake detection system must remain impartial to forgery types, appearance, and quality for guaranteed generalizable detection performance. Despite various attempts to enhance cross-dataset generalization, the problem remains challenging, particularly when testing against common post-processing perturbations, such as video compression or blur. Hence, this study introduces a deepfake detection framework, leveraging a self-supervised pre-training model that delivers exceptional generalization ability, withstanding common corruptions and enabling feature explainability. The framework comprises three key components: a feature extractor based on vision Transformer architecture that is pre-trained via self-supervised contrastive learning methodology, a graph convolution network coupled with a Transformer discriminator, and a graph Transformer relevancy map that provides a better understanding of manipulated regions and further explains the model's decision. To assess the effectiveness of the proposed framework, several challenging experiments are conducted, including in-data distribution performance, cross-dataset, cross-manipulation generalization, and robustness against common post-production perturbations. The results achieved demonstrate the remarkable effectiveness of the proposed deepfake detection framework, surpassing the current state-of-the-art approaches.
    摘要 深度伪造检测方法已经在给定的数据集上显示出了可靠的结果,但是在见到过去的数据集上进行测试时,其表现会受到很大的损害。因此,一个可靠的深度伪造检测系统必须保持不偏向于伪造类型、外观和质量,以确保普遍可靠的检测性能。尽管有许多尝试来提高跨数据集通用性,这个问题仍然是挑战,特别是在面对常见的后期处理扰动时。因此,本研究将提出一个深度伪造检测框架,利用一个自我超vised的预训练模型,具有出色的通用能力,抵抗常见的扰动和提供功能解释。这个框架包括三个关键 комponent:一个基于视觉对应架构的Feature Extractor,一个与对应架构的Transformer Discriminator,以及一个基于对应架构的Graph Transformer Relevancy Map。为了评估提案的效果,本研究将进行许多挑战性的实验,包括在原始数据集上的性能、跨数据集通用性、跨修改通用性和对常见后期处理扰动的Robustness。实验结果显示,提案的深度伪造检测框架具有卓越的效果,超过现有的方法。

Verifiable Feature Attributions: A Bridge between Post Hoc Explainability and Inherent Interpretability

  • paper_url: http://arxiv.org/abs/2307.15007
  • repo_url: None
  • paper_authors: Usha Bhalla, Suraj Srinivas, Himabindu Lakkaraju
  • for: 这 paper 的目的是解释机器学习模型的行为,提高模型的可解释性。
  • methods: 这 paper 使用了两种主要的解释策略:post hoc 解释和嵌入式可解释模型。post hoc 解释方法可以解释复杂的黑盒模型的行为,但是这些解释可能不准确,而且无法验证。嵌入式可解释模型则可以自动编码解释到模型结构中,解释是自然的、可靠的和验证的,但是它们通常具有较强的表达能力。这 paper 提出了一种方法,即 Verifiability Tuning (VerT),可以将黑盒模型转化成可靠、可验证的解释模型。
  • results: 这 paper 的实验结果表明,VerT 可以将黑盒模型转化成可靠、可验证的解释模型,并且这些解释模型可以correctly 和可靠地解释模型的行为。同时,VerT 还可以保持黑盒模型的预测性能。
    Abstract With the increased deployment of machine learning models in various real-world applications, researchers and practitioners alike have emphasized the need for explanations of model behaviour. To this end, two broad strategies have been outlined in prior literature to explain models. Post hoc explanation methods explain the behaviour of complex black-box models by highlighting features that are critical to model predictions; however, prior work has shown that these explanations may not be faithful, and even more concerning is our inability to verify them. Specifically, it is nontrivial to evaluate if a given attribution is correct with respect to the underlying model. Inherently interpretable models, on the other hand, circumvent these issues by explicitly encoding explanations into model architecture, meaning their explanations are naturally faithful and verifiable, but they often exhibit poor predictive performance due to their limited expressive power. In this work, we aim to bridge the gap between the aforementioned strategies by proposing Verifiability Tuning (VerT), a method that transforms black-box models into models that naturally yield faithful and verifiable feature attributions. We begin by introducing a formal theoretical framework to understand verifiability and show that attributions produced by standard models cannot be verified. We then leverage this framework to propose a method to build verifiable models and feature attributions out of fully trained black-box models. Finally, we perform extensive experiments on semi-synthetic and real-world datasets, and show that VerT produces models that (1) yield explanations that are correct and verifiable and (2) are faithful to the original black-box models they are meant to explain.
    摘要 随着机器学习模型在各种实际应用中的广泛部署,研究人员和实践者们一样强调了模型行为的解释的必要性。为此,先前的文献中提出了两种广泛的解释策略:后期解释方法通过强调模型预测中关键的特征来解释复杂黑盒模型的行为,但是先前的研究表明这些解释可能不准确,甚至更加担忧的是无法确认这些贡献的正确性。而内置可解释模型则通过显式地编码解释到模型建构中,因此其解释自然地准确和可靠,但它们通常具有有限的表达能力,导致预测性能不佳。在这个工作中,我们希望bridge这两种策略的差异,提出一种名为Verifiability Tuning(VerT)的方法,可以将黑盒模型转化为可以自然地生成准确和可靠的特征贡献的模型。我们首先引入了一个正式的理论框架,以理解可靠性的概念,并证明标准模型生成的贡献无法被验证。然后,我们利用这个框架,提出一种建立可靠模型和特征贡献的方法,并在 semi-synthetic 和实际数据集上进行了广泛的实验,结果表明VerT可以生成准确和可靠的特征贡献,同时保持和原始黑盒模型相同的预测性能。

MapNeRF: Incorporating Map Priors into Neural Radiance Fields for Driving View Simulation

  • paper_url: http://arxiv.org/abs/2307.14981
  • repo_url: None
  • paper_authors: Chenming Wu, Jiadai Sun, Zhelun Shen, Liangjun Zhang
  • for: 用于自动驾驶测试中测试摄像头感知器。
  • methods: 利用map假设来将神经辐射场与不确定摄像头位置整合,以确保多视角一致性。
  • results: 实验结果显示,我们的方法可以在偏离路径上维持 semantic consistency。详细视频可以在https://youtu.be/jEQWr-Rfh3A中检视。
    Abstract Simulating camera sensors is a crucial task in autonomous driving. Although neural radiance fields are exceptional at synthesizing photorealistic views in driving simulations, they still fail to generate extrapolated views. This paper proposes to incorporate map priors into neural radiance fields to synthesize out-of-trajectory driving views with semantic road consistency. The key insight is that map information can be utilized as a prior to guiding the training of the radiance fields with uncertainty. Specifically, we utilize the coarse ground surface as uncertain information to supervise the density field and warp depth with uncertainty from unknown camera poses to ensure multi-view consistency. Experimental results demonstrate that our approach can produce semantic consistency in deviated views for vehicle camera simulation. The supplementary video can be viewed at https://youtu.be/jEQWr-Rfh3A.
    摘要 simulate 摄像头传感器是自动驾驶中非常重要的任务。尽管神经辐射场是在驾驶 simulations 中生成 photorealistic 视图的非常好的方法,但它们仍然无法生成 extrapolated 视图。这篇论文提议将地图约束 incorporated 到神经辐射场中,以生成 deviated 视图的 semantic 路况一致性。关键的思想是利用地图信息作为训练神经辐射场的先验知识,以确保多视图一致性。特别是,我们利用 unknown 相机 pose 中的不确定信息来监督density field 和折叠深度的学习,以确保多视图一致性。实验结果表明,我们的方法可以在 deviated 视图中保持 semantic 路况一致性。补充视频可以在 https://youtu.be/jEQWr-Rfh3A 上查看。