cs.CV - 2023-07-03

Cross-modality Attention Adapter: A Glioma Segmentation Fine-tuning Method for SAM Using Multimodal Brain MR Images

  • paper_url: http://arxiv.org/abs/2307.01124
  • repo_url: None
  • paper_authors: Xiaoyu Shi, Shurong Chai, Yinhao Li, Jingliang Cheng, Jie Bai, Guohua Zhao, Yen-Wei Chen
  • for: 这个论文主要是为了提出一种基于多Modal融合的扩展模型,用于更好地分类肿瘤region在多Modal MRI脑部图像中。
  • methods: 该方法使用了一种基于cross-modality attention的扩展模型,通过多Modal融合来提高基础模型的性能。
  • results: 经验 validate了该方法的效果,在 Zhengzhou大学第一附属医院(FHZU)私人肿瘤数据集上实现了88.38%的Dice和10.64的 Hausdorff distance,比现有技术提高4%。
    Abstract According to the 2021 World Health Organization (WHO) Classification scheme for gliomas, glioma segmentation is a very important basis for diagnosis and genotype prediction. In general, 3D multimodal brain MRI is an effective diagnostic tool. In the past decade, there has been an increase in the use of machine learning, particularly deep learning, for medical images processing. Thanks to the development of foundation models, models pre-trained with large-scale datasets have achieved better results on a variety of tasks. However, for medical images with small dataset sizes, deep learning methods struggle to achieve better results on real-world image datasets. In this paper, we propose a cross-modality attention adapter based on multimodal fusion to fine-tune the foundation model to accomplish the task of glioma segmentation in multimodal MRI brain images with better results. The effectiveness of the proposed method is validated via our private glioma data set from the First Affiliated Hospital of Zhengzhou University (FHZU) in Zhengzhou, China. Our proposed method is superior to current state-of-the-art methods with a Dice of 88.38% and Hausdorff distance of 10.64, thereby exhibiting a 4% increase in Dice to segment the glioma region for glioma treatment.
    摘要 Note:* "glioma" is translated as "肿瘤" (tumor) in Simplified Chinese.* "WHO" is translated as "世界卫生组织" (World Health Organization) in Simplified Chinese.* "MRI" is translated as "Magnetic Resonance Imaging" in Simplified Chinese.* "deep learning" is translated as "深度学习" (deep learning) in Simplified Chinese.* "foundation model" is translated as "基础模型" (foundation model) in Simplified Chinese.* "multimodal fusion" is translated as "多Modal融合" (multimodal fusion) in Simplified Chinese.* "Dice" is translated as " dice" (Dice) in Simplified Chinese.* "Hausdorff distance" is translated as " Hausdorff distance" (Hausdorff distance) in Simplified Chinese.

Artifacts Mapping: Multi-Modal Semantic Mapping for Object Detection and 3D Localization

  • paper_url: http://arxiv.org/abs/2307.01121
  • repo_url: None
  • paper_authors: Federico Rollo, Gennaro Raiola, Andrea Zunino, Nikolaos Tsagarakis, Arash Ajoudani
  • for: 本研究旨在实现自主探测和地图建模,即使在未知环境中。
  • methods: 该方法使用多感器融合方法,包括RGB和深度数据从RGB-D摄像头和雷达。
  • results: 实验显示,该方法可以准确探测98%的实际环境对象,而不需要后处理。相比单感器实验(camera或雷达),感器融合allow robot探测近距离和远距离障碍物,而单感器实验中的障碍物探测会受到干扰或精度问题。
    Abstract Geometric navigation is nowadays a well-established field of robotics and the research focus is shifting towards higher-level scene understanding, such as Semantic Mapping. When a robot needs to interact with its environment, it must be able to comprehend the contextual information of its surroundings. This work focuses on classifying and localising objects within a map, which is under construction (SLAM) or already built. To further explore this direction, we propose a framework that can autonomously detect and localize predefined objects in a known environment using a multi-modal sensor fusion approach (combining RGB and depth data from an RGB-D camera and a lidar). The framework consists of three key elements: understanding the environment through RGB data, estimating depth through multi-modal sensor fusion, and managing artifacts (i.e., filtering and stabilizing measurements). The experiments show that the proposed framework can accurately detect 98% of the objects in the real sample environment, without post-processing, while 85% and 80% of the objects were mapped using the single RGBD camera or RGB + lidar setup respectively. The comparison with single-sensor (camera or lidar) experiments is performed to show that sensor fusion allows the robot to accurately detect near and far obstacles, which would have been noisy or imprecise in a purely visual or laser-based approach.
    摘要 现代几何导航已成为机器人学的一个占据领域,研究重点正在转移到更高级的场景理解,如semantic mapping。当机器人需要与环境交互时,它必须能够理解它所处环境的Contextual信息。这项工作关注于在构建或已经构建的地图上分类和定位预定的对象,通过多Modal感知融合方法(结合RGB和深度数据从RGB-D相机和激光雷达)来自动检测和定位预定的对象。该框架由三个关键元素组成:通过RGB数据理解环境,通过多Modal感知融合估计深度,并处理缺陷(例如,过滤和稳定测量)。实验表明,提议的框架可以准确地检测98%的真实环境中的对象,不需要后处理,而使用单个RGBD相机或RGB+激光雷达设置时分别映射85%和80%的对象。与单感知器(相机或激光雷达)实验进行比较,显示了感知融合Allow robotaccurately检测靠近和远方障碍物,这些障碍物在视觉或激光基础approach中都会呈噪或不准确。

MeT: A Graph Transformer for Semantic Segmentation of 3D Meshes

  • paper_url: http://arxiv.org/abs/2307.01115
  • repo_url: None
  • paper_authors: Giuseppe Vecchio, Luca Prezzavento, Carmelo Pino, Francesco Rundo, Simone Palazzo, Concetto Spampinato
    for: 本研究旨在提出一种基于 transformer 的方法,用于semantic segmentation of 3D mesh。methods: 该方法使用了 global attention mechanism,以及 Laplacian eigenvectors 的位置编码和 clustering-based features,以提高模型对非sequential数据的处理和地方上下文的捕捉。results: 实验结果表明,该方法在三个Shape COSEG Dataset上的人 segmentation任务和ShapeNet benchmark上均达到了当前最佳性能。
    Abstract Polygonal meshes have become the standard for discretely approximating 3D shapes, thanks to their efficiency and high flexibility in capturing non-uniform shapes. This non-uniformity, however, leads to irregularity in the mesh structure, making tasks like segmentation of 3D meshes particularly challenging. Semantic segmentation of 3D mesh has been typically addressed through CNN-based approaches, leading to good accuracy. Recently, transformers have gained enough momentum both in NLP and computer vision fields, achieving performance at least on par with CNN models, supporting the long-sought architecture universalism. Following this trend, we propose a transformer-based method for semantic segmentation of 3D mesh motivated by a better modeling of the graph structure of meshes, by means of global attention mechanisms. In order to address the limitations of standard transformer architectures in modeling relative positions of non-sequential data, as in the case of 3D meshes, as well as in capturing the local context, we perform positional encoding by means the Laplacian eigenvectors of the adjacency matrix, replacing the traditional sinusoidal positional encodings, and by introducing clustering-based features into the self-attention and cross-attention operators. Experimental results, carried out on three sets of the Shape COSEG Dataset, on the human segmentation dataset proposed in Maron et al., 2017 and on the ShapeNet benchmark, show how the proposed approach yields state-of-the-art performance on semantic segmentation of 3D meshes.
    摘要 三角形网格已成为三维形状精确地 approximating 的标准方法,因为它们的效率和高灵活性可以 Capture 非均匀形状。然而,这种非均匀性会导致网格结构的不规则,使得三维网格分割变得特别困难。三维网格 semantic segmentation 通常通过基于 CNN 的方法进行解决,以至于达到较好的准确性。最近, transformers 在 NLP 和计算机视觉领域中得到了足够的动力,达到与 CNN 模型相当的性能,支持长期寻求的建筑通用性。以此为动力,我们提出一种基于 transformers 的三维网格 semantic segmentation 方法,通过全球注意力机制来更好地模型网格结构。为了解决标准 transformers 架构在非序数据中的模型非Sequential 数据的局限性,以及 capture 当地Context 的问题,我们使用劳拉cian 快速值来进行位置编码,取代传统的极值律动编码。此外,我们还引入 clustering-based 特征来自注意和跨注意操作中。实验结果,在三个Shape COSEG 数据集上进行了测试,以及 Maron et al. 2017 提出的人差分 segmentation 数据集和 ShapeNet benchmark,显示了我们提出的方法在三维网格 semantic segmentation 中达到了国际最佳性。

MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion

  • paper_url: http://arxiv.org/abs/2307.01097
  • repo_url: https://github.com/Tangshitao/MVDiffusion
  • paper_authors: Shitao Tang, Fuyang Zhang, Jiacheng Chen, Peng Wang, Yasutaka Furukawa
  • for: 这 paper 是为了生成基于文本提示的具有一致性的多视图图像的方法。
  • methods: 这 paper 使用了一种名为 MVDiffusion 的简单 yet effective 方法,该方法 simultaneous 生成所有图像,并且具有全局意识,从而解决了产生误差的问题。
  • results: 这 paper 的实验结果表明,MVDiffusion 可以高效地生成高分辨率的 photorealistic 图像,并且可以适应任意的文本提示或者将一个一视图图像扩展到 360 度的全景视图。
    Abstract This paper introduces MVDiffusion, a simple yet effective method for generating consistent multi-view images from text prompts given pixel-to-pixel correspondences (e.g., perspective crops from a panorama or multi-view images given depth maps and poses). Unlike prior methods that rely on iterative image warping and inpainting, MVDiffusion simultaneously generates all images with a global awareness, effectively addressing the prevalent error accumulation issue. At its core, MVDiffusion processes perspective images in parallel with a pre-trained text-to-image diffusion model, while integrating novel correspondence-aware attention layers to facilitate cross-view interactions. For panorama generation, while only trained with 10k panoramas, MVDiffusion is able to generate high-resolution photorealistic images for arbitrary texts or extrapolate one perspective image to a 360-degree view. For multi-view depth-to-image generation, MVDiffusion demonstrates state-of-the-art performance for texturing a scene mesh. The project page is at https://mvdiffusion.github.io/.
    摘要 At its core, MVDiffusion processes perspective images in parallel with a pre-trained text-to-image diffusion model, while integrating novel correspondence-aware attention layers to facilitate cross-view interactions. For panorama generation, while only trained with 10,000 panoramas, MVDiffusion is able to generate high-resolution photorealistic images for arbitrary texts or extrapolate one perspective image to a 360-degree view. For multi-view depth-to-image generation, MVDiffusion demonstrates state-of-the-art performance for texturing a scene mesh.More information can be found on the project page at .

UW-ProCCaps: UnderWater Progressive Colourisation with Capsules

  • paper_url: http://arxiv.org/abs/2307.01091
  • repo_url: None
  • paper_authors: Rita Pucci, Niki Martinel
  • for: 本研究旨在提高underwater图像的存储空间效率,以便更多的图像收集campaign。
  • methods: 我们提出了一种新的机器学习模型,可以从underwater图像的顾色通道中重建图像的颜色信息,从而节省2/3的存储空间。我们的模型特化于水下颜色重建,包括一个encoder-decoder架构,其中encoder包括一个卷积encoder和一个并行特化的分类器,使用webly-supervised数据进行训练。
  • results: 我们在四个benchmark数据集上进行了质量和量тив评估,结果显示,我们的解决方案在水下颜色重建 task中表现出色,比 estado-of-the-art(SOTA)解决方案更高效。此外,我们还证明了生成的颜色化提高了图像质量,比SOTA颜色提高模型更高效。
    Abstract Underwater images are fundamental for studying and understanding the status of marine life. We focus on reducing the memory space required for image storage while the memory space consumption in the collecting phase limits the time lasting of this phase leading to the need for more image collection campaigns. We present a novel machine-learning model that reconstructs the colours of underwater images from their luminescence channel, thus saving 2/3 of the available storage space. Our model specialises in underwater colour reconstruction and consists of an encoder-decoder architecture. The encoder is composed of a convolutional encoder and a parallel specialised classifier trained with webly-supervised data. The encoder and the decoder use layers of capsules to capture the features of the entities in the image. The colour reconstruction process recalls the progressive and the generative adversarial training procedures. The progressive training gives the ground for a generative adversarial routine focused on the refining of colours giving the image bright and saturated colours which bring the image back to life. We validate the model both qualitatively and quantitatively on four benchmark datasets. This is the first attempt at colour reconstruction in greyscale underwater images. Extensive results on four benchmark datasets demonstrate that our solution outperforms state-of-the-art (SOTA) solutions. We also demonstrate that the generated colourisation enhances the quality of images compared to enhancement models at the SOTA.
    摘要 水下图像是生物多样性研究的基础数据。我们的研究目标是减少图像存储空间,因为收集阶段的存储空间占用限制了收集阶段的时间,导致需要更多的图像收集campaign。我们提出了一种新的机器学习模型,可以从照明通道中重建水下图像的颜色,从而节省2/3的可用存储空间。我们的模型专注于水下颜色重建,并包括一个encoder-decoder架构。encoder包括一个 convolutional encoder 和一个并行特殊化分类器,并在网络环境中进行了监督学习。encoder和decoder都使用彩袋层来捕捉图像中的特征。颜色重建过程包括进程式和生成敌对训练过程。进程式训练给生成敌对训练提供了基础,并且通过对颜色进行精细调整,使图像变得更加生动和细腻。我们对四个标准数据集进行了质量和量化的验证,这是首次对灰度水下图像进行颜色重建。我们的解决方案在四个标准数据集上显示出比state-of-the-art(SOTA)解决方案更高的性能。此外,我们还证明了生成颜色化后的图像质量比SOTA的增强模型更高。

Streamlined Lensed Quasar Identification in Multiband Images via Ensemble Networks

  • paper_url: http://arxiv.org/abs/2307.01090
  • repo_url: None
  • paper_authors: Irham Taufik Andika, Sherry H. Suyu, Raoul Cañameras, Alejandra Melo, Stefan Schuldt, Yiping Shu, Anna-Christina Eilers, Anton Timur Jaelani, Minghao Yue
  • for: 本研究旨在 automatization 强 gravitational lensing 检测,提高 cosmic expansion rate 和 dark matter profile 等研究的效率。
  • methods: 本研究使用了多种 cutting-edge convolutional neural networks (CNNs) 和 vision transformers (ViTs),包括 ResNet、Inception、NASNet、MobileNet、EfficientNet 和 RegNet,以及 Hyper Suprime-Cam (HSC) 多波段图像。
  • results: 研究发现,通过 ensemble 这些 CNNs 和 ViTs,可以大幅减少假阳性源,并且在实际数据上具有较高的检测精度。最终,通过对 HSC 图像进行拼接,以及使用 UKIRT、VISTA 和 unWISE 数据,找到了约 600 万个源,并将其减少到 892,609 个。经过光度预选择,发现了 3080 个可能的强 gravitational lens 源,其中 210 个已经得到了观察confirmation。这些结果表明,自动化的深度学习管道可以有效地检测强 gravitational lensing,并减少人工视觉检查的时间和努力。
    Abstract Quasars experiencing strong lensing offer unique viewpoints on subjects related to the cosmic expansion rate, the dark matter profile within the foreground deflectors, and the quasar host galaxies. Unfortunately, identifying them in astronomical images is challenging since they are overwhelmed by the abundance of non-lenses. To address this, we have developed a novel approach by ensembling cutting-edge convolutional networks (CNNs) -- for instance, ResNet, Inception, NASNet, MobileNet, EfficientNet, and RegNet -- along with vision transformers (ViTs) trained on realistic galaxy-quasar lens simulations based on the Hyper Suprime-Cam (HSC) multiband images. While the individual model exhibits remarkable performance when evaluated against the test dataset, achieving an area under the receiver operating characteristic curve of $>$97.3% and a median false positive rate of 3.6%, it struggles to generalize in real data, indicated by numerous spurious sources picked by each classifier. A significant improvement is achieved by averaging these CNNs and ViTs, resulting in the impurities being downsized by factors up to 50. Subsequently, combining the HSC images with the UKIRT, VISTA, and unWISE data, we retrieve approximately 60 million sources as parent samples and reduce this to 892,609 after employing a photometry preselection to discover $z>1.5$ lensed quasars with Einstein radii of $\theta_\mathrm{E}<5$ arcsec. Afterward, the ensemble classifier indicates 3080 sources with a high probability of being lenses, for which we visually inspect, yielding 210 prevailing candidates awaiting spectroscopic confirmation. These outcomes suggest that automated deep learning pipelines hold great potential in effectively detecting strong lenses in vast datasets with minimal manual visual inspection involved.
    摘要 <<注意:以下文本使用了简化字体。>>Quasars经历强大的变形所提供了关于宇宙膨胀率、黑 mater profile within the foreground deflectors以及这些 Quasar host galaxies的独特视角。然而,在天文图像中 indentifying them 是具有挑战性的,因为它们被非强�的扩散所掩蔽。为了解决这个问题,我们开发了一种新的方法, combining cutting-edge convolutional neural networks (CNNs) 和视 transformers (ViTs) ,例如 ResNet、Inception、NASNet、MobileNet、EfficientNet 和 RegNet,并在基于 Hyper Suprime-Cam (HSC) 多频道图像的 simulated galaxy-quasar lens 中进行训练。尽管单个模型在测试数据集上表现出色,在 receiver operating characteristic curve 上取得了area > 97.3% 和 median false positive rate 为 3.6%,但它在实际数据中表现不佳,表现出许多假阳性。通过将这些 CNNs 和 ViTs ensemble,随后将 HSC 图像与 UKIRT、VISTA 和 unWISE 数据结合,我们检索了大约 6000万个源,并将其减少到 892,609 个后进行光度预选择,以找到 $z>1.5$ 强大变形透镜的候选者。然后,我们使用 ensemble classifier,并将其中 3080 个源作为高可能性强大变形透镜进行视觉检查,其中 210 个透镜被证明为主要候选者,等待spectroscopic confirmation。这些结果表明,通过自动的深度学习管道,可以快速和高效地检测强大变形透镜,并且可以避免大量的手动视觉检查。

Empirically Validating Conformal Prediction on Modern Vision Architectures Under Distribution Shift and Long-tailed Data

  • paper_url: http://arxiv.org/abs/2307.01088
  • repo_url: None
  • paper_authors: Kevin Kasa, Graham W. Taylor
  • for: 提供深度学习模型可靠的不确定性估计和安全保证
  • methods: 考虑了多种 posterior-based 和 training-based 具有确定性 guarantee的方法
  • results: 在大规模数据集和模型上进行了实证评估,发现这些方法在分布偏移和长尾分布下表现不佳,常常违反保证I hope this helps! Let me know if you have any other questions.
    Abstract Conformal prediction has emerged as a rigorous means of providing deep learning models with reliable uncertainty estimates and safety guarantees. Yet, its performance is known to degrade under distribution shift and long-tailed class distributions, which are often present in real world applications. Here, we characterize the performance of several post-hoc and training-based conformal prediction methods under these settings, providing the first empirical evaluation on large-scale datasets and models. We show that across numerous conformal methods and neural network families, performance greatly degrades under distribution shifts violating safety guarantees. Similarly, we show that in long-tailed settings the guarantees are frequently violated on many classes. Understanding the limitations of these methods is necessary for deployment in real world and safety-critical applications.
    摘要 具有预测可靠性和安全保证的具体预测(conformal prediction)在深度学习模型中出现了,但其在分布变化和长尾类分布下的性能有所下降。我们对多种后处和训练基于的具体预测方法进行了大规模数据集和模型的实验性评估,发现在分布变化下,各种方法的性能强度下降,同时在长尾类分布下,保证 frequently 被违反。这些方法在实际应用中的部署和安全应用中的理解其限制是必要的。

Shi-NeSS: Detecting Good and Stable Keypoints with a Neural Stability Score

  • paper_url: http://arxiv.org/abs/2307.01069
  • repo_url: None
  • paper_authors: Konstantin Pakulev, Alexander Vakhitov, Gonzalo Ferrer
  • for: 本研究旨在提出一种可靠的特征点检测方法,解决了针对特征点的定义和对应的标注数据的问题。
  • methods: 本方法结合手工设计的Shi检测器和神经网络,利用Shi检测器提供的本地化特征点和神经网络进行选择,并使用神经网络进行稳定性分数的评估。
  • results: 在HPatches、ScanNet、MegaDepth和IMC-PT等测试集上,本方法表现出了顶尖的性能和良好的通用性,可以作为下游任务的基础功能。
    Abstract Learning a feature point detector presents a challenge both due to the ambiguity of the definition of a keypoint and correspondingly the need for a specially prepared ground truth labels for such points. In our work, we address both of these issues by utilizing a combination of a hand-crafted Shi detector and a neural network. We build on the principled and localized keypoints provided by the Shi detector and perform their selection using the keypoint stability score regressed by the neural network - Neural Stability Score (NeSS). Therefore, our method is named Shi-NeSS since it combines the Shi detector and the properties of the keypoint stability score, and it only requires for training sets of images without dataset pre-labeling or the need for reconstructed correspondence labels. We evaluate Shi-NeSS on HPatches, ScanNet, MegaDepth and IMC-PT, demonstrating state-of-the-art performance and good generalization on downstream tasks.
    摘要

Localized Questions in Medical Visual Question Answering

  • paper_url: http://arxiv.org/abs/2307.01067
  • repo_url: https://github.com/sergiotasconmorales/locvqa
  • paper_authors: Sergio Tascon-Morales, Pablo Márquez-Neila, Raphael Sznitman
  • for: 医学视觉问答模型 (Medical VQA) 的研究,以解决现有模型只能回答整个图像的问题,而不能回答关于图像具体区域的问题。
  • methods: 提出了一种新的方法,可以回答关于图像区域的问题,同时考虑问题的上下文。
  • results: 实验结果表明,该方法比现有方法在三个数据集上表现更好,达到了更高的准确率。Note: The text is in Simplified Chinese, and the word order is adjusted to match the language convention.
    Abstract Visual Question Answering (VQA) models aim to answer natural language questions about given images. Due to its ability to ask questions that differ from those used when training the model, medical VQA has received substantial attention in recent years. However, existing medical VQA models typically focus on answering questions that refer to an entire image rather than where the relevant content may be located in the image. Consequently, VQA models are limited in their interpretability power and the possibility to probe the model about specific image regions. This paper proposes a novel approach for medical VQA that addresses this limitation by developing a model that can answer questions about image regions while considering the context necessary to answer the questions. Our experimental results demonstrate the effectiveness of our proposed model, outperforming existing methods on three datasets. Our code and data are available at https://github.com/sergiotasconmorales/locvqa.
    摘要 <>Translate given text into Simplified Chinese.<>干���VQA(Visual Question Answering)模型目标是回答给定图像的自然语言问题。由于它可以提出训练模型不同的问题,因此医学VQA在最近几年内得到了广泛的关注。然而,现有的医学VQA模型通常只会回答整个图像的问题,而不是在图像中具体的区域。这限制了VQA模型的解释力和可 probing 能力。这篇论文提议一种新的医学VQA方法,该方法可以回答图像区域的问题,同时考虑到问题的上下文。我们的实验结果表明,我们提出的方法可以超越现有方法,在三个数据集上达到更高的性能。我们的代码和数据可以在 GitHub 上找到:https://github.com/sergiotasconmorales/locvqa。

TomatoDIFF: On-plant Tomato Segmentation with Denoising Diffusion Models

  • paper_url: http://arxiv.org/abs/2307.01064
  • repo_url: https://github.com/mivanovska/tomatodiff
  • paper_authors: Marija Ivanovska, Vitomir Struc, Janez Pers
  • for: 这项研究旨在提高 Tomatoes 的生长和生产,同时降低成本和环境影响。
  • methods: 该paper提出了一种新的Diffusion-based模型,用于semantic segmentation of on-plant Tomatoes。
  • results: 该模型在与其他竞争方法比较中显示出了SOTA表现,即使在受 occlusion 影响的环境中也能够达到优秀的结果。
    Abstract Artificial intelligence applications enable farmers to optimize crop growth and production while reducing costs and environmental impact. Computer vision-based algorithms in particular, are commonly used for fruit segmentation, enabling in-depth analysis of the harvest quality and accurate yield estimation. In this paper, we propose TomatoDIFF, a novel diffusion-based model for semantic segmentation of on-plant tomatoes. When evaluated against other competitive methods, our model demonstrates state-of-the-art (SOTA) performance, even in challenging environments with highly occluded fruits. Additionally, we introduce Tomatopia, a new, large and challenging dataset of greenhouse tomatoes. The dataset comprises high-resolution RGB-D images and pixel-level annotations of the fruits.
    摘要 人工智能应用程序可以帮助农民优化作物生长和生产,同时降低成本和环境影响。计算机视觉算法在特别是广泛应用于水果分割,以便进行深入分析丰收质量和准确的受量估计。在这篇论文中,我们提议了一种新的扩散模型,即TomatoDIFF,用于 semantic segmentation of on-plant tomatoes。与其他竞争方法进行比较后,我们的模型在具有高度受阻物的环境中仍然达到了当前最佳性能(SOTA)。此外,我们还介绍了一个新的大型和挑战性较高的greenhouse tomatoes数据集,该数据集包括高分辨率RGB-D图像和每个像素级别的水果标注。

Cross-modal Place Recognition in Image Databases using Event-based Sensors

  • paper_url: http://arxiv.org/abs/2307.01047
  • repo_url: None
  • paper_authors: Xiang Ji, Jiaxin Wei, Yifu Wang, Huiliang Shang, Laurent Kneip
  • for: 本文是为了提出一种跨Modal视觉位置认识框架,能够从事件查询中检索常规图像,并在不同场景下达到比frame-based和事件基于方法更高的性能。
  • methods: 本文使用的方法包括事件查询和图像检索,以及一种新的卷积神经网络模型,用于将事件信息与图像信息结合在一起。
  • results: 本文的实验结果表明,相比 frame-based 和事件基于方法,跨Modal视觉位置认识框架可以在不同场景下达到较高的性能,并且可以通过组合检索和分类来进一步提高性能。
    Abstract Visual place recognition is an important problem towards global localization in many robotics tasks. One of the biggest challenges is that it may suffer from illumination or appearance changes in surrounding environments. Event cameras are interesting alternatives to frame-based sensors as their high dynamic range enables robust perception in difficult illumination conditions. However, current event-based place recognition methods only rely on event information, which restricts downstream applications of VPR. In this paper, we present the first cross-modal visual place recognition framework that is capable of retrieving regular images from a database given an event query. Our method demonstrates promising results with respect to the state-of-the-art frame-based and event-based methods on the Brisbane-Event-VPR dataset under different scenarios. We also verify the effectiveness of the combination of retrieval and classification, which can boost performance by a large margin.
    摘要 Visual place recognition是重要的global localization问题中的一个关键问题,它可能受到环境照明或外观变化的影响。事件摄像头是替代frame-based感知器的有趣选择,因为它们的高动态范围使得在困难的照明条件下进行稳定的感知。然而,当前的事件基本Visual place recognition方法仅仅基于事件信息,这限制了下游应用程序的可能性。在这篇论文中,我们提出了首个可以从事件库中提取常见图像的跨模态Visual place recognition框架。我们的方法在不同的场景下与现状技术相比,对Brisbane-Event-VPR数据集表现出了promising的结果。我们还验证了抽象和分类的组合可以提高性能的幅度。

SAM-DA: UAV Tracks Anything at Night with SAM-Powered Domain Adaptation

  • paper_url: http://arxiv.org/abs/2307.01024
  • repo_url: https://github.com/vision4robotics/sam-da
  • paper_authors: Liangliang Yao, Haobo Zuo, Guangze Zheng, Changhong Fu, Jia Pan
  • for: 本研究旨在提高夜间无人机跟踪的效果,特别是使用实时DAYTIME trackers进行夜间跟踪。
  • methods: 本研究使用Segment Anything Model(SAM)来提高夜间无人机跟踪的效果,并设计了一种新的SAM-powered目标频道训练样本扩展方法。
  • results: 实验结果表明,SAM-DA可以在夜间无人机视频中提高夜间跟踪的效果,并且比目前状态ixen DA更好地适应夜间环境。此外,SAM-DA只需要 fewer nighttime images 来训练,这使得算法的验证和部署变得更加容易。
    Abstract Domain adaptation (DA) has demonstrated significant promise for real-time nighttime unmanned aerial vehicle (UAV) tracking. However, the state-of-the-art (SOTA) DA still lacks the potential object with accurate pixel-level location and boundary to generate the high-quality target domain training sample. This key issue constrains the transfer learning of the real-time daytime SOTA trackers for challenging nighttime UAV tracking. Recently, the notable Segment Anything Model (SAM) has achieved remarkable zero-shot generalization ability to discover abundant potential objects due to its huge data-driven training approach. To solve the aforementioned issue, this work proposes a novel SAM-powered DA framework for real-time nighttime UAV tracking, i.e., SAM-DA. Specifically, an innovative SAM-powered target domain training sample swelling is designed to determine enormous high-quality target domain training samples from every single raw nighttime image. This novel one-to-many method significantly expands the high-quality target domain training sample for DA. Comprehensive experiments on extensive nighttime UAV videos prove the robustness and domain adaptability of SAM-DA for nighttime UAV tracking. Especially, compared to the SOTA DA, SAM-DA can achieve better performance with fewer raw nighttime images, i.e., the fewer-better training. This economized training approach facilitates the quick validation and deployment of algorithms for UAVs. The code is available at https://github.com/vision4robotics/SAM-DA.
    摘要 域 adaptation (DA) 已经表现出了实时夜间无人机 (UAV) 跟踪的显著搭配潜力。然而,当前的状态势 (SOTA) DA 仍然缺乏高质量像素级别位置和边沿来生成高质量目标域训练样本。这个关键问题限制了日间高质量跟踪器的转移学习。最近, Segment Anything Model (SAM) 已经实现了各种 Zero-shot 通用能力,可以快速发现丰富的可能对象。为解决这个问题,本工作提出了一种基于 SAM 的 DA 框架,即 SAM-DA。特别是,我们提出了一种创新的 SAM 驱动的目标域训练样本扩展方法,可以从每个 Raw 夜间图像中找到巨大数量的高质量目标域训练样本。这种一对多的方法可以显著扩展高质量目标域训练样本,从而提高 DA 的性能。归根结底,我们的 SAM-DA 方法可以在夜间 UAV 跟踪中提供更好的性能,只需要 fewer 个 Raw 夜间图像,即 fewer-better 训练。这种经济的训练方法可以促进算法的快速验证和部署。代码可以在 GitHub 上找到:https://github.com/vision4robotics/SAM-DA。

CGAM: Click-Guided Attention Module for Interactive Pathology Image Segmentation via Backpropagating Refinement

  • paper_url: http://arxiv.org/abs/2307.01015
  • repo_url: None
  • paper_authors: Seonghui Min, Won-Ki Jeong
  • for: 这项研究的目的是提高图像分割 tasks中的可靠性和准确性,以便为医疗数据提供更好的量化分析。
  • methods: 该研究使用了一种交互式分割方法,通过用户提供的点击约束和 semantic feature map 来优化深度神经网络的输出。这种方法被称为 click-guided attention module (CGAM),它可以避免过度适应用户点击,并且模型大小不受输入图像大小的影响。
  • results: 实验结果表明,与现有状态艺术方法相比,我们的方法在病理图像数据集上表现出了更好的性能。
    Abstract Tumor region segmentation is an essential task for the quantitative analysis of digital pathology. Recently presented deep neural networks have shown state-of-the-art performance in various image-segmentation tasks. However, because of the unclear boundary between the cancerous and normal regions in pathology images, despite using modern methods, it is difficult to produce satisfactory segmentation results in terms of the reliability and accuracy required for medical data. In this study, we propose an interactive segmentation method that allows users to refine the output of deep neural networks through click-type user interactions. The primary method is to formulate interactive segmentation as an optimization problem that leverages both user-provided click constraints and semantic information in a feature map using a click-guided attention module (CGAM). Unlike other existing methods, CGAM avoids excessive changes in segmentation results, which can lead to the overfitting of user clicks. Another advantage of CGAM is that the model size is independent of input image size. Experimental results on pathology image datasets indicated that our method performs better than existing state-of-the-art methods.
    摘要 受体区域分割是生物图像分析中的一项重要任务。最近提出的深度神经网络已经在各种图像分割任务中显示出了前所未有的性能。然而,由于生理图像中的悬浮边缘不清晰,使用现代方法仍然很难得到具有医疗数据所需的可靠性和准确性的分割结果。在这项研究中,我们提出了一种互动分割方法,允许用户通过点击类型的交互来修正深度神经网络的输出。我们的方法是通过点击约束和 semantic 信息在特征图上的协同使用点击导航模块(CGAM)来形式互动分割问题。与其他现有方法不同,CGAM 不会导致用户点击的过多修改,从而避免了点击过拟合。此外,CGAM 的模型大小独立于输入图像大小。在生物图像数据集上进行了实验,我们的方法比现有的状态艺前方法表现更好。

SynthCal: A Synthetic Benchmarking Pipeline to Compare Camera Calibration Algorithms

  • paper_url: http://arxiv.org/abs/2307.01013
  • repo_url: None
  • paper_authors: Lala Shakti Swarup Ray, Bo Zhou, Lars Krupp, Sungho Suh, Paul Lukowicz
  • for: 本研究目的是提供一个可靠的实验室摄像机校正评估pipeline,以便评估摄像机参数估计算法的精度。
  • methods: 本研究使用了SynthCal实验室摄像机校正对benchmarkingipeline,它可以生成各种摄像机参数估计算法的测试数据。
  • results: 实验结果显示SynthCal可以实现高精度的摄像机参数估计,并且可以评估不同的摄像机参数估计算法和实验室环境。
    Abstract Accurate camera calibration is crucial for various computer vision applications. However, measuring camera parameters in the real world is challenging and arduous, and there needs to be a dataset with ground truth to evaluate calibration algorithms' accuracy. In this paper, we present SynthCal, a synthetic camera calibration benchmarking pipeline that generates images of calibration patterns to measure and enable accurate quantification of calibration algorithm performance in camera parameter estimation. We present a SynthCal-generated calibration dataset with four common patterns, two camera types, and two environments with varying view, distortion, lighting, and noise levels. The dataset evaluates single-view calibration algorithms by measuring reprojection and root-mean-square errors for identical patterns and camera settings. Additionally, we analyze the significance of different patterns using Zhang's method, which estimates intrinsic and extrinsic camera parameters with known correspondences between 3D points and their 2D projections in different configurations and environments. The experimental results demonstrate the effectiveness of SynthCal in evaluating various calibration algorithms and patterns.
    摘要 准确的摄像头准确化是许多计算机视觉应用中的关键。然而,在实际世界中测量摄像头参数是困难和辛苦的,而且需要一个具有真实数据的数据集来评估准确化算法的准确性。在这篇论文中,我们提出了SynthCal,一个人工生成的摄像头准确化测试平台,可以生成准确度测试图案,以便评估和精确量化各种准确化算法的性能。我们提供了SynthCal生成的准确化数据集,包括四种常见的图案、两种摄像头类型和两种环境,其中包括不同的视角、扭曲、照明和噪声水平。这个数据集用于评估单视准确化算法,并且可以量化投影和平均方差误差。此外,我们还分析了不同图案的重要性,使用张氏方法,该方法可以在不同的配置和环境中估计摄像头的内参和外参参数,并且可以确定图案和摄像头设置之间的唯一对应关系。实验结果表明,SynthCal可以准确地评估各种准确化算法和图案。

Joint Coordinate Regression and Association For Multi-Person Pose Estimation, A Pure Neural Network Approach

  • paper_url: http://arxiv.org/abs/2307.01004
  • repo_url: None
  • paper_authors: Dongyang Yu, Yunshi Xie, Wangpeng An, Li Zhang, Yufeng Yao
  • for: 这篇论文旨在提出一种新的一stage端到端多人2D姿态估计算法(Joint Coordinate Regression and Association,JCRA),该算法可以直接生成人体姿态关节和关联,无需任何后处理。
  • methods: 该算法采用了一stage端到端网络架构,从而提高了推理速度。同时,作者采用了对称网络结构,以确保高准确率地标定关节点。具体来说,该算法使用了一个转换网络,直接输出部位位置,从而实现了显著性能提高。
  • results: 广泛的实验表明,JCRA在MS COCO和CrowdPosebenchmark上表现出色,并且在精度和效率两个方面都超越了现有的状态对照方法。具体来说,JCRA在MS COCO上达到了69.2 mAP,并且在推理加速方面比前状态的底层方法提高了78%。
    Abstract We introduce a novel one-stage end-to-end multi-person 2D pose estimation algorithm, known as Joint Coordinate Regression and Association (JCRA), that produces human pose joints and associations without requiring any post-processing. The proposed algorithm is fast, accurate, effective, and simple. The one-stage end-to-end network architecture significantly improves the inference speed of JCRA. Meanwhile, we devised a symmetric network structure for both the encoder and decoder, which ensures high accuracy in identifying keypoints. It follows an architecture that directly outputs part positions via a transformer network, resulting in a significant improvement in performance. Extensive experiments on the MS COCO and CrowdPose benchmarks demonstrate that JCRA outperforms state-of-the-art approaches in both accuracy and efficiency. Moreover, JCRA demonstrates 69.2 mAP and is 78\% faster at inference acceleration than previous state-of-the-art bottom-up algorithms. The code for this algorithm will be publicly available.
    摘要 我们介绍了一种新的一stage端到端多人2D姿态估计算法,称为共同坐标回归和关联(JCRA),该算法可以生成人体姿态关节和关联而无需任何后处理。我们提出的算法具有快速、高精度、有效和简单的特点。我们的一stage端到端网络架构可以显著提高JCRA的推理速度。此外,我们设计了对 encryptor和解码器的同质网络结构,以确保高精度在标记关节点的identification。这种架构直接通过 transformer 网络输出部分位置,从而得到了显著的性能提高。我们对 MS COCO 和 CrowdPose 测试准则进行了广泛的实验,并证明了 JCRA 在精度和效率两个方面超过了当前状态的抗下方法。此外,JCRA 在推理加速方面的提速率为 78%。代码将公开 availability。

Predicting beauty, liking, and aesthetic quality: A comparative analysis of image databases for visual aesthetics research

  • paper_url: http://arxiv.org/abs/2307.00984
  • repo_url: None
  • paper_authors: Ralf Bartho, Katja Thoemmes, Christoph Redies
  • for: 本研究提供了12个图像集的比较概述,这些图像集包含美学评价(美好、喜欢或艺术质量)的评分。
  • methods: 研究使用了20种已经研究过的统计图像特征(A),以及一个基于对象识别的卷积神经网络(B)来预测美学评价。
  • results: 研究发现不同图像集的美学评价预测结果具有substantial的变化,但是包含照片或画作的图像集具有相似的特征,表明不同的图像类别具有不同的美学评价特征。尽管统计图像特征和卷积神经网络具有相似的预测精度,但是不同的图像集的差异提出了对前期研究结果的总化和普遍性的困难。研究强调了在实验和计算美学领域中考虑多个图像集,以提高研究结果的有效性和普遍性。
    Abstract In the fields of Experimental and Computational Aesthetics, numerous image datasets have been created over the last two decades. In the present work, we provide a comparative overview of twelve image datasets that include aesthetic ratings (beauty, liking or aesthetic quality) and investigate the reproducibility of results across different datasets. Specifically, we examine how consistently the ratings can be predicted by using either (A) a set of 20 previously studied statistical image properties, or (B) the layers of a convolutional neural network developed for object recognition. Our findings reveal substantial variation in the predictability of aesthetic ratings across the different datasets. However, consistent similarities were found for datasets containing either photographs or paintings, suggesting different relevant features in the aesthetic evaluation of these two image genres. To our surprise, statistical image properties and the convolutional neural network predict aesthetic ratings with similar accuracy, highlighting a significant overlap in the image information captured by the two methods. Nevertheless, the discrepancies between the datasets call into question the generalizability of previous research findings on single datasets. Our study underscores the importance of considering multiple datasets to improve the validity and generalizability of research results in the fields of experimental and computational aesthetics.
    摘要 在实验和计算艺术领域,过去二十年内已经创建了许多图像集。在 presente 工作中,我们提供了十二个图像集的比较概述,这些图像集包括美学评分(美丽、喜欢或艺术质量)。我们发现这些图像集的美学评分是不一致的,但是包含照片或画作的图像集具有相似的特征。这些结果表明了使用不同方法来预测美学评分的可靠性。我们发现使用20种已知的统计图像特征和 convolutional neural network 对象识别的层次结构都可以准确预测美学评分,这显示了这两种方法 capture 了图像信息的重要方面。然而,由于不同图像集之间的差异,这casts doubt 到 previous research 的普遍性和可靠性。我们的研究强调了在实验和计算艺术领域中考虑多个图像集以提高研究结果的有效性和普遍性。

Autism Spectrum Disorder Classification in Children based on Structural MRI Features Extracted using Contrastive Variational Autoencoder

  • paper_url: http://arxiv.org/abs/2307.00976
  • repo_url: None
  • paper_authors: Ruimin Ma, Ruitao Xie, Yanlin Wang, Jintao Meng, Yanjie Wei, Wenhui Xi, Yi Pan
  • for: 这篇论文的目的是为了提高儿童Autism Spectrum Disorder(ASD)的早期诊断和 intervención,通过使用机器学习和神经成像技术,基于струк成像MRI(s-MRI)的机器分类。
  • methods: 这篇论文使用了contrastive variational autoencoder(CVAE)提取了s-MRI特征,并将ASD参与者表示为ASD特有的特征通道,共同特征通道表示健康参与者。
  • results: 研究实现了在儿童ASD(年龄范围:0.92-4.83岁)中的机器分类精度(超过0.97),并通过对不同 cortical区域表面积和s-MRI特征之间的相关性进行神经анаatomical解释,探讨了可能的ASD治疗目标。
    Abstract Autism spectrum disorder (ASD) is a highly disabling mental disease that brings significant impairments of social interaction ability to the patients, making early screening and intervention of ASD critical. With the development of the machine learning and neuroimaging technology, extensive research has been conducted on machine classification of ASD based on structural MRI (s-MRI). However, most studies involve with datasets where participants' age are above 5. Few studies conduct machine classification of ASD for participants below 5-year-old, but, with mediocre predictive accuracy. In this paper, we push the boundary of predictive accuracy (above 0.97) of machine classification of ASD in children (age range: 0.92-4.83 years), based on s-MRI features extracted using contrastive variational autoencoder (CVAE). 78 s-MRI, collected from Shenzhen Children's Hospital, are used for training CVAE, which consists of both ASD-specific feature channel and common shared feature channel. The ASD participants represented by ASD-specific features can be easily discriminated from TC participants represented by the common shared features, leading to high classification accuracy. In case of degraded predictive accuracy when data size is extremely small, a transfer learning strategy is proposed here as a potential solution. Finally, we conduct neuroanatomical interpretation based on the correlation between s-MRI features extracted from CVAE and surface area of different cortical regions, which discloses potential biomarkers that could help target treatments of ASD in the future.
    摘要 自适应症 спектルム疾病 (ASD) 是一种非常影响社交交流能力的精神疾病,对患者的早期检测和 intervención 至关重要。随着机器学习和神经成像技术的发展,有广泛的研究在机器分类 ASD 基于结构MRI(s-MRI)。然而,大多数研究参与者的年龄都大于 5 岁。只有一些研究进行了机器分类 ASD 的参与者下限为 5 岁,但它们的预测精度不太高。在这篇论文中,我们使用对比变量自动编码器(CVAE)提取的 s-MRI 特征来提高机器分类 ASD 的预测精度(高于 0.97),并且在儿童(年龄范围:0.92-4.83 岁)中进行了研究。我们使用了78 个 s-MRI,从深圳儿童医院收集到,用于训练 CVAE,CVAE 包括 ASD 特有的特征通道和共同分享的特征通道。ASD 参与者通过 ASD 特有的特征被轻松地与TC参与者(通过共同分享的特征)进行分开,从而导致高的分类精度。在数据量非常小时,降低预测精度的情况下,我们提出了传输学习策略作为可能的解决方案。最后,我们通过对 CVAE 提取的 s-MRI 特征和不同 cortical 区域表面积之间的相关性进行神经 анаatomical 解释,揭示了可能的生物标志物,这些生物标志物可能帮助未来对 ASD 进行精细的target treatments。

MoVie: Visual Model-Based Policy Adaptation for View Generalization

  • paper_url: http://arxiv.org/abs/2307.00972
  • repo_url: https://github.com/yangsizhe/MoVie
  • paper_authors: Sizhe Yang, Yanjie Ze, Huazhe Xu
  • for: 这个论文旨在解决视 Reinforcement Learning Agent 在不同视角下的泛化问题,即 $\textit{view generalization}$ 问题。
  • methods: 作者提出了一种简单 yet effective 的方法,通过在测试时使用模型来启用视模型基于策略的 $\textbf{MoVie}$ adaptation,无需显式奖励信号和任何修改 durante 训练时间。
  • results: 研究表明,该方法在四种不同的场景下(包括 DMControl、xArm 和 Adroit 中的 $\textbf{18}$ 个任务)具有显著的进步,相对于原始方法,提高了 $\mathbf{33}$%、$\mathbf{86}$% 和 $\mathbf{152}$%。这些出色的结果表明该方法在实际 робо扮中具有极大的潜力。视频可以在 https://yangsizhe.github.io/MoVie/ 上查看。
    Abstract Visual Reinforcement Learning (RL) agents trained on limited views face significant challenges in generalizing their learned abilities to unseen views. This inherent difficulty is known as the problem of $\textit{view generalization}$. In this work, we systematically categorize this fundamental problem into four distinct and highly challenging scenarios that closely resemble real-world situations. Subsequently, we propose a straightforward yet effective approach to enable successful adaptation of visual $\textbf{Mo}$del-based policies for $\textbf{Vie}$w generalization ($\textbf{MoVie}$) during test time, without any need for explicit reward signals and any modification during training time. Our method demonstrates substantial advancements across all four scenarios encompassing a total of $\textbf{18}$ tasks sourced from DMControl, xArm, and Adroit, with a relative improvement of $\mathbf{33}$%, $\mathbf{86}$%, and $\mathbf{152}$% respectively. The superior results highlight the immense potential of our approach for real-world robotics applications. Videos are available at https://yangsizhe.github.io/MoVie/ .
    摘要 Visible Reinforcement Learning(RL)代理人在有限视角下训练的情况下面临普遍化学习能力的挑战。这种问题被称为“视图普遍化问题”。在这项工作中,我们系统地将这个基本问题分为四种特点强大且具有挑战性的情况,与实际情况很相似。然后,我们提议一种简单 yet有效的方法,可以在测试时使Visual模型基于策略扩展到未经见过的视图,不需要显式奖励信号和任何修改 durante el entrenamiento。我们的方法在四种情况下表现出了显著的进步,涵盖了DMControl、xArm和Adroit中的18个任务,相对于原始方法提高了33%、86%和152%。这些出色的结果表明我们的方法在实际 робо特应用中具有极大的潜力。视频可以在https://yangsizhe.github.io/MoVie/ 中找到。

HODINet: High-Order Discrepant Interaction Network for RGB-D Salient Object Detection

  • paper_url: http://arxiv.org/abs/2307.00954
  • repo_url: None
  • paper_authors: Kang Yi, Jing Xu, Xiao Jin, Fu Guo, Yan-Feng Wu
  • for: RGB-D salient object detection (SOD)
  • methods: 使用 transformer-based 和 CNN-based 架构为背景,并在不同阶段进行高级别交互 Feature Fusion
  • results: 对 seven widely used 数据集进行了广泛的实验,并在四个评价指标上达到了竞争性的表现
    Abstract RGB-D salient object detection (SOD) aims to detect the prominent regions by jointly modeling RGB and depth information. Most RGB-D SOD methods apply the same type of backbones and fusion modules to identically learn the multimodality and multistage features. However, these features contribute differently to the final saliency results, which raises two issues: 1) how to model discrepant characteristics of RGB images and depth maps; 2) how to fuse these cross-modality features in different stages. In this paper, we propose a high-order discrepant interaction network (HODINet) for RGB-D SOD. Concretely, we first employ transformer-based and CNN-based architectures as backbones to encode RGB and depth features, respectively. Then, the high-order representations are delicately extracted and embedded into spatial and channel attentions for cross-modality feature fusion in different stages. Specifically, we design a high-order spatial fusion (HOSF) module and a high-order channel fusion (HOCF) module to fuse features of the first two and the last two stages, respectively. Besides, a cascaded pyramid reconstruction network is adopted to progressively decode the fused features in a top-down pathway. Extensive experiments are conducted on seven widely used datasets to demonstrate the effectiveness of the proposed approach. We achieve competitive performance against 24 state-of-the-art methods under four evaluation metrics.
    摘要

Towards Building Self-Aware Object Detectors via Reliable Uncertainty Quantification and Calibration

  • paper_url: http://arxiv.org/abs/2307.00934
  • repo_url: https://github.com/fiveai/saod
  • paper_authors: Kemal Oksuz, Tom Joy, Puneet K. Dokania
  • for: 本文提出了一个新的测试检测器 robustness 的任务,即 Self-Aware Object Detection (SAOD) 任务,以解决现有测试方法存在的重大缺陷,例如不当的 Out-of-distribution 检测方法和不包括本地化和分类质量的Calibration метриcs。
  • methods: 本文提出了一种新的测试框架,包括新的指标和大规模测试数据集,用于测试多种 object detector 的Robustness性能。
  • results: 经过EXTENSIVE 使用本文提出的测试框架,研究人员可以获得详细的检测器 Robustness 性能分析结果,包括检测器对域 shift 的Robustness,Scene中的信息 Uncertainty 度量,以及检测结果的Calibration 性能。此外,本文还提出了一个简单的基准方法,以便研究人员可以对未来的提议方法进行比较。
    Abstract The current approach for testing the robustness of object detectors suffers from serious deficiencies such as improper methods of performing out-of-distribution detection and using calibration metrics which do not consider both localisation and classification quality. In this work, we address these issues, and introduce the Self-Aware Object Detection (SAOD) task, a unified testing framework which respects and adheres to the challenges that object detectors face in safety-critical environments such as autonomous driving. Specifically, the SAOD task requires an object detector to be: robust to domain shift; obtain reliable uncertainty estimates for the entire scene; and provide calibrated confidence scores for the detections. We extensively use our framework, which introduces novel metrics and large scale test datasets, to test numerous object detectors in two different use-cases, allowing us to highlight critical insights into their robustness performance. Finally, we introduce a simple baseline for the SAOD task, enabling researchers to benchmark future proposed methods and move towards robust object detectors which are fit for purpose. Code is available at https://github.com/fiveai/saod
    摘要 当前对对象检测器的测试方法存在严重的缺陷,例如不当的异常检测方法和不充分考虑本地化和分类质量的评价指标。在这种工作中,我们解决这些问题,并提出了自适应对象检测(SAOD)任务,一种统一的测试框架,尊重和遵循对自动驾驶等安全关键环境中的对象检测器所面临的挑战。具体来说,SAOD任务要求对象检测器:对域shift具有抗衰减能力; 在整个场景中获得可靠的uncertainty估计; 并提供准确的信任分数。我们广泛使用我们的框架,引入了新的指标和大规模测试数据集,对许多对象检测器进行了两种不同的应用场景的测试,从而得出了对其Robustness性能的重要发现。最后,我们提出了一个简单的基线方法,使得研究人员可以对未来的提议方法进行比较,并推动对象检测器的Robustness性能进一步提高。代码可以在https://github.com/fiveai/saod上获取。

A large calcium-imaging dataset reveals a systematic V4 organization for natural scenes

  • paper_url: http://arxiv.org/abs/2307.00932
  • repo_url: None
  • paper_authors: Tianye Wang, Haoxuan Yao, Tai Sing Lee, Jiayi Hong, Yang Li, Hongfei Jiang, Ian Max Andolina, Shiming Tang
  • for: 研究人员希望更深入地理解视觉系统如何处理自然场景,因此他们使用了许多自然图像来观察 primate V4 的宽频碱粒镜像数据,并使用深度学习建立了 V4 的数字响应模型。
  • methods: 研究人员使用了宽频碱粒镜像和单 photon 镜像来观察 primate V4 的响应,并使用深度学习建立了 V4 的数字响应模型。
  • results: 研究人员发现了 V4 中各种自然图像特征的域化功能区域,包括表面相关特征如颜色和Texture,以及形状相关特征如边缘、曲线和人脸特征。这些结果证明了 V4 在处理自然场景时的细致 topological 组织和神经编码。
    Abstract The visual system evolved to process natural scenes, yet most of our understanding of the topology and function of visual cortex derives from studies using artificial stimuli. To gain deeper insights into visual processing of natural scenes, we utilized widefield calcium-imaging of primate V4 in response to many natural images, generating a large dataset of columnar-scale responses. We used this dataset to build a digital twin of V4 via deep learning, generating a detailed topographical map of natural image preferences at each cortical position. The map revealed clustered functional domains for specific classes of natural image features. These ranged from surface-related attributes like color and texture to shape-related features such as edges, curvature, and facial features. We validated the model-predicted domains with additional widefield calcium-imaging and single-cell resolution two-photon imaging. Our study illuminates the detailed topological organization and neural codes in V4 that represent natural scenes.
    摘要 “我们的视觉系统演化来处理自然场景,但大多数我们关于视觉脑区的理解却来自使用人工刺激的研究。为了更深入了解视觉处理自然场景,我们利用了广场 calcium 影像 primate V4 的回应,生成了视觉脑区的大量柱状对应。我们使用了这个数据集来建立视觉脑区的数位双胞虫,生成了自然图像特征的详细地图。这个地图显示了各 cortical 位置的自然图像特征的集中功能领域,包括表面相关特征如颜色和文本ure,以及形状相关特征如边缘、曲线和脸部特征。我们验证了模型预测的领域使用进一步的广场 calcium 影像和单细胞分解二氢镜影像。我们的研究照明了视觉脑区中自然场景的细部 topological 组织和神经代码。”

Semi-supervised multi-view concept decomposition

  • paper_url: http://arxiv.org/abs/2307.00924
  • repo_url: None
  • paper_authors: Qi Jiang, Guoxu Zhou, Qibin Zhao
  • for: 提高多视图数据的聚类性能
  • methods: 基于多视图基因因子(CF)的半监督多视图概念因子模型(SMVCF),包括多视图CF、标签传播、抽象学习和自适应权重 вектор,以Integrate valuable information from multiple views and improve clustering performance.
  • results: 在四个多样化的数据集上进行了广泛的实验,并证明了SMVCF模型在多视图聚类任务中的效果和优势。
    Abstract Concept Factorization (CF), as a novel paradigm of representation learning, has demonstrated superior performance in multi-view clustering tasks. It overcomes limitations such as the non-negativity constraint imposed by traditional matrix factorization methods and leverages kernel methods to learn latent representations that capture the underlying structure of the data, thereby improving data representation. However, existing multi-view concept factorization methods fail to consider the limited labeled information inherent in real-world multi-view data. This often leads to significant performance loss. To overcome these limitations, we propose a novel semi-supervised multi-view concept factorization model, named SMVCF. In the SMVCF model, we first extend the conventional single-view CF to a multi-view version, enabling more effective exploration of complementary information across multiple views. We then integrate multi-view CF, label propagation, and manifold learning into a unified framework to leverage and incorporate valuable information present in the data. Additionally, an adaptive weight vector is introduced to balance the importance of different views in the clustering process. We further develop targeted optimization methods specifically tailored for the SMVCF model. Finally, we conduct extensive experiments on four diverse datasets with varying label ratios to evaluate the performance of SMVCF. The experimental results demonstrate the effectiveness and superiority of our proposed approach in multi-view clustering tasks.
    摘要 科学技术(CF)是一种新的表示学习方法,在多视图归一 clustering 任务中表现出了超越性。它超越了传统矩阵因子化方法中的非正式约束,并通过核函数方法学习隐藏的表示,以捕捉数据的下面结构,从而改善数据表示。然而,现有的多视图概念因子化方法通常忽视了实际世界中多视图数据中的有限 Label 信息。这经常导致显著的性能损失。为解决这些限制,我们提出了一种新的半监督多视图概念因子化模型,称为 SMVCF。在 SMVCF 模型中,我们首先将单视图 CF 扩展到多视图版本,以更好地利用多视图数据中的补充信息。然后,我们将多视图 CF、标签传播、抽象学习集成到一个统一框架中,以利用数据中的有价值信息。此外,我们还引入了自适应权重 вектор,以平衡不同视图在归一 clustering 过程中的重要性。最后,我们专门为 SMVCF 模型开发了目标优化方法。我们在四种不同的数据集上进行了广泛的实验,并评估了 SMVCF 模型的性能。实验结果表明,我们提出的方法在多视图归一 clustering 任务中表现出了有效性和优势。

Many tasks make light work: Learning to localise medical anomalies from multiple synthetic tasks

  • paper_url: http://arxiv.org/abs/2307.00899
  • repo_url: https://github.com/matt-baugh/many-tasks-make-light-work
  • paper_authors: Matthew Baugh, Jeremy Tan, Johanna P. Müller, Mischa Dombrowski, James Batten, Bernhard Kainz
  • for: 这篇论文旨在解决单类模型和非标型检测问题,因为完全监督学习模型无法可靠地识别没有在训练中包含的类型。
  • methods: 该论文使用自动生成的异常数据和生成器自动编码器来解决这个问题,并且通过多个可见分割的异常学习任务进行训练和验证。
  • results: 该论文可以轻松超越现有的方法,并在脑MRI和胸部X射线图像上进行了示例。
    Abstract There is a growing interest in single-class modelling and out-of-distribution detection as fully supervised machine learning models cannot reliably identify classes not included in their training. The long tail of infinitely many out-of-distribution classes in real-world scenarios, e.g., for screening, triage, and quality control, means that it is often necessary to train single-class models that represent an expected feature distribution, e.g., from only strictly healthy volunteer data. Conventional supervised machine learning would require the collection of datasets that contain enough samples of all possible diseases in every imaging modality, which is not realistic. Self-supervised learning methods with synthetic anomalies are currently amongst the most promising approaches, alongside generative auto-encoders that analyse the residual reconstruction error. However, all methods suffer from a lack of structured validation, which makes calibration for deployment difficult and dataset-dependant. Our method alleviates this by making use of multiple visually-distinct synthetic anomaly learning tasks for both training and validation. This enables more robust training and generalisation. With our approach we can readily outperform state-of-the-art methods, which we demonstrate on exemplars in brain MRI and chest X-rays. Code is available at https://github.com/matt-baugh/many-tasks-make-light-work .
    摘要 随着单类模型和 OUT-OF-Distribution 检测的兴趣增长,因为完全supervised机器学习模型无法可靠地识别不包括在其训练中的类。实际世界中的长尾多样化 OUT-OF-Distribution 类型,例如屏检、分类和质控,意味着需要训练单类模型,表示预期的特征分布,例如只从严格健康志愿者数据中训练。传统的supervised机器学习需要收集包含所有可能的疾病样本的 dataset,这并不是现实的。self-supervised learning方法与生成式 auto-encoders 是目前最有前途的方法,但所有方法受到结构化验证的缺失,这使得在部署时进行调整困难和数据集dependent。我们的方法利用多个可见distinct的synthetic anomaly学习任务来 both 训练和验证,这使得训练更加坚固和普适。我们的方法可以轻松超越现有的方法,我们在 brain MRI 和胸部 X-ray 中进行了示例。代码可以在 获取。

Synthesis of Contrast-Enhanced Breast MRI Using Multi-b-Value DWI-based Hierarchical Fusion Network with Attention Mechanism

  • paper_url: http://arxiv.org/abs/2307.00895
  • repo_url: None
  • paper_authors: Tianyu Zhang, Luyi Han, Anna D’Angelo, Xin Wang, Yuan Gao, Chunyao Lu, Jonas Teuwen, Regina Beets-Tan, Tao Tan, Ritse Mann
  • for: 这个研究旨在开发一种基于多序列融合网络的 CE-MRI Synthesis方法,以降低或避免使用 GBCA,从而减轻病人的负担。
  • methods: 这个研究使用了多序列融合网络,将 T1-weighted MRI 和 DWIs 的数据 fusion 以获取 CE-MRI。在这个过程中,使用了多序列注意力模块和层次表示信息融合模块,以获取精细的特征图。
  • results: 研究结果表明,多值 b-值 DWI 基于的融合模型可能可以用于生成 CE-MRI,从而避免或减少使用 GBCA,从而减轻病人的负担。
    Abstract Magnetic resonance imaging (MRI) is the most sensitive technique for breast cancer detection among current clinical imaging modalities. Contrast-enhanced MRI (CE-MRI) provides superior differentiation between tumors and invaded healthy tissue, and has become an indispensable technique in the detection and evaluation of cancer. However, the use of gadolinium-based contrast agents (GBCA) to obtain CE-MRI may be associated with nephrogenic systemic fibrosis and may lead to bioaccumulation in the brain, posing a potential risk to human health. Moreover, and likely more important, the use of gadolinium-based contrast agents requires the cannulation of a vein, and the injection of the contrast media which is cumbersome and places a burden on the patient. To reduce the use of contrast agents, diffusion-weighted imaging (DWI) is emerging as a key imaging technique, although currently usually complementing breast CE-MRI. In this study, we develop a multi-sequence fusion network to synthesize CE-MRI based on T1-weighted MRI and DWIs. DWIs with different b-values are fused to efficiently utilize the difference features of DWIs. Rather than proposing a pure data-driven approach, we invent a multi-sequence attention module to obtain refined feature maps, and leverage hierarchical representation information fused at different scales while utilizing the contributions from different sequences from a model-driven approach by introducing the weighted difference module. The results show that the multi-b-value DWI-based fusion model can potentially be used to synthesize CE-MRI, thus theoretically reducing or avoiding the use of GBCA, thereby minimizing the burden to patients. Our code is available at \url{https://github.com/Netherlands-Cancer-Institute/CE-MRI}.
    摘要 磁共振成像(MRI)是当前临床成像技术中诊断乳腺癌最敏感的方法。增强的磁共振成像(CE-MRI)可以准确地区分肿瘤和混合到正常组织,成为诊断和评估癌病的不可或缺的技术。然而,使用高德林铵基 contrast agent(GBCA)来实现CE-MRI可能会与肾脏系统 fibrosis 相关,并且可能会在脑部堆积,对人类健康构成潜在的风险。此外,使用高德林铵基 contrast agent需要血管注射剂,这是对患者的困扰和负担。为了减少对高德林铵基 contrast agent的使用,扩散成像(DWI)正在成为诊断乳腺癌的关键成像技术之一。在这个研究中,我们开发了一种多следова列融合网络,使用T1重度成像和DWI进行CE-MRI的合成。不同的b值的DWI被融合,以利用不同的DWI特征。而不是直接提出数据驱动方法,我们创造了一种多следова列注意模块,以获得精细的特征地图,并利用层次表示信息在不同的缩放级别上进行融合。我们的研究结果表明,使用多b值DWI合成CE-MRI的模型可能可以减少或避免使用GBCA,从而减轻患者的负担。我们的代码可以在 \url{https://github.com/Netherlands-Cancer-Institute/CE-MRI} 上获取。

Mega-cities dominate China’s urban greening

  • paper_url: http://arxiv.org/abs/2307.00894
  • repo_url: None
  • paper_authors: Xiaoxin Zhang, Martin Brandt, Xiaoye Tong, Xiaowei Tong, Wenmin Zhang, Florian Reiner, Sizhuo Li, Feng Tian, Yuemin Yue, Weiqi Zhou, Bin Chen, Xiangming Xiao, Rasmus Fensholt
  • for: 本研究旨在使用尺度小探空craft量化中国大型城市的城市树覆盖率,评估2010年和2019年城市绿化政策的影响。
  • methods: 本研究使用尺度小探空craft进行城市树覆盖率的评估,对全国所有面积超过50平方公里的主要城市进行评估。
  • results: 研究发现,2019年全国城市树覆盖率约为11%,76%的城市出现了从2010年到2019年的树覆盖增长。特别是在北京和上海等各大城市,树覆盖增长率为7.69%,远高于其他城市的3.94%。
    Abstract Trees play a crucial role in urban environments, offering various ecosystem services that contribute to public health and human well-being. China has initiated a range of urban greening policies over the past decades, however, monitoring their impact on urban tree dynamics at a national scale has proven challenging. In this study, we deployed nano-satellites to quantify urban tree coverage in all major Chinese cities larger than 50 km2 in 2010 and 2019. Our findings indicate that approximately 6000 km2 (11%) of urban areas were covered by trees in 2019, and 76% of these cities experienced an increase in tree cover compared to 2010. Notably, the increase in tree cover in mega-cities such as Beijing, and Shanghai was approximately twice as large as in most other cities (7.69% vs 3.94%). The study employs a data-driven approach towards assessing urban tree cover changes in relation to greening policies, showing clear signs of tree cover increases but also suggesting an uneven implementation primarily benefiting a few mega-cities.
    摘要 城市中的树木扮演着重要的生态系统服务作用,对公众健康和人类福祉产生了重要贡献。中国在过去的几十年中实施了一系列城市绿化政策,但监测这些政策对城市树木动态的影响在国家范围内是有挑战的。本研究通过使用尺度约为1米的幼卫星来评估2010年和2019年中国大于50平方公里的主要城市的城市树木覆盖率。我们发现在2019年,城市树木覆盖率约为11%,76%的城市经历了2010年相比的增长。尤其是在北京和上海等巨型城市,增长率约为7.69%,与其他城市相比为twice as large(3.94%)。本研究采用数据驱动的方法来评估城市树木覆盖变化和绿化政策之间的关系,显示出明显的树木覆盖增加,但也表明了一些巨型城市的不均衡实施。

Generating Reliable Pixel-Level Labels for Source Free Domain Adaptation

  • paper_url: http://arxiv.org/abs/2307.00893
  • repo_url: None
  • paper_authors: Gabriel Tjio, Ping Liu, Yawei Luo, Chee Keong Kwoh, Joey Zhou Tianyi
  • for: Addressing the challenging domain adaptation setting where the knowledge from the labelled source domain dataset is only available from a pretrained black-box segmentation model.
  • methods: Proposes a simple yet novel image translation workflow, ReGEN, which comprises an image-to-image translation network and a segmentation network to generate target-like images using the noisy predictions from the original target domain images.
  • results: Demonstrates favourable performance relative to recent state-of-the-art work in two benchmark domain adaptation settings.Here’s the simplified Chinese text:
  • for: Addressing 域适应Setting中的挑战,即只有源频段数据上的标注知识可以用于黑盒子分割模型。
  • methods: 提出了一种简单 yet novel的图像翻译工作流程(ReGEN),该流程包括图像翻译网络和分割网络,用于基于原始目标频段图像的含杂预测生成目标类似图像。
  • results: 在两个标准域适应设定中展示了与最新的状态艺术工作相当的表现。
    Abstract This work addresses the challenging domain adaptation setting in which knowledge from the labelled source domain dataset is available only from the pretrained black-box segmentation model. The pretrained model's predictions for the target domain images are noisy because of the distributional differences between the source domain data and the target domain data. Since the model's predictions serve as pseudo labels during self-training, the noise in the predictions impose an upper bound on model performance. Therefore, we propose a simple yet novel image translation workflow, ReGEN, to address this problem. ReGEN comprises an image-to-image translation network and a segmentation network. Our workflow generates target-like images using the noisy predictions from the original target domain images. These target-like images are semantically consistent with the noisy model predictions and therefore can be used to train the segmentation network. In addition to being semantically consistent with the predictions from the original target domain images, the generated target-like images are also stylistically similar to the target domain images. This allows us to leverage the stylistic differences between the target-like images and the target domain image as an additional source of supervision while training the segmentation model. We evaluate our model with two benchmark domain adaptation settings and demonstrate that our approach performs favourably relative to recent state-of-the-art work. The source code will be made available.
    摘要 Our workflow generates target-like images using the noisy predictions from the original target domain images. These target-like images are semantically consistent with the noisy model predictions and can be used to train the segmentation network. Additionally, the generated target-like images are stylistically similar to the target domain images, allowing us to leverage the stylistic differences between the target-like images and the target domain image as an additional source of supervision during training.We evaluate our model with two benchmark domain adaptation settings and demonstrate that our approach performs favorably relative to recent state-of-the-art work. The source code will be made available.

An Explainable Deep Framework: Towards Task-Specific Fusion for Multi-to-One MRI Synthesis

  • paper_url: http://arxiv.org/abs/2307.00885
  • repo_url: https://github.com/fiy2w/mri_seq2seq
  • paper_authors: Luyi Han, Tianyu Zhang, Yunzhi Huang, Haoran Dou, Xin Wang, Yuan Gao, Chunyao Lu, Tan Tao, Ritse Mann
    for:多序列MRI在临床设置中具有可靠诊断和治疗预测价值,但某些序列可能因各种原因而无法使用或缺失。为解决这个问题,MRI合成是一个可能的解决方案。methods:使用最新的深度学习基于方法,可以好好地组合可用的多个序列来合成缺失的序列。然而,这些方法缺乏对不同输入序列的贡献的评估和生成图像质量的估计,使其实际应用困难。因此,我们提议一种可解释的任务特定合成网络,可以自动调整任务特定的权重,并提供可读性和可靠性从两个方面:1. 可视化输入序列的贡献在合并阶段的权重平均模块,使得可以看到每个输入序列的贡献。2. 在合成图像时,使用任务特定的注意力模块,以便高亮网络在合成图像时所尝试修改的区域。results:我们在BraTS2021数据集上进行了1251个主题的实验,结果表明,我们的提议的方法在无序列合成任务中表现更好于当前状态的方法。我们的代码可以在 \url{https://github.com/fiy2W/mri_seq2seq} 上找到。
    Abstract Multi-sequence MRI is valuable in clinical settings for reliable diagnosis and treatment prognosis, but some sequences may be unusable or missing for various reasons. To address this issue, MRI synthesis is a potential solution. Recent deep learning-based methods have achieved good performance in combining multiple available sequences for missing sequence synthesis. Despite their success, these methods lack the ability to quantify the contributions of different input sequences and estimate the quality of generated images, making it hard to be practical. Hence, we propose an explainable task-specific synthesis network, which adapts weights automatically for specific sequence generation tasks and provides interpretability and reliability from two sides: (1) visualize the contribution of each input sequence in the fusion stage by a trainable task-specific weighted average module; (2) highlight the area the network tried to refine during synthesizing by a task-specific attention module. We conduct experiments on the BraTS2021 dataset of 1251 subjects, and results on arbitrary sequence synthesis indicate that the proposed method achieves better performance than the state-of-the-art methods. Our code is available at \url{https://github.com/fiy2W/mri_seq2seq}.
    摘要 多重序列MRI在临床 Settings中是有价值的,可以提供可靠的诊断和治疗预测,但某些序列可能无法使用或缺失。为解决这个问题,MRI合成是一个可能的解决方案。现代深度学习基于方法在 combining 多个可用序列中缺失序列的合成方面 achieved 好的性能。尽管它们在成功,但它们缺乏对不同输入序列的贡献的评估和生成图像质量的估计,使其实难实际应用。因此,我们提议一种可解释的任务特定合成网络,该网络可自动适应特定序列生成任务,并提供可读性和可靠性从两个方面:1. 在合并阶段,使用可训练的任务特定权重平均模块来可视化每个输入序列的贡献。2. 使用任务特定注意力模块来高亮合成过程中网络尝试修改的区域。我们在 BraTS2021 数据集上进行了1251个主题的实验,并对arbitrary sequence synthesis进行了测试。结果表明,我们的方法在比 estado-of-the-art 方法更好。我们的代码可以在 \url{https://github.com/fiy2W/mri_seq2seq} 上获取。

Co-Learning Meets Stitch-Up for Noisy Multi-label Visual Recognition

  • paper_url: http://arxiv.org/abs/2307.00880
  • repo_url: https://github.com/vamosc/colearning-meet-stitchup
  • paper_authors: Chao Liang, Zongxin Yang, Linchao Zhu, Yi Yang
  • for: Handle long-tailed multi-label recognition with noisy labels.
  • methods: Propose a Stitch-Up augmentation to synthesize cleaner samples, and a Heterogeneous Co-Learning framework to leverage the inconsistency between long-tailed and balanced distributions.
  • results: Achieve superior results compared to various baselines, demonstrating the effectiveness of the proposed method in handling noisy long-tailed multi-label data.Here’s the full text in Simplified Chinese:
  • for: Handle long-tailed multi-label recognition with noisy labels.
  • methods: Propose a Stitch-Up augmentation to synthesize cleaner samples, 以及一个 Heterogeneous Co-Learning 框架,利用长尾分布和多 Label 的不一致性,生成更清楚的标签。
  • results: Compared to various baselines, 提出的方法在处理噪音长尾多 Label 资料时取得了superior的结果,证明了方法的有效性。
    Abstract In real-world scenarios, collected and annotated data often exhibit the characteristics of multiple classes and long-tailed distribution. Additionally, label noise is inevitable in large-scale annotations and hinders the applications of learning-based models. Although many deep learning based methods have been proposed for handling long-tailed multi-label recognition or label noise respectively, learning with noisy labels in long-tailed multi-label visual data has not been well-studied because of the complexity of long-tailed distribution entangled with multi-label correlation. To tackle such a critical yet thorny problem, this paper focuses on reducing noise based on some inherent properties of multi-label classification and long-tailed learning under noisy cases. In detail, we propose a Stitch-Up augmentation to synthesize a cleaner sample, which directly reduces multi-label noise by stitching up multiple noisy training samples. Equipped with Stitch-Up, a Heterogeneous Co-Learning framework is further designed to leverage the inconsistency between long-tailed and balanced distributions, yielding cleaner labels for more robust representation learning with noisy long-tailed data. To validate our method, we build two challenging benchmarks, named VOC-MLT-Noise and COCO-MLT-Noise, respectively. Extensive experiments are conducted to demonstrate the effectiveness of our proposed method. Compared to a variety of baselines, our method achieves superior results.
    摘要 在实际应用场景中,收集的数据经常具有多个类别和长尾分布的特点,同时 Label noise 是无法避免的。虽然许多深度学习基于方法已经提出来处理长尾多标签识别或 Label noise,但是在长尾多标签视觉数据中学习受损 Label noise 还没有得到充分研究,这是因为长尾分布和多标签相互作用的复杂性。为解决这种复杂而困难的问题,本文强调降低 Label noise 基于多标签分类和长尾学习的一些自然性质。在详细的实现方式中,我们提出了一种叫做 Stitch-Up 的扩充方法,可以直接减少多标签雷达。此外,我们还设计了一种叫做 Heterogeneous Co-Learning 的框架,可以利用长尾和平衡分布之间的不一致性,以便更好地学习受损 Label noise 的 cleaner 标签。为证明我们的方法的效果,我们建立了两个具有挑战性的 benchmark,分别是 VOC-MLT-Noise 和 COCO-MLT-Noise。我们进行了广泛的实验,以确认我们的方法的效果。相比多种基线方法,我们的方法得到了更好的结果。

End-To-End Prediction of Knee Osteoarthritis Progression With Multi-Modal Transformers

  • paper_url: http://arxiv.org/abs/2307.00873
  • repo_url: None
  • paper_authors: Egor Panfilov, Simo Saarakkala, Miika T. Nieminen, Aleksei Tiulpin
  • for: The paper is written to investigate the use of multi-modal data and deep learning methods for predicting the progression of knee osteoarthritis (KOA).
  • methods: The paper uses a Transformer approach for multi-modal fusion of knee imaging data, and analyzes its performance across different progression horizons.
  • results: The paper shows that structural knee MRI can identify radiographic KOA progressors with an area under the ROC curve (ROC AUC) of 0.70-0.76 and Average Precision (AP) of 0.15-0.54 in 2-8 year horizons. Additionally, the paper finds that progression within 1 year can be better predicted with a multi-modal method using X-ray, structural, and compositional MR images.Here are the results in Simplified Chinese text:
  • for: 这个研究是为了研究多Modal数据和深度学习方法在股骨关节炎(KOA)的进展预测方面。
  • methods: 这个研究使用Transformer方法进行多Modal数据的拟合,并分析其在不同进展时间 horizon 上的性能。
  • results: 这个研究发现,结构性股骨MRI可以在2-8年的进展时间 horizon 上识别 radiographic KOA 进行者,ROC AUC 为0.70-0.76,AP 为0.15-0.54。此外,研究发现,使用多Modal的X-ray、结构性和化学成分MRI Image 可以更好地预测在1年内的进展。
    Abstract Knee Osteoarthritis (KOA) is a highly prevalent chronic musculoskeletal condition with no currently available treatment. The manifestation of KOA is heterogeneous and prediction of its progression is challenging. Current literature suggests that the use of multi-modal data and advanced modeling methods, such as the ones based on Deep Learning, has promise in tackling this challenge. To date, however, the evidence on the efficacy of this approach is limited. In this study, we leveraged recent advances in Deep Learning and, using a Transformer approach, developed a unified framework for the multi-modal fusion of knee imaging data. Subsequently, we analyzed its performance across a range of scenarios by investigating multiple progression horizons -- from short-term to long-term. We report our findings using a large cohort (n=2421-3967) derived from the Osteoarthritis Initiative dataset. We show that structural knee MRI allows identifying radiographic KOA progressors on par with multi-modal fusion approaches, achieving an area under the ROC curve (ROC AUC) of 0.70-0.76 and Average Precision (AP) of 0.15-0.54 in 2-8 year horizons. Progression within 1 year was better predicted with a multi-modal method using X-ray, structural, and compositional MR images -- ROC AUC of 0.76(0.04), AP of 0.13(0.04) -- or via clinical data. Our follow-up analysis generally shows that prediction from the imaging data is more accurate for post-traumatic subjects, and we further investigate which subject subgroups may benefit the most. The present study provides novel insights into multi-modal imaging of KOA and brings a unified data-driven framework for studying its progression in an end-to-end manner, providing new tools for the design of more efficient clinical trials. The source code of our framework and the pre-trained models are made publicly available.
    摘要 knee退化病 (KOA) 是一种非常流行的慢性 musculoskeletal 疾病,目前没有可用的治疗方案。 KOA 的表现是多iform 的,预测其进展是具有挑战性。现有文献建议使用多 modal 数据和进步的模型方法,如基于深度学习的方法,可以解决这个挑战。然而,至今为止,这种方法的效果仍然有限。在这个研究中,我们利用了最新的深度学习技术,使用 transformer 方法,实现了多modal 影像资料的融合。然后,我们分析了这个框架在不同的时间长度下的表现,包括短期、中期和长期。我们使用了一大群患者(n=2421-3967), derive 自 Osteoarthritis Initiative 资料集。我们发现,静止股骨 MRI 可以与多modal 融合方法相比,预测 X-ray 照片 KOA 进展者,ROC AUC 为 0.70-0.76,AP 为 0.15-0.54 在 2-8 年时间长度下。在 1 年内进展预测中,多modal 方法使用 X-ray、静止、和化学 MR 影像的结果更高,ROC AUC 为 0.76(0.04),AP 为 0.13(0.04)。我们的续作分析显示,从影像数据预测 KOA 的进展更加精确,特别是在受到伤害后的患者中。我们进一步研究了哪些患者子群可能从影像数据中得到最多的帮助。这项研究提供了新的关于多modal 影像 KOA 的新知识,并提供了一个统一的数据驱动的框架,实现了预测 KOA 进展的端到端方式。我们的框架和预训模型的源代码都公开 disponibile。

VINECS: Video-based Neural Character Skinning

  • paper_url: http://arxiv.org/abs/2307.00842
  • repo_url: None
  • paper_authors: Zhouyingcheng Liao, Vladislav Golyanik, Marc Habermann, Christian Theobalt
  • for: 创建高级别的人物游戏角色,自动生成高级别的人物模型和皮肤材料
  • methods: 使用多视图视频学习技术,自动生成人物模型的皮肤纹理、皮肤骨骼和动作纹理
  • results: 比起现有方法,提高人物模型的自动生成效果,不需要高级别的4D扫描数据
    Abstract Rigging and skinning clothed human avatars is a challenging task and traditionally requires a lot of manual work and expertise. Recent methods addressing it either generalize across different characters or focus on capturing the dynamics of a single character observed under different pose configurations. However, the former methods typically predict solely static skinning weights, which perform poorly for highly articulated poses, and the latter ones either require dense 3D character scans in different poses or cannot generate an explicit mesh with vertex correspondence over time. To address these challenges, we propose a fully automated approach for creating a fully rigged character with pose-dependent skinning weights, which can be solely learned from multi-view video. Therefore, we first acquire a rigged template, which is then statically skinned. Next, a coordinate-based MLP learns a skinning weights field parameterized over the position in a canonical pose space and the respective pose. Moreover, we introduce our pose- and view-dependent appearance field allowing us to differentiably render and supervise the posed mesh using multi-view imagery. We show that our approach outperforms state-of-the-art while not relying on dense 4D scans.
    摘要 rigging和皮肤化人物模型是一项复杂的任务,传统上需要大量的手动工作和专业知识。现有的方法可以涵盖不同的人物,但是大多数情况下预测的是静止皮肤 веса,性能对高度动作pose较差,而其他方法则需要在不同的姿势下获得密集的3D人物扫描,或者无法在时间上生成明确的网格。为了解决这些挑战,我们提出了一种完全自动化的方法,可以从多视图视频中学习出fully rigged人物,并且可以靠 solely 在多视图图像上进行渲染和监督。我们的方法包括以下几个步骤:首先,我们获得一个已经RIGGED的模板,然后使用一个坐标基于的多层感知学习(MLP)来学习皮肤 веса场,该场被参数化为位置在一个坐标系中的 canonical pose 空间和相应的姿势。此外,我们引入了 pose-和 view-dependent的外观场,使得我们可以在多视图图像上分别渲染和监督着poses的人物模型。我们的方法比 estado-of-the-art 高效,而不需要密集的4D扫描。

Surgical fine-tuning for Grape Bunch Segmentation under Visual Domain Shifts

  • paper_url: http://arxiv.org/abs/2307.00837
  • repo_url: https://github.com/airlab-polimi/sft_grape_segmentation
  • paper_authors: Agnese Chiatti, Riccardo Bertoglio, Nico Catalano, Matteo Gatti, Matteo Matteucci
  • for: 本研究旨在帮助移动机器人在农业 settings中自动和有效地监测植物状况,因此需要机器人具备鲜明的视觉感知能力,抗性强 против农业设置中的快速变化。
  • methods: 本研究使用手术精细调整(surgical fine-tuning)来适应新收集的葡萄图像,并且只需要调整特定的模型层,从而减少参数的数量,同时支持模型适应新的视域变化。
  • results: 本研究表明,通过手术精细调整,可以有效地适应新收集的葡萄图像,并且可以减少参数的数量,从而提高模型的鲜明性和灵活性。
    Abstract Mobile robots will play a crucial role in the transition towards sustainable agriculture. To autonomously and effectively monitor the state of plants, robots ought to be equipped with visual perception capabilities that are robust to the rapid changes that characterise agricultural settings. In this paper, we focus on the challenging task of segmenting grape bunches from images collected by mobile robots in vineyards. In this context, we present the first study that applies surgical fine-tuning to instance segmentation tasks. We show how selectively tuning only specific model layers can support the adaptation of pre-trained Deep Learning models to newly-collected grape images that introduce visual domain shifts, while also substantially reducing the number of tuned parameters.
    摘要 Mobile robots will play a crucial role in the transition towards sustainable agriculture. To autonomously and effectively monitor the state of plants, robots ought to be equipped with visual perception capabilities that are robust to the rapid changes that characterize agricultural settings. In this paper, we focus on the challenging task of segmenting grape bunches from images collected by mobile robots in vineyards. In this context, we present the first study that applies surgical fine-tuning to instance segmentation tasks. We show how selectively tuning only specific model layers can support the adaptation of pre-trained Deep Learning models to newly-collected grape images that introduce visual domain shifts, while also substantially reducing the number of tuned parameters.Here's the translation in Traditional Chinese as well:Mobile robots will play a crucial role in the transition towards sustainable agriculture. To autonomously and effectively monitor the state of plants, robots ought to be equipped with visual perception capabilities that are robust to the rapid changes that characterize agricultural settings. In this paper, we focus on the challenging task of segmenting grape bunches from images collected by mobile robots in vineyards. In this context, we present the first study that applies surgical fine-tuning to instance segmentation tasks. We show how selectively tuning only specific model layers can support the adaptation of pre-trained Deep Learning models to newly-collected grape images that introduce visual domain shifts, while also substantially reducing the number of tuned parameters.

Robust Surgical Tools Detection in Endoscopic Videos with Noisy Data

  • paper_url: http://arxiv.org/abs/2307.01232
  • repo_url: None
  • paper_authors: Adnan Qayyum, Hassan Ali, Massimo Caputo, Hunaid Vohra, Taofeek Akinosho, Sofiat Abioye, Ilhem Berrou, Paweł Capik, Junaid Qadir, Muhammad Bilal
  • for: This paper aims to develop a systematic methodology for robust surgical tool detection using noisy data.
  • methods: The proposed methodology introduces an intelligent active learning strategy for minimal dataset identification and label correction by human experts, as well as an assembling strategy for a student-teacher model-based self-training framework to achieve robust classification of 14 surgical tools in a semi-supervised fashion.
  • results: The proposed methodology achieves an average F1-score of 85.88% for the ensemble model-based self-training with class weights, and 80.88% without class weights for noisy labels. The proposed method significantly outperforms existing approaches, effectively demonstrating its effectiveness.
    Abstract Over the past few years, surgical data science has attracted substantial interest from the machine learning (ML) community. Various studies have demonstrated the efficacy of emerging ML techniques in analysing surgical data, particularly recordings of procedures, for digitizing clinical and non-clinical functions like preoperative planning, context-aware decision-making, and operating skill assessment. However, this field is still in its infancy and lacks representative, well-annotated datasets for training robust models in intermediate ML tasks. Also, existing datasets suffer from inaccurate labels, hindering the development of reliable models. In this paper, we propose a systematic methodology for developing robust models for surgical tool detection using noisy data. Our methodology introduces two key innovations: (1) an intelligent active learning strategy for minimal dataset identification and label correction by human experts; and (2) an assembling strategy for a student-teacher model-based self-training framework to achieve the robust classification of 14 surgical tools in a semi-supervised fashion. Furthermore, we employ weighted data loaders to handle difficult class labels and address class imbalance issues. The proposed methodology achieves an average F1-score of 85.88\% for the ensemble model-based self-training with class weights, and 80.88\% without class weights for noisy labels. Also, our proposed method significantly outperforms existing approaches, which effectively demonstrates its effectiveness.
    摘要 过去几年,外科数据科学已经吸引了机器学习(ML)社区的广泛关注。各种研究表明了新兴ML技术在分析外科数据方面的效果,特别是记录手术过程的数据,用于数字化临床和非临床功能,如前置规划、Context-aware决策和手术技巧评估。然而,这一领域仍处于初期阶段,缺乏代表性的、正确标注的数据集,用于训练中等ML任务。此外,现有的数据集受到不准确的标注,妨碍了模型的发展。在本文中,我们提出了一种系统的方法ологи,用于在噪音数据上建立可靠的外科工具检测模型。我们的方法有两个关键创新:(1)一种智能活动学习策略,用于标注和人工专家审核的最小数据集选择;(2)一种学生-教师模型自我训练框架,用于在半监督式下实现多 Tool 的精度分类。此外,我们采用了权重数据加载器,处理困难分类和分类不均衡问题。我们的提议方法在 ensemble 模型基础自动训练中,使用类别权重时,达到了85.88%的平均 F1 分数,无类别权重时,达到了80.88%。此外,我们的提议方法在与现有方法进行比较时,显示了明显的效果。

Unveiling the Potential of Spike Streams for Foreground Occlusion Removal from Densely Continuous Views

  • paper_url: http://arxiv.org/abs/2307.00821
  • repo_url: None
  • paper_authors: Jiyuan Zhang, Shiyan Chen, Yajing Zheng, Zhaofei Yu, Tiejun Huang
  • for: removes dense foreground occlusion
  • methods: continuous multi-view imaging with one spike camera, novel model \textbf{SpkOccNet}, cross-view mutual attention mechanism
  • results: efficient removal of dense occlusions in diverse scenes, strong generalization.Here’s the full text in Simplified Chinese:
  • for: 本研究旨在解决 dense foreground occlusion 问题,实现高质量的背景清晰化。
  • methods: 我们提出了一种基于 continuous multi-view imaging 和 novel model \textbf{SpkOccNet} 的方法,使用 cross-view mutual attention mechanism 进行有效的融合和精度调整。
  • results: 我们的方法可以高效地除去各种场景中的 dense occlusions,并且具有强大的泛化能力。
    Abstract The extraction of a clean background image by removing foreground occlusion holds immense practical significance, but it also presents several challenges. Presently, the majority of de-occlusion research focuses on addressing this issue through the extraction and synthesis of discrete images from calibrated camera arrays. Nonetheless, the restoration quality tends to suffer when faced with dense occlusions or high-speed motions due to limited perspectives and motion blur. To successfully remove dense foreground occlusion, an effective multi-view visual information integration approach is required. Introducing the spike camera as a novel type of neuromorphic sensor offers promising capabilities with its ultra-high temporal resolution and high dynamic range. In this paper, we propose an innovative solution for tackling the de-occlusion problem through continuous multi-view imaging using only one spike camera without any prior knowledge of camera intrinsic parameters and camera poses. By rapidly moving the spike camera, we continually capture the dense stream of spikes from the occluded scene. To process the spikes, we build a novel model \textbf{SpkOccNet}, in which we integrate information of spikes from continuous viewpoints within multi-windows, and propose a novel cross-view mutual attention mechanism for effective fusion and refinement. In addition, we contribute the first real-world spike-based dataset \textbf{S-OCC} for occlusion removal. The experimental results demonstrate that our proposed model efficiently removes dense occlusions in diverse scenes while exhibiting strong generalization.
    摘要 抽取清晰背景图像,除去前景遮挡,具有很大的实际意义,但也存在多种挑战。目前,大多数遮挡恢复研究通过抽取和合成扩展的照片阵列来解决这个问题。然而,当面临紧密的遮挡物或高速运动时, restore 质量往往受到限制,因为有限的视角和运动模糊。要成功地除去紧密的前景遮挡,需要一种有效的多视图视觉信息集成方法。在这篇论文中,我们提出了一种新的解决方案,通过不断拍摄快速移动的射频摄像机来实现连续多视图抽取。我们在不知道摄像机内参数和摄像机位置的情况下,通过 continually 捕捉遮挡场景中的快速流动的射频脉冲来处理脉冲。为了处理脉冲,我们建立了一种新的模型,即 SpkOccNet,其中我们在多窗口内集成了不断更新的脉冲信息,并提出了一种新的交叉视图互注意机制来实现有效的融合和修正。此外,我们也提供了首个实际遮挡基本数据集,即 S-OCC,以用于遮挡除去。实验结果表明,我们提出的模型能够有效地除去多种场景中的紧密遮挡。

Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset

  • paper_url: http://arxiv.org/abs/2307.00818
  • repo_url: https://github.com/idea-research/motion-x
  • paper_authors: Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, Lei Zhang
  • for: This paper is written for researchers and developers who work on 3D human pose estimation, motion capture, and computer vision.
  • methods: The paper presents a whole-body motion and text annotation pipeline that can automatically annotate motion from single- or multi-view videos and provide comprehensive semantic labels for each video and fine-grained whole-body pose descriptions for each frame.
  • results: The paper constructs Motion-X, a large-scale 3D expressive whole-body motion dataset that covers 96K motion sequences from massive scenes, with 13.7M precise 3D whole-body pose annotations and 13.7M frame-level whole-body pose descriptions. The paper demonstrates the accuracy of the annotation pipeline and the significant benefit of Motion-X in enhancing expressive, diverse, and natural motion generation, as well as 3D whole-body human mesh recovery.Here are the three points in Simplified Chinese text:
  • for: 这篇论文是为研究者和开发者们写的,他们工作在3D人体姿态估计、动作捕捉和计算机视觉领域。
  • methods: 论文提出了一种整体人体动作和文本注释管道,可以自动从单视或多视视频中注释动作,并提供每个视频的全面semantic标签和每帧精细的人体姿态描述。
  • results: 论文构建了Motion-X数据集,包括96万个动作序列,涵盖庞大的场景,每个动作序列具有1370万个精细的3D人体姿态注释和每帧的人体姿态描述。论文证明了注释管道的准确性和Motion-X数据集在增强自然、多样化和表达力强的动作生成、3D人体人脸恢复方面的重要作用。
    Abstract In this paper, we present Motion-X, a large-scale 3D expressive whole-body motion dataset. Existing motion datasets predominantly contain body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions. Moreover, they are primarily collected from limited laboratory scenes with textual descriptions manually labeled, which greatly limits their scalability. To overcome these limitations, we develop a whole-body motion and text annotation pipeline, which can automatically annotate motion from either single- or multi-view videos and provide comprehensive semantic labels for each video and fine-grained whole-body pose descriptions for each frame. This pipeline is of high precision, cost-effective, and scalable for further research. Based on it, we construct Motion-X, which comprises 13.7M precise 3D whole-body pose annotations (i.e., SMPL-X) covering 96K motion sequences from massive scenes. Besides, Motion-X provides 13.7M frame-level whole-body pose descriptions and 96K sequence-level semantic labels. Comprehensive experiments demonstrate the accuracy of the annotation pipeline and the significant benefit of Motion-X in enhancing expressive, diverse, and natural motion generation, as well as 3D whole-body human mesh recovery.
    摘要 在这篇论文中,我们介绍Motion-X,一个大规模的3D表达全身动作数据集。现有的动作数据集主要包含body只的姿势,缺乏面部表达、手势和细化pose描述。此外,它们主要从有限的实验室场景中收集,手动编注文本描述,这会很大限制其扩展性。为了超越这些限制,我们开发了一个整体动作和文本注释管线,可以自动从单个或多视图视频中提取动作,并提供每个视频的全面的semantic标签和每帧的细化全身姿势描述。这个管线具有高精度、成本效果和可扩展性,适用于进一步的研究。基于它,我们建立了Motion-X,包含1370万精度3D全身姿势注释(即SMPL-X),覆盖96000个动作序列。此外,Motion-X还提供1370万帧级全身姿势描述和96000个序列级semantic标签。实验证明注释管线的准确性和Motion-X的极大地改进了表达、多样化和自然的动作生成,以及3D全身人体模型的恢复。

SketchMetaFace: A Learning-based Sketching Interface for High-fidelity 3D Character Face Modeling

  • paper_url: http://arxiv.org/abs/2307.00804
  • repo_url: None
  • paper_authors: Zhongjin Luo, Dong Du, Heming Zhu, Yizhou Yu, Hongbo Fu, Xiaoguang Han
  • for: 用于 amateur 用户快速创建高级化 3D 人物面孔
  • methods: 采用 curvature-aware 笔划支持面孔细节控制,以及“几何和深度导航 mesh 模型”(IDGMM)算法,并提供了粗略到细节 2D 笔划界面设计和数据驱动的笔划建议工具
  • results: 在用户研究中证明了我们的系统比既有模型工具更加容易使用,同时提供了高质量的结果Here’s a breakdown of each point:
  • for: The paper is written for amateur users who want to quickly create high-fidelity 3D avatars with realistic faces.
  • methods: The system uses curvature-aware strokes to support the controllability of carving facial details, and a novel learning-based method called “Implicit and Depth Guided Mesh Modeling” (IDGMM) to achieve high-quality results with high efficiency. Additionally, the system provides a coarse-to-fine 2D sketching interface design and a data-driven stroke suggestion tool to support usability.
  • results: User studies demonstrate the superiority of the system over existing modeling tools in terms of ease of use and visual quality of results, and experimental analyses show that IDGMM reaches a better trade-off between accuracy and efficiency.
    Abstract Modeling 3D avatars benefits various application scenarios such as AR/VR, gaming, and filming. Character faces contribute significant diversity and vividity as a vital component of avatars. However, building 3D character face models usually requires a heavy workload with commercial tools, even for experienced artists. Various existing sketch-based tools fail to support amateurs in modeling diverse facial shapes and rich geometric details. In this paper, we present SketchMetaFace - a sketching system targeting amateur users to model high-fidelity 3D faces in minutes. We carefully design both the user interface and the underlying algorithm. First, curvature-aware strokes are adopted to better support the controllability of carving facial details. Second, considering the key problem of mapping a 2D sketch map to a 3D model, we develop a novel learning-based method termed "Implicit and Depth Guided Mesh Modeling" (IDGMM). It fuses the advantages of mesh, implicit, and depth representations to achieve high-quality results with high efficiency. In addition, to further support usability, we present a coarse-to-fine 2D sketching interface design and a data-driven stroke suggestion tool. User studies demonstrate the superiority of our system over existing modeling tools in terms of the ease to use and visual quality of results. Experimental analyses also show that IDGMM reaches a better trade-off between accuracy and efficiency. SketchMetaFace is available at https://zhongjinluo.github.io/SketchMetaFace/.
    摘要 模型3D人物可以应用于AR/VR、游戏和电影等场景,人物脸部具有重要的多样性和生动性,是人物模型的重要组件。然而,建立3D人物脸部模型通常需要大量的工作量,即使有经验的艺术家也会面临困难。现有的素描工具无法支持新手用户模型多样的脸部形状和富有的几何特征。在本文中,我们提出了SketchMetaFace - 一个针对新手用户的素描系统,可以在几分钟内创建高质量的3D脸部模型。我们仔细设计了用户界面和下面的算法。首先,我们采用了 curvature-aware strokes,以更好地支持挑战脸部细节的绘制。其次,我们开发了一种新的学习基于方法,称为“凸型和深度引导的三角形模型”(IDGMM),它将mesh、凸型和深度表示结合起来,以实现高质量和高效的结果。此外,为了进一步支持使用性,我们提出了一种粗略到细节的2D素描界面设计和数据驱动的笔划建议工具。用户研究表明,我们的系统比现有的模型工具更容易使用,并且可以生成高质量的结果。实验分析也显示,IDGMM达到了更好的精度和效率的平衡。SketchMetaFace可以在https://zhongjinluo.github.io/SketchMetaFace/上下载。

ACDMSR: Accelerated Conditional Diffusion Models for Single Image Super-Resolution

  • paper_url: http://arxiv.org/abs/2307.00781
  • repo_url: None
  • paper_authors: Axi Niu, Pham Xuan Trung, Kang Zhang, Jinqiu Sun, Yu Zhu, In So Kweon, Yanning Zhang
  • for: 这个论文主要针对图像增Resolution(SR)领域的图像-图像翻译问题。
  • methods: 该论文提出了一种基于扩散模型的图像SR方法,通过 deterministic iterative denoising 进程来实现SR。
  • results: 该方法在多个 benchmark 数据集上进行了广泛的实验,并证明了与之前的尝试相比,它可以提供更高的图像质量和更快的执行速度。
    Abstract Diffusion models have gained significant popularity in the field of image-to-image translation. Previous efforts applying diffusion models to image super-resolution (SR) have demonstrated that iteratively refining pure Gaussian noise using a U-Net architecture trained on denoising at various noise levels can yield satisfactory high-resolution images from low-resolution inputs. However, this iterative refinement process comes with the drawback of low inference speed, which strongly limits its applications. To speed up inference and further enhance the performance, our research revisits diffusion models in image super-resolution and proposes a straightforward yet significant diffusion model-based super-resolution method called ACDMSR (accelerated conditional diffusion model for image super-resolution). Specifically, our method adapts the standard diffusion model to perform super-resolution through a deterministic iterative denoising process. Our study also highlights the effectiveness of using a pre-trained SR model to provide the conditional image of the given low-resolution (LR) image to achieve superior high-resolution results. We demonstrate that our method surpasses previous attempts in qualitative and quantitative results through extensive experiments conducted on benchmark datasets such as Set5, Set14, Urban100, BSD100, and Manga109. Moreover, our approach generates more visually realistic counterparts for low-resolution images, emphasizing its effectiveness in practical scenarios.
    摘要 Diffusion models 在图像到图像翻译领域得到了广泛的应用。先前的尝试使用 diffusion models 进行图像超分辨(SR)已经证明了,通过iteratively refining 纯 Gaussian noise 使用 U-Net 架构在不同噪声水平上训练的方法可以获得高质量的高分辨图像。然而,这种迭代纯化过程受到了推理速度的限制,妨碍了其应用。为了加速推理和进一步提高性能,我们的研究重新审视了 diffusion models 在图像 SR 领域,并提出了一种简单又有力的 diffusion model-based SR 方法 called ACDMSR (加速 conditional diffusion model for image super-resolution)。我们的方法改变了标准的 diffusion model,通过 deterministic iterative denoising 进行 SR。我们的研究还发现,使用预训练 SR 模型提供给 low-resolution(LR)图像的conditional image可以实现更高的高分辨结果。我们通过对 Set5、Set14、Urban100、BSD100 和 Manga109 等标准数据集进行了广泛的实验,证明了我们的方法超过了之前的尝试, both qualitatively and quantitatively。此外,我们的方法可以生成更加视觉真实的 low-resolution 图像的高分辨对应体,强调其在实际应用中的效iveness。

Learning Noise-Resistant Image Representation by Aligning Clean and Noisy Domains

  • paper_url: http://arxiv.org/abs/2307.00761
  • repo_url: None
  • paper_authors: Yanhui Guo, Xiaolin Wu, Fangzhou Luo
  • for: 提高图像 Representation robustness 对于噪音
  • methods: 使用 dual domain 映射和target-guided 匿名函数
  • results: 可以在复杂噪音图像上达到 Remarkable 性能和Robustness
    Abstract Recent supervised and unsupervised image representation learning algorithms have achieved quantum leaps. However, these techniques do not account for representation resilience against noise in their design paradigms. Consequently, these effective methods suffer failure when confronted with noise outside the training distribution, such as complicated real-world noise that is usually opaque to model training. To address this issue, dual domains are optimized to separately model a canonical space for noisy representations, namely the Noise-Robust (NR) domain, and a twinned canonical clean space, namely the Noise-Free (NF) domain, by maximizing the interaction information between the representations. Given the dual canonical domains, we design a target-guided implicit neural mapping function to accurately translate the NR representations to the NF domain, yielding noise-resistant representations by eliminating noise regencies. The proposed method is a scalable module that can be readily integrated into existing learning systems to improve their robustness against noise. Comprehensive trials of various tasks using both synthetic and real-world noisy data demonstrate that the proposed Target-Guided Dual-Domain Translation (TDDT) method is able to achieve remarkable performance and robustness in the face of complex noisy images.
    摘要 最近的supervised和Unsupervised图像表示学习算法已经取得了量子跳跃。然而,这些技术不会考虑对噪声的表现鲁棒性在设计思路中。因此,这些有效的方法在面对训练数据外的噪声时会失败,如复杂的真实世界噪声,这些噪声通常是模型训练中透明的。为解决这个问题,我们分别优化了两个域,即Noise-Robust(NR)域和Noise-Free(NF)域,以便分别模型噪声表示的 canonical 空间。然后,我们设计了一个目标导向的隐藏神经映射函数,以准确地将NR表示翻译到NF域中,从而消除噪声残留。我们提出的方法可以轻松地integrated into现有的学习系统,以提高它们对噪声的抗性。我们在多个任务上使用both synthetic和真实世界噪声数据进行了全面的试验,并证明了我们的Target-Guided Dual-Domain Translation(TDDT)方法在面对复杂噪声图像时能够实现出色的性能和抗性。

Structured Network Pruning by Measuring Filter-wise Interactions

  • paper_url: http://arxiv.org/abs/2307.00758
  • repo_url: None
  • paper_authors: Wenting Tang, Xingxing Wei, Bo Li
  • for: 降低深度学习模型的计算成本,以便在实际应用中实现快速的模型训练和推理。
  • methods: integrate filter-wise interaction into redundancy criterion, propose structured network pruning approach SNPFI, automatically assign proper sparsity and eliminate useless filters.
  • results: 对多种常用的深度学习模型(AlexNet、MobileNetv1、ResNet-50)和多种图像分类 dataset(MNIST、CIFAR-10、ImageNet)进行了实验,结果显示SNPFI可以减少网络计算量达60%以上,而模型的分类精度仍然保持在acceptable水平。
    Abstract Structured network pruning is a practical approach to reduce computation cost directly while retaining the CNNs' generalization performance in real applications. However, identifying redundant filters is a core problem in structured network pruning, and current redundancy criteria only focus on individual filters' attributes. When pruning sparsity increases, these redundancy criteria are not effective or efficient enough. Since the filter-wise interaction also contributes to the CNN's prediction accuracy, we integrate the filter-wise interaction into the redundancy criterion. In our criterion, we introduce the filter importance and filter utilization strength to reflect the decision ability of individual and multiple filters. Utilizing this new redundancy criterion, we propose a structured network pruning approach SNPFI (Structured Network Pruning by measuring Filter-wise Interaction). During the pruning, the SNPFI can automatically assign the proper sparsity based on the filter utilization strength and eliminate the useless filters by filter importance. After the pruning, the SNPFI can recover pruned model's performance effectively without iterative training by minimizing the interaction difference. We empirically demonstrate the effectiveness of the SNPFI with several commonly used CNN models, including AlexNet, MobileNetv1, and ResNet-50, on various image classification datasets, including MNIST, CIFAR-10, and ImageNet. For all experimental CNN models, nearly 60% of computation is reduced in a network compression while the classification accuracy remains.
    摘要 《结构化网络剔除》是一种实用的方法,可以直接降低计算成本,同时保持深度学习网络的泛化性能。然而,识别重复的滤波器是结构化网络剔除的核心问题,现有的重复性标准只考虑个体滤波器的特性。随着剔除率增加,这些重复性标准不再有效率。因为滤波器之间的交互也对深度学习网络的预测精度产生影响,我们将滤波器之间的交互纳入重复性标准中。在我们的标准中,我们引入了滤波器重要性和滤波器利用强度,以反映个体和多个滤波器的决策能力。通过这种新的重复性标准,我们提出了结构化网络剔除方法(SNPFI)。在剔除过程中,SNPFI可以自动根据滤波器利用强度分配适当的稀缺,并将无用的滤波器 eliminate 。剔除后,SNPFI可以有效地恢复剔除后的模型性能,无需迭代训练,只需要对交互差异进行最小化。我们通过多种常用的深度学习模型,包括AlexNet、MobileNetv1和ResNet-50,在多个图像分类任务上进行了实验,包括MNIST、CIFAR-10和ImageNet。对所有的实验深度学习模型来说,SNPFI可以在网络压缩中降低大约60%的计算成本,而且分类精度保持不变。

Graph-level Anomaly Detection via Hierarchical Memory Networks

  • paper_url: http://arxiv.org/abs/2307.00755
  • repo_url: https://github.com/niuchx/himnet
  • paper_authors: Chaoxi Niu, Guansong Pang, Ling Chen
  • for: 本研究旨在开发一种基于嵌入式嵌入式自适应神经网络的图级异常检测方法,以检测图中的异常结构和节点特征。
  • methods: 本方法使用图自编码器网络来学习层次结构嵌入,并分别学习节点级和图级常规模式。
  • results: 对16个真实的图据集进行了广泛的实验,并证明了本方法在检测本地异常图和全球异常图方面具有显著优势,同时也具有抗异常杂质特性。
    Abstract Graph-level anomaly detection aims to identify abnormal graphs that exhibit deviant structures and node attributes compared to the majority in a graph set. One primary challenge is to learn normal patterns manifested in both fine-grained and holistic views of graphs for identifying graphs that are abnormal in part or in whole. To tackle this challenge, we propose a novel approach called Hierarchical Memory Networks (HimNet), which learns hierarchical memory modules -- node and graph memory modules -- via a graph autoencoder network architecture. The node-level memory module is trained to model fine-grained, internal graph interactions among nodes for detecting locally abnormal graphs, while the graph-level memory module is dedicated to the learning of holistic normal patterns for detecting globally abnormal graphs. The two modules are jointly optimized to detect both locally- and globally-anomalous graphs. Extensive empirical results on 16 real-world graph datasets from various domains show that i) HimNet significantly outperforms the state-of-art methods and ii) it is robust to anomaly contamination. Codes are available at: https://github.com/Niuchx/HimNet.
    摘要 GRAPH-LEVEL ANOMALY DETECTION 目标是识别异常图,其表现出异常结构和节点属性比多数图在图集中异常。一个主要挑战是学习图集中的正常模式,包括细化和总体两种视图。为解决这个挑战,我们提出了一种新的方法called Hierarchical Memory Networks(HimNet)。这种方法通过图自动编码网络架构学习层次记忆模块——节点记忆模块和图记忆模块。节点级别记忆模块用于模型图内节点之间的细化交互,检测本地异常图,而图级别记忆模块专门学习总体正常模式,检测全局异常图。两个模块共同优化,可以检测本地和全局异常图。我们对16个真实世界图据集进行了广泛的实验,结果显示:i) HimNet在比较方法中显著超越,ii) 它对异常污染具有较好的抗性。codes可以在https://github.com/Niuchx/HimNet中找到。

LXL: LiDAR Excluded Lean 3D Object Detection with 4D Imaging Radar and Camera Fusion

  • paper_url: http://arxiv.org/abs/2307.00724
  • repo_url: None
  • paper_authors: Weiyi Xiong, Jianan Liu, Tao Huang, Qing-Long Han, Yuxuan Xia, Bing Zhu
  • for: 本研究旨在提高自动驾驶摄像头和4D成像雷达融合的3D对象检测性能。
  • methods: 本研究使用了”深度基于抽象”和”雷达占用支持”等技术,并对比了不同的提升策略。
  • results: 实验结果表明,提出的方法在VoD和TJ4DRadSet数据集上显著超越了现有的3D对象检测方法,而且在不同的提升设定下进行比较性能表现最佳。
    Abstract As an emerging technology and a relatively affordable device, the 4D imaging radar has already been confirmed effective in performing 3D object detection in autonomous driving. Nevertheless, the sparsity and noisiness of 4D radar point clouds hinder further performance improvement, and in-depth studies about its fusion with other modalities are lacking. On the other hand, most of the camera-based perception methods transform the extracted image perspective view features into the bird's-eye view geometrically via "depth-based splatting" proposed in Lift-Splat-Shoot (LSS), and some researchers exploit other modals such as LiDARs or ordinary automotive radars for enhancement. Recently, a few works have applied the "sampling" strategy for image view transformation, showing that it outperforms "splatting" even without image depth prediction. However, the potential of "sampling" is not fully unleashed. In this paper, we investigate the "sampling" view transformation strategy on the camera and 4D imaging radar fusion-based 3D object detection. In the proposed model, LXL, predicted image depth distribution maps and radar 3D occupancy grids are utilized to aid image view transformation, called "radar occupancy-assisted depth-based sampling". Experiments on VoD and TJ4DRadSet datasets show that the proposed method outperforms existing 3D object detection methods by a significant margin without bells and whistles. Ablation studies demonstrate that our method performs the best among different enhancement settings.
    摘要 “4D射频激光作为新兴技术和相对较便宜的设备,已经确认了其在自动驾驶中的3D物体探测效果。然而,4D射频激光点云的罕见和噪音妨碍了更进一步的性能改善,而且与其他感知modalities的混合研究相对较少。相反,大多数基于摄像头的视觉感知方法将提取到的图像视角特征转换为鸟瞰视 geometrically via "depth-based splatting"提出的Lift-Splat-Shoot (LSS),一些研究人员将其他modalities,如LiDAR或普通的汽车激光,用于增强。最近,一些工作已经应用了"sampling"策略来对图像视角进行变换,显示它比"splatting"更好,即使不需要图像深度预测。然而,"sampling"的潜力仍未被完全发掘。在本文中,我们 investigate "sampling" 视角变换策略在摄像头和4D射频激光融合基础的3D物体探测中。在我们的提案中,预测的图像深度分布图和4D射频激光3Doccupancy grid被用来帮助图像视角变换,称为"radar occupancy-assisted depth-based sampling"。实验结果显示,我们的方法在VoD和TJ4DRadSet datasets上具有明显的优势,而且ablation study显示我们的方法在不同的增强设定中表现最好。”

SSC3OD: Sparsely Supervised Collaborative 3D Object Detection from LiDAR Point Clouds

  • paper_url: http://arxiv.org/abs/2307.00717
  • repo_url: None
  • paper_authors: Yushan Han, Hui Zhang, Honglei Zhang, Yidong Li
  • for: 提高自动驾驶中多智能体间的协同检测效果,适用于具有限制的标注数据情况。
  • methods: 提出一种稀疑指标学习方法(SSC3OD),包括两个新组件:柱子基本的掩码自适应器(Pillar-MAE)和实例挖掘模块。
  • results: 对三个大规模数据集进行了广泛的实验,结果表明,提出的SSC3OD方法可以有效地提高稀疑指标学习的协同3D物体检测效果。
    Abstract Collaborative 3D object detection, with its improved interaction advantage among multiple agents, has been widely explored in autonomous driving. However, existing collaborative 3D object detectors in a fully supervised paradigm heavily rely on large-scale annotated 3D bounding boxes, which is labor-intensive and time-consuming. To tackle this issue, we propose a sparsely supervised collaborative 3D object detection framework SSC3OD, which only requires each agent to randomly label one object in the scene. Specifically, this model consists of two novel components, i.e., the pillar-based masked autoencoder (Pillar-MAE) and the instance mining module. The Pillar-MAE module aims to reason over high-level semantics in a self-supervised manner, and the instance mining module generates high-quality pseudo labels for collaborative detectors online. By introducing these simple yet effective mechanisms, the proposed SSC3OD can alleviate the adverse impacts of incomplete annotations. We generate sparse labels based on collaborative perception datasets to evaluate our method. Extensive experiments on three large-scale datasets reveal that our proposed SSC3OD can effectively improve the performance of sparsely supervised collaborative 3D object detectors.
    摘要 共享3D对象检测已经广泛研究在自动驾驶领域,但现有的共享3D对象检测器在完全监督 paradigm中仅靠据大规模标注的3D包bounding box,这是劳动密集和时间消耗的。为解决这个问题,我们提出了一个稀疏监督的共享3D对象检测框架SSC3OD,它只需每个代理 randomly标注Scene中的一个对象。具体来说,该模型包括两个新的组件:即柱体基于的masked autoencoder(Pillar-MAE)和实例挖掘模块。Pillar-MAE模块 aspires to reason over high-level semantics in a self-supervised manner,而实例挖掘模块在线生成高质量的pseudo标签 для共享检测器。通过引入这些简单 yet effective mechanism,我们的提出的SSC3OD可以减轻不完整的标注的负面影响。我们基于共享感知 dataset generate sparse labels来评估我们的方法。我们在三个大规模dataset上进行了广泛的实验,结果显示,我们的提出的SSC3OD可以有效地提高稀疏监督的共享3D对象检测器的性能。

JourneyDB: A Benchmark for Generative Image Understanding

  • paper_url: http://arxiv.org/abs/2307.00716
  • repo_url: None
  • paper_authors: Junting Pan, Keqiang Sun, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, Hongsheng Li
  • for: 本研究旨在探讨视语模型是否能够理解生成的图像。
  • methods: 我们提供了一个大规模的 dataset,名为 JourneyDB,用于多模态视觉理解生成图像。我们还设计了4个比较标准的 benchmark,用于衡量生成图像理解的内容和风格理解能力。
  • results: 我们通过对当前领域的状态级模型进行应用测试,发现它们在生成图像理解方面存在一定的局限性和缺陷。我们还进行了一个深入的分析,以帮助改进现有的多模态模型。
    Abstract While recent advancements in vision-language models have revolutionized multi-modal understanding, it remains unclear whether they possess the capabilities of comprehending the generated images. Compared to real data, synthetic images exhibit a higher degree of diversity in both content and style, for which there are significant difficulties for the models to fully apprehend. To this end, we present a large-scale dataset, JourneyDB, for multi-modal visual understanding in generative images. Our curated dataset covers 4 million diverse and high-quality generated images paired with the text prompts used to produce them. We further design 4 benchmarks to quantify the performance of generated image understanding in terms of both content and style interpretation. These benchmarks include prompt inversion, style retrieval, image captioning and visual question answering. Lastly, we assess the performance of current state-of-the-art multi-modal models when applied to JourneyDB, and provide an in-depth analysis of their strengths and limitations in generated content understanding. We hope the proposed dataset and benchmarks will facilitate the research in the field of generative content understanding. The dataset will be available on https://journeydb.github.io.
    摘要 “Recent advancements in vision-language models have significantly improved multi-modal understanding, but it remains unclear whether these models can fully comprehend generated images. Synthetic images exhibit a higher degree of diversity in both content and style, which poses challenges for the models to fully apprehend. To address this, we present JourneyDB, a large-scale dataset for multi-modal visual understanding in generative images. Our dataset includes 4 million high-quality and diverse generated images paired with text prompts used to produce them. We also design four benchmarks to evaluate the performance of generated image understanding in terms of content and style interpretation. These benchmarks include prompt inversion, style retrieval, image captioning, and visual question answering. Finally, we assess the performance of current state-of-the-art multi-modal models on JourneyDB and provide an in-depth analysis of their strengths and limitations in generated content understanding. We hope that the proposed dataset and benchmarks will facilitate research in the field of generative content understanding. The dataset will be available on .”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Guided Patch-Grouping Wavelet Transformer with Spatial Congruence for Ultra-High Resolution Segmentation

  • paper_url: http://arxiv.org/abs/2307.00711
  • repo_url: None
  • paper_authors: Deyi Ji, Feng Zhao, Hongtao Lu
  • for: 本文提出了一种新的ultra-high resolution(UHR)图像分割方法,以解决现有方法在内存成本和本地特征精度之间的矛盾。
  • methods: 该方法基于一种新的Transformer($\mathcal{T}$)-Convolutional Neural Network(CNN,$\mathcal{C}$)共享学习框架,其中$\mathcal{T}$使用整个UHR图像作为输入,并提取了本地细节和长距离Contextual Dependencies。而$\mathcal{C}$使用下采样后的图像作为输入,以学习类别级别的深度Context。为了提高推理速度和计算复杂度,$\mathcal{T}$将原始UHR图像切分成patches,并将其分组动态,然后使用轻量级多头波峰 transformer(WFormer)网络来学习低级的本地细节。同时,这个过程中也捕捉了远距离的Contextual Dependencies,因为patches可以在空间领域中被分配到同一组。此外,$\mathcal{C}$生成的mask也被使用来导引patch grouping过程,提供了一种决策指南。此外,两个分支之间的协同约束也被利用,以保持patches之间的空间一致。总的来说,我们堆叠了多个阶段进程,形成一个pyramid结构。实验结果显示,GPWFormer在五个benchmark数据集上实现了显著的提升。
    Abstract Most existing ultra-high resolution (UHR) segmentation methods always struggle in the dilemma of balancing memory cost and local characterization accuracy, which are both taken into account in our proposed Guided Patch-Grouping Wavelet Transformer (GPWFormer) that achieves impressive performances. In this work, GPWFormer is a Transformer ($\mathcal{T}$)-CNN ($\mathcal{C}$) mutual leaning framework, where $\mathcal{T}$ takes the whole UHR image as input and harvests both local details and fine-grained long-range contextual dependencies, while $\mathcal{C}$ takes downsampled image as input for learning the category-wise deep context. For the sake of high inference speed and low computation complexity, $\mathcal{T}$ partitions the original UHR image into patches and groups them dynamically, then learns the low-level local details with the lightweight multi-head Wavelet Transformer (WFormer) network. Meanwhile, the fine-grained long-range contextual dependencies are also captured during this process, since patches that are far away in the spatial domain can also be assigned to the same group. In addition, masks produced by $\mathcal{C}$ are utilized to guide the patch grouping process, providing a heuristics decision. Moreover, the congruence constraints between the two branches are also exploited to maintain the spatial consistency among the patches. Overall, we stack the multi-stage process in a pyramid way. Experiments show that GPWFormer outperforms the existing methods with significant improvements on five benchmark datasets.
    摘要 现有的超高分辨率(UHR)分割方法都受到内存成本和地方特征精度之间的折衔,我们提出的指导 patch-组合波峰变换器(GPWFormer)可以达到出色的表现。在这项工作中,GPWFormer是一种基于Transformer($\mathcal{T}$)和Convolutional Neural Network($\mathcal{C}$)的互相学习框架,其中$\mathcal{T}$ 将整个UHR图像作为输入,并提取了地方细节和细腻的长距离Contextual Dependencies,而$\mathcal{C}$ 则将降采样后的图像作为输入,以学习类别深度的Context。为了提高推理速度和计算复杂度,$\mathcal{T}$ 将原始UHR图像 partitioned 成patches,并将其们组合成动态组。在这个过程中,GPWFormer网络学习了低级的地方细节,同时也捕捉了远距离的Contextual Dependencies,因为patches可以在空间领域中被分配到同一个组。此外,$\mathcal{C}$ 生成的mask也被利用来引导patch grouping процесс,提供了一个决策指标。此外,我们还利用了两支分支之间的协调约束,以保持patches中的空间一致。总的来说,我们将多个阶段过程堆叠在峰形式上。实验表明,GPWFormer在五个基准数据集上表现出了显著的改善。

Efficient Visual Fault Detection for Freight Train Braking System via Heterogeneous Self Distillation in the Wild

  • paper_url: http://arxiv.org/abs/2307.00701
  • repo_url: https://github.com/MVME-HBUT/HSD-FTI-FDet
  • paper_authors: Yang Zhang, Huilin Pan, Yang Zhou, Mingying Li, Guodong Sun
  • for: 这篇论文的目的是提高货物列车的快速瑕疵探测,以满足现实工程环境中的限制硬件需求。
  • methods: 这篇论文提出了一个异化自适应框架,以确保检测精度和速度,并满足低资源需求。这个框架使用了轻量级的背景来提取特征,并创建了一个新的异化知识颈。这个颈模型通过并行编码来优化特征提取能力,并使用通用分布来获得更可靠和准确的 bounding box 估计。
  • results: 实验结果显示,我们的框架可以在四个瑕疵Dataset上达到37帧/秒的速度,并且与传统混合条件下的方法相比,具有更高的精度和更低的内存使用量。
    Abstract Efficient visual fault detection of freight trains is a critical part of ensuring the safe operation of railways under the restricted hardware environment. Although deep learning-based approaches have excelled in object detection, the efficiency of freight train fault detection is still insufficient to apply in real-world engineering. This paper proposes a heterogeneous self-distillation framework to ensure detection accuracy and speed while satisfying low resource requirements. The privileged information in the output feature knowledge can be transferred from the teacher to the student model through distillation to boost performance. We first adopt a lightweight backbone to extract features and generate a new heterogeneous knowledge neck. Such neck models positional information and long-range dependencies among channels through parallel encoding to optimize feature extraction capabilities. Then, we utilize the general distribution to obtain more credible and accurate bounding box estimates. Finally, we employ a novel loss function that makes the network easily concentrate on values near the label to improve learning efficiency. Experiments on four fault datasets reveal that our framework can achieve over 37 frames per second and maintain the highest accuracy in comparison with traditional distillation approaches. Moreover, compared to state-of-the-art methods, our framework demonstrates more competitive performance with lower memory usage and the smallest model size.
    摘要 高效的货运列车缺陷检测是铁路安全运行的关键部分,但现有的深度学习方法在实际工程环境中并不够高效。这篇论文提出了一种多元自适应框架,以确保检测精度和速度,同时满足资源限制。我们首先采用轻量级核心EXTRACT Features和生成新的多元知识脖子。这种脖子模型通过并行编码来提高特征提取能力,并使用通用分布来获取更可靠和准确的 bounding box 估计。最后,我们采用一种新的损失函数,使网络更容易集中于标签附近的值来提高学习效率。实验结果表明,我们的框架可以达到37帧/秒的速度,并与传统束Distillation方法相比,保持最高的准确率。此外,与当前状态艺术方法相比,我们的框架具有较低的内存使用量和最小的模型大小。

Camera Calibration from a Single Imaged Ellipsoid: A Moon Calibration Algorithm

  • paper_url: http://arxiv.org/abs/2307.00689
  • repo_url: None
  • paper_authors: Kalani R. Danas Rivera, Mason A. Peck
  • for: 这项研究旨在用星系中的扩展体 apply 到航天器摄像头准确设定。
  • methods: 该方法利用扩展体的图像,将其投射到圆柱体上,然后与观察者的target-relative状态进行结合,以实现基于单个非球体图像的摄像头准确设定。
  • results: 该算法可以在单个图像基础上计算摄像头的 focal length 和主点,并且可以在多张图像基础上减小一标准差的不确定性。
    Abstract This work introduces a method that applies images of the extended bodies in the solar system to spacecraft camera calibration. The extended bodies consist of planets and moons that are well-modeled by triaxial ellipsoids. When imaged, the triaxial ellipsoid projects to a conic section which is generally an ellipse. This work combines the imaged ellipse with information on the observer's target-relative state to achieve camera calibration from a single imaged ellipsoid. As such, this work is the first to accomplish camera calibration from a single, non-spherical imaged ellipsoid. The camera calibration algorithm is applied to synthetic images of ellipsoids as well as planetary images of Saturn's moons as captured by the Cassini spacecraft. From a single image, the algorithm estimates the focal length and principal point of Cassini's Narrow Angle Camera within 1.0 mm and 10 pixels, respectively. With multiple images, the one standard deviation uncertainty in focal length and principal point estimates reduce to 0.5 mm and 3.1 pixels, respectively. Though created for spacecraft camera calibration in mind, this work also generalizes to terrestrial camera calibration using any number of imaged ellipsoids.
    摘要

A Proximal Algorithm for Network Slimming

  • paper_url: http://arxiv.org/abs/2307.00684
  • repo_url: None
  • paper_authors: Kevin Bui, Fanghui Xue, Fredrick Park, Yingyong Qi, Jack Xin
  • for: 这 paper 是关于 channel pruning 方法的研究,用于减少 convolutional neural networks (CNNs) 的 Parameters 数量,以提高模型的计算效率和存储空间利用率。
  • methods: 这 paper 使用了 proximal network slimming (NS) 算法,该算法通过使用 Kurdyka-{\L}ojasiewicz 假设来确保 global convergence。 proximal NS 不需要选择 scaling factor threshold,也不需要 fine-tuning 束缚 CNNs。
  • results: 实验表明,使用 proximal NS 可以在一次训练中获得比较精炼的 CNN 模型,并且其减少后的精度与原始模型相似。 这 paper 在 VGGNet、DenseNet 和 ResNet 上进行了 CIFAR 10/100 的实验 validate 了 proximal NS 的效果。
    Abstract As a popular channel pruning method for convolutional neural networks (CNNs), network slimming (NS) has a three-stage process: (1) it trains a CNN with $\ell_1$ regularization applied to the scaling factors of the batch normalization layers; (2) it removes channels whose scaling factors are below a chosen threshold; and (3) it retrains the pruned model to recover the original accuracy. This time-consuming, three-step process is a result of using subgradient descent to train CNNs. Because subgradient descent does not exactly train CNNs towards sparse, accurate structures, the latter two steps are necessary. Moreover, subgradient descent does not have any convergence guarantee. Therefore, we develop an alternative algorithm called proximal NS. Our proposed algorithm trains CNNs towards sparse, accurate structures, so identifying a scaling factor threshold is unnecessary and fine tuning the pruned CNNs is optional. Using Kurdyka-{\L}ojasiewicz assumptions, we establish global convergence of proximal NS. Lastly, we validate the efficacy of the proposed algorithm on VGGNet, DenseNet and ResNet on CIFAR 10/100. Our experiments demonstrate that after one round of training, proximal NS yields a CNN with competitive accuracy and compression.
    摘要 为了减少 convolutional neural networks (CNNs) 的复杂度,网络精简 (NS) 是一种广泛使用的频道剪除方法,它包括以下三个阶段:1. 使用 $\ell_1$ regularization 对 batch normalization layers 的扩大因子进行训练;2. removing channels whose scaling factors 是下于选择的阈值;3. retrains 剪除后的模型,以回升原始精度。这个时间消耗大,三步骤的过程是因为使用 subgradient descent 训练 CNNs。因为 subgradient descent 不是训练 CNNs 向稀疏、准确结构的算法,因此需要后两个步骤。而且,subgradient descent 没有任何收敛保证。因此,我们开发了一种替代算法 called proximal NS。我们的提议的算法 trains CNNs 向稀疏、准确结构,因此无需确定 scaling factor 阈值,并且 fine-tuning 剪除后的 CNNs 是可选的。使用 Kurdyka-{\L}ojasiewicz 假设,我们确立 proximal NS 的全球收敛性。最后,我们验证了我们的算法在 VGGNet、DenseNet 和 ResNet 上的 CIFAR 10/100 表现,实验表明,在一次训练后,proximal NS 可以生成一个与其他方法相当的精度和压缩率的 CNN。

Pay Attention to the Atlas: Atlas-Guided Test-Time Adaptation Method for Robust 3D Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2307.00676
  • repo_url: None
  • paper_authors: Jingjie Guo, Weitong Zhang, Matthew Sinclair, Daniel Rueckert, Chen Chen
  • for: 提高3D医学成像 segmentation模型的Robustness,解决训练和测试数据分布不同导致的性能下降问题。
  • methods: 使用atlas-guided test-time adaptation(TTA)方法,只需要一个单例无标签测试样本,通过对网络进行注册并最小化atlas-based loss来适应测试数据。此外,还使用了通道和空间注意力块来提高适应性。
  • results: 在多个不同来源的数据集上进行了广泛的实验,结果表明AdaAtlas-Attention方法在提高3D医学成像 segmentation模型的Robustness方面表现出色,至少比其他竞争方法更好。
    Abstract Convolutional neural networks (CNNs) often suffer from poor performance when tested on target data that differs from the training (source) data distribution, particularly in medical imaging applications where variations in imaging protocols across different clinical sites and scanners lead to different imaging appearances. However, re-accessing source training data for unsupervised domain adaptation or labeling additional test data for model fine-tuning can be difficult due to privacy issues and high labeling costs, respectively. To solve this problem, we propose a novel atlas-guided test-time adaptation (TTA) method for robust 3D medical image segmentation, called AdaAtlas. AdaAtlas only takes one single unlabeled test sample as input and adapts the segmentation network by minimizing an atlas-based loss. Specifically, the network is adapted so that its prediction after registration is aligned with the learned atlas in the atlas space, which helps to reduce anatomical segmentation errors at test time. In addition, different from most existing TTA methods which restrict the adaptation to batch normalization blocks in the segmentation network only, we further exploit the use of channel and spatial attention blocks for improved adaptability at test time. Extensive experiments on multiple datasets from different sites show that AdaAtlas with attention blocks adapted (AdaAtlas-Attention) achieves superior performance improvements, greatly outperforming other competitive TTA methods.
    摘要 循环神经网络(CNN)在测试数据不同于训练数据分布时表现不佳,特别在医疗影像应用中,因为不同的临床站点和扫描仪器使用不同的扫描技术,导致影像的显示不同。然而,为了解决这个问题,我们提出了一种名为 AdaAtlas 的新的测试时适应(TTA)方法,它只需要一个无标签的测试样本作为输入,然后使用了一个名为 AdaAtlas 的扩展。在这个扩展中,我们将网络的预测结果与学习的床表空间中的 Atlas 进行了对齐,以降低测试时的解剖学 segmentation 错误。此外,与大多数现有的 TTA 方法不同,我们还在测试时使用了通道和空间注意力块,以提高适应性。经验表明,AdaAtlas-Attention 在多个数据集上表现出色,与其他竞争性 TTA 方法相比,具有显著的性能提升。

Real-time Vision-based Navigation for a Robot in an Indoor Environment

  • paper_url: http://arxiv.org/abs/2307.00666
  • repo_url: https://github.com/manglanisagar/vision-search-navigation
  • paper_authors: Sagar Manglani
  • for: 本研究目的是开发一个自主navigation系统,用于家庭环境中的自主 Navigation。
  • methods: 这个系统使用了视觉技术和进步的路径规划算法,帮助机器人通过避免障碍物来 navigate到目的地。
  • results: 研究结果显示了系统的表现,包括质量和量化的指标,展示了这个系统在实时自主 Navigation 中的潜力和局限性。
    Abstract This paper presents a study on the development of an obstacle-avoidance navigation system for autonomous navigation in home environments. The system utilizes vision-based techniques and advanced path-planning algorithms to enable the robot to navigate toward the destination while avoiding obstacles. The performance of the system is evaluated through qualitative and quantitative metrics, highlighting its strengths and limitations. The findings contribute to the advancement of indoor robot navigation, showcasing the potential of vision-based techniques for real-time, autonomous navigation.
    摘要 这篇论文介绍了一项关于自主navigation在家庭环境中的障碍避免导航系统的研究。该系统利用视觉技术和高级路径规划算法,使机器人能够实时避免障碍物而前往目的地点。研究结果通过质量和量度指标进行评估,把系统的优点和局限性得到展示。发现有助于家庭环境内自主机器人导航的发展,同时也展示了视觉技术在实时自主导航中的潜在优势。

CNN-BiLSTM model for English Handwriting Recognition: Comprehensive Evaluation on the IAM Dataset

  • paper_url: http://arxiv.org/abs/2307.00664
  • repo_url: None
  • paper_authors: Firat Kizilirmak, Berrin Yanikoglu
  • for: 英文手写识别问题的 CNN-BiLSTM 系统,进行了大量的 IAM 数据集上的评估,包括模型大小、数据增强和词典的影响。
  • methods: 使用 CNN-BiLSTM 网络和 CTC 层进行模型建构,并应用测试时数据增强以提高Recognition of difficult cases。
  • results: 最佳模型可以达到 3.59% CER 和 9.44% WER,并通过对 IAM 数据集进行错误分析,显示了一些困难的手写图像和错误标注的样本。
    Abstract We present a CNN-BiLSTM system for the problem of offline English handwriting recognition, with extensive evaluations on the public IAM dataset, including the effects of model size, data augmentation and the lexicon. Our best model achieves 3.59\% CER and 9.44\% WER using CNN-BiLSTM network with CTC layer. Test time augmentation with rotation and shear transformations applied to the input image, is proposed to increase recognition of difficult cases and found to reduce the word error rate by 2.5\% points. We also conduct an error analysis of our proposed method on IAM dataset, show hard cases of handwriting images and explore samples with erroneous labels. We provide our source code as public-domain, to foster further research to encourage scientific reproducibility.
    摘要 我们提出了一个基于CNN-BiLSTM的英文手写识别系统,对公共IAM数据集进行了广泛的评估,包括模型大小、数据增强和词典的影响。我们的最佳模型在CNN-BiLSTM网络中使用CTC层时达到了3.59%的字幕误差率和9.44%的单词错误率。我们提议在输入图像上应用旋转和倾斜变换来增强难以识别的情况的识别,并发现这可以降低单词错误率2.5个百分点。我们还进行了我们的提议方法的错误分析在IAM数据集上,显示了困难的手写图像示例和错误标签的样本。我们将我们的源代码作为公共领域发布,以便进一步的研究,促进科学复现性。

More Synergy, Less Redundancy: Exploiting Joint Mutual Information for Self-Supervised Learning

  • paper_url: http://arxiv.org/abs/2307.00651
  • repo_url: None
  • paper_authors: Salman Mohamadi, Gianfranco Doretto, Donald A. Adjeroh
  • for: 这个论文的目的是调查对自然语言处理(NLP)任务进行自助学习(SSL)的影响,并研究如何使SSL模型更加可靠地利用数据分布信息。
  • methods: 这篇论文使用了一种新的多变量信息测量perspective,即partial information decomposition(PID),来检测SSL模型中的相互信息。PID可以将共同信息分解成三个重要组成部分:唯一信息、重复信息和合作信息。
  • results: 该研究的实验结果表明,通过最小化视图之间的重复信息并最大化目标表示和视图之间的合作信息,可以提高SSL模型的性能。此外,该研究还提出了一种新的SSL训练协议。多个数据集和两个下游任务的广泛实验结果证明了该框架的效果。
    Abstract Self-supervised learning (SSL) is now a serious competitor for supervised learning, even though it does not require data annotation. Several baselines have attempted to make SSL models exploit information about data distribution, and less dependent on the augmentation effect. However, there is no clear consensus on whether maximizing or minimizing the mutual information between representations of augmentation views practically contribute to improvement or degradation in performance of SSL models. This paper is a fundamental work where, we investigate role of mutual information in SSL, and reformulate the problem of SSL in the context of a new perspective on mutual information. To this end, we consider joint mutual information from the perspective of partial information decomposition (PID) as a key step in \textbf{reliable multivariate information measurement}. PID enables us to decompose joint mutual information into three important components, namely, unique information, redundant information and synergistic information. Our framework aims for minimizing the redundant information between views and the desired target representation while maximizing the synergistic information at the same time. Our experiments lead to a re-calibration of two redundancy reduction baselines, and a proposal for a new SSL training protocol. Extensive experimental results on multiple datasets and two downstream tasks show the effectiveness of this framework.
    摘要 自我监督学习(SSL)现在是指导学习的严重竞争对手,即使它不需要数据标注。许多基线已经尝试使 SSL 模型利用数据分布信息,并且减少扩展效应的依赖。然而,没有一致的共识,是否在 SSL 模型性能上实际上提高或下降的情况。这篇论文是一项基础性工作,我们调查了 SSL 中积分信息的角色,并将问题重新定义为基于新的积分信息角度。为此,我们使用部分信息分解(PID)来分解共同积分信息为三个重要组成部分:唯一信息、重复信息和协同信息。我们的框架的目标是减少视图之间的重复信息和目标表示之间的重复信息,同时增加协同信息。我们的实验导致了两种减少重复基elines的重新调整,以及一个新的 SSL 训练协议的建议。我们的实验结果表明,这个框架在多个数据集和两个下游任务上具有高效性。