cs.CV - 2023-10-13

Pairwise Similarity Learning is SimPLE

  • paper_url: http://arxiv.org/abs/2310.09449
  • repo_url: None
  • paper_authors: Yandong Wen, Weiyang Liu, Yao Feng, Bhiksha Raj, Rita Singh, Adrian Weller, Michael J. Black, Bernhard Schölkopf
    for: 本研究强调一个通用 yet 重要的学习问题:对照相似学习(PSL)。PSL 涵盖了许多重要应用,如开放集面Recognition、Speaker verification、图像检索和人重识别。学习的目标是学习一个对照相似函数,将相同标签的样本对 assign 更高的相似性分数,而不同标签的样本对 assign 更低的相似性分数。methods: 我们首先认为PSL 的一个关键需求是什么,然后讨论如何使用现有方法实现这一需求。然后,我们提出了一种奇异简单的代理自由方法,叫做SimPLE,不需要特征/代理normalization 也不需要angular margin,却能够在开放集面认知中广泛应用。results: 我们在三个PSL任务上应用了提议的方法:开放集面Recognition、图像检索和Speaker verification。经过大规模的实验结果表明,我们的方法在当前状态艺的方法中表现出色,表现出了显著的优势。
    Abstract In this paper, we focus on a general yet important learning problem, pairwise similarity learning (PSL). PSL subsumes a wide range of important applications, such as open-set face recognition, speaker verification, image retrieval and person re-identification. The goal of PSL is to learn a pairwise similarity function assigning a higher similarity score to positive pairs (i.e., a pair of samples with the same label) than to negative pairs (i.e., a pair of samples with different label). We start by identifying a key desideratum for PSL, and then discuss how existing methods can achieve this desideratum. We then propose a surprisingly simple proxy-free method, called SimPLE, which requires neither feature/proxy normalization nor angular margin and yet is able to generalize well in open-set recognition. We apply the proposed method to three challenging PSL tasks: open-set face recognition, image retrieval and speaker verification. Comprehensive experimental results on large-scale benchmarks show that our method performs significantly better than current state-of-the-art methods.
    摘要 在这篇论文中,我们关注一个通用 yet 重要的学习问题:对照性学习(PSL)。 PSL 涵盖了许多重要应用,如开放集面Recognition、Speaker Verification、图像检索和人Re-Identification。 PSL 的目标是学习一个对照性函数,对同样标签的样本对(Positive Pair)分配更高的相似性分数,而对不同标签的样本对(Negative Pair)分配更低的相似性分数。我们开始 by 识别 PSL 的关键需求,然后讨论现有方法如何实现这个需求。然后,我们提出了一种奇异简单的代理自由方法,叫做 SimPLE,不需要特征/代理 нормализация也不需要angular margin,却能够在开放集面认知中广泛应用。我们在三个复杂 PSL 任务上应用了提议的方法:开放集面认知、图像检索和Speaker Verification。我们在大规模 benchmark 上进行了广泛的实验,结果表明,我们的方法在当前状态的方法之上表现出色。

Automatic segmentation of lung findings in CT and application to Long COVID

  • paper_url: http://arxiv.org/abs/2310.09446
  • repo_url: https://github.com/miclab-unicamp/medseg
  • paper_authors: Diedre S. Carmo, Rosarie A. Tudas, Alejandro P. Comellas, Leticia Rittner, Roberto A. Lotufo, Joseph M. Reinhardt, Sarah E. Gerard
  • for: 这个研究旨在提高计算机断层成像中肺脏病变的自动分割精度,以便诊断和特征化肺疾病。
  • methods: 该研究提出了一种基于深度学习的S-MEDSeg方法,结合预训练的EfficientNet底层、双向特征层积网络和现代网络技术,以提高肺病变分割性能。
  • results: 对于基eline方法的比较,S-MEDSeg方法在分割性能上有显著提高,并进行了全面的ablation研究来评估提案的网络修改对性能的贡献。该方法还应用于一个独立的长COVID患者数据集,以研究抗生素治疗后肺发现的扩展。
    Abstract Automated segmentation of lung abnormalities in computed tomography is an important step for diagnosing and characterizing lung disease. In this work, we improve upon a previous method and propose S-MEDSeg, a deep learning based approach for accurate segmentation of lung lesions in chest CT images. S-MEDSeg combines a pre-trained EfficientNet backbone, bidirectional feature pyramid network, and modern network advancements to achieve improved segmentation performance. A comprehensive ablation study was performed to evaluate the contribution of the proposed network modifications. The results demonstrate modifications introduced in S-MEDSeg significantly improves segmentation performance compared to the baseline approach. The proposed method is applied to an independent dataset of long COVID inpatients to study the effect of post-acute infection vaccination on extent of lung findings. Open-source code, graphical user interface and pip package are available at https://github.com/MICLab-Unicamp/medseg.
    摘要 自动分割肺部异常的计算机断层成像是诊断和特征化肺病的重要步骤。在这项工作中,我们改进了之前的方法,并提出了S-MEDSeg,一种基于深度学习的方法用于精准地分割肺部扩散图像中的肺脏病变。S-MEDSeg结合了预训练的EfficientNet背bone、双向特征层网络和现代网络技术,以实现改进的分割性能。我们进行了完整的减少研究,以评估提案的网络修改对分割性能的贡献。结果显示,S-MEDSeg中的修改对分割性能具有显著改进效果,相比基线方法。我们应用了这种方法于一个独立的长COVID患者数据集,以研究后期感染疫苗对肺发现的影响。可以在https://github.com/MICLab-Unicamp/medseg上下载开源代码、图形用户界面和pip包。

Tackling Heterogeneity in Medical Federated learning via Vision Transformers

  • paper_url: http://arxiv.org/abs/2310.09444
  • repo_url: None
  • paper_authors: Erfan Darzi, Yiqing Shen, Nanna M. Sijtsema, P. M. A van Ooijen
  • for: 提高医疗联合学习中数据不均衡的问题,特别是提高弱代客户端的性能。
  • methods: 使用Optimization-based regularization方法。
  • results: 使用视图转换器可以大幅提高弱代客户端的性能,而无需付出重大的全局准确率代价。
    Abstract Optimization-based regularization methods have been effective in addressing the challenges posed by data heterogeneity in medical federated learning, particularly in improving the performance of underrepresented clients. However, these methods often lead to lower overall model accuracy and slower convergence rates. In this paper, we demonstrate that using Vision Transformers can substantially improve the performance of underrepresented clients without a significant trade-off in overall accuracy. This improvement is attributed to the Vision transformer's ability to capture long-range dependencies within the input data.
    摘要 以优化为基础的规范化方法在医疗联合学习中处理数据不均衡问题,特别是改善受抑客户端的性能。然而,这些方法通常会导致全局模型精度下降和更慢的收敛速率。在这篇论文中,我们展示了使用视觉转换器可以大幅提高受抑客户端的性能,无需折损全局精度。这种改善归因于视觉转换器能够捕捉输入数据中的长距离关系。

MEMTRACK: A Deep Learning-Based Approach to Microrobot Tracking in Dense and Low-Contrast Environments

  • paper_url: http://arxiv.org/abs/2310.09441
  • repo_url: https://github.com/sawhney-medha/memtrack
  • paper_authors: Medha Sawhney, Bhas Karmarkar, Eric J. Leaman, Arka Daw, Anuj Karpatne, Bahareh Behkam
  • for: 本研究目的是为了解决追踪微型机器人(microrobot)的挑战,即其 minute size 和高速运动导致的精度追踪问题。
  • methods: 本研究使用了人工智能技术,包括深度学习对象检测和修改后的 Simple Online and Real-time Tracking(SORT)算法,以实现微型机器人的检测和追踪。
  • results: 研究发现,使用MEMTrack方法可以准确地追踪纤维蛋白质环境中的微型机器人,并且可以量化细菌的平均速度,与人工标注数据无统计学 significante difference。
    Abstract Tracking microrobots is challenging, considering their minute size and high speed. As the field progresses towards developing microrobots for biomedical applications and conducting mechanistic studies in physiologically relevant media (e.g., collagen), this challenge is exacerbated by the dense surrounding environments with feature size and shape comparable to microrobots. Herein, we report Motion Enhanced Multi-level Tracker (MEMTrack), a robust pipeline for detecting and tracking microrobots using synthetic motion features, deep learning-based object detection, and a modified Simple Online and Real-time Tracking (SORT) algorithm with interpolation for tracking. Our object detection approach combines different models based on the object's motion pattern. We trained and validated our model using bacterial micro-motors in collagen (tissue phantom) and tested it in collagen and aqueous media. We demonstrate that MEMTrack accurately tracks even the most challenging bacteria missed by skilled human annotators, achieving precision and recall of 77% and 48% in collagen and 94% and 35% in liquid media, respectively. Moreover, we show that MEMTrack can quantify average bacteria speed with no statistically significant difference from the laboriously-produced manual tracking data. MEMTrack represents a significant contribution to microrobot localization and tracking, and opens the potential for vision-based deep learning approaches to microrobot control in dense and low-contrast settings. All source code for training and testing MEMTrack and reproducing the results of the paper have been made publicly available https://github.com/sawhney-medha/MEMTrack.
    摘要 追踪微型机器人是一项挑战,因为它们的小型尺寸和高速运动。随着领域的进步,开发微型机器人用于生物医学应用和在生物学 relevante 媒体中进行机制研究(例如,肽),这种挑战变得更加严重,因为环境中的物体尺寸和形状与微型机器人相似。在这篇文章中,我们报道了 Motion Enhanced Multi-level Tracker(MEMTrack),一种可靠的跟踪管线,用于检测和跟踪微型机器人,使用人工生成的动作特征、深度学习基于对象检测和修改了Simple Online and Real-time Tracking(SORT)算法。我们的对象检测方法结合了不同的模型,根据对象的运动模式。我们在肽(组织荒)和液态媒体中训练和验证了我们的模型,并在这两种媒体中进行了测试。我们证明了 MEMTrack 可以准确地跟踪,包括最复杂的细菌,并且与人工标注数据的精度和准确率相似(精度为77%,准确率为48%)。此外,我们还表明了 MEMTrack 可以测量细菌的平均速度,与手工生成的跟踪数据无 statistically significant difference。 MEMTrack 对微型机器人的定位和跟踪做出了重要贡献,并开启了视觉基于深度学习的微型机器人控制在低对比度和稠密环境中的可能性。MEMTrack 的所有训练和测试代码和 reproduce 文章中的结果可以在 GitHub 上公共地获得(https://github.com/sawhney-medha/MEMTrack)。

LL-VQ-VAE: Learnable Lattice Vector-Quantization For Efficient Representations

  • paper_url: http://arxiv.org/abs/2310.09382
  • repo_url: None
  • paper_authors: Ahmed Khalil, Robert Piechocki, Raul Santos-Rodriguez
  • for: 学习粒度化 vector quantization 以获得精炼的抽象表示。
  • methods: 取代 VQ-VAE 中的 vector quantization 层,使用 lattice-based 粒度化,实现了一个可学习的 lattice 结构,避免 codebook 塌陷,提高了 codebook 使用率。
  • results: 比 VQ-VAE 低于 reconstruction error,在同等训练条件下训练时间减少为一半,参数数量保持为 $D$,可扩展性很好,在 FFHQ-1024 数据集上进行了实验,并包括 FashionMNIST 和 Celeb-A。
    Abstract In this paper we introduce learnable lattice vector quantization and demonstrate its effectiveness for learning discrete representations. Our method, termed LL-VQ-VAE, replaces the vector quantization layer in VQ-VAE with lattice-based discretization. The learnable lattice imposes a structure over all discrete embeddings, acting as a deterrent against codebook collapse, leading to high codebook utilization. Compared to VQ-VAE, our method obtains lower reconstruction errors under the same training conditions, trains in a fraction of the time, and with a constant number of parameters (equal to the embedding dimension $D$), making it a very scalable approach. We demonstrate these results on the FFHQ-1024 dataset and include FashionMNIST and Celeb-A.
    摘要 在这篇论文中,我们介绍了学习式树vector quantization,并证明其在学习离散表示的有效性。我们的方法,即LL-VQ-VAE,将VQ-VAE中的vector quantization层替换为基于树的离散化。学习的树对所有离散编码都强制实施结构,防止码库塌陷,导致高码库利用率。相比VQ-VAE,我们的方法在同样的训练条件下获得较低的重建错误,训练时间远 shorter,并且参数数量固定(等于嵌入维度$D),因此具有扩展性。我们在FFHQ-1024数据集上证明了这些结果,并包括FashionMNIST和Celeb-A。

Efficient Apple Maturity and Damage Assessment: A Lightweight Detection Model with GAN and Attention Mechanism

  • paper_url: http://arxiv.org/abs/2310.09347
  • repo_url: None
  • paper_authors: Yufei Liu, Manzhou Li, Qin Ma
    for:这个研究旨在提出一种基于轻量级卷积神经网络(CNN)和生成敌对网络(GAN)的苹果 ripeness 和损害水平检测方法。methods:这个方法使用了优化模型的深度和宽度,并使用了先进的模型压缩技术,以实现实时性能的提高。同时,这个方法引入了注意力机制,以动态地调整不同的特征层的重要性,以改善物体检测任务的性能。results:实验结果显示,在苹果 ripeness 检测任务中,提出的方法可以达到95.6%、93.8%、95.0%和56.5%的精度、回溯、准确率和FPS等指标,而在苹果损害水平检测任务中,可以达到95.3%、93.7%和94.5%的精度、回溯和mAP等指标。在两个任务中,提出的方法都超越了主流模型, demonstrate了该方法在苹果 ripeness 和损害水平检测任务中的出色表现和高实用价值。
    Abstract This study proposes a method based on lightweight convolutional neural networks (CNN) and generative adversarial networks (GAN) for apple ripeness and damage level detection tasks. Initially, a lightweight CNN model is designed by optimizing the model's depth and width, as well as employing advanced model compression techniques, successfully reducing the model's parameter and computational requirements, thus enhancing real-time performance in practical applications. Simultaneously, attention mechanisms are introduced, dynamically adjusting the importance of different feature layers to improve the performance in object detection tasks. To address the issues of sample imbalance and insufficient sample size, GANs are used to generate realistic apple images, expanding the training dataset and enhancing the model's recognition capability when faced with apples of varying ripeness and damage levels. Furthermore, by applying the object detection network for damage location annotation on damaged apples, the accuracy of damage level detection is improved, providing a more precise basis for decision-making. Experimental results show that in apple ripeness grading detection, the proposed model achieves 95.6\%, 93.8\%, 95.0\%, and 56.5 in precision, recall, accuracy, and FPS, respectively. In apple damage level detection, the proposed model reaches 95.3\%, 93.7\%, and 94.5\% in precision, recall, and mAP, respectively. In both tasks, the proposed method outperforms other mainstream models, demonstrating the excellent performance and high practical value of the proposed method in apple ripeness and damage level detection tasks.
    摘要 中文翻译:本研究提出一种基于轻量级卷积神经网络(CNN)和生成敌对网络(GAN)的苹果 ripeness 和损害水平检测方法。初始化时,设计了一个轻量级 CNN 模型,通过优化模型的深度和宽度,以及应用进步的模型压缩技术,成功减少模型的参数和计算需求,从而提高实时性在实际应用中。同时,引入了注意力机制,动态调整不同特征层的重要性,以提高对象检测任务的表现。为了解决样本偏极和样本数量不足的问题,使用 GAN 生成真实的苹果图像,扩大了训练集,提高了模型对不同的 ripeness 和损害水平的识别能力。此外,通过对受损苹果中的损害位置进行标注,提高了损害水平的准确率,为决策提供了更加准确的基础。实验结果表明,在苹果 ripeness 检测任务中,提出的模型达到了 95.6%、93.8%、95.0% 和 56.5% 的准确率、回归率、整体精度和 FPS,分别。在苹果损害水平检测任务中,提出的模型达到了 95.3%、93.7% 和 94.5% 的准确率、回归率和 mAP,分别。在两个任务中,提出的方法超过了主流模型, demonstarting 出优秀的表现和实际应用中的高价值。

Vision-by-Language for Training-Free Compositional Image Retrieval

  • paper_url: http://arxiv.org/abs/2310.09291
  • repo_url: None
  • paper_authors: Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata
    for: 这个论文的目的是提出一种无需训练的图像检索方法,可以通过将大规模的视觉语言模型(VLM)和大语言模型(LLM)组合在一起来实现。methods: 该方法使用了一个简单的、人类可理解的架构,即使用一个预训练的生成型VLM来描述引用图像,然后使用一个LLM来重新组合描述以进行图像检索。results: 在四个零 shot图像检索 benchmark 中,该方法实现了相对较高的性能,并且可以轻松地扩展到更多的图像和文本对象。此外,该方法还可以让人类更好地理解图像检索的过程,并且可以通过修改文本来修正检索结果。
    Abstract Given an image and a target modification (e.g an image of the Eiffel tower and the text "without people and at night-time"), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-language models (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with large language models (LLMs). By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the caption based on the textual target modification for subsequent retrieval via e.g. CLIP, we achieve modular language reasoning. In four ZS-CIR benchmarks, we find competitive, in-part state-of-the-art performance - improving over supervised methods. Moreover, the modularity of CIReVL offers simple scalability without re-training, allowing us to both investigate scaling laws and bottlenecks for ZS-CIR while easily scaling up to in parts more than double of previously reported results. Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases. Code will be released upon acceptance.
    摘要 Traditional Image Retrieval (CIR) 是一个检索图像库中的图像,以满足给定的文本修改的目标。 在supervised Approach中,需要对query image、文本修改和target image进行标注,这会成本很高。 latest research 使用大规模的vision-language model (VLMs) 和Zero-Shot CIR (ZS-CIR) 来解决这个问题。 however, state-of-the-art approaches in ZS-CIR 仍然需要训练任务特定的自定义模型,这需要大量的图像和文本对。在这个工作中,我们提出了一种不需要训练的Compositional Image Retrieval through Vision-by-Language (CIReVL) 方法,这是一个简单、可理解的、可扩展的管道。 我们使用pre-trained的生成型VLM来captioning reference image,然后使用大型语言模型 (LLMs) 来重新组合caption,以便后续通过例如CLIP进行检索。 我们实现了模块化的语言理解,并在四个ZS-CIR benchmark中获得了竞争性的、部分状态的报告表现。 此外,CIReVL的模块性允许无需重新训练,我们可以轻松地扩展到以前未报告的结果。 最后,我们表明了CIReVL使CIR变得人类可理解,通过在语言领域中模块化图像和文本的组合,使其可调整和重新对齐失败案例。 代码将在接受后发布。

An Unbiased Look at Datasets for Visuo-Motor Pre-Training

  • paper_url: http://arxiv.org/abs/2310.09289
  • repo_url: None
  • paper_authors: Sudeep Dasari, Mohan Kumar Srirama, Unnat Jain, Abhinav Gupta
  • For: The paper is focused on dataset centric analysis of robotic pre-training for visual representation learning.* Methods: The paper uses pre-training on large-scale but out-of-domain data (e.g., videos of egocentric interactions) and then transferring the representations to target robotics tasks.* Results: The paper finds that traditional vision datasets (like ImageNet, Kinetics and 100 Days of Hands) are surprisingly competitive options for visuo-motor representation learning, and that the pre-training dataset’s image distribution matters more than its size. Additionally, the paper shows that common simulation benchmarks are not a reliable proxy for real-world performance and that simple regularization strategies can dramatically improve real-world policy learning.Here’s the simplified Chinese text for the three key points:* 用途:文章关注机器人预训练视觉表示学习。* 方法:采用预训练大量但不同领域数据(如自我互动视频),然后将表示转移到目标机器人任务。* 结果:发现传统视觉集(如ImageNet、Kinetics和100Days of Hands)是奇异的可行选择,预训练集图像分布更重要于其大小。此外,文章表明常见的模拟 benchmark 不是可靠的实际世界表现代理,简单的正则化策略可以很大程度提高实际世界政策学习。
    Abstract Visual representation learning hold great promise for robotics, but is severely hampered by the scarcity and homogeneity of robotics datasets. Recent works address this problem by pre-training visual representations on large-scale but out-of-domain data (e.g., videos of egocentric interactions) and then transferring them to target robotics tasks. While the field is heavily focused on developing better pre-training algorithms, we find that dataset choice is just as important to this paradigm's success. After all, the representation can only learn the structures or priors present in the pre-training dataset. To this end, we flip the focus on algorithms, and instead conduct a dataset centric analysis of robotic pre-training. Our findings call into question some common wisdom in the field. We observe that traditional vision datasets (like ImageNet, Kinetics and 100 Days of Hands) are surprisingly competitive options for visuo-motor representation learning, and that the pre-training dataset's image distribution matters more than its size. Finally, we show that common simulation benchmarks are not a reliable proxy for real world performance and that simple regularization strategies can dramatically improve real world policy learning. https://data4robotics.github.io
    摘要 视觉表示学习具有潜在的潜力应用于机器人学,但是受到机器人数据的缺乏和同质化的限制。 latest works address this problem by pre-training 视觉表示在大规模 yet out-of-domain 数据上 (e.g., 自我互动视频) 并将其转移到目标机器人任务上。 although the field is heavily focused on developing better pre-training algorithms, we find that dataset choice is just as important to this paradigm's success. After all, the representation can only learn the structures or priors present in the pre-training dataset. To this end, we flip the focus on algorithms, and instead conduct a dataset-centric analysis of robotic pre-training. Our findings call into question some common wisdom in the field. We observe that traditional vision datasets (like ImageNet, Kinetics and 100 Days of Hands) are surprisingly competitive options for visuomotor representation learning, and that the pre-training dataset's image distribution matters more than its size. Finally, we show that common simulation benchmarks are not a reliable proxy for real-world performance and that simple regularization strategies can dramatically improve real-world policy learning.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard Chinese writing systems. The other one is Traditional Chinese.

SAIR: Learning Semantic-aware Implicit Representation

  • paper_url: http://arxiv.org/abs/2310.09285
  • repo_url: None
  • paper_authors: Canyu Zhang, Xiaoguang Li, Qing Guo, Song Wang
  • for: image inpainting task
  • methods: semantic-aware implicit representation (SAIR) and two modules: (1) building a semantic implicit representation (SIR) and (2) building an appearance implicit representation (AIR)
  • results: surpasses state-of-the-art approaches by a significant margin.Here’s the Chinese translation:
  • for: 图像填充任务
  • methods: semantic-aware implicit representation (SAIR) 和两个模块:(1) 建立semantic implicit representation (SIR) 和 (2) 建立appearance implicit representation (AIR)
  • results: 超越状态艺方法的表现,达到了显著的提升。
    Abstract Implicit representation of an image can map arbitrary coordinates in the continuous domain to their corresponding color values, presenting a powerful capability for image reconstruction. Nevertheless, existing implicit representation approaches only focus on building continuous appearance mapping, ignoring the continuities of the semantic information across pixels. As a result, they can hardly achieve desired reconstruction results when the semantic information within input images is corrupted, for example, a large region misses. To address the issue, we propose to learn semantic-aware implicit representation (SAIR), that is, we make the implicit representation of each pixel rely on both its appearance and semantic information (\eg, which object does the pixel belong to). To this end, we propose a framework with two modules: (1) building a semantic implicit representation (SIR) for a corrupted image whose large regions miss. Given an arbitrary coordinate in the continuous domain, we can obtain its respective text-aligned embedding indicating the object the pixel belongs. (2) building an appearance implicit representation (AIR) based on the SIR. Given an arbitrary coordinate in the continuous domain, we can reconstruct its color whether or not the pixel is missed in the input. We validate the novel semantic-aware implicit representation method on the image inpainting task, and the extensive experiments demonstrate that our method surpasses state-of-the-art approaches by a significant margin.
    摘要 <>translate_language: zh-CNImplicit representation of an image can map arbitrary coordinates in the continuous domain to their corresponding color values, presenting a powerful capability for image reconstruction. However, existing implicit representation approaches only focus on building continuous appearance mapping, ignoring the continuities of the semantic information across pixels. As a result, they can hardly achieve desired reconstruction results when the semantic information within input images is corrupted, for example, a large region is missing. To address the issue, we propose to learn semantic-aware implicit representation (SAIR), that is, we make the implicit representation of each pixel rely on both its appearance and semantic information (e.g., which object does the pixel belong to). To this end, we propose a framework with two modules:(1) Building a semantic implicit representation (SIR) for a corrupted image whose large regions are missing. Given an arbitrary coordinate in the continuous domain, we can obtain its respective text-aligned embedding indicating the object the pixel belongs to.(2) Building an appearance implicit representation (AIR) based on the SIR. Given an arbitrary coordinate in the continuous domain, we can reconstruct its color whether or not the pixel is missing in the input. We validate the novel semantic-aware implicit representation method on the image inpainting task, and the extensive experiments demonstrate that our method surpasses state-of-the-art approaches by a significant margin.Note: The translation is done using the Google Translate API, which may not be perfect and may not capture all the nuances of the original text.

Transformer-based Multimodal Change Detection with Multitask Consistency Constraints

  • paper_url: http://arxiv.org/abs/2310.09276
  • repo_url: https://github.com/qaz670756/mmcd
  • paper_authors: Biyuan Liu, Huaixin Chen, Kun Li, Michael Ying Yang
  • for: 本研究旨在利用多modal数据进行Change Detection,以解决现有方法在面对多modal数据时的问题。
  • methods: 本研究提出了一种基于Transformer网络的多modalChange Detection方法,通过cross-attention学习多modal输入之间的共同表示,并采用了一种兼容约束来确保多modal关系的建立。
  • results: 与五种现有方法进行比较,本研究的模型在Semantic和Height Change Detection任务中均显示出了consistent的多task优势性。此外,该方法可以轻松地适应其他方法,从而实现了Promising的改进。
    Abstract Change detection plays a fundamental role in Earth observation for analyzing temporal iterations over time. However, recent studies have largely neglected the utilization of multimodal data that presents significant practical and technical advantages compared to single-modal approaches. This research focuses on leveraging digital surface model (DSM) data and aerial images captured at different times for detecting change beyond 2D. We observe that the current change detection methods struggle with the multitask conflicts between semantic and height change detection tasks. To address this challenge, we propose an efficient Transformer-based network that learns shared representation between cross-dimensional inputs through cross-attention. It adopts a consistency constraint to establish the multimodal relationship, which involves obtaining pseudo change through height change thresholding and minimizing the difference between semantic and pseudo change within their overlapping regions. A DSM-to-image multimodal dataset encompassing three cities in the Netherlands was constructed. It lays a new foundation for beyond-2D change detection from cross-dimensional inputs. Compared to five state-of-the-art change detection methods, our model demonstrates consistent multitask superiority in terms of semantic and height change detection. Furthermore, the consistency strategy can be seamlessly adapted to the other methods, yielding promising improvements.
    摘要 地球观测中的变化检测扮演了基础性的角色,但最近的研究主要忽略了多modal数据的利用,这些数据具有重要的实践和技术优势。本研究目的在于利用数字地表模型(DSM)数据和不同时间拍摄的空中图像进行超过2D的变化检测。我们发现现有的变化检测方法在多任务冲突中表现不佳,特别是semantic和高程变化检测任务之间的冲突。为解决这个挑战,我们提议一种高效的Transformer网络,通过cross-attention学习共享表示 между多维输入。它采用一种一致性约束,以建立多modal关系,其中包括通过高程变化阈值获取 Pseudo 变化,并将semantic和Pseudo 变化在重叠区域内的差异降低到最小。为建立这种多模态关系,我们构建了包括荷兰三座城市的DSM-to-图像多模态数据集。相比五种state-of-the-art变化检测方法,我们的模型在semantic和高程变化检测任务上具有一致多任务优势。此外,一致策略可以轻松地应用到其他方法上,具有极好的改进前景。

Understanding and Modeling the Effects of Task and Context on Drivers’ Gaze Allocation

  • paper_url: http://arxiv.org/abs/2310.09275
  • repo_url: None
  • paper_authors: Iuliia Kotseruba, John K. Tsotsos
  • for: This paper aims to improve the accuracy of driver gaze prediction by explicitly modeling task and context influences, and to provide a new benchmark for evaluating such models.
  • methods: The proposed method addresses shortcomings of the popular DR(eye)VE dataset and extends it with per-frame annotations for driving task and context. The authors also benchmark a number of baseline and state-of-the-art models for saliency and driver gaze prediction, and analyze them with respect to the new annotations.
  • results: The proposed method significantly improves the state of the art performance on DR(eye)VE overall (by 24% KLD and 89% NSS) and on a subset of action and safety-critical intersection scenarios (by 10-30% KLD).Here’s the same information in Simplified Chinese text:
  • for: 这篇论文的目的是提高驾驶员视线预测的准确率,并提供一个新的评价标准。
  • methods: 该方法解决了DR(eye)VE dataset的一些缺陷,并将每帧的驾驶任务和上下文信息添加到 annotations 中。作者还对一些基线和现状模型进行了评价,并与新的 annotations 进行了分析。
  • results: 该方法在DR(eye)VE 上的总表现提高了24% KLD 和 89% NSS,并在一些行为和安全关键交叉点场景中提高了10-30% KLD。
    Abstract Understanding what drivers look at is important for many applications, including driver training, monitoring, and assistance, as well as self-driving. Traditionally, factors affecting human visual attention have been divided into bottom-up (involuntary attraction to salient regions) and top-down (task- and context-driven). Although both play a role in drivers' gaze allocation, most of the existing modeling approaches apply techniques developed for bottom-up saliency and do not consider task and context influences explicitly. Likewise, common driving attention benchmarks lack relevant task and context annotations. Therefore, to enable analysis and modeling of these factors for drivers' gaze prediction, we propose the following: 1) address some shortcomings of the popular DR(eye)VE dataset and extend it with per-frame annotations for driving task and context; 2) benchmark a number of baseline and SOTA models for saliency and driver gaze prediction and analyze them w.r.t. the new annotations; and finally, 3) a novel model that modulates drivers' gaze prediction with explicit action and context information, and as a result significantly improves SOTA performance on DR(eye)VE overall (by 24\% KLD and 89\% NSS) and on a subset of action and safety-critical intersection scenarios (by 10--30\% KLD). Extended annotations, code for model and evaluation will be made publicly available.
    摘要 Understanding what drivers look at is important for many applications, including driver training, monitoring, and assistance, as well as self-driving. Traditionally, factors affecting human visual attention have been divided into bottom-up (involuntary attraction to salient regions) and top-down (task- and context-driven). Although both play a role in drivers' gaze allocation, most of the existing modeling approaches apply techniques developed for bottom-up saliency and do not consider task and context influences explicitly. Likewise, common driving attention benchmarks lack relevant task and context annotations. Therefore, to enable analysis and modeling of these factors for drivers' gaze prediction, we propose the following:1. Address some shortcomings of the popular DR(eye)VE dataset and extend it with per-frame annotations for driving task and context.2. Benchmark a number of baseline and SOTA models for saliency and driver gaze prediction and analyze them w.r.t. the new annotations.3. A novel model that modulates drivers' gaze prediction with explicit action and context information, and as a result significantly improves SOTA performance on DR(eye)VE overall (by 24\% KLD and 89\% NSS) and on a subset of action and safety-critical intersection scenarios (by 10--30\% KLD).Extended annotations, code for model and evaluation will be made publicly available.

Time CNN and Graph Convolution Network for Epileptic Spike Detection in MEG Data

  • paper_url: http://arxiv.org/abs/2310.09236
  • repo_url: None
  • paper_authors: Pauline Mouches, Thibaut Dejean, Julien Jung, Romain Bouet, Carole Lartizien, Romain Quentin
  • for: 这个论文的目的是用机器学习方法检测 magnetoencephalography(MEG)记录中的尖峰,以便准确地确定引起癫痫发作的脑区域。
  • methods: 这个论文提出了一种基于1D时间卷积神经网络(Time CNN)和图像卷积神经网络(GCN)的方法,用于分类MEG记录中的短时间帧是否包含尖峰。
  • results: 该方法在一个平衡的数据集上达到了76.7%的分类f1分数,并在一个实际上较偏斜的数据集上达到了25.5%的分类f1分数,都高于深度学习状态对的方法。
    Abstract Magnetoencephalography (MEG) recordings of patients with epilepsy exhibit spikes, a typical biomarker of the pathology. Detecting those spikes allows accurate localization of brain regions triggering seizures. Spike detection is often performed manually. However, it is a burdensome and error prone task due to the complexity of MEG data. To address this problem, we propose a 1D temporal convolutional neural network (Time CNN) coupled with a graph convolutional network (GCN) to classify short time frames of MEG recording as containing a spike or not. Compared to other recent approaches, our models have fewer parameters to train and we propose to use a GCN to account for MEG sensors spatial relationships. Our models produce clinically relevant results and outperform deep learning-based state-of-the-art methods reaching a classification f1-score of 76.7% on a balanced dataset and of 25.5% on a realistic, highly imbalanced dataset, for the spike class.
    摘要 магнетоэнцефалографические (MEG) записи пациентов с эпилепсией выделяют пики,Typical biomarker of the pathology. Detecting those spikes allows accurate localization of brain regions triggering seizures. Spike detection is often performed manually, but it is a burdensome and error-prone task due to the complexity of MEG data. To address this problem, we propose a one-dimensional temporal convolutional neural network (Time CNN) coupled with a graph convolutional network (GCN) to classify short time frames of MEG recording as containing a spike or not. Compared to other recent approaches, our models have fewer parameters to train, and we propose to use a GCN to account for MEG sensors' spatial relationships. Our models produce clinically relevant results and outperform deep learning-based state-of-the-art methods, reaching a classification f1-score of 76.7% on a balanced dataset and 25.5% on a realistic, highly imbalanced dataset for the spike class.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Ultrasound Image Segmentation of Thyroid Nodule via Latent Semantic Feature Co-Registration

  • paper_url: http://arxiv.org/abs/2310.09221
  • repo_url: None
  • paper_authors: Xuewei Li, Yaqiao Zhu, Jie Gao, Xi Wei, Ruixuan Zhang, Yuan Tian, Mei Yu
  • for: 这个论文的目的是提高自动化颈部ultrasound图像分割模型的泛化性能,以便在医疗实践中能够更好地适应不同供应商和扫描卷轴的图像。
  • methods: 这篇论文提出了一种基于新型协调网络的颈部 nodule 分割框架(ASTN),通过提取atlas和目标图像中的潜在含义,并利用深度特征来实现图像协调,以保持颈部结构完整性并减少图像差异引起的影响。
  • results: 论文的评估结果表明,通过提出的方法,模型的泛化性能得到了显著改进,同时保持了高级别的分割精度。
    Abstract Segmentation of nodules in thyroid ultrasound imaging plays a crucial role in the detection and treatment of thyroid cancer. However, owing to the diversity of scanner vendors and imaging protocols in different hospitals, the automatic segmentation model, which has already demonstrated expert-level accuracy in the field of medical image segmentation, finds its accuracy reduced as the result of its weak generalization performance when being applied in clinically realistic environments. To address this issue, the present paper proposes ASTN, a framework for thyroid nodule segmentation achieved through a new type co-registration network. By extracting latent semantic information from the atlas and target images and utilizing in-depth features to accomplish the co-registration of nodules in thyroid ultrasound images, this framework can ensure the integrity of anatomical structure and reduce the impact on segmentation as the result of overall differences in image caused by different devices. In addition, this paper also provides an atlas selection algorithm to mitigate the difficulty of co-registration. As shown by the evaluation results collected from the datasets of different devices, thanks to the method we proposed, the model generalization has been greatly improved while maintaining a high level of segmentation accuracy.
    摘要 segmentation of nodules in thyroid ultrasound imaging plays a crucial role in the detection and treatment of thyroid cancer. However, owing to the diversity of scanner vendors and imaging protocols in different hospitals, the automatic segmentation model, which has already demonstrated expert-level accuracy in the field of medical image segmentation, finds its accuracy reduced as the result of its weak generalization performance when being applied in clinically realistic environments. To address this issue, the present paper proposes ASTN, a framework for thyroid nodule segmentation achieved through a new type co-registration network. By extracting latent semantic information from the atlas and target images and utilizing in-depth features to accomplish the co-registration of nodules in thyroid ultrasound images, this framework can ensure the integrity of anatomical structure and reduce the impact on segmentation as the result of overall differences in image caused by different devices. In addition, this paper also provides an atlas selection algorithm to mitigate the difficulty of co-registration. As shown by the evaluation results collected from the datasets of different devices, thanks to the method we proposed, the model generalization has been greatly improved while maintaining a high level of segmentation accuracy.Here's the word-for-word translation: Segmentation of nodules in thyroid ultrasound imaging plays a crucial role in the detection and treatment of thyroid cancer. However, owing to the diversity of scanner vendors and imaging protocols in different hospitals, the automatic segmentation model, which has already demonstrated expert-level accuracy in the field of medical image segmentation, finds its accuracy reduced as the result of its weak generalization performance when being applied in clinically realistic environments. To address this issue, the present paper proposes ASTN, a framework for thyroid nodule segmentation achieved through a new type co-registration network. By extracting latent semantic information from the atlas and target images and utilizing in-depth features to accomplish the co-registration of nodules in thyroid ultrasound images, this framework can ensure the integrity of anatomical structure and reduce the impact on segmentation as the result of overall differences in image caused by different devices. In addition, this paper also provides an atlas selection algorithm to mitigate the difficulty of co-registration. As shown by the evaluation results collected from the datasets of different devices, thanks to the method we proposed, the model generalization has been greatly improved while maintaining a high level of segmentation accuracy.

Unseen Image Synthesis with Diffusion Models

  • paper_url: http://arxiv.org/abs/2310.09213
  • repo_url: None
  • paper_authors: Ye Zhu, Yu Wu, Zhiwei Deng, Olga Russakovsky, Yan Yan
    for: 本文主要针对的是如何使用 pré-trained 和冻结的抑噪扩散模型(DDPM)在单个频道数据上进行隐藏采样和几何优化,以生成未看过的频道图像。methods: 本文使用了隐藏采样和几何优化,使用 pré-trained 和冻结的 DDPM 模型,在单个频道数据上进行 Synthesizing 未看过的频道图像。results: 本文经过extensive的分析和实验,证明了这种新的视角可以帮助探索和重新评估抑噪扩散模型的数据生成泛化能力。
    Abstract While the current trend in the generative field is scaling up towards larger models and more training data for generalized domain representations, we go the opposite direction in this work by synthesizing unseen domain images without additional training. We do so via latent sampling and geometric optimization using pre-trained and frozen Denoising Diffusion Probabilistic Models (DDPMs) on single-domain datasets. Our key observation is that DDPMs pre-trained even just on single-domain images are already equipped with sufficient representation abilities to reconstruct arbitrary images from the inverted latent encoding following bi-directional deterministic diffusion and denoising trajectories. This motivates us to investigate the statistical and geometric behaviors of the Out-Of-Distribution (OOD) samples from unseen image domains in the latent spaces along the denoising chain. Notably, we theoretically and empirically show that the inverted OOD samples also establish Gaussians that are distinguishable from the original In-Domain (ID) samples in the intermediate latent spaces, which allows us to sample from them directly. Geometrical domain-specific and model-dependent information of the unseen subspace (e.g., sample-wise distance and angles) is used to further optimize the sampled OOD latent encodings from the estimated Gaussian prior. We conduct extensive analysis and experiments using pre-trained diffusion models (DDPM, iDDPM) on different datasets (AFHQ, CelebA-HQ, LSUN-Church, and LSUN-Bedroom), proving the effectiveness of this novel perspective to explore and re-think the diffusion models' data synthesis generalization ability.
    摘要 当前在生成领域的趋势是扩大模型和训练数据来获得通用领域表示,而我们在这个工作中则进行了相反的方向,通过latent sampling和几何优化使用预训练和冻结的Diffusion Probabilistic Models (DDPMs)来 sinthezzer unseen domain图像无需额外训练。我们的关键发现是,DDPMs预训练了Single-Domain dataset上的图像就已经具备了 suficient representation能力来重建任意图像,从bi-directional deterministic diffusion和denoising trajectories中的逆Latent encoding中。这使我们感兴趣地研究OOD样本在抽象空间中的统计和几何行为,以及OOD样本的 inverted Gaussian distribution。我们通过理论和实验表明,OOD样本在抽象空间中也可以成立Gaussian distribution,并且与原始ID样本在中间抽象空间中的距离和角度有所不同。我们使用采样自OOD Gaussian distribution的抽象编码进行进一步的优化。我们在不同的 dataset(AFHQ、CelebA-HQ、LSUN-Church和LSUN-Bedroom)上使用预训练的扩散模型(DDPM和iDDPM)进行了广泛的分析和实验,证明了这种新的视角的效果性。

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

  • paper_url: http://arxiv.org/abs/2310.09199
  • repo_url: https://github.com/kyegomez/PALI3
  • paper_authors: Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, Radu Soricut
  • for: 该论文目的是提出一种更小、更快、更强的视觉语言模型(VLM),以比较和更大的类似模型进行比较。
  • methods: 该论文使用了视transformer(ViT)模型预训练使用分类目标,并与对比(SigLIP)预训练进行比较。
  • results: 研究发现,虽然 SigLIP-based PaLI略微下perform在标准图像分类标准下,但在多modal标准下表现更好,特别是在地图localization和视觉相关文本理解方面。通过扩大SigLIP图像编码器到20亿参数,实现了新的多语言跨模态检索state-of-the-art。
    Abstract This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks, especially on localization and visually-situated text understanding. We scale the SigLIP image encoder up to 2 billion parameters, and achieves a new state-of-the-art on multilingual cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles research on fundamental pieces of complex VLMs, and could fuel a new generation of scaled-up models.
    摘要

Exploring Sparse Spatial Relation in Graph Inference for Text-Based VQA

  • paper_url: http://arxiv.org/abs/2310.09147
  • repo_url: None
  • paper_authors: Sheng Zhou, Dan Guo, Jia Li, Xun Yang, Meng Wang
    for:文章主要目的是提出一种基于 sparse spatial graph network (SSGN) 的文本视觉问答系统,以避免重复的关系推理。methods:文章使用的方法包括:1. 引入空间意识关系剪辑技术,以避免使用所有视觉关系进行答案预测。2. 使用空间距离、几何维度、 overlap 区域和 DIoU 进行空间意识关系剪辑。3. 学习三种视觉关系:对象对象关系、OCR 和对象关系,以及 OCR 和对象关系。results:文章的实验结果表明,SSGN 在 TextVQA 和 ST-VQA 数据集上达到了可够的表现。此外,一些视觉化结果还表明了我们的方法的可解释性。
    Abstract Text-based visual question answering (TextVQA) faces the significant challenge of avoiding redundant relational inference. To be specific, a large number of detected objects and optical character recognition (OCR) tokens result in rich visual relationships. Existing works take all visual relationships into account for answer prediction. However, there are three observations: (1) a single subject in the images can be easily detected as multiple objects with distinct bounding boxes (considered repetitive objects). The associations between these repetitive objects are superfluous for answer reasoning; (2) two spatially distant OCR tokens detected in the image frequently have weak semantic dependencies for answer reasoning; and (3) the co-existence of nearby objects and tokens may be indicative of important visual cues for predicting answers. Rather than utilizing all of them for answer prediction, we make an effort to identify the most important connections or eliminate redundant ones. We propose a sparse spatial graph network (SSGN) that introduces a spatially aware relation pruning technique to this task. As spatial factors for relation measurement, we employ spatial distance, geometric dimension, overlap area, and DIoU for spatially aware pruning. We consider three visual relationships for graph learning: object-object, OCR-OCR tokens, and object-OCR token relationships. SSGN is a progressive graph learning architecture that verifies the pivotal relations in the correlated object-token sparse graph, and then in the respective object-based sparse graph and token-based sparse graph. Experiment results on TextVQA and ST-VQA datasets demonstrate that SSGN achieves promising performances. And some visualization results further demonstrate the interpretability of our method.
    摘要 文本基于视觉问答(TextVQA)面临着避免重复关系推理的 significiant挑战。具体来说,图像中检测到的对象和Optical Character Recognition(OCR)token的大量会导致丰富的视觉关系。现有的工作是将所有视觉关系都用于答案预测。然而,我们有三个观察结论:(1)图像中的一个主题可以轻松地被检测到为多个对象的多个 bounding box(被视为重复的对象)。这些重复的对象之间的关系是不必要的 для答案推理;(2)图像中两个距离很远的 OCR token frequent 有弱的 semantic dependence for answer reasoning;(3)靠近的对象和token可能是答案预测中重要的视觉提示。而不是使用所有的视觉关系,我们尝试去标识最重要的连接或者减少重复的连接。我们提议一种稀疏空间图网络(SSGN),该网络引入了基于空间的关系剪辑技术。我们使用的空间因素包括空间距离、几何维度、重叠面积和 DIoU 等,用于空间自然剪辑。我们考虑了三种视觉关系进行图学学习:对象-对象关系、OCR Token-OCR Token 关系和对象-OCR Token 关系。SSGN 是一种进步的图学学习架构,它验证了相关对象-Token 稀疏图中的重要关系,然后在各自的对象基础稀疏图和 Token 基础稀疏图中验证。实验结果表明,SSGN 在 TextVQA 和 ST-VQA 数据集上表现出色。一些视觉化结果进一步证明了我们的方法的可解释性。

Physics-guided Noise Neural Proxy for Low-light Raw Image Denoising

  • paper_url: http://arxiv.org/abs/2310.09126
  • repo_url: None
  • paper_authors: Hansen Feng, Lizhi Wang, Yiqi Huang, Yuzhi Wang, Hua Huang
  • for: 本研究旨在提高低光照静止图像净化的性能,通过学习基于synthetic数据的学习型方法。
  • methods: 我们提出了一种新的噪声模型学习框架,即physics-guided noise neural proxy(PNNP),它 integrate了三种高效技术:physics-guided noise decoupling(PND)、physics-guided proxy model(PPM)和分布式极值损失(DDL)。
  • results: 我们的PNNP框架在公共的低光照静止图像净化数据集和真实的低光照拍摄场景中展现出了超越性表现。
    Abstract Low-light raw image denoising plays a crucial role in mobile photography, and learning-based methods have become the mainstream approach. Training the learning-based methods with synthetic data emerges as an efficient and practical alternative to paired real data. However, the quality of synthetic data is inherently limited by the low accuracy of the noise model, which decreases the performance of low-light raw image denoising. In this paper, we develop a novel framework for accurate noise modeling that learns a physics-guided noise neural proxy (PNNP) from dark frames. PNNP integrates three efficient techniques: physics-guided noise decoupling (PND), physics-guided proxy model (PPM), and differentiable distribution-oriented loss (DDL). The PND decouples the dark frame into different components and handles different levels of noise in a flexible manner, which reduces the complexity of the noise neural proxy. The PPM incorporates physical priors to effectively constrain the generated noise, which promotes the accuracy of the noise neural proxy. The DDL provides explicit and reliable supervision for noise modeling, which promotes the precision of the noise neural proxy. Extensive experiments on public low-light raw image denoising datasets and real low-light imaging scenarios demonstrate the superior performance of our PNNP framework.
    摘要 低光照图像去噪扮演了手机摄影中关键的角色,学习基于方法成为主流。使用生成的数据进行训练学习基于方法 emerges as an efficient and practical alternative to paired real data。然而,生成数据质量的局限性由噪音模型的准确性减少了低光照图像去噪的性能。在这篇论文中,我们开发了一种新的框架,即物理导向噪音神经代理(PNNP),从黑框中学习噪音模型。PNNP integrates three efficient techniques:物理导向噪音分离(PND)、物理导向代理模型(PPM)和分布导向损失(DDL)。PND将黑框分解成不同组件,处理不同水平的噪音,减少噪音神经代理的复杂性。PPM incorporates physical priors to effectively constrain the generated noise, which promotes the accuracy of the noise neural proxy。DDL provides explicit and reliable supervision for noise modeling, which promotes the precision of the noise neural proxy。我们在公共的低光照图像去噪数据集和实际的低光照摄影场景进行了广泛的实验, demonstrably superior performance of our PNNP framework.

Training and Predicting Visual Error for Real-Time Applications

  • paper_url: http://arxiv.org/abs/2310.09125
  • repo_url: https://github.com/Jaliborc/rt-percept
  • paper_authors: João Libório Cardoso, Bernhard Kerbl, Lei Yang, Yury Uralsky, Michael Wimmer
  • for: 这个论文旨在提出一种基于卷积神经网络的图像质量评估方法,以实现在实时应用中高效地计算视觉误差。
  • methods: 该论文使用卷积神经网络来预测视觉误差,而不需要参考图像或渲染图像。这些神经网络通过利用 readily available 的图像空间信息和 reprojection from previous frames 来估计视觉误差,并且可以在实时应用中实现高效的计算。
  • results: 该论文的实验结果表明,使用卷积神经网络来预测视觉误差可以具有70%-90%的变差能力,并且可以在实时应用中实现至少一个数量级的计算时间减少。这些方法可以在实时应用中实现高效的图像质量评估,并且可以在未seen 图像区域中提供可靠的误差估计。
    Abstract Visual error metrics play a fundamental role in the quantification of perceived image similarity. Most recently, use cases for them in real-time applications have emerged, such as content-adaptive shading and shading reuse to increase performance and improve efficiency. A wide range of different metrics has been established, with the most sophisticated being capable of capturing the perceptual characteristics of the human visual system. However, their complexity, computational expense, and reliance on reference images to compare against prevent their generalized use in real-time, restricting such applications to using only the simplest available metrics. In this work, we explore the abilities of convolutional neural networks to predict a variety of visual metrics without requiring either reference or rendered images. Specifically, we train and deploy a neural network to estimate the visual error resulting from reusing shading or using reduced shading rates. The resulting models account for 70%-90% of the variance while achieving up to an order of magnitude faster computation times. Our solution combines image-space information that is readily available in most state-of-the-art deferred shading pipelines with reprojection from previous frames to enable an adequate estimate of visual errors, even in previously unseen regions. We describe a suitable convolutional network architecture and considerations for data preparation for training. We demonstrate the capability of our network to predict complex error metrics at interactive rates in a real-time application that implements content-adaptive shading in a deferred pipeline. Depending on the portion of unseen image regions, our approach can achieve up to $2\times$ performance compared to state-of-the-art methods.
    摘要 “视觉错误度量在图像相似性评估中扮演了基本角色。最近,它们在实时应用中得到了广泛使用,如内容适应填充和填充率调整,以提高性能和效率。已经建立了许多不同的度量方法,其中最复杂的可以捕捉人类视觉系统的特性。然而,它们的复杂性、计算成本和对参照图像进行比较的需求,使得它们在实时应用中无法普遍应用,只能使用最简单的度量方法。在这种情况下,我们 investigate了使用卷积神经网络预测多种视觉度量,不需要参照图像或渲染图像。我们训练和部署一个神经网络,以便估算由填充率或 reuse 的视觉错误。该模型能够覆盖70%-90%的变差,并且可以在实时应用中达到10倍的计算时间减少。我们的解决方案结合了 readily 可用的图像空间信息,并通过前一帧的 reprojection 来实现可靠的视觉错误估计,包括未看到的区域。我们描述了适合的卷积神经网络架构,以及训练数据的准备方法。我们在实时应用中示出了我们的网络可以在交互速度下预测复杂的视觉度量,并且可以根据未看到的区域的大小,实现2倍的性能提升。”

Equirectangular image construction method for standard CNNs for Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2310.09122
  • repo_url: None
  • paper_authors: Haoqian Chen, Jian Liu, Minghe Li, Kaiwen Jiang, Ziheng Xu, Rencheng Sun, Yi Sui
  • for: 该文章的目的是提出一种方法来将平面图像转换为全景图像,以便使用标准 convolutional neural networks (CNNs) 进行semantic segmentation。
  • methods: 该方法使用了 inverse transformation of the spherical center projection 和 equidistant cylindrical projection,以学习不同位置的扭曲特征,从而对全景图像进行semantic segmentation。
  • results: 实验表明,使用该方法可以获得最佳的平均 IoU 值为 43.76%,比其他三种方法(supervised learning、unsupervised learning 和数据增强)高出 23.85%、10.7% 和 17.23%。
    Abstract 360{\deg} spherical images have advantages of wide view field, and are typically projected on a planar plane for processing, which is known as equirectangular image. The object shape in equirectangular images can be distorted and lack translation invariance. In addition, there are few publicly dataset of equirectangular images with labels, which presents a challenge for standard CNNs models to process equirectangular images effectively. To tackle this problem, we propose a methodology for converting a perspective image into equirectangular image. The inverse transformation of the spherical center projection and the equidistant cylindrical projection are employed. This enables the standard CNNs to learn the distortion features at different positions in the equirectangular image and thereby gain the ability to semantically the equirectangular image. The parameter, {\phi}, which determines the projection position of the perspective image, has been analyzed using various datasets and models, such as UNet, UNet++, SegNet, PSPNet, and DeepLab v3+. The experiments demonstrate that an optimal value of {\phi} for effective semantic segmentation of equirectangular images is 6{\pi}/16 for standard CNNs. Compared with the other three types of methods (supervised learning, unsupervised learning and data augmentation), the method proposed in this paper has the best average IoU value of 43.76%. This value is 23.85%, 10.7% and 17.23% higher than those of other three methods, respectively.
    摘要 三百六十度球形图像具有广阔视场和平面处理优势,通常称为平面图像。然而,在这些图像中,物体形状会受到扭曲和缺失平衡不变性的影响。此外,publicly disponible的平面图像标签数据集罕见,这使得标准的CNN模型在处理平面图像时遇到了一定的挑战。为解决这个问题,我们提出了将投影图像转换为球形图像的方法。我们使用了球形中心投影和等距直 cylindrical投影的 inverse transformation。这使得标准的CNN模型能够学习不同位置的扭曲特征,从而为球形图像semantic segmentation增加能力。我们通过不同的数据集和模型,如UNet、UNet++、SegNet、PSPNet和DeepLab v3+,分析了参数{\phi}的影响。实验表明,为标准CNN模型在球形图像semantic segmentation中效果最佳的{\phi}值为6π/16。相比于其他三种方法(监督学习、自动学习和数据增强),本文提出的方法具有最高的平均IoU值43.76%。这个值高于其他三种方法的平均IoU值23.85%、10.7%和17.23%。

DSG: An End-to-End Document Structure Generator

  • paper_url: http://arxiv.org/abs/2310.09118
  • repo_url: https://github.com/j-rausch/dsg
  • paper_authors: Johannes Rausch, Gentiana Rashiti, Maxim Gusev, Ce Zhang, Stefan Feuerriegel
  • for: 该论文主要目标是提供一种可以拟合文档结构的完全端到端训练系统,以便在实际应用中进行下游任务。
  • methods: 该论文提出了一种名为文档结构生成器(DSG)的新系统,该系统通过结合深度神经网络进行文档解析,包括文档中的实体(如图文、文本块、标题等)和这些实体之间的关系。与传统系统不同的是,我们的DSG是通过端到端训练而训练的,从而使其在实际应用中更有效和灵活。
  • results: 我们的实验结果表明,我们的DSG系统可以在评估 dataset 上超过商业 OCR 工具的性能,并且在这之上还达到了状态对照表现。此外,我们还提供了一个大规模的实际杂志数据集,以便用于评估和比较。
    Abstract Information in industry, research, and the public sector is widely stored as rendered documents (e.g., PDF files, scans). Hence, to enable downstream tasks, systems are needed that map rendered documents onto a structured hierarchical format. However, existing systems for this task are limited by heuristics and are not end-to-end trainable. In this work, we introduce the Document Structure Generator (DSG), a novel system for document parsing that is fully end-to-end trainable. DSG combines a deep neural network for parsing (i) entities in documents (e.g., figures, text blocks, headers, etc.) and (ii) relations that capture the sequence and nested structure between entities. Unlike existing systems that rely on heuristics, our DSG is trained end-to-end, making it effective and flexible for real-world applications. We further contribute a new, large-scale dataset called E-Periodica comprising real-world magazines with complex document structures for evaluation. Our results demonstrate that our DSG outperforms commercial OCR tools and, on top of that, achieves state-of-the-art performance. To the best of our knowledge, our DSG system is the first end-to-end trainable system for hierarchical document parsing.
    摘要 在行业、研究和公共部门中,信息通常以渲染的文档形式储存(例如PDF文档、扫描件)。因此,为下游任务提供支持,需要一种将渲染文档映射到结构化层次格式的系统。然而,现有的这种任务系统受限于规则和规则,不是可全程训练的。在这种工作中,我们介绍了文档结构生成器(DSG),一种全程训练的文档分析系统。DSG组合了深度神经网络来分析文档中的实体(例如图片、文本块、标题等)以及这些实体之间的关系,包括嵌入结构和顺序结构。与现有系统不同,我们的DSG不仅仅靠规则,而是通过全程训练,使其在实际应用中更有效和灵活。此外,我们还提供了一个新的大规模数据集called E-Periodica,包含了复杂的现实世界杂志文档,用于评估。我们的结果表明,我们的DSG比商业OCR工具更高效,而且在此基础之上还实现了状态的最佳性。到目前为止,我们的DSG系统是第一个可全程训练的文档结构分析系统。

Faster 3D cardiac CT segmentation with Vision Transformers

  • paper_url: http://arxiv.org/abs/2310.09099
  • repo_url: https://github.com/ljollans/trunet
  • paper_authors: Lee Jollans, Mariana Bustamante, Lilian Henriksson, Anders Persson, Tino Ebbers
  • For: The paper is focused on developing a new deep learning architecture for 3D semantic segmentation of cardiac computed tomography (CT) volumes.* Methods: The authors adapted the Vision Transformer (ViT) for three-dimensional volume inputs and incorporated a modified ResNet50 block and a ViT block into their hybrid Transformer-Residual U-Net framework (TRUNet). They also used cascade upsampling with skip connections.* Results: The TRUNet model converged in significantly less time than residual U-Net while providing comparable or superior segmentations of the left ventricle, left atrium, left atrial appendage, ascending aorta, and pulmonary veins. The model also offered more precise vessel boundary segmentation and better captured the heart’s overall anatomical structure.
    Abstract Accurate segmentation of the heart is essential for personalized blood flow simulations and surgical intervention planning. A recent advancement in image recognition is the Vision Transformer (ViT), which expands the field of view to encompass a greater portion of the global image context. We adapted ViT for three-dimensional volume inputs. Cardiac computed tomography (CT) volumes from 39 patients, featuring up to 20 timepoints representing the complete cardiac cycle, were utilized. Our network incorporates a modified ResNet50 block as well as a ViT block and employs cascade upsampling with skip connections. Despite its increased model complexity, our hybrid Transformer-Residual U-Net framework, termed TRUNet, converges in significantly less time than residual U-Net while providing comparable or superior segmentations of the left ventricle, left atrium, left atrial appendage, ascending aorta, and pulmonary veins. TRUNet offers more precise vessel boundary segmentation and better captures the heart's overall anatomical structure compared to residual U-Net, as confirmed by the absence of extraneous clusters of missegmented voxels. In terms of both performance and training speed, TRUNet exceeded U-Net, a commonly used segmentation architecture, making it a promising tool for 3D semantic segmentation tasks in medical imaging. The code for TRUNet is available at github.com/ljollans/TRUNet.
    摘要 心脏分割的精确性是 personnalized 血流模拟和心脏手术观察规划的重要因素。 recent advancement in image recognition 中的一个是 Vision Transformer (ViT),它扩展了 global image context 的 Field of view。我们将 ViT 应用到三维量入力中。使用 39 名病人的心脏 Computed Tomography (CT) 量据,这些量据包含了完整的心脏周期,我们的网络包括 Modified ResNet50 块以及 ViT 块,并使用递增缩减 skip connections。 despite its increased model complexity, our hybrid Transformer-Residual U-Net framework, termed TRUNet, converges in significantly less time than residual U-Net while providing comparable or superior segmentations of the left ventricle, left atrium, left atrial appendage, ascending aorta, and pulmonary veins。 TRUNet 提供了更精确的血管边界分 segmentation 和更好地捕捉心脏的全面 anatomical structure,与 residual U-Net 不同,确认了没有过度分类的错误 voxels。在性能和训练速度方面,TRUNet 超越了 U-Net,一个常用的 segmentation 架构,使其成为适合三维Semantic segmentation 任务的可靠工具。 TRUNet 的代码可以在 github.com/ljollans/TRUNet 中找到。

iPUNet:Iterative Cross Field Guided Point Cloud Upsampling

  • paper_url: http://arxiv.org/abs/2310.09092
  • repo_url: None
  • paper_authors: Guangshun Wei, Hao Pan, Shaojie Zhuang, Yuanfeng Zhou, Changjian Li
  • for: 提高3D扫描产生的点云的使用可用性,增强点云的精度和完整性。
  • methods: 提出一种学习基于的点云upsampling方法,iPUNet,该方法可以生成高密度和均匀的点云,并更好地捕捉到锐利特征。通过自我超vision引导点生成,并在每个输入点上学习地方参数化表面,实现arbitrary Ratio upsampling。进一步,通过迭代策略,将不均匀的输入点移动到愿望的连续3D表面上,以提高点云的精度和完整性。
  • results: 对多种物体和场景的扫描数据进行了广泛的评估,展示了iPUNet可以effectively Handle noisy和非均匀分布的输入点云,并超越当前点云upsampling方法的性能。
    Abstract Point clouds acquired by 3D scanning devices are often sparse, noisy, and non-uniform, causing a loss of geometric features. To facilitate the usability of point clouds in downstream applications, given such input, we present a learning-based point upsampling method, i.e., iPUNet, which generates dense and uniform points at arbitrary ratios and better captures sharp features. To generate feature-aware points, we introduce cross fields that are aligned to sharp geometric features by self-supervision to guide point generation. Given cross field defined frames, we enable arbitrary ratio upsampling by learning at each input point a local parameterized surface. The learned surface consumes the neighboring points and 2D tangent plane coordinates as input, and maps onto a continuous surface in 3D where arbitrary ratios of output points can be sampled. To solve the non-uniformity of input points, on top of the cross field guided upsampling, we further introduce an iterative strategy that refines the point distribution by moving sparse points onto the desired continuous 3D surface in each iteration. Within only a few iterations, the sparse points are evenly distributed and their corresponding dense samples are more uniform and better capture geometric features. Through extensive evaluations on diverse scans of objects and scenes, we demonstrate that iPUNet is robust to handle noisy and non-uniformly distributed inputs, and outperforms state-of-the-art point cloud upsampling methods.
    摘要 <>转换给定文本到简化中文。>三Dimensional扫描设备获取的点云经常是稀疏、噪声和不均匀的,这会导致点云的几何特征丢失。为了使点云在下游应用中更加可用,我们提出了一种学习基于的点云填充方法,即iPUNet,该方法可以生成稠密和均匀的点云,并更好地捕捉锐度特征。为了生成具有特征的点云,我们引入了相关的横向场,这些场景被自我超视来引导点生成。给定横向场定义的帧,我们启用了任意比例填充,通过学习每个输入点的本地参数化表面来实现。这个学习的表面 consume了邻近点和2D tangent plane坐标作为输入,并将其映射到3D连续表面上,从而实现任意比例的输出点抽象。为了解决输入点的不均匀性,我们采用了基于横向场的迭代策略,通过在每个迭代中将稀疏点移动到所需的连续3D表面上来缓解输入点的不均匀性。只需几个迭代,稀疏点就可以均匀分布,其对应的稠密样本也更加均匀和更好地捕捉几何特征。通过对各种物体和场景的扫描数据进行广泛的评估,我们证明了iPUNet可以有效地处理噪声和不均匀分布的输入点云,并超过了当前的点云填充方法。

pose-format: Library for Viewing, Augmenting, and Handling .pose Files

  • paper_url: http://arxiv.org/abs/2310.09066
  • repo_url: https://github.com/sign-language-processing/pose
  • paper_authors: Amit Moryossef, Mathias Müller, Rebecka Fahrni
  • for: 管理和分析姿势数据是一项复杂的任务,面临着多种挑战,如处理多种文件结构和数据类型,以及实现有效的数据操作,如归一化和增强。这篇论文提出了\texttt{pose-format}工具包,用于解决这些挑战。
  • methods: \texttt{pose-format}工具包包括特有的文件格式,可以包含多个个体和无限多个时间帧,因此适用于图像和视频数据。此外,它支持与流行的数学库,如NumPy、PyTorch和TensorFlow进行紧密集成,从而实现了强大的机器学习应用。
  • results: 通过 benchmarking,我们表明,我们的\texttt{.pose}文件格式在与常见格式如OpenPose相比,具有明显的性能优势,同时具有自包含的姿势规定的优点。此外,库还包括数据归一化、增强和易用的视觉化功能,可以在Python和浏览器环境中使用。因此,\texttt{pose-format}成为一个一站式解决方案,协助管理和分析姿势数据的复杂性。
    Abstract Managing and analyzing pose data is a complex task, with challenges ranging from handling diverse file structures and data types to facilitating effective data manipulations such as normalization and augmentation. This paper presents \texttt{pose-format}, a comprehensive toolkit designed to address these challenges by providing a unified, flexible, and easy-to-use interface. The library includes a specialized file format that encapsulates various types of pose data, accommodating multiple individuals and an indefinite number of time frames, thus proving its utility for both image and video data. Furthermore, it offers seamless integration with popular numerical libraries such as NumPy, PyTorch, and TensorFlow, thereby enabling robust machine-learning applications. Through benchmarking, we demonstrate that our \texttt{.pose} file format offers vastly superior performance against prevalent formats like OpenPose, with added advantages like self-contained pose specification. Additionally, the library includes features for data normalization, augmentation, and easy-to-use visualization capabilities, both in Python and Browser environments. \texttt{pose-format} emerges as a one-stop solution, streamlining the complexities of pose data management and analysis.
    摘要 管理和分析姿态数据是一项复杂的任务,具有从处理多种文件结构和数据类型到实现有效的数据处理操作 such as нормализация和扩展的挑战。这篇论文介绍了 \texttt{pose-format},一个完整的工具集,用于解决这些挑战。该库包括一种专门的文件格式,可以包含多个个体和无限多个时间帧,因此适用于图像和视频数据。此外,它还提供了与流行的数字库 such as NumPy、PyTorch 和 TensorFlow 的灵活集成,使得可以实现 Robust 的机器学习应用。经 benchmarking,我们示出了我们的 \texttt{.pose} 文件格式在与普遍使用的 OpenPose 格式相比,具有明显的性能优势,同时具有自包含的姿态规范等优点。此外,库还包括数据normalization、扩展和轻松使用的可视化功能,可以在 Python 和浏览器环境中使用。 \texttt{pose-format} 成为一个一站式解决方案,协助管理和分析姿态数据的复杂性。

VCL Challenges 2023 at ICCV 2023 Technical Report: Bi-level Adaptation Method for Test-time Adaptive Object Detection

  • paper_url: http://arxiv.org/abs/2310.08986
  • repo_url: None
  • paper_authors: Chenyu Lin, Yusheng He, Zhengqing Zang, Chenwei Tang, Tao Wang, Jiancheng Lv
  • for: 本文参与VCL Challenges B Continual Test_time Adaptation,技术细节方面的研究报告。
  • methods: 使用bi_level适应方法,包括图像级别和检测器级别的适应。图像级别使用可调参数基于图像缓冲区域,检测器级别使用可调参数基于均值教师模块。
  • results: 在VCL Challenges B的目标频道上实现了38.3%的mAP,相对下降仅4.2%,总性能为32.5%的mAP。
    Abstract This report outlines our team's participation in VCL Challenges B Continual Test_time Adaptation, focusing on the technical details of our approach. Our primary focus is Testtime Adaptation using bi_level adaptations, encompassing image_level and detector_level adaptations. At the image level, we employ adjustable parameterbased image filters, while at the detector level, we leverage adjustable parameterbased mean teacher modules. Ultimately, through the utilization of these bi_level adaptations, we have achieved a remarkable 38.3% mAP on the target domain of the test set within VCL Challenges B. It is worth noting that the minimal drop in mAP, is mearly 4.2%, and the overall performance is 32.5% mAP.
    摘要 这份报告描述我们团队在VCL挑战B中实现了时间适应性,强调我们的技术细节。我们的主要重点是在测试时间适应性方面,使用两级适应方法:图像级别和检测器级别。在图像级别,我们使用可调参数基于图像滤波器,而在检测器级别,我们利用可调参数基于均值教师模块。通过这些两级适应方法,我们在VCL挑战B的目标频谱上达到了38.3%的mAP,其中最小下降是4.2%,总性能是32.5%的mAP。

UniParser: Multi-Human Parsing with Unified Correlation Representation Learning

  • paper_url: http://arxiv.org/abs/2310.08984
  • repo_url: https://github.com/cjm-sfw/Uniparser
  • paper_authors: Jiaming Chu, Lei Jin, Junliang Xing, Jian Zhao
  • for: 这 paper 是关于多个人分割图像的 segmentation 任务,需要 both instance-level 和 fine-grained category-level 信息。
  • methods: 这 paper 使用了一种 integration 方法,即 UniParser,将 instance-level 和 category-level 表示 integrate 在 three key aspects:1) 我们提出了一种统一 correlation representation learning 方法,让网络学习 instance 和 category 特征在 cosine space 中; 2) we unify the form of outputs of each modules as pixel-level segmentation results, while supervising instance and category features using a homogeneous label accompanied by an auxiliary loss; 3) we design a joint optimization procedure to fuse instance and category representations.
  • results: 通过 virtue of unifying instance-level 和 category-level output, UniParser 超越了 state-of-the-art 方法, achieving 49.3% AP on MHPv2.0 和 60.4% AP on CIHP。
    Abstract Multi-human parsing is an image segmentation task necessitating both instance-level and fine-grained category-level information. However, prior research has typically processed these two types of information through separate branches and distinct output formats, leading to inefficient and redundant frameworks. This paper introduces UniParser, which integrates instance-level and category-level representations in three key aspects: 1) we propose a unified correlation representation learning approach, allowing our network to learn instance and category features within the cosine space; 2) we unify the form of outputs of each modules as pixel-level segmentation results while supervising instance and category features using a homogeneous label accompanied by an auxiliary loss; and 3) we design a joint optimization procedure to fuse instance and category representations. By virtual of unifying instance-level and category-level output, UniParser circumvents manually designed post-processing techniques and surpasses state-of-the-art methods, achieving 49.3% AP on MHPv2.0 and 60.4% AP on CIHP. We will release our source code, pretrained models, and online demos to facilitate future studies.
    摘要 多人识别是一种图像分割任务,需要 both 实例级别和细化类别级别的信息。然而,先前的研究通常会将这两种类型的信息处理为两个不同的分支和不同的输出格式,这会导致不具有效率和重复的框架。这篇论文介绍了 UniParser,它将实例级别和类别级别的表示集成在三个关键方面:1. 我们提出了一种统一相关表示学习方法,让我们的网络在cosine空间内学习实例和类别特征;2. 我们将每个模块的输出形式统一为像素级别分割结果,并使用一个同一个标签和auxiliary loss来监督实例和类别特征;3. 我们设计了一种联合优化程序来融合实例和类别表示。通过对实例级别和类别级别的输出进行统一,UniParser可以避免手动设计后处理技术,并超越状态艺术方法,在MHPv2.0上 achieve 49.3% AP和CIHP上 achieve 60.4% AP。我们将发布我们的源代码、预训练模型和在线演示,以便未来的研究。

LRRU: Long-short Range Recurrent Updating Networks for Depth Completion

  • paper_url: http://arxiv.org/abs/2310.08956
  • repo_url: https://github.com/YufeiWang777/LRRU
  • paper_authors: Yufei Wang, Bo Li, Ge Zhang, Qi Liu, Tao Gao, Yuchao Dai
  • for: 提高深度完成任务的效率,适用于实际应用。
  • methods: 提出了一种新的轻量级深度网络框架LRRU,通过不学习复杂特征表示来实现深度完成。LRRU首先粗略填充空 sparse输入数据,然后通过学习空间variant的核函数进行迭代更新。
  • results: 实验结果表明,我们提出的LRRU变体可以在不同参数情况下达到领先的性能水平,特别是LRRU-Base模型在NYUv2数据集上的表现超过竞争对手,并在提交时间点上在KITTI深度完成评价板块上排名第一。
    Abstract Existing deep learning-based depth completion methods generally employ massive stacked layers to predict the dense depth map from sparse input data. Although such approaches greatly advance this task, their accompanied huge computational complexity hinders their practical applications. To accomplish depth completion more efficiently, we propose a novel lightweight deep network framework, the Long-short Range Recurrent Updating (LRRU) network. Without learning complex feature representations, LRRU first roughly fills the sparse input to obtain an initial dense depth map, and then iteratively updates it through learned spatially-variant kernels. Our iterative update process is content-adaptive and highly flexible, where the kernel weights are learned by jointly considering the guidance RGB images and the depth map to be updated, and large-to-small kernel scopes are dynamically adjusted to capture long-to-short range dependencies. Our initial depth map has coarse but complete scene depth information, which helps relieve the burden of directly regressing the dense depth from sparse ones, while our proposed method can effectively refine it to an accurate depth map with less learnable parameters and inference time. Experimental results demonstrate that our proposed LRRU variants achieve state-of-the-art performance across different parameter regimes. In particular, the LRRU-Base model outperforms competing approaches on the NYUv2 dataset, and ranks 1st on the KITTI depth completion benchmark at the time of submission. Project page: https://npucvr.github.io/LRRU/.
    摘要 Traditional deep learning-based depth completion methods usually use a large number of stacked layers to predict the dense depth map from the sparse input data. Although these approaches have made significant progress in this task, their high computational complexity limits their practical applications. To improve the efficiency of depth completion, we propose a lightweight deep network framework called Long-short Range Recurrent Updating (LRRU) network. Instead of learning complex feature representations, LRRU first roughly fills the sparse input to obtain an initial dense depth map, and then iteratively updates it through learned spatially-variant kernels. Our iterative update process is content-adaptive and highly flexible, where the kernel weights are learned by jointly considering the guidance RGB images and the depth map to be updated, and large-to-small kernel scopes are dynamically adjusted to capture long-to-short range dependencies. Our initial depth map has coarse but complete scene depth information, which helps reduce the burden of directly regressing the dense depth from sparse ones, while our proposed method can effectively refine it to an accurate depth map with less learnable parameters and inference time. Experimental results show that our proposed LRRU variants achieve state-of-the-art performance across different parameter regimes. In particular, the LRRU-Base model outperforms competing approaches on the NYUv2 dataset, and ranks 1st on the KITTI depth completion benchmark at the time of submission. More information can be found on the project page: .

Online Adaptive Disparity Estimation for Dynamic Scenes in Structured Light Systems

  • paper_url: http://arxiv.org/abs/2310.08934
  • repo_url: None
  • paper_authors: Rukun Qiao, Hiroshi Kawasaki, Hongbin Zha
  • for: 这 paper 是为了解决深度神经网络在不同环境中的性能下降问题,提出了自主学习在线适应方法。
  • methods: 这 paper 使用了一种基于长序输入的无监督损失函数,以便在测试时进行网络适应。
  • results: 该 paper 的提出方法可以快速地适应新的环境,并且在未看过的数据上达到了更高的准确率。
    Abstract In recent years, deep neural networks have shown remarkable progress in dense disparity estimation from dynamic scenes in monocular structured light systems. However, their performance significantly drops when applied in unseen environments. To address this issue, self-supervised online adaptation has been proposed as a solution to bridge this performance gap. Unlike traditional fine-tuning processes, online adaptation performs test-time optimization to adapt networks to new domains. Therefore, achieving fast convergence during the adaptation process is critical for attaining satisfactory accuracy. In this paper, we propose an unsupervised loss function based on long sequential inputs. It ensures better gradient directions and faster convergence. Our loss function is designed using a multi-frame pattern flow, which comprises a set of sparse trajectories of the projected pattern along the sequence. We estimate the sparse pseudo ground truth with a confidence mask using a filter-based method, which guides the online adaptation process. Our proposed framework significantly improves the online adaptation speed and achieves superior performance on unseen data.
    摘要 In this paper, we propose an unsupervised loss function based on long sequential inputs that ensures better gradient directions and faster convergence. Our loss function is designed using a multi-frame pattern flow, which comprises a set of sparse trajectories of the projected pattern along the sequence. We estimate the sparse pseudo ground truth with a confidence mask using a filter-based method, which guides the online adaptation process.Our proposed framework significantly improves the online adaptation speed and achieves superior performance on unseen data.

TIDE: Temporally Incremental Disparity Estimation via Pattern Flow in Structured Light System

  • paper_url: http://arxiv.org/abs/2310.08932
  • repo_url: https://github.com/codepointer/tidenet
  • paper_authors: Rukun Qiao, Hiroshi Kawasaki, Hongbin Zha
  • for: 这 paper 的目的是提出一种基于学习的缺失计算方法,用于单摄像头排光系统中的Scene reconstruction。
  • methods: 这 paper 使用了一种名为 TIDE-Net 的循环网络,通过利用投影模式的变换(pattern flow)来模型时间信息,并将其与前一帧的缺失计算结果进行融合。
  • results: 经过训练使用 synthetic data,这 paper 的模型在使用 real data 时表现出了较高的准确率和效率。
    Abstract We introduced Temporally Incremental Disparity Estimation Network (TIDE-Net), a learning-based technique for disparity computation in mono-camera structured light systems. In our hardware setting, a static pattern is projected onto a dynamic scene and captured by a monocular camera. Different from most former disparity estimation methods that operate in a frame-wise manner, our network acquires disparity maps in a temporally incremental way. Specifically, We exploit the deformation of projected patterns (named pattern flow ) on captured image sequences, to model the temporal information. Notably, this newly proposed pattern flow formulation reflects the disparity changes along the epipolar line, which is a special form of optical flow. Tailored for pattern flow, the TIDE-Net, a recurrent architecture, is proposed and implemented. For each incoming frame, our model fuses correlation volumes (from current frame) and disparity (from former frame) warped by pattern flow. From fused features, the final stage of TIDE-Net estimates the residual disparity rather than the full disparity, as conducted by many previous methods. Interestingly, this design brings clear empirical advantages in terms of efficiency and generalization ability. Using only synthetic data for training, our extensitve evaluation results (w.r.t. both accuracy and efficienty metrics) show superior performance than several SOTA models on unseen real data. The code is available on https://github.com/CodePointer/TIDENet.
    摘要 我们介绍了 Temporally Incremental Disparity Estimation Network(TIDE-Net),一种基于学习的离散光系统中的 disparity 计算技术。在我们的硬件设置中,一个静止的模式被投射到了动态场景中,并被一个单目标 Camera 捕获。与大多数前一代 disparity 估计方法不同,我们的网络在帧率上进行了 temporally 增量的 disparity 计算。具体来说,我们利用投射模式(名为 pattern flow)在捕获到的图像序列中的变形,来模型时间信息。值得注意的是,这种 newly proposed pattern flow 表示在epipolar line上的 disparity 变化,这是一种特殊的 optic flow。针对 pattern flow,我们提出了 TIDE-Net,一种循环架构。每个进来的帧,我们的模型将 correlation volumes(从当前帧)和 disparity(从前一帧)折叠为 pattern flow 后的残差 disparity,而不是完整的 disparity。这种设计带来了明显的 empirical 优势,包括效率和通用能力。使用仅synthetic data для训练,我们的广泛的评估结果(相对准确率和效率 metric)表明 TIDE-Net 在未看到的实际数据上表现出色,superior 于多个state-of-the-art 模型。代码可以在 https://github.com/CodePointer/TIDENet 上获取。

Towards Interpretable Controllability in Object-Centric Learning

  • paper_url: http://arxiv.org/abs/2310.08929
  • repo_url: None
  • paper_authors: Jinwoo Kim, Janghyuk Choi, Jaehyun Kang, Changyeon Lee, Ho-Jin Choi, Seon Joo Kim
  • for: 该论文旨在探讨人工神经网络中的绑定问题,寻找可以达到人类认知水平的方法,通过符号化的Entities来理解世界。
  • methods: 该论文提出了一种新的方法,即插图增强(SlotAug),通过自然的图像增强策略来学习可控性。同时,我们还提出了两种子方法:卷积扩展和插槽一致损失。
  • results: 我们的实验和理论验证表明,我们的方法可以有效地实现可读性和可控性,提供了一种新的可控性控制对象表示的能力。
    Abstract The binding problem in artificial neural networks is actively explored with the goal of achieving human-level recognition skills through the comprehension of the world in terms of symbol-like entities. Especially in the field of computer vision, object-centric learning (OCL) is extensively researched to better understand complex scenes by acquiring object representations or slots. While recent studies in OCL have made strides with complex images or videos, the interpretability and interactivity over object representation remain largely uncharted, still holding promise in the field of OCL. In this paper, we introduce a novel method, Slot Attention with Image Augmentation (SlotAug), to explore the possibility of learning interpretable controllability over slots in a self-supervised manner by utilizing an image augmentation strategy. We also devise the concept of sustainability in controllable slots by introducing iterative and reversible controls over slots with two proposed submethods: Auxiliary Identity Manipulation and Slot Consistency Loss. Extensive empirical studies and theoretical validation confirm the effectiveness of our approach, offering a novel capability for interpretable and sustainable control of object representations. Code will be available soon.
    摘要 artifical neural networks 的绑定问题在激发学习中得到了广泛的研究,以实现人类水平的识别能力,通过对世界的符号化表示来解释复杂的场景。特别在计算机视觉领域,对象中心学习(OCL)被广泛研究,以更好地理解复杂的场景,获得对象表示或槽的获得。虽然最近的OCL研究在复杂图像或视频上已经做出了很大的进展,但是解释性和交互性在对象表示上仍然未得到了充分的探索,这些领域仍然具有潜在的探索空间。在这篇论文中,我们提出了一种新的方法,即插入图像增强(Slot Augmentation,SlotAug),以探索在自然的方式下可以学习可控的槽表示。我们还提出了持续可控的槽的概念,并通过两种提议的子方法:协助标识修饰和槽一致损失来实现可控的槽。我们的方法得到了广泛的实验和理论验证,可以提供一种新的可解释的可控的对象表示能力。代码即将上传。

SIDE: Self-supervised Intermediate Domain Exploration for Source-free Domain Adaptation

  • paper_url: http://arxiv.org/abs/2310.08928
  • repo_url: https://github.com/se111/side
  • paper_authors: Jiamei Liu, Han Sun, Yizhen Jia, Jie Qin, Huiyu Zhou, Ningzhong Liu
  • for: 这篇论文的目的是解决域别迁移问题,并在没有源数据的情况下进行域别迁移。
  • methods: 这篇论文提出了自我指导中途域探索(SIDE)方法,它通过在中途域探索过程中选择类似源和目标域的样本,以及对这些中途域样本进行过渡域阶段调整,以bridge域间差异。
  • results: 根据三个popular benchмарck(Office-31、Office-Home和VisDA-C)的实验结果显示,这篇论文提出的SIDE方法能够与现有的方法竞争。
    Abstract Domain adaptation aims to alleviate the domain shift when transferring the knowledge learned from the source domain to the target domain. Due to privacy issues, source-free domain adaptation (SFDA), where source data is unavailable during adaptation, has recently become very demanding yet challenging. Existing SFDA methods focus on either self-supervised learning of target samples or reconstruction of virtual source data. The former overlooks the transferable knowledge in the source model, whilst the latter introduces even more uncertainty. To address the above issues, this paper proposes self-supervised intermediate domain exploration (SIDE) that effectively bridges the domain gap with an intermediate domain, where samples are cyclically filtered out in a self-supervised fashion. First, we propose cycle intermediate domain filtering (CIDF) to cyclically select intermediate samples with similar distributions over source and target domains. Second, with the aid of those intermediate samples, an inter-domain gap transition (IDGT) module is developed to mitigate possible distribution mismatches between the source and target data. Finally, we introduce cross-view consistency learning (CVCL) to maintain the intrinsic class discriminability whilst adapting the model to the target domain. Extensive experiments on three popular benchmarks, i.e. Office-31, Office-Home and VisDA-C, show that our proposed SIDE achieves competitive performance against state-of-the-art methods.
    摘要 域 adaptation 的目标是减少域 shift когда传递源域中学习的知识到目标域中。由于隐私问题,源数据不可用的域 adaptation(SFDA)在最近变得非常具有挑战性和需求。现有的 SFDA 方法集中在自我超级学习目标样本上或重构虚拟源数据。前者忽略了源模型中可以传递的知识,而后者更加增加了不确定性。为了解决这些问题,本文提出了自我超级域探索(SIDE),它可以有效地跨越域隔。首先,我们提出了循环中间域滤波(CIDF)来循环选择具有源和目标域 Distribution 相似的中间样本。其次,通过这些中间样本,我们开发了过渡域隔转移(IDGT)模块,以避免可能存在的域隔 Distribution 差异。最后,我们引入了交叉视角一致学习(CVCL),以保持内在类分隔性 whilst 适应目标域。我们在 Office-31、Office-Home 和 VisDA-C 三个流行的 benchmark 上进行了广泛的实验,并证明了我们的提出的 SIDE 可以与当前的方法竞争。

Feature Proliferation – the “Cancer” in StyleGAN and its Treatments

  • paper_url: http://arxiv.org/abs/2310.08921
  • repo_url: https://github.com/songc42/feature-proliferation
  • paper_authors: Shuang Song, Yuanbang Liang, Jing Wu, Yu-Kun Lai, Yipeng Qin
  • for: 这个论文的目的是解决StyleGAN图像生成器中的特征增殖问题,以提高图像质量和多样性。
  • methods: 这篇论文首先探讨了StyleGAN图像生成器的具体机制,发现了特征增殖现象,并证明了这种现象导致StyleGAN图像生成器的缺陷。然后,提出了一种新的特征调整方法,通过调整危险特征来mitigate特征增殖问题。
  • results: 实验结果证明了我们的假设和提议的有效性,并证明了提出的特征调整方法的有效性。
    Abstract Despite the success of StyleGAN in image synthesis, the images it synthesizes are not always perfect and the well-known truncation trick has become a standard post-processing technique for StyleGAN to synthesize high-quality images. Although effective, it has long been noted that the truncation trick tends to reduce the diversity of synthesized images and unnecessarily sacrifices many distinct image features. To address this issue, in this paper, we first delve into the StyleGAN image synthesis mechanism and discover an important phenomenon, namely Feature Proliferation, which demonstrates how specific features reproduce with forward propagation. Then, we show how the occurrence of Feature Proliferation results in StyleGAN image artifacts. As an analogy, we refer to it as the" cancer" in StyleGAN from its proliferating and malignant nature. Finally, we propose a novel feature rescaling method that identifies and modulates risky features to mitigate feature proliferation. Thanks to our discovery of Feature Proliferation, the proposed feature rescaling method is less destructive and retains more useful image features than the truncation trick, as it is more fine-grained and works in a lower-level feature space rather than a high-level latent space. Experimental results justify the validity of our claims and the effectiveness of the proposed feature rescaling method. Our code is available at https://github. com/songc42/Feature-proliferation.
    摘要 尽管 StyleGAN 在图像生成方面取得了成功,但生成的图像不一定是完美的,常用的截断技巧已成为 StyleGAN 图像生成的标准后处理技术。虽然有效,但这种技巧可能会减少生成的图像多样性,并且意外地抛弃许多图像特征。为解决这个问题,在这篇论文中,我们首先探究 StyleGAN 图像生成机制,并发现一个重要现象:特征增殖(Feature Proliferation)。我们发现,在 StyleGAN 图像生成过程中,特定的特征会在前向传播中重新生成,从而导致 StyleGAN 图像的artifacts。我们将这种现象称为 StyleGAN 中的"癌症",因为它会在图像生成过程中不断增殖和恶化。最后,我们提出了一种新的特征重新Scaling方法,该方法可以识别和调节危险的特征,以 Mitigate 特征增殖。由于我们的发现特征增殖,该方法比 truncation 技巧更不 destrucтив,因为它在较低的特征空间进行调节,而不是在高级潜在空间。实验结果证明了我们的说法和提出的特征重新Scaling 方法的有效性。我们的代码可以在 上下载。

Scalarization for Multi-Task and Multi-Domain Learning at Scale

  • paper_url: http://arxiv.org/abs/2310.08910
  • repo_url: None
  • paper_authors: Amelie Royer, Tijmen Blankevoort, Babak Ehteshami Bejnordi
  • for: This paper focuses on improving the efficiency of training multi-task and multi-domain neural networks.
  • methods: The authors use a combination of theoretical analysis and experimental methods to understand the training dynamics of these networks, and propose a new population-based training method to optimize the scalarization weights.
  • results: The authors show that their proposed method achieves on-par performance with more costly state-of-the-art optimization methods, and provides a more efficient way to train multi-task and multi-domain networks.
    Abstract Training a single model on multiple input domains and/or output tasks allows for compressing information from multiple sources into a unified backbone hence improves model efficiency. It also enables potential positive knowledge transfer across tasks/domains, leading to improved accuracy and data-efficient training. However, optimizing such networks is a challenge, in particular due to discrepancies between the different tasks or domains: Despite several hypotheses and solutions proposed over the years, recent work has shown that uniform scalarization training, i.e., simply minimizing the average of the task losses, yields on-par performance with more costly SotA optimization methods. This raises the issue of how well we understand the training dynamics of multi-task and multi-domain networks. In this work, we first devise a large-scale unified analysis of multi-domain and multi-task learning to better understand the dynamics of scalarization across varied task/domain combinations and model sizes. Following these insights, we then propose to leverage population-based training to efficiently search for the optimal scalarization weights when dealing with a large number of tasks or domains.
    摘要 训练单个模型在多个输入领域和/或输出任务上可以压缩多个源的信息到一个统一的核心,从而提高模型的效率。这也可能导致任务/领域之间的正面知识传递,从而提高准确性和数据训练效率。然而,优化这些网络是一项挑战,特别是因为不同任务或领域之间存在差异:尽管多年来有许多假设和解决方案,但最近的研究表明, uniform scalarization 训练,即只是将所有任务损失平均化为最小值,可以与更昂贵的 State-of-the-art 优化方法相比肩。这引发了我们理解多任务和多领域学习的训练动力学的问题。在这项工作中,我们首先设计了一项大规模的多领域多任务学习统一分析,以更好地理解 scalarization 的动力学在不同任务/领域组合和模型大小上。然后,我们提议使用人口训练来高效地搜索最佳 scalarization 权重,当面临大量任务或领域时。

3D Understanding of Deformable Linear Objects: Datasets and Transferability Benchmark

  • paper_url: http://arxiv.org/abs/2310.08904
  • repo_url: None
  • paper_authors: Bare Luka Žagar, Tim Hertel, Mingyu Liu, Ekim Yurtsever, ALois C. Knoll
  • for: 研究3D弹性线性物体,如血管和电缆,以提高对这些系统的理解和设计。
  • methods: 使用PointWire和PointVessel数据集,对现状的3D弹性线性物体进行了大规模的测试和评估。
  • results: 通过对PointWire和PointVessel数据集进行了转移性测试,发现现有方法的泛化能力不强,需要进一步改进。
    Abstract Deformable linear objects are vastly represented in our everyday lives. It is often challenging even for humans to visually understand them, as the same object can be entangled so that it appears completely different. Examples of deformable linear objects include blood vessels and wiring harnesses, vital to the functioning of their corresponding systems, such as the human body and a vehicle. However, no point cloud datasets exist for studying 3D deformable linear objects. Therefore, we are introducing two point cloud datasets, PointWire and PointVessel. We evaluated state-of-the-art methods on the proposed large-scale 3D deformable linear object benchmarks. Finally, we analyzed the generalization capabilities of these methods by conducting transferability experiments on the PointWire and PointVessel datasets.
    摘要 弹性线性物体在我们日常生活中非常普遍。它们的外观可能会很不同,因为它们可能会络绎在一起,让它们看起来完全不同。例如,血管和车辆电线很重要,它们是人体和车辆系统的重要组成部分。但是,目前没有任何点云数据集用于研究3D弹性线性物体。因此,我们正在引入两个点云数据集:PointWire和PointVessel。我们评估了现有的方法在我们提出的大规模3D弹性线性物体benchmark上的性能。最后,我们进行了对PointWire和PointVessel数据集的转移性实验,以评估这些方法的通用能力。

Self supervised convolutional kernel based handcrafted feature harmonization: Enhanced left ventricle hypertension disease phenotyping on echocardiography

  • paper_url: http://arxiv.org/abs/2310.08897
  • repo_url: None
  • paper_authors: Jina Lee, Youngtaek Hong, Dawun Jeong, Yeonggul Jang, Sihyeon Jeong, Taekgeun Jung, Yeonyee E. Yoon, Inki Moon, Seung-Ah Lee, Hyuk-Jae Chang
  • for: 预测疾病(如Left Ventricular Hypertrophy和Hypertensive Heart Disease)的医学成像技术,使用手工设计的特征来预测疾病。
  • methods: 使用标准化成像协议、统计调整和评估特征稳定性来协调特征提取。
  • results: 提出了一种使用自主学习(SSL)和卷积层整合的方法,可以在有限的数据集中提高数据理解并适应多种数据设置。该方法在各种任务中显示出优秀表现,特别是在 Left Ventricular Hypertrophy 分类任务中表现出色。
    Abstract Radiomics, a medical imaging technique, extracts quantitative handcrafted features from images to predict diseases. Harmonization in those features ensures consistent feature extraction across various imaging devices and protocols. Methods for harmonization include standardized imaging protocols, statistical adjustments, and evaluating feature robustness. Myocardial diseases such as Left Ventricular Hypertrophy (LVH) and Hypertensive Heart Disease (HHD) are diagnosed via echocardiography, but variable imaging settings pose challenges. Harmonization techniques are crucial for applying handcrafted features in disease diagnosis in such scenario. Self-supervised learning (SSL) enhances data understanding within limited datasets and adapts to diverse data settings. ConvNeXt-V2 integrates convolutional layers into SSL, displaying superior performance in various tasks. This study focuses on convolutional filters within SSL, using them as preprocessing to convert images into feature maps for handcrafted feature harmonization. Our proposed method excelled in harmonization evaluation and exhibited superior LVH classification performance compared to existing methods.
    摘要 医学成像技术Radiomics提取了生物marker的量化特征,以预测疾病。在这些特征中,谱harmonization是确保特征EXTRACTINGCONSISTENTLY across various imaging devices and protocols。这些方法包括标准化成像协议,统计调整和评估特征稳定性。我们通过echo医学诊断Left Ventricular Hypertrophy (LVH)和Hypertensive Heart Disease (HHD),但不同的成像设置 pose challenges。在这种情况下,谱harmonization技术是至关重要的。自主学习(SSL)可以帮助我们更好地理解有限的数据集和适应多种数据设置。ConvNeXt-V2 integrates convolutional layers into SSL,在多种任务中展现出了出色的表现。本研究关注了在SSL中的卷积层,将它们用作预处理,将图像转换成特征地图以便手工特征谱harmonization。我们的提议方法在谱harmonization评估中表现出色,并在现有方法中展现出了更高的LVH分类性能。

Image Cropping under Design Constraints

  • paper_url: http://arxiv.org/abs/2310.08892
  • repo_url: None
  • paper_authors: Takumi Nishiyasu, Wataru Shimoda, Yoichi Sato
  • for: 本研究旨在提供一种基于分数函数的图像剪辑方法,以满足各种设计约束。
  • methods: 本研究使用分数函数来评估剪辑结果的美观可能性和设计约束满足度。我们还提出了两种变体方法:提案基本方法和热图基本方法。
  • results: 实验结果显示,提案基本方法在同等计算成本下表现较好,而热图基本方法可以通过增加计算成本来获得更高的分数。我们还发现,在满足设计约束的同时保持美观可能性是一项不容易解决的问题。
    Abstract Image cropping is essential in image editing for obtaining a compositionally enhanced image. In display media, image cropping is a prospective technique for automatically creating media content. However, image cropping for media contents is often required to satisfy various constraints, such as an aspect ratio and blank regions for placing texts or objects. We call this problem image cropping under design constraints. To achieve image cropping under design constraints, we propose a score function-based approach, which computes scores for cropped results whether aesthetically plausible and satisfies design constraints. We explore two derived approaches, a proposal-based approach, and a heatmap-based approach, and we construct a dataset for evaluating the performance of the proposed approaches on image cropping under design constraints. In experiments, we demonstrate that the proposed approaches outperform a baseline, and we observe that the proposal-based approach is better than the heatmap-based approach under the same computation cost, but the heatmap-based approach leads to better scores by increasing computation cost. The experimental results indicate that balancing aesthetically plausible regions and satisfying design constraints is not a trivial problem and requires sensitive balance, and both proposed approaches are reasonable alternatives.
    摘要 Image cropping是图像修改中的一种基本技巧,可以提高图像的 компози图感。在显示媒体中,图像cropping是一种可能的技术,可以自动生成媒体内容。然而,对媒体内容的图像cropping often需要满足多种约束,例如比例和蓝色区域用于文本或对象的放置。我们称这个问题为图像cropping under design constraints。为解决图像cropping under design constraints问题,我们提出了一种分数函数基本方法,计算cropped结果是否美观可能和满足设计约束。我们还探索了两种 derivated Approaches,提案基本方法和热图基本方法,并构建了用于评估提案的数据集。在实验中,我们示出了提案方法比基eline的表现,并发现提案基本方法在同一个计算成本下比热图基本方法更好,但热图基本方法通过增加计算成本可以提高分数。实验结果表明,平衡美观可能区域和满足设计约束并不是一个轻松的问题,需要灵活的平衡。两种提案方法都是合理的选择。

Extending Multi-modal Contrastive Representations

  • paper_url: http://arxiv.org/abs/2310.08884
  • repo_url: https://github.com/mcr-peft/ex-mcr
  • paper_authors: Zehan Wang, Ziang Zhang, Luping Liu, Yang Zhao, Haifeng Huang, Tao Jin, Zhou Zhao
  • for: 这篇研究旨在提出一种训练效率高且没有对照数据的多modal contrastive representation(MCR)方法,以扩展多modal learning的可能性。
  • methods: 这篇研究使用了C-MCR的想法,并将多个现有的MCR空间融合到同一个基本MCR空间中,以获得一个共同的对照表现空间。此外,研究对MCR空间的整个学习管线进行了优化,包括训练数据、架构和学习目标。
  • results: 研究发现,这篇方法可以实现无需对照数据的MCR学习,并且可以保留原始模式的semantic alignement。此外,这篇方法在多modal Retrieval和3D物体分类任务中获得了state-of-the-art表现。进一步的质变结果显示了模式之间的弹性联系,这显示了多modal learning的可能性。
    Abstract Multi-modal contrastive representation (MCR) of more than three modalities is critical in multi-modal learning. Although recent methods showcase impressive achievements, the high dependence on large-scale, high-quality paired data and the expensive training costs limit their further development. Inspired by recent C-MCR, this paper proposes Extending Multimodal Contrastive Representation (Ex-MCR), a training-efficient and paired-data-free method to flexibly learn unified contrastive representation space for more than three modalities by integrating the knowledge of existing MCR spaces. Specifically, Ex-MCR aligns multiple existing MCRs into the same based MCR, which can effectively preserve the original semantic alignment of the based MCR. Besides, we comprehensively enhance the entire learning pipeline for aligning MCR spaces from the perspectives of training data, architecture, and learning objectives. With the preserved original modality alignment and the enhanced space alignment, Ex-MCR shows superior representation learning performance and excellent modality extensibility. To demonstrate the effectiveness of Ex-MCR, we align the MCR spaces of CLAP (audio-text) and ULIP (3D-vision) into the CLIP (vision-text), leveraging the overlapping text and image modality, respectively. Remarkably, without using any paired data, Ex-MCR learns a 3D-image-text-audio unified contrastive representation, and it achieves state-of-the-art performance on audio-visual, 3D-image, audio-text, visual-text retrieval, and 3D object classification tasks. More importantly, extensive qualitative results further demonstrate the emergent semantic alignment between the extended modalities (e.g., audio and 3D), which highlights the great potential of modality extensibility.
    摘要 多modalcontrastiverepresentation(MCR)是多modal学习中的关键。尽管 current methods 显示出色的成绩,但它们受到大规模、高质量的对应数据和训练成本的限制。 Drawing inspiration from recent C-MCR, this paper proposes Extending Multimodal Contrastive Representation (Ex-MCR), a training-efficient and paired-data-free method to flexibly learn unified contrastive representation space for more than three modalities by integrating the knowledge of existing MCR spaces. Specifically, Ex-MCR aligns multiple existing MCRs into the same based MCR, which can effectively preserve the original semantic alignment of the based MCR. Besides, we comprehensively enhance the entire learning pipeline for aligning MCR spaces from the perspectives of training data, architecture, and learning objectives. With the preserved original modality alignment and the enhanced space alignment, Ex-MCR shows superior representation learning performance and excellent modality extensibility. To demonstrate the effectiveness of Ex-MCR, we align the MCR spaces of CLAP (audio-text) and ULIP (3D-vision) into the CLIP (vision-text), leveraging the overlapping text and image modality, respectively. Remarkably, without using any paired data, Ex-MCR learns a 3D-image-text-audio unified contrastive representation, and it achieves state-of-the-art performance on audio-visual, 3D-image, audio-text, visual-text retrieval, and 3D object classification tasks. More importantly, extensive qualitative results further demonstrate the emergent semantic alignment between the extended modalities (e.g., audio and 3D), which highlights the great potential of modality extensibility.

R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

  • paper_url: http://arxiv.org/abs/2310.08872
  • repo_url: None
  • paper_authors: Jiayu Xiao, Liang Li, Henglei Lv, Shuhui Wang, Qingming Huang
  • for: 这项研究的目的是提出一种适用于文本描述的隐式图像生成模型,可以在不需要训练辅助模块或重新调整扩散模型的情况下,生成符合文本输入的图像。
  • methods: 我们提出了一种 Region and Boundary (R&B) 抽象注意力指导方法,通过在生成过程中逐渐修改扩散模型的注意力地图,使模型能够生成符合文本输入的图像,同时还能够准确地表达文本中的布局指令。
  • results: 我们的方法在多个 benchmark 上表现出色,远远超过了现有的零shot隐式图像生成方法,both qualitatively和quantitatively。
    Abstract Recent text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images given text-prompts as input. However, these models fail to convey appropriate spatial composition specified by a layout instruction. In this work, we probe into zero-shot grounded T2I generation with diffusion models, that is, generating images corresponding to the input layout information without training auxiliary modules or finetuning diffusion models. We propose a Region and Boundary (R&B) aware cross-attention guidance approach that gradually modulates the attention maps of diffusion model during generative process, and assists the model to synthesize images (1) with high fidelity, (2) highly compatible with textual input, and (3) interpreting layout instructions accurately. Specifically, we leverage the discrete sampling to bridge the gap between consecutive attention maps and discrete layout constraints, and design a region-aware loss to refine the generative layout during diffusion process. We further propose a boundary-aware loss to strengthen object discriminability within the corresponding regions. Experimental results show that our method outperforms existing state-of-the-art zero-shot grounded T2I generation methods by a large margin both qualitatively and quantitatively on several benchmarks.
    摘要 (Simplified Chinese translation)latest text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images given text-prompts as input. However, these models fail to convey appropriate spatial composition specified by a layout instruction. In this work, we explore zero-shot grounded T2I generation with diffusion models, that is, generating images corresponding to the input layout information without training auxiliary modules or finetuning diffusion models. We propose a Region and Boundary (R&B) aware cross-attention guidance approach that gradually modulates the attention maps of diffusion model during generative process, and assists the model to synthesize images (1) with high fidelity, (2) highly compatible with textual input, and (3) interpreting layout instructions accurately. Specifically, we leverage the discrete sampling to bridge the gap between consecutive attention maps and discrete layout constraints, and design a region-aware loss to refine the generative layout during diffusion process. We further propose a boundary-aware loss to strengthen object discriminability within the corresponding regions. Experimental results show that our method outperforms existing state-of-the-art zero-shot grounded T2I generation methods by a large margin both qualitatively and quantitatively on several benchmarks.

Re-initialization-free Level Set Method via Molecular Beam Epitaxy Equation Regularization for Image Segmentation

  • paper_url: http://arxiv.org/abs/2310.08861
  • repo_url: None
  • paper_authors: Fanghui Song, Jiebao Sun, Shengzhu Shi, Zhichang Guo, Dazhi Zhang
  • for: 该文章的目的是提出一种高阶级 level set 变分法,以提高图像分割的精度和稳定性。
  • methods: 该方法使用分子束辐射(MBE)方程regularization,使得级别集函数的演化过程受到晶体增长的限制,以避免重初化和级别集函数的不稳定性。该方法还可以处理噪声图像中的粗糙性问题。
  • results: 数值实验表明,该方法可以生成平滑的分割曲线,保留细腻的分割目标,并实现小对象的稳定分割。与现有的级别集方法相比,该模型在精度和效率两个方面具有当今最佳的状态。
    Abstract Variational level set method has become a powerful tool in image segmentation due to its ability to handle complex topological changes and maintain continuity and smoothness in the process of evolution. However its evolution process can be unstable, which results in over flatted or over sharpened contours and segmentation failure. To improve the accuracy and stability of evolution, we propose a high-order level set variational segmentation method integrated with molecular beam epitaxy (MBE) equation regularization. This method uses the crystal growth in the MBE process to limit the evolution of the level set function, and thus can avoid the re-initialization in the evolution process and regulate the smoothness of the segmented curve. It also works for noisy images with intensity inhomogeneity, which is a challenge in image segmentation. To solve the variational model, we derive the gradient flow and design scalar auxiliary variable (SAV) scheme coupled with fast Fourier transform (FFT), which can significantly improve the computational efficiency compared with the traditional semi-implicit and semi-explicit scheme. Numerical experiments show that the proposed method can generate smooth segmentation curves, retain fine segmentation targets and obtain robust segmentation results of small objects. Compared to existing level set methods, this model is state-of-the-art in both accuracy and efficiency.
    摘要 <>变分级别设定方法在图像分割中已成为一种强大工具,因为它可以处理复杂的多尺度变化和保持连续性和平滑性在进化过程中。然而,其进化过程可能会不稳定,导致过扁或过锋化的边界和分割失败。为了提高精度和稳定性的进化,我们提出了高阶级别设定变形方法,并与分子束激光增减 Equation(MBE)定则相结合。这种方法使用MBE过程中的晶体生长来限制级别设定函数的进化,从而可以避免重初化进程和规范级别设定函数的平滑性。它还适用于具有强度不均的图像,这是图像分割中的挑战。为解variational模型,我们 derive了流体场和scalar auxiliary variable(SAV) schemes,并结合快速傅立叹Transform(FFT),可以significantly improve计算效率相比传统的半显式和半隐式方案。numerical experiments show that the proposed method can generate smooth segmentation curves, retain fine segmentation targets and obtain robust segmentation results of small objects. Compared to existing level set methods, this model is state-of-the-art in both accuracy and efficiency.

Rank-DETR for High Quality Object Detection

  • paper_url: http://arxiv.org/abs/2310.08854
  • repo_url: https://github.com/leaplabthu/rank-detr
  • paper_authors: Yifan Pu, Weicong Liang, Yiduo Hao, Yuhui Yuan, Yukang Yang, Chao Zhang, Han Hu, Gao Huang
  • for: 提高 DE TR 类型 объек检测器 的精度和性能
  • methods: 提出了一种简单高效的rank-oriented设计,包括rank-oriented架构设计和rank-oriented损失函数设计,以降低假阳性率并提高 AP 值
  • results: 应用方法到现有 SOTA 方法(如 H-DETR 和 DINO-DETR)上,在不同的backbone上(如 ResNet-$50$、Swin-T 和 Swin-L) obtainted strong COCO object detection results,证明了方法的效果
    Abstract Modern detection transformers (DETRs) use a set of object queries to predict a list of bounding boxes, sort them by their classification confidence scores, and select the top-ranked predictions as the final detection results for the given input image. A highly performant object detector requires accurate ranking for the bounding box predictions. For DETR-based detectors, the top-ranked bounding boxes suffer from less accurate localization quality due to the misalignment between classification scores and localization accuracy, thus impeding the construction of high-quality detectors. In this work, we introduce a simple and highly performant DETR-based object detector by proposing a series of rank-oriented designs, combinedly called Rank-DETR. Our key contributions include: (i) a rank-oriented architecture design that can prompt positive predictions and suppress the negative ones to ensure lower false positive rates, as well as (ii) a rank-oriented loss function and matching cost design that prioritizes predictions of more accurate localization accuracy during ranking to boost the AP under high IoU thresholds. We apply our method to improve the recent SOTA methods (e.g., H-DETR and DINO-DETR) and report strong COCO object detection results when using different backbones such as ResNet-$50$, Swin-T, and Swin-L, demonstrating the effectiveness of our approach. Code is available at \url{https://github.com/LeapLabTHU/Rank-DETR}.
    摘要 现代检测转换器(DETR)使用一组对象查询来预测一个列表的 bounding box,并将其排序于其分类信任度分数上,并选择输入图像的最高排名的预测结果作为最终检测结果。一个高效的对象检测器需要准确的排序,以确保 bounding box 预测的准确性。为 DETR 基于的检测器,最顶层的 bounding box 受到误差的分布率的影响,从而降低了高质量检测器的建立。在这项工作中,我们提出了一种简单高效的 DETR 基于的对象检测器,并将其命名为 Rank-DETR。我们的关键贡献包括:1. 排序 oriented 建筑设计,可以让正面预测提高,并且对负面预测进行抑制,以降低假阳性率。2. 排序 oriented 损失函数和匹配成本设计,在排序时优先考虑更高的本地化准确性,以提高 AP 下高 IoU 阈值下的性能。我们应用我们的方法于当前 SOTA 方法(例如 H-DETR 和 DINO-DETR),并在不同的背景(例如 ResNet-$50$、Swin-T 和 Swin-L)上测试,得到了强的 COCO 对象检测结果,证明了我们的方法的有效性。代码可以在 \url{https://github.com/LeapLabTHU/Rank-DETR} 上获取。

Revisiting Multi-modal 3D Semantic Segmentation in Real-world Autonomous Driving

  • paper_url: http://arxiv.org/abs/2310.08826
  • repo_url: None
  • paper_authors: Feng Jiang, Chaoping Tu, Gang Zhang, Jun Li, Hanqing Huang, Junyu Lin, Di Feng, Jian Pu
  • for: 提高多模态3D semantic segmentation的安全性,即使在弱协同下运行
  • methods: 提出CPGNet-LCF多模态融合框架,继承CPGNet的易于部署和实时执行能力,并引入弱协同知识塑造策略以提高对弱协同的Robustness
  • results: 在nuScenes和SemanticKITTIbenchmark上实现了state-of-the-art表现,并可以在20ms/帧的Tesla V100 GPU上使用TensorRT TF16模式实时执行,并对四级弱协同水平进行了性能 benchmarHere’s the simplified Chinese text:
  • for: 提高多模态3D semantic segmentation的安全性,即使在弱协同下运行
  • methods: 提出CPGNet-LCF多模态融合框架,继承CPGNet的易于部署和实时执行能力,并引入弱协同知识塑造策略以提高对弱协同的Robustness
  • results: 在nuScenes和SemanticKITTIbenchmark上实现了state-of-the-art表现,并可以在20ms/帧的Tesla V100 GPU上使用TensorRT TF16模式实时执行,并对四级弱协同水平进行了性能 benchmar
    Abstract LiDAR and camera are two critical sensors for multi-modal 3D semantic segmentation and are supposed to be fused efficiently and robustly to promise safety in various real-world scenarios. However, existing multi-modal methods face two key challenges: 1) difficulty with efficient deployment and real-time execution; and 2) drastic performance degradation under weak calibration between LiDAR and cameras. To address these challenges, we propose CPGNet-LCF, a new multi-modal fusion framework extending the LiDAR-only CPGNet. CPGNet-LCF solves the first challenge by inheriting the easy deployment and real-time capabilities of CPGNet. For the second challenge, we introduce a novel weak calibration knowledge distillation strategy during training to improve the robustness against the weak calibration. CPGNet-LCF achieves state-of-the-art performance on the nuScenes and SemanticKITTI benchmarks. Remarkably, it can be easily deployed to run in 20ms per frame on a single Tesla V100 GPU using TensorRT TF16 mode. Furthermore, we benchmark performance over four weak calibration levels, demonstrating the robustness of our proposed approach.
    摘要 利用LiDAR和摄像头两种关键感知器,多模态3D semantic segmentation可以得到高效和稳定的混合。然而,现有的多模态方法面临两个主要挑战:1)efficient deployment和实时执行困难; 2)在软配置下导致性能下降。为解决这些挑战,我们提出了CPGNet-LCF,一种新的多模态混合框架,extend LiDAR只的CPGNet。CPGNet-LCF解决了第一个挑战by继承CPGNet的易部署和实时能力。为第二个挑战,我们引入了一种新的弱配置知识继承策略在训练中,以提高对弱配置的Robustness。CPGNet-LCF在nuScenes和SemanticKITTIbenchmark上达到了状态的最佳性能。另外,它可以轻松地在单个Tesla V100 GPU上使用TensorRT TF16模式运行,并且在四个弱配置水平上测试性能,表明我们提出的方法具有可靠性。

From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models

  • paper_url: http://arxiv.org/abs/2310.08825
  • repo_url: https://github.com/yuchenliu98/comm
  • paper_authors: Dongsheng Jiang, Yuchen Liu, Songlin Liu, Xiaopeng Zhang, Jin Li, Hongkai Xiong, Qi Tian
    for:This paper aims to investigate the effectiveness of different vision encoders within Multi-modal Large Language Models (MLLMs) and to propose a simple yet effective feature merging strategy to enhance the visual capabilities of MLLMs.methods:The authors conduct an extensive investigation into the effectiveness of different vision encoders within MLLMs, including CLIP and DINO, and propose a feature merging strategy called COMM that integrates CLIP and DINO with Multi-level features Merging to enhance the visual capabilities of MLLMs.results:The authors evaluate COMM through comprehensive experiments on a wide range of benchmarks, including image captioning, visual question answering, visual grounding, and object hallucination, and show that COMM outperforms existing methods, demonstrating its enhanced visual capabilities within MLLMs.
    Abstract Multi-modal Large Language Models (MLLMs) have made significant strides in expanding the capabilities of Large Language Models (LLMs) through the incorporation of visual perception interfaces. Despite the emergence of exciting applications and the availability of diverse instruction tuning data, existing approaches often rely on CLIP or its variants as the visual branch, and merely extract features from the deep layers. However, these methods lack a comprehensive analysis of the visual encoders in MLLMs. In this paper, we conduct an extensive investigation into the effectiveness of different vision encoders within MLLMs. Our findings reveal that the shallow layer features of CLIP offer particular advantages for fine-grained tasks such as grounding and region understanding. Surprisingly, the vision-only model DINO, which is not pretrained with text-image alignment, demonstrates promising performance as a visual branch within MLLMs. By simply equipping it with an MLP layer for alignment, DINO surpasses CLIP in fine-grained related perception tasks. Building upon these observations, we propose a simple yet effective feature merging strategy, named COMM, that integrates CLIP and DINO with Multi-level features Merging, to enhance the visual capabilities of MLLMs. We evaluate COMM through comprehensive experiments on a wide range of benchmarks, including image captioning, visual question answering, visual grounding, and object hallucination. Experimental results demonstrate the superior performance of COMM compared to existing methods, showcasing its enhanced visual capabilities within MLLMs. Code will be made available at https://github.com/YuchenLiu98/COMM.
    摘要 多模态大语言模型(MLLMs)已经在扩展大语言模型(LLMs)的能力方面做出了重要进展,通过添加视觉感知界面。虽然出现了许多有趣的应用和多种指导调整数据,但现有的方法 oftentimes 仅仅是使用 CLIP 或其变体作为视觉分支,并且只是从深层次抽取特征。然而,这些方法缺乏对 MLLMs 中的视觉编码器的全面分析。在这篇论文中,我们进行了 MLLMs 中不同视觉编码器的全面调查。我们发现,CLIP 的浅层特征对细腻任务如落实和区域理解具有特殊的优势。同时,没有与文本图像对齐培训的视野只模型 DINO 在 MLLMs 中表现出了出色的性能。通过对 DINO 进行 MLP 层的拼接,我们发现 DINO 可以在细腻相关的感知任务中超过 CLIP。基于这些观察,我们提出了一种简单 yet 有效的特征融合策略,名为 COMM,可以在 MLLMs 中提高视觉能力。我们通过对 COMM 进行了广泛的实验,包括图像描述、视觉问答、视觉落实和物体幻化等多种标准 benchmark,并证明 COMM 的性能超过了现有方法。代码将在 GitHub 上公开。

SAM-guided Unsupervised Domain Adaptation for 3D Segmentation

  • paper_url: http://arxiv.org/abs/2310.08820
  • repo_url: None
  • paper_authors: Xidong Peng, Runnan Chen, Feng Qiao, Lingdong Kong, Youquan Liu, Tai Wang, Xinge Zhu, Yuexin Ma
  • for: 本研究旨在解决无监督领域适应(UDA)在3D分割任务中的挑战,即3D点云数据的稀疏和无序性导致域之间的差异变得明显。
  • methods: 我们的方法借鉴了视觉基础模型SAM的强大泛化能力,将3D域中的特征表示与SAM的特征空间进行统一,从而解决3D域适应问题。我们还提出了一种创新的混合特征增强方法,通过利用相关的图像和点云数据来促进知识传递,并在Scene和Instance两级进行实现。
  • results: 我们的方法在许多广泛 признан的数据集上进行了评估,并实现了领先的性能。
    Abstract Unsupervised domain adaptation (UDA) in 3D segmentation tasks presents a formidable challenge, primarily stemming from the sparse and unordered nature of point cloud data. Especially for LiDAR point clouds, the domain discrepancy becomes obvious across varying capture scenes, fluctuating weather conditions, and the diverse array of LiDAR devices in use. While previous UDA methodologies have often sought to mitigate this gap by aligning features between source and target domains, this approach falls short when applied to 3D segmentation due to the substantial domain variations. Inspired by the remarkable generalization capabilities exhibited by the vision foundation model, SAM, in the realm of image segmentation, our approach leverages the wealth of general knowledge embedded within SAM to unify feature representations across diverse 3D domains and further solves the 3D domain adaptation problem. Specifically, we harness the corresponding images associated with point clouds to facilitate knowledge transfer and propose an innovative hybrid feature augmentation methodology, which significantly enhances the alignment between the 3D feature space and SAM's feature space, operating at both the scene and instance levels. Our method is evaluated on many widely-recognized datasets and achieves state-of-the-art performance.
    摘要 <>通用领域适应(Unsupervised Domain Adaptation, UDA)在3D segmentation任务中是一项极其挑战的问题,主要归因于点云数据的稀疏和无序性。尤其是对于雷达点云数据,域分化差异在不同捕捉场景、变化的天气条件以及不同的雷达设备使用中显得极其明显。在过去的UDAM方法中,通常是通过对源和目标域中的特征进行对齐来减少这个差异,但这种方法在3D segmentation中失败,因为域分化差异过于明显。受到图像分割领域中SAM模型的杰出泛化能力的激发,我们的方法利用SAM模型中嵌入的广泛通用知识来统一3D域中的特征表示,并解决3D域适应问题。具体来说,我们利用相应的图像和点云数据来促进知识传递,并提出了一种创新的混合特征增强方法,这种方法可以很好地将3D特征空间和SAM特征空间进行对应,并在场景和实例层次上进行操作。我们的方法在许多广泛 признан的数据集上进行了评估,并实现了领域内最佳性能。

Incremental Object Detection with CLIP

  • paper_url: http://arxiv.org/abs/2310.08815
  • repo_url: None
  • paper_authors: Yupeng He, Ziyue Huang, Qingjie Liu, Yunhong Wang
  • for: addresses the problem of data ambiguity in incremental object detection, where images may have different labeled bounding boxes in multiple continuous learning stages.
  • methods: uses a language-visual model (CLIP) to generate text feature embeddings for different class sets, and employs broad classes to replace unavailable novel classes in the early learning stage.
  • results: outperforms state-of-the-art methods, particularly for new classes, in various incremental learning settings on the PASCAL VOC 2007 dataset.Here is the summary in Traditional Chinese:
  • for: 解决对incremental object detection中的数据暂存问题, images 可能在多个连续学习阶段中有不同的标签 bounding box。
  • methods: 使用语言-视觉模型 (CLIP) 生成不同类别集的文本特征嵌入,并使用广泛的类别来取代在早期学习阶段中不可用的新类别。
  • results: 在PASCAL VOC 2007 dataset上的多个增量学习设定中,与现有方法比较,特别是新类别的表现更好。
    Abstract In the incremental detection task, unlike the incremental classification task, data ambiguity exists due to the possibility of an image having different labeled bounding boxes in multiple continuous learning stages. This phenomenon often impairs the model's ability to learn new classes. However, the forward compatibility of the model is less considered in existing work, which hinders the model's suitability for incremental learning. To overcome this obstacle, we propose to use a language-visual model such as CLIP to generate text feature embeddings for different class sets, which enhances the feature space globally. We then employ the broad classes to replace the unavailable novel classes in the early learning stage to simulate the actual incremental scenario. Finally, we use the CLIP image encoder to identify potential objects in the proposals, which are classified into the background by the model. We modify the background labels of those proposals to known classes and add the boxes to the training set to alleviate the problem of data ambiguity. We evaluate our approach on various incremental learning settings on the PASCAL VOC 2007 dataset, and our approach outperforms state-of-the-art methods, particularly for the new classes.
    摘要 在增量检测任务中,与增量分类任务不同,数据之间存在冲突,因为图像在多个连续学习阶段可能有不同的标签框。这种现象经常妨碍模型学习新类。然而,现有的工作更少考虑前向兼容性,这限制了模型的适用范围。为解决这个障碍,我们提议使用语言视觉模型如CLIP生成不同类型集的文本特征嵌入,这些嵌入在全球特征空间中增强了特征空间。然后,我们使用广泛的类型取代未available的新类型在早期学习阶段来模拟实际的增量学习场景。最后,我们使用CLIP图像编码器来识别提议中的可能性对象,这些对象被模型分类为背景。我们修改背景标签这些提议,并将其添加到训练集中,以解决数据之间冲突的问题。我们在多个增量学习场景上测试了我们的方法,并与现有的方法进行比较。我们发现,我们的方法在新类型上表现出色,特别是在新类型上。

Two-Stage Deep Learning Framework for Quality Assessment of Left Atrial Late Gadolinium Enhanced MRI Images

  • paper_url: http://arxiv.org/abs/2310.08805
  • repo_url: None
  • paper_authors: K M Arefeen Sultan, Benjamin Orkild, Alan Morris, Eugene Kholmovski, Erik Bieging, Eugene Kwan, Ravi Ranjan, Ed DiBella, Shireen Elhabian
  • for: 自动评估 Left Atrial Fibrosis 的高质量3D晚期增强Image (LGE-MRI) 图像,以提高诊断精度、提高效率、保持标准化和提高病人结果。
  • methods: 使用两阶段深度学习方法,包括左心室探测器和深度网络,以评估LGE-MRI图像诊断质量。
  • results: 比较多 зада学习和预先学习两种训练策略,发现预先学习获得了约4%和9%的F1-Score和特率提升,对于有限的医疗图像标注数据而言。
    Abstract Accurate assessment of left atrial fibrosis in patients with atrial fibrillation relies on high-quality 3D late gadolinium enhancement (LGE) MRI images. However, obtaining such images is challenging due to patient motion, changing breathing patterns, or sub-optimal choice of pulse sequence parameters. Automated assessment of LGE-MRI image diagnostic quality is clinically significant as it would enhance diagnostic accuracy, improve efficiency, ensure standardization, and contributes to better patient outcomes by providing reliable and high-quality LGE-MRI scans for fibrosis quantification and treatment planning. To address this, we propose a two-stage deep-learning approach for automated LGE-MRI image diagnostic quality assessment. The method includes a left atrium detector to focus on relevant regions and a deep network to evaluate diagnostic quality. We explore two training strategies, multi-task learning, and pretraining using contrastive learning, to overcome limited annotated data in medical imaging. Contrastive Learning result shows about $4\%$, and $9\%$ improvement in F1-Score and Specificity compared to Multi-Task learning when there's limited data.
    摘要 高品质的3D晚期γ增强(LGE)MRI图像是评估左 auricle fibrosis 的患者中的精度评估中的关键。然而,获得这些图像是困难的,因为患者的运动、呼吸模式的变化以及脉冲序列参数的不佳选择。自动评估LGE-MRI图像诊断质量是临床重要的,因为它会提高诊断精度、提高效率、保证标准化,并为患者提供可靠的高质量LGE-MRI扫描,以便纤维质量量化和治疗规划。为解决这个问题,我们提议一种两个阶段的深度学习方法来自动评估LGE-MRI图像诊断质量。该方法包括左 auricle检测器,以关注相关区域,以及深度网络来评估诊断质量。我们 explore两种训练策略:多任务学习和预训练使用对比学习,以超越医学影像中的有限着色数据。对比学习结果显示,在有限数据情况下,对比学习可以提高F1分数和特异性的表现,比multi任务学习提高约4%和9%。