cs.CV - 2023-09-24

Diffeomorphic Multi-Resolution Deep Learning Registration for Applications in Breast MRI

  • paper_url: http://arxiv.org/abs/2309.13777
  • repo_url: None
  • paper_authors: Matthew G. French, Gonzalo D. Maso Talou, Thiranja P. Babarenda Gamage, Martyn P. Nash, Poul M. Nielsen, Anthony J. Doyle, Juan Eugenio Iglesias, Yaël Balbastre, Sean I. Young
  • for: 静脉成像规划中的精准注册可以提高乳腺癌治疗中肿瘤的定位。
  • methods: 本文提出了一种learning-based注册方法,该方法遵循 diffeomorphic 约束,并且在静脉成像中提供了优秀的注册结果。
  • results: 本文的实验结果表明,该注册方法可以提供高质量的注册结果,同时也遵循 diffeomorphic 约束。
    Abstract In breast surgical planning, accurate registration of MR images across patient positions has the potential to improve the localisation of tumours during breast cancer treatment. While learning-based registration methods have recently become the state-of-the-art approach for most medical image registration tasks, these methods have yet to make inroads into breast image registration due to certain difficulties-the lack of rich texture information in breast MR images and the need for the deformations to be diffeomophic. In this work, we propose learning strategies for breast MR image registration that are amenable to diffeomorphic constraints, together with early experimental results from in-silico and in-vivo experiments. One key contribution of this work is a registration network which produces superior registration outcomes for breast images in addition to providing diffeomorphic guarantees.
    摘要 医学影像识别是一个重要的领域,它可以帮助医生更好地识别和治疗癌症。在乳腺癌治疗中,精准地将MR图像注册到患者的不同位置中有可能提高肿瘤的定位。然而,学习基本的注册方法在乳腺影像注册中尚未得到广泛应用,因为乳腺MR图像的纹理信息缺乏,并且需要的变换是 diffeomophic。在这种情况下,我们提出了一些学习策略,可以考虑到 diffeomorphic 约束,并且在实验中获得了出色的注册结果。我们的一个关键贡献是一种注册网络,可以生成高质量的注册结果,同时也提供 diffeomorphic garanties。

Motion Segmentation from a Moving Monocular Camera

  • paper_url: http://arxiv.org/abs/2309.13772
  • repo_url: None
  • paper_authors: Yuxiang Huang, John Zelek
  • for: 能够减少视觉SLAM或SFM中的运动物体识别,以便建立地图。
  • methods: synergistically fusing two popular branches of monocular motion segmentation approaches:point trajectory based和optical flow based methods。
  • results: 在KT3DMoSeg dataset上达到了状态计算机科学技术的表现水平,能够处理复杂的运动和场景结构。
    Abstract Identifying and segmenting moving objects from a moving monocular camera is difficult when there is unknown camera motion, different types of object motions and complex scene structures. To tackle these challenges, we take advantage of two popular branches of monocular motion segmentation approaches: point trajectory based and optical flow based methods, by synergistically fusing these two highly complementary motion cues at object level. By doing this, we are able to model various complex object motions in different scene structures at once, which has not been achieved by existing methods. We first obtain object-specific point trajectories and optical flow mask for each common object in the video, by leveraging the recent foundational models in object recognition, segmentation and tracking. We then construct two robust affinity matrices representing the pairwise object motion affinities throughout the whole video using epipolar geometry and the motion information provided by optical flow. Finally, co-regularized multi-view spectral clustering is used to fuse the two affinity matrices and obtain the final clustering. Our method shows state-of-the-art performance on the KT3DMoSeg dataset, which contains complex motions and scene structures. Being able to identify moving objects allows us to remove them for map building when using visual SLAM or SFM.
    摘要 Difficulties in identifying and segmenting moving objects from a moving monocular camera include unknown camera motion, diverse object motions, and complex scene structures. To address these challenges, we synergistically fuse two popular monocular motion segmentation approaches: point trajectory-based and optical flow-based methods, at the object level. This enables us to model various complex object motions in different scene structures simultaneously, which has not been achieved by existing methods.We first obtain object-specific point trajectories and optical flow masks for each common object in the video by leveraging recent foundational models in object recognition, segmentation, and tracking. We then construct two robust affinity matrices representing the pairwise object motion affinities throughout the entire video using epipolar geometry and motion information provided by optical flow. Finally, co-regularized multi-view spectral clustering is used to fuse the two affinity matrices, resulting in the final clustering. Our method achieves state-of-the-art performance on the KT3DMoSeg dataset, which contains complex motions and scene structures. By identifying moving objects, we can remove them for map building when using visual SLAM or SFM.

Devil in the Number: Towards Robust Multi-modality Data Filter

  • paper_url: http://arxiv.org/abs/2309.13770
  • repo_url: None
  • paper_authors: Yichen Xu, Zihan Xu, Wenhao Chai, Zhonghan Zhao, Enxin Song, Gaoang Wang
  • For: 这个研究的目的是为了提高CLIP的表现和降低训练成本,通过适当的筛选方法来筛选多modal资料集。* Methods: 这个研究使用了CLIP score筛选器和文本检测方法来筛选资料。在分析资料集时,我们发现了大量的重复信息,例如数字,在文本内容中。我们进行了实验,发现这些重复元素对CLIP scores有着内在的影响。* Results: 我们的文本基于CLIP筛选器在DataComp中的“小规模”频道上比顶尖方法表现出色,实现了3.6%的性能提升。实验还显示了我们提议的文本填充筛选器比原始CLIP score筛选器在选择顶尖40%的资料时表现更好。此外,我们的研究还发现了数字对CLIP和其处理的影响,具有价值的指导意义,包括语言重写技术。
    Abstract In order to appropriately filter multi-modality data sets on a web-scale, it becomes crucial to employ suitable filtering methods to boost performance and reduce training costs. For instance, LAION papers employs the CLIP score filter to select data with CLIP scores surpassing a certain threshold. On the other hand, T-MARS achieves high-quality data filtering by detecting and masking text within images and then filtering by CLIP score. Through analyzing the dataset, we observe a significant proportion of redundant information, such as numbers, present in the textual content. Our experiments on a subset of the data unveil the profound impact of these redundant elements on the CLIP scores. A logical approach would involve reevaluating the CLIP scores after eliminating these influences. Experimentally, our text-based CLIP filter outperforms the top-ranked method on the ``small scale" of DataComp (a data filtering benchmark) on ImageNet distribution shifts, achieving a 3.6% performance improvement. The results also demonstrate that our proposed text-masked filter outperforms the original CLIP score filter when selecting the top 40% of the data. The impact of numbers on CLIP and their handling provide valuable insights for improving the effectiveness of CLIP training, including language rewrite techniques.
    摘要 We observe a significant amount of redundant information, such as numbers, in the textual content of the dataset. Our experiments on a subset of the data reveal that these redundant elements have a profound impact on the CLIP scores. A logical approach would be to reevaluate the CLIP scores after eliminating these influences.Experimentally, our text-based CLIP filter outperforms the top-ranked method on the "small scale" of DataComp (a data filtering benchmark) on ImageNet distribution shifts, achieving a 3.6% performance improvement. The results also show that our proposed text-masked filter outperforms the original CLIP score filter when selecting the top 40% of the data. The impact of numbers on CLIP and their handling provide valuable insights for improving the effectiveness of CLIP training, including language rewrite techniques.

Combining Two Adversarial Attacks Against Person Re-Identification Systems

  • paper_url: http://arxiv.org/abs/2309.13763
  • repo_url: None
  • paper_authors: Eduardo de O. Andrade, Igor Garcia Ballhausen Sampaio, Joris Guérin, José Viterbo
  • for: 这个研究是针对人员识别系统(Re-ID)的安全性进行研究,尤其是运用深度神经网络来实现人员识别。
  • methods: 本研究使用了两种攻击方法:P-FGSM和Deep Mis-Ranking,并且将其应用到两个受测Re-ID模型:IDE(ResNet-50)和AlignedReID。
  • results: 研究结果显示,这些攻击方法可以对Re-ID模型造成较大的影响,其中AlignedReID在CUHK03 dataset上的 Rank-10 指数下降了3.36%。此外,研究者还尝试使用Dropout进行防护。
    Abstract The field of Person Re-Identification (Re-ID) has received much attention recently, driven by the progress of deep neural networks, especially for image classification. The problem of Re-ID consists in identifying individuals through images captured by surveillance cameras in different scenarios. Governments and companies are investing a lot of time and money in Re-ID systems for use in public safety and identifying missing persons. However, several challenges remain for successfully implementing Re-ID, such as occlusions and light reflections in people's images. In this work, we focus on adversarial attacks on Re-ID systems, which can be a critical threat to the performance of these systems. In particular, we explore the combination of adversarial attacks against Re-ID models, trying to strengthen the decrease in the classification results. We conduct our experiments on three datasets: DukeMTMC-ReID, Market-1501, and CUHK03. We combine the use of two types of adversarial attacks, P-FGSM and Deep Mis-Ranking, applied to two popular Re-ID models: IDE (ResNet-50) and AlignedReID. The best result demonstrates a decrease of 3.36% in the Rank-10 metric for AlignedReID applied to CUHK03. We also try to use Dropout during the inference as a defense method.
    摘要 人员重复识别(Re-ID)领域在最近几年内受到了广泛关注,启发于深度神经网络的进步,特别是图像分类。Re-ID问题的核心是通过不同场景的安全摄像头捕捉到人员的图像,并在不同的环境下进行人员识别。政府和公司在公共安全和失踪人员问题上投入了大量时间和资金,以实现Re-ID系统的应用。然而,Re-ID实施还存在一些挑战,如人像中的遮挡和反射光。在这种情况下,我们将关注Re-ID系统中的对抗攻击,这可能会对系统的性能产生重要的威胁。我们在三个数据集上进行了实验:DukeMTMC-ReID、Market-1501和CUHK03。我们将两种对抗攻击相结合:P-FGSM和Deep Mis-Ranking,并将其应用于两种流行的Re-ID模型:IDE(ResNet-50)和AlignedReID。最佳结果表明,对CUHK03数据集应用AlignedReID模型,P-FGSM和Deep Mis-Ranking的组合可以导致rank-10指标下的下降为3.36%。我们还尝试了在推理过程中使用Dropout作为防御方法。

Look Ma, no code: fine tuning nnU-Net for the AutoPET II challenge by only adjusting its JSON plans

  • paper_url: http://arxiv.org/abs/2309.13747
  • repo_url: None
  • paper_authors: Fabian Isensee, Klaus H. Maier-Hein
  • for: 提高 AutoPET II 挑战中 nnU-Net 的性能
  • methods: 通过 modifying nnU-Net 的 ‘nnUNetPlans.json’ 文件,switch to UNet with residual encoder,增加 batch size 和 patch size,以提高模型的性能
  • results: 比自动配置的 nnU-Net 基eline(5-fold cross-validation Dice score of 65.14 vs 33.28)substantially outperform,但是需要更多的计算资源来训练模型。最终提交ensemble两个最有前途的配置。当提交时,我们的方法在预测集上排名第一。
    Abstract We participate in the AutoPET II challenge by modifying nnU-Net only through its easy to understand and modify 'nnUNetPlans.json' file. By switching to a UNet with residual encoder, increasing the batch size and increasing the patch size we obtain a configuration that substantially outperforms the automatically configured nnU-Net baseline (5-fold cross-validation Dice score of 65.14 vs 33.28) at the expense of increased compute requirements for model training. Our final submission ensembles the two most promising configurations. At the time of submission our method ranks first on the preliminary test set.
    摘要 我们参加了AutoPET II挑战,只通过nnUNetPlans.json文件进行 modify nnU-Net。通过更改残差编码器,增加批处理大小和增加补做大小,我们获得了与自动配置的nnU-Net基线(5次交叉验证精度分数为65.14 vs 33.28)的性能显著提高,但是需要更高的计算资源来训练模型。我们最终提交的结果是两种最有前途的配置的ensemble。在提交时,我们的方法在预测集上排名第一。Note: "nnUNetPlans.json" is a JSON file that contains the architecture of the nnU-Net model, and it is "easy to understand and modify" as mentioned in the text.

DROP: Dynamics Responses from Human Motion Prior and Projective Dynamics

  • paper_url: http://arxiv.org/abs/2309.13742
  • repo_url: None
  • paper_authors: Yifeng Jiang, Jungdam Won, Yuting Ye, C. Karen Liu
  • for: 这篇论文旨在实现人类动作的生成和跟踪,以满足计算机视觉、运动和医疗等领域的需求。
  • methods: 该论文提出了一种名为DROP的新框架,它利用生成式动作优先逻辑和投影动力来模型人类动作的响应。
  • results: 经过广泛的评估,DROP模型在不同的动作任务和物理干扰下表现出了可scalability和多样性的特点。
    Abstract Synthesizing realistic human movements, dynamically responsive to the environment, is a long-standing objective in character animation, with applications in computer vision, sports, and healthcare, for motion prediction and data augmentation. Recent kinematics-based generative motion models offer impressive scalability in modeling extensive motion data, albeit without an interface to reason about and interact with physics. While simulator-in-the-loop learning approaches enable highly physically realistic behaviors, the challenges in training often affect scalability and adoption. We introduce DROP, a novel framework for modeling Dynamics Responses of humans using generative mOtion prior and Projective dynamics. DROP can be viewed as a highly stable, minimalist physics-based human simulator that interfaces with a kinematics-based generative motion prior. Utilizing projective dynamics, DROP allows flexible and simple integration of the learned motion prior as one of the projective energies, seamlessly incorporating control provided by the motion prior with Newtonian dynamics. Serving as a model-agnostic plug-in, DROP enables us to fully leverage recent advances in generative motion models for physics-based motion synthesis. We conduct extensive evaluations of our model across different motion tasks and various physical perturbations, demonstrating the scalability and diversity of responses.
    摘要 实现人类动作的实惠真实、对环境 dynamically responsive 是动画人物的长期目标,应用于电脑感知、运动和医疗等领域,如动作预测和数据增强。现有的运动基础的生成动作模型可以实现广泛的动作数据模型,但是没有与物理相互作用的界面。而使用模拟器-在-the-loop 学习方法可以实现高度的物理真实行为,但是训练问题往往会影响数据量和采纳。我们介绍了 DROP,一个新的框架,用于模型人类动作的 Dynamics Responses,使用生成动作假设和投影动力学。 DROP 可以被视为一个高度稳定、最小化的物理基础的人类模拟器,与生成动作假设的投影动力学相互作用。通过将学习的动作假设作为投影能量的一部分,DROP 允许flexible和简单地整合已学习的动作假设和新频率动力学。作为一个模型无关的插件,DROP 允许我们充分利用最近的生成动作模型,以便实现物理基础的动作合成。我们在不同的动作任务和各种物理损害中进行了广泛的评估,证明了 DROP 的普遍性和多样性。

MOSAIC: Multi-Object Segmented Arbitrary Stylization Using CLIP

  • paper_url: http://arxiv.org/abs/2309.13716
  • repo_url: None
  • paper_authors: Prajwal Ganugula, Y S S S Santosh Kumar, N K Sagar Reddy, Prabhath Chellingi, Avinash Thakur, Neeraj Kasera, C Shyam Anand
  • for: 这篇论文的目的是提出一种基于文本提示的多对象分割自由风格化方法,以提高风格化图像的控制精度和扩展性。
  • methods: 该方法使用了视transformer架构进行文本基于分割和风格化模块,可以针对不同的对象进行精细的风格化控制。
  • results: 该方法可以生成高质量的风格化图像,并且可以在不同的对象类上进行扩展性测试,而且可以在不同的风格转换中保持图像的可读性。
    Abstract Style transfer driven by text prompts paved a new path for creatively stylizing the images without collecting an actual style image. Despite having promising results, with text-driven stylization, the user has no control over the stylization. If a user wants to create an artistic image, the user requires fine control over the stylization of various entities individually in the content image, which is not addressed by the current state-of-the-art approaches. On the other hand, diffusion style transfer methods also suffer from the same issue because the regional stylization control over the stylized output is ineffective. To address this problem, We propose a new method Multi-Object Segmented Arbitrary Stylization Using CLIP (MOSAIC), that can apply styles to different objects in the image based on the context extracted from the input prompt. Text-based segmentation and stylization modules which are based on vision transformer architecture, were used to segment and stylize the objects. Our method can extend to any arbitrary objects, styles and produce high-quality images compared to the current state of art methods. To our knowledge, this is the first attempt to perform text-guided arbitrary object-wise stylization. We demonstrate the effectiveness of our approach through qualitative and quantitative analysis, showing that it can generate visually appealing stylized images with enhanced control over stylization and the ability to generalize to unseen object classes.
    摘要 文本驱动的样式传递开创了一条新的创作图像样式化路径,而不需要实际收集样式图像。尽管有promising结果,文本驱动样式化方法还有一个问题:用户无法控制样式化的精度。如果用户想创造艺术图像,用户需要精准地控制图像中的多种实体的样式化。现有的approach都无法解决这个问题。另一方面,扩散样式传递方法也有同样的问题,因为对彩色输出的区域样式控制是无效的。为解决这个问题,我们提出了一种新的方法:多对象分割自由样式传递使用CLIP(MOSAIC)。我们使用了基于视力转换器架构的文本基于分割和样式化模块,可以将不同的对象在图像中应用不同的样式。我们的方法可以扩展到任意对象、样式和生成高质量图像,比现状态的方法更高效。我们知道,这是文本引导自由对象样式传递的首次尝试。我们通过质量和量化分析,证明我们的方法可以生成美观的样式化图像,并且可以增强样式化的控制和泛化到未看过的对象类型。

Sound-Print: Generalised Face Presentation Attack Detection using Deep Representation of Sound Echoes

  • paper_url: http://arxiv.org/abs/2309.13704
  • repo_url: None
  • paper_authors: Raghavendra Ramachandra, Jag Mohan Singh, Sushma Venkatesh
  • for: 这篇论文主要目的是提出一种基于阴投信号的声学回音攻击探测方法,以实现智能手机上的面部识别系统中的安全性。
  • methods: 本论文使用的方法包括对于声学回音的分析和模elling,并提出一种基于宽频脉冲的传输信号,以提高信号与噪音的比例。
  • results: 实验结果显示,提出的方法可以妥善地探测不同类型的面部攻击,包括印刷攻击、显示攻击和塑胶面伪攻击。
    Abstract Facial biometrics are widely deployed in smartphone-based applications because of their usability and increased verification accuracy in unconstrained scenarios. The evolving applications of smartphone-based facial recognition have also increased Presentation Attacks (PAs), where an attacker can present a Presentation Attack Instrument (PAI) to maliciously gain access to the application. Because the materials used to generate PAI are not deterministic, the detection of unknown presentation attacks is challenging. In this paper, we present an acoustic echo-based face Presentation Attack Detection (PAD) on a smartphone in which the PAs are detected based on the reflection profiles of the transmitted signal. We propose a novel transmission signal based on the wide pulse that allows us to model the background noise before transmitting the signal and increase the Signal-to-Noise Ratio (SNR). The received signal reflections were processed to remove background noise and accurately represent reflection characteristics. The reflection profiles of the bona fide and PAs are different owing to the different reflection characteristics of the human skin and artefact materials. Extensive experiments are presented using the newly collected Acoustic Sound Echo Dataset (ASED) with 4807 samples captured from bona fide and four different types of PAIs, including print (two types), display, and silicone face-mask attacks. The obtained results indicate the robustness of the proposed method for detecting unknown face presentation attacks.
    摘要 “人脸生物特征在智能手机应用中广泛应用,因为它们的使用性和无限制场景中的验证精度提高。随着智能手机上的人脸识别应用的发展,也增加了演示攻击(PA),其中攻击者可以使用演示攻击工具(PAI)来恶意获取应用程序。由于攻击工具的材料不决定性,检测未知的演示攻击是困难的。在这篇论文中,我们提出了基于声学回音的人脸演示攻击检测(PAD)方法,在智能手机上进行。我们提出了一种基于宽PULSE的新的传输信号,使得我们可以在传输信号之前模拟背景噪声,提高信号噪声比(SNR)。接收到的声学回音后,我们对噪声进行了处理,以便准确地表示回音特征。人脸和攻击工具的反射特征不同,因为人脸和 artifact材料的反射特征不同。我们在新收集的声学回音数据集(ASED)上进行了广泛的实验,该数据集包含4807个样本,其中有4种不同类型的PAI,包括印刷(两种)、显示和塑料面具攻击。获得的结果表明,我们提出的方法对于检测未知的人脸演示攻击具有坚定的Robustness。”

Video Adverse-Weather-Component Suppression Network via Weather Messenger and Adversarial Backpropagation

  • paper_url: http://arxiv.org/abs/2309.13700
  • repo_url: https://github.com/scott-yjyang/ViWS-Net
  • paper_authors: Yijun Yang, Angelica I. Aviles-Rivero, Huazhu Fu, Ye Liu, Weiming Wang, Lei Zhu
  • for: Restoring videos degraded by any weather condition
  • methods: Video adverse-weather-component suppression network (ViWS-Net), including a weather-agnostic video transformer encoder, long short-term temporal modeling mechanism, weather discriminator, and messenger-driven video transformer decoder
  • results: Outperforms current state-of-the-art methods in restoring videos degraded by any weather condition, on benchmark datasets and real-world weather videos.
    Abstract Although convolutional neural networks (CNNs) have been proposed to remove adverse weather conditions in single images using a single set of pre-trained weights, they fail to restore weather videos due to the absence of temporal information. Furthermore, existing methods for removing adverse weather conditions (e.g., rain, fog, and snow) from videos can only handle one type of adverse weather. In this work, we propose the first framework for restoring videos from all adverse weather conditions by developing a video adverse-weather-component suppression network (ViWS-Net). To achieve this, we first devise a weather-agnostic video transformer encoder with multiple transformer stages. Moreover, we design a long short-term temporal modeling mechanism for weather messenger to early fuse input adjacent video frames and learn weather-specific information. We further introduce a weather discriminator with gradient reversion, to maintain the weather-invariant common information and suppress the weather-specific information in pixel features, by adversarially predicting weather types. Finally, we develop a messenger-driven video transformer decoder to retrieve the residual weather-specific feature, which is spatiotemporally aggregated with hierarchical pixel features and refined to predict the clean target frame of input videos. Experimental results, on benchmark datasets and real-world weather videos, demonstrate that our ViWS-Net outperforms current state-of-the-art methods in terms of restoring videos degraded by any weather condition.
    摘要 尽管卷积神经网络(CNNs)已经提议用单个预训练 веса来去除单个图像中的不良天气情况,但它们无法恢复天气视频,因为缺乏时间信息。此外,现有的天气视频修复方法只能处理一种类型的不良天气。在这种情况下,我们提出了第一个可以恢复所有不良天气视频的框架,即视频不良天气组件抑制网络(ViWS-Net)。以实现这一目标,我们首先设计了不同天气情况下的视频转换器编码器,其中包括多个转换器阶段。此外,我们还设计了一种长期短期模型来早期融合输入视频帧的邻近信息,以学习天气特定的信息。此外,我们还引入了一种天气预测器,以便对输入视频帧的像素特征进行拟合,并且通过对天气类型进行反向推导,以维护天气不变的通用信息,并抑制天气特定的信息。最后,我们开发了一种天气驱动的视频转换器解码器,以恢复输入视频中的剩余天气特定特征,并将其空间时间聚合和归一化,以预测输入视频的干净目标帧。实验结果,在标准测试集和实际天气视频上,表明我们的ViWS-Net可以超越当前状态的方法,在任何天气条件下恢复受损的视频。

Causal-DFQ: Causality Guided Data-free Network Quantization

  • paper_url: http://arxiv.org/abs/2309.13682
  • repo_url: https://github.com/42shawn/causal-dfq
  • paper_authors: Yuzhang Shang, Bingxin Xu, Gaowen Liu, Ramana Kompella, Yan Yan
  • For: 这个研究的目的是为了解决在实际应用中无法提供训练数据的情况下,深度神经网络协商过程中的问题。* Methods: 这个研究使用了 causal reasoning 来建立 causal 图模型,并提出了一个基于 causality 的 data-free network quantization 方法(Causal-DFQ),以消除依赖于数据的限制。* Results: 实验结果显示,Causal-DFQ 能够将深度神经网络协商到更小的网络,并且可以在不需要训练数据的情况下保持比较高的预测性能。
    Abstract Model quantization, which aims to compress deep neural networks and accelerate inference speed, has greatly facilitated the development of cumbersome models on mobile and edge devices. There is a common assumption in quantization methods from prior works that training data is available. In practice, however, this assumption cannot always be fulfilled due to reasons of privacy and security, rendering these methods inapplicable in real-life situations. Thus, data-free network quantization has recently received significant attention in neural network compression. Causal reasoning provides an intuitive way to model causal relationships to eliminate data-driven correlations, making causality an essential component of analyzing data-free problems. However, causal formulations of data-free quantization are inadequate in the literature. To bridge this gap, we construct a causal graph to model the data generation and discrepancy reduction between the pre-trained and quantized models. Inspired by the causal understanding, we propose the Causality-guided Data-free Network Quantization method, Causal-DFQ, to eliminate the reliance on data via approaching an equilibrium of causality-driven intervened distributions. Specifically, we design a content-style-decoupled generator, synthesizing images conditioned on the relevant and irrelevant factors; then we propose a discrepancy reduction loss to align the intervened distributions of the pre-trained and quantized models. It is worth noting that our work is the first attempt towards introducing causality to data-free quantization problem. Extensive experiments demonstrate the efficacy of Causal-DFQ. The code is available at https://github.com/42Shawn/Causal-DFQ.
    摘要 模型减量,用于压缩深度神经网络并加速推理速度,已经大大便化了移动和边缘设备上的模型开发。但是,现实中的假设是所有训练数据都可用,而在实际应用中,这个假设不一定成立,因为隐私和安全问题。因此,无数据网络减量在神经网络压缩中收到了重要注意。 causal reasoning提供了一种直观的方式来模型 causal 关系,以消除数据驱动的相关性,使 causality 成为分析无数据问题的关键组成部分。然而, literature 中关于无数据网络减量的 causal 表述不充分。为了bridging这个差距,我们构建了 causal 图来模型数据生成和差异减少 между 预训练和减量模型。 inspirited by causal 理解,我们提出了 causality-guided 无数据网络减量方法(Causal-DFQ),以消除数据的依赖性。具体来说,我们设计了内容-风格-分解的生成器,通过conditioning 图像的相关和 irrelevant 因素来生成图像。然后,我们提出了干扰分布的减少损失,以将预训练和减量模型之间的 intervened 分布接近。值得注意的是,我们的工作是无数据网络减量问题中首次引入 causality 的尝试。extensive experiments 表明了 Causal-DFQ 的有效性。代码可以在 https://github.com/42Shawn/Causal-DFQ 上获取。

BdSpell: A YOLO-based Real-time Finger Spelling System for Bangla Sign Language

  • paper_url: http://arxiv.org/abs/2309.13676
  • repo_url: None
  • paper_authors: Naimul Haque, Meraj Serker, Tariq Bin Bashar
  • for: 提高孟加拉手语(BdSL)解释的可用性和包容性,增进孟加拉手语社区中的语言平等。
  • methods: 基于YOLOv5架构的实时手势识别系统,采用特定规则和数字类作为触发器,高效生成隐藏和复合字符,消减用户的压力。
  • results: 实现字符识别时间优化为1.32秒,准确率达98%,YOLOv5模型在9147张图像上显示出极高的平均精度报告率(mAP)为96.4%。
    Abstract In the domain of Bangla Sign Language (BdSL) interpretation, prior approaches often imposed a burden on users, requiring them to spell words without hidden characters, which were subsequently corrected using Bangla grammar rules due to the missing classes in BdSL36 dataset. However, this method posed a challenge in accurately guessing the incorrect spelling of words. To address this limitation, we propose a novel real-time finger spelling system based on the YOLOv5 architecture. Our system employs specified rules and numerical classes as triggers to efficiently generate hidden and compound characters, eliminating the necessity for additional classes and significantly enhancing user convenience. Notably, our approach achieves character spelling in an impressive 1.32 seconds with a remarkable accuracy rate of 98\%. Furthermore, our YOLOv5 model, trained on 9147 images, demonstrates an exceptional mean Average Precision (mAP) of 96.4\%. These advancements represent a substantial progression in augmenting BdSL interpretation, promising increased inclusivity and accessibility for the linguistic minority. This innovative framework, characterized by compatibility with existing YOLO versions, stands as a transformative milestone in enhancing communication modalities and linguistic equity within the Bangla Sign Language community.
    摘要 在孟加拉手语(BdSL)解释领域,先前的方法经常对用户带来压力,需要他们在无隐藏字符的情况下寻找字符,然后根据孟加拉语法规则进行修正,由于在BdSL36数据集中缺失的类型。但这种方法难以准确地猜测错误的拼写。为解决这个限制,我们提出了一种新的实时手写系统,基于YOLOv5架构。我们的系统采用了特定的规则和数字类作为触发器,以高效地生成隐藏和复合字符,从而消除了额外的类和增加了用户的便利。特别是,我们的方法在1.32秒内完成字符拼写,并达到了98%的精度。此外,我们的YOLOv5模型,在9147张图像上训练,显示了极高的平均精度(mAP)96.4%。这些进步表明了在增强孟加拉手语解释方面的重要突破,这将为孟加拉手语社区提供更多的包容性和可用性。这种革命性的框架,具有与现有YOLO版本兼容的特点,代表了孟加拉手语解释领域的巨大进步,并将在语言平等和通信模式方面产生深远的影响。

Joint inversion of Time-Lapse Surface Gravity and Seismic Data for Monitoring of 3D CO$_2$ Plumes via Deep Learning

  • paper_url: http://arxiv.org/abs/2310.04430
  • repo_url: None
  • paper_authors: Adrian Celaya, Mauricio Araya-Polo
  • for: 预测地下CO2涡,作为监测CO2储存部署的辅助工具。
  • methods: 基于深度学习的3D结合时间差表地重力和地震数据重建地下密度和速度模型。
  • results: 与深度学习基于重力只和地震只的拟合模型相比,joint匹配模型得到了改善的密度和速度重建、准确的分割和高的R-squared系数。这些结果表明深度学习基于联合拟合是有效的CO2储存监测工具。
    Abstract We introduce a fully 3D, deep learning-based approach for the joint inversion of time-lapse surface gravity and seismic data for reconstructing subsurface density and velocity models. The target application of this proposed inversion approach is the prediction of subsurface CO2 plumes as a complementary tool for monitoring CO2 sequestration deployments. Our joint inversion technique outperforms deep learning-based gravity-only and seismic-only inversion models, achieving improved density and velocity reconstruction, accurate segmentation, and higher R-squared coefficients. These results indicate that deep learning-based joint inversion is an effective tool for CO$_2$ storage monitoring. Future work will focus on validating our approach with larger datasets, simulations with other geological storage sites, and ultimately field data.
    摘要 我团队提出了一种完全三维、深度学习基于的方法,用于同时逆合时间序列表面重力和地震数据,以重建地下密度和速度模型。我们的目标应用是预测地下CO2泵,作为监测CO2储存部署的辅助工具。我们的联合逆合模型在密度和速度重建、准确分割和高R-平方 coefficient方面表现出色,这表明深度学习基于的联合逆合是有效的CO$_2$储存监测工具。未来工作将集中于验证我们的方法,使用更大的数据集、其他地质储存站的 simulate 和最终场景数据。

OneSeg: Self-learning and One-shot Learning based Single-slice Annotation for 3D Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2309.13671
  • repo_url: None
  • paper_authors: Yixuan Wu, Bo Zheng, Jintai Chen, Danny Z. Chen, Jian Wu
  • for: 提高医疗图像分割精度,减少数据标注努力。
  • methods: 提议一种自学习和一键学习基于构建,只需要标注一个3D图像的一 slice,以提高3D医疗图像分割精度。
  • results: 比对完全监督方法,新方法可以达到相似的性能,仅需要0.1%的数据标注,并且在多个异常测试集上进行了广泛的实验验证。
    Abstract As deep learning methods continue to improve medical image segmentation performance, data annotation is still a big bottleneck due to the labor-intensive and time-consuming burden on medical experts, especially for 3D images. To significantly reduce annotation efforts while attaining competitive segmentation accuracy, we propose a self-learning and one-shot learning based framework for 3D medical image segmentation by annotating only one slice of each 3D image. Our approach takes two steps: (1) self-learning of a reconstruction network to learn semantic correspondence among 2D slices within 3D images, and (2) representative selection of single slices for one-shot manual annotation and propagating the annotated data with the well-trained reconstruction network. Extensive experiments verify that our new framework achieves comparable performance with less than 1% annotated data compared with fully supervised methods and generalizes well on several out-of-distribution testing sets.
    摘要 随着深度学习方法在医疗影像分割性能的提高,数据注释仍然是一个大的瓶颈,因为医疗专家需要投入大量的劳动和时间来进行注释,特别是 для 3D 影像。为了减少注释努力而获得竞争性的分割精度,我们提出了一个自学习和一次学习基于框架,只需要注释每个 3D 影像中的一个平面。我们的方法包括两步:(1)自学习一个重建网络,以学习 2D 影像内 3D 影像中的semantic相关性,以及(2)选择单个平面进行一次手动注释,并使用已经训练好的重建网络将注释数据传播到其他影像中。我们的新方法在多个out-of-distribution测试集上进行了广泛的实验,并证明了它可以与完全监督方法相比,并且在不同的测试集上具有良好的一致性。

Adaptation of the super resolution SOTA for Art Restoration in camera capture images

  • paper_url: http://arxiv.org/abs/2309.13655
  • repo_url: https://github.com/naagar/art_restoration_dm
  • paper_authors: Sandeep Nagar, Abhinaba Bala, Sai Amrit Patnaik
    for: 这项研究旨在开发一个基于计算机视觉模型的自动化艺术修复方法,以提高和重建受损艺术作品的视觉质量,保留原始特点和瑰宝。methods: 该研究采用了基于扩散模型(DM)的图像超分辨率技术,并对其进行了微调,以适应不同类型的受损,包括噪声、模糊、scratches、淡化等。results: 研究结果显示,通过微调一个超分辨率模型,可以处理多种受损类型,并且可以提高和重建受损艺术作品的视觉质量,而不需要专业知识和较长的时间。代码链接:https://github.com/Naagar/art_restoration_DM。
    Abstract Preserving cultural heritage is of paramount importance. In the domain of art restoration, developing a computer vision model capable of effectively restoring deteriorated images of art pieces was difficult, but now we have a good computer vision state-of-art. Traditional restoration methods are often time-consuming and require extensive expertise. The aim of this work is to design an automated solution based on computer vision models that can enhance and reconstruct degraded artworks, improving their visual quality while preserving their original characteristics and artifacts. The model should handle a diverse range of deterioration types, including but not limited to noise, blur, scratches, fading, and other common forms of degradation. We adapt the current state-of-art for the image super-resolution based on the Diffusion Model (DM) and fine-tune it for Image art restoration. Our results show that instead of fine-tunning multiple different models for different kinds of degradation, fine-tuning one super-resolution. We train it on multiple datasets to make it robust. code link: https://github.com/Naagar/art_restoration_DM
    摘要 保护文化遗产对于我们非常重要。在艺术修复领域,开发一个可以有效地恢复褪色的艺术作品图像的计算机视觉模型是一项具有挑战性的任务,但现在我们已经有了一个非常出色的计算机视觉状态。传统的修复方法通常是时间consuming且需要广泛的专业知识。我们的目标是设计一个自动化的解决方案,基于计算机视觉模型,可以提高褪色的艺术作品图像的视觉质量,同时保持原始特征和痕迹。我们采用当前状态的扩充模型(DM),并进行了精细调整,以适应不同类型的褪色,包括噪声、模糊、擦抹、淡化和其他常见的褪色形式。我们的结果表明,不同于先前的多个模型的微调,我们可以通过微调一个超解析模型来实现图像修复。我们在多个数据集上训练这个模型,以使其具有坚固性。更多信息请参考:https://github.com/Naagar/art_restoration_DM。

ILNet: Low-level Matters for Salient Infrared Small Target Detection

  • paper_url: http://arxiv.org/abs/2309.13646
  • repo_url: https://github.com/li-haoqing/ilnet
  • paper_authors: Haoqing Li, Jinfu Yang, Runshi Wang, Yifei Xu
  • for: 该文章目标是提出一种基于干扰低级网络(ILNet)的干扰小目标检测方法,以提高干扰小目标特征的表示能力。
  • methods: 该方法使用了一种新的轻量级特征融合模块(IPOF),将低级信息更加注重地融合到深层网络中,以提高干扰小目标的检测性能。此外,还使用了一种动态一维度聚合层(DODA)来动态调整低维度信息的聚合方式。此外,该方法还使用了 Representative Block(RB)来动态分配深层和浅层网络的权重。
  • results: 实验结果表明,提出的 ILNet 方法在NUAA-SIRST 数据集上取得了78.22% nIoU 和 1.33e-6 Fa 的最佳性能,并在 IRSTD-1K 数据集上取得了68.91% nIoU 和 3.23e-6 Fa 的最佳性能。此外,ILNet 还能够在数据量增加时获得更大的提升。
    Abstract Infrared small target detection is a technique for finding small targets from infrared clutter background. Due to the dearth of high-level semantic information, small infrared target features are weakened in the deep layers of the CNN, which underachieves the CNN's representation ability. To address the above problem, in this paper, we propose an infrared low-level network (ILNet) that considers infrared small targets as salient areas with little semantic information. Unlike other SOTA methods, ILNet pays greater attention to low-level information instead of treating them equally. A new lightweight feature fusion module, named Interactive Polarized Orthogonal Fusion module (IPOF), is proposed, which integrates more important low-level features from the shallow layers into the deep layers. A Dynamic One-Dimensional Aggregation layers (DODA) are inserted into the IPOF, to dynamically adjust the aggregation of low dimensional information according to the number of input channels. In addition, the idea of ensemble learning is used to design a Representative Block (RB) to dynamically allocate weights for shallow and deep layers. Experimental results on the challenging NUAA-SIRST (78.22% nIoU and 1.33e-6 Fa) and IRSTD-1K (68.91% nIoU and 3.23e-6 Fa) dataset demonstrate that the proposed ILNet can get better performances than other SOTA methods. Moreover, ILNet can obtain a greater improvement with the increasement of data volume. Training code are available at https://github.com/Li-Haoqing/ILNet.
    摘要 infrared小目标检测是一种技术,用于从抖抖辐射背景中检测小目标。由于高级 semantic信息的缺乏,小抖抖辐射目标特征在深层神经网络中弱化,这会导致神经网络的表征能力受到限制。为解决上述问题,本文提出了一种infrared低级网络(ILNet),它视小抖抖辐射目标为有少量semantic信息的突出区域。不同于其他SOTA方法,ILNet更加注重低级信息,而不是对其进行平等处理。为了更好地捕捉低级信息,我们提出了一种新的轻量级特征融合模块(IPOF),该模块将深层神经网络中的重要低级特征与浅层神经网络中的低级特征进行有效的融合。此外,我们还使用了 Representative Block(RB)来动态分配深浅层神经网络中的权重。实验结果表明,提出的ILNet可以在NUAA-SIRST(78.22% nIoU和1.33e-6 Fa)和IRSTD-1K(68.91% nIoU和3.23e-6 Fa) dataset上达到SOTA的性能。此外,ILNet可以随着数据量的增加而获得更大的改进。训练代码可以在https://github.com/Li-Haoqing/ILNet中找到。

Changes-Aware Transformer: Learning Generalized Changes Representation

  • paper_url: http://arxiv.org/abs/2309.13619
  • repo_url: None
  • paper_authors: Dan Wang, Licheng Jiao, Jie Chen, Shuyuan Yang, Fang Liu
  • for: 本研究旨在提高 Change Detection (CD) 任务中的变化检测精度,通过学习多种变化的总体表示,并提出一种Changes-Aware Transformer (CAT) 来修正差异特征。
  • methods: 本研究使用了一种novel的 Changes-Aware Transformer (CAT) 来修正差异特征,CAT 通过栅格cosine cross-attention层和自我注意层来实现这一目的。
  • results: 实验结果表明,我们的方法可以在 remote sensing CD 数据集和街景 CD 数据集上达到状态之 arts 性能,并且具有良好的普适性。
    Abstract Difference features obtained by comparing the images of two periods play an indispensable role in the change detection (CD) task. However, a pair of bi-temporal images can exhibit diverse changes, which may cause various difference features. Identifying changed pixels with differ difference features to be the same category is thus a challenge for CD. Most nowadays' methods acquire distinctive difference features in implicit ways like enhancing image representation or supervision information. Nevertheless, informative image features only guarantee object semantics are modeled and can not guarantee that changed pixels have similar semantics in the difference feature space and are distinct from those unchanged ones. In this work, the generalized representation of various changes is learned straightforwardly in the difference feature space, and a novel Changes-Aware Transformer (CAT) for refining difference features is proposed. This generalized representation can perceive which pixels are changed and which are unchanged and further guide the update of pixels' difference features. CAT effectively accomplishes this refinement process through the stacked cosine cross-attention layer and self-attention layer. After refinement, the changed pixels in the difference feature space are closer to each other, which facilitates change detection. In addition, CAT is compatible with various backbone networks and existing CD methods. Experiments on remote sensing CD data set and street scene CD data set show that our method achieves state-of-the-art performance and has excellent generalization.
    摘要 <>TRANSLATE_TEXT diferenciales características obtenidas por comparar las imágenes de dos períodos juegan un papel fundamental en la tarea de detección de cambios (CD). Sin embargo, una pareja de imágenes bi-temporales puede exhibir cambios diversificados, lo que puede causar diferentes diferenciales características. Identificar pixels cambiados con diferenciales características similares es un desafío para la CD. La mayoría de los métodos actuales adquieren características diferenciales distintivas de manera implícita, como la mejora de la representación de la imagen o la información de supervisión. Sin embargo, las características de la imagen informativas solo garantizan que las semánticas de los objetos se modelen y no garantizan que los pixels cambiados tengan semánticas similares en el espacio de características de diferencia y se distingan de los pixels no cambiados. En este trabajo, se aprende una representación generalizada de los cambios en el espacio de características de diferencia y se propone un Novel Changes-Aware Transformer (CAT) para refinar las características de diferencia. Esta representación generalizada puede percibir qué pixels están cambiados y qué pixels no están cambiados y guiar el update de las características de diferencia de los pixels. CAT efectúa este proceso de refinamiento mediante capas de atención cruzada cosínica y de atención a sí misma. Después de la refinement, los pixels cambiados en el espacio de características de diferencia están más cercanos entre sí, lo que facilita la detección de cambios. Además, CAT es compatible con redes de soporte existentes y métodos de CD. Los experimentos en los conjuntos de datos de CD de áreas remotas y escenas de la calle muestran que nuestro método logra un rendimiento estatal de arte y tiene una excelente generalización.Note: The text is translated using the Google Translate API, and the translation may not be perfect. Please let me know if you need any further assistance.

VisionKG: Unleashing the Power of Visual Datasets via Knowledge Graph

  • paper_url: http://arxiv.org/abs/2309.13610
  • repo_url: None
  • paper_authors: Jicheng Yuan, Anh Le-Tuan, Manh Nguyen-Duc, Trung-Kien Tran, Manfred Hauswirth, Danh Le-Phuoc
  • for: 提供一个全面的 computer vision 数据资源,实现跨多个源、任务和分类的Visual dataset集成。
  • methods: 使用知识 graphs和Semantic Web技术来整合、组织和管理多种形式的Visual dataset,提供简单的存取和查询服务,并具有扩展性和可扩展性。
  • results: 组建了一个名为 Vision Knowledge Graph(VisionKG)的资源,它可以实现跨多个源、任务和分类的Visual dataset集成,并提供了多种数据 Retrieval 和探索服务。
    Abstract The availability of vast amounts of visual data with heterogeneous features is a key factor for developing, testing, and benchmarking of new computer vision (CV) algorithms and architectures. Most visual datasets are created and curated for specific tasks or with limited image data distribution for very specific situations, and there is no unified approach to manage and access them across diverse sources, tasks, and taxonomies. This not only creates unnecessary overheads when building robust visual recognition systems, but also introduces biases into learning systems and limits the capabilities of data-centric AI. To address these problems, we propose the Vision Knowledge Graph (VisionKG), a novel resource that interlinks, organizes and manages visual datasets via knowledge graphs and Semantic Web technologies. It can serve as a unified framework facilitating simple access and querying of state-of-the-art visual datasets, regardless of their heterogeneous formats and taxonomies. One of the key differences between our approach and existing methods is that ours is knowledge-based rather than metadatabased. It enhances the enrichment of the semantics at both image and instance levels and offers various data retrieval and exploratory services via SPARQL. VisionKG currently contains 519 million RDF triples that describe approximately 40 million entities, and are accessible at https://vision.semkg.org and through APIs. With the integration of 30 datasets and four popular CV tasks, we demonstrate its usefulness across various scenarios when working with CV pipelines.
    摘要 “现代计算机视觉(CV)算法和架构的开发、测试和评估中,庞大量的视觉数据的可用性是关键因素。大多数视觉数据集是为特定任务或有限的图像数据分布而创建和维护的,而且没有一种统一的方法来管理和访问它们。这不仅会增加建立可靠的视觉识别系统的开发成本,而且会引入偏见到学习系统中和限制数据驱动AI的能力。为解决这些问题,我们提议了视觉知识图(VisionKG),一种新的资源,通过知识图和Semantic Web技术来集成、组织和管理视觉数据集。它可以作为一个统一的框架,方便访问和查询多种不同的视觉任务和数据集,无论它们的格式和分类如何。我们的方法与现有方法的主要区别在于,我们的方法是基于知识图而不是基于元数据的。它可以增强图像和实例层次的 semantics,并提供了多种数据检索和探索服务via SPARQL。VisionKG目前包含519亿个RDF三元组,描述约40亿个实体,可以在https://vision.semkg.org和通过API访问。我们通过将30个数据集和4种常见CV任务集成到VisionKG中,证明了它在不同的场景中对CV管道的有用性。”

Vulnerabilities in Video Quality Assessment Models: The Challenge of Adversarial Attacks

  • paper_url: http://arxiv.org/abs/2309.13609
  • repo_url: https://github.com/gzhu-dvl/attackvqa
  • paper_authors: Ao-Xiang Zhang, Yu Ran, Weixuan Tang, Yuan-Gen Wang
  • for: This paper focuses on evaluating the robustness of No-Reference Video Quality Assessment (NR-VQA) models against adversarial attacks, and proposing a patch-based random search method for black-box attacks.
  • methods: The paper uses Convolutional Neural Networks (CNNs) and Transformers as the base models for NR-VQA, and proposes a novel loss function called Score-Reversed Boundary Loss to evaluate the robustness of these models against adversarial attacks.
  • results: The paper presents the results of evaluating the robustness of NR-VQA models against adversarial attacks using the proposed Score-Reversed Boundary Loss, and shows that the proposed method can effectively launch both white-box and black-box attacks in an imperceptible manner.Here is the simplified Chinese text for the three information points:
  • for: 这篇论文关注NR-VQA模型对针对攻击的Robustness评估,并提出了一种基于随机搜索的黑盒攻击方法。
  • methods: 该论文使用Convolutional Neural Networks (CNNs)和Transformers作为NR-VQA模型的基础模型,并提出了一种新的损失函数called Score-Reversed Boundary Loss来评估NR-VQA模型对攻击的Robustness。
  • results: 该论文通过使用提出的Score-Reversed Boundary Loss来评估NR-VQA模型对攻击的Robustness,并显示了该方法可以效果地发起白盒和黑盒攻击,并且在无人知情的情况下进行。
    Abstract No-Reference Video Quality Assessment (NR-VQA) plays an essential role in improving the viewing experience of end-users. Driven by deep learning, recent NR-VQA models based on Convolutional Neural Networks (CNNs) and Transformers have achieved outstanding performance. To build a reliable and practical assessment system, it is of great necessity to evaluate their robustness. However, such issue has received little attention in the academic community. In this paper, we make the first attempt to evaluate the robustness of NR-VQA models against adversarial attacks, and propose a patch-based random search method for black-box attack. Specifically, considering both the attack effect on quality score and the visual quality of adversarial video, the attack problem is formulated as misleading the estimated quality score under the constraint of just-noticeable difference (JND). Built upon such formulation, a novel loss function called Score-Reversed Boundary Loss is designed to push the adversarial video's estimated quality score far away from its ground-truth score towards a specific boundary, and the JND constraint is modeled as a strict $L_2$ and $L_\infty$ norm restriction. By this means, both white-box and black-box attacks can be launched in an effective and imperceptible manner. The source code is available at https://github.com/GZHU-DVL/AttackVQA.
    摘要 “无参考视频质量评估(NR-VQA)在提高用户视频观看体验中扮演着关键性的角色。驱动深度学习,最新的NR-VQA模型基于卷积神经网络(CNNs)和变换器(Transformers)已经实现了出色的表现。为建立可靠和实用的评估系统,必须评估其可靠性。然而,这一问题在学术界得到了少量的关注。本文是首次评估NR-VQA模型对抗攻击的尝试,并提出了一种黑盒攻击方法基于补丁随机搜索。具体来说,我们认为攻击问题应该是让估计的质量分数受到攻击,同时保证视频质量的变化在可以快速感知的范围内。针对这一问题,我们提出了一种新的损失函数 called Score-Reversed Boundary Loss,它可以让攻击者通过控制估计质量分数的变化,使得攻击者可以在无法察觉的情况下发动白盒和黑盒攻击。源代码可以在https://github.com/GZHU-DVL/AttackVQA上获取。”

FaceAtt: Enhancing Image Captioning with Facial Attributes for Portrait Images

  • paper_url: http://arxiv.org/abs/2309.13601
  • repo_url: None
  • paper_authors: Naimul Haque, Iffat Labiba, Sadia Akter
  • for: This paper focuses on developing a novel approach to attribute-focused image captioning that accurately depicts facial attributes within images.
  • methods: The FaceAtt model uses deep learning techniques and annotated attributes of portraits as supplementary prior knowledge to improve caption quality.
  • results: The FaceAtt model yields a subtle yet discernible enhancement in resulting caption scores, demonstrating the effectiveness of incorporating additional attribute vectors during training.Here’s the simplified Chinese text for the three key points:
  • for: 这篇论文关注开发一种基于人脸特征的图像描述模型,以准确描述图像中的人脸特征。
  • methods: FaceAtt模型使用深度学习技术和人脸特征注释作为辅助知识,以提高描述质量。
  • results: FaceAtt模型在训练时使用人脸特征注释可以提供微妙 yet 可识别的提升,表明注释的添加可以提高模型的表现。
    Abstract Automated image caption generation is a critical area of research that enhances accessibility and understanding of visual content for diverse audiences. In this study, we propose the FaceAtt model, a novel approach to attribute-focused image captioning that emphasizes the accurate depiction of facial attributes within images. FaceAtt automatically detects and describes a wide range of attributes, including emotions, expressions, pointed noses, fair skin tones, hair textures, attractiveness, and approximate age ranges. Leveraging deep learning techniques, we explore the impact of different image feature extraction methods on caption quality and evaluate our model's performance using metrics such as BLEU and METEOR. Our FaceAtt model leverages annotated attributes of portraits as supplementary prior knowledge for our portrait images before captioning. This innovative addition yields a subtle yet discernible enhancement in the resulting scores, exemplifying the potency of incorporating additional attribute vectors during training. Furthermore, our research contributes to the broader discourse on ethical considerations in automated captioning. This study sets the stage for future research in refining attribute-focused captioning techniques, with a focus on enhancing linguistic coherence, addressing biases, and accommodating diverse user needs.
    摘要 自动生成图像标签是一个关键的研究领域,它提高了视觉内容的可访问性和理解,便于不同的用户群体。在这项研究中,我们提出了FaceAtt模型,一种新的图像标签生成方法,强调在图像中准确描述人脸特征。FaceAtt自动检测和描述了各种特征,包括情感、表情、短脚、白肤肤、头发Texture、吸引力和年龄范围。我们利用深度学习技术,研究不同的图像特征提取方法对标签质量的影响,并使用BLEU和METEOR等 метри来评估我们的FaceAtt模型。我们的FaceAtt模型利用了人脸图像的注解特征作为额外知识来进行预处理,这种创新的添加带来了微妙 yet 可见的提高,表明了在训练时添加特征向量的力量。此外,我们的研究对自动标签技术的伦理考虑进行贡献。这项研究为未来更进一步的增强特征强调标签技术做出了平台,包括提高语言一致性、消除偏见和满足多样化用户需求。

Multi-Dimensional Hyena for Spatial Inductive Bias

  • paper_url: http://arxiv.org/abs/2309.13600
  • repo_url: None
  • paper_authors: Itamar Zimerman, Lior Wolf
  • for: 这个论文是为了提出一种数据效率的视觉变换器,不需要自注意。它使用了一种新的多轴泛化方法,基于最近的Hyena层。
  • methods: 这个论文使用了一种新的泛化方法,即Hyena N-D层,以提高视觉变换器的性能。它还提出了多种不同的方法来实现这种泛化,并从实际和理论上进行了详细的分析。
  • results: 实验结果显示,Hyena N-D层能够提高多种视觉变换器架构的性能,如ViT、Swin和DeiT等。此外,在小数据集 régime中,Hyena-based ViT比特有些文献中提出的特定设计来解决这个问题的ViT变种更好。最后, authors表明了一种hybrid方法,将Hyena N-D层用于前几层,然后使用传统注意力层,能够持续提高不同的视觉变换器架构的性能。
    Abstract In recent years, Vision Transformers have attracted increasing interest from computer vision researchers. However, the advantage of these transformers over CNNs is only fully manifested when trained over a large dataset, mainly due to the reduced inductive bias towards spatial locality within the transformer's self-attention mechanism. In this work, we present a data-efficient vision transformer that does not rely on self-attention. Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer. We propose several alternative approaches for obtaining this generalization and delve into their unique distinctions and considerations from both empirical and theoretical perspectives. Our empirical findings indicate that the proposed Hyena N-D layer boosts the performance of various Vision Transformer architectures, such as ViT, Swin, and DeiT across multiple datasets. Furthermore, in the small dataset regime, our Hyena-based ViT is favorable to ViT variants from the recent literature that are specifically designed for solving the same challenge, i.e., working with small datasets or incorporating image-specific inductive bias into the self-attention mechanism. Finally, we show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures.
    摘要 Recently, Vision Transformers have gained increasing attention from computer vision researchers. However, the advantage of these transformers over Convolutional Neural Networks (CNNs) is only fully manifested when trained on a large dataset, due to the reduced inductive bias towards spatial locality within the transformer's self-attention mechanism. In this work, we propose a data-efficient vision transformer that does not rely on self-attention. Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer. We present several alternative approaches for obtaining this generalization and discuss their unique distinctions and considerations from both empirical and theoretical perspectives.Our empirical findings indicate that the proposed Hyena N-D layer enhances the performance of various Vision Transformer architectures, such as ViT, Swin, and DeiT, across multiple datasets. Furthermore, in the small dataset regime, our Hyena-based ViT outperforms ViT variants from the recent literature that are specifically designed for solving the same challenge, i.e., working with small datasets or incorporating image-specific inductive bias into the self-attention mechanism. Finally, we show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures.

On the Posterior Distribution in Denoising: Application to Uncertainty Quantification

  • paper_url: http://arxiv.org/abs/2309.13598
  • repo_url: https://github.com/HilaManor/GaussianDenoisingPosterior
  • paper_authors: Hila Manor, Tomer Michaeli
  • for: 这篇论文主要针对的是降噪方法的应用,包括低级图像感知器的降噪、以及基于 Tweedie 公式的score-based生成模型。
  • methods: 该论文使用 Gaussian denoising 的 posterior distribution 链接到数据分布的 posterior mean,并 derivates 出高阶中心差的关系。
  • results: 该论文可以快速和减少内存占用来计算 posterior distribution 的主要方向和高阶中心差,不需要训练或精度调整降噪器。
    Abstract Denoisers play a central role in many applications, from noise suppression in low-grade imaging sensors, to empowering score-based generative models. The latter category of methods makes use of Tweedie's formula, which links the posterior mean in Gaussian denoising (i.e., the minimum MSE denoiser) with the score of the data distribution. Here, we derive a fundamental relation between the higher-order central moments of the posterior distribution, and the higher-order derivatives of the posterior mean. We harness this result for uncertainty quantification of pre-trained denoisers. Particularly, we show how to efficiently compute the principal components of the posterior distribution for any desired region of an image, as well as to approximate the full marginal distribution along those (or any other) one-dimensional directions. Our method is fast and memory efficient, as it does not explicitly compute or store the high-order moment tensors and it requires no training or fine tuning of the denoiser. Code and examples are available on the project's webpage in https://hilamanor.github.io/GaussianDenoisingPosterior/
    摘要 纹理恢复器在许多应用中扮演着中心角色,从噪声消除低级图像感知器到激发Score-based生成模型。后者使用Tweedie的公式,将 posterior mean在 Gaussian denoising 中相应的负面积最小化。我们 derivate 出 posterior distribution 的高级中心均值和 posterior mean 的高级导数之间的基本关系。我们利用这个结果进行uncertainty quantification of pre-trained denoisers。特别是,我们可以快速计算 posterior distribution 的主要Components在任意区域中,以及任意一个方向的全级分布。我们的方法快速,内存占用少,因为它不需要直接计算或存储高级 moment tensor,也不需要训练或微调denoiser。 codes 和示例可以在https://hilamanor.github.io/GaussianDenoisingPosterior/ 的项目网站上找到。

Advancements in 3D Lane Detection Using LiDAR Point Clouds: From Data Collection to Model Development

  • paper_url: http://arxiv.org/abs/2309.13596
  • repo_url: None
  • paper_authors: Runkai Zhao, Yuwen Heng, Yuanda Gao, Shilei Liu, Heng Wang, Changhao Yao, Jiawen Chen, Weidong Cai
  • for: 本研究旨在提高自动驾驶系统(ADAS)的车辆感知和决策能力,通过利用学习基于的技术。
  • methods: 本研究使用了LiDAR数据集,并设计了一个简单 yet effective的自动标注管线,以生成更加精细的车道标注。
  • results: 实验结果显示,LiLaDet模型在K-Lane数据集和LiSV-3DLane数据集上的3D车道检测任务中表现出色,超过了现有的摄像头和LiDAR基于的方法。
    Abstract Advanced Driver-Assistance Systems (ADAS) have successfully integrated learning-based techniques into vehicle perception and decision-making. However, their application in 3D lane detection for effective driving environment perception is hindered by the lack of comprehensive LiDAR datasets. The sparse nature of LiDAR point cloud data prevents an efficient manual annotation process. To solve this problem, we present LiSV-3DLane, a large-scale 3D lane dataset that comprises 20k frames of surround-view LiDAR point clouds with enriched semantic annotation. Unlike existing datasets confined to a frontal perspective, LiSV-3DLane provides a full 360-degree spatial panorama around the ego vehicle, capturing complex lane patterns in both urban and highway environments. We leverage the geometric traits of lane lines and the intrinsic spatial attributes of LiDAR data to design a simple yet effective automatic annotation pipeline for generating finer lane labels. To propel future research, we propose a novel LiDAR-based 3D lane detection model, LiLaDet, incorporating the spatial geometry learning of the LiDAR point cloud into Bird's Eye View (BEV) based lane identification. Experimental results indicate that LiLaDet outperforms existing camera- and LiDAR-based approaches in the 3D lane detection task on the K-Lane dataset and our LiSV-3DLane.
    摘要 高级驾驶辅助系统(ADAS)已成功地将学习基于的技术 integrate 到车辆的感知和决策中。然而,它们在3D车道检测中为有效的驾驶环境感知受到了LiDAR数据的缺乏全面的障碍。LiDAR点云数据的稀疏性阻碍了人工注释的效率。为解决这个问题,我们提出了LiSV-3DLane,一个大规模的3D车道数据集,包含20000帧的周围视野LiDAR点云数据,并且具有增强的semantic注释。与现有的前视角所限定的数据集不同,LiSV-3DLane提供了360度的全景视图,捕捉了城市和高速公路环境中复杂的车道模式。我们利用LiDAR数据的几何特征和点云数据的内在空间属性,设计了一个简单 yet effective的自动注释管道,以生成更细的车道标签。为未来的研究提供动力,我们提出了一种基于LiDAR的3D车道检测模型LiLaDet,该模型将LiDAR点云中的空间几何学学习 integrate 到基于bird's eye view(BEV)的车道标识中。实验结果表明,LiLaDet在K-Lane数据集和我们的LiSV-3DLane上的3D车道检测任务中表现出色,比摄像头和LiDAR基的方法更高效。

Benchmarking Encoder-Decoder Architectures for Biplanar X-ray to 3D Shape Reconstruction

  • paper_url: http://arxiv.org/abs/2309.13587
  • repo_url: None
  • paper_authors: Mahesh Shakya, Bishesh Khanal
  • for: 这些论文的目的是为了evaluate多种深度学习模型在2D-3D骨形状重建方面的性能,以便在临床应用中进行评估和选择最佳模型。
  • methods: 这些论文使用的方法包括多种深度学习模型,以及Automatic clinical parameter and landmark extraction methods。
  • results: 这些论文的结果表明,关注全域空间关系的注意力机制方法在所有骨性质和数据集上表现较好,但是在临床相关的 subgroup中表现可能会被过度估计,肋骨比 femur、hip 和脊梁更加困难重建,并且 dice score 改进不总是导致自动计算临床相关参数的改进。
    Abstract Various deep learning models have been proposed for 3D bone shape reconstruction from two orthogonal (biplanar) X-ray images. However, it is unclear how these models compare against each other since they are evaluated on different anatomy, cohort and (often privately held) datasets. Moreover, the impact of the commonly optimized image-based segmentation metrics such as dice score on the estimation of clinical parameters relevant in 2D-3D bone shape reconstruction is not well known. To move closer toward clinical translation, we propose a benchmarking framework that evaluates tasks relevant to real-world clinical scenarios, including reconstruction of fractured bones, bones with implants, robustness to population shift, and error in estimating clinical parameters. Our open-source platform provides reference implementations of 8 models (many of whose implementations were not publicly available), APIs to easily collect and preprocess 6 public datasets, and the implementation of automatic clinical parameter and landmark extraction methods. We present an extensive evaluation of 8 2D-3D models on equal footing using 6 public datasets comprising images for four different anatomies. Our results show that attention-based methods that capture global spatial relationships tend to perform better across all anatomies and datasets; performance on clinically relevant subgroups may be overestimated without disaggregated reporting; ribs are substantially more difficult to reconstruct compared to femur, hip and spine; and the dice score improvement does not always bring a corresponding improvement in the automatic estimation of clinically relevant parameters.
    摘要 各种深度学习模型已经提议用于从两个mutually orthogonal(biplanar)X射线图像中重建3D骨形状。然而,它们之间的比较很难,因为它们在不同的解剖学、人群和(常常是私人拥有)数据集上进行评估。此外,通常优化的图像基于分割指标如 dice score 对2D-3D骨形状重建中的临床参数的影响不够了解。为了更近地到临床翻译,我们提出了一个 benchmarking 框架,评估了实际临床情景中的任务,包括骨折重建、骨嵌入、人口变化的Robustness和临床参数的错误。我们的开源平台提供了8个模型的参考实现(许多实现没有公开)、6个公共数据集的自动化采集和处理API,以及自动提取临床参数和标记的实现。我们对8个2D-3D模型进行了平等评估,使用6个公共数据集,包括4种不同的解剖学。我们的结果显示: attention-based 方法, capture 全局空间关系,在所有解剖学和数据集上表现较好; 不分解的报告可能会过分估计临床重要 subgroup; 肋骨重建相比股骨、股骨和脊梁更加困难; 并 dice score 改进不总是导致自动计算临床参数的改进。

Solving Low-Dose CT Reconstruction via GAN with Local Coherence

  • paper_url: http://arxiv.org/abs/2309.13584
  • repo_url: https://github.com/lwjie595/GANLC
  • paper_authors: Wenjie Liu
  • for: 用于诊断人体内部器官病变的计算Tomography(CT)成为医学影像领域的基本话题之一,低剂CT的使用被广泛采用,因此其重建方法得到了广泛的研究。
  • methods: 我们提出了一种基于生成对抗网络(GANs)的新方法,该方法可以利用运动场进行优化,从而提高重建图像的地方协调性和稳定性。
  • results: 我们对实验数据进行评估,结果表明,我们的提议方法可以与现有的状态对抗方法相比,显著提高重建图像的精度和稳定性。
    Abstract The Computed Tomography (CT) for diagnosis of lesions in human internal organs is one of the most fundamental topics in medical imaging. Low-dose CT, which offers reduced radiation exposure, is preferred over standard-dose CT, and therefore its reconstruction approaches have been extensively studied. However, current low-dose CT reconstruction techniques mainly rely on model-based methods or deep-learning-based techniques, which often ignore the coherence and smoothness for sequential CT slices. To address this issue, we propose a novel approach using generative adversarial networks (GANs) with enhanced local coherence. The proposed method can capture the local coherence of adjacent images by optical flow, which yields significant improvements in the precision and stability of the constructed images. We evaluate our proposed method on real datasets and the experimental results suggest that it can outperform existing state-of-the-art reconstruction approaches significantly.
    摘要 computed tomography (CT) 用于人体内部肿瘤诊断是医学影像领域的基本话题之一。低剂量 CT 比标准剂量 CT 更受欢迎,因此其重建方法得到了广泛的研究。然而,现有的低剂量 CT 重建技术主要基于模型基本方法或深度学习基本方法,这些方法经常忽略邻域 CT slice 的协调性和平滑性。为解决这个问题,我们提出了一种使用生成对抗网络 (GANs) 增强本地协调性的新方法。该方法可以通过光流来捕捉邻域图像的本地协调性,从而实现显著提高重建图像的精度和稳定性。我们在实际数据集上测试了我们的提议方法,实验结果表明,它可以与现有的状态空间重建方法相比,显著提高重建图像的质量。

A SAM-based Solution for Hierarchical Panoptic Segmentation of Crops and Weeds Competition

  • paper_url: http://arxiv.org/abs/2309.13578
  • repo_url: None
  • paper_authors: Khoa Dang Nguyen, Thanh-Hai Phung, Hoang-Giang Cao
  • for: 这个论文旨在探讨农业领域的高级计算机视觉技术——泛型分割,以提高农业作物和杂草的识别和分类。
  • methods: 该论文提出了一种combines Segment AnyThing Model (SAM)和对象检测模型的方法,以实现高级分割任务。 specifically, 该方法 integrate了两种对象检测模型的特点,namely DINO和YOLO-v8。
  • results: 该论文的best-performing模型在竞赛中的PQ+分数为81.33。
    Abstract Panoptic segmentation in agriculture is an advanced computer vision technique that provides a comprehensive understanding of field composition. It facilitates various tasks such as crop and weed segmentation, plant panoptic segmentation, and leaf instance segmentation, all aimed at addressing challenges in agriculture. Exploring the application of panoptic segmentation in agriculture, the 8th Workshop on Computer Vision in Plant Phenotyping and Agriculture (CVPPA) hosted the challenge of hierarchical panoptic segmentation of crops and weeds using the PhenoBench dataset. To tackle the tasks presented in this competition, we propose an approach that combines the effectiveness of the Segment AnyThing Model (SAM) for instance segmentation with prompt input from object detection models. Specifically, we integrated two notable approaches in object detection, namely DINO and YOLO-v8. Our best-performing model achieved a PQ+ score of 81.33 based on the evaluation metrics of the competition.
    摘要 “对农业中的涵盖分割技术(panoptic segmentation)进行了进一步的探索,以获得农田场景的全面理解。这技术可以帮助农业面临的问题,例如作物和杂草分类、植物涵盖分类以及叶子实例分类。为了探索这些应用,CVPPA年会(8th Workshop on Computer Vision in Plant Phenotyping and Agriculture)举办了一个挑战,即使用PhenoBench数据集进行阶层涵盖分类。我们提出了一个结合SAM模型(Segment AnyThing Model)的实例分类方法,并与物件探测模型(DINO和YOLO-v8)进行了统合。我们的最佳模型在竞赛中的PQ+分数为81.33。”Note: "PQ+ score" is a combination of precision, recall, and F1-score, which is a common evaluation metric for segmentation tasks.

Matrix Completion-Informed Deep Unfolded Equilibrium Models for Self-Supervised k-Space Interpolation in MRI

  • paper_url: http://arxiv.org/abs/2309.13571
  • repo_url: None
  • paper_authors: Chen Luo, Huayu Wang, Taofeng Xie, Qiyu Jin, Guoqing Chen, Zhuo-Xu Cui, Dong Liang
  • for: 提高MRI图像的速度和质量,不需要完整的标签数据
  • methods: 利用深度学习模型,同时保留常规模型的理论保证
  • results: 提出一种自适应深度学习方法,可以在不具备完整标签数据的情况下,实现MRI图像的加速和提高Here is the full text in Simplified Chinese:
  • for: 本研究旨在提高MRI图像的速度和质量,不需要完整的标签数据。
  • methods: 我们提出了一种利用深度学习模型的自适应方法,同时保留常规模型的理论保证。
  • results: 我们的方法可以在不具备完整标签数据的情况下,实现MRI图像的加速和提高,并且超过了现有的自适应方法和传统正则化方法的性能。
    Abstract Recently, regularization model-driven deep learning (DL) has gained significant attention due to its ability to leverage the potent representational capabilities of DL while retaining the theoretical guarantees of regularization models. However, most of these methods are tailored for supervised learning scenarios that necessitate fully sampled labels, which can pose challenges in practical MRI applications. To tackle this challenge, we propose a self-supervised DL approach for accelerated MRI that is theoretically guaranteed and does not rely on fully sampled labels. Specifically, we achieve neural network structure regularization by exploiting the inherent structural low-rankness of the $k$-space data. Simultaneously, we constrain the network structure to resemble a nonexpansive mapping, ensuring the network's convergence to a fixed point. Thanks to this well-defined network structure, this fixed point can completely reconstruct the missing $k$-space data based on matrix completion theory, even in situations where full-sampled labels are unavailable. Experiments validate the effectiveness of our proposed method and demonstrate its superiority over existing self-supervised approaches and traditional regularization methods, achieving performance comparable to that of supervised learning methods in certain scenarios.
    摘要 We achieve neural network structure regularization by exploiting the inherent low-rankness of the $k$-space data. Simultaneously, we constrain the network structure to be nonexpansive, ensuring the network's convergence to a fixed point. Thanks to this well-defined network structure, this fixed point can completely reconstruct the missing $k$-space data based on matrix completion theory, even when full-sampled labels are unavailable.Experiments demonstrate the effectiveness of our proposed method and its superiority over existing self-supervised approaches and traditional regularization methods. In certain scenarios, our method achieves performance comparable to that of supervised learning methods.

Robust Digital-Twin Localization via An RGBD-based Transformer Network and A Comprehensive Evaluation on a Mobile Dataset

  • paper_url: http://arxiv.org/abs/2309.13570
  • repo_url: https://github.com/augcog/dttd2
  • paper_authors: Zixun Huang, Keling Yao, Seth Z. Zhao, Chuanyu Pan, Tianjian Xu, Weiyu Feng, Allen Y. Yang
  • for: 本研究旨在探讨数字双技术在3D物体跟踪和地理位置确定方面的潜在作用,并提出一种基于变换器的6DoF姿态估计器,以实现在真实世界噪声数据下的最佳准确性。
  • methods: 本研究使用变换器来实现6DoF姿态估计器,并通过对现有 литературы的全面验证,提出了一个新的RGBD数据集called Digital Twin Tracking Dataset v2 (DTTD2),以适应iPhone感知器数据。
  • results: 经过广泛的实验和深入分析,本研究证明了我们的方法在面临深度数据错误时仍然能够表现出优于现有基elines的性能。
    Abstract The potential of digital-twin technology, involving the creation of precise digital replicas of physical objects, to reshape AR experiences in 3D object tracking and localization scenarios is significant. However, enabling robust 3D object tracking in dynamic mobile AR environments remains a formidable challenge. These scenarios often require a more robust pose estimator capable of handling the inherent sensor-level measurement noise. In this paper, recognizing the challenges of comprehensive solutions in existing literature, we propose a transformer-based 6DoF pose estimator designed to achieve state-of-the-art accuracy under real-world noisy data. To systematically validate the new solution's performance against the prior art, we also introduce a novel RGBD dataset called Digital Twin Tracking Dataset v2 (DTTD2), which is focused on digital-twin object tracking scenarios. Expanded from an existing DTTD v1 (DTTD1), the new dataset adds digital-twin data captured using a cutting-edge mobile RGBD sensor suite on Apple iPhone 14 Pro, expanding the applicability of our approach to iPhone sensor data. Through extensive experimentation and in-depth analysis, we illustrate the effectiveness of our methods under significant depth data errors, surpassing the performance of existing baselines. Code and dataset are made publicly available at: https://github.com/augcog/DTTD2
    摘要 “数字双身技术的潜在可能性,即创建精确的数字对象复制,对于3D对象跟踪和本地化场景的AR经验进行重塑,是非常 significannot。然而,在动态 mobil AR 环境中实现Robust 3D对象跟踪仍然是一大挑战。这些场景通常需要一个更加Robust的 pose estimator,可以处理潜在的 sensor-level 测量噪音。在这篇论文中,我们认为现有Literature中的全面解决方案存在挑战,因此我们提出了一种基于 transformer 的 6DoF pose estimator,可以在实际世界噪音数据下实现 state-of-the-art 精度。为了系统地验证我们的新解决方案的性能,我们还发布了一个名为 Digital Twin Tracking Dataset v2 (DTTD2) 的新数据集,该数据集专注于数字双身对象跟踪场景。DTTD2 是基于 DTTD1 的扩展,新增了使用高级 mobil RGBD 感知器 suite 在 Apple iPhone 14 Pro 上 captured 的数字双身数据,使我们的方法可以应用于 iPhone 感知器数据。通过广泛的实验和深入分析,我们证明了我们的方法在重大深度数据错误下可以实现更高的性能,超过现有的基准值。Code 和数据集在 GitHub 上公开,请参考:https://github.com/augcog/DTTD2。”

Multivariate Prototype Representation for Domain-Generalized Incremental Learning

  • paper_url: http://arxiv.org/abs/2309.13563
  • repo_url: None
  • paper_authors: Can Peng, Piotr Koniusz, Kaiyu Guo, Brian C. Lovell, Peyman Moghadam
  • for: 这种研究旨在解决深度学习模型在新类样本微调时发生的灾难性忘记问题,以及这种问题在不同领域数据上进行测试时的域shift问题。
  • methods: 我们提出了一种Domain-Generalized Class-Incremental Learning(DGCIL)方法,该方法能够保持老类,适应新类,并可以在未看过的领域上进行可靠的分类。我们的损失函数保持分类boundary,并且降低每个类的域特定信息。无需保存老示例,我们使用知识传播和估计老类prototype偏移来进行逐步训练。我们的prototype表示基于多变量正态分布,其中的均值和协方差是随着模型特征的变化而不断地适应老类。为了保持老类的表示,我们采用Cholesky分解来采样pseudo-特征。相比之前的pseudo-特征采样策略,我们的方法能够更好地捕捉变异semantic信息。
  • results: 我们在多个benchmark上进行了实验,并证明了我们的方法的主张。
    Abstract Deep learning models suffer from catastrophic forgetting when being fine-tuned with samples of new classes. This issue becomes even more pronounced when faced with the domain shift between training and testing data. In this paper, we study the critical and less explored Domain-Generalized Class-Incremental Learning (DGCIL). We design a DGCIL approach that remembers old classes, adapts to new classes, and can classify reliably objects from unseen domains. Specifically, our loss formulation maintains classification boundaries and suppresses the domain-specific information of each class. With no old exemplars stored, we use knowledge distillation and estimate old class prototype drift as incremental training advances. Our prototype representations are based on multivariate Normal distributions whose means and covariances are constantly adapted to changing model features to represent old classes well by adapting to the feature space drift. For old classes, we sample pseudo-features from the adapted Normal distributions with the help of Cholesky decomposition. In contrast to previous pseudo-feature sampling strategies that rely solely on average mean prototypes, our method excels at capturing varying semantic information. Experiments on several benchmarks validate our claims.
    摘要

LOGICSEG: Parsing Visual Semantics with Neural Logic Learning and Reasoning

  • paper_url: http://arxiv.org/abs/2309.13556
  • repo_url: None
  • paper_authors: Liulei Li, Wenguan Wang, Yi Yang
  • for: 填充高性能semantic segmentation模型的潜在空白,使得模型能够更好地理解视觉世界的结构和抽象。
  • methods: 利用神经 inductive 学习和逻辑推理,将数据和符号知识结合在一起,从而实现视 semantic 解析。
  • results: 在四个 dataset 上进行了广泛的实验,证明了 LOGICSEG 的效果和通用性。
    Abstract Current high-performance semantic segmentation models are purely data-driven sub-symbolic approaches and blind to the structured nature of the visual world. This is in stark contrast to human cognition which abstracts visual perceptions at multiple levels and conducts symbolic reasoning with such structured abstraction. To fill these fundamental gaps, we devise LOGICSEG, a holistic visual semantic parser that integrates neural inductive learning and logic reasoning with both rich data and symbolic knowledge. In particular, the semantic concepts of interest are structured as a hierarchy, from which a set of constraints are derived for describing the symbolic relations and formalized as first-order logic rules. After fuzzy logic-based continuous relaxation, logical formulae are grounded onto data and neural computational graphs, hence enabling logic-induced network training. During inference, logical constraints are packaged into an iterative process and injected into the network in a form of several matrix multiplications, so as to achieve hierarchy-coherent prediction with logic reasoning. These designs together make LOGICSEG a general and compact neural-logic machine that is readily integrated into existing segmentation models. Extensive experiments over four datasets with various segmentation models and backbones verify the effectiveness and generality of LOGICSEG. We believe this study opens a new avenue for visual semantic parsing.
    摘要 Translated into Simplified Chinese:当前高性能semantic segmentation模型都是纯数据驱动的sub-symbolic方法,而这与人类认知的抽象Visual perception at multiple levels and symbolic reasoning with structured abstraction is in stark contrast. To fill these fundamental gaps, we propose LOGICSEG, a comprehensive visual semantic parser that combines neural inductive learning and logic reasoning with both rich data and symbolic knowledge. Specifically, the semantic concepts of interest are structured as a hierarchy, from which a set of constraints are derived for describing the symbolic relations and formalized as first-order logic rules. After fuzzy logic-based continuous relaxation, logical formulae are grounded onto data and neural computational graphs, thereby enabling logic-induced network training. During inference, logical constraints are packaged into an iterative process and injected into the network in the form of several matrix multiplications, thereby achieving hierarchy-coherent prediction with logic reasoning. These designs together make LOGICSEG a versatile and compact neural-logic machine that can be seamlessly integrated into existing segmentation models. Experimental results over four datasets with various segmentation models and backbones demonstrate the effectiveness and generality of LOGICSEG. We believe this study opens a new avenue for visual semantic parsing.

Generalized Dice Focal Loss trained 3D Residual UNet for Automated Lesion Segmentation in Whole-Body FDG PET/CT Images

  • paper_url: http://arxiv.org/abs/2309.13553
  • repo_url: https://github.com/ahxmeds/autosegnet
  • paper_authors: Shadab Ahamed, Arman Rahmim
  • For: The paper is written for developing a comprehensive PET/CT lesion segmentation model for routine quantitative image analysis.* Methods: The paper uses a 3D Residual UNet with Generalized Dice Focal Loss function on the AutoPET challenge 2023 training dataset, and develops the model in a 5-fold cross-validation setting with ensemble learning.* Results: The average ensemble achieved a Dice similarity coefficient (DSC) of 0.5417, false-positive volume (FPV) of 0.8261 ml, and false negative volume (FNV) of 0.2538 ml, while the weighted-average ensemble achieved similar results.Here’s the simplified Chinese text for the three key points:* For: 这篇论文是为了开发一个 Routine 量化图像分析中的 PET/CT 癌症分割模型。* Methods: 这篇论文使用了 3D Residual UNet 与 Generalized Dice Focal Loss 函数在 AutoPET 挑战 2023 训练集上进行了训练,并使用了 5-fold 交叉验证设置和 ensemble 学习。* Results: 平均ensemble 达到了 Dice 相似度系数 (DSC) 为 0.5417,false-positive volume (FPV) 为 0.8261 ml,false negative volume (FNV) 为 0.2538 ml,而 weighted-average ensemble 也达到了类似的结果。
    Abstract Automated segmentation of cancerous lesions in PET/CT images is a vital initial task for quantitative analysis. However, it is often challenging to train deep learning-based segmentation methods to high degree of accuracy due to the diversity of lesions in terms of their shapes, sizes, and radiotracer uptake levels. These lesions can be found in various parts of the body, often close to healthy organs that also show significant uptake. Consequently, developing a comprehensive PET/CT lesion segmentation model is a demanding endeavor for routine quantitative image analysis. In this work, we train a 3D Residual UNet using Generalized Dice Focal Loss function on the AutoPET challenge 2023 training dataset. We develop our models in a 5-fold cross-validation setting and ensemble the five models via average and weighted-average ensembling. On the preliminary test phase, the average ensemble achieved a Dice similarity coefficient (DSC), false-positive volume (FPV) and false negative volume (FNV) of 0.5417, 0.8261 ml, and 0.2538 ml, respectively, while the weighted-average ensemble achieved 0.5417, 0.8186 ml, and 0.2538 ml, respectively. Our algorithm can be accessed via this link: https://github.com/ahxmeds/autosegnet.
    摘要 自动 segmentation of cancerous lesions in PET/CT images 是一项非常重要的初始任务,用于量化分析。然而,由于肿瘤的多样性,包括形状、大小和辐射追踪水平,因此往往具有很高的学习难度。这些肿瘤可以在体内各个部位找到, часто靠近健康的器官,这些器官也会显示出明显的辐射吸收。因此,开发一个全面的 PET/CT 肿瘤 segmentation 模型是一项复杂的任务,用于日常量化图像分析。在这个工作中,我们使用 Generalized Dice Focal Loss 函数来训练一个 3D Residual UNet 模型。我们在 5-fold 跨Validation Setting 中进行了模型开发,并使用 average 和 weighted-average ensemble。在预liminary test阶段,average ensemble 达到了 Dice similarity coefficient (DSC)、false-positive volume (FPV) 和 false negative volume (FNV) 的值为 0.5417,0.8261 ml 和 0.2538 ml,分别。而 weighted-average ensemble 达到了 0.5417,0.8186 ml 和 0.2538 ml,分别。我们的算法可以通过以下链接访问:https://github.com/ahxmeds/autosegnet。

Towards Robust Robot 3D Perception in Urban Environments: The UT Campus Object Dataset

  • paper_url: http://arxiv.org/abs/2309.13549
  • repo_url: https://github.com/ut-amrl/coda-models
  • paper_authors: Arthur Zhang, Chaitanya Eranki, Christina Zhang, Ji-Hwan Park, Raymond Hong, Pranav Kalyani, Lochana Kalyanaraman, Arsh Gamare, Arnav Bagad, Maria Esteva, Joydeep Biswas
  • for: 这个论文是为了提供一个大学校园环境下的自主 Navigation 的数据集,用于 Egocentric 3D 识别和规划。
  • methods: 该论文使用了多 modal 感知器,包括 3D 点云和颜色视频,以及 RGB-D 视频和 IMU 传感器,并提供了大量的Annotation。
  • results: 该论文的实验结果表明,使用 CODa 数据集可以提高urban 环境中 3D объек检测性能,并且 sensor-specific 细化调整和预训练可以进一步提高检测精度。
    Abstract We introduce the UT Campus Object Dataset (CODa), a mobile robot egocentric perception dataset collected on the University of Texas Austin Campus. Our dataset contains 8.5 hours of multimodal sensor data: synchronized 3D point clouds and stereo RGB video from a 128-channel 3D LiDAR and two 1.25MP RGB cameras at 10 fps; RGB-D videos from an additional 0.5MP sensor at 7 fps, and a 9-DOF IMU sensor at 40 Hz. We provide 58 minutes of ground-truth annotations containing 1.3 million 3D bounding boxes with instance IDs for 53 semantic classes, 5000 frames of 3D semantic annotations for urban terrain, and pseudo-ground truth localization. We repeatedly traverse identical geographic locations for a wide range of indoor and outdoor areas, weather conditions, and times of the day. Using CODa, we empirically demonstrate that: 1) 3D object detection performance in urban settings is significantly higher when trained using CODa compared to existing datasets even when employing state-of-the-art domain adaptation approaches, 2) sensor-specific fine-tuning improves 3D object detection accuracy and 3) pretraining on CODa improves cross-dataset 3D object detection performance in urban settings compared to pretraining on AV datasets. Using our dataset and annotations, we release benchmarks for 3D object detection and 3D semantic segmentation using established metrics. In the future, the CODa benchmark will include additional tasks like unsupervised object discovery and re-identification. We publicly release CODa on the Texas Data Repository, pre-trained models, dataset development package, and interactive dataset viewer on our website at https://amrl.cs.utexas.edu/coda. We expect CODa to be a valuable dataset for research in egocentric 3D perception and planning for autonomous navigation in urban environments.
    摘要 我们介绍UT кампус物件Dataset(CODa),是一个移动机器人自我观察 Dataset,在德州大学奥斯汀分校范围内收集到的8.5小时多modal感应数据。我们的数据包括同步3D点云和stereoRGB影像,来自128通道3D LiDAR和两个1.25MPRGB摄像头,每秒10帧;RGB-D影像从额外0.5MP感应器,每秒7帧,以及9DOF IMU感应器,每秒40Hz。我们提供58分钟的真实标注,包括1.3百万个3D bounding box,每个物体都有实体ID,分配到53个semantic class中;5000帧3D实体标注,用于城市地形的处理;以及假的地理位置标注。我们在同一个地理位置上重复探索了各种室内和室外区域,天气状况和时间。使用CODa,我们经过实验证明:1)在城市设置中,使用CODa进行训练后,3D物体检测性能高于现有数据集,即使使用现有的领域适应方法;2)感应器特定的精确调整可以提高3D物体检测精度;3)使用CODa进行预训可以在城市设置中提高交叉数据集3D物体检测性能。我们在我们的网站上公开了CODa,包括预训模型、数据开发套件和互动数据检视器,可以在https://amrl.cs.utexas.edu/coda 中找到。我们预期CODa将成为城市自主navigation egocentric 3D视察和规划的重要数据集。

DFRD: Data-Free Robustness Distillation for Heterogeneous Federated Learning

  • paper_url: http://arxiv.org/abs/2309.13546
  • repo_url: None
  • paper_authors: Kangyang Luo, Shuai Wang, Yexuan Fu, Xiang Li, Yunshi Lan, Ming Gao
  • for: 提出了一种隐私保护的分布式学习方法(DFRD),可以在数据不同和模型不同的场景下培养一个稳定和有效的全局模型。
  • methods: 在服务器端使用一个条件生成器来估算本地模型上传的训练空间,并系统地调查其训练的准确度、传输性和多样性。
  • results: 通过实验证明,DFRD在多个图像分类任务上比最佳参考模型具有显著的性能提升。
    Abstract Federated Learning (FL) is a privacy-constrained decentralized machine learning paradigm in which clients enable collaborative training without compromising private data. However, how to learn a robust global model in the data-heterogeneous and model-heterogeneous FL scenarios is challenging. To address it, we resort to data-free knowledge distillation to propose a new FL method (namely DFRD). DFRD equips a conditional generator on the server to approximate the training space of the local models uploaded by clients, and systematically investigates its training in terms of fidelity, transferability} and diversity. To overcome the catastrophic forgetting of the global model caused by the distribution shifts of the generator across communication rounds, we maintain an exponential moving average copy of the generator on the server. Additionally, we propose dynamic weighting and label sampling to accurately extract knowledge from local models. Finally, our extensive experiments on various image classification tasks illustrate that DFRD achieves significant performance gains compared to SOTA baselines.
    摘要 federated learning (FL) 是一种遵循 privacy 的分布式机器学习模式,在Client端实现协同训练而无需披露私人数据。然而,在数据不同和模型不同的 FL 场景中,学习 Robust 的全球模型是一个挑战。为此,我们通过不使用数据的知识热化来提出一种新的 FL 方法(namely DFRD)。DFRD 在服务器端安装一个Conditional generator,用于模拟客户端上传的本地模型的训练空间,并系统地研究其训练的准确性、传递性和多样性。为了解决由生成器在交流周期中的分布转移所引起的全球模型的忘却性,我们在服务器端维护一个指数移动平均的生成器复制。此外,我们提出了动态权重和标签采样,以准确地提取本地模型中的知识。最后,我们在不同的图像分类任务上进行了广泛的实验,结果显示,DFRD 与当前的标准基eline相比, achieved 显著的性能提升。

Comparative Evaluation of Transfer Learning for Classification of Brain Tumor Using MRI

  • paper_url: http://arxiv.org/abs/2310.02270
  • repo_url: None
  • paper_authors: Abu Kaisar Mohammad Masum, Nusrat Badhon, S. M. Saiful Islam Badhon, Nushrat Jahan Ria, Sheikh Abujar, Muntaser Mansur Syed, Naveed Mahmud
  • for: 这项研究旨在利用计算机助成诊断技术,尤其是机器学习和深度学习,以分类三种脑肿瘤。
  • methods: 我们使用了四种转移学习技术来分类脑肿瘤,并在一个标准数据集上进行测试,包括3064个MRI图像,表示三种脑肿瘤。
  • results: 我们发现,使用ResNet-50模型可以达到99.06%的准确率,超过其他模型。我们还证明了如何在均衡数据集上提高准确率,而无需使用扩展方法。
    Abstract Abnormal growth of cells in the brain and its surrounding tissues is known as a brain tumor. There are two types, one is benign (non-cancerous) and another is malignant (cancerous) which may cause death. The radiologists' ability to diagnose malignancies is greatly aided by magnetic resonance imaging (MRI). Brain cancer diagnosis has been considerably expedited by the field of computer-assisted diagnostics, especially in machine learning and deep learning. In our study, we categorize three different kinds of brain tumors using four transfer learning techniques. Our models were tested on a benchmark dataset of $3064$ MRI pictures representing three different forms of brain cancer. Notably, ResNet-50 outperformed other models with a remarkable accuracy of $99.06\%$. We stress the significance of a balanced dataset for improving accuracy without the use of augmentation methods. Additionally, we experimentally demonstrate our method and compare with other classification algorithms on the CE-MRI dataset using evaluations like F1-score, AUC, precision and recall.
    摘要 异常组织增长在脑和周围组织中 known as 脑肿瘤。这有两种,一种是非恶性(非癌细胞),另一种是恶性(癌细胞),可能导致死亡。医学影像识别异常性的能力得到了巨大的助益,特别是在电磁共振成像(MRI)和电脑协助诊断领域。在我们的研究中,我们分类了三种不同的脑肿瘤,使用四种转移学习技术。我们的模型在一个底本数据集上进行测试,包括3064幅 MRI 照片,代表三种不同的脑癌。值得注意的是,ResNet-50 的准确率达到了99.06%,在其他模型中具有卓越的表现。我们强调了统计数据的平衡性,以提高准确性,而不需使用增强方法。此外,我们实验性地评估了我们的方法,并与其他分类算法进行比较,使用评估指标如 F1 分数、AUC、精度和 recall。

Semi-Supervised Domain Generalization for Object Detection via Language-Guided Feature Alignment

  • paper_url: http://arxiv.org/abs/2309.13525
  • repo_url: https://github.com/sinamalakouti/CDDMSL
  • paper_authors: Sina Malakouti, Adriana Kovashka
  • for: 这篇论文旨在解决半有 labels 的领域泛化(Domain Generalization,DG)和领域转换(Domain Adaptation,DA)问题,并且将vision-language预训应用于这个问题。
  • methods: 这篇论文使用了一种新的 Cross-Domain Descriptive Multi-Scale Learning(CDDMSL)方法,它通过将图像描述在语言空间中进行对领域特有特征的对应,以实现图像描述的协调。
  • results: compared to existing methods, CDDMSL 在 DG 和 DA 环境中都有着重要的进步,实现了11.7%和7.5%的改善。
    Abstract Existing domain adaptation (DA) and generalization (DG) methods in object detection enforce feature alignment in the visual space but face challenges like object appearance variability and scene complexity, which make it difficult to distinguish between objects and achieve accurate detection. In this paper, we are the first to address the problem of semi-supervised domain generalization by exploring vision-language pre-training and enforcing feature alignment through the language space. We employ a novel Cross-Domain Descriptive Multi-Scale Learning (CDDMSL) aiming to maximize the agreement between descriptions of an image presented with different domain-specific characteristics in the embedding space. CDDMSL significantly outperforms existing methods, achieving 11.7% and 7.5% improvement in DG and DA settings, respectively. Comprehensive analysis and ablation studies confirm the effectiveness of our method, positioning CDDMSL as a promising approach for domain generalization in object detection tasks.
    摘要 现有的领域适应(DA)和通用化(DG)方法在物体检测中强制视觉空间中的特征对齐,但面临对象外观多样性和场景复杂性等挑战,这使得分辨对象并不容易,精度检测也不高。在这篇论文中,我们是首次解决半supervised领域通用化问题,通过探索视觉语言预训练和在语言空间强制特征对齐。我们提出了一种新的跨领域描述多Scale学习(CDDMSL),旨在 maximize图像的描述在嵌入空间中的一致性。CDDMSL与现有方法相比,显著提高了11.7%和7.5%的提升率,分别在DA和DG设置下。广泛的分析和缺省研究证明了我们的方法的有效性,positioning CDDMSL为领域通用化在物体检测任务中的可靠方法。

LiDAR-UDA: Self-ensembling Through Time for Unsupervised LiDAR Domain Adaptation

  • paper_url: http://arxiv.org/abs/2309.13523
  • repo_url: None
  • paper_authors: Amirreza Shaban, JoonHo Lee, Sanghun Jung, Xiangyun Meng, Byron Boots
  • for: 这个研究是为了提出一个基于自适应领域对应(UDA)的 LiDAR 分类方法,以应对不同 LiDAR 感应器配置所带来的领域差异。
  • methods: 这个方法使用了两个技术来降低感应器差异和提高pseudo标签质量:1)LiDAR 焦点抽样,实现不同 LiDAR 扫描模式的模拟;2)跨帧聚合,利用 consecutive 帧的时间一致性来生成更可靠的pseudo标签。
  • results: 这个方法在多个公开 LiDAR 数据集上进行评估,与现有的方法相比,获得了更高的平均 mIoU 分量 ($3.9%$) 。
    Abstract We introduce LiDAR-UDA, a novel two-stage self-training-based Unsupervised Domain Adaptation (UDA) method for LiDAR segmentation. Existing self-training methods use a model trained on labeled source data to generate pseudo labels for target data and refine the predictions via fine-tuning the network on the pseudo labels. These methods suffer from domain shifts caused by different LiDAR sensor configurations in the source and target domains. We propose two techniques to reduce sensor discrepancy and improve pseudo label quality: 1) LiDAR beam subsampling, which simulates different LiDAR scanning patterns by randomly dropping beams; 2) cross-frame ensembling, which exploits temporal consistency of consecutive frames to generate more reliable pseudo labels. Our method is simple, generalizable, and does not incur any extra inference cost. We evaluate our method on several public LiDAR datasets and show that it outperforms the state-of-the-art methods by more than $3.9\%$ mIoU on average for all scenarios. Code will be available at https://github.com/JHLee0513/LiDARUDA.
    摘要 我们介绍了LiDAR-UDA,一种新的两阶段自我训练基于无监督领域适应(UDA)方法,用于LiDAR分割。现有的自我训练方法使用一个基于源数据的模型来生成目标数据的假标签,然后通过调整网络来提高预测。这些方法受到源和目标领域之间的频率差引起的频率差问题。我们提出了两种技术来减少探测器差异并提高假标签质量:1)LiDAR扫描方式抽样,可以模拟不同的LiDAR扫描方式,通过随机删除探测器来实现;2)同帧集成,可以利用连续帧的时间一致性来生成更可靠的假标签。我们的方法简单、普适,无需额外的推理成本。我们在一些公共LiDAR数据集上评估了我们的方法,并证明它在所有场景上超过了state-of-the-art方法的$3.9\%$ mIoU平均提升。代码将在https://github.com/JHLee0513/LiDARUDA上提供。

InSpaceType: Reconsider Space Type in Indoor Monocular Depth Estimation

  • paper_url: http://arxiv.org/abs/2309.13516
  • repo_url: None
  • paper_authors: Cho-Ying Wu, Quankai Gao, Chin-Cheng Hsu, Te-Lin Wu, Jing-Wen Chen, Ulrich Neumann
  • for: 本研究旨在探讨indoor monocular depth estimation方法在实际场景中的稳定性和泛化性,特别是在不同的空间类型下的表现。
  • methods: 本研究使用了11种最新的方法进行比较,并发现这些方法在不同的空间类型下存在明显的表现偏好。
  • results: 研究发现,现有的方法在不同的空间类型下存在明显的性能差异,表明这些方法存在偏好,而且在某些空间类型下表现非常差。
    Abstract Indoor monocular depth estimation has attracted increasing research interest. Most previous works have been focusing on methodology, primarily experimenting with NYU-Depth-V2 (NYUv2) Dataset, and only concentrated on the overall performance over the test set. However, little is known regarding robustness and generalization when it comes to applying monocular depth estimation methods to real-world scenarios where highly varying and diverse functional \textit{space types} are present such as library or kitchen. A study for performance breakdown into space types is essential to realize a pretrained model's performance variance. To facilitate our investigation for robustness and address limitations of previous works, we collect InSpaceType, a high-quality and high-resolution RGBD dataset for general indoor environments. We benchmark 11 recent methods on InSpaceType and find they severely suffer from performance imbalance concerning space types, which reveals their underlying bias. We extend our analysis to 4 other datasets, 3 mitigation approaches, and the ability to generalize to unseen space types. Our work marks the first in-depth investigation of performance imbalance across space types for indoor monocular depth estimation, drawing attention to potential safety concerns for model deployment without considering space types, and further shedding light on potential ways to improve robustness. See \url{https://depthcomputation.github.io/DepthPublic} for data.
    摘要 内部单目深度估计已经吸引了越来越多的研究兴趣。大多数前一些工作都是在方法ologies上进行了尝试,主要使用NYU-Depth-V2(NYUv2)数据集,并且只是对测试集的总性性能进行了评估。然而,对于实际世界场景中的应用,尚不甚了解单目深度估计方法的稳定性和泛化性。为了实现预训练模型的性能变化,我们需要进行空间类型的性能剖析。为了促进我们的调查和解决前一些工作的局限性,我们收集了InSpaceType,一个高质量、高分辨率的RGBD数据集,用于普遍的内部环境。我们对InSpaceType进行了11种最近的方法的测试,发现它们在不同的空间类型上表现出了严重的性能偏好。我们还扩展了我们的分析至4个其他数据集、3种缓解方法和无seen空间类型的能力。我们的工作是内部单目深度估计中首次对空间类型的性能偏好进行了深入的调查,这引起了关注在没有考虑空间类型的情况下部署模型可能存在的安全风险,以及如何提高模型的稳定性。参考链接:

Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2309.13505
  • repo_url: https://github.com/xing0047/rewrite
  • paper_authors: Yun Xing, Jian Kang, Aoran Xiao, Jiahao Nie, Shao Ling, Shijian Lu
  • for: 增强语言授 зада务下的semantic segmentation的能力,使得图像可以通过文本描述进行空间localization。
  • methods: 利用CLIP来补做缺失的semantics,建立一个概念库,并通过群集导航 sampling来选择相关的概念,然后将其 feed into pre-training。
  • results: 在8个 segmentation benchmark上进行了广泛的实验,表明CoCu可以减轻语言授 зада务下的semantic gap,大幅提高语言授 зада务下的semantic segmentation的性能。
    Abstract Vision-Language Pre-training has demonstrated its remarkable zero-shot recognition ability and potential to learn generalizable visual representations from language supervision. Taking a step ahead, language-supervised semantic segmentation enables spatial localization of textual inputs by learning pixel grouping solely from image-text pairs. Nevertheless, the state-of-the-art suffers from clear semantic gaps between visual and textual modality: plenty of visual concepts appeared in images are missing in their paired captions. Such semantic misalignment circulates in pre-training, leading to inferior zero-shot performance in dense predictions due to insufficient visual concepts captured in textual representations. To close such semantic gap, we propose Concept Curation (CoCu), a pipeline that leverages CLIP to compensate for the missing semantics. For each image-text pair, we establish a concept archive that maintains potential visually-matched concepts with our proposed vision-driven expansion and text-to-vision-guided ranking. Relevant concepts can thus be identified via cluster-guided sampling and fed into pre-training, thereby bridging the gap between visual and textual semantics. Extensive experiments over a broad suite of 8 segmentation benchmarks show that CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin, suggesting the value of bridging semantic gap in pre-training data.
    摘要 “视言预训示出了无需示例数据的惊人识别能力和可能学习通用的视觉表示。尝试一步前进,语言指导的semantic segmentation可以将文本输入的空间局部化,通过从图像和文本对的学习像素组合。然而,当前的状态艺术受到清晰的Semantic Gap问题困扰,即图像中的许多视觉概念没有在其关联的文本中出现。这种semantic misalignment在预训练中循环,导致零例预测中的稠密预测性能下降,因为预训练中的文本表示中缺失的视觉概念。为了填充这种semantic gap,我们提出了Concept Curation(CoCu)管线,它利用CLIP来补偿缺失的semantics。对每个图像和文本对,我们建立了一个concept archive,该archive保存了可能与图像匹配的视觉概念,我们提出的视力驱动扩展和文本驱动的排名。通过群组指导采样,可以从concept archive中提取相关的概念,并将其传递给预训练,从而bridging视觉和文本semantic之间的 gap。我们对8种 segmentation benchmark进行了广泛的实验,结果表明CoCu可以 achieve superb zero-shot transfer performance,并大幅提高语言指导 segmentation baseline,这表明bridging semantic gap在预训练数据中的价值。”