cs.CV - 2023-11-08

Active Transfer Learning for Efficient Video-Specific Human Pose Estimation

  • paper_url: http://arxiv.org/abs/2311.05041
  • repo_url: https://github.com/iminthemiddle/vatl4pose-wacv2024
  • paper_authors: Hiromu Taketsugu, Norimichi Ukita
  • For: 这个论文的目的是提出一种基于活动学习和传输学习的方法,以有效地适应人体 pose 估计器到个人视频频谱中。* Methods: 该方法使用了热图的变化来衡量估计结果的不确定性,以及全身人体 pose 的不自然性来选择不同和不确定的样本进行有效的估计器学习。此外,该方法还重新评估了现有的活动转移学习方法,并提出了新的重新训练方法和停止 criterion。* Results: 实验结果表明,该方法可以提高学习效率,并在比较方法中得到更好的性能。代码可以在 GitHub 上获取:https://github.com/ImIntheMiddle/VATL4Pose-WACV2024。
    Abstract Human Pose (HP) estimation is actively researched because of its wide range of applications. However, even estimators pre-trained on large datasets may not perform satisfactorily due to a domain gap between the training and test data. To address this issue, we present our approach combining Active Learning (AL) and Transfer Learning (TL) to adapt HP estimators to individual video domains efficiently. For efficient learning, our approach quantifies (i) the estimation uncertainty based on the temporal changes in the estimated heatmaps and (ii) the unnaturalness in the estimated full-body HPs. These quantified criteria are then effectively combined with the state-of-the-art representativeness criterion to select uncertain and diverse samples for efficient HP estimator learning. Furthermore, we reconsider the existing Active Transfer Learning (ATL) method to introduce novel ideas related to the retraining methods and Stopping Criteria (SC). Experimental results demonstrate that our method enhances learning efficiency and outperforms comparative methods. Our code is publicly available at: https://github.com/ImIntheMiddle/VATL4Pose-WACV2024
    摘要 人体姿势(HP)估计是正在活跃研究的领域,因为它有很多应用场景。然而,即使使用大量数据进行预训练,HP估计器也可能不能达到预期的性能,这是因为训练和测试数据之间的领域差异。为了解决这个问题,我们提出了一种结合活动学习(AL)和转移学习(TL)的方法,以高效地适应视频域中的HP估计器。为了高效学习,我们的方法量化了(i)估计热图中的时间变化引入的估计不确定性,以及(ii)全身HP估计中的不自然性。这些量化的 критери价然与当前最佳表达性 критериion(SC)有效地结合,以选择不确定和多样的样本,以便高效地学习HP估计器。此外,我们重新考虑了现有的活动转移学习(ATL)方法,并引入了新的重新训练方法和停止标准(SC)。实验结果表明,我们的方法可以提高学习效率,并比相对方法高效。我们的代码可以在 GitHub 上获取:https://github.com/ImIntheMiddle/VATL4Pose-WACV2024

S$^3$AD: Semi-supervised Small Apple Detection in Orchard Environments

  • paper_url: http://arxiv.org/abs/2311.05029
  • repo_url: None
  • paper_authors: Robert Johanson, Christian Wilms, Ole Johannsen, Simone Frintrop
  • for: 本文 targets 苹果干部检测问题,为精准农业应用,如自动化受产量测量或水果摘取提供了解决方案。
  • methods: 本文提出了一种半监督的苹果检测方法,使用上下文注意力和选择平铺来改进小苹果的检测,同时减少计算成本。
  • results: 对于MAD数据集和MSU数据集的广泛评估表明,S$^3$AD系统与多个强大的完全监督基线系统进行比较,达到了最高的$14.9%$的提升。此外,通过利用数据集中苹果属性的详细注释,分析了不同系统对尺寸和干扰度的影响,量化了当前的挑战。
    Abstract Crop detection is integral for precision agriculture applications such as automated yield estimation or fruit picking. However, crop detection, e.g., apple detection in orchard environments remains challenging due to a lack of large-scale datasets and the small relative size of the crops in the image. In this work, we address these challenges by reformulating the apple detection task in a semi-supervised manner. To this end, we provide the large, high-resolution dataset MAD comprising 105 labeled images with 14,667 annotated apple instances and 4,440 unlabeled images. Utilizing this dataset, we also propose a novel Semi-Supervised Small Apple Detection system S$^3$AD based on contextual attention and selective tiling to improve the challenging detection of small apples, while limiting the computational overhead. We conduct an extensive evaluation on MAD and the MSU dataset, showing that S$^3$AD substantially outperforms strong fully-supervised baselines, including several small object detection systems, by up to $14.9\%$. Additionally, we exploit the detailed annotations of our dataset w.r.t. apple properties to analyze the influence of relative size or level of occlusion on the results of various systems, quantifying current challenges.
    摘要 减损检测是精准农业应用程序的关键组成部分,如自动化产量估计或果实摘取。然而,减损检测,例如苹果检测在园区环境中仍然是一个挑战,因为缺乏大规模数据集和图像中小型减损的相对比例。在这种情况下,我们通过半upervised的方式改进苹果检测任务。为此,我们提供了大型、高分辨率的数据集MAD,包含105个标注图像和14,667个注解apple实例,以及4,440个未标注图像。基于这个数据集,我们还提出了一种新的半upervised小 apple检测系统S$^3$AD,使用contextual attention和选择分割来改进小apple的检测,同时限制计算过程的开销。我们对MAD和MSU数据集进行了广泛的评估,显示S$^3$AD在比较减损的情况下,与强大的完全supervised基eline相比,提高了14.9%。此外,我们利用我们数据集中 apple 属性的详细注解,分析不同系统在不同的减损和 occlusion 情况下的结果,并对现有挑战进行了质量评估。

Leveraging a realistic synthetic database to learn Shape-from-Shading for estimating the colon depth in colonoscopy images

  • paper_url: http://arxiv.org/abs/2311.05021
  • repo_url: None
  • paper_authors: Josué Ruano, Martín Gómez, Eduardo Romero, Antoine Manzanera
  • for: 这个研究旨在提高单摄colonoscopy影像中的肠道深度估计,以帮助诊断肠及肛部癌症。
  • methods: 这个研究使用了一个新的方法来从单摄colonoscopy影像中估计肠道深度图。这个方法基于肠道墙的阴影变化,并由一个专门训练的单元神经网络估计。
  • results: 这个研究的结果显示了这个方法可以实现高精度的肠道深度估计,具有95.65%的阈值准确性和0.451cm的平均误差。此外,该方法在一些真实的影像中也显示了良好的成果。
    Abstract Colonoscopy is the choice procedure to diagnose colon and rectum cancer, from early detection of small precancerous lesions (polyps), to confirmation of malign masses. However, the high variability of the organ appearance and the complex shape of both the colon wall and structures of interest make this exploration difficult. Learned visuospatial and perceptual abilities mitigate technical limitations in clinical practice by proper estimation of the intestinal depth. This work introduces a novel methodology to estimate colon depth maps in single frames from monocular colonoscopy videos. The generated depth map is inferred from the shading variation of the colon wall with respect to the light source, as learned from a realistic synthetic database. Briefly, a classic convolutional neural network architecture is trained from scratch to estimate the depth map, improving sharp depth estimations in haustral folds and polyps by a custom loss function that minimizes the estimation error in edges and curvatures. The network was trained by a custom synthetic colonoscopy database herein constructed and released, composed of 248,400 frames (47 videos), with depth annotations at the level of pixels. This collection comprehends 5 subsets of videos with progressively higher levels of visual complexity. Evaluation of the depth estimation with the synthetic database reached a threshold accuracy of 95.65%, and a mean-RMSE of 0.451 cm, while a qualitative assessment with a real database showed consistent depth estimations, visually evaluated by the expert gastroenterologist coauthoring this paper. Finally, the method achieved competitive performance with respect to another state-of-the-art method using a public synthetic database and comparable results in a set of images with other five state-of-the-art methods.
    摘要 殖colonoscopy是诊断大肠和直肠癌的首选方法,从早期发现小前癌肿瘤(肿块)到确认肿瘤。然而,肠部的高变化性和肠壁的复杂形状使这种探测困难。通过了解视觉和感知能力,技术上的限制可以得到应用。这项工作介绍了一种新的方法,用单摄影探测肠部深度图。该图由肠壁阴影变化与光源的相对位置学习得到,并通过一个自定义的损失函数来改进锐利的深度估计。这种方法使用了自定义的 sintetic colonoscopy 数据库,包括 248,400 帧(47 个视频),其中每帧有像素级别的深度注解。这个收集包括 5 个视频subset,每个 subset 的视觉复杂度逐渐增加。对于这个数据库的评估,depth estimation 达到了95.65%的阈值精度和0.451 cm的平均欧姆误差,而且与真实数据库的评估显示了一致的深度估计。最终,该方法在与另一个状态略方法进行比较时达到了竞争性性能,并在一组图像中与其他五个状态略方法的结果相当。

Familiarity-Based Open-Set Recognition Under Adversarial Attacks

  • paper_url: http://arxiv.org/abs/2311.05006
  • repo_url: None
  • paper_authors: Philip Enevoldsen, Christian Gundersen, Nico Lang, Serge Belongie, Christian Igel
  • for: 本研究旨在探讨 familiarscore-based open-set recognition 中的攻击问题,以及这些攻击的效果。
  • methods: 本研究使用了梯度导向的 adversarial 攻击方法,包括 False Familiarity 和 False Novelty 两种类型的攻击。
  • results: 研究发现,在 informed 和 uninformed 设置下,这些攻击都能够减少 familiarscore-based open-set recognition 的准确率。
    Abstract Open-set recognition (OSR), the identification of novel categories, can be a critical component when deploying classification models in real-world applications. Recent work has shown that familiarity-based scoring rules such as the Maximum Softmax Probability (MSP) or the Maximum Logit Score (MLS) are strong baselines when the closed-set accuracy is high. However, one of the potential weaknesses of familiarity-based OSR are adversarial attacks. Here, we present gradient-based adversarial attacks on familiarity scores for both types of attacks, False Familiarity and False Novelty attacks, and evaluate their effectiveness in informed and uninformed settings on TinyImageNet.
    摘要

Effective Restoration of Source Knowledge in Continual Test Time Adaptation

  • paper_url: http://arxiv.org/abs/2311.04991
  • repo_url: None
  • paper_authors: Fahim Faisal Niloy, Sk Miraj Ahmed, Dripta S. Raychaudhuri, Samet Oymak, Amit K. Roy-Chowdhury
  • for: 本文旨在解决测试时适应(TTA)方法在动态环境中的挑战,包括累累忘记之前学习的有价值源知识和逐渐增加的错误。
  • methods: 本文提出了一种无监督领域变换检测方法,可以在动态环境中检测领域变换并将模型参数还原到原始源预训练值。这种方法通过监测领域变换触发的统计变化来重新启用原始源的知识,从而 correction 模型参数的负面影响。
  • results: 对比州前方法,本文的方法在动态环境中表现出色,可以减少模型参数的负面影响和累累忘记。我们通过对多个 benchmark 数据集进行广泛的实验来证明本文的方法的优秀性。
    Abstract Traditional test-time adaptation (TTA) methods face significant challenges in adapting to dynamic environments characterized by continuously changing long-term target distributions. These challenges primarily stem from two factors: catastrophic forgetting of previously learned valuable source knowledge and gradual error accumulation caused by miscalibrated pseudo labels. To address these issues, this paper introduces an unsupervised domain change detection method that is capable of identifying domain shifts in dynamic environments and subsequently resets the model parameters to the original source pre-trained values. By restoring the knowledge from the source, it effectively corrects the negative consequences arising from the gradual deterioration of model parameters caused by ongoing shifts in the domain. Our method involves progressive estimation of global batch-norm statistics specific to each domain, while keeping track of changes in the statistics triggered by domain shifts. Importantly, our method is agnostic to the specific adaptation technique employed and thus, can be incorporated to existing TTA methods to enhance their performance in dynamic environments. We perform extensive experiments on benchmark datasets to demonstrate the superior performance of our method compared to state-of-the-art adaptation methods.
    摘要 传统的测试时适应(TTA)方法在面临不断变化的长期目标分布环境中遇到重大挑战。这些挑战主要来自两个因素: catastrophic forgetting 已经学习的源知识的价值,以及由于 pseudo 标签的误偏而导致的慢滑块误差的堆积。为解决这些问题,本文提出了一种无监督领域变化检测方法,能够在动态环境中检测领域的变化,并将模型参数重置回源预训练值。通过恢复源知识,它有效地纠正由持续变化的领域所导致的模型参数的负面影响。我们的方法包括逐步估计每个领域的全局批处理 нор 特征统计,并记录每个领域的变化触发的统计变化。与此同时,我们的方法是对现有 TTA 方法进行增强,不需要改变现有的适应技术。我们在标准的测试数据集上进行了广泛的实验,并证明了我们的方法在动态环境中的超过当今适应方法的表现。

Exploiting Inductive Biases in Video Modeling through Neural CDEs

  • paper_url: http://arxiv.org/abs/2311.04986
  • repo_url: None
  • paper_authors: Johnathan Chiu, Samuel Duffield, Max Hunter-Gordon, Kaelan Donatella, Max Aifer, Andi Gu
  • for: 这篇论文是用于Video Task中的Video interpolating和Mask Propagation问题。
  • methods: 本文提出了一种使用Controlled Differential Equations(CDEs)来解决Video Task中的Key challenges,包括Video interpolating和Mask Propagation。它将CDEs应用于不同的分辨率,实现了一个连续时间的U-Net架构。不同于传统方法,本文的方法不需要明确的Optical Flow学习,而是利用CDEs的自然连续时间特性来生成高度表达力的Video模型。
  • results: 本文展示了与State-of-the-art模型相比,本文的方法在Video interpolating和Mask Propagation任务中具有竞争性的性能。
    Abstract We introduce a novel approach to video modeling that leverages controlled differential equations (CDEs) to address key challenges in video tasks, notably video interpolation and mask propagation. We apply CDEs at varying resolutions leading to a continuous-time U-Net architecture. Unlike traditional methods, our approach does not require explicit optical flow learning, and instead makes use of the inherent continuous-time features of CDEs to produce a highly expressive video model. We demonstrate competitive performance against state-of-the-art models for video interpolation and mask propagation tasks.
    摘要 我们提出了一种新的视频模型方法,利用控制的微分方程(CDE)解决视频任务中的关键挑战,包括视频 interpolate 和mask propagation。我们在不同的分辨率上应用CDE,导致一种连续时间的U-Net架构。与传统方法不同,我们的方法不需要显式学习流体力学,而是利用CDE的内在连续时间特征来生成一个非常表达力的视频模型。我们在视频 interpolate 和mask propagation任务中示出了与状态艺术模型的竞争性表现。

GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs

  • paper_url: http://arxiv.org/abs/2311.04901
  • repo_url: None
  • paper_authors: Zhenfang Chen, Rui Sun, Wenjun Liu, Yining Hong, Chuang Gan
  • for: 本研究目的是提出一种基于生长和重用模块的生成型神经符号逻辑视觉理解方法,以提高现有神经符号逻辑模型的效率和可重用性。
  • methods: 本方法包括三个独特的阶段:模块初始化、模块生成和模块执行。首先,给定一个视力语言任务,我们采用大语言模型来检查是否可以重用和增长已有的模块来解决这个新任务。如果不能,我们将创建一个新模块,并指定这个模块的输入和输出。然后,我们使用大语言模型来生成匹配要求的代码块,并将其添加到模块库中。为了更好地评估新模块的能力,我们将几个例子作为测试用例,并评估它们是否可以通过这些测试。如果可以,我们将新模块添加到模块库中,并在其他任务上进行重用。最后,我们使用执行生成的程序来评估模型的性能。
  • results: 我们的模型在标准任务如视觉问答和参考表达理解中表现竞争力强,同时模块学习自一个任务可以很好地转移到新任务上。此外,我们的模型可以通过几个例子的几个测试来适应新的视觉理解任务,而不需要大量的训练数据。
    Abstract Recent works have shown that Large Language Models (LLMs) could empower traditional neuro-symbolic models via programming capabilities to translate language into module descriptions, thus achieving strong visual reasoning results while maintaining the model's transparency and efficiency. However, these models usually exhaustively generate the entire code snippet given each new instance of a task, which is extremely ineffective. We propose generative neuro-symbolic visual reasoning by growing and reusing modules. Specifically, our model consists of three unique stages, module initialization, module generation, and module execution. First, given a vision-language task, we adopt LLMs to examine whether we could reuse and grow over established modules to handle this new task. If not, we initialize a new module needed by the task and specify the inputs and outputs of this new module. After that, the new module is created by querying LLMs to generate corresponding code snippets that match the requirements. In order to get a better sense of the new module's ability, we treat few-shot training examples as test cases to see if our new module could pass these cases. If yes, the new module is added to the module library for future reuse. Finally, we evaluate the performance of our model on the testing set by executing the parsed programs with the newly made visual modules to get the results. We find the proposed model possesses several advantages. First, it performs competitively on standard tasks like visual question answering and referring expression comprehension; Second, the modules learned from one task can be seamlessly transferred to new tasks; Last but not least, it is able to adapt to new visual reasoning tasks by observing a few training examples and reusing modules.
    摘要 First, we use LLMs to determine if we can reuse or grow existing modules to handle a new task. If not, we initialize a new module with the task's inputs and outputs. Then, we use LLMs to generate code snippets that match the module's requirements. We treat a few training examples as test cases to evaluate the new module's ability. If the module passes the test cases, it is added to the module library for future reuse. Finally, we evaluate the model's performance on a testing set by executing the parsed programs with the newly made visual modules.Our proposed model has several advantages. First, it performs well on standard visual reasoning tasks like visual question answering and referring expression comprehension. Second, the modules learned from one task can be easily transferred to new tasks. Lastly, the model can adapt to new visual reasoning tasks by observing a few training examples and reusing modules.

Are foundation models efficient for medical image segmentation?

  • paper_url: http://arxiv.org/abs/2311.04847
  • repo_url: None
  • paper_authors: Danielle Ferreira, Rima Arnaout
  • for: 这个论文是为了评估Segment Anything模型(SAM)在各种物体分割任务中的表现,以及相比之下一种特有的模式自适应学习(SSL)方法在25种量化报告中的性能。
  • methods: 这个论文使用了Supervised Training方法,并对SAM和SSL方法进行了比较,以评估它们在100次心脏超声报告中的性能和资源占用情况。
  • results: 研究发现,SAM在评估中表现不佳,需要更多的标注和计算资源,而SSL方法则表现更好,具有更高的效率。
    Abstract Foundation models are experiencing a surge in popularity. The Segment Anything model (SAM) asserts an ability to segment a wide spectrum of objects but required supervised training at unprecedented scale. We compared SAM's performance (against clinical ground truth) and resources (labeling time, compute) to a modality-specific, label-free self-supervised learning (SSL) method on 25 measurements for 100 cardiac ultrasounds. SAM performed poorly and required significantly more labeling and computing resources, demonstrating worse efficiency than SSL.
    摘要 基础模型目前正在流行。射频段化模型(SAM)声称可以分类广泛的物体,但需要前所未有的指导式训练。我们比较了SAM的性能(与临床真实值)和资源(标签时间、计算)与一种特定Modalities自适应学习(SSL)方法在25个测量上100个心脏超声图像。SAM表现不佳,需要更多的标签和计算资源,示出了与SSL的更差的效率。

Self-Supervised Learning for Visual Relationship Detection through Masked Bounding Box Reconstruction

  • paper_url: http://arxiv.org/abs/2311.04834
  • repo_url: https://github.com/deeplab-ai/selfsupervisedvrd
  • paper_authors: Zacharias Anastasakis, Dimitrios Mallis, Markos Diomataris, George Alexandridis, Stefanos Kollias, Vassilis Pitsikalis
  • for: 本研究旨在提出一种自助学习方法,用于视觉关系检测任务(VRD)。
  • methods: 该方法基于Masked Image Modeling(MIM)的想法,提出Masked Bounding Box Reconstruction(MBBR),即在场景中随机mask一部分实体/物体,然后通过不masked对象进行重建。这种方法通过对象级别的masked模型学习,使网络学习场景中对象之间的交互,并因此具有高度预测视觉对象关系的表示。
  • results: 对于Visual Relationship Detection(VRD)任务,该方法在几个少量示例的情况下,能够超过现有的状态对方法,并且 Qualitative和Quantitative评估都表明了该方法学习的图像表示能够具有高度的Robustness和可预测性。
    Abstract We present a novel self-supervised approach for representation learning, particularly for the task of Visual Relationship Detection (VRD). Motivated by the effectiveness of Masked Image Modeling (MIM), we propose Masked Bounding Box Reconstruction (MBBR), a variation of MIM where a percentage of the entities/objects within a scene are masked and subsequently reconstructed based on the unmasked objects. The core idea is that, through object-level masked modeling, the network learns context-aware representations that capture the interaction of objects within a scene and thus are highly predictive of visual object relationships. We extensively evaluate learned representations, both qualitatively and quantitatively, in a few-shot setting and demonstrate the efficacy of MBBR for learning robust visual representations, particularly tailored for VRD. The proposed method is able to surpass state-of-the-art VRD methods on the Predicate Detection (PredDet) evaluation setting, using only a few annotated samples. We make our code available at https://github.com/deeplab-ai/SelfSupervisedVRD.
    摘要 我们提出了一种新的自主学习方法,具体是用于视觉关系检测(VRD)任务。我们受到Masked Image Modeling(MIM)的成功所 inspirited,我们提议Masked Bounding Box Reconstruction(MBBR),这是MIM的一种变体,在场景中部分对象被遮盖,然后根据未遮盖的对象进行重建。核心想法是通过对象水平的遮盖模型,使网络学习场景中对象之间的交互,从而学习出高度预测视觉对象关系的上下文意识的表示。我们在几个shot设置下进行了详细评估learned表示,并证明MBBR可以学习出高效的视觉表示,特别适用于VRD任务。我们使用了只有几个标注样本,但我们的方法仍然可以超越当前VRD方法在Predicate Detection(PredDet)评估设置中的性能。我们在https://github.com/deeplab-ai/SelfSupervisedVRD中提供了代码。

Anonymizing medical case-based explanations through disentanglement

  • paper_url: http://arxiv.org/abs/2311.04833
  • repo_url: None
  • paper_authors: Helena Montenegro, Jaime S. Cardoso
  • for: 本研究旨在Addressing the problem of privacy concerns in deep learning models for medical image analysis, by proposing a novel method for disentangling identity and medical characteristics of images and anonymizing them.
  • methods: 本研究使用了一种novel方法, named 嵌入式推理(Embedding-based reasoning), which disentangles the identity and medical characteristics of images by replacing some feature vectors while preserving the remaining features. The researchers also proposed a model to manufacture synthetic privacy-preserving identities to replace the original image’s identity and achieve anonymization.
  • results: 实验表明,这种方法可以生成真实的、适用于医疗和生物特征的synthetic privacy-preserving identities,并且可以通过替换医学特征来生成counterfactual images。 The results demonstrate the capacity of the proposed method to generate realistic-looking anonymized images that preserve their original medical content, and the network’s inherent capacity to generate counterfactual images through the replacement of medical features.
    Abstract Case-based explanations are an intuitive method to gain insight into the decision-making process of deep learning models in clinical contexts. However, medical images cannot be shared as explanations due to privacy concerns. To address this problem, we propose a novel method for disentangling identity and medical characteristics of images and apply it to anonymize medical images. The disentanglement mechanism replaces some feature vectors in an image while ensuring that the remaining features are preserved, obtaining independent feature vectors that encode the images' identity and medical characteristics. We also propose a model to manufacture synthetic privacy-preserving identities to replace the original image's identity and achieve anonymization. The models are applied to medical and biometric datasets, demonstrating their capacity to generate realistic-looking anonymized images that preserve their original medical content. Additionally, the experiments show the network's inherent capacity to generate counterfactual images through the replacement of medical features.
    摘要 情况基本的解释是一种直观的方法,用于掌握深度学习模型在医疗场景中做出决策的过程。然而,医疗图像不能被用作解释,因为隐私问题。为解决这个问题,我们提出了一种新的方法,用于分离图像的标识特征和医疗特征。这种分离机制将某些图像特征替换,以保证保留图像的其他特征,从而获得独立的特征向量,这些特征向量都是图像的标识特征和医疗特征。我们还提出了一种模型,用于生成隐私保护的人工标识,以替换原始图像的标识。这些模型在医疗和生物метрических数据集上应用, demonstarting their capacity to generate realistic-looking anonymized images that preserve their original medical content. In addition, the experiments show the network's inherent capacity to generate counterfactual images through the replacement of medical features.Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

SODAWideNet – Salient Object Detection with an Attention augmented Wide Encoder Decoder network without ImageNet pre-training

  • paper_url: http://arxiv.org/abs/2311.04828
  • repo_url: https://github.com/VimsLab/SODAWideNet
  • paper_authors: Rohit Venkata Sai Dulam, Chandra Kambhamettu
  • for: 这个论文的目的是开发一个新的突出对象检测(SOD)模型,而不需要在ImageNet dataset上重新训练整个网络。
  • methods: 该模型使用了一个encoder-decoder风格的网络,并提出了多种新的特征反ernerModule来使用backbone特征。其中包括了多重接收场(MRFFAM)和多scale注意(MSA)等模块,以提高网络的表达能力和灵活性。
  • results: 模型在五个 dataset上达到了竞争性的性能,并且 Parameters efficient。
    Abstract Developing a new Salient Object Detection (SOD) model involves selecting an ImageNet pre-trained backbone and creating novel feature refinement modules to use backbone features. However, adding new components to a pre-trained backbone needs retraining the whole network on the ImageNet dataset, which requires significant time. Hence, we explore developing a neural network from scratch directly trained on SOD without ImageNet pre-training. Such a formulation offers full autonomy to design task-specific components. To that end, we propose SODAWideNet, an encoder-decoder-style network for Salient Object Detection. We deviate from the commonly practiced paradigm of narrow and deep convolutional models to a wide and shallow architecture, resulting in a parameter-efficient deep neural network. To achieve a shallower network, we increase the receptive field from the beginning of the network using a combination of dilated convolutions and self-attention. Therefore, we propose Multi Receptive Field Feature Aggregation Module (MRFFAM) that efficiently obtains discriminative features from farther regions at higher resolutions using dilated convolutions. Next, we propose Multi-Scale Attention (MSA), which creates a feature pyramid and efficiently computes attention across multiple resolutions to extract global features from larger feature maps. Finally, we propose two variants, SODAWideNet-S (3.03M) and SODAWideNet (9.03M), that achieve competitive performance against state-of-the-art models on five datasets.
    摘要 开发新的突出对象检测(SOD)模型需要选择一个ImageNet预训练后缘和创造新的特征级进程来使用后缘特征。然而,将新组件添加到预训练后缘需要重新训练整个网络在ImageNet dataset上,这需要很长时间。因此,我们探索直接从 scratch 开发一个神经网络,不需要 ImageNet 预训练。这种方法允许我们完全自主地设计任务特定的组件。为此,我们提出了SODAWideNet,一个encoder-decoder风格的神经网络用于突出对象检测。我们与通常实践的窄深 convolutional 模型不同,使用宽浅的架构,从而实现参数效率的深度神经网络。为了实现更 shallow 的网络,我们从网络的开始点增加了扩展卷积和自注意力。因此,我们提出了多个感知场Feature Aggregation Module(MRFFAM),可以有效地从更远的区域获取高分辨率的特征。接着,我们提出了多Scale Attention(MSA),可以快速计算多个分辨率之间的注意力,以提取更大的特征图。最后,我们提出了SODAWideNet-S(3.03M)和SODAWideNet(9.03M)两种变体,在五个 dataset 上实现了与当前模型竞争的性能。

Cross-Silo Federated Learning Across Divergent Domains with Iterative Parameter Alignment

  • paper_url: http://arxiv.org/abs/2311.04818
  • repo_url: https://github.com/mattgorb/iterative_parameter_alignment
  • paper_authors: Matt Gorbett, Hossein Shirazi, Indrakshi Ray
  • for: 这 paper 的目的是提出一种基于 peer-to-peer topology的 Federated Learning 方法,以便在各自的数据集上训练模型,并在模型之间进行参数的对齐,以提高模型的泛化能力。
  • methods: 这 paper 使用了一种Weighted Distance Minimization 方法来对模型参数进行对齐,并在每个参与者的模型中寻找一个唯一的解。
  • results: 这 paper 的实验结果表明,这种方法可以在不同的数据集上达到竞争力的结果,并且在不同的领域中进行模型的对齐。此外,这种方法还能够在各个参与者的模型中寻找唯一的解,从而实现了在模型之间的对齐。
    Abstract Learning from the collective knowledge of data dispersed across private sources can provide neural networks with enhanced generalization capabilities. Federated learning, a method for collaboratively training a machine learning model across remote clients, achieves this by combining client models via the orchestration of a central server. However, current approaches face two critical limitations: i) they struggle to converge when client domains are sufficiently different, and ii) current aggregation techniques produce an identical global model for each client. In this work, we address these issues by reformulating the typical federated learning setup: rather than learning a single global model, we learn N models each optimized for a common objective. To achieve this, we apply a weighted distance minimization to model parameters shared in a peer-to-peer topology. The resulting framework, Iterative Parameter Alignment, applies naturally to the cross-silo setting, and has the following properties: (i) a unique solution for each participant, with the option to globally converge each model in the federation, and (ii) an optional early-stopping mechanism to elicit fairness among peers in collaborative learning settings. These characteristics jointly provide a flexible new framework for iteratively learning from peer models trained on disparate datasets. We find that the technique achieves competitive results on a variety of data partitions compared to state-of-the-art approaches. Further, we show that the method is robust to divergent domains (i.e. disjoint classes across peers) where existing approaches struggle.
    摘要 学习从分散在私人源中的数据收集的总知识可以提高神经网络的泛化能力。联邦学习方法可以在远程客户端上协同训练机器学习模型,并将客户端模型通过中央服务器的协调合并到一起。然而,现有方法面临两个重要的限制:一是在客户端领域差异足够大时困难于收敛,二是现有的集成技术会生成每个客户端都相同的全球模型。在这项工作中,我们解决这些问题 by 重新定义联邦学习设置:而不是学习单一的全球模型,我们学习 N 个优化为共同目标的模型。为 достичь这一点,我们使用分数据归一化来对参数共享在层次结构中进行补做。该框架被称为迭代参数对齐,可以自然地应用于跨积 Sylos 的设置。具有以下特点:1. 每个参与者都有唯一的解决方案,可以在联邦中每个模型都进行全球收敛。2. 可选的早期停止机制,以便在合作学习设置中约束参与者之间的公平。这些特点结合起来,为我们提供了一种灵活的新框架,可以逐步学习来自各个模型在不同数据集上的训练。我们发现,该技术可以与现有方法相比,在多个数据分区上实现竞争性的结果。此外,我们还证明该方法在不同领域(即客户端领域中的分离类)where existing approaches struggle。

Domain Adaptive Object Detection via Balancing Between Self-Training and Adversarial Learning

  • paper_url: http://arxiv.org/abs/2311.04815
  • repo_url: None
  • paper_authors: Muhammad Akhtar Munir, Muhammad Haris Khan, M. Saquib Sarfraz, Mohsen Ali
  • for: 这个研究旨在提高深度学习基于物体探测器的适应能力,尤其是在面对新目标领域时,该领域具有明显的物体和背景变化。
  • methods: 本研究使用模型的预测不确定性来实现内部平衡和类别平衡。具体来说,我们发展了一种量化预测不确定性的技术,并使用高信任率预测生成 pseudo-label,以便进行自我训练。
  • results: 我们的方法在五个不同和具有挑战性的适应情况下表现出优秀的成绩,与现有的州��cially-of-the-art方法之间存在明显的差异。
    Abstract Deep learning based object detectors struggle generalizing to a new target domain bearing significant variations in object and background. Most current methods align domains by using image or instance-level adversarial feature alignment. This often suffers due to unwanted background and lacks class-specific alignment. A straightforward approach to promote class-level alignment is to use high confidence predictions on unlabeled domain as pseudo-labels. These predictions are often noisy since model is poorly calibrated under domain shift. In this paper, we propose to leverage model's predictive uncertainty to strike the right balance between adversarial feature alignment and class-level alignment. We develop a technique to quantify predictive uncertainty on class assignments and bounding-box predictions. Model predictions with low uncertainty are used to generate pseudo-labels for self-training, whereas the ones with higher uncertainty are used to generate tiles for adversarial feature alignment. This synergy between tiling around uncertain object regions and generating pseudo-labels from highly certain object regions allows capturing both image and instance-level context during the model adaptation. We report thorough ablation study to reveal the impact of different components in our approach. Results on five diverse and challenging adaptation scenarios show that our approach outperforms existing state-of-the-art methods with noticeable margins.
    摘要 深度学习基于对象检测器在新目标领域中一般化是一个问题。大多数当前方法使用图像或实例水平的对抗特征偏移来对领域进行Alignment。这经常受到背景的不良影响并且缺乏类别偏移。在本文中,我们提出使用高确定性预测作为 pseudo-labels来促进类别偏移。我们开发了一种技术来评估预测的不确定性,并使用低不确定性预测生成 pseudo-labels,而高不确定性预测则用于生成对抗特征偏移。这种同时使用瓷砾在不确定性范围内和生成高确定性预测作为 pseudo-labels的方法,允许在模型适应过程中捕捉图像和实例级上下文。我们进行了完整的ablation研究,以便了解不同组件在我们的方法中的影响。我们在五种多样化和挑战性的适应场景中进行了实验,并发现我们的方法可以与现有状态的方法相比,表现出明显的优势。

Be Careful When Evaluating Explanations Regarding Ground Truth

  • paper_url: http://arxiv.org/abs/2311.04813
  • repo_url: https://github.com/mi2datalab/be-careful-evaluating-explanations
  • paper_authors: Hubert Baniecki, Maciej Chrabaszcz, Andreas Holzinger, Bastian Pfeifer, Anna Saranti, Przemyslaw Biecek
  • for: 本研究旨在评估深度学习模型的安全性,尤其是在医疗影像分析和机器人应用中,通过评估模型与人类理解之间的匹配程度。
  • methods: 本研究提出了一种框架,用于同时评估深度学习模型和解释方法的稳定性,并使用了精度调整程序来(不)对模型与真实情况之间的匹配进行调整。
  • results: 实验结果表明,视transformer模型和相关的解释方法在不同的模型架构和post-hoc本地解释方法下的Robustness具有一定的潜在攻击风险。
    Abstract Evaluating explanations of image classifiers regarding ground truth, e.g. segmentation masks defined by human perception, primarily evaluates the quality of the models under consideration rather than the explanation methods themselves. Driven by this observation, we propose a framework for $\textit{jointly}$ evaluating the robustness of safety-critical systems that $\textit{combine}$ a deep neural network with an explanation method. These are increasingly used in real-world applications like medical image analysis or robotics. We introduce a fine-tuning procedure to (mis)align model$\unicode{x2013}$explanation pipelines with ground truth and use it to quantify the potential discrepancy between worst and best-case scenarios of human alignment. Experiments across various model architectures and post-hoc local interpretation methods provide insights into the robustness of vision transformers and the overall vulnerability of such AI systems to potential adversarial attacks.
    摘要 evaluating explanations of image classifiers regarding ground truth, e.g. segmentation masks defined by human perception, primarily evaluates the quality of the models under consideration rather than the explanation methods themselves. driven by this observation, we propose a framework for $\textit{jointly}$ evaluating the robustness of safety-critical systems that $\textit{combine}$ a deep neural network with an explanation method. these are increasingly used in real-world applications like medical image analysis or robotics. we introduce a fine-tuning procedure to (mis)align model$\unicode{x2013}$explanation pipelines with ground truth and use it to quantify the potential discrepancy between worst and best-case scenarios of human alignment. experiments across various model architectures and post-hoc local interpretation methods provide insights into the robustness of vision transformers and the overall vulnerability of such AI systems to potential adversarial attacks.Here's the text with some notes on the translation:* "ground truth" is translated as "人类理解的标准" (rénxìng lǐjiě de biānwù)* "segmentation masks" is translated as "分割面积" (fēnzhì miànjì)* "deep neural network" is translated as "深度神经网络" (shēngrán jiānxīngwǎng)* "explanation method" is translated as "解释方法" (jiějie fāngfa)* "safety-critical systems" is translated as "安全关键系统" (ānquè guānjī systems)* "jointly" is translated as "共同" (gòngdòng)* "worst-case scenarios" is translated as "最坏情况" (zuì huò qíngkē)* "best-case scenarios" is translated as "最好情况" (zuì hǎo qíngkē)* "human alignment" is translated as "人类对齐" (rénxìng duìqí)Note that the translation of "jointly" as "共同" is a bit more formal than the original English text, but it accurately conveys the meaning of the phrase. Additionally, the translation of "worst-case scenarios" and "best-case scenarios" as "最坏情况" and "最好情况" is a bit more idiomatic than the original English text, but it accurately conveys the meaning of the phrases in Chinese.

Image-Based Virtual Try-On: A Survey

  • paper_url: http://arxiv.org/abs/2311.04811
  • repo_url: https://github.com/little-misfit/survey-of-virtual-try-on
  • paper_authors: Dan Song, Xuanpu Zhang, Juan Zhou, Weizhi Nie, Ruofeng Tong, An-An Liu
  • For: This paper aims to provide a comprehensive analysis of state-of-the-art techniques and methodologies in image-based virtual try-on, and to identify key trends and future research directions in this field.* Methods: The paper uses a pipeline architecture that includes person representation, try-on indication, clothing warping, and try-on stage. The authors also propose a new semantic criteria using CLIP and evaluate representative methods with uniformly implemented evaluation metrics on the same dataset.* Results: The paper provides a comprehensive overview of the current state of image-based virtual try-on research, including both quantitative and qualitative evaluations of current open-source methods. The authors also demonstrate the potential of large-scale models on this task by fine-tuning a recent image generation model (PBE) using ControlNet.Here is the text in Simplified Chinese:* For: 这篇论文目的是提供图像基于虚拟试穿的全面分析,并预测这个领域的未来研究趋势。* Methods: 该论文使用管道式架构,包括人体表示、试穿指示、衣服折叠和试穿阶段。作者还提出了基于CLIP的新的Semantic criteria。* Results: 论文提供了图像基于虚拟试穿的现状报告,包括现有开源方法的量化和质量评估。作者还示出了大规模模型在这个任务上的潜在潜力。
    Abstract Image-based virtual try-on aims to synthesize a naturally dressed person image with a clothing image, which revolutionizes online shopping and inspires related topics within image generation, showing both research significance and commercial potentials. However, there is a great gap between current research progress and commercial applications and an absence of comprehensive overview towards this field to accelerate the development. In this survey, we provide a comprehensive analysis of the state-of-the-art techniques and methodologies in aspects of pipeline architecture, person representation and key modules such as try-on indication, clothing warping and try-on stage. We propose a new semantic criteria with CLIP, and evaluate representative methods with uniformly implemented evaluation metrics on the same dataset. In addition to quantitative and qualitative evaluation of current open-source methods, we also utilize ControlNet to fine-tune a recent large image generation model (PBE) to show future potentials of large-scale models on image-based virtual try-on task. Finally, unresolved issues are revealed and future research directions are prospected to identify key trends and inspire further exploration. The uniformly implemented evaluation metrics, dataset and collected methods will be made public available at https://github.com/little-misfit/Survey-Of-Virtual-Try-On.
    摘要 图像基于虚拟试穿涉及合成一个自然地穿着人像与衣服图像,这种技术革新了在线购物和相关领域的图像生成,具有研究重要性和商业潜力。然而,目前研究进步和商业应用之间存在巨大的差距,而且对这一领域的全面概述缺乏,以便加速发展。在这份调查中,我们提供了图像基于虚拟试穿领域的全面分析,包括管道架构、人体表示和关键模块such as 试穿指示、衣服扭曲和试穿阶段。我们还提出了一种新的semantic标准,并使用CLIP进行评估代表方法。此外,我们还使用ControlNet来精度调整最近一个大型图像生成模型(PBE),以示未来大型模型在图像基于虚拟试穿任务中的潜力。最后,我们揭示了未解的问题和未来研究方向,以便识别关键趋势和激发更多的探索。我们 uniformly 实施的评估指标、数据集和收集的方法将在https://github.com/little-misfit/Survey-Of-Virtual-Try-On 上公开。

VioLA: Aligning Videos to 2D LiDAR Scans

  • paper_url: http://arxiv.org/abs/2311.04783
  • repo_url: None
  • paper_authors: Jun-Jee Chao, Selim Engin, Nikhil Chavan-Dafle, Bhoram Lee, Volkan Isler
  • for: 将视频Sequence align到2D LiDAR扫描图中的环境
  • methods: 引入VioLA方法,首先从图像序列中建立本地场景的semantic map,然后从图像序列中提取高度为固定值的点进行注册
  • results: 通过使用 pré-trained text-to-image填充模型和深度完成模型来填充缺失的场景内容,提高了pose注册性能,最多提高20%
    Abstract We study the problem of aligning a video that captures a local portion of an environment to the 2D LiDAR scan of the entire environment. We introduce a method (VioLA) that starts with building a semantic map of the local scene from the image sequence, then extracts points at a fixed height for registering to the LiDAR map. Due to reconstruction errors or partial coverage of the camera scan, the reconstructed semantic map may not contain sufficient information for registration. To address this problem, VioLA makes use of a pre-trained text-to-image inpainting model paired with a depth completion model for filling in the missing scene content in a geometrically consistent fashion to support pose registration. We evaluate VioLA on two real-world RGB-D benchmarks, as well as a self-captured dataset of a large office scene. Notably, our proposed scene completion module improves the pose registration performance by up to 20%.
    摘要 我们研究将视频Capture的本地环境与2D LiDAR扫描的全景环境进行对应的问题。我们提出了一种方法(VioLA),它首先从图像序列中建立了本地场景的semantic map,然后提取了一个固定高度的点进行与LiDAR地图进行对应。由于恢复错误或相机扫描的部分覆盖,可能导致重建的semantic maplack sufficient information for registration。为解决这个问题,VioLA利用了一个预训练的文本-图像填充模型和一个depth completion模型来填充缺失的场景内容,以保持准确的姿态注册。我们在两个实际的RGB-D标准benchmark上以及一个自拍摄的大办公室场景中进行了评估,并观察到我们提议的场景完成模块可以提高姿态注册性能达20%。

Lidar Annotation Is All You Need

  • paper_url: http://arxiv.org/abs/2311.04777
  • repo_url: https://github.com/evocargo/lidar-annotation-is-all-you-need
  • paper_authors: Dinar Sharafutdinov, Stanislav Kuskov, Saian Protasov, Alexey Voropaev
    for: 这 paper 的目的是提高图像分割的效率,使用 convolutional neural network 在多感器设置下进行图像分割。methods: 该方法使用 lidar 精度测量点云,并将其直接用于图像分割模型的训练。该方法还使用 masked loss 来处理稀疏的地面数据。results: 实验表明,该方法可以在多个数据集上实现相似的性能,而不需要大量的注释数据。该方法可以减少注释负担,并允许在图像分割模型的训练中混合不同类型的地面数据。
    Abstract In recent years, computer vision has transformed fields such as medical imaging, object recognition, and geospatial analytics. One of the fundamental tasks in computer vision is semantic image segmentation, which is vital for precise object delineation. Autonomous driving represents one of the key areas where computer vision algorithms are applied. The task of road surface segmentation is crucial in self-driving systems, but it requires a labor-intensive annotation process in several data domains. The work described in this paper aims to improve the efficiency of image segmentation using a convolutional neural network in a multi-sensor setup. This approach leverages lidar (Light Detection and Ranging) annotations to directly train image segmentation models on RGB images. Lidar supplements the images by emitting laser pulses and measuring reflections to provide depth information. However, lidar's sparse point clouds often create difficulties for accurate object segmentation. Segmentation of point clouds requires time-consuming preliminary data preparation and a large amount of computational resources. The key innovation of our approach is the masked loss, addressing sparse ground-truth masks from point clouds. By calculating loss exclusively where lidar points exist, the model learns road segmentation on images by using lidar points as ground truth. This approach allows for blending of different ground-truth data types during model training. Experimental validation of the approach on benchmark datasets shows comparable performance to a high-quality image segmentation model. Incorporating lidar reduces the load on annotations and enables training of image-segmentation models without loss of segmentation quality. The methodology is tested on diverse datasets, both publicly available and proprietary. The strengths and weaknesses of the proposed method are also discussed in the paper.
    摘要 Recently, computer vision has revolutionized fields such as medical imaging, object recognition, and geospatial analytics. One of the fundamental tasks in computer vision is semantic image segmentation, which is crucial for precise object delineation. Autonomous driving is one of the key areas where computer vision algorithms are applied, and the task of road surface segmentation is crucial in self-driving systems. However, this task requires a labor-intensive annotation process in several data domains.The work described in this paper aims to improve the efficiency of image segmentation using a convolutional neural network in a multi-sensor setup. This approach leverages lidar (Light Detection and Ranging) annotations to directly train image segmentation models on RGB images. Lidar supplements the images by emitting laser pulses and measuring reflections to provide depth information. However, lidar's sparse point clouds often create difficulties for accurate object segmentation.The key innovation of our approach is the masked loss, which addresses sparse ground-truth masks from point clouds. By calculating loss exclusively where lidar points exist, the model learns road segmentation on images by using lidar points as ground truth. This approach allows for blending of different ground-truth data types during model training. Experimental validation of the approach on benchmark datasets shows comparable performance to a high-quality image segmentation model. Incorporating lidar reduces the load on annotations and enables training of image-segmentation models without loss of segmentation quality. The methodology is tested on diverse datasets, both publicly available and proprietary. The strengths and weaknesses of the proposed method are also discussed in the paper.

GCS-ICHNet: Assessment of Intracerebral Hemorrhage Prognosis using Self-Attention with Domain Knowledge Integration

  • paper_url: http://arxiv.org/abs/2311.04772
  • repo_url: https://github.com/Windbelll/Prognosis-analysis-of-cerebral-hemorrhage
  • paper_authors: Xuhao Shan, Xinyang Li, Ruiquan Ge, Shibin Wu, Ahmed Elazab, Jichao Zhu, Lingyan Zhang, Gangyong Jia, Qingying Xiao, Xiang Wan, Changmiao Wang
  • for: 预测急性脑出血(ICH)的诊断和治疗结果,提高患者生存率。
  • methods: 利用多Modal脑CT图像数据和格拉斯哥昏迷分数(GCS)来提高ICH诊断,使用trasnformer基本的融合模块进行评估。
  • results: GCS-ICHNet实现了81.03%的敏感性和91.59%的特异性,超过了平均临床医生和其他当前状态的方法。
    Abstract Intracerebral Hemorrhage (ICH) is a severe condition resulting from damaged brain blood vessel ruptures, often leading to complications and fatalities. Timely and accurate prognosis and management are essential due to its high mortality rate. However, conventional methods heavily rely on subjective clinician expertise, which can lead to inaccurate diagnoses and delays in treatment. Artificial intelligence (AI) models have been explored to assist clinicians, but many prior studies focused on model modification without considering domain knowledge. This paper introduces a novel deep learning algorithm, GCS-ICHNet, which integrates multimodal brain CT image data and the Glasgow Coma Scale (GCS) score to improve ICH prognosis. The algorithm utilizes a transformer-based fusion module for assessment. GCS-ICHNet demonstrates high sensitivity 81.03% and specificity 91.59%, outperforming average clinicians and other state-of-the-art methods.
    摘要 Intracerebral Hemorrhage (ICH) 是一种严重的疾病,由于脑血管受损而导致血液泄露,可能会导致严重的后果和死亡。因此,有效和准确的诊断和治疗是非常重要的,因为其mortality rate 很高。然而,传统的方法很多都是基于专业知识的,可能会导致不准确的诊断和治疗延迟。人工智能(AI)模型已经被探讨以帮助临床医生,但许多先前的研究都是对模型进行修改而不考虑域知识。本文介绍了一种新的深度学习算法,GCS-ICHNet,它通过结合多Modal脑CT图像数据和格拉斯哥昏迷scale(GCS)分数来改善ICH的诊断。该算法使用 transformer-based 混合模块进行评估。GCS-ICHNet 在敏感性和特点上都有出色的表现,高达 81.03% 和 91.59%,超过了平均的临床医生和其他现有的方法。

An attention-based deep learning network for predicting Platinum resistance in ovarian cancer

  • paper_url: http://arxiv.org/abs/2311.04769
  • repo_url: None
  • paper_authors: Haoming Zhuang, Beibei Li, Jingtong Ma, Patrice Monkam, Shouliang Qi, Wei Qian, Dianning He
    for: 这项研究的目的是提出一种基于深度学习的方法,用于判断患有高等级胸膜癌的患者是否 platinum 抵抗性。methods: 该研究使用了289名高等级胸膜癌患者的数据,并建立了一个结合了压缩块(SE Block)和空间 пирамид Pooling层(SPPLayer)的Dense Convolutional Network(DenseNet)模型。利用多Modal PET/CT 图像数据进行预测患者 platinum 抵抗性。results: 经五次交叠验证,SE-SPP-DenseNet 模型在预测患者 platinum 抵抗性方面 achieved a high accuracy rate和抛物线曲线(AUC)值,分别为92.6%和0.93。通过进行剥离实验和单Modal 数据验证,证明了将 SE Block 和 SPPLayer 添加到深度学习模型中,并考虑多Modal 数据的重要性。
    Abstract Background: Ovarian cancer is among the three most frequent gynecologic cancers globally. High-grade serous ovarian cancer (HGSOC) is the most common and aggressive histological type. Guided treatment for HGSOC typically involves platinum-based combination chemotherapy, necessitating an assessment of whether the patient is platinum-resistant. The purpose of this study is to propose a deep learning-based method to determine whether a patient is platinum-resistant using multimodal positron emission tomography/computed tomography (PET/CT) images. Methods: 289 patients with HGSOC were included in this study. An end-to-end SE-SPP-DenseNet model was built by adding Squeeze-Excitation Block (SE Block) and Spatial Pyramid Pooling Layer (SPPLayer) to Dense Convolutional Network (DenseNet). Multimodal data from PET/CT images of the regions of interest (ROI) were used to predict platinum resistance in patients. Results: Through five-fold cross-validation, SE-SPP-DenseNet achieved a high accuracy rate and an area under the curve (AUC) in predicting platinum resistance in patients, which were 92.6% and 0.93, respectively. The importance of incorporating SE Block and SPPLayer into the deep learning model, and considering multimodal data was substantiated by carrying out ablation studies and experiments with single modality data. Conclusions: The obtained classification results indicate that our proposed deep learning framework performs better in predicting platinum resistance in patients, which can help gynecologists make better treatment decisions. Keywords: PET/CT, CNN, SE Block, SPP Layer, Platinum resistance, Ovarian cancer
    摘要 背景:子宫癌是全球三大妇科癌种之一,高等级膜蛋白癌(HGSOC)是最常见并且最严重的 histological 型。对 HGSOC 患者的治疗通常包括钍基化学疗法,需要评估患者是否为钍耐 resistance。本研究的目的是提出一种基于深度学习的方法,使用多Modal positron emission tomography/computed tomography(PET/CT)图像来判断患者是否为钍耐 resistance。方法:本研究包括289名 HGSOC 患者。我们建立了一个结合 Squeeze-Excitation Block(SE Block)和 Spatial Pyramid Pooling Layer(SPPLayer)的 Dense Convolutional Network(DenseNet)模型。使用 ROI 的多Modal PET/CT 图像来预测患者是否为钍耐 resistance。结果:通过五次交叉验证,SE-SPP-DenseNet 模型在预测患者是否为钍耐 resistance 中达到了高精度率和报春分(AUC)的好Result,分别为92.6%和0.93。我们通过进行剖除研究和单模态数据实验来证明,将 SE Block 和 SPPLayer 添加到深度学习模型中,以及考虑多Modal数据的重要性。结论:我们的提出的深度学习框架在预测患者是否为钍耐 resistance 中表现更好,可以帮助妇科医生作出更好的治疗决策。关键词:PET/CT, CNN, SE Block, SPP Layer, 钍耐 resistance, 子宫癌

DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation

  • paper_url: http://arxiv.org/abs/2311.04766
  • repo_url: None
  • paper_authors: Guinan Su, Yanwu Yang, Zhifeng Li
  • for: 这个研究旨在提高音频驱动3D面部动画的精度和效率,特别是在虚拟现实、游戏和视频会议等应用中。
  • methods: 该研究提出了一种交叉模式双学习框架,称为DualTalker,以提高数据使用效率并关联跨模态关系。该框架共同受过主任务(音频驱动面部动画)和其双任务(详细说话)的联合训练,并共享音频/运动编码器组件。
  • results: 经过广泛的实验和一项感知用户研究,我们展示了我们的方法在VOCA和BIWI数据集上的较好的表现,both qualitatively和quantitatively。我们的代码和视频示例已经在https://github.com/sabrina-su/iadf.git中提供。
    Abstract In recent years, audio-driven 3D facial animation has gained significant attention, particularly in applications such as virtual reality, gaming, and video conferencing. However, accurately modeling the intricate and subtle dynamics of facial expressions remains a challenge. Most existing studies approach the facial animation task as a single regression problem, which often fail to capture the intrinsic inter-modal relationship between speech signals and 3D facial animation and overlook their inherent consistency. Moreover, due to the limited availability of 3D-audio-visual datasets, approaches learning with small-size samples have poor generalizability that decreases the performance. To address these issues, in this study, we propose a cross-modal dual-learning framework, termed DualTalker, aiming at improving data usage efficiency as well as relating cross-modal dependencies. The framework is trained jointly with the primary task (audio-driven facial animation) and its dual task (lip reading) and shares common audio/motion encoder components. Our joint training framework facilitates more efficient data usage by leveraging information from both tasks and explicitly capitalizing on the complementary relationship between facial motion and audio to improve performance. Furthermore, we introduce an auxiliary cross-modal consistency loss to mitigate the potential over-smoothing underlying the cross-modal complementary representations, enhancing the mapping of subtle facial expression dynamics. Through extensive experiments and a perceptual user study conducted on the VOCA and BIWI datasets, we demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. We have made our code and video demonstrations available at https://github.com/sabrina-su/iadf.git.
    摘要 Recently, audio-driven 3D facial animation has gained significant attention, especially in virtual reality, gaming, and video conferencing applications. However, accurately modeling the intricate and subtle dynamics of facial expressions remains a challenge. Most existing studies treat the facial animation task as a single regression problem, which often fails to capture the intrinsic inter-modal relationship between speech signals and 3D facial animation, and overlooks their inherent consistency. Moreover, due to the limited availability of 3D-audio-visual datasets, approaches learning with small-size samples have poor generalizability, which decreases performance.To address these issues, in this study, we propose a cross-modal dual-learning framework, termed DualTalker, aiming at improving data usage efficiency and relating cross-modal dependencies. The framework is trained jointly with the primary task (audio-driven facial animation) and its dual task (lip reading) and shares common audio/motion encoder components. Our joint training framework leverages information from both tasks and explicitly capitalizes on the complementary relationship between facial motion and audio to improve performance. Furthermore, we introduce an auxiliary cross-modal consistency loss to mitigate the potential over-smoothing underlying the cross-modal complementary representations, enhancing the mapping of subtle facial expression dynamics.Through extensive experiments and a perceptual user study conducted on the VOCA and BIWI datasets, we demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. Our code and video demonstrations are available at .

Social Motion Prediction with Cognitive Hierarchies

  • paper_url: http://arxiv.org/abs/2311.04726
  • repo_url: https://github.com/Walter0807/Social-CH
  • paper_authors: Wentao Zhu, Jason Qin, Yuke Lou, Hang Ye, Xiaoxuan Ma, Hai Ci, Yizhou Wang
  • for: 这个研究的目的是复制人类在预测他人行为方面的能力,通过解决社交动作预测问题。
  • methods: 这个研究使用了一个新的比较器、一种新的形式化、以及基于认知的框架。他们还使用了行为做副本和生成对抗学习来提高学习效率和通用性。
  • results: 研究人员通过实施了一种新的3D多人动作数据集,并通过对比较器和生成对抗学习来验证数据集和方法的有效性。
    Abstract Humans exhibit a remarkable capacity for anticipating the actions of others and planning their own actions accordingly. In this study, we strive to replicate this ability by addressing the social motion prediction problem. We introduce a new benchmark, a novel formulation, and a cognition-inspired framework. We present Wusi, a 3D multi-person motion dataset under the context of team sports, which features intense and strategic human interactions and diverse pose distributions. By reformulating the problem from a multi-agent reinforcement learning perspective, we incorporate behavioral cloning and generative adversarial imitation learning to boost learning efficiency and generalization. Furthermore, we take into account the cognitive aspects of the human social action planning process and develop a cognitive hierarchy framework to predict strategic human social interactions. We conduct comprehensive experiments to validate the effectiveness of our proposed dataset and approach. Code and data are available at https://walter0807.github.io/Social-CH/.
    摘要 人类具有惊人的其他人行为预测能力和自己行为规划能力。在这项研究中,我们努力复制这种能力,解决社会动作预测问题。我们引入了新的标准集,一种新的形ulation,以及一个基于认知的框架。我们提供了一个3D多人动作数据集,名为Wusi,该数据集在体育赛事中展示了人类之间的激烈和策略性交互,以及多种姿态分布。我们将问题转换为多智能 reinforcement learning 视角,并应用行为做样和生成对抗学习来提高学习效率和泛化能力。此外,我们考虑了人类社会行为规划过程中的认知方面,并开发了认知层次框架来预测人类社会互动的策略。我们进行了全面的实验来验证我们的提posed dataset和方法的有效性。代码和数据可以在https://walter0807.github.io/Social-CH/获取。

Training CLIP models on Data from Scientific Papers

  • paper_url: http://arxiv.org/abs/2311.04711
  • repo_url: https://github.com/nopperl/clip_arxiv_pmc
  • paper_authors: Calvin Metzger
  • for: 这篇论文旨在检验 CLIP 模型是否通过使用高质量数据来提高总体性能。
  • methods: 该论文使用 arXiv 和 PubMed Central 搜索引擎提取文本图像数据,并在小规模 CLIP 模型(ViT B/32)上进行实验。
  • results: 实验结果表明,使用高质量数据源可以提高 CLIP 模型的性能,但提高的程度只是有 moderate。这表明使用这些数据源来训练大规模 CLIP 模型是一个值得进行研究的方向。
    Abstract Contrastive Language-Image Pretraining (CLIP) models are able to capture the semantic relationship of images and texts and have enabled a wide range of applications, from image retrieval to classification. These models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores whether limited amounts higher quality data in a specific domain improve the general performance of CLIP models. To this purpose, we extract text-image data from scientific papers hosted in the arXiv and PubMed Central repositories. Experiments on small-scale CLIP models (ViT B/32) show that model performance increases on average, but only moderately. This result indicates that using the data sources considered in the paper to train large-scale CLIP models is a worthwile research direction.
    摘要 对比语言-图像预训(CLIP)模型可以捕捉图像和文本之间的 semantic 关系,并实现了许多应用,从图像检索到分类。这些模型通常通过互联网爬虫获取数据进行训练,这些数据的量很大,但质量有限。这篇文章探讨了是否有限量但高质量数据在特定领域提高CLIP模型的总性能。为此,我们从arXiv和PubMed Central数据库中提取了文本-图像数据。实验结果表明,使用这些数据来训练小规模CLIP模型(ViT B/32)时,模型的性能平均提高,但只有moderate。这结果表明,使用文章中所考虑的数据来训练大规模CLIP模型是一个有前途的研究方向。

3D Pose Estimation of Tomato Peduncle Nodes using Deep Keypoint Detection and Point Cloud

  • paper_url: http://arxiv.org/abs/2311.04699
  • repo_url: None
  • paper_authors: Jianchao Ci, Xin Wang, David Rapado-Rincón, Akshay K. Burusa, Gert Kootstra
  • For: 本研究旨在提供一种基于关键点检测的RGB-D相机数据的方法,用于自动探测护叶节点,以便在绿色家庭中自动探测 Tomatoes。* Methods: 本方法使用了RGB-D相机数据,通过检测颜色图像中的四个骨干特征点,并将其与3D点云信息集成,以确定护叶节点的3D姿态。* Results: 研究结果表明,该方法具有高精度的物体检测能力(AP@0.5=0.96)、高精度的关键点检测率(PDJ@0.2=94.31%)和3D姿态估计精度(MAE=11.38o和9.93o)。此外,该方法还可以快速响应视点变化。
    Abstract Greenhouse production of fruits and vegetables in developed countries is challenged by labor 12 scarcity and high labor costs. Robots offer a good solution for sustainable and cost-effective 13 production. Acquiring accurate spatial information about relevant plant parts is vital for 14 successful robot operation. Robot perception in greenhouses is challenging due to variations in 15 plant appearance, viewpoints, and illumination. This paper proposes a keypoint-detection-based 16 method using data from an RGB-D camera to estimate the 3D pose of peduncle nodes, which 17 provides essential information to harvest the tomato bunches. 18 19 Specifically, this paper proposes a method that detects four anatomical landmarks in the color 20 image and then integrates 3D point-cloud information to determine the 3D pose. A 21 comprehensive evaluation was conducted in a commercial greenhouse to gain insight into the 22 performance of different parts of the method. The results showed: (1) high accuracy in object 23 detection, achieving an Average Precision (AP) of AP@0.5=0.96; (2) an average Percentage of 24 Detected Joints (PDJ) of the keypoints of PhDJ@0.2=94.31%; and (3) 3D pose estimation 25 accuracy with mean absolute errors (MAE) of 11.38o and 9.93o for the relative upper and lower 26 angles between the peduncle and main stem, respectively. Furthermore, the capability to handle 27 variations in viewpoint was investigated, demonstrating the method was robust to view changes. 28 However, canonical and higher views resulted in slightly higher performance compared to other 29 views. Although tomato was selected as a use case, the proposed method is also applicable to 30 other greenhouse crops like pepper.
    摘要 developed countries的绿色房production of fruits and vegetables面临劳动力短缺和高劳动成本的挑战。Robots可以提供可持续和成本效果的解决方案。获取有关相关植物部分的准确空间信息是成功机器人运行的关键。在绿色房中机器人识别是由于植物外观、视角和照明变化而困难。这篇论文提出了基于特征点探测的方法,使用RGB-D摄像头数据来估算 Tomatoes的3D姿态。 Specifically, this paper proposes a method that detects four anatomical landmarks in the color image and then integrates 3D point-cloud information to determine the 3D pose. A comprehensive evaluation was conducted in a commercial greenhouse to gain insight into the performance of different parts of the method. The results showed: (1) high accuracy in object detection, achieving an Average Precision (AP) of AP@0.5=0.96; (2) an average Percentage of Detected Joints (PDJ) of the keypoints of PhDJ@0.2=94.31%; and (3) 3D pose estimation accuracy with mean absolute errors (MAE) of 11.38° and 9.93° for the relative upper and lower angles between the peduncle and main stem, respectively. Furthermore, the capability to handle variations in viewpoint was investigated, demonstrating the method was robust to view changes. However, canonical and higher views resulted in slightly higher performance compared to other views. Although tomato was selected as a use case, the proposed method is also applicable to other greenhouse crops like pepper.

Weakly supervised cross-model learning in high-content screening

  • paper_url: http://arxiv.org/abs/2311.04678
  • repo_url: None
  • paper_authors: Watkinson Gabriel, Cohen Ethan, Bourriez Nicolas, Bendidi Ihab, Bollot Guillaume, Genovesio Auguste
  • for: 本研究旨在探索如何在药物搜索中连接不同数据类型的数据。
  • methods: 我们提出了一种新的方法,利用弱监督和跨站复制在高内容检测中使用CLIP建立跨模态表示。
  • results: 我们的方法可以学习更好的表示,减轻批处理效应,并且对JUMP-CP数据集进行了有效的预处理,从85TB减少到7TB,保留了所有干扰和大多数信息内容。
    Abstract With the surge in available data from various modalities, there is a growing need to bridge the gap between different data types. In this work, we introduce a novel approach to learn cross-modal representations between image data and molecular representations for drug discovery. We propose EMM and IMM, two innovative loss functions built on top of CLIP that leverage weak supervision and cross sites replicates in High-Content Screening. Evaluating our model against known baseline on cross-modal retrieval, we show that our proposed approach allows to learn better representations and mitigate batch effect. In addition, we also present a preprocessing method for the JUMP-CP dataset that effectively reduce the required space from 85Tb to a mere usable 7Tb size, still retaining all perturbations and most of the information content.
    摘要 随着不同数据模式之间的数据量的增加,需要桥接这些数据模式之间的 gap 变得更加重要。在这种工作中,我们介绍了一种新的方法,用于从图像数据和分子表示之间学习 crossed-modal 表示。我们提出了两种创新的损失函数EMM和IMM,基于 CLIP 的上下文,利用弱监督和跨站复制在高内容检测中。我们对知道的基准进行跨Modal 检索,并显示了我们提议的方法可以学习更好的表示,并减轻批处理效应。此外,我们还提出了对 JUMP-CP 数据集的预处理方法,可以有效地将数据减少到可用 7Tb 大小,保留所有干扰和大多数信息内容。

  • paper_url: http://arxiv.org/abs/2311.04950
  • repo_url: None
  • paper_authors: Siao Tang, Xin Wang, Hong Chen, Chaoyu Guan, Yansong Tang, Wenwu zhu
  • for: 提高 diffusion models 的计算效率,使其在多种任务中实现 state-of-the-art 性能。
  • methods: 提出了一种基于 Diffusion Distillation 的 Block-wise Neural Architecture Search (DiffNAS) 方法,通过自动去除 diffusion models 中的结构冗余来减少计算成本。
  • results: 实验表明,DiffNAS 可以实现约 50% MACs 和参数减少,并且可以在 latent diffusion models 上实现比 teacher 更好的性能。
    Abstract Diffusion models have recently shown remarkable generation ability, achieving state-of-the-art performance in many tasks. However, the high computational cost is still a troubling problem for diffusion models. To tackle this problem, we propose to automatically remove the structural redundancy in diffusion models with our proposed Diffusion Distillation-based Block-wise Neural Architecture Search (DiffNAS). Specifically, given a larger pretrained teacher, we leverage DiffNAS to search for the smallest architecture which achieves on-par or even better performance than the teacher. Considering current diffusion models are based on UNet which naturally has a block-wise structure, we perform neural architecture search independently in each block, which largely reduces the search space. Different from previous block-wise NAS methods, DiffNAS contains a block-wise local search strategy and a retraining strategy with a joint dynamic loss. Concretely, during the search process, we block-wisely select the best subnet to avoid the unfairness brought by the global search strategy used in previous works. When retraining the searched architecture, we adopt a dynamic joint loss to maintain the consistency between supernet training and subnet retraining, which also provides informative objectives for each block and shortens the paths of gradient propagation. We demonstrate this joint loss can effectively improve model performance. We also prove the necessity of the dynamic adjustment of this loss. The experiments show that our method can achieve significant computational reduction, especially on latent diffusion models with about 50% MACs and Parameter reduction.
    摘要 Diffusion 模型最近显示出了很好的生成能力,在许多任务中达到了状态的核心性能。然而,高计算成本仍然是 diffusion 模型的一个困扰问题。为解决这个问题,我们提出了自动Remove diffusion 模型中的结构冗余的方法:Diffusion Distillation-based Block-wise Neural Architecture Search (DiffNAS)。具体来说,我们使用 DiffNAS 在一个更大的预训练老师模型基础上进行搜索,找到与老师模型具有相同或更好的性能的最小架构。由于现有的 diffusion 模型基于 UNet 结构,我们在搜索过程中独立地进行每个块的 neural architecture search,从而大大减少搜索空间。与前一些块基本 NAS 方法不同,DiffNAS 包含块基本选择策略和重新训练策略,并且采用了一种动态共同损失。在搜索过程中,我们会在每个块中选择最佳子网,以避免由全局搜索策略所带来的不公平性。在重新训练搜索出的架构时,我们采用了一种动态共同损失,以保持超网训练和子网重新训练之间的一致性,同时提供了每个块的有用目标。我们证明了这种动态调整的损失是必要的。实验表明,我们的方法可以实现显著的计算减少,特别是在含有约50% MACs和参数的含括 diffusion 模型上。

VET: Visual Error Tomography for Point Cloud Completion and High-Quality Neural Rendering

  • paper_url: http://arxiv.org/abs/2311.04634
  • repo_url: https://github.com/lfranke/vet
  • paper_authors: Linus Franke, Darius Rückert, Laura Fink, Matthias Innmann, Marc Stamminger
  • for: 这个论文的目的是提高点云图像的新视图合成质量。
  • methods: 该论文使用了一种基于神经网络的方法,使用点云代理geometry来检测和修复新视图合成中的缺失或损害。
  • results: 论文的实验结果表明,该方法可以显著提高点云图像的新视图合成质量,并且可以有效地修复大规模的缺失和细腻结构。同时,该方法的实时渲染速度也得到了改进。
    Abstract In the last few years, deep neural networks opened the doors for big advances in novel view synthesis. Many of these approaches are based on a (coarse) proxy geometry obtained by structure from motion algorithms. Small deficiencies in this proxy can be fixed by neural rendering, but larger holes or missing parts, as they commonly appear for thin structures or for glossy regions, still lead to distracting artifacts and temporal instability. In this paper, we present a novel neural-rendering-based approach to detect and fix such deficiencies. As a proxy, we use a point cloud, which allows us to easily remove outlier geometry and to fill in missing geometry without complicated topological operations. Keys to our approach are (i) a differentiable, blending point-based renderer that can blend out redundant points, as well as (ii) the concept of Visual Error Tomography (VET), which allows us to lift 2D error maps to identify 3D-regions lacking geometry and to spawn novel points accordingly. Furthermore, (iii) by adding points as nested environment maps, our approach allows us to generate high-quality renderings of the surroundings in the same pipeline. In our results, we show that our approach can improve the quality of a point cloud obtained by structure from motion and thus increase novel view synthesis quality significantly. In contrast to point growing techniques, the approach can also fix large-scale holes and missing thin structures effectively. Rendering quality outperforms state-of-the-art methods and temporal stability is significantly improved, while rendering is possible at real-time frame rates.
    摘要 Recently, deep neural networks have led to significant advances in novel view synthesis. Many of these methods rely on a coarse proxy geometry obtained through structure from motion algorithms. While small defects in the proxy can be corrected by neural rendering, larger holes or missing parts can still result in distracting artifacts and temporal instability. In this paper, we propose a novel neural-rendering-based approach to detect and fix such deficiencies. We use a point cloud as our proxy, which allows us to easily remove outlier geometry and fill in missing geometry without complicated topological operations. The key components of our approach are:1. A differentiable, blending point-based renderer that can blend out redundant points.2. The concept of Visual Error Tomography (VET), which allows us to lift 2D error maps to identify 3D regions lacking geometry and spawn novel points accordingly.3. The addition of points as nested environment maps, which allows us to generate high-quality renderings of the surroundings in the same pipeline.Our results show that our approach can significantly improve the quality of a point cloud obtained by structure from motion and increase novel view synthesis quality. In contrast to point growing techniques, our approach can effectively fix large-scale holes and missing thin structures. The rendering quality outperforms state-of-the-art methods, and temporal stability is significantly improved, all while rendering is possible at real-time frame rates.

General Framework to Evaluate Unlinkability in Biometric Template Protection Systems

  • paper_url: http://arxiv.org/abs/2311.04633
  • repo_url: None
  • paper_authors: Marta Gomez-Barrero, Javier Galbally, Christian Rathgeb, Christoph Busch
  • for: 保护生物特征数据的隐私问题
  • methods: 提出了一个新的普适框架来评估生物特征模板的不可识别性
  • results: 应用于四种现有的生物特征模板保护技术中的一种,并与其他现有的指标进行比较,以显示其优势
    Abstract The wide deployment of biometric recognition systems in the last two decades has raised privacy concerns regarding the storage and use of biometric data. As a consequence, the ISO/IEC 24745 international standard on biometric information protection has established two main requirements for protecting biometric templates: irreversibility and unlinkability. Numerous efforts have been directed to the development and analysis of irreversible templates. However, there is still no systematic quantitative manner to analyse the unlinkability of such templates. In this paper we address this shortcoming by proposing a new general framework for the evaluation of biometric templates' unlinkability. To illustrate the potential of the approach, it is applied to assess the unlinkability of four state-of-the-art techniques for biometric template protection: biometric salting, Bloom filters, Homomorphic Encryption and block re-mapping. For the last technique, the proposed framework is compared with other existing metrics to show its advantages.
    摘要 在过去二十年中,生物认证系统的广泛应用已引发了隐私问题,特别是 relate to the storage and use of biometric data。为了解决这问题,国际标准ISO/IEC 24745要求保护生物特征模板的两个主要要求是不可逆和不可关联。虽然有很多努力在发展和分析不可逆模板,但是还没有一个系统性的量化方法来分析不可关联性。在这篇论文中,我们正在解决这一缺点,并提出了一个新的通用框架来评估生物特征模板的不可关联性。为了证明我们的方法的潜力,我们应用它来评估四种现状的生物模板保护技术:生物盐、Bloom filter、Homomorphic Encryption和块重映射。对于最后一种技术,我们的框架与其他现有的指标进行比较,以显示它的优势。

Image Patch-Matching with Graph-Based Learning in Street Scenes

  • paper_url: http://arxiv.org/abs/2311.04617
  • repo_url: None
  • paper_authors: Rui She, Qiyu Kang, Sijie Wang, Wee Peng Tay, Yong Liang Guan, Diego Navarro Navarro, Andreas Hartmannsgruber
  • for: 这篇论文主要针对自动驾驶中的计算视觉任务,即将实时捕捉的车辆摄像头中的图像与图像库中的特征区域匹配。
  • methods: 该论文提出了一种基于图像图的空间关系学习模型,其中图像patches的edge表示图像区域之间的空间关系。
  • results: 该模型在多个街景数据集上进行评估,并取得了领先的匹配结果。
    Abstract Matching landmark patches from a real-time image captured by an on-vehicle camera with landmark patches in an image database plays an important role in various computer perception tasks for autonomous driving. Current methods focus on local matching for regions of interest and do not take into account spatial neighborhood relationships among the image patches, which typically correspond to objects in the environment. In this paper, we construct a spatial graph with the graph vertices corresponding to patches and edges capturing the spatial neighborhood information. We propose a joint feature and metric learning model with graph-based learning. We provide a theoretical basis for the graph-based loss by showing that the information distance between the distributions conditioned on matched and unmatched pairs is maximized under our framework. We evaluate our model using several street-scene datasets and demonstrate that our approach achieves state-of-the-art matching results.
    摘要 <> translate("Matching landmark patches from a real-time image captured by an on-vehicle camera with landmark patches in an image database plays an important role in various computer perception tasks for autonomous driving. Current methods focus on local matching for regions of interest and do not take into account spatial neighborhood relationships among the image patches, which typically correspond to objects in the environment. In this paper, we construct a spatial graph with the graph vertices corresponding to patches and edges capturing the spatial neighborhood information. We propose a joint feature and metric learning model with graph-based learning. We provide a theoretical basis for the graph-based loss by showing that the information distance between the distributions conditioned on matched and unmatched pairs is maximized under our framework. We evaluate our model using several street-scene datasets and demonstrate that our approach achieves state-of-the-art matching results.")Here's the translation:<>匹配从在车载摄像头捕捉的实时图像中提取的标志性补丁与图像库中的标志性补丁之间的匹配对计算机视觉任务中扮演着重要角色。当前方法主要集中于区域关注点的本地匹配,而不考虑图像补丁之间的空间相邻关系,通常对环境中的物体相应。在本文中,我们构建了一个空间图,其顶点对应于补丁,而边则捕捉了图像补丁之间的空间相邻关系。我们提出了一种联合特征和度量学习模型,并基于图形学习。我们提供了对图形学习损失的理论基础,并证明了我们的框架下,匹配后的分布conditioned on matched和unmatched对的信息距离最大。我们使用了多个街景数据集来评估我们的方法,并证明了我们的方法可以达到状态级匹配结果。

On Characterizing the Evolution of Embedding Space of Neural Networks using Algebraic Topology

  • paper_url: http://arxiv.org/abs/2311.04592
  • repo_url: https://github.com/cross-caps/dnntopology
  • paper_authors: Suryaka Suresh, Bishshoy Das, Vinayak Abrol, Sumantra Dutta Roy
  • For: 本研究使用深度学习神经网络(DNN)的层次结构来研究特征表示空间的topologic变化。* Methods: 本研究使用Cubical homology来分析深度神经网络的特征表示空间,并对多种流行的深度架构和实际图像数据进行了扩展分析。* Results: 研究发现,随着深度层数的增加,特征表示空间的topologic复杂度逐渐减少,最终达到最低的Betti数。此外,研究还发现了一些对准变换和数据采样等因素的不变性,这些不变性有助于提高神经网络的泛化能力。
    Abstract We study how the topology of feature embedding space changes as it passes through the layers of a well-trained deep neural network (DNN) through Betti numbers. Motivated by existing studies using simplicial complexes on shallow fully connected networks (FCN), we present an extended analysis using Cubical homology instead, with a variety of popular deep architectures and real image datasets. We demonstrate that as depth increases, a topologically complicated dataset is transformed into a simple one, resulting in Betti numbers attaining their lowest possible value. The rate of decay in topological complexity (as a metric) helps quantify the impact of architectural choices on the generalization ability. Interestingly from a representation learning perspective, we highlight several invariances such as topological invariance of (1) an architecture on similar datasets; (2) embedding space of a dataset for architectures of variable depth; (3) embedding space to input resolution/size, and (4) data sub-sampling. In order to further demonstrate the link between expressivity \& the generalization capability of a network, we consider the task of ranking pre-trained models for downstream classification task (transfer learning). Compared to existing approaches, the proposed metric has a better correlation to the actually achievable accuracy via fine-tuning the pre-trained model.
    摘要 我们研究深度神经网络(DNN)中层次结构的变化,通过瓶颈复合体(Cubical homology)进行扩展分析,使用多种流行的深度架构和实际图像数据集。我们发现,随着深度增加,复杂的图像数据集变换成简单的一个,导致Betti数达到最低可能的值。 decay 率可以量化架构选择对通用能力的影响。从表示学学习的视角来看,我们发现了一些对称性,包括:(1) 架构在相似的数据集上的 topological invariance; (2) 数据集的嵌入空间的嵌入空间的 topological invariance; (3) 嵌入空间与输入分辨率/大小的 topological invariance; (4) 数据采样的 topological invariance。为了进一步证明拓扑表达能力和通用能力之间的关系,我们考虑了预训练模型的排名任务(transfer learning)。与现有方法相比,我们的指标具有更好的与实际可 achievable 精度的相关性。

Rethinking Human Pose Estimation for Autonomous Driving with 3D Event Representations

  • paper_url: http://arxiv.org/abs/2311.04591
  • repo_url: https://github.com/masterhow/eventpointpose
  • paper_authors: Xiaoting Yin, Hao Shi, Jiaan Chen, Ze Wang, Yaozu Ye, Huajian Ni, Kailun Yang, Kaiwei Wang
  • for: 提高自动驾驶和停车安全性,通过预测人类行为。
  • methods: 使用事件摄像机,创建3D事件表示,并开发EV-3DPW数据集。
  • results: 在公共实际世界DHP19数据集上,事件点云技术实现了实时移动预测,而解除事件 voxel方法达到了最高准确性。实验表明我们的提posed 3D表示方法在 traditional RGB图像和事件帧技术的比较中具有更高的总体化能力。
    Abstract Human pose estimation is a critical component in autonomous driving and parking, enhancing safety by predicting human actions. Traditional frame-based cameras and videos are commonly applied, yet, they become less reliable in scenarios under high dynamic range or heavy motion blur. In contrast, event cameras offer a robust solution for navigating these challenging contexts. Predominant methodologies incorporate event cameras into learning frameworks by accumulating events into event frames. However, such methods tend to marginalize the intrinsic asynchronous and high temporal resolution characteristics of events. This disregard leads to a loss in essential temporal dimension data, crucial for safety-critical tasks associated with dynamic human activities. To address this issue and to unlock the 3D potential of event information, we introduce two 3D event representations: the Rasterized Event Point Cloud (RasEPC) and the Decoupled Event Voxel (DEV). The RasEPC collates events within concise temporal slices at identical positions, preserving 3D attributes with statistical cues and markedly mitigating memory and computational demands. Meanwhile, the DEV representation discretizes events into voxels and projects them across three orthogonal planes, utilizing decoupled event attention to retrieve 3D cues from the 2D planes. Furthermore, we develop and release EV-3DPW, a synthetic event-based dataset crafted to facilitate training and quantitative analysis in outdoor scenes. On the public real-world DHP19 dataset, our event point cloud technique excels in real-time mobile predictions, while the decoupled event voxel method achieves the highest accuracy. Experiments reveal our proposed 3D representation methods' superior generalization capacities against traditional RGB images and event frame techniques. Our code and dataset are available at https://github.com/MasterHow/EventPointPose.
    摘要 人体姿态估计是自动驾驶和停车中的关键组件,提高安全性 by 预测人类行为。传统的帧基摄像头和视频通常被应用,但在高动态范围或重重运动模糊的场景下变得不可靠。相比之下,事件摄像头提供了一种可靠的解决方案。大多数方法是将事件摄像头集成到学习框架中,但这些方法通常会忽略事件的本质异步和高时间分辨率特性。这种忽略会导致数据中丢失重要的时间维度信息,这些信息对于安全关键任务相对至关重要。为了解决这个问题并激活事件信息的3D潜力,我们介绍了两种3D事件表示方法:矩阵化事件点云(RasEPC)和解除事件VOXEL(DEV)。 RasEPC将事件按照时间片的方式归并在同一个位置,保留3D特征并减少内存和计算负担。 DEV表示法将事件分解成立方体,并将其投影到三个orthogonal平面上,通过独立事件注意力来捕捉3D准确信息。此外,我们还开发了EV-3DPW Synthetic Event-based Dataset,用于训练和量化分析户外场景。在公共的real-world DHP19数据集上,我们的事件点云技术在实时移动预测中表现出色,而DEV表示法在精度方面达到最高水平。实验表明我们提出的3D表示方法具有传统RGB图像和事件帧技术的更好的总体化能力。我们的代码和数据可以在https://github.com/MasterHow/EventPointPose上获取。

Weakly-supervised deepfake localization in diffusion-generated images

  • paper_url: http://arxiv.org/abs/2311.04584
  • repo_url: None
  • paper_authors: Dragos Tantaru, Elisabeta Oneata, Dan Oneata
  • for: 这 paper 的目的是提出一种weakly-supervised的 Deepfake detection方法,以便提供更多的信息,包括哪些区域被修改。
  • methods: 这 paper 使用了 three main categories of methods,包括explanations, local scores 和 attention。这些方法都基于 Xception 网络作为共同背景 Architecure。
  • results: 这 paper 的结果表明,weakly-supervised localization 是可能的,并且使用 local scores 方法可以更加敏感于缺乏超级vision。
    Abstract The remarkable generative capabilities of denoising diffusion models have raised new concerns regarding the authenticity of the images we see every day on the Internet. However, the vast majority of existing deepfake detection models are tested against previous generative approaches (e.g. GAN) and usually provide only a "fake" or "real" label per image. We believe a more informative output would be to augment the per-image label with a localization map indicating which regions of the input have been manipulated. To this end, we frame this task as a weakly-supervised localization problem and identify three main categories of methods (based on either explanations, local scores or attention), which we compare on an equal footing by using the Xception network as the common backbone architecture. We provide a careful analysis of all the main factors that parameterize the design space: choice of method, type of supervision, dataset and generator used in the creation of manipulated images; our study is enabled by constructing datasets in which only one of the components is varied. Our results show that weakly-supervised localization is attainable, with the best performing detection method (based on local scores) being less sensitive to the looser supervision than to the mismatch in terms of dataset or generator.
    摘要 “denoising diffusion模型的卓越生成能力已经引起了互联网上每天看到的图像的真实性的新问题。然而,现有的深伪检测模型都是基于前一代生成方法(如GAN)进行测试,通常只提供每个图像的“伪”或“真”标签。我们认为,更有用的输出将是在每个图像上添加一个涉及到输入图像的涉及级别的地图,以便更好地了解图像中哪些部分被修改。为了实现这一目标,我们将这个任务视为一个弱有监督的地方化任务,并将涉及到的方法分为三大类(基于解释、本地分数或注意力)。我们使用Xception网络作为共同背景架构,并进行了仔细的分析,包括方法选择、监督类型、数据集和生成器在内的所有主要因素。我们的研究表明,弱有监督的地方化是可行的,最高性能的检测方法(基于本地分数)在监督不充分的情况下比 DATASET或生成器的不一致性更加敏感。”

A 3D generative model of pathological multi-modal MR images and segmentations

  • paper_url: http://arxiv.org/abs/2311.04552
  • repo_url: https://github.com/virginiafdez/brainspade3d_rel
  • paper_authors: Virginia Fernandez, Walter Hugo Lopez Pinaya, Pedro Borges, Mark S. Graham, Tom Vercauteren, M. Jorge Cardoso
  • for: 本研究旨在提供一种用于脑MRI和相关分割的三维生成模型,以便 condition on 特定的疾病现象和对比。
  • methods: 本研究使用了生成对抗网络(GANs)和扩散模型(DMs)来生成高质量的Synthetic MRI和相关分割数据,并允许用户根据特定的疾病现象和对比来控制生成的图像和分割结果。
  • results: 研究表明,brainSPADE3D可以生成高度具有一致性的Synthetic MRI和相关分割数据,并且可以结合不同的疾病现象来生成混合的图像和分割结果。此外,研究还发现,使用brainSPADE3D可以改善预测模型在不期望的疾病存在时的性能。
    Abstract Generative modelling and synthetic data can be a surrogate for real medical imaging datasets, whose scarcity and difficulty to share can be a nuisance when delivering accurate deep learning models for healthcare applications. In recent years, there has been an increased interest in using these models for data augmentation and synthetic data sharing, using architectures such as generative adversarial networks (GANs) or diffusion models (DMs). Nonetheless, the application of synthetic data to tasks such as 3D magnetic resonance imaging (MRI) segmentation remains limited due to the lack of labels associated with the generated images. Moreover, many of the proposed generative MRI models lack the ability to generate arbitrary modalities due to the absence of explicit contrast conditioning. These limitations prevent the user from adjusting the contrast and content of the images and obtaining more generalisable data for training task-specific models. In this work, we propose brainSPADE3D, a 3D generative model for brain MRI and associated segmentations, where the user can condition on specific pathological phenotypes and contrasts. The proposed joint imaging-segmentation generative model is shown to generate high-fidelity synthetic images and associated segmentations, with the ability to combine pathologies. We demonstrate how the model can alleviate issues with segmentation model performance when unexpected pathologies are present in the data.
    摘要 生成模型和人工数据可以作为真实医学影像数据的代理,解决医疗应用中深度学习模型的准确性问题。在最近几年里,有越来越多的人关注使用这些模型进行数据增强和人工数据共享,使用生成敌方网络(GANs)或扩散模型(DMs)。然而,使用生成数据进行3D磁共振成像(MRI)分割 task 仍然受到生成图像无标签的限制,以及生成模型无法生成多种模式的缺乏能力。这些限制使得用户无法调整图像的对比度和内容,从而获得更普适的数据用于训练任务特定模型。在这项工作中,我们提出了brainSPADE3D,一种3D生成模型用于脑MRI和相关的分割。用户可以根据特定的疾病现象和对比来conditioning这些模型。我们展示了该模型可以生成高 fideli ty的人工图像和相关的分割,并能够组合疾病。我们还示出了该模型可以解决预期疾病存在于数据中时,分割模型性能下降的问题。

Learning Robust Multi-Scale Representation for Neural Radiance Fields from Unposed Images

  • paper_url: http://arxiv.org/abs/2311.04521
  • repo_url: None
  • paper_authors: Nishant Jain, Suryansh Kumar, Luc Van Gool
  • for: 这 paper 是用于解决计算机视觉中的神经图像基于渲染问题,即在给定一组由自由移动相机拍摄的图像时,在测试时使用神经网络synthesize场景图像。
  • methods: 该 paper 使用了以下方法:(i)通过一个可靠的渠道来重建高精度相机参数,以便在神经新视角synthesize过程中更加准确地模拟场景图像。(ii)在day-to-day不 pose的图像中,模型对象内容的多resolution采用,以适应高速运动的相机。
  • results: 该 paper 通过实验表明,在不考虑 Camera pose 估计精度的情况下,模型多scale neural scene representation可以是Counterproductive。而在具有准确的相机pose估计的场景表示框架中,可以准确地synthesize图像。
    Abstract We introduce an improved solution to the neural image-based rendering problem in computer vision. Given a set of images taken from a freely moving camera at train time, the proposed approach could synthesize a realistic image of the scene from a novel viewpoint at test time. The key ideas presented in this paper are (i) Recovering accurate camera parameters via a robust pipeline from unposed day-to-day images is equally crucial in neural novel view synthesis problem; (ii) It is rather more practical to model object's content at different resolutions since dramatic camera motion is highly likely in day-to-day unposed images. To incorporate the key ideas, we leverage the fundamentals of scene rigidity, multi-scale neural scene representation, and single-image depth prediction. Concretely, the proposed approach makes the camera parameters as learnable in a neural fields-based modeling framework. By assuming per view depth prediction is given up to scale, we constrain the relative pose between successive frames. From the relative poses, absolute camera pose estimation is modeled via a graph-neural network-based multiple motion averaging within the multi-scale neural-fields network, leading to a single loss function. Optimizing the introduced loss function provides camera intrinsic, extrinsic, and image rendering from unposed images. We demonstrate, with examples, that for a unified framework to accurately model multiscale neural scene representation from day-to-day acquired unposed multi-view images, it is equally essential to have precise camera-pose estimates within the scene representation framework. Without considering robustness measures in the camera pose estimation pipeline, modeling for multi-scale aliasing artifacts can be counterproductive. We present extensive experiments on several benchmark datasets to demonstrate the suitability of our approach.
    摘要 我们介绍一个改进了的解决方案,用于计算机视觉中的神经图像基于测试项目。假设我们有一组由自由移动摄像机拍摄的图像,我们的方法可以将这些图像转换为具有真实感的图像,并且在测试时间点上实现不同的观察角度。我们的关键想法包括:1. 从日常生活中的不条理图像中获取精确的摄像机参数,这是神经novel view synthesis问题的重要前提。2. 因为日常生活中的图像可能会受到剧烈的摄像机运动,因此需要在不同的解析率上模型物体内容。为了实现这些想法,我们利用了场景的刚性、多尺度神经场景表示和单图像深度预测的基础知识。具体来说,我们将摄像机参数设置为神经场中的学习型态。通过假设每个视角深度预测是固定的,我们将相关的视角之间的相对位置组成一个多尺度神经网络中的多动作平均,从而得到一个单一的损失函数。通过优化这个损失函数,我们可以获得摄像机参数、摄像机内部和图像输出等。我们在多个 benchmark 数据集上进行了广泛的实验,证明了我们的方法适用于实现多尺度神经场景表示。而不具备稳定性测量的摄像机参数估计在神经场景表示框架中是Equally essential。如果不考虑稳定性测量,则模型多尺度杂质噪压可能会是Counterproductive。

Learning Discriminative Features for Crowd Counting

  • paper_url: http://arxiv.org/abs/2311.04509
  • repo_url: None
  • paper_authors: Yuehai Chen
  • For: 提高人群计数模型在高度拥挤的区域中的准确性,特别是在人群中的小对象和背景之间的区分。* Methods: 提出了一种学习权重特征框架,包括遮盖特征预测模块(MPM)和监督像素级异常学习模块(CLM),以提高模型在高度拥挤区域中的局部化和对比背景的能力。* Results: 模型在各种计算机视觉任务中,如人群计数和物体检测,在拥挤环境下提高了地面的准确性。
    Abstract Crowd counting models in highly congested areas confront two main challenges: weak localization ability and difficulty in differentiating between foreground and background, leading to inaccurate estimations. The reason is that objects in highly congested areas are normally small and high-level features extracted by convolutional neural networks are less discriminative to represent small objects. To address these problems, we propose a learning discriminative features framework for crowd counting, which is composed of a masked feature prediction module (MPM) and a supervised pixel-level contrastive learning module (CLM). The MPM randomly masks feature vectors in the feature map and then reconstructs them, allowing the model to learn about what is present in the masked regions and improving the model's ability to localize objects in high-density regions. The CLM pulls targets close to each other and pushes them far away from background in the feature space, enabling the model to discriminate foreground objects from background. Additionally, the proposed modules can be beneficial in various computer vision tasks, such as crowd counting and object detection, where dense scenes or cluttered environments pose challenges to accurate localization. The proposed two modules are plug-and-play, incorporating the proposed modules into existing models can potentially boost their performance in these scenarios.
    摘要 群体计数模型在高度拥堵的区域面临两大挑战:一是地方化能力弱和Difficulty in differentiating between foreground and background,导致估计不准确。这是因为在高度拥堵的区域中的对象通常是小型,高级特征提取网络的特征更难以区分小对象。为解决这些问题,我们提议一种学习特征分类框架 для群体计数,该框架包括带mask的特征预测模块(MPM)和supervised像素级冲突学习模块(CLM)。MPM randomly masks特征向量在特征图中,然后重建它们,使模型能够学习masked regions中的内容,提高对象的定位能力。CLM pulls targets close to each other and pushes them far away from background in the feature space, enabling the model to discriminate foreground objects from background。此外,提议的两个模块可以在多种计算机视觉任务中有效,如人群计数和物体检测, где高度拥堵的环境或拥堵的环境会对准确的定位 pose challenges。提议的两个模块可以plug-and-play,将其 integrate into existing models可能会提高其性能在这些场景中。

NITEC: Versatile Hand-Annotated Eye Contact Dataset for Ego-Vision Interaction

  • paper_url: http://arxiv.org/abs/2311.04505
  • repo_url: https://github.com/thohemp/nitec
  • paper_authors: Thorsten Hempel, Magnus Jung, Ahmed A. Abdelrahman, Ayoub Al-Hamadi
  • for: The paper is written for advancing ego-vision-based eye contact research, specifically in the fields of computer vision, human-computer interaction, and social robotics.
  • methods: The paper presents a hand-annotated eye contact dataset called NITEC, which exceeds existing datasets in size and variety of demographics, social contexts, and lighting conditions.
  • results: The paper demonstrates strong cross-dataset performance of NITEC, emphasizing its effectiveness and adaptability in various scenarios, and makes the dataset publicly available for further exploration and reproducibility.
    Abstract Eye contact is a crucial non-verbal interaction modality and plays an important role in our everyday social life. While humans are very sensitive to eye contact, the capabilities of machines to capture a person's gaze are still mediocre. We tackle this challenge and present NITEC, a hand-annotated eye contact dataset for ego-vision interaction. NITEC exceeds existing datasets for ego-vision eye contact in size and variety of demographics, social contexts, and lighting conditions, making it a valuable resource for advancing ego-vision-based eye contact research. Our extensive evaluations on NITEC demonstrate strong cross-dataset performance, emphasizing its effectiveness and adaptability in various scenarios, that allows seamless utilization to the fields of computer vision, human-computer interaction, and social robotics. We make our NITEC dataset publicly available to foster reproducibility and further exploration in the field of ego-vision interaction. https://github.com/thohemp/nitec
    摘要 眼接触是非常重要的非语言交互方式,在我们每天的社交生活中扮演着重要的角色。然而,机器人的眼接触捕捉能力仍然很差。我们解决这个挑战,并提供了 NITEC,一个手动标注的眼接触数据集 для egovision交互。NITEC 的大小和多样性都超过了现有的 egovision 眼接触数据集,包括不同的人种、社会背景和照明条件,这使得它成为了 egovision 研究中的一个非常有价值的资源。我们对 NITEC 进行了广泛的评估,并证明了它在多个场景中的强大横跨数据集表现,表明它在计算机视觉、人机交互和社交机器人等领域可以无缝应用。我们将 NITEC 数据集公开提供,以便重现和进一步探索 egovision 交互领域。更多信息请参考 https://github.com/thohemp/nitec。

PRED: Pre-training via Semantic Rendering on LiDAR Point Clouds

  • paper_url: http://arxiv.org/abs/2311.04501
  • repo_url: None
  • paper_authors: Hao Yang, Haiyang Wang, Di Dai, Liwei Wang
  • for: The paper is written for outdoor point cloud pre-training, addressing the issue of incompleteness in point clouds and incorporating images for improved performance.
  • methods: The paper proposes a novel image-assisted pre-training framework called PRED, which uses a Birds-Eye-View feature map conditioned semantic rendering and point-wise masking with a high mask ratio (95%) to enhance the model’s performance.
  • results: The paper demonstrates the superiority of PRED over prior point cloud pre-training methods, achieving significant improvements on various large-scale datasets for 3D perception tasks.
    Abstract Pre-training is crucial in 3D-related fields such as autonomous driving where point cloud annotation is costly and challenging. Many recent studies on point cloud pre-training, however, have overlooked the issue of incompleteness, where only a fraction of the points are captured by LiDAR, leading to ambiguity during the training phase. On the other hand, images offer more comprehensive information and richer semantics that can bolster point cloud encoders in addressing the incompleteness issue inherent in point clouds. Yet, incorporating images into point cloud pre-training presents its own challenges due to occlusions, potentially causing misalignments between points and pixels. In this work, we propose PRED, a novel image-assisted pre-training framework for outdoor point clouds in an occlusion-aware manner. The main ingredient of our framework is a Birds-Eye-View (BEV) feature map conditioned semantic rendering, leveraging the semantics of images for supervision through neural rendering. We further enhance our model's performance by incorporating point-wise masking with a high mask ratio (95%). Extensive experiments demonstrate PRED's superiority over prior point cloud pre-training methods, providing significant improvements on various large-scale datasets for 3D perception tasks. Codes will be available at https://github.com/PRED4pc/PRED.
    摘要 <>转换文本到简化中文。<>在3D相关领域,如自动驾驶,前期训练是非常重要的。然而,许多最近的点云预训练研究却忽视了点云的不完整性问题,导致训练阶段的模糊性。相比之下,图像具有更广泛的信息和更丰富的 semantics,可以增强点云编码器对点云不完整性的应对。然而,将图像与点云进行结合存在其自身的挑战,因为可能存在遮挡,导致点云和像素之间的不一致。在这个工作中,我们提出了一种新的图像助け预训练框架,称为PRED(图像协助预训练)。我们的框架的主要组成部分是基于 bird's eye view(BEV)的Semantic Feature Map Conditioned Neural Rendering,利用图像semantics来为预训练提供超vision。此外,我们还增强了我们的模型性能,通过点wise掩蔽(mask ratio为95%)。广泛的实验证明PRED的优越性,在多个大规模数据集上提供了3D感知任务的显著改进。代码将在https://github.com/PRED4pc/PRED上提供。

PersonMAE: Person Re-Identification Pre-Training with Masked AutoEncoders

  • paper_url: http://arxiv.org/abs/2311.04496
  • repo_url: None
  • paper_authors: Hezhen Hu, Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Lu Yuan, Dong Chen, Houqiang Li
  • for: This paper is written for the task of Person Re-identification (ReID), specifically to learn generic feature representation for this task.
  • methods: The paper proposes a simple yet effective pre-training framework called PersonMAE, which involves two core designs in masked autoencoders to better serve the task of Person Re-ID. The framework generates two regions from the given image, corrupts one region with block-wise masking to mimic common occlusion in ReID, and then predicts the whole other region at both pixel level and semantic feature level.
  • results: The paper achieves state-of-the-art performance on four downstream ReID tasks, including supervised (holistic and occluded setting), and unsupervised (UDA and USL setting). Specifically, with the ViT-B backbone, the paper achieves 79.8% and 69.5% mAP on the MSMT17 and OccDuke datasets, respectively, surpassing the previous state-of-the-art by a large margin of +8.0 mAP and +5.3 mAP, respectively.
    Abstract Pre-training is playing an increasingly important role in learning generic feature representation for Person Re-identification (ReID). We argue that a high-quality ReID representation should have three properties, namely, multi-level awareness, occlusion robustness, and cross-region invariance. To this end, we propose a simple yet effective pre-training framework, namely PersonMAE, which involves two core designs into masked autoencoders to better serve the task of Person Re-ID. 1) PersonMAE generates two regions from the given image with RegionA as the input and \textit{RegionB} as the prediction target. RegionA is corrupted with block-wise masking to mimic common occlusion in ReID and its remaining visible parts are fed into the encoder. 2) Then PersonMAE aims to predict the whole RegionB at both pixel level and semantic feature level. It encourages its pre-trained feature representations with the three properties mentioned above. These properties make PersonMAE compatible with downstream Person ReID tasks, leading to state-of-the-art performance on four downstream ReID tasks, i.e., supervised (holistic and occluded setting), and unsupervised (UDA and USL setting). Notably, on the commonly adopted supervised setting, PersonMAE with ViT-B backbone achieves 79.8% and 69.5% mAP on the MSMT17 and OccDuke datasets, surpassing the previous state-of-the-art by a large margin of +8.0 mAP, and +5.3 mAP, respectively.
    摘要 <>TRANSLATE_TEXT预训练在人识别(ReID)中扮演着日益重要的角色,我们认为高质量的ReID表示应具有三种性能,即多级意识、遮挡Robustness和跨地域一致性。为此,我们提出了一个简单 yet有效的预训练框架,即PersonMAE,该框架包括两个核心设计:1. PersonMAE将给定图像分成两个区域, RegionA 作为输入,RegionB 作为预测目标。 RegionA 会被块性遮盖,以模拟ReID中常见的遮挡,其可见部分会被编码器处理。2. PersonMAE会 endeavour 预测 RegionB 的整个像素级和 semantic feature 级。这种设计使得 PersonMAE 的预训练特征表示具有上述三种性能,这些性能使得 PersonMAE 与下游 ReID 任务相匹配,导致 PersonMAE 在四个下游 ReID 任务中 achieved state-of-the-art 性能,包括超级vised(整体和 occluded 设置)和无监督(UDA 和 USL 设置)。特别是,在通常采用的超级vised 设置下,PersonMAE WITH ViT-B 基础模型 achieved 79.8% 和 69.5% mAP 在 MSMT17 和 OccDuke 数据集上,比前一个 state-of-the-art 的margin 加大了 +8.0 mAP,+5.3 mAP,分别。>>

Non-Rigid Shape Registration via Deep Functional Maps Prior

  • paper_url: http://arxiv.org/abs/2311.04494
  • repo_url: https://github.com/rqhuang88/DFR
  • paper_authors: Puhua Jiang, Mingze Sun, Ruqi Huang
  • for: 非RIGID shape registration without correspondence supervision
  • methods: 使用学习基于的框架,通过高维空间映射学习得到非RIGID shape registration
  • results: 可以处理大幅度内在变换和外在变换的shape registration,并且可以提供高质量的对应关系 between 不同形状对Here’s a more detailed explanation of each point:1. for: The paper proposes a learning-based framework for non-rigid shape registration without correspondence supervision. This means that the framework does not rely on manually specified correspondences between shapes, but instead uses learning-based methods to establish these correspondences.2. methods: The framework uses a combination of high-dimensional embedding and deep functional maps (DFM) to establish correspondences between shapes. The high-dimensional embedding maps shapes into a high-dimensional space where they are easier to align, and the DFM learns a non-linear mapping between the shapes. The correspondences are dynamically updated based on the intermediate registrations and filtered by a consistency prior, which makes the pipeline more robust.3. results: The paper demonstrates that the proposed framework achieves state-of-the-art results on several benchmarks of non-rigid point cloud matching, and can handle both significant extrinsic and intrinsic deformations. The framework is also able to provide high-quality correspondences between unseen challenging shape pairs, which is not possible with traditional registration methods or intrinsic methods.
    Abstract In this paper, we propose a learning-based framework for non-rigid shape registration without correspondence supervision. Traditional shape registration techniques typically rely on correspondences induced by extrinsic proximity, therefore can fail in the presence of large intrinsic deformations. Spectral mapping methods overcome this challenge by embedding shapes into, geometric or learned, high-dimensional spaces, where shapes are easier to align. However, due to the dependency on abstract, non-linear embedding schemes, the latter can be vulnerable with respect to perturbed or alien input. In light of this, our framework takes the best of both worlds. Namely, we deform source mesh towards the target point cloud, guided by correspondences induced by high-dimensional embeddings learned from deep functional maps (DFM). In particular, the correspondences are dynamically updated according to the intermediate registrations and filtered by consistency prior, which prominently robustify the overall pipeline. Moreover, in order to alleviate the requirement of extrinsically aligned input, we train an orientation regressor on a set of aligned synthetic shapes independent of the training shapes for DFM. Empirical results show that, with as few as dozens of training shapes of limited variability, our pipeline achieves state-of-the-art results on several benchmarks of non-rigid point cloud matching, but also delivers high-quality correspondences between unseen challenging shape pairs that undergo both significant extrinsic and intrinsic deformations, in which case neither traditional registration methods nor intrinsic methods work. The code is available at https://github.com/rqhuang88/DFR.
    摘要 在这篇论文中,我们提出了一种学习基于的非定制形状匹配框架,不需要对匹配得到超vision。传统的形状匹配技术通常靠 extrinsic proximity 引起的匹配,因此在大规模内部扭变的情况下失败。spectral mapping 方法可以将形状嵌入高维空间中,使形状更容易匹配。然而,由于依赖于抽象的非线性嵌入方案,后者可能对输入数据进行敏感操作。为了解决这个问题,我们的框架结合了两者的优点。具体来说,我们将源网格弯曲到目标点云,以高维空间中学习的深度函数映射(DFM)中的匹配为导向。特别是,匹配在中间registrations 更新和consistency prior 的筛选下进行动态更新,以提高整体的稳定性。此外,为了避免外部对齐的需求,我们在独立于训练形状的synthetic shapes 上训练了一个旋转回归器。实验结果表明,只需几十个有限的训练形状,我们的管道可以在多个非定制点云匹配 benchmark 上达到领先的Result,并且能够在未看到的挑战性形状对应中提供高质量的匹配。代码可以在 https://github.com/rqhuang88/DFR 上获取。

All-Optical Phase Conjugation Using Diffractive Wavefront Processing

  • paper_url: http://arxiv.org/abs/2311.04473
  • repo_url: None
  • paper_authors: Che-Yung Shen, Jingxi Li, Tianyi Gan, Mona Jarrahi, Aydogan Ozcan
  • for: 用于Counteracting wavefront distortions,包括Imaging和Beam focusing。
  • methods: 使用Deep learning优化Passive diffractive layers,实现All-optical phase conjugation操作。
  • results: 通过实验验证,Diffractive wavefront processor可以成功地对phas aberrations进行OPC操作,并且可以在不同的电磁波谱中实现Cost-effective wavefront engineering解决方案。
    Abstract Optical phase conjugation (OPC) is a nonlinear technique used for counteracting wavefront distortions, with various applications ranging from imaging to beam focusing. Here, we present the design of a diffractive wavefront processor to approximate all-optical phase conjugation operation for input fields with phase aberrations. Leveraging deep learning, a set of passive diffractive layers was optimized to all-optically process an arbitrary phase-aberrated coherent field from an input aperture, producing an output field with a phase distribution that is the conjugate of the input wave. We experimentally validated the efficacy of this wavefront processor by 3D fabricating diffractive layers trained using deep learning and performing OPC on phase distortions never seen by the diffractive processor during its training. Employing terahertz radiation, our physical diffractive processor successfully performed the OPC task through a shallow spatially-engineered volume that axially spans tens of wavelengths. In addition to this transmissive OPC configuration, we also created a diffractive phase-conjugate mirror by combining deep learning-optimized diffractive layers with a standard mirror. Given its compact, passive and scalable nature, our diffractive wavefront processor can be used for diverse OPC-related applications, e.g., turbidity suppression and aberration correction, and is also adaptable to different parts of the electromagnetic spectrum, especially those where cost-effective wavefront engineering solutions do not exist.
    摘要 光学相 conjugation(OPC)是一种非线性技术,用于对波front distortions 进行矫正,具有各种应用,从图像到波front фокусировки。在这里,我们提出了一种使用 diffractive wavefront processor 来近似 all-optical phase conjugation 操作,用于处理具有相位偏移的输入场。通过深度学习,我们分配了一组 passive diffractive layers 来对输入场进行 all-optical 处理,以生成一个输出场的相位分布,与输入场的相位分布是相似的。我们通过实验验证了这种 wavefront processor 的有效性,使用了 deep learning 进行优化的 diffractive layers 和标准镜的组合,以实现 OPC 任务。使用 terahertz 辐射,我们的物理 diffractive processor 成功完成了 OPC 任务,并在一个 axially 延伸多达数十波长的束缚空间中完成了 shallow 的 SPATIALLY-ENGINEERED 体。此外,我们还创造了一个 diffractive phase-conjugate mirror,通过将 deep learning 优化的 diffractive layers 与标准镜相结合。由于它的 компакт、被动和可扩展性,我们的 diffractive wavefront processor 可以用于多种 OPC-相关的应用,例如浊度补做和偏移补做,并可以适应不同的电磁波谱спектrum,特别是那些没有cost-effective wavefront engineering 解决方案。

Enhancing Few-shot CLIP with Semantic-Aware Fine-Tuning

  • paper_url: http://arxiv.org/abs/2311.04464
  • repo_url: None
  • paper_authors: Yao Zhu, Yuefeng Chen, Wei Wang, Xiaofeng Mao, Xiu Yan, Yue Wang, Zhigang Li, Wang lu, Jindong Wang, Xiangyang Ji
  • for: 本研究的目的是提高深度神经网络在具有限制样本数的低资源场景中表现,通过修改CLIP预训练模型的特定部分来适应不同的少shot任务。
  • methods: 我们修改了CLIP预训练模型的视觉编码器中的特征权重卷积层,使其在不同的少shot任务中适应不同的 semantics。在训练过程中,我们根据任务特点调整了这些权重,以便模型能够更好地适应具体的任务。在测试阶段,我们使用了差分融合来结合原始的权重卷积层和调整后的权重卷积层,以便将它们两者的知识融合在一起。
  • results: 我们的方法可以增强传统的少shot CLIP,并且与现有的adapter方法(SAFE-A)兼容。我们的方法可以更好地适应不同的少shot任务,并且在测试阶段的性能得到了提升。
    Abstract Learning generalized representations from limited training samples is crucial for applying deep neural networks in low-resource scenarios. Recently, methods based on Contrastive Language-Image Pre-training (CLIP) have exhibited promising performance in few-shot adaptation tasks. To avoid catastrophic forgetting and overfitting caused by few-shot fine-tuning, existing works usually freeze the parameters of CLIP pre-trained on large-scale datasets, overlooking the possibility that some parameters might not be suitable for downstream tasks. To this end, we revisit CLIP's visual encoder with a specific focus on its distinctive attention pooling layer, which performs a spatial weighted-sum of the dense feature maps. Given that dense feature maps contain meaningful semantic information, and different semantics hold varying importance for diverse downstream tasks (such as prioritizing semantics like ears and eyes in pet classification tasks rather than side mirrors), using the same weighted-sum operation for dense features across different few-shot tasks might not be appropriate. Hence, we propose fine-tuning the parameters of the attention pooling layer during the training process to encourage the model to focus on task-specific semantics. In the inference process, we perform residual blending between the features pooled by the fine-tuned and the original attention pooling layers to incorporate both the few-shot knowledge and the pre-trained CLIP's prior knowledge. We term this method as Semantic-Aware FinE-tuning (SAFE). SAFE is effective in enhancing the conventional few-shot CLIP and is compatible with the existing adapter approach (termed SAFE-A).
    摘要 学习通用表示法从有限的训练样本中学习是深度神经网络在低资源场景中应用的关键。最近,基于对比语言图像预训练(CLIP)的方法在几架适应任务中表现出色。为了避免几架适应过拟合和忘记,现有的工作通常将CLIP预训练在大规模数据集上的参数冻结,忽略了可能一些参数不适合下游任务。为此,我们重新审视CLIP的视觉Encoder,尤其是其独特的注意力池化层,该层通过在权重总和中进行空间权重的积分来实现。由于稠密特征图包含有意义的语义信息,而不同语义在不同下游任务中具有不同的重要性(例如在宠物分类任务中更重视耳朵和眼睛而非侧镜),因此在不同几架适应任务中使用同样的权重积分操作可能不合适。因此,我们提议在训练过程中细化注意力池化层的参数,以便让模型强调任务特有的语义。在推理过程中,我们通过将细化后的注意力池化层和原始注意力池化层的特征进行差分融合来 incorporate 两者的知识。我们称这种方法为 Semantic-Aware FinE-tuning(SAFE)。SAFE 有效地提高了传统的几架 CLIP,并且与现有的适配器方法(称为 SAFE-A)兼容。

Retargeting video with an end-to-end framework

  • paper_url: http://arxiv.org/abs/2311.04458
  • repo_url: None
  • paper_authors: Thi-Ngoc-Hanh Le, HuiGuang Huang, Yi-Ru Chen, Tong-Yee Lee
  • For: 这个研究旨在为 Computer Graphics 应用程序提供影片重定向功能,以增强用户观赏体验。* Methods: 本研究使用了一个终端到终端的 RETVI 方法,具有两个模组:内容特征分析器 (CFA) 和适应型扩展估计器 (ADE),以解决旧有方法的计算瓶颈和限制。* Results: 实验和评估结果显示,我们的系统在质量和运行时间上具有明显的优势,超越了先前的工作。更多结果可以在 http://graphics.csie.ncku.edu.tw/RETVI 网站上获取。
    Abstract Video holds significance in computer graphics applications. Because of the heterogeneous of digital devices, retargeting videos becomes an essential function to enhance user viewing experience in such applications. In the research of video retargeting, preserving the relevant visual content in videos, avoiding flicking, and processing time are the vital challenges. Extending image retargeting techniques to the video domain is challenging due to the high running time. Prior work of video retargeting mainly utilizes time-consuming preprocessing to analyze frames. Plus, being tolerant of different video content, avoiding important objects from shrinking, and the ability to play with arbitrary ratios are the limitations that need to be resolved in these systems requiring investigation. In this paper, we present an end-to-end RETVI method to retarget videos to arbitrary aspect ratios. We eliminate the computational bottleneck in the conventional approaches by designing RETVI with two modules, content feature analyzer (CFA) and adaptive deforming estimator (ADE). The extensive experiments and evaluations show that our system outperforms previous work in quality and running time. Visit our project website for more results at http://graphics.csie.ncku.edu.tw/RETVI.
    摘要 视频具有计算机图形应用中的重要意义。由于数字设备的多样性,对视频进行重定向变得非常重要,以提高用户视觉体验。在研究视频重定向方面,保持视频中相关的视觉内容,避免抖动、处理时间和视频内容的多样性是核心挑战。由于视频重定向技术的运行时间较长,将图像重定向技术应用于视频领域是挑战。现有的视频重定向方法主要通过时间consuming的预处理分析帧来解决这些挑战。此外,保持重要对象不减小、避免抖动和处理时间也是需要解决的问题。在这篇论文中,我们提出了一种终端到终端的视频重定向方法(RETVI),以解决以上问题。我们通过设计内容特征分析器(CFA)和自适应扭转估计器(ADE)两个模块来消除传统方法的计算瓶颈。我们的系统在质量和运行时间方面胜过先前的工作。更多结果可以在我们项目网站上找到:http://graphics.csie.ncku.edu.tw/RETVI。

SS-MAE: Spatial-Spectral Masked Auto-Encoder for Multi-Source Remote Sensing Image Classification

  • paper_url: http://arxiv.org/abs/2311.04442
  • repo_url: None
  • paper_authors: Junyan Lin, Feng Gao, Xiaocheng Shi, Junyu Dong, Qian Du
  • for: 这个研究旨在提出一种基于自动编码器的隐藏特征模型(SS-MAE),用于混合多源数据的类别 tasks。
  • methods: 该模型包括空间对应分支和спектраль对应分支,具体来说,空间对应分支随机填充patches,并从原始数据中恢复缺失的像素;而 спектраль对应分支随机填充频道,并从原始数据中恢复缺失的频道。
  • results: 实验结果显示,Compared with多个基于state-of-the-art的基eline,SS-MAE在三个公开的数据集上表现出色,并且可以充分利用输入数据的空间和спектраль特征。
    Abstract Masked image modeling (MIM) is a highly popular and effective self-supervised learning method for image understanding. Existing MIM-based methods mostly focus on spatial feature modeling, neglecting spectral feature modeling. Meanwhile, existing MIM-based methods use Transformer for feature extraction, some local or high-frequency information may get lost. To this end, we propose a spatial-spectral masked auto-encoder (SS-MAE) for HSI and LiDAR/SAR data joint classification. Specifically, SS-MAE consists of a spatial-wise branch and a spectral-wise branch. The spatial-wise branch masks random patches and reconstructs missing pixels, while the spectral-wise branch masks random spectral channels and reconstructs missing channels. Our SS-MAE fully exploits the spatial and spectral representations of the input data. Furthermore, to complement local features in the training stage, we add two lightweight CNNs for feature extraction. Both global and local features are taken into account for feature modeling. To demonstrate the effectiveness of the proposed SS-MAE, we conduct extensive experiments on three publicly available datasets. Extensive experiments on three multi-source datasets verify the superiority of our SS-MAE compared with several state-of-the-art baselines. The source codes are available at \url{https://github.com/summitgao/SS-MAE}.
    摘要 高清晰自适应模型(MIM)是一种非常受欢迎且有效的无监督学习方法,用于图像理解。现有的MIM基于方法主要关注空间特征模型化,忽略spectral特征模型化。另外,现有的MIM基于方法使用Transformer来EXTRACT特征,可能会导致一些本地或高频信息丢失。为了解决这个问题,我们提议一种具有空间特征和spectral特征的掩码自适应编码器(SS-MAE),用于混合高光谱和LiDAR/SAR数据的分类。具体来说,SS-MAE包括一个空间特征分支和一个spectral特征分支。空间特征分支掩码随机质点并重建缺失像素,而spectral特征分支掩码随机spectral通道并重建缺失通道。我们的SS-MAE完全利用输入数据的空间和spectral表示。此外,为了补偿本地特征在训练阶段的不足,我们添加了两个轻量级CNN来EXTRACT特征。我们的方法同时利用全球特征和本地特征来模型特征。为了证明我们提议的SS-MAE的效果,我们在三个公共可用的数据集上进行了广泛的实验。我们的实验结果表明,SS-MAE在多源数据集上的表现明显超过了一些状态的基eline。我们的代码可以在github上找到:https://github.com/summitgao/SS-MAE。

Blurry Video Compression: A Trade-off between Visual Enhancement and Data Compression

  • paper_url: http://arxiv.org/abs/2311.04430
  • repo_url: None
  • paper_authors: Dawit Mureja Argaw, Junsik Kim, In So Kweon
  • for: 本研究旨在提高视频压缩(VC)方法的 universality,使其在不同的时间前提下能够维持视频质量。
  • methods: 本研究使用了一种基于可让渡的最小化最大化优化方法,通过利用视频压缩和图像增强之间的自然质量补做,提高视频质量。
  • results: 对多个标准数据集进行了广泛的实验,证明了我们的方法在比较于现有的VC方法之上具有更高的效果。
    Abstract Existing video compression (VC) methods primarily aim to reduce the spatial and temporal redundancies between consecutive frames in a video while preserving its quality. In this regard, previous works have achieved remarkable results on videos acquired under specific settings such as instant (known) exposure time and shutter speed which often result in sharp videos. However, when these methods are evaluated on videos captured under different temporal priors, which lead to degradations like motion blur and low frame rate, they fail to maintain the quality of the contents. In this work, we tackle the VC problem in a general scenario where a given video can be blurry due to predefined camera settings or dynamics in the scene. By exploiting the natural trade-off between visual enhancement and data compression, we formulate VC as a min-max optimization problem and propose an effective framework and training strategy to tackle the problem. Extensive experimental results on several benchmark datasets confirm the effectiveness of our method compared to several state-of-the-art VC approaches.
    摘要 现有的视频压缩(VC)方法主要目标是减少视频中的空间和时间重复性,以保持视频质量。在这种情况下,先前的工作已经实现了在特定的曝光时间和闭合速度下拍摄的视频中获得了出色的结果。然而,当这些方法应用于不同的时间优先顺序下拍摄的视频时,它们无法保持视频内容的质量。在这种情况下,我们解决了视频压缩问题,充分利用了视频增强和数据压缩之间的自然负荷关系,并提出了一种有效的框架和训练策略。对多个标准数据集进行了广泛的实验,证明了我们的方法与许多现有的VC方法相比,有更高的效果。

CSAM: A 2.5D Cross-Slice Attention Module for Anisotropic Volumetric Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2311.04942
  • repo_url: https://github.com/al3x-o-o-hung/csam
  • paper_authors: Alex Ling Yu Hung, Haoxin Zheng, Kai Zhao, Xiaoxi Du, Kaifeng Pang, Qi Miao, Steven S. Raman, Demetri Terzopoulos, Kyunghyun Sung
  • for: This paper aims to address the problem of anisotropic volumetric medical data in deep learning-based segmentation, specifically in magnetic resonance imaging (MRI) data.
  • methods: The proposed method is a 2.5D approach that combines 2D convolution with volumetric information, using a Cross-Slice Attention Module (CSAM) to capture information across all slices in the volume. The CSAM module applies semantic, positional, and slice attention on deep feature maps at different scales.
  • results: The proposed method was extensively tested using different network architectures and tasks, and the results demonstrate the usefulness and generalizability of CSAM. The code for the proposed method is available at https://github.com/aL3x-O-o-Hung/CSAM.
    Abstract A large portion of volumetric medical data, especially magnetic resonance imaging (MRI) data, is anisotropic, as the through-plane resolution is typically much lower than the in-plane resolution. Both 3D and purely 2D deep learning-based segmentation methods are deficient in dealing with such volumetric data since the performance of 3D methods suffers when confronting anisotropic data, and 2D methods disregard crucial volumetric information. Insufficient work has been done on 2.5D methods, in which 2D convolution is mainly used in concert with volumetric information. These models focus on learning the relationship across slices, but typically have many parameters to train. We offer a Cross-Slice Attention Module (CSAM) with minimal trainable parameters, which captures information across all the slices in the volume by applying semantic, positional, and slice attention on deep feature maps at different scales. Our extensive experiments using different network architectures and tasks demonstrate the usefulness and generalizability of CSAM. Associated code is available at https://github.com/aL3x-O-o-Hung/CSAM.
    摘要 “一大部分的体积医学数据,特别是磁共振成像(MRI)数据,具有不对称性,通常在水平方向的分辨率比垂直方向的分辨率低得多。三维和仅二维的深度学习基于的分类方法都不适合处理这种体积数据,因为三维方法在遇到不对称的数据时表现不佳,而二维方法则忽略了体积数据中的重要信息。对二点五维方法的研究相对较少,这些模型通常在标本之间学习关系,但通常有许多受训的参数。我们提出了跨条件注意模块(CSAM),具有最少受训参数,可以在不同标本之间学习关系,并且可以在不同的标本中捕捉到重要信息。我们的广泛实验显示了 CSAM 的有用性和普遍性。相关的代码可以在 GitHub 上找到:https://github.com/aL3x-O-o-Hung/CSAM。”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know.

Learning the What and How of Annotation in Video Object Segmentation

  • paper_url: http://arxiv.org/abs/2311.04414
  • repo_url: https://github.com/thanosDelatolas/eva-vos
  • paper_authors: Thanos Delatolas, Vicky Kalogeiton, Dim P. Papadopoulos
  • for: 提高视频对象分割(VOS)模型训练效率,减少人工标注成本。
  • methods: 提出了一种人工 Loop(HITL)注意力机制,通过预测哪些帧(”What”)和哪种注意力类型(”How”)进行标注,以提高标注效率。
  • results: 对MOSE和DAVIS数据集进行实验,比较了EVA-VOS和标准 annotating 方法,结果表明:EVA-VOS可以在3.5倍快的速度达到与人类一致的准确率;选择帧performanced状态的方法得到了状元性的表现;EVA-VOS在标注时间方面具有显著的提升。
    Abstract Video Object Segmentation (VOS) is crucial for several applications, from video editing to video data generation. Training a VOS model requires an abundance of manually labeled training videos. The de-facto traditional way of annotating objects requires humans to draw detailed segmentation masks on the target objects at each video frame. This annotation process, however, is tedious and time-consuming. To reduce this annotation cost, in this paper, we propose EVA-VOS, a human-in-the-loop annotation framework for video object segmentation. Unlike the traditional approach, we introduce an agent that predicts iteratively both which frame ("What") to annotate and which annotation type ("How") to use. Then, the annotator annotates only the selected frame that is used to update a VOS module, leading to significant gains in annotation time. We conduct experiments on the MOSE and the DAVIS datasets and we show that: (a) EVA-VOS leads to masks with accuracy close to the human agreement 3.5x faster than the standard way of annotating videos; (b) our frame selection achieves state-of-the-art performance; (c) EVA-VOS yields significant performance gains in terms of annotation time compared to all other methods and baselines.
    摘要 视频对象分割(VOS)是许多应用程序中的关键技术,从视频编辑到视频数据生成。训练VOS模型需要大量的手动标注视频。传统的Annotation Way是要求人类 manually draw detailed segmentation masks on the target objects at each video frame。然而,这个Annotation process是费时的和费力的。在这篇论文中,我们提出了EVA-VOS,一个人类在Loop的标注框架 для视频对象分割。与传统方法不同,我们引入了一个代理人,该代理人预测iteratively Which frame ("What") to annotate和Which annotation type ("How") to use。然后,标注者仅标注选择的帧,并将其用于更新VOS模块,从而实现了显著的标注时间缩短。我们在MOSE和DAVIS datasets上进行了实验,并显示了以下结果:(a)EVA-VOS可以在3.5倍 faster than the standard way of annotating videos 实现masks with accuracy close to human agreement。(b)我们的帧选择性能在State-of-the-art level。(c)EVA-VOS比所有方法和基准值具有显著的标注时间缩短。