paper_authors: Xin Yu, Qi Yang, Yucheng Tang, Riqiang Gao, Shunxing Bao, Leon Y. Cai, Ho Hin Lee, Yuankai Huo, Ann Zenobia Moore, Luigi Ferrucci, Bennett A. Landman
results: 我们的实验表明,C-SliceGen 方法可以生成高质量的图像,具有真实性和相似性。此外,我们还证明了该方法可以减少腹部 CT 数据的 slice 位置差异,并且在1033名参与者的 Baltimore Longitudinal Study of Aging (BLSA) 数据集上进行了评估,并证明了该方法可以减少腹部 slice 的位置差异。Abstract
Two-dimensional single-slice abdominal computed tomography (CT) provides a detailed tissue map with high resolution allowing quantitative characterization of relationships between health conditions and aging. However, longitudinal analysis of body composition changes using these scans is difficult due to positional variation between slices acquired in different years, which leading to different organs/tissues captured. To address this issue, we propose C-SliceGen, which takes an arbitrary axial slice in the abdominal region as a condition and generates a pre-defined vertebral level slice by estimating structural changes in the latent space. Our experiments on 2608 volumetric CT data from two in-house datasets and 50 subjects from the 2015 Multi-Atlas Abdomen Labeling Challenge dataset (BTCV) Challenge demonstrate that our model can generate high-quality images that are realistic and similar. We further evaluate our method's capability to harmonize longitudinal positional variation on 1033 subjects from the Baltimore Longitudinal Study of Aging (BLSA) dataset, which contains longitudinal single abdominal slices, and confirmed that our method can harmonize the slice positional variance in terms of visceral fat area. This approach provides a promising direction for mapping slices from different vertebral levels to a target slice and reducing positional variance for single-slice longitudinal analysis. The source code is available at: https://github.com/MASILab/C-SliceGen.
摘要
“两维单片腹部计算机断层成像(CT)提供了高分辨率的组织地图,可以量化健康状况和年龄之间的关系。然而,使用这些扫描数据进行 longitudinal 分析的 Body 组成变化困难,因为不同年份扫描时的 slice 位置会有偏移。为解决这个问题,我们提出了 C-SliceGen,它可以将任意腹部 Axial slice 作为输入,并生成预定的vertebral level slice,通过估计 latent space 中的结构变化。我们的实验表明,我们的模型可以生成高质量的图像,具有实际和相似的特征。我们进一步评估了我们的方法在1033名参与者的 Baltimore Longitudinal Study of Aging(BLSA)数据集上的能力,并证明了我们的方法可以减少 slice 位置偏移的方差。这种方法提供了一个可行的方向,可以将不同 vertebral level 的 slice 映射到目标 slice 上,并减少 longitudinal 分析中的 slice 位置偏移。源代码可以在以下链接中获取:https://github.com/MASILab/C-SliceGen。”
a critical analysis of internal reliability for uncertainty quantification of dense image matching in multi-view stereo
results: 研究发现,使用不同的内部匹配度量可以对多视角摄影探测数据的内部可靠性进行评估,尤其是在LiDAR参照数据不 disponible情况下。Abstract
Nowadays, photogrammetrically derived point clouds are widely used in many civilian applications due to their low cost and flexibility in acquisition. Typically, photogrammetric point clouds are assessed through reference data such as LiDAR point clouds. However, when reference data are not available, the assessment of photogrammetric point clouds may be challenging. Since these point clouds are algorithmically derived, their accuracies and precisions are highly varying with the camera networks, scene complexity, and dense image matching (DIM) algorithms, and there is no standard error metric to determine per-point errors. The theory of internal reliability of camera networks has been well studied through first-order error estimation of Bundle Adjustment (BA), which is used to understand the errors of 3D points assuming known measurement errors. However, the measurement errors of the DIM algorithms are intricate to an extent that every single point may have its error function determined by factors such as pixel intensity, texture entropy, and surface smoothness. Despite the complexity, there exist a few common metrics that may aid the process of estimating the posterior reliability of the derived points, especially in a multi-view stereo (MVS) setup when redundancies are present. In this paper, by using an aerial oblique photogrammetric block with LiDAR reference data, we analyze several internal matching metrics within a common MVS framework, including statistics in ray convergence, intersection angles, DIM energy, etc.
摘要
现在,由光ogrammetry derive的点云在多种民用应用中广泛使用,主要因为它们的成本低廉和捕捉方式灵活。通常,光ogrammetric点云通过参考数据 such as LiDAR点云进行评估。然而,当参考数据不 disponible时,评估光ogrammetric点云可能具有挑战。由于这些点云是算法 derive的,其准确性和精度与摄像机网络、场景复杂度和密集图像匹配(DIM)算法有高度相关。而且没有标准的错误度量来确定每个点的错误。在摄像机网络的内部可靠性理论方面,已经进行了广泛的研究,包括第一个错误估计的Bundle Adjustment(BA)理论,以便理解3D点的错误。然而,DIM算法中的测量错误是复杂到每个点都有自己的错误函数,它们取决于因素如像素强度、текстура杂度和表面平滑性。尽管如此,还是有一些常见的度量可以帮助估计 derivated 点云的 posterior 可靠性,特别是在多视点雷达(MVS)设置中,当redundancy 存在时。在本文中,我们使用了一个飞行倾斜的光ogrammetric块,与LiDAR参考数据进行比较,分析了一些内部匹配度量,包括射线整合度、交叉角度、DIM能量等。
MOVIN: Real-time Motion Capture using a Single LiDAR
paper_authors: Deok-Kyeong Jang, Dongseok Yang, Deok-Yun Jang, Byeoli Choi, Taeil Jin, Sung-Hee Lee for:* 这个论文是为了解决现有的全身跟踪系统过于昂贵、需要专业技能运行或者穿着不适的问题,提供一种数据驱动的生成方法来实现实时全身跟踪。methods:* 这个方法使用了一个LiDAR传感器来获取3D点云数据,并使用一个自适应 conditional variational autoencoder(CVAE)模型来学习全身 pose 的分布。results:* 该方法可以准确地预测表演者的3D全身信息和局部关节细节,同时具有考虑时间相关性移动的能力。Here’s the simplified Chinese text in the format you requested:for:* 这个论文是为了解决现有的全身跟踪系统过于昂贵、需要专业技能运行或者穿着不适的问题,提供一种数据驱动的生成方法来实现实时全身跟踪。methods:* 这个方法使用了一个LiDAR传感器来获取3D点云数据,并使用一个自适应 conditional variational autoencoder(CVAE)模型来学习全身 pose 的分布。results:* 该方法可以准确地预测表演者的3D全身信息和局部关节细节,同时具有考虑时间相关性移动的能力。Abstract
Recent advancements in technology have brought forth new forms of interactive applications, such as the social metaverse, where end users interact with each other through their virtual avatars. In such applications, precise full-body tracking is essential for an immersive experience and a sense of embodiment with the virtual avatar. However, current motion capture systems are not easily accessible to end users due to their high cost, the requirement for special skills to operate them, or the discomfort associated with wearable devices. In this paper, we present MOVIN, the data-driven generative method for real-time motion capture with global tracking, using a single LiDAR sensor. Our autoregressive conditional variational autoencoder (CVAE) model learns the distribution of pose variations conditioned on the given 3D point cloud from LiDAR.As a central factor for high-accuracy motion capture, we propose a novel feature encoder to learn the correlation between the historical 3D point cloud data and global, local pose features, resulting in effective learning of the pose prior. Global pose features include root translation, rotation, and foot contacts, while local features comprise joint positions and rotations. Subsequently, a pose generator takes into account the sampled latent variable along with the features from the previous frame to generate a plausible current pose. Our framework accurately predicts the performer's 3D global information and local joint details while effectively considering temporally coherent movements across frames. We demonstrate the effectiveness of our architecture through quantitative and qualitative evaluations, comparing it against state-of-the-art methods. Additionally, we implement a real-time application to showcase our method in real-world scenarios. MOVIN dataset is available at \url{https://movin3d.github.io/movin_pg2023/}.
摘要
现代技术的发展带来了新的互动应用程序,如社交Metaverse,在这些应用程序中,用户通过他们的虚拟人物进行互动。在这些应用程序中,精准全身跟踪是实现卷积体验和虚拟人物embodying的关键。然而,目前的动作捕捉系统因为高价格、需要特殊技能运行以及穿着设备不舒适而不太可 accessible。在这篇论文中,我们提出了MOVIN,一种基于数据驱动的生成方法,通过单个LiDAR感知器实现实时动作捕捉,包括全身跟踪。我们的 autoencoder 模型学习了基于给定的 3D 点云数据的动作变化的分布,并通过一种新的特征编码器来学习 pose 的相关性。全身pose特征包括根据翻译、旋转和脚 contacts,而地方特征包括 JOINT 位置和旋转。然后,一个 pose 生成器将考虑 Sampled 随机变量以及前一帧的特征来生成一个可能的当前 pose。我们的框架可以准确预测演员的 3D 全身信息和局部关节细节,同时考虑了在帧之间的时间准确性。我们通过量化和质量评估来证明我们的架构的有效性,并与当前的方法进行比较。此外,我们还实现了一个实时应用,以示出我们的方法在实际场景中的应用。MOVIN 数据集可以在 \url{https://movin3d.github.io/movin_pg2023/} 上获取。
Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal Intervention
results: 研究发现,该方法可以减少文本视频检索模型中的偏见问题,并在 Epic-Kitchens-100、YouCook2 和 MSR-VTT 等 datasets 上达到了领先的成绩。Abstract
Many studies focus on improving pretraining or developing new backbones in text-video retrieval. However, existing methods may suffer from the learning and inference bias issue, as recent research suggests in other text-video-related tasks. For instance, spatial appearance features on action recognition or temporal object co-occurrences on video scene graph generation could induce spurious correlations. In this work, we present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips, which is the first such attempt for a text-video retrieval task, to the best of our knowledge. We first hypothesise and verify the bias on how it would affect the model illustrated with a baseline study. Then, we propose a causal debiasing approach and perform extensive experiments and ablation studies on the Epic-Kitchens-100, YouCook2, and MSR-VTT datasets. Our model overpasses the baseline and SOTA on nDCG, a semantic-relevancy-focused evaluation metric which proves the bias is mitigated, as well as on the other conventional metrics.
摘要
In this work, we present a unique and systematic study of a temporal bias due to frame length discrepancy between training and test sets of trimmed video clips, which is the first such attempt for a text-video retrieval task, to the best of our knowledge. We first hypothesize and verify the bias using a baseline study. Then, we propose a causal debiasing approach and perform extensive experiments and ablation studies on the Epic-Kitchens-100, YouCook2, and MSR-VTT datasets. Our model outperforms the baseline and SOTA on nDCG, a semantic-relevancy-focused evaluation metric, which proves that the bias is mitigated, as well as on other conventional metrics.
UGC: Unified GAN Compression for Efficient Image-to-Image Translation
results: 实验结果表明,UGC 可以在各种图像识别和生成任务上达到比较高的性能水平,而且比传统的 GAN 模型具有更好的计算效率和数据使用效率。Abstract
Recent years have witnessed the prevailing progress of Generative Adversarial Networks (GANs) in image-to-image translation. However, the success of these GAN models hinges on ponderous computational costs and labor-expensive training data. Current efficient GAN learning techniques often fall into two orthogonal aspects: i) model slimming via reduced calculation costs; ii)data/label-efficient learning with fewer training data/labels. To combine the best of both worlds, we propose a new learning paradigm, Unified GAN Compression (UGC), with a unified optimization objective to seamlessly prompt the synergy of model-efficient and label-efficient learning. UGC sets up semi-supervised-driven network architecture search and adaptive online semi-supervised distillation stages sequentially, which formulates a heterogeneous mutual learning scheme to obtain an architecture-flexible, label-efficient, and performance-excellent model.
摘要
近年来,人工智能领域内的生成对抗网络(GAN)在图像到图像翻译方面取得了很大进步。然而,GAN模型的成功受到了计算成本的约束和训练数据的劳动成本。当前的高效GAN学习技术通常分为两个垂直方面:i)模型缩减通过减少计算成本;ii)数据/标签高效学习 fewer 训练数据/标签。为了结合这两个方面的优点,我们提出了一种新的学习理念,即统一GAN压缩(UGC),它通过统一优化目标来融合模型高效和标签高效的学习。UGC设计了顺序执行 semi-supervised 驱动网络搜索和自适应在线 semi-supervised 熔化阶段,这种异质共同学习方案可以从数据中提取出高效、标签高效和表现优秀的模型。
Effective Image Tampering Localization via Enhanced Transformer and Co-attention Fusion
results: 实验结果表明,提出的方案在多个标准数据集上达到了当前最佳的总体化能力和Robustness。代码将于https://github.com/multimediaFor/EITLNet公开。Abstract
Powerful manipulation techniques have made digital image forgeries be easily created and widespread without leaving visual anomalies. The blind localization of tampered regions becomes quite significant for image forensics. In this paper, we propose an effective image tampering localization network (EITLNet) based on a two-branch enhanced transformer encoder with attention-based feature fusion. Specifically, a feature enhancement module is designed to enhance the feature representation ability of the transformer encoder. The features extracted from RGB and noise streams are fused effectively by the coordinate attention-based fusion module at multiple scales. Extensive experimental results verify that the proposed scheme achieves the state-of-the-art generalization ability and robustness in various benchmark datasets. Code will be public at https://github.com/multimediaFor/EITLNet.
摘要
powerful manipulation techniques have made digital image forgeries easily created and widespread without leaving visual anomalies. The blind localization of tampered regions becomes quite significant for image forensics. In this paper, we propose an effective image tampering localization network (EITLNet) based on a two-branch enhanced transformer encoder with attention-based feature fusion. Specifically, a feature enhancement module is designed to enhance the feature representation ability of the transformer encoder. The features extracted from RGB and noise streams are fused effectively by the coordinate attention-based fusion module at multiple scales. Extensive experimental results verify that the proposed scheme achieves the state-of-the-art generalization ability and robustness in various benchmark datasets. Code will be public at https://github.com/multimediaFor/EITLNet.Here's the translation in Traditional Chinese:powerful manipulation techniques have made digital image forgeries easily created and widespread without leaving visual anomalies. The blind localization of tampered regions becomes quite significant for image forensics. In this paper, we propose an effective image tampering localization network (EITLNet) based on a two-branch enhanced transformer encoder with attention-based feature fusion. Specifically, a feature enhancement module is designed to enhance the feature representation ability of the transformer encoder. The features extracted from RGB and noise streams are fused effectively by the coordinate attention-based fusion module at multiple scales. Extensive experimental results verify that the proposed scheme achieves the state-of-the-art generalization ability and robustness in various benchmark datasets. Code will be public at https://github.com/multimediaFor/EITLNet.
RenderIH: A Large-scale Synthetic Dataset for 3D Interacting Hand Pose Estimation
results: 实验表明,预训练在 RenderIH 数据上可以显著降低误差,从 6.76mm 降低至 5.79mm,并且 TransHand 超越了当前的方法。Abstract
The current interacting hand (IH) datasets are relatively simplistic in terms of background and texture, with hand joints being annotated by a machine annotator, which may result in inaccuracies, and the diversity of pose distribution is limited. However, the variability of background, pose distribution, and texture can greatly influence the generalization ability. Therefore, we present a large-scale synthetic dataset RenderIH for interacting hands with accurate and diverse pose annotations. The dataset contains 1M photo-realistic images with varied backgrounds, perspectives, and hand textures. To generate natural and diverse interacting poses, we propose a new pose optimization algorithm. Additionally, for better pose estimation accuracy, we introduce a transformer-based pose estimation network, TransHand, to leverage the correlation between interacting hands and verify the effectiveness of RenderIH in improving results. Our dataset is model-agnostic and can improve more accuracy of any hand pose estimation method in comparison to other real or synthetic datasets. Experiments have shown that pretraining on our synthetic data can significantly decrease the error from 6.76mm to 5.79mm, and our Transhand surpasses contemporary methods. Our dataset and code are available at https://github.com/adwardlee/RenderIH.
摘要
当前的互动手(IH)数据集相对简单,背景和文化环境相对有限,手关节被机器注意者注解,可能会导致错误,手姿 distribuition 的多样性也很有限。然而,背景、手姿分布和文化环境的变化可以对泛化能力产生很大的影响。因此,我们提供了一个大规模的 sintetic 数据集 RenderIH,包含100万个真实的、多样的互动手图像,具有多种背景、视角和手 texture。为生成自然和多样的互动姿势,我们提议了一个新的pose优化算法。此外,为更好地优化pose估计精度,我们引入了一种基于 transformer 的 pose估计网络 TransHand,以利用互动手之间的相关性。我们的数据集是model-agnostic,可以提高任何手姿估计方法的准确性,比较其他真实或 sintetic 数据集。我们的实验表明,预训练于我们的 sintetic 数据可以显著降低错误率,从6.76mm降低到5.79mm,而我们的 TransHand 突破了当今方法。我们的数据集和代码可以在 https://github.com/adwardlee/RenderIH 上下载。
Chasing Day and Night: Towards Robust and Efficient All-Day Object Detection Guided by an Event Camera
results: 我们的EOLO方法在各种照明条件下表现出色,与现状的最佳方法(RENet)相比,增加了3.74%的mAP50。我们还建立了两个新的数据集,E-MSCOCO和E-VOC,以便进一步验证和改进我们的方法。Abstract
The ability to detect objects in all lighting (i.e., normal-, over-, and under-exposed) conditions is crucial for real-world applications, such as self-driving.Traditional RGB-based detectors often fail under such varying lighting conditions.Therefore, recent works utilize novel event cameras to supplement or guide the RGB modality; however, these methods typically adopt asymmetric network structures that rely predominantly on the RGB modality, resulting in limited robustness for all-day detection. In this paper, we propose EOLO, a novel object detection framework that achieves robust and efficient all-day detection by fusing both RGB and event modalities. Our EOLO framework is built based on a lightweight spiking neural network (SNN) to efficiently leverage the asynchronous property of events. Buttressed by it, we first introduce an Event Temporal Attention (ETA) module to learn the high temporal information from events while preserving crucial edge information. Secondly, as different modalities exhibit varying levels of importance under diverse lighting conditions, we propose a novel Symmetric RGB-Event Fusion (SREF) module to effectively fuse RGB-Event features without relying on a specific modality, thus ensuring a balanced and adaptive fusion for all-day detection. In addition, to compensate for the lack of paired RGB-Event datasets for all-day training and evaluation, we propose an event synthesis approach based on the randomized optical flow that allows for directly generating the event frame from a single exposure image. We further build two new datasets, E-MSCOCO and E-VOC based on the popular benchmarks MSCOCO and PASCAL VOC. Extensive experiments demonstrate that our EOLO outperforms the state-of-the-art detectors,e.g.,RENet,by a substantial margin (+3.74% mAP50) in all lighting conditions.Our code and datasets will be available at https://vlislab22.github.io/EOLO/
摘要
“能够检测各种照明(正常、过颤、和UNDER-EXPOSED)的能力是实际应用中的重要要素,例如自动驾驶。传统的RGB基于的探测器经常在这些不同的照明条件下失败。因此,现有的工作将使用新的事件摄像机来补充或导引RGB模式;然而,这些方法通常运用不对称的网络结构,导致仅对RGB模式进行有限的可靠性。在这篇文章中,我们提出了EOLO框架,一个可靠且高效的实现了全天探测的物体探测框架。我们的EOLO框架基于轻量级的神经网络(SNN),以有效地利用事件的异步性。此外,我们首先引入了一个事件时间注意力(ETA)模组,以学习事件中的高时间信息,同时保留重要的边缘信息。其次,由于不同的模式在不同的照明条件下展现出不同的重要性,我们提出了一个新的对称RGB-事件融合(SREF)模组,以有效地融合RGB-事件特征,并确保在所有照明条件下实现平衡和适应的融合。此外,为了补充缺乏RGB-事件的对称训练和评估数据,我们提出了一个基于随机抽象流的事件生成方法,允许从单一曝光图像中直接生成事件帧。 finally,我们建立了两个新的数据集,E-MSCOCO和E-VOC,基于知名的benchmark MSCOCO和PASCAL VOC。实验结果显示,我们的EOLO在所有照明条件下明显超过了现有的检测器,例如RENet,+3.74% mAP50。我们的代码和数据将在https://vlislab22.github.io/EOLO/ 网页上公开。”
LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation
For: The paper is focused on developing a framework for generating co-speech gestures that are semantically aligned with the speech content, and it aims to provide several control handles for various applications.* Methods: The proposed framework consists of two stages: script-based gesture generation and audio-guided rhythm refinement. The script-based gesture generation uses pre-trained CLIP text embeddings as guidance, while the audio-guided rhythm refinement uses a simple but effective diffusion-based gesture generation backbone conditioned on audio signals.* Results: The proposed framework outperforms competing methods in terms of semantic awareness and rhythm alignment, and it also achieves state-of-the-art performance on two benchmarks. Additionally, the framework provides several applications such as changing the gesticulation style, editing co-speech gestures via textual prompting, and controlling the semantic awareness and rhythm alignment with guided diffusion.Here’s the simplified Chinese version of the three key points:* For: 这篇论文是关于开发一种基于语音内容的协调姿势生成框架,并提供了多种控制处理的目的。* Methods: 该框架包括两个阶段:脚本基于的姿势生成和音频导向的协调姿势细化。脚本基于的姿势生成使用预训练的 CLIP 文本嵌入为导航,而音频导向的姿势细化使用简单 yet effective 的扩散基本模型, conditioned on 音频信号。* Results: 该框架比前一代方法更加具有 semantics-aware 和 rhythm alignment 的优势,并在两个标准测试集上达到了领先的性能。此外,该框架还提供了多种应用,例如修改姿势风格、通过文本提示编辑协调姿势、控制semantic awareness和rhythm alignment with guided diffusion。Abstract
Gestures are non-verbal but important behaviors accompanying people's speech. While previous methods are able to generate speech rhythm-synchronized gestures, the semantic context of the speech is generally lacking in the gesticulations. Although semantic gestures do not occur very regularly in human speech, they are indeed the key for the audience to understand the speech context in a more immersive environment. Hence, we introduce LivelySpeaker, a framework that realizes semantics-aware co-speech gesture generation and offers several control handles. In particular, our method decouples the task into two stages: script-based gesture generation and audio-guided rhythm refinement. Specifically, the script-based gesture generation leverages the pre-trained CLIP text embeddings as the guidance for generating gestures that are highly semantically aligned with the script. Then, we devise a simple but effective diffusion-based gesture generation backbone simply using pure MLPs, that is conditioned on only audio signals and learns to gesticulate with realistic motions. We utilize such powerful prior to rhyme the script-guided gestures with the audio signals, notably in a zero-shot setting. Our novel two-stage generation framework also enables several applications, such as changing the gesticulation style, editing the co-speech gestures via textual prompting, and controlling the semantic awareness and rhythm alignment with guided diffusion. Extensive experiments demonstrate the advantages of the proposed framework over competing methods. In addition, our core diffusion-based generative model also achieves state-of-the-art performance on two benchmarks. The code and model will be released to facilitate future research.
摘要
<>传统方法可以生成与语音节奏同步的手势,但是它们缺乏语音Semantic上的信息。虽然semantic手势在人类语言交流中并不太常见,但它们对听众理解语言场景的更深入的含义是非常重要的。因此,我们介绍了LivelySpeaker框架,它可以实现语音Semantic-aware co-speech手势生成,并提供了多个控制把柄。具体来说,我们的方法分为两个阶段:脚本基于的手势生成和音频导航的rhythm refinement。特别是,我们使用预训练的CLIP文本嵌入为生成高度semantic相align的手势的指导。然后,我们设计了一种简单 yet powerful的扩散基于多层perceptron(MLP)的手势生成模型,该模型通过只使用音频信号来生成真实的手势动作。我们利用这种强大的优先来谱匹配script-guided手势与音频信号,特别是在零扩展设定下。我们的新的两个阶段生成框架还具有多种应用,例如改变手势风格,通过文本提示编辑副音频手势,以及控制semantic awareness和rhythm alignment的扩散导航。我们的实验表明,我们的提议的框架比前一代方法具有更多的优势。此外,我们的核心扩散生成模型也在两个标准测试集上达到了状态的艺术性表现。我们计划将代码和模型发布,以便未来的研究。
MVP: Meta Visual Prompt Tuning for Few-Shot Remote Sensing Image Scene Classification
results: 实验结果显示,提出的MVP方法在不同的设定下(包括不同的way和shot)均具有较好的性能,并且在跨领域适应中也具有良好的表现。Abstract
Vision Transformer (ViT) models have recently emerged as powerful and versatile models for various visual tasks. Recently, a work called PMF has achieved promising results in few-shot image classification by utilizing pre-trained vision transformer models. However, PMF employs full fine-tuning for learning the downstream tasks, leading to significant overfitting and storage issues, especially in the remote sensing domain. In order to tackle these issues, we turn to the recently proposed parameter-efficient tuning methods, such as VPT, which updates only the newly added prompt parameters while keeping the pre-trained backbone frozen. Inspired by VPT, we propose the Meta Visual Prompt Tuning (MVP) method. Specifically, we integrate the VPT method into the meta-learning framework and tailor it to the remote sensing domain, resulting in an efficient framework for Few-Shot Remote Sensing Scene Classification (FS-RSSC). Furthermore, we introduce a novel data augmentation strategy based on patch embedding recombination to enhance the representation and diversity of scenes for classification purposes. Experiment results on the FS-RSSC benchmark demonstrate the superior performance of the proposed MVP over existing methods in various settings, such as various-way-various-shot, various-way-one-shot, and cross-domain adaptation.
摘要
视野变换器(ViT)模型最近在视觉任务中表现出了强大和通用的能力。其中,一项工作名为PMF在几个shot图像分类中获得了可观的结果,但PMF使用全部精度调整,导致重要遗传和存储问题,尤其在远程感知领域。为解决这些问题,我们转向 reciently proposed 参数精度调整方法,如VPT,该方法只更新添加的提示参数,而保留预训练的背部锁定。 inspirited by VPT,我们提出了元视觉提示调整方法(MVP)。specifically,我们将VPT方法 integrate into 元学习框架,并适应远程感知领域,从而实现了高效的几个shot远程感知场景分类(FS-RSSC)。此外,我们引入了一种新的数据增强策略基于贴图嵌入重编,以提高分类目的场景表示和多样性。FS-RSSC标准测试集实验结果表明,我们提出的MVP方法在不同的设置下(如多种多shot、一种多shot和跨领域适应)都与现有方法进行了比较,并达到了更好的表现。
LiDAR Data Synthesis with Denoising Diffusion Probabilistic Models
results: 我们的方法在KITTI-360和KITTI-Raw数据集上的生成任务和KITTI-360数据集上的upsampling任务中表现出色,超过了基eline。我们的代码和预训练参数将在https://github.com/kazuto1011/r2dm上提供。Abstract
Generative modeling of 3D LiDAR data is an emerging task with promising applications for autonomous mobile robots, such as scalable simulation, scene manipulation, and sparse-to-dense completion of LiDAR point clouds. Existing approaches have shown the feasibility of image-based LiDAR data generation using deep generative models while still struggling with the fidelity of generated data and training instability. In this work, we present R2DM, a novel generative model for LiDAR data that can generate diverse and high-fidelity 3D scene point clouds based on the image representation of range and reflectance intensity. Our method is based on the denoising diffusion probabilistic models (DDPMs), which have demonstrated impressive results among generative model frameworks and have been significantly progressing in recent years. To effectively train DDPMs on the LiDAR domain, we first conduct an in-depth analysis regarding data representation, training objective, and spatial inductive bias. Based on our designed model R2DM, we also introduce a flexible LiDAR completion pipeline using the powerful properties of DDPMs. We demonstrate that our method outperforms the baselines on the generation task of KITTI-360 and KITTI-Raw datasets and the upsampling task of KITTI-360 datasets. Our code and pre-trained weights will be available at https://github.com/kazuto1011/r2dm.
摘要
“三维LiDAR数据生成是一个emerging任务,具有吸引人的应用前景,如自动移动 robot的广泛simulation、scene操作和LiDAR点云的簇范 completion。现有的方法已经显示了对于深度生成模型的LiDAR数据生成的可能性,但是仍然面临生成数据的实际性和训练稳定性问题。在这个工作中,我们提出了R2DM,一个新的LiDAR数据生成模型,可以生成多样和高实际性的三维Scene点云,基于影像表现的距离和反射intensity。我们的方法基于减误散射概率模型(DDPMs),这些模型在生成模型框架中已经显示出了杰出的成果,并在最近几年内有所进步。为了有效地对LiDAR领域训练 DDPMs,我们首先进行了LiDAR领域的深入分析,包括数据表现、训练目标和空间传递偏好。基于我们的设计的R2DM模型,我们也提出了一个灵活的LiDAR完备管线,使用DDPMs的强大特性。我们的方法在KITTI-360和KITTI-Rawdataset上的生成和upsampling任务中表现出色,较基eline的方法更好。我们的代码和预训练 веса将在https://github.com/kazuto1011/r2dm上公开。”
Convex Latent-Optimized Adversarial Regularizers for Imaging Inverse Problems
results: 这个研究展示了CLEAR-informed的调节模型能够在真实数据上运行,并且能够稳定地重建图像,即使在测量干扰的情况下。此外,这个方法比 conventinal data-driven技术和传统调节方法更好,具有更高的重建质量和更好的稳定性。Abstract
Recently, data-driven techniques have demonstrated remarkable effectiveness in addressing challenges related to MR imaging inverse problems. However, these methods still exhibit certain limitations in terms of interpretability and robustness. In response, we introduce Convex Latent-Optimized Adversarial Regularizers (CLEAR), a novel and interpretable data-driven paradigm. CLEAR represents a fusion of deep learning (DL) and variational regularization. Specifically, we employ a latent optimization technique to adversarially train an input convex neural network, and its set of minima can fully represent the real data manifold. We utilize it as a convex regularizer to formulate a CLEAR-informed variational regularization model that guides the solution of the imaging inverse problem on the real data manifold. Leveraging its inherent convexity, we have established the convergence of the projected subgradient descent algorithm for the CLEAR-informed regularization model. This convergence guarantees the attainment of a unique solution to the imaging inverse problem, subject to certain assumptions. Furthermore, we have demonstrated the robustness of our CLEAR-informed model, explicitly showcasing its capacity to achieve stable reconstruction even in the presence of measurement interference. Finally, we illustrate the superiority of our approach using MRI reconstruction as an example. Our method consistently outperforms conventional data-driven techniques and traditional regularization approaches, excelling in both reconstruction quality and robustness.
摘要
最近,数据驱动技术已经在MR图像反问题中表现出了非常出色的效果。然而,这些方法仍然存在一定的可解释性和稳定性的限制。为回应,我们介绍了一种新的可解释的数据驱动方法,即Convex Latent-Optimized Adversarial Regularizers(CLEAR)。CLEAR是一种将深度学习(DL)和变量正则化相结合的新型数据驱动方法。我们使用了潜在优化技术来对输入的凸神经网络进行对抗训练,并将其集的最小值用作一种凸正则化模型,以指导图像反问题的解决。由于其内置的凸性,我们已经证明了对CLEAR-informed的正则化模型的投影下滤链落点梯度下降算法的 converges,这 garanties the attainment of a unique solution to the imaging inverse problem, subject to certain assumptions。此外,我们还证明了我们的CLEAR-informed模型的稳定性,并证明其能够在测量干扰存在时仍然实现稳定的重建。最后,我们使用MRI重建为例子,并证明了我们的方法可以与传统的数据驱动技术和正则化方法相比,在重建质量和稳定性两个方面表现出色。
LiteTrack: Layer Pruning with Asynchronous Feature Extraction for Lightweight and Efficient Visual Tracking
results: 模型在 GOT-10k 测试集上 achiev 65.2% AO,超过了所有之前的轻量级跟踪模型,并在 ONNX 上的 Jetson Orin NX 边缘设备上运行速度超过 100 fps。此外,模型在 NVIDIA 2080Ti GPU 上达到了 171 fps 的运行速度,并在 TrackingNet 测试集上达到了 72.2% AO 和 82.4% AUC。Abstract
The recent advancements in transformer-based visual trackers have led to significant progress, attributed to their strong modeling capabilities. However, as performance improves, running latency correspondingly increases, presenting a challenge for real-time robotics applications, especially on edge devices with computational constraints. In response to this, we introduce LiteTrack, an efficient transformer-based tracking model optimized for high-speed operations across various devices. It achieves a more favorable trade-off between accuracy and efficiency than the other lightweight trackers. The main innovations of LiteTrack encompass: 1) asynchronous feature extraction and interaction between the template and search region for better feature fushion and cutting redundant computation, and 2) pruning encoder layers from a heavy tracker to refine the balnace between performance and speed. As an example, our fastest variant, LiteTrack-B4, achieves 65.2% AO on the GOT-10k benchmark, surpassing all preceding efficient trackers, while running over 100 fps with ONNX on the Jetson Orin NX edge device. Moreover, our LiteTrack-B9 reaches competitive 72.2% AO on GOT-10k and 82.4% AUC on TrackingNet, and operates at 171 fps on an NVIDIA 2080Ti GPU. The code and demo materials will be available at https://github.com/TsingWei/LiteTrack.
摘要
Recent advancements in transformer-based visual trackers have led to significant progress, thanks to their strong modeling capabilities. However, as performance improves, running latency correspondingly increases, presenting a challenge for real-time robotics applications, especially on edge devices with computational constraints. In response to this, we introduce LiteTrack, an efficient transformer-based tracking model optimized for high-speed operations across various devices. It achieves a more favorable trade-off between accuracy and efficiency than other lightweight trackers. The main innovations of LiteTrack include:1. 异步特征提取和模板与搜索区域之间的交互,以实现更好的特征融合和减少 redundancy computation。2. 从一个重量级的跟踪器中剪辑encoder层,以进一步整合性和速度的平衡。例如,我们最快的变体LiteTrack-B4在GOT-10k标准测试集上达到了65.2%的AO,超过了所有之前的高效跟踪器,并在ONNX上的Jetson Orin NX边缘设备上运行速度达100帧/秒。此外,我们的LiteTrack-B9在GOT-10k和TrackingNet标准测试集上达到了72.2%的AO和82.4%的AUC,并在NVIDIA 2080Ti GPU上运行速度达171帧/秒。我们将在https://github.com/TsingWei/LiteTrack中提供代码和示例材料。
Image-level supervision and self-training for transformer-based cross-modality tumor segmentation
paper_authors: Malo de Boisredon, Eugene Vorontsov, William Trung Le, Samuel Kadoury for:* 这个研究旨在提高医疗影像分割领域中的自动化医疗影像分类,特别是跨modalities的情况下。methods:* 提出了一个新的半supervised训练策略called MoDATTS,可以实现精准的跨modalities 3D肿瘤分类。* 使用了一个image-to-image translation策略来将不同modalities的影像转换为弹性target volume,以提高对不同modalities的普遍化。* 还引入了一个迭代自训程序来进一步关闭modalities之间的领域差。results:* MoDATTS在CrossMoDA 2022挑战中的reported top Dice score为0.87+/-0.04,较其他参赛队伍的方法高。* MoDATTS在跨modalities的Brain Tumor Segmentation任务上显示了consistent的提高,其Dice score比baseline高出10%以上。* MoDATTS可以实现95%的目标supervised模型性能,并且可以透过更多的标注资料来进一步提高性能。Abstract
Deep neural networks are commonly used for automated medical image segmentation, but models will frequently struggle to generalize well across different imaging modalities. This issue is particularly problematic due to the limited availability of annotated data, making it difficult to deploy these models on a larger scale. To overcome these challenges, we propose a new semi-supervised training strategy called MoDATTS. Our approach is designed for accurate cross-modality 3D tumor segmentation on unpaired bi-modal datasets. An image-to-image translation strategy between imaging modalities is used to produce annotated pseudo-target volumes and improve generalization to the unannotated target modality. We also use powerful vision transformer architectures and introduce an iterative self-training procedure to further close the domain gap between modalities. MoDATTS additionally allows the possibility to extend the training to unannotated target data by exploiting image-level labels with an unsupervised objective that encourages the model to perform 3D diseased-to-healthy translation by disentangling tumors from the background. The proposed model achieves superior performance compared to other methods from participating teams in the CrossMoDA 2022 challenge, as evidenced by its reported top Dice score of 0.87+/-0.04 for the VS segmentation. MoDATTS also yields consistent improvements in Dice scores over baselines on a cross-modality brain tumor segmentation task composed of four different contrasts from the BraTS 2020 challenge dataset, where 95% of a target supervised model performance is reached. We report that 99% and 100% of this maximum performance can be attained if 20% and 50% of the target data is additionally annotated, which further demonstrates that MoDATTS can be leveraged to reduce the annotation burden.
摘要
深度神经网络通常用于自动医疗影像分割,但模型很难泛化到不同的成像方式。这个问题特别是由于有限的标注数据,使得这些模型在更大规模上部署变得困难。为了解决这些挑战,我们提出了一种新的半监督训练策略called MoDATTS。我们的方法适用于精准的三维肿瘤分割不同成像方式的不协调数据集。我们使用 между成像模式之间的图像转换策略生成标注 pseudo-目标Volume,以改善对目标成像模式的泛化。此外,我们使用强大的视觉转换架构,并引入迭代自我训练过程,以进一步减小成像模式之间的领域差。MoDATTS还允许在未标注目标数据上继续训练,通过利用图像水平标签来鼓励模型进行三维疾病到健康的图像翻译,从而分离肿瘤和背景。我们的模型在CrossMoDA 2022挑战中与其他参赛队列表示的方法相比,实现了最高的Dice分数0.87+/-0.04 для VS分割。MoDATTS还在四个不同的脑肿瘤分割任务中实现了相对于基eline的稳定改进,其中95%的目标监督模型性能可以达到。如果添加20%和50%的目标数据,则可以达到99%和100%的最大性能,这进一步证明了MoDATTS可以减少标注占用。
results: 该方法在两个公共数据集上达到了91.3%和98.4%的高精度,比之前的最佳方法高出了10%以上。Abstract
Skeletal Action recognition from an egocentric view is important for applications such as interfaces in AR/VR glasses and human-robot interaction, where the device has limited resources. Most of the existing skeletal action recognition approaches use 3D coordinates of hand joints and 8-corner rectangular bounding boxes of objects as inputs, but they do not capture how the hands and objects interact with each other within the spatial context. In this paper, we present a new framework called Contact-aware Skeletal Action Recognition (CaSAR). It uses novel representations of hand-object interaction that encompass spatial information: 1) contact points where the hand joints meet the objects, 2) distant points where the hand joints are far away from the object and nearly not involved in the current action. Our framework is able to learn how the hands touch or stay away from the objects for each frame of the action sequence, and use this information to predict the action class. We demonstrate that our approach achieves the state-of-the-art accuracy of 91.3% and 98.4% on two public datasets, H2O and FPHA, respectively.
摘要
skeletal action recognition from an egocentric view 是很重要的,因为它们可以用于AR/VR镜头和人机交互,而这些设备具有有限的资源。现有的大多数skeletal action recognition方法使用手 JOINTS的3D坐标和物体的8个顶点 rectangle bounding box作为输入,但是它们不能捕捉手和物体之间的空间关系。在这篇论文中,我们提出了一个新的框架,即Contact-aware Skeletal Action Recognition(CaSAR)。它使用了一些新的手-物体交互表示,包括:1)手 JOINTS与物体之间的接触点,2)手 JOINTS在物体之外的远离点,这些点在当前动作序列中并不直接参与动作。我们的框架可以在每帧动作序列中学习手与物体之间的接触和远离情况,并使用这些信息预测动作类别。我们证明了我们的方法可以在两个公共数据集上达到91.3%和98.4%的状态态的准确率。
CryoAlign: feature-based method for global and local 3D alignment of EM density maps
paper_authors: Bintao He, Fa Zhang, Chenjie Feng, Jianyi Yang, Xin Gao, Renmin Han
for: density maps的对Alignment和比较,以解释结构信息,如结构不一致性分析和原子模型组装。
methods: 使用本地密度特征描述符来捕捉空间结构相似性,快速建立点对点匹配和稳定定制参数。
results: 在实验评估中,CryoAlign表现出了较高的对Alignment精度和速度,胜过现有的方法。Abstract
Advances on cryo-electron imaging technologies have led to a rapidly increasing number of density maps. Alignment and comparison of density maps play a crucial role in interpreting structural information, such as conformational heterogeneity analysis using global alignment and atomic model assembly through local alignment. Here, we propose a fast and accurate global and local cryo-electron microscopy density map alignment method CryoAlign, which leverages local density feature descriptors to capture spatial structure similarities. CryoAlign is the first feature-based EM map alignment tool, in which the employment of feature-based architecture enables the rapid establishment of point pair correspondences and robust estimation of alignment parameters. Extensive experimental evaluations demonstrate the superiority of CryoAlign over the existing methods in both alignment accuracy and speed.
摘要
All-optical image denoising using a diffractive visual processor
methods: all-optical and non-iterative, using deep learning-enabled analog diffractive image denoiser
results: efficiently removes salt and pepper noise and image rendering-related spatial artifacts, with an output power efficiency of ~30-40%Abstract
Image denoising, one of the essential inverse problems, targets to remove noise/artifacts from input images. In general, digital image denoising algorithms, executed on computers, present latency due to several iterations implemented in, e.g., graphics processing units (GPUs). While deep learning-enabled methods can operate non-iteratively, they also introduce latency and impose a significant computational burden, leading to increased power consumption. Here, we introduce an analog diffractive image denoiser to all-optically and non-iteratively clean various forms of noise and artifacts from input images - implemented at the speed of light propagation within a thin diffractive visual processor. This all-optical image denoiser comprises passive transmissive layers optimized using deep learning to physically scatter the optical modes that represent various noise features, causing them to miss the output image Field-of-View (FoV) while retaining the object features of interest. Our results show that these diffractive denoisers can efficiently remove salt and pepper noise and image rendering-related spatial artifacts from input phase or intensity images while achieving an output power efficiency of ~30-40%. We experimentally demonstrated the effectiveness of this analog denoiser architecture using a 3D-printed diffractive visual processor operating at the terahertz spectrum. Owing to their speed, power-efficiency, and minimal computational overhead, all-optical diffractive denoisers can be transformative for various image display and projection systems, including, e.g., holographic displays.
摘要
图像去噪,是一个fundamental inverse problem,目标是从输入图像中除去噪声/特征。通常,计算机执行的数字图像去噪算法会出现延迟,因为它们在图像处理中需要许多迭代。深度学习启用的方法也会引入延迟和计算负担,导致增加的电力消耗。在这里,我们引入了一种光学diffractive图像去噪器,可以非 iteratively 和all-optically清理输入图像中的各种噪声和特征。这种光学图像去噪器包括优化的通过深度学习的透明层,以physically 扰动光模式,使其在输出图像Field-of-View(FoV)中产生折射。我们的结果表明,这种diffractive denoiser可以高效地除去盐和细颗粒噪声,以及图像渲染相关的空间artefacts,从输入相位或Intensity图像中获得 ~30-40%的输出功率效率。我们实验采用了一个3D打印的diffractive视觉处理器,在teraHz频谱下运行。由于它们的速度、功率效率和计算负担很低,光学diffractive denoiser可能会对各种图像显示和投影系统做出巨大的变革,例如投影式显示器。
Neural Gradient Learning and Optimization for Oriented Point Normal Estimation
paper_authors: Qing Li, Huifang Feng, Kanle Shi, Yi Fang, Yu-Shen Liu, Zhizhong Han
for: 学习3D点云中的向量场,用于normal估计。
methods: 使用深度学习方法, parameterize对象函数生成点云中的导向量,并使用地方几何学习 angular distance field 进行精细化。
results: 提供了一种robust和精细的normal估计方法,可以抗抗噪、点云异常和点云分布变化。对比 précédents works,提高了normal估计的精度和泛化能力。Abstract
We propose Neural Gradient Learning (NGL), a deep learning approach to learn gradient vectors with consistent orientation from 3D point clouds for normal estimation. It has excellent gradient approximation properties for the underlying geometry of the data. We utilize a simple neural network to parameterize the objective function to produce gradients at points using a global implicit representation. However, the derived gradients usually drift away from the ground-truth oriented normals due to the lack of local detail descriptions. Therefore, we introduce Gradient Vector Optimization (GVO) to learn an angular distance field based on local plane geometry to refine the coarse gradient vectors. Finally, we formulate our method with a two-phase pipeline of coarse estimation followed by refinement. Moreover, we integrate two weighting functions, i.e., anisotropic kernel and inlier score, into the optimization to improve the robust and detail-preserving performance. Our method efficiently conducts global gradient approximation while achieving better accuracy and generalization ability of local feature description. This leads to a state-of-the-art normal estimator that is robust to noise, outliers and point density variations. Extensive evaluations show that our method outperforms previous works in both unoriented and oriented normal estimation on widely used benchmarks. The source code and pre-trained models are available at https://github.com/LeoQLi/NGLO.
摘要
我们提出了神经Gradient学习(NGL),一种深度学习方法,用于从3D点云中学习具有一致方向的梯度 вектор。它具有优秀的梯度近似性特性,用于下面的数据结构。我们使用了简单的神经网络来参数化目标函数,以生成点上的梯度。然而, derivated梯度通常会偏离实际的正见方向的 норма,因为缺乏本地细节描述。因此,我们引入了梯度向量优化(GVO),以学习基于本地平面几何的angular distance场,以重фине粗略梯度向量。最后,我们将方法拟合成为两阶段管道,首先进行粗略估计,然后进行细化。此外,我们将两个权重函数,即不同权重的kernel和准确度分数,integrated into the optimization,以提高方法的稳定性和细节描述能力。我们的方法可以高效地进行全局梯度近似,同时实现更高的准确性和地方特征描述能力。这使得我们的方法在噪声、异常点和点云变化等问题上具有更高的Robustness和普遍性。我们对广泛使用的标准准点Cloud进行了广泛的评估,并证明了我们的方法在不oriented和oriented normal estimation中具有state-of-the-art的性能。源代码和预训练模型可以在https://github.com/LeoQLi/NGLO中下载。
Differentiable SLAM Helps Deep Learning-based LiDAR Perception Tasks
results: 实验结果表明,使用可微分SLAM架构可以提高两种深度学习应用程序(地面水平估计和动态到静止LiDAR翻译)的性能。总的来说,这些发现提供了重要的navidad的提高LiDAR基于导航系统性能的新方法。Abstract
We investigate a new paradigm that uses differentiable SLAM architectures in a self-supervised manner to train end-to-end deep learning models in various LiDAR based applications. To the best of our knowledge there does not exist any work that leverages SLAM as a training signal for deep learning based models. We explore new ways to improve the efficiency, robustness, and adaptability of LiDAR systems with deep learning techniques. We focus on the potential benefits of differentiable SLAM architectures for improving performance of deep learning tasks such as classification, regression as well as SLAM. Our experimental results demonstrate a non-trivial increase in the performance of two deep learning applications - Ground Level Estimation and Dynamic to Static LiDAR Translation, when used with differentiable SLAM architectures. Overall, our findings provide important insights that enhance the performance of LiDAR based navigation systems. We demonstrate that this new paradigm of using SLAM Loss signal while training LiDAR based models can be easily adopted by the community.
摘要
我们investigates一种新的思想,利用可微分SLAM架构来在自我超vised的方式下训练深度学习模型,用于各种LiDAR应用程序中。我们知道到目前为止,没有任何工作利用SLAM作为训练深度学习模型的信号。我们探索新的方法,以提高LiDAR系统的效率、可靠性和适应性。我们关注使用可微分SLAM架构来改善深度学习任务的性能,如分类、回归以及SLAM。我们的实验结果表明,将SLAM损失信号作为训练深度学习模型的一部分,可以提高Ground Level Estimation和Dynamic to Static LiDAR Translation两个深度学习应用程序的性能。总的来说,我们的发现提供了重要的意见,使LiDAR基于导航系统的性能得到了提高。我们示示了这新的思想可以轻松地被社区采纳。
Efficient Pyramid Channel Attention Network for Pathological Myopia Detection
results: 在三个数据集上进行了广泛的实验,证明了我们的EPCA-Net在检测PM方面的表现超过了现有方法。此外,我们还尝试了预训练和终端调整方法,并证明了与传统终端调整方法相比,我们的方法在参数更少的情况下实现了竞争力的表现。Abstract
Pathological myopia (PM) is the leading ocular disease for impaired vision and blindness worldwide. The key to detecting PM as early as possible is to detect informative features in global and local lesion regions, such as fundus tessellation, atrophy and maculopathy. However, applying classical convolutional neural networks (CNNs) to efficiently highlight global and local lesion context information in feature maps is quite challenging. To tackle this issue, we aim to fully leverage the potential of global and local lesion information with attention module design. Based on this, we propose an efficient pyramid channel attention (EPCA) module, which dynamically explores the relative importance of global and local lesion context information in feature maps. Then we combine the EPCA module with the backbone network to construct EPCA-Net for automatic PM detection based on fundus images. In addition, we construct a PM dataset termed PM-fundus by collecting fundus images of PM from publicly available datasets (e.g., the PALM dataset and ODIR dataset). The comprehensive experiments are conducted on three datasets, demonstrating that our EPCA-Net outperforms state-of-the-art methods in detecting PM. Furthermore, motivated by the recent pretraining-and-finetuning paradigm, we attempt to adapt pre-trained natural image models for PM detection by freezing them and treating the EPCA module and other attention modules as the adapters. The results show that our method with the pretraining-and-finetuning paradigm achieves competitive performance through comparisons to part of methods with traditional fine-tuning methods with fewer tunable parameters.
摘要
全球最主要的眼睛疾病之一是病理型近视(PM),它是全球视力和失明的主要原因。检测PM的关键在于检测背景和局部 lesion 区域中有用的特征。然而,使用传统的卷积神经网络(CNNs)来快速提取全球和局部 lesion 上下文信息在特征图中是很困难的。为了解决这个问题,我们想要全面利用全球和局部 lesion 信息的潜在能力,并设计了一种受注意模块(Attention Module)。基于这种设计,我们提出了一种高效的pyramid channel attention(EPCA)模块,它可以动态探索特征图中全球和局部 lesion 上下文信息的相对重要性。然后,我们将EPCA模块与背部网络结合,构建EPCA-Net以自动检测基于眼科图像的PM。此外,我们还构建了PM-fundus数据集,收集了PM眼科图像数据。我们在三个数据集上进行了广泛的实验,结果表明,我们的EPCA-Net可以超越现有方法的检测PM性能。此外,我们还尝试了采用先training-and-finetuning的方法,将预训练的自然图像模型适应PM检测。结果表明,我们的方法可以通过与一些传统 fine-tuning 方法进行比较,实现类似的性能。
CLIPUNetr: Assisting Human-robot Interface for Uncalibrated Visual Servoing Control with CLIP-driven Referring Expression Segmentation
results: 实验表明,使用 CLIPUNetr 可以提高边界和结构测量的准确性,平均提高120%,并成功地帮助实际世界中的 UIBVS 控制。Abstract
The classical human-robot interface in uncalibrated image-based visual servoing (UIBVS) relies on either human annotations or semantic segmentation with categorical labels. Both methods fail to match natural human communication and convey rich semantics in manipulation tasks as effectively as natural language expressions. In this paper, we tackle this problem by using referring expression segmentation, which is a prompt-based approach, to provide more in-depth information for robot perception. To generate high-quality segmentation predictions from referring expressions, we propose CLIPUNetr - a new CLIP-driven referring expression segmentation network. CLIPUNetr leverages CLIP's strong vision-language representations to segment regions from referring expressions, while utilizing its ``U-shaped'' encoder-decoder architecture to generate predictions with sharper boundaries and finer structures. Furthermore, we propose a new pipeline to integrate CLIPUNetr into UIBVS and apply it to control robots in real-world environments. In experiments, our method improves boundary and structure measurements by an average of 120% and can successfully assist real-world UIBVS control in an unstructured manipulation environment.
摘要
传统的人机交互界面在无调整图像基于视觉服务(UIBVS)中依赖于人类注释或semantic segmentation WITH categorical标签。这两种方法无法匹配人类自然的沟通方式,并且不能够具备rich semantics在操作任务中。在这篇论文中,我们解决这个问题,使用引用表达分 segmentation,以提供更多的信息来提高机器人的感知。为生成高质量的分 segmentation预测,我们提出了CLIPUNetr,一种基于CLIP的引用表达分 segmentation网络。CLIPUNetr利用CLIP的强视语表示能力,从引用表达中提取区域,同时利用其“U字形”编码器-解码器架构,生成预测更加精细的边界和结构。此外,我们提出了一种新的管道,将CLIPUNetr纳入UIBVS中,并在实际世界环境中控制机器人。在实验中,我们发现,我们的方法可以提高边界和结构测量的平均值120%,并成功地帮助实际世界中的UIBVS控制。