cs.CV - 2023-11-17

Closely-Spaced Object Classification Using MuyGPyS

  • paper_url: http://arxiv.org/abs/2311.10904
  • repo_url: None
  • paper_authors: Kerianne Pruett, Nathan McNaughton, Michael Schneider
  • for: 这篇论文的目的是提高地面空间域意识(SDA)算法的精度,以更好地检测和识别靠近的空间 объек。
  • methods: 这篇论文使用了泛函分布(Gaussian process)python包MuyGPyS进行靠近 объек的分类,并研究了分类精度与角分度和亮度差的关系。
  • results: 论文发现,使用泛函分布分类方法可以在更为困难的情况下提高分类精度,并且比传统机器学习方法更有优势。
    Abstract Accurately detecting rendezvous and proximity operations (RPO) is crucial for understanding how objects are behaving in the space domain. However, detecting closely-spaced objects (CSO) is challenging for ground-based optical space domain awareness (SDA) algorithms as two objects close together along the line-of-sight can appear blended as a single object within the point-spread function (PSF) of the optical system. Traditional machine learning methods can be useful for differentiating between singular objects and closely-spaced objects, but many methods require large training sample sizes or high signal-to-noise conditions. The quality and quantity of realistic data make probabilistic classification methods a superior approach, as they are better suited to handle these data inadequacies. We present CSO classification results using the Gaussian process python package, MuyGPyS, and examine classification accuracy as a function of angular separation and magnitude difference between the simulated satellites. This orbit-independent analysis is done on highly accurate simulated SDA images that emulate realistic ground-based commercial-of-the-shelf (COTS) optical sensor observations of CSOs. We find that MuyGPyS outperforms traditional machine learning methods, especially under more challenging circumstances.
    摘要 准确探测 rendezvous 和 proximity operations(RPO)是 espacial domain 中对 объек的行为理解的关键。然而, closely-spaced objects(CSO)的探测对地面上的 optical space domain awareness(SDA)算法是挑战的,因为两个 объек在视线上几乎相同的距离可以在 optic 系统的 point-spread function(PSF)中被混合为一个单一的 объек。传统的机器学习方法可以用于分 differentiating between singular objects and closely-spaced objects,但这些方法通常需要大量的训练样本或高的信号噪声比。 probablistic classification methods 是一种更加适合的方法,因为它们可以更好地处理这些数据不足。我们使用 Gaussian process python 包 MuyGPyS 进行 CSO 类别结果,并分析类别精度与两个 simulated satellite 的角度差和亮度差之间的关系。这是一种 orbit-independent 的分析,基于高度准确的 simulated SDA 图像,这些图像模拟了商用的 COTS 光学感知器观测 CSOs。我们发现 MuyGPyS 在更加挑战的情况下表现更好,特别是在更低的信号噪声比下。

OCT2Confocal: 3D CycleGAN based Translation of Retinal OCT Images to Confocal Microscopy

  • paper_url: http://arxiv.org/abs/2311.10902
  • repo_url: None
  • paper_authors: Xin Tian, Nantheera Anantrasirichai, Lindsay Nicholson, Alin Achim
  • for: bridging the gap between in vivo OCT and ex vivo confocal microscopy imaging
  • methods: developed a 3D CycleGAN framework for unsupervised translation of in vivo OCT to ex vivo confocal microscopy images
  • results: effectively translates between 3D medical data domains, capturing vascular, textural, and cellular details with precision, outperforming existing methods despite limited data.
    Abstract Optical coherence tomography (OCT) and confocal microscopy are pivotal in retinal imaging, each presenting unique benefits and limitations. In vivo OCT offers rapid, non-invasive imaging but can be hampered by clarity issues and motion artifacts. Ex vivo confocal microscopy provides high-resolution, cellular detailed color images but is invasive and poses ethical concerns and potential tissue damage. To bridge these modalities, we developed a 3D CycleGAN framework for unsupervised translation of in vivo OCT to ex vivo confocal microscopy images. Applied to our OCT2Confocal dataset, this framework effectively translates between 3D medical data domains, capturing vascular, textural, and cellular details with precision. This marks the first attempt to exploit the inherent 3D information of OCT and translate it into the rich, detailed color domain of confocal microscopy. Assessed through quantitative and qualitative metrics, the 3D CycleGAN framework demonstrates commendable image fidelity and quality, outperforming existing methods despite the constraints of limited data. This non-invasive generation of retinal confocal images has the potential to further enhance diagnostic and monitoring capabilities in ophthalmology.
    摘要

Point Cloud Self-supervised Learning via 3D to Multi-view Masked Autoencoder

  • paper_url: http://arxiv.org/abs/2311.10887
  • repo_url: None
  • paper_authors: Zhimin Chen, Yingwei Li, Longlong Jing, Liang Yang, Bing Li
  • for: 本研究目的是提出一种基于多视图特征的3D自助学习方法,以便更深入地理解3D结构。
  • methods: 我们提出了一种使用3D做masked autoencoder来全面利用多视图特征,并使用编码的 токен来生成原始点云和多视图深度图像。
  • results: 我们的方法在不同任务和设定下表现出色,并且在3D物体分类、少样本学习、部分分割和3D物体检测等多种下游任务中具有较高的表现。
    Abstract In recent years, the field of 3D self-supervised learning has witnessed significant progress, resulting in the emergence of Multi-Modality Masked AutoEncoders (MAE) methods that leverage both 2D images and 3D point clouds for pre-training. However, a notable limitation of these approaches is that they do not fully utilize the multi-view attributes inherent in 3D point clouds, which is crucial for a deeper understanding of 3D structures. Building upon this insight, we introduce a novel approach employing a 3D to multi-view masked autoencoder to fully harness the multi-modal attributes of 3D point clouds. To be specific, our method uses the encoded tokens from 3D masked point clouds to generate original point clouds and multi-view depth images across various poses. This approach not only enriches the model's comprehension of geometric structures but also leverages the inherent multi-modal properties of point clouds. Our experiments illustrate the effectiveness of the proposed method for different tasks and under different settings. Remarkably, our method outperforms state-of-the-art counterparts by a large margin in a variety of downstream tasks, including 3D object classification, few-shot learning, part segmentation, and 3D object detection. Code will be available at: https://github.com/Zhimin-C/Multiview-MAE
    摘要 近年来,3D自适应学习领域内,多模式做法(MAE)方法得到了显著进步,利用了2D图像和3D点云进行预训练。然而,这些方法并不完全利用3D点云中的多视角特征,这是深入理解3D结构的关键。基于这一点,我们提出了一种新的方法,使用3D到多视角做法(MAE)来全面利用3D点云的多模式特征。具体来说,我们的方法使用3D做法掩码图像中的编码符号来生成原始点云和多视角深度图像。这种方法不仅扩大了模型对几何结构的理解,还利用了点云的内在多模式特征。我们的实验表明,提议的方法在不同任务和设置下具有显著的优势,比如3D物体分类、几何学学习、部分分割和3D物体检测等多个下游任务。代码将在:https://github.com/Zhimin-C/Multiview-MAE 上提供。

A Video-Based Activity Classification of Human Pickers in Agriculture

  • paper_url: http://arxiv.org/abs/2311.10885
  • repo_url: None
  • paper_authors: Abhishesh Pal, Antonio C. Leite, Jon G. O. Gjevestad, Pål J. From
  • for: This paper aims to improve the efficiency and productivity of harvesting operations in farming systems by developing an intelligent robotic system that can monitor human behavior, identify ongoing activities, and anticipate the worker’s needs.
  • methods: The proposed solution uses a combination of Mask Region-based Convolutional Neural Network (Mask R-CNN) for object detection, optical flow for motion estimation, and newly added statistical attributes of flow motion descriptors (Correlation Sensitivity, CS) to classify human activities in different agricultural scenarios.
  • results: The proposed framework is tested on in-house collected datasets from various crop fields, including strawberry polytunnels and apple tree orchards, and shows satisfactory results amidst challenges such as lighting variation, blur, and occlusions. The framework is evaluated using sensitivity, specificity, and accuracy measures, and the results demonstrate the effectiveness of the proposed approach.
    Abstract In farming systems, harvesting operations are tedious, time- and resource-consuming tasks. Based on this, deploying a fleet of autonomous robots to work alongside farmworkers may provide vast productivity and logistics benefits. Then, an intelligent robotic system should monitor human behavior, identify the ongoing activities and anticipate the worker's needs. In this work, the main contribution consists of creating a benchmark model for video-based human pickers detection, classifying their activities to serve in harvesting operations for different agricultural scenarios. Our solution uses the combination of a Mask Region-based Convolutional Neural Network (Mask R-CNN) for object detection and optical flow for motion estimation with newly added statistical attributes of flow motion descriptors, named as Correlation Sensitivity (CS). A classification criterion is defined based on the Kernel Density Estimation (KDE) analysis and K-means clustering algorithm, which are implemented upon in-house collected dataset from different crop fields like strawberry polytunnels and apple tree orchards. The proposed framework is quantitatively analyzed using sensitivity, specificity, and accuracy measures and shows satisfactory results amidst various dataset challenges such as lighting variation, blur, and occlusions.
    摘要 在农业系统中,收割操作是耗时耗源的任务。基于这点,投入一支自动驾驶机器人工作 alongside 农工可能提供广泛的生产力和物流利好。然后,一个智能机器人系统应该监测人类行为,识别当前活动并预测工作者的需求。在这种工作中,我们的主要贡献是创建一个视频基于人员检测的benchmark模型,并将其应用于不同的农业场景。我们的解决方案利用了掩模区域基于卷积神经网络(Mask R-CNN) для对象检测和运动场景的估计,并添加了新的统计特征,即相关敏感度(CS)。我们定义了一个基于饱和概率分布(KDE)分析和K-means归一化算法的分类准则,并在自己收集的 dataset 上进行了测试。我们的提出的框架在不同的 dataset 挑战下,如光线变化、模糊和遮挡等,也表现出了满意的结果。

Pre- to Post-Contrast Breast MRI Synthesis for Enhanced Tumour Segmentation

  • paper_url: http://arxiv.org/abs/2311.10879
  • repo_url: https://github.com/richardobi/pre_post_synthesis
  • paper_authors: Richard Osuala, Smriti Joshi, Apostolia Tsirikoglou, Lidia Garrucho, Walter H. L. Pinaya, Oliver Diaz, Karim Lekadir
  • for: This paper aims to explore the feasibility of producing synthetic contrast enhancements for dynamic contrast-enhanced MRI (DCE-MRI) using a generative adversarial network (GAN).
  • methods: The authors use a GAN to translate pre-contrast T1-weighted fat-saturated breast MRI to their corresponding first DCE-MRI sequence. They also introduce a Scaled Aggregate Measure (SAMe) to evaluate the quality of synthetic data.
  • results: The generated DCE-MRI data are assessed using quantitative image quality metrics and applied to the downstream task of 3D breast tumour segmentation. The results show that the synthetic data can enhance the robustness of breast tumour segmentation models via data augmentation.Here’s the summary in Simplified Chinese:
  • for: 这个研究旨在探讨使用生成 adversarial 网络 (GAN) 生成动态增强MRI (DCE-MRI) 的增强剂。
  • methods: 作者使用 GAN 将预contrast T1-weighted fat-saturated breast MRI 翻译成其对应的第一个 DCE-MRI 序列。他们还引入了一种Scale Aggregate Measure (SAMe) 评估生成数据的质量。
  • results: 生成的 DCE-MRI 数据通过量化图像质量指标进行评估,并应用于3D breast tumour segmentation 下游任务。结果表明,生成数据可以通过数据增强提高breast tumour segmentation 模型的Robustness。
    Abstract Despite its benefits for tumour detection and treatment, the administration of contrast agents in dynamic contrast-enhanced MRI (DCE-MRI) is associated with a range of issues, including their invasiveness, bioaccumulation, and a risk of nephrogenic systemic fibrosis. This study explores the feasibility of producing synthetic contrast enhancements by translating pre-contrast T1-weighted fat-saturated breast MRI to their corresponding first DCE-MRI sequence leveraging the capabilities of a generative adversarial network (GAN). Additionally, we introduce a Scaled Aggregate Measure (SAMe) designed for quantitatively evaluating the quality of synthetic data in a principled manner and serving as a basis for selecting the optimal generative model. We assess the generated DCE-MRI data using quantitative image quality metrics and apply them to the downstream task of 3D breast tumour segmentation. Our results highlight the potential of post-contrast DCE-MRI synthesis in enhancing the robustness of breast tumour segmentation models via data augmentation. Our code is available at https://github.com/RichardObi/pre_post_synthesis.
    摘要 尽管对肿瘤检测和治疗具有优点,但是在动态增强磁共振成像(DCE-MRI)中 administraiting 对比剂具有一系列问题,包括其涉及性、堆积和肾生成性综合症风险。本研究探讨使用生成对抗网络(GAN)将预对磁共振成像(T1)转换为对应的第一个DCE-MRI序列的可能性,并引入一种量化评价生成数据的标准尺度(SAMe)。我们评估生成的DCE-MRI数据使用量化图像质量指标,并应用其到下游任务——三维乳腺肿瘤分割。我们的结果表明,通过数据增强,可以提高乳腺肿瘤分割模型的Robustness。我们的代码可以在https://github.com/RichardObi/pre_post_synthesis上获取。

Multi-entity Video Transformers for Fine-Grained Video Representation Learning

  • paper_url: http://arxiv.org/abs/2311.10873
  • repo_url: https://github.com/facebookresearch/video_rep_learning
  • paper_authors: Matthew Walmer, Rose Kanjirathinkal, Kai Sheng Tai, Keyur Muzumdar, Taipeng Tian, Abhinav Shrivastava
  • for: 本研究旨在提高视频表示学习中的时间粒度表示学习,以便在时间很密集的任务中生成帧对帧的表示。
  • methods: 我们提出了一种自我监督的方法,通过在时间管道中更好地 интеGRATE空间信息来提高 transformer 架构的设计。我们的 Multi-entity Video Transformer(MV-Former)架构使用了无监督的 ViT 特征,并采用了多种策略来最大化提取的特征的utilty,而不需要细化 ViT 背部网络。这包括一种可学习的空间符号池化策略,用于从每帧中提取多个关键区域的特征。
  • results: 我们的实验表明,MV-Former 不仅超过了先前的自我监督方法,还超过了一些使用额外监督或训练数据的先前工作。当与 Kinetics-400 的额外训练数据结合使用时,MV-Former 又得到了进一步的性能提升。MV-Former 的代码可以在 GitHub 上找到。
    Abstract The area of temporally fine-grained video representation learning aims to generate frame-by-frame representations for temporally dense tasks. In this work, we advance the state-of-the-art for this area by re-examining the design of transformer architectures for video representation learning. A salient aspect of our self-supervised method is the improved integration of spatial information in the temporal pipeline by representing multiple entities per frame. Prior works use late fusion architectures that reduce frames to a single dimensional vector before any cross-frame information is shared, while our method represents each frame as a group of entities or tokens. Our Multi-entity Video Transformer (MV-Former) architecture achieves state-of-the-art results on multiple fine-grained video benchmarks. MV-Former leverages image features from self-supervised ViTs, and employs several strategies to maximize the utility of the extracted features while also avoiding the need to fine-tune the complex ViT backbone. This includes a Learnable Spatial Token Pooling strategy, which is used to identify and extract features for multiple salient regions per frame. Our experiments show that MV-Former not only outperforms previous self-supervised methods, but also surpasses some prior works that use additional supervision or training data. When combined with additional pre-training data from Kinetics-400, MV-Former achieves a further performance boost. The code for MV-Former is available at https://github.com/facebookresearch/video_rep_learning.
    摘要 traditional Chinese:Temporally fine-grained video representation learning的领域目标是产生每帧的frame-by-frame表现,以便在时间紧密的任务中进行分析。在这个工作中,我们提高了时间精细video representation learning的状态艺术,通过重新评估 transformer架构的设计。我们的自我超vised方法之一的特点是在时间管线中更好地融合空间信息,通过每帧都 représent multiple entities或token。对于先前的works,他们使用输出frames的单一维度vector,然后在cross-frame信息交互之前进行简化,而我们的方法则是在每帧中represent多个entity或token。我们的 Multi-entity Video Transformer(MV-Former)架构在多个精细video benchmark上 achieved state-of-the-art results。MV-Former leverages自我超vised ViTs的图像特征,并运用多种策略来提高提取的特征之 utility,同时避免繁杂的 ViT 背部bone fine-tuning。包括学习的空间Token Pooling策略,用于在每帧中识别和提取多个焦点区域的特征。我们的实验显示,MV-Former不仅超过了先前的自我超vised方法,而且还超过了一些使用额外supervision或训练数据的先前工作。当与Kinetics-400的额外训练数据结合时,MV-Former又得到了进一步的性能提升。MV-Former的代码可以在https://github.com/facebookresearch/video_rep_learning 获取。

Zero-Shot Digital Rock Image Segmentation with a Fine-Tuned Segment Anything Model

  • paper_url: http://arxiv.org/abs/2311.10865
  • repo_url: None
  • paper_authors: Zhaoyang Ma, Xupeng He, Shuyu Sun, Bicheng Yan, Hyung Kwak, Jun Gao
  • for: 这研究旨在提高石油和天然气提取效率,通过精准的图像分割,提高数字岩石模型的精度,进一步推进数字岩石物理学的理解。
  • methods: 这研究使用Meta AI的Segment Anything Model(SAM),并对其进行了微调,以优化参数和处理大规模图像,从而提高准确性。
  • results: 实验结果表明,微调后的SAM模型(RockSAM)在岩石CT/SEM图像分割中表现出色,能够生成高质量的mask,从而提高数字岩石图像分析的效率和准确性。
    Abstract Accurate image segmentation is crucial in reservoir modelling and material characterization, enhancing oil and gas extraction efficiency through detailed reservoir models. This precision offers insights into rock properties, advancing digital rock physics understanding. However, creating pixel-level annotations for complex CT and SEM rock images is challenging due to their size and low contrast, lengthening analysis time. This has spurred interest in advanced semi-supervised and unsupervised segmentation techniques in digital rock image analysis, promising more efficient, accurate, and less labour-intensive methods. Meta AI's Segment Anything Model (SAM) revolutionized image segmentation in 2023, offering interactive and automated segmentation with zero-shot capabilities, essential for digital rock physics with limited training data and complex image features. Despite its advanced features, SAM struggles with rock CT/SEM images due to their absence in its training set and the low-contrast nature of grayscale images. Our research fine-tunes SAM for rock CT/SEM image segmentation, optimizing parameters and handling large-scale images to improve accuracy. Experiments on rock CT and SEM images show that fine-tuning significantly enhances SAM's performance, enabling high-quality mask generation in digital rock image analysis. Our results demonstrate the feasibility and effectiveness of the fine-tuned SAM model (RockSAM) for rock images, offering segmentation without extensive training or complex labelling.
    摘要 准确的图像分割是重要的在沟口模型和物质Characterization中,提高油气抽取效率的细节沟口模型。这种精度提供了岩石性质的启示,进而提高数字岩石物理理解。然而,为复杂的CT和SEM岩石图像创建像素级注释是困难的,因为它们的大小和低对比度,使分析时间增加。这种情况推动了数字岩石图像分析中进阶半supervised和无监督分割技术的兴趣,提供更高效、准确和 menos labor-intensive的方法。Meta AI的Segment Anything Model(SAM)在2023年革命化图像分割,提供了交互式和自动化分割,零 shot能力,对数字岩石物理进行有限训练数据和复杂图像特征是必需的。尽管它具有先进的特征,但SAM在岩石CT/SEM图像上很困难,因为它们缺失在其训练集中,以及图像的低对比度。我们的研究根据SAM进行了精度调整和大规模图像处理,以提高准确性。实验表明,精度调整可以大幅提高SAM的性能,使得高质量的面 Generation在数字岩石图像分析中。我们的结果证明了RockSAM模型的可行性和效果,为岩石图像分割提供了无需广泛训练或复杂标注的选择。

WATUNet: A Deep Neural Network for Segmentation of Volumetric Sweep Imaging Ultrasound

  • paper_url: http://arxiv.org/abs/2311.10857
  • repo_url: None
  • paper_authors: Donya Khaledyan, Thomas J. Marini, Avice OConnell, Steven Meng, Jonah Kan, Galen Brennan, Yu Zhao, Timothy M. Baran, Kevin J. Parker
  • for: 这个论文是为了提高乳腺癌诊断技术的研究。
  • methods: 这个研究使用了一种新的方法,即量子扫描干扰图像(VSI),它可以让不具备专业训练的操作员捕捉高质量的超声图像。这个方法与深度学习技术结合,可能地改善乳腺癌诊断的准确率、时间和成本,并提高病人的结果。
  • results: 这个研究使用了一种新的分割模型,即波峰注意网络(WATUNet),以提高分割模型的性能。研究结果表明,与其他深度网络相比,这个模型在两个 dataset 上的分割结果显著优于其他。在 VSI dataset 上,该模型的 dice 系数和 F1 分数分别为 0.94 和 0.94,而在公共 dataset 上,其分别为 0.93 和 0.94。
    Abstract Objective. Limited access to breast cancer diagnosis globally leads to delayed treatment. Ultrasound, an effective yet underutilized method, requires specialized training for sonographers, which hinders its widespread use. Approach. Volume sweep imaging (VSI) is an innovative approach that enables untrained operators to capture high-quality ultrasound images. Combined with deep learning, like convolutional neural networks (CNNs), it can potentially transform breast cancer diagnosis, enhancing accuracy, saving time and costs, and improving patient outcomes. The widely used UNet architecture, known for medical image segmentation, has limitations, such as vanishing gradients and a lack of multi-scale feature extraction and selective region attention. In this study, we present a novel segmentation model known as Wavelet_Attention_UNet (WATUNet). In this model, we incorporate wavelet gates (WGs) and attention gates (AGs) between the encoder and decoder instead of a simple connection to overcome the limitations mentioned, thereby improving model performance. Main results. Two datasets are utilized for the analysis. The public "Breast Ultrasound Images" (BUSI) dataset of 780 images and a VSI dataset of 3818 images. Both datasets contained segmented lesions categorized into three types: no mass, benign mass, and malignant mass. Our segmentation results show superior performance compared to other deep networks. The proposed algorithm attained a Dice coefficient of 0.94 and an F1 score of 0.94 on the VSI dataset and scored 0.93 and 0.94 on the public dataset, respectively.
    摘要 Approach: to address this issue, we propose a novel approach called volume sweep imaging (VSI), which enables untrained operators to capture high-quality ultrasound images. We also use deep learning, specifically convolutional neural networks (CNNs), to improve breast cancer diagnosis. However, existing UNet architectures have limitations, such as vanishing gradients and a lack of multi-scale feature extraction and selective region attention.To overcome these limitations, we present a novel segmentation model called Wavelet_Attention_UNet (WATUNet). Our model incorporates wavelet gates (WGs) and attention gates (AGs) between the encoder and decoder, which improves model performance.Main results: we evaluate our model on two datasets: the public "Breast Ultrasound Images" (BUSI) dataset of 780 images and a VSI dataset of 3818 images. Both datasets contain segmented lesions categorized into three types: no mass, benign mass, and malignant mass. Our segmentation results show superior performance compared to other deep networks. On the VSI dataset, the proposed algorithm attained a Dice coefficient of 0.94 and an F1 score of 0.94, while on the public dataset, it scored 0.93 and 0.94, respectively.

Domain Generalization of 3D Object Detection by Density-Resampling

  • paper_url: http://arxiv.org/abs/2311.10845
  • repo_url: None
  • paper_authors: Shuangzhi Li, Lei Ma, Xingyu Li
  • for: 提高3D对象检测的通用性,增强对不同频谱域的检测性能。
  • methods: 我们提出了一种基于单个频谱的泛化(SDG)方法,包括一种新的数据增强方法和一种多任务学习策略。数据增强方法是物理感知基于的density-based增强方法,可以减少因点云密度而导致的性能下降。学习策略方面,我们开发了一种多任务学习方法,在源Training期间,除了主要的标准检测任务之外,还使用了一个辅助的自动化3D场景恢复任务,以提高Encoder对背景和前景细节的理解,从而提高对象检测的准确率。
  • results: 我们的方法在多个检测任务上(包括“车”, “人”和“自行车”检测)取得了优于state-of-the-art SDG方法和不supervised频谱适应方法的性能。代码将公开发布。
    Abstract Point-cloud-based 3D object detection suffers from performance degradation when encountering data with novel domain gaps. To tackle it, the single-domain generalization (SDG) aims to generalize the detection model trained in a limited single source domain to perform robustly on unexplored domains. In this paper, we propose an SDG method to improve the generalizability of 3D object detection to unseen target domains. Unlike prior SDG works for 3D object detection solely focusing on data augmentation, our work introduces a novel data augmentation method and contributes a new multi-task learning strategy in the methodology. Specifically, from the perspective of data augmentation, we design a universal physical-aware density-based data augmentation (PDDA) method to mitigate the performance loss stemming from diverse point densities. From the learning methodology viewpoint, we develop a multi-task learning for 3D object detection: during source training, besides the main standard detection task, we leverage an auxiliary self-supervised 3D scene restoration task to enhance the comprehension of the encoder on background and foreground details for better recognition and detection of objects. Furthermore, based on the auxiliary self-supervised task, we propose the first test-time adaptation method for domain generalization of 3D object detection, which efficiently adjusts the encoder's parameters to adapt to unseen target domains during testing time, to further bridge domain gaps. Extensive cross-dataset experiments covering "Car", "Pedestrian", and "Cyclist" detections, demonstrate our method outperforms state-of-the-art SDG methods and even overpass unsupervised domain adaptation methods under some circumstances. The code will be made publicly available.
    摘要 “点云基于3D对象检测中的性能逐渐下降,这是因为数据域隔阶产生的域外挑战。为解决这问题,单域泛化(SDG)寻求将在有限单个源域中训练的检测模型扩展到未探索的域中进行稳健的检测。在这篇论文中,我们提出了一种SDG方法,以提高3D对象检测的泛化性能。与先前的SDG方法不同的是,我们不仅通过数据扩展来解决问题,还提出了一种新的多任务学习策略。”“从数据扩展的角度来看,我们设计了一种物理相关的点云数据扩展(PDDA)方法,以避免因点云密度差异而导致的性能下降。从学习方法的角度来看,我们开发了一种多任务学习方法,在源训练期间,除了主要的标准检测任务之外,我们还利用一个自动导向的3D场景恢复任务来增强encoder对背景和前景细节的理解,以便更好地识别和检测对象。”“此外,基于自动导向任务,我们提出了第一个测试时适应方法,用于域泛化3D对象检测的适应。在测试时,我们可以快速调整encoder的参数,以适应未探索的目标域,从而减少域外隔阶。广泛的跨数据集实验表明,我们的方法超过了状态艺术SDG方法,甚至在某些情况下超过了无监督域泛化方法。代码将在公共上公布。”

SelfEval: Leveraging the discriminative nature of generative models for evaluation

  • paper_url: http://arxiv.org/abs/2311.10708
  • repo_url: None
  • paper_authors: Sai Saketh Rambhatla, Ishan Misra
  • for: 这个论文旨在描述一种自动化评估文本图像生成模型的方法,以评估这些模型在多modal文本图像分类任务中的性能。
  • methods: 该方法使用文本图像生成模型计算图像的可能性,以直接应用于分类任务。
  • results: 该方法可以自动评估文本图像生成模型的性能,并且可以评估模型在 attribute binding、颜色识别、计数、形状识别和空间理解等任务中的表现。此外,该方法还可以评估生成模型在 Winoground 图像分数任务中的性能,并与权威评价相当。
    Abstract In this work, we show that text-to-image generative models can be 'inverted' to assess their own text-image understanding capabilities in a completely automated manner. Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts, making the generative model directly applicable to discriminative tasks. Using SelfEval, we repurpose standard datasets created for evaluating multimodal text-image discriminative models to evaluate generative models in a fine-grained manner: assessing their performance on attribute binding, color recognition, counting, shape recognition, spatial understanding. To the best of our knowledge SelfEval is the first automated metric to show a high degree of agreement for measuring text-faithfulness with the gold-standard human evaluations across multiple models and benchmarks. Moreover, SelfEval enables us to evaluate generative models on challenging tasks such as Winoground image-score where they demonstrate competitive performance to discriminative models. We also show severe drawbacks of standard automated metrics such as CLIP-score to measure text faithfulness on benchmarks such as DrawBench, and how SelfEval sidesteps these issues. We hope SelfEval enables easy and reliable automated evaluation for diffusion models.
    摘要 在这项工作中,我们显示了文本到图像生成模型可以被"反转"来评估它们自己的文本-图像理解能力,这是一种完全自动化的方法。 我们称之为SelfEval,它使用生成模型来计算文本提示给出的真实图像的概率,使生成模型直接适用于分类任务。 使用SelfEval,我们可以将标准的评估多媒体文本-图像模型的数据集重新用于评估生成模型,并且可以细化地评估它们的性能,包括Attribute binding、颜色识别、计数、形态识别和空间理解。 根据我们所知,SelfEval是首个自动化指标,能够与人工评估的标准金属板高度一致度的评估文本 faithfulness 多种模型和基准。 此外,SelfEval可以评估生成模型在Winoground图像分数任务上的竞争性表现,并且可以解决标准自动化指标如CLIP-score在DrawBench任务上的问题。 我们希望SelfEval可以帮助执行扩散模型的自动化评估。

Multimodal Representation Learning by Alternating Unimodal Adaptation

  • paper_url: http://arxiv.org/abs/2311.10707
  • repo_url: None
  • paper_authors: Xiaohui Zhang, Jaehong Yoon, Mohit Bansal, Huaxiu Yao
  • for: addressing the challenge of dominant modalities in multimodal learning, improving performance in scenarios with complete and missing modalities
  • methods: alternating unimodal learning, shared head with continuous optimization, gradient modification mechanism for preventing information loss, test-time uncertainty-based model fusion
  • results: superior performance compared to prior approaches in extensive experiments on five diverse datasets
    Abstract Multimodal learning, which integrates data from diverse sensory modes, plays a pivotal role in artificial intelligence. However, existing multimodal learning methods often struggle with challenges where some modalities appear more dominant than others during multimodal learning, resulting in suboptimal performance. To address this challenge, we propose MLA (Multimodal Learning with Alternating Unimodal Adaptation). MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process, thereby minimizing interference between modalities. Simultaneously, it captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities. This optimization process is controlled by a gradient modification mechanism to prevent the shared head from losing previously acquired information. During the inference phase, MLA utilizes a test-time uncertainty-based model fusion mechanism to integrate multimodal information. Extensive experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities. These experiments demonstrate the superiority of MLA over competing prior approaches.
    摘要 多模式学习,它将多种感知模式的数据集成在人工智能中扮演着关键角色。然而,现有的多模式学习方法经常遇到一些模式在多模式学习中显得更加主导地,导致表现下降。为解决这个挑战,我们提议了MLA(多模式学习与交换单模式适应)。MLA将传统的共同多模式学习过程重新框定为交换单模式学习过程,从而减少模式之间的干扰。同时,它通过共享头来捕捉交叉模式交互,并在不同模式之间进行不断的优化。这个优化过程由梯度修正机制控制,以防止共享头失去先前获得的信息。在推断阶段,MLA通过测试时间不确定性基于模型融合机制来集成多模式信息。经验表明,MLA在五种多样化的数据集上比前一种方法更高效。

SplatArmor: Articulated Gaussian splatting for animatable humans from monocular RGB videos

  • paper_url: http://arxiv.org/abs/2311.10812
  • repo_url: None
  • paper_authors: Rohit Jena, Ganesh Subramanian Iyer, Siddharth Choudhary, Brandon Smith, Pratik Chaudhari, James Gee
  • for: 这篇论文旨在提出一种新的人体模型恢复方法,以便实现可控的人体synthesis。
  • methods: 该方法使用3D Gaussians来 parameterize人体模型,并通过扩展SMPL几何体的皮肤来定义人体的动作。此外,该方法还使用SE(3)场来捕捉人体的 pose-dependent效果,并使用神经颜色场来提供颜色Regularization和3D超vision。
  • results: 该方法可以提供高质量的人体模型,并且可以在ZJU MoCap和People Snapshot数据集上达到惊人的效果。这些结果表明,Gaussian splatting是一种有趣的代替方法,可以使用笔制 primitives来实现人体synthesis,不会面临非 differentiability和优化问题。
    Abstract We propose SplatArmor, a novel approach for recovering detailed and animatable human models by `armoring' a parameterized body model with 3D Gaussians. Our approach represents the human as a set of 3D Gaussians within a canonical space, whose articulation is defined by extending the skinning of the underlying SMPL geometry to arbitrary locations in the canonical space. To account for pose-dependent effects, we introduce a SE(3) field, which allows us to capture both the location and anisotropy of the Gaussians. Furthermore, we propose the use of a neural color field to provide color regularization and 3D supervision for the precise positioning of these Gaussians. We show that Gaussian splatting provides an interesting alternative to neural rendering based methods by leverging a rasterization primitive without facing any of the non-differentiability and optimization challenges typically faced in such approaches. The rasterization paradigms allows us to leverage forward skinning, and does not suffer from the ambiguities associated with inverse skinning and warping. We show compelling results on the ZJU MoCap and People Snapshot datasets, which underscore the effectiveness of our method for controllable human synthesis.
    摘要 我们提出了SplatArmor,一种新的方法,用于recovering detailed和可动的人体模型。我们的方法使用3D Gaussian来“armor”一个参数化的体型模型,并将人体表示为一组3D Gaussian在一个坐标系中。这个坐标系的定义是通过扩展SMPLGeometry的皮肤来任意位置在坐标系中的扩展。为了考虑pose-dependent效果,我们引入了SE(3)场,以便捕捉Gaussian的位置和方向。此外,我们还提出了使用神经颜色场来提供颜色规则和3D超vision来精确位置Gaussian。我们表明,Gaussian splatting提供了一种有趣的代替方法,而不是基于神经网络渲染方法。这种渲染方法不受非导数和优化问题的影响,并且不受人体 inverse skinning和扭曲的困扰。我们在ZJU MoCap和People Snapshot数据集上展示了吸引人的结果,这些结果表明了我们的方法的可控性和可行性。

SpACNN-LDVAE: Spatial Attention Convolutional Latent Dirichlet Variational Autoencoder for Hyperspectral Pixel Unmixing

  • paper_url: http://arxiv.org/abs/2311.10701
  • repo_url: None
  • paper_authors: Soham Chitnis, Kiran Mantripragada, Faisal Z. Qureshi
  • For: The paper proposes a method for hyperspectral unmixing, which aims to separate the pure spectral signals of underlying materials (endmembers) and their proportions (abundances) in a hyperspectral image (HSI).* Methods: The proposed method builds upon the Latent Dirichlet Variational Autoencoder (LDVAE) and incorporates an isotropic convolutional neural network (CNN) encoder with spatial attention to leverage the spatial information present in the HSI.* Results: The proposed method was evaluated on four datasets (Samson, Hydice Urban, Cuprite, and OnTech-HSI-Syn-21) and showed improvement in endmember extraction and abundance estimation by incorporating spatial information. The model was also trained on synthetic data and evaluated on real-world data for the Cuprite dataset, demonstrating the transfer learning paradigm.
    Abstract The Hyperspectral Unxming problem is to find the pure spectral signal of the underlying materials (endmembers) and their proportions (abundances). The proposed method builds upon the recently proposed method, Latent Dirichlet Variational Autoencoder (LDVAE). It assumes that abundances can be encoded as Dirichlet Distributions while mixed pixels and endmembers are represented by Multivariate Normal Distributions. However, LDVAE does not leverage spatial information present in an HSI; we propose an Isotropic CNN encoder with spatial attention to solve the hyperspectral unmixing problem. We evaluated our model on Samson, Hydice Urban, Cuprite, and OnTech-HSI-Syn-21 datasets. Our model also leverages the transfer learning paradigm for Cuprite Dataset, where we train the model on synthetic data and evaluate it on real-world data. We are able to observe the improvement in the results for the endmember extraction and abundance estimation by incorporating the spatial information. Code can be found at https://github.com/faisalqureshi/cnn-ldvae
    摘要 “干扰干扰干扰干扰”问题是找到背景材料(终端成员)的纯 spectral 信号和它们的含量(充足)。我们的方法建立在近期提出的方法,潜在 Dirichlet 自动化学习(LDVAE)之上。它假设了含量可以被编码为 Dirichlet 分布,混合像素和终端成员则是 Multivariate Normal 分布。但是, LDVAE 不利用高spectral 图像中的空间信息,我们提议使用ISO特征层 CNN 编码器和空间注意力来解决干扰问题。我们在 Samson、Hydice Urban、Cuprite 和 OnTech-HSI-Syn-21 dataset 上评估了我们的模型,并且利用了转移学习 paradigma 在 Cuprite dataset 上训练模型,然后评估在实际数据上。我们发现了在 incorporating 空间信息时,可以提高终端EXTRACTION 和含量估测的结果。代码可以在 GitHub 上找到:https://github.com/faisalqureshi/cnn-ldvae。

Versatile Medical Image Segmentation Learned from Multi-Source Datasets via Model Self-Disambiguation

  • paper_url: http://arxiv.org/abs/2311.10696
  • repo_url: None
  • paper_authors: Xiaoyang Chen, Hao Zheng, Yuemeng Li, Yuncong Ma, Liang Ma, Hongming Li, Yong Fan
  • for: 本研究旨在开发一种可靠、多Modal的医疗影像分割模型,以便更好地应用于各种医疗设备和协议下收集的影像数据。
  • methods: 我们采用了可cost-efficient的方法,利用 readily available的部分或甚至是缺失注解的分割标签。我们提出了自ambiguation、专业知识 incorporation和 imbalance mitigation 等策略,以解决不同来源的标签不一致性所带来的挑战。
  • results: 我们在基于八种不同来源的多Modal数据集上进行了实验,并证明了我们的方法的有效性和超越性。这些结果表明,我们的方法可以更好地利用现有的注解数据,并减少新数据的注解工作,以提高模型的能力。
    Abstract A versatile medical image segmentation model applicable to imaging data collected with diverse equipment and protocols can facilitate model deployment and maintenance. However, building such a model typically requires a large, diverse, and fully annotated dataset, which is rarely available due to the labor-intensive and costly data curation. In this study, we develop a cost-efficient method by harnessing readily available data with partially or even sparsely annotated segmentation labels. We devise strategies for model self-disambiguation, prior knowledge incorporation, and imbalance mitigation to address challenges associated with inconsistently labeled data from various sources, including label ambiguity and imbalances across modalities, datasets, and segmentation labels. Experimental results on a multi-modal dataset compiled from eight different sources for abdominal organ segmentation have demonstrated our method's effectiveness and superior performance over alternative state-of-the-art methods, highlighting its potential for optimizing the use of existing annotated data and reducing the annotation efforts for new data to further enhance model capability.
    摘要 一种通用的医疗影像分割模型,可以应用于不同设备和协议收集的影像数据,可以方便模型部署和维护。然而,建立这种模型通常需要一个大、多样化和完全注释的数据集,而这种数据集罕见地可以通过劳动密集和成本高的数据整理获得。在这种研究中,我们开发了一种经济高效的方法,利用可用的数据中具有部分或甚至缺失注释的分割标签。我们提出了自然语言处理技术、先前知识 integrate 和负面优化的策略,以解决来自不同来源的分割标签的不一致和模式、数据集和分割标签之间的差异。实验结果表明,我们的方法在多模态数据集上表现出色,并在与其他当前状态的方法进行比较中表现出超越性,这 highlights 我们的方法的潜在用于现有注释数据的优化和新数据的注释努力的减少,以进一步提高模型的能力。

3D-TexSeg: Unsupervised Segmentation of 3D Texture using Mutual Transformer Learning

  • paper_url: http://arxiv.org/abs/2311.10651
  • repo_url: None
  • paper_authors: Iyyakutti Iyappan Ganapathi, Fayaz Ali, Sajid Javed, Syed Sadaf Ali, Naoufel Werghi
  • for: 本研究旨在提出一种无监督的3D文本分割方法,用于分割3D模型表面上的文本特征。
  • methods: 该方法基于变换器模型,包括一个标签生成器和一个清洁器。这两个模型使用三维图像表示法处理 mesh 表面 Facet,并在循环互学学习机制下对其进行标签。
  • results: 实验结果表明,提出的方法可以在三个公共可用数据集上以外,与标准和state-of-the-art无监督方法相比,并与监督方法相竞争。
    Abstract Analysis of the 3D Texture is indispensable for various tasks, such as retrieval, segmentation, classification, and inspection of sculptures, knitted fabrics, and biological tissues. A 3D texture is a locally repeated surface variation independent of the surface's overall shape and can be determined using the local neighborhood and its characteristics. Existing techniques typically employ computer vision techniques that analyze a 3D mesh globally, derive features, and then utilize the obtained features for retrieval or classification. Several traditional and learning-based methods exist in the literature, however, only a few are on 3D texture, and nothing yet, to the best of our knowledge, on the unsupervised schemes. This paper presents an original framework for the unsupervised segmentation of the 3D texture on the mesh manifold. We approach this problem as binary surface segmentation, partitioning the mesh surface into textured and non-textured regions without prior annotation. We devise a mutual transformer-based system comprising a label generator and a cleaner. The two models take geometric image representations of the surface mesh facets and label them as texture or non-texture across an iterative mutual learning scheme. Extensive experiments on three publicly available datasets with diverse texture patterns demonstrate that the proposed framework outperforms standard and SOTA unsupervised techniques and competes reasonably with supervised methods.
    摘要 analysis of 3D texture 是必备的 для多种任务,如采集、分割、分类和生物组织诊断。 3D texture 是一种本地重复的表面变化,不受表面整体形状的影响,可以通过地方 neighborhood 和其特征来确定。现有的技术通常使用计算机视觉技术,分析全球的 3D 网格,提取特征,然后使用获得的特征进行采集或分类。文献中有一些传统的方法和学习基于的方法,但只有一些是3D texture,而没有任何一个是基于无监督方案。本文提出了一个原创的无监督分割方案,将 mesh 表面分割成文本化和非文本化区域,无需先有注释。我们设计了一个基于 transformer 的系统,包括标签生成器和清洁器两部分。两个模型在迭代的互助学习方案中,对 mesh 表面 Facet 的 геометрической图像表示进行标注。我们对三个公共可用的数据集进行了广泛的实验,结果显示,提出的方案可以超过标准和 SOTA 无监督方法,并与指导方法相匹配。

Self-trained Panoptic Segmentation

  • paper_url: http://arxiv.org/abs/2311.10648
  • repo_url: None
  • paper_authors: Shourya Verma
  • for: 这个研究的目的是提出一种基于嵌入的自监督掌景分割方法,以便在假数据和无标签数据的情况下提高掌景分割模型的性能。
  • methods: 该方法使用了一种基于嵌入的自监督方法,通过在Synthetic和Real两个频谱中进行自适应训练,以便在Synthetic和Real两个频谱中进行分割。
  • results: 研究表明,该方法可以在Synthetic和Real两个频谱中提高掌景分割模型的性能,并且可以在不同的频谱之间进行准确的适应。
    Abstract Panoptic segmentation is an important computer vision task which combines semantic and instance segmentation. It plays a crucial role in domains of medical image analysis, self-driving vehicles, and robotics by providing a comprehensive understanding of visual environments. Traditionally, deep learning panoptic segmentation models have relied on dense and accurately annotated training data, which is expensive and time consuming to obtain. Recent advancements in self-supervised learning approaches have shown great potential in leveraging synthetic and unlabelled data to generate pseudo-labels using self-training to improve the performance of instance and semantic segmentation models. The three available methods for self-supervised panoptic segmentation use proposal-based transformer architectures which are computationally expensive, complicated and engineered for specific tasks. The aim of this work is to develop a framework to perform embedding-based self-supervised panoptic segmentation using self-training in a synthetic-to-real domain adaptation problem setting.
    摘要 《泛opeptic segmentation是计算机视觉任务之一,它将semantic segmentation和instance segmentation融合在一起,在医学影像分析、自动驾驶和机器人等领域具有重要作用。传统的深度学习泛opeptic segmentation模型通常需要大量和准确的标注训练数据,这是expensive和时间consuming的。现在的self-supervised learning方法已经 показа出了很好的潜力,可以使用synthetic和无标签数据生成pseudo-labels,以提高instance和semantic segmentation模型的性能。现有的三种方法 для自我超vised泛opeptic segmentation都使用提案based transformer架构,这些架构是 computationally expensive, complicated和engineered for specific tasks。本研究的目标是开发一个框架,可以在synthetic-to-real domain adaptation问题上进行嵌入基于自我超vised泛opeptic segmentation,使用self-training。》Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

Astronomical Images Quality Assessment with Automated Machine Learning

  • paper_url: http://arxiv.org/abs/2311.10617
  • repo_url: None
  • paper_authors: Olivier Parisot, Pierrick Bruneau, Patrik Hitzelberger
  • for: 这个论文是为了探讨电子助力天文学中图像质量评估的应用。
  • methods: 该论文使用了自动化机器学习模型来评估天文图像质量。
  • results: 研究人员通过使用自动化机器学习模型,成功地自动评估了天文图像质量。
    Abstract Electronically Assisted Astronomy consists in capturing deep sky images with a digital camera coupled to a telescope to display views of celestial objects that would have been invisible through direct observation. This practice generates a large quantity of data, which may then be enhanced with dedicated image editing software after observation sessions. In this study, we show how Image Quality Assessment can be useful for automatically rating astronomical images, and we also develop a dedicated model by using Automated Machine Learning.
    摘要 电子助力天文学包括使用数字摄像头与望远镜相结合,以显示天体对象的深空图像,这些图像可能通过直接观察无法看到。这种做法生成了大量数据,可以使用专门的图像修复软件进行优化 после观察会。在这项研究中,我们展示了如何使用图像质量评估来自动评分天文图像,并开发了一个专门的自动机器学习模型。

CA-Jaccard: Camera-aware Jaccard Distance for Person Re-identification

  • paper_url: http://arxiv.org/abs/2311.10605
  • repo_url: None
  • paper_authors: Yiyu Chen, Zheyi Fan, Zhaoru Chen, Yixuan Zhu
  • for: 人体重复识别 (re-ID) 是一个复杂的任务,旨在学习人体特征以实现人体 Retrieval。
  • methods: 我们提出了一种新的相机意识(CA)Jaccard距离,利用相机信息提高Jaccard距离的可靠性。我们引入了相机意识k-相似最近邻居(CKRNNs)和相机意识本地查询扩展(CLQE)来提高相关邻居的可靠性和考虑相机变化的强制约束。
  • results: 我们的CA-Jaccard距离是一种简单又有效的距离度量,可以高效地提高人体重复识别方法的可靠性和低计算成本。我们在实验中证明了我们的方法的效果。
    Abstract Person re-identification (re-ID) is a challenging task that aims to learn discriminative features for person retrieval. In person re-ID, Jaccard distance is a widely used distance metric, especially in re-ranking and clustering scenarios. However, we discover that camera variation has a significant negative impact on the reliability of Jaccard distance. In particular, Jaccard distance calculates the distance based on the overlap of relevant neighbors. Due to camera variation, intra-camera samples dominate the relevant neighbors, which reduces the reliability of the neighbors by introducing intra-camera negative samples and excluding inter-camera positive samples. To overcome this problem, we propose a novel camera-aware Jaccard (CA-Jaccard) distance that leverages camera information to enhance the reliability of Jaccard distance. Specifically, we introduce camera-aware k-reciprocal nearest neighbors (CKRNNs) to find k-reciprocal nearest neighbors on the intra-camera and inter-camera ranking lists, which improves the reliability of relevant neighbors and guarantees the contribution of inter-camera samples in the overlap. Moreover, we propose a camera-aware local query expansion (CLQE) to exploit camera variation as a strong constraint to mine reliable samples in relevant neighbors and assign these samples higher weights in overlap to further improve the reliability. Our CA-Jaccard distance is simple yet effective and can serve as a general distance metric for person re-ID methods with high reliability and low computational cost. Extensive experiments demonstrate the effectiveness of our method.
    摘要 人体重认识(re-ID)是一项具有挑战性的任务,旨在学习人体特征的识别特征。在人体重认识中,Jacard距离是广泛使用的距离度量,特别是在重新排序和聚合场景中。然而,我们发现了相机变化对Jaccard距离的可靠性的印象。具体来说,Jaccard距离根据相机变化引入了内相机负样本和排除了间相机正样本,从而减少了相机变化对Jaccard距离的可靠性。为解决这个问题,我们提出了一种新的相机意识Jacard(CA-Jaccard)距离,利用相机信息来提高Jaccard距离的可靠性。特别是,我们引入相机意识k-最相似邻居(CKRNNs),以找到k-最相似邻居在内相机和间相机排名列表上,从而改善相机变化对Jaccard距离的可靠性。此外,我们提出了相机意识地本查询扩展(CLQE),以利用相机变化作为强制约束,挖掘可靠的样本,并将这些样本在重合中分配更高的权重,以进一步提高可靠性。我们的CA-Jaccard距离简单又有效,可以作为人体重认识方法中的一种高可靠性低计算成本的距离度量。广泛的实验证明了我们的方法的有效性。

Multimodal Indoor Localization Using Crowdsourced Radio Maps

  • paper_url: http://arxiv.org/abs/2311.10601
  • repo_url: None
  • paper_authors: Zhaoguang Yi, Xiangyu Wen, Qiyue Xia, Peize Li, Francisco Zampella, Firas Alsehly, Chris Xiaoxuan Lu
  • for: 这篇论文旨在替代传统的indoorPositioning Systems(IPS)中使用建筑物的floor plans,而是利用人们所持有的智能手机和WiFi启动的Robots所生成的广泛来源Radio Maps。
  • methods: 这篇论文提出了一种新的框架,该框架利用一种不确定性感知神经网络模型和一种特制的 bayesian融合技术来解决广泛来源Radio Maps的不准确和罕见的问题。
  • results: 对多个实际场景进行了广泛的评估,结果显示该系统可以减少 ~ 25%的性能提升,与最佳参考值相比。
    Abstract Indoor Positioning Systems (IPS) traditionally rely on odometry and building infrastructures like WiFi, often supplemented by building floor plans for increased accuracy. However, the limitation of floor plans in terms of availability and timeliness of updates challenges their wide applicability. In contrast, the proliferation of smartphones and WiFi-enabled robots has made crowdsourced radio maps - databases pairing locations with their corresponding Received Signal Strengths (RSS) - increasingly accessible. These radio maps not only provide WiFi fingerprint-location pairs but encode movement regularities akin to the constraints imposed by floor plans. This work investigates the possibility of leveraging these radio maps as a substitute for floor plans in multimodal IPS. We introduce a new framework to address the challenges of radio map inaccuracies and sparse coverage. Our proposed system integrates an uncertainty-aware neural network model for WiFi localization and a bespoken Bayesian fusion technique for optimal fusion. Extensive evaluations on multiple real-world sites indicate a significant performance enhancement, with results showing ~ 25% improvement over the best baseline
    摘要 室内定位系统(IPS)传统上依靠速度和建筑物的WiFi基础设施,经常补充了建筑物的 floor plan,以提高准确性。然而, floor plan 的有效性和时效性的限制使其广泛应用受到挑战。相比之下,智能手机和WIFI启动的机器人的普及,使得来自众生的广播电子地图 - 将位置对应到接收信号强度(RSS)的数据库 - 变得越来越可 accessible。这些广播电子地图不仅提供WIFI指纹-位置对应,还编码了运动规律,类似于建筑物的制约。这项工作探讨了使用这些广播电子地图作为 floor plan 的替代品在多模态 IPS 中的可能性。我们提出了一种新的框架,以解决广播电子地图的不准确和罕见覆盖问题。我们的提议的系统通过不确定性感知神经网络模型和特定的 Bayesian 融合技术来实现优质融合。对多个实际场景进行了广泛的评估,结果显示了大约 25% 的性能提升,与最佳基eline相比。

Détection d’objets célestes dans des images astronomiques par IA explicable

  • paper_url: http://arxiv.org/abs/2311.10592
  • repo_url: None
  • paper_authors: Olivier Parisot, Mahmoud Jaziri
  • for: 这个研究是为了自动检测捕获的天体对象(如星系、星云、或球状星团)的存在和位置。
  • methods: 该研究使用了可解释的人工智能方法来检测捕获的天体对象。
  • results: 该研究可以自动检测捕获的天体对象的存在和位置,并提供了可解释的结果。I hope that helps! Let me know if you have any other questions.
    Abstract Amateur and professional astronomers can easily capture a large number of deep sky images with recent smart telescopes. However, afterwards verification is still required to check whether the celestial objects targeted are actually visible in the images produced. Depending on the magnitude of the targets, the observation conditions and the time during which the data is captured, it is possible that only stars are present in the images. In this study, we propose an approach based on explainable Artificial Intelligence to automatically detect the presence and position of captured objects. -- -- Gr\^ace \`a l'apport des t\'elescopes automatis\'es grand public, les astronomes amateurs et professionnels peuvent capturer facilement une grande quantit\'e d'images du ciel profond (comme par exemple les galaxies, n\'ebuleuses, ou amas globulaires). N\'eanmoins, une v\'erification reste n\'ecessaire \`a post\'eriori pour v\'erifier si les objets c\'elestes vis\'es sont effectivement visibles dans les images produites: cela d\'epend notamment de la magnitude des cibles, des conditions d'observation mais aussi de la dur\'ee pendant laquelle les donn\'ees sont captur\'ees. Dans cette \'etude, nous proposons une approche bas\'ee sur l'IA explicable pour d\'etecter automatiquement la pr\'esence et la position des objets captur\'es.
    摘要 <>现代智能望远镜可以轻松地 capture 大量深空图像,但是需要 posteriori 验证以确认targeted 的星系是否实际存在在图像中。这取决于目标星系的亮度、观测条件以及Capture 时间。在本研究中,我们提出一种基于可解释的人工智能方法来自动检测captured 对象的存在和位置。---感谢公共大众望远镜的贡献,天文爱好者和专业天文学家可以轻松地捕捉大量深空图像,如 галактиcas, 星系、或球状星团。然而,需要 posteriori 验证以确认targeted 的星系是否实际存在在图像中:这取决于目标星系的亮度、观测条件以及Capture 时间。在本研究中,我们提出一种基于可解释的人工智能方法来自动检测captured 对象的存在和位置。

Human motion trajectory prediction using the Social Force Model for real-time and low computational cost applications

  • paper_url: http://arxiv.org/abs/2311.10582
  • repo_url: None
  • paper_authors: Oscar Gil, Alberto Sanfeliu
  • for: 这篇论文的目的是提出一种新的人体动量预测模型,即Social Force Generative Adversarial Network (SoFGAN),用于人机合作任务中的预测人体动量,如陪伴、引导或接近等任务。
  • methods: 这篇论文使用了Generative Adversarial Network (GAN)和Social Force Model (SFM)组合来生成不同的可能的人体动量,以避免在场景中的碰撞。此外,还添加了一个Conditional Variational Autoencoder (CVAE)模块,以强调目的学习。
  • results: 根据UCY或BIWI数据集的实验结果表示,我们的方法在预测方面比大多数当前状态的方法更加准确,同时也比其他方法减少了碰撞的风险。此外,我们还实现了在实时无GPU的情况下,使用这种模型进行高质量的预测,而且计算成本较低。
    Abstract Human motion trajectory prediction is a very important functionality for human-robot collaboration, specifically in accompanying, guiding, or approaching tasks, but also in social robotics, self-driving vehicles, or security systems. In this paper, a novel trajectory prediction model, Social Force Generative Adversarial Network (SoFGAN), is proposed. SoFGAN uses a Generative Adversarial Network (GAN) and Social Force Model (SFM) to generate different plausible people trajectories reducing collisions in a scene. Furthermore, a Conditional Variational Autoencoder (CVAE) module is added to emphasize the destination learning. We show that our method is more accurate in making predictions in UCY or BIWI datasets than most of the current state-of-the-art models and also reduces collisions in comparison to other approaches. Through real-life experiments, we demonstrate that the model can be used in real-time without GPU's to perform good quality predictions with a low computational cost.
    摘要 人体运动轨迹预测是人机合作中非常重要的功能,尤其在陪伴、引导或接近任务中,也在社交机器人、自动驾驶车或安全系统中。在这篇论文中,我们提出了一种新的轨迹预测模型,即社交力生成抗拒网络(SoFGAN)。SoFGAN使用生成抗拒网络(GAN)和社交力模型(SFM)生成不同的可能的人体轨迹,以降低场景中的碰撞。此外,我们添加了一个条件可变自动编码器(CVAE)模块,以强调目标学习。我们表明,我们的方法在UCY或BIWI数据集上的预测比现有的大多数状态对模型更准确,并且与其他方法相比,减少了碰撞。通过实际实验,我们示示了该模型可以在实时无需GPU进行高质量预测,并且计算成本较低。

SSB: Simple but Strong Baseline for Boosting Performance of Open-Set Semi-Supervised Learning

  • paper_url: http://arxiv.org/abs/2311.10572
  • repo_url: None
  • paper_authors: Yue Fan, Anna Kukleva, Dengxin Dai, Bernt Schiele
  • for: 本文研究了开放集成学习(SSL)方法在开放集成场景中的应用,即使用不含标签数据来提高模型的泛化性。
  • methods: 本文提出了一种叫做简单强基线(SSB)的方法,它利用高置信度的 pseudo-标签数据来提高准确分类,并利用非线性变换将特征分配给多任务学习框架中的准确分类和异常检测两个任务。此外,本文还提出了 pseudo-负数据挖掘技术,以进一步提高异常检测性能。
  • results: 实验表明,SSB 方法可以大幅提高准确分类和异常检测性能,在开放集成场景中准确分类率高达 97.5%,与现有方法相比差距很大。
    Abstract Semi-supervised learning (SSL) methods effectively leverage unlabeled data to improve model generalization. However, SSL models often underperform in open-set scenarios, where unlabeled data contain outliers from novel categories that do not appear in the labeled set. In this paper, we study the challenging and realistic open-set SSL setting, where the goal is to both correctly classify inliers and to detect outliers. Intuitively, the inlier classifier should be trained on inlier data only. However, we find that inlier classification performance can be largely improved by incorporating high-confidence pseudo-labeled data, regardless of whether they are inliers or outliers. Also, we propose to utilize non-linear transformations to separate the features used for inlier classification and outlier detection in the multi-task learning framework, preventing adverse effects between them. Additionally, we introduce pseudo-negative mining, which further boosts outlier detection performance. The three ingredients lead to what we call Simple but Strong Baseline (SSB) for open-set SSL. In experiments, SSB greatly improves both inlier classification and outlier detection performance, outperforming existing methods by a large margin. Our code will be released at https://github.com/YUE-FAN/SSB.
    摘要 semi-supervised learning(SSL)方法可以有效地利用无标签数据来提高模型的泛化性。然而,SSL模型经常在开放集成分类情况下表现不佳,因为无标签数据中可能包含来自新类别的异常数据。在这篇论文中,我们研究了开放集成SSL设定,其目标是同时正确地分类归类数据和检测异常数据。intuitively,归类器应该只在归类数据上训练。然而,我们发现归类性能可以通过包含高信任 Pseudo-标注数据来大幅提高,无论它们是归类数据还是异常数据。此外,我们提议利用非线性变换来分离在多任务学习框架中使用的归类和异常检测的特征,避免它们之间的干扰。此外,我们引入 Pseudo-负样本采集,进一步提高异常检测性能。这三个元素导致我们提出的简单强基线(SSB)方法,在实验中对开放集成SSL方法进行了大幅改进。我们的代码将在https://github.com/YUE-FAN/SSB上发布。

Phase Guided Light Field for Spatial-Depth High Resolution 3D Imaging

  • paper_url: http://arxiv.org/abs/2311.10568
  • repo_url: None
  • paper_authors: Geyou Zhang, Ce Zhu, Kai Liu, Yipeng Liu
  • for: 提高单shot光场相机的空间分辨率和深度准确性
  • methods: 使用光学投影机制作高频单相频偏振Pattern,并提出相位导向光场算法以提高光场相机的空间分辨率和深度准确性
  • results: 实验结果显示,相比现有活动光场方法,提出的方法可以在单shot光场相机上重建3D点云,并且提高了空间分辨率10倍,保持同高深度分辨率,仅需要一组高频单相频偏振Pattern。
    Abstract On 3D imaging, light field cameras typically are of single shot, and however, they heavily suffer from low spatial resolution and depth accuracy. In this paper, by employing an optical projector to project a group of single high-frequency phase-shifted sinusoid patterns, we propose a phase guided light field algorithm to significantly improve both the spatial and depth resolutions for off-the-shelf light field cameras. First, for correcting the axial aberrations caused by the main lens of our light field camera, we propose a deformed cone model to calibrate our structured light field system. Second, over wrapped phases computed from patterned images, we propose a stereo matching algorithm, i.e. phase guided sum of absolute difference, to robustly obtain the correspondence for each pair of neighbored two lenslets. Finally, by introducing a virtual camera according to the basic geometrical optics of light field imaging, we propose a reorganization strategy to reconstruct 3D point clouds with spatial-depth high resolution. Experimental results show that, compared with the state-of-the-art active light field methods, the proposed reconstructs 3D point clouds with a spatial resolution of 1280$\times$720 with factors 10$\times$ increased, while maintaining the same high depth resolution and needing merely a single group of high-frequency patterns.
    摘要 对3D成像,光场相机通常是单步的,但它们受到低空间分辨率和深度准确性的压制。在这篇论文中,我们通过使用光学投影机 проек加入一组单高频相位偏移的振荡模式,提出了相位导航光场算法,以提高各种光场相机的空间和深度分辨率。首先,为了正确地纠正主镜的轴向扭曲,我们提出了扭曲杯模型来准确地calibrate我们的结构化光场系统。其次,通过对 Patterned 图像中的卷绕相位进行匹配,我们提出了相位导航差分差分析法,以稳定地获取每对邻居两个镜头的匹配。最后,通过引入基本光学的光场投影机,我们提出了重新组织策略,以重建3D点云的空间深度高分辨率。实验结果表明,相比state-of-the-art的活动光场方法,我们的提议可以重建3D点云,空间分辨率为1280×720,同时保持高深度分辨率,只需要一组高频模式。

Archtree: on-the-fly tree-structured exploration for latency-aware pruning of deep neural networks

  • paper_url: http://arxiv.org/abs/2311.10549
  • repo_url: https://github.com/KBerghaus/class_der
  • paper_authors: Rémi Ouazan Reboul, Edouard Yvinec, Arnaud Dapogny, Kevin Bailly
  • for: 这篇论文目的是为了提出一种基于硬件约束的延迟驱动结构化剪辑方法,以提高深度神经网络(DNN)的执行效率。
  • methods: 这篇论文使用了一种基于树结构的搜索方法,即Archtree,以同时探索多个候选遗传子模型的空间。此外,它还包括在目标硬件上实时估计延迟,以更加准确地考虑硬件约束。
  • results: 实验结果表明,Archtree方法可以更好地保持原始模型的准确性,同时更好地适应延迟预算。相比之下,现有的状态 искусственный方法都不如Archtree方法。
    Abstract Deep neural networks (DNNs) have become ubiquitous in addressing a number of problems, particularly in computer vision. However, DNN inference is computationally intensive, which can be prohibitive e.g. when considering edge devices. To solve this problem, a popular solution is DNN pruning, and more so structured pruning, where coherent computational blocks (e.g. channels for convolutional networks) are removed: as an exhaustive search of the space of pruned sub-models is intractable in practice, channels are typically removed iteratively based on an importance estimation heuristic. Recently, promising latency-aware pruning methods were proposed, where channels are removed until the network reaches a target budget of wall-clock latency pre-emptively estimated on specific hardware. In this paper, we present Archtree, a novel method for latency-driven structured pruning of DNNs. Archtree explores multiple candidate pruned sub-models in parallel in a tree-like fashion, allowing for a better exploration of the search space. Furthermore, it involves on-the-fly latency estimation on the target hardware, accounting for closer latencies as compared to the specified budget. Empirical results on several DNN architectures and target hardware show that Archtree better preserves the original model accuracy while better fitting the latency budget as compared to existing state-of-the-art methods.
    摘要 Recently, promising latency-aware pruning methods have been proposed, where channels are removed until the network reaches a target budget of wall-clock latency pre-emptively estimated on specific hardware. In this paper, we present Archtree, a novel method for latency-driven structured pruning of DNNs. Archtree explores multiple candidate pruned sub-models in parallel in a tree-like fashion, allowing for a better exploration of the search space. Furthermore, it involves on-the-fly latency estimation on the target hardware, accounting for closer latencies as compared to the specified budget.Empirical results on several DNN architectures and target hardware show that Archtree better preserves the original model accuracy while better fitting the latency budget as compared to existing state-of-the-art methods.

Joint covariance property under geometric image transformations for spatio-temporal receptive fields according to the generalized Gaussian derivative model for visual receptive fields

  • paper_url: http://arxiv.org/abs/2311.10543
  • repo_url: None
  • paper_authors: Tony Lindeberg
  • for: 本研究探讨了自然图像变换对视觉操作的影响,尤其是在计算机视觉和生物视觉中。
  • methods: 本文使用了geometry image transformations的covariance性质来表达图像操作的稳定性和高层次视觉操作的 invariancy。
  • results: 本文提出了一种joint covariance property,该性质可以描述不同类型的图像变换如何相互交互,并且提供了匹配输出与下游spatio-temporal receptive fields的参数需要如何变换以实现图像操作的稳定性。
    Abstract The influence of natural image transformations on receptive field responses is crucial for modelling visual operations in computer vision and biological vision. In this regard, covariance properties with respect to geometric image transformations in the earliest layers of the visual hierarchy are essential for expressing robust image operations and for formulating invariant visual operations at higher levels. This paper defines and proves a joint covariance property under compositions of spatial scaling transformations, spatial affine transformations, Galilean transformations and temporal scaling transformations, which makes it possible to characterize how different types of image transformations interact with each other. Specifically, the derived relations show how the receptive field parameters need to be transformed, in order to match the output from spatio-temporal receptive fields with the underlying spatio-temporal image transformations.
    摘要 自然图像变换对视觉运算的影响是计算机视觉和生物视觉中关键的一部分。在这种情况下,图像层次结构的最初层的covariance属性对于表达 Robust 图像运算和高层次的抗变换视觉操作是关键的。本文定义并证明了在作用于图像的空间缩放变换、空间乘数变换、加利列安变换和时间缩放变换的复合作用下,covariance属性的共同性。具体来说, derivated 关系表明了感知场参数如何对应于图像变换,以实现匹配下来的输出与背景图像变换。

Segment Anything Model with Uncertainty Rectification for Auto-Prompting Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2311.10529
  • repo_url: https://github.com/YichiZhang98/UR-SAM
  • paper_authors: Yichi Zhang, Shiyao Hu, Chen Jiang, Yuan Cheng, Yuan Qi
  • for: 提高自动提示医疗图像分割的可靠性和Robustness
  • methods: 使用提高分割性能的指示map和不确定性修正模块
  • results: 在两个公共的3D医疗图像数据集上,无需额外训练或精度调整,our方法可以进一步提高分割性能,达到最高达10.7%和13.8%的 dice相似度,表明our方法具有效果和广泛的应用前提。
    Abstract The introduction of the Segment Anything Model (SAM) has marked a significant advancement in prompt-driven image segmentation. However, SAM's application to medical image segmentation requires manual prompting of target structures to obtain acceptable performance, which is still labor-intensive. Despite attempts of auto-prompting to turn SAM into a fully automatic manner, it still exhibits subpar performance and lacks of reliability in the field of medical imaging. In this paper, we propose UR-SAM, an uncertainty rectified SAM framework to enhance the robustness and reliability for auto-prompting medical image segmentation. Our method incorporates a prompt augmentation module to estimate the distribution of predictions and generate uncertainty maps, and an uncertainty-based rectification module to further enhance the performance of SAM. Extensive experiments on two public 3D medical datasets covering the segmentation of 35 organs demonstrate that without supplementary training or fine-tuning, our method further improves the segmentation performance with up to 10.7 % and 13.8 % in dice similarity coefficient, demonstrating efficiency and broad capabilities for medical image segmentation without manual prompting.
    摘要 《Introduction of Segment Anything Model (SAM) has brought significant advancements in prompt-driven image segmentation. However, applying SAM to medical image segmentation still requires manual prompting of target structures, which is labor-intensive. Despite attempts to turn SAM into a fully automatic manner, its performance is still subpar and unreliable in medical imaging. In this paper, we propose UR-SAM, an uncertainty rectified SAM framework to enhance the robustness and reliability of auto-prompting medical image segmentation. Our method incorporates a prompt augmentation module to estimate the distribution of predictions and generate uncertainty maps, and an uncertainty-based rectification module to further enhance the performance of SAM. Extensive experiments on two public 3D medical datasets covering the segmentation of 35 organs show that our method can improve segmentation performance by up to 10.7% and 13.8% in dice similarity coefficient without supplementary training or fine-tuning, demonstrating efficiency and broad capabilities for medical image segmentation without manual prompting.》Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Removing Adverse Volumetric Effects From Trained Neural Radiance Fields

  • paper_url: http://arxiv.org/abs/2311.10523
  • repo_url: None
  • paper_authors: Andreas L. Teigen, Mauhing Yip, Victor P. Hamran, Vegard Skui, Annette Stahl, Rudolf Mester
  • for: 本文探讨了使用神经辐射场(NeRF)在雾瑞环境中的应用,并提出了一种方法来除去雾瑞。
  • methods: 本文提出了一种基于场景全局强度的方法,可以在生成新视图时除去雾瑞。此外,本文还引入了一个新的数据集,用于测试NeRF在雾瑞环境中的性能。
  • results: 本文通过视频结果示示了使用NeRF来渲染雾瑞环境中的 объек 的 ClearView 效果。
    Abstract While the use of neural radiance fields (NeRFs) in different challenging settings has been explored, only very recently have there been any contributions that focus on the use of NeRF in foggy environments. We argue that the traditional NeRF models are able to replicate scenes filled with fog and propose a method to remove the fog when synthesizing novel views. By calculating the global contrast of a scene, we can estimate a density threshold that, when applied, removes all visible fog. This makes it possible to use NeRF as a way of rendering clear views of objects of interest located in fog-filled environments. Additionally, to benchmark performance on such scenes, we introduce a new dataset that expands some of the original synthetic NeRF scenes through the addition of fog and natural environments. The code, dataset, and video results can be found on our project page: https://vegardskui.com/fognerf/
    摘要 traditional NeRF models 可以复制foggy scenes,我们提出了一种方法来从synthesized views中移除fog。通过计算场景的全局对比度,我们可以估算一个density threshold,当应用于场景时,可以完全remove all visible fog。这使得我们可以使用NeRF来渲染fog-filled environments中的 объекts of interest的清晰视图。此外,为了评估这些场景的性能,我们引入了一个新的 dataset,该dataset通过fog和自然环境的添加扩展了一些原始的synthetic NeRF scenes。我们的代码、dataset和视频结果可以在我们项目页面上找到:https://vegardskui.com/fognerf/

Mind the map! Accounting for existing map information when estimating online HDMaps from sensor data

  • paper_url: http://arxiv.org/abs/2311.10517
  • repo_url: https://github.com/hustvl/maptr
  • paper_authors: Rémy Sun, Li Yang, Diane Lingrand, Frédéric Precioso
  • for: 提高在感知器上的高清晰地图(HDMap)估计,以便减轻自动驾驶系统中HDMap的手动获取成本,并可能扩展其使用范围。
  • methods: 利用已有地图,提高在线HDMap估计。提出3种有用的现有地图类型(最小主义、噪声、过时),并介绍MapEX框架,该框架通过编码地图元素为查询token,并改进了训练классиic查询基地图估计模型的匹配算法。
  • results: 在nuScenes数据集上实现了显著的改进,比如使用噪声地图时,MapEX比MapTRv2探测器提高38%,比当前SOTA提高16%。
    Abstract Online High Definition Map (HDMap) estimation from sensors offers a low-cost alternative to manually acquired HDMaps. As such, it promises to lighten costs for already HDMap-reliant Autonomous Driving systems, and potentially even spread their use to new systems. In this paper, we propose to improve online HDMap estimation by accounting for already existing maps. We identify 3 reasonable types of useful existing maps (minimalist, noisy, and outdated). We also introduce MapEX, a novel online HDMap estimation framework that accounts for existing maps. MapEX achieves this by encoding map elements into query tokens and by refining the matching algorithm used to train classic query based map estimation models. We demonstrate that MapEX brings significant improvements on the nuScenes dataset. For instance, MapEX - given noisy maps - improves by 38% over the MapTRv2 detector it is based on and by 16% over the current SOTA.
    摘要 “在线高清地图(HDMap)估算从感知器件提供了一种低成本的替代方案,以减轻现有HDMap-依赖的自动驾驶系统的成本,并可能扩展其使用至新的系统。在这篇论文中,我们提议改进在线HDMap估算,考虑现有地图的价值。我们确定了三种有用的现有地图类型(简化、噪音、过时),并引入了MapEX,一种新的在线HDMap估算框架。MapEX通过编码地图元素为查询token,并通过改进基于类传统查询基于地图估算模型的匹配算法来实现。我们示示了MapEX在nuScenes数据集上带来了显著改进,比如,MapEX(基于噪音地图)与MapTRv2探测器相比提高38%,与当前SOTA相比提高16%。”Note: "HDMap" in the text refers to "High-Definition Map".

A Framework of Landsat-8 Band Selection based on UMDA for Deforestation Detection

  • paper_url: http://arxiv.org/abs/2311.10513
  • repo_url: None
  • paper_authors: Eduardo B. Neto, Paulo R. C. Pedro, Alvaro Fazenda, Fabio A. Faria
  • for: 这项研究旨在提出一种新的框架,用于监测热带雨林。
  • methods: 该研究使用分布统计算法(UMDA)选择来自Landstat-8的 спектраль带,以提高对除森林区域的识别。
  • results: 实验表明,使用最佳组合(651)可以达到90%以上的准确率,并且比其他所有组合相比,效率和效果更高。
    Abstract The conservation of tropical forests is a current subject of social and ecological relevance due to their crucial role in the global ecosystem. Unfortunately, millions of hectares are deforested and degraded each year. Therefore, government or private initiatives are needed for monitoring tropical forests. In this sense, this work proposes a novel framework, which uses of distribution estimation algorithm (UMDA) to select spectral bands from Landsat-8 that yield a better representation of deforestation areas to guide a semantic segmentation architecture called DeepLabv3+. In performed experiments, it was possible to find several compositions that reach balanced accuracy superior to 90% in segment classification tasks. Furthermore, the best composition (651) found by UMDA algorithm fed the DeepLabv3+ architecture and surpassed in efficiency and effectiveness all compositions compared in this work.
    摘要 保护热带雨林是当前社会和生态领域的热点话题,因为它们在全球生态系统中扮演了关键角色。然而,每年仍有数百万公顷的雨林被毁灭和侵蚀。因此,政府或私人的倡议是必要的,以监测热带雨林。在这种情况下,本工作提出了一个新的框架,使用分布Estimation算法(UMDA)选择LandSat-8遥感器中的spectral Band,以更好地表示Deforestation区域,并用DeepLabv3+ semanticsegmentation架构进行分类。在实验中,能够找到许多compositions,其中balanced accuracy超过90%的分类任务。此外,最佳Composition(651)由UMDA算法选择,并将DeepLabv3+架构feed,在效率和效果方面超过了所有相比的Compositions。

A Relay System for Semantic Image Transmission based on Shared Feature Extraction and Hyperprior Entropy Compression

  • paper_url: http://arxiv.org/abs/2311.10492
  • repo_url: None
  • paper_authors: Wannian An, Zhicheng Bao, Haotai Liang, Chen Dong, Xiaodong
  • for: 提高图像重建和修复的质量
  • methods: 使用共享特征提取技术和幂噪 entropy压缩(HEC)技术
  • results: 相比其他最新研究方法,提出的系统具有较低的传输开销和更高的 semantic 图像传输性能,特别是在同样的条件下,相对比较方法的多尺度结构相似度(MS-SSIM)高出约0.2。Here’s the simplified Chinese text in the format you requested:
  • for: 提高图像重建和修复的质量
  • methods: 使用共享特征提取技术和幂噪 entropy压缩(HEC)技术
  • results: 相比其他最新研究方法,提出的系统具有较低的传输开销和更高的 semantic 图像传输性能。
    Abstract Nowadays, the need for high-quality image reconstruction and restoration is more and more urgent. However, most image transmission systems may suffer from image quality degradation or transmission interruption in the face of interference such as channel noise and link fading. To solve this problem, a relay communication network for semantic image transmission based on shared feature extraction and hyperprior entropy compression (HEC) is proposed, where the shared feature extraction technology based on Pearson correlation is proposed to eliminate partial shared feature of extracted semantic latent feature. In addition, the HEC technology is used to resist the effect of channel noise and link fading and carried out respectively at the source node and the relay node. Experimental results demonstrate that compared with other recent research methods, the proposed system has lower transmission overhead and higher semantic image transmission performance. Particularly, under the same conditions, the multi-scale structural similarity (MS-SSIM) of this system is superior to the comparison method by approximately 0.2.
    摘要 现在,高品质图像重建和修复的需求越来越紧迫。然而,大多数图像传输系统可能会受到频率干扰和链接损坏的影响,导致图像质量下降。为解决这个问题,一种基于共享特征提取和超凡 entropy压缩(HEC)的关键点通信网络 для semantics 图像传输是提出的,其中基于皮尔逊相关性的共享特征提取技术用于消除部分共享特征。此外,HEC技术在源节点和关键节点进行了分别应用,以抵御频率干扰和链接损坏的影响。实验结果表明,相比其他最近的研究方法,提出的系统具有较低的传输 overhead 和较高的semantics 图像传输性能。特别是,在同样的条件下,该系统的多尺度结构相似度(MS-SSIM)比对比方法高出约0.2。

FRCSyn Challenge at WACV 2024:Face Recognition Challenge in the Era of Synthetic Data

  • paper_url: http://arxiv.org/abs/2311.10476
  • repo_url: https://github.com/ndido98/frcsyn
  • paper_authors: Pietro Melzi, Ruben Tolosana, Ruben Vera-Rodriguez, Minchul Kim, Christian Rathgeb, Xiaoming Liu, Ivan DeAndres-Tame, Aythami Morales, Julian Fierrez, Javier Ortega-Garcia, Weisong Zhao, Xiangyu Zhu, Zheyu Yan, Xiao-Yu Zhang, Jinlin Wu, Zhen Lei, Suvidha Tripathi, Mahak Kothari, Md Haider Zama, Debayan Deb, Bernardo Biesseck, Pedro Vidal, Roger Granada, Guilherme Fickel, Gustavo Führ, David Menotti, Alexander Unnervik, Anjith George, Christophe Ecabert, Hatef Otroshi Shahreza, Parsa Rahimi, Sébastien Marcel, Ioannis Sarridis, Christos Koutlis, Georgia Baltsou, Symeon Papadopoulos, Christos Diou, Nicolò Di Domenico, Guido Borghi, Lorenzo Pellegrini, Enrique Mas-Candela, Ángela Sánchez-Pérez, Andrea Atzori, Fadi Boutros, Naser Damer, Gianni Fenu, Mirko Marras
  • for: 这篇论文旨在探讨面 recognition技术在假数据 Era中的挑战,以及如何通过使用假数据来解决现有技术的限制。
  • methods: 这篇论文使用了一个国际性的 Face Recognition Challenge in the Era of Synthetic Data (FRCSyn),以探讨假数据在面 recognition技术中的应用。
  • results: 根据这篇论文的结果,使用假数据可以有效地解决面 recognition技术中的数据隐私问题、人种偏见、未经见过的场景推理、以及面部pose和 occlusion 等挑战。
    Abstract Despite the widespread adoption of face recognition technology around the world, and its remarkable performance on current benchmarks, there are still several challenges that must be covered in more detail. This paper offers an overview of the Face Recognition Challenge in the Era of Synthetic Data (FRCSyn) organized at WACV 2024. This is the first international challenge aiming to explore the use of synthetic data in face recognition to address existing limitations in the technology. Specifically, the FRCSyn Challenge targets concerns related to data privacy issues, demographic biases, generalization to unseen scenarios, and performance limitations in challenging scenarios, including significant age disparities between enrollment and testing, pose variations, and occlusions. The results achieved in the FRCSyn Challenge, together with the proposed benchmark, contribute significantly to the application of synthetic data to improve face recognition technology.
    摘要 尽管全球范围内普及的人脸识别技术表现出色,但还有一些挑战需要更加详细地考虑。这篇文章提供了WACV 2024年举行的人脸识别挑战(FRCSyn)的概述。这是首个使用合成数据探索人脸识别技术的国际挑战。具体来说,FRCSyn挑战旨在解决现有技术中的数据隐私问题、人口偏见、未经见测enario推理和场景复杂性等问题。包括年龄差距、拍摄角度变化、 occlusion等情况下的性能 limitation。FRCSyn挑战的结果,以及提议的标准化程序,对于使用合成数据提升人脸识别技术具有重要 significanse。

End-to-end autoencoding architecture for the simultaneous generation of medical images and corresponding segmentation masks

  • paper_url: http://arxiv.org/abs/2311.10472
  • repo_url: None
  • paper_authors: Aghiles Kebaili, Jérôme Lapuyade-Lahorgue, Pierre Vera, Su Ruan
  • for: 这篇论文的目的是提出一个基于希尔伯特统计力学自适应网络(HVAE)的终端架构,以提高医疗影像分类中的训练数据产生和实验结果。
  • methods: 这篇论文使用的方法是基于HVAE的终端架构,实现更好的 posterior distribution 推测,并且与传统的Variational Autoencoders(VAE)相比,具有更高的图像生成质量。
  • results: 这篇论文的结果显示,在拥有少量数据的情况下,这个方法可以超越对抗性模型,实现更好的图像质量和精确的肿瘤标识生成。实验结果显示,这个方法在不同的医疗影像模式下都具有良好的效果。
    Abstract Despite the increasing use of deep learning in medical image segmentation, acquiring sufficient training data remains a challenge in the medical field. In response, data augmentation techniques have been proposed; however, the generation of diverse and realistic medical images and their corresponding masks remains a difficult task, especially when working with insufficient training sets. To address these limitations, we present an end-to-end architecture based on the Hamiltonian Variational Autoencoder (HVAE). This approach yields an improved posterior distribution approximation compared to traditional Variational Autoencoders (VAE), resulting in higher image generation quality. Our method outperforms generative adversarial architectures under data-scarce conditions, showcasing enhancements in image quality and precise tumor mask synthesis. We conduct experiments on two publicly available datasets, MICCAI's Brain Tumor Segmentation Challenge (BRATS), and Head and Neck Tumor Segmentation Challenge (HECKTOR), demonstrating the effectiveness of our method on different medical imaging modalities.
    摘要 尽管深度学习在医学图像分割中得到了广泛应用,但获取充足的训练数据仍然是医疗领域的挑战。为应对这些限制,数据增强技术被提出,但生成真实和多样化的医学图像和其相对应的掩码仍然是一项困难任务,特别是在训练集较少的情况下。为解决这些限制,我们提出了基于希尔伯特变量自动机(HVAE)的端到端架构。这种方法可以在训练集较少的情况下提供更好的 posterior distribution 近似,从而提高图像生成质量。我们的方法在数据缺乏情况下比generative adversarial网络(GAN)表现出色,展现出了图像质量和精准肿瘤掩码生成的改进。我们在公共数据集 BRATS 和 HECKTOR 上进行了实验,证明了我们的方法在不同的医学成像模式下的效果。

Correlation-Distance Graph Learning for Treatment Response Prediction from rs-fMRI

  • paper_url: http://arxiv.org/abs/2311.10463
  • repo_url: https://github.com/summerwings/cdgin
  • paper_authors: Xiatian Zhang, Sisi Zheng, Hubert P. H. Shum, Haozheng Zhang, Nan Song, Mingkang Song, Hongxiao Jia
  • for: 该研究旨在提高resting-state fMRI(rs-fMRI)功能连接分析的严格应用,以推断药物反应。
  • methods: 该研究提出了一种图学习框架,通过对相似度和距离基于的神经相似度进行集成,以实现更加准确地捕捉脑动态特征,并且可以更好地预测药物反应。
  • results: 实验结果表明,该方法在 Chronic pain 和 depersonalization disorder 数据集上都有出色的表现,并且超过了当前方法的表现。
    Abstract Resting-state fMRI (rs-fMRI) functional connectivity (FC) analysis provides valuable insights into the relationships between different brain regions and their potential implications for neurological or psychiatric disorders. However, specific design efforts to predict treatment response from rs-fMRI remain limited due to difficulties in understanding the current brain state and the underlying mechanisms driving the observed patterns, which limited the clinical application of rs-fMRI. To overcome that, we propose a graph learning framework that captures comprehensive features by integrating both correlation and distance-based similarity measures under a contrastive loss. This approach results in a more expressive framework that captures brain dynamic features at different scales and enables more accurate prediction of treatment response. Our experiments on the chronic pain and depersonalization disorder datasets demonstrate that our proposed method outperforms current methods in different scenarios. To the best of our knowledge, we are the first to explore the integration of distance-based and correlation-based neural similarity into graph learning for treatment response prediction.
    摘要 <>将文本翻译成简化中文。<>resting-state fMRI(rs-fMRI)功能连接(FC)分析为脑区之间关系提供了有价值的信息,但是特定的设计努力用于预测治疗响应从rs-fMRI中仍然有限,这主要归结于Current brain state和下面的机制难以理解,这限制了rs-fMRI在临床应用中的使用。为了解决这个问题,我们提议一种图学学习框架,该框架可以捕捉包括相互关联和距离基于相似度度量在内的全面特征。这种方法具有更加表达力的优点,可以捕捉脑动态特征在不同尺度上,并且可以更准确地预测治疗响应。我们的实验表明,在折磨症和人格分裂症数据集上,我们的提议方法在不同的场景中都超过了当前方法。到目前为止,我们是首次将距离基于和相互关联基于的神经相似度 интегри进图学学习中,以预测治疗响应。

DeepClean: Machine Unlearning on the Cheap by Resetting Privacy Sensitive Weights using the Fisher Diagonal

  • paper_url: http://arxiv.org/abs/2311.10448
  • repo_url: None
  • paper_authors: Jiaeli Shi, Najah Ghalyan, Kostis Gourgoulias, John Buford, Sean Moran
  • for: 保护隐私信息,避免机器学习模型意外吸收和泄露敏感信息。
  • methods: 使用 Fisher Information Matrix (FIM) 实现选择性忘记,而不需要全面重训练或大量矩阵逆函数计算。
  • results: 实验表明,我们的算法可以成功忘记任意选择的训练数据subset,并且可以在不同的神经网络架构上实现。
    Abstract Machine learning models trained on sensitive or private data can inadvertently memorize and leak that information. Machine unlearning seeks to retroactively remove such details from model weights to protect privacy. We contribute a lightweight unlearning algorithm that leverages the Fisher Information Matrix (FIM) for selective forgetting. Prior work in this area requires full retraining or large matrix inversions, which are computationally expensive. Our key insight is that the diagonal elements of the FIM, which measure the sensitivity of log-likelihood to changes in weights, contain sufficient information for effective forgetting. Specifically, we compute the FIM diagonal over two subsets -- the data to retain and forget -- for all trainable weights. This diagonal representation approximates the complete FIM while dramatically reducing computation. We then use it to selectively update weights to maximize forgetting of the sensitive subset while minimizing impact on the retained subset. Experiments show that our algorithm can successfully forget any randomly selected subsets of training data across neural network architectures. By leveraging the FIM diagonal, our approach provides an interpretable, lightweight, and efficient solution for machine unlearning with practical privacy benefits.
    摘要

DUA-DA: Distillation-based Unbiased Alignment for Domain Adaptive Object Detection

  • paper_url: http://arxiv.org/abs/2311.10437
  • repo_url: None
  • paper_authors: Yongchao Feng, Shiwei Li, Yingjie Gao, Ziyue Huang, Yanan Zhang, Qingjie Liu, Yunhong Wang
  • for: 提高韦度预测的域隔适应性和精度
  • methods: 使用分立老师模型和目标相关物理网络,对域隔适应对象检测进行准则整合和域准确性提高
  • results: 在跨域场景下,提高了域隔适应对象检测的精度和一致性,并大幅超越了现有的准则整合方法
    Abstract Though feature-alignment based Domain Adaptive Object Detection (DAOD) have achieved remarkable progress, they ignore the source bias issue, i.e. the aligned features are more favorable towards the source domain, leading to a sub-optimal adaptation. Furthermore, the presence of domain shift between the source and target domains exacerbates the problem of inconsistent classification and localization in general detection pipelines. To overcome these challenges, we propose a novel Distillation-based Unbiased Alignment (DUA) framework for DAOD, which can distill the source features towards a more balanced position via a pre-trained teacher model during the training process, alleviating the problem of source bias effectively. In addition, we design a Target-Relevant Object Localization Network (TROLN), which can mine target-related knowledge to produce two classification-free metrics (IoU and centerness). Accordingly, we implement a Domain-aware Consistency Enhancing (DCE) strategy that utilizes these two metrics to further refine classification confidences, achieving a harmonization between classification and localization in cross-domain scenarios. Extensive experiments have been conducted to manifest the effectiveness of this method, which consistently improves the strong baseline by large margins, outperforming existing alignment-based works.
    摘要 尽管基于特征对齐的领域适应物体检测(DAOD)已经取得了显著的进步,但它们忽略了源偏见问题,即对齐的特征更加偏向源频道,导致不优化适应。此外,在源和目标频道之间的频道变化使得检测总体的一致性和地方化准确性受到影响。为了解决这些挑战,我们提出了一种基于凝固的不偏投对适应(DUA)框架,可以在教师模型的 pré-训练过程中通过凝固来减轻源偏见问题。此外,我们设计了一个 Target-Relevant Object Localization Network(TROLN),可以挖掘目标相关知识,生成两个无类别的度量(IoU和中心率)。根据这两个度量,我们实施了域aware的一致性提高策略(DCE),以进一步精细化类别信任度,实现在垂直频道上的一致性。我们进行了广泛的实验,manifestly显示了该方法的有效性, persistently 大幅超越了现有的对齐基本elines。

Deep Residual CNN for Multi-Class Chest Infection Diagnosis

  • paper_url: http://arxiv.org/abs/2311.10430
  • repo_url: None
  • paper_authors: Ryan Donghan Kwon, Dohyun Lim, Yoonha Lee, Seung Won Lee
  • for: 这篇论文旨在开发和评估一种深度卷积神经网络(CNN),用于多类诊断胸部感染病,基于胸部X射影像。
  • methods: 该模型采用了深度卷积神经网络,通过对不同来源的数据集进行训练和验证,实现了robust的总准确率达93%。
  • results: 研究发现,不同类别之间存在微妙的差异,尤其是 fibrosis 类别,这反映了自动医疗图像诊断的复杂性和挑战。这些发现可以帮助未来的研究,增强模型在识别图像中更加细腻和复杂的特征方面的性能,以及优化和改进模型的架构和训练过程。
    Abstract The advent of deep learning has significantly propelled the capabilities of automated medical image diagnosis, providing valuable tools and resources in the realm of healthcare and medical diagnostics. This research delves into the development and evaluation of a Deep Residual Convolutional Neural Network (CNN) for the multi-class diagnosis of chest infections, utilizing chest X-ray images. The implemented model, trained and validated on a dataset amalgamated from diverse sources, demonstrated a robust overall accuracy of 93%. However, nuanced disparities in performance across different classes, particularly Fibrosis, underscored the complexity and challenges inherent in automated medical image diagnosis. The insights derived pave the way for future research, focusing on enhancing the model's proficiency in classifying conditions that present more subtle and nuanced visual features in the images, as well as optimizing and refining the model architecture and training process. This paper provides a comprehensive exploration into the development, implementation, and evaluation of the model, offering insights and directions for future research and development in the field.
    摘要 深度学习的出现对医疗图像诊断自动化技术带来了 significiant 的推动,提供了valuable 的工具和资源在医疗和医疗诊断领域。这项研究探讨了一种使用深度差分卷积神经网络(CNN)进行多类医疗图像诊断,使用了胸部X射线图像。实施的模型,在基于多个来源的数据集上进行训练和验证,表现了93%的总准确率。然而,不同类型的疾病之间存在了细微的差异,这反映了自动医疗图像诊断的复杂性和挑战。这些发现可以为未来的研究提供方向,例如增强模型对疾病表现更加细微的图像特征的识别能力,以及优化和改进模型的架构和训练过程。本文对模型的开发、实现和评估进行了全面的探讨,为未来的研究和发展提供了新的想法和方向。

Deep Learning based CNN Model for Classification and Detection of Individuals Wearing Face Mask

  • paper_url: http://arxiv.org/abs/2311.10408
  • repo_url: None
  • paper_authors: R. Chinnaiyan, Iyyappan M, Al Raiyan Shariff A, Kondaveeti Sai, Mallikarjunaiah B M, P Bharath
  • for: 防止 COVID-19 流行病的扩散,提高安全性,特别是在敏感区域。
  • methods: 使用深度学习创建一个实时视频和图像中检测面具的模型,包括面部检测和物体检测。
  • results: 实验结果表明模型在测试数据上具有优秀的准确率。
    Abstract In response to the global COVID-19 pandemic, there has been a critical demand for protective measures, with face masks emerging as a primary safeguard. The approach involves a two-fold strategy: first, recognizing the presence of a face by detecting faces, and second, identifying masks on those faces. This project utilizes deep learning to create a model that can detect face masks in real-time streaming video as well as images. Face detection, a facet of object detection, finds applications in diverse fields such as security, biometrics, and law enforcement. Various detector systems worldwide have been developed and implemented, with convolutional neural networks chosen for their superior performance accuracy and speed in object detection. Experimental results attest to the model's excellent accuracy on test data. The primary focus of this research is to enhance security, particularly in sensitive areas. The research paper proposes a rapid image pre-processing method with masks centred on faces. Employing feature extraction and Convolutional Neural Network, the system classifies and detects individuals wearing masks. The research unfolds in three stages: image pre-processing, image cropping, and image classification, collectively contributing to the identification of masked faces. Continuous surveillance through webcams or CCTV cameras ensures constant monitoring, triggering a security alert if a person is detected without a mask.
    摘要 因应全球 COVID-19 大流行,有一个急需的保护措施,面具出现为主要防范手段。该方法包括两个方面策略:首先,识别面具,然后,识别面具上的面。这个项目利用深度学习创建一个在实时流动视频和图像中检测面具的模型。面部检测是对象检测的一个方面,在安全、生物特征、和刑事调查等领域有广泛的应用。全球各地已经开发和实施了多种检测系统, convolutional neural networks(CNN)因其对对象检测的高性能精度和速度而被选择。实验结果证明模型在测试数据上具有优秀的准确率。本研究的主要目标是增强安全,特别是在敏感区域。研究论文提议一种快速的图像预处理方法,将面具围绕面进行中心。通过特征提取和 Convolutional Neural Network,系统可以识别和检测戴着面具的人。研究分三个阶段:图像预处理、图像裁剪和图像分类,共同帮助识别面具。通过持续的网络或 CCTV 摄像头监测,确保不断监测,如果检测到没有面具的人,就触发安全警报。

Optimized Deep Learning Models for AUV Seabed Image Analysis

  • paper_url: http://arxiv.org/abs/2311.10399
  • repo_url: None
  • paper_authors: Rajesh Sharma R, Akey Sungheetha, Chinnaiyan R
  • for: 本研究旨在提供最新的AUV图像处理技术和工具,以帮助更好地理解海底的特征和结构。
  • methods: 本研究使用了最新的计算机和算法技术,包括图像处理和分析方法,以提高AUV图像的质量和准确性。
  • results: 研究发现,使用新的AUV图像处理技术和工具,可以提高海底图像的质量和准确性,并且可以更好地理解海底的特征和结构。In English, this means:
  • for: This study aims to provide the most up-to-date AUV image processing techniques and tools to better understand the characteristics and structure of the seafloor.
  • methods: The study uses the latest computer and algorithmic techniques, including image processing and analysis methods, to improve the quality and accuracy of AUV images.
  • results: The research found that using new AUV image processing techniques and tools can improve the quality and accuracy of seafloor images, and provide a better understanding of the seafloor’s characteristics and structure.
    Abstract Using autonomous underwater vehicles, or AUVs, has completely changed how we gather data from the ocean floor. AUV innovation has advanced significantly, especially in the analysis of images, due to the increasing need for accurate and efficient seafloor mapping. This blog post provides a detailed summary and comparison of the most current advancements in AUV seafloor image processing. We will go into the realm of undersea technology, covering everything through computer and algorithmic advancements to advances in sensors and cameras. After reading this page through to the end, you will have a solid understanding of the most up-to-date techniques and tools for using AUVs to process seabed photos and how they could further our comprehension of the ocean floor
    摘要 Translation notes:* "AUV" is translated as "自主探测器" (zì zhòu tàn bèi qì) in Simplified Chinese.* "seafloor mapping" is translated as "海底地图" (hǎi dǐ dì tú) in Simplified Chinese.* "underwater technology" is translated as "水下科技" (shuǐ xià kē jì) in Simplified Chinese.* "computer and algorithmic advancements" is translated as "计算机和算法进步" (jì suàn jí hé suān fǎ jìn bo) in Simplified Chinese.* "sensor and camera improvements" is translated as "探测器和摄像头改进" (tàn bèi qì hé diàn yǐng tóu gǎi jì) in Simplified Chinese.

Two-Factor Authentication Approach Based on Behavior Patterns for Defeating Puppet Attacks

  • paper_url: http://arxiv.org/abs/2311.10389
  • repo_url: None
  • paper_authors: Wenhao Wang, Guyue Li, Zhiming Chu, Haobo Li, Daniele Faccio
  • for: 这篇论文是为了防止傀儡攻击(puppet attacks)而设计的。
  • methods: 该方法基于用户行为特征,具体来说是在身份验证过程中两次 consecutively 使用不同的手指点击设备。
  • results: 该方法可以实现97.87%的准确率和1.89%的false positive rate(FPR)。而且,对于增强傀儡攻击的抵抗力, combining 图像特征和时间特征在PUPGUARD中具有突出的优势。
    Abstract Fingerprint traits are widely recognized for their unique qualities and security benefits. Despite their extensive use, fingerprint features can be vulnerable to puppet attacks, where attackers manipulate a reluctant but genuine user into completing the authentication process. Defending against such attacks is challenging due to the coexistence of a legitimate identity and an illegitimate intent. In this paper, we propose PUPGUARD, a solution designed to guard against puppet attacks. This method is based on user behavioral patterns, specifically, the user needs to press the capture device twice successively with different fingers during the authentication process. PUPGUARD leverages both the image features of fingerprints and the timing characteristics of the pressing intervals to establish two-factor authentication. More specifically, after extracting image features and timing characteristics, and performing feature selection on the image features, PUPGUARD fuses these two features into a one-dimensional feature vector, and feeds it into a one-class classifier to obtain the classification result. This two-factor authentication method emphasizes dynamic behavioral patterns during the authentication process, thereby enhancing security against puppet attacks. To assess PUPGUARD's effectiveness, we conducted experiments on datasets collected from 31 subjects, including image features and timing characteristics. Our experimental results demonstrate that PUPGUARD achieves an impressive accuracy rate of 97.87% and a remarkably low false positive rate (FPR) of 1.89%. Furthermore, we conducted comparative experiments to validate the superiority of combining image features and timing characteristics within PUPGUARD for enhancing resistance against puppet attacks.
    摘要 人体指纹特征因其独特性和安全优势而广泛应用。然而,指纹特征可能受到傀儡攻击,攻击者可以让用户不情愿而真实的完成身份验证过程。防御 против这类攻击困难,因为攻击者可以利用合法身份和不良意图的共存。在本文中,我们提出了PUPGUARD解决方案,用于防御傀儡攻击。该方法基于用户行为特征,具体来说是在身份验证过程中,用户需要连续两次使用不同的手指点击捕捉设备。PUPGUARD利用手套特征和点击间隔时间特征,并将这两个特征 fusion 成一维特征向量,然后通过一类一分类器进行分类。这种二因素验证方法强调身份验证过程中的动态行为特征,从而增强对傀儡攻击的安全性。为评估PUPGUARD的效果,我们在31名参与者的数据集上进行了实验。我们的实验结果表明,PUPGUARD的准确率为97.87%,并且false positive rate (FPR)为1.89%。此外,我们还进行了比较实验,以验证PUPGUARD combining 手套特征和点击间隔时间特征可以增强对傀儡攻击的抵抗力。

Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking

  • paper_url: http://arxiv.org/abs/2311.10382
  • repo_url: None
  • paper_authors: Yizhe Li, Sanping Zhou, Zheng Qin, Le Wang, Jinjun Wang, Nanning Zheng
    for: 本研究旨在提高多目标跟踪(MOT)的精度和可靠性,使其在视频序列中更好地跟踪目标。methods: 本研究提出了一种简单 yet effective的两stage特征学习 парадиг,用于同时学习单击和多击特征,以实现robust的数据关联。在不同的框架中,我们设计了一种单击特征学习模块,以提取每个检测的特征,以便在不同的帧之间进行有效的目标关联。在跟踪过程中,如果某些目标被lost多个帧,我们则设计了一种多击特征学习模块,以提取每个跟踪的特征,以便在长期内重新找到这些丢失的目标。results: 我们的方法在MOT17和MOT20 datasets上实现了显著的改进,并在DanceTrack dataset上达到了当前最佳性能。
    Abstract Multi-Object Tracking (MOT) remains a vital component of intelligent video analysis, which aims to locate targets and maintain a consistent identity for each target throughout a video sequence. Existing works usually learn a discriminative feature representation, such as motion and appearance, to associate the detections across frames, which are easily affected by mutual occlusion and background clutter in practice. In this paper, we propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets, so as to achieve robust data association in the tracking process. For the detections without being associated, we design a novel single-shot feature learning module to extract discriminative features of each detection, which can efficiently associate targets between adjacent frames. For the tracklets being lost several frames, we design a novel multi-shot feature learning module to extract discriminative features of each tracklet, which can accurately refind these lost targets after a long period. Once equipped with a simple data association logic, the resulting VisualTracker can perform robust MOT based on the single-shot and multi-shot feature representations. Extensive experimental results demonstrate that our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.
    摘要

MSE-Nets: Multi-annotated Semi-supervised Ensemble Networks for Improving Segmentation of Medical Image with Ambiguous Boundaries

  • paper_url: http://arxiv.org/abs/2311.10380
  • repo_url: None
  • paper_authors: Shuai Wang, Tengjin Weng, Jingyi Wang, Yang Shen, Zhidong Zhao, Yixiu Liu, Pengfei Jiao, Zhiming Cheng, Yaqi Wang
  • for: 这个研究是为了解决医学影像分类标注 exhibit 对于专家而存在变化,因为医学影像中的物体和背景的分类Boundaries是不明确的。
  • methods: 我们提出了Multi-annotated Semi-supervised Ensemble Networks (MSE-Nets),用于从有限多标注和充沛无标注数据中学习分类。我们还引入了网络组别实现增强 (NPCE) 模组和多网络伪素指导 (MNPS) 模组,以便更好地处理多标注数据。
  • results: 我们的方法可以大大减少需要多标注数据的需求,仅需97.75%的标注数据,并且与最佳对照方法的差距只有Jaccard指数4%。此外,我们的方法在医学影像分类 tasks 上的实验结果表明,与其他仅使用单一标注或合并融合方法相比,我们的方法在医学影像分类 tasks 上表现更好。
    Abstract Medical image segmentation annotations exhibit variations among experts due to the ambiguous boundaries of segmented objects and backgrounds in medical images. Although using multiple annotations for each image in the fully-supervised has been extensively studied for training deep models, obtaining a large amount of multi-annotated data is challenging due to the substantial time and manpower costs required for segmentation annotations, resulting in most images lacking any annotations. To address this, we propose Multi-annotated Semi-supervised Ensemble Networks (MSE-Nets) for learning segmentation from limited multi-annotated and abundant unannotated data. Specifically, we introduce the Network Pairwise Consistency Enhancement (NPCE) module and Multi-Network Pseudo Supervised (MNPS) module to enhance MSE-Nets for the segmentation task by considering two major factors: (1) to optimize the utilization of all accessible multi-annotated data, the NPCE separates (dis)agreement annotations of multi-annotated data at the pixel level and handles agreement and disagreement annotations in different ways, (2) to mitigate the introduction of imprecise pseudo-labels, the MNPS extends the training data by leveraging consistent pseudo-labels from unannotated data. Finally, we improve confidence calibration by averaging the predictions of base networks. Experiments on the ISIC dataset show that we reduced the demand for multi-annotated data by 97.75\% and narrowed the gap with the best fully-supervised baseline to just a Jaccard index of 4\%. Furthermore, compared to other semi-supervised methods that rely only on a single annotation or a combined fusion approach, the comprehensive experimental results on ISIC and RIGA datasets demonstrate the superior performance of our proposed method in medical image segmentation with ambiguous boundaries.
    摘要 医学图像分割注释存在专家间的差异,这是因为医学图像中对象和背景的分割Boundaries是不确定的。虽然使用多个注释来训练深度模型已经得到了广泛的研究,但获得大量多注释数据是困难的,因为分割注释需要大量的时间和人力资源,导致大多数图像无法得到任何注释。为解决这个问题,我们提出了多注释 semi-supervised ensemble networks (MSE-Nets),用于从有限多注释和丰富无注释数据中学习分割。 Specifically, we introduce the network pairwise consistency enhancement (NPCE) module and multi-network pseudo-supervised (MNPS) module to enhance MSE-Nets for the segmentation task by considering two major factors: (1) to optimize the utilization of all accessible multi-annotated data, the NPCE separates (dis)agreement annotations of multi-annotated data at the pixel level and handles agreement and disagreement annotations in different ways, (2) to mitigate the introduction of imprecise pseudo-labels, the MNPS extends the training data by leveraging consistent pseudo-labels from unannotated data. Finally, we improve confidence calibration by averaging the predictions of base networks. 实验结果表明,我们可以降低需要多注释数据的数量为97.75%,并将与最佳完全监督基eline之间的差距降低到只是Jaccard指数4%。此外,与其他半监督方法相比,我们的提议方法在医学图像分割中的边界是不确定的情况下表现出了superior performance。

Breaking Temporal Consistency: Generating Video Universal Adversarial Perturbations Using Image Models

  • paper_url: http://arxiv.org/abs/2311.10366
  • repo_url: None
  • paper_authors: Hee-Seon Kim, Minji Son, Minbeom Kim, Myung-Joon Kwon, Changick Kim
  • for: 防御深度学习模型受到攻击的安全性问题在视频分析中变得更加紧迫。特别是Universal Adversarial Perturbation(UAP)对深度学习模型 pose a significant threat, as a single perturbation can mislead deep learning models on entire datasets.
  • methods: 我们提出了一种新的视频UAP使用图像数据和图像模型。这使得我们可以利用图像数据和图像模型基础的研究来进行视频应用。然而,图像模型对视频中的时间方面的分析有限,这是成功视频攻击的关键。为解决这个挑战,我们引入了Breaking Temporal Consistency(BTC)方法,这是在图像模型中 incorporate temporal information into video attacks 的第一个尝试。我们想要生成攻击视频,其具有与原始视频相反的模式。具体来说,BTC-UAP minimizes the feature similarity between neighboring frames in videos。
  • results: 我们的方法比现有方法更有效,可以在不同的数据集上达到高效率。其中包括ImageNet、UCF-101和Kinetics-400等数据集。此外,我们的方法适用于视频的不同长度和对时间偏移的抗衡性。
    Abstract As video analysis using deep learning models becomes more widespread, the vulnerability of such models to adversarial attacks is becoming a pressing concern. In particular, Universal Adversarial Perturbation (UAP) poses a significant threat, as a single perturbation can mislead deep learning models on entire datasets. We propose a novel video UAP using image data and image model. This enables us to take advantage of the rich image data and image model-based studies available for video applications. However, there is a challenge that image models are limited in their ability to analyze the temporal aspects of videos, which is crucial for a successful video attack. To address this challenge, we introduce the Breaking Temporal Consistency (BTC) method, which is the first attempt to incorporate temporal information into video attacks using image models. We aim to generate adversarial videos that have opposite patterns to the original. Specifically, BTC-UAP minimizes the feature similarity between neighboring frames in videos. Our approach is simple but effective at attacking unseen video models. Additionally, it is applicable to videos of varying lengths and invariant to temporal shifts. Our approach surpasses existing methods in terms of effectiveness on various datasets, including ImageNet, UCF-101, and Kinetics-400.
    摘要 为了使深度学习模型更加普及,对于这些模型的攻击性质也变得越来越重要。特别是对于Universal Adversarial Perturbation(UAP)而言,单个杂散可以诱导深度学习模型对整个数据集进行误导。我们提出了一种新的视频UAP,使用图像数据和图像模型。这使得我们可以利用图像数据和图像模型基础的研究,对于视频应用有更多的优势。然而,图像模型对视频中的时间方面有限制,这是成功视频攻击的关键。为解决这个挑战,我们引入了Breaking Temporal Consistency(BTC)方法,这是在图像模型中引入时间信息的第一次尝试。我们想要生成一些与原始视频相反的攻击视频。特别是,BTC-UAP减少了邻域帧视频特征之间的相似性。我们的方法简单而有效,可以让未看过视频模型进行攻击。此外,它适用于视频的不同长度和不同的时间偏移。我们的方法在不同的数据集上表现出色,包括ImageNet、UCF-101和Kinetics-400等。

Video-based Sequential Bayesian Homography Estimation for Soccer Field Registration

  • paper_url: http://arxiv.org/abs/2311.10361
  • repo_url: None
  • paper_authors: Paul J. Claasen, J. P. de Villiers
  • for: 提高视频摄像机动态Homography的推断精度
  • methods: 使用两stage kalman滤波器与tracked keypoints进行BAYESIAN Homography推断
  • results: 与现有方法相比,提高了homography评估指标的精度,且可以使用现有的键点检测方法进行增强
    Abstract A novel Bayesian framework is proposed, which explicitly relates the homography of one video frame to the next through an affine transformation while explicitly modelling keypoint uncertainty. The literature has previously used differential homography between subsequent frames, but not in a Bayesian setting. In cases where Bayesian methods have been applied, camera motion is not adequately modelled, and keypoints are treated as deterministic. The proposed method, Bayesian Homography Inference from Tracked Keypoints (BHITK), employs a two-stage Kalman filter and significantly improves existing methods. Existing keypoint detection methods may be easily augmented with BHITK. It enables less sophisticated and less computationally expensive methods to outperform the state-of-the-art approaches in most homography evaluation metrics. Furthermore, the homography annotations of the WorldCup and TS-WorldCup datasets have been refined using a custom homography annotation tool released for public use. The refined datasets are consolidated and released as the consolidated and refined WorldCup (CARWC) dataset.
    摘要 提出了一种新的 bayesian 框架,将一帧视频与下一帧视频之间的投影关系通过 affine 变换进行显式关联,并且明确模糊关键点的不确定性。过去的文献中使用了差分投影 между 后续帧,但是没有在 bayesian Setting 中使用。在摄像机运动不充分模型和关键点 treated 为确定的情况下,提出的方法 bayesian homography inference from tracked keypoints (BHITK) 使用了两 stage kalman filter,并有所提高现有方法。现有的关键点检测方法可以轻松地增强 BHITK。这种方法使得不太复杂和计算成本较低的方法能够在大多数投影评价指标中超越现有的状态艺术方法。此外,世界杯和 TS-WorldCup 数据集中的 homography 标注已经通过自定义 homography 标注工具进行了精细化,并将其整合成 consolidated and refined WorldCup (CARWC) 数据集,并将其公开发布。

Garment Recovery with Shape and Deformation Priors

  • paper_url: http://arxiv.org/abs/2311.10356
  • repo_url: None
  • paper_authors: Ren Li, Corentin Dumery, Benoît Guillard, Pascal Fua
  • for: 提出了一种方法,用于从真实图像中生成真实的服装模型,无论服装的形状或扭formation。
  • methods: 我们的方法利用了自动生成的数据学习到的形状和扭formation的约束,以准确地捕捉服装的形状和扭formation,包括大型的扭formation。
  • results: 我们的方法可以准确地回归服装的几何结构,同时也可以生成可以直接用于动画和 simulations 的服装模型。
    Abstract While modeling people wearing tight-fitting clothing has made great strides in recent years, loose-fitting clothing remains a challenge. We propose a method that delivers realistic garment models from real-world images, regardless of garment shape or deformation. To this end, we introduce a fitting approach that utilizes shape and deformation priors learned from synthetic data to accurately capture garment shapes and deformations, including large ones. Not only does our approach recover the garment geometry accurately, it also yields models that can be directly used by downstream applications such as animation and simulation.
    摘要 “对于穿着紧身服装的人体模型化在最近几年中已经做出了很大的进步,但是对于润缓服装仍然是一个挑战。我们提出了一种方法,可以将实际拍摄的图像转换为实际的服装模型,无论服装的形状或者扭曲是多大。为此,我们引入了一种适应策略,利用自己 sintetic data 中学习的形状和扭曲假设,实现精准地捕捉服装的形状和扭曲,包括大的一类。不仅我们的方法可以实现实际的服装几何学精准地回传,还能够提供下游应用,如动画和模拟等。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Pseudo Label-Guided Data Fusion and Output Consistency for Semi-Supervised Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2311.10349
  • repo_url: https://github.com/ortonwang/plgdf
  • paper_authors: Tao Wang, Yuanbin Chen, Xinlin Zhang, Yuanbo Zhou, Junlin Lan, Bizhe Bai, Tao Tan, Min Du, Qinquan Gao, Tong Tong
  • for: 这个研究是为了提出一个基于Convolutional Neural Networks的 semi-supervised learning架构,以便医疗图像分类任务中使用更少的标签数据。
  • methods: 这个研究使用了mean teacher network,并提出了一个新的伪标签使用方案,将标签和无标签数据结合以增加资料集。此外,研究者还强制了解释层之间的一致性,并提出了适合评估一致性的损失函数。最后,研究者还加入了一个锋化操作以进一步提高分类的精度。
  • results: 实验结果显示,PLGDF架构可以将更少的标签数据用于医疗图像分类任务中,同时与六种州分类学习方法进行比较,PLGDF的性能较高。codes这个研究的数据可以在https://github.com/ortonwang/PLGDF上获得。
    Abstract Supervised learning algorithms based on Convolutional Neural Networks have become the benchmark for medical image segmentation tasks, but their effectiveness heavily relies on a large amount of labeled data. However, annotating medical image datasets is a laborious and time-consuming process. Inspired by semi-supervised algorithms that use both labeled and unlabeled data for training, we propose the PLGDF framework, which builds upon the mean teacher network for segmenting medical images with less annotation. We propose a novel pseudo-label utilization scheme, which combines labeled and unlabeled data to augment the dataset effectively. Additionally, we enforce the consistency between different scales in the decoder module of the segmentation network and propose a loss function suitable for evaluating the consistency. Moreover, we incorporate a sharpening operation on the predicted results, further enhancing the accuracy of the segmentation. Extensive experiments on three publicly available datasets demonstrate that the PLGDF framework can largely improve performance by incorporating the unlabeled data. Meanwhile, our framework yields superior performance compared to six state-of-the-art semi-supervised learning methods. The codes of this study are available at https://github.com/ortonwang/PLGDF.
    摘要 <>将提供给定文本的简化中文翻译。<>基于卷积神经网络的超级vised学习算法在医疗影像 segmentation 任务上成为标准,但它们的效果受到大量标注数据的限制。然而,标注医疗影像集合是一项劳累和时间consuming的过程。 inspirited by semi-supervised算法,我们提出了 PLGDF 框架,该框架基于 Mean Teacher 网络,用于 segmenting 医疗影像,并使用 menos 标注数据。我们提出了一种新的 Pseudo-label 利用方案,该方案将标注和无标注数据结合使用,以增强数据集的规模。此外,我们在解码模块中 enforcing 等效性,并提出了适合评估等效性的损失函数。此外,我们还添加了一种锐化操作,以进一步提高 segmentation 的准确性。广泛的实验表明,PLGDF 框架可以通过 incorporating 无标注数据,大幅提高性能。同时,我们的框架在六种 state-of-the-art semi-supervised 学习方法中显示出了超越性。codes 这些研究可以在 https://github.com/ortonwang/PLGDF 上获得。

Enhancing Student Engagement in Online Learning through Facial Expression Analysis and Complex Emotion Recognition using Deep Learning

  • paper_url: http://arxiv.org/abs/2311.10343
  • repo_url: None
  • paper_authors: Rekha R Nair, Tina Babu, Pavithra K
  • for: 该论文目的是提出一种基于深度学习技术的 facial expression 评估方法,以评估在线学习 sessio 中学生的参与度。
  • methods: 该方法使用了深度学习模型,通过分析学生表情来评估学生的参与度。
  • results: 实验结果显示,该方法可以准确地分类学生的基本情感状态,并且达到了95%的准确率。
    Abstract In response to the COVID-19 pandemic, traditional physical classrooms have transitioned to online environments, necessitating effective strategies to ensure sustained student engagement. A significant challenge in online teaching is the absence of real-time feedback from teachers on students learning progress. This paper introduces a novel approach employing deep learning techniques based on facial expressions to assess students engagement levels during online learning sessions. Human emotions cannot be adequately conveyed by a student using only the basic emotions, including anger, disgust, fear, joy, sadness, surprise, and neutrality. To address this challenge, proposed a generation of four complex emotions such as confusion, satisfaction, disappointment, and frustration by combining the basic emotions. These complex emotions are often experienced simultaneously by students during the learning session. To depict these emotions dynamically,utilized a continuous stream of image frames instead of discrete images. The proposed work utilized a Convolutional Neural Network (CNN) model to categorize the fundamental emotional states of learners accurately. The proposed CNN model demonstrates strong performance, achieving a 95% accuracy in precise categorization of learner emotions.
    摘要 Due to the COVID-19 pandemic, traditional physical classrooms have transitioned to online environments, requiring effective strategies to ensure sustained student engagement. One significant challenge in online teaching is the lack of real-time feedback from teachers on students' learning progress. This paper proposes a novel approach using deep learning techniques based on facial expressions to assess students' engagement levels during online learning sessions.Human emotions cannot be fully conveyed by a student using only the basic emotions, such as anger, disgust, fear, joy, sadness, surprise, and neutrality. To address this challenge, the proposed approach generates four complex emotions, including confusion, satisfaction, disappointment, and frustration, by combining the basic emotions. These complex emotions are often experienced simultaneously by students during the learning session. To depict these emotions dynamically, the proposed approach uses a continuous stream of image frames instead of discrete images.The proposed approach utilizes a Convolutional Neural Network (CNN) model to accurately categorize the fundamental emotional states of learners. The proposed CNN model demonstrates strong performance, achieving a 95% accuracy in precise categorization of learner emotions.Here is the translation in Traditional Chinese:因COVID-19大流行,传统的物理教室转换为在线环境,需要有效的策略来确保学生的持续参与。在线教学中的一个重要挑战是缺乏教师在学生学习过程中的实时反馈。本文提出了一种使用深度学习技术基于表情来评估在线学习Session中学生的参与水平的新方法。人类的情感无法由学生只使用基本情感来完全表达,例如愤怒、厌恶、恐惧、喜悦、悲伤、惊讶和中性。为了解决这个挑战,提出了组合基本情感生成四种复杂情感,包括混乱、满足、失望和沮丧。这些复杂情感经常在学习Session中同时出现。为了显示这些情感的动态变化,提出的方法使用一串无限长的图像框架而不是独立的图像。提出的方法使用一个Convolutional Neural Network(CNN)模型精确地分Category学生的情感状态。提出的CNN模型示出了强大的表现,实现了95%的精确分Category学生情感。

A2XP: Towards Private Domain Generalization

  • paper_url: http://arxiv.org/abs/2311.10339
  • repo_url: None
  • paper_authors: Geunhyeok Yu, Hyoseok Hwang
  • for: 本研究旨在解决深度神经网络(DNNs)在不同领域数据中表现不佳的问题,特别是 Computer Vision 领域。
  • methods: 本文提出了一种新的方法,即 Attend to eXpert Prompts(A2XP),以解决 DNNs 在不同领域数据中的领域泛化问题。A2XP 包括两个阶段:专家适应和领域泛化。在第一阶段,每个源领域的提问都被优化,以引导模型向优化的方向发展。在第二阶段,两个嵌入器网络被训练,以有效地混合这些专家提问,以达到最佳输出。
  • results: 我们的广泛实验表明,A2XP 可以与现有的非私有领域泛化方法相比,达到最新的结果。实验结果表明,提出的方法不仅可以解决 DNNs 中的领域泛化问题,还提供了一种隐私保护、高效的解决方案,对 Computer Vision 领域的更广泛应用。
    Abstract Deep Neural Networks (DNNs) have become pivotal in various fields, especially in computer vision, outperforming previous methodologies. A critical challenge in their deployment is the bias inherent in data across different domains, such as image style, and environmental conditions, leading to domain gaps. This necessitates techniques for learning general representations from biased training data, known as domain generalization. This paper presents Attend to eXpert Prompts (A2XP), a novel approach for domain generalization that preserves the privacy and integrity of the network architecture. A2XP consists of two phases: Expert Adaptation and Domain Generalization. In the first phase, prompts for each source domain are optimized to guide the model towards the optimal direction. In the second phase, two embedder networks are trained to effectively amalgamate these expert prompts, aiming for an optimal output. Our extensive experiments demonstrate that A2XP achieves state-of-the-art results over existing non-private domain generalization methods. The experimental results validate that the proposed approach not only tackles the domain generalization challenge in DNNs but also offers a privacy-preserving, efficient solution to the broader field of computer vision.
    摘要

Cooperative Perception with Learning-Based V2V communications

  • paper_url: http://arxiv.org/abs/2311.10336
  • repo_url: None
  • paper_authors: Chenguang Liu, Yunfei Chen, Jianjun Chen, Ryan Payton, Michael Riley, Shuang-Hua Yang
  • for: 本文旨在探讨协同感知在自动驾驶中的应用,以缓解单个自动车辆感知的局限性。
  • methods: 本文使用了不同的融合方法和通信频率损害来评估协同感知的性能。同时,一种新的晚期融合方案被提出,以利用中间特征的稳定性。
  • results: numerical results表明,中间融合在频率损害较大的情况下比早期融合和晚期融合更加稳定,当SNR大于0dB时。此外,提议的融合方案也超过了使用检测输出的传统晚期融合。 autoencoder也提供了一个好的平衡点,在检测准确率和带宽使用之间。
    Abstract Cooperative perception has been widely used in autonomous driving to alleviate the inherent limitation of single automated vehicle perception. To enable cooperation, vehicle-to-vehicle (V2V) communication plays an indispensable role. This work analyzes the performance of cooperative perception accounting for communications channel impairments. Different fusion methods and channel impairments are evaluated. A new late fusion scheme is proposed to leverage the robustness of intermediate features. In order to compress the data size incurred by cooperation, a convolution neural network-based autoencoder is adopted. Numerical results demonstrate that intermediate fusion is more robust to channel impairments than early fusion and late fusion, when the SNR is greater than 0 dB. Also, the proposed fusion scheme outperforms the conventional late fusion using detection outputs, and autoencoder provides a good compromise between detection accuracy and bandwidth usage.
    摘要 合作感知在自动驾驶中广泛应用以减轻单个自动车辆感知的内在限制。为实现合作,车辆间通信(V2V)在无可或缺。这项工作分析了帐户通信频率的影响,并评估不同的混合方法和通信频率。我们提出了一种新的晚期混合方案,以利用中间特征的可靠性。为减少合作所带来的数据大小,我们采用了一种基于卷积神经网络的自适应编码器。数值结果表明,中间混合比早期混合和晚期混合更加鲁棒,当SNR大于0dB时。此外,我们的混合方案比传统的晚期混合使用检测输出更高效,而自适应编码器提供了一个好的平衡 между检测精度和带宽使用。

Leveraging Multimodal Fusion for Enhanced Diagnosis of Multiple Retinal Diseases in Ultra-wide OCTA

  • paper_url: http://arxiv.org/abs/2311.10331
  • repo_url: https://github.com/hwei-hw/m3octa
  • paper_authors: Hao Wei, Peilun Shi, Guitao Bai, Minqing Zhang, Shuangle Li, Wu Yuan
  • for: 该论文旨在提供 Exceptionally wide scanning range of up to 24 x 20 $mm^{2}$,覆盖 anterior和 posterior regions of the retina 的 Ultra-wide optical coherence tomography angiography (UW-OCTA) 技术。
  • methods: 该论文提出了一种 cross-modal fusion 框架,利用 multi-modal information for diagnosing multiple diseases。
  • results: 通过对 M3OCTA 数据集进行了extensive experiments,证明了该方法的效iveness和超越性, both in fixed and varying modalities settings。
    Abstract Ultra-wide optical coherence tomography angiography (UW-OCTA) is an emerging imaging technique that offers significant advantages over traditional OCTA by providing an exceptionally wide scanning range of up to 24 x 20 $mm^{2}$, covering both the anterior and posterior regions of the retina. However, the currently accessible UW-OCTA datasets suffer from limited comprehensive hierarchical information and corresponding disease annotations. To address this limitation, we have curated the pioneering M3OCTA dataset, which is the first multimodal (i.e., multilayer), multi-disease, and widest field-of-view UW-OCTA dataset. Furthermore, the effective utilization of multi-layer ultra-wide ocular vasculature information from UW-OCTA remains underdeveloped. To tackle this challenge, we propose the first cross-modal fusion framework that leverages multi-modal information for diagnosing multiple diseases. Through extensive experiments conducted on our openly available M3OCTA dataset, we demonstrate the effectiveness and superior performance of our method, both in fixed and varying modalities settings. The construction of the M3OCTA dataset, the first multimodal OCTA dataset encompassing multiple diseases, aims to advance research in the ophthalmic image analysis community.
    摘要 “ULTRA-WIDE Optical coherence tomography angiography(UW-OCTA)是一种emerging imaging技术,具有优先的优点,包括提供exceptionally wide scanning range,覆盖 anterior和 posterior retina regions,但是目前可用的UW-OCTA数据集存在limited comprehensive hierarchical information和相应的疾病标识。为解决这个限制,我们已经组装了创新的M3OCTA数据集,是第一个多模式(即多层)、多疾病、最宽的field-of-view UW-OCTA数据集。此外,使用ultra-wide ocular vasculature信息的有效利用方法仍然受到挑战。为了解决这个问题,我们提出了首个 Cross-modal fusion框架,利用多modal信息进行疾病诊断。经过了我们公开提供的M3OCTA数据集的广泛实验,我们证明了我们的方法的有效性和superior performance,包括固定和变化modal settings。M3OCTA数据集的建立,是为了进展医疗影像分析社区的研究。”

TransONet: Automatic Segmentation of Vasculature in Computed Tomographic Angiograms Using Deep Learning

  • paper_url: http://arxiv.org/abs/2311.10328
  • repo_url: None
  • paper_authors: Alireza Bagheri Rajeoni, Breanna Pederson, Ali Firooz, Hamed Abdollahi, Andrew K. Smith, Daniel G. Clair, Susan M. Lessner, Homayoun Valafar
  • for: 这个研究旨在提高Computed Tomographic Angiography(CTA)图像中血管系统的诊断速度和准确率,以便帮助医生更好地诊断和治疗 peripheral arterial disease(PAD)。
  • methods: 这个研究使用了深度学习技术来分类CTA图像中的血管系统,包括从下颈部分到股骨分支和从下颈部分到膝盖的两个部分。
  • results: 研究结果表明,使用深度学习技术可以准确地分类CTA图像中的血管系统,其中最高的Dice准确率达到93.5%和80.64%。这些结果表明深度学习技术在诊断血管系统中具有高度的潜在价值和优势。
    Abstract Pathological alterations in the human vascular system underlie many chronic diseases, such as atherosclerosis and aneurysms. However, manually analyzing diagnostic images of the vascular system, such as computed tomographic angiograms (CTAs) is a time-consuming and tedious process. To address this issue, we propose a deep learning model to segment the vascular system in CTA images of patients undergoing surgery for peripheral arterial disease (PAD). Our study focused on accurately segmenting the vascular system (1) from the descending thoracic aorta to the iliac bifurcation and (2) from the descending thoracic aorta to the knees in CTA images using deep learning techniques. Our approach achieved average Dice accuracies of 93.5% and 80.64% in test dataset for (1) and (2), respectively, highlighting its high accuracy and potential clinical utility. These findings demonstrate the use of deep learning techniques as a valuable tool for medical professionals to analyze the health of the vascular system efficiently and accurately. Please visit the GitHub page for this paper at https://github.com/pip-alireza/TransOnet.
    摘要 人体血管系统的疾病变化对多种慢性疾病有重要影响,如atherosclerosis和aneurysms。然而,手动分析医学影像诊断图像(如计算机tomography angiography,CTA)是一项时间consuming和繁琐的过程。为解决这个问题,我们提出了一个深度学习模型,用于在患有peripheral arterial disease(PAD)患者的CTA图像中分类血管系统。我们的研究把注意力集中在以下两个方面:1. 从 descending thoracic aorta 到 iliac bifurcation 的血管系统分类(CTA图像),我们使用深度学习技术实现了平均的 Dice 准确率为 93.5%。2. 从 descending thoracic aorta 到 knees 的血管系统分类(CTA图像),我们使用深度学习技术实现了平均的 Dice 准确率为 80.64%。这些结果表明,使用深度学习技术可以作为医疗专业人员分析血管系统的健康效果精准和高效的 valuabe工具。更多信息请参考我们的 GitHub 页面:https://github.com/pip-alireza/TransOnet。

Learning transformer-based heterogeneously salient graph representation for multimodal fusion classification of hyperspectral image and LiDAR data

  • paper_url: http://arxiv.org/abs/2311.10320
  • repo_url: None
  • paper_authors: Jiaqi Yang, Bo Du, Liangpei Zhang
  • for: 提高多源Remote Sensing图像分类的精度和准确性。
  • methods: 提出基于 transformer 的不同类型数据图像表示法(THSGR),通过多模式图像抽象和自注意力自适应归一化来提高模型的稳定性和泛化能力。
  • results: 通过三个 benchmark 数据集的实验和分析,证明提出的方法能够在不同的模式数据上提高分类精度和准确性,并且与其他 SOTA 方法相比具有竞争力。
    Abstract Data collected by different modalities can provide a wealth of complementary information, such as hyperspectral image (HSI) to offer rich spectral-spatial properties, synthetic aperture radar (SAR) to provide structural information about the Earth's surface, and light detection and ranging (LiDAR) to cover altitude information about ground elevation. Therefore, a natural idea is to combine multimodal images for refined and accurate land-cover interpretation. Although many efforts have been attempted to achieve multi-source remote sensing image classification, there are still three issues as follows: 1) indiscriminate feature representation without sufficiently considering modal heterogeneity, 2) abundant features and complex computations associated with modeling long-range dependencies, and 3) overfitting phenomenon caused by sparsely labeled samples. To overcome the above barriers, a transformer-based heterogeneously salient graph representation (THSGR) approach is proposed in this paper. First, a multimodal heterogeneous graph encoder is presented to encode distinctively non-Euclidean structural features from heterogeneous data. Then, a self-attention-free multi-convolutional modulator is designed for effective and efficient long-term dependency modeling. Finally, a mean forward is put forward in order to avoid overfitting. Based on the above structures, the proposed model is able to break through modal gaps to obtain differentiated graph representation with competitive time cost, even for a small fraction of training samples. Experiments and analyses on three benchmark datasets with various state-of-the-art (SOTA) methods show the performance of the proposed approach.
    摘要 据收集的不同模式数据可以提供丰富的补充信息,如光谱镜像(HSI)提供了丰富的 spectral-spatial 性质, Synthetic Aperture Radar(SAR)提供了地球表面的结构信息,和光散射和距离测量(LiDAR)提供了高度信息。因此,将多Modal Image 进行合并,可以提高准确的地面解释。虽然许多努力已经尝试过多源 remote sensing 图像分类,但还有三个问题:1)不充分考虑模态不同性,2)丰富的特征和复杂的计算,3)过拟合现象。为了突破这些障碍,本文提出了一种基于 transformer 的多模态焦点图表示(THSGR)方法。首先,一种多模态不同结构图编码器被提出,用于编码不同模式数据中的非欧几何特征。然后,一种自我注意力free的多层核心修饰器被设计,用于效果地和高效地模型长远依赖关系。最后,一种均方桢被提出,以避免过拟合。基于以上结构,提出的方法可以突破模态差异,获得竞争时间成本下的分化图表示,即使只有小部分训练样本。实验和分析基于三个benchmark数据集和多种 state-of-the-art 方法表明了方法的性能。

Nonparametric Teaching for Multiple Learners

  • paper_url: http://arxiv.org/abs/2311.10318
  • repo_url: https://github.com/chen2hang/mint_nonparametricteaching
  • paper_authors: Chen Zhang, Xiaofeng Cao, Weiyang Liu, Ivor Tsang, James Kwok
  • for: 本研究旨在解决多学生同时学习的非 Parametric 迭代教学问题,这问题受到单个学生教学设定下的现实世界情况启发,在这种情况下,一名教师通常会向多名学生传授知识。
  • methods: 我们在本研究中提出了一种新的框架–多学生非 Parametric 教学(MINT),其中教师 aimsto 教育多名学生,每名学生都专注于学习一个拟合值模型。为达到这个目标,我们将问题转化为教学一个向量值模型,并从单个学生教学情况下使用的拟合值希尔伯特空间扩展到向量值空间。
  • results: 我们证明了 MINT 可以在多名学生可以交流的情况下提供显著的教学速度提升,特别是在重复的单个学生教学情况下。此外,我们还进行了广泛的实验来验证 MINT 的实用性和效率。
    Abstract We study the problem of teaching multiple learners simultaneously in the nonparametric iterative teaching setting, where the teacher iteratively provides examples to the learner for accelerating the acquisition of a target concept. This problem is motivated by the gap between current single-learner teaching setting and the real-world scenario of human instruction where a teacher typically imparts knowledge to multiple students. Under the new problem formulation, we introduce a novel framework -- Multi-learner Nonparametric Teaching (MINT). In MINT, the teacher aims to instruct multiple learners, with each learner focusing on learning a scalar-valued target model. To achieve this, we frame the problem as teaching a vector-valued target model and extend the target model space from a scalar-valued reproducing kernel Hilbert space used in single-learner scenarios to a vector-valued space. Furthermore, we demonstrate that MINT offers significant teaching speed-up over repeated single-learner teaching, particularly when the multiple learners can communicate with each other. Lastly, we conduct extensive experiments to validate the practicality and efficiency of MINT.
    摘要 我们研究同时教育多个学习者的问题,这是非参数iterative teaching设定下的教师逐步提供示例,以加速目标概念的掌握。这个问题受到单个学习者教学设定下的现实场景启发,在这种场景中,一个教师通常会向多名学生传授知识。在新的问题设定下,我们提出了一个新的框架——多学习者非参数教学(MINT)。在MINT中,教师需要同时教育多名学习者,每名学习者都需要学习一个含有Scalar值的目标模型。为了实现这一点,我们将问题归类为教育一个向量值目标模型,并从单个学习者的场景中使用的scalar值 reproduce kernel Hilbert space扩展了目标模型空间。此外,我们还证明了MINT在多名学习者之间交流的情况下可以得到明显的教学速度增加。最后,我们进行了广泛的实验来验证MINT的实用性和效率。

MPSeg : Multi-Phase strategy for coronary artery Segmentation

  • paper_url: http://arxiv.org/abs/2311.10306
  • repo_url: None
  • paper_authors: Jonghoe Ku, Yong-Hee Lee, Junsup Shin, In Kyu Lee, Hyun-Woo Kim
  • for: 这份论文是为了提供一个新的多阶段方法来自动分类心脏arteries的分类,以便更好地评估 cardiovascular disease。
  • methods: 这篇论文使用了一个多阶段方法,包括分类 Left Coronary Artery (LCA) 和 Right Coronary Artery (RCA),然后使用专门的集成模型来执行分类任务。在LCA的分类任务中,使用了一个精确的修正模型以更正初始类别预测。
  • results: 这篇论文在Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiography imagEs (ARCADE) Segmentation Detection Algorithm challenge at MICCAI 2023 中表现了非常出色的效果。
    Abstract Accurate segmentation of coronary arteries is a pivotal process in assessing cardiovascular diseases. However, the intricate structure of the cardiovascular system presents significant challenges for automatic segmentation, especially when utilizing methodologies like the SYNTAX Score, which relies extensively on detailed structural information for precise risk stratification. To address these difficulties and cater to this need, we present MPSeg, an innovative multi-phase strategy designed for coronary artery segmentation. Our approach specifically accommodates these structural complexities and adheres to the principles of the SYNTAX Score. Initially, our method segregates vessels into two categories based on their unique morphological characteristics: Left Coronary Artery (LCA) and Right Coronary Artery (RCA). Specialized ensemble models are then deployed for each category to execute the challenging segmentation task. Due to LCA's higher complexity over RCA, a refinement model is utilized to scrutinize and correct initial class predictions on segmented areas. Notably, our approach demonstrated exceptional effectiveness when evaluated in the Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiography imagEs (ARCADE) Segmentation Detection Algorithm challenge at MICCAI 2023.
    摘要 通过精准分割 coronary arteries 是评估心血管疾病的关键过程。然而,心血管系统的复杂结构对自动分割带来了很大挑战,特别是在使用 SYNTAX Score 方法时。为了解决这些困难并满足这种需求,我们提出了 MPSeg,一种创新的多阶段策略,用于 coronary artery 分割。我们的方法特别注意到 cardiovascular 系统的结构特征,并遵循 SYNTAX Score 的原则。我们的方法首先将血管分为两类,根据它们独特的形态特征:左心脏动脉 (LCA) 和右心脏动脉 (RCA)。然后,我们使用专门设计的集合模型来执行分割任务。由于 LCA 的复杂性比 RCA 更高,因此我们使用修正模型来检查并更正初始分类预测结果。值得注意的是,我们的方法在 MICCAI 2023 年的 Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiography imagEs (ARCADE) Segmentation Detection Algorithm 挑战中表现出色。

Semi-supervised ViT knowledge distillation network with style transfer normalization for colorectal liver metastases survival prediction

  • paper_url: http://arxiv.org/abs/2311.10305
  • repo_url: None
  • paper_authors: Mohamed El Amine Elforaici, Emmanuel Montagnon, Francisco Perdigon Romero, William Trung Le, Feryel Azzi, Dominique Trudel, Bich Nguyen, Simon Turcotte, An Tang, Samuel Kadoury
  • for: 预测肝脏 метаstatic tumor (CLM) 患者的存活率和疾病进展,以便在系统化化学疗apy 的回应基础上提高精准医学进步。
  • methods: 我们提出了一种综合方法,包括使用生成对抗网络 (GAN) Normalize histology slides 和 semi-supervised 模型进行组织类划分,以及使用注意力机制来重要性评估不同的 Histology slides 区域。
  • results: 我们的方法在一个临床数据集上进行评估,表现出优于相关方法,c-index 为 0.804 (0.014) 和 0.733 (0.014) для OS 和 TTR,同时在TRG 分类任务中,我们的方法可以达到 86.9% 到 90.3% 的准确率和 78.5% 到 82.1% 的准确率。
    Abstract Colorectal liver metastases (CLM) significantly impact colon cancer patients, influencing survival based on systemic chemotherapy response. Traditional methods like tumor grading scores (e.g., tumor regression grade - TRG) for prognosis suffer from subjectivity, time constraints, and expertise demands. Current machine learning approaches often focus on radiological data, yet the relevance of histological images for survival predictions, capturing intricate tumor microenvironment characteristics, is gaining recognition. To address these limitations, we propose an end-to-end approach for automated prognosis prediction using histology slides stained with H&E and HPS. We first employ a Generative Adversarial Network (GAN) for slide normalization to reduce staining variations and improve the overall quality of the images that are used as input to our prediction pipeline. We propose a semi-supervised model to perform tissue classification from sparse annotations, producing feature maps. We use an attention-based approach that weighs the importance of different slide regions in producing the final classification results. We exploit the extracted features for the metastatic nodules and surrounding tissue to train a prognosis model. In parallel, we train a vision Transformer (ViT) in a knowledge distillation framework to replicate and enhance the performance of the prognosis prediction. In our evaluation on a clinical dataset of 258 patients, our approach demonstrates superior performance with c-indexes of 0.804 (0.014) for OS and 0.733 (0.014) for TTR. Achieving 86.9% to 90.3% accuracy in predicting TRG dichotomization and 78.5% to 82.1% accuracy for the 3-class TRG classification task, our approach outperforms comparative methods. Our proposed pipeline can provide automated prognosis for pathologists and oncologists, and can greatly promote precision medicine progress in managing CLM patients.
    摘要 来自肝脏 метаstatic colorectal cancer (CLM) 对colon cancer 患者有着重要的影响,影响了存生基于系统化化学疗法的回朋答。传统的方法,如肿瘤分化分数 (e.g., 肿瘤变化度 - TRG) 用于预后预测,受到主观性、时间限制和专业知识的限制。目前的机器学习方法通常专注于放射学数据,但是对存生预测的 histological 影像具有重要的特征,因此获得了更多的认可。为了解决这些限制,我们提出了一个端到端的方法,用于自动预测CLM患者的存生预测。我们首先使用生成对抗网络 (GAN) 来 нормалізу slide,以改善图像质量并减少染色变化。然后,我们提出了一个半supervised模型,用于从稀疏标注中进行组织识别,生成特征地图。我们使用注意力机制,让不同的图像区域在生成最终类别结果中获得不同的重要性。我们利用提取的特征来训练存生预测模型。另外,我们在知识传播框架中将视觉Transformers (ViT) 训练来增强和复制存生预测的表现。在我们在258名病例的临床数据集上进行评估时,我们的方法示出了超过86.9%到90.3%的准确率,对于TRG分类 tasks 的准确率为78.5%到82.1%。我们的方法比较方法更高。我们的提案的管道可以为Pathologist和Oncologist提供自动预测,并可以帮助精确医学进步在处理CLM患者。

BiHRNet: A Binary high-resolution network for Human Pose Estimation

  • paper_url: http://arxiv.org/abs/2311.10296
  • repo_url: None
  • paper_authors: Zhicheng Zhang, Xueyao Sun, Yonghao Dang, Jianqin Yin
  • for: 本研究旨在提出一个可行的人体pose估计器(BiHRNet),用于实时应用中的人体pose估计。
  • methods: 本研究使用了二进制神经网络(BNN),并提出了两种技术来减少由网络二进制化所导致的准确性下降:一种是一种新的损失函数,另一种是一种新的信息重建瓶颈(IR Bottleneck)和多槽基块(MS-Block)的设计。
  • results: 实验结果表明,BiHRNet在MPII数据集上 achieve PCKh 87.9,而在COCO数据集上 achieve 70.8 mAP,这些结果都高于大多数测试的精简型全精度网络。
    Abstract Human Pose Estimation (HPE) plays a crucial role in computer vision applications. However, it is difficult to deploy state-of-the-art models on resouce-limited devices due to the high computational costs of the networks. In this work, a binary human pose estimator named BiHRNet(Binary HRNet) is proposed, whose weights and activations are expressed as $\pm$1. BiHRNet retains the keypoint extraction ability of HRNet, while using fewer computing resources by adapting binary neural network (BNN). In order to reduce the accuracy drop caused by network binarization, two categories of techniques are proposed in this work. For optimizing the training process for binary pose estimator, we propose a new loss function combining KL divergence loss with AWing loss, which makes the binary network obtain more comprehensive output distribution from its real-valued counterpart to reduce information loss caused by binarization. For designing more binarization-friendly structures, we propose a new information reconstruction bottleneck called IR Bottleneck to retain more information in the initial stage of the network. In addition, we also propose a multi-scale basic block called MS-Block for information retention. Our work has less computation cost with few precision drop. Experimental results demonstrate that BiHRNet achieves a PCKh of 87.9 on the MPII dataset, which outperforms all binary pose estimation networks. On the challenging of COCO dataset, the proposed method enables the binary neural network to achieve 70.8 mAP, which is better than most tested lightweight full-precision networks.
    摘要 人体姿势估计(HPE)在计算机视觉应用中扮演着关键性的角色。然而,由于现有的状态艺术网络在资源有限的设备上部署困难,因此在这种情况下,一种名为BiHRNet的二进制人体姿势估计器被提出。BiHRNet在保持人体姿势估计的能力的同时,使用了更少的计算资源,通过适应二进制神经网络(BNN)进行减少计算成本。为了减少由网络二进制化引起的准确性下降,本文提出了两种类型的技术。首先,我们提出了一种新的损失函数,它将KL散度损失与AWing损失相结合,以使得二进制网络从其浮点对应的网络获得更全面的输出分布,从而减少由二进制化引起的信息损失。其次,我们提出了一种新的信息重建瓶颈,称为IR瓶颈,以保留在网络的初始阶段更多的信息。此外,我们还提出了一种多尺度基本块,称为MS-Block,以保持更多的信息。我们的方法具有较低的计算成本和减少的精度下降。实验结果表明,BiHRNet在MPII dataset上 achievied PCKh的87.9,超过了所有二进制姿势估计网络。在COCO dataset上,我们的方法使得二进制神经网络在70.8 mAP的情况下获得了更好的表现,超过了大多数测试的轻量级整数网络。

Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression

  • paper_url: http://arxiv.org/abs/2311.10794
  • repo_url: None
  • paper_authors: Animesh Sinha, Bo Sun, Anmol Kalia, Arantxa Casanova, Elliot Blanchard, David Yan, Winnie Zhang, Tony Nelli, Jiahui Chen, Hardik Shah, Licheng Yu, Mitesh Kumar Singh, Ankit Ramchandani, Maziar Sanjabi, Sonal Gupta, Amy Bearman, Dhruv Mahajan
  • for: 该论文 targets 在 distinct domain 中使用 Latent Diffusion Models (LDMs) 进行图像生成,以提高图像质量、描述性和场景多样性。
  • methods: 该论文使用了Prompt Engineering和Human-in-the-Loop (HITL) Alignment和Style datasets来finetune Emu模型,以提高图像的描述性和风格匹配。
  • results: 论文的实验结果表明,使用 Style Tailoring 方法可以提高图像质量(14%)、描述性(16.2%)和场景多样性(15.3%),相比于基于 Emu 模型的描述性引擎。
    Abstract We introduce Style Tailoring, a recipe to finetune Latent Diffusion Models (LDMs) in a distinct domain with high visual quality, prompt alignment and scene diversity. We choose sticker image generation as the target domain, as the images significantly differ from photorealistic samples typically generated by large-scale LDMs. We start with a competent text-to-image model, like Emu, and show that relying on prompt engineering with a photorealistic model to generate stickers leads to poor prompt alignment and scene diversity. To overcome these drawbacks, we first finetune Emu on millions of sticker-like images collected using weak supervision to elicit diversity. Next, we curate human-in-the-loop (HITL) Alignment and Style datasets from model generations, and finetune to improve prompt alignment and style alignment respectively. Sequential finetuning on these datasets poses a tradeoff between better style alignment and prompt alignment gains. To address this tradeoff, we propose a novel fine-tuning method called Style Tailoring, which jointly fits the content and style distribution and achieves best tradeoff. Evaluation results show our method improves visual quality by 14%, prompt alignment by 16.2% and scene diversity by 15.3%, compared to prompt engineering the base Emu model for stickers generation.
    摘要 我们介绍 Style Tailoring,一种精度调整潜在扩散模型(LDM)的独特领域recipe,以高质量的视觉和文本映射、场景多样性为目标。我们选择了贴纸图像生成作为目标领域,因为这些图像与大规模 LDM 通常生成的 фото真实样式很不同。我们开始 WITH 一个能够文本图像生成的竞争力强的模型,如 Emu,并显示了,通过倚靠写入 photorealistic 模型来生成贴纸会导致文本映射和场景多样性受损。为了缓解这些缺陷,我们首先在弱监督下对 Emu 进行了数百万个贴纸样本的训练,以启动多样性。然后,我们 manually curate 了人类在 loop(HITL)的对应和风格数据集,并进行了进一步的训练,以提高对应和风格的映射。我们发现在这些数据集上进行顺序的训练存在一定的负面关系,即改进对应和风格的映射可能会导致另一个方面的损失。为了解决这个负面关系,我们提出了一种新的训练方法,即 Style Tailoring。我们的方法可以同时调整内容和风格的分布,并实现最佳的负面关系。我们的评估结果表明,相比于基于 Emu 模型的贴纸生成,我们的方法可以提高视觉质量14%, 文本映射16.2%, 场景多样性15.3%。

Hierarchical Pruning of Deep Ensembles with Focal Diversity

  • paper_url: http://arxiv.org/abs/2311.10293
  • repo_url: None
  • paper_authors: Yanzhao Wu, Ka-Ho Chow, Wenqi Wei, Ling Liu
    for: 这篇论文旨在提出一种新的深度神经网络集成预选方法,以提高集成的普遍性和可靠性,并且可以有效地降低集成执行的时间和空间成本。methods: 该方法基于三种新的集成预选技术,包括:1) 基于焦点多样性指标的集成预选方法,可以准确地捕捉集成中每个网络的补做能力;2) 基于层次结构的集成预选方法,可以逐层找到低成本高准确性的深度 ensemble;3) 基于多个焦点多样性指标的集成协调方法,可以融合多个焦点多样性指标,以提高集成预选的精度和可靠性。results: 在使用 популяр的 benchmark 数据集上进行测试,我们 demonstarted that the proposed hierarchical ensemble pruning approach can effectively identify high-quality deep ensembles with better generalizability while being more time and space efficient in ensemble decision-making.
    Abstract Deep neural network ensembles combine the wisdom of multiple deep neural networks to improve the generalizability and robustness over individual networks. It has gained increasing popularity to study deep ensemble techniques in the deep learning community. Some mission-critical applications utilize a large number of deep neural networks to form deep ensembles to achieve desired accuracy and resilience, which introduces high time and space costs for ensemble execution. However, it still remains a critical challenge whether a small subset of the entire deep ensemble can achieve the same or better generalizability and how to effectively identify these small deep ensembles for improving the space and time efficiency of ensemble execution. This paper presents a novel deep ensemble pruning approach, which can efficiently identify smaller deep ensembles and provide higher ensemble accuracy than the entire deep ensemble of a large number of member networks. Our hierarchical ensemble pruning approach (HQ) leverages three novel ensemble pruning techniques. First, we show that the focal diversity metrics can accurately capture the complementary capacity of the member networks of an ensemble, which can guide ensemble pruning. Second, we design a focal diversity based hierarchical pruning approach, which will iteratively find high quality deep ensembles with low cost and high accuracy. Third, we develop a focal diversity consensus method to integrate multiple focal diversity metrics to refine ensemble pruning results, where smaller deep ensembles can be effectively identified to offer high accuracy, high robustness and high efficiency. Evaluated using popular benchmark datasets, we demonstrate that the proposed hierarchical ensemble pruning approach can effectively identify high quality deep ensembles with better generalizability while being more time and space efficient in ensemble decision making.
    摘要 深度神经网络集成 combines the wisdom of multiple deep neural networks to improve the generalizability and robustness over individual networks. It has gained increasing popularity to study deep ensemble techniques in the deep learning community. Some mission-critical applications utilize a large number of deep neural networks to form deep ensembles to achieve desired accuracy and resilience, which introduces high time and space costs for ensemble execution. However, it still remains a critical challenge whether a small subset of the entire deep ensemble can achieve the same or better generalizability and how to effectively identify these small deep ensembles for improving the space and time efficiency of ensemble execution. This paper presents a novel deep ensemble pruning approach, which can efficiently identify smaller deep ensembles and provide higher ensemble accuracy than the entire deep ensemble of a large number of member networks. Our hierarchical ensemble pruning approach (HQ) leverages three novel ensemble pruning techniques. First, we show that the focal diversity metrics can accurately capture the complementary capacity of the member networks of an ensemble, which can guide ensemble pruning. Second, we design a focal diversity based hierarchical pruning approach, which will iteratively find high quality deep ensembles with low cost and high accuracy. Third, we develop a focal diversity consensus method to integrate multiple focal diversity metrics to refine ensemble pruning results, where smaller deep ensembles can be effectively identified to offer high accuracy, high robustness and high efficiency. Evaluated using popular benchmark datasets, we demonstrate that the proposed hierarchical ensemble pruning approach can effectively identify high quality deep ensembles with better generalizability while being more time and space efficient in ensemble decision making.

SSASS: Semi-Supervised Approach for Stenosis Segmentation

  • paper_url: http://arxiv.org/abs/2311.10281
  • repo_url: None
  • paper_authors: In Kyu Lee, Junsup Shin, Yong-Hee Lee, Jonghoe Ku, Hyun-Woo Kim
  • for: 这篇论文是为了帮助医疗专业人员更准确地评估患者的情况,尤其是在 coronary angiography (CAG) 中准确识别 coronary artery stenosis 的疾病风险。
  • methods: 我们提出了一种 semi-supervised 方法,通过数据扩充和 pseudo-label-based 学习技术,以便更好地处理 coronary artery 的复杂结构和 X-ray 图像中的隐藏噪声。
  • results: 我们的方法在 Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiography imagEs (ARCADE) Stenosis Detection Algorithm 挑战中表现出色,只需要一个模型,而不需要 ensemble 多个模型。这种成功表明我们的方法具有 Automatic 和高效的特点,可以帮助医疗专业人员更准确地评估患者的情况。
    Abstract Coronary artery stenosis is a critical health risk, and its precise identification in Coronary Angiography (CAG) can significantly aid medical practitioners in accurately evaluating the severity of a patient's condition. The complexity of coronary artery structures combined with the inherent noise in X-ray images poses a considerable challenge to this task. To tackle these obstacles, we introduce a semi-supervised approach for cardiovascular stenosis segmentation. Our strategy begins with data augmentation, specifically tailored to replicate the structural characteristics of coronary arteries. We then apply a pseudo-label-based semi-supervised learning technique that leverages the data generated through our augmentation process. Impressively, our approach demonstrated an exceptional performance in the Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiography imagEs (ARCADE) Stenosis Detection Algorithm challenge by utilizing a single model instead of relying on an ensemble of multiple models. This success emphasizes our method's capability and efficiency in providing an automated solution for accurately assessing stenosis severity from medical imaging data.
    摘要 coronary artery stenosis 是一个严重的健康风险,并且准确评估患者的状况的精准识别在 coronary angiography (CAG) 中非常重要。 However, the complexity of coronary artery structures and the inherent noise in X-ray images pose significant challenges to this task. To overcome these challenges, we propose a semi-supervised approach for cardiovascular stenosis segmentation. Our approach begins with data augmentation, specifically tailored to replicate the structural characteristics of coronary arteries. We then apply a pseudo-label-based semi-supervised learning technique that leverages the data generated through our augmentation process. Notably, our approach achieved an exceptional performance in the Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiography imagEs (ARCADE) Stenosis Detection Algorithm challenge by utilizing a single model instead of relying on an ensemble of multiple models. This success highlights our method's capability and efficiency in providing an automated solution for accurately assessing stenosis severity from medical imaging data.Here's the word-for-word translation of the text into Simplified Chinese: coronary artery stenosis 是一个严重的健康风险,并且准确评估患者的状况的精准识别在 coronary angiography (CAG) 中非常重要。然而, coronary artery structures 的复杂性和 X-ray images 中的自然噪音 pose significant challenges to this task. To overcome these challenges, we propose a semi-supervised approach for cardiovascular stenosis segmentation. our approach begins with data augmentation, specifically tailored to replicate the structural characteristics of coronary arteries. We then apply a pseudo-label-based semi-supervised learning technique that leverages the data generated through our augmentation process. notable, our approach achieved an exceptional performance in the Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiography imagEs (ARCADE) Stenosis Detection Algorithm challenge by utilizing a single model instead of relying on an ensemble of multiple models. This success highlights our method's capability and efficiency in providing an automated solution for accurately assessing stenosis severity from medical imaging data.

Vision meets mmWave Radar: 3D Object Perception Benchmark for Autonomous Driving

  • paper_url: http://arxiv.org/abs/2311.10261
  • repo_url: None
  • paper_authors: Yizhou Wang, Jen-Hao Cheng, Jui-Te Huang, Sheng-Yao Kuan, Qiqian Fu, Chiming Ni, Shengyu Hao, Gaoang Wang, Guanbin Xing, Hui Liu, Jenq-Neng Hwang
  • for: 这个论文主要是为了提出一个新的自动驾驶识别系统,利用摄像头和雷达的结合来实现高精度和可靠性。
  • methods: 这个论文使用了一种新的数据集CRUW3D,包含66万个同步和准确卡口、雷达和LiDAR的帧,以及不同的驾驶enario。另外,这个论文还使用了Radio Frequency(RF)张量来表示雷达数据,这种张量包含了3D位置信息以及空间时间 semantics信息。
  • results: 这个论文通过对摄像头和雷达数据进行融合,实现了更高的识别精度和可靠性。这种融合方法可以在不同的照明和天气情况下实现更好的性能,并且可以提供更多的semantic信息来支持更高级别的自动驾驶功能。
    Abstract Sensor fusion is crucial for an accurate and robust perception system on autonomous vehicles. Most existing datasets and perception solutions focus on fusing cameras and LiDAR. However, the collaboration between camera and radar is significantly under-exploited. The incorporation of rich semantic information from the camera, and reliable 3D information from the radar can potentially achieve an efficient, cheap, and portable solution for 3D object perception tasks. It can also be robust to different lighting or all-weather driving scenarios due to the capability of mmWave radars. In this paper, we introduce the CRUW3D dataset, including 66K synchronized and well-calibrated camera, radar, and LiDAR frames in various driving scenarios. Unlike other large-scale autonomous driving datasets, our radar data is in the format of radio frequency (RF) tensors that contain not only 3D location information but also spatio-temporal semantic information. This kind of radar format can enable machine learning models to generate more reliable object perception results after interacting and fusing the information or features between the camera and radar.
    摘要 感知融合是自动驾驶车辆准确和可靠的感知系统中的关键。大多数现有的数据集和感知解决方案都将关注相机和LiDAR的整合。然而,相机和雷达之间的合作仍然被忽视了。将 ricoh semantic information from the camera and reliable 3D information from the radar fusion 可能实现一个高效、便宜、可搬的解决方案 для 3D 物体感知任务。它还可以在不同的照明或天气驾驶enario中具有更高的可靠性,因为雷达具有 millimeter wave 的特点。在本文中,我们介绍了 CRUW3D 数据集,包括 66 万个同步和准确地测量的相机、雷达和 LiDAR 帧,在不同的驾驶enario中。与其他大规模自动驾驶数据集不同的是,我们的雷达数据以 radio frequency (RF) 张量的形式提供,该张量包含不仅 3D 位置信息,还有空间时间semantic信息。这种雷达格式可以让机器学习模型通过相机和雷达之间的信息或特征交互和融合来生成更可靠的物体感知结果。

UniMOS: A Universal Framework For Multi-Organ Segmentation Over Label-Constrained Datasets

  • paper_url: http://arxiv.org/abs/2311.10251
  • repo_url: https://github.com/lw8807001/unimos
  • paper_authors: Can Li, Sheng Shao, Junyi Qu, Shuchao Pang, Mehmet A. Orgun
  • for: This paper aims to provide a universal framework for medical image segmentation tasks, which can utilize fully and partially labeled images as well as unlabeled images.
  • methods: The proposed framework, called UniMOS, uses a Multi-Organ Segmentation (MOS) module over fully/partially labeled data as the basenet, and incorporates a semi-supervised training module that combines consistent regularization and pseudolabeling techniques on unlabeled data.
  • results: The experiments show that the UniMOS framework exhibits excellent performance in several medical image segmentation tasks compared to other advanced methods, and also significantly improves data utilization and reduces annotation cost.Here’s the full text in Simplified Chinese:
  • for: 这篇论文目的是提供一个通用的医学图像分割框架,可以利用全部和部分标注图像以及无标注图像。
  • methods: 提议的UniMOS框架使用Multi-Organ Segmentation(MOS)模块来基于全部/部分标注数据进行基准,并将一种半supervised训练模块与无标注数据进行结合,该模块使用了一致的常规化和pseudolabeling技术。
  • results: 实验表明,UniMOS框架在几种医学图像分割任务中表现出色,与其他先进方法相比,也有显著提高数据利用率和减少标注成本。
    Abstract Machine learning models for medical images can help physicians diagnose and manage diseases. However, due to the fact that medical image annotation requires a great deal of manpower and expertise, as well as the fact that clinical departments perform image annotation based on task orientation, there is the problem of having fewer medical image annotation data with more unlabeled data and having many datasets that annotate only a single organ. In this paper, we present UniMOS, the first universal framework for achieving the utilization of fully and partially labeled images as well as unlabeled images. Specifically, we construct a Multi-Organ Segmentation (MOS) module over fully/partially labeled data as the basenet and designed a new target adaptive loss. Furthermore, we incorporate a semi-supervised training module that combines consistent regularization and pseudolabeling techniques on unlabeled data, which significantly improves the segmentation of unlabeled data. Experiments show that the framework exhibits excellent performance in several medical image segmentation tasks compared to other advanced methods, and also significantly improves data utilization and reduces annotation cost. Code and models are available at: https://github.com/lw8807001/UniMOS.
    摘要 医疗影像机器学习模型可以帮助医生诊断和管理疾病。然而,由于医疗影像标注需要很大的人力和专业知识,以及临床部门根据任务方向进行标注,导致有 fewer 的医疗影像标注数据和更多的未标注数据,以及许多只标注一个器官的数据集。在这篇论文中,我们提出了UniMOS,第一个可以实现充分利用完全/部分标注图像和未标注图像的框架。具体来说,我们在完全/部分标注数据上构建了多器官分割(MOS)模块作为基础网络,并设计了一种新的目标适应损失函数。此外,我们还将一种semi-supervised Training模块 incorporated into the framework,该模块将在无标注数据上进行一致准确 regularization 和 pseudolabeling 技术,以显著提高未标注数据的分割性能。实验表明,该框架在多个医疗影像分割任务中表现出色,比其他进步方法更好,同时也有利用数据和减少标注成本。代码和模型可以在:https://github.com/lw8807001/UniMOS 中找到。

Segment Anything in Defect Detection

  • paper_url: http://arxiv.org/abs/2311.10245
  • repo_url: https://github.com/bozhenhhu/DefectSAM
  • paper_authors: Bozhen Hu, Bin Gao, Cheng Tan, Tongle Wu, Stan Z. Li
    for: 这个研究旨在提高非破坏性测试系统中的缺陷检测精度,使用非接触、安全、高效的检测能力。methods: 本研究使用了一个新的方法,即DefectSAM,基于Segment Anything(SAM)模型,并使用了一个精心组制的实验室实验数据集和专业人员的讯息,以超越现有的州际状态艺术法。results: 实验结果显示,DefectSAM可以优化缺陷检测率,尤其是检测较弱和小型缺陷的能力,并获得更 precisione 的缺陷大小估计。此外,DefectSAM在不同材料上进行了实验,并证明了其在缺陷检测中的可靠性和有效性。
    Abstract Defect detection plays a crucial role in infrared non-destructive testing systems, offering non-contact, safe, and efficient inspection capabilities. However, challenges such as low resolution, high noise, and uneven heating in infrared thermal images hinder comprehensive and accurate defect detection. In this study, we propose DefectSAM, a novel approach for segmenting defects on highly noisy thermal images based on the widely adopted model, Segment Anything (SAM)\cite{kirillov2023segany}. Harnessing the power of a meticulously curated dataset generated through labor-intensive lab experiments and valuable prompts from experienced experts, DefectSAM surpasses existing state-of-the-art segmentation algorithms and achieves significant improvements in defect detection rates. Notably, DefectSAM excels in detecting weaker and smaller defects on complex and irregular surfaces, reducing the occurrence of missed detections and providing more accurate defect size estimations. Experimental studies conducted on various materials have validated the effectiveness of our solutions in defect detection, which hold significant potential to expedite the evolution of defect detection tools, enabling enhanced inspection capabilities and accuracy in defect identification.
    摘要 异常检测在红外非破坏测试系统中扮演着关键角色,提供无接触、安全、高效的检测能力。然而,红外热图像中的低分辨率、高噪声和不均匀热辐射等挑战使得全面和准确的异常检测变得困难。本研究提出了基于广泛采用的模型Segment Anything(SAM)的新方法——异常检测(DefectSAM),通过精心制作的劳动atorydataset和经验丰富的专家提示,超越现有状态的最佳分割算法,提高异常检测率。特别是,DefectSAM在复杂和不规则表面上检测弱小异常,降低错过检测的发生率,并提供更准确的异常大小估计。在不同材料上进行的实验研究证明了我们的解决方案在异常检测方面的有效性,这些解决方案具有提高检测工具的精度和准确性,并促进异常 indentification的进步。