cs.CV - 2023-09-04

NLLB-CLIP – train performant multilingual image retrieval model on a budget

  • paper_url: http://arxiv.org/abs/2309.01859
  • repo_url: None
  • paper_authors: Alexander Visheratin
  • for: investigate whether someone without access to massive computing resources can make a valuable scientific contribution in the field of multilingual image retrieval.
  • methods: trained a CLIP model with a text encoder from the NLLB model using an automatically created dataset of 106,246 good-quality images with captions in 200 languages, and used various sizes of image and text encoders and froze different parts of the model during training.
  • results: NLLB-CLIP is comparable in quality to state-of-the-art models and significantly outperforms them on low-resource languages.
    Abstract Today, the exponential rise of large models developed by academic and industrial institutions with the help of massive computing resources raises the question of whether someone without access to such resources can make a valuable scientific contribution. To explore this, we tried to solve the challenging task of multilingual image retrieval having a limited budget of $1,000. As a result, we present NLLB-CLIP - CLIP model with a text encoder from the NLLB model. To train the model, we used an automatically created dataset of 106,246 good-quality images with captions in 201 languages derived from the LAION COCO dataset. We trained multiple models using image and text encoders of various sizes and kept different parts of the model frozen during the training. We thoroughly analyzed the trained models using existing evaluation datasets and newly created XTD200 and Flickr30k-200 datasets. We show that NLLB-CLIP is comparable in quality to state-of-the-art models and significantly outperforms them on low-resource languages.
    摘要 Note: "NLLB" stands for "No Language Labels Born", which is a technique for training machine learning models without language labels. "CLIP" stands for "Contrastive Language-Image Pre-training".

Towards Universal Image Embeddings: A Large-Scale Dataset and Challenge for Generic Image Representations

  • paper_url: http://arxiv.org/abs/2309.01858
  • repo_url: None
  • paper_authors: Nikolaos-Antonios Ypsilantis, Kaifeng Chen, Bingyi Cao, Mário Lipovský, Pelin Dogan-Schönberger, Grzegorz Makosa, Boris Bluntschli, Mojtaba Seyedhosseini, Ondřej Chum, André Araujo
  • for: The paper aims to address the problem of universal image embedding, where a single universal model is trained and used in multiple domains.
  • methods: The paper proposes a new large-scale public benchmark for the evaluation of universal image embeddings, with 241k query images, 1.4M index images, and 2.8M training images across 8 different domains and 349k classes. The authors also provide a comprehensive experimental evaluation on the new dataset and conduct a public research competition to foster future research in this area.
  • results: The paper shows that existing approaches and simplistic extensions lead to worse performance than an assembly of models trained for each domain separately. Additionally, the public research competition attracted the participation of more than 1k teams worldwide and generated many interesting research ideas and findings.
    Abstract Fine-grained and instance-level recognition methods are commonly trained and evaluated on specific domains, in a model per domain scenario. Such an approach, however, is impractical in real large-scale applications. In this work, we address the problem of universal image embedding, where a single universal model is trained and used in multiple domains. First, we leverage existing domain-specific datasets to carefully construct a new large-scale public benchmark for the evaluation of universal image embeddings, with 241k query images, 1.4M index images and 2.8M training images across 8 different domains and 349k classes. We define suitable metrics, training and evaluation protocols to foster future research in this area. Second, we provide a comprehensive experimental evaluation on the new dataset, demonstrating that existing approaches and simplistic extensions lead to worse performance than an assembly of models trained for each domain separately. Finally, we conducted a public research competition on this topic, leveraging industrial datasets, which attracted the participation of more than 1k teams worldwide. This exercise generated many interesting research ideas and findings which we present in detail. Project webpage: https://cmp.felk.cvut.cz/univ_emb/
    摘要 通常,细化和实例级认识方法在特定领域上进行训练和评估,这种方法在实际大规模应用中不实用。在这种工作中,我们解决了图像嵌入的问题,其中一个通用模型在多个领域进行训练和使用。我们首先利用现有的领域特定数据集, méticulously construct了一个大规模的公共数据集,用于图像嵌入的评估,该数据集包括8个领域、349个类型,共计241k个查询图像、1.4M个指定图像和2.8M个训练图像。我们定义了适当的度量、训练和评估协议,以促进未来的研究。其次,我们对新数据集进行了完整的实验评估,表明现有方法和简单的扩展在多个领域中的性能较差于每个领域 separately trained models。最后,我们在这个主题上进行了公共研究竞赛,使用了来自产业的数据集,这引起了全球1k多个团队的参与。这个实验生成了许多有趣的研究想法和发现,我们在详细地展示。项目网页:https://cmp.felk.cvut.cz/univ_emb/

SMPLitex: A Generative Model and Dataset for 3D Human Texture Estimation from Single Image

  • paper_url: http://arxiv.org/abs/2309.01855
  • repo_url: None
  • paper_authors: Dan Casas, Marc Comino-Trinidad
  • for: 该论文旨在 estimating and manipulating 人体3D外观从单个图像中。
  • methods: 该方法基于 reciently proposed generative models for 2D images,并将其扩展到3D领域通过计算输入图像中像素与表面之间的对应关系。
  • results: 该方法在3个公共可用的数据集上进行了量化和质量评估,并表明 SMPLitex 可以对人体Texture estimation 进行更好的表现,同时允许更多的任务,如编辑、 sintesis 和 manipulate。
    Abstract We propose SMPLitex, a method for estimating and manipulating the complete 3D appearance of humans captured from a single image. SMPLitex builds upon the recently proposed generative models for 2D images, and extends their use to the 3D domain through pixel-to-surface correspondences computed on the input image. To this end, we first train a generative model for complete 3D human appearance, and then fit it into the input image by conditioning the generative model to the visible parts of the subject. Furthermore, we propose a new dataset of high-quality human textures built by sampling SMPLitex conditioned on subject descriptions and images. We quantitatively and qualitatively evaluate our method in 3 publicly available datasets, demonstrating that SMPLitex significantly outperforms existing methods for human texture estimation while allowing for a wider variety of tasks such as editing, synthesis, and manipulation
    摘要 我们提出SMPLitex方法,用于从单张图像中估计和操纵人体的完整3D外观。SMPLitex基于最近提出的生成模型 для2D图像,并将其扩展到3D领域通过图像上的像素到表面匹配。为此,我们首先培训了一个完整3D人体外观生成模型,然后将其适应到输入图像中可见部分的条件下。此外,我们还提出了一个新的高质量人体xture样本,通过SMPLitex conditioned on subject descriptions和图像来建立。我们在3个公共可用的数据集上Quantitatively和Qualitatively评估了我们的方法,结果显示SMPLitex Significantly Outperforms现有的人体xture估计方法,同时允许更多的任务,如编辑、生成和操作。

Uncertainty in AI: Evaluating Deep Neural Networks on Out-of-Distribution Images

  • paper_url: http://arxiv.org/abs/2309.01850
  • repo_url: None
  • paper_authors: Jamiu Idowu, Ahmed Almasoud
  • for: 这篇论文探讨了深度神经网络(包括ResNet-50、VGG16、DenseNet121、AlexNet和GoogleNet)在不正常数据(out-of-distribution,OOD)或受到干扰(perturbed)情况下的表现不一致。
  • methods: 这篇论文采用了三个实验方法:首先,使用预训练模型将OOD图像分类,以评估它们的表现。其次,建立了模型预测的 ensemble,使用概率平均来寻求多数票的优势。 ensemble的不确定性由average probability、variance和entropy指标来衡量。最后,对新生成的DALL-E图像或实际捕捉图像进行了干扰(filters、rotations等)测试,以评估模型的 robustness。
  • results: 结果显示,ResNet-50是OOD图像中最准确的单个模型,但ensemble perfom even better,对所有图像进行正确分类。此外,对DALL-E图像和实际捕捉图像进行干扰测试后,ResNet-50模型的表现出现了明显的敏感性,对4/5不受干扰图像进行正确分类,但对所有受干扰图像进行错误分类,这些错误分类也可以由人类观察到,反映AI模型的局限性。使用Saliency map可以确定模型对图像的重要区域做出决策。
    Abstract As AI models are increasingly deployed in critical applications, ensuring the consistent performance of models when exposed to unusual situations such as out-of-distribution (OOD) or perturbed data, is important. Therefore, this paper investigates the uncertainty of various deep neural networks, including ResNet-50, VGG16, DenseNet121, AlexNet, and GoogleNet, when dealing with such data. Our approach includes three experiments. First, we used the pretrained models to classify OOD images generated via DALL-E to assess their performance. Second, we built an ensemble from the models' predictions using probabilistic averaging for consensus due to its advantages over plurality or majority voting. The ensemble's uncertainty was quantified using average probabilities, variance, and entropy metrics. Our results showed that while ResNet-50 was the most accurate single model for OOD images, the ensemble performed even better, correctly classifying all images. Third, we tested model robustness by adding perturbations (filters, rotations, etc.) to new epistemic images from DALL-E or real-world captures. ResNet-50 was chosen for this being the best performing model. While it classified 4 out of 5 unperturbed images correctly, it misclassified all of them post-perturbation, indicating a significant vulnerability. These misclassifications, which are clear to human observers, highlight AI models' limitations. Using saliency maps, we identified regions of the images that the model considered important for their decisions.
    摘要 As AI模型在关键应用中得到广泛应用,确保模型在不常见的情况下(如外部数据)的稳定性是重要的。因此,这篇论文研究了各种深度神经网络(包括ResNet-50、VGG16、DenseNet121、AlexNet和GoogleNet)在处理不常见数据时的不确定性。我们的方法包括三个实验。第一个实验是使用预训练模型来分类DALL-E生成的外部数据,以评估它们的性能。第二个实验是使用概率权重平均来构建一个ensemble,并使用概率、方差和 entropy 度量来衡量ensemble的不确定性。我们的结果表明,虽然ResNet-50是外部数据上最准确的单个模型,但ensemble perfom even better, correctly classifying all images。第三个实验是测试模型的Robustness,我们添加了 filters、旋转等扰动到DALL-E生成的新知识图像或实际捕捉图像。ResNet-50是我们选择的,因为它是最佳性能的模型。而在添加扰动后,ResNet-50对5个未扰动图像中的4个正确分类,但对所有扰动图像 incorrect classification,这表明模型有 significiant vulnerability。这些错误分类,对人类来说明显, highlight AI模型的局限性。使用saliency maps,我们identified模型对图像决策中的重要区域。

StereoFlowGAN: Co-training for Stereo and Flow with Unsupervised Domain Adaptation

  • paper_url: http://arxiv.org/abs/2309.01842
  • repo_url: None
  • paper_authors: Zhexiao Xiong, Feng Qiao, Yu Zhang, Nathan Jacobs
  • for: 这 paper 用于提出一种基于图像到图像翻译的新训练策略,用于立体匹配和光流估算,以便在真实图像场景下实现优秀的性能。
  • methods: 这 paper 使用了一种图像到图像翻译的方法,通过在真实图像和Synthetic图像之间进行图像翻译,以便在真实图像场景下训练模型。它还引入了一种bidirectional feature warping模块,可以处理左右和前后方向的图像翻译。
  • results: 实验结果表明,这 paper 的提出的方法可以比前一些基于域变换的方法更高效地进行立体匹配和光流估算,这证明了该方法的有效性。
    Abstract We introduce a novel training strategy for stereo matching and optical flow estimation that utilizes image-to-image translation between synthetic and real image domains. Our approach enables the training of models that excel in real image scenarios while relying solely on ground-truth information from synthetic images. To facilitate task-agnostic domain adaptation and the training of task-specific components, we introduce a bidirectional feature warping module that handles both left-right and forward-backward directions. Experimental results show competitive performance over previous domain translation-based methods, which substantiate the efficacy of our proposed framework, effectively leveraging the benefits of unsupervised domain adaptation, stereo matching, and optical flow estimation.
    摘要 我们介绍了一种新的训练策略,用于stereo匹配和光流估算,该策略利用图像到图像翻译来在生成图像和实际图像域之间进行图像-图像翻译。我们的方法允许在实际图像场景下训练出 excel 的模型,只靠基于生成图像的真实信息进行训练。为了实现任务不受限制的领域适应和任务特定组件的训练,我们引入了双向特征扭曲模块,可以处理左右和前后两个方向。实验结果显示,我们的提出的框架可以与前一些基于领域翻译的方法相比,并且实际上利用了无监督领域适应、stereo匹配和光流估算的优点。

On the fly Deep Neural Network Optimization Control for Low-Power Computer Vision

  • paper_url: http://arxiv.org/abs/2309.01824
  • repo_url: None
  • paper_authors: Ishmeet Kaur, Adwaita Janardhan Jadhav
  • for: This paper aims to improve the deployability of state-of-the-art computer vision techniques on resource-constrained edge devices.
  • methods: The paper proposes a novel technique called AdaptiveActivation, which dynamically adjusts the sparsity and precision of a DNN’s activation function during run-time to improve accuracy and energy consumption.
  • results: The authors conduct experiments on popular edge devices and show that their approach achieves accuracy within 1.5% of the baseline while requiring 10%–38% less memory, providing more accuracy-efficiency tradeoff options.Here’s the same information in Simplified Chinese text:
  • for: 这篇论文目标是提高现代计算机视觉技术在受限的边缘设备上的部署可能性。
  • methods: 论文提出了一种名为 AdaptiveActivation 的新技术,它在运行时动态调整 DNN 的激活函数输出范围,以提高准确率和能耗。
  • results: 作者在各种边缘设备上进行了实验,并显示其方法可以达到基eline的准确率( Within 1.5%),同时需要10%-38% Less memory,提供更多的准确率-效率质量权衡选项。
    Abstract Processing visual data on mobile devices has many applications, e.g., emergency response and tracking. State-of-the-art computer vision techniques rely on large Deep Neural Networks (DNNs) that are usually too power-hungry to be deployed on resource-constrained edge devices. Many techniques improve the efficiency of DNNs by using sparsity or quantization. However, the accuracy and efficiency of these techniques cannot be adapted for diverse edge applications with different hardware constraints and accuracy requirements. This paper presents a novel technique to allow DNNs to adapt their accuracy and energy consumption during run-time, without the need for any re-training. Our technique called AdaptiveActivation introduces a hyper-parameter that controls the output range of the DNNs' activation function to dynamically adjust the sparsity and precision in the DNN. AdaptiveActivation can be applied to any existing pre-trained DNN to improve their deployability in diverse edge environments. We conduct experiments on popular edge devices and show that the accuracy is within 1.5% of the baseline. We also show that our approach requires 10%--38% less memory than the baseline techniques leading to more accuracy-efficiency tradeoff options
    摘要 处理移动设备上的视觉数据有很多应用,例如紧急响应和跟踪。现状顶尖计算机视觉技术依靠大深度神经网络(DNNs),但这些大DNNs通常是资源受限的边缘设备上不可deploy。许多技术改进DNNs的效率,使用稀疏性或量化。然而,这些技术不能适应多样化的边缘应用程序不同的硬件限制和准确要求。这篇论文提出了一种新的技术,允许DNNs在运行时自适应准确和能耗,无需任何再训练。我们的技术被称为AdaptiveActivation,它在DNNs的活化函数输出范围中引入了一个超参数,以动态调整DNNs的稀疏性和精度。AdaptiveActivation可以应用于任何现有的预训练DNN,以提高它们在多样化边缘环境中的部署可能性。我们在受欢迎的边缘设备上进行了实验,并证明了准确性与基准值相差1.5%。我们还证明了我们的方法需要10%-38% menos的内存,从而提供更多的准确度-效率质量评估选项

Multi-dimension unified Swin Transformer for 3D Lesion Segmentation in Multiple Anatomical Locations

  • paper_url: http://arxiv.org/abs/2309.01823
  • repo_url: None
  • paper_authors: Shaoyan Pan, Yiqiao Liu, Sarah Halek, Michal Tomaszewski, Shubing Wang, Richard Baumgartner, Jianda Yuan, Gregory Goldmacher, Antong Chen
  • for: 这个研究旨在提高了肿瘤辐射成像中的肿瘤 segmentation精度,以便为肿瘤生长模型的研究提供更多的数据。
  • methods: 该研究使用了一种新的模型,即多维度统一Swin transformer(MDU-ST),将2D和3D输入都可以进行学习,并且可以从大量未标注的3D肿瘤量据中学习肿瘤形态下的基本特征。
  • results: 该研究发现,使用该模型可以在肿瘤成像中提高肿瘤 segmentation精度,并且在评估中得到了与其他模型相比的显著提高。这种方法可以用于自动化肿瘤 segmentation,以便为肿瘤生长模型的研究提供更多的数据。
    Abstract In oncology research, accurate 3D segmentation of lesions from CT scans is essential for the modeling of lesion growth kinetics. However, following the RECIST criteria, radiologists routinely only delineate each lesion on the axial slice showing the largest transverse area, and delineate a small number of lesions in 3D for research purposes. As a result, we have plenty of unlabeled 3D volumes and labeled 2D images, and scarce labeled 3D volumes, which makes training a deep-learning 3D segmentation model a challenging task. In this work, we propose a novel model, denoted a multi-dimension unified Swin transformer (MDU-ST), for 3D lesion segmentation. The MDU-ST consists of a Shifted-window transformer (Swin-transformer) encoder and a convolutional neural network (CNN) decoder, allowing it to adapt to 2D and 3D inputs and learn the corresponding semantic information in the same encoder. Based on this model, we introduce a three-stage framework: 1) leveraging large amount of unlabeled 3D lesion volumes through self-supervised pretext tasks to learn the underlying pattern of lesion anatomy in the Swin-transformer encoder; 2) fine-tune the Swin-transformer encoder to perform 2D lesion segmentation with 2D RECIST slices to learn slice-level segmentation information; 3) further fine-tune the Swin-transformer encoder to perform 3D lesion segmentation with labeled 3D volumes. The network's performance is evaluated by the Dice similarity coefficient (DSC) and Hausdorff distance (HD) using an internal 3D lesion dataset with 593 lesions extracted from multiple anatomical locations. The proposed MDU-ST demonstrates significant improvement over the competing models. The proposed method can be used to conduct automated 3D lesion segmentation to assist radiomics and tumor growth modeling studies. This paper has been accepted by the IEEE International Symposium on Biomedical Imaging (ISBI) 2023.
    摘要 在肿瘤研究中,准确的3D肿瘤分割从CT扫描图是非常重要的,以模拟肿瘤增长趋势。然而,根据RECIST标准,医生通常只在最大横向面的AXIAL slice上画出肿瘤,并且只为研究目的画出少量3D肿瘤。这意味着我们有大量未标注的3D卷积和标注的2D图像,以及罕见的标注3D卷积,这使得训练深度学习3D分割模型成为一项挑战。在这种情况下,我们提出了一种新的模型,即多维度统一Swin变换(MDU-ST),用于肿瘤分割。MDU-ST包括Swin变换器(Swin-transformer)编码器和卷积神经网络(CNN)解码器,可以适应2D和3D输入,并在同一个编码器中学习相应的 semantic 信息。基于这种模型,我们提出了一个三个阶段的框架:1)通过自动驱动的预TEXT任务来利用大量未标注的3D肿瘤卷积来学习肿瘤生物学的下面纹理;2)根据2D RECIST slice 进行精度调整Swin-transformer编码器,以学习 slice-level 分割信息;3)进一步精度调整Swin-transformer编码器,以进行3D肿瘤分割使用标注3D卷积。网络的性能被评估于 internal 3D 肿瘤数据集中,包括593个肿瘤,从多个 анаatomical 位置中提取。提出的 MDU-ST 表现出色,胜过竞争模型。该方法可以用于自动进行3D肿瘤分割,以帮助 радиOmics 和肿瘤增长模型研究。这篇文章已经被Accepted by IEEE International Symposium on Biomedical Imaging(ISBI)2023。

Instant Continual Learning of Neural Radiance Fields

  • paper_url: http://arxiv.org/abs/2309.01811
  • repo_url: None
  • paper_authors: Ryan Po, Zhengyang Dong, Alexander W. Bergman, Gordon Wetzstein
  • for: novle-view synthesis和3D scene reconstruction
  • methods: replay-based methods combined with a hybrid explicit–implicit scene representation
  • results: higher reconstruction quality and faster training than previous methods
    Abstract Neural radiance fields (NeRFs) have emerged as an effective method for novel-view synthesis and 3D scene reconstruction. However, conventional training methods require access to all training views during scene optimization. This assumption may be prohibitive in continual learning scenarios, where new data is acquired in a sequential manner and a continuous update of the NeRF is desired, as in automotive or remote sensing applications. When naively trained in such a continual setting, traditional scene representation frameworks suffer from catastrophic forgetting, where previously learned knowledge is corrupted after training on new data. Prior works in alleviating forgetting with NeRFs suffer from low reconstruction quality and high latency, making them impractical for real-world application. We propose a continual learning framework for training NeRFs that leverages replay-based methods combined with a hybrid explicit--implicit scene representation. Our method outperforms previous methods in reconstruction quality when trained in a continual setting, while having the additional benefit of being an order of magnitude faster.
    摘要

Accuracy and Consistency of Space-based Vegetation Height Maps for Forest Dynamics in Alpine Terrain

  • paper_url: http://arxiv.org/abs/2309.01797
  • repo_url: None
  • paper_authors: Yuchang Jiang, Marius Rüetschi, Vivien Sainte Fare Garnot, Mauro Marty, Konrad Schindler, Christian Ginzler, Jan D. Wegner
  • for: 提高瑞士国家森林评估的时间分辨率
  • methods: 使用卫星遥感和深度学习生成大规模的植被高程地图
  • results: 实现年度、国家范围内的植被高程地图,并对植被高程地图进行变化探测,检测到小于250平方米的变化
    Abstract Monitoring and understanding forest dynamics is essential for environmental conservation and management. This is why the Swiss National Forest Inventory (NFI) provides countrywide vegetation height maps at a spatial resolution of 0.5 m. Its long update time of 6 years, however, limits the temporal analysis of forest dynamics. This can be improved by using spaceborne remote sensing and deep learning to generate large-scale vegetation height maps in a cost-effective way. In this paper, we present an in-depth analysis of these methods for operational application in Switzerland. We generate annual, countrywide vegetation height maps at a 10-meter ground sampling distance for the years 2017 to 2020 based on Sentinel-2 satellite imagery. In comparison to previous works, we conduct a large-scale and detailed stratified analysis against a precise Airborne Laser Scanning reference dataset. This stratified analysis reveals a close relationship between the model accuracy and the topology, especially slope and aspect. We assess the potential of deep learning-derived height maps for change detection and find that these maps can indicate changes as small as 250 $m^2$. Larger-scale changes caused by a winter storm are detected with an F1-score of 0.77. Our results demonstrate that vegetation height maps computed from satellite imagery with deep learning are a valuable, complementary, cost-effective source of evidence to increase the temporal resolution for national forest assessments.
    摘要 监测和理解森林动态是环境保护和管理的关键。为此,瑞士国家森林调查(NFI)提供了全国覆盖率0.5米的植被高度地图。然而,NFI的更新周期为6年,限制了森林动态的时间分析。这可以通过使用空间Remote sensing和深度学习生成大规模的植被高度地图来改进。在这篇论文中,我们对这些方法进行了详细的分析,并在瑞士进行了操作应用。我们生成了2017年至2020年的年度、全国覆盖率10米的植被高度地图,基于Sentinel-2卫星图像。与之前的研究相比,我们进行了大规模的 stratified 分析,并对精确的空中雷达扫描参照数据进行了比较。这种 stratified 分析表明模型精度与地形特征(坡度和方向)之间存在紧密的关系。我们评估了深度学习得到的高度地图在变化检测方面的潜力,并发现这些地图可以检测变化为小为250平方米。在更大规模的变化(冬季风暴)方面,我们获得了F1分数0.77。我们的结果表明,通过卫星图像使用深度学习计算的植被高度地图是一种有价值的、补充性、成本效果的证据,可以增加国家森林评估的时间分辨率。

Safe and Robust Watermark Injection with a Single OoD Image

  • paper_url: http://arxiv.org/abs/2309.01786
  • repo_url: None
  • paper_authors: Shuyang Yu, Junyuan Hong, Haobo Zhang, Haotao Wang, Zhangyang Wang, Jiayu Zhou
  • for: 保护深度神经网络模型的知识产权和商业所有权
  • methods: 使用一个单一的 OUT-OF-distribution(OoD)图像作为秘密钥刃,并通过随机偏移模型参数来防御常见的水印移除攻击
  • results: 提出了一种安全和可靠的水印插入技术,可以在不需要训练数据的情况下,在不同的模型版本上保持水印的可读性和不朽性。
    Abstract Training a high-performance deep neural network requires large amounts of data and computational resources. Protecting the intellectual property (IP) and commercial ownership of a deep model is challenging yet increasingly crucial. A major stream of watermarking strategies implants verifiable backdoor triggers by poisoning training samples, but these are often unrealistic due to data privacy and safety concerns and are vulnerable to minor model changes such as fine-tuning. To overcome these challenges, we propose a safe and robust backdoor-based watermark injection technique that leverages the diverse knowledge from a single out-of-distribution (OoD) image, which serves as a secret key for IP verification. The independence of training data makes it agnostic to third-party promises of IP security. We induce robustness via random perturbation of model parameters during watermark injection to defend against common watermark removal attacks, including fine-tuning, pruning, and model extraction. Our experimental results demonstrate that the proposed watermarking approach is not only time- and sample-efficient without training data, but also robust against the watermark removal attacks above.
    摘要 训练高性能深度神经网络需要大量的数据和计算资源。保护深度模型的知识产权(IP)和商业所有权是一项挑战,但也变得越来越重要。一条主要的水印策略是通过毒化训练样本来植入可靠的后门 triggers,但这些常常因数据隐私和安全问题而不切实际,并且容易受到微型化改进的影响。为了解决这些挑战,我们提议一种安全和可靠的后门基于水印注入技术,利用单个 OUT-OF-distribution(OoD)图像作为秘密键 для IP 验证。训练数据的独立性使其不受第三方的IP安全承诺影响。我们通过在水印注入过程中随机偏移模型参数来增强鲁棒性,以抵御通常的水印移除攻击,包括微型化、剪辑和模型提取。我们的实验结果表明,我们提议的水印策略不仅时间和样本效率高,而且具有很好的鲁棒性。

StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation

  • paper_url: http://arxiv.org/abs/2309.01770
  • repo_url: None
  • paper_authors: Zhouxia Wang, Xintao Wang, Liangbin Xie, Zhongang Qi, Ying Shan, Wenping Wang, Ping Luo
  • for: 本研究旨在提出一种不需要LoRA的图像美化方法,该方法可以根据文本提示和样式参考图像来生成输出图像,而不需要训练每种样式的LoRA。
  • methods: 本方法使用了两种组件:一个两路扩充模块(TPCA)和三种解除策略。这些组件使得我们的模型可以分离文本提示和样式参考特征,并减少样式参考中的强相关性,从而提高图像质量和多样性。
  • results: 实验表明,我们的方法可以生成高质量的图像,并且可以适应不同的样式(包括未经见过的样式),而不需要多个LoRA。相比之下,现有方法 Less flexible and less efficient。
    Abstract This paper presents a LoRA-free method for stylized image generation that takes a text prompt and style reference images as inputs and produces an output image in a single pass. Unlike existing methods that rely on training a separate LoRA for each style, our method can adapt to various styles with a unified model. However, this poses two challenges: 1) the prompt loses controllability over the generated content, and 2) the output image inherits both the semantic and style features of the style reference image, compromising its content fidelity. To address these challenges, we introduce StyleAdapter, a model that comprises two components: a two-path cross-attention module (TPCA) and three decoupling strategies. These components enable our model to process the prompt and style reference features separately and reduce the strong coupling between the semantic and style information in the style references. StyleAdapter can generate high-quality images that match the content of the prompts and adopt the style of the references (even for unseen styles) in a single pass, which is more flexible and efficient than previous methods. Experiments have been conducted to demonstrate the superiority of our method over previous works.
    摘要

BLiSS: Bootstrapped Linear Shape Space

  • paper_url: http://arxiv.org/abs/2309.01765
  • repo_url: None
  • paper_authors: Sanjeev Muralikrishnan, Chun-Hao Paul Huang, Duygu Ceylan, Niloy J. Mitra
  • for: 创建人类形态模型,提高人类形态数据的表示和分析能力
  • methods: 使用自适应扩展模型,通过精细调整和非rigid registration来实现人类形态数据的匹配
  • results: 提出BLiSS方法,可以自动将新的不注册扫描数据匹配到已有的注册扫描数据中,提高人类形态数据的表示和分析能力
    Abstract Morphable models are fundamental to numerous human-centered processes as they offer a simple yet expressive shape space. Creating such morphable models, however, is both tedious and expensive. The main challenge is establishing dense correspondences across raw scans that capture sufficient shape variation. This is often addressed using a mix of significant manual intervention and non-rigid registration. We observe that creating a shape space and solving for dense correspondence are tightly coupled -- while dense correspondence is needed to build shape spaces, an expressive shape space provides a reduced dimensional space to regularize the search. We introduce BLiSS, a method to solve both progressively. Starting from a small set of manually registered scans to bootstrap the process, we enrich the shape space and then use that to get new unregistered scans into correspondence automatically. The critical component of BLiSS is a non-linear deformation model that captures details missed by the low-dimensional shape space, thus allowing progressive enrichment of the space.
    摘要 《膨润模型是人类中心的过程中的基础模型,它们提供了简单 yet 表达力强的形态空间。然而,创建这些膨润模型是时间和成本的挑战。主要挑战在于在原始扫描图像之间建立密集的对匹配,以捕捉足够的形态变化。通常通过手动干预和非RIGID注册来解决这个问题。我们发现创建形态空间和 dense correspondence 是紧密相关的——而 dense correspondence 是建立形态空间的必要条件,而且一个表达力强的形态空间可以提供一个减少维度的空间来规范搜索。我们介绍了 BLiSS,一种解决这两个问题的方法。从一个小型手动注册的扫描图像开始,我们在形态空间中增强表达力,然后用这个空间来自动将新的未注册扫描图像与其他扫描图像进行对匹配。BLiSS 的关键组成部分是一种非线性塑形模型,它可以捕捉低维度形态空间所过度的细节,从而允许进行进一步的表达力增强。》

Multispectral Indices for Wildfire Management

  • paper_url: http://arxiv.org/abs/2309.01751
  • repo_url: None
  • paper_authors: Afonso Oliveira, João P. Matos-Carvalho, Filipe Moutinho, Nuno Fachada
  • for: 这篇论文旨在为火灾预防和管理提供 Multispectral 指标和相关方法。
  • methods: 论文检查了多种领域,其中 Multispectral 指标与野火预防和管理有着 closest 关系,包括植被和土壤特征提取、水特征映射、人工结构识别和火灾后烧区面积估计。
  • results: 论文强调了 Multispectral 指标在野火管理中的 universality 和有效性,并提供了具体的指标,如 NDVI 和 NDWI。 同时,为了提高准确性和解决个体指标应用中的局限性,建议 integra complementary 处理解决方案和其他数据源,如高分辨率图像和地面测量。
    Abstract This paper highlights and summarizes the most important multispectral indices and associated methodologies for fire management. Various fields of study are examined where multispectral indices align with wildfire prevention and management, including vegetation and soil attribute extraction, water feature mapping, artificial structure identification, and post-fire burnt area estimation. The versatility and effectiveness of multispectral indices in addressing specific issues in wildfire management are emphasized. Fundamental insights for optimizing data extraction are presented. Concrete indices for each task, including the NDVI and the NDWI, are suggested. Moreover, to enhance accuracy and address inherent limitations of individual index applications, the integration of complementary processing solutions and additional data sources like high-resolution imagery and ground-based measurements is recommended. This paper aims to be an immediate and comprehensive reference for researchers and stakeholders working on multispectral indices related to the prevention and management of fires.
    摘要 Simplified Chinese:这篇论文探讨了多spectral指标的最重要应用和方法在野火管理中,包括植被和土壤特征提取、水特征地图、人工结构识别和火灾后烧Area估计。论文强调了多spectral指标在野火预防和管理中的 versatility和有效性。提供了数据提取优化的基本理念,并建议使用NDVI和NDWI等指标。此外,为了提高准确性和解决个体指标应用的局限性,建议结合补充处理解决方案和高分辨率图像以及地面测量数据。这篇论文旨在为研究人员和相关方 working on多spectral指标与野火预防和管理的人提供立即和全面的参考。

Generative-based Fusion Mechanism for Multi-Modal Tracking

  • paper_url: http://arxiv.org/abs/2309.01728
  • repo_url: https://github.com/zhangyong-tang/gmmt
  • paper_authors: Zhangyong Tang, Tianyang Xu, Xuefeng Zhu, Xiao-Jun Wu, Josef Kittler
  • for: 本研究探讨了如何使用生成模型技术来解决多Modal跟踪中的信息融合挑战。
  • methods: 本研究使用了两种常见的生成模型技术, namely Conditional Generative Adversarial Networks (CGANs) 和 Diffusion Models (DMs)。这些技术在传统的融合过程中直接将每种模式的特征传输到融合块,而不是直接将特征传输。
  • results: 经验表明,使用生成模型技术可以提高多Modal跟踪的性能,并在 LasHeR 和 RGBD1K 上设置新的纪录。
    Abstract Generative models (GMs) have received increasing research interest for their remarkable capacity to achieve comprehensive understanding. However, their potential application in the domain of multi-modal tracking has remained relatively unexplored. In this context, we seek to uncover the potential of harnessing generative techniques to address the critical challenge, information fusion, in multi-modal tracking. In this paper, we delve into two prominent GM techniques, namely, Conditional Generative Adversarial Networks (CGANs) and Diffusion Models (DMs). Different from the standard fusion process where the features from each modality are directly fed into the fusion block, we condition these multi-modal features with random noise in the GM framework, effectively transforming the original training samples into harder instances. This design excels at extracting discriminative clues from the features, enhancing the ultimate tracking performance. To quantitatively gauge the effectiveness of our approach, we conduct extensive experiments across two multi-modal tracking tasks, three baseline methods, and three challenging benchmarks. The experimental results demonstrate that the proposed generative-based fusion mechanism achieves state-of-the-art performance, setting new records on LasHeR and RGBD1K.
    摘要 Translated into Simplified Chinese:生成模型(GM)已经收到了研究的增加兴趣,特别是在多modal跟踪领域中。在这个预期中,我们想要探索生成技术的潜在应用,以解决多modal跟踪中的关键挑战——信息融合。在这篇论文中,我们探究了两种主要的生成技术——条件生成对抗网络(CGAN)和扩散模型(DM)。与标准融合过程不同,我们在生成模型框架中,通过conditioning多modal特征来增强特征的抽象能力,从而提高最终跟踪性能。为了量化评估我们的方法的效果,我们在两个多modal跟踪任务、三个基eline方法和三个挑战性 benchmark 上进行了广泛的实验。实验结果表明,我们提出的生成基于融合机制可以达到状态级表现,在 LasHeR 和 RGBD1K 上设置新的纪录。

SAF-IS: a Spatial Annotation Free Framework for Instance Segmentation of Surgical Tools

  • paper_url: http://arxiv.org/abs/2309.01723
  • repo_url: None
  • paper_authors: Luca Sestini, Benoit Rosa, Elena De Momi, Giancarlo Ferrigno, Nicolas Padoy
  • for: This paper aims to develop a framework for instance segmentation of surgical instruments without requiring expensive pixel-level annotations.
  • methods: The proposed solution uses binary tool masks and binary tool presence labels to train a tool instance classifier, leveraging unsupervised binary segmentation models to obtain the masks.
  • results: The approach outperforms several state-of-the-art fully-supervised segmentation methods and is completely free from spatial annotations.
    Abstract Instance segmentation of surgical instruments is a long-standing research problem, crucial for the development of many applications for computer-assisted surgery. This problem is commonly tackled via fully-supervised training of deep learning models, requiring expensive pixel-level annotations to train. In this work, we develop a framework for instance segmentation not relying on spatial annotations for training. Instead, our solution only requires binary tool masks, obtainable using recent unsupervised approaches, and binary tool presence labels, freely obtainable in robot-assisted surgery. Based on the binary mask information, our solution learns to extract individual tool instances from single frames, and to encode each instance into a compact vector representation, capturing its semantic features. Such representations guide the automatic selection of a tiny number of instances (8 only in our experiments), displayed to a human operator for tool-type labelling. The gathered information is finally used to match each training instance with a binary tool presence label, providing an effective supervision signal to train a tool instance classifier. We validate our framework on the EndoVis 2017 and 2018 segmentation datasets. We provide results using binary masks obtained either by manual annotation or as predictions of an unsupervised binary segmentation model. The latter solution yields an instance segmentation approach completely free from spatial annotations, outperforming several state-of-the-art fully-supervised segmentation approaches.
    摘要 Instance segmentation of surgical instruments是长期的研究问题,对计算机助手手术应用的发展非常重要。通常通过深度学习模型的全导学习来解决这个问题,需要昂贵的像素级注解来训练。在这种工作中,我们开发了不需要空间注解的实例分割框架。而是利用最近的无监督方法获取的二进制工具面积,以及在机器人助手手术中自由获得的二进制工具存在标签。基于二进制面积信息,我们的解决方案可以从单帧中提取个体工具实例,并将每个实例编码为一个紧凑的向量表示,捕捉其 semantic 特征。这些表示导引人工操作员选择工具类型标签。最终,我们使用这些标签来匹配每个训练实例与二进制工具存在标签,以提供有效的超级视图信号,用于训练工具实例分类器。我们在EndoVis 2017和2018 segmentation dataset上验证了我们的框架。我们使用手动注解或predictions of unsupervised binary segmentation model来获取二进制面积。后者解决方案可以完全免除空间注解,并在数个状态之前的全导学习方法之上表现出色。

ControlMat: A Controlled Generative Approach to Material Capture

  • paper_url: http://arxiv.org/abs/2309.01700
  • repo_url: None
  • paper_authors: Giuseppe Vecchio, Rosalie Martin, Arthur Roullier, Adrien Kaiser, Romain Rouffet, Valentin Deschaintre, Tamy Boubekeur
    for: 提出了一种控制的数据生成方法,用于从单个照片中生成可信、可缩放、物理基础的数字材料。methods: 使用了生成深度网络进行控制synthesis,并采用了多通道输出的diffusion模型,采样过程进行多尺度信息融合,并引入了折叠diffusion来实现高分辨率输出和缩放性。results: 比较了与推论和秘密空间优化方法,显示了控制Mat的超越性,并且仔细验证了diffusion过程的设计选择。
    Abstract Material reconstruction from a photograph is a key component of 3D content creation democratization. We propose to formulate this ill-posed problem as a controlled synthesis one, leveraging the recent progress in generative deep networks. We present ControlMat, a method which, given a single photograph with uncontrolled illumination as input, conditions a diffusion model to generate plausible, tileable, high-resolution physically-based digital materials. We carefully analyze the behavior of diffusion models for multi-channel outputs, adapt the sampling process to fuse multi-scale information and introduce rolled diffusion to enable both tileability and patched diffusion for high-resolution outputs. Our generative approach further permits exploration of a variety of materials which could correspond to the input image, mitigating the unknown lighting conditions. We show that our approach outperforms recent inference and latent-space-optimization methods, and carefully validate our diffusion process design choices. Supplemental materials and additional details are available at: https://gvecchio.com/controlmat/.
    摘要 Material 重建从照片是3D内容创造的关键组件。我们提议将这个不定性问题转化为控制的合成问题,利用最近的生成深度网络的进步。我们提出ControlMat方法,给定一个具有不控制照明的照片输入,使用扩散模型生成可信、可缩放、基于物理的高分辨率数字材料。我们仔细分析扩散模型的多通道输出行为,适应多尺度信息的融合和滤波rolled扩散,以实现高分辨率输出的瓦片可重复性和补充扩散。我们的生成方法还允许探索输入图像所对应的多种材料, mitigate不确定的照明条件。我们比较了我们的扩散过程设计选择,并证明我们的方法超越了最近的推理和latent空间优化方法。补充材料和更多细节可以在:https://gvecchio.com/controlmat/。

Mask-Attention-Free Transformer for 3D Instance Segmentation

  • paper_url: http://arxiv.org/abs/2309.01692
  • repo_url: https://github.com/dvlab-research/mask-attention-free-transformer
  • paper_authors: Xin Lai, Yuhui Yuan, Ruihang Chu, Yukang Chen, Han Hu, Jiaya Jia
  • for: 提高3D实例分割 task 的速度和准确率,即使初始masks的召回率低。
  • methods: 弃用mask attention设计,改用auxiliary center regression任务,通过positional prior来进行cross-attention和迭代改进。
  • results: 与现有工作相比,我们的方法可以在ScanNetv2 3D实例分割benchmark上 converge 4x faster,并且在多个dataset上显示出超过现有方法的性能。
    Abstract Recently, transformer-based methods have dominated 3D instance segmentation, where mask attention is commonly involved. Specifically, object queries are guided by the initial instance masks in the first cross-attention, and then iteratively refine themselves in a similar manner. However, we observe that the mask-attention pipeline usually leads to slow convergence due to low-recall initial instance masks. Therefore, we abandon the mask attention design and resort to an auxiliary center regression task instead. Through center regression, we effectively overcome the low-recall issue and perform cross-attention by imposing positional prior. To reach this goal, we develop a series of position-aware designs. First, we learn a spatial distribution of 3D locations as the initial position queries. They spread over the 3D space densely, and thus can easily capture the objects in a scene with a high recall. Moreover, we present relative position encoding for the cross-attention and iterative refinement for more accurate position queries. Experiments show that our approach converges 4x faster than existing work, sets a new state of the art on ScanNetv2 3D instance segmentation benchmark, and also demonstrates superior performance across various datasets. Code and models are available at https://github.com/dvlab-research/Mask-Attention-Free-Transformer.
    摘要 最近,基于 transformer 的方法在 3D 实例分割中占据了主导地位,其中 mask attention 是通常 involve 的一部分。具体来说,对象查询在第一次 cross-attention 中被 guid 由初始实例面积的面积注意力,然后在类似的方式进行 iterative refinement。但我们发现,mask attention 管道通常会导致慢速收敛,因为初始实例面积的准确率较低。因此,我们放弃了面积注意力设计,转而使用 auxillary center regression 任务来解决这个问题。通过 center regression,我们可以有效地超越低准确率的问题,并通过做Positional Prior来进行 cross-attention。为了实现这个目标,我们开发了一系列的位置意识设计。首先,我们学习了 3D 空间中的位置分布,作为初始的位置查询。它们在 3D 空间中分布 densely,可以轻松地捕捉 scene 中的对象,并且具有高准确率。此外,我们还提供了相对位置编码 для cross-attention 和 iterative refinement,以便更准确地确定位置查询。实验表明,我们的方法可以在 ScanNetv2 3D 实例分割 benchmark 上 converges 4x faster than 现有的工作,并且在多个 dataset 上也表现出了superior的性能。代码和模型可以在 https://github.com/dvlab-research/Mask-Attention-Free-Transformer 上获取。

Prior Knowledge Guided Network for Video Anomaly Detection

  • paper_url: http://arxiv.org/abs/2309.01682
  • repo_url: None
  • paper_authors: Zhewen Deng, Dongyue Chen, Shizhuo Deng
  • for: video anomaly detection (VAD)
  • methods: 使用教师学生网络、自适应网络、知识填充等方法提高模型的泛化能力和多尺度检测能力
  • results: 实验结果表明,我们的方法可以更高效、更准确地检测视频异常事件,比现有的方法更高效。
    Abstract Video Anomaly Detection (VAD) involves detecting anomalous events in videos, presenting a significant and intricate task within intelligent video surveillance. Existing studies often concentrate solely on features acquired from limited normal data, disregarding the latent prior knowledge present in extensive natural image datasets. To address this constraint, we propose a Prior Knowledge Guided Network(PKG-Net) for the VAD task. First, an auto-encoder network is incorporated into a teacher-student architecture to learn two designated proxy tasks: future frame prediction and teacher network imitation, which can provide better generalization ability on unknown samples. Second, knowledge distillation on proper feature blocks is also proposed to increase the multi-scale detection ability of the model. In addition, prediction error and teacher-student feature inconsistency are combined to evaluate anomaly scores of inference samples more comprehensively. Experimental results on three public benchmarks validate the effectiveness and accuracy of our method, which surpasses recent state-of-the-arts.
    摘要 视频异常检测(VAD)涉及到视频中异常事件的检测,是智能视频监测中的一个复杂和繁复任务。现有研究 часто仅仅使用有限的正常数据来学习特征,忽略了大量自然图像数据中的隐藏知识。为解决这一问题,我们提议一种基于先前知识指导网络(PKG-Net)的方法。首先,我们在教师-学生架构中 integrate 一个自编码器网络,以学习两个指定的代理任务:未来帧预测和教师网络模仿,以提高对未知样本的泛化能力。其次,我们还提出了在正确的特征块上进行知识填充,以增强模型的多尺度检测能力。此外,我们还结合预测错误和教师-学生特征不一致来评估推理样本的异常分数。实验结果表明,我们的方法可以在三个公共标准测试集上达到更高的准确率,超过当前状态的艺术。

Building Footprint Extraction in Dense Areas using Super Resolution and Frame Field Learning

  • paper_url: http://arxiv.org/abs/2309.01656
  • repo_url: None
  • paper_authors: Vuong Nguyen, Anh Ho, Duc-Anh Vu, Nguyen Thi Ngoc Anh, Tran Ngoc Thang
  • for: 提高叠 edifices 的精度和精细度,使得在压杂的区域中提取建筑物的轮廓更加精准。
  • methods: 使用超分解提高空中图像的空间分辨率,然后使用多任务学习模块进行分割和框架场景学习,以处理不规则的建筑结构。
  • results: 对印度一个贫民区的实验表明,提出的方法可以明显超越当前状态的方法,具有较高的精度和精细度。
    Abstract Despite notable results on standard aerial datasets, current state-of-the-arts fail to produce accurate building footprints in dense areas due to challenging properties posed by these areas and limited data availability. In this paper, we propose a framework to address such issues in polygonal building extraction. First, super resolution is employed to enhance the spatial resolution of aerial image, allowing for finer details to be captured. This enhanced imagery serves as input to a multitask learning module, which consists of a segmentation head and a frame field learning head to effectively handle the irregular building structures. Our model is supervised by adaptive loss weighting, enabling extraction of sharp edges and fine-grained polygons which is difficult due to overlapping buildings and low data quality. Extensive experiments on a slum area in India that mimics a dense area demonstrate that our proposed approach significantly outperforms the current state-of-the-art methods by a large margin.
    摘要 尽管现有的州际数据集上得到了可注目的结果,但现今的状态艺术无法在受挑战的区域中生成准确的建筑面积,这是因为这些区域具有复杂的属性和有限的数据可用性。在这篇论文中,我们提出了一种框架来解决这些问题。我们首先使用超分解来提高飞行图像的空间分辨率,以便更好地捕捉详细的建筑结构。这个提高的图像作为输入,我们的模型包括一个分割头和一个帧场学习头,以有效地处理不规则的建筑结构。我们的模型被超过适应损失质量调整,以提取锐利的边和细腻的多边形,这是由于 overlap 的建筑和低质量数据所致。我们的提出的方法在印度一个模拟 dense 区域的实验中显示出了与当前状态艺术方法之间的大幅提高。

Relay Diffusion: Unifying diffusion process across resolutions for image synthesis

  • paper_url: http://arxiv.org/abs/2309.03350
  • repo_url: https://github.com/THUDM/RelayDiffusion
  • paper_authors: Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jianqiao Wangni, Zhuoyi Yang, Jie Tang
  • for: 这 paper 是用于描述一种基于抽象扩散模型的高分辨率图像生成方法。
  • methods: 这 paper 使用了一种叫做 Relay Diffusion Model (RDM),它可以将低分辨率图像或噪声转换成等效的高分辨率图像,从而让扩散过程可以继续无间断地进行在任何新的分辨率或模型中。
  • results: 这 paper 的实验结果表明,RDM 可以在 CelebA-HQ 和 ImageNet 256$\times$256 上 achieved state-of-the-art FID 和 sFID Result,大幅超越过去的 ADM、LDM 和 DiT 等方法。
    Abstract Diffusion models achieved great success in image synthesis, but still face challenges in high-resolution generation. Through the lens of discrete cosine transformation, we find the main reason is that \emph{the same noise level on a higher resolution results in a higher Signal-to-Noise Ratio in the frequency domain}. In this work, we present Relay Diffusion Model (RDM), which transfers a low-resolution image or noise into an equivalent high-resolution one for diffusion model via blurring diffusion and block noise. Therefore, the diffusion process can continue seamlessly in any new resolution or model without restarting from pure noise or low-resolution conditioning. RDM achieves state-of-the-art FID on CelebA-HQ and sFID on ImageNet 256$\times$256, surpassing previous works such as ADM, LDM and DiT by a large margin. All the codes and checkpoints are open-sourced at \url{https://github.com/THUDM/RelayDiffusion}.
    摘要 Diffusion models have achieved great success in image synthesis, but still face challenges in high-resolution generation. Through the lens of discrete cosine transformation, we find that the main reason is that the same noise level on a higher resolution results in a higher Signal-to-Noise Ratio in the frequency domain. In this work, we present Relay Diffusion Model (RDM), which transfers a low-resolution image or noise into an equivalent high-resolution one for diffusion model via blurring diffusion and block noise. Therefore, the diffusion process can continue seamlessly in any new resolution or model without restarting from pure noise or low-resolution conditioning. RDM achieves state-of-the-art FID on CelebA-HQ and sFID on ImageNet 256×256, surpassing previous works such as ADM, LDM and DiT by a large margin. All the codes and checkpoints are open-sourced at \url{https://github.com/THUDM/RelayDiffusion}.Here's the word-for-word translation of the text:Diffusion models 已经取得了很大的成功在图像生成中,但仍然面临高分辨率生成的挑战。通过抽象幂transform的孔径,我们发现主要的问题在于,在更高的分辨率下,相同的噪声水平会导致更高的信号噪声比在频域中。在这项工作中,我们提出了Relay Diffusion Model(RDM),它通过混合扩散和块噪声来将低分辨率图像或噪声转换成与扩散模型相对应的高分辨率图像。因此,扩散过程可以不间断继续在任何新的分辨率或模型上进行,不需要从纯噪声或低分辨率conditioning重新开始。RDM在CelebA-HQ和ImageNet 256×256上 achieve state-of-the-art FID和sFID,大幅超过了前一些工作,如ADM、LDM和DiT。所有的代码和检查点都是开源的,可以在 \url{https://github.com/THUDM/RelayDiffusion} 上找到。

ReLoc-PDR: Visual Relocalization Enhanced Pedestrian Dead Reckoning via Graph Optimization

  • paper_url: http://arxiv.org/abs/2309.01646
  • repo_url: None
  • paper_authors: Zongyang Chen, Xianfei Pan, Changhao Chen
  • for: 本研究旨在提供一种精准地位化行人在卫星排除条件下,使用低成本抗 gravitational 倾斜传感器。
  • methods: 该研究提出了一种 combining PDR 和视觉重定位技术,使用图像优化算法和学习描述符来实现Robust位置估算。
  • results: 实验表明,我们的 ReLoc-PDR 在不良环境中表现出了较高的准确性和可靠性,可以在较少的文本环境和黑夜场景中实现高精度的行人位置估算。
    Abstract Accurately and reliably positioning pedestrians in satellite-denied conditions remains a significant challenge. Pedestrian dead reckoning (PDR) is commonly employed to estimate pedestrian location using low-cost inertial sensor. However, PDR is susceptible to drift due to sensor noise, incorrect step detection, and inaccurate stride length estimation. This work proposes ReLoc-PDR, a fusion framework combining PDR and visual relocalization using graph optimization. ReLoc-PDR leverages time-correlated visual observations and learned descriptors to achieve robust positioning in visually-degraded environments. A graph optimization-based fusion mechanism with the Tukey kernel effectively corrects cumulative errors and mitigates the impact of abnormal visual observations. Real-world experiments demonstrate that our ReLoc-PDR surpasses representative methods in accuracy and robustness, achieving accurte and robust pedestrian positioning results using only a smartphone in challenging environments such as less-textured corridors and dark nighttime scenarios.
    摘要 <>精度和可靠地定位行人在卫星探测不可靠情况下是一个重要挑战。行人死reckoning(PDR)通常用低成本惯性传感器来估算行人位置。然而,PDR受到传感器噪声、错误的步动检测和不准确的步长估算的影响,导致偏移。本工作提出了Reloc-PDR,一种混合框架, комбини了PDR和视觉重定位使用图像优化。Reloc-PDR利用时间相关的视觉观察和学习的特征来实现在视觉弱化环境中Robust的定位。一种图像优化基于Tukeykernel的混合机制,有效地纠正累累的错误和减少了不正常的视觉观察的影响。实际实验表明,我们的Reloc-PDR在精度和Robust性方面超过了代表性的方法,在杂糌走廊和黑夜enario中实现了高精度和Robust的行人定位,只使用了智能手机。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Cross-Consistent Deep Unfolding Network for Adaptive All-In-One Video Restoration

  • paper_url: http://arxiv.org/abs/2309.01627
  • repo_url: None
  • paper_authors: Yuanshuo Cheng, Mingwen Shao, Yecong Wan, Lixu Zhang, Wangmeng Zuo, Deyu Meng
  • for: 提高实际应用中视频修复(VR)方法的可扩展性和可靠性。
  • methods: 提议了一种 Cross-consistent Deep Unfolding Network(CDUN),可以通过单一模型来消除多种降低效果。
  • results: 实验表明,提议的方法可以在All-In-One VR中实现状态最佳的表现。
    Abstract Existing Video Restoration (VR) methods always necessitate the individual deployment of models for each adverse weather to remove diverse adverse weather degradations, lacking the capability for adaptive processing of degradations. Such limitation amplifies the complexity and deployment costs in practical applications. To overcome this deficiency, in this paper, we propose a Cross-consistent Deep Unfolding Network (CDUN) for All-In-One VR, which enables the employment of a single model to remove diverse degradations for the first time. Specifically, the proposed CDUN accomplishes a novel iterative optimization framework, capable of restoring frames corrupted by corresponding degradations according to the degradation features given in advance. To empower the framework for eliminating diverse degradations, we devise a Sequence-wise Adaptive Degradation Estimator (SADE) to estimate degradation features for the input corrupted video. By orchestrating these two cascading procedures, CDUN achieves adaptive processing for diverse degradation. In addition, we introduce a window-based inter-frame fusion strategy to utilize information from more adjacent frames. This strategy involves the progressive stacking of temporal windows in multiple iterations, effectively enlarging the temporal receptive field and enabling each frame's restoration to leverage information from distant frames. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance in All-In-One VR.
    摘要 现有的视频修复(VR)方法总是需要采取各种特定的模型来消除不同的降低因素,缺乏适应处理降低因素的能力。这种局限性会增加实际应用中的复杂性和投入成本。为了解决这一缺点,在这篇论文中,我们提出了一种名为 Cross-consistent Deep Unfolding Network(CDUN)的所有在一个VR方法,可以使用单个模型来消除多种降低因素。具体来说,我们提出了一种新的迭代优化框架,可以根据输入降低因素的特征来修复降低因素所影响的帧。为了使这种框架能够消除多种降低因素,我们设计了一种适应性降低因素估计器(SADE),可以为输入降低因素提供适应性的估计。通过这两种顺序执行的过程,CDUN实现了适应处理多种降低因素。此外,我们还提出了一种窗口基本的Inter-frame融合策略,可以利用更多的邻近帧中的信息。这种策略通过在多个迭代中逐渐堆叠窗口,实现了提高时间感知范围,使每帧的修复可以利用更远的帧中的信息。广泛的实验表明,我们提出的方法在All-In-One VR中实现了state-of-the-art的性能。

AGG-Net: Attention Guided Gated-convolutional Network for Depth Image Completion

  • paper_url: http://arxiv.org/abs/2309.01624
  • repo_url: None
  • paper_authors: Dongyue Chen, Tingxuan Huang, Zhimin Song, Shizhuo Deng, Tong Jia
  • for: 提高RGBD相机遥感几何图像质量
  • methods: 提出了一种基于Attention Guided Gated-convolutional Network(AGG-Net)的深度图像完成方法,通过实现颜色和深度特征的综合 fusion,提高了图像的准确性和可靠性
  • results: 对比于状态艺术方法,该方法在NYU-Depth V2、DIML和SUN RGB-D等标准底本上达到了更高的完成率
    Abstract Recently, stereo vision based on lightweight RGBD cameras has been widely used in various fields. However, limited by the imaging principles, the commonly used RGB-D cameras based on TOF, structured light, or binocular vision acquire some invalid data inevitably, such as weak reflection, boundary shadows, and artifacts, which may bring adverse impacts to the follow-up work. In this paper, we propose a new model for depth image completion based on the Attention Guided Gated-convolutional Network (AGG-Net), through which more accurate and reliable depth images can be obtained from the raw depth maps and the corresponding RGB images. Our model employs a UNet-like architecture which consists of two parallel branches of depth and color features. In the encoding stage, an Attention Guided Gated-Convolution (AG-GConv) module is proposed to realize the fusion of depth and color features at different scales, which can effectively reduce the negative impacts of invalid depth data on the reconstruction. In the decoding stage, an Attention Guided Skip Connection (AG-SC) module is presented to avoid introducing too many depth-irrelevant features to the reconstruction. The experimental results demonstrate that our method outperforms the state-of-the-art methods on the popular benchmarks NYU-Depth V2, DIML, and SUN RGB-D.
    摘要 近些年来,基于轻量级RGBD相机的斯tereo视觉已经广泛应用于多个领域。然而,由于捕捉原理的限制,常用的RGB-D相机基于TOF、结构光或双目视觉都会不可避免地获得一些无效数据,如弱反射、边缘阴影和artefacts,这些无效数据可能会对后续工作产生负面影响。在这篇论文中,我们提出了一种基于Attention Guided Gated-convolutional Network(AGG-Net)的深度图像完成模型,通过这种模型,从原始的深度图和对应的RGB图中获得更加准确和可靠的深度图像。我们的模型采用了UNet-like的架构,该架构包括两个平行的深度和颜色特征分支。在编码阶段,我们提出了一种Attention Guided Gated-Convolution(AG-GConv)模块,用于在不同的尺度上进行深度和颜色特征的 fusión,以降低无效深度数据对重建的负面影响。在解码阶段,我们提出了一种Attention Guided Skip Connection(AG-SC)模块,以避免在重建过程中引入过多不相关的深度特征。实验结果表明,我们的方法在流行的benchmark上(NYU-Depth V2、DIML和SUN RGB-D)具有出色的性能。

Hindering Adversarial Attacks with Multiple Encrypted Patch Embeddings

  • paper_url: http://arxiv.org/abs/2309.01620
  • repo_url: None
  • paper_authors: AprilPyone MaungMaung, Isao Echizen, Hitoshi Kiya
  • for: 本研究提出了一种新的钥匙基防御方法,旨在提高防御效果和可靠性。
  • methods: 本研究基于之前的钥匙基防御方法,并做出了两个主要改进:(1)高效地训练,(2)可选的随机化。提议的防御使用一或多个秘密质量嵌入和一个预训练的卷积网络来实现。当使用多个秘密嵌入时,提议的防御允许在推理过程中进行随机化。
  • results: 对于ImageNet dataset上的一系列攻击,包括适应性攻击,提议的防御得到了高度的鲁棒性精度和相当的清洁精度。
    Abstract In this paper, we propose a new key-based defense focusing on both efficiency and robustness. Although the previous key-based defense seems effective in defending against adversarial examples, carefully designed adaptive attacks can bypass the previous defense, and it is difficult to train the previous defense on large datasets like ImageNet. We build upon the previous defense with two major improvements: (1) efficient training and (2) optional randomization. The proposed defense utilizes one or more secret patch embeddings and classifier heads with a pre-trained isotropic network. When more than one secret embeddings are used, the proposed defense enables randomization on inference. Experiments were carried out on the ImageNet dataset, and the proposed defense was evaluated against an arsenal of state-of-the-art attacks, including adaptive ones. The results show that the proposed defense achieves a high robust accuracy and a comparable clean accuracy compared to the previous key-based defense.
    摘要 在这篇论文中,我们提出了一种新的钥匙基础防御,旨在同时增加效率和鲁棒性。先前的钥匙基础防御可以有效防御对抗例子,但是特制的适应攻击可以绕过先前的防御,并且在大量的ImageNet数据集上训练是困难的。我们基于先前的防御,提出了两大改进:(1)高效的训练和(2)可选的随机化。我们提出的防御使用一个或多个秘密贴图嵌入和一个或多个预训练的卷积网络。当使用多个秘密贴图嵌入时,我们的防御允许在推理时进行随机化。我们在ImageNet数据集上进行了实验,并评估了一系列当今最佳攻击。结果表明,我们的防御可以 дости得高效率和相对较高的清洁率,与先前的钥匙基础防御相比。

On the Query Strategies for Efficient Online Active Distillation

  • paper_url: http://arxiv.org/abs/2309.01612
  • repo_url: None
  • paper_authors: Michele Boldo, Enrico Martini, Mirco De Marchi, Stefano Aldegheri, Nicola Bombieri
  • for: 本研究旨在提高人姿估算(HPE)模型的训练效率和实时适应性,通过Active Learning(AL)和在线热退化。
  • methods: 本研究使用两种方法进行评估:一种是传统的离线方法,另一种是通过知识退化进行在线评估。
  • results: 研究表明,通过选择合适的帧进行训练,可以减少模型的计算复杂度,同时保持模型的准确性。这种方法可以应用于实时人姿估算场景,并且可以帮助模型在新的上下文中进行有效的适应。
    Abstract Deep Learning (DL) requires lots of time and data, resulting in high computational demands. Recently, researchers employ Active Learning (AL) and online distillation to enhance training efficiency and real-time model adaptation. This paper evaluates a set of query strategies to achieve the best training results. It focuses on Human Pose Estimation (HPE) applications, assessing the impact of selected frames during training using two approaches: a classical offline method and a online evaluation through a continual learning approach employing knowledge distillation, on a popular state-of-the-art HPE dataset. The paper demonstrates the possibility of enabling training at the edge lightweight models, adapting them effectively to new contexts in real-time.
    摘要 深度学习(DL)需要很多时间和数据,导致计算成本很高。最近,研究人员使用活动学习(AL)和在线热针蒸馈来提高训练效率和实时模型适应。这篇论文评估了一组查询策略,以实现最佳训练结果。它专注于人姿估计(HPE)应用,评估在训练过程中选择的帧的影响,使用两种方法:一种传统的离线方法和一种在线评估过程,通过知识传递来适应新上下文。论文展示了在边缘训练轻量级模型的可能性,并在实时中效地适应新上下文。

Segmentation of 3D pore space from CT images using curvilinear skeleton: application to numerical simulation of microbial decomposition

  • paper_url: http://arxiv.org/abs/2309.01611
  • repo_url: None
  • paper_authors: Olivier Monga, Zakaria Belghali, Mouad Klai, Lucie Druoton, Dominique Michelucci, Valerie Pot
  • For: The paper aims to present a new method for describing the pore space of soil using the curvilinear skeleton, which can improve the accuracy and efficiency of numerical simulations of microbial decomposition and diffusion processes.* Methods: The authors use 3D X-ray CT scanner images to extract the pore space of soil and then use the curvilinear skeleton to segment the pore space into connected regions. They compare the results with other methods using different geometric representations of pore space, such as balls and voxels.* Results: The authors validate the simulation outputs using different pore space geometrical representations and show that the curvilinear skeleton-based method can provide more accurate and efficient simulations of microbial decomposition and diffusion processes in soil.
    Abstract Recent advances in 3D X-ray Computed Tomographic (CT) sensors have stimulated research efforts to unveil the extremely complex micro-scale processes that control the activity of soil microorganisms. Voxel-based description (up to hundreds millions voxels) of the pore space can be extracted, from grey level 3D CT scanner images, by means of simple image processing tools. Classical methods for numerical simulation of biological dynamics using mesh of voxels, such as Lattice Boltzmann Model (LBM), are too much time consuming. Thus, the use of more compact and reliable geometrical representations of pore space can drastically decrease the computational cost of the simulations. Several recent works propose basic analytic volume primitives (e.g. spheres, generalized cylinders, ellipsoids) to define a piece-wise approximation of pore space for numerical simulation of draining, diffusion and microbial decomposition. Such approaches work well but the drawback is that it generates approximation errors. In the present work, we study another alternative where pore space is described by means of geometrically relevant connected subsets of voxels (regions) computed from the curvilinear skeleton. Indeed, many works use the curvilinear skeleton (3D medial axis) for analyzing and partitioning 3D shapes within various domains (medicine, material sciences, petroleum engineering, etc.) but only a few ones in soil sciences. Within the context of soil sciences, most studies dealing with 3D medial axis focus on the determination of pore throats. Here, we segment pore space using curvilinear skeleton in order to achieve numerical simulation of microbial decomposition (including diffusion processes). We validate simulation outputs by comparison with other methods using different pore space geometrical representations (balls, voxels).
    摘要 In the present work, we study another alternative where pore space is described by means of geometrically relevant connected subsets of voxels (regions) computed from the curvilinear skeleton. The curvilinear skeleton is a 3D medial axis that has been widely used in various domains such as medicine, material sciences, and petroleum engineering to analyze and partition 3D shapes. However, only a few studies have applied this technique in soil sciences, and most of them have focused on determining pore throats. Here, we segment pore space using the curvilinear skeleton to achieve numerical simulation of microbial decomposition, including diffusion processes. We validate the simulation outputs by comparing them with other methods using different pore space geometrical representations (balls, voxels).

Probabilistic Precision and Recall Towards Reliable Evaluation of Generative Models

  • paper_url: http://arxiv.org/abs/2309.01590
  • repo_url: https://github.com/kdst-team/probablistic_precision_recall
  • paper_authors: Dogyun Park, Suhyun Kim
    for: This paper focuses on evaluating the fidelity and diversity of generative models, specifically addressing the limitations of existing k-Nearest Neighbor (kNN) based precision-recall metrics.methods: The authors propose novel metrics, P-precision and P-recall (PP&PR), based on a probabilistic approach to address the oversimplified assumptions and undesirable properties of kNN.results: The authors show through extensive toy experiments and state-of-the-art generative models that their PP&PR provide more reliable estimates for comparing fidelity and diversity than the existing metrics.
    Abstract Assessing the fidelity and diversity of the generative model is a difficult but important issue for technological advancement. So, recent papers have introduced k-Nearest Neighbor ($k$NN) based precision-recall metrics to break down the statistical distance into fidelity and diversity. While they provide an intuitive method, we thoroughly analyze these metrics and identify oversimplified assumptions and undesirable properties of kNN that result in unreliable evaluation, such as susceptibility to outliers and insensitivity to distributional changes. Thus, we propose novel metrics, P-precision and P-recall (PP\&PR), based on a probabilistic approach that address the problems. Through extensive investigations on toy experiments and state-of-the-art generative models, we show that our PP\&PR provide more reliable estimates for comparing fidelity and diversity than the existing metrics. The codes are available at \url{https://github.com/kdst-team/Probablistic_precision_recall}.
    摘要 【评估生成模型的准确性和多样性是技术发展中的一个重要问题。因此,最近的论文已经引入了k-最近邻居($k$NN)基于精度-回归指标来细分统计距离。尽管它们提供了直观的方法,但我们在这些指标中进行了全面的分析,并发现了它们的假设过于简化,以及不良的性质,如受到异常值的影响和分布变化的敏感性不足。因此,我们提出了新的指标,即P-精度和P-回归(PP&PR),基于概率方法,以解决这些问题。我们在各种实验中进行了广泛的调查,并显示了我们的PP&PR在比较准确性和多样性方面提供了更可靠的估计。代码可以在获取。】Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

SATAY: A Streaming Architecture Toolflow for Accelerating YOLO Models on FPGA Devices

  • paper_url: http://arxiv.org/abs/2309.01587
  • repo_url: None
  • paper_authors: Alexander Montgomerie-Corcoran, Petros Toupas, Zhewen Yu, Christos-Savvas Bouganis
  • for: 这个论文是为了解决现代智能视觉和图像处理任务中的对象检测问题,以实现现实生活中的各种应用程序,如自动驾驶到医疗影像处理。
  • methods: 这个论文使用了流处理架构和自动化工具流来加速YOLO模型,以解决将现代对象检测模型部署到FPGA设备上的挑战。
  • results: 这个论文的研究结果表明,使用流处理架构和自动化工具流可以生成高性能的FPGA加速器,可以与GPU设备相比,并且超越当前状态的FPGA加速器。
    Abstract AI has led to significant advancements in computer vision and image processing tasks, enabling a wide range of applications in real-life scenarios, from autonomous vehicles to medical imaging. Many of those applications require efficient object detection algorithms and complementary real-time, low latency hardware to perform inference of these algorithms. The YOLO family of models is considered the most efficient for object detection, having only a single model pass. Despite this, the complexity and size of YOLO models can be too computationally demanding for current edge-based platforms. To address this, we present SATAY: a Streaming Architecture Toolflow for Accelerating YOLO. This work tackles the challenges of deploying stateof-the-art object detection models onto FPGA devices for ultralow latency applications, enabling real-time, edge-based object detection. We employ a streaming architecture design for our YOLO accelerators, implementing the complete model on-chip in a deeply pipelined fashion. These accelerators are generated using an automated toolflow, and can target a range of suitable FPGA devices. We introduce novel hardware components to support the operations of YOLO models in a dataflow manner, and off-chip memory buffering to address the limited on-chip memory resources. Our toolflow is able to generate accelerator designs which demonstrate competitive performance and energy characteristics to GPU devices, and which outperform current state-of-the-art FPGA accelerators.
    摘要

Improving Visual Quality and Transferability of Adversarial Attacks on Face Recognition Simultaneously with Adversarial Restoration

  • paper_url: http://arxiv.org/abs/2309.01582
  • repo_url: None
  • paper_authors: Fengfan Zhou, Hefei Ling, Yuxuan Shi, Jiazhong Chen, Ping Li
  • for: 该论文旨在提高黑客脸部例子的视觉质量和传输性。
  • methods: 该论文提出了一种新的黑客攻击技术,即黑客恢复(AdvRestore),它利用了一种面Restoration Latent Diffusion Model(RLDM)来提高黑客脸部例子的视觉质量和传输性。
  • results: 该论文的实验结果表明,黑客恢复技术可以备受提高黑客脸部例子的传输性和视觉质量。
    Abstract Adversarial face examples possess two critical properties: Visual Quality and Transferability. However, existing approaches rarely address these properties simultaneously, leading to subpar results. To address this issue, we propose a novel adversarial attack technique known as Adversarial Restoration (AdvRestore), which enhances both visual quality and transferability of adversarial face examples by leveraging a face restoration prior. In our approach, we initially train a Restoration Latent Diffusion Model (RLDM) designed for face restoration. Subsequently, we employ the inference process of RLDM to generate adversarial face examples. The adversarial perturbations are applied to the intermediate features of RLDM. Additionally, by treating RLDM face restoration as a sibling task, the transferability of the generated adversarial face examples is further improved. Our experimental results validate the effectiveness of the proposed attack method.
    摘要 <>发现对抗面部示例具有两个关键性能:视觉质量和传输性。然而,现有的方法几乎从未同时考虑这两个性能,导致效果不佳。为解决这问题,我们提出了一种新的对抗攻击技术,即对抗恢复(AdvRestore)。我们首先在RLDM(Restoration Latent Diffusion Model)中训练一个面Restoration模型。然后,我们使用RLDM的推理过程来生成对抗面部示例。对抗扰动被应用于RLDM中的中间特征。此外,通过将RLDM的面Restoration视为姐妹任务,我们进一步改进了生成的对抗面部示例的传输性。我们的实验结果证明了我们的攻击方法的效果。Note: "RLDM" stands for "Restoration Latent Diffusion Model" in Chinese.

DiffHPE: Robust, Coherent 3D Human Pose Lifting with Diffusion

  • paper_url: http://arxiv.org/abs/2309.01575
  • repo_url: None
  • paper_authors: Cédric Rommel, Eduardo Valle, Mickaël Chen, Souhaiel Khalfaoui, Renaud Marlet, Matthieu Cord, Patrick Pérez
  • for: 这篇论文是为了提出一种新的3D人姿估计方法(DiffHPE),通过灵活的扩散模型来提高人姿估计的准确性、可靠性和一致性。
  • methods: 这篇论文使用了扩散模型,并将其与现有的指导模型相结合,以提高人姿估计的精度和可靠性。
  • results: 论文通过使用扩散模型,提高了人姿估计的时间一致性和三角均衡性,并在干扰情况下表现更加稳定。在人类3.6M数据集上,这种方法表现出色,并在不同的干扰情况下保持稳定性。
    Abstract We present an innovative approach to 3D Human Pose Estimation (3D-HPE) by integrating cutting-edge diffusion models, which have revolutionized diverse fields, but are relatively unexplored in 3D-HPE. We show that diffusion models enhance the accuracy, robustness, and coherence of human pose estimations. We introduce DiffHPE, a novel strategy for harnessing diffusion models in 3D-HPE, and demonstrate its ability to refine standard supervised 3D-HPE. We also show how diffusion models lead to more robust estimations in the face of occlusions, and improve the time-coherence and the sagittal symmetry of predictions. Using the Human\,3.6M dataset, we illustrate the effectiveness of our approach and its superiority over existing models, even under adverse situations where the occlusion patterns in training do not match those in inference. Our findings indicate that while standalone diffusion models provide commendable performance, their accuracy is even better in combination with supervised models, opening exciting new avenues for 3D-HPE research.
    摘要 我们提出了一种创新的三维人姿估计(3D-HPE)方法,通过结合前沿扩散模型,这些模型在多个领域中引领了革命,但在3D-HPE中尚未得到广泛研究。我们证明了扩散模型可以提高人姿估计的准确性、可靠性和相对性。我们提出了一种新的推 diffusionHPE 方法,并证明它可以在标准的三维人姿估计上进行精细调整。我们还表明了扩散模型可以在 occlusion 情况下提供更加稳定的估计,并改善时间相关性和顺序协调性。使用 Human\,3.6M 数据集,我们证明了我们的方法的效iveness,并与现有模型相比,即使在训练和推测中 occlusion patrerns 不同时,也能够 достичь更高的性能。我们的发现表明,单独使用扩散模型可以提供很好的性能,但是将它们与直接学习模型相结合,可以带来更高的准确性,开启了3D-HPE研究的新的可能性。

Raw Data Is All You Need: Virtual Axle Detector with Enhanced Receptive Field

  • paper_url: http://arxiv.org/abs/2309.01574
  • repo_url: None
  • paper_authors: Henik Riedel, Robert Steven Lorenzen, Clemens Hübler
  • for: 本研究旨在开发一种新的车辆轴承检测方法,以实现实时应用桥式Weight-In-Motion(BWIM)系统,不需要专门的车辆检测器。
  • methods: 该方法基于虚拟车辆检测器(VAD)模型,利用原始加速度数据进行处理,从而提高了感知范围。
  • results: 比较 experiments 表明,与现有VAD方法相比,提出的虚拟车辆检测器with Enhanced Receptive field(VADER)可以提高(F_1) score 73%,空间精度 39%,同时降低计算和内存成本99%。VADER在使用代表性训练集和功能传感器时,(F_1) score 达99.4%,空间错误为4.13cm。此外,我们还提出了一种基于对象大小驱动的 CNN 架构设计规则,结果表明,使用原始数据可以达到更好的性能,从而成为考虑原始数据作为输入的优势。
    Abstract Rising maintenance costs of ageing infrastructure necessitate innovative monitoring techniques. This paper presents a new approach for axle detection, enabling real-time application of Bridge Weigh-In-Motion (BWIM) systems without dedicated axle detectors. The proposed method adapts the Virtual Axle Detector (VAD) model to handle raw acceleration data, which allows the receptive field to be increased. The proposed Virtual Axle Detector with Enhanced Receptive field (VADER) improves the \(F_1\) score by 73\% and spatial accuracy by 39\%, while cutting computational and memory costs by 99\% compared to the state-of-the-art VAD. VADER reaches a \(F_1\) score of 99.4\% and a spatial error of 4.13~cm when using a representative training set and functional sensors. We also introduce a novel receptive field (RF) rule for an object-size driven design of Convolutional Neural Network (CNN) architectures. Based on this rule, our results suggest that models using raw data could achieve better performance than those using spectrograms, offering a compelling reason to consider raw data as input.
    摘要 提高老化基础设施维护成本的必要性,这篇论文提出了一种新的车轮检测方法,允许实时应用桥梁运动测量系统(BWIM)无需专门的车轮检测器。该方法基于虚拟车轮检测器(VAD)模型,可以处理原始加速度数据,从而提高感知范围。提出的虚拟车轮检测器增强型(VADER)提高了\(F_1\)分数 by 73%和空间准确率 by 39%,同时降低计算和存储成本 by 99%比领先的VAD。VADER在使用代表性训练集和功能传感器时达到了\(F_1\)分数99.4%和空间错误4.13cm。我们还提出了一种新的接收场规则(RF),用于设计基于卷积神经网络(CNN)架构。根据这个规则,我们的结果表明使用原始数据可以实现更好的性能,这为使用原始数据作为输入提供了一个吸引人的理由。

Locality-Aware Hyperspectral Classification

  • paper_url: http://arxiv.org/abs/2309.01561
  • repo_url: https://github.com/zhoufangqin/hylite
  • paper_authors: Fangqin Zhou, Mert Kilickaya, Joaquin Vanschoren
  • for: 本研究旨在提高干扰影像分类精度,利用视Transformers自动化干扰影像分类。
  • methods: 本研究提出了三大贡献:一、引入干扰本地信息图像变换器(HyLITE),二、一种新的规范函数,以及三、提出的方法在竞争对手中具有明显的性能优势,升高了准确率。
  • results: 本研究的实验结果表明,提出的方法可以在干扰影像分类任务中升高准确率,与竞争对手相比,具有明显的性能优势,最高准确率提升10%。
    Abstract Hyperspectral image classification is gaining popularity for high-precision vision tasks in remote sensing, thanks to their ability to capture visual information available in a wide continuum of spectra. Researchers have been working on automating Hyperspectral image classification, with recent efforts leveraging Vision-Transformers. However, most research models only spectra information and lacks attention to the locality (i.e., neighboring pixels), which may be not sufficiently discriminative, resulting in performance limitations. To address this, we present three contributions: i) We introduce the Hyperspectral Locality-aware Image TransformEr (HyLITE), a vision transformer that models both local and spectral information, ii) A novel regularization function that promotes the integration of local-to-global information, and iii) Our proposed approach outperforms competing baselines by a significant margin, achieving up to 10% gains in accuracy. The trained models and the code are available at HyLITE.
    摘要 干扰图像分类在远程感知中得到推广,感谢它们可以捕捉视觉信息的广泛谱 spectrum。研究人员在自动化干扰图像分类方面努力,其中最新的努力是利用视力变换器。然而,大多数研究模型只考虑spectra信息,缺乏对邻近像素(即地方信息)的注意力,这可能导致表现有限制。为此,我们提出了三项贡献:1. 我们介绍了干扰图像特征地址Transformer(HyLITE),一种视力变换器,可以同时模型本地和spectra信息。2. 一种新的规范函数,可以促进本地信息与全局信息的集成。3. 我们的提议方法在比较基eline上表现出特别的准确性提升,达到10%的提升率。我们的训练模型和代码可以在HyLITE上下载。

TSTTC: A Large-Scale Dataset for Time-to-Contact Estimation in Driving Scenarios

  • paper_url: http://arxiv.org/abs/2309.01539
  • repo_url: https://github.com/tusen-ai/TSTTC
  • paper_authors: Yuheng Shi, Zehao Huang, Yan Yan, Naiyan Wang, Xiaojie Guo
  • for: 这篇论文主要旨在提供一个大规模的行为对象驱动距离时间联系(TTC)数据集,以便促进TTC估计方法的研究和发展。
  • methods: 这篇论文使用了大量的驾驶数据,并采用了最新的神经网络生成技术来增加小TTC情况的数据量。
  • results: 该论文提供了一个大规模的TTC数据集,并提供了一些简单 yet 有效的TTC估计基线。这些基线在提posed dataset上进行了广泛的评估,以证明其效果。
    Abstract Time-to-Contact (TTC) estimation is a critical task for assessing collision risk and is widely used in various driver assistance and autonomous driving systems. The past few decades have witnessed development of related theories and algorithms. The prevalent learning-based methods call for a large-scale TTC dataset in real-world scenarios. In this work, we present a large-scale object oriented TTC dataset in the driving scene for promoting the TTC estimation by a monocular camera. To collect valuable samples and make data with different TTC values relatively balanced, we go through thousands of hours of driving data and select over 200K sequences with a preset data distribution. To augment the quantity of small TTC cases, we also generate clips using the latest Neural rendering methods. Additionally, we provide several simple yet effective TTC estimation baselines and evaluate them extensively on the proposed dataset to demonstrate their effectiveness. The proposed dataset is publicly available at https://open-dataset.tusen.ai/TSTTC.
    摘要 <>将文本翻译成简化中文。<>时间到contact(TTC)估计是评估冲突风险的关键任务,广泛应用于不同的驾驶助手和自动驾驶系统中。过去几十年内,相关理论和算法的发展都有所成就。现有的学习型方法需要大量的TTC实际场景数据。在这项工作中,我们提供了一个大规模的 объек oriented TTC数据集,用于推广TTC估计。为了收集有价值的样本并使数据具有不同TTC值相对均衡,我们通过了数千小时的驾驶数据,选择了超过200K个序列,并采用了一个预设的数据分布。为了增加小TTC情况的数量,我们还使用了最新的神经网络渲染方法生成clip。此外,我们还提供了一些简单 yet有效的TTC估计基线,并在提posed数据集上进行了广泛的评估,以示其效果。提出的数据集可以在https://open-dataset.tusen.ai/TSTTC上公开获取。

On the use of Mahalanobis distance for out-of-distribution detection with neural networks for medical imaging

  • paper_url: http://arxiv.org/abs/2309.01488
  • repo_url: https://github.com/harryanthony/mahalanobis-ood-detection
  • paper_authors: Harry Anthony, Konstantinos Kamnitsas
  • for: 本研究旨在探讨在医疗应用中使用神经网络时,如何探测输入数据与训练数据之间的差异,以避免不可靠的预测。
  • methods: 本研究使用了距离基于方法,如 Mahalanobis 距离,来探测输入数据与训练数据之间的差异。
  • results: 本研究发现,使用 Mahalanobis 距离探测输入数据与训练数据之间的差异,并不是一个一致的解决方案。相反, Results 表明,选择合适的层或层组,可以提高探测不同类型的异常情况的灵活性。此外,研究还发现,将 OOD 探测器分解成不同深度的网络层可以增强网络的稳定性。这些发现都被 validate 在实际 OOD 任务上,使用 CheXpert 胸部X射影像,并使用不同的 pacemaker 和性别作为 OOD 例子。
    Abstract Implementing neural networks for clinical use in medical applications necessitates the ability for the network to detect when input data differs significantly from the training data, with the aim of preventing unreliable predictions. The community has developed several methods for out-of-distribution (OOD) detection, within which distance-based approaches - such as Mahalanobis distance - have shown potential. This paper challenges the prevailing community understanding that there is an optimal layer, or combination of layers, of a neural network for applying Mahalanobis distance for detection of any OOD pattern. Using synthetic artefacts to emulate OOD patterns, this paper shows the optimum layer to apply Mahalanobis distance changes with the type of OOD pattern, showing there is no one-fits-all solution. This paper also shows that separating this OOD detector into multiple detectors at different depths of the network can enhance the robustness for detecting different OOD patterns. These insights were validated on real-world OOD tasks, training models on CheXpert chest X-rays with no support devices, then using scans with unseen pacemakers (we manually labelled 50% of CheXpert for this research) and unseen sex as OOD cases. The results inform best-practices for the use of Mahalanobis distance for OOD detection. The manually annotated pacemaker labels and the project's code are available at: https://github.com/HarryAnthony/Mahalanobis-OOD-detection.
    摘要 使用神经网络进行医疗应用时,需要神经网络能够检测输入数据与训练数据之间的差异,以避免不可靠的预测。社区已经开发出了许多对外部数据(OOD)检测方法,其中距离基于方法,如马哈拉诺比斯距离,表现出了潜在。这篇论文挑战了社区认知,即在任何OOD模式下都有一个最佳层或组合层可以应用马哈拉诺比斯距离来检测OOD模式。使用 sintetic artifacts 模拟 OOD 模式,这篇论文表明了在不同类型的 OOD 模式时,适用马哈拉诺比斯距离的层不同,而且没有一个通用的解决方案。此外,这篇论文还表明,将 OOD 检测器分解成不同深度的网络层可以提高对不同 OOD 模式的检测稳定性。这些发现得到了实际 OOD 任务的验证,使用 CheXpert 胸部X射影片进行训练,然后使用未知的心 pacemaker 和性别作为 OOD 例外。结果提供了使用马哈拉诺比斯距离进行 OOD 检测的最佳实践。手动标注 pacemaker 标签和项目代码可以在 GitHub 上获取:https://github.com/HarryAnthony/Mahalanobis-OOD-detection。

GenSelfDiff-HIS: Generative Self-Supervision Using Diffusion for Histopathological Image Segmentation

  • paper_url: http://arxiv.org/abs/2309.01487
  • repo_url: None
  • paper_authors: Vishnuvardhan Purma, Suhas Srinath, Seshan Srirangarajan, Aanchal Kakkar, Prathosh A. P
    for:这个研究目的是提出一种基于自类学习的几何像分割方法,来减轻几何像分析的传统人工分析压力。methods:这个方法基于对无标注数据的生成扩散模型,并使用多元损失函数进行精致化。results:研究结果显示,这个方法可以在两个公开available的数据集上取得良好的效果,并且在一个新提出的头颈癌(HN)数据集上也取得了良好的效果。
    Abstract Histopathological image segmentation is a laborious and time-intensive task, often requiring analysis from experienced pathologists for accurate examinations. To reduce this burden, supervised machine-learning approaches have been adopted using large-scale annotated datasets for histopathological image analysis. However, in several scenarios, the availability of large-scale annotated data is a bottleneck while training such models. Self-supervised learning (SSL) is an alternative paradigm that provides some respite by constructing models utilizing only the unannotated data which is often abundant. The basic idea of SSL is to train a network to perform one or many pseudo or pretext tasks on unannotated data and use it subsequently as the basis for a variety of downstream tasks. It is seen that the success of SSL depends critically on the considered pretext task. While there have been many efforts in designing pretext tasks for classification problems, there haven't been many attempts on SSL for histopathological segmentation. Motivated by this, we propose an SSL approach for segmenting histopathological images via generative diffusion models in this paper. Our method is based on the observation that diffusion models effectively solve an image-to-image translation task akin to a segmentation task. Hence, we propose generative diffusion as the pretext task for histopathological image segmentation. We also propose a multi-loss function-based fine-tuning for the downstream task. We validate our method using several metrics on two publically available datasets along with a newly proposed head and neck (HN) cancer dataset containing hematoxylin and eosin (H\&E) stained images along with annotations. Codes will be made public at https://github.com/PurmaVishnuVardhanReddy/GenSelfDiff-HIS.git.
    摘要 历史 PATHOLOGICAL 图像分割是一项劳动密集和时间consuming的任务,经验训练的病理学家通常需要进行准确的检查。为了减轻这个负担,有些人使用了Supervised 机器学习方法,使用大规模的标注数据进行历史 PATHOLOGICAL 图像分析。然而,在一些情况下,获得大规模的标注数据是一个瓶颈,而自我supervised 学习(SSL)是一种代替方案,它可以使用只有未标注的数据进行训练。基本上,SSL 的想法是训练一个网络,使其在未标注数据上完成一些pseudo或预text任务,然后用这些任务作为基础进行多种下游任务。它的成功取决于考虑的预text任务。虽然有很多人在设计 Classification 的预text任务方面做出了努力,但是对历史 PATHOLOGICAL 图像分割的 SSL 方法还未有很多尝试。在这篇论文中,我们提出了一种基于生成扩散模型的 SSL 方法,用于分割历史 PATHOLOGICAL 图像。我们基于图像到图像的翻译任务的观察,因此我们提出了生成扩散作为预text任务。此外,我们还提出了基于多个损失函数的细化。我们使用了多种度量来验证我们的方法,并在两个公共可用的数据集上进行验证,以及一个新提出的头颈癌(HN)癌症数据集,该数据集包含HE染料的历史 PATHOLOGICAL 图像以及注解。代码将在 上公开。

CA2: Class-Agnostic Adaptive Feature Adaptation for One-class Classification

  • paper_url: http://arxiv.org/abs/2309.01483
  • repo_url: None
  • paper_authors: Zilong Zhang, Zhibin Zhao, Deyu Meng, Xingwu Zhang, Xuefeng Chen
  • for: 提高机器学习模型在真实世界中的部署,实现一类分类(OCC)。
  • methods: 使用适应特定目标数据集的预训练特征,并将其扩展到未知类数。
  • results: 在不同类数(1-1024)的训练数据上,一直高于当前状态艺术方法,提高OCC性能。
    Abstract One-class classification (OCC), i.e., identifying whether an example belongs to the same distribution as the training data, is essential for deploying machine learning models in the real world. Adapting the pre-trained features on the target dataset has proven to be a promising paradigm for improving OCC performance. Existing methods are constrained by assumptions about the number of classes. This contradicts the real scenario where the number of classes is unknown. In this work, we propose a simple class-agnostic adaptive feature adaptation method (CA2). We generalize the center-based method to unknown classes and optimize this objective based on the prior existing in the pre-trained network, i.e., pre-trained features that belong to the same class are adjacent. CA2 is validated to consistently improve OCC performance across a spectrum of training data classes, spanning from 1 to 1024, outperforming current state-of-the-art methods. Code is available at https://github.com/zhangzilongc/CA2.
    摘要 一类分类(OCC),即确定输入例子是否属于训练数据的同一分布,在实际应用中是非常重要的。适应预训练特征onto目标数据集已经证明是提高OCC性能的有效方法。现有方法受限于类别数量的假设,这与实际场景不符。在这种情况下,我们提出了一种简单的类型不可知的适应特征调整方法(CA2)。我们将中心基于方法扩展到未知类别,并基于预训练网络中的先前存在的对称性来优化这个目标函数。CA2被证明可以在训练数据类型范围从1到1024之间,在不同类别的情况下一致提高OCC性能,超越当前状态的最佳方法。代码可以在https://github.com/zhangzilongc/CA2上下载。

Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models

  • paper_url: http://arxiv.org/abs/2309.01479
  • repo_url: None
  • paper_authors: Qiong Wu, Wei Yu, Yiyi Zhou, Shubin Huang, Xiaoshuai Sun, Rongrong Ji
  • for: 这篇研究目的是提出一种 Parameter and Computation Efficient Transfer Learning (PCETL) 方法,以提高 Vision-Language Pre-trained (VLP) 模型在下游任务中的适应性。
  • methods: 本研究提出了一种叫做 Dynamic Architecture Skipping (DAS) 方法,它通过观察 VLP 模型的模块之间的相互关联,并使用强化学习 (RL) 来决定哪些模块是可以被跳过的。这样可以将 VLP 模型的trainable参数数量降低,同时保持其在下游任务中的性能。
  • results: 实验结果显示 DAS 方法可以将 VLP 模型的 Computational Complexity 降低到 -11.97% FLOPs,并且与现有的 PETL 方法相比,DAS 方法在参数给数和性能之间能够取得平衡。
    Abstract With ever increasing parameters and computation, vision-language pre-trained (VLP) models exhibit prohibitive expenditure in downstream task adaption. Recent endeavors mainly focus on parameter efficient transfer learning (PETL) for VLP models by only updating a small number of parameters. However, excessive computational overhead still plagues the application of VLPs. In this paper, we aim at parameter and computation efficient transfer learning (PCETL) for VLP models. In particular, PCETL not only needs to limit the number of trainable parameters in VLP models, but also to reduce the computational redundancy during inference, thus enabling a more efficient transfer. To approach this target, we propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL. Instead of directly optimizing the intrinsic architectures of VLP models, DAS first observes the significances of their modules to downstream tasks via a reinforcement learning (RL) based process, and then skips the redundant ones with lightweight networks, i.e., adapters, according to the obtained rewards. In this case, the VLP model can well maintain the scale of trainable parameters while speeding up its inference on downstream tasks. To validate DAS, we apply it to two representative VLP models, namely ViLT and METER, and conduct extensive experiments on a bunch of VL tasks. The experimental results not only show the great advantages of DAS in reducing computational complexity, e.g. -11.97% FLOPs of METER on VQA2.0, but also confirm its competitiveness against existing PETL methods in terms of parameter scale and performance. Our source code is given in our appendix.
    摘要 随着参数和计算的增加,视觉语言预训练(VLP)模型在下游任务适应中存在拥堵性问题。现有的努力主要集中在视觉语言预训练(PETL)模型中,通过只更新一小部分参数进行 parameter efficient transfer learning。然而,计算开销仍然困扰着VLP的应用。在这篇论文中,我们目标是在VLP模型中实现参数和计算效率的传输学习(PCETL)。具体来说,PCETL不仅需要限制VLP模型的可训练参数数量,还需要在推理过程中减少计算重复性,以便更有效地进行传输。为达到这个目标,我们提出了一种新的动态architecture skipping(DAS)方法。在DAS方法中,我们首先通过 reinforcement learning(RL)基于的过程来评估VLP模型中各个模块的下游任务意义,然后根据获得的奖励,使用轻量级网络(adapter)将红undeniable模块替换掉。这样,VLP模型可以保持参数数量的扩展性,同时快速完成下游任务的推理。为验证DAS方法,我们在两个代表性的VLP模型,namely ViLT和METER上进行了广泛的实验。实验结果不仅表明DAS方法可以减少计算复杂度,例如METER模型在VQA2.0任务上的计算复杂度减少了11.97%,而且也证明了它在参数数量和性能方面与现有的PETL方法相当竞争。我们的源代码在附录中提供。

Defect Detection in Synthetic Fibre Ropes using Detectron2 Framework

  • paper_url: http://arxiv.org/abs/2309.01469
  • repo_url: None
  • paper_authors: Anju Rani, Daniel O. Arroyo, Petar Durdevic
  • for: This paper aims to develop an automated and efficient method for detecting defects in synthetic fibre ropes (SFRs) using deep learning (DL) models, specifically the Detectron2 library with Mask R-CNN architecture.
  • methods: The study uses an experimentally obtained dataset of high-dimensional images of SFRs, with seven damage classes, to train and test Mask R-CNN with various backbone configurations.
  • results: The use of Detectron2 and Mask R-CNN with different backbone configurations can effectively detect defects in SFRs, enhancing the inspection process and ensuring the safety of the fibre ropes.
    Abstract Fibre ropes with the latest technology have emerged as an appealing alternative to steel ropes for offshore industries due to their lightweight and high tensile strength. At the same time, frequent inspection of these ropes is essential to ensure the proper functioning and safety of the entire system. The development of deep learning (DL) models in condition monitoring (CM) applications offers a simpler and more effective approach for defect detection in synthetic fibre ropes (SFRs). The present paper investigates the performance of Detectron2, a state-of-the-art library for defect detection and instance segmentation. Detectron2 with Mask R-CNN architecture is used for segmenting defects in SFRs. Mask R-CNN with various backbone configurations has been trained and tested on an experimentally obtained dataset comprising 1,803 high-dimensional images containing seven damage classes (loop high, loop medium, loop low, compression, core out, abrasion, and normal respectively) for SFRs. By leveraging the capabilities of Detectron2, this study aims to develop an automated and efficient method for detecting defects in SFRs, enhancing the inspection process, and ensuring the safety of the fibre ropes.
    摘要 合成纤维绳(Synthetic Fiber Ropes,SFR)的 Condition Monitoring(CM)应用中,latest technology的纤维绳已经出现为海上工业的吸引力,因为它们具有轻量和高强度特点。同时,为保证整个系统的正常工作和安全,这些纤维绳的常规检查是必须的。在这种情况下,深度学习(Deep Learning,DL)模型在CM应用中的开发提供了一种更加简单和有效的方法来检测SFR中的缺陷。本文 investigate了Detectron2库在SFR中的缺陷检测和实例分割方面的表现。通过使用Mask R-CNN架构,Detectron2在SFR中 segmenting 缺陷。Mask R-CNN采用了不同的背景配置,在实验获得的1,803个高维度图像中进行了7种损害类(循环高、循环中、循环低、压缩、核心缺陷、 Abrasion 和正常)的训练和测试。通过利用Detectron2的能力,本研究旨在开发一种自动化和高效的SFR缺陷检测方法,提高检查过程,并确保纤维绳的安全。

Toward Defensive Letter Design

  • paper_url: http://arxiv.org/abs/2309.01452
  • repo_url: https://github.com/rprokap/pset-9
  • paper_authors: Rentaro Kataoka, Akisato Kimura, Seiichi Uchida
  • for: 防御 adversarial 攻击
  • methods: 使用 Iterative Fast Gradient Sign Method (I-FGSM) 和深度回归模型测试字符图像的防御能力,并基于生成 adversarial 网络 (GAN) 提出一种两步方法生成更加防御性的字符图像。
  • results: 通过测试和实验,提出了一种基于字符图像的防御机制,可以帮助防御 adversarial 攻击。
    Abstract A major approach for defending against adversarial attacks aims at controlling only image classifiers to be more resilient, and it does not care about visual objects, such as pandas and cars, in images. This means that visual objects themselves cannot take any defensive actions, and they are still vulnerable to adversarial attacks. In contrast, letters are artificial symbols, and we can freely control their appearance unless losing their readability. In other words, we can make the letters more defensive to the attacks. This paper poses three research questions related to the adversarial vulnerability of letter images: (1) How defensive are the letters against adversarial attacks? (2) Can we estimate how defensive a given letter image is before attacks? (3) Can we control the letter images to be more defensive against adversarial attacks? For answering the first and second questions, we measure the defensibility of letters by employing Iterative Fast Gradient Sign Method (I-FGSM) and then build a deep regression model for estimating the defensibility of each letter image. We also propose a two-step method based on a generative adversarial network (GAN) for generating character images with higher defensibility, which solves the third research question.
    摘要 一种主要方法对抗反对攻击是控制图像分类器更加抗击强,不管图像中的物体,如熊猫和车辆,是否受到攻击。这意味着视觉物体本身无法采取任何防御行动,仍然易受到反对攻击。相比之下,字符是人工符号,我们可以自由地控制它们的显示,只要不导致不可读性。即使在攻击时,我们可以使字符更加抗击强。这篇论文提出了三个研究问题 relacionadas 反对攻击的抗击性:1. 字符对反对攻击的抗击性如何?2. 我们可以在攻击之前对给定的字符图像进行估算,该图像在攻击中的抗击性如何?3. 我们可以通过生成推荐网络(GAN)来生成具有更高抗击性的字符图像,以解决第三个研究问题。为了回答第一个和第二个问题,我们使用迭代快速梯度签名方法(I-FGSM)测量字符的抗击性,并建立深度回归模型来估算每个字符图像的抗击性。我们还提出了一种基于GAN的两步方法,用于生成具有更高抗击性的字符图像,解决第三个研究问题。

Large Separable Kernel Attention: Rethinking the Large Kernel Attention Design in CNN

  • paper_url: http://arxiv.org/abs/2309.01439
  • repo_url: None
  • paper_authors: Kin Wai Lau, Lai-Man Po, Yasar Abbas Ur Rehman
  • for: 提高vision-based任务中VAN的性能,并降低计算和存储的占用率。
  • methods: 提出Large Separable Kernel Attention(LSKA)模块,将深度wise convolutional layer中的2D卷积核 decomposition为水平和垂直的1D卷积核,从而避免了额外块的使用。
  • results: 对VAN、ViTs和ConvNeXt进行了修饰,并在Object recognition、Object detection、Semantic segmentation和Robustness测试中提供了相对较好的性能。在不同的kernel size下,LSKA模块可以减少计算和存储的占用率,并且在Object recognition、Object detection、Semantic segmentation和Robustness测试中具有较好的性能。
    Abstract Visual Attention Networks (VAN) with Large Kernel Attention (LKA) modules have been shown to provide remarkable performance, that surpasses Vision Transformers (ViTs), on a range of vision-based tasks. However, the depth-wise convolutional layer in these LKA modules incurs a quadratic increase in the computational and memory footprints with increasing convolutional kernel size. To mitigate these problems and to enable the use of extremely large convolutional kernels in the attention modules of VAN, we propose a family of Large Separable Kernel Attention modules, termed LSKA. LSKA decomposes the 2D convolutional kernel of the depth-wise convolutional layer into cascaded horizontal and vertical 1-D kernels. In contrast to the standard LKA design, the proposed decomposition enables the direct use of the depth-wise convolutional layer with large kernels in the attention module, without requiring any extra blocks. We demonstrate that the proposed LSKA module in VAN can achieve comparable performance with the standard LKA module and incur lower computational complexity and memory footprints. We also find that the proposed LSKA design biases the VAN more toward the shape of the object than the texture with increasing kernel size. Additionally, we benchmark the robustness of the LKA and LSKA in VAN, ViTs, and the recent ConvNeXt on the five corrupted versions of the ImageNet dataset that are largely unexplored in the previous works. Our extensive experimental results show that the proposed LSKA module in VAN provides a significant reduction in computational complexity and memory footprints with increasing kernel size while outperforming ViTs, ConvNeXt, and providing similar performance compared to the LKA module in VAN on object recognition, object detection, semantic segmentation, and robustness tests.
    摘要 视觉注意网络(VAN)配置了大kernel注意模块(LKA)可以提供出色的表现,超过视transformer(ViT),在视觉任务中。然而,深度 wise convolutional层在这些LKA模块中会导致计算和存储空间呈 quadratic 增长,随着核心大小的增加。为了解决这些问题并使用极大的核心大小在VAN中的注意模块中,我们提出了一个家族Large Separable Kernel Attention(LSKA)模块。LSKA将二维核心层的深度wise convolutional层 decomposed into cascaded horizontal和vertical 1-D核心。与标准LKA设计不同,我们的分解方式可以 direct 使用深度wise convolutional层中的大核心在注意模块中,无需额外块。我们的实验结果表明,在VAN中使用我们提出的LSKA模块可以与标准LKA模块相当,同时具有较低的计算复杂度和存储空间占用。此外,我们发现LSKA设计偏向对象的形状,而不是Texture,随着核心大小的增加。此外,我们对VAN、ViTs和最近的ConvNeXt在ImageNet数据集上进行了大规模的robustness测试,发现LSKA模块在对象认知、物体检测、semantic segmentation和Robustness测试中具有优秀的表现。

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

  • paper_url: http://arxiv.org/abs/2309.01430
  • repo_url: https://github.com/leaplabthu/dat
  • paper_authors: Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, Gao Huang
  • for: 这篇论文的目的是提出一种可靠且高效的视觉处理模型,能够解决传统的视觉模型在实现全球注意力和对特定区域的适应能力之间的矛盾。
  • methods: 这篇论文提出了一种名为“弹性多头注意模组”的新型注意力模块,具有自动分配鉴定点的功能,以实现对应区域的适应注意。这个模组可以与传统的 dense attention 结合,以提高视觉模型的表现力。
  • results: 根据实验结果,这篇论文的提案DAT++ 能够在多个视觉识别任务上取得顶尖的成绩,包括 ImageNet 的准确率85.9%、COCO 的实例分割精度54.5和47.0,以及 ADE20K 的 semantics 分割精度51.5。
    Abstract Transformers have shown superior performance on various vision tasks. Their large receptive field endows Transformer models with higher representation power than their CNN counterparts. Nevertheless, simply enlarging the receptive field also raises several concerns. On the one hand, using dense attention in ViT leads to excessive memory and computational cost, and features can be influenced by irrelevant parts that are beyond the region of interests. On the other hand, the handcrafted attention adopted in PVT or Swin Transformer is data agnostic and may limit the ability to model long-range relations. To solve this dilemma, we propose a novel deformable multi-head attention module, where the positions of key and value pairs in self-attention are adaptively allocated in a data-dependent way. This flexible scheme enables the proposed deformable attention to dynamically focus on relevant regions while maintains the representation power of global attention. On this basis, we present Deformable Attention Transformer (DAT), a general vision backbone efficient and effective for visual recognition. We further build an enhanced version DAT++. Extensive experiments show that our DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
    摘要 《 transformers 在视觉任务中表现出色,其大 receive 场能够提供更高的表示力 than CNN 模型。然而,简单地扩大 receive 场也存在一些问题。一方面,使用 dense attention 在 ViT 中会导致额外的内存和计算成本增加,而且特征可能受到 beyond 的无关部分的影响。另一方面,手动设置的 attention 在 PVT 或 Swin Transformer 中可能会限制对长距离关系的模型化。为解决这个 dilemma,我们提出了一种 novel deformable multi-head attention 模块,其中 key 和 value 对在 self-attention 中的位置会在数据依存地分配。这种灵活的方案允许我们的 propose deformable attention 动态关注相关的区域,同时保持 global attention 的表示力。基于这个基础,我们提出了 Deformable Attention Transformer (DAT),一种通用的视觉基础结构,高效精准地进行视觉识别。此外,我们还提出了 DAT++,一种进一步提高 DAT 的版本。广泛的实验表明,我们的 DAT++ 在多种视觉识别benchmark上达到了最佳结果,其中 ImageNet 准确率为 85.9%,COCO 实例分割 mAP 为 54.5 和 47.0,ADE20K semantic segmentation mIoU 为 51.5。

Adapting Segment Anything Model for Change Detection in HR Remote Sensing Images

  • paper_url: http://arxiv.org/abs/2309.01429
  • repo_url: https://github.com/ggsding/sam-cd
  • paper_authors: Lei Ding, Kun Zhu, Daifeng Peng, Hao Tang, Haitao Guo
    for: 这个研究目的是将视觉基础模型(VFM)应用于高分辨率远程感知图像(RSIs)中的变化探测。methods: 这个研究使用了快速SAM的视觉编码器来提取RS scene中的视觉表现,并使用了一个 convolutional adaptor 来聚合任务化变化信息。此外,这个研究还引入了一个任务无关的 semantic learning branch 来模型RSIs中的semantic latent space。results: 这个研究获得了与顶尖方法相比的更高的准确性,并且展示了与半指导CD方法相似的样本效率学习能力。根据我们所知,这是首次将VFM应用于HR RSIs中的变化探测。
    Abstract Vision Foundation Models (VFMs) such as the Segment Anything Model (SAM) allow zero-shot or interactive segmentation of visual contents, thus they are quickly applied in a variety of visual scenes. However, their direct use in many Remote Sensing (RS) applications is often unsatisfactory due to the special imaging characteristics of RS images. In this work, we aim to utilize the strong visual recognition capabilities of VFMs to improve the change detection of high-resolution Remote Sensing Images (RSIs). We employ the visual encoder of FastSAM, an efficient variant of the SAM, to extract visual representations in RS scenes. To adapt FastSAM to focus on some specific ground objects in the RS scenes, we propose a convolutional adaptor to aggregate the task-oriented change information. Moreover, to utilize the semantic representations that are inherent to SAM features, we introduce a task-agnostic semantic learning branch to model the semantic latent in bi-temporal RSIs. The resulting method, SAMCD, obtains superior accuracy compared to the SOTA methods and exhibits a sample-efficient learning ability that is comparable to semi-supervised CD methods. To the best of our knowledge, this is the first work that adapts VFMs for the CD of HR RSIs.
    摘要 各种视觉基础模型(VFM),如分割任何模型(SAM),可以实现零shot或交互分割视觉内容,因此它们很快地应用于多种视觉场景。然而,直接使用它们在许多远程感知(RS)应用中 often 不满足要求,因为RS图像的特殊捕捉特性。在这项工作中,我们希望利用VFM的强大视觉识别能力来改进高分辨率远程感知图像(RSIs)中的变化检测。我们使用 FastSAM 的视觉编码器来提取 RS 场景中的视觉表示。为了使 FastSAM 在RS场景中专注于某些特定的地面 объек,我们提议一种 convolutional adapter 来聚合任务关注的变化信息。此外,我们引入一种任务无关的 semantic learning branch 来模型RSIs中的semantic latent space。得到的方法,SAMCD,与state-of-the-art方法相比,显示出了更高的准确率,并且展现了与 semi-supervised CD 方法类似的样本效率学习能力。到目前为止,这是首次应用 VFM 于高分辨率 RSIs 的变化检测。

Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification

  • paper_url: http://arxiv.org/abs/2309.01420
  • repo_url: None
  • paper_authors: Zhiyin Shao, Xinyu Zhang, Changxing Ding, Jian Wang, Jingdong Wang
  • for: 本研究旨在提高文本到图像人重识别(T2I-ReID)任务的性能,因为存在两种基础问题:数据不一致和训练不一致。
  • methods: 我们提出了一个新的统一预训策略(UniPT),包括建立大规模的文本标注人像数据集“LUPerson-T”,并使用简单的视觉语言预训策略来对图像和文本模态的特征空间进行Alignment。
  • results: 我们的UniPT可以在不需要任何辅助工具的情况下达到竞争性的排名1精度(68.50%、60.09%和51.85%)在CUHK-PEDES、ICFG-PEDES和RSTPReid等三个任务上。
    Abstract The pre-training task is indispensable for the text-to-image person re-identification (T2I-ReID) task. However, there are two underlying inconsistencies between these two tasks that may impact the performance; i) Data inconsistency. A large domain gap exists between the generic images/texts used in public pre-trained models and the specific person data in the T2I-ReID task. This gap is especially severe for texts, as general textual data are usually unable to describe specific people in fine-grained detail. ii) Training inconsistency. The processes of pre-training of images and texts are independent, despite cross-modality learning being critical to T2I-ReID. To address the above issues, we present a new unified pre-training pipeline (UniPT) designed specifically for the T2I-ReID task. We first build a large-scale text-labeled person dataset "LUPerson-T", in which pseudo-textual descriptions of images are automatically generated by the CLIP paradigm using a divide-conquer-combine strategy. Benefiting from this dataset, we then utilize a simple vision-and-language pre-training framework to explicitly align the feature space of the image and text modalities during pre-training. In this way, the pre-training task and the T2I-ReID task are made consistent with each other on both data and training levels. Without the need for any bells and whistles, our UniPT achieves competitive Rank-1 accuracy of, ie, 68.50%, 60.09%, and 51.85% on CUHK-PEDES, ICFG-PEDES and RSTPReid, respectively. Both the LUPerson-T dataset and code are available at https;//github.com/ZhiyinShao-H/UniPT.
    摘要 “预训作业是文本识别人重识别(T2I-ReID)任务的不可或缺的一部分。然而,有两个隐含的差异可能影响性能;一是数据不一致。 generic images/texts在公共预训模型中使用的大频谱与特定人数据在T2I-ReID任务中存在巨大的频谱差异。这种差异特别是对文本而言,通常文本数据无法细化特定人的描述。二是训练不一致。图像和文本的预训过程是独立的,尽管交叉模态学习是T2I-ReID任务中 krit。为解决以上问题,我们提出了一个新的一体化预训管线(UniPT),专门为T2I-ReID任务设计。我们首先建立了大规模的文本标注人数据集"LUPerson-T",其中图像中的文本描述使用CLIP парадигмы自动生成pseudo文本描述。利用这个数据集,我们然后使用简单的视觉语言预训框架,在预训过程中显式对图像和文本模式之间的特征空间进行对接。这样,预训任务和T2I-ReID任务在数据和训练水平上得到了一致。不需要任何附加功能,我们的UniPT实现了竞争力强的排名1准确率,即68.50%、60.09%和51.85%在CUHK-PEDES、ICFG-PEDES和RSTPReid等三个任务中。LUPerson-T数据集和代码都可以在https://github.com/ZhiyinShao-H/UniPT上获取。”

Implicit Neural Image Stitching With Enhanced and Blended Feature Reconstruction

  • paper_url: http://arxiv.org/abs/2309.01409
  • repo_url: https://github.com/minshu-kim/NIS
  • paper_authors: Minsu Kim, Jaewon Lee, Byeonghun Lee, Sunghoon Im, Kyong Hwan Jin
  • for: 提高图像拼接的质量和精度,解决传统框架中的锐利artefacts和照明、深度等级的不一致问题。
  • methods: 基于隐藏层 neural network 的图像拼接方法,通过估计图像的福洛coefficients来提高图像质量,并在幂值空间进行颜色匹配和重差调整,最终decode为RGB值得拼接图像。
  • results: 比传统框架更高效地解决低分辨率图像拼接问题,并且可以融合加速图像提高方法,实现更高质量的拼接图像。
    Abstract Existing frameworks for image stitching often provide visually reasonable stitchings. However, they suffer from blurry artifacts and disparities in illumination, depth level, etc. Although the recent learning-based stitchings relax such disparities, the required methods impose sacrifice of image qualities failing to capture high-frequency details for stitched images. To address the problem, we propose a novel approach, implicit Neural Image Stitching (NIS) that extends arbitrary-scale super-resolution. Our method estimates Fourier coefficients of images for quality-enhancing warps. Then, the suggested model blends color mismatches and misalignment in the latent space and decodes the features into RGB values of stitched images. Our experiments show that our approach achieves improvement in resolving the low-definition imaging of the previous deep image stitching with favorable accelerated image-enhancing methods. Our source code is available at https://github.com/minshu-kim/NIS.
    摘要 现有的图像拼接框架通常提供可见的合理拼接结果,但它们受到锐化缺陷和照明、深度等因素的影响,导致拼接图像具有模糊效果。虽然最近的学习型拼接方法可以减轻这些缺陷,但它们需要牺牲图像质量,无法捕捉高频环境的细节。为解决这问题,我们提出了一种新的方法:隐式神经图像拼接(NIS),它扩展了自适应超分辨率。我们的方法估算图像的快推函数,然后使用建议的模型在幽DefaultsLatent空间进行混合和调整,最后 decode到RGB值来生成拼接图像。我们的实验表明,我们的方法可以提高前期深度图像拼接的低分辨率问题,并且可以利用加速的图像改进方法来加速图像进行改进。我们的源代码可以在 GitHub 上找到:https://github.com/minshu-kim/NIS。

Leveraging Self-Supervised Vision Transformers for Neural Transfer Function Design

  • paper_url: http://arxiv.org/abs/2309.01408
  • repo_url: None
  • paper_authors: Dominik Engel, Leon Sick, Timo Ropinski
  • for: 用于量 Rendering 中的结构分类和质量属性分配
  • methods: 使用自然语言描述的自适应 Transfer Function 定义方法
  • results: 减少标注量和计算时间,提高分割精度和用户体验
    Abstract In volume rendering, transfer functions are used to classify structures of interest, and to assign optical properties such as color and opacity. They are commonly defined as 1D or 2D functions that map simple features to these optical properties. As the process of designing a transfer function is typically tedious and unintuitive, several approaches have been proposed for their interactive specification. In this paper, we present a novel method to define transfer functions for volume rendering by leveraging the feature extraction capabilities of self-supervised pre-trained vision transformers. To design a transfer function, users simply select the structures of interest in a slice viewer, and our method automatically selects similar structures based on the high-level features extracted by the neural network. Contrary to previous learning-based transfer function approaches, our method does not require training of models and allows for quick inference, enabling an interactive exploration of the volume data. Our approach reduces the amount of necessary annotations by interactively informing the user about the current classification, so they can focus on annotating the structures of interest that still require annotation. In practice, this allows users to design transfer functions within seconds, instead of minutes. We compare our method to existing learning-based approaches in terms of annotation and compute time, as well as with respect to segmentation accuracy. Our accompanying video showcases the interactivity and effectiveness of our method.
    摘要 在量rendering中,传输函数用于分类结构物体,并将光学性质如颜色和透明度分配给这些结构物体。传输函数通常是1D或2D函数,它们将简单特征映射到这些光学性质。由于设计传输函数的过程通常是慢搬和不直观的,因此有几种方法被提议用于其交互式规定。在本文中,我们提出了一种使用自然语言处理器来定义传输函数的新方法。用户只需在切片查看器中选择结构物体,我们的方法会自动选择与结构物体相似的结构,基于由神经网络提取的高级特征。与之前的学习基于的传输函数方法不同,我们的方法不需要训练模型,可以快速进行推理,从而允许用户在数秒钟内设计传输函数,而不是需要数分钟。此外,我们的方法可以减少必须的注释量,通过在用户操作时提供反馈,使用户能够更快地注释需要注释的结构物体。在实践中,我们的方法比既有学习基于的方法更快,更准确。我们的视频辑演示了我们的方法的互动性和效果。

Learning Residual Elastic Warps for Image Stitching under Dirichlet Boundary Condition

  • paper_url: http://arxiv.org/abs/2309.01406
  • repo_url: https://github.com/minshu-kim/REwarp
  • paper_authors: Minsu Kim, Yongjun Lee, Woo Kyoung Han, Kyong Hwan Jin
  • for: 解决深度学习图像拼接中大偏移误差所导致的缺陷,提高图像拼接的精度和效率。
  • methods: 使用 Dirichlet 边界条件和循环学习减少误差,预测homography和Thin-plate Spline(TPS)来实现缺陷和孔洞自适应拼接。
  • results: 在实验中,REwarp 表现出了优于现有方法的精度和计算效率,能够提供高质量的图像拼接 results。
    Abstract Trendy suggestions for learning-based elastic warps enable the deep image stitchings to align images exposed to large parallax errors. Despite the remarkable alignments, the methods struggle with occasional holes or discontinuity between overlapping and non-overlapping regions of a target image as the applied training strategy mostly focuses on overlap region alignment. As a result, they require additional modules such as seam finder and image inpainting for hiding discontinuity and filling holes, respectively. In this work, we suggest Recurrent Elastic Warps (REwarp) that address the problem with Dirichlet boundary condition and boost performances by residual learning for recurrent misalign correction. Specifically, REwarp predicts a homography and a Thin-plate Spline (TPS) under the boundary constraint for discontinuity and hole-free image stitching. Our experiments show the favorable aligns and the competitive computational costs of REwarp compared to the existing stitching methods. Our source code is available at https://github.com/minshu-kim/REwarp.
    摘要 当前建议:学习基于弹性折叠的方法可以使深度图像融合到大量偏差错误中进行对齐。尽管这些方法能够实现出色的对齐,但它们在处理重叠和非重叠区域之间的缺陷和缺口问题上却陷入困难。因此,它们通常需要额外的模块,如缺陷找到器和图像填充,以隐藏缺陷和填充缺口。在这种情况下,我们建议使用循环弹性折叠(REwarp)方法,该方法通过 Dirichlet 边界条件和循环弹性学习来解决缺陷和缺口问题,从而实现不间断和缺陷自适应的图像融合。我们的实验表明,REwarp 方法可以具有优秀的对齐性和竞争性的计算成本。我们的源代码可以在 GitHub 上找到。

SSVOD: Semi-Supervised Video Object Detection with Sparse Annotations

  • paper_url: http://arxiv.org/abs/2309.01391
  • repo_url: None
  • paper_authors: Tanvir Mahmud, Chun-Hao Liu, Burhaneddin Yaman, Diana Marculescu
  • for: 这篇论文是为了提出一种基于semi-supervised learning的视频对象检测方法,以解决现有视频对象检测方法所存在的一些问题,例如需要大量注释框架来实现良好的监督学习效果。
  • methods: 这篇论文使用了一种基于流动的策略,即使用流动的预测来选择合适的 pseudo-labels,以便在大量无注释框架上进行学习。具体来说,这篇论文引入了两种选择方法:一种是基于流动的预测和匹配的方法,另一种是基于交叉 IoU 和交叉异同度的方法。
  • results: 根据论文的结果,这种 semi-supervised 视频对象检测方法可以在 ImageNet-VID、Epic-KITCHENS 和 YouTube-VIS 等 datasets 上达到显著的性能改进,比如在 ImageNet-VID 上的 mAP 提高了 10.3%,在 Epic-KITCHENS 上的 mAP 提高了 13.1%,在 YouTube-VIS 上的 mAP 提高了 11.4%。
    Abstract Despite significant progress in semi-supervised learning for image object detection, several key issues are yet to be addressed for video object detection: (1) Achieving good performance for supervised video object detection greatly depends on the availability of annotated frames. (2) Despite having large inter-frame correlations in a video, collecting annotations for a large number of frames per video is expensive, time-consuming, and often redundant. (3) Existing semi-supervised techniques on static images can hardly exploit the temporal motion dynamics inherently present in videos. In this paper, we introduce SSVOD, an end-to-end semi-supervised video object detection framework that exploits motion dynamics of videos to utilize large-scale unlabeled frames with sparse annotations. To selectively assemble robust pseudo-labels across groups of frames, we introduce \textit{flow-warped predictions} from nearby frames for temporal-consistency estimation. In particular, we introduce cross-IoU and cross-divergence based selection methods over a set of estimated predictions to include robust pseudo-labels for bounding boxes and class labels, respectively. To strike a balance between confirmation bias and uncertainty noise in pseudo-labels, we propose confidence threshold based combination of hard and soft pseudo-labels. Our method achieves significant performance improvements over existing methods on ImageNet-VID, Epic-KITCHENS, and YouTube-VIS datasets. Code and pre-trained models will be released.
    摘要 尽管在半监督学习方面已经取得了显著的进步,但视频对象检测中仍有一些关键问题需要解决:(1)在视频中达到良好的性能需要大量的注释帧。(2)即使在视频中存在大量的间隔帧相互关系,仍然收集注释帧的成本高、时间费时、重复的问题。(3)现有的半监督技术在静止图像上基本无法利用视频中的时间动态。本文介绍SSVOD,一种终端到终点的半监督视频对象检测框架,利用视频中的时间动态来利用大量的无注释帧。为了选择性地组合坚实的pseudo标签,我们引入了流动抗压采用邻近帧的预测值进行时间一致性估计。具体来说,我们引入了交叉IoU和交叉异常值基于选择方法,以包括坚实的pseudo标签 для bounding box和类别标签。为了保持pseudo标签中的确认偏见和不确定噪音的平衡,我们提议使用信任值阈值基于组合硬化和软化pseudo标签。我们的方法在ImageNet-VID、Epic-KITCHENS和YouTube-VIS数据集上取得了显著的性能提升。我们将代码和预训练模型发布。

Understanding Video Scenes through Text: Insights from Text-based Video Question Answering

  • paper_url: http://arxiv.org/abs/2309.01380
  • repo_url: None
  • paper_authors: Soumya Jahagirdar, Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar
  • for: 本研究探讨了两个新引入的 dataset,NewsVideoQA 和 M4-ViteVQA,用于基于文本内容回答视频问题。
  • methods: 研究者使用了 BERT-QA,一种文本只模型,对这两个 dataset 进行了实验,并发现它在这两个 dataset 上 display 相似的表现。
  • results: 研究发现,训练在 M4-ViteVQA 上并不能 directly apply to NewsVideoQA,且对于 out-of-domain 训练,需要进行适应。
    Abstract Researchers have extensively studied the field of vision and language, discovering that both visual and textual content is crucial for understanding scenes effectively. Particularly, comprehending text in videos holds great significance, requiring both scene text understanding and temporal reasoning. This paper focuses on exploring two recently introduced datasets, NewsVideoQA and M4-ViteVQA, which aim to address video question answering based on textual content. The NewsVideoQA dataset contains question-answer pairs related to the text in news videos, while M4-ViteVQA comprises question-answer pairs from diverse categories like vlogging, traveling, and shopping. We provide an analysis of the formulation of these datasets on various levels, exploring the degree of visual understanding and multi-frame comprehension required for answering the questions. Additionally, the study includes experimentation with BERT-QA, a text-only model, which demonstrates comparable performance to the original methods on both datasets, indicating the shortcomings in the formulation of these datasets. Furthermore, we also look into the domain adaptation aspect by examining the effectiveness of training on M4-ViteVQA and evaluating on NewsVideoQA and vice-versa, thereby shedding light on the challenges and potential benefits of out-of-domain training.
    摘要 Translated into Simplified Chinese:研究人员对视觉和语言领域进行了广泛的研究,发现视觉和文本内容都是理解场景的关键因素。特别是在视频中理解文本内容的重要性,需要场景文本理解和时间理解。本文关注两个最近引入的dataset,新闻视频问答集和M4-ViteVQA集,以解决基于文本内容的视频问答问题。新闻视频问答集包含新闻视频中文本内容相关的问题答案对,而M4-ViteVQA集包含多种类别的问题答案对,如博客、旅行和购物。我们对这些dataset的形式化进行了多种层次的分析,探讨响应问题需要的视觉理解和多帧理解水平。此外,我们还进行了BERT-QA模型的实验,这是一种只有文本内容的模型,它在这两个dataset上达到了相当的性能,表明这些dataset的形式化存在缺陷。此外,我们还研究了域适应问题,包括在M4-ViteVQA集上训练并在新闻视频问答集上测试的效果,以及 vice versa,从而探讨域外训练的挑战和优点。

ImmersiveNeRF: Hybrid Radiance Fields for Unbounded Immersive Light Field Reconstruction

  • paper_url: http://arxiv.org/abs/2309.01374
  • repo_url: None
  • paper_authors: Xiaohang Yu, Haoxiang Wang, Yuqi Han, Lei Yang, Tao Yu, Qionghai Dai
    for: This paper proposes a method for unbounded immersive light field reconstruction, which supports high-quality rendering and aggressive view extrapolation.methods: The method uses a hybrid radiance field representation, with separate radiance fields for the foreground and background, and adaptive sampling and segmentation regularization to improve performance.results: The proposed method achieves strong performance for unbounded immersive light field reconstruction, and contributes a novel dataset for further research and applications in the immersive light field domain.Here’s the text in Simplified Chinese:for: 这篇论文提出了一种用于无限维度吸引光场重建的方法,支持高质量渲染和较为侵略性的视角推导。methods: 该方法使用了混合的光场场景表示,将前景和背景分别表示为两个不同的空间映射策略,并使用了适应性的采样和分割规则来提高性能。results: 提出的方法在无限维度吸引光场重建中实现了强大的表现,并为未来的研究和应用在吸引光场领域提供了一个新的数据集。
    Abstract This paper proposes a hybrid radiance field representation for unbounded immersive light field reconstruction which supports high-quality rendering and aggressive view extrapolation. The key idea is to first formally separate the foreground and the background and then adaptively balance learning of them during the training process. To fulfill this goal, we represent the foreground and background as two separate radiance fields with two different spatial mapping strategies. We further propose an adaptive sampling strategy and a segmentation regularizer for more clear segmentation and robust convergence. Finally, we contribute a novel immersive light field dataset, named THUImmersive, with the potential to achieve much larger space 6DoF immersive rendering effects compared with existing datasets, by capturing multiple neighboring viewpoints for the same scene, to stimulate the research and AR/VR applications in the immersive light field domain. Extensive experiments demonstrate the strong performance of our method for unbounded immersive light field reconstruction.
    摘要 这个论文提出了一种混合辐射场表示方法,用于无限尺度吸引辐射场重建,支持高质量渲染和激进视点推演。关键思想是首先正式分离背景和前景,然后在训练过程中适应地学习它们。为达到这个目标,我们将背景和前景表示为两个不同的辐射场,使用两种不同的空间映射策略。我们还提出了一种适应 sampling 策略和一种分割规范,以实现更清晰的分割和更稳定的收敛。最后,我们发布了一个新的 immerse 辐射场数据集,名为 THUImmersive,它可以实现更大的空间 six-degree-of-freedom 吸引辐射渲染效果,比现有数据集更大,通过记录同一场景的多个邻居视点,刺激研究和 AR/VR 应用在 immerse 辐射场领域。广泛的实验表明我们的方法在无限尺度吸引辐射场重建中具有强大表现。

DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion

  • paper_url: http://arxiv.org/abs/2309.01372
  • repo_url: https://github.com/axdfhj/mdd
  • paper_authors: Yunhong Lou, Linchao Zhu, Yaxiong Wang, Xiaohan Wang, Yi Yang
  • for: 这个论文的目的是生成基于文本描述的高质量人体动作,同时保持动作多样性。
  • methods: 该论文使用了一种新的方法,即 Hierarchical Semantic Aggregation (HSA) 模块和 Motion Discrete Diffusion (MDD) 框架,以确保动作质量和多样性之间的平衡。
  • results: 该论文通过实验证明,其方法可以在 HumanML3D 和 KIT-ML 上达到 state-of-the-art 的动作质量和竞争力动作多样性。
    Abstract We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions while preserving motion diversity.Despite the recent significant process in text-based human motion generation,existing methods often prioritize fitting training motions at the expense of action diversity. Consequently, striking a balance between motion quality and diversity remains an unresolved challenge. This problem is compounded by two key factors: 1) the lack of diversity in motion-caption pairs in existing benchmarks and 2) the unilateral and biased semantic understanding of the text prompt, focusing primarily on the verb component while neglecting the nuanced distinctions indicated by other words.In response to the first issue, we construct a large-scale Wild Motion-Caption dataset (WMC) to extend the restricted action boundary of existing well-annotated datasets, enabling the learning of diverse motions through a more extensive range of actions. To this end, a motion BLIP is trained upon a pretrained vision-language model, then we automatically generate diverse motion captions for the collected motion sequences. As a result, we finally build a dataset comprising 8,888 motions coupled with 141k text.To comprehensively understand the text command, we propose a Hierarchical Semantic Aggregation (HSA) module to capture the fine-grained semantics.Finally,we involve the above two designs into an effective Motion Discrete Diffusion (MDD) framework to strike a balance between motion quality and diversity. Extensive experiments on HumanML3D and KIT-ML show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity. Dataset, code, and pretrained models will be released to reproduce all of our results.
    摘要 我们介绍了一种新的方法——多样化动作(DiverseMotion),可以生成高质量的人体动作,基于文本描述而conditioning,同时保持动作多样性。尽管最近有 significative progress in text-based human motion generation,但现有方法通常会偏好适应训练动作,而忽略动作多样性。这种问题被两个关键因素困扰:1)现有的动作-caption对不够多样化,2)文本提示的semantic理解偏执一面,即便只重视verb部分,而忽略其他词语的细微差别。为了解决第一个问题,我们构建了一个大规模的野动作-caption数据集(WMC),以扩展现有的动作边界,让学习动作的多样性。为此,我们首先训练了一个动作BLIP在一个预训练的视力语言模型上,然后自动生成了多样的动作caption。最终,我们建立了一个包含8,888个动作和141,000个文本的数据集。为了全面理解文本命令,我们提出了一个层次semantic汇集(HSA)模块,以捕捉细节semantic。最后,我们将上述两种设计 integrate into an effective Motion Discrete Diffusion(MDD)框架,以平衡动作质量和多样性。我们的多样化动作在HumanML3D和KIT-ML上进行了广泛的实验,并达到了状态之arte motion质量和竞争力动作多样性。数据集、代码和预训练模型将被发布,以便复制所有我们的结果。

Attention as Annotation: Generating Images and Pseudo-masks for Weakly Supervised Semantic Segmentation with Diffusion

  • paper_url: http://arxiv.org/abs/2309.01369
  • repo_url: None
  • paper_authors: Ryota Yoshihashi, Yuya Otsuka, Kenji Doi, Tomohiro Tanaka
    for:* 这个论文的目的是提出一种没有使用真实图像和手动标注的Semantic segmentation训练方法。methods:* 该方法使用文本到图像扩散模型生成的图像,并使用图像的内部文本到图像十字关注作为监督 Pseudo-mask。results:* 实验表明,attn2mask可以在PASCAL VOC中取得良好的结果,而不需要使用真实的训练数据。* 该方法还能够扩展到更多类别的场景,如ImageNet segmentation。* 它还显示了对LoRA基于的细化调整的适应能力,可以将segmenation转移到远程领域,如Cityscapes。
    Abstract Although recent advancements in diffusion models enabled high-fidelity and diverse image generation, training of discriminative models largely depends on collections of massive real images and their manual annotation. Here, we present a training method for semantic segmentation that neither relies on real images nor manual annotation. The proposed method {\it attn2mask} utilizes images generated by a text-to-image diffusion model in combination with its internal text-to-image cross-attention as supervisory pseudo-masks. Since the text-to-image generator is trained with image-caption pairs but without pixel-wise labels, attn2mask can be regarded as a weakly supervised segmentation method overall. Experiments show that attn2mask achieves promising results in PASCAL VOC for not using real training data for segmentation at all, and it is also useful to scale up segmentation to a more-class scenario, i.e., ImageNet segmentation. It also shows adaptation ability with LoRA-based fine-tuning, which enables the transfer to a distant domain i.e., Cityscapes.
    摘要 尽管最近的扩散模型可以生成高精度和多样的图像,但训练推理模型大多依赖于庞大的真实图像和手动注释。在这里,我们提出了一种不需要真实图像和手动注释的 semantic segmentation 训练方法。我们称之为“attn2mask”,它利用由文本到图像扩散模型生成的图像,并与其内部的文本到图像交叉注意力作为超级vision pseudo-mask。由于文本到图像生成器在没有像素级标注的情况下训练,可以视为一种弱supervised segmentation方法。实验表明,attn2mask 在 PASCAL VOC 上表现出色,不使用实际训练数据进行 segmentation 的情况下,并且在更多类enario中也表现出了好的扩展能力。它还表现出了 LoRA-based fine-tuning 的适应能力,可以在远程领域 i.e., Cityscapes 中进行转移。

High Frequency, High Accuracy Pointing onboard Nanosats using Neuromorphic Event Sensing and Piezoelectric Actuation

  • paper_url: http://arxiv.org/abs/2309.01361
  • repo_url: None
  • paper_authors: Yasir Latif, Peter Anastasiou, Yonhon Ng, Zebb Prime, Tien-Fu Lu, Matthew Tetlow, Robert Mahony, Tat-Jun Chin
  • for: 这个论文旨在提高小型卫星的稳定点击精度,以便为空间域意识任务(SDA)提供更高精度的点击。
  • methods: 该论文提出了一种新的 payload,它利用 neuromorphic event sensor 和 piezoelectric stage 实现高精度和高频率的相对位态估计和控制。
  • results: experiments 表明,使用该 payload 可以在 1-5 度的精度下实现稳定点击,并且可以在 50Hz 的操作频率下运行。I hope this helps! Let me know if you have any other questions.
    Abstract As satellites become smaller, the ability to maintain stable pointing decreases as external forces acting on the satellite come into play. At the same time, reaction wheels used in the attitude determination and control system (ADCS) introduce high frequency jitter which can disrupt pointing stability. For space domain awareness (SDA) tasks that track objects tens of thousands of kilometres away, the pointing accuracy offered by current nanosats, typically in the range of 10 to 100 arcseconds, is not sufficient. In this work, we develop a novel payload that utilises a neuromorphic event sensor (for high frequency and highly accurate relative attitude estimation) paired in a closed loop with a piezoelectric stage (for active attitude corrections) to provide highly stable sensor-specific pointing. Event sensors are especially suited for space applications due to their desirable characteristics of low power consumption, asynchronous operation, and high dynamic range. We use the event sensor to first estimate a reference background star field from which instantaneous relative attitude is estimated at high frequency. The piezoelectric stage works in a closed control loop with the event sensor to perform attitude corrections based on the discrepancy between the current and desired attitude. Results in a controlled setting show that we can achieve a pointing accuracy in the range of 1-5 arcseconds using our novel payload at an operating frequency of up to 50Hz using a prototype built from commercial-off-the-shelf components. Further details can be found at https://ylatif.github.io/ultrafinestabilisation
    摘要 为了提高小型卫星的稳定性,我们开发了一种新的 payload,它使用神经元事件传感器(高频和高精度相对姿态估计)和一个 piezoelectric stage(用于活动姿态 corrections)。事件传感器在空间应用中特别有利,因为它们具有低功耗、异步操作和高动态范围等极佳特点。我们使用事件传感器来估计背景星场,并根据差异来使用 piezoelectric stage 进行姿态 corrections。在控制的环境中,我们可以使用我们的新型payload实现1-5弧矩度精度的指向稳定,并且可以在50Hz的运行频率下达到这个精度。更多细节可以在https://ylatif.github.io/ultrafinestabilisation 查看。

Adapting Classifiers To Changing Class Priors During Deployment

  • paper_url: http://arxiv.org/abs/2309.01357
  • repo_url: None
  • paper_authors: Natnael Daba, Bruce McIntosh, Abhijit Mahalanobis
  • for: 这篇论文是关于如何在不同的部署场景中使用通用分类器的。
  • methods: 论文使用了基于分类器自身输出的方法来估算类偏好。
  • results: 结果表明,在部署场景中 incorporating 估算的类偏好可以使分类器在运行时准确率提高。I hope that helps! Let me know if you have any other questions.
    Abstract Conventional classifiers are trained and evaluated using balanced data sets in which all classes are equally present. Classifiers are now trained on large data sets such as ImageNet, and are now able to classify hundreds (if not thousands) of different classes. On one hand, it is desirable to train such general-purpose classifier on a very large number of classes so that it performs well regardless of the settings in which it is deployed. On the other hand, it is unlikely that all classes known to the classifier will occur in every deployment scenario, or that they will occur with the same prior probability. In reality, only a relatively small subset of the known classes may be present in a particular setting or environment. For example, a classifier will encounter mostly animals if its deployed in a zoo or for monitoring wildlife, aircraft and service vehicles at an airport, or various types of automobiles and commercial vehicles if it is used for monitoring traffic. Furthermore, the exact class priors are generally unknown and can vary over time. In this paper, we explore different methods for estimating the class priors based on the output of the classifier itself. We then show that incorporating the estimated class priors in the overall decision scheme enables the classifier to increase its run-time accuracy in the context of its deployment scenario.
    摘要 传统的分类器通常在具有平衡数据集的情况下训练和评估。现在,分类器被训练在大量数据集如ImageNet上,能够分类百计以上不同的类别。一方面,悉心地训练这种通用分类器,以便它在不同的部署enario中都能表现出色。另一方面,实际情况下,分类器可能只会遇到部分已知的类别,而且这些类别的发生概率可能不同,甚至随着时间的推移而变化。例如,如果把分类器部署在动物园或野生动物监测中,它就会遇到大量的动物类别。 similarly, if it is used for monitoring traffic, it will encounter mostly automobiles and commercial vehicles. 为了解决这个问题,我们在这篇论文中研究了不同的方法来估算类别概率,基于分类器的输出。然后,我们表明,在部署scenario中,通过包含估算后的类别概率在总决策方案中,可以使分类器在运行时准确性提高。

Real-time pedestrian recognition on low computational resources

  • paper_url: http://arxiv.org/abs/2309.01353
  • repo_url: None
  • paper_authors: Guifan Weng
  • for: 这篇文章的目的是实现实时行人识别在小型移动设备上,以提高安全性和自动驾驶等应用的效能。
  • methods: 这篇文章使用了三种方法来实现实时行人识别,包括提高了本地二进制特征和 AdaBoost 分类器、优化了几何特征和支持向量机制、以及实现了快速梯度下降神经网络。
  • results: 这篇文章的结果显示了三种方法可以在小型物理设备上实现实时行人识别,并且获得了高于95%的准确率和高于5 fps的速度。这些方法可以轻松地应用到小型移动设备上,并且具有高相容性和通用性。
    Abstract Pedestrian recognition has successfully been applied to security, autonomous cars, Aerial photographs. For most applications, pedestrian recognition on small mobile devices is important. However, the limitations of the computing hardware make this a challenging task. In this work, we investigate real-time pedestrian recognition on small physical-size computers with low computational resources for faster speed. This paper presents three methods that work on the small physical size CPUs system. First, we improved the Local Binary Pattern (LBP) features and Adaboost classifier. Second, we optimized the Histogram of Oriented Gradients (HOG) and Support Vector Machine. Third, We implemented fast Convolutional Neural Networks (CNNs). The results demonstrate that the three methods achieved real-time pedestrian recognition at an accuracy of more than 95% and a speed of more than 5 fps on a small physical size computational platform with a 1.8 GHz Intel i5 CPU. Our methods can be easily applied to small mobile devices with high compatibility and generality.
    摘要 人体识别已成功应用于安全、自动驾驶、航空图像等领域。大多数应用中,人体识别在小型移动设备上是非常重要。然而,计算硬件的限制使得这成为一项挑战。在这项工作中,我们调查了小型物理尺寸计算机上的实时人体识别方法。本文提出了三种方法,它们在小型物理尺寸计算机系统上实现了实时人体识别,并且具有高准确率和快速速度。首先,我们改进了本地二进制特征(LBP)和权重融合分类器。其次,我们优化了梯度图 histogram 和支持向量机。最后,我们实现了快速的卷积神经网络(CNNs)。结果表明,三种方法在一个小型物理尺寸计算平台上实现了实时人体识别,准确率高于 95%,速度高于 5 fps。我们的方法可以轻松应用于小型移动设备,具有高兼容性和通用性。

Adv3D: Generating 3D Adversarial Examples in Driving Scenarios with NeRF

  • paper_url: http://arxiv.org/abs/2309.01351
  • repo_url: None
  • paper_authors: Leheng Li, Qing Lian, Ying-Cong Chen
  • for: 这个研究旨在测试深度神经网络(DNNs)对于恶作剧示例的敏感性,并且针对基于DNN的自动驾驶架构(i.e., 3D物体探测)。
  • methods: 这个研究使用了模型恶作剧示例为Neural Radiance Fields(NeRFs),并且提出了primitive-aware sampling和semantic-guided regularization以生成可能的恶作剧示例。
  • results: 实验结果显示,训练了恶作剧NeRF可以对不同的 pose、scene 和3D探测器进行大规模的性能降低。此外,这个研究也提出了一种防御方法,即通过数据增强训练来防止这些攻击。
    Abstract Deep neural networks (DNNs) have been proven extremely susceptible to adversarial examples, which raises special safety-critical concerns for DNN-based autonomous driving stacks (i.e., 3D object detection). Although there are extensive works on image-level attacks, most are restricted to 2D pixel spaces, and such attacks are not always physically realistic in our 3D world. Here we present Adv3D, the first exploration of modeling adversarial examples as Neural Radiance Fields (NeRFs). Advances in NeRF provide photorealistic appearances and 3D accurate generation, yielding a more realistic and realizable adversarial example. We train our adversarial NeRF by minimizing the surrounding objects' confidence predicted by 3D detectors on the training set. Then we evaluate Adv3D on the unseen validation set and show that it can cause a large performance reduction when rendering NeRF in any sampled pose. To generate physically realizable adversarial examples, we propose primitive-aware sampling and semantic-guided regularization that enable 3D patch attacks with camouflage adversarial texture. Experimental results demonstrate that the trained adversarial NeRF generalizes well to different poses, scenes, and 3D detectors. Finally, we provide a defense method to our attacks that involves adversarial training through data augmentation. Project page: https://len-li.github.io/adv3d-web
    摘要 深度神经网络(DNN)已经被证明非常易受到敌意示例的影响,这引发了特别的安全关注,特别是在基于DNN的自动驾驶堆栈中(即3D对象检测)。虽然有大量的图像级别攻击工作,但大多数都是在2D像素空间中进行,这些攻击并不总是物理上真实的在我们的3D世界中。在这里,我们提出了模型敌意示例为神经辐射场(NeRF)的首次探索。NeRF的进步提供了真实的外观和准确的3D生成,从而导致更真实和可能的敌意示例。我们在训练敌意NeRF时,将培育周围对象的信任预测值作为3D检测器的训练集中的一部分。然后,我们在无法见验证集上评估Adv3D,并证明它可以在任意抽象 pose 中进行3D patch攻击。为生成物理可能的敌意示例,我们提出了基于元素的sampling和semantic-guided regularization,允许3D质量攻击。实验结果表明,我们的训练敌意NeRF可以在不同的 pose、场景和3D检测器上进行广泛的应用。最后,我们提出了防御方法,通过数据增强来进行对敌意示例的防御。项目页面:https://len-li.github.io/adv3d-web

Adaptive Parametric Prototype Learning for Cross-Domain Few-Shot Classification

  • paper_url: http://arxiv.org/abs/2309.01342
  • repo_url: None
  • paper_authors: Marzi Heidari, Abdullah Alchihabi, Qing En, Yuhong Guo
  • for: 这篇论文是为了解决跨领域少数检索分类问题。
  • methods: 本文提出了一种名为 Adaptive Parametric Prototype Learning(APPL)的新方法,它是基于元学习惯例的。不同于现有的标本性几少方法,我们在支持集合上学习分类标本,并将标本获得到几少检索集合中的条件强制整理。
  • results: 我们在多个跨领域少数检索资料集上实验了这种方法,结果显示APPL在跨领域少数检索分类中表现更好,比较多数现有的方法。
    Abstract Cross-domain few-shot classification induces a much more challenging problem than its in-domain counterpart due to the existence of domain shifts between the training and test tasks. In this paper, we develop a novel Adaptive Parametric Prototype Learning (APPL) method under the meta-learning convention for cross-domain few-shot classification. Different from existing prototypical few-shot methods that use the averages of support instances to calculate the class prototypes, we propose to learn class prototypes from the concatenated features of the support set in a parametric fashion and meta-learn the model by enforcing prototype-based regularization on the query set. In addition, we fine-tune the model in the target domain in a transductive manner using a weighted-moving-average self-training approach on the query instances. We conduct experiments on multiple cross-domain few-shot benchmark datasets. The empirical results demonstrate that APPL yields superior performance than many state-of-the-art cross-domain few-shot learning methods.
    摘要 跨领域少量分类问题比其内领域对应的问题更加挑战性,这是因为训练和测试任务之间存在领域偏移。在这篇论文中,我们开发了一种名为 Adaptive Parametric Prototype Learning(APPL)的新方法,该方法基于元学习准则进行跨领域少量分类。与现有的概率性几何方法不同,我们在支持集合的 concatenated 特征上学习类prototype,并在元学习中加入prototype-based regularization。此外,我们在目标领域中进行了权重移动平均自适应更新方法,以便在查询集中进行微调。我们在多个跨领域少量分类 benchmark 数据集上进行了实验,结果表明,APPL 的性能较多现状顶尖跨领域少量分类方法优秀。

MDSC: Towards Evaluating the Style Consistency Between Music and

  • paper_url: http://arxiv.org/abs/2309.01340
  • repo_url: https://github.com/zixiangzhou916/mdsc
  • paper_authors: Zixiang Zhou, Baoyuan Wang
  • for: 评估音乐与舞蹈风格的匹配度(assessing the matching degree between music and dance styles)
  • methods: 使用音乐编码器和动作编码器进行匹配和对齐(using music and motion encoders for matching and alignment)
  • results: 提出了一种新的评估metric——音乐动作风格一致度(MDSC),并通过用户研究发现其可以准确评估音乐与动作风格之间的匹配度(accurately assessing the matching degree between music and dance styles)。Here is the summary in English for reference:
  • for: Assessing the matching degree between music and dance styles
  • methods: Using music and motion encoders for matching and alignment
  • results: Proposed a new evaluation metric called Music-Dance Style Consistency (MDSC) and validated its effectiveness through user studies, demonstrating its ability to accurately assess the matching degree between music and dance styles.
    Abstract We propose MDSC(Music-Dance-Style Consistency), the first evaluation metric which assesses to what degree the dance moves and music match. Existing metrics can only evaluate the fidelity and diversity of motion and the degree of rhythmic matching between music and motion. MDSC measures how stylistically correlated the generated dance motion sequences and the conditioning music sequences are. We found that directly measuring the embedding distance between motion and music is not an optimal solution. We instead tackle this through modelling it as a clustering problem. Specifically, 1) we pre-train a music encoder and a motion encoder, then 2) we learn to map and align the motion and music embedding in joint space by jointly minimizing the intra-cluster distance and maximizing the inter-cluster distance, and 3) for evaluation purpose, we encode the dance moves into embedding and measure the intra-cluster and inter-cluster distances, as well as the ratio between them. We evaluate our metric on the results of several music-conditioned motion generation methods, combined with user study, we found that our proposed metric is a robust evaluation metric in measuring the music-dance style correlation. The code is available at: https://github.com/zixiangzhou916/MDSC.
    摘要 我们提出了MDSC(音乐舞蹈风格一致性),第一个评估预测metric,评估音乐和舞蹈动作之间的匹配程度。现有的metric只能评估动作和音乐的精度和多样性,以及音乐和动作的节奏匹配程度。而MDSC则测量了生成的舞蹈动作序列和条件音乐序列之间的风格相关性。我们发现直接测量动作和音乐的嵌入距离并不是最佳解决方案。我们相反地通过模elling它为一个聚类问题来解决。具体来说,我们先预训一个音乐编码器和一个动作编码器,然后学习将动作和音乐嵌入在共同空间中进行对应。最后,我们用计算内层距离和外层距离,以及内层和外层之间的比率来评估。我们在一些音乐条件动作生成方法的结果上进行了评估,并结合用户研究发现,我们的提出的metric是一种可靠的评估metric,可以量化音乐舞蹈风格相关性。代码可以在以下链接中找到:https://github.com/zixiangzhou916/MDSC。

Semantic-Constraint Matching Transformer for Weakly Supervised Object Localization

  • paper_url: http://arxiv.org/abs/2309.01331
  • repo_url: None
  • paper_authors: Yiwen Cao, Yukun Su, Wenjun Wang, Yanxia Liu, Qingyao Wu
  • for: 本研究旨在解决weakly supervised object localization中的partial activation问题,即使用只有图像水平级别的指导,学习检测器能够准确地本地化对象。
  • methods: 本研究使用Vision Transformer(Transformer)来解决partial activation问题,并通过自动注意力机制获取长距离特征依赖关系。此外,还提出了一种本地匹配策略,通过对局部图像进行洗混,保证全局一致性。
  • results: 经验结果表明,我们的方法可以在CUB-200-2011和ILSVRC datasets上达到新的状态级表现,与之前的方法相比具有显著的超越。
    Abstract Weakly supervised object localization (WSOL) strives to learn to localize objects with only image-level supervision. Due to the local receptive fields generated by convolution operations, previous CNN-based methods suffer from partial activation issues, concentrating on the object's discriminative part instead of the entire entity scope. Benefiting from the capability of the self-attention mechanism to acquire long-range feature dependencies, Vision Transformer has been recently applied to alleviate the local activation drawbacks. However, since the transformer lacks the inductive localization bias that are inherent in CNNs, it may cause a divergent activation problem resulting in an uncertain distinction between foreground and background. In this work, we proposed a novel Semantic-Constraint Matching Network (SCMN) via a transformer to converge on the divergent activation. Specifically, we first propose a local patch shuffle strategy to construct the image pairs, disrupting local patches while guaranteeing global consistency. The paired images that contain the common object in spatial are then fed into the Siamese network encoder. We further design a semantic-constraint matching module, which aims to mine the co-object part by matching the coarse class activation maps (CAMs) extracted from the pair images, thus implicitly guiding and calibrating the transformer network to alleviate the divergent activation. Extensive experimental results conducted on two challenging benchmarks, including CUB-200-2011 and ILSVRC datasets show that our method can achieve the new state-of-the-art performance and outperform the previous method by a large margin.
    摘要 我们提出了一种新的半监督物体地址 localization(WSOL)方法,它的目的是通过只有图像级别的指导来学习地址物体。由于图像 convolution 操作生成的局部感知范围,前一代 CNN 基于方法容易出现部分活跃问题,即将对象的特征部分进行激活而不是整个实体范围。在应用自注意机制可以获得长范围特征依赖关系的情况下,我们使用了 Vision Transformer 来缓解本地活动问题。然而,由于 transformer 缺乏 CNN 中带有适应性的 inductive 地址偏好,可能会导致不确定的背景和前景分割。为了解决这个问题,我们提出了一种新的 Semantic-Constraint Matching Network(SCMN),通过 transformer 来实现对分歧的激活的控制。具体来说,我们首先提出了一种本地小块洗版策略,用于构建图像对。这种策略可以在保证全局一致性的情况下,对图像进行局部洗版。然后,我们将这些包含共同物体的图像对feed到 Siamese 网络Encoder 中。我们还设计了一种 semantic-constraint 匹配模块,该模块的目的是通过匹配 CAMs 提取自对图像对中的共同部分,从而隐式地引导和调整 transformer 网络,以缓解分歧的激活。我们在 CUB-200-2011 和 ILSVRC 数据集上进行了广泛的实验,结果表明,我们的方法可以达到新的状态码性能,并将之前的方法超越。

SKoPe3D: A Synthetic Dataset for Vehicle Keypoint Perception in 3D from Traffic Monitoring Cameras

  • paper_url: http://arxiv.org/abs/2309.01324
  • repo_url: None
  • paper_authors: Himanshu Pahadia, Duo Lu, Bharatesh Chakravarthi, Yezhou Yang
    for:SKoPe3D is a synthetic vehicle keypoint dataset generated using the CARLA simulator, aiming to address the challenges of vehicle keypoint detection in vision-based vehicle monitoring for ITS.methods:The dataset includes generated images with bounding boxes, tracking IDs, and 33 keypoints for each vehicle, spanning over 25k images across 28 scenes with over 150k vehicle instances and 4.9 million keypoints.results:The dataset has the potential to enable advancements in vehicle keypoint detection for ITS, as demonstrated by training a keypoint R-CNN model on the dataset and conducting a thorough evaluation. The dataset’s applicability and the potential for knowledge transfer between synthetic and real-world data are highlighted.
    Abstract Intelligent transportation systems (ITS) have revolutionized modern road infrastructure, providing essential functionalities such as traffic monitoring, road safety assessment, congestion reduction, and law enforcement. Effective vehicle detection and accurate vehicle pose estimation are crucial for ITS, particularly using monocular cameras installed on the road infrastructure. One fundamental challenge in vision-based vehicle monitoring is keypoint detection, which involves identifying and localizing specific points on vehicles (such as headlights, wheels, taillights, etc.). However, this task is complicated by vehicle model and shape variations, occlusion, weather, and lighting conditions. Furthermore, existing traffic perception datasets for keypoint detection predominantly focus on frontal views from ego vehicle-mounted sensors, limiting their usability in traffic monitoring. To address these issues, we propose SKoPe3D, a unique synthetic vehicle keypoint dataset generated using the CARLA simulator from a roadside perspective. This comprehensive dataset includes generated images with bounding boxes, tracking IDs, and 33 keypoints for each vehicle. Spanning over 25k images across 28 scenes, SKoPe3D contains over 150k vehicle instances and 4.9 million keypoints. To demonstrate its utility, we trained a keypoint R-CNN model on our dataset as a baseline and conducted a thorough evaluation. Our experiments highlight the dataset's applicability and the potential for knowledge transfer between synthetic and real-world data. By leveraging the SKoPe3D dataset, researchers and practitioners can overcome the limitations of existing datasets, enabling advancements in vehicle keypoint detection for ITS.
    摘要 现代交通基础设施中的智能交通系统(ITS)已经革命化了现代交通基础设施,提供了重要的功能,如交通监测、路安全评估、减压和法律执法。在视觉基础上,精准的车辆检测和车辆位置估计是ITS的关键,特别是使用路边安装的单目camera。车辆特征和形态变化、遮挡、天气和照明条件会增加车辆检测的复杂度。此外,现有的交通感知数据集主要集中在前视角,限制了它们的应用范围。为解决这些问题,我们提出了SKoPe3D数据集,这是一个基于CARLA simulate器生成的路边视角的唯一的车辆关键点数据集。这个全面的数据集包括生成的图像、 bounding box、跟踪ID和33个关键点,涵盖了25000多张图像、28个场景,共计150000辆车辆和4900万个关键点。为证明其可用性,我们在我们的数据集上训练了一个关键点R-CNN模型,并进行了系统性的评估。我们的实验表明,SKoPe3D数据集可以应用于车辆关键点检测,并且可以在实际数据上传递知识。通过利用SKoPe3D数据集,研究人员和实践者可以超越现有数据集的限制,促进车辆关键点检测的进步,以推动ITS的发展。

FAU-Net: An Attention U-Net Extension with Feature Pyramid Attention for Prostate Cancer Segmentation

  • paper_url: http://arxiv.org/abs/2309.01322
  • repo_url: None
  • paper_authors: Pablo Cesar Quihui-Rubio, Daniel Flores-Araiza, Miguel Gonzalez-Mendoza, Christian Mata, Gilberto Ochoa-Ruiz
  • for: 这篇论文是为了提出一种基于深度学习的抑制肾脏区域分 segmentation 方法,以提高肾脏癌检测和诊断的工作流程。
  • methods: 该方法使用 U-Net 网络结合 additive 和 feature pyramid attention 模块,以提高分 segmentation 精度。
  • results: 比较 seven 种不同的 U-Net 架构,提出的方法在测试集中 achieved mean DSC 84.15% 和 IoU 76.9%,与大多数研究模型相比,只有 R2U-Net 和 attention R2U-Net 架构超越。
    Abstract This contribution presents a deep learning method for the segmentation of prostate zones in MRI images based on U-Net using additive and feature pyramid attention modules, which can improve the workflow of prostate cancer detection and diagnosis. The proposed model is compared to seven different U-Net-based architectures. The automatic segmentation performance of each model of the central zone (CZ), peripheral zone (PZ), transition zone (TZ) and Tumor were evaluated using Dice Score (DSC), and the Intersection over Union (IoU) metrics. The proposed alternative achieved a mean DSC of 84.15% and IoU of 76.9% in the test set, outperforming most of the studied models in this work except from R2U-Net and attention R2U-Net architectures.
    摘要 Translation notes:* "prostate zones" is translated as "陌生区域" (pinyin: zhèng xìng qū yù)* "MRI images" is translated as "MRI图像" (pinyin: MRI tú xiàng)* "U-Net" is translated as "U-Net" (pinyin: Yù nét)* "additive and feature pyramid attention modules" is translated as "加法和特征层 pyramid 注意模块" (pinyin: jiā fàng yǔ tiě xìng piào qián yǐng module)* "can improve the workflow of prostate cancer detection and diagnosis" is translated as "可以改善陌生癌病检测和诊断的工作流程" (pinyin: kě yǐ gǎi shàn zhèng xìng ài yì jīng yì gòng zuò liú xíng)* "the proposed model" is translated as "提议的模型" (pinyin: tím yì de mó del)* "seven different U-Net-based architectures" is translated as "七种不同的 U-Net 基于架构" (pinyin: qī zhǒng bù dìng de U-Net bìng yù jià gòng)* "automatic segmentation performance" is translated as "自动分割性能" (pinyin: zì dìan fēn xiǎn yè nuò)* "Dice Score (DSC)" is translated as "Dice Score (DSC)" (pinyin: Dice Score (DSC))* "Intersection over Union (IoU)" is translated as "交叠率 (IoU)" (pinyin: jiāo fù rátio (IoU))* "test set" is translated as "测试集" (pinyin: cè shì jí)* "outperforming most of the studied models" is translated as "超越大多数研究的模型" (pinyin: chāo yú dà duō shù yán jí de mó del)* "except for R2U-Net and attention R2U-Net architectures" is translated as "除了 R2U-Net 和注意 R2U-Net 架构" (pinyin: chú le R2U-Net hé zhù yì R2U-Net jià gòng)

An FPGA smart camera implementation of segmentation models for drone wildfire imagery

  • paper_url: http://arxiv.org/abs/2309.01318
  • repo_url: None
  • paper_authors: Eduardo Guarduño-Martinez, Jorge Ciprian-Sanchez, Gerardo Valente, Vazquez-Garcia, Gerardo Rodriguez-Hernandez, Adriana Palacios-Rosas, Lucile Rossi-Tisson, Gilberto Ochoa-Ruiz
  • for: 这个研究旨在开发一个可行的、低功耗的计算机构架,以实现在无人机上进行火灾探测和评估。
  • methods: 这个研究使用了智能相机,基于低功耗的可程式逻辑阵列(FPGAs),并与二进制神经网络(BNNs)结合,以实现在边缘计算上的高效执行。
  • results: 研究人员透过优化和对缩减原始模型的实现,实现了从8帧每秒(FPS)提高至33.63 FPS的速度提升,而且无损于标注性能:模型在沃尔夫-科赫茨曼统计指标(MCC)、F1分数和哈菲安质量指标(HAF)中获得0.912、0.915和0.870的分数,并且与原始全精度模型的标注结果相似。
    Abstract Wildfires represent one of the most relevant natural disasters worldwide, due to their impact on various societal and environmental levels. Thus, a significant amount of research has been carried out to investigate and apply computer vision techniques to address this problem. One of the most promising approaches for wildfire fighting is the use of drones equipped with visible and infrared cameras for the detection, monitoring, and fire spread assessment in a remote manner but in close proximity to the affected areas. However, implementing effective computer vision algorithms on board is often prohibitive since deploying full-precision deep learning models running on GPU is not a viable option, due to their high power consumption and the limited payload a drone can handle. Thus, in this work, we posit that smart cameras, based on low-power consumption field-programmable gate arrays (FPGAs), in tandem with binarized neural networks (BNNs), represent a cost-effective alternative for implementing onboard computing on the edge. Herein we present the implementation of a segmentation model applied to the Corsican Fire Database. We optimized an existing U-Net model for such a task and ported the model to an edge device (a Xilinx Ultra96-v2 FPGA). By pruning and quantizing the original model, we reduce the number of parameters by 90%. Furthermore, additional optimizations enabled us to increase the throughput of the original model from 8 frames per second (FPS) to 33.63 FPS without loss in the segmentation performance: our model obtained 0.912 in Matthews correlation coefficient (MCC),0.915 in F1 score and 0.870 in Hafiane quality index (HAF), and comparable qualitative segmentation results when contrasted to the original full-precision model. The final model was integrated into a low-cost FPGA, which was used to implement a neural network accelerator.
    摘要 野火是全球最重要的自然灾害之一,它对社会和环境层次产生了深远的影响。因此,许多研究已经进行了,以应用计算机见识技术来解决这个问题。一种最具吸引力的方法是使用具有可见光和红外线摄像头的无人机,以无人机遥测、监控和评估野火传播的方式进行远程监控,但是在邻近灾区进行这些操作。然而,实现有效的计算机见识算法在无人机上是经常不可能的,因为将全精度深度学习模型在GPU上运行是不可避免的,因为它们的电力消耗量太高,无人机的载重量也是有限的。因此,在这个工作中,我们认为智能相机,基于低功耗的场程可程式阵列(FPGAs),与二进制神经网络(BNNs)共同构成了一个可行的选择。我们在这里显示了对 corsica 火灾数据库进行分类模型的实现。我们修改了现有的 U-Net 模型,并将模型转移到边缘设备(Xilinx Ultra96-v2 FPGA)上。通过剪裁和数值化原始模型,我们缩减了模型的参数数量,从8帧/秒降至33.63帧/秒,并维持了分类性能的稳定。我们的模型在 Matthews 相互关联系数(MCC)、F1 分数(F1)和 Hafiane 质量指数(HAF)中获得了0.912、0.915和0.870的数据,并且在与原始全精度模型进行比较时,获得了相似的分类结果。最终模型被集成到低成本 FPGA 上,实现了一个神经网络加速器。

Enhancing Automated and Early Detection of Alzheimer’s Disease Using Out-Of-Distribution Detection

  • paper_url: http://arxiv.org/abs/2309.01312
  • repo_url: None
  • paper_authors: Audrey Paleczny, Shubham Parab, Maxwell Zhang
  • for: 预测老年人群中的阿尔ц海默病患者(age 65 and older),以便提供早期诊断和治疗。
  • methods: 使用深度学习模型(Convolutional Neural Network,CNN)和磁共振成像(Magnetic Resonance Imaging,MRI)进行诊断。
  • results: 使用OOD检测技术可以减少假阳性诊断,提高诊断的可靠性。模型基于CNN结果的检测精度为98%,分类精度为95%,超过了基于分割体积模型的检测和分类精度(93%和87%)。
    Abstract More than 10.7% of people aged 65 and older are affected by Alzheimer's disease. Early diagnosis and treatment are crucial as most Alzheimer's patients are unaware of having it until the effects become detrimental. AI has been known to use magnetic resonance imaging (MRI) to diagnose Alzheimer's. However, models which produce low rates of false diagnoses are critical to prevent unnecessary treatments. Thus, we trained supervised Random Forest models with segmented brain volumes and Convolutional Neural Network (CNN) outputs to classify different Alzheimer's stages. We then applied out-of-distribution (OOD) detection to the CNN model, enabling it to report OOD if misclassification is likely, thereby reducing false diagnoses. With an accuracy of 98% for detection and 95% for classification, our model based on CNN results outperformed our segmented volume model, which had detection and classification accuracies of 93% and 87%, respectively. Applying OOD detection to the CNN model enabled it to flag brain tumor images as OOD with 96% accuracy and minimal overall accuracy reduction. By using OOD detection to enhance the reliability of MRI classification using CNNs, we lowered the rate of false positives and eliminated a significant disadvantage of using Machine Learning models for healthcare tasks. Source code available upon request.
    摘要 更多于10.7%的人年龄在65岁及以上有患阿尔茨海默病。早期诊断和治疗是非常重要,因为大多数阿尔茨海默病患者并不知道自己患病 until the effects become detrimental。人工智能可以使用磁共振成像(MRI)进行诊断。然而,模型生成低False Positive率是非常重要,以避免不必要的治疗。因此,我们使用了监督式Random Forest模型,并将Convolutional Neural Network(CNN)输出与分割的脑部volume进行类别。我们然后将CNN模型应用到OOD检测,以便如果误分类可能,则报告OOD,从而降低了False Positive率。使用OOD检测可以提高MRI类别的可靠性,并且使用CNN模型进行健康任务的应用中,消除了一个重要的缺点。代码可以在请求时提供。

EMR-MSF: Self-Supervised Recurrent Monocular Scene Flow Exploiting Ego-Motion Rigidity

  • paper_url: http://arxiv.org/abs/2309.01296
  • repo_url: None
  • paper_authors: Zijie Jiang, Masatoshi Okutomi
  • for: 本研究旨在提高现有自监学习Scene Flow estimation方法的精度,通过借鉴超vised学习方法的优点,并在减少动态区域的影响下提高固有的姿态稳定性。
  • methods: 我们提出了一种名为EMR-MSF的改进模型,其中包括在构建ego-motion汇集模块时引入explicit和稳定的几何约束,以及在满足固有的姿态稳定性下使用mask正则化损失。此外,我们还提出了一种运动一致损失和mask正则化损失,以全面发挥静态区域的作用。
  • results: 我们的提posed方法在KITTIScene Flow benchmark上表现出色,与state-of-the-art自监学习monocular方法的SF-all指标相比,提高44%。此外,我们的方法在深度和视觉征迹等子任务中也达到了supervised方法的水平。
    Abstract Self-supervised monocular scene flow estimation, aiming to understand both 3D structures and 3D motions from two temporally consecutive monocular images, has received increasing attention for its simple and economical sensor setup. However, the accuracy of current methods suffers from the bottleneck of less-efficient network architecture and lack of motion rigidity for regularization. In this paper, we propose a superior model named EMR-MSF by borrowing the advantages of network architecture design under the scope of supervised learning. We further impose explicit and robust geometric constraints with an elaborately constructed ego-motion aggregation module where a rigidity soft mask is proposed to filter out dynamic regions for stable ego-motion estimation using static regions. Moreover, we propose a motion consistency loss along with a mask regularization loss to fully exploit static regions. Several efficient training strategies are integrated including a gradient detachment technique and an enhanced view synthesis process for better performance. Our proposed method outperforms the previous self-supervised works by a large margin and catches up to the performance of supervised methods. On the KITTI scene flow benchmark, our approach improves the SF-all metric of the state-of-the-art self-supervised monocular method by 44% and demonstrates superior performance across sub-tasks including depth and visual odometry, amongst other self-supervised single-task or multi-task methods.
    摘要 自我监督单目场景流估算,寻求从两个 consecutively temporally 单目图像中理解三维结构和三维运动。由于现有方法的精度受到网络架构的瓶颈和运动稳定性的限制,因此在这篇论文中,我们提出了一种高效的模型 named EMR-MSF。我们采用了指导了supervised学习中网络架构的优点,并在 elaborate 构建了自身运动汇集模块,在这里我们提出了一种坚定性软面罩来过滤动态区域,以确保稳定的自身运动估算。此外,我们还提出了运动一致损失和面罩规范损失,以便充分利用静止区域。我们还 интегрирова了一些高效的训练策略,包括梯度分离技术和改进的视图合成过程,以提高表现。根据 KITTI 场景流标准测试集,我们的方法在自我监督单目方法中提高了 SF-all 指标44%,并在深度和视觉速度等子任务中表现出色,在其他自我监督单任务或多任务方法中也达到了类似的水平。