cs.CV - 2023-09-26

Cross-City Matters: A Multimodal Remote Sensing Benchmark Dataset for Cross-City Semantic Segmentation using High-Resolution Domain Adaptation Networks

  • paper_url: http://arxiv.org/abs/2309.16499
  • repo_url: None
  • paper_authors: Danfeng Hong, Bing Zhang, Hao Li, Yuxuan Li, Jing Yao, Chenyu Li, Martin Werner, Jocelyn Chanussot, Alexander Zipf, Xiao Xiang Zhu
  • for: 这个论文的目的是提出一种高分解能力的城市范围内 semantic segmentation 任务的解决方案,以提高人工智能(AI)模型在多modal remote sensing(RS)应用中的表现。
  • methods: 这个论文使用了一种高分解能力的域 adaptation network(HighDAN),可以在多城市环境中提高AI模型的泛化能力。HighDAN 使用了平行高到低分辨率的混合方式,并通过对抗学习来覆盖不同城市中RS图像表示的巨大差异。此外,该论文还考虑了 Dice 损失来缓解因城市因素而导致的类别不均匀问题。
  • results: 实验表明,高分解能力的 HighDAN 在 C2Seg 数据集上表现出色,与现有的竞争对手相比,具有更高的分 segmentation 性能和泛化能力。
    Abstract Artificial intelligence (AI) approaches nowadays have gained remarkable success in single-modality-dominated remote sensing (RS) applications, especially with an emphasis on individual urban environments (e.g., single cities or regions). Yet these AI models tend to meet the performance bottleneck in the case studies across cities or regions, due to the lack of diverse RS information and cutting-edge solutions with high generalization ability. To this end, we build a new set of multimodal remote sensing benchmark datasets (including hyperspectral, multispectral, SAR) for the study purpose of the cross-city semantic segmentation task (called C2Seg dataset), which consists of two cross-city scenes, i.e., Berlin-Augsburg (in Germany) and Beijing-Wuhan (in China). Beyond the single city, we propose a high-resolution domain adaptation network, HighDAN for short, to promote the AI model's generalization ability from the multi-city environments. HighDAN is capable of retaining the spatially topological structure of the studied urban scene well in a parallel high-to-low resolution fusion fashion but also closing the gap derived from enormous differences of RS image representations between different cities by means of adversarial learning. In addition, the Dice loss is considered in HighDAN to alleviate the class imbalance issue caused by factors across cities. Extensive experiments conducted on the C2Seg dataset show the superiority of our HighDAN in terms of segmentation performance and generalization ability, compared to state-of-the-art competitors. The C2Seg dataset and the semantic segmentation toolbox (involving the proposed HighDAN) will be available publicly at https://github.com/danfenghong.
    摘要 现代人工智能(AI)方法在单一感知(RS)应用中已经取得了非常出色的成功,特别是在强调个别城市环境(例如单个城市或区域)。然而,这些AI模型在各个城市或区域的案例研究中会遇到性能瓶颈,因为缺乏多元RS信息和 cutting-edge 解决方案。为此,我们建立了一个新的多模态RS benchmark数据集(包括偏振、多спектраль、Synthetic Aperture Radar),称为C2Seg数据集,用于跨城市semantic segmentation任务。C2Seg数据集包括两个跨城市场景: Berlin-Augsburg(德国)和Beijing-Wuhan(中国)。我们提议一种高分辨率领域适应网络(HighDAN),用于提高AI模型在多城市环境中的泛化能力。HighDAN可以保持 изучение城市景象的空间 topological结构在高到低分辨率的并行混合方式中,同时通过对RS图像表示之间的巨大差异进行反对学习来填充这些差异。此外,我们还考虑了Dice损失函数,以解决由不同城市因素引起的分类不均匀问题。我们在C2Seg数据集上进行了广泛的实验,并证明了我们的HighDAN在segmentation性能和泛化能力方面与现有竞争者相比较出色。C2Seg数据集和semantic segmentation工具箱(包括我们提议的HighDAN)将在https://github.com/danfenghong公共地址上公开。

Conversion of single-energy computed tomography to parametric maps of dual-energy computed tomography using convolutional neural network

  • paper_url: http://arxiv.org/abs/2309.15314
  • repo_url: None
  • paper_authors: Sangwook Kim, Jimin Lee, Jungye Kim, Bitbyeol Kim, Chang Heon Choi, Seongmoon Jung
  • for: 这个研究的目的是将单能CT图像(SECT)直接转换为三种 Parametric 地幅图像(DECT)中的虚拟独束图像(VMI)、有效原子数(EAN)和相对电子浓度(RED)。
  • methods: 我们提出了一个深度学习(DL)多任务学习框架,使用 convolutional neural network(CNN)直接将 SECT 图像转换为 DECT 中的三种 Parametric 地幅图像。我们提出了 VMI-Net、EAN-Net 和 RED-Net 三种不同的网络,用于将 SECT 图像转换为不同的 Parametric 地幅图像。我们在 67 名患者的数据集(2019-2020 年采集)上训练和验证了我们的模型。
  • results: 我们的模型可以将 120 kVp SECT 图像转换为高质量的 VMI、EAN 和 RED 图像。转换后的 AD 为 9.02 Hounsfield Unit,RD 为 0.41% 相对于真实的 VMIs。转换后的 EAN 和 RED 的 AD 分别为 0.29 和 0.96,RD 分别为 1.99% 和 0.50%。
    Abstract Objectives: We propose a deep learning (DL) multi-task learning framework using convolutional neural network (CNN) for a direct conversion of single-energy CT (SECT) to three different parametric maps of dual-energy CT (DECT): Virtual-monochromatic image (VMI), effective atomic number (EAN), and relative electron density (RED). Methods: We propose VMI-Net for conversion of SECT to 70, 120, and 200 keV VMIs. In addition, EAN-Net and RED-Net were also developed to convert SECT to EAN and RED. We trained and validated our model using 67 patients collected between 2019 and 2020. SECT images with 120 kVp acquired by the DECT (IQon spectral CT, Philips) were used as input, while the VMIs, EAN, and RED acquired by the same device were used as target. The performance of the DL framework was evaluated by absolute difference (AD) and relative difference (RD). Results: The VMI-Net converted 120 kVp SECT to the VMIs with AD of 9.02 Hounsfield Unit, and RD of 0.41% compared to the ground truth VMIs. The ADs of the converted EAN and RED were 0.29 and 0.96, respectively, while the RDs were 1.99% and 0.50% for the converted EAN and RED, respectively. Conclusions: SECT images were directly converted to the three parametric maps of DECT (i.e., VMIs, EAN, and RED). By using this model, one can generate the parametric information from SECT images without DECT device. Our model can help investigate the parametric information from SECT retrospectively. Advances in knowledge: Deep learning framework enables converting SECT to various high-quality parametric maps of DECT.
    摘要 目标:我们提出了一种深度学习(DL)多任务学习框架,使用卷积神经网络(CNN)将单能CT(SECT)直接转换为三个不同的参数图像:虚拟单能图像(VMI)、有效原子数(EAN)和相对电子密度(RED)。方法:我们提出了VMI-Net来将SECT转换为70、120和200 keV的VMIs。此外,我们还开发了EAN-Net和RED-Net,用于将SECT转换为EAN和RED。我们使用2019-2020年采集的67例病人数据进行训练和验证。SECT图像使用120 kVp的DECT(IQon spectral CT, Philips)获取,而VMIs、EAN和RED则是使用同一设备获取的目标。我们评估了DL框架的性能,使用绝对差(AD)和相对差(RD)进行评估。结果:VMI-Net将120 kVp SECT转换为VMIs的AD为9.02 Hounsfield Unit,RD为0.41%,相比于参照值VMIs。转换后的EAN和RED的AD分别为0.29和0.96,RD分别为1.99%和0.50%。结论:我们提出了一种将SECT图像直接转换为DECT三个参数图像的DL框架。通过这种模型,可以从SECT图像中提取 Parametric 信息,无需使用DECT设备。我们的模型可以帮助回溯SECT图像中的 Parametric 信息。知识前进:DL框架可以将SECT图像转换为多种高质量的DECT参数图像。

M$^{3}$3D: Learning 3D priors using Multi-Modal Masked Autoencoders for 2D image and video understanding

  • paper_url: http://arxiv.org/abs/2309.15313
  • repo_url: None
  • paper_authors: Muhammad Abdullah Jamal, Omid Mohareri
  • for: 本研究旨在提出一种新的预训练策略,即M$^{3}$3D(多ModalMasked3D),可以利用RGB-D数据中的3D先验知识和学习的跨模态特征,并通过自我监督学习框架的组合来提升modal之间的匹配程度。
  • methods: 本研究使用了两种主要的自我监督学习框架:Masked Image Modeling(MIM)和对比学习,以便嵌入屏幕3D先验知识和相关的特征,提高modal之间的匹配程度。
  • results: 实验表明,M$^{3}$3D在ScanNet、NYUv2、UCF-101和OR-AR等数据集上都表现出优于现有的状态艺术方法,尤其是在ScanNet semantic segmentation任务上提高了+1.3% mIoU противMask3D。此外,我们还证明了我们的方法在数据缺乏情况下的数据效率superiority。
    Abstract We present a new pre-training strategy called M$^{3}$3D ($\underline{M}$ulti-$\underline{M}$odal $\underline{M}$asked $\underline{3D}$) built based on Multi-modal masked autoencoders that can leverage 3D priors and learned cross-modal representations in RGB-D data. We integrate two major self-supervised learning frameworks; Masked Image Modeling (MIM) and contrastive learning; aiming to effectively embed masked 3D priors and modality complementary features to enhance the correspondence between modalities. In contrast to recent approaches which are either focusing on specific downstream tasks or require multi-view correspondence, we show that our pre-training strategy is ubiquitous, enabling improved representation learning that can transfer into improved performance on various downstream tasks such as video action recognition, video action detection, 2D semantic segmentation and depth estimation. Experiments show that M$^{3}$3D outperforms the existing state-of-the-art approaches on ScanNet, NYUv2, UCF-101 and OR-AR, particularly with an improvement of +1.3\% mIoU against Mask3D on ScanNet semantic segmentation. We further evaluate our method on low-data regime and demonstrate its superior data efficiency compared to current state-of-the-art approaches.
    摘要 我们提出了一种新的预训练策略,称为M$^{3}$3D(多Modal多样性掩码3D),基于多Modal掩码自动encoder,可以利用3D先验和学习的过程来抽象RGB-D数据中的多Modal特征。我们将两种主要的自我超vised学习框架,Masked Image Modeling(MIM)和对比学习,结合在一起,以便具有更好的映射性和多Modal特征的嵌入。与现有的方法不同,我们的预训练策略不仅关注特定的下游任务,也不需要多视角对应。我们展示了M$^{3}$3D在不同的下游任务上的改进表现,包括视频动作识别、视频动作检测、2D semantics标注和深度估计。实验结果表明,M$^{3}$3D在ScanNet、NYUv2、UCF-101和OR-AR等数据集上具有较高的表现,特别是在ScanNet semantic segmentation任务上与Mask3D的+1.3\% mIoU提升。此外,我们还对M$^{3}$3D在低数据量情况下的数据效率进行了评估,并证明了它在当前状态的最佳方法中具有超过其他方法的优势。

SEPT: Towards Efficient Scene Representation Learning for Motion Prediction

  • paper_url: http://arxiv.org/abs/2309.15289
  • repo_url: None
  • paper_authors: Zhiqian Lan, Yuxuan Jiang, Yao Mu, Chen Chen, Shengbo Eben Li, Hang Zhao, Keqiang Li
  • for: 这篇论文旨在提高自动驾驶车辆在复杂交通环境中安全运行,通过提取有效的空间时间关系来准确预测行进。
  • methods: 该论文提出了一种基于自我指导学习的模型框架,利用Scene Encoder来捕捉行车轨迹和路网的空间结构,以及路网和 Agent 之间的交互。
  • results: 经过广泛的实验,SEPT 模型在 Argoverse 1 和 Argoverse 2 运动预测测试 bencmark 上达到了状态机器人性能,与前一代方法在所有主要指标上大幅超越。
    Abstract Motion prediction is crucial for autonomous vehicles to operate safely in complex traffic environments. Extracting effective spatiotemporal relationships among traffic elements is key to accurate forecasting. Inspired by the successful practice of pretrained large language models, this paper presents SEPT, a modeling framework that leverages self-supervised learning to develop powerful spatiotemporal understanding for complex traffic scenes. Specifically, our approach involves three masking-reconstruction modeling tasks on scene inputs including agents' trajectories and road network, pretraining the scene encoder to capture kinematics within trajectory, spatial structure of road network, and interactions among roads and agents. The pretrained encoder is then finetuned on the downstream forecasting task. Extensive experiments demonstrate that SEPT, without elaborate architectural design or manual feature engineering, achieves state-of-the-art performance on the Argoverse 1 and Argoverse 2 motion forecasting benchmarks, outperforming previous methods on all main metrics by a large margin.
    摘要 <>translate_language: zh-CNMotion prediction is crucial for autonomous vehicles to operate safely in complex traffic environments. Extracting effective spatiotemporal relationships among traffic elements is key to accurate forecasting. Inspired by the successful practice of pretrained large language models, this paper presents SEPT, a modeling framework that leverages self-supervised learning to develop powerful spatiotemporal understanding for complex traffic scenes. Specifically, our approach involves three masking-reconstruction modeling tasks on scene inputs including agents' trajectories and road network, pretraining the scene encoder to capture kinematics within trajectory, spatial structure of road network, and interactions among roads and agents. The pretrained encoder is then finetuned on the downstream forecasting task. Extensive experiments demonstrate that SEPT, without elaborate architectural design or manual feature engineering, achieves state-of-the-art performance on the Argoverse 1 and Argoverse 2 motion forecasting benchmarks, outperforming previous methods on all main metrics by a large margin.<>Here's the translation in Traditional Chinese as well:<>translate_language: zh-TWMotion prediction is crucial for autonomous vehicles to operate safely in complex traffic environments. Extracting effective spatiotemporal relationships among traffic elements is key to accurate forecasting. Inspired by the successful practice of pretrained large language models, this paper presents SEPT, a modeling framework that leverages self-supervised learning to develop powerful spatiotemporal understanding for complex traffic scenes. Specifically, our approach involves three masking-reconstruction modeling tasks on scene inputs including agents' trajectories and road network, pretraining the scene encoder to capture kinematics within trajectory, spatial structure of road network, and interactions among roads and agents. The pretrained encoder is then finetuned on the downstream forecasting task. Extensive experiments demonstrate that SEPT, without elaborate architectural design or manual feature engineering, achieves state-of-the-art performance on the Argoverse 1 and Argoverse 2 motion forecasting benchmarks, outperforming previous methods on all main metrics by a large margin.<>

Boosting High Resolution Image Classification with Scaling-up Transformers

  • paper_url: http://arxiv.org/abs/2309.15277
  • repo_url: https://github.com/wangyi111/cvppa2023-dnd-challenge
  • paper_authors: Yi Wang
  • for: 这个论文是为了提出一个整体方法来进行高分辨率图像分类,并在ICCV/CVPPA2023 Deep Nutrient Deficiency Challenge中获得第二名。
  • methods: 这个方法包括了以下六个步骤:1)数据分布分析以检查潜在的领域转移,2)选择强大基eline模型,3)将已 publiushed pretrained模型进行转移学习,4)进行不断 fine-tuning 小子集,5)实现训练数据的多样性和避免过滤,6)在试验时使用 “data soups” 进行跨折衣平均预测,以获得更稳定的最终试验结果。
  • results: 这个方法在ICCV/CVPPA2023 Deep Nutrient Deficiency Challenge中获得第二名,并且在高分辨率图像分类任务中表现出色。
    Abstract We present a holistic approach for high resolution image classification that won second place in the ICCV/CVPPA2023 Deep Nutrient Deficiency Challenge. The approach consists of a full pipeline of: 1) data distribution analysis to check potential domain shift, 2) backbone selection for a strong baseline model that scales up for high resolution input, 3) transfer learning that utilizes published pretrained models and continuous fine-tuning on small sub-datasets, 4) data augmentation for the diversity of training data and to prevent overfitting, 5) test-time augmentation to improve the prediction's robustness, and 6) "data soups" that conducts cross-fold model prediction average for smoothened final test results.
    摘要 我们提出了一种整体方法 для高分辨率图像分类,在ICCV/CVPPA2023年度深度营养不足挑战赛中获得第二名。该方法包括以下整体管道:1. 数据分布分析,检查可能存在域shift问题。2. 选择一个强大基线模型,可扩展到高分辨率输入。3. 基于已发布的预训练模型进行传播学习,并进行不断微调小sub-dataset。4. 对训练数据进行数据扩展,以避免过拟合和提高预测的稳定性。5. 在测试时使用数据扩展,以提高预测的Robustness。6. "数据汤",对各个折衣进行平均预测,以平滑最终测试结果。

A Topological Machine Learning Pipeline for Classification

  • paper_url: http://arxiv.org/abs/2309.15276
  • repo_url: None
  • paper_authors: Francesco Conti, Davide Moroni, Maria Antonietta Pascali
  • for: 这个研究旨在发展一个可以将抗持续图 associate 到数据中的数位数据处理管道。
  • methods: 这个管道使用了一个格子搜索方法来 determin 最佳的表示方法和参数。
  • results: 这个研究获得了一个可以将抗持续图转换为适合机器学习数据的表示方法,并且可以评估这个管道的性能,并且与其他表示方法进行比较。
    Abstract In this work, we develop a pipeline that associates Persistence Diagrams to digital data via the most appropriate filtration for the type of data considered. Using a grid search approach, this pipeline determines optimal representation methods and parameters. The development of such a topological pipeline for Machine Learning involves two crucial steps that strongly affect its performance: firstly, digital data must be represented as an algebraic object with a proper associated filtration in order to compute its topological summary, the Persistence Diagram. Secondly, the persistence diagram must be transformed with suitable representation methods in order to be introduced in a Machine Learning algorithm. We assess the performance of our pipeline, and in parallel, we compare the different representation methods on popular benchmark datasets. This work is a first step toward both an easy and ready-to-use pipeline for data classification using persistent homology and Machine Learning, and to understand the theoretical reasons why, given a dataset and a task to be performed, a pair (filtration, topological representation) is better than another.
    摘要 在这项工作中,我们开发了一个管道,将持续征迹相关联到数字数据上,通过最佳筛选方法来确定最佳表示方法和参数。我们使用格子搜索方法来确定最佳表示方法和参数。这个管道的开发对机器学习领域具有重要意义,因为它可以帮助我们快速地将数据分类器与持续征迹相结合,从而提高数据分类的精度。在我们的管道中,我们首先将数字数据转换为一个 алгебраического对象,并将其关联到适当的筛选方法来计算其拓扑概要,即持续征迹。然后,我们将持续征迹转换为适当的表示方法,以便在机器学习算法中使用。我们评估管道的性能,并同时比较不同的表示方法在popular benchmark datasets上的性能。这项工作是机器学习领域的一个重要突破,因为它可以帮助我们快速地选择最佳的表示方法和参数,以便在数据分类 task 中实现更高的精度。此外,我们还可以通过对不同的表示方法进行比较,更好地理解在给定的数据集和任务中,哪种表示方法是更好的。

DECO: Dense Estimation of 3D Human-Scene Contact In The Wild

  • paper_url: http://arxiv.org/abs/2309.15273
  • repo_url: None
  • paper_authors: Shashank Tripathi, Agniv Chatterjee, Jean-Claude Passy, Hongwei Yi, Dimitrios Tzionas, Michael J. Black
  • for: 本研究旨在开发一种可以在真实生活中捕捉人类与物体之间的接触的人工智能模型。
  • methods: 该方法使用了新收集的 DAMON 数据集,该数据集包含人体Vertex-level的接触标注和RGB图像。然后,我们使用了一种新的 body-part-driven 和 scene-context-driven 注意力机制来estrain一种名为 DECO 的3D接触探测器。
  • results: 我们对 DAMON 数据集进行了广泛的评估,以及 RICH 和 BEHAVE 数据集。我们的方法与现有的 SOTA 方法进行了比较,并显示了在多种Benchmark上的显著性能提升。此外,我们还证明了 DECO 在真实生活中的人类互动图像中能够具有良好的普适性。
    Abstract Understanding how humans use physical contact to interact with the world is key to enabling human-centric artificial intelligence. While inferring 3D contact is crucial for modeling realistic and physically-plausible human-object interactions, existing methods either focus on 2D, consider body joints rather than the surface, use coarse 3D body regions, or do not generalize to in-the-wild images. In contrast, we focus on inferring dense, 3D contact between the full body surface and objects in arbitrary images. To achieve this, we first collect DAMON, a new dataset containing dense vertex-level contact annotations paired with RGB images containing complex human-object and human-scene contact. Second, we train DECO, a novel 3D contact detector that uses both body-part-driven and scene-context-driven attention to estimate vertex-level contact on the SMPL body. DECO builds on the insight that human observers recognize contact by reasoning about the contacting body parts, their proximity to scene objects, and the surrounding scene context. We perform extensive evaluations of our detector on DAMON as well as on the RICH and BEHAVE datasets. We significantly outperform existing SOTA methods across all benchmarks. We also show qualitatively that DECO generalizes well to diverse and challenging real-world human interactions in natural images. The code, data, and models are available at https://deco.is.tue.mpg.de.
    摘要 Simplified Chinese:理解人类如何通过物理接触与世界交互是人类中心智能的关键。现有方法 Either focus on 2D or 3D body joints rather than the surface, use coarse 3D body regions, or do not generalize to in-the-wild images. In contrast, we focus on inferring dense, 3D contact between the full body surface and objects in arbitrary images. To achieve this, we first collect DAMON, a new dataset containing dense vertex-level contact annotations paired with RGB images containing complex human-object and human-scene contact. Second, we train DECO, a novel 3D contact detector that uses both body-part-driven and scene-context-driven attention to estimate vertex-level contact on the SMPL body. DECO builds on the insight that human observers recognize contact by reasoning about the contacting body parts, their proximity to scene objects, and the surrounding scene context. We perform extensive evaluations of our detector on DAMON as well as on the RICH and BEHAVE datasets. We significantly outperform existing SOTA methods across all benchmarks. We also show qualitatively that DECO generalizes well to diverse and challenging real-world human interactions in natural images. The code, data, and models are available at https://deco.is.tue.mpg.de.

ObVi-SLAM: Long-Term Object-Visual SLAM

  • paper_url: http://arxiv.org/abs/2309.15268
  • repo_url: https://github.com/ut-amrl/obvi-slam
  • paper_authors: Amanda Adkins, Taijing Chen, Joydeep Biswas
  • for: 这篇论文是为了解决长时间执行任务中机器人的定位问题,即在不同的环境变化下保持定位的稳定性和扩展性。
  • methods: 本论文使用了低级别的视觉特征点,以及不确定性感知的长期地图建模技术来解决这个问题。
  • results: 经验证明,使用ObVi-SLAM可以在不同的天气和照明条件下实现高精度的定位估计,并且能够在长时间内保持定位的一致性。
    Abstract Robots responsible for tasks over long time scales must be able to localize consistently and scalably amid geometric, viewpoint, and appearance changes. Existing visual SLAM approaches rely on low-level feature descriptors that are not robust to such environmental changes and result in large map sizes that scale poorly over long-term deployments. In contrast, object detections are robust to environmental variations and lead to more compact representations, but most object-based SLAM systems target short-term indoor deployments with close objects. In this paper, we introduce ObVi-SLAM to overcome these challenges by leveraging the best of both approaches. ObVi-SLAM uses low-level visual features for high-quality short-term visual odometry; and to ensure global, long-term consistency, ObVi-SLAM builds an uncertainty-aware long-term map of persistent objects and updates it after every deployment. By evaluating ObVi-SLAM on data from 16 deployment sessions spanning different weather and lighting conditions, we empirically show that ObVi-SLAM generates accurate localization estimates consistent over long-time scales in spite of varying appearance conditions.
    摘要 роботы, ответственные за задачи на долгосрочных временных масштабах, должны быть в состоянии локализоваться надежным образом и масштабируемо amid геометрических, точек зрения и изменений внешнего вида. существующие подходы к визуальному SLAM основаны на низкоуровневых описаниях особенностей, которые не устойчивы к изменениям окружающей среды и приводят к большим картам, которые масштабируются плохо на долгосрочных деплоиментах. в отличие от этого, детектирование объектов является robust to изменения окружающей среды и приводит к более компактным представлениям, но большинство систем SLAM на основе объектов ориентированы на короткие термины наружных деплоиментов с близкими объектами. в этой статье мы вводим ObVi-SLAM, чтобы преодолеть эти проблемы, используя лучшие свойства обоих подходов. ObVi-SLAM использует низкоуровневые визуальные особенности для высококачественной краткосрочной визуальной одометрии; и для обеспечения глобальной, долгосрочной一порядковости, ObVi-SLAM строит карту неопределенности-осознанного долгосрочного мапа persistent objects и обновляет ее после каждого деплоимента. путем empirical evaluation ObVi-SLAM на данных из 16 сеансов деплоимента, разных погодных и осветительных условий, мы показываем, что ObVi-SLAM генерирует точные оценки локализации, согласующиеся на долгосрочных временных масштабах, несмотря на изменяющиеся условия внешнего вида.

SLIQ: Quantum Image Similarity Networks on Noisy Quantum Computers

  • paper_url: http://arxiv.org/abs/2309.15259
  • repo_url: https://github.com/silverengineered/sliq
  • paper_authors: Daniel Silver, Tirthak Patel, Aditya Ranjan, Harshitta Gandhi, William Cutler, Devesh Tiwari
  • for: 这项研究旨在解决量子计算机上进行无监督相似检测任务的挑战,提出了首个开源的量子相似检测网络SLIQ,使用实用和有效的量子学习和减少方差算法。
  • methods: SLIQ使用了实用和有效的量子学习和减少方差算法,包括量子径弧优化和量子均值优化等。
  • results: SLIQ可以在资源有限的情况下实现高效的无监督相似检测任务,并且比传统的量子算法有更好的性能。
    Abstract Exploration into quantum machine learning has grown tremendously in recent years due to the ability of quantum computers to speed up classical programs. However, these efforts have yet to solve unsupervised similarity detection tasks due to the challenge of porting them to run on quantum computers. To overcome this challenge, we propose SLIQ, the first open-sourced work for resource-efficient quantum similarity detection networks, built with practical and effective quantum learning and variance-reducing algorithms.
    摘要 对量子机器学习的探索在最近几年内有很大的发展,这主要是因为量子计算机可以快速化古典程式。然而,这些努力仍然无法解决无监督相似探测任务,因为将它们转换到量子计算机上是一个挑战。为了解决这个挑战,我们提出了SLIQ,首个开源的量子相似探测网络,使用了实用和有效的量子学习和减少方差算法。

APIS: A paired CT-MRI dataset for ischemic stroke segmentation challenge

  • paper_url: http://arxiv.org/abs/2309.15243
  • repo_url: None
  • paper_authors: Santiago Gómez, Daniel Mantilla, Gustavo Garzón, Edgar Rangel, Andrés Ortiz, Franklin Sierra-Jerez, Fabio Martínez
  • for: 本研究旨在提供一个公共数据集,以便研究人员可以使用配对的NCCT和ADC图像来描述stroke患者的脑损伤。
  • methods: 本研究使用了APIS数据集,该数据集包括20名 stroke患者的NCCT和ADC图像。研究人员可以使用这些数据来提出新的计算方法,以便更好地处理 stroke 患者的脑损伤。
  • results: 研究人员使用了专门的深度学习工具来解决 stroke 患者的脑损伤分 segmentation 问题。然而,结果表明,从 NCCT 序列中分 segmentation stroke 患者的脑损伤仍然是一个挑战。不withstanding, the annotated dataset remains accessible to the public upon registration, inviting the scientific community to deal with stroke characterization from NCCT but guided with paired DWI information.
    Abstract Stroke is the second leading cause of mortality worldwide. Immediate attention and diagnosis play a crucial role regarding patient prognosis. The key to diagnosis consists in localizing and delineating brain lesions. Standard stroke examination protocols include the initial evaluation from a non-contrast CT scan to discriminate between hemorrhage and ischemia. However, non-contrast CTs may lack sensitivity in detecting subtle ischemic changes in the acute phase. As a result, complementary diffusion-weighted MRI studies are captured to provide valuable insights, allowing to recover and quantify stroke lesions. This work introduced APIS, the first paired public dataset with NCCT and ADC studies of acute ischemic stroke patients. APIS was presented as a challenge at the 20th IEEE International Symposium on Biomedical Imaging 2023, where researchers were invited to propose new computational strategies that leverage paired data and deal with lesion segmentation over CT sequences. Despite all the teams employing specialized deep learning tools, the results suggest that the ischemic stroke segmentation task from NCCT remains challenging. The annotated dataset remains accessible to the public upon registration, inviting the scientific community to deal with stroke characterization from NCCT but guided with paired DWI information.
    摘要 世界上第二大死因之一是中风,即时注意和诊断对患者结果产生关键作用。诊断的关键在于确定脑损害的位置和范围。标准中风诊断协议包括非contrast CT扫描,以确定是否有出血或压缩。然而,非contrast CT可能无法察觉急性阶段的微scopic ischemic变化。因此,补充的扩散束环境MRI研究可以提供有价值的信息,帮助回复和量化中风损害。本文介绍了APIS,第一个对应公共数据集,包括NCCT和ADC研究数据。APIS在2023年IEEE国际生物医学图像学会上展示为挑战,邀请研究人员提出新的计算机科学策略,利用对应数据来解决中风损害部署。尽管所有团队使用特殊的深度学习工具,结果表明NCCT中风损害部署仍然是一个挑战。注释数据集仍然公开访问,邀请科学社区利用NCCT数据,但是受到对应DWI信息的指导。

CLRmatchNet: Enhancing Curved Lane Detection with Deep Matching Process

  • paper_url: http://arxiv.org/abs/2309.15204
  • repo_url: None
  • paper_authors: Sapir Kontente, Roy Orfaig, Ben-Zion Bobrovsky
    for: 提高排版检测精度,增加安全导航数据methods: 使用深度学习子模块网络(MatchNet)取代传统的标签分配过程,提高排版检测精度results: 在抛物线路段中显著提高排版检测精度,对所有背景进行改进,并维持或提高相对精度在其他部分。
    Abstract Lane detection plays a crucial role in autonomous driving by providing vital data to ensure safe navigation. Modern algorithms rely on anchor-based detectors, which are then followed by a label assignment process to categorize training detections as positive or negative instances based on learned geometric attributes. The current methods, however, have limitations and might not be optimal since they rely on predefined classical cost functions that are based on a low-dimensional model. Our research introduces MatchNet, a deep learning sub-module-based approach aimed at enhancing the label assignment process. Integrated into a state-of-the-art lane detection network like the Cross Layer Refinement Network for Lane Detection (CLRNet), MatchNet replaces the conventional label assignment process with a sub-module network. This integration results in significant improvements in scenarios involving curved lanes, with remarkable improvement across all backbones of +2.8% for ResNet34, +2.3% for ResNet101, and +2.96% for DLA34. In addition, it maintains or even improves comparable results in other sections. Our method boosts the confidence level in lane detection, allowing an increase in the confidence threshold. The code will be available soon: https://github.com/sapirkontente/CLRmatchNet.git
    摘要 Lane detection plays a crucial role in autonomous driving by providing vital data to ensure safe navigation. Modern algorithms rely on anchor-based detectors, which are then followed by a label assignment process to categorize training detections as positive or negative instances based on learned geometric attributes. The current methods, however, have limitations and might not be optimal since they rely on predefined classical cost functions that are based on a low-dimensional model. Our research introduces MatchNet, a deep learning sub-module-based approach aimed at enhancing the label assignment process. Integrated into a state-of-the-art lane detection network like the Cross Layer Refinement Network for Lane Detection (CLRNet), MatchNet replaces the conventional label assignment process with a sub-module network. This integration results in significant improvements in scenarios involving curved lanes, with remarkable improvement across all backbones of +2.8% for ResNet34, +2.3% for ResNet101, and +2.96% for DLA34. In addition, it maintains or even improves comparable results in other sections. Our method boosts the confidence level in lane detection, allowing an increase in the confidence threshold. The code will be available soon:

GasMono: Geometry-Aided Self-Supervised Monocular Depth Estimation for Indoor Scenes

  • paper_url: http://arxiv.org/abs/2309.16019
  • repo_url: https://github.com/zxcqlf/gasmono
  • paper_authors: Chaoqiang Zhao, Matteo Poggi, Fabio Tosi, Lei Zhou, Qiyu Sun, Yang Tang, Stefano Mattoccia
  • for: This paper aims to improve self-supervised monocular depth estimation in indoor scenes, addressing challenges such as large rotation and low texture.
  • methods: The proposed method uses multi-view geometry to obtain coarse camera poses, and refines them through rotation and translation/scale optimization. It also combines global reasoning with an overfitting-aware, iterative self-distillation mechanism to improve depth estimation.
  • results: The proposed method achieves state-of-the-art performance on four datasets (NYUv2, ScanNet, 7scenes, and KITTI), with outstanding generalization ability. Code and models are available at https://github.com/zxcqlf/GasMono.Here's the full text in Simplified Chinese:
  • for: 这篇论文目标是提高自动监督单视深度估计在室内场景中的性能,解决大角度转换和低 тексту化等挑战。
  • methods: 提议方法使用多视图几何来获取估计摄像头位置,并通过旋转和平移/缩放优化来提高准确性。它还结合全球推理和迭代自适应自我混合优化机制来提高深度估计。
  • results: 提议方法在四个数据集(NYUv2、ScanNet、7scenes、KITTI)上实现了新的州度-of-the-art性能,同时也有出色的泛化能力。代码和模型可以在https://github.com/zxcqlf/GasMono上下载。
    Abstract This paper tackles the challenges of self-supervised monocular depth estimation in indoor scenes caused by large rotation between frames and low texture. We ease the learning process by obtaining coarse camera poses from monocular sequences through multi-view geometry to deal with the former. However, we found that limited by the scale ambiguity across different scenes in the training dataset, a na\"ive introduction of geometric coarse poses cannot play a positive role in performance improvement, which is counter-intuitive. To address this problem, we propose to refine those poses during training through rotation and translation/scale optimization. To soften the effect of the low texture, we combine the global reasoning of vision transformers with an overfitting-aware, iterative self-distillation mechanism, providing more accurate depth guidance coming from the network itself. Experiments on NYUv2, ScanNet, 7scenes, and KITTI datasets support the effectiveness of each component in our framework, which sets a new state-of-the-art for indoor self-supervised monocular depth estimation, as well as outstanding generalization ability. Code and models are available at https://github.com/zxcqlf/GasMono
    摘要 这个论文解决了室内场景自指导的单视深度估计中的旋转和纹理问题。我们通过多视图几何获取估算摄像头位置,以处理前一些旋转。然而,我们发现由于不同场景集中的尺度抽象ambiguity,直接从场景集中获取的几何估算不能提高性能,这是预期的。为解决这个问题,我们提议在训练中对这些估算进行旋转和缩放优化。此外,为了减轻纹理的影响,我们将全球理解与迭代自适应自逼化机制相结合,提供更准确的深度指导。实验表明,我们的框架在NYUv2、ScanNet、7scenes和KITTI数据集上达到了室内自指导单视深度估计的新州chart,以及出色的泛化能力。代码和模型可以在https://github.com/zxcqlf/GasMono中下载。

Generating Visual Scenes from Touch

  • paper_url: http://arxiv.org/abs/2309.15117
  • repo_url: https://github.com/roopeshukla/Drishti-For_Blind
  • paper_authors: Fengyu Yang, Jiacheng Zhang, Andrew Owens
  • for: 这篇论文旨在生成可能的图像从触感中。
  • methods: 该论文使用最近的扩散进步来创建图像从感触信号中的模型,并应用于多个视触合成任务。
  • results: 该模型在 manipulate 图像以符合触感信号的问题上表现出色,并且是首次成功地生成图像从触感信号中不需要其他场景信息。 additionally, the model is also used to solve two novel synthesis problems: generating images without the touch sensor or the hand holding it, and estimating an image’s shading from its reflectance and touch.
    Abstract An emerging line of work has sought to generate plausible imagery from touch. Existing approaches, however, tackle only narrow aspects of the visuo-tactile synthesis problem, and lag significantly behind the quality of cross-modal synthesis methods in other domains. We draw on recent advances in latent diffusion to create a model for synthesizing images from tactile signals (and vice versa) and apply it to a number of visuo-tactile synthesis tasks. Using this model, we significantly outperform prior work on the tactile-driven stylization problem, i.e., manipulating an image to match a touch signal, and we are the first to successfully generate images from touch without additional sources of information about the scene. We also successfully use our model to address two novel synthesis problems: generating images that do not contain the touch sensor or the hand holding it, and estimating an image's shading from its reflectance and touch.
    摘要 一种新兴的工作努力在触感中生成可信的图像。现有的方法只是处理触感领域的窄化方面,与其他领域的交叉模式同步生成方法相比,质量有很大差距。我们利用latest advances in latent diffusion创建一个图像从触感信号(以及vice versa)的模型,并应用于多个视触同步生成任务。使用这个模型,我们在触感驱动的风格化问题上表现出色,即通过触感信号修改图像,并且是首次无需其他场景信息就能从触感中生成图像。我们还成功地使用我们的模型解决了两个新的同步生成问题:生成不包含触感传感器或手持之物的图像,以及根据反射和触感来估算图像的阴影。

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

  • paper_url: http://arxiv.org/abs/2309.15112
  • repo_url: https://github.com/internlm/internlm-xcomposer
  • paper_authors: Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Wenwei Zhang, Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang
  • for: This paper proposes a vision-language large model called InternLM-XComposer, which enables advanced image-text comprehension and composition.
  • methods: The model uses interleaved text-image composition, comprehension with rich multilingual knowledge, and state-of-the-art performance.
  • results: The model consistently achieves state-of-the-art results across various mainstream benchmarks for vision-language foundational models.Here’s the text in Simplified Chinese:
  • for: InternLM-XComposer是一个用于进行高级图文理解和组合的视觉语言大型模型。
  • methods: 该模型采用了互动的文本图像组合,以及训练在广泛的多Modal多语言概念上,以实现深刻的视觉内容理解。
  • results: 该模型在主流的视觉语言基础模型benchmark上一直保持了状态之Art的 Result。
    Abstract We propose InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition. The innovative nature of our model is highlighted by three appealing properties: 1) Interleaved Text-Image Composition: InternLM-XComposer can effortlessly generate coherent and contextual articles that seamlessly integrate images, providing a more engaging and immersive reading experience. Simply provide a title, and our system will generate the corresponding manuscript. It can intelligently identify the areas in the text where images would enhance the content and automatically insert the most appropriate visual candidates. 2) Comprehension with Rich Multilingual Knowledge: The text-image comprehension is empowered by training on extensive multi-modal multilingual concepts with carefully crafted strategies, resulting in a deep understanding of visual content. 3) State-of-the-art Performance: Our model consistently achieves state-of-the-art results across various mainstream benchmarks for vision-language foundational models, including MME Benchmark, MMBench, MMBench-CN, Seed-Bench, and CCBench (Chinese Cultural Benchmark). Collectively, InternLM-XComposer seamlessly blends advanced text-image comprehension and composition, revolutionizing vision-language interaction and offering new insights and opportunities. The InternLM-XComposer model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.
    摘要 我们提出了InternLM-XComposer,一种视觉语言大型模型,具有先进的图文涉及和组合能力。我们的模型具有三个吸引人的特性:1. 交叉文本图像组合:InternLM-XComposer可以轻松地生成协调的和上下文相关的文章,并将图片与文本结合在一起,提供更加有趣和沉浸的阅读体验。只需提供一个标题,我们的系统将生成相应的报道。它可以智能地确定文本中需要图片的地方,并自动提供最合适的视觉候选者。2. 多语言多Modal概念的理解:图文理解受到了训练多Modal多语言概念的广泛和精心制定的策略的激进提高,从而实现了深刻的视觉内容理解。3. 状态之Art的表现:我们的模型在主流的视觉语言基础模型 benchmark 上 consistently achieve state-of-the-art 的Results,包括 MME Benchmark、MMBench、MMBench-CN、Seed-Bench 和 CCBench(中文文化benchmark)。总之,InternLM-XComposer 具有高级的文本图像组合和理解能力,对视觉语言交互带来革命性的变革,提供新的视角和机遇。InternLM-XComposer 模型系列,包含 7B 参数,公开 disponibles 于 GitHub 上:

DistillBEV: Boosting Multi-Camera 3D Object Detection with Cross-Modal Knowledge Distillation

  • paper_url: http://arxiv.org/abs/2309.15109
  • repo_url: https://github.com/qcraftai/distill-bev
  • paper_authors: Zeyu Wang, Dingwen Li, Chenxu Luo, Cihang Xie, Xiaodong Yang
  • for: 提高多摄像头 bird’s-eye-view(BEV)基于3D感知的性能,以便在自动驾驶行业中大量生产相机。
  • methods: 提议使用多摄像头BEV学习的学生探测器模型,通过与一个Well-trained LiDAR基于的教师探测器模型进行协同学习,以提高学生模型的表征学习能力。
  • results: 在多个表征模型上进行了广泛的测试,结果显示,我们的方法可以在nuScenes中实现state-of-the-art的性能。
    Abstract 3D perception based on the representations learned from multi-camera bird's-eye-view (BEV) is trending as cameras are cost-effective for mass production in autonomous driving industry. However, there exists a distinct performance gap between multi-camera BEV and LiDAR based 3D object detection. One key reason is that LiDAR captures accurate depth and other geometry measurements, while it is notoriously challenging to infer such 3D information from merely image input. In this work, we propose to boost the representation learning of a multi-camera BEV based student detector by training it to imitate the features of a well-trained LiDAR based teacher detector. We propose effective balancing strategy to enforce the student to focus on learning the crucial features from the teacher, and generalize knowledge transfer to multi-scale layers with temporal fusion. We conduct extensive evaluations on multiple representative models of multi-camera BEV. Experiments reveal that our approach renders significant improvement over the student models, leading to the state-of-the-art performance on the popular benchmark nuScenes.
    摘要 三元射频(BEV)多摄像头基于的3D识别是现在自动驾驶行业中趋势,因为摄像头具有低成本的大规模生产。然而,现有的多摄像头BEV和激光激光(LiDAR)基于的3D对象检测存在显著性能差距。主要的原因在于LiDAR可以准确地测量深度和其他几何量,而从仅图像输入中抽取3D信息是极其困难的。在这种情况下,我们提议通过让学生检测器模型学习教师检测器的特征来增强多摄像头BEV的表征学习。我们提出了有效的均衡策略,使学生模型专注于学习教师模型的关键特征,并在多尺度层次进行时间融合。我们对多个代表性模型进行了广泛的实验,实验结果表明,我们的方法可以提供显著改善,使学生模型达到nuScenes上的 estado del arte性能。

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

  • paper_url: http://arxiv.org/abs/2309.15103
  • repo_url: None
  • paper_authors: Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, Ziwei Liu
  • for: 本研究旨在基于预训练的文本到图像(T2I)模型,开发高质量的文本到视频(T2V)生成模型。
  • methods: 该研究提出了LaVie框架,包括基础T2V模型、时间 interpolate模型和视频超分辨模型。具有简单的时间自注意力和扭转位坐标编码,LaVie能够有效捕捉视频数据中的时间相关性。
  • results: 对LaVie进行了广泛的实验,证明其在量和质上都达到了领先水平。此外,LaVie还能够在不同的长视频生成和个性化视频合成应用中表现出优秀的 versatility。
    Abstract This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: 1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. 2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications.
    摘要 Our key insights are:1. Incorporating simple temporal self-attentions and rotary positional encoding effectively captures the temporal correlations in video data.2. Joint image-video fine-tuning is crucial for producing high-quality and creative outcomes.To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Additionally, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications.

Case Study: Ensemble Decision-Based Annotation of Unconstrained Real Estate Images

  • paper_url: http://arxiv.org/abs/2309.15097
  • repo_url: None
  • paper_authors: Miroslav Despotovic, Zedong Zhang, Eric Stumpe, Matthias Zeppelzauer
  • for: 这个研究是为了实现房产照片标注的证明方案。
  • methods: 这个研究使用了简单迭代规律基于 semi-supervised learning 方法来标注房产照片。
  • results: 这个研究获得了实际应用中对房产照片标注的重要关键特征和单一性,以及实现 praktical 实现的重要需求。I hope this helps! Let me know if you have any other questions.
    Abstract We describe a proof-of-concept for annotating real estate images using simple iterative rule-based semi-supervised learning. In this study, we have gained important insights into the content characteristics and uniqueness of individual image classes as well as essential requirements for a practical implementation.
    摘要 我们描述了一个 Proof of Concept,用于使用简单迭代规则基于 semi-supervised learning 来标注房产图片。在这项研究中,我们获得了图像类别特征和个体图像的重要理解,以及实际应用中必需的基本要求。Here's the breakdown of the translation:* 我们 (wǒmen) - we* 描述了 (pīnxiē le) - describe* 一个 (yī ge) - one* Proof of Concept (Proof of Concept)* 用于 (yòngyòu) - for* 使用 (shǐyòu) - using* 简单迭代 (jìndān diēyuè) - simple iterative* 规则基于 (guīyù jīdào) - rule-based* semi-supervised learning (semi-supervised learning)* 来 (lái) - to* 标注 (biāozhù) - annotate* 房产 (fángchǎng) - real estate* 图片 (túpǐn) - imagesNote that Simplified Chinese uses "图片" (túpǐn) instead of "图像" (túxiàng) to refer to images. Also, "简单迭代" (jìndān diēyuè) is a more common way to say "iterative" in Simplified Chinese.

Video-adverb retrieval with compositional adverb-action embeddings

  • paper_url: http://arxiv.org/abs/2309.15086
  • repo_url: https://github.com/ExplainableML/ReGaDa
  • paper_authors: Thomas Hummel, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata
  • for: 本研究提出了一种视频至动词概念映射(以下简称 video-to-adverb)的框架,用于实现细致的视频理解。
  • methods: 该框架使用了一种共同嵌入空间中的异常网格机制,以及一种新的训练目标,包括 triplet 损失和一个回归目标。
  • results: 该方法在最新的五个视频-动词检索 benchmark 上达到了状态的前学者性能。此外,我们还提出了一些新的数据集分裂,用于测试视频-动词检索的通用性。
    Abstract Retrieving adverbs that describe an action in a video poses a crucial step towards fine-grained video understanding. We propose a framework for video-to-adverb retrieval (and vice versa) that aligns video embeddings with their matching compositional adverb-action text embedding in a joint embedding space. The compositional adverb-action text embedding is learned using a residual gating mechanism, along with a novel training objective consisting of triplet losses and a regression target. Our method achieves state-of-the-art performance on five recent benchmarks for video-adverb retrieval. Furthermore, we introduce dataset splits to benchmark video-adverb retrieval for unseen adverb-action compositions on subsets of the MSR-VTT Adverbs and ActivityNet Adverbs datasets. Our proposed framework outperforms all prior works for the generalisation task of retrieving adverbs from videos for unseen adverb-action compositions. Code and dataset splits are available at https://hummelth.github.io/ReGaDa/.
    摘要 Retrieving adverbs that describe an action in a video is a crucial step towards fine-grained video understanding. We propose a framework for video-to-adverb retrieval (and vice versa) that aligns video embeddings with their matching compositional adverb-action text embeddings in a joint embedding space. The compositional adverb-action text embedding is learned using a residual gating mechanism, along with a novel training objective consisting of triplet losses and a regression target. Our method achieves state-of-the-art performance on five recent benchmarks for video-adverb retrieval. Furthermore, we introduce dataset splits to benchmark video-adverb retrieval for unseen adverb-action compositions on subsets of the MSR-VTT Adverbs and ActivityNet Adverbs datasets. Our proposed framework outperforms all prior works for the generalization task of retrieving adverbs from videos for unseen adverb-action compositions. Code and dataset splits are available at https://hummelth.github.io/ReGaDa/.Here's the translation in Traditional Chinese:取得运动动作中的动词形容词是精确的运动理解的重要步骤。我们提出了一个影像与动词形容词的对应框架,将影像嵌入与其对应的动词形容词文本嵌入在共同的嵌入空间中进行对应。我们使用了一个复零阀 mechanism来学习动词形容词文本的对应,同时使用了一个新的训练目标,包括 triplet loss 和一个回溯目标。我们的方法在五个最近的benchmark上取得了最佳性能。此外,我们还引入了一些新的数据分割,以便在未见过的动词形容词-动作compositions上进行测试。我们的提出的框架在这个测试任务中表现出色,并且超过了所有之前的工作。代码和数据分割可以在https://hummelth.github.io/ReGaDa/上取得。

The Surveillance AI Pipeline

  • paper_url: http://arxiv.org/abs/2309.15084
  • repo_url: None
  • paper_authors: Pratyusha Ria Kalluri, William Agnew, Myra Cheng, Kentrell Owens, Luca Soldaini, Abeba Birhane
  • for: 这篇论文主要写作的目的是揭露计算机视觉研究如何推动跟踪surveillance。
  • methods: 这篇论文使用了三十年的计算机视觉研究论文和下游具有专利的文献,共计超过40,000份文献进行分析,以揭露计算机视觉研究如何转化为跟踪surveillance。
  • results: 研究发现,大多数计算机视觉论文和专利都报道其技术可以提取人类数据,而且主要是提取人体和身体部位的数据。此外,研究发现了一些机构、国家和领域在计算机视觉研究中拥有较高的涉及度,这些机构、国家和领域的研究往往被用于跟踪专利。总的来说,计算机视觉研究在过去三十年内,对跟踪surveillance的应用增长了五倍以上,现已经用于超过11,000个跟踪专利。此外,研究还发现了大量的文献使用掩饰语言,以隐藏跟踪的程度。
    Abstract A rapidly growing number of voices argue that AI research, and computer vision in particular, is powering mass surveillance. Yet the direct path from computer vision research to surveillance has remained obscured and difficult to assess. Here, we reveal the Surveillance AI pipeline by analyzing three decades of computer vision research papers and downstream patents, more than 40,000 documents. We find the large majority of annotated computer vision papers and patents self-report their technology enables extracting data about humans. Moreover, the majority of these technologies specifically enable extracting data about human bodies and body parts. We present both quantitative and rich qualitative analysis illuminating these practices of human data extraction. Studying the roots of this pipeline, we find that institutions that prolifically produce computer vision research, namely elite universities and "big tech" corporations, are subsequently cited in thousands of surveillance patents. Further, we find consistent evidence against the narrative that only these few rogue entities are contributing to surveillance. Rather, we expose the fieldwide norm that when an institution, nation, or subfield authors computer vision papers with downstream patents, the majority of these papers are used in surveillance patents. In total, we find the number of papers with downstream surveillance patents increased more than five-fold between the 1990s and the 2010s, with computer vision research now having been used in more than 11,000 surveillance patents. Finally, in addition to the high levels of surveillance we find documented in computer vision papers and patents, we unearth pervasive patterns of documents using language that obfuscates the extent of surveillance. Our analysis reveals the pipeline by which computer vision research has powered the ongoing expansion of surveillance.
    摘要 快速增长的声音表示,人工智能研究,特别是计算机视觉,在跟踪surveillance中扮演着关键角色。然而,从计算机视觉研究直接到跟踪的路径一直保持混乱和难以评估。在这里,我们揭露了跟踪AI管道,通过分析过去三十年的计算机视觉研究文献和下游套件,涵盖了超过40,000份文档。我们发现大多数标注的计算机视觉文献和套件都自report其技术可以提取人类数据。此外,大多数这些技术都可以提取人体和身体部分数据。我们提供了丰富的质量分析,描绘这些实践。研究这些管道的源头,我们发现了著名大学和“big tech”公司在计算机视觉研究方面的强大势力,后来被引用在数千个跟踪套件中。此外,我们发现了证据,证明这些机构、国家或领域作出计算机视觉研究后,大多数文献都被用于跟踪套件。总的来说,我们发现在20世纪90年代和2010年代之间,计算机视觉研究在跟踪领域的数量增长了五倍以上,现已被用于超过11,000个跟踪套件。此外,我们发现了计算机视觉文献和套件中的高水平跟踪记录,以及文档使用掩饰跟踪的语言。我们的分析揭示了计算机视觉研究如何为跟踪提供了持续增长的能量。

RPEFlow: Multimodal Fusion of RGB-PointCloud-Event for Joint Optical Flow and Scene Flow Estimation

  • paper_url: http://arxiv.org/abs/2309.15082
  • repo_url: https://github.com/danqu130/RPEFlow
  • paper_authors: Zhexiong Wan, Yuxin Mao, Jing Zhang, Yuchao Dai
  • for: 本研究旨在jointly estimating 2D optical flow和3D scene flow,使用RGB图像、点云和事件的多Modal Fusion。
  • methods: 我们提出了一种多Stage multimodal fusion模型,称为RPEFlow,它使用了跨模态关注拟合模块和多模态信息匹配regularization项。
  • results: 我们的模型在实验中与现有状态的 искусurz以一定的优势进行比较,并提供了一个新的synthetic dataset,以便进一步探索多模态感知的研究。
    Abstract Recently, the RGB images and point clouds fusion methods have been proposed to jointly estimate 2D optical flow and 3D scene flow. However, as both conventional RGB cameras and LiDAR sensors adopt a frame-based data acquisition mechanism, their performance is limited by the fixed low sampling rates, especially in highly-dynamic scenes. By contrast, the event camera can asynchronously capture the intensity changes with a very high temporal resolution, providing complementary dynamic information of the observed scenes. In this paper, we incorporate RGB images, Point clouds and Events for joint optical flow and scene flow estimation with our proposed multi-stage multimodal fusion model, RPEFlow. First, we present an attention fusion module with a cross-attention mechanism to implicitly explore the internal cross-modal correlation for 2D and 3D branches, respectively. Second, we introduce a mutual information regularization term to explicitly model the complementary information of three modalities for effective multimodal feature learning. We also contribute a new synthetic dataset to advocate further research. Experiments on both synthetic and real datasets show that our model outperforms the existing state-of-the-art by a wide margin. Code and dataset is available at https://npucvr.github.io/RPEFlow.
    摘要 最近,RGB图像和点云 fusión方法已经被提议用于同时估算2D光流和3D场景流。然而,由于传统的RGB相机和LiDAR感知器都采用帧基数据采集机制,因此其性能受到固定的低采样率的限制,特别是在高动态场景下。相反,事件摄像头可以不同步地捕捉强度变化,提供了高时间分辨率下的动态信息。在这篇论文中,我们将RGB图像、点云和事件 fusion在我们提出的多阶段多模式融合模型RPEFlow中。首先,我们提出了一种注意力融合模块,使用交叉注意力机制来隐式地探索2D和3D分支之间的内部交叉模式关系。其次,我们引入了互信息规则项来显式地模型三个感知器之间的补充信息,以便有效地学习多模式特征。我们还提供了一个新的 sintetic数据集,以促进进一步的研究。实验表明,我们的模型在实验室和真实数据集上都超过了现有状态的较好。代码和数据集可以在https://npucvr.github.io/RPEFlow上下载。

Language-EXtended Indoor SLAM (LEXIS): A Versatile System for Real-time Visual Scene Understanding

  • paper_url: http://arxiv.org/abs/2309.15065
  • repo_url: None
  • paper_authors: Christina Kassab, Matias Mattamala, Lintong Zhang, Maurice Fallon
  • For: This paper introduces a real-time indoor Simultaneous Localization and Mapping (SLAM) system that can understand and interact with its surroundings.* Methods: The system uses Large Language Models (LLMs) to create a unified approach to scene understanding and place recognition, including visual-inertial odometry and Contrastive Language-Image Pretraining (CLIP) features.* Results: The system successfully categorizes rooms with varying layouts and dimensions and outperforms the state-of-the-art (SOTA) for place recognition and trajectory estimation tasks. Additionally, it demonstrates the potential for planning.Here is the same information in Simplified Chinese:* For: 这篇论文介绍了一个实时indoor Simultaneous Localization and Mapping(SLAM)系统,它可以理解和与周围环境交互。* Methods: 该系统使用大型自然语言模型(LLMs)创建一种统一的场景理解和地点识别方法,包括视觉感知和语言图像预训练(CLIP)特征。* Results: 该系统成功分类具有不同布局和尺寸的房间,并超越了当前最佳(SOTA)的地点识别和轨迹估计任务。它还 demonstrably potential for planning。
    Abstract Versatile and adaptive semantic understanding would enable autonomous systems to comprehend and interact with their surroundings. Existing fixed-class models limit the adaptability of indoor mobile and assistive autonomous systems. In this work, we introduce LEXIS, a real-time indoor Simultaneous Localization and Mapping (SLAM) system that harnesses the open-vocabulary nature of Large Language Models (LLMs) to create a unified approach to scene understanding and place recognition. The approach first builds a topological SLAM graph of the environment (using visual-inertial odometry) and embeds Contrastive Language-Image Pretraining (CLIP) features in the graph nodes. We use this representation for flexible room classification and segmentation, serving as a basis for room-centric place recognition. This allows loop closure searches to be directed towards semantically relevant places. Our proposed system is evaluated using both public, simulated data and real-world data, covering office and home environments. It successfully categorizes rooms with varying layouts and dimensions and outperforms the state-of-the-art (SOTA). For place recognition and trajectory estimation tasks we achieve equivalent performance to the SOTA, all also utilizing the same pre-trained model. Lastly, we demonstrate the system's potential for planning.
    摘要 自适应和多元的语义理解能使自动化系统更好地理解和与周围环境交互。现有的固定类型模型限制了室内移动和助手自动化系统的适应性。在这项工作中,我们介绍了LEXIS,一种实时室内同时地图和定位(SLAM)系统,利用大型语言模型(LLMs)的开放词汇性来创建一种综合的场景理解和地点识别方法。该方法首先建立了环境中的Topological SLAM图(使用视觉-陀螺偏移),并将语音-图像预训练(CLIP)特征embed在图节点中。我们使用这种表示来进行flexible房间分类和分割,作为基础 для房间中心的地点识别。这使得循环关闭搜索可以 dirigir towards semantically relevant places。我们提posed系统被评估使用公共、 simulate数据和实际数据,覆盖办公室和家庭环境。它成功地分类了不同的布局和维度的房间,并超过了当前最佳(SOTA)。 для地点识别和轨迹估计任务,我们 achieveequivalent performance to SOTA,alls utilizing the same pre-trained model。最后,我们示出了系统的规划潜力。

HPCR: Holistic Proxy-based Contrastive Replay for Online Continual Learning

  • paper_url: http://arxiv.org/abs/2309.15038
  • repo_url: https://github.com/felixhuiweilin/pcr
  • paper_authors: Huiwei Lin, Shanshan Feng, Baoquan Zhang, Xutao Li, Yew-soon Ong, Yunming Ye
    for: This paper aims to address the catastrophic forgetting issue in online continual learning (OCL) by proposing a novel replay-based method called proxy-based contrastive replay (PCR) and a more advanced method named holistic proxy-based contrastive replay (HPCR).methods: The proposed methods use a contrastive-based loss with anchor-to-proxy pairs instead of anchor-to-sample pairs to alleviate the forgetting issue. The HPCR method consists of three components: a contrastive component, a temperature component, and a distillation component.results: The proposed methods are evaluated on four datasets and consistently demonstrate superior performance over various state-of-the-art methods.
    Abstract Online continual learning (OCL) aims to continuously learn new data from a single pass over the online data stream. It generally suffers from the catastrophic forgetting issue. Existing replay-based methods effectively alleviate this issue by replaying part of old data in a proxy-based or contrastive-based replay manner. In this paper, we conduct a comprehensive analysis of these two replay manners and find they can be complementary. Inspired by this finding, we propose a novel replay-based method called proxy-based contrastive replay (PCR), which replaces anchor-to-sample pairs with anchor-to-proxy pairs in the contrastive-based loss to alleviate the phenomenon of forgetting. Based on PCR, we further develop a more advanced method named holistic proxy-based contrastive replay (HPCR), which consists of three components. The contrastive component conditionally incorporates anchor-to-sample pairs to PCR, learning more fine-grained semantic information with a large training batch. The second is a temperature component that decouples the temperature coefficient into two parts based on their impacts on the gradient and sets different values for them to learn more novel knowledge. The third is a distillation component that constrains the learning process to keep more historical knowledge. Experiments on four datasets consistently demonstrate the superiority of HPCR over various state-of-the-art methods.
    摘要 在线持续学习(OCL)目标是在单次在线数据流中不断学习新数据。它通常受到忘却问题的困扰。现有的重播基本方法有效地解决这个问题,其中包括在代理基于的重播方式和对比基于的重播方式。在这篇论文中,我们进行了这两种重播方法的全面分析,发现它们可以补做。 inspirited by this finding,我们提出了一种新的重播基本方法,即代理基于对比重播(PCR),它将 anchor-to-sample 对替换为 anchor-to-proxy 对在对比基于的损失函数中,以解决忘却现象。基于 PCR,我们进一步开发了一种更高级的方法,即整体代理基于对比重播(HPCR),它包括三个 ком成分。第一是对比成分,它在大训练批处理中 conditionally 包含 anchor-to-sample 对,以学习更细致的semantic信息。第二是温度成分,它将温度系数分解为两个部分,根据它们对梯度的影响,并将其设置为不同的值,以学习更多的新知识。第三是馈送成分,它使得学习过程中保留更多的历史知识。实验结果表明,HPCR 在四个数据集上相比多种当前的方法表现出色。

Nuclear Morphometry using a Deep Learning-based Algorithm has Prognostic Relevance for Canine Cutaneous Mast Cell Tumors

  • paper_url: http://arxiv.org/abs/2309.15031
  • repo_url: None
  • paper_authors: Andreas Haghofer, Eda Parlak, Alexander Bartel, Taryn A. Donovan, Charles-Antoine Assenmacher, Pompei Bolfa, Michael J. Dark, Andrea Fuchs-Baumgartinger, Andrea Klang, Kathrin Jäger, Robert Klopfleisch, Sophie Merz, Barbara Richter, F. Yvonne Schulman, Jonathan Ganz, Josef Scharinger, Marc Aubreville, Stephan M. Winkler, Matti Kiupel, Christof A. Bertram
  • for: 这个研究是为了评估自动化形态学在犬肉瘤细胞肿瘤中的预后价值。
  • methods: 这个研究使用了深度学习算法来自动地测量细胞核的大小和形状,并与人工测量和Pathologist估计的核大小进行比较。
  • results: 研究发现,自动化形态学的报告值与人工测量和Pathologist估计的核大小有高度相关性,且其预后价值也较高。在ROC曲线中,自动化形态学的AUC值为0.943(95% CI:0.889-0.996),高于人工测量和mitotic count。此外,自动化形态学还可以提供较高的特异性和低的内涵相关性。
    Abstract Variation in nuclear size and shape is an important criterion of malignancy for many tumor types; however, categorical estimates by pathologists have poor reproducibility. Measurements of nuclear characteristics (morphometry) can improve reproducibility, but manual methods are time consuming. In this study, we evaluated fully automated morphometry using a deep learning-based algorithm in 96 canine cutaneous mast cell tumors with information on patient survival. Algorithmic morphometry was compared with karyomegaly estimates by 11 pathologists, manual nuclear morphometry of 12 cells by 9 pathologists, and the mitotic count as a benchmark. The prognostic value of automated morphometry was high with an area under the ROC curve regarding the tumor-specific survival of 0.943 (95% CI: 0.889 - 0.996) for the standard deviation (SD) of nuclear area, which was higher than manual morphometry of all pathologists combined (0.868, 95% CI: 0.737 - 0.991) and the mitotic count (0.885, 95% CI: 0.765 - 1.00). At the proposed thresholds, the hazard ratio for algorithmic morphometry (SD of nuclear area $\geq 9.0 \mu m^2$) was 18.3 (95% CI: 5.0 - 67.1), for manual morphometry (SD of nuclear area $\geq 10.9 \mu m^2$) 9.0 (95% CI: 6.0 - 13.4), for karyomegaly estimates 7.6 (95% CI: 5.7 - 10.1), and for the mitotic count 30.5 (95% CI: 7.8 - 118.0). Inter-rater reproducibility for karyomegaly estimates was fair ($\kappa$ = 0.226) with highly variable sensitivity/specificity values for the individual pathologists. Reproducibility for manual morphometry (SD of nuclear area) was good (ICC = 0.654). This study supports the use of algorithmic morphometry as a prognostic test to overcome the limitations of estimates and manual measurements.
    摘要 干细胞癌的核体大小和形状的变化是许多肿瘤类型的恶性性的重要依据,但是专业人员的 categorical 估计具有低的重建性。使用measurements of nuclear characteristics( morphometry)可以提高重建性,但是手动方法需要很多时间。在这个研究中,我们对96只犬尖锐皮癌细胞的自动化 morphometry 使用了深度学习基于的算法,并与11名医生的 karyomegaly 估计、12名医生手动的核体形态和 Mitotic count 作为参照进行比较。自动化 morphometry 的预测价值很高,ROC 曲线的面积为0.943(95% CI:0.889-0.996),而手动 morphometry 的所有医生合计的预测价值为0.868(95% CI:0.737-0.991), Mitotic count 的预测价值为0.885(95% CI:0.765-1.00)。在提出的阈值下,自动化 morphometry(核体面积SD≥9.0μm²)的 Hazard ratio 为18.3(95% CI:5.0-67.1),手动 morphometry(核体面积SD≥10.9μm²)的 Hazard ratio 为9.0(95% CI:6.0-13.4), karyomegaly 估计的 Hazard ratio 为7.6(95% CI:5.7-10.1), Mitotic count 的 Hazard ratio 为30.5(95% CI:7.8-118.0)。医生之间的 karyomegaly 估计的重建性只有 fair(κ=0.226),而手动 morphometry(核体面积SD)的重建性为good(ICC=0.654)。这种研究支持使用自动化 morphometry 作为诊断的方法,以超越估计和手动测量的限制。

IFT: Image Fusion Transformer for Ghost-free High Dynamic Range Imaging

  • paper_url: http://arxiv.org/abs/2309.15019
  • repo_url: None
  • paper_authors: Hailing Wang, Wei Li, Yuanyuan Xi, Jie Hu, Hanting Chen, Longyu Li, Yunhe Wang
  • for: 高动态范围多帧图像重建无幽霾图像,以提供具有 фото真实细节的内容相似但空间不一致的低动态范围图像。
  • methods: 提议一种图像融合变换器(IFT),包括快速全球补做搜索(FGPS)模块和自我交叠融合模块(SCF),用于生成无幽霾的HDR图像。 FGPS模块在参照帧中搜索与参照帧中的每个patch相似的patch,以模型长距离依赖关系;SCF模块在patch上进行了内部Feature融合和交叠Feature融合,以降低输入分辨率的线性复杂度。
  • results: 对多个标准 benchmark进行了广泛的实验,并达到了现有方法的状态空间性能。
    Abstract Multi-frame high dynamic range (HDR) imaging aims to reconstruct ghost-free images with photo-realistic details from content-complementary but spatially misaligned low dynamic range (LDR) images. Existing HDR algorithms are prone to producing ghosting artifacts as their methods fail to capture long-range dependencies between LDR frames with large motion in dynamic scenes. To address this issue, we propose a novel image fusion transformer, referred to as IFT, which presents a fast global patch searching (FGPS) module followed by a self-cross fusion module (SCF) for ghost-free HDR imaging. The FGPS searches the patches from supporting frames that have the closest dependency to each patch of the reference frame for long-range dependency modeling, while the SCF conducts intra-frame and inter-frame feature fusion on the patches obtained by the FGPS with linear complexity to input resolution. By matching similar patches between frames, objects with large motion ranges in dynamic scenes can be aligned, which can effectively alleviate the generation of artifacts. In addition, the proposed FGPS and SCF can be integrated into various deep HDR methods as efficient plug-in modules. Extensive experiments on multiple benchmarks show that our method achieves state-of-the-art performance both quantitatively and qualitatively.
    摘要 多帧高动态范围(HDR)捕捉目标是从内容相似仍然空间不一致的低动态范围(LDR)图像中重建无幻象 artifacts 图像,而现有的HDR算法容易产生幻象痕迹,因为它们的方法无法捕捉动态场景中大动量的长距离依赖关系。为解决这个问题,我们提议一种新的图像融合变换器,称为IFT,它包括一个快速全局小块搜索(FGPS)模块和一个自我交叉融合模块(SCF),用于无幻象HDR捕捉。FGPS模块在支持图像中搜索与参照图像中每个小块相似的小块,以模拟长距离依赖关系,而SCF模块在获得FGPS模块中的小块后,通过线性复杂度来进行内部Feature fusion和间接Feature fusion,使得对象在动态场景中的大动量可以得到准确的对齐,从而有效地避免生成artefacts。此外,我们的FGPS和SCF模块可以与多种深度HDR方法集成为高效插件模块。我们在多个标准底图上进行了广泛的实验,结果表明,我们的方法在量和质量两个方面均达到了领先的性能。

Object-Centric Open-Vocabulary Image-Retrieval with Aggregated Features

  • paper_url: http://arxiv.org/abs/2309.14999
  • repo_url: https://github.com/zeniSoida/pl1
  • paper_authors: Hila Levi, Guy Heller, Dan Levi, Ethan Fetaya
  • for: 这个研究的目的是提高开放词汇图像搜寻的效率,以便更好地搜寻具有特定物品的图像。
  • methods: 这个研究使用了 dense embedding 技术,将 CLIP 中的对象标注为 dense embedding,并将这些对象 embedding 聚合为一个简洁的表示。
  • results: 这个研究比较 global feature 方法,在三个 dataset 上 achieved 15 mAP 点更高的准确率。它还证明了其可扩展性和可解释性。
    Abstract The task of open-vocabulary object-centric image retrieval involves the retrieval of images containing a specified object of interest, delineated by an open-set text query. As working on large image datasets becomes standard, solving this task efficiently has gained significant practical importance. Applications include targeted performance analysis of retrieved images using ad-hoc queries and hard example mining during training. Recent advancements in contrastive-based open vocabulary systems have yielded remarkable breakthroughs, facilitating large-scale open vocabulary image retrieval. However, these approaches use a single global embedding per image, thereby constraining the system's ability to retrieve images containing relatively small object instances. Alternatively, incorporating local embeddings from detection pipelines faces scalability challenges, making it unsuitable for retrieval from large databases. In this work, we present a simple yet effective approach to object-centric open-vocabulary image retrieval. Our approach aggregates dense embeddings extracted from CLIP into a compact representation, essentially combining the scalability of image retrieval pipelines with the object identification capabilities of dense detection methods. We show the effectiveness of our scheme to the task by achieving significantly better results than global feature approaches on three datasets, increasing accuracy by up to 15 mAP points. We further integrate our scheme into a large scale retrieval framework and demonstrate our method's advantages in terms of scalability and interpretability.
    摘要 开放词汇物体固定图像检索任务是将图像包含指定的目标物体,通过开放集文本查询来定义。随着图像数据集的大小增长,解决这个任务得到了实际上的重要性。应用包括基于自定义查询和训练中的困难例挖掘。近期,基于对比学习的开放词汇系统取得了很大的突破,使得大规模的开放词汇图像检索成为可能。然而,这些方法使用单个全局嵌入,因此无法检索包含相对较小的物体实例的图像。另一方面,从检测管道中提取本地嵌入也面临着扩展性挑战,使其不适用于大规模的图像检索。在这项工作中,我们提出了一种简单又有效的物体固定图像检索方法。我们的方法将CLIP中提取的密集嵌入聚合到一个紧凑的表示中,实际上将图像检索管道的扩展性与密集检测方法的物体识别能力相结合。我们证明了我们的方法在三个数据集上比全球特征方法提高了15个mAP点。此外,我们还将我们的方法integrirated into a large-scale retrieval framework,并证明了我们的方法在扩展性和可读性方面具有优势。

An Ensemble Model for Distorted Images in Real Scenarios

  • paper_url: http://arxiv.org/abs/2309.14998
  • repo_url: None
  • paper_authors: Boyuan Ji, Jianchang Huang, Wenzhuo Huang, Shuke He
  • for: 实现高级Computer Vision任务中的影像捕捉条件和环境的影响,并且训练在不具有扭曲的数据集上的大多数Computer Vision算法将会受到限制。即使对硬件进行更新,如感应器和深度学习方法,在实际应用中仍然无法应对变化的环境和条件。
  • methods: 我们运用了物件探测器YOLOv7来探测CDCOCO数据集上的扭曲图像。我们运用了精心设计的优化,包括数据增强、检测盒 ensemble、降噪模型、超解析模型和转移学习,使我们的模型在CDCOCO测试集上表现出色。我们的降噪检测模型可以对扭曲图像进行降噪和修复,使得模型在实际应用中有很多实际应用场景和环境。
  • results: 我们的模型在CDCOCO测试集上表现出色,实现了高级Computer Vision任务中的影像捕捉和识别。我们的降噪检测模型可以对扭曲图像进行降噪和修复,使得模型在实际应用中有很多实际应用场景和环境。
    Abstract Image acquisition conditions and environments can significantly affect high-level tasks in computer vision, and the performance of most computer vision algorithms will be limited when trained on distortion-free datasets. Even with updates in hardware such as sensors and deep learning methods, it will still not work in the face of variable conditions in real-world applications. In this paper, we apply the object detector YOLOv7 to detect distorted images from the dataset CDCOCO. Through carefully designed optimizations including data enhancement, detection box ensemble, denoiser ensemble, super-resolution models, and transfer learning, our model achieves excellent performance on the CDCOCO test set. Our denoising detection model can denoise and repair distorted images, making the model useful in a variety of real-world scenarios and environments.
    摘要 computer vision 高级任务的图像获取条件和环境可能会有显著影响,而大多数计算机视觉算法在不受扭曲影响的数据集上训练时会表现有限。即使升级硬件如感知器和深度学习方法,在实际应用中仍然无法满足变化的条件。在这篇论文中,我们将对 CDCOCO 数据集中的扭曲图像进行探测,使用了优化的数据增强、检测盒集合、除噪集合、超分解模型和传输学习等技术,我们的模型在 CDCOCO 测试集上表现出色。我们的噪声检测模型可以将扭曲图像去噪和修复,使得模型在多种实际应用环境中变得有用。

IAIFNet: An Illumination-Aware Infrared and Visible Image Fusion Network

  • paper_url: http://arxiv.org/abs/2309.14997
  • repo_url: None
  • paper_authors: Qiao Yang, Yu Zhang, Jian Zhang, Zijing Zhao, Shunli Zhang, Jinqiao Wang, Junzhe Chen
  • for: 提高低光照环境下图像融合的质量
  • methods: 使用抗阴影增强网络和自适应差分融合模块以及焦点意识模块
  • results: 比较五种现有方法,实验结果表明该方法可以提高低光照环境下图像融合的质量
    Abstract Infrared and visible image fusion (IVIF) is used to generate fusion images with comprehensive features of both images, which is beneficial for downstream vision tasks. However, current methods rarely consider the illumination condition in low-light environments, and the targets in the fused images are often not prominent. To address the above issues, we propose an Illumination-Aware Infrared and Visible Image Fusion Network, named as IAIFNet. In our framework, an illumination enhancement network first estimates the incident illumination maps of input images. Afterwards, with the help of proposed adaptive differential fusion module (ADFM) and salient target aware module (STAM), an image fusion network effectively integrates the salient features of the illumination-enhanced infrared and visible images into a fusion image of high visual quality. Extensive experimental results verify that our method outperforms five state-of-the-art methods of fusing infrared and visible images.
    摘要 射频和可见像融合(IVIF)是用来生成具有两个图像完整特征的融合图像,对下游视觉任务有利。然而,目前的方法几乎不考虑低光环境中的照明条件,目标在融合图像中也很难发掘。为了解决这些问题,我们提出了对照明意识的射频和可见像融合网络,名为IAIFNet。在我们的框架中,照明增强网first estimates input图像的进入照明地图。然后,透过我们提出的适应式差分融合模组(ADFM)和焦点意识模组(STAM),图像融合网络将融合两个照明增强的射频和可见像的焦点特征,生成高质量的融合图像。实验结果显示,我们的方法比五种现有的融合方法性能更高。

Robust Sequential DeepFake Detection

  • paper_url: http://arxiv.org/abs/2309.14991
  • repo_url: https://github.com/rshaojimmy/seqdeepfake
  • paper_authors: Rui Shao, Tianxing Wu, Ziwei Liu
    for: 这篇论文目的是为了探讨深伪攻击(DeepFake)的新型威胁,即多步骤脸部修改(Sequential DeepFake),并提出了一个新的研究问题——检测sequential DeepFake manipulation(Seq-DeepFake)。methods: 这篇论文使用了一个新的dataset——Seq-DeepFake dataset,并提出了一个专门针对Seq-DeepFake manipulation的问题,casting it as an image-to-sequence task。furthermore, the authors proposed a concise yet effective Seq-DeepFake Transformer (SeqFakeFormer) and a dedicated Seq-DeepFake Transformer with Image-Sequence Reasoning (SeqFakeFormer++) to better detect Seq-DeepFake manipulation.results: 根据实验结果,SeqFakeFormer和SeqFakeFormer++ Both show strong performance on the Seq-DeepFake dataset and the more challenging Sequential DeepFake dataset with perturbations (Seq-DeepFake-P),demonstrating their effectiveness in detecting Seq-DeepFake manipulation.
    Abstract Since photorealistic faces can be readily generated by facial manipulation technologies nowadays, potential malicious abuse of these technologies has drawn great concerns. Numerous deepfake detection methods are thus proposed. However, existing methods only focus on detecting one-step facial manipulation. As the emergence of easy-accessible facial editing applications, people can easily manipulate facial components using multi-step operations in a sequential manner. This new threat requires us to detect a sequence of facial manipulations, which is vital for both detecting deepfake media and recovering original faces afterwards. Motivated by this observation, we emphasize the need and propose a novel research problem called Detecting Sequential DeepFake Manipulation (Seq-DeepFake). Unlike the existing deepfake detection task only demanding a binary label prediction, detecting Seq-DeepFake manipulation requires correctly predicting a sequential vector of facial manipulation operations. To support a large-scale investigation, we construct the first Seq-DeepFake dataset, where face images are manipulated sequentially with corresponding annotations of sequential facial manipulation vectors. Based on this new dataset, we cast detecting Seq-DeepFake manipulation as a specific image-to-sequence task and propose a concise yet effective Seq-DeepFake Transformer (SeqFakeFormer). To better reflect real-world deepfake data distributions, we further apply various perturbations on the original Seq-DeepFake dataset and construct the more challenging Sequential DeepFake dataset with perturbations (Seq-DeepFake-P). To exploit deeper correlation between images and sequences when facing Seq-DeepFake-P, a dedicated Seq-DeepFake Transformer with Image-Sequence Reasoning (SeqFakeFormer++) is devised, which builds stronger correspondence between image-sequence pairs for more robust Seq-DeepFake detection.
    摘要 因为现在可以轻松地通过图像修改技术生成真实的人脸,这已经引起了严重的风险。许多深伪检测方法已经被提出,但是这些方法只关注一步图像修改。然而,随着易 accessible 的人脸编辑应用程序的出现,人们可以通过多步操作Sequentially manipulate facial components。这种新的威胁需要我们检测一个序列的图像修改,这是检测深伪媒体以及恢复原始人脸的关键。我们根据这一观察,提出了一个新的研究问题:检测串行深伪修改(Seq-DeepFake)。与传统的深伪检测任务只需要对图像进行二分类预测而言,检测 Seq-DeepFake 修改需要正确地预测一个序列的facial manipulation vector。为支持大规模的调查,我们构建了第一个 Seq-DeepFake 数据集,其中每个人脸图像都被Sequentially manipulate,并且对应的涉及到Sequential facial manipulation vector的注释。基于这个新数据集,我们将检测 Seq-DeepFake 修改定义为图像到序列任务,并提出了一种简洁 yet effective 的 Seq-DeepFake Transformer(SeqFakeFormer)。为更好地反映实际深伪数据分布,我们还对原始 Seq-DeepFake 数据集进行了多种杂化,并构建了更加具有挑战性的 Sequential DeepFake 数据集(Seq-DeepFake-P)。面临 Seq-DeepFake-P 数据集时,我们提出了一种专门为image-sequence pairs建立更强的相关性的Seq-DeepFake Transformer with Image-Sequence Reasoning(SeqFakeFormer++),以更好地利用图像和序列之间的深刻相关性,从而提高 Seq-DeepFake 检测的Robustness。

MoCaE: Mixture of Calibrated Experts Significantly Improves Object Detection

  • paper_url: http://arxiv.org/abs/2309.14976
  • repo_url: None
  • paper_authors: Kemal Oksuz, Selim Kuzucu, Tom Joy, Puneet K. Dokania
  • for: 提高对象检测的精度和稳定性
  • methods: 使用混合尝试(MoE)和准确混合(Calibration)技术,对不同的对象检测器进行匹配和筛选,以提高总的检测精度
  • results: 在COCO测试数据集上提高了对象检测的AP值$2.4$,在 LVIS数据集上提高了实例分割的AP值$2.3$,在DOTA数据集上提高了旋转对象检测的AP值$82.62$,全部实现了新的最佳状态(SOTA)
    Abstract We propose an extremely simple and highly effective approach to faithfully combine different object detectors to obtain a Mixture of Experts (MoE) that has a superior accuracy to the individual experts in the mixture. We find that naively combining these experts in a similar way to the well-known Deep Ensembles (DEs), does not result in an effective MoE. We identify the incompatibility between the confidence score distribution of different detectors to be the primary reason for such failure cases. Therefore, to construct the MoE, our proposal is to first calibrate each individual detector against a target calibration function. Then, filter and refine all the predictions from different detectors in the mixture. We term this approach as MoCaE and demonstrate its effectiveness through extensive experiments on object detection, instance segmentation and rotated object detection tasks. Specifically, MoCaE improves (i) three strong object detectors on COCO test-dev by $2.4$ $\mathrm{AP}$ by reaching $59.0$ $\mathrm{AP}$; (ii) instance segmentation methods on the challenging long-tailed LVIS dataset by $2.3$ $\mathrm{AP}$; and (iii) all existing rotated object detectors by reaching $82.62$ $\mathrm{AP_{50}$ on DOTA dataset, establishing a new state-of-the-art (SOTA). Code will be made public.
    摘要 我们提出一种非常简单且高效的方法,以混合不同的对象探测器来建立一个混合experts(MoE),以提高各个专家的精度。我们发现,将这些专家集成在类似于知名的深度ensemble(DE)中,并不能达到有效的混合。我们认为,不同探测器的信任分布不兼容是主要的原因。因此,为建立MoE,我们的建议是首先对每个个体探测器进行对目标准化函数的调整。然后,对不同探测器的所有预测进行筛选和精度调整。我们称这种方法为MoCaE,并通过了详细的实验,证明其效果。具体来说,MoCaE可以提高COCO测试开发集的三个强对象探测器的AP分数by 2.4,达到59.0AP;在挑战性的长尾LVIS数据集上提高实例分割方法的AP分数by 2.3;以及在DOTA数据集上所有旋转对象探测器的AP50分数by 82.62,创造了新的状态之状态(SOTA)。我们将代码公开。

A novel approach for holographic 3D content generation without depth map

  • paper_url: http://arxiv.org/abs/2309.14967
  • repo_url: None
  • paper_authors: Hakdong Kim, Minkyu Jee, Yurim Lee, Kyudam Choi, MinSung Yoon, Cheongwon Kim
  • for: 用于生成计算机生成镜像(CGH),使用FFT算法。
  • methods: 使用深度学习方法,只使用输入RGB图像来估计图像的深度图,然后生成CGH。
  • results: 对比与其他模型,提出的方法可以更加准确地生成镜像,只使用RGB色度数据。
    Abstract In preparation for observing holographic 3D content, acquiring a set of RGB color and depth map images per scene is necessary to generate computer-generated holograms (CGHs) when using the fast Fourier transform (FFT) algorithm. However, in real-world situations, these paired formats of RGB color and depth map images are not always fully available. We propose a deep learning-based method to synthesize the volumetric digital holograms using only the given RGB image, so that we can overcome environments where RGB color and depth map images are partially provided. The proposed method uses only the input of RGB image to estimate its depth map and then generate its CGH sequentially. Through experiments, we demonstrate that the volumetric hologram generated through our proposed model is more accurate than that of competitive models, under the situation that only RGB color data can be provided.
    摘要 要观看幻象三维内容,需要获取每个场景的RGB颜色和深度地图图像集。使用快速傅立勋变换算法时,这对计算机生成镜像(CGH)的生成是必要的。然而,在实际情况下,这对RGB颜色和深度地图图像的对应形式不总是可用。我们提出了一种基于深度学习的方法,可以使用只提供RGB图像来生成三维数字镜像,以便在RGB颜色和深度地图图像不完整提供的情况下进行覆盖。该方法只需要输入RGB图像,并且可以逐步生成其CGH。经过实验,我们证明了我们提出的模型可以在RGB颜色数据仅提供的情况下生成更加准确的幻象镜像,比与竞争模型更好。

GridFormer: Towards Accurate Table Structure Recognition via Grid Prediction

  • paper_url: http://arxiv.org/abs/2309.14962
  • repo_url: None
  • paper_authors: Pengyuan Lyu, Weihong Ma, Hongyi Wang, Yuechen Yu, Chengquan Zhang, Kun Yao, Yang Xue, Jingdong Wang
  • for: 理解无约束表格结构
  • methods: 提出了一种新的GridFormer方法,通过预测Grid中的顶点和边来解释无约束表格结构
  • results: 在五个具有困难性的benchmark上实现了与其他方法相比竞争性的表现
    Abstract All tables can be represented as grids. Based on this observation, we propose GridFormer, a novel approach for interpreting unconstrained table structures by predicting the vertex and edge of a grid. First, we propose a flexible table representation in the form of an MXN grid. In this representation, the vertexes and edges of the grid store the localization and adjacency information of the table. Then, we introduce a DETR-style table structure recognizer to efficiently predict this multi-objective information of the grid in a single shot. Specifically, given a set of learned row and column queries, the recognizer directly outputs the vertexes and edges information of the corresponding rows and columns. Extensive experiments on five challenging benchmarks which include wired, wireless, multi-merge-cell, oriented, and distorted tables demonstrate the competitive performance of our model over other methods.
    摘要 所有表格都可以表示为格子。基于这一观察,我们提出了GridFormer,一种新的方法,用于解释不受限制的表格结构,通过预测格子的顶点和边。首先,我们提出了一种灵活的表格表示方式,即MXN格子。在这种表示方式中,格子的顶点和边存储了表格的本地化和相邻信息。然后,我们引入了DETR风格的表结构识别器,以高效地预测这些多对象信息。specifically,给定一组学习的行和列查询,识别器直接输出了相应的行和列的顶点和边信息。我们在五个具有挑战性的标准化表格 benchmark上进行了广泛的实验,并证明了我们的模型在其他方法之上具有竞争性。

Towards Real-World Test-Time Adaptation: Tri-Net Self-Training with Balanced Normalization

  • paper_url: http://arxiv.org/abs/2309.14949
  • repo_url: https://github.com/gorilla-lab-scut/tribe
  • paper_authors: Yongyi Su, Xun Xu, Kui Jia
  • for: 这个研究的目的是对于实际世界的测试过程进行时间适应,以适应未见过的损坏。
  • methods: 这个研究使用了globally class imbalanced testing set来补充现有的实际世界时间适应协议,并示出了现有方法在不同的测试设定下的缺陷。它们还提出了一个叫做平衡batchnorm层的新方法,可以在测试过程中适应而不偏向多数类别。此外,它们还参考了自适应(ST)的成功,并将其应用于时间适应。但是,ST独立使用可能会导致过度适应,因此它们提出了一个称为anchored loss的调整方法,以避免过度适应。
  • results: 这个研究建立了一个名为TRIBE的统一架构,其中包括平衡batchnorm层。TRIBE在四个实际世界时间适应设定中进行评估,并在多个评估协议下表现出了州际的最佳成绩。
    Abstract Test-Time Adaptation aims to adapt source domain model to testing data at inference stage with success demonstrated in adapting to unseen corruptions. However, these attempts may fail under more challenging real-world scenarios. Existing works mainly consider real-world test-time adaptation under non-i.i.d. data stream and continual domain shift. In this work, we first complement the existing real-world TTA protocol with a globally class imbalanced testing set. We demonstrate that combining all settings together poses new challenges to existing methods. We argue the failure of state-of-the-art methods is first caused by indiscriminately adapting normalization layers to imbalanced testing data. To remedy this shortcoming, we propose a balanced batchnorm layer to swap out the regular batchnorm at inference stage. The new batchnorm layer is capable of adapting without biasing towards majority classes. We are further inspired by the success of self-training~(ST) in learning from unlabeled data and adapt ST for test-time adaptation. However, ST alone is prone to over adaption which is responsible for the poor performance under continual domain shift. Hence, we propose to improve self-training under continual domain shift by regularizing model updates with an anchored loss. The final TTA model, termed as TRIBE, is built upon a tri-net architecture with balanced batchnorm layers. We evaluate TRIBE on four datasets representing real-world TTA settings. TRIBE consistently achieves the state-of-the-art performance across multiple evaluation protocols. The code is available at \url{https://github.com/Gorilla-Lab-SCUT/TRIBE}.
    摘要 Test-Time Adaptation的目标是在推理阶段将源频道模型适应测试数据,并在不同的挑战性实际场景中成功减少。然而,现有的工作主要关注于非非独立的数据流和持续频道转移。在这种工作中,我们首先补充了现有的实际世界TTA协议,并添加了全球级别的类别不均衡测试集。我们示示,将所有设置结合起来会带来新的挑战,并让现有方法失败。我们认为现有方法的失败的原因是不经过分类地应用normalization层到不均衡的测试数据。为了解决这个缺陷,我们提议使用平衡batchnorm层在推理阶段交换常规batchnorm层。新的batchnorm层可以在不偏向多个类别的情况下适应。此外,我们受到了自适应学习(ST)的成功,并将ST应用于测试阶段适应。然而,ST独立进行适应可能会导致过度适应,这会导致在持续频道转移中的 poor performance。因此,我们提议在持续频道转移中补充模型更新的 anchored loss。最终的TTA模型,称为TRIBE,基于了三元网络架构和平衡batchnorm层。我们在四个实际世界TTA设置上评估TRIBE,并在多种评价协议中获得了状态的艺术性表现。代码可以在 \url{https://github.com/Gorilla-Lab-SCUT/TRIBE} 中获取。

FEC: Three Finetuning-free Methods to Enhance Consistency for Real Image Editing

  • paper_url: http://arxiv.org/abs/2309.14934
  • repo_url: None
  • paper_authors: Songyan Chen, Jiancheng Huang
  • for: 实现精准的图像编辑任务,解决现有方法重建图像失败的问题。
  • methods: 提出了三种抽样方法,每种适用于不同的编辑类型和设置,可以确保成功的重建和提高编辑效果。
  • results: 方法可以减少计算机 память和计算量的使用,同时保持图像的纹理和特征,实现精准的图像编辑任务。
    Abstract Text-conditional image editing is a very useful task that has recently emerged with immeasurable potential. Most current real image editing methods first need to complete the reconstruction of the image, and then editing is carried out by various methods based on the reconstruction. Most methods use DDIM Inversion for reconstruction, however, DDIM Inversion often fails to guarantee reconstruction performance, i.e., it fails to produce results that preserve the original image content. To address the problem of reconstruction failure, we propose FEC, which consists of three sampling methods, each designed for different editing types and settings. Our three methods of FEC achieve two important goals in image editing task: 1) ensuring successful reconstruction, i.e., sampling to get a generated result that preserves the texture and features of the original real image. 2) these sampling methods can be paired with many editing methods and greatly improve the performance of these editing methods to accomplish various editing tasks. In addition, none of our sampling methods require fine-tuning of the diffusion model or time-consuming training on large-scale datasets. Hence the cost of time as well as the use of computer memory and computation can be significantly reduced.
    摘要 文本编辑是一项非常有前途的任务,它最近受到了无法估量的潜力。大多数当前实际图像编辑方法都需要先完成图像重建,然后使用不同的方法进行基于重建的编辑。大多数方法使用DDIM反向扩散来进行重建,但DDIM反向扩散经常无法保证重建性能,即无法生成保持原始图像内容的结果。为解决重建失败的问题,我们提出了FEC,它包括三种采样方法,每种采样方法适用于不同的编辑类型和设置。我们的三种采样方法可以实现两个重要的编辑任务目标:1)确保成功重建,即采样到生成结果,保持原始实际图像的纹理和特征。2)这些采样方法可以与多种编辑方法结合使用,大幅提高这些编辑方法的性能,完成多种编辑任务。此外,我们的采样方法不需要扩散模型的细调或大规模数据集的时间consuming Training,因此可以减少计算机 память和计算成本。

Addressing Data Misalignment in Image-LiDAR Fusion on Point Cloud Segmentation

  • paper_url: http://arxiv.org/abs/2309.14932
  • repo_url: None
  • paper_authors: Wei Jong Yang, Guan Cheng Lee
  • for: 提高自动驾驶 task 的性能,尤其是Camera和LiDAR感知器的数据融合。
  • methods: 使用先进的多感器融合模型,但是面临着数据不匹配的问题,需要更好地对准LiDAR点与图像之间的对应关系。
  • results: 提出了一种解决方案,通过对nuScenes数据集和SOTA的融合模型2DPASS进行仔细分析,并提供可能的解决方案或改进方向。
    Abstract With the advent of advanced multi-sensor fusion models, there has been a notable enhancement in the performance of perception tasks within in terms of autonomous driving. Despite these advancements, the challenges persist, particularly in the fusion of data from cameras and LiDAR sensors. A critial concern is the accurate alignment of data from these disparate sensors. Our observations indicate that the projected positions of LiDAR points often misalign on the corresponding image. Furthermore, fusion models appear to struggle in accurately segmenting these misaligned points. In this paper, we would like to address this problem carefully, with a specific focus on the nuScenes dataset and the SOTA of fusion models 2DPASS, and providing the possible solutions or potential improvements.
    摘要 “随着进步的多感应用模型的出现,自动驾驶感知任务中的性能有了明显提升。然而,问题仍然存在,尤其是把激光感知和摄像头感知融合的问题。我们的观察显示,LiDAR点的投影位置常常与相应的图像不一致。此外,融合模型似乎对这些不一致的点进行正确分类具有困难。在本文中,我们将仔细处理这个问题,专注于nuscenes数据集和2DPASS的SOTA融合模型,并提供可能的解决方案或改进方向。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Noise-Tolerant Unsupervised Adapter for Vision-Language Models

  • paper_url: http://arxiv.org/abs/2309.14928
  • repo_url: None
  • paper_authors: Eman Ali, Dayan Guan, Shijian Lu, Abdulmotaleb Elsaddik
  • for: 这篇论文旨在开发一个不需要目标标签的类型辨识模型,以提高类型辨识任务的扩展性和可扩展性。
  • methods: 本研究使用了一个名为NtUA的随机适应器,它可以从几个随机标本中学习出更好的目标模型。NtUA包括两个补充设计:一是适应缓存的形成,它根据预测信心重新排序缓存中的鉴定项目和预测值;另一个是预测整合,它利用大量数据语言模型的知识传授来修正缓存中的预测值和缓存权重。
  • results: 实验结果显示,NtUA在多个通用的测试集上具有优秀的性能,并且可以在不需要目标标签的情况下进行类型辨识任务。
    Abstract Recent advances in large-scale vision-language models have achieved very impressive performance in various zero-shot image classification tasks. While prior studies have demonstrated significant improvements by introducing few-shot labelled target samples, they still require labelling of target samples, which greatly degrades their scalability while handling various visual recognition tasks. We design NtUA, a Noise-tolerant Unsupervised Adapter that allows learning superior target models with few-shot unlabelled target samples. NtUA works as a key-value cache that formulates visual features and predicted pseudo-labels of the few-shot unlabelled target samples as key-value pairs. It consists of two complementary designs. The first is adaptive cache formation that combats pseudo-label noises by weighting the key-value pairs according to their prediction confidence. The second is pseudo-label rectification, which corrects both pair values (i.e., pseudo-labels) and cache weights by leveraging knowledge distillation from large-scale vision language models. Extensive experiments show that NtUA achieves superior performance consistently across multiple widely adopted benchmarks.
    摘要 latest advances in large-scale vision-language models have achieved very impressive performance in various zero-shot image classification tasks. While prior studies have demonstrated significant improvements by introducing few-shot labelled target samples, they still require labelling of target samples, which greatly degrades their scalability while handling various visual recognition tasks. We design NtUA, a Noise-tolerant Unsupervised Adapter that allows learning superior target models with few-shot unlabelled target samples. NtUA works as a key-value cache that formulates visual features and predicted pseudo-labels of the few-shot unlabelled target samples as key-value pairs. It consists of two complementary designs. The first is adaptive cache formation that combats pseudo-label noises by weighting the key-value pairs according to their prediction confidence. The second is pseudo-label rectification, which corrects both pair values (i.e., pseudo-labels) and cache weights by leveraging knowledge distillation from large-scale vision language models. Extensive experiments show that NtUA achieves superior performance consistently across multiple widely adopted benchmarks.

PHRIT: Parametric Hand Representation with Implicit Template

  • paper_url: http://arxiv.org/abs/2309.14916
  • repo_url: None
  • paper_authors: Zhisheng Huang, Yujin Chen, Di Kang, Jinlu Zhang, Zhigang Tu
  • for: Parametric hand mesh modeling with an implicit template.
  • methods: 使用 signed distance fields (SDFs) with part-based shape priors, deforming the canonical template using a deformation field.
  • results: Realistic and immersive hand modeling with state-of-the-art performance, demonstrated through multiple downstream tasks such as skeleton-driven hand reconstruction, shapes from point clouds, and single-view 3D reconstruction.
    Abstract We propose PHRIT, a novel approach for parametric hand mesh modeling with an implicit template that combines the advantages of both parametric meshes and implicit representations. Our method represents deformable hand shapes using signed distance fields (SDFs) with part-based shape priors, utilizing a deformation field to execute the deformation. The model offers efficient high-fidelity hand reconstruction by deforming the canonical template at infinite resolution. Additionally, it is fully differentiable and can be easily used in hand modeling since it can be driven by the skeleton and shape latent codes. We evaluate PHRIT on multiple downstream tasks, including skeleton-driven hand reconstruction, shapes from point clouds, and single-view 3D reconstruction, demonstrating that our approach achieves realistic and immersive hand modeling with state-of-the-art performance.
    摘要 我们提出PHRIT,一种新的方法 Parametric Hand Mesh Modeling,结合 parametric meshes 和 implicit representations 的优点。我们的方法使用 signed distance fields (SDFs) 表示可变手型,通过部分基于 shape priors 的扭曲场来执行扭曲。该模型可以高效地执行高精度手形重建,通过扭曲权重的权重场来执行扭曲。此外,该模型是可导数的,可以由骨架和形状幂代码驱动。我们在多个下游任务中评估PHRIT,包括骨架驱动手形重建、点云形状重建和单视图三维重建, demonstarted that our approach achieves realistic and immersive hand modeling with state-of-the-art performance.Here's the translation in Traditional Chinese:我们提出PHRIT,一种新的方法 Parametric Hand Mesh Modeling,结合 parametric meshes 和 implicit representations 的优点。我们的方法使用 signed distance fields (SDFs) 表示可变手型,通过部分基于 shape priors 的扭曲场来执行扭曲。该模型可以高效地执行高精度手形重建,通过扭曲权重的权重场来执行扭曲。此外,该模型是可导数的,可以由骨架和形状幂代码驱动。我们在多个下游任务中评估PHRIT,包括骨架驱动手形重建、点云形状重建和单视野三维重建, demonstarted that our approach achieves realistic and immersive hand modeling with state-of-the-art performance.

Face Cartoonisation For Various Poses Using StyleGAN

  • paper_url: http://arxiv.org/abs/2309.14908
  • repo_url: None
  • paper_authors: Kushal Jain, Ankith Varun J, Anoop Namboodiri
  • for: 本文提出了一种新的方法,以保持原始Identität和支持多种姿势来实现面卡通化。 unlike previous methods, which relied on conditional-GANs, our approach leverages the expressive latent space of StyleGAN.
  • methods: 我们的方法通过引入一个捕捉图像中的姿势和Identität信息的编码器,并将其转换为 StyleGAN的表达空间中的嵌入。 we then pass this embedding through a pre-trained generator to obtain the desired cartoonised output.
  • results: 我们通过广泛的实验表明,我们的编码器可以使StyleGAN输出更好地保持Identität,并且可以在不同的姿势下进行 cartoonisation. our method stands out from other approaches based on StyleGAN by not requiring a dedicated and fine-tuned StyleGAN model.
    Abstract This paper presents an innovative approach to achieve face cartoonisation while preserving the original identity and accommodating various poses. Unlike previous methods in this field that relied on conditional-GANs, which posed challenges related to dataset requirements and pose training, our approach leverages the expressive latent space of StyleGAN. We achieve this by introducing an encoder that captures both pose and identity information from images and generates a corresponding embedding within the StyleGAN latent space. By subsequently passing this embedding through a pre-trained generator, we obtain the desired cartoonised output. While many other approaches based on StyleGAN necessitate a dedicated and fine-tuned StyleGAN model, our method stands out by utilizing an already-trained StyleGAN designed to produce realistic facial images. We show by extensive experimentation how our encoder adapts the StyleGAN output to better preserve identity when the objective is cartoonisation.
    摘要

Pre-training-free Image Manipulation Localization through Non-Mutually Exclusive Contrastive Learning

  • paper_url: http://arxiv.org/abs/2309.14900
  • repo_url: https://github.com/knightzjz/ncl-iml
  • paper_authors: Jizhe Zhou, Xiaochen Ma, Xia Du, Ahmed Y. Alhammadi, Wentao Feng
    for:* The paper aims to address the data insufficiency problem in Deep Image Manipulation Localization (IML) models by proposing a Non-mutually exclusive Contrastive Learning (NCL) framework.methods:* The NCL framework uses a pivot structure with dual branches to constantly switch the role of contour patches between positives and negatives during training, and a pivot-consistent loss to avoid spatial corruption.results:* The proposed NCL framework achieves state-of-the-art performance on all five benchmarks without any pre-training, and is more robust on unseen real-life samples.Here is the simplified Chinese text for the three key points:for:* 论文目标是解决 Deep Image Manipulation Localization (IML) 模型所处的数据缺乏问题,提出 Non-mutually exclusive Contrastive Learning (NCL) 框架。methods:* NCL 框架使用一个 pivot 结构,其中 dual branches 在训练中不断交换抽象范围和实际范围的角色,并使用 pivot-consistent 损失函数来避免空间损害。results:* 提议的 NCL 框架在所有五个 benchmark 上达到了无预训练的状态之冠,并在未看过的实际样本上更加稳定。
    Abstract Deep Image Manipulation Localization (IML) models suffer from training data insufficiency and thus heavily rely on pre-training. We argue that contrastive learning is more suitable to tackle the data insufficiency problem for IML. Crafting mutually exclusive positives and negatives is the prerequisite for contrastive learning. However, when adopting contrastive learning in IML, we encounter three categories of image patches: tampered, authentic, and contour patches. Tampered and authentic patches are naturally mutually exclusive, but contour patches containing both tampered and authentic pixels are non-mutually exclusive to them. Simply abnegating these contour patches results in a drastic performance loss since contour patches are decisive to the learning outcomes. Hence, we propose the Non-mutually exclusive Contrastive Learning (NCL) framework to rescue conventional contrastive learning from the above dilemma. In NCL, to cope with the non-mutually exclusivity, we first establish a pivot structure with dual branches to constantly switch the role of contour patches between positives and negatives while training. Then, we devise a pivot-consistent loss to avoid spatial corruption caused by the role-switching process. In this manner, NCL both inherits the self-supervised merits to address the data insufficiency and retains a high manipulation localization accuracy. Extensive experiments verify that our NCL achieves state-of-the-art performance on all five benchmarks without any pre-training and is more robust on unseen real-life samples. The code is available at: https://github.com/Knightzjz/NCL-IML.
    摘要 深度图像修饰本地化(IML)模型受训练数据不足的影响,因此强调预训练。我们认为对IML而言,对比学习更适合解决数据不足问题。创造独特的负和正样本是对比学习的前提。然而,在IML中采用对比学习时,我们遇到了三类图像 patch:妥协、原始和边界 patch。妥协和原始 patch 是自然的独特的,但边界 patch 包含了妥协和原始像素,因此不是独特的。简单地抛弃这些边界 patch 会导致性能下降,因为边界 patch 对学习结果有决定性的作用。因此,我们提出了非独特对比学习(NCL)框架,以解决上述困境。在 NCL 中,我们首先建立了一个平衡结构,其中包含了两个分支,以在训练过程中不断地将边界 patch 的角色 switching。然后,我们设计了一个平衡结构相关的损失函数,以避免由角色 switching 过程引起的空间损害。通过这种方式,NCL 同时继承了自动学习的自我超vised 优点,并保留高效的修饰本地化精度。广泛的实验证明了我们的 NCL 在五个标准测试集上均达到了领先的性能水平,而且在未经预训练的情况下达到了最高的修饰本地化精度。代码可以在 GitHub 上获取:

FDLS: A Deep Learning Approach to Production Quality, Controllable, and Retargetable Facial Performances

  • paper_url: http://arxiv.org/abs/2309.14897
  • repo_url: None
  • paper_authors: Wan-Duo Kurt Ma, Muhammad Ghifary, J. P. Lewis, Byungkuk Choi, Haekwang Eom
  • for: 这 paper 是为了解决电影特效中的人工智能表演问题,包括创建真实的人工智能人类和将演员表演 transferred 到人形角色。
  • methods: 该 paper 使用了 Facial Deep Learning Solver (FDLS) 解决方案,它采用了粗略到细节的人工智能策略,让解决过程中的表演可以被验证和修改。
  • results: FDLS 可以帮助制作人员在生产中实现高质量的动画表演,而不需要大量的人工干预。该系统已经在多部电影中使用,并且可以处理小于一天的日常变化。
    Abstract Visual effects commonly requires both the creation of realistic synthetic humans as well as retargeting actors' performances to humanoid characters such as aliens and monsters. Achieving the expressive performances demanded in entertainment requires manipulating complex models with hundreds of parameters. Full creative control requires the freedom to make edits at any stage of the production, which prohibits the use of a fully automatic ``black box'' solution with uninterpretable parameters. On the other hand, producing realistic animation with these sophisticated models is difficult and laborious. This paper describes FDLS (Facial Deep Learning Solver), which is Weta Digital's solution to these challenges. FDLS adopts a coarse-to-fine and human-in-the-loop strategy, allowing a solved performance to be verified and edited at several stages in the solving process. To train FDLS, we first transform the raw motion-captured data into robust graph features. Secondly, based on the observation that the artists typically finalize the jaw pass animation before proceeding to finer detail, we solve for the jaw motion first and predict fine expressions with region-based networks conditioned on the jaw position. Finally, artists can optionally invoke a non-linear finetuning process on top of the FDLS solution to follow the motion-captured virtual markers as closely as possible. FDLS supports editing if needed to improve the results of the deep learning solution and it can handle small daily changes in the actor's face shape. FDLS permits reliable and production-quality performance solving with minimal training and little or no manual effort in many cases, while also allowing the solve to be guided and edited in unusual and difficult cases. The system has been under development for several years and has been used in major movies.
    摘要 通常,特效需要创建真实的人工人类和将演员的表演转移到人oid角色,如外星人和怪物。为了实现娱乐中的表演效果,需要 manipulate复杂的模型,包括数百个参数。为了获得完全的创作控制,需要在生产过程中有自由地进行编辑,因此不能使用完全自动的“黑obox”解决方案。然而,使用这些复杂的模型生成真实的动画是困难和耗时的。本文描述了WDLS(Facial Deep Learning Solver),是威塔数字的解决方案。WDLS采用了粗略到细节的人类在循环策略,allowing solved performance可以在多个阶段的解决过程中进行验证和编辑。为了训练WDLS,我们首先将原始的动作捕获数据转化为Robust Graph特征。然后,根据艺术家们通常在完成短脊动画之前进行精细表情的修饰,我们解决了下嘴部动作,并使用区域网络根据嘴部位置预测细表情。最后,艺术家可以选择atively辅助非线性调整过程,以便在动作捕获虚拟标记上最接近可能。WDLS支持编辑,以便根据需要改进深度学习解决方案的结果,并可以处理小于日常变化的actor面部形态。WDLS允许可靠且生产质量高的表现解决方案,同时允许解决方案被指导和编辑在不常见和困难的情况下。这个系统已经在数年内开发,并在主要电影中使用。

Nearest Neighbor Guidance for Out-of-Distribution Detection

  • paper_url: http://arxiv.org/abs/2309.14888
  • repo_url: https://github.com/jingkang50/openood
  • paper_authors: Jaewoo Park, Yoon Gyo Jung, Andrew Beng Jin Teoh
  • for: 这篇论文的目的是提出一种基于类ifier的外部样本探测方法,以提高机器学习模型在开放世界环境中的检测精度。
  • methods: 该方法使用 Nearest Neighbor Guidance (NNGuide) 方法,将类ifier-based 分数引导到数据拟合的边界 geometries 中,以减少 OOD 样本的过度自信问题。
  • results: 经过广泛的实验测试,NNGuide 方法可以减少 OOD 样本的过度自信问题,同时保持类ifier-based 分数的细腻度,在 ImageNet OOD 检测测试准则下达到了状态的艺术指标 AUROC、FPR95 和 AUPR 的最佳 результа们。
    Abstract Detecting out-of-distribution (OOD) samples are crucial for machine learning models deployed in open-world environments. Classifier-based scores are a standard approach for OOD detection due to their fine-grained detection capability. However, these scores often suffer from overconfidence issues, misclassifying OOD samples distant from the in-distribution region. To address this challenge, we propose a method called Nearest Neighbor Guidance (NNGuide) that guides the classifier-based score to respect the boundary geometry of the data manifold. NNGuide reduces the overconfidence of OOD samples while preserving the fine-grained capability of the classifier-based score. We conduct extensive experiments on ImageNet OOD detection benchmarks under diverse settings, including a scenario where the ID data undergoes natural distribution shift. Our results demonstrate that NNGuide provides a significant performance improvement on the base detection scores, achieving state-of-the-art results on both AUROC, FPR95, and AUPR metrics. The code is given at \url{https://github.com/roomo7time/nnguide}.
    摘要 检测open-world环境中的非标准样本(out-of-distribution,OOD)是机器学习模型的关键问题。基于分类器的分数是标准的OOD检测方法,它具有细致的检测能力。然而,这些分数常常受到过自信问题的困扰,错误地分类OOD样本远离标准区域。为解决这个挑战,我们提出了一种方法called Nearest Neighbor Guidance(NNGuide),它使得分类器基于的分数尊重数据拟合的 geometrical 结构。NNGuide降低了OOD样本的过自信问题,同时保持了分类器的细致能力。我们在ImageNet OOD检测 benchmark中进行了广泛的实验,包括一个情况下ID数据经受自然分布变化。我们的结果表明,NNGuide可以带来显著的性能提升,在AUROC、FPR95和AUPR metrics中均达到了领先的结果。代码可以在 GitHub上找到:https://github.com/roomo7time/nnguide。

Locality-preserving Directions for Interpreting the Latent Space of Satellite Image GANs

  • paper_url: http://arxiv.org/abs/2309.14883
  • repo_url: None
  • paper_authors: Georgia Kourmouli, Nikos Kostagiolas, Yannis Panagakis, Mihalis A. Nicolaou
  • for: 这 paper 是为了解释wavelet-based Generative Adversarial Networks (GANs) 的隐藏空间,以便 capture 卫星图像中的大空间和频谱特征。
  • methods: 这 paper 使用了保持本地性的方法,可以 decomposed 预训练 GANs 的权重空间,并回归可解释的方向,与高级 semantics 概念(如城市化、结构密度、植被存在)相关。
  • results: 作者比较了本地性方法和传统的 PCA 方法,发现本地性方法可以更好地 preserve 类信息,并且在卫星图像数据synthesis中表现出色。
    Abstract We present a locality-aware method for interpreting the latent space of wavelet-based Generative Adversarial Networks (GANs), that can well capture the large spatial and spectral variability that is characteristic to satellite imagery. By focusing on preserving locality, the proposed method is able to decompose the weight-space of pre-trained GANs and recover interpretable directions that correspond to high-level semantic concepts (such as urbanization, structure density, flora presence) - that can subsequently be used for guided synthesis of satellite imagery. In contrast to typically used approaches that focus on capturing the variability of the weight-space in a reduced dimensionality space (i.e., based on Principal Component Analysis, PCA), we show that preserving locality leads to vectors with different angles, that are more robust to artifacts and can better preserve class information. Via a set of quantitative and qualitative examples, we further show that the proposed approach can outperform both baseline geometric augmentations, as well as global, PCA-based approaches for data synthesis in the context of data augmentation for satellite scene classification.
    摘要 我们提出了一种地域意识的方法,用于解释基于wavelet的生成对抗网络(GANs)的latent空间,可以很好地捕捉卫星图像中的大空间和频谱多样性。通过强调地域性,我们的方法可以对预训练的GANs的权重空间进行分解,并回归可解释的方向,这些方向与高级 semantics(如城市化、结构密度、植被存在)相关,可以用于导引卫星图像的生成。与通常使用的方法相比,我们显示了保持地域性可以生成更加robust的方向,这些方向具有不同的角度,可以更好地抵御artifacts并保持类别信息。通过一系列量化和质量化的例子,我们进一步显示了我们的方法可以在卫星场景分类数据synthesis中超过基eline的几何增强和全局、基于PCA的方法。

ITEM3D: Illumination-Aware Directional Texture Editing for 3D Models

  • paper_url: http://arxiv.org/abs/2309.14872
  • repo_url: None
  • paper_authors: Shengqi Liu, Zhuo Chen, Jingnan Gao, Yichao Yan, Wenhan Zhu, Xiaobo Li, Ke Gao, Jiangjing Lyu, Xiaokang Yang
  • for: automatic 3D object editing according to text prompts
  • methods: diffusion models, differentiable rendering, noise difference optimization objective
  • results: outperforms state-of-the-art methods, explicit control over lighting
    Abstract Texture editing is a crucial task in 3D modeling that allows users to automatically manipulate the surface materials of 3D models. However, the inherent complexity of 3D models and the ambiguous text description lead to the challenge in this task. To address this challenge, we propose ITEM3D, an illumination-aware model for automatic 3D object editing according to the text prompts. Leveraging the diffusion models and the differentiable rendering, ITEM3D takes the rendered images as the bridge of text and 3D representation, and further optimizes the disentangled texture and environment map. Previous methods adopt the absolute editing direction namely score distillation sampling (SDS) as the optimization objective, which unfortunately results in the noisy appearance and text inconsistency. To solve the problem caused by the ambiguous text, we introduce a relative editing direction, an optimization objective defined by the noise difference between the source and target texts, to release the semantic ambiguity between the texts and images. Additionally, we gradually adjust the direction during optimization to further address the unexpected deviation in the texture domain. Qualitative and quantitative experiments show that our ITEM3D outperforms the state-of-the-art methods on various 3D objects. We also perform text-guided relighting to show explicit control over lighting.
    摘要 《Texture Editing in 3D Modeling: A Challenge and a Proposed Solution》Texture editing是3D模型创建中的一项重要任务,它允许用户自动修改3D模型的表面材料。然而,3D模型的内在复杂性和文本描述的模糊性使得这项任务具有挑战。为解决这个挑战,我们提出了ITEM3D,一种基于扩散模型和可导渲染的自适应3D物体编辑方法。ITEM3D通过将渲染图像作为文本和3D表示之间的桥梁,并且进一步优化分离的 текстур和环境图像。前一些方法使用绝对编辑方向,即分配混合样本(SDS)的分数浓缩为优化目标,却导致图像的噪音和文本不一致。为解决文本的模糊性,我们引入了相对编辑方向,即源和目标文本之间的差异噪音作为优化目标,以释放文本和图像之间的 semantics 的模糊性。此外,我们在优化过程中逐步调整方向,以进一步解决图像领域中的意外偏差。我们对ITEM3D进行了质量和量化的实验,结果显示,ITEM3D在多种3D对象上超过了现有方法的性能。此外,我们还进行了文本指导的照明控制,以示Explicit Control over Lighting。

Cross-Dataset-Robust Method for Blind Real-World Image Quality Assessment

  • paper_url: http://arxiv.org/abs/2309.14868
  • repo_url: None
  • paper_authors: Yuan Chen, Zhiliang Ma, Yang Zhao
  • for: 本研究旨在提出一种可靠、 robust 的盲图质量评估(BIQA)方法,以便在实际世界中评估图像质量。
  • methods: 本研究使用了三个方面的技术,包括:1)多种人工智能模型的训练策略,2)大规模的实际世界图像数据集,3)强大的基础模型。
  • results: 实验结果表明,提出的方法在跨数据集测试中表现更好,甚至超越了直接在这些数据集上训练的一些状态对照方法,这证明了我们的方法的可靠性和泛化能力。
    Abstract Although many effective models and real-world datasets have been presented for blind image quality assessment (BIQA), recent BIQA models usually tend to fit specific training set. Hence, it is still difficult to accurately and robustly measure the visual quality of an arbitrary real-world image. In this paper, a robust BIQA method, is designed based on three aspects, i.e., robust training strategy, large-scale real-world dataset, and powerful backbone. First, many individual models based on popular and state-of-the-art (SOTA) Swin-Transformer (SwinT) are trained on different real-world BIQA datasets respectively. Then, these biased SwinT-based models are jointly used to generate pseudo-labels, which adopts the probability of relative quality of two random images instead of fixed quality score. A large-scale real-world image dataset with 1,000,000 image pairs and pseudo-labels is then proposed for training the final cross-dataset-robust model. Experimental results on cross-dataset tests show that the performance of the proposed method is even better than some SOTA methods that are directly trained on these datasets, thus verifying the robustness and generalization of our method.
    摘要 尽管现有许多有效的模型和实际数据集已经为盲图质量评估(BIQA)提供了许多有用的方法,但是现在的BIQA模型通常会适应特定的训练集。因此,仍然很难准确和可靠地测量真实世界图像的视觉质量。本文提出了一种robustBIQA方法,基于以下三个方面:一、robust训练策略;二、大规模的实际世界图像集;三、强大的后向。首先,多个基于流行和state-of-the-art(SOTA)Swin-Transformer(SwinT)的个体模型在不同的盲图BIQA数据集上进行了分别训练。然后,这些偏向的SwinT-based模型被用来生成 pseudo-labels,使用了两个随机图像之间的相对质量概率而不是固定的质量分数。然后,一个大规模的实际世界图像集,包含1000000个图像对和pseudo-labels,被用于训练最终的cross-dataset-robust模型。实验结果表明,提议的方法在cross-dataset测试中的性能甚至高于直接在这些数据集上训练的一些SOTA方法,从而证明了我们的方法的Robustness和普适性。

Unsupervised Reconstruction of 3D Human Pose Interactions From 2D Poses Alone

  • paper_url: http://arxiv.org/abs/2309.14865
  • repo_url: None
  • paper_authors: Peter Hardy, Hansung Kim
    for: 这种研究旨在解决多人场景下无监督2D-3D人姿估算方法中的投影歧义问题,通过预测人体眼镜中的高度角度来解决问题。methods: 该方法基于先前的工作,独立地提取每个人物的2D姿势,然后将其拼接在共享的3D坐标系中。然后,使用预测的高度角度来旋转和偏移每个人物的姿势,以便得到精确的3D重建。results: 在CHI3D dataset上进行测试,该方法能够实现精确的3D重建,并且介绍了三种新的量化指标来评估该方法的性能。这些指标为:1) 2D-3D姿势重建精度(2D-3D Pose Reconstruction Accuracy),2) 3D人姿精度(3D Human Pose Accuracy),3) 多人姿势重建精度(Multi-Person Pose Reconstruction Accuracy)。这些指标为未来研究的标准。
    Abstract Current unsupervised 2D-3D human pose estimation (HPE) methods do not work in multi-person scenarios due to perspective ambiguity in monocular images. Therefore, we present one of the first studies investigating the feasibility of unsupervised multi-person 2D-3D HPE from just 2D poses alone, focusing on reconstructing human interactions. To address the issue of perspective ambiguity, we expand upon prior work by predicting the cameras' elevation angle relative to the subjects' pelvis. This allows us to rotate the predicted poses to be level with the ground plane, while obtaining an estimate for the vertical offset in 3D between individuals. Our method involves independently lifting each subject's 2D pose to 3D, before combining them in a shared 3D coordinate system. The poses are then rotated and offset by the predicted elevation angle before being scaled. This by itself enables us to retrieve an accurate 3D reconstruction of their poses. We present our results on the CHI3D dataset, introducing its use for unsupervised 2D-3D pose estimation with three new quantitative metrics, and establishing a benchmark for future research.
    摘要 Current unsupervised 2D-3D人姿估算(HPE)方法在多人场景下不起作用,因为单目图像中的视角含义不明确。因此,我们发表了一项研究,探讨未经监督的多人2D-3DHPE从仅2D姿势中进行可行性研究,关注人类互动的重建。为解决视角不确定性问题,我们在先前的工作之上预测了相机的抬高角度 relative to the subjects' pelvis。这使得我们可以将预测的姿势旋转到地平面和水平面之间的垂直偏移来进行校准,并获得了每个人的3D姿势的估算。我们的方法是独立地提升每个人的2D姿势到3D,然后将它们在共享的3D坐标系中组合。然后,我们将 pose 旋转和偏移到预测的抬高角度,并将其缩放。这样就可以得到一个准确的3D重建。我们在 CHI3D 数据集上进行了实验,引入了三种新的量化指标,并建立了一个标准 для未来的研究。

Generalization of pixel-wise phase estimation by CNN and improvement of phase-unwrapping by MRF optimization for one-shot 3D scan

  • paper_url: http://arxiv.org/abs/2309.14824
  • repo_url: None
  • paper_authors: Hiroto Harada, Michihiro Mikamo, Ryo Furukawa, Ryushuke Sagawa, Hiroshi Kawasaki
    for: 提高一遍3D扫描的精度和稳定性,特别适用于医疗、工业等领域。methods: 提议使用像素间插值技术,通过U-Net预训练CG数据进行高效数据增强,并提出robust对匹配找索引算法基于Markov随机场优化。results: 实验结果表明,提议方法可以有效地提高一遍3D扫描的精度和稳定性,并能够处理具有强频率和文本特征的实际数据。
    Abstract Active stereo technique using single pattern projection, a.k.a. one-shot 3D scan, have drawn a wide attention from industry, medical purposes, etc. One severe drawback of one-shot 3D scan is sparse reconstruction. In addition, since spatial pattern becomes complicated for the purpose of efficient embedding, it is easily affected by noise, which results in unstable decoding. To solve the problems, we propose a pixel-wise interpolation technique for one-shot scan, which is applicable to any types of static pattern if the pattern is regular and periodic. This is achieved by U-net which is pre-trained by CG with efficient data augmentation algorithm. In the paper, to further overcome the decoding instability, we propose a robust correspondence finding algorithm based on Markov random field (MRF) optimization. We also propose a shape refinement algorithm based on b-spline and Gaussian kernel interpolation using explicitly detected laser curves. Experiments are conducted to show the effectiveness of the proposed method using real data with strong noises and textures.
    摘要 aktive stereo技术使用单一模式投影,即一枚3D扫描,在业界、医疗领域等方面引起了广泛的关注。一个一枚3D扫描的严重缺点是稀疏重建。此外,由于空间模式变得复杂,以便高效嵌入,因此易受到噪声的影响,导致解码不稳定。为解决这些问题,我们提议了一种像素级 interpolate技术,适用于任何类型的静止模式,只要模式是正则和周期的。这是通过CG预训练的U-网进行实现的。在论文中,为进一步减轻解码不稳定性,我们提议了基于Markov随机场(MRF)优化的稳定匹配算法。我们还提议了基于b-spline和Gaussian核函数 interpolate的形态纠正算法,使用显式探测的激光曲线。我们对实际数据进行了实验,以证明我们提议的方法的有效性。

Three-dimensional Tracking of a Large Number of High Dynamic Objects from Multiple Views using Current Statistical Model

  • paper_url: http://arxiv.org/abs/2309.14820
  • repo_url: None
  • paper_authors: Nianhao Xie
  • for: 该论文主要针对多视图多物体三维跟踪问题,尤其是生物群体行为研究中需要精确的轨迹数据。
  • methods: 该方法基于 bayesian tracking-while-reconstruction 框架,使用现有的统计模型来预测物体状态和估计状态协方差,并通过 kalman 筛选减少测量噪声。
  • results: simulations 实验和实际蜡烛实验表明,该方法可以提高跟踪完整性、续度和精度,相比常见的常速度基于 particule filter 方法。
    Abstract Three-dimensional tracking of multiple objects from multiple views has a wide range of applications, especially in the study of bio-cluster behavior which requires precise trajectories of research objects. However, there are significant temporal-spatial association uncertainties when the objects are similar to each other, frequently maneuver, and cluster in large numbers. Aiming at such a multi-view multi-object 3D tracking scenario, a current statistical model based Kalman particle filter (CSKPF) method is proposed following the Bayesian tracking-while-reconstruction framework. The CSKPF algorithm predicts the objects' states and estimates the objects' state covariance by the current statistical model to importance particle sampling efficiency, and suppresses the measurement noise by the Kalman filter. The simulation experiments prove that the CSKPF method can improve the tracking integrity, continuity, and precision compared with the existing constant velocity based particle filter (CVPF) method. The real experiment on fruitfly clusters also confirms the effectiveness of the CSKPF method.
    摘要 三维跟踪多个物体从多个视图有广泛的应用,特别是生物群集行为研究需要精确的轨迹数据。然而,在物体相似、频繁拐弯和大量聚集时,存在较大的时空协同不确定性。为解决这类多视图多物体三维跟踪问题,提出了一种现有统计模型基于Kalman滤波器(CSKPF)方法。CSKPF算法预测物体状态和估计物体状态协方差,通过当前统计模型提高实际效率,并通过Kalman滤波器减少测量噪声。实验表明,CSKPF方法可以提高跟踪完整性、续写性和精度,比CVPF方法更为有效。实际实验也证实了CSKPF方法的效果。

Discrepancy Matters: Learning from Inconsistent Decoder Features for Consistent Semi-supervised Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2309.14819
  • repo_url: https://github.com/maxwell0027/lefed
  • paper_authors: Qingjie Zeng, Yutong Xie, Zilin Lu, Mengkang Lu, Yong Xia
  • for: 这篇论文主要应用在扩展有限标签数据的领域,尤其是医疗影像分类 задачі上。
  • methods: 本文提出了一种新的半超级学习方法(LeFeD),通过将两个解oder的不一致信息作为反馈信号传递到encoder中,从而学习feature水平的不一致。
  • results: 实验结果显示,LeFeD可以比前 eight 个SOTA方法在三个公开 dataset上进行排名,而且不需要添加额外的不确定性估计和强制性约束。此外,LeFeD也在医疗影像分类任务上设置了新的SOTA记录。
    Abstract Semi-supervised learning (SSL) has been proven beneficial for mitigating the issue of limited labeled data especially on the task of volumetric medical image segmentation. Unlike previous SSL methods which focus on exploring highly confident pseudo-labels or developing consistency regularization schemes, our empirical findings suggest that inconsistent decoder features emerge naturally when two decoders strive to generate consistent predictions. Based on the observation, we first analyze the treasure of discrepancy in learning towards consistency, under both pseudo-labeling and consistency regularization settings, and subsequently propose a novel SSL method called LeFeD, which learns the feature-level discrepancy obtained from two decoders, by feeding the discrepancy as a feedback signal to the encoder. The core design of LeFeD is to enlarge the difference by training differentiated decoders, and then learn from the inconsistent information iteratively. We evaluate LeFeD against eight state-of-the-art (SOTA) methods on three public datasets. Experiments show LeFeD surpasses competitors without any bells and whistles such as uncertainty estimation and strong constraints, as well as setting a new state-of-the-art for semi-supervised medical image segmentation. Code is available at \textcolor{cyan}{https://github.com/maxwell0027/LeFeD}
    摘要 半经过监督学习(SSL)已经证明对受限标注数据的问题具有有利的影响,特别是在医疗影像分割任务上。 unlike previous SSL methods,我们的实证发现,当两个解码器努力生成一致的预测时,自然而然地出现了不一致的解码器特征。基于这一观察,我们首先分析了在学习向一致的过程中,缺失的储量,以及在pseudo-标签和一致化正则化设置下的情况。然后,我们提出了一种新的SSL方法,called LeFeD,它通过在两个解码器之间学习层次不一致的特征,以Feedback信号来驱动编码器。LeFeD的核心设计是通过培养不同的解码器来扩大差异,然后在不一致信息上学习。我们对八种现有的SOTA方法进行比较,实验结果表明,LeFeD可以在三个公共数据集上超越竞争对手,而无需添加额外的征识度量和强制约束。代码可以在 \textcolor{cyan}{https://github.com/maxwell0027/LeFeD} 上获取。

A Comparative Study of Population-Graph Construction Methods and Graph Neural Networks for Brain Age Regression

  • paper_url: http://arxiv.org/abs/2309.14816
  • repo_url: https://github.com/bintsi/brain-age-population-graphs
  • paper_authors: Kyriaki-Margarita Bintsi, Tamara T. Mueller, Sophie Starck, Vasileios Baltatzis, Alexander Hammers, Daniel Rueckert
  • for: 这个研究的目的是为了提高脑 возраст估计的精度,以便在临床设置中作为诊断工具。
  • methods: 这个研究使用了人口图,将多种成像数据 combinated 并捕捉了人口中个体之间的关系。
  • results: 研究发现,使用 homophily 约束可以提高 GNN 的性能,而其他建立方法可能是更加稳定的。
    Abstract The difference between the chronological and biological brain age of a subject can be an important biomarker for neurodegenerative diseases, thus brain age estimation can be crucial in clinical settings. One way to incorporate multimodal information into this estimation is through population graphs, which combine various types of imaging data and capture the associations among individuals within a population. In medical imaging, population graphs have demonstrated promising results, mostly for classification tasks. In most cases, the graph structure is pre-defined and remains static during training. However, extracting population graphs is a non-trivial task and can significantly impact the performance of Graph Neural Networks (GNNs), which are sensitive to the graph structure. In this work, we highlight the importance of a meaningful graph construction and experiment with different population-graph construction methods and their effect on GNN performance on brain age estimation. We use the homophily metric and graph visualizations to gain valuable quantitative and qualitative insights on the extracted graph structures. For the experimental evaluation, we leverage the UK Biobank dataset, which offers many imaging and non-imaging phenotypes. Our results indicate that architectures highly sensitive to the graph structure, such as Graph Convolutional Network (GCN) and Graph Attention Network (GAT), struggle with low homophily graphs, while other architectures, such as GraphSage and Chebyshev, are more robust across different homophily ratios. We conclude that static graph construction approaches are potentially insufficient for the task of brain age estimation and make recommendations for alternative research directions.
    摘要 人类脑发育年龄与生物年龄之间的差异可能是脑性变化疾病的重要生物标记器,因此脑发育年龄估计可能是临床设置中非常重要。一种方法是通过人口图,融合不同类型的内部数据,以capture个体之间的相互关联。在医疗影像中,人口图已经显示出了非常有前途的结果,主要是用于分类任务。然而,提取人口图是一个非常困难的任务,可能会对Graph Neural Networks(GNNs)的性能产生很大的影响,GNNs是对图结构非常敏感。在这个研究中,我们强调了具有意义的图建构的重要性,并尝试了不同的人口图建构方法,以及它们对GNN性能的影响。我们使用了同化度量和图可视化来获得有用的量化和质量的问题。我们利用了UK Biobank数据集,这个数据集提供了许多影像和非影像特征。我们的结果显示,高敏感度的图结构,如几何网络(GCN)和几何注意力网络(GAT),对低同化度图进行训练时表现不佳,而其他架构,如几何泛化网络(GraphSage)和Chebychev网络,则在不同的同化度比率下表现更加稳定。我们结论,静态图建构方法可能不够具体,我们建议进行更多的研究,以找到更好的方法。

ENIGMA-51: Towards a Fine-Grained Understanding of Human-Object Interactions in Industrial Scenarios

  • paper_url: http://arxiv.org/abs/2309.14809
  • repo_url: https://github.com/syscv/sam-hq
  • paper_authors: Francesco Ragusa, Rosario Leonardi, Michele Mazzamuto, Claudia Bonanno, Rosario Scavo, Antonino Furnari, Giovanni Maria Farinella
  • for: The paper is written for studying human-object interactions in industrial scenarios.
  • methods: The paper uses a new dataset called ENIGMA-51, which is densely annotated with labels to enable the systematic study of human-object interactions.
  • results: The baseline results show that the ENIGMA-51 dataset poses a challenging benchmark for studying human-object interactions in industrial scenarios.Here’s the text in Simplified Chinese:
  • for: 这篇论文是为了研究工业场景中人与物之间的互动而写的。
  • methods: 这篇论文使用了一个新的数据集 called ENIGMA-51,这个数据集是 densely annotated 的,具有许多标签,可以系统地研究人与物之间的互动。
  • results: 基本结果表明,ENIGMA-51 数据集在工业场景中的人与物互动研究提供了一个具有挑战性的标准。
    Abstract ENIGMA-51 is a new egocentric dataset acquired in a real industrial domain by 19 subjects who followed instructions to complete the repair of electrical boards using industrial tools (e.g., electric screwdriver) and electronic instruments (e.g., oscilloscope). The 51 sequences are densely annotated with a rich set of labels that enable the systematic study of human-object interactions in the industrial domain. We provide benchmarks on four tasks related to human-object interactions: 1) untrimmed action detection, 2) egocentric human-object interaction detection, 3) short-term object interaction anticipation and 4) natural language understanding of intents and entities. Baseline results show that the ENIGMA-51 dataset poses a challenging benchmark to study human-object interactions in industrial scenarios. We publicly release the dataset at: https://iplab.dmi.unict.it/ENIGMA-51/.
    摘要 ENIGMA-51 是一个新的自我中心数据集,在真实的工业领域中由 19 名参与者收集到,他们按照 instrucions 完成了电气板的维修使用工业工具(例如电动钻)和电子 instrumente(例如振荡器)。这 51 个序列都有密集的注解,包括一个丰富的标签集,允许系统性的人机交互研究。我们在四个关于人机交互任务上提供了标准测试:1)不加工作行为检测,2) egocentric 人机交互检测,3)短期对象交互预测和4)自然语言理解意图和实体。基准结果显示,ENIGMA-51 数据集对于研究工业场景中的人机交互具有挑战性。我们在以下链接公开发布了数据集:https://iplab.dmi.unict.it/ENIGMA-51/。

3D printed realistic finger vein phantoms

  • paper_url: http://arxiv.org/abs/2309.14806
  • repo_url: None
  • paper_authors: Luuk Spreeuwers, Rasmus van der Grift, Pesigrihastamadya Normakristagaluh
  • for: 该论文旨在提出一种创新的手别脉干模拟方法,以实现获得真实的手别脉干图像和精确知道脉干 patrern。
  • methods: 该方法使用3D打印技术,通过不同的打印材料和参数,模拟手别脉干中不同组织的光学性能,如骨、血管和软组织。
  • results: 该方法可以创造出真实的手别脉干图像和精确知道脉干 patrern,并且可以用于开发和评估手别脉干EXTRACTION和识别方法。此外,该方法还可以用于骗取手别脉干识别系统。
    Abstract Finger vein pattern recognition is an emerging biometric with a good resistance to presentation attacks and low error rates. One problem is that it is hard to obtain ground truth finger vein patterns from live fingers. In this paper we propose an advanced method to create finger vein phantoms using 3D printing where we mimic the optical properties of the various tissues inside the fingers, like bone, veins and soft tissues using different printing materials and parameters. We demonstrate that we are able to create finger phantoms that result in realistic finger vein images and precisely known vein patterns. These phantoms can be used to develop and evaluate finger vein extraction and recognition methods. In addition, we show that the finger vein phantoms can be used to spoof a finger vein recognition system. This paper is based on the Master's thesis of Rasmus van der Grift.
    摘要 finger vein pattern recognition 是一种emerging biometric,具有良好的抵御呈现攻击和低错误率。然而,一个问题是寻得真实的 finger vein pattern 的实验室数据很Difficult。在这篇论文中,我们提出了一种高级的方法,使用 3D printing 技术创建 finger vein phantom,模拟手指内部不同组织的光学性质,如骨、血管和软组织,使用不同的印刷材料和参数。我们示示了我们可以创建真实的手指形态和精确知道的血管图像。这些 phantom 可以用来开发和评估手指血管提取和识别方法。此外,我们还表明了 finger vein phantom 可以用来骗取手指血管识别系统。这篇论文基于 Rasmus van der Grift 的硕士论文。

3D Density-Gradient based Edge Detection on Neural Radiance Fields (NeRFs) for Geometric Reconstruction

  • paper_url: http://arxiv.org/abs/2309.14800
  • repo_url: None
  • paper_authors: Miriam Jäger, Boris Jutzi
  • for: 可以从Neural Radiance Fields(NeRF)中生成高质量的3D几何重建。
  • methods: 使用density gradient的方法,具体是使用Sobel、Canny和Laplacian of Gaussian的3D边条探测器,从相邻的维度进行探测。
  • results: 能够实现高地形精度的3D几何重建,并且可以实现物体表面上的物体完整性。Canny探测器能够干涯缺口,并且实现uniform的点密度。
    Abstract Generating geometric 3D reconstructions from Neural Radiance Fields (NeRFs) is of great interest. However, accurate and complete reconstructions based on the density values are challenging. The network output depends on input data, NeRF network configuration and hyperparameter. As a result, the direct usage of density values, e.g. via filtering with global density thresholds, usually requires empirical investigations. Under the assumption that the density increases from non-object to object area, the utilization of density gradients from relative values is evident. As the density represents a position-dependent parameter it can be handled anisotropically, therefore processing of the voxelized 3D density field is justified. In this regard, we address geometric 3D reconstructions based on density gradients, whereas the gradients result from 3D edge detection filters of the first and second derivatives, namely Sobel, Canny and Laplacian of Gaussian. The gradients rely on relative neighboring density values in all directions, thus are independent from absolute magnitudes. Consequently, gradient filters are able to extract edges along a wide density range, almost independent from assumptions and empirical investigations. Our approach demonstrates the capability to achieve geometric 3D reconstructions with high geometric accuracy on object surfaces and remarkable object completeness. Notably, Canny filter effectively eliminates gaps, delivers a uniform point density, and strikes a favorable balance between correctness and completeness across the scenes.
    摘要 <>translate "Generating geometric 3D reconstructions from Neural Radiance Fields (NeRFs) is of great interest. However, accurate and complete reconstructions based on the density values are challenging. The network output depends on input data, NeRF network configuration and hyperparameter. As a result, the direct usage of density values, e.g. via filtering with global density thresholds, usually requires empirical investigations. Under the assumption that the density increases from non-object to object area, the utilization of density gradients from relative values is evident. As the density represents a position-dependent parameter it can be handled anisotropically, therefore processing of the voxelized 3D density field is justified. In this regard, we address geometric 3D reconstructions based on density gradients, whereas the gradients result from 3D edge detection filters of the first and second derivatives, namely Sobel, Canny and Laplacian of Gaussian. The gradients rely on relative neighboring density values in all directions, thus are independent from absolute magnitudes. Consequently, gradient filters are able to extract edges along a wide density range, almost independent from assumptions and empirical investigations. Our approach demonstrates the capability to achieve geometric 3D reconstructions with high geometric accuracy on object surfaces and remarkable object completeness. Notably, Canny filter effectively eliminates gaps, delivers a uniform point density, and strikes a favorable balance between correctness and completeness across the scenes."into Simplified Chinese.Here's the translation:<>生成基于神经辐射场(NeRF)的三维几何重建是非常有趣的。然而,基于密度值的准确和完整的重建很困难。神经网络输出取决于输入数据、NeRF网络配置和超参数。因此,直接使用密度值,例如通过全局密度阈值滤波,通常需要实际调查。在密度增加从非对象区域到对象区域的假设下,使用密度梯度的Relative值是明显的。由于密度是位置依赖的参数,因此可以在三维浓度场中进行处理。在这种情况下,我们关注基于密度梯度的三维几何重建,其中梯度来自三维边检测器的第一和第二导数,即索贝尔、坎尼和洛普拉几何。这些梯度依赖于所有方向的相邻密度值,因此不受绝对值的影响。因此,梯度滤波可以在广泛的密度范围内提取边缘,大致独立于假设和实际调查。我们的方法可以实现高度准确的三维几何重建,并且可以在对象表面上达到很高的物理精度和对象完整性。尤其是坎尼滤波可以有效地消除峰值,提供一致的点密度,并在场景中占据有利的平衡。

Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation

  • paper_url: http://arxiv.org/abs/2309.14786
  • repo_url: https://github.com/suhwan-cho/tmo
  • paper_authors: Suhwan Cho, Minhyeok Lee, Jungho Lee, MyeongAh Cho, Sangyoun Lee
  • for: 提高视频对象分割(VOS)任务中对象检测的稳定性和可靠性,不受外部指导或干扰。
  • methods: 使用动态图像和流动图像的共同特征来适应对象的动态特征,并通过对动态图像的隐藏学习来减少对运动指示的依赖性。
  • results: 在所有公共测试数据集上达到了最佳性能水平,并在实时推理速度下保持稳定性。
    Abstract Unsupervised video object segmentation (VOS) is a task that aims to detect the most salient object in a video without external guidance about the object. To leverage the property that salient objects usually have distinctive movements compared to the background, recent methods collaboratively use motion cues extracted from optical flow maps with appearance cues extracted from RGB images. However, as optical flow maps are usually very relevant to segmentation masks, the network is easy to be learned overly dependent on the motion cues during network training. As a result, such two-stream approaches are vulnerable to confusing motion cues, making their prediction unstable. To relieve this issue, we design a novel motion-as-option network by treating motion cues as optional. During network training, RGB images are randomly provided to the motion encoder instead of optical flow maps, to implicitly reduce motion dependency of the network. As the learned motion encoder can deal with both RGB images and optical flow maps, two different predictions can be generated depending on which source information is used as motion input. In order to fully exploit this property, we also propose an adaptive output selection algorithm to adopt optimal prediction result at test time. Our proposed approach affords state-of-the-art performance on all public benchmark datasets, even maintaining real-time inference speed.
    摘要 自助视频对象分割(VOS)是一项任务,旨在在视频中检测最引人注目的对象,不受外部指导。为了利用聚集对象的特征运动,现代方法通常结合运动图表中的运动规划和RGB图像中的外观规划。然而,由于运动图表通常很相关于分割掩蔽,因此网络在训练时容易被学习过度依赖于运动规划。这会导致这些两派方法在预测时容易被混淆运动规划,使其预测不稳定。为解决这问题,我们设计了一种新的运动作为选项网络,在网络训练时对RGB图像进行隐式减少运动依赖。由于学习的运动编码器可以处理RGB图像和运动图表两种不同的输入,因此在测试时可以生成两个不同的预测结果,具体取决于哪种源信息作为运动输入。为了充分利用这性能,我们还提出了一种适应输出选择算法,以选择最佳预测结果。我们的提出的方法可以在所有公共benchmark数据集上达到状态革命性的性能,同时保持实时推理速度。

Frugal Satellite Image Change Detection with Deep-Net Inversion

  • paper_url: http://arxiv.org/abs/2309.14781
  • repo_url: None
  • paper_authors: Hichem Sahbi, Sebastien Deschamps
  • For: 这个论文的目的是提出一种基于活动学习的变化检测算法,用于检测卫星图像中的目标变化。* Methods: 该算法基于问答模型,通过询问用户(即oracle)关于图像中的变化相关性,更新深度神经网络(DNN)分类器。该算法还使用了一种新的对抗模型,以学习最有代表性、多样性和不确定性的虚拟 exemplars,从而提高了活动学习的效果。* Results: 实验表明,提出的深度网络反向推理方法在相关工作中表现出色,超过了相关的前工作。
    Abstract Change detection in satellite imagery seeks to find occurrences of targeted changes in a given scene taken at different instants. This task has several applications ranging from land-cover mapping, to anthropogenic activity monitory as well as climate change and natural hazard damage assessment. However, change detection is highly challenging due to the acquisition conditions and also to the subjectivity of changes. In this paper, we devise a novel algorithm for change detection based on active learning. The proposed method is based on a question and answer model that probes an oracle (user) about the relevance of changes only on a small set of critical images (referred to as virtual exemplars), and according to oracle's responses updates deep neural network (DNN) classifiers. The main contribution resides in a novel adversarial model that allows learning the most representative, diverse and uncertain virtual exemplars (as inverted preimages of the trained DNNs) that challenge (the most) the trained DNNs, and this leads to a better re-estimate of these networks in the subsequent iterations of active learning. Experiments show the out-performance of our proposed deep-net inversion against the related work.
    摘要 Change detection in satellite imagery aims to identify targeted changes in a scene captured at different times. This task has numerous applications, including land-cover mapping, monitoring anthropogenic activities, and assessing climate change and natural hazard damage. However, change detection is highly challenging due to acquisition conditions and the subjectivity of changes. In this paper, we propose a novel algorithm for change detection based on active learning. Our method uses a question-and-answer model to probe an oracle (user) about the relevance of changes on a small set of critical images (referred to as virtual exemplars), and updates deep neural network (DNN) classifiers according to the oracle's responses. The main contribution is a novel adversarial model that learns the most representative, diverse, and uncertain virtual exemplars (as inverted preimages of the trained DNNs) that challenge the trained DNNs, leading to a better re-estimate of these networks in subsequent active learning iterations. Experimental results show the outperformance of our proposed deep-net inversion compared to related work.

Multi-Label Feature Selection Using Adaptive and Transformed Relevance

  • paper_url: http://arxiv.org/abs/2309.14768
  • repo_url: https://github.com/sadegh28/atr
  • paper_authors: Sadegh Eskandari, Sahar Ghassabi
  • for: 本研究旨在提出一种基于信息理论的多标签特征选择方法,以便在多标签数据中提取有用的特征。
  • methods: 本方法基于一种新的启发函数,结合了算法适应和问题转换方法。它考虑每个标签的个别powers和抽象标签空间的权重,以选择最佳的特征。
  • results: 我们在12个benchmark上进行了实验,与10种现有的信息理论筛选方法进行比较。结果显示,我们的方法在6个评价指标中均显示出优异性,并且在特征和标签空间相对较大的benchmark中保持稳定性。代码可以在https://github.com/Sadegh28/ATR上获取。
    Abstract Multi-label learning has emerged as a crucial paradigm in data analysis, addressing scenarios where instances are associated with multiple class labels simultaneously. With the growing prevalence of multi-label data across diverse applications, such as text and image classification, the significance of multi-label feature selection has become increasingly evident. This paper presents a novel information-theoretical filter-based multi-label feature selection, called ATR, with a new heuristic function. Incorporating a combinations of algorithm adaptation and problem transformation approaches, ATR ranks features considering individual labels as well as abstract label space discriminative powers. Our experimental studies encompass twelve benchmarks spanning various domains, demonstrating the superiority of our approach over ten state-of-the-art information-theoretical filter-based multi-label feature selection methods across six evaluation metrics. Furthermore, our experiments affirm the scalability of ATR for benchmarks characterized by extensive feature and label spaces. The codes are available at https://github.com/Sadegh28/ATR
    摘要 多标签学习已成为数据分析中的重要方法,用于处理同时具有多个分类标签的实例。随着多标签数据在不同应用领域的普及,如文本和图像分类等,多标签特征选择的重要性得到了更加明显的证明。本文提出了一种基于信息理论滤波器的新型多标签特征选择方法,即ATR,它通过结合算法适应和问题转换技术来评估特征的个别标签和抽象标签空间的探索力。我们的实验包括12个标准 benchmark,覆盖多个领域,表明我们的方法在6个评价指标中超过了10种现有的信息理论滤波器基于多标签特征选择方法。此外,我们的实验还证明了ATR在特征和标签空间较大的 benchmark 上的扩展性。代码可以在https://github.com/Sadegh28/ATR 中获取。

InvKA: Gait Recognition via Invertible Koopman Autoencoder

  • paper_url: http://arxiv.org/abs/2309.14764
  • repo_url: None
  • paper_authors: Fan Li, Dong Liang, Jing Lian, Qidong Liu, Hegui Zhu, Jizhao Liu
  • for: 提高步态识别方法的可解性和计算效率
  • methods: 基于koopman算子理论提取步态特征,使用可逆 autoencoder 减少模型大小并减少计算深度,从而降低计算成本
  • results: 在多个数据集上实现了计算成本减少至1%,同时保持步态识别精度高达98%(非遮挡数据集)
    Abstract Most current gait recognition methods suffer from poor interpretability and high computational cost. To improve interpretability, we investigate gait features in the embedding space based on Koopman operator theory. The transition matrix in this space captures complex kinematic features of gait cycles, namely the Koopman operator. The diagonal elements of the operator matrix can represent the overall motion trend, providing a physically meaningful descriptor. To reduce the computational cost of our algorithm, we use a reversible autoencoder to reduce the model size and eliminate convolutional layers to compress its depth, resulting in fewer floating-point operations. Experimental results on multiple datasets show that our method reduces computational cost to 1% compared to state-of-the-art methods while achieving competitive recognition accuracy 98% on non-occlusion datasets.
    摘要 现有的步幅识别方法受到低解释性和高计算成本的影响。为改善解释性,我们在库曼操作理论基础上探索步幅特征在嵌入空间中。这个转换矩阵在这个空间中捕捉了复杂的步幅周期特征,即库曼操作。转换矩阵的主要元素可以表示步幅运动趋势,提供物理意义的描述。对于我们的算法,我们使用可逆 autoencoder 将模型大小缩小,并删除卷积层以压缩其深度,从而减少浮点运算次数。实验结果显示,我们的方法可以与现有的方法相比,在非遮掩 dataset 上 achieves 98% 的识别率,而且计算成本仅占现有方法的 1%。

Diffusion-based Holistic Texture Rectification and Synthesis

  • paper_url: http://arxiv.org/abs/2309.14759
  • repo_url: https://github.com/Dominoer/siggraph_asia_2023_holistic_texture
  • paper_authors: Guoqing Hao, Satoshi Iizuka, Kensho Hara, Edgar Simo-Serra, Hirokatsu Kataoka, Kazuhiro Fukui
  • for: 本文提出了一种新的框架,用于修正自然图像中的遮挡和扭曲。
  • methods: 该框架使用一种条件的潜在扩散模型(LDM),并使用一种新的遮挡意识的潜在变换器来编码图像特征。
  • results: 实验结果表明,该框架在评价量和质量上都有显著的优势,并且通过了全面的拟合研究。
    Abstract We present a novel framework for rectifying occlusions and distortions in degraded texture samples from natural images. Traditional texture synthesis approaches focus on generating textures from pristine samples, which necessitate meticulous preparation by humans and are often unattainable in most natural images. These challenges stem from the frequent occlusions and distortions of texture samples in natural images due to obstructions and variations in object surface geometry. To address these issues, we propose a framework that synthesizes holistic textures from degraded samples in natural images, extending the applicability of exemplar-based texture synthesis techniques. Our framework utilizes a conditional Latent Diffusion Model (LDM) with a novel occlusion-aware latent transformer. This latent transformer not only effectively encodes texture features from partially-observed samples necessary for the generation process of the LDM, but also explicitly captures long-range dependencies in samples with large occlusions. To train our model, we introduce a method for generating synthetic data by applying geometric transformations and free-form mask generation to clean textures. Experimental results demonstrate that our framework significantly outperforms existing methods both quantitatively and quantitatively. Furthermore, we conduct comprehensive ablation studies to validate the different components of our proposed framework. Results are corroborated by a perceptual user study which highlights the efficiency of our proposed approach.
    摘要 我们提出了一种新的框架,用于纠正自然图像中的遮挡和扭曲。传统的文本生成方法强调从无损样本中生成文本,但这些样本往往需要人工准备,而且很难在自然图像中获得。这些问题源于自然图像中文本样本的频繁遮挡和变化,导致文本样本的缺失和扭曲。为解决这些问题,我们提出了一个框架,可以从自然图像中纠正的文本样本,扩展了示例基于文本生成技术的应用范围。我们的框架使用一种conditional Latent Diffusion Model(LDM),并使用一种新的遮挡意识的缓冲变换器。这个缓冲变换器不仅能够有效地编码自然图像中的文本特征,还能够显著地捕捉大遮挡样本中的长距离依赖关系。为训练我们的模型,我们引入了一种生成合成数据的方法,通过几何变换和自由形mask生成来生成干净的文本样本。实验结果表明,我们的框架在量化和质量上都有显著提高,并且进行了完整的ablation研究来验证不同的模型组件。结果得到了人类感知测试的证明,表明我们的提出的方法更加高效。

On quantifying and improving realism of images generated with diffusion

  • paper_url: http://arxiv.org/abs/2309.14756
  • repo_url: None
  • paper_authors: Yunzhuo Chen, Naveed Akhtar, Nur Al Hasan Haldar, Ajmal Mian
    for:* 这个论文主要是为了解决生成模型中的图像真实性评估问题。methods:* 该论文提出了一种基于统计方法的图像真实性评估指标,即图像真实性分数(IRS),并通过实验表明其可以有效地评估生成模型中的图像真实性。results:* 该论文的实验结果表明,图像真实性分数可以准确地分辨真实图像和假图像,并且可以用于评估生成模型的性能。此外,通过修改生成损失函数以采用图像真实性分数,可以提高生成模型中的图像质量。
    Abstract Recent advances in diffusion models have led to a quantum leap in the quality of generative visual content. However, quantification of realism of the content is still challenging. Existing evaluation metrics, such as Inception Score and Fr\'echet inception distance, fall short on benchmarking diffusion models due to the versatility of the generated images. Moreover, they are not designed to quantify realism of an individual image. This restricts their application in forensic image analysis, which is becoming increasingly important in the emerging era of generative models. To address that, we first propose a metric, called Image Realism Score (IRS), computed from five statistical measures of a given image. This non-learning based metric not only efficiently quantifies realism of the generated images, it is readily usable as a measure to classify a given image as real or fake. We experimentally establish the model- and data-agnostic nature of the proposed IRS by successfully detecting fake images generated by Stable Diffusion Model (SDM), Dalle2, Midjourney and BigGAN. We further leverage this attribute of our metric to minimize an IRS-augmented generative loss of SDM, and demonstrate a convenient yet considerable quality improvement of the SDM-generated content with our modification. Our efforts have also led to Gen-100 dataset, which provides 1,000 samples for 100 classes generated by four high-quality models. We will release the dataset and code.
    摘要

Image Denoising via Style Disentanglement

  • paper_url: http://arxiv.org/abs/2309.14755
  • repo_url: None
  • paper_authors: Jingwei Niu, Jun Cheng, Shan Tan
  • for: 提出了一种新的图像去噪方法,可以同时提供清晰的去噪机制和好的性能。
  • methods: 视噪为图像的一种风格,通过提取噪音样本和噪音自由样本来除噪。设计了新的损失函数和网络模块,在特征空间中分离噪音和内容特征。
  • results: 通过Synthetic Noise Removal和Real-world Image Denoising dataset(SIDD和DND)的广泛实验,证明了方法的效果,PSNR和SSIM指标均有显著提高。此外,方法也具有良好的解释性。
    Abstract Image denoising is a fundamental task in low-level computer vision. While recent deep learning-based image denoising methods have achieved impressive performance, they are black-box models and the underlying denoising principle remains unclear. In this paper, we propose a novel approach to image denoising that offers both clear denoising mechanism and good performance. We view noise as a type of image style and remove it by incorporating noise-free styles derived from clean images. To achieve this, we design novel losses and network modules to extract noisy styles from noisy images and noise-free styles from clean images. The noise-free style induces low-response activations for noise features and high-response activations for content features in the feature space. This leads to the separation of clean contents from noise, effectively denoising the image. Unlike disentanglement-based image editing tasks that edit semantic-level attributes using styles, our main contribution lies in editing pixel-level attributes through global noise-free styles. We conduct extensive experiments on synthetic noise removal and real-world image denoising datasets (SIDD and DND), demonstrating the effectiveness of our method in terms of both PSNR and SSIM metrics. Moreover, we experimentally validate that our method offers good interpretability.
    摘要 图像噪声除除是计算机视觉领域的基础任务。Recent deep learning基于图像噪声除方法具有印象深刻的表现,但是它们是黑盒模型,即使用的原理不明确。在这篇论文中,我们提出了一种新的图像噪声除方法,具有明确的噪声除机制并且表现良好。我们视噪声为图像风格的一种,通过将噪声样本从噪声图像中提取出来,并将噪声样本与清晰图像中的风格相结合,从而将噪声分离出来。为此,我们设计了新的损失函数和网络模块,以提取噪声样本和清晰图像中的风格。噪声样本在特征空间中产生低响应活动,而内容特征产生高响应活动,这导致了噪声和内容的分离,从而有效地除噪图像。不同于基于分离Semantic-level特征的图像修改任务,我们的主要贡献在于通过全局噪声无响应风格来修改像素级别的特征。我们在合成噪声除和实际图像噪声除数据集(SIDD和DND)进行了广泛的实验,并证明了我们的方法在PSNR和SSIM指标上表现出色。此外,我们还进行了实验来证明我们的方法具有良好的可读性。

Advanced Volleyball Stats for All Levels: Automatic Setting Tactic Detection and Classification with a Single Camera

  • paper_url: http://arxiv.org/abs/2309.14753
  • repo_url: https://github.com/volleyIEEE/VolleyStats
  • paper_authors: Haotian Xia, Rhys Tracy, Yun Zhao, Yuqing Wang, Yuan-Fang Wang, Weining Shen
  • for: 提供高级战术分类方法的单视图计算机视觉框架
  • methods: 使用设球轨迹认识和novel设轨类ifier生成全面和高级统计数据
  • results: 超越现有方法的性能,在不同游戏情况下进行复杂游戏情况和不同摄像头角度处理Here’s the same information in Simplified Chinese:
  • for: 为高级战术分类提供单视图计算机视觉框架
  • methods: combining setting ball trajectory recognition with novel set trajectory classifier生成全面和高级统计数据
  • results: 超越现有方法的性能,在不同游戏情况下进行复杂游戏情况和不同摄像头角度处理
    Abstract This paper presents PathFinder and PathFinderPlus, two novel end-to-end computer vision frameworks designed specifically for advanced setting strategy classification in volleyball matches from a single camera view. Our frameworks combine setting ball trajectory recognition with a novel set trajectory classifier to generate comprehensive and advanced statistical data. This approach offers a fresh perspective for in-game analysis and surpasses the current level of granularity in volleyball statistics. In comparison to existing methods used in our baseline PathFinder framework, our proposed ball trajectory detection methodology in PathFinderPlus exhibits superior performance for classifying setting tactics under various game conditions. This robustness is particularly advantageous in handling complex game situations and accommodating different camera angles. Additionally, our study introduces an innovative algorithm for automatic identification of the opposing team's right-side (opposite) hitter's current row (front or back) during gameplay, providing critical insights for tactical analysis. The successful demonstration of our single-camera system's feasibility and benefits makes high-level technical analysis accessible to volleyball enthusiasts of all skill levels and resource availability. Furthermore, the computational efficiency of our system allows for real-time deployment, enabling in-game strategy analysis and on-the-spot gameplan adjustments.
    摘要

Text-image guided Diffusion Model for generating Deepfake celebrity interactions

  • paper_url: http://arxiv.org/abs/2309.14751
  • repo_url: None
  • paper_authors: Yunzhuo Chen, Nur Al Hasan Haldar, Naveed Akhtar, Ajmal Mian
  • for: The paper aims to explore the use of diffusion models for generating realistic and controllable Deepfake images, with a focus on creating forged content for celebrity interactions.
  • methods: The paper modifies a popular stable diffusion model to generate high-quality Deepfake images with text and image prompts, and adds the input anchor image’s latent at the beginning of inferencing to improve the generation of images with multiple persons. Additionally, the paper uses Dreambooth to enhance the realism of the fake images.
  • results: The paper demonstrates that the devised scheme can create fake visual content with alarming realism, such as images of meetings between powerful political figures, which could be used to spread rumors or misinformation.
    Abstract Deepfake images are fast becoming a serious concern due to their realism. Diffusion models have recently demonstrated highly realistic visual content generation, which makes them an excellent potential tool for Deepfake generation. To curb their exploitation for Deepfakes, it is imperative to first explore the extent to which diffusion models can be used to generate realistic content that is controllable with convenient prompts. This paper devises and explores a novel method in that regard. Our technique alters the popular stable diffusion model to generate a controllable high-quality Deepfake image with text and image prompts. In addition, the original stable model lacks severely in generating quality images that contain multiple persons. The modified diffusion model is able to address this problem, it add input anchor image's latent at the beginning of inferencing rather than Gaussian random latent as input. Hence, we focus on generating forged content for celebrity interactions, which may be used to spread rumors. We also apply Dreambooth to enhance the realism of our fake images. Dreambooth trains the pairing of center words and specific features to produce more refined and personalized output images. Our results show that with the devised scheme, it is possible to create fake visual content with alarming realism, such that the content can serve as believable evidence of meetings between powerful political figures.
    摘要 深圳图像是现在日益成为严重问题,因为它们的真实性。扩散模型在最近几年内表现出了高度真实的视觉内容生成能力,这使得它们成为深圳图像生成的极佳潜在工具。为了防止深圳图像的滥用,我们需要首先探索扩散模型可以生成高质量、控制性良好的深圳图像的可能性。这篇论文提出了一种新的方法。我们的技术改变了流行的稳定扩散模型,以生成高质量、控制性良好的深圳图像,使用文本和图像提示。此外,原始的稳定模型在生成高质量图像时缺乏严重的问题,我们修改了模型,将输入的anchor图像的秘密添加到推理过程的开始,而不是 Gaussian 随机秘密输入。因此,我们专注于生成假内容,如著名人之间的互动,这可能用于散布谣言。我们还使用 Dreambooth 来提高假图像的真实感。Dreambooth 训练了对中心词和特定特征的对应,以生成更加细致和个性化的输出图像。我们的结果显示,通过我们的方案,可以创建真实性惊人的假视觉内容,例如,著名人之间的互动,这些内容可能用作具有证据性的证据。

SSPFusion: A Semantic Structure-Preserving Approach for Infrared and Visible Image Fusion

  • paper_url: http://arxiv.org/abs/2309.14745
  • repo_url: None
  • paper_authors: Qiao Yang, Yu Zhang, Jian Zhang, Zijing Zhao, Shunli Zhang, Jinqiao Wang, Junzhe Chen
  • for: 提高计算机视觉任务的性能,即提高图像融合后的对象检测和识别等任务的性能。
  • methods: 提出了一种 semantic structure-preserving 方法,即 SSPFusion,该方法包括 Structural Feature Extractor (SFE) 和 multi-scale Structure-Preserving Fusion (SPF) 两个模块。
  • results: 对三个标准数据集进行了实验,证明了 SSPFusion 方法可以生成高质量的融合图像,并且在对象检测和识别等计算机视觉任务中提高了性能。
    Abstract Most existing learning-based infrared and visible image fusion (IVIF) methods exhibit massive redundant information in the fusion images, i.e., yielding edge-blurring effect or unrecognizable for object detectors. To alleviate these issues, we propose a semantic structure-preserving approach for IVIF, namely SSPFusion. At first, we design a Structural Feature Extractor (SFE) to extract the structural features of infrared and visible images. Then, we introduce a multi-scale Structure-Preserving Fusion (SPF) module to fuse the structural features of infrared and visible images, while maintaining the consistency of semantic structures between the fusion and source images. Owing to these two effective modules, our method is able to generate high-quality fusion images from pairs of infrared and visible images, which can boost the performance of downstream computer-vision tasks. Experimental results on three benchmarks demonstrate that our method outperforms eight state-of-the-art image fusion methods in terms of both qualitative and quantitative evaluations. The code for our method, along with additional comparison results, will be made available at: https://github.com/QiaoYang-CV/SSPFUSION.
    摘要 大多数现有的学习基于的红外和可见图像 fusión(IVIF)方法会带来巨大的重复信息在合并图像中,例如导致边缘模糊效应或对物体检测器无法识别。为了解决这些问题,我们提出了一种semantic structure-preserving的方法,即SSPFusion。在这个方法中,我们首先设计了一个结构特征提取器(SFE),用于提取红外和可见图像的结构特征。然后,我们引入了一个多尺度结构保持合并(SPF)模块,用于将红外和可见图像的结构特征进行合并,同时保持了这些特征在合并图像和原始图像之间的一致性。由于这两个有效的模块,我们的方法能够生成高质量的合并图像,从红外和可见图像的对应组合中获得提高。实验结果表明,我们的方法在三个标准测试集上与八种现有的图像 fusión方法进行比较,在质量和量化评价方面都表现出优于其他方法。我们的代码、以及更多的比较结果,将在GitHub上公开:https://github.com/QiaoYang-CV/SSPFUSION。

ADU-Depth: Attention-based Distillation with Uncertainty Modeling for Depth Estimation

  • paper_url: http://arxiv.org/abs/2309.14744
  • repo_url: None
  • paper_authors: Zizhang Wu, Zhuozheng Li, Zhi-Gang Fan, Yunzhe Wu, Xiaoquan Wang, Rui Tang, Jian Pu
  • for: 提高单目深度估计精度,使用左右图像对Alignment和学习帮助提高单目深度估计精度。
  • methods: 提出了一种名为ADU-Depth的知识填充框架,通过将Well-trained教师网络传递知识到单目学生网络,以提高单目深度估计精度。在训练阶段,应用了注意力适应特征填充和关注深度适应响应填充,以便在不同预测难度下进行有效的知识传递。同时,Explicitly model the uncertainty of depth estimation to guide distillation in both feature space and result space。
  • results: 在实际的深度估计 dataset KITTI 和 DrivingStereo 上进行了广泛的实验,并达到了在挑战性较高的 KITTI 在线Benchmark 上的第一名。
    Abstract Monocular depth estimation is challenging due to its inherent ambiguity and ill-posed nature, yet it is quite important to many applications. While recent works achieve limited accuracy by designing increasingly complicated networks to extract features with limited spatial geometric cues from a single RGB image, we intend to introduce spatial cues by training a teacher network that leverages left-right image pairs as inputs and transferring the learned 3D geometry-aware knowledge to the monocular student network. Specifically, we present a novel knowledge distillation framework, named ADU-Depth, with the goal of leveraging the well-trained teacher network to guide the learning of the student network, thus boosting the precise depth estimation with the help of extra spatial scene information. To enable domain adaptation and ensure effective and smooth knowledge transfer from teacher to student, we apply both attention-adapted feature distillation and focal-depth-adapted response distillation in the training stage. In addition, we explicitly model the uncertainty of depth estimation to guide distillation in both feature space and result space to better produce 3D-aware knowledge from monocular observations and thus enhance the learning for hard-to-predict image regions. Our extensive experiments on the real depth estimation datasets KITTI and DrivingStereo demonstrate the effectiveness of the proposed method, which ranked 1st on the challenging KITTI online benchmark.
    摘要 单眼深度估计是一个挑战性的任务,因为它具有自然的歧义和不确定性,但它对许多应用程序非常重要。Recent works 通过设计越来越复杂的网络来提取具有有限空间几何指标的单眼照片,实现有限的精度。我们则是通过将左右两个照片作为输入,训练一个教师网络,以便传递到单眼学生网络中的3D几何意识。我们称之为ADU-Depth的知识传授框架。我们希望通过将教师网络的学习知识转移到学生网络中,以提高单眼深度估计的精度。为了实现领域适应和确保专业转移,我们在训练阶段使用了注意力适应的特征传授和聚焦深度适应的回应传授。此外,我们Explicitly 模型深度估计的不确定性,以导引传授在特征空间和结果空间中。我们的实验结果显示,我们的提案可以在真实的深度估计数据集KITTI和DrivingStereo上得到高效的性能,并在挑战性的KITTI online排名中排名第一。

Volumetric Semantically Consistent 3D Panoptic Mapping

  • paper_url: http://arxiv.org/abs/2309.14737
  • repo_url: https://github.com/y9miao/consistentpanopticslam
  • paper_authors: Yang Miao, Iro Armeni, Marc Pollefeys, Daniel Barath
  • for: 生成自适应、准确、高效的semantic 3D地图,用于无结构环境中的自主机器人。
  • methods: 基于Voxel-TSDF表示法,具有semantic prediction confidence的集成、semantic和实例一致的3D区域生成、基于图像优化的semantic标签和实例精度提升等新方法。
  • results: 在公共大规模数据集上达到了state-of-the-art精度水平,提高了许多常用的指标,并且指出了现有研究评价中的一个缺陷:使用真实轨迹而不是SLAM估计的轨迹作为输入,会导致评价结果与实际数据之间存在很大差距。
    Abstract We introduce an online 2D-to-3D semantic instance mapping algorithm aimed at generating comprehensive, accurate, and efficient semantic 3D maps suitable for autonomous agents in unstructured environments. The proposed approach is based on a Voxel-TSDF representation used in recent algorithms. It introduces novel ways of integrating semantic prediction confidence during mapping, producing semantic and instance-consistent 3D regions. Further improvements are achieved by graph optimization-based semantic labeling and instance refinement. The proposed method achieves accuracy superior to the state of the art on public large-scale datasets, improving on a number of widely used metrics. We also highlight a downfall in the evaluation of recent studies: using the ground truth trajectory as input instead of a SLAM-estimated one substantially affects the accuracy, creating a large gap between the reported results and the actual performance on real-world data.
    摘要 我们介绍一种在线2D-to-3D语义实例映射算法,旨在生成全面、准确和高效的语义3D地图,适用于无结构环境中的自主机器人。该方法基于最近的Voxel-TSDF表示方式,并提出了新的语义预测信任级别的集成方法,生成语义和实例一致的3D区域。此外,我们还提出了基于图优化的语义标签和实例细化方法,进一步提高方法的准确性。我们的方法在大规模公共数据集上实现了比STATE-OF-THE-ART更高的准确度,提高了一些广泛使用的指标。此外,我们还指出了评估最近研究的缺点:使用真实的轨迹作为输入而不是SLAM估计的轨迹,会导致评估结果与实际数据中的性能存在大差异。

Explaining Deep Face Algorithms through Visualization: A Survey

  • paper_url: http://arxiv.org/abs/2309.14715
  • repo_url: None
  • paper_authors: Thrupthi Ann John, Vineeth N Balasubramanian, C. V. Jawahar
  • for: 本研究旨在 bridging the gap between deep face models 和 human understanding, by conducting a meta-analysis of explainability algorithms in the face domain.
  • methods: 本研究使用了一系列的普适可视化算法,并对各种面部模型进行计算可视化。
  • results: 研究发现了面部网络结构和层次结构的细节,以及可视化算法的设计考虑事项。此外,通过用户研究,确定了实用的可视化算法,以便对 AI 专业人员提供可读性的可视化工具。
    Abstract Although current deep models for face tasks surpass human performance on some benchmarks, we do not understand how they work. Thus, we cannot predict how it will react to novel inputs, resulting in catastrophic failures and unwanted biases in the algorithms. Explainable AI helps bridge the gap, but currently, there are very few visualization algorithms designed for faces. This work undertakes a first-of-its-kind meta-analysis of explainability algorithms in the face domain. We explore the nuances and caveats of adapting general-purpose visualization algorithms to the face domain, illustrated by computing visualizations on popular face models. We review existing face explainability works and reveal valuable insights into the structure and hierarchy of face networks. We also determine the design considerations for practical face visualizations accessible to AI practitioners by conducting a user study on the utility of various explainability algorithms.
    摘要 Translated into Simplified Chinese:尽管当前的深度模型在面任务上已经超越了人类表现,但我们并不理解它们如何工作。这意味着我们无法预测它们如何对新输入react,导致 catastrophic failures 和不想要的偏见在算法中。可见 AI 可以帮助bridging the gap,但目前有很少的视觉化算法适用于面。这个工作是一次首次的 meta-analysis of explainability algorithms in the face domain。我们探索了适应面模型的通用视觉化算法的妙处和缺点,通过计算面模型上的视觉化来 Ilustrated。我们还回顾了现有的面可见工作,并发现了面网络的结构和层次结构的 valuable insights。最后,我们确定了实用的面视觉化设计考虑因素,通过对各种可见算法的用户研究来确定。

Bootstrap Diffusion Model Curve Estimation for High Resolution Low-Light Image Enhancement

  • paper_url: http://arxiv.org/abs/2309.14709
  • repo_url: None
  • paper_authors: Jiancheng Huang, Yifan Liu, Shifeng Chen
  • For: 本研究目的是提出一种基于学习的弱光照图像提升方法,以解决现有方法的两大问题:高分辨率图像的计算成本高和同时增强和净化不够。* Methods: 本方法使用了拟合分布模型,通过学习分布参数的分布来代替传统的 нормаль光照图像本身。具体来说,我们采用了抛物线估计方法来处理高分辨率图像,其中抛物线参数由我们的拟合分布模型来估计。此外,我们还在每次抛物线调整中应用了净化模块,以净化每次调整后的提升结果。* Results: 我们在常用的 benchmark 数据集上进行了广泛的实验,并证明了 BDCE 在质量和量化上达到了领先水平。
    Abstract Learning-based methods have attracted a lot of research attention and led to significant improvements in low-light image enhancement. However, most of them still suffer from two main problems: expensive computational cost in high resolution images and unsatisfactory performance in simultaneous enhancement and denoising. To address these problems, we propose BDCE, a bootstrap diffusion model that exploits the learning of the distribution of the curve parameters instead of the normal-light image itself. Specifically, we adopt the curve estimation method to handle the high-resolution images, where the curve parameters are estimated by our bootstrap diffusion model. In addition, a denoise module is applied in each iteration of curve adjustment to denoise the intermediate enhanced result of each iteration. We evaluate BDCE on commonly used benchmark datasets, and extensive experiments show that it achieves state-of-the-art qualitative and quantitative performance.
    摘要 学习基于方法在低光照图像提升中吸引了很多研究人员的关注,并导致了显著的改善。然而,大多数方法仍然受到两个主要问题的困扰:高分辨率图像的计算成本高昂,同时同时提升和净化的性能不足。为解决这些问题,我们提议BDCE,一种使用分布的演化模型,利用学习分布参数的Curve estimation方法来处理高分辨率图像。另外,在每次曲线调整中,我们采用了一个净化模块,以净化每次调整后的图像结果。我们在常用的 referencia dataset上进行了广泛的实验,结果表明BDCE可以达到领先的质量和量化性能。

Tile Classification Based Viewport Prediction with Multi-modal Fusion Transformer

  • paper_url: http://arxiv.org/abs/2309.14704
  • repo_url: None
  • paper_authors: Zhihao Zhang, Yiwei Chen, Weizhan Zhang, Caixia Yan, Qinghua Zheng, Qi Wang, Wangdu Chen
  • for: 这种论文主要是为了提出一种基于多模态融合 трансформа器的视窗预测方法,以解决现有的轨迹基本方法存在缺乏稳定性和信息拼接问题。
  • methods: 该方法使用变换器网络EXTRACTER来抽取每种模态中的长距离依赖关系,然后 Mine intra-和inter-模态关系以捕捉视频内容和用户历史输入的共同影响。此外,该方法还将未来的块分为两类:用户有兴趣或无兴趣,并选择未来的视窗为包含最多用户有兴趣的块。
  • results: 对于两个广泛使用的PVS-HM和Xu-Gaze数据集,MFTR表现出优于状态艺术方法的平均预测精度和重叠率,同时具有竞争的计算效率。
    Abstract Viewport prediction is a crucial aspect of tile-based 360 video streaming system. However, existing trajectory based methods lack of robustness, also oversimplify the process of information construction and fusion between different modality inputs, leading to the error accumulation problem. In this paper, we propose a tile classification based viewport prediction method with Multi-modal Fusion Transformer, namely MFTR. Specifically, MFTR utilizes transformer-based networks to extract the long-range dependencies within each modality, then mine intra- and inter-modality relations to capture the combined impact of user historical inputs and video contents on future viewport selection. In addition, MFTR categorizes future tiles into two categories: user interested or not, and selects future viewport as the region that contains most user interested tiles. Comparing with predicting head trajectories, choosing future viewport based on tile's binary classification results exhibits better robustness and interpretability. To evaluate our proposed MFTR, we conduct extensive experiments on two widely used PVS-HM and Xu-Gaze dataset. MFTR shows superior performance over state-of-the-art methods in terms of average prediction accuracy and overlap ratio, also presents competitive computation efficiency.
    摘要 视窗预测是360度视频流式系统中的关键方面,但现有的轨迹基于方法缺乏Robustness,同时过分解信息构建和多Modal输入之间的融合过程,导致错误积累问题。在本文中,我们提出了基于多模态融合变换器的瓦片分类视窗预测方法,称之为MFTR。具体来说,MFTR利用变换器网络EXTRACT每个模式中的长距离依赖关系,然后挖掘INTRA-和INTER-模式关系,以捕捉用户历史输入和视频内容的共同影响 future viewport选择。此外,MFTR将未来瓦片分为两类:用户有趣或不,并选择未来视窗为包含最多用户有趣瓦片的区域。与predicting head trajectories不同,基于瓦片的二分类结果选择未来视窗显示更高的Robustness和可读性。为评估我们提出的MFTR,我们在两个广泛使用的PVS-HM和Xu-Gaze数据集上进行了广泛的实验。MFTR在状态前方法的平均预测精度和重叠率方面显示出superior性,同时具有竞争力的计算效率。

Structure Invariant Transformation for better Adversarial Transferability

  • paper_url: http://arxiv.org/abs/2309.14700
  • repo_url: https://github.com/xiaosen-wang/sit
  • paper_authors: Xiaosen Wang, Zeliang Zhang, Jianping Zhang
  • for: This paper aims to improve the effectiveness of black-box adversarial attacks on deep neural networks (DNNs) by proposing a novel input transformation-based attack called Structure Invariant Attack (SIA).
  • methods: The SIA attack applies random image transformations to each image block to create a diverse set of images for gradient calculation, improving the transferability of the attack compared to existing methods.
  • results: The proposed SIA attack exhibits better transferability than existing state-of-the-art (SOTA) input transformation-based attacks on both CNN-based and transformer-based models, as demonstrated through extensive experiments on the ImageNet dataset.
    Abstract Given the severe vulnerability of Deep Neural Networks (DNNs) against adversarial examples, there is an urgent need for an effective adversarial attack to identify the deficiencies of DNNs in security-sensitive applications. As one of the prevalent black-box adversarial attacks, the existing transfer-based attacks still cannot achieve comparable performance with the white-box attacks. Among these, input transformation based attacks have shown remarkable effectiveness in boosting transferability. In this work, we find that the existing input transformation based attacks transform the input image globally, resulting in limited diversity of the transformed images. We postulate that the more diverse transformed images result in better transferability. Thus, we investigate how to locally apply various transformations onto the input image to improve such diversity while preserving the structure of image. To this end, we propose a novel input transformation based attack, called Structure Invariant Attack (SIA), which applies a random image transformation onto each image block to craft a set of diverse images for gradient calculation. Extensive experiments on the standard ImageNet dataset demonstrate that SIA exhibits much better transferability than the existing SOTA input transformation based attacks on CNN-based and transformer-based models, showing its generality and superiority in boosting transferability. Code is available at https://github.com/xiaosen-wang/SIT.
    摘要 由于深度神经网络(DNNs)对攻击性示例的漏洞性,有一项紧迫需要有效的攻击方法来识别DNNs在安全敏感应用中的缺陷。现有的黑盒攻击中的传输基于攻击仍然无法达到相对比白盒攻击的性能。其中,输入变换基于的攻击表现出了remarkable的效果,但是现有的输入变换基于的攻击仍然只能对输入图像进行全局性的变换,导致变换图像的多样性受限。我们认为,更多的多样性变换图像可以提高传输性。因此,我们开展了一种新的输入变换基于的攻击方法,即结构不变攻击(SIA),它在每个图像块上随机应用图像变换来生成一组多样的图像,以便在梯度计算中更好地保持图像的结构。我们对标准的ImageNet数据集进行了广泛的实验,结果显示,SIA比现有的SOTA输入变换基于的攻击更有效,可以在CNN-based和transformer-based模型上提高传输性,这表明了它的通用性和超越性。代码可以在https://github.com/xiaosen-wang/SIT上获取。

DriveSceneGen: Generating Diverse and Realistic Driving Scenarios from Scratch

  • paper_url: http://arxiv.org/abs/2309.14685
  • repo_url: https://github.com/SS47816/DriveSceneGen
  • paper_authors: Shuo Sun, Zekai Gu, Tianchen Sun, Jiawei Sun, Chengran Yuan, Yuhang Han, Dongen Li, Marcelo H. Ang Jr
  • for: This paper is written for the development and validation of autonomous driving systems, specifically to address the lack of diverse and realistic traffic scenarios in large quantities.
  • methods: The paper introduces DriveSceneGen, a data-driven driving scenario generation method that learns from real-world driving datasets and generates entire dynamic driving scenarios from scratch.
  • results: The experimental results on 5,000 generated scenarios show that DriveSceneGen can generate novel driving scenarios with high fidelity and diversity, and is able to generate scenarios that align with real-world data distributions.Here’s the same information in Simplified Chinese text:
  • for: 这篇论文是为了开发和验证自动驾驶系统而写的,特别是解决现实世界中缺乏具有多样性和真实性的交通场景的问题。
  • methods: 论文引入了 DriveSceneGen,一种基于实际驾驶数据的驾驶场景生成方法,可以从零开始生成整个动态驾驶场景。
  • results: 对5000个生成的场景进行了实验,结果显示,DriveSceneGen可以生成高准确性和多样性的驾驶场景,并且能够生成与现实世界数据分布相似的场景。
    Abstract Realistic and diverse traffic scenarios in large quantities are crucial for the development and validation of autonomous driving systems. However, owing to numerous difficulties in the data collection process and the reliance on intensive annotations, real-world datasets lack sufficient quantity and diversity to support the increasing demand for data. This work introduces DriveSceneGen, a data-driven driving scenario generation method that learns from the real-world driving dataset and generates entire dynamic driving scenarios from scratch. DriveSceneGen is able to generate novel driving scenarios that align with real-world data distributions with high fidelity and diversity. Experimental results on 5k generated scenarios highlight the generation quality, diversity, and scalability compared to real-world datasets. To the best of our knowledge, DriveSceneGen is the first method that generates novel driving scenarios involving both static map elements and dynamic traffic participants from scratch.
    摘要 现实生活中的交通情况具有广泛的多样性和复杂性,这些情况是自动驾驶系统的开发和验证中非常重要的。然而,由于数据收集过程中的多个问题和对于数据的依赖,现实世界的数据缺乏量和多样性,无法满足增长中的数据需求。本研究提出了 DriveSceneGen,一种基于数据的驾驶情况生成方法,从真实世界的驾驶数据中学习,并从零生成完整的动态驾驶情况。DriveSceneGen 能够生成高匹配度和多样性的驾驶情况,并且可以与实际世界数据的分布相互适应。实验结果显示,DriveSceneGen 能够生成5000个高品质和多样性的驾驶情况,并且可以与实际世界数据的分布相互适应。根据我们所知,DriveSceneGen 是首个从零生成包含静止地图元素和动态交通参与者的驾驶情况的方法。

DONNAv2 – Lightweight Neural Architecture Search for Vision tasks

  • paper_url: http://arxiv.org/abs/2309.14670
  • repo_url: None
  • paper_authors: Sweta Priyadarshi, Tianyu Jiang, Hsin-Pai Cheng, Sendil Krishna, Viswanath Ganapathy, Chirag Patel
  • for: 这个研究旨在开发一个具有 Computationally Efficient Neural Architecture Distillation(DONNA)的下一代神经网络架构,以便在装置在边缘设备上进行视觉应用的部署。
  • methods: 这个研究使用了一种名为DONNAv2的神经网络架构,并使用了一个简单的方法来减少 computationally extensive stage,这个方法是使用对各个块的损失函数作为对样本模型的表现度量。
  • results: 这个研究的结果显示,DONNAv2可以实现10倍的计算成本减少,并且可以在装置在Samsung Galaxy S10 mobile平台上进行硬件在轮试验。此外,DONNAv2还使用了封页知识传播范本来移除高测试成本的块。
    Abstract With the growing demand for vision applications and deployment across edge devices, the development of hardware-friendly architectures that maintain performance during device deployment becomes crucial. Neural architecture search (NAS) techniques explore various approaches to discover efficient architectures for diverse learning tasks in a computationally efficient manner. In this paper, we present the next-generation neural architecture design for computationally efficient neural architecture distillation - DONNAv2 . Conventional NAS algorithms rely on a computationally extensive stage where an accuracy predictor is learned to estimate model performance within search space. This building of accuracy predictors helps them predict the performance of models that are not being finetuned. Here, we have developed an elegant approach to eliminate building the accuracy predictor and extend DONNA to a computationally efficient setting. The loss metric of individual blocks forming the network serves as the surrogate performance measure for the sampled models in the NAS search stage. To validate the performance of DONNAv2 we have performed extensive experiments involving a range of diverse vision tasks including classification, object detection, image denoising, super-resolution, and panoptic perception network (YOLOP). The hardware-in-the-loop experiments were carried out using the Samsung Galaxy S10 mobile platform. Notably, DONNAv2 reduces the computational cost of DONNA by 10x for the larger datasets. Furthermore, to improve the quality of NAS search space, DONNAv2 leverages a block knowledge distillation filter to remove blocks with high inference costs.
    摘要 随着视觉应用的扩展和边缘设备的普及,发展具有高性能和可扩展性的硬件友好架构成为了不可或缺的。神经网络搜索(NAS)技术探索了许多方法,以便在计算效率高的情况下找到适合多种学习任务的高效架构。在这篇论文中,我们介绍了下一代神经网络设计方法,即DONNAv2,用于计算效率高的神经网络采样。传统的NAS算法需要一个计算昂贵的阶段,用于学习一个准确性预测器,以便在搜索空间中估计模型的性能。我们在这篇论文中提出了一种简洁的方法,即使用网络块的损失度作为采样模型的表现度量。为验证DONNAv2的性能,我们进行了对多种多样化视觉任务的广泛实验,包括分类、物体检测、图像噪声纠正、超分辨率和拼接性见网络(YOLOP)。实验使用了Samsung Galaxy S10移动 платформой。值得注意的是,DONNAv2将DONNA的计算成本减少了10倍,并且通过使用块知识继承筛选来移除高计算成本的块。

ZiCo-BC: A Bias Corrected Zero-Shot NAS for Vision Tasks

  • paper_url: http://arxiv.org/abs/2309.14666
  • repo_url: None
  • paper_authors: Kartikeya Bhardwaj, Hsin-Pai Cheng, Sweta Priyadarshi, Zhuojin Li
  • for: 这个论文主要是为了研究零shot neural architecture search(NAS)方法的可靠性和广泛应用性。
  • methods: 该论文使用了现有的零shot proxy ZiCo,并对其进行了修正以解决偏见问题。
  • results: 该论文通过对多种视觉任务(图像分类、物体检测和 semantic segmentation)进行了广泛的实验,并证明了our approach可以在Samsung Galaxy S10设备上实现更高的准确率和更低的延迟时间。
    Abstract Zero-Shot Neural Architecture Search (NAS) approaches propose novel training-free metrics called zero-shot proxies to substantially reduce the search time compared to the traditional training-based NAS. Despite the success on image classification, the effectiveness of zero-shot proxies is rarely evaluated on complex vision tasks such as semantic segmentation and object detection. Moreover, existing zero-shot proxies are shown to be biased towards certain model characteristics which restricts their broad applicability. In this paper, we empirically study the bias of state-of-the-art (SOTA) zero-shot proxy ZiCo across multiple vision tasks and observe that ZiCo is biased towards thinner and deeper networks, leading to sub-optimal architectures. To solve the problem, we propose a novel bias correction on ZiCo, called ZiCo-BC. Our extensive experiments across various vision tasks (image classification, object detection and semantic segmentation) show that our approach can successfully search for architectures with higher accuracy and significantly lower latency on Samsung Galaxy S10 devices.
    摘要 <>Zero-Shot neural architecture search (NAS) approaches propose novel training-free metrics called zero-shot proxies to significantly reduce the search time compared to traditional training-based NAS. Despite the success on image classification, the effectiveness of zero-shot proxies is rarely evaluated on complex vision tasks such as semantic segmentation and object detection. Moreover, existing zero-shot proxies are shown to be biased towards certain model characteristics which restricts their broad applicability. In this paper, we empirically study the bias of state-of-the-art (SOTA) zero-shot proxy ZiCo across multiple vision tasks and observe that ZiCo is biased towards thinner and deeper networks, leading to sub-optimal architectures. To solve the problem, we propose a novel bias correction on ZiCo, called ZiCo-BC. Our extensive experiments across various vision tasks (image classification, object detection and semantic segmentation) show that our approach can successfully search for architectures with higher accuracy and significantly lower latency on Samsung Galaxy S10 devices.中文简体版: zero-shot neural architecture search (NAS) 方法提出了新的训练不需要的度量,以减少传统的训练基于 NAS 的搜索时间。 despite 图像分类 tasks 的成功,zero-shot proxies 在复杂的视觉任务,如 semantic segmentation 和 object detection 中的效果 rarely 评估。 In addition, 现有的 zero-shot proxies 具有一定的特征偏好,限制了它们的普遍性。 在这篇 paper 中,我们对 state-of-the-art (SOTA) zero-shot proxy ZiCo 的 bias 进行了 empirical study,并发现 ZiCo 偏好于较为细长和深的网络,导致非优化的 architecture。 To solve the problem, we propose a novel bias correction on ZiCo, called ZiCo-BC. Our extensive experiments across various vision tasks (image classification, object detection and semantic segmentation) show that our approach can successfully search for architectures with higher accuracy and significantly lower latency on Samsung Galaxy S10 devices.

Probabilistic 3D Multi-Object Cooperative Tracking for Autonomous Driving via Differentiable Multi-Sensor Kalman Filter

  • paper_url: http://arxiv.org/abs/2309.14655
  • repo_url: None
  • paper_authors: Hsu-kuang Chiu, Chien-Yi Wang, Min-Hung Chen, Stephen F. Smith
  • for: This paper aims to improve the reliability and accuracy of multi-object cooperative tracking in autonomous driving by leveraging vehicle-to-vehicle (V2V) communication and a differentiable multi-sensor Kalman Filter.
  • methods: The proposed method uses a differentiable multi-sensor Kalman Filter to estimate the measurement uncertainty of each detection from different connected autonomous vehicles (CAVs), which enables better utilization of the theoretical optimality property of Kalman Filter-based tracking algorithms.
  • results: The experimental results show that the proposed algorithm improves the tracking accuracy by 17% with only 0.037x communication costs compared to the state-of-the-art method in V2V4Real.Here’s the Chinese translation of the three points:
  • for: 这篇论文目的是通过追踪多个目标的多感器协同追踪方法来提高自动驾驶车辆的可靠性和准确性。
  • methods: 该提议的方法使用可微多感器卡尔曼滤波器来估算各个探测机器人的测量不确定性,从而更好地利用卡尔曼滤波器基于追踪算法的理论优化性。
  • results: 实验结果显示,该提议的算法可以提高追踪精度 by 17%,并且只需0.037x的通信成本相比于现有方法在V2V4Real中。
    Abstract Current state-of-the-art autonomous driving vehicles mainly rely on each individual sensor system to perform perception tasks. Such a framework's reliability could be limited by occlusion or sensor failure. To address this issue, more recent research proposes using vehicle-to-vehicle (V2V) communication to share perception information with others. However, most relevant works focus only on cooperative detection and leave cooperative tracking an underexplored research field. A few recent datasets, such as V2V4Real, provide 3D multi-object cooperative tracking benchmarks. However, their proposed methods mainly use cooperative detection results as input to a standard single-sensor Kalman Filter-based tracking algorithm. In their approach, the measurement uncertainty of different sensors from different connected autonomous vehicles (CAVs) may not be properly estimated to utilize the theoretical optimality property of Kalman Filter-based tracking algorithms. In this paper, we propose a novel 3D multi-object cooperative tracking algorithm for autonomous driving via a differentiable multi-sensor Kalman Filter. Our algorithm learns to estimate measurement uncertainty for each detection that can better utilize the theoretical property of Kalman Filter-based tracking methods. The experiment results show that our algorithm improves the tracking accuracy by 17% with only 0.037x communication costs compared with the state-of-the-art method in V2V4Real.
    摘要 In this paper, we propose a novel 3D multi-object cooperative tracking algorithm for autonomous driving via a differentiable multi-sensor Kalman Filter. Our algorithm learns to estimate measurement uncertainty for each detection, which can better utilize the theoretical property of Kalman Filter-based tracking methods. The experiment results show that our algorithm improves the tracking accuracy by 17% with only 0.037x communication costs compared with the state-of-the-art method in V2V4Real.

Free Discontinuity Design: With an Application to the Economic Effects of Internet Shutdowns

  • paper_url: http://arxiv.org/abs/2309.14630
  • repo_url: https://github.com/davidvandijcke/fdd
  • paper_authors: Florian Gunsilius, David Van Dijcke
  • for: 这篇论文是为了研究干扰 assigning 的影响而写的。
  • methods: 这篇论文使用了一种非 Parametric 方法来估计干扰 assigning 导致的不连续性。这种方法基于 Mumford-Shah 函数的卷积relaxation,并且我们证明了它的标识和收敛。
  • results: 使用这种方法,作者发现印度的互联网停机导致经济活动减少了超过 50%,这大大超过了之前的估计,并为全球数字经济带来新的灯光。
    Abstract Thresholds in treatment assignments can produce discontinuities in outcomes, revealing causal insights. In many contexts, like geographic settings, these thresholds are unknown and multivariate. We propose a non-parametric method to estimate the resulting discontinuities by segmenting the regression surface into smooth and discontinuous parts. This estimator uses a convex relaxation of the Mumford-Shah functional, for which we establish identification and convergence. Using our method, we estimate that an internet shutdown in India resulted in a reduction of economic activity by over 50%, greatly surpassing previous estimates and shedding new light on the true cost of such shutdowns for digital economies globally.
    摘要 干预分配的阈值可能会产生输出的破碎,揭示了 causal 的增长。在许多情况下,如地理设置,这些阈值未知并且多变量。我们提议一种非 Parametric 方法来估计这些破碎的结果,segments 回归表面为平滑和破碎部分。这个估计器使用 Mumford-Shah 函数的凸relaxation,我们证明其标识和收敛。使用我们的方法,我们估计印度的互联网停机导致经济活动减少了超过 50%,大大超过了之前的估计,为全球数字经济带来新的见解。

Text-to-Image Generation for Abstract Concepts

  • paper_url: http://arxiv.org/abs/2309.14623
  • repo_url: None
  • paper_authors: Jiayi Liao, Xu Chen, Qiang Fu, Lun Du, Xiangnan He, Xiang Wang, Shi Han, Dongmei Zhang
    for: 这种研究旨在提高大规模模型在各个领域中表达抽象概念的能力,例如自然语言处理和计算机视觉。methods: 该研究基于三层艺术作品理论,将抽象概念 clarify为明确的意图,然后通过语言模型转换为具有相关含义的物理对象,最后使用概念依赖的形式从LM中提取形态模式集来集成信息。results: 该研究的评估结果表明,该框架可以帮助创建具有充分表达抽象概念的图像,并且与人类评估和新定义的概念分数指标表现良好。
    Abstract Recent years have witnessed the substantial progress of large-scale models across various domains, such as natural language processing and computer vision, facilitating the expression of concrete concepts. Unlike concrete concepts that are usually directly associated with physical objects, expressing abstract concepts through natural language requires considerable effort, which results from their intricate semantics and connotations. An alternative approach is to leverage images to convey rich visual information as a supplement. Nevertheless, existing Text-to-Image (T2I) models are primarily trained on concrete physical objects and tend to fail to visualize abstract concepts. Inspired by the three-layer artwork theory that identifies critical factors, intent, object and form during artistic creation, we propose a framework of Text-to-Image generation for Abstract Concepts (TIAC). The abstract concept is clarified into a clear intent with a detailed definition to avoid ambiguity. LLMs then transform it into semantic-related physical objects, and the concept-dependent form is retrieved from an LLM-extracted form pattern set. Information from these three aspects will be integrated to generate prompts for T2I models via LLM. Evaluation results from human assessments and our newly designed metric concept score demonstrate the effectiveness of our framework in creating images that can sufficiently express abstract concepts.
    摘要 recent years have witnessed significant progress in large-scale models across various domains, such as natural language processing and computer vision, which has facilitated the expression of concrete concepts. However, expressing abstract concepts through natural language is much more challenging due to their complex semantics and connotations. To address this issue, we can leverage images to convey rich visual information as a supplement. However, existing Text-to-Image (T2I) models are primarily trained on concrete physical objects and tend to fail to visualize abstract concepts.Inspired by the three-layer artwork theory that identifies critical factors, intent, object, and form during artistic creation, we propose a framework for Text-to-Image generation for Abstract Concepts (TIAC). The abstract concept is first clarified into a clear intent with a detailed definition to avoid ambiguity. Then, LLMs (Large Language Models) transform it into semantic-related physical objects, and the concept-dependent form is retrieved from an LLM-extracted form pattern set. Finally, information from these three aspects is integrated to generate prompts for T2I models via LLM.Our evaluation results from human assessments and our newly designed metric, concept score, demonstrate the effectiveness of our framework in creating images that can sufficiently express abstract concepts.

NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized Device Coordinates Space

  • paper_url: http://arxiv.org/abs/2309.14616
  • repo_url: https://github.com/Jiawei-Yao0812/NDCScene
  • paper_authors: Jiawei Yao, Chuming Li, Keqiang Sun, Yingjie Cai, Hao Li, Wanli Ouyang, Hongsheng Li
  • For: 提高单目3D semantic scene completion(SSC)的精度和效率,解决当前状态OF-THE-ART方法存在的特征抽象、姿态异常和计算不均衡等问题。* Methods: 提出了一种 Normalized Device Coordinates scene completion network(NDC-Scene),通过进行逐步修复depth维度的归一化设备坐标(NDC)空间,而不是直接将2D特征图卷积到世界空间中,以解决当前状态OF-THE-ART方法的特征抽象和姿态异常问题。同时,设计了一个适应深度的双 decode器,以同时进行2D和3D特征图的混合和提升。* Results: 经验证明,提出的方法在SemanticKITTI和NYUv2 dataset上比现状态OF-THE-ART方法表现更高,并且可以更好地处理各种复杂的场景和姿态。代码可以在https://github.com/Jiawei-Yao0812/NDCScene中下载。
    Abstract Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs. In this paper, we identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Computation Imbalance in the 3D convolution across different depth levels. To address these problems, we devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2D feature map to a Normalized Device Coordinates (NDC) space, rather than to the world space directly, through progressive restoration of the dimension of depth with deconvolution operations. Experiment results demonstrate that transferring the majority of computation from the target 3D space to the proposed normalized device coordinates space benefits monocular SSC tasks. Additionally, we design a Depth-Adaptive Dual Decoder to simultaneously upsample and fuse the 2D and 3D feature maps, further improving overall performance. Our extensive experiments confirm that the proposed method consistently outperforms state-of-the-art methods on both outdoor SemanticKITTI and indoor NYUv2 datasets. Our code are available at https://github.com/Jiawei-Yao0812/NDCScene.
    摘要 《单目3D semantics场景完成(SSC)》在最近几年内引起了广泛的关注,因为它可以从单一图像中预测复杂的 semantics和geometry形状,无需3D输入。在这篇论文中,我们确定了当前状态艺术方法中的一些关键问题,包括 проекed 2D特征在射线空间中的特征歧义、3D卷积中的姿态歧义和3D卷积在不同深度水平上的计算不均衡。为解决这些问题,我们提出了一种新的Normalized Device Coordinates场景完成网络(NDC-Scene),通过逐渐恢复depth维度的卷积操作来直接将2D特征图射到Normalized Device Coordinates(NDC)空间中,而不是直接射到世界空间中。实验结果表明,将大多数计算从目标3D空间传播到我们提议的 Normalized Device Coordinates空间中得到了monocular SSC任务的改进。此外,我们还设计了一种适应深度的双解码器,以同时提高2D和3D特征图的 upsampling 和融合,进一步提高总性能。我们的广泛实验表明,我们的方法在SemanticKITTI和NYUv2 dataset上的表现都高于当前状态艺术方法。我们的代码可以在https://github.com/Jiawei-Yao0812/NDCScene中下载。

Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline

  • paper_url: http://arxiv.org/abs/2309.14611
  • repo_url: https://github.com/wangxiao5791509/VisEvent_SOT_Benchmark
  • paper_authors: Xiao Wang, Shiao Wang, Chuanming Tang, Lin Zhu, Bo Jiang, Yonghong Tian, Jin Tang
  • for: 这 paper 的目的是提出一种基于 bio-inspired 事件摄像头的高速低延迟视觉跟踪方法,使用多模态/多视图信息进行知识传递,以实现使用事件信号进行测试的高性能低延迟跟踪。
  • methods: 该 paper 使用一个教师 Transformer 网络和一个学生 Transformer 网络,通过对 RGB 帧和事件流同时进行训练,实现高速低延迟的视觉跟踪。此外,该 paper 还提出了一种层次知识填充策略,包括对比相似性、特征表示和响应图填充知识,以引导学生网络的学习。
  • results: 该 paper 在两个低分辨率的 dataset(FE240hz、VisEvent、COESOT)上进行了广泛的实验,以及提出了一个大规模高分辨率的 dataset(EventVOT),包含 1141 个视频,覆盖了人行、车辆、无人机等多种类别。结果表明,该方法可以在高速低延迟的情况下实现高性能的视觉跟踪。
    Abstract Tracking using bio-inspired event cameras has drawn more and more attention in recent years. Existing works either utilize aligned RGB and event data for accurate tracking or directly learn an event-based tracker. The first category needs more cost for inference and the second one may be easily influenced by noisy events or sparse spatial resolution. In this paper, we propose a novel hierarchical knowledge distillation framework that can fully utilize multi-modal / multi-view information during training to facilitate knowledge transfer, enabling us to achieve high-speed and low-latency visual tracking during testing by using only event signals. Specifically, a teacher Transformer-based multi-modal tracking framework is first trained by feeding the RGB frame and event stream simultaneously. Then, we design a new hierarchical knowledge distillation strategy which includes pairwise similarity, feature representation, and response maps-based knowledge distillation to guide the learning of the student Transformer network. Moreover, since existing event-based tracking datasets are all low-resolution ($346 \times 260$), we propose the first large-scale high-resolution ($1280 \times 720$) dataset named EventVOT. It contains 1141 videos and covers a wide range of categories such as pedestrians, vehicles, UAVs, ping pongs, etc. Extensive experiments on both low-resolution (FE240hz, VisEvent, COESOT), and our newly proposed high-resolution EventVOT dataset fully validated the effectiveness of our proposed method. The dataset, evaluation toolkit, and source code are available on \url{https://github.com/Event-AHU/EventVOT_Benchmark}
    摘要 很多研究者在最近几年内对使用生物体感知的事件摄像机进行跟踪吸引了越来越多的注意力。现有的方法可以通过同时使用RGB图像和事件数据进行准确的跟踪,或者直接学习事件基于的跟踪器。前一种方法需要更多的计算成本,而后一种可能会受到噪音事件或者稀疏的空间分辨率的影响。在这篇论文中,我们提出了一种新的层次知识填充框架,可以在训练时将多模态/多视图信息完全利用,以便知识传递,从而在测试时通过使用事件信号来实现高速低延迟的视觉跟踪。具体来说,我们首先使用Transformer基本多模态跟踪框架进行训练,并在其基础之上设计了一种新的层次知识填充策略,包括对比相似性、特征表示和响应地图基于的知识填充。此外,由于现有的事件基本跟踪数据库都是低分辨率($346 \times 260),我们提出了首个大规模高分辨率($1280 \times 720)数据库,名为EventVOT。它包括1141个视频,覆盖了人行、车辆、无人机、乒乓球等多种类别。我们在低分辨率(FE240hz、VisEvent、COESOT)和我们新提出的高分辨率EventVOT数据库上进行了广泛的实验,并证明了我们提出的方法的有效性。数据库、评价工具箱和源代码可以在https://github.com/Event-AHU/EventVOT_Benchmark上获取。

Progressive Text-to-3D Generation for Automatic 3D Prototyping

  • paper_url: http://arxiv.org/abs/2309.14600
  • repo_url: https://github.com/Texaser/MTN
  • paper_authors: Han Yi, Zhedong Zheng, Xiangyu Xu, Tat-seng Chua
  • for: 本研究旨在提出一种基于自然语言描述的文本-三维生成方法,以减少手动设计三维模型的工作量,并提供一种更自然的用户交互方式。
  • methods: 我们提出了一种多尺度三平面网络(MTN)和一种新的进度学习策略,以解决文本-三维生成问题中精度的重要问题。MTN网络由四个三平面组成,从低分辨率向高分辨率进行过渡,以便优化大型输出。此外,我们还引入进度学习策略,使网络专注于困难的细腻特征。
  • results: 我们的实验表明,提出的方法在各种描述中具有优异表现,包括一些最为困难的描述,现有方法难以生成可行的形状。我们的方法能够成功地生成高质量的三维模型,并且能够减少用户需要提供的描述细节。
    Abstract Text-to-3D generation is to craft a 3D object according to a natural language description. This can significantly reduce the workload for manually designing 3D models and provide a more natural way of interaction for users. However, this problem remains challenging in recovering the fine-grained details effectively and optimizing a large-size 3D output efficiently. Inspired by the success of progressive learning, we propose a Multi-Scale Triplane Network (MTN) and a new progressive learning strategy. As the name implies, the Multi-Scale Triplane Network consists of four triplanes transitioning from low to high resolution. The low-resolution triplane could serve as an initial shape for the high-resolution ones, easing the optimization difficulty. To further enable the fine-grained details, we also introduce the progressive learning strategy, which explicitly demands the network to shift its focus of attention from simple coarse-grained patterns to difficult fine-grained patterns. Our experiment verifies that the proposed method performs favorably against existing methods. For even the most challenging descriptions, where most existing methods struggle to produce a viable shape, our proposed method consistently delivers. We aspire for our work to pave the way for automatic 3D prototyping via natural language descriptions.
    摘要

Applications of Sequential Learning for Medical Image Classification

  • paper_url: http://arxiv.org/abs/2309.14591
  • repo_url: None
  • paper_authors: Sohaib Naim, Brian Caffo, Haris I Sair, Craig K Jones
  • for: 这个研究的目的是为了开发一个适合小量医疗影像资料的神经网络训练框架,并为无阶级验证或测试集的训练提供依据。
  • methods: 我们提出了一种逐步学习方法,将和时间相关的小批量医疗影像资料用PyTorch convolutional neural networks (CNN)进行训练和逐步更新。我们解决了逐步学习中的遗传扩散、恰遇遗传和概念变化问题,并使用公开ailable的医疗MNIST和NIH胸部X射像数据集。我们开始比较两种方法:将神经网络先进行短时间的预训练,然后进行逐步学习,以及不进行预训练,直接进行逐步学习。我们还考虑了两种独特的训练和验证数据recruitment的方法,以确保获得完整的信息抽象。
  • results: 在第一个实验中,两种方法都成功地 дости了~95%的准确度门槛, although the short pre-training step enables sequential accuracy to plateau in fewer steps. 在第二个实验中,两种方法中的第二种方法(crosses the ~90% accuracy threshold much sooner)表现更好。在第三个实验中,这个方法在 Without pre-training 的情况下表现较差,但在具有预训练步骤的情况下,神经网络可以较快地超过 ~60% 的准确度门槛。
    Abstract Purpose: The aim of this work is to develop a neural network training framework for continual training of small amounts of medical imaging data and create heuristics to assess training in the absence of a hold-out validation or test set. Materials and Methods: We formulated a retrospective sequential learning approach that would train and consistently update a model on mini-batches of medical images over time. We address problems that impede sequential learning such as overfitting, catastrophic forgetting, and concept drift through PyTorch convolutional neural networks (CNN) and publicly available Medical MNIST and NIH Chest X-Ray imaging datasets. We begin by comparing two methods for a sequentially trained CNN with and without base pre-training. We then transition to two methods of unique training and validation data recruitment to estimate full information extraction without overfitting. Lastly, we consider an example of real-life data that shows how our approach would see mainstream research implementation. Results: For the first experiment, both approaches successfully reach a ~95% accuracy threshold, although the short pre-training step enables sequential accuracy to plateau in fewer steps. The second experiment comparing two methods showed better performance with the second method which crosses the ~90% accuracy threshold much sooner. The final experiment showed a slight advantage with a pre-training step that allows the CNN to cross ~60% threshold much sooner than without pre-training. Conclusion: We have displayed sequential learning as a serviceable multi-classification technique statistically comparable to traditional CNNs that can acquire data in small increments feasible for clinically realistic scenarios.
    摘要 目的:本工作的目的是开发一个神经网络训练框架,用于逐步训练小量医学成像数据,并创造一些启发法来评估在缺乏预约验证或测试集时的训练。材料和方法:我们提出了一种回顾性顺序学习方法,可以在医学成像数据中逐步训练和更新模型。我们使用PyTorch神经网络(CNN)和公开available的医学MNIST和NIH胸部X射影数据集来解决顺序学习中的过拟合、恶化学习和概念漂移问题。我们首先比较了两种方法:一种是在顺序学习过程中不进行基础预训练,另一种是在顺序学习过程中进行基础预训练。然后,我们转移到了两种唯一的训练和验证数据招募方法来估计无损信息抽取。最后,我们考虑了一个实际的数据示例,显示了我们的方法在现实生活中的应用。结果:在第一个实验中,两种方法都能达到大约95%的准确率阈值,但短期预训练步骤使Sequential learning的准确率快速落在阈值。在第二个实验中, Comparing two methods showed better performance with the second method, which crossed the ~90% accuracy threshold much sooner. Finally, the third experiment showed a slight advantage with a pre-training step that allows the CNN to cross the ~60% threshold much sooner than without pre-training.结论:我们已经Displayed sequential learning as a serviceable multi-classification technique that is statistically comparable to traditional CNNs, and can acquire data in small increments feasible for clinically realistic scenarios.

DifAttack: Query-Efficient Black-Box Attack via Disentangled Feature Space

  • paper_url: http://arxiv.org/abs/2309.14585
  • repo_url: https://github.com/csjunjun/difattack
  • paper_authors: Liu Jun, Zhou Jiantao, Zeng Jiandian, Jinyu Tian
  • for: This paper aims to propose an efficient score-based black-box adversarial attack method with a high Attack Success Rate (ASR) and good generalizability.
  • methods: The proposed method, called DifAttack, uses a disentangled feature space to differentiate an image’s latent feature into an adversarial feature and a visual feature. The method trains an autoencoder for the disentanglement using pairs of clean images and their Adversarial Examples (AEs) generated from available surrogate models via white-box attack methods.
  • results: The proposed method achieves significant improvements in ASR and query efficiency simultaneously, especially in the targeted attack and open-set scenarios. The code will be available at https://github.com/csjunjun/DifAttack.git soon.Here’s the same information in Simplified Chinese:
  • for: This paper aims to propose an efficient score-based black-box adversarial attack method with a high Attack Success Rate (ASR) and good generalizability.
  • methods: The proposed method, called DifAttack, uses a disentangled feature space to differentiate an image’s latent feature into an adversarial feature and a visual feature. The method trains an autoencoder for the disentanglement using pairs of clean images and their Adversarial Examples (AEs) generated from available surrogate models via white-box attack methods.
  • results: The proposed method achieves significant improvements in ASR and query efficiency simultaneously, especially in the targeted attack and open-set scenarios. The code will be available at https://github.com/csjunjun/DifAttack.git soon.
    Abstract This work investigates efficient score-based black-box adversarial attacks with a high Attack Success Rate (ASR) and good generalizability. We design a novel attack method based on a Disentangled Feature space, called DifAttack, which differs significantly from the existing ones operating over the entire feature space. Specifically, DifAttack firstly disentangles an image's latent feature into an adversarial feature and a visual feature, where the former dominates the adversarial capability of an image, while the latter largely determines its visual appearance. We train an autoencoder for the disentanglement by using pairs of clean images and their Adversarial Examples (AEs) generated from available surrogate models via white-box attack methods. Eventually, DifAttack iteratively optimizes the adversarial feature according to the query feedback from the victim model until a successful AE is generated, while keeping the visual feature unaltered. In addition, due to the avoidance of using surrogate models' gradient information when optimizing AEs for black-box models, our proposed DifAttack inherently possesses better attack capability in the open-set scenario, where the training dataset of the victim model is unknown. Extensive experimental results demonstrate that our method achieves significant improvements in ASR and query efficiency simultaneously, especially in the targeted attack and open-set scenarios. The code will be available at https://github.com/csjunjun/DifAttack.git soon.
    摘要