cs.CV - 2023-10-23

Towards contrast-agnostic soft segmentation of the spinal cord

  • paper_url: http://arxiv.org/abs/2310.15402
  • repo_url: https://github.com/sct-pipeline/contrast-agnostic-softseg-spinalcord
  • paper_authors: Sandrine Bédard, Naga Karthik Enamundram, Charidimos Tsagkas, Emanuele Pravatà, Cristina Granziera, Andrew Smith, Kenneth Arnold Weber II, Julien Cohen-Adad
    for:* 这个论文的目的是提出一种深度学习基于的脊椎神经分割方法,以减少脊椎神经膨胀面积(CSA)的变化,从而提高脊椎神经分割的精度和可重复性。methods:* 该方法使用了深度学习的UNet模型,并使用了多个对比度的binary segmentations来生成参与者特定的软分割(soft ground truth,SGT)。然后,使用SGT和回归型损失函数来训练UNet模型。results:* 对比之前的方法,该方法能够减少CSA的变化($p < 0.05$, 沃科克签名检验),并且在不同的数据集、供应商、对比度和疾病(压缩、损伤)中具有更好的普适性。
    Abstract Spinal cord segmentation is clinically relevant and is notably used to compute spinal cord cross-sectional area (CSA) for the diagnosis and monitoring of cord compression or neurodegenerative diseases such as multiple sclerosis. While several semi and automatic methods exist, one key limitation remains: the segmentation depends on the MRI contrast, resulting in different CSA across contrasts. This is partly due to the varying appearance of the boundary between the spinal cord and the cerebrospinal fluid that depends on the sequence and acquisition parameters. This contrast-sensitive CSA adds variability in multi-center studies where protocols can vary, reducing the sensitivity to detect subtle atrophies. Moreover, existing methods enhance the CSA variability by training one model per contrast, while also producing binary masks that do not account for partial volume effects. In this work, we present a deep learning-based method that produces soft segmentations of the spinal cord. Using the Spine Generic Public Database of healthy participants ($\text{n}=267$; $\text{contrasts}=6$), we first generated participant-wise soft ground truth (GT) by averaging the binary segmentations across all 6 contrasts. These soft GT, along with a regression-based loss function, were then used to train a UNet model for spinal cord segmentation. We evaluated our model against state-of-the-art methods and performed ablation studies involving different GT mask types, loss functions, and contrast-specific models. Our results show that using the soft average segmentations along with a regression loss function reduces CSA variability ($p < 0.05$, Wilcoxon signed-rank test). The proposed spinal cord segmentation model generalizes better than the state-of-the-art contrast-specific methods amongst unseen datasets, vendors, contrasts, and pathologies (compression, lesions), while accounting for partial volume effects.
    摘要 临床重要的脊椎神经段化是用于计算脊椎跨sectional area (CSA)的诊断和监测多发性静脉炎和其他神经退化疾病。 although several semi and automatic methods exist, one key limitation remains: the segmentation depends on the MRI contrast, resulting in different CSA across contrasts. This is partly due to the varying appearance of the boundary between the spinal cord and the cerebrospinal fluid that depends on the sequence and acquisition parameters. This contrast-sensitive CSA adds variability in multi-center studies where protocols can vary, reducing the sensitivity to detect subtle atrophies. Moreover, existing methods enhance the CSA variability by training one model per contrast, while also producing binary masks that do not account for partial volume effects.在这项工作中,我们提出了一种基于深度学习的方法,该方法生成了软分 segmentation of the spinal cord。 使用Generic Public Database of healthy participants(n = 267,contrasts = 6),我们首先生成了每个参与者的soft ground truth(GT),通过对所有6个对比进行平均来生成每个参与者的GT。这些soft GT,以及一种回归基于的损失函数,然后用于训练UNet模型。我们对我们的模型进行比较,并进行了不同GT层次、损失函数和对比特定模型的ablation研究。我们的结果表明,使用soft average segmentations along with a regression loss function reduces CSA variability ($p < 0.05$, Wilcoxon signed-rank test).我们的提议的脊椎神经段化模型在未看过的数据集、供应商、对比和疾病(压缩、损害)方面更好地总结,同时考虑到partial volume effects。

Remote Heart Rate Monitoring in Smart Environments from Videos with Self-supervised Pre-training

  • paper_url: http://arxiv.org/abs/2310.15388
  • repo_url: None
  • paper_authors: Divij Gupta, Ali Etemad
    for: 这 paper 的目的是提出一种基于自我supervised learning的远程心率估算方法,以减少需要大量标注数据的限制,并提高性能。methods: 这 paper 使用的方法包括自我supervised contrastive learning,以及3个空间和3个时间的扩展。results: 对两个公开的数据集进行了实验,并证明了该方法可以比较related works和supervised learning baseline的性能,并且approach the state-of-the-art。
    Abstract Recent advances in deep learning have made it increasingly feasible to estimate heart rate remotely in smart environments by analyzing videos. However, a notable limitation of deep learning methods is their heavy reliance on extensive sets of labeled data for effective training. To address this issue, self-supervised learning has emerged as a promising avenue. Building on this, we introduce a solution that utilizes self-supervised contrastive learning for the estimation of remote photoplethysmography (PPG) and heart rate monitoring, thereby reducing the dependence on labeled data and enhancing performance. We propose the use of 3 spatial and 3 temporal augmentations for training an encoder through a contrastive framework, followed by utilizing the late-intermediate embeddings of the encoder for remote PPG and heart rate estimation. Our experiments on two publicly available datasets showcase the improvement of our proposed approach over several related works as well as supervised learning baselines, as our results approach the state-of-the-art. We also perform thorough experiments to showcase the effects of using different design choices such as the video representation learning method, the augmentations used in the pre-training stage, and others. We also demonstrate the robustness of our proposed method over the supervised learning approaches on reduced amounts of labeled data.
    摘要 近期深度学习的发展使得智能环境中 Remote 心率估计变得越来越可能。然而,深度学习方法却存在一定的问题,即需要大量标注数据进行有效的训练。为解决这个问题,自动学习 emerged as a promising avenue。我们基于这个 Avenues 提出了一种解决方案,利用自动学习的对比学习来进行 Remote 光谱 Plethysmography (PPG) 和心率监测,从而减少标注数据的依赖性和提高性能。我们提议使用 3 个空间和 3 个时间的扩展来训练一个编码器,然后使用编码器的晚期中间 embedding 进行 Remote PPG 和心率估计。我们在两个公共可用的数据集上进行了实验,并证明了我们的提议方法在许多相关作品以及标注学习基准点上的改进。我们还进行了广泛的实验,以展示不同的设计选择的影响,如视频学习方法、预训练阶段的扩展和其他。此外,我们还证明了我们的提议方法在标注学习方法下的减少量标注数据上的Robustness。

Deep Integrated Explanations

  • paper_url: http://arxiv.org/abs/2310.15368
  • repo_url: https://github.com/dix-cikm23/dix
  • paper_authors: Oren Barkan, Yehonatan Elisha, Jonathan Weill, Yuval Asher, Amit Eshel, Noam Koenigstein
  • for: 这篇论文提出了一种名为深度集成解释(DIX)的全面方法,用于解释视觉模型。
  • methods: DIX使用模型的中间表示和相应的梯度来生成解释地图。
  • results: 经过对多种任务、数据集和模型配置的广泛评估,DIX能够生成 faithful和准确的解释地图,并超过当前状态的方法。
    Abstract This paper presents Deep Integrated Explanations (DIX) - a universal method for explaining vision models. DIX generates explanation maps by integrating information from the intermediate representations of the model, coupled with their corresponding gradients. Through an extensive array of both objective and subjective evaluations spanning diverse tasks, datasets, and model configurations, we showcase the efficacy of DIX in generating faithful and accurate explanation maps, while surpassing current state-of-the-art methods.
    摘要

DeepVox and SAVE-CT: a contrast- and dose-independent 3D deep learning approach for thoracic aorta segmentation and aneurysm prediction using computed tomography scans

  • paper_url: http://arxiv.org/abs/2310.15328
  • repo_url: None
  • paper_authors: Matheus del-Valle, Lariza Laura de Oliveira, Henrique Cursino Vieira, Henrique Min Ho Lee, Lucas Lembrança Pinheiro, Maria Fernanda Portugal, Newton Shydeo Brandão Miyoshi, Nelson Wolosker
  • for: 这个研究旨在提出一个可靠且自动化的脊梗动脉炎(TAA)检测方法,以减少TAA的死亡率和专业医生的评估过程中的时间。
  • methods: 这个研究使用了一个新的分类模型(DeepVox)和一个新的TAA分类模型(SAVE-CT),它们可以自动识别和分类TAA,并且可以处理不同数量的图像和不同的脊梗序列。
  • results: 这个研究发现,使用DeepVox和SAVE-CT模型可以实现自动化的TAA检测,并且可以提高医生的评估效率和准确性。这个方法可以帮助减少TAA的死亡率和专业医生的负担。
    Abstract Thoracic aortic aneurysm (TAA) is a fatal disease which potentially leads to dissection or rupture through progressive enlargement of the aorta. It is usually asymptomatic and screening recommendation are limited. The gold-standard evaluation is performed by computed tomography angiography (CTA) and radiologists time-consuming assessment. Scans for other indications could help on this screening, however if acquired without contrast enhancement or with low dose protocol, it can make the clinical evaluation difficult, besides increasing the scans quantity for the radiologists. In this study, it was selected 587 unique CT scans including control and TAA patients, acquired with low and standard dose protocols, with or without contrast enhancement. A novel segmentation model, DeepVox, exhibited dice score coefficients of 0.932 and 0.897 for development and test sets, respectively, with faster training speed in comparison to models reported in the literature. The novel TAA classification model, SAVE-CT, presented accuracies of 0.930 and 0.922 for development and test sets, respectively, using only the binary segmentation mask from DeepVox as input, without hand-engineered features. These two models together are a potential approach for TAA screening, as they can handle variable number of slices as input, handling thoracic and thoracoabdominal sequences, in a fully automated contrast- and dose-independent evaluation. This may assist to decrease TAA mortality and prioritize the evaluation queue of patients for radiologists.
    摘要 腹部大动脉瘤 (TAA) 是一种可能导致分解或爆裂的致命疾病,通过进行不断扩大的大动脉。它通常是无症状的,而检查建议则有限。黄金标准评估是通过 computed tomography angiography (CTA) 进行,但这需要 radiologists 的时间负担。在这个研究中,选择了 587 个独特的 CT 扫描,包括控制和 TAA 患者,采用低和标准剂量剂 protocol。一个新的分 segmentation 模型,DeepVox,在发展和测试集上显示了 dice 分数价值为 0.932 和 0.897,并且比文献中报告的模型更快的训练速度。一个新的 TAA 分类模型,SAVE-CT,在发展和测试集上显示了准确率为 0.930 和 0.922,使用仅有 binary 分 segmentation 面组的 DeepVox 输入,不需手工设计特征。这两个模型共同形成一个可能的 TAA 检查方法,可以处理变数数量的萤幕,处理腹部和腹部类别的扫描,并且完全自动、对比物和剂量无需关注。这可能帮助降低 TAA 的死亡率,并将检查顺序优先级为 radiologists 评估。

Videoprompter: an ensemble of foundational models for zero-shot video understanding

  • paper_url: http://arxiv.org/abs/2310.15324
  • repo_url: None
  • paper_authors: Adeel Yousaf, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, Mubarak Shah
    for: 这个论文是为了提高视频理解的零shot性能而提出的一种框架。methods: 该框架 combinest 预训练的推论视频模型(VLMs)和预训练的生成视频到文本和文本到文本模型。 该框架具有两个关键修改:首先,我们提出了语言指导的视觉特征增强,并使用视频到文本模型将查询视频转换为其描述性形式。其次,我们提出了视频特定的提示技术,以生成更加有意义的描述,以增强类标表示。results: 该框架在三种零shot设定下进行了视频理解:1)视频动作识别,2)视频到文本和文本到视频检索,以及3)时间敏感的视频任务。Result表明,该框架在多个 benchmark 上具有稳定的提高,并且可以与不同的 VLMs 结合使用。我们将代码公开。
    Abstract Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations. Recently, large language models (LLMs) have been used to enrich the text-based class labels by enhancing the descriptiveness of the class names. However, these improvements are restricted to the text-based classifier only, and the query visual features are not considered. In this paper, we propose a framework which combines pre-trained discriminative VLMs with pre-trained generative video-to-text and text-to-text models. We introduce two key modifications to the standard zero-shot setting. First, we propose language-guided visual feature enhancement and employ a video-to-text model to convert the query video to its descriptive form. The resulting descriptions contain vital visual cues of the query video, such as what objects are present and their spatio-temporal interactions. These descriptive cues provide additional semantic knowledge to VLMs to enhance their zeroshot performance. Second, we propose video-specific prompts to LLMs to generate more meaningful descriptions to enrich class label representations. Specifically, we introduce prompt techniques to create a Tree Hierarchy of Categories for class names, offering a higher-level action context for additional visual cues, We demonstrate the effectiveness of our approach in video understanding across three different zero-shot settings: 1) video action recognition, 2) video-to-text and textto-video retrieval, and 3) time-sensitive video tasks. Consistent improvements across multiple benchmarks and with various VLMs demonstrate the effectiveness of our proposed framework. Our code will be made publicly available.
    摘要 视力语模型(VLM)将查询视频分类为计算视觉特征和文本基础标签表示之间的相似度。现在,大型语言模型(LLM)已经用于提高文本基础标签的描述性。然而,这些改进只适用于文本基础标签,查询视频特征未被考虑。在这篇论文中,我们提出一个框架,结合预训练的推断性VLM和预训练的生成视频到文本和文本到文本模型。我们提出两个关键修改:首先,我们提出语言引导的视觉特征增强,并使用视频到文本模型将查询视频转换成其描述性的形式。 resulting descriptions contain vital visual cues of the query video, such as what objects are present and their spatio-temporal interactions。这些描述性缺省提供了额外的semantic knowledge to VLMs,以提高其零基础性性能。其次,我们提出视频特有的提示,以便LLMs生成更加有意义的描述,以恰到类标签表示。specifically, we introduce prompt techniques to create a Tree Hierarchy of Categories for class names, offering a higher-level action context for additional visual cues。我们在视频理解中进行了三种零基础设定:1)视频动作识别,2)视频到文本和文本到视频检索,和3)时间敏感视频任务。我们在多个标准 benchark 和不同VLMs上展现了一致性的改进, demonstrating the effectiveness of our proposed framework。我们将代码公开。

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

  • paper_url: http://arxiv.org/abs/2310.15308
  • repo_url: None
  • paper_authors: Haoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, Hadi Pouransari
    for:This paper aims to create a unified model that combines the strengths of two pre-trained vision foundation models (VFMs), Segment Anything Model (SAM) and Contrastive Language-Image Pre-training (CLIP).methods:The proposed method uses a simple recipe that integrates multi-task learning, continual learning techniques, and teacher-student distillation to efficiently merge SAM and CLIP into a single backbone.results:The resulting model, called SAM-CLIP, learns richer visual representations that are equipped with both localization and semantic features, leading to improved performance on several head probing tasks and zero-shot semantic segmentation tasks, with new state-of-the-art results on 5 benchmarks.
    Abstract The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe to efficiently merge VFMs into a unified model that assimilates their expertise. Our proposed method integrates multi-task learning, continual learning techniques, and teacher-student distillation. This strategy entails significantly less computational cost compared to traditional multi-task training from scratch. Additionally, it only demands a small fraction of the pre-training datasets that were initially used to train individual models. By applying our method to SAM and CLIP, we derive SAM-CLIP: a unified model that amalgamates the strengths of SAM and CLIP into a single backbone, making it apt for edge device applications. We show that SAM-CLIP learns richer visual representations, equipped with both localization and semantic features, suitable for a broad range of vision tasks. SAM-CLIP obtains improved performance on several head probing tasks when compared with SAM and CLIP. We further show that SAM-CLIP not only retains the foundational strengths of its precursor models but also introduces synergistic functionalities, most notably in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results on 5 benchmarks. It outperforms previous models that are specifically designed for this task by a large margin, including +6.8% and +5.9% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively.
    摘要 “公共可用的视觉基础模型(VFM)的领域正在迅速扩展。VFM具有不同的预训练目标,从而具备不同的能力。例如,CLIP excel在Semantic Understanding方面,而SAM专注于 segmentation 的空间理解。在这项工作中,我们提出了一个简单的方法,可以快速将 VFM 合并成一个统一的模型,并融合它们的专长。我们的提议的方法包括多任务学习、 continual learning 技术和教师学习。这种策略相比传统的多任务训练从零开始,需要 significatively 更少的计算成本。此外,它只需要原始训练数据的一小部分。通过应用我们的方法于 SAM 和 CLIP,我们得到了 SAM-CLIP:一个统一的模型,将 SAM 和 CLIP 的强点融合到一起,适用于边缘设备应用。我们显示,SAM-CLIP 学习了更加丰富的视觉表示,具有 Both localization 和 Semantic 特征,适用于广泛的视觉任务。SAM-CLIP 在多个头 probing 任务中表现出色,比 SAM 和 CLIP 更好。我们进一步显示,SAM-CLIP 不仅保留了其前置模型的基础优势,还 introduce 了相互补做的功能,主要在零shot Semantic Segmentation 方面,SAM-CLIP 在5个标准 benchmark 上设置了新的州队纪录,包括 Pascal-VOC 和 COCO-Stuff 数据集,升级 +6.8% 和 +5.9% 的 Mean IoU。”

SyncFusion: Multimodal Onset-synchronized Video-to-Audio Foley Synthesis

  • paper_url: http://arxiv.org/abs/2310.15247
  • repo_url: None
  • paper_authors: Marco Comunità, Riccardo F. Gramaccioni, Emilian Postolache, Emanuele Rodolà, Danilo Comminiello, Joshua D. Reiss
  • for: 本研究旨在提高声效设计的效率和自动化水平,使声效设计更加创新和灵活。
  • methods: 本研究提出了一种基于扩散模型的声效自动生成方法,使用环境录音或文本嵌入来控制扩散模型生成的声效音轨。
  • results: 实验结果表明,该方法可以准确地检测视频中的重复动作开头,并生成符合视频的声效音轨。此外,编辑开头轨或更改控制嵌入需要 much less 的努力 than 编辑音轨本身,从而简化了声效设计过程。
    Abstract Sound design involves creatively selecting, recording, and editing sound effects for various media like cinema, video games, and virtual/augmented reality. One of the most time-consuming steps when designing sound is synchronizing audio with video. In some cases, environmental recordings from video shoots are available, which can aid in the process. However, in video games and animations, no reference audio exists, requiring manual annotation of event timings from the video. We propose a system to extract repetitive actions onsets from a video, which are then used - in conjunction with audio or textual embeddings - to condition a diffusion model trained to generate a new synchronized sound effects audio track. In this way, we leave complete creative control to the sound designer while removing the burden of synchronization with video. Furthermore, editing the onset track or changing the conditioning embedding requires much less effort than editing the audio track itself, simplifying the sonification process. We provide sound examples, source code, and pretrained models to faciliate reproducibility
    摘要 声音设计包括创atively选择、录音和编辑声效 для不同媒体 like 电影、游戏和虚拟/增强现实。同视频的 Audio 的同步是设计声音的一个最时consuming的步骤。在一些情况下,可以使用视频拍摄的环境录音来帮助同步,但在游戏和动画中,没有参考音频,需要手动标注视频中的事件时间。我们提出一种系统,可以提取视频中的重复动作开始时间,这些时间可以与音频或文本嵌入一起用于conditioning一个 diffusion 模型,以生成一个新的同步的声效音轨。这样,我们保留了声音设计师完整的创作控制,同时消除了与视频的同步相关的劳动。此外,编辑启动轨或改变conditioning嵌入需要 much less effort than editing 音频轨本身,这使得声音同化过程变得更加简单。我们提供了声音示例、源代码和预训练模型,以便促进可重现性。

RoboDepth: Robust Out-of-Distribution Depth Estimation under Corruptions

  • paper_url: http://arxiv.org/abs/2310.15171
  • repo_url: https://github.com/ldkong1205/robodepth
  • paper_authors: Lingdong Kong, Shaoyuan Xie, Hanjiang Hu, Lai Xing Ng, Benoit R. Cottereau, Wei Tsang Ooi
  • For: The paper aims to address the issue of out-of-distribution (OoD) situations in depth estimation from monocular images, which is crucial for real-world visual perception systems like autonomous driving.* Methods: The authors introduce a comprehensive robustness test suite called RoboDepth, which includes 18 corruptions across three categories: weather and lighting conditions, sensor failures and movement, and data processing anomalies. They benchmark 42 depth estimation models across indoor and outdoor scenes to assess their resilience to these corruptions.* Results: The authors find that many leading depth estimation models are susceptible to typical corruptions, highlighting the need for more robust models. They discuss design considerations for crafting more robust depth estimation models, including pre-training, augmentation, modality, model capacity, and learning paradigms.Here’s the Chinese translation of the three key points:* For: 这篇论文旨在解决单目图像深度估计中的异常情况(Out-of-distribution,OoD)问题,这对实际应用中的视觉识别系统如自动驾驶是非常重要。* Methods: 作者们提出了一个全面的 robustness 测试策略,名为 RoboDepth,该策略包括18种异常情况,分为三类:天气和照明条件、感知器和运动失效、数据处理异常。他们在室内和室外场景中对42种深度估计模型进行了测试,以评估它们对这些异常情况的抗性。* Results: 作者们发现,许多当前的深度估计模型在常见的异常情况下很容易受损,这提出了为了创造更加可靠的深度估计模型所需的设计考虑。他们讨论了针对异常情况设计更加 Robust 的深度估计模型,包括预训练、增强、模式、容量和学习方法等方面的设计考虑。
    Abstract Depth estimation from monocular images is pivotal for real-world visual perception systems. While current learning-based depth estimation models train and test on meticulously curated data, they often overlook out-of-distribution (OoD) situations. Yet, in practical settings -- especially safety-critical ones like autonomous driving -- common corruptions can arise. Addressing this oversight, we introduce a comprehensive robustness test suite, RoboDepth, encompassing 18 corruptions spanning three categories: i) weather and lighting conditions; ii) sensor failures and movement; and iii) data processing anomalies. We subsequently benchmark 42 depth estimation models across indoor and outdoor scenes to assess their resilience to these corruptions. Our findings underscore that, in the absence of a dedicated robustness evaluation framework, many leading depth estimation models may be susceptible to typical corruptions. We delve into design considerations for crafting more robust depth estimation models, touching upon pre-training, augmentation, modality, model capacity, and learning paradigms. We anticipate our benchmark will establish a foundational platform for advancing robust OoD depth estimation.
    摘要 深度估计从单目图像中是实际视觉系统中的关键任务。当前的学习型深度估计模型通常在精心挑选的数据上训练和测试,但它们经常忽视非标准(OoD)情况。然而,在实际应用中,特别是自动驾驶等安全关键的场景中,常见的损害可能会出现。为解决这一问题,我们介绍了一个完整的RoboDepth robustness测试 suite,包括18种损害类型,即:一、天气和照明条件;二、传感器故障和运动;三、数据处理异常。然后,我们在室内和室外场景中测试了42个深度估计模型的可靠性。我们的发现表明,在缺乏专门的可靠性评估框架的情况下,许多领先的深度估计模型可能会受到 Typical corruptions 的影响。我们进一步探讨了制定更加可靠的深度估计模型的设计考虑因素,包括预训练、扩展、模式、容量和学习方法。我们预计,RoboDepth 会成为建立可靠 OoD 深度估计模型的基础平台。

FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling

  • paper_url: http://arxiv.org/abs/2310.15169
  • repo_url: https://github.com/arthur-qiu/longercrafter
  • paper_authors: Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, Ziwei Liu
  • for: 这种研究旨在扩展基于文本的视频生成能力,以便在执行中生成更高质量的长视频。
  • methods: 我们首先分析了视频扩散模型中的初始噪声的影响,然后基于这个观察,我们提出了一种免除调参的高效方法来增强已经预训练的视频扩散模型的生成能力,同时保持内容一致性。我们采用了一种窗口函数进行时间注意力,并对不同框架的噪声进行顺序调整。
  • results: 我们的方法比前一个最佳方法带来了约17%的时间成本增加,而且生成的视频样本可以在我们的网站上查看:http://haonanqiu.com/projects/FreeNoise.html。
    Abstract With the availability of large-scale video datasets and the advances of diffusion models, text-driven video generation has achieved substantial progress. However, existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. Furthermore, these models only support single-text conditions, whereas real-life scenarios often require multi-text conditions as the video content changes over time. To tackle these challenges, this study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. 1) We first analyze the impact of initial noise in video diffusion models. Then building upon the observation of noise, we propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models while preserving content consistency. Specifically, instead of initializing noises for all frames, we reschedule a sequence of noises for long-range correlation and perform temporal attention over them by window-based function. 2) Additionally, we design a novel motion injection method to support the generation of videos conditioned on multiple text prompts. Extensive experiments validate the superiority of our paradigm in extending the generative capabilities of video diffusion models. It is noteworthy that compared with the previous best-performing method which brought about 255% extra time cost, our method incurs only negligible time cost of approximately 17%. Generated video samples are available at our website: http://haonanqiu.com/projects/FreeNoise.html.
    摘要 通过大规模视频集和扩散模型的进步,文本驱动视频生成已经取得了显著进步。然而,现有的视频生成模型通常只在有限数量的帧上进行训练,导致在推理过程中无法生成高质量的长视频。此外,这些模型只支持单个文本条件,而实际场景通常需要多个文本条件,以适应视频内容的变化。为解决这些挑战,本研究探讨了扩展文本驱动能力,以生成基于多个文本条件的长视频。1. 我们首先分析了视频扩散模型中的初始噪声的影响。然后,我们提出了一种免除调整和时间效率的方法 FreeNoise,以提高预训练视频扩散模型的生成能力,同时保持内容一致。特别是,而不是为所有帧 initialize 噪声,我们重新安排了一个序列噪声,并通过窗口函数进行时间注意力。2. 此外,我们设计了一种新的运动插入方法,以支持基于多个文本条件的视频生成。广泛的实验证明了我们的方法的优越性。与之前最佳成果相比,我们的方法只增加了约17%的时间成本,而其他方法增加了255%的时间成本。生成的视频示例可以在我们的网站上找到:http://haonanqiu.com/projects/FreeNoise.html。

Ghost on the Shell: An Expressive Representation of General 3D Shapes

  • paper_url: http://arxiv.org/abs/2310.15168
  • repo_url: https://github.com/lzzcd001/GShell
  • paper_authors: Zhen Liu, Yao Feng, Yuliang Xiu, Weiyang Liu, Liam Paull, Michael J. Black, Bernhard Schölkopf
  • for: 该论文目的是描述一种能够模型精准的3D表面几何体,以便创建真实的虚拟世界。
  • methods: 该论文使用了一种基于 manifold signed distance field的parameterization方法,以便模型开放的表面。
  • results: 该论文的实验结果表明,该方法可以实现非常高的重建和生成表面的精度,并且可以快速地渲染与材料和灯光相关的场景。
    Abstract The creation of photorealistic virtual worlds requires the accurate modeling of 3D surface geometry for a wide range of objects. For this, meshes are appealing since they 1) enable fast physics-based rendering with realistic material and lighting, 2) support physical simulation, and 3) are memory-efficient for modern graphics pipelines. Recent work on reconstructing and statistically modeling 3D shape, however, has critiqued meshes as being topologically inflexible. To capture a wide range of object shapes, any 3D representation must be able to model solid, watertight, shapes as well as thin, open, surfaces. Recent work has focused on the former, and methods for reconstructing open surfaces do not support fast reconstruction with material and lighting or unconditional generative modelling. Inspired by the observation that open surfaces can be seen as islands floating on watertight surfaces, we parameterize open surfaces by defining a manifold signed distance field on watertight templates. With this parameterization, we further develop a grid-based and differentiable representation that parameterizes both watertight and non-watertight meshes of arbitrary topology. Our new representation, called Ghost-on-the-Shell (G-Shell), enables two important applications: differentiable rasterization-based reconstruction from multiview images and generative modelling of non-watertight meshes. We empirically demonstrate that G-Shell achieves state-of-the-art performance on non-watertight mesh reconstruction and generation tasks, while also performing effectively for watertight meshes.
    摘要 创造光realistic虚拟世界需要准确地模型3D表面几何,为这,缓冲是有吸引力的,因为它们可以1)快速地基于物理学渲染,使用真实的材料和照明,2)支持物理模拟,3)对现代图形管道来说是内存有效。然而, latest research on reconstructing and statistically modeling 3D shape has criticized meshes for being topologically inflexible. To capture a wide range of object shapes, any 3D representation must be able to model solid, watertight shapes as well as thin, open surfaces. Recent work has focused on the former, and methods for reconstructing open surfaces do not support fast reconstruction with material and lighting or unconditional generative modeling.我们注意到,开放表面可以被看作是浮在 watertight 表面上的岛屿,我们可以将开放表面 parameterized by defining a manifold signed distance field on watertight templates. With this parameterization, we further develop a grid-based and differentiable representation that parameterizes both watertight and non-watertight meshes of arbitrary topology. Our new representation, called Ghost-on-the-Shell (G-Shell), enables two important applications: differentiable rasterization-based reconstruction from multiview images and generative modeling of non-watertight meshes. We empirically demonstrate that G-Shell achieves state-of-the-art performance on non-watertight mesh reconstruction and generation tasks, while also performing effectively for watertight meshes.

SAM-Med3D

  • paper_url: http://arxiv.org/abs/2310.15161
  • repo_url: https://github.com/uni-medical/sam-med3d
  • paper_authors: Haoyu Wang, Sizheng Guo, Jin Ye, Zhongying Deng, Junlong Cheng, Tianbin Li, Jianpin Chen, Yanzhou Su, Ziyan Huang, Yiqing Shen, Bin Fu, Shaoting Zhang, Junjun He, Yu Qiao
    for:This paper aims to improve the performance of the Segment Anything Model (SAM) in 3D volumetric medical image segmentation.methods:The authors modify SAM to a 3D architecture trained on a comprehensively processed large-scale volumetric medical dataset, and provide a comprehensive evaluation of its performance.results:SAM-Med3D excels at capturing 3D spatial information and exhibits competitive performance with significantly fewer prompt points than the top-performing fine-tuned SAM in the medical domain. It also shows enhanced efficiency and broad segmentation capabilities for 3D volumetric medical images.
    Abstract Although the Segment Anything Model (SAM) has demonstrated impressive performance in 2D natural image segmentation, its application to 3D volumetric medical images reveals significant shortcomings, namely suboptimal performance and unstable prediction, necessitating an excessive number of prompt points to attain the desired outcomes. These issues can hardly be addressed by fine-tuning SAM on medical data because the original 2D structure of SAM neglects 3D spatial information. In this paper, we introduce SAM-Med3D, the most comprehensive study to modify SAM for 3D medical images. Our approach is characterized by its comprehensiveness in two primary aspects: firstly, by comprehensively reformulating SAM to a thorough 3D architecture trained on a comprehensively processed large-scale volumetric medical dataset; and secondly, by providing a comprehensive evaluation of its performance. Specifically, we train SAM-Med3D with over 131K 3D masks and 247 categories. Our SAM-Med3D excels at capturing 3D spatial information, exhibiting competitive performance with significantly fewer prompt points than the top-performing fine-tuned SAM in the medical domain. We then evaluate its capabilities across 15 datasets and analyze it from multiple perspectives, including anatomical structures, modalities, targets, and generalization abilities. Our approach, compared with SAM, showcases pronouncedly enhanced efficiency and broad segmentation capabilities for 3D volumetric medical images. Our code is released at https://github.com/uni-medical/SAM-Med3D.
    摘要 although the Segment Anything Model (SAM) has demonstrated impressive performance in 2D natural image segmentation, its application to 3D volumetric medical images reveals significant shortcomings, namely suboptimal performance and unstable prediction, necessitating an excessive number of prompt points to attain the desired outcomes. these issues can hardly be addressed by fine-tuning SAM on medical data because the original 2D structure of SAM neglects 3D spatial information. in this paper, we introduce SAM-Med3D, the most comprehensive study to modify SAM for 3D medical images. our approach is characterized by its comprehensiveness in two primary aspects: firstly, by comprehensively reformulating SAM to a thorough 3D architecture trained on a comprehensively processed large-scale volumetric medical dataset; and secondly, by providing a comprehensive evaluation of its performance. specifically, we train SAM-Med3D with over 131K 3D masks and 247 categories. our SAM-Med3D excels at capturing 3D spatial information, exhibiting competitive performance with significantly fewer prompt points than the top-performing fine-tuned SAM in the medical domain. we then evaluate its capabilities across 15 datasets and analyze it from multiple perspectives, including anatomical structures, modalities, targets, and generalization abilities. our approach, compared with SAM, showcases pronouncedly enhanced efficiency and broad segmentation capabilities for 3D volumetric medical images. our code is released at https://github.com/uni-medical/SAM-Med3D.

FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models

  • paper_url: http://arxiv.org/abs/2310.15160
  • repo_url: https://github.com/LiheYoung/FreeMask
  • paper_authors: Lihe Yang, Xiaogang Xu, Bingyi Kang, Yinghuan Shi, Hengshuang Zhao
  • for: 提高 semantic segmentation 模型的训练效果,使其更加准确地分类图像中的各个对象。
  • methods: 使用生成模型生成具有真实描述信息的 sintetic 图像,并通过 Conditional GAN 生成映射,生成更多的具有描述信息的图像-描述映射对。
  • results: 使用 synthetic 图像进行训练,可以达到与使用真实图像进行训练相同的性能水平(e.g., 48.3 vs. 48.5 mIoU on ADE20K,和 49.3 vs. 50.5 on COCO-Stuff)。此外,可以通过对 synthetic 图像进行 filtering 和重新分配,提高 segmentation 模型的性能。
    Abstract Semantic segmentation has witnessed tremendous progress due to the proposal of various advanced network architectures. However, they are extremely hungry for delicate annotations to train, and the acquisition is laborious and unaffordable. Therefore, we present FreeMask in this work, which resorts to synthetic images from generative models to ease the burden of both data collection and annotation procedures. Concretely, we first synthesize abundant training images conditioned on the semantic masks provided by realistic datasets. This yields extra well-aligned image-mask training pairs for semantic segmentation models. We surprisingly observe that, solely trained with synthetic images, we already achieve comparable performance with real ones (e.g., 48.3 vs. 48.5 mIoU on ADE20K, and 49.3 vs. 50.5 on COCO-Stuff). Then, we investigate the role of synthetic images by joint training with real images, or pre-training for real images. Meantime, we design a robust filtering principle to suppress incorrectly synthesized regions. In addition, we propose to inequally treat different semantic masks to prioritize those harder ones and sample more corresponding synthetic images for them. As a result, either jointly trained or pre-trained with our filtered and re-sampled synthesized images, segmentation models can be greatly enhanced, e.g., from 48.7 to 52.0 on ADE20K. Code is available at https://github.com/LiheYoung/FreeMask.
    摘要 Semantic segmentation 技术在过去几年中历史猛增,但是这些高级网络架构却很需要精细的标注来训练,而这些标注的收集和生成却很困难和昂贵。因此,我们在这个工作中提出了FreeMask,它利用生成模型生成的 sintetic 图像来减轻数据收集和标注过程的压力。具体来说,我们首先使用生成模型生成大量的训练图像,并将这些图像与实际数据中的 semantic mask 相对应。这些图像-mask 对 alignment 非常好,可以用于semantic segmentation 模型的训练。我们奇怪的发现,只使用 sintetic 图像进行训练,我们就可以达到与实际数据相当的性能(例如,ADE20K 中的 mIoU 从 48.3 提高到 48.5,COCO-Stuff 中的 mIoU 从 49.3 提高到 50.5)。然后,我们研究了使用 sintetic 图像进行 joint 训练或 pre-training 的效果,同时设计了一种鲁棒的 filtering 原则来排除 incorrect 生成的区域。此外,我们还提出了对不同的 semantic mask 进行不同的 treated 方式,以便更好地调整对各种 semantic mask 的训练。结果是,通过 jointly 训练或 pre-training 使用我们的 filtered 和重新分配的 sintetic 图像,semantic segmentation 模型的性能可以得到大幅提高(例如,ADE20K 中的 mIoU 从 48.7 提高到 52.0)。代码可以在 上找到。

Online Detection of AI-Generated Images

  • paper_url: http://arxiv.org/abs/2310.15150
  • repo_url: https://github.com/Aryia-Behroziuan/neurons
  • paper_authors: David C. Epstein, Ishan Jain, Oliver Wang, Richard Zhang
  • for: 本研究旨在检测AI生成的图像,以适应现实中新生成器不断更新的情况。
  • methods: 本研究使用N个生成器进行训练,并在N+k个生成器上进行测试,根据历史上公布的生成方法的发布日期进行设置。
  • results: 研究表明,通过抽象像素预测,可以实现强大的表现,并且可以在没有自动生成数据的情况下训练检测器。
    Abstract With advancements in AI-generated images coming on a continuous basis, it is increasingly difficult to distinguish traditionally-sourced images (e.g., photos, artwork) from AI-generated ones. Previous detection methods study the generalization from a single generator to another in isolation. However, in reality, new generators are released on a streaming basis. We study generalization in this setting, training on N models and testing on the next (N+k), following the historical release dates of well-known generation methods. Furthermore, images increasingly consist of both real and generated components, for example through image inpainting. Thus, we extend this approach to pixel prediction, demonstrating strong performance using automatically-generated inpainted data. In addition, for settings where commercial models are not publicly available for automatic data generation, we evaluate if pixel detectors can be trained solely on whole synthetic images.
    摘要 随着人工智能生成图像的进步不断,现在很难分辨来自传统源的图像(例如照片、艺术作品)和人工智能生成的图像。先前的检测方法通常研究单个生成器之间的泛化,但在实际情况下,新的生成器不断发布。我们研究这种设定,训练在N个模型上,测试在下一个(N+k)个,按照历史上发布的人工生成方法的时间顺序。此外,图像越来越多地包含真实和生成的组成部分,例如通过图像填充。因此,我们扩展了这种方法到像素预测,并证明了使用自动生成的填充数据表现出色。此外,在商业模型没有公开可用的自动数据生成情况下,我们评估了像素检测器是否可以在solely基于完整的人工图像上训练。

DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design

  • paper_url: http://arxiv.org/abs/2310.15144
  • repo_url: https://github.com/design-bench/design-bench.github.io
  • paper_authors: Kevin Lin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Lijuan Wang
  • for: 这篇论文是为了研究和评估文本到图像(T2I)生成模型在视觉设计场景中的潜力。
  • methods: 作者提出了一个名为DEsignBench的T2I生成测试平台,包括了评估T2I模型在“设计技术能力”和“设计应用场景”两个维度上的测试样本。
  • results: 作者使用DALL-E 3和其他领先的T2I模型在DEsignBench平台上进行测试,并创建了一个Side-by-Side比较图库,以便对生成图像进行人类评估和自动评估。人类评估包括图文对齐、视觉美感和设计创新等方面,而自动评估则使用GPT-4V引擎进行评估。
    Abstract We introduce DEsignBench, a text-to-image (T2I) generation benchmark tailored for visual design scenarios. Recent T2I models like DALL-E 3 and others, have demonstrated remarkable capabilities in generating photorealistic images that align closely with textual inputs. While the allure of creating visually captivating images is undeniable, our emphasis extends beyond mere aesthetic pleasure. We aim to investigate the potential of using these powerful models in authentic design contexts. In pursuit of this goal, we develop DEsignBench, which incorporates test samples designed to assess T2I models on both "design technical capability" and "design application scenario." Each of these two dimensions is supported by a diverse set of specific design categories. We explore DALL-E 3 together with other leading T2I models on DEsignBench, resulting in a comprehensive visual gallery for side-by-side comparisons. For DEsignBench benchmarking, we perform human evaluations on generated images in DEsignBench gallery, against the criteria of image-text alignment, visual aesthetic, and design creativity. Our evaluation also considers other specialized design capabilities, including text rendering, layout composition, color harmony, 3D design, and medium style. In addition to human evaluations, we introduce the first automatic image generation evaluator powered by GPT-4V. This evaluator provides ratings that align well with human judgments, while being easily replicable and cost-efficient. A high-resolution version is available at https://github.com/design-bench/design-bench.github.io/raw/main/designbench.pdf?download=
    摘要 我们介绍DEsignBench,一个文本到图像(T2I)生成测试准则,适用于视觉设计场景。最近的T2I模型如DALL-E 3等,已经表现出了惊人的能力,可以生成高质量的图像,与文本输入高度吻合。而我们的目标不仅在于创造美丽的图像,更在于探索使用这些强大模型在实际设计场景中的潜在可能性。为实现这个目标,我们开发了DEsignBench,它包括了评测T2I模型的“设计技术能力”和“设计应用场景”两个维度。每个维度都有多种特定的设计类别支持。我们使用DALL-E 3和其他领先的T2I模型在DEsignBench上进行测试,并创建了一个丰富的视觉图库,用于对各模型进行侧对比。为DEsignBench测试,我们进行了人类评价生成图像,以评价图像和文本之间的吻合度、视觉美感和设计创新性。我们的评价还考虑了其他专业设计能力,包括文本渲染、布局组合、颜色和彩色协调、3D设计和媒体风格。此外,我们还引入了基于GPT-4V的自动生成图像评价器,它提供了与人类评价相符的评价结果,同时易于复制和经济。高分辨率版本可以在https://github.com/design-bench/design-bench.github.io/raw/main/designbench.pdf?download=下载。

Fusion-Driven Tree Reconstruction and Fruit Localization: Advancing Precision in Agriculture

  • paper_url: http://arxiv.org/abs/2310.15138
  • repo_url: None
  • paper_authors: Kaiming Fu, Peng Wei, Juan Villacres, Zhaodan Kong, Stavros G. Vougioukas, Brian N. Bailey
  • for: This study aims to improve the precision of guidance for agricultural robotics and automation systems by analyzing fruit distribution in orchards.
  • methods: The study uses a fusion of RGB imagery, LiDAR, and IMU data to reconstruct trees and locate fruits with high precision.
  • results: The experiments conducted in both a controlled environment and an actual peach orchard demonstrate the robustness and efficacy of the proposed methodology, highlighting its potential for transforming agricultural robotics and precision farming.
    Abstract Fruit distribution is pivotal in shaping the future of both agriculture and agricultural robotics, paving the way for a streamlined supply chain. This study introduces an innovative methodology that harnesses the synergy of RGB imagery, LiDAR, and IMU data, to achieve intricate tree reconstructions and the pinpoint localization of fruits. Such integration not only offers insights into the fruit distribution, which enhances the precision of guidance for agricultural robotics and automation systems, but also sets the stage for simulating synthetic fruit patterns across varied tree architectures. To validate this approach, experiments have been carried out in both a controlled environment and an actual peach orchard. The results underscore the robustness and efficacy of this fusion-driven methodology, highlighting its potential as a transformative tool for future agricultural robotics and precision farming.
    摘要 Frruit 分布对未来农业和农业机器人发展起着关键作用,为农业自动化系统的指导提供精准的信息。本研究提出了一种创新的方法,通过RGB图像、LiDAR和IMU数据的结合,实现精细的树形 reconstruction和果实的精确定位。这种整合不仅提供了果实分布的信息,也为 simulate synthetic fruit patterns across varied tree architectures 创造了条件。为验证这种方法的可行性,在控制环境和实际桃果园中进行了实验。结果表明这种融合驱动的方法有 robustness和效果, highlighting its potential as a transformative tool for future agricultural robotics and precision farming。Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Novel-View Acoustic Synthesis from 3D Reconstructed Rooms

  • paper_url: http://arxiv.org/abs/2310.15130
  • repo_url: https://github.com/apple/ml-nvas3d
  • paper_authors: Byeongjoo Ahn, Karren Yang, Brian Hamilton, Jonathan Sheaffer, Anurag Ranjan, Miguel Sarabia, Oncel Tuzel, Jen-Hao Rick Chang
  • for: 本研究探讨了将盲音频记录与3D场景信息结合使用以实现新视角声音合成。
  • methods: 我们使用了2-4个麦克风的盲音频记录和场景中的多个不知道声音源的3D几何和材料,并计算出场景中的任何声音位置。
  • results: 我们的方法比既有的方法更高效,能够同时解决声音源localization、separation和干扰除等问题。在Matterport3D-NVAS数据集上的模拟研究中,我们的模型实现了99.8%的源localization精度、PSNR26.44dB和SDR14.23dB等佳绩。
    Abstract We investigate the benefit of combining blind audio recordings with 3D scene information for novel-view acoustic synthesis. Given audio recordings from 2-4 microphones and the 3D geometry and material of a scene containing multiple unknown sound sources, we estimate the sound anywhere in the scene. We identify the main challenges of novel-view acoustic synthesis as sound source localization, separation, and dereverberation. While naively training an end-to-end network fails to produce high-quality results, we show that incorporating room impulse responses (RIRs) derived from 3D reconstructed rooms enables the same network to jointly tackle these tasks. Our method outperforms existing methods designed for the individual tasks, demonstrating its effectiveness at utilizing 3D visual information. In a simulated study on the Matterport3D-NVAS dataset, our model achieves near-perfect accuracy on source localization, a PSNR of 26.44 dB and a SDR of 14.23 dB for source separation and dereverberation, resulting in a PSNR of 25.55 dB and a SDR of 14.20 dB on novel-view acoustic synthesis. Code, pretrained model, and video results are available on the project webpage (https://github.com/apple/ml-nvas3d).
    摘要 我们调查了结合无视录音 recording 与 3D 场景信息的独特观点音响合成的优点。我们使用两到四个麦克风的音录音,以及场景中多个不知名的声音来源的 3D 几何和材料,估计场景中任何声音的位置。我们识别了独特观点音响合成的主要挑战为声音来源位置Localization、分离和降噪。而将数据集训练成为终端网络,却无法生成高品质结果。我们显示,将3D 房间响应函数(RIR) derive from 3D 重建的房间,可以让同一个网络同时解决这些任务。我们的方法比于现有的方法设计 для个别任务,具有更高的效果,实现了将3D 视觉信息作用到音响合成中。在 simulated 的 Matterport3D-NVAS 数据集上,我们的模型实现了近乎完美的精度在声音来源Localization,PSNR 26.44 dB 和 SDR 14.23 dB для声音分离和降噪,最终实现了 PSNR 25.55 dB 和 SDR 14.20 dB 在独特观点音响合成中。代码、预训模型和视频结果可以在项目网页(https://github.com/apple/ml-nvas3d)上获取。

Projected Stochastic Gradient Descent with Quantum Annealed Binary Gradients

  • paper_url: http://arxiv.org/abs/2310.15128
  • repo_url: None
  • paper_authors: Maximilian Krahn, Michelle Sasdelli, Fengyi Yang, Vladislav Golyanik, Juho Kannala, Tat-Jun Chin, Tolga Birdal
  • for: 该文章提出了一种新的层wise随机优化器,用于在量子硬件上训练使用二进制权重的神经网络(BNNs)。BNNs可以减少深度学习模型的计算需求和能 consumption,但是在实际训练中仍然是一个开放的挑战。
  • methods: 该优化器使用了一种新的方法,称为层wise随机梯度映射(QP-SBGD),它可以将梯度映射到二进制变量上,并通过解决一个二次约束 Binary optimization 问题来实现。
  • results: 通过对 Rosenbrock 函数、BNNs 和二进制图 neural networks 进行训练,我们展示了 QP-SBGD 可以与其他竞争性和成熔的基准值相比,或者与其相当。此外,我们还证明了 QP-SBGD 的修改版本可以 converge to a fixed point in the binary variable space。
    Abstract We present, QP-SBGD, a novel layer-wise stochastic optimiser tailored towards training neural networks with binary weights, known as binary neural networks (BNNs), on quantum hardware. BNNs reduce the computational requirements and energy consumption of deep learning models with minimal loss in accuracy. However, training them in practice remains to be an open challenge. Most known BNN-optimisers either rely on projected updates or binarise weights post-training. Instead, QP-SBGD approximately maps the gradient onto binary variables, by solving a quadratic constrained binary optimisation. Under practically reasonable assumptions, we show that this update rule converges with a rate of $\mathcal{O}(1 / \sqrt{T})$. Moreover, we show how the $\mathcal{NP}$-hard projection can be effectively executed on an adiabatic quantum annealer, harnessing recent advancements in quantum computation. We also introduce a projected version of this update rule and prove that if a fixed point exists in the binary variable space, the modified updates will converge to it. Last but not least, our algorithm is implemented layer-wise, making it suitable to train larger networks on resource-limited quantum hardware. Through extensive evaluations, we show that QP-SBGD outperforms or is on par with competitive and well-established baselines such as BinaryConnect, signSGD and ProxQuant when optimising the Rosenbrock function, training BNNs as well as binary graph neural networks.
    摘要 我们提出了QP-SBGD,一种适用于训练使用二进制权重的神经网络(BNNs)的新的层 wise随机优化器。BNNs可以减少深度学习模型的计算需求和能 consumption,但是在实际训练中仍然是一个开放的挑战。大多数已知的BNN优化器都是通过 projeted 更新或者在训练后对权重进行二进制化。然而,QP-SBGD可以将梯度约束在二进制变量上,通过解决一个二次约束 binary 优化问题来approximately 映射梯度。在实际可能的假设下,我们证明了这个更新规则在 $O(\frac{1}{\sqrt{T})$ 的速率下收敛。此外,我们还证明了在Quantum Annealer上实现这个更新规则的 $\mathcal{NP}$-hard проекcion可以高效地执行。此外,我们还引入了一个修改后的版本,并证明如果在二进制变量空间中存在固定点,那么修改后的更新规则将收敛到它。最后,我们的算法是层 wise 的,使其适用于训练更大的神经网络在有限的Quantum 硬件上。通过广泛的评估,我们表明了QP-SBGD可以与现有的竞争力强的基准值相比,如BinaryConnect、signSGD 和 ProxQuant 等,在 Rosenbrock 函数、BNN 以及二进制图 neural network 上进行优化。

SpVOS: Efficient Video Object Segmentation with Triple Sparse Convolution

  • paper_url: http://arxiv.org/abs/2310.15115
  • repo_url: None
  • paper_authors: Weihao Lin, Tao Chen, Chong Yu
    for:* 这个论文主要研究的是semi-supervised video object segmentation(Semi-VOS),它只需要标注第一帧视频图像就可以对后续帧进行分割。methods:* 该论文提出了一种名为SpVOS的稀疏基线方法,该方法使用了一种新的三重稀疏 convolution来减少总VOS框架的计算成本。results:* 实验结果显示,提出的SpVOS方法可以在两个主流VOS数据集上达到比较高的分割性能,同时减少了42%的计算成本。例如,在DAVIS-2017验证集上,SpVOS方法可以获得83.04%的总分(79.29%),与传统的非稀疏VOS基eline(82.88%)几乎相当,而且在YouTube-VOS数据集上,SpVOS方法可以获得80.36%的总分(75.84%)。
    Abstract Semi-supervised video object segmentation (Semi-VOS), which requires only annotating the first frame of a video to segment future frames, has received increased attention recently. Among existing pipelines, the memory-matching-based one is becoming the main research stream, as it can fully utilize the temporal sequence information to obtain high-quality segmentation results. Even though this type of method has achieved promising performance, the overall framework still suffers from heavy computation overhead, mainly caused by the per-frame dense convolution operations between high-resolution feature maps and each kernel filter. Therefore, we propose a sparse baseline of VOS named SpVOS in this work, which develops a novel triple sparse convolution to reduce the computation costs of the overall VOS framework. The designed triple gate, taking full consideration of both spatial and temporal redundancy between adjacent video frames, adaptively makes a triple decision to decide how to apply the sparse convolution on each pixel to control the computation overhead of each layer, while maintaining sufficient discrimination capability to distinguish similar objects and avoid error accumulation. A mixed sparse training strategy, coupled with a designed objective considering the sparsity constraint, is also developed to balance the VOS segmentation performance and computation costs. Experiments are conducted on two mainstream VOS datasets, including DAVIS and Youtube-VOS. Results show that, the proposed SpVOS achieves superior performance over other state-of-the-art sparse methods, and even maintains comparable performance, e.g., an 83.04% (79.29%) overall score on the DAVIS-2017 (Youtube-VOS) validation set, with the typical non-sparse VOS baseline (82.88% for DAVIS-2017 and 80.36% for Youtube-VOS) while saving up to 42% FLOPs, showing its application potential for resource-constrained scenarios.
    摘要 半supervised视频对象分割(Semi-VOS),需要只annotating the first frame of a video to segment future frames,在最近received increased attention。 Among existing pipelines, the memory-matching-based one is becoming the main research stream, as it can fully utilize the temporal sequence information to obtain high-quality segmentation results. Although this type of method has achieved promising performance, the overall framework still suffers from heavy computation overhead, mainly caused by the per-frame dense convolution operations between high-resolution feature maps and each kernel filter. Therefore, we propose a sparse baseline of VOS named SpVOS in this work, which develops a novel triple sparse convolution to reduce the computation costs of the overall VOS framework. The designed triple gate, taking full consideration of both spatial and temporal redundancy between adjacent video frames, adaptively makes a triple decision to decide how to apply the sparse convolution on each pixel to control the computation overhead of each layer, while maintaining sufficient discrimination capability to distinguish similar objects and avoid error accumulation. A mixed sparse training strategy, coupled with a designed objective considering the sparsity constraint, is also developed to balance the VOS segmentation performance and computation costs. Experiments are conducted on two mainstream VOS datasets, including DAVIS and Youtube-VOS. Results show that, the proposed SpVOS achieves superior performance over other state-of-the-art sparse methods, and even maintains comparable performance, e.g., an 83.04% (79.29%) overall score on the DAVIS-2017 (Youtube-VOS) validation set, with the typical non-sparse VOS baseline (82.88% for DAVIS-2017 and 80.36% for Youtube-VOS) while saving up to 42% FLOPs, showing its application potential for resource-constrained scenarios.

Matryoshka Diffusion Models

  • paper_url: http://arxiv.org/abs/2310.15111
  • repo_url: None
  • paper_authors: Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Josh Susskind, Navdeep Jaitly
  • for: 高解像图像和视频生成
  • methods: diffusion process + NestedUNet architecture + progressive training schedule
  • results: 实现了高解像图像和视频生成,并且可以进行零基础学习Here’s a more detailed explanation of each point:
  • for: The paper is focused on high-resolution image and video synthesis, and it proposes a novel framework called Matryoshka Diffusion Models (MDM) to achieve this goal.
  • methods: The paper uses a diffusion process to denoise inputs at multiple resolutions jointly, and employs a NestedUNet architecture where features and parameters for small-scale inputs are nested within those of large scales. Additionally, the paper proposes a progressive training schedule from lower to higher resolutions, which helps to improve optimization for high-resolution generation.
  • results: The paper demonstrates the effectiveness of its approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications. Remarkably, the paper shows that a single pixel-space model can be trained at resolutions of up to 1024x1024 pixels, with strong zero-shot generalization using the CC12M dataset, which contains only 12 million images.
    Abstract Diffusion models are the de facto approach for generating high-quality images and videos, but learning high-dimensional models remains a formidable task due to computational and optimization challenges. Existing methods often resort to training cascaded models in pixel space or using a downsampled latent space of a separately trained auto-encoder. In this paper, we introduce Matryoshka Diffusion Models(MDM), an end-to-end framework for high-resolution image and video synthesis. We propose a diffusion process that denoises inputs at multiple resolutions jointly and uses a NestedUNet architecture where features and parameters for small-scale inputs are nested within those of large scales. In addition, MDM enables a progressive training schedule from lower to higher resolutions, which leads to significant improvements in optimization for high-resolution generation. We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications. Remarkably, we can train a single pixel-space model at resolutions of up to 1024x1024 pixels, demonstrating strong zero-shot generalization using the CC12M dataset, which contains only 12 million images.
    摘要 Diffusion models are the de facto approach for generating high-quality images and videos, but learning high-dimensional models remains a formidable task due to computational and optimization challenges. Existing methods often resort to training cascaded models in pixel space or using a downsampled latent space of a separately trained auto-encoder. In this paper, we introduce Matryoshka Diffusion Models(MDM), an end-to-end framework for high-resolution image and video synthesis. We propose a diffusion process that denoises inputs at multiple resolutions jointly and uses a NestedUNet architecture where features and parameters for small-scale inputs are nested within those of large scales. In addition, MDM enables a progressive training schedule from lower to higher resolutions, which leads to significant improvements in optimization for high-resolution generation. We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications. Remarkably, we can train a single pixel-space model at resolutions of up to 1024x1024 pixels, demonstrating strong zero-shot generalization using the CC12M dataset, which contains only 12 million images.Here's the translation in Traditional Chinese:Diffusion models are the de facto approach for generating high-quality images and videos, but learning high-dimensional models remains a formidable task due to computational and optimization challenges. Existing methods often resort to training cascaded models in pixel space or using a downsampled latent space of a separately trained auto-encoder. In this paper, we introduce Matryoshka Diffusion Models(MDM), an end-to-end framework for high-resolution image and video synthesis. We propose a diffusion process that denoises inputs at multiple resolutions jointly and uses a NestedUNet architecture where features and parameters for small-scale inputs are nested within those of large scales. In addition, MDM enables a progressive training schedule from lower to higher resolutions, which leads to significant improvements in optimization for high-resolution generation. We demonstrate the effectiveness of our approach on various benchmarks, including class-conditioned image generation, high-resolution text-to-image, and text-to-video applications. Remarkably, we can train a single pixel-space model at resolutions of up to 1024x1024 pixels, demonstrating strong zero-shot generalization using the CC12M dataset, which contains only 12 million images.

Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

  • paper_url: http://arxiv.org/abs/2310.15110
  • repo_url: https://github.com/sudo-ai-3d/zero123plus
  • paper_authors: Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, Hao Su
  • for: 生成3D共变图像从单个输入视图中
  • methods: 使用图像扩散模型,并对其进行conditioning和训练,以优化预训练的2D生成规则
  • results: 生成高质量、共变的多视图图像,解决了Texture异常和Geometric Misalignment等问题,并可以通过训练ControlNet进行更高级的控制I hope that helps! Let me know if you have any other questions.
    Abstract We report Zero123++, an image-conditioned diffusion model for generating 3D-consistent multi-view images from a single input view. To take full advantage of pretrained 2D generative priors, we develop various conditioning and training schemes to minimize the effort of finetuning from off-the-shelf image diffusion models such as Stable Diffusion. Zero123++ excels in producing high-quality, consistent multi-view images from a single image, overcoming common issues like texture degradation and geometric misalignment. Furthermore, we showcase the feasibility of training a ControlNet on Zero123++ for enhanced control over the generation process. The code is available at https://github.com/SUDO-AI-3D/zero123plus.
    摘要 我团队报道Zero123++,一种基于图像的扩散模型,可以从单个输入视图中生成3D保持一致的多视图图像。为了完全利用预训练的2D生成假设,我们开发了多种conditioning和训练方案,以最小化从存储库中的扩散模型(如稳定扩散)的训练时间。Zero123++在生成高质量、一致的多视图图像方面表现出色,解决了通常出现的 текстура强制下降和几何不对齐问题。此外,我们还展示了在Zero123++上训练控制网络以提高生成过程的控制性的可能性。代码可以在https://github.com/SUDO-AI-3D/zero123plus上获取。

FD-Align: Feature Discrimination Alignment for Fine-tuning Pre-Trained Models in Few-Shot Learning

  • paper_url: http://arxiv.org/abs/2310.15105
  • repo_url: https://github.com/skingorz/fd-align
  • paper_authors: Kun Song, Huimin Ma, Bochao Zou, Huishuai Zhang, Weiran Huang
  • for: 提高预训练模型的下游任务性能,尤其是在分布转移时。
  • methods: 提出了一种名为Feature Discrimination Alignment(FD-Align)的细化方法,通过保持幌子特征的一致性来增强模型的通用性。
  • results: 经验证明,该方法可以提高ID和OOD任务的性能,并且可以轻松地与现有方法集成,从而提高模型的总性能。
    Abstract Due to the limited availability of data, existing few-shot learning methods trained from scratch fail to achieve satisfactory performance. In contrast, large-scale pre-trained models such as CLIP demonstrate remarkable few-shot and zero-shot capabilities. To enhance the performance of pre-trained models for downstream tasks, fine-tuning the model on downstream data is frequently necessary. However, fine-tuning the pre-trained model leads to a decrease in its generalizability in the presence of distribution shift, while the limited number of samples in few-shot learning makes the model highly susceptible to overfitting. Consequently, existing methods for fine-tuning few-shot learning primarily focus on fine-tuning the model's classification head or introducing additional structure. In this paper, we introduce a fine-tuning approach termed Feature Discrimination Alignment (FD-Align). Our method aims to bolster the model's generalizability by preserving the consistency of spurious features across the fine-tuning process. Extensive experimental results validate the efficacy of our approach for both ID and OOD tasks. Once fine-tuned, the model can seamlessly integrate with existing methods, leading to performance improvements. Our code can be found in https://github.com/skingorz/FD-Align.
    摘要 In this paper, we introduce a fine-tuning approach termed Feature Discrimination Alignment (FD-Align). Our method aims to bolster the model's generalizability by preserving the consistency of spurious features across the fine-tuning process. Extensive experimental results validate the efficacy of our approach for both in-distribution (ID) and out-of-distribution (OOD) tasks. Once fine-tuned, the model can seamlessly integrate with existing methods, leading to performance improvements. Our code can be found at https://github.com/skingorz/FD-Align.

On the Detection of Image-Scaling Attacks in Machine Learning

  • paper_url: http://arxiv.org/abs/2310.15085
  • repo_url: https://github.com/equiw/2023-detection-scalingattacks
  • paper_authors: Erwin Quiring, Andreas Müller, Konrad Rieck
  • for: 这篇论文旨在研究图像缩放攻击的检测方法,以帮助防范这种攻击。
  • methods: 本论文提出了两种检测方法,分别是基于图像的特征值分析和基于图像的预测值分析。这些方法简单易用, yet significantly outperform previous work。
  • results: 在对所有主要学习平台和缩放算法进行了广泛的评估中,本论文的方法能够可靠地检测图像缩放攻击,包括在适应性攻击者下 detection of attacks modifying the entire scaled image 和 minor parts of the image 都能够得到良好的检测性能。
    Abstract Image scaling is an integral part of machine learning and computer vision systems. Unfortunately, this preprocessing step is vulnerable to so-called image-scaling attacks where an attacker makes unnoticeable changes to an image so that it becomes a new image after scaling. This opens up new ways for attackers to control the prediction or to improve poisoning and backdoor attacks. While effective techniques exist to prevent scaling attacks, their detection has not been rigorously studied yet. Consequently, it is currently not possible to reliably spot these attacks in practice. This paper presents the first in-depth systematization and analysis of detection methods for image-scaling attacks. We identify two general detection paradigms and derive novel methods from them that are simple in design yet significantly outperform previous work. We demonstrate the efficacy of these methods in a comprehensive evaluation with all major learning platforms and scaling algorithms. First, we show that image-scaling attacks modifying the entire scaled image can be reliably detected even under an adaptive adversary. Second, we find that our methods provide strong detection performance even if only minor parts of the image are manipulated. As a result, we can introduce a novel protection layer against image-scaling attacks.
    摘要 Image scaling 是机器学习和计算机视觉系统中的一个基本步骤。然而,这个预处理步骤受到称为图像缩放攻击的威胁。这些攻击使攻击者可以隐蔽地改变图像,使其变成一个新的图像 после 缩放。这开创了新的攻击方式,让攻击者可以控制预测或提高毒剂和后门攻击。虽然有有效的防御技术,但检测这些攻击还没有得到系统的研究。因此,目前并不可靠地检测这些攻击。本文提出了第一个系统化和分析检测图像缩放攻击的方法。我们标识了两个通用检测方法,并从这些方法中 derivation 出了简单设计的新方法。我们在对所有主要学习平台和缩放算法进行了全面的评估中,证明了这些方法的效果。首先,我们表明了修改整个缩放图像的攻击可以可靠地检测,即使敌方是可靠的。其次,我们发现我们的方法在只有少量图像部分被修改时也有强大的检测性能。因此,我们可以在图像缩放攻击中引入一种新的保护层。

E4S: Fine-grained Face Swapping via Editing With Regional GAN Inversion

  • paper_url: http://arxiv.org/abs/2310.15081
  • repo_url: https://github.com/e4s2023/e4s2023
  • paper_authors: Maomao Li, Ge Yuan, Cairong Wang, Zhian Liu, Yong Zhang, Yongwei Nie, Jue Wang, Dong Xu
  • for: 这个论文提出了一种基于细腻表情编辑的面部换位方法,称为”编辑换位”(E4S)。传统的面部换位方法通常基于全局特征EXTRACTION,容易导致源face的损失。相比之下,我们的框架使用了区域生成 adversarial network (RGI) 方法,可以显式分离形状和Texture。
  • methods: 我们的E4S在一个预训练的 StyleGAN 中的 latent space 中进行面部换位,使用多Scale mask-guided encoder 将每个 facial component 的 Texture 映射到区域风格码中,然后使用 mask-guided injection module 修改特征图中的 style codes。这种分离使得面部换位可以简化为style和mask switching。
  • results: 对比之前的方法,我们的E4S可以更好地保持 texture、形状和照明条件。此外,我们还提出了一种面部填充网络作为后期处理,以解决潜在的mask exchange Area不一致。我们的实现可以在https://github.com/e4s2023/E4S2023中找到。
    Abstract This paper proposes a novel approach to face swapping from the perspective of fine-grained facial editing, dubbed "editing for swapping" (E4S). The traditional face swapping methods rely on global feature extraction and often fail to preserve the source identity. In contrast, our framework proposes a Regional GAN Inversion (RGI) method, which allows the explicit disentanglement of shape and texture. Specifically, our E4S performs face swapping in the latent space of a pretrained StyleGAN, where a multi-scale mask-guided encoder is applied to project the texture of each facial component into regional style codes and a mask-guided injection module then manipulates feature maps with the style codes. Based on this disentanglement, face swapping can be simplified as style and mask swapping. Besides, since reconstructing the source face in the target image may lead to disharmony lighting, we propose to train a re-coloring network to make the swapped face maintain the lighting condition on the target face. Further, to deal with the potential mismatch area during mask exchange, we designed a face inpainting network as post-processing. The extensive comparisons with state-of-the-art methods demonstrate that our E4S outperforms existing methods in preserving texture, shape, and lighting. Our implementation is available at https://github.com/e4s2023/E4S2023.
    摘要

RD-VIO: Robust Visual-Inertial Odometry for Mobile Augmented Reality in Dynamic Environments

  • paper_url: http://arxiv.org/abs/2310.15072
  • repo_url: None
  • paper_authors: Jinyu Li, Xiaokun Pan, Gan Huang, Ziyang Zhang, Nan Wang, Hujun Bao, Guofeng Zhang
  • for: 这个论文是用于解决动态场景和纯旋转问题的视觉-运动规律系统(VIO)。
  • methods: 本文提出了一个名为RD-VIO的新型VIO系统,使用了IMU-PARSAC算法来灵活地检测和匹配关键点。在第一阶段,该算法使用视觉和IMU测量来匹配新的关键点。在第二阶段,它使用了自己的统计资讯来引导内部关键点匹配。此外,本文还使用了抽象框架来处理纯旋转运动。
  • results: 实验结果显示,提出的RD-VIO系统在公共数据上比其他方法在动态环境中表现更好。
    Abstract It is typically challenging for visual or visual-inertial odometry systems to handle the problems of dynamic scenes and pure rotation. In this work, we design a novel visual-inertial odometry (VIO) system called RD-VIO to handle both of these two problems. Firstly, we propose an IMU-PARSAC algorithm which can robustly detect and match keypoints in a two-stage process. In the first state, landmarks are matched with new keypoints using visual and IMU measurements. We collect statistical information from the matching and then guide the intra-keypoint matching in the second stage. Secondly, to handle the problem of pure rotation, we detect the motion type and adapt the deferred-triangulation technique during the data-association process. We make the pure-rotational frames into the special subframes. When solving the visual-inertial bundle adjustment, they provide additional constraints to the pure-rotational motion. We evaluate the proposed VIO system on public datasets. Experiments show the proposed RD-VIO has obvious advantages over other methods in dynamic environments.
    摘要 通常情况下,视觉或视觉-遥感增益系统会遇到动态场景和纯旋转的问题。在这项工作中,我们设计了一种新的视觉-遥感增益(VIO)系统,称为RD-VIO,以处理这两个问题。首先,我们提出了一种IMU-PARSAC算法,可以强健地检测和匹配视觉和IMU测量中的关键点。在第一个阶段,我们使用视觉和IMU测量来匹配地标,并收集视觉和IMU测量中的统计信息,以后在第二个阶段进行内部匹配。其次,为了处理纯旋转的问题,我们在数据关联过程中检测运动类型,并适应延迟三角形技术。将纯旋转的帧转换为特殊子帧,并在视觉-遥感套件调整中提供附加纯旋转运动的约束。我们对公共数据集进行了评测,实验表明,我们的RD-VIO系统在动态环境中具有明显的优势。

DREAM+: Efficient Dataset Distillation by Bidirectional Representative Matching

  • paper_url: http://arxiv.org/abs/2310.15052
  • repo_url: https://github.com/lyq312318224/dream
  • paper_authors: Yanqing Liu, Jianyang Gu, Kai Wang, Zheng Zhu, Kaipeng Zhang, Wei Jiang, Yang You
  • for: 创建减小数据集,降低存储和训练成本
  • methods: 基于bidirectional representative matching的新匹配策略,可应用于主流数据减小框架
  • results: 可以降低数据减小迭代次数,并且在充分训练时可以提高性能,达到状态略报结果
    Abstract Dataset distillation plays a crucial role in creating compact datasets with similar training performance compared with original large-scale ones. This is essential for addressing the challenges of data storage and training costs. Prevalent methods facilitate knowledge transfer by matching the gradients, embedding distributions, or training trajectories of synthetic images with those of the sampled original images. Although there are various matching objectives, currently the strategy for selecting original images is limited to naive random sampling. We argue that random sampling overlooks the evenness of the selected sample distribution, which may result in noisy or biased matching targets. Besides, the sample diversity is also not constrained by random sampling. Additionally, current methods predominantly focus on single-dimensional matching, where information is not fully utilized. To address these challenges, we propose a novel matching strategy called Dataset Distillation by Bidirectional REpresentAtive Matching (DREAM+), which selects representative original images for bidirectional matching. DREAM+ is applicable to a variety of mainstream dataset distillation frameworks and significantly reduces the number of distillation iterations by more than 15 times without affecting performance. Given sufficient training time, DREAM+ can further improve the performance and achieve state-of-the-art results. We have released the code at github.com/NUS-HPC-AI-Lab/DREAM+.
    摘要 To address these challenges, we propose a novel matching strategy called Dataset Distillation by Bidirectional REpresentAtive Matching (DREAM+), which selects representative original images for bidirectional matching. DREAM+ is applicable to a variety of mainstream dataset distillation frameworks and significantly reduces the number of distillation iterations by more than 15 times without affecting performance. Given sufficient training time, DREAM+ can further improve the performance and achieve state-of-the-art results. The code has been released at github.com/NUS-HPC-AI-Lab/DREAM+.

CalibrationPhys: Self-supervised Video-based Heart and Respiratory Rate Measurements by Calibrating Between Multiple Cameras

  • paper_url: http://arxiv.org/abs/2310.15043
  • repo_url: None
  • paper_authors: Yusuke Akamatsu, Terumi Umematsu, Hitoshi Imaoka
  • for: 用于心率和呼吸频率测量,更加有用和易用于传统的接触式传感器。
  • methods: 使用自我超级vised学习方法,不需要贵重的实验室数据 collection。
  • results: 在两个数据集上进行了实验,并且比现有的方法表现出色,可以使用任意的相机进行心率和呼吸频率测量。Here’s the full answer in Simplified Chinese:
  • for: 这篇论文是用于提出一种基于视频的心率和呼吸频率测量方法,比传统的接触式传感器更加有用和易用。
  • methods: 这种方法使用了自我超级vised学习方法,不需要贵重的实验室数据 collection。
  • results: 在两个数据集上进行了实验,并且比现有的方法表现出色,可以使用任意的相机进行心率和呼吸频率测量。
    Abstract Video-based heart and respiratory rate measurements using facial videos are more useful and user-friendly than traditional contact-based sensors. However, most of the current deep learning approaches require ground-truth pulse and respiratory waves for model training, which are expensive to collect. In this paper, we propose CalibrationPhys, a self-supervised video-based heart and respiratory rate measurement method that calibrates between multiple cameras. CalibrationPhys trains deep learning models without supervised labels by using facial videos captured simultaneously by multiple cameras. Contrastive learning is performed so that the pulse and respiratory waves predicted from the synchronized videos using multiple cameras are positive and those from different videos are negative. CalibrationPhys also improves the robustness of the models by means of a data augmentation technique and successfully leverages a pre-trained model for a particular camera. Experimental results utilizing two datasets demonstrate that CalibrationPhys outperforms state-of-the-art heart and respiratory rate measurement methods. Since we optimize camera-specific models using only videos from multiple cameras, our approach makes it easy to use arbitrary cameras for heart and respiratory rate measurements.
    摘要 traditional contact-based sensors的替代方案,使用视频来测量心跳和呼吸频率更加有用和易用。然而,现有的深度学习方法大多需要训练用的真实心跳和呼吸波,这些数据集成本昂贵。本文提出了一种自我超vised video基于心跳和呼吸频率测量方法,即CalibrationPhys。CalibrationPhys使用多个摄像头同时拍摄的人脸视频进行自我超vised学习,无需真实心跳和呼吸波的标注。我们使用对ynchronous的多个摄像头拍摄的人脸视频进行对比学习,以便在多个摄像头上测量心跳和呼吸频率。此外,我们还使用数据增强技术来提高模型的Robustness。实验结果表明,CalibrationPhys可以高效地测量心跳和呼吸频率,并且可以使用任意摄像头进行测量。因为我们只需要使用多个摄像头拍摄的视频来优化相机特定的模型,因此我们的方法很容易使用。

Manipulation Mask Generator: High-Quality Image Manipulation Mask Generation Method Based on Modified Total Variation Noise Reduction

  • paper_url: http://arxiv.org/abs/2310.15041
  • repo_url: None
  • paper_authors: Xinyu Yang, Jizhe Zhou
  • for: 提高深度学习模型的表现(improve the performance of deep learning models)
  • methods: 使用修改后total variation noise reduction方法(modified total variation noise reduction method)和MSER+NMS技术(MSER+NMS technology)提取文本信息(extract text information)
  • results: 可以获得含有少量噪音的图像(obtain images with little noise),同时保留文本信息(preserve text information),可以用于深度学习模型(can be used for deep learning models)
    Abstract In artificial intelligence, any model that wants to achieve a good result is inseparable from a large number of high-quality data. It is especially true in the field of tamper detection. This paper proposes a modified total variation noise reduction method to acquire high-quality tampered images. We automatically crawl original and tampered images from the Baidu PS Bar. Baidu PS Bar is a website where net friends post countless tampered images. Subtracting the original image with the tampered image can highlight the tampered area. However, there is also substantial noise on the final print, so these images can't be directly used in the deep learning model. Our modified total variation noise reduction method is aimed at solving this problem. Because a lot of text is slender, it is easy to lose text information after the opening and closing operation. We use MSER (Maximally Stable Extremal Regions) and NMS (Non-maximum Suppression) technology to extract text information. And then use the modified total variation noise reduction technology to process the subtracted image. Finally, we can obtain an image with little noise by adding the image and text information. And the idea also largely retains the text information. Datasets generated in this way can be used in deep learning models, and they will help the model achieve better results.
    摘要 在人工智能中,任何模型想要获得好的结果,都是不可或缺的大量高质量数据。特别是在妥协检测领域。这篇论文提出了一种修改后总变量噪声减少方法,以获得高质量的妥协图像。我们自动爬取了原始图像和妥协图像从百度PS栏。百度PS栏是一个网上网友发布 countless 妥协图像的网站。将原始图像 subtracted 妥协图像可以高亮妥协区域。但是,最终图像还有较大的噪声,所以这些图像无法直接用于深度学习模型。我们修改后总变量噪声减少方法是解决这个问题的目标。因为文本很多是细长的,在开关和关闭操作中容易产生文本信息损失。我们使用 MSER (最大稳定极值区域) 和 NMS (非最大suppression) 技术提取文本信息。然后使用修改后总变量噪声减少技术处理减去后的图像。最后,我们可以获得一个噪声少的图像,并将图像和文本信息相加。这种方法可以帮助模型获得更好的结果,同时保留了文本信息的大部分。这些数据集可以用于深度学习模型,并帮助模型取得更好的结果。

P2AT: Pyramid Pooling Axial Transformer for Real-time Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2310.15025
  • repo_url: None
  • paper_authors: Mohammed A. M. Elhassan, Changjun Zhou, Amina Benabid, Abuzar B. M. Adam
  • for: 本研究旨在提出一种实时semantic segmentation架构,以提高自动驾驶等实时任务中的景象理解精度。
  • methods: 本研究提出了一种名为Pyramid Pooling Axial Transformer(P2AT)的实时semantic segmentation架构,包括一个卷积神经网络encoder、一个pyramid pooling axial transformer、一个Bidirectional Fusion模块以及一个decoder block。
  • results: 在Camvid、Cityscapes和Pascal VOC 2012等三个复杂场景理解数据集上, variants of P2AT 都达到了状态的最佳结果,具体的结果分别是80.5%、81.0%、81.1%以及78.7%。
    Abstract Recently, Transformer-based models have achieved promising results in various vision tasks, due to their ability to model long-range dependencies. However, transformers are computationally expensive, which limits their applications in real-time tasks such as autonomous driving. In addition, an efficient local and global feature selection and fusion are vital for accurate dense prediction, especially driving scene understanding tasks. In this paper, we propose a real-time semantic segmentation architecture named Pyramid Pooling Axial Transformer (P2AT). The proposed P2AT takes a coarse feature from the CNN encoder to produce scale-aware contextual features, which are then combined with the multi-level feature aggregation scheme to produce enhanced contextual features. Specifically, we introduce a pyramid pooling axial transformer to capture intricate spatial and channel dependencies, leading to improved performance on semantic segmentation. Then, we design a Bidirectional Fusion module (BiF) to combine semantic information at different levels. Meanwhile, a Global Context Enhancer is introduced to compensate for the inadequacy of concatenating different semantic levels. Finally, a decoder block is proposed to help maintain a larger receptive field. We evaluate P2AT variants on three challenging scene-understanding datasets. In particular, our P2AT variants achieve state-of-art results on the Camvid dataset 80.5%, 81.0%, 81.1% for P2AT-S, P2ATM, and P2AT-L, respectively. Furthermore, our experiment on Cityscapes and Pascal VOC 2012 have demonstrated the efficiency of the proposed architecture, with results showing that P2AT-M, achieves 78.7% on Cityscapes. The source code will be available at
    摘要 近些时间,基于Transformer的模型在各种视觉任务中取得了有前途的成绩,这主要归功于它们的长距离依赖关系模型。然而,Transformer是 computationally 昂贵的,这限制了它们在实时任务,如自动驾驶,的应用。此外,fficient的本地和全局特征选择和融合是 dense prediction 精度的关键,特别是驾驶场景理解任务。在这篇论文中,我们提出了一种实时semantic segmentation 架构,名为Pyramid Pooling Axial Transformer (P2AT)。我们的提案的P2AT使用 CNN Encoder 中的粗细特征来生成尺度意义的Contextual Features,然后使用多级特征聚合方案来生成加强的Contextual Features。具体来说,我们引入了一种pyramid pooling axial transformer,以捕捉细致的空间和通道相互关系,从而提高了semantic segmentation 的性能。此外,我们设计了一种Bidirectional Fusion 模块(BiF),用于将不同水平的semantic信息融合。同时,我们引入了一种全球上下文增强器(Global Context Enhancer),以补做不同水平semantic信息融合的不足。最后,我们提出了一个解码块,以帮助保持更大的接收器场。我们的P2AT变体在三个复杂的场景理解数据集上进行了评估,具体来说是 Camvid 数据集的80.5%, 81.0%, 81.1% для P2AT-S, P2ATM, 和 P2AT-L 分别。此外,我们对 Cityscapes 和 Pascal VOC 2012 进行了实验,结果表明,P2AT-M 在 Cityscapes 上达到了78.7%。源代码将在 [

SONIC: Sonar Image Correspondence using Pose Supervised Learning for Imaging Sonars

  • paper_url: http://arxiv.org/abs/2310.15023
  • repo_url: None
  • paper_authors: Samiran Gode, Akshay Hinduja, Michael Kaess
  • for: 解决水下SLAM中数据关联问题,通过一种基于学习特征的新方法实现SONAR图像匹配。
  • methods: 提出了一种名为SONIC(SONar Image Correspondence)的 pose-supervised 网络,可以在不同视角下提供强健的特征匹配,承受视角变化。
  • results: 方法可以在SONAR图像中生成高精度的匹配结果,将为水下SLAM带来更高的精度和可靠性。
    Abstract In this paper, we address the challenging problem of data association for underwater SLAM through a novel method for sonar image correspondence using learned features. We introduce SONIC (SONar Image Correspondence), a pose-supervised network designed to yield robust feature correspondence capable of withstanding viewpoint variations. The inherent complexity of the underwater environment stems from the dynamic and frequently limited visibility conditions, restricting vision to a few meters of often featureless expanses. This makes camera-based systems suboptimal in most open water application scenarios. Consequently, multibeam imaging sonars emerge as the preferred choice for perception sensors. However, they too are not without their limitations. While imaging sonars offer superior long-range visibility compared to cameras, their measurements can appear different from varying viewpoints. This inherent variability presents formidable challenges in data association, particularly for feature-based methods. Our method demonstrates significantly better performance in generating correspondences for sonar images which will pave the way for more accurate loop closure constraints and sonar-based place recognition. Code as well as simulated and real-world datasets will be made public to facilitate further development in the field.
    摘要 在这篇论文中,我们解决了水下SLAM中数据关联的挑战问题,通过一种新的声波图像匹配方法,使用学习的特征。我们提出了SONIC(声波图像匹配),一种姿态监睹的网络,可以生成Robust的特征匹配,抗抗视点变化。水下环境的内在复杂性来自于动态和有限的视野条件,这限制了视野到几米的空气。这使得摄像头系统在大多数开水应用场景中不太适用。因此,多束声波扫描仪成为了感知传感器的首选。然而,它们也不 Without its limitations。声波扫描仪可以在不同视点下提供长距离可见性,但是它们的测量可能会因视点变化而显示不同。这种内在的变化对数据关联具有挑战性,特别是基于特征方法。我们的方法在生成声波图像匹配中表现出了显著的改善,这将为更精确的循环关闭约束和声波基于地理位置识别提供道路。我们将代码、模拟和实际数据集一起发布,以便进一步发展在这个领域。

Wonder3D: Single Image to 3D using Cross-Domain Diffusion

  • paper_url: http://arxiv.org/abs/2310.15008
  • repo_url: None
  • paper_authors: Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, Wenping Wang
  • for: 这个论文的目的是提出一种高效地生成高精度文本化三维模型的方法,以优化图像到三维任务的质量、一致性和效率。
  • methods: 该方法基于多视图扩散模型,通过多视图交叉领域注意力机制和形态相关的normal图像 fusión算法来生成高质量的三维表面。
  • results: 经过广泛的评估,该方法可以获得高质量的重建结果,具有良好的一致性和可靠的效率,与之前的方法相比有所提高。
    Abstract In this work, we introduce Wonder3D, a novel method for efficiently generating high-fidelity textured meshes from single-view images.Recent methods based on Score Distillation Sampling (SDS) have shown the potential to recover 3D geometry from 2D diffusion priors, but they typically suffer from time-consuming per-shape optimization and inconsistent geometry. In contrast, certain works directly produce 3D information via fast network inferences, but their results are often of low quality and lack geometric details. To holistically improve the quality, consistency, and efficiency of image-to-3D tasks, we propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images. To ensure consistency, we employ a multi-view cross-domain attention mechanism that facilitates information exchange across views and modalities. Lastly, we introduce a geometry-aware normal fusion algorithm that extracts high-quality surfaces from the multi-view 2D representations. Our extensive evaluations demonstrate that our method achieves high-quality reconstruction results, robust generalization, and reasonably good efficiency compared to prior works.
    摘要 在这个研究中,我们介绍了 Wonder3D,一种新的方法,能够高效地生成具有高品质的纹理降降 mesh 从单视图图像。现有的基于 Score Distillation Sampling(SDS)的方法有可能从二维扩散先验中恢复三维几何结构,但它们通常受到每个形状的优化时间consuming和不一致的几何结构的限制。与之相反,一些工作直接通过快速网络推理生成了三维信息,但其结果通常是低质量的,缺乏几何细节。为了全面提高图像到三维任务的质量、一致性和效率,我们提议了一种域隔扩散模型,该模型生成了多视图正常地图和对应的颜色图像。为保持一致性,我们使用了多视图域隔扩散注意机制,该机制促进了不同视图和模式之间的信息交换。最后,我们引入了一种几何意识的正常融合算法,该算法从多视图二维表示中提取出高质量的表面。我们的广泛评估表明,我们的方法可以实现高质量的重建结果,良好的一致性和理想的效率,相比于先前的方法。

StenUNet: Automatic Stenosis Detection from X-ray Coronary Angiography

  • paper_url: http://arxiv.org/abs/2310.14961
  • repo_url: https://github.com/huilin0220/stenunet
  • paper_authors: Hui Lin, Tom Liu, Aggelos Katsaggelos, Adrienne Kline
  • for: 鉴别 coronary artery disease (CAD) 的主要方法
  • methods: 使用机器学习和其他计算机视觉技术
  • results: 测试集 F1 分数为 0.5348,比第二名偏低 0.0005 分数
    Abstract Coronary angiography continues to serve as the primary method for diagnosing coronary artery disease (CAD), which is the leading global cause of mortality. The severity of CAD is quantified by the location, degree of narrowing (stenosis), and number of arteries involved. In current practice, this quantification is performed manually using visual inspection and thus suffers from poor inter- and intra-rater reliability. The MICCAI grand challenge: Automatic Region-based Coronary Artery Disease diagnostics using the X-ray angiography imagEs (ARCADE) curated a dataset with stenosis annotations, with the goal of creating an automated stenosis detection algorithm. Using a combination of machine learning and other computer vision techniques, we propose the architecture and algorithm StenUNet to accurately detect stenosis from X-ray Coronary Angiography. Our submission to the ARCADE challenge placed 3rd among all teams. We achieved an F1 score of 0.5348 on the test set, 0.0005 lower than the 2nd place.
    摘要 心血管绘影继续serve as the primary方法 для诊断心血管疾病(CAD),这是全球最主要的死亡原因。CAD的严重程度由病变的位置、狭窄程度(stenosis)和涉及的动脉数量来衡量。在当前的实践中,这些评估是通过视觉检查进行手动实施,因此受到poor inter-和intra-评估者的可靠性的限制。MICCAI大挑战:自动区域基于X射线绘影的 coronary artery disease 诊断(ARCADE)筹集了stenosis注解,目的是创建一个自动检测stenosis的算法。通过machine learning和其他计算机视觉技术,我们提出了StenUNet架构和算法,以准确地从X射线心血管绘影中检测stenosis。我们对ARCADE挑战的提交位列第3,测试集F1分数为0.5348,相比第2名的0.0005低。

Learning Real-World Image De-Weathering with Imperfect Supervision

  • paper_url: http://arxiv.org/abs/2310.14958
  • repo_url: https://github.com/1180300419/imperfect-deweathering
  • paper_authors: Xiaohui Liu, Zhilu Zhang, Xiaohe Wu, Chaoyu Feng, Xiaotao Wang, LEI LEI, Wangmeng Zuo
  • for: 提高现实场景中图像去气化的性能,解决现有 dataset 中 inconsistent 照明、位置和xture 等问题对学习型去气化方法的训练造成的负面影响。
  • methods: 提出一种Unified Inconsistency Addressing (UIA) 方法,包括一种基于信息瓶颈理论的 Consistent Label Constructor (CLC) 生成 pseudo-label,以及一种 Information Allocation Strategy (IAS) 将原始不完整标签和 pseudo-标签共同supervise 去气化模型。
  • results: 在两个实际场景中进行测试,结果表明 UIA 方法可以帮助现有去气化模型提高性能。 codes 可以在 https://github.com/1180300419/imperfect-deweathering 上下载。
    Abstract Real-world image de-weathering aims at removing various undesirable weather-related artifacts. Owing to the impossibility of capturing image pairs concurrently, existing real-world de-weathering datasets often exhibit inconsistent illumination, position, and textures between the ground-truth images and the input degraded images, resulting in imperfect supervision. Such non-ideal supervision negatively affects the training process of learning-based de-weathering methods. In this work, we attempt to address the problem with a unified solution for various inconsistencies. Specifically, inspired by information bottleneck theory, we first develop a Consistent Label Constructor (CLC) to generate a pseudo-label as consistent as possible with the input degraded image while removing most weather-related degradations. In particular, multiple adjacent frames of the current input are also fed into CLC to enhance the pseudo-label. Then we combine the original imperfect labels and pseudo-labels to jointly supervise the de-weathering model by the proposed Information Allocation Strategy (IAS). During testing, only the de-weathering model is used for inference. Experiments on two real-world de-weathering datasets show that our method helps existing de-weathering models achieve better performance. Codes are available at https://github.com/1180300419/imperfect-deweathering.
    摘要 现实世界中的图像去气化目标是去除各种不好的天气相关的artefacts。由于不能同时拍摄图像对,现有的现实世界去气化数据集经常表现出不一致的照明、位置和文本ure между真实图像和输入降低图像,这会负面影响学习基于的去气化方法的训练过程。在这种情况下,我们尝试解决这个问题,通过一种统一的解决方案来处理不同的不一致。具体来说,我们首先开发了一种适应信息瓶颈理论的Consistent Label Constructor (CLC),用于生成与输入降低图像最接近的pseudo-标签。具体来说,我们将多个相邻帧的输入图像 feed into CLC,以增强pseudo-标签。然后,我们将原始不完美标签和pseudo-标签共同用于supervising去气化模型,通过我们提出的信息分配策略(IAS)。在测试时,只有去气化模型进行推理。实验结果表明,我们的方法可以帮助现有的去气化模型在两个真实的去气化数据集上提高性能。代码可以在https://github.com/1180300419/imperfect-deweathering上获取。

Robust Depth Linear Error Decomposition with Double Total Variation and Nuclear Norm for Dynamic MRI Reconstruction

  • paper_url: http://arxiv.org/abs/2310.14934
  • repo_url: None
  • paper_authors: Junpeng Tan, Chunmei Qing, Xiangmin Xu
  • for: 这个论文是为了提高动态MRI图像重建的速度和准确性而写的。
  • methods: 该论文提出了一种新的稳定低级动态MRI重建优化模型,使用了高度减掉样本和离散傅里叶变换(DFT)。该方法包括线性分解、双Total Variation(TV)和双核内容 нор(NN)规范。
  • results: compared with五种现有方法,该方法在动态MRI数据上进行了广泛的实验,并证明了其在重建准确性和时间复杂度方面的优越性。
    Abstract Compressed Sensing (CS) significantly speeds up Magnetic Resonance Image (MRI) processing and achieves accurate MRI reconstruction from under-sampled k-space data. According to the current research, there are still several problems with dynamic MRI k-space reconstruction based on CS. 1) There are differences between the Fourier domain and the Image domain, and the differences between MRI processing of different domains need to be considered. 2) As three-dimensional data, dynamic MRI has its spatial-temporal characteristics, which need to calculate the difference and consistency of surface textures while preserving structural integrity and uniqueness. 3) Dynamic MRI reconstruction is time-consuming and computationally resource-dependent. In this paper, we propose a novel robust low-rank dynamic MRI reconstruction optimization model via highly under-sampled and Discrete Fourier Transform (DFT) called the Robust Depth Linear Error Decomposition Model (RDLEDM). Our method mainly includes linear decomposition, double Total Variation (TV), and double Nuclear Norm (NN) regularizations. By adding linear image domain error analysis, the noise is reduced after under-sampled and DFT processing, and the anti-interference ability of the algorithm is enhanced. Double TV and NN regularizations can utilize both spatial-temporal characteristics and explore the complementary relationship between different dimensions in dynamic MRI sequences. In addition, Due to the non-smoothness and non-convexity of TV and NN terms, it is difficult to optimize the unified objective model. To address this issue, we utilize a fast algorithm by solving a primal-dual form of the original problem. Compared with five state-of-the-art methods, extensive experiments on dynamic MRI data demonstrate the superior performance of the proposed method in terms of both reconstruction accuracy and time complexity.
    摘要 压缩感知(CS)可以大大提高 магнит共振成像(MRI)处理速度,并实现高精度的 MRI重建从下采样的 k-空间数据中。根据当前研究,有些动态 MRI k-空间重建基于 CS 的问题仍然存在。其中有以下几点:1. 埃尔增殖和像域重建存在差异,需要考虑不同域的处理方法。2. 动态 MRI 数据是三维的,具有空间-时间特征,需要计算表面文件的差异和一致性,同时保持结构完整性和特有性。3. 动态 MRI 重建时间consuming和计算资源依赖。在本文中,我们提出了一种新的 Robust Depth Linear Error Decomposition Model (RDLEDM),用于解决上述问题。我们的方法主要包括线性分解、双 Total Variation (TV) 和双 Nuclear Norm (NN) 正则化。通过加入线性图像域错误分析,可以减少下采样和DFT处理后的噪声,提高算法的抗干扰能力。同时,双 TV 和 NN 正则化可以利用动态 MRI 序列的空间-时间特征,并探索不同维度之间的协同关系。由于 TV 和 NN 正则化是不卷积和非拟合的,因此优化单一目标函数是困难的。为此,我们利用一种快速的算法,解决原始问题的预子问题。与现有五种方法进行比较,我们在动态 MRI 数据上进行了广泛的实验,并证明了我们的方法在重建精度和计算时间上的优越性。

Converting Depth Images and Point Clouds for Feature-based Pose Estimation

  • paper_url: http://arxiv.org/abs/2310.14924
  • repo_url: https://github.com/rlsch/depth-conversions
  • paper_authors: Robert Lösch, Mark Sastuba, Jonas Toth, Bernhard Jung
  • for: 该论文旨在将深度数据转换为可视化的图像,以显示基本隐藏在传统深度图像中的空间细节。
  • methods: 该方法首先进行噪声除除,然后在两个邻居点的normal vector的差异中编码信息。
  • results: 与拧角度图像相比,该新的转换方法可以生成更亮、更高对比度的图像,显示更多的细节和 outline。在视觉增强任务和RGB-D SLAM中,对所有测试特征(AKAZE、ORB、SIFT、SURF),我们的新的 flexion 图像表现更好,并且显示出了将深度数据和经典计算机视觉桥接的潜力。
    Abstract In recent years, depth sensors have become more and more affordable and have found their way into a growing amount of robotic systems. However, mono- or multi-modal sensor registration, often a necessary step for further processing, faces many challenges on raw depth images or point clouds. This paper presents a method of converting depth data into images capable of visualizing spatial details that are basically hidden in traditional depth images. After noise removal, a neighborhood of points forms two normal vectors whose difference is encoded into this new conversion. Compared to Bearing Angle images, our method yields brighter, higher-contrast images with more visible contours and more details. We tested feature-based pose estimation of both conversions in a visual odometry task and RGB-D SLAM. For all tested features, AKAZE, ORB, SIFT, and SURF, our new Flexion images yield better results than Bearing Angle images and show great potential to bridge the gap between depth data and classical computer vision. Source code is available here: https://rlsch.github.io/depth-flexion-conversion.
    摘要 近年来,深度感知器件成本逐渐下降,并在机器人系统中得到广泛应用。然而,单模或多模态感知器件注册,经常需要进一步处理Raw深度图像或点云数据,这在实际应用中遇到了许多挑战。本文提出了将深度数据转换为可视化 spatial details的方法,从而解决传统深度图像中隐藏的问题。经过噪声除除后,点云中的两个 Normal vector 的差异被编码到这个新的转换中。与比较 Bearing Angle 图像的情况相比,我们的新方法可以提供更亮、更高对比度的图像,图像中的缘界更加明显,更多的细节可见。我们在视觉运动任务和RGB-D SLAM 中测试了基于特征的姿态估计,并发现所有测试特征(AKAZE、ORB、SIFT、SURF),我们的新Flexion图像比 Bearing Angle 图像更好,可能为传统深度数据和经典计算机视ión之间的桥梁。源代码可以在以下链接中找到:https://rlsch.github.io/depth-flexion-conversion。

GRLib: An Open-Source Hand Gesture Detection and Recognition Python Library

  • paper_url: http://arxiv.org/abs/2310.14919
  • repo_url: https://github.com/mikhail-vlasenko/grlib
  • paper_authors: Jan Warchocki, Mikhail Vlasenko, Yke Bauke Eisma
  • for: 这篇论文旨在提出一个基于 OpenCV 的Python 库,用于检测和识别手势。
  • methods: 该论文使用了 MediaPipe Hands 进行手势检测,并使用了数据加工和keyframe提取来支持动态手势。
  • results: 对三个真实世界数据集进行测试,该库的性能高于另一个公开可用的 HGR 系统 - MediaPipe Solutions。
    Abstract Hand gesture recognition systems provide a natural way for humans to interact with computer systems. Although various algorithms have been designed for this task, a host of external conditions, such as poor lighting or distance from the camera, make it difficult to create an algorithm that performs well across a range of environments. In this work, we present GRLib: an open-source Python library able to detect and classify static and dynamic hand gestures. Moreover, the library can be trained on existing data for improved classification robustness. The proposed solution utilizes a feed from an RGB camera. The retrieved frames are then subjected to data augmentation and passed on to MediaPipe Hands to perform hand landmark detection. The landmarks are then classified into their respective gesture class. The library supports dynamic hand gestures through trajectories and keyframe extraction. It was found that the library outperforms another publicly available HGR system - MediaPipe Solutions, on three diverse, real-world datasets. The library is available at https://github.com/mikhail-vlasenko/grlib and can be installed with pip.
    摘要 人体姿势识别系统提供了一种自然的人机交互方式。虽然有各种算法被设计用于这项任务,但外部条件,如照明不佳或相机距离较远,使得创建一个在多种环境下表现良好的算法变得困难。在这种工作中,我们提出了GRLib:一个开源的Python库,能够检测和分类静止和动态手势。此外,库还可以在现有数据上进行训练,以提高分类稳定性。该解决方案利用RGB摄像头的数据流,并对数据进行数据增强。接下来,抓取的帧将被传递给MediaPipe Hands进行手指地标检测。手指地标会被分类为各自的姿势类。库支持动态手势通过轨迹和关键帧EXTRACTION。实验表明,库在三个多样化的实际数据集上表现出色,超过了另一个公开available的HGR系统——MediaPipe Solutions。库可以在https://github.com/mikhail-vlasenko/grlib中下载,并使用pip安装。

Object Pose Estimation Annotation Pipeline for Multi-view Monocular Camera Systems in Industrial Settings

  • paper_url: http://arxiv.org/abs/2310.14914
  • repo_url: None
  • paper_authors: Hazem Youssef, Frederik Polachowski, Jérôme Rutinowski, Moritz Roidl, Christopher Reining
  • for: 这个论文主要用于解决大型工业空间中物流操作中的对象位置和对象pose估计问题,而不需要安装人工设备或昂贵设备。
  • methods: 这篇论文使用现有的摄像头来解决对象pose估计问题,而不需要人工标注。它使用了一种基于深度学习的方法,通过将3D模型 proyect到实际空间中,以获得对象的6D位置。
  • results: 该论文在一个自定义的数据集中测试了他们的管道,并取得了26,482个对象实例的一致性高的标注,只需要一小部分的时间。
    Abstract Object localization, and more specifically object pose estimation, in large industrial spaces such as warehouses and production facilities, is essential for material flow operations. Traditional approaches rely on artificial artifacts installed in the environment or excessively expensive equipment, that is not suitable at scale. A more practical approach is to utilize existing cameras in such spaces in order to address the underlying pose estimation problem and to localize objects of interest. In order to leverage state-of-the-art methods in deep learning for object pose estimation, large amounts of data need to be collected and annotated. In this work, we provide an approach to the annotation of large datasets of monocular images without the need for manual labor. Our approach localizes cameras in space, unifies their location with a motion capture system, and uses a set of linear mappings to project 3D models of objects of interest at their ground truth 6D pose locations. We test our pipeline on a custom dataset collected from a system of eight cameras in an industrial setting that mimics the intended area of operation. Our approach was able to provide consistent quality annotations for our dataset with 26, 482 object instances at a fraction of the time required by human annotators.
    摘要 <>传送文本到简化中文。>在大型工厂和生产设施中,物流运作中的对象位置和 orientación是关键。传统方法通常采用在环境中安装人工设施或过分昂贵的设备,这些设备不适用于大规模应用。我们提出了一种更实用的方法,利用现有的摄像头来解决对象pose估计问题并将对象的位置进行标注。为了利用深度学习的最新方法进行对象pose估计,需要收集和标注大量数据。在这种工作中,我们提供了一种无需人工劳动的对象标注方法。我们的方法将摄像头在空间中固定,与动作捕捉系统结合,并使用一组线性映射将3D对象的模型在真实6D姿态位置上进行投影。我们在一个自定义的数据集中测试了我们的管道,该数据集由8个摄像头在工业场景中采集而成。我们的方法能够在相对较少的时间内提供高质量标注数据,与人工标注者相比,我们的方法能够提供26,482个对象实例的标注。

Orientation-Aware Leg Movement Learning for Action-Driven Human Motion Prediction

  • paper_url: http://arxiv.org/abs/2310.14907
  • repo_url: None
  • paper_authors: Chunzhi Gu, Chao Zhang, Shigeru Kuriyama
  • for: 这个论文的目的是提出一种基于动作标签的人体运动预测方法,以便预测人体 будущее运动序列,同时尊重给出的动作标签。
  • methods: 这个方法使用了一种叫做动作嵌入(action-conditioned in-betweening,ACB)的学习任务,以便促进人体运动过程中的自然转换。具体来说,它只在一些活跃的走姿动作类别,如走或跑,进行动作嵌入学习。
  • results: 这个方法在三个标准 benchmark 数据集上进行了广泛的测试,并证明了它在视觉质量、预测精度和动作忠实性等方面达到了领先的性能。
    Abstract The task of action-driven human motion prediction aims to forecast future human motion from the observed sequence while respecting the given action label. It requires modeling not only the stochasticity within human motion but the smooth yet realistic transition between multiple action labels. However, the fact that most of the datasets do not contain such transition data complicates this task. Existing work tackles this issue by learning a smoothness prior to simply promote smooth transitions, yet doing so can result in unnatural transitions especially when the history and predicted motions differ significantly in orientations. In this paper, we argue that valid human motion transitions should incorporate realistic leg movements to handle orientation changes, and cast it as an action-conditioned in-betweening (ACB) learning task to encourage transition naturalness. Because modeling all possible transitions is virtually unreasonable, our ACB is only performed on very few selected action classes with active gait motions, such as Walk or Run. Specifically, we follow a two-stage forecasting strategy by first employing the motion diffusion model to generate the target motion with a specified future action, and then producing the in-betweening to smoothly connect the observation and prediction to eventually address motion prediction. Our method is completely free from the labeled motion transition data during training. To show the robustness of our approach, we generalize our trained in-betweening learning model on one dataset to two unseen large-scale motion datasets to produce natural transitions. Extensive methods on three benchmark datasets demonstrate that our method yields the state-of-the-art performance in terms of visual quality, prediction accuracy, and action faithfulness.
    摘要 人体动作预测任务的目标是预测未来人体动作,而且需要遵循给定的动作标签。这需要模型人体动作中的随机性以及多个动作标签之间的平滑过渡。然而,大多数数据集不包含这种过渡数据,这使得这个任务变得更加复杂。现有的方法通过学习一个平滑性先验来促进平滑过渡,但这可能会导致不自然的过渡,特别是当历史动作和预测动作差异较大时。在这篇论文中,我们认为有效的人体动作过渡应该包含实际的脚部运动,以处理方向变化。我们将这种任务定义为动作条件的宽权(ACB)学习任务,以促进过渡自然性。由于模型所有可能的过渡是无法实现的,我们只在一些活动步态动作类型,如走或跑,进行ACB学习。我们采用了两个阶段预测策略:首先,使用动作扩散模型生成target动作,然后生成宽权来连接观察和预测,以最终解决动作预测问题。我们的方法不需要在训练时使用标注过渡动作数据。为了证明我们的方法的稳定性,我们在一个数据集上进行了一些推广和特化的方法,并在三个 benchmark 数据集上进行了广泛的测试。结果表明,我们的方法在视觉质量、预测精度和动作忠实度等方面达到了领先水平。

Deep learning denoiser assisted roughness measurements extraction from thin resists with low Signal-to-Noise Ratio(SNR) SEM images: analysis with SMILE

  • paper_url: http://arxiv.org/abs/2310.14815
  • repo_url: None
  • paper_authors: Sara Sacchi, Bappaditya Dey, Iacopo Mochi, Sandip Halder, Philippe Leray
  • for: 高 numerical aperture Extreme Ultraviolet Lithography (高 NA EUVL) 的技术进步,使得研究薄膜 photoresists (下30nm) 的需求增加,但是SEM图像受到干扰的影响,导致图像的增强和精度下降。这种情况下,本研究的目的是使用深度学习减噪器,提高SEM图像的信号强度,并且可以准确地测量薄膜中的折射率和宽度折射。
  • methods: 本研究使用了深度学习减噪器,对SEM图像进行减噪处理,并使用开源的测量软件SMILE 2.3.2进行系统性的分析。
  • results: 对于不同的涂敷厚度(15nm、20nm、25nm、30nm)、底层(Spin-On-Glass-SOG、Organic Underlayer-OUL)和抽样平均数(4、8、16、32、64 Fr),通过减噪处理后的图像进行了系统性的分析,并发现denoised图像中的CD具有不变性,减噪后的图像具有更高的信号强度,并且可以准确地测量薄膜中的折射率和宽度折射。
    Abstract The technological advance of High Numerical Aperture Extreme Ultraviolet Lithography (High NA EUVL) has opened the gates to extensive researches on thinner photoresists (below 30nm), necessary for the industrial implementation of High NA EUVL. Consequently, images from Scanning Electron Microscopy (SEM) suffer from reduced imaging contrast and low Signal-to-Noise Ratio (SNR), impacting the measurement of unbiased Line Edge Roughness (uLER) and Line Width Roughness (uLWR). Thus, the aim of this work is to enhance the SNR of SEM images by using a Deep Learning denoiser and enable robust roughness extraction of the thin resist. For this study, we acquired SEM images of Line-Space (L/S) patterns with a Chemically Amplified Resist (CAR) with different thicknesses (15nm, 20nm, 25nm, 30nm), underlayers (Spin-On-Glass-SOG, Organic Underlayer-OUL) and frames of averaging (4, 8, 16, 32, and 64 Fr). After denoising, a systematic analysis has been carried out on both noisy and denoised images using an open-source metrology software, SMILE 2.3.2, for investigating mean CD, SNR improvement factor, biased and unbiased LWR/LER Power Spectral Density (PSD). Denoised images with lower number of frames present unaltered Critical Dimensions (CDs), enhanced SNR (especially for low number of integration frames), and accurate measurements of uLER and uLWR, with the same accuracy as for noisy images with a consistent higher number of frames. Therefore, images with a small number of integration frames and with SNR < 2 can be successfully denoised, and advantageously used in improving metrology throughput while maintaining reliable roughness measurements for the thin resist.
    摘要 高 numerical aperture extreme ultraviolet литография (高 NA EUVL) 的技术进步已经开启了追究薄膜抗抗�� (Below 30nm) 的广泛研究,这是高 NA EUVL 的工业实现所必需的。然而,由于 SEM 图像的快照射镜观察镜影响,导致 SEM 图像的呈现效果受到了干扰,从而影响了无偏线Edge 粗 roughness (uLER) 和 Line Width Roughness (uLWR) 的测量。因此,本研究的目标是使用深度学习去噪器提高 SEM 图像的信噪比 (SNR),以便robustly 提取薄膜中的粗 roughness。我们对 Line-Space (L/S) 模式中的 Chemically Amplified Resist (CAR) WITH different thicknesses (15nm, 20nm, 25nm, 30nm)、underlayers (Spin-On-Glass-SOG, Organic Underlayer-OUL) 和 frames of averaging (4, 8, 16, 32, and 64 Fr) 进行了SEM 图像的取样,并对这些图像进行了去噪处理。然后,我们使用开源的测量软件 SMILE 2.3.2 进行了系统性的分析, investigate mean CD, SNR improvement factor, biased and unbiased LWR/LER Power Spectral Density (PSD)。去噪后的图像显示了下降的 Critical Dimensions (CDs)、提高的 SNR (特别是低数量的整合帧)、和精确地测量 uLER 和 uLWR,与不去噪的图像相同的精度。因此,具有少量的整合帧和 SNR < 2 的图像可以成功地去噪,并且可以提高测量过程的效率而无需损失精度。

DocTrack: A Visually-Rich Document Dataset Really Aligned with Human Eye Movement for Machine Reading

  • paper_url: http://arxiv.org/abs/2310.14802
  • repo_url: https://github.com/hint-lab/doctrack
  • paper_authors: Hao Wang, Qingxuan Wang, Yue Li, Changqing Wang, Chenhui Chu, Rui Wang
  • for: 本研究目的是提供一个可以让机器学习模型更好地理解文档的数据集,以便进一步推动文档智能模型的研究和开发。
  • methods: 该研究使用了眼动跟踪技术来跟踪人类阅读文档时的眼动路径,并将其与文档中的各种元素进行对比,以生成一个可以让机器学习模型更好地理解文档的数据集。
  • results: 研究发现,虽然文档智能模型已经做出了 significiant 进步,但它们仍然无法如人类一样准确、连续、灵活地理解文档。这些发现可能对未来文档智能模型的研究和开发产生影响。数据集可以在 \url{https://github.com/hint-lab/doctrack} 上下载。
    Abstract The use of visually-rich documents (VRDs) in various fields has created a demand for Document AI models that can read and comprehend documents like humans, which requires the overcoming of technical, linguistic, and cognitive barriers. Unfortunately, the lack of appropriate datasets has significantly hindered advancements in the field. To address this issue, we introduce \textsc{DocTrack}, a VRD dataset really aligned with human eye-movement information using eye-tracking technology. This dataset can be used to investigate the challenges mentioned above. Additionally, we explore the impact of human reading order on document understanding tasks and examine what would happen if a machine reads in the same order as a human. Our results suggest that although Document AI models have made significant progress, they still have a long way to go before they can read VRDs as accurately, continuously, and flexibly as humans do. These findings have potential implications for future research and development of Document AI models. The data is available at \url{https://github.com/hint-lab/doctrack}.
    摘要 使用触发性文档(VRD)在不同领域的应用已经创造了人工智能文档模型能够像人类一样阅读和理解文档的需求,但这些需求却受到技术、语言和认知障碍的影响。然而,缺乏适当的数据集的问题使得这一领域的进步受到了很大的限制。为了解决这个问题,我们介绍了《 DocTrack》,一个基于人类眼动信息的 VRD 数据集。这个数据集可以用于调查以上挑战。此外,我们还探讨了人类阅读顺序对文档理解任务的影响,以及机器人是否可以像人类一样阅读。我们的结果表明,虽然文档人工智能模型已经做出了很大的进步,但它们仍然需要进一步的改进,以达到人类一样的精度、连续性和灵活性。这些发现有可能对未来文档人工智能模型的研发产生影响。数据可以在 GitHub 上获取:https://github.com/hint-lab/doctrack。

SAMCLR: Contrastive pre-training on complex scenes using SAM for view sampling

  • paper_url: http://arxiv.org/abs/2310.14736
  • repo_url: None
  • paper_authors: Benjamin Missaoui, Chongbin Yuan
  • for: 这个论文是为了提高自主学习的对照式学习效果,并且在复杂的场景中提高模型的精确性。
  • methods: 这个论文使用了SAM来分割图像,然后从同一个区域中采样两个看法。
  • results: 这个论文的预training在Cityscapes和ADE20K上,然后评估在CIFAR-10、STL10和ImageNet上,与SimCLR、DINO和MoCo相比,表现至少与其匹配,常常表现出色。
    Abstract In Computer Vision, self-supervised contrastive learning enforces similar representations between different views of the same image. The pre-training is most often performed on image classification datasets, like ImageNet, where images mainly contain a single class of objects. However, when dealing with complex scenes with multiple items, it becomes very unlikely for several views of the same image to represent the same object category. In this setting, we propose SAMCLR, an add-on to SimCLR which uses SAM to segment the image into semantic regions, then sample the two views from the same region. Preliminary results show empirically that when pre-training on Cityscapes and ADE20K, then evaluating on classification on CIFAR-10, STL10 and ImageNette, SAMCLR performs at least on par with, and most often significantly outperforms not only SimCLR, but also DINO and MoCo.
    摘要

MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion

  • paper_url: http://arxiv.org/abs/2310.14729
  • repo_url: None
  • paper_authors: Roy Kapon, Guy Tevet, Daniel Cohen-Or, Amit H. Bermano
  • for: 这篇论文是为了生成一种可靠的多视图2D抽象方法,以便从不同角度获取3D动作序列。
  • methods: 该方法基于一种 diffusion model,并且只使用2D数据进行训练。它通过同时杂化多个2D动作序列来生成一致的3D动作序列。
  • results: 该方法在不同的运动领域中都能够生成多样化和真实的3D动作序列,而无需文本条件。它比传统的优化化策略更自然地与扩散框架结合,并且可以避免一些常见的问题,如外域抽象、缺失细节和模式落塌。
    Abstract We introduce Multi-view Ancestral Sampling (MAS), a method for generating consistent multi-view 2D samples of a motion sequence, enabling the creation of its 3D counterpart. MAS leverages a diffusion model trained solely on 2D data, opening opportunities to exciting and diverse fields of motion previously under-explored as 3D data is scarce and hard to collect. MAS works by simultaneously denoising multiple 2D motion sequences representing the same motion from different angles. Our consistency block ensures consistency across all views at each diffusion step by combining the individual generations into a unified 3D sequence, and projecting it back to the original views for the next iteration. We demonstrate MAS on 2D pose data acquired from videos depicting professional basketball maneuvers, rhythmic gymnastic performances featuring a ball apparatus, and horse obstacle course races. In each of these domains, 3D motion capture is arduous, and yet, MAS generates diverse and realistic 3D sequences without textual conditioning. As we demonstrate, our ancestral sampling-based approach offers a more natural integration with the diffusion framework compared to popular denoising optimization-based approaches, and avoids common issues such as out-of-domain sampling, lack of details and mode-collapse. https://guytevet.github.io/mas-page/
    摘要 我们介绍 Multi-view Ancestral Sampling(MAS),一种产生一致的多看法2D样本的方法,实现3D样本的创建。MAS 利用了一个 diffusion 模型,仅从2D数据进行训练,这开启了吸引人的和多样化的动作领域,因为3D数据罕见和困难收集。MAS 工作的方式是同时干扰多个2D动作序列,代表同一个动作的不同角度。我们的一致封页确保了每个步骤中的一致性,通过结合个别生成的3D样本,并将其转换回原始的视角,以便下一个迭代。我们在2D姿势数据,包括职业篮球动作、rhythmic gymnastic performances featuring a ball apparatus 和 horse obstacle course races 等领域中进行了评估。在这些领域中,3D动作捕捉是困难的,但MAS 仍然可以生成多样化和真实的3D样本,无需文本调整。我们示示了我们的 ancstral sampling 基本上和流行的混合优化基本法相比,具有更自然的整合,并避免了常见的对外领域样本、缺乏细节和模式崩溃等问题。Please note that the translation is done using Google Translate, and may not be perfect or idiomatic.

Rethinking Scale Imbalance in Semi-supervised Object Detection for Aerial Images

  • paper_url: http://arxiv.org/abs/2310.14718
  • repo_url: None
  • paper_authors: Ruixiang Zhang, Chang Xu, Fang Xu, Wen Yang, Guangjun He, Huai Yu, Gui-Song Xia
  • for: 本研究针对半指导 объек detection(SSOD)在航空图像中问题,即小型 объек 的训练问题。
  • methods: 本研究提出了一个 novel Scale-discriminative Semi-Supervised Object Detection(S^3OD)学习管线,以解决半指导 SSOD 中的尺度偏见问题。
  • results: 实验结果显示, compared to state-of-the-art competitors, 我们的提案方法具有更高的性能。Here’s a more detailed explanation of each point:
  • for: The paper is focused on the problem of semi-supervised object detection (SSOD) in aerial images, specifically the challenge of training detectors for small objects.
  • methods: The proposed method is a novel Scale-discriminative Semi-Supervised Object Detection (S^3OD) learning pipeline, which includes three key components: Size-aware Adaptive Thresholding (SAT), Size-rebalanced Label Assignment (SLA), and Teacher-guided Negative Learning (TNL).
  • results: The proposed method outperforms state-of-the-art competitors in terms of performance, as demonstrated by extensive experiments conducted on the DOTA-v1.5 benchmark.
    Abstract This paper focuses on the scale imbalance problem of semi-supervised object detection(SSOD) in aerial images. Compared to natural images, objects in aerial images show smaller sizes and larger quantities per image, increasing the difficulty of manual annotation. Meanwhile, the advanced SSOD technique can train superior detectors by leveraging limited labeled data and massive unlabeled data, saving annotation costs. However, as an understudied task in aerial images, SSOD suffers from a drastic performance drop when facing a large proportion of small objects. By analyzing the predictions between small and large objects, we identify three imbalance issues caused by the scale bias, i.e., pseudo-label imbalance, label assignment imbalance, and negative learning imbalance. To tackle these issues, we propose a novel Scale-discriminative Semi-Supervised Object Detection (S^3OD) learning pipeline for aerial images. In our S^3OD, three key components, Size-aware Adaptive Thresholding (SAT), Size-rebalanced Label Assignment (SLA), and Teacher-guided Negative Learning (TNL), are proposed to warrant scale unbiased learning. Specifically, SAT adaptively selects appropriate thresholds to filter pseudo-labels for objects at different scales. SLA balances positive samples of objects at different scales through resampling and reweighting. TNL alleviates the imbalance in negative samples by leveraging information generated by a teacher model. Extensive experiments conducted on the DOTA-v1.5 benchmark demonstrate the superiority of our proposed methods over state-of-the-art competitors. Codes will be released soon.
    摘要
  1. Size-aware Adaptive Thresholding (SAT): This component adaptively selects appropriate thresholds to filter pseudo-labels for objects at different scales.2. Size-rebalanced Label Assignment (SLA): This component balances positive samples of objects at different scales through resampling and reweighting.3. Teacher-guided Negative Learning (TNL): This component alleviates the imbalance in negative samples by leveraging information generated by a teacher model.Our proposed methods have been extensively tested on the DOTA-v1.5 benchmark and have demonstrated superior performance compared to state-of-the-art competitors. The codes will be released soon.

Interaction-Driven Active 3D Reconstruction with Object Interiors

  • paper_url: http://arxiv.org/abs/2310.14700
  • repo_url: https://github.com/Salingo/Interaction-Driven-Reconstruction
  • paper_authors: Zihao Yan, Fubao Su, Mingyang Wang, Ruizhen Hu, Hao Zhang, Hui Huang
  • for: 这篇论文是为了描述一种活动3D重建方法,该方法结合视觉感知、机器人对象互动和3D扫描,以恢复目标3D物体的外部和内部结构。
  • methods: 该方法使用了机器人对象互动的分析,以及基于神经网络的部件检测和三角形重建。
  • results: 该方法可以自动地完成目标3D物体的重建,并且可以获得物体的部件相互关系和内部结构。
    Abstract We introduce an active 3D reconstruction method which integrates visual perception, robot-object interaction, and 3D scanning to recover both the exterior and interior, i.e., unexposed, geometries of a target 3D object. Unlike other works in active vision which focus on optimizing camera viewpoints to better investigate the environment, the primary feature of our reconstruction is an analysis of the interactability of various parts of the target object and the ensuing part manipulation by a robot to enable scanning of occluded regions. As a result, an understanding of part articulations of the target object is obtained on top of complete geometry acquisition. Our method operates fully automatically by a Fetch robot with built-in RGBD sensors. It iterates between interaction analysis and interaction-driven reconstruction, scanning and reconstructing detected moveable parts one at a time, where both the articulated part detection and mesh reconstruction are carried out by neural networks. In the final step, all the remaining, non-articulated parts, including all the interior structures that had been exposed by prior part manipulations and subsequently scanned, are reconstructed to complete the acquisition. We demonstrate the performance of our method via qualitative and quantitative evaluation, ablation studies, comparisons to alternatives, as well as experiments in a real environment.
    摘要 我们介绍了一种活动三维重建方法,该方法结合视觉感知、机器人对象互动和3D扫描,以获取目标3D对象的外部和内部结构。与其他有关活动视觉的研究不同,我们的重建方法不是关注摄像头视点优化以更好地探索环境,而是通过分析机器人对目标对象不同部分的互动性,以及由此导致的部件扫描和重建 occluded 区域。因此,我们可以获得目标对象的部件骨格,同时完全获得其三维结构。我们的方法可以凭借Fetch机器人内置的RGBD感知器自动完成,它在互动分析和互动驱动重建、扫描和重建等步骤中循环运行。在最后一步,我们重建了所有未被扫描的非骨立部分,包括所有在先前的部件扫描中暴露出来的内部结构。我们通过质量和量度评估、简除研究、相对研究和实际环境中的实验,证明了我们的方法的效果。

CAwa-NeRF: Instant Learning of Compression-Aware NeRF Features

  • paper_url: http://arxiv.org/abs/2310.14695
  • repo_url: None
  • paper_authors: Omnia Mahmoud, Théo Ladune, Matthieu Gendrin
  • for: 提高 Neural Radiance Fields (NeRF) 的质量和效率,通过三维场景的卷积特征网格模型化。
  • methods: 使用多分辨率哈希编码,从lookup表中学习可训练的特征网格,以实现几秒钟内高质量的神经图形 primitives。
  • results: 通过快速学习压缩意识 NeRF 特征网格(CAwa-NeRF),可以在模型训练结束时导出压缩后的特征网格,无需更改存储架构或模型参数,并且可以在不同的静止场景中实现出色的效果。
    Abstract Modeling 3D scenes by volumetric feature grids is one of the promising directions of neural approximations to improve Neural Radiance Fields (NeRF). Instant-NGP (INGP) introduced multi-resolution hash encoding from a lookup table of trainable feature grids which enabled learning high-quality neural graphics primitives in a matter of seconds. However, this improvement came at the cost of higher storage size. In this paper, we address this challenge by introducing instant learning of compression-aware NeRF features (CAwa-NeRF), that allows exporting the zip compressed feature grids at the end of the model training with a negligible extra time overhead without changing neither the storage architecture nor the parameters used in the original INGP paper. Nonetheless, the proposed method is not limited to INGP but could also be adapted to any model. By means of extensive simulations, our proposed instant learning pipeline can achieve impressive results on different kinds of static scenes such as single object masked background scenes and real-life scenes captured in our studio. In particular, for single object masked background scenes CAwa-NeRF compresses the feature grids down to 6% (1.2 MB) of the original size without any loss in the PSNR (33 dB) or down to 2.4% (0.53 MB) with a slight virtual loss (32.31 dB).
    摘要 <>模型3D场景使用分割特征网格是一个有前途的方向,以提高神经预测场景(NeRF)的性能。INSTant-NGP(INGP)引入多尺度哈希编码,从一个可调特征网格的lookup表中学习高质量神经图形基元,只需几秒钟内。然而,这种改进带来了更高的存储大小。在这篇论文中,我们解决这个挑战,通过引入压缩意识NeRF特征(CAwa-NeRF),允许在模型训练结束时,压缩特征网格,并在训练参数和存储架构不变的情况下,实现无损压缩。此外,我们的提案不仅适用于INGP,也可以适用于任何模型。通过广泛的仿真实验,我们的快速学习管道可以在不同类型的静止场景中实现惊人的结果,包括单个对象遮盖背景场景和实际studio中捕捉的真实场景。特别是在单个对象遮盖背景场景中,CAwa-NeRF可以将特征网格压缩到6%(1.2MB)原始大小的1/6,无损PSNR(33dB)或者压缩到2.4%(0.53MB),有一定的虚拟损失(32.31dB)。

On Partial Shape Correspondence and Functional Maps

  • paper_url: http://arxiv.org/abs/2310.14692
  • repo_url: None
  • paper_authors: Amit Bracha, Thomas Dagès, Ron Kimmel
  • for: 对于匹配 shapes 与其部分的问题
  • methods: 使用 functional maps 来转换问题,并通过解决 least squares 问题来进行 algebra 匹配
  • results: 提出了一种新的方法来解决 partial shape matching 问题,并在 SHREC’16 数据集上进行评估,与现有的不监督方法比较,得到了更好的结果。
    Abstract While dealing with matching shapes to their parts, we often utilize an instrument known as functional maps. The idea is to translate the shape matching problem into ``convenient'' spaces by which matching is performed algebraically by solving a least squares problem. Here, we argue that such formulations, though popular in this field, introduce errors in the estimated match when partiality is invoked. Such errors are unavoidable even when considering advanced feature extraction networks, and they can be shown to escalate with increasing degrees of shape partiality, adversely affecting the learning capability of such systems. To circumvent these limitations, we propose a novel approach for partial shape matching. Our study of functional maps led us to a novel method that establishes direct correspondence between partial and full shapes through feature matching bypassing the need for functional map intermediate spaces. The Gromov distance between metric spaces leads to the construction of the first part of our loss functions. For regularization we use two options: a term based on the area preserving property of the mapping, and a relaxed version of it without the need to compute a functional map. The proposed approach shows superior performance on the SHREC'16 dataset, outperforming existing unsupervised methods for partial shape matching. In particular, it achieves state-of-the-art result on the SHREC'16 HOLES benchmark, superior also compared to supervised methods.
    摘要 而在匹配形状与其部件时,我们经常使用一种工具称之为功能地图。这个想法是将形状匹配问题转化成``方便''的空间中进行算术匹配,解决一个最小二乘问题。我们认为这种形式ulation,虽然在这个领域非常流行,但是会在形状partiality情况下引入误差。这些误差不仅会在高度形状partiality情况下出现,而且会随着形状partiality的增加而增长,从而对这些系统的学习能力产生负面影响。为了缺陷这些限制,我们提出了一种新的方法 для partial shape matching。我们的研究表示,功能地图不仅可以用于匹配完整形状,还可以用于匹配部件之间的相互关系。我们提出了一种新的方法,可以直接将partial shape与完整形状之间建立对应关系,不需要 intermediate spaces。我们使用Gromov距离来建立首部分的损失函数。为了正则化,我们使用两种选项:一个基于形状匹配的区域性质,以及一个宽松化的版本,不需要计算功能地图。我们的方法在SHREC'16 dataset上表现出色,超过了现有的无监督方法。特别是在SHREC'16 HOLES benchmark上,我们的方法达到了状态 искусственный智能领域的最佳结果,并且在supervised方法之上。

Online Out-of-Domain Detection for Automated Driving

  • paper_url: http://arxiv.org/abs/2310.14675
  • repo_url: https://github.com/Aryia-Behroziuan/neurons
  • paper_authors: Timo Sämann, Horst-Michael Groß
  • for: Ensuring safety in automated driving, particularly in detecting distributional shifts in Deep Neural Networks (DNNs)
  • methods: Proof of concept for a safety mechanism that detects leaving of the training domain online (at runtime) using the Synthia data set
  • results: Achieved 100% correct detection of whether the input data is inside or outside the domain
    Abstract Ensuring safety in automated driving is a major challenge for the automotive industry. Special attention is paid to artificial intelligence, in particular to Deep Neural Networks (DNNs), which is considered a key technology in the realization of highly automated driving. DNNs learn from training data, which means that they only achieve good accuracy within the underlying data distribution of the training data. When leaving the training domain, a distributional shift is caused, which can lead to a drastic reduction of accuracy. In this work, we present a proof of concept for a safety mechanism that can detect the leaving of the domain online, i.e. at runtime. In our experiments with the Synthia data set we can show that a 100 % correct detection of whether the input data is inside or outside the domain is achieved. The ability to detect when the vehicle leaves the domain can be an important requirement for certification.
    摘要 自动驾驶安全保障是汽车业的主要挑战之一。特别是对人工智能技术的应用,特别是深度神经网络(DNNs),被视为高度自动驾驶的重要技术。DNNs通过训练数据学习,意味着它们只有在训练数据的下面分布中获得良好的准确率。当离开训练领域时,分布Shift会导致准确率降低至极限。在这项工作中,我们提出了一种可行性证明,可以在运行时(online)检测数据集离开训练领域。我们在Synthia数据集的实验中表明,可以实现100%的正确检测,以确定输入数据是否在训练领域内或外。能够检测车辆离开训练领域的能力可能是证明的重要需求。

Inject Semantic Concepts into Image Tagging for Open-Set Recognition

  • paper_url: http://arxiv.org/abs/2310.15200
  • repo_url: https://github.com/xinyu1205/recognize-anything
  • paper_authors: Xinyu Huang, Yi-Jie Huang, Youcai Zhang, Weiwei Tian, Rui Feng, Yuejie Zhang, Yanchun Xie, Yaqian Li, Lei Zhang
  • for: 本研究旨在提出一种基本图像识别模型,具有强大的开放集识别能力,通过在图像标签训练框架中注入semantic概念。
  • methods: 本研究使用了图像-文本对对alignment和图像标签的细致交互机制,并采用了大型自然语言模型(LLM)来生成多样化的视觉标签描述。
  • results: 评估在多个图像识别benchmark上表明,RAM++超越了现有的基本图像识别模型,特别是在开放集识别方面表现出优异。具体来说,对于常用的标签类别,RAM++在OpenImages和ImageNet上显示10.2 mAP和15.4 mAP的提升;对于开放集识别,RAM++在OpenImages上记录了5 mAP和6.4 mAP的提升,而在RAM上记录了4.7 mAP的提升。
    Abstract In this paper, we introduce the Recognize Anything Plus Model~(RAM++), a fundamental image recognition model with strong open-set recognition capabilities, by injecting semantic concepts into image tagging training framework. Previous approaches are either image tagging models constrained by limited semantics, or vision-language models with shallow interaction for suboptimal performance in multi-tag recognition. In contrast, RAM++ integrates image-text alignment and image-tagging within a unified fine-grained interaction framework based on image-tags-text triplets. This design enables RAM++ not only excel in identifying predefined categories, but also significantly augment the recognition ability in open-set categories. Moreover, RAM++ employs large language models~(LLMs) to generate diverse visual tag descriptions, pioneering the integration of LLM's knowledge into image tagging training. This approach empowers RAM++ to integrate visual description concepts for open-set recognition during inference. Evaluations on comprehensive image recognition benchmarks demonstrate RAM++ exceeds existing state-of-the-art (SOTA) fundamental image recognition models on most aspects. Specifically, for predefined common-used tag categories, RAM++ showcases 10.2 mAP and 15.4 mAP enhancements over CLIP on OpenImages and ImageNet. For open-set categories beyond predefined, RAM++ records improvements of 5 mAP and 6.4 mAP over CLIP and RAM respectively on OpenImages. For diverse human-object interaction phrases, RAM++ achieves 7.8 mAP and 4.7 mAP improvements on the HICO benchmark. Code, datasets and pre-trained models are available at \url{https://github.com/xinyu1205/recognize-anything}.
    摘要 在这篇论文中,我们介绍了Recognize Anything Plus Model~(RAM++),一种基本图像识别模型,具有强大的开放集识别能力,通过在图像标签训练框架中注入semantic概念。先前的方法都是 either图像标签模型受限于有限 semantics,或视觉语言模型具有浅层交互,导致suboptimal的多标签识别性能。相比之下,RAM++将图像-文本对齐和图像标签融合到了一个细化的交互框架中,基于图像标签文本三元组。这种设计使得RAM++不仅能够在预定的类别中 excel,还能够在开放集类别中具有显著的增强。此外,RAM++利用大型语言模型~(LLMs)生成多样的视觉标签描述,开拓了图像标签训练中语言知识的应用。这种方法使得RAM++在推理过程中可以将视觉描述概念集成到开放集识别中。测试结果表明,RAM++在主流图像识别benchmark上超越了现有的基本图像识别模型,尤其是在预定的常用标签类别和开放集类别中。具体来说,对于预定的常用标签类别,RAM++记录了10.2 mAP和15.4 mAP的提升,对比CLIP。对于开放集类别,RAM++记录了5 mAP和6.4 mAP的提升。对于多样的人物-物体互动表达,RAM++实现了7.8 mAP和4.7 mAP的提升。代码、数据集和预训练模型可以在上获取。

Invariant Feature Regularization for Fair Face Recognition

  • paper_url: http://arxiv.org/abs/2310.14652
  • repo_url: https://github.com/panasonicconnect/invreg
  • paper_authors: Jiali Ma, Zhongqi Yue, Kagaya Tomoyuki, Suzuki Tomoki, Karlekar Jayashree, Sugiri Pranata, Hanwang Zhang
  • for: The paper aims to address the issue of bias in face recognition systems due to the imbalanced demographic attributes in the training data.
  • methods: The proposed method, called Invariant Feature Regularization (INV-REG), uses unsupervised data partitioning to generate diverse partitions that act as self-annotated confounders, allowing the model to deconfound and learn invariant features that generalize well across different demographic groups.
  • results: The proposed method improves face recognition performance on a variety of demographic groups, achieving new state-of-the-art results when combined with two strong baselines (Arcface and CIFP).Here’s the simplified Chinese text for the three key points:
  • for: 本研究旨在解决面部识别系统中的偏见问题,即训练数据中不均衡的人类特征Attributes。
  • methods: 提议的方法是InvVariance Regularization(INV-REG),通过无监督数据分割生成多种不同的数据分割,每个分割 acted as self-annotated confounder,使模型学习不受特定人类属性的偏见特征。
  • results: 提议的方法可以提高面部识别性能在不同的人类群体中,达到新的状态数据。
    Abstract Fair face recognition is all about learning invariant feature that generalizes to unseen faces in any demographic group. Unfortunately, face datasets inevitably capture the imbalanced demographic attributes that are ubiquitous in real-world observations, and the model learns biased feature that generalizes poorly in the minority group. We point out that the bias arises due to the confounding demographic attributes, which mislead the model to capture the spurious demographic-specific feature. The confounding effect can only be removed by causal intervention, which requires the confounder annotations. However, such annotations can be prohibitively expensive due to the diversity of the demographic attributes. To tackle this, we propose to generate diverse data partitions iteratively in an unsupervised fashion. Each data partition acts as a self-annotated confounder, enabling our Invariant Feature Regularization (INV-REG) to deconfound. INV-REG is orthogonal to existing methods, and combining INV-REG with two strong baselines (Arcface and CIFP) leads to new state-of-the-art that improves face recognition on a variety of demographic groups. Code is available at https://github.com/PanasonicConnect/InvReg.
    摘要 “美貌识别是关于学习不受影响的特征的,以便在未看过的人脸群中generalize。然而,人脸数据集总是捕捉到各种不均衡的人类属性,这使得模型学习到偏袋特征,其在少数群体中generalize很差。我们发现,这种偏袋是由潜在的人类属性混合所致,这使得模型捕捉到假的人类属性特征。这种混合效应可以通过 causal intervention 来消除,但这些批注可能因人类属性的多样性而变得昂贵。为解决这个问题,我们提议在无监督的情况下 iteratively 生成多个数据分区。每个数据分区 acts as a self-annotated confounder,使我们的 Invariant Feature Regularization (INV-REG) 可以减少混合效应。INV-REG 与现有方法不同,并且将 INV-REG 与两个强大基线(Arcface 和 CIFP)结合使用,可以获得新的状态机器人脸识别性能。代码可以在 https://github.com/PanasonicConnect/InvReg 上获取。”

Relit-NeuLF: Efficient Relighting and Novel View Synthesis via Neural 4D Light Field

  • paper_url: http://arxiv.org/abs/2310.14642
  • repo_url: https://github.com/oppo-us-research/relitneulf
  • paper_authors: Zhong Li, Liangchen Song, Zhang Chen, Xiangyu Du, Lele Chen, Junsong Yuan, Yi Xu
  • for: 本文解决了基于多视图图像的复杂场景同时重新照明和新视图合成问题,使用分析synthesis方法Relit-NeuLF。
  • methods: Relit-NeuLF方法首先使用两个平面光场表示法 parameterize each ray in a 4D coordinate system,使得效率学习和推理。然后,通过自我超级视图的方式,分解SVBRDF组成部分:粗糙度、法向和反射率。基于这些分解的 BRDF 组成部分和条件光线方向,RenderNet 学习了将光线颜色synthesize。
  • results: 实验表明,提出的方法是效率高效的,在 both synthetic data和实际人脸数据上达到了 state-of-the-art 水平,并且自动地学习了场景的SVBRDF。代码在 GitHub 上公开发布,可以在这里找到:https://github.com/oppo-us-research/RelitNeuLF。
    Abstract In this paper, we address the problem of simultaneous relighting and novel view synthesis of a complex scene from multi-view images with a limited number of light sources. We propose an analysis-synthesis approach called Relit-NeuLF. Following the recent neural 4D light field network (NeuLF), Relit-NeuLF first leverages a two-plane light field representation to parameterize each ray in a 4D coordinate system, enabling efficient learning and inference. Then, we recover the spatially-varying bidirectional reflectance distribution function (SVBRDF) of a 3D scene in a self-supervised manner. A DecomposeNet learns to map each ray to its SVBRDF components: albedo, normal, and roughness. Based on the decomposed BRDF components and conditioning light directions, a RenderNet learns to synthesize the color of the ray. To self-supervise the SVBRDF decomposition, we encourage the predicted ray color to be close to the physically-based rendering result using the microfacet model. Comprehensive experiments demonstrate that the proposed method is efficient and effective on both synthetic data and real-world human face data, and outperforms the state-of-the-art results. We publicly released our code on GitHub. You can find it here: https://github.com/oppo-us-research/RelitNeuLF
    摘要 在这篇论文中,我们解决了同时重新照明和新视图合成复杂场景的问题,使用限制数量的灯光源。我们提出了一种分析synthesis方法called Relit-NeuLF。基于最近的神经网络4D灯场(NeuLF),Relit-NeuLF首先利用了两个平面灯场表示,以参数化每个光束在4D坐标系统中,从而实现高效的学习和推理。然后,我们利用自我超级视图的方式来回归场景中的空间变化的折射率分布函数(SVBRDF)。一个DecomposeNet学习将每个光束映射到其SVBRDF组成部分:反射率、法向和粗糙度。基于分解后的BRDF组成部分和灯光方向,一个RenderNet学习了将光束颜色Synthesize。为了自我超级视图SVBRDF分解,我们鼓励预测的光束颜色与物理基础渲染结果相匹配。完整的实验证明了我们提出的方法是高效和有效的,并在 sintetic数据和实际人脸数据上达到了状态的艺术结果。我们在GitHub上公开了我们的代码,您可以在这里找到:https://github.com/oppo-us-research/RelitNeuLF。

Semantic-Aware Adversarial Training for Reliable Deep Hashing Retrieval

  • paper_url: http://arxiv.org/abs/2310.14637
  • repo_url: https://github.com/xandery-geek/SAAT
  • paper_authors: Xu Yuan, Zheng Zhang, Xunguang Wang, Lin Wu
  • for: 本文旨在提高深度哈希模型中的防御性 robustness,通过提出Semantic-Aware Adversarial Training(SAAT)方法,以提高深度哈希模型对黑hat攻击的鲁棒性。
  • methods: 本文提出了一种名为Discriminative Mainstay Features Learning(DMFL)的方法,用于在深度哈希模型中学习可靠的主干特征,以便在防御学习中引导深度哈希模型。DMFL方法通过在泛化学习中同时考虑探索性和 semantics的特征,实现了精细化的主干特征学习。
  • results: 经验表明,SAAT方法可以在 benchmark datasets 上实现superb的防御性表现,同时可以有效地消除深度哈希模型中的黑hat攻击。此外,SAAT方法可以帮助提高深度哈希模型的鲁棒性,使其在不同的攻击方式下保持稳定的性能。
    Abstract Deep hashing has been intensively studied and successfully applied in large-scale image retrieval systems due to its efficiency and effectiveness. Recent studies have recognized that the existence of adversarial examples poses a security threat to deep hashing models, that is, adversarial vulnerability. Notably, it is challenging to efficiently distill reliable semantic representatives for deep hashing to guide adversarial learning, and thereby it hinders the enhancement of adversarial robustness of deep hashing-based retrieval models. Moreover, current researches on adversarial training for deep hashing are hard to be formalized into a unified minimax structure. In this paper, we explore Semantic-Aware Adversarial Training (SAAT) for improving the adversarial robustness of deep hashing models. Specifically, we conceive a discriminative mainstay features learning (DMFL) scheme to construct semantic representatives for guiding adversarial learning in deep hashing. Particularly, our DMFL with the strict theoretical guarantee is adaptively optimized in a discriminative learning manner, where both discriminative and semantic properties are jointly considered. Moreover, adversarial examples are fabricated by maximizing the Hamming distance between the hash codes of adversarial samples and mainstay features, the efficacy of which is validated in the adversarial attack trials. Further, we, for the first time, formulate the formalized adversarial training of deep hashing into a unified minimax optimization under the guidance of the generated mainstay codes. Extensive experiments on benchmark datasets show superb attack performance against the state-of-the-art algorithms, meanwhile, the proposed adversarial training can effectively eliminate adversarial perturbations for trustworthy deep hashing-based retrieval. Our code is available at https://github.com/xandery-geek/SAAT.
    摘要 深度哈希已经广泛研究和成功应用于大规模图像检索系统,这主要归功于其高效性和可靠性。然而, latest studies have shown that deep hashing models are vulnerable to adversarial attacks, which poses a security threat to these models. Specifically, it is challenging to efficiently distill reliable semantic representatives for deep hashing to guide adversarial learning, which hinders the enhancement of adversarial robustness of deep hashing-based retrieval models. In addition, current researches on adversarial training for deep hashing are difficult to be formalized into a unified minimax structure.In this paper, we propose a Semantic-Aware Adversarial Training (SAAT) method to improve the adversarial robustness of deep hashing models. Specifically, we design a discriminative mainstay features learning (DMFL) scheme to construct semantic representatives for guiding adversarial learning in deep hashing. Our DMFL scheme is adaptively optimized in a discriminative learning manner, where both discriminative and semantic properties are jointly considered. Moreover, we fabricate adversarial examples by maximizing the Hamming distance between the hash codes of adversarial samples and mainstay features, which is validated in the adversarial attack trials. Furthermore, we formulate the formalized adversarial training of deep hashing into a unified minimax optimization under the guidance of the generated mainstay codes.Extensive experiments on benchmark datasets show that our proposed adversarial training method can effectively eliminate adversarial perturbations for trustworthy deep hashing-based retrieval, while achieving superb attack performance against the state-of-the-art algorithms. Our code is available at https://github.com/xandery-geek/SAAT.

Multilevel Perception Boundary-guided Network for Breast Lesion Segmentation in Ultrasound Images

  • paper_url: http://arxiv.org/abs/2310.14636
  • repo_url: None
  • paper_authors: Xing Yang, Jian Zhang, Qijian Chen, Li Wang, Lihui Wang
    for: breast tumor segmentation from ultrasound imagesmethods: * multilevel global perception module (MGPM)* boundary guided module (BGM)* multi-level boundary-enhanced segmentation (BS) lossresults:* improved segmentation performance for tumor boundaries* outperformed state-of-the-art methods in terms of qualitative and quantitative evaluation metrics* Dice score, Jaccard coefficient, Specificity and HD95 improved by 0.70%, 1.1%, 0.1% and 2.5% respectively.
    Abstract Automatic segmentation of breast tumors from the ultrasound images is essential for the subsequent clinical diagnosis and treatment plan. Although the existing deep learning-based methods have achieved significant progress in automatic segmentation of breast tumor, their performance on tumors with similar intensity to the normal tissues is still not pleasant, especially for the tumor boundaries. To address this issue, we propose a PBNet composed by a multilevel global perception module (MGPM) and a boundary guided module (BGM) to segment breast tumors from ultrasound images. Specifically, in MGPM, the long-range spatial dependence between the voxels in a single level feature maps are modeled, and then the multilevel semantic information is fused to promote the recognition ability of the model for non-enhanced tumors. In BGM, the tumor boundaries are extracted from the high-level semantic maps using the dilation and erosion effects of max pooling, such boundaries are then used to guide the fusion of low and high-level features. Moreover, to improve the segmentation performance for tumor boundaries, a multi-level boundary-enhanced segmentation (BS) loss is proposed. The extensive comparison experiments on both publicly available dataset and in-house dataset demonstrate that the proposed PBNet outperforms the state-of-the-art methods in terms of both qualitative visualization results and quantitative evaluation metrics, with the Dice score, Jaccard coefficient, Specificity and HD95 improved by 0.70%, 1.1%, 0.1% and 2.5% respectively. In addition, the ablation experiments validate that the proposed MGPM is indeed beneficial for distinguishing the non-enhanced tumors and the BGM as well as the BS loss are also helpful for refining the segmentation contours of the tumor.
    摘要 自动 segmentation of breast tumors from ultrasound images 是至关重要的,以便进行后续临床诊断和治疗规划。尽管现有的深度学习基本方法已经取得了对 breast tumor 的自动 segmentation 的显著进步,但它们在与正常组织的同等亮度下的表现仍然不太理想,特别是 tumor 边界。为解决这个问题,我们提议一种名为 PBNet 的模型,其包括一个多层全局感知模块 (MGPM) 和一个边界导向模块 (BGM)。具体来说,在 MGPM 中,我们模型了 voxels 之间的长距离空间依赖性,然后将多层semantic信息融合以提高模型对非扩展 tumor 的识别能力。在 BGM 中,我们使用高级别semantic图像的扩散和剪辑效果进行最大池化,以提取 tumor 边界。此外,为了提高 tumor 边界的分 segmentation 性能,我们提出了一种多层边界增强分 segmentation (BS) 损失函数。我们在公共可用数据集和自有数据集进行了广泛的比较实验,结果显示,提议的 PBNet 在qualitative visualization结果和量化评价指标上都超过了现有的状态OF-THE-ART方法,Dice 分数、Jaccard 系数、特异性和 HD95 分别提高了0.70%、1.1%、0.1%和2.5%。此外,我们还进行了剖除实验, validate that the proposed MGPM 对非扩展 tumor 的识别能力和 BGM 以及 BS 损失函数对 tumor 边界的分 segmentation 性能具有帮助作用。

Pre-Training LiDAR-Based 3D Object Detectors Through Colorization

  • paper_url: http://arxiv.org/abs/2310.14592
  • repo_url: None
  • paper_authors: Tai-Yu Pan, Chenyang Ma, Tianle Chen, Cheng Perng Phoo, Katie Z Luo, Yurong You, Mark Campbell, Kilian Q. Weinberger, Bharath Hariharan, Wei-Lun Chao
  • for: 这个论文主要用于提高自动驾驶车的三维对象检测和理解,即使只有有限的标注数据。
  • methods: 这篇论文提出了一种创新的预训练方法,即根据点云图像彩色化(Grounded Point Colorization,GPC),使模型具有有用的 semantic cue。
  • results: 实验结果表明,GPC 能够在 KITTI 和 Waymo 数据集上显著提高精度和效率,尤其是在仅使用 20% 的 KITTI 数据集时。
    Abstract Accurate 3D object detection and understanding for self-driving cars heavily relies on LiDAR point clouds, necessitating large amounts of labeled data to train. In this work, we introduce an innovative pre-training approach, Grounded Point Colorization (GPC), to bridge the gap between data and labels by teaching the model to colorize LiDAR point clouds, equipping it with valuable semantic cues. To tackle challenges arising from color variations and selection bias, we incorporate color as "context" by providing ground-truth colors as hints during colorization. Experimental results on the KITTI and Waymo datasets demonstrate GPC's remarkable effectiveness. Even with limited labeled data, GPC significantly improves fine-tuning performance; notably, on just 20% of the KITTI dataset, GPC outperforms training from scratch with the entire dataset. In sum, we introduce a fresh perspective on pre-training for 3D object detection, aligning the objective with the model's intended role and ultimately advancing the accuracy and efficiency of 3D object detection for autonomous vehicles.
    摘要 <>转换文本到简化中文。<>自适应汽车自动驾驶需要高精度三维对象检测和理解,因此需要大量标注数据进行训练。在这项工作中,我们介绍了一种创新的预训练方法——基于点云的地面色彩化(GPC),用于让模型学习点云的颜色,从而提供有价值的语义提示。为了解决颜色变化和选择偏见的挑战,我们在色彩化过程中包含颜色作为“上下文”,提供实际颜色作为提示。实验结果表明,GPC在KITTI和Waymo数据集上具有惊人的有效性。即使使用有限的标注数据,GPC可以大幅提高精通训练性能,特别是在使用KITTI数据集的20%时,GPC已经超越了从零开始训练整个数据集。总的来说,我们引入了3D对象检测预训练的新视角,使目标与模型的角色一致,从而提高3D对象检测的准确性和效率,为自动驾驶车辆提供更好的支持。

Tensor Decomposition Based Attention Module for Spiking Neural Networks

  • paper_url: http://arxiv.org/abs/2310.14576
  • repo_url: None
  • paper_authors: Haoyu Deng, Ruijie Zhu, Xuerui Qiu, Yule Duan, Malu Zhang, Liangjian Deng
  • for: 提高逻辑神经网络(SNN)性能
  • methods: 使用矩阵分解技术实现tensor-相关的注意机制
  • results: 在静态和动态测试集上达到了最佳性能,超过了现有的SNN模型(使用Transformer和CNN底层)In English:
  • for: Improving the performance of spiking neural networks (SNN)
  • methods: Using tensor decomposition techniques to implement attention mechanisms that are relevant to tensors
  • results: Achieving state-of-the-art performance on both static and dynamic benchmark datasets, outperforming existing SNN models with Transformer-based and CNN-based backbones.
    Abstract The attention mechanism has been proven to be an effective way to improve spiking neural network (SNN). However, based on the fact that the current SNN input data flow is split into tensors to process on GPUs, none of the previous works consider the properties of tensors to implement an attention module. This inspires us to rethink current SNN from the perspective of tensor-relevant theories. Using tensor decomposition, we design the \textit{projected full attention} (PFA) module, which demonstrates excellent results with linearly growing parameters. Specifically, PFA is composed by the \textit{linear projection of spike tensor} (LPST) module and \textit{attention map composing} (AMC) module. In LPST, we start by compressing the original spike tensor into three projected tensors using a single property-preserving strategy with learnable parameters for each dimension. Then, in AMC, we exploit the inverse procedure of the tensor decomposition process to combine the three tensors into the attention map using a so-called connecting factor. To validate the effectiveness of the proposed PFA module, we integrate it into the widely used VGG and ResNet architectures for classification tasks. Our method achieves state-of-the-art performance on both static and dynamic benchmark datasets, surpassing the existing SNN models with Transformer-based and CNN-based backbones.
    摘要 “对于神经网络(SNN)来说,专注机制已经证明是一种有效的改进方法。然而,由于现有SNN的输入数据流程是以紧缩为tensor进行GPU处理,所以前一些工作都没有考虑tensor的特性来实现专注模组。这为我们提供了重新思考SNN的机会,从tensor相关理论的角度出发。使用tensor分解,我们设计了对应全面专注(PFA)模组,它在线性增长参数下表现出色。具体来说,PFA由线性对应缩排(LPST)模组和专注地图组合(AMC)模组组成。在LPST中,我们将原始发射tensor缩排成三个投影tensor使用单一的属性保持策略,每个维度都有学习型的参数。然后,在AMC中,我们利用反向的tensor分解过程来结合三个tensor成专注地图,使用一个称为连接因子。为了证明提案的PFA模组的有效性,我们将其integragreen到了广泛使用的VGG和ResNet架构中,进行分类任务。我们的方法在静态和动态benchmark数据上表现出色,超越了现有的SNN模型,包括基于Transformer和CNN的底层。”

DICE: Diverse Diffusion Model with Scoring for Trajectory Prediction

  • paper_url: http://arxiv.org/abs/2310.14570
  • repo_url: None
  • paper_authors: Younwoo Choi, Ray Coden Mercurius, Soheil Mohamad Alizadeh Shabestary, Amir Rasouli
  • for: 预测行人或自动驾驶车辆的未来路径,以提高安全性和有效性。
  • methods: 使用吸引模型,并提出了一种高效的采样机制和分数方法来减少计算成本。
  • results: 在UCY/ETH和nuScenes测试集上实现了状态当前的表现,并在一些子集和指标上达到了领先水平。
    Abstract Road user trajectory prediction in dynamic environments is a challenging but crucial task for various applications, such as autonomous driving. One of the main challenges in this domain is the multimodal nature of future trajectories stemming from the unknown yet diverse intentions of the agents. Diffusion models have shown to be very effective in capturing such stochasticity in prediction tasks. However, these models involve many computationally expensive denoising steps and sampling operations that make them a less desirable option for real-time safety-critical applications. To this end, we present a novel framework that leverages diffusion models for predicting future trajectories in a computationally efficient manner. To minimize the computational bottlenecks in iterative sampling, we employ an efficient sampling mechanism that allows us to maximize the number of sampled trajectories for improved accuracy while maintaining inference time in real time. Moreover, we propose a scoring mechanism to select the most plausible trajectories by assigning relative ranks. We show the effectiveness of our approach by conducting empirical evaluations on common pedestrian (UCY/ETH) and autonomous driving (nuScenes) benchmark datasets on which our model achieves state-of-the-art performance on several subsets and metrics.
    摘要 路用户轨迹预测在动态环境中是一项复杂但重要的任务,有助于自动驾驶等应用。未知多种多样的代理人意图的多模态性是预测任务中的主要挑战。扩散模型在预测任务中表现出色,但这些模型包含许多计算成本高的杂谱步骤和抽样操作,使其不适合实时安全关键应用。为此,我们提出了一种新的框架,利用扩散模型进行未来轨迹预测,并在计算效率上做出了改进。为了减少循环抽样中的计算瓶颈,我们采用了高效的抽样机制,以最大化抽样轨迹数量,提高预测准确性,同时保持检测时间在实时水平。此外,我们提议一种排名机制,将可能性最高的轨迹排名为最有可能性的。我们在UCY/ETH和nuScenes两个常用的人行和自动驾驶数据集上进行了实验,并证明了我们的方法在一些子集和指标上达到了当前领域的状态前沿性。

F$^2$AT: Feature-Focusing Adversarial Training via Disentanglement of Natural and Perturbed Patterns

  • paper_url: http://arxiv.org/abs/2310.14561
  • repo_url: None
  • paper_authors: Yaguan Qian, Chenyu Zhao, Zhaoquan Gu, Bin Wang, Shouling Ji, Wei Wang, Boyang Zhou, Pan Zhou
  • for: 防止深度神经网络(DNNs)受到针对性攻击,以保护critical应用程序,如自动驾驶车、surveillance安全和医疗诊断。
  • methods: 提出了一种新的Feature-Focusing Adversarial Training(F$^2$AT)方法,通过分割bit-plane来分离自然和受攻击的样本,使模型更加注重自然样本中的核心特征,从而降低针对性攻击的影响。
  • results: F$^2$AT方法在clean accuracy和针对性攻击性方面具有显著的优势,比 précédente方法更高。
    Abstract Deep neural networks (DNNs) are vulnerable to adversarial examples crafted by well-designed perturbations. This could lead to disastrous results on critical applications such as self-driving cars, surveillance security, and medical diagnosis. At present, adversarial training is one of the most effective defenses against adversarial examples. However, traditional adversarial training makes it difficult to achieve a good trade-off between clean accuracy and robustness since spurious features are still learned by DNNs. The intrinsic reason is that traditional adversarial training makes it difficult to fully learn core features from adversarial examples when adversarial noise and clean examples cannot be disentangled. In this paper, we disentangle the adversarial examples into natural and perturbed patterns by bit-plane slicing. We assume the higher bit-planes represent natural patterns and the lower bit-planes represent perturbed patterns, respectively. We propose a Feature-Focusing Adversarial Training (F$^2$AT), which differs from previous work in that it enforces the model to focus on the core features from natural patterns and reduce the impact of spurious features from perturbed patterns. The experimental results demonstrated that F$^2$AT outperforms state-of-the-art methods in clean accuracy and adversarial robustness.
    摘要 The main reason is that traditional adversarial training cannot fully separate adversarial noise and clean examples, making it difficult to learn core features from adversarial examples. To address this issue, we propose a novel method called Feature-Focusing Adversarial Training (F$^2$AT). We disentangle adversarial examples into natural and perturbed patterns using bit-plane slicing, where higher bit-planes represent natural patterns and lower bit-planes represent perturbed patterns.F$^2$AT differs from previous methods in that it enforces the model to focus on core features from natural patterns and reduce the impact of spurious features from perturbed patterns. Our experimental results show that F$^2$AT outperforms state-of-the-art methods in both clean accuracy and adversarial robustness.

Polyhedral Surface: Self-supervised Point Cloud Reconstruction Based on Polyhedral Surface

  • paper_url: http://arxiv.org/abs/2310.14560
  • repo_url: None
  • paper_authors: Hui Tian, Kai Xu
  • for: 本研究旨在解决点云重建问题,尤其是在计算机图形领域中建立地方几何体以适应地方曲线。
  • methods: 本研究提出了一种新的多面体表示方法,用于表示地方表面。这种方法可以更好地表示锐利特征和表面边界,而不需要任何地方坐标系统。
  • results: 研究人员通过使用 нормаls构建多面体表示,包括二Normal和三Normal表示。这种方法实现了当前状态的最佳结果在三个常用的数据集(ShapeNetCore、ABC和ScanNet)中。
    Abstract Point cloud reconstruction from raw point cloud has been an important topic in computer graphics for decades, especially due to its high demand in modeling and rendering applications. An important way to solve this problem is establishing a local geometry to fit the local curve. However, previous methods build either a local plane or polynomial curve. Local plane brings the loss of sharp feature and the boundary artefacts on open surface. Polynomial curve is hard to combine with neural network due to the local coordinate consistent problem. To address this, we propose a novel polyhedral surface to represent local surface. This method provides more flexible to represent sharp feature and surface boundary on open surface. It does not require any local coordinate system, which is important when introducing neural networks. Specifically, we use normals to construct the polyhedral surface, including both dihedral and trihedral surfaces using 2 and 3 normals, respectively. Our method achieves state-of-the-art results on three commonly used datasets (ShapeNetCore, ABC, and ScanNet). Code will be released upon acceptance.
    摘要 <>将点云重建问题视为计算机图形领域中的一个重要话题,特别是由于其高需求于模型和渲染应用。现有的方法建立了本地几何结构来适应本地曲线。然而,前一代方法均建立了本地平面或多项式曲线。本地平面会导致具有粗糙特征和边缘artefacts的问题,而多项式曲线困难与神经网络结合使用,因为本地坐标系统存在问题。为解决这个问题,我们提出了一种新的多面体表示本地表面的方法。这种方法可以更好地表示锐角特征和表面边界的问题。它不需要任何本地坐标系统,这对于引入神经网络来说非常重要。具体来说,我们使用法向量来构建多面体表面,包括二面体和三面体使用2和3个法向量,分别。我们的方法实现了状态略于最佳的结果在三个常用的数据集(ShapeNetCore、ABC和ScanNet)中。代码将在接受后发布。

S3Aug: Segmentation, Sampling, and Shift for Action Recognition

  • paper_url: http://arxiv.org/abs/2310.14556
  • repo_url: None
  • paper_authors: Taiki Sugiura, Toru Tamaki
  • for: 本研究 propose 一种视频数据增强方法,用于提高动作识别的性能。
  • methods: 该方法通过段落和标签图像转换来生成新的视频数据,并通过采样和特征偏移来增强视频帧之间的时间协调性。
  • results: 实验结果表明,提档方法可以有效地提高动作识别的性能,特别是在Mimetics数据集上的异 Context 视频。
    Abstract Action recognition is a well-established area of research in computer vision. In this paper, we propose S3Aug, a video data augmenatation for action recognition. Unlike conventional video data augmentation methods that involve cutting and pasting regions from two videos, the proposed method generates new videos from a single training video through segmentation and label-to-image transformation. Furthermore, the proposed method modifies certain categories of label images by sampling to generate a variety of videos, and shifts intermediate features to enhance the temporal coherency between frames of the generate videos. Experimental results on the UCF101, HMDB51, and Mimetics datasets demonstrate the effectiveness of the proposed method, paricularlly for out-of-context videos of the Mimetics dataset.
    摘要 <>translate "Action recognition is a well-established area of research in computer vision. In this paper, we propose S3Aug, a video data augmentation for action recognition. Unlike conventional video data augmentation methods that involve cutting and pasting regions from two videos, the proposed method generates new videos from a single training video through segmentation and label-to-image transformation. Furthermore, the proposed method modifies certain categories of label images by sampling to generate a variety of videos, and shifts intermediate features to enhance the temporal coherency between frames of the generated videos. Experimental results on the UCF101, HMDB51, and Mimetics datasets demonstrate the effectiveness of the proposed method, particularly for out-of-context videos of the Mimetics dataset." into Simplified Chinese. Action recognition 是计算机视觉领域的一个已经有了很长历史的研究领域。在这篇论文中,我们提出了 S3Aug,一种用于动作识别的视频数据增强方法。与传统的视频数据增强方法不同,S3Aug 方法不需要从两个视频中切割和粘贴区域,而是通过分割和标签图像转换来生成新的视频。此外,S3Aug 方法还可以对某些类别的标签图像进行采样,以生成多种视频,并将中间特征shift来提高视频帧之间的时间协调性。实验结果表明,S3Aug 方法在 UCF101、HMDB51 和 Mimetics 数据集上具有出色的效果,特别是对 Mimetics 数据集中的 OUT-OF-CONTEXT 视频。

Practical Deep Dispersed Watermarking with Synchronization and Fusion

  • paper_url: http://arxiv.org/abs/2310.14532
  • repo_url: https://github.com/bytedance/dwsf
  • paper_authors: Hengchang Guo, Qilong Zhang, Junwei Luo, Feng Guo, Wenbin Zhang, Xiaodong Su, Minglei Li
    for: 这个论文的目的是提出一种实用的深度 watermarking 技术,以提高对高分辨率图像的潜在水印。methods: 本文使用的方法包括:随机选择 Cover 图像中的一些固定小尺寸块进行推干 embedder,并在扩展stage中使用水平同步模块和解码器来rectify 和提取扩展stage中的水印信息。results: 对比与状态艺方法,本文的盲水印技术可以提高比特率的平均提高5.28%和5.93%,并且显示更好的文件大小增长和视觉质量。
    Abstract Deep learning based blind watermarking works have gradually emerged and achieved impressive performance. However, previous deep watermarking studies mainly focus on fixed low-resolution images while paying less attention to arbitrary resolution images, especially widespread high-resolution images nowadays. Moreover, most works usually demonstrate robustness against typical non-geometric attacks (\textit{e.g.}, JPEG compression) but ignore common geometric attacks (\textit{e.g.}, Rotate) and more challenging combined attacks. To overcome the above limitations, we propose a practical deep \textbf{D}ispersed \textbf{W}atermarking with \textbf{S}ynchronization and \textbf{F}usion, called \textbf{\proposed}. Specifically, given an arbitrary-resolution cover image, we adopt a dispersed embedding scheme which sparsely and randomly selects several fixed small-size cover blocks to embed a consistent watermark message by a well-trained encoder. In the extraction stage, we first design a watermark synchronization module to locate and rectify the encoded blocks in the noised watermarked image. We then utilize a decoder to obtain messages embedded in these blocks, and propose a message fusion strategy based on similarity to make full use of the consistency among messages, thus determining a reliable message. Extensive experiments conducted on different datasets convincingly demonstrate the effectiveness of our proposed {\proposed}. Compared with state-of-the-art approaches, our blind watermarking can achieve better performance: averagely improve the bit accuracy by 5.28\% and 5.93\% against single and combined attacks, respectively, and show less file size increment and better visual quality. Our code is available at https://github.com/bytedance/DWSF.
    摘要 深度学习基于的盲水印技术逐渐出现,并实现了卓越的性能。然而,前一代的深水印研究主要集中在固定低分辨率图像上,而忽略了任意分辨率图像,特别是现在广泛存在的高分辨率图像。此外,大多数工作通常在Typical non-geometric attacks(如JPEG压缩)中展现了抗坚定性,而忽略了Common geometric attacks(如旋转)和更复杂的组合攻击。为了解决上述局限性,我们提出了一种实用的深度盲水印方法,即{\proposed}。具体来说,给定一个任意分辨率的覆盖图像,我们采用了一种散列嵌入方案,在覆盖图像中随机和精度地选择一些固定的小尺寸覆盖块,并通过一个培编的编码器嵌入一致的盲水印消息。在提取阶段,我们首先设计了一个盲水印同步模块,用于在受损盲水印图像中找到并修正编码的块。然后,我们使用一个解码器从这些块中提取嵌入的消息,并提出了一种消息融合策略基于相似性,以便在多个消息之间共享信息,从而确定一个可靠的消息。我们对不同的数据集进行了广泛的实验,并证明了我们提出的{\proposed}的有效性。相比之前的方法,我们的盲水印技术可以平均提高比特率准确率5.28%和5.93%,并且在单个和组合攻击下表现更好,并且显示更好的视觉质量和更小的文件尺寸增长。我们的代码可以在https://github.com/bytedance/DWSF上获取。

Poster: Real-Time Object Substitution for Mobile Diminished Reality with Edge Computing

  • paper_url: http://arxiv.org/abs/2310.14511
  • repo_url: None
  • paper_authors: Hongyu Ke, Haoxin Wang
  • For: 本研究旨在提供一个可靠的、实时的减少实际(DR)架构,以便在行动设备上实现高品质的实时Scene建构。* Methods: 本研究使用了边缘计算技术,实现了实时物品替换,并提出了一个终端到终端架构,以便实现高品质的实时Scene建构。* Results: 本研究获得了一个可靠的、实时的DR架构,并在实验中证明了其高品质和可靠性。Translation:* For: This research aims to provide a reliable and real-time diminished reality (DR) architecture for mobile devices, enabling high-quality real-time scene construction.* Methods: The research uses edge computing technology to achieve real-time object substitution, and proposes an end-to-end architecture to facilitate high-quality real-time scene construction.* Results: The research obtains a reliable and real-time DR architecture, and demonstrates its high quality and reliability through experiments.
    Abstract Diminished Reality (DR) is considered as the conceptual counterpart to Augmented Reality (AR), and has recently gained increasing attention from both industry and academia. Unlike AR which adds virtual objects to the real world, DR allows users to remove physical content from the real world. When combined with object replacement technology, it presents an further exciting avenue for exploration within the metaverse. Although a few researches have been conducted on the intersection of object substitution and DR, there is no real-time object substitution for mobile diminished reality architecture with high quality. In this paper, we propose an end-to-end architecture to facilitate immersive and real-time scene construction for mobile devices with edge computing.
    摘要 降低现实(DR)被视为扩展现实(AR)的概念对应,最近受到了行业和学术界的越来越多的关注。与AR不同的是,DR使用户可以从真实世界中消除物理内容。当与物体替换技术结合使用时,它开启了在Metaverse中进一步的探索之路。虽然有些研究者已经对这两个领域的交叉进行了研究,但是还没有实时物体替换的手机降低现实架构,具有高质量。本文提出了一种综合架构,以便在移动设备上实现具有启发性和实时性的场景建构,并利用边缘计算。

ADoPT: LiDAR Spoofing Attack Detection Based on Point-Level Temporal Consistency

  • paper_url: http://arxiv.org/abs/2310.14504
  • repo_url: None
  • paper_authors: Minkyoung Cho, Yulong Cao, Zixiang Zhou, Z. Morley Mao
  • For: The paper aims to address the challenge of LiDAR spoofing attacks in autonomous vehicles by proposing a novel framework called ADoPT.* Methods: The proposed method uses temporal consistency to identify abnormal objects based on the coherency of point clusters across consecutive frames.* Results: The evaluation using the nuScenes dataset shows that the proposed algorithm effectively counters various LiDAR spoofing attacks with a low false positive ratio (< 10%) and high true positive ratio (> 85%), outperforming existing state-of-the-art defense methods.Here’s the simplified Chinese text for the three key points:* For: 本研究旨在解决自动驾驶车辆LiDAR欺诈攻击的挑战,提出了一种名为ADoPT的新框架。* Methods: ADoPT使用点云的时间含义来识别异常对象,根据连续帧的点云协同性来判断是否存在攻击。* Results: 对于nuScenes数据集的评估表明,提议的算法可以有效地抵御多种LiDAR欺诈攻击,false positive ratio < 10%,true positive ratio > 85%,超越现有的状态码防御方法CARLO和3D-TC2。
    Abstract Deep neural networks (DNNs) are increasingly integrated into LiDAR (Light Detection and Ranging)-based perception systems for autonomous vehicles (AVs), requiring robust performance under adversarial conditions. We aim to address the challenge of LiDAR spoofing attacks, where attackers inject fake objects into LiDAR data and fool AVs to misinterpret their environment and make erroneous decisions. However, current defense algorithms predominantly depend on perception outputs (i.e., bounding boxes) thus face limitations in detecting attackers given the bounding boxes are generated by imperfect perception models processing limited points, acquired based on the ego vehicle's viewpoint. To overcome these limitations, we propose a novel framework, named ADoPT (Anomaly Detection based on Point-level Temporal consistency), which quantitatively measures temporal consistency across consecutive frames and identifies abnormal objects based on the coherency of point clusters. In our evaluation using the nuScenes dataset, our algorithm effectively counters various LiDAR spoofing attacks, achieving a low (< 10%) false positive ratio (FPR) and high (> 85%) true positive ratio (TPR), outperforming existing state-of-the-art defense methods, CARLO and 3D-TC2. Furthermore, our evaluation demonstrates the promising potential for accurate attack detection across various road environments.
    摘要 Translated into Simplified Chinese:深度神经网络(DNNs)在激光探测和范围(LiDAR)基于感知系统中日益得到推广,需要在对抗情况下保持稳定性。我们想解决激光欺诈攻击,攻击者通过投入假对象到激光数据中,使自动驾驶车辆错误地理解环境和做出错误的决策。然而,当前的防御算法主要依赖于感知输出(即 bounding box),因此面临限制,因为感知模型根据 Egocar 的视角处理有限的点数据生成 bounding box,这些 bounding box 是不准确的。为了突破这些限制,我们提出了一个新的框架,名为 ADoPT(基于点水平的异常检测),它量化 consecutives 帧中的时间一致性,并根据点群集的一致性来识别异常对象。在我们使用 nuScenes 数据集进行评估中,我们的算法能够有效地对各种 LiDAR 欺诈攻击进行防御,实现 < 10% 的 false positive ratio(FPR)和 > 85% 的 true positive ratio(TPR),超过当前的状态对抗方法 CARLO 和 3D-TC2。此外,我们的评估还表明了我们的算法在不同的道路环境中具有扬性的攻击检测潜力。

MSFormer: A Skeleton-multiview Fusion Method For Tooth Instance Segmentation

  • paper_url: http://arxiv.org/abs/2310.14489
  • repo_url: None
  • paper_authors: Yuan Li, Huan Liu, Yubo Tao, Xiangyang He, Haifeng Li, Xiaohu Guo, Hai Lin
  • for: 本研究旨在提高基于深度学习的牙齿分割方法,使其在有限数据情况下实现高精度分割。
  • methods: 本研究提出了一种2D-3D联合见解方法,使用skeleton来提取3D见解,并通过对skeleton和图像进行对比学习来实现2D-3D联合见解。
  • results: 实验结果表明,与大型预训练多视图模型结合MSFormer方法,可以实现 state-of-the-art 的牙齿分割性能,只需要100个训练网格。此外,随着训练数据量的增加, segmentation 精度也会提高。
    Abstract Recently, deep learning-based tooth segmentation methods have been limited by the expensive and time-consuming processes of data collection and labeling. Achieving high-precision segmentation with limited datasets is critical. A viable solution to this entails fine-tuning pre-trained multiview-based models, thereby enhancing performance with limited data. However, relying solely on two-dimensional (2D) images for three-dimensional (3D) tooth segmentation can produce suboptimal outcomes because of occlusion and deformation, i.e., incomplete and distorted shape perception. To improve this fine-tuning-based solution, this paper advocates 2D-3D joint perception. The fundamental challenge in employing 2D-3D joint perception with limited data is that the 3D-related inputs and modules must follow a lightweight policy instead of using huge 3D data and parameter-rich modules that require extensive training data. Following this lightweight policy, this paper selects skeletons as the 3D inputs and introduces MSFormer, a novel method for tooth segmentation. MSFormer incorporates two lightweight modules into existing multiview-based models: a 3D-skeleton perception module to extract 3D perception from skeletons and a skeleton-image contrastive learning module to obtain the 2D-3D joint perception by fusing both multiview and skeleton perceptions. The experimental results reveal that MSFormer paired with large pre-trained multiview models achieves state-of-the-art performance, requiring only 100 training meshes. Furthermore, the segmentation accuracy is improved by 2.4%-5.5% with the increasing volume of training data.
    摘要

Player Re-Identification Using Body Part Appearences

  • paper_url: http://arxiv.org/abs/2310.14469
  • repo_url: https://github.com/abhinine4/Soccerplayer_Reidentification
  • paper_authors: Mahesh Bhosale, Abhishek Kumar, David Doermann
  • for: 本研究提出了一种基于神经网络的人体部分识别方法,用于解决人体识别 task。
  • methods: 该方法使用了两条流域网络(一条用于特征地図EXTRACTION,另一条用于身体部分地図EXTRACTION)和一个bilinear-pooling层,以生成和空间 pooling 身体部分地図。每个本地特征的身体部分地図由相应的本地外观特征和身体部分描述符之间的bilinear映射生成。
  • results: 该模型在SoccerNet-V3 dataset上表现出色,比如OsNet和InceptionNet等状态的模型。
    Abstract We propose a neural network architecture that learns body part appearances for soccer player re-identification. Our model consists of a two-stream network (one stream for appearance map extraction and the other for body part map extraction) and a bilinear-pooling layer that generates and spatially pools the body part map. Each local feature of the body part map is obtained by a bilinear mapping of the corresponding local appearance and body part descriptors. Our novel representation yields a robust image-matching feature map, which results from combining the local similarities of the relevant body parts with the weighted appearance similarity. Our model does not require any part annotation on the SoccerNet-V3 re-identification dataset to train the network. Instead, we use a sub-network of an existing pose estimation network (OpenPose) to initialize the part substream and then train the entire network to minimize the triplet loss. The appearance stream is pre-trained on the ImageNet dataset, and the part stream is trained from scratch for the SoccerNet-V3 dataset. We demonstrate the validity of our model by showing that it outperforms state-of-the-art models such as OsNet and InceptionNet.
    摘要 我们提出了一种神经网络架构,用于学习足球运动员重识别的体部表现。我们的模型包括一个两树网络(一树用于出现地图抽取,另一树用于身体部分地图抽取)以及一个 bilinear-pooling 层,该层生成并在空间上 pool 身体部分地图。每个地方特征的身体部分地图来自于 bilinear 映射的相应地方出现和身体部分描述符。我们的新表示方式生成了一个鲁棒的图像匹配特征地图,该特征地图来自于组合相关的身体部分特征和权重的出现相似性。我们的模型不需要在 SoccerNet-V3 重识别 dataset 上进行任何部分注释,而是使用一个 OpenPose pose 估计网络的子网络来初始化部分流动,然后训练整个网络以 minimize triplet loss。出现流动在 ImageNet dataset 上进行预训练,而身体部分流动则是从 scratch 训练于 SoccerNet-V3 dataset。我们示出了我们的模型的有效性,证明它在 OsNet 和 InceptionNet 等状态机器上出perform。