cs.CV - 2023-07-16

Diffusion to Confusion: Naturalistic Adversarial Patch Generation Based on Diffusion Model for Object Detector

  • paper_url: http://arxiv.org/abs/2307.08076
  • repo_url: None
  • paper_authors: Shuo-Yen Lin, Ernie Chu, Che-Hsien Lin, Jun-Cheng Chen, Jia-Ching Wang
  • For: The paper aims to address the issue of poor-quality physical adversarial patches for protecting personal privacy from malicious monitoring using object detectors.* Methods: The proposed method uses diffusion models (DM) to generate naturalistic adversarial patches that are high-quality and stable, without suffering from mode collapse problems.* Results: The proposed approach achieves better-quality and more naturalistic adversarial patches than other state-of-the-art patch generation methods, with acceptable attack performance and various generation trade-offs under different conditions.Here are the three points in Simplified Chinese text:* For: 本文目标是解决对象检测器的恶意监测中的个人隐私泄露问题,通过生成高质量的物理攻击质 patches。* Methods: 提议方法基于扩散模型(DM),通过采样预训练于自然图像的DM模型,可以稳定地生成高质量和自然的物理攻击质 patches。* Results: 对比其他状态公共的攻击质 patch生成方法,提议方法可以实现更好的攻击性和质量,同时提供不同条件下的生成质量负担。
    Abstract Many physical adversarial patch generation methods are widely proposed to protect personal privacy from malicious monitoring using object detectors. However, they usually fail to generate satisfactory patch images in terms of both stealthiness and attack performance without making huge efforts on careful hyperparameter tuning. To address this issue, we propose a novel naturalistic adversarial patch generation method based on the diffusion models (DM). Through sampling the optimal image from the DM model pretrained upon natural images, it allows us to stably craft high-quality and naturalistic physical adversarial patches to humans without suffering from serious mode collapse problems as other deep generative models. To the best of our knowledge, we are the first to propose DM-based naturalistic adversarial patch generation for object detectors. With extensive quantitative, qualitative, and subjective experiments, the results demonstrate the effectiveness of the proposed approach to generate better-quality and more naturalistic adversarial patches while achieving acceptable attack performance than other state-of-the-art patch generation methods. We also show various generation trade-offs under different conditions.
    摘要 多种物理抗击方法已经广泛提出来保护个人隐私免受恶势力监测器的攻击。然而,这些方法通常无法生成满意的质量的抗击图像,而且需要大量的优化参数调整,以达到良好的攻击性和隐蔽性。为解决这个问题,我们提出了一种新的自然化抗击方法,基于扩散模型(DM)。通过从DM模型预训练于自然图像中采样最佳图像,我们可以稳定地生成高质量和自然化的物理抗击图像,而不会受到深度生成模型的严重混合问题的影响。我们认为我们是第一个提出DM模型基于的自然化抗击图像生成方法。通过了广泛的量化、质量和主观实验,我们的方法可以生成更高质量和更自然化的抗击图像,同时保持可接受的攻击性。我们还展示了不同条件下的生成质量的交易。

Dense Multitask Learning to Reconfigure Comics

  • paper_url: http://arxiv.org/abs/2307.08071
  • repo_url: None
  • paper_authors: Deblina Bhattacharjee, Sabine Süsstrunk, Mathieu Salzmann
  • for: 这paper的目的是开发一种多任务学习(MTL)模型,以实现漫画幕anel的精度预测,以便将漫画从一个发布频道传输到另一个。
  • methods: 我们使用了一种常见的策略,即无监督图像到图像翻译,以利用大量的真实世界约束。我们还利用了这些翻译结果,开发了一种基于视力变换器底层和域转移注意模块的多任务方法。
  • results: 我们的MTL方法可以成功地标识漫画幕anel中的Semantic单元以及嵌入的3D notion。这是一个非常具有挑战性的问题,因为漫画包含了不同的艺术风格、插图、布局和 объек scale,这些因素取决于作者的创作过程。
    Abstract In this paper, we develop a MultiTask Learning (MTL) model to achieve dense predictions for comics panels to, in turn, facilitate the transfer of comics from one publication channel to another by assisting authors in the task of reconfiguring their narratives. Our MTL method can successfully identify the semantic units as well as the embedded notion of 3D in comic panels. This is a significantly challenging problem because comics comprise disparate artistic styles, illustrations, layouts, and object scales that depend on the authors creative process. Typically, dense image-based prediction techniques require a large corpus of data. Finding an automated solution for dense prediction in the comics domain, therefore, becomes more difficult with the lack of ground-truth dense annotations for the comics images. To address these challenges, we develop the following solutions: 1) we leverage a commonly-used strategy known as unsupervised image-to-image translation, which allows us to utilize a large corpus of real-world annotations; 2) we utilize the results of the translations to develop our multitasking approach that is based on a vision transformer backbone and a domain transferable attention module; 3) we study the feasibility of integrating our MTL dense-prediction method with an existing retargeting method, thereby reconfiguring comics.
    摘要 在这篇论文中,我们开发了一种多任务学习(MTL)模型,以实现漫画panel的密集预测,以便将漫画从一个发布渠道传输到另一个渠道,并 помочь作者重新配置他们的故事。我们的MTL方法可以成功地标识漫画中的语义单元以及嵌入的3D notion。这是一个非常具有挑战性的问题,因为漫画包含了不同的艺术风格、插图、布局和对象比例,这些因素取决于作者的创作过程。通常,密集图像基于预测技术需要很大的数据库。因此,在漫画领域找到自动化密集预测的解决方案变得更加困难,因为漫画图像的密集批注缺乏。为 Addressing these challenges, we develop the following solutions:1. 我们利用一种通常使用的策略,即无监督图像到图像翻译,以使用大量的实际世界约束。2. 我们利用翻译结果来开发我们的多任务方法,该方法基于视Transformer底层和域传递注意模块。3. 我们研究将我们的MTL密集预测方法与现有的重定向方法集成,以重新配置漫画。

MaGNAS: A Mapping-Aware Graph Neural Architecture Search Framework for Heterogeneous MPSoC Deployment

  • paper_url: http://arxiv.org/abs/2307.08065
  • repo_url: None
  • paper_authors: Mohanad Odema, Halima Bouzidi, Hamza Ouarnoughi, Smail Niar, Mohammad Abdullah Al Faruque
  • For: The paper is written for vision-based applications that require efficient processing on heterogeneous MPSoC platforms.* Methods: The paper proposes a novel unified design-mapping approach for efficient processing of vision GNN workloads on heterogeneous MPSoC platforms, including a mapping-aware Graph Neural Architecture Search (MaGNAS) framework.* Results: The proposed MaGNAS framework achieves 1.57x latency speedup and is 3.38x more energy-efficient for several vision datasets executed on the Xavier MPSoC platform compared to the GPU-only deployment, while sustaining an average 0.11% accuracy reduction from the baseline.Here’s the simplified Chinese text for the three information points:* For: 这篇论文是为视觉应用程序设计的,需要高效地运行在多核心处理器系统(MPSoC)上。* Methods: 论文提出了一种新的统一设计映射方法,用于高效地处理视觉Graph Neural Networks(GNN)任务在多核心MPSoC平台上。* Results: 提出的MaGNAS框架在Xavier MPSoC上实现了1.57倍的延迟速度提升和3.38倍的能效率提升,而且与基线相比减少了0.11%的准确率。
    Abstract Graph Neural Networks (GNNs) are becoming increasingly popular for vision-based applications due to their intrinsic capacity in modeling structural and contextual relations between various parts of an image frame. On another front, the rising popularity of deep vision-based applications at the edge has been facilitated by the recent advancements in heterogeneous multi-processor Systems on Chips (MPSoCs) that enable inference under real-time, stringent execution requirements. By extension, GNNs employed for vision-based applications must adhere to the same execution requirements. Yet contrary to typical deep neural networks, the irregular flow of graph learning operations poses a challenge to running GNNs on such heterogeneous MPSoC platforms. In this paper, we propose a novel unified design-mapping approach for efficient processing of vision GNN workloads on heterogeneous MPSoC platforms. Particularly, we develop MaGNAS, a mapping-aware Graph Neural Architecture Search framework. MaGNAS proposes a GNN architectural design space coupled with prospective mapping options on a heterogeneous SoC to identify model architectures that maximize on-device resource efficiency. To achieve this, MaGNAS employs a two-tier evolutionary search to identify optimal GNNs and mapping pairings that yield the best performance trade-offs. Through designing a supernet derived from the recent Vision GNN (ViG) architecture, we conducted experiments on four (04) state-of-the-art vision datasets using both (i) a real hardware SoC platform (NVIDIA Xavier AGX) and (ii) a performance/cost model simulator for DNN accelerators. Our experimental results demonstrate that MaGNAS is able to provide 1.57x latency speedup and is 3.38x more energy-efficient for several vision datasets executed on the Xavier MPSoC vs. the GPU-only deployment while sustaining an average 0.11% accuracy reduction from the baseline.
    摘要 图 neural network (GNN) 在视觉应用中变得越来越受欢迎,这是因为它们内置了图结构和上下文关系 между图像帧中不同部分的能力。同时,由于近期增加的深度视觉应用在边缘进行执行,因此GNN也必须遵循同样的执行要求。然而,与 Typical deep neural networks不同,图学习操作的不规则流动使得在多核心处理器系统(MPSoC)平台上运行GNN变得更加挑战。在这篇论文中,我们提出了一种新的统一设计映射方法,以便高效地处理视觉GNN工作负荷在多核心处理器平台上。具体来说,我们开发了 MaGNAS,一个具有映射意识的图 neural architecture search框架。MaGNAS提出了一个图 neural 架构设计空间,并与多核心 SoC 上的可能的映射选择相结合,以便确定最佳的GNN模型,以最大化设备资源利用。为达到这一目标,MaGNAS使用了两层演化搜索,以确定最佳的GNN和映射对。通过基于最近的视觉 GNN(ViG)架构的超网,我们进行了在四个(04) state-of-the-art 视觉 dataset 上的实验,使用了 both (i) 真实硬件 SoC 平台(NVIDIA Xavier AGX)和 (ii) 性能/成本模型适用于 DNN 加速器的表现/成本模拟器。我们的实验结果表明,MaGNAS 能够提供 1.57 倍的延迟速度提升和 3.38 倍的能效性提升,而在 Xavier MPSoC 上执行多个视觉dataset 时与 GPU-only 部署相比,保持了平均 0.11% 的准确性下降。

LafitE: Latent Diffusion Model with Feature Editing for Unsupervised Multi-class Anomaly Detection

  • paper_url: http://arxiv.org/abs/2307.08059
  • repo_url: None
  • paper_authors: Haonan Yin, Guanlong Jiao, Qianhui Wu, Borje F. Karlsson, Biqing Huang, Chin Yew Lin
  • for: 本研究旨在为flexible manufacturing systems提供一种不需要监督的多类异常检测方法,能够在只有正常数据可用时检测到对象属于多个类型的异常。
  • methods: 本研究使用生成器基于方法,包括潜在扩散模型 для重建,以解决泛化难点’’问题,以及特征编辑策略来进一步缓解’’标识短ircuit’’问题。
  • results: 对MVTec-AD和MPDD数据集进行了广泛的实验,显示提出的LafitE方法在average AUROC指标上与现有方法相比,具有显著的优势。同时,通过我们提出的pseudo验证集来选择适合实际测试集的超参数。
    Abstract In the context of flexible manufacturing systems that are required to produce different types and quantities of products with minimal reconfiguration, this paper addresses the problem of unsupervised multi-class anomaly detection: develop a unified model to detect anomalies from objects belonging to multiple classes when only normal data is accessible. We first explore the generative-based approach and investigate latent diffusion models for reconstruction to mitigate the notorious ``identity shortcut'' issue in auto-encoder based methods. We then introduce a feature editing strategy that modifies the input feature space of the diffusion model to further alleviate ``identity shortcuts'' and meanwhile improve the reconstruction quality of normal regions, leading to fewer false positive predictions. Moreover, we are the first who pose the problem of hyperparameter selection in unsupervised anomaly detection, and propose a solution of synthesizing anomaly data for a pseudo validation set to address this problem. Extensive experiments on benchmark datasets MVTec-AD and MPDD show that the proposed LafitE, \ie, Latent Diffusion Model with Feature Editing, outperforms state-of-art methods by a significant margin in terms of average AUROC. The hyperparamters selected via our pseudo validation set are well-matched to the real test set.
    摘要 在需要生产不同类型和量的产品时,这篇论文解决了无监督多类异常检测的问题:开发一个综合模型,可以从多个类型的对象中检测异常。我们首先探讨了生成器基本的方法,并调查了抽象扩散模型来解决拥有短circuit''问题,这是抽象扩散模型基于方法的一个常见问题。然后,我们引入了特征编辑策略,将输入特征空间中的特征进行修改,以更好地降低异常点的检测难度,同时提高正常区域的重建质量,从而减少假阳性预测。此外,我们是第一个提出了无监督异常检测中参数选择的问题,并提出了一种使用生成异常数据来 Pseudo 验证集来解决这个问题。我们的LafitE(即潜在扩散模型与特征编辑)在 MPDD 和 MVTec-AD 测试集上进行了广泛的实验,结果表明,它在 average AUROC 方面与状态机器人在前方的方法相比,具有显著的优势。而我们选择的参数via Pseudo 验证集和实际测试集之间的匹配性也很高。

TransNuSeg: A Lightweight Multi-Task Transformer for Nuclei Segmentation

  • paper_url: http://arxiv.org/abs/2307.08051
  • repo_url: https://github.com/zhenqi-he/transnuseg
  • paper_authors: Zhenqi He, Mathias Unberath, Jing Ke, Yiqing Shen
    for:这篇论文的目的是提出一种基于Transformer的自动核体分割方法,以提高核体分割的准确性和效率。methods:这篇论文使用了一种新的多任务学习策略,将核体分割任务分解成三个子任务:核体实例分割、核体边沿分割和分割边缘集成。此外,它还使用了一种新的自适应共享机制,以便在不同分支之间共享自适应 heads。results:这篇论文的实验结果表明,使用这种方法可以在两个不同的数据集上,与CA2.5-Net等其他状态对抗方法相比,提高核体分割的精度。此外,这种方法还可以降低模型的参数量,从而提高计算效率。
    Abstract Nuclei appear small in size, yet, in real clinical practice, the global spatial information and correlation of the color or brightness contrast between nuclei and background, have been considered a crucial component for accurate nuclei segmentation. However, the field of automatic nuclei segmentation is dominated by Convolutional Neural Networks (CNNs), meanwhile, the potential of the recently prevalent Transformers has not been fully explored, which is powerful in capturing local-global correlations. To this end, we make the first attempt at a pure Transformer framework for nuclei segmentation, called TransNuSeg. Different from prior work, we decouple the challenging nuclei segmentation task into an intrinsic multi-task learning task, where a tri-decoder structure is employed for nuclei instance, nuclei edge, and clustered edge segmentation respectively. To eliminate the divergent predictions from different branches in previous work, a novel self distillation loss is introduced to explicitly impose consistency regulation between branches. Moreover, to formulate the high correlation between branches and also reduce the number of parameters, an efficient attention sharing scheme is proposed by partially sharing the self-attention heads amongst the tri-decoders. Finally, a token MLP bottleneck replaces the over-parameterized Transformer bottleneck for a further reduction in model complexity. Experiments on two datasets of different modalities, including MoNuSeg have shown that our methods can outperform state-of-the-art counterparts such as CA2.5-Net by 2-3% Dice with 30% fewer parameters. In conclusion, TransNuSeg confirms the strength of Transformer in the context of nuclei segmentation, which thus can serve as an efficient solution for real clinical practice. Code is available at https://github.com/zhenqi-he/transnuseg.
    摘要 核体在实际临床应用中显示为小型,但是在实际临床实践中,全局空间信息和背景颜色或亮度差异的 corrélation,被视为精确的核体分割的关键组成部分。然而,自动核体分割领域被 CNN 所主导,而 transformer 的潜在力量尚未得到完全利用,这是拥有地方-全局 corrélation 的强大能力。为此,我们首次提出了一个纯 transformer 框架 для核体分割,称为 TransNuSeg。与先前的工作不同,我们将核体分割任务分解为内生多任务学习任务,其中使用 tri-decoder 结构进行核体实例、核体边和分割的聚合edge 分割。为了消除不同分支的不同预测,我们引入了一种新的自我抽象损失,以直接强制不同分支之间的一致性规则。此外,我们还提出了一种高效的注意力共享方案,通过共享部分自动注意力头来降低模型参数数量。最后,我们将各自MLP 瓶颈替换为更加简单的 токен MLP 瓶颈,以进一步降低模型复杂性。在两个不同模式的数据集上进行了实验,包括 MoNuSeg,我们的方法可以与状态态-of-the-art 对手 CA2.5-Net 相比,提高 Dice 指标2-3%,同时减少参数数量30%。结论:TransNuSeg 证明了 transformer 在核体分割领域的力量,可以作为实际临床应用的高效解决方案。代码可以在 上获取。

A Novel SLCA-UNet Architecture for Automatic MRI Brain Tumor Segmentation

  • paper_url: http://arxiv.org/abs/2307.08048
  • repo_url: None
  • paper_authors: Tejashwini P S, Thriveni J, Venugopal K R
    for: 这个论文主要针对的是如何通过深度学习来自动化脑肿图像分类和识别。methods: 该论文提出了一种基于UNet架构的修改方法,即SLCA UNet方法,该方法包括循环感知层、Channel Attention层和堆叠卷积层等模块,能够有效地捕捉脑肿图像中的细节和概念信息。results: 该论文在使用BraTS数据集进行测试时,实现了良好的性能,其中Dice指标、敏感度、特异度和 Hausdorff95指标分别为0.845、0.845、0.999和8.1。
    Abstract Brain tumor is deliberated as one of the severe health complications which lead to decrease in life expectancy of the individuals and is also considered as a prominent cause of mortality worldwide. Therefore, timely detection and prediction of brain tumors can be helpful to prevent death rates due to brain tumors. Biomedical image analysis is a widely known solution to diagnose brain tumor. Although MRI is the current standard method for imaging tumors, its clinical usefulness is constrained by the requirement of manual segmentation which is time-consuming. Deep learning-based approaches have emerged as a promising solution to develop automated biomedical image exploration tools and the UNet architecture is commonly used for segmentation. However, the traditional UNet has limitations in terms of complexity, training, accuracy, and contextual information processing. As a result, the modified UNet architecture, which incorporates residual dense blocks, layered attention, and channel attention modules, in addition to stacked convolution, can effectively capture both coarse and fine feature information. The proposed SLCA UNet approach achieves good performance on the freely accessible Brain Tumor Segmentation (BraTS) dataset, with an average performance of 0.845, 0.845, 0.999, and 8.1 in terms of Dice, Sensitivity, Specificity, and Hausdorff95 for BraTS 2020 dataset, respectively.
    摘要 脑肿是一种严重的健康问题,可能导致个体寿命减少,并被视为全球致死原因之一。因此,在时间上掌握和预测脑肿的技术是非常重要的。生物医学影像分析是一种广泛应用的解决方案,用于诊断脑肿。虽然MRI是当前标准的肿体影像方法,但其临床实用性受到手动分 segmentation 的限制,这是时间consuming 的。基于深度学习的方法在诊断方面出现了一种可能的解决方案,其中 UNet 架构是最常用的。然而,传统的 UNet 有许多局限性,包括复杂度、训练、准确率和上下文信息处理等方面。因此,基于 SLCA UNet 架构,通过添加径 residual dense blocks、层 attention 和通道 attention 模块,可以更好地捕捉肿体中粗细特征信息。提出的 SLCA UNet 方法在可以获得 BraTS 2020 数据集的自由访问Brain Tumor Segmentation(BraTS)数据集上的良好性能,其中的平均性能为0.845、0.845、0.999和8.1。

Planting a SEED of Vision in Large Language Model

  • paper_url: http://arxiv.org/abs/2307.08041
  • repo_url: https://github.com/ailab-cvc/seed
  • paper_authors: Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, Ying Shan
  • for: 这个论文是为了提供一种能让语言模型同时看图和画图的图像tokenizer。
  • methods: 这个论文使用了一种新的图像tokenizer architecture,它使用了一个1D causal dependency来生成图像token,并且通过在tokenizer训练阶段进行优化来使图像token capture高级别 semantics。
  • results: 通过使用这种新的图像tokenizer,LLM可以通过简单的LoRA tuning来实现图像-文本和文本-图像生成。这个论文在5.7天内使用64个V100 GPU和500万个公共可用的图像-文本对对其进行训练。
    Abstract We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the emergent ability to SEE and Draw at the same time. Research on image tokenizers has previously reached an impasse, as frameworks employing quantized visual tokens have lost prominence due to subpar performance and convergence in multimodal comprehension (compared to BLIP-2, etc.) or generation (compared to Stable Diffusion, etc.). Despite the limitations, we remain confident in its natural capacity to unify visual and textual representations, facilitating scalable multimodal training with LLM's original recipe. In this study, we identify two crucial principles for the architecture and training of SEED that effectively ease subsequent alignment with LLMs. (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens should capture high-level semantics consistent with the degree of semantic abstraction in words, and be optimized for both discriminativeness and reconstruction during the tokenizer training phase. As a result, the off-the-shelf LLM is able to perform both image-to-text and text-to-image generation by incorporating our SEED through efficient LoRA tuning. Comprehensive multimodal pretraining and instruction tuning, which may yield improved results, are reserved for future investigation. This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs. Our preliminary study emphasizes the great potential of discrete visual tokens in versatile multimodal LLMs and the importance of proper image tokenizers in broader research.
    摘要 我们介绍SEED,一个复杂的图像tokenizer,允许大型语言模型(LLM)同时“看”和“绘”。过去的研究图像tokenizer已经到了僵对,因为使用量化的视觉token导致了与BLIP-2等模型的比较不利,以及生成模型的比较不利(比如Stable Diffusion等)。尽管有限制,我们仍然信任它的自然能力,将vision和textual表现结合起来,实现标准多模式训练, LLM 的原始配方。在这个研究中,我们确定了两个重要的建筑和训练SEED的原则,以确保其与LLM的配合。1. 图像token应该与2D物理 patch 位置无关,而是通过1D causal dependency 生成,这样的自然依赖性与 LLM 的左往右预测机制相关。2. 图像token应该捕捉高度抽象的 semantics,与 слова的Semantic abstraction 相关,并在 tokenizer 训练阶段进行优化。因此,通过我们的SEED,标准 LLM 可以进行图像-文本和文本-图像生成,只需要通过LoRA调整。未来的多模式预训和指令调整可能会带来更好的结果。我们在5.7天内使用64个V100 GPU和500万个公开可用的图像-文本对给SEED进行训练。我们的初步研究显示,可以使用精确的图像tokenizer来实现多模式LLM的实用性和多元性。

Multi-Object Discovery by Low-Dimensional Object Motion

  • paper_url: http://arxiv.org/abs/2307.08027
  • repo_url: https://github.com/sadrasafa/multi-object-segmentation
  • paper_authors: Sadra Safadoust, Fatma Güney
  • for: 该研究旨在提高单图像中的动态 reconstruction,即使无法获取下一帧图像。
  • methods: 该研究使用了像素级几何和物体运动来解除单图像中的流动ambiguity。
  • results: 该研究在 sintetic和实际 datasets上达到了state-of-the-art的多物体分 segmentation结果,并且对预测深度图表现了可靠的性能。
    Abstract Recent work in unsupervised multi-object segmentation shows impressive results by predicting motion from a single image despite the inherent ambiguity in predicting motion without the next image. On the other hand, the set of possible motions for an image can be constrained to a low-dimensional space by considering the scene structure and moving objects in it. We propose to model pixel-wise geometry and object motion to remove ambiguity in reconstructing flow from a single image. Specifically, we divide the image into coherently moving regions and use depth to construct flow bases that best explain the observed flow in each region. We achieve state-of-the-art results in unsupervised multi-object segmentation on synthetic and real-world datasets by modeling the scene structure and object motion. Our evaluation of the predicted depth maps shows reliable performance in monocular depth estimation.
    摘要

Analysing Gender Bias in Text-to-Image Models using Object Detection

  • paper_url: http://arxiv.org/abs/2307.08025
  • repo_url: https://github.com/harveymannering/text-to-image-bias
  • paper_authors: Harvey Mannering
  • for: 该研究目的是测试文本到图像模型中的偏见。
  • methods: 该研究使用了对应的提示,例如“一个男人/女人持有一个物品”,以检查某些物品是否与certain gender相关。
  • results: 分析结果显示, masculine prompts 更 frequently generate了如锦标、剑、车、棒棒球和自行车等物品,而 feminine prompts 更 frequently generate了如手提包、雨伞、碗、瓶子和杯子等物品。
    Abstract This work presents a novel strategy to measure bias in text-to-image models. Using paired prompts that specify gender and vaguely reference an object (e.g. "a man/woman holding an item") we can examine whether certain objects are associated with a certain gender. In analysing results from Stable Diffusion, we observed that male prompts generated objects such as ties, knives, trucks, baseball bats, and bicycles more frequently. On the other hand, female prompts were more likely to generate objects such as handbags, umbrellas, bowls, bottles, and cups. We hope that the method outlined here will be a useful tool for examining bias in text-to-image models.
    摘要 Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Boosting 3-DoF Ground-to-Satellite Camera Localization Accuracy via Geometry-Guided Cross-View Transformer

  • paper_url: http://arxiv.org/abs/2307.08015
  • repo_url: None
  • paper_authors: Yujiao Shi, Fei Wu, Akhil Perincherry, Ankit Vora, Hongdong Li
  • for: 提高Camera pose estimation的精度,尤其是在具有有限样本密度的卫星图像库中。
  • methods: 我们提出了一种方法,通过估算地面图像和卫星图像之间的相对旋转和翻译来提高地面摄像机的位置和方向估计的准确性。我们的方法包括:(1) 使用geometry-guided cross-view transformer来将地面视图转换为飞行视图;(2) 使用神经网络pose optimizer来估计卫星图像和地面图像之间的相对旋转;(3) 使用uncertainty-guided spatial correlation来生成可能性图中的车辆位置。
  • results: 我们的方法在cross-view KITTI dataset上实验表明,与state-of-the-art方法相比,具有显著的改进。特别是,限制车辆 lateral pose 在1m内的概率从35.54%提高到76.44%,限制车辆 orientation 在1°内的概率从19.64%提高到99.10%。
    Abstract Image retrieval-based cross-view localization methods often lead to very coarse camera pose estimation, due to the limited sampling density of the database satellite images. In this paper, we propose a method to increase the accuracy of a ground camera's location and orientation by estimating the relative rotation and translation between the ground-level image and its matched/retrieved satellite image. Our approach designs a geometry-guided cross-view transformer that combines the benefits of conventional geometry and learnable cross-view transformers to map the ground-view observations to an overhead view. Given the synthesized overhead view and observed satellite feature maps, we construct a neural pose optimizer with strong global information embedding ability to estimate the relative rotation between them. After aligning their rotations, we develop an uncertainty-guided spatial correlation to generate a probability map of the vehicle locations, from which the relative translation can be determined. Experimental results demonstrate that our method significantly outperforms the state-of-the-art. Notably, the likelihood of restricting the vehicle lateral pose to be within 1m of its Ground Truth (GT) value on the cross-view KITTI dataset has been improved from $35.54\%$ to $76.44\%$, and the likelihood of restricting the vehicle orientation to be within $1^{\circ}$ of its GT value has been improved from $19.64\%$ to $99.10\%$.
    摘要 Image Retrieval-based Cross-view Localization Methods Often Lead to Very Coarse Camera Pose Estimation, Due to the Limited Sampling Density of the Database Satellite Images. In This Paper, We Propose a Method to Increase the Accuracy of a Ground Camera's Location and Orientation by Estimating the Relative Rotation and Translation Between the Ground-level Image and Its Matched/retrieved Satellite Image. Our Approach Designs a Geometry-guided Cross-view Transformer That Combines the Benefits of Conventional Geometry and Learnable Cross-view Transformers to Map the Ground-view Observations to an Overhead View. Given the Synthesized Overhead View and Observed Satellite Feature Maps, We Construct a Neural Pose Optimizer with Strong Global Information Embedding Ability to Estimate the Relative Rotation Between Them. After Aligning Their Rotations, We Develop an Uncertainty-guided Spatial Correlation to Generate a Probability Map of the Vehicle Locations, from Which the Relative Translation Can Be Determined. Experimental Results Demonstrate That Our Method Significantly Outperforms the State-of-the-art. Notably, the Likelihood of Restricting the Vehicle Lateral Pose to Be Within 1m of Its Ground Truth (GT) Value on the Cross-view KITTI Dataset Has Been Improved from $35.54\%$ to $76.44\%$, and the Likelihood of Restricting the Vehicle Orientation to Be Within $1^{\circ}$ of Its GT Value Has Been Improved from $19.64\%$ to $99.10\%$.

Revisiting Implicit Models: Sparsity Trade-offs Capability in Weight-tied Model for Vision Tasks

  • paper_url: http://arxiv.org/abs/2307.08013
  • repo_url: None
  • paper_authors: Haobo Song, Soumajit Majumder, Tao Lin
  • for: This paper aims to revisit the line of implicit models, specifically weight-tied models, and evaluate their effectiveness, stability, and efficiency on vision tasks.
  • methods: The paper uses weight-tied models as the basis for its study, and proposes the use of distinct sparse masks to improve the model capacity.
  • results: The paper finds that weight-tied models are more effective, stable, and efficient on vision tasks compared to DEQ variants, and provides design guidelines for practitioners regarding the selection of depth, width, and sparsity.
    Abstract Implicit models such as Deep Equilibrium Models (DEQs) have garnered significant attention in the community for their ability to train infinite layer models with elegant solution-finding procedures and constant memory footprint. However, despite several attempts, these methods are heavily constrained by model inefficiency and optimization instability. Furthermore, fair benchmarking across relevant methods for vision tasks is missing. In this work, we revisit the line of implicit models and trace them back to the original weight-tied models. Surprisingly, we observe that weight-tied models are more effective, stable, as well as efficient on vision tasks, compared to the DEQ variants. Through the lens of these simple-yet-clean weight-tied models, we further study the fundamental limits in the model capacity of such models and propose the use of distinct sparse masks to improve the model capacity. Finally, for practitioners, we offer design guidelines regarding the depth, width, and sparsity selection for weight-tied models, and demonstrate the generalizability of our insights to other learning paradigms.
    摘要 匿型模型(如深度均衡模型)在社区中受到了广泛关注,因为它们可以训练无穷层模型,并且有着简洁的解决方案和常量内存占用。然而,虽然有几次尝试,但这些方法受到了模型不充分利用和优化不稳定的限制。此外,相关的视觉任务中的公平比较缺失。在这种情况下,我们回到了权重相关模型的起源,并发现了权重相关模型在视觉任务上的效果更高,稳定性更好,并且更高效。通过这些简单 yet clean的权重相关模型,我们进一步研究了这些模型的基本限制,并提出了使用特定的稀疏面积来提高模型容量的方法。最后,我们向实践者提供了深度、宽度和稀疏性选择的设计指南,并证明了我们的理解在其他学习模式上的普适性。

Householder Projector for Unsupervised Latent Semantics Discovery

  • paper_url: http://arxiv.org/abs/2307.08012
  • repo_url: https://github.com/kingjamessong/householdergan
  • paper_authors: Yue Song, Jichao Zhang, Nicu Sebe, Wei Wang
  • for: 这个论文主要是为了探索 generator adversarial networks (GANs) 的内在结构,以便更好地理解和控制图像生成过程。
  • methods: 作者提出了一种基于 Householder 变换的低级别正交矩阵表示方法(Householder Projector),用于参数化 projection matrix,从而实现对 latent code 的 traverse 以便发现更加精细的 semantic attributes。
  • results: 作者在 StyleGAN2/StyleGAN3 模型中集成了 Householder Projector,并在多个 benchmark 上评估了模型的表现。结果显示,只需要在原始训练步骤的 1% 上进行微调,Householder Projector 可以帮助 StyleGANs 发现更加精细和准确的 semantic attributes,而不需要牺牲图像的准确性。
    Abstract Generative Adversarial Networks (GANs), especially the recent style-based generators (StyleGANs), have versatile semantics in the structured latent space. Latent semantics discovery methods emerge to move around the latent code such that only one factor varies during the traversal. Recently, an unsupervised method proposed a promising direction to directly use the eigenvectors of the projection matrix that maps latent codes to features as the interpretable directions. However, one overlooked fact is that the projection matrix is non-orthogonal and the number of eigenvectors is too large. The non-orthogonality would entangle semantic attributes in the top few eigenvectors, and the large dimensionality might result in meaningless variations among the directions even if the matrix is orthogonal. To avoid these issues, we propose Householder Projector, a flexible and general low-rank orthogonal matrix representation based on Householder transformations, to parameterize the projection matrix. The orthogonality guarantees that the eigenvectors correspond to disentangled interpretable semantics, while the low-rank property encourages that each identified direction has meaningful variations. We integrate our projector into pre-trained StyleGAN2/StyleGAN3 and evaluate the models on several benchmarks. Within only $1\%$ of the original training steps for fine-tuning, our projector helps StyleGANs to discover more disentangled and precise semantic attributes without sacrificing image fidelity.
    摘要 “生成问题网络”(GANs),特别是最近的样式基因生成器(StyleGANs),在结构化的底层空间中有多元 semantics。内在semantics发现方法产生了可以在底层代码中移动的方法,以便只有一个因素在旅游中变化。最近,一种无监督的方法提出了一个可能的方向, directly使用对应码到特征的投影矩阵的特征向量作为可解释的方向。然而,一个被遗忘的事实是,投影矩阵不对称,数量过多的特征向量会导致意义的变化,即使投影矩阵是对称的。为了解决这些问题,我们提出了“Householder Projector”,一种通用且统一的低维度对称矩阵表示,基于Householder变换。对称性 garantuees that the feature vectors correspond to disentangled interpretable semantics, while the low-rank property encourages that each identified direction has meaningful variations. We integrate our projector into pre-trained StyleGAN2/StyleGAN3 and evaluate the models on several benchmarks. Within only 1% of the original training steps for fine-tuning, our projector helps StyleGANs to discover more disentangled and precise semantic attributes without sacrificing image fidelity.

LUCYD: A Feature-Driven Richardson-Lucy Deconvolution Network

  • paper_url: http://arxiv.org/abs/2307.07998
  • repo_url: https://github.com/ctom2/lucyd-deconvolution
  • paper_authors: Tomáš Chobola, Gesine Müller, Veit Dausmann, Anton Theileis, Jan Taucher, Jan Huisken, Tingying Peng
  • for: 提高微scopic图像质量和可读性
  • methods: 结合Richardson-Lucy抽象方程和深度学习网络特征,提出了LUCYD方法,以增强图像质量并降低计算成本。
  • results: LUCYD方法在synthetic和实际微scopic图像中表现出色,超过了现有方法的性能,并能够处理不同的微scopic模式和捕捉条件。
    Abstract The process of acquiring microscopic images in life sciences often results in image degradation and corruption, characterised by the presence of noise and blur, which poses significant challenges in accurately analysing and interpreting the obtained data. This paper proposes LUCYD, a novel method for the restoration of volumetric microscopy images that combines the Richardson-Lucy deconvolution formula and the fusion of deep features obtained by a fully convolutional network. By integrating the image formation process into a feature-driven restoration model, the proposed approach aims to enhance the quality of the restored images whilst reducing computational costs and maintaining a high degree of interpretability. Our results demonstrate that LUCYD outperforms the state-of-the-art methods in both synthetic and real microscopy images, achieving superior performance in terms of image quality and generalisability. We show that the model can handle various microscopy modalities and different imaging conditions by evaluating it on two different microscopy datasets, including volumetric widefield and light-sheet microscopy. Our experiments indicate that LUCYD can significantly improve resolution, contrast, and overall quality of microscopy images. Therefore, it can be a valuable tool for microscopy image restoration and can facilitate further research in various microscopy applications. We made the source code for the model accessible under https://github.com/ctom2/lucyd-deconvolution.
    摘要 生物科学中获取微scopic图像过程经常会导致图像异常和损害,表现为噪声和模糊,这会对数据分析和解释带来重大挑战。这篇论文提出了LUCYD方法,该方法combines Richardson-Lucy整形方程和基于完全 convolutional neural network 的深度特征融合,以提高图像 restore 的质量,降低计算成本,保持高度可解释性。我们的结果表明,LUCYD方法在对比state-of-the-art方法时表现出色,在 sintetic 和实际 microscopy 图像中都达到了更高的图像质量和普适性。我们的实验表明,LUCYD方法可以处理不同的 microscopy 模式和拍摄条件,并且可以在两个不同的 microscopy 数据集上进行评估,包括volumetric widefield 和 light-sheet microscopy。我们的实验结果表明,LUCYD方法可以大幅提高微scopic图像的分辨率、对比度和总质量。因此,它可以成为微scopic图像 Restoration 的有价值工具,并且可以推动各种 microscopy 应用的进一步研究。我们将模型的源代码公开于 GitHub 上,可以通过 访问。

Enforcing Topological Interaction between Implicit Surfaces via Uniform Sampling

  • paper_url: http://arxiv.org/abs/2307.08716
  • repo_url: None
  • paper_authors: Hieu Le, Nicolas Talabot, Jiancheng Yang, Pascal Fua
  • for: 本文旨在提出一种新的方法,用于精确地模型3D物体表面,以保证它们之间的topological交互。
  • methods: 该方法基于随机点的统计方法,通过选择一组点作为参照点,来修正3D物体表面。
  • results: 实验表明,该方法可以准确地重建人体心脏,保证组件之间的topological连接。此外,该方法还可以用来模拟手与任意物体之间的各种交互方式。
    Abstract Objects interact with each other in various ways, including containment, contact, or maintaining fixed distances. Ensuring these topological interactions is crucial for accurate modeling in many scenarios. In this paper, we propose a novel method to refine 3D object representations, ensuring that their surfaces adhere to a topological prior. Our key observation is that the object interaction can be observed via a stochastic approximation method: the statistic of signed distances between a large number of random points to the object surfaces reflect the interaction between them. Thus, the object interaction can be indirectly manipulated by using choosing a set of points as anchors to refine the object surfaces. In particular, we show that our method can be used to enforce two objects to have a specific contact ratio while having no surface intersection. The conducted experiments show that our proposed method enables accurate 3D reconstruction of human hearts, ensuring proper topological connectivity between components. Further, we show that our proposed method can be used to simulate various ways a hand can interact with an arbitrary object.
    摘要 objects 与其他物体之间存在多种互动方式,包括含容、触摸或维持固定距离。保证这些拓扑互动是对很多场景中模型的精确预测非常重要。在这篇论文中,我们提出了一种新的方法来精细调整3D物体表示,使其表面遵循拓扑优先。我们的关键观察是物体互动可以通过一种随机点方法的统计来观察:对一大量Random点的积分可以反映物体之间的互动。因此,我们可以通过选择一组点作为安全来修改物体表面,以间接地控制物体互动。具体来说,我们表明了我们的方法可以用来保证两个物体之间有specific contact比例,而不会出现表面交叉。实验结果表明,我们的提议方法可以准确地重建人类心脏,并保证组件之间的拓扑连接性。此外,我们还表明了我们的方法可以用来模拟手部与任意物体之间的各种互动方式。

Integrating Human Parsing and Pose Network for Human Action Recognition

  • paper_url: http://arxiv.org/abs/2307.07977
  • repo_url: https://github.com/liujf69/ipp-net-parsing
  • paper_authors: Runwei Ding, Yuhang Wen, Jinfu Liu, Nan Dai, Fanyang Meng, Mengyuan Liu
  • for: 本研究旨在提高人体动作识别的精度,使用人体解剖特征图和人体分剖特征图作为输入模式。
  • methods: 本研究提出了一种Integrating Human Parsing and Pose Network(IPP-Net),利用人体解剖特征图和人体分剖特征图进行双路结合,以提高人体动作识别的精度。人体pose分支使用图型卷积网络来模型pose特征,而人体分剖分支使用人体探测和分剖器来提取多帧人体部分特征,然后使用卷积嵌入学习来学习人体分剖特征。
  • results: 对于NTU RGB+D和NTU RGB+D 120测试集,IPP-Net的实验结果表明,IPP-Net可以在人体动作识别任务中获得更高的准确率,比如exist的方法更高。代码可以在https://github.com/liujf69/IPP-Net-Parsing上获取。
    Abstract Human skeletons and RGB sequences are both widely-adopted input modalities for human action recognition. However, skeletons lack appearance features and color data suffer large amount of irrelevant depiction. To address this, we introduce human parsing feature map as a novel modality, since it can selectively retain spatiotemporal features of the body parts, while filtering out noises regarding outfits, backgrounds, etc. We propose an Integrating Human Parsing and Pose Network (IPP-Net) for action recognition, which is the first to leverage both skeletons and human parsing feature maps in dual-branch approach. The human pose branch feeds compact skeletal representations of different modalities in graph convolutional network to model pose features. In human parsing branch, multi-frame body-part parsing features are extracted with human detector and parser, which is later learnt using a convolutional backbone. A late ensemble of two branches is adopted to get final predictions, considering both robust keypoints and rich semantic body-part features. Extensive experiments on NTU RGB+D and NTU RGB+D 120 benchmarks consistently verify the effectiveness of the proposed IPP-Net, which outperforms the existing action recognition methods. Our code is publicly available at https://github.com/liujf69/IPP-Net-Parsing .
    摘要 人体骨架和RGB序列都是人类动作识别中广泛使用的输入模式。然而,骨架缺乏外表特征,RGB数据受到大量不相关的描述所受损害。为了解决这个问题,我们引入人体解析特征图作为一种新的输入模式,因为它可以选择性地保留身体部位的空间特征,并过滤背景和服装等不相关的噪音。我们提议一种 integrate human parsing and pose network(IPP-Net),它是第一个同时利用骨架和人体解析特征图进行双树结构的方法。人体姿势分支将不同模式的短暂骨架表示feed到图像卷积网络中,以模型姿势特征。人体解析分支使用人体检测和解析器来提取多帧身体部位解析特征,并使用卷积核心学习。最后,我们采用了两支分支的晚期ensemble来得到最终预测结果,考虑到了稳定的关键点和丰富的语义身体部位特征。我们的代码可以在https://github.com/liujf69/IPP-Net-Parsing上找到。extensive experiments on NTU RGB+D and NTU RGB+D 120 benchmarks consistently verify the effectiveness of the proposed IPP-Net, which outperforms the existing action recognition methods.

HRHD-HK: A benchmark dataset of high-rise and high-density urban scenes for 3D semantic segmentation of photogrammetric point clouds

  • paper_url: http://arxiv.org/abs/2307.07976
  • repo_url: https://github.com/luzaijiaoxial/hrhd-hk
  • paper_authors: Maosu Li, Yijie Wu, Anthony G. O. Yeh, Fan Xue
  • for: 这篇论文旨在评估现有的3Dsemantic segmentation方法,以及它们在多样化的城市场景中的性能。
  • methods: 这篇论文使用了8种流行的semantic segmentation方法,并对其进行了全面的评估。
  • results: 实验结果表明,现有的3D semantic segmentation方法在处理高层、高密度城市区域时仍有很大的改进空间,特别是对于城市 объек的小体积物体。
    Abstract Many existing 3D semantic segmentation methods, deep learning in computer vision notably, claimed to achieve desired results on urban point clouds, in which the city objects are too many and diverse for people to judge qualitatively. Thus, it is significant to assess these methods quantitatively in diversified real-world urban scenes, encompassing high-rise, low-rise, high-density, and low-density urban areas. However, existing public benchmark datasets primarily represent low-rise scenes from European cities and cannot assess the methods comprehensively. This paper presents a benchmark dataset of high-rise urban point clouds, namely High-Rise, High-Density urban scenes of Hong Kong (HRHD-HK), which has been vacant for a long time. HRHD-HK arranged in 150 tiles contains 273 million colorful photogrammetric 3D points from diverse urban settings. The semantic labels of HRHD-HK include building, vegetation, road, waterbody, facility, terrain, and vehicle. To the best of our knowledge, HRHD-HK is the first photogrammetric dataset that focuses on HRHD urban areas. This paper also comprehensively evaluates eight popular semantic segmentation methods on the HRHD-HK dataset. Experimental results confirmed plenty of room for enhancing the current 3D semantic segmentation of point clouds, especially for city objects with small volumes. Our dataset is publicly available at: https://github.com/LuZaiJiaoXiaL/HRHD-HK.
    摘要 许多现有的3Dsemantic segmentation方法,特别是深度学习在计算机视觉领域,宣称达到了所需的结果在城市点云中,其中城市 объекts 太多和多样,使人无法评估其质量。因此,需要对这些方法进行量化的评估,以适应多样化的城市场景。然而,现有的公共benchmark数据集主要表示欧洲城市的低层建筑,无法全面评估这些方法。本文提出了一个高层、高密度城市点云数据集,即高层高密度城市区域的香港(HRHD-HK)数据集,该数据集已经空缺了很长时间。HRHD-HK包括150个块,每个块包含273万个颜色化光学3D点云,来自不同的城市环境。 semantic label 包括建筑、植被、路面、水域、设施、地形和车辆。根据我们所知,HRHD-HK是首个专注于高层高密度城市区域的光学数据集。本文还进行了8种流行的semantic segmentation方法的全面评估。实验结果表明,目前的3Dsemantic segmentation技术在城市点云中仍有很多的提升空间,特别是对城市对象的小体积。我们的数据集可以在:https://github.com/LuZaiJiaoXiaL/HRHD-HK中下载。

Towards Viewpoint-Invariant Visual Recognition via Adversarial Training

  • paper_url: http://arxiv.org/abs/2307.10235
  • repo_url: None
  • paper_authors: Shouwei Ruan, Yinpeng Dong, Hang Su, Jianteng Peng, Ning Chen, Xingxing Wei
  • for: 提高图像分类器的视角不变性,使其能够在不同的视角下仍然准确地预测图像。
  • methods: 提出了一种基于对抗训练的方法,称为视点不变 adversarial training(VIAT),通过将视点变换视为攻击, формули了一个最小化损失函数,以实现视点不变的图像分类器。
  • results: 经验表明,VIAT 可以有效地提高多种图像分类器的视角不变性,并且可以通过将对抗视点分布传递给不同的图像,提高对象的泛化性能。
    Abstract Visual recognition models are not invariant to viewpoint changes in the 3D world, as different viewing directions can dramatically affect the predictions given the same object. Although many efforts have been devoted to making neural networks invariant to 2D image translations and rotations, viewpoint invariance is rarely investigated. As most models process images in the perspective view, it is challenging to impose invariance to 3D viewpoint changes based only on 2D inputs. Motivated by the success of adversarial training in promoting model robustness, we propose Viewpoint-Invariant Adversarial Training (VIAT) to improve viewpoint robustness of common image classifiers. By regarding viewpoint transformation as an attack, VIAT is formulated as a minimax optimization problem, where the inner maximization characterizes diverse adversarial viewpoints by learning a Gaussian mixture distribution based on a new attack GMVFool, while the outer minimization trains a viewpoint-invariant classifier by minimizing the expected loss over the worst-case adversarial viewpoint distributions. To further improve the generalization performance, a distribution sharing strategy is introduced leveraging the transferability of adversarial viewpoints across objects. Experiments validate the effectiveness of VIAT in improving the viewpoint robustness of various image classifiers based on the diversity of adversarial viewpoints generated by GMVFool.
    摘要 “视觉识别模型不具备对3D世界视角变化的不变性,因为不同的观察方向可能会对同一物体的预测产生很大的影响。虽然许多努力已经投入到了使用神经网络对2D图像的翻译和旋转进行不变性处理,但视点不变性 rarely investigated。因为大多数模型在 perspective view 中处理图像,因此基于2D输入 alone 提高3D视角不变性的问题是挑战。鼓动了对模型 robustness 的成功,我们提出了 Viewpoint-Invariant Adversarial Training(VIAT),以提高常见图像分类器的视角不变性。VIAT 是一种 minimax 优化问题,其中内部最大化部分表示多样化的敌方攻击视点,通过学习一个基于新的攻击 GMVFool 的 Gaussian mixture distribution,而外部最小化部分则是在最坏情况下的敌方视点分布上进行视点不变性的训练。为了进一步提高通用性表现,我们还提出了基于对敌方视点的传输性的分布分享策略。实验证明,VIAT 可以提高多种图像分类器的视角不变性,基于 GMVFool 生成的多样化敌方视点。”

Dual-level Interaction for Domain Adaptive Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2307.07972
  • repo_url: https://github.com/rainjamesy/dida
  • paper_authors: Dongyu Yao, Boheng Li
  • for: 这篇论文主要针对域 adaptation 上的 semantic segmentation 问题,提出了一种基于 dual-level interaction 的方法(DIDA),以增强模型的鲁棒性和精度。
  • methods: 该方法在 semantic segmentation 中使用了域 adaptation 技术,并在不同的域上进行了 augmented 视图的交互,以便增强模型的鲁棒性和精度。此外,该方法还使用了一种动态更新策略来保持一个有用的实例银行,以便更好地捕捉实例的特征。
  • results: 根据实验结果,该方法在 confusing 和 long-tailed 类上表现出了明显的优势,特别是在 semantic segmentation 中。与现有方法相比,该方法可以增强模型的精度和鲁棒性,并且可以更好地处理域 adaptation 问题。
    Abstract Self-training approach recently secures its position in domain adaptive semantic segmentation, where a model is trained with target domain pseudo-labels. Current advances have mitigated noisy pseudo-labels resulting from the domain gap. However, they still struggle with erroneous pseudo-labels near the boundaries of the semantic classifier. In this paper, we tackle this issue by proposing a dual-level interaction for domain adaptation (DIDA) in semantic segmentation. Explicitly, we encourage the different augmented views of the same pixel to have not only similar class prediction (semantic-level) but also akin similarity relationship with respect to other pixels (instance-level). As it's impossible to keep features of all pixel instances for a dataset, we, therefore, maintain a labeled instance bank with dynamic updating strategies to selectively store the informative features of instances. Further, DIDA performs cross-level interaction with scattering and gathering techniques to regenerate more reliable pseudo-labels. Our method outperforms the state-of-the-art by a notable margin, especially on confusing and long-tailed classes. Code is available at \href{https://github.com/RainJamesY/DIDA}
    摘要 自适应方法最近在域 adapted semantic segmentation 中脱颖而出,其中一个模型通过目标域 Pseudo-标签 进行训练。现有技术已经消除了域之间的噪声 Pseudo-标签,但仍然在边缘类划分器中遇到了错误 Pseudo-标签。在这篇论文中,我们解决了这个问题,我们提议一种双级互动 для域适应(DIDA)在semantic segmentation中。具体来说,我们要求不同的扩展视图(augmented views)中的同一个像素有不仅相似的类预测(semantic-level),还有类似的相似性关系与其他像素(instance-level)。由于不可能保持一个 dataset 中所有像素的特征,我们因此保持一个标注的实例银行,并使用动态更新策略来选择ively存储实例中的有用特征。此外,DIDA通过散布和聚集技术来进行交互,以重新生成更可靠的 Pseudo-标签。我们的方法与当前状态的较好,尤其是在混淆和长尾类上。代码可以在 \href{https://github.com/RainJamesY/DIDA} 上找到。

EmoSet: A Large-scale Visual Emotion Dataset with Rich Attributes

  • paper_url: http://arxiv.org/abs/2307.07961
  • repo_url: None
  • paper_authors: Jingyuan Yang, Qirui Huang, Tingting Ding, Dani Lischinski, Daniel Cohen-Or, Hui Huang
    for:This paper aims to introduce a large-scale visual emotion dataset (EmoSet) with rich annotations to support research in visual emotion analysis and understanding.methods:The dataset is constructed by collecting images from social networks and artistic sources, and is annotated with 118,102 images and 14 emotion attributes, including brightness, colorfulness, scene type, object class, facial expression, and human action.results:EmoSet is five times larger than the largest existing dataset and is well balanced between different emotion categories, providing a valuable resource for researchers in the field of affective computing.
    Abstract Visual Emotion Analysis (VEA) aims at predicting people's emotional responses to visual stimuli. This is a promising, yet challenging, task in affective computing, which has drawn increasing attention in recent years. Most of the existing work in this area focuses on feature design, while little attention has been paid to dataset construction. In this work, we introduce EmoSet, the first large-scale visual emotion dataset annotated with rich attributes, which is superior to existing datasets in four aspects: scale, annotation richness, diversity, and data balance. EmoSet comprises 3.3 million images in total, with 118,102 of these images carefully labeled by human annotators, making it five times larger than the largest existing dataset. EmoSet includes images from social networks, as well as artistic images, and it is well balanced between different emotion categories. Motivated by psychological studies, in addition to emotion category, each image is also annotated with a set of describable emotion attributes: brightness, colorfulness, scene type, object class, facial expression, and human action, which can help understand visual emotions in a precise and interpretable way. The relevance of these emotion attributes is validated by analyzing the correlations between them and visual emotion, as well as by designing an attribute module to help visual emotion recognition. We believe EmoSet will bring some key insights and encourage further research in visual emotion analysis and understanding. Project page: https://vcc.tech/EmoSet.
    摘要 Visual Emotion Analysis (VEA) 目标是预测人们对视觉刺激的情感反应。这是一项有前途的、具有挑战性的任务,在情感计算领域内,在最近几年内受到了越来越多的关注。大多数现有的工作在这个领域都是特征设计方向,而忽略了数据建构。在这项工作中,我们介绍了Emoset,第一个大规模的视觉情感数据集,其中包含330万个图像,其中118,102个图像被人类标注员仔细标注,比现有最大的数据集大五倍。Emoset包含社交媒体图像以及艺术图像,并且具有良好的各种情感类别的均衡。受精神学研究 inspirited,每个图像还被标注了一组可见的情感特征:明亮度、颜色彩强、场景类型、物体类型、表情和人类动作,这些特征可以帮助理解视觉情感的精确和可读性。这些情感特征的相关性被证明通过对它们与视觉情感之间的相关性分析,以及设计了一个特征模块来帮助视觉情感认知。我们认为Emoset将带来一些关键的发现,并促进视觉情感分析和理解的进一步研究。项目页面:https://vcc.tech/EmoSet。

Accurate 3D Prediction of Missing Teeth in Diverse Patterns for Precise Dental Implant Planning

  • paper_url: http://arxiv.org/abs/2307.07953
  • repo_url: None
  • paper_authors: Lei Ma, Peng Xue, Yuning Gu, Yue Zhao, Min Zhu, Zhongxiang Ding, Dinggang Shen
  • For: 这个研究旨在提供一种准确预测缺失牙齿的框架,以便为牙齿嵌入式计划提供更好的规划和置入。* Methods: 该研究使用了一种基于CBCT图像的数据集来估算牙齿模型之间的点对点匹配关系,并将每种牙齿类型的位置和形状信息编码到牙齿词典中。然后,使用 sparse 表示法来学习缺失牙齿的邻近牙齿的位置和形状信息,并将这些信息应用到缺失牙齿的词典中来生成准确的预测结果。* Results: 研究结果表明,该提出的框架可以准确预测缺失牙齿的位置和形状,其预测误差为1.04mm和1.33mm分别对于单个缺失牙齿和14个缺失牙齿的预测。这表明该框架可以准确预测缺失牙齿在不同的模式下。
    Abstract In recent years, the demand for dental implants has surged, driven by their high success rates and esthetic advantages. However, accurate prediction of missing teeth for precise digital implant planning remains a challenge due to the intricate nature of dental structures and the variability in tooth loss patterns. This study presents a novel framework for accurate prediction of missing teeth in different patterns, facilitating digital implant planning. The proposed framework begins by estimating point-to-point correspondence among a dataset of dental mesh models reconstructed from CBCT images of healthy subjects. Subsequently, tooth dictionaries are constructed for each tooth type, encoding their position and shape information based on the established point-to-point correspondence. To predict missing teeth in a given dental mesh model, sparse coefficients are learned by sparsely representing adjacent teeth of the missing teeth using the corresponding tooth dictionaries. These coefficients are then applied to the dictionaries of the missing teeth to generate accurate predictions of their positions and shapes. The evaluation results on real subjects shows that our proposed framework achieves an average prediction error of 1.04mm for predictions of single missing tooth and an average prediction error of 1.33mm for the prediction of 14 missing teeth, which demonstrates its capability of accurately predicting missing teeth in various patterns. By accurately predicting missing teeth, dental professionals can improve the planning and placement of dental implants, leading to better esthetic and functional outcomes for patients undergoing dental implant procedures.
    摘要 Recently, the demand for dental implants has increased significantly due to their high success rates and aesthetic advantages. However, accurately predicting missing teeth for precise digital implant planning remains a challenge due to the complexity of dental structures and the variability of tooth loss patterns. This study proposes a novel framework for accurately predicting missing teeth in different patterns, facilitating digital implant planning.The proposed framework begins by estimating point-to-point correspondence among a dataset of dental mesh models reconstructed from CBCT images of healthy subjects. Next, tooth dictionaries are constructed for each tooth type, encoding their position and shape information based on the established point-to-point correspondence. To predict missing teeth in a given dental mesh model, sparse coefficients are learned by sparsely representing adjacent teeth of the missing teeth using the corresponding tooth dictionaries. These coefficients are then applied to the dictionaries of the missing teeth to generate accurate predictions of their positions and shapes.The evaluation results on real subjects show that our proposed framework achieves an average prediction error of 1.04mm for predictions of single missing teeth and an average prediction error of 1.33mm for the prediction of 14 missing teeth, which demonstrates its ability to accurately predict missing teeth in various patterns. By accurately predicting missing teeth, dental professionals can improve the planning and placement of dental implants, leading to better esthetic and functional outcomes for patients undergoing dental implant procedures.

Accelerating Distributed ML Training via Selective Synchronization

  • paper_url: http://arxiv.org/abs/2307.07950
  • repo_url: None
  • paper_authors: Sahil Tyagi, Martin Swany
  • for: 这篇论文的目的是提出一种实用、低开销的深度神经网络(DNNs)训练方法,以提高分布式训练中的效率。
  • methods: 这篇论文使用的方法包括: + 精确地决定在每步的统计聚合是否有必要,以避免高交互成本的统计聚合所带来的过度负担。 + 在不同的训练案例中,适当地调整统计聚合的频率,以便在训练时间中实现最佳的对照。 + 提出了多种优化方法,以提高在半同步训练中的融合。
  • results: 这篇论文的结果显示,使用\texttt{SelSync}方法可以实现与BSP训练相同或更好的精度,而且可以降低训练时间,对应的缩减比例为14倍。
    Abstract In distributed training, deep neural networks (DNNs) are launched over multiple workers concurrently and aggregate their local updates on each step in bulk-synchronous parallel (BSP) training. However, BSP does not linearly scale-out due to high communication cost of aggregation. To mitigate this overhead, alternatives like Federated Averaging (FedAvg) and Stale-Synchronous Parallel (SSP) either reduce synchronization frequency or eliminate it altogether, usually at the cost of lower final accuracy. In this paper, we present \texttt{SelSync}, a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step either by calling the aggregation op or applying local updates based on their significance. We propose various optimizations as part of \texttt{SelSync} to improve convergence in the context of \textit{semi-synchronous} training. Our system converges to the same or better accuracy than BSP while reducing training time by up to 14$\times$.
    摘要 在分布式训练中,深度神经网络(DNN)被多个工作者同时启动,并在每次步骤中在大规模同步(BSP)训练中进行集中更新。然而,BSP不会线性扩展,因为聚合成本过高。为了缓解这个开销,有些alternatives如联邦平均(FedAvg)和停顿同步并行(SSP)可以减少同步频率,或者完全消除同步,通常是在牺牲最终准确性的代价。在这篇论文中,我们提出了\texttt{SelSync},一种实用、低开销的DNN训练方法,可以在每次步骤中动态决定是否进行聚合或者应用本地更新,根据它们的重要性。我们还提出了多种优化,以提高在半同步训练中的整合。我们的系统可以与BSP相比,提高训练效率,同时保持最终准确性。

Language Conditioned Traffic Generation

  • paper_url: http://arxiv.org/abs/2307.07947
  • repo_url: https://github.com/Aryia-Behroziuan/References
  • paper_authors: Shuhan Tan, Boris Ivanovic, Xinshuo Weng, Marco Pavone, Philipp Kraehenbuehl
  • for: 这篇论文主要用于解决现代自动驾驶开发中的模拟问题,即创建真实、可扩展、具有吸引力的交通场景。
  • methods: 该论文使用语言作为交通场景生成的超visory工具, combines a large language model with a transformer-based decoder architecture,从数据集中选择可能的地图位置,并生成初始的交通分布和车辆的动态。
  • results: compared to prior work,LCTGen模型在无条件和条件交通场景生成中显示出了更高的实际性和准确性。
    Abstract Simulation forms the backbone of modern self-driving development. Simulators help develop, test, and improve driving systems without putting humans, vehicles, or their environment at risk. However, simulators face a major challenge: They rely on realistic, scalable, yet interesting content. While recent advances in rendering and scene reconstruction make great strides in creating static scene assets, modeling their layout, dynamics, and behaviors remains challenging. In this work, we turn to language as a source of supervision for dynamic traffic scene generation. Our model, LCTGen, combines a large language model with a transformer-based decoder architecture that selects likely map locations from a dataset of maps, and produces an initial traffic distribution, as well as the dynamics of each vehicle. LCTGen outperforms prior work in both unconditional and conditional traffic scene generation in terms of realism and fidelity. Code and video will be available at https://ariostgx.github.io/lctgen.
    摘要 现代自动驾驶发展的核心是模拟。模拟器帮助开发、测试和改进驾驶系统,而不会对人类、车辆或环境造成危险。然而,模拟器遇到一个主要挑战:它们需要真实、可扩展、又有趣的内容。而最近的渲染和场景重建技术已经做出了很大的进步,但是模拟场景的布局、动态和行为仍然是一个挑战。在这项工作中,我们寻求语言作为模拟场景生成的超级视图。我们的模型LCTGen结合了一个大型语言模型和一个基于转换器的解码器架构,从数据集中选择可能的地图位置,并生成初始的交通分布以及每辆车辆的动力学。LCTGen在无条件和条件交通场景生成方面比前一代的工作更高效和更真实。代码和视频将在https://ariostgx.github.io/lctgen上提供。

Surface Geometry Processing: An Efficient Normal-based Detail Representation

  • paper_url: http://arxiv.org/abs/2307.07945
  • repo_url: None
  • paper_authors: Wuyuan Xie, Miaohui Wang, Di Lin, Boxin Shi, Jianmin Jiang
  • for: 本文提出了一种高效的Surface detail处理框架,用于解决高分辨率3D视觉应用中的传统方法具有较大的内存和计算时间成本。
  • methods: 本文提出了一种基于2D正常域的新的Surface detail处理方法,通过抽取新的正常特征表示来表示微geometry结构。文中both theoretically和empirically阐述了该表示的三个重要性能属性:细节分离、细节传输和细节幂等。
  • results: 对比现有状态的艺技,我们证明并示出了提议的正常基于表示的效果和多样性。在最新的benchmark dataset上,我们实现了理论分析和实验结果,证明了我们的正常基于表示的效果和多样性。在相同输入Surface vertices上,我们的方法只需6.5%的内存成本和14.0%的运行时间,相比现有竞争算法。
    Abstract With the rapid development of high-resolution 3D vision applications, the traditional way of manipulating surface detail requires considerable memory and computing time. To address these problems, we introduce an efficient surface detail processing framework in 2D normal domain, which extracts new normal feature representations as the carrier of micro geometry structures that are illustrated both theoretically and empirically in this article. Compared with the existing state of the arts, we verify and demonstrate that the proposed normal-based representation has three important properties, including detail separability, detail transferability and detail idempotence. Finally, three new schemes are further designed for geometric surface detail processing applications, including geometric texture synthesis, geometry detail transfer, and 3D surface super-resolution. Theoretical analysis and experimental results on the latest benchmark dataset verify the effectiveness and versatility of our normal-based representation, which accepts 30 times of the input surface vertices but at the same time only takes 6.5% memory cost and 14.0% running time in comparison with existing competing algorithms.
    摘要 Traditional high-resolution 3D vision applications 的面精度处理方式很快发展,但这些方法需要大量的内存和计算时间。为解决这些问题,我们介绍了一种高效的面精度处理框架,该框架在2D正常域中提取了新的正常特征表示,这些表示包括微geometry结构的示例,我们在这篇文章中对其进行了理论和实验 validate。与现有的状态艺术相比,我们的正常基于表示具有三个重要特性,包括细节分离、细节传输和细节幂等。最后,我们针对几种几种 геометри�结构细节处理应用程序设计了三种新方案,包括几何�xture生成、细节传输和3D surface超分辨率。我们的正常基于表示在对最新的benchmark数据进行了理论分析和实验验证,其可以接受30倍的输入面Vertex,但同时只需6.5%的内存成本和14.0%的计算时间,与现有的竞争算法相比。

CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion

  • paper_url: http://arxiv.org/abs/2307.07938
  • repo_url: None
  • paper_authors: Haotian Dong, Enhui Ma, Lubo Wang, Miaohui Wang, Wuyuan Xie, Qing Guo, Ping Li, Lingyu Liang, Kairui Yang, Di Lin
    for:CVSformer is proposed to improve semantic scene completion by learning cross-view object relationships.methods:CVSformer consists of Multi-View Feature Synthesis and Cross-View Transformer to learn cross-view object relationships.results:CVSformer achieves state-of-the-art results on public datasets.
    Abstract Semantic scene completion (SSC) requires an accurate understanding of the geometric and semantic relationships between the objects in the 3D scene for reasoning the occluded objects. The popular SSC methods voxelize the 3D objects, allowing the deep 3D convolutional network (3D CNN) to learn the object relationships from the complex scenes. However, the current networks lack the controllable kernels to model the object relationship across multiple views, where appropriate views provide the relevant information for suggesting the existence of the occluded objects. In this paper, we propose Cross-View Synthesis Transformer (CVSformer), which consists of Multi-View Feature Synthesis and Cross-View Transformer for learning cross-view object relationships. In the multi-view feature synthesis, we use a set of 3D convolutional kernels rotated differently to compute the multi-view features for each voxel. In the cross-view transformer, we employ the cross-view fusion to comprehensively learn the cross-view relationships, which form useful information for enhancing the features of individual views. We use the enhanced features to predict the geometric occupancies and semantic labels of all voxels. We evaluate CVSformer on public datasets, where CVSformer yields state-of-the-art results.
    摘要 In this paper, we propose Cross-View Synthesis Transformer (CVSformer), which consists of Multi-View Feature Synthesis and Cross-View Transformer for learning cross-view object relationships. In Multi-View Feature Synthesis, we use a set of 3D convolutional kernels rotated differently to compute multi-view features for each voxel. In Cross-View Transformer, we employ cross-view fusion to comprehensively learn cross-view relationships, which form useful information for enhancing the features of individual views. We use the enhanced features to predict the geometric occupancies and semantic labels of all voxels.We evaluate CVSformer on public datasets, where CVSformer yields state-of-the-art results.

S2R-ViT for Multi-Agent Cooperative Perception: Bridging the Gap from Simulation to Reality

  • paper_url: http://arxiv.org/abs/2307.07935
  • repo_url: None
  • paper_authors: Jinlong Li, Runsheng Xu, Xinyu Liu, Baolu Li, Qin Zou, Jiaqi Ma, Hongkai Yu
  • for: 本研究旨在解决现有多智能体协同感知算法在实际世界中表现不佳的问题,即在实际数据上下文中,由于 simulate 数据和实际数据之间的域差异,导致模型在实际世界中的感知性能下降。
  • methods: 本研究提出了一种首次在多智能体协同感知中实现了 sim2real 转移学习框架,通过一种新的视力变换器(ViT),并考虑了实施差(Implementation Gap)和特征差(Feature Gap)两种域差。为了有效地 relief 实施差,我们提出了一种不确定性感知器,并通过在egosensor和inter-sensor之间进行特征适应而减少特征差。
  • results: 我们在公共多智能体协同感知数据集OPV2V和V2V4Real上进行了广泛的实验,结果表明,提出的S2R-ViT可以有效地跨越实际和模拟之间的域差,并在点云基于3D物体检测方面表现出色,至于其他方法。
    Abstract Due to the lack of real multi-agent data and time-consuming of labeling, existing multi-agent cooperative perception algorithms usually select the simulated sensor data for training and validating. However, the perception performance is degraded when these simulation-trained models are deployed to the real world, due to the significant domain gap between the simulated and real data. In this paper, we propose the first Simulation-to-Reality transfer learning framework for multi-agent cooperative perception using a novel Vision Transformer, named as S2R-ViT, which considers both the Implementation Gap and Feature Gap between simulated and real data. We investigate the effects of these two types of domain gaps and propose a novel uncertainty-aware vision transformer to effectively relief the Implementation Gap and an agent-based feature adaptation module with inter-agent and ego-agent discriminators to reduce the Feature Gap. Our intensive experiments on the public multi-agent cooperative perception datasets OPV2V and V2V4Real demonstrate that the proposed S2R-ViT can effectively bridge the gap from simulation to reality and outperform other methods significantly for point cloud-based 3D object detection.
    摘要 Translation notes:* "multi-agent" is translated as "多智能" (duō zhìnéng)* "cooperative perception" is translated as "合作感知" (hézuò gǎnperce)* "simulation-trained models" is translated as "模拟训练模型" (móxī xùxíng módelì)* "real world" is translated as "实际世界" (shíjiè shìjiè)* "domain gap" is translated as "领域差距" (lánxìng jìnjù)* "Implementation Gap" is translated as "实现差距" (shíxiàn jìnjù)* "Feature Gap" is translated as "特征差距" (tèxí jìnjù)* "uncertainty-aware vision transformer" is translated as "不确定性意识感知变换器" (bù qièdìngxìng yìshì gǎnperce bianhuàng)* "agent-based feature adaptation module" is translated as "智能 Agent 基于特征修改模块" (zhìnéng agent jīyú yìxìng xiūgòu módèl)Note: The translation is based on Simplified Chinese, which is used in mainland China. If you need Traditional Chinese translation, please let me know.

Contrastive Multi-Task Dense Prediction

  • paper_url: http://arxiv.org/abs/2307.07934
  • repo_url: https://github.com/USTCPCS/CVPR2018_attention
  • paper_authors: Siwei Yang, Hanrong Ye, Dan Xu
  • for: This paper addresses the problem of multi-task dense prediction, aiming to achieve simultaneous learning and inference on multiple dense prediction tasks in a single framework.
  • methods: The paper introduces feature-wise contrastive consistency to model cross-task interactions, which effectively boosts representation learning for different sub-tasks without extra expensive distillation modules.
  • results: The proposed multi-task contrastive learning approach achieves superior performance on two challenging datasets (NYUD-v2 and Pascal-Context), establishing new state-of-the-art results for dense predictions.
    Abstract This paper targets the problem of multi-task dense prediction which aims to achieve simultaneous learning and inference on a bunch of multiple dense prediction tasks in a single framework. A core objective in design is how to effectively model cross-task interactions to achieve a comprehensive improvement on different tasks based on their inherent complementarity and consistency. Existing works typically design extra expensive distillation modules to perform explicit interaction computations among different task-specific features in both training and inference, bringing difficulty in adaptation for different task sets, and reducing efficiency due to clearly increased size of multi-task models. In contrast, we introduce feature-wise contrastive consistency into modeling the cross-task interactions for multi-task dense prediction. We propose a novel multi-task contrastive regularization method based on the consistency to effectively boost the representation learning of the different sub-tasks, which can also be easily generalized to different multi-task dense prediction frameworks, and costs no additional computation in the inference. Extensive experiments on two challenging datasets (i.e. NYUD-v2 and Pascal-Context) clearly demonstrate the superiority of the proposed multi-task contrastive learning approach for dense predictions, establishing new state-of-the-art performances.
    摘要 In contrast, we introduce feature-wise contrastive consistency to model cross-task interactions for multi-task dense prediction. We propose a novel multi-task contrastive regularization method based on consistency to effectively boost representation learning of different sub-tasks, which can be easily generalized to different multi-task dense prediction frameworks and does not require additional computation in inference. Extensive experiments on two challenging datasets (NYUD-v2 and Pascal-Context) demonstrate the superiority of the proposed multi-task contrastive learning approach for dense predictions, establishing new state-of-the-art performances.

Holistic Prototype Attention Network for Few-Shot VOS

  • paper_url: http://arxiv.org/abs/2307.07933
  • repo_url: https://github.com/nust-machine-intelligence-laboratory/hpan
  • paper_authors: Yin Tang, Tao Chen, Xiruo Jiang, Yazhou Yao, Guo-Sen Xie, Heng-Tao Shen
  • for: 提高ew-shot video对象分割(FSVOS)中的动态对象分割精度,通过小量支持图像进行权重学习。
  • methods: 我们提出了一种整体prototype注意网络(HPAN),它包括 prototype graph注意模块(PGAM)和对称prototype注意模块(BPAM),通过将有用知识传递到未经见类别上来提高分割性能。
  • results: 我们在YouTube-FSVOS上进行了广泛的实验,并证明了我们提出的HPAN方法的有效性和优越性。
    Abstract Few-shot video object segmentation (FSVOS) aims to segment dynamic objects of unseen classes by resorting to a small set of support images that contain pixel-level object annotations. Existing methods have demonstrated that the domain agent-based attention mechanism is effective in FSVOS by learning the correlation between support images and query frames. However, the agent frame contains redundant pixel information and background noise, resulting in inferior segmentation performance. Moreover, existing methods tend to ignore inter-frame correlations in query videos. To alleviate the above dilemma, we propose a holistic prototype attention network (HPAN) for advancing FSVOS. Specifically, HPAN introduces a prototype graph attention module (PGAM) and a bidirectional prototype attention module (BPAM), transferring informative knowledge from seen to unseen classes. PGAM generates local prototypes from all foreground features and then utilizes their internal correlations to enhance the representation of the holistic prototypes. BPAM exploits the holistic information from support images and video frames by fusing co-attention and self-attention to achieve support-query semantic consistency and inner-frame temporal consistency. Extensive experiments on YouTube-FSVOS have been provided to demonstrate the effectiveness and superiority of our proposed HPAN method.
    摘要 “几帧影像物类分割(FSVOS)目的是将无法见的类别中的动态物类分割,通过一小集支持影像,其中包含像素级别物类标注。现有方法已经证明,对FSVOS使用域间代理机制可以将支持影像和询问帧之间建立相互关联。然而,代理帧中含有重复的像素信息和背景噪音,导致分割性能不佳。此外,现有方法往往忽略了询问影像之间的相互关联。为解决以上问题,我们提出了整体原型注意网络(HPAN),以提高FSVOS的性能。具体来说,HPAN包括一个原型图像注意模组(PGAM)和一个双向原型注意模组(BPAM),将有用的知识传递自见到未见的类别。PGAM从所有前景特征中生成本地区prototype,然后利用这些内部相关性来强化整体prototype的表现。BPAM利用支持影像和询问影像之间的共同关联和自我关联,实现支持询问semantic一致和内部时间一致。我们在YouTube-FSVOS上进行了广泛的实验,以证明我们提出的HPAN方法的有效性和superiority。”

DocTr: Document Transformer for Structured Information Extraction in Documents

  • paper_url: http://arxiv.org/abs/2307.07929
  • repo_url: None
  • paper_authors: Haofu Liao, Aruni RoyChowdhury, Weijian Li, Ankan Bansal, Yuting Zhang, Zhuowen Tu, Ravi Kumar Satzoda, R. Manmatha, Vijay Mahadevan
  • for: 提出了一种新的结构化信息抽取(SIE)方法,用于从视觉 ric 文档中提取结构化信息。
  • methods: 使用了 anchor-based 对象检测器的想法,将实体表示为 anchor word 和 bounding box,并表示实体关联为 anchor word 的关联。
  • results: 评估在三个 SIE benchmark 上,提出的方法显示效果很好,并且在语言上进行 предваритель训练 后,能够学习实体检测。
    Abstract We present a new formulation for structured information extraction (SIE) from visually rich documents. It aims to address the limitations of existing IOB tagging or graph-based formulations, which are either overly reliant on the correct ordering of input text or struggle with decoding a complex graph. Instead, motivated by anchor-based object detectors in vision, we represent an entity as an anchor word and a bounding box, and represent entity linking as the association between anchor words. This is more robust to text ordering, and maintains a compact graph for entity linking. The formulation motivates us to introduce 1) a DOCument TRansformer (DocTr) that aims at detecting and associating entity bounding boxes in visually rich documents, and 2) a simple pre-training strategy that helps learn entity detection in the context of language. Evaluations on three SIE benchmarks show the effectiveness of the proposed formulation, and the overall approach outperforms existing solutions.
    摘要 我们提出了一种新的结构信息提取(SIE)方式,用于从视觉丰富文档中提取结构信息。该方式希望解决现有的IOB标注或图像基于的形ulation的限制,这些限制是 Either too reliant on the correct ordering of input text or struggle with decoding a complex graph. 而是,我们将实体表示为一个锚字和一个矩形框,并将实体连接视为锚字之间的关系。这种方式更加鲁棒地对text的顺序,并保持了compact的图像 для实体连接。这种方式的提出使我们引入了以下两个方法:1. DOCument TRansformer (DocTr),用于在视觉丰富文档中检测和关联实体矩形框。2. 一种简单的预训练策略,用于在语言上学习实体检测。我们对三个SIE benchmark进行了评估,结果显示了提posed方式的效果,并且总的approach exceeds existing solutions。

Reinforced Disentanglement for Face Swapping without Skip Connection

  • paper_url: http://arxiv.org/abs/2307.07928
  • repo_url: https://github.com/alaist/RD-FS
  • paper_authors: Xiaohang Ren, Xingyu Chen, Pengfei Yao, Heung-Yeung Shum, Baoyuan Wang
  • for: 解决SOTA face swap模型中的人脸特征泄露和非人脸特征失去Problem
  • methods: 引入新的face swap框架’WSC-swap’, eliminating skip connections and using two target encoders to respectively capture pixel-level non-facial region attributes and semantic non-identity attributes in the face region. Plus, employing both identity removal loss via adversarial training and non-identity preservation loss via prior 3DMM models.
  • results: 对FaceForensics++和CelebA-HQ进行了广泛的实验,比较表现出色,包括一个新的指标 для测试人脸一致性,这个指标之前完全被忽略了。
    Abstract The SOTA face swap models still suffer the problem of either target identity (i.e., shape) being leaked or the target non-identity attributes (i.e., background, hair) failing to be fully preserved in the final results. We show that this insufficient disentanglement is caused by two flawed designs that were commonly adopted in prior models: (1) counting on only one compressed encoder to represent both the semantic-level non-identity facial attributes(i.e., pose) and the pixel-level non-facial region details, which is contradictory to satisfy at the same time; (2) highly relying on long skip-connections between the encoder and the final generator, leaking a certain amount of target face identity into the result. To fix them, we introduce a new face swap framework called 'WSC-swap' that gets rid of skip connections and uses two target encoders to respectively capture the pixel-level non-facial region attributes and the semantic non-identity attributes in the face region. To further reinforce the disentanglement learning for the target encoder, we employ both identity removal loss via adversarial training (i.e., GAN) and the non-identity preservation loss via prior 3DMM models like [11]. Extensive experiments on both FaceForensics++ and CelebA-HQ show that our results significantly outperform previous works on a rich set of metrics, including one novel metric for measuring identity consistency that was completely neglected before.
    摘要 现状的SOTA面 swap模型仍然受到两种问题的困扰:一是目标人脸特征(即形状)泄露,二是目标非人脸特征(如背景和 волосы)在最终结果中未能完全保留。我们显示出这种不足的分离是由两种不当的设计引起的:(1)通过单一压缩编码器来表示面部非人脸特征(即姿势)和像素级非人脸地方特征,这是不可能同时满足的;(2)高度依赖长跳转连接来传递编码器到最终生成器,这会带来一定程度的目标人脸标识泄露。为了解决这些问题,我们提出了一个新的面 swap框架 called 'WSC-swap',它 eliminates skip connections and uses two target encoders to respectively capture the pixel-level non-facial region attributes and the semantic non-identity attributes in the face region. 为了进一步加强目标编码器的分离学习,我们采用了both identity removal loss via adversarial training(i.e., GAN)和非人脸保持损失 via prior 3DMM models like [11]. Our extensive experiments on both FaceForensics++ and CelebA-HQ show that our results significantly outperform previous works on a rich set of metrics, including one novel metric for measuring identity consistency that was completely neglected before.

RayMVSNet++: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo

  • paper_url: http://arxiv.org/abs/2307.10233
  • repo_url: None
  • paper_authors: Yifei Shi, Junhua Xi, Dewen Hu, Zhiping Cai, Kai Xu
  • for: 这个论文的目的是提出一种基于学习的多视图静止(MVS)方法,以提高多视图静止的精度和效率。
  • methods: 这个方法使用了直接优化每个摄像头光束上的深度值,模拟激光激光仪的距离测量。具体来说,它使用了序列预测方法,通过转换器特征来学习Sequential Modeling,实际上是多视图静止中的epipolar线搜索。
  • results: 这个方法在DTU和Tanks & Temples数据集上达到了所有前一代学习基于方法的总重建得分0.33mm和F-score59.48%。它能够在复杂的景象中提供高质量的深度估计和点云重建。此外,提出了RayMVSNet++以增强每个光束的上下文特征聚合,通过设计了一个注意力阀unit来选择在本地折射镜附近的相关的邻居光束。RayMVSNet++在ScanNet数据集上达到了状态艺术性能。
    Abstract Learning-based multi-view stereo (MVS) has by far centered around 3D convolution on cost volumes. Due to the high computation and memory consumption of 3D CNN, the resolution of output depth is often considerably limited. Different from most existing works dedicated to adaptive refinement of cost volumes, we opt to directly optimize the depth value along each camera ray, mimicking the range finding of a laser scanner. This reduces the MVS problem to ray-based depth optimization which is much more light-weight than full cost volume optimization. In particular, we propose RayMVSNet which learns sequential prediction of a 1D implicit field along each camera ray with the zero-crossing point indicating scene depth. This sequential modeling, conducted based on transformer features, essentially learns the epipolar line search in traditional multi-view stereo. We devise a multi-task learning for better optimization convergence and depth accuracy. We found the monotonicity property of the SDFs along each ray greatly benefits the depth estimation. Our method ranks top on both the DTU and the Tanks & Temples datasets over all previous learning-based methods, achieving an overall reconstruction score of 0.33mm on DTU and an F-score of 59.48% on Tanks & Temples. It is able to produce high-quality depth estimation and point cloud reconstruction in challenging scenarios such as objects/scenes with non-textured surface, severe occlusion, and highly varying depth range. Further, we propose RayMVSNet++ to enhance contextual feature aggregation for each ray through designing an attentional gating unit to select semantically relevant neighboring rays within the local frustum around that ray. RayMVSNet++ achieves state-of-the-art performance on the ScanNet dataset. In particular, it attains an AbsRel of 0.058m and produces accurate results on the two subsets of textureless regions and large depth variation.
    摘要 学习基于多视图涂抹(MVS)的方法主要集中在3D convolution中的成本量和存储占用。由于3D CNN的计算和存储占用非常高,输出深度的分辨率经常受到限制。与大多数现有的成本量优化方法不同,我们直接优化摄像头方向上的深度值,模拟激光雷达扫描器的范围找寻。这将MVS问题降低到折线基于深度优化,与全成本量优化相比许多轻量级。特别是,我们提出了RayMVSNet,它通过学习每个摄像头方向上的1D隐函数来预测折线上的深度值,并在扫描器范围内查找零交叉点。这种顺序模型化,基于变换器特征,实际上学习了传统多视图涂抹中的epipolar线搜索。我们设计了多任务学习来改进优化的吞吐量和深度准确率。我们发现折线上SDF的 monotonicity 性帮助深度估计。我们的方法在DTU和Tanks & Temples数据集上至今为止的所有学习基于方法中排名第一,实现了总重建分数为0.33mm(DTU)和59.48%(Tanks & Temples)。它能够在物体/场景中的非杂表面、严重遮挡和高度变化的深度范围中生成高质量的深度估计和点云重建。此外,我们提出了RayMVSNet++,它通过设计了注意力闭合单元来选择当地frustum中的相互 relevante的折线,从而增强每个折线的上下文特征汇集。RayMVSNet++在ScanNet数据集上达到了状态的最佳性能,其中包括Textureless Regions和大深度变化两个子集。

On the Robustness of Split Learning against Adversarial Attacks

  • paper_url: http://arxiv.org/abs/2307.07916
  • repo_url: https://github.com/fmy266/SplitADV
  • paper_authors: Mingyuan Fan, Cen Chen, Chengyu Wang, Wenmeng Zhou, Jun Huang
  • for: 证明分学可以增强模型安全性,尤其是在敏感数据不能直接分享的情况下。
  • methods: 通过分享部分模型和计算结果,而不是直接分享敏感数据和模型细节,来实现模型训练和共享。
  • results: 研究发现,对于敏感数据,分学可以减少对模型的攻击,但是在中间层级上的攻击仍然存在,并且可以通过新的攻击方法(SPADV)来证明这一点。
    Abstract Split learning enables collaborative deep learning model training while preserving data privacy and model security by avoiding direct sharing of raw data and model details (i.e., sever and clients only hold partial sub-networks and exchange intermediate computations). However, existing research has mainly focused on examining its reliability for privacy protection, with little investigation into model security. Specifically, by exploring full models, attackers can launch adversarial attacks, and split learning can mitigate this severe threat by only disclosing part of models to untrusted servers.This paper aims to evaluate the robustness of split learning against adversarial attacks, particularly in the most challenging setting where untrusted servers only have access to the intermediate layers of the model.Existing adversarial attacks mostly focus on the centralized setting instead of the collaborative setting, thus, to better evaluate the robustness of split learning, we develop a tailored attack called SPADV, which comprises two stages: 1) shadow model training that addresses the issue of lacking part of the model and 2) local adversarial attack that produces adversarial examples to evaluate.The first stage only requires a few unlabeled non-IID data, and, in the second stage, SPADV perturbs the intermediate output of natural samples to craft the adversarial ones. The overall cost of the proposed attack process is relatively low, yet the empirical attack effectiveness is significantly high, demonstrating the surprising vulnerability of split learning to adversarial attacks.
    摘要 分学习可以实现合作深度学习模型训练,同时保护数据隐私和模型安全,因为服务器和客户端只保持部分子网和交换中间计算。然而,现有研究主要关注隐私保护的可靠性,很少研究模型安全。Specifically, by exploring全模型,攻击者可以发起对抗性攻击,并且分学习可以 Mitigate this severe threat by only disclosing部分模型 to untrusted servers。本文的目的是评估分学习对抗性攻击的Robustness,特别是在最具挑战性的设定下,即无法信任服务器仅有访问模型中间层。现有的对抗性攻击主要集中在中央设定中,而不是合作设定,因此,为更好地评估分学习的Robustness,我们开发了一种适应攻击,称为SPADV。SPADV包括两个阶段:1)阴影模型训练,解决了因为缺少部分模型而产生的问题,2)本地对抗性攻击,生成对抗性样本来评估。在第一阶段,只需几个非相关的非ID数据,而在第二阶段,SPADV在自然样本中 Output的中间部分进行了预处理,以生成对抗性样本。整个攻击过程的总成本相对较低, yet the empirical attack effectiveness is significantly high,表明了分学习对抗性攻击的Surprising vulnerability。

Predicting mechanical properties of Carbon Nanotube (CNT) images Using Multi-Layer Synthetic Finite Element Model Simulations

  • paper_url: http://arxiv.org/abs/2307.07912
  • repo_url: None
  • paper_authors: Kaveh Safavigerdini, Koundinya Nouduri, Ramakrishna Surya, Andrew Reinhard, Zach Quinlan, Filiz Bunyak, Matthew R. Maschmann, Kannappan Palaniappan
  • For: The paper is written for predicting mechanical properties of vertically-oriented carbon nanotube (CNT) forest images using a deep learning model for artificial intelligence (AI)-based materials discovery.* Methods: The paper uses an innovative data augmentation technique that involves the use of multi-layer synthetic (MLS) or quasi-2.5D images, which are generated by blending 2D synthetic images. The paper also uses a physics-based model to estimate mechanical properties such as stiffness and buckling load for the MLS images. The proposed deep learning architecture, CNTNeXt, builds upon the previous CNTNet neural network, using a ResNeXt feature representation followed by random forest regression estimator.* Results: The paper expects the proposed machine learning approach to outperform single synthetic image-based learning when it comes to predicting mechanical properties of real scanning electron microscopy images. This has the potential to accelerate understanding and control of CNT forest self-assembly for diverse applications.Here are the three points in Simplified Chinese text:* For: 这篇论文是为了预测垂直方向碳纳米管(CNT)森林图像的机械性能使用深度学习模型进行人工智能(AI)基于材料发现。* Methods: 论文使用了一种创新的数据增强技术,利用多层合成(MLS)或 quasi-2.5D 图像,这些图像是通过拼接2D synthetic图像来生成。 MLs 图像更像真实的扫描电子显微镜(SEM)图像,但不需要进行expensive的3D simulations或实验。机械性能如强度和折倒荷重等被使用物理基本模型来估算。* Results: 论文预计该提出的机器学习方法会在真实SEM图像上预测机械性能时比单个synthetic图像基本学习更高效。这有可能加速CNT森林自组装的理解和控制,以推动多种应用。
    Abstract We present a pipeline for predicting mechanical properties of vertically-oriented carbon nanotube (CNT) forest images using a deep learning model for artificial intelligence (AI)-based materials discovery. Our approach incorporates an innovative data augmentation technique that involves the use of multi-layer synthetic (MLS) or quasi-2.5D images which are generated by blending 2D synthetic images. The MLS images more closely resemble 3D synthetic and real scanning electron microscopy (SEM) images of CNTs but without the computational cost of performing expensive 3D simulations or experiments. Mechanical properties such as stiffness and buckling load for the MLS images are estimated using a physics-based model. The proposed deep learning architecture, CNTNeXt, builds upon our previous CNTNet neural network, using a ResNeXt feature representation followed by random forest regression estimator. Our machine learning approach for predicting CNT physical properties by utilizing a blended set of synthetic images is expected to outperform single synthetic image-based learning when it comes to predicting mechanical properties of real scanning electron microscopy images. This has the potential to accelerate understanding and control of CNT forest self-assembly for diverse applications.
    摘要 我们提出了一个管道,用于预测纵向碳纳米管(CNT)森林图像中的机械性能,使用深度学习模型,以实现人工智能(AI)基于材料发现。我们的方法包括一种创新的数据增强技术,使用多层合成(MLS)或 quasi-2.5D 图像,这些图像由混合2D 合成图像来生成。MLS 图像更加closely resemble 3D 合成和实验室扫描电子镜像(SEM)图像,但没有 computationally expensive 3D simulations 或实验的成本。机械性能,如刚性和塌笔荷,对 MLS 图像进行估算,使用物理基础模型。我们提出的深度学习架构,CNTNeXt,基于我们之前的 CNTNet 神经网络,使用 ResNeXt 特征表示,然后使用随机森林回归估计器。我们的机器学习方法,通过使用混合的合成图像来预测 CNT 物理性能,对于实验室扫描电子镜像图像的机械性能预测,具有更高的性能,相比单个合成图像基于学习。这有助于加速理解和控制 CNT 森林自组装的应用。

Multitemporal SAR images change detection and visualization using RABASAR and simplified GLR

  • paper_url: http://arxiv.org/abs/2307.07892
  • repo_url: None
  • paper_authors: Weiying Zhao, Charles-Alban Deledalle, Loïc Denis, Henri Maître, Jean-Marie Nicolas, Florence Tupin
  • for: 本研究旨在检测不同类型的地表变化,以便对地面监测进行更好的准备。
  • methods: 本研究提出了一种简化了通量比(SGLR)方法,假设同时间像素具有相同的等效数目(ENL)。此外,还开发了一种改进的光谱划分法以确定变化类型。
  • results: 研究人员通过处理模拟和SAR图像,并与传统方法进行比较,证明了提案方法的效果性。特别是,数值实验表明,该方法可以有效地检测农田地区变化、建筑地区变化、港区变化和洪涝变化。
    Abstract Understanding the state of changed areas requires that precise information be given about the changes. Thus, detecting different kinds of changes is important for land surface monitoring. SAR sensors are ideal to fulfil this task, because of their all-time and all-weather capabilities, with good accuracy of the acquisition geometry and without effects of atmospheric constituents for amplitude data. In this study, we propose a simplified generalized likelihood ratio ($S_{GLR}$) method assuming that corresponding temporal pixels have the same equivalent number of looks (ENL). Thanks to the denoised data provided by a ratio-based multitemporal SAR image denoising method (RABASAR), we successfully applied this similarity test approach to compute the change areas. A new change magnitude index method and an improved spectral clustering-based change classification method are also developed. In addition, we apply the simplified generalized likelihood ratio to detect the maximum change magnitude time, and the change starting and ending times. Then, we propose to use an adaptation of the REACTIV method to visualize the detection results vividly. The effectiveness of the proposed methods is demonstrated through the processing of simulated and SAR images, and the comparison with classical techniques. In particular, numerical experiments proved that the developed method has good performances in detecting farmland area changes, building area changes, harbour area changes and flooding area changes.
    摘要 理解改变区域的状态需要提供精确的改变信息。因此,检测不同类型的改变是重要的 для地面监测。SAR探测器是理想的选择,因为它们在任何时间和天气条件下都有good accuracy的获取geometry和无 atmospheric constituents的影响。在这项研究中,我们提出了一种简化了通用类比比率(SGLR)方法,假设同一时间的批量像素具有相同的等效数量looks(ENL)。经过了 ratio-based multitemporal SAR图像减噪方法(RABASAR)提供的净化数据,我们成功地应用了这种相似测试方法来计算改变区域。此外,我们还开发了一种改进的光谱分类法来分类改变,以及一种最大变化强度时间、改变开始和结束时间的检测方法。最后,我们使用了一种基于REACTIV方法的修改来可见化检测结果。我们通过处理 simulated和SAR图像,以及与传统技术进行比较,证明了我们提出的方法的有效性。具体来说,数值实验表明,我们的方法在检测农田改变、建筑改变、港口改变和洪涝改变方面具有良好的表现。

Why Does Little Robustness Help? Understanding Adversarial Transferability From Surrogate Training

  • paper_url: http://arxiv.org/abs/2307.07873
  • repo_url: None
  • paper_authors: Yechao Zhang, Shengshan Hu, Leo Yu Zhang, Junyu Shi, Minghui Li, Xiaogeng Liu, Wei Wan, Hai Jin
  • for: 本研究的目的是解释对深度神经网络(DNNs)的抗击例(Adversarial Examples,AE)的传播性。
  • methods: 本研究使用了一系列的理论和实验分析,以了解在对DNNs进行对抗性训练时,模型的平滑性和梯度相似性之间的负相关性。
  • results: 研究发现,在对DNNs进行对抗性训练时,数据分布shift和梯度相似性之间存在负相关性。此外,研究还发现,通过控制数据分布shift和梯度regularization,可以构建更好的代理模型,以提高对抗性训练的传播性。
    Abstract Adversarial examples (AEs) for DNNs have been shown to be transferable: AEs that successfully fool white-box surrogate models can also deceive other black-box models with different architectures. Although a bunch of empirical studies have provided guidance on generating highly transferable AEs, many of these findings lack explanations and even lead to inconsistent advice. In this paper, we take a further step towards understanding adversarial transferability, with a particular focus on surrogate aspects. Starting from the intriguing little robustness phenomenon, where models adversarially trained with mildly perturbed adversarial samples can serve as better surrogates, we attribute it to a trade-off between two predominant factors: model smoothness and gradient similarity. Our investigations focus on their joint effects, rather than their separate correlations with transferability. Through a series of theoretical and empirical analyses, we conjecture that the data distribution shift in adversarial training explains the degradation of gradient similarity. Building on these insights, we explore the impacts of data augmentation and gradient regularization on transferability and identify that the trade-off generally exists in the various training mechanisms, thus building a comprehensive blueprint for the regulation mechanism behind transferability. Finally, we provide a general route for constructing better surrogates to boost transferability which optimizes both model smoothness and gradient similarity simultaneously, e.g., the combination of input gradient regularization and sharpness-aware minimization (SAM), validated by extensive experiments. In summary, we call for attention to the united impacts of these two factors for launching effective transfer attacks, rather than optimizing one while ignoring the other, and emphasize the crucial role of manipulating surrogate models.
    摘要 adversarial examples (AEs) for DNNs have been shown to be transferable: AEs that successfully fool white-box surrogate models can also deceive other black-box models with different architectures. Although a bunch of empirical studies have provided guidance on generating highly transferable AEs, many of these findings lack explanations and even lead to inconsistent advice. In this paper, we take a further step towards understanding adversarial transferability, with a particular focus on surrogate aspects. Starting from the intriguing little robustness phenomenon, where models adversarially trained with mildly perturbed adversarial samples can serve as better surrogates, we attribute it to a trade-off between two predominant factors: model smoothness and gradient similarity. Our investigations focus on their joint effects, rather than their separate correlations with transferability. Through a series of theoretical and empirical analyses, we conjecture that the data distribution shift in adversarial training explains the degradation of gradient similarity. Building on these insights, we explore the impacts of data augmentation and gradient regularization on transferability and identify that the trade-off generally exists in the various training mechanisms, thus building a comprehensive blueprint for the regulation mechanism behind transferability. Finally, we provide a general route for constructing better surrogates to boost transferability which optimizes both model smoothness and gradient similarity simultaneously, e.g., the combination of input gradient regularization and sharpness-aware minimization (SAM), validated by extensive experiments. In summary, we call for attention to the united impacts of these two factors for launching effective transfer attacks, rather than optimizing one while ignoring the other, and emphasize the crucial role of manipulating surrogate models.

Unified Adversarial Patch for Cross-modal Attacks in the Physical World

  • paper_url: http://arxiv.org/abs/2307.07859
  • repo_url: None
  • paper_authors: Xingxing Wei, Yao Huang, Yitong Sun, Jie Yu
    for: This paper aims to demonstrate the potential risks of physical adversarial attacks on object detectors that use both visible and infrared sensors.methods: The authors propose a unified adversarial patch that can fool both visible and infrared object detectors simultaneously, using a single patch. They design a novel boundary-limited shape optimization method to achieve compact and smooth shapes, and propose a score-aware iterative evaluation to balance the fooling degree between the two sensors.results: The authors achieve an Attack Success Rate (ASR) of 73.33% and 69.17% against one-stage and two-stage object detectors, respectively. They also verify the effective attacks in the physical world under various settings, such as different angles, distances, postures, and scenes.
    Abstract Recently, physical adversarial attacks have been presented to evade DNNs-based object detectors. To ensure the security, many scenarios are simultaneously deployed with visible sensors and infrared sensors, leading to the failures of these single-modal physical attacks. To show the potential risks under such scenes, we propose a unified adversarial patch to perform cross-modal physical attacks, i.e., fooling visible and infrared object detectors at the same time via a single patch. Considering different imaging mechanisms of visible and infrared sensors, our work focuses on modeling the shapes of adversarial patches, which can be captured in different modalities when they change. To this end, we design a novel boundary-limited shape optimization to achieve the compact and smooth shapes, and thus they can be easily implemented in the physical world. In addition, to balance the fooling degree between visible detector and infrared detector during the optimization process, we propose a score-aware iterative evaluation, which can guide the adversarial patch to iteratively reduce the predicted scores of the multi-modal sensors. We finally test our method against the one-stage detector: YOLOv3 and the two-stage detector: Faster RCNN. Results show that our unified patch achieves an Attack Success Rate (ASR) of 73.33% and 69.17%, respectively. More importantly, we verify the effective attacks in the physical world when visible and infrared sensors shoot the objects under various settings like different angles, distances, postures, and scenes.
    摘要 最近,物理攻击被提出以逃脱基于DNN的物体检测器。为确保安全,许多场景同时使用可见感知器和红外感知器,导致单模态物理攻击失败。为了表明这些场景下的风险,我们提议一种横跨模态物理攻击,即通过单个贴图 Fooled 可见和红外检测器。considering 不同的感知机制,我们的工作专注于模型适应器形状的设计,这些形状可以在不同的感知器下被捕捉。为此,我们设计了一种 novel 边界限定的形状优化方法,以实现紧凑和平滑的形状,这些形状可以轻松地在物理世界中实现。此外,为保证可见检测器和红外检测器在优化过程中的攻击度差异,我们提议一种分数感知迭代评估,可以导引攻击贴图在迭代过程中逐渐减少多模检测器的预测分数。最后,我们对 YOLOv3 和 Faster RCNN 进行测试,结果显示我们的横跨模态贴图可以在不同的角度、距离、姿态和场景下实现73.33% 和 69.17% 的攻击成功率。更重要的是,我们在物理世界中验证了这些攻击的有效性,当可见和红外感知器在不同的设置下拍摄对象时。

Neural Video Recovery for Cloud Gaming

  • paper_url: http://arxiv.org/abs/2307.07847
  • repo_url: https://github.com/Aryia-Behroziuan/neurons
  • paper_authors: Zhaoyuan He, Yifan Yang, Shuozhe Li, Diyuan Dai, Lili Qiu
  • for: 提高云游戏的视频恢复精度和效率
  • methods: 使用游戏状态提高视频恢复精度,使用部分解码视频快速恢复失去的视频帧
  • results: 在iPhone 12和笔记机实现中,证明了使用游戏状态提高视频恢复精度,并且实现了高效的视频恢复Here is the translation in Simplified Chinese:
  • for: 提高云游戏的视频恢复精度和效率
  • methods: 使用游戏状态提高视频恢复精度,使用部分解码视频快速恢复失去的视频帧
  • results: 在iPhone 12和笔记机实现中,证明了使用游戏状态提高视频恢复精度,并且实现了高效的视频恢复
    Abstract Cloud gaming is a multi-billion dollar industry. A client in cloud gaming sends its movement to the game server on the Internet, which renders and transmits the resulting video back. In order to provide a good gaming experience, a latency below 80 ms is required. This means that video rendering, encoding, transmission, decoding, and display have to finish within that time frame, which is especially challenging to achieve due to server overload, network congestion, and losses. In this paper, we propose a new method for recovering lost or corrupted video frames in cloud gaming. Unlike traditional video frame recovery, our approach uses game states to significantly enhance recovery accuracy and utilizes partially decoded frames to recover lost portions. We develop a holistic system that consists of (i) efficiently extracting game states, (ii) modifying H.264 video decoder to generate a mask to indicate which portions of video frames need recovery, and (iii) designing a novel neural network to recover either complete or partial video frames. Our approach is extensively evaluated using iPhone 12 and laptop implementations, and we demonstrate the utility of game states in the game video recovery and the effectiveness of our overall design.
    摘要 云台游戏是一个多百亿美元的业态。客户端在云台游戏中将其运动发送到游戏服务器上,服务器在互联网上渲染和传输结果,并将视频传输回客户端。为提供良好的游戏体验,云台游戏中的延迟必须低于80ms。这意味着视频渲染、编码、传输、解码和显示必须在这个时间段内完成,这是特别困难的因为服务器过载、网络压力和损失。在这篇论文中,我们提出了一种新的视频帧恢复方法,与传统视频帧恢复方法不同,我们的方法使用游戏状态进行明显提高恢复精度,并使用部分解码的帧来恢复丢失的部分。我们设计了一个整体系统,包括(i)高效地提取游戏状态,(ii)修改H.264视频解码器生成一个抑制器,用于指示需要恢复的视频帧部分,以及(iii)设计一种新的神经网络来恢复完整或部分的视频帧。我们的方法在iPhone 12和笔记机实现中进行了广泛的评估,并证明游戏状态在游戏视频恢复中的重要性以及我们的总体设计的有效性。