cs.CV - 2023-10-19

A Car Model Identification System for Streamlining the Automobile Sales Process

  • paper_url: http://arxiv.org/abs/2310.13198
  • repo_url: None
  • paper_authors: Said Togru, Marco Moldovan
  • for: automatizing car model and make identification from images to improve online car-selling platform efficiency
  • methods: employing various efficient network architectures (CNNs, ViTs, hybrid models) and refining performance through data augmentation, fine-tuning pretrained models, and hyperparameter tuning
  • results: achieving an accuracy of 81.97% with the EfficientNet (V2 b2) architecture, promising enhanced user experiences across car-selling websites
    Abstract This project presents an automated solution for the efficient identification of car models and makes from images, aimed at streamlining the vehicle listing process on online car-selling platforms. Through a thorough exploration encompassing various efficient network architectures including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and hybrid models, we achieved a notable accuracy of 81.97% employing the EfficientNet (V2 b2) architecture. To refine performance, a combination of strategies, including data augmentation, fine-tuning pretrained models, and extensive hyperparameter tuning, were applied. The trained model offers the potential for automating information extraction, promising enhanced user experiences across car-selling websites.
    摘要

LeTFuser: Light-weight End-to-end Transformer-Based Sensor Fusion for Autonomous Driving with Multi-Task Learning

  • paper_url: http://arxiv.org/abs/2310.13135
  • repo_url: https://github.com/pagand/e2etransfuser
  • paper_authors: Pedram Agand, Mohammad Mahdavian, Manolis Savva, Mo Chen
  • for: 本研究目的是解决在综合感知技术下的自动驾驶中,使用现有的混合感知技术无法处理复杂的动态代理人问题。
  • methods: 我们介绍了LeTFuser算法,它是基于trasnformer的RGB-D镜头混合算法,用于同时进行感知和控制任务。我们的模型包括两个模块: перception模块负责对RGB-D镜头数据进行编码,并实施semantic segmentation、semantic depth cloud mapping(SDC)和交通灯状态识别等任务;控制模块使用编码特征和补充数据(包括粗略的 simulate器、多种测量)来预测车辆的控制方向。
  • results: 我们在CARLA simulator上评估了模型,并进行了与其他模型的比较分析。我们发现,LeTFuser在各种情况下(包括正常和反对情况)都能够提供更高的性能和稳定性。
    Abstract In end-to-end autonomous driving, the utilization of existing sensor fusion techniques for imitation learning proves inadequate in challenging situations that involve numerous dynamic agents. To address this issue, we introduce LeTFuser, a transformer-based algorithm for fusing multiple RGB-D camera representations. To perform perception and control tasks simultaneously, we utilize multi-task learning. Our model comprises of two modules, the first being the perception module that is responsible for encoding the observation data obtained from the RGB-D cameras. It carries out tasks such as semantic segmentation, semantic depth cloud mapping (SDC), and traffic light state recognition. Our approach employs the Convolutional vision Transformer (CvT) \cite{wu2021cvt} to better extract and fuse features from multiple RGB cameras due to local and global feature extraction capability of convolution and transformer modules, respectively. Following this, the control module undertakes the decoding of the encoded characteristics together with supplementary data, comprising a rough simulator for static and dynamic environments, as well as various measurements, in order to anticipate the waypoints associated with a latent feature space. We use two methods to process these outputs and generate the vehicular controls (e.g. steering, throttle, and brake) levels. The first method uses a PID algorithm to follow the waypoints on the fly, whereas the second one directly predicts the control policy using the measurement features and environmental state. We evaluate the model and conduct a comparative analysis with recent models on the CARLA simulator using various scenarios, ranging from normal to adversarial conditions, to simulate real-world scenarios. Our code is available at \url{https://github.com/pagand/e2etransfuser/tree/cvpr-w} to facilitate future studies.
    摘要 在末端自动驾驶中,使用现有的感知融合技术进行模仿学习显示不够有效,特别是在包含多个动态代理的复杂情况下。为解决这个问题,我们提出了LeTFuser算法,它基于变换器来融合多个RGB-D相机表示。通过多任务学习,我们同时进行感知和控制任务。我们的模型包括两个模块:第一个是感知模块,负责对RGB-D相机获得的观察数据进行编码。它完成了semantic segmentation、semantic depth cloud mapping(SDC)和交通灯状态识别等任务。我们采用了Convolutional Vision Transformer(CvT) \cite{wu2021cvt} 来更好地提取和融合多个RGB相机中的特征,因为它具有局部和全局特征提取能力。接着,控制模块根据编码特征和补充数据(包括粗略的模拟器、静态和动态环境的测量数据)进行解码,以预测 vehicular控制(例如,车辆的油门、加速和刹车)水平。我们使用两种方法处理输出并生成车辆控制水平:一种使用PID算法跟踪卫星点,另一种直接预测控制策略使用测量特征和环境状态。我们在CARLA模拟器上对模型进行评估,并对其进行与最近模型的比较分析,使用多种情况,从普通到反对情况,来模拟真实世界情况。我们的代码可以在\url{https://github.com/pagand/e2etransfuser/tree/cvpr-w}中找到,以便未来的研究。

RSAdapter: Adapting Multimodal Models for Remote Sensing Visual Question Answering

  • paper_url: http://arxiv.org/abs/2310.13120
  • repo_url: None
  • paper_authors: Yuduo Wang, Pedram Ghamisi
  • for: 本研究旨在提高Remote Sensing(RS)Visual Question Answering(VQA)中的效率和 parameter efficiency,尤其是在使用 transformer 模型时。
  • methods: 本研究提出了一种名为 RSAdapter 的新方法,它包括两个关键组件:并行适配器和每个完全连接(FC)层后的额外线性变换层。这种方法不仅提高了适配预训练多Modal模型的能力,而且在推理时可以将线性变换层的参数与前一层的FC层集成,从而降低推理成本。
  • results: 在三个不同的 RS-VQA 数据集上进行了广泛的实验,并在所有三个数据集上达到了最佳结果。
    Abstract In recent years, with the rapid advancement of transformer models, transformer-based multimodal architectures have found wide application in various downstream tasks, including but not limited to Image Captioning, Visual Question Answering (VQA), and Image-Text Generation. However, contemporary approaches to Remote Sensing (RS) VQA often involve resource-intensive techniques, such as full fine-tuning of large models or the extraction of image-text features from pre-trained multimodal models, followed by modality fusion using decoders. These approaches demand significant computational resources and time, and a considerable number of trainable parameters are introduced. To address these challenges, we introduce a novel method known as RSAdapter, which prioritizes runtime and parameter efficiency. RSAdapter comprises two key components: the Parallel Adapter and an additional linear transformation layer inserted after each fully connected (FC) layer within the Adapter. This approach not only improves adaptation to pre-trained multimodal models but also allows the parameters of the linear transformation layer to be integrated into the preceding FC layers during inference, reducing inference costs. To demonstrate the effectiveness of RSAdapter, we conduct an extensive series of experiments using three distinct RS-VQA datasets and achieve state-of-the-art results on all three datasets. The code for RSAdapter will be available online at https://github.com/Y-D-Wang/RSAdapter.
    摘要 Recently, with the rapid development of transformer models, transformer-based multimodal architectures have been widely applied in various downstream tasks, such as Image Captioning, Visual Question Answering (VQA), and Image-Text Generation. However, current approaches to Remote Sensing (RS) VQA often rely on resource-intensive techniques, such as full fine-tuning of large models or extracting image-text features from pre-trained multimodal models, followed by modality fusion using decoders. These approaches require significant computational resources and time, and a large number of trainable parameters are introduced. To address these challenges, we propose a novel method called RSAdapter, which prioritizes runtime and parameter efficiency. RSAdapter consists of two key components: the Parallel Adapter and an additional linear transformation layer inserted after each fully connected (FC) layer within the Adapter. This approach not only improves adaptation to pre-trained multimodal models but also allows the parameters of the linear transformation layer to be integrated into the preceding FC layers during inference, reducing inference costs. To demonstrate the effectiveness of RSAdapter, we conduct an extensive series of experiments using three distinct RS-VQA datasets and achieve state-of-the-art results on all three datasets. The code for RSAdapter will be available online at .

DreamSpace: Dreaming Your Room Space with Text-Driven Panoramic Texture Propagation

  • paper_url: http://arxiv.org/abs/2310.13119
  • repo_url: https://github.com/ybbbbt/DreamSpace
  • paper_authors: Bangbang Yang, Wenqi Dong, Lin Ma, Wenbo Hu, Xiao Liu, Zhaopeng Cui, Yuewen Ma
  • for: 这个论文是为了解决3D场景纹理生成中的束缚问题,即在XR/VR应用中,通过自由视点渲染实现高品质的场景纹理生成。
  • methods: 该论文提出了一种新的室内场景纹理框架,通过带有细节和Authentic空间协调的纹理生成技术,实现了文本驱动的纹理生成。该框架包括两个Texture alignment方法,一个是粗糙到细节的纹理生成方法,另一个是在不同视角下进行纹理匹配和生成。
  • results: 实验和在真实的室内场景上的应用表明,该框架可以生成高品质的纹理图像,并在VR头戴式设备上提供了有趣的体验。
    Abstract Diffusion-based methods have achieved prominent success in generating 2D media. However, accomplishing similar proficiencies for scene-level mesh texturing in 3D spatial applications, e.g., XR/VR, remains constrained, primarily due to the intricate nature of 3D geometry and the necessity for immersive free-viewpoint rendering. In this paper, we propose a novel indoor scene texturing framework, which delivers text-driven texture generation with enchanting details and authentic spatial coherence. The key insight is to first imagine a stylized 360{\deg} panoramic texture from the central viewpoint of the scene, and then propagate it to the rest areas with inpainting and imitating techniques. To ensure meaningful and aligned textures to the scene, we develop a novel coarse-to-fine panoramic texture generation approach with dual texture alignment, which both considers the geometry and texture cues of the captured scenes. To survive from cluttered geometries during texture propagation, we design a separated strategy, which conducts texture inpainting in confidential regions and then learns an implicit imitating network to synthesize textures in occluded and tiny structural areas. Extensive experiments and the immersive VR application on real-world indoor scenes demonstrate the high quality of the generated textures and the engaging experience on VR headsets. Project webpage: https://ybbbbt.com/publication/dreamspace
    摘要 diffusion-based methods 已经取得了2D媒体生成的显著成功,但在3D空间应用中,例如XR/VR中的场景级别的 mesh 纹理仍然受到限制,主要是因为3D geometery的复杂性和需要免费看角渲染。在这篇论文中,我们提出了一种新的室内场景纹理框架,可以通过文本驱动生成细节浩繁、真实空间准确的纹理。我们的关键发现是首先从场景的中心视点想象一个精细360度扩展的Texture,然后通过填充和模仿技术将其扩展到其他区域。为确保纹理与场景的意义和平行性,我们开发了一种新的粗略到细腻的扩展Texture生成方法,该方法考虑了场景的几何和纹理特征。在进行纹理填充时,我们设计了一种分离策略,通过在信息丰富的区域进行纹理填充,然后使用一种隐藏的假凝结网络来Synthesize纹理在受阻和小结构区域。我们的实验和基于真实室内场景的VR应用示出了生成的纹理的高质量和VR头戴设备上的有趣体验。项目网站:https://ybbbbt.com/publication/dreamspace

Streamlining Brain Tumor Classification with Custom Transfer Learning in MRI Images

  • paper_url: http://arxiv.org/abs/2310.13108
  • repo_url: None
  • paper_authors: Javed Hossain, Md. Touhidul Islam, Md. Taufiqul Haque Khan Tusar
  • for: 这个研究的目的是为了使用Custom transfer learning networks来分类脑肿图像。
  • methods: 这个研究使用了一种自定义的卷积神经网络架构,包括VGG-19架构和额外的隐藏层,以提高计算效率。
  • results: 研究得到了96.42%的分类精度。
    Abstract Brain tumors are increasingly prevalent, characterized by the uncontrolled spread of aberrant tissues in the brain, with almost 700,000 new cases diagnosed globally each year. Magnetic Resonance Imaging (MRI) is commonly used for the diagnosis of brain tumors and accurate classification is a critical clinical procedure. In this study, we propose an efficient solution for classifying brain tumors from MRI images using custom transfer learning networks. While several researchers have employed various pre-trained architectures such as RESNET-50, ALEXNET, VGG-16, and VGG-19, these methods often suffer from high computational complexity. To address this issue, we present a custom and lightweight model using a Convolutional Neural Network-based pre-trained architecture with reduced complexity. Specifically, we employ the VGG-19 architecture with additional hidden layers, which reduces the complexity of the base architecture but improves computational efficiency. The objective is to achieve high classification accuracy using a novel approach. Finally, the result demonstrates a classification accuracy of 96.42%.
    摘要 脑肿增多,特征为脑内不良组织的无控制扩散,每年全球诊断新 случа数达700,000例。核磁共振成像(MRI)广泛用于脑肿诊断,精确分类是临床重要程序。在本研究中,我们提出一种高效的脑肿分类方法,使用自定义传输学习网络。虽然许多研究人员使用了不同的预训练模型,如RESNET-50、ALEXNET、VGG-16和VGG-19,但这些方法经常受到高计算复杂性的限制。为解决这个问题,我们提出了一种自定义和轻量级的模型,基于卷积神经网络预训练架构,减少了基础架构的复杂性,但提高了计算效率。目标是实现高精度分类。最终结果表明,分类精度达96.42%。

PatchCURE: Improving Certifiable Robustness, Model Utility, and Computation Efficiency of Adversarial Patch Defenses

  • paper_url: http://arxiv.org/abs/2310.13076
  • repo_url: None
  • paper_authors: Chong Xiang, Tong Wu, Sihui Dai, Jonathan Petit, Suman Jana, Prateek Mittal
  • for: 这个论文是为了提出一种防御攻击patch attack的方法,以提高模型的可靠性和安全性。
  • methods: 该论文使用了一种名为PatchCURE的防御框架,该框架可以根据不同的计算效率和功能需求来调整防御性能。
  • results: 该论文的实验结果表明,PatchCURE可以在不同的计算效率和功能需求下提供优秀的防御性能,并且可以与现有的状态态攻击防御方法相比肉。
    Abstract State-of-the-art defenses against adversarial patch attacks can now achieve strong certifiable robustness with a marginal drop in model utility. However, this impressive performance typically comes at the cost of 10-100x more inference-time computation compared to undefended models -- the research community has witnessed an intense three-way trade-off between certifiable robustness, model utility, and computation efficiency. In this paper, we propose a defense framework named PatchCURE to approach this trade-off problem. PatchCURE provides sufficient "knobs" for tuning defense performance and allows us to build a family of defenses: the most robust PatchCURE instance can match the performance of any existing state-of-the-art defense (without efficiency considerations); the most efficient PatchCURE instance has similar inference efficiency as undefended models. Notably, PatchCURE achieves state-of-the-art robustness and utility performance across all different efficiency levels, e.g., 16-23% absolute clean accuracy and certified robust accuracy advantages over prior defenses when requiring computation efficiency to be close to undefended models. The family of PatchCURE defenses enables us to flexibly choose appropriate defenses to satisfy given computation and/or utility constraints in practice.
    摘要 现代防御技术可以实现强有条件的鲁棒性,但通常会增加10-100倍的推理时间成本,与未防御的模型相比。在这篇论文中,我们提出了一种防御框架名为PatchCURE,以解决这种三方贸易问题。PatchCURE提供了多个可调参数,allowing us to construct a family of defenses,其中最鲁棒的PatchCURE实例可以与任何现有的最佳防御相匹配,而不考虑效率考虑;而最有效的PatchCURE实例的推理效率与未防御模型几乎相同。此外,PatchCURE在不同的效率水平上都可以实现状态之最好的鲁棒性和用户性能。PatchCURE的家族防御可以在实践中选择合适的防御,满足给定的计算和/或用户约束。

Using Logic Programming and Kernel-Grouping for Improving Interpretability of Convolutional Neural Networks

  • paper_url: http://arxiv.org/abs/2310.13073
  • repo_url: None
  • paper_authors: Parth Padalkar, Gopal Gupta
  • for: 这个论文的目的是提出一种神经符号学框架(NeSyFOLD-G),该框架使得深度学习神经网络(CNN)的下层层 kernel 的知识更加可读性。
  • methods: 这个框架使用 CNN 的最后一层 kernel 组成一个符号化规则集(rule-set),并使用 FOLD-SE-M 算法生成规则集。在生成规则集时,首先找到 CNN 中相似的 kernel 组,然后对每个 kernel 组进行 binarization,并将其作为 FOLD-SE-M 的输入数据。
  • results: 这个框架可以减少 FOLD-SE-M 生成的规则集的大小,从而提高了知识的可读性。此外,这个框架还可以将 CNN 的下层层 kernel 符号化,并将其映射到人类可理解的概念上。
    Abstract Within the realm of deep learning, the interpretability of Convolutional Neural Networks (CNNs), particularly in the context of image classification tasks, remains a formidable challenge. To this end we present a neurosymbolic framework, NeSyFOLD-G that generates a symbolic rule-set using the last layer kernels of the CNN to make its underlying knowledge interpretable. What makes NeSyFOLD-G different from other similar frameworks is that we first find groups of similar kernels in the CNN (kernel-grouping) using the cosine-similarity between the feature maps generated by various kernels. Once such kernel groups are found, we binarize each kernel group's output in the CNN and use it to generate a binarization table which serves as input data to FOLD-SE-M which is a Rule Based Machine Learning (RBML) algorithm. FOLD-SE-M then generates a rule-set that can be used to make predictions. We present a novel kernel grouping algorithm and show that grouping similar kernels leads to a significant reduction in the size of the rule-set generated by FOLD-SE-M, consequently, improving the interpretability. This rule-set symbolically encapsulates the connectionist knowledge of the trained CNN. The rule-set can be viewed as a normal logic program wherein each predicate's truth value depends on a kernel group in the CNN. Each predicate in the rule-set is mapped to a concept using a few semantic segmentation masks of the images used for training, to make it human-understandable. The last layers of the CNN can then be replaced by this rule-set to obtain the NeSy-G model which can then be used for the image classification task. The goal directed ASP system s(CASP) can be used to obtain the justification of any prediction made using the NeSy-G model. We also propose a novel algorithm for labeling each predicate in the rule-set with the semantic concept(s) that its corresponding kernel group represents.
    摘要 在深度学习领域,特别是图像分类任务中,卷积神经网络(CNN)的解释性仍然是一大挑战。为此,我们提出了一种神经符号学框架(NeSyFOLD-G),该框架使用 CNN 的最后一层核心来生成一个符号化规则集。与其他类似框架不同的是,我们首先在 CNN 中找到相似核心(kernel-grouping),并使用归一化矩阵来将每个核心组的输出binariz。然后,我们使用这些binarization表作为 FOLD-SE-M 算法的输入数据,并使其生成一个规则集。我们提出了一种新的核心分组算法,并证明了将相似核心分组可以减少 FOLD-SE-M 生成的规则集的大小,从而提高解释性。这个规则集可以被视为一个正逻辑程序,其中每个前置的真值取决于 CNN 中的核心组。每个前置在规则集中都可以被映射到一个概念,使其成为人类理解的。我们还提出了一种新的算法,用于在规则集中标注每个前置的 semantic 概念。最后,我们将 CNN 的最后一层替换为 NeSy-G 模型,并使用该模型进行图像分类任务。我们还可以使用 goal-directed ASP 系统(s(CASP))来获取任务结果的证明。

Putting the Object Back into Video Object Segmentation

  • paper_url: http://arxiv.org/abs/2310.12982
  • repo_url: https://github.com/hkchengrex/Cutie
  • paper_authors: Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon-Young Lee, Alexander Schwing
  • for: 这个研究是为了提高视频对象分割(VOS)的精度和效率,特别是在复杂的数据集中。
  • methods: 这个模型使用一个叫做“query-based object transformer”的新方法,将物件表现 від memory 拼接到 video object segmentation 结果中,以提高精度和效率。
  • results: 在具有复杂数据集的 MOSE 测试集上,这个模型与 XMem 相比增加了8.7 J&F,并且与 DeAOT 相比增加了4.2 J&F,且三倍快速。
    Abstract We present Cutie, a video object segmentation (VOS) network with object-level memory reading, which puts the object representation from memory back into the video object segmentation result. Recent works on VOS employ bottom-up pixel-level memory reading which struggles due to matching noise, especially in the presence of distractors, resulting in lower performance in more challenging data. In contrast, Cutie performs top-down object-level memory reading by adapting a small set of object queries for restructuring and interacting with the bottom-up pixel features iteratively with a query-based object transformer (qt, hence Cutie). The object queries act as a high-level summary of the target object, while high-resolution feature maps are retained for accurate segmentation. Together with foreground-background masked attention, Cutie cleanly separates the semantics of the foreground object from the background. On the challenging MOSE dataset, Cutie improves by 8.7 J&F over XMem with a similar running time and improves by 4.2 J&F over DeAOT while running three times as fast. Code is available at: https://hkchengrex.github.io/Cutie
    摘要 我们介绍Cutie,一个具有物体记忆阅读的视频物体分割(VOS)网络,它将物体表现从内存中推回到视频物体分割结果中。现有的VOS方法通常使用底向推导的像素级别记忆阅读,它在许多挑战性数据中会受到匹配噪音的影响,导致性能较差。相比之下,Cutie使用顶向物体级别的记忆阅读,通过适应一小集的物体查询来重构和与底向像素特征进行联合运算(qt,因此称为Cutie)。物体查询 behave as a high-level概要 of the target object, while high-resolution feature maps are retained for accurate segmentation. 同时,内部遮瑕遮瑕注意力可以清晰地分离背景和前景的 semantics。在MOSE数据集上,Cutie与XMem的比较获得8.7 J&F的提升,并且与DeAOT的比较获得4.2 J&F的提升,具有相似的执行时间。代码可以在:https://hkchengrex.github.io/Cutie 上取得。

HumanTOMATO: Text-aligned Whole-body Motion Generation

  • paper_url: http://arxiv.org/abs/2310.12978
  • repo_url: https://github.com/IDEA-Research/HumanTOMATO
  • paper_authors: Shunlin Lu, Ling-Hao Chen, Ailing Zeng, Jing Lin, Ruimao Zhang, Lei Zhang, Heung-Yeung Shum
  • for: 本研究targets a novel text-driven whole-body motion generation task, which takes a given textual description as input and aims to generate high-quality, diverse, and coherent facial expressions, hand gestures, and body motions simultaneously.
  • methods: 我们提出了一个Text-aligned whOle-body Motion generATiOn framework,named HumanTOMATO,which is the first attempt to our knowledge towards applicable holistic motion generation in this research area. Our solution includes two key designs: (1) a Holistic Hierarchical VQ-VAE (aka H$^2$VQ) and a Hierarchical-GPT for fine-grained body and hand motion reconstruction and generation with two structured codebooks; and (2) a pre-trained text-motion-alignment model to help generated motion align with the input textual description explicitly.
  • results: 我们的模型在生成动作质量和文本对齐方面具有显著优势。
    Abstract This work targets a novel text-driven whole-body motion generation task, which takes a given textual description as input and aims at generating high-quality, diverse, and coherent facial expressions, hand gestures, and body motions simultaneously. Previous works on text-driven motion generation tasks mainly have two limitations: they ignore the key role of fine-grained hand and face controlling in vivid whole-body motion generation, and lack a good alignment between text and motion. To address such limitations, we propose a Text-aligned whOle-body Motion generATiOn framework, named HumanTOMATO, which is the first attempt to our knowledge towards applicable holistic motion generation in this research area. To tackle this challenging task, our solution includes two key designs: (1) a Holistic Hierarchical VQ-VAE (aka H$^2$VQ) and a Hierarchical-GPT for fine-grained body and hand motion reconstruction and generation with two structured codebooks; and (2) a pre-trained text-motion-alignment model to help generated motion align with the input textual description explicitly. Comprehensive experiments verify that our model has significant advantages in both the quality of generated motions and their alignment with text.
    摘要 这个工作目标是一种基于文本描述的全身动作生成任务,它将输入文本描述作为输入,并生成高质量、多样化、协调的面部表达、手势和身体动作。先前的文本动作生成任务主要有两点限制:一是忽略细腻的手势和面部控制的重要作用,二是缺乏文本和动作之间的好的对应。为了解决这些限制,我们提出了一个名为人类TOMATO的文本对齐整体动作框架,是我们知道的研究领域首次尝试。为了解决这个复杂的任务,我们的解决方案包括两个关键设计:一是一种层次结构的VQ-VAE(即H$^2$VQ)和一种层次结构的GPT для细腻的身体和手势动作重建和生成,使用两个结构化的编码库;二是一种预训练的文本动作对齐模型,以帮助生成的动作与输入文本描述对齐Explicitly。广泛的实验证明了我们模型在生成动作质量和文本对齐方面具有显著优势。

On the Hidden Waves of Image

  • paper_url: http://arxiv.org/abs/2310.12976
  • repo_url: https://github.com/rprokap/pset-9
  • paper_authors: Yinpeng Chen, Dongdong Chen, Xiyang Dai, Mengchen Liu, Lu Yuan, Zicheng Liu, Youzuo Lin
  • for: 这篇论文描述了一种感人现象:通过一组一向波方程,可以成功重建图像,并且每个图像都对应着一个唯一的初始条件,可以从原始图像使用视觉编码器(例如卷积神经网络)来计算。
  • methods: 这篇论文使用了一种隐藏和学习速度的一向波方程来重建图像。每个图像都对应着一个特定的初始条件,可以使用视觉编码器来计算。
  • results: 这篇论文发现了一种称为”隐藏波”的现象,即每个图像都可以被分解为一 colelction of special solutions of the same one-way wave equations,这些解决方案都是一个共享的 autoregressive 矩阵的多阶幂。这表明,即使速度和autoregressive矩阵是隐藏的,它们也可以被学习和共享。这种数学变换提供了一种新的数学视角来理解图像。
    Abstract In this paper, we introduce an intriguing phenomenon-the successful reconstruction of images using a set of one-way wave equations with hidden and learnable speeds. Each individual image corresponds to a solution with a unique initial condition, which can be computed from the original image using a visual encoder (e.g., a convolutional neural network). Furthermore, the solution for each image exhibits two noteworthy mathematical properties: (a) it can be decomposed into a collection of special solutions of the same one-way wave equations that are first-order autoregressive, with shared coefficient matrices for autoregression, and (b) the product of these coefficient matrices forms a diagonal matrix with the speeds of the wave equations as its diagonal elements. We term this phenomenon hidden waves, as it reveals that, although the speeds of the set of wave equations and autoregressive coefficient matrices are latent, they are both learnable and shared across images. This represents a mathematical invariance across images, providing a new mathematical perspective to understand images.
    摘要 在这篇论文中,我们介绍了一种有趣的现象:通过一组一个方向波方程的解,成功地重建图像。每个图像都对应于一个唯一的初始条件,可以从原始图像使用视觉编码器(例如卷积神经网络)来计算。此外,每个解表现出两个值得注意的数学性质:(a)它可以分解为同一个一个方向波方程的特殊解,这些特殊解具有共享的权重矩阵,并(b)这些权重矩阵的乘积形成一个对角矩阵,其中对角元素都是波方程的速度。我们称这种现象为“隐藏波”,因为尽管波方程和权重矩阵的速度都是隐藏的,但它们都是学习的,并且在图像之间共享。这表示图像具有数学的变换性,提供了一个新的数学视角来理解图像。

FSD: Fast Self-Supervised Single RGB-D to Categorical 3D Objects

  • paper_url: http://arxiv.org/abs/2310.12974
  • repo_url: None
  • paper_authors: Mayank Lunayach, Sergey Zakharov, Dian Chen, Rares Ambrus, Zsolt Kira, Muhammad Zubair Irshad
  • for: 3D object recognition without relying on real-world 3D labeled data
  • methods: multi-stage training pipeline with synthetic and real-world data, combining 2D and 3D supervised losses and 2D self-supervised loss
  • results: outperforms existing self-supervised 6D pose and size estimation baselines on the NOCS test-set with a 16.4% absolute improvement in mAP for 6D pose estimation, running in near real-time at 5 Hz
    Abstract In this work, we address the challenging task of 3D object recognition without the reliance on real-world 3D labeled data. Our goal is to predict the 3D shape, size, and 6D pose of objects within a single RGB-D image, operating at the category level and eliminating the need for CAD models during inference. While existing self-supervised methods have made strides in this field, they often suffer from inefficiencies arising from non-end-to-end processing, reliance on separate models for different object categories, and slow surface extraction during the training of implicit reconstruction models; thus hindering both the speed and real-world applicability of the 3D recognition process. Our proposed method leverages a multi-stage training pipeline, designed to efficiently transfer synthetic performance to the real-world domain. This approach is achieved through a combination of 2D and 3D supervised losses during the synthetic domain training, followed by the incorporation of 2D supervised and 3D self-supervised losses on real-world data in two additional learning stages. By adopting this comprehensive strategy, our method successfully overcomes the aforementioned limitations and outperforms existing self-supervised 6D pose and size estimation baselines on the NOCS test-set with a 16.4% absolute improvement in mAP for 6D pose estimation while running in near real-time at 5 Hz.
    摘要 在这项工作中,我们面临着3D物体认知无需真实世界3D标注数据的挑战。我们的目标是在单个RGB-D图像中预测物体的3D形状、大小和6D姿态,并在类别水平上进行预测,不需要在推理过程中使用CAD模型。现有的自动学习方法在这个领域有所进步,但它们经常受到非终端处理引起的不具有效率,以及不同类别的物体模型之间的分离,从而降低了推理速度和实际应用性。我们提议的方法采用多阶段训练管道,通过在synthetic领域中使用2D和3D监督损失进行训练,然后在real-world数据上添加2D监督和3D自监督损失进行两个额外学习阶段。通过这种全面策略,我们的方法成功地超越了先前的自动学习6D姿态和大小估计基准,在NOCS测试集上达到16.4%的绝对提升率,而且在5Hz的刷新率下运行在实时内。

Human Pose-based Estimation, Tracking and Action Recognition with Deep Learning: A Survey

  • paper_url: http://arxiv.org/abs/2310.13039
  • repo_url: None
  • paper_authors: Lijuan Zhou, Xiang Meng, Zhihuan Liu, Mengqi Wu, Zhimin Gao, Pichao Wang
  • for: 这篇论文旨在探讨深度学习在人姿分析中的应用,包括人姿估计、人姿跟踪和动作识别。
  • methods: 本论文提出了一种基于深度学习的人姿分析方法,包括人姿估计、人姿跟踪和动作识别。
  • results: 研究人员通过对多个人姿分析任务的评估和分析,发现了一些关键的问题和挑战,同时也提出了一些可能的解决方案。
    Abstract Human pose analysis has garnered significant attention within both the research community and practical applications, owing to its expanding array of uses, including gaming, video surveillance, sports performance analysis, and human-computer interactions, among others. The advent of deep learning has significantly improved the accuracy of pose capture, making pose-based applications increasingly practical. This paper presents a comprehensive survey of pose-based applications utilizing deep learning, encompassing pose estimation, pose tracking, and action recognition.Pose estimation involves the determination of human joint positions from images or image sequences. Pose tracking is an emerging research direction aimed at generating consistent human pose trajectories over time. Action recognition, on the other hand, targets the identification of action types using pose estimation or tracking data. These three tasks are intricately interconnected, with the latter often reliant on the former. In this survey, we comprehensively review related works, spanning from single-person pose estimation to multi-person pose estimation, from 2D pose estimation to 3D pose estimation, from single image to video, from mining temporal context gradually to pose tracking, and lastly from tracking to pose-based action recognition. As a survey centered on the application of deep learning to pose analysis, we explicitly discuss both the strengths and limitations of existing techniques. Notably, we emphasize methodologies for integrating these three tasks into a unified framework within video sequences. Additionally, we explore the challenges involved and outline potential directions for future research.
    摘要 人姿分析在研究社区和实际应用中受到了广泛关注,因为它在游戏、视频监测、运动表现分析和人机交互等领域有着扩大的应用范围。深度学习的出现使得人姿捕捉的准确性得到了显著提高,使得人姿基于应用变得更加实用。本文是一篇对深度学习应用于人姿分析的全面评论,涵盖了人姿估计、人姿跟踪和动作识别等三个任务。人姿估计是从图像或图像序列中确定人 JOINT 位置的任务。人姿跟踪是一个emerging的研究方向,旨在在时间上生成一致的人姿轨迹。动作识别则是根据人姿估计或跟踪数据来确定动作类型的任务。这三个任务之间存在紧密的关系,后两个任务经常依赖于前一个任务。在这篇评论中,我们全面回顾相关的工作,从单人人姿估计到多人人姿估计,从2D人姿估计到3D人姿估计,从单图像到视频,从慢慢地采集时间上的动作特征来估计人姿到pose tracking,并最后从跟踪到动作识别。作为深度学习应用于人姿分析的评论,我们明确地讨论了现有技术的优势和局限性。尤其是我们强调将这三个任务集成到视频序列中,并探讨了相关挑战和未来研究的可能性。

Real-Time Motion Prediction via Heterogeneous Polyline Transformer with Relative Pose Encoding

  • paper_url: http://arxiv.org/abs/2310.12970
  • repo_url: https://github.com/zhejz/hptr
  • paper_authors: Zhejun Zhang, Alexander Liniger, Christos Sakaridis, Fisher Yu, Luc Van Gool
  • for: 这篇论文是用于解决自动驾驶系统中的动作预测问题,以提高系统的实时运行效率和扩展性。
  • methods: 本文使用了K-nearest neighbor attention with relative pose encoding (KNARPE) attended Transformer,以及一个 hierarchical framework 允许在线进行资料更新。
  • results: 实验结果显示,使用HPTR可以在维持与现有方法相同的性能水准下,提高系统的实时运行效率和扩展性。
    Abstract The real-world deployment of an autonomous driving system requires its components to run on-board and in real-time, including the motion prediction module that predicts the future trajectories of surrounding traffic participants. Existing agent-centric methods have demonstrated outstanding performance on public benchmarks. However, they suffer from high computational overhead and poor scalability as the number of agents to be predicted increases. To address this problem, we introduce the K-nearest neighbor attention with relative pose encoding (KNARPE), a novel attention mechanism allowing the pairwise-relative representation to be used by Transformers. Then, based on KNARPE we present the Heterogeneous Polyline Transformer with Relative pose encoding (HPTR), a hierarchical framework enabling asynchronous token update during the online inference. By sharing contexts among agents and reusing the unchanged contexts, our approach is as efficient as scene-centric methods, while performing on par with state-of-the-art agent-centric methods. Experiments on Waymo and Argoverse-2 datasets show that HPTR achieves superior performance among end-to-end methods that do not apply expensive post-processing or model ensembling. The code is available at https://github.com/zhejz/HPTR.
    摘要 现实世界中部署自动驾驶系统需要其组件在实时上下文中运行,包括预测周围交通参与者未来轨迹的运动预测模块。现有的中心式方法在公共标准上表现出色,但它们由于参与者数量增加而受到高计算负担和差异化问题。为解决这问题,我们提出了K-最近邻居注意力与相对姿态编码(KNARPE),一种新的注意力机制,allowing Transformers使用对称的对比姿态表示。然后,基于KNARPE,我们提出了多态轨迹转换器(HPTR),一种层次结构的框架,允许在在线推断过程中 asynchronous token 更新。通过在Agent之间共享上下文和重用不变的上下文,我们的方法可以与场景中心式方法相当努力,同时与现有的中心式方法相比,表现出色。实验结果表明,HPTR在不使用昂贵的后处理或模型ensemble的情况下,在终端方法中达到了最高性能。代码可以在https://github.com/zhejz/HPTR 上找到。

3D-GPT: Procedural 3D Modeling with Large Language Models

  • paper_url: http://arxiv.org/abs/2310.12945
  • repo_url: None
  • paper_authors: Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, Stephen Gould
    for:* 3D模型创建自动化Content生成methods:* 使用大型自然语言模型(LLMs)进行指令驱动3D模型创建* 分解复杂的3D模型创建任务为可访问的部分,并委托适合的代理处理每个任务results:* 可靠地从文本中提取参数值,用于轻松地与3D软件集成* 与人类设计师合作有效* 可以轻松地扩展 manipulate 可能性Note: The above information is in Simplified Chinese.
    Abstract In the pursuit of efficient automated content creation, procedural generation, leveraging modifiable parameters and rule-based systems, emerges as a promising approach. Nonetheless, it could be a demanding endeavor, given its intricate nature necessitating a deep understanding of rules, algorithms, and parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT positions LLMs as proficient problem solvers, dissecting the procedural 3D modeling tasks into accessible segments and appointing the apt agent for each task. 3D-GPT integrates three core agents: the task dispatch agent, the conceptualization agent, and the modeling agent. They collaboratively achieve two objectives. First, it enhances concise initial scene descriptions, evolving them into detailed forms while dynamically adapting the text based on subsequent instructions. Second, it integrates procedural generation, extracting parameter values from enriched text to effortlessly interface with 3D software for asset creation. Our empirical investigations confirm that 3D-GPT not only interprets and executes instructions, delivering reliable results but also collaborates effectively with human designers. Furthermore, it seamlessly integrates with Blender, unlocking expanded manipulation possibilities. Our work highlights the potential of LLMs in 3D modeling, offering a basic framework for future advancements in scene generation and animation.
    摘要 在寻求高效自动内容创造的探索中,procédural生成技术 Emerges as a promising approach。然而,它的复杂性可能会增加工作负担,需要深刻理解规则、算法和参数。为了减轻工作负担,我们介绍了3D-GPT框架,该框架利用大语言模型(LLMs) дляinstruction-driven 3D 模型创建。3D-GPT将 LLMs 作为有能力的问题解决器,将过程式 3D 模型创建任务分解成可 accessible 的部分,并将每个任务分配给适合的代理人。3D-GPT integrate three core agents:任务派发代理人、概念化代理人和模型代理人。它们合作实现两个目标。首先,它可以将简洁的初始场景描述进行细化,并在基于后续指令的动态适应文本上进行演化。其次,它可以通过执行程序代码来轻松地与 3D 软件进行资产创建。我们的实验证明,3D-GPT 不仅可以理解和执行指令,还可以与人类设计师合作有效。此外,它可以轻松地与 Blender 集成,解锁了更多的操作可能性。我们的工作显示了LLMs在 3D 模型创建方面的潜在力量,并提供了一个基本的框架,以便未来的Scene生成和动画技术的发展。

Unsupervised Object Localization in the Era of Self-Supervised ViTs: A Survey

  • paper_url: http://arxiv.org/abs/2310.12904
  • repo_url: https://github.com/valeoai/awesome-unsupervised-object-localization
  • paper_authors: Oriane Siméoni, Éloi Zablocki, Spyros Gidaris, Gilles Puy, Patrick Pérez
  • for: 这些研究旨在实现无监督的对象位置标注,即无需任何手动标注对象的存在和位置。
  • methods: 这些方法基于自我超vised的特征学习,包括ViT等模型,以实现对象的探索和定位。
  • results: 这些方法可以在图像和视频中找到对象,而无需任何手动标注。Here is the text in Simplified Chinese:
  • for: 这些研究旨在实现无监督的对象位置标注,即无需任何手动标注对象的存在和位置。
  • methods: 这些方法基于自我超vised的特征学习,包括ViT等模型,以实现对象的探索和定位。
  • results: 这些方法可以在图像和视频中找到对象,而无需任何手动标注。
    Abstract The recent enthusiasm for open-world vision systems show the high interest of the community to perform perception tasks outside of the closed-vocabulary benchmark setups which have been so popular until now. Being able to discover objects in images/videos without knowing in advance what objects populate the dataset is an exciting prospect. But how to find objects without knowing anything about them? Recent works show that it is possible to perform class-agnostic unsupervised object localization by exploiting self-supervised pre-trained features. We propose here a survey of unsupervised object localization methods that discover objects in images without requiring any manual annotation in the era of self-supervised ViTs. We gather links of discussed methods in the repository https://github.com/valeoai/Awesome-Unsupervised-Object-Localization.
    摘要 当前社区对开放视界系统的热情表明了对外部环境中进行感知任务的高度兴趣,而这些任务曾经 Until now, most benchmark setups have been based on closed vocabularies, but recent works have shown that it is possible to perform class-agnostic unsupervised object localization by exploiting self-supervised pre-trained features. We propose a survey of unsupervised object localization methods that discover objects in images without requiring any manual annotation, in the era of self-supervised ViTs. The discussed methods can be found in the repository .Note: "ViT" stands for Vision Transformer, which is a type of neural network architecture that has gained popularity in computer vision tasks.

Perceptual Assessment and Optimization of High Dynamic Range Image Rendering

  • paper_url: http://arxiv.org/abs/2310.12877
  • repo_url: None
  • paper_authors: Peibei Cao, Rafal K. Mantiuk, Kede Ma
  • for: This paper focuses on developing a family of high dynamic range (HDR) image quality assessment (IQA) models that can accurately evaluate the quality of HDR images.
  • methods: The proposed HDR IQA models use a simple inverse display model to decompose an HDR image into a set of low dynamic range (LDR) images with different exposures, which are then assessed by existing LDR quality models. The local quality scores of each exposure are aggregated using a well-exposedness measure, and the overall quality score is obtained by weighting the exposures.
  • results: The proposed HDR IQA models outperform existing IQA methods, including the HDR-VDP family, in evaluating the quality of HDR images. Additionally, the models demonstrate strengths in perceptual optimization of HDR novel view synthesis.
    Abstract High dynamic range (HDR) imaging has gained increasing popularity for its ability to faithfully reproduce the luminance levels in natural scenes. Accordingly, HDR image quality assessment (IQA) is crucial but has been superficially treated. The majority of existing IQA models are developed for and calibrated against low dynamic range (LDR) images, which have been shown to be poorly correlated with human perception of HDR image quality. In this work, we propose a family of HDR IQA models by transferring the recent advances in LDR IQA. The key step in our approach is to specify a simple inverse display model that decomposes an HDR image to a set of LDR images with different exposures, which will be assessed by existing LDR quality models. The local quality scores of each exposure are then aggregated with the help of a simple well-exposedness measure into a global quality score for each exposure, which will be further weighted across exposures to obtain the overall quality score. When assessing LDR images, the proposed HDR quality models reduce gracefully to the original LDR ones with the same performance. Experiments on four human-rated HDR image datasets demonstrate that our HDR quality models are consistently better than existing IQA methods, including the HDR-VDP family. Moreover, we demonstrate their strengths in perceptual optimization of HDR novel view synthesis.
    摘要 高动态范围(HDR)摄影技术在自然场景中的亮度水平准确反映,因此HDR图像质量评估(IQA)变得非常重要。然而,现有的IQA模型大多是基于低动态范围(LDR)图像的,这些图像与人类对HDR图像质量的识别呈现相互不符。在这项工作中,我们提出了一系列基于LDR IQA的HDR IQA模型。我们的方法的关键步骤是使用一个简单的反映显示模型将HDR图像分解成不同曝光的LDR图像,然后使用现有的LDR质量模型评估每个曝光的地方质量。最后,我们将每个曝光的地方质量Weighted来获得总质量分数,并将其进行权重平均来获得总质量分数。当评估LDR图像时,我们的HDR质量模型会降低到原始的LDR模型,并保持同样的性能。我们在四个人标注的HDR图像集上进行了实验,并证明了我们的HDR质量模型在现有IQA方法中具有更高的性能。此外,我们还证明了它们在HDR新视图合成中的强大特点。

EMIT-Diff: Enhancing Medical Image Segmentation via Text-Guided Diffusion Model

  • paper_url: http://arxiv.org/abs/2310.12868
  • repo_url: None
  • paper_authors: Zheyuan Zhang, Lanhong Yao, Bin Wang, Debesh Jha, Elif Keles, Alpay Medetalibeyoglu, Ulas Bagci
    for:This paper aims to address the challenge of scarce high-quality labeled data for medical deep learning models by proposing a novel approach called EMIT-Diff for medical image synthesis.methods:The proposed EMIT-Diff method leverages recent diffusion probabilistic models to generate realistic and diverse synthetic medical image data that preserve the essential characteristics of the original medical images. The method incorporates edge information to guide the synthesis process and ensures that the synthesized samples adhere to medically relevant constraints.results:The proposed EMIT-Diff method achieves significant improvements in medical image segmentation tasks on multiple datasets, including Ultrasound breast, CT spleen, and MRI prostate. The method demonstrates the effectiveness of introducing a first-ever text-guided diffusion model for general medical image segmentation tasks, and the results show the feasibility of using synthetic data for medical image segmentation tasks.
    Abstract Large-scale, big-variant, and high-quality data are crucial for developing robust and successful deep-learning models for medical applications since they potentially enable better generalization performance and avoid overfitting. However, the scarcity of high-quality labeled data always presents significant challenges. This paper proposes a novel approach to address this challenge by developing controllable diffusion models for medical image synthesis, called EMIT-Diff. We leverage recent diffusion probabilistic models to generate realistic and diverse synthetic medical image data that preserve the essential characteristics of the original medical images by incorporating edge information of objects to guide the synthesis process. In our approach, we ensure that the synthesized samples adhere to medically relevant constraints and preserve the underlying structure of imaging data. Due to the random sampling process by the diffusion model, we can generate an arbitrary number of synthetic images with diverse appearances. To validate the effectiveness of our proposed method, we conduct an extensive set of medical image segmentation experiments on multiple datasets, including Ultrasound breast (+13.87%), CT spleen (+0.38%), and MRI prostate (+7.78%), achieving significant improvements over the baseline segmentation methods. For the first time, to our best knowledge, the promising results demonstrate the effectiveness of our EMIT-Diff for medical image segmentation tasks and show the feasibility of introducing a first-ever text-guided diffusion model for general medical image segmentation tasks. With carefully designed ablation experiments, we investigate the influence of various data augmentation ratios, hyper-parameter settings, patch size for generating random merging mask settings, and combined influence with different network architectures.
    摘要 大规模、大变种、高质量数据是深度学习模型在医疗应用中的关键,因为它们可能提供更好的泛化性和避免过拟合。然而,医疗数据的稀缺性常常带来重大挑战。这篇论文提出了一种新的方法,称为EMIT-Diff,以 Address this challenge by developing controllable diffusion models for medical image synthesis. We leverage recent diffusion probabilistic models to generate realistic and diverse synthetic medical image data that preserve the essential characteristics of the original medical images by incorporating edge information of objects to guide the synthesis process. In our approach, we ensure that the synthesized samples adhere to medically relevant constraints and preserve the underlying structure of imaging data. Due to the random sampling process by the diffusion model, we can generate an arbitrary number of synthetic images with diverse appearances. To validate the effectiveness of our proposed method, we conduct an extensive set of medical image segmentation experiments on multiple datasets, including Ultrasound breast (+13.87%), CT spleen (+0.38%), and MRI prostate (+7.78%), achieving significant improvements over the baseline segmentation methods. For the first time, to our best knowledge, the promising results demonstrate the effectiveness of our EMIT-Diff for medical image segmentation tasks and show the feasibility of introducing a first-ever text-guided diffusion model for general medical image segmentation tasks. With carefully designed ablation experiments, we investigate the influence of various data augmentation ratios, hyper-parameter settings, patch size for generating random merging mask settings, and combined influence with different network architectures.

Neural Degradation Representation Learning for All-In-One Image Restoration

  • paper_url: http://arxiv.org/abs/2310.12848
  • repo_url: https://github.com/mdyao/NDR-Restore
  • paper_authors: Mingde Yao, Ruikang Xu, Yuanshen Guan, Jie Huang, Zhiwei Xiong
  • for: This paper aims to propose an all-in-one image restoration network that can handle multiple degradations, such as noise, haze, rain, and downsampling.
  • methods: The proposed method learns a neural degradation representation (NDR) that captures the underlying characteristics of various degradations, and uses a degradation query module and a degradation injection module to recognize and utilize the specific degradation based on NDR. The method also employs a bidirectional optimization strategy to effectively drive NDR to learn the degradation representation.
  • results: The proposed method is demonstrated to be effective and generalizable on representative types of degradations, including noise, haze, rain, and downsampling, through comprehensive experiments.
    Abstract Existing methods have demonstrated effective performance on a single degradation type. In practical applications, however, the degradation is often unknown, and the mismatch between the model and the degradation will result in a severe performance drop. In this paper, we propose an all-in-one image restoration network that tackles multiple degradations. Due to the heterogeneous nature of different types of degradations, it is difficult to process multiple degradations in a single network. To this end, we propose to learn a neural degradation representation (NDR) that captures the underlying characteristics of various degradations. The learned NDR decomposes different types of degradations adaptively, similar to a neural dictionary that represents basic degradation components. Subsequently, we develop a degradation query module and a degradation injection module to effectively recognize and utilize the specific degradation based on NDR, enabling the all-in-one restoration ability for multiple degradations. Moreover, we propose a bidirectional optimization strategy to effectively drive NDR to learn the degradation representation by optimizing the degradation and restoration processes alternately. Comprehensive experiments on representative types of degradations (including noise, haze, rain, and downsampling) demonstrate the effectiveness and generalization capability of our method.
    摘要 现有方法已经证明可以在单一的降解类型上达到有效性。然而,在实际应用中,降解通常是未知的,并且模型和降解之间的匹配问题会导致性能下降。在这篇论文中,我们提出了一个涵盖多种降解的图像恢复网络。由于不同类型的降解具有不同的特征,因此在单一网络中处理多种降解是困难的。为此,我们提出了学习神经降解表示(NDR),该表示捕捉不同类型降解的基本特征。学习后,NDR可以适应性地分解不同类型降解,类似于神经字典中的基本降解组件。然后,我们提出了降解查询模块和降解注入模块,以有效地识别和利用特定降解,实现涵盖多种降解的恢复能力。此外,我们提出了双向优化策略,以有效地驱动NDR学习降解表示,通过同时优化降解和恢复过程来更好地适应不同类型降解。在代表性的降解类型(包括噪声、雾、雨和下采样)的实验中,我们的方法得到了有效性和普适性的证明。

OODRobustBench: benchmarking and analyzing adversarial robustness under distribution shift

  • paper_url: http://arxiv.org/abs/2310.12793
  • repo_url: None
  • paper_authors: Lin Li, Yifei Wang, Chawin Sitawarin, Michael Spratling
  • for: 本文旨在评估针对异常输入的鲁棒性,以及验证现有的鲁棒训练方法是否能够在不同的输入分布下保持鲁棒性。
  • methods: 本文提出了一个名为OODRobustBench的测试框架,用于全面评估异常输入鲁棒性。OODRobustBench使用了23个数据集的自然适应异常shift和6个威胁模型的不适应异常shift进行测试。
  • results: 本文通过对706个鲁棒模型进行60.7万次攻击评估,发现了多个鲁棒模型在异常输入下的鲁棒性受到严重的挑战。此外,本文还发现了一些新的训练方法和技术可以提高异常输入鲁棒性。
    Abstract Existing works have made great progress in improving adversarial robustness, but typically test their method only on data from the same distribution as the training data, i.e. in-distribution (ID) testing. As a result, it is unclear how such robustness generalizes under input distribution shifts, i.e. out-of-distribution (OOD) testing. This is a concerning omission as such distribution shifts are unavoidable when methods are deployed in the wild. To address this issue we propose a benchmark named OODRobustBench to comprehensively assess OOD adversarial robustness using 23 dataset-wise shifts (i.e. naturalistic shifts in input distribution) and 6 threat-wise shifts (i.e., unforeseen adversarial threat models). OODRobustBench is used to assess 706 robust models using 60.7K adversarial evaluations. This large-scale analysis shows that: 1) adversarial robustness suffers from a severe OOD generalization issue; 2) ID robustness correlates strongly with OOD robustness, in a positive linear way, under many distribution shifts. The latter enables the prediction of OOD robustness from ID robustness. Based on this, we are able to predict the upper limit of OOD robustness for existing robust training schemes. The results suggest that achieving OOD robustness requires designing novel methods beyond the conventional ones. Last, we discover that extra data, data augmentation, advanced model architectures and particular regularization approaches can improve OOD robustness. Noticeably, the discovered training schemes, compared to the baseline, exhibit dramatically higher robustness under threat shift while keeping high ID robustness, demonstrating new promising solutions for robustness against both multi-attack and unforeseen attacks.
    摘要 现有的研究已经做出了很大的进步,以改善抗敌性 Robustness,但通常只在同一个分布中进行测试,即内部分布(ID)测试。因此,实际上不清楚这种 Robustness 如何在输入分布差异(OOD)测试中发挥作用。这是一个担忧的漏洞,因为在实际应用中,分布差异是不可避免的。为了解决这个问题,我们提出了名为 OODRobustBench 的库,用于全面评估 OOD 抗敌性 Robustness,透过 23 个dataset-wise 分布差异和 6 个威胁-wise 分布差异。OODRobustBench 用于评估 706 个Robust model,使用 60.7K 个攻击评估。这些大规模的分析显示:1)抗敌性 Robustness 受到 OOD 通用化的问题严重影响;2)ID 抗敌性 Robustness 与 OOD 抗敌性 Robustness 之间存在正相关,在许多分布差异下,呈现正线性关系。这使得可以预测 OOD 抗敌性 Robustness。基于这个结果,我们能够预测现有的 Robust training scheme 的 OOD 抗敌性 Robustness 的Upper bound。结果表明,以现有的方法设计 OOD 抗敌性 Robustness 需要开探新的方法。最后,我们发现额外的数据、数据增强、进阶模型架构和特定的调整方法可以提高 OOD 抗敌性 Robustness。特别是,发现的训练方案,相比基准,在威胁差异下显示出了很高的抗敌性 Robustness,同时保持高的 ID 抗敌性 Robustness,显示出了新的可行的解决方案。

Anomaly Heterogeneity Learning for Open-set Supervised Anomaly Detection

  • paper_url: http://arxiv.org/abs/2310.12790
  • repo_url: https://github.com/mala-lab/ahl
  • paper_authors: Jiawen Zhu, Choubo Ding, Yu Tian, Guansong Pang
  • for: 本研究目标是提高openset监控预测(OSAD)预测效果,使其能够掌握未经见过的异常类型。
  • methods: 本文提出了一种新的方法——异常分布学习(AHL),通过使用有限数量的异常示例来学习多元异常分布,并将其用于建立一个统一的多元异常模型。
  • results: 实验结果表明,AHL可以1)提高不同的状态的艺术模型(SOTA)在检测已见和未见异常方面的性能,达到新的SOTA性能水平,2)有效地在新目标领域中generalize to unseen anomalies。
    Abstract Open-set supervised anomaly detection (OSAD) - a recently emerging anomaly detection area - aims at utilizing a few samples of anomaly classes seen during training to detect unseen anomalies (i.e., samples from open-set anomaly classes), while effectively identifying the seen anomalies. Benefiting from the prior knowledge illustrated by the seen anomalies, current OSAD methods can often largely reduce false positive errors. However, these methods treat the anomaly examples as from a homogeneous distribution, rendering them less effective in generalizing to unseen anomalies that can be drawn from any distribution. In this paper, we propose to learn heterogeneous anomaly distributions using the limited anomaly examples to address this issue. To this end, we introduce a novel approach, namely Anomaly Heterogeneity Learning (AHL), that simulates a diverse set of heterogeneous (seen and unseen) anomaly distributions and then utilizes them to learn a unified heterogeneous abnormality model. Further, AHL is a generic framework that existing OSAD models can plug and play for enhancing their abnormality modeling. Extensive experiments on nine real-world anomaly detection datasets show that AHL can 1) substantially enhance different state-of-the-art (SOTA) OSAD models in detecting both seen and unseen anomalies, achieving new SOTA performance on a large set of datasets, and 2) effectively generalize to unseen anomalies in new target domains.
    摘要 在本文中,我们提出了一种新的方法,即异常多样性学习(AHL),用于学习异常多样的分布。AHL 通过模拟各种不同的异常分布,然后利用这些分布来学习一个统一的异常模型。AHL 是一个通用的框架,可以让现有的 OSAD 模型插入和使用,以提高异常模型化。我们在 nine 个实际异常检测 dataset 上进行了广泛的实验,结果表明,AHL 可以:1. 对不同的 SOTA OSAD 模型进行显著改进,在许多 dataset 上减少假阳性错误,并达到新的 SOTA 性能。2. 对未经见过的异常进行有效的泛化,在新的目标领域中具有优秀的泛化性能。

DT/MARS-CycleGAN: Improved Object Detection for MARS Phenotyping Robot

  • paper_url: http://arxiv.org/abs/2310.12787
  • repo_url: None
  • paper_authors: David Liu, Zhengkun Li, Zihao Wu, Changying Li
  • for: 这个论文的目的是提高Modular Agricultural Robotic System(MARS)对复杂和变化背景的作物物体检测精度。
  • methods: 该论文提出了一种基于Digital-Twin(DT)MARS-CycleGAN模型的图像增强技术,以便在MARS捕获的真实作物图像上进行更好的物体检测。
  • results: 对比于传统的CycleGAN模型,该新的DT/MARS-CycleGAN模型能够更好地适应复杂和变化的作物形态,从而提高MARS的作物物体检测性能。
    Abstract Robotic crop phenotyping has emerged as a key technology to assess crops' morphological and physiological traits at scale. These phenotypical measurements are essential for developing new crop varieties with the aim of increasing productivity and dealing with environmental challenges such as climate change. However, developing and deploying crop phenotyping robots face many challenges such as complex and variable crop shapes that complicate robotic object detection, dynamic and unstructured environments that baffle robotic control, and real-time computing and managing big data that challenge robotic hardware/software. This work specifically tackles the first challenge by proposing a novel Digital-Twin(DT)MARS-CycleGAN model for image augmentation to improve our Modular Agricultural Robotic System (MARS)'s crop object detection from complex and variable backgrounds. Our core idea is that in addition to the cycle consistency losses in the CycleGAN model, we designed and enforced a new DT-MARS loss in the deep learning model to penalize the inconsistency between real crop images captured by MARS and synthesized images sensed by DT MARS. Therefore, the generated synthesized crop images closely mimic real images in terms of realism, and they are employed to fine-tune object detectors such as YOLOv8. Extensive experiments demonstrated that our new DT/MARS-CycleGAN framework significantly boosts our MARS' crop object/row detector's performance, contributing to the field of robotic crop phenotyping.
    摘要 人工智能耕地评估技术已经成为评估作物形态和生理特征的关键工具。这些形态测量是为开发新的作物品种,提高产量和面对环境挑战,如气候变化。然而,开发和部署作物评估机器人受到许多挑战,如复杂和变化的作物形态,导致机器人检测困难,不结构化环境,以及实时处理大量数据的硬件/软件挑战。本工作专门解决第一个挑战,提出了一种新的数字双子(DT)MARS-CycleGAN模型,用于图像增强,从复杂和变化的背景中提高我们的Modular Agricultural Robotic System(MARS)的作物对象检测精度。我们的核心想法是,在CycleGAN模型中的循环一致损失之外,我们还设计了一个新的DT-MARS损失函数,以惩罚DT MARS捕捉到的实际作物图像与模拟图像之间的不一致。因此,生成的模拟图像准确地模仿实际图像,并且用于练化对象检测器,如YOLOv8。广泛的实验证明,我们的新DT/MARS-CycleGAN框架可以明显提高MARS的作物对象/行检测器性能,对耕地 roboticphenotyping领域做出了贡献。

Mixing Histopathology Prototypes into Robust Slide-Level Representations for Cancer Subtyping

  • paper_url: http://arxiv.org/abs/2310.12769
  • repo_url: https://github.com/butkej/protomixer
  • paper_authors: Joshua Butke, Noriaki Hashimoto, Ichiro Takeuchi, Hiroaki Miyoshi, Koichi Ohshima, Jun Sakuma
  • for: 这个论文主要targets Whole-slide image analysis for computational pathology, with the goal of developing an efficient and effective method for processing large-scale datasets.
  • methods: The proposed method uses a combination of feature embedding and clustering to preprocess the full whole-slide image into a reduced prototype representation, which is then fed into a suitable MLP-Mixer architecture.
  • results: The proposed method achieves comparable performance to current state-of-the-art methods while achieving lower training costs in terms of computational time and memory load, as demonstrated through experiments on two public benchmarks and one in-house malignant lymphoma dataset.
    Abstract Whole-slide image analysis via the means of computational pathology often relies on processing tessellated gigapixel images with only slide-level labels available. Applying multiple instance learning-based methods or transformer models is computationally expensive as, for each image, all instances have to be processed simultaneously. The MLP-Mixer is an under-explored alternative model to common vision transformers, especially for large-scale datasets. Due to the lack of a self-attention mechanism, they have linear computational complexity to the number of input patches but achieve comparable performance on natural image datasets. We propose a combination of feature embedding and clustering to preprocess the full whole-slide image into a reduced prototype representation which can then serve as input to a suitable MLP-Mixer architecture. Our experiments on two public benchmarks and one inhouse malignant lymphoma dataset show comparable performance to current state-of-the-art methods, while achieving lower training costs in terms of computational time and memory load. Code is publicly available at https://github.com/butkej/ProtoMixer.
    摘要 整幕图像分析通常通过计算 pathology 利用处理分割的 gigapixel 图像,只有推送批处理的标签可用。应用多个实例学习基于方法或 transformer 模型是计算昂贵的,因为每个图像都需要同时处理所有实例。 MLP-Mixer 是一种未得到充分探索的代码模型,特别是 для大规模数据集。由于缺乏自我注意机制,它们有线性的计算复杂度与输入patches的数量相同,但在自然图像数据集上实现了相似的性能。我们提议将整个整幕图像 препроцессинг到一个减少的原型表示,然后用 suitable MLP-Mixer 架构进行进一步处理。我们的实验结果表明,在两个公共标准测试集和一个内部恶性淋巴癌数据集上,我们的方法可以与当前状态的方法相比,实现较低的训练成本,包括计算时间和内存负担。代码可以在 https://github.com/butkej/ProtoMixer 上获取。

Minimalist and High-Performance Semantic Segmentation with Plain Vision Transformers

  • paper_url: http://arxiv.org/abs/2310.12755
  • repo_url: https://github.com/ydhonghit/plainseg
  • paper_authors: Yuanduo Hong, Jue Wang, Weichao Sun, Huihui Pan
  • for: 本研究的目的是提供简单高效的基准系统,用于实际semantic segmentation tasks中的简单ViT模型。
  • methods: 我们使用了简单的3$\times$3卷积layer和 transformer层(encoder或decoder)来构建我们的模型,并进行了大量的实验研究,以了解高性能semantic segmentation的两个基本原则:(1)高分辨率特征是高性能的关键,即使使用简单的upsampling技术,和(2)较为窄的transformer decoder需要许多更大的学习率。
  • results: 我们的方法在四个流行的benchmark上得到了高性能和效率的result,并可以作为评估基准模型在semantic segmentation中的转移能力的工具。
    Abstract In the wake of Masked Image Modeling (MIM), a diverse range of plain, non-hierarchical Vision Transformer (ViT) models have been pre-trained with extensive datasets, offering new paradigms and significant potential for semantic segmentation. Current state-of-the-art systems incorporate numerous inductive biases and employ cumbersome decoders. Building upon the original motivations of plain ViTs, which are simplicity and generality, we explore high-performance `minimalist' systems to this end. Our primary purpose is to provide simple and efficient baselines for practical semantic segmentation with plain ViTs. Specifically, we first explore the feasibility and methodology for achieving high-performance semantic segmentation using the last feature map. As a result, we introduce the PlainSeg, a model comprising only three 3$\times$3 convolutions in addition to the transformer layers (either encoder or decoder). In this process, we offer insights into two underlying principles: (i) high-resolution features are crucial to high performance in spite of employing simple up-sampling techniques and (ii) the slim transformer decoder requires a much larger learning rate than the wide transformer decoder. On this basis, we further present the PlainSeg-Hier, which allows for the utilization of hierarchical features. Extensive experiments on four popular benchmarks demonstrate the high performance and efficiency of our methods. They can also serve as powerful tools for assessing the transfer ability of base models in semantic segmentation. Code is available at \url{https://github.com/ydhongHIT/PlainSeg}.
    摘要 在Masked Image Modeling(MIM)的投射下,一些简单、非层次的视觉变换器(ViT)模型已经在大量数据上预训练,为Semantic Segmentation带来了新的思路和潜在性。当前领先的系统通常包含许多拟合因子和复杂的解码器。我们从原始的简单ViT的目的开始,即简单和通用,探索高性能的`最小主义'系统。我们的主要目标是提供简单、高效的Semantic Segmentation基线,特别是使用最后一个特征图进行Semantic Segmentation。为此,我们提出了PlainSeg模型,它只有3个3x3卷积 layer以及变换层(ither encoder或decoder)。在这个过程中,我们提供了两个基本原则:(i)高分辨率特征是高性能的关键,即使使用简单的upsampling技术,和(ii)短Transformer解码器需要 Much larger learning rate than wide Transformer解码器。基于这两个原则,我们进一步提出了PlainSeg-Hier,它允许使用层次特征。我们的实验表明,我们的方法具有高性能和高效性,并且可以作为评估基准模型的工具。代码可以在 \url{https://github.com/ydhongHIT/PlainSeg} 中找到。

ExtSwap: Leveraging Extended Latent Mapper for Generating High Quality Face Swapping

  • paper_url: http://arxiv.org/abs/2310.12736
  • repo_url: https://github.com/aravinda27/extswap
  • paper_authors: Aravinda Reddy PN, K. Sreenivasa Rao, Raghavendra Ramachandra, Pabitra mitra
  • for: 这篇论文是为了提出一种基于预训练的StyleGAN结构的新的面孔交换方法。
  • methods: 该方法使用了不同的编码解码结构和嵌入 интеграción网络来生成高质量结果,但其质量受到杂mix的影响。该方法通过分离 semantics来解除杂mix。
  • results: 广泛的实验表明,提议的方法成功地解除了个体和特征特征,并超越了许多现有的面孔交换方法, both qualitatively and quantitatively。
    Abstract We present a novel face swapping method using the progressively growing structure of a pre-trained StyleGAN. Previous methods use different encoder decoder structures, embedding integration networks to produce high-quality results, but their quality suffers from entangled representation. We disentangle semantics by deriving identity and attribute features separately. By learning to map the concatenated features into the extended latent space, we leverage the state-of-the-art quality and its rich semantic extended latent space. Extensive experiments suggest that the proposed method successfully disentangles identity and attribute features and outperforms many state-of-the-art face swapping methods, both qualitatively and quantitatively.
    摘要 我们提出了一种新的面部换位方法,使用逐渐增长的 StyleGAN 预训练结构。先前的方法使用不同的编码解码结构和嵌入集成网络来生成高质量结果,但其质量受到杂合表示的限制。我们通过分离 semantics 来解耦各个特征。我们学习将拼接的特征映射到延展的幂空间中,以利用状态 искусственный智能的高质量和它的富 semantics 延展空间。广泛的实验表明,我们提出的方法成功地解耦了特征和特征,并超越了许多现状艺术的面部换位方法, both qualitatively and quantitatively。

Multiscale Motion-Aware and Spatial-Temporal-Channel Contextual Coding Network for Learned Video Compression

  • paper_url: http://arxiv.org/abs/2310.12733
  • repo_url: None
  • paper_authors: Yiming Wang, Qian Huang, Bin Tang, Huashan Sun, Xing Li
  • for: 提高视频压缩效率和质量
  • methods: 提出了一种基于 Contextual Coding 的动态媒体编码网络(MASTC-VC),利用多尺度运动预测信息以优化运动识别和压缩表征,同时通过 Contextual Module 捕捉射影层次特征以减少位帧率。
  • results: 对三个公共的测试数据集进行了广泛的实验,并证明了 MASTC-VC 比前一代方法(H.265/HEVC 和 H.266/VVC)在 PSNR 和 MS-SSIM 指标上具有较高的效率和质量。具体来说,MASTC-VC 在 PSNR 指标下提供了10.15% 的BD-rate 减少,并在 MS-SSIM 指标下提供了23.93% 的BD-rate 减少。
    Abstract Recently, learned video compression has achieved exciting performance. Following the traditional hybrid prediction coding framework, most learned methods generally adopt the motion estimation motion compensation (MEMC) method to remove inter-frame redundancy. However, inaccurate motion vector (MV) usually lead to the distortion of reconstructed frame. In addition, most approaches ignore the spatial and channel redundancy. To solve above problems, we propose a motion-aware and spatial-temporal-channel contextual coding based video compression network (MASTC-VC), which learns the latent representation and uses variational autoencoders (VAEs) to capture the characteristics of intra-frame pixels and inter-frame motion. Specifically, we design a multiscale motion-aware module (MS-MAM) to estimate spatial-temporal-channel consistent motion vector by utilizing the multiscale motion prediction information in a coarse-to-fine way. On the top of it, we further propose a spatial-temporal-channel contextual module (STCCM), which explores the correlation of latent representation to reduce the bit consumption from spatial, temporal and channel aspects respectively. Comprehensive experiments show that our proposed MASTC-VC is surprior to previous state-of-the-art (SOTA) methods on three public benchmark datasets. More specifically, our method brings average 10.15\% BD-rate savings against H.265/HEVC (HM-16.20) in PSNR metric and average 23.93\% BD-rate savings against H.266/VVC (VTM-13.2) in MS-SSIM metric.
    摘要 Specifically, we design a multiscale motion-aware module (MS-MAM) to estimate spatial-temporal-channel consistent motion vector by utilizing the multiscale motion prediction information in a coarse-to-fine way. Additionally, we propose a spatial-temporal-channel contextual module (STCCM), which explores the correlation of latent representation to reduce the bit consumption from spatial, temporal, and channel aspects, respectively.Comprehensive experiments show that our proposed MASTC-VC is superior to previous state-of-the-art (SOTA) methods on three public benchmark datasets. Specifically, our method achieves an average 10.15% BD-rate savings against H.265/HEVC (HM-16.20) in PSNR metric and an average 23.93% BD-rate savings against H.266/VVC (VTM-13.2) in MS-SSIM metric.

Query-aware Long Video Localization and Relation Discrimination for Deep Video Understanding

  • paper_url: http://arxiv.org/abs/2310.12724
  • repo_url: None
  • paper_authors: Yuanxing Xu, Yuting Wei, Bin Wu
  • for: This paper aims to address the problem of holistically analyzing long videos and extracting useful knowledge to solve different types of queries.
  • methods: The proposed method uses an imagelanguage pretrained model to select frames pertinent to queries, obviating the need for a complete movie-level knowledge graph.
  • results: The approach achieved first and fourth positions for two groups of movie-level queries, demonstrating its effectiveness and robustness.Here’s the full Chinese text:
  • for: 本文目的是解决长视频的全面分析和提取有用知识,以解决不同类型的查询。
  • methods: 提议的方法使用图语言预训练模型,选择关键词 pertinent 查询的帧,不需要完整的电影级别知识图。
  • results: 方法在两组电影级别查询中 achieved 第一名和第四名,表明其效果和稳定性。
    Abstract The surge in video and social media content underscores the need for a deeper understanding of multimedia data. Most of the existing mature video understanding techniques perform well with short formats and content that requires only shallow understanding, but do not perform well with long format videos that require deep understanding and reasoning. Deep Video Understanding (DVU) Challenge aims to push the boundaries of multimodal extraction, fusion, and analytics to address the problem of holistically analyzing long videos and extract useful knowledge to solve different types of queries. This paper introduces a query-aware method for long video localization and relation discrimination, leveraging an imagelanguage pretrained model. This model adeptly selects frames pertinent to queries, obviating the need for a complete movie-level knowledge graph. Our approach achieved first and fourth positions for two groups of movie-level queries. Sufficient experiments and final rankings demonstrate its effectiveness and robustness.
    摘要 文中的媒体内容增长趋势高亮了深入理解多媒体数据的需要。现有的大多数成熟的视频理解技术在短格式内容上表现良好,但是对长格式视频进行深入理解和推理是不够。深入视频理解(DVU)挑战旨在推动多模态抽取、融合和分析,解决不同类型的查询问题。本文介绍了一种基于图像语言预训练模型的查询意识方法,可以选择相关的查询关键帧,从而避免需要完整的电影级别知识图。我们的方法在两组电影级别查询中获得了第一和第四名,经过了丰富的实验和最终排名,证明了其效果和可靠性。

Generating Robust Adversarial Examples against Online Social Networks (OSNs)

  • paper_url: http://arxiv.org/abs/2310.12708
  • repo_url: https://github.com/csjunjun/robustosnattack
  • paper_authors: Jun Liu, Jiantao Zhou, Haiwei Wu, Weiwei Sun, Jinyu Tian
  • for: 本文旨在设计一种可以在在线社交网络(OSN)上传输的Robust adversarial example(AE)攻击方法,以便AE在OSN上传输后仍保持攻击能力。
  • methods: 本文提出了一种基于 differentiable network(DN)的optimization框架,用于生成Robust AE。 Specifically, the framework includes a differentiable JPEG layer and an encoder-decoder subnetwork to simulate the operations conducted by an OSN.
  • results: 对Facebook、WeChat和QQ进行了广泛的实验,结果表明, compared to existing methods, our attack methods can produce more robust AEs, especially under small distortion constraints. The performance gain in terms of Attack Success Rate (ASR) can be more than 60%. In addition, we built a public dataset containing more than 10,000 pairs of AEs processed by Facebook, WeChat or QQ, which can facilitate future research in the robust AEs generation.
    Abstract Online Social Networks (OSNs) have blossomed into prevailing transmission channels for images in the modern era. Adversarial examples (AEs) deliberately designed to mislead deep neural networks (DNNs) are found to be fragile against the inevitable lossy operations conducted by OSNs. As a result, the AEs would lose their attack capabilities after being transmitted over OSNs. In this work, we aim to design a new framework for generating robust AEs that can survive the OSN transmission; namely, the AEs before and after the OSN transmission both possess strong attack capabilities. To this end, we first propose a differentiable network termed SImulated OSN (SIO) to simulate the various operations conducted by an OSN. Specifically, the SIO network consists of two modules: 1) a differentiable JPEG layer for approximating the ubiquitous JPEG compression and 2) an encoder-decoder subnetwork for mimicking the remaining operations. Based upon the SIO network, we then formulate an optimization framework to generate robust AEs by enforcing model outputs with and without passing through the SIO to be both misled. Extensive experiments conducted over Facebook, WeChat and QQ demonstrate that our attack methods produce more robust AEs than existing approaches, especially under small distortion constraints; the performance gain in terms of Attack Success Rate (ASR) could be more than 60%. Furthermore, we build a public dataset containing more than 10,000 pairs of AEs processed by Facebook, WeChat or QQ, facilitating future research in the robust AEs generation. The dataset and code are available at https://github.com/csjunjun/RobustOSNAttack.git.
    摘要 在现代时期,在线社交网络(OSNs)已经成为图像传输的主流途径。敌意例子(AEs),特意设计用于欺骗深度神经网络(DNNs),在OSNs中的各种lossy操作后失去攻击能力。在这项工作中,我们目的是设计一个新的框架,以生成具有抵抗力的AEs,使其在OSNs中传输后仍保持攻击能力。为此,我们首先提出一个可导网络,称为模拟OSN(SIO)网络。specifically,SIO网络由两个模块组成:1)一个可导JPEG层,用于模拟广泛存在的JPEG压缩;2)一个Encoder-Decoder子网络,用于模拟剩下的操作。基于SIO网络,我们然后建立了一个优化框架,用于生成抵抗力强的AEs,通过对SLO outputs with和without passing through SIO进行权衡来实现。经验表明,我们的攻击方法可以生成更加抵抗力强的AEs,特别是在小质量约束下,Attack Success Rate(ASR)的性能提升可以高达60%。此外,我们建立了一个包含more than 10,000个AEs processed by Facebook, WeChat or QQ的公共数据集,以便未来的研究人员进行抵抗AEs生成的研究。数据集和代码可以在https://github.com/csjunjun/RobustOSNAttack.git中获取。

Recoverable Privacy-Preserving Image Classification through Noise-like Adversarial Examples

  • paper_url: http://arxiv.org/abs/2310.12707
  • repo_url: https://github.com/csjunjun/ric
  • paper_authors: Jun Liu, Jiantao Zhou, Jinyu Tian, Weiwei Sun
  • for: 本研究旨在保障云计算平台上进行的图像相关服务中的数据隐私。
  • methods: 我们提出了一种新的隐私保护图像分类方案,可以将纯文本预测器应用到加密图像中,而不需要重新训练专门的预测器。此外,加密图像可以通过秘钥进行解密,并且可以保持图像的原始形式。
  • results: 我们的方案可以保持纯文本预测器在加密和纯文本域中的预测精度相同,并且可以高效地解密加密图像,保持图像的原始形式,并且具有满意的泛化能力和高安全性。
    Abstract With the increasing prevalence of cloud computing platforms, ensuring data privacy during the cloud-based image related services such as classification has become crucial. In this study, we propose a novel privacypreserving image classification scheme that enables the direct application of classifiers trained in the plaintext domain to classify encrypted images, without the need of retraining a dedicated classifier. Moreover, encrypted images can be decrypted back into their original form with high fidelity (recoverable) using a secret key. Specifically, our proposed scheme involves utilizing a feature extractor and an encoder to mask the plaintext image through a newly designed Noise-like Adversarial Example (NAE). Such an NAE not only introduces a noise-like visual appearance to the encrypted image but also compels the target classifier to predict the ciphertext as the same label as the original plaintext image. At the decoding phase, we adopt a Symmetric Residual Learning (SRL) framework for restoring the plaintext image with minimal degradation. Extensive experiments demonstrate that 1) the classification accuracy of the classifier trained in the plaintext domain remains the same in both the ciphertext and plaintext domains; 2) the encrypted images can be recovered into their original form with an average PSNR of up to 51+ dB for the SVHN dataset and 48+ dB for the VGGFace2 dataset; 3) our system exhibits satisfactory generalization capability on the encryption, decryption and classification tasks across datasets that are different from the training one; and 4) a high-level of security is achieved against three potential threat models. The code is available at https://github.com/csjunjun/RIC.git.
    摘要 随着云计算平台的普及,在云上进行图像相关服务,如分类,保持数据隐私已成为关键。在本研究中,我们提出了一种新的隐私保护图像分类方案,允许直接在批处文本领域训练的分类器进行加密图像的分类,无需重新训练专门的分类器。此外,加密图像可以使用秘钥回到原始形式中,保持高比特率(可 восстанови)。具体来说,我们的提议方案包括使用特征提取器和编码器将批处文本图像做遮盾,并使用新设计的噪声类对抗例(NAE)。这种NAE不仅将加密图像具有噪声类视觉外观,还迫使目标分类器预测加密图像的标签与原始批处文本图像的标签相同。在解码阶段,我们采用了对称差异学习(SRL)框架,以最小化plaintext图像的受损。我们的实验结果表明:1)在批处文本领域和加密领域中,分类器的分类精度保持不变;2)加密图像可以高效地还原到原始形式,PSNR平均值达到51+ dB для SVHN数据集和48+ dB для VGGFace2数据集;3)我们的系统在不同的数据集上展现了满意的总体化能力;4)对于三种威胁模型,我们的系统实现了高度的安全性。代码可以在https://github.com/csjunjun/RIC.git中找到。

Exploiting Low-confidence Pseudo-labels for Source-free Object Detection

  • paper_url: http://arxiv.org/abs/2310.12705
  • repo_url: None
  • paper_authors: Zhihong Chen, Zilei Wang, Yixin Zhang
  • for: 这个研究旨在适应无标注目标领域中使用源领域训练的检测器,不需要存取源标注数据。
  • methods: 我们提出了一新的方法,使用高和低信任阈值来全面利用 Pseudo-labels。特别是,高信任阈值上的 Pseudo-labels 使用传统方式,而低信任阈值到中阈值的 Pseudo-labels 通过 Local Spatial Contrastive Learning (LSCL) 和 Proposal Soft Training (PST) 两个组件来进一步提高模型的表现。
  • results: 我们的方法在五个跨领域物体检测 benchmark 上进行了广泛的实验,结果显示我们的方法可以超越现有的 SFOD 方法, achieved state-of-the-art 性能。
    Abstract Source-free object detection (SFOD) aims to adapt a source-trained detector to an unlabeled target domain without access to the labeled source data. Current SFOD methods utilize a threshold-based pseudo-label approach in the adaptation phase, which is typically limited to high-confidence pseudo-labels and results in a loss of information. To address this issue, we propose a new approach to take full advantage of pseudo-labels by introducing high and low confidence thresholds. Specifically, the pseudo-labels with confidence scores above the high threshold are used conventionally, while those between the low and high thresholds are exploited using the Low-confidence Pseudo-labels Utilization (LPU) module. The LPU module consists of Proposal Soft Training (PST) and Local Spatial Contrastive Learning (LSCL). PST generates soft labels of proposals for soft training, which can mitigate the label mismatch problem. LSCL exploits the local spatial relationship of proposals to improve the model's ability to differentiate between spatially adjacent proposals, thereby optimizing representational features further. Combining the two components overcomes the challenges faced by traditional methods in utilizing low-confidence pseudo-labels. Extensive experiments on five cross-domain object detection benchmarks demonstrate that our proposed method outperforms the previous SFOD methods, achieving state-of-the-art performance.
    摘要

Representation Learning via Consistent Assignment of Views over Random Partitions

  • paper_url: http://arxiv.org/abs/2310.12692
  • repo_url: https://github.com/sthalles/carp
  • paper_authors: Thalles Silva, Adín Ramírez Rivera
  • for: 这篇论文是用于自适应对映射学习的可观察性聚合方法,即Consistent Assignment of Views over Random Partitions (CARP)。
  • methods: CARP 使用梯度下降来学习代表特征,并且不需要额外的非数学模组来解决聚合问题。CARP 使用一个新的随机分割概念来训练模型,并且强制不同观点之间的一致性。
  • results: CARP 可以在17个数据集上学习出适合的表示,并且在多种自适应学习任务上表现出色。与11个现有的自适应方法进行比较,CARP 在转移学习任务中表现最好,并且在训练时间较短的情况下表现更好。
    Abstract We present Consistent Assignment of Views over Random Partitions (CARP), a self-supervised clustering method for representation learning of visual features. CARP learns prototypes in an end-to-end online fashion using gradient descent without additional non-differentiable modules to solve the cluster assignment problem. CARP optimizes a new pretext task based on random partitions of prototypes that regularizes the model and enforces consistency between views' assignments. Additionally, our method improves training stability and prevents collapsed solutions in joint-embedding training. Through an extensive evaluation, we demonstrate that CARP's representations are suitable for learning downstream tasks. We evaluate CARP's representations capabilities in 17 datasets across many standard protocols, including linear evaluation, few-shot classification, k-NN, k-means, image retrieval, and copy detection. We compare CARP performance to 11 existing self-supervised methods. We extensively ablate our method and demonstrate that our proposed random partition pretext task improves the quality of the learned representations by devising multiple random classification tasks. In transfer learning tasks, CARP achieves the best performance on average against many SSL methods trained for a longer time.
    摘要 我们介绍了一种自助学习 clustering 方法,即 Consistent Assignment of Views over Random Partitions (CARP),用于学习视觉特征的表示。CARP 通过梯度下降来学习抽象,不需要额外的非 diferenciable 模块来解决归一化问题。CARP 优化了一个基于随机分区的 проtotypes 的新预测任务,用于规范模型并确保视图归一化的一致性。此外,我们的方法还提高了共同嵌入训练的稳定性,避免了共同嵌入训练中的坍缩解决方案。通过广泛的评估,我们表明了 CARP 的表示是适合学习下游任务的。我们在 17 个数据集上进行了广泛的评估,包括线性评估、少量分类、k-NN、k-means、图像检索和复制检测等几种标准协议。我们与 11 种现有的自助学习方法进行了比较,并证明了我们提posed的随机分区预测任务可以提高学习的表示质量。在转移学习任务中,CARP 的表示性能在许多 SSL 方法已经训练了更长时间后仍然保持最高的平均性能。

TapMo: Shape-aware Motion Generation of Skeleton-free Characters

  • paper_url: http://arxiv.org/abs/2310.12678
  • repo_url: None
  • paper_authors: Jiaxu Zhang, Shaoli Huang, Zhigang Tu, Xin Chen, Xiaohang Zhan, Gang Yu, Ying Shan
  • for: 本研究旨在生成多种非骨架3D人物动作。
  • methods: 该方法包括两个主要组件: mesh 处理预测器和形态意识扩散模块。 mesh 处理预测器 预测皮肤精度和自适应扩散控制,而 shape-aware 动作扩散模块 使用文本指导动作和形态特征来生成具有形态考虑的动作。
  • results: 对比其他自动动画方法,TapMo 能够生成高质量的动作,并且能够涵盖多种非人物3D模型。
    Abstract Previous motion generation methods are limited to the pre-rigged 3D human model, hindering their applications in the animation of various non-rigged characters. In this work, we present TapMo, a Text-driven Animation Pipeline for synthesizing Motion in a broad spectrum of skeleton-free 3D characters. The pivotal innovation in TapMo is its use of shape deformation-aware features as a condition to guide the diffusion model, thereby enabling the generation of mesh-specific motions for various characters. Specifically, TapMo comprises two main components - Mesh Handle Predictor and Shape-aware Diffusion Module. Mesh Handle Predictor predicts the skinning weights and clusters mesh vertices into adaptive handles for deformation control, which eliminates the need for traditional skeletal rigging. Shape-aware Motion Diffusion synthesizes motion with mesh-specific adaptations. This module employs text-guided motions and mesh features extracted during the first stage, preserving the geometric integrity of the animations by accounting for the character's shape and deformation. Trained in a weakly-supervised manner, TapMo can accommodate a multitude of non-human meshes, both with and without associated text motions. We demonstrate the effectiveness and generalizability of TapMo through rigorous qualitative and quantitative experiments. Our results reveal that TapMo consistently outperforms existing auto-animation methods, delivering superior-quality animations for both seen or unseen heterogeneous 3D characters.
    摘要 previous motion generation methods are limited to pre-rigged 3D human models, which hinders their applications in animating various non-rigged characters. In this work, we present TapMo, a text-driven animation pipeline for synthesizing motion in a broad spectrum of skeleton-free 3D characters. The key innovation of TapMo is its use of shape deformation-aware features as a condition to guide the diffusion model, allowing for the generation of mesh-specific motions for various characters. TapMo consists of two main components: Mesh Handle Predictor and Shape-aware Motion Diffusion Module. The Mesh Handle Predictor predicts the skinning weights and clusters mesh vertices into adaptive handles for deformation control, eliminating the need for traditional skeletal rigging. The Shape-aware Motion Diffusion module synthesizes motion with mesh-specific adaptations, using text-guided motions and mesh features extracted during the first stage. This module preserves the geometric integrity of the animations by accounting for the character's shape and deformation. TapMo is trained in a weakly-supervised manner and can accommodate a multitude of non-human meshes, both with and without associated text motions. Our experiments show that TapMo consistently outperforms existing auto-animation methods, delivering superior-quality animations for both seen and unseen heterogeneous 3D characters.

Weakly Supervised Learning for Breast Cancer Prediction on Mammograms in Realistic Settings

  • paper_url: http://arxiv.org/abs/2310.12677
  • repo_url: None
  • paper_authors: Shreyasi Pathak, Jörg Schlötterer, Jeroen Geerdink, Onno Dirk Vijlbrief, Maurice van Keulen, Christin Seifert
  • for: 早期检测乳腺癌的自动方法可以减少死亡率,但目前在医院中广泛应用受限因为这些方法具有太多约束。
  • methods: 这些方法假设有可用的注释 для单个图像或甚至是区域对象(ROI),以及固定的图像数量。
  • results: 在实际医疗设置下,我们发现了一种两级多例学习(MIL)方法,可以在只有案例标签而不是每个图像或ROI的情况下进行检测。我们还提出了封装特有的MIL池化变体,以便在 breast cancer 的一侧出现。我们的研究表明,这种两级 MIL 方法可以在现实的医疗设置下应用,并且可以随着患者的不断入院而扩展。
    Abstract Automatic methods for early detection of breast cancer on mammography can significantly decrease mortality. Broad uptake of those methods in hospitals is currently hindered because the methods have too many constraints. They assume annotations available for single images or even regions-of-interest (ROIs), and a fixed number of images per patient. Both assumptions do not hold in a general hospital setting. Relaxing those assumptions results in a weakly supervised learning setting, where labels are available per case, but not for individual images or ROIs. Not all images taken for a patient contain malignant regions and the malignant ROIs cover only a tiny part of an image, whereas most image regions represent benign tissue. In this work, we investigate a two-level multi-instance learning (MIL) approach for case-level breast cancer prediction on two public datasets (1.6k and 5k cases) and an in-house dataset of 21k cases. Observing that breast cancer is usually only present in one side, while images of both breasts are taken as a precaution, we propose a domain-specific MIL pooling variant. We show that two-level MIL can be applied in realistic clinical settings where only case labels, and a variable number of images per patient are available. Data in realistic settings scales with continuous patient intake, while manual annotation efforts do not. Hence, research should focus in particular on unsupervised ROI extraction, in order to improve breast cancer prediction for all patients.
    摘要 自动方法可以早期检测乳腺癌,可以大幅降低死亡率。然而,现在医院中广泛应用这些方法受到限制,因为这些方法假设有可用的注释 для单个图像或Region of Interest(ROI),以及固定的图像数量每个患者。这两个假设在普通医院设置中不成立。在弱监督学习 Setting中,标签只有每个案例的Level,而不是每个图像或ROI。不同于其他图像,大多数图像区域表示正常的组织。在这项工作中,我们调查了一种两级多例学习(MIL)方法,用于检测乳腺癌的情况。我们注意到,乳腺癌通常只存在一侧,而图像中收集的两侧图像是为了预防。我们提议一种域Specific MIL Pooling变体。我们表明,两级 MIL可以在现实的医疗设置中应用,只有案例标签和每个患者的变数量图像可以获得。数据在实际设置中随着患者的不断入院而增加,而手动标注努力则不会增加。因此,研究应该专注于无监督ROI提取,以提高乳腺癌预测的准确性。

TRUSTED: The Paired 3D Transabdominal Ultrasound and CT Human Data for Kidney Segmentation and Registration Research

  • paper_url: http://arxiv.org/abs/2310.12646
  • repo_url: None
  • paper_authors: William Ndzimbong, Cyril Fourniol, Loic Themyr, Nicolas Thome, Yvonne Keeza, Beniot Sauer, Pierre-Thierry Piechaud, Arnaud Mejean, Jacques Marescaux, Daniel George, Didier Mutter, Alexandre Hostettler, Toby Collins
  • For: The paper is written for researchers to develop and validate new image segmentation and image-modality registration methods using abdominal ultrasound (US) data.* Methods: The paper uses a dataset of paired transabdominal 3DUS and CT kidney images, with segmentation and anatomical landmark annotations, to evaluate the performance of different deep learning models and image registration methods.* Results: The paper reports the results of benchmarking five deep learning models for automatic kidney segmentation, with average DICE scores ranging from 83.2% to 89.1% for CT images and 61.9% to 79.4% for US images. The paper also reports the results of benchmarking three image registration methods, with Coherent Point Drift performing best with an average Target Registration Error of 4.53mm.
    Abstract Inter-modal image registration (IMIR) and image segmentation with abdominal Ultrasound (US) data has many important clinical applications, including image-guided surgery, automatic organ measurement and robotic navigation. However, research is severely limited by the lack of public datasets. We propose TRUSTED (the Tridimensional Renal Ultra Sound TomodEnsitometrie Dataset), comprising paired transabdominal 3DUS and CT kidney images from 48 human patients (96 kidneys), including segmentation, and anatomical landmark annotations by two experienced radiographers. Inter-rater segmentation agreement was over 94 (Dice score), and gold-standard segmentations were generated using the STAPLE algorithm. Seven anatomical landmarks were annotated, important for IMIR systems development and evaluation. To validate the dataset's utility, 5 competitive Deep Learning models for automatic kidney segmentation were benchmarked, yielding average DICE scores from 83.2% to 89.1% for CT, and 61.9% to 79.4% for US images. Three IMIR methods were benchmarked, and Coherent Point Drift performed best with an average Target Registration Error of 4.53mm. The TRUSTED dataset may be used freely researchers to develop and validate new segmentation and IMIR methods.
    摘要 <> translate the following text into Simplified Chinese<>多Modal图像匹配(IMIR)和腹部超声图像分割(US)数据在临床应用中具有重要的临床应用,包括图像导航、自动器官测量和 робоaxiNavigation。然而,研究受到公共数据的缺乏所限。我们提出了TRUSTED(三维肾脏超声TomodEnsitometrie Dataset),包括48名人类病人(96个肾脏)的对称的肾脏3DUS和CT图像,以及分割和解剖标志注解由两名经验丰富的放射学家。对于每个病人,两名放射学家进行了分割,并取得了多于94%的重合率(Dice分数)。此外,我们还生成了金标准分割,使用STAPLE算法。 Seven个解剖标志被注解,对IMIR系统的开发和评估具有重要性。为验证数据集的有用性,我们对5种竞争型深度学习模型进行了自动肾脏分割的测试,其中CT图像的平均DICE分数在83.2%到89.1%之间,而US图像的平均DICE分数在61.9%到79.4%之间。此外,我们还测试了3种IMIR方法,并确定了coherent Point Drift方法为最佳,其Target Registration Error平均值为4.53毫米。TRUSTED数据集可供研究人员免费使用,以开发和验证新的分割和IMIR方法。

SIRe-IR: Inverse Rendering for BRDF Reconstruction with Shadow and Illumination Removal in High-Illuminance Scenes

  • paper_url: http://arxiv.org/abs/2310.13030
  • repo_url: https://github.com/ingra14m/sire-ir
  • paper_authors: Ziyi Yang, Yanzhen Chen, Xinyu Gao, Yazhen Yuan, Yu Wu, Xiaowei Zhou, Xiaogang Jin
  • for: 这个论文旨在解决强光照下的透视图恢复问题, existing implicit neural inverse rendering方法在强光照下受到阴影和反射的影响,难以正确理解场景几何结构,从而导致精度因素难以分解。
  • methods: 该方法使用非线性映射和规则化可见性估计来分解场景为环境地图、颜色、粗糙度。通过准确地模型 indirect 辐射场,正常、可见性和直接光同时描述,可以减少阴影和反射的影响,不需要对场景做严格的限制。
  • results: 该方法可以在强光照下高质量地恢复颜色和粗糙度,不受阴影干扰。与现有方法相比,SIRe-IR 在量化和质量上均表现出优于现有方法。
    Abstract Implicit neural representation has opened up new possibilities for inverse rendering. However, existing implicit neural inverse rendering methods struggle to handle strongly illuminated scenes with significant shadows and indirect illumination. The existence of shadows and reflections can lead to an inaccurate understanding of scene geometry, making precise factorization difficult. To this end, we present SIRe-IR, an implicit neural inverse rendering approach that uses non-linear mapping and regularized visibility estimation to decompose the scene into environment map, albedo, and roughness. By accurately modeling the indirect radiance field, normal, visibility, and direct light simultaneously, we are able to remove both shadows and indirect illumination in materials without imposing strict constraints on the scene. Even in the presence of intense illumination, our method recovers high-quality albedo and roughness with no shadow interference. SIRe-IR outperforms existing methods in both quantitative and qualitative evaluations.
    摘要 启用隐藏神经表示法打开了新的可逆渲染可能性。然而,现有的隐藏神经反射方法在强烈照明场景中受到强烈阴影和反射的影响,导致场景几何理解不准确,精确因子化困难。为了解决这问题,我们提出了SIRe-IR方法,它使用非线性映射和规则化可见性估计来分解场景为环境图、抛光率和粗糙度。通过准确模型 indirect radiance field, normal, visibility 和直接照明同时,我们可以去除物体表面上的阴影和 indirect illumination,无需对场景做严格的限制。即使在强烈照明下,我们的方法可以高质量地提取 albedo 和粗糙度,无阴影干扰。相比之下,SIRe-IR 方法在量化和质量两个方面都高于现有方法。

FUSC: Fetal Ultrasound Semantic Clustering of Second Trimester Scans Using Deep Self-supervised Learning

  • paper_url: http://arxiv.org/abs/2310.12600
  • repo_url: None
  • paper_authors: Hussain Alasmawi, Leanne Bricker, Mohammad Yaqub
  • for: 本研究旨在开发一种自动将婴儿超声图像分类为大范围的胎儿视图,以降低或消除人工标注的需求。
  • methods: 本研究使用了一个大型数据集,共88,063张图像,并开发了一种名为Fetal Ultrasound Semantic Clustering(FUSC)方法。
  • results: 研究结果显示,FUSC方法可以在一个新的测试数据集上达到92.2%的分类纯度。
    Abstract Ultrasound is the primary imaging modality in clinical practice during pregnancy. More than 140M fetuses are born yearly, resulting in numerous scans. The availability of a large volume of fetal ultrasound scans presents the opportunity to train robust machine learning models. However, the abundance of scans also has its challenges, as manual labeling of each image is needed for supervised methods. Labeling is typically labor-intensive and requires expertise to annotate the images accurately. This study presents an unsupervised approach for automatically clustering ultrasound images into a large range of fetal views, reducing or eliminating the need for manual labeling. Our Fetal Ultrasound Semantic Clustering (FUSC) method is developed using a large dataset of 88,063 images and further evaluated on an additional unseen dataset of 8,187 images achieving over 92% clustering purity. The result of our investigation hold the potential to significantly impact the field of fetal ultrasound imaging and pave the way for more advanced automated labeling solutions. Finally, we make the code and the experimental setup publicly available to help advance the field.
    摘要 超声成为了临床实践中最主要的医学影像模式之一,每年超过14亿个胎儿出生,从而生成了大量的扫描图像。这些图像的丰富性带来了许多挑战,特别是需要手动标注每个图像以进行指导方法。手动标注是一项劳动密集的任务,需要专业人士准确地标注图像。本研究提出了一种无监督的方法,可以自动将超声图像分为大量的胎儿视角,从而减少或消除手动标注的需求。我们称之为胎儿超声 semantics 分 clustering(FUSC)方法,通过使用88,063张图像的大型数据集进行开发,并在其他未看过的数据集上进行了进一步的评估,实现了92%以上的分 clustering纯度。我们的发现可能会对胎儿超声成像领域产生深远的影响,并为更高级的自动标注解决方案铺平道路。最后,我们将代码和实验设置公开发布,以帮助前进该领域。

PrivacyGAN: robust generative image privacy

  • paper_url: http://arxiv.org/abs/2310.12590
  • repo_url: None
  • paper_authors: Mariia Zameshina, Marlene Careil, Olivier Teytaud, Laurent Najman
  • for: 保护人脸图像隐私 + 传统的人脸图像隐私保护方法通常分为两类:数据毒素方法和匿名化方法。 + 这种研究提出了一种新的方法,称为 PrivacyGAN,使用图像生成技术来保护隐私,同时保持图像使用性,特别是在社交媒体应用中。
  • methods: 使用 VQGAN 和 StyleGAN 等图像生成技术,在 embedding 空间中将原始图像Shift到一个骗图像上。 + Drawing inspiration from Fawkes,our method entails shifting the original image within the embedding space towards a decoy image。
  • results: + 通过隐私度量表和新的未知图像识别技术抗性测试,我们证明了我们的方法的有效性。 + 此外,我们还进行了人类评估,并证明了 modificated 图像仍能保持人脸识别的能力,同时避免了隐私泄露。
    Abstract Classical techniques for protecting facial image privacy typically fall into two categories: data-poisoning methods, exemplified by Fawkes, which introduce subtle perturbations to images, or anonymization methods that generate images resembling the original only in several characteristics, such as gender, ethnicity, or facial expression.In this study, we introduce a novel approach, PrivacyGAN, that uses the power of image generation techniques, such as VQGAN and StyleGAN, to safeguard privacy while maintaining image usability, particularly for social media applications. Drawing inspiration from Fawkes, our method entails shifting the original image within the embedding space towards a decoy image.We evaluate our approach using privacy metrics on traditional and novel facial image datasets. Additionally, we propose new criteria for evaluating the robustness of privacy-protection methods against unknown image recognition techniques, and we demonstrate that our approach is effective even in unknown embedding transfer scenarios. We also provide a human evaluation that further proves that the modified image preserves its utility as it remains recognisable as an image of the same person by friends and family.
    摘要 传统的面部图像隐私保护方法通常分为两类:数据毒素方法,如法克斯,引入微妙的抖动,或匿名化方法,生成表现类似于原始图像的图像。在这项研究中,我们介绍一种新的方法,隐私GAN,利用图像生成技术,如VQGAN和StyleGAN,保护隐私 while maintaining图像可用性,特别是在社交媒体应用程序中。 drawing inspiration from法克斯,我们的方法是将原始图像在嵌入空间中移动到一个预料图像。我们使用隐私指标进行评估,并提出了新的隐私保护方法不知情图像识别技术的抗性测试标准。我们还进行了人工评估,证明修改后的图像仍然保留了同一个人的认知度,可以被朋友和家人识别。

Diverse Diffusion: Enhancing Image Diversity in Text-to-Image Generation

  • paper_url: http://arxiv.org/abs/2310.12583
  • repo_url: None
  • paper_authors: Mariia Zameshina, Olivier Teytaud, Laurent Najman
  • for: 提高图像生成模型中的多样性,以便生成更加真实和多样的图像。
  • methods: 提出了一种基于稳定扩散的无监督方法,通过找到稳定扩散空间中远离彼此的向量来提高图像多样性。
  • results: 通过对各种特征进行实验,包括颜色多样性、LPIPS指标和人类/性别表现,证明了我们的多样性方法的有效性,并提供了价值的透彻视角以改进文本到图像模型。
    Abstract Latent diffusion models excel at producing high-quality images from text. Yet, concerns appear about the lack of diversity in the generated imagery. To tackle this, we introduce Diverse Diffusion, a method for boosting image diversity beyond gender and ethnicity, spanning into richer realms, including color diversity.Diverse Diffusion is a general unsupervised technique that can be applied to existing text-to-image models. Our approach focuses on finding vectors in the Stable Diffusion latent space that are distant from each other. We generate multiple vectors in the latent space until we find a set of vectors that meets the desired distance requirements and the required batch size.To evaluate the effectiveness of our diversity methods, we conduct experiments examining various characteristics, including color diversity, LPIPS metric, and ethnicity/gender representation in images featuring humans.The results of our experiments emphasize the significance of diversity in generating realistic and varied images, offering valuable insights for improving text-to-image models. Through the enhancement of image diversity, our approach contributes to the creation of more inclusive and representative AI-generated art.
    摘要 latent diffusion模型可以生成高质量的图像,但有关图像多样性的问题出现。为解决这问题,我们介绍了多样化催化(Diverse Diffusion),一种可以超越性别和文化多样性,探索更加丰富的颜色多样性的方法。多样化催化是一种无监督的普适技术,可以应用于现有的文本到图像模型。我们的方法是在稳定催化的幂谱空间中找到远离彼此的向量。我们通过生成多个向量在幂谱空间,直到找到符合需要的距离要求和批处理大小。为评估我们多样性方法的效果,我们进行了各种特征的实验,包括颜色多样性、LPIPS指标和人类团队表现等。实验结果表明,多样性是生成图像的重要因素,我们的方法可以提高图像的多样性,从而创造更加包容和代表性的人工智能生成艺术。

A reproducible 3D convolutional neural network with dual attention module (3D-DAM) for Alzheimer’s disease classification

  • paper_url: http://arxiv.org/abs/2310.12574
  • repo_url: None
  • paper_authors: Gia Minh Hoang, Youngjoo Lee, Jae Gwan Kim
    for:这个研究的目的是为了提出一个可重现的模型,用于诊断阿尔茨海默症。methods:这个模型使用了3D卷积神经网络,并具有双注意模块。results:这个模型在ADNI数据库上进行训练,并在两个独立的数据集(AIBL和OASIS1)上进行验证。结果显示,这个方法可以实现91.94%的MCI进展诊断精度和96.30%的阿尔茨海默症诊断精度。此外,模型也在两个数据集上具有良好的一致性,即AIBL数据集上的精度为86.37%,OASIS1数据集上的精度为83.42%。
    Abstract Alzheimer's disease is one of the most common types of neurodegenerative disease, characterized by the accumulation of amyloid-beta plaque and tau tangles. Recently, deep learning approaches have shown promise in Alzheimer's disease diagnosis. In this study, we propose a reproducible model that utilizes a 3D convolutional neural network with a dual attention module for Alzheimer's disease classification. We trained the model in the ADNI database and verified the generalizability of our method in two independent datasets (AIBL and OASIS1). Our method achieved state-of-the-art classification performance, with an accuracy of 91.94% for MCI progression classification and 96.30% for Alzheimer's disease classification on the ADNI dataset. Furthermore, the model demonstrated good generalizability, achieving an accuracy of 86.37% on the AIBL dataset and 83.42% on the OASIS1 dataset. These results indicate that our proposed approach has competitive performance and generalizability when compared to recent studies in the field.
    摘要 阿尔茨海默病是一种最常见的神经退化病种,表现为amyloid-beta固革和tau卷绕。近年来,深度学习方法在阿尔茨海默病诊断中表现出了搭配性。在这项研究中,我们提议一种可重制的模型,利用3D卷积神经网络和双注意模块进行阿尔茨海默病分类。我们在ADNI数据库中训练了该模型,并在两个独立的数据集(AIBL和OASIS1)中验证了我们的方法的一致性。我们的方法在ADNI数据集上实现了状态机器的分类性能,准确率为91.94%,并在AIBL数据集和OASIS1数据集上达到了86.37%和83.42%的准确率。这些结果表明,我们提议的方法在相关领域中具有竞争力和一致性。

DA-TransUNet: Integrating Spatial and Channel Dual Attention with Transformer U-Net for Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2310.12570
  • repo_url: https://github.com/sun-1024/da-transunet
  • paper_authors: Guanqun Sun, Yizhi Pan, Weikun Kong, Zichang Xu, Jianhua Ma, Teeradaj Racharak, Le-Minh Nguyen, Junyi Xin
    for:This paper proposes a novel deep medical image segmentation framework called DA-TransUNet, which aims to improve medical image segmentation performance by incorporating transformer and dual attention block into the traditional U-shaped architecture.methods:The proposed DA-TransUNet model utilizes attention mechanism of transformer and multifaceted feature extraction of DA-Block to efficiently combine global, local, and multi-scale features for medical image segmentation. Additionally, dual attention blocks are added before the Transformer layer and in skip connections to enhance feature extraction and transfer.results:Experimental results across various benchmarks of medical image segmentation reveal that DA-TransUNet significantly outperforms state-of-the-art methods, demonstrating the effectiveness of the proposed framework in improving medical image segmentation performance.
    Abstract Great progress has been made in automatic medical image segmentation due to powerful deep representation learning. The influence of transformer has led to research into its variants, and large-scale replacement of traditional CNN modules. However, such trend often overlooks the intrinsic feature extraction capabilities of the transformer and potential refinements to both the model and the transformer module through minor adjustments. This study proposes a novel deep medical image segmentation framework, called DA-TransUNet, aiming to introduce the Transformer and dual attention block into the encoder and decoder of the traditional U-shaped architecture. Unlike prior transformer-based solutions, our DA-TransUNet utilizes attention mechanism of transformer and multifaceted feature extraction of DA-Block, which can efficiently combine global, local, and multi-scale features to enhance medical image segmentation. Meanwhile, experimental results show that a dual attention block is added before the Transformer layer to facilitate feature extraction in the U-net structure. Furthermore, incorporating dual attention blocks in skip connections can enhance feature transfer to the decoder, thereby improving image segmentation performance. Experimental results across various benchmark of medical image segmentation reveal that DA-TransUNet significantly outperforms the state-of-the-art methods. The codes and parameters of our model will be publicly available at https://github.com/SUN-1024/DA-TransUnet.
    摘要 医学图像分割领域内,由于深度学习的强大表现, automatizd medical image segmentation 已经取得了 significan progress。 transformer 的影响导致了关于其变体和传统 CNN 模块的大规模替换的研究。然而,这些趋势通常忽略了 transformer 的内在特征提取能力和可能的模型和变体块的微调。本研究提出了一种新的深度医学图像分割框架,称为 DA-TransUNet,旨在将 transformer 和 dual attention block 引入传统 U-shaped 架构的encoder和decoder中。与先前的 transformer-based 解决方案不同,我们的 DA-TransUNet 利用 transformer 的注意机制和 DA-Block 的多元特征提取,可以有效地将全球、本地和多尺度特征相结合,以提高医学图像分割。此外,在 U-net 结构中添加 dual attention block 可以促进特征提取,从而提高图像分割性能。在不同的医学图像分割 benchmark 上,DA-TransUNet 与当前状态的方法进行比较,实验结果显示 DA-TransUNet 在医学图像分割 task 中具有显著的优势。我们将在 GitHub 上公开我们的模型和参数,访问 https://github.com/SUN-1024/DA-TransUnet。

Click on Mask: A Labor-efficient Annotation Framework with Level Set for Infrared Small Target Detection

  • paper_url: http://arxiv.org/abs/2310.12562
  • repo_url: https://github.com/li-haoqing/com
  • paper_authors: Haoqing Li, Jinfu Yang, Yifei Xu, Runshi Wang
  • for: 这个论文主要关注的是如何提高infrared小target检测的效率和品质,并且解决小target的手动标注需求对于资料驱动方法的限制。
  • methods: 本文提出了一个劳动效率高且简洁的标注框架,使用level set方法实现高品质pseudo标注,仅需一个简单的点击。 variational level set形式中的期望差能函数设计,以维持零水平面的存在 during level set演化过程。
  • results: 实验结果显示,我们的方法在NUAA-SIRST和IRSTD-1k数据集上实现了superior表现。
    Abstract Infrared Small Target Detection is a challenging task to separate small targets from infrared clutter background. Recently, deep learning paradigms have achieved promising results. However, these data-driven methods need plenty of manual annotation. Due to the small size of infrared targets, manual annotation consumes more resources and restricts the development of this field. This letter proposed a labor-efficient and cursory annotation framework with level set, which obtains a high-quality pseudo mask with only one cursory click. A variational level set formulation with an expectation difference energy functional is designed, in which the zero level contour is intrinsically maintained during the level set evolution. It solves the issue that zero level contour disappearing due to small target size and excessive regularization. Experiments on the NUAA-SIRST and IRSTD-1k datasets reveal that our approach achieves superior performance. Code is available at https://github.com/Li-Haoqing/COM.
    摘要 INFRARED小target检测是一个具有挑战性的任务,既需要分离小目标从红外背景中,又需要大量的手动标注。 Recently, deep learning paradigms have achieved promising results, but these data-driven methods require a lot of manual annotation, which is time-consuming and restricts the development of this field. This letter proposes a labor-efficient and cursory annotation framework with level set, which can obtain a high-quality pseudo mask with just one cursory click.We designed a variational level set formulation with an expectation difference energy functional, which ensures that the zero level contour is intrinsically maintained during the level set evolution. This solves the issue of the zero level contour disappearing due to small target size and excessive regularization. Experiments on the NUAA-SIRST and IRSTD-1k datasets show that our approach achieves superior performance. Code is available at https://github.com/Li-Haoqing/COM.Here's the translation in Traditional Chinese as well:INFRARED小target检测是一个具有挑战性的任务,既需要分离小目标从红外背景中,又需要大量的手动标注。 Recently, deep learning paradigms have achieved promising results, but these data-driven methods require a lot of manual annotation, which is time-consuming and restricts the development of this field. This letter proposes a labor-efficient and cursory annotation framework with level set, which can obtain a high-quality pseudo mask with just one cursory click.We designed a variational level set formulation with an expectation difference energy functional, which ensures that the zero level contour is intrinsically maintained during the level set evolution. This solves the issue of the zero level contour disappearing due to small target size and excessive regularization. Experiments on the NUAA-SIRST and IRSTD-1k datasets show that our approach achieves superior performance. Code is available at https://github.com/Li-Haoqing/COM.

Explanation-Based Training with Differentiable Insertion/Deletion Metric-Aware Regularizers

  • paper_url: http://arxiv.org/abs/2310.12553
  • repo_url: None
  • paper_authors: Yuya Yoshikawa, Tomoharu Iwata
  • for: 提高复杂机器学习预测器的解释质量,通过插入和删除指标来评估解释的准确性。
  • methods: 提出插入/删除指标认知优化(ID-ExpO),通过优化具有可导性的预测器,以提高插入和删除指标的解释准确性,同时保持预测精度。
  • results: 实验结果表明,通过ID-ExpO优化的深度神经网络预测器,可以使得后期解释器生成更准确和易于理解的解释,同时保持高度的预测精度。
    Abstract The quality of explanations for the predictions of complex machine learning predictors is often measured using insertion and deletion metrics, which assess the faithfulness of the explanations, i.e., how correctly the explanations reflect the predictor's behavior. To improve the faithfulness, we propose insertion/deletion metric-aware explanation-based optimization (ID-ExpO), which optimizes differentiable predictors to improve both insertion and deletion scores of the explanations while keeping their predictive accuracy. Since the original insertion and deletion metrics are indifferentiable with respect to the explanations and directly unavailable for gradient-based optimization, we extend the metrics to be differentiable and use them to formalize insertion and deletion metric-based regularizers. The experimental results on image and tabular datasets show that the deep neural networks-based predictors fine-tuned using ID-ExpO enable popular post-hoc explainers to produce more faithful and easy-to-interpret explanations while keeping high predictive accuracy.
    摘要 “复杂机器学习预测器的解释质量经常用插入和删除指标来衡量,以评估解释的准确性,即预测器的行为如何准确地反映在解释中。为了提高准确性,我们提议使用插入/删除指标意识的解释基于优化(ID-ExpO),该方法优化可导式预测器,以提高插入和删除指标的解释忠实度,同时保持预测精度。由于原始的插入和删除指标与解释无法导数,我们将其扩展为可导的指标,并使用它们来形式化插入和删除指标基于的补偿器。实验结果表明,使用 ID-ExpO 进行 fine-tuning 的深度神经网络预测器在图像和表格数据集上能够生成更准确和易于理解的解释,同时保持高度预测精度。”

PGA: Personalizing Grasping Agents with Single Human-Robot Interaction

  • paper_url: http://arxiv.org/abs/2310.12547
  • repo_url: None
  • paper_authors: Junghyun Kim, Gi-Cheon Kang, Jaein Kim, Seoyun Yang, Minjoon Jung, Byoung-Tak Zhang
  • for: develop robots that ground and grasp objects based on natural language instructions
  • methods: learn personal objects by propagating user-given information through a Reminiscence-a collection of raw images from the user’s environment
  • results: significantly outperforms baseline methods both in offline and online settings, demonstrating its effectiveness and personalization applicability on real-world scenarios
    Abstract Language-Conditioned Robotic Grasping (LCRG) aims to develop robots that ground and grasp objects based on natural language instructions. While robots capable of recognizing personal objects like "my wallet" can interact more naturally with non-expert users, current LCRG systems primarily limit robots to understanding only generic expressions. To this end, we introduce a task scenario GraspMine with a novel dataset that aims to locate and grasp personal objects given personal indicators via learning from a single human-robot interaction. To address GraspMine, we propose Personalized Grasping Agent (PGA), that learns personal objects by propagating user-given information through a Reminiscence-a collection of raw images from the user's environment. Specifically, PGA acquires personal object information by a user presenting a personal object with its associated indicator, followed by PGA inspecting the object by rotating it. Based on the acquired information, PGA pseudo-labels objects in the Reminiscence by our proposed label propagation algorithm. Harnessing the information acquired from the interactions and the pseudo-labeled objects in the Reminiscence, PGA adapts the object grounding model to grasp personal objects. Experiments on GraspMine show that PGA significantly outperforms baseline methods both in offline and online settings, signifying its effectiveness and personalization applicability on real-world scenarios. Finally, qualitative analysis shows the effectiveness of PGA through a detailed investigation of results in each phase.
    摘要 language-conditioned 机器人抓取(LCRG)目的是开发基于自然语言指令的机器人,以便非专业用户与机器人进行更自然的交互。现今的LCRG系统主要限制机器人只能理解通用表达,而不是具体的个人物品。为此,我们引入一个任务场景名为GraspMine,该场景的目标是基于个人指示器找到和抓取个人物品。为解决GraspMine,我们提出了个性化抓取代理(PGA),它通过人类给出的信息来学习个人物品。具体来说,PGA通过人类展示个人物品和其相关的指示器,然后通过PGA旋转对象来获取个人物品信息。根据获取到的信息,PGA使用我们提议的标签传播算法对Reminiscence中的对象进行 pseudo-标签。利用与人类交互和 pseudo-标签对象的信息,PGA适应对象定位模型以抓取个人物品。GraspMine实验表明,PGA在线和离线 Setting中都有显著优于基eline方法,这表明它在实际场景中具有有效性和个性化应用性。最后,我们进行了详细的分析,以证明PGA的效果。

Weakly-Supervised Semantic Segmentation with Image-Level Labels: from Traditional Models to Foundation Models

  • paper_url: http://arxiv.org/abs/2310.13026
  • repo_url: None
  • paper_authors: Zhaozheng Chen, Qianru Sun
  • for: 这篇论文主要针对的是弱iously-supervised semantic segmentation(WSSS),它利用只有图像级别的标签,而不需要每个像素的标签。
  • methods: 该论文分析了多种传统方法,包括像素级、图像级、跨图像和外部数据的方法。
  • results: 研究发现,使用视觉基础模型,如Segment Anything Model(SAM),在WSSS中可以实现高效的Semantic segmentation。但是,还需要进一步的研究来解决在WSSS中 deploying visual foundational models 的挑战。
    Abstract The rapid development of deep learning has driven significant progress in the field of image semantic segmentation - a fundamental task in computer vision. Semantic segmentation algorithms often depend on the availability of pixel-level labels (i.e., masks of objects), which are expensive, time-consuming, and labor-intensive. Weakly-supervised semantic segmentation (WSSS) is an effective solution to avoid such labeling. It utilizes only partial or incomplete annotations and provides a cost-effective alternative to fully-supervised semantic segmentation. In this paper, we focus on the WSSS with image-level labels, which is the most challenging form of WSSS. Our work has two parts. First, we conduct a comprehensive survey on traditional methods, primarily focusing on those presented at premier research conferences. We categorize them into four groups based on where their methods operate: pixel-wise, image-wise, cross-image, and external data. Second, we investigate the applicability of visual foundation models, such as the Segment Anything Model (SAM), in the context of WSSS. We scrutinize SAM in two intriguing scenarios: text prompting and zero-shot learning. We provide insights into the potential and challenges associated with deploying visual foundational models for WSSS, facilitating future developments in this exciting research area.
    摘要 深度学习的快速发展已经 drived significiant progress in the field of image semantic segmentation - 计算机视觉中的基本任务。 semantic segmentation algorithms often rely on the availability of pixel-level labels (i.e., object masks), which are expensive, time-consuming, and labor-intensive. Weakly-supervised semantic segmentation (WSSS) is an effective solution to avoid such labeling. It utilizes only partial or incomplete annotations and provides a cost-effective alternative to fully-supervised semantic segmentation. In this paper, we focus on the WSSS with image-level labels, which is the most challenging form of WSSS. Our work has two parts. First, we conduct a comprehensive survey on traditional methods, primarily focusing on those presented at premier research conferences. We categorize them into four groups based on where their methods operate: pixel-wise, image-wise, cross-image, and external data. Second, we investigate the applicability of visual foundation models, such as the Segment Anything Model (SAM), in the context of WSSS. We scrutinize SAM in two intriguing scenarios: text prompting and zero-shot learning. We provide insights into the potential and challenges associated with deploying visual foundational models for WSSS, facilitating future developments in this exciting research area.

Machine Learning for Leaf Disease Classification: Data, Techniques and Applications

  • paper_url: http://arxiv.org/abs/2310.12509
  • repo_url: None
  • paper_authors: Jianping Yao, Son N. Tran, Samantha Sawyer, Saurabh Garg
  • for: 本研究旨在为研究者、工程师、管理者和企业家提供Machine learning技术和应用的现状报告,帮助他们更好地理解和利用Machine learning技术来推动智能农业发展。
  • methods: 本研究将从公共数据集开始,然后概述传统学习、深度学习和增强学习等常见机器学习技术,以及它们在叶病识别方面的应用。
  • results: 本研究将提供有用的资源和材料,帮助读者更好地理解和应用Machine learning技术,推动智能农业发展。
    Abstract The growing demand for sustainable development brings a series of information technologies to help agriculture production. Especially, the emergence of machine learning applications, a branch of artificial intelligence, has shown multiple breakthroughs which can enhance and revolutionize plant pathology approaches. In recent years, machine learning has been adopted for leaf disease classification in both academic research and industrial applications. Therefore, it is enormously beneficial for researchers, engineers, managers, and entrepreneurs to have a comprehensive view about the recent development of machine learning technologies and applications for leaf disease detection. This study will provide a survey in different aspects of the topic including data, techniques, and applications. The paper will start with publicly available datasets. After that, we summarize common machine learning techniques, including traditional (shallow) learning, deep learning, and augmented learning. Finally, we discuss related applications. This paper would provide useful resources for future study and application of machine learning for smart agriculture in general and leaf disease classification in particular.
    摘要 随着可持续发展的需求增长,农业生产方面的信息技术得到了推广应用。特别是人工智能的一个分支——机器学习,在农业生产中已经展现出多个突破,可以增强和革命化植物疾病管理方法。在最近几年中,机器学习在学术研究和实践应用中都得到了广泛的应用。因此, для研究人员、工程师、经理人和企业家来说,有一个全面的认知对现代机器学习技术和应用的发展是非常有利的。本研究将从公共可用数据开始,然后总结传统(浅学习)、深度学习和增强学习等常见机器学习技术,最后讨论相关应用。这篇论文将为未来在智能农业和植物疾病分类方面的研究和应用提供有用的资源。

Enhancing High-Resolution 3D Generation through Pixel-wise Gradient Clipping

  • paper_url: http://arxiv.org/abs/2310.12474
  • repo_url: https://github.com/fudan-zvg/pgc-3d
  • paper_authors: Zijie Pan, Jiachen Lu, Xiatian Zhu, Li Zhang
  • for: 高分辨率3D对象生成任务仍然具有挑战性,主要原因是有限的注释化训练数据的可用性。
  • methods: 利用图像生成模型,使用知识传递技术如Score Distillation Sampling(SDS)来超越这些限制。
  • results: 提出了一种新的Pixel-wise Gradient Clipping(PGC)操作,可以快速并高效地控制梯度的大小,从而提高现有3D生成模型的渲染质量。
    Abstract High-resolution 3D object generation remains a challenging task primarily due to the limited availability of comprehensive annotated training data. Recent advancements have aimed to overcome this constraint by harnessing image generative models, pretrained on extensive curated web datasets, using knowledge transfer techniques like Score Distillation Sampling (SDS). Efficiently addressing the requirements of high-resolution rendering often necessitates the adoption of latent representation-based models, such as the Latent Diffusion Model (LDM). In this framework, a significant challenge arises: To compute gradients for individual image pixels, it is necessary to backpropagate gradients from the designated latent space through the frozen components of the image model, such as the VAE encoder used within LDM. However, this gradient propagation pathway has never been optimized, remaining uncontrolled during training. We find that the unregulated gradients adversely affect the 3D model's capacity in acquiring texture-related information from the image generative model, leading to poor quality appearance synthesis. To address this overarching challenge, we propose an innovative operation termed Pixel-wise Gradient Clipping (PGC) designed for seamless integration into existing 3D generative models, thereby enhancing their synthesis quality. Specifically, we control the magnitude of stochastic gradients by clipping the pixel-wise gradients efficiently, while preserving crucial texture-related gradient directions. Despite this simplicity and minimal extra cost, extensive experiments demonstrate the efficacy of our PGC in enhancing the performance of existing 3D generative models for high-resolution object rendering.
    摘要 高分辨率3D物体生成仍然是一项具有挑战性的任务,主要因为有限的可用完整标注训练数据。latest advancements 尝试使用图像生成模型,通过知识传输技术如分数投影 sampling (SDS),超越这一限制。然而,高分辨率渲染通常需要采用秘密表示基于模型,如秘密扩散模型 (LDM)。在这个框架中,一个主要挑战是计算每个图像像素的梯度,需要在图像模型中冻结的组件,如 VAE 编码器,通过秘密空间的梯度传播来进行反射。然而,这个梯度传播路径从未被优化,在训练中保持不控制。我们发现,不控制的梯度会对3D模型获得图像生成模型中的Texture-related信息,导致低质量的外观合成。为了解决这一总体挑战,我们提出了一种创新操作,称为像素级梯度剪辑 (PGC),用于integration into existing 3D生成模型,从而提高其合成质量。具体来说,我们控制像素级梯度的大小,通过有效地剪辑像素级梯度,保留关键的Texture-related梯度方向。尽管这种简单和附加成本较少,但经验表明,我们的PGC可以提高现有3D生成模型的高分辨率物体渲染质量。

RecolorCloud: A Point Cloud Tool for Recoloring, Segmentation, and Conversion

  • paper_url: http://arxiv.org/abs/2310.12470
  • repo_url: None
  • paper_authors: Esteban Segarra Martinez, Ryan P. McMahan
  • for: 该论文目的是提供一种自动修正点云中的颜色冲击的工具,以提高大型点云的 фото实际质量。
  • methods: 该工具使用自动染色技术来解决环境干扰所导致的点云颜色错误。用户只需要指定 bounding box 区域,以影响颜色。
  • results: 实验结果显示,该工具可以大幅提高大型点云的 фото实际质量,并且用户可以快速地将点云染色到设置的semantic segmentation颜色中。
    Abstract Point clouds are a 3D space representation of an environment that was recorded with a high precision laser scanner. These scanners can suffer from environmental interference such as surface shading, texturing, and reflections. Because of this, point clouds may be contaminated with fake or incorrect colors. Current open source or proprietary tools offer limited or no access to correcting these visual errors automatically. RecolorCloud is a tool developed to resolve these color conflicts by utilizing automated color recoloring. We offer the ability to deleting or recoloring outlier points automatically with users only needing to specify bounding box regions to effect colors. Results show a vast improvement of the photo-realistic quality of large point clouds. Additionally, users can quickly recolor a point cloud with set semantic segmentation colors.
    摘要 <> tranlate text into Simplified ChinesePoint clouds are a 3D space representation of an environment that was recorded with a high precision laser scanner. These scanners can suffer from environmental interference such as surface shading, texturing, and reflections. Because of this, point clouds may be contaminated with fake or incorrect colors. Current open source or proprietary tools offer limited or no access to correcting these visual errors automatically. RecolorCloud is a tool developed to resolve these color conflicts by utilizing automated color recoloring. We offer the ability to deleting or recoloring outlier points automatically with users only needing to specify bounding box regions to effect colors. Results show a vast improvement of the photo-realistic quality of large point clouds. Additionally, users can quickly recolor a point cloud with set semantic segmentation colors. tranlate text into Simplified Chinese点云是环境记录高精度激光扫描仪所记录的3D空间表示,这些扫描仪可能会受到环境干扰,如表面阴影、文字化和反射。因此,点云可能会受到假或错误的颜色污染。现有的开源或商业工具具有有限或无法自动 correction这些视觉错误的能力。 RecolorCloud 是一种用于解决这些颜色冲突的工具,通过自动化颜色重新染色来解决。我们提供了自动删除或重新染色异常点的能力,只需要用户指定矩形区域,就可以对颜色产生影响。结果显示大量点云的图像质量有了很大的提升。此外,用户还可以快速地使用设置的语义分割颜色重新染色点云。

WeedCLR: Weed Contrastive Learning through Visual Representations with Class-Optimized Loss in Long-Tailed Datasets

  • paper_url: http://arxiv.org/abs/2310.12465
  • repo_url: None
  • paper_authors: Alzayat Saleh, Alex Olsen, Jake Wood, Bronson Philippa, Mostafa Rahimi Azghadi
  • for: 这篇论文旨在提出一种用于长尾数据集的植物识别方法,以解决现有数据集的限制,促进深度学习模型的普及化。
  • methods: 这篇论文提出了一种基于自我超vised learning的方法,使用类型优化损失函数和涅槽 entropy 来学习rich和稠密的视觉特征,不需要labels。
  • results: 这篇论文在两个公共的植物数据集上进行了评估, compared to existing methods, WeedCLR 得到了4.3%的精度提升和5.6%的精度提升,并且在不同的环境条件下也展现了更好的一致性和稳定性。
    Abstract Image classification is a crucial task in modern weed management and crop intervention technologies. However, the limited size, diversity, and balance of existing weed datasets hinder the development of deep learning models for generalizable weed identification. In addition, the expensive labelling requirements of mainstream fully-supervised weed classifiers make them cost- and time-prohibitive to deploy widely, for new weed species, and in site-specific weed management. This paper proposes a novel method for Weed Contrastive Learning through visual Representations (WeedCLR), that uses class-optimized loss with Von Neumann Entropy of deep representation for weed classification in long-tailed datasets. WeedCLR leverages self-supervised learning to learn rich and robust visual features without any labels and applies a class-optimized loss function to address the class imbalance problem in long-tailed datasets. WeedCLR is evaluated on two public weed datasets: CottonWeedID15, containing 15 weed species, and DeepWeeds, containing 8 weed species. WeedCLR achieves an average accuracy improvement of 4.3\% on CottonWeedID15 and 5.6\% on DeepWeeds over previous methods. It also demonstrates better generalization ability and robustness to different environmental conditions than existing methods without the need for expensive and time-consuming human annotations. These significant improvements make WeedCLR an effective tool for weed classification in long-tailed datasets and allows for more rapid and widespread deployment of site-specific weed management and crop intervention technologies.
    摘要 现代农业中的图像分类任务是现代农业管理和作物 intervención技术的关键任务。然而,现有的异草数据集的大小、多样性和平衡受到了深度学习模型的发展带来限制。此外,主流的完全supervised weed分类器的高产生成成本和时间成本使其在新的异草种、具体的农业场景中广泛应用的成本和时间繁琐。这篇论文提出了一种novel的异草对比学习(WeedCLR)方法,通过视觉表示的类扩展损失函数进行异草分类。WeedCLR通过无监督学习学习丰富和稳定的视觉特征,不需要任何标签,并应用类扩展损失函数来解决长尾数据集中的类偏袋问题。WeedCLR在cottonweedID15和DeepWeeds两个公共异草数据集上进行评估,分别达到了4.3%和5.6%的准确率提升。它还表现出了更好的泛化能力和不同环境条件下的更好的鲁棒性,不需要贵重的人工标注。这些显著改进使WeedCLR成为了长尾数据集中的有效异草分类工具,可以更加快速地普及site-specific农业管理和作物 intervención技术。

Lidar Panoptic Segmentation and Tracking without Bells and Whistles

  • paper_url: http://arxiv.org/abs/2310.12464
  • repo_url: https://github.com/abhinavagarwalla/most-lps
  • paper_authors: Abhinav Agarwalla, Xuhua Huang, Jason Ziglar, Francesco Ferroni, Laura Leal-Taixé, James Hays, Aljoša Ošep, Deva Ramanan
  • for: 本研究提出了一种surprisingly simple yet effective的探测中心式网络,用于实现3D/4D lidar精准分割和跟踪任务。
  • methods: 我们的网络具有模块化设计,并且对于所有精准分割和跟踪任务进行优化。其中一个核心 ком成分是对象实例探测分支,我们使用点级(modal)注解进行训练,而在缺乏modal(cuboid)注解时,我们使用运动轨迹级别的超级视图来重点探测对象大小。
  • results: 我们在多个3D/4D lidar精准分割和跟踪 benchmark上评估了我们的方法,并观察到我们的模型在开源模型中 establishment了新的状态时刻,超过了最近的查询基本模型。
    Abstract State-of-the-art lidar panoptic segmentation (LPS) methods follow bottom-up segmentation-centric fashion wherein they build upon semantic segmentation networks by utilizing clustering to obtain object instances. In this paper, we re-think this approach and propose a surprisingly simple yet effective detection-centric network for both LPS and tracking. Our network is modular by design and optimized for all aspects of both the panoptic segmentation and tracking task. One of the core components of our network is the object instance detection branch, which we train using point-level (modal) annotations, as available in segmentation-centric datasets. In the absence of amodal (cuboid) annotations, we regress modal centroids and object extent using trajectory-level supervision that provides information about object size, which cannot be inferred from single scans due to occlusions and the sparse nature of the lidar data. We obtain fine-grained instance segments by learning to associate lidar points with detected centroids. We evaluate our method on several 3D/4D LPS benchmarks and observe that our model establishes a new state-of-the-art among open-sourced models, outperforming recent query-based models.
    摘要 现代雷达精密分割方法(LPS)采用底层分割心理,从 semantic segmentation 网络开始,通过归类来获得对象实例。在这篇文章中,我们弃用这种方法,并提出一种简单却有效的探测心理网络,用于 both LPS 和跟踪。我们的网络是模块化设计的,并且对所有 LPS 和跟踪任务进行优化。我们的网络中的一个核心组件是对象实例探测支持,我们使用点级(modal)注释进行训练。在缺乏模态(cuboid)注释的情况下,我们使用轨迹级超级视图来恢复模态中心和对象扩展,并通过学习将雷达点与探测中心相关联来获得细化的实例分割。我们对多个 3D/4D LPS benchmark 进行评估,并观察到我们的模型在开源模型中成功设置新的状态对照。我们的模型超过了最近的查询基于模型。

Not Just Learning from Others but Relying on Yourself: A New Perspective on Few-Shot Segmentation in Remote Sensing

  • paper_url: http://arxiv.org/abs/2310.12452
  • repo_url: https://github.com/hanbobizl/dmnet
  • paper_authors: Hanbo Bi, Yingchao Feng, Zhiyuan Yan, Yongqiang Mao, Wenhui Diao, Hongqi Wang, Xian Sun
  • For: 提出了一种新的几 shot segmentation(FSS)方法,用于将未知类目标分类到几个标注样本上。* Methods: 我们提出了一种名为 dual-mining 网络(DMNet)的方法,它不再仅仅是从支持图像中学习 semantics,而是同时从查询图像中提取 semantics。我们还提出了一种减少不相关特征污染的方法,以及一种新的知识分支suppressor(KMS)模块,用于降低已知类对象的活动。* Results: 我们在 iSAID 和 LoveDA 遥感数据集上进行了广泛的实验,并证明了我们的方法可以在 1-shot 和 5-shot 设置下达到最佳性能。特别是,我们的模型(使用 Resnet-50 作为背景网络)在 iSAID 下的 mIoU 达到了 49.58% 和 51.34%,在 1-shot 和 5-shot 设置下分别高于现有的 state-of-the-art 方法 by 1.8% 和 1.12%。代码可以在 https://github.com/HanboBizl/DMNet 上下载。
    Abstract Few-shot segmentation (FSS) is proposed to segment unknown class targets with just a few annotated samples. Most current FSS methods follow the paradigm of mining the semantics from the support images to guide the query image segmentation. However, such a pattern of `learning from others' struggles to handle the extreme intra-class variation, preventing FSS from being directly generalized to remote sensing scenes. To bridge the gap of intra-class variance, we develop a Dual-Mining network named DMNet for cross-image mining and self-mining, meaning that it no longer focuses solely on support images but pays more attention to the query image itself. Specifically, we propose a Class-public Region Mining (CPRM) module to effectively suppress irrelevant feature pollution by capturing the common semantics between the support-query image pair. The Class-specific Region Mining (CSRM) module is then proposed to continuously mine the class-specific semantics of the query image itself in a `filtering' and `purifying' manner. In addition, to prevent the co-existence of multiple classes in remote sensing scenes from exacerbating the collapse of FSS generalization, we also propose a new Known-class Meta Suppressor (KMS) module to suppress the activation of known-class objects in the sample. Extensive experiments on the iSAID and LoveDA remote sensing datasets have demonstrated that our method sets the state-of-the-art with a minimum number of model parameters. Significantly, our model with the backbone of Resnet-50 achieves the mIoU of 49.58% and 51.34% on iSAID under 1-shot and 5-shot settings, outperforming the state-of-the-art method by 1.8% and 1.12%, respectively. The code is publicly available at https://github.com/HanboBizl/DMNet.
    摘要 “几shot分类(FSS)是一种用于未知类目标的分类方法,只需要几个标注的样本。现有的FSS方法都是基于从支持图像中挖掘 semantics 来导引查询图像的分类。但这种“学习他人”的模式对于类型内的差异问题不能提供直接的解决方案,因此FSS在远程感知场景中难以应用。为了bridging这个差异问题,我们开发了一个名为DMNet的双采矿网络,它不再仅仅从支持图像中挖掘 semantics,而是对查询图像本身也进行了更多的注意。”“ Specifically, we propose a Class-public Region Mining (CPRM) module to effectively suppress irrelevant feature pollution by capturing the common semantics between the support-query image pair. The Class-specific Region Mining (CSRM) module is then proposed to continuously mine the class-specific semantics of the query image itself in a `filtering' and `purifying' manner.”“ In addition, to prevent the co-existence of multiple classes in remote sensing scenes from exacerbating the collapse of FSS generalization, we also propose a new Known-class Meta Suppressor (KMS) module to suppress the activation of known-class objects in the sample.”“ Extensive experiments on the iSAID and LoveDA remote sensing datasets have demonstrated that our method sets the state-of-the-art with a minimum number of model parameters. Significantly, our model with the backbone of Resnet-50 achieves the mIoU of 49.58% and 51.34% on iSAID under 1-shot and 5-shot settings, outperforming the state-of-the-art method by 1.8% and 1.12%, respectively.”“ The code is publicly available at https://github.com/HanboBizl/DMNet.”

Segment Anything Meets Universal Adversarial Perturbation

  • paper_url: http://arxiv.org/abs/2310.12431
  • repo_url: None
  • paper_authors: Dongshen Han, Sheng Zheng, Chaoning Zhang
  • for: investigate whether it is possible to attack SAM with image-agnostic Universal Adversarial Perturbation (UAP)
  • methods: propose a novel perturbation-centric framework based on self-supervised contrastive learning (CL) to generate UAP
  • results: validate the effectiveness of the proposed method with both quantitative and qualitative results, and perform ablation study to understand various components in the method.
    Abstract As Segment Anything Model (SAM) becomes a popular foundation model in computer vision, its adversarial robustness has become a concern that cannot be ignored. This works investigates whether it is possible to attack SAM with image-agnostic Universal Adversarial Perturbation (UAP). In other words, we seek a single perturbation that can fool the SAM to predict invalid masks for most (if not all) images. We demonstrate convetional image-centric attack framework is effective for image-independent attacks but fails for universal adversarial attack. To this end, we propose a novel perturbation-centric framework that results in a UAP generation method based on self-supervised contrastive learning (CL), where the UAP is set to the anchor sample and the positive sample is augmented from the UAP. The representations of negative samples are obtained from the image encoder in advance and saved in a memory bank. The effectiveness of our proposed CL-based UAP generation method is validated by both quantitative and qualitative results. On top of the ablation study to understand various components in our proposed method, we shed light on the roles of positive and negative samples in making the generated UAP effective for attacking SAM.
    摘要 As Segment Anything Model (SAM) becomes a popular foundation model in computer vision, its adversarial robustness has become a concern that cannot be ignored. This work investigates whether it is possible to attack SAM with image-agnostic Universal Adversarial Perturbation (UAP). In other words, we seek a single perturbation that can fool the SAM to predict invalid masks for most (if not all) images. We demonstrate that the conventional image-centric attack framework is effective for image-independent attacks but fails for universal adversarial attacks. To this end, we propose a novel perturbation-centric framework that results in a UAP generation method based on self-supervised contrastive learning (CL), where the UAP is set to the anchor sample and the positive sample is augmented from the UAP. The representations of negative samples are obtained from the image encoder in advance and saved in a memory bank. The effectiveness of our proposed CL-based UAP generation method is validated by both quantitative and qualitative results. Additionally, we perform an ablation study to understand the various components in our proposed method and shed light on the roles of positive and negative samples in making the generated UAP effective for attacking SAM.

LoMAE: Low-level Vision Masked Autoencoders for Low-dose CT Denoising

  • paper_url: http://arxiv.org/abs/2310.12405
  • repo_url: None
  • paper_authors: Dayang Wang, Yongshun Xu, Shuo Han, Zhan Wu, Li Zhou, Bahareh Morovati, Hengyong Yu
    for: 这篇论文是为了提高低剂量 computed tomography(LDCT)图像质量的方法。methods: 这篇论文使用了 transformer 模型,并且使用了 masked autoencoder(MAE)来进行自我预训。results: experiments 结果显示,提案的 LoMAE 可以增强 transformer 的混参质化表现,并且可以大大减少依赖clean数据。它还展示了优异的韧性和应用性。
    Abstract Low-dose computed tomography (LDCT) offers reduced X-ray radiation exposure but at the cost of compromised image quality, characterized by increased noise and artifacts. Recently, transformer models emerged as a promising avenue to enhance LDCT image quality. However, the success of such models relies on a large amount of paired noisy and clean images, which are often scarce in clinical settings. In the fields of computer vision and natural language processing, masked autoencoders (MAE) have been recognized as an effective label-free self-pretraining method for transformers, due to their exceptional feature representation ability. However, the original pretraining and fine-tuning design fails to work in low-level vision tasks like denoising. In response to this challenge, we redesign the classical encoder-decoder learning model and facilitate a simple yet effective low-level vision MAE, referred to as LoMAE, tailored to address the LDCT denoising problem. Moreover, we introduce an MAE-GradCAM method to shed light on the latent learning mechanisms of the MAE/LoMAE. Additionally, we explore the LoMAE's robustness and generability across a variety of noise levels. Experiments results show that the proposed LoMAE can enhance the transformer's denoising performance and greatly relieve the dependence on the ground truth clean data. It also demonstrates remarkable robustness and generalizability over a spectrum of noise levels.
    摘要 低剂量 computed tomography (LDCT) 具有减少 X-ray 辐射暴露的优点,但是它会增加图像质量的噪声和artefacts。最近, transformer 模型在提高 LDCT 图像质量方面表现出了承诺。然而,这些模型的成功受到了丰富的对照图像对照集的限制,而在临床 setting 中这些对照图像通常罕见。在计算机视觉和自然语言处理领域, masked autoencoder (MAE) 被认为是一种有效的无标签预训练方法,因为它们在特征表示方面具有出色的能力。然而,原始的预训练和精度调整设计无法在低级视觉任务中进行 denoising。为回应这个挑战,我们重新设计了传统的 encoder-decoder 学习模型,并提出了一种简单 yet effective 的低级视觉 MAE,被称为 LoMAE,专门针对 LDCT denoising 问题。此外,我们还提出了 MAE-GradCAM 方法,以解释 LoMAE 在隐藏学习机制方面的学习过程。此外,我们还对 LoMAE 的Robustness和可generate性进行了多种噪声水平的测试。实验结果表明,我们的 LoMAE 可以提高 transformer 的 denoising性能,同时大幅减少了对 clean 数据的依赖。它还表现出了remarkable robustness和泛化性,适用于多种噪声水平。

Deep Learning Techniques for Video Instance Segmentation: A Survey

  • paper_url: http://arxiv.org/abs/2310.12393
  • repo_url: None
  • paper_authors: Chenhao Xu, Chang-Tsun Li, Yongjian Hu, Chee Peng Lim, Douglas Creighton
  • for: 这 paper 的目的是对视频实例分割问题进行深入分析和评估,并提出一些有效的方法来解决这个问题。
  • methods: 这 paper 使用了许多深度学习技术来解决视频实例分割问题,包括不同的架构设计和 auxiliary 技术。
  • results: 这 paper 对各种深度学习模型的性能、复杂度和计算负担进行了比较和分析,并提出了一些优化方法来提高视频实例分割的性能。
    Abstract Video instance segmentation, also known as multi-object tracking and segmentation, is an emerging computer vision research area introduced in 2019, aiming at detecting, segmenting, and tracking instances in videos simultaneously. By tackling the video instance segmentation tasks through effective analysis and utilization of visual information in videos, a range of computer vision-enabled applications (e.g., human action recognition, medical image processing, autonomous vehicle navigation, surveillance, etc) can be implemented. As deep-learning techniques take a dominant role in various computer vision areas, a plethora of deep-learning-based video instance segmentation schemes have been proposed. This survey offers a multifaceted view of deep-learning schemes for video instance segmentation, covering various architectural paradigms, along with comparisons of functional performance, model complexity, and computational overheads. In addition to the common architectural designs, auxiliary techniques for improving the performance of deep-learning models for video instance segmentation are compiled and discussed. Finally, we discuss a range of major challenges and directions for further investigations to help advance this promising research field.
    摘要 视频实例分割(也称为多对象跟踪和分割)是一个迅速发展的计算机视觉研究领域,于2019年引入,旨在同时检测、分割和跟踪视频中的实例。通过有效地分析和利用视频中的视觉信息,可以实现许多基于计算机视觉的应用程序(如人员动作识别、医疗影像处理、自动驾驶车辆导航、监控等)。随着深度学习技术在多种计算机视觉领域中的主导地位,一大批深度学习基于的视频实例分割方案已经被提出。本评论对深度学习方案的多种建筑思想进行了全面的概述,并对不同方案的功能性能、模型复杂度和计算负担进行了比较。此外,还编译了一些改进深度学习模型的 auxiliary 技巧,并对它们进行了讨论。最后,我们讨论了一些主要挑战和未来研究的方向,以帮助这个有前途的研究领域的发展。