cs.CV - 2023-10-16

Filling the Holes on 3D Heritage Object Surface based on Automatic Segmentation Algorithm

  • paper_url: http://arxiv.org/abs/2310.10875
  • repo_url: None
  • paper_authors: Sinh Van Nguyen, Son Thanh Le, Minh Khai Tran, Le Thanh Sach
  • for: 这个论文的目的是提出一种改进的3D物体表面填充方法,以提高计算机图形学、图像处理和计算机视觉等领域中3D对象的重建和处理的精度。
  • methods: 该论文使用的方法包括计算几何学和深度学习模型,以及基于图像处理的机器学习算法。
  • results: 相比现有方法,该论文提出的方法可以更高精度地重建3D对象,并且可以适用于多种3D数据类型,包括点云数据和三角形网格数据。
    Abstract Reconstructing and processing the 3D objects are popular activities in the research field of computer graphics, image processing and computer vision. The 3D objects are processed based on the methods like geometric modeling, a branch of applied mathematics and computational geometry, or the machine learning algorithms based on image processing. The computation of geometrical objects includes processing the curves and surfaces, subdivision, simplification, meshing, holes filling, reconstructing, and refining the 3D surface objects on both point cloud data and triangular mesh. While the machine learning methods are developed using deep learning models. With the support of 3D laser scan devices and Lidar techniques, the obtained dataset is close to original shape of the real objects. Besides, the photography and its application based on the modern techniques in recent years help us collect data and process the 3D models more precise. This article proposes an improved method for filling holes on the 3D object surface based on an automatic segmentation. Instead of filling the hole directly as the existing methods, we now subdivide the hole before filling it. The hole is first determined and segmented automatically based on computation of its local curvature. It is then filled on each part of the hole to match its local curvature shape. The method can work on both 3D point cloud surfaces and triangular mesh surface. Comparing to the state of the art methods, our proposed method obtained higher accuracy of the reconstructed 3D objects.
    摘要 Computer graphics、图像处理和计算机视觉领域的研究中,重建和处理3D对象是非常流行的活动。这些3D对象通常通过几何模型化或基于图像处理的机器学习算法进行处理。计算几何对象包括处理曲线和表面、分割、简化、网格化、填充洞和改进3D表面对象的点云数据和三角形网格。而机器学习方法则是基于深度学习模型。通过3D激光扫描设备和激光技术获得的数据,我们可以更加准确地重建真实对象的形状。此外,现代技术的应用也有助于我们更加精准地收集数据和处理3D模型。本文提出了一种改进的洞填充方法,通过自动分割洞而不是直接填充洞像现有方法。首先,我们使用计算当地曲线的方法自动分割洞。然后,我们在每个洞部分填充其所对应的本地弯曲形状。这种方法可以在3D点云表面和三角形网格表面上进行应用。与现有方法相比,我们的提议方法能够获得更高的3D对象重建精度。

Approximation properties of slice-matching operators

  • paper_url: http://arxiv.org/abs/2310.10869
  • repo_url: None
  • paper_authors: Shiying Li, Caroline Moosmueller
  • for: This paper is written for the purpose of exploring approximation properties of iterative slice-matching procedures for transferring a source measure to a target measure, particularly in high dimensions.
  • methods: The paper uses slice-matching operators, which depend on the source and target measures and slicing directions, to examine the approximation properties of iterative slice-matching schemes.
  • results: The paper demonstrates invariance and equivariance properties of the slice-matching operator with respect to the source and target measures, respectively, and establishes error bounds for approximating the target measure using one step of the slice-matching scheme. Additionally, the paper investigates connections to affine registration problems and extensions to the invariance and equivariance properties of the slice-matching operator.
    Abstract Iterative slice-matching procedures are efficient schemes for transferring a source measure to a target measure, especially in high dimensions. These schemes have been successfully used in applications such as color transfer and shape retrieval, and are guaranteed to converge under regularity assumptions. In this paper, we explore approximation properties related to a single step of such iterative schemes by examining an associated slice-matching operator, depending on a source measure, a target measure, and slicing directions. In particular, we demonstrate an invariance property with respect to the source measure, an equivariance property with respect to the target measure, and Lipschitz continuity concerning the slicing directions. We furthermore establish error bounds corresponding to approximating the target measure by one step of the slice-matching scheme and characterize situations in which the slice-matching operator recovers the optimal transport map between two measures. We also investigate connections to affine registration problems with respect to (sliced) Wasserstein distances. These connections can be also be viewed as extensions to the invariance and equivariance properties of the slice-matching operator and illustrate the extent to which slice-matching schemes incorporate affine effects.
    摘要 iterative slice-matching 算法是高维中高效的源度量至目标度量转移方案,尤其在应用中如颜色传输和形状检索中得到了成功。这些算法在Regularity assumptions下是确定的收敛的。在这篇论文中,我们研究了单步iterative slice-matching算法的approximation Properties,包括一个相关的slice-matching运算符,它取决于源度量、目标度量和切割方向。我们证明了对源度量的不变性、对目标度量的对称性和切割方向的 lipschitz连续性。我们还确定了一个基于单步slice-matching算法的目标度量的错误 bound,并 characterize了在 slice-matching算法中可以重建两个度量之间的优化运输图的情况。此外,我们还 investigate了基于水星距离的 affine registration problem 的连接,这些连接可以被视为 slice-matching算法中的 affine 效应的扩展。

The Invisible Map: Visual-Inertial SLAM with Fiducial Markers for Smartphone-based Indoor Navigation

  • paper_url: http://arxiv.org/abs/2310.10862
  • repo_url: None
  • paper_authors: Paul Ruvolo, Ayush Chakraborty, Rucha Dave, Richard Li, Duncan Mazza, Xierui Shen, Raiyan Siddique, Krishna Suresh
  • for: 创建大尺寸、易于导航的3D地图,使用主流智能手机。
  • methods: 将3D地图问题定义为图像SLAM问题,并估计环境中的建筑物标志(指标)和可导航路径(手机姿态)。
  • results: 系统可以创建准确的3D地图。此外,我们还提出了选择映射超参数的精细技术,以适应新环境。
    Abstract We present a system for creating building-scale, easily navigable 3D maps using mainstream smartphones. In our approach, we formulate the 3D-mapping problem as an instance of Graph SLAM and infer the position of both building landmarks (fiducial markers) and navigable paths through the environment (phone poses). Our results demonstrate the system's ability to create accurate 3D maps. Further, we highlight the importance of careful selection of mapping hyperparameters and provide a novel technique for tuning these hyperparameters to adapt our algorithm to new environments.
    摘要 我们提出了一种基于主流智能手机的建筑尺度级可探索3D地图创建系统。我们将3D地图问题定义为Instance of Graph SLAM,并通过约束建筑标记( fiducial markers)和环境中可行路径(手机姿态)来计算位置。我们的结果表明系统可以创建准确的3D地图。此外,我们强调了选择映射超参数的重要性,并提供了一种新的参数调整技术,以适应新环境。

SoybeanNet: Transformer-Based Convolutional Neural Network for Soybean Pod Counting from Unmanned Aerial Vehicle (UAV) Images

  • paper_url: http://arxiv.org/abs/2310.10861
  • repo_url: https://github.com/jiajiali04/soybean-pod-counting-from-uav-images
  • paper_authors: Jiajia Li, Raju Thada Magar, Dong Chen, Feng Lin, Dechun Wang, Xiang Yin, Weichao Zhuang, Zhaojian Li
  • for: 这个论文的目的是提高豫豢的生产效率,并使用无人机图像来实现豫豢的果实计数。
  • methods: 这个论文使用了一种新的点基 counting网络,叫做SoybeanNet,使用了强大的变换器核心来同时进行豫豢的果实计数和定位。
  • results: 该论文在使用实际的无人机图像进行测试时,与五种现有方法进行比较,并取得了84.51%的计数精度。
    Abstract Soybeans are a critical source of food, protein and oil, and thus have received extensive research aimed at enhancing their yield, refining cultivation practices, and advancing soybean breeding techniques. Within this context, soybean pod counting plays an essential role in understanding and optimizing production. Despite recent advancements, the development of a robust pod-counting algorithm capable of performing effectively in real-field conditions remains a significant challenge This paper presents a pioneering work of accurate soybean pod counting utilizing unmanned aerial vehicle (UAV) images captured from actual soybean fields in Michigan, USA. Specifically, this paper presents SoybeanNet, a novel point-based counting network that harnesses powerful transformer backbones for simultaneous soybean pod counting and localization with high accuracy. In addition, a new dataset of UAV-acquired images for soybean pod counting was created and open-sourced, consisting of 113 drone images with more than 260k manually annotated soybean pods captured under natural lighting conditions. Through comprehensive evaluations, SoybeanNet demonstrated superior performance over five state-of-the-art approaches when tested on the collected images. Remarkably, SoybeanNet achieved a counting accuracy of $84.51\%$ when tested on the testing dataset, attesting to its efficacy in real-world scenarios. The publication also provides both the source code (\url{https://github.com/JiajiaLi04/Soybean-Pod-Counting-from-UAV-Images}) and the labeled soybean dataset (\url{https://www.kaggle.com/datasets/jiajiali/uav-based-soybean-pod-images}), offering a valuable resource for future research endeavors in soybean pod counting and related fields.
    摘要 soybeans是一种重要的食品、蛋白和油源,因此它们在提高产量、改善栽培方法和进步杂交技术方面 receiving extensive research。在这个 контексте中,豇豆果 counting plays an essential role in understanding and optimizing production. Despite recent advancements, the development of a robust pod-counting algorithm capable of performing effectively in real-field conditions remains a significant challenge.这篇文章提出了一项突破性的豇豆果 counting方法,使用了来自美国密歇根州actual soybean fields的无人机图像。Specifically, this paper presents SoybeanNet, a novel point-based counting network that harnesses powerful transformer backbones for simultaneous soybean pod counting and localization with high accuracy. In addition, a new dataset of UAV-acquired images for soybean pod counting was created and open-sourced, consisting of 113 drone images with more than 260k manually annotated soybean pods captured under natural lighting conditions. Through comprehensive evaluations, SoybeanNet demonstrated superior performance over five state-of-the-art approaches when tested on the collected images. Remarkably, SoybeanNet achieved a counting accuracy of 84.51% when tested on the testing dataset, attesting to its efficacy in real-world scenarios. The publication also provides both the source code (https://github.com/JiajiaLi04/Soybean-Pod-Counting-from-UAV-Images) and the labeled soybean dataset (https://www.kaggle.com/datasets/jiajiali/uav-based-soybean-pod-images), offering a valuable resource for future research endeavors in soybean pod counting and related fields.

Provable Probabilistic Imaging using Score-Based Generative Priors

  • paper_url: http://arxiv.org/abs/2310.10835
  • repo_url: None
  • paper_authors: Yu Sun, Zihui Wu, Yifan Chen, Berthy T. Feng, Katherine L. Bouman
  • for: 这篇论文旨在提出一种可靠地估计高质量图像并同时评估其不确定性的权重函数架构。
  • methods: 该论文提出了一种基于Monte Carlo(MC)的插入式权重函数架构(PMC),可以同时捕捉高质量图像重建和不确定性评估。具体来说,该论文引入了两种PMC算法,可以视为传统插入式质量函数(PnP)和杂化正则化(RED)算法的排除样本分布 аналоги。
  • results: 对多个代表性的逆问题进行实验,结果表明PMCAlgorithm可以显著提高图像重建质量和高精度不确定性评估。
    Abstract Estimating high-quality images while also quantifying their uncertainty are two desired features in an image reconstruction algorithm for solving ill-posed inverse problems. In this paper, we propose plug-and-play Monte Carlo (PMC) as a principled framework for characterizing the space of possible solutions to a general inverse problem. PMC is able to incorporate expressive score-based generative priors for high-quality image reconstruction while also performing uncertainty quantification via posterior sampling. In particular, we introduce two PMC algorithms which can be viewed as the sampling analogues of the traditional plug-and-play priors (PnP) and regularization by denoising (RED) algorithms. We also establish a theoretical analysis for characterizing the convergence of the PMC algorithms. Our analysis provides non-asymptotic stationarity guarantees for both algorithms, even in the presence of non-log-concave likelihoods and imperfect score networks. We demonstrate the performance of the PMC algorithms on multiple representative inverse problems with both linear and nonlinear forward models. Experimental results show that PMC significantly improves reconstruction quality and enables high-fidelity uncertainty quantification.
    摘要 <>将文本翻译成简化中文。<>解决具有不整合性 inverse problem 的算法中,估计高质量图像并同时量化其不确定性是两个愿景。在这篇论文中,我们提出了插入式 Monte Carlo (PMC) 作为一种原理性的框架,用于描述解决一般 inverse problem 中可能的解空间。PMC 能够integrate expressive score-based生成模型,以实现高质量图像重建和不确定性量化。具体来说,我们介绍了两种 PMC 算法,可以视为传统的插入式 priors (PnP) 和 regularization by denoising (RED) 算法的抽象。我们还进行了理论分析,用于Characterizing PMC 算法的收敛性。我们的分析提供了不对称站点保证,即使likelihood 不是几何凹形的情况下,PMC 算法仍然能够收敛。我们在多个代表性的 inverse problem 中进行了实验,结果表明,PMC 可以显著提高重建质量和实现高精度的不确定性量化。

Vision and Language Navigation in the Real World via Online Visual Language Mapping

  • paper_url: http://arxiv.org/abs/2310.10822
  • repo_url: None
  • paper_authors: Chengguang Xu, Hieu T. Nguyen, Christopher Amato, Lawson L. S. Wong
  • for: 提高移动机器人在未看过的环境中的导航效率,使其能够根据自然语言指令进行导航。
  • methods: 提出了一个新的导航框架,包括四个关键组件:一个基于LLMs的指令解析器、一个在线视觉语言映射器、一个基于语言索引的地方定位器和一个基于DD-PPO的本地控制器。
  • results: 在一个未看过的实验室环境中测试了该框架,无需调整,significantly outperformed了现有的VLN基线。
    Abstract Navigating in unseen environments is crucial for mobile robots. Enhancing them with the ability to follow instructions in natural language will further improve navigation efficiency in unseen cases. However, state-of-the-art (SOTA) vision-and-language navigation (VLN) methods are mainly evaluated in simulation, neglecting the complex and noisy real world. Directly transferring SOTA navigation policies trained in simulation to the real world is challenging due to the visual domain gap and the absence of prior knowledge about unseen environments. In this work, we propose a novel navigation framework to address the VLN task in the real world. Utilizing the powerful foundation models, the proposed framework includes four key components: (1) an LLMs-based instruction parser that converts the language instruction into a sequence of pre-defined macro-action descriptions, (2) an online visual-language mapper that builds a real-time visual-language map to maintain a spatial and semantic understanding of the unseen environment, (3) a language indexing-based localizer that grounds each macro-action description into a waypoint location on the map, and (4) a DD-PPO-based local controller that predicts the action. We evaluate the proposed pipeline on an Interbotix LoCoBot WX250 in an unseen lab environment. Without any fine-tuning, our pipeline significantly outperforms the SOTA VLN baseline in the real world.
    摘要 naviigating 无法看到环境是机器人 navigation 的关键。增强机器人能够遵循自然语言指令,将会进一步提高无法看到环境中的导航效率。然而,当前的VLN方法(state-of-the-art)主要在模拟环境中进行评估,忽略了实际世界的复杂和噪音。直接将模拟环境中训练的VLN策略传输到实际世界是困难的,因为视觉领域之间的差异和未知环境中的先验知识缺乏。在这种情况下,我们提出了一种新的导航框架,用于解决VLN任务在实际世界中。该框架包括四个关键组件:1. LLMs基础模型based instruction parser,将自然语言指令转换为一系列预定的macro-action描述。2. 在线视觉语言映射,实时建立视觉语言地图,以维护未知环境的空间和Semantic理解。3. 基于语言索引的本地化器,将每个macro-action描述映射到地图上的坐标位置。4. DD-PPO基于本地控制器,预测动作。我们在一个未知的实际环境中使用Interbotix LoCoBot WX250进行评估,无需精细调整,我们的管道在实际世界中显著超越了当前VLN基线。

Convolutional Neural Network Model for Diabetic Retinopathy Feature Extraction and Classification

  • paper_url: http://arxiv.org/abs/2310.10806
  • repo_url: https://github.com/s21sharan/cnn_dr_detection
  • paper_authors: Sharan Subramanian, Leilani H. Gilpin
  • for: 这个研究旨在应用人工智能于医疗领域,尤其是检测无 симtomatic progressing 的疾病,如糖尿病retinopathy (DR)。
  • methods: 本研究使用了 convolutional Neural Network (CNN) 模型,通过输入照片后,可以正确地识别四种known DR特征,包括 micro-aneurysms、cotton wools、exudates 和 hemorrhages。
  • results: 本研究获得了97%的敏感度和71%的准确率,表明模型具有高度的可读性和抗过滤性。
    Abstract The application of Artificial Intelligence in the medical market brings up increasing concerns but aids in more timely diagnosis of silent progressing diseases like Diabetic Retinopathy. In order to diagnose Diabetic Retinopathy (DR), ophthalmologists use color fundus images, or pictures of the back of the retina, to identify small distinct features through a difficult and time-consuming process. Our work creates a novel CNN model and identifies the severity of DR through fundus image input. We classified 4 known DR features, including micro-aneurysms, cotton wools, exudates, and hemorrhages, through convolutional layers and were able to provide an accurate diagnostic without additional user input. The proposed model is more interpretable and robust to overfitting. We present initial results with a sensitivity of 97% and an accuracy of 71%. Our contribution is an interpretable model with similar accuracy to more complex models. With that, our model advances the field of DR detection and proves to be a key step towards AI-focused medical diagnosis.
    摘要 这个应用人工智能在医疗市场中带来的应用对于不明显进行诊断的疾病,如糖尿病肉眼病(DR),具有增长的担忧。为了诊断DR,医生会使用彩色背部影像(color fundus images),或者是背部 Retina 的照片,以便识别小型的明显特征。我们的工作创造了一个新的 convolutional neural network(CNN)模型,可以通过背部影像的输入来诊断DR的严重程度。我们分类了4种已知的DR特征,包括微型血管、绒毛、渗透物和出血,透过 convolutional layers 进行分类,并能够提供精确的诊断,不需要额外的使用者输入。我们的模型更加可读性和避免过拟合。我们给出了初步的结果,敏感性为97%,准确率为71%。我们的贡献是一个可读性好的模型,与更复杂的模型相比,具有相似的准确性。这个模型对于DR检测具有重要的进步,并且是人工智能在医疗诊断中的一个关键步骤。

LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation

  • paper_url: http://arxiv.org/abs/2310.10769
  • repo_url: https://github.com/RQ-Wu/LAMP
  • paper_authors: Ruiqi Wu, Liangyu Chen, Tong Yang, Chunle Guo, Chongyi Li, Xiangyu Zhang
    for:LAMP is designed to learn motion patterns in text-to-video generation, with a focus on few-shot learning and efficient use of resources.methods:The LAMP framework uses a first-frame-conditioned pipeline with an off-the-shelf text-to-image model for content generation, and expands the pretrained 2D convolution layers to temporal-spatial motion learning layers. Shared-noise sampling is used to improve stability and flexibility.results:Extensive experiments show that LAMP can effectively learn motion patterns on limited data and generate high-quality videos, with applications in text-to-image diffusion, real-world image animation, and video editing. The code and models are available online.
    Abstract With the impressive progress in diffusion-based text-to-image generation, extending such powerful generative ability to text-to-video raises enormous attention. Existing methods either require large-scale text-video pairs and a large number of training resources or learn motions that are precisely aligned with template videos. It is non-trivial to balance a trade-off between the degree of generation freedom and the resource costs for video generation. In our study, we present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 8~16 videos on a single GPU. Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation so that our tuned video diffusion model mainly focuses on motion learning. The well-developed text-to-image techniques can provide visually pleasing and diverse content as generation conditions, which highly improves video quality and generation freedom. To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers and modify the attention blocks to the temporal level. Additionally, we develop an effective inference trick, shared-noise sampling, which can improve the stability of videos with computational costs. Our method can also be flexibly applied to other tasks, e.g. real-world image animation and video editing. Extensive experiments demonstrate that LAMP can effectively learn the motion pattern on limited data and generate high-quality videos. The code and models are available at https://rq-wu.github.io/projects/LAMP.
    摘要 <>translate text into Simplified Chinese<>随着文本到图像生成技术的快速进步,扩展这种强大的生成能力到文本到视频生成引发了巨大的关注。现有方法可以通过大量的文本-视频对对和大量的训练资源来学习,但是很难平衡生成自由度和资源成本之间的折衔。在我们的研究中,我们提出了一个几个步骤基于的调整框架,名为LAMP,可以在单个GPU上使用8~16个视频进行调整。 Specifically,我们设计了一个基于首帧的管道,使用商业化的文本到图像模型来生成内容,以便我们的调整视频模型主要集中在动作学习。具有良好的文本到图像技术可以提供辐射的和多样化的生成条件,从而很大地提高视频质量和生成自由度。为了捕捉时间维度的特征,我们扩展了预训练的2D卷积层,并对它们进行我们的新的时空动作学习层,还修改了注意力块到时间层。此外,我们开发了一种有效的推理技术,分享噪声抽样,可以提高视频的稳定性。我们的方法可以适应其他任务,例如真实世界图像动画和视频编辑。广泛的实验证明了LAMP可以有效地学习动作模式,并生成高质量的视频。代码和模型可以在https://rq-wu.github.io/projects/LAMP中获得。

Deep Conditional Shape Models for 3D cardiac image segmentation

  • paper_url: http://arxiv.org/abs/2310.10756
  • repo_url: None
  • paper_authors: Athira J Jacob, Puneet Sharma, Daniel Ruckert
  • for: 医疗图像分析的第一步是准确地定义器官结构。
  • methods: 我们引入了一种新的分割算法,使用深度条件形状模型(DCSM)作为核心组件。该算法使用深度隐式形状表示,学习任何有兴趣的生物结构的模态无关形状模型,并通过自动检测或用户输入的特征点来适应图像。最后,我们添加了一个模态依赖的轻量级细节修正网络,以捕捉图像中没有表示的细节。
  • results: 我们在各种3D成像Modalities(对比增强CT、非对比CT、3D电子心征图像)中进行心脏左心室(LV)分割,并证明自动DCSM在非对比CT中超过基准,并且在对比CT和3DE中使用细节修正网络时,特别是在 Hausdorff 距离方面获得了显著改进。半自动DCSM使用用户输入的特征点,只在对比CT上训练,可达92%的 dice,对所有Modalities具有Equivalent或更好的性能。
    Abstract Delineation of anatomical structures is often the first step of many medical image analysis workflows. While convolutional neural networks achieve high performance, these do not incorporate anatomical shape information. We introduce a novel segmentation algorithm that uses Deep Conditional Shape models (DCSMs) as a core component. Using deep implicit shape representations, the algorithm learns a modality-agnostic shape model that can generate the signed distance functions for any anatomy of interest. To fit the generated shape to the image, the shape model is conditioned on anatomic landmarks that can be automatically detected or provided by the user. Finally, we add a modality-dependent, lightweight refinement network to capture any fine details not represented by the implicit function. The proposed DCSM framework is evaluated on the problem of cardiac left ventricle (LV) segmentation from multiple 3D modalities (contrast-enhanced CT, non-contrasted CT, 3D echocardiography-3DE). We demonstrate that the automatic DCSM outperforms the baseline for non-contrasted CT without the local refinement, and with the refinement for contrasted CT and 3DE, especially with significant improvement in the Hausdorff distance. The semi-automatic DCSM with user-input landmarks, while only trained on contrasted CT, achieves greater than 92% Dice for all modalities. Both automatic DCSM with refinement and semi-automatic DCSM achieve equivalent or better performance compared to inter-user variability for these modalities.
    摘要 医学图像分析工作流程中的解剖结构定义是第一步。卷积神经网络可以达到高性能,但不会包含解剖形态信息。我们介绍了一种新的分割算法,使用深度条件形状模型(DCSM)作为核心组件。使用深度隐式形状表示,算法学习了任意解剖结构的模态独立形状模型,可以生成任意解剖结构的签名距离函数。为了使Shape模型适应图像,Shape模型被conditioned on可自动探测或提供的解剖标志。最后,我们添加了一个模态依赖的轻量级细节修正网络,以捕捉无法由隐式函数表示的细节。我们提出的DCSM框架在cardiac left ventricle(LV)三维模态(对比增强CT、非对比CT和3DE)的分割问题上进行了评估。我们表明,自动DCSM比基线高效,而且与使用本地细节修正有显著改善。用户输入标志的半自动DCSM,即使只在对比CT上训练,可以达到92%的Dice指标或更高。自动DCSM和半自动DCSM都与人类间变化相当或更好,对这些模态来说。

IDRNet: Intervention-Driven Relation Network for Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2310.10755
  • repo_url: https://github.com/segmentationblwx/sssegmentation
  • paper_authors: Zhenchao Jin, Xiaowei Hu, Lingting Zhu, Luchuan Song, Li Yuan, Lequan Yu
  • for: 提高 dense prediction 任务中的 contextual information 聚合
  • methods: 利用 deletion diagnostics 过程模型 contextual relations among different pixels, 并使用 feature enhancement module 进一步提高 distinguishability
  • results: 对 state-of-the-art segmentation frameworks 带来了一致性的性能提升, 并在各种标准评价 metrics 上达到了竞争性的结果
    Abstract Co-occurrent visual patterns suggest that pixel relation modeling facilitates dense prediction tasks, which inspires the development of numerous context modeling paradigms, \emph{e.g.}, multi-scale-driven and similarity-driven context schemes. Despite the impressive results, these existing paradigms often suffer from inadequate or ineffective contextual information aggregation due to reliance on large amounts of predetermined priors. To alleviate the issues, we propose a novel \textbf{I}ntervention-\textbf{D}riven \textbf{R}elation \textbf{Net}work (\textbf{IDRNet}), which leverages a deletion diagnostics procedure to guide the modeling of contextual relations among different pixels. Specifically, we first group pixel-level representations into semantic-level representations with the guidance of pseudo labels and further improve the distinguishability of the grouped representations with a feature enhancement module. Next, a deletion diagnostics procedure is conducted to model relations of these semantic-level representations via perceiving the network outputs and the extracted relations are utilized to guide the semantic-level representations to interact with each other. Finally, the interacted representations are utilized to augment original pixel-level representations for final predictions. Extensive experiments are conducted to validate the effectiveness of IDRNet quantitatively and qualitatively. Notably, our intervention-driven context scheme brings consistent performance improvements to state-of-the-art segmentation frameworks and achieves competitive results on popular benchmark datasets, including ADE20K, COCO-Stuff, PASCAL-Context, LIP, and Cityscapes. Code is available at \url{https://github.com/SegmentationBLWX/sssegmentation}.
    摘要 伴生视觉模式表明,像素关系模型化可以促进紧凑预测任务,这引发了许多上下文模型范文的发展,如多scale-driven和相似性-driven上下文方案。 despite the impressive results, these existing paradigms often suffer from inadequate or ineffective contextual information aggregation due to reliance on large amounts of predetermined priors. To address these issues, we propose a novel \textbf{I}ntervention-\textbf{D}riven \textbf{R}elation \textbf{Net}work (\textbf{IDRNet}), which leverages a deletion diagnostics procedure to guide the modeling of contextual relations among different pixels. Specifically, we first group pixel-level representations into semantic-level representations with the guidance of pseudo labels and further improve the distinguishability of the grouped representations with a feature enhancement module. Next, a deletion diagnostics procedure is conducted to model relations of these semantic-level representations via perceiving the network outputs and the extracted relations are utilized to guide the semantic-level representations to interact with each other. Finally, the interacted representations are utilized to augment original pixel-level representations for final predictions. Extensive experiments are conducted to validate the effectiveness of IDRNet quantitatively and qualitatively. Notably, our intervention-driven context scheme brings consistent performance improvements to state-of-the-art segmentation frameworks and achieves competitive results on popular benchmark datasets, including ADE20K, COCO-Stuff, PASCAL-Context, LIP, and Cityscapes. Code is available at \url{https://github.com/SegmentationBLWX/sssegmentation}.

HairCLIPv2: Unifying Hair Editing via Proxy Feature Blending

  • paper_url: http://arxiv.org/abs/2310.10651
  • repo_url: https://github.com/wty-ustc/hairclipv2
  • paper_authors: Tianyi Wei, Dongdong Chen, Wenbo Zhou, Jing Liao, Weiming Zhang, Gang Hua, Nenghai Yu
  • for: 提供一个能够基于文本描述或参考图像进行头发编辑的框架,并且支持多种交互方式,包括文本描述、参考图像和绘制笔触等。
  • methods: 使用交叉模式模型(如CLIP)将头发编辑转化为头发传送任务,并将编辑条件转化为不同的代理特征。在输入图像上加载编辑效果,通过在头发特征空间中混合相应的代理特征来实现。
  • results: 对比于原始HairCLIP,HairCLIPv2可以更好地保持无关特征(如人脸特征、背景特征),同时支持未before seen文本描述和不同交互方式。量化和质量实验表明,HairCLIPv2在编辑效果、无关特征保持和视觉自然性等方面具有显著优势。
    Abstract Hair editing has made tremendous progress in recent years. Early hair editing methods use well-drawn sketches or masks to specify the editing conditions. Even though they can enable very fine-grained local control, such interaction modes are inefficient for the editing conditions that can be easily specified by language descriptions or reference images. Thanks to the recent breakthrough of cross-modal models (e.g., CLIP), HairCLIP is the first work that enables hair editing based on text descriptions or reference images. However, such text-driven and reference-driven interaction modes make HairCLIP unable to support fine-grained controls specified by sketch or mask. In this paper, we propose HairCLIPv2, aiming to support all the aforementioned interactions with one unified framework. Simultaneously, it improves upon HairCLIP with better irrelevant attributes (e.g., identity, background) preservation and unseen text descriptions support. The key idea is to convert all the hair editing tasks into hair transfer tasks, with editing conditions converted into different proxies accordingly. The editing effects are added upon the input image by blending the corresponding proxy features within the hairstyle or hair color feature spaces. Besides the unprecedented user interaction mode support, quantitative and qualitative experiments demonstrate the superiority of HairCLIPv2 in terms of editing effects, irrelevant attribute preservation and visual naturalness. Our code is available at \url{https://github.com/wty-ustc/HairCLIPv2}.
    摘要 随笔修整技术在最近几年来取得了巨大的进步。早期的修整方法通常使用细致绘制的素描或面Mask来指定修整条件。尽管它们可以实现非常细致的本地控制,但是这些交互方式在基于语言描述或参考图像的修整条件上是不效率的。感谢最近的交互模型技术(如CLIP)的突破,我们的HairCLIP是首个可以基于语言描述或参考图像进行随笔修整的工作。然而,这些基于文本描述或参考图像的交互方式使得HairCLIP无法支持细致的素描或面Mask来指定修整条件。在这篇论文中,我们提出了HairCLIPv2,旨在支持所有的交互方式,同时也提高了不相关特征(如人脸和背景)的保留和未看到文本描述的支持。我们的关键思想是将所有的随笔修整任务转换为随笔传输任务,并将编辑条件转换为不同的代理 accordingly。然后,在输入图像上添加修整效果,通过在额发型或发色特征空间中混合相应的代理特征。除了不同的用户交互方式支持外,我们的HairCLIPv2还在编辑效果、不相关特征保留和视觉自然性方面具有显著优势。我们的代码可以在GitHub上找到:

TraM-NeRF: Tracing Mirror and Near-Perfect Specular Reflections through Neural Radiance Fields

  • paper_url: http://arxiv.org/abs/2310.10650
  • repo_url: https://github.com/Rubikalubi/TraM-NeRF
  • paper_authors: Leif Van Holland, Ruben Bliersbach, Jan U. Müller, Patrick Stotko, Reinhard Klein
  • for: 实现复杂场景中细节轻松渲染,如镜子等具有偏光反射的物体。
  • methods: 采用物理可能的材料模型和Monte-Carlo方法在Volume Rendering中厘定反射行为,实现重要抽象和透射计算。
  • results: 实现了对这些挑战场景的一致射预测和uperior的效果,较前一代方法更好。
    Abstract Implicit representations like Neural Radiance Fields (NeRF) showed impressive results for photorealistic rendering of complex scenes with fine details. However, ideal or near-perfectly specular reflecting objects such as mirrors, which are often encountered in various indoor scenes, impose ambiguities and inconsistencies in the representation of the reconstructed scene leading to severe artifacts in the synthesized renderings. In this paper, we present a novel reflection tracing method tailored for the involved volume rendering within NeRF that takes these mirror-like objects into account while avoiding the cost of straightforward but expensive extensions through standard path tracing. By explicitly modeling the reflection behavior using physically plausible materials and estimating the reflected radiance with Monte-Carlo methods within the volume rendering formulation, we derive efficient strategies for importance sampling and the transmittance computation along rays from only few samples. We show that our novel method enables the training of consistent representations of such challenging scenes and achieves superior results in comparison to previous state-of-the-art approaches.
    摘要 <>使用各种各样的各种方法来描述复杂场景的隐式表示,如神经辐射场(NeRF),已经取得了高度真实的渲染复杂场景的成果。然而,在室内场景中遇到的 идеаль或几乎完美的 espejo 反射物体,如镜子,会导致描述重建场景的表示具有歧义和不一致,从而导致渲染 synthesized 图像中的 artifacts。在这篇论文中,我们提出了一种专门为 NeRF 中 involve 体积渲染中的反射跟踪方法,该方法能够考虑这些 espejo 类型的物体,而不需要 straightforward 而且昂贵的扩展。我们通过物理可能的材料模型和 Monte-Carlo 方法来表示反射行为,从而 deriv 高效的重要抽象策略和传播计算方法。我们示示了我们的新方法可以培养 consistent 的表示这些复杂场景,并且在 compared 到先前的状态的艺术方法上达到了更高的成果。

TOSS:High-quality Text-guided Novel View Synthesis from a Single Image

  • paper_url: http://arxiv.org/abs/2310.10644
  • repo_url: None
  • paper_authors: Yukai Shi, Jianan Wang, He Cao, Boshi Tang, Xianbiao Qi, Tianyu Yang, Yukun Huang, Shilong Liu, Lei Zhang, Heung-Yeung Shum
  • for: 本研究旨在提出一种基于文本Semantic guidance的novel view synthesis(NVS)方法,以解决单视图NVS问题的不足约束性。
  • methods: 本方法使用文本作为高级 semantic information来约束NVS解决方案空间,并引入了特定于图像和摄像机pose conditioning的模块,以及专门为pose正确性和细节细节加权训练。
  • results: 对比Zero-1-to-3,本研究的提议TOSS实现了更可信、控制性和多视图一致的NVS结果,并通过了全面的ablations来证明引入的Semantic guidance和建筑设计的有效性。
    Abstract In this paper, we present TOSS, which introduces text to the task of novel view synthesis (NVS) from just a single RGB image. While Zero-1-to-3 has demonstrated impressive zero-shot open-set NVS capability, it treats NVS as a pure image-to-image translation problem. This approach suffers from the challengingly under-constrained nature of single-view NVS: the process lacks means of explicit user control and often results in implausible NVS generations. To address this limitation, TOSS uses text as high-level semantic information to constrain the NVS solution space. TOSS fine-tunes text-to-image Stable Diffusion pre-trained on large-scale text-image pairs and introduces modules specifically tailored to image and camera pose conditioning, as well as dedicated training for pose correctness and preservation of fine details. Comprehensive experiments are conducted with results showing that our proposed TOSS outperforms Zero-1-to-3 with more plausible, controllable and multiview-consistent NVS results. We further support these results with comprehensive ablations that underscore the effectiveness and potential of the introduced semantic guidance and architecture design.
    摘要 在这篇论文中,我们提出了TOSS,它将文本引入到novel view synthesis(NVS)任务中,只需基于单个RGB图像。 zero-1-to-3 已经表现出了非常出色的零基础开放式 NVS 能力,但是这种方法在单视图 NVS 中存在一些缺乏约束的问题:没有明确的用户控制方式,导致 NVS 生成结果往往不太可能。为了解决这个限制,TOSS 使用文本作为高级Semantic信息来约束 NVS 解决方案空间。TOSS 细致地调整了文本-图像 Stable Diffusion 预训练的大规模文本-图像对,并 introduce了专门为图像和摄像头姿态conditioning设计的模块,以及专门为posecorrectness和细节细节训练。我们进行了全面的实验,结果显示了我们提出的 TOSS 比 zero-1-to-3 更加plausible、可控和多视图一致的 NVS 结果。我们还进行了详细的ablation,以证明引入的semantic导航和建筑设计的效果和潜在。

Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting

  • paper_url: http://arxiv.org/abs/2310.10642
  • repo_url: https://github.com/fudan-zvg/4d-gaussian-splatting
  • paper_authors: Zeyu Yang, Hongye Yang, Zijie Pan, Xiatian Zhu, Li Zhang
  • for: 生成复杂的动态场景3D图像和不同时间点的视图
  • methods: 使用4D primitives优化approximateunderlying spacetime volume,包括视角 dependent和时间演化的外观
  • results: 提供了一种简单、灵活、可变长视频和终端培育的方法,能够capture复杂的动态场景运动,并且在实验中达到了较高的视觉质量和效率。
    Abstract Reconstructing dynamic 3D scenes from 2D images and generating diverse views over time is challenging due to scene complexity and temporal dynamics. Despite advancements in neural implicit models, limitations persist: (i) Inadequate Scene Structure: Existing methods struggle to reveal the spatial and temporal structure of dynamic scenes from directly learning the complex 6D plenoptic function. (ii) Scaling Deformation Modeling: Explicitly modeling scene element deformation becomes impractical for complex dynamics. To address these issues, we consider the spacetime as an entirety and propose to approximate the underlying spatio-temporal 4D volume of a dynamic scene by optimizing a collection of 4D primitives, with explicit geometry and appearance modeling. Learning to optimize the 4D primitives enables us to synthesize novel views at any desired time with our tailored rendering routine. Our model is conceptually simple, consisting of a 4D Gaussian parameterized by anisotropic ellipses that can rotate arbitrarily in space and time, as well as view-dependent and time-evolved appearance represented by the coefficient of 4D spherindrical harmonics. This approach offers simplicity, flexibility for variable-length video and end-to-end training, and efficient real-time rendering, making it suitable for capturing complex dynamic scene motions. Experiments across various benchmarks, including monocular and multi-view scenarios, demonstrate our 4DGS model's superior visual quality and efficiency.
    摘要 <>translate text into Simplified ChineseDynamic 3D scene reconstruction from 2D images and generating diverse views over time is challenging due to scene complexity and temporal dynamics. Despite advancements in neural implicit models, there are still limitations: (i) Inadequate Scene Structure: Existing methods cannot effectively reveal the spatial and temporal structure of dynamic scenes by directly learning the complex 6D plenoptic function. (ii) Scaling Deformation Modeling: Explicitly modeling scene element deformation becomes impractical for complex dynamics. To address these issues, we consider the spacetime as a whole and propose to approximate the underlying spatio-temporal 4D volume of a dynamic scene by optimizing a collection of 4D primitives, with explicit geometry and appearance modeling. Learning to optimize the 4D primitives enables us to synthesize novel views at any desired time with our tailored rendering routine. Our model is conceptually simple, consisting of a 4D Gaussian parameterized by anisotropic ellipses that can rotate arbitrarily in space and time, as well as view-dependent and time-evolved appearance represented by the coefficient of 4D spherindrical harmonics. This approach offers simplicity, flexibility for variable-length video and end-to-end training, and efficient real-time rendering, making it suitable for capturing complex dynamic scene motions. Experiments across various benchmarks, including monocular and multi-view scenarios, demonstrate our 4DGS model's superior visual quality and efficiency.Translated by Google Translate

LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

  • paper_url: http://arxiv.org/abs/2310.10640
  • repo_url: https://github.com/hananshafi/llmblueprint
  • paper_authors: Hanan Gani, Shariq Farooq Bhat, Muzammal Naseer, Salman Khan, Peter Wonka
  • for: 实现文本描述中的复杂场景和多个物件的图像生成
  • methods: 利用大自然语言模型提取文本描述中的关键元素,包括物件 bounding box 坐标、个别物品的详细描述和背景内容,然后使用这些元素生成图像
  • results: 与基准扩散模型相比,实现了复杂文本描述中的图像生成,并在用户评估中获得了高度的认可和满意度
    Abstract Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts describing complex scenes with multiple objects. While excelling in generating images from short, single-object descriptions, these models often struggle to faithfully capture all the nuanced details within longer and more elaborate textual inputs. In response, we present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts, including bounding box coordinates for foreground objects, detailed textual descriptions for individual objects, and a succinct background context. These components form the foundation of our layout-to-image generation model, which operates in two phases. The initial Global Scene Generation utilizes object layouts and background context to create an initial scene but often falls short in faithfully representing object characteristics as specified in the prompts. To address this limitation, we introduce an Iterative Refinement Scheme that iteratively evaluates and refines box-level content to align them with their textual descriptions, recomposing objects as needed to ensure consistency. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models. This is further validated by a user study, underscoring the efficacy of our approach in generating coherent and detailed scenes from intricate textual inputs.
    摘要 文本到图像生成技术在进行长度和复杂性增加时遇到了挑战。尽管在简短的单个对象描述中表现出色,但是在处理长度和复杂性更高的文本提示时,这些模型经常遇到困难,不能准确地捕捉文本中的细节。为了解决这问题,我们提出了一种新的方法,利用大型自然语言模型(LLM)来提取文本提示中的关键组成部分,包括主要对象的 bounding box 坐标、对象的详细文本描述和背景 контекст。这些组成部分成为我们的布局到图像生成模型的基础,该模型在两个阶段进行处理。首先,我们使用对象布局和背景 контекст来创建初始场景,但是这些场景经常无法准确地表现出对象的特征。为了解决这个限制,我们提出了一种迭代优化方案,通过评估和修改框架内容,使其与文本描述相符。我们的评估表明,对于包含多个对象的复杂提示,我们的方法可以提高了回归率,并且在用户研究中得到了证明。Translation notes:* "Diffusion-based generative models" is translated as "文本到图像生成技术" (text-to-image generation technology)* "long and intricate text prompts" is translated as "长度和复杂性更高的文本提示" (text prompts with length and complexity)* "nuanced details" is translated as "细节" (details)* "Large Language Models" is translated as "大型自然语言模型" (large language models)* " bounding box coordinates" is translated as " bounding box 坐标" (bounding box coordinates)* "detailed textual descriptions" is translated as "详细文本描述" (detailed textual descriptions)* "succinct background context" is translated as "背景 контекст" (background context)* "Global Scene Generation" is translated as "全球场景生成" (global scene generation)* "Iterative Refinement Scheme" is translated as "迭代优化方案" (iterative refinement scheme)* "box-level content" is translated as "框架内容" (box-level content)* "recomposing objects" is translated as "重新组合对象" (recomposing objects)* "consistency" is translated as "一致性" (consistency)* "user study" is translated as "用户研究" (user study)

DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing

  • paper_url: http://arxiv.org/abs/2310.10624
  • repo_url: None
  • paper_authors: Jia-Wei Liu, Yan-Pei Cao, Jay Zhangjie Wu, Weijia Mao, Yuchao Gu, Rui Zhao, Jussi Keppo, Ying Shan, Mike Zheng Shou
  • for: 这个论文的目的是提出一种基于动态神经辐射场(NeRF)的人称视频编辑方法,以解决现有方法因为视频长度和视角变化而受限的问题。
  • methods: 这种方法使用动态NeRF作为人称视频表示,并提出了一种基于多视图多姿Score Distillation Sampling(SDS)、图像恢复损失、文本导向地方部超分辨率、风格传递等多种技术的图像三维空间编辑管线。
  • results: 与比较方法相比,这种方法在两个难度较大的数据集上显示出了大幅提高(50%~95%)的人类喜好度。具体比较结果可以查看到项目页面https://showlab.github.io/DynVideo-E/.
    Abstract Despite remarkable research advances in diffusion-based video editing, existing methods are limited to short-length videos due to the contradiction between long-range consistency and frame-wise editing. Recent approaches attempt to tackle this challenge by introducing video-2D representations to degrade video editing to image editing. However, they encounter significant difficulties in handling large-scale motion- and view-change videos especially for human-centric videos. This motivates us to introduce the dynamic Neural Radiance Fields (NeRF) as the human-centric video representation to ease the video editing problem to a 3D space editing task. As such, editing can be performed in the 3D spaces and propagated to the entire video via the deformation field. To provide finer and direct controllable editing, we propose the image-based 3D space editing pipeline with a set of effective designs. These include multi-view multi-pose Score Distillation Sampling (SDS) from both 2D personalized diffusion priors and 3D diffusion priors, reconstruction losses on the reference image, text-guided local parts super-resolution, and style transfer for 3D background space. Extensive experiments demonstrate that our method, dubbed as DynVideo-E, significantly outperforms SOTA approaches on two challenging datasets by a large margin of 50% ~ 95% in terms of human preference. Compelling video comparisons are provided in the project page https://showlab.github.io/DynVideo-E/. Our code and data will be released to the community.
    摘要 尽管扩展视频编辑技术已取得了很大的进步,但现有方法仅适用于短视频,因为视频编辑和框架之间存在长距离一致性和帧级编辑之间的矛盾。现有的方法通过引入视频到2D表示来减轻视频编辑到图像编辑。然而,它们在处理大规模运动和视点变化视频,特别是人类中心视频时遇到了重大困难。这个问题驱使我们提出人类中心视频表示——动态神经辐射场(NeRF),以便将视频编辑转化为3D空间编辑任务。在这种情况下,编辑可以在3D空间进行,并通过扭曲场进行对整个视频的广泛传播。为了提供更加精细和直接控制的编辑,我们提议了基于图像的3D空间编辑管线,包括多视图多姿Score Distillation Sampling(SDS)、参考图像的重建损失、文本指导的本地部分超解析和风格转换。我们的方法,即DynVideo-E,在两个挑战性 datasets 上达到了领先的水平,与前一代方法的比较达到了50%~95%的差距。我们在项目页面(https://showlab.github.io/DynVideo-E/)提供了吸引人的视频比较。我们的代码和数据将被公开发布到社区。

Interpreting and Controlling Vision Foundation Models via Text Explanations

  • paper_url: http://arxiv.org/abs/2310.10591
  • repo_url: https://github.com/tonychenxyz/vit-interpret
  • paper_authors: Haozhe Chen, Junfeng Yang, Carl Vondrick, Chengzhi Mao
  • for: 本研究旨在理解和控制CLIP等大规模预训练视觉基础模型的预测结果。
  • methods: 本研究使用了一种基于自然语言的方法来解释视 transformer 的干 tokens,并通过捕捉最近的文本来进行解释。
  • results: 本研究可以帮助理解CLIP等模型在视觉任务中的决策过程,并提供了一种控制模型行为的方法,以提高模型的Robustness和减少偏见和偶合关系。
    Abstract Large-scale pre-trained vision foundation models, such as CLIP, have become de facto backbones for various vision tasks. However, due to their black-box nature, understanding the underlying rules behind these models' predictions and controlling model behaviors have remained open challenges. We present a framework for interpreting vision transformer's latent tokens with natural language. Given a latent token, our framework retains its semantic information to the final layer using transformer's local operations and retrieves the closest text for explanation. Our approach enables understanding of model visual reasoning procedure without needing additional model training or data collection. Based on the obtained interpretations, our framework allows for model editing that controls model reasoning behaviors and improves model robustness against biases and spurious correlations.
    摘要 大规模预训练视觉基础模型,如CLIP,已成为视觉任务的德 факто底层。然而,由于其黑盒模型的性质,理解这些模型预测的下面规则和控制模型行为仍然是开放的挑战。我们提出了一个把视觉转换器的幂谱Token与自然语言相关的框架。给定一个幂谱Token,我们的框架使用转换器的地方运算保留它的含义到最终层,并从文本库中检索最相似的文本来解释。我们的方法可以在不需要额外训练或数据收集的前提下,理解模型的视觉逻辑过程。基于获得的解释,我们的框架允许对模型的编辑,控制模型的逻辑行为,并改善模型免疫偏见和偶极相关性。

Matching the Neuronal Representations of V1 is Necessary to Improve Robustness in CNNs with V1-like Front-ends

  • paper_url: http://arxiv.org/abs/2310.10575
  • repo_url: https://github.com/dicarlolab/vonenet
  • paper_authors: Ruxandra Barbulescu, Tiago Marques, Arlindo L. Oliveira
  • for: 提高对图像损害的鲁棒性(robustness to image corruptions)
  • methods: 模拟肉眼初级视觉区域(primate primary visual cortex)的计算,使得模型具有更高的鲁棒性
  • results: 模型使用生物学发现的电场特性分布(empirical biological distributions) sampling,对图像损害的鲁棒性明显更高(相对差异为8.72%),而同类 neuronal sub-populations 在两个模型中具有相似的响应特性和下游权重学习结果,但下游处理具有不同的影响。
    Abstract While some convolutional neural networks (CNNs) have achieved great success in object recognition, they struggle to identify objects in images corrupted with different types of common noise patterns. Recently, it was shown that simulating computations in early visual areas at the front of CNNs leads to improvements in robustness to image corruptions. Here, we further explore this result and show that the neuronal representations that emerge from precisely matching the distribution of RF properties found in primate V1 is key for this improvement in robustness. We built two variants of a model with a front-end modeling the primate primary visual cortex (V1): one sampling RF properties uniformly and the other sampling from empirical biological distributions. The model with the biological sampling has a considerably higher robustness to image corruptions that the uniform variant (relative difference of 8.72%). While similar neuronal sub-populations across the two variants have similar response properties and learn similar downstream weights, the impact on downstream processing is strikingly different. This result sheds light on the origin of the improvements in robustness observed in some biologically-inspired models, pointing to the need of precisely mimicking the neuronal representations found in the primate brain.
    摘要 有些卷积神经网络(CNN)在物体识别方面取得了很大的成功,但它们在受到不同类型的常见噪声损害后仍然难以识别物体。最近,研究人员发现,在CNN的前端模型中 simulate computations 可以提高对图像损害的Robustness。在这里,我们进一步探索这一结论,并证明了模型在 precisely 匹配了黑眼睛动物V1中的观察者分布 Property 后,会带来更高的Robustness。我们建立了两个模型,其中一个采样RF Property uniform,另一个采样从empirical biological distribution。与uniform variant相比,使用生物分布采样的模型具有更高的Robustness to image corruptions(相对差异为8.72%)。尽管这两个模型中的 neuronal sub-populations 具有相似的response property和downstream weights,但是它们对下游处理的影响却是不同的。这一结论 shed light onto the origin of the improvements in robustness observed in some biologically-inspired models, pointing to the need of precisely mimicking the neuronal representations found in the primate brain.

RefConv: Re-parameterized Refocusing Convolution for Powerful ConvNets

  • paper_url: http://arxiv.org/abs/2310.10563
  • repo_url: https://github.com/Aiolus-X/RefConv
  • paper_authors: Zhicheng Cai, Xiaohan Ding, Qiu Shen, Xun Cao
  • for: 提高 CNN 模型的性能,无需更改原始模型结构或添加新的推理成本。
  • methods: 使用可调 Refocusing Transformation 修改基础核函数,使得不同通道的参数之间建立连接,从而提高模型表达能力。
  • results: 在图像分类、物体检测和 semantic segmentation 等 CNN 模型中,使用 RefConv 可以提高性能(ImageNet 上的顶部一 accuracy 提高至 1.47%),而无需增加推理成本或改变原始模型结构。
    Abstract We propose Re-parameterized Refocusing Convolution (RefConv) as a replacement for regular convolutional layers, which is a plug-and-play module to improve the performance without any inference costs. Specifically, given a pre-trained model, RefConv applies a trainable Refocusing Transformation to the basis kernels inherited from the pre-trained model to establish connections among the parameters. For example, a depth-wise RefConv can relate the parameters of a specific channel of convolution kernel to the parameters of the other kernel, i.e., make them refocus on the other parts of the model they have never attended to, rather than focus on the input features only. From another perspective, RefConv augments the priors of existing model structures by utilizing the representations encoded in the pre-trained parameters as the priors and refocusing on them to learn novel representations, thus further enhancing the representational capacity of the pre-trained model. Experimental results validated that RefConv can improve multiple CNN-based models by a clear margin on image classification (up to 1.47% higher top-1 accuracy on ImageNet), object detection and semantic segmentation without introducing any extra inference costs or altering the original model structure. Further studies demonstrated that RefConv can reduce the redundancy of channels and smooth the loss landscape, which explains its effectiveness.
    摘要 我们提议使用Re-parameterized Refocusing Convolution(RefConv)取代常规 convolutional layer,这是一个可插入的模块,可以无需更改预测成本提高性能。具体来说,给定一个预训练模型,RefConv将预训练模型继承的基准kernel应用一个可学习的 Refocusing Transformation,以建立模型参数之间的连接。例如,深度 wise RefConv可以将一个特定通道的 convolution kernel 的参数与其他kernel的参数相关联,例如,使得这些参数强调其他部分的模型,而不是仅仅强调输入特征。从另一个角度来看,RefConv可以利用预训练参数中的代表性作为PRIOR,并强调这些代表性,以学习新的表示,从而进一步提高预训练模型的表达能力。实验结果表明,RefConv可以在图像分类(最高达1.47%的top-1准确率提升在ImageNet)、物体检测和 semantic segmentation 中提高多个CNN基于模型的性能,而无需增加预测成本或改变原始模型结构。进一步的研究还表明,RefConv可以减少通道的重复性和缓和损失函数的曲线,这解释了它的效果。

InfoGCN++: Learning Representation by Predicting the Future for Online Human Skeleton-based Action Recognition

  • paper_url: http://arxiv.org/abs/2310.10547
  • repo_url: https://github.com/stnoah1/sode
  • paper_authors: Seunggeun Chi, Hyung-gun Chi, Qixing Huang, Karthik Ramani
  • for: online skeleton-based action recognition
  • methods: InfoGCN++, a novel extension of InfoGCN that enables real-time categorization of action types without complete observation sequences
  • results: exceptional performance in online action recognition, consistently matching or exceeding existing techniques
    Abstract Skeleton-based action recognition has made significant advancements recently, with models like InfoGCN showcasing remarkable accuracy. However, these models exhibit a key limitation: they necessitate complete action observation prior to classification, which constrains their applicability in real-time situations such as surveillance and robotic systems. To overcome this barrier, we introduce InfoGCN++, an innovative extension of InfoGCN, explicitly developed for online skeleton-based action recognition. InfoGCN++ augments the abilities of the original InfoGCN model by allowing real-time categorization of action types, independent of the observation sequence's length. It transcends conventional approaches by learning from current and anticipated future movements, thereby creating a more thorough representation of the entire sequence. Our approach to prediction is managed as an extrapolation issue, grounded on observed actions. To enable this, InfoGCN++ incorporates Neural Ordinary Differential Equations, a concept that lets it effectively model the continuous evolution of hidden states. Following rigorous evaluations on three skeleton-based action recognition benchmarks, InfoGCN++ demonstrates exceptional performance in online action recognition. It consistently equals or exceeds existing techniques, highlighting its significant potential to reshape the landscape of real-time action recognition applications. Consequently, this work represents a major leap forward from InfoGCN, pushing the limits of what's possible in online, skeleton-based action recognition. The code for InfoGCN++ is publicly available at https://github.com/stnoah1/infogcn2 for further exploration and validation.
    摘要 InfoGCN++ 是一种在线动作识别模型,它是 InfoGCN 的一种创新扩展。InfoGCN++ 可以在实时情况下进行动作类型分类,不需要完整的动作观察序列。它超越了传统方法,通过学习当前和预测未来动作的整体表示来提高模型的表示能力。我们采用了神经网络普通微分方程来管理预测问题,以便有效地模型动作的不断演化。经过严格的评估,InfoGCN++ 在三个骨干基于动作识别benchmark上表现出色, consistently 与或超过了现有方法的性能。这成功表明InfoGCN++ 在实时动作识别应用中具有重要的潜力。因此,这种工作代表了 InfoGCN 的一个重要突破,推动了在线骨干基于动作识别领域的发展。InfoGCN++ 的代码可以在 上公开下载,以便进一步探索和验证。

Label-efficient Segmentation via Affinity Propagation

  • paper_url: http://arxiv.org/abs/2310.10533
  • repo_url: https://github.com/circleradon/apro
  • paper_authors: Wentong Li, Yuqian Yuan, Song Wang, Wenyu Liu, Dongqi Tang, Jian Liu, Jianke Zhu, Lei Zhang
  • for: 降低寸劳的像素精度标注成本
  • methods: 提出了一种基于对称协议的对应关系建模方法,包括本地和全局对应关系项
  • results: 在三种标注任务上(包括INSTANCE Segmentation、semantic Segmentation和CLIP-引导Semantic Segmentation)达到了高度的性能提升
    Abstract Weakly-supervised segmentation with label-efficient sparse annotations has attracted increasing research attention to reduce the cost of laborious pixel-wise labeling process, while the pairwise affinity modeling techniques play an essential role in this task. Most of the existing approaches focus on using the local appearance kernel to model the neighboring pairwise potentials. However, such a local operation fails to capture the long-range dependencies and ignores the topology of objects. In this work, we formulate the affinity modeling as an affinity propagation process, and propose a local and a global pairwise affinity terms to generate accurate soft pseudo labels. An efficient algorithm is also developed to reduce significantly the computational cost. The proposed approach can be conveniently plugged into existing segmentation networks. Experiments on three typical label-efficient segmentation tasks, i.e. box-supervised instance segmentation, point/scribble-supervised semantic segmentation and CLIP-guided semantic segmentation, demonstrate the superior performance of the proposed approach.
    摘要 弱监督分割的研究已经吸引了越来越多的关注,以减少繁琐的像素精确标注过程的成本。在这个任务中,对 neighboring pairwise 潜在力场的建模技术扮演着关键角色。大多数现有方法都是基于本地外观核函数来建模邻近对的可能性。然而,这种本地操作无法捕捉长距离依赖关系和对象的 topological 结构。在这种工作中,我们将互相关系建模化为一种互相传播过程,并提出了本地和全局对对应潜在力场的两个方法,以生成准确的软精确标签。我们还开发了高效的算法,以减少计算成本。提议的方法可以方便地插入现有的分割网络中。在三种典型的标签有效分割任务中,即盒子监督实例分割、点/scribble监督 semantic segmentation 和 CLIP 引导的 semantic segmentation 中,我们的方法显示出了superior的性能。

Distribution prediction for image compression: An experimental re-compressor for JPEG images

  • paper_url: http://arxiv.org/abs/2310.10517
  • repo_url: None
  • paper_authors: Maxim Koroteev, Yaroslav Borisov, Pavel Frolov
  • for: 提高jpg图像压缩率
  • methods: 使用部分解码算法获取量化的DCT坐标,然后进行更有效的压缩
  • results: 实现了对jpg图像进行无损压缩
    Abstract We propose a new scheme to re-compress JPEG images in a lossless way. Using a JPEG image as an input the algorithm partially decodes the signal to obtain quantized DCT coefficients and then re-compress them in a more effective way.
    摘要 我们提出了一种新的方案,用于在无损压缩 JPEG 图像。使用 JPEG 图像作为输入,算法部分解码信号,获取量化 DCT 系数,然后在更有效的方式压缩它们。Here's a breakdown of the translation:* "We propose" is translated as "我们提出" (wǒmen tīshì).* "a new scheme" is translated as "一种新的方案" (yī zhǒng xīn de fāng'ān).* "to re-compress" is translated as "重新压缩" (zhòng xīn pīn chā).* "JPEG images" is translated as "JPEG 图像" (JPEG túxiàng).* "in a lossless way" is translated as "在无损压缩的方式" (在无损压缩的方式).* "Using a JPEG image as an input" is translated as "使用 JPEG 图像作为输入" (shǐyòu JPEG túxiàng zhīxīng).* "partially decodes the signal" is translated as "部分解码信号" (bùzhì jiěmǎ xìngxiàng).* "to obtain quantized DCT coefficients" is translated as "获取量化 DCT 系数" (gòuqù liàngpǐ DCT xìshù).* "and then re-compress them" is translated as "然后重新压缩它们" (ránhòu zhòng xīn pīn chā tāmen).I hope this helps! Let me know if you have any further questions.

Unifying Image Processing as Visual Prompting Question Answering

  • paper_url: http://arxiv.org/abs/2310.10513
  • repo_url: None
  • paper_authors: Yihao Liu, Xiangyu Chen, Xianzheng Ma, Xintao Wang, Jiantao Zhou, Yu Qiao, Chao Dong
  • for: 提高图像质量和提取视觉特征,替代具体任务模型。
  • methods: 使用大规模模型预训练和在图像处理任务中进行培 обу,通过视觉提问解决图像处理任务。
  • results: 提出一个通用的图像处理模型,可以处理多种图像处理任务,包括图像修复、图像增强、图像特征提取等。
    Abstract Image processing is a fundamental task in computer vision, which aims at enhancing image quality and extracting essential features for subsequent vision applications. Traditionally, task-specific models are developed for individual tasks and designing such models requires distinct expertise. Building upon the success of large language models (LLMs) in natural language processing (NLP), there is a similar trend in computer vision, which focuses on developing large-scale models through pretraining and in-context learning. This paradigm shift reduces the reliance on task-specific models, yielding a powerful unified model to deal with various tasks. However, these advances have predominantly concentrated on high-level vision tasks, with less attention paid to low-level vision tasks. To address this issue, we propose a universal model for general image processing that covers image restoration, image enhancement, image feature extraction tasks, \textit{etc}. Our proposed framework, named PromptGIP, unifies these diverse image processing tasks within a universal framework. Inspired by NLP question answering (QA) techniques, we employ a visual prompting question answering paradigm. Specifically, we treat the input-output image pair as a structured question-answer sentence, thereby reprogramming the image processing task as a prompting QA problem. PromptGIP can undertake diverse \textbf{cross-domain} tasks using provided visual prompts, eliminating the need for task-specific finetuning. Our methodology offers a universal and adaptive solution to general image processing. While PromptGIP has demonstrated a certain degree of out-of-domain task generalization capability, further research is expected to fully explore its more powerful emergent generalization.
    摘要 计算机视觉中的图像处理是一项基本任务,旨在提高图像质量和提取视觉应用中的关键特征。传统上,图像处理任务需要开发特定任务的模型,并且设计这些模型需要专门的技能。在计算机视觉领域,大型语言模型(LLM)的成功引起了一种类似的趋势,即通过预训练和上下文学习来建立大规模模型。这种思路转移使得图像处理任务可以使用一个强大的通用模型进行处理,而不需要特定任务的模型。然而,这些进步主要集中在高级视觉任务上,对低级视觉任务的关注较少。为了解决这个问题,我们提出了一个通用的图像处理模型,名为PromptGIP。我们的提案旨在将多种图像处理任务集成到一个通用框架中,并采用了视觉提问解答技术来实现。通过将输入输出图像对当做一个结构化的问题和答案句子,我们可以将图像处理任务转化为一个提问解答问题。PromptGIP可以通过提供的视觉提问来完成多个跨领域任务,无需特定任务的训练。我们的方法可以提供一个通用和适应的解决方案 для普通图像处理。虽然PromptGIP已经表现了一定的 OUT-OF-DOMAIN 任务泛化能力,但是进一步的研究可以充分发挥其更强大的 emergent 泛化能力。

Evaluation and improvement of Segment Anything Model for interactive histopathology image segmentation

  • paper_url: http://arxiv.org/abs/2310.10493
  • repo_url: None
  • paper_authors: SeungKyu Kim, Hyun-Jic Oh, Seonghui Min, Won-Ki Jeong
  • for: This paper focuses on evaluating the performance of the Segment Anything Model (SAM) in interactive segmentation of histopathology data, and comparing it with other state-of-the-art interactive models.
  • methods: The paper uses the SAM model as a foundational model for image segmentation, and evaluates its performance in zero-shot and fine-tuned scenarios on histopathology data. The authors also propose a modification of SAM’s decoder to improve its local refinement ability and stability.
  • results: The experimental results show that SAM exhibits weaknesses in segmentation performance compared to other models, but demonstrates relative strengths in inference time and generalization capability. The proposed modification of SAM’s decoder is effective in improving its performance for interactive histology image segmentation.
    Abstract With the emergence of the Segment Anything Model (SAM) as a foundational model for image segmentation, its application has been extensively studied across various domains, including the medical field. However, its potential in the context of histopathology data, specifically in region segmentation, has received relatively limited attention. In this paper, we evaluate SAM's performance in zero-shot and fine-tuned scenarios on histopathology data, with a focus on interactive segmentation. Additionally, we compare SAM with other state-of-the-art interactive models to assess its practical potential and evaluate its generalization capability with domain adaptability. In the experimental results, SAM exhibits a weakness in segmentation performance compared to other models while demonstrating relative strengths in terms of inference time and generalization capability. To improve SAM's limited local refinement ability and to enhance prompt stability while preserving its core strengths, we propose a modification of SAM's decoder. The experimental results suggest that the proposed modification is effective to make SAM useful for interactive histology image segmentation. The code is available at \url{https://github.com/hvcl/SAM_Interactive_Histopathology}
    摘要 随着Segment Anything Model(SAM)作为图像分割基本模型的出现,其应用在不同领域得到了广泛的研究,但在 histopathology 数据中的区域分割方面却收到了相对有限的关注。在这篇论文中,我们评估了 SAM 在 zero-shot 和 fine-tuned 场景中对 histopathology 数据的性能,强调交互分割。此外,我们与其他当前领先的交互模型进行比较,以评估 SAM 在实际应用中的实用性和适应性。在实验结果中,SAM 在分割性能方面表现较弱,但在推理时间和适应性方面表现出了相对的优势。为了改进 SAM 的局部精度修正能力并保持其核心优势,我们提议一种修改 SAM 的解码器。实验结果表明,该修改是有效的,使得 SAM 在交互式 histology 图像分割中变得有用。代码可以在 \url{https://github.com/hvcl/SAM_Interactive_Histopathology} 上获取。

On the Transferability of Learning Models for Semantic Segmentation for Remote Sensing Data

  • paper_url: http://arxiv.org/abs/2310.10490
  • repo_url: https://github.com/gdaosu/transferability-remote-sensing
  • paper_authors: Rongjun Qin, Guixiang Zhang, Yang Tang
  • for: 本研究旨在investigate remote sensing (RS) semantic segmentation/classification任务上的传输性和适应性,以及如何通过领域适应(DA)方法提高深度学习(DL)模型的传输性。
  • methods: 本研究使用了四个高度不同的RS数据集,并将六个模型在不同的DA策略下进行训练,以量化模型之间的传输性和适应性。此外,我们还提出了一种简单的方法来评估模型在目标领域中的传输性,不需要标签数据。
  • results: 我们的实验结果显示,DL模型在不同领域之间的传输性较差,而DA策略可以有效地提高DL模型的传输性。此外,我们还发现了一些不常报道的 Raw和适应传输性的观察结果。我们的提出的标签 свобо�评估方法也被证明可以更好地评估模型的传输性。
    Abstract Recent deep learning-based methods outperform traditional learning methods on remote sensing (RS) semantic segmentation/classification tasks. However, they require large training datasets and are generally known for lack of transferability due to the highly disparate RS image content across different geographical regions. Yet, there is no comprehensive analysis of their transferability, i.e., to which extent a model trained on a source domain can be readily applicable to a target domain. Therefore, in this paper, we aim to investigate the raw transferability of traditional and deep learning (DL) models, as well as the effectiveness of domain adaptation (DA) approaches in enhancing the transferability of the DL models (adapted transferability). By utilizing four highly diverse RS datasets, we train six models with and without three DA approaches to analyze their transferability between these datasets quantitatively. Furthermore, we developed a straightforward method to quantify the transferability of a model using the spectral indices as a medium and have demonstrated its effectiveness in evaluating the model transferability at the target domain when the labels are unavailable. Our experiments yield several generally important yet not well-reported observations regarding the raw and adapted transferability. Moreover, our proposed label-free transferability assessment method is validated to be better than posterior model confidence. The findings can guide the future development of generalized RS learning models. The trained models are released under this link: https://github.com/GDAOSU/Transferability-Remote-Sensing
    摘要 现代深度学习方法在远程感知(RS)semantic segmentation/分类任务上表现出色,但它们需要大量的训练数据并且通常因为不同地区RS图像内容差异极大而无法转移。然而,没有系统性的分析转移性,即源领域训练的模型可以如何 extent 应用于目标领域。因此,在这篇论文中,我们想要调查传统和深度学习(DL)模型的原生转移性,以及使用领域适应(DA)策略可以提高DL模型的转移性(适应转移性)。通过使用四个高度不同RS数据集,我们训练了六个模型,并使用三种DA策略进行分析其转移性。此外,我们还提出了一种简单的方法来评估模型的转移性,使用spectral indices作为媒介,并在目标领域无标签情况下证明其效果。我们的实验结果表明了一些不常报道的观察结果,包括原生转移性和适应转移性的分析。此外,我们的提出的无标签转移性评估方法被证明为比 posterior model confidence 更有用。这些发现可以指导未来的RS学习模型的发展。我们训练的模型可以在以下链接获取:https://github.com/GDAOSU/Transferability-Remote-Sensing

Combating Label Noise With A General Surrogate Model For Sample Selection

  • paper_url: http://arxiv.org/abs/2310.10463
  • repo_url: None
  • paper_authors: Chao Liang, Linchao Zhu, Humphrey Shi, Yi Yang
  • for: 减少标签噪音,提高深度学习系统的性能。
  • methods: 利用CLIPvision-language surrogate模型自动过滤噪音样本,并采用margin适应损失来规范选择偏好。
  • results: 在实际和Synthetic噪音数据集上实现了显著改进,无需CLIP在推断阶段参与。
    Abstract Modern deep learning systems are data-hungry. Learning with web data is one of the feasible solutions, but will introduce label noise inevitably, which can hinder the performance of deep neural networks. Sample selection is an effective way to deal with label noise. The key is to separate clean samples based on some criterion. Previous methods pay more attention to the small loss criterion where small-loss samples are regarded as clean ones. Nevertheless, such a strategy relies on the learning dynamics of each data instance. Some noisy samples are still memorized due to frequently occurring corrupted learning patterns. To tackle this problem, a training-free surrogate model is preferred, freeing from the effect of memorization. In this work, we propose to leverage the vision-language surrogate model CLIP to filter noisy samples automatically. CLIP brings external knowledge to facilitate the selection of clean samples with its ability of text-image alignment. Furthermore, a margin adaptive loss is designed to regularize the selection bias introduced by CLIP, providing robustness to label noise. We validate the effectiveness of our proposed method on both real-world and synthetic noisy datasets. Our method achieves significant improvement without CLIP involved during the inference stage.
    摘要 现代深度学习系统具有巨量数据的需求。使用网络数据进行学习是一个可行的解决方案,但会随机扰动标签,从而影响深度神经网络的性能。样本选择是一个有效的方法来处理标签噪音。以往的方法更多地关注于小损失标准,即将小损失的样本视为干净的样本。然而,这种策略基于每个数据实例的学习动态,一些噪音样本仍然会被记忆由频繁出现的异常学习模式。为解决这个问题,我们提议利用CLIP视觉语言代理模型自动过滤噪音样本。CLIP带来了外部知识,以便通过图文对齐来促进干净样本的选择。此外,我们还设计了一种margin适应损失,以规避CLIP的选择偏见,提供了对标签噪音的Robustness。我们在真实的噪音数据集和静态数据集上验证了我们的提议的有效性,而不需要在推理阶段使用CLIP。

Model Selection of Anomaly Detectors in the Absence of Labeled Validation Data

  • paper_url: http://arxiv.org/abs/2310.10461
  • repo_url: None
  • paper_authors: Clement Fung, Chen Qiu, Aodong Li, Maja Rudolph
  • for: 这个论文旨在提出一种通用的框架,用于评估基于图像的异常检测器。
  • methods: 该方法假设有一小支持集(support set)的正常图像,通过预训练的扩散模型进行处理,生成了人工异常样本。当混合到正常样本集中时,这些人工异常样本创建了一个适用于异常检测器评估的验证框架。
  • results: 在广泛的实验研究中,我们发现,使用我们的生成的验证数据可以选择同样的模型和超参数,与使用真实的验证集一样。此外,我们发现,使用我们的方法选择的提示(prompts)在CLIP基于异常检测中表现出色,超过其他所有提示策略,并在挑战性的MVTec-AD数据集上达到最佳检测精度。
    Abstract Anomaly detection requires detecting abnormal samples in large unlabeled datasets. While progress in deep learning and the advent of foundation models has produced powerful unsupervised anomaly detection methods, their deployment in practice is often hindered by the lack of labeled data -- without it, the detection accuracy of an anomaly detector cannot be evaluated reliably. In this work, we propose a general-purpose framework for evaluating image-based anomaly detectors with synthetically generated validation data. Our method assumes access to a small support set of normal images which are processed with a pre-trained diffusion model (our proposed method requires no training or fine-tuning) to produce synthetic anomalies. When mixed with normal samples from the support set, the synthetic anomalies create detection tasks that compose a validation framework for anomaly detection evaluation and model selection. In an extensive empirical study, ranging from natural images to industrial applications, we find that our synthetic validation framework selects the same models and hyper-parameters as selection with a ground-truth validation set. In addition, we find that prompts selected by our method for CLIP-based anomaly detection outperforms all other prompt selection strategies, and leads to the overall best detection accuracy, even on the challenging MVTec-AD dataset.
    摘要 异常检测需要检测大量无标签数据中的异常标本。虽然深度学习和基础模型的进步导致了无监督异常检测方法的生成,但它们在实践中的应用受到了无标签数据的缺乏影响,因为无法对异常检测器的准确性进行可靠评估。在这个工作中,我们提出一个通用的框架,用于评估基于图像的异常检测器,使用生成的运动模型来生成异常标本。我们的方法不需要训练或微调。当混合到支持集的正常图像中,生成的异常标本创建了一个异常检测任务,它们组成了一个适用于异常检测评估和模型选择的验证框架。在广泛的实验研究中,我们发现我们的生成验证框架可以选择同样的模型和参数,并且我们的提示选择策略在CLIP基础上的异常检测中表现出色,并且导致了整体最佳的检测精度,甚至在挑战性的MVTec-AD数据集上。

Object Detection in Aerial Images in Scarce Data Regimes

  • paper_url: http://arxiv.org/abs/2310.10433
  • repo_url: None
  • paper_authors: Pierre Le Jeune
  • for: 这个论文的目的是提高几个shot目标检测的性能,并评估其在不同类型图像上的可迁移性。
  • methods: 该论文使用了一种特殊的注意力机制来改进小对象的检测性能,以及一种批处理的盒子相似度标准来改进训练和评估。此外,论文还提出了两种不同的metric学习和精度调整方法来提高检测性能。
  • results: 论文得到了显著的提高在小对象检测上,并在 Cross-Domain FSOD 领域取得了卓越的结果。此外,论文还成功地解决了在 COSE 系统中部署检测模型的工程问题,并在具有超过 100 万像素的图像中进行实时检测。
    Abstract Most contributions on Few-Shot Object Detection (FSOD) evaluate their methods on natural images only, yet the transferability of the announced performance is not guaranteed for applications on other kinds of images. We demonstrate this with an in-depth analysis of existing FSOD methods on aerial images and observed a large performance gap compared to natural images. Small objects, more numerous in aerial images, are the cause for the apparent performance gap between natural and aerial images. As a consequence, we improve FSOD performance on small objects with a carefully designed attention mechanism. In addition, we also propose a scale-adaptive box similarity criterion, that improves the training and evaluation of FSOD methods, particularly for small objects. We also contribute to generic FSOD with two distinct approaches based on metric learning and fine-tuning. Impressive results are achieved with the fine-tuning method, which encourages tackling more complex scenarios such as Cross-Domain FSOD. We conduct preliminary experiments in this direction and obtain promising results. Finally, we address the deployment of the detection models inside COSE's systems. Detection must be done in real-time in extremely large images (more than 100 megapixels), with limited computation power. Leveraging existing optimization tools such as TensorRT, we successfully tackle this engineering challenge.
    摘要 多数对几shot对象检测(FSOD)的贡献仅测试在自然图像上,但是这些方法的可传性并不保证在其他类型图像上的应用。我们通过对现有FSOD方法的深入分析在飞行图像上表明,小对象的众多性导致自然图像和飞行图像之间的性能差距。为了解决这个问题,我们采用了特别设计的注意机制来提高小对象的检测性能。此外,我们还提出了可以适应不同大小的盒子相似性标准,以改进FSOD方法的训练和评估。此外,我们还提出了基于度量学习和精度调整的两种不同方法来提高FSOD性能。经过精心调整,我们得到了很好的结果。最后,我们关注在COSE系统中部署检测模型。在具有EXTREMELY大图像(超过100 megapixel)和有限计算资源的情况下,我们成功使用现有的优化工具 such as TensorRT来解决这个工程问题。

DANAA: Towards transferable attacks with double adversarial neuron attribution

  • paper_url: http://arxiv.org/abs/2310.10427
  • repo_url: https://github.com/Davidjinzb/DANAA
  • paper_authors: Zhibo Jin, Zhiyu Zhu, Xinyi Wang, Jiayu Zhang, Jun Shen, Huaming Chen
  • for: 本文旨在提出一种基于中间层的双反抗智能方法,以提高深度神经网络模型中特征重要性的估计结果,并提高模型的抗击性能。
  • methods: 本文提出了一种基于对抗非线性路径的双反抗神经网络模型,通过计算中间层输出的各个神经元的贡献,以估计特征重要性。
  • results: 对多个基准数据集进行了广泛的实验,并证明了我们的方法可以达到当今最佳性能。我们的代码可以在 GitHub 上找到:https://github.com/Davidjinzb/DANAA
    Abstract While deep neural networks have excellent results in many fields, they are susceptible to interference from attacking samples resulting in erroneous judgments. Feature-level attacks are one of the effective attack types, which targets the learnt features in the hidden layers to improve its transferability across different models. Yet it is observed that the transferability has been largely impacted by the neuron importance estimation results. In this paper, a double adversarial neuron attribution attack method, termed `DANAA', is proposed to obtain more accurate feature importance estimation. In our method, the model outputs are attributed to the middle layer based on an adversarial non-linear path. The goal is to measure the weight of individual neurons and retain the features that are more important towards transferability. We have conducted extensive experiments on the benchmark datasets to demonstrate the state-of-the-art performance of our method. Our code is available at: https://github.com/Davidjinzb/DANAA
    摘要 深度神经网络在多个领域取得了出色的成绩,但它们受到攻击样本的干扰,导致评判结果错误。攻击样本是一种有效的攻击类型,targets the learnt features in the hidden layers to improve its transferability across different models。然而,我们发现,通过neuron importance estimation结果,攻击样本的传播性受到了很大的影响。在这篇论文中,我们提出了一种double adversarial neuron attribution attack方法,称为`DANAA'。我们的方法基于一个 adversarial non-linear path,将模型输出归结到中层。我们的目标是测量个体神经元的重要性,保留对传播性更重要的特征。我们在标准 benchmark datasets 上进行了广泛的实验,以示出我们的方法的state-of-the-art性。我们的代码可以在:https://github.com/Davidjinzb/DANAA 中找到。

A Novel Benchmarking Paradigm and a Scale- and Motion-Aware Model for Egocentric Pedestrian Trajectory Prediction

  • paper_url: http://arxiv.org/abs/2310.10424
  • repo_url: None
  • paper_authors: Amir Rasouli
  • for: 本研究旨在提高智能驾驶系统中对行人行为预测的精度。
  • methods: 本文提出了一种新的 egocentric 行人轨迹预测算法评估方法,基于不同的情况下的 contextual information,提取了 meaningful 和系统的 driving scenarios,并提出了一种更有效的 metric 来评估预测模型。
  • results: 对 existed 模型进行了广泛的 empirical 研究,暴露了不同方法在不同情况下的缺陷和优势,并显示了我们的方法在挑战性情况下可以达到40%的提高。
    Abstract Predicting pedestrian behavior is one of the main challenges for intelligent driving systems. In this paper, we present a new paradigm for evaluating egocentric pedestrian trajectory prediction algorithms. Based on various contextual information, we extract driving scenarios for a meaningful and systematic approach to identifying challenges for prediction models. In this regard, we also propose a new metric for more effective ranking within the scenario-based evaluation. We conduct extensive empirical studies of existing models on these scenarios to expose shortcomings and strengths of different approaches. The scenario-based analysis highlights the importance of using multimodal sources of information and challenges caused by inadequate modeling of ego-motion and scale of pedestrians. To this end, we propose a novel egocentric trajectory prediction model that benefits from multimodal sources of data fused in an effective and efficient step-wise hierarchical fashion and two auxiliary tasks designed to learn more robust representation of scene dynamics. We show that our approach achieves significant improvement by up to 40% in challenging scenarios compared to the past arts via empirical evaluation on common benchmark datasets.
    摘要 预测行人行为是智能驾驶系统中的一个主要挑战。在这篇论文中,我们提出了一种新的评估 egocentric 行人轨迹预测算法的新模式。基于多种情境信息,我们提取了 meaningful 和系统的驾驶场景,以便更好地识别预测模型的挑战。在这个 regard,我们也提出了一个更有效的排名 metric。我们对现有模型进行了广泛的实证研究,以暴露不同方法的缺陷和优势。场景基本分析表明,使用多modal 信息和 egocentric 行人的不充分模型和比例会导致预测困难。为此,我们提出了一种新的 egocentric 轨迹预测模型,该模型利用多modal 数据的综合和有效的步骤式层次结构,以及两个辅助任务,以学习更加稳定的Scene 动力学。我们的方法在复杂的场景下比过去的艺术品 achieved 40% 的提升,via 实验评估常见的 benchmark 数据集。

YOLOv7 for Mosquito Breeding Grounds Detection and Tracking

  • paper_url: http://arxiv.org/abs/2310.10423
  • repo_url: None
  • paper_authors: Camila Laranjeira, Daniel Andrade, Jefersson A. dos Santos
  • for: 防止气候变化的威胁,忽略性 tropical diseases 如遗传性疟疾、血吸虫病和生物攻击等可能成为全球问题。
  • methods: 本文使用 YOLOv7 状态之 искусственный神经网络方法,自动检测和地图蚊子繁殖地点,以便地方机构可以有效 intervene。
  • results: 我们在 ICIP 2023 大赛中发布的数据集上进行了实验,并示出 YOLOv7 可以直接应用于检测更大的ocus类别,如池塘、车胎和水箱,并且可以通过简单的滤波来实现时间一致性的跟踪过程。
    Abstract With the looming threat of climate change, neglected tropical diseases such as dengue, zika, and chikungunya have the potential to become an even greater global concern. Remote sensing technologies can aid in controlling the spread of Aedes Aegypti, the transmission vector of such diseases, by automating the detection and mapping of mosquito breeding sites, such that local entities can properly intervene. In this work, we leverage YOLOv7, a state-of-the-art and computationally efficient detection approach, to localize and track mosquito foci in videos captured by unmanned aerial vehicles. We experiment on a dataset released to the public as part of the ICIP 2023 grand challenge entitled Automatic Detection of Mosquito Breeding Grounds. We show that YOLOv7 can be directly applied to detect larger foci categories such as pools, tires, and water tanks and that a cheap and straightforward aggregation of frame-by-frame detection can incorporate time consistency into the tracking process.
    摘要 With the looming threat of climate change, neglected tropical diseases such as dengue, zika, and chikungunya have the potential to become an even greater global concern. Remote sensing technologies can aid in controlling the spread of Aedes Aegypti, the transmission vector of such diseases, by automating the detection and mapping of mosquito breeding sites, such that local entities can properly intervene. In this work, we leverage YOLOv7, a state-of-the-art and computationally efficient detection approach, to localize and track mosquito foci in videos captured by unmanned aerial vehicles. We experiment on a dataset released to the public as part of the ICIP 2023 grand challenge entitled Automatic Detection of Mosquito Breeding Grounds. We show that YOLOv7 can be directly applied to detect larger foci categories such as pools, tires, and water tanks, and that a cheap and straightforward aggregation of frame-by-frame detection can incorporate time consistency into the tracking process.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need the translation in Traditional Chinese, please let me know.

LMT: Longitudinal Mixing Training, a Framework to Predict Disease Progression from a Single Image

  • paper_url: http://arxiv.org/abs/2310.10420
  • repo_url: None
  • paper_authors: Rachid Zeghlache, Pierre-Henri Conze, Mostafa El Habib Daho, Yihao Li, Hugo Le boite, Ramin Tadayoni, Pascal Massin, Béatrice Cochener, Ikram Brahim, Gwenolé Quellec, Mathieu Lamard
  • for: 这个论文旨在检测和预测糖尿病Retinopathy (DR) 疾病进程的扩展。
  • methods: 该论文使用了混合训练和预text任务,并使用了新的神经网络模型 named Neural Ordinary Differential Equation (NODE)。
  • results: 该论文在使用 Longitudinal Mixing Training (LMT) 框架时,可以更好地检测和预测 DR 疾病进程。在 OPHDIAT 长itudinal retinal Color Fundus Photographs (CFP) dataset 上,该方法可以预测下一次访问 whether an eye would develop a severe DR,其 AUC 为 0.798,比基线结果高得多。
    Abstract Longitudinal imaging is able to capture both static anatomical structures and dynamic changes in disease progression toward earlier and better patient-specific pathology management. However, conventional approaches rarely take advantage of longitudinal information for detection and prediction purposes, especially for Diabetic Retinopathy (DR). In the past years, Mix-up training and pretext tasks with longitudinal context have effectively enhanced DR classification results and captured disease progression. In the meantime, a novel type of neural network named Neural Ordinary Differential Equation (NODE) has been proposed for solving ordinary differential equations, with a neural network treated as a black box. By definition, NODE is well suited for solving time-related problems. In this paper, we propose to combine these three aspects to detect and predict DR progression. Our framework, Longitudinal Mixing Training (LMT), can be considered both as a regularizer and as a pretext task that encodes the disease progression in the latent space. Additionally, we evaluate the trained model weights on a downstream task with a longitudinal context using standard and longitudinal pretext tasks. We introduce a new way to train time-aware models using $t_{mix}$, a weighted average time between two consecutive examinations. We compare our approach to standard mixing training on DR classification using OPHDIAT a longitudinal retinal Color Fundus Photographs (CFP) dataset. We were able to predict whether an eye would develop a severe DR in the following visit using a single image, with an AUC of 0.798 compared to baseline results of 0.641. Our results indicate that our longitudinal pretext task can learn the progression of DR disease and that introducing $t_{mix}$ augmentation is beneficial for time-aware models.
    摘要 长itudinal imaging可以捕捉到稳定的解剖结构以及疾病进程的动态变化,从而提供更早和更好的病理管理。然而,传统方法rarely利用长itudinal信息进行检测和预测,特别是对于肥胖糖尿病(DR)。在过去几年,杂合训练和预文任务with longitudinal context有效提高了DR分类结果,并捕捉了疾病进程。同时,一种新的神经网络模型名为神经常微方程(NODE)已经被提出,可以解决常微方程问题。根据定义,NODE适合解决时间相关的问题。在这篇文章中,我们提议将这三个方面结合,以检测和预测DR疾病进程。我们的框架,长itudinal Mixing Training(LMT),可以被视为一种正则化和预文任务,用于编码疾病进程在幽默空间中。此外,我们评估训练模型的加权因子在下游任务中使用标准和长itudinal预文任务时的性能。我们还介绍了一种新的时间感知训练方法,使用 $t_{mix} $ Weighted average time between two consecutive examinations。我们比较了我们的方法与标准杂合训练在DR分类中的表现,使用 OPHDIAT longitudinal retinal Color Fundus Photographs(CFP)数据集。我们能够使用单个图像预测眼睛是否在下一次访问中会发展严重的DR,AUC为 0.798,比基eline结果提高了0.641。我们的结果表明,我们的长itudinal预文任务可以学习DR疾病的进程,并且将 $t_{mix} $ 杂合增强是对时间感知模型有利。

Prior-Free Continual Learning with Unlabeled Data in the Wild

  • paper_url: http://arxiv.org/abs/2310.10417
  • repo_url: https://github.com/visiontao/pfcl
  • paper_authors: Tao Zhuo, Zhiyong Cheng, Hehe Fan, Mohan Kankanhalli
  • for: 本研究旨在 Addressing the problem of forgetting in continual learning (CL) when task priors are unknown in real-world applications.
  • methods: 我们提出了一个 Prior-Free Continual Learning (PFCL) 方法,不需要任务识别或先前的数据。我们透过两个方法来实现这一目标:首先,我们消除了任务识别的需求,以使用固定单头架构中的任务特定出力头选择。其次,我们使用了一种常见性预测策略,以避免在新任务上重访先前的数据。然而,这种方法单独可能在分类增量学习中表现不佳,特别是当任务序列很长时。我们运用了一个辅助数据集来增强模型的一致性,以提高分类精度。
  • results: 我们的PFCL方法在多个图像分类benchmarkdataset上进行了广泛的实验,结果显示PFCL方法能够对所有三种学习场景进行优化,并且与最近的回温基本方法相比,PFCL方法在竞争率上具有相似的精度。
    Abstract Continual Learning (CL) aims to incrementally update a trained model on new tasks without forgetting the acquired knowledge of old ones. Existing CL methods usually reduce forgetting with task priors, \ie using task identity or a subset of previously seen samples for model training. However, these methods would be infeasible when such priors are unknown in real-world applications. To address this fundamental but seldom-studied problem, we propose a Prior-Free Continual Learning (PFCL) method, which learns new tasks without knowing the task identity or any previous data. First, based on a fixed single-head architecture, we eliminate the need for task identity to select the task-specific output head. Second, we employ a regularization-based strategy for consistent predictions between the new and old models, avoiding revisiting previous samples. However, using this strategy alone often performs poorly in class-incremental scenarios, particularly for a long sequence of tasks. By analyzing the effectiveness and limitations of conventional regularization-based methods, we propose enhancing model consistency with an auxiliary unlabeled dataset additionally. Moreover, since some auxiliary data may degrade the performance, we further develop a reliable sample selection strategy to obtain consistent performance improvement. Extensive experiments on multiple image classification benchmark datasets show that our PFCL method significantly mitigates forgetting in all three learning scenarios. Furthermore, when compared to the most recent rehearsal-based methods that replay a limited number of previous samples, PFCL achieves competitive accuracy. Our code is available at: https://github.com/visiontao/pfcl
    摘要 Our approach has two key components. First, we eliminate the need for task identity to select the task-specific output head using a fixed single-head architecture. Second, we employ a regularization-based strategy for consistent predictions between the new and old models, avoiding revisiting previous samples. However, this strategy alone can perform poorly in class-incremental scenarios, so we also use an auxiliary unlabeled dataset to enhance model consistency.To ensure reliable performance improvement, we develop a sample selection strategy to choose the most informative samples from the auxiliary dataset. Our approach significantly mitigates forgetting in all three learning scenarios, and achieves competitive accuracy compared to rehearsal-based methods that replay a limited number of previous samples.Our code is available at: https://github.com/visiontao/pfcl.

Style transfer between Microscopy and Magnetic Resonance Imaging via Generative Adversarial Network in small sample size settings

  • paper_url: http://arxiv.org/abs/2310.10414
  • repo_url: None
  • paper_authors: Monika Pytlarz, Adrian Onicas, Alessandro Crimi
  • for: 这个研究的目的是使用 Conditional GAN 架构将 MRI 图像翻译成历史学图像,以便避免侵入性的生物псии检测。
  • methods: 这个研究使用的方法是使用 Conditional GAN 架构,该架构可以将 MRI 图像翻译成历史学图像。
  • results: 这个研究表明,使用 Conditional GAN 架构可以可靠地将 MRI 图像翻译成历史学图像,并且可以使用高分辨率的历史学图像和相对较低分辨率的 MRI 图像进行对应。
    Abstract Cross-modal augmentation of Magnetic Resonance Imaging (MRI) and microscopic imaging based on the same tissue samples is promising because it can allow histopathological analysis in the absence of an underlying invasive biopsy procedure. Here, we tested a method for generating microscopic histological images from MRI scans of the corpus callosum using conditional generative adversarial network (cGAN) architecture. To our knowledge, this is the first multimodal translation of the brain MRI to histological volumetric representation of the same sample. The technique was assessed by training paired image translation models taking sets of images from MRI scans and microscopy. The use of cGAN for this purpose is challenging because microscopy images are large in size and typically have low sample availability. The current work demonstrates that the framework reliably synthesizes histology images from MRI scans of corpus callosum, emphasizing the network's ability to train on high resolution histologies paired with relatively lower-resolution MRI scans. With the ultimate goal of avoiding biopsies, the proposed tool can be used for educational purposes.
    摘要 通过同一个组织样本的跨模态扩充,核磁共振成像(MRI)和微scopic成像之间的同化是有前途的,因为它可以允许在不基于侵入性生物псиchosurgeries的情况下进行 histopathological 分析。我们在这里测试了一种方法,用于从 MRI 扫描中生成 microscopic histological 图像。根据我们所知,这是第一种跨modal 翻译 brain MRI 到 histological 三维表示的方法。这种技术被评估了,通过对对应的图像翻译模型进行训练。使用 cGAN 进行这种目的是挑战性的,因为微scopic 图像通常很大,并且 Sample 的可用性很低。当前的研究表明,该框架可靠地将 MRI 扫描中的 corpus callosum 转换为 histology 图像,强调网络的能力在高分辨率 histology 和相对较低分辨率 MRI 扫描之间进行训练。以避免生物псиchosurgeries 为目的,提出的工具可以用于教育用途。

Image super-resolution via dynamic network

  • paper_url: http://arxiv.org/abs/2310.10413
  • repo_url: https://github.com/hellloxiaotian/dsrnet
  • paper_authors: Chunwei Tian, Xuanyu Zhang, Qi Zhang, Mingming Yang, Zhaojie Ju
  • for: 这篇论文旨在提出一种动态网络 для图像超解像(DSRNet),以提高图像超解像的准确率和复杂场景下的应用性。
  • methods: 该网络使用了差异增强块、宽增强块、特征细化块和结构块等多种块来提高图像超解像的精度和可靠性。
  • results: 实验结果表明,与传统方法相比,DSRNet能够更好地处理复杂场景下的图像超解像问题,同时具有较低的计算量和可扩展性,适用于移动设备上的实时应用。
    Abstract Convolutional neural networks (CNNs) depend on deep network architectures to extract accurate information for image super-resolution. However, obtained information of these CNNs cannot completely express predicted high-quality images for complex scenes. In this paper, we present a dynamic network for image super-resolution (DSRNet), which contains a residual enhancement block, wide enhancement block, feature refinement block and construction block. The residual enhancement block is composed of a residual enhanced architecture to facilitate hierarchical features for image super-resolution. To enhance robustness of obtained super-resolution model for complex scenes, a wide enhancement block achieves a dynamic architecture to learn more robust information to enhance applicability of an obtained super-resolution model for varying scenes. To prevent interference of components in a wide enhancement block, a refinement block utilizes a stacked architecture to accurately learn obtained features. Also, a residual learning operation is embedded in the refinement block to prevent long-term dependency problem. Finally, a construction block is responsible for reconstructing high-quality images. Designed heterogeneous architecture can not only facilitate richer structural information, but also be lightweight, which is suitable for mobile digital devices. Experimental results shows that our method is more competitive in terms of performance and recovering time of image super-resolution and complexity. The code of DSRNet can be obtained at https://github.com/hellloxiaotian/DSRNet.
    摘要 convolutional neural networks (CNNs) 依靠深度网络架构来提取图像超分解中的准确信息。然而,这些 CNNs 所获取的信息无法完全表达预测的高质量图像 для复杂场景。在这篇论文中,我们提出了动态网络 для图像超分解 (DSRNet),它包含了差异增强块、宽增强块、特征细化块和结构块。差异增强块由一个差异增强架构组成,以便在图像超分解中提高层次特征。为了提高获取的超分解模型在复杂场景中的应用 robustness,宽增强块实现了一个动态架构,以学习更加Robust的信息以提高获取的超分解模型的可用性。为了避免各个组件之间的干扰,特征细化块使用了堆叠结构来准确地学习获取的特征。此外,在细化块中还包含了循环学习操作,以避免长期依赖问题。最后,结构块负责重建高质量图像。设计的不同化架构不仅可以提供更加丰富的结构信息,还可以减轻计算负担,适用于移动式数字设备。实验结果表明,我们的方法在性能和图像重建时间方面更加竞争力,同时也更加复杂。DSRNet 的代码可以在 https://github.com/hellloxiaotian/DSRNet 上获取。

Loci-Segmented: Improving Scene Segmentation Learning

  • paper_url: http://arxiv.org/abs/2310.10410
  • repo_url: https://github.com/CognitiveModeling/Loci-Segmented
  • paper_authors: Manuel Traub, Frederic Becker, Adrian Sauter, Sebastian Otte, Martin V. Butz
  • for: 本研究旨在提高场景表示的分割能力,并提出了一种基于槽的处理方法。
  • methods: 本方法使用了一种名为Loci-Segmented(Loci-s)的场景分割神经网络,它基于Loci(Traub等,ICLR 2023)框架,并具有以下三大提升:(1)添加了预训练的动态背景模块;(2)具有对象专注的几何卷积编码模块;(3)采用了级联解码模块,successively生成对象Mask、Masked Depth Maps和Masked, Depth-map-informed RGB重建。
  • results: 对比于之前的最佳成果,Loci-s在MOVi datasets和另一个Established dataset集合中实现了32%的交集覆盖率(IoU)提升。此外,Loci-s还生成了良好的可解释性 latent representation,这些表示可能作为解释基础模型的可解释基础 для解决下游任务,如语言背景和Context-和Goal-conditioned Event Processing。
    Abstract Slot-oriented processing approaches for compositional scene representation have recently undergone a tremendous development. We present Loci-Segmented (Loci-s), an advanced scene segmentation neural network that extends the slot-based location and identity tracking architecture Loci (Traub et al., ICLR 2023). The main advancements are (i) the addition of a pre-trained dynamic background module; (ii) a hyper-convolution encoder module, which enables object-focused bottom-up processing; and (iii) a cascaded decoder module, which successively generates object masks, masked depth maps, and masked, depth-map-informed RGB reconstructions. The background module features the learning of both a foreground identifying module and a background re-generator. We further improve performance via (a) the integration of depth information as well as improved slot assignments via (b) slot-location-entity regularization and (b) a prior segmentation network. Even without these latter improvements, the results reveal superior segmentation performance in the MOVi datasets and in another established dataset collection. With all improvements, Loci-s achieves a 32% better intersection over union (IoU) score in MOVi-E than the previous best. We furthermore show that Loci-s generates well-interpretable latent representations. We believe that these representations may serve as a foundation-model-like interpretable basis for solving downstream tasks, such as grounding language and context- and goal-conditioned event processing.
    摘要 各种槽处理方法在 compositional scene representation 领域受到了非常大的发展。我们现在提出了 Loci-Segmented(Loci-s),这是一种高级的Scene segmentation neural network,它扩展了 Loci(Traub et al., ICLR 2023)槽基 Architecture。主要改进包括:(i) 添加了预训练的动态背景模块;(ii) 使用了对象专注的凹陷 Encoder 模块,以便从底层处进行对象特征提取;(iii) 使用了级联的解码模块,以顺序生成对象面积、掩码depth maps和掩码、 depth-map-informed RGB 重建。背景模块包括学习both foreground 特征和背景重建。我们进一步提高性能通过:(a) интеграción of depth information以及改进的槽分配via(b) slot-location-entity regularization和(b) 一个前 segmentation network。无论这些改进,Loci-s 在 MOVi 数据集和另一个已知数据集中显示出色的 segmentation 性能。通过所有改进,Loci-s 在 MOVi-E 中实现了与之前最佳的32%的交集 над union(IoU)分数提高。我们进一步表明Loci-s 生成的latent representations是可解释的。我们认为这些表示可以作为基础模型的可解释基础,用于解决下游任务,如语言基础和上下文-和目标conditioned事件处理。

A cross Transformer for image denoising

  • paper_url: http://arxiv.org/abs/2310.10408
  • repo_url: https://github.com/hellloxiaotian/ctnet
  • paper_authors: Chunwei Tian, Menghua Zheng, Wangmeng Zuo, Shichao Zhang, Yanning Zhang, Chia-Wen Ling
  • for: 提高复杂场景中图像的清洁率
  • methods: 使用交叉 transformer 涨梯网络(CTNet),包括序列块(SB)、平行块(PB)和差分块(RB),以获取有效的结构信息,并通过多种交互来提高适应性。
  • results: 在实际和synthetic图像杂交中,CTNet表现出色,超过了一些流行的去噪方法。适用于移动设备,如手机。
    Abstract Deep convolutional neural networks (CNNs) depend on feedforward and feedback ways to obtain good performance in image denoising. However, how to obtain effective structural information via CNNs to efficiently represent given noisy images is key for complex scenes. In this paper, we propose a cross Transformer denoising CNN (CTNet) with a serial block (SB), a parallel block (PB), and a residual block (RB) to obtain clean images for complex scenes. A SB uses an enhanced residual architecture to deeply search structural information for image denoising. To avoid loss of key information, PB uses three heterogeneous networks to implement multiple interactions of multi-level features to broadly search for extra information for improving the adaptability of an obtained denoiser for complex scenes. Also, to improve denoising performance, Transformer mechanisms are embedded into the SB and PB to extract complementary salient features for effectively removing noise in terms of pixel relations. Finally, a RB is applied to acquire clean images. Experiments illustrate that our CTNet is superior to some popular denoising methods in terms of real and synthetic image denoising. It is suitable to mobile digital devices, i.e., phones. Codes can be obtained at https://github.com/hellloxiaotian/CTNet.
    摘要 深度卷积神经网络 (CNN) 在图像噪声除除针对 feedforward 和反馈方式以获得好的表现。然而,如何通过 CNN 获得有效的结构信息,以有效地表示给定的噪声图像是关键问题。在这篇论文中,我们提出了一种跨Transformer混合卷积神经网络 (CTNet),包括序列块 (SB)、平行块 (PB) 和差异块 (RB),以获取复杂场景中的干净图像。SB 使用增强的剩余架构,深入搜索图像噪声除除中的结构信息。为了避免关键信息损失,PB 使用三种不同的网络来实现多种交互,以广泛搜索更多的信息,以提高取得的噪声除除器的适应性。此外,为了提高噪声除除性能,SB 和 PB 中包含了Transformer机制,以EXTRACT complementary salient features,以有效地除除图像层次关系中的噪声。最后,RB 用于获取干净图像。实验表明,我们的 CTNet 在实际和 sintetic 图像噪声除除方面表现优于一些流行的噪声除除方法。它适用于移动设备,如手机。代码可以在 https://github.com/hellloxiaotian/CTNet obtener。

LLM4SGG: Large Language Model for Weakly Supervised Scene Graph Generation

  • paper_url: http://arxiv.org/abs/2310.10404
  • repo_url: https://github.com/rlqja1107/torch-LLM4SGG
  • paper_authors: Kibum Kim, Kanghoon Yoon, Jaehyeong Jeon, Yeonjun In, Jinyoung Moon, Donghyun Kim, Chanyoung Park
  • for: 本研究旨在提出一种新的、基于语言模型的弱监督Scene Graph生成方法(LLM4SGG),以解决现有WSSGG方法中的两个问题:1)语义过度简化问题,2)低密度场景图问题。
  • methods: 我们提出一种新的方法,即使用语言模型的语言理解和推理能力来提取caption中的 triplets,并将entity/ predicate类与目标数据进行对齐。为了更好地利用语言模型,我们采用了链式思维和在Context few-shot learning策略。
  • results: 我们在Visual Genome和GQA datasets上进行了广泛的实验,并显示了与现有WSSGG方法相比的显著提高,包括Recall@K和mean Recall@K的提高。此外,LLM4SGG还具有数据效率的优势,可以通过小量的训练图像进行效果iveness的模型训练。
    Abstract Weakly-Supervised Scene Graph Generation (WSSGG) research has recently emerged as an alternative to the fully-supervised approach that heavily relies on costly annotations. In this regard, studies on WSSGG have utilized image captions to obtain unlocalized triplets while primarily focusing on grounding the unlocalized triplets over image regions. However, they have overlooked the two issues involved in the triplet formation process from the captions: 1) Semantic over-simplification issue arises when extracting triplets from captions, where fine-grained predicates in captions are undesirably converted into coarse-grained predicates, resulting in a long-tailed predicate distribution, and 2) Low-density scene graph issue arises when aligning the triplets in the caption with entity/predicate classes of interest, where many triplets are discarded and not used in training, leading to insufficient supervision. To tackle the two issues, we propose a new approach, i.e., Large Language Model for weakly-supervised SGG (LLM4SGG), where we mitigate the two issues by leveraging the LLM's in-depth understanding of language and reasoning ability during the extraction of triplets from captions and alignment of entity/predicate classes with target data. To further engage the LLM in these processes, we adopt the idea of Chain-of-Thought and the in-context few-shot learning strategy. To validate the effectiveness of LLM4SGG, we conduct extensive experiments on Visual Genome and GQA datasets, showing significant improvements in both Recall@K and mean Recall@K compared to the state-of-the-art WSSGG methods. A further appeal is that LLM4SGG is data-efficient, enabling effective model training with a small amount of training images.
    摘要 弱监督场景图生成(WSSGG)研究最近几年来得到了更多的关注,它作为完全监督的方法的替代方案,减少了成本的注释。在这个 regard,研究者们通过图文来获取不地址 triplets,主要是将图文中的不地址 triplets与图像区域相对应。然而,他们忽略了图文中 triplet 形成过程中的两个问题:1)语义过于简化问题,图文中的细腻 predicate 被不必要地转换为粗略 predicate,导致 predicate 的分布呈长尾形态;2)图像区域对应问题,在将 triplets 与 interesset class 对应时,多个 triplets 被抛弃,导致训练不充分。为解决这两个问题,我们提出了一种新的方法,即大语言模型 для 弱监督 SGG(LLM4SGG)。我们通过利用 LLM 的深刻语言理解和逻辑能力来缓解这两个问题。为了更好地利用 LLM,我们采用了链条思想和在 Context 中几招学习策略。为验证 LLM4SGG 的有效性,我们对 Visual Genome 和 GQA 数据集进行了广泛的实验,并显示了与州chart 方法相比的显著改善。此外,LLM4SGG 具有数据效率的特点,可以在小量训练图像上进行有效的模型训练。

Enhanced Edge-Perceptual Guided Image Filtering

  • paper_url: http://arxiv.org/abs/2310.10387
  • repo_url: None
  • paper_authors: Jinyu Li
  • for: 提高图像Edge-preserving能力和计算复杂性的问题
  • methods: 提出一种基于Explicit first-order edge-protect约束和Explicit residual约束的新导向图像滤波器
  • results: 在单图像细节增强、多尺度曝光融合和多spectral图像分类等应用中,提出的滤波器能够提高Edge-preserving能力,并且经过理论分析和实验验证
    Abstract Due to the powerful edge-preserving ability and low computational complexity, Guided image filter (GIF) and its improved versions has been widely applied in computer vision and image processing. However, all of them are suffered halo artifacts to some degree, as the regularization parameter increase. In the case of inconsistent structure of guidance image and input image, edge-preserving ability degradation will also happen. In this paper, a novel guided image filter is proposed by integrating an explicit first-order edge-protect constraint and an explicit residual constraint which will improve the edge-preserving ability in both cases. To illustrate the efficiency of the proposed filter, the performances are shown in some typical applications, which are single image detail enhancement, multi-scale exposure fusion, hyper spectral images classification. Both theoretical analysis and experimental results prove that the powerful edge-preserving ability of the proposed filter.
    摘要 因为导引图像过滤器(GIF)和其改进版本具有强大的边缘保持能力和低计算复杂性,因此在计算机视觉和图像处理领域得到广泛应用。然而,所有它们都受到一定程度的尘埃畸 artifacts的困扰,随着规则化参数的增加。在指导图像和输入图像的结构不一致的情况下,边缘保持能力也会降低。在这篇论文中,一种新的导引图像过滤器被提出,通过加入显式的第一阶边缘保护约束和显式的差分约束,以提高边缘保持能力。为了证明提案的效果,在一些典型应用中进行了表现,包括单图像细节增强、多比例曝光融合和多spectral图像分类。 Both theoretical analysis and experimental results prove that the proposed filter has powerful edge-preserving ability.

Looping LOCI: Developing Object Permanence from Videos

  • paper_url: http://arxiv.org/abs/2310.10372
  • repo_url: None
  • paper_authors: Manuel Traub, Frederic Becker, Sebastian Otte, Martin V. Butz
  • for: 本研究旨在提高基于场景表示学习的分割和跟踪方法,使其能够更好地处理部分可见对象和逻辑物理测试。
  • methods: 本研究使用了Loci-Looped算法,它是一种基于Loci neural network架构的循环型朴素网络,可以自动将像素空间信息与预测结果混合,以获得信息混合活动。Loci-Looped还可以学习对象动态和对象之间交互的 compositional 表示。
  • results: Loci-Looped可以在对象遮挡时长时间跟踪对象,甚至预测遮挡后对象的重返,无需显式历史缓存。此外,Loci-Looped在ADEPT和CLEVRER数据集上比基于现有模型表现出色,在对象遮挡或感知数据断续时表现更好。这表示Loci-Looped可以在无监督下自适应学习物理概念,包括对象持续性和静止性。
    Abstract Recent compositional scene representation learning models have become remarkably good in segmenting and tracking distinct objects within visual scenes. Yet, many of these models require that objects are continuously, at least partially, visible. Moreover, they tend to fail on intuitive physics tests, which infants learn to solve over the first months of their life. Our goal is to advance compositional scene representation algorithms with an embedded algorithm that fosters the progressive learning of intuitive physics, akin to infant development. As a fundamental component for such an algorithm, we introduce Loci-Looped, which advances a recently published unsupervised object location, identification, and tracking neural network architecture (Loci, Traub et al., ICLR 2023) with an internal processing loop. The loop is designed to adaptively blend pixel-space information with anticipations yielding information-fused activities as percepts. Moreover, it is designed to learn compositional representations of both individual object dynamics and between-objects interaction dynamics. We show that Loci-Looped learns to track objects through extended periods of object occlusions, indeed simulating their hidden trajectories and anticipating their reappearance, without the need for an explicit history buffer. We even find that Loci-Looped surpasses state-of-the-art models on the ADEPT and the CLEVRER dataset, when confronted with object occlusions or temporary sensory data interruptions. This indicates that Loci-Looped is able to learn the physical concepts of object permanence and inertia in a fully unsupervised emergent manner. We believe that even further architectural advancements of the internal loop - also in other compositional scene representation learning models - can be developed in the near future.
    摘要 现代场景表示学习模型已经很出色地 segmenting 和跟踪视场中的不同对象。然而,许多这些模型需要对象在视场中保持不间断的可见性。此外,它们在直觉物理测试中表现不佳,这与婴儿在生长过程中学习的直觉物理概念相反。我们的目标是提高场景表示算法,其中包括一个内置的算法,以便逐步学习直觉物理概念,类似于婴儿的发展。为此,我们引入了Loci-Looped,它是一种基于Loci(Traub et al., ICLR 2023)的无监督对象位置、识别和跟踪神经网络架构,并添加了内部处理循环。这个循环通过自适应混合像素空间信息和预测得到的信息混合活动,以便学习对象动态和对象之间的交互动态。我们发现Loci-Looped可以在对象遮挡期间跟踪对象,并且可以预测遮挡物体的重返,无需显式历史缓存。此外,Loci-Looped在ADEPT和CLEVRER数据集上表现出色,even when confronted with object occlusions or temporary sensory data interruptions.这表明Loci-Looped可以在无监督下自适应学习物理概念,包括对象永恒和运动的概念。我们认为,将Loci-Looped的内部循环扩展到其他场景表示学习模型中,可能会在未来得到进一步改进。

Camera-LiDAR Fusion with Latent Contact for Place Recognition in Challenging Cross-Scenes

  • paper_url: http://arxiv.org/abs/2310.10371
  • repo_url: None
  • paper_authors: Yan Pan, Jiapeng Xie, Jiajie Wu, Bo Zhou
  • for: 本文是为了解决在视角变化、季节变化和场景变换等环境下实现地点认知而写的。
  • methods: 本文使用了一种新的三通道地点描述器,包括图像、点云和融合分支。图像和点云之间的相互关系被利用,以实现信息互动和融合。
  • results: EXTENSIVE experiments on KITTI、NCLT、USVInland和校园数据集表明,提出的地点描述器为最佳方法,在复杂的场景下表现了 robustness 和通用性。
    Abstract Although significant progress has been made, achieving place recognition in environments with perspective changes, seasonal variations, and scene transformations remains challenging. Relying solely on perception information from a single sensor is insufficient to address these issues. Recognizing the complementarity between cameras and LiDAR, multi-modal fusion methods have attracted attention. To address the information waste in existing multi-modal fusion works, this paper introduces a novel three-channel place descriptor, which consists of a cascade of image, point cloud, and fusion branches. Specifically, the fusion-based branch employs a dual-stage pipeline, leveraging the correlation between the two modalities with latent contacts, thereby facilitating information interaction and fusion. Extensive experiments on the KITTI, NCLT, USVInland, and the campus dataset demonstrate that the proposed place descriptor stands as the state-of-the-art approach, confirming its robustness and generality in challenging scenarios.
    摘要 Translated into Simplified Chinese:尽管已经做出了很大的进步,但是在视角变化、季节变化和场景变换等环境下实现地点认知仍然是一个挑战。仅仅基于单一感知器的信息不够 Address these issues. 识别摄像头和激光仪的补偿性,多感知Modal Fusion方法吸引了关注。为了改进现有多感知融合方法中的信息浪费,本文提出了一种新的三通道地点描述符,包括图像、点云和融合分支。具体来说,融合分支采用了双 stage pipeline,利用两个感知器之间的相互关系,以便信息互动和融合。从 KITTI、NCLT、USVInland 和校园数据集进行了广泛的实验,confirming its robustness and generality in challenging scenarios.

Multimodal Object Query Initialization for 3D Object Detection

  • paper_url: http://arxiv.org/abs/2310.10353
  • repo_url: None
  • paper_authors: Mathijs R. van Geerenstein, Felicia Ruppel, Klaus Dietmayer, Dariu M. Gavrila
  • For: 本研究旨在提高LiDAR和摄像头感知器件之间的对接,以提高3D物体检测模型的性能。* Methods: 我们提出了一种高效、可组合、多Modal的对象查询初始化方法,使得查询可以在多种感知器件输入下进行初始化。* Results: 我们在nuScenesbenchmark上进行了实验,并与现有方法进行比较。结果显示,我们的方法可以在LiDAR-camera输入下达到更高的性能,并且比现有方法更快。此外,我们的方法可以适用于任意感知器件输入组合。
    Abstract 3D object detection models that exploit both LiDAR and camera sensor features are top performers in large-scale autonomous driving benchmarks. A transformer is a popular network architecture used for this task, in which so-called object queries act as candidate objects. Initializing these object queries based on current sensor inputs is a common practice. For this, existing methods strongly rely on LiDAR data however, and do not fully exploit image features. Besides, they introduce significant latency. To overcome these limitations we propose EfficientQ3M, an efficient, modular, and multimodal solution for object query initialization for transformer-based 3D object detection models. The proposed initialization method is combined with a "modality-balanced" transformer decoder where the queries can access all sensor modalities throughout the decoder. In experiments, we outperform the state of the art in transformer-based LiDAR object detection on the competitive nuScenes benchmark and showcase the benefits of input-dependent multimodal query initialization, while being more efficient than the available alternatives for LiDAR-camera initialization. The proposed method can be applied with any combination of sensor modalities as input, demonstrating its modularity.
    摘要 三维物体探测模型,利用激光和相机感知器件特点,在大规模自动驾驶benchmark中表现出色。 transformer是一种广泛使用的网络架构,在这种情况下,被称为“对象查询”的对象被当作候选对象。现有方法通常基于现有的激光数据进行初始化,但是不充分利用图像特征。此外,它们也会增加显著的延迟。为了解决这些限制,我们提出了高效的EfficientQ3M方法,用于初始化转换器基于三维对象探测模型中的对象查询。我们的初始化方法与“多感器均衡”转换器解码器结合使用,使得查询可以在解码器中访问所有感知模式。在实验中,我们超越了现有的转换器基于LiDAR对象探测模型的状态,在competitive nuScenes benchmark上表现出色,并示出了输入具有multimodal查询初始化的优势,同时更高效于现有的LiDAR-camera初始化方法。该方法可以针对任何感知模式进行输入,表明其模块性。

Semi-Supervised Crowd Counting with Contextual Modeling: Facilitating Holistic Understanding of Crowd Scenes

  • paper_url: http://arxiv.org/abs/2310.10352
  • repo_url: https://github.com/cha15yq/MRC-Crowd
  • paper_authors: Yifei Qian, Xiaopeng Hong, Ognjen Arandjelović, Zhongliang Guo, Carl R. Donovan
  • for: 增强人群计数模型的可靠性和准确性,提高模型在受限数据量时的泛化能力。
  • methods: 基于mean teacher框架,对无标签数据进行masking处理,以便模型通过整体特征学习人群场景,模仿人类认知过程。其他方法包括 incorporating fine-grained density classification task,以提高特征学习。
  • results: 模型在挑战性评价指标上表现出色,胜过之前的方法,且模型具有’subitizing’-like行为,即准确地计算低密度区域,同时 incorporating 地方细节来计算高密度区域。
    Abstract To alleviate the heavy annotation burden for training a reliable crowd counting model and thus make the model more practicable and accurate by being able to benefit from more data, this paper presents a new semi-supervised method based on the mean teacher framework. When there is a scarcity of labeled data available, the model is prone to overfit local patches. Within such contexts, the conventional approach of solely improving the accuracy of local patch predictions through unlabeled data proves inadequate. Consequently, we propose a more nuanced approach: fostering the model's intrinsic 'subitizing' capability. This ability allows the model to accurately estimate the count in regions by leveraging its understanding of the crowd scenes, mirroring the human cognitive process. To achieve this goal, we apply masking on unlabeled data, guiding the model to make predictions for these masked patches based on the holistic cues. Furthermore, to help with feature learning, herein we incorporate a fine-grained density classification task. Our method is general and applicable to most existing crowd counting methods as it doesn't have strict structural or loss constraints. In addition, we observe that the model trained with our framework exhibits a 'subitizing'-like behavior. It accurately predicts low-density regions with only a 'glance', while incorporating local details to predict high-density regions. Our method achieves the state-of-the-art performance, surpassing previous approaches by a large margin on challenging benchmarks such as ShanghaiTech A and UCF-QNRF. The code is available at: https://github.com/cha15yq/MRC-Crowd.
    摘要 为了减轻人群计数模型的注释负担,从而使模型更实用和准确,这篇论文提出了一种新的半监督方法基于 Mean Teacher 框架。当缺乏标注数据时,模型容易过拟合本地块。在这种情况下,通过 solely 提高本地块预测的准确性来提高模型的性能是不够的。因此,我们提出了一种更加细化的方法:激发模型的内在 'subitizing' 能力。这种能力使模型可以通过利用人群场景的理解来准确地计算区域的人群数量,这与人类认知过程相似。为了实现这个目标,我们在无标注数据上应用掩码,使模型根据全景册筹的信息进行预测。此外,我们还在模型中添加了细化的浓度分类任务,以帮助特征学习。我们的方法是通用的,可以应用于大多数现有的人群计数方法,不具有严格的结构或损失约束。此外,我们发现模型通过我们的框架进行训练时会展现 'subitizing' 类的行为,即可以准确地计算低密度区域,只需要一个 '察看',同时 incorporate 本地细节来预测高密度区域。我们的方法在挑战性较高的标准准则 ShanghaiTech A 和 UCF-QNRF 上实现了 estado del arte 的性能,大幅超过了先前的方法。模型代码可以在 GitHub 上找到:https://github.com/cha15yq/MRC-Crowd。

ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion

  • paper_url: http://arxiv.org/abs/2310.10343
  • repo_url: https://github.com/jiayuyang/consistnet
  • paper_authors: Jiayu Yang, Ziang Cheng, Yunfei Duan, Pan Ji, Hongdong Li
  • for: 这篇论文的目的是提出一种能够生成多个不同视角的图像,同时保持3D(多视图)一致性的方法。
  • methods: 该方法基于一个多视图一致块,该块在多个单视图噪声过程中交换信息,根据多视图几何原理来协调多个单视图特征。
  • results: 该方法可以轻松地插入预训练的LDMs(卷积神经网络),不需要显式的像素对应关系或深度预测。实验显示,该方法可以在40秒内在单个A100 GPU上生成16个不同视角的图像,并且能够有效地学习3D一致性。
    Abstract Given a single image of a 3D object, this paper proposes a novel method (named ConsistNet) that is able to generate multiple images of the same object, as if seen they are captured from different viewpoints, while the 3D (multi-view) consistencies among those multiple generated images are effectively exploited. Central to our method is a multi-view consistency block which enables information exchange across multiple single-view diffusion processes based on the underlying multi-view geometry principles. ConsistNet is an extension to the standard latent diffusion model, and consists of two sub-modules: (a) a view aggregation module that unprojects multi-view features into global 3D volumes and infer consistency, and (b) a ray aggregation module that samples and aggregate 3D consistent features back to each view to enforce consistency. Our approach departs from previous methods in multi-view image generation, in that it can be easily dropped-in pre-trained LDMs without requiring explicit pixel correspondences or depth prediction. Experiments show that our method effectively learns 3D consistency over a frozen Zero123 backbone and can generate 16 surrounding views of the object within 40 seconds on a single A100 GPU. Our code will be made available on https://github.com/JiayuYANG/ConsistNet
    摘要 给定一张3D对象的单张图像,这篇论文提出了一种新的方法(名为ConsistNet),可以生成多张不同视角的图像,同时利用多视图的3D一致性。我们的方法的核心是一个多视图一致块,允许多个单视图的扩散过程之间进行信息交换,基于下面的多视图几何原理。ConsistNet是标准潜在扩散模型的扩展,包括两个子模块:(a)视图聚合模块,将多视图特征映射到全局3D体Volume并评估一致性,以及(b)光束聚合模块,从3D一致的特征样本返回到每个视图,以确保一致性。我们的方法与前一些多视图图像生成方法不同,可以直接使用预训练的LDMs,无需明确的像素匹配或深度预测。实验表明,我们的方法可以在冰zero123架构上学习3D一致性,在单个A100 GPU上生成16个对象周围视图 Within 40秒。我们的代码将在https://github.com/JiayuYANG/ConsistNet上公开。

Scene Graph Conditioning in Latent Diffusion

  • paper_url: http://arxiv.org/abs/2310.10338
  • repo_url: https://github.com/frankfundel/sgcond
  • paper_authors: Frank Fundel
  • for: 这个论文旨在提高基于文本描述的扩展凝聚模型的精细Semantic控制和精准生成图像。
  • methods: 这篇论文使用了ControlNet和Gated Self-Attention等多种方法来解决大规模凝聚模型的微调和精准生成图像。
  • results: 研究人员通过使用提出的方法可以生成图像从场景图中得到更高质量的图像,超过了之前的方法。I hope this helps! Let me know if you have any other questions.
    Abstract Diffusion models excel in image generation but lack detailed semantic control using text prompts. Additional techniques have been developed to address this limitation. However, conditioning diffusion models solely on text-based descriptions is challenging due to ambiguity and lack of structure. In contrast, scene graphs offer a more precise representation of image content, making them superior for fine-grained control and accurate synthesis in image generation models. The amount of image and scene-graph data is sparse, which makes fine-tuning large diffusion models challenging. We propose multiple approaches to tackle this problem using ControlNet and Gated Self-Attention. We were able to show that using out proposed methods it is possible to generate images from scene graphs with much higher quality, outperforming previous methods. Our source code is publicly available on https://github.com/FrankFundel/SGCond
    摘要 吸引模型在图像生成方面表现出色,但缺乏文本描述的细腻 semantic控制。为了解决这个限制,有些技术被开发出来。然而,通过 solely 文本描述来conditioning 吸引模型是困难的,因为描述的ambiguity和lack of structure。相比之下,场景图表示图像内容的更加精细,使其成为更好的 Fine-grained control 和图像生成模型的精准合成。然而,图像和场景图数据的量是稀缺的,这使得 fine-tuning 大型吸引模型困难。我们提出了多种方法来解决这个问题,包括ControlNet和Gated Self-Attention。我们能够证明,使用我们的方法可以生成从场景图中的图像,质量远高于之前的方法。我们的源代码在https://github.com/FrankFundel/SGCond 上公开可用。

Towards image compression with perfect realism at ultra-low bitrates

  • paper_url: http://arxiv.org/abs/2310.10325
  • repo_url: None
  • paper_authors: Marlène Careil, Matthew J. Muckley, Jakob Verbeek, Stéphane Lathuilière
  • for: 提高图像质量,使其不受比特率的影响
  • methods: 使用迭代扩散模型代替feed-forward推理器,并conditioning模型于vector-quantized图像表示和全文描述
  • results: 与状态之前的编码器相比,提高图像质量,并在ultra-low比特率下(0.003比特/像素)提供高质量图像重建Here’s the breakdown of each point:
  • for: The paper aims to improve the quality of compressed images and make it less dependent on the bitrate.
  • methods: The proposed method uses iterative diffusion models instead of feed-forward decoders trained with MSE or LPIPS distortions. Additionally, the model is conditioned on a vector-quantized image representation and a global textual image description to provide additional context.
  • results: The proposed method outperforms state-of-the-art codecs at ultra-low bitrates (0.003 bits/pixel) and provides high-quality image reconstruction with low dependence on the bitrate.
    Abstract Image codecs are typically optimized to trade-off bitrate vs, distortion metrics. At low bitrates, this leads to compression artefacts which are easily perceptible, even when training with perceptual or adversarial losses. To improve image quality, and to make it less dependent on the bitrate, we propose to decode with iterative diffusion models, instead of feed-forward decoders trained using MSE or LPIPS distortions used in most neural codecs. In addition to conditioning the model on a vector-quantized image representation, we also condition on a global textual image description to provide additional context. We dub our model PerCo for 'perceptual compression', and compare it to state-of-the-art codecs at rates from 0.1 down to 0.003 bits per pixel. The latter rate is an order of magnitude smaller than those considered in most prior work. At this bitrate a 512x768 Kodak image is encoded in less than 153 bytes. Despite this ultra-low bitrate, our approach maintains the ability to reconstruct realistic images. We find that our model leads to reconstructions with state-of-the-art visual quality as measured by FID and KID, and that the visual quality is less dependent on the bitrate than previous methods.
    摘要 图像编码器通常是进行比特率vs扭曲指标的优化的。在低比特率下,这会导致压缩artefacts,即使在使用感知或敌对损失进行训练。为了改进图像质量并使其不受比特率的影响,我们提议使用迭代扩散模型进行解码,而不是使用MSE或LPIPS损失来训练Feed-forward decoder。此外,我们还conditioning the model on a vector-quantized image representation和global文本描述来提供额外的 контекст。我们称我们的模型为PerCo,用于'感知压缩',并与当前的编码器进行比较。我们的模型在比特率从0.1下到0.003比特每像素进行比较,其中0.003比特每像素是在大多数先前工作中考虑的一个次数。在这个比特率下,我们可以将512x768像素的Kodak图像编码为 less than 153字节。尽管我们的比特率非常低,但我们的方法可以保持实际的图像重建。我们发现我们的模型可以在FID和KID指标下达到状态泰ometer的视觉质量,并且这种视觉质量与比特率相对较少受到影响。

Multi-Body Neural Scene Flow

  • paper_url: http://arxiv.org/abs/2310.10301
  • repo_url: https://github.com/kavisha725/MBNSF
  • paper_authors: Kavisha Vidanapathirana, Shin-Fang Chng, Xueqian Li, Simon Lucey
  • for: 本研究旨在提高Scene Flow的测试时优化,使其能够更好地处理实际世界数据中的多体刚体运动。
  • methods: 作者提出了一种基于坐标网络的神经网络优化方法,通过正则化Scene Flow预测中的流体平滑性来捕捉通用运动。此外,作者还引入了一种基于流体尺度的正则项,以便在多体刚体运动中保持流体场的连续性。
  • results: 作者在实际数据上进行了广泛的实验,并证明了他们的方法能够超过当前最佳的3D Scene Flow和长期点云轨迹预测。 codes available at: \href{https://github.com/kavisha725/MBNSF}{https://github.com/kavisha725/MBNSF}.
    Abstract The test-time optimization of scene flow - using a coordinate network as a neural prior - has gained popularity due to its simplicity, lack of dataset bias, and state-of-the-art performance. We observe, however, that although coordinate networks capture general motions by implicitly regularizing the scene flow predictions to be spatially smooth, the neural prior by itself is unable to identify the underlying multi-body rigid motions present in real-world data. To address this, we show that multi-body rigidity can be achieved without the cumbersome and brittle strategy of constraining the $SE(3)$ parameters of each rigid body as done in previous works. This is achieved by regularizing the scene flow optimization to encourage isometry in flow predictions for rigid bodies. This strategy enables multi-body rigidity in scene flow while maintaining a continuous flow field, hence allowing dense long-term scene flow integration across a sequence of point clouds. We conduct extensive experiments on real-world datasets and demonstrate that our approach outperforms the state-of-the-art in 3D scene flow and long-term point-wise 4D trajectory prediction. The code is available at: \href{https://github.com/kavisha725/MBNSF}{https://github.com/kavisha725/MBNSF}.
    摘要 scene flow 测试时优化 - 使用坐标网络作为神经网络先验 - 在过去几年中变得越来越流行,这是因为它的简单性、不受数据偏见和现在的表现水平都很高。但我们发现,即使坐标网络可以捕捉一般运动的概念,但神经网络本身无法直接捕捉真实世界数据中的多体刚性运动。为解决这个问题,我们表明了一种不需要干扰和脆弱的策略,即在场景流优化中规范化流预测以促进刚性。这种策略允许场景流中的多体刚性,同时保持连续的流场,因此允许长期场景流集成。我们对实际数据进行了广泛的实验,并证明了我们的方法在3D场景流和长期点云轨迹预测中超越了现有的状态艺术。代码可以在:上下载。

Effortless Cross-Platform Video Codec: A Codebook-Based Method

  • paper_url: http://arxiv.org/abs/2310.10292
  • repo_url: None
  • paper_authors: Kuan Tian, Yonghang Guan, Jinxi Xiang, Jun Zhang, Xiao Han, Wei Yang
  • for: 提高视频编码器的环境灵活性和计算效率
  • methods: 基于码ebook的视频编码框架,使用Conditional Cross-Attention模块取得帧之间的上下文
  • results: 实验结果显示,我们的方法可以超越传统的H.265(中)编码器,无需任何Entropy约束,同时具备跨平台性Here’s the simplified Chinese text in the format you requested:
  • for: 提高视频编码器的环境灵活性和计算效率
  • methods: 基于码ebook的视频编码框架,使用Conditional Cross-Attention模块取得帧之间的上下文
  • results: 实验结果显示,我们的方法可以超越传统的H.265(中)编码器,无需任何Entropy约束,同时具备跨平台性
    Abstract Under certain circumstances, advanced neural video codecs can surpass the most complex traditional codecs in their rate-distortion (RD) performance. One of the main reasons for the high performance of existing neural video codecs is the use of the entropy model, which can provide more accurate probability distribution estimations for compressing the latents. This also implies the rigorous requirement that entropy models running on different platforms should use consistent distribution estimations. However, in cross-platform scenarios, entropy models running on different platforms usually yield inconsistent probability distribution estimations due to floating point computation errors that are platform-dependent, which can cause the decoding side to fail in correctly decoding the compressed bitstream sent by the encoding side. In this paper, we propose a cross-platform video compression framework based on codebooks, which avoids autoregressive entropy modeling and achieves video compression by transmitting the index sequence of the codebooks. Moreover, instead of using optical flow for context alignment, we propose to use the conditional cross-attention module to obtain the context between frames. Due to the absence of autoregressive modeling and optical flow alignment, we can design an extremely minimalist framework that can greatly benefit computational efficiency. Importantly, our framework no longer contains any distribution estimation modules for entropy modeling, and thus computations across platforms are not necessarily consistent. Experimental results show that our method can outperform the traditional H.265 (medium) even without any entropy constraints, while achieving the cross-platform property intrinsically.
    摘要 在某些情况下,高级神经视频编码器可以超越最复杂的传统编码器在比特率-损失(RD)性能方面。主要的原因是使用Entropy模型,可以提供更准确的概率分布估计,用于压缩缓冲。然而,在跨平台场景下,运行于不同平台的Entropy模型通常会产生不一致的概率分布估计,因为计算机中的浮点数计算错误是平台相关的,这会导致解码器无法正确地解码编码器发送的压缩位流。在本文中,我们提出了基于codebooks的跨平台视频压缩框架,不使用潮流模型和相关适应模块,而是通过传输编码器序列的index来实现压缩。另外,我们提出了基于条件cross-attention模块来获取帧之间的上下文。由于不使用潮流模型和相关适应模块,我们可以设计一个极其简洁的框架,可以大幅提高计算效率。重要的是,我们的框架不再包含任何分布估计模块,因此在不同平台上的计算是不一致的。实验结果表明,我们的方法可以在不使用Entropy约束下,超越传统H.265(中)的RD性能,同时实现跨平台性特性。

Towards Open-World Co-Salient Object Detection with Generative Uncertainty-aware Group Selective Exchange-Masking

  • paper_url: http://arxiv.org/abs/2310.10264
  • repo_url: https://github.com/wuyang98/CoSOD
  • paper_authors: Yang Wu, Shenglong Hu, Huihui Song, Kaihua Zhang, Bo Liu, Dong Liu
  • for: 提高CoSOD模型在开放世界场景下的Robustness。
  • methods: 引入集选择交换掩码(GSEM)方法,使用混合度量选择图像,并使用变量量生成器和CoSOD变换分支模型。
  • results: 提出了一种基于变量量生成器和CoSOD变换分支模型的Robust CoSOD方法,并在三个开放世界 benchmark dataset上进行了实验,证明了方法的有效性和实用性。
    Abstract The traditional definition of co-salient object detection (CoSOD) task is to segment the common salient objects in a group of relevant images. This definition is based on an assumption of group consensus consistency that is not always reasonable in the open-world setting, which results in robustness issue in the model when dealing with irrelevant images in the inputting image group under the open-word scenarios. To tackle this problem, we introduce a group selective exchange-masking (GSEM) approach for enhancing the robustness of the CoSOD model. GSEM takes two groups of images as input, each containing different types of salient objects. Based on the mixed metric we designed, GSEM selects a subset of images from each group using a novel learning-based strategy, then the selected images are exchanged. To simultaneously consider the uncertainty introduced by irrelevant images and the consensus features of the remaining relevant images in the group, we designed a latent variable generator branch and CoSOD transformer branch. The former is composed of a vector quantised-variational autoencoder to generate stochastic global variables that model uncertainty. The latter is designed to capture correlation-based local features that include group consensus. Finally, the outputs of the two branches are merged and passed to a transformer-based decoder to generate robust predictions. Taking into account that there are currently no benchmark datasets specifically designed for open-world scenarios, we constructed three open-world benchmark datasets, namely OWCoSal, OWCoSOD, and OWCoCA, based on existing datasets. By breaking the group-consistency assumption, these datasets provide effective simulations of real-world scenarios and can better evaluate the robustness and practicality of models.
    摘要 传统上,co-salient object detection(CoSOD)任务的定义是将相同的焦点对象在多个相关图像中分割。这个定义基于了群体一致性的假设,这并不总是在开放世界场景下合理的,这会导致模型在处理无关图像时出现Robustness问题。为解决这个问题,我们提出了群选择交换掩码(GSEM)方法,用于增强CoSOD模型的Robustness。GSEM使用两组图像作为输入,每组图像含有不同类型的焦点对象。基于我们定义的混合度量,GSEM选择每组图像的一部分图像,然后将这些图像交换。为同时考虑无关图像引入的不确定性和剩下相关图像的协同特征,我们设计了隐藏变量生成分支和CoSOD变换分支。前者由vector quantized-variational autoencoder组成,用于生成随机全球变量,模型不确定性。后者是为了捕捉协同特征,包括群体一致性。最后,两个分支的输出被 merge,并传递到基于变换器的解码器,以生成Robust的预测。考虑到目前没有特定于开放世界场景的准确数据集,我们构建了三个开放世界数据集,namely OWCoSal、OWCoSOD和OWCoCA,基于现有数据集。由于这些数据集破坏了群体一致性假设,它们可以更好地模拟实际场景,并且可以更好地评估模型的Robustness和实用性。

Long-term Dependency for 3D Reconstruction of Freehand Ultrasound Without External Tracker

  • paper_url: http://arxiv.org/abs/2310.10248
  • repo_url: https://github.com/ucl-candi/freehand
  • paper_authors: Qi Li, Ziyi Shen, Qian Li, Dean C. Barratt, Thomas Dowrick, Matthew J. Clarkson, Tom Vercauteren, Yipeng Hu
  • For: 定义新的方法来嵌入长期依赖性,并评估其性能。* Methods: 使用序列模型与多个变数预测来编码长期依赖性,并提出两个依赖因子(体部图像内容和扫描协议)以推广精准重建。* Results: 1) 添加长期依赖性可以提高重建精度,并且随序列长度、变数间隔和扫描协议而变化。2) 对于训练中的体部或协议方差的降低,对重建精度产生负面影响。I hope this helps! Let me know if you have any further questions.
    Abstract Objective: Reconstructing freehand ultrasound in 3D without any external tracker has been a long-standing challenge in ultrasound-assisted procedures. We aim to define new ways of parameterising long-term dependencies, and evaluate the performance. Methods: First, long-term dependency is encoded by transformation positions within a frame sequence. This is achieved by combining a sequence model with a multi-transformation prediction. Second, two dependency factors are proposed, anatomical image content and scanning protocol, for contributing towards accurate reconstruction. Each factor is quantified experimentally by reducing respective training variances. Results: 1) The added long-term dependency up to 400 frames at 20 frames per second (fps) indeed improved reconstruction, with an up to 82.4% lowered accumulated error, compared with the baseline performance. The improvement was found to be dependent on sequence length, transformation interval and scanning protocol and, unexpectedly, not on the use of recurrent networks with long-short term modules; 2) Decreasing either anatomical or protocol variance in training led to poorer reconstruction accuracy. Interestingly, greater performance was gained from representative protocol patterns, than from representative anatomical features. Conclusion: The proposed algorithm uses hyperparameter tuning to effectively utilise long-term dependency. The proposed dependency factors are of practical significance in collecting diverse training data, regulating scanning protocols and developing efficient networks. Significance: The proposed new methodology with publicly available volunteer data and code for parametersing the long-term dependency, experimentally shown to be valid sources of performance improvement, which could potentially lead to better model development and practical optimisation of the reconstruction application.
    摘要 目标:无需外部跟踪器,在ultrasound-assisted程序中自由手写三维重建问题已经是长期的挑战。我们想要定义新的方法来parameterize长期依赖关系,并评估其性能。方法:首先,通过将长期依赖关系编码为帧序列中的变换位置,使用序列模型和多变换预测结合。其次,我们提出了两个依赖因素,一是解剖学图像内容,二是扫描协议。每个因素都是通过实验量化训练方差来评估。结果:1)在400帧内的20帧/秒(fps)加入长期依赖关系后,重建精度显著提高,相比基eline性能,下降82.4%的累积错误。这种改进与序列长度、变换间隔和扫描协议有关,不同于使用循环网络long-short term模块。2)在训练中降低解剖学或协议方差可以得到更差的重建精度。意外地,更多的表现是由代表协议模式获得的,而不是由解剖学特征获得。结论:我们的算法使用了适当的hyperparameter调整,以利用长期依赖关系。我们提出的依赖因素对于收集多样化的训练数据、调整扫描协议和开发高效的网络是实际上的有用。意义:我们的新方法ологи在公共可用的志愿者数据和代码中实现了参数化长期依赖关系,实验证明了这些方法的有效性,这可能会导致更好的模型开发和实用优化重建应用。

Mask wearing object detection algorithm based on improved YOLOv5

  • paper_url: http://arxiv.org/abs/2310.10245
  • repo_url: None
  • paper_authors: Peng Wen, Junhu Zhang, Haitao Li
  • for: 本研究旨在提出一种基于YOLOv5l的面Mask检测模型,以提高公共场合人员戴Mask的检测精度。
  • methods: 本研究使用Multi-Head Attentional Self-Convolution和Swin Transformer Block以提高模型的敏捷度和准确率。此外,我们还提出了I-CBAM模块以提高目标检测精度。
  • results: 在MASK数据集上进行实验,我们的模型比YOLOv5l模型提高了1.1%的mAP(0.5)和1.3%的mAP(0.5:0.95)。这表明我们的提案可以显著提高面Mask检测的精度。
    Abstract Wearing a mask is one of the important measures to prevent infectious diseases. However, it is difficult to detect people's mask-wearing situation in public places with high traffic flow. To address the above problem, this paper proposes a mask-wearing face detection model based on YOLOv5l. Firstly, Multi-Head Attentional Self-Convolution not only improves the convergence speed of the model but also enhances the accuracy of the model detection. Secondly, the introduction of Swin Transformer Block is able to extract more useful feature information, enhance the detection ability of small targets, and improve the overall accuracy of the model. Our designed I-CBAM module can improve target detection accuracy. In addition, using enhanced feature fusion enables the model to better adapt to object detection tasks of different scales. In the experimentation on the MASK dataset, the results show that the model proposed in this paper achieved a 1.1% improvement in mAP(0.5) and a 1.3% improvement in mAP(0.5:0.95) compared to the YOLOv5l model. Our proposed method significantly enhances the detection capability of mask-wearing.
    摘要 穿戴口罩是预防感染疾病的一种重要措施。然而,在高流量的公共场所中探测人们穿戴口罩的情况很困难。为解决这个问题,本文提出了基于YOLOv5l的口罩穿戴面部检测模型。首先,多头注意力自适应卷积不仅提高模型的融合速度,也提高了模型的检测精度。其次,将Swin卷积层引入可以提取更多有用的特征信息,提高小目标的检测能力,并提高模型的总精度。我们设计的I-CBAM模块可以提高标的检测精度。此外,使用强化的特征融合可以让模型更好地适应不同的物体检测任务。在MASK dataset上的实验结果显示,提案的模型与YOLOv5l模型相比,在mAP(0.5)和mAP(0.5:0.95)中实现了1.1%和1.3%的提升。我们的提案方法可以优化口罩穿戴的检测能力。

Generalizing Medical Image Representations via Quaternion Wavelet Networks

  • paper_url: http://arxiv.org/abs/2310.10224
  • repo_url: https://github.com/ispamm/QWT
  • paper_authors: Luigi Sigillo, Eleonora Grassucci, Aurelio Uncini, Danilo Comminiello
  • for: 提高医疗图像处理领域的神经网络通用性,适应不同数据来源和任务的研究。
  • methods: 提出一种新的、通用、数据和任务无关的框架,可以从医疗图像中提取突出的特征。该框架基于四元波峰变换,可以与现有的医疗图像分析或生成任务集成,并且可以与实数、四元数或复数值模型混合使用。
  • results: 经过广泛的实验评估,包括不同的数据集和任务,如重建、分割和模态翻译等,结果显示提出的框架可以提高网络性能,同时具有广泛适用的通用性。
    Abstract Neural network generalizability is becoming a broad research field due to the increasing availability of datasets from different sources and for various tasks. This issue is even wider when processing medical data, where a lack of methodological standards causes large variations being provided by different imaging centers or acquired with various devices and cofactors. To overcome these limitations, we introduce a novel, generalizable, data- and task-agnostic framework able to extract salient features from medical images. The proposed quaternion wavelet network (QUAVE) can be easily integrated with any pre-existing medical image analysis or synthesis task, and it can be involved with real, quaternion, or hypercomplex-valued models, generalizing their adoption to single-channel data. QUAVE first extracts different sub-bands through the quaternion wavelet transform, resulting in both low-frequency/approximation bands and high-frequency/fine-grained features. Then, it weighs the most representative set of sub-bands to be involved as input to any other neural model for image processing, replacing standard data samples. We conduct an extensive experimental evaluation comprising different datasets, diverse image analysis, and synthesis tasks including reconstruction, segmentation, and modality translation. We also evaluate QUAVE in combination with both real and quaternion-valued models. Results demonstrate the effectiveness and the generalizability of the proposed framework that improves network performance while being flexible to be adopted in manifold scenarios.
    摘要 QUAVE首先使用四元波лет变换提取不同的子带,包括低频/抽象带和高频/细化特征。然后,它对最有代表性的子带进行权重,将其作为任务模型的输入,取代标准数据样本。我们进行了广泛的实验评估,包括不同的数据集和多种图像分析和生成任务,如重建、分割和模式翻译。我们还在QUAVE与实数和四元数值模型结合使用时进行了评估。结果表明我们提出的框架能够提高网络性能,同时具有通用的优势,能够在多种场景中适用。

RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models

  • paper_url: http://arxiv.org/abs/2310.10221
  • repo_url: https://github.com/longkukuhi/armbench
  • paper_authors: Zijun Long, George Killick, Richard McCreadie, Gerardo Aragon Camarasa
  • for: This paper is written for robotic vision applications, specifically to address the challenges of object detection, segmentation, and identification in real-world warehouse scenarios.
  • methods: The paper proposes the use of Multimodal Large Language Models (MLLMs) as a novel backbone for various downstream tasks, leveraging the pre-training capabilities of MLLMs to create a simplified framework and mitigate the need for task-specific encoders.
  • results: The paper introduces the RoboLLM framework, equipped with a BEiT-3 backbone, which outperforms existing baselines and substantially reduces the engineering burden associated with model selection and tuning, as demonstrated in the ARMBench challenge.
    Abstract Robotic vision applications often necessitate a wide range of visual perception tasks, such as object detection, segmentation, and identification. While there have been substantial advances in these individual tasks, integrating specialized models into a unified vision pipeline presents significant engineering challenges and costs. Recently, Multimodal Large Language Models (MLLMs) have emerged as novel backbones for various downstream tasks. We argue that leveraging the pre-training capabilities of MLLMs enables the creation of a simplified framework, thus mitigating the need for task-specific encoders. Specifically, the large-scale pretrained knowledge in MLLMs allows for easier fine-tuning to downstream robotic vision tasks and yields superior performance. We introduce the RoboLLM framework, equipped with a BEiT-3 backbone, to address all visual perception tasks in the ARMBench challenge-a large-scale robotic manipulation dataset about real-world warehouse scenarios. RoboLLM not only outperforms existing baselines but also substantially reduces the engineering burden associated with model selection and tuning. The source code is publicly available at https://github.com/longkukuhi/armbench.
    摘要 robotic 视觉应用经常需要各种视觉识别任务,如物体检测、分割和识别。 DESPITE 这些任务的 SUBSTANTIAL ADVANCES,整合特殊模型到一个统一的视觉管道中存在 significan engineering challenges 和 Costs。 Recently, Multimodal Large Language Models (MLLMs) have emerged as novel backbones for various downstream tasks. WE ARGUE that leveraging the pre-training capabilities of MLLMs enables the creation of a simplified framework, thus mitigating the need for task-specific encoders. Specifically, the large-scale pretrained knowledge in MLLMs allows for easier fine-tuning to downstream robotic vision tasks and yields superior performance. We introduce the RoboLLM framework, equipped with a BEiT-3 backbone, to address all visual perception tasks in the ARMBench challenge-a large-scale robotic manipulation dataset about real-world warehouse scenarios. RoboLLM not only outperforms existing baselines but also substantially reduces the engineering burden associated with model selection and tuning. The source code is publicly available at https://github.com/longkukuhi/armbench.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. If you need the translation in Traditional Chinese, please let me know.

Self-supervised Fetal MRI 3D Reconstruction Based on Radiation Diffusion Generation Model

  • paper_url: http://arxiv.org/abs/2310.10209
  • repo_url: None
  • paper_authors: Junpeng Tan, Xin Zhang, Yao Lv, Xiangmin Xu, Gang Li
  • for: 这种论文是为了解决妊娠Magnetic Resonance Imaging(MRI)中的精度恢复问题而写的。
  • methods: 这种方法使用了基于卷积的射线辐射场(NeRF)和基于超分解的扩展生成器(CINR)等技术来解决区域性灵敏度不均匀和全局一致性问题。
  • results: 实验结果表明,这种方法可以在实际世界妊娠MRI核心中实现高质量超分解重建。
    Abstract Although the use of multiple stacks can handle slice-to-volume motion correction and artifact removal problems, there are still several problems: 1) The slice-to-volume method usually uses slices as input, which cannot solve the problem of uniform intensity distribution and complementarity in regions of different fetal MRI stacks; 2) The integrity of 3D space is not considered, which adversely affects the discrimination and generation of globally consistent information in fetal MRI; 3) Fetal MRI with severe motion artifacts in the real-world cannot achieve high-quality super-resolution reconstruction. To address these issues, we propose a novel fetal brain MRI high-quality volume reconstruction method, called the Radiation Diffusion Generation Model (RDGM). It is a self-supervised generation method, which incorporates the idea of Neural Radiation Field (NeRF) based on the coordinate generation and diffusion model based on super-resolution generation. To solve regional intensity heterogeneity in different directions, we use a pre-trained transformer model for slice registration, and then, a new regionally Consistent Implicit Neural Representation (CINR) network sub-module is proposed. CINR can generate the initial volume by combining a coordinate association map of two different coordinate mapping spaces. To enhance volume global consistency and discrimination, we introduce the Volume Diffusion Super-resolution Generation (VDSG) mechanism. The global intensity discriminant generation from volume-to-volume is carried out using the idea of diffusion generation, and CINR becomes the deviation intensity generation network of the volume-to-volume diffusion model. Finally, the experimental results on real-world fetal brain MRI stacks demonstrate the state-of-the-art performance of our method.
    摘要 although multiple stacks can handle slice-to-volume motion correction and artifact removal problems, there are still several problems: 1) the slice-to-volume method usually uses slices as input, which cannot solve the problem of uniform intensity distribution and complementarity in regions of different fetal MRI stacks; 2) the integrity of 3D space is not considered, which adversely affects the discrimination and generation of globally consistent information in fetal MRI; 3) fetal MRI with severe motion artifacts in the real world cannot achieve high-quality super-resolution reconstruction. to address these issues, we propose a novel fetal brain MRI high-quality volume reconstruction method, called the radiation diffusion generation model (RDGM). it is a self-supervised generation method, which incorporates the idea of neural radiation field (NeRF) based on the coordinate generation and diffusion model based on super-resolution generation. to solve regional intensity heterogeneity in different directions, we use a pre-trained transformer model for slice registration, and then, a new regionally consistent implicit neural representation (CINR) network sub-module is proposed. CINR can generate the initial volume by combining a coordinate association map of two different coordinate mapping spaces. to enhance volume global consistency and discrimination, we introduce the volume diffusion super-resolution generation (VDSG) mechanism. the global intensity discriminant generation from volume-to-volume is carried out using the idea of diffusion generation, and CINR becomes the deviation intensity generation network of the volume-to-volume diffusion model. finally, the experimental results on real-world fetal brain MRI stacks demonstrate the state-of-the-art performance of our method.

MoConVQ: Unified Physics-Based Motion Control via Scalable Discrete Representations

  • paper_url: http://arxiv.org/abs/2310.10198
  • repo_url: None
  • paper_authors: Heyuan Yao, Zhenhua Song, Yuyang Zhou, Tenglong Ao, Baoquan Chen, Libin Liu
  • for: 本文提出了一种新的物理学基的动作控制框架,即MoConVQ,可以有效地从大量、无结构的动作示例中学习动作嵌入。
  • methods: 该方法基于量化变换自动编码器(VQ-VAE)和模型基于奖励学习,可以从大量动作示例中学习动作嵌入,并且可以Capture多样化的动作技巧。
  • results: 研究人员通过多种应用场景来证明MoConVQ的可行性,包括通用跟踪控制、交互式人物控制、物理学基的动作生成等。此外,研究人员还证明了MoConVQ可以与大型语言模型(LLMs)集成,以解决复杂和抽象的任务。
    Abstract In this work, we present MoConVQ, a novel unified framework for physics-based motion control leveraging scalable discrete representations. Building upon vector quantized variational autoencoders (VQ-VAE) and model-based reinforcement learning, our approach effectively learns motion embeddings from a large, unstructured dataset spanning tens of hours of motion examples. The resultant motion representation not only captures diverse motion skills but also offers a robust and intuitive interface for various applications. We demonstrate the versatility of MoConVQ through several applications: universal tracking control from various motion sources, interactive character control with latent motion representations using supervised learning, physics-based motion generation from natural language descriptions using the GPT framework, and, most interestingly, seamless integration with large language models (LLMs) with in-context learning to tackle complex and abstract tasks.
    摘要 在这项工作中,我们介绍了MoConVQ,一种新的物理基于运动控制框架,利用可扩展的字符串表示法。我们的方法基于vector量化自适应学习(VQ-VAE)和基于模型的奖励学习,从大量、无结构的运动示例中学习出高质量的运动嵌入。这种运动表示不仅捕捉了多样化的运动技巧,还提供了一种稳定和直观的界面,可以应用于多种应用程序。我们在这篇论文中展示了MoConVQ的多种应用,包括从不同运动源的跟踪控制、使用监督学习的潜在运动表示进行交互人物控制、基于自然语言描述的物理运动生成、以及与大语言模型(LLM)集成,以解决复杂和抽象任务。

The Road to On-board Change Detection: A Lightweight Patch-Level Change Detection Network via Exploring the Potential of Pruning and Pooling

  • paper_url: http://arxiv.org/abs/2310.10166
  • repo_url: None
  • paper_authors: Lihui Xue, Zhihao Wang, Xueqian Wang, Gang Li
  • for: 这个论文主要是为了提高大规模卫星遥感帧测变检测(CD)方法的效率和可靠性,尤其是在具有限制性的计算和内存资源的edge Computing平台上。
  • methods: 本论文提出了一个轻量级的帧级CD网络(LPCDNet),以快速除去大量无变帧,以提高后续像素级CD过程的效率和内存成本。LPCDNet使用了一个感度指导的通道剔除方法,删除无重要通道,并建立轻量级的后门网络基于ResNet18网络。此外,本文还提出了一个多层特征压缩(MLFC)模组,用于压缩和融合两个时间点的帧级特征信息。
  • results: 根据实验结果,LPCDNet在两个CD资料集上可以每秒逐帧检测1000帧以上,比已有方法高得多,而且不会对CD性能造成明显的损失。此外,LPCDNet还可以降低后续像Pixel-level CD过程的内存成本超过60%。
    Abstract Existing satellite remote sensing change detection (CD) methods often crop original large-scale bi-temporal image pairs into small patch pairs and then use pixel-level CD methods to fairly process all the patch pairs. However, due to the sparsity of change in large-scale satellite remote sensing images, existing pixel-level CD methods suffer from a waste of computational cost and memory resources on lots of unchanged areas, which reduces the processing efficiency of on-board platform with extremely limited computation and memory resources. To address this issue, we propose a lightweight patch-level CD network (LPCDNet) to rapidly remove lots of unchanged patch pairs in large-scale bi-temporal image pairs. This is helpful to accelerate the subsequent pixel-level CD processing stage and reduce its memory costs. In our LPCDNet, a sensitivity-guided channel pruning method is proposed to remove unimportant channels and construct the lightweight backbone network on basis of ResNet18 network. Then, the multi-layer feature compression (MLFC) module is designed to compress and fuse the multi-level feature information of bi-temporal image patch. The output of MLFC module is fed into the fully-connected decision network to generate the predicted binary label. Finally, a weighted cross-entropy loss is utilized in the training process of network to tackle the change/unchange class imbalance problem. Experiments on two CD datasets demonstrate that our LPCDNet achieves more than 1000 frames per second on an edge computation platform, i.e., NVIDIA Jetson AGX Orin, which is more than 3 times that of the existing methods without noticeable CD performance loss. In addition, our method reduces more than 60% memory costs of the subsequent pixel-level CD processing stage.
    摘要 现有的卫星遥感变化检测(CD)方法 oftentimes crop original large-scale bi-temporal image pairs into small patch pairs and then use pixel-level CD methods to fairly process all the patch pairs. However, due to the sparsity of change in large-scale satellite remote sensing images, existing pixel-level CD methods suffer from a waste of computational cost and memory resources on lots of unchanged areas, which reduces the processing efficiency of on-board platform with extremely limited computation and memory resources. To address this issue, we propose a lightweight patch-level CD network (LPCDNet) to rapidly remove lots of unchanged patch pairs in large-scale bi-temporal image pairs. This is helpful to accelerate the subsequent pixel-level CD processing stage and reduce its memory costs. In our LPCDNet, a sensitivity-guided channel pruning method is proposed to remove unimportant channels and construct the lightweight backbone network on basis of ResNet18 network. Then, the multi-layer feature compression (MLFC) module is designed to compress and fuse the multi-level feature information of bi-temporal image patch. The output of MLFC module is fed into the fully-connected decision network to generate the predicted binary label. Finally, a weighted cross-entropy loss is utilized in the training process of network to tackle the change/unchange class imbalance problem. Experiments on two CD datasets demonstrate that our LPCDNet achieves more than 1000 frames per second on an edge computation platform, i.e., NVIDIA Jetson AGX Orin, which is more than 3 times that of the existing methods without noticeable CD performance loss. In addition, our method reduces more than 60% memory costs of the subsequent pixel-level CD processing stage.

A Search for Prompts: Generating Structured Answers from Contracts

  • paper_url: http://arxiv.org/abs/2310.10141
  • repo_url: None
  • paper_authors: Adam Roegiest, Radha Chitta, Jonathan Donnelly, Maya Lash, Alexandra Vtyurina, François Longtin
  • for: 法律问题自动回答,帮助自动化人类审查或标识特定条件(例如,自动续订警示)。
  • methods: 使用 OpenAI 的 \textit{GPT-3.5-Turbo} 进行不结构化生成问题回答,并对问题回答提供了审查和改进。
  • results: 相比 semantic matching 方法,我们的模板提问方法更加准确,并且通过Context learning和提问修改,我们的方法可以进一步提高性能。
    Abstract In many legal processes being able to action on the concrete implication of a legal question can be valuable to automating human review or signalling certain conditions (e.g., alerts around automatic renewal). To support such tasks, we present a form of legal question answering that seeks to return one (or more) fixed answers for a question about a contract clause. After showing that unstructured generative question answering can have questionable outcomes for such a task, we discuss our exploration methodology for legal question answering prompts using OpenAI's \textit{GPT-3.5-Turbo} and provide a summary of insights. Using insights gleaned from our qualitative experiences, we compare our proposed template prompts against a common semantic matching approach and find that our prompt templates are far more accurate despite being less reliable in the exact response return. With some additional tweaks to prompts and the use of in-context learning, we are able to further improve the performance of our proposed strategy while maximizing the reliability of responses as best we can.
    摘要 在许多法律程序中,能够对法律问题的具体实施有益于自动化人类审查或标识特定条件(例如,续约提醒)。为支持这些任务,我们提出了一种法律问题回答方法,该方法可以为一个合同条款的问题返回一个或多个固定答案。在显示了无结构生成问题回答可能导致问able的结果后,我们讲述了我们的探索方法ологи,使用OpenAI的GPT-3.5-Turbo进行问题回答提示。我们提供了一个摘要的感想,并与常见Semantic Matching方法进行比较。我们发现,我们的提案的模板提示比Semantic Matching方法更准确,尽管它们可能不那么可靠地返回具体的回答。通过对提示和使用上下文学习进行一些调整,我们能够进一步改进我们的提案的性能,同时最大化回答的可靠性。

PELA: Learning Parameter-Efficient Models with Low-Rank Approximation

  • paper_url: http://arxiv.org/abs/2310.10700
  • repo_url: https://github.com/guoyang9/pela
  • paper_authors: Yangyang Guo, Guangzhi Wang, Mohan Kankanhalli
  • for: 提高预训练模型中参数效率,以适应资源受限的下游任务。
  • methods: Introducing an intermediate pre-training stage, using low-rank approximation to compress the original large model, and devising a feature distillation module and weight perturbation regularization module to enhance the low-rank model.
  • results: 提高预训练模型的参数效率,同时保持与基本架构相似的性能水平,减少参数大小一半至二分之一。
    Abstract Applying a pre-trained large model to downstream tasks is prohibitive under resource-constrained conditions. Recent dominant approaches for addressing efficiency issues involve adding a few learnable parameters to the fixed backbone model. This strategy, however, leads to more challenges in loading large models for downstream fine-tuning with limited resources. In this paper, we propose a novel method for increasing the parameter efficiency of pre-trained models by introducing an intermediate pre-training stage. To this end, we first employ low-rank approximation to compress the original large model and then devise a feature distillation module and a weight perturbation regularization module. These modules are specifically designed to enhance the low-rank model. Concretely, we update only the low-rank model while freezing the backbone parameters during pre-training. This allows for direct and efficient utilization of the low-rank model for downstream tasks. The proposed method achieves both efficiencies in terms of required parameters and computation time while maintaining comparable results with minimal modifications to the base architecture. Specifically, when applied to three vision-only and one vision-language Transformer models, our approach often demonstrates a $\sim$0.6 point decrease in performance while reducing the original parameter size by 1/3 to 2/3.
    摘要

3DYoga90: A Hierarchical Video Dataset for Yoga Pose Understanding

  • paper_url: http://arxiv.org/abs/2310.10131
  • repo_url: https://github.com/seonokkim/3dyoga90
  • paper_authors: Seonok Kim
  • For: 这个研究是为了开发一个更大、更完整的人工智能训练用运动视频集,具体来说是3D Yoga901数据集,用于习习poses和瑜伽动作识别。* Methods: 这个研究使用了一个专门为这个目标而制作的数据集,其包括90个姿势的RGB视频和3D骨骼序列,同时还有一个三级标签层次结构。* Results: 这个研究创造了一个更大、更完整的公共数据集,包括RGB视频和3D骨骼序列,这对于人工智能训练和瑜伽动作识别具有广泛的应用前景。
    Abstract The increasing popularity of exercises including yoga and Pilates has created a greater demand for professional exercise video datasets in the realm of artificial intelligence. In this study, we developed 3DYoga901, which is organized within a three-level label hierarchy. We have expanded the number of poses from an existing state-of-the-art dataset, increasing it from 82 to 90 poses. Our dataset includes meticulously curated RGB yoga pose videos and 3D skeleton sequences. This dataset was created by a dedicated team of six individuals, including yoga instructors. It stands out as one of the most comprehensive open datasets, featuring the largest collection of RGB videos and 3D skeleton sequences among publicly available resources. This contribution has the potential to significantly advance the field of yoga action recognition and pose assessment. Additionally, we conducted experiments to evaluate the practicality of our proposed dataset. We employed three different model variants for benchmarking purposes.
    摘要 随着瑜伽和PILATES等运动的流行,人工智能领域的训练数据需求增加。在这项研究中,我们开发了3DYoga901,这是一个三级标签层次结构下的组织方式。我们将现有状态艺术数据集中的82姿势提高到90姿势。我们的数据集包括仔细挑选的RGB瑜伽姿势视频和3D骨架序列。这个数据集由6名专业人员,包括瑜伽教练,共同创建。它是公共可用资源中最完整的开放数据集,拥有最大的RGB视频和3D骨架序列收集。这一贡献有可能在瑜伽动作识别和姿势评估领域取得重要进步。此外,我们还进行了实验来评估我们的提案的实用性。我们使用了三种不同的模型变体进行比较。

Few-shot Action Recognition with Captioning Foundation Models

  • paper_url: http://arxiv.org/abs/2310.10125
  • repo_url: None
  • paper_authors: Xiang Wang, Shiwei Zhang, Hangjie Yuan, Yingya Zhang, Changxin Gao, Deli Zhao, Nong Sang
  • for: 本研究旨在将预训练的视觉语言知识转移到多个下游任务中,以提高实验效率和准确性。
  • methods: 本研究使用了一个名为CapFSAR的弹性插件架构,具有自动生成对应的视觉描述和文本嵌入的能力,以扩展预训练的视觉语言知识。
  • results: 实验结果显示,CapFSAR在多个标准几个阶段训练 benchmark 上表现出色,与现有方法比较,具有更高的准确性和更好的一致性。
    Abstract Transferring vision-language knowledge from pretrained multimodal foundation models to various downstream tasks is a promising direction. However, most current few-shot action recognition methods are still limited to a single visual modality input due to the high cost of annotating additional textual descriptions. In this paper, we develop an effective plug-and-play framework called CapFSAR to exploit the knowledge of multimodal models without manually annotating text. To be specific, we first utilize a captioning foundation model (i.e., BLIP) to extract visual features and automatically generate associated captions for input videos. Then, we apply a text encoder to the synthetic captions to obtain representative text embeddings. Finally, a visual-text aggregation module based on Transformer is further designed to incorporate cross-modal spatio-temporal complementary information for reliable few-shot matching. In this way, CapFSAR can benefit from powerful multimodal knowledge of pretrained foundation models, yielding more comprehensive classification in the low-shot regime. Extensive experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods and achieves state-of-the-art performance. The code will be made publicly available.
    摘要 <>使用预训练多modal基础模型的视觉语言知识传递到多个下游任务是一个有前途的方向。然而,当前大多数几 shot动作识别方法仍然只接受单一视觉输入,因为对附加文本描述的标注成本高昂。在这篇论文中,我们开发了一个有效的插件式框架called CapFSAR,以利用预训练多modal模型的知识而不需要手动标注文本。具体来说,我们首先利用一个captioning基础模型(即BLIP)来提取视觉特征并自动生成相关的描述文本 для输入视频。然后,我们应用一个文本编码器来对生成的文本嵌入获得代表性的文本嵌入。最后,我们设计了一个基于Transformer的视TEXT聚合模块,以便在低shot情况下利用多modal视觉语言的补充信息进行可靠的匹配。这样,CapFSAR可以从预训练多modal模型中获得强大的知识,在低shot情况下实现更全面的分类。我们的实验结果表明,提议的CapFSAR在多个标准几 shotbenchmark上表现出色,并达到了状态 искусственный智能的性能。代码将公开。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.

AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion

  • paper_url: http://arxiv.org/abs/2310.10123
  • repo_url: None
  • paper_authors: Yitong Jiang, Zhaoyang Zhang, Tianfan Xue, Jinwei Gu
  • for: solves complex real-world image restoration situations with multiple unknown degradations
  • methods: uses an all-in-one image restoration framework with latent diffusion, including a Blind Image Quality Assessment Module (BIQA) and an All-in-One Image Refinement (AIR) Module, as well as a Structure Correction Module (SCM)
  • results: outperforms state-of-the-art approaches with superior restoration results and supports a wider range of tasks, including real-scenario images with multiple unknown degradations.
    Abstract In this paper, we aim to solve complex real-world image restoration situations, in which, one image may have a variety of unknown degradations. To this end, we propose an all-in-one image restoration framework with latent diffusion (AutoDIR), which can automatically detect and address multiple unknown degradations. Our framework first utilizes a Blind Image Quality Assessment Module (BIQA) to automatically detect and identify the unknown dominant image degradation type of the image. Then, an All-in-One Image Refinement (AIR) Module handles multiple kinds of degradation image restoration with the guidance of BIQA. Finally, a Structure Correction Module (SCM) is proposed to recover the image details distorted by AIR. Our comprehensive evaluation demonstrates that AutoDIR outperforms state-of-the-art approaches by achieving superior restoration results while supporting a wider range of tasks. Notably, AutoDIR is also the first method to automatically handle real-scenario images with multiple unknown degradations.
    摘要 在这篇论文中,我们目标是解决复杂的真实世界图像恢复问题,在这个问题中,一个图像可能具有多种未知的降低效应。为此,我们提议一个整合性图像恢复框架——自适应扩散图像修复(AutoDIR),可以自动检测和解决多种未知降低效应。我们的框架首先利用一个隐藏影像质量评估模块(BIQA)来自动检测和识别图像的未知主要降低类型。然后,一个全面修复(AIR)模块处理多种降低效应的图像修复,以BIQA的指导。最后,我们提出一个结构修复模块(SCM)来恢复图像细节,受到AIR的扭曲影响。我们的全面评估表明,AutoDIR在恢复Result中显示出优于当前方法,同时支持更广泛的任务范围。尤其是,AutoDIR是第一个自动处理真实世界图像中的多种未知降低的方法。

KAKURENBO: Adaptively Hiding Samples in Deep Neural Network Training

  • paper_url: http://arxiv.org/abs/2310.10102
  • repo_url: https://github.com/TruongThaoNguyen/kakurenbo
  • paper_authors: Truong Thao Nguyen, Balazs Gerofi, Edgar Josafat Martinez-Noriega, François Trahay, Mohamed Wahib
  • for: 提高深度神经网络训练效率,减少训练成本。
  • methods: 利用训练损失和预测信任度信息, dynamically 排除训练样本,不归一化影响精度。
  • results: 在多个大规模数据集和模型上,与基eline相比,我们的方法可以降低训练时间,仅带来0.4%的精度下降。可以在https://github.com/TruongThaoNguyen/kakurenbo 获取代码。
    Abstract This paper proposes a method for hiding the least-important samples during the training of deep neural networks to increase efficiency, i.e., to reduce the cost of training. Using information about the loss and prediction confidence during training, we adaptively find samples to exclude in a given epoch based on their contribution to the overall learning process, without significantly degrading accuracy. We explore the converge properties when accounting for the reduction in the number of SGD updates. Empirical results on various large-scale datasets and models used directly in image classification and segmentation show that while the with-replacement importance sampling algorithm performs poorly on large datasets, our method can reduce total training time by up to 22% impacting accuracy only by 0.4% compared to the baseline. Code available at https://github.com/TruongThaoNguyen/kakurenbo
    摘要

A Multi-Scale Spatial Transformer U-Net for Simultaneously Automatic Reorientation and Segmentation of 3D Nuclear Cardiac Images

  • paper_url: http://arxiv.org/abs/2310.10095
  • repo_url: None
  • paper_authors: Yangfan Ni, Duo Zhang, Gege Ma, Lijun Lu, Zhongke Huang, Wentao Zhu
  • For: 这个研究旨在提高核心心脏成像中的左心室(LV)重orientation和分割的精度,以便进行多modal的量化分析。* Methods: 该研究提出了一种结合多尺度空间变换网络(MSSTN)和多尺度UNet(MSUNet)模块的端到端模型,用于同时重orientation和分割LV区域从核心心脏成像图像中。* Results: 实验结果表明,该提出的方法可以显著提高重orientation和分割性能。这种结合学习框架可以促进重orientation和分割任务之间的互补性,从而实现高性能和高效的图像处理 workflow。
    Abstract Accurate reorientation and segmentation of the left ventricular (LV) is essential for the quantitative analysis of myocardial perfusion imaging (MPI), in which one critical step is to reorient the reconstructed transaxial nuclear cardiac images into standard short-axis slices for subsequent image processing. Small-scale LV myocardium (LV-MY) region detection and the diverse cardiac structures of individual patients pose challenges to LV segmentation operation. To mitigate these issues, we propose an end-to-end model, named as multi-scale spatial transformer UNet (MS-ST-UNet), that involves the multi-scale spatial transformer network (MSSTN) and multi-scale UNet (MSUNet) modules to perform simultaneous reorientation and segmentation of LV region from nuclear cardiac images. The proposed method is trained and tested using two different nuclear cardiac image modalities: 13N-ammonia PET and 99mTc-sestamibi SPECT. We use a multi-scale strategy to generate and extract image features with different scales. Our experimental results demonstrate that the proposed method significantly improves the reorientation and segmentation performance. This joint learning framework promotes mutual enhancement between reorientation and segmentation tasks, leading to cutting edge performance and an efficient image processing workflow. The proposed end-to-end deep network has the potential to reduce the burden of manual delineation for cardiac images, thereby providing multimodal quantitative analysis assistance for physicists.
    摘要 要实现多Modal量子分析的协助,我们提出了一种结合多级空间变换网络(MSSTN)和多级UNet(MSUNet)模块的端到端模型,用于自动将核心心脏图像重定向和分割为标准短轴扁平图像。左心室(LV)区域检测和各个患者的各种心脏结构带来了重orientation和分割任务的挑战。我们的方法通过一种多级strategy来生成和提取图像特征,从而提高重定向和分割性能。我们对13N-氨基酸PET和99mTc-氯丙胺SPECT两种核心心脏图像模式进行训练和测试,并取得了显著提高重定向和分割性能的结果。这种结合学习框架可以减少心脏图像手动分割的劳动,从而为物理学家提供多模态量子分析的协助。

PUCA: Patch-Unshuffle and Channel Attention for Enhanced Self-Supervised Image Denoising

  • paper_url: http://arxiv.org/abs/2310.10088
  • repo_url: https://github.com/HyemiEsme/PUCA
  • paper_authors: Hyemi Jang, Junsung Park, Dahuin Jung, Jaihyun Lew, Ho Bae, Sungroh Yoon
  • for: 自动化图像干扰除(image denoising)
  • methods: 使用自适应卷积(dilated attention blocks)和质patch-unshuffle/shuffle来扩大感知场和保持J-不变性
  • results: 实验结果显示,PUCA在自主学习图像干扰除中实现了状态的最佳性能,超过了现有方法的性能Here’s the simplified Chinese text in the format you requested:
  • for: 自动化图像干扰除(image denoising)
  • methods: 使用自适应卷积(dilated attention blocks)和质patch-unshuffle/shuffle来扩大感知场和保持J-不变性
  • results: 实验结果显示,PUCA在自主学习图像干扰除中实现了状态的最佳性能,超过了现有方法的性能
    Abstract Although supervised image denoising networks have shown remarkable performance on synthesized noisy images, they often fail in practice due to the difference between real and synthesized noise. Since clean-noisy image pairs from the real world are extremely costly to gather, self-supervised learning, which utilizes noisy input itself as a target, has been studied. To prevent a self-supervised denoising model from learning identical mapping, each output pixel should not be influenced by its corresponding input pixel; This requirement is known as J-invariance. Blind-spot networks (BSNs) have been a prevalent choice to ensure J-invariance in self-supervised image denoising. However, constructing variations of BSNs by injecting additional operations such as downsampling can expose blinded information, thereby violating J-invariance. Consequently, convolutions designed specifically for BSNs have been allowed only, limiting architectural flexibility. To overcome this limitation, we propose PUCA, a novel J-invariant U-Net architecture, for self-supervised denoising. PUCA leverages patch-unshuffle/shuffle to dramatically expand receptive fields while maintaining J-invariance and dilated attention blocks (DABs) for global context incorporation. Experimental results demonstrate that PUCA achieves state-of-the-art performance, outperforming existing methods in self-supervised image denoising.
    摘要 尽管监督的图像噪声网络已经在合成噪声图像上展现出惊人的表现,但在实际应用中它们经常失败,因为实际噪声和合成噪声之间存在差异。由于干净图像和噪声图像的对应对不容易获得,自动学习,即使用噪声输入本身作为目标,已经被研究。为保证自动学习噪声除掉模型不学习同一个映射,每个输出像素不能受到其对应的输入像素的影响,这种要求被称为J-不变性。盲区网络(BSN)在保证J-不变性方面广泛应用。然而,通过在BSN中注射额外操作,如下采样,可能会暴露盲区信息,从而违反J-不变性。因此,特制的BSN材料只能被允许,限制了建筑的创新性。为了突破这一限制,我们提出了PUCA,一种新的J-不变的U-Net架构,用于自动学习噪声除掉。PUCA利用质心不排序/排序来巨大地扩大接收场,同时保持J-不变性,并采用扩展的注意块(DABs)来 incorporate global context。实验结果表明,PUCA可以达到状态之内的表现,比存在的方法在自动学习噪声除掉中高效。

Expression Domain Translation Network for Cross-domain Head Reenactment

  • paper_url: http://arxiv.org/abs/2310.10073
  • repo_url: None
  • paper_authors: Taewoong Kang, Jeongsik Oh, Jaeseong Lee, Sunghyun Park, Jaegul Choo
    for:* The paper aims to improve cross-domain head reenactment, specifically transferring human motions to cartoon characters.methods:* The paper introduces a novel expression domain translation network to transform human expressions into anime expressions.* The network uses a 3D geometric-aware loss function to ensure geometric consistency between the input and output expressions.results:* The proposed method outperforms existing methods in both qualitative and quantitative analysis, demonstrating a significant advancement in cross-domain head reenactment.
    Abstract Despite the remarkable advancements in head reenactment, the existing methods face challenges in cross-domain head reenactment, which aims to transfer human motions to domains outside the human, including cartoon characters. It is still difficult to extract motion from out-of-domain images due to the distinct appearances, such as large eyes. Recently, previous work introduced a large-scale anime dataset called AnimeCeleb and a cross-domain head reenactment model, including an optimization-based mapping function to translate the human domain's expressions to the anime domain. However, we found that the mapping function, which relies on a subset of expressions, imposes limitations on the mapping of various expressions. To solve this challenge, we introduce a novel expression domain translation network that transforms human expressions into anime expressions. Specifically, to maintain the geometric consistency of expressions between the input and output of the expression domain translation network, we employ a 3D geometric-aware loss function that reduces the distances between the vertices in the 3D mesh of the human and anime. By doing so, it forces high-fidelity and one-to-one mapping with respect to two cross-expression domains. Our method outperforms existing methods in both qualitative and quantitative analysis, marking a significant advancement in the field of cross-domain head reenactment.
    摘要 尽管有很多HEADreenactment的进步,现有的方法在跨domain HEADreenactment中遇到了困难,即将人类动作转移到不同的领域,如漫画人物。因为这些领域的外观有很大的不同,例如大的眼睛,提取动作从不同领域的图像仍然是很困难的。在最近,之前的工作已经介绍了一个大规模的漫画数据集called AnimeCeleb和一种跨领域HEADreenactment模型,包括一种优化基于映射函数,将人类领域的表情翻译到漫画领域。然而,我们发现这种映射函数,它基于一 subset of 表情,强制了表情的限制。为解决这个挑战,我们引入了一种新的表情频率翻译网络,可以将人类表情翻译成漫画表情。在我们的方法中,我们采用了一种3D геометрически感知的损失函数,以保持表情的几何一致性。这使得我们的方法可以具有高精度和一对一的映射性,对于两个跨表情频率的跨领域映射。我们的方法在质量和量度分析中都超过了现有的方法,标志着跨领域HEADreenactment领域的重要进步。

ZoomTrack: Target-aware Non-uniform Resizing for Efficient Visual Tracking

  • paper_url: http://arxiv.org/abs/2310.10071
  • repo_url: https://github.com/kou-99/zoomtrack
  • paper_authors: Yutong Kou, Jin Gao, Bing Li, Gang Wang, Weiming Hu, Yizheng Wang, Liang Li
  • for: 本研究旨在实现高速追踪,并获得高性能的tracking results,而不是专注于性能和速度之间的 compromise.
  • methods: 本文使用非对称resize的方法,将cropped image的input size变小,并保持目标的视觉资讯。这个方法可以通过quadratic programming (QP) efficiently solve,并可以与大多数的crop-based local tracker naturally integrate.
  • results: 在五个挑战性的dataset上,本文的方法可以获得了consistent improvement,并在speed-oriented版本的OSTrack上even outperform its performance-oriented counterpart by 0.6% AUC on TNL2K,并且在50% faster和save over 55% MACs的情况下实现。
    Abstract Recently, the transformer has enabled the speed-oriented trackers to approach state-of-the-art (SOTA) performance with high-speed thanks to the smaller input size or the lighter feature extraction backbone, though they still substantially lag behind their corresponding performance-oriented versions. In this paper, we demonstrate that it is possible to narrow or even close this gap while achieving high tracking speed based on the smaller input size. To this end, we non-uniformly resize the cropped image to have a smaller input size while the resolution of the area where the target is more likely to appear is higher and vice versa. This enables us to solve the dilemma of attending to a larger visual field while retaining more raw information for the target despite a smaller input size. Our formulation for the non-uniform resizing can be efficiently solved through quadratic programming (QP) and naturally integrated into most of the crop-based local trackers. Comprehensive experiments on five challenging datasets based on two kinds of transformer trackers, \ie, OSTrack and TransT, demonstrate consistent improvements over them. In particular, applying our method to the speed-oriented version of OSTrack even outperforms its performance-oriented counterpart by 0.6% AUC on TNL2K, while running 50% faster and saving over 55% MACs. Codes and models are available at https://github.com/Kou-99/ZoomTrack.
    摘要 最近,transformer已经使得速度强调跟踪器可以达到状态之Art(SOTA)性能,并且具有更高的速度,即使使用更小的输入大小或更轻量级的特征提取核心。然而,它们仍然较相对落后于其对应的性能强调版本。在这篇论文中,我们展示了可以减少或甚至消除这个差距,而且可以在更小的输入大小下实现高速跟踪。为此,我们非均匀地缩放cropped图像,使得target的可能出现的区域的分辨率更高,而其他区域的分辨率相对较低。这样可以解决在更小的输入大小下尚可以保留更多的原始信息来搜寻target的问题。我们的非均匀缩放的形式可以通过quadratic programming(QP)有效地解决,并且自然地整合到大多数的crop-based本地跟踪器中。我们在五个复杂的数据集上进行了广泛的实验,并示出了一致的改进。特别是,对于速度强调版本的OSTrack,我们的方法可以在TNL2K上提高0.6%的AUC,并且在50%的速度下运行,并且占用了55%的MACs。代码和模型可以在https://github.com/Kou-99/ZoomTrack上获取。

Generalizable Person Search on Open-world User-Generated Video Content

  • paper_url: http://arxiv.org/abs/2310.10068
  • repo_url: None
  • paper_authors: Junjie Li, Guanshuo Wang, Yichao Yan, Fufu Yu, Qiong Jia, Jie Qin, Shouhong Ding, Xiaokang Yang
  • for: 实现人寻找任务中的扩展性能,尤其是在不同的摄像头和环境中。
  • methods: 提出了一个通用框架,包括两种水平的普遍化:对于特征水平,引入多任务抽象型普通批量对顶推对顶推标准差,对于数据水平,则是通过通道宽度ID相关特征装饰策略来实现普遍化。
  • results: 在两个挑战人寻找测试 bencmarks 上获得了可靠的表现,无需使用任何人工标注或目标领域的样本。
    Abstract Person search is a challenging task that involves detecting and retrieving individuals from a large set of un-cropped scene images. Existing person search applications are mostly trained and deployed in the same-origin scenarios. However, collecting and annotating training samples for each scene is often difficult due to the limitation of resources and the labor cost. Moreover, large-scale intra-domain data for training are generally not legally available for common developers, due to the regulation of privacy and public security. Leveraging easily accessible large-scale User Generated Video Contents (\emph{i.e.} UGC videos) to train person search models can fit the open-world distribution, but still suffering a performance gap from the domain difference to surveillance scenes. In this work, we explore enhancing the out-of-domain generalization capabilities of person search models, and propose a generalizable framework on both feature-level and data-level generalization to facilitate downstream tasks in arbitrary scenarios. Specifically, we focus on learning domain-invariant representations for both detection and ReID by introducing a multi-task prototype-based domain-specific batch normalization, and a channel-wise ID-relevant feature decorrelation strategy. We also identify and address typical sources of noise in open-world training frames, including inaccurate bounding boxes, the omission of identity labels, and the absence of cross-camera data. Our framework achieves promising performance on two challenging person search benchmarks without using any human annotation or samples from the target domain.
    摘要 人体搜索是一项复杂的任务,它涉及到从大量未修剪场景图像中检测和检索人体。现有的人体搜索应用程序都是在同一个来源场景中训练和部署的。然而,收集和标注训练样本的成本是非常高,尤其是在资源和劳动力方面。此外,大规模内域数据 для训练通常不可得,因为隐私和公共安全的法规限制。我们利用可达性高的用户生成内容(i.e., UGC视频)来训练人体搜索模型,以适应开放世界分布。然而,这些模型仍然受到频率域不同的问题带来的性能差。在这种情况下,我们提出了一种通用的框架,以便在无法预测的情况下进行下游任务。我们主要关注于学习域外 invariant 表示,包括检测和 ReID 领域的学习。我们引入多任务prototype-based域特定批处理,以及通道级 ID 相关特征修饰策略。我们还识别和解决常见的开放世界训练帧中的噪声源,包括不准确的 bounding box、缺失标签和 cross-camera 数据缺失。我们的框架在两个复杂的人体搜索标准准点上达到了无需使用人类标注或目标域样本的承诺性能。

  • paper_url: http://arxiv.org/abs/2310.10061
  • repo_url: https://github.com/rachelfheaton/CASPER-model
  • paper_authors: Rachel F. Heaton
  • For: This paper aims to understand the nature of human visual representations and processes through the study of visual search.* Methods: The paper presents a theory of visual search based on empirical findings and instantiated in a computational model called CASPER (Concurrent Attention: Serial and Parallel Evaluation with Relations).* Results: The paper describes seven experiments that test CASPER’s predictions about relational search, and shows that CASPER can account for negative acceleration in search functions for relational stimuli.Here are the three points in Simplified Chinese:
  • for: 这篇论文旨在通过视觉搜索来理解人类视觉表示和过程的本质。
  • methods: 这篇论文提出了基于实验发现的视觉搜索理论,并将其实现在一个名为CASPER(并行注意力:序列和平行评估与关系)的计算模型中。
  • results: 这篇论文描述了七个实验,以测试CASPER模型对关系搜索的预测,并显示了CASPER可以解释视觉系统在关系刺激下的负加速度。
    Abstract The following is a dissertation aimed at understanding what the various phenomena in visual search teach us about the nature of human visual representations and processes. I first review some of the major empirical findings in the study of visual search. I next present a theory of visual search in terms of what I believe these findings suggest about the representations and processes underlying ventral visual processing. These principles are instantiated in a computational model called CASPER (Concurrent Attention: Serial and Parallel Evaluation with Relations), originally developed by Hummel, that I have adapted to account for a range of phenomena in visual search. I then describe an extension of the CASPER model to account for our ability to search for visual items defined not simply by the features composing those items but by the spatial relations among those features. Seven experiments (four main experiments and three replications) are described that test CASPER's predictions about relational search. Finally, I evaluate the fit between CASPER's predictions and the empirical findings and show with three additional simulations that CASPER can account for negative acceleration in search functions for relational stimuli if one postulates that the visual system is leveraging an emergent feature that bypasses relational processing.
    摘要 这是一篇关于视觉搜寻的论文,旨在了解人类视觉表示和过程中的本质。我首先介绍了视觉搜寻的一些主要实验发现,然后提出了基于这些发现的视觉搜寻理论。这些原则通过我修改了由哔哔(Hummel)开发的计算模型CASPER(同时注意力:串行和平行评估与关系)来实现。我然后描述了一种扩展CASPER模型,以便解释我们在视觉搜寻中搜寻视觉物体的空间关系。然后,我描述了七个实验(四个主要实验和三个复现),用于测试CASPER模型的预测。最后,我评估了CASPER模型的预测与实验发现的Compatibility,并通过三个额外的仿真显示CASPER模型可以解释视觉系统在搜寻关系性 stimulus 时的负加速。

EAR-Net: Pursuing End-to-End Absolute Rotations from Multi-View Images

  • paper_url: http://arxiv.org/abs/2310.10051
  • repo_url: None
  • paper_authors: Yuzhen Liu, Qiulei Dong
  • for: 提供一种结构为深度神经网络的综合方法,用于从多视图图像中估计绝对旋转。
  • methods: 使用深度神经网络建立 epipolar confidence graph,并使用 confidence-aware rotation averaging 模块来预测绝对旋转。
  • results: 在三个公共数据集上,EAR-Net 比现有方法提高了准确性和速度。
    Abstract Absolute rotation estimation is an important topic in 3D computer vision. Existing works in literature generally employ a multi-stage (at least two-stage) estimation strategy where multiple independent operations (feature matching, two-view rotation estimation, and rotation averaging) are implemented sequentially. However, such a multi-stage strategy inevitably leads to the accumulation of the errors caused by each involved operation, and degrades its final estimation on global rotations accordingly. To address this problem, we propose an End-to-end method for estimating Absolution Rotations from multi-view images based on deep neural Networks, called EAR-Net. The proposed EAR-Net consists of an epipolar confidence graph construction module and a confidence-aware rotation averaging module. The epipolar confidence graph construction module is explored to simultaneously predict pairwise relative rotations among the input images and their corresponding confidences, resulting in a weighted graph (called epipolar confidence graph). Based on this graph, the confidence-aware rotation averaging module, which is differentiable, is explored to predict the absolute rotations. Thanks to the introduced confidences of the relative rotations, the proposed EAR-Net could effectively handle outlier cases. Experimental results on three public datasets demonstrate that EAR-Net outperforms the state-of-the-art methods by a large margin in terms of accuracy and speed.
    摘要 <>将文本翻译成简化中文。<>三维计算机视觉中的绝对旋转估算是一个重要的话题。现有文献中的方法通常采用多个独立的操作(特征匹配、两视旋转估算和旋转平均)的多阶段(至少两阶段)Strategy,这会导致每个参与的操作的错误积累,从而影响最终的全球旋转估算。为解决这个问题,我们提出了基于深度神经网络的绝对旋转估算方法,called EAR-Net。提案的 EAR-Net 包括 Epipolar 信任图构建模块和信任度权重平均模块。Epipolar 信任图构建模块可以同时预测输入图像之间的对应关系和它们的相对旋转信任度,从而构建一个权重图(称为 Epipolar 信任图)。基于这个图,信任度权重平均模块可以预测绝对旋转。由于引入的相对旋转信任度,提案的 EAR-Net 可以有效地处理异常情况。实验结果表明,EAR-Net 在三个公共数据集上的准确率和速度都高于当前状态的方法。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Hyperspectral Image Fusion via Logarithmic Low-rank Tensor Ring Decomposition

  • paper_url: http://arxiv.org/abs/2310.10044
  • repo_url: None
  • paper_authors: Jun Zhang, Lipeng Zhu, Chao Wang, Shutao Li
  • for: 本研究旨在提高低分辨率多spectral图像(LR-HSI)与高分辨率多spectral图像(HR-MSI)的混合方法,以获得高分辨率多spectral图像(HR-HSI)。
  • methods: 本研究使用了tensor环(TR)分解方法,并利用了tensor核内积 regularization(TNN)来保持高维低级结构。
  • results: 实验结果表明,提出的方法可以提高视觉质量,并超过现有的state-of-the-art混合方法 regarding various quantitative metrics。
    Abstract Integrating a low-spatial-resolution hyperspectral image (LR-HSI) with a high-spatial-resolution multispectral image (HR-MSI) is recognized as a valid method for acquiring HR-HSI. Among the current fusion approaches, the tensor ring (TR) decomposition-based method has received growing attention owing to its superior performance on preserving the spatial-spectral correlation. Furthermore, the low-rank property in some TR factors has been exploited via the matrix nuclear norm regularization along mode-2. On the other hand, the tensor nuclear norm (TNN)-based approaches have recently demonstrated to be more efficient on keeping high-dimensional low-rank structures in tensor recovery. Here, we study the low-rankness of TR factors from the TNN perspective and consider the mode-2 logarithmic TNN (LTNN) on each TR factor. A novel fusion model is proposed by incorporating this LTNN regularization and the weighted total variation which is to promote the continuity of HR-HSI in the spatial-spectral domain. Meanwhile, we have devised a highly efficient proximal alternating minimization algorithm to solve the proposed model. The experimental results indicate that our method improves the visual quality and exceeds the existing state-of-the-art fusion approaches with respect to various quantitative metrics.
    摘要 Integrating a low-spatial-resolution hyperspectral image (LR-HSI) with a high-spatial-resolution multispectral image (HR-MSI) is recognized as a valid method for acquiring HR-HSI. Among the current fusion approaches, the tensor ring (TR) decomposition-based method has received growing attention owing to its superior performance on preserving the spatial-spectral correlation. Furthermore, the low-rank property in some TR factors has been exploited via the matrix nuclear norm regularization along mode-2. On the other hand, the tensor nuclear norm (TNN)-based approaches have recently demonstrated to be more efficient on keeping high-dimensional low-rank structures in tensor recovery. Here, we study the low-rankness of TR factors from the TNN perspective and consider the mode-2 logarithmic TNN (LTNN) on each TR factor. A novel fusion model is proposed by incorporating this LTNN regularization and the weighted total variation which is to promote the continuity of HR-HSI in the spatial-spectral domain. Meanwhile, we have devised a highly efficient proximal alternating minimization algorithm to solve the proposed model. The experimental results indicate that our method improves the visual quality and exceeds the existing state-of-the-art fusion approaches with respect to various quantitative metrics.Translation in Simplified Chinese:合并低分辨率干卷成像(LR-HSI)和高分辨率多spectral成像(HR-MSI)可以获得高分辨率干卷成像(HR-HSI)。当前的融合方法中,基于tensor ring(TR)分解的方法受到了越来越多的关注,因为它能够保留干卷成像的空间-spectral相关性。此外,TR因子中的低级属性也被利用了,通过matrix nuclear norm regularization along mode-2。而tensor nuclear norm(TNN)基于的方法则在tensor recovery中保持高维度低级结构方面表现更高效。在这里,我们从TNN的角度研究TR因子的低级性,并考虑mode-2 logarithmic TNN(LTNN)在每个TR因子上。我们提出了一种新的融合模型,通过加入LTNN regularization和权重Total variation来提高HR-HSI在空间-spectral频域中的连续性。此外,我们还开发了一种高效的 proximal alternating minimization算法来解决提案的模型。实验结果表明,我们的方法可以提高视觉质量,并在各种量化指标上超越现有的融合方法。

Evading Detection Actively: Toward Anti-Forensics against Forgery Localization

  • paper_url: http://arxiv.org/abs/2310.10036
  • repo_url: None
  • paper_authors: Long Zhuo, Shenghai Luo, Shunquan Tan, Han Chen, Bin Li, Jiwu Huang
  • for: 防止黑客修改图像,使图像检测器判定图像是否修改。
  • methods: 使用自我指导和对抗学习算法,训练深度学习反诈模型,以逃脱现有的修改检测器。
  • results: 成功逃脱现有的修改检测器,并在多个 dataset 上实现了高效的修改检测。
    Abstract Anti-forensics seeks to eliminate or conceal traces of tampering artifacts. Typically, anti-forensic methods are designed to deceive binary detectors and persuade them to misjudge the authenticity of an image. However, to the best of our knowledge, no attempts have been made to deceive forgery detectors at the pixel level and mis-locate forged regions. Traditional adversarial attack methods cannot be directly used against forgery localization due to the following defects: 1) they tend to just naively induce the target forensic models to flip their pixel-level pristine or forged decisions; 2) their anti-forensics performance tends to be severely degraded when faced with the unseen forensic models; 3) they lose validity once the target forensic models are retrained with the anti-forensics images generated by them. To tackle the three defects, we propose SEAR (Self-supErvised Anti-foRensics), a novel self-supervised and adversarial training algorithm that effectively trains deep-learning anti-forensic models against forgery localization. SEAR sets a pretext task to reconstruct perturbation for self-supervised learning. In adversarial training, SEAR employs a forgery localization model as a supervisor to explore tampering features and constructs a deep-learning concealer to erase corresponding traces. We have conducted largescale experiments across diverse datasets. The experimental results demonstrate that, through the combination of self-supervised learning and adversarial learning, SEAR successfully deceives the state-of-the-art forgery localization methods, as well as tackle the three defects regarding traditional adversarial attack methods mentioned above.
    摘要 反反馈技术目的是消除或隐藏修改 traces。通常,反反馈方法是为了欺骗 binary 检测器,使其错误地评估图像的 authenticity。然而,据我们所知,没有任何尝试使用对 forgery 位置进行欺骗和误导。传统的反对敌方攻击方法无法直接使用对 forgery 位置的攻击,因为以下三点缺陷:1. 它们通常只是简单地让目标伪钞模型变更其像素级的原始或修改的决策;2. 它们对于不同的伪钞模型表现出很差的防御性能;3. 它们在伪钞模型被重新训练后失效。为了解决这些缺陷,我们提出了 SEAR(Self-supErvised Anti-foRensics),一种新的自我超vised 和反对敌方训练算法,用于训练深度学习反反馈模型。SEAR 设置了一个预text task,用于在自我超vised 学习中重建杂音。在反对敌方训练中,SEAR 使用一个 forgery 位置模型作为监视器,以探索修改特征并构建深度学习隐藏器,以消除相应的 traces。我们在多个 dataset 上进行了大规模的实验,实验结果表明,通过将自我超vised 学习和反对敌方训练相结合,SEAR 成功地欺骗了当前最佳的 forgery 位置方法,同时解决了传统反对敌方攻击方法所存在的三个缺陷。

Deep Unfolding Network for Image Compressed Sensing by Content-adaptive Gradient Updating and Deformation-invariant Non-local Modeling

  • paper_url: http://arxiv.org/abs/2310.10033
  • repo_url: None
  • paper_authors: Wenxue Cui, Xiaopeng Fan, Jian Zhang, Debin Zhao
  • for: 用于图像压缩感知(CS)领域的深度 unfolding 网络(DUN)的改进。
  • methods: 提出了一种基于传统的 Proximal Gradient Descent(PGD)算法的 novel DUN 网络(dubbed DUN-CSNet),以解决现有 DUN 中的两个问题:1)大多数超参数是独立于输入内容的,限制了其适应性;2)在每次迭代中使用的普通 convolutional neural network 弱化了更广泛的上下文优先顺序,导致表达能力下降。
  • results: 经验表明,提出的 DUN-CSNet 在图像压缩感知领域的表现较前者有大幅提升。
    Abstract Inspired by certain optimization solvers, the deep unfolding network (DUN) has attracted much attention in recent years for image compressed sensing (CS). However, there still exist the following two issues: 1) In existing DUNs, most hyperparameters are usually content independent, which greatly limits their adaptability for different input contents. 2) In each iteration, a plain convolutional neural network is usually adopted, which weakens the perception of wider context prior and therefore depresses the expressive ability. In this paper, inspired by the traditional Proximal Gradient Descent (PGD) algorithm, a novel DUN for image compressed sensing (dubbed DUN-CSNet) is proposed to solve the above two issues. Specifically, for the first issue, a novel content adaptive gradient descent network is proposed, in which a well-designed step size generation sub-network is developed to dynamically allocate the corresponding step sizes for different textures of input image by generating a content-aware step size map, realizing a content-adaptive gradient updating. For the second issue, considering the fact that many similar patches exist in an image but have undergone a deformation, a novel deformation-invariant non-local proximal mapping network is developed, which can adaptively build the long-range dependencies between the nonlocal patches by deformation-invariant non-local modeling, leading to a wider perception on context priors. Extensive experiments manifest that the proposed DUN-CSNet outperforms existing state-of-the-art CS methods by large margins.
    摘要 traditional Proximal Gradient Descent (PGD) 算法的灵感,一种新的深度 unfolding 网络(DUN)为图像压缩感知(CS)提出了一种新的方法。在这种方法中,存在两个问题:1)在现有的 DUN 中,大多数超参数是独立于输入内容的,这限制了它们的适应性。2)在每个迭代中,通常采用平面卷积神经网络,这弱化了更广泛的上下文先验,从而降低了表达能力。在这篇论文中,我们提出了一种新的 DUN-CSNet,以解决以上两个问题。Specifically,为了解决第一个问题,我们提出了一种新的内容适应的梯度下降网络,其中包括一个 Well-designed 步长生成子网络,通过生成内容快照映射,实现内容适应的梯度更新。为了解决第二个问题,我们发展了一种新的非 lok 的非局部抽象映射网络,该网络可以在不同的扭变下自适应地建立非局部的长距离依赖关系,从而扩大上下文先验的视野。经过广泛的实验,我们发现,提出的 DUN-CSNet 可以舒适性地击败现有的CS方法。

RoomDesigner: Encoding Anchor-latents for Style-consistent and Shape-compatible Indoor Scene Generation

  • paper_url: http://arxiv.org/abs/2310.10027
  • repo_url: https://github.com/zhao-yiqun/roomdesigner
  • paper_authors: Yiqun Zhao, Zibo Zhao, Jing Li, Sixun Dong, Shenghua Gao
  • for: 本研究旨在创造具有尺度、样式兼容的室内场景,以便在室内设计和规划中提供更加真实和可信的场景。
  • methods: 本研究提出了一种两stage模型,首先使用离散 вектор量化来编码家具为anchor-latent,然后利用 transformer 模型预测室内场景。通过 incorporating anchor-latent 表示,我们的生成模型可以生成具有尺度和样式兼容的家具布局。
  • results: 实验结果表明,我们的方法可以在 3D-Front 数据集上生成更加一致和兼容的室内场景,而无需shape取样。此外,我们还进行了广泛的ablation研究,以验证我们的设计选择在室内场景生成模型中的效果。
    Abstract Indoor scene generation aims at creating shape-compatible, style-consistent furniture arrangements within a spatially reasonable layout. However, most existing approaches primarily focus on generating plausible furniture layouts without incorporating specific details related to individual furniture pieces. To address this limitation, we propose a two-stage model integrating shape priors into the indoor scene generation by encoding furniture as anchor latent representations. In the first stage, we employ discrete vector quantization to encode furniture pieces as anchor-latents. Based on the anchor-latents representation, the shape and location information of the furniture was characterized by a concatenation of location, size, orientation, class, and our anchor latent. In the second stage, we leverage a transformer model to predict indoor scenes autoregressively. Thanks to incorporating the proposed anchor-latents representations, our generative model produces shape-compatible and style-consistent furniture arrangements and synthesis furniture in diverse shapes. Furthermore, our method facilitates various human interaction applications, such as style-consistent scene completion, object mismatch correction, and controllable object-level editing. Experimental results on the 3D-Front dataset demonstrate that our approach can generate more consistent and compatible indoor scenes compared to existing methods, even without shape retrieval. Additionally, extensive ablation studies confirm the effectiveness of our design choices in the indoor scene generation model.
    摘要 indoor scene generation aims to create shape-compatible, style-consistent furniture arrangements within a spatially reasonable layout. However, most existing approaches primarily focus on generating plausible furniture layouts without incorporating specific details related to individual furniture pieces. To address this limitation, we propose a two-stage model integrating shape priors into the indoor scene generation by encoding furniture as anchor latent representations. In the first stage, we employ discrete vector quantization to encode furniture pieces as anchor-latents. Based on the anchor-latents representation, the shape and location information of the furniture was characterized by a concatenation of location, size, orientation, class, and our anchor latent. In the second stage, we leverage a transformer model to predict indoor scenes autoregressively. Thanks to incorporating the proposed anchor-latents representations, our generative model produces shape-compatible and style-consistent furniture arrangements and synthesis furniture in diverse shapes. Furthermore, our method facilitates various human interaction applications, such as style-consistent scene completion, object mismatch correction, and controllable object-level editing. Experimental results on the 3D-Front dataset demonstrate that our approach can generate more consistent and compatible indoor scenes compared to existing methods, even without shape retrieval. Additionally, extensive ablation studies confirm the effectiveness of our design choices in the indoor scene generation model.Here's the text with some minor adjustments to make it more idiomatic in Simplified Chinese:indoor scene generation aims to create shape-compatible, style-consistent furniture arrangements within a spatially reasonable layout. However, most existing approaches primarily focus on generating plausible furniture layouts without incorporating specific details related to individual furniture pieces. To address this limitation, we propose a two-stage model integrating shape priors into the indoor scene generation by encoding furniture as anchor latent representations. In the first stage, we employ discrete vector quantization to encode furniture pieces as anchor-latents. Based on the anchor-latents representation, the shape and location information of the furniture was characterized by a concatenation of location, size, orientation, class, and our anchor latent. In the second stage, we leverage a transformer model to predict indoor scenes autoregressively. Thanks to incorporating the proposed anchor-latents representations, our generative model produces shape-compatible and style-consistent furniture arrangements and synthesis furniture in diverse shapes. Furthermore, our method facilitates various human interaction applications, such as style-consistent scene completion, object mismatch correction, and controllable object-level editing. Experimental results on the 3D-Front dataset demonstrate that our approach can generate more consistent and compatible indoor scenes compared to existing methods, even without shape retrieval. Additionally, extensive ablation studies confirm the effectiveness of our design choices in the indoor scene generation model.

An Empirical Study of Super-resolution on Low-resolution Micro-expression Recognition

  • paper_url: http://arxiv.org/abs/2310.10022
  • repo_url: None
  • paper_authors: Ling Zhou, Mingpei Wang, Xiaohua Huang, Wenming Zheng, Qirong Mao, Guoying Zhao
  • for: 本研究旨在提高低分辨率(LR)环境中的微表情识别(MER)精度,特别是在实际应用中的群体MER场景。
  • methods: 本研究使用了七种最新的状态之册(SOTA)MER技术,并对13种SOTA超分解(SR)技术进行评估,以解决SR助成MER中的问题。
  • results: 经验研究表明,SR助成MER在LR场景中存在主要的挑战,并提出了改进SR助成MER的方向。
    Abstract Micro-expression recognition (MER) in low-resolution (LR) scenarios presents an important and complex challenge, particularly for practical applications such as group MER in crowded environments. Despite considerable advancements in super-resolution techniques for enhancing the quality of LR images and videos, few study has focused on investigate super-resolution for improving LR MER. The scarcity of investigation can be attributed to the inherent difficulty in capturing the subtle motions of micro-expressions, even in original-resolution MER samples, which becomes even more challenging in LR samples due to the loss of distinctive features. Furthermore, a lack of systematic benchmarking and thorough analysis of super-resolution-assisted MER methods has been noted. This paper tackles these issues by conducting a series of benchmark experiments that integrate both super-resolution (SR) and MER methods, guided by an in-depth literature survey. Specifically, we employ seven cutting-edge state-of-the-art (SOTA) MER techniques and evaluate their performance on samples generated from 13 SOTA SR techniques, thereby addressing the problem of super-resolution in MER. Through our empirical study, we uncover the primary challenges associated with SR-assisted MER and identify avenues to tackle these challenges by leveraging recent advancements in both SR and MER methodologies. Our analysis provides insights for progressing toward more efficient SR-assisted MER.
    摘要 低分辨率(LR)环境下的微表达识别(MER)具有重要和复杂的挑战,尤其是在实际应用中,如集体MER在拥挤的环境中。despite considerable advancements in super-resolution techniques for enhancing the quality of LR images and videos, few studies have focused on investigating super-resolution for improving LR MER. The scarcity of investigation can be attributed to the inherent difficulty in capturing the subtle motions of micro-expressions, even in original-resolution MER samples, which becomes even more challenging in LR samples due to the loss of distinctive features. Furthermore, a lack of systematic benchmarking and thorough analysis of super-resolution-assisted MER methods has been noted. This paper tackles these issues by conducting a series of benchmark experiments that integrate both super-resolution (SR) and MER methods, guided by an in-depth literature survey. Specifically, we employ seven cutting-edge state-of-the-art (SOTA) MER techniques and evaluate their performance on samples generated from 13 SOTA SR techniques, thereby addressing the problem of super-resolution in MER. Through our empirical study, we uncover the primary challenges associated with SR-assisted MER and identify avenues to tackle these challenges by leveraging recent advancements in both SR and MER methodologies. Our analysis provides insights for progressing toward more efficient SR-assisted MER.

Black-box Targeted Adversarial Attack on Segment Anything (SAM)

  • paper_url: http://arxiv.org/abs/2310.10010
  • repo_url: None
  • paper_authors: Sheng Zheng, Chaoning Zhang
  • for: 本研究旨在实现对Segment Anything Model(SAM)的targeted adversarial attack(TAA),以便更好地理解SAM在恶意攻击下的Robustness。
  • methods: 该研究使用了一种简单 yet effective的方法,即只攻击图像Encoder,以解决prompt依赖性。此外,提出了一种新的规范损失来增强cross-model transferability,使攻击图像更有特征Domination。
  • results: 广泛的实验证明了我们提出的简单技术可以成功地实现黑盒TAA on SAM。
    Abstract Deep recognition models are widely vulnerable to adversarial examples, which change the model output by adding quasi-imperceptible perturbation to the image input. Recently, Segment Anything Model (SAM) has emerged to become a popular foundation model in computer vision due to its impressive generalization to unseen data and tasks. Realizing flexible attacks on SAM is beneficial for understanding the robustness of SAM in the adversarial context. To this end, this work aims to achieve a targeted adversarial attack (TAA) on SAM. Specifically, under a certain prompt, the goal is to make the predicted mask of an adversarial example resemble that of a given target image. The task of TAA on SAM has been realized in a recent arXiv work in the white-box setup by assuming access to prompt and model, which is thus less practical. To address the issue of prompt dependence, we propose a simple yet effective approach by only attacking the image encoder. Moreover, we propose a novel regularization loss to enhance the cross-model transferability by increasing the feature dominance of adversarial images over random natural images. Extensive experiments verify the effectiveness of our proposed simple techniques to conduct a successful black-box TAA on SAM.
    摘要 深度识别模型广泛受到敌意例子的攻击,这些攻击通过添加 quasi-不可见的扰动来改变输入图像,从而影响模型的输出。最近,Segment Anything Model(SAM)在计算机视觉领域得到了广泛的应用,因为它在未seen数据和任务上表现出了很好的总体化能力。为了更好地理解SAM在敌意上下文中的稳定性,本工作寻求实现针对SAM的targeted adversarial attack(TAA)。具体来说,在一定的提示下,目标是使针对敌意例子的预测面几乎与给定的目标图像相同。在 white-box 设置下,这项任务在最近的 arXiv 文章中已经实现了,但是假设了对提示和模型的访问,这是不实际的。为解决提示的依赖关系,我们提议一种简单 yet 有效的方法,即只攻击图像Encoder。此外,我们还提议一种新的规范损失,以增强跨模型传输性,通过增加攻击图像的特征主导性,使攻击图像在随机自然图像上占据优势。我们的实验证明了我们提议的简单技术可以成功地实现黑盒 TAA 任务。

Assessing Encoder-Decoder Architectures for Robust Coronary Artery Segmentation

  • paper_url: http://arxiv.org/abs/2310.10002
  • repo_url: None
  • paper_authors: Shisheng Zhang, Ramtin Gharleghi, Sonit Singh, Arcot Sowmya, Susann Beier
  • for: 避免心血管疾病的诊断延迟,通过精准 coronary artery 分 segmentation,改善病人结果。
  • methods: 使用 convolutional neural networks (CNN) 和 U-Net 架构,以及 25 个不同的 encoder-decoder 组合。
  • results: 使用 ASOCA 公共数据集,对 40 个案例进行分析,发现 EfficientNet-LinkNet 组合的 Dice 乘数为 0.882,95% Percentile Hausdorff 距离为 4.753,表明该模型在 MICCAI 2020 挑战中比其他模型表现更出色。
    Abstract Coronary artery diseases are among the leading causes of mortality worldwide. Timely and accurate diagnosis, facilitated by precise coronary artery segmentation, is pivotal in changing patient outcomes. In the realm of biomedical imaging, convolutional neural networks, especially the U-Net architecture, have revolutionised segmentation processes. However, one of the primary challenges remains the lack of benchmarking datasets specific to coronary arteries. However through the use of the recently published public dataset ASOCA, the potential of deep learning for accurate coronary segmentation can be improved. This paper delves deep into examining the performance of 25 distinct encoder-decoder combinations. Through analysis of the 40 cases provided to ASOCA participants, it is revealed that the EfficientNet-LinkNet combination, serving as encoder and decoder, stands out. It achieves a Dice coefficient of 0.882 and a 95th percentile Hausdorff distance of 4.753. These findings not only underscore the superiority of our model in comparison to those presented at the MICCAI 2020 challenge but also set the stage for future advancements in coronary artery segmentation, opening doors to enhanced diagnostic and treatment strategies.
    摘要 coronary artery disease 是全球最主要的死亡原因之一,时间和准确的诊断是改善病人结果的关键。在生物医学影像中,对于条形血管的精确分类是非常重要。然而,主要挑战是缺乏特定于条形血管的参考数据集。但是透过使用最近发布的公共数据集ASOCA,可以改善深度学习的精确分类性。本文将进行深入分析25组不同的encoder-decoder组合的表现。通过分析ASOCA参赛者提供的40个档案,发现了EfficientNet-LinkNet组合(作为encoder和decoder)的表现最出色,其Dice系数为0.882,95%的 Hausdorff距离为4.753。这些发现不仅与在MICCAI 2020挑战中提出的模型相比,而且开启了未来条形血管分类的新天地,将来对诊断和治疗策略带来改善。

SeUNet-Trans: A Simple yet Effective UNet-Transformer Model for Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2310.09998
  • repo_url: None
  • paper_authors: Tan-Hanh Pham, Xianqi Li, Kim-Doang Nguyen
  • for: 这个研究旨在提出一个简单 yet effective的 UNet-Transformer(seUNet-Trans)模型,用于医疗影像分类。
  • methods: 我们使用 UNet 模型作为特征提取器,将输入影像生成多个特征地图,然后将这些地图转移到一个桥layer中,并使用 Transformer 模型进行自我注意力机制。
  • results: 我们将模型评估于五个医疗影像分类 dataset上,结果显示 seUNet-Trans 模型在这些dataset上具有较高的性能。
    Abstract Automated medical image segmentation is becoming increasingly crucial in modern clinical practice, driven by the growing demand for precise diagnoses, the push towards personalized treatment plans, and advancements in machine learning algorithms, especially the incorporation of deep learning methods. While convolutional neural networks (CNNs) have been prevalent among these methods, the remarkable potential of Transformer-based models for computer vision tasks is gaining more acknowledgment. To harness the advantages of both CNN-based and Transformer-based models, we propose a simple yet effective UNet-Transformer (seUNet-Trans) model for medical image segmentation. In our approach, the UNet model is designed as a feature extractor to generate multiple feature maps from the input images, and these maps are propagated into a bridge layer, which sequentially connects the UNet and the Transformer. In this stage, we employ the pixel-level embedding technique without position embedding vectors to make the model more efficient. Moreover, we applied spatial-reduction attention in the Transformer to reduce the computational/memory overhead. By leveraging the UNet architecture and the self-attention mechanism, our model not only preserves both local and global context information but also captures long-range dependencies between input elements. The proposed model is extensively experimented on five medical image segmentation datasets, including polyp segmentation, to demonstrate its efficacy. A comparison with several state-of-the-art segmentation models on these datasets shows the superior performance of seUNet-Trans.
    摘要 《自动医疗影像分割在现代临床实践中变得越来越重要,这被增长的精准诊断需求、个性化治疗方案推动以及机器学习算法的发展,特别是深度学习方法的应用所驱动。而卷积神经网络(CNN)在这些方法中具有广泛的应用,但是Transformer基本模型在计算机视觉任务中表现出了惊人的潜力。为了利用CNN和Transformer两种模型的优点,我们提出了一种简单 yet effective的UNet-Transformer(seUNet-Trans)模型。在我们的方法中,UNet模型被设计为特征提取器,生成输入图像多个特征地图,然后这些地图被传递到一个桥层,该桥层连接了UNet和Transformer。在这个阶段,我们采用了像素级嵌入技术而不使用位置嵌入向量,以使模型更加高效。此外,我们在Transformer中应用了空间减少注意力,以降低计算/存储占用的开销。通过UNet架构和自注意机制,我们的模型不仅保留了本地和全局上下文信息,还能捕捉输入元素之间的长距离依赖关系。我们对五种医疗影像分割数据集进行了广泛的实验,包括肠肝肿瘤分割,以证明seUNet-Trans模型的高效性。与一些现状顶尖分割模型进行比较,我们的模型在这些数据集上显示出了superior的性能。》

A Survey of Graph and Attention Based Hyperspectral Image Classification Methods for Remote Sensing Data

  • paper_url: http://arxiv.org/abs/2310.09994
  • repo_url: None
  • paper_authors: Aryan Vats, Manan Suri
    for: 本研究旨在对频谱成像图像分类中应用深度学习技术,并评估其在远程感知和航空频谱成像图像中的性能。methods: 本研究涉及使用图гра数据结构和注意机制来减少维度,以提高频谱成像图像分类的性能。同时,也探讨了使用图 convolutional Neural Networks 进行频谱成像图像特征提取,以提高分类性能。results: 根据本研究的结果,使用图гра数据结构和注意机制可以提高频谱成像图像分类的性能,并且可以在远程感知和航空频谱成像图像中实现更好的分类结果。
    Abstract The use of Deep Learning techniques for classification in Hyperspectral Imaging (HSI) is rapidly growing and achieving improved performances. Due to the nature of the data captured by sensors that produce HSI images, a common issue is the dimensionality of the bands that may or may not contribute to the label class distinction. Due to the widespread nature of class labels, Principal Component Analysis is a common method used for reducing the dimensionality. However,there may exist methods that incorporate all bands of the Hyperspectral image with the help of the Attention mechanism. Furthermore, to yield better spectral spatial feature extraction, recent methods have also explored the usage of Graph Convolution Networks and their unique ability to use node features in prediction, which is akin to the pixel spectral makeup. In this survey we present a comprehensive summary of Graph based and Attention based methods to perform Hyperspectral Image Classification for remote sensing and aerial HSI images. We also summarize relevant datasets on which these techniques have been evaluated and benchmark the processing techniques.
    摘要 使用深度学习技术进行干扰спектраль成像(HSI)的分类正在迅速增长,并达到了改进的性能。由于探测器生成HSI图像的数据特性,一个常见的问题是带宽的维度,这些带可能或可能不会影响类标分布。由于类标的普遍性,常用的方法包括原始特征值分析(PCA)等。然而,可能存在一些方法,它们可以在所有HSI图像带中使用注意力机制,以提高特征特征提取。此外,为了提取更好的 спектral空间特征,现有的方法还在探索使用图像卷积网络,它们可以使用节点特征进行预测,这与像素 спектраль组成类似。在本综述中,我们提供了对图基和注意力基的方法进行干扰спектраль成像图像分类的全面概述,以及相关的数据集和处理技术的比较。