cs.CV - 2023-07-23

ComPtr: Towards Diverse Bi-source Dense Prediction Tasks via A Simple yet General Complementary Transformer

  • paper_url: http://arxiv.org/abs/2307.12349
  • repo_url: https://github.com/lartpang/comptr
  • paper_authors: Youwei Pang, Xiaoqi Zhao, Lihe Zhang, Huchuan Lu
  • for: 这个研究是为了开发一个能够同时处理多种不同任务的复合式对话模型,以提高深度学习(DL)在紧密预测领域的表现。
  • methods: 本研究使用了一种名为ComPtr的复合式对话模型,它基于信息补充的概念,并具有两个组件:协调强化和差异意识部分。这两个组件可以帮助ComPtr从不同的图像来源中获取重要的视觉 semantic 讯号,并将其转换为多个任务中的有用信息。
  • results: 在多个代表性的视觉任务中,例如离散检测、RGB-T人数掌握、RGB-D/T焦点物探测以及RGB-D semantics 类别分类,ComPtr 都能够获得良好的表现。
    Abstract Deep learning (DL) has advanced the field of dense prediction, while gradually dissolving the inherent barriers between different tasks. However, most existing works focus on designing architectures and constructing visual cues only for the specific task, which ignores the potential uniformity introduced by the DL paradigm. In this paper, we attempt to construct a novel \underline{ComP}lementary \underline{tr}ansformer, \textbf{ComPtr}, for diverse bi-source dense prediction tasks. Specifically, unlike existing methods that over-specialize in a single task or a subset of tasks, ComPtr starts from the more general concept of bi-source dense prediction. Based on the basic dependence on information complementarity, we propose consistency enhancement and difference awareness components with which ComPtr can evacuate and collect important visual semantic cues from different image sources for diverse tasks, respectively. ComPtr treats different inputs equally and builds an efficient dense interaction model in the form of sequence-to-sequence on top of the transformer. This task-generic design provides a smooth foundation for constructing the unified model that can simultaneously deal with various bi-source information. In extensive experiments across several representative vision tasks, i.e. remote sensing change detection, RGB-T crowd counting, RGB-D/T salient object detection, and RGB-D semantic segmentation, the proposed method consistently obtains favorable performance. The code will be available at \url{https://github.com/lartpang/ComPtr}.
    摘要 深度学习(DL)在密集预测方面取得了 significiant 进步,逐渐消除了不同任务之间的自然障碍。然而,大多数现有的工作都是为特定任务或子集任务设计特有的建筑和视觉提示,忽视了深度学习 парадиг中的可能性。在这篇论文中,我们尝试构建一种新的 ComPlementary trasnformer,即 ComPtr,用于多种生物源密集预测任务。Specifically, unlike existing methods that over-specialize in a single task or a subset of tasks, ComPtr starts from the more general concept of bi-source dense prediction. Based on the basic dependence on information complementarity, we propose consistency enhancement and difference awareness components with which ComPtr can evacuate and collect important visual semantic cues from different image sources for diverse tasks, respectively. ComPtr treats different inputs equally and builds an efficient dense interaction model in the form of sequence-to-sequence on top of the transformer. This task-generic design provides a smooth foundation for constructing the unified model that can simultaneously deal with various bi-source information. In extensive experiments across several representative vision tasks, i.e. 远程感知变化检测、RGB-T人群计数、RGB-D/T突出物检测和RGB-D semantic segmentation, the proposed method consistently obtains favorable performance. Code will be available at \url{https://github.com/lartpang/ComPtr}.

ResShift: Efficient Diffusion Model for Image Super-resolution by Residual Shifting

  • paper_url: http://arxiv.org/abs/2307.12348
  • repo_url: https://github.com/zsyoaoa/resshift
  • paper_authors: Zongsheng Yue, Jianyi Wang, Chen Change Loy
  • for: 提高图像超分辨率(SR)方法的执行速度,解决现有方法的执行慢速问题。
  • methods: 提出一种新的和高效的扩散模型,通过减少扩散步数,消除post加速的需求和相关性能下降。
  • results: 实验表明,提出的方法在 synthetic 和实际数据集上具有较高或相当于当前状态艺术方法的性能,只需15步扩散。
    Abstract Diffusion-based image super-resolution (SR) methods are mainly limited by the low inference speed due to the requirements of hundreds or even thousands of sampling steps. Existing acceleration sampling techniques inevitably sacrifice performance to some extent, leading to over-blurry SR results. To address this issue, we propose a novel and efficient diffusion model for SR that significantly reduces the number of diffusion steps, thereby eliminating the need for post-acceleration during inference and its associated performance deterioration. Our method constructs a Markov chain that transfers between the high-resolution image and the low-resolution image by shifting the residual between them, substantially improving the transition efficiency. Additionally, an elaborate noise schedule is developed to flexibly control the shifting speed and the noise strength during the diffusion process. Extensive experiments demonstrate that the proposed method obtains superior or at least comparable performance to current state-of-the-art methods on both synthetic and real-world datasets, even only with 15 sampling steps. Our code and model are available at https://github.com/zsyOAOA/ResShift.
    摘要 Diffusion-based图像超分辨 (SR) 方法主要受限于推断速度较低,因为需要数百或者千个抽象步骤。现有的加速抽象技术无论如何,都会 sacrificing performance一定程度,导致SR结果过度模糊。为解决这个问题,我们提出了一种新的和高效的Diffusion模型,可以大幅减少抽象步骤数量,从而消除推断过程中的后加速和其相关的性能下降。我们的方法构建了一个Markov链,将高分辨图像和低分辨图像之间的差异转移到高分辨图像上,大幅提高了转移效率。此外,我们还开发了一种灵活控制抽象速度和噪声强度的附加noise schedule。广泛的实验表明,我们的方法可以在现有的State-of-the-art方法的基础上实现更好的或者相当的性能,只需要15个抽象步骤。我们的代码和模型可以在https://github.com/zsyOAOA/ResShift上下载。

Rapid detection of soil carbonates by means of NIR spectroscopy, deep learning methods and phase quantification by powder Xray diffraction

  • paper_url: http://arxiv.org/abs/2307.12341
  • repo_url: None
  • paper_authors: Lykourgos Chiniadis, Petros Tamvakis
    for:* 提高农业生产和土壤属性分析,为生态可持续发展提供关键前提。methods:* 使用FT NIR reflectance спектроскопия和深度学习方法来预测土壤碳酸含量。* 利用多种机器学习算法,如:多层感知网络(MLP)回归器和卷积神经网络(CNN),并与传统的多变量分析(PLSR)、矩阵分析(Cubist)和支持向量机(SVM)进行比较。results:* 使用FT NIR reflectance спектроскопия和深度学习方法可以快速和高效地预测土壤碳酸含量,并且在未看过的土壤样本上达到了良好的预测性能。* 与X射Diffraction量测相比,MLP预测值和实际值之间的相对误差在5%以内,表明深度学习模型可以准确地预测土壤碳酸含量。* 本研究的结果表明,深度学习模型可以作为快速和高效的预测工具,用于预测土壤碳酸含量,特别是在没有量imetric方法可用的情况下。
    Abstract Soil NIR spectral absorbance/reflectance libraries are utilized towards improving agricultural production and analysis of soil properties which are key prerequisite for agroecological balance and environmental sustainability. Carbonates in particular, represent a soil property which is mostly affected even by mild, let alone extreme, changes of environmental conditions during climate change. In this study we propose a rapid and efficient way to predict carbonates content in soil by means of FT NIR reflectance spectroscopy and by use of deep learning methods. We exploited multiple machine learning methods, such as: 1) a MLP Regressor and 2) a CNN and compare their performance with other traditional ML algorithms such as PLSR, Cubist and SVM on the combined dataset of two NIR spectral libraries: KSSL (USDA), a dataset of soil samples reflectance spectra collected nationwide, and LUCAS TopSoil (European Soil Library) which contains soil sample absorbance spectra from all over the European Union, and use them to predict carbonate content on never before seen soil samples. Soil samples in KSSL and in TopSoil spectral libraries were acquired in the spectral region of visNIR, however in this study, only the NIR spectral region was utilized. Quantification of carbonates by means of Xray Diffraction is in good agreement with the volumetric method and the MLP prediction. Our work contributes to rapid carbonates content prediction in soil samples in cases where: 1) no volumetric method is available and 2) only NIR spectra absorbance data are available. Up till now and to the best of our knowledge, there exists no other study, that presents a prediction model trained on such an extensive dataset with such promising results on unseen data, undoubtedly supporting the notion that deep learning models present excellent prediction tools for soil carbonates content.
    摘要 soil NIR spectral absorbance/reflectance 图书馆是用于提高农业生产和土壤属性分析的重要途径,这些属性是生态平衡和环境可持续性的关键因素。碳酸盐 particularly 是在气候变化中环境条件轻度到极端变化时受到影响的土壤属性。本研究提出了一种快速和高效的碳酸盐含量预测方法,通过FT NIR 反射спектроскопия和深度学习方法。我们利用了多种机器学习方法,如:1)多层感知网络(MLP)回归器和2)卷积神经网络(CNN),并与传统的机器学习算法如:PLSR、Cubist和SVM进行比较,使用了合并的KSSL(美国农业部)和LUCAS TopSoil(欧盟土壤图书馆)的两个NIR spectral库,用于预测碳酸盐含量。KSSL和TopSoil spectral库中的土壤样本反射спектроскопия数据收集在VisNIRspectral区间内,但在本研究中仅使用NIRspectral区间。X射 diffraction 测量和MLP预测结果表明,我们的方法可以准确预测碳酸盐含量。我们的工作支持在没有体积方法可用时,只有NIR Spectra absorbance数据可用时,可以快速预测碳酸盐含量。到目前为止,我们知道没有其他研究,可以在such an extensive dataset上提出类似的预测模型,并且模型在未见数据上表现了惊人的好准确性,无疑地支持深度学习模型在土壤碳酸盐含量预测中的出色表现。

Learning Navigational Visual Representations with Semantic Map Supervision

  • paper_url: http://arxiv.org/abs/2307.12335
  • repo_url: https://github.com/yiconghong/ego2map-navit
  • paper_authors: Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernoncourt, Trung Bui, Stephen Gould, Hao Tan
  • for: 本研究旨在提高家用机器人视觉导航能力,尤其是在室内环境中。
  • methods: 我们提出了一种基于 egocentric 视图和 semantic 地图(Ego$^2$-Map)的视觉表示学习方法,以增强机器人的导航能力。
  • results: 我们的实验表明,使用我们学习的表示可以在 object-goal 导航任务中表现出优于最新的视觉预训练方法,并在 continuous 环境中实现新的state-of-the-art 结果。
    Abstract Being able to perceive the semantics and the spatial structure of the environment is essential for visual navigation of a household robot. However, most existing works only employ visual backbones pre-trained either with independent images for classification or with self-supervised learning methods to adapt to the indoor navigation domain, neglecting the spatial relationships that are essential to the learning of navigation. Inspired by the behavior that humans naturally build semantically and spatially meaningful cognitive maps in their brains during navigation, in this paper, we propose a novel navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps (Ego$^2$-Map). We apply the visual transformer as the backbone encoder and train the model with data collected from the large-scale Habitat-Matterport3D environments. Ego$^2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation. Experiments show that agents using our learned representations on object-goal navigation outperform recent visual pre-training methods. Moreover, our representations significantly improve vision-and-language navigation in continuous environments for both high-level and low-level action spaces, achieving new state-of-the-art results of 47% SR and 41% SPL on the test server.
    摘要 “能够感受到环境的 semantics 和空间结构是家用机器人视觉导航的重要前提。然而,现有的大多数工作仅将视觉背bone 进行独立图像的分类或自我类型学习方法进行适应室内导航领域,忽略了 Navigation 中所需的空间关系。 Drawing inspiration from humans' natural ability to build semantically and spatially meaningful cognitive maps in their brains during navigation, in this paper, we propose a novel navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps (Ego$^2$-Map). We apply the visual transformer as the backbone encoder and train the model with data collected from the large-scale Habitat-Matterport3D environments. Ego$^2$-Map learning transfers the compact and rich information from a map, such as objects, structure, and transition, to the agent's egocentric representations for navigation. Experiments show that agents using our learned representations on object-goal navigation outperform recent visual pre-training methods. Moreover, our representations significantly improve vision-and-language navigation in continuous environments for both high-level and low-level action spaces, achieving new state-of-the-art results of 47% SR and 41% SPL on the test server.”Note that Simplified Chinese is the official standard for Chinese writing in mainland China, and it is used in this translation. Traditional Chinese is used in Taiwan and Hong Kong, and it may have slightly different grammar and character usage.

ES2Net: An Efficient Spectral-Spatial Network for Hyperspectral Image Change Detection

  • paper_url: http://arxiv.org/abs/2307.12327
  • repo_url: None
  • paper_authors: Qingren Yao, Yuan Zhou, Wei Xiang
    For:* The paper is written for hyperspectral image change detection (HSI-CD) and aims to identify differences in bitemporal HSIs.Methods:* The paper proposes an end-to-end efficient spectral-spatial change detection network (ES2Net) that includes a learnable band selection module and a cluster-wise spatial attention mechanism.Results:* The paper demonstrates the effectiveness and superiority of the proposed method compared with other state-of-the-art methods through experiments on three widely used HSI-CD datasets.
    Abstract Hyperspectral image change detection (HSI-CD) aims to identify the differences in bitemporal HSIs. To mitigate spectral redundancy and improve the discriminativeness of changing features, some methods introduced band selection technology to select bands conducive for CD. However, these methods are limited by the inability to end-to-end training with the deep learning-based feature extractor and lack considering the complex nonlinear relationship among bands. In this paper, we propose an end-to-end efficient spectral-spatial change detection network (ES2Net) to address these issues. Specifically, we devised a learnable band selection module to automatically select bands conducive to CD. It can be jointly optimized with a feature extraction network and capture the complex nonlinear relationships among bands. Moreover, considering the large spatial feature distribution differences among different bands, we design the cluster-wise spatial attention mechanism that assigns a spatial attention factor to each individual band to individually improve the feature discriminativeness for each band. Experiments on three widely used HSI-CD datasets demonstrate the effectiveness and superiority of this method compared with other state-of-the-art methods.
    摘要 干elijah�image Change Detection (HSI-CD) 的目标是 Identify the differences between bitemporal hyperspectral images (HSIs). To reduce spectral redundancy and improve the discriminativeness of changing features, some methods have introduced band selection technology to select bands that are conducive to CD. However, these methods are limited by their inability to perform end-to-end training with deep learning-based feature extractors and their failure to consider the complex nonlinear relationships among bands.In this paper, we propose an end-to-end efficient spectral-spatial change detection network (ES2Net) to address these issues. Specifically, we have devised a learnable band selection module that automatically selects bands that are conducive to CD. This module can be jointly optimized with a feature extraction network and captures the complex nonlinear relationships among bands. Moreover, considering the large spatial feature distribution differences among different bands, we have designed a cluster-wise spatial attention mechanism that assigns a spatial attention factor to each individual band to individually improve the feature discriminativeness for each band.Experiments on three widely used HSI-CD datasets have demonstrated the effectiveness and superiority of this method compared with other state-of-the-art methods.

Development of pericardial fat count images using a combination of three different deep-learning models

  • paper_url: http://arxiv.org/abs/2307.12316
  • repo_url: None
  • paper_authors: Takaaki Matsunaga, Atsushi Kono, Hidetoshi Matsuo, Kaoru Kitagawa, Mizuho Nishio, Hiromi Hashimura, Yu Izawa, Takayoshi Toba, Kazuki Ishikawa, Akie Katsuki, Kazuyuki Ohmura, Takamichi Murakami
  • for: The paper aims to generate pericardial fat count images (PFCIs) from chest radiographs (CXRs) using a dedicated deep-learning model, in order to evaluate pericardial fat (PF) and its potential role in the development of coronary artery disease.
  • methods: The proposed method uses three different deep-learning models, including CycleGAN, to generate PFCIs from CXRs. The method first projects the three-dimensional CT images onto a two-dimensional plane, and then uses the deep-learning models to generate PFCIs from the projected images. The performance of the proposed method is evaluated using structural similarity index measure (SSIM), mean squared error (MSE), and mean absolute error (MAE).
  • results: The results show that the PFCIs generated using the proposed method have better performance than those generated using a single CycleGAN-based model, as measured by SSIM, MSE, and MAE. The proposed method also shows the potential for evaluating PF without the need for CT scans.
    Abstract Rationale and Objectives: Pericardial fat (PF), the thoracic visceral fat surrounding the heart, promotes the development of coronary artery disease by inducing inflammation of the coronary arteries. For evaluating PF, this study aimed to generate pericardial fat count images (PFCIs) from chest radiographs (CXRs) using a dedicated deep-learning model. Materials and Methods: The data of 269 consecutive patients who underwent coronary computed tomography (CT) were reviewed. Patients with metal implants, pleural effusion, history of thoracic surgery, or that of malignancy were excluded. Thus, the data of 191 patients were used. PFCIs were generated from the projection of three-dimensional CT images, where fat accumulation was represented by a high pixel value. Three different deep-learning models, including CycleGAN, were combined in the proposed method to generate PFCIs from CXRs. A single CycleGAN-based model was used to generate PFCIs from CXRs for comparison with the proposed method. To evaluate the image quality of the generated PFCIs, structural similarity index measure (SSIM), mean squared error (MSE), and mean absolute error (MAE) of (i) the PFCI generated using the proposed method and (ii) the PFCI generated using the single model were compared. Results: The mean SSIM, MSE, and MAE were as follows: 0.856, 0.0128, and 0.0357, respectively, for the proposed model; and 0.762, 0.0198, and 0.0504, respectively, for the single CycleGAN-based model. Conclusion: PFCIs generated from CXRs with the proposed model showed better performance than those with the single model. PFCI evaluation without CT may be possible with the proposed method.
    摘要 目的和目标:胸膜脂肪(PF),脏膜内附近心脏的脂肪,可以促进心脏疾病的发展,并导致心脏粥玢病变。为评估PF,本研究想要从胸部X射线图像(CXR)中生成胸膜脂肪计数图像(PFCIs)。材料和方法:本研究审查了269例 consecutively admitted patients的 coronary computed tomography(CT)数据。排除了 метал制品、肿胀、历史上的胸部手术和肿瘤等因素,因此使用了191例的数据。PFCIs由三维CT图像的投影生成,其中脂肪堆积表示高像素值。本研究使用了三种深度学习模型,包括 CycleGAN,来生成PFCIs从CXR。单个 CycleGAN-based 模型用于生成PFCIs从CXR,并与提案方法进行比较。为评估生成的PFCIs的图像质量,使用了结构相似度指数(SSIM)、平均平方误差(MSE)和平均绝对误差(MAE)。结果:* SSIM:0.856* MSE:0.0128* MAE:0.0357比较结果:* SSIM:0.762* MSE:0.0198* MAE:0.0504结论:提案方法生成的PFCIs显示与单个模型生成的PFCIs有更好的性能。PFCI评估可能不需要CT扫描。

Building Extraction from Remote Sensing Images via an Uncertainty-Aware Network

  • paper_url: http://arxiv.org/abs/2307.12309
  • repo_url: https://github.com/henryjiepanli/uncertainty-aware-network
  • paper_authors: Wei He, Jiepan Li, Weinan Cao, Liangpei Zhang, Hongyan Zhang
  • for: 减少建筑物识别错误率
  • methods: 使用uncertainty-aware网络(UANet)
  • results: 比其他状态对照方法减少误差率
    Abstract Building extraction aims to segment building pixels from remote sensing images and plays an essential role in many applications, such as city planning and urban dynamic monitoring. Over the past few years, deep learning methods with encoder-decoder architectures have achieved remarkable performance due to their powerful feature representation capability. Nevertheless, due to the varying scales and styles of buildings, conventional deep learning models always suffer from uncertain predictions and cannot accurately distinguish the complete footprints of the building from the complex distribution of ground objects, leading to a large degree of omission and commission. In this paper, we realize the importance of uncertain prediction and propose a novel and straightforward Uncertainty-Aware Network (UANet) to alleviate this problem. To verify the performance of our proposed UANet, we conduct extensive experiments on three public building datasets, including the WHU building dataset, the Massachusetts building dataset, and the Inria aerial image dataset. Results demonstrate that the proposed UANet outperforms other state-of-the-art algorithms by a large margin.
    摘要 traditional deep learning models always suffer from uncertain predictions and cannot accurately distinguish the complete footprints of the building from the complex distribution of ground objects, leading to a large degree of omission and commission. To address this problem, we propose a novel and straightforward Uncertainty-Aware Network (UANet) to alleviate this problem. To verify the performance of our proposed UANet, we conduct extensive experiments on three public building datasets, including the WHU building dataset, the Massachusetts building dataset, and the Inria aerial image dataset. Results demonstrate that the proposed UANet outperforms other state-of-the-art algorithms by a large margin.Here's the text with some additional information about the Simplified Chinese translation:The text is translated into Simplified Chinese, which is the standard writing system used in mainland China. The translation is done using a machine translation tool, and the result is a more literal translation of the original text.Some notes about the translation:* "uncertain prediction" is translated as "uncertain predictions" (uncertain predictions is a phrase that is commonly used in machine learning to describe the situation where a model is not sure about its predictions).* "complete footprints" is translated as "complete footprint" (singular form) to match the original text.* "ground objects" is translated as "ground objects" (literal translation), but it could be translated as "ground features" or "ground objects" depending on the context.* "state-of-the-art algorithms" is translated as "state-of-the-art algorithm" (singular form) to match the original text.I hope this helps! Let me know if you have any other questions.

RANSAC-NN: Unsupervised Image Outlier Detection using RANSAC

  • paper_url: http://arxiv.org/abs/2307.12301
  • repo_url: https://github.com/mxtsai/ransac-nn
  • paper_authors: Chen-Han Tsai, Yu-Shao Peng
  • for: This paper proposes an unsupervised outlier detection algorithm specifically designed for image data, called RANSAC-NN.
  • methods: The proposed algorithm uses a RANSAC-based approach to compare images and predict the outlier score without additional training or label information.
  • results: The proposed algorithm consistently performs favorably against state-of-the-art outlier detection algorithms on 15 diverse datasets without any hyperparameter tuning, and it has potential applications in image mislabeled detection.Here’s the same information in Simplified Chinese text:
  • for: 这篇论文提出了一种特制于图像数据的无监督异常检测算法,名为RANSAC-NN。
  • methods: 该算法使用RANSAC-based方法比较图像,自动预测每个图像的异常分数,无需额外训练或标签信息。
  • results: 该算法在15个多样化的数据集中,与现状异常检测算法相比,一般性高,而且无需调整Hyperparameter,并且具有图像推理检测的潜在应用。
    Abstract Image outlier detection (OD) is crucial for ensuring the quality and accuracy of image datasets used in computer vision tasks. The majority of OD algorithms, however, have not been targeted toward image data. Consequently, the results of applying such algorithms to images are often suboptimal. In this work, we propose RANSAC-NN, a novel unsupervised OD algorithm specifically designed for images. By comparing images in a RANSAC-based approach, our algorithm automatically predicts the outlier score of each image without additional training or label information. We evaluate RANSAC-NN against state-of-the-art OD algorithms on 15 diverse datasets. Without any hyperparameter tuning, RANSAC-NN consistently performs favorably in contrast to other algorithms in almost every dataset category. Furthermore, we provide a detailed analysis to understand each RANSAC-NN component, and we demonstrate its potential applications in image mislabeled detection. Code for RANSAC-NN is provided at https://github.com/mxtsai/ransac-nn
    摘要 Image outlier detection (OD) 是 Ensure the quality and accuracy of image datasets used in computer vision tasks 的 crucial step. However, most OD algorithms have not been designed specifically for images, leading to suboptimal results when applied to images. In this work, we propose RANSAC-NN, a novel unsupervised OD algorithm tailored for images. By comparing images in a RANSAC-based approach, our algorithm automatically predicts the outlier score of each image without requiring additional training or label information. We evaluate RANSAC-NN against state-of-the-art OD algorithms on 15 diverse datasets and show that it consistently performs well without any hyperparameter tuning. Furthermore, we provide a detailed analysis of each RANSAC-NN component and demonstrate its potential applications in image mislabeled detection. The code for RANSAC-NN is available at https://github.com/mxtsai/ransac-nn.

Hybrid-CSR: Coupling Explicit and Implicit Shape Representation for Cortical Surface Reconstruction

  • paper_url: http://arxiv.org/abs/2307.12299
  • repo_url: None
  • paper_authors: Shanlin Sun, Thanh-Tung Le, Chenyu You, Hao Tang, Kun Han, Haoyu Ma, Deying Kong, Xiangyi Yan, Xiaohui Xie
  • for: cortical surface reconstruction
  • methods: geometric deep-learning model combining explicit and implicit shape representations, mesh-based deformation module, optimization-based diffeomorphic surface registration
  • results: surpasses existing implicit and explicit cortical surface reconstruction methods in numeric metrics, including accuracy, regularity, and consistency.
    Abstract We present Hybrid-CSR, a geometric deep-learning model that combines explicit and implicit shape representations for cortical surface reconstruction. Specifically, Hybrid-CSR begins with explicit deformations of template meshes to obtain coarsely reconstructed cortical surfaces, based on which the oriented point clouds are estimated for the subsequent differentiable poisson surface reconstruction. By doing so, our method unifies explicit (oriented point clouds) and implicit (indicator function) cortical surface reconstruction. Compared to explicit representation-based methods, our hybrid approach is more friendly to capture detailed structures, and when compared with implicit representation-based methods, our method can be topology aware because of end-to-end training with a mesh-based deformation module. In order to address topology defects, we propose a new topology correction pipeline that relies on optimization-based diffeomorphic surface registration. Experimental results on three brain datasets show that our approach surpasses existing implicit and explicit cortical surface reconstruction methods in numeric metrics in terms of accuracy, regularity, and consistency.
    摘要 我们介绍Hybrid-CSR,一种几何深度学习模型,将显式和隐式形态表示结合用于脑表面重建。具体来说,Hybrid-CSR从template mesh的显式变形开始,以获取粗略重建的脑表面,然后根据这些oriented point clouds进行后续的可 differentiable Poisson surface reconstruction。这样,我们的方法将显式(oriented point clouds)和隐式(指示函数)脑表面重建方法融合在一起。与显式表示基于方法相比,我们的半结合方法更友好地捕捉细节结构,而与隐式表示基于方法相比,我们的方法可以保持topology意识,这是因为我们通过粘性变换模块进行终端训练。为了解决topology问题,我们提出了一种新的topology更正管道,该管道基于数据优化的diffusion surface registration。实验结果表明,我们的方法在三个脑数据集上的数据指标上胜过现有的隐式和显式脑表面重建方法。

Simultaneous temperature estimation and nonuniformity correction from multiple frames

  • paper_url: http://arxiv.org/abs/2307.12297
  • repo_url: None
  • paper_authors: Navot Oz, Omri Berman, Nir Sochen, David Mendelovich, Iftach Klapp
  • for: 这个论文的目的是提出一种同时进行温度估计和不均匀性修正的方法,以提高低成本的红外摄像机在各种应用中的精度和效率。
  • methods: 该方法基于深度学习核函数网络(KPN),利用摄像机的物理图像捕获模型,并通过一个新的偏移块来 incorporate ambient temperature。
  • results: 对实际数据进行测试,该方法可以 achieve 高精度和高效率的温度估计和不均匀性修正,相比 vanilla KPN 有显著的改善。
    Abstract Infrared (IR) cameras are widely used for temperature measurements in various applications, including agriculture, medicine, and security. Low-cost IR camera have an immense potential to replace expansive radiometric cameras in these applications, however low-cost microbolometer-based IR cameras are prone to spatially-variant nonuniformity and to drift in temperature measurements, which limits their usability in practical scenarios. To address these limitations, we propose a novel approach for simultaneous temperature estimation and nonuniformity correction from multiple frames captured by low-cost microbolometer-based IR cameras. We leverage the physical image acquisition model of the camera and incorporate it into a deep learning architecture called kernel estimation networks (KPN), which enables us to combine multiple frames despite imperfect registration between them. We also propose a novel offset block that incorporates the ambient temperature into the model and enables us to estimate the offset of the camera, which is a key factor in temperature estimation. Our findings demonstrate that the number of frames has a significant impact on the accuracy of temperature estimation and nonuniformity correction. Moreover, our approach achieves a significant improvement in performance compared to vanilla KPN, thanks to the offset block. The method was tested on real data collected by a low-cost IR camera mounted on a UAV, showing only a small average error of $0.27^\circ C-0.54^\circ C$ relative to costly scientific-grade radiometric cameras. Our method provides an accurate and efficient solution for simultaneous temperature estimation and nonuniformity correction, which has important implications for a wide range of practical applications.
    摘要 infrared (IR) 摄像机广泛应用于温度测量多个应用场景中,如农业、医学和安全。低成本IR摄像机具有取代昂贵 радиометрические摄像机的潜在优势,但低成本微博拉ometer-based IR摄像机受到空间不均和温度测量中的偏差所限制。为解决这些限制,我们提出了一种新的方法,即同时进行温度估计和不均差修正,使用多帧 captured by low-cost microbolometer-based IR摄像机。我们利用摄像机物理捕获模型,并将其integrated into a deep learning architecture called kernel estimation networks (KPN),这使得我们可以将多帧组合成一起,即使这些帧之间没有完美对齐。我们还提出了一个新的偏移块,即将室外温度 incorporated into the model,这使得我们可以估计摄像机的偏移,这是温度估计中的关键因素。我们的发现表明,帧数对温度估计和不均差修正的精度有着显著的影响。此外,我们的方法在比 vanilla KPN 的情况下具有显著的改善,这是因为偏移块的存在。我们的方法在实际数据 collected by a low-cost IR摄像机 mounted on a UAV 上进行测试,显示只有小于 $0.27^\circ C-0.54^\circ C$ 的平均误差,相比昂贵的科学级 radiometric 摄像机。我们的方法为温度估计和不均差修正提供了一种准确和高效的解决方案,这有着广泛的实际应用前景。

TransHuman: A Transformer-based Human Representation for Generalizable Neural Human Rendering

  • paper_url: http://arxiv.org/abs/2307.12291
  • repo_url: https://github.com/pansanity666/TransHuman
  • paper_authors: Xiao Pan, Zongxin Yang, Jianxin Ma, Chang Zhou, Yi Yang
  • for: 本研究强调的任务是通过 conditional Neural Radiance Fields (NeRF) 来实现一致的人类渲染,从多视图视频中训练 conditional NeRF 模型,以便在不同的人物上进行一致的渲染。
  • methods: 本研究使用了一种brand-new的框架,名为 TransHuman,它通过 Transformer-based Human Encoding (TransHE)、Deformable Partial Radiance Fields (DPaRF) 和 Fine-grained Detail Integration (FDI) 来学习涂抹 SMPL 图像,并捕捉人体部件之间的全局关系。
  • results: 经验表明,TransHuman 可以在 ZJU-MoCap 和 H36M 数据集上达到新的州态艺之绩,同时具有高效性。项目页面:https://pansanity666.github.io/TransHuman/
    Abstract In this paper, we focus on the task of generalizable neural human rendering which trains conditional Neural Radiance Fields (NeRF) from multi-view videos of different characters. To handle the dynamic human motion, previous methods have primarily used a SparseConvNet (SPC)-based human representation to process the painted SMPL. However, such SPC-based representation i) optimizes under the volatile observation space which leads to the pose-misalignment between training and inference stages, and ii) lacks the global relationships among human parts that is critical for handling the incomplete painted SMPL. Tackling these issues, we present a brand-new framework named TransHuman, which learns the painted SMPL under the canonical space and captures the global relationships between human parts with transformers. Specifically, TransHuman is mainly composed of Transformer-based Human Encoding (TransHE), Deformable Partial Radiance Fields (DPaRF), and Fine-grained Detail Integration (FDI). TransHE first processes the painted SMPL under the canonical space via transformers for capturing the global relationships between human parts. Then, DPaRF binds each output token with a deformable radiance field for encoding the query point under the observation space. Finally, the FDI is employed to further integrate fine-grained information from reference images. Extensive experiments on ZJU-MoCap and H36M show that our TransHuman achieves a significantly new state-of-the-art performance with high efficiency. Project page: https://pansanity666.github.io/TransHuman/
    摘要 在这篇论文中,我们关注通用神经人类渲染任务,这个任务是通过多视图视频来训练 conditional Neural Radiance Fields(NeRF)来生成不同人物的图像。为了处理人类动态运动,之前的方法主要使用了SparseConvNet(SPC)来表示人类,但这种SPC-based表示方式有两个问题:一是优化在不稳定的观察空间下,导致训练和执行阶段的姿势不一致;二是缺乏人类部分之间的全局关系,这是在处理部分SMPL的时候非常重要。为了解决这些问题,我们提出了一个全新的框架名为TransHuman,它在 canonical space 中学习涂抹 SMPL,并且 capture 人类部分之间的全局关系。TransHuman 的主要组成部分包括 Transformer-based Human Encoding(TransHE)、Deformable Partial Radiance Fields(DPaRF)和 Fine-grained Detail Integration(FDI)。TransHE 首先在 canonical space 中使用 transformers 处理涂抹 SMPL,以 capture 人类部分之间的全局关系。然后,DPaRF 将每个输出 Token 绑定到一个可变的辐射场,以在观察空间中编码查询点。最后,FDI 被使用来进一步 интеGRATE 细节信息。我们对 ZJU-MoCap 和 H36M 进行了广泛的实验,并证明了我们的 TransHuman 可以达到新的 estado del arte 性能,同时具有高效性。项目页面:https://pansanity666.github.io/TransHuman/

Downstream-agnostic Adversarial Examples

  • paper_url: http://arxiv.org/abs/2307.12280
  • repo_url: https://github.com/cgcl-codes/advencoder
  • paper_authors: Ziqi Zhou, Shengshan Hu, Ruizhi Zhao, Qian Wang, Leo Yu Zhang, Junhui Hou, Hai Jin
  • for: 本研究旨在提出一个基于预训模型的攻击框架,可以对具有预训模型的下游任务进行 Universial Adversarial Examples 攻击。
  • methods: 本研究使用了高频率成分信息来引导生成攻击例子,然后设计了一个生成攻击框架,以学习攻击类别dataset的分布,以提高攻击成功率和传播性。
  • results: 研究结果显示,攻击者可以成功攻击下游任务,不需要知道预训dataset或下游dataset的详细信息。此外,研究者还提出了四种防护方法,其结果进一步证明了 AdvEncoder 的攻击能力。
    Abstract Self-supervised learning usually uses a large amount of unlabeled data to pre-train an encoder which can be used as a general-purpose feature extractor, such that downstream users only need to perform fine-tuning operations to enjoy the benefit of "large model". Despite this promising prospect, the security of pre-trained encoder has not been thoroughly investigated yet, especially when the pre-trained encoder is publicly available for commercial use. In this paper, we propose AdvEncoder, the first framework for generating downstream-agnostic universal adversarial examples based on the pre-trained encoder. AdvEncoder aims to construct a universal adversarial perturbation or patch for a set of natural images that can fool all the downstream tasks inheriting the victim pre-trained encoder. Unlike traditional adversarial example works, the pre-trained encoder only outputs feature vectors rather than classification labels. Therefore, we first exploit the high frequency component information of the image to guide the generation of adversarial examples. Then we design a generative attack framework to construct adversarial perturbations/patches by learning the distribution of the attack surrogate dataset to improve their attack success rates and transferability. Our results show that an attacker can successfully attack downstream tasks without knowing either the pre-training dataset or the downstream dataset. We also tailor four defenses for pre-trained encoders, the results of which further prove the attack ability of AdvEncoder.
    摘要 自我监督学习通常使用大量未标注数据来预训练一个编码器,以便下游用户只需进行精细调整来获得“大型模型”的好处。然而,预训练编码器的安全性尚未得到全面的调查,尤其是在公开可用于商业用途的情况下。在这篇论文中,我们提出了 AdvEncoder,第一个基于预训练编码器的下游agnostic通用攻击示例生成框架。AdvEncoder的目标是为一组自然图像构建一个通用攻击杂音或贴图,可以让所有继承于受害者预训练编码器的下游任务受到攻击。与传统攻击示例工作不同,预训练编码器只输出图像特征 вектор而不是分类标签。因此,我们首先利用图像高频成分信息来引导攻击示例生成。然后,我们设计了一个生成攻击框架,通过学习攻击代理数据集的分布来提高攻击成功率和传输性。我们的结果表明,攻击者可以成功攻击下游任务,不需要知道预训练数据集或下游数据集。我们还适应四种防御措施 для预训练编码器,结果证明了 AdvEncoder 的攻击能力。

FDCT: Fast Depth Completion for Transparent Objects

  • paper_url: http://arxiv.org/abs/2307.12274
  • repo_url: https://github.com/nonmy/fdct
  • paper_authors: Tianan Li, Zhehan Chen, Huan Liu, Chen Wang
  • for: 这篇论文的目的是提出一种快速的深度完成框架,用于处理透明物体的RGB-D图像。
  • methods: 该方法使用了一种新的拼接分支和短circuit来抓取低级特征,并使用了一种损失函数来抑制过拟合。
  • results: 对比之前的方法,该方法可以在70帧/秒的速度下提供更高精度的深度修复结果,并且可以改善物体抓取任务中的姿态估计。Here’s the full summary in Simplified Chinese:
  • for: 这篇论文的目的是提出一种快速的深度完成框架,用于处理透明物体的RGB-D图像。
  • methods: 该方法使用了一种新的拼接分支和短circuit来抓取低级特征,并使用了一种损失函数来抑制过拟合。
  • results: 对比之前的方法,该方法可以在70帧/秒的速度下提供更高精度的深度修复结果,并且可以改善物体抓取任务中的姿态估计。I hope that helps!
    Abstract Depth completion is crucial for many robotic tasks such as autonomous driving, 3-D reconstruction, and manipulation. Despite the significant progress, existing methods remain computationally intensive and often fail to meet the real-time requirements of low-power robotic platforms. Additionally, most methods are designed for opaque objects and struggle with transparent objects due to the special properties of reflection and refraction. To address these challenges, we propose a Fast Depth Completion framework for Transparent objects (FDCT), which also benefits downstream tasks like object pose estimation. To leverage local information and avoid overfitting issues when integrating it with global information, we design a new fusion branch and shortcuts to exploit low-level features and a loss function to suppress overfitting. This results in an accurate and user-friendly depth rectification framework which can recover dense depth estimation from RGB-D images alone. Extensive experiments demonstrate that FDCT can run about 70 FPS with a higher accuracy than the state-of-the-art methods. We also demonstrate that FDCT can improve pose estimation in object grasping tasks. The source code is available at https://github.com/Nonmy/FDCT
    摘要 深度完成是许多 робо类任务的关键,如自动驾驶、3D重建和机械 manipulate。尽管已有很大的进步,现有方法仍然具有计算投入性和时间约束,并且大多数方法只适用于不透明物体,对于透明物体呈现特殊的反射和折射特性具有困难。为解决这些挑战,我们提出了高速深度完成框架 для透明物体(FDCT),该框架还有利于下游任务如对象 pose 估计。为了利用本地信息和避免过拟合问题,我们设计了新的融合分支和短cut 来利用低级特征,并设计了一个损失函数来抑制过拟合。这导致了一个准确和用户友好的深度修正框架,可以从RGB-D图像中恢复精密的深度估计。广泛的实验表明,FDCT 可以在70 FPS 下运行,并且与现状态艺术方法相比具有更高的准确率。我们还示出了 FDCT 可以改善对象抓取任务中的姿态估计。源代码可以在 中下载。

Context Perception Parallel Decoder for Scene Text Recognition

  • paper_url: http://arxiv.org/abs/2307.12270
  • repo_url: None
  • paper_authors: Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Chenxia Li, Yuning Du, Yu-Gang Jiang
  • for: 这篇论文主要研究Scene Text Recognition(STR)方法的高精度和快速推断问题。
  • methods: 本文提出了一种新的AR模型,并进行了实验研究,发现AR模型的成功不仅归功于语言模型,还归功于视觉上下文的感知。为此,本文提出了Context Perception Parallel Decoder(CPPD)模型,通过计算字符出现频率和字符顺序来提供上下文信息,并且与字符预测任务结合,以准确地推断字符序列。
  • results: 实验结果表明,CPPD模型在英文和中文benchmark上达到了非常竞争性的准确率,并且比AR模型快约7倍。此外,CPPD模型也是当前最快的recognizer之一。代码将很快发布。
    Abstract Scene text recognition (STR) methods have struggled to attain high accuracy and fast inference speed. Autoregressive (AR)-based STR model uses the previously recognized characters to decode the next character iteratively. It shows superiority in terms of accuracy. However, the inference speed is slow also due to this iteration. Alternatively, parallel decoding (PD)-based STR model infers all the characters in a single decoding pass. It has advantages in terms of inference speed but worse accuracy, as it is difficult to build a robust recognition context in such a pass. In this paper, we first present an empirical study of AR decoding in STR. In addition to constructing a new AR model with the top accuracy, we find out that the success of AR decoder lies also in providing guidance on visual context perception rather than language modeling as claimed in existing studies. As a consequence, we propose Context Perception Parallel Decoder (CPPD) to decode the character sequence in a single PD pass. CPPD devises a character counting module and a character ordering module. Given a text instance, the former infers the occurrence count of each character, while the latter deduces the character reading order and placeholders. Together with the character prediction task, they construct a context that robustly tells what the character sequence is and where the characters appear, well mimicking the context conveyed by AR decoding. Experiments on both English and Chinese benchmarks demonstrate that CPPD models achieve highly competitive accuracy. Moreover, they run approximately 7x faster than their AR counterparts, and are also among the fastest recognizers. The code will be released soon.
    摘要 Scene文本识别(STR)方法一直受到高精度和快速推理速度的限制。基于排序(AR)的STR模型利用已经识别的字符来逐个解码下一个字符,其精度较高。然而,推理速度又相对较慢,主要因为这种迭代过程。相反,并行推理(PD)基于STR模型在单个推理过程中推理所有字符,它在推理速度方面具有优势,但精度相对较差,因为在这种情况下难以建立强大的识别Context。在这篇论文中,我们首先进行了AR推理在STR方面的实验研究。此外,我们还构建了一个新的AR模型,并发现AR推理的成功不仅归结于语言模型化,还需要VisualContext的感知。因此,我们提出了Context Perception并行推理器(CPPD),可以在单个PD过程中推理字符序列。CPPD包括字符出现频次预测模块和字符顺序预测模块。对于每个文本实例,前者预测每个字符的出现频次,而后者预测字符的顺序和占位符。与字符预测任务相结合,它们构建了一个Robust的识别Context,能够准确地表示文本序列和字符的位置,与AR推理准确相似。实验表明,CPPD模型在英文和中文标准套件上达到了非常竞争的精度水平,并且运行速度约为AR模型的7倍,同时也是推理速度最快的一部分。代码即将发布。

ResWCAE: Biometric Pattern Image Denoising Using Residual Wavelet-Conditioned Autoencoder

  • paper_url: http://arxiv.org/abs/2307.12255
  • repo_url: None
  • paper_authors: Youzhi Liang, Wen Liang
  • for: 该文章的目的是提出一种轻量级和可靠的深度学习架构,用于解决小型互联网设备中的指纹图像噪声问题。
  • methods: 该架构基于差分卷积束的残差整合(Res-WCAE),包括两个Encoder和一个Decoder。image Encoder使用残差连接来保持细腻的空间特征,而wavelet Encoder使用卷积束来提取特征。
  • results: 对比多种现有方法,RES-WCAE在噪声水平较高的指纹图像权重级别上表现出优异性,特别是对于严重损坏的指纹图像。总的来说,RES-WCAE显示出了解决小型互联网设备中生物认证系统中的噪声问题的潜力。
    Abstract The utilization of biometric authentication with pattern images is increasingly popular in compact Internet of Things (IoT) devices. However, the reliability of such systems can be compromised by image quality issues, particularly in the presence of high levels of noise. While state-of-the-art deep learning algorithms designed for generic image denoising have shown promise, their large number of parameters and lack of optimization for unique biometric pattern retrieval make them unsuitable for these devices and scenarios. In response to these challenges, this paper proposes a lightweight and robust deep learning architecture, the Residual Wavelet-Conditioned Convolutional Autoencoder (Res-WCAE) with a Kullback-Leibler divergence (KLD) regularization, designed specifically for fingerprint image denoising. Res-WCAE comprises two encoders - an image encoder and a wavelet encoder - and one decoder. Residual connections between the image encoder and decoder are leveraged to preserve fine-grained spatial features, where the bottleneck layer conditioned on the compressed representation of features obtained from the wavelet encoder using approximation and detail subimages in the wavelet-transform domain. The effectiveness of Res-WCAE is evaluated against several state-of-the-art denoising methods, and the experimental results demonstrate that Res-WCAE outperforms these methods, particularly for heavily degraded fingerprint images in the presence of high levels of noise. Overall, Res-WCAE shows promise as a solution to the challenges faced by biometric authentication systems in compact IoT devices.
    摘要 互联网物联网(IoT)设备中的生物识别系统对于图像生物识别的使用越来越普遍。然而,这些系统的可靠性可能受到图像质量问题的影响,特别是在高水平的噪声存在下。现有的深度学习算法,设计用于普通图像数据的数据条件,即使在生物识别领域中显示了损害,但它们的参数数量过多,并且不适合专门适应生物识别图像的条件下进行数据条件。为了解决这些挑战,本文提出了一个轻量级和可靠的深度学习架构,即对应涡回图像条件的内积条件(KLD)调整的内积条件对称卷积数位对称(Res-WCAE)。Res-WCAE包括两个卷积数位:一个图像卷积数位和一个波лет卷积数位,以及一个解oder。内积条件组件使用对应涡回图像条件的条件对称卷积数位,以维持细节的空间特征。实验结果显示,Res-WCAE在多个州数据条件下与其他竞争方法进行比较,特别是在高水平的噪声存在下,Res-WCAE表现出色。总的来说,Res-WCAE具有优秀的应用潜力,用于生物识别系统中的图像数据条件。

Explainable Depression Detection via Head Motion Patterns

  • paper_url: http://arxiv.org/abs/2307.12241
  • repo_url: None
  • paper_authors: Monika Gahalawat, Raul Fernandez Rojas, Tanaya Guha, Ramanathan Subramanian, Roland Goecke
  • for: 检测抑郁症状
  • methods: 基于head motion数据的基本运动单元(kinemes)和机器学习方法
  • results: 头部运动模式是识别抑郁症状的有效标记,并且可以找到与先前发现的解释性kineme patrernsHere’s a more detailed explanation of each point:
  • for: The paper is written to detect depression using head motion data and machine learning methods.
  • methods: The paper uses two approaches to analyze head motion data: (a) discovering kinemes from head motion data of both depressed patients and healthy controls, and (b) learning kineme patterns only from healthy controls and computing statistics derived from reconstruction errors for both the patient and control classes. The paper employs machine learning methods to evaluate depression classification performance on two datasets: BlackDog and AVEC2013.
  • results: The paper finds that head motion patterns are effective biomarkers for detecting depressive symptoms, and that explanatory kineme patterns consistent with prior findings can be observed for the two classes. The paper achieves peak F1 scores of 0.79 and 0.82, respectively, over BlackDog and AVEC2013 for binary classification over episodic thin-slices, and a peak F1 of 0.72 over videos for AVEC2013.
    Abstract While depression has been studied via multimodal non-verbal behavioural cues, head motion behaviour has not received much attention as a biomarker. This study demonstrates the utility of fundamental head-motion units, termed \emph{kinemes}, for depression detection by adopting two distinct approaches, and employing distinctive features: (a) discovering kinemes from head motion data corresponding to both depressed patients and healthy controls, and (b) learning kineme patterns only from healthy controls, and computing statistics derived from reconstruction errors for both the patient and control classes. Employing machine learning methods, we evaluate depression classification performance on the \emph{BlackDog} and \emph{AVEC2013} datasets. Our findings indicate that: (1) head motion patterns are effective biomarkers for detecting depressive symptoms, and (2) explanatory kineme patterns consistent with prior findings can be observed for the two classes. Overall, we achieve peak F1 scores of 0.79 and 0.82, respectively, over BlackDog and AVEC2013 for binary classification over episodic \emph{thin-slices}, and a peak F1 of 0.72 over videos for AVEC2013.
    摘要 在识别抑郁症的研究中,脑部运动尚未得到过多的关注,作为生物标志。这项研究表明了基本头部运动单元(kinemes)在抑郁检测中的有用性,通过采用两种不同的方法和特征:(a)从头部运动数据中挖掘出kinemes,并对both depressed patients和健康Control进行比较;(b)从健康Control中学习kineme模式,并计算来自重建错误的统计,用于分类patient和control类。通过机器学习方法,我们评估了在BlackDog和AVEC2013 datasets上的抑郁分类性能。我们发现:(1)头部运动模式是有效的生物标志,可以检测抑郁症状;(2)可以看到与之前发现相符的解释性kineme模式。总的来说,我们在BlackDog和AVEC2013上取得了最高的F1分数为0.79和0.82,并在AVEC2013上取得了最高的F1分数为0.72。

Learning Dynamic Query Combinations for Transformer-based Object Detection and Segmentation

  • paper_url: http://arxiv.org/abs/2307.12239
  • repo_url: https://github.com/bytedance/dq-det
  • paper_authors: Yiming Cui, Linjie Yang, Haichao Yu
  • for: 这种方法用于提高DETR基本模型的性能,包括对象检测、实例分割、精准分割和视频实例分割等多个任务。
  • methods: 该方法使用一个列表学习的检测查询来从变换器网络中提取信息,并学习预测图像中对象的位置和类别。然后,通过随机几何融合这些学习的查询来生成动态的融合查询,以更好地捕捉图像中对象的先前知识。
  • results: 通过使用我们的模ulated查询,DETR基本模型在多个任务上取得了一致和稳定的高性能。这些任务包括对象检测、实例分割、精准分割和视频实例分割等。
    Abstract Transformer-based detection and segmentation methods use a list of learned detection queries to retrieve information from the transformer network and learn to predict the location and category of one specific object from each query. We empirically find that random convex combinations of the learned queries are still good for the corresponding models. We then propose to learn a convex combination with dynamic coefficients based on the high-level semantics of the image. The generated dynamic queries, named modulated queries, better capture the prior of object locations and categories in the different images. Equipped with our modulated queries, a wide range of DETR-based models achieve consistent and superior performance across multiple tasks including object detection, instance segmentation, panoptic segmentation, and video instance segmentation.
    摘要 transformer-based 检测和分割方法使用一个学习的检测查询列表从 transformer 网络中获取信息并学习预测图像中的对象位置和类别。我们实际发现,随机几何的检测查询仍然可以为相应的模型提供好的性能。然后,我们提议通过图像的高级 semantics 学习动态的查询组合,称为modulated queries。这些生成的动态查询更好地捕捉图像中对象的先驱知识,使得各种基于 DE TR 的模型在多个任务上具有一致性和superior的表现,包括对象检测、实例分割、泛化分割和视频实例分割。

Multi-Modal Machine Learning for Assessing Gaming Skills in Online Streaming: A Case Study with CS:GO

  • paper_url: http://arxiv.org/abs/2307.12236
  • repo_url: None
  • paper_authors: Longxiang Zhang, Wenping Wang
  • for: 这种研究是为了评估在视频流处理中的游戏技能,以便向流服务提供者提供个性化推荐和服务促销。
  • methods: 该研究使用了最新的终端模型,以学习多Modalities的联合表示。在数据集中,研究人员首先识别数据集中的漏洞,然后手动清理数据。
  • results: 经过广泛的实验,研究人员证明了他们的提议的有效性。然而,研究人员还发现了他们的模型偏向用户标识而不是学习有意义的表示。
    Abstract Online streaming is an emerging market that address much attention. Assessing gaming skills from videos is an important task for streaming service providers to discover talented gamers. Service providers require the information to offer customized recommendation and service promotion to their customers. Meanwhile, this is also an important multi-modal machine learning tasks since online streaming combines vision, audio and text modalities. In this study we begin by identifying flaws in the dataset and proceed to clean it manually. Then we propose several variants of latest end-to-end models to learn joint representation of multiple modalities. Through our extensive experimentation, we demonstrate the efficacy of our proposals. Moreover, we identify that our proposed models is prone to identifying users instead of learning meaningful representations. We purpose future work to address the issue in the end.
    摘要 在线串流是一个崛起的市场,吸引了大量的注意力。从视频中评估玩家技巧是串流服务提供商们需要了解潜在的才华玩家的重要任务。服务提供商需要这些信息以为客户提供个性化推荐和服务促销。同时,这也是一个重要的多modal机器学习任务,因为在线串流结合了视觉、音频和文本模式。在这个研究中,我们开始通过手动清理数据集来发现问题,然后我们提出了多种最新的端到端模型,以学习多Modalities的联合表示。通过我们的广泛的实验,我们证明了我们的提议的效果。此外,我们发现我们的提议模型很容易被用户的身份识别,而不是学习有意义的表示。我们未来的工作是解决这个问题。

EchoGLAD: Hierarchical Graph Neural Networks for Left Ventricle Landmark Detection on Echocardiograms

  • paper_url: http://arxiv.org/abs/2307.12229
  • repo_url: https://github.com/masoudmo/echoglad
  • paper_authors: Masoud Mokhtari, Mobina Mahdavi, Hooman Vaseli, Christina Luong, Purang Abolmaesumi, Teresa S. M. Tsang, Renjie Liao
  • For: The paper aims to automate the task of detecting four landmark locations and measuring the internal dimension of the left ventricle and the approximate mass of the surrounding muscle in the heart, using machine learning.* Methods: The proposed method uses an echocardiogram-based, hierarchical graph neural network (GNN) for left ventricle landmark detection, which includes a hierarchical graph representation learning framework for multi-resolution landmark detection via GNNs, and induced hierarchical supervision at different levels of granularity using a multi-level loss.* Results: The paper achieves the state-of-the-art mean absolute errors (MAEs) of 1.46 mm and 1.86 mm on two datasets under the in-distribution (ID) setting, and shows better out-of-distribution (OOD) generalization than prior works with a testing MAE of 4.3 mm.
    Abstract The functional assessment of the left ventricle chamber of the heart requires detecting four landmark locations and measuring the internal dimension of the left ventricle and the approximate mass of the surrounding muscle. The key challenge of automating this task with machine learning is the sparsity of clinical labels, i.e., only a few landmark pixels in a high-dimensional image are annotated, leading many prior works to heavily rely on isotropic label smoothing. However, such a label smoothing strategy ignores the anatomical information of the image and induces some bias. To address this challenge, we introduce an echocardiogram-based, hierarchical graph neural network (GNN) for left ventricle landmark detection (EchoGLAD). Our main contributions are: 1) a hierarchical graph representation learning framework for multi-resolution landmark detection via GNNs; 2) induced hierarchical supervision at different levels of granularity using a multi-level loss. We evaluate our model on a public and a private dataset under the in-distribution (ID) and out-of-distribution (OOD) settings. For the ID setting, we achieve the state-of-the-art mean absolute errors (MAEs) of 1.46 mm and 1.86 mm on the two datasets. Our model also shows better OOD generalization than prior works with a testing MAE of 4.3 mm.
    摘要 左心室功能评估需要检测四个标志点和测量左心室内部维度以及周围肌肉的约重。难点在机器学习自动化这个任务是严重缺乏临床标注,即只有一些标注像素在高维度图像中,导致许多前作重要关注于均勋标注。然而,这种标注策略忽视图像的解剖信息并且带来一定偏见。为解决这个挑战,我们提出了一种用电子心室图像(EchoGLAD)来检测左心室标志点的模型。我们的主要贡献包括:1. 基于层次图像表示学习框架,通过层次graph neural network(GNN)实现多分辨率标志点检测。2. 通过多级监督,在不同的级别上实现层次监督。我们在公共和私人数据集上进行了测试,在内 distribuition(ID)和外部 distribuition(OOD)两种设置下。在ID设置下,我们实现了左心室标志点检测的状态机器学习(SOTA)的mean absolute error(MAE)为1.46mm和1.86mm。我们的模型还在OOD设置下表现更好,测试MAE为4.3mm。

The identification of garbage dumps in the rural areas of Cyprus through the application of deep learning to satellite imagery

  • paper_url: http://arxiv.org/abs/2308.02502
  • repo_url: None
  • paper_authors: Andrew Keith Wilkinson
  • For: The paper aims to investigate the use of artificial intelligence techniques and satellite imagery to identify illegal garbage dumps in rural areas of Cyprus.* Methods: The paper uses a novel dataset of images, data augmentation techniques, and an artificial neural network (specifically, a convolutional neural network) to recognize the presence or absence of garbage in new images.* Results: The resulting deep learning model can correctly identify images containing garbage in approximately 90% of cases, and could form the basis of a future system for systematically analyzing the entire landscape of Cyprus to build a comprehensive “garbage” map of the island.Here are the three points in Simplified Chinese text:* For: 这篇论文目标是使用人工智能技术和卫星影像来识别资产报废在cyprus的农村地区。* Methods: 论文使用了一个新的图像集,数据增强技术和人工神经网络来识别新图像中是否包含垃圾。* Results: 结果显示,使用这种方法可以在约90%的情况下正确地识别垃圾图像,并可能成为未来Cyprus岛上系统性地分析整个地图的基础。
    Abstract Garbage disposal is a challenging problem throughout the developed world. In Cyprus, as elsewhere, illegal ``fly-tipping" is a significant issue, especially in rural areas where few legal garbage disposal options exist. However, there is a lack of studies that attempt to measure the scale of this problem, and few resources available to address it. A method of automating the process of identifying garbage dumps would help counter this and provide information to the relevant authorities. The aim of this study was to investigate the degree to which artificial intelligence techniques, together with satellite imagery, can be used to identify illegal garbage dumps in the rural areas of Cyprus. This involved collecting a novel dataset of images that could be categorised as either containing, or not containing, garbage. The collection of such datasets in sufficient raw quantities is time consuming and costly. Therefore a relatively modest baseline set of images was collected, then data augmentation techniques used to increase the size of this dataset to a point where useful machine learning could occur. From this set of images an artificial neural network was trained to recognise the presence or absence of garbage in new images. A type of neural network especially suited to this task known as ``convolutional neural networks" was used. The efficacy of the resulting model was evaluated using an independently collected dataset of test images. The result was a deep learning model that could correctly identify images containing garbage in approximately 90\% of cases. It is envisaged that this model could form the basis of a future system that could systematically analyse the entire landscape of Cyprus to build a comprehensive ``garbage" map of the island.
    摘要 垃圾处理是发达国家的一个挑战,在塞浦路斯也是如此。非法投射(fly-tipping)是一个严重的问题,尤其在农村地区,因为有限的法定垃圾处理选择。然而,有很少的研究 Trying to measure the scale of this problem and few resources available to address it. This study aimed to investigate the extent to which artificial intelligence techniques, combined with satellite imagery, can be used to identify illegal garbage dumps in rural areas of Cyprus.To do this, we collected a novel dataset of images that could be categorized as either containing or not containing garbage. However, collecting such datasets in large quantities is time-consuming and costly, so we used data augmentation techniques to increase the size of our dataset to a point where useful machine learning could occur. We then trained an artificial neural network on this dataset to recognize the presence or absence of garbage in new images. We used a type of neural network well-suited to this task, called convolutional neural networks (CNNs).We evaluated the efficacy of our resulting model using an independently collected dataset of test images. The model was able to correctly identify images containing garbage in approximately 90% of cases. We envision this model as the basis for a future system that could systematically analyze the entire landscape of Cyprus to create a comprehensive "garbage" map of the island.

ASCON: Anatomy-aware Supervised Contrastive Learning Framework for Low-dose CT Denoising

  • paper_url: http://arxiv.org/abs/2307.12225
  • repo_url: https://github.com/hao1635/ASCON
  • paper_authors: Zhihao Chen, Qi Gao, Yi Zhang, Hongming Shan
  • for: 低剂量 computed tomography(CT)静止图像减噪
  • methods: 提出了一种新的 Anatomy-aware Supervised CONtrastive learning框架(ASCON),可以利用静止图像的生物学信息进行减噪,同时提供生物学解释。
  • results: 对两个公共的低剂量 CT 静止图像减噪数据集进行了广泛的实验,并证明了 ASCON 的超过状态艺术模型的性能。另外,ASCON 提供了低剂量 CT 静止图像减噪中的生物学解释,这是首次实现的。
    Abstract While various deep learning methods have been proposed for low-dose computed tomography (CT) denoising, most of them leverage the normal-dose CT images as the ground-truth to supervise the denoising process. These methods typically ignore the inherent correlation within a single CT image, especially the anatomical semantics of human tissues, and lack the interpretability on the denoising process. In this paper, we propose a novel Anatomy-aware Supervised CONtrastive learning framework, termed ASCON, which can explore the anatomical semantics for low-dose CT denoising while providing anatomical interpretability. The proposed ASCON consists of two novel designs: an efficient self-attention-based U-Net (ESAU-Net) and a multi-scale anatomical contrastive network (MAC-Net). First, to better capture global-local interactions and adapt to the high-resolution input, an efficient ESAU-Net is introduced by using a channel-wise self-attention mechanism. Second, MAC-Net incorporates a patch-wise non-contrastive module to capture inherent anatomical information and a pixel-wise contrastive module to maintain intrinsic anatomical consistency. Extensive experimental results on two public low-dose CT denoising datasets demonstrate superior performance of ASCON over state-of-the-art models. Remarkably, our ASCON provides anatomical interpretability for low-dose CT denoising for the first time. Source code is available at https://github.com/hao1635/ASCON.
    摘要 而んどの深度学习方法已经被提出用于低剂量计算机断层成像(CT)去噪,大多数其中 leverages the normal-dose CT影像作为ground truth来监督去噪过程。这些方法通常忽略单个CT影像内的自然 correlation,特别是人体组织学的 semantics,而且缺乏去噪过程的解释性。在这篇论文中,我们提出了一种新的Anatomy-aware Supervised CONtrastive learning框架,称为ASCON,可以利用人体组织学来低剂量CT去噪,同时提供解释性。我们的ASCON包括两个新的设计:一种高效的自我注意力基于U-Net(ESAU-Net)和一种多尺度的组织学对比网络(MAC-Net)。首先,为了更好地捕捉全局-地方交互和适应高分辨率输入,我们使用了通道级别的自我注意力机制。其次,MAC-Net包括一个patch-wise非对比模块,用于捕捉内在的组织信息,以及一个像素级别的对比模块,用于保持内在的组织一致性。我们对公共的两个低剂量CT去噪数据集进行了广泛的实验,结果显示ASCON在状态机器上表现出了superior性。吸取onders, our ASCON为低剂量CT去噪提供了解释性,这是首次。源代码可以在https://github.com/hao1635/ASCON中获取。

LoLep: Single-View View Synthesis with Locally-Learned Planes and Self-Attention Occlusion Inference

  • paper_url: http://arxiv.org/abs/2307.12217
  • repo_url: None
  • paper_authors: Cong Wang, Yu-Ping Wang, Dinesh Manocha
  • for: 本研究旨在提出一种新的方法,即LoLep,可以从单个RGB图像中推断高精度的场景表示,并生成更好的新视图。
  • methods: 为解决不具有深度信息的情况下推断合适的平面位置的问题,我们采用了预分 partition disparity 空间为 bins,并设计了一种disparity sampler来推断多个平面在每个bin中的本地偏移。此外,我们还提出了两种优化策略,其中一种是与不同的 disparity distribution 集合结合,另一种是在不同的 dataset 上添加 occlusion-aware reprojection loss 作为简单 yet effective 的 геометрической监视技术。
  • results: 我们的方法可以生成高精度的场景表示,并在不同的数据集上达到了领先的状态。与MINE相比,我们的方法具有 LPIPS 减少量为4.8%-9.0%和 RV 减少量为73.9%-83.5%。此外,我们还评估了实际图像上的性能,并证明了 LoLep 的优势。
    Abstract We propose a novel method, LoLep, which regresses Locally-Learned planes from a single RGB image to represent scenes accurately, thus generating better novel views. Without the depth information, regressing appropriate plane locations is a challenging problem. To solve this issue, we pre-partition the disparity space into bins and design a disparity sampler to regress local offsets for multiple planes in each bin. However, only using such a sampler makes the network not convergent; we further propose two optimizing strategies that combine with different disparity distributions of datasets and propose an occlusion-aware reprojection loss as a simple yet effective geometric supervision technique. We also introduce a self-attention mechanism to improve occlusion inference and present a Block-Sampling Self-Attention (BS-SA) module to address the problem of applying self-attention to large feature maps. We demonstrate the effectiveness of our approach and generate state-of-the-art results on different datasets. Compared to MINE, our approach has an LPIPS reduction of 4.8%-9.0% and an RV reduction of 73.9%-83.5%. We also evaluate the performance on real-world images and demonstrate the benefits.
    摘要 我们提出了一种新方法,LoLep,该方法从单个RGB图像中回归地方精度的计划,以便更加准确地表示场景,并生成更好的新视图。在不知道深度信息的情况下,回归合适的平面位置是一个具有挑战性的问题。为解决这个问题,我们先对差分空间进行预分区,并设计了差分抽样器来回归多个平面在每个分区中的本地偏移。然而,只使用这种抽样器将网络训练不整合;我们还提出了两种优化策略,其中一种是基于不同差分分布的数据集的优化策略,另一种是一种简单 yet有效的干扰抑制损失技术。我们还引入了自注意机制,以改善干扰推断,并提出了块抽样自注意模块(BS-SA),以解决应用自注意到大特征地图时存在的问题。我们证明了我们的方法的有效性,并在不同的数据集上达到了领先的结果。相比于MINE,我们的方法有LPIPS减少4.8%-9.0%和RV减少73.9%-83.5%。我们还评估了实际图像上的性能,并证明了其利好。

LIST: Learning Implicitly from Spatial Transformers for Single-View 3D Reconstruction

  • paper_url: http://arxiv.org/abs/2307.12194
  • repo_url: None
  • paper_authors: Mohammad Samiul Arshad, William J. Beksi
  • for: 用于重构3D对象的几何和 topological结构 from a single 2D图像
  • methods: 利用本地和全局图像特征,通过一种新的神经网络架构来准确重构3D对象的几何和 topological结构
  • results: 对比现有方法,本研究的模型能够更高精度地重构3D对象的几何和 topological结构,不需要摄像头估计或像素对齐。
    Abstract Accurate reconstruction of both the geometric and topological details of a 3D object from a single 2D image embodies a fundamental challenge in computer vision. Existing explicit/implicit solutions to this problem struggle to recover self-occluded geometry and/or faithfully reconstruct topological shape structures. To resolve this dilemma, we introduce LIST, a novel neural architecture that leverages local and global image features to accurately reconstruct the geometric and topological structure of a 3D object from a single image. We utilize global 2D features to predict a coarse shape of the target object and then use it as a base for higher-resolution reconstruction. By leveraging both local 2D features from the image and 3D features from the coarse prediction, we can predict the signed distance between an arbitrary point and the target surface via an implicit predictor with great accuracy. Furthermore, our model does not require camera estimation or pixel alignment. It provides an uninfluenced reconstruction from the input-view direction. Through qualitative and quantitative analysis, we show the superiority of our model in reconstructing 3D objects from both synthetic and real-world images against the state of the art.
    摘要 通过单个2D图像掌握3D物体的几何和拓扑细节是计算机视觉领域的基本挑战。现有的解决方案往往无法恢复自我遮盖的几何结构和/或准确地重建物体的拓扑形态。为解决这个问题,我们介绍了LIST,一种新的神经网络架构,它利用本地和全局图像特征来准确地从单个图像中重建3D物体的几何和拓扑结构。我们使用全局2D特征预测目标对象的抽象形状,然后使用它作为高分辨率重建的基础。通过利用图像中的本地特征和3D预测结果,我们可以使用隐式预测器来准确地预测目标表面上任意点的负距离。此外,我们的模型不需要相机估计或像素对齐。它可以从输入视图角度提供不受束缚的重建。通过质量和量化分析,我们证明了我们的模型在对真实图像和 sintetic图像进行3D物体重建时具有优势。

An X3D Neural Network Analysis for Runner’s Performance Assessment in a Wild Sporting Environment

  • paper_url: http://arxiv.org/abs/2307.12183
  • repo_url: None
  • paper_authors: David Freire-Obregón, Javier Lorenzo-Navarro, Oliverio J. Santana, Daniel Hernández-Sosa, Modesto Castrillón-Santana
  • for: 这个研究是为了应用传输学习技术在运动环境中进行3D神经网络的分析。
  • methods: 该方法使用一个动作识别网络来估计长跑运动员的总赛时(CRT)。
  • results: 研究发现,使用X3D神经网络可以提供出色的表现,对于短视频输入, Mean Absolute Error为12分钟半。此外,X3D神经网络需要的内存比前一作少得多,可以达到更高的精度。
    Abstract We present a transfer learning analysis on a sporting environment of the expanded 3D (X3D) neural networks. Inspired by action quality assessment methods in the literature, our method uses an action recognition network to estimate athletes' cumulative race time (CRT) during an ultra-distance competition. We evaluate the performance considering the X3D, a family of action recognition networks that expand a small 2D image classification architecture along multiple network axes, including space, time, width, and depth. We demonstrate that the resulting neural network can provide remarkable performance for short input footage, with a mean absolute error of 12 minutes and a half when estimating the CRT for runners who have been active from 8 to 20 hours. Our most significant discovery is that X3D achieves state-of-the-art performance while requiring almost seven times less memory to achieve better precision than previous work.
    摘要 我们提出了一种基于转移学习的分析,涉及到扩展三维神经网络(X3D)环境中的运动环境。我们的方法使用动作识别网络来估算Runner在长跑比赛中的累累时间(CRT)。我们评估性能时考虑了X3D家族中的动作识别网络,该网络扩展了小于2D图像分类架构的维度,包括空间、时间、宽度和深度。我们表明,这种神经网络可以提供短输入视频时具有很好的性能,其中平均绝对误差为12分钟半,当估算 runner在8到20个小时内活动时。我们最重要的发现是,X3D可以达到现有最佳性能,而且需要只有七分之一的内存,以提高精度。

Prototype-Driven and Multi-Expert Integrated Multi-Modal MR Brain Tumor Image Segmentation

  • paper_url: http://arxiv.org/abs/2307.12180
  • repo_url: https://github.com/linzy0227/pdminet
  • paper_authors: Yafei Zhang, Zhiyuan Li, Huafeng Li, Dapeng Tao
  • for: 这个论文的目的是提出一种多模态核磁共振(MR)脑肿瘤图像分割方法,以便更好地确定和地理位置脑肿瘤子区域。
  • methods: 该方法首先提取输入图像中的特征,然后使用脑肿瘤prototype来引导和融合不同模态特征,以便高亮每个脑肿瘤子区域的特征。
  • results: 实验结果表明,提出的方法在三个竞赛脑肿瘤分割数据集上具有更高的分割精度和稳定性。
    Abstract For multi-modal magnetic resonance (MR) brain tumor image segmentation, current methods usually directly extract the discriminative features from input images for tumor sub-region category determination and localization. However, the impact of information aliasing caused by the mutual inclusion of tumor sub-regions is often ignored. Moreover, existing methods usually do not take tailored efforts to highlight the single tumor sub-region features. To this end, a multi-modal MR brain tumor segmentation method with tumor prototype-driven and multi-expert integration is proposed. It could highlight the features of each tumor sub-region under the guidance of tumor prototypes. Specifically, to obtain the prototypes with complete information, we propose a mutual transmission mechanism to transfer different modal features to each other to address the issues raised by insufficient information on single-modal features. Furthermore, we devise a prototype-driven feature representation and fusion method with the learned prototypes, which implants the prototypes into tumor features and generates corresponding activation maps. With the activation maps, the sub-region features consistent with the prototype category can be highlighted. A key information enhancement and fusion strategy with multi-expert integration is designed to further improve the segmentation performance. The strategy can integrate the features from different layers of the extra feature extraction network and the features highlighted by the prototypes. Experimental results on three competition brain tumor segmentation datasets prove the superiority of the proposed method.
    摘要 现有的多Modal MR脑肿吸引图像分割方法通常直接从输入图像中提取出特征,以确定和 lokalisieren tumor sub-region。然而,信息抖动所引起的影响通常被忽略。此外,现有的方法通常不会采取特化的努力来强调单个 tumor sub-region 的特征。为此,我们提出了一种基于 tumor 原型的多Modal MR脑肿吸引图像分割方法。它可以在 tumor 原型的指导下强调每个 tumor sub-region 的特征。具体来说,为了获得完整的信息,我们提出了一种互传机制,将不同模态特征传递给每个模态,以解决单modal特征不具备的问题。此外,我们还提出了一种基于原型的特征表示和融合方法,通过learned原型来嵌入 tumor 特征,并生成对应的活化地图。通过活化地图,可以强调与原型类别相符的子区域特征。为了进一步提高分割性能,我们还设计了一种多 экспер特性融合策略。该策略可以将不同层次的特征和由原型引导的特征融合在一起。实验结果表明,我们的提议方法在三个竞赛脑肿吸引图像分割数据集上具有优越性。

Leveraging Knowledge Graphs for Zero-Shot Object-agnostic State Classification

  • paper_url: http://arxiv.org/abs/2307.12179
  • repo_url: None
  • paper_authors: Filipos Gouidis, Theodore Patkos, Antonis Argyros, Dimitris Plexousakis
  • for: 本研究强调解决对象状态分类(OSC)问题,具体来说是一种零例学习问题,即不需要知道对象的类别来预测对象的状态。
  • methods: 我们提出了首个对象agnosticState Classification(OaSC)方法,即不需要对象类知识或估计来预测对象的状态。我们利用知识 graphs(KGs)来结构化和组织知识,并与视觉信息结合,以便在对象/状态对的训练集中未经遭遇的情况下预测对象的状态。
  • results: 实验结果表明,对象类知识不是预测对象状态的决定因素。此外,我们的OaSC方法在所有数据集和benchmark中都超越了现有方法,差距很大。
    Abstract We investigate the problem of Object State Classification (OSC) as a zero-shot learning problem. Specifically, we propose the first Object-agnostic State Classification (OaSC) method that infers the state of a certain object without relying on the knowledge or the estimation of the object class. In that direction, we capitalize on Knowledge Graphs (KGs) for structuring and organizing knowledge, which, in combination with visual information, enable the inference of the states of objects in object/state pairs that have not been encountered in the method's training set. A series of experiments investigate the performance of the proposed method in various settings, against several hypotheses and in comparison with state of the art approaches for object attribute classification. The experimental results demonstrate that the knowledge of an object class is not decisive for the prediction of its state. Moreover, the proposed OaSC method outperforms existing methods in all datasets and benchmarks by a great margin.
    摘要 我们研究对象状态分类(OSC)问题,并将其视为零例学习问题。具体来说,我们提出了首个对象不依赖类别知识的对象状态分类方法(OaSC)。这种方法可以基于知识图(KGs)结构和组织知识,并结合视觉信息,对未在方法训练集中出现的对象/状态对进行状态推理。我们进行了一系列实验,测试方法在不同的设置、假设和现有的对象属性分类方法的比较中的性能。实验结果表明,对象类知识不是决定对象状态预测的关键因素。此外,我们的OaSC方法在所有数据集和标准准则上都超越了现有方法,差距很大。

Challenges for Monocular 6D Object Pose Estimation in Robotics

  • paper_url: http://arxiv.org/abs/2307.12172
  • repo_url: None
  • paper_authors: Stefan Thalhammer, Dominik Bauer, Peter Hönig, Jean-Baptiste Weibel, José García-Rodríguez, Markus Vincze
  • for: 本研究旨在探讨单视模式下的物体 pose 估算问题,即 robotics 应用中的核心识别任务。
  • methods: 该研究使用了现成的 RGB 感知器和 CNN 进行快速推理,以及提出了一种综合视角和数据集的评估方法。
  • results: 研究发现,虽然现有的多Modal 和单视方法已经达到了 estado del arte,但是它们在 robotics 应用中仍面临着 occlusion 处理、新的 pose 表示方法、Category-level pose estimation 的形式化和改进等挑战。此外,大量对象集、新型对象、干涉物质、不确定性估计等问题仍未得到解决。
    Abstract Object pose estimation is a core perception task that enables, for example, object grasping and scene understanding. The widely available, inexpensive and high-resolution RGB sensors and CNNs that allow for fast inference based on this modality make monocular approaches especially well suited for robotics applications. We observe that previous surveys on object pose estimation establish the state of the art for varying modalities, single- and multi-view settings, and datasets and metrics that consider a multitude of applications. We argue, however, that those works' broad scope hinders the identification of open challenges that are specific to monocular approaches and the derivation of promising future challenges for their application in robotics. By providing a unified view on recent publications from both robotics and computer vision, we find that occlusion handling, novel pose representations, and formalizing and improving category-level pose estimation are still fundamental challenges that are highly relevant for robotics. Moreover, to further improve robotic performance, large object sets, novel objects, refractive materials, and uncertainty estimates are central, largely unsolved open challenges. In order to address them, ontological reasoning, deformability handling, scene-level reasoning, realistic datasets, and the ecological footprint of algorithms need to be improved.
    摘要 We argue that there are still several fundamental challenges that need to be addressed in order to improve the performance of monocular object pose estimation in robotics. These challenges include occlusion handling, the development of novel pose representations, and the formalization and improvement of category-level pose estimation. Additionally, there are several central, largely unsolved open challenges that must be addressed, including the need for large object sets, the ability to handle novel objects and refractive materials, and the inclusion of uncertainty estimates.To address these challenges, we propose several areas of improvement, including ontological reasoning, deformability handling, scene-level reasoning, the creation of realistic datasets, and the reduction of the ecological footprint of algorithms. By focusing on these areas, we believe that the performance of monocular object pose estimation in robotics can be significantly improved, enabling more advanced and capable robots.

Facial Point Graphs for Amyotrophic Lateral Sclerosis Identification

  • paper_url: http://arxiv.org/abs/2307.12159
  • repo_url: None
  • paper_authors: Nícolas Barbosa Gomes, Arissa Yoshida, Mateus Roder, Guilherme Camargo de Oliveira, João Paulo Papa
  • for: 早期诊断阿LS(amyotrophic lateral sclerosis)有助于确定治疗的开始、改善病人的前途和整体健康状况。
  • methods: 该论文提出使用计算机方法分析病人的脸部表达来自动识别阿LS。
  • results: 实验结果表明,该方法在渥太华神经面数据集上表现出色,超过了现有的最佳成绩,为早期诊断阿LS带来了有前途的发展。
    Abstract Identifying Amyotrophic Lateral Sclerosis (ALS) in its early stages is essential for establishing the beginning of treatment, enriching the outlook, and enhancing the overall well-being of those affected individuals. However, early diagnosis and detecting the disease's signs is not straightforward. A simpler and cheaper way arises by analyzing the patient's facial expressions through computational methods. When a patient with ALS engages in specific actions, e.g., opening their mouth, the movement of specific facial muscles differs from that observed in a healthy individual. This paper proposes Facial Point Graphs to learn information from the geometry of facial images to identify ALS automatically. The experimental outcomes in the Toronto Neuroface dataset show the proposed approach outperformed state-of-the-art results, fostering promising developments in the area.
    摘要 早期诊断阿底士病(ALS)对患者的治疗结果有着重要的影响,能够提高生活质量和整体健康状况。然而,早期诊断和识别病种的症状不是一件容易的事情。这项研究提出使用计算机方法分析病人的面部表达来自动识别ALS。研究发现,当患者进行特定的动作时,如开口嘴巴,健康人的面部肌肉运动与ALS患者不同。该研究使用面部点Graph学习face图像的几何信息,并在渥太华神经面数据集上进行实验,结果表明该方法在比较当前的结果之上出色,这些结果表明了这种方法在诊断ALS方面的潜在价值。

Real-Time Neural Video Recovery and Enhancement on Mobile Devices

  • paper_url: http://arxiv.org/abs/2307.12152
  • repo_url: https://github.com/Aryia-Behroziuan/neurons
  • paper_authors: Zhaoyuan He, Yifan Yang, Lili Qiu, Kyoungjun Park
  • for: 提高移动设备上视频流式传输的流畅体验
  • methods: 提出一种新的视频帧恢复算法、一种新的超分辨率算法和一种接受器增强视频比特率调整算法
  • results: 实现了30帧/秒的实时增强,在不同的网络环境下测试,实现了视频流经验质量(Quality of Experience,QoE)的显著提高(24%-82%)
    Abstract As mobile devices become increasingly popular for video streaming, it's crucial to optimize the streaming experience for these devices. Although deep learning-based video enhancement techniques are gaining attention, most of them cannot support real-time enhancement on mobile devices. Additionally, many of these techniques are focused solely on super-resolution and cannot handle partial or complete loss or corruption of video frames, which is common on the Internet and wireless networks. To overcome these challenges, we present a novel approach in this paper. Our approach consists of (i) a novel video frame recovery scheme, (ii) a new super-resolution algorithm, and (iii) a receiver enhancement-aware video bit rate adaptation algorithm. We have implemented our approach on an iPhone 12, and it can support 30 frames per second (FPS). We have evaluated our approach in various networks such as WiFi, 3G, 4G, and 5G networks. Our evaluation shows that our approach enables real-time enhancement and results in a significant increase in video QoE (Quality of Experience) of 24\% - 82\% in our video streaming system.
    摘要 “随着移动设备在影像流媒体中的普及,实时优化影像流媒体的体验成为了非常重要的。深度学习基本的影像增强技术在获得注目,但大多数这些技术无法在移动设备上支持实时优化。此外,许多这些技术仅专注于超解析,而无法处理部分或完全的影像帧损失或腐败,这是互联网和无线网络上很常见的问题。”“为了解决这些挑战,我们在这篇论文中提出了一个新的方法。我们的方法包括:(i)一个新的影像帧恢复算法,(ii)一个新的超解析算法,以及(iii)一个受到接收端优化影像比特率改变算法的影像流媒体实时优化系统。我们在iPhone 12上实现了我们的方法,并且可以支持30帧每秒(FPS)。我们在WiFi、3G、4G和5G网络中进行了评估,我们的评估结果表明,我们的方法可以实现实时优化,并且导致影像流媒体系统中的影像质量经验(Quality of Experience,QoE)增加了24%-82%。”

Does color modalities affect handwriting recognition? An empirical study on Persian handwritings using convolutional neural networks

  • paper_url: http://arxiv.org/abs/2307.12150
  • repo_url: None
  • paper_authors: Abbas Zohrevand, Zahra Imani, Javad Sadri, Ching Y. Suen
  • for: 这篇论文是 investigate whether color modalities of handwritten digits and words affect their recognition accuracy or speed.
  • methods: 使用 Convolutional Neural Networks (CNNs) 作为眼动模拟器,在一个新的波斯语手写数据库中进行测试。
  • results: 结果表明,使用 CNN 对黑白字体和字符图像进行训练后,对测试集进行识别时,比其他两种颜色模式具有更高的性能。然而,在不同颜色模式下的训练时间比较,发现使用 BW 图像进行训练是最效率的。
    Abstract Most of the methods on handwritten recognition in the literature are focused and evaluated on Black and White (BW) image databases. In this paper we try to answer a fundamental question in document recognition. Using Convolutional Neural Networks (CNNs), as eye simulator, we investigate to see whether color modalities of handwritten digits and words affect their recognition accuracy or speed? To the best of our knowledge, so far this question has not been answered due to the lack of handwritten databases that have all three color modalities of handwritings. To answer this question, we selected 13,330 isolated digits and 62,500 words from a novel Persian handwritten database, which have three different color modalities and are unique in term of size and variety. Our selected datasets are divided into training, validation, and testing sets. Afterwards, similar conventional CNN models are trained with the training samples. While the experimental results on the testing set show that CNN on the BW digit and word images has a higher performance compared to the other two color modalities, in general there are no significant differences for network accuracy in different color modalities. Also, comparisons of training times in three color modalities show that recognition of handwritten digits and words in BW images using CNN is much more efficient.
    摘要 大多数现成的手写识别方法都是在黑白图像库中进行研究和评估。在这篇论文中,我们试图回答一个基本的问题:使用卷积神经网络(CNN)作为眼动模拟器,我们调查了手写数字和字符的三种颜色模式是否影响了识别精度或速度?至于这个问题,我们认为现有的手写库缺乏三种颜色模式的手写数据,因此这个问题尚未得到解答。为了回答这个问题,我们选择了13330个隔离的数字和62500个字符从一个新的波斯语手写库中,这些数据库具有三种颜色模式和各种大小和样式。我们选择的数据库被分成了训练集、验证集和测试集。然后,我们使用同样的传统CNN模型在训练样本上进行训练。在测试集上的实验结果表明,使用CNN对黑白数字和字符图像进行识别的精度较高,而在其他两种颜色模式下的识别精度相对较低。此外,在三种颜色模式下的训练时间进行比较,发现使用黑白图像进行识别的训练时间相对较短。

Learned Gridification for Efficient Point Cloud Processing

  • paper_url: http://arxiv.org/abs/2307.14354
  • repo_url: https://github.com/computri/gridifier
  • paper_authors: Putri A. van der Linden, David W. Romero, Erik J. Bekkers
  • for: 将点云数据转换为可以进行操作的稠密grid数据,以提高操作效率和可扩展性。
  • methods: 提出了一种名为”learnable gridification”的方法,用于将点云数据转换为稠密grid数据,并在后续层使用常见的grid-based操作,如Conv3D。此外,还提出了一种名为”learnable de-gridification”的方法,用于将稠密grid数据还原回原始点云数据。
  • results: 经过理论和实验分析,显示了gridified网络在内存和时间方面的扩展性和可扩展性,而且可以达到竞争性的结果。
    Abstract Neural operations that rely on neighborhood information are much more expensive when deployed on point clouds than on grid data due to the irregular distances between points in a point cloud. In a grid, on the other hand, we can compute the kernel only once and reuse it for all query positions. As a result, operations that rely on neighborhood information scale much worse for point clouds than for grid data, specially for large inputs and large neighborhoods. In this work, we address the scalability issue of point cloud methods by tackling its root cause: the irregularity of the data. We propose learnable gridification as the first step in a point cloud processing pipeline to transform the point cloud into a compact, regular grid. Thanks to gridification, subsequent layers can use operations defined on regular grids, e.g., Conv3D, which scale much better than native point cloud methods. We then extend gridification to point cloud to point cloud tasks, e.g., segmentation, by adding a learnable de-gridification step at the end of the point cloud processing pipeline to map the compact, regular grid back to its original point cloud form. Through theoretical and empirical analysis, we show that gridified networks scale better in terms of memory and time than networks directly applied on raw point cloud data, while being able to achieve competitive results. Our code is publicly available at https://github.com/computri/gridifier.
    摘要 神经操作需要邻域信息的时候在点云上比在网格数据上更加昂贵,因为点云中点的距离不规则。在网格上,我们可以一次计算核心,然后重复使用它们 для所有查询位置。因此,基于邻域信息的操作在点云上比网格数据更加慢速,尤其是对于大输入和大邻域。在这项工作中,我们解决了点云方法的扩展性问题,通过learned gridification来将点云转换成一个紧凑的、规则的网格。然后,我们扩展了gridification到点云到点云任务,例如分割,通过添加一个学习的de-gridification步骤来将紧凑的网格还原回原始点云形式。我们通过理论和实验分析表明,gridified网络在内存和时间方面比直接应用于原始点云数据更加高效,同时能够实现竞争性的结果。我们的代码可以在https://github.com/computri/gridifier上下载。

A Vision for Cleaner Rivers: Harnessing Snapshot Hyperspectral Imaging to Detect Macro-Plastic Litter

  • paper_url: http://arxiv.org/abs/2307.12145
  • repo_url: https://github.com/river-lab/hyperspectral_macro_plastic_detection
  • paper_authors: Nathaniel Hanson, Ahmet Demirkaya, Deniz Erdoğmuş, Aron Stubbins, Taşkın Padır, Tales Imbiriba
  • for: 这个研究旨在解决水体中废弃 пласти克废弃物的监测问题,以提高当地生态和经济环境的健康性。
  • methods: 这个研究使用计算机成像技术来检测水体中的巨大废弃 пласти克废弃物。研究人员使用快照可见短波辐射 hyperspectral成像技术,并利用机器学习分类方法来实现高精度检测。
  • results: 实验结果表明,使用 hyperspectral 数据和非线性分类方法可以在具有挑战性的场景下实现高精度的检测精度,特别是在检测部分潜没的废弃 пласти克废弃物时。
    Abstract Plastic waste entering the riverine harms local ecosystems leading to negative ecological and economic impacts. Large parcels of plastic waste are transported from inland to oceans leading to a global scale problem of floating debris fields. In this context, efficient and automatized monitoring of mismanaged plastic waste is paramount. To address this problem, we analyze the feasibility of macro-plastic litter detection using computational imaging approaches in river-like scenarios. We enable near-real-time tracking of partially submerged plastics by using snapshot Visible-Shortwave Infrared hyperspectral imaging. Our experiments indicate that imaging strategies associated with machine learning classification approaches can lead to high detection accuracy even in challenging scenarios, especially when leveraging hyperspectral data and nonlinear classifiers. All code, data, and models are available online: https://github.com/RIVeR-Lab/hyperspectral_macro_plastic_detection.
    摘要 塑料垃圾进入河流环境会对当地生态系统造成负面影响,导致生态和经济 Both positive and negative impacts. Large amounts of plastic waste are transported from inland to oceans, causing a global problem of floating debris fields. In this context, efficient and automated monitoring of mismanaged plastic waste is crucial. To address this problem, we explore the feasibility of macro-plastic litter detection using computational imaging approaches in river-like scenarios. We enable near-real-time tracking of partially submerged plastics by using snapshot Visible-Shortwave Infrared hyperspectral imaging. Our experiments show that imaging strategies combined with machine learning classification approaches can achieve high detection accuracy, especially in challenging scenarios, by leveraging hyperspectral data and nonlinear classifiers. All code, data, and models are available online at .Here's the breakdown of the translation:* "塑料垃圾" (plastic waste) is translated as "塑料垃圾" (plastic waste)* "进入河流环境" (entering the riverine environment) is translated as "进入河流环境" (entering the riverine environment)* "会对当地生态系统造成负面影响" (leading to negative ecological and economic impacts) is translated as "会对当地生态系统造成负面影响" (leading to negative ecological and economic impacts)* "Large amounts of plastic waste are transported from inland to oceans" is translated as "大量塑料垃圾从内陆流入海洋" (large amounts of plastic waste are transported from inland to oceans)* "causing a global problem of floating debris fields" is translated as "导致全球漂泊垃圾场的问题" (causing a global problem of floating debris fields)* "In this context, efficient and automated monitoring of mismanaged plastic waste is crucial" is translated as "在这种情况下,有效和自动化的塑料垃圾监测是关键" (in this context, efficient and automated monitoring of mismanaged plastic waste is crucial)* "To address this problem, we explore the feasibility of macro-plastic litter detection using computational imaging approaches in river-like scenarios" is translated as "为解决这个问题,我们在河流类场景中explore了计算成像方法的可能性" (to address this problem, we explore the feasibility of macro-plastic litter detection using computational imaging approaches in river-like scenarios)* "We enable near-real-time tracking of partially submerged plastics by using snapshot Visible-Shortwave Infrared hyperspectral imaging" is translated as "我们通过使用快照可见短波infrared hyperspectral成像来实现近实时跟踪部分浸没在水中的塑料" (we enable near-real-time tracking of partially submerged plastics by using snapshot Visible-Shortwave Infrared hyperspectral imaging)* "Our experiments indicate that imaging strategies associated with machine learning classification approaches can lead to high detection accuracy" is translated as "我们的实验表明,与机器学习分类方法相关的成像策略可以实现高的检测精度" (our experiments indicate that imaging strategies associated with machine learning classification approaches can lead to high detection accuracy)* "especially in challenging scenarios, by leveraging hyperspectral data and nonlinear classifiers" is translated as "特别是在挑战性的场景下,通过利用快照数据和非线性分类器来提高检测精度" (especially in challenging scenarios, by leveraging hyperspectral data and nonlinear classifiers)* "All code, data, and models are available online" is translated as "所有代码、数据和模型都可以在线获取" (all code, data, and models are available online)

SCPAT-GAN: Structural Constrained and Pathology Aware Convolutional Transformer-GAN for Virtual Histology Staining of Human Coronary OCT images

  • paper_url: http://arxiv.org/abs/2307.12138
  • repo_url: None
  • paper_authors: Xueshen Li, Hongshan Liu, Xiaoyu Song, Brigitta C. Brott, Silvio H. Litovsky, Yu Gan
  • for: 对于摄影镜像胆管疾病的诊断和治疗提供虚拟 histological 信息
  • methods: 使用 transformer 生成数学类型的 GAN 模型,将 OCT 图像转换为虚拟显色 H&E 厚比例图像
  • results: 实现了对摄影镜像胆管疾病的诊断和治疗中的虚拟 histological 信息生成,并且可以对应胆管疾病的病理特征Here’s the simplified Chinese text:
  • for: 为摄影镜像胆管疾病的诊断和治疗提供虚拟 histological 信息
  • methods: 使用 transformer 生成数学类型的 GAN 模型,将 OCT 图像转换为虚拟显色 H&E 厚比例图像
  • results: 实现了对摄影镜像胆管疾病的诊断和治疗中的虚拟 histological 信息生成,并且可以对应胆管疾病的病理特征
    Abstract There is a significant need for the generation of virtual histological information from coronary optical coherence tomography (OCT) images to better guide the treatment of coronary artery disease. However, existing methods either require a large pixel-wisely paired training dataset or have limited capability to map pathological regions. To address these issues, we proposed a structural constrained, pathology aware, transformer generative adversarial network, namely SCPAT-GAN, to generate virtual stained H&E histology from OCT images. The proposed SCPAT-GAN advances existing methods via a novel design to impose pathological guidance on structural layers using transformer-based network.
    摘要 <>将文本翻译成简化中文。<> coronary optical coherence tomography (OCT) 图像中的虚拟 histological 信息的生成很需要用于更好地治疗 coronary artery disease。然而,现有的方法 Either require a large pixel-wisely paired training dataset or have limited capability to map pathological regions。为解决这些问题,我们提出了一种基于 transformer 的 generator adversarial network,即 SCPAT-GAN,用于从 OCT 图像中生成虚拟染色 H&E histology。我们的提议的 SCPAT-GAN 会使现有的方法得到一种新的设计,通过将 pathological guidance 添加到结构层以使用 transformer-based network。Here's the breakdown of the translation:* coronary optical coherence tomography (OCT) 图像中的虚拟 histological 信息 (Virtual histological information in OCT images)* Either require a large pixel-wisely paired training dataset (Existing methods either require a large pixel-wisely paired training dataset)* or have limited capability to map pathological regions (or have limited capability to map pathological regions)* 用于更好地治疗 coronary artery disease (To better guide the treatment of coronary artery disease)* 基于 transformer 的 generator adversarial network (Based on transformer, a generator adversarial network)* 即 SCPAT-GAN (i.e., SCPAT-GAN)* 用于从 OCT 图像中生成虚拟染色 H&E histology (To generate virtual stained H&E histology from OCT images)* 我们的提议的 SCPAT-GAN (Our proposed SCPAT-GAN)* 会使现有的方法得到一种新的设计 (Will make existing methods obtain a new design)* 通过将 pathological guidance 添加到结构层 (By adding pathological guidance to the structural layers)* 以使用 transformer-based network (To use transformer-based network)

Improving temperature estimation in low-cost infrared cameras using deep neural networks

  • paper_url: http://arxiv.org/abs/2307.12130
  • repo_url: None
  • paper_authors: Navot Oz, Nir Sochen, David Mendelovich, Iftach Klapp
  • for: 提高低成本热相机的温度精度和修正不均匀性
  • methods: 开发了一个考虑 ambient temperature 的非均匀性模拟器,并提出了一种基于批处理神经网络的方法,通过使用单个图像和摄像头自身测量的 ambient temperature 来估算对象的温度和修正不均匀性
  • results: 比前一些方法更低 ($1^\circ C$) 的mean temperature error,并且通过Physical constraint 降低了错误率 ($4%$)。 总的来说,mean temperature error 在广泛验证集上为 $0.37^\circ C$,并在实际场景中得到了等效的结果。
    Abstract Low-cost thermal cameras are inaccurate (usually $\pm 3^\circ C$) and have space-variant nonuniformity across their detector. Both inaccuracy and nonuniformity are dependent on the ambient temperature of the camera. The main goal of this work was to improve the temperature accuracy of low-cost cameras and rectify the nonuniformity. A nonuniformity simulator that accounts for the ambient temperature was developed. An end-to-end neural network that incorporates the ambient temperature at image acquisition was introduced. The neural network was trained with the simulated nonuniformity data to estimate the object's temperature and correct the nonuniformity, using only a single image and the ambient temperature measured by the camera itself. Results show that the proposed method lowered the mean temperature error by approximately $1^\circ C$ compared to previous works. In addition, applying a physical constraint on the network lowered the error by an additional $4\%$. The mean temperature error over an extensive validation dataset was $0.37^\circ C$. The method was verified on real data in the field and produced equivalent results.
    摘要 低成本热相机的精度受到 ambient temperature 的影响(通常在 $\pm 3^\circ C$ 之间),并且具有空间不均的非均匀性,这两个问题都与热相机的环境温度相关。本工作的主要目标是提高低成本热相机的温度精度和修正非均匀性。我们开发了一个考虑 ambient temperature 的非均匀性模拟器,并引入了一个结合 ambient temperature 的末端到终端神经网络。这个神经网络通过使用模拟的非均匀数据进行训练,以估算对象的温度并修正非均匀性,只需要使用单个图像和摄像头自带的温度值。结果显示,我们的方法可以比前一些工作降低mean温度错误约 $1^\circ C$。此外,通过应用物理约束,降低错误约 $4\%$。整体来说,我们的方法在广泛验证数据集上的mean温度错误为 $0.37^\circ C$。此方法还在实际场景中进行验证,并获得相同的结果。

InFusion: Inject and Attention Fusion for Multi Concept Zero-Shot Text-based Video Editing

  • paper_url: http://arxiv.org/abs/2308.00135
  • repo_url: https://github.com/infusion-zero-edit/InFusion
  • paper_authors: Anant Khandelwal
  • for: 这篇论文的目的是提出一个框架,以便透过文本提示进行类型控制的视频编辑,并且在不需要训练的情况下实现高品质的视频编辑。
  • methods: 这篇论文使用了大型预训的文本扩散模型,并且提出了一个内部插入(Injection)的方法,允许在视频中编辑多个概念,并且具有像素级的控制。
  • results: 论文的实验结果显示,这个框架可以实现高品质和时间含意的视频编辑,并且可以与现有的图像扩散技术进行整合。
    Abstract Large text-to-image diffusion models have achieved remarkable success in generating diverse, high-quality images. Additionally, these models have been successfully leveraged to edit input images by just changing the text prompt. But when these models are applied to videos, the main challenge is to ensure temporal consistency and coherence across frames. In this paper, we propose InFusion, a framework for zero-shot text-based video editing leveraging large pre-trained image diffusion models. Our framework specifically supports editing of multiple concepts with pixel-level control over diverse concepts mentioned in the editing prompt. Specifically, we inject the difference in features obtained with source and edit prompts from U-Net residual blocks of decoder layers. When these are combined with injected attention features, it becomes feasible to query the source contents and scale edited concepts along with the injection of unedited parts. The editing is further controlled in a fine-grained manner with mask extraction and attention fusion, which cut the edited part from the source and paste it into the denoising pipeline for the editing prompt. Our framework is a low-cost alternative to one-shot tuned models for editing since it does not require training. We demonstrated complex concept editing with a generalised image model (Stable Diffusion v1.5) using LoRA. Adaptation is compatible with all the existing image diffusion techniques. Extensive experimental results demonstrate the effectiveness of existing methods in rendering high-quality and temporally consistent videos.
    摘要 大型文本到图像扩散模型已经实现了惊人的成功,可以生成多样化、高质量的图像。此外,这些模型还可以通过修改输入文本来编辑图像。但是在视频 editing 中,主要挑战是保证时间协调一致和各帧的一致性。在这篇论文中,我们提出了 InFusion 框架,一种基于大型预训练的图像扩散模型来实现零搅 diffusion 的文本基于视频编辑。我们的框架具有多个概念编辑、像素级控制的特点,可以在编辑提示中提出多个概念,并且可以在精细化的方式下控制编辑。具体来说,我们将源和编辑提示中的特征差拟合到 U-Net 径弧层的解码层中,并与注入的注意力特征相结合,使得可以查询源内容并将编辑概念扩大到不变的部分。此外,我们还使用抽取Mask和注意力融合来进一步控制编辑。我们的框架是一种低成本的替代方案,不需要训练。我们使用了 Stable Diffusion v1.5 总体图像模型进行复杂概念编辑,并证明了与现有图像扩散技术相容。我们的实验结果表明,InFusion 可以生成高质量和时间协调一致的视频。

Synthesis of Batik Motifs using a Diffusion – Generative Adversarial Network

  • paper_url: http://arxiv.org/abs/2307.12122
  • repo_url: https://github.com/octadion/diffusion-stylegan2-ada-pytorch
  • paper_authors: One Octadion, Novanto Yudistira, Diva Kurnianingtyas
    for:本研究旨在帮助 batik 设计师或手工艺术家创造独特和高品质的 batik 模样,并实现有效率的生产时间和成本。methods:本研究使用了 StyleGAN2-Ada 和扩散技术,实现了生成高品质和真实的 synthetic batik 模样。 StyleGAN2-Ada 是一种分离 Style 和 Content 两个方面的 GAN 模型,而扩散技术则引入了随机噪音到数据中。results:根据质量和量度评估,模型测试过程中生成的 batik 模样具有细节丰富和艺术多样性,并且能够实现有效率的生产时间和成本。
    Abstract Batik, a unique blend of art and craftsmanship, is a distinct artistic and technological creation for Indonesian society. Research on batik motifs is primarily focused on classification. However, further studies may extend to the synthesis of batik patterns. Generative Adversarial Networks (GANs) have been an important deep learning model for generating synthetic data, but often face challenges in the stability and consistency of results. This research focuses on the use of StyleGAN2-Ada and Diffusion techniques to produce realistic and high-quality synthetic batik patterns. StyleGAN2-Ada is a variation of the GAN model that separates the style and content aspects in an image, whereas diffusion techniques introduce random noise into the data. In the context of batik, StyleGAN2-Ada and Diffusion are used to produce realistic synthetic batik patterns. This study also made adjustments to the model architecture and used a well-curated batik dataset. The main goal is to assist batik designers or craftsmen in producing unique and quality batik motifs with efficient production time and costs. Based on qualitative and quantitative evaluations, the results show that the model tested is capable of producing authentic and quality batik patterns, with finer details and rich artistic variations. The dataset and code can be accessed here:https://github.com/octadion/diffusion-stylegan2-ada-pytorch
    摘要 《独特的抽象艺术——batik的研究》batik是印度尼西亚社会独特的艺术和手工艺术材料,研究主要集中在纹理的分类。然而,进一步的研究可能会扩展到纹理的合成。生成对抗网络(GANs)是深度学习模型,用于生成合成数据,但经常面临稳定性和一致性问题。本研究使用StyleGAN2-Ada和扩散技术生成真实和高质量的合成纹理图案。StyleGAN2-Ada是GAN模型中分离风格和内容的变种,而扩散技术引入随机噪声到数据中。在batik中,StyleGAN2-Ada和扩散被用生成真实的合成纹理图案。本研究还对模型结构进行了调整,使用了优化的batik数据集。主要目标是协助batik设计师或手工艺术家生成独特和高质量的纹理图案,减少生产时间和成本。根据质量和量度评价,结果显示,试用的模型能够生成authentic和高质量的纹理图案,细节更加细腻,艺术变化更加丰富。数据集和代码可以在以下链接获取:https://github.com/octadion/diffusion-stylegan2-ada-pytorch

Pyramid Semantic Graph-based Global Point Cloud Registration with Low Overlap

  • paper_url: http://arxiv.org/abs/2307.12116
  • repo_url: https://github.com/hkust-aerial-robotics/pagor
  • paper_authors: Zhijian Qiao, Zehuan Yu, Huan Yin, Shaojie Shen
  • for: 这篇论文是关于全球点云注册的,用于绕过视点变化和 occlusion 等问题,以实现 loop closing 和 relocalization。
  • methods: 该论文提出了一种基于图论的全球点云注册方法,使用了 robust 的数据关联和可靠的姿态估计,以及semantic 信息来减少点云数据的精度。
  • results: 实验结果表明,该方法在自行收集的indoor数据集和公共的 KITTI 数据集上具有最高成功率,即使点云之间的重叠率低、semantic质量低。代码已经开源在 GitHub 上(https://github.com/HKUST-Aerial-Robotics/Pagor)。
    Abstract Global point cloud registration is essential in many robotics tasks like loop closing and relocalization. Unfortunately, the registration often suffers from the low overlap between point clouds, a frequent occurrence in practical applications due to occlusion and viewpoint change. In this paper, we propose a graph-theoretic framework to address the problem of global point cloud registration with low overlap. To this end, we construct a consistency graph to facilitate robust data association and employ graduated non-convexity (GNC) for reliable pose estimation, following the state-of-the-art (SoTA) methods. Unlike previous approaches, we use semantic cues to scale down the dense point clouds, thus reducing the problem size. Moreover, we address the ambiguity arising from the consistency threshold by constructing a pyramid graph with multi-level consistency thresholds. Then we propose a cascaded gradient ascend method to solve the resulting densest clique problem and obtain multiple pose candidates for every consistency threshold. Finally, fast geometric verification is employed to select the optimal estimation from multiple pose candidates. Our experiments, conducted on a self-collected indoor dataset and the public KITTI dataset, demonstrate that our method achieves the highest success rate despite the low overlap of point clouds and low semantic quality. We have open-sourced our code https://github.com/HKUST-Aerial-Robotics/Pagor for this project.
    摘要 To achieve this, we construct a consistency graph to facilitate robust data association and employ graduated non-convexity (GNC) for reliable pose estimation, following state-of-the-art (SoTA) methods. Unlike previous approaches, we use semantic cues to scale down the dense point clouds, reducing the problem size. Additionally, we address the ambiguity arising from the consistency threshold by constructing a pyramid graph with multi-level consistency thresholds.We then propose a cascaded gradient ascend method to solve the resulting densest clique problem and obtain multiple pose candidates for every consistency threshold. Finally, fast geometric verification is employed to select the optimal estimation from multiple pose candidates. Our experiments, conducted on a self-collected indoor dataset and the public KITTI dataset, show that our method achieves the highest success rate despite the low overlap of point clouds and low semantic quality. We have open-sourced our code at https://github.com/HKUST-Aerial-Robotics/Pagor for this project.