cs.CV - 2023-08-03

An End-to-end Food Portion Estimation Framework Based on Shape Reconstruction from Monocular Image

  • paper_url: http://arxiv.org/abs/2308.01810
  • repo_url: None
  • paper_authors: Zeman Shao, Gautham Vinod, Jiangpeng He, Fengqing Zhu
  • for: 这个研究旨在提供一个自动化饮食评估解决方案,并且使用深度学习来估计食物能量价值。
  • methods: 这个方法使用一个终端 deep learning 框架,通过从食物图像中推导出食物的3D形状信息,以估计食物能量价值。
  • results: 在使用 Nutrition5k 食物图像集进行评估中,这个方法的 Mean Absolute Error (MAE) 为40.05 kCal, Mean Absolute Percentage Error (MAPE) 为11.47%。这个方法仅使用 RGB 图像作为输入,并且与需要 RGB 和深度信息的现有方法相比,它具有竞争力。
    Abstract Dietary assessment is a key contributor to monitoring health status. Existing self-report methods are tedious and time-consuming with substantial biases and errors. Image-based food portion estimation aims to estimate food energy values directly from food images, showing great potential for automated dietary assessment solutions. Existing image-based methods either use a single-view image or incorporate multi-view images and depth information to estimate the food energy, which either has limited performance or creates user burdens. In this paper, we propose an end-to-end deep learning framework for food energy estimation from a monocular image through 3D shape reconstruction. We leverage a generative model to reconstruct the voxel representation of the food object from the input image to recover the missing 3D information. Our method is evaluated on a publicly available food image dataset Nutrition5k, resulting a Mean Absolute Error (MAE) of 40.05 kCal and Mean Absolute Percentage Error (MAPE) of 11.47% for food energy estimation. Our method uses RGB image as the only input at the inference stage and achieves competitive results compared to the existing method requiring both RGB and depth information.
    摘要 饮食评估是健康状况监测的关键因素。现有的自我报告方法具有巨大的偏见和错误。图像基于食物部分估计技术可以直接从食物图像中估算食物能量值,显示出了自动化饮食评估解决方案的潜在优势。现有的图像基于方法可以使用单视图图像或多视图图像和深度信息来估算食物能量,但它们具有有限的性能或者让用户感到压力。在这篇论文中,我们提出了一种基于深度学习的端到端框架,通过RGB图像来估算食物能量。我们利用生成模型来重建食物对象的 voxel 表示,从输入图像中恢复缺失的3D信息。我们的方法在公共可用的饮食图像数据集Nutrition5k上进行评估,得到了40.05 kCal的平均绝对误差(MAE)和11.47%的平均绝对百分比误差(MAPE)。我们的方法只需RGB图像作为推理阶段的输入,实现了与需要RGB和深度信息的现有方法相比的竞争性成绩。

QUEST: Query Stream for Vehicle-Infrastructure Cooperative Perception

  • paper_url: http://arxiv.org/abs/2308.01804
  • repo_url: None
  • paper_authors: Siqi Fan, Haibao Yu, Wenxian Yang, Jirui Yuan, Zaiqing Nie
  • for: 这篇论文的目的是提出一种名为QUEST的合作感知框架,以实现可解释的实例级别的 flexible feature interaction。
  • methods: 这篇论文使用了许多现有的合作方法,如结果合作和特征合作,并提出了一种新的查询合作方法,通过让查询流动 между代理进行交互。
  • results: 实验结果表明,QUEST 框架可以有效地提高合作感知性能,并且在实际应用场景中(如摄像头基于车辆基础设施感知)表现出了更高的传输灵活性和 packet dropout 的Robustness。
    Abstract Cooperative perception can effectively enhance individual perception performance by providing additional viewpoint and expanding the sensing field. Existing cooperation paradigms are either interpretable (result cooperation) or flexible (feature cooperation). In this paper, we propose the concept of query cooperation to enable interpretable instance-level flexible feature interaction. To specifically explain the concept, we propose a cooperative perception framework, termed QUEST, which let query stream flow among agents. The cross-agent queries are interacted via fusion for co-aware instances and complementation for individual unaware instances. Taking camera-based vehicle-infrastructure perception as a typical practical application scene, the experimental results on the real-world dataset, DAIR-V2X-Seq, demonstrate the effectiveness of QUEST and further reveal the advantage of the query cooperation paradigm on transmission flexibility and robustness to packet dropout. We hope our work can further facilitate the cross-agent representation interaction for better cooperative perception in practice.
    摘要 合作感知可以有效地提高个体感知性能,提供额外视点和扩大感知场。现有的合作方法是可解释的(结果合作)或者灵活的(特征合作)。在这篇论文中,我们提出了查询合作概念,以实现可解释的实例级别的灵活特征交互。为了具体说明这个概念,我们提出了一个合作感知框架,称为QUEST,允许查询流水线在代理之间流动。不同代理之间的问题是通过融合实现协同意识的实例,而不同代理之间的问题是通过补充实现个体未知的实例。使用摄像头基于车辆基础设施感知为实际应用场景,在DAIR-V2X-Seq实际数据集上进行实验, demonstarte了QUEST的效果,并透露了查询合作方法在传输灵活性和 packet dropout Robustness 的优势。我们希望我们的工作能够进一步促进跨代理表示交互,以便更好地实现合作感知在实践中。

RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension

  • paper_url: http://arxiv.org/abs/2308.02299
  • repo_url: https://github.com/mightyzau/regionblip
  • paper_authors: Qiang Zhou, Chaohui Yu, Shaofeng Zhang, Sitong Wu, Zhibing Wang, Fan Wang
  • for: 本研究目的是延伸多模式大语言模型(MLLM)的理解范围,以包括地域对象。
  • methods: 我们提议提取地域特征作为软提示,以便不需要MLLM微调。我们还提出了一种新的位置协助特征提取模块,以有效地提取来自常见图像特征和点云特征的地域特征。
  • results: 我们的实验结果表明,我们的框架 RegionBLIP 可以保持 BLIP-2 的图像理解能力,并在新引入的点云模式和地域对象上增加理解能力。
    Abstract In this work, we investigate extending the comprehension of Multi-modal Large Language Models (MLLMs) to regional objects. To this end, we propose to extract features corresponding to regional objects as soft prompts for LLM, which provides a straightforward and scalable approach and eliminates the need for LLM fine-tuning. To effectively extract regional features from regular image features and irregular point cloud features, we present a novel and unified position-assisted feature extraction module. Furthermore, training an MLLM from scratch is highly time-consuming. Thus, we propose incrementally extending existing pre-trained MLLMs to comprehend more modalities and the regional objects of those modalities. Specifically, we freeze the Q-Former from BLIP-2, an impressive MLLM, and optimize the modality-specific Lora parameters in Q-Former and LLM for each newly introduced modality. The freezing of the Q-Former eliminates the need for extensive pre-training on massive image-text data. The freezed Q-Former pre-trained from massive image-text data is also beneficial for the pre-training on image-region-text data. We name our framework RegionBLIP. We pre-train RegionBLIP on image-region-text, point-cloud-text, and point-cloud-region-text data. Experimental results verify that \Ours{} can preserve the image comprehension capability of BILP-2 and further gain a comprehension of the newly introduced point cloud modality and regional objects. The Data, Code, and Pre-trained models will be available at https://github.com/mightyzau/RegionBLIP.
    摘要 在这项工作中,我们 investigate extending the comprehension of Multi-modal Large Language Models (MLLMs) to regional objects. 为此,我们提议提取对 regional objects 的特征作为 LLM 的软提示,这提供了一个直观的并可扩展的方法,并消除了 LLM 的 fine-tuning 需求。为了有效地从常见图像特征和点云特征中提取 regional 特征,我们提出了一个 novel 和统一的位置帮助特征提取模块。此外,预训练一个 MLLM 从头开始是非常时间消耗的。因此,我们提议逐步扩展现有的预训练 MLLM 以包括更多Modalities 和这些 Modality 中的 regional objects。具体来说,我们冻结 BLIP-2 中的 Q-Former,并优化 Modality-specific Lora 参数在 Q-Former 和 LLM 中。冻结 Q-Former 消除了大量预训练在图像-文本数据上的需求。同时,预训练 Q-Former 在图像-区域-文本数据上也有助于预训练。我们将这种框架命名为 RegionBLIP。我们在图像-区域-文本、点云-文本和点云-区域-文本数据上预训练 RegionBLIP。实验结果表明,我们的方法可以保持 BLIP-2 中的图像理解能力,同时还能够增加对新引入的点云模式和 regional objects 的理解。数据、代码和预训练模型将在 GitHub 上提供,链接在 https://github.com/mightyzau/RegionBLIP。

Point2Mask: Point-supervised Panoptic Segmentation via Optimal Transport

  • paper_url: http://arxiv.org/abs/2308.01779
  • repo_url: https://github.com/liwentomng/point2mask
  • paper_authors: Wentong Li, Yuqian Yuan, Song Wang, Jianke Zhu, Jianshu Li, Jian Liu, Lei Zhang
  • for: 高级降级图像分割,避免高成本的像素级标注。
  • methods: 提出了一种有效的方法Point2Mask,通过单个随机点标注来实现高质量的精细预测。
  • results: 在Pascal VOC和COCO上实验表明,提出的Point2Mask方法可以在无需大量标注的情况下达到高水平的精细预测性能。
    Abstract Weakly-supervised image segmentation has recently attracted increasing research attentions, aiming to avoid the expensive pixel-wise labeling. In this paper, we present an effective method, namely Point2Mask, to achieve high-quality panoptic prediction using only a single random point annotation per target for training. Specifically, we formulate the panoptic pseudo-mask generation as an Optimal Transport (OT) problem, where each ground-truth (gt) point label and pixel sample are defined as the label supplier and consumer, respectively. The transportation cost is calculated by the introduced task-oriented maps, which focus on the category-wise and instance-wise differences among the various thing and stuff targets. Furthermore, a centroid-based scheme is proposed to set the accurate unit number for each gt point supplier. Hence, the pseudo-mask generation is converted into finding the optimal transport plan at a globally minimal transportation cost, which can be solved via the Sinkhorn-Knopp Iteration. Experimental results on Pascal VOC and COCO demonstrate the promising performance of our proposed Point2Mask approach to point-supervised panoptic segmentation. Source code is available at: https://github.com/LiWentomng/Point2Mask.
    摘要 Recently, weakly-supervised image segmentation has attracted increasing research attention, aiming to avoid the expensive pixel-wise labeling. In this paper, we propose an effective method, called Point2Mask, to achieve high-quality panoptic prediction using only a single random point annotation per target for training. Specifically, we formulate the panoptic pseudo-mask generation as an Optimal Transport (OT) problem, where each ground-truth (gt) point label and pixel sample are defined as the label supplier and consumer, respectively. The transportation cost is calculated by the introduced task-oriented maps, which focus on the category-wise and instance-wise differences among the various thing and stuff targets. Furthermore, a centroid-based scheme is proposed to set the accurate unit number for each gt point supplier. Hence, the pseudo-mask generation is converted into finding the optimal transport plan at a globally minimal transportation cost, which can be solved via the Sinkhorn-Knopp Iteration. Experimental results on Pascal VOC and COCO demonstrate the promising performance of our proposed Point2Mask approach to point-supervised panoptic segmentation. Source code is available at: https://github.com/LiWentomng/Point2Mask.Here's the word-for-word translation of the text into Simplified Chinese:最近,弱型图像分割获得了研究人员的越来越多的注意力,以避免高价的像素精度标注。在这篇论文中,我们提出了一种有效的方法,即Point2Mask,用于使用单个随机点标注来训练高质量�anoptic预测。具体来说,我们将�anoptic pseudo-mask生成视为一个Optimal Transport(OT)问题,其中每个ground-truth(gt)点标签和像素抽象被定义为标签供应商和消费者,分别。交通成本由引入的任务 oriented map计算,该地图强调类别和实例划分的差异。此外,我们提出了一种基于中心点的方案,以确定每个gt点供应商的准确单位数。因此,pseudo-mask生成转化为找到最低交通成本的优质运输计划,可以通过Sinkhorn-Knopp迭代解决。实验结果表明,我们提出的Point2Mask方法在Pascal VOC和COCO上表现出了扎实的推荐性。源代码可以在https://github.com/LiWentomng/Point2Mask上获取。

Deep Learning-based Prediction of Stress and Strain Maps in Arterial Walls for Improved Cardiovascular Risk Assessment

  • paper_url: http://arxiv.org/abs/2308.01771
  • repo_url: None
  • paper_authors: Yasin Shokrollahi1, Pengfei Dong1, Xianqi Li, Linxia Gu
  • for: 这个研究旨在替代finite element method(FEM),通过使用深度学习工具来更有效地预测血管壁的剪切应力和强度场。
  • methods: 我们提出了一种基于U-Net的全 convolutional neural network(CNN)来预测血管壁cross section中的 von Mises 剪切应力和强度场,并开发了一种基于conditional generative adversarial network(cGAN)来提高预测结果的准确性。
  • results: 我们的模型可以高度准确地预测 von Mises 剪切应力和强度场,SSIM分数为0.854和0.830, Mean squared errors为0.017和0.018。此外,我们还提出了 ensemble 和 transfer learning 技术来进一步提高模型的性能。
    Abstract This study investigated the potential of end-to-end deep learning tools as a more effective substitute for FEM in predicting stress-strain fields within 2D cross sections of arterial wall. We first proposed a U-Net based fully convolutional neural network (CNN) to predict the von Mises stress and strain distribution based on the spatial arrangement of calcification within arterial wall cross-sections. Further, we developed a conditional generative adversarial network (cGAN) to enhance, particularly from the perceptual perspective, the prediction accuracy of stress and strain field maps for arterial walls with various calcification quantities and spatial configurations. On top of U-Net and cGAN, we also proposed their ensemble approaches, respectively, to further improve the prediction accuracy of field maps. Our dataset, consisting of input and output images, was generated by implementing boundary conditions and extracting stress-strain field maps. The trained U-Net models can accurately predict von Mises stress and strain fields, with structural similarity index scores (SSIM) of 0.854 and 0.830 and mean squared errors of 0.017 and 0.018 for stress and strain, respectively, on a reserved test set. Meanwhile, the cGAN models in a combination of ensemble and transfer learning techniques demonstrate high accuracy in predicting von Mises stress and strain fields, as evidenced by SSIM scores of 0.890 for stress and 0.803 for strain. Additionally, mean squared errors of 0.008 for stress and 0.017 for strain further support the model's performance on a designated test set. Overall, this study developed a surrogate model for finite element analysis, which can accurately and efficiently predict stress-strain fields of arterial walls regardless of complex geometries and boundary conditions.
    摘要 Translated into Simplified Chinese:这个研究 investigate了使用端到端深度学习工具作为较为有效的finite element分析的替代方法,以predict arterial wall的压力-弯形场在2D横截面上。我们首先提出了基于U-Net的全 convolutional neural network (CNN),用于预测 calcification的空间布局对arterial wall横截面的 von Mises 压力和弯形场的分布。此外,我们还开发了基于 conditional generative adversarial network (cGAN)的模型,用于提高预测压力和弯形场图像的准确性。在U-Net和cGAN的基础之上,我们还提出了ensemble approaches,以进一步提高预测图像的准确性。我们的数据集,包括输入和输出图像,通过实施边界条件和提取压力-弯形场图像来生成。训练的U-Net模型可以准确预测 von Mises 压力和弯形场,SSIM 分数为0.854和0.830,mean squared error为0.017和0.018,分别对应于压力和弯形场。此外,cGAN模型在 transferred learning 和 ensemble learning 技术的组合下表现出了高度的准确性,SSIM 分数为0.890和0.803,mean squared error为0.008和0.017。总之,这个研究开发了一个surrogate model,可以高效地预测arterial wall的压力-弯形场,不 matter complex geometries和boundary conditions.

Focus on Content not Noise: Improving Image Generation for Nuclei Segmentation by Suppressing Steganography in CycleGAN

  • paper_url: http://arxiv.org/abs/2308.01769
  • repo_url: None
  • paper_authors: Jonas Utz, Tobias Weise, Maja Schlereth, Fabian Wagner, Mareike Thies, Mingxuan Gu, Stefan Uderhardt, Katharina Breininger
  • for: 这个论文主要用于描述一种用于生成顺序图像的潜在网络,以便为核体分 segmentation 任务提供更加准确的synthetic数据集。
  • methods: 该论文使用了 CycleGAN 生成器,并通过采用低通道滤波器基于 DCT 来除掉生成图像中的隐藏短路信息(steganography),以提高生成图像和cycled mask的准确性。
  • results: 相比 vanilla CycleGAN,该方法可以提高核体分 segmentation 任务的 F1 分数上的表现,提高了5.4个百分点。此外,该研究还证明了在 CycleGAN 架构中 integrate 高级规范技术可以减轻 steganography-related 问题,生成更加准确的synthetic数据集。
    Abstract Annotating nuclei in microscopy images for the training of neural networks is a laborious task that requires expert knowledge and suffers from inter- and intra-rater variability, especially in fluorescence microscopy. Generative networks such as CycleGAN can inverse the process and generate synthetic microscopy images for a given mask, thereby building a synthetic dataset. However, past works report content inconsistencies between the mask and generated image, partially due to CycleGAN minimizing its loss by hiding shortcut information for the image reconstruction in high frequencies rather than encoding the desired image content and learning the target task. In this work, we propose to remove the hidden shortcut information, called steganography, from generated images by employing a low pass filtering based on the DCT. We show that this increases coherence between generated images and cycled masks and evaluate synthetic datasets on a downstream nuclei segmentation task. Here we achieve an improvement of 5.4 percentage points in the F1-score compared to a vanilla CycleGAN. Integrating advanced regularization techniques into the CycleGAN architecture may help mitigate steganography-related issues and produce more accurate synthetic datasets for nuclei segmentation.
    摘要 描述核体在微scopic图像中的标注是一项劳动密集的任务,需要专家知识和受到内 raters 和外 raters 的变化,特别是在荧光微scopic中。生成网络如 CycleGAN 可以 inverse 该过程,生成基于给定的 mask 的 synthetic 微scopic图像,从而建立 synthetic 数据集。然而,过去的工作表明,生成的图像与 mask 之间存在内容不一致,部分是因为 CycleGAN 在高频范围内隐藏短路信息,而不是编码愿意图CONTENT 和学习目标任务。在这种情况下,我们提议从生成的图像中除去隐藏的短路信息,使用 DCT 基于低通过滤波。我们发现,这会增加生成图像和 цикли mask 之间的协调性,并评估 synthetic 数据集在下游核体分割任务中的性能。在这种情况下,我们实现了 Vanilla CycleGAN 的 5.4 个百分点 F1 score 的改进。将 advanced regularization techniques integrate 到 CycleGAN 架构中可能会 mitigate steganography-related issues 并生成更准确的 synthetic 数据集 для核体分割任务。

Multidimensional Data Analysis Based on Block Convolutional Tensor Decomposition

  • paper_url: http://arxiv.org/abs/2308.01768
  • repo_url: None
  • paper_authors: Mahdi Molavi, Mansoor Rezghi, Tayyebeh Saeedi
    for: This paper focuses on developing a new tensor-tensor product called the $\star_c{}\text{-Product}$ based on block convolution with reflective boundary conditions, and using it to improve tensor decomposition for analyzing high-dimensional data.methods: The paper uses the t-product of tensors and block convolution with reflective boundary conditions to develop a new tensor-tensor product called the $\star_c{}\text{-Product}$. The paper also introduces a tensor decomposition based on this product for arbitrary order tensors.results: The paper shows that the proposed $\star_c{}\text{-Product}$ has lower complexity than t-SVD and yields higher-quality results in applications such as classification and compression.
    Abstract Tensor decompositions are powerful tools for analyzing multi-dimensional data in their original format. Besides tensor decompositions like Tucker and CP, Tensor SVD (t-SVD) which is based on the t-product of tensors is another extension of SVD to tensors that recently developed and has found numerous applications in analyzing high dimensional data. This paper offers a new insight into the t-Product and shows that this product is a block convolution of two tensors with periodic boundary conditions. Based on this viewpoint, we propose a new tensor-tensor product called the $\star_c{}\text{-Product}$ based on Block convolution with reflective boundary conditions. Using a tensor framework, this product can be easily extended to tensors of arbitrary order. Additionally, we introduce a tensor decomposition based on our $\star_c{}\text{-Product}$ for arbitrary order tensors. Compared to t-SVD, our new decomposition has lower complexity, and experiments show that it yields higher-quality results in applications such as classification and compression.
    摘要 tensor decompositions是多维数据的分析工具, Besides tensor decompositions like Tucker和CP,tensor SVD(t-SVD),which is based on the t-product of tensors, is another extension of SVD to tensors that has recently been developed and has found numerous applications in analyzing high-dimensional data. This paper offers a new perspective on the t-product and shows that this product is a block convolution of two tensors with periodic boundary conditions. Based on this viewpoint, we propose a new tensor-tensor product called the $\star_c{}\text{-Product}$ based on block convolution with reflective boundary conditions. Using a tensor framework, this product can be easily extended to tensors of arbitrary order. Additionally, we introduce a tensor decomposition based on our $\star_c{}\text{-Product}$ for arbitrary-order tensors. Compared to t-SVD, our new decomposition has lower complexity, and experiments show that it yields higher-quality results in applications such as classification and compression.

PoissonNet: Resolution-Agnostic 3D Shape Reconstruction using Fourier Neural Operators

  • paper_url: http://arxiv.org/abs/2308.01766
  • repo_url: https://github.com/arsenal9971/poissonnet
  • paper_authors: Hector Andrade-Loarca, Julius Hege, Aras Bacho, Gitta Kutyniok
  • for: 这个论文旨在解决用点云数据 reconstruction 3D shapes 的问题,传统的深度神经网络在高分辨率下遇到计算复杂性的问题。
  • methods: 作者使用 fourier neural operator (FNO) 解决波兰方程,从oriented point cloud measurement中重建mesh。
  • results: 作者的方法在 reconstruction 质量、运行时间和分辨率灵活性方面比现有方法优秀,同时具有一shot super-resolution 和梯度可视化的优点。
    Abstract We introduce PoissonNet, an architecture for shape reconstruction that addresses the challenge of recovering 3D shapes from points. Traditional deep neural networks face challenges with common 3D shape discretization techniques due to their computational complexity at higher resolutions. To overcome this, we leverage Fourier Neural Operators (FNOs) to solve the Poisson equation and reconstruct a mesh from oriented point cloud measurements. PoissonNet exhibits two main advantages. First, it enables efficient training on low-resolution data while achieving comparable performance at high-resolution evaluation, thanks to the resolution-agnostic nature of FNOs. This feature allows for one-shot super-resolution. Second, our method surpasses existing approaches in reconstruction quality while being differentiable. Overall, our proposed method not only improves upon the limitations of classical deep neural networks in shape reconstruction but also achieves superior results in terms of reconstruction quality, running time, and resolution flexibility. Furthermore, we demonstrate that the Poisson surface reconstruction problem is well-posed in the limit case by showing a universal approximation theorem for the solution operator of the Poisson equation with distributional data utilizing the Fourier Neural Operator, which provides a theoretical foundation for our numerical results. The code to reproduce the experiments is available on: \url{https://github.com/arsenal9971/PoissonNet}.
    摘要 我们介绍PoissonNet,一个用于形状重建的架构,解决从点 cloud 中获取 3D 形状的挑战。传统的深度神经网络在高分辨率下 computationally 复杂,因此我们利用 Fourier Neural Operators (FNOs) 解决 Poisson 方程,从排序点云集中重建 mesh。PoissonNet 有两个主要优点:第一,它可以高效地在低分辨率上训练,并在高分辨率评估中实现相同的性能,这使得它能够实现一次超Resolution。第二,我们的方法可以超过现有的方法重建质量,并且可以实现可微的变化。总的来说,我们的提案不仅优化了传统的深度神经网络在形状重建中的局限性,而且实现了更好的重建质量、运行时间和分辨率灵活性。此外,我们还证明了 Poisson 面重建问题在极限情况下是可定义的,通过显示了基于 Fourier Neural Operator 的解析函数,则提供了理论基础 для我们的数据�Martin 的实验结果。Code 可以在:\url{https://github.com/arsenal9971/PoissonNet} 中找到。

NuInsSeg: A Fully Annotated Dataset for Nuclei Instance Segmentation in H&E-Stained Histological Images

  • paper_url: http://arxiv.org/abs/2308.01760
  • repo_url: https://github.com/masih4/nuinsseg
  • paper_authors: Amirreza Mahbod, Christine Polak, Katharina Feldmann, Rumsha Khan, Katharina Gelles, Georg Dorffner, Ramona Woitek, Sepideh Hatamikia, Isabella Ellinger
  • for: This paper is written for the task of automatic nuclei instance segmentation in whole slide image analysis, specifically using supervised deep learning methods.
  • methods: The paper uses a fully manually annotated dataset called NuInsSeg, which contains 665 image patches with over 30,000 manually segmented nuclei from 31 human and mouse organs. Additionally, the paper provides ambiguous area masks for the entire dataset, which represent parts of the images where precise manual annotations are impossible.
  • results: The paper releases one of the biggest fully manually annotated datasets of nuclei in Hematoxylin and Eosin (H&E)-stained histological images, called NuInsSeg, which can be used to train and evaluate supervised deep learning models for nuclei instance segmentation.Here’s the information in Simplified Chinese text:
  • for: 这篇论文是为批处理学自动核体实例分割任务而写的,特别是使用监督深度学习方法。
  • methods: 这篇论文使用了完全手动标注的数据集,名为NuInsSeg,该数据集包含665个图像patch,共有30,000多个手动标注的核体。此外,论文还提供了整个数据集的模糊区域面积,表示图像中的不确定部分,即人工专家无法准确手动标注的部分。
  • results: 论文发布了一个大量的完全手动标注的核体数据集,名为NuInsSeg,可以用于训练和评估supervised深度学习模型。
    Abstract In computational pathology, automatic nuclei instance segmentation plays an essential role in whole slide image analysis. While many computerized approaches have been proposed for this task, supervised deep learning (DL) methods have shown superior segmentation performances compared to classical machine learning and image processing techniques. However, these models need fully annotated datasets for training which is challenging to acquire, especially in the medical domain. In this work, we release one of the biggest fully manually annotated datasets of nuclei in Hematoxylin and Eosin (H&E)-stained histological images, called NuInsSeg. This dataset contains 665 image patches with more than 30,000 manually segmented nuclei from 31 human and mouse organs. Moreover, for the first time, we provide additional ambiguous area masks for the entire dataset. These vague areas represent the parts of the images where precise and deterministic manual annotations are impossible, even for human experts. The dataset and detailed step-by-step instructions to generate related segmentation masks are publicly available at https://www.kaggle.com/datasets/ipateam/nuinsseg and https://github.com/masih4/NuInsSeg, respectively.
    摘要 在计算生物学中,自动核体实例分割扮演着整个扫描图像分析中的关键角色。虽然许多计算机化方法已经被提议用于这项任务,但是supervised深度学习(DL)方法在 segmentation 性能方面表现出色,比起经典机器学习和图像处理技术更为出色。然而,这些模型需要完全标注的数据集进行训练,这在医疗领域是困难的获得的。在这项工作中,我们发布了一个包含665个图像patches,共计超过30,000个手动标注的核体的全部扫描图像数据集,称为NuInsSeg。这个数据集包括31种人类和小鼠的组织。此外,我们还为整个数据集提供了首次的uncertain area masks。这些暧昧区域表示图像中的不确定和不可定量的部分,即even human experts无法 preciselly和准确地手动标注这些部分。数据集和相关的生成分 segmentation masks的详细步骤 instrucions都公开在https://www.kaggle.com/datasets/ipateam/nuinsseg和https://github.com/masih4/NuInsSeg中,分别。

Neural Collapse Terminus: A Unified Solution for Class Incremental Learning and Its Variants

  • paper_url: http://arxiv.org/abs/2308.01746
  • repo_url: https://github.com/neuralcollapseapplications/unicil
  • paper_authors: Yibo Yang, Haobo Yuan, Xiangtai Li, Jianlong Wu, Lefei Zhang, Zhouchen Lin, Philip Torr, Dacheng Tao, Bernard Ghanem
  • for: Handle class incremental learning (CIL), long-tail class incremental learning (LTCIL), and few-shot class incremental learning (FSCIL) well.
  • methods: Propose a unified solution called neural collapse terminus, which is a fixed structure with the maximal equiangular inter-class separation for the whole label space. Also, propose a prototype evolving scheme to drive the backbone features into the neural collapse terminus smoothly.
  • results: The method is effective in all three tasks and can handle data imbalance and data scarcity. Theoretical analysis indicates that the method holds the neural collapse optimality in an incremental fashion. Extensive experiments with multiple datasets demonstrate the effectiveness of the unified solution and the generalized case.
    Abstract How to enable learnability for new classes while keeping the capability well on old classes has been a crucial challenge for class incremental learning. Beyond the normal case, long-tail class incremental learning and few-shot class incremental learning are also proposed to consider the data imbalance and data scarcity, respectively, which are common in real-world implementations and further exacerbate the well-known problem of catastrophic forgetting. Existing methods are specifically proposed for one of the three tasks. In this paper, we offer a unified solution to the misalignment dilemma in the three tasks. Concretely, we propose neural collapse terminus that is a fixed structure with the maximal equiangular inter-class separation for the whole label space. It serves as a consistent target throughout the incremental training to avoid dividing the feature space incrementally. For CIL and LTCIL, we further propose a prototype evolving scheme to drive the backbone features into our neural collapse terminus smoothly. Our method also works for FSCIL with only minor adaptations. Theoretical analysis indicates that our method holds the neural collapse optimality in an incremental fashion regardless of data imbalance or data scarcity. We also design a generalized case where we do not know the total number of classes and whether the data distribution is normal, long-tail, or few-shot for each coming session, to test the generalizability of our method. Extensive experiments with multiple datasets are conducted to demonstrate the effectiveness of our unified solution to all the three tasks and the generalized case.
    摘要 如何在新类增加时保持老类能力已成为增量学习中的关键挑战。此外,长尾类增量学习和少shot类增量学习也被提出来考虑数据偏好和数据稀缺问题,这些问题在实际应用中非常普遍并使得已知的忘记问题更加严重。现有的方法只适用于一种任务。在这篇论文中,我们提出了增量训练中的不一致问题的统一解决方案。具体来说,我们提出了一种固定结构的神经衰减终点,该结构具有整个标签空间中最大的等角间隔性。它在增量训练中作为一个固定目标,以避免在增量训练中逐渐将特征空间分割。对CIL和LTCIL,我们还提出了一种原型演化方案,以使得脊梁特征流动到我们的神经衰减终点的。我们的方法也适用于FSCIL,只需要一些小的适应。理论分析表明,我们的方法在增量训练中具有神经衰减优化的优点,不管数据不平衡或数据稀缺。我们还设计了一个通用情况,在每个来往会训练中,不知道总类数量和数据分布是正常、长尾或少shot,以测试我们的方法的普适性。我们进行了多个数据集的广泛实验,以证明我们的统一解决方案对所有三个任务和通用情况具有效果。

Enhancing Visibility in Nighttime Haze Images Using Guided APSF and Gradient Adaptive Convolution

  • paper_url: http://arxiv.org/abs/2308.01738
  • repo_url: https://github.com/jinyeying/nighttime_dehaze
  • paper_authors: Yeying Jin, Beibei Lin, Wending Yan, Wei Ye, Yuan Yuan, Robby T. Tan
  • for: 提高夜间雾景的可见度,处理亮光和抑制雾气的影响。
  • methods: 利用灯源意识网络检测夜间图像中的灯光源,然后使用APSF指导的灯光渲染,以抑制雾光效果。同时,使用梯度适应卷积,捕捉雾景中的边缘和текстуures,以提高场景的对比度。最后,通过学习注意力映射,调整gamma修正,以增强低光强度区域的明暗对比。
  • results: 对实际夜间雾景图像进行了广泛的评估,并达到了30.38dB PSNR,比前方法提高13%。数据和代码可以在:\url{https://github.com/jinyeying/nighttime_dehaze}中找到。
    Abstract Visibility in hazy nighttime scenes is frequently reduced by multiple factors, including low light, intense glow, light scattering, and the presence of multicolored light sources. Existing nighttime dehazing methods often struggle with handling glow or low-light conditions, resulting in either excessively dark visuals or unsuppressed glow outputs. In this paper, we enhance the visibility from a single nighttime haze image by suppressing glow and enhancing low-light regions. To handle glow effects, our framework learns from the rendered glow pairs. Specifically, a light source aware network is proposed to detect light sources of night images, followed by the APSF (Angular Point Spread Function)-guided glow rendering. Our framework is then trained on the rendered images, resulting in glow suppression. Moreover, we utilize gradient-adaptive convolution, to capture edges and textures in hazy scenes. By leveraging extracted edges and textures, we enhance the contrast of the scene without losing important structural details. To boost low-light intensity, our network learns an attention map, then adjusted by gamma correction. This attention has high values on low-light regions and low values on haze and glow regions. Extensive evaluation on real nighttime haze images, demonstrates the effectiveness of our method. Our experiments demonstrate that our method achieves a PSNR of 30.38dB, outperforming state-of-the-art methods by 13$\%$ on GTA5 nighttime haze dataset. Our data and code is available at: \url{https://github.com/jinyeying/nighttime_dehaze}.
    摘要 夜间雾气场景中的可见度受多种因素影响,包括低光照、强烈辉光、光散射和多种颜色光源的存在。现有的夜间雾气去除方法 часто难以处理辉光效果,导致视觉效果过扭或辉光输出不充分抑制。在这篇论文中,我们通过抑制辉光和强化低光照区域来提高夜间雾气图像的可见度。为了处理辉光效果,我们的框架学习了夜间镜头中的灯光对。具体来说,我们提出了一个灯光意识网络,用于夜间镜头中灯光源的检测,然后使用APSF( Ángular Point Spread Function)引导的辉光渲染。我们的框架然后在渲染图像上进行训练,从而实现辉光抑制。此外,我们利用梯度适应 convolution,以捕捉雾气场景中的边缘和文本ure。通过利用EXTRACTED edges和文本ure,我们可以提高场景的对比度,而不是失去重要的结构细节。为了增强低光照INTENSITY,我们的网络学习了注意力图,然后通过γcorrecting进行调整。这个注意力图在低光照区域有高值,而在雾气和辉光区域有低值。我们的实验表明,我们的方法可以在实际的夜间雾气图像上达到PSNR 30.38dB,比 estado-of-the-art 方法高出13%的GTA5夜间雾气数据集。我们的数据和代码可以在以下链接获取:https://github.com/jinyeying/nighttime_dehaze。

Quantification of Predictive Uncertainty via Inference-Time Sampling

  • paper_url: http://arxiv.org/abs/2308.01731
  • repo_url: None
  • paper_authors: Katarína Tóthová, Ľubor Ladický, Daniel Thul, Marc Pollefeys, Ender Konukoglu
  • for: 这项研究的目的是提出一种post-hoc采样策略来估计预测不确定性,以考虑数据歧义导致的预测变化。
  • methods: 该方法不需要特定的建模Component或训练机制,可以应用于任何饱和推动网络,无需变更网络结构或训练过程。
  • results: 实验表明,该方法可以生成多种可能性 Distribution,与预测错误之间存在良好的相关性。
    Abstract Predictive variability due to data ambiguities has typically been addressed via construction of dedicated models with built-in probabilistic capabilities that are trained to predict uncertainty estimates as variables of interest. These approaches require distinct architectural components and training mechanisms, may include restrictive assumptions and exhibit overconfidence, i.e., high confidence in imprecise predictions. In this work, we propose a post-hoc sampling strategy for estimating predictive uncertainty accounting for data ambiguity. The method can generate different plausible outputs for a given input and does not assume parametric forms of predictive distributions. It is architecture agnostic and can be applied to any feed-forward deterministic network without changes to the architecture or training procedure. Experiments on regression tasks on imaging and non-imaging input data show the method's ability to generate diverse and multi-modal predictive distributions, and a desirable correlation of the estimated uncertainty with the prediction error.
    摘要 <>将文本翻译成简化中文。<>通常,预测变化因数据抽象性而导致的预测不确定性通过建立专门的模型,这些模型具有内置的概率能力,并在预测不确定性为变量首选项进行训练。这些方法可能需要特殊的建筑Component和训练机制,可能包含限制性的假设和显示过自信,即高度自信准确性。在这种工作中,我们提出了一种 posterior sampling 策略,用于估计预测不确定性,考虑到数据抽象性。这种方法可以生成不同的可能的输出,并不假设预测分布的 Parametric 形式。它是架构无关的,可以应用于任何滤频决策网络,无需改变架构或训练过程。实验表明,这种方法可以生成多样化和多模态的预测分布,并且预测不确定性与预测错误之间存在可undesirable的相关性。

Weakly Supervised 3D Instance Segmentation without Instance-level Annotations

  • paper_url: http://arxiv.org/abs/2308.01721
  • repo_url: None
  • paper_authors: Shichao Dong, Guosheng Lin
  • for: 提高3D semanticScene理解任务的效率,减少人工标注成本。
  • methods: 提出首个弱级标注的3D实例分割方法,只需要分类semantic标签作为监督,不需要实例级标注。
  • results: 实验表明,我们的方法可以与最新的完全监督方法相当,同时可以帮助现有方法在减少标注成本的情况下学习3D实例分割。
    Abstract 3D semantic scene understanding tasks have achieved great success with the emergence of deep learning, but often require a huge amount of manually annotated training data. To alleviate the annotation cost, we propose the first weakly-supervised 3D instance segmentation method that only requires categorical semantic labels as supervision, and we do not need instance-level labels. The required semantic annotations can be either dense or extreme sparse (e.g. 0.02% of total points). Even without having any instance-related ground-truth, we design an approach to break point clouds into raw fragments and find the most confident samples for learning instance centroids. Furthermore, we construct a recomposed dataset using pseudo instances, which is used to learn our defined multilevel shape-aware objectness signal. An asymmetrical object inference algorithm is followed to process core points and boundary points with different strategies, and generate high-quality pseudo instance labels to guide iterative training. Experiments demonstrate that our method can achieve comparable results with recent fully supervised methods. By generating pseudo instance labels from categorical semantic labels, our designed approach can also assist existing methods for learning 3D instance segmentation at reduced annotation cost.
    摘要 三维semantic场景理解任务在深度学习出现后取得了很大成功,但经常需要大量手动标注训练数据。为了减轻标注成本,我们提出了首个无监督的3D实例分割方法,只需要类别semantic标签作为监督,并不需要实例级别标签。需要的semantic标注可以是密集的或者极少的(例如0.02%的总点数)。即使没有实例相关的地面真实数据,我们还是可以将点云分解成原始碎片,并从最信任的样本中学习实例中心点。此外,我们还构建了一个pseudo实例集合,用于学习我们定义的多级形状感知信号。我们采用不对称的物体推断算法,处理核心点和边点的不同策略,生成高质量pseudo实例标签,以导引反复训练。实验表明,我们的方法可以与最近的完全监督方法相当。通过将pseudo实例标签生成自类别semantic标签,我们设计的方法还可以帮助现有的方法在减少标注成本的情况下学习3D实例分割。

Balanced Destruction-Reconstruction Dynamics for Memory-replay Class Incremental Learning

  • paper_url: http://arxiv.org/abs/2308.01698
  • repo_url: https://github.com/zyuh/bdr-main
  • paper_authors: Yuhang Zhou, Jiangchao Yao, Feng Hong, Ya Zhang, Yanfeng Wang
  • For: 提高类增量学习(CIL)中的稳定性和泛化能力,通过平衡旧知识的destruction和重建来alleviate catastrophic forgetting。* Methods: 提出了一种Balanced Destruction-Reconstruction(BDR)模块,通过考虑不同类的训练状态的差异和存储库中样本的量不均衡来平衡旧知识的destruction和重建,从而提高知识重建的效果。* Results: 经验表明,作为轻量级插件,BDR模块可以大幅提高现有最佳方法的性能,并且具有良好的泛化能力。
    Abstract Class incremental learning (CIL) aims to incrementally update a trained model with the new classes of samples (plasticity) while retaining previously learned ability (stability). To address the most challenging issue in this goal, i.e., catastrophic forgetting, the mainstream paradigm is memory-replay CIL, which consolidates old knowledge by replaying a small number of old classes of samples saved in the memory. Despite effectiveness, the inherent destruction-reconstruction dynamics in memory-replay CIL are an intrinsic limitation: if the old knowledge is severely destructed, it will be quite hard to reconstruct the lossless counterpart. Our theoretical analysis shows that the destruction of old knowledge can be effectively alleviated by balancing the contribution of samples from the current phase and those saved in the memory. Motivated by this theoretical finding, we propose a novel Balanced Destruction-Reconstruction module (BDR) for memory-replay CIL, which can achieve better knowledge reconstruction by reducing the degree of maximal destruction of old knowledge. Specifically, to achieve a better balance between old knowledge and new classes, the proposed BDR module takes into account two factors: the variance in training status across different classes and the quantity imbalance of samples from the current phase and memory. By dynamically manipulating the gradient during training based on these factors, BDR can effectively alleviate knowledge destruction and improve knowledge reconstruction. Extensive experiments on a range of CIL benchmarks have shown that as a lightweight plug-and-play module, BDR can significantly improve the performance of existing state-of-the-art methods with good generalization.
    摘要

BEVControl: Accurately Controlling Street-view Elements with Multi-perspective Consistency via BEV Sketch Layout

  • paper_url: http://arxiv.org/abs/2308.01661
  • repo_url: None
  • paper_authors: Kairui Yang, Enhui Ma, Jibin Peng, Qing Guo, Di Lin, Kaicheng Yu
  • for: 提高计算机视觉系统中的识别模型性能,使用生成图像来增强模型的表现。
  • methods: 提出了一种两stage生成方法,名为BEVControl,可以生成准确的前景和背景内容。此外,还支持绘制风格输入,让人们更方便地编辑。
  • results: 对比BEVGen方法,BEVControl在前景分割mIoU中提高了5.89到26.80的margin,而且用生成图像来训练下游识别模型,其NDS分数平均提高1.29。
    Abstract Using synthesized images to boost the performance of perception models is a long-standing research challenge in computer vision. It becomes more eminent in visual-centric autonomous driving systems with multi-view cameras as some long-tail scenarios can never be collected. Guided by the BEV segmentation layouts, the existing generative networks seem to synthesize photo-realistic street-view images when evaluated solely on scene-level metrics. However, once zoom-in, they usually fail to produce accurate foreground and background details such as heading. To this end, we propose a two-stage generative method, dubbed BEVControl, that can generate accurate foreground and background contents. In contrast to segmentation-like input, it also supports sketch style input, which is more flexible for humans to edit. In addition, we propose a comprehensive multi-level evaluation protocol to fairly compare the quality of the generated scene, foreground object, and background geometry. Our extensive experiments show that our BEVControl surpasses the state-of-the-art method, BEVGen, by a significant margin, from 5.89 to 26.80 on foreground segmentation mIoU. In addition, we show that using images generated by BEVControl to train the downstream perception model, it achieves on average 1.29 improvement in NDS score.
    摘要

DiffColor: Toward High Fidelity Text-Guided Image Colorization with Diffusion Models

  • paper_url: http://arxiv.org/abs/2308.01655
  • repo_url: None
  • paper_authors: Jianxin Lin, Peng Xiao, Yijun Wang, Rongju Zhang, Xiangxiang Zeng
  • for: 提高自动或参照基于的图像颜色化精度和多样性,尤其是对象级颜色控制。
  • methods: 基于预训练扩散模型,提出一种名为DiffColor的新方法,通过使用提前训练的文本到图像模型,生成基于文本提示的颜色化图像,不需任何其他输入。DiffColor方法包括两个阶段:颜色化与生成颜色先前,以及在文本引导下可控色化。
  • results: DiffColor方法可以在几轮 iterations 中生成真实和多样的颜色,保持图像结构和背景不变,同时将颜色与目标语言指导相吻合。此外,DiffColor方法允许在文本引导下进行卷积控制,即通过修改提示文本来生成不同的颜色化结果,而无需任何 fine-tuning。广泛的实验和用户研究表明,DiffColor方法在视觉质量、颜色准确性和颜色化选项多样性方面,超越了先前的works。
    Abstract Recent data-driven image colorization methods have enabled automatic or reference-based colorization, while still suffering from unsatisfactory and inaccurate object-level color control. To address these issues, we propose a new method called DiffColor that leverages the power of pre-trained diffusion models to recover vivid colors conditioned on a prompt text, without any additional inputs. DiffColor mainly contains two stages: colorization with generative color prior and in-context controllable colorization. Specifically, we first fine-tune a pre-trained text-to-image model to generate colorized images using a CLIP-based contrastive loss. Then we try to obtain an optimized text embedding aligning the colorized image and the text prompt, and a fine-tuned diffusion model enabling high-quality image reconstruction. Our method can produce vivid and diverse colors with a few iterations, and keep the structure and background intact while having colors well-aligned with the target language guidance. Moreover, our method allows for in-context colorization, i.e., producing different colorization results by modifying prompt texts without any fine-tuning, and can achieve object-level controllable colorization results. Extensive experiments and user studies demonstrate that DiffColor outperforms previous works in terms of visual quality, color fidelity, and diversity of colorization options.
    摘要

Multi-scale Cross-restoration Framework for Electrocardiogram Anomaly Detection

  • paper_url: http://arxiv.org/abs/2308.01639
  • repo_url: https://github.com/mediabrain-sjtu/ecgad
  • paper_authors: Aofan Jiang, Chaoqin Huang, Qing Cao, Shuang Wu, Zi Zeng, Kang Chen, Ya Zhang, Yanfeng Wang
  • for: 该论文旨在提高电cardiogram(ECG)诊断工具的敏感度,以检测心脏疾病。
  • methods: 该论文提出了一种基于异常检测的ECG anomaly detection和本地化方法,通过对全息ECG和心跳级别特征进行多级混合还原,以提高检测异常的精度。
  • results: 该论文在一个新的 benchmark 数据集上达到了当前领域的状态的表现,并在两个其他常用的ECG数据集上也显示出了优秀的表现。
    Abstract Electrocardiogram (ECG) is a widely used diagnostic tool for detecting heart conditions. Rare cardiac diseases may be underdiagnosed using traditional ECG analysis, considering that no training dataset can exhaust all possible cardiac disorders. This paper proposes using anomaly detection to identify any unhealthy status, with normal ECGs solely for training. However, detecting anomalies in ECG can be challenging due to significant inter-individual differences and anomalies present in both global rhythm and local morphology. To address this challenge, this paper introduces a novel multi-scale cross-restoration framework for ECG anomaly detection and localization that considers both local and global ECG characteristics. The proposed framework employs a two-branch autoencoder to facilitate multi-scale feature learning through a masking and restoration process, with one branch focusing on global features from the entire ECG and the other on local features from heartbeat-level details, mimicking the diagnostic process of cardiologists. Anomalies are identified by their high restoration errors. To evaluate the performance on a large number of individuals, this paper introduces a new challenging benchmark with signal point-level ground truths annotated by experienced cardiologists. The proposed method demonstrates state-of-the-art performance on this benchmark and two other well-known ECG datasets. The benchmark dataset and source code are available at: \url{https://github.com/MediaBrain-SJTU/ECGAD}
    摘要 电心agram(ECG)是一种广泛使用的诊断工具,用于检测心脏疾病。但是,有些罕见的心脏疾病可能会被传统的ECG分析错过,因为没有一个可以包含所有可能的心脏疾病的训练数据集。这篇论文提出使用异常检测来识别任何不健康状态,只使用正常的ECG进行训练。然而,检测ECG中的异常可以是困难的,因为心脏电压的变化会导致巨大的个体差异和异常现象。为解决这个挑战,这篇论文提出了一种新的多尺度跨 restore 框架,用于ECG异常检测和定位。该框架利用两个分支自动编码器来实现多尺度特征学习,其中一个分支关注整个ECG的全局特征,另一个分支关注心跳级别细节。异常被标识为高 restore 错误。为评估性能,这篇论文创建了一个新的挑战性的标准数据集,其中每个信号点都有经验论断医生的地面真实值。提出的方法在这个标准数据集上达到了状态艺术性的性能,并在两个常见的ECG数据集上也取得了优秀的成绩。标准数据集和源代码可以在:\url{https://github.com/MediaBrain-SJTU/ECGAD} 获取。

Disentangling Multi-view Representations Beyond Inductive Bias

  • paper_url: http://arxiv.org/abs/2308.01634
  • repo_url: https://github.com/guanzhou-ke/dmrib
  • paper_authors: Guanzhou Ke, Yang Yu, Guoqing Chao, Xiaoli Wang, Chenyang Xu, Shengfeng He
  • For: The paper is written for proposing a novel multi-view representation disentangling method that can go beyond inductive biases and ensure both interpretability and generalizability of the resulting representations.* Methods: The proposed method is based on discovering multi-view consistency in advance, which determines the disentangling information boundary, and maximizing transformation invariance and clustering consistency between views. The method consists of two stages: obtaining multi-view consistency by training a consistent encoder, and disentangling specificity from comprehensive representations by minimizing the upper bound of mutual information.* Results: The proposed method outperforms 12 comparison methods in terms of clustering and classification performance on four multi-view datasets, and the extracted consistency and specificity are compact and interpretable.
    Abstract Multi-view (or -modality) representation learning aims to understand the relationships between different view representations. Existing methods disentangle multi-view representations into consistent and view-specific representations by introducing strong inductive biases, which can limit their generalization ability. In this paper, we propose a novel multi-view representation disentangling method that aims to go beyond inductive biases, ensuring both interpretability and generalizability of the resulting representations. Our method is based on the observation that discovering multi-view consistency in advance can determine the disentangling information boundary, leading to a decoupled learning objective. We also found that the consistency can be easily extracted by maximizing the transformation invariance and clustering consistency between views. These observations drive us to propose a two-stage framework. In the first stage, we obtain multi-view consistency by training a consistent encoder to produce semantically-consistent representations across views as well as their corresponding pseudo-labels. In the second stage, we disentangle specificity from comprehensive representations by minimizing the upper bound of mutual information between consistent and comprehensive representations. Finally, we reconstruct the original data by concatenating pseudo-labels and view-specific representations. Our experiments on four multi-view datasets demonstrate that our proposed method outperforms 12 comparison methods in terms of clustering and classification performance. The visualization results also show that the extracted consistency and specificity are compact and interpretable. Our code can be found at \url{https://github.com/Guanzhou-Ke/DMRIB}.
    摘要 多视图(或多modal)表示学习目标是理解不同视图表示之间的关系。现有方法通过引入强大的概念假设来分离多视图表示,但这可能会限制其泛化能力。在这篇论文中,我们提出了一种新的多视图表示分离方法,旨在超越假设,以确保表示的解释性和泛化性。我们的方法基于发现多视图一致性在前提下可以确定分离信息边界,从而导致一个分离学习目标。我们还发现可以通过最大化变换不变性和视图集成一致性来提取一致性。这些发现驱动我们提出了一个两stage框架。在第一阶段,我们通过训练一个具有相同含义的编码器来生成多视图中的含义相同的表示,以及其相对应的pseudo标签。在第二阶段,我们通过最小化表示之间的上界乘积来分离特定的表示和总体表示之间的相互信息。最后,我们使用pseudo标签和视图特定表示来重建原始数据。我们的实验表明,我们的提出的方法在四个多视图数据集上比12种参考方法表现出色,并且可以准确地重建原始数据。visual化结果还表明提取的一致性和特定性具有 компакт性和可读性。我们的代码可以在GitHub上找到:\url{https://github.com/Guanzhou-Ke/DMRIB}。

Erasure-based Interaction Network for RGBT Video Object Detection and A Unified Benchmark

  • paper_url: http://arxiv.org/abs/2308.01630
  • repo_url: None
  • paper_authors: Zhengzheng Tu, Qishun Wang, Hongshun Wang, Kunpeng Wang, Chenglong Li
  • for: 提高视频对象检测(VOD)性能在不利照条件下,通过引入热模态,抗衰落和降低计算复杂度。
  • methods: 提出了一种新的计算机视IONTask called RGB-thermal(RGBT)VOD,并设计了一个名为EINet的新网络模型,以及一个包含50对复杂背景、多种物体和不同照明条件的VT-VOD50数据集。
  • results: 对VT-VOD50数据集进行了广泛的实验,证明了我们提posed方法的效果和效率,并且与现有主流VOD方法进行了比较。
    Abstract Recently, many breakthroughs are made in the field of Video Object Detection (VOD), but the performance is still limited due to the imaging limitations of RGB sensors in adverse illumination conditions. To alleviate this issue, this work introduces a new computer vision task called RGB-thermal (RGBT) VOD by introducing the thermal modality that is insensitive to adverse illumination conditions. To promote the research and development of RGBT VOD, we design a novel Erasure-based Interaction Network (EINet) and establish a comprehensive benchmark dataset (VT-VOD50) for this task. Traditional VOD methods often leverage temporal information by using many auxiliary frames, and thus have large computational burden. Considering that thermal images exhibit less noise than RGB ones, we develop a negative activation function that is used to erase the noise of RGB features with the help of thermal image features. Furthermore, with the benefits from thermal images, we rely only on a small temporal window to model the spatio-temporal information to greatly improve efficiency while maintaining detection accuracy. VT-VOD50 dataset consists of 50 pairs of challenging RGBT video sequences with complex backgrounds, various objects and different illuminations, which are collected in real traffic scenarios. Extensive experiments on VT-VOD50 dataset demonstrate the effectiveness and efficiency of our proposed method against existing mainstream VOD methods. The code of EINet and the dataset will be released to the public for free academic usage.
    摘要 近些时间,在视频对象检测(VOD)领域内,有很多突破性的进展,但性能仍然受到RGB感知器的不利照明条件的限制。为了解决这个问题,这项工作引入了一个新的计算机视觉任务,即RGB-热(RGBT)VOD,通过引入热特征来减少不利照明条件的影响。为促进RGBT VOD的研究和开发,我们设计了一种新的擦除基本网络(EINet)和建立了一个完整的benchmark数据集(VT-VOD50)。传统的VOD方法通常利用多个auxiliary帧的时间信息,因此具有大的计算负担。由于热图像比RGB图像具有更少的噪声,我们开发了一种负活动函数,用于擦除RGB特征图像中的噪声,并且利用热图像特征来减少计算负担。此外,通过利用热图像的优点,我们只需要使用小的时间窗口来模型空间-时间信息,以提高效率而无需牺牲检测精度。VT-VOD50数据集包含50对复杂背景、多种物体和不同照明条件的RGBT视频序列,通过实际的交通enario进行收集。我们在VT-VOD50数据集上进行了广泛的实验,证明了我们提出的方法在现有主流VOD方法的比较中表现出色,同时具有高效率。我们将EINet和数据集发布到公共平台,免费用于学术研究。

A Multidimensional Analysis of Social Biases in Vision Transformers

  • paper_url: http://arxiv.org/abs/2308.01948
  • repo_url: https://github.com/jannik-brinkmann/social-biases-in-vision-transformers
  • paper_authors: Jannik Brinkmann, Paul Swoboda, Christian Bartelt
  • for: 研究卷积Transformers(ViT)中存在的社会偏见。
  • methods: 测试数据、模型架构和训练目标对ViT中学习的表示中的社会偏见的影响。
  • results: 使用对 diffusion-based image editing进行 counterfactual augmentation 训练可以减少社会偏见,但不能完全消除其他偏见。大型模型比小型模型更少偏见,使用激发jective objectives 训练的模型也比使用生成jective objectives 训练的模型更少偏见。同一个数据集上使用不同的自然语言目标可能导致ViT中学习的社会偏见发生相反的偏见。这些发现可以帮助我们更好地理解社会偏见的起源,并提供改进公平性的方法。
    Abstract The embedding spaces of image models have been shown to encode a range of social biases such as racism and sexism. Here, we investigate specific factors that contribute to the emergence of these biases in Vision Transformers (ViT). Therefore, we measure the impact of training data, model architecture, and training objectives on social biases in the learned representations of ViTs. Our findings indicate that counterfactual augmentation training using diffusion-based image editing can mitigate biases, but does not eliminate them. Moreover, we find that larger models are less biased than smaller models, and that models trained using discriminative objectives are less biased than those trained using generative objectives. In addition, we observe inconsistencies in the learned social biases. To our surprise, ViTs can exhibit opposite biases when trained on the same data set using different self-supervised objectives. Our findings give insights into the factors that contribute to the emergence of social biases and suggests that we could achieve substantial fairness improvements based on model design choices.
    摘要 “图像模型的嵌入空间已经显示出许多社会偏见,如种族主义和性别歧视。在这里,我们调查了视Transformer(ViT)中这些偏见的起源。因此,我们测量了训练数据、模型结构和训练目标对图像模型学习的社会偏见的影响。我们的发现表明,通过对 diffusion-based image editing 进行 counterfactual augmentation 训练可以减轻偏见,但并不能完全消除它们。此外,我们发现大型模型比小型模型更少表现社会偏见,并且使用推导性目标进行训练的模型比使用生成性目标进行训练的模型更少表现社会偏见。此外,我们发现图像模型学习的社会偏见存在不一致性。对于同一个数据集,使用不同的自然语言目标可以使图像模型表现出相反的偏见。我们的发现可以帮助我们更好地理解社会偏见的起源,并且表明我们可以通过模型设计选择来实现显著的公平性改进。”

A Novel Convolutional Neural Network Architecture with a Continuous Symmetry

  • paper_url: http://arxiv.org/abs/2308.01621
  • repo_url: https://github.com/liuyao12/ConvNets-PDE-perspective
  • paper_authors: Yao Liu, Hang Shao, Bing Bai
  • for: 这篇论文探讨了一种基于偏微分方程(PDE)的卷积神经网络架构(ConvNet),并实现了这种架构在图像识别任务上的比较性表现。
  • methods: 这篇论文使用了一种具有组合性的卷积神经网络架构,并运用了一种称为“对称性”的概念,允许权重的调整via一个连续的群集。
  • results: 这篇论文的结果显示,这种卷积神经网络架构可以在图像识别任务上实现比较性的表现,并且具有一定的内部对称性。
    Abstract This paper introduces a new Convolutional Neural Network (ConvNet) architecture inspired by a class of partial differential equations (PDEs) called quasi-linear hyperbolic systems. With comparable performance on the image classification task, it allows for the modification of the weights via a continuous group of symmetry. This is a significant shift from traditional models where the architecture and weights are essentially fixed. We wish to promote the (internal) symmetry as a new desirable property for a neural network, and to draw attention to the PDE perspective in analyzing and interpreting ConvNets in the broader Deep Learning community.
    摘要 (Simplified Chinese translation note: "quasi-linear hyperbolic systems" is translated as "幂线抽象系统", and "ConvNet" is translated as "卷积神经网络".)

A Survey on Deep Learning-based Spatio-temporal Action Detection

  • paper_url: http://arxiv.org/abs/2308.01618
  • repo_url: None
  • paper_authors: Peng Wang, Fanwei Zeng, Yuntao Qian
  • for: 本研究概述了深度学习方法在视觉中的动作检测和定位问题,包括动作检测、动作识别和动作位置定位等方面。
  • methods: 本文提出了一种分类和组织深度学习方法的稠密概念框架,并详细介绍了链接算法,以将帧或剪辑级别的检测结果相互链接成为动作管道。
  • results: 本文对当前领域的深度学习方法进行了全面的回顾,并对常用的测试数据集和评价指标进行了介绍。最后,本文结束并提出了一些可能的未来研究方向。
    Abstract Spatio-temporal action detection (STAD) aims to classify the actions present in a video and localize them in space and time. It has become a particularly active area of research in computer vision because of its explosively emerging real-world applications, such as autonomous driving, visual surveillance, entertainment, etc. Many efforts have been devoted in recent years to building a robust and effective framework for STAD. This paper provides a comprehensive review of the state-of-the-art deep learning-based methods for STAD. Firstly, a taxonomy is developed to organize these methods. Next, the linking algorithms, which aim to associate the frame- or clip-level detection results together to form action tubes, are reviewed. Then, the commonly used benchmark datasets and evaluation metrics are introduced, and the performance of state-of-the-art models is compared. At last, this paper is concluded, and a set of potential research directions of STAD are discussed.
    摘要 空间时间动作检测(STAD)目标是在视频中分类并地理化动作的存在。它在计算机视觉领域已经成为非常活跃的研究领域,因为它在自动驾驶、视觉监测、娱乐等实际应用中具有爆炸性突出来的应用前景。多年来,许多努力投入于建立一个可靠有效的STAD框架。本文提供了深度学习基础的state-of-the-art方法的完整回顾。首先,一个分类器是开发出一个分类器,以便组织这些方法。接下来,链接算法,它们的目标是将帧或clip级别的检测结果相互链接,以形成动作管道,被评估。然后,通用的标准数据集和评估指标被介绍,并将现状的状态模型的性能比较。最后,本文结束,并对STAD的未来研究方向进行了讨论。

Real-time Light Estimation and Neural Soft Shadows for AR Indoor Scenarios

  • paper_url: http://arxiv.org/abs/2308.01613
  • repo_url: None
  • paper_authors: Alexander Sommer, Ulrich Schwanecke, Elmar Schömer
  • for: 这个论文旨在提供一种用于实时AR应用程序中真实嵌入虚拟对象的渲染管道。
  • methods: 该管道包括两个主要组成部分:一个深度神经网络基于的光估计和一个神经软阴影生成器。光估计根据深度神经网络来确定主要光向、光色、 ambient色和阴影Texture中的透明度参数。软阴影方法将对象基于的真实软阴影编码为光向依赖的Texture。
  • results: 我们的管道可以在实时AR场景中新一个水平的真实性来插入对象。我们的模型够小以跑在当前的移动设备上。我们在iPhone 11 Pro上达到了9ms的光估计时间和5ms的神经软阴影时间。
    Abstract We present a pipeline for realistic embedding of virtual objects into footage of indoor scenes with focus on real-time AR applications. Our pipeline consists of two main components: A light estimator and a neural soft shadow texture generator. Our light estimation is based on deep neural nets and determines the main light direction, light color, ambient color and an opacity parameter for the shadow texture. Our neural soft shadow method encodes object-based realistic soft shadows as light direction dependent textures in a small MLP. We show that our pipeline can be used to integrate objects into AR scenes in a new level of realism in real-time. Our models are small enough to run on current mobile devices. We achieve runtimes of 9ms for light estimation and 5ms for neural shadows on an iPhone 11 Pro.
    摘要 我们提出了一个管道,用于在室内场景视频中真实嵌入虚拟对象,特别是针对实时AR应用。我们的管道包括两个主要组成部分:光估计和神经软影Texture生成器。我们的光估计基于深度神经网络,确定主要光方向、光色、 ambient色和阴影Texture中的opacity参数。我们的神经软影方法将对象基于实际的软阴影编码为光方向依赖的文本ures在小型MLP中。我们显示,我们的管道可以在实时AR场景中嵌入对象到新的真实水平,并且可以在当前的移动设备上运行。我们的模型够小,在iPhone 11 Pro上达到9毫秒的运行时间和5毫秒的神经阴影时间。

IndoHerb: Indonesia Medicinal Plants Recognition using Transfer Learning and Deep Learning

  • paper_url: http://arxiv.org/abs/2308.01604
  • repo_url: None
  • paper_authors: Muhammad Salman Ikrar Musyaffa, Novanto Yudistira, Muhammad Arif Rahman
  • for: 这个研究旨在使用计算机视觉技术来识别INDONESIA的药用植物。
  • methods: 该研究使用了转移学习法从Convolutional Neural Network(CNN)算法来分类INDONESIA的药用植物。数据采集自INDONESIA独立地通过Google图像搜索引擎,然后进行数据处理并进行分类。
  • results: 测试结果显示,使用DenseNet121模型可以达到87.4%的准确率,而使用零基eline模型可以达到43.53%的准确率。
    Abstract Herbal plants are nutritious plants that can be used as an alternative to traditional disease healing. In Indonesia there are various types of herbal plants. But with the development of the times, the existence of herbal plants as traditional medicines began to be forgotten so that not everyone could recognize them. Having the ability to identify herbal plants can have many positive impacts. However, there is a problem where identifying plants can take a long time because it requires in-depth knowledge and careful examination of plant criteria. So that the application of computer vision can help identify herbal plants. Previously, research had been conducted on the introduction of herbal plants from Vietnam using several algorithms, but from these research the accuracy was not high enough. Therefore, this study intends to implement transfer learning from the Convolutional Neural Network (CNN) algorithm to classify types of herbal plants from Indonesia. This research was conducted by collecting image data of herbal plants from Indonesia independently through the Google Images search engine. After that, it will go through the data preprocessing, classification using the transfer learning method from CNN, and analysis will be carried out. The CNN transfer learning models used are ResNet34, DenseNet121, and VGG11_bn. Based on the test results of the three models, it was found that DenseNet121 was the model with the highest accuracy, which was 87.4%. In addition, testing was also carried out using the scratch model and obtained an accuracy of 43.53%. The Hyperparameter configuration used in this test is the ExponentialLR scheduler with a gamma value of 0.9; learning rate 0.001; Cross Entropy Loss function; Adam optimizer; and the number of epochs is 50. Indonesia Medicinal Plant Dataset can be accessed at the following link https://github.com/Salmanim20/indo_medicinal_plant
    摘要 药用植物是一种有营养价值的植物,可以作为传统疾病的替代疗法。在印度尼西亚有很多种类的药用植物,但随着时代的发展,药用植物作为传统药物的存在开始被忘记,因此不 everybody都能认得它们。能够识别药用植物可以有很多积极影响。然而,识别植物可以耗费很长时间,因为它需要深入的知识和仔细的植物标准的检查。因此,计算机视觉的应用可以帮助识别药用植物。在过去的研究中,有些算法在印度尼西亚 introducing herbal plants from Vietnam,但这些研究的准确率不够高。因此,这项研究打算通过传输学习法,使用Convolutional Neural Network(CNN)算法来类型印度尼西亚的药用植物。这项研究通过独立地收集印度尼西亚药用植物的图像数据,并进行数据处理、类型确定和分析。测试结果显示,DenseNet121模型的准确率为87.4%,而自适应模型的准确率为43.53%。Hyperparameter配置使用ExponentialLR学习策略,γ值为0.9,学习率为0.001,交叉熵损失函数,Adam优化器,训练集数为50。印度尼西亚药用植物数据集可以在以下链接中获取:https://github.com/Salmanim20/indo_medicinal_plant。

Reference-Free Isotropic 3D EM Reconstruction using Diffusion Models

  • paper_url: http://arxiv.org/abs/2308.01594
  • repo_url: None
  • paper_authors: Kyungryun Lee, Won-Ki Jeong
  • for: 这个论文是为了解决电子顾微显微镜像中的不均匀分辨率问题,提高分析和下游任务的精度和效率。
  • methods: 这个方法基于 diffusion 模型,可以不需要参考数据或先知信息来重建3DVolume。它适用于高度下采样的数据,并且在实验中表现出了对比supervised learning方法的优势和稳定性。
  • results: 该方法可以自动恢复单个不均匀的Volume,无需任何训练数据。此外, authors 还进行了大量的实验,证明了该方法在两个公共数据集上的robustness和可靠性。
    Abstract Electron microscopy (EM) images exhibit anisotropic axial resolution due to the characteristics inherent to the imaging modality, presenting challenges in analysis and downstream tasks.In this paper, we propose a diffusion-model-based framework that overcomes the limitations of requiring reference data or prior knowledge about the degradation process. Our approach utilizes 2D diffusion models to consistently reconstruct 3D volumes and is well-suited for highly downsampled data. Extensive experiments conducted on two public datasets demonstrate the robustness and superiority of leveraging the generative prior compared to supervised learning methods. Additionally, we demonstrate our method's feasibility for self-supervised reconstruction, which can restore a single anisotropic volume without any training data.
    摘要

Consistency Regularization for Generalizable Source-free Domain Adaptation

  • paper_url: http://arxiv.org/abs/2308.01587
  • repo_url: None
  • paper_authors: Longxiang Tang, Kai Li, Chunming He, Yulun Zhang, Xiu Li
  • for: 这个论文目的是为了源自由预测项目(SFDA),将已经训练好的源模型适应到无标注目标频道,并且不需要存取源数据集,这使得它在实际中的应用非常广泛。
  • methods: 我们提出了一个对称正规化框架,以发展一个更加普遍化的 SFDA 方法,这个方法同时将模型在目标训练和测试数据集上的表现提高。我们利用弱地调整的图像生成软件标签,将强地调整的图像标签作为模型训练的指导方向,以便增强模型的普遍化能力。此外,我们还提出了一个抽样 Pseudo-label 选择策略,选择具有更大频率差的标签,以获取更多的可能用的超vision。最后,我们引入了全球方向的均衡化方法,以利用全球类别分布和特征单元信息,进一步改善适应过程。
  • results: 我们的方法在多个 SFDA 标准库上进行了广泛的实验,结果显示我们的方法可以在不存取源数据集的情况下,实现源自由预测项目的高性能和普遍化能力。此外,我们的方法还能够在无法见测试数据集上保持高度的稳定性和可靠性。
    Abstract Source-free domain adaptation (SFDA) aims to adapt a well-trained source model to an unlabelled target domain without accessing the source dataset, making it applicable in a variety of real-world scenarios. Existing SFDA methods ONLY assess their adapted models on the target training set, neglecting the data from unseen but identically distributed testing sets. This oversight leads to overfitting issues and constrains the model's generalization ability. In this paper, we propose a consistency regularization framework to develop a more generalizable SFDA method, which simultaneously boosts model performance on both target training and testing datasets. Our method leverages soft pseudo-labels generated from weakly augmented images to supervise strongly augmented images, facilitating the model training process and enhancing the generalization ability of the adapted model. To leverage more potentially useful supervision, we present a sampling-based pseudo-label selection strategy, taking samples with severer domain shift into consideration. Moreover, global-oriented calibration methods are introduced to exploit global class distribution and feature cluster information, further improving the adaptation process. Extensive experiments demonstrate our method achieves state-of-the-art performance on several SFDA benchmarks, and exhibits robustness on unseen testing datasets.
    摘要

MVFlow: Deep Optical Flow Estimation of Compressed Videos with Motion Vector Prior

  • paper_url: http://arxiv.org/abs/2308.01568
  • repo_url: None
  • paper_authors: Shili Zhou, Xuhao Jiang, Weimin Tan, Ruian He, Bo Yan
  • for: 这个论文是为了提高压缩视频中的光学流估算的速度和准确性而提出的。
  • methods: 该论文使用了动态模型,并将动态信息与视频压缩信息结合使用,以提高光学流估算的准确性和速度。
  • results: 实验结果表明,相比于现有模型,该论文提出的MVFlow模型可以降低AEPE值1.09,或者保持与现有模型相同的准确性,而且可以节省52%的计算时间。
    Abstract In recent years, many deep learning-based methods have been proposed to tackle the problem of optical flow estimation and achieved promising results. However, they hardly consider that most videos are compressed and thus ignore the pre-computed information in compressed video streams. Motion vectors, one of the compression information, record the motion of the video frames. They can be directly extracted from the compression code stream without computational cost and serve as a solid prior for optical flow estimation. Therefore, we propose an optical flow model, MVFlow, which uses motion vectors to improve the speed and accuracy of optical flow estimation for compressed videos. In detail, MVFlow includes a key Motion-Vector Converting Module, which ensures that the motion vectors can be transformed into the same domain of optical flow and then be utilized fully by the flow estimation module. Meanwhile, we construct four optical flow datasets for compressed videos containing frames and motion vectors in pairs. The experimental results demonstrate the superiority of our proposed MVFlow, which can reduce the AEPE by 1.09 compared to existing models or save 52% time to achieve similar accuracy to existing models.
    摘要 In detail, MVFlow includes a key Motion-Vector Converting Module, which ensures that the motion vectors can be transformed into the same domain as the optical flow and then be fully utilized by the flow estimation module. Furthermore, we construct four optical flow datasets for compressed videos containing frames and motion vectors in pairs. The experimental results show that our proposed MVFlow is superior, with a reduction of 1.09 in AEPE compared to existing models or a 52% reduction in time to achieve similar accuracy to existing models.

Dynamic Token-Pass Transformers for Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2308.01944
  • repo_url: None
  • paper_authors: Yuang Liu, Qiang Zhou, Jing Wang, Fan Wang, Jun Wang, Wei Zhang
  • for: 这个论文是为了提高 semantic segmentation 的效率和速度而写的。
  • methods: 这个方法使用 dynamic token-pass vision transformers (DoViT),可以适应不同的图像复杂度,并且可以适当地将部分易于计算的 токенs 排除,以减少测试成本。
  • results: 这个方法可以大约减少 40% 至 60% FLOPs,并且与硬件友好,同时仍然可以保持高效率和优化结果。 ViT-L/B 的 Throughput 和测试速度可以提高至更多于 2$\times$ on Cityscapes。
    Abstract Vision transformers (ViT) usually extract features via forwarding all the tokens in the self-attention layers from top to toe. In this paper, we introduce dynamic token-pass vision transformers (DoViT) for semantic segmentation, which can adaptively reduce the inference cost for images with different complexity. DoViT gradually stops partial easy tokens from self-attention calculation and keeps the hard tokens forwarding until meeting the stopping criteria. We employ lightweight auxiliary heads to make the token-pass decision and divide the tokens into keeping/stopping parts. With a token separate calculation, the self-attention layers are speeded up with sparse tokens and still work friendly with hardware. A token reconstruction module is built to collect and reset the grouped tokens to their original position in the sequence, which is necessary to predict correct semantic masks. We conduct extensive experiments on two common semantic segmentation tasks, and demonstrate that our method greatly reduces about 40% $\sim$ 60% FLOPs and the drop of mIoU is within 0.8% for various segmentation transformers. The throughput and inference speed of ViT-L/B are increased to more than 2$\times$ on Cityscapes.
    摘要 vision transformers (ViT) 通常通过从顶部到底部传递所有的 tokens 来提取特征。在这篇论文中,我们介绍了动态Token-pass vision transformers (DoViT),用于 semantic segmentation,可以适应不同的图像复杂度减少推理成本。DoViT 会逐渐停止一些容易的 tokens 从自我注意计算中,并保留一些困难的 tokens 直到满足停止 criterion。我们使用轻量级的辅助头来做 token-pass 决策,并将 tokens 分为保留/停止部分。通过 sparse tokens 的计算,自我注意层得到加速,同时仍与硬件友好。我们还构建了一个token重建模块,用于收集和重新设置 grouped tokens 的原始位置序列,这是必要的来预测正确的semantic mask。我们在两个常见的semantic segmentation任务上进行了广泛的实验,并证明了我们的方法可以大幅减少约40% 到60% FLOPs,并且drop of mIoU 在0.8% 以下。ViT-L/B 的 Throughput 和推理速度被提高到更多于 2 倍在 Cityscapes。

Get the Best of Both Worlds: Improving Accuracy and Transferability by Grassmann Class Representation

  • paper_url: http://arxiv.org/abs/2308.01547
  • repo_url: None
  • paper_authors: Haoqi Wang, Zhizhong Li, Wayne Zhang
  • for: 这篇论文旨在提高深度学习模型的准确率和特征传输能力。
  • methods: 论文使用了将类Vector转换为线性Subspace(i.e.~点在草mann manifold),并实现了在深度学习框架中集成Riemannian SGD以便同时优化类Subspace和模型参数。
  • results: 在ImageNet-1K数据集上,使用GCR后,ResNet50-D、ResNeXt50、Swin-T和Deit3-S等模型的顶部1错误率相对下降了5.6%、4.5%、3.0%和3.5%。此外,GCR还提供了更多的特征自由度,使得下游任务的质量更高。例如,对于ResNet50-D模型,GCR可以提高平均线性传输精度从77.98%提升到79.70%。
    Abstract We generalize the class vectors found in neural networks to linear subspaces (i.e.~points in the Grassmann manifold) and show that the Grassmann Class Representation (GCR) enables the simultaneous improvement in accuracy and feature transferability. In GCR, each class is a subspace and the logit is defined as the norm of the projection of a feature onto the class subspace. We integrate Riemannian SGD into deep learning frameworks such that class subspaces in a Grassmannian are jointly optimized with the rest model parameters. Compared to the vector form, the representative capability of subspaces is more powerful. We show that on ImageNet-1K, the top-1 error of ResNet50-D, ResNeXt50, Swin-T and Deit3-S are reduced by 5.6%, 4.5%, 3.0% and 3.5%, respectively. Subspaces also provide freedom for features to vary and we observed that the intra-class feature variability grows when the subspace dimension increases. Consequently, we found the quality of GCR features is better for downstream tasks. For ResNet50-D, the average linear transfer accuracy across 6 datasets improves from 77.98% to 79.70% compared to the strong baseline of vanilla softmax. For Swin-T, it improves from 81.5% to 83.4% and for Deit3, it improves from 73.8% to 81.4%. With these encouraging results, we believe that more applications could benefit from the Grassmann class representation. Code is released at https://github.com/innerlee/GCR.
    摘要 我们总结了神经网络中的类别向量到线性子空间(即点在草mann manifold),并证明GCR(草mann类表示)可以同时提高准确率和特征传递性。在GCR中,每个类是一个子空间,logit是定义为特征在类子空间上的投影距离的模值。我们将里曼射SGD集成到深度学习框架中,以便在Grassmannian中同时优化类子空间和其他模型参数。相比vector形式,子空间的表示能力更强。我们证明在ImageNet-1K上,ResNet50-D、ResNeXt50、Swin-T和Deit3-S的top-1错误率分别下降5.6%, 4.5%, 3.0%和3.5%。子空间还提供了特征之间的自由,我们观察到在子空间维度增加时,内类特征变化的度逐渐增加。因此,我们发现GCR特征质量更好,用于下游任务。对于ResNet50-D,我们在6个数据集上的平均直线传输精度从77.98%提高到79.70%,比强基eline softmax强度更高。对于Swin-T,从81.5%提高到83.4%,对于Deit3,从73.8%提高到81.4%。这些激动人心的结果表明,更多的应用可以从GCR中受益。代码可以在https://github.com/innerlee/GCR中找到。

DMDC: Dynamic-mask-based dual camera design for snapshot Hyperspectral Imaging

  • paper_url: http://arxiv.org/abs/2308.01541
  • repo_url: https://github.com/caizeyu1992/dmdc
  • paper_authors: Zeyu Cai, Chengqian Jin, Feipeng Da
  • for: 提高coded aperture snapshot spectral imaging(CASSI)中深度学习方法的性能。
  • methods: 使用动态面镜(DMD)和多模式重构网络(DMDC-net),首先根据RGB图像学习场景的空间特征分布,然后使用SLM编码场景,最后将RGB和CASSI图像传输到网络进行重构。
  • results: 在多个数据集上进行了广泛的实验,结果表明,我们的方法可以与先前最佳方法(SOTA)比较,提高PSNR值超过9 dB。
    Abstract Deep learning methods are developing rapidly in coded aperture snapshot spectral imaging (CASSI). The number of parameters and FLOPs of existing state-of-the-art methods (SOTA) continues to increase, but the reconstruction accuracy improves slowly. Current methods still face two problems: 1) The performance of the spatial light modulator (SLM) is not fully developed due to the limitation of fixed Mask coding. 2) The single input limits the network performance. In this paper we present a dynamic-mask-based dual camera system, which consists of an RGB camera and a CASSI system running in parallel. First, the system learns the spatial feature distribution of the scene based on the RGB images, then instructs the SLM to encode each scene, and finally sends both RGB and CASSI images to the network for reconstruction. We further designed the DMDC-net, which consists of two separate networks, a small-scale CNN-based dynamic mask network for dynamic adjustment of the mask and a multimodal reconstruction network for reconstruction using RGB and CASSI measurements. Extensive experiments on multiple datasets show that our method achieves more than 9 dB improvement in PSNR over the SOTA. (https://github.com/caizeyu1992/DMDC)
    摘要 深度学习方法在coded aperture snapshot spectral imaging(CASSI)领域快速发展,但现有状态的方法(SOTA)中参数和FLOPs的数量继续增加,但 reconstruction accuracy 的提高相对较慢。当前方法仍面临两个问题:1)SLM(spatial light modulator)的性能尚未得到完全发展,因为固定的Mask coding有限制。2)单个输入限制网络性能。在本文中,我们提出了动态Mask基于双camera系统,包括一个RGB摄像头和一个CASSI系统在平行运行。首先,系统通过RGB图像学习场景中的空间特征分布,然后根据场景指定SLM编码,最后将RGB和CASSI图像传递给网络进行重建。我们还设计了DMDC-net,它包括两个独立的网络:一个小规模的CNN基于动态Mask网络用于动态调整Mask,以及一个多模式重建网络用于使用RGB和CASSI测量进行重建。我们在多个数据集上进行了广泛的实验,结果显示,我们的方法可以在PSNR方面实现超过9dB的提高。(https://github.com/caizeyu1992/DMDC)

MFIM: Megapixel Facial Identity Manipulation

  • paper_url: http://arxiv.org/abs/2308.01536
  • repo_url: None
  • paper_authors: Sanghyeon Na
  • for: 本文提出了一种新的面孔交换框架,称为大像素面孔标识修饰(MFIM),用于实现两个目标:首先,生成高质量图像;其次,将给定图像的身份特征变换成另一个人的身份特征,保留不相关的特征。
  • methods: 本文使用了预训练的StyleGAN进行GAN-倒置,以生成高像素图像。此外,本文还使用了3DMM来捕捉多种面孔特征,并将这些特征用于生成面孔交换图像。
  • results: 经过广泛的实验表明,本文的模型达到了状态方法性能。此外,本文还提出了一种新的操作,称为身份混合,允许用户自定义新的身份。
    Abstract Face swapping is a task that changes a facial identity of a given image to that of another person. In this work, we propose a novel face-swapping framework called Megapixel Facial Identity Manipulation (MFIM). The face-swapping model should achieve two goals. First, it should be able to generate a high-quality image. We argue that a model which is proficient in generating a megapixel image can achieve this goal. However, generating a megapixel image is generally difficult without careful model design. Therefore, our model exploits pretrained StyleGAN in the manner of GAN-inversion to effectively generate a megapixel image. Second, it should be able to effectively transform the identity of a given image. Specifically, it should be able to actively transform ID attributes (e.g., face shape and eyes) of a given image into those of another person, while preserving ID-irrelevant attributes (e.g., pose and expression). To achieve this goal, we exploit 3DMM that can capture various facial attributes. Specifically, we explicitly supervise our model to generate a face-swapped image with the desirable attributes using 3DMM. We show that our model achieves state-of-the-art performance through extensive experiments. Furthermore, we propose a new operation called ID mixing, which creates a new identity by semantically mixing the identities of several people. It allows the user to customize the new identity.
    摘要 《Face swapping是一个任务,把给定图像的脸部标识换成另一个人的脸部标识。在这个工作中,我们提出了一种新的面部交换框架,即Megapixel Facial Identity Manipulation(MFIM)。这个模型应该实现两个目标。一是生成高质量图像,我们认为能够生成高分辨率图像的模型可以实现这个目标。然而,生成高分辨率图像通常需要精心的模型设计。因此,我们的模型利用了预训练的StyleGAN,通过GAN-倒置的方式来生成高分辨率图像。二是能够有效地将给定图像的脸部标识转换为另一个人的脸部标识。具体来说,它应该能够活动地将ID属性(如脸形和眼睛)转换为另一个人的ID属性,保留ID不关属性(如姿势和表情)。为了实现这个目标,我们利用了3DMM,它可以捕捉各种脸部特征。我们显式地监督我们的模型生成一个面部交换图像,拥有愿望的特征使用3DMM。我们的实验结果表明,我们的模型实现了状态监测的性能。此外,我们还提出了一种新的操作,即ID混合,它可以将多个人的标识混合成一个新的标识,让用户自定义新的标识。

Multimodal Adaptation of CLIP for Few-Shot Action Recognition

  • paper_url: http://arxiv.org/abs/2308.01532
  • repo_url: None
  • paper_authors: Jiazheng Xing, Mengmeng Wang, Xiaojun Hou, Guang Dai, Jingdong Wang, Yong Liu
  • for: 这篇论文的目的是提出一种新的方法来将大规模预训的视觉模型CLIP应用于几步动作识别 зада务中,以提高性能和效率。
  • methods: 这篇论文使用了“预训、统一”的方法来避免从零训练网络,从而节省时间和资源。但这方法有两个缺点:首先,几步动作识别的有限标签数据需要对数据进行严格的节省,以避免过拟合;其次,影片的EXTRA-temporal维度对几步动作识别的有效时间模型产生挑战,而预训的视觉模型通常是图像模型。这篇论文提出了一种名为Multimodal Adaptation of CLIP(MA-CLIP)的新方法,用于解决这些问题。
  • results: MA-CLIP可以快速地适应几步动作识别任务,并且可以将视觉模型转换到不同的任务上,而不需要从零训练网络。这篇论文还提出了一种基于注意机制的文本导向原型建立模组,可以充分利用影片-文本多媒体资料来增强影片原型的表现。
    Abstract Applying large-scale pre-trained visual models like CLIP to few-shot action recognition tasks can benefit performance and efficiency. Utilizing the "pre-training, fine-tuning" paradigm makes it possible to avoid training a network from scratch, which can be time-consuming and resource-intensive. However, this method has two drawbacks. First, limited labeled samples for few-shot action recognition necessitate minimizing the number of tunable parameters to mitigate over-fitting, also leading to inadequate fine-tuning that increases resource consumption and may disrupt the generalized representation of models. Second, the video's extra-temporal dimension challenges few-shot recognition's effective temporal modeling, while pre-trained visual models are usually image models. This paper proposes a novel method called Multimodal Adaptation of CLIP (MA-CLIP) to address these issues. It adapts CLIP for few-shot action recognition by adding lightweight adapters, which can minimize the number of learnable parameters and enable the model to transfer across different tasks quickly. The adapters we design can combine information from video-text multimodal sources for task-oriented spatiotemporal modeling, which is fast, efficient, and has low training costs. Additionally, based on the attention mechanism, we design a text-guided prototype construction module that can fully utilize video-text information to enhance the representation of video prototypes. Our MA-CLIP is plug-and-play, which can be used in any different few-shot action recognition temporal alignment metric.
    摘要 使用大规模预训练的视觉模型如CLIP进行几步动作认知任务可以提高性能和效率。利用“预训练、精度调整”的方法可以避免从头开始训练网络,这可以降低时间和资源的投入。然而,这种方法有两点缺点。首先,几步动作认知任务的有限标注样本数量需要尽量减少可训练参数的数量,以避免过拟合。其次,视频的Extra-temporal维度对几步动作认知任务的有效时间模型化起来困难,而预训练的视觉模型通常是图像模型。这篇论文提出了一种新的方法called Multimodal Adaptation of CLIP (MA-CLIP)来解决这些问题。它适应了CLIP进行几步动作认知任务,通过添加轻量级适配器,可以最小化可训练参数的数量,并使模型快速传播到不同任务。我们设计的适配器可以将视频-文本多modal源的信息结合到任务指向的空间时间模型中,这快速、高效、训练成本低。此外,基于注意机制,我们设计了文本引导原型构建模块,可以充分利用视频-文本信息来增强视频原型的表示。我们的MA-CLIP是可插入的,可以在不同的几步动作认知任务中使用。

Data Augmentation for Human Behavior Analysis in Multi-Person Conversations

  • paper_url: http://arxiv.org/abs/2308.01526
  • repo_url: None
  • paper_authors: Kun Li, Dan Guo, Guoliang Chen, Feiyang Liu, Meng Wang
  • for: 本文介绍了我们队伍HFUT-VUT在ACM Multimedia 2023年度多媒体大挑战中的解决方案,该解决方案包括三个子挑战:身体行为识别、眼接触检测和下一位说话预测。
  • methods: 我们选择Swin Transformer作为基线,并使用数据增强策略来解决上述三个任务。具体来说,我们将原始视频裁剪去除其他部分的噪音,同时利用数据增强来改善模型的通用性。
  • results: 我们的解决方案在测试集上实现了身体行为识别的最佳结果(准确率为0.6262)、眼接触检测的最高精度(准确率为0.7771)和下一位说话预测的相对不错的结果(不Weighted Average Recall为0.5281)。
    Abstract In this paper, we present the solution of our team HFUT-VUT for the MultiMediate Grand Challenge 2023 at ACM Multimedia 2023. The solution covers three sub-challenges: bodily behavior recognition, eye contact detection, and next speaker prediction. We select Swin Transformer as the baseline and exploit data augmentation strategies to address the above three tasks. Specifically, we crop the raw video to remove the noise from other parts. At the same time, we utilize data augmentation to improve the generalization of the model. As a result, our solution achieves the best results of 0.6262 for bodily behavior recognition in terms of mean average precision and the accuracy of 0.7771 for eye contact detection on the corresponding test set. In addition, our approach also achieves comparable results of 0.5281 for the next speaker prediction in terms of unweighted average recall.
    摘要 在这篇论文中,我们介绍了我们团队HFUT-VUT在ACM Multimedia 2023年度的MultiMediate Grand Challenge 2023中的解决方案。该解决方案包括三个子挑战:身体行为识别、眼球接触检测和下一个发言人预测。我们选择Swin Transformer作为基线,并使用数据扩充策略来解决以上三个任务。具体来说,我们对原始视频进行cropping,以移除其他部分的噪音。同时,我们利用数据扩充来提高模型的通用性。因此,我们的解决方案在对应的测试集上实现了身体行为识别的最佳结果(准确率0.6262)和眼球接触检测的最高精度(准确率0.7771)。此外,我们的方法还实现了与基线相当的下一个发言人预测结果(不重量平均回归率0.5281)。

VisAlign: Dataset for Measuring the Degree of Alignment between AI and Humans in Visual Perception

  • paper_url: http://arxiv.org/abs/2308.01525
  • repo_url: https://github.com/jiyounglee-0523/visalign
  • paper_authors: Jiyoung Lee, Seungho Kim, Seunghyun Won, Joonseok Lee, Marzyeh Ghassemi, James Thorne, Jaeseok Choi, O-Kil Kwon, Edward Choi
  • for: 本研究旨在提高人工智能模型与人类目标、偏好或道德原则的一致性,即 AI 安全性。
  • methods: 本文提出了一个新的图像分类数据集,用于衡量人工智能模型与人类视觉启示的一致性。
  • results: 使用该数据集,我们分析了五种流行的视觉识别模型和七种决策策略的视觉一致性和可靠性。
    Abstract AI alignment refers to models acting towards human-intended goals, preferences, or ethical principles. Given that most large-scale deep learning models act as black boxes and cannot be manually controlled, analyzing the similarity between models and humans can be a proxy measure for ensuring AI safety. In this paper, we focus on the models' visual perception alignment with humans, further referred to as AI-human visual alignment. Specifically, we propose a new dataset for measuring AI-human visual alignment in terms of image classification, a fundamental task in machine perception. In order to evaluate AI-human visual alignment, a dataset should encompass samples with various scenarios that may arise in the real world and have gold human perception labels. Our dataset consists of three groups of samples, namely Must-Act (i.e., Must-Classify), Must-Abstain, and Uncertain, based on the quantity and clarity of visual information in an image and further divided into eight categories. All samples have a gold human perception label; even Uncertain (severely blurry) sample labels were obtained via crowd-sourcing. The validity of our dataset is verified by sampling theory, statistical theories related to survey design, and experts in the related fields. Using our dataset, we analyze the visual alignment and reliability of five popular visual perception models and seven abstention methods. Our code and data is available at \url{https://github.com/jiyounglee-0523/VisAlign}.
    摘要 人工智能对齐(AI对齐)指模型行为与人类目标、偏好或伦理原则相符。由于大多数大规模深度学习模型作为黑盒子无法被手动控制,因此分析模型与人类的相似性可以作为AI安全的代理指标。在这篇论文中,我们关注视觉模型与人类的视觉对齐,即AI-人类视觉对齐。我们提出了一个新的数据集来衡量AI-人类视觉对齐,该数据集包括了不同场景可能在实际世界中出现的图像分类任务。为了评估AI-人类视觉对齐,数据集应包含具有不同频率和清晰度的图像样本,并且每个样本都有人类视觉标签。我们的数据集分为三类样本:必须作为(Must-Act)、必须停止(Must-Abstain)和不确定(Uncertain),根据图像中视觉信息的量和清晰度进行分类。所有样本具有人类视觉标签,包括不确定(极度模糊)样本的标签,通过在线人员投票获取。我们的数据集的有效性被证明由抽样理论、统计相关的调查设计理论以及相关领域专家的验证。使用我们的数据集,我们分析了五种流行的视觉识别模型和七种停止方法的视觉对齐和可靠性。我们的代码和数据可以在GitHub上获取:

PPI-NET: End-to-End Parametric Primitive Inference

  • paper_url: http://arxiv.org/abs/2308.01521
  • repo_url: None
  • paper_authors: Liang Wang, Xiaogang Wang
  • for: 本研究旨在提高设计模型创建过程中的效率和准确性,避免使用自适应模型从手绘图像中推导参数化基本形态时的不必要的重复和错误积累问题。
  • methods: 我们提议一种高效准确的终端方法,通过直接从手绘图像中提取基本形态的准确 Parametric 表示,以避免使用自适应模型的推导过程中的错误和重复工作。
  • results: 我们的模型样本匹配标准 CAD 软件的表示格式,因此可以将其导入 CAD 软件进行解决、编辑和应用到下游设计任务中。
    Abstract In engineering applications, line, circle, arc, and point are collectively referred to as primitives, and they play a crucial role in path planning, simulation analysis, and manufacturing. When designing CAD models, engineers typically start by sketching the model's orthographic view on paper or a whiteboard and then translate the design intent into a CAD program. Although this design method is powerful, it often involves challenging and repetitive tasks, requiring engineers to perform numerous similar operations in each design. To address this conversion process, we propose an efficient and accurate end-to-end method that avoids the inefficiency and error accumulation issues associated with using auto-regressive models to infer parametric primitives from hand-drawn sketch images. Since our model samples match the representation format of standard CAD software, they can be imported into CAD software for solving, editing, and applied to downstream design tasks.
    摘要 在工程应用中,直线、圆形、弧形和点被合称为基本形状,它们在路径规划、分析研究和生产中扮演着重要的角色。在设计CAD模型时,工程师通常从纸上或白板上绘制模型的正交视图,然后将设计意图翻译到CAD软件中。虽然这种设计方法具有强大的能力,但它经常带来复杂和重复的任务,需要工程师在每个设计中执行大量相似的操作。为了解决这个转换过程中的不效率和错误积累问题,我们提出了一种高效和准确的终端方法,这种方法可以避免使用自动回归模型来从手绘笔画图像中推导参数化基本形状。由于我们的模型样本与标准CAD软件的表示格式匹配,因此它们可以被直接 importing into CAD软件中进行解决、编辑和应用到下游设计任务中。

Contrastive Multi-FaceForensics: An End-to-end Bi-grained Contrastive Learning Approach for Multi-face Forgery Detection

  • paper_url: http://arxiv.org/abs/2308.01520
  • repo_url: None
  • paper_authors: Cong Zhang, Honggang Qi, Yuezun Li, Siwei Lyu
  • for: 本研究旨在提高多人脸伪造检测。
  • methods: 该方法基于对比学习,包括粗细度层次的对比学习和像素级别的对比学习。
  • results: 对比Multi-FaceForensics方法,该方法在OpenForensics数据集上达到了18.5%的提升。
    Abstract DeepFakes have raised serious societal concerns, leading to a great surge in detection-based forensics methods in recent years. Face forgery recognition is the conventional detection method that usually follows a two-phase pipeline: it extracts the face first and then determines its authenticity by classification. Since DeepFakes in the wild usually contain multiple faces, using face forgery detection methods is merely practical as they have to process faces in a sequel, i.e., only one face is processed at the same time. One straightforward way to address this issue is to integrate face extraction and forgery detection in an end-to-end fashion by adapting advanced object detection architectures. However, as these object detection architectures are designed to capture the semantic information of different object categories rather than the subtle forgery traces among the faces, the direct adaptation is far from optimal. In this paper, we describe a new end-to-end framework, Contrastive Multi-FaceForensics (COMICS), to enhance multi-face forgery detection. The core of the proposed framework is a novel bi-grained contrastive learning approach that explores effective face forgery traces at both the coarse- and fine-grained levels. Specifically, the coarse-grained level contrastive learning captures the discriminative features among positive and negative proposal pairs in multiple scales with the instruction of the proposal generator, and the fine-grained level contrastive learning captures the pixel-wise discrepancy between the forged and original areas of the same face and the pixel-wise content inconsistency between different faces. Extensive experiments on the OpenForensics dataset demonstrate our method outperforms other counterparts by a large margin (~18.5%) and shows great potential for integration into various architectures.
    摘要 深度复制(DeepFakes)已引起了社会的严重关注,leading to a great surge in detection-based forensics methods in recent years. Face forgery recognition is the conventional detection method that usually follows a two-phase pipeline: it extracts the face first and then determines its authenticity by classification. However, since DeepFakes in the wild usually contain multiple faces, using face forgery detection methods is merely practical as they have to process faces in a sequel, i.e., only one face is processed at the same time. To address this issue, we can integrate face extraction and forgery detection in an end-to-end fashion by adapting advanced object detection architectures. However, as these object detection architectures are designed to capture the semantic information of different object categories rather than the subtle forgery traces among the faces, the direct adaptation is far from optimal.In this paper, we propose a new end-to-end framework, Contrastive Multi-FaceForensics (COMICS), to enhance multi-face forgery detection. The core of the proposed framework is a novel bi-grained contrastive learning approach that explores effective face forgery traces at both the coarse- and fine-grained levels. Specifically, the coarse-grained level contrastive learning captures the discriminative features among positive and negative proposal pairs in multiple scales with the instruction of the proposal generator, and the fine-grained level contrastive learning captures the pixel-wise discrepancy between the forged and original areas of the same face and the pixel-wise content inconsistency between different faces. Extensive experiments on the OpenForensics dataset demonstrate that our method outperforms other counterparts by a large margin (~18.5%) and shows great potential for integration into various architectures.

Circumventing Concept Erasure Methods For Text-to-Image Generative Models

  • paper_url: http://arxiv.org/abs/2308.01508
  • repo_url: https://github.com/nyu-dice-lab/circumventing-concept-erasure
  • paper_authors: Minh Pham, Kelly O. Marshall, Chinmay Hegde
  • for: 本研究旨在检查五种最近提出的概念消除方法,以确定这些方法是否能够彻底消除目标概念。
  • methods: 本研究使用了五种最近提出的概念消除方法,包括:Xu et al.’s method(2018)、Zhang et al.’s method(2019)、Liu et al.’s method(2020)、Wang et al.’s method(2020)和Zhang et al.’s method(2020)。
  • results: 研究发现,无论使用哪种方法,都无法彻底消除目标概念。特别是,使用特定的学习word embeddings可以 Retrieves “erased” concepts from the sanitized models with no alterations to their weights。这些结果表明,后期概念消除方法是不坚定的,并质疑它们在AI安全中的使用。
    Abstract Text-to-image generative models can produce photo-realistic images for an extremely broad range of concepts, and their usage has proliferated widely among the general public. On the flip side, these models have numerous drawbacks, including their potential to generate images featuring sexually explicit content, mirror artistic styles without permission, or even hallucinate (or deepfake) the likenesses of celebrities. Consequently, various methods have been proposed in order to "erase" sensitive concepts from text-to-image models. In this work, we examine five recently proposed concept erasure methods, and show that targeted concepts are not fully excised from any of these methods. Specifically, we leverage the existence of special learned word embeddings that can retrieve "erased" concepts from the sanitized models with no alterations to their weights. Our results highlight the brittleness of post hoc concept erasure methods, and call into question their use in the algorithmic toolkit for AI safety.
    摘要 文本到图像生成模型可以生成高度真实的图像,覆盖了极广泛的概念,并在普通公众中得到了广泛的应用。然而,这些模型也有许多缺点,包括可能生成涉黄内容、无许可的艺术风格模仿或even celebrities的形象hallucination(或深 fake)。因此,各种方法被提议,以“消除”敏感概念从文本到图像模型中。在这项工作中,我们研究了五种最近提出的概念消除方法,并发现目标概念在这些方法中并没有完全消除。具体来说,我们利用特殊学习的词嵌入,可以从清理后的模型中提取“消除”的概念,无需对模型的权重进行任何修改。我们的结果表明了后期概念消除方法的脆弱性,并质疑它们在AI安全中的使用。

TSMD: A Database for Static Color Mesh Quality Assessment Study

  • paper_url: http://arxiv.org/abs/2308.01940
  • repo_url: None
  • paper_authors: Qi Yang, Joel Jung, Haiqiang Wang, Xiaozhong Xu, Shan Liu
  • for: This paper is written for the study of static mesh compression algorithms and objective quality metrics.
  • methods: The paper uses a large-scale, crowdsourcing-based, subjective experiment to collect subjective scores from 74 viewers, and analyzes the dataset to validate its sample diversity and Mean Opinion Scores (MOS) accuracy.
  • results: The paper reports Pearson and Spearman correlations around 0.75, demonstrating the need for further development of more robust metrics.Here is the text in Simplified Chinese:
  • for: 这篇论文是为了研究静态网格压缩算法和对象质量指标的研究而写的。
  • methods: 这篇论文使用大规模的人工社会测试来收集74名观众的主观分数,并分析数据集以验证其样本多样性和 Mean Opinion Scores(MOS)准确性。
  • results: 这篇论文报告了皮尔逊和斯帕曼相关性约为0.75,表明需要进一步开发更加Robust的指标。
    Abstract Static meshes with texture map are widely used in modern industrial and manufacturing sectors, attracting considerable attention in the mesh compression community due to its huge amount of data. To facilitate the study of static mesh compression algorithm and objective quality metric, we create the Tencent - Static Mesh Dataset (TSMD) containing 42 reference meshes with rich visual characteristics. 210 distorted samples are generated by the lossy compression scheme developed for the Call for Proposals on polygonal static mesh coding, released on June 23 by the Alliance for Open Media Volumetric Visual Media group. Using processed video sequences, a large-scale, crowdsourcing-based, subjective experiment was conducted to collect subjective scores from 74 viewers. The dataset undergoes analysis to validate its sample diversity and Mean Opinion Scores (MOS) accuracy, establishing its heterogeneous nature and reliability. State-of-the-art objective metrics are evaluated on the new dataset. Pearson and Spearman correlations around 0.75 are reported, deviating from results typically observed on less heterogeneous datasets, demonstrating the need for further development of more robust metrics. The TSMD, including meshes, PVSs, bitstreams, and MOS, is made publicly available at the following location: https://multimedia.tencent.com/resources/tsmd.
    摘要 Static meshes with texture map 广泛应用于现代工业和制造领域,吸引了严重的数据压缩社区的关注,因为它们的数据量很大。为了促进静止矩阵压缩算法和目标质量指标的研究,我们创建了腾讯-静止矩阵数据集(TSMD),包含42个参考矩阵,具有丰富的视觉特征。通过发布的损失压缩方案,我们生成了210个扭曲样例。使用处理过的视频序列,我们通过大规模的人员协同实验,收集了74名观众的主观评分。数据集进行分析,以验证样本多样性和主观评分准确性,证明其多样性和可靠性。我们使用现有的对象指标进行evaluation,并报告了0.75的 peakson和spearman相关性,与其他更少的多样性的数据集相比,表明需要进一步发展更加Robust的指标。TSMD,包括矩阵、PVS、比特流和MOS,在以下地址公开发布:https://multimedia.tencent.com/resources/tsmd。

TDMD: A Database for Dynamic Color Mesh Subjective and Objective Quality Explorations

  • paper_url: http://arxiv.org/abs/2308.01499
  • repo_url: None
  • paper_authors: Qi Yang, Joel Jung, Timon Deschamps, Xiaozhong Xu, Shan Liu
  • for: 这个论文的目的是为了开发对动态颜色网格(DCM)的 объектив度量表,以及研究一般处理DCM时的影响。
  • methods: 这个论文使用了八个参考DCM对象和六种常见的扭曲来创建了 Tencent - 动态颜色网格数据库(TDMD)。然后,通过处理视频序列(PVS)来进行了大规模的主观实验,从而获得了303个扭曲DCM样本的平均意见分数。
  • results: 这个数据库可以用于研究不同类型的扭曲对人类 восприятия的影响,以及提供DCM压缩和相关任务中的建议。此外,这个论文还评估了三种当今最佳的对metric在TDMD上的表现,包括图像基于的、点基于的和视频基于的metric。实验结果表明每种metric在不同的应用中具有优势和缺陷,并提供了实际应用中metric选择的建议。TDMD将在以下位置公开:https://multimedia.tencent.com/resources/tdmd。
    Abstract Dynamic colored meshes (DCM) are widely used in various applications; however, these meshes may undergo different processes, such as compression or transmission, which can distort them and degrade their quality. To facilitate the development of objective metrics for DCMs and study the influence of typical distortions on their perception, we create the Tencent - dynamic colored mesh database (TDMD) containing eight reference DCM objects with six typical distortions. Using processed video sequences (PVS) derived from the DCM, we have conducted a large-scale subjective experiment that resulted in 303 distorted DCM samples with mean opinion scores, making the TDMD the largest available DCM database to our knowledge. This database enabled us to study the impact of different types of distortion on human perception and offer recommendations for DCM compression and related tasks. Additionally, we have evaluated three types of state-of-the-art objective metrics on the TDMD, including image-based, point-based, and video-based metrics, on the TDMD. Our experimental results highlight the strengths and weaknesses of each metric, and we provide suggestions about the selection of metrics in practical DCM applications. The TDMD will be made publicly available at the following location: https://multimedia.tencent.com/resources/tdmd.
    摘要 《dynamic colored meshes(DCM)在各种应用中广泛使用,但这些网格可能会经历压缩、传输等过程,导致其质量下降。为了促进DCM的 объектив评价和研究这些扭曲对人类视觉的影响,我们创建了腾讯-动态颜色网格数据库(TDMD),包含8个参考DCM对象以及6种典型的扭曲。使用来自DCM的处理视频序列(PVS),我们进行了大规模的主观实验,从而生成了303个扭曲DCM样本,其中每个样本有平均意见分数。TDMD是我们知道的最大的DCM数据库。我们通过对TDMD进行研究,发现不同类型的扭曲对人类视觉的影响,并提供了DCM压缩和相关任务中的指导方针。此外,我们还评估了三种当今最佳的对象评价度量,包括图像基于的、点基于的和视频基于的度量,在TDMD上。我们的实验结果显示了每种度量的优缺点,并提供了实际应用中选择度量的建议。TDMD将在以下地址公开:https://multimedia.tencent.com/resources/tdmd。》

Efficient neural supersampling on a novel gaming dataset

  • paper_url: http://arxiv.org/abs/2308.01483
  • repo_url: None
  • paper_authors: Antoine Mercier, Ruan Erasmus, Yashesh Savani, Manik Dhingra, Fatih Porikli, Guillaume Berger
  • for: 提高游戏视频的实时渲染效果,因为需要更高的分辨率、帧率和光彩实现。
  • methods: 使用神经网络算法进行渲染内容的高速抽象,比现有方法四倍效率,保持同等准确性。
  • results: 引入了一个新的数据集,提供了辅助特征如运动 вектор和深度,这些特征是通过渲染特性如视窗晃动和mipple biasing在不同分辨率下生成的。我们认为这个数据集会填补当前数据景观的空白,并可以作为测试进步的 valuable resource。
    Abstract Real-time rendering for video games has become increasingly challenging due to the need for higher resolutions, framerates and photorealism. Supersampling has emerged as an effective solution to address this challenge. Our work introduces a novel neural algorithm for supersampling rendered content that is 4 times more efficient than existing methods while maintaining the same level of accuracy. Additionally, we introduce a new dataset which provides auxiliary modalities such as motion vectors and depth generated using graphics rendering features like viewport jittering and mipmap biasing at different resolutions. We believe that this dataset fills a gap in the current dataset landscape and can serve as a valuable resource to help measure progress in the field and advance the state-of-the-art in super-resolution techniques for gaming content.
    摘要 Translated into Simplified Chinese:现实时游戏渲染面临着高分辨率、高帧率和真实感的增加需求,而抽象渲染技术已成为一种有效的解决方案。我们的工作推出了一种基于神经网络的抽象渲染内容算法,比现有方法高效四倍,保持同等准确性。此外,我们还介绍了一个新的数据集,该数据集包含不同分辨率下的游戏内容中的视觉特征,如视口抖动和mips扭曲,以及相关的动作向量和深度信息。我们认为这个数据集将填补当前数据景观的空白,并成为评估领域的价值资源,帮助推动游戏内容的超分辨率技术的进步。

HANDAL: A Dataset of Real-World Manipulable Object Categories with Pose Annotations, Affordances, and Reconstructions

  • paper_url: http://arxiv.org/abs/2308.01477
  • repo_url: None
  • paper_authors: Andrew Guo, Bowen Wen, Jianhe Yuan, Jonathan Tremblay, Stephen Tyree, Jeffrey Smith, Stan Birchfield
  • for: 这 paper 是为了提供一个category-level object pose estimation和可用性预测的数据集,而且这个数据集专注于可以由机器人抓取的可操作物品,例如锤子、用具和螺丝刀。
  • methods: 这 paper 使用了单个抽象相机和半自动化处理来生成高质量的3D注释,而不需要人工劳动。
  • results: 这 paper 描述了一个包含 308k annotated image frame 和 2.2k 视频的 212 个实际世界物品的 17 个类别,以及这些数据集的使用性和挑战。
    Abstract We present the HANDAL dataset for category-level object pose estimation and affordance prediction. Unlike previous datasets, ours is focused on robotics-ready manipulable objects that are of the proper size and shape for functional grasping by robot manipulators, such as pliers, utensils, and screwdrivers. Our annotation process is streamlined, requiring only a single off-the-shelf camera and semi-automated processing, allowing us to produce high-quality 3D annotations without crowd-sourcing. The dataset consists of 308k annotated image frames from 2.2k videos of 212 real-world objects in 17 categories. We focus on hardware and kitchen tool objects to facilitate research in practical scenarios in which a robot manipulator needs to interact with the environment beyond simple pushing or indiscriminate grasping. We outline the usefulness of our dataset for 6-DoF category-level pose+scale estimation and related tasks. We also provide 3D reconstructed meshes of all objects, and we outline some of the bottlenecks to be addressed for democratizing the collection of datasets like this one.
    摘要 我们介绍了HANDAL数据集,用于分类水平对象pose估计和可行预测。与前一代数据集不同,我们的数据集专注于适用于机器人搅拌的可搅拌物品,包括锤子、工具和螺丝driver等,它们具有适合机器人搅拌的尺寸和形状。我们的注释过程涉及了单一的商业摄像头和半自动化处理,使得我们可以生成高质量3D注释而无需咨询大量人员。数据集包含308万个注释图像帧,来自2.2万个视频,212种实际世界中的 объек。我们专注于硬件和厨房工具对象,以便在机器人搅拌需要与环境进行实际交互的场景中进行研究。我们详细介绍了我们数据集的用途,包括6个自由度分类pose+scale估计和相关任务。我们还提供了所有物品的3D重建模型,并详细介绍了数据集收集的一些瓶颈。

DLSIA: Deep Learning for Scientific Image Analysis

  • paper_url: http://arxiv.org/abs/2308.02559
  • repo_url: None
  • paper_authors: Eric J Roberts, Tanny Chavez, Alexander Hexemer, Petrus H. Zwart
  • for: 科研图像分析(Scientific Image Analysis)
  • methods: 使用Python编程语言、 convolutional neural network(CNN) architecture、autoencoders、U-Nets、MSDNets、Sparse Mixed-Scale Networks(SMSNets)等方法。
  • results: 提供可定制的CNN建模、抽象CNN复杂性、促进科研发现、促进交叉领域合作、驱动科研图像分析等。
    Abstract We introduce DLSIA (Deep Learning for Scientific Image Analysis), a Python-based machine learning library that empowers scientists and researchers across diverse scientific domains with a range of customizable convolutional neural network (CNN) architectures for a wide variety of tasks in image analysis to be used in downstream data processing, or for experiment-in-the-loop computing scenarios. DLSIA features easy-to-use architectures such as autoencoders, tunable U-Nets, and parameter-lean mixed-scale dense networks (MSDNets). Additionally, we introduce sparse mixed-scale networks (SMSNets), generated using random graphs and sparse connections. As experimental data continues to grow in scale and complexity, DLSIA provides accessible CNN construction and abstracts CNN complexities, allowing scientists to tailor their machine learning approaches, accelerate discoveries, foster interdisciplinary collaboration, and advance research in scientific image analysis.
    摘要 我们介绍DLSIA(深度学习科学影像分析),这是一个基于Python的机器学习库,它为科学家和研究人员提供了许多可自定义的卷积神经网络架构,用于各种影像分析任务,包括下游资料处理和实验运行 Computing enario。DLSIA 提供了易于使用的架构,例如自动编码器、可调 U-Net 和对数零对数网络(MSDNets)。此外,我们还引入了随机 Graph 和稀疏连接的稀疏混合网络(SMSNets)。随着实验数据的数量和复杂度不断增加,DLSIA 提供了可 accessible CNN 的建构和抽象,让科学家可以根据自己的机器学习方法,加速发现,促进跨领域合作,并进步科学影像分析研究。

COVID-VR: A Deep Learning COVID-19 Classification Model Using Volume-Rendered Computer Tomography

  • paper_url: http://arxiv.org/abs/2308.01433
  • repo_url: None
  • paper_authors: Noemi Maritza L. Romero, Ricco Vasconcellos, Mariana R. Mendoza, João L. D. Comba
  • for: 本研究旨在开发一种基于多视角Volume Rendering(VR)技术的肺疾病分类方法,以提供全面的肺部图像,并提高肺疾病识别的准确率。
  • methods: 本研究使用了深度学习模型,利用CT扫描图像作为输入,通过Volume Rendering技术生成多视角的肺部图像,并对这些图像进行分类。
  • results: 对比于传统的slice-based方法,本研究的方法能够更好地识别肺疾病,并且在使用私有数据和公共数据进行比较时,得到了相似的结果。
    Abstract The COVID-19 pandemic presented numerous challenges to healthcare systems worldwide. Given that lung infections are prevalent among COVID-19 patients, chest Computer Tomography (CT) scans have frequently been utilized as an alternative method for identifying COVID-19 conditions and various other types of pulmonary diseases. Deep learning architectures have emerged to automate the identification of pulmonary disease types by leveraging CT scan slices as inputs for classification models. This paper introduces COVID-VR, a novel approach for classifying pulmonary diseases based on volume rendering images of the lungs captured from multiple angles, thereby providing a comprehensive view of the entire lung in each image. To assess the effectiveness of our proposal, we compared it against competing strategies utilizing both private data obtained from partner hospitals and a publicly available dataset. The results demonstrate that our approach effectively identifies pulmonary lesions and performs competitively when compared to slice-based methods.
    摘要 COVID-19 大流行对全球医疗系统带来了很多挑战。由于封颈感染是 COVID-19 患者的常见症状,胸部计算机扫描(CT)扫描得到了广泛的应用,以确定 COVID-19 状况和其他类型的肺病。深度学习建筑在扫描肺部的 CT 扫描片中进行自动识别肺病类型。本文介绍了 COVID-VR,一种基于肺部体积渲染图像的新方法,以获取整个肺部的全面视图。为评估我们的提议效果,我们与合作医院提供的私人数据进行比较,以及公共可用的数据集。结果表明,我们的方法可以有效地识别肺病涂抹,并与 slice-based 方法相比竞争性强。

LiDAR View Synthesis for Robust Vehicle Navigation Without Expert Labels

  • paper_url: http://arxiv.org/abs/2308.01424
  • repo_url: https://github.com/jonathsch/lidar-synthesis
  • paper_authors: Jonathan Schmidt, Qadeer Khan, Daniel Cremers
  • for: 本研究旨在提供一种使用LiDAR扫描仪生成更多的训练数据集,以便在公共道路上安全地自动驾驶汽车。
  • methods: 本研究使用 mesh reconstruction 和 ray casting 技术生成更多的 LiDAR 点云,而无需实际驾驶车辆到危险位置。然后,使用深度学习模型,将 LiDAR 扫描结果作为输入,预测未来的车辆轨迹。最后,应用 waypoint controller 将预测轨迹与车辆的加速和转向标签相匹配。
  • results: 研究人员通过在线评估和与同期工作进行比较,证明了我们的方法的效iveness。特别是在模型稳定性方面,我们的方法具有显著的优势。项目页面:https://jonathsch.github.io/lidar-synthesis/
    Abstract Deep learning models for self-driving cars require a diverse training dataset to manage critical driving scenarios on public roads safely. This includes having data from divergent trajectories, such as the oncoming traffic lane or sidewalks. Such data would be too dangerous to collect in the real world. Data augmentation approaches have been proposed to tackle this issue using RGB images. However, solutions based on LiDAR sensors are scarce. Therefore, we propose synthesizing additional LiDAR point clouds from novel viewpoints without physically driving at dangerous positions. The LiDAR view synthesis is done using mesh reconstruction and ray casting. We train a deep learning model, which takes a LiDAR scan as input and predicts the future trajectory as output. A waypoint controller is then applied to this predicted trajectory to determine the throttle and steering labels of the ego-vehicle. Our method neither requires expert driving labels for the original nor the synthesized LiDAR sequence. Instead, we infer labels from LiDAR odometry. We demonstrate the effectiveness of our approach in a comprehensive online evaluation and with a comparison to concurrent work. Our results show the importance of synthesizing additional LiDAR point clouds, particularly in terms of model robustness. Project page: https://jonathsch.github.io/lidar-synthesis/
    摘要 深度学习模型 для自驾车需要一个多样化的训练集,以确保在公共道路上安全地处理潜在危险的驾驶场景。这包括有 divergent 的轨迹,如对向道或人行道。然而,收集这些数据在实际世界中是太危险的。为解决这个问题,提出了使用 RGB 图像的数据增强方法。然而,基于 LiDAR 探测器的解决方案很少。因此,我们提议通过 mesh 重建和射线投影来生成额外的 LiDAR 点云。我们用一个深度学习模型,该模型从 LiDAR 扫描输入得到未来轨迹的预测结果。然后,我们应用一个 waypoint 控制器来确定 egocar 的加速和转向标签。我们的方法不需要原始 LiDAR 序列的专家驾驶标签,也不需要生成的 LiDAR 序列的专家标签。相反,我们从 LiDAR 速度来推断标签。我们在线评估中进行了全面的评估,并与当前的工作进行比较。我们的结果表明,生成额外的 LiDAR 点云对模型的稳定性具有重要作用。项目页面:https://jonathsch.github.io/lidar-synthesis/

Harder synthetic anomalies to improve OoD detection in Medical Images

  • paper_url: http://arxiv.org/abs/2308.01412
  • repo_url: https://github.com/snavalm/mood22
  • paper_authors: Sergio Naval Marimont, Giacomo Tarroni
  • for: 本研究旨在提高医学图像分割网络的泛化能力,使其能够在不同类型的异常情况下保持高度的准确率。
  • methods: 本研究使用了在2020年MOOD挑战赛中赢得奖的基于 Synthetic Local Anomaly (SLA) 的方法,并进一步改进了Synthetic anomaly生成过程,使它们更加多样化和挑战性。
  • results: 本研究在2022年MOOD挑战赛中获得了sample-wise和pixel-wise任务的首位, demonstrating the effectiveness of our method in improving the generalization ability of medical image segmentation networks.
    Abstract Our method builds upon previous Medical Out-of-Distribution (MOOD) challenge winners that empirically show that synthetic local anomalies generated copying / interpolating foreign patches are useful to train segmentation networks able to generalize to unseen types of anomalies. In terms of the synthetic anomaly generation process, our contributions makes synthetic anomalies more heterogeneous and challenging by 1) using random shapes instead of squares and 2) smoothing the interpolation edge of anomalies so networks cannot rely on the high gradient between image - foreign patch to identify anomalies. Our experiments using the validation set of 2020 MOOD winners show that both contributions improved substantially the method performance. We used a standard 3D U-Net architecture as segmentation network, trained patch-wise in both brain and abdominal datasets. Our final challenge submission consisted of 10 U-Nets trained across 5 data folds with different configurations of the anomaly generation process. Our method achieved first position in both sample-wise and pixel-wise tasks in the 2022 edition of the Medical Out-of-Distribution held at MICCAI.
    摘要 我们的方法建立在前一个医学异常(MOOD)挑战赛中赢家的基础上,这些赢家实证表明,通过复制/ interpolate 外部质 patches 生成的 synthetic local anomalies 可以帮助训练检测网络,以便在未经见过的异常类型上进行检测。在 synthetic anomaly 生成过程中,我们的贡献使 synthetic anomalies 更加多样和挑战性,包括:1. 使用随机形状而非方正形,2. 平滑 interpolate 边缘,以防止网络通过高Gradient между图像和外部质 patch 来识别异常。我们的实验使用 2020 MOOD 赛 validate set 表明,这两个贡献都有所提高了方法的性能。我们使用标准的 3D U-Net 架构作为检测网络,在脑和腹部数据集上进行 patch-wise 训练。我们的最终挑战提交包括 10 个 U-Nets 在 5 个数据叠加上不同的异常生成过程配置上进行训练。我们的方法在 2022 年的医学异常挑战中取得了 sample-wise 和 pixel-wise 两个任务中的第一名。

Follow the Soldiers with Optimized Single-Shot Multibox Detection and Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2308.01389
  • repo_url: None
  • paper_authors: Jumman Hossain, Maliha Momtaz
  • for: 建立一个自动驾驶系统,跟踪特定人(这里是士兵)在任何方向移动。
  • methods: 使用深度汽车和强化学习模型。
  • results: SSD Lite 提供了较好的性能和大幅提高的测试速度(约2-3倍),并且不损害准确性。
    Abstract Nowadays, autonomous cars are gaining traction due to their numerous potential applications on battlefields and in resolving a variety of other real-world challenges. The main goal of our project is to build an autonomous system using DeepRacer which will follow a specific person (for our project, a soldier) when they will be moving in any direction. Two main components to accomplish this project is an optimized Single-Shot Multibox Detection (SSD) object detection model and a Reinforcement Learning (RL) model. We accomplished the task using SSD Lite instead of SSD and at the end, compared the results among SSD, SSD with Neural Computing Stick (NCS), and SSD Lite. Experimental results show that SSD Lite gives better performance among these three techniques and exhibits a considerable boost in inference speed (~2-3 times) without compromising accuracy.
    摘要 现在,自适应汽车正在得到推广,因为它们在战场和解决各种实际问题上具有广泛的应用前景。我们项目的主要目标是使用DeepRacer建立一个自动驾驶系统,该系统能跟踪一名士兵(在我们项目中)在任何方向移动时。我们项目的两个主要组成部分是优化单幅多框检测(SSD)模型和再征学习(RL)模型。我们使用SSD Lite而不是SSD,并在结尾对这三种技术进行比较。实验结果表明,SSD Lite在这三种技术中表现最佳,并且在执行速度方面表现出了明显的提升(约2-3倍)而无需牺牲准确性。

Computational Long Exposure Mobile Photography

  • paper_url: http://arxiv.org/abs/2308.01379
  • repo_url: None
  • paper_authors: Eric Tabellion, Nikhil Karnad, Noa Glaser, Ben Weiss, David E. Jacobs, Yael Pritch
  • for: 这篇论文是关于 computational burst photography system,用于实现长时间拍摄和运动融合的效果。
  • methods: 该系统使用了对象检测和分割、Scene motion tracking、多帧合成等技术来实现运动融合和长时间拍摄的效果。
  • results: 该系统可以自动地在手持式智能手机摄像头应用程序中实现运动融合和长时间拍摄的效果,并且可以保持图像的高分辨率和高动态范围。
    Abstract Long exposure photography produces stunning imagery, representing moving elements in a scene with motion-blur. It is generally employed in two modalities, producing either a foreground or a background blur effect. Foreground blur images are traditionally captured on a tripod-mounted camera and portray blurred moving foreground elements, such as silky water or light trails, over a perfectly sharp background landscape. Background blur images, also called panning photography, are captured while the camera is tracking a moving subject, to produce an image of a sharp subject over a background blurred by relative motion. Both techniques are notoriously challenging and require additional equipment and advanced skills. In this paper, we describe a computational burst photography system that operates in a hand-held smartphone camera app, and achieves these effects fully automatically, at the tap of the shutter button. Our approach first detects and segments the salient subject. We track the scene motion over multiple frames and align the images in order to preserve desired sharpness and to produce aesthetically pleasing motion streaks. We capture an under-exposed burst and select the subset of input frames that will produce blur trails of controlled length, regardless of scene or camera motion velocity. We predict inter-frame motion and synthesize motion-blur to fill the temporal gaps between the input frames. Finally, we composite the blurred image with the sharp regular exposure to protect the sharpness of faces or areas of the scene that are barely moving, and produce a final high resolution and high dynamic range (HDR) photograph. Our system democratizes a capability previously reserved to professionals, and makes this creative style accessible to most casual photographers. More information and supplementary material can be found on our project webpage: https://motion-mode.github.io/
    摘要 长时间拍摄可以生成吸引人的图像,通过运动模糊来表现场景中运动元素。通常有两种模式:前景模式和背景模式。前景模式拍摄的图像通常在静止的摄像机上拍摄,捕捉了运动的前景元素,如水或光梯,与静止的背景景象一起呈现。背景模式拍摄的图像通常在跟踪运动目标的同时拍摄,以生成一个锐化的主题图像,与运动背景模糊的效果。两种技术都是非常具有挑战性,需要额外的设备和高级技能。在这篇论文中,我们描述了一种基于智能手机摄像机应用程序的计算拍摄系统,可以自动实现这些效果,只需要单击拍摄按钮。我们的方法首先检测和分割主题元素。我们跟踪场景运动,并将多帧图像对齐以保持愿望的锐化和生成美观的运动梯度。我们捕捉具有控制长度的下采样,无论场景或摄像机运动速度。我们预测帧间运动,并使用Synthesize动作模糊填充时间间隔。最后,我们将模糊图像与锐化正常曝光图像 composite,以保护面孔或动作较少的场景区域的锐化,并生成高分辨率和高动态范围的图像。我们的系统将这种创新技术普及化,让大多数优秀摄影家能够轻松地实现这种创新风格。更多信息和补充材料可以在我们项目网站中找到:

ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders

  • paper_url: http://arxiv.org/abs/2308.01317
  • repo_url: None
  • paper_authors: Shawn Xu, Lin Yang, Christopher Kelly, Marcin Sieniek, Timo Kohlberger, Martin Ma, Wei-Hung Weng, Attila Kiraly, Sahar Kazemzadeh, Zakkai Melamed, Jungyeon Park, Patricia Strachan, Yun Liu, Chuck Lau, Preeti Singh, Christina Chen, Mozziyar Etemadi, Sreenivasa Raju Kalidindi, Yossi Matias, Katherine Chou, Greg S. Corrado, Shravya Shetty, Daniel Tse, Shruthi Prabhakara, Daniel Golden, Rory Pilgrim, Krish Eswaran, Andrew Sellergren
    for:* 这个研究旨在开发一个具有广泛应用能力的语言/图像同步类型(ELIXR),用于进行胸部X射像(CXR)分类、数据有效分类和semantic搜寻等多种任务。methods:* 这个研究使用了一个语言同步图像编码器,与固定的语言模型PaLM 2结合,实现了一个轻量级的适配器架构。* 研究使用了MIMIC-CXR dataset上的图像和相应的自由文本医学报告进行训练。results:* ELIXR在零shot胸部X射像(CXR)分类中实现了state-of-the-art性能(mean AUC of 0.850 across 13 findings)。* ELIXR在数据有效CXR分类中实现了高性能(mean AUCs of 0.893 and 0.898 across five findings),并在几个不同的数据量下进行了比较(1%、10%)。* ELIXR在semantic搜寻任务中实现了0.76的normalized discounted cumulative gain(NDCG),包括了一些完美的回答。
    Abstract Our approach, which we call Embeddings for Language/Image-aligned X-Rays, or ELIXR, leverages a language-aligned image encoder combined or grafted onto a fixed LLM, PaLM 2, to perform a broad range of tasks. We train this lightweight adapter architecture using images paired with corresponding free-text radiology reports from the MIMIC-CXR dataset. ELIXR achieved state-of-the-art performance on zero-shot chest X-ray (CXR) classification (mean AUC of 0.850 across 13 findings), data-efficient CXR classification (mean AUCs of 0.893 and 0.898 across five findings (atelectasis, cardiomegaly, consolidation, pleural effusion, and pulmonary edema) for 1% (~2,200 images) and 10% (~22,000 images) training data), and semantic search (0.76 normalized discounted cumulative gain (NDCG) across nineteen queries, including perfect retrieval on twelve of them). Compared to existing data-efficient methods including supervised contrastive learning (SupCon), ELIXR required two orders of magnitude less data to reach similar performance. ELIXR also showed promise on CXR vision-language tasks, demonstrating overall accuracies of 58.7% and 62.5% on visual question answering and report quality assurance tasks, respectively. These results suggest that ELIXR is a robust and versatile approach to CXR AI.
    摘要 我们的方法,我们称之为语言/图像对齐X射线(ELIXR),利用一个语言对齐图像编码器与固定的自然语言处理模型(PaLM 2)结合,以实现广泛的任务。我们使用图像和相应的自由文本医学报告从MIMIC-CXR数据集进行训练这个轻量级适配器建筑。ELIXR在零shot肺X射线(CXR)分类中达到了状态元的性能(平均AUC为0.850,涵盖13个发现),以及数据效率CXR分类(平均AUC为0.893和0.898,涵盖五个发现(肿瘤、心脏肥大、混合、肺液和肺血液)),并在semantic搜索中达到了0.76减少积分率(NDCG)。相比现有的数据效率方法,包括supervised contrastive learning(SupCon),ELIXR需要两个数据量级下降到达到类似性能。此外,ELIXR还在CXR视言语任务中表现出了承诺,其总准确率为58.7%和62.5%。这些结果表明ELIXR是一种强大和多功能的CXR AI方法。

Patched Denoising Diffusion Models For High-Resolution Image Synthesis

  • paper_url: http://arxiv.org/abs/2308.01316
  • repo_url: None
  • paper_authors: Zheng Ding, Mengqi Zhang, Jiajun Wu, Zhuowen Tu
  • for: 生成高分辨率图像(例如1024×512)
  • methods: 使用小尺寸图像块(例如64×64)进行训练,并使用新的特征质料策略来避免边缘artefact
  • results: 在自然图像集(1024×512)和标准的小尺寸图像集(256×256)上实现高质量图像生成,并在所有四个数据集上达到了当前最佳FID分数。同时,Patch-DM还比 классиic diffusion模型减少了内存复杂度。
    Abstract We propose an effective denoising diffusion model for generating high-resolution images (e.g., 1024$\times$512), trained on small-size image patches (e.g., 64$\times$64). We name our algorithm Patch-DM, in which a new feature collage strategy is designed to avoid the boundary artifact when synthesizing large-size images. Feature collage systematically crops and combines partial features of the neighboring patches to predict the features of a shifted image patch, allowing the seamless generation of the entire image due to the overlap in the patch feature space. Patch-DM produces high-quality image synthesis results on our newly collected dataset of nature images (1024$\times$512), as well as on standard benchmarks of smaller sizes (256$\times$256), including LSUN-Bedroom, LSUN-Church, and FFHQ. We compare our method with previous patch-based generation methods and achieve state-of-the-art FID scores on all four datasets. Further, Patch-DM also reduces memory complexity compared to the classic diffusion models.
    摘要 我们提出一种有效的杂噪扩散模型,用于生成高分辨率图像(例如1024×512),基于小尺寸图像块(例如64×64)进行训练。我们命名该算法为“Patch-DM”,其中我们设计了一种新的特征贯通策略,以避免边缘artefact when Synthesizing large-size images。特征贯通系统系统地剪辑并组合邻近块的部分特征,以预测shifted image块的特征,从而实现了整个图像的无缝生成,因为邻近块特征空间之间存在 overlap。Patch-DM生成了高质量的图像合成结果在我们新收集的自然图像 dataset(1024×512)上,以及标准的小尺寸 benchmark(256×256)上,包括LSUN-Bedroom、LSUN-Church和FFHQ。我们与前期的patch-based生成方法进行比较,并在所有四个 dataset上实现了状态的方程FID scores。此外,Patch-DM还降低了传统扩散模型的内存复杂性。

Revisiting DETR Pre-training for Object Detection

  • paper_url: http://arxiv.org/abs/2308.01300
  • repo_url: None
  • paper_authors: Yan Ma, Weicong Liang, Yiduo Hao, Bohan Chen, Xiangyu Yue, Chao Zhang, Yuhui Yuan
  • for: 本研究目的是研究如何通过自动学习的方式提高DETR基础模型的性能,而不需要更改其底层结构。
  • methods: 本研究使用了多种自动学习方法,包括DETReg和LLaVA等。
  • results: 研究发现,使用这些自动学习方法可以提高COCO物体检测任务的性能,并且可以超越最新的State-of-the-art模型。 Specifically, the authors achieved an AP of $59.3%$ on the COCO val set, surpassing the previous state-of-the-art model by $1.4%$. Additionally, the authors generated a series of synthetic pre-training datasets and demonstrated that pre-training on these datasets can lead to notable improvements in object detection performance.
    Abstract Motivated by that DETR-based approaches have established new records on COCO detection and segmentation benchmarks, many recent endeavors show increasing interest in how to further improve DETR-based approaches by pre-training the Transformer in a self-supervised manner while keeping the backbone frozen. Some studies already claimed significant improvements in accuracy. In this paper, we take a closer look at their experimental methodology and check if their approaches are still effective on the very recent state-of-the-art such as $\mathcal{H}$-Deformable-DETR. We conduct thorough experiments on COCO object detection tasks to study the influence of the choice of pre-training datasets, localization, and classification target generation schemes. Unfortunately, we find the previous representative self-supervised approach such as DETReg, fails to boost the performance of the strong DETR-based approaches on full data regimes. We further analyze the reasons and find that simply combining a more accurate box predictor and Objects$365$ benchmark can significantly improve the results in follow-up experiments. We demonstrate the effectiveness of our approach by achieving strong object detection results of AP=$59.3\%$ on COCO val set, which surpasses $\mathcal{H}$-Deformable-DETR + Swin-L by +$1.4\%$. Last, we generate a series of synthetic pre-training datasets by combining the very recent image-to-text captioning models (LLaVA) and text-to-image generative models (SDXL). Notably, pre-training on these synthetic datasets leads to notable improvements in object detection performance. Looking ahead, we anticipate substantial advantages through the future expansion of the synthetic pre-training dataset.
    摘要 基于DETR的方法在COCO检测和 segmentation bencmarks 上设置新的纪录,许多最近的尝试表示越来越关注如何进一步提高DETR基于的方法,而不是固定背景。一些研究已经提出了显著改进的精度。在这篇文章中,我们坚持更加仔细地检查这些实验方法,并查看它们是否在最新的state-of-the-art 中如 $\mathcal{H}$-Deformable-DETR 中保持有效。我们在COCO对象检测任务中进行了系统的实验,以研究预训练数据集的选择、本地化和分类目标生成方案的影响。不幸地,我们发现以前的代表性自我超vised 方法DETReg,在全数据场景下不能提高强大 DE TR-based 方法的性能。我们进一步分析了原因,并发现可以通过结合更高精度的包Predictor和Objects$365$ benchmark来显著提高结果。我们证明了我们的方法的效果,通过在COCO验证集上达到 AP = $59.3\%$ 的强大对象检测结果,超过 $\mathcal{H}$-Deformable-DETR + Swin-L 的 + $1.4\%$。最后,我们生成了一系列的Synthetic pre-training datasets,通过结合最近的图文描述模型(LLaVA)和文本到图生成模型(SDXL)。不凡地,预训练在这些Synthetic datasets上显著提高了对象检测性能。looking ahead,我们预计将来的扩展将带来重要的优势。

A vision transformer-based framework for knowledge transfer from multi-modal to mono-modal lymphoma subtyping models

  • paper_url: http://arxiv.org/abs/2308.01328
  • repo_url: None
  • paper_authors: Bilel Guetarni, Feryal Windal, Halim Benhabiles, Marianne Petit, Romain Dubois, Emmanuelle Leteurtre, Dominique Collard
  • for: 本研究旨在提出一种基于视Transformer的框架,用于从高分辨率整图中分类Diffuse Large B-Cell Lymphoma(DLBCL)癌症subtype。
  • methods: 我们提议一种多模式 architecture来训练一个分类模型,从多种整图模式中提取特征。然后,我们通过知识传播机制来高效地驱动这个分类模型的学习。
  • results: 我们在一个包含157个病例的实验study中发现,我们的单模式分类模型的表现非常出色,比六个最新的抗癌方法更高效。此外,我们对实验数据进行了power-law曲线估算,结果表明,我们的分类模型需要一个合理的数量的更多病例来进行训练,以达到与IHC技术相同的诊断精度。
    Abstract Determining lymphoma subtypes is a crucial step for better patients treatment targeting to potentially increase their survival chances. In this context, the existing gold standard diagnosis method, which is based on gene expression technology, is highly expensive and time-consuming making difficult its accessibility. Although alternative diagnosis methods based on IHC (immunohistochemistry) technologies exist (recommended by the WHO), they still suffer from similar limitations and are less accurate. WSI (Whole Slide Image) analysis by deep learning models showed promising new directions for cancer diagnosis that would be cheaper and faster than existing alternative methods. In this work, we propose a vision transformer-based framework for distinguishing DLBCL (Diffuse Large B-Cell Lymphoma) cancer subtypes from high-resolution WSIs. To this end, we propose a multi-modal architecture to train a classifier model from various WSI modalities. We then exploit this model through a knowledge distillation mechanism for efficiently driving the learning of a mono-modal classifier. Our experimental study conducted on a dataset of 157 patients shows the promising performance of our mono-modal classification model, outperforming six recent methods from the state-of-the-art dedicated for cancer classification. Moreover, the power-law curve, estimated on our experimental data, shows that our classification model requires a reasonable number of additional patients for its training to potentially reach identical diagnosis accuracy as IHC technologies.
    摘要 确定淋巴癌 subclass 是诊断患者治疗的关键步骤,以提高生存可能性。然而,现有的黄金标准诊断方法,基于基因表达技术,是非常昂贵和时间consuming,使其Difficult to access。尽管现有基于 IHC(免疫抗体技术)的诊断方法存在,但它们仍然受到限制,并且精度较低。WSI(整个板块图像)分析by deep learning模型显示了新的方向 для肿瘤诊断,这将比现有的alternative方法更便宜和更快。在这项工作中,我们提出了基于视Transformer的框架,用于从高分辨率 WSI 中分类Diffuse Large B-Cell Lymphoma(淋巴癌)亚型。为此,我们提出了一种多modal architecture,用于训练一个分类模型。然后,我们利用知识储存机制,将这个模型转化为一个简单的单modal分类器。我们的实验研究,在一个包含 157 名病人的数据集上进行,显示了我们的单modal分类模型在诊断性能方面的优秀表现,比六个最新的state-of-the-art肿瘤分类方法更高。此外,我们在实验数据上计算的力量律曲线,表明我们的分类模型需要一个合理的数量的更多病人来进行训练,以达到与 IHC 技术相同的诊断精度。

Incorporating Season and Solar Specificity into Renderings made by a NeRF Architecture using Satellite Images

  • paper_url: http://arxiv.org/abs/2308.01262
  • repo_url: https://github.com/enterprisecv-6/season-nerf
  • paper_authors: Michael Gableman, Avinash Kak
  • for: 这篇论文的目的是提出一种基于NeRF的渲染框架,可以根据卫星图像进行训练,并考虑太阳角度和视角角度来渲染场景从不同的视角。
  • methods: 该论文使用Neural Radiance Field(NeRF)来模型场景的光照和阴影,并在NeRF中引入一个新的输入变量——年份,以教育网络render seasonal features。
  • results: 作者在八个Area of Interest中测试了他们的框架,并获得了高精度的渲染、高精度的高度图和预测阴影等结果。此外,作者还进行了ablation study,以 justify network design parameters。
    Abstract As a result of Shadow NeRF and Sat-NeRF, it is possible to take the solar angle into account in a NeRF-based framework for rendering a scene from a novel viewpoint using satellite images for training. Our work extends those contributions and shows how one can make the renderings season-specific. Our main challenge was creating a Neural Radiance Field (NeRF) that could render seasonal features independently of viewing angle and solar angle while still being able to render shadows. We teach our network to render seasonal features by introducing one more input variable -- time of the year. However, the small training datasets typical of satellite imagery can introduce ambiguities in cases where shadows are present in the same location for every image of a particular season. We add additional terms to the loss function to discourage the network from using seasonal features for accounting for shadows. We show the performance of our network on eight Areas of Interest containing images captured by the Maxar WorldView-3 satellite. This evaluation includes tests measuring the ability of our framework to accurately render novel views, generate height maps, predict shadows, and specify seasonal features independently from shadows. Our ablation studies justify the choices made for network design parameters.
    摘要 due to Shadow NeRF and Sat-NeRF, it is possible to take the solar angle into account in a NeRF-based framework for rendering a scene from a novel viewpoint using satellite images for training. Our work extends those contributions and shows how one can make the renderings season-specific. Our main challenge was creating a Neural Radiance Field (NeRF) that could render seasonal features independently of viewing angle and solar angle while still being able to render shadows. We teach our network to render seasonal features by introducing one more input variable -- time of the year. However, the small training datasets typical of satellite imagery can introduce ambiguities in cases where shadows are present in the same location for every image of a particular season. We add additional terms to the loss function to discourage the network from using seasonal features for accounting for shadows. We show the performance of our network on eight Areas of Interest containing images captured by the Maxar WorldView-3 satellite. This evaluation includes tests measuring the ability of our framework to accurately render novel views, generate height maps, predict shadows, and specify seasonal features independently from shadows. Our ablation studies justify the choices made for network design parameters.Here's the translation in Traditional Chinese:这是由于阴影NeRF和Sat-NeRF而可以将太阳角度考虑到NeRF基础框架中,以便从不同观点测量场景。我们的工作延伸了这些贡献,并显示了如何使渲染为季节特定。我们的主要挑战是创建一个能够独立地考虑观察角度和太阳角度的Neural Radiance Field(NeRF),并且仍能正确地显示阴影。我们教育我们的网络以时间年份为输入变量,以便在不同季节中显示季节特定的特征。然而,对于具有阴影的几何形状的实际测试数据可能会导致歧义。我们添加了额外的损失函数来防止网络使用季节特定的特征来计算阴影。我们在八个Area of Interest中展示了我们的网络,包括量测系统在不同观点下的渲染新视野、生成高度图、预测阴影和季节特定的特征独立于阴影。我们的ablation研究证明了我们的网络设计选择的正确性。

Learning Spatial Distribution of Long-Term Trackers Scores

  • paper_url: http://arxiv.org/abs/2308.01256
  • repo_url: None
  • paper_authors: Vincenzo Mariano Scarrica, Antonino Staiano
  • for: 这篇论文是为了提高长期跟踪性能而写的。
  • methods: 这篇论文使用了融合策略,将多个基线跟踪器作为输入,并在学习阶段对其进行了优化。
  • results: 在LTB-50数据集上,这篇论文的召回率为0.738,与当前状态前进行竞争。在反向使用VOT-LT2022和LTB-50数据集时,召回率为0.619,仍然在当前状态前进行竞争。
    Abstract Long-Term tracking is a hot topic in Computer Vision. In this context, competitive models are presented every year, showing a constant growth rate in performances, mainly measured in standardized protocols as Visual Object Tracking (VOT) and Object Tracking Benchmark (OTB). Fusion-trackers strategy has been applied over last few years for overcoming the known re-detection problem, turning out to be an important breakthrough. Following this approach, this work aims to generalize the fusion concept to an arbitrary number of trackers used as baseline trackers in the pipeline, leveraging a learning phase to better understand how outcomes correlate with each other, even when no target is present. A model and data independence conjecture will be evidenced in the manuscript, yielding a recall of 0.738 on LTB-50 dataset when learning from VOT-LT2022, and 0.619 by reversing the two datasets. In both cases, results are strongly competitive with state-of-the-art and recall turns out to be the first on the podium.
    摘要 长期跟踪是计算机视觉领域热点话题。在这个上下文中,每年都有竞争力强的模型被推出,表现得越来越好,主要根据标准化协议进行评估,如视觉 объекtracking(VOT)和物体跟踪benchmark(OTB)。遗传跟踪策略在过去几年得到应用,并被视为重要的突破。基于这种方法,本研究旨在普适化融合概念,使得任意数量的基线跟踪器可以在管道中使用,并通过学习阶段更好地理解不同跟踪器之间的结果相关性,即使target不存在。 manuscript中会证明模型和数据独立性 conjecture,在LTB-50 dataset上取得0.738的回归率,并在反向两个dataset上取得0.619的回归率。在两个情况下,结果强烈竞争与状态机器人,并且回归率处于第一名。

A Hyper-pixel-wise Contrastive Learning Augmented Segmentation Network for Old Landslide Detection Using High-Resolution Remote Sensing Images and Digital Elevation Model Data

  • paper_url: http://arxiv.org/abs/2308.01251
  • repo_url: None
  • paper_authors: Yiming Zhou, Yuexing Peng, Wei Li, Junchuan Yu, Daqing Ge, Wei Xiang
  • for: old landslide detection
  • methods: hyper-pixel-wise contrastive learning augmented segmentation network (HPCL-Net) and global hyper-pixel-wise sample pair queues-based contrastive learning method
  • results: improved reliability of old landslide detection compared to previous models, with increased mIoU, Landslide IoU, and F1-score metrics
    Abstract As a harzard disaster, landslide often brings tremendous losses to humanity, so it's necessary to achieve reliable detection of landslide. However, the problems of visual blur and small-sized dataset cause great challenges for old landslide detection task when using remote sensing data. To reliably extract semantic features, a hyper-pixel-wise contrastive learning augmented segmentation network (HPCL-Net) is proposed, which augments the local salient feature extraction from the boundaries of landslides through HPCL and fuses the heterogeneous infromation in the semantic space from High-Resolution Remote Sensing Images and Digital Elevation Model Data data. For full utilization of the precious samples, a global hyper-pixel-wise sample pair queues-based contrastive learning method, which includes the construction of global queues that store hyper-pixel-wise samples and the updating scheme of a momentum encoder, is developed, reliably enhancing the extraction ability of semantic features. The proposed HPCL-Net is evaluated on a Loess Plateau old landslide dataset and experiment results show that the model greatly improves the reliablity of old landslide detection compared to the previous old landslide segmentation model, where mIoU metric is increased from 0.620 to 0.651, Landslide IoU metric is increased from 0.334 to 0.394 and F1-score metric is increased from 0.501 to 0.565.
    摘要 翻译文本作为危险灾害,山崩常会对人类造成巨大的损害,因此需要实现可靠的山崩检测。然而,使用遥感数据时,视觉模糊和小样本集的问题会导致古老山崩检测任务中的巨大挑战。为了可靠地提取semantic特征,我们提议了一种基于hyper-pixel-wise对比学习增强segmentation网络(HPCL-Net),该网络通过在山崩边界的本地精重特征提取方法和高分辨率遥感图像和数字高程模型数据的异化信息进行semantic空间的笔记卷积。为了充分利用珍贵的样本,我们开发了一种全球hyper-pixel-wise对比学习方法,该方法包括建立全球队列,并且在批处理队列中进行快速更新的批处理编码器。实验结果表明,提议的HPCL-Net模型在中国Loess Plateau古老山崩数据集上进行检测比前一代古老山崩分割模型更高度可靠,其mIoU指标从0.620提高到0.651,山崩指标从0.334提高到0.394,F1-score指标从0.501提高到0.565。

A Hybrid Approach To Real-Time Multi-Object Tracking

  • paper_url: http://arxiv.org/abs/2308.01248
  • repo_url: None
  • paper_authors: Vincenzo Mariano Scarrica, Ciro Panariello, Alessio Ferone, Antonino Staiano
  • for: 这个论文主要目标是提出一种基于深度学习和经典算法的实时多目标跟踪方法,用于人群跟踪系统。
  • methods: 该方法 combinestraditional optical flow algorithm和深度学习架构,以实现一个具有折衔的跟踪精度和计算成本的权衡。
  • results: 对不同设置进行实验,该方法可以达到0.608的MOTA分数,与相关State-of-the-art的0.549分数相当,而且运行时间减少了约一半。
    Abstract Multi-Object Tracking, also known as Multi-Target Tracking, is a significant area of computer vision that has many uses in a variety of settings. The development of deep learning, which has encouraged researchers to propose more and more work in this direction, has significantly impacted the scientific advancement around the study of tracking as well as many other domains related to computer vision. In fact, all of the solutions that are currently state-of-the-art in the literature and in the tracking industry, are built on top of deep learning methodologies that produce exceptionally good results. Deep learning is enabled thanks to the ever more powerful technology researchers can use to handle the significant computational resources demanded by these models. However, when real-time is a main requirement, developing a tracking system without being constrained by expensive hardware support with enormous computational resources is necessary to widen tracking applications in real-world contexts. To this end, a compromise is to combine powerful deep strategies with more traditional approaches to favor considerably lower processing solutions at the cost of less accurate tracking results even though suitable for real-time domains. Indeed, the present work goes in that direction, proposing a hybrid strategy for real-time multi-target tracking that combines effectively a classical optical flow algorithm with a deep learning architecture, targeted to a human-crowd tracking system exhibiting a desirable trade-off between performance in tracking precision and computational costs. The developed architecture was experimented with different settings, and yielded a MOTA of 0.608 out of the compared state-of-the-art 0.549 results, and about half the running time when introducing the optical flow phase, achieving almost the same performance in terms of accuracy.
    摘要 多目标跟踪(也称多Target tracking)是计算机视觉领域的一个重要领域,它在各种场景中有很多应用。深度学习的发展,使研究人员们能够更加勇敢地提出更多的工作,对跟踪领域以及其他计算机视觉领域的科学进步产生了深远的影响。实际上,现有literature和industry中的所有state-of-the-art解决方案都基于深度学习方法,其Result exceptionally good。然而,当实时是主要要求时,建立一个不受昂贵硬件支持的跟踪系统是必要的,以拓宽跟踪应用在真实世界中。为此,可以通过结合强大的深度策略和传统方法来达成一个折衔,以提高跟踪精度的同时,降低计算成本。本工作就在这个方向上进行了尝试,提出了一种hybrid策略,将经典的光流算法与深度学习架构相结合,用于人群跟踪系统,实现了精度和计算成本之间的折衔。实验结果显示,与比较state-of-the-art的0.549结果相比,该系统的MOTA得分为0.608,运行时间缩短了约一半。

Tirtha – An Automated Platform to Crowdsource Images and Create 3D Models of Heritage Sites

  • paper_url: http://arxiv.org/abs/2308.01246
  • repo_url: https://github.com/smlab-niser/tirtha-public
  • paper_authors: Jyotirmaya Shivottam, Subhankar Mishra
  • for: 保护文化遗产(CH)sites的数字化保存是非常重要,以防止自然灾害或人类活动的损害。
  • methods: 使用现代计算机视觉和光学测量技术,创建CH sites的3D模型。
  • results: 创建了一个Web平台,让普通公众通过投稿照片来创建CH sites的3D模型,提高了数字保存效率、成本效果和可持续性。
    Abstract Digital preservation of Cultural Heritage (CH) sites is crucial to protect them against damage from natural disasters or human activities. Creating 3D models of CH sites has become a popular method of digital preservation thanks to advancements in computer vision and photogrammetry. However, the process is time-consuming, expensive, and typically requires specialized equipment and expertise, posing challenges in resource-limited developing countries. Additionally, the lack of an open repository for 3D models hinders research and public engagement with their heritage. To address these issues, we propose Tirtha, a web platform for crowdsourcing images of CH sites and creating their 3D models. Tirtha utilizes state-of-the-art Structure from Motion (SfM) and Multi-View Stereo (MVS) techniques. It is modular, extensible and cost-effective, allowing for the incorporation of new techniques as photogrammetry advances. Tirtha is accessible through a web interface at https://tirtha.niser.ac.in and can be deployed on-premise or in a cloud environment. In our case studies, we demonstrate the pipeline's effectiveness by creating 3D models of temples in Odisha, India, using crowdsourced images. These models are available for viewing, interaction, and download on the Tirtha website. Our work aims to provide a dataset of crowdsourced images and 3D reconstructions for research in computer vision, heritage conservation, and related domains. Overall, Tirtha is a step towards democratizing digital preservation, primarily in resource-limited developing countries.
    摘要 针对文化遗产(CH)场景的数字保存是非常重要,以保护它们免受自然灾害或人类活动的损害。创建CH场景的3D模型已成为数字保存的流行方法,感谢计算机视觉和光学测量的进步。然而,这个过程需要较长的时间,高昂的成本,通常需要专业设备和技能,这会对发展中国家 pose 挑战。此外,缺乏开放的3D模型存储库,限制了研究和公众对遗产的参与。为解决这些问题,我们提出了Tirtha,一个基于网络的平台,用于协同上传CH场景的图像。Tirtha利用当前最佳的结构从动(SfM)和多视图镜像(MVS)技术。它是可扩展的,可cost-effective,可以适应计算机视觉的进步。Tirtha通过Web界面提供,可以在本地部署或云端环境中部署。在我们的案例研究中,我们示例了在奥里萨(India)的寺庐场景中使用拍摄的图像创建3D模型。这些模型通过Tirtha网站上的浏览、互动和下载。我们的工作目标是提供一个由众所共同拍摄的图像和3D重建的数据集,用于计算机视觉、遗产保护和相关领域的研究。总之,Tirtha是一步向数字保存的民主化,特别是在发展中国家。