cs.CV - 2023-07-18

Plug the Leaks: Advancing Audio-driven Talking Face Generation by Preventing Unintended Information Flow

  • paper_url: http://arxiv.org/abs/2307.09368
  • repo_url: None
  • paper_authors: Dogucan Yaman, Fevziye Irem Eyiokur, Leonard Bärmann, Hazim Kemal Ekenel, Alexander Waibel
  • for: 这个论文的目的是提出一种基于音频的讲话脸面生成方法,以提高讲话脸面的视觉质量和音频视频同步。
  • methods: 这篇论文提出了一些现有的同步方法的问题,如lip和pose信息的不必要流动,以及训练过程中的不稳定性。具体来说,它们包括一个静音lip引用图生成器,一个适应 triplet损失,以及一种稳定的同步损失函数。
  • results: 根据实验结果,这种方法可以在LRS2和LRW上实现状态之最的表现,同时在视觉质量和音频视频同步两个方面均有显著提高。此外,通过不同的减少实验,证明了这些改进的共同作用和它们的个别贡献。
    Abstract Audio-driven talking face generation is the task of creating a lip-synchronized, realistic face video from given audio and reference frames. This involves two major challenges: overall visual quality of generated images on the one hand, and audio-visual synchronization of the mouth part on the other hand. In this paper, we start by identifying several problematic aspects of synchronization methods in recent audio-driven talking face generation approaches. Specifically, this involves unintended flow of lip and pose information from the reference to the generated image, as well as instabilities during model training. Subsequently, we propose various techniques for obviating these issues: First, a silent-lip reference image generator prevents leaking of lips from the reference to the generated image. Second, an adaptive triplet loss handles the pose leaking problem. Finally, we propose a stabilized formulation of synchronization loss, circumventing aforementioned training instabilities while additionally further alleviating the lip leaking issue. Combining the individual improvements, we present state-of-the art performance on LRS2 and LRW in both synchronization and visual quality. We further validate our design in various ablation experiments, confirming the individual contributions as well as their complementary effects.
    摘要 audio-driven talking face生成任务是创建一个 lip-synchronized、realistic 的面部视频,从给定的音频和参考帧开始。这涉及到两个主要挑战:一是生成图像的总Visual质量,二是音频-视频同步的口部部分。在这篇论文中,我们开始是通过找到最近的音频驱动 talking face生成方法中的一些问题。特别是,这些问题包括往返图像中的lip和姿势信息的不良流动,以及在训练模型时的不稳定性。然后,我们提出了多种解决方案:首先,一个 silent-lip 参考图像生成器防止了 lip 信息从参考图像流向生成图像。其次,一种适应 triplet 损失处理 pose 泄露问题。最后,我们提出了一种稳定化同步损失的形式, circumventing aforementioned training instabilities 而并且进一步缓解 lip 泄露问题。将各种改进相结合,我们展示了 state-of-the-art 性能在 LRS2 和 LRW 上,同时在同步和Visual质量两个方面均有提高。我们进一步验证了我们的设计,通过多个减少实验,证明了各个贡献以及它们的补偿效果。

An Evaluation of Zero-Cost Proxies – from Neural Architecture Performance to Model Robustness

  • paper_url: http://arxiv.org/abs/2307.09365
  • repo_url: None
  • paper_authors: Jovita Lukasik, Michael Moeller, Margret Keuper
  • for: 研究 zero-cost 代理的用途和能力,尤其是用于寻找性能好吃的 neural architecture 搜索。
  • methods: 使用 zero-cost 代理来预测模型的性能,包括单个预测任务和多任务协同预测。
  • results: 分析了常见 zero-cost 代理的性能预测能力,发现 predicting 模型的稳定性更加困难,需要结合多个代理来预测。
    Abstract Zero-cost proxies are nowadays frequently studied and used to search for neural architectures. They show an impressive ability to predict the performance of architectures by making use of their untrained weights. These techniques allow for immense search speed-ups. So far the joint search for well-performing and robust architectures has received much less attention in the field of NAS. Therefore, the main focus of zero-cost proxies is the clean accuracy of architectures, whereas the model robustness should play an evenly important part. In this paper, we analyze the ability of common zero-cost proxies to serve as performance predictors for robustness in the popular NAS-Bench-201 search space. We are interested in the single prediction task for robustness and the joint multi-objective of clean and robust accuracy. We further analyze the feature importance of the proxies and show that predicting the robustness makes the prediction task from existing zero-cost proxies more challenging. As a result, the joint consideration of several proxies becomes necessary to predict a model's robustness while the clean accuracy can be regressed from a single such feature.
    摘要 现在,零成本代理常被研究和使用来搜索神经网络架构。它们能够很好地预测架构的性能,只使用未训练的权重。这些技术可以提供很大的搜索速度减少。而在神经网络搜索(NAS)领域中, JOINT 搜索良好性和稳定性的架构尚未受到很多关注。因此,零成本代理的主要关注点是架构的净精度,而模型的稳定性应该具有相等的重要性。在这篇论文中,我们分析了常见的零成本代理是否能够用来预测架构的稳定性,以及它们的多目标任务。我们还分析了代理的特征重要性,并显示了预测稳定性使得预测任务更加困难。因此,需要结合多个代理来预测模型的稳定性,而净精度可以从单个代理中进行回归。

Disentangle then Parse:Night-time Semantic Segmentation with Illumination Disentanglement

  • paper_url: http://arxiv.org/abs/2307.09362
  • repo_url: None
  • paper_authors: Zhixiang Wei, Lin Chen, Tao Tu, Huaian Chen, Pengyang Ling, Yi Jin
  • for: 提高夜间 semantic segmentation 性能
  • methods: 提出了一种新的夜间 semantic segmentation 方法,即 disentangle then parse (DTP),通过分离夜间图像中的光度和反射组成部分,并基于它们的自适应 fusión 来认知 semantics。
  • results: DTP 在夜间 segmentation 任务中表现出色,与state-of-the-art 方法相比,具有更高的准确率和更好的一致性。
    Abstract Most prior semantic segmentation methods have been developed for day-time scenes, while typically underperforming in night-time scenes due to insufficient and complicated lighting conditions. In this work, we tackle this challenge by proposing a novel night-time semantic segmentation paradigm, i.e., disentangle then parse (DTP). DTP explicitly disentangles night-time images into light-invariant reflectance and light-specific illumination components and then recognizes semantics based on their adaptive fusion. Concretely, the proposed DTP comprises two key components: 1) Instead of processing lighting-entangled features as in prior works, our Semantic-Oriented Disentanglement (SOD) framework enables the extraction of reflectance component without being impeded by lighting, allowing the network to consistently recognize the semantics under cover of varying and complicated lighting conditions. 2) Based on the observation that the illumination component can serve as a cue for some semantically confused regions, we further introduce an Illumination-Aware Parser (IAParser) to explicitly learn the correlation between semantics and lighting, and aggregate the illumination features to yield more precise predictions. Extensive experiments on the night-time segmentation task with various settings demonstrate that DTP significantly outperforms state-of-the-art methods. Furthermore, with negligible additional parameters, DTP can be directly used to benefit existing day-time methods for night-time segmentation.
    摘要 大多数先前的 semantic segmentation 方法都是在白天场景下开发的,而在夜晚场景下表现不佳,主要因为夜晚场景中的灯光条件不充分和复杂。在这项工作中,我们解决了这个挑战,我们提出了一种新的夜晚 semantic segmentation 方法,即分解并解析 (DTP)。DTP 方法分解夜晚图像为不变的反射和特定灯光组成部分,然后基于这两个部分的 adaptive 融合来认定 semantics。具体来说,我们的 DTP 方法包括两个关键组成部分:1. 而不是在先前的工作中处理灯光束结合的特征,我们的 Semantic-Oriented Disentanglement (SOD) 框架允许在灯光条件下提取反射组成部分,使网络可以在不同和复杂的灯光条件下一直认定 semantics。2. 基于我们发现,灯光组成部分可以作为一些具有混淆的区域的征标,我们进一步引入了 Illumination-Aware Parser (IAParser) 来显式学习灯光和 semantics 之间的相关性,并将灯光特征聚合到更加精确的预测中。我们对夜晚 segmentation 任务进行了广泛的实验,结果表明,DTP 方法在不同的设置下显著超过了当前的方法。此外,DTP 方法可以在不加增加参数的情况下直接应用于当前的日间方法中,以提高夜晚 segmentation 性能。

OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation

  • paper_url: http://arxiv.org/abs/2307.09356
  • repo_url: https://github.com/wudongming97/onlinerefer
  • paper_authors: Dongming Wu, Tiancai Wang, Yuang Zhang, Xiangyu Zhang, Jianbing Shen
  • for: 本研究旨在提高视频对象 segmentation(RVOS)的精度和效率,通过人工指令来 segment object 在视频中。
  • methods: 我们提出了一种简单 yet effective 的在线模型,称为 OnlineRefer,使用显式的查询传播来提高当前帧的引用预测精度。我们还扩展了我们的在线模型,使其兼容于视频基础模型。
  • results: 我们在四个 benchmark 上评估了我们的方法,包括 Refer-Youtube-VOS、Refer-DAVIS17、A2D-Sentences 和 JHMDB-Sentences。与其他Offline方法相比,我们的OnlineRefer WITH Swin-L 背景模型在 Refer-Youtube-VOS 和 Refer-DAVIS17 上达到了 63.5 J&F 和 64.8 J&F 的最高分,即使没有使用任何辅助工具。
    Abstract Referring video object segmentation (RVOS) aims at segmenting an object in a video following human instruction. Current state-of-the-art methods fall into an offline pattern, in which each clip independently interacts with text embedding for cross-modal understanding. They usually present that the offline pattern is necessary for RVOS, yet model limited temporal association within each clip. In this work, we break up the previous offline belief and propose a simple yet effective online model using explicit query propagation, named OnlineRefer. Specifically, our approach leverages target cues that gather semantic information and position prior to improve the accuracy and ease of referring predictions for the current frame. Furthermore, we generalize our online model into a semi-online framework to be compatible with video-based backbones. To show the effectiveness of our method, we evaluate it on four benchmarks, \ie, Refer-Youtube-VOS, Refer-DAVIS17, A2D-Sentences, and JHMDB-Sentences. Without bells and whistles, our OnlineRefer with a Swin-L backbone achieves 63.5 J&F and 64.8 J&F on Refer-Youtube-VOS and Refer-DAVIS17, outperforming all other offline methods.
    摘要 “参考视频物体分割(RVOS)目标在实现视频中人工指令下分割物体。现今顶尖方法都是在离线模式下进行,每个片段独立地与文本嵌入进行跨Modal理解。它们通常表明离线模式是RVOS必要的,但是模型内部的时间相互关联受限。在这个工作中,我们打破了先前的离线信念,并提出了一个简单又有效的在线模型,使用Explicit Query Propagation(具体询问传播),名为OnlineRefer。我们的方法利用目标传递获取 semantic 信息和位置偏好,以提高当前帧的引用预测精度。此外,我们将我们的在线模型转换为半在线框架,以适应视频基础模型。为证明我们的方法的有效性,我们在四个 benchmark 上进行评估,包括Refer-Youtube-VOS、Refer-DAVIS17、A2D-Sentences 和 JHMDB-Sentences。 Without 钟表和套路,我们的OnlineRefer 使用 Swin-L 基础模型在 Refer-Youtube-VOS 和 Refer-DAVIS17 上 achieved 63.5 J&F 和 64.8 J&F,比所有其他离线方法高。”

SphereNet: Learning a Noise-Robust and General Descriptor for Point Cloud Registration

  • paper_url: http://arxiv.org/abs/2307.09351
  • repo_url: None
  • paper_authors: Guiyu Zhao, Zhentao Guo, Xin Wang, Hongbin Ma
  • for: 本研究旨在提出一种robust和可 generalized的点云注册方法,以便在不同视角下收集的点云之间进行准确的注册。
  • methods: 我们提出了一种基于学习的点云注册方法,称为SphereNet。该方法使用了圆柱体生成器来编码初始特征,然后使用圆形 interpolate 来实现鲁棒性 against 噪声。最后,一种新的圆形卷积神经网络with spherical integrity padding 完成了描述符的提取,这有助于减少丢失的特征并完全捕捉点云的几何特征。
  • results: 我们在两个indoor和outdoor dataset上进行了广泛的实验,并在高强度噪声下,SphereNet 能够提高特征匹配回快度比超过25个百分点。此外,SphereNet 在3DMatch和3DLoMatchbenchmark上达到了93.5%的注册回快度和75.6%的总回快度,并且在未见数据上有最好的泛化能力。
    Abstract Point cloud registration is to estimate a transformation to align point clouds collected in different perspectives. In learning-based point cloud registration, a robust descriptor is vital for high-accuracy registration. However, most methods are susceptible to noise and have poor generalization ability on unseen datasets. Motivated by this, we introduce SphereNet to learn a noise-robust and unseen-general descriptor for point cloud registration. In our method, first, the spheroid generator builds a geometric domain based on spherical voxelization to encode initial features. Then, the spherical interpolation of the sphere is introduced to realize robustness against noise. Finally, a new spherical convolutional neural network with spherical integrity padding completes the extraction of descriptors, which reduces the loss of features and fully captures the geometric features. To evaluate our methods, a new benchmark 3DMatch-noise with strong noise is introduced. Extensive experiments are carried out on both indoor and outdoor datasets. Under high-intensity noise, SphereNet increases the feature matching recall by more than 25 percentage points on 3DMatch-noise. In addition, it sets a new state-of-the-art performance for the 3DMatch and 3DLoMatch benchmarks with 93.5\% and 75.6\% registration recall and also has the best generalization ability on unseen datasets.
    摘要 点云注册是将多个视角中收集的点云进行对齐。在学习基于的点云注册中,一个可靠的描述符是关键,但大多数方法容易受到噪声的影响,对未看到的数据集的普适性很差。为了解决这个问题,我们提出了圆球网(SphereNet),用于学习具有噪声抗性和未看到数据集普适性的点云注册描述符。在我们的方法中,首先,圆球生成器将点云转换为圆球形式,以便在圆球分割中编码初始特征。然后,圆球 interpolating 技术是引入的,以实现对噪声的抗性。最后,一种新的圆球卷积神经网络,具有圆球完整裁剪,完成了特征EXTRACTING操作,从而减少特征丢失和完全捕捉几何特征。为评估我们的方法,我们引入了一个新的标准测试集3DMatch-noise,这个测试集具有强大的噪声。我们在室内和室外 dataset上进行了广泛的实验。在高强度噪声下,SphereNet提高了特征匹配回归率超过25个百分点。此外,它在3DMatch和3DLoMatch测试集上达到了93.5%和75.6%的注册回归率,同时具有最好的普适性。

Visual Validation versus Visual Estimation: A Study on the Average Value in Scatterplots

  • paper_url: http://arxiv.org/abs/2307.09330
  • repo_url: None
  • paper_authors: Daniel Braun, Ashley Suh, Remco Chang, Michael Gleicher, Tatiana von Landesberger
  • for: 这个论文 investigate了人们可以通过视觉方式验证统计模型是否适合数据。
  • methods: 研究使用了两个人口(公告和志愿者),参与者需要同时通过视觉估算和视觉验证(接受或拒绝)常见的均值模型。
  • results: 结果表明参与者们的验证和估算没有偏见,并且自然的批判点(接受或拒绝给定均值)与95%的自信区间边界相当接近,表明视觉感知的自信区间与统计标准相符。
    Abstract We investigate the ability of individuals to visually validate statistical models in terms of their fit to the data. While visual model estimation has been studied extensively, visual model validation remains under-investigated. It is unknown how well people are able to visually validate models, and how their performance compares to visual and computational estimation. As a starting point, we conducted a study across two populations (crowdsourced and volunteers). Participants had to both visually estimate (i.e, draw) and visually validate (i.e., accept or reject) the frequently studied model of averages. Across both populations, the level of accuracy of the models that were considered valid was lower than the accuracy of the estimated models. We find that participants' validation and estimation were unbiased. Moreover, their natural critical point between accepting and rejecting a given mean value is close to the boundary of its 95% confidence interval, indicating that the visually perceived confidence interval corresponds to a common statistical standard. Our work contributes to the understanding of visual model validation and opens new research opportunities.
    摘要 我们研究人员是否可以通过视觉方式验证统计模型的合适性。虽然视觉模型估计已经得到了广泛的研究,但视觉模型验证还受到了不足的研究。人们是否能够通过视觉方式验证模型,并且与计算机机器估计相比如何?为了开始,我们在两个人口中进行了研究(公众和志愿者)。参与者需要同时视觉估计(即绘制)以及视觉验证(即接受或拒绝)常见的平均值模型。在两个人口中,被认为是有效的模型的准确率较低于估计模型的准确率。我们发现参与者的验证和估计无偏见。此外,他们的自然批判点(between accepting and rejecting a given mean value)与其95%信息Interval的边缘几乎相同,表明视觉感知的信息Interval与统计标准相吻合。我们的研究对视觉模型验证的理解做出了贡献,开启了新的研究机遇。

Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving

  • paper_url: http://arxiv.org/abs/2307.09329
  • repo_url: https://github.com/kaavyarekanar/towards-a-performance-analysis-on-pre-trained-vqa-models-for-autonomous-driving
  • paper_authors: Kaavya Rekanar, Ciarán Eising, Ganesh Sistu, Martin Hayes
  • for: 这篇短篇论文主要探讨了三种受欢迎的视觉问答模型(ViLBERT、ViLT、LXMERT)在解决自动驾驶场景中的问题上的表现。
  • methods: 这篇论文使用了对多模态架构中变换器的使用进行分析,并通过对参考答案与计算机视觉专家提供的答案进行比较来评估这些模型的性能。
  • results: 结果表明,包含对多模态进行交叉感知和较晚的融合技术的模型在自动驾驶场景中表现出了良好的潜力,可能为自动驾驶领域提供更好的答案。
    Abstract This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts. Model selection is predicated on the analysis of transformer utilization in multimodal architectures. The results indicate that models incorporating cross-modal attention and late fusion techniques exhibit promising potential for generating improved answers within a driving perspective. This initial analysis serves as a launchpad for a forthcoming comprehensive comparative study involving nine VQA models and sets the scene for further investigations into the effectiveness of VQA model queries in self-driving scenarios. Supplementary material is available at https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving.
    摘要 这篇短文提供了三种流行的视觉问答(VQA)模型的初步分析,即ViLBERT、ViLT和LXMERT,在回答关于驾驶场景的问题上。这些模型的性能被评估通过与计算机视觉专家提供的参考答案之间的相似性进行比较。模型选择基于多Modal arquitectures中的变换器使用情况的分析。结果表明,包含对Modal attention和迟至合并技术的模型表现出了优秀的潜力,用于生成改进的答案从驾驶角度来看。这个初步分析将为后续 Comparative study involving nine VQA models 作为起点,进一步调查VQA模型在自动驾驶场景中的效果。补充材料可以在https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving 中找到。

Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis

  • paper_url: http://arxiv.org/abs/2307.09323
  • repo_url: https://github.com/fictionarry/er-nerf
  • paper_authors: Jiahe Li, Jiawei Zhang, Xiao Bai, Jun Zhou, Lin Gu
  • for: 实现高质量、快速渲染和小型模型的对话人物synthesis
  • methods: 使用Conditional Neural Radiance Fields (NeRF)建立Explicit Region-based NeRF (ER-NeRF)架构,并引入Tri-Plane Hash Representation和Region Attention Module等技术来提高对话人物模型的准确性和效率
  • results: 对比 précédente方法,ER-NeRF可以实现更高的高质量和audio-lips同步的对话人物视频生成,并且具有更高的效率和更小的模型大小
    Abstract This paper presents ER-NeRF, a novel conditional Neural Radiance Fields (NeRF) based architecture for talking portrait synthesis that can concurrently achieve fast convergence, real-time rendering, and state-of-the-art performance with small model size. Our idea is to explicitly exploit the unequal contribution of spatial regions to guide talking portrait modeling. Specifically, to improve the accuracy of dynamic head reconstruction, a compact and expressive NeRF-based Tri-Plane Hash Representation is introduced by pruning empty spatial regions with three planar hash encoders. For speech audio, we propose a Region Attention Module to generate region-aware condition feature via an attention mechanism. Different from existing methods that utilize an MLP-based encoder to learn the cross-modal relation implicitly, the attention mechanism builds an explicit connection between audio features and spatial regions to capture the priors of local motions. Moreover, a direct and fast Adaptive Pose Encoding is introduced to optimize the head-torso separation problem by mapping the complex transformation of the head pose into spatial coordinates. Extensive experiments demonstrate that our method renders better high-fidelity and audio-lips synchronized talking portrait videos, with realistic details and high efficiency compared to previous methods.
    摘要 To achieve this, we propose a compact and expressive NeRF-based Tri-Plane Hash Representation, which prunes empty spatial regions using three planar hash encoders. This allows for more efficient rendering and improved accuracy.For speech audio, we introduce a Region Attention Module to generate region-aware condition features via an attention mechanism. This approach explicitly connects audio features and spatial regions, allowing for more effective capture of local motions.Furthermore, we propose a direct and fast Adaptive Pose Encoding to optimize the head-torso separation problem by mapping the complex transformation of the head pose into spatial coordinates. This approach allows for more efficient and accurate rendering of talking portraits.Extensive experiments demonstrate that our method produces high-fidelity and audio-lips synchronized talking portrait videos with realistic details and high efficiency, outperforming previous methods.

Towards Automated Semantic Segmentation in Mammography Images

  • paper_url: http://arxiv.org/abs/2307.10296
  • repo_url: None
  • paper_authors: Cesar A. Sierra-Franco, Jan Hurtado, Victor de A. Thomaz, Leonardo C. da Cruz, Santiago V. Silva, Alberto B. Raposo
  • for: 检测非感知乳腺癌,提供诊断和评估图像质量的机会。
  • methods: 使用深度学习框架自动 segmenting 乳腺、肌肉、肉组织和脂肪组织。
  • results: 实现了高精度的 segmentation 性能,适用于多样化和复杂的案例,可以应用于临床实践。
    Abstract Mammography images are widely used to detect non-palpable breast lesions or nodules, preventing cancer and providing the opportunity to plan interventions when necessary. The identification of some structures of interest is essential to make a diagnosis and evaluate image adequacy. Thus, computer-aided detection systems can be helpful in assisting medical interpretation by automatically segmenting these landmark structures. In this paper, we propose a deep learning-based framework for the segmentation of the nipple, the pectoral muscle, the fibroglandular tissue, and the fatty tissue on standard-view mammography images. We introduce a large private segmentation dataset and extensive experiments considering different deep-learning model architectures. Our experiments demonstrate accurate segmentation performance on variate and challenging cases, showing that this framework can be integrated into clinical practice.
    摘要 乳影像广泛用于探测不可触感乳腺肿块或肿瘤,预防癌症并提供诊断和治疗计划时的机会。正确识别一些关键结构是诊断和评估图像质量的关键。因此,计算机支持的检测系统可以帮助医疗解读,自动将关键结构分割出来。在这篇论文中,我们提出了基于深度学习的框架,用于标准视图乳影像中的胸肌、乳腺组织和脂肪组织的自动分割。我们提供了大量私人分割数据集和详细的实验,考虑了不同的深度学习模型架构。我们的实验表明,这个框架可以在多种和复杂的案例中提供准确的分割表现,并且可以在临床实践中应用。

MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds

  • paper_url: http://arxiv.org/abs/2307.09316
  • repo_url: https://github.com/cvmi-lab/mars3d
  • paper_authors: Jiahui Liu, Chirui Chang, Jianhui Liu, Xiaoyang Wu, Lan Ma, Xiaojuan Qi
  • for: 这种研究旨在提高多扫描大规模点云semantic segmentation的精度,以便在自动化系统中提高自主驾驶能力。
  • methods: 该研究提出了一种名为MarS3D的插件式运动相关模块,可以让单扫模型具备多扫观察能力。该模块包括两个关键设计:横幅特征嵌入模块和运动相关特征学习模块。
  • results: 实验表明,MarS3D可以大幅提高基线模型的性能。代码可以在https://github.com/CVMI-Lab/MarS3D中下载。
    Abstract 3D semantic segmentation on multi-scan large-scale point clouds plays an important role in autonomous systems. Unlike the single-scan-based semantic segmentation task, this task requires distinguishing the motion states of points in addition to their semantic categories. However, methods designed for single-scan-based segmentation tasks perform poorly on the multi-scan task due to the lacking of an effective way to integrate temporal information. We propose MarS3D, a plug-and-play motion-aware module for semantic segmentation on multi-scan 3D point clouds. This module can be flexibly combined with single-scan models to allow them to have multi-scan perception abilities. The model encompasses two key designs: the Cross-Frame Feature Embedding module for enriching representation learning and the Motion-Aware Feature Learning module for enhancing motion awareness. Extensive experiments show that MarS3D can improve the performance of the baseline model by a large margin. The code is available at https://github.com/CVMI-Lab/MarS3D.
    摘要 三维semantic segmentation在多扫描大规模点云中扮演着重要的角色,这个任务不同于单扫描任务,需要分别处理点cloud中的运动态和semantic category。然而,针对单扫描任务设计的方法在多扫描任务中表现不佳,主要因为缺乏有效的时间信息集成方法。我们提议了MarS3D,一个可插入式运动意识模块,用于semantic segmentation在多扫描3D点云中。这个模块可以与单扫描模型结合,让它们拥有多扫描视觉能力。模型包括两个关键设计:横幅嵌入模块和运动意识学习模块。广泛的实验表明,MarS3D可以大幅提高基eline模型的性能。代码可以在https://github.com/CVMI-Lab/MarS3D中下载。

EigenTrajectory: Low-Rank Descriptors for Multi-Modal Trajectory Forecasting

  • paper_url: http://arxiv.org/abs/2307.09306
  • repo_url: https://github.com/inhwanbae/eigentrajectory
  • paper_authors: Inhwan Bae, Jean Oh, Hae-Gon Jeon
  • for: 预测人行道径的方法
  • methods: 使用新的轨迹描述符将行人运动转换为一个紧凑的空间(称为$\mathbb{ET}$空间),并使用低级别approximation减少轨迹描述符的复杂性。
  • results: 比较 existed trajectory forecasting models 的预测精度和可靠性,并通过anchor-based refinement方法覆盖所有可能的未来。Here’s the full text in Simplified Chinese:
  • for: 预测人行道径的方法
  • methods: 使用新的轨迹描述符将行人运动转换为一个紧凑的空间(称为$\mathbb{ET}$空间),并使用低级别approximation减少轨迹描述符的复杂性。
  • results: 比较 existed trajectory forecasting models 的预测精度和可靠性,并通过anchor-based refinement方法覆盖所有可能的未来。
    Abstract Capturing high-dimensional social interactions and feasible futures is essential for predicting trajectories. To address this complex nature, several attempts have been devoted to reducing the dimensionality of the output variables via parametric curve fitting such as the B\'ezier curve and B-spline function. However, these functions, which originate in computer graphics fields, are not suitable to account for socially acceptable human dynamics. In this paper, we present EigenTrajectory ($\mathbb{ET}$), a trajectory prediction approach that uses a novel trajectory descriptor to form a compact space, known here as $\mathbb{ET}$ space, in place of Euclidean space, for representing pedestrian movements. We first reduce the complexity of the trajectory descriptor via a low-rank approximation. We transform the pedestrians' history paths into our $\mathbb{ET}$ space represented by spatio-temporal principle components, and feed them into off-the-shelf trajectory forecasting models. The inputs and outputs of the models as well as social interactions are all gathered and aggregated in the corresponding $\mathbb{ET}$ space. Lastly, we propose a trajectory anchor-based refinement method to cover all possible futures in the proposed $\mathbb{ET}$ space. Extensive experiments demonstrate that our EigenTrajectory predictor can significantly improve both the prediction accuracy and reliability of existing trajectory forecasting models on public benchmarks, indicating that the proposed descriptor is suited to represent pedestrian behaviors. Code is publicly available at https://github.com/inhwanbae/EigenTrajectory .
    摘要 capturing高维社交互动和可能的未来是预测轨迹的关键。为了解决这种复杂的性质,许多尝试都是减少输出变量的维度via参数曲线拟合,如B\'ezier曲线和B-spline函数。然而,这些函数,起源于计算机图形领域,不适合考虑社会接受的人类动态。在这篇论文中,我们提出了EigenTrajectory($\mathbb{ET}$),一种轨迹预测方法,使用一种新的轨迹描述符来形成一个紧凑的空间,称为$\mathbb{ET}$空间,以代替欧几何空间,来表示步行者的运动。我们首先减少轨迹描述符的复杂性via低级应对。将步行者历史路径转换为我们的$\mathbb{ET}$空间,表示的是时空原则Components,并将其输入到市场上可用的轨迹预测模型中。输入和输出模型以及社交互动都被聚集和聚合在相应的$\mathbb{ET}$空间中。最后,我们提出了一种轨迹锚点基于的修正方法,以覆盖所有可能的未来在提posed $\mathbb{ET}$空间中。广泛的实验表明,我们的EigenTrajectory预测器可以在公共的benchmark上显著提高现有轨迹预测模型的预测精度和可靠性, indicating that the proposed descriptor is suitable to represent pedestrian behaviors。代码可以在https://github.com/inhwanbae/EigenTrajectory上获取。

Conformal prediction under ambiguous ground truth

  • paper_url: http://arxiv.org/abs/2307.09302
  • repo_url: None
  • paper_authors: David Stutz, Abhijit Guha Roy, Tatiana Matejovicova, Patricia Strachan, Ali Taylan Cemgil, Arnaud Doucet
  • for: 这个研究是为了应对安全承认分类任务中的不确定性量化,提供信心集包括真实类别,并且可以根据使用者指定的概率。
  • methods: 这个研究使用了对复现预测,即提供信心集,包括真实类别,并且可以根据使用者指定的概率。然而,这个方法通常需要一个专属的验证集,以获得实际的真实类别。
  • results: 这个研究提出了一个基于对复现 posterior distribution 的类别分类框架,可以在不确定类别设定下进行预测。实验结果显示,这个方法可以在实际应用中提供更好的不确定性量化,并且可以在不同的数据集上进行预测。
    Abstract In safety-critical classification tasks, conformal prediction allows to perform rigorous uncertainty quantification by providing confidence sets including the true class with a user-specified probability. This generally assumes the availability of a held-out calibration set with access to ground truth labels. Unfortunately, in many domains, such labels are difficult to obtain and usually approximated by aggregating expert opinions. In fact, this holds true for almost all datasets, including well-known ones such as CIFAR and ImageNet. Applying conformal prediction using such labels underestimates uncertainty. Indeed, when expert opinions are not resolvable, there is inherent ambiguity present in the labels. That is, we do not have ``crisp'', definitive ground truth labels and this uncertainty should be taken into account during calibration. In this paper, we develop a conformal prediction framework for such ambiguous ground truth settings which relies on an approximation of the underlying posterior distribution of labels given inputs. We demonstrate our methodology on synthetic and real datasets, including a case study of skin condition classification in dermatology.
    摘要 在安全关键分类任务中,协Forms prediction可以进行严格的uncertainty量化,提供包含真实类型的信任集,其中用户可以指定概率。通常,这假设有一个保留的Calibration集可以获得真实标签。然而,在许多领域,这些标签很难获得,通常通过专家意见的汇总来估算。事实上,这是大多数数据集的情况,包括CIFAR和ImageNet。在使用这些标签进行协Forms prediction时,会低估uncertainty。实际上,当专家意见不可分解时,存在labels中的不确定性。即,我们没有“精炼”的、明确的真实标签,这种uncertainty应该在calibration中被考虑。在这篇论文中,我们开发了一种协Forms prediction框架,用于这种不确定真实标签的设置,该框架基于输入的标签 posterior distribution的approximation。我们在 sintetic和实际数据集上验证了我们的方法,包括一个皮肤状况分类的case study。

RepViT: Revisiting Mobile CNN From ViT Perspective

  • paper_url: http://arxiv.org/abs/2307.09283
  • repo_url: https://github.com/jameslahm/RepViT
  • paper_authors: Ao Wang, Hui Chen, Zijia Lin, Hengjun Pu, Guiguang Ding
  • for: 该研究旨在探讨轻量级视Transformers(ViTs)和轻量级卷积神经网络(CNNs)在移动设备上的性能和延迟性能。
  • methods: 该研究使用了一种名为RepViT的新家族的纯轻量级CNN,通过结合轻量级ViTs的有效的建筑设计,实现了更高的性能和更低的延迟。
  • results: 实验结果表明,RepViT在多种视觉任务中表现出色,并且在iPhone 12上实现了80.4%的顶部1 accuracy,延迟只有1.3毫秒,是轻量级模型中首次达到这个水平。
    Abstract Recently, lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency compared with lightweight Convolutional Neural Networks (CNNs) on resource-constrained mobile devices. This improvement is usually attributed to the multi-head self-attention module, which enables the model to learn global representations. However, the architectural disparities between lightweight ViTs and lightweight CNNs have not been adequately examined. In this study, we revisit the efficient design of lightweight CNNs and emphasize their potential for mobile devices. We incrementally enhance the mobile-friendliness of a standard lightweight CNN, specifically MobileNetV3, by integrating the efficient architectural choices of lightweight ViTs. This ends up with a new family of pure lightweight CNNs, namely RepViT. Extensive experiments show that RepViT outperforms existing state-of-the-art lightweight ViTs and exhibits favorable latency in various vision tasks. On ImageNet, RepViT achieves over 80\% top-1 accuracy with nearly 1ms latency on an iPhone 12, which is the first time for a lightweight model, to the best of our knowledge. Our largest model, RepViT-M3, obtains 81.4\% accuracy with only 1.3ms latency. The code and trained models are available at \url{https://github.com/jameslahm/RepViT}.
    摘要 最近,轻量级视Transformers(ViTs)在有限的移动设备上表现出色,比较于轻量级卷积神经网络(CNNs)更具有优势,主要归功于自注意机制,允许模型学习全局表示。然而,这两种模型的建构差异尚未得到充分探讨。在本研究中,我们重新审视了轻量级CNNs的有效设计,强调其在移动设备上的潜在优势。我们逐步提高了标准轻量级CNNs的MobileNetV3的移动友好性,通过 интеграating了轻量级ViTs的有效建构选择。这使得我们获得了一个新的家族的纯轻量级CNNs,称之为RepViT。我们进行了广泛的实验,发现RepViT可以在各种视觉任务中超越现有的状态域轻量级ViTs,并且在iPhone 12上实现了80.4%的top-1准确率,延迟只有约1ms。我们最大的模型RepViT-M3可以达到81.4%的准确率,延迟只有1.3ms。代码和训练模型可以在上获取。

Regression-free Blind Image Quality Assessment

  • paper_url: http://arxiv.org/abs/2307.09279
  • repo_url: https://github.com/XiaoqiWang/regression-free-iqa
  • paper_authors: Xiaoqi Wang, Jian Xiong, Hao Gao, Weisi Lin
  • for: 提高图像质量评估模型的准确性,降低由偏袋训练样本引起的偏袋问题。
  • methods: 基于Retrieving Similar Instances的框架,结合semantic和distortion特征来评估图像质量。
  • results: 与state-of-the-art regression-based模型相比,提出的模型可以显著提高图像质量评估的准确性。
    Abstract Regression-based blind image quality assessment (IQA) models are susceptible to biased training samples, leading to a biased estimation of model parameters. To mitigate this issue, we propose a regression-free framework for image quality evaluation, which is founded upon retrieving similar instances by incorporating semantic and distortion features. The motivation behind this approach is rooted in the observation that the human visual system (HVS) has analogous visual responses to semantically similar image contents degraded by the same distortion. The proposed framework comprises two classification-based modules: semantic-based classification (SC) module and distortion-based classification (DC) module. Given a test image and an IQA database, the SC module retrieves multiple pristine images based on semantic similarity. The DC module then retrieves instances based on distortion similarity from the distorted images that correspond to each retrieved pristine image. Finally, the predicted quality score is derived by aggregating the subjective quality scores of multiple retrieved instances. Experimental results on four benchmark databases validate that the proposed model can remarkably outperform the state-of-the-art regression-based models.
    摘要 征inct-based盲目图像质量评估(IQA)模型容易受到偏向训练样本的影响,导致模型参数的偏向估计。为解决这个问题,我们提议一种无回归的框架 для图像质量评估,基于检索相似的实例。这种方法的动机在于人类视觉系统(HVS)在semantically相似的图像内容受到同样的损害后,有相似的视觉响应。提议的框架包括两个分类基 module:semantic-based分类(SC)模块和损害基分类(DC)模块。给定一个测试图像和IQA数据库,SC模块将多个无损图像根据semantic similarity retrieved。然后,DC模块将从对应每个损害图像中 Retrieves instances based on distortion similarity。最后,预测的质量分数由多个检索到的实例的主观质量分数的汇总得到。实验结果表明,提议的模型可以remarkably exceed state-of-the-art regression-based模型的性能。

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding

  • paper_url: http://arxiv.org/abs/2307.09267
  • repo_url: None
  • paper_authors: Zehan Wang, Haifeng Huang, Yang Zhao, Linjun Li, Xize Cheng, Yichen Zhu, Aoxiong Yin, Zhou Zhao
  • for: 本研究的目的是提出一种基于弱监督的3D视觉固定方法,以便使用简单的文本描述来找到3D场景中的目标对象。
  • methods: 我们提出了一种使用弱监督的Semantic Matching模型来学习3D视觉固定模型,其中使用粗细的场景文本对应关系来链接对象和文本。我们还提出了一种将弱监督的semantic Matching知识融入到传统的两阶段3D视觉固定模型中,以提高性能和降低推理成本。
  • results: 我们在ScanRefer、Nr3D和Sr3D等 datasets上进行了广泛的实验,并证明了我们的提出方法的效果。
    Abstract 3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query. Although many approaches have been proposed and achieved impressive performance, they all require dense object-sentence pair annotations in 3D point clouds, which are both time-consuming and expensive. To address the problem that fine-grained annotated data is difficult to obtain, we propose to leverage weakly supervised annotations to learn the 3D visual grounding model, i.e., only coarse scene-sentence correspondences are used to learn object-sentence links. To accomplish this, we design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner. Specifically, we first extract object proposals and coarsely select the top-K candidates based on feature and class similarity matrices. Next, we reconstruct the masked keywords of the sentence using each candidate one by one, and the reconstructed accuracy finely reflects the semantic similarity of each candidate to the query. Additionally, we distill the coarse-to-fine semantic matching knowledge into a typical two-stage 3D visual grounding model, which reduces inference costs and improves performance by taking full advantage of the well-studied structure of the existing architectures. We conduct extensive experiments on ScanRefer, Nr3D, and Sr3D, which demonstrate the effectiveness of our proposed method.
    摘要 三维视觉根据 involves 找到一个给定句子查询对应的目标对象在三维场景中。虽然许多方法已经提出并实现了出色的性能,但它们都需要密集的对象句子对 annotation 在三维点云中,这些数据均是时间consuming 和昂贵的。为了解决细化的 annotated data 困难以获得,我们提议使用弱supervised annotation 学习三维视觉根据模型,即只使用句子-场景对应的粗细 correlations 来学习对象-句子链接。为实现这一目标,我们设计了一种新的semantic matching模型,该模型在粗细-精度排序的方式下分析对象提案和句子之间的semantic similarity。具体来说,我们首先提取对象提案,然后使用特征和类型相似度矩阵来粗细选择前top-K candidates。接着,我们一个一个地使用每个候选对象来重建句子中的掩码关键词,重建准确率就是每个候选对象与查询句子之间的semantic similarity的表现。此外,我们将粗细-精度匹配知识蒸馏到一个常见的两阶段三维视觉根据模型中,从而降低推理成本和提高性能,同时利用现有的建筑结构。我们在ScanRefer、Nr3D和Sr3D上进行了广泛的实验,结果表明我们的提议方法的效果。

Knowledge Distillation for Object Detection: from generic to remote sensing datasets

  • paper_url: http://arxiv.org/abs/2307.09264
  • repo_url: None
  • paper_authors: Hoàng-Ân Lê, Minh-Tan Pham
  • for: 评估多种常用计算机视觉数据集上开发的物体检测知识填充方法在远程感知中的表现。
  • methods: 包括逻辑模仿和特征模仿方法,用于车辆检测,并在xView和VEDAI数据集上进行了广泛的实验。
  • results: 实验结果表明,这些方法在远程感知 datasets 上表现出了高度的变化,并且confirm了结果聚合和cross validation的重要性。
    Abstract Knowledge distillation, a well-known model compression technique, is an active research area in both computer vision and remote sensing communities. In this paper, we evaluate in a remote sensing context various off-the-shelf object detection knowledge distillation methods which have been originally developed on generic computer vision datasets such as Pascal VOC. In particular, methods covering both logit mimicking and feature imitation approaches are applied for vehicle detection using the well-known benchmarks such as xView and VEDAI datasets. Extensive experiments are performed to compare the relative performance and interrelationships of the methods. Experimental results show high variations and confirm the importance of result aggregation and cross validation on remote sensing datasets.
    摘要 知识填充(knowledge distillation)是计算机视觉和远程感知领域中广泛研究的一种模型压缩技术。在这篇论文中,我们在远程感知上评估了一些 generic 计算机视觉数据集上开发的物体检测知识填充方法。特别是,我们应用了两种方法:ilogit 模仿和特征模仿。我们使用了 xView 和 VEDAI 数据集进行车辆检测。我们进行了广泛的实验比较这些方法的相对性能和相互关系。实验结果表明,存在很大的变化,并证明了在远程感知数据集上的结果聚合和十分 validate 的重要性。

Neuromorphic spintronics simulated using an unconventional data-driven Thiele equation approach

  • paper_url: http://arxiv.org/abs/2307.09262
  • repo_url: None
  • paper_authors: Anatole Moureaux, Simon de Wergifosse, Chloé Chopin, Flavio Abreu Araujo
  • for: 这个研究旨在开发一种量化描述磁矩扭矩气体oscillators(STVOs)动态的数学模型,以加速STVO-based neuromorphic computing设备的设计和计算成本减少。
  • methods: 这个研究使用了一种不寻常的模型,即TEA和MMS数据的组合,以解决STVO动态学问题。这种模型可以将计算时间减少9个数量级,同时保持同等精度。
  • results: 研究人员通过模拟STVO-based neural network来解决一个分类任务,并评估其性能对输入信号电流强度和噪声的影响。结果显示,这种方法可以加速STVO-based neuromorphic computing设备的设计和计算成本减少,同时维持同等精度。
    Abstract In this study, we developed a quantitative description of the dynamics of spin-torque vortex nano-oscillators (STVOs) through an unconventional model based on the combination of the Thiele equation approach (TEA) and data from micromagnetic simulations (MMS). Solving the STVO dynamics with our analytical model allows to accelerate the simulations by 9 orders of magnitude compared to MMS while reaching the same level of accuracy. Here, we showcase our model by simulating a STVO-based neural network for solving a classification task. We assess its performance with respect to the input signal current intensity and the level of noise that might affect such a system. Our approach is promising for accelerating the design of STVO-based neuromorphic computing devices while decreasing drastically its computational cost.
    摘要 在这项研究中,我们开发了一种量化描述STVOs的动态模型,通过TEA和MMS数据的组合来实现。我们的分析模型可以将STVO动态 simulations加速9个数量级,与MMS达到同等精度。在这里,我们通过模拟一个基于STVO的神经网络来解决一个分类任务。我们评估了这个系统对输入信号强度和噪声的影响。我们的方法可以加速STVO基于神经网络设备的设计,同时减少计算成本。

Adaptive Topological Feature via Persistent Homology: Filtration Learning for Point Clouds

  • paper_url: http://arxiv.org/abs/2307.09259
  • repo_url: None
  • paper_authors: Naoki Nishikawa, Yuichi Ike, Kenji Yamanishi
  • for: 提高机器学习方法对点云的精度,通过 incorporating 全球拓扑特征。
  • methods: 提出了一种基于神经网络的自适应筛选方法,以实现 persistent homology 的准确计算。
  • results: 在多个分类任务中,实验结果表明了我们的框架的效果。
    Abstract Machine learning for point clouds has been attracting much attention, with many applications in various fields, such as shape recognition and material science. To enhance the accuracy of such machine learning methods, it is known to be effective to incorporate global topological features, which are typically extracted by persistent homology. In the calculation of persistent homology for a point cloud, we need to choose a filtration for the point clouds, an increasing sequence of spaces. Because the performance of machine learning methods combined with persistent homology is highly affected by the choice of a filtration, we need to tune it depending on data and tasks. In this paper, we propose a framework that learns a filtration adaptively with the use of neural networks. In order to make the resulting persistent homology isometry-invariant, we develop a neural network architecture with such invariance. Additionally, we theoretically show a finite-dimensional approximation result that justifies our architecture. Experimental results demonstrated the efficacy of our framework in several classification tasks.
    摘要 Translation in Simplified Chinese:机器学习点云拥有很多应用,包括形状识别和材料科学等领域。将全球数学特征(Persistent homology)与机器学习方法结合,可以提高精度。在点云中计算 persistent homology 时,需要选择一个滤子,这是一个增加的序列空间。由于选择滤子会对机器学习方法的性能产生很大的影响,因此需要根据数据和任务进行调整。在这篇论文中,我们提出了一个自适应的滤子学习框架,使用神经网络来实现。为了使得结果的 persistent homology 是尺度不变的,我们开发了一个具有尺度不变的神经网络架构。此外,我们也理论上显示了一个有限维度近似结果,证明了我们的架构。实验结果显示了我们的框架在多个分类任务中的效果。

Generation of High Spatial Resolution Terrestrial Surface from Low Spatial Resolution Elevation Contour Maps via Hierarchical Computation of Median Elevation Regions

  • paper_url: http://arxiv.org/abs/2307.09239
  • repo_url: None
  • paper_authors: Geetika Barman, B. S. Daya Sagar
  • for: 将稀疏的数字地形图(DEM)转换为密集的数字地形图。
  • methods: 使用 median contours 进行转换,包括 DEM 的含义情况下的分解、非负无权重的中值高程区域(MER)的计算和高程面的推算。
  • results: 可以在高分辨率上生成高精度的地形图。该方法考虑了现有的高程面信息,可以在新的地形表面上 interpolate 高程面,直到无需生成高程面为止。这种新的方法是low-cost和可靠的,并且使用高程面来进行推算。
    Abstract We proposed a simple yet effective morphological approach to convert a sparse Digital Elevation Model (DEM) to a dense Digital Elevation Model. The conversion is similar to that of the generation of high-resolution DEM from its low-resolution DEM. The approach involves the generation of median contours to achieve the purpose. It is a sequential step of the I) decomposition of the existing sparse Contour map into the maximum possible Threshold Elevation Region (TERs). II) Computing all possible non-negative and non-weighted Median Elevation Region (MER) hierarchically between the successive TER decomposed from a sparse contour map. III) Computing the gradient of all TER, and MER computed from previous steps would yield the predicted intermediate elevation contour at a higher spatial resolution. We presented this approach initially with some self-made synthetic data to show how the contour prediction works and then experimented with the available contour map of Washington, NH to justify its usefulness. This approach considers the geometric information of existing contours and interpolates the elevation contour at a new spatial region of a topographic surface until no elevation contours are necessary to generate. This novel approach is also very low-cost and robust as it uses elevation contours.
    摘要 我们提出了一种简单 yet有效的 morphological 方法,将稀疏的数字高程模型(DEM)转换成稠密的数字高程模型。这种转换类似于生成高分辨率 DEM 从其低分辨率 DEM 中。方法包括以下步骤:1. 将现有的稀疏 Contour 地图 decomposed 成最大可能的 Threshold Elevation Region (TER)。2. 计算所有可能的非负和无权重的 Median Elevation Region (MER) 在缺省 TER 中进行层次搜索。3. 计算所有 TER 和 MER 的梯度,并将之前步骤计算的 MER 作为输入,可以预测高程Contour 的预测值。我们首先使用自己制作的一些自然数据来示例ify 如何实现 contour 预测,然后对有效的 Contour 地图 of Washington, NH 进行了实验,以证明该方法的有用性。该方法考虑了现有 Contour 的 геометрической信息,并在 topographic 表面上 interpolate 高程Contour ntil no 高程Contour 是必须生成的。这种新的方法是非常低成本和可靠,因为它使用高程Contour。

Fusing Hand and Body Skeletons for Human Action Recognition in Assembly

  • paper_url: http://arxiv.org/abs/2307.09238
  • repo_url: None
  • paper_authors: Dustin Aganian, Mona Köhler, Benedict Stephan, Markus Eisenbach, Horst-Michael Gross
  • for: 这个论文是为了提高人机合作的效果而写的。
  • methods: 这个论文使用了肢体骨架和手套骨架两种方法,并使用了通信网络和变换器来提高人机合作的效果。
  • results: 这个论文的实验结果表明,将肢体骨架和手套骨架结合使用可以提高人机合作的效果,并且可以更好地识别人员的动作。
    Abstract As collaborative robots (cobots) continue to gain popularity in industrial manufacturing, effective human-robot collaboration becomes crucial. Cobots should be able to recognize human actions to assist with assembly tasks and act autonomously. To achieve this, skeleton-based approaches are often used due to their ability to generalize across various people and environments. Although body skeleton approaches are widely used for action recognition, they may not be accurate enough for assembly actions where the worker's fingers and hands play a significant role. To address this limitation, we propose a method in which less detailed body skeletons are combined with highly detailed hand skeletons. We investigate CNNs and transformers, the latter of which are particularly adept at extracting and combining important information from both skeleton types using attention. This paper demonstrates the effectiveness of our proposed approach in enhancing action recognition in assembly scenarios.
    摘要 随着协作机器人(cobot)在工业生产中的普及,有效的人机合作变得非常重要。cobot应该能够识别人类行为,以帮助完成组装任务,并且能够自主行动。为了实现这一点,skeleton-based方法经常使用,因为它们能够通过不同的人和环境进行泛化。虽然身体骨架方法广泛用于动作识别,但它们可能无法准确地识别组装动作,因为工人的手和手指在这些动作中扮演着重要的角色。为了解决这一限制,我们提议一种方法,即将较为简单的身体骨架与高级细节的手骨架结合。我们研究了CNN和转换器,其中转换器特别适合将两种骨架中的重要信息拼接起来,使用吸引注意力。本文证明了我们提议的方法可以在组装场景中提高动作识别的效果。

Augmenting CLIP with Improved Visio-Linguistic Reasoning

  • paper_url: http://arxiv.org/abs/2307.09233
  • repo_url: None
  • paper_authors: Samyadeep Basu, Maziar Sanjabi, Daniela Massiceti, Shell Xu Hu, Soheil Feizi
  • for: 提高CLIP模型的compositional visio-linguistic理解能力
  • methods: 使用可微的图像参数化细化CLIP模型,通过带有扩散对象的梯度整理法,从大型文本到图像生成模型中抽取可以进行visio-linguistic理解任务的能力
  • results: 在Winoground和ARO datasets上,OUR方法可以提高CLIP模型的绝对visio-linguistic性能,在Winoground上提高了7%,在ARO上提高了3%,同时,OUR方法也微幅提高了CLIP模型的零shot性能在多个下游任务上。
    Abstract Image-text contrastive models such as CLIP are useful for a variety of downstream applications including zero-shot classification, image-text retrieval and transfer learning. However, these contrastively trained vision-language models often fail on compositional visio-linguistic tasks such as Winoground with performance equivalent to random chance. In our paper, we address this issue and propose a sample-efficient light-weight method called SDS-CLIP to improve the compositional visio-linguistic reasoning capabilities of CLIP. The core idea of our method is to use differentiable image parameterizations to fine-tune CLIP with a distillation objective from large text-to-image generative models such as Stable-Diffusion which are relatively good at visio-linguistic reasoning tasks. On the challenging Winoground compositional reasoning benchmark, our method improves the absolute visio-linguistic performance of different CLIP models by up to 7%, while on the ARO dataset, our method improves the visio-linguistic performance by upto 3%. As a byproduct of inducing visio-linguistic reasoning into CLIP, we also find that the zero-shot performance improves marginally on a variety of downstream datasets. Our method reinforces that carefully designed distillation objectives from generative models can be leveraged to extend existing contrastive image-text models with improved visio-linguistic reasoning capabilities.
    摘要 “图文共同模型如CLIP可以用于多种下游应用,包括零 shot 分类、图文搜寻和传播学习。然而,这些对照训练的视力语言模型经常在成本复杂的图文共同任务中表现不佳,例如 Winoground,其表现相当于随机几率。在我们的论文中,我们解决这个问题,并提出一个轻量级、可靠的方法called SDS-CLIP,以提高 CLIP 的图文共同推理能力。我们的方法的核心思想是使用可微的图像参数来精致地调整 CLIP,使其能够从大型文本至图生成模型中获得更好的推理能力。在 Winoground 的挑战性图文共同任务中,我们的方法可以提高不同 CLIP 模型的绝对图文推理性能 by up to 7%,而在 ARO dataset 中,我们的方法可以提高图文推理性能 by up to 3%。另外,我们发现,将 visio-linguistic 推理引入 CLIP 中,可以微小地提高零 shot 下的表现。我们的方法证明了,通过将生成模型的数据分配给对照训练的视力语言模型,可以延展现有的对照训练模型,以提高图文共同推理能力。”

A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

  • paper_url: http://arxiv.org/abs/2307.09220
  • repo_url: None
  • paper_authors: Chaoyang Zhu, Long Chen
  • for: 这篇论文主要是为了提供一个对开放词汇检测(OVD)和开放词汇分割(OVS)的全面回顾和评论。
  • methods: 这篇论文使用了一种基于任务和方法的分类法来总结各种方法,包括视觉语义空间映射、新视觉特征合成、区域意识训练、pseudo标签、知识储存承诺和传输学习等。
  • results: 这篇论文对各种任务进行了全面的讨论,包括对象检测、semantic/instance/panoptic segmentation、3D场景和视频理解等,并对每个任务进行了细节的描述、主要挑战、发展路径、优点和缺点的评论。
    Abstract As the most fundamental tasks of computer vision, object detection and segmentation have made tremendous progress in the deep learning era. Due to the expensive manual labeling, the annotated categories in existing datasets are often small-scale and pre-defined, i.e., state-of-the-art detectors and segmentors fail to generalize beyond the closed-vocabulary. To resolve this limitation, the last few years have witnessed increasing attention toward Open-Vocabulary Detection (OVD) and Segmentation (OVS). In this survey, we provide a comprehensive review on the past and recent development of OVD and OVS. To this end, we develop a taxonomy according to the type of task and methodology. We find that the permission and usage of weak supervision signals can well discriminate different methodologies, including: visual-semantic space mapping, novel visual feature synthesis, region-aware training, pseudo-labeling, knowledge distillation-based, and transfer learning-based. The proposed taxonomy is universal across different tasks, covering object detection, semantic/instance/panoptic segmentation, 3D scene and video understanding. In each category, its main principles, key challenges, development routes, strengths, and weaknesses are thoroughly discussed. In addition, we benchmark each task along with the vital components of each method. Finally, several promising directions are provided to stimulate future research.
    摘要 Computer vision 的基本任务中,对象检测和分割在深度学习时代取得了巨大进步。然而,由于手动标注成本高昂,现有数据集中的标注类型通常是小规模的预定的,即现状顶尖的检测器和分割器无法泛化。为解决这种限制,过去几年内,开放词汇检测(OVD)和分割(OVS)受到了逐渐增长的关注。在这种调查中,我们提供了对过去和最近发展的OVD和OVS的全面评论。为此,我们开发了一个分类方法,根据任务类型和方法学习。我们发现,使用弱级指导信号的许可和使用可以很好地区分不同的方法,包括:视觉Semantic空间映射、新视觉特征合成、区域意识训练、pseudo-labeling、知识泛化学习和传输学习。提出的分类方法 universal across different tasks,涵盖对象检测、semantic/instance/panoptic分割、3D场景和视频理解。在每个类型中,我们详细讨论了主要原则、关键挑战、发展途径、优点和缺点。此外,我们对每个任务进行了评估,并与每种方法的重要组成部分进行了比较。最后,我们提供了一些有前途的方向,以促进未来的研究。

You’ve Got Two Teachers: Co-evolutionary Image and Report Distillation for Semi-supervised Anatomical Abnormality Detection in Chest X-ray

  • paper_url: http://arxiv.org/abs/2307.09184
  • repo_url: None
  • paper_authors: Jinghan Sun, Dong Wei, Zhe Xu, Donghuan Lu, Hong Liu, Liansheng Wang, Yefeng Zheng
    for:* 这个论文旨在帮助抑制验图像检测和诊断呼吸系统和心血管疾病的 radiologic 发现。methods:* 这个方法使用了 semi-supervised 图像检测和文本分类方法,通过对报告和图像进行交互式融合,提高检测精度。results:* 实验结果表明,这个方法在 MIMIC-CXR 数据集上表现出色,较前者weakly和semi-supervised方法有更高的性能。
    Abstract Chest X-ray (CXR) anatomical abnormality detection aims at localizing and characterising cardiopulmonary radiological findings in the radiographs, which can expedite clinical workflow and reduce observational oversights. Most existing methods attempted this task in either fully supervised settings which demanded costly mass per-abnormality annotations, or weakly supervised settings which still lagged badly behind fully supervised methods in performance. In this work, we propose a co-evolutionary image and report distillation (CEIRD) framework, which approaches semi-supervised abnormality detection in CXR by grounding the visual detection results with text-classified abnormalities from paired radiology reports, and vice versa. Concretely, based on the classical teacher-student pseudo label distillation (TSD) paradigm, we additionally introduce an auxiliary report classification model, whose prediction is used for report-guided pseudo detection label refinement (RPDLR) in the primary vision detection task. Inversely, we also use the prediction of the vision detection model for abnormality-guided pseudo classification label refinement (APCLR) in the auxiliary report classification task, and propose a co-evolution strategy where the vision and report models mutually promote each other with RPDLR and APCLR performed alternatively. To this end, we effectively incorporate the weak supervision by reports into the semi-supervised TSD pipeline. Besides the cross-modal pseudo label refinement, we further propose an intra-image-modal self-adaptive non-maximum suppression, where the pseudo detection labels generated by the teacher vision model are dynamically rectified by high-confidence predictions by the student. Experimental results on the public MIMIC-CXR benchmark demonstrate CEIRD's superior performance to several up-to-date weakly and semi-supervised methods.
    摘要 胸部X射影(CXR)解剖异常检测目的在于寻找和描述胸部X射影中的呼吸和心脏 radiological 发现,以减少观察偏见和提高诊断工作流程。大多数现有方法都对这个任务进行了完全监督学习(fully supervised learning),需要高成本的大量偏例资料(per-abnormality annotations),或弱监督学习(weakly supervised learning),其性能仍然落后于完全监督学习方法。在这个工作中,我们提出了一个共演化图像和报告蒸发(CEIRD)框架,它通过将视觉检测结果与文本标示的异常发现(report-guided pseudo detection label refinement,RPDLR)和视觉检测模型的预测(abnormality-guided pseudo classification label refinement,APCLR)进行互动,以实现半监督学习的异常检测。此外,我们还提出了一个共演化策略,让视觉和报告模型互相推广 each other,通过对应的RPDLR和APCLR进行交替进行。这样,我们实现了对报告的弱监督学习,并将其 integrate 到 semi-supervised TSD 管道中。除了跨模式 pseudo label refinement 外,我们还提出了一个内部图像模式自适应非最大 suppress,其中 pseudo detection labels 生成的教师视觉模型的预测被动地修正了由学生高信度预测的高信度预测。实验结果显示,CEIRD 在公共 MIMIC-CXR benchmark 上表现出色,较上一代弱监督和半监督方法更高。

Pixel-wise Graph Attention Networks for Person Re-identification

  • paper_url: http://arxiv.org/abs/2307.09183
  • repo_url: https://github.com/wenyu1009/pganet
  • paper_authors: Wenyu Zhang, Qing Ding, Jian Hu, Yi Ma, Mingzhe Lu
  • for: 本研究用于探讨图像特征提取中使用图 convolutional networks (GCN) 和图注意力网络 (GAT),以提高图像识别性能。
  • methods: 本研究提出了一种新的图生成算法,可以将图像转换成图形,并使用GAT更新节点特征。这两个步骤组成了一个图像精度注意模块 (PGA),可以与传统的图像特征提取方法结合使用。基于PGA和ResNet,提出了一种新的像素精度graph注意网络 (PGANet),用于人识别任务。
  • results: 对于Market1501、DukeMTMC-reID和Occluded-DukeMTMC等 datasets,PGANet可以达到最佳性能,比前一个state-of-the-art的0.8%、1.1%和11%。
    Abstract Graph convolutional networks (GCN) is widely used to handle irregular data since it updates node features by using the structure information of graph. With the help of iterated GCN, high-order information can be obtained to further enhance the representation of nodes. However, how to apply GCN to structured data (such as pictures) has not been deeply studied. In this paper, we explore the application of graph attention networks (GAT) in image feature extraction. First of all, we propose a novel graph generation algorithm to convert images into graphs through matrix transformation. It is one magnitude faster than the algorithm based on K Nearest Neighbors (KNN). Then, GAT is used on the generated graph to update the node features. Thus, a more robust representation is obtained. These two steps are combined into a module called pixel-wise graph attention module (PGA). Since the graph obtained by our graph generation algorithm can still be transformed into a picture after processing, PGA can be well combined with CNN. Based on these two modules, we consulted the ResNet and design a pixel-wise graph attention network (PGANet). The PGANet is applied to the task of person re-identification in the datasets Market1501, DukeMTMC-reID and Occluded-DukeMTMC (outperforms state-of-the-art by 0.8\%, 1.1\% and 11\% respectively, in mAP scores). Experiment results show that it achieves the state-of-the-art performance. \href{https://github.com/wenyu1009/PGANet}{The code is available here}.
    摘要 Graph convolutional networks (GCN) 广泛应用于不规则数据处理,因为它利用图structure信息来更新节点特征。然而,如何将 GCN 应用于结构数据(如图片)尚未得到深入研究。在这篇论文中,我们探索使用图注意网络(GAT)在图像特征提取中的应用。首先,我们提出了一种新的图生成算法,通过矩阵变换将图像转换为图。这个算法比基于 K Nearest Neighbors(KNN)的算法一 magnitude faster。然后,GAT 在生成的图上更新节点特征。因此,我们获得了一个更加稳定的表示。这两个步骤组合在一起,我们称之为像素级别图注意模块(PGA)。由于生成的图仍可以转换为图像 после处理,PGA 可以与 CNN 集成。基于这两个模块,我们咨询了 ResNet,并设计了像素级别图注意网络(PGANet)。PGANet 在 Market1501、DukeMTMC-reID 和 Occluded-DukeMTMC 数据集上进行人重复识别任务,与状态艺术的比例分别高于 0.8\%、1.1\% 和 11\%。实验结果表明,它达到了状态艺术性能。 链接:

Jean-Luc Picard at Touché 2023: Comparing Image Generation, Stance Detection and Feature Matching for Image Retrieval for Arguments

  • paper_url: http://arxiv.org/abs/2307.09172
  • repo_url: None
  • paper_authors: Max Moebius, Maximilian Enderling, Sarah T. Bachinger
  • for: 参加共同任务”图像检索论据”,我们使用了不同的图像检索管道,包括图像生成、立场检测、预选和特征匹配。我们提交了四个不同的运行,并与基准值进行比较。我们的管道表现和基准值类似。
  • methods: 我们使用了不同的图像检索管道,包括图像生成、立场检测、预选和特征匹配。
  • results: 我们的管道表现和基准值类似。
    Abstract Participating in the shared task "Image Retrieval for arguments", we used different pipelines for image retrieval containing Image Generation, Stance Detection, Preselection and Feature Matching. We submitted four different runs with different pipeline layout and compare them to given baseline. Our pipelines perform similarly to the baseline.
    摘要 参加了“图像检索 для论点”共同任务,我们使用了不同的图像检索管道,包括图像生成、立场检测、预选和特征匹配。我们提交了四个不同的运行,并与给定的基线进行比较。我们的管道表现和基线相似。Note that the word "baseline" in the original text was translated as "基线" in Simplified Chinese, which is a common way to refer to a reference point or a standard for comparison.

ECSIC: Epipolar Cross Attention for Stereo Image Compression

  • paper_url: http://arxiv.org/abs/2307.10284
  • repo_url: None
  • paper_authors: Matthias Wödlinger, Jan Kotera, Manuel Keglevic, Jan Xu, Robert Sablatnig
  • for: 这 paper 的目的是提出一种新的学习方法 для双眼图像压缩。
  • methods: 该方法使用一种新的 crossed attention(SCA)模块和两个双眼上下文模块来同时压缩左右图像。SCA模块在对应的投影线上进行权重 Restricted cross-attention 处理,并在平行进行处理。双眼上下文模块使用第一个图像作为上下文来提高第二个编码图像的Entropy 估计。
  • results: ECSIC 在 Cityscapes 和 InStereo2k 两个流行的双眼图像数据集上达到了当今最佳性能,而且具有快速编码和解码功能,因此在实时应用中非常实用。
    Abstract In this paper, we present ECSIC, a novel learned method for stereo image compression. Our proposed method compresses the left and right images in a joint manner by exploiting the mutual information between the images of the stereo image pair using a novel stereo cross attention (SCA) module and two stereo context modules. The SCA module performs cross-attention restricted to the corresponding epipolar lines of the two images and processes them in parallel. The stereo context modules improve the entropy estimation of the second encoded image by using the first image as a context. We conduct an extensive ablation study demonstrating the effectiveness of the proposed modules and a comprehensive quantitative and qualitative comparison with existing methods. ECSIC achieves state-of-the-art performance among stereo image compression models on the two popular stereo image datasets Cityscapes and InStereo2k while allowing for fast encoding and decoding, making it highly practical for real-time applications.
    摘要 在这篇论文中,我们提出了一种新的学习方法ECSIC,用于压缩立体图像。我们的提议方法同时压缩左右图像,通过利用立体图像对的补做和两个立体上下文模块来提高压缩率。SCA模块在对应的epipolar线上进行交叉注意力限制,并在平行进行处理。立体上下文模块使得第二个编码图像的Entropy估计得到改善,通过使用第一个图像作为上下文。我们进行了广泛的减少研究,证明提议的模块的有效性,并对现有方法进行了全面的量化和质量性比较。ECSIC在Cityscapes和InStereo2k两个流行的立体图像数据集上实现了最新的性能,同时允许快速编码和解码,因此在实时应用中非常实用。

Towards Trustworthy Dataset Distillation

  • paper_url: http://arxiv.org/abs/2307.09165
  • repo_url: None
  • paper_authors: Shijie Ma, Fei Zhu, Zhen Cheng, Xu-Yao Zhang
    for:Trustworthy Dataset Distillation (TrustDD) aims to reduce training costs and enhance model trustworthiness in real-world applications by distilling both in-distribution (InD) samples and outliers.methods:The proposed method uses dataset distillation (DD) to reduce the training dataset to a tiny synthetic dataset, and simultaneously considers in-distribution (InD) classification and out-of-distribution (OOD) detection. To generate pseudo-outliers, InD samples are corrupted to introduce Pseudo-Outlier Exposure (POE).results:Comprehensive experiments on various settings demonstrate the effectiveness of TrustDD, and the proposed POE surpasses state-of-the-art method Outlier Exposure (OE). TrustDD is more trustworthy and applicable to real open-world scenarios compared to previous dataset distillation methods.Here is the format you requested:for: Trustworthy Dataset Distillation (TrustDD)methods: dataset distillation (DD) + Pseudo-Outlier Exposure (POE)results: effective in InD classification and OOD detection, surpasses state-of-the-art OE method.
    Abstract Efficiency and trustworthiness are two eternal pursuits when applying deep learning in real-world applications. With regard to efficiency, dataset distillation (DD) endeavors to reduce training costs by distilling the large dataset into a tiny synthetic dataset. However, existing methods merely concentrate on in-distribution (InD) classification in a closed-world setting, disregarding out-of-distribution (OOD) samples. On the other hand, OOD detection aims to enhance models' trustworthiness, which is always inefficiently achieved in full-data settings. For the first time, we simultaneously consider both issues and propose a novel paradigm called Trustworthy Dataset Distillation (TrustDD). By distilling both InD samples and outliers, the condensed datasets are capable to train models competent in both InD classification and OOD detection. To alleviate the requirement of real outlier data and make OOD detection more practical, we further propose to corrupt InD samples to generate pseudo-outliers and introduce Pseudo-Outlier Exposure (POE). Comprehensive experiments on various settings demonstrate the effectiveness of TrustDD, and the proposed POE surpasses state-of-the-art method Outlier Exposure (OE). Compared with the preceding DD, TrustDD is more trustworthy and applicable to real open-world scenarios. Our code will be publicly available.
    摘要 “效率和可靠性是深度学习应用实际场景中的两大永恒追求。在这个领域,数据集缩写(DD)是一种减少训练成本的方法,它将大量数据集缩写成一个小型的合成数据集。然而,现有方法仅专注于闭合世界的内分布(InD)类别,忽略了外分布(OOD)样本。而OOD检测则是提高模型的可靠性的一种不fficient的方法,通常在全数据设置下进行。为了解决这两个问题,我们同时考虑了两者,并提出了一种新的思路——可靠的数据集缩写(TrustDD)。通过缩写InD样本和异常样本,缩写后的数据集能够训练可以在InD类别和OOD检测中具备竞争力。为了减少实际异常数据的需求和使OOD检测更加实用,我们进一步提议在InD样本中做质量损害,并引入假异常样本的暴露(POE)。经过了多种场景的实验,我们发现TrustDD比前一代DD更加可靠和适用于真正的开放世界场景。我们的代码将公开。”

CG-fusion CAM: Online segmentation of laser-induced damage on large-aperture optics

  • paper_url: http://arxiv.org/abs/2307.09161
  • repo_url: None
  • paper_authors: Yueyue Han, Yingyan Huang, Hangcheng Dong, Fengdong Chen, Fa Zeng, Zhitao Peng, Qihua Zhu, Guodong Liu
  • for: 这个论文是为了解决大光学望远镜上高功率激光器所产生的激光损害问题,特别是在线分 segmentation方面。
  • methods: 这个论文使用了一种弱类别Semantic segmentation算法,即Continuous Gradient CAM和其多尺度融合(CG-fusion CAM)。这种算法可以使用仅仅有图像级别标签,并且可以生成高精度的类 activation maps。
  • results: 实验结果表明,该算法可以与全程supervised算法相比,达到相同的分 segmentation性能。
    Abstract Online segmentation of laser-induced damage on large-aperture optics in high-power laser facilities is challenged by complicated damage morphology, uneven illumination and stray light interference. Fully supervised semantic segmentation algorithms have achieved state-of-the-art performance, but rely on plenty of pixel-level labels, which are time-consuming and labor-consuming to produce. LayerCAM, an advanced weakly supervised semantic segmentation algorithm, can generate pixel-accurate results using only image-level labels, but its scattered and partially under-activated class activation regions degrade segmentation performance. In this paper, we propose a weakly supervised semantic segmentation method with Continuous Gradient CAM and its nonlinear multi-scale fusion (CG-fusion CAM). The method redesigns the way of back-propagating gradients and non-linearly activates the multi-scale fused heatmaps to generate more fine-grained class activation maps with appropriate activation degree for different sizes of damage sites. Experiments on our dataset show that the proposed method can achieve segmentation performance comparable to that of fully supervised algorithms.
    摘要 在高功率激光设施中,大光学口径上的激光引起的损害分 segmentation 是由于损害形态复杂、不均匀照明和干扰噪音的问题困难。全程supervised的semantic segmentation算法已经达到了状态的前沿性,但是它们需要大量的像素级标签,生成这些标签是时间consuming和劳动密集的。LayerCAM是一种先进的弱supervised semantic segmentation算法,可以生成像素精度的结果,但是它的分布式和部分地下活化的热点区域会降低分 segmentation 性能。在这篇论文中,我们提出了一种弱supervised semantic segmentation方法,使用Continuous Gradient CAM和非线性多尺度融合(CG-fusion CAM)。该方法重新定义了后向推导的方式和非线性激活多尺度融合的热点映射,以生成更细致的类活动地图,并且对不同的损害 Site 的大小进行适度的激活。实验表明,我们的方法可以与全程supervised算法相当的segmentation性能。

Constraining Depth Map Geometry for Multi-View Stereo: A Dual-Depth Approach with Saddle-shaped Depth Cells

  • paper_url: http://arxiv.org/abs/2307.09160
  • repo_url: https://github.com/dive128/dmvsnet
  • paper_authors: Xinyi Ye, Weiyue Zhao, Tianqi Liu, Zihao Huang, Zhiguo Cao, Xin Li
  • for: 本研究旨在提高多视图深度(MVS)方法的准确性和完整性,通过适合的深度几何来提高深度估计的精度。
  • methods: 我们提出了一种基于学习的多视图深度方法,即DUAL-MVSNet,它可以生成oscillating深度平面。技术上,我们预测每个像素两个深度值(双深度),并提出了一种新的损失函数和检查板形式的选择策略来限制预测的深度几何。
  • results: 与现有方法相比,DUAL-MVSNet在DTU benchmark上得到了高排名,并在复杂的场景下(如坦克和寺庐)达到了最高性能,这表明了我们的方法具有强大的表现和泛化能力。我们的方法还指出了考虑深度几何在MVS方面的新研究方向。
    Abstract Learning-based multi-view stereo (MVS) methods deal with predicting accurate depth maps to achieve an accurate and complete 3D representation. Despite the excellent performance, existing methods ignore the fact that a suitable depth geometry is also critical in MVS. In this paper, we demonstrate that different depth geometries have significant performance gaps, even using the same depth prediction error. Therefore, we introduce an ideal depth geometry composed of Saddle-Shaped Cells, whose predicted depth map oscillates upward and downward around the ground-truth surface, rather than maintaining a continuous and smooth depth plane. To achieve it, we develop a coarse-to-fine framework called Dual-MVSNet (DMVSNet), which can produce an oscillating depth plane. Technically, we predict two depth values for each pixel (Dual-Depth), and propose a novel loss function and a checkerboard-shaped selecting strategy to constrain the predicted depth geometry. Compared to existing methods,DMVSNet achieves a high rank on the DTU benchmark and obtains the top performance on challenging scenes of Tanks and Temples, demonstrating its strong performance and generalization ability. Our method also points to a new research direction for considering depth geometry in MVS.
    摘要 To achieve this, we develop a coarse-to-fine framework called Dual-MVSNet (DMVSNet), which predicts two depth values for each pixel (Dual-Depth) and uses a novel loss function and checkerboard-shaped selecting strategy to constrain the predicted depth geometry. Compared to existing methods, DMVSNet achieves a high rank on the DTU benchmark and obtains the top performance on challenging scenes of Tanks and Temples, demonstrating its strong performance and generalization ability. Our method also highlights the importance of considering depth geometry in MVS and opens up a new research direction in this area.

Class-relation Knowledge Distillation for Novel Class Discovery

  • paper_url: http://arxiv.org/abs/2307.09158
  • repo_url: https://github.com/kleinzcy/cr-kd-ncd
  • paper_authors: Gu Peiyan, Zhang Chuyu, Xu Ruiji, He Xuming
  • for: 本研究目标是无监督学习新类,通过已知类数据来学习未知类。
  • methods: 我们引入了一个基于预测模型已知类分布的类关系表示,并使用知识填充框架来正则化学习新类。我们还开发了一种可学习的权重函数,以适应每个数据点在新类中的知识传递。
  • results: 我们在多个 benchmark 上进行了广泛的实验,并证明了我们的方法可以与之前的状态时间比对较高。
    Abstract We tackle the problem of novel class discovery, which aims to learn novel classes without supervision based on labeled data from known classes. A key challenge lies in transferring the knowledge in the known-class data to the learning of novel classes. Previous methods mainly focus on building a shared representation space for knowledge transfer and often ignore modeling class relations. To address this, we introduce a class relation representation for the novel classes based on the predicted class distribution of a model trained on known classes. Empirically, we find that such class relation becomes less informative during typical discovery training. To prevent such information loss, we propose a novel knowledge distillation framework, which utilizes our class-relation representation to regularize the learning of novel classes. In addition, to enable a flexible knowledge distillation scheme for each data point in novel classes, we develop a learnable weighting function for the regularization, which adaptively promotes knowledge transfer based on the semantic similarity between the novel and known classes. To validate the effectiveness and generalization of our method, we conduct extensive experiments on multiple benchmarks, including CIFAR100, Stanford Cars, CUB, and FGVC-Aircraft datasets. Our results demonstrate that the proposed method outperforms the previous state-of-the-art methods by a significant margin on almost all benchmarks. Code is available at \href{https://github.com/kleinzcy/Cr-KD-NCD}{here}.
    摘要 我们解决了一个新类发现问题,即在无监督下学习新类,基于已知类的标注数据。知情挑战在传递已知类数据中的知识到新类学习中。以前的方法主要集中在建立已知类和新类之间的共享表示空间中,frequently ignore 新类之间的类关系。为了解决这一点,我们引入一个基于已知类模型预测的新类分布的类关系表示。我们发现,在普通的发现训练中,这个类关系表示变得 menos informative。为了避免这种信息损失,我们提出了一种新的知识塑化框架,利用我们的类关系表示来塑化新类的学习。此外,我们开发了一个可学习的权重函数,以便在每个新类数据点上进行自适应的知识传递优化。为了证明我们的方法的有效性和普适性,我们在多个benchmark上进行了广泛的实验,包括CIFAR100、Stanford Cars、CUB和FGVC-Aircraft等数据集。我们的结果表明,我们的方法在大多数benchmark上与之前的状态OF THE ART方法之间的差距非常大。代码可以在\href{https://github.com/kleinzcy/Cr-KD-NCD}{这里}找到。

MLF-DET: Multi-Level Fusion for Cross-Modal 3D Object Detection

  • paper_url: http://arxiv.org/abs/2307.09155
  • repo_url: None
  • paper_authors: Zewei Lin, Yanqing Shen, Sanping Zhou, Shitao Chen, Nanning Zheng
  • for: 这 paper 的目的是提出一种高性能的跨模态3D物体检测方法,以便更好地利用图像中的信息。
  • methods: 这 paper 使用了多级融合网络(MLF-DET),包括特征级融合和决策级融合两个部分。特征级融合使用多scale voxel图像融合模块(MVI),决策级融合使用轻量级特征引导修正模块(FCR)。此外,paper 还提出了一种有效的数据采样策略,即遮挡对准GT采样(OGS),以增加训练场景中的样本数量,从而降低过拟合。
  • results: EXTENSIVE experiments 表明,我们的方法在 KITTI 数据集上达到了 82.89% 的中等 AP 值,并在不具备特殊功能的情况下达到了领先的性能。
    Abstract In this paper, we propose a novel and effective Multi-Level Fusion network, named as MLF-DET, for high-performance cross-modal 3D object DETection, which integrates both the feature-level fusion and decision-level fusion to fully utilize the information in the image. For the feature-level fusion, we present the Multi-scale Voxel Image fusion (MVI) module, which densely aligns multi-scale voxel features with image features. For the decision-level fusion, we propose the lightweight Feature-cued Confidence Rectification (FCR) module which further exploits image semantics to rectify the confidence of detection candidates. Besides, we design an effective data augmentation strategy termed Occlusion-aware GT Sampling (OGS) to reserve more sampled objects in the training scenes, so as to reduce overfitting. Extensive experiments on the KITTI dataset demonstrate the effectiveness of our method. Notably, on the extremely competitive KITTI car 3D object detection benchmark, our method reaches 82.89% moderate AP and achieves state-of-the-art performance without bells and whistles.
    摘要 在这篇论文中,我们提出了一种新的和有效的多级融合网络,称为MLF-DET,用于高性能跨模态3D对象检测。我们在这种网络中结合了功能级融合和决策级融合,以完全利用图像中的信息。对于功能级融合,我们提出了多尺度精细对齐模块(MVI),用于 densely aligning multi-scale voxel features with image features。对于决策级融合,我们提出了轻量级特征引导修正模块(FCR),以进一步利用图像 semantics 来修正检测候选者的信任度。此外,我们设计了一种有效的数据增强策略,称为Occlusion-aware GT Sampling(OGS),以保留更多的批处理对象在训练场景中,从而降低过拟合。广泛的实验表明,我们的方法在KITTI数据集上达到了82.89%的中等AP值,并在无论钟铃铃的情况下实现了state-of-the-art性能。

OPHAvatars: One-shot Photo-realistic Head Avatars

  • paper_url: http://arxiv.org/abs/2307.09153
  • repo_url: https://github.com/lsx0101/ophavatars
  • paper_authors: Shaoxu Li
  • for: 该paper的目的是创建一种从单个肖像图像 synthesize photo-realistic digital avatars的方法。
  • methods: 该方法使用驱动关键点特征来生成一个粗糙的说话头视频,然后使用扭曲神经辐射场来生成粗糙的说话头模型。通过更新优化的图像,该方法可以重新训练更高质量的模型。
  • results: 该方法可以在多个主题下对多个主题进行高效的量化和质量上的比较,并且可以在不同的照明和摄影条件下提供高质量的数字人物模型。
    Abstract We propose a method for synthesizing photo-realistic digital avatars from only one portrait as the reference. Given a portrait, our method synthesizes a coarse talking head video using driving keypoints features. And with the coarse video, our method synthesizes a coarse talking head avatar with a deforming neural radiance field. With rendered images of the coarse avatar, our method updates the low-quality images with a blind face restoration model. With updated images, we retrain the avatar for higher quality. After several iterations, our method can synthesize a photo-realistic animatable 3D neural head avatar. The motivation of our method is deformable neural radiance field can eliminate the unnatural distortion caused by the image2video method. Our method outperforms state-of-the-art methods in quantitative and qualitative studies on various subjects.
    摘要 我们提出一种方法,可以从单一的肖像中生成真实的数字人物。给定一个肖像,我们的方法可以生成一个驱动关键点特征的粗糙说话头视频。然后,我们的方法可以使用扭曲神经辐射场来生成粗糙说话头人物。使用渲染出的粗糙人物图像,我们的方法可以通过盲人脸修复模型来更新低质量图像。经过多次迭代,我们的方法可以生成一个真实的渲染3D神经头人物。我们的方法的动机是使用扭曲神经辐射场可以消除图像2视频方法中的不自然的扭曲。与现状的方法进行比较,我们的方法在量化和质量上都有较高的表现。

Semi-supervised Cycle-GAN for face photo-sketch translation in the wild

  • paper_url: http://arxiv.org/abs/2307.10281
  • repo_url: https://github.com/chaofengc/Face-Sketch-SCG
  • paper_authors: Chaofeng Chen, Wei Liu, Xiao Tan, Kwan-Yee K. Wong
  • for: 本研究旨在提高面图翻译效果,使用深度神经网络和GAN方法。
  • methods: 我们提出了一种半监督方法,使用噪音插入策略,称为半周期GAN(SCG)。我们首先对输入图片进行 pseudo sketch 特征表示,并使用这些 pseudo pairs 来监督图片到素描 generator $G_{p2s}$。然后,$G_{p2s}$ 的输出可以用来自动地训练素描到图片 generator $G_{s2p}$。这样就可以使用一小reference set of photo-sketch pairs和一大面积的人脸图像集(没有ground-truth sketches)来训练 $G_{p2s}$ 和 $G_{s2p}$。
  • results: 实验结果表明,SCG 可以在公共 bencmark 上达到竞争性的性能,并在野外的图像上得到更加有reasonable的素描-to-图片结果,具有较少的过拟合问题。
    Abstract The performance of face photo-sketch translation has improved a lot thanks to deep neural networks. GAN based methods trained on paired images can produce high-quality results under laboratory settings. Such paired datasets are, however, often very small and lack diversity. Meanwhile, Cycle-GANs trained with unpaired photo-sketch datasets suffer from the \emph{steganography} phenomenon, which makes them not effective to face photos in the wild. In this paper, we introduce a semi-supervised approach with a noise-injection strategy, named Semi-Cycle-GAN (SCG), to tackle these problems. For the first problem, we propose a {\em pseudo sketch feature} representation for each input photo composed from a small reference set of photo-sketch pairs, and use the resulting {\em pseudo pairs} to supervise a photo-to-sketch generator $G_{p2s}$. The outputs of $G_{p2s}$ can in turn help to train a sketch-to-photo generator $G_{s2p}$ in a self-supervised manner. This allows us to train $G_{p2s}$ and $G_{s2p}$ using a small reference set of photo-sketch pairs together with a large face photo dataset (without ground-truth sketches). For the second problem, we show that the simple noise-injection strategy works well to alleviate the \emph{steganography} effect in SCG and helps to produce more reasonable sketch-to-photo results with less overfitting than fully supervised approaches. Experiments show that SCG achieves competitive performance on public benchmarks and superior results on photos in the wild.
    摘要 《面部照片翻译表现得到了深度神经网络的改进,使得GAN基于方法在实验室设置下可以生成高质量的结果。然而,这些配对数据集通常很小,缺乏多样性。同时,基于不配对照片翻译数据集的Cycle-GANs受到了《隐写》现象的影响,使其对面部照片在野外不 efective。本文提出了一种半指导式方法,名为半周期GAN(SCG),以解决这些问题。首先,我们提出了一种《假绘制特征》表示方法,用于每个输入照片中组成一个小型参考集的绘制对。然后,我们使用这些《假对》来监督一个照片到绘制的生成器$G_{p2s}$。$G_{p2s}$的输出可以在自动化的方式下帮助训练一个绘制到照片的生成器$G_{s2p}$。这样就可以在一个小型的照片-绘制对参考集和一大量的面部照片数据集(无ground truth绘制)之间共同训练 $G_{p2s}$ 和 $G_{s2p}$。其次,我们发现了一种简单的噪声注入策略可以减轻SCG中的《隐写》现象,并帮助生成更加合理的绘制到照片结果,降低过拟合。实验表明,SCG可以在公共的benchmark上达到竞争性的表现,并在野外的照片上达到更高的表现。》

PRO-Face S: Privacy-preserving Reversible Obfuscation of Face Images via Secure Flow

  • paper_url: http://arxiv.org/abs/2307.09146
  • repo_url: None
  • paper_authors: Lin Yuan, Kai Liang, Xiao Pu, Yan Zhang, Jiaxu Leng, Tao Wu, Nannan Wang, Xinbo Gao
  • for: 防止面部隐私泄露
  • methods: 使用逆向神经网络(INN)进行隐私保护,并在网络中注入密钥以确保原始图像仅可以通过同一模型和正确的密钥进行恢复。
  • results: 对于多个公共图像集进行了广泛的实验,证明提议的框架在对多种现有方法的比较中具有超越性。
    Abstract This paper proposes a novel paradigm for facial privacy protection that unifies multiple characteristics including anonymity, diversity, reversibility and security within a single lightweight framework. We name it PRO-Face S, short for Privacy-preserving Reversible Obfuscation of Face images via Secure flow-based model. In the framework, an Invertible Neural Network (INN) is utilized to process the input image along with its pre-obfuscated form, and generate the privacy protected image that visually approximates to the pre-obfuscated one, thus ensuring privacy. The pre-obfuscation applied can be in diversified form with different strengths and styles specified by users. Along protection, a secret key is injected into the network such that the original image can only be recovered from the protection image via the same model given the correct key provided. Two modes of image recovery are devised to deal with malicious recovery attempts in different scenarios. Finally, extensive experiments conducted on three public image datasets demonstrate the superiority of the proposed framework over multiple state-of-the-art approaches.
    摘要 In this framework, an Invertible Neural Network (INN) is used to process the input image and its pre-obfuscated form, and generate a privacy-protected image that visually resembles the pre-obfuscated one, ensuring privacy. The pre-obfuscation can be applied in diverse forms with different strengths and styles specified by users. Additionally, a secret key is injected into the network, allowing only the original image to be recovered from the protected image through the same model if the correct key is provided. To address malicious recovery attempts, two modes of image recovery are designed.Experiments conducted on three public image datasets demonstrate the superiority of the proposed framework compared to multiple state-of-the-art approaches.

MVA2023 Small Object Detection Challenge for Spotting Birds: Dataset, Methods, and Results

  • paper_url: http://arxiv.org/abs/2307.09143
  • repo_url: https://github.com/iim-ttij/mva2023smallobjectdetection4spottingbirds
  • paper_authors: Yuki Kondo, Norimichi Ukita, Takayuki Yamaguchi, Hao-Yu Hou, Mu-Yi Shen, Chia-Chi Hsu, En-Ming Huang, Yu-Chen Huang, Yu-Cheng Xia, Chien-Yao Wang, Chun-Yi Lee, Da Huo, Marc A. Kastner, Tingwei Liu, Yasutomo Kawanishi, Takatsugu Hirayama, Takahiro Komamizu, Ichiro Ide, Yosuke Shinya, Xinyao Liu, Guang Liang, Syusuke Yasui
  • for: 本研究旨在提供一个新的小对象检测数据集,用于鸟类检测。
  • methods: 本研究使用了一种新的小对象检测方法,包括提出了一个新的数据集。
  • results: 本研究的实验结果显示,这种新的检测方法可以准确地检测到远距离的鸟类。
    Abstract Small Object Detection (SOD) is an important machine vision topic because (i) a variety of real-world applications require object detection for distant objects and (ii) SOD is a challenging task due to the noisy, blurred, and less-informative image appearances of small objects. This paper proposes a new SOD dataset consisting of 39,070 images including 137,121 bird instances, which is called the Small Object Detection for Spotting Birds (SOD4SB) dataset. The detail of the challenge with the SOD4SB dataset is introduced in this paper. In total, 223 participants joined this challenge. This paper briefly introduces the award-winning methods. The dataset, the baseline code, and the website for evaluation on the public testset are publicly available.
    摘要 小物体检测(SOD)是机器视觉中重要的话题,因为(i)许多现实世界应用需要对远距离物体进行检测,(ii)SOD是一项复杂的任务,因为小物体的图像表现具有噪音、模糊和不具有准确信息的特点。本文提出了一个新的SOD数据集,包括39,070张图像和137,121个鸟具体实例,称为小物体检测 для鸟兽(SOD4SB)数据集。本文介绍了SOD4SB数据集的挑战。总共有223名参与者参加了这个挑战。本文 briefly介绍了奖励方法。数据集、基线代码和评估网站对公共测试集进行评估是公共可用的。

Light-Weight Vision Transformer with Parallel Local and Global Self-Attention

  • paper_url: http://arxiv.org/abs/2307.09120
  • repo_url: None
  • paper_authors: Nikolas Ebert, Laurenz Reichardt, Didier Stricker, Oliver Wasenmüller
  • for: 这个研究的目的是将现代computer vision中的Transformer架构重新设计为适合具有有限资源的硬件上进行自动驾驶任务,并且可以在实时性要求下执行。
  • methods: 我们在这个研究中提出了一些改进PLG-ViT架构的方法,以减少其中的参数和浮点运算数。我们识别了PLG-ViT架构中的computationally expensive block,并提出了一些改进以减少这些对象的参数和浮点运算数。
  • results: 我们的研究获得了以下结果:我们可以将PLG-ViT架构缩小到原本的5倍小,并且只有500万个参数,可以在ImageNet-1K分类标准库中获得79.5%的顶部1个精度。我们的网络在COCO实例分类标准库中也实现了良好的表现。此外,我们还实现了一系列的实验,评估了我们的方法在自动驾驶和交通领域中的应用潜力。
    Abstract While transformer architectures have dominated computer vision in recent years, these models cannot easily be deployed on hardware with limited resources for autonomous driving tasks that require real-time-performance. Their computational complexity and memory requirements limits their use, especially for applications with high-resolution inputs. In our work, we redesign the powerful state-of-the-art Vision Transformer PLG-ViT to a much more compact and efficient architecture that is suitable for such tasks. We identify computationally expensive blocks in the original PLG-ViT architecture and propose several redesigns aimed at reducing the number of parameters and floating-point operations. As a result of our redesign, we are able to reduce PLG-ViT in size by a factor of 5, with a moderate drop in performance. We propose two variants, optimized for the best trade-off between parameter count to runtime as well as parameter count to accuracy. With only 5 million parameters, we achieve 79.5$\%$ top-1 accuracy on the ImageNet-1K classification benchmark. Our networks demonstrate great performance on general vision benchmarks like COCO instance segmentation. In addition, we conduct a series of experiments, demonstrating the potential of our approach in solving various tasks specifically tailored to the challenges of autonomous driving and transportation.
    摘要 “储在限制型硬件上部署的自动驾驶任务需要实时性和轻量级的模型,但是transformer架构在最近几年内在计算机见识领域中占了主导地位。然而,这些模型的计算复杂度和内存需求限制了它们的使用,特别是在高分辨率输入的应用中。在我们的工作中,我们重新设计了原PLG-ViT架构,将其转换为更加快速和高效的架构,适合这些任务。我们识别PLG-ViT架构中最 computationally expensive的对象,并提出了多种改进,以减少参数和浮点运算数。因此,我们成功地将PLG-ViT的大小减少了5倍,但是损失了一些性能。我们提出了两种版本,适合在parameter count和runtime之间寻找最佳平衡,以及parameter count和准确率之间寻找最佳平衡。仅有500万个参数,我们在ImageNet-1K分类标准库中 achieved 79.5%的top-1准确率。我们的网络在通用的见识测试中也表现出色,并进行了一系列关于自动驾驶和交通的实验,显示了我们的方法在这些任务中的潜力。”

NU-MCC: Multiview Compressive Coding with Neighborhood Decoder and Repulsive UDF

  • paper_url: http://arxiv.org/abs/2307.09112
  • repo_url: None
  • paper_authors: Stefan Lionar, Xiangyu Xu, Min Lin, Gim Hee Lee
  • for: 单视图RGB-D输入的3D重建 tasks(3D reconstruction from single-view RGB-D inputs)
  • methods: 提出了一种新方法 called NU-MCC,包括两个关键创新:一个邻域解码器和一个排斥未分配距离函数(Repulsive UDF)
  • results: 实验结果表明,NU-MCC可以学习出一个强大的3D表示,提高了单视图3D重建的状态方法,相比MCC,NU-MCC在CO3D-v2数据集上的F1分数提高9.7%,运行速度高于MCC的5倍。
    Abstract Remarkable progress has been made in 3D reconstruction from single-view RGB-D inputs. MCC is the current state-of-the-art method in this field, which achieves unprecedented success by combining vision Transformers with large-scale training. However, we identified two key limitations of MCC: 1) The Transformer decoder is inefficient in handling large number of query points; 2) The 3D representation struggles to recover high-fidelity details. In this paper, we propose a new approach called NU-MCC that addresses these limitations. NU-MCC includes two key innovations: a Neighborhood decoder and a Repulsive Unsigned Distance Function (Repulsive UDF). First, our Neighborhood decoder introduces center points as an efficient proxy of input visual features, allowing each query point to only attend to a small neighborhood. This design not only results in much faster inference speed but also enables the exploitation of finer-scale visual features for improved recovery of 3D textures. Second, our Repulsive UDF is a novel alternative to the occupancy field used in MCC, significantly improving the quality of 3D object reconstruction. Compared to standard UDFs that suffer from holes in results, our proposed Repulsive UDF can achieve more complete surface reconstruction. Experimental results demonstrate that NU-MCC is able to learn a strong 3D representation, significantly advancing the state of the art in single-view 3D reconstruction. Particularly, it outperforms MCC by 9.7% in terms of the F1-score on the CO3D-v2 dataset with more than 5x faster running speed.
    摘要 “单一视野RGB-D输入的3D重建技术已经做出了杰出的进步。MCC是这个领域的现行州Of-The-Art方法,通过结合Computer Vision Transformer和大规模训练而取得了前无seen的成功。然而,我们发现了MCC的两个关键限制:1)Transformer decoder在处理大量查询点时不具有高效性;2)3D表现仅能够重建高精度的细节。在这篇论文中,我们提出了一个新的方法called NU-MCC,它包括两个关键创新:一个邻居decoder和一个Repulsive Unsigned Distance Function(Repulsive UDF)。首先,我们的邻居decoder将输入的视觉特征转换为中心点,让每个查询点只需要关注当地小区域。这个设计不仅提高了单位执行时间的速度,而且允许更好地利用更细节的视觉特征,以提高3D文字的重建。其次,我们的Repulsive UDF是一种新型的对应场,它在MCC中使用的标准UDF遭受到洞的问题,而我们的提案可以实现更完整的表面重建。实验结果显示,NU-MCC可以对单一视野3D重建进行更好的学习,并在CO3D-v2dataset上比MCC提高9.7%的F1分数,且比MCC更快速执行。”

Mining of Single-Class by Active Learning for Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2307.09109
  • repo_url: None
  • paper_authors: Hugues Lambert, Emma Slade
  • for: 这 paper 是为了提出一种新的活动学习(AL)策略,帮助寻找最有用的样本,并增加模型的性能。
  • methods: 这 paper 使用了深度强化学习来构建一个 AL 策略,并利用量精度相关性来建立高性能模型。
  • results: 这 paper 的结果表明,MiSiCAL 能够在 COCO10k 中的 150 个类中超越随机策略,而最强的基eline 只能在 101 个类中超越随机策略。
    Abstract Several Active Learning (AL) policies require retraining a target model several times in order to identify the most informative samples and rarely offer the option to focus on the acquisition of samples from underrepresented classes. Here the Mining of Single-Class by Active Learning (MiSiCAL) paradigm is introduced where an AL policy is constructed through deep reinforcement learning and exploits quantity-accuracy correlations to build datasets on which high-performance models can be trained with regards to specific classes. MiSiCAL is especially helpful in the case of very large batch sizes since it does not require repeated model training sessions as is common in other AL methods. This is thanks to its ability to exploit fixed representations of the candidate data points. We find that MiSiCAL is able to outperform a random policy on 150 out of 171 COCO10k classes, while the strongest baseline only outperforms random on 101 classes.
    摘要 几种活动学习(AL)策略需要重新训练目标模型数次以确定最有用的样本,并很少提供针对异常类别的样本收集的选项。在这里,我们引入了单类挖掘学习(MiSiCAL)模式,其中一个AL策略通过深度强化学习构建,利用量精度相关性来建立高性能模型可以在特定类别上训练。MiSiCAL特别有利于大批量时,因为它不需要重复的模型训练会议。这是因为它可以利用固定表示的候选数据点。我们发现MiSiCAL能够在COCO10k类型中击败随机策略的150个类型,而最强基eline只能在101个类型上击败随机策略。

Division Gets Better: Learning Brightness-Aware and Detail-Sensitive Representations for Low-Light Image Enhancement

  • paper_url: http://arxiv.org/abs/2307.09104
  • repo_url: None
  • paper_authors: Huake Wang, Xiaoyang Yan, Xingsong Hou, Junhui Li, Yujie Dun, Kaibing Zhang
  • for: 提高低光照图像的对比度、颜色和纹理的Restoration
  • methods: 提出了一种基于两个分支网络的新方法LCDBNet,其中一个分支网络负责调整亮度,另一个分支网络负责修复颜色和纹理。
  • results: 对七个标准测试集进行了广泛的实验,结果显示LCDBNet的表现比其他当前领先方法更佳,并且在多种参考/非参考质量评价指标中获得了更高的分数。
    Abstract Low-light image enhancement strives to improve the contrast, adjust the visibility, and restore the distortion in color and texture. Existing methods usually pay more attention to improving the visibility and contrast via increasing the lightness of low-light images, while disregarding the significance of color and texture restoration for high-quality images. Against above issue, we propose a novel luminance and chrominance dual branch network, termed LCDBNet, for low-light image enhancement, which divides low-light image enhancement into two sub-tasks, e.g., luminance adjustment and chrominance restoration. Specifically, LCDBNet is composed of two branches, namely luminance adjustment network (LAN) and chrominance restoration network (CRN). LAN takes responsibility for learning brightness-aware features leveraging long-range dependency and local attention correlation. While CRN concentrates on learning detail-sensitive features via multi-level wavelet decomposition. Finally, a fusion network is designed to blend their learned features to produce visually impressive images. Extensive experiments conducted on seven benchmark datasets validate the effectiveness of our proposed LCDBNet, and the results manifest that LCDBNet achieves superior performance in terms of multiple reference/non-reference quality evaluators compared to other state-of-the-art competitors. Our code and pretrained model will be available.
    摘要 低光照图像增强尝试提高图像的对比度、调整视觉效果和恢复颜色和纹理的损害。现有方法通常更关注提高低光照图像的可见度和对比度,而忽略了高质量图像的颜色和纹理恢复的重要性。为了解决这个问题,我们提议一种新的抽象和彩度分支网络(LCDBNet),用于低光照图像增强。LCDBNet将低光照图像增强分为两个子任务:亮度调整和颜色恢复。具体来说,LCDBNet由两个分支组成:亮度调整网络(LAN)和颜色恢复网络(CRN)。LAN负责学习亮度意识的特征,利用长距离关系和本地关注相互关系。而CRN则专注于学习细节敏感的特征,通过多层wavelet分解。最后,我们设计了一个混合网络,将它们学习的特征融合,以生成视觉卓越的图像。我们在七个标准测试数据集上进行了广泛的实验,结果表明LCDBNet在多个参考/非参考质量评价指标上表现出色,超过了当前状态的竞争对手。我们的代码和预训练模型将可以获得。

Reclaiming the Horizon: Novel Visualization Designs for Time-Series Data with Large Value Ranges

  • paper_url: http://arxiv.org/abs/2307.10278
  • repo_url: None
  • paper_authors: Daniel Braun, Rita Borgo, Max Sondag, Tatiana von Landesberger
  • for: 支持实践者在时间序列数据中进行标识和分类任务,即在很大的值范围内进行标识和分类。
  • methods: 提出了两种新的视觉设计:第一种是维度范围图,它是 классиic 的背景图的扩展;第二种是带有很大值范围的 log-line 图。这两种新的视觉设计可以让实践者更好地对很大的值范围进行视觉化。
  • results: 在四种常见的时间序列分析和大值范围视觉化任务中,新的维度范围图表现更好或等于所有其他设计,包括标识、分类、估计和趋势检测等任务。只有趋势检测任务中,传统的背景图表现更好。结果具有领域独立性,只需要时间序列数据具有很大的值范围即可。
    Abstract We introduce two novel visualization designs to support practitioners in performing identification and discrimination tasks on large value ranges (i.e., several orders of magnitude) in time-series data: (1) The order of magnitude horizon graph, which extends the classic horizon graph; and (2) the order of magnitude line chart, which adapts the log-line chart. These new visualization designs visualize large value ranges by explicitly splitting the mantissa m and exponent e of a value v = m * 10e . We evaluate our novel designs against the most relevant state-of-the-art visualizations in an empirical user study. It focuses on four main tasks commonly employed in the analysis of time-series and large value ranges visualization: identification, discrimination, estimation, and trend detection. For each task we analyse error, confidence, and response time. The new order of magnitude horizon graph performs better or equal to all other designs in identification, discrimination, and estimation tasks. Only for trend detection tasks, the more traditional horizon graphs reported better performance. Our results are domain-independent, only requiring time-series data with large value ranges.
    摘要 我们介绍了两种新的视觉化设计,用于支持实践者在时间序列数据中进行标识和分类任务,即在很大的值范围内(几个数量级):(1)扩展类 horizon graph,和(2)适应 log-line 图表。这两种新的视觉化设计将大值范围视觉化为显式地将值分解成整数部分m和指数部分e,使得v = m * 10e 的值范围变得更加明了。我们对最相关的现有视觉化进行了实验用户研究,这种研究包括四个主要任务,即标识、分类、估算和趋势检测。对每个任务,我们分析了错误率、自信度和响应时间。新的 ORDER OF MAGNITUDE horizon graph 在标识、分类和估算任务中表现更好或相等于所有其他设计。只有趋势检测任务中,传统的 horizon graphs 表现更好。我们的结果是领域独立的,只需要时间序列数据具有大值范围即可。

PixelHuman: Animatable Neural Radiance Fields from Few Images

  • paper_url: http://arxiv.org/abs/2307.09070
  • repo_url: None
  • paper_authors: Gyumin Shim, Jaeseong Lee, Junha Hyung, Jaegul Choo
  • for: 这 paper 是为了实现从几个人像中生成可动人景的 novel 模型。
  • methods: 该方法使用 neural radiance field 和 pose-aware pixel-aligned features,通过数据驱动的方式学习了折叠场景,以实现从几个不同的视图和姿势中生成可动人景。
  • results: 实验结果显示,该方法可以在多视图和新姿势synthesis中实现state-of-the-art表现,只需要几个图像来训练。
    Abstract In this paper, we propose PixelHuman, a novel human rendering model that generates animatable human scenes from a few images of a person with unseen identity, views, and poses. Previous work have demonstrated reasonable performance in novel view and pose synthesis, but they rely on a large number of images to train and are trained per scene from videos, which requires significant amount of time to produce animatable scenes from unseen human images. Our method differs from existing methods in that it can generalize to any input image for animatable human synthesis. Given a random pose sequence, our method synthesizes each target scene using a neural radiance field that is conditioned on a canonical representation and pose-aware pixel-aligned features, both of which can be obtained through deformation fields learned in a data-driven manner. Our experiments show that our method achieves state-of-the-art performance in multiview and novel pose synthesis from few-shot images.
    摘要 在这篇论文中,我们提出了PixelHuman,一种新的人体渲染模型,该模型可以从几张人物图像中生成可动人体场景,无需seen人物视角和姿势。先前的工作已经证明了在新视角和姿势合成方面的可靠性,但它们需要大量图像进行训练,并且需要相当长的时间来生成从未seen的人物图像中的可动场景。我们的方法与现有方法不同,它可以对任意输入图像进行可动人体合成。给定一个随机姿势序列,我们的方法使用神经辐射场来synthesize每个目标场景,该场景被 conditioned 于一个拟合表示和pose-aware像素对齐特征,这些特征可以通过在数据驱动的方式下学习的扭形场所获得。我们的实验表明,我们的方法在多视图和新姿势合成方面具有状态级表现。

Evaluate Fine-tuning Strategies for Fetal Head Ultrasound Image Segmentation with U-Net

  • paper_url: http://arxiv.org/abs/2307.09067
  • repo_url: https://github.com/13204942/ft_methods_for_fetal_head_segmentation
  • paper_authors: Fangyijie Wang, Guénolé Silvestre, Kathleen M. Curran
    for:这份研究目的是提高胎头环circumference(HC)的测量效率,以便更好地监控胎生长。methods:我们提出了一种将MobileNet作为encoder,并通过精确调整(FT)U-Net网络来进行胎头 segmentation的方法。这种方法可以实现限制 Parameters的培训,并且比对于从头开始训练的网络模型来得到更好的性能。results:我们发现,这种FT方法可以与从头开始训练的网络模型相比,具有与其相似的性能(85.8%),并且具有较小的trainable parameter size(Below 4.4 million)。这显示了我们的FT方法可以实现胎头 segmentation的目的,并且可以应对实际应用中的限制。
    Abstract Fetal head segmentation is a crucial step in measuring the fetal head circumference (HC) during gestation, an important biometric in obstetrics for monitoring fetal growth. However, manual biometry generation is time-consuming and results in inconsistent accuracy. To address this issue, convolutional neural network (CNN) models have been utilized to improve the efficiency of medical biometry. But training a CNN network from scratch is a challenging task, we proposed a Transfer Learning (TL) method. Our approach involves fine-tuning (FT) a U-Net network with a lightweight MobileNet as the encoder to perform segmentation on a set of fetal head ultrasound (US) images with limited effort. This method addresses the challenges associated with training a CNN network from scratch. It suggests that our proposed FT strategy yields segmentation performance that is comparable when trained with a reduced number of parameters by 85.8%. And our proposed FT strategy outperforms other strategies with smaller trainable parameter sizes below 4.4 million. Thus, we contend that it can serve as a dependable FT approach for reducing the size of models in medical image analysis. Our key findings highlight the importance of the balance between model performance and size in developing Artificial Intelligence (AI) applications by TL methods. Code is available at https://github.com/13204942/FT_Methods_for_Fetal_Head_Segmentation.
    摘要 产前胎头分割是评估胎头圆周长(HC)的关键步骤,是妊娠期内重要的生物指标,可以对胎子增长进行监测。然而,手动生物学测量是时间consuming且准确性不稳定。为了解决这个问题,批处神经网络(CNN)模型已经被应用于医学生物学测量中,以提高生物学测量的效率。然而,训练CNN网络从零开始是一项困难的任务。我们提出了传输学习(TL)方法。我们的方法包括 fine-tuning(FT)一个U-Net网络,使其在一组有限的胎头ultrasound(US)图像上进行分割,而不需要大量的努力。这种方法可以解决训练CNN网络从零开始的挑战。我们发现,我们的提议的FT策略可以在限制参数数量的情况下,达到相当于85.8%的分割性能。此外,我们的FT策略还可以比其他策略下面4.4万个可 Trainable参数更好地表现。因此,我们认为这种FT方法可以在医学生物学测量中用于减小模型的大小。我们的关键发现表明了在使用TL方法开发人工智能应用程序时,模型性能和大小之间的平衡是非常重要的。代码可以在https://github.com/13204942/FT_Methods_for_Fetal_Head_Segmentation中找到。

PatchCT: Aligning Patch Set and Label Set with Conditional Transport for Multi-Label Image Classification

  • paper_url: http://arxiv.org/abs/2307.09066
  • repo_url: https://github.com/keepgoingjkg/patchct
  • paper_authors: Miaoge Li, Dongsheng Wang, Xinyang Liu, Zequn Zeng, Ruiying Lu, Bo Chen, Mingyuan Zhou
  • for: 这个论文目的是提出一种基于 Conditional Transport (CT) 理论的多标签图像分类方法,以实现更好地利用图像和标签Semantic Space的互动。
  • methods: 该方法使用 CT 理议来bridging图像和标签域的语义空间,并通过定义前向和后向导航器来学习和对Alignment those two semantic sets。
  • results: 根据实验结果,提出的方法在三个公共图像benchmark上 consistently outperform了之前的方法。
    Abstract Multi-label image classification is a prediction task that aims to identify more than one label from a given image. This paper considers the semantic consistency of the latent space between the visual patch and linguistic label domains and introduces the conditional transport (CT) theory to bridge the acknowledged gap. While recent cross-modal attention-based studies have attempted to align such two representations and achieved impressive performance, they required carefully-designed alignment modules and extra complex operations in the attention computation. We find that by formulating the multi-label classification as a CT problem, we can exploit the interactions between the image and label efficiently by minimizing the bidirectional CT cost. Specifically, after feeding the images and textual labels into the modality-specific encoders, we view each image as a mixture of patch embeddings and a mixture of label embeddings, which capture the local region features and the class prototypes, respectively. CT is then employed to learn and align those two semantic sets by defining the forward and backward navigators. Importantly, the defined navigators in CT distance model the similarities between patches and labels, which provides an interpretable tool to visualize the learned prototypes. Extensive experiments on three public image benchmarks show that the proposed model consistently outperforms the previous methods.
    摘要 多标签图像分类是一个预测任务,旨在从给定图像中标识多个标签。本文考虑了图像和文本标签域之间的semantic consistency,并引入了conditional transport(CT)理论来bridging这些两个域之间的知识渠道。相比之下, latest cross-modal attention-based studies尝试了将这两个表示空间对齐,并实现了出色的表现,但是它们需要设计了复杂的对齐模块和附加的复杂操作。我们发现,将多标签分类定义为CT问题,可以充分利用图像和标签之间的交互,并通过最小化bidirectional CT成本来捕捉到这些交互。具体来说,我们将图像和文本标签 feed到模式特异Encoder中,然后视图每个图像为一个mixture of patch embeddings和一个mixture of label embeddings,这两个 embedding capture了图像的局部区域特征和类型豁达,分别。然后,我们使用CT来学习和对这两个semantic sets进行对齐,定义了前向和后向探险者。特别是,定义的探险者在CT距离模型中模型了图像和标签之间的相似性,这提供了可读的工具来可视化学习的prototype。我们在三个公共图像 benchmark上进行了广泛的实验,结果显示,我们的方法在前一个方法之上具有稳定的性能优势。

Learning Adaptive Neighborhoods for Graph Neural Networks

  • paper_url: http://arxiv.org/abs/2307.09065
  • repo_url: None
  • paper_authors: Avishkar Saha, Oscar Mendez, Chris Russell, Richard Bowden
  • for: 本 paper 是为了提出一种可 diferenciable 图结构生成器,帮助GCNs在图结构数据上进行端到端学习。
  • methods: 本 paper 使用了一种novel end-to-end differentiable graph generator,该模块可以将图结构学习到GCNs中,并且可以将每个节点的邻居和大小选择为其自己。
  • results: 本 paper 的实验结果表明,该模块可以在多种dataset和GCN背景下提高结果的准确率,并且可以与其他结构学习方法相比。
    Abstract Graph convolutional networks (GCNs) enable end-to-end learning on graph structured data. However, many works assume a given graph structure. When the input graph is noisy or unavailable, one approach is to construct or learn a latent graph structure. These methods typically fix the choice of node degree for the entire graph, which is suboptimal. Instead, we propose a novel end-to-end differentiable graph generator which builds graph topologies where each node selects both its neighborhood and its size. Our module can be readily integrated into existing pipelines involving graph convolution operations, replacing the predetermined or existing adjacency matrix with one that is learned, and optimized, as part of the general objective. As such it is applicable to any GCN. We integrate our module into trajectory prediction, point cloud classification and node classification pipelines resulting in improved accuracy over other structure-learning methods across a wide range of datasets and GCN backbones.
    摘要 图像卷积网络(GCNs)允许端到端学习在图结构数据上。然而,许多工作假设给定的图结构。当输入图是噪音或无法获得时,一种方法是构建或学习隐藏图结构。这些方法通常固定整个图的节点度,这是不优化的。相反,我们提出了一种新的终端可 differentiable图生成器,它可以在建立图结构时让每个节点选择自己的邻居和大小。我们的模块可以轻松地整合到现有的GCN执行推导管线中,将预先确定或现有的相对位矩阵 replaced 为一个学习和优化的一部分,因此可以应用于任何GCN。我们将我们的模块集成到轨迹预测、点云分类和节点分类管线中,在各种数据集和GCN背景下得到了相对于其他结构学习方法的提高准确性。

Deep learning for unsupervised domain adaptation in medical imaging: Recent advancements and future perspectives

  • paper_url: http://arxiv.org/abs/2308.01265
  • repo_url: None
  • paper_authors: Suruchi Kumari, Pravendra Singh
  • for: 本研究写作的目的是对医疗影像领域内的深度学习方法进行评论和概述,尤其是在过去几年内的不监督领域适应(UDA)技术发展。
  • methods: 本研究主要探讨了医疗影像领域内的不监督领域适应技术,包括特征对焦、影像转换、自我监督、分离表示方法等。
  • results: 本研究给出了医疗影像领域内不监督领域适应技术的综观和评论,包括六种不同类型的方法,以及各自的数据集使用情况。
    Abstract Deep learning has demonstrated remarkable performance across various tasks in medical imaging. However, these approaches primarily focus on supervised learning, assuming that the training and testing data are drawn from the same distribution. Unfortunately, this assumption may not always hold true in practice. To address these issues, unsupervised domain adaptation (UDA) techniques have been developed to transfer knowledge from a labeled domain to a related but unlabeled domain. In recent years, significant advancements have been made in UDA, resulting in a wide range of methodologies, including feature alignment, image translation, self-supervision, and disentangled representation methods, among others. In this paper, we provide a comprehensive literature review of recent deep UDA approaches in medical imaging from a technical perspective. Specifically, we categorize current UDA research in medical imaging into six groups and further divide them into finer subcategories based on the different tasks they perform. We also discuss the respective datasets used in the studies to assess the divergence between the different domains. Finally, we discuss emerging areas and provide insights and discussions on future research directions to conclude this survey.
    摘要 深度学习在医疗影像领域已经表现出非常出色。然而,这些方法主要是基于指导学习,假设训练和测试数据都来自同一个分布。可是,这个假设在实践中可能并不成立。为解决这些问题,无监督领域适应(UDA)技术得到了广泛应用。在最近几年,UDA领域在医疗影像领域的研究得到了 significative进步,包括特征对齐、图像翻译、自我指导、分解表示方法等。在这篇论文中,我们提供了医疗影像领域最新的深度UDA策略的全面文献综述。特别是,我们将当前UDA研究分为六个组,并将它们进一步分为不同任务的子类别。我们还讨论了不同研究使用的数据集,以评估不同领域之间的差异。最后,我们介绍了未来研究的前景和见解,并结束这篇报告。

Outlier-Robust Tensor Low-Rank Representation for Data Clustering

  • paper_url: http://arxiv.org/abs/2307.09055
  • repo_url: None
  • paper_authors: Tong Wu
  • for: 本文针对损受噪音或标本特有混淆的维度资料进行了恢复和聚类分析。
  • methods: 本文提出了一种基于维度对称分解(t-SVD)的噪音抗性维度低维表示(OR-TLRR)方法,用于同时检测噪音和维度资料的聚类分析。
  • results: 本文的实验结果显示,OR-TLRR方法可以对损受噪音或标本特有混淆的维度资料进行有效的恢复和聚类分析,并且可以处理部分资料欠拓实验结果。
    Abstract Low-rank tensor analysis has received widespread attention with many practical applications. However, the tensor data are often contaminated by outliers or sample-specific corruptions. How to recover the tensor data that are corrupted by outliers and perform data clustering remains a challenging problem. This paper develops an outlier-robust tensor low-rank representation (OR-TLRR) method for simultaneous outlier detection and tensor data clustering based on the tensor singular value decomposition (t-SVD) algebraic framework. It is motivated by the recently proposed tensor-tensor product induced by invertible linear transforms that satisfy certain conditions. For tensor observations with arbitrary outlier corruptions, OR-TLRR has provable performance guarantee for exactly recovering the row space of clean data and detecting outliers under mild conditions. Moreover, an extension of OR-TLRR is also proposed to handle the case when parts of the data are missing. Finally, extensive experimental results on both synthetic and real data demonstrate the effectiveness of the proposed algorithms.
    摘要 低级tensor分析已经受到广泛关注,有很多实际应用。然而,tensor数据经常受到异常值或样本特有的腐朽影响。如何recover受损tensor数据并进行数据归类是一个具有挑战性的问题。这篇论文开发了一种具有异常鲁棒性的tensor低级表示法(OR-TLRR),用于同时检测异常值和tensor数据归类,基于tensor特征值分解(t-SVD)的代数框架。它是基于最近提出的tensor-tensor产品导出的减法,其满足某些条件。对于受到任意异常损害的tensor观测值,OR-TLRR有证明性的性能保证可以准确地恢复clean数据的行空间,并在某些条件下检测异常值。此外,对于缺失数据的情况,我们还提出了OR-TLRR的扩展。最后,我们在 synthetic和实际数据上进行了广泛的实验,并证明了提案的算法的效果。

Connections between Operator-splitting Methods and Deep Neural Networks with Applications in Image Segmentation

  • paper_url: http://arxiv.org/abs/2307.09052
  • repo_url: None
  • paper_authors: Hao Liu, Xue-Cheng Tai, Raymond Chan
  • for: 这篇论文的目的是为了提供深度神经网络的数学解释,以及将深度神经网络与数学算法联系起来的方法。
  • methods: 这篇论文使用了分解策略和多普雷德法则来解释深度神经网络。它还提出了两种基于分解策略的网络方案,用于解决图像分割问题。
  • results: 实验结果表明,这两种网络方案具有良好的性能,可以有效地解决图像分割问题。
    Abstract Deep neural network is a powerful tool for many tasks. Understanding why it is so successful and providing a mathematical explanation is an important problem and has been one popular research direction in past years. In the literature of mathematical analysis of deep deep neural networks, a lot of works are dedicated to establishing representation theories. How to make connections between deep neural networks and mathematical algorithms is still under development. In this paper, we give an algorithmic explanation for deep neural networks, especially in their connection with operator splitting and multigrid methods. We show that with certain splitting strategies, operator-splitting methods have the same structure as networks. Utilizing this connection and the Potts model for image segmentation, two networks inspired by operator-splitting methods are proposed. The two networks are essentially two operator-splitting algorithms solving the Potts model. Numerical experiments are presented to demonstrate the effectiveness of the proposed networks.
    摘要 深度神经网络是许多任务的 poderous工具。理解它的成功原因以及提供数学解释是一个重要的研究方向,在过去几年中受到了广泛的关注。在深度神经网络的数学分析文献中,许多研究都是建立表示理论的。但是,如何将深度神经网络与数学算法连接起来仍然是一个开发中的问题。在这篇论文中,我们提供了深度神经网络的算法解释,特别是与分解算法和多格rid方法之间的连接。我们表明,使用某些拆分策略,Operator-splitting方法和神经网络之间存在同构关系。利用这种连接和Potts模型,我们提出了两种基于Operator-splitting方法的神经网络,它们实际上是两种解决Potts模型的算法。我们对这两种网络进行了数学实验,以证明它们的有效性。

PottsMGNet: A Mathematical Explanation of Encoder-Decoder Based Neural Networks

  • paper_url: http://arxiv.org/abs/2307.09039
  • repo_url: None
  • paper_authors: Xue-Cheng Tai, Hao Liu, Raymond Chan
  • for: 这篇论文主要是为了解释基于编码器-解码器架构的效果神经网络,以及其在图像分割领域的应用。
  • methods: 该论文使用了两阶段尼特斯模型来解释编码器-解码器架构,并使用了多普逊法和运算符拆分方法来离散化连续控制模型。
  • results: 研究发现,将Soft-Threshold-Dynamics作为正则化项 incorporated into the PottsMGNet,可以使其在各种网络参数(如网络宽度和深度)下表现出色,并在各种大量噪声的数据集上达到了Remarkable performance。在大多数实验中,新网络 sempre perfoms better or as good as existing networks for image segmentation on accuracy and dice score。
    Abstract For problems in image processing and many other fields, a large class of effective neural networks has encoder-decoder-based architectures. Although these networks have made impressive performances, mathematical explanations of their architectures are still underdeveloped. In this paper, we study the encoder-decoder-based network architecture from the algorithmic perspective and provide a mathematical explanation. We use the two-phase Potts model for image segmentation as an example for our explanations. We associate the segmentation problem with a control problem in the continuous setting. Then, multigrid method and operator splitting scheme, the PottsMGNet, are used to discretize the continuous control model. We show that the resulting discrete PottsMGNet is equivalent to an encoder-decoder-based network. With minor modifications, it is shown that a number of the popular encoder-decoder-based neural networks are just instances of the proposed PottsMGNet. By incorporating the Soft-Threshold-Dynamics into the PottsMGNet as a regularizer, the PottsMGNet has shown to be robust with the network parameters such as network width and depth and achieved remarkable performance on datasets with very large noise. In nearly all our experiments, the new network always performs better or as good on accuracy and dice score than existing networks for image segmentation.
    摘要 对于图像处理和其他领域的问题,一大类效果强大的神经网络有编码器-解码器基本架构。 although these networks have made impressive performances, mathematical explanations of their architectures are still underdeveloped. In this paper, we study the encoder-decoder-based network architecture from the algorithmic perspective and provide a mathematical explanation. We use the two-phase Potts model for image segmentation as an example for our explanations. We associate the segmentation problem with a control problem in the continuous setting. Then, multigrid method and operator splitting scheme, the PottsMGNet, are used to discretize the continuous control model. We show that the resulting discrete PottsMGNet is equivalent to an encoder-decoder-based network. With minor modifications, it is shown that a number of the popular encoder-decoder-based neural networks are just instances of the proposed PottsMGNet. By incorporating the Soft-Threshold-Dynamics into the PottsMGNet as a regularizer, the PottsMGNet has shown to be robust with the network parameters such as network width and depth and achieved remarkable performance on datasets with very large noise. In nearly all our experiments, the new network always performs better or as good on accuracy and dice score than existing networks for image segmentation.Note: The translation is in Simplified Chinese, which is one of the two standardized Chinese languages. The other is Traditional Chinese.

Online Self-Supervised Thermal Water Segmentation for Aerial Vehicles

  • paper_url: http://arxiv.org/abs/2307.09027
  • repo_url: https://github.com/connorlee77/uav-thermal-water-segmentation
  • paper_authors: Connor Lee, Jonathan Gustafsson Frennert, Lu Gan, Matthew Anderson, Soon-Jo Chung
  • for: 这个论文的目的是提出一种新的方法,使RGB训练的水分割网络可以适应目标域空气热图像,并通过在线自我指导来使用文本和运动参数作为监督信号。
  • methods: 该方法使用了在线自我指导,将RGB训练的水分割网络应用到目标域空气热图像上,并使用文本和运动参数作为监督信号。
  • results: 该方法可以在夜间、无法训练数据的情况下,使current autonomous aerial robots在近岸环境中进行视觉导航、测量和流追踪等任务。另外,该方法还可以在实时上下文中运行,并且在Nvidia Jetson嵌入式计算平台上实现了实时应用。
    Abstract We present a new method to adapt an RGB-trained water segmentation network to target-domain aerial thermal imagery using online self-supervision by leveraging texture and motion cues as supervisory signals. This new thermal capability enables current autonomous aerial robots operating in near-shore environments to perform tasks such as visual navigation, bathymetry, and flow tracking at night. Our method overcomes the problem of scarce and difficult-to-obtain near-shore thermal data that prevents the application of conventional supervised and unsupervised methods. In this work, we curate the first aerial thermal near-shore dataset, show that our approach outperforms fully-supervised segmentation models trained on limited target-domain thermal data, and demonstrate real-time capabilities onboard an Nvidia Jetson embedded computing platform. Code and datasets used in this work will be available at: https://github.com/connorlee77/uav-thermal-water-segmentation.
    摘要 我们提出了一种新的方法,用于将RGB搜索到的水分割网络适应目标域空气热图像上的自动超vision。这种新的热能力使得现有的无人飞行机器人在夜晚近岸环境中执行视觉导航、水深测量和流追踪等任务。我们的方法解决了 conventional 监督和无监督方法中 scarce 和 difficult-to-obtain 的近岸热数据问题。在这项工作中,我们创建了首个空气热近岸数据集,证明我们的方法超过了限定目标域热数据的彻底监督模型,并在 Nvidia Jetson 嵌入式计算平台上实现了实时功能。代码和数据集将在:https://github.com/connorlee77/uav-thermal-water-segmentation 上提供。

ActionPrompt: Action-Guided 3D Human Pose Estimation With Text and Pose Prompting

  • paper_url: http://arxiv.org/abs/2307.09026
  • repo_url: None
  • paper_authors: Hongwei Zheng, Han Li, Bowen Shi, Wenrui Dai, Botao Wan, Yu Sun, Min Guo, Hongkai Xiong
  • for: 提高视频基于2D-to-3D人姿估计(HPE)的性能,解决深度歧义问题。
  • methods: 提出了一个名为动作提示模块(APM)的插件模块,可以有效地挖掘不同类型的动作准则,以提高3D HPE的性能。
  • results: 实验表明,APM可以大幅提高大多数视频基于2D-to-3D HPE框架的性能。
    Abstract Recent 2D-to-3D human pose estimation (HPE) utilizes temporal consistency across sequences to alleviate the depth ambiguity problem but ignore the action related prior knowledge hidden in the pose sequence. In this paper, we propose a plug-and-play module named Action Prompt Module (APM) that effectively mines different kinds of action clues for 3D HPE. The highlight is that, the mining scheme of APM can be widely adapted to different frameworks and bring consistent benefits. Specifically, we first present a novel Action-related Text Prompt module (ATP) that directly embeds action labels and transfers the rich language information in the label to the pose sequence. Besides, we further introduce Action-specific Pose Prompt module (APP) to mine the position-aware pose pattern of each action, and exploit the correlation between the mined patterns and input pose sequence for further pose refinement. Experiments show that APM can improve the performance of most video-based 2D-to-3D HPE frameworks by a large margin.
    摘要 最近的2D-to-3D人姿估算(HPE)利用时间连续性来减轻深度不确定性问题,但是它们忽略了动作相关的先前知识。在这篇论文中,我们提出了一个插件式模块 named Action Prompt Module (APM),可以有效地挖掘不同类型的动作提示。特别是,我们首先提出了一种新的动作相关文本提示模块(ATP),直接嵌入动作标签,将written language信息传递到人姿序列中。此外,我们还引入了动作特定的姿势提示模块(APP),挖掘每种动作的位置感知姿势模式,并利用输入姿势序列和挖掘的模式之间的相关性进行进一步的姿势纠正。实验显示,APM可以提高大多数基于视频的2D-to-3D HPE框架的性能,增加了一定的改进空间。

LA-Net: Landmark-Aware Learning for Reliable Facial Expression Recognition under Label Noise

  • paper_url: http://arxiv.org/abs/2307.09023
  • repo_url: None
  • paper_authors: Zhiyu Wu, Jinshi Cui
  • for: 提高人脸表情识别(FER)的性能,解决实际应用中的标签噪声问题。
  • methods: 利用人脸特征点(landmark)来减少标签噪声的影响,从两个角度进行处理:首先,使用landmark信息来抑制表情空间中的uncertainty,并通过邻域聚合来提高每个样本的训练指导质量;其次,将landmark信息integrated到表情表示中,使表情特征提取器更加不敏感于标签噪声。
  • results: 对于在野外 dataset和synthetic noisy dataset的广泛实验,我们示出了LA-Net可以达到领先的性能水平。
    Abstract Facial expression recognition (FER) remains a challenging task due to the ambiguity of expressions. The derived noisy labels significantly harm the performance in real-world scenarios. To address this issue, we present a new FER model named Landmark-Aware Net~(LA-Net), which leverages facial landmarks to mitigate the impact of label noise from two perspectives. Firstly, LA-Net uses landmark information to suppress the uncertainty in expression space and constructs the label distribution of each sample by neighborhood aggregation, which in turn improves the quality of training supervision. Secondly, the model incorporates landmark information into expression representations using the devised expression-landmark contrastive loss. The enhanced expression feature extractor can be less susceptible to label noise. Our method can be integrated with any deep neural network for better training supervision without introducing extra inference costs. We conduct extensive experiments on both in-the-wild datasets and synthetic noisy datasets and demonstrate that LA-Net achieves state-of-the-art performance.
    摘要 面部表达识别(FER)仍然是一项具有挑战性的任务,主要是因为表达的模糊性。 derivated的噪声标签在实际应用场景中会产生很大的影响。为解决这个问题,我们提出了一种新的FER模型,即Landmark-Aware Net(LA-Net),该模型利用面部特征点来减少表达空间中的uncertainty,并通过邻域聚合来提高每个样本的训练指导质量。其次,模型通过我们提出的表达-特征点对比损失来将特征点信息 incorporated into表达表示,从而使表达特征EXTractor更加抵抗噪声标签的影响。我们的方法可以与任何深度神经网络结合使用,不需要额外的推理成本。我们在野外数据集和静态噪声数据集上进行了广泛的实验,并证明了LA-Net可以达到当前最佳性能。

Face-PAST: Facial Pose Awareness and Style Transfer Networks

  • paper_url: http://arxiv.org/abs/2307.09020
  • repo_url: None
  • paper_authors: Sunder Ali Khowaja, Ghulam Mujtaba, Jiseok Yoon, Ik Hyun Lee
  • for: 提出一种基于 StyleGAN 的 facial style transfer 网络,以保持 facial 图像的细节和结构,并生成高质量的样式化图像。
  • methods: 使用预训练的样式生成网络、循环优化器和门控制单元,以及 facial 结构、身份和分割损失来保持 facial 细节和结构。
  • results: 通过对存在较少数据的 facial 图像进行样式转移,并且可以生成高质量的样式化图像,而不会过拟合样式或添加artefacts。
    Abstract Facial style transfer has been quite popular among researchers due to the rise of emerging technologies such as eXtended Reality (XR), Metaverse, and Non-Fungible Tokens (NFTs). Furthermore, StyleGAN methods along with transfer-learning strategies have reduced the problem of limited data to some extent. However, most of the StyleGAN methods overfit the styles while adding artifacts to facial images. In this paper, we propose a facial pose awareness and style transfer (Face-PAST) network that preserves facial details and structures while generating high-quality stylized images. Dual StyleGAN inspires our work, but in contrast, our work uses a pre-trained style generation network in an external style pass with a residual modulation block instead of a transform coding block. Furthermore, we use the gated mapping unit and facial structure, identity, and segmentation losses to preserve the facial structure and details. This enables us to train the network with a very limited amount of data while generating high-quality stylized images. Our training process adapts curriculum learning strategy to perform efficient and flexible style mixing in the generative space. We perform extensive experiments to show the superiority of Face-PAST in comparison to existing state-of-the-art methods.
    摘要 Facial style transfer已经非常受研究人员欢迎,因为emerging technologies如XR、Metaverse和NFTs的出现。此外,StyleGAN方法和传播学习策略有助于缓解有限数据的问题。然而,大多数StyleGAN方法会过滤式,导致facial image中的瑕疵和错误。在这篇论文中,我们提出了一个名为Face-PAST的 facial pose awareness和style transfer网络,能够保留facial detail和结构,同时生成高品质的类型化图像。我们的作业受到了Dual StyleGAN的灵感,但是我们使用了一个预训式的style生成网络,而不是一个transform coding block。此外,我们使用了闸道 mapping单元和facial结构、认知和分类损失,以保留facial structure和瑕疵。这使得我们可以在有限数据的情况下训练网络,并生成高品质的类型化图像。我们的训练过程使用了curriculum learning策略,以实现有效和灵活的style混合在生成空间中。我们进行了广泛的实验,以显示Face-PAST在与现有的州际状态方法相比之下的superiority。

U-shaped Transformer: Retain High Frequency Context in Time Series Analysis

  • paper_url: http://arxiv.org/abs/2307.09019
  • repo_url: None
  • paper_authors: Qingkui Chen, Yiqin Zhang
  • for: 这篇论文旨在提出一种基于 transformer 框架的时间序列预测模型,利用 skip-layer 连接和 patch 合并分割操作提高模型的精度和效率。
  • methods: 该模型采用了 traditional transformer 框架,并加入了 skip-layer 连接和 patch 合并分割操作,以提高模型的精度和效率。
  • results: 实验结果表明,该模型在多个数据集上达到了高水平的预测性能,而且比传统的 transformer 模型更高效。
    Abstract Time series prediction plays a crucial role in various industrial fields. In recent years, neural networks with a transformer backbone have achieved remarkable success in many domains, including computer vision and NLP. In time series analysis domain, some studies have suggested that even the simplest MLP networks outperform advanced transformer-based networks on time series forecast tasks. However, we believe these findings indicate there to be low-rank properties in time series sequences. In this paper, we consider the low-pass characteristics of transformers and try to incorporate the advantages of MLP. We adopt skip-layer connections inspired by Unet into traditional transformer backbone, thus preserving high-frequency context from input to output, namely U-shaped Transformer. We introduce patch merge and split operation to extract features with different scales and use larger datasets to fully make use of the transformer backbone. Our experiments demonstrate that the model performs at an advanced level across multiple datasets with relatively low cost.
    摘要 时序序列预测在各个行业中扮演着关键的角色。在最近几年,基于transformer结构的神经网络在计算机视觉和自然语言处理等领域取得了很大的成功。然而,一些研究表明,简单的MLP网络可以在时序序列预测任务上表现更出色于高级的transformer-based网络。我们认为这些发现表明时序序列序列具有低级属性。在这篇论文中,我们考虑了transformer的低通过性特性,并尝试将MLP网络的优点与transformer结构相结合。我们采用了 skip-layer 连接,以保持输入到输出的高频上下文,即U-shaped Transformer。我们还引入了补丁合并和分裂操作,以提取不同尺度的特征,并使用更大的数据集,以全面利用transformer结构。我们的实验表明,模型在多个数据集上达到了高水平的性能,而且相对成本较低。

Survey on Controlable Image Synthesis with Deep Learning

  • paper_url: http://arxiv.org/abs/2307.10275
  • repo_url: https://github.com/Aryia-Behroziuan/References
  • paper_authors: Shixiong Zhang, Jiao Li, Lu Yang
  • for: 本研究旨在investigate low-level controllable image synthesis problem, 用于精细图像渲染和修改任务。
  • methods: 本文使用deep learning技术, 尤其是生成模型, 来实现可控图像生成方法。
  • results: 本文对3D可控图像生成进行了评估, 并结合了评价指标和数据集。 更进一步, 本文还 briefly summarized 相关应用、产品和资源 для实践者。
    Abstract Image synthesis has attracted emerging research interests in academic and industry communities. Deep learning technologies especially the generative models greatly inspired controllable image synthesis approaches and applications, which aim to generate particular visual contents with latent prompts. In order to further investigate low-level controllable image synthesis problem which is crucial for fine image rendering and editing tasks, we present a survey of some recent works on 3D controllable image synthesis using deep learning. We first introduce the datasets and evaluation indicators for 3D controllable image synthesis. Then, we review the state-of-the-art research for geometrically controllable image synthesis in two aspects: 1) Viewpoint/pose-controllable image synthesis; 2) Structure/shape-controllable image synthesis. Furthermore, the photometrically controllable image synthesis approaches are also reviewed for 3D re-lighting researches. While the emphasis is on 3D controllable image synthesis algorithms, the related applications, products and resources are also briefly summarized for practitioners.
    摘要 Image合成已经吸引了学术和工业社区的新兴研究兴趣。特别是深度学习技术,包括生成模型,对可控图像生成方法和应用产生了启发。为了进一步调查低级可控图像生成问题,这问题对细节图像渲染和修订任务是关键的。我们现在介绍一些最新的3D可控图像生成使用深度学习的研究。我们首先介绍了3D可控图像生成数据集和评价指标。然后,我们对3D可控图像生成的两个方面进行了回顾:1)视点/姿态可控图像生成; 2)结构/形状可控图像生成。此外,我们还回顾了3D重新照明研究中的光学可控图像生成方法。虽然我们的重点是3D可控图像生成算法,但我们还 briefly summarized了相关应用、产品和资源,为实践者提供参考。

Soft-IntroVAE for Continuous Latent space Image Super-Resolution

  • paper_url: http://arxiv.org/abs/2307.09008
  • repo_url: None
  • paper_authors: Zhi-Song Liu, Zijia Wang, Zhen Jia
  • for: 这篇论文旨在提出一种基于Variational AutoEncoder的连续图像超分辨率(SR)方法,用于实现实用和灵活的图像缩放 для各种显示器。
  • methods: 该方法使用了Local implicit image representation,并基于Variational AutoEncoder进行 latent space interpolation。另外,一种新的潜在空间对抗训练方法是使用的,以实现照片真实的图像修复。
  • results: 对比其他方法,提出的Soft-introVAE-SR方法可以更好地提高图像的质量,并且可以扩展到噪声除除和实际图像超分辨率领域。
    Abstract Continuous image super-resolution (SR) recently receives a lot of attention from researchers, for its practical and flexible image scaling for various displays. Local implicit image representation is one of the methods that can map the coordinates and 2D features for latent space interpolation. Inspired by Variational AutoEncoder, we propose a Soft-introVAE for continuous latent space image super-resolution (SVAE-SR). A novel latent space adversarial training is achieved for photo-realistic image restoration. To further improve the quality, a positional encoding scheme is used to extend the original pixel coordinates by aggregating frequency information over the pixel areas. We show the effectiveness of the proposed SVAE-SR through quantitative and qualitative comparisons, and further, illustrate its generalization in denoising and real-image super-resolution.
    摘要 continuous image super-resolution (SR) 近期吸引了许多研究人员的关注,因为它可以实现多种显示器上的图像缩放。本地隐式图像表示是一种可以将坐标和2D特征映射到隐藏空间的方法。 inspirited by Variational AutoEncoder, we propose a Soft-introVAE for continuous latent space image super-resolution (SVAE-SR).一种新的隐藏空间对抗训练方法是实现 фото真实图像修复。为了进一步提高质量,我们使用位置编码方案来扩展原始像素坐标,并将频率信息聚合到像素区域上。我们通过量化和质量比较,证明了我们提出的 SVAE-SR 的效果。此外,我们还ILLUSTRATE 其泛化性在噪声除除和真实图像超分解中。

Frequency-mixed Single-source Domain Generalization for Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2307.09005
  • repo_url: https://github.com/liamheng/non-iid_medical_image_segmentation
  • paper_authors: Heng Li, Haojin Li, Wei Zhao, Huazhu Fu, Xiuyun Su, Yan Hu, Jiang Liu
  • for: 提高医疗影像分类模型的一般化性,尤其是当标注数据稀缺时。
  • methods: 提出了一种新的频率混合单源领域一致化方法(FreeSDG),通过分析频率对领域差异的影响,利用混合频率 спектル增强单源领域。同时,建立了自我监督来学习具有Robust特征的分类表示。
  • results: 透过实验证明,FreeSDG比前一项方法高效,可以增强医疗影像分类模型的一般化性,特别是当标注数据稀缺时。
    Abstract The annotation scarcity of medical image segmentation poses challenges in collecting sufficient training data for deep learning models. Specifically, models trained on limited data may not generalize well to other unseen data domains, resulting in a domain shift issue. Consequently, domain generalization (DG) is developed to boost the performance of segmentation models on unseen domains. However, the DG setup requires multiple source domains, which impedes the efficient deployment of segmentation algorithms in clinical scenarios. To address this challenge and improve the segmentation model's generalizability, we propose a novel approach called the Frequency-mixed Single-source Domain Generalization method (FreeSDG). By analyzing the frequency's effect on domain discrepancy, FreeSDG leverages a mixed frequency spectrum to augment the single-source domain. Additionally, self-supervision is constructed in the domain augmentation to learn robust context-aware representations for the segmentation task. Experimental results on five datasets of three modalities demonstrate the effectiveness of the proposed algorithm. FreeSDG outperforms state-of-the-art methods and significantly improves the segmentation model's generalizability. Therefore, FreeSDG provides a promising solution for enhancing the generalization of medical image segmentation models, especially when annotated data is scarce. The code is available at https://github.com/liamheng/Non-IID_Medical_Image_Segmentation.
    摘要 医学图像分割涉及到缺乏注释的问题,这会影响深度学习模型的训练数据收集。具体来说,由限制数据训练的模型可能无法在其他未见数据域上泛化良好,导致领域变化问题。为了解决这个问题并提高分割模型的泛化性,我们提出了一种新的方法 called Frequency-mixed Single-source Domain Generalization method (FreeSDG)。通过分析频谱的效果,FreeSDG利用了混合频谱来扩展单源频谱。此外,我们还构建了自我超vision来学习robust的上下文感知表示。实验结果表明,提出的方法可以在五个数据集上三种模式上达到最佳性能。FreeSDG比 estado-of-the-art 方法更高效,并显著提高了分割模型的泛化性。因此,FreeSDG 提供了医学图像分割模型的泛化问题中的一个有前途的解决方案,特别是当注释数据缺乏时。代码可以在 上找到。

TractCloud: Registration-free tractography parcellation with a novel local-global streamline point cloud representation

  • paper_url: http://arxiv.org/abs/2307.09000
  • repo_url: https://github.com/SlicerDMRI/TractCloud
  • paper_authors: Tengfei Xue, Yuqian Chen, Chaoyi Zhang, Alexandra J. Golby, Nikos Makris, Yogesh Rathi, Weidong Cai, Fan Zhang, Lauren J. O’Donnell
  • for: 这篇论文的目的是提出一种无需注册的束分割方法,以便在各个个体空间中进行束分割,并且能够在大规模数据集上高效地进行分析。
  • methods: 该方法使用了一种新的、可学习的、本地-全局束表示方法,利用周围的束和整个脑部的束来描述本地解剖学和全脑姿态。
  • results: 论文在五个独立获得的测试数据集上进行了测试,并与多种状态的人群和疾病进行了比较,并显示了与之前的状态分割方法相比的显著优势。
    Abstract Diffusion MRI tractography parcellation classifies streamlines into anatomical fiber tracts to enable quantification and visualization for clinical and scientific applications. Current tractography parcellation methods rely heavily on registration, but registration inaccuracies can affect parcellation and the computational cost of registration is high for large-scale datasets. Recently, deep-learning-based methods have been proposed for tractography parcellation using various types of representations for streamlines. However, these methods only focus on the information from a single streamline, ignoring geometric relationships between the streamlines in the brain. We propose TractCloud, a registration-free framework that performs whole-brain tractography parcellation directly in individual subject space. We propose a novel, learnable, local-global streamline representation that leverages information from neighboring and whole-brain streamlines to describe the local anatomy and global pose of the brain. We train our framework on a large-scale labeled tractography dataset, which we augment by applying synthetic transforms including rotation, scaling, and translations. We test our framework on five independently acquired datasets across populations and health conditions. TractCloud significantly outperforms several state-of-the-art methods on all testing datasets. TractCloud achieves efficient and consistent whole-brain white matter parcellation across the lifespan (from neonates to elderly subjects, including brain tumor patients) without the need for registration. The robustness and high inference speed of TractCloud make it suitable for large-scale tractography data analysis. Our project page is available at https://tractcloud.github.io/.
    摘要 Diffusion MRI tractography parcellation 分类ifies streamlines into anatomical fiber tracts, allowing for quantification and visualization in clinical and scientific applications. Current tractography parcellation methods rely heavily on registration, but registration inaccuracies can affect parcellation and increase computational cost for large-scale datasets. Recently, deep-learning-based methods have been proposed for tractography parcellation using various types of streamline representations. However, these methods only focus on information from a single streamline, ignoring geometric relationships between streamlines in the brain.We propose TractCloud, a registration-free framework that performs whole-brain tractography parcellation directly in individual subject space. We use a novel, learnable, local-global streamline representation that leverages information from neighboring and whole-brain streamlines to describe the local anatomy and global pose of the brain. We train our framework on a large-scale labeled tractography dataset and augment it with synthetic transforms including rotation, scaling, and translations. We test our framework on five independently acquired datasets across populations and health conditions.TractCloud significantly outperforms several state-of-the-art methods on all testing datasets, achieving efficient and consistent whole-brain white matter parcellation across the lifespan (from neonates to elderly subjects, including brain tumor patients) without the need for registration. The robustness and high inference speed of TractCloud make it suitable for large-scale tractography data analysis. For more information, please visit our project page at .

Towards Authentic Face Restoration with Iterative Diffusion Models and Beyond

  • paper_url: http://arxiv.org/abs/2307.08996
  • repo_url: None
  • paper_authors: Yang Zhao, Tingbo Hou, Yu-Chuan Su, Xuhui Jia. Yandong Li, Matthias Grundmann
  • for: This paper aims to propose an authentic face restoration system that can generate high-quality and realistic faces from low-quality ones, which is important in various computer vision applications such as image enhancement, video communication, and taking portrait.
  • methods: The proposed method, called $\textbf{IDM}$, is based on denoising diffusion models (DDMs) and uses iterative learning to achieve authentic face restoration. The method has two aspects of intrinsic iterative refinement and extrinsic iterative enhancement to preserve the content and gradually refine the high-quality details.
  • results: The proposed method demonstrates superior performance on blind face restoration tasks and can also clean the data to improve the restoration task. Additionally, the authentically cleaned data generated by the proposed method is found to be helpful for image generation tasks, achieving better quality than state-of-the-art on FFHQ and ImageNet generation using either GANs or diffusion models without modifying the models.
    Abstract An authentic face restoration system is becoming increasingly demanding in many computer vision applications, e.g., image enhancement, video communication, and taking portrait. Most of the advanced face restoration models can recover high-quality faces from low-quality ones but usually fail to faithfully generate realistic and high-frequency details that are favored by users. To achieve authentic restoration, we propose $\textbf{IDM}$, an $\textbf{I}$teratively learned face restoration system based on denoising $\textbf{D}$iffusion $\textbf{M}$odels (DDMs). We define the criterion of an authentic face restoration system, and argue that denoising diffusion models are naturally endowed with this property from two aspects: intrinsic iterative refinement and extrinsic iterative enhancement. Intrinsic learning can preserve the content well and gradually refine the high-quality details, while extrinsic enhancement helps clean the data and improve the restoration task one step further. We demonstrate superior performance on blind face restoration tasks. Beyond restoration, we find the authentically cleaned data by the proposed restoration system is also helpful to image generation tasks in terms of training stabilization and sample quality. Without modifying the models, we achieve better quality than state-of-the-art on FFHQ and ImageNet generation using either GANs or diffusion models.
    摘要 一个真实的脸部恢复系统在许多计算机视觉应用中日益增加要求,例如图像提高、视频通信和拍照。大多数高级脸部恢复模型可以从低质量脸部恢复出高质量脸部,但通常无法准确地生成用户喜欢的真实和高频率细节。为实现真实的恢复,我们提出了 $\textbf{IDM}$,一种基于杂化扩散模型(DDM)的迭代学习face restoration系统。我们定义了真实的脸部恢复系统的标准,并论证DDM自然拥有这种属性,从两个方面:内在迭代细化和外在迭代增强。内在学习可以保持内容良好,逐渐细化高质量细节,而外在增强可以清洁数据,提高恢复任务一步更进。我们在盲目脸部恢复任务上展示了superior性能。此外,我们发现由我们提posed的恢复系统 authentically cleaned的数据不仅有助于图像生成任务的训练稳定和样本质量,而且可以在使用GANs或扩散模型时达到更高的质量。

Revisiting Latent Space of GAN Inversion for Real Image Editing

  • paper_url: http://arxiv.org/abs/2307.08995
  • repo_url: None
  • paper_authors: Kai Katsumata, Duc Minh Vo, Bei Liu, Hideki Nakayama
  • for: 这个研究旨在解决StyleGANs中的实像增强和实像增强之间的贸易问题,提供一个新的实像增强方法。
  • methods: 本研究使用StyleGANs的抽象层 $\mathcal{Z}$ 和高能量的潜在空间,建立一个新的合成空间 $\mathcal{F}/\mathcal{Z}^{+}$,以实现实像增强而不丧失图像质量。
  • results: 实验结果显示, $\mathcal{Z}^{+}$ 可以取代常用的 $\mathcal{W}$, $\mathcal{W}^{+}$, 和 $\mathcal{S}$ 空间,保持图像重建质量,并且实现Semantic editing。
    Abstract The exploration of the latent space in StyleGANs and GAN inversion exemplify impressive real-world image editing, yet the trade-off between reconstruction quality and editing quality remains an open problem. In this study, we revisit StyleGANs' hyperspherical prior $\mathcal{Z}$ and combine it with highly capable latent spaces to build combined spaces that faithfully invert real images while maintaining the quality of edited images. More specifically, we propose $\mathcal{F}/\mathcal{Z}^{+}$ space consisting of two subspaces: $\mathcal{F}$ space of an intermediate feature map of StyleGANs enabling faithful reconstruction and $\mathcal{Z}^{+}$ space of an extended StyleGAN prior supporting high editing quality. We project the real images into the proposed space to obtain the inverted codes, by which we then move along $\mathcal{Z}^{+}$, enabling semantic editing without sacrificing image quality. Comprehensive experiments show that $\mathcal{Z}^{+}$ can replace the most commonly-used $\mathcal{W}$, $\mathcal{W}^{+}$, and $\mathcal{S}$ spaces while preserving reconstruction quality, resulting in reduced distortion of edited images.
    摘要 <>translate the following text into Simplified Chinese:The exploration of the latent space in StyleGANs and GAN inversion exemplify impressive real-world image editing, yet the trade-off between reconstruction quality and editing quality remains an open problem. In this study, we revisit StyleGANs' hyperspherical prior $\mathcal{Z}$ and combine it with highly capable latent spaces to build combined spaces that faithfully invert real images while maintaining the quality of edited images. More specifically, we propose $\mathcal{F}/\mathcal{Z}^{+}$ space consisting of two subspaces: $\mathcal{F}$ space of an intermediate feature map of StyleGANs enabling faithful reconstruction and $\mathcal{Z}^{+}$ space of an extended StyleGAN prior supporting high editing quality. We project the real images into the proposed space to obtain the inverted codes, by which we then move along $\mathcal{Z}^{+}$, enabling semantic editing without sacrificing image quality. Comprehensive experiments show that $\mathcal{Z}^{+}$ can replace the most commonly-used $\mathcal{W}$, $\mathcal{W}^{+}$, and $\mathcal{S}$ spaces while preserving reconstruction quality, resulting in reduced distortion of edited images.Translation: StyleGANs 的latent空间的探索和GAN倒计时的实际图像编辑,却存在一个公共的问题,即重建质量和编辑质量之间的权衡。在这种研究中,我们回到StyleGANs的高维度先验 Space $\mathcal{Z}$ 和高能量的latent空间,并将其组合成一个faithful的图像重建和高质量编辑的共同空间。我们提出了 $\mathcal{F}/\mathcal{Z}^{+}$ 空间,包括两个子空间: $\mathcal{F}$ 空间是 StyleGANs 中间特征图的intermediate feature map,可以实现 faithful reconstruction,而 $\mathcal{Z}^{+}$ 空间是 StyleGANs 的扩展先验空间,可以支持高质量的编辑。我们将实际图像映射到我们提出的空间中,并在 $\mathcal{Z}^{+}$ 空间中移动,以实现 semantics编辑而无需牺牲图像质量。我们的全面实验表明, $\mathcal{Z}^{+}$ 可以取代通常使用的 $\mathcal{W}$, $\mathcal{W}^{+}$ 和 $\mathcal{S}$ 空间,保持重建质量,并减少编辑图像的扭曲。

Human Action Recognition in Still Images Using ConViT

  • paper_url: http://arxiv.org/abs/2307.08994
  • repo_url: None
  • paper_authors: Seyed Rohollah Hosseyni, Hasan Taheri, Sanaz Seyedin, Ali Ahmad Rahmani
  • for: 本研究旨在提高图像识别 tasks 中不同部分之间的关系理解,以提高人体动作识别精度。
  • methods: 本研究提出了一种新的模块,它使用 Vision Transformer (ViT) 来模型图像各部分之间的关系。该模型包括一个深度卷积网络,用于提取图像高级空间特征,以及一个 Vision Transformer,用于使用特征图来捕捉图像各部分之间的关系。
  • results: 本研究在 Stanford40 和 PASCAL VOC 2012 动作数据集上进行了评估,并 achieved 95.5% mAP 和 91.5% mAP 结果,这些结果与其他当前领先方法相比较出色。
    Abstract Understanding the relationship between different parts of the image plays a crucial role in many visual recognition tasks. Despite the fact that Convolutional Neural Networks (CNNs) have demonstrated impressive results in detecting single objects, they lack the capability to extract the relationship between various regions of an image, which is a crucial factor in human action recognition. To address this problem, this paper proposes a new module that functions like a convolutional layer using Vision Transformer (ViT). The proposed action recognition model comprises two components: the first part is a deep convolutional network that extracts high-level spatial features from the image, and the second component of the model utilizes a Vision Transformer that extracts the relationship between various regions of the image using the feature map generated by the CNN output. The proposed model has been evaluated on the Stanford40 and PASCAL VOC 2012 action datasets and has achieved 95.5% mAP and 91.5% mAP results, respectively, which are promising compared to other state-of-the-art methods.
    摘要 理解图像中不同部分之间的关系在许多视觉识别任务中扮演着关键角色。尽管卷积神经网络(CNN)在检测单个 объек 上表现出了惊人的成绩,但它缺乏抽象图像中不同部分之间的关系抽取能力,这是人类动作识别中的关键因素。为解决这个问题,本文提出了一个新的模块,该模块使用视力 трансформер(ViT)来实现卷积操作。该模型包括两个部分:第一部分是一个深度卷积神经网络,该神经网络从图像中提取高级空间特征;第二部分的模型使用 feature map 生成于 CNN 输出来抽取图像中不同部分之间的关系。该模型在 Standford40 和 PASCAL VOC 2012 动作数据集上进行了评估,并 achieved 95.5% mAP 和 91.5% mAP 的结果,这些结果与其他当前状态的方法相当出色。

Arbitrary point cloud upsampling via Dual Back-Projection Network

  • paper_url: http://arxiv.org/abs/2307.08992
  • repo_url: None
  • paper_authors: Zhi-Song Liu, Zijia Wang, Zhen Jia
  • for: 该论文主要针对精度低的点云数据进行重建,以提高点云的密度并重建细节 geometric 信息。
  • methods: 该方法基于 Dual Back-Projection 网络(DBPnet),通过在上下文中具有 Point cloud upsampling 的方式进行减小点云的重建误差。
  • results: 实验结果显示,该方法可以在不同的upsampling因子(例如 4x、5.5x)下实现最低的点集匹配损失,并且成功地验证了非均匀点云的重建。
    Abstract Point clouds acquired from 3D sensors are usually sparse and noisy. Point cloud upsampling is an approach to increase the density of the point cloud so that detailed geometric information can be restored. In this paper, we propose a Dual Back-Projection network for point cloud upsampling (DBPnet). A Dual Back-Projection is formulated in an up-down-up manner for point cloud upsampling. It not only back projects feature residues but also coordinates residues so that the network better captures the point correlations in the feature and space domains, achieving lower reconstruction errors on both uniform and non-uniform sparse point clouds. Our proposed method is also generalizable for arbitrary upsampling tasks (e.g. 4x, 5.5x). Experimental results show that the proposed method achieves the lowest point set matching losses with respect to the benchmark. In addition, the success of our approach demonstrates that generative networks are not necessarily needed for non-uniform point clouds.
    摘要 点云数据通常是稀疏的和噪声污染的。点云upsampling是一种方法来增加点云的密度,以便从重建细节的几何信息。在这篇论文中,我们提出了双向反投影网络(DBPnet)。双向反投影是在上下两个方向进行的,用于点云upsampling。它不仅反投影特征剩余,还反投影坐标剩余,从而使网络更好地捕捉点云之间的相关性,实现了较低的重建错误率。我们提出的方法可应用于任意的upsampling任务(例如4倍、5.5倍)。实验结果显示,我们的方法实现了对比准标的最低点集匹配损失。此外,我们的成功表明了生成网络并不一定需要非均匀点云。

EgoVM: Achieving Precise Ego-Localization using Lightweight Vectorized Maps

  • paper_url: http://arxiv.org/abs/2307.08991
  • repo_url: None
  • paper_authors: Yuzhe He, Shuang Liang, Xiaofei Rui, Chengying Cai, Guowei Wan
  • for: 本研究旨在提供一种高精度、轻量级的 egolocalization 方法,以满足自动驾驶技术的需求。
  • methods: 本方法使用了vectorized maps,与传统的点云地图相比,具有较低的计算复杂度和存储量。具体来说,我们首先从多视图图像和 LiDAR 点云中提取 BEV 特征,然后使用一组学习式的semantic embedding来编码地图元素的semantic类型,并通过semantic segmentation进行超出检查。接着,我们将地图查询,包括学习式的semantic embedding和地图元素坐标,传递给 transformer decoder进行交叉模态匹配。最后,我们采用了一种可靠的 histogram-based pose solver,以搜索所有可能的姿态,并优化最佳姿态。
  • results: 我们在 nuScenes 数据集和新收集的数据集上进行了广泛的验证,结果显示,我们的方法可以实现厘米级的 lokalisierung 精度,并在使用 vectorized maps 的情况下,与现有方法相比,提供了较大的提升。此外,我们的模型已经在一大群自动驾驶车辆中进行了广泛的测试,并在多种复杂的城市场景下表现出色。
    Abstract Accurate and reliable ego-localization is critical for autonomous driving. In this paper, we present EgoVM, an end-to-end localization network that achieves comparable localization accuracy to prior state-of-the-art methods, but uses lightweight vectorized maps instead of heavy point-based maps. To begin with, we extract BEV features from online multi-view images and LiDAR point cloud. Then, we employ a set of learnable semantic embeddings to encode the semantic types of map elements and supervise them with semantic segmentation, to make their feature representation consistent with BEV features. After that, we feed map queries, composed of learnable semantic embeddings and coordinates of map elements, into a transformer decoder to perform cross-modality matching with BEV features. Finally, we adopt a robust histogram-based pose solver to estimate the optimal pose by searching exhaustively over candidate poses. We comprehensively validate the effectiveness of our method using both the nuScenes dataset and a newly collected dataset. The experimental results show that our method achieves centimeter-level localization accuracy, and outperforms existing methods using vectorized maps by a large margin. Furthermore, our model has been extensively tested in a large fleet of autonomous vehicles under various challenging urban scenes.
    摘要 准确可靠的自驾车导航启用需要精准的ego-localization。在这篇论文中,我们提出了EgoVM,一种终端到端的地图localization网络,可以与之前的State-of-the-art方法相比,但使用轻量级的 вектор化地图而不是重量级的点云地图。我们从多视图图像和LiDAR点云中提取了BEV特征,然后使用一组可学习的semantic embedding来编码地图元素的semantic类型,并使用semantic segmentation来监督它们的特征表示相符合BEV特征。接着,我们将地图查询,由learnable semantic embedding和地图元素坐标组成,传递给一个transformer解码器进行跨模态匹配与BEV特征。最后,我们采用一种稳定的 histogram-based pose解决方案来估算最佳pose,通过搜索所有候选pose来找到最佳pose。我们对使用nuScenes数据集和新收集的数据集进行了广泛验证,结果表明我们的方法可以实现厘米级准确的localization,并与使用 вектор化地图的现有方法相比,大幅提高性能。此外,我们的模型在一大群自动驾车车辆下进行了广泛的测试,并在各种复杂的城市场景下运行。

In Defense of Clip-based Video Relation Detection

  • paper_url: http://arxiv.org/abs/2307.08984
  • repo_url: None
  • paper_authors: Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, Roger Zimmermann
  • for: 本研究旨在提高视频视关系检测(VidVRD)的精度和效率,通过空间矩阵和时间边界来检测视频中的视觉关系 triplets。
  • methods: 该研究使用clip-based方法,在不同的clip中分别进行视觉关系的检测和融合,以达到更高的检测精度和效率。
  • results: 对于两个 VidVRD 测试 benchmark 进行了广泛的实验,并证明了使用clip tubelets可以达到更高的性能,而且clip tubelets在模型设计方面具有更多的灵活性,可以更好地alleviate long-term object tracking问题和视频Tubelet特征压缩中的时间信息损失。
    Abstract Video Visual Relation Detection (VidVRD) aims to detect visual relationship triplets in videos using spatial bounding boxes and temporal boundaries. Existing VidVRD methods can be broadly categorized into bottom-up and top-down paradigms, depending on their approach to classifying relations. Bottom-up methods follow a clip-based approach where they classify relations of short clip tubelet pairs and then merge them into long video relations. On the other hand, top-down methods directly classify long video tubelet pairs. While recent video-based methods utilizing video tubelets have shown promising results, we argue that the effective modeling of spatial and temporal context plays a more significant role than the choice between clip tubelets and video tubelets. This motivates us to revisit the clip-based paradigm and explore the key success factors in VidVRD. In this paper, we propose a Hierarchical Context Model (HCM) that enriches the object-based spatial context and relation-based temporal context based on clips. We demonstrate that using clip tubelets can achieve superior performance compared to most video-based methods. Additionally, using clip tubelets offers more flexibility in model designs and helps alleviate the limitations associated with video tubelets, such as the challenging long-term object tracking problem and the loss of temporal information in long-term tubelet feature compression. Extensive experiments conducted on two challenging VidVRD benchmarks validate that our HCM achieves a new state-of-the-art performance, highlighting the effectiveness of incorporating advanced spatial and temporal context modeling within the clip-based paradigm.
    摘要 视频视关系检测(VidVRD)的目标是在视频中检测视关系三元组使用空间矩形框和时间边界。现有的 VidVRD 方法可以分为底层和顶层两类,它们根据他们如何分类关系来进行分类。底层方法采用帧基本方法,先将短 clip 对组分类,然后将它们合并成为长视频关系。相反,顶层方法直接将长视频对组分类。虽然最近的视频基本方法使用视频封装件(Tubelet)已经显示出了有利的成绩,但我们认为在空间和时间上更好地模型上下文比clip tubelets和视频封装件之间的选择更重要。这种情况 motivates 我们重新评估clip-based paradigm,并探索关键成功因素。在这篇论文中,我们提出了层次上下文模型(HCM),它在clip中增强了对象空间上下文和关系时间上下文。我们示出,使用clip tubelets可以在大多数视频基本方法中获得更高的性能,并且使用clip tubelets的模型设计具有更多的灵活性,可以解决视频封装件中的长期对象跟踪问题和长期封装件特征压缩中的时间信息损失问题。我们在两个 VidVRD benchmark 上进行了广泛的实验,并证明了我们的 HCM 实现了新的州Of-The-Art性能,这highlights 了在clip-based paradigm中包含先进的空间和时间上下文模型的效iveness。

Learned Scalable Video Coding For Humans and Machines

  • paper_url: http://arxiv.org/abs/2307.08978
  • repo_url: None
  • paper_authors: Hadi Hadizadeh, Ivan V. Bajić
  • for: 这个论文是为了支持自动视频分析,而不是只是为了人类视觉。
  • methods: 该论文使用了深度神经网络(DNNs)来实现Conditional Coding,以实现更好的压缩效果。
  • results: 实验结果表明,该 Framework 在四个标准视频数据集上表现出了更好的压缩效果,而且可以保持与人类视觉任务的相似性。
    Abstract Video coding has traditionally been developed to support services such as video streaming, videoconferencing, digital TV, and so on. The main intent was to enable human viewing of the encoded content. However, with the advances in deep neural networks (DNNs), encoded video is increasingly being used for automatic video analytics performed by machines. In applications such as automatic traffic monitoring, analytics such as vehicle detection, tracking and counting, would run continuously, while human viewing could be required occasionally to review potential incidents. To support such applications, a new paradigm for video coding is needed that will facilitate efficient representation and compression of video for both machine and human use in a scalable manner. In this manuscript, we introduce the first end-to-end learnable video codec that supports a machine vision task in its base layer, while its enhancement layer supports input reconstruction for human viewing. The proposed system is constructed based on the concept of conditional coding to achieve better compression gains. Comprehensive experimental evaluations conducted on four standard video datasets demonstrate that our framework outperforms both state-of-the-art learned and conventional video codecs in its base layer, while maintaining comparable performance on the human vision task in its enhancement layer. We will provide the implementation of the proposed system at www.github.com upon completion of the review process.
    摘要 <启用简化中文表示法。视频编码传统上是为服务如视频流传输、视频会议、数字电视等服务开发的。主要目的是启用人类观看编码内容。然而,随着深度神经网络(DNNs)的发展,编码的视频现在越来越被用于自动视频分析,这些分析由机器完成。在应用中,自动交通监测等场景中,机器会不断地进行视频分析,而人类可能会periodically审查可能的事件。为支持这些应用,我们需要一个新的视频编码 paradigma,以便高效地表示和压缩视频,以便机器和人类都可以使用。在这篇论文中,我们介绍了首个可学习的视频编码系统,该系统的基层支持机器视觉任务,而其增强层支持人类视觉。我们根据条件编码的概念设计了该系统,以实现更好的压缩减少。我们在四个标准视频集上进行了广泛的实验评估,结果表明,我们的框架在基层上比现有的学习视频编码和传统视频编码更高效,而且在人类视觉任务中保持相对的稳定性。我们将在www.github.com上提供该系统的实现,请等待审核过程结束。

Deep Physics-Guided Unrolling Generalization for Compressed Sensing

  • paper_url: http://arxiv.org/abs/2307.08950
  • repo_url: https://github.com/guaishou74851/prl
  • paper_authors: Bin Chen, Jiechong Song, Jingfen Xie, Jian Zhang
  • for: 这篇论文旨在提出一种新的深度学习方法,用于解决图像压缩感知问题。
  • methods: 该方法基于深度学习的核心思想,通过将传统的迭代回归模型扩展到高维特征空间,以提高网络的容量和实时预测速度。
  • results: 实验表明,提出的PRL网络比其他现有方法表现出更高的性能和效率,并且具有大量的可进一步改进和实际应用于其他反向图像问题或优化模型。
    Abstract By absorbing the merits of both the model- and data-driven methods, deep physics-engaged learning scheme achieves high-accuracy and interpretable image reconstruction. It has attracted growing attention and become the mainstream for inverse imaging tasks. Focusing on the image compressed sensing (CS) problem, we find the intrinsic defect of this emerging paradigm, widely implemented by deep algorithm-unrolled networks, in which more plain iterations involving real physics will bring enormous computation cost and long inference time, hindering their practical application. A novel deep $\textbf{P}$hysics-guided un$\textbf{R}$olled recovery $\textbf{L}$earning ($\textbf{PRL}$) framework is proposed by generalizing the traditional iterative recovery model from image domain (ID) to the high-dimensional feature domain (FD). A compact multiscale unrolling architecture is then developed to enhance the network capacity and keep real-time inference speeds. Taking two different perspectives of optimization and range-nullspace decomposition, instead of building an algorithm-specific unrolled network, we provide two implementations: $\textbf{PRL-PGD}$ and $\textbf{PRL-RND}$. Experiments exhibit the significant performance and efficiency leading of PRL networks over other state-of-the-art methods with a large potential for further improvement and real application to other inverse imaging problems or optimization models.
    摘要 By absorbing the advantages of both model-driven and data-driven methods, the deep physics-engaged learning scheme achieves high accuracy and interpretable image reconstruction, and has attracted growing attention and become the mainstream for inverse imaging tasks. However, the image compressed sensing (CS) problem, which is widely implemented by deep algorithm-unrolled networks, has an inherent defect: more plain iterations involving real physics will lead to enormous computation cost and long inference time, hindering their practical application.To address this issue, a novel deep $\textbf{P}$hysics-guided un$\textbf{R}$olled recovery $\textbf{L}$earning ($\textbf{PRL}$) framework is proposed, which generalizes the traditional iterative recovery model from the image domain (ID) to the high-dimensional feature domain (FD). Additionally, a compact multiscale unrolling architecture is developed to enhance the network capacity and maintain real-time inference speeds.Two implementations of the PRL framework are provided: $\textbf{PRL-PGD}$ and $\textbf{PRL-RND}$, which use two different perspectives of optimization and range-nullspace decomposition. Experimental results show that the PRL networks significantly outperform other state-of-the-art methods, with a large potential for further improvement and real application to other inverse imaging problems or optimization models.

Experimental Security Analysis of DNN-based Adaptive Cruise Control under Context-Aware Perception Attacks

  • paper_url: http://arxiv.org/abs/2307.08939
  • repo_url: None
  • paper_authors: Xugui Zhou, Anqi Chen, Maxfield Kouzel, Haotian Ren, Morgan McCarty, Cristina Nita-Rotaru, Homa Alemzadeh
    for: 评估深度神经网络(DNN)基于自适应播速控制系统(ACC)的安全性,以防止恶意投入摄像头数据,引起前方碰撞。methods: 提出了一种结合知识驱动和数据驱动的方法,选择最重要的时刻进行攻击,并在运行时使用优化算法生成适应性的图像干扰。results: 实验结果表明,提posed攻击可以在实际驾驶数据集和真实 simulate平台上 достиieves 142.9倍高的成功率,并被89.6%的安全特性减弱,同时具有适应性和robustness。这种攻击可以考虑到Operator的干预和基本安全特性,并提供了防御攻击的策略。
    Abstract Adaptive Cruise Control (ACC) is a widely used driver assistance feature for maintaining desired speed and safe distance to the leading vehicles. This paper evaluates the security of the deep neural network (DNN) based ACC systems under stealthy perception attacks that strategically inject perturbations into camera data to cause forward collisions. We present a combined knowledge-and-data-driven approach to design a context-aware strategy for the selection of the most critical times for triggering the attacks and a novel optimization-based method for the adaptive generation of image perturbations at run-time. We evaluate the effectiveness of the proposed attack using an actual driving dataset and a realistic simulation platform with the control software from a production ACC system and a physical-world driving simulator while considering interventions by the driver and safety features such as Automatic Emergency Braking (AEB) and Forward Collision Warning (FCW). Experimental results show that the proposed attack achieves 142.9x higher success rate in causing accidents than random attacks and is mitigated 89.6% less by the safety features while being stealthy and robust to real-world factors and dynamic changes in the environment. This study provides insights into the role of human operators and basic safety interventions in preventing attacks.
    摘要 这篇研究文章评估了基于深度神经网络(DNN)的自适应速度控制(ACC)系统的安全性,在潜在攻击下维持预期的速度和安全距离。我们提出了一种结合知识驱动和数据驱动的方法,以选择最重要的时刻进行攻击,并使用一种基于优化的方法生成runtime的图像扰动。我们使用实际驾驶数据和一个真实的游戏平台,包括生产ACC系统的控制软件和物理世界驾驶 simulator,评估了我们的攻击效果。实验结果显示,我们的攻击可以导致事故的成功率比随机攻击高出142.9倍,并且受到安全功能如自动紧急刹车(AEB)和前方冲撞警示(FCW)的抑制89.6%。这篇研究提供了人类操作员和基本安全功能的抗攻击策略,并探讨了环境的动态变化和实际因素的影响。

CSSL-RHA: Contrastive Self-Supervised Learning for Robust Handwriting Authentication

  • paper_url: http://arxiv.org/abs/2307.11100
  • repo_url: None
  • paper_authors: Jingyao Wang, Luntian Mou, Changwen Zheng, Wen Gao
  • for: 防止诈骗和文化遗产保护等领域中的手写认证任务
  • methods: 提出了一种基于自我超vised学习的Contrastive Self-Supervised Learning框架(CSSL-RHA),可以学习复杂但重要的特征,并准确地预测作者标识
  • results: 对五个基准数据集和自行标注的EN-HA数据集进行了广泛的实验,证明了我们的CSSL-RHA在基础线上性能较高,并且在异常情况下(如数据forge和损坏)仍然能够有效地进行认证
    Abstract Handwriting authentication is a valuable tool used in various fields, such as fraud prevention and cultural heritage protection. However, it remains a challenging task due to the complex features, severe damage, and lack of supervision. In this paper, we propose a novel Contrastive Self-Supervised Learning framework for Robust Handwriting Authentication (CSSL-RHA) to address these issues. It can dynamically learn complex yet important features and accurately predict writer identities. Specifically, to remove the negative effects of imperfections and redundancy, we design an information-theoretic filter for pre-processing and propose a novel adaptive matching scheme to represent images as patches of local regions dominated by more important features. Through online optimization at inference time, the most informative patch embeddings are identified as the "most important" elements. Furthermore, we employ contrastive self-supervised training with a momentum-based paradigm to learn more general statistical structures of handwritten data without supervision. We conduct extensive experiments on five benchmark datasets and our manually annotated dataset EN-HA, which demonstrate the superiority of our CSSL-RHA compared to baselines. Additionally, we show that our proposed model can still effectively achieve authentication even under abnormal circumstances, such as data falsification and corruption.
    摘要 《手写文本认证:一种值得信赖的工具》手写文本认证是多个领域中的一种重要工具,用于防止 fraud 和保护文化遗产。然而,由于手写文本的复杂特征、严重损害以及缺乏监督,这是一项挑战性的任务。在这篇论文中,我们提出了一种新的 Contrastive Self-Supervised Learning 框架,用于 Robust Handwriting Authentication (CSSL-RHA)。该框架可以动态学习复杂的重要特征,并准确地预测作者标识。为了解决负面影响和重复性的问题,我们设计了一种信息论Filter 来预处理图像,并提出了一种新的自适应匹配方案,用于将图像转换为具有更重要特征的 patches 的地方区域。通过在推理时进行在线优化,我们可以快速地identify 最有用的 patch embeddings 作为 "最重要" 的元素。此外,我们采用了一种带有摘要的自我超vision 训练方法,以学习不监督的手写数据的更加通用的统计结构。我们在五个 benchmark 数据集和我们手动标注的 EN-HA 数据集上进行了广泛的实验,并证明了我们的 CSSL-RHA 与基线相比更高效。此外,我们还证明了我们的提案的模型可以在不正常的情况下,如数据伪造和损害,仍然有效地进行身份验证。

Learning to Sample Tasks for Meta Learning

  • paper_url: http://arxiv.org/abs/2307.08924
  • repo_url: https://github.com/ZJLAB-AMMI/HS-OMRL
  • paper_authors: Jingyao Wang, Zeen Song, Xingzhe Su, Lingyu Si, Hongwei Dong, Wenwen Qiang, Changwen Zheng
  • for: 通过对各种元学习方法、任务采样器和几个少shot学习任务进行实验,这篇论文得出了三个结论。
  • methods: 第一,没有一个通用的任务采样策略可以保证元学习模型的表现。第二,任务多样性会导致模型在训练过程中 Either underfit 或 overfit。第三,模型的总结表现受到任务分化、任务熵和任务难度的影响。
  • results: 作为回应,我们提出了一种新的任务采样器called Adaptive Sampler (ASr)。ASr是一个可插入的任务采样器,它根据任务分化、任务熵和任务难度来采样任务。为了优化ASr,我们提出了一种简单的通用元学习算法。最后,论文通过大量实验证明了提议的ASr的有效性。
    Abstract Through experiments on various meta-learning methods, task samplers, and few-shot learning tasks, this paper arrives at three conclusions. Firstly, there are no universal task sampling strategies to guarantee the performance of meta-learning models. Secondly, task diversity can cause the models to either underfit or overfit during training. Lastly, the generalization performance of the models are influenced by task divergence, task entropy, and task difficulty. In response to these findings, we propose a novel task sampler called Adaptive Sampler (ASr). ASr is a plug-and-play task sampler that takes task divergence, task entropy, and task difficulty to sample tasks. To optimize ASr, we rethink and propose a simple and general meta-learning algorithm. Finally, a large number of empirical experiments demonstrate the effectiveness of the proposed ASr.
    摘要
  1. There is no universal task sampling strategy that can guarantee the performance of meta-learning models.2. Task diversity can cause the models to either underfit or overfit during training.3. The generalization performance of the models is influenced by task divergence, task entropy, and task difficulty.In response to these findings, we propose a novel task sampler called Adaptive Sampler (ASr). ASr is a plug-and-play task sampler that takes task divergence, task entropy, and task difficulty into consideration when sampling tasks. To optimize ASr, we propose a simple and general meta-learning algorithm.Empirical experiments on a large number of tasks demonstrate the effectiveness of the proposed ASr.

Accuracy versus time frontiers of semi-supervised and self-supervised learning on medical images

  • paper_url: http://arxiv.org/abs/2307.08919
  • repo_url: https://github.com/tufts-ml/ssl-vs-ssl-benchmark
  • paper_authors: Zhe Huang, Ruijie Jiang, Shuchin Aeron, Michael C. Hughes
  • for: 本研究的目的是为了提供一个可靠的benchmark,帮助实践者在有限的标签数据和训练时间的情况下,尽可能提高分类器的性能。
  • methods: 本研究使用了两种主要的研究方向:自监学习和semi-supervised learning。自监学习先在无标签数据上培养有用的表示,然后使用标签数据进行细化;semi-supervised learning直接使用标签和无标签数据同时训练分类器。
  • results: 本研究的结果表明,使用additional的无标签数据可以提高分类器的性能,并且使用MixMatch、SimCLR和BYOL等方法可以获得最佳性能。在3个医学图像集合上,6种semi-supervised方法和5种自监学习方法与强的标签只方法进行比较,并提供了在新的医学任务上提高性能的设置。
    Abstract For many applications of classifiers to medical images, a trustworthy label for each image can be difficult or expensive to obtain. In contrast, images without labels are more readily available. Two major research directions both promise that additional unlabeled data can improve classifier performance: self-supervised learning pretrains useful representations on unlabeled data only, then fine-tunes a classifier on these representations via the labeled set; semi-supervised learning directly trains a classifier on labeled and unlabeled data simultaneously. Recent methods from both directions have claimed significant gains on non-medical tasks, but do not systematically assess medical images and mostly compare only to methods in the same direction. This study contributes a carefully-designed benchmark to help answer a practitioner's key question: given a small labeled dataset and a limited budget of hours to spend on training, what gains from additional unlabeled images are possible and which methods best achieve them? Unlike previous benchmarks, ours uses realistic-sized validation sets to select hyperparameters, assesses runtime-performance tradeoffs, and bridges two research fields. By comparing 6 semi-supervised methods and 5 self-supervised methods to strong labeled-only baselines on 3 medical datasets with 30-1000 labels per class, we offer insights to resource-constrained, results-focused practitioners: MixMatch, SimCLR, and BYOL represent strong choices that were not surpassed by more recent methods. After much effort selecting hyperparameters on one dataset, we publish settings that enable strong methods to perform well on new medical tasks within a few hours, with further search over dozens of hours delivering modest additional gains.
    摘要 For many medical image classification applications, obtaining trustworthy labels for each image can be difficult or expensive. In contrast, images without labels are more readily available. Two major research directions promise that additional unlabeled data can improve classifier performance: self-supervised learning pretrains useful representations on unlabeled data only, then fine-tunes a classifier on these representations via the labeled set; semi-supervised learning directly trains a classifier on labeled and unlabeled data simultaneously. Recent methods from both directions have claimed significant gains on non-medical tasks, but do not systematically assess medical images and mostly compare only to methods in the same direction. This study contributes a carefully-designed benchmark to help answer a practitioner's key question: given a small labeled dataset and a limited budget of hours to spend on training, what gains from additional unlabeled images are possible and which methods best achieve them? Unlike previous benchmarks, ours uses realistic-sized validation sets to select hyperparameters, assesses runtime-performance tradeoffs, and bridges two research fields. By comparing 6 semi-supervised methods and 5 self-supervised methods to strong labeled-only baselines on 3 medical datasets with 30-1000 labels per class, we offer insights to resource-constrained, results-focused practitioners: MixMatch, SimCLR, and BYOL represent strong choices that were not surpassed by more recent methods. After much effort selecting hyperparameters on one dataset, we publish settings that enable strong methods to perform well on new medical tasks within a few hours, with further search over dozens of hours delivering modest additional gains.

Towards the Sparseness of Projection Head in Self-Supervised Learning

  • paper_url: http://arxiv.org/abs/2307.08913
  • repo_url: None
  • paper_authors: Zeen Song, Xingzhe Su, Jingyao Wang, Wenwen Qiang, Changwen Zheng, Fuchun Sun
  • for: 本研究旨在探讨自动学习(Self-Supervised Learning,SSL)中的一种成功方法——对比学习,以及对其中的参数化投影头的内部机制和维度归一化现象的研究。
  • methods: 本研究通过实验分析和理论调查,探讨对比学习中的投影头对 representation 质量的影响,并提出了假设只需要一 subset of features 来最小化对比损失。
  • results: 实验结果表明,束之 sparse projection head 可以增强对比学习的性能,并且可以轻松地与现有的 SSL 方法结合使用。
    Abstract In recent years, self-supervised learning (SSL) has emerged as a promising approach for extracting valuable representations from unlabeled data. One successful SSL method is contrastive learning, which aims to bring positive examples closer while pushing negative examples apart. Many current contrastive learning approaches utilize a parameterized projection head. Through a combination of empirical analysis and theoretical investigation, we provide insights into the internal mechanisms of the projection head and its relationship with the phenomenon of dimensional collapse. Our findings demonstrate that the projection head enhances the quality of representations by performing contrastive loss in a projected subspace. Therefore, we propose an assumption that only a subset of features is necessary when minimizing the contrastive loss of a mini-batch of data. Theoretical analysis further suggests that a sparse projection head can enhance generalization, leading us to introduce SparseHead - a regularization term that effectively constrains the sparsity of the projection head, and can be seamlessly integrated with any self-supervised learning (SSL) approaches. Our experimental results validate the effectiveness of SparseHead, demonstrating its ability to improve the performance of existing contrastive methods.
    摘要

What Can Simple Arithmetic Operations Do for Temporal Modeling?

  • paper_url: http://arxiv.org/abs/2307.08908
  • repo_url: https://github.com/whwu95/ATM
  • paper_authors: Wenhao Wu, Yuxin Song, Zhun Sun, Jingdong Wang, Chang Xu, Wanli Ouyang
  • for: 本研究旨在探讨视频内容中的时间模型化问题,采用简单的四则数学操作来建立时间关系。
  • methods: 我们首先从视频帧特征中提取auxiliary时间cue,并使用加减乘除四则数学操作来提取相关特征。然后,我们将这些特征与原始的时间不关注域进行对比,以便进一步提高视频识别性能。我们称之为Arithmetic Temporal Module(ATM),它可以与CNNs和ViTs两种不同的架构结合使用。
  • results: 我们在Something-Something V1、V2和Kinetics-400等视频识别Benchmark上进行了广泛的ablation研究,并证明ATM模块可以在低计算成本下提供强大的时间模型化能力。此外,ATM模块可以与不同的架构结合使用,并在 Something-Something V1、V2和Kinetics-400等Benchmark上达到了65.6%、74.6%和89.4%的top-1准确率。代码可以在https://github.com/whwu95/ATM中下载。
    Abstract Temporal modeling plays a crucial role in understanding video content. To tackle this problem, previous studies built complicated temporal relations through time sequence thanks to the development of computationally powerful devices. In this work, we explore the potential of four simple arithmetic operations for temporal modeling. Specifically, we first capture auxiliary temporal cues by computing addition, subtraction, multiplication, and division between pairs of extracted frame features. Then, we extract corresponding features from these cues to benefit the original temporal-irrespective domain. We term such a simple pipeline as an Arithmetic Temporal Module (ATM), which operates on the stem of a visual backbone with a plug-and-play style. We conduct comprehensive ablation studies on the instantiation of ATMs and demonstrate that this module provides powerful temporal modeling capability at a low computational cost. Moreover, the ATM is compatible with both CNNs- and ViTs-based architectures. Our results show that ATM achieves superior performance over several popular video benchmarks. Specifically, on Something-Something V1, V2 and Kinetics-400, we reach top-1 accuracy of 65.6%, 74.6%, and 89.4% respectively. The code is available at https://github.com/whwu95/ATM.
    摘要 temporald modeling plays a crucial role in understanding video content. To tackle this problem, previous studies built complicated temporal relations through time sequence thanks to the development of computationally powerful devices. In this work, we explore the potential of four simple arithmetic operations for temporal modeling. Specifically, we first capture auxiliary temporal cues by computing addition, subtraction, multiplication, and division between pairs of extracted frame features. Then, we extract corresponding features from these cues to benefit the original temporal-irrelevant domain. We term such a simple pipeline as an Arithmetic Temporal Module (ATM), which operates on the stem of a visual backbone with a plug-and-play style. We conduct comprehensive ablation studies on the instantiation of ATMs and demonstrate that this module provides powerful temporal modeling capability at a low computational cost. Moreover, the ATM is compatible with both CNNs- and ViTs-based architectures. Our results show that ATM achieves superior performance over several popular video benchmarks. Specifically, on Something-Something V1, V2 and Kinetics-400, we reach top-1 accuracy of 65.6%, 74.6%, and 89.4% respectively. The code is available at https://github.com/whwu95/ATM.Here's the translation in Traditional Chinese:时间模型化在视频内容理解中扮演重要角色。以往的研究通过时间序列建立复杂的时间关系,感谢computationally powerful devices的发展。在这个工作中,我们探索四个简单的算术操作的潜力 для时间模型化。具体来说,我们首先 capture auxiliary temporal cues by computing addition, subtraction, multiplication, and division between pairs of extracted frame features。然后,我们从这些cue中提取相应的特征,以帮助原始时间不适用的领域。我们给这个简单管道命名为Arithmetic Temporal Module (ATM),它在视觉背bone上运作,并且具有plug-and-play的风格。我们进行了广泛的ablation study,证明这个模组提供了强大的时间模型化能力,同时computational cost较低。此外,ATM适用于CNNs-和ViTs-based architecture。我们的结果显示,ATM在Something-Something V1, V2和Kinetics-400上达到了top-1准确率的65.6%, 74.6%, 和89.4%。代码可以在https://github.com/whwu95/ATM中找到。

Modular Neural Network Approaches for Surgical Image Recognition

  • paper_url: http://arxiv.org/abs/2307.08880
  • repo_url: None
  • paper_authors: Nosseiba Ben Salem, Younes Bennani, Joseph Karkazan, Abir Barbara, Charles Dacheux, Thomas Gregory
  • for: 这个论文的目的是提出一种基于深度学习的模块学习方法,用于解决DCSS不稳定性分类问题。
  • methods: 这个论文使用了自我训练和模块学习方法来解决DCSS不稳定性分类问题。
  • results: 实验结果显示,模块学习方法可以提高分类性能,而Weighted Modular方法达到了几乎完美的分类效果。
    Abstract Deep learning-based applications have seen a lot of success in recent years. Text, audio, image, and video have all been explored with great success using deep learning approaches. The use of convolutional neural networks (CNN) in computer vision, in particular, has yielded reliable results. In order to achieve these results, a large amount of data is required. However, the dataset cannot always be accessible. Moreover, annotating data can be difficult and time-consuming. Self-training is a semi-supervised approach that managed to alleviate this problem and achieve state-of-the-art performances. Theoretical analysis even proved that it may result in a better generalization than a normal classifier. Another problem neural networks can face is the increasing complexity of modern problems, requiring a high computational and storage cost. One way to mitigate this issue, a strategy that has been inspired by human cognition known as modular learning, can be employed. The principle of the approach is to decompose a complex problem into simpler sub-tasks. This approach has several advantages, including faster learning, better generalization, and enables interpretability. In the first part of this paper, we introduce and evaluate different architectures of modular learning for Dorsal Capsulo-Scapholunate Septum (DCSS) instability classification. Our experiments have shown that modular learning improves performances compared to non-modular systems. Moreover, we found that weighted modular, that is to weight the output using the probabilities from the gating module, achieved an almost perfect classification. In the second part, we present our approach for data labeling and segmentation with self-training applied on shoulder arthroscopy images.
    摘要 深度学习基本应用在最近几年内得到了很大成功。文本、音频、图像和视频都被使用深度学习方法进行了成功的探索。特别是计算机视觉领域中使用卷积神经网络(CNN)的应用,已经取得了可靠的结果。但是,获取数据往往是一个大问题,因为数据集经常不可 accessible。此外,对数据进行标注也可能是一项困难和时间consuming的任务。自我帮助是一种半监督的方法,可以解决这个问题,并达到状态 искусственный智能的性能。然而,神经网络还面临着现代问题的增长复杂度,需要大量的计算和存储资源。为了解决这个问题,我们可以采用一种人类认知的灵感---模块学习的方法。这种方法的原则是将复杂问题分解成更简单的子任务。这种方法有很多优点,包括更快的学习速度、更好的泛化性和可读性。在本文中,我们首先介绍和评估不同的模块学习架构在DCSS不稳定分类问题上。我们的实验表明,模块学习可以提高性能,并且weighted模块可以达到几乎完美的分类结果。在第二部分,我们介绍了我们的自动标注和分割方法,使用自我帮助在Shoulder镜像上进行应用。

LiDAR-BEVMTN: Real-Time LiDAR Bird’s-Eye View Multi-Task Perception Network for Autonomous Driving

  • paper_url: http://arxiv.org/abs/2307.08850
  • repo_url: None
  • paper_authors: Sambit Mohapatra, Senthil Yogamani, Varun Ravi Kumar, Stefan Milz, Heinrich Gotzig, Patrick Mäder
  • for: 这篇论文旨在提出一种实时多任务深度学习网络,用于自动驾驶中的3D场景识别。
  • methods: 该方法使用了一种共享encoder和任务特定decoder的架构,实现了 joint representation learning。还提出了一种新的Semantic Weighting and Guidance(SWAG)模块,以提高对象检测的准确性。
  • results: 该方法在NVIDIA Xavier平台上实现了3ms的延迟时间,并在两个任务中达到了状态的较好的表现(semantic segmentation和动作分割),并且在3D对象检测任务中达到了状态的较好的表现(仅次于状态之最佳表现)。
    Abstract LiDAR is crucial for robust 3D scene perception in autonomous driving. LiDAR perception has the largest body of literature after camera perception. However, multi-task learning across tasks like detection, segmentation, and motion estimation using LiDAR remains relatively unexplored, especially on automotive-grade embedded platforms. We present a real-time multi-task convolutional neural network for LiDAR-based object detection, semantics, and motion segmentation. The unified architecture comprises a shared encoder and task-specific decoders, enabling joint representation learning. We propose a novel Semantic Weighting and Guidance (SWAG) module to transfer semantic features for improved object detection selectively. Our heterogeneous training scheme combines diverse datasets and exploits complementary cues between tasks. The work provides the first embedded implementation unifying these key perception tasks from LiDAR point clouds achieving 3ms latency on the embedded NVIDIA Xavier platform. We achieve state-of-the-art results for two tasks, semantic and motion segmentation, and close to state-of-the-art performance for 3D object detection. By maximizing hardware efficiency and leveraging multi-task synergies, our method delivers an accurate and efficient solution tailored for real-world automated driving deployment. Qualitative results can be seen at https://youtu.be/H-hWRzv2lIY.
    摘要 “LiDAR 是自动驾驶中Robust 3D 场景识别的关键技术。LiDAR 识别有大量文献,仅次于摄像头识别。然而,使用 LiDAR 进行多任务学习,特别是在汽车级别的嵌入式平台上,尚未得到充分的研究。我们提出了一种实时多任务卷积神经网络,用于基于 LiDAR 的 объек特殊、类别和运动分割。该架构包括共享Encoder和任务特定的Decoder,允许 JOINT 表征学习。我们提出了一种新的Semantic Weighting and Guidance(SWAG)模块,用于 selectively 提高对象检测的准确性。我们的不同数据集合并利用了各种任务之间的补做作用。我们的方法在 NVIDIA Xavier 平台上实现了3ms 延迟,并实现了对 semantic 和运动分割两个任务的state-of-the-art 性能,以及对 3D 对象检测的近似性性能。通过最大化硬件效率和多任务 synergies,我们的方法提供了一个准确和高效的解决方案,适用于实际自动驾驶部署。详细结果可以参考 。”

DARTS: Double Attention Reference-based Transformer for Super-resolution

  • paper_url: http://arxiv.org/abs/2307.08837
  • repo_url: https://github.com/bia006/darts
  • paper_authors: Masoomeh Aslahishahri, Jordan Ubbens, Ian Stavness
  • for: 提高低分辨率图像的内容质量
  • methods: 使用转换器模型,学习对两个图像分布进行 JOINT 表示,通过匹配对应关系学习高分辨率图像来提高低分辨率图像的内容质量
  • results: 在 SUN80 数据集上达到了状态机器人模型的水平,PSNR/SSIM 分别为 29.83 / 0.809,表明单独使用注意力机制可以实现参照基于图像超分辨率任务,不需要多种特殊设计Sub网络、知识储存或多Stage训练。
    Abstract We present DARTS, a transformer model for reference-based image super-resolution. DARTS learns joint representations of two image distributions to enhance the content of low-resolution input images through matching correspondences learned from high-resolution reference images. Current state-of-the-art techniques in reference-based image super-resolution are based on a multi-network, multi-stage architecture. In this work, we adapt the double attention block from the GAN literature, processing the two visual streams separately and combining self-attention and cross-attention blocks through a gating attention strategy. Our work demonstrates how the attention mechanism can be adapted for the particular requirements of reference-based image super-resolution, significantly simplifying the architecture and training pipeline. We show that our transformer-based model performs competitively with state-of-the-art models, while maintaining a simpler overall architecture and training process. In particular, we obtain state-of-the-art on the SUN80 dataset, with a PSNR/SSIM of 29.83 / .809. These results show that attention alone is sufficient for the RSR task, without multiple purpose-built subnetworks, knowledge distillation, or multi-stage training.
    摘要 我们介绍了DARTS,一种基于 transformer 模型的参照型图像超分辨模型。DARTS 学习了两个图像分布的共同表示,以增强输入图像的内容。当前领导技术在参照型图像超分辨中使用多个网络、多个阶段架构。在这种工作中,我们从 GAN 文献中采用了双注意块,处理两个视觉流 separately,并通过阻塞注意力策略将自注意块和交叉注意块组合在一起。我们的工作表明了注意力机制在参照型图像超分辨中可以进行适应,大大简化架构和训练流程。我们显示了我们的 transformer 基于模型与状态革新模型相当,而且具有更简单的总体架构和训练过程。特别是,我们在 SUN80 数据集上获得了状态革新的 PSNR/SSIM 值为 29.83 / .809。这些结果表明,注意力机制alone 是RSR 任务中的 suficient,不需要多个专门设计的子网络、知识继承或多个阶段训练。

Harnessing the Power of AI based Image Generation Model DALLE 2 in Agricultural Settings

  • paper_url: http://arxiv.org/abs/2307.08789
  • repo_url: None
  • paper_authors: Ranjan Sapkota
  • for: 这项研究旨在探讨人工智能(AI)在农业领域视觉进程的提升方面,使用开源AI图像生成器DALLE 2。
  • methods: 该研究使用了chatGPT的自然语言处理能力和DALLE 2模型,实现了将文本描述器转换为真实的视觉内容的创新方法。
  • results: 研究发现,使用DALLE 2模型可以提高农业视觉进程的质量和准确性,帮助农业决策更加 Informed,并改善资源分配。结果表明,AI将在精度农业领域产生快速发展。I hope that helps! Let me know if you have any other questions.
    Abstract This study investigates the potential impact of artificial intelligence (AI) on the enhancement of visualization processes in the agricultural sector, using the advanced AI image generator, DALLE 2, developed by OpenAI. By synergistically utilizing the natural language processing proficiency of chatGPT and the generative prowess of the DALLE 2 model, which employs a Generative Adversarial Networks (GANs) framework, our research offers an innovative method to transform textual descriptors into realistic visual content. Our rigorously assembled datasets include a broad spectrum of agricultural elements such as fruits, plants, and scenarios differentiating crops from weeds, maintained for AI-generated versus original images. The quality and accuracy of the AI-generated images were evaluated via established metrics including mean squared error (MSE), peak signal-to-noise ratio (PSNR), and feature similarity index (FSIM). The results underline the significant role of the DALLE 2 model in enhancing visualization processes in agriculture, aiding in more informed decision-making, and improving resource distribution. The outcomes of this research highlight the imminent rise of an AI-led transformation in the realm of precision agriculture.
    摘要

The FathomNet2023 Competition Dataset

  • paper_url: http://arxiv.org/abs/2307.08781
  • repo_url: https://github.com/fathomnet/fgvc-comp-2023
  • paper_authors: Eric Orenstein, Kevin Barnard, Lonny Lundsten, Geneviève Patterson, Benjamin Woodward, Kakani Katija
  • for: study marine organisms and environmental monitoring
  • methods: automatic processing of visual data
  • results: recognition of new organisms and assessment of out-of-sample data
    Abstract Ocean scientists have been collecting visual data to study marine organisms for decades. These images and videos are extremely valuable both for basic science and environmental monitoring tasks. There are tools for automatically processing these data, but none that are capable of handling the extreme variability in sample populations, image quality, and habitat characteristics that are common in visual sampling of the ocean. Such distribution shifts can occur over very short physical distances and in narrow time windows. Creating models that are able to recognize when an image or video sequence contains a new organism, an unusual collection of animals, or is otherwise out-of-sample is critical to fully leverage visual data in the ocean. The FathomNet2023 competition dataset presents a realistic scenario where the set of animals in the target data differs from the training data. The challenge is both to identify the organisms in a target image and assess whether it is out-of-sample.
    摘要 Translated into Simplified Chinese:海洋科学家已经在数十年内收集视频数据来研究海洋生物。这些图像和视频非常有价值,不仅为基础科学研究,还为环境监测任务。然而,存在一些工具可以自动处理这些数据,但是无法处理海洋视频样本中的极大变化,包括样本人口、图像质量和生物群体特征等。这些变化可能在非常短的物理距离和时间窗口内发生。创建能够识别目标图像中的新生物、不寻常的动物群体或者是否外样的模型是海洋视频数据的核心。FathomNet2023比赛数据集提供了一个真实的enario,其中目标数据中的生物集合与训练数据不同。挑战是both识别目标图像中的生物和判断图像是否外样。

Similarity Min-Max: Zero-Shot Day-Night Domain Adaptation

  • paper_url: http://arxiv.org/abs/2307.08779
  • repo_url: https://github.com/Red-Fairy/ZeroShotDayNightDA
  • paper_authors: Rundong Luo, Wenjing Wang, Wenhan Yang, Jiaying Liu
  • for: 这篇论文旨在解决黑暗环境下影像识别和分类等夜间视觉任务中的模型性能降低问题。
  • methods: 本论文提出了一个统一的架构,协助实现零数据黑暗领域适应。它首先使用黑暗图像来减少特征相似性,然后将模型适应到黑暗图像和正常照明图像之间的特征相似性。
  • results: 实验结果显示,本方法可以优化模型的通用化能力,并在不同的夜间视觉任务中实现显著的改善。包括分类、 semantic segmentation、visual place recognition和video action recognition等多种夜间视觉任务都能够得到良好的表现。
    Abstract Low-light conditions not only hamper human visual experience but also degrade the model's performance on downstream vision tasks. While existing works make remarkable progress on day-night domain adaptation, they rely heavily on domain knowledge derived from the task-specific nighttime dataset. This paper challenges a more complicated scenario with border applicability, i.e., zero-shot day-night domain adaptation, which eliminates reliance on any nighttime data. Unlike prior zero-shot adaptation approaches emphasizing either image-level translation or model-level adaptation, we propose a similarity min-max paradigm that considers them under a unified framework. On the image level, we darken images towards minimum feature similarity to enlarge the domain gap. Then on the model level, we maximize the feature similarity between the darkened images and their normal-light counterparts for better model adaptation. To the best of our knowledge, this work represents the pioneering effort in jointly optimizing both aspects, resulting in a significant improvement of model generalizability. Extensive experiments demonstrate our method's effectiveness and broad applicability on various nighttime vision tasks, including classification, semantic segmentation, visual place recognition, and video action recognition. Code and pre-trained models are available at https://red-fairy.github.io/ZeroShotDayNightDA-Webpage/.
    摘要 低光照条件不仅影响人类视觉经验,还会下降模型在下游视觉任务中的性能。现有的工作做出了显著的进步在日夜域适应中,但是它们依赖于任务特定的夜间数据集的知识。本文挑战了更复杂的场景,即零shot日夜域适应,即无需夜间数据集来适应。与先前的零shot适应方法不同,我们提出了一种相似度最大化思想,它考虑了图像级和模型级的适应。在图像级别,我们使用最小特征相似性来抑制图像,以扩大域之间的差距。然后,在模型级别,我们使用最大化特征相似性来进行模型适应。根据我们所知,这种方法是首次同时优化图像和模型级别的适应,从而提高模型的通用性。我们的方法在不同的夜视任务中,包括分类、 semantic segmentation、视觉地标识和视频动作识别等,都有广泛的应用和实验证明了其效果。代码和预训练模型可以在 中下载。

UPSCALE: Unconstrained Channel Pruning

  • paper_url: http://arxiv.org/abs/2307.08771
  • repo_url: https://github.com/apple/ml-upscale
  • paper_authors: Alvin Wan, Hanxiang Hao, Kaushik Patnaik, Yueyang Xu, Omer Hadad, David Güera, Zhile Ren, Qi Shan
  • for: 降低卷积神经网络的执行速度,通过频道剪裁来提高模型的压缩率。
  • methods: 使用频道剪裁技术,但是对多个分支段的模型来说,频道剪裁可能会导致执行时间复制。为了解决这个问题,通常是将某些频道约束在一起,以完全消除执行时间复制,但是这会导致减少模型的准确性。
  • results: 根据我们的发现,可以在出口时重新排序频道,以降低执行时间复制并提高模型的准确性。我们提出的普适算法 UPSCALE 可以适应任何剪裁模式,并在 ImageNet 上提高后训练剪裁模型的平均准确性 by 2.1 点。此外,UPSCALE 还可以提高执行速度,相比基eline export 可以达到两倍的提升。
    Abstract As neural networks grow in size and complexity, inference speeds decline. To combat this, one of the most effective compression techniques -- channel pruning -- removes channels from weights. However, for multi-branch segments of a model, channel removal can introduce inference-time memory copies. In turn, these copies increase inference latency -- so much so that the pruned model can be slower than the unpruned model. As a workaround, pruners conventionally constrain certain channels to be pruned together. This fully eliminates memory copies but, as we show, significantly impairs accuracy. We now have a dilemma: Remove constraints but increase latency, or add constraints and impair accuracy. In response, our insight is to reorder channels at export time, (1) reducing latency by reducing memory copies and (2) improving accuracy by removing constraints. Using this insight, we design a generic algorithm UPSCALE to prune models with any pruning pattern. By removing constraints from existing pruners, we improve ImageNet accuracy for post-training pruned models by 2.1 points on average -- benefiting DenseNet (+16.9), EfficientNetV2 (+7.9), and ResNet (+6.2). Furthermore, by reordering channels, UPSCALE improves inference speeds by up to 2x over a baseline export.
    摘要 Our insight is to reorder channels at export time to address this issue. By reducing latency by reducing memory copies and improving accuracy by removing constraints, we can achieve better performance. We have designed a generic algorithm called UPSCALE to prune models with any pruning pattern. By removing constraints from existing pruners, we have improved ImageNet accuracy for post-training pruned models by an average of 2.1 points - benefiting DenseNet, EfficientNetV2, and ResNet. Furthermore, by reordering channels, UPSCALE improves inference speeds by up to 2x over a baseline export.

Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

  • paper_url: http://arxiv.org/abs/2307.08763
  • repo_url: None
  • paper_authors: Kumar Ashutosh, Santhosh Kumar Ramakrishnan, Triantafyllos Afouras, Kristen Grauman
  • for: 这篇论文旨在提高人工智能对教程视频中人体动作的理解,以便更好地执行 DIY 修理任务和食谱等。
  • methods: 论文提出自动从教程视频中挖掘任务图грам,并使用这个图грам来规范化键步认识。
  • results: 在多个实际教程视频 datasets 上,论文显示了更加可靠的零基础键步定位和改进的视频表示学习,超过了现状势。
    Abstract Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state -- such as the steps of a recipe or a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a predefined sequential script. We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps, and then leverage this graph to regularize keystep recognition in novel videos. On multiple datasets of real-world instructional videos, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art.
    摘要 执行活动理解需要感知人类行为,视为一个更广泛的任务,涉及多个键步骤在视频中顺序执行,以达到最终目标状态,例如预约的食谱或 DIY 修理任务。现有的工作大多数都是隔离这个更广泛的结构,或者固定地将键步骤与预先定义的顺序脚本相对应。我们提议自动从教程视频中发现任务图,表示人们执行键步骤的概率方式,然后利用这个图来规范新视频中的键步骤识别。在多个实际教程视频数据集上,我们表明了影响:更可靠的零基础键步骤定位和改进的视频表示学习,超越了现状的最佳。

Diffusion Models Beat GANs on Image Classification

  • paper_url: http://arxiv.org/abs/2307.08702
  • repo_url: None
  • paper_authors: Soumik Mukhopadhyay, Matthew Gwilliam, Vatsal Agarwal, Namitha Padmanabhan, Archana Swaminathan, Srinidhi Hegde, Tianyi Zhou, Abhinav Shrivastava
  • For: The paper explores the possibility of a unified representation learner that can address both generative and discriminative tasks simultaneously, using diffusion models as a prime candidate.* Methods: The paper uses a U-Net architecture to train a diffusion model for image generation, denoising, inpainting, super-resolution, and manipulation tasks, and demonstrates that the resulting model can generate high-fidelity, diverse, and novel images. The paper also explores optimal methods for extracting and using the embeddings generated by the model for classification tasks.* Results: The paper shows that the diffusion model outperforms comparable generative-discriminative methods such as BigBiGAN for classification tasks, and that with careful feature selection and pooling, the model achieves promising results on several fine-grained visual classification datasets.
    Abstract While many unsupervised learning models focus on one family of tasks, either generative or discriminative, we explore the possibility of a unified representation learner: a model which uses a single pre-training stage to address both families of tasks simultaneously. We identify diffusion models as a prime candidate. Diffusion models have risen to prominence as a state-of-the-art method for image generation, denoising, inpainting, super-resolution, manipulation, etc. Such models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high fidelity, diverse, novel images. The U-Net architecture, as a convolution-based architecture, generates a diverse set of feature representations in the form of intermediate feature maps. We present our findings that these embeddings are useful beyond the noise prediction task, as they contain discriminative information and can also be leveraged for classification. We explore optimal methods for extracting and using these embeddings for classification tasks, demonstrating promising results on the ImageNet classification task. We find that with careful feature selection and pooling, diffusion models outperform comparable generative-discriminative methods such as BigBiGAN for classification tasks. We investigate diffusion models in the transfer learning regime, examining their performance on several fine-grained visual classification datasets. We compare these embeddings to those generated by competing architectures and pre-trainings for classification tasks.
    摘要 多数不监督学习模型都专注于一种家族的任务,可是我们探索一个综合表示学习者:一种使用单一预训练阶段来同时解决两种家族任务的模型。我们认为扩散模型是最佳候选人选。扩散模型已经在图像生成、降噪、填充、超解像、修改等任务中脱颖而出,这些模型通过训练 U-Net 来逐步预测和除掉噪音,并且生成出高准确率、多样化、新颖的图像。U-Net 架构是一种基于 convolution 的架构,生成了多样化的特征表示,我们发现这些嵌入有用于分类任务,它们包含描述性信息,可以用于分类。我们探索如何提取和使用这些嵌入来进行分类任务,并实现了在 ImageNet 分类任务上的成功。我们发现,通过精心选择和 pooling,扩散模型在分类任务上超过相对的生成-分类方法,如 BigBiGAN。我们 investigate 扩散模型在转移学习 режи度下的性能,对多个细腻视觉分类任务进行比较。我们比较这些嵌入与其他架构和预训练生成的嵌入,并发现扩散模型在分类任务上表现出优异的result。

Flow Matching in Latent Space

  • paper_url: http://arxiv.org/abs/2307.08698
  • repo_url: https://github.com/vinairesearch/lfm
  • paper_authors: Quan Dao, Hao Phung, Binh Nguyen, Anh Tran
  • for: 这 paper 的目的是提出一种基于流匹配的生成模型,用于高分辨率图像生成。这种方法可以在受限的计算资源下进行训练,并且可以在不同的 conditional generation 任务中实现高质量的图像生成。
  • methods: 这 paper 使用了流匹配方法,并在预训练 autoencoder 的 latent space 中进行训练。这种方法可以更好地利用计算资源,并且可以在高分辨率图像生成 tasks 中实现更好的效果。
  • results: 这 paper 的实验结果表明,流匹配方法可以在不同的 conditional generation 任务中实现高质量的图像生成。具体来说,这 paper 在 CelebA-HQ、FFHQ、LSUN Church & Bedroom 和 ImageNet 等数据集上实现了优秀的quantitative 和 qualitative 结果。此外,这 paper 还提供了一种 theoretically 控制的 Wasserstein-2 距离,用于证明流匹配目标下的 latent flow distribution 和 true data distribution 之间的关系。
    Abstract Flow matching is a recent framework to train generative models that exhibits impressive empirical performance while being relatively easier to train compared with diffusion-based models. Despite its advantageous properties, prior methods still face the challenges of expensive computing and a large number of function evaluations of off-the-shelf solvers in the pixel space. Furthermore, although latent-based generative methods have shown great success in recent years, this particular model type remains underexplored in this area. In this work, we propose to apply flow matching in the latent spaces of pretrained autoencoders, which offers improved computational efficiency and scalability for high-resolution image synthesis. This enables flow-matching training on constrained computational resources while maintaining their quality and flexibility. Additionally, our work stands as a pioneering contribution in the integration of various conditions into flow matching for conditional generation tasks, including label-conditioned image generation, image inpainting, and semantic-to-image generation. Through extensive experiments, our approach demonstrates its effectiveness in both quantitative and qualitative results on various datasets, such as CelebA-HQ, FFHQ, LSUN Church & Bedroom, and ImageNet. We also provide a theoretical control of the Wasserstein-2 distance between the reconstructed latent flow distribution and true data distribution, showing it is upper-bounded by the latent flow matching objective. Our code will be available at https://github.com/VinAIResearch/LFM.git.
    摘要 “流行匹配”是一种最近的框架,用于训练生成模型,具有让人感到惊叹的实际性能,而且训练更加容易。然而,先前的方法仍面临计算成本高和批处空间评估函数评估的挑战。另外,半Hidden Markov model(HMM)在这个领域中的应用还很少。在这个工作中,我们提议将流行匹配应用于预训练 autoencoder 的latent空间中,从而提高计算效率和可扩展性,以便在高分辨率图像生成中进行流行匹配训练。此外,我们的工作是在流行匹配中 интеGRATION 多种条件的先驱性贡献,包括标签Conditional image generation、图像缺失和semantic-to-image generation。经过广泛的实验,我们的方法在不同的数据集上都达到了良好的量化和质量结果,如 celebA-HQ、FFHQ、LSUN Church & Bedroom 和 ImageNet。我们还提供了对 Wasserstein-2 距离真实数据分布的控制,证明它是上界于流行匹配目标。我们的代码将在 GitHub 上公开,请参考 https://github.com/VinAIResearch/LFM.git。

Neural Video Depth Stabilizer

  • paper_url: http://arxiv.org/abs/2307.08695
  • repo_url: https://github.com/raymondwang987/nvds
  • paper_authors: Yiran Wang, Min Shi, Jiaqi Li, Zihao Huang, Zhiguo Cao, Jianming Zhang, Ke Xian, Guosheng Lin
  • for: 实时深度测量的维度准确性,以提高视频中的深度推导精度。
  • methods: 提出一个名为神经网络深度稳定器(NVDS)的插件化框架,可以对不同的单一图像深度模型进行稳定化,并且不需要额外的训练或准确性检查。
  • results: 在大规模的自然场景视频深度数据集(VDW)上进行评估,以及两个公共的benchmark上,与之前的方法进行比较,实现了更高的一致性、精度和效率。
    Abstract Video depth estimation aims to infer temporally consistent depth. Some methods achieve temporal consistency by finetuning a single-image depth model during test time using geometry and re-projection constraints, which is inefficient and not robust. An alternative approach is to learn how to enforce temporal consistency from data, but this requires well-designed models and sufficient video depth data. To address these challenges, we propose a plug-and-play framework called Neural Video Depth Stabilizer (NVDS) that stabilizes inconsistent depth estimations and can be applied to different single-image depth models without extra effort. We also introduce a large-scale dataset, Video Depth in the Wild (VDW), which consists of 14,203 videos with over two million frames, making it the largest natural-scene video depth dataset to our knowledge. We evaluate our method on the VDW dataset as well as two public benchmarks and demonstrate significant improvements in consistency, accuracy, and efficiency compared to previous approaches. Our work serves as a solid baseline and provides a data foundation for learning-based video depth models. We will release our dataset and code for future research.
    摘要 视频深度估计目标是获取时间一致的深度。一些方法在测试时使用图形和投影约束来精微调整单张图像深度模型,这是不高效且不稳定的。另一种方法是通过数据来强制实现时间一致,但这需要设计出功能强大的模型并具备足够的视频深度数据。为解决这些挑战,我们提出了名为神经视频深度稳定器(NVDS)的插件化框架,可以快速稳定不一致的深度估计,并可以适用于不同的单张图像深度模型无需额外努力。我们还推出了一个大规模的数据集,视频深度在野(VDW),该数据集包含14203个视频和超过200万帧,这是我们所知道的最大的自然场景视频深度数据集。我们对VDW数据集以及两个公共的标准测试集进行评估,并示出了和前方法相比的显著改进。我们的工作作为一个坚实的基础和未来研究的数据基础,我们将在未来发布我们的数据集和代码。

SEMI-DiffusionInst: A Diffusion Model Based Approach for Semiconductor Defect Classification and Segmentation

  • paper_url: http://arxiv.org/abs/2307.08693
  • repo_url: None
  • paper_authors: Vic De Ridder, Bappaditya Dey, Sandip Halder, Bartel Van Waeyenberge
  • for: 本研究旨在提出一种新的半导体缺陷检测框架”SEMI-DiffusionInst”,并与之前的框架进行比较。
  • methods: 该研究使用了一种扩散模型,并 investigate了不同的特征提取器网络和数据采样策略以实现一个平衡的质量和计算效率。
  • results: 该模型在总的mAP和准确地分割mAP方面都有所提高,并在大多数缺陷类型中表现比以前的工作更好或相当。特别是在检测任务中,线扩散和薄桥缺陷的准确率提高约15%。此外,通过调整推理参数,推理时间可以得到显著提高,无需妥协模型精度。
    Abstract With continuous progression of Moore's Law, integrated circuit (IC) device complexity is also increasing. Scanning Electron Microscope (SEM) image based extensive defect inspection and accurate metrology extraction are two main challenges in advanced node (2 nm and beyond) technology. Deep learning (DL) algorithm based computer vision approaches gained popularity in semiconductor defect inspection over last few years. In this research work, a new semiconductor defect inspection framework "SEMI-DiffusionInst" is investigated and compared to previous frameworks. To the best of the authors' knowledge, this work is the first demonstration to accurately detect and precisely segment semiconductor defect patterns by using a diffusion model. Different feature extractor networks as backbones and data sampling strategies are investigated towards achieving a balanced trade-off between precision and computing efficiency. Our proposed approach outperforms previous work on overall mAP and performs comparatively better or as per for almost all defect classes (per class APs). The bounding box and segmentation mAPs achieved by the proposed SEMI-DiffusionInst model are improved by 3.83% and 2.10%, respectively. Among individual defect types, precision on line collapse and thin bridge defects are improved approximately 15\% on detection task for both defect types. It has also been shown that by tuning inference hyperparameters, inference time can be improved significantly without compromising model precision. Finally, certain limitations and future work strategy to overcome them are discussed.
    摘要 随着Moore的法则不断进步,集成电路(IC)设备复杂度也在增加。扫描电子镜(SEM)图像基于广泛的缺陷检测和精确的测量提取是进程技术的两大挑战。在过去几年中,用于半导体缺陷检测的深度学习(DL)算法基于计算机视觉方法得到了广泛的应用。本研究工作中,一种新的半导体缺陷检测框架“SEMI-DiffusionInst”被调查和比较了前一些框架。据作者所知,这是第一次使用扩散模型准确地检测和精确地分割半导体缺陷模式。不同的特征提取网络作为后端和数据采样策略的调查,以实现精度和计算效率之间的平衡负担。我们的提议方法在总的map和每个缺陷类ap方面的性能有所提高,并且与前一些工作相比,对大多数缺陷类型的性能具有较好的性能。SEMI-DiffusionInst模型在bounding box和分割map方面的性能提高了3.83%和2.10%。对于特定的缺陷类型,我们对检测任务的精度有15%的提高。此外,通过调整推理超参数,可以在不妥协精度的前提下大幅提高推理时间。最后,本研究的一些局限性和未来工作策略以及如何超越它们也被讨论。

Semantic Counting from Self-Collages

  • paper_url: http://arxiv.org/abs/2307.08727
  • repo_url: https://github.com/lukasknobel/selfcollages
  • paper_authors: Lukas Knobel, Tengda Han, Yuki M. Asano
  • for: 不需要手动标注数据,可以学习对象数量计算任务。
  • methods: 使用自带的”SelfCollages”图像作为训练样本,利用现有的无监督表示和分割技术进行学习。
  • results: 比基线模型和通用模型强,可以与有监督学习模型相匹配。
    Abstract While recent supervised methods for reference-based object counting continue to improve the performance on benchmark datasets, they have to rely on small datasets due to the cost associated with manually annotating dozens of objects in images. We propose Unsupervised Counter (UnCo), a model that can learn this task without requiring any manual annotations. To this end, we construct "SelfCollages", images with various pasted objects as training samples, that provide a rich learning signal covering arbitrary object types and counts. Our method builds on existing unsupervised representations and segmentation techniques to successfully demonstrate the ability to count objects without manual supervision. Our experiments show that our method not only outperforms simple baselines and generic models such as FasterRCNN, but also matches the performance of supervised counting models in some domains.
    摘要 近期的监督学习方法可以继续提高参考基于对象计数的性能,但它们需要依靠小型数据集,因为 manually annotating dozens of objects in images 是成本很高的。我们提议一种无监督的对象计数模型(Unsupervised Counter,UnCo),不需要任何手动纠正。为此,我们构建了“SelfCollages”,它们是各种粘贴过的对象图像,提供了丰富的学习信号,覆盖了任意对象类型和计数。我们的方法基于现有的无监督表示和分割技术,成功地实现了没有手动纠正的对象计数。我们的实验表明,我们的方法不仅超过了简单的基准和通用模型如 FasterRCNN,还与有监督 counting 模型在一些领域匹配性能。

Implementation of a perception system for autonomous vehicles using a detection-segmentation network in SoC FPGA

  • paper_url: http://arxiv.org/abs/2307.08682
  • repo_url: https://github.com/vision-agh/mt_kria
  • paper_authors: Maciej Baczmanski, Mateusz Wasala, Tomasz Kryjak
  • for: 本研究旨在开发一种高效、实时、能效的感知控制系统,用于自动驾驶汽车。
  • methods: 该系统基于MultiTaskV3检测分类网络,并在AMD Xilinx Kria KV260视觉AI嵌入式平台上实现了并行加速。
  • results: 该系统在对Mock城市道路环境中进行测试时,实现了对 объекts的检测精度高于97%,以及图像分割精度高于90%。同时,该系统具有低功耗和小型化的优点。
    Abstract Perception and control systems for autonomous vehicles are an active area of scientific and industrial research. These solutions should be characterised by high efficiency in recognising obstacles and other environmental elements in different road conditions, real-time capability, and energy efficiency. Achieving such functionality requires an appropriate algorithm and a suitable computing platform. In this paper, we have used the MultiTaskV3 detection-segmentation network as the basis for a perception system that can perform both functionalities within a single architecture. It was appropriately trained, quantised, and implemented on the AMD Xilinx Kria KV260 Vision AI embedded platform. By using this device, it was possible to parallelise and accelerate the computations. Furthermore, the whole system consumes relatively little power compared to a CPU-based implementation (an average of 5 watts, compared to the minimum of 55 watts for weaker CPUs, and the small size (119mm x 140mm x 36mm) of the platform allows it to be used in devices where the amount of space available is limited. It also achieves an accuracy higher than 97% of the mAP (mean average precision) for object detection and above 90% of the mIoU (mean intersection over union) for image segmentation. The article also details the design of the Mecanum wheel vehicle, which was used to test the proposed solution in a mock-up city.
    摘要 自动驾驶车辆的感知和控制系统是科学和工业领域的活跃领域。这些解决方案应具有高效地识别障碍物和其他环境元素,实时性和能效性。实现这种功能需要适当的算法和适当的计算平台。在这篇文章中,我们使用了MultiTaskV3检测-分割网络作为感知系统的基础,该系统可以同时完成这两个功能。它被正确地训练、量化和在AMD Xilinx Kria KV260 Vision AI嵌入式平台上实现。通过使用这台设备,可以并行化和加速计算。此外,整个系统的功耗相对较低,比CPU-基本实现(最低55 wat),而且尺寸很小(119mm x 140mm x 36mm),使其适用于有限空间的设备。此外,它的准确率高于97%的mAP(平均检测精度)和90%的mIoU(图像分割精度)。文章还详细介绍了使用的 mécanum 轮胎汽车,该车被用来测试提议的解决方案在模拟城市中。

CohortFinder: an open-source tool for data-driven partitioning of biomedical image cohorts to yield robust machine learning models

  • paper_url: http://arxiv.org/abs/2307.08673
  • repo_url: None
  • paper_authors: Fan Fan, Georgia Martinez, Thomas Desilvio, John Shin, Yijiang Chen, Bangchen Wang, Takaya Ozeki, Maxime W. Lafarge, Viktor H. Koelzer, Laura Barisoni, Anant Madabhushi, Satish E. Viswanath, Andrew Janowczyk
  • for: 这个论文是为了 Mitigating batch effects (BEs) in machine learning (ML) models, specifically through data-driven cohort partitioning.
  • methods: 该论文使用了一个开源工具叫做 CohortFinder, 通过数据驱动的 cohort 分区来缓解 BEs.
  • results: 论文表明,使用 CohortFinder 可以提高下游医疗影像处理任务中 ML 模型的性能。
    Abstract Batch effects (BEs) refer to systematic technical differences in data collection unrelated to biological variations whose noise is shown to negatively impact machine learning (ML) model generalizability. Here we release CohortFinder, an open-source tool aimed at mitigating BEs via data-driven cohort partitioning. We demonstrate CohortFinder improves ML model performance in downstream medical image processing tasks. CohortFinder is freely available for download at cohortfinder.com.
    摘要 批处效应(BE)指的是数据收集过程中的系统性技术差异,不 relacionados con variaciones biológicas,这些噪声会负面影响机器学习(ML)模型的泛化性。我们现在发布了一个开源工具,名为 cohortfinder,用于减少BE的影响。我们示例了 cohortfinder 可以提高下游医学影像处理任务中 ML 模型的性能。cohortfinder 可以免费下载于 cohortfinder.com。

PolyGNN: Polyhedron-based Graph Neural Network for 3D Building Reconstruction from Point Clouds

  • paper_url: http://arxiv.org/abs/2307.08636
  • repo_url: https://github.com/chenzhaiyu/polygnn
  • paper_authors: Zhaiyu Chen, Yilei Shi, Liangliang Nan, Zhitong Xiong, Xiao Xiang Zhu
  • for: 本研究旨在开发一种基于多面体的图 neural network,用于从点云数据中进行3D建筑重建。
  • methods: 本方法使用多面体分解获取 primitives,然后通过图节点分类来学习这些primitives的组合。为了有效地表示任意形状的多面体,我们提出了三种不同的采样策略,以选择表示多面体的有效点。此外,我们还 incorporate 多面体间的邻接关系来增强图节点的分类。
  • results: 我们在大规模的 sintetic dataset上进行了大规模的重建,并进行了对比分析。结果表明,我们的方法可以快速和高效地进行大规模的重建,并且可以提供高质量的重建结果。此外,我们还进行了对实际点云数据进行重建的实验,并发现我们的方法可以在不同城市的点云数据上进行有效的重建。
    Abstract We present PolyGNN, a polyhedron-based graph neural network for 3D building reconstruction from point clouds. PolyGNN learns to assemble primitives obtained by polyhedral decomposition via graph node classification, achieving a watertight, compact, and weakly semantic reconstruction. To effectively represent arbitrary-shaped polyhedra in the neural network, we propose three different sampling strategies to select representative points as polyhedron-wise queries, enabling efficient occupancy inference. Furthermore, we incorporate the inter-polyhedron adjacency to enhance the classification of the graph nodes. We also observe that existing city-building models are abstractions of the underlying instances. To address this abstraction gap and provide a fair evaluation of the proposed method, we develop our method on a large-scale synthetic dataset covering 500k+ buildings with well-defined ground truths of polyhedral class labels. We further conduct a transferability analysis across cities and on real-world point clouds. Both qualitative and quantitative results demonstrate the effectiveness of our method, particularly its efficiency for large-scale reconstructions. The source code and data of our work are available at https://github.com/chenzhaiyu/polygnn.
    摘要 我们介绍PolyGNN,一种基于多面体的图 neural network,用于从点云数据中进行3D建筑重建。PolyGNN通过图节点分类学习粗粒结构,实现了水密、紧凑、弱 semantic reconstruction。为了有效地表示任意形状多面体在神经网络中,我们提出了三种不同的抽象策略,选择表示多面体的重要点作为 queries,以实现高效的占据推断。此外,我们还 incorporate了多面体间邻接关系,以提高图节点的分类。我们还发现现有的城市建筑模型都是对下面的实例进行抽象。为了 Addressing this abstraction gap and provide a fair evaluation of our method, we develop our method on a large-scale synthetic dataset covering 500k+ buildings with well-defined ground truths of polyhedral class labels. We further conduct a transferability analysis across cities and on real-world point clouds. Both qualitative and quantitative results demonstrate the effectiveness of our method, particularly its efficiency for large-scale reconstructions. 我们的代码和数据可以在https://github.com/chenzhaiyu/polygnn中获取。

Deficiency-Aware Masked Transformer for Video Inpainting

  • paper_url: http://arxiv.org/abs/2307.08629
  • repo_url: https://github.com/yeates/dmt
  • paper_authors: Yongsheng Yu, Heng Fan, Libo Zhang
  • for: 这篇论文的目的是提出一种能够处理视频填充问题的方法,具体来说是提出一种能够在视频中填充损坏的部分的方法。
  • methods: 这篇论文使用了一种叫做Deficiency-aware Masked Transformer(DMT)的框架,该框架具有三个优势:首先,通过预训练一个图像填充模型DMT_img,以便将其用作视频模型DMT_vid的预设,从而提高了补假 случа件的填充效果。其次,使用自身注意力模块 selectively 把 spatiotemporal 标识符包含在推理中,以加速推理和除除噪信号。第三,将一种简单 yet effective的 Receptive Field Contextualizer 集成到 DMT 中,进一步提高了性能。
  • results: 对于 YouTube-VOS 和 DAVIS 等 datasets,DMT_vid 具有显著的性能优势,与之前的解决方案相比,具有更高的准确率和更好的稳定性。
    Abstract Recent video inpainting methods have made remarkable progress by utilizing explicit guidance, such as optical flow, to propagate cross-frame pixels. However, there are cases where cross-frame recurrence of the masked video is not available, resulting in a deficiency. In such situation, instead of borrowing pixels from other frames, the focus of the model shifts towards addressing the inverse problem. In this paper, we introduce a dual-modality-compatible inpainting framework called Deficiency-aware Masked Transformer (DMT), which offers three key advantages. Firstly, we pretrain a image inpainting model DMT_img serve as a prior for distilling the video model DMT_vid, thereby benefiting the hallucination of deficiency cases. Secondly, the self-attention module selectively incorporates spatiotemporal tokens to accelerate inference and remove noise signals. Thirdly, a simple yet effective Receptive Field Contextualizer is integrated into DMT, further improving performance. Extensive experiments conducted on YouTube-VOS and DAVIS datasets demonstrate that DMT_vid significantly outperforms previous solutions. The code and video demonstrations can be found at github.com/yeates/DMT.
    摘要 现代视频填充方法已经取得了显著进步,通过使用显式导向,如光流,来传播帧之间像素。然而,有些情况下,跨帧回归的masked视频不可用,导致不足。在这种情况下,而不是借鉴其他帧的像素,模型的注意力转移到了解决反问题。在这篇论文中,我们介绍了一个双Modal可能性兼容的填充框架,即缺失意识的Transformer(DMT),它具有三个关键优势。首先,我们在DMT_img模型的预训练中,使用image填充模型DMT_img作为后续的拟合模型DMT_vid的先验,从而提高了缺失情况的描述能力。其次,自我注意力模块 selectively incorporates spatiotemporal tokens,以加速推理和消除噪声信号。最后,我们采用了一个简单 yet effective的Receptive Field Contextualizer,进一步提高性能。我们在YouTube-VOS和DAVIS数据集上进行了广泛的实验,并证明了DMT_vid在前一些解决方案之上显著超越。代码和视频示例可以在github.com/yeates/DMT中找到。

Benchmarking fixed-length Fingerprint Representations across different Embedding Sizes and Sensor Types

  • paper_url: http://arxiv.org/abs/2307.08615
  • repo_url: https://github.com/tim-rohwedder/fixed-length-fingerprint-extractors
  • paper_authors: Tim Rohwedder, Daile Osorio-Roig, Christian Rathgeb, Christoph Busch
  • for: 这个论文的目的是提高指纹识别的计算效率,通过减少纹理信息的维度来保持高度的生物ometric表现。
  • methods: 该论文使用了深度神经网络提取指纹的固定长度嵌入。
  • results: 实验结果表明,使用512个特征元素的纹理基于嵌入部分的fixed-length指纹表示可以保持高度的识别性,并且可以看到不同感知器类型之间的性能差异。
    Abstract Traditional minutiae-based fingerprint representations consist of a variable-length set of minutiae. This necessitates a more complex comparison causing the drawback of high computational cost in one-to-many comparison. Recently, deep neural networks have been proposed to extract fixed-length embeddings from fingerprints. In this paper, we explore to what extent fingerprint texture information contained in such embeddings can be reduced in terms of dimension while preserving high biometric performance. This is of particular interest since it would allow to reduce the number of operations incurred at comparisons. We also study the impact in terms of recognition performance of the fingerprint textural information for two sensor types, i.e. optical and capacitive. Furthermore, the impact of rotation and translation of fingerprint images on the extraction of fingerprint embeddings is analysed. Experimental results conducted on a publicly available database reveal an optimal embedding size of 512 feature elements for the texture-based embedding part of fixed-length fingerprint representations. In addition, differences in performance between sensor types can be perceived.
    摘要