cs.CV - 2023-11-02

Idempotent Generative Network

  • paper_url: http://arxiv.org/abs/2311.01462
  • repo_url: None
  • paper_authors: Assaf Shocher, Amil Dravid, Yossi Gandelsman, Inbar Mosseri, Michael Rubinstein, Alexei A. Efros
  • for: 本研究旨在提出一种基于神经网络培养的生成模型,该模型可以在一步中生成输出,同时保持一致的幂次空间,并允许顺序应用 для细化。
  • methods: 该模型使用了一种新的培养方法,即培养神经网络成为可递归的操作。该操作的目标是将源分布(例如泊松噪声)映射到目标分布(例如真实图像)。
  • results: 该研究表明,通过使用这种培养方法,可以提取到一种能够在一步中生成输出,同时保持一致的幂次空间的模型。此外,该模型还能够处理来自目标和源分布的输入,并将其映射回目标分布。
    Abstract We propose a new approach for generative modeling based on training a neural network to be idempotent. An idempotent operator is one that can be applied sequentially without changing the result beyond the initial application, namely $f(f(z))=f(z)$. The proposed model $f$ is trained to map a source distribution (e.g, Gaussian noise) to a target distribution (e.g. realistic images) using the following objectives: (1) Instances from the target distribution should map to themselves, namely $f(x)=x$. We define the target manifold as the set of all instances that $f$ maps to themselves. (2) Instances that form the source distribution should map onto the defined target manifold. This is achieved by optimizing the idempotence term, $f(f(z))=f(z)$ which encourages the range of $f(z)$ to be on the target manifold. Under ideal assumptions such a process provably converges to the target distribution. This strategy results in a model capable of generating an output in one step, maintaining a consistent latent space, while also allowing sequential applications for refinement. Additionally, we find that by processing inputs from both target and source distributions, the model adeptly projects corrupted or modified data back to the target manifold. This work is a first step towards a ``global projector'' that enables projecting any input into a target data distribution.
    摘要 我们提出一新的生成模型基于将神经网络训练为无效的操作。一个无效的操作是可以递进行无效的操作,而不会改变结果,即 $f(f(z)) = f(z) $。我们的模型 $f $ 将从源分布(例如 Gaussian 噪声)映射到目标分布(例如真实的图像),使用以下目标:1. 目标分布中的所有实例都应该变映射到自己,即 $f(x) = x $。我们称这为目标扩展。2. 源分布中的所有实例都应该变映射到定义的目标扩展中,这是通过优化无效性项目 $f(f(z)) = f(z) $ 来实现。这个过程会导致 $f(z) $ 的范围在目标扩展中,从而使得模型能够在一步骤内产生出PUT。在理想的假设下,这个过程可以将input转换为目标分布中的实例。此外,我们发现当处理来自目标和源分布的输入时,模型能够优化受到损害或修改的数据,将其转换回目标扩展中。这项工作是一个“全球投影器”的第一步,可以将任何输入投射到目标数据分布中。

Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization

  • paper_url: http://arxiv.org/abs/2311.01459
  • repo_url: https://github.com/jameelhassan/PromptAlign
  • paper_authors: Jameel Hassan, Hanan Gani, Noor Hussein, Muhammad Uzair Khattak, Muzammal Naseer, Fahad Shahbaz Khan, Salman Khan
  • For: 本研究旨在解决预训练后Prompt Learning的难点,即对未seen频率的频率不适应。* Methods: 我们使用了一种新的提问调整方法,即在测试阶段使用提问调整来将测试阶段样本的特征分布与源频率样本的特征分布进行对应。* Results: 我们的方法可以在零shot上提高vision-language模型的性能,比基elineMaPLe提高3.08%。在跨数据集总结检验中,我们的方法在所有数据集上表现出了一致性的提升。Here’s the English version for reference:* For: The paper aims to solve the problem of poor generalization to unseen frequencies in prompt learning, which is a challenge in pre-training and fine-tuning vision-language models.* Methods: We propose a new prompt tuning method that aligns the test-time sample statistics with the source data statistics, using a single test sample to adapt multi-modal prompts at test time.* Results: Our method can improve the performance of vision-language models on zero-shot tasks, with a 3.08% improvement over the baseline MaPLe. Our method consistently improves performance across all datasets in cross-dataset generalization with unseen categories.
    Abstract The promising zero-shot generalization of vision-language models such as CLIP has led to their adoption using prompt learning for numerous downstream tasks. Previous works have shown test-time prompt tuning using entropy minimization to adapt text prompts for unseen domains. While effective, this overlooks the key cause for performance degradation to unseen domains -- distribution shift. In this work, we explicitly handle this problem by aligning the out-of-distribution (OOD) test sample statistics to those of the source data using prompt tuning. We use a single test sample to adapt multi-modal prompts at test time by minimizing the feature distribution shift to bridge the gap in the test domain. Evaluating against the domain generalization benchmark, our method improves zero-shot top- 1 accuracy beyond existing prompt-learning techniques, with a 3.08% improvement over the baseline MaPLe. In cross-dataset generalization with unseen categories across 10 datasets, our method improves consistently across all datasets compared to the existing state-of-the-art. Our source code and models are available at https://jameelhassan.github.io/promptalign.
    摘要 “CLIP的零基础通用化扩展已经导致它的应用使用提示学习。先前的研究表明,在测试时使用 entropy 最小化来调整文本提示以适应未看过的领域是有效的。然而,这些方法忽略了性能下降的主要原因——分布shift。在这种情况下,我们明确处理这个问题,使用提示调整来对测试样本的数据分布进行对应。我们使用单个测试样本来在测试时调整多modal的提示,以减少测试领域中的特征分布差异,从而bridging测试领域中的差距。通过对域泛化标准 benchmark 进行评估,我们的方法在零基础情况下提高了顶部 1 的准确率,与基eline MaPLe 相比,提高了3.08%。在不同的数据集之间进行交叉预测时,我们的方法在所有数据集上具有一致性的提高,与现有的状态之artefact 相比。我们的源代码和模型可以在 中获取。”

Detecting Deepfakes Without Seeing Any

  • paper_url: http://arxiv.org/abs/2311.01458
  • repo_url: https://github.com/talreiss/factor
  • paper_authors: Tal Reiss, Bar Cavia, Yedid Hoshen
  • for: 防止深伪攻击,对社会造成严重威胁的媒体恶意修改。
  • methods: 引入“事实检查”概念,从 fake news 检测中获得灵感,以检查伪媒体中的 false facts,并将其与观察到的媒体进行比较,以进行深伪攻击的检测。
  • results: 提出 FACTOR,一个实用的深伪攻击检测方法,在面对重要攻击设定下表现出色,包括脸部交换和音视声合成。优点包括不需训练,仅使用现有的特征,易于实现,并且不需要见到深伪攻击。
    Abstract Deepfake attacks, malicious manipulation of media containing people, are a serious concern for society. Conventional deepfake detection methods train supervised classifiers to distinguish real media from previously encountered deepfakes. Such techniques can only detect deepfakes similar to those previously seen, but not zero-day (previously unseen) attack types. As current deepfake generation techniques are changing at a breathtaking pace, new attack types are proposed frequently, making this a major issue. Our main observations are that: i) in many effective deepfake attacks, the fake media must be accompanied by false facts i.e. claims about the identity, speech, motion, or appearance of the person. For instance, when impersonating Obama, the attacker explicitly or implicitly claims that the fake media show Obama; ii) current generative techniques cannot perfectly synthesize the false facts claimed by the attacker. We therefore introduce the concept of "fact checking", adapted from fake news detection, for detecting zero-day deepfake attacks. Fact checking verifies that the claimed facts (e.g. identity is Obama), agree with the observed media (e.g. is the face really Obama's?), and thus can differentiate between real and fake media. Consequently, we introduce FACTOR, a practical recipe for deepfake fact checking and demonstrate its power in critical attack settings: face swapping and audio-visual synthesis. Although it is training-free, relies exclusively on off-the-shelf features, is very easy to implement, and does not see any deepfakes, it achieves better than state-of-the-art accuracy.
    摘要 深刻的假动态攻击(deepfake attacks)对社会造成了严重的忧虑。传统的深刻检测方法通过训练监督分类器来分辨真伪媒体。这些技术只能检测已经见过的深刻攻击,但不能检测 Zero-day(未见过)攻击型。随着现有的深刻生成技术在极其快速的进步,新的攻击型不断提出,这成为了一个重要的问题。我们的主要观察结果是:一、在许多有效的深刻攻击中,假媒体必须被附加 false facts(说法),例如指出假媒体是谁(例如Obama);二、目前的生成技术无法完美地Synthesize false facts。我们因此引入了“ факчекінグ”(fact checking)概念,从 fake news 检测中获得灵感,用于检测 Zero-day 深刻攻击。fact checking 检查假媒体中的 false facts 是否与观察到的媒体相符,因此可以区分真伪媒体。因此,我们引入 FACTOR,一个实用的深刻实现 fact checking 的方法,并在重要的攻击设定中进行评估:脸部调换和音频视觉合成。这个方法不需要训练,仅仅使用现有的特征,易于实现,并且无法检测任何深刻。它在critical attack settings中实现了更好的准确性。

UltraLiDAR: Learning Compact Representations for LiDAR Completion and Generation

  • paper_url: http://arxiv.org/abs/2311.01448
  • repo_url: None
  • paper_authors: Yuwen Xiong, Wei-Chiu Ma, Jingkang Wang, Raquel Urtasun
  • for: 增强 LiDAR 点云的精度和覆盖率,提高自驾抵达系统的性能。
  • methods: 基于数据驱动的 UltraLiDAR 框架,包括点云的数据驱动编码、点云的精度和覆盖率的改进、点云的生成和 manipulate。
  • results: 对实际点云数据进行训练,可以达到densify sparse point clouds 的目的,并且可以生成更加真实和可信的 LiDAR 点云数据,比 Priors 方法更有优势。
    Abstract LiDAR provides accurate geometric measurements of the 3D world. Unfortunately, dense LiDARs are very expensive and the point clouds captured by low-beam LiDAR are often sparse. To address these issues, we present UltraLiDAR, a data-driven framework for scene-level LiDAR completion, LiDAR generation, and LiDAR manipulation. The crux of UltraLiDAR is a compact, discrete representation that encodes the point cloud's geometric structure, is robust to noise, and is easy to manipulate. We show that by aligning the representation of a sparse point cloud to that of a dense point cloud, we can densify the sparse point clouds as if they were captured by a real high-density LiDAR, drastically reducing the cost. Furthermore, by learning a prior over the discrete codebook, we can generate diverse, realistic LiDAR point clouds for self-driving. We evaluate the effectiveness of UltraLiDAR on sparse-to-dense LiDAR completion and LiDAR generation. Experiments show that densifying real-world point clouds with our approach can significantly improve the performance of downstream perception systems. Compared to prior art on LiDAR generation, our approach generates much more realistic point clouds. According to A/B test, over 98.5\% of the time human participants prefer our results over those of previous methods.
    摘要 利达(LiDAR)提供了高精度的三维世界几何测量。然而,高密度的LiDAR仪器非常昂贵,而且低密度LiDAR所捕获的点云经常是稀疏的。为解决这些问题,我们介绍了UltraLiDAR,一个数据驱动的场景级LiDAR完成、生成和修改框架。UltraLiDAR的核心思想是一种含有点云几何结构的紧凑、离散表示方法,具有噪声抗性和易于操作的特点。我们表明,通过将稀疏点云的表示与高密度点云的表示进行对应,可以将稀疏点云灵活地填充为如果被真实高密度LiDAR捕获的样式,减少成本。此外,通过学习点云codebook的前提,我们可以生成多种、现实主义的LiDAR点云,用于自动驾驶。我们对稀疏点云完成和LiDAR生成进行评估。实验表明,使用我们的方法可以在下游识别系统中显著提高稀疏点云的性能。相比之前的LiDAR生成方法,我们的方法可以生成更加真实的点云。根据A/B测试,人类参与者超过98.5%的时间 prefer我们的结果。

CADSim: Robust and Scalable in-the-wild 3D Reconstruction for Controllable Sensor Simulation

  • paper_url: http://arxiv.org/abs/2311.01447
  • repo_url: None
  • paper_authors: Jingkang Wang, Sivabalan Manivasagam, Yun Chen, Ze Yang, Ioan Andrei Bârsan, Anqi Joyce Yang, Wei-Chiu Ma, Raquel Urtasun
  • For: The paper is written for the development of realistic simulation for self-driving vehicles, specifically focusing on sensor simulation and the reconstruction of vehicle geometry.* Methods: The paper proposes a new method called CADSim, which combines part-aware object-class priors with differentiable rendering to automatically reconstruct vehicle geometry, including articulated wheels, with high-quality appearance.* Results: The paper shows that CADSim recovers more accurate shapes from sparse data compared to existing approaches, and it trains and renders efficiently. The reconstructed vehicles are demonstrated in several applications, including accurate testing of autonomy perception systems.Here is the same information in Simplified Chinese:* For: 本文为自动驾驶车辆实时模拟实现真实simulation,特别是感知器模拟。* Methods: 本文提出了一种新方法 called CADSim,它将部件意识对象类预先知识与可微分渲染结合,自动重建车辆几何结构,包括摆脱的车轮。* Results: 本文表明,CADSim可以从稀疏数据中提取更高质量的形状信息,并具有高效训练和渲染能力。重建的车辆在多个应用中展示了高度准确的测试和感知系统。
    Abstract Realistic simulation is key to enabling safe and scalable development of % self-driving vehicles. A core component is simulating the sensors so that the entire autonomy system can be tested in simulation. Sensor simulation involves modeling traffic participants, such as vehicles, with high quality appearance and articulated geometry, and rendering them in real time. The self-driving industry has typically employed artists to build these assets. However, this is expensive, slow, and may not reflect reality. Instead, reconstructing assets automatically from sensor data collected in the wild would provide a better path to generating a diverse and large set with good real-world coverage. Nevertheless, current reconstruction approaches struggle on in-the-wild sensor data, due to its sparsity and noise. To tackle these issues, we present CADSim, which combines part-aware object-class priors via a small set of CAD models with differentiable rendering to automatically reconstruct vehicle geometry, including articulated wheels, with high-quality appearance. Our experiments show our method recovers more accurate shapes from sparse data compared to existing approaches. Importantly, it also trains and renders efficiently. We demonstrate our reconstructed vehicles in several applications, including accurate testing of autonomy perception systems.
    摘要 Simplified Chinese translation:现实化模拟是自动驾驶车辆开发中安全和扩展的关键。核心组件是模拟感知器,以便整个自主系统可以在模拟中测试。感知器模拟包括模拟交通参与者,如车辆,并在实时中渲染它们。自驾行业Typically, artists have been employed to build these assets, but this is expensive, slow, and may not reflect reality. Instead, automatically reconstructing assets from in-the-wild sensor data would provide a better path to generating a diverse and large set with good real-world coverage. However, current reconstruction approaches struggle with in-the-wild sensor data due to its sparsity and noise. To address these issues, we present CADSim, which combines part-aware object-class priors via a small set of CAD models with differentiable rendering to automatically reconstruct vehicle geometry, including articulated wheels, with high-quality appearance. Our experiments show that our method recovers more accurate shapes from sparse data compared to existing approaches. Importantly, it also trains and renders efficiently. We demonstrate our reconstructed vehicles in several applications, including accurate testing of autonomy perception systems.

Adv3D: Generating Safety-Critical 3D Objects through Closed-Loop Simulation

  • paper_url: http://arxiv.org/abs/2311.01446
  • repo_url: None
  • paper_authors: Jay Sarva, Jingkang Wang, James Tu, Yuwen Xiong, Sivabalan Manivasagam, Raquel Urtasun
  • for: 这个论文旨在测试自动驾驶车辆(SDV)在各种场景下的可靠性,以确保其安全部署。
  • methods: 这篇论文提出了一个框架,名为Adv3D,可以在真实世界场景下进行closed-loop感知 simulation,以评估自动驾驶系统的性能。
  • results: 该框架可以在真实世界场景下找到影响自动驾驶系统性能的场景变化,并且发现这些变化在交互 Setting下更加有效。
    Abstract Self-driving vehicles (SDVs) must be rigorously tested on a wide range of scenarios to ensure safe deployment. The industry typically relies on closed-loop simulation to evaluate how the SDV interacts on a corpus of synthetic and real scenarios and verify it performs properly. However, they primarily only test the system's motion planning module, and only consider behavior variations. It is key to evaluate the full autonomy system in closed-loop, and to understand how variations in sensor data based on scene appearance, such as the shape of actors, affect system performance. In this paper, we propose a framework, Adv3D, that takes real world scenarios and performs closed-loop sensor simulation to evaluate autonomy performance, and finds vehicle shapes that make the scenario more challenging, resulting in autonomy failures and uncomfortable SDV maneuvers. Unlike prior works that add contrived adversarial shapes to vehicle roof-tops or roadside to harm perception only, we optimize a low-dimensional shape representation to modify the vehicle shape itself in a realistic manner to degrade autonomy performance (e.g., perception, prediction, and motion planning). Moreover, we find that the shape variations found with Adv3D optimized in closed-loop are much more effective than those in open-loop, demonstrating the importance of finding scene appearance variations that affect autonomy in the interactive setting.
    摘要 In this paper, we propose a framework called Adv3D that takes real-world scenarios and performs closed-loop sensor simulation to evaluate autonomy performance. We also find vehicle shapes that make the scenario more challenging, resulting in autonomy failures and uncomfortable SDV maneuvers. Unlike prior works that add contrived adversarial shapes to vehicle roof-tops or roadside to harm perception only, we optimize a low-dimensional shape representation to modify the vehicle shape itself in a realistic manner to degrade autonomy performance (e.g., perception, prediction, and motion planning).Moreover, we find that the shape variations found with Adv3D optimized in closed-loop are much more effective than those in open-loop, demonstrating the importance of finding scene appearance variations that affect autonomy in the interactive setting.

LabelFormer: Object Trajectory Refinement for Offboard Perception from LiDAR Point Clouds

  • paper_url: http://arxiv.org/abs/2311.01444
  • repo_url: None
  • paper_authors: Anqi Joyce Yang, Sergio Casas, Nikita Dvornik, Sean Segal, Yuwen Xiong, Jordan Sir Kwang Hu, Carter Fang, Raquel Urtasun
  • for: 这个论文旨在提出一种简单、高效、有效的路径水平纠正方法,以提高自动驾驶观察系统的训练效果。
  • methods: 该论文提出了一种两阶段方法,首先探测和跟踪对象,然后使用学习的纠正模型进行精度提高。而 LabelFormer 方法则是一种简单、高效、有效的 trajectory-level 纠正方法,它首先对每帧观察进行编码,然后利用自我注意力来理解轨迹的全 temporal 上下文,最后对轨迹进行解码,以获得精度提高后的对象大小和每帧 pose。
  • results: 论文的实验结果表明,LabelFormer 方法可以在城市和高速公路上的数据集上大幅度超越现有的方法。此外,论文还表明,通过使用 LabelFormer 生成的自动标签进行训练,可以提高下游探测性能。详细信息请参考 https://waabi.ai/labelformer
    Abstract A major bottleneck to scaling-up training of self-driving perception systems are the human annotations required for supervision. A promising alternative is to leverage "auto-labelling" offboard perception models that are trained to automatically generate annotations from raw LiDAR point clouds at a fraction of the cost. Auto-labels are most commonly generated via a two-stage approach -- first objects are detected and tracked over time, and then each object trajectory is passed to a learned refinement model to improve accuracy. Since existing refinement models are overly complex and lack advanced temporal reasoning capabilities, in this work we propose LabelFormer, a simple, efficient, and effective trajectory-level refinement approach. Our approach first encodes each frame's observations separately, then exploits self-attention to reason about the trajectory with full temporal context, and finally decodes the refined object size and per-frame poses. Evaluation on both urban and highway datasets demonstrates that LabelFormer outperforms existing works by a large margin. Finally, we show that training on a dataset augmented with auto-labels generated by our method leads to improved downstream detection performance compared to existing methods. Please visit the project website for details https://waabi.ai/labelformer
    摘要 很多自动驾驶感知系统的训练Scaling-up受到人工标注的瓶颈。一种有前途的方法是利用“自动标注”的Board perception模型,可以自动生成标注从原始LiDAR点云数据,并且只需要一小部分的成本。自动标注通常通过两个阶段进行:首先探测和跟踪对象,然后每个对象轨迹通过学习改进模型来提高准确性。现有的改进模型太复杂,缺乏高级时间逻辑能力,因此在这里我们提出了LabelFormer,一种简单、高效、有效的轨迹级别改进方法。我们的方法首先编码每帧观察数据,然后利用自我注意力来理解轨迹的全 temporal 上下文,并最后解码出改进后的对象大小和每帧姿态。我们的LabelFormer在都市和高速公路 dataset 上进行评估,与现有方法比较,表现出了大幅度的提高。最后,我们展示了通过我们的方法生成的自动标注来训练下游检测模型,对现有方法进行训练带来了改进的检测性能。详细信息请参考我们的项目网站:

Transformation Decoupling Strategy based on Screw Theory for Deterministic Point Cloud Registration with Gravity Prior

  • paper_url: http://arxiv.org/abs/2311.01432
  • repo_url: None
  • paper_authors: Xinyi Li, Zijian Ma, Yinlong Liu, Walter Zimmer, Hu Cao, Feihu Zhang, Alois Knoll
  • for: 这 paper 是为了解决受重 OUTLIER 干扰的点云注册问题,特别是在实际应用中经常出现的相对性基于重力方向的注册问题。
  • methods: 该 paper 提出了一种基于扭轴理论的转换分解策略,将原始的 4-DOF 问题分解成 3 个 sub-problems 中的 1-DOF、2-DOF 和 1-DOF,从而提高计算效率。特别是,第一个 1-DOF 表示矢量在旋转轴上的翻译,我们提出了一种间隔刺激方法来解决它。第二个 2-DOF 表示枢轴,我们利用 branch-and-bound 方法来解决它。最后一个 1-DOF 表示旋转角度,我们提出了一种全局投票方法来估算它。
  • results: 该 paper 的方法可以高效地和 deterministic 地进行注册,特别是在 OUTLIER 率超过 99% 的情况下。广泛的实验表明,与现有方法相比,该方法更高效和更稳定。
    Abstract Point cloud registration is challenging in the presence of heavy outlier correspondences. This paper focuses on addressing the robust correspondence-based registration problem with gravity prior that often arises in practice. The gravity directions are typically obtained by inertial measurement units (IMUs) and can reduce the degree of freedom (DOF) of rotation from 3 to 1. We propose a novel transformation decoupling strategy by leveraging screw theory. This strategy decomposes the original 4-DOF problem into three sub-problems with 1-DOF, 2-DOF, and 1-DOF, respectively, thereby enhancing the computation efficiency. Specifically, the first 1-DOF represents the translation along the rotation axis and we propose an interval stabbing-based method to solve it. The second 2-DOF represents the pole which is an auxiliary variable in screw theory and we utilize a branch-and-bound method to solve it. The last 1-DOF represents the rotation angle and we propose a global voting method for its estimation. The proposed method sequentially solves three consensus maximization sub-problems, leading to efficient and deterministic registration. In particular, it can even handle the correspondence-free registration problem due to its significant robustness. Extensive experiments on both synthetic and real-world datasets demonstrate that our method is more efficient and robust than state-of-the-art methods, even when dealing with outlier rates exceeding 99%.
    摘要 <>文 Cloud 注册是在具有重大外liers的对应关系时具有挑战性。这篇论文关注了在实践中常出现的强对应关系基础注册问题,并使用重力方向作为准确度提高的一种方法。重力方向通常由惯性测量设备(IMU)获得,可以将旋转的度量量reduced to 1。我们提议了一种新的变换分解策略,基于螺旋理论。这种策略将原始的4DOF问题分解成3个子问题,每个子问题都是1DOF、2DOF和1DOF,从而提高计算效率。 Specifically, the first 1DOF represents the translation along the rotation axis, and we propose an interval stabbing-based method to solve it. The second 2DOF represents the pole, which is an auxiliary variable in screw theory, and we utilize a branch-and-bound method to solve it. The last 1DOF represents the rotation angle, and we propose a global voting method for its estimation. The proposed method sequentially solves three consensus maximization sub-problems, leading to efficient and deterministic registration. In particular, it can even handle the correspondence-free registration problem due to its significant robustness. Extensive experiments on both synthetic and real-world datasets demonstrate that our method is more efficient and robust than state-of-the-art methods, even when dealing with outlier rates exceeding 99%.<>

Efficient Vision Transformer for Accurate Traffic Sign Detection

  • paper_url: http://arxiv.org/abs/2311.01429
  • repo_url: None
  • paper_authors: Javad Mirzapour Kaleybar, Hooman Khaloo, Avaz Naghipour
  • for: 本研究论文主要targets traffic sign detection in self-driving vehicles and driver assistance systems, with the goal of developing reliable and highly accurate algorithms for widespread adoption in diverse real-life scenarios.
  • methods: 本研究使用了Transformer模型,尤其是Vision Transformer variants,来解决 traffic sign detection task。Transformer的注意机制,原本设计用于自然语言处理,在图像识别领域提供了并行效率的改进。
  • results: 实验评估表明,该策略可以在GTSDB数据集上实现显著的进步,特别是在速度和准确率两个方面。
    Abstract This research paper addresses the challenges associated with traffic sign detection in self-driving vehicles and driver assistance systems. The development of reliable and highly accurate algorithms is crucial for the widespread adoption of traffic sign recognition and detection (TSRD) in diverse real-life scenarios. However, this task is complicated by suboptimal traffic images affected by factors such as camera movement, adverse weather conditions, and inadequate lighting. This study specifically focuses on traffic sign detection methods and introduces the application of the Transformer model, particularly the Vision Transformer variants, to tackle this task. The Transformer's attention mechanism, originally designed for natural language processing, offers improved parallel efficiency. Vision Transformers have demonstrated success in various domains, including autonomous driving, object detection, healthcare, and defense-related applications. To enhance the efficiency of the Transformer model, the research proposes a novel strategy that integrates a locality inductive bias and a transformer module. This includes the introduction of the Efficient Convolution Block and the Local Transformer Block, which effectively capture short-term and long-term dependency information, thereby improving both detection speed and accuracy. Experimental evaluations demonstrate the significant advancements achieved by this approach, particularly when applied to the GTSDB dataset.
    摘要 To enhance the efficiency of the Transformer model, the research proposes a novel strategy that integrates a locality inductive bias and a transformer module. This includes the introduction of the Efficient Convolution Block and the Local Transformer Block, which effectively capture short-term and long-term dependency information, thereby improving both detection speed and accuracy.Experimental evaluations demonstrate the significant advancements achieved by this approach, particularly when applied to the GTSDB dataset. The proposed method is able to detect traffic signs more accurately and efficiently, which is crucial for the widespread adoption of traffic sign recognition and detection (TSRD) in diverse real-life scenarios.

Exploring Deep Learning Techniques for Glaucoma Detection: A Comprehensive Review

  • paper_url: http://arxiv.org/abs/2311.01425
  • repo_url: None
  • paper_authors: Aized Amin Soofi, Fazal-e-Amin
    for: 本文旨在提供一种全面的深度学习方法,用于检测和诊断眼内压病( Glaucoma)。methods: 本文使用的方法包括深度学习的分割、分类和检测技术,以提高眼内压病的检测精度和效率。results: 根据文献分析,深度学习方法在眼内压病检测中表现出色,可以提高检测精度和效率,并且具有可重复性和可靠性。但是,深度学习方法还存在一些限制和挑战,需要进一步的研究和改进。
    Abstract Glaucoma is one of the primary causes of vision loss around the world, necessitating accurate and efficient detection methods. Traditional manual detection approaches have limitations in terms of cost, time, and subjectivity. Recent developments in deep learning approaches demonstrate potential in automating glaucoma detection by detecting relevant features from retinal fundus images. This article provides a comprehensive overview of cutting-edge deep learning methods used for the segmentation, classification, and detection of glaucoma. By analyzing recent studies, the effectiveness and limitations of these techniques are evaluated, key findings are highlighted, and potential areas for further research are identified. The use of deep learning algorithms may significantly improve the efficacy, usefulness, and accuracy of glaucoma detection. The findings from this research contribute to the ongoing advancements in automated glaucoma detection and have implications for improving patient outcomes and reducing the global burden of glaucoma.
    摘要 Here is the translation in Simplified Chinese: glaucoma是全球主要导致视力损失的疾病之一,需要精准和高效的检测方法。传统的手动检测方法受到成本、时间和主观性的限制。current deep learning approaches demonstrate potential in automating glaucoma detection by detecting relevant features from retinal fundus images. This article provides a comprehensive overview of cutting-edge deep learning methods used for the segmentation, classification, and detection of glaucoma. By analyzing recent studies, the effectiveness and limitations of these techniques are evaluated, key findings are highlighted, and potential areas for further research are identified. The use of deep learning algorithms may significantly improve the efficacy, usefulness, and accuracy of glaucoma detection. The findings from this research contribute to the ongoing advancements in automated glaucoma detection and have implications for improving patient outcomes and reducing the global burden of glaucoma.

CenterRadarNet: Joint 3D Object Detection and Tracking Framework using 4D FMCW Radar

  • paper_url: http://arxiv.org/abs/2311.01423
  • repo_url: None
  • paper_authors: Jen-Hao Cheng, Sheng-Yao Kuan, Hugo Latapie, Gaowen Liu, Jenq-Neng Hwang
  • for: 提高自动驾驶和协助驾驶技术的安全性,增强雷达感知的稳定性和可靠性。
  • methods: 提出一种高效的中心雷达网络(CenterRadarNet),利用4D雷达数据进行高级别表示学习,实现3D对象检测和重新识别(re-ID)任务。
  • results: 在K-Radar3D对象检测数据集上达到了状态之Art的result,并实现了雷达数据集V2上的首个3D对象跟踪结果。在多种驾驶场景下,CenterRadarNet表现了一致、稳定的性能,强调其广泛应用性。
    Abstract Robust perception is a vital component for ensuring safe autonomous and assisted driving. Automotive radar (77 to 81 GHz), which offers weather-resilient sensing, provides a complementary capability to the vision- or LiDAR-based autonomous driving systems. Raw radio-frequency (RF) radar tensors contain rich spatiotemporal semantics besides 3D location information. The majority of previous methods take in 3D (Doppler-range-azimuth) RF radar tensors, allowing prediction of an object's location, heading angle, and size in bird's-eye-view (BEV). However, they lack the ability to at the same time infer objects' size, orientation, and identity in the 3D space. To overcome this limitation, we propose an efficient joint architecture called CenterRadarNet, designed to facilitate high-resolution representation learning from 4D (Doppler-range-azimuth-elevation) radar data for 3D object detection and re-identification (re-ID) tasks. As a single-stage 3D object detector, CenterRadarNet directly infers the BEV object distribution confidence maps, corresponding 3D bounding box attributes, and appearance embedding for each pixel. Moreover, we build an online tracker utilizing the learned appearance embedding for re-ID. CenterRadarNet achieves the state-of-the-art result on the K-Radar 3D object detection benchmark. In addition, we present the first 3D object-tracking result using radar on the K-Radar dataset V2. In diverse driving scenarios, CenterRadarNet shows consistent, robust performance, emphasizing its wide applicability.
    摘要 robust 感知是自动驾驶和助动驾驶安全的关键组件。汽车雷达(77至81 GHz),具有天气抵抗性,提供了视觉或 LiDAR 自动驾驶系统的补充能力。 raw 电磁波(RF)雷达张量包含了具有3D位置信息的辐射学 semantics。大多数前一代方法接受3D(Doppler-range-azimuth)RF雷达张量,允许预测目标的位置、方向角和大小在鸟瞰视图(BEV)中。然而,它们缺乏同时推断目标的大小、方向和身份在3D空间的能力。为了解决这些限制,我们提出了一种高效的联合体系结构,称之为 CenterRadarNet,用于从4D(Doppler-range-azimuth-elevation)雷达数据中进行高级表示学习,以实现3D объек的检测和重新识别(re-ID)任务。作为单个阶段3D对象探测器,CenterRadarNet直接生成了BEV对象分布信息可信度地图,相应的3D包围框属性和外观嵌入。此外,我们建立了在线跟踪器,使用学习的外观嵌入进行重新识别。 CenterRadarNet在K-Radar 3D对象检测标准准则上实现了状态的最佳结果。此外,我们还提供了基于雷达的K-Radar数据集 V2 上的首个3D对象跟踪结果。在多样化的驾驶enario中,CenterRadarNet表现了一致、可靠的性,强调了其广泛的应用能力。

The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing

  • paper_url: http://arxiv.org/abs/2311.01410
  • repo_url: None
  • paper_authors: Shen Nie, Hanzhong Allan Guo, Cheng Lu, Yuhao Zhou, Chenyu Zheng, Chongxuan Li
  • for: 这个论文旨在提出一种统一的概率形式化方法,用于基于扩散的图像编辑,其中一个隐藏变量在任务特定的方式下被编辑,并且通常与原始的随机或偏微分方程(SDE或ODE)中的相应变量偏离。
  • methods: 论文使用的方法包括编辑SDE和ODE,以及基于SDE的扩散基eline。
  • results: 论文的实验结果表明,在不同任务中,SDE具有明显的优势和多样性,可以在图像编辑中提供更高质量的结果,并且可以与现有的扩散基eline相比,显示出更好的性能。
    Abstract We present a unified probabilistic formulation for diffusion-based image editing, where a latent variable is edited in a task-specific manner and generally deviates from the corresponding marginal distribution induced by the original stochastic or ordinary differential equation (SDE or ODE). Instead, it defines a corresponding SDE or ODE for editing. In the formulation, we prove that the Kullback-Leibler divergence between the marginal distributions of the two SDEs gradually decreases while that for the ODEs remains as the time approaches zero, which shows the promise of SDE in image editing. Inspired by it, we provide the SDE counterparts for widely used ODE baselines in various tasks including inpainting and image-to-image translation, where SDE shows a consistent and substantial improvement. Moreover, we propose SDE-Drag -- a simple yet effective method built upon the SDE formulation for point-based content dragging. We build a challenging benchmark (termed DragBench) with open-set natural, art, and AI-generated images for evaluation. A user study on DragBench indicates that SDE-Drag significantly outperforms our ODE baseline, existing diffusion-based methods, and the renowned DragGAN. Our results demonstrate the superiority and versatility of SDE in image editing and push the boundary of diffusion-based editing methods.
    摘要 我们提出了一种统一的概率形式化方法 дляDiffusion-based图像编辑,其中一个隐藏变量在任务特定的方式下被编辑,通常与原始的随机或ordinary differential equation(SDE或ODE)中的相应 marginal distribution不同。相反,它定义了一个对应的SDE或ODE для编辑。在这种形式化中,我们证明了两个SDE的margin distribution的Kullback-Leibler divergence逐渐减少,而ODE中的 margin distribution的Kullback-Leibler divergence保持不变,这表明SDE在图像编辑中的承诺。 inspirited by it, we provide SDE counterparts for widely used ODE baselines in various tasks, including inpainting and image-to-image translation, where SDE shows consistent and substantial improvement. Moreover, we propose SDE-Drag -- a simple yet effective method built upon the SDE formulation for point-based content dragging. We build a challenging benchmark (termed DragBench) with open-set natural, art, and AI-generated images for evaluation. A user study on DragBench indicates that SDE-Drag significantly outperforms our ODE baseline, existing diffusion-based methods, and the renowned DragGAN. Our results demonstrate the superiority and versatility of SDE in image editing and push the boundary of diffusion-based editing methods.

Learning to See Physical Properties with Active Sensing Motor Policies

  • paper_url: http://arxiv.org/abs/2311.01405
  • repo_url: None
  • paper_authors: Gabriel B. Margolis, Xiang Fu, Yandong Ji, Pulkit Agrawal
  • for: 这个论文是为了帮助机器人更有效地行走,通过利用图像中的物理特性来做出计划。
  • methods: 这个方法使用自我超vised labeling,使用实际行走中采集的图像和实际物理参数估计器在模拟环境中训练。另外,我们还引入了活动感知 дви作策略(ASMP),以便增强物理参数估计的准确性。
  • results: 我们的方法可以准确地预测物理参数,并且可以在不同的摄像头和机器人上工作,即使是在飞行器拍摄的过头图像中。
    Abstract Knowledge of terrain's physical properties inferred from color images can aid in making efficient robotic locomotion plans. However, unlike image classification, it is unintuitive for humans to label image patches with physical properties. Without labeled data, building a vision system that takes as input the observed terrain and predicts physical properties remains challenging. We present a method that overcomes this challenge by self-supervised labeling of images captured by robots during real-world traversal with physical property estimators trained in simulation. To ensure accurate labeling, we introduce Active Sensing Motor Policies (ASMP), which are trained to explore locomotion behaviors that increase the accuracy of estimating physical parameters. For instance, the quadruped robot learns to swipe its foot against the ground to estimate the friction coefficient accurately. We show that the visual system trained with a small amount of real-world traversal data accurately predicts physical parameters. The trained system is robust and works even with overhead images captured by a drone despite being trained on data collected by cameras attached to a quadruped robot walking on the ground.
    摘要 知识来自景色的物理属性可以帮助机器人制定有效的移动计划。然而,与人类标注图像不同,将图像块标注为物理属性是不直观的。没有标注数据,建立一个接受图像作为输入,预测物理属性的视觉系统仍然是一个挑战。我们提出了一种方法,通过自我超vised标注实际行走中捕捉的图像来解决这个问题。为确保准确标注,我们引入了活动感知电动政策(ASMP),这些政策在实际行走中被训练以探索提高物理参数估计的行走方式。例如,四肢动物机器人学会在地面上滑块来精准估计透震率。我们显示,使用小量实际行走数据训练的视觉系统可以准确预测物理参数。受训系统是Robust的,甚至在悬停在空中的飞机拍摄的图像上也能正常工作,即使它们在地面上行走时被训练。

Learning Realistic Traffic Agents in Closed-loop

  • paper_url: http://arxiv.org/abs/2311.01394
  • repo_url: None
  • paper_authors: Chris Zhang, James Tu, Lunjun Zhang, Kelvin Wong, Simon Suo, Raquel Urtasun
  • for: 本研究旨在开发一种可靠和可扩展的自动驾驶软件,通过在实际上使用人工智能来模拟实际交通情况,以避免在真实世界中发生的危险。
  • methods: 本研究使用了拟合学习(IL)和奖励学习(RL)两种方法,通过一种叫做“强制规则恢复”(RTR)的关闭Loop学习目标,将这两种方法结合起来,以实现更加人类化的交通行为。
  • results: 实验结果表明,使用RTR方法可以让交通策略更加真实和普遍,在正常和偏值情况下都能够达到更好的平衡点,并且可以用作预测模型训练数据生成工具,从而提高下游预测指标。
    Abstract Realistic traffic simulation is crucial for developing self-driving software in a safe and scalable manner prior to real-world deployment. Typically, imitation learning (IL) is used to learn human-like traffic agents directly from real-world observations collected offline, but without explicit specification of traffic rules, agents trained from IL alone frequently display unrealistic infractions like collisions and driving off the road. This problem is exacerbated in out-of-distribution and long-tail scenarios. On the other hand, reinforcement learning (RL) can train traffic agents to avoid infractions, but using RL alone results in unhuman-like driving behaviors. We propose Reinforcing Traffic Rules (RTR), a holistic closed-loop learning objective to match expert demonstrations under a traffic compliance constraint, which naturally gives rise to a joint IL + RL approach, obtaining the best of both worlds. Our method learns in closed-loop simulations of both nominal scenarios from real-world datasets as well as procedurally generated long-tail scenarios. Our experiments show that RTR learns more realistic and generalizable traffic simulation policies, achieving significantly better tradeoffs between human-like driving and traffic compliance in both nominal and long-tail scenarios. Moreover, when used as a data generation tool for training prediction models, our learned traffic policy leads to considerably improved downstream prediction metrics compared to baseline traffic agents. For more information, visit the project website: https://waabi.ai/rtr
    摘要 现实主义交通模拟是自动驾驶软件开发中非常重要的,以确保在真实世界中安全部署。通常,模仿学习(IL)用于直接从真实世界观察中学习人类交通代理,但是不具体地规定交通规则,则代理训练出来的不具有真实的交通规则遵从性,导致很多偏差和脱离道路的情况。这个问题在不同的情况和长尾情况下更加严重。相反,奖励学习(RL)可以训练交通代理避免偏差,但是使用RLalone会导致不人类化的驾驶行为。我们提出了一种名为“强制交通规则”(RTR)的整体循环学习目标,用于匹配专家示例,并且自然地组合了IL + RL两种方法,从而获得最佳的两个世界。我们的方法在循环 simulations of both nominal scenarios from real-world datasets as well as procedurally generated long-tail scenarios中学习。我们的实验表明,RTR可以更加真实和普遍的交通模拟策略,在nominal和长尾情况下都能够获得更好的平衡。此外,当用作预测模型训练数据生成工具时,我们学习的交通策略会导致下游预测 metric 明显提高,相比基eline traffic agents。更多信息请访问我们的项目网站:https://waabi.ai/rtr。

Sim2Real Bilevel Adaptation for Object Surface Classification using Vision-Based Tactile Sensors

  • paper_url: http://arxiv.org/abs/2311.01380
  • repo_url: https://github.com/hsp-iit/sim2real-surface-classification
  • paper_authors: Gabriele M. Caddeo, Andrea Maracani, Paolo D. Alfano, Nicola A. Piga, Lorenzo Rosasco, Lorenzo Natale
  • for: bridging the Sim2Real gap in vision-based tactile sensors for classifying object surfaces
  • methods: training a Diffusion Model using a small dataset of real-world images and aligning features of the two domains using an adversarial procedure
  • results: a total accuracy of 81.9%, a significant improvement compared to the 34.7% achieved by the classifier trained solely on simulated images
    Abstract In this paper, we address the Sim2Real gap in the field of vision-based tactile sensors for classifying object surfaces. We train a Diffusion Model to bridge this gap using a relatively small dataset of real-world images randomly collected from unlabeled everyday objects via the DIGIT sensor. Subsequently, we employ a simulator to generate images by uniformly sampling the surface of objects from the YCB Model Set. These simulated images are then translated into the real domain using the Diffusion Model and automatically labeled to train a classifier. During this training, we further align features of the two domains using an adversarial procedure. Our evaluation is conducted on a dataset of tactile images obtained from a set of ten 3D printed YCB objects. The results reveal a total accuracy of 81.9%, a significant improvement compared to the 34.7% achieved by the classifier trained solely on simulated images. This demonstrates the effectiveness of our approach. We further validate our approach using the classifier on a 6D object pose estimation task from tactile data.
    摘要 在这篇论文中,我们处理视觉基于感觉器的Surface classification问题中的Sim2Real gap。我们使用一个扩散模型来跨越这个差距,使用一个相对较小的实际世界图像集来训练。然后,我们使用一个模拟器生成图像,通过对物体表面进行均匀采样,从YCB模型集中获取的图像。这些模拟图像然后通过扩散模型进行翻译,并自动将其标注为训练一个分类器。在这个训练过程中,我们还使用一种对抗性方法来对两个领域的特征进行对齐。我们的评估是基于一组从3D打印YCB对象中获取的感觉图像集。结果表明,我们的方法可以达到81.9%的总准确率,与 solely 在模拟图像上训练的分类器(34.7%)相比,这表明我们的方法的效iveness。我们进一步验证了我们的方法,使用感觉数据进行6D对象姿态估计任务。

Robust Identity Perceptual Watermark Against Deepfake Face Swapping

  • paper_url: http://arxiv.org/abs/2311.01357
  • repo_url: None
  • paper_authors: Tianyi Wang, Mengxiao Huang, Harry Cheng, Bin Ma, Yinglong Wang
  • for: 防止 Deepfake 面孔替换的隐私问题
  • methods: 植入不可见信号进行探测和追溯
  • results: 实现了对 Deepfake 面孔替换的检测和追溯,并且在不同的数据集和替换方法下达到了状态之最的性能
    Abstract Notwithstanding offering convenience and entertainment to society, Deepfake face swapping has caused critical privacy issues with the rapid development of deep generative models. Due to imperceptible artifacts in high-quality synthetic images, passive detection models against face swapping in recent years usually suffer performance damping regarding the generalizability issue. Therefore, several studies have been attempted to proactively protect the original images against malicious manipulations by inserting invisible signals in advance. However, the existing proactive defense approaches demonstrate unsatisfactory results with respect to visual quality, detection accuracy, and source tracing ability. In this study, we propose the first robust identity perceptual watermarking framework that concurrently performs detection and source tracing against Deepfake face swapping proactively. We assign identity semantics regarding the image contents to the watermarks and devise an unpredictable and unreversible chaotic encryption system to ensure watermark confidentiality. The watermarks are encoded and recovered by jointly training an encoder-decoder framework along with adversarial image manipulations. Extensive experiments demonstrate state-of-the-art performance against Deepfake face swapping under both cross-dataset and cross-manipulation settings.
    摘要 不смотря于对社会提供便捷和娱乐,深入模型的发展使得深伪肖面换技术导致了严重的隐私问题。由于高质量 sintetic 图像中的不可见artefacts,以往的检测模型在面 swap 问题上通常会表现出性能下降,尤其是在泛化问题上。因此,一些研究尝试了在先进行反应性保护原始图像,以防止恶意操作。然而,现有的反应性防御方法在视觉质量、检测精度和来源追踪能力等方面均表现不满意。在这种情况下,我们提出了首个可靠性感知水印框架,可同时进行检测和来源追踪对 Deepfake 面 swap 进行反应性防御。我们将图像内容中的Identify semantics分配给水印,并设计了不可预测、不可逆的混沌加密系统,以保证水印的机密性。水印被编码和还原通过在encoder-decoder框架中进行共同训练,并与对采样图像进行恶意修改。广泛的实验表明,我们的方法在不同的 dataset 和修改设定下均达到了顶尖性能。

Deep learning based Image Compression for Microscopy Images: An Empirical Study

  • paper_url: http://arxiv.org/abs/2311.01352
  • repo_url: None
  • paper_authors: Yu Zhou, Jan Sollman, Jianxu Chen
  • for: 本研究旨在分析 классические和深度学习基于图像压缩方法,以及它们对深度学习基于图像处理模型的影响。
  • methods: 本研究使用了多种класси型损失图像压缩技术和深度学习基于图像压缩模型,并对它们进行比较,包括CompressAI工具箱提供的多种压缩模型。
  • results: 研究发现,深度学习基于图像压缩技术可以大幅提高压缩率,而不会对下游的标签自由预测模型造成重大影响。在2D情况下,AI基于压缩技术的表现远胜于класси型压缩技术。
    Abstract With the fast development of modern microscopes and bioimaging techniques, an unprecedentedly large amount of imaging data are being generated, stored, analyzed, and even shared through networks. The size of the data poses great challenges for current data infrastructure. One common way to reduce the data size is by image compression. This present study analyzes classic and deep learning based image compression methods, and their impact on deep learning based image processing models. Deep learning based label-free prediction models (i.e., predicting fluorescent images from bright field images) are used as an example application for comparison and analysis. Effective image compression methods could help reduce the data size significantly without losing necessary information, and therefore reduce the burden on data management infrastructure and permit fast transmission through the network for data sharing or cloud computing. To compress images in such a wanted way, multiple classical lossy image compression techniques are compared to several AI-based compression models provided by and trained with the CompressAI toolbox using python. These different compression techniques are compared in compression ratio, multiple image similarity measures and, most importantly, the prediction accuracy from label-free models on compressed images. We found that AI-based compression techniques largely outperform the classic ones and will minimally affect the downstream label-free task in 2D cases. In the end, we hope the present study could shed light on the potential of deep learning based image compression and the impact of image compression on downstream deep learning based image analysis models.
    摘要 随着现代微镜和生物成像技术的快速发展,生成的成像数据量已达到历史高点,对当今数据基础设施pose了巨大挑战。一种常见的方法是图像压缩,以减少数据大小。本研究 Compares classic and deep learning based image compression methods, and their impact on deep learning based image processing models. 用作比较和分析的例子应用是label-free预测模型(即从明亮图像预测 fluorescent image)。有效地压缩图像可以减少数据大小,而无需产生重要信息的损失,因此可以减轻数据管理基础设施的负担和允许数据在网络上快速传输或云计算。在python中使用CompressAI工具箱进行了多种класси型损失图像压缩技术的比较,以及几种基于AI的压缩模型。这些不同的压缩技术在压缩率、多个图像相似度度量和最重要的预测精度上进行了比较。我们发现,基于AI的压缩技术在2D情况下较Classic的压缩技术有很大的优势,并且对下游label-free任务的影响较小。我们希望这一研究可以把关于深度学习基于图像压缩的潜在能力和图像压缩对深度学习基于图像分析模型的影响 shed some light on。

Towards Evaluating Transfer-based Attacks Systematically, Practically, and Fairly

  • paper_url: http://arxiv.org/abs/2311.01323
  • repo_url: None
  • paper_authors: Qizhang Li, Yiwen Guo, Wangmeng Zuo, Hao Chen
  • for: 本研究旨在提供一个标准化的攻击比较 bencmark,以系统地、公平地、实际地评估对黑盒神经网络模型的攻击方法。
  • methods: 本研究使用了30多种转移基于攻击方法,包括各种攻击模式、攻击方法、攻击评估方法等。
  • results: 本研究对25种受到攻击的substitute/victim模型进行了完整的评估,获得了新的见解和指导方针,可以帮助未来的攻击评估。
    Abstract The adversarial vulnerability of deep neural networks (DNNs) has drawn great attention due to the security risk of applying these models in real-world applications. Based on transferability of adversarial examples, an increasing number of transfer-based methods have been developed to fool black-box DNN models whose architecture and parameters are inaccessible. Although tremendous effort has been exerted, there still lacks a standardized benchmark that could be taken advantage of to compare these methods systematically, fairly, and practically. Our investigation shows that the evaluation of some methods needs to be more reasonable and more thorough to verify their effectiveness, to avoid, for example, unfair comparison and insufficient consideration of possible substitute/victim models. Therefore, we establish a transfer-based attack benchmark (TA-Bench) which implements 30+ methods. In this paper, we evaluate and compare them comprehensively on 25 popular substitute/victim models on ImageNet. New insights about the effectiveness of these methods are gained and guidelines for future evaluations are provided. Code at: https://github.com/qizhangli/TA-Bench.
    摘要 深度神经网络(DNN)的敌对攻击漏洞引起了广泛的关注,因为它们在实际应用中的安全风险较高。基于传输性的攻击方法的开发,随着黑盒模型的应用,逐渐增加了一些传输性的攻击方法,以欺骗无法访问模型的 Architecture 和参数的黑盒模型。虽然努力很大,但是目前还缺乏一个标准化的准则,可以系统、公平、实用地比较这些方法。我们的调查发现,评估一些方法的需要更加合理、更加全面,以避免例如不公平的比较和可能的代用/受害者模型的不足考虑。因此,我们建立了一个基于传输的攻击准则(TA-Bench),它实现了30多种方法。在这篇论文中,我们对25个popular substitute/victim模型进行了全面的评估和比较,从而获得了新的洞察和指导。代码在:https://github.com/qizhangli/TA-Bench。

Hybrid-Fusion Transformer for Multisequence MRI

  • paper_url: http://arxiv.org/abs/2311.01308
  • repo_url: None
  • paper_authors: Jihoon Cho, Jinah Park
  • for: 这个论文主要目标是为了提高多模态MRI图像分割的精度。
  • methods: 该论文提出了一种hybrid fusion transformer(HFTrans)方法,利用不同的多模态MRI序列特性,并通过Transformer层进行特征集成。
  • results: 实验表明,提出的hybrid-fusion方法在三维医疗图像分割任务中表现出色,在BraTS2020和MRBrainS18两个公共数据集上比前一个状态的方法更高精度。
    Abstract Medical segmentation has grown exponentially through the advent of a fully convolutional network (FCN), and we have now reached a turning point through the success of Transformer. However, the different characteristics of the modality have not been fully integrated into Transformer for medical segmentation. In this work, we propose the novel hybrid fusion Transformer (HFTrans) for multisequence MRI image segmentation. We take advantage of the differences among multimodal MRI sequences and utilize the Transformer layers to integrate the features extracted from each modality as well as the features of the early fused modalities. We validate the effectiveness of our hybrid-fusion method in three-dimensional (3D) medical segmentation. Experiments on two public datasets, BraTS2020 and MRBrainS18, show that the proposed method outperforms previous state-of-the-art methods on the task of brain tumor segmentation and brain structure segmentation.
    摘要 医学分割技术在全 convolutional network(FCN)的出现后 exponentiates 快速增长,而现在已经达到了转折点,这是由于 transformer 的成功。然而,不同的模态特征还没有被完全 integrate 到 transformer 中 для医学分割。在这项工作中,我们提议一种新的 hybrid fusion transformer(HFTrans) для多sequences MRI图像分割。我们利用不同的多modal MRI sequence的特征,并使用 transformer 层将每个模式中提取的特征和早期合并的特征集成。我们在三维医学分割中验证了我们的 hybrid-fusion 方法的有效性。在 BraTS2020 和 MRBrainS18 两个公共数据集上进行了实验,并确认了我们的方法在脑肿瘤分割和脑结构分割任务上的优于前一个state-of-the-art方法。

DP-Mix: Mixup-based Data Augmentation for Differentially Private Learning

  • paper_url: http://arxiv.org/abs/2311.01295
  • repo_url: https://github.com/wenxuan-bao/dp-mix
  • paper_authors: Wenxuan Bao, Francesco Pittaluga, Vijay Kumar B G, Vincent Bindschaedler
  • for: 提高计算机视觉模型的通用性,特别是在训练数据有限的情况下。
  • methods: 使用多样化数据 augmentation技术,如简单的图像变换和组合,以提高计算机视觉模型的泛化能力。
  • results: 提出了两种专门针对权谱学习的数据增强技术,包括DP-Mix_Self和DP-Mix_Diff,可以在多个数据集和设置下达到最佳性能。
    Abstract Data augmentation techniques, such as simple image transformations and combinations, are highly effective at improving the generalization of computer vision models, especially when training data is limited. However, such techniques are fundamentally incompatible with differentially private learning approaches, due to the latter's built-in assumption that each training image's contribution to the learned model is bounded. In this paper, we investigate why naive applications of multi-sample data augmentation techniques, such as mixup, fail to achieve good performance and propose two novel data augmentation techniques specifically designed for the constraints of differentially private learning. Our first technique, DP-Mix_Self, achieves SoTA classification performance across a range of datasets and settings by performing mixup on self-augmented data. Our second technique, DP-Mix_Diff, further improves performance by incorporating synthetic data from a pre-trained diffusion model into the mixup process. We open-source the code at https://github.com/wenxuan-Bao/DP-Mix.
    摘要 <>传统的数据扩充技术,如简单的图像变换和组合,对计算机视觉模型的通用性进行了高度的改进,尤其是在训练数据scarce情况下。然而,这些技术与异质性学习方法不兼容,因为后者假设每个训练图像对学习的模型做出的贡献是有限的。在这篇论文中,我们 investigate why naive应用多样样本数据扩充技术,如mixup,无法达到好性能,并提出了两种专门针对异质性学习的数据扩充技术。我们的第一种技术,DP-Mix_Self,在多个 dataset 和设置中达到了 SoTA 分类性能,通过在自增强数据上进行mixup。我们的第二种技术,DP-Mix_Diff,进一步提高性能,通过在扩充过程中包含Synthetic数据来增强mixup。我们将代码开源在https://github.com/wenxuan-Bao/DP-Mix上。

Joint 3D Shape and Motion Estimation from Rolling Shutter Light-Field Images

  • paper_url: http://arxiv.org/abs/2311.01292
  • repo_url: None
  • paper_authors: Hermes McGriff, Renato Martins, Nicolas Andreff, Cédric Demonceaux
  • for: Addresses the problem of 3D reconstruction of scenes from a single image captured by a light-field camera equipped with a rolling shutter sensor.
  • methods: Leverages the 3D information cues present in the light-field and the motion information provided by the rolling shutter effect, with a generic model for the imaging process and a two-stage algorithm that minimizes the re-projection error.
  • results: Provides an instantaneous 3D shape-and-pose-and-velocity sensing paradigm, with a new benchmark dataset and several experiments conducted for different scenes and types of motions to demonstrate the effectiveness and advantages of the approach.Here is the same information in Traditional Chinese:
  • for: Addresses the problem of 3D reconstruction of scenes from a single image captured by a light-field camera equipped with a rolling shutter sensor.
  • methods: Leverages the 3D information cues present in the light-field and the motion information provided by the rolling shutter effect, with a generic model for the imaging process and a two-stage algorithm that minimizes the re-projection error.
  • results: Provides an instantaneous 3D shape-and-pose-and-velocity sensing paradigm, with a new benchmark dataset and several experiments conducted for different scenes and types of motions to demonstrate the effectiveness and advantages of the approach.
    Abstract In this paper, we propose an approach to address the problem of 3D reconstruction of scenes from a single image captured by a light-field camera equipped with a rolling shutter sensor. Our method leverages the 3D information cues present in the light-field and the motion information provided by the rolling shutter effect. We present a generic model for the imaging process of this sensor and a two-stage algorithm that minimizes the re-projection error while considering the position and motion of the camera in a motion-shape bundle adjustment estimation strategy. Thereby, we provide an instantaneous 3D shape-and-pose-and-velocity sensing paradigm. To the best of our knowledge, this is the first study to leverage this type of sensor for this purpose. We also present a new benchmark dataset composed of different light-fields showing rolling shutter effects, which can be used as a common base to improve the evaluation and tracking the progress in the field. We demonstrate the effectiveness and advantages of our approach through several experiments conducted for different scenes and types of motions. The source code and dataset are publicly available at: https://github.com/ICB-Vision-AI/RSLF
    摘要 在这篇论文中,我们提出了一种方法,用于从单个拍摄的图像中重建场景的3D形态。我们的方法利用了光场中的3D信息征ifiers和滚动镜头效果提供的运动信息。我们提出了一个通用的捕捉过程模型和一种两个阶段算法,以最小化投影误差,同时考虑摄像机的位置和运动。因此,我们提供了一种实时的3D形态、位置和速度探测方法。根据我们所知,这是首次利用这种传感器来实现这种目的。我们还提供了一个新的比较基准数据集,包括不同的光场,这可以用作评估和跟踪领域的共同基准。我们通过对不同场景和运动类型进行多个实验,证明了我们的方法的有效性和优势。源代码和数据集可以在以下链接中下载:https://github.com/ICB-Vision-AI/RSLF。

Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition

  • paper_url: http://arxiv.org/abs/2311.01283
  • repo_url: None
  • paper_authors: Hamid Ahmadabadi, Omid Nejati Manzari, Ahmad Ayatollahi
  • for: 提高人体动作识别的性能和效率,通过知识传播和 CNN 和 ViT 模型的组合。
  • methods: 使用 Transformer 视网膜作为学生模型,而 convolutional network 作为教师模型。教师模型提取本地图像特征,而学生模型通过注意力机制关注全图像特征。采用 Vision Transformer(ViT)框架,并评估多种变体的 ViT,包括 PVT、Convit、MVIT、Swin Transformer 和 Twins。
  • results: 对 Stanford 40 数据集进行人体动作识别任务,通过知识传播训练学生模型,相比常规训练方法,得到了显著提高的准确率和 mAP。这些结果表明,将本地和全图像特征结合在一起可以提高动作识别任务的性能。
    Abstract This paper presents a study on improving human action recognition through the utilization of knowledge distillation, and the combination of CNN and ViT models. The research aims to enhance the performance and efficiency of smaller student models by transferring knowledge from larger teacher models. The proposed method employs a Transformer vision network as the student model, while a convolutional network serves as the teacher model. The teacher model extracts local image features, whereas the student model focuses on global features using an attention mechanism. The Vision Transformer (ViT) architecture is introduced as a robust framework for capturing global dependencies in images. Additionally, advanced variants of ViT, namely PVT, Convit, MVIT, Swin Transformer, and Twins, are discussed, highlighting their contributions to computer vision tasks. The ConvNeXt model is introduced as a teacher model, known for its efficiency and effectiveness in computer vision. The paper presents performance results for human action recognition on the Stanford 40 dataset, comparing the accuracy and mAP of student models trained with and without knowledge distillation. The findings illustrate that the suggested approach significantly improves the accuracy and mAP when compared to training networks under regular settings. These findings emphasize the potential of combining local and global features in action recognition tasks.
    摘要

Exploring Deep Learning Image Super-Resolution for Iris Recognition

  • paper_url: http://arxiv.org/abs/2311.01241
  • repo_url: None
  • paper_authors: Eduardo Ribeiro, Andreas Uhl, Fernando Alonso-Fernandez, Reuben A. Farrugia
  • for: 这个论文是为了检验深度学习方法在低分辨率图像到高分辨率图像的映射问题上的能力。
  • methods: 这个论文使用了两种深度学习单图超解决方法:堆叠自适应网络(SAE)和卷积神经网络(CNN),以实现快速速度、保持地方信息和减少噪声的目的。
  • results: 实验结果表明,深度学习方法在一个 Near-infrared iris 图像库中的评估和识别实验中表现出色,超过了与之比较的算法。
    Abstract In this work we test the ability of deep learning methods to provide an end-to-end mapping between low and high resolution images applying it to the iris recognition problem. Here, we propose the use of two deep learning single-image super-resolution approaches: Stacked Auto-Encoders (SAE) and Convolutional Neural Networks (CNN) with the most possible lightweight structure to achieve fast speed, preserve local information and reduce artifacts at the same time. We validate the methods with a database of 1.872 near-infrared iris images with quality assessment and recognition experiments showing the superiority of deep learning approaches over the compared algorithms.
    摘要 在这项工作中,我们测试了深度学习方法是否可以提供低到高分辨率图像的端到端映射,并应用于芳心识别问题。我们提议使用两种深度学习单图超解析方法:堆式自适应神经网络(SAE)和卷积神经网络(CNN),以达到快速速度、保持本地信息和减少噪声的目的。我们验证了这些方法使用1.872个近红外芳心图像库,并进行评估和识别实验,显示深度学习方法比比较算法更出色。

Log-Likelihood Score Level Fusion for Improved Cross-Sensor Smartphone Periocular Recognition

  • paper_url: http://arxiv.org/abs/2311.01237
  • repo_url: None
  • paper_authors: Fernando Alonso-Fernandez, Kiran B. Raja, Christoph Busch, Josef Bigun
  • for: 提高不同摄像头数据的可比性和识别率
  • methods: 使用多比较器的拟合方法,基于线性逻辑回归,将各摄像头的分布调整到共同的概率领域
  • results: 实现对不同摄像头数据的融合,提高 périocular 性能,降低cross-sensor EER达40%
    Abstract The proliferation of cameras and personal devices results in a wide variability of imaging conditions, producing large intra-class variations and a significant performance drop when images from heterogeneous environments are compared. However, many applications require to deal with data from different sources regularly, thus needing to overcome these interoperability problems. Here, we employ fusion of several comparators to improve periocular performance when images from different smartphones are compared. We use a probabilistic fusion framework based on linear logistic regression, in which fused scores tend to be log-likelihood ratios, obtaining a reduction in cross-sensor EER of up to 40% due to the fusion. Our framework also provides an elegant and simple solution to handle signals from different devices, since same-sensor and cross-sensor score distributions are aligned and mapped to a common probabilistic domain. This allows the use of Bayes thresholds for optimal decision-making, eliminating the need of sensor-specific thresholds, which is essential in operational conditions because the threshold setting critically determines the accuracy of the authentication process in many applications.
    摘要 “由于相机和个人设备的普遍存在,导致图像环境的差异较大,从不同设备获取的图像之间存在大量的内类差异,这导致对图像进行比较时表现下降。然而,许多应用程序需要定期处理来自不同源的数据,因此需要解决这些可操作性问题。我们采用多比较器的合并方法来改进 périocular 性能,使用线性логистиック回归框架,在这个框架中,融合后的分数倾向于是Log-likelihood比率,从而实现了降低跨传感器EER的目标,最多降低40%。我们的框架还提供了一种简单和易于处理不同设备的信号的方法,因为同传感器和跨传感器分布是被映射到一个共同的 probabilistic 领域,这使得可以使用 bayes 阈值进行优化的决策,从而消除传感器特定的阈值的需求,这在操作条件下是非常重要的,因为阈值设定对图像认证过程中的精度具有 kritical 的作用。”

Robust Feature Learning and Global Variance-Driven Classifier Alignment for Long-Tail Class Incremental Learning

  • paper_url: http://arxiv.org/abs/2311.01227
  • repo_url: https://github.com/JAYATEJAK/GVAlign
  • paper_authors: Jayateja Kalla, Soma Biswas
  • for: 增强长尾类逐步学习,使模型逐步学习新类,并mitigate catastrophic forgetting在长尾数据分布下。
  • methods: 利用全球差异作为有用的度量,并在第二阶段使用类prototype来实现类ifierAlignment,从而Capture类属性,消除数据平衡或另外层次调整的需求。
  • results: 在CIFAR-100和ImageNet-Subset datasets上进行了广泛的实验,证明了该方法在多种长尾CIL场景中的超越性,并且在不同的长尾类 incremental learning情况下保持了优异性。
    Abstract This paper introduces a two-stage framework designed to enhance long-tail class incremental learning, enabling the model to progressively learn new classes, while mitigating catastrophic forgetting in the context of long-tailed data distributions. Addressing the challenge posed by the under-representation of tail classes in long-tail class incremental learning, our approach achieves classifier alignment by leveraging global variance as an informative measure and class prototypes in the second stage. This process effectively captures class properties and eliminates the need for data balancing or additional layer tuning. Alongside traditional class incremental learning losses in the first stage, the proposed approach incorporates mixup classes to learn robust feature representations, ensuring smoother boundaries. The proposed framework can seamlessly integrate as a module with any class incremental learning method to effectively handle long-tail class incremental learning scenarios. Extensive experimentation on the CIFAR-100 and ImageNet-Subset datasets validates the approach's efficacy, showcasing its superiority over state-of-the-art techniques across various long-tail CIL settings.
    摘要

Optimal Transport-Guided Conditional Score-Based Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.01226
  • repo_url: https://github.com/xjtu-xgu/otcs
  • paper_authors: Xiang Gu, Liwei Yang, Jian Sun, Zongben Xu
  • for: conditional generation of target data with paired data as condition
  • methods: optimal transport-guided conditional score-based diffusion model (OTCS)
  • results: effective training of the conditional score-based model for unpaired or partially paired settings, with theoretical proof of data transport in optimal transport.Here’s the full text in Simplified Chinese:
  • for: 本文提出了一种基于匹配关系的条件分布模型(OTCS),用于无或半对数据的条件生成。
  • methods: OTCS 使用 $L_2$-正则化不监督或半监督的最优运输来建立对不对数据的 Coupling 关系,然后基于这个 Coupling 关系来训练条件分布模型。
  • results: OTCS 在无或半对super-resolution 和图像转换 tasks 上进行了广泛的实验,并证明了其效果性。从Optimal Transport的视角来看,OTCS 实现了数据的传输,这是一个对大规模数据集的挑战。我们还提供了一个 theoretically 的证明,证明 OTCS 实现了数据传输。代码可以在 \url{https://github.com/XJTU-XGU/OTCS} 上获取。
    Abstract Conditional score-based diffusion model (SBDM) is for conditional generation of target data with paired data as condition, and has achieved great success in image translation. However, it requires the paired data as condition, and there would be insufficient paired data provided in real-world applications. To tackle the applications with partially paired or even unpaired dataset, we propose a novel Optimal Transport-guided Conditional Score-based diffusion model (OTCS) in this paper. We build the coupling relationship for the unpaired or partially paired dataset based on $L_2$-regularized unsupervised or semi-supervised optimal transport, respectively. Based on the coupling relationship, we develop the objective for training the conditional score-based model for unpaired or partially paired settings, which is based on a reformulation and generalization of the conditional SBDM for paired setting. With the estimated coupling relationship, we effectively train the conditional score-based model by designing a ``resampling-by-compatibility'' strategy to choose the sampled data with high compatibility as guidance. Extensive experiments on unpaired super-resolution and semi-paired image-to-image translation demonstrated the effectiveness of the proposed OTCS model. From the viewpoint of optimal transport, OTCS provides an approach to transport data across distributions, which is a challenge for OT on large-scale datasets. We theoretically prove that OTCS realizes the data transport in OT with a theoretical bound. Code is available at \url{https://github.com/XJTU-XGU/OTCS}.
    摘要 <>使用 Conditional Score-based Diffusion Model(SBDM)可以实现目标数据的条件生成,但是它需要对condition paired的数据。在实际应用中,可能无法获得充分的paired数据。为了解决这个问题,我们在这篇论文中提出了一种新的 Optimal Transport-guided Conditional Score-based Diffusion Model(OTCS)。我们通过 $L_2$-regularized unsupervised或半supervised Optimal Transport来建立coupling关系,并基于这个coupling关系来定义Objective для条件 SBDM 的训练。通过使用估计的coupling关系,我们可以有效地训练条件 Score-based Model。我们设计了一种“重采样-by-compatibility”策略,以选择与高兼容性的样本作为导航。我们在无对应数据和半对应数据上进行了广泛的实验,并证明了我们的 OTCS 模型的有效性。从optimal transport的视角来看,OTCS 提供了将数据传输到分布上的方法,这是对大规模数据集的optimal transport而言是一个挑战。我们理论上证明了 OTCS 实现了数据传输在 OT 中的理论上的 bound。代码可以在 \url{https://github.com/XJTU-XGU/OTCS} 上获取。

Convergent plug-and-play with proximal denoiser and unconstrained regularization parameter

  • paper_url: http://arxiv.org/abs/2311.01216
  • repo_url: None
  • paper_authors: Samuel Hurault, Antonin Chambolle, Arthur Leclaire, Nicolas Papadakis
  • for: 这篇论文的目的是提供新的抽象证明,以解决图像反问题中的抽象问题。
  • methods: 这篇论文使用的方法是基于插入预训练的噪声矩阵的PnP算法,包括Proximal Gradient Descent(PGD)和Douglas-Rachford Splitting(DRS)。
  • results: 该论文的实验研究表明,使用这两种解决方案可以提高图像恢复的准确率。
    Abstract In this work, we present new proofs of convergence for Plug-and-Play (PnP) algorithms. PnP methods are efficient iterative algorithms for solving image inverse problems where regularization is performed by plugging a pre-trained denoiser in a proximal algorithm, such as Proximal Gradient Descent (PGD) or Douglas-Rachford Splitting (DRS). Recent research has explored convergence by incorporating a denoiser that writes exactly as a proximal operator. However, the corresponding PnP algorithm has then to be run with stepsize equal to $1$. The stepsize condition for nonconvex convergence of the proximal algorithm in use then translates to restrictive conditions on the regularization parameter of the inverse problem. This can severely degrade the restoration capacity of the algorithm. In this paper, we present two remedies for this limitation. First, we provide a novel convergence proof for PnP-DRS that does not impose any restrictions on the regularization parameter. Second, we examine a relaxed version of the PGD algorithm that converges across a broader range of regularization parameters. Our experimental study, conducted on deblurring and super-resolution experiments, demonstrate that both of these solutions enhance the accuracy of image restoration.
    摘要 在这个工作中,我们提供了新的收敛证明 для插入式游戏(PnP)算法。PnP方法是高效的迭代算法,用于解决图像反转问题,其中的正则化是通过插入预训练的噪声除除器来实现,如距离梯度下降(PGD)或道格拉斯-蕾舍分裂(DRS)。Recent research has explored convergence by incorporating a denoiser that writes exactly as a proximal operator. However, the corresponding PnP algorithm has then to be run with stepsize equal to $1$. The stepsize condition for nonconvex convergence of the proximal algorithm in use then translates to restrictive conditions on the regularization parameter of the inverse problem. This can severely degrade the restoration capacity of the algorithm. In this paper, we present two remedies for this limitation. First, we provide a novel convergence proof for PnP-DRS that does not impose any restrictions on the regularization parameter. Second, we examine a relaxed version of the PGD algorithm that converges across a broader range of regularization parameters. Our experimental study, conducted on deblurring and super-resolution experiments, demonstrate that both of these solutions enhance the accuracy of image restoration.Note: The translation is in Simplified Chinese, which is one of the two standard Chinese writing systems. The other system is Traditional Chinese.

High-Quality Animatable Dynamic Garment Reconstruction from Monocular Videos

  • paper_url: http://arxiv.org/abs/2311.01214
  • repo_url: None
  • paper_authors: Xiongzheng Li, Jinsong Zhang, Yu-Kun Lai, Jingyu Yang, Kun Li
  • for: reconstruction of high-quality animatable dynamic garments from monocular videos
  • methods: learnable garment deformation network, multi-hypothesis deformation module
  • results: high-quality dynamic garments with coherent surface details, can be easily animated under unseen poses
    Abstract Much progress has been made in reconstructing garments from an image or a video. However, none of existing works meet the expectations of digitizing high-quality animatable dynamic garments that can be adjusted to various unseen poses. In this paper, we propose the first method to recover high-quality animatable dynamic garments from monocular videos without depending on scanned data. To generate reasonable deformations for various unseen poses, we propose a learnable garment deformation network that formulates the garment reconstruction task as a pose-driven deformation problem. To alleviate the ambiguity estimating 3D garments from monocular videos, we design a multi-hypothesis deformation module that learns spatial representations of multiple plausible deformations. Experimental results on several public datasets demonstrate that our method can reconstruct high-quality dynamic garments with coherent surface details, which can be easily animated under unseen poses. The code will be provided for research purposes.
    摘要 很多进步已经被成功地应用于从图像或视频中重建衣服。然而,现有的所有方法都不能满足高质量动态衣服的数字化,可以根据不同的未知pose进行调整。在这篇论文中,我们提出了首个不 dependence on scanned data 的简单视频中高质量动态衣服重建方法。为了生成不同pose下的合理的变形,我们提议了一种学习型衣服变形网络,将衣服重建任务定义为pose驱动的变形问题。为了解决来自单视频中的3D衣服估计的ambiguity,我们设计了多种可能性变形模块,这些模块学习了多个可能的变形的空间表示。我们的方法在多个公共数据集上进行了实验,结果表明我们可以重建高质量的动态衣服,并且可以轻松地在未知pose下进行动画。我们将提供代码供研究用途。

Semantic Scene Graph Generation Based on an Edge Dual Scene Graph and Message Passing Neural Network

  • paper_url: http://arxiv.org/abs/2311.01192
  • repo_url: None
  • paper_authors: Hyeongjin Kim, Sangwon Kim, Jong Taek Lee, Byoung Chul Ko
  • for: 提高Scene Graph Generation(SGG)的精度和可靠性,使其能够更好地捕捉图像中对象之间的复杂关系和互动。
  • methods: 基于Edge Dual Scene Graph(EdgeSGG)和Dual Message Passing Neural Network(DualMPNN),可以更好地捕捉图像中对象之间的richContextual interactions,并且可以更精确地预测对象之间的关系。
  • results: 与State-of-the-Art(SoTA)方法进行比较,提出的模型在三个SGG任务上显示了substantial性能提升,并且在长尾分布上进行实验表明,在 integrate对象之间的关系时,可以有效 mitigate Existing long-tail problems。
    Abstract Along with generative AI, interest in scene graph generation (SGG), which comprehensively captures the relationships and interactions between objects in an image and creates a structured graph-based representation, has significantly increased in recent years. However, relying on object-centric and dichotomous relationships, existing SGG methods have a limited ability to accurately predict detailed relationships. To solve these problems, a new approach to the modeling multiobject relationships, called edge dual scene graph generation (EdgeSGG), is proposed herein. EdgeSGG is based on a edge dual scene graph and Dual Message Passing Neural Network (DualMPNN), which can capture rich contextual interactions between unconstrained objects. To facilitate the learning of edge dual scene graphs with a symmetric graph structure, the proposed DualMPNN learns both object- and relation-centric features for more accurately predicting relation-aware contexts and allows fine-grained relational updates between objects. A comparative experiment with state-of-the-art (SoTA) methods was conducted using two public datasets for SGG operations and six metrics for three subtasks. Compared with SoTA approaches, the proposed model exhibited substantial performance improvements across all SGG subtasks. Furthermore, experiment on long-tail distributions revealed that incorporating the relationships between objects effectively mitigates existing long-tail problems.
    摘要 accompanies the rise of generative AI, scene graph generation (SGG) has gained significant attention in recent years. However, existing SGG methods are limited in their ability to accurately predict detailed relationships due to their reliance on object-centric and dichotomous relationships. To address these issues, a new approach called edge dual scene graph generation (EdgeSGG) is proposed. EdgeSGG is based on an edge dual scene graph and a Dual Message Passing Neural Network (DualMPNN), which can capture rich contextual interactions between unconstrained objects. To facilitate the learning of edge dual scene graphs with a symmetric graph structure, the proposed DualMPNN learns both object- and relation-centric features for more accurately predicting relation-aware contexts and allows fine-grained relational updates between objects. A comparative experiment with state-of-the-art (SoTA) methods was conducted using two public datasets for SGG operations and six metrics for three subtasks. Compared with SoTA approaches, the proposed model exhibited substantial performance improvements across all SGG subtasks. Furthermore, experiment on long-tail distributions revealed that incorporating the relationships between objects effectively mitigates existing long-tail problems.Here is the translation in Traditional Chinese:随着生成AI的出现,Scene Graph Generation (SGG)在最近的年分内得到了很大的关注。然而,现有的SGG方法受到物件中心和二分法的限制,它们的预测细节关系的能力有限。为解决这些问题,一种新的方法called EdgeSGG被提出。EdgeSGG基于边dual scene graph和Dual Message Passing Neural Network (DualMPNN),可以捕捉无结构物件之间的丰富contextual互动。为了促进边dual scene graph的学习,提出的DualMPNN将学习物件和关系中心的特征,以更精确地预测关系意识的上下文,并允许细化的关系更新。对于SoTA方法进行比较实验,使用了两个公共的数据集和六个度量来评估三个SGG任务。与SoTA方法相比,提出的模型在所有SGG任务上表现出substantial的性能改善。此外,对于长尾分布的实验显示,将物件之间的关系 интеグrez effectively mitigates existing long-tail problems。

Terrain-Informed Self-Supervised Learning: Enhancing Building Footprint Extraction from LiDAR Data with Limited Annotations

  • paper_url: http://arxiv.org/abs/2311.01188
  • repo_url: None
  • paper_authors: Anuja Vats, David Völgyes, Martijn Vermeer, Marius Pedersen, Kiran Raja, Daniele S. M. Fantin, Jacob Alexander Hay
  • for: 这个研究的目的是提出一种基于深度学习的建筑图像分类方法,以便从remote sensing数据中提取 preciselocation building footprint maps。
  • methods: 这个方法使用自动生成的地形模型,从LiDAR数据中学习特点特征,并透过自我超级vised learning来对building segmentation进行推导。
  • results: 这个方法可以从仅有1%的标签(相当于25个标签的例子)中提取出高性能的建筑分类表现,并在几何对应中进一步提高表现。此外,这个方法还能够在实际应用中应用,并且比其他基于ImageNet预训练的方法表现更好。
    Abstract Estimating building footprint maps from geospatial data is of paramount importance in urban planning, development, disaster management, and various other applications. Deep learning methodologies have gained prominence in building segmentation maps, offering the promise of precise footprint extraction without extensive post-processing. However, these methods face challenges in generalization and label efficiency, particularly in remote sensing, where obtaining accurate labels can be both expensive and time-consuming. To address these challenges, we propose terrain-aware self-supervised learning, tailored to remote sensing, using digital elevation models from LiDAR data. We propose to learn a model to differentiate between bare Earth and superimposed structures enabling the network to implicitly learn domain-relevant features without the need for extensive pixel-level annotations. We test the effectiveness of our approach by evaluating building segmentation performance on test datasets with varying label fractions. Remarkably, with only 1% of the labels (equivalent to 25 labeled examples), our method improves over ImageNet pre-training, showing the advantage of leveraging unlabeled data for feature extraction in the domain of remote sensing. The performance improvement is more pronounced in few-shot scenarios and gradually closes the gap with ImageNet pre-training as the label fraction increases. We test on a dataset characterized by substantial distribution shifts and labeling errors to demonstrate the generalizability of our approach. When compared to other baselines, including ImageNet pretraining and more complex architectures, our approach consistently performs better, demonstrating the efficiency and effectiveness of self-supervised terrain-aware feature learning.
    摘要 估算建筑地图从地ospatial数据中是城市规划、开发、灾害管理等应用中的关键任务。深度学习方法在建筑分割地图中得到了广泛应用,可以准确地提取建筑地图 без需要大量后处理。然而,这些方法面临通用化和标签效率的挑战,特别是在遥感中,获得准确的标签可能会是时间和成本的投资。为解决这些挑战,我们提议使用地形自适应自监学习,适应遥感,使用雷达数据获得的数字高程模型。我们提议学习一种模型,可以区分无附加结构和地表上的结构,使得网络可以隐式地学习领域相关的特征,不需要大量的像素级标注。我们测试了我们的方法的效果,对测试数据集进行了不同标签分布的评估。非常remarkably,只使用1%的标签(相当于25个标注示例),我们的方法可以超越图像网络预训练,显示了在遥感领域中利用无标注数据进行特征提取的优势。性能提升随着标签分布的增加,在几个批量场景中,我们的方法的性能相对较高,示示了我们的方法的效率和效果。我们对一个具有显著的分布偏移和标签错误的数据集进行了测试,以示我们的方法的普适性。与其他基elines,包括图像网络预训练和更复杂的架构相比,我们的方法一直表现出色, demonstrating the efficiency and effectiveness of self-supervised terrain-aware feature learning.

Learning Intra and Inter-Camera Invariance for Isolated Camera Supervised Person Re-identification

  • paper_url: http://arxiv.org/abs/2311.01155
  • repo_url: None
  • paper_authors: Menglin Wang, Xiaojin Gong
  • for: 这种论文是为了研究在受到isoled camera supervised(ISCS)设置下进行人识别的情况。
  • methods: 该论文提出了一种新的方法,通过充分利用训练数据的变化来解决人识别下来。该方法包括在每个环境中构建style-consistent的环境,并在每个环境中进行prototype contrastive learning。同时,通过强制实施 intra-camera 增强不变性来消除相机偏见的影响。
  • results: 该论文在多个 benchmark 上进行了广泛的实验,并证明了该方法的有效性和超越性。
    Abstract Supervised person re-identification assumes that a person has images captured under multiple cameras. However when cameras are placed in distance, a person rarely appears in more than one camera. This paper thus studies person re-ID under such isolated camera supervised (ISCS) setting. Instead of trying to generate fake cross-camera features like previous methods, we explore a novel perspective by making efficient use of the variation in training data. Under ISCS setting, a person only has limited images from a single camera, so the camera bias becomes a critical issue confounding ID discrimination. Cross-camera images are prone to being recognized as different IDs simply by camera style. To eliminate the confounding effect of camera bias, we propose to learn both intra- and inter-camera invariance under a unified framework. First, we construct style-consistent environments via clustering, and perform prototypical contrastive learning within each environment. Meanwhile, strongly augmented images are contrasted with original prototypes to enforce intra-camera augmentation invariance. For inter-camera invariance, we further design a much improved variant of multi-camera negative loss that optimizes the distance of multi-level negatives. The resulting model learns to be invariant to both subtle and severe style variation within and cross-camera. On multiple benchmarks, we conduct extensive experiments and validate the effectiveness and superiority of the proposed method. Code will be available at https://github.com/Terminator8758/IICI.
    摘要 受监测人重识别假设有多个摄像头捕捉到同一个人的图像。然而,当摄像头远离时,人很少会出现在多个摄像头中。这篇论文因此研究了在孤立摄像头超级vised(ISCS)设定下进行人重识别。而不是尝试生成虚假的交叉摄像头特征,我们explore一种新的视角,即efficient地利用训练数据的变化。在ISCS设定下,一个人只有限制的图像来自单个摄像头,因此摄像头偏见成为人识别中的关键问题。交叉摄像头图像容易被识别为不同的ID,只是因为摄像头风格。为了消除摄像头偏见的影响,我们提议学习 both intra-和inter-摄像头不变性于一个统一框架下。首先,我们使用 clustering 构建 style-consistent 环境,并在每个环境中进行 prototypical contrastive learning。同时,我们使用强制加工的图像与原始评原核对进行 intra-camera 增强不变性。为了保证 inter-camera 不变性,我们还提出了一个大幅提高的多摄像头负面损失的改进版本。这使得模型学习到了内部和交叉摄像头中的不变性,并且对于严重和柔性的样式变化都具有抗预测能力。在多个标准列表上进行了广泛的实验,并证明了我们的方法的有效性和超越性。代码将在 https://github.com/Terminator8758/IICI 上提供。

AeroPath: An airway segmentation benchmark dataset with challenging pathology

  • paper_url: http://arxiv.org/abs/2311.01138
  • repo_url: https://github.com/raidionics/aeropath
  • paper_authors: Karen-Helene Støverud, David Bouget, Andre Pedersen, Håkon Olav Leira, Thomas Langø, Erlend Fagertun Hofstad
  • for: 提高肺病患者的诊断和治疗效果,需要早期诊断和治疗。CT图像分析是诊断的关键之一,而高质量的气管树分割是 intervención 规划和直到 bronchoscopy 操作的必要条件。
  • methods: 我们提出了一种新的公共benchmark dataset(AeroPath),包含27个CT图像,来评估新的 automatic airway segmentation 方法。此外,我们还提出了一种多尺度融合设计,以便自动气管分割。
  • results: 我们的提案的模型在AeroPath dataset上预测了所有患者的正确的分割结果,并且能够抗衡各种病理变化,至少到第五代气管。此外,我们还开发了一个公开可用的在线应用程序,以便在新数据上测试我们的模型。
    Abstract To improve the prognosis of patients suffering from pulmonary diseases, such as lung cancer, early diagnosis and treatment are crucial. The analysis of CT images is invaluable for diagnosis, whereas high quality segmentation of the airway tree are required for intervention planning and live guidance during bronchoscopy. Recently, the Multi-domain Airway Tree Modeling (ATM'22) challenge released a large dataset, both enabling training of deep-learning based models and bringing substantial improvement of the state-of-the-art for the airway segmentation task. However, the ATM'22 dataset includes few patients with severe pathologies affecting the airway tree anatomy. In this study, we introduce a new public benchmark dataset (AeroPath), consisting of 27 CT images from patients with pathologies ranging from emphysema to large tumors, with corresponding trachea and bronchi annotations. Second, we present a multiscale fusion design for automatic airway segmentation. Models were trained on the ATM'22 dataset, tested on the AeroPath dataset, and further evaluated against competitive open-source methods. The same performance metrics as used in the ATM'22 challenge were used to benchmark the different considered approaches. Lastly, an open web application is developed, to easily test the proposed model on new data. The results demonstrated that our proposed architecture predicted topologically correct segmentations for all the patients included in the AeroPath dataset. The proposed method is robust and able to handle various anomalies, down to at least the fifth airway generation. In addition, the AeroPath dataset, featuring patients with challenging pathologies, will contribute to development of new state-of-the-art methods. The AeroPath dataset and the web application are made openly available.
    摘要 要改善患有肺病的患者的诊断和治疗效果,早期诊断和治疗是关键。CT图像分析是诊断的不可或缺的工具,而高质量的气管树分 segmentation 则是用于操作规划和直到生 bronchoscopy 的live导航中的必要条件。最近,多域气管树模型大会(ATM'22)挑战发布了大量数据,为深度学习基于模型的训练提供了条件,并为气管分 segmentation 任务带来了显著的状态艺术提升。然而,ATM'22 数据集包含少量患有肺动脉病理的患者,这些病理可能会影响气管树的解剖结构。在这项研究中,我们介绍了一个新的公共数据集(AeroPath),包含 27 个 CT 图像,这些图像来自患有肺动脉病理的患者,包括肺脏病和大型肿瘤,同时还包括气管和支气管的注释。其次,我们提出了一种多尺度融合设计用于自动气管分 segmentation。我们在 ATM'22 数据集上训练了模型,在 AeroPath 数据集上进行测试,并对开源方法进行比较。使用 ATM'22 挑战中使用的同样效果指标进行比较。最后,我们开发了一个开放的网络应用程序,以便轻松地在新数据上测试我们的提议方法。结果表明,我们的提议体系在 AeroPath 数据集上预测了所有患者的正确分 segmentation。我们的方法具有抗难度和能够处理多种畸形的能力,至少到第五代气管。此外,AeroPath 数据集, featuring 患有复杂病理的患者,将为开发新的状态艺术方法提供贡献。AeroPath 数据集和网络应用程序都是公开可用。

A deep learning experiment for semantic segmentation of overlapping characters in palimpsests

  • paper_url: http://arxiv.org/abs/2311.01130
  • repo_url: None
  • paper_authors: Michela Perino, Michele Ginolfi, Anna Candida Felici, Michela Rosellini
  • for: 这项研究的目的是提出一种基于深度学习的Semantic Segmentation方法,用于在重叠的字符上分割个letter。
  • methods: 该方法使用了多spectral imaging技术和人工智能技术,包括深度学习的Semantic Segmentation算法,用于识别和分割重叠的字符。
  • results: 实验结果表明,该方法可以准确地分割重叠的字符,并且可以提高对Palimpsests的识别和分析效率。
    Abstract Palimpsests refer to historical manuscripts where erased writings have been partially covered by the superimposition of a second writing. By employing imaging techniques, e.g., multispectral imaging, it becomes possible to identify features that are imperceptible to the naked eye, including faded and erased inks. When dealing with overlapping inks, Artificial Intelligence techniques can be utilized to disentangle complex nodes of overlapping letters. In this work, we propose deep learning-based semantic segmentation as a method for identifying and segmenting individual letters in overlapping characters. The experiment was conceived as a proof of concept, focusing on the palimpsests of the Ars Grammatica by Prisciano as a case study. Furthermore, caveats and prospects of our approach combined with multispectral imaging are also discussed.
    摘要 某些抄写物件被称为磁带文献,这些文献中的字符串被部分覆盖了第二种写作。通过使用多spectral imaging技术,可以检测出覆盖不明文字符的特征,包括淡入的字符和抹除的字符。在多个字符之间相互重叠时,人工智能技术可以用来分解复杂的节点。在这种情况下,我们提出了深度学习基于semantic segmentation的方法,用于在重叠字符中识别和分类个字符。我们的实验是以证明性为目的,案例研究了普里斯尼亚诺的《语法 Grammatica》抄写物件。此外,我们还讨论了我们的方法的限制和前景。

Cheating Depth: Enhancing 3D Surface Anomaly Detection via Depth Simulation

  • paper_url: http://arxiv.org/abs/2311.01117
  • repo_url: https://github.com/vitjanz/3dsr
  • paper_authors: Vitjan Zavrtanik, Matej Kristan, Danijel Skočaj
  • for: 提高RGB基于表面异常检测方法的准确率和处理速度
  • methods: 提出了一种新的深度感知分割自动编码器(DADA)架构,以便同时学习RGB和3D数据的整体离散特征空间,以便3D表面异常检测
  • results: 实验结果表明,提出的方法可以在MVTec3D异常检测标准套件上达到最高精度和处理速度,超过所有现有的状态之异常检测方法
    Abstract RGB-based surface anomaly detection methods have advanced significantly. However, certain surface anomalies remain practically invisible in RGB alone, necessitating the incorporation of 3D information. Existing approaches that employ point-cloud backbones suffer from suboptimal representations and reduced applicability due to slow processing. Re-training RGB backbones, designed for faster dense input processing, on industrial depth datasets is hindered by the limited availability of sufficiently large datasets. We make several contributions to address these challenges. (i) We propose a novel Depth-Aware Discrete Autoencoder (DADA) architecture, that enables learning a general discrete latent space that jointly models RGB and 3D data for 3D surface anomaly detection. (ii) We tackle the lack of diverse industrial depth datasets by introducing a simulation process for learning informative depth features in the depth encoder. (iii) We propose a new surface anomaly detection method 3DSR, which outperforms all existing state-of-the-art on the challenging MVTec3D anomaly detection benchmark, both in terms of accuracy and processing speed. The experimental results validate the effectiveness and efficiency of our approach, highlighting the potential of utilizing depth information for improved surface anomaly detection.
    摘要 (i)我们提出了一种新的深度意识Discrete Autoencoder(DADA)建筑,它允许学习一个通用的离散准则空间,该空间同时模型RGB和3D数据 для3D表面异常检测。(ii)我们解决了工业深度数据集的有限性问题,通过引入一种学习深度特征的 simulations process。(iii)我们提出了一种新的3DSR方法,它在MVTec3D异常检测benchmark上表现出了比所有现有的国际状态最佳的性能,both in terms of accuracy和processing speed。我们的实验结果证明了我们的方法的有效性和高效性, highlighting the potential of utilizing depth information for improved surface anomaly detection.

H-NeXt: The next step towards roto-translation invariant networks

  • paper_url: http://arxiv.org/abs/2311.01111
  • repo_url: https://github.com/karellat/h-next
  • paper_authors: Tomas Karella, Filip Sroubek, Jan Flusser, Jan Blazek, Vasek Kosik
  • for: 该论文目的是提出一种可以快速学习并在不同orientation下保持性的网络模型。
  • methods: 该论文使用了一种名为H-NeXt的网络模型,它包括一个对称矩阵的后向层、一个不变性池化层和一个分类层。
  • results: 该论文通过在不含扩展图像的训练集上训练H-NeXt网络,并在扩展测试集上进行分类,得到了与当前状态集成比的更高的表现。
    Abstract The widespread popularity of equivariant networks underscores the significance of parameter efficient models and effective use of training data. At a time when robustness to unseen deformations is becoming increasingly important, we present H-NeXt, which bridges the gap between equivariance and invariance. H-NeXt is a parameter-efficient roto-translation invariant network that is trained without a single augmented image in the training set. Our network comprises three components: an equivariant backbone for learning roto-translation independent features, an invariant pooling layer for discarding roto-translation information, and a classification layer. H-NeXt outperforms the state of the art in classification on unaugmented training sets and augmented test sets of MNIST and CIFAR-10.
    摘要 广泛的equivariant网络的普及,强调了参数效率模型和有效使用训练数据的重要性。在当今不可忽略的不visible deformation Robustness era,我们提出了H-NeXt,它在 equivariance和invariance之间填补了空白。H-NeXt是一种parameter-efficient的旋转翻译不变的网络,在没有一个扩展图像的训练集上培养。我们的网络包括三部分:一个恒等背景,用于学习旋转翻译独立的特征,一个不变pooling层,用于抛弃旋转信息,以及一个分类层。H-NeXt在MNIST和CIFAR-10的未扩展训练集和扩展测试集上的分类性能高于当前状态。

Learning A Multi-Task Transformer Via Unified And Customized Instruction Tuning For Chest Radiograph Interpretation

  • paper_url: http://arxiv.org/abs/2311.01092
  • repo_url: https://github.com/medhk23/omnifm-dr
  • paper_authors: Lijian Xu, Ziyu Ni, Xinglong Liu, Xiaosong Wang, Hongsheng Li, Shaoting Zhang
  • For: 这个研究旨在探讨多模式深度学习模型在医学应用中的应用,并将多项任务集成为一个统一的变数损失函数,以提高诊断的可解释性。* Methods: 本研究使用自定义指令调整的变数深度学习模型,并将多项颜ppo任务集成为一个共同训练架构,以增加诊断的可解释性。* Results: 该模型在多个胸部X射影benchmark上 exhibits 高度的直接推论和调整性,并且还经过了三位放射学家的评估,证明了模型的解释性。
    Abstract The emergence of multi-modal deep learning models has made significant impacts on clinical applications in the last decade. However, the majority of models are limited to single-tasking, without considering disease diagnosis is indeed a multi-task procedure. Here, we demonstrate a unified transformer model specifically designed for multi-modal clinical tasks by incorporating customized instruction tuning. We first compose a multi-task training dataset comprising 13.4 million instruction and ground-truth pairs (with approximately one million radiographs) for the customized tuning, involving both image- and pixel-level tasks. Thus, we can unify the various vision-intensive tasks in a single training framework with homogeneous model inputs and outputs to increase clinical interpretability in one reading. Finally, we demonstrate the overall superior performance of our model compared to prior arts on various chest X-ray benchmarks across multi-tasks in both direct inference and finetuning settings. Three radiologists further evaluate the generated reports against the recorded ones, which also exhibit the enhanced explainability of our multi-task model.
    摘要 随着多modal深度学习模型的出现,在过去的一代,它们在临床应用中产生了重要的影响。然而,大多数模型都是单任务的,没有考虑到疾病诊断实际上是多任务的过程。在这里,我们演示了一种特有的转换器模型,专门为多modal临床任务而设计,通过自定义指令调整。我们首先组织了一个多任务训练集,包括1340万个指令和真实数据对(约100万个X射像),用于自定义调整。因此,我们可以在单一的训练框架中,将多种视觉沉浸任务集成起来,使得临床解释性提高。最后,我们比较了我们的模型与之前艺术的性能,在多任务情况下,Direct inference和微调设置中,都达到了总体更高的性能。三名医生还评估了我们生成的报告与记录的报告,这也表明了我们的多任务模型的解释性得到了进一步提高。

Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic Narrative Grounding

  • paper_url: http://arxiv.org/abs/2311.01091
  • repo_url: None
  • paper_authors: Tianrui Hui, Zihan Ding, Junshi Huang, Xiaoming Wei, Xiaolin Wei, Jiao Dai, Jizhong Han, Si Liu
  • for: 本研究旨在提高图文描述的对应关系,即图像中的物体和文本描述之间的相互作用。
  • methods: 该研究提出了一种新的phrase-pixel-object transformer decoder(PPO-TD),该模型可以同时捕捉图像中细节和概念级别的信息,并且通过对应描述文本进行学习。此外,研究者还提出了一种phraseObject Contrastive Loss(POCL),用于更精准地聚合对应的phrase-object对。
  • results: 实验表明,该方法可以在PNG benchmark上达到新的状态值性能,与之前的方法相比,具有大的margin。
    Abstract Panoptic narrative grounding (PNG) aims to segment things and stuff objects in an image described by noun phrases of a narrative caption. As a multimodal task, an essential aspect of PNG is the visual-linguistic interaction between image and caption. The previous two-stage method aggregates visual contexts from offline-generated mask proposals to phrase features, which tend to be noisy and fragmentary. The recent one-stage method aggregates only pixel contexts from image features to phrase features, which may incur semantic misalignment due to lacking object priors. To realize more comprehensive visual-linguistic interaction, we propose to enrich phrases with coupled pixel and object contexts by designing a Phrase-Pixel-Object Transformer Decoder (PPO-TD), where both fine-grained part details and coarse-grained entity clues are aggregated to phrase features. In addition, we also propose a PhraseObject Contrastive Loss (POCL) to pull closer the matched phrase-object pairs and push away unmatched ones for aggregating more precise object contexts from more phrase-relevant object tokens. Extensive experiments on the PNG benchmark show our method achieves new state-of-the-art performance with large margins.
    摘要

Infusion: Internal Diffusion for Video Inpainting

  • paper_url: http://arxiv.org/abs/2311.01090
  • repo_url: None
  • paper_authors: Nicolas Cherel, Andrés Almansa, Yann Gousseau, Alasdair Newson
  • for: 视频填充(video inpainting)任务是在视频中填充某个区域,以达到视觉上的满意度。
  • methods: 我们采用了扩散模型,它可以模型复杂的数据分布,包括图像和视频。我们采用了内部学习方法,这也使得我们的网络规模减少了。
  • results: 我们的方法可以在视频填充任务中达到状态机器的性能,特别是在动态背景和Texture中。我们的方法不需要支持元素,如光学流计算,因此它在动态Texture中表现更好。
    Abstract Video inpainting is the task of filling a desired region in a video in a visually convincing manner. It is a very challenging task due to the high dimensionality of the signal and the temporal consistency required for obtaining convincing results. Recently, diffusion models have shown impressive results in modeling complex data distributions, including images and videos. Diffusion models remain nonetheless very expensive to train and perform inference with, which strongly restrict their application to video. We show that in the case of video inpainting, thanks to the highly auto-similar nature of videos, the training of a diffusion model can be restricted to the video to inpaint and still produce very satisfying results. This leads us to adopt an internal learning approch, which also allows for a greatly reduced network size. We call our approach "Infusion": an internal learning algorithm for video inpainting through diffusion. Due to our frugal network, we are able to propose the first video inpainting approach based purely on diffusion. Other methods require supporting elements such as optical flow estimation, which limits their performance in the case of dynamic textures for example. We introduce a new method for efficient training and inference of diffusion models in the context of internal learning. We split the diffusion process into different learning intervals which greatly simplifies the learning steps. We show qualititative and quantitative results, demonstrating that our method reaches state-of-the-art performance, in particular in the case of dynamic backgrounds and textures.
    摘要 视频填充是填充一个 désirée 区域在视频中,以达到可观的效果。这是一个非常具有挑战性的任务,因为视频信号的维度很高,并且需要在时间上保持一致性以获得可靠的结果。最近,扩散模型在处理复杂数据分布中表现出色,包括图像和视频。但是,扩散模型在训练和推理中非常昂贵,这限制了它在视频填充中的应用。我们发现,在视频填充中,由于视频的高自动相似性,可以通过仅训练在填充视频中的扩散模型,以达到非常满意的结果。我们称这种方法为“扩散融合”(Infusion)。由于我们的网络较为减少,我们可以提出第一个基于扩散的视频填充方法。其他方法通常需要支持元素,如光流估计,这限制了它们在动态Texture中的表现。我们提出了一种高效的训练和推理扩散模型的方法,将扩散过程分成不同的学习间隔。我们显示了qualitative和quantitative结果,证明我们的方法可以达到领先的性能,特别是在动态背景和Texture中。

Dynamic Multimodal Information Bottleneck for Multimodality Classification

  • paper_url: http://arxiv.org/abs/2311.01066
  • repo_url: https://github.com/bii-wushuang/dmib
  • paper_authors: Yingying Fang, Shuang Wu, Sheng Zhang, Chaoyan Huang, Tieyong Zeng, Xiaodan Xing, Simon Walsh, Guang Yang
  • for: 这篇论文的目的是提高多 modal 数据的使用,以提高医疗诊断和预测的精度。
  • methods: 本文使用了一种称为多元数据信息瓶颈框架的方法,以减少数据繁殖和噪音,并保持适当的预测信息精度。
  • results: 实验结果显示,本文的方法在两个内部COVID-19数据集和两个公共生物医学数据集上的诊断和预测任务中,比前一代方法更高效和更Robust,能够在大规模噪音渠道存在时保持高度的预测性。
    Abstract Effectively leveraging multimodal data such as various images, laboratory tests and clinical information is gaining traction in a variety of AI-based medical diagnosis and prognosis tasks. Most existing multi-modal techniques only focus on enhancing their performance by leveraging the differences or shared features from various modalities and fusing feature across different modalities. These approaches are generally not optimal for clinical settings, which pose the additional challenges of limited training data, as well as being rife with redundant data or noisy modality channels, leading to subpar performance. To address this gap, we study the robustness of existing methods to data redundancy and noise and propose a generalized dynamic multimodal information bottleneck framework for attaining a robust fused feature representation. Specifically, our information bottleneck module serves to filter out the task-irrelevant information and noises in the fused feature, and we further introduce a sufficiency loss to prevent dropping of task-relevant information, thus explicitly preserving the sufficiency of prediction information in the distilled feature. We validate our model on an in-house and a public COVID19 dataset for mortality prediction as well as two public biomedical datasets for diagnostic tasks. Extensive experiments show that our method surpasses the state-of-the-art and is significantly more robust, being the only method to remain performance when large-scale noisy channels exist. Our code is publicly available at https://github.com/BII-wushuang/DMIB.
    摘要 通过有效地利用多模态数据,如各种图像、实验室测试和临床信息,在许多基于人工智能的医疗诊断和预测任务中占据主导地位。现有的多模态技术大多只是利用不同或共同特征之间的差异和共同特征的融合来提高性能。这些方法在临床设置下不是最佳选择,因为存在有限的训练数据,以及充斥着重复的数据或噪声的渠道,导致表现下降。为解决这个差距,我们研究了现有方法对数据重复和噪声的Robustness,并提出一种通用的动态多模态信息瓶颈框架,以获得一个Robust的融合特征表示。具体来说,我们的信息瓶颈模块可以筛除任务不关的信息和噪声在融合特征中,并引入一种充分loss来防止任务相关信息的排除,从而Explicitly preserved任务适用信息的完整性。我们在一个内部和一个公共COVID-19数据集上进行了 Mortality 预测和两个公共生物医学数据集上进行了诊断任务的验证。广泛的实验表明,我们的方法超过了当前状态的表现,并且在大规模噪声渠道存在时具有显著的Robust性,是唯一一个能够保持表现的方法。我们的代码可以在https://github.com/BII-wushuang/DMIB上获取。

Novel View Synthesis from a Single RGBD Image for Indoor Scenes

  • paper_url: http://arxiv.org/abs/2311.01065
  • repo_url: None
  • paper_authors: Congrui Hetang, Yuping Wang
  • for: 这篇论文提出了一种基于单个RGBD输入的新视图图像合成方法。
  • methods: 该方法将RGBD图像转换为点云,然后通过渲染从不同视角来实现新视图图像的合成。它将NVS任务转换为图像翻译问题,并使用生成器抗抗网络进行风格传递。
  • results: 该方法可以实现高质量的新视图图像合成,并且可以 circumvent了传统多图像技术的限制,如NeRF和MVS。
    Abstract In this paper, we propose an approach for synthesizing novel view images from a single RGBD (Red Green Blue-Depth) input. Novel view synthesis (NVS) is an interesting computer vision task with extensive applications. Methods using multiple images has been well-studied, exemplary ones include training scene-specific Neural Radiance Fields (NeRF), or leveraging multi-view stereo (MVS) and 3D rendering pipelines. However, both are either computationally intensive or non-generalizable across different scenes, limiting their practical value. Conversely, the depth information embedded in RGBD images unlocks 3D potential from a singular view, simplifying NVS. The widespread availability of compact, affordable stereo cameras, and even LiDARs in contemporary devices like smartphones, makes capturing RGBD images more accessible than ever. In our method, we convert an RGBD image into a point cloud and render it from a different viewpoint, then formulate the NVS task into an image translation problem. We leveraged generative adversarial networks to style-transfer the rendered image, achieving a result similar to a photograph taken from the new perspective. We explore both unsupervised learning using CycleGAN and supervised learning with Pix2Pix, and demonstrate the qualitative results. Our method circumvents the limitations of traditional multi-image techniques, holding significant promise for practical, real-time applications in NVS.
    摘要 在这篇论文中,我们提出了一种方法,可以从单个RGBD(红绿蓝深度)输入中生成新视图图像。新视图合成(NVS)是计算机视觉领域的一项有趣的任务,具有广泛的应用前景。传统的多张图像方法(如训练场景特定的神经辐射场(NeRF)或者利用多视图ステレオ(MVS)和3D渲染管道),尽管都是计算机昂贵或者不能普适应用于不同场景,但是它们的实际价值受限。相反,RGBD图像中嵌入的深度信息,使得3D的潜在能力从单个视图中解锁,使得NVS更加简单。现在, Compact和Affordable的斯tereo相机和甚至LiDAR在现代设备中的普及,使得捕捉RGBD图像更加容易。在我们的方法中,我们将RGBD图像转换为点云,然后从不同视图点渲染图像,最后将NVS任务转化为图像翻译问题。我们利用生成对抗网络进行风格转换,实现了从新视图点渲染的结果,与真实的新视图图像几乎相同。我们在无监督学习和监督学习两种情况下进行了质量检验,并展示了相关的结果。我们的方法可以绕过传统的多张图像技术的限制,具有实用的应用前景。

Multimodal Foundation Models for Zero-shot Animal Species Recognition in Camera Trap Images

  • paper_url: http://arxiv.org/abs/2311.01064
  • repo_url: None
  • paper_authors: Zalan Fabian, Zhongqi Miao, Chunyuan Li, Yuanhan Zhang, Ziwei Liu, Andrés Hernández, Andrés Montes-Rojas, Rafael Escucha, Laura Siabatto, Andrés Link, Pablo Arbeláez, Rahul Dodhia, Juan Lavista Ferres
  • for: 这个研究目的是为了发展一个大规模的野生动物追踪解决方案,以减少人工劳动成本。
  • methods: 这个研究使用了多 modal 基础模型,包括视觉语言模型,将camera trap图像描述为文本,然后与外部知识库进行对比,以验证物种。
  • results: 研究发现,使用培育学习技术可以对camera trap图像进行零条件物种分类,并且可以增强描述质量。
    Abstract Due to deteriorating environmental conditions and increasing human activity, conservation efforts directed towards wildlife is crucial. Motion-activated camera traps constitute an efficient tool for tracking and monitoring wildlife populations across the globe. Supervised learning techniques have been successfully deployed to analyze such imagery, however training such techniques requires annotations from experts. Reducing the reliance on costly labelled data therefore has immense potential in developing large-scale wildlife tracking solutions with markedly less human labor. In this work we propose WildMatch, a novel zero-shot species classification framework that leverages multimodal foundation models. In particular, we instruction tune vision-language models to generate detailed visual descriptions of camera trap images using similar terminology to experts. Then, we match the generated caption to an external knowledge base of descriptions in order to determine the species in a zero-shot manner. We investigate techniques to build instruction tuning datasets for detailed animal description generation and propose a novel knowledge augmentation technique to enhance caption quality. We demonstrate the performance of WildMatch on a new camera trap dataset collected in the Magdalena Medio region of Colombia.
    摘要 WildMatch uses vision-language models to generate detailed visual descriptions of camera trap images, using similar terminology to experts. Then, it matches the generated caption to an external knowledge base of descriptions to determine the species in a zero-shot manner. To build the instruction tuning datasets for detailed animal description generation, we investigate techniques such as knowledge augmentation.We demonstrate the performance of WildMatch on a new camera trap dataset collected in the Magdalena Medio region of Colombia. With the ability to analyze wildlife images without relying on expensive labeled data, WildMatch has the potential to revolutionize large-scale wildlife tracking solutions, reducing the need for human labor and increasing the efficiency of conservation efforts.

Learning to Adapt CLIP for Few-Shot Monocular Depth Estimation

  • paper_url: http://arxiv.org/abs/2311.01034
  • repo_url: None
  • paper_authors: Xueting Hu, Ce Zhang, Yi Zhang, Bowen Hai, Ke Yu, Zhihai He
  • for: 本研究的目的是提出一种ew-shot基于的离散depth estimation方法,以增强VLMs的泛化能力。
  • methods: 该方法使用CLIP作为VLMs,并使用固定的depth bins来实现Zero-shot depth estimation。此外,还包括一些learnable prompts来预处理输入文本,以使模型更好地理解文本。
  • results: 对NYU V2和KITTI dataset进行了广泛的实验,并证明了该方法可以在MARE指标上比前一个状态的方法提高至多10.6%。
    Abstract Pre-trained Vision-Language Models (VLMs), such as CLIP, have shown enhanced performance across a range of tasks that involve the integration of visual and linguistic modalities. When CLIP is used for depth estimation tasks, the patches, divided from the input images, can be combined with a series of semantic descriptions of the depth information to obtain similarity results. The coarse estimation of depth is then achieved by weighting and summing the depth values, called depth bins, corresponding to the predefined semantic descriptions. The zero-shot approach circumvents the computational and time-intensive nature of traditional fully-supervised depth estimation methods. However, this method, utilizing fixed depth bins, may not effectively generalize as images from different scenes may exhibit distinct depth distributions. To address this challenge, we propose a few-shot-based method which learns to adapt the VLMs for monocular depth estimation to balance training costs and generalization capabilities. Specifically, it assigns different depth bins for different scenes, which can be selected by the model during inference. Additionally, we incorporate learnable prompts to preprocess the input text to convert the easily human-understood text into easily model-understood vectors and further enhance the performance. With only one image per scene for training, our extensive experiment results on the NYU V2 and KITTI dataset demonstrate that our method outperforms the previous state-of-the-art method by up to 10.6\% in terms of MARE.
    摘要 预训练的视觉语言模型(VLM),如CLIP,在涉及视觉语言模式的多种任务上表现出色。当用CLIP进行深度估计任务时,将图像中分割的小块与一系列 semantic description of depth information相结合,可以获得相似性结果。通过对 depth value 的权重和总和,可以实现粗略的深度估计。这种零批学习方法可以避免传统的干扰和时间consuming的深度估计方法。然而,这种方法使用固定的depth bin,可能无法有效泛化为不同的场景中的图像。为了解决这个挑战,我们提出了一种几批学习基于的方法,可以在训练成本和泛化能力之间寻找平衡。具体来说,它在不同的场景中分配不同的 depth bin,可以由模型在推理时选择。此外,我们还在输入文本中添加了学习的提示,以将人类可以理解的文本转换为模型可以理解的向量,并进一步提高性能。只需要一张图像 per scene 进行训练,我们在 NYU V2 和 KITTI 数据集上进行了广泛的实验,结果显示,我们的方法可以与之前的状态的艺术比出至多 10.6% 的 MARE 提高。

Nonnegative/Binary Matrix Factorization for Image Classification using Quantum Annealing

  • paper_url: http://arxiv.org/abs/2311.01028
  • repo_url: None
  • paper_authors: Hinako Asaoka, Kazue Kudo
  • for: 图像分类问题的解决
  • methods: 使用量子热化法实现矩阵分解法和多类分类模型
  • results: 1. 数据量、特征数、轮次都较少时,使用NBMF模型的准确率高于传统机器学习方法,如神经网络;2. 使用量子热化法计算器可以减少计算时间。
    Abstract Classical computing has borne witness to the development of machine learning. The integration of quantum technology into this mix will lead to unimaginable benefits and be regarded as a giant leap forward in mankind's ability to compute. Demonstrating the benefits of this integration now becomes essential. With the advance of quantum computing, several machine-learning techniques have been proposed that use quantum annealing. In this study, we implement a matrix factorization method using quantum annealing for image classification and compare the performance with traditional machine-learning methods. Nonnegative/binary matrix factorization (NBMF) was originally introduced as a generative model, and we propose a multiclass classification model as an application. We extract the features of handwritten digit images using NBMF and apply them to solve the classification problem. Our findings show that when the amount of data, features, and epochs is small, the accuracy of models trained by NBMF is superior to classical machine-learning methods, such as neural networks. Moreover, we found that training models using a quantum annealing solver significantly reduces computation time. Under certain conditions, there is a benefit to using quantum annealing technology with machine learning.
    摘要 Nonnegative/binary matrix factorization (NBMF) was originally introduced as a generative model, and we propose a multiclass classification model as an application. We extract the features of handwritten digit images using NBMF and apply them to solve the classification problem. Our findings show that when the amount of data, features, and epochs is small, the accuracy of models trained by NBMF is superior to classical machine-learning methods, such as neural networks. Moreover, we found that training models using a quantum annealing solver significantly reduces computation time.Under certain conditions, the use of quantum annealing technology with machine learning can bring benefits.

Incorporating Language-Driven Appearance Knowledge Units with Visual Cues in Pedestrian Detection

  • paper_url: http://arxiv.org/abs/2311.01025
  • repo_url: None
  • paper_authors: Sungjune Park, Hyunjun Kim, Yong Man Ro
  • for: 本研究旨在利用大型自然语言模型(LLM)对文本描述中的语义和上下文信息,提高人体检测 task 的性能。
  • methods: 我们提出了一种新的语言驱动的人体检测方法,通过将文本描述 integrate 到视觉cue中,以提高人体检测 task 的性能。我们首先构建了大量的描述集,其中包含了各种人体的描述,然后通过 LLM 进行学习,提取出语义上下文信息。最后,我们将语言驱动的知识单元与视觉cue相结合,以提供丰富的描述信息。
  • results: 我们通过对多种人体检测器进行广泛的实验,证明了我们的方法的有效性,并实现了人体检测 task 的最新表现。
    Abstract Large language models (LLMs) have shown their capability in understanding contextual and semantic information regarding appearance knowledge of instances. In this paper, we introduce a novel approach to utilize the strength of an LLM in understanding contextual appearance variations and to leverage its knowledge into a vision model (here, pedestrian detection). While pedestrian detection is considered one of crucial tasks directly related with our safety (e.g., intelligent driving system), it is challenging because of varying appearances and poses in diverse scenes. Therefore, we propose to formulate language-driven appearance knowledge units and incorporate them with visual cues in pedestrian detection. To this end, we establish description corpus which includes numerous narratives describing various appearances of pedestrians and others. By feeding them through an LLM, we extract appearance knowledge sets that contain the representations of appearance variations. After that, we perform a task-prompting process to obtain appearance knowledge units which are representative appearance knowledge guided to be relevant to a downstream pedestrian detection task. Finally, we provide plentiful appearance information by integrating the language-driven knowledge units with visual cues. Through comprehensive experiments with various pedestrian detectors, we verify the effectiveness of our method showing noticeable performance gains and achieving state-of-the-art detection performance.
    摘要 大型语言模型(LLM)有示出理解上下文和 semantics 信息的能力,这里我们提出一种新的方法,利用 LLM 理解上下文的应用变化和对于视觉模型(例如人员探测)的知识。由于人员探测是安全系统中一项重要的任务,但是它受到不同的场景和姿势的影响,因此我们提出使用语言驱动的应用知识单元来应对这个挑战。我们建立了一个描述库,包括许多描述不同人员和其他物品的故事。我们通过 LLM 处理这些故事,从中提取了应用知识集,这些集包括了不同人员的应用形式。然后,我们进行了任务激发过程,从中获得了具有应用知识导向的语言驱动知识单元。最后,我们通过与视觉提示集成而提供了丰富的应用信息。通过对不同人员探测器进行了详细的实验,我们证明了我们的方法的有效性,并 achieved state-of-the-art 的探测性能。

Expanding Expressiveness of Diffusion Models with Limited Data via Self-Distillation based Fine-Tuning

  • paper_url: http://arxiv.org/abs/2311.01018
  • repo_url: None
  • paper_authors: Jiwan Hur, Jaehyun Choi, Gyojin Han, Dong-Jae Lee, Junmo Kim
  • for: 提高限制 dataset 上 diffusion model 的表达能力和生成能力,以解决各种下游任务中使用预训练 diffusion model 的不满result.
  • methods: 提出 Self-Distillation for Fine-Tuning diffusion models (SDFT) 方法,利用源 dataset 中多种特征,提高 diffusion model 的生成能力和表达能力。
  • results: 实验结果表明,SDFT 可以在限制 dataset 上提高 diffusion model 的表达能力和生成能力,并且可以在多种下游任务中提高生成效果。
    Abstract Training diffusion models on limited datasets poses challenges in terms of limited generation capacity and expressiveness, leading to unsatisfactory results in various downstream tasks utilizing pretrained diffusion models, such as domain translation and text-guided image manipulation. In this paper, we propose Self-Distillation for Fine-Tuning diffusion models (SDFT), a methodology to address these challenges by leveraging diverse features from diffusion models pretrained on large source datasets. SDFT distills more general features (shape, colors, etc.) and less domain-specific features (texture, fine details, etc) from the source model, allowing successful knowledge transfer without disturbing the training process on target datasets. The proposed method is not constrained by the specific architecture of the model and thus can be generally adopted to existing frameworks. Experimental results demonstrate that SDFT enhances the expressiveness of the diffusion model with limited datasets, resulting in improved generation capabilities across various downstream tasks.
    摘要 <>转换文本到简化中文。<>训练扩散模型在有限的数据集上存在限制生成能力和表达能力的挑战,导致使用预训练扩散模型的下游任务获得不满足的结果,如域转换和文本引导图像修饰。在这篇论文中,我们提出了自适应精炼扩散模型(SDFT),一种方法来解决这些挑战,通过利用大源数据集预训练的多样的特征。SDFT从源模型中提取更通用的特征(形状、颜色等),而不是域特定的特征(Texture、细节等),使得知识传递成功不会对目标数据集的训练过程产生影响。提出的方法不受特定模型的架构限制,因此可以通用于现有框架。实验结果表明,SDFT可以提高有限数据集上扩散模型的表达能力,从而在多种下游任务中提高生成能力。

Visual Analytics for Efficient Image Exploration and User-Guided Image Captioning

  • paper_url: http://arxiv.org/abs/2311.01016
  • repo_url: None
  • paper_authors: Yiran Li, Junpeng Wang, Prince Aboagye, Michael Yeh, Yan Zheng, Liang Wang, Wei Zhang, Kwan-Liu Ma
  • for: 本研究旨在帮助读者更好地理解大规模图像数据集中的Semantic结构和可能存在的数据偏见,以及提高语言-图像模型在caption生成过程中的表达能力。
  • methods: 本研究采用了一种新的视觉分析方法,利用大规模语言-图像模型的预训练技术,可以快速浏览大规模图像数据集,并自动生成图像的caption,以帮助读者更好地理解图像的Semantic结构和数据偏见。
  • results: 研究结果表明,通过使用这种新的视觉分析方法,可以快速发现大规模图像数据集中的数据偏见,并且可以提高语言-图像模型在caption生成过程中的表达能力。此外,研究还发现了一些可能存在的数据偏见,并提出了一些建议来改进语言-图像模型的caption生成能力。
    Abstract Recent advancements in pre-trained large-scale language-image models have ushered in a new era of visual comprehension, offering a significant leap forward. These breakthroughs have proven particularly instrumental in addressing long-standing challenges that were previously daunting. Leveraging these innovative techniques, this paper tackles two well-known issues within the realm of visual analytics: (1) the efficient exploration of large-scale image datasets and identification of potential data biases within them; (2) the evaluation of image captions and steering of their generation process. On the one hand, by visually examining the captions automatically generated from language-image models for an image dataset, we gain deeper insights into the semantic underpinnings of the visual contents, unearthing data biases that may be entrenched within the dataset. On the other hand, by depicting the association between visual contents and textual captions, we expose the weaknesses of pre-trained language-image models in their captioning capability and propose an interactive interface to steer caption generation. The two parts have been coalesced into a coordinated visual analytics system, fostering mutual enrichment of visual and textual elements. We validate the effectiveness of the system with domain practitioners through concrete case studies with large-scale image datasets.
    摘要
  1. Efficient exploration of large-scale image datasets and identification of potential biases within them.2. Evaluation of image captions and steering of their generation process.On one hand, by visually examining the captions automatically generated from language-image models for an image dataset, we gain a deeper understanding of the semantic underpinnings of the visual content, revealing any biases that may be present in the dataset.On the other hand, by depicting the association between visual contents and textual captions, we expose the weaknesses of pre-trained language-image models in their captioning capability and propose an interactive interface to steer caption generation.These two parts have been integrated into a coordinated visual analytics system, which fosters the mutual enrichment of visual and textual elements. We validate the effectiveness of the system through concrete case studies with large-scale image datasets, and demonstrate its potential for practical applications in the field.

Act As You Wish: Fine-Grained Control of Motion Diffusion Model with Hierarchical Semantic Graphs

  • paper_url: http://arxiv.org/abs/2311.01015
  • repo_url: https://github.com/jpthu17/graphmotion
  • paper_authors: Peng Jin, Yang Wu, Yanbo Fan, Zhongqian Sun, Yang Wei, Li Yuan
  • for: fine-grained control over human motion generation
  • methods: hierarchical semantic graphs, text-to-motion diffusion process
  • results: superior performance on two benchmark datasets, ability to continuously refine generated motion
    Abstract Most text-driven human motion generation methods employ sequential modeling approaches, e.g., transformer, to extract sentence-level text representations automatically and implicitly for human motion synthesis. However, these compact text representations may overemphasize the action names at the expense of other important properties and lack fine-grained details to guide the synthesis of subtly distinct motion. In this paper, we propose hierarchical semantic graphs for fine-grained control over motion generation. Specifically, we disentangle motion descriptions into hierarchical semantic graphs including three levels of motions, actions, and specifics. Such global-to-local structures facilitate a comprehensive understanding of motion description and fine-grained control of motion generation. Correspondingly, to leverage the coarse-to-fine topology of hierarchical semantic graphs, we decompose the text-to-motion diffusion process into three semantic levels, which correspond to capturing the overall motion, local actions, and action specifics. Extensive experiments on two benchmark human motion datasets, including HumanML3D and KIT, with superior performances, justify the efficacy of our method. More encouragingly, by modifying the edge weights of hierarchical semantic graphs, our method can continuously refine the generated motion, which may have a far-reaching impact on the community. Code and pre-training weights are available at https://github.com/jpthu17/GraphMotion.
    摘要 大多数文本驱动人体动作生成方法采用顺序模型,如 transformer,自动提取文本表达并用于人体动作生成。然而,这些紧凑的文本表达可能会强调动作名称的代价,而忽略其他重要特性,并且缺乏细节来指导动作生成。在这篇论文中,我们提议使用层次 semantic graphs 来实现细化控制 sobre 动作生成。具体来说,我们将动作描述分解成三级层次结构,包括动作、动作特征和具体动作。这些全局到本地结构可以帮助我们更好地理解动作描述,并且为动作生成提供细化控制。与此同时,我们将文本到动作协同扩散过程分解成三个semantic层次,以捕捉整体动作、本地动作和动作特征。经验表明,我们的方法在 HumanML3D 和 KIT 两个人体动作数据集上表现出色,并且可以不断细化生成的动作,这可能会对社区产生深远的影响。代码和预训练 веса可以在 https://github.com/jpthu17/GraphMotion 上获取。

Exploring Unified Perspective For Fast Shapley Value Estimation

  • paper_url: http://arxiv.org/abs/2311.01010
  • repo_url: https://github.com/user-tian/simshap
  • paper_authors: Borui Zhang, Baotong Tian, Wenzhao Zheng, Jie Zhou, Jiwen Lu
  • for: 这篇论文旨在解决深度神经网络模型中的黑盒问题,使用了Shapley值作为可靠的工具。
  • methods: 这篇论文使用了多种方法,包括ApproSemivalue、KernelSHAP和FastSHAP,以减少计算复杂性。
  • results: 该论文通过分析现有工作的一致性,提出了一种简单和高效的估计方法,称为SimSHAP,并通过了大量的表格和图像数据的实验,证明了其效果性。
    Abstract Shapley values have emerged as a widely accepted and trustworthy tool, grounded in theoretical axioms, for addressing challenges posed by black-box models like deep neural networks. However, computing Shapley values encounters exponential complexity in the number of features. Various approaches, including ApproSemivalue, KernelSHAP, and FastSHAP, have been explored to expedite the computation. We analyze the consistency of existing works and conclude that stochastic estimators can be unified as the linear transformation of importance sampling of feature subsets. Based on this, we investigate the possibility of designing simple amortized estimators and propose a straightforward and efficient one, SimSHAP, by eliminating redundant techniques. Extensive experiments conducted on tabular and image datasets validate the effectiveness of our SimSHAP, which significantly accelerates the computation of accurate Shapley values.
    摘要 <>使用Shapley值 Addressing Deep Learning模型的挑战============================================Shapley值已成为深度学习模型的挑战的一种广泛accepted和可靠的工具,基于理论axioms。然而,计算Shapley值遇到了特征数量的指数复杂性。各种方法,包括ApproSemivalue、KernelSHAP和FastSHAP,已经被探索以减少计算复杂性。我们对现有的工作进行了一致性分析,并发现了特征子抽样的重要性。基于这一点,我们提出了一种简单的总结器,SimSHAP,并进行了广泛的实验,证明了我们的SimSHAP可以快速和高效地计算准确的Shapley值。>>>Here's the translation:使用Shapley值 Addressing Deep Learning模型的挑战============================================Shapley值已成为深度学习模型的挑战的一种广泛accepted和可靠的工具,基于理论axioms。然而,计算Shapley值遇到了特征数量的指数复杂性。各种方法,包括ApproSemivalue、KernelSHAP和FastSHAP,已经被探索以减少计算复杂性。我们对现有的工作进行了一致性分析,并发现了特征子抽样的重要性。基于这一点,我们提出了一种简单的总结器,SimSHAP,并进行了广泛的实验,证明了我们的SimSHAP可以快速和高效地计算准确的Shapley值。

VCISR: Blind Single Image Super-Resolution with Video Compression Synthetic Data

  • paper_url: http://arxiv.org/abs/2311.00996
  • repo_url: https://github.com/kiteretsu77/vcisr-official
  • paper_authors: Boyang Wang, Bowen Liu, Shiyu Liu, Fengyu Yang
  • for: 这个论文主要针对的是在单个视频帧输入下,使用低分辨率图像数据进行盲目超分辨率(SISR)任务。
  • methods: 我们提出了一种基于视频压缩的质量模型,用于生成低分辨率图像数据,并将其整合到现有的图像集中。这种方法可以广泛应用于现有的图像集,从而保持训练效率。
  • results: 我们的方法在无参考图像质量评估中达到了最高水平,并在多个数据集上显示了更好的视觉质量。此外,我们还评估了使用我们的杂化模型训练的SISR神经网络在视频超分辨率(VSR)任务中的性能,并发现其与专门为VSRS设计的建筑物 exhibits 相似或更好的性能,这说明了我们的策略可以普适地应用于更复杂的压缩残留 artifacts。
    Abstract In the blind single image super-resolution (SISR) task, existing works have been successful in restoring image-level unknown degradations. However, when a single video frame becomes the input, these works usually fail to address degradations caused by video compression, such as mosquito noise, ringing, blockiness, and staircase noise. In this work, we for the first time, present a video compression-based degradation model to synthesize low-resolution image data in the blind SISR task. Our proposed image synthesizing method is widely applicable to existing image datasets, so that a single degraded image can contain distortions caused by the lossy video compression algorithms. This overcomes the leak of feature diversity in video data and thus retains the training efficiency. By introducing video coding artifacts to SISR degradation models, neural networks can super-resolve images with the ability to restore video compression degradations, and achieve better results on restoring generic distortions caused by image compression as well. Our proposed approach achieves superior performance in SOTA no-reference Image Quality Assessment, and shows better visual quality on various datasets. In addition, we evaluate the SISR neural network trained with our degradation model on video super-resolution (VSR) datasets. Compared to architectures specifically designed for the VSR purpose, our method exhibits similar or better performance, evidencing that the presented strategy on infusing video-based degradation is generalizable to address more complicated compression artifacts even without temporal cues.
    摘要 在单影像超分辨率(SISR)任务中,现有的工作已经成功地恢复图像级未知降低。然而,当单个视频帧为输入时,这些工作通常无法处理由视频压缩引起的降低,如蚊子噪声、环形噪声、块状噪声和扫描噪声。在这种情况下,我们为首次提出了基于视频压缩的降低模型,用于生成低分辨率图像数据。我们的提议的图像生成方法可以广泛应用于现有的图像数据集,以便一个降低图像包含视频压缩所引起的扰动。这些扰动可以增加图像数据的特征多样性,从而保持训练效率。通过将视频编码扰动引入到SISR降低模型中,神经网络可以在恢复图像时恢复视频压缩所引起的降低,并且能够更好地恢复图像压缩所引起的扰动。我们的提出的方法在SOTA无参考图像质量评估中显示出优秀的性能,并在不同的数据集上显示更好的视觉质量。此外,我们对SISR神经网络经过我们的降低模型进行训练后,对视超分辨率(VSR)数据集进行评估。相比特制为VSRL的建筑,我们的方法在视觉质量方面显示相似或更好的性能,证明了我们提出的策略可以普适地应用于更复杂的压缩扰动,甚至 без时间证明。

A Chronological Survey of Theoretical Advancements in Generative Adversarial Networks for Computer Vision

  • paper_url: http://arxiv.org/abs/2311.00995
  • repo_url: None
  • paper_authors: Hrishikesh Sharma
  • for: This paper aims to provide a chronological overview of the development of Generative Adversarial Networks (GANs) in the research field of computer vision, highlighting the key challenges and solutions in the evolution of GAN models.
  • methods: The paper uses a chronological approach to present the landmark research works on GANs, focusing on the theoretical advancements and applications of GANs in computer vision.
  • results: The paper highlights the significant improvements in the training of GAN models over time, and the various applications of GANs in computer vision tasks such as image generation, image-to-image translation, and image synthesis.
    Abstract Generative Adversarial Networks (GANs) have been workhorse generative models for last many years, especially in the research field of computer vision. Accordingly, there have been many significant advancements in the theory and application of GAN models, which are notoriously hard to train, but produce good results if trained well. There have been many a surveys on GANs, organizing the vast GAN literature from various focus and perspectives. However, none of the surveys brings out the important chronological aspect: how the multiple challenges of employing GAN models were solved one-by-one over time, across multiple landmark research works. This survey intends to bridge that gap and present some of the landmark research works on the theory and application of GANs, in chronological order.
    摘要 生成对抗网络(GAN)在过去几年中成为计算机视觉领域的重要生成模型,因此有很多重要的进展在GAN模型的理论和应用方面。尽管有很多关于GAN的评论文章,但 none of them 探讨了GAN模型的多个挑战如何逐步解决,从多个重要研究工作的角度出发。这篇评论文章的目的是弥补这一漏洞,并介绍一些GAN模型的理论和应用的各个阶段进展。

LaughTalk: Expressive 3D Talking Head Generation with Laughter

  • paper_url: http://arxiv.org/abs/2311.00994
  • repo_url: None
  • paper_authors: Kim Sung-Bin, Lee Hyun, Da Hye Hong, Suekyeong Nam, Janghoon Ju, Tae-Hyun Oh
  • for: 这篇论文旨在提出一种能够同时表达语言和笑声的3D人物生成方法。
  • methods: 该方法使用了一种新的数据集,该数据集包括2D笑声视频和 Pseudo-annotated和人类验证的3D FLAME参数和顶点。该方法还使用了一种两个阶段的训练方案,首先学习语言朗读,然后学习表达笑声信号。
  • results: 对比现有方法,该方法在语言朗读和表达笑声信号方面都有出色的表现。此外,该方法还可以用于生成真实的人物模型。
    Abstract Laughter is a unique expression, essential to affirmative social interactions of humans. Although current 3D talking head generation methods produce convincing verbal articulations, they often fail to capture the vitality and subtleties of laughter and smiles despite their importance in social context. In this paper, we introduce a novel task to generate 3D talking heads capable of both articulate speech and authentic laughter. Our newly curated dataset comprises 2D laughing videos paired with pseudo-annotated and human-validated 3D FLAME parameters and vertices. Given our proposed dataset, we present a strong baseline with a two-stage training scheme: the model first learns to talk and then acquires the ability to express laughter. Extensive experiments demonstrate that our method performs favorably compared to existing approaches in both talking head generation and expressing laughter signals. We further explore potential applications on top of our proposed method for rigging realistic avatars.
    摘要 幽默是人类社交交流中的一种重要表达方式,但现有3D讲话头生成方法往往无法准确捕捉幽默和笑容的细节和重要性。在这篇论文中,我们介绍了一个新的任务:生成3D讲话头,能够同时具备流畅的语音和真实的笑容表达。我们新编译的数据集包括2D笑容视频和 Pseudo-注释和人类验证的3D FLAME参数和顶点。我们提议的训练方案包括两个阶段:首先学习讲话,然后学习表达笑容信号。我们的方法在讲话头生成和表达笑容信号方面都表现出了优异的成绩。我们还探讨了基于我们的提议方法的可能的应用,如制作真实的人物替身。

IR-UWB Radar-based Situational Awareness System for Smartphone-Distracted Pedestrians

  • paper_url: http://arxiv.org/abs/2311.00991
  • repo_url: None
  • paper_authors: Jamsheed Manja Ppallan, Ruchi Pandey, Yellappa Damam, Vijay Narayan Tiwari, Karthikeyan Arunachalam, Antariksha Ray
  • for: 提高智能手机使用者在路上行人安全性
  • methods: 使用IR-UWB雷达和人工神经网络实现实时障碍探测和警示
  • results: 实现了97%的障碍检测精度和95%的障碍分类精度,检测延迟26.8毫秒
    Abstract With the widespread adoption of smartphones, ensuring pedestrian safety on roads has become a critical concern due to smartphone distraction. This paper proposes a novel and real-time assistance system called UWB-assisted Safe Walk (UASW) for obstacle detection and warns users about real-time situations. The proposed method leverages Impulse Radio Ultra-Wideband (IR-UWB) radar embedded in the smartphone, which provides excellent range resolution and high noise resilience using short pulses. We implemented UASW specifically for Android smartphones with IR-UWB connectivity. The framework uses complex Channel Impulse Response (CIR) data to integrate rule-based obstacle detection with artificial neural network (ANN) based obstacle classification. The performance of the proposed UASW system is analyzed using real-time collected data. The results show that the proposed system achieves an obstacle detection accuracy of up to 97% and obstacle classification accuracy of up to 95% with an inference delay of 26.8 ms. The results highlight the effectiveness of UASW in assisting smartphone-distracted pedestrians and improving their situational awareness.
    摘要 随着智能手机的普及,保障行人安全在路上已成为一个重要问题,因为智能手机的分心。这篇论文提出了一种新的实时协助系统,名为UWB-assisted Safe Walk(UASW),用于避免障碍物检测和警示用户实时情况。该方法利用冲击式 радио Ultra-Wideband(IR-UWB)雷达,嵌入在智能手机中,它具有出色的范围分辨率和高噪声抗性,使用短报波。我们专门为Android智能手机开发了UASW。框架使用复杂的通道冲击响应(CIR)数据集成规则基于障碍物检测和人工神经网络(ANN)基于障碍物分类。我们对提出的UASW系统进行了实时数据收集和分析,结果显示,该系统可以达到97%的障碍物检测精度和95%的障碍物分类精度,检测延迟为26.8毫秒。结果表明,UASW系统有效地帮助智能手机分心行人提高情况意识。

VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning

  • paper_url: http://arxiv.org/abs/2311.00990
  • repo_url: None
  • paper_authors: Hong Chen, Xin Wang, Guanning Zeng, Yipeng Zhang, Yuwei Zhou, Feilin Han, Wenwu Zhu
  • for: 这篇论文旨在提出一种个性化多主题文本到视频生成模型,即VideoDreamer框架,以生成具有独特视觉特征的多主题文本导向视频。
  • methods: VideoDreamer模型基于预训练的稳定扩散和 latent-code 动力,并利用 temporal cross-frame attention 进行视频生成。在生成过程中,VideoDreamer还采用了Disen-Mix Finetuning和 Human-in-the-Loop Re-finetuning策略,以解决多主题生成中的 attribute binding 问题。
  • results: 在评估中,VideoDreamer模型能够生成具有新内容的、适应多主题的文本导向视频,例如新的事件和背景。同时,VideoDreamer还能够保持视频的时间协调和可读性。
    Abstract Customized text-to-video generation aims to generate text-guided videos with customized user-given subjects, which has gained increasing attention recently. However, existing works are primarily limited to generating videos for a single subject, leaving the more challenging problem of customized multi-subject text-to-video generation largely unexplored. In this paper, we fill this gap and propose a novel VideoDreamer framework. VideoDreamer can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects. Specifically, VideoDreamer leverages the pretrained Stable Diffusion with latent-code motion dynamics and temporal cross-frame attention as the base video generator. The video generator is further customized for the given multiple subjects by the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy, which can tackle the attribute binding problem of multi-subject generation. We also introduce MultiStudioBench, a benchmark for evaluating customized multi-subject text-to-video generation models. Extensive experiments demonstrate the remarkable ability of VideoDreamer to generate videos with new content such as new events and backgrounds, tailored to the customized multiple subjects. Our project page is available at https://videodreamer23.github.io/.
    摘要 自定义文本到视频生成技术已经引起了越来越多的关注,目前的工作主要是为单个主题生成视频,忽略了更加挑战性的多主题文本到视频生成问题。在这篇论文中,我们填补这一漏洞,并提出了一个名为VideoDreamer的框架。VideoDreamer可以生成具有稳定的时间性和多主题视频特征的文本引导视频。具体来说,VideoDreamer利用了预训练的稳定扩散方法和带有动态混合的时间跨帧注意力,作为基础视频生成器。此外,我们还提出了一种名为Disen-MixFinetuning和人工约束重新训练策略,可以解决多主题生成中的特征绑定问题。我们还提出了一个名为MultiStudioBench的多主题生成评价指标,用于评估自定义多主题文本到视频生成模型。广泛的实验表明,VideoDreamer可以生成具有新的内容,如新的事件和背景,适应自定义多主题。我们的项目页面可以在https://videodreamer23.github.io/查看。

CML-MOTS: Collaborative Multi-task Learning for Multi-Object Tracking and Segmentation

  • paper_url: http://arxiv.org/abs/2311.00987
  • repo_url: None
  • paper_authors: Yiming Cui, Cheng Han, Dongfang Liu
  • for: 这个论文旨在提出一个能够同时进行物件检测、实例分割和多个物件追踪的可效框架,以便在视觉分析中进行实例级别的分析。
  • methods: 本文提出的方法是基于一个称为“相互连接”的新结构,这个结构在一个端到端学习的CNN中实现了多任务之间的相互连接,以便帮助这些任务同时进行。
  • results: 在KITTI MOTS和MOTS Challenge datasets上进行了广泛的评估,结果表明了本文提出的方法在多个物件追踪和实例分割任务中的表现都很出色。
    Abstract The advancement of computer vision has pushed visual analysis tasks from still images to the video domain. In recent years, video instance segmentation, which aims to track and segment multiple objects in video frames, has drawn much attention for its potential applications in various emerging areas such as autonomous driving, intelligent transportation, and smart retail. In this paper, we propose an effective framework for instance-level visual analysis on video frames, which can simultaneously conduct object detection, instance segmentation, and multi-object tracking. The core idea of our method is collaborative multi-task learning which is achieved by a novel structure, named associative connections among detection, segmentation, and tracking task heads in an end-to-end learnable CNN. These additional connections allow information propagation across multiple related tasks, so as to benefit these tasks simultaneously. We evaluate the proposed method extensively on KITTI MOTS and MOTS Challenge datasets and obtain quite encouraging results.
    摘要 “计算机视觉的发展使得视力分析任务从静止图像转移到视频域。近年来,视频实例分割,即在视频帧中跟踪和分割多个对象,吸引了很多关注,因为它在自动驾驶、智能交通和智能商业等领域可能得到广泛的应用。在这篇论文中,我们提出一种高效的视频帧级别的实例分析框架,可以同时进行对象检测、实例分割和多对象跟踪。我们的方法的核心思想是在结构化多任务学习中实现协同学习,通过将检测、分割和跟踪任务头部之间建立相关连接来实现信息传递。这些额外连接使得多个相关任务之间的信息可以相互传递,从而对多个任务产生共同的改进。我们对提出的方法进行了广泛的测试,并在KITTI MOTS和MOTS Challenge数据集上获得了很好的结果。”

M&M3D: Multi-Dataset Training and Efficient Network for Multi-view 3D Object Detection

  • paper_url: http://arxiv.org/abs/2311.00986
  • repo_url: None
  • paper_authors: Hang Zhang
  • for: 本研究提出了一种多视图3D物体检测网络结构,使用Camera-only数据和Bird’s-Eye-View地图,以解决当前键问题域适应和视觉数据传输。
  • methods: 该研究基于域适应和视觉数据传输的挑战,提出了一种基于Transformer的传输学习方法和3DanchorQuery的检测头,以实现数据迁移和效率检测。
  • results: 通过多个数据集训练和使用小量源数据和现有大型模型预训练参数,该网络实现了竞争性的结果,并且利用3D信息作为可用的semantic信息和2D多视图图像特征融合到视语言传输设计中。
    Abstract In this research, I proposed a network structure for multi-view 3D object detection using camera-only data and a Bird's-Eye-View map. My work is based on a current key challenge domain adaptation and visual data transfer. Although many excellent camera-only 3D object detection has been continuously proposed, many research work risk dramatic performance drop when the networks are trained on the source domain but tested on a different target domain. Then I found it is very surprising that predictions on bounding boxes and classes are still replied to on 2D networks. Based on the domain gap assumption on various 3D datasets, I found they still shared a similar data extraction on the same BEV map size and camera data transfer. Therefore, to analyze the domain gap influence on the current method and to make good use of 3D space information among the dataset and the real world, I proposed a transfer learning method and Transformer construction to study the 3D object detection on NuScenes-mini and Lyft. Through multi-dataset training and a detection head from the Transformer, the network demonstrated good data migration performance and efficient detection performance by using 3D anchor query and 3D positional information. Relying on only a small amount of source data and the existing large model pre-training weights, the efficient network manages to achieve competitive results on the new target domain. Moreover, my study utilizes 3D information as available semantic information and 2D multi-view image features blending into the visual-language transfer design. In the final 3D anchor box prediction and object classification, my network achieved good results on standard metrics of 3D object detection, which differs from dataset-specific models on each training domain without any fine-tuning.
    摘要 在这项研究中,我提出了一种多视图3D物体检测网络结构,使用了摄像头数据和鸟瞰图。我的工作基于当前关键挑战的领域适应和视觉数据传递。虽然许多出色的摄像头只3D物体检测方法已经不断提出,但是许多研究工作在不同目标领域进行训练后会导致性能巨大下降。然而,我发现了一个很奇怪的现象:在2D网络上预测矩形框和类别时,预测结果仍然受到2D网络的影响。基于领域差异假设,我发现了许多3D数据集之间的数据EXTRACTOR都是类似的,因此可以在同一个BEV图像大小和摄像头数据传输下进行数据传输。为了分析当前方法中领域差异的影响和在数据集和实际世界中利用3D空间信息,我提出了一种传输学习方法和Transformer结构。通过多个数据集训练和Transformer检测头,网络实现了良好的数据迁移性和高效的检测性能,使用3D锚定Query和3D位势信息。只需要少量的源数据和现有大型模型预训练 весов,高效的网络可以在新的目标领域中获得竞争性的结果。此外,我的研究利用了3D信息作为可用的semantic信息和2D多视图图像特征融合到视语言传输设计中。最终,我的网络在标准3D物体检测指标上实现了好的结果,与不同预训练领域的模型不需要任何微调。

MAAIG: Motion Analysis And Instruction Generation

  • paper_url: http://arxiv.org/abs/2311.00980
  • repo_url: None
  • paper_authors: Wei-Hsin Yeh, Pei Hsin Lin, Yu-An Su, Wen Hsiang Cheng, Lun-Wei Ku
  • for: 提供个人体育训练home自动化指导,帮助用户提高运动技巧和避免伤害。
  • methods: 使用MAAIG应用框架,通过对用户提供的运动动作视频进行分析,生成每帧的嵌入向量,并将其与预训练T5模型结合,生成专业教练般的运动指导。
  • results: 能够识别和解决用户可能存在的问题,提供实时指导,帮助用户改善运动技巧和避免伤害。
    Abstract Many people engage in self-directed sports training at home but lack the real-time guidance of professional coaches, making them susceptible to injuries or the development of incorrect habits. In this paper, we propose a novel application framework called MAAIG(Motion Analysis And Instruction Generation). It can generate embedding vectors for each frame based on user-provided sports action videos. These embedding vectors are associated with the 3D skeleton of each frame and are further input into a pretrained T5 model. Ultimately, our model utilizes this information to generate specific sports instructions. It has the capability to identify potential issues and provide real-time guidance in a manner akin to professional coaches, helping users improve their sports skills and avoid injuries.
    摘要 很多人在家中自己进行体育训练,但是缺乏专业教练的实时指导,导致他们容易受伤或形成错误的习惯。在这篇论文中,我们提出了一种新的应用框架,即Motion Analysis And Instruction Generation(MAAIG)。它可以基于用户提供的体育动作视频生成嵌入向量,这些嵌入向量与每帧3D骨架相关,然后输入到预训练的T5模型中。最终,我们的模型可以利用这些信息生成特定的体育指导。它可以识别用户的可能问题,并在专业教练的方式下提供实时指导,帮助用户提高体育技巧,避免伤害。

Overhead Line Defect Recognition Based on Unsupervised Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2311.00979
  • repo_url: None
  • paper_authors: Weixi Wang, Xichen Zhong, Xin Li, Sizhe Li, Xun Ma
  • for: automatic defect recognition in overhead lines
  • methods: Faster RCNN network + unsupervised semantic segmentation
  • results: improved accuracy and adaptability in identifying equipment issues
    Abstract Overhead line inspection greatly benefits from defect recognition using visible light imagery. Addressing the limitations of existing feature extraction techniques and the heavy data dependency of deep learning approaches, this paper introduces a novel defect recognition framework. This is built on the Faster RCNN network and complemented by unsupervised semantic segmentation. The approach involves identifying the type and location of the target equipment, utilizing semantic segmentation to differentiate between the device and its backdrop, and finally employing similarity measures and logical rules to categorize the type of defect. Experimental results indicate that this methodology focuses more on the equipment rather than the defects when identifying issues in overhead lines. This leads to a notable enhancement in accuracy and exhibits impressive adaptability. Thus, offering a fresh perspective for automating the inspection of distribution network equipment.
    摘要 Overhead line inspection 受到缺陷识别使用可见光影像的巨大 beneficial 影响。现有的特征提取技术和深度学习方法存在局限性,这篇文章提出了一种新的缺陷识别框架。这基于Faster RCNN网络,并且 complemented by 无supervised semantic segmentation。该方法包括:首先,确定目标设备的类型和位置;其次,使用semantic segmentation来分 differentiate 设备和背景;最后,使用相似度度量和逻辑规则来分类缺陷类型。实验结果表明,该方法更强调设备而不是缺陷,从而提高了准确性。此外,它具有出色的适应性。因此,这种方法可以提供一种新的自动化分配网络设备检查的新 perspectives。

Lightweight super resolution network for point cloud geometry compression

  • paper_url: http://arxiv.org/abs/2311.00970
  • repo_url: https://github.com/lidq92/lsrn-pcgc
  • paper_authors: Wei Zhang, Dingquan Li, Ge Li, Wen Gao
  • for: 本文提出了一种基于缓冲点云的压缩方法,通过利用轻量级超Resolution网络来实现。
  • methods: 该方法首先将点云分解为基点云和重建原点云的 interpolate 模式。而 interpolation 模式的处理 strategy 则是通过一个轻量级超Resolution网络来学习,而不是直接压缩 interpolate 模式。
  • results: 实验表明,与lookup table-based方法相比,该方法可以更加准确地获得 interpolate 模式,同时在接受ABLE computational cost下可以访问更广泛的邻近 voxels。这使得该方法在 MPEG Cat1 (Solid) 和 Cat2 数据集上实现了显著的压缩性能。
    Abstract This paper presents an approach for compressing point cloud geometry by leveraging a lightweight super-resolution network. The proposed method involves decomposing a point cloud into a base point cloud and the interpolation patterns for reconstructing the original point cloud. While the base point cloud can be efficiently compressed using any lossless codec, such as Geometry-based Point Cloud Compression, a distinct strategy is employed for handling the interpolation patterns. Rather than directly compressing the interpolation patterns, a lightweight super-resolution network is utilized to learn this information through overfitting. Subsequently, the network parameter is transmitted to assist in point cloud reconstruction at the decoder side. Notably, our approach differentiates itself from lookup table-based methods, allowing us to obtain more accurate interpolation patterns by accessing a broader range of neighboring voxels at an acceptable computational cost. Experiments on MPEG Cat1 (Solid) and Cat2 datasets demonstrate the remarkable compression performance achieved by our method.
    摘要 Simplified Chinese translation:这篇论文提出了一种基于轻量级超解算法的点云减少方法。该方法将点云分解为基点云和重建原点云的 interpolating 模式。而基点云可以使用任何lossless 编码器高效地压缩,而 interpolating 模式则通过一个轻量级超解网络学习。然后,网络参数将被传输到决策端,以帮助重建点云。与lookup 表格基于方法不同,我们可以通过访问更广泛的邻近 voxel 来获取更准确的 interpolating 模式,而不会影响计算成本。实验结果表明,我们的方法在 MPEG Cat1 (Solid) 和 Cat2 数据集上实现了出色的压缩性能。

Detecting Generated Images by Real Images Only

  • paper_url: http://arxiv.org/abs/2311.00962
  • repo_url: https://github.com/molyswu/hand_detection
  • paper_authors: Xiuli Bi, Bo Liu, Fan Yang, Bin Xiao, Weisheng Li, Gao Huang, Pamela C. Cosman
  • for: 这篇论文旨在探讨生成模型产生的图像是否真实,并提出了一种新的检测方法,即从真实图像开始,找出它们共同点,然后将它们映射到对应的紧密空间中,以此检测生成图像。
  • methods: 本文使用了一种新的检测方法,即将真实图像映射到一个紧密的空间中,以检测生成图像。这种方法不需要大量的训练数据,仅需使用实际的图像进行训练,并且可以实现高效的检测。
  • results: 实验结果显示,本文提出的方法可以实现高效的生成图像检测,并且具有较好的响应性和稳定性。它可以检测出不同的生成模型,并且可以应对各种后期处理。这些优点使得本方法可以应用在实际的应用中。
    Abstract As deep learning technology continues to evolve, the images yielded by generative models are becoming more and more realistic, triggering people to question the authenticity of images. Existing generated image detection methods detect visual artifacts in generated images or learn discriminative features from both real and generated images by massive training. This learning paradigm will result in efficiency and generalization issues, making detection methods always lag behind generation methods. This paper approaches the generated image detection problem from a new perspective: Start from real images. By finding the commonality of real images and mapping them to a dense subspace in feature space, the goal is that generated images, regardless of their generative model, are then projected outside the subspace. As a result, images from different generative models can be detected, solving some long-existing problems in the field. Experimental results show that although our method was trained only by real images and uses 99.9\% less training data than other deep learning-based methods, it can compete with state-of-the-art methods and shows excellent performance in detecting emerging generative models with high inference efficiency. Moreover, the proposed method shows robustness against various post-processing. These advantages allow the method to be used in real-world scenarios.
    摘要 deep learning技术继续发展,生成模型中的图像越来越真实,让人们开始 вопро问图像的真实性。现有的生成图像检测方法检测生成图像中的视觉瑕疵或者从实际和生成图像中学习特征,通过大规模训练。这种学习模式会导致效率和泛化问题,使检测方法总是落后于生成方法。本文从新的角度解决生成图像检测问题:从实际图像开始。通过找到实际图像的共同点,将它们映射到封闭的子空间中,使得生成图像,无论它们的生成模型,都将被投影到外部子空间。因此,不同的生成模型中的图像可以被检测出来,解决了领域中一些长期存在的问题。实验结果表明,我们的方法只使用实际图像进行训练,使用99.9% menos的深度学习基于数据进行训练,可以与当前状态的方法竞争,并且在检测新兴的生成模型方面表现出色,具有高速检测效率和Robustness against various post-processing。这些优点使得方法可以在实际场景中使用。

Concatenated Masked Autoencoders as Spatial-Temporal Learner

  • paper_url: http://arxiv.org/abs/2311.00961
  • repo_url: https://github.com/minhoooo1/catmae
  • paper_authors: Zhouqiang Jiang, Bowen Wang, Tong Xiang, Zhaofeng Niu, Hong Tang, Guangshun Li, Liangzhi Li
  • for: 这篇论文的目的是学习视频表示,包括理解视频中的连续运动和视觉匹配。
  • methods: 这篇论文提出了一种新的自我supervised视频表示学习方法,即 Concatenated Masked Autoencoders (CatMAE),它使用了一个掩码(95%)来遮盖视频帧的后续帧,并使用了一个Encoder和一个Decoder来编码和重建视频帧。
  • results: 与之前最先进的预训练方法相比,CatMAE在视频分割任务和动作识别任务中表现出了领先的水平。
    Abstract Learning representations from videos requires understanding continuous motion and visual correspondences between frames. In this paper, we introduce the Concatenated Masked Autoencoders (CatMAE) as a spatial-temporal learner for self-supervised video representation learning. For the input sequence of video frames, CatMAE keeps the initial frame unchanged while applying substantial masking (95%) to subsequent frames. The encoder in CatMAE is responsible for encoding visible patches for each frame individually; subsequently, for each masked frame, the decoder leverages visible patches from both previous and current frames to reconstruct the original image. Our proposed method enables the model to estimate the motion information between visible patches, match the correspondences between preceding and succeeding frames, and ultimately learn the evolution of scenes. Furthermore, we propose a new data augmentation strategy, Video-Reverse (ViRe), which uses reversed video frames as the model's reconstruction targets. This further encourages the model to utilize continuous motion details and correspondences to complete the reconstruction, thereby enhancing the model's capabilities. Compared to the most advanced pre-training methods, CatMAE achieves a leading level in video segmentation tasks and action recognition tasks.
    摘要 学习视频中的表示需要理解连续的运动和视觉匹配 между帧。在这篇论文中,我们介绍了嵌入式马SKAd(CatMAE)作为自我超级视频表示学习的空间-时间学习器。对于输入序列中的视频帧,CatMAE保留初始帧不变,而对后续帧应用了95%的压缩(masking)。CatMAE的编码器负责为每帧图像中的可见区域进行编码;对于每帧压缩图像,CatMAE的解码器利用前一帧和当前帧中的可见区域来重建原始图像。我们提议的方法使得模型可以估计图像中的运动信息,匹配前一帧和当前帧之间的对应关系,并最终学习场景的演化。此外,我们还提出了一种新的数据增强策略,名为视频反向(ViRe),它使用反转的视频帧作为模型的重建目标。这使得模型更加强调连续的运动细节和对应关系,从而提高模型的能力。相比最先进的预训练方法,CatMAE在视频分割任务和动作认知任务中达到了领先水平。

Optimal Noise pursuit for Augmenting Text-to-Video Generation

  • paper_url: http://arxiv.org/abs/2311.00949
  • repo_url: None
  • paper_authors: Shijie Ma, Huayi Xu, Mengjian Li, Weidong Geng, Meng Wang, Yaxiong Wang
  • for: 提高文本到视频生成器的稳定性和质量,尤其是在不同噪音输入下。
  • methods: 提出了一种 aproaches 使用倒推视频映射来寻找最佳噪音,并通过搜索和反向映射来实现。此外,还提出了一种semantic-preserving rewriter来优化文本提示。
  • results: 通过extensive experiments on WebVid-10M benchmark,显示了提高文本到视频生成器的稳定性和质量,而且无需优化。
    Abstract Despite the remarkable progress in text-to-video generation, existing diffusion-based models often exhibit instability in terms of noise during inference. Specifically, when different noises are fed for the given text, these models produce videos that differ significantly in terms of both frame quality and temporal consistency. With this observation, we posit that there exists an optimal noise matched to each textual input; however, the widely adopted strategies of random noise sampling often fail to capture it. In this paper, we argue that the optimal noise can be approached through inverting the groundtruth video using the established noise-video mapping derived from the diffusion model. Nevertheless, the groundtruth video for the text prompt is not available during inference. To address this challenge, we propose to approximate the optimal noise via a search and inversion pipeline. Given a text prompt, we initially search for a video from a predefined candidate pool that closely relates to the text prompt. Subsequently, we invert the searched video into the noise space, which serves as an improved noise prompt for the textual input. In addition to addressing noise, we also observe that the text prompt with richer details often leads to higher-quality videos. Motivated by this, we further design a semantic-preserving rewriter to enrich the text prompt, where a reference-guided rewriting is devised for reasonable details compensation, and a denoising with a hybrid semantics strategy is proposed to preserve the semantic consistency. Extensive experiments on the WebVid-10M benchmark show that our proposed method can improve the text-to-video models with a clear margin, while introducing no optimization burden.
    摘要 尽管文本到视频生成技术已经取得了非常出色的进步,但现有的扩散基本模型在推理过程中仍然存在噪声稳定性问题。具体来说,对于不同的噪声输入,这些模型会生成具有不同框架质量和时间一致性的视频。从这个观察出发,我们认为存在一个与每个文本输入匹配的最佳噪声,但通常采用的随机噪声抽样策略往往无法捕捉到它。在这篇论文中,我们 argue that 最佳噪声可以通过推理模型确定的噪声-视频映射来接近。但是,在推理过程中不可以获得真实的地面视频。为解决这个挑战,我们提议一种搜索和反向映射管线来估算最佳噪声。给定一个文本提示,我们首先从预定的候选池中搜索一个与文本提示高度相关的视频,然后将搜索到的视频反向映射到噪声空间,作为改进的噪声提示。此外,我们还发现文本提示具有更多的细节时,会导致更高质量的视频。驱动于这一点,我们进一步设计了一种具有语义保持的重写器,以提高文本提示的细节和语义一致性。我们对WebVid-10M测试集进行了广泛的实验,结果表明,我们的提议方法可以帮助提高文本到视频模型,而且无需进行优化升级。

SatBird: Bird Species Distribution Modeling with Remote Sensing and Citizen Science Data

  • paper_url: http://arxiv.org/abs/2311.00936
  • repo_url: https://github.com/rolnicklab/satbird
  • paper_authors: Mélisande Teng, Amna Elmustafa, Benjamin Akera, Yoshua Bengio, Hager Radi Abdelwahed, Hugo Larochelle, David Rolnick
  • for: 本研究旨在提高生物多样性监测和生态系统模拟,通过预测卫星图像中种群遇Rate的方法,以帮助保护生态系统。
  • methods: 本研究使用卫星图像和公民科学工具收集种群观察数据,并提供了环境数据和种群范围地图。
  • results: 本研究提供了一个新的任务,即使用卫星图像预测种群遇Rate,并在美国和肯尼亚提供了数据集。这些数据集可以帮助扩大生物多样性监测和生态系统模拟。
    Abstract Biodiversity is declining at an unprecedented rate, impacting ecosystem services necessary to ensure food, water, and human health and well-being. Understanding the distribution of species and their habitats is crucial for conservation policy planning. However, traditional methods in ecology for species distribution models (SDMs) generally focus either on narrow sets of species or narrow geographical areas and there remain significant knowledge gaps about the distribution of species. A major reason for this is the limited availability of data traditionally used, due to the prohibitive amount of effort and expertise required for traditional field monitoring. The wide availability of remote sensing data and the growing adoption of citizen science tools to collect species observations data at low cost offer an opportunity for improving biodiversity monitoring and enabling the modelling of complex ecosystems. We introduce a novel task for mapping bird species to their habitats by predicting species encounter rates from satellite images, and present SatBird, a satellite dataset of locations in the USA with labels derived from presence-absence observation data from the citizen science database eBird, considering summer (breeding) and winter seasons. We also provide a dataset in Kenya representing low-data regimes. We additionally provide environmental data and species range maps for each location. We benchmark a set of baselines on our dataset, including SOTA models for remote sensing tasks. SatBird opens up possibilities for scalably modelling properties of ecosystems worldwide.
    摘要 生物多样性正在不可预期的速度下降,对生物圈服务的确保食物、水和人类健康和幸福具有重要作用。了解种群的分布是保护政策规划中非常重要。然而,传统生态学中的种群分布模型(SDM)通常将注意力集中在特定的种群或特定的地理区域,并且还有许多不明之处关于种群的分布。一个主要原因是传统的场景监测数据的有限性,由于场景监测所需的努力和专业技能的成本很高。然而,卫星数据的广泛可用性和公民科学工具的普及使得生物多样性监测变得更加容易,并且可以提高生态系统的模拟。我们介绍了一个新的任务,即通过卫星图像预测鸟类种群的分布,并提出了SatBird数据集,该数据集包括美国各地的卫星图像位置和基于公民科学数据库eBird的存在或缺失观察数据,覆盖夏季(繁殖期)和冬季两个季节。此外,我们还提供了每个位置的环境数据和种群范围地图。我们对我们的数据集进行了一系列的基线测试,包括当今最佳实践的卫星数据处理模型。SatBird开销了全球范围内生态系统的可扩展模拟的可能性。

Towards High-quality HDR Deghosting with Conditional Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.00932
  • repo_url: None
  • paper_authors: Qingsen Yan, Tao Hu, Yuan Sun, Hao Tang, Yu Zhu, Wei Dong, Luc Van Gool, Yanning Zhang
  • for: 本研究旨在使用深度神经网络技术来重建高动态范围(HDR)图像,从多个低动态范围(LDR)图像中提取HDR图像,并解决阻挡现实场景中的应用。
  • methods: 本研究使用了Diffusion Model来生成HDR图像,包括Feature Condition Generator和Noise Predictor两部分。Feature Condition Generator使用了注意力和Domain Feature Alignment(DFA)层来转换中间特征,以避免幽灵残影。Noise Predictor使用了随机迭代的抽象过程来生成HDR图像。此外,为了减少LDR图像的饱和问题所引起的语义混乱,我们设计了滑块窗口噪声估计器来采样平滑的噪声。
  • results: 我们对HDR图像重建 benchmark datasets进行了实验,结果表明,我们的方法可以达到状态机器人的性能,并且在实际图像中进行了良好的泛化。
    Abstract High Dynamic Range (HDR) images can be recovered from several Low Dynamic Range (LDR) images by existing Deep Neural Networks (DNNs) techniques. Despite the remarkable progress, DNN-based methods still generate ghosting artifacts when LDR images have saturation and large motion, which hinders potential applications in real-world scenarios. To address this challenge, we formulate the HDR deghosting problem as an image generation that leverages LDR features as the diffusion model's condition, consisting of the feature condition generator and the noise predictor. Feature condition generator employs attention and Domain Feature Alignment (DFA) layer to transform the intermediate features to avoid ghosting artifacts. With the learned features as conditions, the noise predictor leverages a stochastic iterative denoising process for diffusion models to generate an HDR image by steering the sampling process. Furthermore, to mitigate semantic confusion caused by the saturation problem of LDR images, we design a sliding window noise estimator to sample smooth noise in a patch-based manner. In addition, an image space loss is proposed to avoid the color distortion of the estimated HDR results. We empirically evaluate our model on benchmark datasets for HDR imaging. The results demonstrate that our approach achieves state-of-the-art performances and well generalization to real-world images.
    摘要 高动态范围(HDR)图像可以从多个低动态范围(LDR)图像中恢复使用现有深度神经网络(DNN)技术。 DESPITE remarkable progress, DNN-based methods still generate ghosting artifacts when LDR images have saturation and large motion, which hinders potential applications in real-world scenarios. To address this challenge, we formulate the HDR deghosting problem as an image generation that leverages LDR features as the diffusion model's condition, consisting of the feature condition generator and the noise predictor. Feature condition generator employs attention and Domain Feature Alignment(DFA)layer to transform the intermediate features to avoid ghosting artifacts. With the learned features as conditions, the noise predictor leverages a stochastic iterative denoising process for diffusion models to generate an HDR image by steering the sampling process. Furthermore, to mitigate semantic confusion caused by the saturation problem of LDR images, we design a sliding window noise estimator to sample smooth noise in a patch-based manner. In addition, an image space loss is proposed to avoid the color distortion of the estimated HDR results. We empirically evaluate our model on benchmark datasets for HDR imaging. The results demonstrate that our approach achieves state-of-the-art performances and well generalization to real-world images.

RPCANet: Deep Unfolding RPCA Based Infrared Small Target Detection

  • paper_url: http://arxiv.org/abs/2311.00917
  • repo_url: None
  • paper_authors: Fengyi Wu, Tianfang Zhang, Lei Li, Yian Huang, Zhenming Peng
  • for: 提高探测远赤外小目标的准确率和可解释性。
  • methods: 提出了一种可解释的深度学习网络(RPCANet),通过对探测任务的归纳为归纳矩阵分解、低级背景估计和图像重建的抽象,将深度学习与域知识结合起来。
  • results: 在实验中,RPCANet 得到了优于基eline方法的良好效果,并且可以准确地检测小目标,同时保留图像的内在特征。
    Abstract Deep learning (DL) networks have achieved remarkable performance in infrared small target detection (ISTD). However, these structures exhibit a deficiency in interpretability and are widely regarded as black boxes, as they disregard domain knowledge in ISTD. To alleviate this issue, this work proposes an interpretable deep network for detecting infrared dim targets, dubbed RPCANet. Specifically, our approach formulates the ISTD task as sparse target extraction, low-rank background estimation, and image reconstruction in a relaxed Robust Principle Component Analysis (RPCA) model. By unfolding the iterative optimization updating steps into a deep-learning framework, time-consuming and complex matrix calculations are replaced by theory-guided neural networks. RPCANet detects targets with clear interpretability and preserves the intrinsic image feature, instead of directly transforming the detection task into a matrix decomposition problem. Extensive experiments substantiate the effectiveness of our deep unfolding framework and demonstrate its trustworthy results, surpassing baseline methods in both qualitative and quantitative evaluations.
    摘要 深度学习(DL)网络在红外小目标检测(ISTD)中已经实现了很好的表现。然而,这些结构具有解释性不足的问题,被广泛视为黑盒子,因为它们忽略了ISTD领域知识。为解决这个问题,本研究提出了可解释的深度网络,称为RPCANet,用于检测红外暗目标。具体来说,我们的方法将ISTD任务解释为稀疏目标提取、低级背景估计和图像重建的一种松散的Robust Principle Component Analysis(RPCA)模型。通过将迭代优化更新步骤转化为深度学习框架,时间consuming和复杂的矩阵计算被替换为理论导向的神经网络。RPCANet可以清晰地解释目标,并保留内在的图像特征,而不是直接将检测任务转化为矩阵分解问题。我们的深度嵌入框架在实验中证明了其效果,超过了基eline方法的质量和量化评价。