cs.CV - 2023-08-26

Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation

paper_url: http://arxiv.org/abs/2308.13505
repo_url: None
paper_authors: Jiaming Zhang, Yutao Cui, Gangshan Wu, Limin Wang
for: 提高视频对象分割（VOS）方法的性能，解决现有方法中 dense matching после Extracting 特征的限制，以及 pixel-wise 匹配导致的目标信息不具备全面理解。
methods: 提出一种 joint 模型，名为 JointFormer，具有三个元素的共同模型，包括特征、匹配和压缩存储。核心设计是 Joint Block，通过注意力的灵活性同时提取特征和传递目标信息到当前 токен和压缩存储 токен。这种方案允许进行广泛的信息传递和特征学习。
results: 根据 DAVIS 2017 val/test-dev 和 YouTube-VOS 2018/2019 val bencmarks，我们的方法实现了新的state-of-the-art性能（89.7%和87.6%）和（87.0%和87.0%），与现有方法相比提高了很大的margin。

Abstract
Current prevailing Video Object Segmentation (VOS) methods usually perform dense matching between the current and reference frames after extracting their features. One on hand, the decoupled modeling restricts the targets information propagation only at high-level feature space. On the other hand, the pixel-wise matching leads to a lack of holistic understanding of the targets. To overcome these issues, we propose a unified VOS framework, coined as JointFormer, for joint modeling the three elements of feature, correspondence, and a compressed memory. The core design is the Joint Block, utilizing the flexibility of attention to simultaneously extract feature and propagate the targets information to the current tokens and the compressed memory token. This scheme allows to perform extensive information propagation and discriminative feature learning. To incorporate the long-term temporal targets information, we also devise a customized online updating mechanism for the compressed memory token, which can prompt the information flow along the temporal dimension and thus improve the global modeling capability. Under the design, our method achieves a new state-of-art performance on DAVIS 2017 val/test-dev (89.7% and 87.6%) and YouTube-VOS 2018/2019 val (87.0% and 87.0%) benchmarks, outperforming existing works by a large margin.

摘要
当前主流的视频对象分割（VOS）方法通常是在提取特征后进行密集匹配 между当前和参考帧。一方面，分解模型限制目标信息的传播只在高级特征空间进行。另一方面，像素级匹配导致无法全面理解目标。为了解决这些问题，我们提出了一个统一的VOS框架，命名为JointFormer，用于联合模型特征、匹配和压缩存储。核心设计是Joint块，利用注意力的灵活性同时提取特征和传递目标信息到当前 токен和压缩存储 токен。这种方案允许进行广泛的信息传递和特征学习。为了包含长期时间的目标信息，我们还设计了自定义在线更新机制，以便在时间维度上流动信息，从而提高全局模型能力。根据设计，我们的方法在DAVIS 2017 val/test-dev（89.7%和87.6%）和YouTube-VOS 2018/2019 val（87.0%和87.0%）标准测试上达到了新的状态级表现，超过现有方法的表现。

A2Q: Accumulator-Aware Quantization with Guaranteed Overflow Avoidance

paper_url: http://arxiv.org/abs/2308.13504
repo_url: None
paper_authors: Ian Colbert, Alessandro Pappalardo, Jakoba Petri-Koenig
for: 训练量化神经网络（QNN）以避免在推理时使用低精度累加器时发生溢出。
methods: 引入了一种新的量化方法，即accumulator-aware quantization（A2Q），它基于模型量化的约束，使得在训练QNN时，模型权重的L1范围遵循累加器的位数范围。这种方法同时激发了模型权重的不结构化粒度缺失，以确保避免溢出。
results: A2Q可以在深度学习基于计算机视觉任务上训练QNN，而无需使用浮点数据类型，同时保持模型精度与浮点模型相似。在我们的评估中，我们考虑了A2Q在通用平台和可编程硬件上的影响。但我们主要目标是在FPGA上部署模型，因为它可以完全利用自定义累加器的位数。我们的实验表明，累加器位数对FPGA上的加速器资源利用率产生显著影响。在我们的 benchmark 中，A2Q可以在 average 上减少资源利用率达到 2.3倍，与32位累加器相比，而且保持99.2%的浮点模型精度。

Abstract
We present accumulator-aware quantization (A2Q), a novel weight quantization method designed to train quantized neural networks (QNNs) to avoid overflow when using low-precision accumulators during inference. A2Q introduces a unique formulation inspired by weight normalization that constrains the L1-norm of model weights according to accumulator bit width bounds that we derive. Thus, in training QNNs for low-precision accumulation, A2Q also inherently promotes unstructured weight sparsity to guarantee overflow avoidance. We apply our method to deep learning-based computer vision tasks to show that A2Q can train QNNs for low-precision accumulators while maintaining model accuracy competitive with a floating-point baseline. In our evaluations, we consider the impact of A2Q on both general-purpose platforms and programmable hardware. However, we primarily target model deployment on FPGAs because they can be programmed to fully exploit custom accumulator bit widths. Our experimentation shows accumulator bit width significantly impacts the resource efficiency of FPGA-based accelerators. On average across our benchmarks, A2Q offers up to a 2.3x reduction in resource utilization over 32-bit accumulator counterparts with 99.2% of the floating-point model accuracy.

摘要
我们介绍了一种新的量化预测方法，名为“accumulator-aware quantization”（A2Q），用于在推导过程中避免量化神经网络（QNN）的过流。A2Q将量化神经网络的模型重量与累绩器的位元数量产生联乘关系，以确保在推导过程中避免过流。因此，在对低精度累绩器进行训练时，A2Q同时具有过程简化和过程简化的功能。我们将这种方法应用于深度学习数据领域的应用，并证明A2Q可以在低精度累绩器下训练QNN，并保持与浮点数据模型的竞争力。在我们的评估中，我们考虑了在通用平台和可程式硬件上的影响，但我们主要针对FPGA进行部署，因为FPGA可以根据自己的特定累绩器位元数量进行自适应。我们的实验表明，累绩器位元数量有着显著的影响力，A2Q可以在FPGA上的加速器上提供更好的资源利用率。在我们的测试中，A2Q在32位累绩器下提供了2.3倍的资源利用率，并保持99.2%的浮点数据模型精度。

Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers

paper_url: http://arxiv.org/abs/2308.13494
repo_url: https://github.com/WISION-Lab/eventful-transformer
paper_authors: Matthew Dutson, Yin Li, Mohit Gupta
for: 这篇论文的目的是提高视觉识别任务中的Transformers模型精度，并且降低它们的computational cost。
methods: 这篇论文使用了一种称为”Eventful Transformers”的方法，它可以将Transformers模型转换为具有适应控制的计算成本，并且可以在runtime中进行自适应调节。
results: 这篇论文在大规模的视频物件检测（ImageNet VID）和动作识别（EPIC-Kitchens 100） dataset上进行了评估，发现这种方法可以实现2-4倍的计算成本优化，仅有小量的精度损失。

Abstract
Vision Transformers achieve impressive accuracy across a range of visual recognition tasks. Unfortunately, their accuracy frequently comes with high computational costs. This is a particular issue in video recognition, where models are often applied repeatedly across frames or temporal chunks. In this work, we exploit temporal redundancy between subsequent inputs to reduce the cost of Transformers for video processing. We describe a method for identifying and re-processing only those tokens that have changed significantly over time. Our proposed family of models, Eventful Transformers, can be converted from existing Transformers (often without any re-training) and give adaptive control over the compute cost at runtime. We evaluate our method on large-scale datasets for video object detection (ImageNet VID) and action recognition (EPIC-Kitchens 100). Our approach leads to significant computational savings (on the order of 2-4x) with only minor reductions in accuracy.

摘要
< Lang="zh-CN" >视力转换器在视觉识别任务中表现出色，但它们的计算成本很高。特别是在视频识别中，模型经常在帧或时间块之间重复应用。在这项工作中，我们利用时间重复性来降低转换器的计算成本。我们描述了一种方法，可以在运行时控制计算成本。我们的提议的家族模型，即事件转换器，可以将现有转换器转换成适应性控制计算成本的模型，而无需再训练。我们在大规模数据集上进行了视频物体检测（ImageNet VID）和动作识别（EPIC-Kitchens 100）的评估，我们的方法可以实现大量的计算成本减少（在2-4倍之间），只有小量的减少准确率。Note: The Simplified Chinese translation is done using the Google Translate API, which may not be perfect and may not capture all the nuances of the original text.

Unlocking the Performance of Proximity Sensors by Utilizing Transient Histograms

paper_url: http://arxiv.org/abs/2308.13473
repo_url: None
paper_authors: Carter Sifferman, Yeping Wang, Mohit Gupta, Michael Gleicher
for: 该论文主要用于提高close-range时间探测（ToF）距离传感器所获取的场景几何信息的准确性。
methods: 该论文使用了直接使用传感器捕获的转变 histogram，并使用可导渠 Rendering 管道来直接优化场景几何，以提高对观察结果的匹配。
results: 论文通过对八种不同视角的平面表面进行3,800次测量，并证明了其方法在大多数场景下高效性比Proprietary distance estimate基准值高一个数量级。此外，论文还示出了一种简单的机器人应用，通过使用该方法来感知机器人臂上的把握器上的距离和坡度。

Abstract
We provide methods which recover planar scene geometry by utilizing the transient histograms captured by a class of close-range time-of-flight (ToF) distance sensor. A transient histogram is a one dimensional temporal waveform which encodes the arrival time of photons incident on the ToF sensor. Typically, a sensor processes the transient histogram using a proprietary algorithm to produce distance estimates, which are commonly used in several robotics applications. Our methods utilize the transient histogram directly to enable recovery of planar geometry more accurately than is possible using only proprietary distance estimates, and consistent recovery of the albedo of the planar surface, which is not possible with proprietary distance estimates alone. This is accomplished via a differentiable rendering pipeline, which simulates the transient imaging process, allowing direct optimization of scene geometry to match observations. To validate our methods, we capture 3,800 measurements of eight planar surfaces from a wide range of viewpoints, and show that our method outperforms the proprietary-distance-estimate baseline by an order of magnitude in most scenarios. We demonstrate a simple robotics application which uses our method to sense the distance to and slope of a planar surface from a sensor mounted on the end effector of a robot arm.

摘要
我们提供一种方法，可以利用 close-range time-of-flight (ToF) 距离传感器所捕获的过渡历史gram来恢复平面场景的几何结构。一个过渡历史gram是一个一维时间射频信号，其记录了在 ToF 传感器上 incident 光子的到达时间。通常，一个传感器会使用专有算法来处理过渡历史gram，以生成距离估计，这些估计在多种 робо得应用中被广泛使用。我们的方法直接利用过渡历史gram，以准确地恢复平面几何结构，并同时恢复平面表面的反射率，这两个参数不可能通过专有距离估计alone 获得。我们通过一个可微的渲染管线来实现这一点，该管线模拟了过渡成像过程，允许直接优化场景几何来匹配观测。为验证我们的方法，我们Capture 3,800个平面测量数据，来自多种视点，并显示我们的方法在大多数情况下高效性比专有距离估计baseline 提高一个数量级。我们还展示了一个简单的 робо得应用，使用我们的方法来检测 robot arm 上的末端器的距离和倾斜。

A Fast Minimization Algorithm for the Euler Elastica Model Based on a Bilinear Decomposition

paper_url: http://arxiv.org/abs/2308.13471
repo_url: None
paper_authors: Zhifang Liu, Baochen Sun, Xue-Cheng Tai, Qi Wang, Huibin Chang
for: 这个论文的目的是提出一种新的、快速、稳定的 alternating minimization（HALM）算法来解决Euler Elastica（EE）模型中的非线性和缺失问题。
methods: 该算法基于bilinear decomposition of the gradient of the underlying image，并且包括三个子最小化问题，每个问题可以在关闭式或快速解决器中解决。
results: 对比其他当前状态算法，新算法能够更快、更稳定地解决EE模型，并且在一系列数学实验中表现良好。例如，与fast operator-splitting-based Deng-Glowinski-Tai算法相比，新算法的平均运行时间只需一半。

Abstract
The Euler Elastica (EE) model with surface curvature can generate artifact-free results compared with the traditional total variation regularization model in image processing. However, strong nonlinearity and singularity due to the curvature term in the EE model pose a great challenge for one to design fast and stable algorithms for the EE model. In this paper, we propose a new, fast, hybrid alternating minimization (HALM) algorithm for the EE model based on a bilinear decomposition of the gradient of the underlying image and prove the global convergence of the minimizing sequence generated by the algorithm under mild conditions. The HALM algorithm comprises three sub-minimization problems and each is either solved in the closed form or approximated by fast solvers making the new algorithm highly accurate and efficient. We also discuss the extension of the HALM strategy to deal with general curvature-based variational models, especially with a Lipschitz smooth functional of the curvature. A host of numerical experiments are conducted to show that the new algorithm produces good results with much-improved efficiency compared to other state-of-the-art algorithms for the EE model. As one of the benchmarks, we show that the average running time of the HALM algorithm is at most one-quarter of that of the fast operator-splitting-based Deng-Glowinski-Tai algorithm.

摘要
“欧拉-艾拉斯特拉（EE）模型可以实现无残留的结果，与传统的总方差整合模型相比，在图像处理中。然而，EE模型中的曲率项带来强烈的非线性和极值问题，使得设计快速稳定的算法成为一大挑战。在本文中，我们提出了一个新的、快速、混合替换几何（HALM）算法，基于图像的梯度的 bilinear 分解，并证明了混合替换过程中的数列 convergence 的 globally 稳定性。HALM 算法包括三个子替换问题，每个都可以通过关键简单的类型或快速的算法来解决，使得新算法具有高精度和高效性。此外，我们还讨论了对应各种曲率基于的可变量化模型的扩展，特别是一个 Lipschitz 平滑函数的曲率。在实验中，我们展示了新算法在许多测试案例中具有较好的效果，并且比传统的算法更高效。”

RestNet: Boosting Cross-Domain Few-Shot Segmentation with Residual Transformation Network

paper_url: http://arxiv.org/abs/2308.13469
repo_url: None
paper_authors: Xinyang Huang, Chuang Zhu, Wenkai Chen
for: 本文目的是提出一种新的交叉频谱几何学习模型，以实现在未知频谱上进行 semantic segmentation，并且可以在有限的注释样本基础上进行学习。
methods: 本文提出了一种名为RestNet的几何学习模型，该模型通过 Semantic Enhanced Anchor Transform (SEAT) 模块和 Intra-domain Residual Enhancement (IRE) 模块来实现知识传递，同时保持了内域支持查询特征信息。此外，本文还提出了一种基于 prototype fusion 的面 predicate 策略，帮助模型慢慢地学习如何分割。
results: 实验表明，RestNet 可以在 ISIC、Chest X-ray 和 FSS-1000 等 dataset 上 achieve state-of-the-art 性能，并且不需要额外的 fine-tuning。

Abstract
Cross-domain few-shot segmentation (CD-FSS) aims to achieve semantic segmentation in previously unseen domains with a limited number of annotated samples. Although existing CD-FSS models focus on cross-domain feature transformation, relying exclusively on inter-domain knowledge transfer may lead to the loss of critical intra-domain information. To this end, we propose a novel residual transformation network (RestNet) that facilitates knowledge transfer while retaining the intra-domain support-query feature information. Specifically, we propose a Semantic Enhanced Anchor Transform (SEAT) module that maps features to a stable domain-agnostic space using advanced semantics. Additionally, an Intra-domain Residual Enhancement (IRE) module is designed to maintain the intra-domain representation of the original discriminant space in the new space. We also propose a mask prediction strategy based on prototype fusion to help the model gradually learn how to segment. Our RestNet can transfer cross-domain knowledge from both inter-domain and intra-domain without requiring additional fine-tuning. Extensive experiments on ISIC, Chest X-ray, and FSS-1000 show that our RestNet achieves state-of-the-art performance. Our code will be available soon.

摘要
specifically，我们提出了一种 Semantic Enhanced Anchor Transform (SEAT) 模块，它可以将特征映射到一个稳定的领域不依赖的空间中使用先进的 semantics。此外，我们还提出了一种 Intra-domain Residual Enhancement (IRE) 模块，它可以保持原始领域的 intra-domain 表示。我们还提出了一种 mask prediction strategy based on prototype fusion，帮助模型慢慢地学习如何分割。我们的 RestNet 可以从 both inter-domain 和 intra-domain 中传递知识，而不需要额外的 fine-tuning。我们进行了广泛的实验，结果表明我们的 RestNet 可以达到 state-of-the-art 性能。我们的代码很快就会发布。