2023-11-05

cs.CV

cs.CV - 2023-11-05

MirrorCalib: Utilizing Human Pose Information for Mirror-based Virtual Camera Calibration

paper_url: http://arxiv.org/abs/2311.02791
repo_url: None
paper_authors: Longyun Liao, Andrew Mitchell, Rong Zheng
for: 估计虚拟摄像头的外部参数，即与实际摄像头相对的投影 mirror 的相对位姿。
methods: 利用人体知识和2D关节位置来估计摄像头外部参数，首先使用修改后的八点算法获取初始估计，然后根据人体 constraints 进行修正，最后使用 RANSAC 算法除异常点。
results: 在 synthetic 和实际数据集上进行测试，mirrorCalib 可以达到 rotation error 0.62{\deg}/1.82{\deg} 和 translation error 37.33/69.51 mm，超越当前最佳方法。

Abstract
In this paper, we present the novel task of estimating the extrinsic parameters of a virtual camera with respect to a real camera with one single fixed planar mirror. This task poses a significant challenge in cases where objects captured lack overlapping views from both real and mirrored cameras. To address this issue, prior knowledge of a human body and 2D joint locations are utilized to estimate the camera extrinsic parameters when a person is in front of a mirror. We devise a modified eight-point algorithm to obtain an initial estimation from 2D joint locations. The 2D joint locations are then refined subject to human body constraints. Finally, a RANSAC algorithm is employed to remove outliers by comparing their epipolar distances to a predetermined threshold. MirrorCalib is evaluated on both synthetic and real datasets and achieves a rotation error of 0.62{\deg}/1.82{\deg} and a translation error of 37.33/69.51 mm on the synthetic/real dataset, which outperforms the state-of-art method.

摘要
在这篇论文中，我们提出了一个新的任务：使用真实摄像头和一个固定的平面镜来估算虚拟摄像头的外部参数。当物体被捕捉时，缺乏真实和镜头摄像头之间的重叠视图，这个任务具有显著的挑战。为解决这个问题，我们利用人体的先知知识和2D关节位置来估算摄像头的外部参数，当人物在镜子前时。我们修改了八点算法以获得初始估算，然后使用人体限制来精确估算。最后，我们使用RANSAC算法来移除异常值，比较它们的视角距离与先定的阈值。我们的 MirrorCalib 方法在Synthetic 和 Real 数据集上进行了评估，其中Synthetic 数据集的旋转错误为0.62°/1.82°，翻译错误为37.33/69.51 mm，超过了现有方法的性能。

MuSHRoom: Multi-Sensor Hybrid Room Dataset for Joint 3D Reconstruction and Novel View Synthesis

paper_url: http://arxiv.org/abs/2311.02778
repo_url: None
paper_authors: Xuqian Ren, Wenjia Wang, Dingding Cai, Tuuli Tuominen, Juho Kannala, Esa Rahtu
for: 这 paper 的目的是提高 Metaverse 技术中的准确、实时、 immerse 模型化，以满足非人类感知（如无人机/机器人/自动驾驶车 navigation）和 immerse 技术（如 AR/VR）的需求。
methods: 这 paper 使用了多感器 гибрид房间数据集 (MuSHRoom)，并对其进行了多种著名的管道测试，以评估它们在实际应用中的性能。
results: 这 paper 提出了一种新的方法，可以在实时和 computationally efficient 的方式下，将3D reconstruction和高质量的rendering融合在一起。这种方法在 MuSHRoom 数据集上显示出了良好的性能。

Abstract
Metaverse technologies demand accurate, real-time, and immersive modeling on consumer-grade hardware for both non-human perception (e.g., drone/robot/autonomous car navigation) and immersive technologies like AR/VR, requiring both structural accuracy and photorealism. However, there exists a knowledge gap in how to apply geometric reconstruction and photorealism modeling (novel view synthesis) in a unified framework. To address this gap and promote the development of robust and immersive modeling and rendering with consumer-grade devices, first, we propose a real-world Multi-Sensor Hybrid Room Dataset (MuSHRoom). Our dataset presents exciting challenges and requires state-of-the-art methods to be cost-effective, robust to noisy data and devices, and can jointly learn 3D reconstruction and novel view synthesis, instead of treating them as separate tasks, making them ideal for real-world applications. Second, we benchmark several famous pipelines on our dataset for joint 3D mesh reconstruction and novel view synthesis. Finally, in order to further improve the overall performance, we propose a new method that achieves a good trade-off between the two tasks. Our dataset and benchmark show great potential in promoting the improvements for fusing 3D reconstruction and high-quality rendering in a robust and computationally efficient end-to-end fashion.

摘要
translate text into Simplified ChineseMetaverse技术需要精准、实时和具有吸引力的模型，用于非人类感知（如无人机/机器人/自动驾驶车 navigation）以及具有吸引力的技术，如AR/VR，需要结构准确和写实感。然而，当前存在一个知识空白，即如何在一个统一框架中应用准确的三维重建和写实感模型。为了bridging这个知识空白，并促进使用消费级设备进行强大和吸引人的模型和渲染，我们首先提出了一个真实世界多感器混合房间数据集（MuSHRoom）。我们的数据集具有吸引人的挑战，需要当前的技术来实现成本效益、鲁棒性和可靠性，同时可以同时学习3D重建和新视野合成，而不是单独处理它们为两个独立的任务，使其适用于实际应用。其次，我们对我们的数据集上使用了许多知名的管道进行联合3D网格重建和新视野合成的benchmark。最后，为了进一步提高总性能，我们提出了一种新的方法，可以在两个任务之间取得良好的平衡。我们的数据集和benchmark表现出了推动改进混合3D重建和高质量渲染的robust和计算效率的潜力。

Fast Sparse 3D Convolution Network with VDB

paper_url: http://arxiv.org/abs/2311.02762
repo_url: None
paper_authors: Fangjun Zhou, Anyong Mao, Eftychios Sifakis
for: 这个论文是为了提出一种新的卷积神经网络实现，用于高效地进行稀疏3D数据推理。
methods: 这个实现使用NanoVDB作为数据结构，以减少内存占用量，同时保持高性能。
results: 这个架构比STATE-OF-THE-ART dense CNN模型快得多，在高分辨率3D物体分类网络上达到了约20倍的速度提升。

Abstract
We proposed a new Convolution Neural Network implementation optimized for sparse 3D data inference. This implementation uses NanoVDB as the data structure to store the sparse tensor. It leaves a relatively small memory footprint while maintaining high performance. We demonstrate that this architecture is around 20 times faster than the state-of-the-art dense CNN model on a high-resolution 3D object classification network.

摘要
我们提出了一种新的卷积神经网络实现，针对稀疏的3D数据推理。这种实现使用NanoVDB作为数据结构来存储稀疏张量。它占用较少的内存空间，同时保持高性能。我们示出这种架构比现有的密集 CNN 模型在高分辨率3D物体分类网络上20倍快。

Fast Point-cloud to Mesh Reconstruction for Deformable Object Tracking

paper_url: http://arxiv.org/abs/2311.02749
repo_url: None
paper_authors: Elham Amin Mansour, Hehui Zheng, Robert K. Katzschmann
for: 用于控制软体的机器人手，需要在线获取软体的状态反馈。
methods: 我们提出了一种方法，可以在58Hz的速度下创建不同类型的物体的塑形网格，并跟踪其变形。这种方法基于点云自动编码器和实数流变换器，可以在Marker-free的方式下进行系统标识。
results: 我们的方法可以在6种ycb类别中实现58Hz的塑形网格重建和跟踪，这些结果可以用于控制机器人手的 grasping操作，并且可以帮助系统进行 marker-free的系统标识。

Abstract
The world around us is full of soft objects that we as humans learn to perceive and deform with dexterous hand movements from a young age. In order for a Robotic hand to be able to control soft objects, it needs to acquire online state feedback of the deforming object. While RGB-D cameras can collect occluded information at a rate of 30 Hz, the latter does not represent a continuously trackable object surface. Hence, in this work, we developed a method that can create deforming meshes of deforming point clouds at a speed of above 50 Hz for different categories of objects. The reconstruction of meshes from point clouds has been long studied in the field of Computer graphics under 3D reconstruction and 4D reconstruction, however both lack the speed and generalizability needed for robotics applications. Our model is designed using a point cloud auto-encoder and a Real-NVP architecture. The latter is a continuous flow neural network with manifold-preservation properties. Our model takes a template mesh which is the mesh of an object in its canonical state and then deforms the template mesh to match a deformed point cloud of the object. Our method can perform mesh reconstruction and tracking at a rate of 58 Hz for deformations of six different ycb categories. An instance of a downstream application can be the control algorithm for a robotic hand that requires online feedback from the state of a manipulated object which would allow online grasp adaptation in a closed-loop manner. Furthermore, the tracking capacity that our method provides can help in the system identification of deforming objects in a marker-free approach. In future work, we will extend our method to more categories of objects and real world deforming point clouds

摘要
世界中的软物体 surrounds 我们，从小时候我们就开始学习通过手部动作来感知和改变它们。如果机器人手需要控制软物体，它需要在线获取软物体的状态反馈。而RGB-D 摄像头可以在 30 Hz 频率上收集受障的信息，但这些信息不是可持续跟踪的物体表面。因此，在这项工作中，我们开发了一种方法，可以在不同类型物体上创建弹性的三角形结构，并在不同类型物体上进行不同类型的变形。我们的模型基于点云自编码器和真实NVP 架构。后者是一种连续流 neural network 具有 manifold-preservation 性能。我们的模型首先将模板三角形与弹性点云结构进行匹配，然后将模板三角形变形为与弹性点云结构匹配。我们的方法可以在 58 Hz 频率上进行三角形重建和跟踪，并且可以在不同类型的 ycb 类型上进行不同类型的变形。这种方法的实现可以用于 robotic 手的控制算法，以便在关闭环境中进行在线抓握适应。此外，我们的方法可以提供跟踪能力，帮助在无标记 Approach 中系统标识弹性物体。在未来的工作中，我们计划扩展我们的方法到更多的类型上，以及使用实际世界中的弹性点云结构。

Attention Modules Improve Image-Level Anomaly Detection for Industrial Inspection: A DifferNet Case Study

paper_url: http://arxiv.org/abs/2311.02747
repo_url: https://github.com/andreluizbvs/insplad
paper_authors: André Luiz Buarque Vieira e Silva, Francisco Simões, Danny Kowerko, Tobias Schlosser, Felipe Battisti, Veronica Teichrieb
for: 这篇论文主要针对于 semi-automated 视觉工业检测中的学习基于方法，以便处理高分辨率图像中的小型缺陷模式。
methods: 这篇论文提出了基于 DifferNet 的解决方案，其中包括了注意模块：AttentDifferNet。该方法可以提高图像水平的检测和分类能力，并在三个视觉异常检测数据集上达到了更高的Results：InsPLAD-fault、MVTec AD 和 Semiconductor Wafer。
results: 相比之前的状态艺术，AttentDifferNet 在三个数据集上的全局 AUROC 平均提高了1.77±0.25个百分点，达到了领先的Results，特别是在InsPLAD-fault 数据集上，这是一个工业检测在野数据集。

Abstract
Within (semi-)automated visual industrial inspection, learning-based approaches for assessing visual defects, including deep neural networks, enable the processing of otherwise small defect patterns in pixel size on high-resolution imagery. The emergence of these often rarely occurring defect patterns explains the general need for labeled data corpora. To alleviate this issue and advance the current state of the art in unsupervised visual inspection, this work proposes a DifferNet-based solution enhanced with attention modules: AttentDifferNet. It improves image-level detection and classification capabilities on three visual anomaly detection datasets for industrial inspection: InsPLAD-fault, MVTec AD, and Semiconductor Wafer. In comparison to the state of the art, AttentDifferNet achieves improved results, which are, in turn, highlighted throughout our quali-quantitative study. Our quantitative evaluation shows an average improvement - compared to DifferNet - of 1.77 +/- 0.25 percentage points in overall AUROC considering all three datasets, reaching SOTA results in InsPLAD-fault, an industrial inspection in-the-wild dataset. As our variants to AttentDifferNet show great prospects in the context of currently investigated approaches, a baseline is formulated, emphasizing the importance of attention for industrial anomaly detection both in the wild and in controlled environments.

摘要
在半自动化visual工业检测中，基于学习的方法用于识别视觉缺陷，包括深度神经网络，可以处理高分辨率图像中的小缺陷模式。由于这些缺陷模式通常是罕见的，因此需要大量标注数据集。为了解决这个问题并提高现有状态的艺术，本研究提出了AttentDifferNet解决方案，它基于DifferNet框架并添加了注意模块。它在三个视觉异常检测数据集（InsPLAD-fault、MVTec AD和半导体晶圆）上提高了图像级检测和分类能力。与现有状态相比，AttentDifferNet实现了提高的结果，这些结果在我们的资深评估中得到了证明。我们的量化评估表明，相比DifferNet，AttentDifferNet在三个数据集上的总AUROC平均提高了1.77±0.25个百分点，在InsPLAD-fault中达到了领先的state-of-the-art results。我们的变体表明，AttentDifferNet在当前investigated的方法中具有极大的潜力。因此，我们形ulated一个基线，强调在工业异常检测中的注意力的重要性，不仅在控制环境中，而且在野外环境中。

Scenario Diffusion: Controllable Driving Scenario Generation With Diffusion

paper_url: http://arxiv.org/abs/2311.02738
repo_url: None
paper_authors: Ethan Pronovost, Meghana Reddy Ganesina, Noureldin Hendy, Zeyu Wang, Andres Morales, Kai Wang, Nicholas Roy
for: 用于验证自动驾驶汽车的安全性。
methods: 使用扩散方法生成交通场景，并同时生成 sintetic agent的姿势、方向和路径的分布。
results: 能够模拟多样化的交通模式，并在不同地区进行扩散。

Abstract
Automated creation of synthetic traffic scenarios is a key part of validating the safety of autonomous vehicles (AVs). In this paper, we propose Scenario Diffusion, a novel diffusion-based architecture for generating traffic scenarios that enables controllable scenario generation. We combine latent diffusion, object detection and trajectory regression to generate distributions of synthetic agent poses, orientations and trajectories simultaneously. To provide additional control over the generated scenario, this distribution is conditioned on a map and sets of tokens describing the desired scenario. We show that our approach has sufficient expressive capacity to model diverse traffic patterns and generalizes to different geographical regions.

摘要
自动化创建人工交通情况场景是评估自动驾驶车辆（AV）的安全性的关键部分。在这篇论文中，我们提出了 Scenario Diffusion，一种基于扩散的架构，用于生成交通情况场景。我们将潜在扩散、物体检测和轨迹回归结合起来，同时生成 distribución 的 synthetic agent 姿势、方向和轨迹。为了提供更多的控制权，我们将这个分布 conditioned 于地图和 sets of tokens 描述所需的场景。我们示示了我们的方法具有 sufficient 的表达能力，能够模拟多样化的交通模式，并在不同的地理区域中泛化。

JRDB-Traj: A Dataset and Benchmark for Trajectory Forecasting in Crowds

paper_url: http://arxiv.org/abs/2311.02736
repo_url: None
paper_authors: Saeed Saadatnejad, Yang Gao, Hamid Rezatofighi, Alexandre Alahi
for: 预测未来轨迹是自主导航中非常重要的，特别是在避免人类事故中，预测代理人的能力在先前是非常重要的。
methods: 作者提出了一个新的轨迹预测数据集，用于评估模型在实际场景中的性能，包括跟踪模块的偏差。
results: 数据集提供了各种感知输入数据，包括所有代理人的位置、场景图像和点云数据，以及预测未来代理人的位置。这个数据集可以帮助研究人员更好地理解导航动力学。

Abstract
Predicting future trajectories is critical in autonomous navigation, especially in preventing accidents involving humans, where a predictive agent's ability to anticipate in advance is of utmost importance. Trajectory forecasting models, employed in fields such as robotics, autonomous vehicles, and navigation, face challenges in real-world scenarios, often due to the isolation of model components. To address this, we introduce a novel dataset for end-to-end trajectory forecasting, facilitating the evaluation of models in scenarios involving less-than-ideal preceding modules such as tracking. This dataset, an extension of the JRDB dataset, provides comprehensive data, including the locations of all agents, scene images, and point clouds, all from the robot's perspective. The objective is to predict the future positions of agents relative to the robot using raw sensory input data. It bridges the gap between isolated models and practical applications, promoting a deeper understanding of navigation dynamics. Additionally, we introduce a novel metric for assessing trajectory forecasting models in real-world scenarios where ground-truth identities are inaccessible, addressing issues related to undetected or over-detected agents. Researchers are encouraged to use our benchmark for model evaluation and benchmarking.

摘要
预测未来轨迹是自动导航中非常重要的，特别是避免人类事故，因为预测代理人的能力在先前是非常重要的。轨迹预测模型在机器人、自动驾驶和导航等领域中使用，但在实际场景中经常遇到挑战，常因模型组件孤立。为解决这个问题，我们介绍了一个新的轨迹预测数据集，用于评估模型在各种实际场景中的表现。这个数据集是JRDB数据集的扩展，提供了完整的数据，包括所有代理人的位置、场景图像和点云数据，全部是机器人的视角。目标是预测代理人未来与机器人之间的位置，使用原始感知输入数据。它bridges模型与实际应用之间的差距，促进了导航动力学的深入理解。此外，我们还引入了一种新的评价轨迹预测模型的指标，用于实际场景中评估模型表现，解决不可见或过度探测的代理人问题。研究人员可以使用我们的标准来评估和比较模型。

ISAR: A Benchmark for Single- and Few-Shot Object Instance Segmentation and Re-Identification

paper_url: http://arxiv.org/abs/2311.02734
repo_url: https://github.com/nicogorlo/isar_wacv24
paper_authors: Nicolas Gorlo, Kenneth Blomqvist, Francesco Milano, Roland Siegwart
for: 这篇论文是为了提高单射对象检测、实例分割和重新识别的能力而写的。
methods: 这篇论文提出了一个基准方法和一个semi-synthetic数据集，以便测试和评估单射对象检测、实例分割和重新识别的算法。
results: 这篇论文提出了一个基准方法，并提供了一个 semi-synthetic 数据集和一个标准化评估管线，以便加速开发单射对象检测、实例分割和重新识别的算法。

Abstract
Most object-level mapping systems in use today make use of an upstream learned object instance segmentation model. If we want to teach them about a new object or segmentation class, we need to build a large dataset and retrain the system. To build spatial AI systems that can quickly be taught about new objects, we need to effectively solve the problem of single-shot object detection, instance segmentation and re-identification. So far there is neither a method fulfilling all of these requirements in unison nor a benchmark that could be used to test such a method. Addressing this, we propose ISAR, a benchmark and baseline method for single- and few-shot object Instance Segmentation And Re-identification, in an effort to accelerate the development of algorithms that can robustly detect, segment, and re-identify objects from a single or a few sparse training examples. We provide a semi-synthetic dataset of video sequences with ground-truth semantic annotations, a standardized evaluation pipeline, and a baseline method. Our benchmark aligns with the emerging research trend of unifying Multi-Object Tracking, Video Object Segmentation, and Re-identification.

摘要
现代物品水平映射系统大多采用上游学习的物品实例分割模型。如果我们想教他们新的物品或分割类，我们需要建立大型数据集并重新训练系统。为建立快速教育新物品的空间AI系统，我们需要有效解决单次物品检测、实例分割和重新识别的问题。目前没有一种满足所有这些要求的方法，也没有一个可用来测试这种方法的标准测试 benchмарck。为此，我们提出了 ISAR，一个基准方法和测试集，用于单次和少量的物品实例分割和重新识别。我们提供了一个半 sintetic的视频序列数据集，以及一个标准化的评估管道和基准方法。我们的 benchmark 与涌现的研究趋势相吻合，即将多个物体跟踪、视频物体分割和重新识别相结合。

Uncertainty Estimation for Safety-critical Scene Segmentation via Fine-grained Reward Maximization

paper_url: http://arxiv.org/abs/2311.02719
repo_url: https://github.com/med-air/fgrm
paper_authors: Hongzheng Yang, Cheng Chen, Yueyao Chen, Markus Scheppach, Hon Chi Yip, Qi Dou
for: 这个研究旨在提高深度 segmentation 模型在安全重要场景中的可靠性，特别是在医疗应用中。
methods: 我们提出了一个新的精细赏金 Maximum (FGRM) 框架，通过直接使用一个不确定度评估相关的奖励函数和一种可调整学习算法，以提高模型的不确定度评估。
results: 我们的方法在两个大规模的安全重要手术景象样本集上进行了实验，结果显示，我们的方法可以在实时一个前进对应中，以一个明显的优势在所有测量不确定度评估的标准做法中，而且可以保持高的任务准确度。代码可以在 \url{https://github.com/med-air/FGRM} 获取。

Abstract
Uncertainty estimation plays an important role for future reliable deployment of deep segmentation models in safety-critical scenarios such as medical applications. However, existing methods for uncertainty estimation have been limited by the lack of explicit guidance for calibrating the prediction risk and model confidence. In this work, we propose a novel fine-grained reward maximization (FGRM) framework, to address uncertainty estimation by directly utilizing an uncertainty metric related reward function with a reinforcement learning based model tuning algorithm. This would benefit the model uncertainty estimation through direct optimization guidance for model calibration. Specifically, our method designs a new uncertainty estimation reward function using the calibration metric, which is maximized to fine-tune an evidential learning pre-trained segmentation model for calibrating prediction risk. Importantly, we innovate an effective fine-grained parameter update scheme, which imposes fine-grained reward-weighting of each network parameter according to the parameter importance quantified by the fisher information matrix. To the best of our knowledge, this is the first work exploring reward optimization for model uncertainty estimation in safety-critical vision tasks. The effectiveness of our method is demonstrated on two large safety-critical surgical scene segmentation datasets under two different uncertainty estimation settings. With real-time one forward pass at inference, our method outperforms state-of-the-art methods by a clear margin on all the calibration metrics of uncertainty estimation, while maintaining a high task accuracy for the segmentation results. Code is available at \url{https://github.com/med-air/FGRM}.

摘要
uncertainty estimation在未来安全critical scenario中的深度分割模型部署中扮演着重要的角色。然而，现有的uncertainty estimation方法受到了确定性评估的缺乏直接指导的问题。在这种情况下，我们提出了一种新的细化的奖励最大化（FGRM） Framework，以Address uncertainty estimation by directly using an uncertainty metric related reward function with a reinforcement learning based model tuning algorithm.这将通过直接优化指导来提高模型的uncertainty estimation。 Specifically, our method designs a new uncertainty estimation reward function using the calibration metric, which is maximized to fine-tune an evidential learning pre-trained segmentation model for calibrating prediction risk. Additionally, we innovate an effective fine-grained parameter update scheme, which imposes fine-grained reward-weighting of each network parameter according to the parameter importance quantified by the fisher information matrix. To the best of our knowledge, this is the first work exploring reward optimization for model uncertainty estimation in safety-critical vision tasks. Our method demonstrates effectiveness on two large safety-critical surgical scene segmentation datasets under two different uncertainty estimation settings, outperforming state-of-the-art methods by a clear margin on all calibration metrics of uncertainty estimation while maintaining a high task accuracy for segmentation results. Code is available at \url{https://github.com/med-air/FGRM}.

CycleCL: Self-supervised Learning for Periodic Videos

paper_url: http://arxiv.org/abs/2311.03402
repo_url: None
paper_authors: Matteo Destro, Michael Gygli
for: periodic video sequences 如 automatic production systems, remote sensing, medical applications, 或 physical training
methods: CycleCL, a self-supervised learning method specifically designed for periodic data, using triplet loss to optimize for desired properties
results: significantly outperforms previous video-based self-supervised learning methods on all tasks in industrial and human actions datasets

Abstract
Analyzing periodic video sequences is a key topic in applications such as automatic production systems, remote sensing, medical applications, or physical training. An example is counting repetitions of a physical exercise. Due to the distinct characteristics of periodic data, self-supervised methods designed for standard image datasets do not capture changes relevant to the progression of the cycle and fail to ignore unrelated noise. They thus do not work well on periodic data. In this paper, we propose CycleCL, a self-supervised learning method specifically designed to work with periodic data. We start from the insight that a good visual representation for periodic data should be sensitive to the phase of a cycle, but be invariant to the exact repetition, i.e. it should generate identical representations for a specific phase throughout all repetitions. We exploit the repetitions in videos to design a novel contrastive learning method based on a triplet loss that optimizes for these desired properties. Our method uses pre-trained features to sample pairs of frames from approximately the same phase and negative pairs of frames from different phases. Then, we iterate between optimizing a feature encoder and resampling triplets, until convergence. By optimizing a model this way, we are able to learn features that have the mentioned desired properties. We evaluate CycleCL on an industrial and multiple human actions datasets, where it significantly outperforms previous video-based self-supervised learning methods on all tasks.

摘要
In this paper, we propose CycleCL, a self-supervised learning method specifically designed for periodic data. We start from the insight that a good visual representation for periodic data should be sensitive to the phase of a cycle but be invariant to the exact repetition. In other words, it should generate identical representations for a specific phase throughout all repetitions.We exploit the repetitions in videos to design a novel contrastive learning method based on a triplet loss that optimizes for these desired properties. Our method uses pre-trained features to sample pairs of frames from approximately the same phase and negative pairs of frames from different phases. Then, we iterate between optimizing a feature encoder and resampling triplets until convergence. By optimizing a model this way, we are able to learn features that have the mentioned desired properties.We evaluate CycleCL on an industrial and multiple human actions datasets, where it significantly outperforms previous video-based self-supervised learning methods on all tasks.

Benchmarking a Benchmark: How Reliable is MS-COCO?

paper_url: http://arxiv.org/abs/2311.02709
repo_url: None
paper_authors: Eric Zimmermann, Justin Szeto, Jerome Pasquero, Frederic Ratle
for: 本研究使用Sama-COCO dataset进行了可读性分析，以发现可能存在的偏见和偏好。
methods: 本研究使用了一个形态分析管道，以评估不同注释方式对模型的影响。
results: 结果表明注释风格对模型的性能有重要影响，并且注释管道应该仔细考虑任务的关键点。In English, that would be:
for: This study uses the Sama-COCO dataset to analyze the readability of the annotations and discover potential biases and preferences.
methods: The study uses a shape analysis pipeline to evaluate the impact of different annotation styles on the model’s performance.
results: The results show that the annotation style has a significant impact on the model’s performance, and the annotation pipeline should carefully consider the task of interest.

Abstract
Benchmark datasets are used to profile and compare algorithms across a variety of tasks, ranging from image classification to segmentation, and also play a large role in image pretraining algorithms. Emphasis is placed on results with little regard to the actual content within the dataset. It is important to question what kind of information is being learned from these datasets and what are the nuances and biases within them. In the following work, Sama-COCO, a re-annotation of MS-COCO, is used to discover potential biases by leveraging a shape analysis pipeline. A model is trained and evaluated on both datasets to examine the impact of different annotation conditions. Results demonstrate that annotation styles are important and that annotation pipelines should closely consider the task of interest. The dataset is made publicly available at https://www.sama.com/sama-coco-dataset/ .

摘要
《Benchmark datasets are used to profile and compare algorithms across a variety of tasks, ranging from image classification to segmentation, and also play a large role in image pretraining algorithms. Emphasis is placed on results with little regard to the actual content within the dataset. It is important to question what kind of information is being learned from these datasets and what are the nuances and biases within them. In the following work, Sama-COCO, a re-annotation of MS-COCO, is used to discover potential biases by leveraging a shape analysis pipeline. A model is trained and evaluated on both datasets to examine the impact of different annotation conditions. Results demonstrate that annotation styles are important and that annotation pipelines should closely consider the task of interest. The dataset is made publicly available at https://www.sama.com/sama-coco-dataset/。》Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

An Empirical Study of Uncertainty in Polygon Annotation and the Impact of Quality Assurance

paper_url: http://arxiv.org/abs/2311.02707
repo_url: None
paper_authors: Eric Zimmermann, Justin Szeto, Frederic Ratle
for: 这篇论文是为了研究多边形标注中的不确定性和质量控制的问题。
methods: 这篇论文使用了多边形标注的多评估程序，并对MS-COCO数据集中的一些对象进行了分析。
results: 研究结果表明，多边形标注的可靠性取决于评估程序和场景和形状的复杂性。

Abstract
Polygons are a common annotation format used for quickly annotating objects in instance segmentation tasks. However, many real-world annotation projects request near pixel-perfect labels. While strict pixel guidelines may appear to be the solution to a successful project, practitioners often fail to assess the feasibility of the work requested, and overlook common factors that may challenge the notion of quality. This paper aims to examine and quantify the inherent uncertainty for polygon annotations and the role that quality assurance plays in minimizing its effect. To this end, we conduct an analysis on multi-rater polygon annotations for several objects from the MS-COCO dataset. The results demonstrate that the reliability of a polygon annotation is dependent on a reviewing procedure, as well as the scene and shape complexity.

摘要
多角形是常用的注释格式，用于快速标注对象在实例分割任务中。然而，许多实际项目需要非常精准的标注。虽然严格的像素指南可能看起来是成功项目的解决方案，但实际上，很多实践者会忽视标注工作的可行性和常见因素的影响。这篇论文旨在检查和评估多角形注释中的内在不确定性，以及质量控制在减少其影响的角色。为此，我们对 MS-COCO 数据集中的多个对象进行了多评人多角形注释的分析。结果表明，多角形注释的可靠性取决于评审过程和场景和形状复杂度。

A Generative Multi-Resolution Pyramid and Normal-Conditioning 3D Cloth Draping

paper_url: http://arxiv.org/abs/2311.02700
repo_url: https://github.com/hunorlaczko/pyramid-drape
paper_authors: Hunor Laczkó, Meysam Madadi, Sergio Escalera, Jordi Gonzalez
for: 3D garment generation and draping
methods: conditional variational autoencoder with pyramid network and surface normal UV maps
results: robust, controllable, and state-of-the-art results with high generalization to unseen garments, poses, and shapes

Abstract
RGB cloth generation has been deeply studied in the related literature, however, 3D garment generation remains an open problem. In this paper, we build a conditional variational autoencoder for 3D garment generation and draping. We propose a pyramid network to add garment details progressively in a canonical space, i.e. unposing and unshaping the garments w.r.t. the body. We study conditioning the network on surface normal UV maps, as an intermediate representation, which is an easier problem to optimize than 3D coordinates. Our results on two public datasets, CLOTH3D and CAPE, show that our model is robust, controllable in terms of detail generation by the use of multi-resolution pyramids, and achieves state-of-the-art results that can highly generalize to unseen garments, poses, and shapes even when training with small amounts of data.

摘要

ChEF: A Comprehensive Evaluation Framework for Standardized Assessment of Multimodal Large Language Models

paper_url: http://arxiv.org/abs/2311.02692
repo_url: https://github.com/openlamm/lamm
paper_authors: Zhelun Shi, Zhipin Wang, Hongxing Fan, Zhenfei Yin, Lu Sheng, Yu Qiao, Jing Shao
for: 本研究旨在提供一个普适的评估框架，以全面理解大型语言模型在多modal内容交互中的能力和局限性。
methods: 本研究提出了一个名为ChEF的全面评估框架，由四个组成部分组成：Scene（可扩展的多modal数据集）、Instruction（灵活的指令检索方程）、Inferencer（可靠的问题解答策略）和Metric（任务特定的分数函数）。这个框架可以标准化地评估不同的大型语言模型，并且可以根据不同的场景和需求设计新的评估方法。
results: 本研究通过对9种知名的大型语言模型在9个场景中进行大规模评估，总结了20多个有价值的观察，包括不同场景下大型语言模型的普适性和多modal交互需要的复合能力。这些观察可以帮助理解大型语言模型在多modal内容交互中的能力和局限性。

Abstract
Multimodal Large Language Models (MLLMs) have shown impressive abilities in interacting with visual content with myriad potential downstream tasks. However, even though a list of benchmarks has been proposed, the capabilities and limitations of MLLMs are still not comprehensively understood, due to a lack of a standardized and holistic evaluation framework. To this end, we present the first Comprehensive Evaluation Framework (ChEF) that can holistically profile each MLLM and fairly compare different MLLMs. First, we structure ChEF as four modular components, i.e., Scenario as scalable multimodal datasets, Instruction as flexible instruction retrieving formulae, Inferencer as reliable question answering strategies, and Metric as indicative task-specific score functions. Based on them, ChEF facilitates versatile evaluations in a standardized framework, and new evaluations can be built by designing new Recipes (systematic selection of these four components). Notably, current MLLM benchmarks can be readily summarized as recipes of ChEF. Second, we introduce 6 new recipes to quantify competent MLLMs' desired capabilities (or called desiderata, i.e., calibration, in-context learning, instruction following, language performance, hallucination, and robustness) as reliable agents that can perform real-world multimodal interactions. Third, we conduct a large-scale evaluation of 9 prominent MLLMs on 9 scenarios and 6 desiderata. Our evaluation summarized over 20 valuable observations concerning the generalizability of MLLMs across various scenarios and the composite capability of MLLMs required for multimodal interactions. We will publicly release all the detailed implementations for further analysis, as well as an easy-to-use modular toolkit for the integration of new recipes and models, so that ChEF can be a growing evaluation framework for the MLLM community.

摘要
多modal大型语言模型（MLLMs）在与视觉内容互动中表现出了吸引人的能力，但是，即使有了一份标准的benchmark列表，MLLMs的能力和局限性仍未被全面了解，这是因为缺乏一个标准化和整体的评估框架。为此，我们提出了首个全面评估框架（ChEF），可以彻底评估每种MLLM，并公平比较不同的MLLMs。ChEF由四个可重复组件组成：Scene（可扩展的多模态数据集）、Instruction（灵活的指令检索方程）、Inferencer（可靠的问题回答策略）和Metric（任务特定的指标函数）。基于这些组件，ChEF提供了一个标准化的评估框架，并且可以通过设计新的Recipe（系统atic选择这些四个组件）来创建新的评估。值得注意的是，现有的MLLM benchmark可以被视为ChEF的recipe。我们还引入了6种新的recipe，用于评估MLLMs的需要的能力（或称为“欲望”，包括准确性、场景学习、指令遵从、语言表现、幻觉和稳定性）。然后，我们对9种知名MLLMs进行了大规模的评估，并对9个场景和6种欲望进行了评估。我们的评估结果表明，MLLMs在不同的场景下的一致性和多模态互动所需的复合能力是非常重要的。我们将在未来公布所有细节的实现，以及一个易于使用的模块化工具包，以便ChEF可以成为MLLM社区的发展评估框架。

Octavius: Mitigating Task Interference in MLLMs via MoE

paper_url: http://arxiv.org/abs/2311.02684
repo_url: None
paper_authors: Zeren Chen, Ziqin Wang, Zhen Wang, Huayang Liu, Zhenfei Yin, Si Liu, Lu Sheng, Wanli Ouyang, Yu Qiao, Jing Shao
for: 本研究旨在探讨大型自然语言模型（LLM）在多Modal学习中的零shot扩展能力，以及在不同模式和下游任务中的性能影响。
methods: 我们提出了一种新的和可扩展的框架，called \mname，用于全面地研究多Modal学习中的多Modal大型自然语言模型（MLLM）。我们将mixture-of-experts（MoE）和一种代表性的PEFT技术LoRA结合，设计了一种基于LLM的新解码器，called LoRA-MoE，用于多Modal学习。
results: 我们的实验结果（大约20%提升）表明了我们的设计的效iveness和多样性在不同的2D和3D下游任务中。

Abstract
Recent studies have demonstrated Large Language Models (LLMs) can extend their zero-shot generalization capabilities to multimodal learning through instruction tuning. As more modalities and downstream tasks are introduced, negative conflicts and interference may have a worse impact on performance. While this phenomenon has been overlooked in previous work, we propose a novel and extensible framework, called \mname, for comprehensive studies and experimentation on multimodal learning with Multimodal Large Language Models (MLLMs). Specifically, we combine the well-known Mixture-of-Experts (MoE) and one of the representative PEFT techniques, \emph{i.e.,} LoRA, designing a novel LLM-based decoder, called LoRA-MoE, for multimodal learning. The experimental results (about 20\% improvement) have shown the effectiveness and versatility of our design in various 2D and 3D downstream tasks. Code and corresponding dataset will be available soon.

摘要
Specifically, we combine the well-known Mixture-of-Experts (MoE) and one of the representative PEFT techniques, i.e., LoRA, to design a novel LLM-based decoder, called LoRA-MoE, for multimodal learning. Our experimental results (with an improvement of around 20%) have shown the effectiveness and versatility of our design in various 2D and 3D downstream tasks. We will make the code and corresponding dataset available soon.

Digital Typhoon: Long-term Satellite Image Dataset for the Spatio-Temporal Modeling of Tropical Cyclones

paper_url: http://arxiv.org/abs/2311.02665
repo_url: https://github.com/kitamoto-lab/digital-typhoon
paper_authors: Asanobu Kitamoto, Jared Hwang, Bastien Vuillod, Lucas Gautier, Yingtao Tian, Tarin Clanuwat
For: The paper presents the official release of the Digital Typhoon dataset, a long-term spatio-temporal satellite image dataset for benchmarking machine learning models in the context of tropical cyclones.* Methods: The authors developed a workflow to create an infrared typhoon-centered image for cropping using Lambert azimuthal equal-area projection and addressed data quality issues such as inter-satellite calibration to create a homogeneous dataset.* Results: The benchmarking results on the analysis, forecasting, and reanalysis for the intensity suggest that the dataset is challenging for recent deep learning models, with many choices affecting the performance of various models.Here are the three points in Simplified Chinese text:* For: 这篇论文介绍了数字飓风数据集的官方发布，这是40多年的飓风卫星图像数据集，用于测试机器学习模型的长期空间时间数据处理能力。* Methods: 作者们开发了一个工作流程，使用拉曼投影将飓风中心的偏振图像进行裁剪，并对卫星数据进行了一系列的数据质量处理，以创建一个一致的数据集。* Results: 对于分析、预测和重建风速的测试结果表明，这个数据集对现代深度学习模型是一个挑战，因为有多种选择会影响不同模型的性能。

Abstract
This paper presents the official release of the Digital Typhoon dataset, the longest typhoon satellite image dataset for 40+ years aimed at benchmarking machine learning models for long-term spatio-temporal data. To build the dataset, we developed a workflow to create an infrared typhoon-centered image for cropping using Lambert azimuthal equal-area projection referring to the best track data. We also address data quality issues such as inter-satellite calibration to create a homogeneous dataset. To take advantage of the dataset, we organized machine learning tasks by the types and targets of inference, with other tasks for meteorological analysis, societal impact, and climate change. The benchmarking results on the analysis, forecasting, and reanalysis for the intensity suggest that the dataset is challenging for recent deep learning models, due to many choices that affect the performance of various models. This dataset reduces the barrier for machine learning researchers to meet large-scale real-world events called tropical cyclones and develop machine learning models that may contribute to advancing scientific knowledge on tropical cyclones as well as solving societal and sustainability issues such as disaster reduction and climate change. The dataset is publicly available at http://agora.ex.nii.ac.jp/digital-typhoon/dataset/ and https://github.com/kitamoto-lab/digital-typhoon/.

摘要
这份论文发布了数字风暴数据集的官方发布，这是40多年的风暴卫星图像数据集，旨在为机器学习模型进行长期空间时间数据的benchmarking。为建立数据集，我们开发了一个工作流程，以拉姆伯特方程为基础，使用最佳轨迹数据来生成射电风暴中心图像进行剪辑。我们还解决了数据质量问题，如卫星间协调 calibration，以创建一个一致的数据集。为了利用这个数据集，我们组织了机器学习任务，按照不同的类型和目标进行定型，包括气象分析、社会影响和气候变化。结果表明，这个数据集对于现代深度学习模型来说是一个挑战，由于多种选择对不同模型的性能产生影响。这个数据集将降低机器学习研究人员面临大规模实际事件风暴预测和解决社会和可持续发展问题的障碍。这个数据集公共可用于http://agora.ex.nii.ac.jp/digital-typhoon/dataset/和https://github.com/kitamoto-lab/digital-typhoon/。

Enhanced adaptive cross-layer scheme for low latency HEVC streaming over Vehicular Ad-hoc Networks (VANETs)

paper_url: http://arxiv.org/abs/2311.02664
repo_url: None
paper_authors: Mohamed Aymen Labiod, Mohamed Gharbi, François-Xavier Coudoux, Patrick Corlay, Noureddine Doghmane
for: 高效视频传输在 Vehicular Ad-hoc Networks (VANET) 中实现了现实，但是这些网络具有变化的通道质量和有限的带宽。
methods: 提议一种低复杂度跨层机制，通过考虑视频编码过程中的时间预测结构、帧重要性和网络负载状态，将每个视频传输包分配到最适合的 Access Category (AC) 队列在 Medium Access Control (MAC) 层。
results: 对不同的低延迟视频通信场景进行了评估，结果显示，提议的机制可以在视频质量和总结束延迟方面提供显著的改进，与802.11p 的 Enhanced Distributed Channel Access (EDCA) 相比。同时，对服务质量 (QoS) 和用户体验质量 (QoE) 也进行了评估，以验证提议的方法。

Abstract
Vehicular communication has become a reality guided by various applications. Among those, high video quality delivery with low latency constraints required by real-time applications constitutes a very challenging task. By dint of its never-before-achieved compression level, the new High-Efficiency Video Coding (HEVC) is very promising for real-time video streaming through Vehicular Ad-hoc Networks (VANET). However, these networks have variable channel quality and limited bandwidth. Therefore, ensuring satisfactory video quality on such networks is a major challenge. In this work, a low complexity cross-layer mechanism is proposed to improve end-to-end performances of HEVC video streaming in VANET under low delay constraints. The idea is to assign to each packet of the transmitted video the most appropriate Access Category (AC) queue on the Medium Access Control (MAC) layer, considering the temporal prediction structure of the video encoding process, the importance of the frame and the state of the network traffic load. Simulation results demonstrate that for different targeted low-delay video communication scenarios, the proposed mechanism offers significant improvements regarding video quality at the reception and end-to-end delay compared to the Enhanced Distributed Channel Access (EDCA) adopted in the 802.11p. Both Quality of Service (QoS) and Quality of Experience (QoE) evaluations have been also carried out to validate the proposed approach.

摘要
To overcome this challenge, we propose a low-complexity cross-layer mechanism that improves the end-to-end performance of HEVC video streaming in VANETs under low delay constraints. Our approach assigns the most appropriate Access Category (AC) queue on the Medium Access Control (MAC) layer to each packet of the transmitted video, taking into account the temporal prediction structure of the video encoding process, the importance of the frame, and the state of the network traffic load.Simulation results show that our proposed mechanism offers significant improvements in video quality at the reception and end-to-end delay compared to the Enhanced Distributed Channel Access (EDCA) adopted in the 802.11p standard, for different targeted low-delay video communication scenarios. We also conducted Quality of Service (QoS) and Quality of Experience (QoE) evaluations to validate our approach.

CCMR: High Resolution Optical Flow Estimation via Coarse-to-Fine Context-Guided Motion Reasoning

paper_url: http://arxiv.org/abs/2311.02661
repo_url: https://github.com/cv-stuttgart/CCMR
paper_authors: Azin Jahedi, Maximilian Luz, Marc Rivinius, Andrés Bruhn
for: 高精度多尺度摄像机流计算
methods: 基于注意力的动量聚合概念，使用层次两步注意力基于上下文运动聚合策略，首先计算全尺度多尺度上下文特征，然后使用它们引导实际运动聚合。
results: 通过结合多尺度和注意力基于概念，提供了高精度的流场图，在 occluded 和 non-occluded 区域都显示出了强大改进，与单尺度注意力基本和多尺度注意力自由基eline比较，提高了23.0% 和 21.6%。并实现了 state-of-the-art 结果，在 KITTI 2015 和 MPI Sintel Clean and Final 上 ranking 第一和第二。

Abstract
Attention-based motion aggregation concepts have recently shown their usefulness in optical flow estimation, in particular when it comes to handling occluded regions. However, due to their complexity, such concepts have been mainly restricted to coarse-resolution single-scale approaches that fail to provide the detailed outcome of high-resolution multi-scale networks. In this paper, we hence propose CCMR: a high-resolution coarse-to-fine approach that leverages attention-based motion grouping concepts to multi-scale optical flow estimation. CCMR relies on a hierarchical two-step attention-based context-motion grouping strategy that first computes global multi-scale context features and then uses them to guide the actual motion grouping. As we iterate both steps over all coarse-to-fine scales, we adapt cross covariance image transformers to allow for an efficient realization while maintaining scale-dependent properties. Experiments and ablations demonstrate that our efforts of combining multi-scale and attention-based concepts pay off. By providing highly detailed flow fields with strong improvements in both occluded and non-occluded regions, our CCMR approach not only outperforms both the corresponding single-scale attention-based and multi-scale attention-free baselines by up to 23.0% and 21.6%, respectively, it also achieves state-of-the-art results, ranking first on KITTI 2015 and second on MPI Sintel Clean and Final. Code and trained models are available at https://github.com/cv-stuttgart /CCMR.

摘要
听力基于的动作聚合概念在视力估计中最近几年得到了广泛应用，尤其是在处理 occluded 区域时。然而，由于其复杂性，这些概念通常只能在粗略分辨率单个级别上实现，无法提供高分辨率多级网络的详细结果。在这篇文章中，我们因此提出了 CCMR：一种高分辨率含级抽象方法，利用听力基于的动作聚合概念来实现多级视力估计。CCMR 利用一种层次两步听力基于的上下文动作聚合策略，首先计算全局多级上下文特征，然后使用它们引导实际动作聚合。我们在所有粗略抽象级别上迭代这两个步骤，并使用可变 covariance 图像变换器来实现高效实现，保持级别 dependent 性。实验和排除示出，我们的努力将多级和听力基于的概念结合起来了。通过提供具有强大改进的流场场景和非 occluded 区域，我们的 CCMR 方法不仅超过了对应的单个级别听力基于和无听力基于的基准值，还达到了状态机器人框架。代码和训练模型可以在获取。

Region of Interest (ROI) based adaptive cross-layer system for real-time video streaming over Vehicular Ad-hoc NETworks (VANETs)

paper_url: http://arxiv.org/abs/2311.02656
repo_url: None
paper_authors: Mohamed Aymen Labiod, Mohamed Gharbi, François-Xavier Coudoux, Patrick Corlay
for: 提高 vehicular video transmission质量以提高驾驶环境感知
methods: 使用 adaptive cross-layer mapping将 ROI视频数据包寄存到 IEEE 802.11p MAC 层，提高 HEVC 压缩视频通信质量
results: 实际 VANET 模拟结果显示，对 HEVC 压缩视频通信，提posed系统可以提供 UP TO 11dB PSNR 提升在 ROI 部分

Abstract
Nowadays, real-time vehicle applications increasingly rely on video acquisition and processing to detect or even identify vehicles and obstacles in the driving environment. In this letter, we propose an algorithm that allows reinforcing these operations by improving end-to-end video transmission quality in a vehicular context. The proposed low complexity solution gives highest priority to the scene regions of interest (ROI) on which the perception of the driving environment is based on. This is done by applying an adaptive cross-layer mapping of the ROI visual data packets at the IEEE 802.11p MAC layer. Realistic VANET simulation results demonstrate that for HEVC compressed video communications, the proposed system offers PSNR gains up to 11dB on the ROI part.

摘要
现在，实时车辆应用越来越依赖于视频获取和处理来探测或识别在驾驶环境中的车辆和障碍物。在这封信中，我们提出了一种算法，可以通过提高端到端视频传输质量来增强这些操作。我们的低复杂度解决方案会将关键场景区域（ROI）的视频数据包在IEEE 802.11p MAC层进行适应性跨层映射。使用HEVC压缩视频通信后，我们的系统可以在ROI部分提供PSNR增幅达11dB。

Generative Face Video Coding Techniques and Standardization Efforts: A Review

paper_url: http://arxiv.org/abs/2311.02649
repo_url: None
paper_authors: Bolin Chen, Jie Chen, Shiqi Wang, Yan Ye
for: 这个论文主要探讨了最新的生成式面部视频编码（GFVC）技术的发展和标准化努力，以实现高质量的面部视频通信在ultra低带宽场景下。methods: 这篇论文将GFVC技术综合评估和总结，包括不同的GFVC算法和其对应的视觉表示方法，以及相关的标准化努力。results: 这篇论文总结了GFVC技术的发展前景和应用潜力，以及相关的挑战和机遇。

Abstract
Generative Face Video Coding (GFVC) techniques can exploit the compact representation of facial priors and the strong inference capability of deep generative models, achieving high-quality face video communication in ultra-low bandwidth scenarios. This paper conducts a comprehensive survey on the recent advances of the GFVC techniques and standardization efforts, which could be applicable to ultra low bitrate communication, user-specified animation/filtering and metaverse-related functionalities. In particular, we generalize GFVC systems within one coding framework and summarize different GFVC algorithms with their corresponding visual representations. Moreover, we review the GFVC standardization activities that are specified with supplemental enhancement information messages. Finally, we discuss fundamental challenges and broad applications on GFVC techniques and their standardization potentials, as well as envision their future trends. The project page can be found at https://github.com/Berlin0610/Awesome-Generative-Face-Video-Coding.

摘要
generative face video coding（GFVC）技术可以利用面部先验的紧凑表示和深度生成模型的强大推理能力，实现高质量的面部视频通信在超低带宽场景下。这篇论文对最近的GFVC技术的进步和标准化努力进行了全面的报道，这些技术可以应用于超低位元率通信、用户指定的动画/滤波和元宇宙相关功能。具体来说，我们将GFVC系统划分到一个编码框架中，并将不同的GFVC算法与其相应的视觉表示进行总结。此外，我们还评论了GFVC的标准化活动，包括补充增强信息消息。最后，我们讨论了GFVC技术的基本挑战和广泛应用，以及其标准化潜力，以及未来趋势。项目页面可以在找到。

An Approach for Multi-Object Tracking with Two-Stage Min-Cost Flow

paper_url: http://arxiv.org/abs/2311.02642
repo_url: None
paper_authors: Huining Li, Yalong Jiang, Xianlin Zeng, Feng Li, Zhipeng Wang
for: 本 paper 的目的是提出一种two-stage tracking pipeline，用于精准地跟踪多个目标在视频中，并且可以减少 occlusion 的影响。
methods: 本 paper 使用 minimum network flow algorithm，并且利用 tracklets 的交叠和低信任探测来准确地定位不准确的 tracklets。在第一 stage，使用高信任探测作为输入，并使用交叠 mask 来准确地定位不准确的 tracklets。在第二 stage，使用低信任探测来修正不准确的 tracklets。
results: 本 paper 在多个popular MOT benchmark datasets上进行了 sufficient 的实验，并 achieved 78.4 MOTA on MOT16 test set, 79.2 on MOT17 test set, and 76.4 on MOT20 test set, 这表明提出的方法是有效的。

Abstract
The minimum network flow algorithm is widely used in multi-target tracking. However, the majority of the present methods concentrate exclusively on minimizing cost functions whose values may not indicate accurate solutions under occlusions. In this paper, by exploiting the properties of tracklets intersections and low-confidence detections, we develop a two-stage tracking pipeline with an intersection mask that can accurately locate inaccurate tracklets which are corrected in the second stage. Specifically, we employ the minimum network flow algorithm with high-confidence detections as input in the first stage to obtain the candidate tracklets that need correction. Then we leverage the intersection mask to accurately locate the inaccurate parts of candidate tracklets. The second stage utilizes low-confidence detections that may be attributed to occlusions for correcting inaccurate tracklets. This process constructs a graph of nodes in inaccurate tracklets and low-confidence nodes and uses it for the second round of minimum network flow calculation. We perform sufficient experiments on popular MOT benchmark datasets and achieve 78.4 MOTA on the test set of MOT16, 79.2 on MOT17, and 76.4 on MOT20, which shows that the proposed method is effective.

摘要
“多目标追踪中 widely 使用最小网络流算法。然而，大多数现有方法仅专注于最小化成本函数的值，而不考虑 occlusions 的情况下的精度。在本文中，我们利用追踪碎片 intersection 和低信任探测的属性，开发了一个两阶段追踪管线，具有精度的找到不精度追踪碎片。具体来说，我们在第一阶段使用最小网络流算法高信任探测作为输入，以获取需要更正的候选追踪碎片。然后，我们利用碎片 intersection 属性来精确地找到不精度追踪碎片的不精度部分。第二阶段使用低信任探测，可能导因于 occlusions，来更正不精度追踪碎片。这个过程建立了一个网络格，其中的节点是不精度追踪碎片和低信任节点，并使用它们进行第二次最小网络流计算。我们对流行的 MOT 评分数据进行了丰富的实验，并在 MOT16 的评分数据上取得 78.4 MOTA，在 MOT17 上取得 79.2 MOTA，在 MOT20 上取得 76.4 MOTA，这表明我们的方法是有效的。”

The Background Also Matters: Background-Aware Motion-Guided Objects Discovery

paper_url: http://arxiv.org/abs/2311.02633
repo_url: None
paper_authors: Sandra Kara, Hejer Ammar, Florian Chabot, Quoc-Cuong Pham
for: 提高对视频数据中物体发现的精度和效率
methods: 利用摄像头流计算出的运动mask，通过学习机制扩展到真正的背景和前景区域，并在物体发现过程中 JOINTLY 学习物体发现任务和物体/非物体分离
results: 在 sintetic 和实际世界数据上进行了实验，结果表明，通过将我们的背景处理与多种前沿方法结合使用，可以大幅提高物体发现性能，并在物体/非物体分离任务中建立强的基线。

Abstract
Recent works have shown that objects discovery can largely benefit from the inherent motion information in video data. However, these methods lack a proper background processing, resulting in an over-segmentation of the non-object regions into random segments. This is a critical limitation given the unsupervised setting, where object segments and noise are not distinguishable. To address this limitation we propose BMOD, a Background-aware Motion-guided Objects Discovery method. Concretely, we leverage masks of moving objects extracted from optical flow and design a learning mechanism to extend them to the true foreground composed of both moving and static objects. The background, a complementary concept of the learned foreground class, is then isolated in the object discovery process. This enables a joint learning of the objects discovery task and the object/non-object separation. The conducted experiments on synthetic and real-world datasets show that integrating our background handling with various cutting-edge methods brings each time a considerable improvement. Specifically, we improve the objects discovery performance with a large margin, while establishing a strong baseline for object/non-object separation.

摘要
Our approach utilizes masks of moving objects extracted from optical flow and designs a learning mechanism to extend them to the true foreground, which includes both moving and static objects. The background, a complementary concept to the learned foreground class, is then isolated in the object discovery process. This enables joint learning of the object discovery task and object/non-object separation.Experiments on synthetic and real-world datasets show that integrating our background handling with various state-of-the-art methods consistently brings significant improvements in object discovery performance, while establishing a strong baseline for object/non-object separation. Specifically, we improve the objects discovery performance by a large margin, demonstrating the effectiveness of our proposed method.

Neural Networks Are Implicit Decision Trees: The Hierarchical Simplicity Bias

paper_url: http://arxiv.org/abs/2311.02622
repo_url: None
paper_authors: Zhehang Du
for: This paper aims to investigate the phenomenon of simplicity bias in neural networks and explore how they rely on simpler features while ignoring more complex ones, even when the complex features are equally predictive.methods: The authors introduce a novel approach called imbalanced label coupling to study scenarios where simple and complex features exhibit different levels of predictive power. They train neural networks on these scenarios and analyze how the networks make predictions based on the ascending complexity of input features.results: The authors find that the trained networks make predictions that align with the ascending complexity of input features, regardless of the underlying predictive power. For example, even when simple spurious features distort predictions in CIFAR-10, the networks still learn core features. However, last-layer retraining with target data distribution is effective but insufficient to fully recover core features when spurious features are perfectly correlated with the target labels in the synthetic dataset. These findings provide direct evidence that neural networks learn core features in the presence of spurious features.

Abstract
Neural networks exhibit simplicity bias; they rely on simpler features while ignoring equally predictive but more complex features. In this work, we introduce a novel approach termed imbalanced label coupling to investigate scenarios where simple and complex features exhibit different levels of predictive power. In these cases, complex features still contribute to predictions. The trained networks make predictions in alignment with the ascending complexity of input features according to how they correlate with the label in the training set, irrespective of the underlying predictive power. For instance, even when simple spurious features distort predictions in CIFAR-10, most cats are predicted to be dogs, and most trucks are predicted to be automobiles! This observation provides direct evidence that the neural network learns core features in the presence of spurious features. We empirically show that last-layer retraining with target data distribution is effective, yet insufficient to fully recover core features when spurious features are perfectly correlated with the target labels in our synthetic dataset. We hope our research contributes to a deeper understanding of the implicit bias of neural networks.

摘要
Note: The text has been translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.

TFNet: Tuning Fork Network with Neighborhood Pixel Aggregation for Improved Building Footprint Extraction

paper_url: http://arxiv.org/abs/2311.02617
repo_url: None
paper_authors: Muhammad Ahmad Waseem, Muhammad Tahir, Zubair Khalid, Momin Uppal
for: 这 paper 考虑了从卫星影像中提取建筑物的问题，这是许多城市规划和决策应用中的关键任务。
methods: 该 paper 提出了一种新的 Tuning Fork Network (TFNet) 设计，用于深度 semantic segmentation，该设计不仅在广泛的建筑物上表现出色，还在 closely packed 的建筑物上表现良好。TFNet 架构包括一个单一的编码器和两个并行的解码器，用于分别重construct 建筑物的架构和建筑物的边缘。此外，TFNet 还 coupling 了一种在训练过程中在 tile 边界上 incorporating neighbohood 信息的新方法。
results: 对 SpaceNet2、WHU 和一个来自卡拉χو（Pakistan）的 dataset 进行比较，提出的方法在所有三个 dataset 上显著地超越了参考方法。

Abstract
This paper considers the problem of extracting building footprints from satellite imagery -- a task that is critical for many urban planning and decision-making applications. While recent advancements in deep learning have made great strides in automated detection of building footprints, state-of-the-art methods available in existing literature often generate erroneous results for areas with densely connected buildings. Moreover, these methods do not incorporate the context of neighborhood images during training thus generally resulting in poor performance at image boundaries. In light of these gaps, we propose a novel Tuning Fork Network (TFNet) design for deep semantic segmentation that not only performs well for widely-spaced building but also has good performance for buildings that are closely packed together. The novelty of TFNet architecture lies in a a single encoder followed by two parallel decoders to separately reconstruct the building footprint and the building edge. In addition, the TFNet design is coupled with a novel methodology of incorporating neighborhood information at the tile boundaries during the training process. This methodology further improves performance, especially at the tile boundaries. For performance comparisons, we utilize the SpaceNet2 and WHU datasets, as well as a dataset from an area in Lahore, Pakistan that captures closely connected buildings. For all three datasets, the proposed methodology is found to significantly outperform benchmark methods.

摘要
To address these limitations, we propose a novel Tuning Fork Network (TFNet) design for deep semantic segmentation. TFNet consists of a single encoder followed by two parallel decoders that separately reconstruct the building footprint and the building edge. Additionally, we introduce a novel methodology that incorporates neighborhood information at the tile boundaries during training, further improving performance, especially at the tile boundaries.We evaluate the proposed methodology on three datasets: SpaceNet2, WHU, and a dataset from Lahore, Pakistan, which captures closely connected buildings. Our results show that TFNet significantly outperforms benchmark methods on all three datasets.

Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection

paper_url: http://arxiv.org/abs/2311.02612
repo_url: https://github.com/zhangzjn/GPT-4V-AD
paper_authors: Jiangning Zhang, Xuhai Chen, Zhucun Xue, Yabiao Wang, Chengjie Wang, Yong Liu
for: 这篇论文探讨了使用Visual Question Answering(VQA) paradigm来实现零基础的视觉异常检测(AD)任务，并对MVTec AD和VisA数据集进行质量和量化评估。methods: 该模型使用Large Multimodal Model(LMM) GPT-4V，包括三个组成部分：1) 粒度地划分，2) 提问设计，3) Text2Segmentation，以便轻松进行量化评估。results: 该模型在零基础AD任务中可以达到certain的结果，例如在MVTec AD和VisA数据集上的图像级别AU-ROC为77.1/88.0，像素级别AU-ROC为68.0/76.6。然而，与零基础方法WinCLIP ann CLIP-AD的性能还有一定差距，需要进一步研究。

Abstract
Large Multimodal Model (LMM) GPT-4V(ision) endows GPT-4 with visual grounding capabilities, making it possible to handle certain tasks through the Visual Question Answering (VQA) paradigm. This paper explores the potential of VQA-oriented GPT-4V in the recently popular visual Anomaly Detection (AD) and is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets. Considering that this task requires both image-/pixel-level evaluations, the proposed GPT-4V-AD framework contains three components: 1) Granular Region Division, 2) Prompt Designing, 3) Text2Segmentation for easy quantitative evaluation, and have made some different attempts for comparative analysis. The results show that GPT-4V can achieve certain results in the zero-shot AD task through a VQA paradigm, such as achieving image-level 77.1/88.0 and pixel-level 68.0/76.6 AU-ROCs on MVTec AD and VisA datasets, respectively. However, its performance still has a certain gap compared to the state-of-the-art zero-shot method, e.g., WinCLIP ann CLIP-AD, and further research is needed. This study provides a baseline reference for the research of VQA-oriented LMM in the zero-shot AD task, and we also post several possible future works. Code is available at \url{https://github.com/zhangzjn/GPT-4V-AD}.

摘要
大型多模式模型（LMM）GPT-4V（视觉）具有视觉基准功能，使得GPT-4可以通过视觉问答（VQA）模式处理某些任务。本文探讨GPT-4V在最近受欢迎的视觉异常检测（AD）任务中的潜力，是首次对流行的MVTec AD和VisA数据集进行质量和量化评估。因为这个任务需要图像/像素级评估，提出了三组件：1）粒度区划，2）提示设计，3）文本2分 segmentation，以便轻松进行量化评估。results显示，GPT-4V可以通过VQA模式在零基础AD任务中获得某些结果，如MVTec AD和VisA数据集上的图像级77.1/88.0和像素级68.0/76.6 AU-ROC。然而，其表现仍有一定差距 compared tostate-of-the-art零基础方法，如WinCLIP ann CLIP-AD，并且需要进一步研究。这些研究提供了VQA-oriented LMM在零基础AD任务的基线参考，并提出了一些可能的未来工作。代码可以在 \url{https://github.com/zhangzjn/GPT-4V-AD} 上获取。

Deep Learning-based 3D Point Cloud Classification: A Systematic Survey and Outlook

paper_url: http://arxiv.org/abs/2311.02608
repo_url: None
paper_authors: Huang Zhang, Changshuo Wang, Shengwei Tian, Baoli Lu, Liping Zhang, Xin Ning, Xiao Bai
for: 本研究的目的是为点云分类提供最新的研究进展和未来趋势，以帮助 relate fields 的研究人员。
methods: 本文回顾了点云数据的取得、特点和挑战，然后介绍了常用的3D数据表示方法、存储格式和点云分类的深度学习方法。
results: 本文对主要方法进行了比较和分析，并提出了一些挑战和未来趋势。In English, this would be:
for: The purpose of this paper is to provide the latest research progress and future trends in point cloud classification for researchers in related fields.
methods: The paper reviews point cloud acquisition, characteristics, and challenges, and then introduces commonly used datasets and deep learning-based methods for point cloud classification.
results: The paper compares and analyzes the performance of the main methods and discusses some challenges and future directions for point cloud classification.

Abstract
In recent years, point cloud representation has become one of the research hotspots in the field of computer vision, and has been widely used in many fields, such as autonomous driving, virtual reality, robotics, etc. Although deep learning techniques have achieved great success in processing regular structured 2D grid image data, there are still great challenges in processing irregular, unstructured point cloud data. Point cloud classification is the basis of point cloud analysis, and many deep learning-based methods have been widely used in this task. Therefore, the purpose of this paper is to provide researchers in this field with the latest research progress and future trends. First, we introduce point cloud acquisition, characteristics, and challenges. Second, we review 3D data representations, storage formats, and commonly used datasets for point cloud classification. We then summarize deep learning-based methods for point cloud classification and complement recent research work. Next, we compare and analyze the performance of the main methods. Finally, we discuss some challenges and future directions for point cloud classification.

摘要
各种计算机视觉领域中的研究热点之一是点云表示，在自动驾驶、虚拟现实、机器人等领域都有广泛的应用。虽然深度学习技术在处理常见的2D网格图像数据上已经取得了很大的成功，但对于不规则、无结构的点云数据处理仍然存在很大的挑战。点云分类是点云分析的基础，许多深度学习基于的方法在这个任务中广泛使用。因此，本文的目的是为这个领域的研究人员提供最新的研究进展和未来趋势。首先，我们介绍点云获取、特点和挑战。其次，我们回顾3D数据表示、存储格式和常用的点云分类 dataset。然后，我们总结了深度学习基于的方法，并补充最近的研究工作。接着，我们比较和分析主要方法的性能。最后，我们讨论了点云分类的一些挑战和未来方向。

Optimizing Implicit Neural Representations from Point Clouds via Energy-Based Models

paper_url: http://arxiv.org/abs/2311.02601
repo_url: None
paper_authors: Ryutaro Yamauchi, Jinya Sakurai, Ryo Furukawa, Tatsushi Matsubayashi
for: 重建无旋转3D点云表面
methods: 使用能量基本模型优化卷积神经网络
results: 提高对点云噪声的耐性

Abstract
Reconstructing a continuous surface from an unoritented 3D point cloud is a fundamental task in 3D shape processing. In recent years, several methods have been proposed to address this problem using implicit neural representations (INRs). In this study, we propose a method to optimize INRs using energy-based models (EBMs). By employing the absolute value of the coordinate-based neural networks as the energy function, the INR can be optimized through the estimation of the point cloud distribution by the EBM. In addition, appropriate parameter settings of the EBM enable the model to consider the magnitude of point cloud noise. Our experiments confirmed that the proposed method is more robust against point cloud noise than conventional surface reconstruction methods.

摘要
<>translate english text into simplified chinese原文：Reconstructing a continuous surface from an unoriented 3D point cloud is a fundamental task in 3D shape processing. In recent years, several methods have been proposed to address this problem using implicit neural representations (INRs). In this study, we propose a method to optimize INRs using energy-based models (EBMs). By employing the absolute value of the coordinate-based neural networks as the energy function, the INR can be optimized through the estimation of the point cloud distribution by the EBM. In addition, appropriate parameter settings of the EBM enable the model to consider the magnitude of point cloud noise. Our experiments confirmed that the proposed method is more robust against point cloud noise than conventional surface reconstruction methods.翻译：建立一个连续的表面从无法指定的3D点云是3D形状处理中的基本任务。过去几年，一些方法被提出来解决这个问题使用隐藏神经网络表示（INR）。在这种研究中，我们提议使用能量基本模型（EBM）来优化INR。通过将坐标基本神经网络的绝对值作为能量函数，可以通过EBM估计点云分布，从而优化INR。此外，合适的EBM参数设置可以让模型考虑点云噪声的大小。我们的实验表明，我们提出的方法比传统表面重建方法更加鲁棒对待点云噪声。

Learning Class and Domain Augmentations for Single-Source Open-Domain Generalization

paper_url: http://arxiv.org/abs/2311.02599
repo_url: None
paper_authors: Prathmesh Bele, Valay Bundele, Avigyan Bhattacharya, Ankit Jha, Gemma Roig, Biplab Banerjee
for: 实现单源开放领域扩展（SS-ODG），解决训练时使用预订范围的标注范围，并在测试时遇到未知类别的挑战。
methods: 我们提出了一个名为SODG-Net的新框架，它同时生成新的领域和 pseudo-开放标本，使用学习型的目标函数，而不是常见的杂质混合策略。我们的方法通过增强多标本的多标本风格和生成多标本的多标本风格，从而提高扩展性。
results: 我们的SODG-Net在多个 benchmark 上进行了广泛的实验评估，与文献中的方法相比，它的表现都是superior。

Abstract
Single-source open-domain generalization (SS-ODG) addresses the challenge of labeled source domains with supervision during training and unlabeled novel target domains during testing. The target domain includes both known classes from the source domain and samples from previously unseen classes. Existing techniques for SS-ODG primarily focus on calibrating source-domain classifiers to identify open samples in the target domain. However, these methods struggle with visually fine-grained open-closed data, often misclassifying open samples as closed-set classes. Moreover, relying solely on a single source domain restricts the model's ability to generalize. To overcome these limitations, we propose a novel framework called SODG-Net that simultaneously synthesizes novel domains and generates pseudo-open samples using a learning-based objective, in contrast to the ad-hoc mixing strategies commonly found in the literature. Our approach enhances generalization by diversifying the styles of known class samples using a novel metric criterion and generates diverse pseudo-open samples to train a unified and confident multi-class classifier capable of handling both open and closed-set data. Extensive experimental evaluations conducted on multiple benchmarks consistently demonstrate the superior performance of SODG-Net compared to the literature.

摘要
单源开放预测（SS-ODG）Addresses the challenge of labeled source domains with supervision during training and unlabeled novel target domains during testing. The target domain includes both known classes from the source domain and samples from previously unseen classes. Existing techniques for SS-ODG primarily focus on calibrating source-domain classifiers to identify open samples in the target domain, but these methods often struggle with visually fine-grained open-closed data, misclassifying open samples as closed-set classes. Moreover, relying solely on a single source domain restricts the model's ability to generalize. To overcome these limitations, we propose a novel framework called SODG-Net that simultaneously synthesizes novel domains and generates pseudo-open samples using a learning-based objective, rather than the ad-hoc mixing strategies commonly found in the literature. Our approach enhances generalization by diversifying the styles of known class samples using a novel metric criterion and generates diverse pseudo-open samples to train a unified and confident multi-class classifier capable of handling both open and closed-set data. Extensive experimental evaluations conducted on multiple benchmarks consistently demonstrate the superior performance of SODG-Net compared to the literature.

Synthetic Tumor Manipulation: With Radiomics Features

paper_url: http://arxiv.org/abs/2311.02586
repo_url: None
paper_authors: Inye Na, Jonghun Kim, Hyunjin Park
for: 用于生成精度控制和个性化的肿瘤部分
methods: 使用生成对抗网络、基于 радиомιcs特征的conditioning、多任务学习
results: 能够生成多样化、真实的肿瘤图像，并且可以根据特定的 радиомιcs特征进行细致的调整

Abstract
We introduce RadiomicsFill, a synthetic tumor generator conditioned on radiomics features, enabling detailed control and individual manipulation of tumor subregions. This conditioning leverages conventional high-dimensional features of the tumor (i.e., radiomics features) and thus is biologically well-grounded. Our model combines generative adversarial networks, radiomics-feature conditioning, and multi-task learning. Through experiments with glioma patients, RadiomicsFill demonstrated its capability to generate diverse, realistic tumors and its fine-tuning ability for specific radiomics features like 'Pixel Surface' and 'Shape Sphericity'. The ability of RadiomicsFill to generate an unlimited number of realistic synthetic tumors offers notable prospects for both advancing medical imaging research and potential clinical applications.

摘要
我们介绍RadiomicsFill，一个基于对射频特征的人工肿瘤生成器，允许详细控制和个别修改肿瘤子区域。这个conditioning leverages conventional高维ensional特征（即对射频特征），因此具有生物学基础。我们的模型结合生成对抗网络、对射频特征conditioning和多任务学习。通过对肿瘤病人进行实验，RadiomicsFill表现出它的能力将生成多样化、现实的肿瘤，并且可以根据特定对射频特征进行细化调整，例如'Pixel Surface'和'Shape Sphericity'。RadiomicsFill的能力生成无限多个真实的人工肿瘤提供了重要的前途，将推动医疗影像研究和 potential clinical应用。

SSL-DG: Rethinking and Fusing Semi-supervised Learning and Domain Generalization in Medical Image Segmentation

paper_url: http://arxiv.org/abs/2311.02583
repo_url: https://github.com/yezanting/ssl-dg
paper_authors: Zanting Ye
for: 这个论文的目的是提出一种基于深度学习的医疗影像分类方法，以应对受限给annotated data的状况下，并且处理域shift问题。
methods: 本论文使用了semi-supervised learning（SSL）和domain generalization（DG）两种方法，具体来说是使用class-level representation来表示未见目标数据，并通过对数据进行增强，以实现cross-domain generalization。
results: 实验结果显示， compared withstate-of-the-art方法，本论文的方法在两个难度问题中表现出色，并且具有较好的一致性和可靠性。

Abstract
Deep learning-based medical image segmentation is an essential yet challenging task in clinical practice, which arises from restricted access to annotated data coupled with the occurrence of domain shifts. Previous attempts have focused on isolated solutions, while disregarding their inter-connectedness. In this paper, we rethink the relationship between semi-supervised learning (SSL) and domain generalization (DG), which are the cutting-edge approaches to address the annotated data-driven constraints and the domain shift issues. Inspired by class-level representation, we show that unseen target data can be represented by a linear combination of source data, which can be achieved by simple data augmentation. The augmented data enrich domain distributions while having semantic consistency, aligning with the principles of consistency-based SSL. Accordingly, we propose SSL-DG, fusing DG and SSL, to achieve cross-domain generalization with limited annotations. Specifically, the global and focal region augmentation, together with an augmentation scale-balancing mechanism, are used to construct a mask-based domain diffusion augmentation module to significantly enrich domain diversity. In order to obtain consistent predictions for the same source data in different networks, we use uncertainty estimation and a deep mutual learning strategy to enforce the consistent constraint. Extensive experiments including ablation studies are designed to validate the proposed SSL-DG. The results demonstrate that our SSL-DG significantly outperforms state-of-the-art solutions in two challenging DG tasks with limited annotations. Code is available at https://github.com/yezanting/SSL-DG.

摘要
深度学习基于医疗图像分割是临床实践中的必要 yet 挑战任务，这是由于缺乏标注数据的限制和频繁出现的频率域变换所致。先前的尝试都是采取分立的方法，而忽视了它们之间的连接。在这篇论文中，我们重新考虑了 semi-supervised learning（SSL）和频率域泛化（DG）的关系，这两种是医疗图像分割的瓶颈和频率域变换问题的解决方案。受到类别表示的启发，我们表明了未经见过的目标数据可以通过简单的数据扩展表示为源数据的线性组合。扩展后的数据可以增强频率域分布，同时保持 semantic consistency，与SSL的原理相符。因此，我们提议SSL-DG，将DG和SSL融合，实现受限的标注的横向泛化。具体来说，我们使用全球和焦点区域扩展，加上扩展缩放机制，构建一个面具基于频率域扩散增强模块，以显著提高频率域分布的多样性。为确保不同的源数据在不同网络中的预测结果具有一致性，我们使用uncertainty估计和深度相互学习策略来强制一致性约束。我们进行了广泛的实验，包括简洁分析，以验证我们的SSL-DG。结果显示，我们的SSL-DG在两个挑战的DG任务中具有明显的优势，并且超过了当前的状况。代码可以在https://github.com/yezanting/SSL-DG上下载。

Group Testing for Accurate and Efficient Range-Based Near Neighbor Search : An Adaptive Binary Splitting Approach

paper_url: http://arxiv.org/abs/2311.02573
repo_url: None
paper_authors: Kashish Mittal, Harsh Shah, Ajit Rajwade
for: 这篇论文针对高维ensional Near Neighbor Search（NNS）问题提出了一个适应性的集群试验框架。
methods: 这篇论文使用了一个基于cosine距离的点积分法，不需要对库中的所有元素进行探索。它还使用了一个多阶分组试验算法，通过分成两个子集，然后逐步对每个子集进行点积分，以节省时间。
results: 实验结果显示，这篇论文的方法可以与排序搜寻相比，提高速度超过10倍，且精度与排序搜寻相同。此外，论文还提供了一个理论分析，详细阐述了预期的距离计算数量和pool中成员数量的关系。

Abstract
This work presents an adaptive group testing framework for the range-based high dimensional near neighbor search problem. The proposed method detects high-similarity vectors from an extensive collection of high dimensional vectors, where each vector represents an image descriptor. Our method efficiently marks each item in the collection as neighbor or non-neighbor on the basis of a cosine distance threshold without exhaustive search. Like other methods in the domain of large scale retrieval, our approach exploits the assumption that most of the items in the collection are unrelated to the query. Unlike other methods, it does not assume a large difference between the cosine similarity of the query vector with the least related neighbor and that with the least unrelated non-neighbor. Following the procedure of binary splitting, a multi-stage adaptive group testing algorithm, we split the set of items to be searched into half at each step, and perform dot product tests on smaller and smaller subsets, many of which we are able to prune away. We experimentally show that our method achieves a speed-up over exhaustive search by a factor of more than ten with an accuracy same as that of exhaustive search, on a variety of large datasets. We present a theoretical analysis of the expected number of distance computations per query and the probability that a pool with a certain number of members will be pruned. In this way, our method exploits very useful and practical distributional properties unlike other methods. In our method, all required data structures are created purely offline. Moreover, our method does not impose any strong assumptions on the number of true near neighbors, is adaptible to streaming settings where new vectors are dynamically added to the database, and does not require any parameter tuning.

摘要

Multiple Object Tracking based on Occlusion-Aware Embedding Consistency Learning

paper_url: http://arxiv.org/abs/2311.02572
repo_url: None
paper_authors: Yaoqi Hu, Axi Niu, Yu Zhu, Qingsen Yan, Jinqiu Sun, Yanning Zhang
for: 多bject tracking中的跟踪问题methods: 利用视觉嵌入的一致性来解决 occlusion 导致的跟踪中断results: 在不同 occlusion 场景下，实现了较高的跟踪性能

Abstract
The Joint Detection and Embedding (JDE) framework has achieved remarkable progress for multiple object tracking. Existing methods often employ extracted embeddings to re-establish associations between new detections and previously disrupted tracks. However, the reliability of embeddings diminishes when the region of the occluded object frequently contains adjacent objects or clutters, especially in scenarios with severe occlusion. To alleviate this problem, we propose a novel multiple object tracking method based on visual embedding consistency, mainly including: 1) Occlusion Prediction Module (OPM) and 2) Occlusion-Aware Association Module (OAAM). The OPM predicts occlusion information for each true detection, facilitating the selection of valid samples for consistency learning of the track's visual embedding. The OAAM leverages occlusion cues and visual embeddings to generate two separate embeddings for each track, guaranteeing consistency in both unoccluded and occluded detections. By integrating these two modules, our method is capable of addressing track interruptions caused by occlusion in online tracking scenarios. Extensive experimental results demonstrate that our approach achieves promising performance levels in both unoccluded and occluded tracking scenarios.

摘要
“ JOINT DETECTION AND EMBEDDING (JDE) 框架在多对象跟踪中做出了卓越的进步。现有方法通常通过提取的嵌入来重新建立新检测和已经中断的跟踪之间的关系。然而，当 occlusion 区域包含邻近 объек 或垃圾物时，嵌入的可靠性会减退，特别是在严重 occlusion 的情况下。为了解决这个问题，我们提出了一种基于视觉嵌入一致性的多对象跟踪方法，包括：1） occlusion prediction module (OPM) 和 2） occlusion-aware association module (OAAM)。OPM 预测每个真实检测中的 occlusion 信息，使得选择有效样本进行嵌入一致学习跟踪的视觉嵌入。OAAM 利用 occlusion 迹象和视觉嵌入来生成每个跟踪的两个分开的嵌入，保证了在不Occluded 和 Occluded 检测场景下的一致性。通过这两个模块的结合，我们的方法可以在在线跟踪场景中解决由 occlusion 引起的跟踪中断。我们的实验结果表明，我们的方法在不Occluded 和 Occluded 跟踪场景下具有出色的表现。”

Rotation Invariant Transformer for Recognizing Object in UAVs

paper_url: http://arxiv.org/abs/2311.02559
repo_url: None
paper_authors: Shuoyi Chen, Mang Ye, Bo Du
for: 本研究的目标是提高UAV上的目标识别精度，特别是对于大角度变换的情况。
methods: 本研究提出了一种新的旋转不变性视Transformer（RotTrans），通过在特征层进行旋转操作来实现旋转不变性。此外，我们还设置了一种�variance constraint来确保原始特征与旋转后的特征之间的关系。
results: 我们的提出的RotTrans模型在最新的UAV数据集上进行测试，与当前状态的艺术得到了显著的改进，其中高度的MAP和Rank1分别提高了5.9%和4.8%。此外，我们的模型还在传统的城市摄像头上进行人重识别任务中表现竞争力强。特别是在ICCV 2021年的Multi-Modal Video Reasoning and Analyzing Competition中，我们的解决方案在UAV基于人重识别追踪上获得了第一名。

Abstract
Recognizing a target of interest from the UAVs is much more challenging than the existing object re-identification tasks across multiple city cameras. The images taken by the UAVs usually suffer from significant size difference when generating the object bounding boxes and uncertain rotation variations. Existing methods are usually designed for city cameras, incapable of handing the rotation issue in UAV scenarios. A straightforward solution is to perform the image-level rotation augmentation, but it would cause loss of useful information when inputting the powerful vision transformer as patches. This motivates us to simulate the rotation operation at the patch feature level, proposing a novel rotation invariant vision transformer (RotTrans). This strategy builds on high-level features with the help of the specificity of the vision transformer structure, which enhances the robustness against large rotation differences. In addition, we design invariance constraint to establish the relationship between the original feature and the rotated features, achieving stronger rotation invariance. Our proposed transformer tested on the latest UAV datasets greatly outperforms the current state-of-the-arts, which is 5.9\% and 4.8\% higher than the highest mAP and Rank1. Notably, our model also performs competitively for the person re-identification task on traditional city cameras. In particular, our solution wins the first place in the UAV-based person re-recognition track in the Multi-Modal Video Reasoning and Analyzing Competition held in ICCV 2021. Code is available at https://github.com/whucsy/RotTrans.

摘要
recognizing a target of interest from UAVs is much more challenging than existing object re-identification tasks across multiple city cameras. The images taken by UAVs usually suffer from significant size difference when generating object bounding boxes and uncertain rotation variations. Existing methods are usually designed for city cameras, incapable of handling the rotation issue in UAV scenarios. A straightforward solution is to perform image-level rotation augmentation, but it would cause loss of useful information when inputting powerful vision transformer as patches. This motivates us to simulate the rotation operation at the patch feature level, proposing a novel rotation invariant vision transformer (RotTrans). This strategy builds on high-level features with the help of the specificity of the vision transformer structure, which enhances robustness against large rotation differences. In addition, we design invariance constraint to establish the relationship between the original feature and the rotated features, achieving stronger rotation invariance. Our proposed transformer tested on the latest UAV datasets greatly outperforms the current state-of-the-arts, which is 5.9% and 4.8% higher than the highest mAP and Rank1. Notably, our model also performs competitively for the person re-identification task on traditional city cameras. In particular, our solution wins the first place in the UAV-based person re-recognition track in the Multi-Modal Video Reasoning and Analyzing Competition held in ICCV 2021. Code is available at https://github.com/whucsy/RotTrans.

Multi-Agent 3D Map Reconstruction and Change Detection in Microgravity with Free-Flying Robots

paper_url: http://arxiv.org/abs/2311.02558
repo_url: None
paper_authors: Holly Dinkel, Julia Di, Jamie Santos, Keenan Albee, Paulo Borges, Marina Moreira, Oleg Alexandrov, Brian Coltin, Trey Smith
for: 这篇论文目标是为了帮助未来的宇航员使用自主飞行器进行宇宙站的维护和监测。
methods: 这篇论文使用了多代理协作地图建模和变化探测来帮助自主飞行器进行宇宙站的维护和监测。其中一个代理用于从图像和深度信息序列中重建宇宙站的3D模型。另一个代理用于定期扫描宇宙站环境，并与3D模型进行比较。
results: 这篇论文通过使用实际的图像和位置数据， validate了变化探测的有效性。

Abstract
Assistive free-flyer robots autonomously caring for future crewed outposts -- such as NASA's Astrobee robots on the International Space Station (ISS) -- must be able to detect day-to-day interior changes to track inventory, detect and diagnose faults, and monitor the outpost status. This work presents a framework for multi-agent cooperative mapping and change detection to enable robotic maintenance of space outposts. One agent is used to reconstruct a 3D model of the environment from sequences of images and corresponding depth information. Another agent is used to periodically scan the environment for inconsistencies against the 3D model. Change detection is validated after completing the surveys using real image and pose data collected by Astrobee robots in a ground testing environment and from microgravity aboard the ISS. This work outlines the objectives, requirements, and algorithmic modules for the multi-agent reconstruction system, including recommendations for its use by assistive free-flyers aboard future microgravity outposts.

摘要
帮助自由飞行机器人在未来的人类殖民站上进行自主维护 -- 如 NASA 的 Astrobee 机器人在国际空站（ISS）上 -- 需要能够探测日常内部变化，跟踪库存、检测和诊断问题，以及监测站点状态。本文提出了多智能合作地图和变化检测框架，以启用机器人维护宇宙站。一个机器人用于从图像和相对深度信息序列中重建环境的3D模型。另一个机器人用于定期扫描环境，并与3D模型进行比较。变化检测的有效性通过在地面测试环境中收集的真实图像和姿态数据，以及在微重力环境中从 ISS 上收集的 Astrobee 机器人的数据进行验证。本文详细介绍了多智能重建系统的目标、要求和算法模块，并对未来微重力站点上的帮助自由飞行机器人使用这些系统提供建议。

IPVNet: Learning Implicit Point-Voxel Features for Open-Surface 3D Reconstruction

paper_url: http://arxiv.org/abs/2311.02552
repo_url: None
paper_authors: Mohammad Samiul Arshad, William J. Beksi
for: 重建三维开面（例如非水平的网格）是计算机视觉领域的一个未探讨的领域。
methods: 我们提出了一种基于学习的隐式方法（IPVNet），它可以在任意分辨率下重建目标。IPVNet 利用点云数据和其粗化的 voxel 对应物进行学习，可以减少artifacts。
results: 我们在synthetic和实际数据集上进行了实验，结果显示IPVNet 可以超越当前状态态的表现，同时生成的重建结果中减少了outlier。

Abstract
Reconstruction of 3D open surfaces (e.g., non-watertight meshes) is an underexplored area of computer vision. Recent learning-based implicit techniques have removed previous barriers by enabling reconstruction in arbitrary resolutions. Yet, such approaches often rely on distinguishing between the inside and outside of a surface in order to extract a zero level set when reconstructing the target. In the case of open surfaces, this distinction often leads to artifacts such as the artificial closing of surface gaps. However, real-world data may contain intricate details defined by salient surface gaps. Implicit functions that regress an unsigned distance field have shown promise in reconstructing such open surfaces. Nonetheless, current unsigned implicit methods rely on a discretized representation of the raw data. This not only bounds the learning process to the representation's resolution, but it also introduces outliers in the reconstruction. To enable accurate reconstruction of open surfaces without introducing outliers, we propose a learning-based implicit point-voxel model (IPVNet). IPVNet predicts the unsigned distance between a surface and a query point in 3D space by leveraging both raw point cloud data and its discretized voxel counterpart. Experiments on synthetic and real-world public datasets demonstrates that IPVNet outperforms the state of the art while producing far fewer outliers in the resulting reconstruction.

摘要
<>将文本翻译成简化中文。<> computer vision 领域中，三维开 superficie 的重建（如非水平的 mesh）是一个未经充分探索的领域。 recent learning-based implicit technique 已经突破了之前的障碍，使得重建在任意分辨率中成为可能。然而，这些方法通常需要在重建目标时分辨内部和外部的区别，以提取zero level set。在开 superficie 中，这种分辨 often leads to artifacts such as artificially closing surface gaps。然而，实际数据可能包含细节定义的明显surface gaps。implicit function 表示一个无符号距离场，已经表现出重建开 superficie 的承诺。然而，当前的无符号 implicit method 仅仅基于原始数据的粗略表示。这不仅限制了学习过程的分辨率，而且也会导致重建中出现异常值。为了准确地重建开 superficie 无异常值，我们提出了学习基于点 cloud 和 Its 粗略 voxel 对应的点云点-voxel 模型（IPVNet）。IPVNet 可以在 3D 空间中预测一个表示点和查询点之间的 unsigned distance。实验表明，IPVNet 在实际和 Synthetic 公共数据集上超过了状态的艺术，同时生成的重建中减少了异常值的出现。

3D-Aware Talking-Head Video Motion Transfer

paper_url: http://arxiv.org/abs/2311.02549
repo_url: None
paper_authors: Haomiao Ni, Jiachen Liu, Yuan Xue, Sharon X. Huang
for: 生成一个新视频，具有原视频的人物表情和动作模式。
methods: 使用一个3D-aware talking-head video motion transfer network（Head3D），全面利用 sujet 视频中的多视图出现特征，并通过自动学习3D head geometry learning module 和 attention-based fusion network来生成合成视频。
results: 在两个公共的 talking-head 视频数据集上进行了广泛的实验，研究发现 Head3D 在实际 cross-identity 设定下比2D和3D先前艺术 superior，并且可以轻松地适应pose-controllable novel view synthesis任务。

Abstract
Motion transfer of talking-head videos involves generating a new video with the appearance of a subject video and the motion pattern of a driving video. Current methodologies primarily depend on a limited number of subject images and 2D representations, thereby neglecting to fully utilize the multi-view appearance features inherent in the subject video. In this paper, we propose a novel 3D-aware talking-head video motion transfer network, Head3D, which fully exploits the subject appearance information by generating a visually-interpretable 3D canonical head from the 2D subject frames with a recurrent network. A key component of our approach is a self-supervised 3D head geometry learning module, designed to predict head poses and depth maps from 2D subject video frames. This module facilitates the estimation of a 3D head in canonical space, which can then be transformed to align with driving video frames. Additionally, we employ an attention-based fusion network to combine the background and other details from subject frames with the 3D subject head to produce the synthetic target video. Our extensive experiments on two public talking-head video datasets demonstrate that Head3D outperforms both 2D and 3D prior arts in the practical cross-identity setting, with evidence showing it can be readily adapted to the pose-controllable novel view synthesis task.

摘要
translate("Motion transfer of talking-head videos involves generating a new video with the appearance of a subject video and the motion pattern of a driving video. Current methodologies primarily depend on a limited number of subject images and 2D representations, thereby neglecting to fully utilize the multi-view appearance features inherent in the subject video. In this paper, we propose a novel 3D-aware talking-head video motion transfer network, Head3D, which fully exploits the subject appearance information by generating a visually-interpretable 3D canonical head from the 2D subject frames with a recurrent network. A key component of our approach is a self-supervised 3D head geometry learning module, designed to predict head poses and depth maps from 2D subject video frames. This module facilitates the estimation of a 3D head in canonical space, which can then be transformed to align with driving video frames. Additionally, we employ an attention-based fusion network to combine the background and other details from subject frames with the 3D subject head to produce the synthetic target video. Our extensive experiments on two public talking-head video datasets demonstrate that Head3D outperforms both 2D and 3D prior arts in the practical cross-identity setting, with evidence showing it can be readily adapted to the pose-controllable novel view synthesis task.")Here's the translation:现在的 talking-head 视频动作传输技术是生成一个新的视频，其视觉特征与源视频一致，而动作特征则与驱动视频一致。现有方法主要基于有限数量的源图像和2D表示，因此忽略了源视频中的多视图外观特征。在这篇论文中，我们提出了一种新的3D意识的 talking-head 视频动作传输网络，即 Head3D。我们的方法可以充分利用源视频中的外观信息，通过生成一个可见的3D抽象头来捕捉源视频中的头部pose和深度信息。我们还使用了一种注意力基于的融合网络，将背景和其他细节从源帧与3D主体头进行结合，以生成合成目标视频。我们的实验表明，Head3D在实际的交叉标识设定下，比2D和3D先前艺术高效，并且可以适应pose控制的新视图合成任务。

VR-NeRF: High-Fidelity Virtualized Walkable Spaces

paper_url: http://arxiv.org/abs/2311.02542
repo_url: https://github.com/facebookresearch/EyefulTower
paper_authors: Linning Xu, Vasu Agrawal, William Laney, Tony Garcia, Aayush Bansal, Changil Kim, Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder, Aljaž Božič, Dahua Lin, Michael Zollhöfer, Christian Richardt
for: 这篇论文的目的是为了建立一个高精度捕捉、模型重建和实时渲染的虚拟现实系统，用于游走空间的高精度捕捉和模型化。
methods: 这篇论文使用了一个自定义多摄像头架设，以高精度捕捉游走空间，并使用了一种新的感知颜色空间来学习准确的高dynamic range外观，以及一种高效的mipmapping机制来实现级别of detail渲染。
results: 这篇论文的结果表明，使用这种方法可以在 dual 2K$\times$2K 的全息VR分辨率上实现高精度渲染，并且可以在36Hz的刷新率下保持高品质。此外，论文还提供了一个高精度测试集，并与现有的基准相比较。

Abstract
We present an end-to-end system for the high-fidelity capture, model reconstruction, and real-time rendering of walkable spaces in virtual reality using neural radiance fields. To this end, we designed and built a custom multi-camera rig to densely capture walkable spaces in high fidelity and with multi-view high dynamic range images in unprecedented quality and density. We extend instant neural graphics primitives with a novel perceptual color space for learning accurate HDR appearance, and an efficient mip-mapping mechanism for level-of-detail rendering with anti-aliasing, while carefully optimizing the trade-off between quality and speed. Our multi-GPU renderer enables high-fidelity volume rendering of our neural radiance field model at the full VR resolution of dual 2K$\times$2K at 36 Hz on our custom demo machine. We demonstrate the quality of our results on our challenging high-fidelity datasets, and compare our method and datasets to existing baselines. We release our dataset on our project website.

摘要
我们提出了一个终端系统，用于在虚拟现实中实时渲染可行空间，使用神经辐射场。为此，我们设计了一个专门的多摄像头笼体，用于高精度捕捉可行空间，并生成多视图高动态范围图像。我们在神经图形元素上添加了一个新的感知色彩空间，用于学习准确的高动态范围外观，并使用高效的压缩缩放机制，以实现级别化渲染。我们使用多卡GPU渲染器，实现高精度体积渲染我们的神经辐射场模型，并在双2K×2K分辨率和36Hz的自定义demo机器上实现。我们在我们的高精度数据集上证明了我们的结果质量，并与现有基准进行比较。我们将数据集上载到我们的项目网站。

Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models

paper_url: http://arxiv.org/abs/2311.02536
repo_url: https://github.com/amzn/augment-the-pairs-wacv2024
paper_authors: Jingru Yi, Burak Uzkent, Oana Ignat, Zili Li, Amanmeet Garg, Xiang Yu, Linda Liu
for: 提高视觉语言模型的表现，具体来说是在准确地定位文本描述中提到的物体。
methods: 使用文本决定和无文本决定的数据增强策略，包括文本背景颜色噪声和水平旋转，以保持图像和文本之间的Semantic consistency。另外，我们还提出了基于受限的信号重建的新的数据增强策略，即像素级别的遮盲。
results: 通过对Flickr30k、referring expressions和GQA三个常用的数据集进行广泛的实验，我们的方法表现出了与现有状态艺术的高水平的表现，并且与CLIP大规模图像和语言数据集预训练的图像Encoder结合使用可以进一步提高表现。

Abstract
Grounding-based vision and language models have been successfully applied to low-level vision tasks, aiming to precisely locate objects referred in captions. The effectiveness of grounding representation learning heavily relies on the scale of the training dataset. Despite being a useful data enrichment strategy, data augmentation has received minimal attention in existing vision and language tasks as augmentation for image-caption pairs is non-trivial. In this study, we propose a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations. Specifically, we apply text-conditioned color jittering and horizontal flipping to ensure semantic consistency between images and captions. To guarantee image-caption correspondence in the training samples, we modify the captions according to pre-defined keywords when applying horizontal flipping. Additionally, inspired by recent masked signal reconstruction, we propose to use pixel-level masking as a novel form of data augmentation. While we demonstrate our data augmentation method with MDETR framework, the proposed approach is applicable to common grounding-based vision and language tasks with other frameworks. Finally, we show that image encoder pretrained on large-scale image and language datasets (such as CLIP) can further improve the results. Through extensive experiments on three commonly applied datasets: Flickr30k, referring expressions and GQA, our method demonstrates advanced performance over the state-of-the-arts with various metrics. Code can be found in https://github.com/amzn/augment-the-pairs-wacv2024.

摘要
围绕基于grounding的视觉语言模型的研究，我们提出了一种可靠的图像描述对应模型，使用文本条件和无条件数据增强来学习表示学习。具体来说，我们使用文本条件的颜色扰动和水平翻转来保证图像和描述的semantic consistency。为保证训练样本中的图像描述对应，我们在应用水平翻转时对描述进行修改。此外，我们受到最近的masked signal reconstruction的启发，提出了一种新的数据增强方法：像素级别的遮盲。我们通过对MDETR框架进行修改来示出我们的数据增强方法的可应用性。最后，我们表明通过使用大规模的图像和语言数据集（如CLIP）进行预训练，可以进一步提高结果。通过对Flickr30k、referring expressions和GQA等三个常用的数据集进行广泛的实验，我们的方法达到了与先前最佳的多种纪录。代码可以在https://github.com/amzn/augment-the-pairs-wacv2024中找到。

TokenMotion: Motion-Guided Vision Transformer for Video Camouflaged Object Detection Via Learnable Token Selection

paper_url: http://arxiv.org/abs/2311.02535
repo_url: None
paper_authors: Zifan Yu, Erfan Bank Tavakoli, Meida Chen, Suya You, Raghuveer Rao, Sanjeev Agarwal, Fengbo Ren
for: 提高视频掩体物体检测（VCOD）的性能，解决Texture相似性和Camera运动引起的难题。
methods: 使用 transformer 模型提取运动指导特征，通过学习 tokens 选择来提高 VCOD 性能。
results: 在 MoCA-Mask 数据集上评估，TMNet 实现了 VCOD 领域的状态可比性，相比 existed 状态可比性方法，提高了12.8%的Weighted F-度、8.4%的S-度和10.7%的Mean IoU。

Abstract
The area of Video Camouflaged Object Detection (VCOD) presents unique challenges in the field of computer vision due to texture similarities between target objects and their surroundings, as well as irregular motion patterns caused by both objects and camera movement. In this paper, we introduce TokenMotion (TMNet), which employs a transformer-based model to enhance VCOD by extracting motion-guided features using a learnable token selection. Evaluated on the challenging MoCA-Mask dataset, TMNet achieves state-of-the-art performance in VCOD. It outperforms the existing state-of-the-art method by a 12.8% improvement in weighted F-measure, an 8.4% enhancement in S-measure, and a 10.7% boost in mean IoU. The results demonstrate the benefits of utilizing motion-guided features via learnable token selection within a transformer-based framework to tackle the intricate task of VCOD.

摘要
“视频掩体物体检测（VCOD）领域存在特殊挑战，主要是因为目标对象和周围环境的文本相似性，以及对象和摄像头运动导致的不规则运动模式。本文提出了TokenMotion（TMNet），利用 transformer 模型提取运动导向特征，通过学习式Token选择进行增强。在复杂的 MoCA-Mask 数据集上测试，TMNet achieve 状态机器人-measure 的最佳性能，比既有状态机器人-measure 方法提高 12.8%，S-measure 提高 8.4%， mean IoU 提高 10.7%。结果表明，通过在 transformer 框架中使用学习式Token选择来捕捉运动导向特征，可以有效地解决 VCOD 领域中的复杂问题。”