2023-12-04

cs.CV

cs.CV - 2023-12-04

Unsupervised Change Detection for Space Habitats Using 3D Point Clouds

paper_url: http://arxiv.org/abs/2312.02396
repo_url: https://github.com/nasa/isaac
paper_authors: Jamie Santos, Holly Dinkel, Julia Di, Paulo V. K. Borges, Marina Moreira, Oleg Alexandrov, Brian Coltin, Trey Smith
for: 本研究开发了一个用于Scene Change Detection的算法，以实现未来的太空站维护 robotic 照料。
methods: 本研究使用raw, 未经 labels的Point Clouds作为输入，首先使用Modified Expectation-Maximization Gaussian Mixture Model（GMM）散列，然后使用地球运输距离进行变化检测。
results: 本研究使用NASA Ames Granite Lab的试用数据进行量值和质量验证，结果显示本方法可以快速和精确地检测Scene Change。且分析了方法的耗时情况。源代码公开供进一步发展。

Abstract
This work presents an algorithm for scene change detection from point clouds to enable autonomous robotic caretaking in future space habitats. Autonomous robotic systems will help maintain future deep-space habitats, such as the Gateway space station, which will be uncrewed for extended periods. Existing scene analysis software used on the International Space Station (ISS) relies on manually-labeled images for detecting changes. In contrast, the algorithm presented in this work uses raw, unlabeled point clouds as inputs. The algorithm first applies modified Expectation-Maximization Gaussian Mixture Model (GMM) clustering to two input point clouds. It then performs change detection by comparing the GMMs using the Earth Mover's Distance. The algorithm is validated quantitatively and qualitatively using a test dataset collected by an Astrobee robot in the NASA Ames Granite Lab comprising single frame depth images taken directly by Astrobee and full-scene reconstructed maps built with RGB-D and pose data from Astrobee. The runtimes of the approach are also analyzed in depth. The source code is publicly released to promote further development.

摘要
The algorithm first applies modified Expectation-Maximization Gaussian Mixture Model (GMM) clustering to two input point clouds, and then performs change detection by comparing the GMMs using the Earth Mover's Distance. The algorithm is validated using a test dataset collected by an Astrobee robot in the NASA Ames Granite Lab, which includes single frame depth images and full-scene reconstructed maps. The runtimes of the approach are also analyzed in depth. The source code is publicly released to promote further development.Translated into Simplified Chinese:这个工作提出了一种场景变化检测算法，用于支持未来宇宙驻留站的自动化机器人照顾。未来的深空驻留站，如宇航员站点 gateway，将会是无人驾驶的，需要自动化系统来维护。现有的场景分析软件在国际空站上使用 manually labeled 图像进行变化检测，而这个工作使用 raw, 未标注的点云作为输入。算法首先应用了修改后的期望最大化 Gaussian Mixture Model (GMM) 分群，然后通过比较 GMM 使用地球运动员的距离进行变化检测。算法被证明使用 NASA Ames 大理石实验室收集的测试数据进行量化和质量化验证，包括单框深度图像和全景重构的地图。还进行了 runtime 的深入分析。源代码公开发布，以便进一步的开发。

MEDPSeg: End-to-end segmentation of pulmonary structures and lesions in computed tomography

paper_url: http://arxiv.org/abs/2312.02365
repo_url: https://github.com/miclab-unicamp/medpseg
paper_authors: Diedre S. Carmo, Jean Ribeiro, Alejandro P. Comellas, Joseph M. Reinhardt, Sarah E. Gerard, Letícia Rittner, Roberto A. Lotufo
for: 这种研究旨在提高Computed Tomography（CT）图像分割的自动化精度，以便更好地诊断和评估肺病。methods: 这种方法使用多尺度学习和多任务学习，并使用可变输出通道来优化多个层次结构。results: 研究达到了多个目标中的状态 arts performance，特别是在分割玻璃层和填充的问题上，这是有限的手动标注可用性的一个挑战。

Abstract
The COVID-19 pandemic response highlighted the potential of deep learning methods in facilitating the diagnosis and prognosis of lung diseases through automated segmentation of normal and abnormal tissue in computed tomography (CT). Such methods not only have the potential to aid in clinical decision-making but also contribute to the comprehension of novel diseases. In light of the labor-intensive nature of manual segmentation for large chest CT cohorts, there is a pressing need for reliable automated approaches that enable efficient analysis of chest CT anatomy in vast research databases, especially in more scarcely annotated targets such as pneumonia consolidations. A limiting factor for the development of such methods is that most current models optimize a fixed annotation format per network output. To tackle this problem, polymorphic training is used to optimize a network with a fixed number of output channels to represent multiple hierarchical anatomic structures, indirectly optimizing more complex labels with simpler annotations. We combined over 6000 volumetric CT scans containing varying formats of manual and automated labels from different sources, and used polymorphic training along with multitask learning to develop MEDPSeg, an end-to-end method for the segmentation of lungs, airways, pulmonary artery, and lung lesions with separation of ground glass opacities, and parenchymal consolidations, all in a single forward prediction. We achieve state-of-the-art performance in multiple targets, particularly in the segmentation of ground glass opacities and consolidations, a challenging problem with limited manual annotation availability. In addition, we provide an open-source implementation with a graphical user interface at https://github.com/MICLab-Unicamp/medpseg.

摘要
COVID-19 疫情响应中，深度学习方法在诊断和诊断急性肺病方面具有潜在的潜力。通过自动分割正常和异常组织，这些方法不仅可以帮助临床决策，还可以帮助我们更好地理解新型疾病。由于手动分割大量胸部 computed tomography（CT）套件的劳动密集程度，现有的可靠自动方法的需求尤为急迫。然而，现有的大多数模型仅仅优化固定的注释格式每个网络输出。为解决这个问题，我们使用多态训练来优化一个网络，使其能够表示多个层次生物结构，并间接优化更复杂的标签使用更简单的注释。我们将多达6000个胸部 CT 扫描图像，含有不同来源的手动和自动注释，并使用多态训练和多任务学习来开发 MEDPSeg，一种结束到终点的方法，用于肺、血液肺、肺动脉和肺损伤的分割，包括分割玻璃涂抹和肺组织损伤，alles在单个前向预测中。我们实现了多个目标中的状态革命性表现，特别是在分割玻璃涂抹和肺组织损伤方面，这是有限的手动注释可用性的一个挑战。此外，我们还提供了一个开源实现和图形用户界面，可以在中找到。

PointNeRF++: A multi-scale, point-based Neural Radiance Field

paper_url: http://arxiv.org/abs/2312.02362
repo_url: None
paper_authors: Weiwei Sun, Eduard Trulls, Yang-Che Tseng, Sneha Sambandam, Gopal Sharma, Andrea Tagliasacchi, Kwang Moo Yi
for: 这篇论文的目的是提高基于点云的神经渲染方法的性能，特别是当点云质量低下时（如稀疏或缺失点）。
methods: 该方法使用多个缩放级别的点云聚合，并使用稀疏的三维格子处理多个缩放级别。具体来说，该方法在有效缩放级别上进行平均计算，并且只在有 suficient neighboring points 的情况下进行计算。此外，该方法还添加了全球缩放级别的全体矢量，以囊括“传统”和点云基本的 NeRF 表示方法。
results: 该方法在 NeRF Synthetic、ScanNet 和 KITTI-360 数据集上进行验证，与当前最佳方法相比，显著超越了其性能。

Abstract
Point clouds offer an attractive source of information to complement images in neural scene representations, especially when few images are available. Neural rendering methods based on point clouds do exist, but they do not perform well when the point cloud quality is low -- e.g., sparse or incomplete, which is often the case with real-world data. We overcome these problems with a simple representation that aggregates point clouds at multiple scale levels with sparse voxel grids at different resolutions. To deal with point cloud sparsity, we average across multiple scale levels -- but only among those that are valid, i.e., that have enough neighboring points in proximity to the ray of a pixel. To help model areas without points, we add a global voxel at the coarsest scale, thus unifying "classical" and point-based NeRF formulations. We validate our method on the NeRF Synthetic, ScanNet, and KITTI-360 datasets, outperforming the state of the art by a significant margin.

摘要
点云提供了一个有把握的信息来补充图像，特别是当有限的图像 disponible 时。基于点云的神经渲染方法已经存在，但它们在点云质量低下不表现好 -- 例如，稀疏或不完整，这 часто发生在实际数据中。我们解决这些问题的方式是使用多尺度级别的点云汇集，并使用稀疏 voxel 网格在不同的分辨率上。为了处理点云稀疏性，我们在有效的多尺度级别之间进行平均 -- 但只有在相邻的像素射线附近有足够的邻居点。为了处理没有点的区域，我们添加了最粗级别的全球 voxel，从而统一 "传统" 和基于点的 NeRF 表示方法。我们在 NeRF Synthetic、ScanNet 和 KITTI-360 数据集上验证了我们的方法，并在比较领域中超越了现状。

Calibrated Uncertainties for Neural Radiance Fields

paper_url: http://arxiv.org/abs/2312.02350
repo_url: None
paper_authors: Niki Amini-Naieni, Tomas Jakab, Andrea Vedaldi, Ronald Clark
for: 这篇论文的目的是提出一种能够在缺乏数据的情况下，对NeRF模型的预测结果进行准确的量化和评估的方法。
methods: 这篇论文使用了两种方法来实现量化和评估NeRF模型的预测结果：一种是基于patch sampling的方法，另一种是一种基于单个NeRF模型的meta-calibrator方法。
results: 根据论文的结果，这两种方法都能够在缺乏数据的情况下，实现对NeRF模型的预测结果的准确量化和评估，并且可以保持图像质量。此外，这些方法还可以应用于视图增强和下一个最佳视图选择等应用场景。

Abstract
Neural Radiance Fields have achieved remarkable results for novel view synthesis but still lack a crucial component: precise measurement of uncertainty in their predictions. Probabilistic NeRF methods have tried to address this, but their output probabilities are not typically accurately calibrated, and therefore do not capture the true confidence levels of the model. Calibration is a particularly challenging problem in the sparse-view setting, where additional held-out data is unavailable for fitting a calibrator that generalizes to the test distribution. In this paper, we introduce the first method for obtaining calibrated uncertainties from NeRF models. Our method is based on a robust and efficient metric to calculate per-pixel uncertainties from the predictive posterior distribution. We propose two techniques that eliminate the need for held-out data. The first, based on patch sampling, involves training two NeRF models for each scene. The second is a novel meta-calibrator that only requires the training of one NeRF model. Our proposed approach for obtaining calibrated uncertainties achieves state-of-the-art uncertainty in the sparse-view setting while maintaining image quality. We further demonstrate our method's effectiveness in applications such as view enhancement and next-best view selection.

摘要
神经辐射场（NeRF）技术已经实现了新视角合成的很好的结果，但仍缺失一个关键组件：准确测量模型预测结果的不确定性。 probabilistic NeRF 方法已经尝试解决这个问题，但它们的输出概率通常不准确，因此不能捕捉模型的真正信任水平。在罕见视图设定中，加额数据缺失导致了calibration的特别困难。在这篇论文中，我们介绍了首个从 NeRF 模型中获取准确不确定性的方法。我们的方法基于一种强健可靠的度量计算每个像素不确定性的预测 posterior 分布中。我们提出了两种技术来消除了需要了 held-out 数据的需求。首先，基于 patch 采样，我们训练了每个场景的两个 NeRF 模型。第二，我们提出了一种新的 meta-calibrator，只需训练一个 NeRF 模型。我们的提议的方法在罕见视图设定中实现了状态的较好的不确定性，同时保持图像质量。我们进一步示出了我们的方法在视图增强和下一个视图选择等应用中的效iveness。

CLIPDrawX: Primitive-based Explanations for Text Guided Sketch Synthesis

paper_url: http://arxiv.org/abs/2312.02345
repo_url: None
paper_authors: Nityanand Mathur, Shyam Marjit, Abhra Chaudhuri, Anjan Dutta
for: 理解CLIP中文提示的视觉概念
methods: 使用简单的几何基本形如直线和圆形来可视化CLIP的latent空间
results: 提出了CLIPDrawX算法，可以更好地Visualize CLIP文本嵌入，并且可以跟踪Synthesis过程，每个视觉概念都可以由基本形来解释

Abstract
With the goal of understanding the visual concepts that CLIP associates with text prompts, we show that the latent space of CLIP can be visualized solely in terms of linear transformations on simple geometric primitives like circles and straight lines. Although existing approaches achieve this by sketch-synthesis-through-optimization, they do so on the space of B\'ezier curves, which exhibit a wastefully large set of structures that they can evolve into, as most of them are non-essential for generating meaningful sketches. We present CLIPDrawX, an algorithm that provides significantly better visualizations for CLIP text embeddings, using only simple primitive shapes like straight lines and circles. This constrains the set of possible outputs to linear transformations on these primitives, thereby exhibiting an inherently simpler mathematical form. The synthesis process of CLIPDrawX can be tracked end-to-end, with each visual concept being explained exclusively in terms of primitives. Implementation will be released upon acceptance. Project Page: $\href{https://clipdrawx.github.io/}{\text{https://clipdrawx.github.io/}$.

摘要
我们希望通过理解CLIP的文本提示下的视觉概念，我们显示CLIP的秘密空间可以 solely通过线性变换来visual化简单的几何基本形如圆和直线。现有的方法通过优化的绘制 synthesis来实现这一点，但是它们在 Bézier 曲线空间中进行，这个空间包含大量不必要的结构，这些结构不必要 для生成有意义的绘制。我们提出了CLIPDrawX算法，它可以为CLIP文本嵌入提供更好的视觉化，只使用简单的基本形如直线和圆。这将限制可能的输出到线性变换这些基本形上，从而表现出更简单的数学形式。CLIPDrawX的合成过程可以跟踪到终端，每个视觉概念都可以solely在基本形上进行解释。我们将在接受后发布实现。项目页面： $\href{https://clipdrawx.github.io/}{\text{https://clipdrawx.github.io/}$

STEREOFOG – Computational DeFogging via Image-to-Image Translation on a real-world Dataset

paper_url: http://arxiv.org/abs/2312.02344
repo_url: https://github.com/apoll2000/stereofog
paper_authors: Anton Pollak, Rajesh Menon
For: The paper is written for exploring the potential of Image-to-Image translation (I2I) in the domain of fog removal, specifically using a real-world dataset called STEREOFOG.* Methods: The paper uses the pix2pix I2I Machine Learning (ML) framework and optimizes it for the STEREOFOG dataset.* Results: The final model achieves an average Complex Wavelet-Structural Similarity (CW-SSIM) score of $0.76$, demonstrating the suitability of the technique for fog removal.Here’s the information in Simplified Chinese text:
for: 本文是用探索图像转换（I2I）在雾除领域中的潜力，具体来说是使用真实世界数据集STEREOFOG。
methods: 本文使用pix2pix I2I机器学习（ML）框架，并对其进行优化。
results: 最终模型在Complex Wavelet-Structural Similarity（CW-SSIM）测试中取得了0.76的平均分数，证明了该技术的适用性。

Abstract
Image-to-Image translation (I2I) is a subtype of Machine Learning (ML) that has tremendous potential in applications where two domains of images and the need for translation between the two exist, such as the removal of fog. For example, this could be useful for autonomous vehicles, which currently struggle with adverse weather conditions like fog. However, datasets for I2I tasks are not abundant and typically hard to acquire. Here, we introduce STEREOFOG, a dataset comprised of $10,067$ paired fogged and clear images, captured using a custom-built device, with the purpose of exploring I2I's potential in this domain. It is the only real-world dataset of this kind to the best of our knowledge. Furthermore, we apply and optimize the pix2pix I2I ML framework to this dataset. With the final model achieving an average Complex Wavelet-Structural Similarity (CW-SSIM) score of $0.76$, we prove the technique's suitability for the problem.

摘要
Image-to-Image翻译（I2I）是机器学习（ML）的一种 subclass，具有很大的应用前景，因为存在两个领域的图像和图像之间的翻译需求，如阴天气的去fog。例如，这可能对自动驾驶汽车有很大的帮助，因为目前它们在阴天气条件下困难 navigate。然而，I2I任务的数据不够多，通常具有难以获得的特点。我们在这里引入 STEREOFOG 数据集，包含 $10,067$ 对 fogged 和clear 图像，通过自己设计的设备捕捉，用于探索 I2I 在这个领域的潜力。这是现实世界中唯一一个这样的数据集，我们知道。此外，我们应用并优化 pix2pix I2I ML 框架，并使用最终模型在 CW-SSIM 指标上取得了 $0.76$ 的平均分数，证明了该技术在这个问题上的适用性。

Cable Slack Detection for Arresting Gear Application using Machine Vision

paper_url: http://arxiv.org/abs/2312.02320
repo_url: None
paper_authors: Ari Goodman, Glenn Shevach, Sean Zabriskie, Dr. Chris Thajudeen
for: 该论文主要是研究一种基于机器视觉的锚缚系统缺陷检测方法，用于检测飞机降落时锚缚系统中的缺陷，以提高飞机的安全性和可靠性。
methods: 该方法使用了多种机器视觉算法，包括二值图像滤波、最小二乘拟合、加权最小二乘拟合、凹凸Edge检测、K-Means聚类、 Gaussian Mixture-based Background/Foreground Segmentation、柯南极值搜索、柯南直线搜索等，以减少干扰、去除背景噪声、关注关键区域、检测图像中的缺陷。
results: 该方法在实际舰船视频数据上进行验证，能够准确地检测锚缚系统中的缺陷，并具有较少的假阳性。用户界面也是设计得便捷，可以让操作员根据实际需求重新定义检测区域和调整方法。

Abstract
The cable-based arrestment systems are integral to the launch and recovery of aircraft onboard carriers and on expeditionary land-based installations. These modern arrestment systems rely on various mechanisms to absorb energy from an aircraft during an arrestment cycle to bring the aircraft to a full stop. One of the primary components of this system is the cable interface to the engine. The formation of slack in the cable at this interface can result in reduced efficiency and drives maintenance efforts to remove the slack prior to continued operations. In this paper, a machine vision based slack detection system is presented. A situational awareness camera is utilized to collect video data of the cable interface region, machine vision algorithms are applied to reduce noise, remove background clutter, focus on regions of interest, and detect changes in the image representative of slack formations. Some algorithms employed in this system include bilateral image filters, least squares polynomial fit, Canny Edge Detection, K-Means clustering, Gaussian Mixture-based Background/Foreground Segmentation for background subtraction, Hough Circle Transforms, and Hough line Transforms. The resulting detections are filtered and highlighted to create an indication to the shipboard operator of the presence of slack and a need for a maintenance action. A user interface was designed to provide operators with an easy method to redefine regions of interest and adjust the methods to specific locations. The algorithms were validated on shipboard footage and were able to accurately identify slack with minimal false positives.

摘要
飞机起落时的缠绕系统是舰载机和远程陆基设施的重要 Component。这些现代缠绕系统利用不同的机制来吸收飞机在降落周期中的能量，以将飞机停在完全停止的状态。缠绕系统中一个主要的元件是缠绕终端与引擎之间的缆线接口。当缆线中形成弹性时，可能会导致效率下降，并促使维护人员在继续运作之前 removeslack。在这篇论文中，一个基于机器见识的缺陷检测系统被提出。使用的是一个 situational awareness camera 收集缠绕区域的视频数据，并将机器见识算法应用到降低噪音、除去背景噪音、专注于区域点，并检测缠绕区域中的变化。这些算法包括 bilateral image filters、最小二乘扩展、Canny Edge Detection、K-Means clustering、Gaussian Mixture-based Background/Foreground Segmentation for background subtraction、Hough Circle Transforms、Hough line Transforms。所得的检测结果被筛选和显示，以创建一个警示操作员的缺陷存在和需要维护行动。用户界面也被设计来提供操作员一个简单的方法来重新定义区域点和调整方法到特定的位置。这些算法在舰载视频上验证，能够具体地识别缺陷，并仅具有 minimal false positives。

MoE-AMC: Enhancing Automatic Modulation Classification Performance Using Mixture-of-Experts

paper_url: http://arxiv.org/abs/2312.02298
repo_url: None
paper_authors: Jiaxin Gao, Qinglong Cao, Yuntian Chen
For: This paper proposes a novel Mixture-of-Experts (MoE) based model called MoE-AMC to address Automatic Modulation Classification (AMC) in a well-balanced manner across varying Signal-to-Noise Ratio (SNR) conditions.* Methods: The proposed MoE-AMC model combines the strengths of two existing models, LSRM (a Transformer-based model) and HSRM (a ResNet-based model), using the MoE framework. This integration enables MoE-AMC to capture distinctive signal features under diverse SNR scenarios and achieve leading performance in modulation classification.* Results: The proposed MoE-AMC model achieved an average classification accuracy of 71.76% across different SNR levels in experiments using the RML2018.01a dataset, surpassing the performance of previous State-of-the-Art (SOTA) models by nearly 10%.

Abstract
Automatic Modulation Classification (AMC) plays a vital role in time series analysis, such as signal classification and identification within wireless communications. Deep learning-based AMC models have demonstrated significant potential in this domain. However, current AMC models inadequately consider the disparities in handling signals under conditions of low and high Signal-to-Noise Ratio (SNR), resulting in an unevenness in their performance. In this study, we propose MoE-AMC, a novel Mixture-of-Experts (MoE) based model specifically crafted to address AMC in a well-balanced manner across varying SNR conditions. Utilizing the MoE framework, MoE-AMC seamlessly combines the strengths of LSRM (a Transformer-based model) for handling low SNR signals and HSRM (a ResNet-based model) for high SNR signals. This integration empowers MoE-AMC to achieve leading performance in modulation classification, showcasing its efficacy in capturing distinctive signal features under diverse SNR scenarios. We conducted experiments using the RML2018.01a dataset, where MoE-AMC achieved an average classification accuracy of 71.76% across different SNR levels, surpassing the performance of previous SOTA models by nearly 10%. This study represents a pioneering application of MoE techniques in the realm of AMC, offering a promising avenue for elevating signal classification accuracy within wireless communication systems.

摘要
《自动调谐分类（AMC）在时间序列分析中发挥了重要作用，如信号分类和识别在无线通信中。深度学习基于的AMC模型在这个领域表现出了显著的潜力。然而，现有的AMC模型不足reichly考虑了处理信号的不同SNR（噪声比）情况下的不均衡性，从而导致其性能不匀。本研究提出了MoE-AMC模型，一种基于权值的mixture-of-experts（MoE）模型，用于解决AMC问题。MoE-AMC通过将LSRM（一种基于Transformer的模型）和HSRM（一种基于ResNet的模型）融合在一起，以实现在不同SNR水平下的均衡性。这种整合使得MoE-AMC能够在模ulation分类中表现出最佳的性能，并且在不同SNR情况下能够捕捉到独特的信号特征。我们在RML2018.01a数据集上进行了实验，MoE-AMC在不同SNR水平上平均分类精度达71.76%，超过了过去最佳模型的性能 by nearly 10%。这项研究表明了MoE技术在AMC领域的应用，并提供了提高无线通信系统中信号分类精度的可能性。》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need the translation in Traditional Chinese, please let me know.

You Can Run but not Hide: Improving Gait Recognition with Intrinsic Occlusion Type Awareness

paper_url: http://arxiv.org/abs/2312.02290
repo_url: None
paper_authors: Ayush Gupta, Rama Chellappa
for: 本研究旨在解决人体步态识别方面的 occlusion 问题，尤其是在无法控制的户外视频序列中进行识别。
methods: 本研究使用了 learned occlusion 类型来提取人体步态特征，从而提高了识别精度。
results: 对 GREW 和 BRIAR dataset 进行了实验，结果表明在同等 occlusion 情况下，通过加入 occlusion 意识来改进了识别精度。

Abstract
While gait recognition has seen many advances in recent years, the occlusion problem has largely been ignored. This problem is especially important for gait recognition from uncontrolled outdoor sequences at range - since any small obstruction can affect the recognition system. Most current methods assume the availability of complete body information while extracting the gait features. When parts of the body are occluded, these methods may hallucinate and output a corrupted gait signature as they try to look for body parts which are not present in the input at all. To address this, we exploit the learned occlusion type while extracting identity features from videos. Thus, in this work, we propose an occlusion aware gait recognition method which can be used to model intrinsic occlusion awareness into potentially any state-of-the-art gait recognition method. Our experiments on the challenging GREW and BRIAR datasets show that networks enhanced with this occlusion awareness perform better at recognition tasks than their counterparts trained on similar occlusions.

摘要
尚未得到充分关注的问题是 occlusion 问题，特别是在无法控制的户外序列中进行识别。这个问题对于跑步识别在远程进行识别来说非常重要，因为任何小的障碍物都可能影响识别系统。现有的方法假设有完整的身体信息时提取跑步特征。当身体部分被遮盖时，这些方法可能会幻想并输出损坏的跑步签名，因为它们尝试找不存在的身体部分。为解决这个问题，我们利用学习到的遮盖类型而提取身份特征。因此，在这项工作中，我们提出了一种遮盖意识的跑步识别方法，可以将这种遮盖意识模糊到任何现状最佳的跑步识别方法中。我们在 GREW 和 BRIAR 数据集上进行了实验，并证明了在相似的遮盖下进行训练的网络在识别任务中表现更好。

PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation

paper_url: http://arxiv.org/abs/2312.02284
repo_url: https://github.com/zhyever/PatchFusion
paper_authors: Zhenyu Li, Shariq Farooq Bhat, Peter Wonka
For: High-resolution single image depth estimation* Methods: Patch-wise fusion network, Global-to-Local module, Consistency-Aware Training and Inference* Results: Generates high-resolution depth maps with intricate details, improves RMSE by 17.3% and 29.4% on UnrealStereo4K and MVS-Synth, respectively.Here’s the Chinese translation of the three points:* For: 高分辨率单张图深度估计* Methods: patch-wise协调网络、全局至本地模块、一致性感知训练和推理* Results: 生成高分辨率深度图像，包含细节信息，提高RMSE指标17.3%和29.4%在UnrealStereo4K和MVS-Synth上。

Abstract
Single image depth estimation is a foundational task in computer vision and generative modeling. However, prevailing depth estimation models grapple with accommodating the increasing resolutions commonplace in today's consumer cameras and devices. Existing high-resolution strategies show promise, but they often face limitations, ranging from error propagation to the loss of high-frequency details. We present PatchFusion, a novel tile-based framework with three key components to improve the current state of the art: (1) A patch-wise fusion network that fuses a globally-consistent coarse prediction with finer, inconsistent tiled predictions via high-level feature guidance, (2) A Global-to-Local (G2L) module that adds vital context to the fusion network, discarding the need for patch selection heuristics, and (3) A Consistency-Aware Training (CAT) and Inference (CAI) approach, emphasizing patch overlap consistency and thereby eradicating the necessity for post-processing. Experiments on UnrealStereo4K, MVS-Synth, and Middleburry 2014 demonstrate that our framework can generate high-resolution depth maps with intricate details. PatchFusion is independent of the base model for depth estimation. Notably, our framework built on top of SOTA ZoeDepth brings improvements for a total of 17.3% and 29.4% in terms of the root mean squared error (RMSE) on UnrealStereo4K and MVS-Synth, respectively.

摘要
单图深度估计是计算机视觉领域的基础任务之一，但现有的深度估计模型在面对今天常见的高分辨率相机和设备时受到限制。现有的高分辨率策略显示了承诺，但它们经常面临限制，从错误卷积到高频率细节的丢失。我们提出了PatchFusion，一种新的瓷片基于框架，它包括三个关键组件来改进当前状态艺：1. 一个覆盖率匹配网络，将全局一致的粗略预测与更细的、不一致的瓷片预测 fusion via 高级特征指导。2. 一个全球至本地（G2L）模块，通过提供瓷片级别的上下文，消除了选择瓷片规则的需要。3. 一种具有适度均衡和细节细节的训练和推理方法，强调瓷片重叠一致性，从而消除了后期处理的需要。实验表明，我们的框架可以在UnrealStereo4K、MVS-Synth和Middleburry 2014上生成高分辨率的深度图像，并且保留了细节。PatchFusion 独立于基础模型 для深度估计。值得注意的是，我们在ZoeDepth 基础模型上建立的框架实现了17.3%和29.4%的改进，分别在UnrealStereo4K 和 MVS-Synth 上。

Mesh-Guided Neural Implicit Field Editing

paper_url: http://arxiv.org/abs/2312.02157
repo_url: https://github.com/cassiePython/MNeuEdit
paper_authors: Can Wang, Mingming He, Menglei Chai, Dongdong Chen, Jing Liao
for: 用于提高神经隐函数场景的编辑性和可见性，并允许用户通过简单的操作进行高级编辑。
methods: 使用 marching tetrahedra 方法抽取神经隐函数场景中的 polygon 网格，并设计了一种可导的颜色抽取器来将神经隐函数场景中的颜色分配给抽取到的 polygon 网格。
results: 通过将神经隐函数场景中的颜色分配给 polygon 网格，使得用户可以通过 gradient 反向传播来进行高级编辑，包括对象添加、组件除除、特定部分塑形、颜色调整等。

Abstract
Neural implicit fields have emerged as a powerful 3D representation for reconstructing and rendering photo-realistic views, yet they possess limited editability. Conversely, explicit 3D representations, such as polygonal meshes, offer ease of editing but may not be as suitable for rendering high-quality novel views. To harness the strengths of both representations, we propose a new approach that employs a mesh as a guiding mechanism in editing the neural radiance field. We first introduce a differentiable method using marching tetrahedra for polygonal mesh extraction from the neural implicit field and then design a differentiable color extractor to assign colors obtained from the volume renderings to this extracted mesh. This differentiable colored mesh allows gradient back-propagation from the explicit mesh to the implicit fields, empowering users to easily manipulate the geometry and color of neural implicit fields. To enhance user control from coarse-grained to fine-grained levels, we introduce an octree-based structure into its optimization. This structure prioritizes the edited regions and the surface part, making our method achieve fine-grained edits to the neural implicit field and accommodate various user modifications, including object additions, component removals, specific area deformations, and adjustments to local and global colors. Through extensive experiments involving diverse scenes and editing operations, we have demonstrated the capabilities and effectiveness of our method. Our project page is: \url{https://cassiepython.github.io/MNeuEdit/}

摘要
neural implicit fields 已经成为一种具有高品质渲染和重建能力的3D表示方法，但它们具有有限可编辑性。相反，显式3D表示方法，如多面体网格，可以方便地进行编辑，但可能无法渲染高品质的新视图。为了利用这两种表示方法的优势，我们提出了一种新的方法，它使用多面体网格作为导向机制来编辑神经辐射场。我们首先介绍了一种可微分的方法，使用冲击四面体来从神经隐函数中提取多面体网格，然后设计了一种可微分的颜色提取器，将从神经辐射渲染中获取的颜色分配给提取出的多面体网格。这种可微分的颜色多面体网格允许从显式网格到神经隐函数的导向幂回传播，使用户可以轻松地修改神经隐函数的几何和颜色。为了增强用户控制的精度，我们引入了一个基于八叉树的结构，该结构在优化过程中给予优先级于编辑的区域和表面部分，使我们的方法实现精度的编辑和多种用户修改，包括对象添加、组件移除、特定区域变形、以及本地和全局颜色的调整。经过了对多种场景和编辑操作的广泛实验，我们证明了我们的方法的能力和效果。关于我们的项目，请参考我们的项目页面：

GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis

paper_url: http://arxiv.org/abs/2312.02155
repo_url: https://github.com/ShunyuanZheng/GPS-Gaussian
paper_authors: Shunyuan Zheng, Boyao Zhou, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, Yebin Liu
for: 本 paper 是为了实现实时 Rendering 的 novel view synthesis 而设计的。
methods: 本 paper 使用的方法是基于 Gaussian Splatting 的，但不需要每个人进行优化。它使用 Gaussian 参数地图来描述源视图，并直接 regression Gaussian Splatting 属性来实现即时新视图synthesizing。
results: 本 paper 的实验结果表明，使用该方法可以在 sparse-view 相机设定下实现 2K 分辨率的 Rendering，并且比预期的方法更高效。

Abstract
We present a new approach, termed GPS-Gaussian, for synthesizing novel views of a character in a real-time manner. The proposed method enables 2K-resolution rendering under a sparse-view camera setting. Unlike the original Gaussian Splatting or neural implicit rendering methods that necessitate per-subject optimizations, we introduce Gaussian parameter maps defined on the source views and regress directly Gaussian Splatting properties for instant novel view synthesis without any fine-tuning or optimization. To this end, we train our Gaussian parameter regression module on a large amount of human scan data, jointly with a depth estimation module to lift 2D parameter maps to 3D space. The proposed framework is fully differentiable and experiments on several datasets demonstrate that our method outperforms state-of-the-art methods while achieving an exceeding rendering speed.

摘要
我们提出了一种新的方法，称为GPS-Gaussian，用于在实时模式下生成人物的新视图。我们的方法可以在精 sparse-view 摄像头设置下实现 2K 分辨率渲染。不同于原始 Gaussian Splatting 或神经隐式渲染方法，我们引入了定义在源视图上的 Gaussian 参数地图，并直接 regression Gaussian Splatting 属性以实现无需优化或微调的即时新视图合成。为此，我们在大量人体扫描数据上训练了我们的 Gaussian 参数回归模块，并同时训练了深度估计模块以将 2D 参数地图提升到 3D 空间。我们的框架完全是可导数学，并在多个数据集上进行了实验，证明了我们的方法可以超越现有方法，同时具有出色的渲染速度。

Aligning and Prompting Everything All at Once for Universal Visual Perception

paper_url: http://arxiv.org/abs/2312.02153
repo_url: https://github.com/shenyunhang/ape
paper_authors: Yunhang Shen, Chaoyou Fu, Peixian Chen, Mengdan Zhang, Ke Li, Xing Sun, Yunsheng Wu, Shaohui Lin, Rongrong Ji
for: 这篇论文是为了建立一种通用视觉系统，并且解决了多Modalité交互和精度差问题。
methods: 这篇论文使用了一种基于句子对象匹配的实例级别任务框架，并使用了开 vocabulary detection 来提高检测和固定的效果。
results: 实验表明，APE 可以在多个 dataset 上达到或与当前状态的论文水平，而且不需要任务特有的精度调整。

Abstract
Vision foundation models have been explored recently to build general-purpose vision systems. However, predominant paradigms, driven by casting instance-level tasks as an object-word alignment, bring heavy cross-modality interaction, which is not effective in prompting object detection and visual grounding. Another line of work that focuses on pixel-level tasks often encounters a large annotation gap of things and stuff, and suffers from mutual interference between foreground-object and background-class segmentation. In stark contrast to the prevailing methods, we present APE, a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks, i.e., detection, segmentation, and grounding, as an instance-level sentence-object matching paradigm. Specifically, APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection, which efficiently scales up model prompting to thousands of category vocabularies and region descriptions while maintaining the effectiveness of cross-modality fusion. To bridge the granularity gap of different pixel-level tasks, APE equalizes semantic and panoptic segmentation to proxy instance learning by considering any isolated regions as individual instances. APE aligns vision and language representation on broad data with natural and challenging characteristics all at once without task-specific fine-tuning. The extensive experiments on over 160 datasets demonstrate that, with only one-suit of weights, APE outperforms (or is on par with) the state-of-the-art models, proving that an effective yet universal perception for anything aligning and prompting is indeed feasible. Codes and trained models are released at https://github.com/shenyunhang/APE.

摘要
现代视觉基础模型在最近几年来得到了广泛的研究和应用。然而，传统的方法，即将实例级任务转化为对象-词对应，会带来巨大的交叉Modal交互，这不仅不效果地启动对象检测和视觉定位，还会增加模型的复杂性。另一些研究强调像素级任务，但是这些任务常常面临巨大的对象和背景分类批注的缺失，同时也会因为前景对象和背景分类的干扰而降低效果。与先前的方法不同，我们提出了APE，一种通用的视觉感知模型，可以同时在图像中对多个任务进行多种任务，即检测、分割和定位，并将其视为一个对象-词对应的实例级别任务。具体来说，APE通过重新定义语言引导的地面检测，使得模型可以规模化地提高模型提示，同时保持交叉模式的效果。为了填补像素级任务之间的粒度差，APE将语义和пан奥普 Segmentation相等，以代表实体学习。APE在具有自然和挑战性的数据集上进行广泛的实验，结果表明，只需一个适用于所有任务的模型 weights，APE可以与先前的状态艺模型相当或超越，证明了一种有效的通用视觉感知是可能的。代码和训练模型可以在https://github.com/shenyunhang/APE 获取。

Steerers: A framework for rotation equivariant keypoint descriptors

paper_url: http://arxiv.org/abs/2312.02152
repo_url: https://github.com/georg-bn/rotation-steerers
paper_authors: Georg Bökman, Johan Edstedt, Michael Felsberg, Fredrik Kahl
for: 图像关键点描述，能够抵抗大的视角变化，是3D重建中的关键。
methods: 学习描述符的方法，可以使描述符更加鲁棒，但会导致数据增强。 alternatively, 可以在测试时进行增强，但会增加运行时间。
results: 我们通过学习线性变换，使描述符能够抵抗旋转。我们称之为”导向”。我们在所有的可能的导向上进行优化，并在AIMS和Roto-360中获得了状态机器人的结果。我们在github上发布了代码和模型参数。

Abstract
Image keypoint descriptions that are discriminative and matchable over large changes in viewpoint are vital for 3D reconstruction. However, descriptions output by learned descriptors are typically not robust to camera rotation. While they can be made more robust by, e.g., data augmentation, this degrades performance on upright images. Another approach is test-time augmentation, which incurs a significant increase in runtime. We instead learn a linear transform in description space that encodes rotations of the input image. We call this linear transform a steerer since it allows us to transform the descriptions as if the image was rotated. From representation theory we know all possible steerers for the rotation group. Steerers can be optimized (A) given a fixed descriptor, (B) jointly with a descriptor or (C) we can optimize a descriptor given a fixed steerer. We perform experiments in all of these three settings and obtain state-of-the-art results on the rotation invariant image matching benchmarks AIMS and Roto-360. We publish code and model weights at github.com/georg-bn/rotation-steerers.

摘要
From representation theory, we know all possible steerers for the rotation group. Steerers can be optimized in three settings: (A) given a fixed descriptor, (B) jointly with a descriptor, or (C) we can optimize a descriptor given a fixed steerer. We conduct experiments in all three settings and achieve state-of-the-art results on the rotation-invariant image matching benchmarks AIMS and Roto-360. Our code and model weights are available at github.com/georg-bn/rotation-steerers.

Readout Guidance: Learning Control from Diffusion Features

paper_url: http://arxiv.org/abs/2312.02150
repo_url: None
paper_authors: Grace Luo, Trevor Darrell, Oliver Wang, Dan B Goldman, Aleksander Holynski
for: 这个论文旨在控制文本到图像扩散模型中使用学习的信号。
methods: 这个方法使用读取头，一种轻量级的网络，在预训练、冻结的扩散模型每个时间步中提取特征。这些读取可以编码单个图像的属性，如pose、深度和边缘；或者高级属性，如相关性和外观相似性。此外，通过比较读取估计与用户定义的目标进行比较，并将梯度反传回读取头，这些估计可以用来导引抽取过程。
results: 相比于先前的条件生成方法，Readout Guidance需要更少的额外参数和训练样本，并提供了一个便捷简单的方法来重produce不同的条件控制，只需要一个架构和抽取过程。我们在拖动基本操作、个体相同生成和空间对齐控制等应用中展示了这些优势。项目页面：https://readout-guidance.github.io。

Abstract
We present Readout Guidance, a method for controlling text-to-image diffusion models with learned signals. Readout Guidance uses readout heads, lightweight networks trained to extract signals from the features of a pre-trained, frozen diffusion model at every timestep. These readouts can encode single-image properties, such as pose, depth, and edges; or higher-order properties that relate multiple images, such as correspondence and appearance similarity. Furthermore, by comparing the readout estimates to a user-defined target, and back-propagating the gradient through the readout head, these estimates can be used to guide the sampling process. Compared to prior methods for conditional generation, Readout Guidance requires significantly fewer added parameters and training samples, and offers a convenient and simple recipe for reproducing different forms of conditional control under a single framework, with a single architecture and sampling procedure. We showcase these benefits in the applications of drag-based manipulation, identity-consistent generation, and spatially aligned control. Project page: https://readout-guidance.github.io.

摘要
我们介绍Readout Guidance，一种控制文本至图散diffusion模型的方法，使用受训头部网络，将受训模型的特征中的信号提取出来。这些受训头部网络可以将单一图像的特征，如姿势、深度和边缘提取出来，或者高阶特征，如相对性和外观相似度。此外，通过比较受训头部网络的估计和用户定义的目标进行比较，并将梯度传递回受训头部网络，这些估计可以用来导引抽样过程。与过去的条件生成方法相比，Readout Guidance需要更少的额外参数和训练数据，并且提供了一个便利的和简单的方法来在单一框架和抽样过程中实现不同的条件控制。我们在拖曳式操作、身份相符的生成和空间相对的控制中展示了这些优点。相关页面：https://readout-guidance.github.io。

Rejuvenating image-GPT as Strong Visual Representation Learners

paper_url: http://arxiv.org/abs/2312.02147
repo_url: https://github.com/oliverrensu/d-igpt
paper_authors: Sucheng Ren, Zeyu Wang, Hongru Zhu, Junfei Xiao, Alan Yuille, Cihang Xie
for: 该论文旨在提高image-GPT（iGPT）模型，使其更好地学习视觉内容。
methods: 该论文对iGPT模型进行了两个简单 yet essential的改进：首先，将预测目标从原始像素shift到semantic token，使模型能够更好地理解视觉内容。其次，通过训练模型预测不仅下一个token，还有可见的token。这种管道特别有效，当 semantic token 被CLIP等批量学习模型编码时。
results: 该论文的实验结果表明，D-iGPT模型在ImageNet-1K数据集上达到了89.5%的顶峰1准确率，并且在下游任务和外部样本上具有强大的泛化和稳定性。

Abstract
This paper enhances image-GPT (iGPT), one of the pioneering works that introduce autoregressive pretraining to predict next pixels for visual representation learning. Two simple yet essential changes are made. First, we shift the prediction target from raw pixels to semantic tokens, enabling a higher-level understanding of visual content. Second, we supplement the autoregressive modeling by instructing the model to predict not only the next tokens but also the visible tokens. This pipeline is particularly effective when semantic tokens are encoded by discriminatively trained models, such as CLIP. We introduce this novel approach as D-iGPT. Extensive experiments showcase that D-iGPT excels as a strong learner of visual representations: A notable achievement of D-iGPT is its compelling performance on the ImageNet-1K dataset -- by training on publicly available datasets, D-iGPT achieves 89.5\% top-1 accuracy with a vanilla ViT-Large model. This model also shows strong generalization on the downstream task and robustness on out-of-distribution samples. Code is avaiable at \href{https://github.com/OliverRensu/D-iGPT}{https://github.com/OliverRensu/D-iGPT}.

摘要
这个论文提出了一种基于autoregressive预训练的新方法，即将像素预测目标从原始像素转换为semantic标签，以便更高层次地理解视觉内容。此外，我们还增加了一种在预测模型中指定模型预测不仅下一个标签，还要预测可见的标签的方法。这种管道被称为D-iGPT。我们在EXTensive的实验中发现，D-iGPT在视觉表示学习中表现出了极佳的表现，特别是在ImageNet-1K dataset上。通过公共可用的数据集进行训练，D-iGPT可以达到89.5%的顶峰1准确率，使用vanilla ViT-Large模型。此外，D-iGPT还在下游任务和外部样本上表现出了强大的普适性和稳定性。代码可以在\href{https://github.com/OliverRensu/D-iGPT}{https://github.com/OliverRensu/D-iGPT}上找到。

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

paper_url: http://arxiv.org/abs/2312.02145
repo_url: https://github.com/prs-eth/marigold
paper_authors: Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, Konrad Schindler
for: 这篇论文的目的是提出一种基于生成扩散模型的幂等变换稳定深度估计方法，以提高单图深度估计的精度和泛化能力。
methods: 该方法基于Stable Diffusion模型， retained its rich prior knowledge，并可以在几天内在单个GPU上练习Synthetic数据上进行精度调整。
results: 该方法在多个数据集上达到了state-of-the-art表现，包括特定情况下超过20%的性能提升。Is there anything else you’d like to know?

Abstract
Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth from a single image is geometrically ill-posed and requires scene understanding, so it is not surprising that the rise of deep learning has led to a breakthrough. The impressive progress of monocular depth estimators has mirrored the growth in model capacity, from relatively modest CNNs to large Transformer architectures. Still, monocular depth estimators tend to struggle when presented with images with unfamiliar content and layout, since their knowledge of the visual world is restricted by the data seen during training, and challenged by zero-shot generalization to new domains. This motivates us to explore whether the extensive priors captured in recent generative diffusion models can enable better, more generalizable depth estimation. We introduce Marigold, a method for affine-invariant monocular depth estimation that is derived from Stable Diffusion and retains its rich prior knowledge. The estimator can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases. Project page: https://marigoldmonodepth.github.io.

摘要
《单目深度估计》是计算机视觉领域的基本任务之一。从单个图像中恢复3D深度是 geommetrically ill-posed，需要场景理解，因此不 surprisingly，深度学习的发展带来了breakthrough。单目深度估计器的卓越进步与模型容量的增长相随，从relatively modest CNNs到大型Transformer architectures。然而，单目深度估计器对于图像中的不熟悉内容和布局会做出困难，因为它们在训练中所学到的视觉世界的知识是通过数据所限制的，并且在新领域中进行零容量扩展是具有挑战性的。这种情况激励我们来探索 Whether the extensive priors captured in recent generative diffusion models can enable better, more generalizable depth estimation。我们提出了Marigold方法，这是基于Stable Diffusion的单目深度估计器，它保留了丰富的先前知识。这个估计器可以在几天内，使用单个GPU和synthetic training data进行精细调整。它在各种数据集上达到了state-of-the-art表现，包括一些特定情况下的20%以上表现提升。项目页面：https://marigoldmonodepth.github.io。

Optimizing Camera Configurations for Multi-View Pedestrian Detection

paper_url: http://arxiv.org/abs/2312.02144
repo_url: None
paper_authors: Yunzhong Hou, Xingjian Leng, Tom Gedeon, Liang Zheng
for: 本研究旨在开发一种基于变换器的多视图人体检测配置生成器，以提高人体检测精度。
methods: 该模型使用了人工智能技术，通过强化学习来自动探索不同的配置空间，并寻找提高检测精度的最佳配置。
results: 对多个 simulations enario，配置生成器 consistently 表现出色，超过了随机搜索、经验法则和人工设计的配置。

Abstract
Jointly considering multiple camera views (multi-view) is very effective for pedestrian detection under occlusion. For such multi-view systems, it is critical to have well-designed camera configurations, including camera locations, directions, and fields-of-view (FoVs). Usually, these configurations are crafted based on human experience or heuristics. In this work, we present a novel solution that features a transformer-based camera configuration generator. Using reinforcement learning, this generator autonomously explores vast combinations within the action space and searches for configurations that give the highest detection accuracy according to the training dataset. The generator learns advanced techniques like maximizing coverage, minimizing occlusion, and promoting collaboration. Across multiple simulation scenarios, the configurations generated by our transformer-based model consistently outperform random search, heuristic-based methods, and configurations designed by human experts, shedding light on future camera layout optimization.

摘要
笔者们证明，通过多视图（multi-view）的共同考虑，可以大幅提高人员检测下 occlusion 的准确率。为实现这一目标，需要设计合适的相机配置，包括相机位置、方向和视场（FoV）。通常，这些配置是基于人类经验或规则所设计的。在这项工作中，我们提出了一种基于 transformer 的相机配置生成器。通过 reinforcement learning，这个生成器可以自主探索巨量的动作空间，并寻找以培训数据集为标准的最高检测精度的配置。生成器学习了高级技巧，如涵盖率最大化、 occlusion 最小化和协作促进。在多个 simulation 场景中，由我们的 transformer-based 模型生成的配置一直高于随机搜索、基于经验或规则的方法，以及由人类专家设计的配置，这有助于未来相机布局优化。

Object Recognition as Next Token Prediction

paper_url: http://arxiv.org/abs/2312.02142
repo_url: https://github.com/kaiyuyue/nxtp
paper_authors: Kaiyu Yue, Bor-Chun Chen, Jonas Geiping, Hengduo Li, Tom Goldstein, Ser-Nam Lim
for: 本文提出了一种基于语言解码器的对象识别方法，即通过自动往后预测图像嵌入的文本标签来进行对象识别。
methods: 本文使用了一种自适应干扰mask，其中包括两个关键特征：即图像嵌入的元素为独立的标签，以及在推理过程中同时采样多个标签的元素。此外， authors还提出了一种简单的策略，即将预训练的语言模型的中间块抛弃，以构建更加紧凑的解码器。
results: 该方法在实验中得到了与全模型相同的性能，而且更加高效。

Abstract
We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels. To ground this prediction process in auto-regression, we customize a non-causal attention mask for the decoder, incorporating two key features: modeling tokens from different labels to be independent, and treating image tokens as a prefix. This masking mechanism inspires an efficient method - one-shot sampling - to simultaneously sample tokens of multiple labels in parallel and rank generated labels by their probabilities during inference. To further enhance the efficiency, we propose a simple strategy to construct a compact decoder by simply discarding the intermediate blocks of a pretrained language model. This approach yields a decoder that matches the full model's performance while being notably more efficient. The code is available at https://github.com/kaiyuyue/nxtp

摘要
我们提出了一种将对象识别视为下一个字符预测的方法。我们的想法是使用语言解码器，通过自动循环预测图像嵌入获得的文本字符，以生成标签。为了将这个预测过程与自动预测联系起来，我们定制了一个非 causal注意力 маска，包括两个关键特征：将不同的标签中的字符视为独立，并将图像字符视为前缀。这种掩码机制 inspirits 一种高效的方法——一次采样——同时在探索阶段，并将生成的标签按照其概率排序。为了进一步提高效率，我们提议一种简单的策略——构建一个紧凑的解码器， simply discarding the intermediate blocks of a pretrained language model。这种方法可以在一个更高效的方式下达到同样的性能。代码可以在 GitHub 上找到：https://github.com/kaiyuyue/nxtp。

iMatching: Imperative Correspondence Learning

paper_url: http://arxiv.org/abs/2312.02141
repo_url: None
paper_authors: Zitong Zhan, Dasong Gao, Yun-Jou Lin, Youjie Xia, Chen Wang
for: 本研究旨在解决计算机视觉中的特征匹配问题，以便进一步应用视差和3D重建等下游应用。
methods: 我们提出了一种新的自我超vised学习方法，称为核心学习（IL），以实现特征匹配的自动化学习。IL可以在无需摄像头位置或深度标签的情况下，通过自适应视频来进行特征匹配学习。
results: 我们通过了广泛的实验，并证明了IL可以在特征匹配和 pose 估计等任务上达到更高的性能，比如取得了state-of-the-art的匹配模型的平均提高率为30%。

Abstract
Learning feature correspondence is a foundational task in computer vision, holding immense importance for downstream applications such as visual odometry and 3D reconstruction. Despite recent progress in data-driven models, feature correspondence learning is still limited by the lack of accurate per-pixel correspondence labels. To overcome this difficulty, we introduce a new self-supervised scheme, imperative learning (IL), for training feature correspondence. It enables correspondence learning on arbitrary uninterrupted videos without any camera pose or depth labels, heralding a new era for self-supervised correspondence learning. Specifically, we formulated the problem of correspondence learning as a bilevel optimization, which takes the reprojection error from bundle adjustment as a supervisory signal for the model. To avoid large memory and computation overhead, we leverage the stationary point to effectively back-propagate the implicit gradients through bundle adjustment. Through extensive experiments, we demonstrate superior performance on tasks including feature matching and pose estimation, in which we obtained an average of 30% accuracy gain over the state-of-the-art matching models.

摘要
学习特征对应是计算机视觉中的基本任务，对下游应用如视觉运动和3D重建具有重要意义。尽管最近的数据驱动模型在此方面做出了重要进步，但特征对应学习仍受到准确每个像素对应标签的缺乏所限。为了解决这Difficulty, we introduce a new self-supervised scheme, imperative learning (IL), for training feature correspondence. It enables correspondence learning on arbitrary uninterrupted videos without any camera pose or depth labels, marking a new era for self-supervised correspondence learning. Specifically, we formulated the problem of correspondence learning as a bilevel optimization, which takes the reprojection error from bundle adjustment as a supervisory signal for the model. To avoid large memory and computation overhead, we leverage the stationary point to effectively back-propagate the implicit gradients through bundle adjustment. Through extensive experiments, we demonstrate superior performance on tasks including feature matching and pose estimation, obtaining an average of 30% accuracy gain over the state-of-the-art matching models.

MANUS: Markerless Hand-Object Grasp Capture using Articulated 3D Gaussians

paper_url: http://arxiv.org/abs/2312.02137
repo_url: None
paper_authors: Chandradeep Pokhariya, Ishaan N Shah, Angela Xing, Zekun Li, Kefan Chen, Avinash Sharma, Srinath Sridhar
for: 这个论文的目的是为了模型手部和物体之间的接触，以便在机器人和混合现实等领域进行更加准确的模拟。
methods: 这篇论文使用了Markerless Hand-Object Grasp Capture using Articulated 3D Gaussians方法，这种方法使用了三维 Gaussian splatting来模型手部的动作和拟合。
results: 论文的结果表明，使用这种方法可以高精度地估算手部和物体之间的接触，并且比其他方法更加准确。

Abstract
Understanding how we grasp objects with our hands has important applications in areas like robotics and mixed reality. However, this challenging problem requires accurate modeling of the contact between hands and objects. To capture grasps, existing methods use skeletons, meshes, or parametric models that can cause misalignments resulting in inaccurate contacts. We present MANUS, a method for Markerless Hand-Object Grasp Capture using Articulated 3D Gaussians. We build a novel articulated 3D Gaussians representation that extends 3D Gaussian splatting for high-fidelity representation of articulating hands. Since our representation uses Gaussian primitives, it enables us to efficiently and accurately estimate contacts between the hand and the object. For the most accurate results, our method requires tens of camera views that current datasets do not provide. We therefore build MANUS-Grasps, a new dataset that contains hand-object grasps viewed from 53 cameras across 30+ scenes, 3 subjects, and comprising over 7M frames. In addition to extensive qualitative results, we also show that our method outperforms others on a quantitative contact evaluation method that uses paint transfer from the object to the hand.

摘要
理解我们如何用手握住物品有着重要的应用在机器人和混合现实领域。然而，这是一个具有困难的问题，需要准确地模型手部和物品之间的接触。现有的方法使用骨架、面或参数模型，可能会导致偏移，从而导致不准确的接触。我们提出了MANUS方法，即无标记手部物品抓取capture方法，使用三维 Gaussian 表示。我们构建了一种新的三维 Gaussian 表示，该表示可以高精度地表示手部的动作。由于我们的表示使用 Gaussian 基本体，因此可以高效地和准确地估计手部和物品之间的接触。为了获得最高的准确性，我们的方法需要数十个摄像头视图，当前的数据集并不提供这些视图。因此，我们构建了MANUS-Grasps数据集，该数据集包含了手部物品抓取的30+场景、3名subject、700万帧以上的视图。此外，我们还提供了详细的qualitative结果，以及与其他方法进行比较的量化接触评估方法。

BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation

paper_url: http://arxiv.org/abs/2312.02136
repo_url: https://github.com/zqh0253/BerfScene
paper_authors: Qihang Zhang, Yinghao Xu, Yujun Shen, Bo Dai, Bolei Zhou, Ceyuan Yang
for: 提供一种实用和高效的3D场景生成方法，可以生成具有复杂空间配置和多种物体的大规模3D场景。
methods: 使用equivariant radiance field和鸟瞰视图（BEV）图进行3D场景生成，通过控制相应的BEV图来简单地 manipulateObjects。
results: 通过包括 pozitional encoding和低通滤波器在内的生成器，实现了equivariant性，可以生成大规模、甚至无限规模的3D场景。

Abstract
Generating large-scale 3D scenes cannot simply apply existing 3D object synthesis technique since 3D scenes usually hold complex spatial configurations and consist of a number of objects at varying scales. We thus propose a practical and efficient 3D representation that incorporates an equivariant radiance field with the guidance of a bird's-eye view (BEV) map. Concretely, objects of synthesized 3D scenes could be easily manipulated through steering the corresponding BEV maps. Moreover, by adequately incorporating positional encoding and low-pass filters into the generator, the representation becomes equivariant to the given BEV map. Such equivariance allows us to produce large-scale, even infinite-scale, 3D scenes via synthesizing local scenes and then stitching them with smooth consistency. Extensive experiments on 3D scene datasets demonstrate the effectiveness of our approach. Our project website is at https://zqh0253.github.io/BerfScene/.

摘要
生成大规模3D场景不能简单地采用现有的3D物体生成技术，因为3D场景通常具有复杂的空间配置和许多物体在不同的缩放级别上出现。我们因此提议一种实用和高效的3D表示方法，即通过沿用鸟瞰视图（BEV）地图的指导，将3D场景中的物体 sintesized 到相应的BEV地图上。这样，可以通过控制相应的BEV地图来轻松地 manipulate 生成的3D场景。此外，通过适当地包含位坐编码和低通滤波器到生成器中，使得表示变得对给定BEV地图的位坐编码和缩放级别具有对称性。这种对称性使得我们可以通过synthesize 局部场景并将其拼接在一起，以实现大规模、甚至无限规模的3D场景生成。我们的项目网站是https://zqh0253.github.io/BerfScene/.

Re-Nerfing: Enforcing Geometric Constraints on Neural Radiance Fields through Novel Views Synthesis

paper_url: http://arxiv.org/abs/2312.02255
repo_url: None
paper_authors: Felix Tristram, Stefano Gasperini, Federico Tombari, Nassir Navab, Benjamin Busam
for: 提高大规模、无限大场景中NeRF的视图合成能力，并且可以使用少量视图进行优化。
methods: 提出了一种简单而通用的多阶段方法，利用NeRF自己的视图合成来解决Shape-radiance歧义问题。
results: 在mip-NeRF 360数据集上进行了广泛的实验，并达到了对Zip-NeRF的提高，即使只使用少量视图进行训练。

Abstract
Neural Radiance Fields (NeRFs) have shown remarkable novel view synthesis capabilities even in large-scale, unbounded scenes, albeit requiring hundreds of views or introducing artifacts in sparser settings. Their optimization suffers from shape-radiance ambiguities wherever only a small visual overlap is available. This leads to erroneous scene geometry and artifacts. In this paper, we propose Re-Nerfing, a simple and general multi-stage approach that leverages NeRF's own view synthesis to address these limitations. With Re-Nerfing, we increase the scene's coverage and enhance the geometric consistency of novel views as follows: First, we train a NeRF with the available views. Then, we use the optimized NeRF to synthesize pseudo-views next to the original ones to simulate a stereo or trifocal setup. Finally, we train a second NeRF with both original and pseudo views while enforcing structural, epipolar constraints via the newly synthesized images. Extensive experiments on the mip-NeRF 360 dataset show the effectiveness of Re-Nerfing across denser and sparser input scenarios, bringing improvements to the state-of-the-art Zip-NeRF, even when trained with all views.

摘要
Re-Nerfing 的步骤如下：首先，我们将可以提供的见解训练一个 NeRF。然后，我们使用已经优化的 NeRF 来生成 Pseudo-views ，模拟一个双眼或三眼设置。最后，我们将第二个 NeRF 训练，使其满足体积、视线的约束，这些约束是通过新生成的图像来提供。我们在 mip-NeRF 360 dataset 进行了广泛的实验，展示了 Re-Nerfing 在不同的输入情况下的有效性，包括更密集和更疏的输入情况。此外，我们发现 Re-Nerfing 可以超越 zip-NeRF，甚至在所有见解训练时仍然保持顶尖性。

Fast View Synthesis of Casual Videos

paper_url: http://arxiv.org/abs/2312.02135
repo_url: None
paper_authors: Yao-Chih Lee, Zhoutong Zhang, Kevin Blackburn-Matzen, Simon Niklaus, Jianming Zhang, Jia-Bin Huang, Feng Liu
for: 这篇论文是为了实现高质量的新视图synthesis from an in-the-wild video，即使场景动态和缺乏parallax。
methods: 该论文使用explicit video representations，将静止和动态视频内容分别处理，使用扩展的平面基本场景表示法和球面幂函数和偏移图来捕捉视点依赖的效果和非平面复杂表面几何。
results: 该方法可以快速地估算hybrid video表示，并在实时渲染新视图。实验显示，与状态对照方法相当，该方法可以从实际场景中的各种视频中生成高质量的新视图，而且100倍 faster于现有方法在训练和渲染。

Abstract
Novel view synthesis from an in-the-wild video is difficult due to challenges like scene dynamics and lack of parallax. While existing methods have shown promising results with implicit neural radiance fields, they are slow to train and render. This paper revisits explicit video representations to synthesize high-quality novel views from a monocular video efficiently. We treat static and dynamic video content separately. Specifically, we build a global static scene model using an extended plane-based scene representation to synthesize temporally coherent novel video. Our plane-based scene representation is augmented with spherical harmonics and displacement maps to capture view-dependent effects and model non-planar complex surface geometry. We opt to represent the dynamic content as per-frame point clouds for efficiency. While such representations are inconsistency-prone, minor temporal inconsistencies are perceptually masked due to motion. We develop a method to quickly estimate such a hybrid video representation and render novel views in real time. Our experiments show that our method can render high-quality novel views from an in-the-wild video with comparable quality to state-of-the-art methods while being 100x faster in training and enabling real-time rendering.

摘要
对于宽泛场景中的视频 synthesis，由于场景动态和缺乏丰杂度的问题，traditional methods Difficult to achieve high-quality novel views. While existing methods have shown promising results with implicit neural radiance fields, they are slow to train and render. This paper revisits explicit video representations to synthesize high-quality novel views from a monocular video efficiently. We treat static and dynamic video content separately. Specifically, we build a global static scene model using an extended plane-based scene representation to synthesize temporally coherent novel video. Our plane-based scene representation is augmented with spherical harmonics and displacement maps to capture view-dependent effects and model non-planar complex surface geometry. We opt to represent the dynamic content as per-frame point clouds for efficiency. While such representations are inconsistency-prone, minor temporal inconsistencies are perceptually masked due to motion. We develop a method to quickly estimate such a hybrid video representation and render novel views in real time. Our experiments show that our method can render high-quality novel views from an in-the-wild video with comparable quality to state-of-the-art methods while being 100 times faster in training and enabling real-time rendering.Here's the translation in Traditional Chinese:对于宽泛场景中的影像synthesis，由于场景动态和缺乏丰杂度的问题，传统方法Difficult to achieve high-quality novel views. While existing methods have shown promising results with implicit neural radiance fields, they are slow to train and render. This paper revisits explicit video representations to synthesize high-quality novel views from a monocular video efficiently. We treat static and dynamic video content separately. Specifically, we build a global static scene model using an extended plane-based scene representation to synthesize temporally coherent novel video. Our plane-based scene representation is augmented with spherical harmonics and displacement maps to capture view-dependent effects and model non-planar complex surface geometry. We opt to represent the dynamic content as per-frame point clouds for efficiency. While such representations are inconsistency-prone, minor temporal inconsistencies are perceptually masked due to motion. We develop a method to quickly estimate such a hybrid video representation and render novel views in real time. Our experiments show that our method can render high-quality novel views from an in-the-wild video with comparable quality to state-of-the-art methods while being 100 times faster in training and enabling real-time rendering.

GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians

paper_url: http://arxiv.org/abs/2312.02134
repo_url: None
paper_authors: Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang, Boyao Zhou, Boning Liu, Shengping Zhang, Liqiang Nie
for: 创建真实的人类模拟器，从单个视频中获取3D形态和动态外观。
methods: 使用可动的3D高斯函数表示人体在不同姿势和服装风格下的变化，并通过动态属性来支持姿势dependent的外观模型。
results: 在公共数据集和自己收集的数据集上 validate GaussianAvatar 的精度和渲染效率，并证明其在 appearances 质量和渲染效率方面具有优于其他方法。

Abstract
We present GaussianAvatar, an efficient approach to creating realistic human avatars with dynamic 3D appearances from a single video. We start by introducing animatable 3D Gaussians to explicitly represent humans in various poses and clothing styles. Such an explicit and animatable representation can fuse 3D appearances more efficiently and consistently from 2D observations. Our representation is further augmented with dynamic properties to support pose-dependent appearance modeling, where a dynamic appearance network along with an optimizable feature tensor is designed to learn the motion-to-appearance mapping. Moreover, by leveraging the differentiable motion condition, our method enables a joint optimization of motions and appearances during avatar modeling, which helps to tackle the long-standing issue of inaccurate motion estimation in monocular settings. The efficacy of GaussianAvatar is validated on both the public dataset and our collected dataset, demonstrating its superior performances in terms of appearance quality and rendering efficiency.

摘要
我们现在提出了一种高效的 GaussianAvatar 方法，可以从单个视频中生成真实的人物模型，并且具有动态的 3D 外表。我们首先引入了可动的 3D Gaussians，用于表示不同的姿势和服装风格的人物。这种显式和可动的表示方式可以更加高效地将 3D 外表从 2D 观察中 fusion。我们的表示方式还受到动态性的支持，以便在人物模型中支持姿势 dependent 的外表模型化。此外，我们还利用了可微动作条件，以便在人物模型中进行 JOINT 优化。这种方法可以帮助解决单目设备中的不准确的运动估计问题。我们的 GaussianAvatar 方法在公共数据集和我们收集的数据集上进行验证，并证明其在外表质量和渲染效率方面具有显著的优势。

Style Aligned Image Generation via Shared Attention

paper_url: http://arxiv.org/abs/2312.02133
repo_url: https://github.com/google/style-aligned
paper_authors: Amir Hertz, Andrey Voynov, Shlomi Fruchter, Daniel Cohen-Or
for: 该论文旨在提高文本到图像（T2I）模型中的风格控制，以便在创作领域中生成有趣的输出。
methods: 该论文提出了一种新的技术，即使用最小的注意力共享，以实现T2I模型中的风格对应。这种方法可以通过简单的反向操作来实现风格一致性。
results: 论文的实验表明，该方法可以在多种风格和文本提示下实现高质量的图像生成和准确性。

Abstract
Large-scale Text-to-Image (T2I) models have rapidly gained prominence across creative fields, generating visually compelling outputs from textual prompts. However, controlling these models to ensure consistent style remains challenging, with existing methods necessitating fine-tuning and manual intervention to disentangle content and style. In this paper, we introduce StyleAligned, a novel technique designed to establish style alignment among a series of generated images. By employing minimal `attention sharing' during the diffusion process, our method maintains style consistency across images within T2I models. This approach allows for the creation of style-consistent images using a reference style through a straightforward inversion operation. Our method's evaluation across diverse styles and text prompts demonstrates high-quality synthesis and fidelity, underscoring its efficacy in achieving consistent style across various inputs.

摘要
大规模文本到图像（T2I）模型在创作领域快速占据主流，生成从文本提示中拥有视觉吸引力的输出。然而，控制这些模型以保持风格一致性仍然具有挑战性，现有的方法需要细致的调整和手动干预，以分离内容和风格。在这篇论文中，我们介绍了 StyleAligned，一种新的技术，可以在T2I模型中Establish风格对齐。我们的方法通过最小化注意力共享的Diffusion过程来保持图像内的风格一致性。这种方法允许通过引用风格的简单反转操作来创造一致的风格图像。我们的方法在多种风格和文本提示下进行评估，显示了高质量的合成和准确性，证明了它在不同的输入上的效果。

Can we truly transfer an actor’s genuine happiness to avatars? An investigation into virtual, real, posed and spontaneous faces

paper_url: http://arxiv.org/abs/2312.02128
repo_url: None
paper_authors: Vitor Miguel Xavier Peres, Greice Pinho Dal Molin, Soraia Raupp Musse
for: 这项研究旨在评估厄曼的动作单元在真实人脸数据集和计算机图形人脸数据集中的表现，以及在不同的演员 gender 和表情方面的差异。
methods: 研究使用了真实人脸数据集和计算机图形人脸数据集，以及特定电影角色如SheHulk和Genius。
results: 结果显示，姿势化数据集中的AU强度较高，而自然表情数据集中的AU强度较低。此外，将真实人脸转化为计算机图形人脸时，AU强度会逐渐减弱，达到80% для AU6 和 45% для AU12。

Abstract
A look is worth a thousand words is a popular phrase. And why is a simple look enough to portray our feelings about something or someone? Behind this question are the theoretical foundations of the field of psychology regarding social cognition and the studies of psychologist Paul Ekman. Facial expressions, as a form of non-verbal communication, are the primary way to transmit emotions between human beings. The set of movements and expressions of facial muscles that convey some emotional state of the individual to their observers are targets of studies in many areas. Our research aims to evaluate Ekman's action units in datasets of real human faces, posed and spontaneous, and virtual human faces resulting from transferring real faces into Computer Graphics faces. In addition, we also conducted a case study with specific movie characters, such as SheHulk and Genius. We intend to find differences and similarities in facial expressions between real and CG datasets, posed and spontaneous faces, and also to consider the actors' genders in the videos. This investigation can help several areas of knowledge, whether using real or virtual human beings, in education, health, entertainment, games, security, and even legal matters. Our results indicate that AU intensities are greater for posed than spontaneous datasets, regardless of gender. Furthermore, there is a smoothing of intensity up to 80 percent for AU6 and 45 percent for AU12 when a real face is transformed into CG.

摘要
一个表情值得千言是一句流行的话语。而为什么一个简单的表情足以表达我们对某事或某人的感受呢？这与心理学领域的社交认知理论和 психолог保罗·埃克曼的研究有着深厚的关系。人脸表情作为非语言性交流的主要方式，可以将人类之间的情感传递给另一方。我们的研究旨在评估埃克曼的动作单位（AU）在真实人脸数据集和计算机图形人脸中的表现。此外，我们还进行了一个案例研究，包括SheHulk和Genius等电影角色。我们希望在不同的数据集、姿势和性别方面发现表情的差异和相似之处。这些研究结果可以帮助多个领域，无论使用真实人类或计算机生成的人脸，如教育、医疗、娱乐、游戏、安全和法律等。我们的结果表明，姿势数据集中AU的强度比自然数据集更高，无论gender。此外，将真实人脸转换成计算机图形人脸时，AU6和AU12的强度会平滑至80%和45%。

VerA: Versatile Anonymization Fit for Clinical Facial Images

paper_url: http://arxiv.org/abs/2312.02124
repo_url: None
paper_authors: Majed El Helou, Doruk Cetin, Petar Stamenkovic, Fabio Zund
for: 这篇论文的目的是为了解决医疗影像匿名化的问题，特别是医疗影像中的人脸部分需要保留某些Semantic区域以示医疗 interveniton的结果。
methods: 这篇论文提出了一个名为VerA的多用途人脸影像匿名化方法，这个方法可以在医疗影像中保留必要的Semantic区域，并且可以进行对应的双影像匿名化。
results: VerA方法在匿名化正规影像时比或者是与现有方法相当，并且在医疗影像中进行对应的双影像匿名化时也表现了出色的效果。

Abstract
The escalating legislative demand for data privacy in facial image dissemination has underscored the significance of image anonymization. Recent advancements in the field surpass traditional pixelation or blur methods, yet they predominantly address regular single images. This leaves clinical image anonymization -- a necessity for illustrating medical interventions -- largely unaddressed. We present VerA, a versatile facial image anonymization that is fit for clinical facial images where: (1) certain semantic areas must be preserved to show medical intervention results, and (2) anonymizing image pairs is crucial for showing before-and-after results. VerA outperforms or is on par with state-of-the-art methods in de-identification and photorealism for regular images. In addition, we validate our results on paired anonymization, and on the anonymization of both single and paired clinical images with extensive quantitative and qualitative evaluation.

摘要
“随着数据隐私立法的不断增加，面彩散布的图像隐私化已经强调了其重要性。现有的技术主要是对单一图像进行像素化或模糊，但这些技术几乎没有处理医疗图像。这使得医疗图像隐私化仍然未得到充分处理。我们提出了VerA，一个适用于医疗面彩图像的多功能隐私化方法，其特点是：（1）保留特定 semantic 区域，以显示医疗干预结果；（2）适用于医疗图像的双向匿名化。VerA与州先进技术相比或者在匿名化和真实性方面表现出色，并且我们验证了我们的结果在双向匿名化和单向匿名化方面的可靠性和质量。”

Mathematical Supplement for the $\texttt{gsplat}$ Library

paper_url: http://arxiv.org/abs/2312.02121
repo_url: https://github.com/nerfstudio-project/gsplat
paper_authors: Vickie Ye, Angjoo Kanazawa
for: 这篇论文主要用于介绍gsplat库，一个可模块化的Toolbox，用于高效的可导 differentiable Gaussian splatting。
methods: 这篇论文使用了forward和backward passes的计算，以便实现可导Gaussian splatting。
results: 论文提供了一个自 contenido的Python API，用于在github.com/nerfstudio-project/gsplat中进行矢量化和反向传播。

Abstract
This report provides the mathematical details of the gsplat library, a modular toolbox for efficient differentiable Gaussian splatting, as proposed by Kerbl et al. It provides a self-contained reference for the computations involved in the forward and backward passes of differentiable Gaussian splatting. To facilitate practical usage and development, we provide a user friendly Python API that exposes each component of the forward and backward passes in rasterization at github.com/nerfstudio-project/gsplat .

摘要
这份报告介绍了gsplat库，一个可重用的差分凝聚工具箱，如果额的投影方法，由Kerbl等人提出。报告提供了对于前向和反向传递的计算细节，以便实用和开发。为了便于实践和开发，我们提供了一个易于使用的Python API，该API在github.com/nerfstudio-project/gsplat上提供了每个前向和反向传递的组件。

GIVT: Generative Infinite-Vocabulary Transformers

paper_url: http://arxiv.org/abs/2312.02116
repo_url: None
paper_authors: Michael Tschannen, Cian Eastwood, Fabian Mentzer
for: 这个论文的目的是提出一种基于Transformer的生成无限词汇模型（GIVT），用于生成具有实值输入的 вектор序列。
methods: 作者提出了两种简单的修改，以使decoder-only transformer能够生成实值输入：首先，在输入端，取代finite-vocabulary lookup table，使用线性投影将输入向量映射到实值输入 ; 其次，在输出端，取代logits预测（通常Mapping到一个 categorical distribution），使用参数来模型多variate Gaussian mixture model。
results: 作者在class-conditional image generation和causal modeling中应用GIVT，并取得了竞争力的结果，而且在使用GIVT进行causal modeling时，其表现比VQ-GAN和MaskGIT更好。此外，作者还应用了GIVT在UViM框架中进行�anoptic segmentation和深度估计，并取得了竞争力的结果。

Abstract
We introduce generative infinite-vocabulary transformers (GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. To this end, we propose two surprisingly simple modifications to decoder-only transformers: 1) at the input, we replace the finite-vocabulary lookup table with a linear projection of the input vectors; and 2) at the output, we replace the logits prediction (usually mapped to a categorical distribution) with the parameters of a multivariate Gaussian mixture model. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, where transformers are used to model the discrete latent sequences of a VQ-VAE, we use GIVT to model the unquantized real-valued latent sequences of a VAE. When applying GIVT to class-conditional image generation with iterative masked modeling, we show competitive results with MaskGIT, while our approach outperforms both VQ-GAN and MaskGIT when using it for causal modeling. Finally, we obtain competitive results outside of image generation when applying our approach to panoptic segmentation and depth estimation with a VAE-based variant of the UViM framework.

摘要
我们介绍生成无限词汇 трансформа器（GIVT），它们生成vector序列中的实数值，而不是从词汇表中获取硬coded的token。为此，我们提出了两个意外的简单修改：1）在输入端，我们将词汇表 replaced with一个线性投影，将输入vector映射到一个vector space中; 2）在输出端，我们将预测值（通常将映射到一个 categorical distribution） replaced with一个多重Gaussian mixture model的参数。受VQ-GAN和MaskGIT的图像生成模式惊吓，我们使用GIVT来模型VAE中的不量化实数序列。当我们应用GIVT到阶层的masked模型中，我们获得了与MaskGIT的竞争性结果，而我们的方法则在使用VQ-GAN和MaskGIT时获得更好的结果。最后，我们在应用我们的方法到照片分类、深度估计和照片分类中获得了竞争性的结果。

ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation

paper_url: http://arxiv.org/abs/2312.02109
repo_url: None
paper_authors: Dar-Yen Chen, Hamish Tennent, Ching-Wen Hsu
for: 这篇论文旨在推出一种文本到图像（T2I）风格传输框架，以超越传统颜色、笔触和物体形状的限制，捕捉高级风格元素，如作品结构和独特艺术表达。
methods: 该框架具有多级风格编码器和我们提出的显式适应机制，使得ArtAdapter实现了无前例的风格传输精度，以确保与文本描述保持高度一致。此外，ACA也有效地分离内容和风格，解决风格参照中的内容借鉴问题。
results: 对比当前状态艺术方法，ArtAdapter的评价表明其在风格传输中具有无 precedent 的精度和灵活性，并且可以在零批训练情况下进行快速跑通。

Abstract
This work introduces ArtAdapter, a transformative text-to-image (T2I) style transfer framework that transcends traditional limitations of color, brushstrokes, and object shape, capturing high-level style elements such as composition and distinctive artistic expression. The integration of a multi-level style encoder with our proposed explicit adaptation mechanism enables ArtAdapte to achieve unprecedented fidelity in style transfer, ensuring close alignment with textual descriptions. Additionally, the incorporation of an Auxiliary Content Adapter (ACA) effectively separates content from style, alleviating the borrowing of content from style references. Moreover, our novel fast finetuning approach could further enhance zero-shot style representation while mitigating the risk of overfitting. Comprehensive evaluations confirm that ArtAdapter surpasses current state-of-the-art methods.

摘要
这个研究引入了ArtAdapter，一种转换性文本到图像（T2I）风格传输框架，超越传统颜色、笔触和物体形状的限制，捕捉高级风格元素 such as 组织和独特艺术表达。我们提出的多级风格编码器和我们的显式适应机制的结合使得ArtAdapte可以实现历史上最高的风格传输精度，确保文本描述的准确性。此外，我们的增强适应approach可以进一步提高零基础风格表示，同时降低风格参照的内容借鉴风险。此外，我们的新的快速训练方法可以进一步提高零基础风格表示，同时降低过拟合风险。广泛的评估表明，ArtAdapter超越了当前状态的方法。

Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection

paper_url: http://arxiv.org/abs/2312.02103
repo_url: None
paper_authors: Sunghun Kang, Junbum Cha, Jonghwan Mun, Byungseok Roh, Chang D. Yoo
for: This paper focuses on open-vocabulary object detection (OVOD) and aims to directly learn region-text alignment for arbitrary concepts.
methods: The proposed method, called Pseudo-Labeling for Arbitrary Concepts (PLAC), uses a simple yet effective approach to learn arbitrary image-to-text mapping for pseudo-labeling of arbitrary concepts.
results: The proposed method shows competitive performance on the standard OVOD benchmark for noun concepts and a large improvement on referring expression comprehension benchmark for arbitrary concepts.

Abstract
Open-vocabulary object detection (OVOD) has recently gained significant attention as a crucial step toward achieving human-like visual intelligence. Existing OVOD methods extend target vocabulary from pre-defined categories to open-world by transferring knowledge of arbitrary concepts from vision-language pre-training models to the detectors. While previous methods have shown remarkable successes, they suffer from indirect supervision or limited transferable concepts. In this paper, we propose a simple yet effective method to directly learn region-text alignment for arbitrary concepts. Specifically, the proposed method aims to learn arbitrary image-to-text mapping for pseudo-labeling of arbitrary concepts, named Pseudo-Labeling for Arbitrary Concepts (PLAC). The proposed method shows competitive performance on the standard OVOD benchmark for noun concepts and a large improvement on referring expression comprehension benchmark for arbitrary concepts.

摘要

Large Language Models as Consistent Story Visualizers

paper_url: http://arxiv.org/abs/2312.02252
repo_url: https://github.com/xiaoqian-shen/StoryGPT-V
paper_authors: Xiaoqian Shen, Mohamed Elhoseiny
for:* 这个论文旨在解决Story Visualization中的一个主要挑战，即处理异常引用和保持人物和背景的一致性。methods:* 这篇论文提出了一种新的StoryGPT-V模型，它利用了隐藏傅尔散（LDM）和大语言模型（LLM）的优势，以生成基于给定故事描述的图像，并保证人物和背景的一致性。results:* 论文的实验结果表明，StoryGPT-V模型可以在两个视觉故事描述 benchmark上显示出优秀的数据统计结果，同时具有低内存占用率和高质量人物生成能力。

Abstract
Recent generative models have demonstrated impressive capabilities in generating realistic and visually pleasing images grounded on textual prompts. Nevertheless, a significant challenge remains in applying these models for the more intricate task of story visualization. Since it requires resolving pronouns (he, she, they) in the frame descriptions, i.e., anaphora resolution, and ensuring consistent characters and background synthesis across frames. Yet, the emerging Large Language Model (LLM) showcases robust reasoning abilities to navigate through ambiguous references and process extensive sequences. Therefore, we introduce \textbf{StoryGPT-V}, which leverages the merits of the latent diffusion (LDM) and LLM to produce images with consistent and high-quality characters grounded on given story descriptions. First, we train a character-aware LDM, which takes character-augmented semantic embedding as input and includes the supervision of the cross-attention map using character segmentation masks, aiming to enhance character generation accuracy and faithfulness. In the second stage, we enable an alignment between the output of LLM and the character-augmented embedding residing in the input space of the first-stage model. This harnesses the reasoning ability of LLM to address ambiguous references and the comprehension capability to memorize the context. We conduct comprehensive experiments on two visual story visualization benchmarks. Our model reports superior quantitative results and consistently generates accurate characters of remarkable quality with low memory consumption. Our code will be made publicly available.

摘要
现代生成模型已经展现出很强的能力，可以根据文本提示生成真实和Visual Pleasing的图像。然而，在应用这些模型进行更复杂的故事视觉化任务时，还是面临着一个主要挑战，即处理宾语（他、她、他们）的解析，保证图像中的人物和背景的一致性。然而，emerging Large Language Model（LLM）表现出了强大的解释能力，可以在不确定参照中穿梭并处理广泛的序列。因此，我们介绍了 StoryGPT-V，它利用了隐藏扩散（LDM）和 LLM 的优点，生成基于给定故事描述的图像，具有高质量和一致的人物。我们的方法包括两个阶段：第一阶段是使用人物意识的 LDM，使用人物增强的semantic embedding作为输入，并使用人物分割mask来监督人物生成的准确性和忠诚性。在第二阶段，我们使用 LLM 的输出和人物增强的embedding来实现人物和背景的一致性。这使得 LLM 可以解决不确定参照和处理广泛的序列，同时保持 context的理解和记忆。我们进行了对两个视觉故事可视化 benchmark 的全面实验，我们的模型报告了superior的量化结果，并一致地生成了高质量的人物。我们的代码将会公开发布。

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

paper_url: http://arxiv.org/abs/2312.02087
repo_url: https://github.com/showlab/VideoSwap
paper_authors: Yuchao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, Kevin Tang
for: 本研究旨在实现视频编辑中的形态变化，通过自定义视频主题交换来替换源视频中的主题，并且使用 semantic point correspondences 来实现对主题的形态修改。
methods: 本研究提出了 VideoSwap 框架，通过启用 semantic point correspondences 来取代 dense correspondences，并且引入了用户点互动（如移除点和拖动点）来解决不同 semantic point correspondences。
results: 广泛的实验显示，VideoSwap 框架可以实现状态领先的视频主题交换效果，并且可以处理各种实际视频中的形态变化。

Abstract
Current diffusion-based video editing primarily focuses on structure-preserved editing by utilizing various dense correspondences to ensure temporal consistency and motion alignment. However, these approaches are often ineffective when the target edit involves a shape change. To embark on video editing with shape change, we explore customized video subject swapping in this work, where we aim to replace the main subject in a source video with a target subject having a distinct identity and potentially different shape. In contrast to previous methods that rely on dense correspondences, we introduce the VideoSwap framework that exploits semantic point correspondences, inspired by our observation that only a small number of semantic points are necessary to align the subject's motion trajectory and modify its shape. We also introduce various user-point interactions (\eg, removing points and dragging points) to address various semantic point correspondence. Extensive experiments demonstrate state-of-the-art video subject swapping results across a variety of real-world videos.

摘要
当前的扩散基于视频编辑主要关注结构保持编辑，通过多种紧密相关性来保证时间一致性和运动对alignment。然而，这些方法在目标编辑包含形状变化时 often 无效。为了在视频编辑中实现形状变化，我们在这里进行自定义视频主题交换，目标将源视频中的主题替换为具有不同形状和特征的目标主题。与之前的方法相比，我们介绍了 VideoSwap 框架，它利用 semantic point correspondences，从我们观察到只需一小部分 semantic point 可以对主题的运动轨迹进行对应和修改其形状。我们还引入了多种用户点交互（例如，移除点和拖动点）来解决不同 semantic point correspondence。我们的实验结果表明，VideoSwap 框架可以在多种实际视频中实现状态 искусство视频主题交换。

GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians

paper_url: http://arxiv.org/abs/2312.02069
repo_url: None
paper_authors: Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, Matthias Nießner
for: 创建高度可控的、 photorealistic 头部人物模型
methods: 使用 Gaussian 扩散模型和 parametric 形态模型进行动态 3D 表示，并通过Expression 传输和参数调整来实现精确的动画控制
results: 在多个复杂情况下，如驾驶视频reenacting，方法表现出色，胜过现有方法In English, this translates to:
for: Creating highly controllable, photorealistic head avatar models
methods: Using Gaussian splat models and parametric morphable models for dynamic 3D representation, and achieving precise animation control through expression transfer and parameter adjustment
results: Outstanding performance in multiple challenging scenarios, such as reenacting driving videos, surpassing existing methods.

Abstract
We introduce GaussianAvatars, a new method to create photorealistic head avatars that are fully controllable in terms of expression, pose, and viewpoint. The core idea is a dynamic 3D representation based on 3D Gaussian splats that are rigged to a parametric morphable face model. This combination facilitates photorealistic rendering while allowing for precise animation control via the underlying parametric model, e.g., through expression transfer from a driving sequence or by manually changing the morphable model parameters. We parameterize each splat by a local coordinate frame of a triangle and optimize for explicit displacement offset to obtain a more accurate geometric representation. During avatar reconstruction, we jointly optimize for the morphable model parameters and Gaussian splat parameters in an end-to-end fashion. We demonstrate the animation capabilities of our photorealistic avatar in several challenging scenarios. For instance, we show reenactments from a driving video, where our method outperforms existing works by a significant margin.

摘要
我们介绍 GaussianAvatars，一种新的方法创建高度可控的头镜器人，包括表达、姿势和视点。核心思想是基于3D Gaussian splat的动态表示，rigged到 Parametric Morphable Face Model。这种结合使得可以实现高品质渲染，同时允许精准的动画控制，例如通过expression transfer from driving sequence或者通过手动修改 morphable model parameters。我们对每个splat进行本地坐标系三角形的参数化，并优化显式偏移量来获得更加准确的几何表示。在镜器人重建中，我们同时优化 morphable model parameters和 Gaussian splat parameters的结果。我们在一些复杂的场景中展示了我们的高品质镜器人的动画能力，例如从驾驶视频中的reenactings，我们的方法在 existed works 上出perform了一定的较大的差异。

DUCK: Distance-based Unlearning via Centroid Kinematics

paper_url: http://arxiv.org/abs/2312.02052
repo_url: https://github.com/ocram17/duck
paper_authors: Marco Cotogni, Jacopo Bonato, Luigi Sabetta, Francesco Pelosin, Alessandro Nicolosi
for: Ensuring privacy in modern artificial intelligence models by eradicating residual influence of specific data subsets.
methods: Distance-based Unlearning via Centroid Kinematics (DUCK) algorithm using metric learning to remove samples matching the nearest incorrect centroid in the embedding space.
results: State-of-the-art performance in class removal and homogeneous sampling removal scenarios, with a novel metric (Adaptive Unlearning Score) to evaluate the unlearning process and a novel membership inference attack to assess the algorithm’s capacity to erase previously acquired knowledge.

Abstract
Machine Unlearning is rising as a new field, driven by the pressing necessity of ensuring privacy in modern artificial intelligence models. This technique primarily aims to eradicate any residual influence of a specific subset of data from the knowledge acquired by a neural model during its training. This work introduces a novel unlearning algorithm, denoted as Distance-based Unlearning via Centroid Kinematics (DUCK), which employs metric learning to guide the removal of samples matching the nearest incorrect centroid in the embedding space. Evaluation of the algorithm's performance is conducted across various benchmark datasets in two distinct scenarios, class removal, and homogeneous sampling removal, obtaining state-of-the-art performance. We introduce a novel metric, called Adaptive Unlearning Score (AUS), encompassing not only the efficacy of the unlearning process in forgetting target data but also quantifying the performance loss relative to the original model. Moreover, we propose a novel membership inference attack to assess the algorithm's capacity to erase previously acquired knowledge, designed to be adaptable to future methodologies.

摘要
Machine Unlearning 是一新的领域，受到现代人工智能模型中保护隐私的需要所驱动。这种技术主要目标是在 neural model durante 训练中取消任何特定子集数据的影响。这项工作介绍了一种新的忘记算法，称为距离基于中心动力学的忘记算法（DUCK），该算法利用度量学来导引匹配 incorrect centroid embedding 空间中的样本 removals。我们在不同的 benchmark 数据集上进行了多种情况的评估，包括类型 removal 和同一类 sampling removal，并取得了当前领域的最佳性能。我们还介绍了一个新的度量，称为适应忘记分数（AUS），该度量不仅涵盖了忘记过程中忘记目标数据的效果，而且还量化了相对于原始模型的性能损失。此外，我们还提出了一种新的会员推测攻击，用于评估算法是否能够消除先前获得的知识，这种攻击适应于未来的方法ologies。

Implicit Learning of Scene Geometry from Poses for Global Localization

paper_url: http://arxiv.org/abs/2312.02029
repo_url: None
paper_authors: Mohammad Altillawi, Shile Li, Sai Manoj Prakhya, Ziyuan Liu, Joan Serrat
for: 这个论文是为了实现单张图像中的绝对姿态估计，以便在已知的区域中进行多种 робо扮和虚拟/增强现实应用。
methods: 该论文使用最新的深度学习技术，直接从输入图像中学习和预测6DoF姿态。然而，这些方法并不充分利用图像下的场景几何结构，从而降低了估计精度。
results: 我们的方法可以在三个常见的视觉地标集上达到最新的回归方法的姿态精度，并且在扩展学习约束下进行了改进。在推理过程中，我们的模型可以在实时中估计图像中的3D场景几何，并将其与全局坐标系对齐以获取姿态。

Abstract
Global visual localization estimates the absolute pose of a camera using a single image, in a previously mapped area. Obtaining the pose from a single image enables many robotics and augmented/virtual reality applications. Inspired by latest advances in deep learning, many existing approaches directly learn and regress 6 DoF pose from an input image. However, these methods do not fully utilize the underlying scene geometry for pose regression. The challenge in monocular relocalization is the minimal availability of supervised training data, which is just the corresponding 6 DoF poses of the images. In this paper, we propose to utilize these minimal available labels (.i.e, poses) to learn the underlying 3D geometry of the scene and use the geometry to estimate the 6 DoF camera pose. We present a learning method that uses these pose labels and rigid alignment to learn two 3D geometric representations (\textit{X, Y, Z coordinates}) of the scene, one in camera coordinate frame and the other in global coordinate frame. Given a single image, it estimates these two 3D scene representations, which are then aligned to estimate a pose that matches the pose label. This formulation allows for the active inclusion of additional learning constraints to minimize 3D alignment errors between the two 3D scene representations, and 2D re-projection errors between the 3D global scene representation and 2D image pixels, resulting in improved localization accuracy. During inference, our model estimates the 3D scene geometry in camera and global frames and aligns them rigidly to obtain pose in real-time. We evaluate our work on three common visual localization datasets, conduct ablation studies, and show that our method exceeds state-of-the-art regression methods' pose accuracy on all datasets.

摘要
全球视觉地理估计照片中的相对pose，使用单张图像，在已经地图化的区域内。获取pose从单张图像，可以激发许多 робо扮和增强/虚拟现实应用。受最新的深度学习技术的激发，许多现有方法直接学习并从输入图像中RECTauss 6DoF pose。然而，这些方法并不完全利用图像下的场景几何结构来进行姿态估计。在单图像约束下，增加姿态估计的挑战是缺乏较多的监督学习数据，即图像的对应6DoF姿态。在这篇论文中，我们提议利用这些最少可用的标签（即图像的对应6DoF姿态）来学习图像下的场景几何结构，并使用这种几何结构来估计图像中的6DoF姿态。我们提出一种学习方法，使用这些姿态标签和rigid对齐来学习图像下的场景几何结构，并将其与图像进行对齐，以估计图像中的6DoF姿态。这种形式允许在执行时添加额外的学习约束，以降低3D对齐错误和2D重 проек错误，从而提高地理估计精度。在推理过程中，我们将图像中的3D场景几何结构与全球坐标系中的3D场景几何结构进行坚固对齐，以获取匹配 pose。我们对三个常见的视觉地理估计数据集进行评估，进行剖除研究，并证明我们的方法在所有数据集上超越了当前最佳回归方法的姿态精度。

A multi-channel cycleGAN for CBCT to CT synthesis

paper_url: http://arxiv.org/abs/2312.02017
repo_url: None
paper_authors: Chelsea A. H. Sargeant, Edward G. A. Henderson, Dónal M. McSweeney, Aaron G. Rankin, Denis Page
for: 提高Image质量和射线治疗规划精度
methods: 多通道输入强调特定图像特征，并 introduce an auxiliary fusion network to enhance the fidelity of generated sCT images.
results: 有效地Addresses some of the challenges inherent in CBCT imaging, whilst restoring the contrast necessary for accurate visualisation of patients’ anatomy.

Abstract
Image synthesis is used to generate synthetic CTs (sCTs) from on-treatment cone-beam CTs (CBCTs) with a view to improving image quality and enabling accurate dose computation to facilitate a CBCT-based adaptive radiotherapy workflow. As this area of research gains momentum, developments in sCT generation methods are difficult to compare due to the lack of large public datasets and sizeable variation in training procedures. To compare and assess the latest advancements in sCT generation, the SynthRAD2023 challenge provides a public dataset and evaluation framework for both MR and CBCT to sCT synthesis. Our contribution focuses on the second task, CBCT-to-sCT synthesis. By leveraging a multi-channel input to emphasize specific image features, our approach effectively addresses some of the challenges inherent in CBCT imaging, whilst restoring the contrast necessary for accurate visualisation of patients' anatomy. Additionally, we introduce an auxiliary fusion network to further enhance the fidelity of generated sCT images.

摘要
《图像合成在生成synthetic CT（sCT）方面取得了 significative进步，以提高图像质量和准确计算剂量，并促进CBCT基于适应放射治疗的工作流程。然而，由于这一领域的研究受到限制，因此发展sCT生成方法的比较困难。为了比较和评估最新的进展，SynthRAD2023挑战提供了公共数据集和评估框架，用于MR和CBCT图像到sCT合成。我们的贡献集中于第二任务，即CBCT到sCT合成。我们采用多通道输入，以强调特定图像特征，有效地解决了CBCT成像中的一些挑战，同时恢复了病人 анатомиче的细节。此外，我们还引入了辅助合并网络，以进一步提高生成的sCT图像的准确性。》

ColonNeRF: Neural Radiance Fields for High-Fidelity Long-Sequence Colonoscopy Reconstruction

paper_url: http://arxiv.org/abs/2312.02015
repo_url: None
paper_authors: Yufei Shi, Beijia Lu, Jia-Wei Liu, Ming Li, Mike Zheng Shou
for: colonoscopy reconstruction for diagnosing colorectal cancer
methods: neural radiance field (NeRF) with region division and integration, multi-level fusion, and DensiNet for dense camera pose guidance
results: outperforms existing methods on two benchmarks over four evaluation metrics, with a substantial increase of about 67%-85% on the SimCol-to-3D dataset and clearer textures and more accurate geometric details in reconstruction visualizations.Here’s the full text in Simplified Chinese:
for: 用于colonoscopy重建，以便诊断Rectal Cancer
methods: 基于NeRF的分割和集成、多级融合以及基于Semantic Consistency的DensiNet等技术
results: 与现有方法相比，在两个benchmark上得分 superior performance，特别是在SimCol-to-3D数据集上的LPIPS-ALEX分数显著提高约67%-85%，并且可以得到更清晰的文本和更准确的几何细节在重建视觉中。

Abstract
Colonoscopy reconstruction is pivotal for diagnosing colorectal cancer. However, accurate long-sequence colonoscopy reconstruction faces three major challenges: (1) dissimilarity among segments of the colon due to its meandering and convoluted shape; (2) co-existence of simple and intricately folded geometry structures; (3) sparse viewpoints due to constrained camera trajectories. To tackle these challenges, we introduce a new reconstruction framework based on neural radiance field (NeRF), named ColonNeRF, which leverages neural rendering for novel view synthesis of long-sequence colonoscopy. Specifically, to reconstruct the entire colon in a piecewise manner, our ColonNeRF introduces a region division and integration module, effectively reducing shape dissimilarity and ensuring geometric consistency in each segment. To learn both the simple and complex geometry in a unified framework, our ColonNeRF incorporates a multi-level fusion module that progressively models the colon regions from easy to hard. Additionally, to overcome the challenges from sparse views, we devise a DensiNet module for densifying camera poses under the guidance of semantic consistency. We conduct extensive experiments on both synthetic and real-world datasets to evaluate our ColonNeRF. Quantitatively, our ColonNeRF outperforms existing methods on two benchmarks over four evaluation metrics. Notably, our LPIPS-ALEX scores exhibit a substantial increase of about 67%-85% on the SimCol-to-3D dataset. Qualitatively, our reconstruction visualizations show much clearer textures and more accurate geometric details. These sufficiently demonstrate our superior performance over the state-of-the-art methods.

摘要
colonoscopy 重建是诊断大肠癌的关键。然而，准确的长序colonoscopy 重建面临三大挑战：（1）肠道段的不同性很大，卷曲的形状和弯曲的结构的共存；（2）摄像头的视点稀缺，限制了摄像头的运动范围；（3）复杂的肠道结构的学习。为了解决这些挑战，我们介绍了一种基于神经采样场（NeRF）的新重建框架，名为ColonNeRF，它利用神经渲染进行新视图合成。具体来说，我们的ColonNeRF通过分割肠道段的方式，将整个肠道重建成多个小段，从而降低形状不同性和保证每个段的准确性。此外，我们的ColonNeRF还包括多级融合模块，通过逐级模型肠道区域，从易于难度进行协同学习。为了解决稀缺的摄像头视点问题，我们还提出了一种增强摄像头位置的DensiNet模块，通过语义一致性来增强摄像头的密度。我们对synthetic和实际数据进行了广泛的实验，并证明了我们的ColonNeRF在四个评价指标上比现有方法表现出色。特别是，我们的LPIPS-ALEX分数在SimCol-to-3D数据集上增加了约67%-85%。重建视觉显示了更清晰的текстуры和更准确的几何特征。这些证明了我们的uperformance相对于状态方法。

SRTransGAN: Image Super-Resolution using Transformer based Generative Adversarial Network

paper_url: http://arxiv.org/abs/2312.01999
repo_url: None
paper_authors: Neeraj Baghel, Shiv Ram Dubey, Satish Kumar Singh
for: 提高图像的分辨率，用于图像识别、医疗图像提高等应用
methods: 使用转换器网络（Transformer）基于GAN，具有自注意力机制，能够更好地利用全局信息
results: 与现有方法相比，SRTransGAN提高了平均PSNR和SSIM分数的4.38%，并且通过分割图像分析了学习能力。

Abstract
Image super-resolution aims to synthesize high-resolution image from a low-resolution image. It is an active area to overcome the resolution limitations in several applications like low-resolution object-recognition, medical image enhancement, etc. The generative adversarial network (GAN) based methods have been the state-of-the-art for image super-resolution by utilizing the convolutional neural networks (CNNs) based generator and discriminator networks. However, the CNNs are not able to exploit the global information very effectively in contrast to the transformers, which are the recent breakthrough in deep learning by exploiting the self-attention mechanism. Motivated from the success of transformers in language and vision applications, we propose a SRTransGAN for image super-resolution using transformer based GAN. Specifically, we propose a novel transformer-based encoder-decoder network as a generator to generate 2x images and 4x images. We design the discriminator network using vision transformer which uses the image as sequence of patches and hence useful for binary classification between synthesized and real high-resolution images. The proposed SRTransGAN outperforms the existing methods by 4.38 % on an average of PSNR and SSIM scores. We also analyze the saliency map to understand the learning ability of the proposed method.

摘要
Image超解像目标是实现从低分辨率图像中生成高分辨率图像。这是一个活跃的领域，旨在超越对应应用中的分辨率限制，如低分辨率物体识别、医疗影像提高等。基于生成领域的人工智能网络（GAN）方法在image超解像中成为了现有的州际标准，但是对于全球信息的利用效率仍然有所限制。这是因为传统的卷积神经网络（CNN）不能够充分利用全球信息，而transformer则是最近的深度学习突破，可以将图像视为序列的单位，并通过自我注意机制来更好地利用全球信息。验证了对于语言和视觉应用的成功，我们提出了SRTransGAN，一个使用对称转换器为基础的GAN方法，具体是一个使用对称转换器为生成器，生成2x和4x的图像。我们设计了识别网络使用视觉对称转换器，这样可以将图像视为序列的单位，并通过自我注意机制来更好地利用全球信息。我们的SRTransGAN与现有方法相比，平均PSNR和SSIM分数提高4.38%。我们还分析了图像焦点的对应分析，以了解我们的方法学习的能力。

Language-only Efficient Training of Zero-shot Composed Image Retrieval

paper_url: http://arxiv.org/abs/2312.01998
repo_url: https://github.com/navervision/lincir
paper_authors: Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, Sangdoo Yun
for: 提高 composed image retrieval (CIR) 任务的效率和可扩展性，使用语言作为唯一的训练数据。
methods: 提出了一种新的 CIR 框架，即 LinCIR，使用自我幂 projection（SMP）自我驱动的语言训练方法。
results: LinCIR 可以在 48 分钟内训练，并在四个不同的 CIR 测试集上达到最高的零shot CIR 性能，包括 CIRCO、GeneCIS、FashionIQ 和 CIRR，甚至超过了一些监督学习方法在 FashionIQ 上的性能。

Abstract
Composed image retrieval (CIR) task takes a composed query of image and text, aiming to search relative images for both conditions. Conventional CIR approaches need a training dataset composed of triplets of query image, query text, and target image, which is very expensive to collect. Several recent works have worked on the zero-shot (ZS) CIR paradigm to tackle the issue without using pre-collected triplets. However, the existing ZS-CIR methods show limited backbone scalability and generalizability due to the lack of diversity of the input texts during training. We propose a novel CIR framework, only using language for its training. Our LinCIR (Language-only training for CIR) can be trained only with text datasets by a novel self-supervision named self-masking projection (SMP). We project the text latent embedding to the token embedding space and construct a new text by replacing the keyword tokens of the original text. Then, we let the new and original texts have the same latent embedding vector. With this simple strategy, LinCIR is surprisingly efficient and highly effective; LinCIR with CLIP ViT-G backbone is trained in 48 minutes and shows the best ZS-CIR performances on four different CIR benchmarks, CIRCO, GeneCIS, FashionIQ, and CIRR, even outperforming supervised method on FashionIQ. Code is available at https://github.com/navervision/lincir

摘要
<> translate into Simplified Chinese composed image retrieval (CIR) task takes a composed query of image and text, aiming to search relative images for both conditions. Conventional CIR approaches need a training dataset composed of triplets of query image, query text, and target image, which is very expensive to collect. Several recent works have worked on the zero-shot (ZS) CIR paradigm to tackle the issue without using pre-collected triplets. However, the existing ZS-CIR methods show limited backbone scalability and generalizability due to the lack of diversity of the input texts during training. We propose a novel CIR framework, only using language for its training. Our LinCIR (Language-only training for CIR) can be trained only with text datasets by a novel self-supervision named self-masking projection (SMP). We project the text latent embedding to the token embedding space and construct a new text by replacing the keyword tokens of the original text. Then, we let the new and original texts have the same latent embedding vector. With this simple strategy, LinCIR is surprisingly efficient and highly effective; LinCIR with CLIP ViT-G backbone is trained in 48 minutes and shows the best ZS-CIR performances on four different CIR benchmarks, CIRCO, GeneCIS, FashionIQ, and CIRR, even outperforming supervised method on FashionIQ. Code is available at https://github.com/navervision/lincirNote: The translation is in Simplified Chinese, which is one of the two standardized Chinese languages. The other is Traditional Chinese.

A Generative Self-Supervised Framework using Functional Connectivity in fMRI Data

paper_url: http://arxiv.org/abs/2312.01994
repo_url: None
paper_authors: Jungwon Choi, Seongho Keum, EungGu Yun, Byung-Hoon Kim, Juho Lee
for: 这个研究旨在开发一个基于 функциональ connectivity（FC）的自适应学习方法，以充分利用 FC 网络中的时间变化特性，提高模型预测的准确性和可读性。
methods: 本研究使用了Graph Neural Network（GNN）来学习 FC 网络，并提出了一个生成式的自适应学习方法，以充分利用 FC 网络中的时间变化特性和空间相关性。
results: 实验结果显示，这个方法可以从大规模（>50,000）的 fMRI 数据中学习有价值的表现，并可以建立高准确性和可靠性的模型，当作下游任务的 fine-tuning。

Abstract
Deep neural networks trained on Functional Connectivity (FC) networks extracted from functional Magnetic Resonance Imaging (fMRI) data have gained popularity due to the increasing availability of data and advances in model architectures, including Graph Neural Network (GNN). Recent research on the application of GNN to FC suggests that exploiting the time-varying properties of the FC could significantly improve the accuracy and interpretability of the model prediction. However, the high cost of acquiring high-quality fMRI data and corresponding phenotypic labels poses a hurdle to their application in real-world settings, such that a model na\"ively trained in a supervised fashion can suffer from insufficient performance or a lack of generalization on a small number of data. In addition, most Self-Supervised Learning (SSL) approaches for GNNs to date adopt a contrastive strategy, which tends to lose appropriate semantic information when the graph structure is perturbed or does not leverage both spatial and temporal information simultaneously. In light of these challenges, we propose a generative SSL approach that is tailored to effectively harness spatio-temporal information within dynamic FC. Our empirical results, experimented with large-scale (>50,000) fMRI datasets, demonstrate that our approach learns valuable representations and enables the construction of accurate and robust models when fine-tuned for downstream tasks.

摘要
深度神经网络在功能连接（FC）网络Extracted from functional Magnetic Resonance Imaging（fMRI）数据上训练得到了流行，主要是因为数据的可用性的提高和模型架构的进步，包括图神经网络（GNN）。 current research on the application of GNN to FC suggests that exploiting the time-varying properties of the FC could significantly improve the accuracy and interpretability of the model prediction. However, the high cost of acquiring high-quality fMRI data and corresponding phenotypic labels poses a hurdle to their application in real-world settings, such that a model naively trained in a supervised fashion can suffer from insufficient performance or a lack of generalization on a small number of data. In addition, most Self-Supervised Learning (SSL) approaches for GNNs to date adopt a contrastive strategy, which tends to lose appropriate semantic information when the graph structure is perturbed or does not leverage both spatial and temporal information simultaneously. In light of these challenges, we propose a generative SSL approach that is tailored to effectively harness spatio-temporal information within dynamic FC. Our empirical results, experimented with large-scale (>50,000) fMRI datasets, demonstrate that our approach learns valuable representations and enables the construction of accurate and robust models when fine-tuned for downstream tasks.

Bootstrapping SparseFormers from Vision Foundation Models

paper_url: http://arxiv.org/abs/2312.01987
repo_url: https://github.com/showlab/sparseformer
paper_authors: Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou
for: 本文提出的SparseFormer架构是一种新的视觉理解方法，通过调整RoI来大幅降低计算成本，而且仍然可以达到良好的性能。
methods: 本文提出的方法是从ViT基础模型中bootstrap SparseFormer架构，通过继承大规模预训练的视transformer weights和冻结大部分 Parameters，只需要训练一个lightweight的关注转换器和一些早期预训练的块来调整token RoI，以达到微调整的性能。
results: 根据实验结果，从AugReg-ViT-L/16-384基础模型中bootstrap的单模态SparseFormer可以在IN-1K数据集上达到84.9%的准确率，而从CLIPs基础模型中bootstrap的多模态SparseFormer也表现出了显著的零处理性能，无需在 bootstrap 过程中看到任何标签或caption。此外，CLIP-bootstrap SparseFormer可以作为高效的视觉编码器在多Modal大语言模型中使用。

Abstract
The recently proposed SparseFormer architecture provides an alternative approach to visual understanding by utilizing a significantly lower number of visual tokens via adjusting RoIs, greatly reducing computational costs while still achieving promising performance. However, training SparseFormers from scratch is still expensive, and scaling up the number of parameters can be challenging. In this paper, we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way. Since the majority of SparseFormer blocks are the standard transformer ones, we can inherit weights from large-scale pre-trained vision transformers and freeze them as much as possible. Therefore, we only need to train the SparseFormer-specific lightweight focusing transformer to adjust token RoIs and fine-tune a few early pre-trained blocks to align the final token representation. In such a way, we can bootstrap SparseFormer architectures from various large-scale pre-trained models (e.g., IN-21K pre-trained AugRegs or CLIPs) using a rather smaller amount of training samples (e.g., IN-1K) and without labels or captions within just a few hours. As a result, the bootstrapped unimodal SparseFormer (from AugReg-ViT-L/16-384) can reach 84.9% accuracy on IN-1K with only 49 tokens, and the multimodal SparseFormer from CLIPs also demonstrates notable zero-shot performance with highly reduced computational cost without seeing any caption during the bootstrapping procedure. In addition, CLIP-bootstrapped SparseFormers, which align the output space with language without seeing a word, can serve as efficient vision encoders in multimodal large language models. Code will be publicly available at https://github.com/showlab/sparseformer

摘要
新提出的SparseFormer架构提供了一种新的视觉理解方法，通过调整RoI，大幅降低计算成本，同时仍然达到了令人满意的性能。然而，从零开始训练SparseFormer仍然是成本高的。在这篇论文中，我们提议使用基于ViT的视觉基础模型来快速启动SparseFormer。由于大多数SparseFormer块是标准的变换器块，因此我们可以继承大规模预训练的视觉变换器的参数，并冻结它们。因此，我们只需要训练SparseFormer特有的轻量级焦点变换器，以调整token RoI，并微调一些早期预训练的块，以使得最终的token表示相对于。通过这种方式，我们可以快速启动SparseFormer架构，从不同的大规模预训练模型（如IN-21K预训练AugRegs或CLIPs）中继承一些参数，并在几个小时内使用一些小规模的训练样本（如IN-1K）进行微调。结果显示，从AugReg-ViT-L/16-384开始的单模SparseFormer可以达到84.9%的准确率在IN-1K上，并且从CLIPs开始的多模SparseFormer也表现出了显著的零shot性能，而且没有在启动过程中看到任何标签或描述。此外，CLIP-启动的SparseFormers可以在多模大语言模型中作为高效的视觉编码器使用。代码将在https://github.com/showlab/sparseformer上公开。

UniGS: Unified Representation for Image Generation and Segmentation

paper_url: http://arxiv.org/abs/2312.01985
repo_url: https://github.com/qqlu/entity
paper_authors: Lu Qi, Lehan Yang, Weidong Guo, Yu Xu, Bo Du, Varun Jampani, Ming-Hsuan Yang
for: 本研究旨在提出一种基于Diffusion模型的图像生成和 segmentation的统一表示方法。
methods: 本文提出了两个新模块，包括位置相关颜色alette和分割二元对应模块，以支持我们的面积级掩码表示方法。
results: 对于多个任务，包括填充、图像生成、引用 segmentation 和实体 segmentation，我们的方法具有同样的灵活性和高效性。

Abstract
This paper introduces a novel unified representation of diffusion models for image generation and segmentation. Specifically, we use a colormap to represent entity-level masks, addressing the challenge of varying entity numbers while aligning the representation closely with the image RGB domain. Two novel modules, including the location-aware color palette and progressive dichotomy module, are proposed to support our mask representation. On the one hand, a location-aware palette guarantees the colors' consistency to entities' locations. On the other hand, the progressive dichotomy module can efficiently decode the synthesized colormap to high-quality entity-level masks in a depth-first binary search without knowing the cluster numbers. To tackle the issue of lacking large-scale segmentation training data, we employ an inpainting pipeline and then improve the flexibility of diffusion models across various tasks, including inpainting, image synthesis, referring segmentation, and entity segmentation. Comprehensive experiments validate the efficiency of our approach, demonstrating comparable segmentation mask quality to state-of-the-art and adaptability to multiple tasks. The code will be released at \href{https://github.com/qqlu/Entity}{https://github.com/qqlu/Entity}.

摘要
On one hand, the location-aware palette ensures the colors' consistency with the entities' locations. On the other hand, the progressive dichotomy module can efficiently decode the synthesized colormap to high-quality entity-level masks in a depth-first binary search without knowing the cluster numbers.To address the lack of large-scale segmentation training data, we employ an inpainting pipeline and improve the flexibility of diffusion models across various tasks, including inpainting, image synthesis, referring segmentation, and entity segmentation. Comprehensive experiments demonstrate the efficiency of our approach, with comparable segmentation mask quality to state-of-the-art and adaptability to multiple tasks. The code will be released at .

Semantics-aware Motion Retargeting with Vision-Language Models

paper_url: http://arxiv.org/abs/2312.01964
repo_url: None
paper_authors: Haodong Zhang, ZhiKe Chen, Haocheng Xu, Lei Hao, Xiaofei Wu, Songcen Xu, Zhensong Zhang, Yue Wang, Rong Xiong
for: motion retargeting between animation characters, capturing and preserving motion semantics
methods: utilizes vision-language models to extract and maintain meaningful motion semantics, incorporates high-level motion semantics into the motion retargeting process, and adopts a two-stage pipeline with skeleton-aware pre-training and fine-tuning
results: produces high-quality motion retargeting results while accurately preserving motion semantics, as demonstrated through experimental results.

Abstract
Capturing and preserving motion semantics is essential to motion retargeting between animation characters. However, most of the previous works neglect the semantic information or rely on human-designed joint-level representations. Here, we present a novel Semantics-aware Motion reTargeting (SMT) method with the advantage of vision-language models to extract and maintain meaningful motion semantics. We utilize a differentiable module to render 3D motions. Then the high-level motion semantics are incorporated into the motion retargeting process by feeding the vision-language model with the rendered images and aligning the extracted semantic embeddings. To ensure the preservation of fine-grained motion details and high-level semantics, we adopt a two-stage pipeline consisting of skeleton-aware pre-training and fine-tuning with semantics and geometry constraints. Experimental results show the effectiveness of the proposed method in producing high-quality motion retargeting results while accurately preserving motion semantics. Project page can be found at https://sites.google.com/view/smtnet.

摘要
捕捉和保留动作 semantics 是动作 Retargeting 中的关键。然而，大多数之前的工作忽略了 semantic 信息或者依赖人类设计的关节级表示。我们现在提出了一种新的 Semantics-aware Motion reTargeting（SMT）方法，利用视力语言模型来提取和维护有意义的动作 semantics。我们使用可导模块来渲染 3D 动作，然后将高级动作 semantics integrate 到动作 Retargeting 过程中。为保持细节动作和高级 semantics 的精度，我们采用了两个阶段管道，包括骨架准备和精度调整。实验结果表明，我们的方法可以生成高质量动作 Retargeting 结果，同时准确地保留动作 semantics。项目页面可以在找到。

Instance-guided Cartoon Editing with a Large-scale Dataset

paper_url: http://arxiv.org/abs/2312.01943
repo_url: None
paper_authors: Jian Lin, Chengze Li, Xueting Liu, Zhongping Ge
for: 该论文旨在提供一种自动Cartoon编辑工具，以便在Cartoon领域内进行创作自由和原创的narraative的发展。
methods: 该论文使用了一种基于实例的图像分割方法，以提高Cartoon图像中Character的分割质量。
results: 该论文通过使用高质量的Cartoon专用数据集和一种端到端的学习模型，实现了高分辨率Character的分割，并可以应用于多种Cartoon编辑应用程序，如3D Ken Burns parallax效果、文本引导的Cartoon风格编辑和漫画化动画等。

Abstract
Cartoon editing, appreciated by both professional illustrators and hobbyists, allows extensive creative freedom and the development of original narratives within the cartoon domain. However, the existing literature on cartoon editing is complex and leans heavily on manual operations, owing to the challenge of automatic identification of individual character instances. Therefore, an automated segmentation of these elements becomes imperative to facilitate a variety of cartoon editing applications such as visual style editing, motion decomposition and transfer, and the computation of stereoscopic depths for an enriched visual experience. Unfortunately, most current segmentation methods are designed for natural photographs, failing to recognize from the intricate aesthetics of cartoon subjects, thus lowering segmentation quality. The major challenge stems from two key shortcomings: the rarity of high-quality cartoon dedicated datasets and the absence of competent models for high-resolution instance extraction on cartoons. To address this, we introduce a high-quality dataset of over 100k paired high-resolution cartoon images and their instance labeling masks. We also present an instance-aware image segmentation model that can generate accurate, high-resolution segmentation masks for characters in cartoon images. We present that the proposed approach enables a range of segmentation-dependent cartoon editing applications like 3D Ken Burns parallax effects, text-guided cartoon style editing, and puppet animation from illustrations and manga.

摘要
卡通修改，被专业插画家和业余爱好者热爱，具有广泛的创作自由和在卡通领域发展原创情节。然而，现有的卡通修改文献复杂，倚靠手动操作，因为自动识别个体角色实例的挑战。因此，自动分割这些元素成为了必要的，以便激发多种卡通修改应用，如视觉风格编辑、动作分解和传输、和计算立体深度，以提供更加丰富的视觉体验。然而，大多数当前的分割方法是为自然照片设计，无法认识到卡通主题中的复杂美学，因此下降分割质量。主要挑战来自两个关键缺点：高质量卡通专用数据集罕见，以及高分辨率实例EXTRACTION模型缺失。为解决这一问题，我们介绍了高质量的卡通数据集，包括100k多个高分辨率卡通图像和其实例标注 mask。我们还提出了一种实例相关的图像分割模型，可以生成高分辨率的分割面羽毛，以便在卡通图像中的人物分割。我们表明，我们的方法可以激发多种分割取决于卡通编辑应用，如3D Ken Burns效果、文本引导卡通风格编辑和漫画图像投影。

COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction

paper_url: http://arxiv.org/abs/2312.01919
repo_url: None
paper_authors: Qihang Ma, Xin Tan, Yanyun Qu, Lizhuang Ma, Zhizhong Zhang, Yuan Xie
for: 增强三维占用预测的自动驾驶系统，提高系统的三维空间理解和物体识别能力。
methods: 提出了一种Compact Occupancy TRansformer（COTR）模型，包括geometry-aware occupancy encoder和semantic-aware group decoder，用于生成高效的三维占用表示。
results: 对多个基线进行比较，COTR模型表现明显优于基线，具有8%-15%的相对改善，证明了我们的方法的优越性。

Abstract
The autonomous driving community has shown significant interest in 3D occupancy prediction, driven by its exceptional geometric perception and general object recognition capabilities. To achieve this, current works try to construct a Tri-Perspective View (TPV) or Occupancy (OCC) representation extending from the Bird-Eye-View perception. However, compressed views like TPV representation lose 3D geometry information while raw and sparse OCC representation requires heavy but reducant computational costs. To address the above limitations, we propose Compact Occupancy TRansformer (COTR), with a geometry-aware occupancy encoder and a semantic-aware group decoder to reconstruct a compact 3D OCC representation. The occupancy encoder first generates a compact geometrical OCC feature through efficient explicit-implicit view transformation. Then, the occupancy decoder further enhances the semantic discriminability of the compact OCC representation by a coarse-to-fine semantic grouping strategy. Empirical experiments show that there are evident performance gains across multiple baselines, e.g., COTR outperforms baselines with a relative improvement of 8%-15%, demonstrating the superiority of our method.

摘要
自动驾驶社区对3D占用预测表示出了极大的兴趣，主要受其超常的几何识别和通用对象识别能力的激发。目前的研究尝试构建Tri-Perspective View（TPV）或Occupancy（OCC）表示，从鸟瞰视角出发。然而，压缩视图如TPV表示会产生3D几何信息损失，而 raw和稀疏OCC表示则需要重大但可减计算成本。为Address这些限制，我们提出Compact Occupancy TRansformer（COTR），具有几何意识occupancy编码器和semantic意识组解码器，以重建紧凑3D OCC表示。几何编码器首先生成高效的explicit-implicit视图转换后的紧凑几何OCC特征。然后，occupancy解码器进一步增强了紧凑OCC表示的semantic可分性，通过粗化到细化的semantic分组策略。实验表明，COTR与多个基准相比，表现出了明显的性能提升，例如COTR与基准相比，表现出了8%-15%的相对提升，这说明了我们的方法的优越性。

A Reliable Representation with Bidirectional Transition Model for Visual Reinforcement Learning Generalization

paper_url: http://arxiv.org/abs/2312.01915
repo_url: None
paper_authors: Xiaobo Hu, Youfang Lin, Yue Liu, Jinwen Wang, Shuo Wang, Hehe Fan, Kai Lv
for: 解决高维 Observation 的控制任务中的可靠和普遍 Representation 问题
methods: 基于人类思维模式，提出了 BiT 模型，可以双向预测环境变化，从而提取可靠的 Representation
results: 在 DeepMind Control suite 中实现了竞争性的普遍化性和样本效率表现，并在 robotic manipulation 和 CARLA simulate 中进行了广泛的应用

Abstract
Visual reinforcement learning has proven effective in solving control tasks with high-dimensional observations. However, extracting reliable and generalizable representations from vision-based observations remains a central challenge. Inspired by the human thought process, when the representation extracted from the observation can predict the future and trace history, the representation is reliable and accurate in comprehending the environment. Based on this concept, we introduce a Bidirectional Transition (BiT) model, which leverages the ability to bidirectionally predict environmental transitions both forward and backward to extract reliable representations. Our model demonstrates competitive generalization performance and sample efficiency on two settings of the DeepMind Control suite. Additionally, we utilize robotic manipulation and CARLA simulators to demonstrate the wide applicability of our method.

摘要
视觉强化学习已经在高维度观察的控制任务中证明了其效果。然而，从视觉 Observation 中提取可靠和普遍适用的表示仍然是中心挑战。根据人类思维的概念，当表示可以预测未来和跟踪历史时，该表示才是可靠和准确地理解环境的。基于这个概念，我们介绍了一种BiT模型（双向转移模型），该模型利用环境转移的双向预测能力来提取可靠的表示。我们的模型在DeepMind Control suite中的两个设置中表现出了竞争性的总结性和样本效率。此外，我们还利用了机器人操作和CARLA simulate器来证明我们的方法的广泛适用性。

Unsupervised Anomaly Detection using Aggregated Normative Diffusion

paper_url: http://arxiv.org/abs/2312.01904
repo_url: https://github.com/alexanderfrotscher/andi
paper_authors: Alexander Frotscher, Jaivardhan Kapoor, Thomas Wolfers, Christian F. Baumgartner
for: 早期发现医疗影像中的异常现象，如脑MRI，对诊断和治疗许多疾病非常重要。
methods: 使用潜在类型的异常检测方法，而不是有限的supervised机器学习方法，可以更好地检测多种异常现象。
results: 研究表明，现有的状态态 искус智能异常检测方法在实际多Modal MR数据中并不能泛化 well，我们引入了一种新的异常检测方法 named Aggregated Normative Diffusion (ANDi)，可以在多种异常现象中提高检测精度。

Abstract
Early detection of anomalies in medical images such as brain MRI is highly relevant for diagnosis and treatment of many conditions. Supervised machine learning methods are limited to a small number of pathologies where there is good availability of labeled data. In contrast, unsupervised anomaly detection (UAD) has the potential to identify a broader spectrum of anomalies by spotting deviations from normal patterns. Our research demonstrates that existing state-of-the-art UAD approaches do not generalise well to diverse types of anomalies in realistic multi-modal MR data. To overcome this, we introduce a new UAD method named Aggregated Normative Diffusion (ANDi). ANDi operates by aggregating differences between predicted denoising steps and ground truth backwards transitions in Denoising Diffusion Probabilistic Models (DDPMs) that have been trained on pyramidal Gaussian noise. We validate ANDi against three recent UAD baselines, and across three diverse brain MRI datasets. We show that ANDi, in some cases, substantially surpasses these baselines and shows increased robustness to varying types of anomalies. Particularly in detecting multiple sclerosis (MS) lesions, ANDi achieves improvements of up to 178% in terms of AUPRC.

摘要
早期异常检测在医疗图像中，如脑MRI，对诊断和治疗许多疾病非常重要。有监督机器学习方法的限制是它们只能处理一些有充分数据的疾病。相比之下，无监督异常检测（USAD）有潜力检测更广泛的异常，通过检测图像中的异常 Pattern。我们的研究表明，现有的状态对抗方法无法在实际多modal MR数据中广泛应用。为了解决这个问题，我们引入了一种新的USAD方法，名为Aggregated Normative Diffusion（ANDi）。ANDi通过将预测的杂化步骤与真实的反转步骤进行聚合，以DDPMs已经在pyramidal Gaussian noise上进行训练。我们对ANDi进行了三种最新的USAD基线方法进行验证，并在三个不同的脑MRI数据集上进行验证。我们发现，ANDi在某些情况下可以明显超越这些基线方法，并在不同类型的异常下表现更加稳定。特别是在检测多发性硬化病（MS）斑点时，ANDi可以达到178%的AUPRC提升。

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos

paper_url: http://arxiv.org/abs/2312.01897
repo_url: None
paper_authors: Min Yang, Huan Gao, Ping Guo, Limin Wang
for: 这个论文是为了解决temporal action detection（TAD）问题，即在不修剪的视频中检测动作的问题。
methods: 这个论文使用了预训练的vision transformer（ViT）模型，并设计了一种新的机制来适应这些预训练模型，以便在更广泛的时间上下文中更好地捕捉动作之间的关系。这个机制包括内部归一化层和后归一化层。
results: 这个论文的实验结果显示，使用这种新机制可以在THUMOS14、ActivityNet-1.3和FineAction等三个数据集上达到69.0平均准确率、37.12平均准确率和17.20平均准确率等高水平。

Abstract
Vision transformer (ViT) has shown high potential in video recognition, owing to its flexible design, adaptable self-attention mechanisms, and the efficacy of masked pre-training. Yet, it still remains unclear how to adapt these pre-trained short-term ViTs for temporal action detection (TAD) in untrimmed videos. The existing works treat them as off-the-shelf feature extractors for each short trimmed snippet without capturing the fine-grained relation among different snippets in a broader temporal context. To mitigate this issue, this paper focuses on designing a new mechanism for adapting these pre-trained ViT models as a unified long-form video transformer to fully unleash its modeling power in capturing inter-snippet relation, while still keeping low computation overhead and memory consumption for efficient TAD. To this end, we design effective cross-snippet propagation modules to gradually exchange short-term video information among different snippets from two levels. For inner-backbone information propagation, we introduce a cross-snippet propagation strategy to enable multi-snippet temporal feature interaction inside the backbone. For post-backbone information propagation, we propose temporal transformer layers for further clip-level modeling. With the plain ViT-B pre-trained with VideoMAE, our end-to-end temporal action detector (ViT-TAD) yields a very competitive performance to previous temporal action detectors, riching up to 69.0 average mAP on THUMOS14, 37.12 average mAP on ActivityNet-1.3 and 17.20 average mAP on FineAction.

摘要
《视transformer（ViT）在视频识别方面表现出了高 potential，归功于其灵活设计、自适应的自注意机制以及预训练的masked掩码。然而，尚未清楚如何适应这些预训练的短期ViT模型用于未处理过的视频中的动作检测（TAD）。现有的工作通常将它们视为预训练的特征提取器，对每个短截视频进行独立的特征提取，而不是捕捉不同截 snippet之间的细致关系在更广泛的时间上。为此，本文强调设计一种适应这些预训练ViT模型的新机制，以全面发挥它们的模型力，同时仍保持低计算负担和内存占用，以实现高效的TAD。为此，我们设计了有效的跨截宣传模块，从两级来进行短视频信息的沟通。对于内部幕驱驱使用的横向宣传策略，我们引入了跨截宣传策略，以便在幕驱驱中进行多个截 snippet的时间特征互动。对于后继幕驱驱使用的时间转换层，我们提出了时间转换层，以进一步进行clip级模型。使用预训练的ViT-B和VideoMAE，我们的综合 temporal action detector（ViT-TAD）在THUMOS14、ActivityNet-1.3和FineAction等测试集上达到了69.0的平均MAP、37.12的平均MAP和17.20的平均MAP。》

InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models

paper_url: http://arxiv.org/abs/2312.01886
repo_url: None
paper_authors: Xunguang Wang, Zhenlan Ji, Pingchuan Ma, Zongjie Li, Shuai Wang
for: 这篇论文旨在攻击大见识语言模型（LVLM），即使攻击者只有访问受到攻击的语言模型的视觉编码器，而不知道其提示和大语言模型的详细信息。
methods: 我们提出了一种新的实用攻击方法，即在攻击者只有访问受到攻击的语言模型的视觉编码器时，通过反向转换目标响应为目标图像，然后使用GPT-4生成一个合理的指令，并使用本地支持模型提取指令意识特征，最终通过最小化这两个特征之间的距离来优化攻击示例。
results: 我们的提出的方法在目标攻击性能和传输性能方面具有显著优势。

Abstract
Large vision-language models (LVLMs) have demonstrated their incredible capability in image understanding and response generation. However, this rich visual interaction also makes LVLMs vulnerable to adversarial examples. In this paper, we formulate a novel and practical gray-box attack scenario that the adversary can only access the visual encoder of the victim LVLM, without the knowledge of its prompts (which are often proprietary for service providers and not publicly available) and its underlying large language model (LLM). This practical setting poses challenges to the cross-prompt and cross-model transferability of targeted adversarial attack, which aims to confuse the LVLM to output a response that is semantically similar to the attacker's chosen target text. To this end, we propose an instruction-tuned targeted attack (dubbed InstructTA) to deliver the targeted adversarial attack on LVLMs with high transferability. Initially, we utilize a public text-to-image generative model to "reverse" the target response into a target image, and employ GPT-4 to infer a reasonable instruction $\boldsymbol{p}^\prime$ from the target response. We then form a local surrogate model (sharing the same visual encoder with the victim LVLM) to extract instruction-aware features of an adversarial image example and the target image, and minimize the distance between these two features to optimize the adversarial example. To further improve the transferability, we augment the instruction $\boldsymbol{p}^\prime$ with instructions paraphrased from an LLM. Extensive experiments demonstrate the superiority of our proposed method in targeted attack performance and transferability.

摘要
大型视觉语言模型（LVLM）已经展现出惊人的图像理解和回快生成能力。然而，这种丰富的视觉互动也使得LVLM变得易受到敌意例子的攻击。在这篇论文中，我们提出了一种新的和实用的灰度攻击场景，攻击者只能访问受到攻击的LVLM的视觉编码器，而不知道它的提示（这些提示通常是服务提供者所拥有的Confidential information，并不公开）以及它的下游大语言模型（LLM）。这种实际场景带来了跨提示和跨模型的目标攻击的困难，攻击者想要使LVLM输出一个与攻击者选择的目标文本semantically similar的回快。为此，我们提出了一种 instruciton-tuned 目标攻击（dubbed InstructTA），可以在LVLMs中实现高度的传输性。首先，我们使用公共的文本到图像生成模型将目标回快转换成目标图像，然后使用GPT-4来推导一个合理的指令 $\boldsymbol{p}'$ FROM 目标回快。我们然后组建一个本地的伪模型（与受到攻击的LVLM共享同一个视觉编码器），以提取指令意识特征，并将这些特征与目标图像的指令意识特征进行距离最小化，以便优化攻击示例。为了进一步提高传输性，我们在指令 $\boldsymbol{p}'$ 中添加了由LLM提取的指令重构。我们的实验结果表明，我们的提posed方法在目标攻击性能和传输性方面具有优势。

FeaInfNet: Diagnosis in Medical Image with Feature-Driven Inference and Visual Explanations

paper_url: http://arxiv.org/abs/2312.01871
repo_url: None
paper_authors: Yitao Peng, Lianghua He, Die Hu, Yihang Liu, Longzhen Yang, Shaohua Shang
for: 这篇论文的目的是提出一种可解释深度学习模型，以解决医疗影像诊断中的问题。
methods: 这篇论文提出了两个关键的创新方法：首先，提出了一个基于特征的网络逻辑结构，用于比较每个子区域影像裂片与疾病模板和正常模板的相似度，最后将每个子区域的比较结果组合以作最终诊断。其次，提出了地方特征面罩（LFM）以提高特征表达能力，并提出了适应动态面罩（Adaptive-DM）以解释特征向量和原型对人类理解的影像裂片。
results: 根据多个公开的医疗影像数据集，包括RSNA、iChallenge-PM、Covid-19、ChinaCXRSet和MontgomerySet，我们的实验结果显示，我们的方法可以在医疗影像诊断中实现state-of-the-art的分类精度和可解释性，较基eline方法更好。对于每个提出的 ком成分，我们还进行了附加的ablation研究来证明其效果。

Abstract
Interpretable deep learning models have received widespread attention in the field of image recognition. Due to the unique multi-instance learning of medical images and the difficulty in identifying decision-making regions, many interpretability models that have been proposed still have problems of insufficient accuracy and interpretability in medical image disease diagnosis. To solve these problems, we propose feature-driven inference network (FeaInfNet). Our first key innovation involves proposing a feature-based network reasoning structure, which is applied to FeaInfNet. The network of this structure compares the similarity of each sub-region image patch with the disease templates and normal templates that may appear in the region, and finally combines the comparison of each sub-region to make the final diagnosis. It simulates the diagnosis process of doctors to make the model interpretable in the reasoning process, while avoiding the misleading caused by the participation of normal areas in reasoning. Secondly, we propose local feature masks (LFM) to extract feature vectors in order to provide global information for these vectors, thus enhancing the expressive ability of the FeaInfNet. Finally, we propose adaptive dynamic masks (Adaptive-DM) to interpret feature vectors and prototypes into human-understandable image patches to provide accurate visual interpretation. We conducted qualitative and quantitative experiments on multiple publicly available medical datasets, including RSNA, iChallenge-PM, Covid-19, ChinaCXRSet, and MontgomerySet. The results of our experiments validate that our method achieves state-of-the-art performance in terms of classification accuracy and interpretability compared to baseline methods in medical image diagnosis. Additional ablation studies verify the effectiveness of each of our proposed components.

摘要
医学影像识别领域内的可解释深度学习模型已经受到了广泛的关注。由于医学影像的独特多个实例学习和困难的决策区域识别，许多已提出的可解释模型仍然存在精度和可解释性不足的问题。为解决这些问题，我们提出了特征驱动推理网络（FeaInfNet）。我们的首要创新是提出了基于特征的网络逻辑结构，该结构将每个子区域图像质量与疾病模板和正常模板进行比较，并最终将每个子区域的比较结果组合以获得最终诊断。这种逻辑结构模仿医生诊断过程，使模型的解释过程变得更加可读ble，同时避免由正常区域参与到逻辑过程中的误导。其次，我们提出了本地特征面罩（LFM），以提高特征向量的表达能力。最后，我们提出了自适应动态面罩（Adaptive-DM），以解释特征向量和原型，并将其转换为人类可理解的图像块。我们在多个公共可用的医学影像数据集上进行了质量和量化的实验，包括RSNA、iChallenge-PM、Covid-19、ChinaCXRSet和MontgomerySet。实验结果表明，我们的方法在医学影像诊断中实现了状态机器人的性能，并且在精度和可解释性两个方面都超过了基eline方法。其他的抽象研究也证明了我们所提出的每一个组件的有效性。

Unveiling Objects with SOLA: An Annotation-Free Image Search on the Object Level for Automotive Data Sets

paper_url: http://arxiv.org/abs/2312.01860
repo_url: None
paper_authors: Philipp Rigoll, Jacob Langner, Eric Sax
for: 本研究的目的是发展自动驾驶系统的视觉感知系统，需要大量的图像数据集来训练Robust的神经网络。
methods: 本研究使用了现代神经网络来搜寻图像中的特定物件，并使用自然语言描述查询。
results: 我们的方法可以实现时间和努力的减少，并且在汽车数据集上进行评估，获得了良好的成绩。

Abstract
Huge image data sets are the fundament for the development of the perception of automated driving systems. A large number of images is necessary to train robust neural networks that can cope with diverse situations. A sufficiently large data set contains challenging situations and objects. For testing the resulting functions, it is necessary that these situations and objects can be found and extracted from the data set. While it is relatively easy to record a large amount of unlabeled data, it is far more difficult to find demanding situations and objects. However, during the development of perception systems, it must be possible to access challenging data without having to perform lengthy and time-consuming annotations. A developer must therefore be able to search dynamically for specific situations and objects in a data set. Thus, we designed a method which is based on state-of-the-art neural networks to search for objects with certain properties within an image. For the ease of use, the query of this search is described using natural language. To determine the time savings and performance gains, we evaluated our method qualitatively and quantitatively on automotive data sets.

摘要
巨大的图像数据集是自动驾驶系统发展的基础。一大量的图像是需要训练强健的神经网络，以便在多样化的情况下提供稳定的性能。一个充分的数据集应包含挑战性的情况和物体。为测试得到的函数，需要从数据集中找到和提取这些情况和物体。虽然 recording a large amount of unlabeled data Comparatively easy, but it is much more difficult to find demanding situations and objects. However, during the development of perception systems, it must be possible to access challenging data without having to perform lengthy and time-consuming annotations. Therefore, we designed a method based on state-of-the-art neural networks to search for objects with certain properties within an image. For ease of use, the query of this search is described using natural language. To determine the time savings and performance gains, we evaluated our method qualitatively and quantitatively on automotive data sets.

Robot Synesthesia: In-Hand Manipulation with Visuotactile Sensing

paper_url: http://arxiv.org/abs/2312.01853
repo_url: None
paper_authors: Ying Yuan, Haichuan Che, Yuzhe Qin, Binghao Huang, Zhao-Heng Yin, Kang-Won Lee, Yi Wu, Soo-Chul Lim, Xiaolong Wang
for: 这篇论文是关于实现灵活的手动控制任务的方法，通过融合视觉和感觉反馈来提高机器人的掌控能力。
methods: 该论文提出了一种基于点云的感觉表示方法，被称为机器人 synesthesia，这种方法可以同时融合视觉和感觉输入，提供更丰富的空间信息，并且使机器人更好地理解自己的行为。
results: 该论文通过在模拟环境中训练并在真实机器人上部署了该方法，并进行了详细的ablation，证明了在各种各样的手动对象旋转任务中，通过融合视觉和感觉的反馈，可以提高强化学习和实际性性能。

Abstract
Executing contact-rich manipulation tasks necessitates the fusion of tactile and visual feedback. However, the distinct nature of these modalities poses significant challenges. In this paper, we introduce a system that leverages visual and tactile sensory inputs to enable dexterous in-hand manipulation. Specifically, we propose Robot Synesthesia, a novel point cloud-based tactile representation inspired by human tactile-visual synesthesia. This approach allows for the simultaneous and seamless integration of both sensory inputs, offering richer spatial information and facilitating better reasoning about robot actions. The method, trained in a simulated environment and then deployed to a real robot, is applicable to various in-hand object rotation tasks. Comprehensive ablations are performed on how the integration of vision and touch can improve reinforcement learning and Sim2Real performance. Our project page is available at https://yingyuan0414.github.io/visuotactile/ .

摘要
执行需要质感丰富的操作任务时，感觉和视觉反馈的融合是必要的。然而，这两种模式之间的差异带来了很大的挑战。在这篇论文中，我们介绍了一种系统，它利用视觉和感觉输入来实现灵活的手部操作。特别是，我们提出了人类感觉视觉同义症的机器人感觉同义症（Robot Synesthesia），这种方法允许在同一时间同步融合两种感觉输入，提供更加充足的空间信息，使机器人更好地理解自己的动作。这种方法在虚拟环境中培训后，在真实机器人上部署，适用于各种手部对象旋转任务。我们进行了详细的ablations，以证明视觉和感觉的融合可以提高强化学习和Sim2Real性能。项目页面可以在https://yingyuan0414.github.io/visuotactile/ 上查看。

Generalization by Adaptation: Diffusion-Based Domain Extension for Domain-Generalized Semantic Segmentation

paper_url: http://arxiv.org/abs/2312.01850
repo_url: None
paper_authors: Joshua Niemeijer, Manuel Schwonberg, Jan-Aike Termöhlen, Nico M. Schmidt, Tim Fingscheidt
for: 这篇论文是为了解决模型在对不同于训练数据的图像进行应用时，表现下降的问题。
methods: 这篇论文使用了阶梯式传播（diffusion）基础的领域扩展技术（DIDEX），通过将多个不同的文本描述给予对应的图像进行生成，以控制生成图像的风格和内容，并将其视为目标领域的 Pseudo-target 领域。
results: 这篇论文的实验结果显示，使用DIDEX方法可以将模型训练到这些 Pseudo-target 领域中，并且可以在不使用任何真实数据的情况下，大幅提高模型的泛化性能。特别是在 GTA5 和 SYNTHIA 等数据集上，与先前的方法相比，这篇论文可以提高 mIoU 性能的平均提升为 3.8% 和 11.8% respectively。代码可以在 https://github.com/JNiemeijer/DIDEX 上找到。

Abstract
When models, e.g., for semantic segmentation, are applied to images that are vastly different from training data, the performance will drop significantly. Domain adaptation methods try to overcome this issue, but need samples from the target domain. However, this might not always be feasible for various reasons and therefore domain generalization methods are useful as they do not require any target data. We present a new diffusion-based domain extension (DIDEX) method and employ a diffusion model to generate a pseudo-target domain with diverse text prompts. In contrast to existing methods, this allows to control the style and content of the generated images and to introduce a high diversity. In a second step, we train a generalizing model by adapting towards this pseudo-target domain. We outperform previous approaches by a large margin across various datasets and architectures without using any real data. For the generalization from GTA5, we improve state-of-the-art mIoU performance by 3.8% absolute on average and for SYNTHIA by 11.8% absolute, marking a big step for the generalization performance on these benchmarks. Code is available at https://github.com/JNiemeijer/DIDEX

摘要
当模型（例如 semantic segmentation 模型）应用于与训练数据不同的图像时，性能会降低很多。适应领域方法可以解决这个问题，但它们需要目标领域的样本。然而，这不一定是可行的，因此领域泛化方法是有用的。我们提出了一种新的扩散基于领域扩展（DIDEX）方法，并使用扩散模型生成一个具有多样化文本提示的 Pseudo-目标领域。与现有方法不同，这允许我们控制生成图像的风格和内容，并引入高度多样化。在第二步，我们将模型适应这个 Pseudo-目标领域，并在不同的 datasets 和架构上超越了先前的方法。我们在 GTA5 和 SYNTHIA 上的 mIoU 性能上提高了3.8%的绝对值和11.8%的绝对值，创造了这些标准 benchmarks 上的新纪录。代码可以在 https://github.com/JNiemeijer/DIDEX 上找到。

Geometrically-driven Aggregation for Zero-shot 3D Point Cloud Understanding

paper_url: http://arxiv.org/abs/2312.02244
repo_url: None
paper_authors: Guofeng Mei, Luigi Riz, Yiming Wang, Fabio Poiesi
for:* Zero-shot 3D point cloud understanding can be achieved via 2D Vision-Language Models (VLMs).methods:* Our approach introduces the first training-free aggregation technique that leverages the point cloud’s 3D geometric structure to improve the quality of the transferred VLMs.* Our approach operates iteratively, performing local-to-global aggregation based on geometric and semantic point-level reasoning.results:* Our approach achieves new state-of-the-art results in all benchmarks, including classification, part segmentation, and semantic segmentation, with a variety of datasets representing both synthetic/real-world, and indoor/outdoor scenarios.

Abstract
Zero-shot 3D point cloud understanding can be achieved via 2D Vision-Language Models (VLMs). Existing strategies directly map Vision-Language Models from 2D pixels of rendered or captured views to 3D points, overlooking the inherent and expressible point cloud geometric structure. Geometrically similar or close regions can be exploited for bolstering point cloud understanding as they are likely to share semantic information. To this end, we introduce the first training-free aggregation technique that leverages the point cloud's 3D geometric structure to improve the quality of the transferred Vision-Language Models. Our approach operates iteratively, performing local-to-global aggregation based on geometric and semantic point-level reasoning. We benchmark our approach on three downstream tasks, including classification, part segmentation, and semantic segmentation, with a variety of datasets representing both synthetic/real-world, and indoor/outdoor scenarios. Our approach achieves new state-of-the-art results in all benchmarks. We will release the source code publicly.

摘要
zero-shot 3D点云理解可以通过2D视力语言模型（VLM）实现。现有策略直接将视力语言模型从2D像素 rendering 或捕捉视图中映射到3D点，忽视了点云的内在和表达ible 3D几何结构。相似或靠近的地方可以被利用来增强点云理解，因为它们可能共享semantic信息。为此，我们介绍了首个无需训练的集成技术，利用点云的3D几何结构来提高转移的Vision-Language Models的质量。我们的方法采用迭代的方式，通过几何和semantic点级reasoning进行本地到全局的聚合。我们在三个下游任务中进行了测试，包括分类、部分分割和semantic分割，并使用了多种数据集，包括 sintetic/real-world 和 indoor/outdoor enario。我们的方法在所有测试中达到了新的状态机制记录。我们将会公开发布源代码。

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

paper_url: http://arxiv.org/abs/2312.01841
repo_url: https://github.com/HumanAIGC/VividTalk
paper_authors: Xusen Sun, Longhao Zhang, Hao Zhu, Peng Zhang, Bang Zhang, Xinya Ji, Kangneng Zhou, Daiheng Gao, Liefeng Bo, Xun Cao
for: 提出一种两stage generic框架，用于生成高视质量的说话头视频，包括各种表情和自然的头部动作。
methods: 使用两种动作学习，包括非RIGID表情动作和固定头部动作，并采用新的学习头 pose codebook 和两阶段训练机制。
results: 对比前一代模型，提出的 VividTalk 能够生成高视质量的说话头视频，同时具有各种表情和自然的头部动作，并且在对比中得到了明显的提升。

Abstract
Audio-driven talking head generation has drawn much attention in recent years, and many efforts have been made in lip-sync, expressive facial expressions, natural head pose generation, and high video quality. However, no model has yet led or tied on all these metrics due to the one-to-many mapping between audio and motion. In this paper, we propose VividTalk, a two-stage generic framework that supports generating high-visual quality talking head videos with all the above properties. Specifically, in the first stage, we map the audio to mesh by learning two motions, including non-rigid expression motion and rigid head motion. For expression motion, both blendshape and vertex are adopted as the intermediate representation to maximize the representation ability of the model. For natural head motion, a novel learnable head pose codebook with a two-phase training mechanism is proposed. In the second stage, we proposed a dual branch motion-vae and a generator to transform the meshes into dense motion and synthesize high-quality video frame-by-frame. Extensive experiments show that the proposed VividTalk can generate high-visual quality talking head videos with lip-sync and realistic enhanced by a large margin, and outperforms previous state-of-the-art works in objective and subjective comparisons.

摘要
audio-driven talking head 生成技术在近年来引起了很大的关注，许多努力在lip-sync、表情表达、自然的头部位置生成和高质量视频等方面。然而，到目前为止，没有任何模型能够同时占据或与所有这些指标匹配。在这篇论文中，我们提出了VividTalk，一种两stage通用框架，可以生成高质量的 talking head 视频，满足所有这些要求。 Specifically，在第一个阶段，我们将音频映射到网格，通过学习两种运动，包括非固定表达运动和固定头部运动。为表达运动，我们采用了blendshape和顶点作为中间表示，以最大化模型的表达能力。为自然的头部运动，我们提出了一个新的学习头 pose 代码库，并在两个阶段进行了两个阶段的训练机制。在第二个阶段，我们提出了一个双支持机制和生成器，用于将网格转换为粗粒度运动，并将高质量的视频框架生成出来。广泛的实验表明，我们的VividTalk可以生成高质量的 talking head 视频，lip-sync 和自然增强，与之前的状态对照比较，在对象和主观比较中占据了很大的优势。

Few Clicks Suffice: Active Test-Time Adaptation for Semantic Segmentation

paper_url: http://arxiv.org/abs/2312.01835
repo_url: None
paper_authors: Longhui Yuan, Shuang Li, Zhuo He, Binhui Xie
for: 这个研究的目的是提出一个实现在测试阶段内部的人工智能驱动的测试时间自适应方法，以优化 semantic segmentation 模型的性能。
methods: 这个方法使用了 active learning 的思想，在测试阶段内部实现人工智能驱动，通过对测试数据进行适应，以提高模型的性能。
results: 实验结果显示，这个方法可以与对应的标示方法相比，实现更高的平均对应率（mIoU），并且仅需要极少的标示量。对于 ACDC 测试集，这个方法可以超过known SOTA TTA 方法的平均 mIoU 表现，并且仅需要一Click 的标示。

Abstract
Test-time adaptation (TTA) adapts the pre-trained models during inference using unlabeled test data and has received a lot of research attention due to its potential practical value. Unfortunately, without any label supervision, existing TTA methods rely heavily on heuristic or empirical studies. Where to update the model always falls into suboptimal or brings more computational resource consumption. Meanwhile, there is still a significant performance gap between the TTA approaches and their supervised counterparts. Motivated by active learning, in this work, we propose the active test-time adaptation for semantic segmentation setup. Specifically, we introduce the human-in-the-loop pattern during the testing phase, which queries very few labels to facilitate predictions and model updates in an online manner. To do so, we propose a simple but effective ATASeg framework, which consists of two parts, i.e., model adapter and label annotator. Extensive experiments demonstrate that ATASeg bridges the performance gap between TTA methods and their supervised counterparts with only extremely few annotations, even one click for labeling surpasses known SOTA TTA methods by 2.6% average mIoU on ACDC benchmark. Empirical results imply that progress in either the model adapter or the label annotator will bring improvements to the ATASeg framework, giving it large research and reality potential.

摘要
test-time adaptation (TTA) 在推理阶段使用无标注测试数据来适应预训练模型，吸引了大量研究的关注，因为它具有实用价值。然而，无法获得标签指导，现有的TTA方法很可能会降低模型的性能或增加计算资源的消耗。同时，与超过标注的TTA方法相比，TTA方法还存在一定的性能差距。作为活动学习的激励，在这种情况下，我们提出了活动测试时适应（ATASeg）框架。specifically，我们在测试阶段引入人工智能在线模式，通过少量标签的查询来促进预测和模型更新。为此，我们提出了简单 yet effective的ATASeg框架，该框架包括两部分：模型适应器和标签注释器。广泛的实验表明，ATASeg可以跨越TTA方法和其标注版本之间的性能差距，只需要极少量标签，甚至一个键盘提交的标签更过了已知SOTA TTA方法的平均mIoU值2.6%在ACDC benchmark上。实验结果表明，ATASeg框架的进步可以通过模型适应器或标签注释器的提升来实现，这对ATASeg框架具有大的研究和实践潜力。

Equivariant plug-and-play image reconstruction

paper_url: http://arxiv.org/abs/2312.01831
repo_url: None
paper_authors: Matthieu Terris, Thomas Moreau, Nelly Pustelnik, Julian Tachella
for: 解决 inverse imaging 问题，使用 implicit definition of an image prior via a denoiser
methods: 使用 plug-and-play 算法，但是这些算法可能会显示不稳定的行为，影响 reconstruction 的质量
results: 通过 enforcing equivariance to certain groups of transformations (rotations, reflections, and/or translations) on the denoiser，可以提高算法的稳定性和 reconstruction 的质量，并提供了一种简单的算法来实现这一点

Abstract
Plug-and-play algorithms constitute a popular framework for solving inverse imaging problems that rely on the implicit definition of an image prior via a denoiser. These algorithms can leverage powerful pre-trained denoisers to solve a wide range of imaging tasks, circumventing the necessity to train models on a per-task basis. Unfortunately, plug-and-play methods often show unstable behaviors, hampering their promise of versatility and leading to suboptimal quality of reconstructed images. In this work, we show that enforcing equivariance to certain groups of transformations (rotations, reflections, and/or translations) on the denoiser strongly improves the stability of the algorithm as well as its reconstruction quality. We provide a theoretical analysis that illustrates the role of equivariance on better performance and stability. We present a simple algorithm that enforces equivariance on any existing denoiser by simply applying a random transformation to the input of the denoiser and the inverse transformation to the output at each iteration of the algorithm. Experiments on multiple imaging modalities and denoising networks show that the equivariant plug-and-play algorithm improves both the reconstruction performance and the stability compared to their non-equivariant counterparts.

摘要
固定并执行算法是一种广泛使用的框架，用于解决各种逆像问题，其中采用隐式定义图像的估计方法。这些算法可以利用强大的预训练的杂变器来解决各种成像任务，免除需要根据每个任务进行训练。然而，固定并执行方法经常表现出不稳定的行为，使得其承诺的 universality 和高质量重建图像的承诺受到限制。在这个工作中，我们证明了对杂变器的约束可以强化固定并执行算法的稳定性和重建质量。我们提供了一个理论分析，解释了约束对性能和稳定性的作用。我们还提出了一种简单的算法，可以在任何现有的杂变器基础上实现约束。实验表明，对着Equivariant plug-and-play算法在不同的成像模式和杂变器网络上的应用显示出较好的重建性和稳定性。

Collaborative Neural Painting

paper_url: http://arxiv.org/abs/2312.01800
repo_url: None
paper_authors: Nicola Dall’Asen, Willi Menapace, Elia Peruzzo, Enver Sangineto, Yiming Wang, Elisa Ricci
for: 这篇论文旨在提出一种新的艺术生成任务，即与人类合作的神经笔画（CNP），以促进人机共同创作绘画。
methods: 该任务使用一种基于序列化的笔划表示，并使用一种基于 transformer 架构的新注意机制来模型输入笔划和完成笔划之间的关系。此外，提出了一种新的遮盾方案来反映协作性。
results: 作者们提出了一种新的评价协议来评价 CNP 任务，并在一个新的绘画对象集上进行了验证。结果表明，该方法可以生成具有艺术性和合理性的绘画作品，并且可以在不同的绘画风格和主题下进行可编辑和可复制的创作。

Abstract
The process of painting fosters creativity and rational planning. However, existing generative AI mostly focuses on producing visually pleasant artworks, without emphasizing the painting process. We introduce a novel task, Collaborative Neural Painting (CNP), to facilitate collaborative art painting generation between humans and machines. Given any number of user-input brushstrokes as the context or just the desired object class, CNP should produce a sequence of strokes supporting the completion of a coherent painting. Importantly, the process can be gradual and iterative, so allowing users' modifications at any phase until the completion. Moreover, we propose to solve this task using a painting representation based on a sequence of parametrized strokes, which makes it easy both editing and composition operations. These parametrized strokes are processed by a Transformer-based architecture with a novel attention mechanism to model the relationship between the input strokes and the strokes to complete. We also propose a new masking scheme to reflect the interactive nature of CNP and adopt diffusion models as the basic learning process for its effectiveness and diversity in the generative field. Finally, to develop and validate methods on the novel task, we introduce a new dataset of painted objects and an evaluation protocol to benchmark CNP both quantitatively and qualitatively. We demonstrate the effectiveness of our approach and the potential of the CNP task as a promising avenue for future research.

摘要
paintings 激发创造力和合理规划。然而，现有的生成AI主要关注生成视觉吸引人的艺术作品，而忽略了绘画过程。我们介绍了一种新任务：协作神经画作（CNP），以便人工智能和人类在合作创作绘画过程中生成共同的艺术作品。给定任何用户输入的拼写划画为背景或仅仅是感兴趣的物品类型，CNP应该生成一个支持完成一幅完整的绘画的序列划画。此外，我们提议使用基于划画序列的画作表示，使其容易进行修改和组合操作。这些划画序列被处理于基于Transformer架构的新注意机制，以模型输入划画和完成划画之间的关系。我们还提出了一种新的遮盲方案，以反映CNP的互动性。 finally，我们引入了一个新的数据集和评估协议，以评估和验证CNP的效果和多样性。我们示出了我们的方法的有效性，并证明CNP任务是未来研究的丰富领域。

paper_url: http://arxiv.org/abs/2312.01790
repo_url: https://github.com/idt-iti/mmfusion-iml
paper_authors: Konstantinos Triaridis, Vasileios Mezaris
for: 本研究旨在提高图像修改检测和定位技术的精度和效果。
methods: 本研究使用了多种常用的喷涂滤波器，包括SRM和Bayar滤波器，以检测和探测图像修改。
results: 研究发现不同的滤波器可以检测到不同类型的修改，并且可以将这些检测结果融合以提高检测和定位的精度。两种方法（晚期融合和早期融合）都能够达到比州前模型更高的性能，在多个数据集上表现出 competed。

Abstract
Recent image manipulation localization and detection techniques usually leverage forensic artifacts and traces that are produced by a noise-sensitive filter, such as SRM and Bayar convolution. In this paper, we showcase that different filters commonly used in such approaches excel at unveiling different types of manipulations and provide complementary forensic traces. Thus, we explore ways of merging the outputs of such filters and aim to leverage the complementary nature of the artifacts produced to perform image manipulation localization and detection (IMLD). We propose two distinct methods: one that produces independent features from each forensic filter and then fuses them (this is referred to as late fusion) and one that performs early mixing of different modal outputs and produces early combined features (this is referred to as early fusion). We demonstrate that both approaches achieve competitive performance for both image manipulation localization and detection, outperforming state-of-the-art models across several datasets.

摘要
现代图像修改检测技术通常利用受到噪声敏感滤波器（如SRM和Bayar滤波）生成的法ensis和踪迹，以探测图像修改。本文显示了不同的滤波器在修改检测中具有不同的优势，可以探测不同类型的修改并生成 complementary的法ensis。因此，我们探讨了将不同滤波器的输出合并以实现图像修改检测（IMLD）的方法。我们提出了两种方法：一种生成每个法ensis的独立特征，然后进行融合（ referred to as late fusion），另一种在不同模式输出之前进行混合，生成早期融合的特征（ referred to as early fusion）。我们示出了这两种方法在多个 dataset 上的竞争性表现，超越了当前领域的模型。

paper_url: http://arxiv.org/abs/2312.01789
repo_url: None
paper_authors: Chengyin Hu, Weiwen Shi
for: 本研究旨在评估跨模态检测器的安全性，并提出一种实际应用中可行的攻击方法。
methods: 本研究使用了适应力优化的含有谱检测器，并使用了颜色QR码优化。
results: 实验结果表明，提案的攻击方法可以在数字和物理环境中效果地攻击跨模态检测器，并且比基eline性能更高。

Abstract
Currently, many studies have addressed security concerns related to visible and infrared detectors independently. In practical scenarios, utilizing cross-modal detectors for tasks proves more reliable than relying on single-modal detectors. Despite this, there is a lack of comprehensive security evaluations for cross-modal detectors. While existing research has explored the feasibility of attacks against cross-modal detectors, the implementation of a robust attack remains unaddressed. This work introduces the Two-stage Optimized Unified Adversarial Patch (TOUAP) designed for performing attacks against visible-infrared cross-modal detectors in real-world, black-box settings. The TOUAP employs a two-stage optimization process: firstly, PSO optimizes an irregular polygonal infrared patch to attack the infrared detector; secondly, the color QR code is optimized, and the shape information of the infrared patch from the first stage is used as a mask. The resulting irregular polygon visible modal patch executes an attack on the visible detector. Through extensive experiments conducted in both digital and physical environments, we validate the effectiveness and robustness of the proposed method. As the TOUAP surpasses baseline performance, we advocate for its widespread attention.

摘要
当前，许多研究已经关注安全问题 related to 可见和红外探测器独立地。在实际场景中，使用交叉模态探测器进行任务 proves 更可靠于单模态探测器。然而，有一个缺乏全面的安全评估 для交叉模态探测器。现有的研究已经探索了交叉模态探测器的可行性攻击，但是它的实施仍然未得到解决。本文介绍了一种可以在实际世界中，黑盒设置下，对可见红外交叉模态探测器进行攻击的两stage优化的恶作剂（TOUAP）。TOUAP使用了两stage优化过程：首先，使用Particle Swarm Optimization（PSO）优化一个不规则的红外探测器攻击；其次，使用颜色QR码优化，并使用红外探测器攻击的形状信息作为面罩。最后，使用可见模态探测器攻击。通过在数字和物理环境中进行了广泛的实验，我们证明了提议的方法的有效性和可靠性。作为TOUAP超过了基eline性能，我们强烈推荐它的广泛应用。

IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks

paper_url: http://arxiv.org/abs/2312.01771
repo_url: https://github.com/xvjiarui/IMProv
paper_authors: Jiarui Xu, Yossi Gandelsman, Amir Bar, Jianwei Yang, Jianfeng Gao, Trevor Darrell, Xiaolong Wang
for: 这篇论文的目的是探讨一种基于多Modal prompts的 generative模型，能够在测试时基于文本描述学习视觉任务。
methods: 这篇论文使用了一种masked generative transformer，并在一个新的图像-文本数据集上进行了训练。在推理时，使用文本和/或图像任务示例，让模型进行填充输出。
results: 研究发现，通过文本控制和扩大数据集大小，可以提高计算机视觉任务的在Context learning性能，包括Foreground Segmentation的AP提高了+10%，Single Object Detection的AP提高了+5%，Colorization的LPIPS下降了大约20%。这些实验结果表明，视觉和语言提示是补做的，使用两者可以实现更好的在Context learning性能。

Abstract
In-context learning allows adapting a model to new tasks given a task description at test time. In this paper, we present IMProv - a generative model that is able to in-context learn visual tasks from multimodal prompts. Given a textual description of a visual task (e.g. "Left: input image, Right: foreground segmentation"), a few input-output visual examples, or both, the model in-context learns to solve it for a new test input. We train a masked generative transformer on a new dataset of figures from computer vision papers and their associated captions, together with a captioned large-scale image-text dataset. During inference time, we prompt the model with text and/or image task example(s) and have the model inpaint the corresponding output. We show that training our model with text conditioning and scaling the dataset size improves in-context learning for computer vision tasks by over +10\% AP for Foreground Segmentation, over +5\% gains in AP for Single Object Detection, and almost 20\% lower LPIPS in Colorization. Our empirical results suggest that vision and language prompts are complementary and it is advantageous to use both to achieve better in-context learning performance. Project page is available at https://jerryxu.net/IMProv .

摘要
内容学习允许在测试时对模型进行新任务适应。在这篇论文中，我们介绍了IMProv模型，该模型可以通过multimodal提示来进行视觉任务的内容学习。例如，给定一个文本描述（如“左：输入图像，右：前景分割”）和一些输入-输出视觉示例，或者都有，模型在测试时可以通过这些提示来解决新的输入。我们在训练一个带有mask的生成变换器时使用了一个新的图像-文本数据集和一个captioned大规模图像-文本数据集。在推理时，我们通过文本和/或图像任务示例来提示模型，让模型进行填充输出。我们发现，通过文本条件和扩大数据集大小进行训练，可以提高计算机视觉任务的内容学习性能，例如，对于前景分割任务，提高了+10%的AP，对于单个对象检测任务，提高了+5%的AP，并且对于颜色化任务，降低了20%的LPIPS。我们的实验结果表明，视觉和语言提示是补偿的，使用两者可以达到更好的内容学习性能。相关项目页面可以在https://jerryxu.net/IMProv 中找到。

Few-Shot Anomaly Detection with Adversarial Loss for Robust Feature Representations

paper_url: http://arxiv.org/abs/2312.03005
repo_url: None
paper_authors: Jae Young Lee, Wonjun Lee, Jaehyun Choi, Yongkwi Lee, Young Seog Yoon
for: 这篇论文的目的是提出一种几少案例检测方法，以解决实际应用中的检测问题。
methods: 本论文使用了一种一类一模型的方法，并将对应的数据进行类别。此外，论文还使用了反击训练损失来增强特征表现的稳定性和普遍性。
results: 实验结果显示，提案的方法通常能够在几少案例检测任务中获得更好的性能，特别是在使用反击训练损失时。

Abstract
Anomaly detection is a critical and challenging task that aims to identify data points deviating from normal patterns and distributions within a dataset. Various methods have been proposed using a one-class-one-model approach, but these techniques often face practical problems such as memory inefficiency and the requirement of sufficient data for training. In particular, few-shot anomaly detection presents significant challenges in industrial applications, where limited samples are available before mass production. In this paper, we propose a few-shot anomaly detection method that integrates adversarial training loss to obtain more robust and generalized feature representations. We utilize the adversarial loss previously employed in domain adaptation to align feature distributions between source and target domains, to enhance feature robustness and generalization in few-shot anomaly detection tasks. We hypothesize that adversarial loss is effective when applied to features that should have similar characteristics, such as those from the same layer in a Siamese network's parallel branches or input-output pairs of reconstruction-based methods. Experimental results demonstrate that the proposed method generally achieves better performance when utilizing the adversarial loss.

摘要
<>传统的异常检测方法通常采用一个模型一个类的方法，但这些技术常常面临实际问题，如占用内存和训练数据的限制。特别是在工业应用中，有限的样本数可能会导致异常检测 task 难以进行。在这篇论文中，我们提出了一种几招异常检测方法，该方法通过 integration of adversarial training loss 来获得更加robust和泛化的特征表示。我们利用了 previously employed in domain adaptation 的对抗损失，用于对特征分布进行对齐，从而提高特征的Robustness和泛化性。我们 hypothesize adversarial loss 是有效的，当应用于特征具有相似特征，如同一层的Parallel branches 或输入输出对的重建方法。实验结果表明，提议的方法通常在使用对抗损失时能够获得更好的性能。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.

Localizing and Assessing Node Significance in Default Mode Network using Sub-Community Detection in Mild Cognitive Impairment

paper_url: http://arxiv.org/abs/2312.01768
repo_url: None
paper_authors: Ameiy Acharya, Chakka Sai Pradeep, Neelam Sinha
for: 本研究使用fMRI技术来 indentify 轻度智能障碍（MCI）病人 Default Mode Network（DMN）中受损的脑区域，使用一种新的Node Significance Score（NSS）。
methods: 研究人员使用 partial correlation 技术构建每个参与者的DMN图，并应用四种流行的社区探测算法（Clique Percolation Method（CPM）、Louvain算法、Greedy Modularity和Leading Eigenvectors）来确定最大子社区。然后，对每个ROI在所有参与者中计算NSS分数，考虑（I）在最大子社区内的频率，以及（II）根据所有四种方法的最大值。
results: 计算NSS分数后，研究人员发现10个DMN节点的分数差 exceeds 20%，最大值达45.69%和43.08%，这与现有医学文献相一致，同时提供了一个量化的度量，可以对受损节点进行排序。这些发现可能为诊断和治疗提供有价值的指导。

Abstract
Our study aims to utilize fMRI to identify the affected brain regions within the Default Mode Network (DMN) in subjects with Mild Cognitive Impairment (MCI), using a novel Node Significance Score (NSS). We construct subject-specific DMN graphs by employing partial correlation of Regions of Interest (ROIs) that make-up the DMN. For the DMN graph, ROIs are the nodes and edges are determined based on partial correlation. Four popular community detection algorithms (Clique Percolation Method (CPM), Louvain algorithm, Greedy Modularity and Leading Eigenvectors) are applied to determine the largest sub-community. NSS ratings are derived for each node, considering (I) frequency in the largest sub-community within a class across all subjects and (II) occurrence in the largest sub-community according to all four methods. After computing the NSS of each ROI in both healthy and MCI subjects, we quantify the score disparity to identify nodes most impacted by MCI. The results reveal a disparity exceeding 20% for 10 DMN nodes, maximally for PCC and Fusiform, showing 45.69% and 43.08% disparity. This aligns with existing medical literature, additionally providing a quantitative measure that enables the ordering of the affected ROIs. These findings offer valuable insights and could lead to treatment strategies aggressively targeting the affected nodes.

摘要

Dynamic Erasing Network Based on Multi-Scale Temporal Features for Weakly Supervised Video Anomaly Detection

paper_url: http://arxiv.org/abs/2312.01764
repo_url: https://github.com/arielzc/de-net
paper_authors: Chen Zhang, Guorong Li, Yuankai Qi, Hanhua Ye, Laiyun Qing, Ming-Hsuan Yang, Qingming Huang
for: 这个研究的目的是为无监督录影内容中的异常检测学习一个检测模型，使其可以使用仅有录影内容的标注数据进行训练。
methods: 我们提出了一个名为 Dynamic Erasing Network（DE-Net）的方法，它会学习多个时间尺度的多个涵义特征。在这个方法中，我们首先提出了一个多个时间尺度的模型，可以对不同的时间尺度抽象出本地和全局的视觉信息。然后，我们设计了一个动态清除策略，可以监控检测到的异常段落是否完整，并在适当的情况下清除显著的异常段落，以便让模型找到缓和的异常段落。
results: 我们的方法在三个数据集上（XD-Violence、TAD和UCF-Crime）取得了良好的性能，与一些现有的方法相比。

Abstract
The goal of weakly supervised video anomaly detection is to learn a detection model using only video-level labeled data. However, prior studies typically divide videos into fixed-length segments without considering the complexity or duration of anomalies. Moreover, these studies usually just detect the most abnormal segments, potentially overlooking the completeness of anomalies. To address these limitations, we propose a Dynamic Erasing Network (DE-Net) for weakly supervised video anomaly detection, which learns multi-scale temporal features. Specifically, to handle duration variations of abnormal events, we first propose a multi-scale temporal modeling module, capable of extracting features from segments of varying lengths and capturing both local and global visual information across different temporal scales. Then, we design a dynamic erasing strategy, which dynamically assesses the completeness of the detected anomalies and erases prominent abnormal segments in order to encourage the model to discover gentle abnormal segments in a video. The proposed method obtains favorable performance compared to several state-of-the-art approaches on three datasets: XD-Violence, TAD, and UCF-Crime. Code will be made available at https://github.com/ArielZc/DE-Net.

摘要
“我们的目标是使用弱监督的方式进行影像异常探测，并将影像分成多个不同长度的段落，以捕捉不同的异常情况。然而，先前的研究通常会将影像分成固定长度的段落，而不考虑异常事件的复杂性或持续时间。此外，这些研究通常只是检测最为异常的段落，可能会忽略了异常事件的完整性。”“为了解决这些限制，我们提出了一个动态抹除网络（DE-Net），用于弱监督的影像异常探测。DE-Net 可以学习多个时间尺度的特征，并且可以在不同的时间尺度上捕捉异常事件的复杂性。 Specifically, 我们首先提出了一个多个时间尺度的模型化模组，可以从不同的时间尺度中提取特征，并且可以捕捉本地和全球的视觉信息。然后，我们设计了一个动态抹除策略，可以动态评估异常事件的完整性，并且可以删除异常事件的主要部分，以便让模型发现柔和的异常事件。”“我们的方法与一些现有的方法进行比较，获得了良好的性能。我们将代码公开在 GitHub 上，供您使用。”

Light Field Imaging in the Restrictive Object Space based on Flexible Angular Plane

paper_url: http://arxiv.org/abs/2312.01761
repo_url: None
paper_authors: Ping Zhou, Nuo Chen, Yuda Xu, Chengcai Xu
for: 这篇论文主要针对 restrictive object space (ROS) 中的光场图像系统，描述了ROS中光场图像的扭曲问题，以及解决这个问题的方法。
methods: 该论文提出了一种flexible angular plane方法，以及一种microlens image non-distortion principle，用于解决ROS中光场图像的扭曲问题。
results: 该论文通过设计了一个ROS-LF simulate系统，并对其进行了Calibration，验证了在ROS中光场图像的扭曲问题可以通过flexible angular plane和microlens image non-distortion principle来解决。

Abstract
In some applications, the object space of light field imaging system is restrictive, such as industrial and medical endoscopes. If the traditional light field imaging system is used in the restrictive object space (ROS) directly but without any specific considerations, the ROS will lead to severe microlens image distortions and then affects light field decoding, calibration and 3D reconstruction. The light field imaging in restrictive object space (ROS-LF) is complicated but significant. In this paper, we first deduce that the reason of the microlens image deviation is the position variation of the angular plane, then we propose the flexible angular plane for ROS-LF, while in the traditional light field the angular plane always coincides with the main lens plane. Subsequently, we propose the microlens image non-distortion principle for ROS-LF and introduce the ROS-LF imaging principle. We demonstrate that the difference is an aperture constant term between the ROS-LF and traditional light field imaging models. At last, we design a ROS-LF simulated system and calibrate it to verify principles proposed in this paper.

摘要
在一些应用中，光场图像系统的对象空间是限制的，如工业和医疗内镜。如果直接在限制对象空间（ROS）中使用传统的光场图像系统而无任何考虑，ROS将导致严重的微镜像偏移，影响光场解码、准确性和3D重建。光场图像在限制对象空间（ROS-LF）复杂但重要。在这篇论文中，我们首先推导出微镜像偏移的原因是光谱的位置变化，然后我们提议用flexible angular plane для ROS-LF，而传统的光场图像系统中的angular plane总是与主镜平面垂直。然后，我们提出了ROS-LF图像非扭曲原理，并介绍了ROS-LF拍摄原理。我们证明ROS-LF和传统光场图像系统拍摄模型之间存在一个镜像常数项。最后，我们设计了ROS-LF模拟系统并对其进行了校准，以验证本文中提出的原理。

Long-Tail Learning with Rebalanced Contrastive Loss

paper_url: http://arxiv.org/abs/2312.01753
repo_url: None
paper_authors: Charika De Alvis, Dishanika Denipitiyage, Suranga Seneviratne
for: 提高长尾类别准确率
methods: 综合使用超级vised contrastive loss和权重平衡的Balanced Contrastive Learning框架
results: 提高了三个主要方面：1. 特征空间均衡，2. 类内压缩，3. 规范，以提高长尾类别准确率。实验结果表明，RCL可以提供更加丰富的特征表示和提高BCL框架的顶部一个精度。同时，RCL也可以作为独立损失函数实现state-of-the-art级别的准确率。

Abstract
Integrating supervised contrastive loss to cross entropy-based communication has recently been proposed as a solution to address the long-tail learning problem. However, when the class imbalance ratio is high, it requires adjusting the supervised contrastive loss to support the tail classes, as the conventional contrastive learning is biased towards head classes by default. To this end, we present Rebalanced Contrastive Learning (RCL), an efficient means to increase the long tail classification accuracy by addressing three main aspects: 1. Feature space balancedness - Equal division of the feature space among all the classes, 2. Intra-Class compactness - Reducing the distance between same-class embeddings, 3. Regularization - Enforcing larger margins for tail classes to reduce overfitting. RCL adopts class frequency-based SoftMax loss balancing to supervised contrastive learning loss and exploits scalar multiplied features fed to the contrastive learning loss to enforce compactness. We implement RCL on the Balanced Contrastive Learning (BCL) Framework, which has the SOTA performance. Our experiments on three benchmark datasets demonstrate the richness of the learnt embeddings and increased top-1 balanced accuracy RCL provides to the BCL framework. We further demonstrate that the performance of RCL as a standalone loss also achieves state-of-the-art level accuracy.

摘要
Recently, combining supervised contrastive loss with cross-entropy-based communication has been proposed to address the long-tail learning problem. However, when the class imbalance ratio is high, it is necessary to adjust the supervised contrastive loss to support the tail classes, as the conventional contrastive learning is biased towards the head classes by default. To address this, we propose Rebalanced Contrastive Learning (RCL), an efficient method to improve the long-tail classification accuracy by addressing three main aspects:1. 空间均衡 - 将所有类型的特征空间分配相同。2. 内类紧密性 - 减小同类嵌入的距离。3. 规则 - 使 tail 类的折衔较大，以避免过拟合。RCL 采用类频率基于的 SoftMax 损失平衡和超级vised contrastive learning损失，并使用权值 multiply 的特征来实现内类紧密性。我们在 Balanced Contrastive Learning (BCL) 框架上实现 RCL，BCL 框架已经达到了顶峰性能。我们在三个 benchmark 数据集上进行了实验，证明 RCL 可以提供 Balanced Contrastive Learning 框架中更丰富的嵌入和提高 top-1 平衡性的精度。此外，我们还证明 RCL 作为独立损失也可以达到状态的精度水平。

Open-DDVM: A Reproduction and Extension of Diffusion Model for Optical Flow Estimation

paper_url: http://arxiv.org/abs/2312.01746
repo_url: https://github.com/dqiaole/flowdiffusion_pytorch
paper_authors: Qiaole Dong, Bo Zhao, Yanwei Fu
for: 这篇论文是为了实现一个开源的Diffusion-based Vid2VID模型（DDVM），用于图像到图像翻译任务，而不需要特定的设计如RAFT。
methods: 该论文使用了Diffusion-based Vid2VID模型（DDVM），并通过对多种设计选择进行研究，以实现开源的DDVM模型。
results: 通过在40000个公共数据集上使用4个GPU进行训练，得到的比对源DDVM的性能相似。代码和模型已经在https://github.com/DQiaole/FlowDiffusion_pytorch中发布。

Abstract
Recently, Google proposes DDVM which for the first time demonstrates that a general diffusion model for image-to-image translation task works impressively well on optical flow estimation task without any specific designs like RAFT. However, DDVM is still a closed-source model with the expensive and private Palette-style pretraining. In this technical report, we present the first open-source DDVM by reproducing it. We study several design choices and find those important ones. By training on 40k public data with 4 GPUs, our reproduction achieves comparable performance to the closed-source DDVM. The code and model have been released in https://github.com/DQiaole/FlowDiffusion_pytorch.

摘要
近些时间，Google提出了DDVM，这是一种用于图像到图像翻译任务的通用扩散模型，它在视觉流估计任务上表现出色，而不需要特定的设计如RAFT。然而，DDVM仍然是一个关闭源代码的模型，需要昂贵的私有alette预训练。在这份技术报告中，我们发布了首个开源DDVM，并研究了一些设计选择。我们在使用40000个公共数据集和4个GPU进行训练后，我们的复制性能与关闭源DDVM相当。代码和模型已经在https://github.com/DQiaole/FlowDiffusion_pytorch中发布。

paper_url: http://arxiv.org/abs/2312.01745
repo_url: None
paper_authors: Dixuan Lin, Yixing Peng, Jingke Meng, Wei-Shi Zheng
for: 文本描述和图像之间建立关联，以实现图像人重识别。methods: 提出了一种 Cross-Modal Adaptive Dual Association（CADA）方法，包括 Association of text Tokens to image Patches（ATP）和 Association of image Regions to text Attributes（ARA）两部分。results: 实验结果表明，CADA方法可以准确地建立图像和文本之间的bidirectional关联，并且超过了现有方法的性能。

Abstract
Text-to-image person re-identification (ReID) aims to retrieve images of a person based on a given textual description. The key challenge is to learn the relations between detailed information from visual and textual modalities. Existing works focus on learning a latent space to narrow the modality gap and further build local correspondences between two modalities. However, these methods assume that image-to-text and text-to-image associations are modality-agnostic, resulting in suboptimal associations. In this work, we show the discrepancy between image-to-text association and text-to-image association and propose CADA: Cross-Modal Adaptive Dual Association that finely builds bidirectional image-text detailed associations. Our approach features a decoder-based adaptive dual association module that enables full interaction between visual and textual modalities, allowing for bidirectional and adaptive cross-modal correspondence associations. Specifically, the paper proposes a bidirectional association mechanism: Association of text Tokens to image Patches (ATP) and Association of image Regions to text Attributes (ARA). We adaptively model the ATP based on the fact that aggregating cross-modal features based on mistaken associations will lead to feature distortion. For modeling the ARA, since the attributes are typically the first distinguishing cues of a person, we propose to explore the attribute-level association by predicting the masked text phrase using the related image region. Finally, we learn the dual associations between texts and images, and the experimental results demonstrate the superiority of our dual formulation. Codes will be made publicly available.

摘要
文本到图像人识别（ReID）的目标是根据文本描述 Retrieves 图像。关键挑战是学习视觉和文本Modalities之间的关系。现有的方法强调学习一个隐藏空间，以降低Modalities的差异，并在两个Modalities之间建立本地匹配。然而，这些方法假设图像到文本和文本到图像的关系是Modality-agnostic，导致不佳的匹配。在这个工作中，我们表明图像到文本和文本到图像的关系之间的差异，并提出了CADA：跨Modalities适应双向关联模型。我们的方法具有一个基于解码器的适应双向关联模块，允许视觉和文本Modalities之间的全面互动，以建立双向和适应的跨Modalities对应关系。具体来说，我们提出了一种双向关联机制：文本Token与图像块的关联（ATP）和图像区域与文本特征的关联（ARA）。我们适应地模型ATP，基于 mistaken associations will lead to feature distortion。为了模型ARA，我们提出了predicting the masked text phrase using the related image region的方法，因为 attribute 是人识别的首要特征。最后，我们学习了图像和文本之间的双向关系，并实验结果表明了我们的双形式的优势。代码将公开上传。

Singular Regularization with Information Bottleneck Improves Model’s Adversarial Robustness

paper_url: http://arxiv.org/abs/2312.02237
repo_url: None
paper_authors: Guanlin Li, Naishan Zheng, Man Zhou, Jie Zhang, Tianwei Zhang
for: 本研究旨在填补针对深度学习模型的抗性攻击研究中缺失的敏感信息分析和扰动分析，以便更好地理解针对攻击。
methods: 本研究使用了特征值分解法，将图像分解成多个矩阵，以分析不同的攻击方式下的敏感信息。基于这些分析结果，提出了一种新的模块来减少敏感信息的影响，并结合信息瓶颈理论来 theoretically 限制中间表示。这种方法可以减少额外参数数量，并且可以通过区域忠诚分析进行解释。
results: 在两个主流数据集上，使用两种流行的模型结构进行评估，并对不同的针对攻击进行评估。结果表明，我们的方法可以提高鲁棒性精度显著。同时，我们证明了我们的方法只需要少量额外参数，并且可以通过区域忠诚分析进行解释。

Abstract
Adversarial examples are one of the most severe threats to deep learning models. Numerous works have been proposed to study and defend adversarial examples. However, these works lack analysis of adversarial information or perturbation, which cannot reveal the mystery of adversarial examples and lose proper interpretation. In this paper, we aim to fill this gap by studying adversarial information as unstructured noise, which does not have a clear pattern. Specifically, we provide some empirical studies with singular value decomposition, by decomposing images into several matrices, to analyze adversarial information for different attacks. Based on the analysis, we propose a new module to regularize adversarial information and combine information bottleneck theory, which is proposed to theoretically restrict intermediate representations. Therefore, our method is interpretable. Moreover, the fashion of our design is a novel principle that is general and unified. Equipped with our new module, we evaluate two popular model structures on two mainstream datasets with various adversarial attacks. The results indicate that the improvement in robust accuracy is significant. On the other hand, we prove that our method is efficient with only a few additional parameters and able to be explained under regional faithfulness analysis.

摘要
深度学习模型面临着一个最严重的威胁： adversarial example。许多研究已经被提出来研究和防御 adversarial example，但这些研究缺乏对 adversarial 信息或杂化的分析，这无法揭示 adversarial example 的谜团和正确的解释。在这篇论文中，我们尝试填补这个空白，通过研究 adversarial 信息作为无结构噪声来分析。specifically，我们通过对图像进行 singular value decomposition，将图像分解成多个矩阵，以分析 adversarial 信息的不同攻击。基于这种分析，我们提出了一个新的模块来规范 adversarial 信息，并结合信息瓶颈理论，这是一种可以理论上限制中间表示的方法。因此，我们的方法是可解释的。此外，我们的设计原理是一种新的、通用的原理。通过我们的新模块，我们评估了两种流行的模型结构，在两种主流数据集上进行了多种 adversarial 攻击。结果显示，我们的方法可以显著提高 robust 精度。另一方面，我们证明了我们的方法只需增加一些额外参数，并且可以在区域 faithfulness 分析中解释。

Fully Spiking Denoising Diffusion Implicit Models

paper_url: http://arxiv.org/abs/2312.01742
repo_url: None
paper_authors: Ryo Watanabe, Yusuke Mukuta, Tatsuya Harada
for: 这个论文旨在提出一种基于神经网络的液态模型，以实现高速和低能耗的生成模型。
methods: 该方法使用了神经网络的快速和低能耗特性，并使用了神经流学习（SCL）来完成液态模型的生成过程。
results: 对比之前的完全神经网络生成模型，该方法表现更好，并且可以实现高速和低能耗的生成。

Abstract
Spiking neural networks (SNNs) have garnered considerable attention owing to their ability to run on neuromorphic devices with super-high speeds and remarkable energy efficiencies. SNNs can be used in conventional neural network-based time- and energy-consuming applications. However, research on generative models within SNNs remains limited, despite their advantages. In particular, diffusion models are a powerful class of generative models, whose image generation quality surpass that of the other generative models, such as GANs. However, diffusion models are characterized by high computational costs and long inference times owing to their iterative denoising feature. Therefore, we propose a novel approach fully spiking denoising diffusion implicit model (FSDDIM) to construct a diffusion model within SNNs and leverage the high speed and low energy consumption features of SNNs via synaptic current learning (SCL). SCL fills the gap in that diffusion models use a neural network to estimate real-valued parameters of a predefined probabilistic distribution, whereas SNNs output binary spike trains. The SCL enables us to complete the entire generative process of diffusion models exclusively using SNNs. We demonstrate that the proposed method outperforms the state-of-the-art fully spiking generative model.

摘要
神经网络具有快速和低能耗特性，因此吸引了广泛关注。然而，在神经网络中的生成模型方面，研究仍然受限，尤其是推 diffusion 模型。这种模型可以生成高质量图像，但是它们的计算成本和推理时间较长，这是因为它们具有迭代净化特性。为了解决这个问题，我们提出了一种全神经网络推 diffusion 隐式模型（FSDDIM），使用神经网络学习电流流程（SCL）来实现。SCL 填充了 diffusion 模型使用神经网络来估计预先定义的概率分布中的实数参数所需的差距。我们示出了我们提出的方法可以在全神经网络中完成整个生成过程，并且超越了现有的全神经网络生成模型。

SRSNetwork: Siamese Reconstruction-Segmentation Networks based on Dynamic-Parameter Convolution

paper_url: http://arxiv.org/abs/2312.01741
repo_url: https://github.com/fidshu/srsnet
paper_authors: Bingkun Nian, Fenghe Tang, Jianrui Ding, Pingping Zhang, Jie Yang, S. Kevin Zhou, Wei Liu
for: 这个论文旨在提出一种高性能的深度神经网络 для弱目标图像分割，包括医学图像分割和抗雷达图像分割。
methods: 该论文分析了现有的动态卷积并提出动态参数卷积（DPConv），并重新评估了重建任务和分割任务之间的关系，提出了一种双网络模型called Siamese Reconstruction-Segmentation Network（SRSNet）。
results: 在七个 dataset 上，包括五个医学dataset 和两个抗雷达图像dataset，我们的 SRSNet consistently achiev 了最佳的分割结果。

Abstract
In this paper, we present a high-performance deep neural network for weak target image segmentation, including medical image segmentation and infrared image segmentation. To this end, this work analyzes the existing dynamic convolutions and proposes dynamic parameter convolution (DPConv). Furthermore, it reevaluates the relationship between reconstruction tasks and segmentation tasks from the perspective of DPConv, leading to the proposal of a dual-network model called the Siamese Reconstruction-Segmentation Network (SRSNet). The proposed model is not only a universal network but also enhances the segmentation performance without altering its structure, leveraging the reconstruction task. Additionally, as the amount of training data for the reconstruction network increases, the performance of the segmentation network also improves synchronously. On seven datasets including five medical datasets and two infrared image datasets, our SRSNet consistently achieves the best segmentation results. The code is released at https://github.com/fidshu/SRSNet.

摘要
在这篇论文中，我们提出了一种高性能的深度神经网络用于弱目标图像分割，包括医学图像分割和红外图像分割。为达到这个目的，我们分析了现有的动态核函数，并提出了动态参数核函数（DPConv）。此外，我们从DPConv的角度重新评估了重建任务和分割任务之间的关系，导致我们提出了一种双网络模型称为对冲重建分割网络（SRSNet）。我们的提案的模型不仅是一种通用网络，而且可以增强分割性能而不需要修改结构，利用重建任务。此外，随着重建网络的训练数据量的增加，分割网络的性能也同步提高。在七个数据集中，包括五个医学数据集和两个红外图像数据集，我们的SRSNet都能够一直保持最佳的分割结果。代码可以在https://github.com/fidshu/SRSNet中下载。

MobileUtr: Revisiting the relationship between light-weight CNN and Transformer for efficient medical image segmentation

paper_url: http://arxiv.org/abs/2312.01740
repo_url: https://github.com/fenghetan9/mobileutr
paper_authors: Fenghe Tang, Bingkun Nian, Jianrui Ding, Quan Quan, Jie Yang, Wei Liu, S. Kevin Zhou
for: 这 paper 是为了提高医学图像分割效率和计算成本的轻量级感知 transformer 的设计。
methods: 该 paper 使用了一种名为 ConvUtr 的卷积神经块，将 transformer 中的 patch embeddings 抽象出来，并在 local-global-local 块中进行了有效的信息流exchange。
results: compared to the state-of-the-art methods, 该 paper 的模型 MobileUtr 在五个公共医学图像数据集上的三种模式下表现出色，同时具有轻量级和低计算成本。

Abstract
Due to the scarcity and specific imaging characteristics in medical images, light-weighting Vision Transformers (ViTs) for efficient medical image segmentation is a significant challenge, and current studies have not yet paid attention to this issue. This work revisits the relationship between CNNs and Transformers in lightweight universal networks for medical image segmentation, aiming to integrate the advantages of both worlds at the infrastructure design level. In order to leverage the inductive bias inherent in CNNs, we abstract a Transformer-like lightweight CNNs block (ConvUtr) as the patch embeddings of ViTs, feeding Transformer with denoised, non-redundant and highly condensed semantic information. Moreover, an adaptive Local-Global-Local (LGL) block is introduced to facilitate efficient local-to-global information flow exchange, maximizing Transformer's global context information extraction capabilities. Finally, we build an efficient medical image segmentation model (MobileUtr) based on CNN and Transformer. Extensive experiments on five public medical image datasets with three different modalities demonstrate the superiority of MobileUtr over the state-of-the-art methods, while boasting lighter weights and lower computational cost. Code is available at https://github.com/FengheTan9/MobileUtr.

摘要
由于医疗图像的稀缺和特殊表示特性，轻量级的视图转换器（ViT） для医疗图像分割是一项重要挑战，现有研究并未专注于这一问题。这项工作检视了 CNN 和 Transformer 之间的关系，以实现将两者的优点集成到基础设计层次。为了利用 CNN 的适应性，我们抽象了 Transformer 类似的轻量级 CNN 块（ConvUtr）作为 ViT 的质点嵌入，将 Transformer 接受排除噪声、非重复和高度压缩 semantic 信息。此外，我们引入了适应性 Local-Global-Local（LGL）块，以便有效地进行本地到全局信息流交换，最大化 Transformer 的全局上下文信息抽取能力。最后，我们构建了一个高效的医疗图像分割模型（MobileUtr），基于 CNN 和 Transformer。在五个公共医疗图像数据集上进行了三种不同的模式的EXTENSIVE EXPERIMENTS，MobileUtr 对比于当前状态的方法显著优于，同时具有较轻的权重和更低的计算成本。代码可以在上下载。

Effective Adapter for Face Recognition in the Wild

paper_url: http://arxiv.org/abs/2312.01734
repo_url: None
paper_authors: Yunhao Liu, Lu Qi, Yu-Ju Tsai, Xiangtai Li, Kelvin C. K. Chan, Ming-Hsuan Yang
for: 提高face recognition在野化环境中的精度，解决低质量图像和实际世界的扭曲问题。
methods: 提出一种有效的 adapter，用于增强现有的面 recognition模型，使其能够在低质量图像和高质量图像之间进行转换。
results: 在零shotSetting下，与基eline的比较达到了3%, 4%, 和7%的提升率，表明方法的有效性。

Abstract
In this paper, we tackle the challenge of face recognition in the wild, where images often suffer from low quality and real-world distortions. Traditional heuristic approaches-either training models directly on these degraded images or their enhanced counterparts using face restoration techniques-have proven ineffective, primarily due to the degradation of facial features and the discrepancy in image domains. To overcome these issues, we propose an effective adapter for augmenting existing face recognition models trained on high-quality facial datasets. The key of our adapter is to process both the unrefined and the enhanced images by two similar structures where one is fixed and the other trainable. Such design can confer two benefits. First, the dual-input system minimizes the domain gap while providing varied perspectives for the face recognition model, where the enhanced image can be regarded as a complex non-linear transformation of the original one by the restoration model. Second, both two similar structures can be initialized by the pre-trained models without dropping the past knowledge. The extensive experiments in zero-shot settings show the effectiveness of our method by surpassing baselines of about 3%, 4%, and 7% in three datasets. Our code will be publicly available at https://github.com/liuyunhaozz/FaceAdapter/.

摘要
在这篇论文中，我们解决了人脸识别在野外中的挑战，图像经常受到低质量和现实世界的扭曲影响。传统的决策方法——直接训练模型使用这些受损图像或其修复后的图像——失效，主要是因为人脸特征的损害和图像领域的不一致。为了解决这些问题，我们提出了一种有效的人脸recognition模型Adapter。我们的Adapter的关键是对已有的人脸识别模型进行增强，使其能够处理低质量和修复后的图像。我们的设计包括两个相似的结构，其中一个是固定的，另一个是可学习的。这种设计可以提供两点优点。首先，双输入系统可以减少领域差距，同时提供多个视角来训练人脸识别模型，修复后的图像可以被看作是原始图像的复杂非线性变换。其次，两个相似的结构都可以使用预训练模型进行初始化，不会损失过去的知识。我们的实验结果表明，我们的方法可以在零例外设置下比基eline上出performances，在三个数据集上的比较结果为3%, 4%, 和7%。我们的代码将在https://github.com/liuyunhaozz/FaceAdapter/上公开。

Likelihood-Aware Semantic Alignment for Full-Spectrum Out-of-Distribution Detection

paper_url: http://arxiv.org/abs/2312.01732
repo_url: https://github.com/lufan31/lsa
paper_authors: Fan Lu, Kai Zhu, Kecheng Zheng, Wei Zhai, Yang Cao
for: 这个研究的目的是提高对不同领域转换的对称检测精度，以及在实际应用中遇到语义和 covariate 变化时的检测能力。
methods: 本研究提出了一个具有偏好 Semantic Alignment（LSA）框架，通过将图像与文本之间的相似性转换为 semantic-relevant 的 Visuomotor 空间内的高可能性区域，以提高 OOD 检测的精度。LSA 包括一个在线上进行 Gaussian 抽样的策略，以及一个两向的启发 Customization 机制，以适应 ID 和 OOD 的变化。
results: 实验结果显示，LSA 在 Near-OOD 设定下的 OOD 检测表现杰出，较 existing 方法高 $15.26%$ 和 $18.88%$ 在两个 F-OOD 标准 benchmark 上，并且在实际应用中具有更高的应用可行性。

Abstract
Full-spectrum out-of-distribution (F-OOD) detection aims to accurately recognize in-distribution (ID) samples while encountering semantic and covariate shifts simultaneously. However, existing out-of-distribution (OOD) detectors tend to overfit the covariance information and ignore intrinsic semantic correlation, inadequate for adapting to complex domain transformations. To address this issue, we propose a Likelihood-Aware Semantic Alignment (LSA) framework to promote the image-text correspondence into semantically high-likelihood regions. LSA consists of an offline Gaussian sampling strategy which efficiently samples semantic-relevant visual embeddings from the class-conditional Gaussian distribution, and a bidirectional prompt customization mechanism that adjusts both ID-related and negative context for discriminative ID/OOD boundary. Extensive experiments demonstrate the remarkable OOD detection performance of our proposed LSA especially on the intractable Near-OOD setting, surpassing existing methods by a margin of $15.26\%$ and $18.88\%$ on two F-OOD benchmarks, respectively.

摘要
全谱出版物检测（F-OOD）目的是正确地识别入数据（ID）样本，同时涵盖语义和相关变换的差异。然而，现有的异常出版物检测器（OOD）往往过拟合偏差信息，忽略内在Semantic correlation，不够适应复杂的领域变换。为解决这个问题，我们提出了具有likelihood感知的SemanticAlignment（LSA）框架，以便将图像文本对应映射到Semantically高可能性区域。LSA包括一种离线的Gaussian抽样策略，可以有效地从类型conditional Gaussian分布中抽取Semantic相关的视觉嵌入，以及一种两向的推荐定制机制，可以调整ID相关的和负面上下文，以提高ID/OOD边界的推荐性。我们进行了广泛的实验， demonstarted that our proposed LSA方法在难以解决的Near-OOD setting中表现出色，与现有方法之间的差异达15.26%和18.88%在两个F-OOD benchmark上，分别。

Simultaneous Alignment and Surface Regression Using Hybrid 2D-3D Networks for 3D Coherent Layer Segmentation of Retinal OCT Images with Full and Sparse Annotations

paper_url: http://arxiv.org/abs/2312.01726
repo_url: https://github.com/ccarliu/Retinal-OCT-LayerSeg
paper_authors: Hong Liu, Dong Wei, Donghuan Lu, Xiaoying Tang, Liansheng Wang, Yefeng Zheng
for:This paper aims to develop a novel framework for 3D retinal layer segmentation in volumetric OCT images based on hybrid 2D-3D convolutional neural networks (CNNs).methods:The proposed framework uses an encoder consisting of 2D convolutions to extract 2D features from individual B-scans, followed by two 3D decoders coupled via a spatial transformer module to produce the alignment displacement vectors and layer segmentation. The framework is trained end-to-end using two losses that utilize the retinal layers’ natural property of being smooth for B-scan alignment and layer segmentation.results:The proposed framework achieves superior performance in terms of both layer segmentation accuracy and cross-B-scan 3D continuity compared to state-of-the-art 2D deep learning methods in both fully and semi-supervised settings. The framework is effective in aligning the B-scans for potential motion correction and offers more clinical values than previous works.

Abstract
Layer segmentation is important to quantitative analysis of retinal optical coherence tomography (OCT). Recently, deep learning based methods have been developed to automate this task and yield remarkable performance. However, due to the large spatial gap and potential mismatch between the B-scans of an OCT volume, all of them were based on 2D segmentation of individual B-scans, which may lose the continuity and diagnostic information of the retinal layers in 3D space. Besides, most of these methods required dense annotation of the OCT volumes, which is labor-intensive and expertise-demanding. This work presents a novel framework based on hybrid 2D-3D convolutional neural networks (CNNs) to obtain continuous 3D retinal layer surfaces from OCT volumes, which works well with both full and sparse annotations. The 2D features of individual B-scans are extracted by an encoder consisting of 2D convolutions. These 2D features are then used to produce the alignment displacement vectors and layer segmentation by two 3D decoders coupled via a spatial transformer module. Two losses are proposed to utilize the retinal layers' natural property of being smooth for B-scan alignment and layer segmentation, respectively, and are the key to the semi-supervised learning with sparse annotation. The entire framework is trained end-to-end. To the best of our knowledge, this is the first work that attempts 3D retinal layer segmentation in volumetric OCT images based on CNNs. Experiments on a synthetic dataset and three public clinical datasets show that our framework can effectively align the B-scans for potential motion correction, and achieves superior performance to state-of-the-art 2D deep learning methods in terms of both layer segmentation accuracy and cross-B-scan 3D continuity in both fully and semi-supervised settings, thus offering more clinical values than previous works.

摘要
层 segmentation 是对 retinal optical coherence tomography (OCT) 的量化分析中非常重要的一环。近些年来，深度学习基于的方法被开发出来自动化这个任务，并表现出非常出色的性能。然而，由于 OCT 图像 Volume 中 B-scan 之间的空间差距和可能的匹配问题，所有这些方法都是基于 индивидуаль B-scan 的2D分割，可能会产生缺失连续性和诊断信息的retinal层在3D空间中。此外，大多数这些方法需要 OCT 图像 Volume 的密集标注，这是劳动密集和专家需求的。本文提出了一种新的框架，基于混合2D-3D convolutional neural networks (CNNs)，以获取 OCT 图像 Volume 中连续的3D retinal层表面。该框架可以与全部和 incomplete 标注进行semi-supervised learning。抽象层 segmentation 的2D特征被提取出来，并使用两个3D解码器，通过空间转换模块进行相互联系。两个损失函数被提出，利用retinal层的自然性质，即平滑性，进行B-scanAlignment和层分割。整个框架是通过端到端学习进行训练。根据我们所知，这是第一个使用 CNNs 进行 retinal层 segmentation 的3D OCT 图像 Volume 的工作。我们在一个Synthetic dataset和三个公共临床dataset上进行了实验，结果表明，我们的框架可以有效地对 B-scan进行潜在运动 correction，并在完全和半supervised setting中达到了2D deep learning方法的最高性能水平。这些结果表明，我们的方法可以提供更多的临床价值。

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

paper_url: http://arxiv.org/abs/2312.01725
repo_url: https://github.com/rlawjdghek/stableviton
paper_authors: Jeongho Kim, Gyojung Gu, Minho Park, Sunghyun Park, Jaegul Choo
for: 这项研究的目的是扩展预训练的扩散模型，以独立进行虚拟试穿任务。
methods: 该研究提出了StableVITON方法，通过在预训练扩散模型的Latent空间学习Semantic相关性 между衣服和人体图像，以精准地捕捉衣服细节。另外，提出了零交叉注意块，以保持衣服细节并利用预训练模型的内在知识在扭曲过程中生成高品质图像。最后，提出了一种新的注意总变量损失函数和应用扩展，以获得更精准的衣服细节表示。
results: 该研究表明，StableVITON方法可以高效地生成高品质的虚拟试穿图像，并在不同人体图像上表现出优秀的灵活性和稳定性。与基eline方法进行比较，StableVITON方法在质量和量化评价中均取得了优异的结果。

Abstract
Given a clothing image and a person image, an image-based virtual try-on aims to generate a customized image that appears natural and accurately reflects the characteristics of the clothing image. In this work, we aim to expand the applicability of the pre-trained diffusion model so that it can be utilized independently for the virtual try-on task.The main challenge is to preserve the clothing details while effectively utilizing the robust generative capability of the pre-trained model. In order to tackle these issues, we propose StableVITON, learning the semantic correspondence between the clothing and the human body within the latent space of the pre-trained diffusion model in an end-to-end manner. Our proposed zero cross-attention blocks not only preserve the clothing details by learning the semantic correspondence but also generate high-fidelity images by utilizing the inherent knowledge of the pre-trained model in the warping process. Through our proposed novel attention total variation loss and applying augmentation, we achieve the sharp attention map, resulting in a more precise representation of clothing details. StableVITON outperforms the baselines in qualitative and quantitative evaluation, showing promising quality in arbitrary person images. Our code is available at https://github.com/rlawjdghek/StableVITON.

摘要
To address these issues, we propose StableVITON, which learns the semantic correspondence between the clothing and the human body within the latent space of the pre-trained diffusion model in an end-to-end manner. Our proposed zero cross-attention blocks not only preserve the clothing details by learning the semantic correspondence but also generate high-fidelity images by utilizing the inherent knowledge of the pre-trained model in the warping process.To further improve the quality of the generated images, we propose a novel attention total variation loss and apply augmentation. The attention total variation loss helps to achieve a sharp attention map, resulting in a more precise representation of clothing details. Our proposed method outperforms the baselines in both qualitative and quantitative evaluations, demonstrating promising quality in arbitrary person images.The code for StableVITON is available at https://github.com/rlawjdghek/StableVITON.

Disentangled Interaction Representation for One-Stage Human-Object Interaction Detection

paper_url: http://arxiv.org/abs/2312.01713
repo_url: None
paper_authors: Xubin Zhong, Changxing Ding, Yupeng Hu, Dacheng Tao
for: 提高一Stage HOI检测器的性能，使其能够EXTRACT DISENTANGLed INTERACTION REPRESENTATIONS
methods: 提议使用Shunted Cross-Attention (SCA)以EXTRACT human appearance, object appearance和global context features，并引入Interaction-aware Pose Estimation (IPE)任务以学习交互相关的人姿特征
results: 实验结果表明，我们的方法可以将EXISTING one-stage HOI检测器中应用，并在HICO-DET和V-COCO两个benchmark上达到了状态之巅表现

Abstract
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding. Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction; however, the interaction representations obtained using this method are entangled and lack interpretability. In contrast, traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner. In this paper, we improve the performance of one-stage methods by enabling them to extract disentangled interaction representations. First, we propose Shunted Cross-Attention (SCA) to extract human appearance, object appearance, and global context features using different cross-attention heads. This is achieved by imposing different masks on the cross-attention maps produced by the different heads. Second, we introduce the Interaction-aware Pose Estimation (IPE) task to learn interaction-relevant human pose features using a disentangled decoder. This is achieved with a novel attention module that accurately captures the human keypoints relevant to the current interaction category. Finally, our approach fuses the appearance feature and pose feature via element-wise addition to form the interaction representation. Experimental results show that our approach can be readily applied to existing one-stage HOI detectors. Moreover, we achieve state-of-the-art performance on two benchmarks: HICO-DET and V-COCO.

摘要
人机交互（HOI）检测是人类图像理解的核心任务。现代一阶方法采用 трансформа器解码器收集图像全域的cue，以便互动预测;然而，这些互动表示具有杂 mixture和无法解释性。相比之下，传统的两阶方法能够细分互动特征，从而提高表达能力。在这篇论文中，我们改进了一阶方法的性能，使其能够提取分离的互动表示。首先，我们提出了分配（SCA），用于提取人体外观、物体外观和全局上下文特征。这是通过不同的 máscara 在不同的 heads 生成的 cross-attention 图表中进行强制分配。其次，我们引入了互动感知pose estimation（IPE）任务，以学习与当前互动类别相关的人体关键点特征。我们使用一种新的注意模块，以准确捕捉人体关键点，并将其与当前互动类别相关。最后，我们将外观特征和pose特征进行元素加法，以形成互动表示。我们的方法可以应用于现有的一阶HOI检测器，并在 HICO-DET 和 V-COCO 两个benchmark上达到了状态的表现。

Regressor-Segmenter Mutual Prompt Learning for Crowd Counting

paper_url: http://arxiv.org/abs/2312.01711
repo_url: None
paper_authors: Mingyue Guo, Li Yuan, Zhaoyi Yan, Binghui Chen, Yaowei Wang, Qixiang Ye
for: 本研究旨在提高人群计数预测中的精度，解决由注释差异引起的 bias 和不准确性问题。
methods: 本研究提出了自适应提问学习（mPrompt）方法，利用回归器和分割器作为对方的指导，解决由注释差异引起的 bias 和不准确性问题。mPrompt 利用点注释来调整分割器，预测 pseudo head masks，并使用预测的分割masks 作为空间约束，修正偏置点注释。
results: 实验表明，mPrompt 可以显著降低 Mean Average Error (MAE)， demonstrating the potential to be general framework for down-stream vision tasks。

Abstract
Crowd counting has achieved significant progress by training regressors to predict instance positions. In heavily crowded scenarios, however, regressors are challenged by uncontrollable annotation variance, which causes density map bias and context information inaccuracy. In this study, we propose mutual prompt learning (mPrompt), which leverages a regressor and a segmenter as guidance for each other, solving bias and inaccuracy caused by annotation variance while distinguishing foreground from background. In specific, mPrompt leverages point annotations to tune the segmenter and predict pseudo head masks in a way of point prompt learning. It then uses the predicted segmentation masks, which serve as spatial constraint, to rectify biased point annotations as context prompt learning. mPrompt defines a way of mutual information maximization from prompt learning, mitigating the impact of annotation variance while improving model accuracy. Experiments show that mPrompt significantly reduces the Mean Average Error (MAE), demonstrating the potential to be general framework for down-stream vision tasks.

摘要
人群计数已经取得了显著的进步，通过训练回归器来预测实例位置。然而，在拥挤的场景下，回归器受到无法控制的注释变量的影响，导致密度地图偏见和上下文信息不准确。在这项研究中，我们提出了互助学习（mPrompt），它利用回归器和分割器作为对方的指导，解决由注释变量引起的偏见和不准确性，同时分别前景和背景。具体来说，mPrompt利用点注释来调整分割器，预测 Pseudo 头部掩蔽，并使用预测的分割mask来约束扭曲的点注释，以提高模型的准确性。mPrompt定义了从提示学习中获得的互信息最大化方法，减轻注释变量的影响，提高下游视觉任务的模型精度。实验表明，mPrompt可以显著减少 Mean Average Error（MAE）， demonstrably 成为下游视觉任务的通用框架。

GenEM: Physics-Informed Generative Cryo-Electron Microscopy

paper_url: http://arxiv.org/abs/2312.02235
repo_url: None
paper_authors: Jiakai Zhang, Qihe Chen, Yan Zeng, Wenyuan Gao, Xuming He, Zhijie Liu, Jingyi Yu
for: 用于提高Single-particle cryo-electron microscopy（cryo-EM）的高分辨率重建，特别是SARS-COV-2融合蛋白质的3D结构。
methods: 通过结合物理基础的cryo-EM模拟和生成器无对照噪音翻译来生成物理正确的生成的Synthetic cryo-EM数据集。
results: 实验表明，GenEM可以生成真实的cryo-EM图像，并且可以提高粒子选择和pose估计模型，最终提高重建分辨率。

Abstract
In the past decade, deep conditional generative models have revolutionized the generation of realistic images, extending their application from entertainment to scientific domains. Single-particle cryo-electron microscopy (cryo-EM) is crucial in resolving near-atomic resolution 3D structures of proteins, such as the SARS-COV-2 spike protein. To achieve high-resolution reconstruction, AI models for particle picking and pose estimation have been adopted. However, their performance is still limited as they lack high-quality annotated datasets. To address this, we introduce physics-informed generative cryo-electron microscopy (GenEM), which for the first time integrates physical-based cryo-EM simulation with a generative unpaired noise translation to generate physically correct synthetic cryo-EM datasets with realistic noises. Initially, GenEM simulates the cryo-EM imaging process based on a virtual specimen. To generate realistic noises, we leverage an unpaired noise translation via contrastive learning with a novel mask-guided sampling scheme. Extensive experiments show that GenEM is capable of generating realistic cryo-EM images. The generated dataset can further enhance particle picking and pose estimation models, eventually improving the reconstruction resolution. We will release our code and annotated synthetic datasets.

摘要
过去一个 década，深度条件生成模型已经革命化了生成真实图像的领域，从娱乐领域扩展到科学领域。单个粒子冰电子 микроскопи亮（cryo-EM）是解决近原子粒子结构的三维保守的关键，如SARS-COV-2融合蛋白。为达到高分辨率重建，AI模型 для粒子选择和姿态估计被采用。然而，其性能仍然有限，因为它们缺乏高质量注解 datasets。为此，我们介绍了物理学习生成电子顾 microscope（GenEM），这是首次将物理基于的电子顾 microscope 模拟与生成无对应的噪音翻译相结合，以生成符合物理规则的真实噪音的synthetic cryo-EM数据集。在GenEM中，我们首先通过虚拟样本来模拟电子顾 microscope 图像处理过程。然后，我们利用一种新的mask-guided sampling scheme来实现无对应的噪音翻译。通过对比学习，我们可以在GenEM中生成真实的噪音。我们的实验表明，GenEM可以生成真实的 cryo-EM 图像。生成的数据集可以进一步提高粒子选择和姿态估计模型，最终提高重建分辨率。我们将发布我们的代码和注解的synthetic数据集。

BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection

paper_url: http://arxiv.org/abs/2312.01696
repo_url: None
paper_authors: Zhenxin Li, Shiyi Lan, Jose M. Alvarez, Zuxuan Wu
for: 提高camera-based 3D对象检测中的 dense BEV（鸟瞰视图）方法的缺点，并提出一种“现代化”的 dense BEV框架BEVNeXt。
methods: 提出了三个增强组件：CRF模ulated depth estimation module、long-term temporal aggregation module和two-stage object decoder。
results: 在 nuScenes benchmark 上，BEVNeXt 在不同设置下比 BEV-based 和 query-based 框架具有更高的性能，实现了 nuScenes 测试集上的最佳结果64.2 NDS。

Abstract
Recently, the rise of query-based Transformer decoders is reshaping camera-based 3D object detection. These query-based decoders are surpassing the traditional dense BEV (Bird's Eye View)-based methods. However, we argue that dense BEV frameworks remain important due to their outstanding abilities in depth estimation and object localization, depicting 3D scenes accurately and comprehensively. This paper aims to address the drawbacks of the existing dense BEV-based 3D object detectors by introducing our proposed enhanced components, including a CRF-modulated depth estimation module enforcing object-level consistencies, a long-term temporal aggregation module with extended receptive fields, and a two-stage object decoder combining perspective techniques with CRF-modulated depth embedding. These enhancements lead to a "modernized" dense BEV framework dubbed BEVNeXt. On the nuScenes benchmark, BEVNeXt outperforms both BEV-based and query-based frameworks under various settings, achieving a state-of-the-art result of 64.2 NDS on the nuScenes test set.

摘要
近些时候，具有查询基于Transformer解oder的三维物体检测技术在发展。这些查询基于解oder在传统密集BEV（鸟瞰视图）方法的超越。然而，我们认为密集BEV框架仍然重要，因为它们在深度估计和物体地址方面表现出色，能够正确和全面地描述3D场景。本文的目标是解决现有密集BEV基于3D物体检测器的缺陷，我们提出了改进的组件，包括CRF模ulates深度估计模块，长期时间聚合模块和两个阶段对象解码器。这些改进使得我们的“现代化”密集BEV框架得名BEVNeXt。在nuScenes标准测试集上，BEVNeXt在不同的设置下超越了BEV基于和查询基于框架，实现了64.2个NDS的state-of-the-artResult on the nuScenes test set。

Fast and accurate sparse-view CBCT reconstruction using meta-learned neural attenuation field and hash-encoding regularization

paper_url: http://arxiv.org/abs/2312.01689
repo_url: None
paper_authors: Heejun Shin, Taehee Kim, Jongho Lee, Seyoung Chun, Seungryung Cho, Dongmyung Shin
for: 提高CBCT图像重建质量和优化优化速度，使用少量视角收集数据进行图像重建
methods: 使用神经网络和哈希编码器，新的准则技术重建细节结构
results: 在不同部位和设备CBCT扫描中，FACT方法可以提供更好的重建质量和更快的优化速度

Abstract
Cone beam computed tomography (CBCT) is an emerging medical imaging technique to visualize the internal anatomical structures of patients. During a CBCT scan, several projection images of different angles or views are collectively utilized to reconstruct a tomographic image. However, reducing the number of projections in a CBCT scan while preserving the quality of a reconstructed image is challenging due to the nature of an ill-posed inverse problem. Recently, a neural attenuation field (NAF) method was proposed by adopting a neural radiance field algorithm as a new way for CBCT reconstruction, demonstrating fast and promising results using only 50 views. However, decreasing the number of projections is still preferable to reduce potential radiation exposure, and a faster reconstruction time is required considering a typical scan time. In this work, we propose a fast and accurate sparse-view CBCT reconstruction (FACT) method to provide better reconstruction quality and faster optimization speed in the minimal number of view acquisitions ($<$ 50 views). In the FACT method, we meta-trained a neural network and a hash-encoder using a few scans (= 15), and a new regularization technique is utilized to reconstruct the details of an anatomical structure. In conclusion, we have shown that the FACT method produced better, and faster reconstruction results over the other conventional algorithms based on CBCT scans of different body parts (chest, head, and abdomen) and CT vendors (Siemens, Phillips, and GE).

摘要
cone beam computed tomography (CBCT) 是一种emerging医疗影像技术，用于可视化患者内部的解剖结构。在CBCT扫描过程中，收集多个视角的投影图，并利用这些图像重构一个tomographic图像。然而，减少CBCT扫描过程中的视角数量，以保持图像重构质量的问题是复杂的，因为这是一个ill-posed inverse problem。最近，一种使用神经透射场（NAF）方法的新方法被提出，这种方法通过采用神经辐射场算法来重构CBCT图像，并且在50个视角下达到了 быстро和有 promise的结果。然而，减少视角数量仍然是可能的，以降低潜在的辐射暴露和提高扫描时间。在这种工作中，我们提出了一种快速和准确的简View CBCT重构方法（FACT），以提供更好的重构质量和更快的优化速度，在最小数量的视角收集（<50个视角）下。在FACT方法中，我们使用了一些扫描（15个）来元学习神经网络和哈希编码器，并利用一种新的Regularization技术来重构解剖结构的细节。结果显示，FACT方法生成了更好的、更快的重构结果，比其他传统的CBCT扫描算法在不同的体部和CT供应商（Siemens、Philips和GE）上。

Adversarial Medical Image with Hierarchical Feature Hiding

paper_url: http://arxiv.org/abs/2312.01679
repo_url: https://github.com/qsyao/hierarchical_feature_constraint
paper_authors: Qingsong Yao, Zecheng He, Yuexiang Li, Yi Lin, Kai Ma, Yefeng Zheng, S. Kevin Zhou
for: 本研究旨在探讨传统医疗领域的深度学习基于图像的黑盒攻击（Adversarial Examples，AE），以及这些攻击的可靠性。
methods: 本研究使用了传统的PGD攻击法，以及一种新的层次特征约束（HFC）来探讨AE的特点。
results: 实验结果表明，HFC可以更有效地隐藏针对医疗图像的AE，并能够绕过一些国际前沿的医疗AE检测器。这显示出现有的医疗预测模型在面对AE时的不足，并需要在未来更加Robust的防御策略。

Abstract
Deep learning based methods for medical images can be easily compromised by adversarial examples (AEs), posing a great security flaw in clinical decision-making. It has been discovered that conventional adversarial attacks like PGD which optimize the classification logits, are easy to distinguish in the feature space, resulting in accurate reactive defenses. To better understand this phenomenon and reassess the reliability of the reactive defenses for medical AEs, we thoroughly investigate the characteristic of conventional medical AEs. Specifically, we first theoretically prove that conventional adversarial attacks change the outputs by continuously optimizing vulnerable features in a fixed direction, thereby leading to outlier representations in the feature space. Then, a stress test is conducted to reveal the vulnerability of medical images, by comparing with natural images. Interestingly, this vulnerability is a double-edged sword, which can be exploited to hide AEs. We then propose a simple-yet-effective hierarchical feature constraint (HFC), a novel add-on to conventional white-box attacks, which assists to hide the adversarial feature in the target feature distribution. The proposed method is evaluated on three medical datasets, both 2D and 3D, with different modalities. The experimental results demonstrate the superiority of HFC, \emph{i.e.,} it bypasses an array of state-of-the-art adversarial medical AE detectors more efficiently than competing adaptive attacks, which reveals the deficiencies of medical reactive defense and allows to develop more robust defenses in future.

摘要
深度学习基于方法可以轻松受到敌意示例（AE）的攻击，这会导致医疗决策中的安全问题。已经发现，传统的敌意攻击如PGD，在分类логи的优化下，容易在特征空间中被识别，从而导致精准的反应防御。为了更好地理解这种现象并重新评估医疗AE的可靠性，我们进行了深入的 investigate。Specifically，我们首先理论上证明，传统的敌意攻击会在固定方向上不断优化易受攻击的特征，从而导致特征空间中的异常表示。然后，我们对医疗图像进行了压力测试，并与自然图像进行比较，发现这种敏感性是一把双刃剑，可以用来隐藏AE。我们然后提出了一种简单 yet effective的层次特征约束（HFC），这是一种新的白盒攻击添加项，可以帮助隐藏敌意特征在目标特征分布中。我们对三个医疗数据集进行了实验，包括2D和3D的多modal数据集。实验结果表明，HFC比其他state-of-the-art adversarial医疗AE探测器更高效，这显示出医疗抗击防御的欠缺，并且可以为未来开发更加Robust的防御。

Multi-task Image Restoration Guided By Robust DINO Features

paper_url: http://arxiv.org/abs/2312.01677
repo_url: None
paper_authors: Xin Lin, Chao Ren, Kelvin C. K. Chan, Lu Qi, Jinshan Pan, Ming-Hsuan Yang
for: 这篇论文是为了提出一种多任务图像修复方法，以提高图像修复的效率和多样性。
methods: 该方法使用了DINOv2中的稳定特征，并设计了专门的组件，包括多层Semantic Fusion模块、DINO-Restore适应和融合模块、DINO视觉对比损失。
results: 对多种图像修复任务进行比较，该方法的实验结果表明，DINO-IR可以在多任务图像修复中获得显著的改善，而且与现有的多任务图像修复方法相比，具有更高的效率和稳定性。

Abstract
Multi-task image restoration has gained significant interest due to its inherent versatility and efficiency compared to its single-task counterpart. Despite its potential, performance degradation is observed with an increase in the number of tasks, primarily attributed to the distinct nature of each restoration task. Addressing this challenge, we introduce \mbox{\textbf{DINO-IR}, a novel multi-task image restoration approach leveraging robust features extracted from DINOv2. Our empirical analysis shows that while shallow features of DINOv2 capture rich low-level image characteristics, the deep features ensure a robust semantic representation insensitive to degradations while preserving high-frequency contour details. Building on these features, we devise specialized components, including multi-layer semantic fusion module, DINO-Restore adaption and fusion module, and DINO perception contrastive loss, to integrate DINOv2 features into the restoration paradigm. Equipped with the aforementioned components, our DINO-IR performs favorably against existing multi-task image restoration approaches in various tasks by a large margin, indicating the superiority and necessity of reinforcing the robust features for multi-task image restoration.

摘要
多任务图像修复受到了广泛的关注，因为它的自然多任务特性和效率比单任务修复更高。然而，随着任务数量的增加，表现下降是一个普遍存在的问题，主要归结于每个修复任务的独特性。为解决这个挑战，我们介绍了\textbf{\mbox{DINO-IR}，一种基于DINOv2的多任务图像修复方法。我们的实验表明，DINOv2的浅层特征 capture了丰富的低级图像特征，而深层特征则保证了不敏感于损害的Semantic Representation，同时保留高频梯度细节。基于这些特征，我们设计了专门的组件，包括多层semantic Fusion模块、DINO-Restore适应和融合模块、DINO感知对比损失等，以把DINOv2特征集成到修复模式中。配备这些组件的DINO-IR在不同任务中与现有的多任务图像修复方法进行比较，表现出了大幅提升，表明了加强 robust特征对多任务图像修复的必要性和优势。

MedXChat: Bridging CXR Modalities with a Unified Multimodal Large Model

paper_url: http://arxiv.org/abs/2312.02233
repo_url: None
paper_authors: Ling Yang, Zhanyu Wang, Luping Zhou
for: 本研究旨在提出一种涵盖多Modal大型模型，以便医疗助手和用户之间的无缝交互。
methods: 本研究使用的方法包括 CXR（胸X射）到报告生成、CXR基于视Question Answering（VQA）以及文本到CXR合成。
results: 本研究显示MedXChat模型在多Modal医疗应用中表现出色，在MIMIC数据集上超过参考模型。此外，我们还提出了一种新的文本到CXR合成方法，使用Stable Diffusion（SD）架构中的指令following能力。这种方法与现有模型框架集成得非常好，无需额外参数，同时保持SD的生成力，并同时具有高精度渠道生成的能力。

Abstract
Despite the success of Large Language Models (LLMs) in general image tasks, a gap persists in the medical field for a multimodal large model adept at handling the nuanced diversity of medical images. Addressing this, we propose MedXChat, a unified multimodal large model designed for seamless interactions between medical assistants and users. MedXChat encompasses three key functionalities: CXR(Chest X-ray)-to-Report generation, CXR-based visual question-answering (VQA), and Text-to-CXR synthesis. Our contributions are as follows. Firstly, our model showcases exceptional cross-task adaptability, displaying adeptness across all three defined tasks and outperforming the benchmark models on the MIMIC dataset in medical multimodal applications. Secondly, we introduce an innovative Text-to-CXR synthesis approach that utilizes instruction-following capabilities within the Stable Diffusion (SD) architecture. This technique integrates smoothly with the existing model framework, requiring no extra parameters, thereby maintaining the SD's generative strength while also bestowing upon it the capacity to render fine-grained medical images with high fidelity. Comprehensive experiments validate MedXChat's synergistic enhancement across all tasks. Our instruction data and model will be open-sourced.

摘要
尽管大语言模型（LLM）在普通图像任务中取得了成功，但在医疗领域仍存在一个多Modal大型模型，能够处理医疗图像的细腻多样性。为解决这个问题，我们提出了MedXChat，一个通用多Modal大型模型，用于医疗助手和用户之间的无缝互动。MedXChat包括三个关键功能：从X脉呼吸（CXR）到报告生成、基于CXR的视觉问答（VQA）和文本到CXR合成。我们的贡献包括以下几点：1. 我们的模型在医疗多Modal应用中表现出色，在MIMIC数据集上超越了标准模型，并在所有三个定义的任务上达到了出色的表现。2. 我们引入了一种创新的文本到CXR合成方法，基于稳定扩散（SD）架构的指令执行能力。这种方法可以轻松地与现有模型框架集成，无需额外参数，因此可以保持SD的生成能力，同时具有高精度渲染医疗图像的能力。3. 我们进行了广泛的实验，证明MedXChat在所有任务上具有相互强化的效果。我们的 instrucion 数据和模型将在未来开源。

paper_url: http://arxiv.org/abs/2312.01671
repo_url: None
paper_authors: Hanyu Wang, Pengxiang Wu, Kevin Dela Rosa, Chen Wang, Abhinav Shrivastava
for: 这篇论文的目的是提出一种基于文本指导的图像风格传递方法，以提高图像风格传递的灵活性和效果。
methods: 该方法使用了一种新的多modalities强化网络，通过跨模态GAN拟合方法生成风格表示，实现图像风格传递和跨模态风格混合。
results: 实验和用户研究表明，该方法在文本指导下的图像风格传递任务中实现了状态天平的性能，并且在跨模态风格混合任务中也有出色的效果。

Abstract
Image Style Transfer (IST) is an interdisciplinary topic of computer vision and art that continuously attracts researchers' interests. Different from traditional Image-guided Image Style Transfer (IIST) methods that require a style reference image as input to define the desired style, recent works start to tackle the problem in a text-guided manner, i.e., Text-guided Image Style Transfer (TIST). Compared to IIST, such approaches provide more flexibility with text-specified styles, which are useful in scenarios where the style is hard to define with reference images. Unfortunately, many TIST approaches produce undesirable artifacts in the transferred images. To address this issue, we present a novel method to achieve much improved style transfer based on text guidance. Meanwhile, to offer more flexibility than IIST and TIST, our method allows style inputs from multiple sources and modalities, enabling MultiModality-guided Image Style Transfer (MMIST). Specifically, we realize MMIST with a novel cross-modal GAN inversion method, which generates style representations consistent with specified styles. Such style representations facilitate style transfer and in principle generalize any IIST methods to MMIST. Large-scale experiments and user studies demonstrate that our method achieves state-of-the-art performance on TIST task. Furthermore, comprehensive qualitative results confirm the effectiveness of our method on MMIST task and cross-modal style interpolation.

摘要
Image Style Transfer (IST) 是一个跨学科的计算机视觉和艺术领域的研究话题，不断吸引研究者们的关注。与传统的图像指导图像风格传输（IIST）方法不同，latest works开始使用文本指导方式来解决问题，即Text-guided Image Style Transfer（TIST）。相比IIST，这些方法提供更多的自定义风格可能性，特别在enario中风格很难定义的图像风格传输 task中。然而，许多TIST方法会生成不愿意的artefacts在传输图像中。为了解决这个问题，我们提出了一种新的方法，可以实现大幅提高基于文本指导的风格传输。此外，我们的方法还允许多种源和modalities的风格输入，实现MultiModality-guided Image Style Transfer（MMIST）。具体来说，我们实现MMIST通过一种新的交叉modal GAN倒推方法，这种方法可以生成具有指定风格的风格表示。这些风格表示可以帮助实现风格传输，并且在理论上可以普适任何IIST方法到MMIST。大规模的实验和用户研究表明，我们的方法在TIST任务上达到了国际级表现。此外，广泛的质量评估结果也证明了我们的方法在MMIST任务和交叉模式风格 interpolate 任务中的效果。

HumanNeRF-SE: A Simple yet Effective Approach to Animate HumanNeRF with Diverse Poses

paper_url: http://arxiv.org/abs/2312.02232
repo_url: None
paper_authors: Caoyuan Ma, Yu-Lun Liu, Zhixiang Wang, Wu Liu, Xinchen Liu, Zheng Wang
for: 可以生成多种新的姿势图像，只需要简单的输入。
methods: combinig explicit和implicit的人体表示，并使用通用和特定的映射过程。
results: 可以在任意姿势下生成图像，只需要几个输入，并且提高了人体pose生成性能，比现有的HumanNeRF研究更好，减少了计算复杂度，不需要使用任何加速模块。

Abstract
We present HumanNeRF-SE, which can synthesize diverse novel pose images with simple input. Previous HumanNeRF studies require large neural networks to fit the human appearance and prior knowledge. Subsequent methods build upon this approach with some improvements. Instead, we reconstruct this approach, combining explicit and implicit human representations with both general and specific mapping processes. Our key insight is that explicit shape can filter the information used to fit implicit representation, and frozen general mapping combined with point-specific mapping can effectively avoid overfitting and improve pose generalization performance. Our explicit and implicit human represent combination architecture is extremely effective. This is reflected in our model's ability to synthesize images under arbitrary poses with few-shot input and increase the speed of synthesizing images by 15 times through a reduction in computational complexity without using any existing acceleration modules. Compared to the state-of-the-art HumanNeRF studies, HumanNeRF-SE achieves better performance with fewer learnable parameters and less training time (see Figure 1).

摘要
我们现在提出了HumanNeRF-SE，这是一种可以生成多种新的姿势图像的简单输入的方法。之前的HumanNeRF研究需要大型神经网络来适应人体外观和先验知识。后续方法在这个方法的基础上做出了一些改进。相反，我们重新构建了这个方法，将显式和隐式人体表示结合起来，并使用通用和特定映射过程。我们的关键思想是，显式形状可以筛选用于适应隐式表示的信息，而冻结通用映射与点特定映射可以有效避免过拟合和提高姿势泛化性能。我们的显式和隐式人体表示结合体系非常有效。这被反映在我们的模型可以在任意姿势下synthesize图像，只需几张输入图像，并且通过减少计算复杂度而提高图像生成速度，不使用任何加速模块。与现有状态的HumanNeRF研究相比，HumanNeRF-SE具有更好的性能， fewer learnable parameters和更短的训练时间（参见图1）。

RiskBench: A Scenario-based Benchmark for Risk Identification

paper_url: http://arxiv.org/abs/2312.01659
repo_url: None
paper_authors: Chi-Hsi Kung, Chieh-Chi Yang, Pang-Yuan Pao, Shu-Wei Lu, Pin-Lun Chen, Hsin-Cheng Lu, Yi-Ting Chen
for: 本研究旨在提高安全性表现，推动零Collision mobilité领域的发展。
methods: 本研究使用了enario-based benchmark，设计了场景分类和扩展管道，以系统地收集多样化场景下的威胁。
results: 研究评估了十种算法的威胁检测、预测和决策支持能力，并进行了广泛的实验。未来研究可能包括威胁检测和预测技术的进一步发展。

Abstract
Intelligent driving systems aim to achieve a zero-collision mobility experience, requiring interdisciplinary efforts to enhance safety performance. This work focuses on risk identification, the process of identifying and analyzing risks stemming from dynamic traffic participants and unexpected events. While significant advances have been made in the community, the current evaluation of different risk identification algorithms uses independent datasets, leading to difficulty in direct comparison and hindering collective progress toward safety performance enhancement. To address this limitation, we introduce \textbf{RiskBench}, a large-scale scenario-based benchmark for risk identification. We design a scenario taxonomy and augmentation pipeline to enable a systematic collection of ground truth risks under diverse scenarios. We assess the ability of ten algorithms to (1) detect and locate risks, (2) anticipate risks, and (3) facilitate decision-making. We conduct extensive experiments and summarize future research on risk identification. Our aim is to encourage collaborative endeavors in achieving a society with zero collisions. We have made our dataset and benchmark toolkit publicly on the project page: https://hcis-lab.github.io/RiskBench/

摘要
智能驾驶系统目标是实现零Collision mobilility经验，需要跨学科努力提高安全性表现。本工作关注在风险识别方面，包括动态交通参与者和意外事件所导致的风险。尽管社区中已经取得了显著进步，但现有的不同风险识别算法评估使用独立的数据集，导致对比不直接的和阻碍了集体进步。为解决这些限制，我们介绍了\textbf{RiskBench}，一个大规模enario-based benchmark для风险识别。我们设计了场景分类和扩展管道，以系统地收集多样化场景下的真实风险。我们评估了十种算法的能力，包括：1）检测和定位风险，2）预测风险，3）促进决策。我们进行了广泛的实验，并总结了未来的风险识别研究。我们的目标是鼓励集体努力，实现零碰撞社会。我们的数据集和benchmark工具箱已经公开在项目页面：https://hcis-lab.github.io/RiskBench/

Adaptive Confidence Threshold for ByteTrack in Multi-Object Tracking

paper_url: http://arxiv.org/abs/2312.01650
repo_url: https://github.com/linh-gist/AdaptConfByteTrack
paper_authors: Linh Van Ma, Muhammad Ishfaq Hussain, JongHyun Park, Jeongbae Kim, Moongu Jeon
for: 多bject tracking
methods: 使用ByteTrack算法和自适应信息reshold技术
results: 实现了与传统ByteTrack方法相比的高效性和稳定性

Abstract
We investigate the application of ByteTrack in the realm of multiple object tracking. ByteTrack, a simple tracking algorithm, enables the simultaneous tracking of multiple objects by strategically incorporating detections with a low confidence threshold. Conventionally, objects are initially associated with high confidence threshold detections. When the association between objects and detections becomes ambiguous, ByteTrack extends the association to lower confidence threshold detections. One notable drawback of the existing ByteTrack approach is its reliance on a fixed threshold to differentiate between high and low-confidence detections. In response to this limitation, we introduce a novel and adaptive approach. Our proposed method entails a dynamic adjustment of the confidence threshold, leveraging insights derived from overall detections. Through experimentation, we demonstrate the effectiveness of our adaptive confidence threshold technique while maintaining running time compared to ByteTrack.

摘要
我们研究ByteTrack在多对象跟踪领域的应用。ByteTrack是一种简单的跟踪算法，允许同时跟踪多个对象，通过策略地包含低信任值探测。在传统的方法中，对象初始化与高信任值探测相关联。当对象与探测之间的关联变得混乱时，ByteTrack将关联扩展到低信任值探测。现有的ByteTrack方法的一个显著的缺点是它对高和低信任值探测之间的分化采用固定的阈值。为了解决这一限制，我们提出了一种新的和适应的方法。我们的提议方法是动态调整信任阈值，利用总探测中的意见来调整。通过实验，我们证明了我们的适应信任阈值技术的有效性，同时保持与ByteTrack相比的运行时间。

TMSR: Tiny Multi-path CNNs for Super Resolution

paper_url: http://arxiv.org/abs/2312.01644
repo_url: None
paper_authors: Chia-Hung Liu, Tzu-Hsin Hsieh, Kuan-Yu Huang, Pei-Yin Chen
for: 提出了一种 tiny multi-path CNN-based Super-Resolution (SR) 方法，即 TMSR，以提高图像分辨率。
methods: 主要参考了一些 tiny CNN-based SR 方法，参数量少于 5k。提出了改进的多路学习和自定义活化函数。
results: 实验结果表明，TMSR 可以与相关的方法相比，在 5k 参数下获得竞争力强的图像质量（PSNR 和 SSIM）。

Abstract
In this paper, we proposed a tiny multi-path CNN-based Super-Resolution (SR) method, called TMSR. We mainly refer to some tiny CNN-based SR methods, under 5k parameters. The main contribution of the proposed method is the improved multi-path learning and self-defined activated function. The experimental results show that TMSR obtains competitive image quality (i.e. PSNR and SSIM) compared to the related works under 5k parameters.

摘要
在这篇论文中，我们提出了一种小型多路卷积神经网络（CNN）基于超分辨率（SR）方法，称为TMSR。我们主要参考了一些小于5000参数的tiny CNN基于SR方法。我们的主要贡献是改进了多路学习和自定义活动函数。实验结果显示，TMSR在5000参数以下的相关作品中可以获得竞争力强的图像质量（PSNR和SSIM）。

SequencePAR: Understanding Pedestrian Attributes via A Sequence Generation Paradigm

paper_url: http://arxiv.org/abs/2312.01640
repo_url: https://github.com/wangxiao5791509/Pedestrian-Attribute-Recognition-Paper-List
paper_authors: Jiandong Jin, Xiao Wang, Chenglong Li, Lili Huang, Jin Tang
for: 提高 pedestrian attribute recognition 的精度和稳定性，适用于实际应用场景。
methods: 提出了一种新的序列生成 paradigm，使用 CLIP 模型提取人体特征，并通过文本提示将属性集embedding 到查询token中。使用 Transformer decoder 生成人体属性，并引入了遮盖多头注意力层来防止模型在训练期间记忆下一个属性。
results: 经过广泛的实验 validate 了我们提出的 SequencePAR 的效果，在多个常用的 pedestrian attribute recognition 数据集上均达到了高精度和稳定性。

Abstract
Current pedestrian attribute recognition (PAR) algorithms are developed based on multi-label or multi-task learning frameworks, which aim to discriminate the attributes using specific classification heads. However, these discriminative models are easily influenced by imbalanced data or noisy samples. Inspired by the success of generative models, we rethink the pedestrian attribute recognition scheme and believe the generative models may perform better on modeling dependencies and complexity between human attributes. In this paper, we propose a novel sequence generation paradigm for pedestrian attribute recognition, termed SequencePAR. It extracts the pedestrian features using a pre-trained CLIP model and embeds the attribute set into query tokens under the guidance of text prompts. Then, a Transformer decoder is proposed to generate the human attributes by incorporating the visual features and attribute query tokens. The masked multi-head attention layer is introduced into the decoder module to prevent the model from remembering the next attribute while making attribute predictions during training. Extensive experiments on multiple widely used pedestrian attribute recognition datasets fully validated the effectiveness of our proposed SequencePAR. The source code and pre-trained models will be released at https://github.com/Event-AHU/OpenPAR.

摘要
现有的步行者特征识别（PAR）算法基于多标签或多任务学习框架，目标是通过特定的分类头来 отличить特征。然而，这些分类模型容易受到不均衡数据或噪声样本的影响。 inspirited by the success of generative models，我们重新思考了人体特征识别的方案，并认为生成模型可能更好地处理人体特征之间的相互关系和复杂性。在这篇论文中，我们提出了一种新的序列生成方法，称为SequencePAR。它使用预训练的CLIP模型提取人体特征，并将属性集embed到询问符token中，以文本提示的指导下。然后，一种Transformer解码器被提出，以将视觉特征和属性询问符共同生成人体特征。在解码模块中，我们引入了masked多头注意层，以防止模型在训练期间记忆下一个属性的情况。我们对多个广泛使用的人体特征识别数据集进行了广泛的实验，并证明了我们提出的SequencePAR的有效性。源代码和预训练模型将在https://github.com/Event-AHU/OpenPAR上发布。

J-Net: Improved U-Net for Terahertz Image Super-Resolution

paper_url: http://arxiv.org/abs/2312.01638
repo_url: None
paper_authors: Woon-Ha Yeo, Seung-Hwan Jung, Seung Jae Oh, Inhee Maeng, Eui Su Lee, Han-Cheol Ryu
for: 本研究旨在提高TERAHERTZ（THz）图像的分辨率，并提出一种基于U-Net的新网络架构called J-Net。
methods: J-Net使用简单的基准块，可以有效地提取低分辨率（LR）图像特征并学习LR图像到高分辨率（HR）图像的映射。
results: 对于DIV2K+Flickr2K数据集的训练，J-Net在PSNR（峰峰信号噪比）方面与其他THz图像超分辨率方法相比，达到了32.52 dB，超过了其他方法 более1 dB。 J-Net还在实际THz图像上表现出了更好的PSNR和视觉改进。

Abstract
Terahertz (THz) waves are electromagnetic waves in the 0.1 to 10 THz frequency range, and THz imaging is utilized in a range of applications, including security inspections, biomedical fields, and the non-destructive examination of materials. However, THz images have low resolution due to the long wavelength of THz waves. Therefore, improving the resolution of THz images is one of the current hot research topics. We propose a novel network architecture called J-Net which is improved version of U-Net to solve the THz image super-resolution. It employs the simple baseline blocks which can extract low resolution (LR) image features and learn the mapping of LR images to highresolution (HR) images efficiently. All training was conducted using the DIV2K+Flickr2K dataset, and we employed the peak signal-to-noise ratio (PSNR) for quantitative comparison. In our comparisons with other THz image super-resolution methods, JNet achieved a PSNR of 32.52 dB, surpassing other techniques by more than 1 dB. J-Net also demonstrates superior performance on real THz images compared to other methods. Experiments show that the proposed J-Net achieves better PSNR and visual improvement compared with other THz image super-resolution methods.

摘要
特拉赫频（THz）波是电磁波的一种频率范围为0.1至10 THz的电磁波，THz成像在各种应用中使用，包括安全检查、医疗领域和不 destrucción材料的检测。然而，THz图像的分辨率受到THz波的长波长影响，因此提高THz图像的分辨率是当前热点研究 topic。我们提出了一种改进版的U-Net网络架构，称之为J-Net，用于解决THz图像超分辨。它使用简单的基线块，可以提取低分辨率（LR）图像特征并有效地学习LR图像到高分辨率（HR）图像的映射。所有训练都在DIV2K+Flickr2K数据集上进行，并使用峰值信号噪声比（PSNR）进行量化比较。在对其他THz图像超分辨方法进行比较时，JNet达到了32.52 dB的PSNR，超过其他技术 более1 dB。J-Net还在真实的THz图像上表现出了更好的PSNR和视觉改进，相比其他方法。实验表明，提议的J-Net在THz图像超分辨方面得到了更好的PSNR和视觉改进。

GaussianHead: Impressive 3D Gaussian-based Head Avatars with Dynamic Hybrid Neural Field

paper_url: http://arxiv.org/abs/2312.01632
repo_url: https://github.com/chiehwangs/gaussian-head
paper_authors: Jie Wang, Xianyan Li, Jiucheng Xie, Feng Xu, Hao Gao
for: 本研究旨在提出一种基于三元 Gaussian primitives 的头部人物算法，以提高高效性和可视化质量。
methods: 该算法使用 canoncal Gaussian 表示动态场景，并使用 explicit “动态” tri-plane 来 parameterize 头部geometry，以获得与地形和 tri-plane 相对的 aligned canonical factors。最后，通过 tiny MLP 进行解码，并使用高效的 differentiable Gaussian rasterizer 进行渲染。
results: 与 estado-of-the-art 技术相比，该算法在自我重建、新视角合成和跨标识重演等任务中实现了优化的视觉效果，同时保持高效的渲染速度（每帧 0.12 秒）。甚至在一些情况下可以清晰地看到嘴巴上的粉刺。代码和更多视频可以在项目主页上找到。

Abstract
Previous head avatar methods have mostly relied on fixed explicit primitives (mesh, point) or implicit surfaces (Sign Distance Function) and volumetric neural radiance field, it challenging to strike a balance among high fidelity, training speed, and resource consumption. The recent popularity of hybrid field has brought novel representation, but is limited by relying on parameterization factors obtained through fixed mappings. We propose GaussianHead: an head avatar algorithm based on anisotropic 3D gaussian primitives. We leverage canonical gaussians to represent dynamic scenes. Using explicit "dynamic" tri-plane as an efficient container for parameterized head geometry, aligned well with factors in the underlying geometry and tri-plane, we obtain aligned canonical factors for the canonical gaussians. With a tiny MLP, factors are decoded into opacity and spherical harmonic coefficients of 3D gaussian primitives. Finally, we use efficient differentiable gaussian rasterizer for rendering. Our approach benefits significantly from our novel representation based on 3D gaussians, and the proper alignment transformation of underlying geometry structures and factors in tri-plane eliminates biases introduced by fixed mappings. Compared to state-of-the-art techniques, we achieve optimal visual results in tasks such as self-reconstruction, novel view synthesis, and cross-identity reenactment while maintaining high rendering efficiency (0.12s per frame). Even the pores around the nose are clearly visible in some cases. Code and additional video can be found on the project homepage.

摘要
以前的头像方法主要依靠固定的显式元素（如 mesh 或 point）或隐式表面（如信号距离函数）和 volumes 神经雷达场，具有高品质、快速训练和资源消耗的平衡问题。现在的混合场景的出现带来了新的表示方式，但它受到固定的映射因子的限制。我们提出了 GaussianHead：基于三维泛化函数的头像算法。我们利用可极化的三维高斯函数来表示动态场景。通过使用显式的“动态”三平面来包装参数化的头geometry，我们可以得到与地面结构和三平面的对齐的泛化因子。通过一个小型的 MLP，我们可以从这些因子中解码 opacity 和圆柱卷积级数。最后，我们使用高效的分布式高斯渲染器进行渲染。我们的方法受益于我们的新的表示方式，并且对于地面结构和因子的对齐转换，可以消除由固定映射引入的偏见。相比之下，我们在自我重建、新视图合成和跨标识人物重游戏等任务中实现了最佳的视觉效果，同时保持高效的渲染速度（0.12秒/帧）。甚至在某些情况下，可以清晰地看到鼻子周围的粗糙。代码和额外的视频可以在项目主页上找到。

CLAMP: Contrastive LAnguage Model Prompt-tuning

paper_url: http://arxiv.org/abs/2312.01629
repo_url: None
paper_authors: Piotr Teterwak, Ximeng Sun, Bryan A. Plummer, Kate Saenko, Ser-Nam Lim
for: 这个论文的目的是研究现代大语言模型（LLMs）是否可以适应图像分类任务。
methods: 该论文使用了一种听说挑战任务中的小量指导数据进行训练，以适应图像分类任务。
results: 研究发现，通过轻量级的微调，LLMs可以实现好的图像分类性能，并且比特制的模型CLIP的性能高出13%。此外，该方法还保留了LLM的生成能力。

Abstract
Large language models (LLMs) have emerged as powerful general-purpose interfaces for many machine learning problems. Recent work has adapted LLMs to generative visual tasks like image captioning, visual question answering, and visual chat, using a relatively small amount of instruction-tuning data. In this paper, we explore whether modern LLMs can also be adapted to classifying an image into a set of categories. First, we evaluate multimodal LLMs that are tuned for generative tasks on zero-shot image classification and find that their performance is far below that of specialized models like CLIP. We then propose an approach for light fine-tuning of LLMs using the same contrastive image-caption matching objective as CLIP. Our results show that LLMs can, indeed, achieve good image classification performance when adapted this way. Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model, while also retaining the LLM's generative abilities. LLM initialization appears to particularly help classification in domains under-represented in the visual pre-training data.

摘要
大型语言模型（LLM）已经成为机器学习问题的强大通用接口。 latest work 已经使用 relativity small amount of instruction-tuning data 将 LLM 应用到生成视觉任务，如图像描述、视觉问答和视觉对话。在这篇论文中，我们 investigate Whether modern LLMs can also be adapted to categorize an image into a set of categories。 First, we evaluate multimodal LLMs that are tuned for generative tasks on zero-shot image classification and find that their performance is far below that of specialized models like CLIP。 We then propose an approach for light fine-tuning of LLMs using the same contrastive image-caption matching objective as CLIP。 Our results show that LLMs can, indeed, achieve good image classification performance when adapted this way。 Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model， while also retaining the LLM's generative abilities。 LLM initialization appears to particularly help classification in domains under-represented in the visual pre-training data。Note: "mLLMs" stands for "multimodal large language models".

Universal Segmentation at Arbitrary Granularity with Language Instruction

paper_url: http://arxiv.org/abs/2312.01623
repo_url: https://github.com/workforai/UniLSeg
paper_authors: Yong Liu, Cairong Zhang, Yitong Wang, Jiahao Wang, Yujiu Yang, Yansong Tang
for: 这种研究的目的是实现任意粒度semantic level的通用分割。
methods: 这种模型使用语言指令引导的方法来实现分割。
results: 这种模型在多种任务和设定下表现出色，超过了专家和通用分割模型的性能。Here’s the full text in Simplified Chinese:
for: 这种研究的目的是实现任意粒度semantic level的通用分割。
methods: 这种模型使用语言指令引导的方法来实现分割。
results: 这种模型在多种任务和设定下表现出色，超过了专家和通用分割模型的性能。

Abstract
This paper aims to achieve universal segmentation of arbitrary semantic level. Despite significant progress in recent years, specialist segmentation approaches are limited to specific tasks and data distribution. Retraining a new model for adaptation to new scenarios or settings takes expensive computation and time cost, which raises the demand for versatile and universal segmentation model that can cater to various granularity. Although some attempts have been made for unifying different segmentation tasks or generalization to various scenarios, limitations in the definition of paradigms and input-output spaces make it difficult for them to achieve accurate understanding of content at arbitrary granularity. To this end, we present UniLSeg, a universal segmentation model that can perform segmentation at any semantic level with the guidance of language instructions. For training UniLSeg, we reorganize a group of tasks from original diverse distributions into a unified data format, where images with texts describing segmentation targets as input and corresponding masks are output. Combined with a automatic annotation engine for utilizing numerous unlabeled data, UniLSeg achieves excellent performance on various tasks and settings, surpassing both specialist and unified segmentation models.

摘要
(Simplified Chinese translation)这篇论文目标是实现任意 semantic level 的通用分割。尽管过去几年有所进步，专家化分割方法仍然受到特定任务和数据分布的限制。为了适应新的enario或设置，重新训练新模型的计算成本和时间成本很高，这引发了需要通用和通用分割模型，可以适应不同的粒度。虽然有一些尝试将不同的分割任务或通用化到不同的enario，但是由于定义 paradigm 和输入输出空间的限制，它们很难实现Content的准确理解。为了解决这个问题，我们提出了 UniLSeg，一种可以在任意 semantic level 上进行分割的通用分割模型。为了训练 UniLSeg，我们将原始多个任务的数据重新组织成一个统一的数据格式，图像上的文本描述分割目标作为输入，并输出对应的mask。通过自动生成的注释工具来利用大量未标注数据，UniLSeg在多个任务和设置中表现出色，超过了专家和通用分割模型。

paper_url: http://arxiv.org/abs/2312.01616
repo_url: None
paper_authors: Yunfei Fan, Tianyu Zhao, Guidong Wang
for: 提高Visual Inertial Navigation System（VINS）的精度和计算效率，以便在有限的资源设备上提供高精度地理位呈现。
methods: 提出了一种新的筛子基于VINS框架，named SchurVINS，可以保证高精度和低计算复杂度。技术来自于Explicitly模型Gradient、Hessian和观察 covariance，然后使用Schur complement decomposes the full model into ego-motion residual model和landmark residual model，最后在这两个模型中进行EKF更新。
results: experiments on EuRoC和TUM-VI datasets show that our method notably outperforms state-of-the-art（SOTA）方法在精度和计算复杂度两个方面。

Abstract
Accuracy and computational efficiency are the most important metrics to Visual Inertial Navigation System (VINS). The existing VINS algorithms with either high accuracy or low computational complexity, are difficult to provide the high precision localization in resource-constrained devices. To this end, we propose a novel filter-based VINS framework named SchurVINS, which could guarantee both high accuracy by building a complete residual model and low computational complexity with Schur complement. Technically, we first formulate the full residual model where Gradient, Hessian and observation covariance are explicitly modeled. Then Schur complement is employed to decompose the full model into ego-motion residual model and landmark residual model. Finally, Extended Kalman Filter (EKF) update is implemented in these two models with high efficiency. Experiments on EuRoC and TUM-VI datasets show that our method notably outperforms state-of-the-art (SOTA) methods in both accuracy and computational complexity. We will open source our experimental code to benefit the community.

摘要
<> translate_language English Simplified ChineseAccuracy和计算效率是视觉导航系统（VINS）中最重要的度量。现有的VINS算法具有高精度或低计算复杂度，却难以在有限资源的设备中提供高精度定位。为此，我们提出了一种新的筛子基于VINS框架，名为SchurVINS，可以保证高精度和低计算复杂度。技术上，我们首先形式化了完整的差异模型，其中包括梯度、Hessian和观测协relation Matrix的明确表述。然后，我们使用Schur complement decomposition来分解全模型，即自身运动差异模型和标记差异模型。最后，我们在这两个模型中实现了高效的EKF更新。实验结果表明，我们的方法在EuRoC和TUM-VI数据集上显著超过了现有方法的精度和计算复杂度。我们将在实验代码上开源，以便对社区有利。

TextAug: Test time Text Augmentation for Multimodal Person Re-identification

paper_url: http://arxiv.org/abs/2312.01605
repo_url: None
paper_authors: Mulham Fawakherji, Eduard Vazquez, Pasquale Giampa, Binod Bhattarai
for: 提高多模态人识别器的性能
methods: 使用图像领域中常用的数据扩充技术，如剪辑和混合，对文本进行扩充
results: 提出了一种简单 yet effective的文本扩充方法，可以在多模态人识别任务中提高性能

Abstract
Multimodal Person Reidentification is gaining popularity in the research community due to its effectiveness compared to counter-part unimodal frameworks. However, the bottleneck for multimodal deep learning is the need for a large volume of multimodal training examples. Data augmentation techniques such as cropping, flipping, rotation, etc. are often employed in the image domain to improve the generalization of deep learning models. Augmenting in other modalities than images, such as text, is challenging and requires significant computational resources and external data sources. In this study, we investigate the effectiveness of two computer vision data augmentation techniques: cutout and cutmix, for text augmentation in multi-modal person re-identification. Our approach merges these two augmentation strategies into one strategy called CutMixOut which involves randomly removing words or sub-phrases from a sentence (Cutout) and blending parts of two or more sentences to create diverse examples (CutMix) with a certain probability assigned to each operation. This augmentation was implemented at inference time without any prior training. Our results demonstrate that the proposed technique is simple and effective in improving the performance on multiple multimodal person re-identification benchmarks.

摘要
多模态人识别正在研究社区中得到广泛的推广，因为它比单modal框架更有效。然而，多模态深度学习的瓶颈是需要大量多模态训练示例。图像领域中使用数据扩展技术，如裁剪、翻转、旋转等，可以提高深度学习模型的通用性。然而，在其他模式中进行数据扩展，如文本，是复杂并需要大量计算资源和外部数据源。在本研究中，我们调查了图像数据扩展技术cutout和cutmix的效iveness在多模态人重识别中。我们的方法将这两种扩展策略合并为一个策略，称为CutMixOut，其中随机从句子中删除单词或子句（cutout），并将多个句子的部分混合成多个多样化的示例（cutmix），每种操作都有一定的概率被采用。这种扩展在推理时实现，无需任何先前训练。我们的结果表明，我们提出的方法是简单而有效的，可以提高多个多模态人重识别标准 bencmarks 的性能。

Good Questions Help Zero-Shot Image Reasoning

paper_url: http://arxiv.org/abs/2312.01598
repo_url: https://github.com/kai-wen-yang/qvix
paper_authors: Kaiwen Yang, Tao Shen, Xinmei Tian, Xiubo Geng, Chongyang Tao, Dacheng Tao, Tianyi Zhou
for: 提高零参数图像推理任务中语义理解和细节描述能力
methods: 使用Question-Driven Visual Exploration（QVix）策略，通过生成更多细节的输入问题，使语言模型更加全面地探索图像内容
results: 在多个复杂的视觉语言benchmark上表现出色，比如科学问答和细化视觉分类，与现有方法比较显著提高了零参数图像推理任务中语义理解和细节描述能力

Abstract
Aligning the recent large language models (LLMs) with computer vision models leads to large vision-language models (LVLMs), which have paved the way for zero-shot image reasoning tasks. However, LVLMs are usually trained on short high-level captions only referring to sparse focus regions in images. Such a ``tunnel vision'' limits LVLMs to exploring other relevant contexts in complex scenes. To address this challenge, we introduce Question-Driven Visual Exploration (QVix), a novel prompting strategy that enhances the exploratory capabilities of LVLMs in zero-shot reasoning tasks. QVix leverages LLMs' strong language prior to generate input-exploratory questions with more details than the original query, guiding LVLMs to explore visual content more comprehensively and uncover subtle or peripheral details. QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment. Our evaluations on various challenging zero-shot vision-language benchmarks, including ScienceQA and fine-grained visual classification, demonstrate that QVix significantly outperforms existing methods, highlighting its effectiveness in bridging the gap between complex visual data and LVLMs' exploratory abilities.

摘要
大量的语言模型（LLMs）与计算机视觉模型结合，可以生成大型视觉语言模型（LVLMs），这些模型可以进行零基础图像推理任务。然而，LVLMs 通常只接受短高级描述，只关注图像中的精细焦点区域。这种“隧道视野”限制了LVLMs的探索能力，使其无法探索复杂场景中的其他相关背景。为解决这个挑战，我们提出了问题驱动的视觉探索（QVix），一种新的提示策略，可以增强LVLMs在零基础推理任务中的探索能力。QVix 利用 LLMs 强大的语言优先来生成更多细节的输入探索问题，使LVLMs可以更全面地探索视觉内容，探测到细节和边缘区域中的细节。QVix 可以扩大视觉场景的探索范围，提高 LVLMs 的推理精度和深度，在视觉问答和视觉推理等任务中表现出色。我们对多个复杂的零基础视觉语言benchmark进行评估， results demonstrate that QVix significantly outperforms existing methods， highlighting its effectiveness in bridging the gap between complex visual data and LVLMs' exploratory abilities。

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

paper_url: http://arxiv.org/abs/2312.01597
repo_url: None
paper_authors: Feng Wang, Jieru Mei, Alan Yuille
for: 增强CLIP的潜在 semantic segmentation 能力，使其能够在零shot classification任务中提供高度的Pixel-level预测。
methods: 提出一种 novel Correlative Self-Attention (CSA) 机制，用于CLIP视觉Encoder的最后一层替换传统自注意力块，以便在零shot semantic segmentation 任务中进行适应。
results: 在八个semantic segmentation benchmark上，通过对CLIP进行训练-free的适应，实现了38.2%的平均零shot mIoU，significantly Outperforming现有的SoTA（33.9%）和vanilla CLIP（14.1%）。

Abstract
Recent advances in contrastive language-image pretraining (CLIP) have demonstrated strong capabilities in zero-shot classification by aligning visual representations with target text embeddings in an image level. However, in dense prediction tasks, CLIP often struggles to localize visual features within an image and fails to give accurate pixel-level predictions, which prevents it from functioning as a generalized visual foundation model. In this work, we aim to enhance CLIP's potential for semantic segmentation with minimal modifications to its pretrained models. By rethinking self-attention, we surprisingly find that CLIP can adapt to dense prediction tasks by simply introducing a novel Correlative Self-Attention (CSA) mechanism. Specifically, we replace the traditional self-attention block of CLIP vision encoder's last layer by our CSA module and reuse its pretrained projection matrices of query, key, and value, leading to a training-free adaptation approach for CLIP's zero-shot semantic segmentation. Extensive experiments show the advantage of CSA: we obtain a 38.2% average zero-shot mIoU across eight semantic segmentation benchmarks highlighted in this paper, significantly outperforming the existing SoTA's 33.9% and the vanilla CLIP's 14.1%.

摘要

PixelLM: Pixel Reasoning with Large Multimodal Model

paper_url: http://arxiv.org/abs/2312.02228
repo_url: None
paper_authors: Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, Xiaojie Jin
for: 这篇论文旨在提供一种有效和高效的多模态模型（LMM），用于解决图像理解任务中的多个开放世界目标问题。
methods: 该模型使用了一种新的轻量级像素解码器和一个全面的分割编码库，以生成高质量的像素级mask。此外，该模型还提出了一种 Target Refinement Loss，以提高模型对多个目标的分辨率。
results: 在多个图像理解和理解任务中，PixelLM超过了已知方法，包括MUSE、单refer和多refer分割。 comprehensive ablation confirm each proposed component的有效性。

Abstract
While large multimodal models (LMMs) have achieved remarkable progress, generating pixel-level masks for image reasoning tasks involving multiple open-world targets remains a challenge. To bridge this gap, we introduce PixelLM, an effective and efficient LMM for pixel-level reasoning and understanding. Central to PixelLM is a novel, lightweight pixel decoder and a comprehensive segmentation codebook. The decoder efficiently produces masks from the hidden embeddings of the codebook tokens, which encode detailed target-relevant information. With this design, PixelLM harmonizes with the structure of popular LMMs and avoids the need for additional costly segmentation models. Furthermore, we propose a target refinement loss to enhance the model's ability to differentiate between multiple targets, leading to substantially improved mask quality. To advance research in this area, we construct MUSE, a high-quality multi-target reasoning segmentation benchmark. PixelLM excels across various pixel-level image reasoning and understanding tasks, outperforming well-established methods in multiple benchmarks, including MUSE, single- and multi-referring segmentation. Comprehensive ablations confirm the efficacy of each proposed component. All code, models, and datasets will be publicly available.

摘要
大型多modal模型（LMM）已经取得了很大的进步，但是在多个开放世界目标的图像理解任务中生成像素级掩码仍然是一大挑战。为bridge这个差距，我们介绍PixelLM，一种高效和高效的LMM для像素级理解和理解。PixelLM的核心是一种新的轻量级像素解码器和一个完整的分割代码库。解码器可以高效地从代码库Token的隐藏嵌入中生成掩码，这些Token嵌入了细节目标相关的信息。这种设计使PixelLM与流行的LMM结构相协同，并避免了额外成本的分割模型。此外，我们提出了一种目标精度优化loss来提高模型对多个目标的分辨率，从而导致掩码质量明显提高。为进一步推动这个领域的研究，我们构建了高品质多目标理解分割benchmark（MUSE）。PixelLM在多种像素级图像理解和理解任务中表现出色，比较有名的方法在多个benchmark中表现出色，包括MUSE、单refer和多refer分割。完整的拓展证明了每个提posed ком ponent的有效性。所有代码、模型和数据将公共可用。

Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition

paper_url: http://arxiv.org/abs/2312.02226
repo_url: None
paper_authors: Chengyou Jia, Minnan Luo, Xiaojun Chang, Zhuohang Dang, Mingfei Han, Mengmeng Wang, Guang Dai, Sizhe Dang, Jingdong Wang
for: 开探无穷词汇视频动作识别，以提高在未经见过的动作识别。
methods: 利用人工智能技术，将视频模型与大型自然语言模型（LLMs）相结合，设计动作conditioned prompts，并实现多模态动作知识匹配机制，以匹配视频和文本知识。
results: 经验表明，我们的方法不仅在多种视频benchmark上达到新的SOTA表现，还具有优秀的解释性。

Abstract
Exploring open-vocabulary video action recognition is a promising venture, which aims to recognize previously unseen actions within any arbitrary set of categories. Existing methods typically adapt pretrained image-text models to the video domain, capitalizing on their inherent strengths in generalization. A common thread among such methods is the augmentation of visual embeddings with temporal information to improve the recognition of seen actions. Yet, they compromise with standard less-informative action descriptions, thus faltering when confronted with novel actions. Drawing inspiration from human cognitive processes, we argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition. To realize this, we innovatively blend video models with Large Language Models (LLMs) to devise Action-conditioned Prompts. Specifically, we harness the knowledge in LLMs to produce a set of descriptive sentences that contain distinctive features for identifying given actions. Building upon this foundation, we further introduce a multi-modal action knowledge alignment mechanism to align concepts in video and textual knowledge encapsulated within the prompts. Extensive experiments on various video benchmarks, including zero-shot, few-shot, and base-to-novel generalization settings, demonstrate that our method not only sets new SOTA performance but also possesses excellent interpretability.

摘要
（简化中文）研究开放词汇视频动作识别是一项有前途的项目，目的是在任意的类别集中识别未经见过的动作。现有方法通常是将预训练的图像文本模型适应到视频领域，利用其通用的优势。这些方法通常是通过将视觉特征与时间信息相结合来提高识别已知动作的能力。然而，它们却妥协使用标准的不具有特征的动作描述，因此在遇到新的动作时表现不佳。 drawing inspiration from human cognitive processes， we argue that augmenting text embeddings with human prior knowledge is crucial for open-vocabulary video action recognition. To achieve this, we innovatively blend video models with Large Language Models (LLMs) to devise Action-conditioned Prompts. Specifically, we harness the knowledge in LLMs to produce a set of descriptive sentences that contain distinctive features for identifying given actions. Building upon this foundation, we further introduce a multi-modal action knowledge alignment mechanism to align concepts in video and textual knowledge encapsulated within the prompts. Extensive experiments on various video benchmarks, including zero-shot, few-shot, and base-to-novel generalization settings, demonstrate that our method not only sets new SOTA performance but also possesses excellent interpretability.（简化中文）研究开放词汇视频动作识别是一项有前途的项目，目的是在任意的类别集中识别未经见过的动作。现有方法通常是将预训练的图像文本模型适应到视频领域，利用其通用的优势。这些方法通常是通过将视觉特征与时间信息相结合来提高识别已知动作的能力。然而，它们却妥协使用标准的不具有特征的动作描述，因此在遇到新的动作时表现不佳。 drawing inspiration from human cognitive processes， we argue that augmenting text embeddings with human prior knowledge is crucial for open-vocabulary video action recognition. To achieve this, we innovatively blend video models with Large Language Models (LLMs) to devise Action-conditioned Prompts. Specifically, we harness the knowledge in LLMs to produce a set of descriptive sentences that contain distinctive features for identifying given actions. Building upon this foundation, we further introduce a multi-modal action knowledge alignment mechanism to align concepts in video and textual knowledge encapsulated within the prompts. Extensive experiments on various video benchmarks, including zero-shot, few-shot, and base-to-novel generalization settings, demonstrate that our method not only sets new SOTA performance but also possesses excellent interpretability.

Learning Efficient Unsupervised Satellite Image-based Building Damage Detection

paper_url: http://arxiv.org/abs/2312.01576
repo_url: https://github.com/fzmi/ubdd
paper_authors: Yiyun Zhang, Zijian Wang, Yadan Luo, Xin Yu, Zi Huang
for: 本研究旨在提出一种无监督的建筑损害检测（U-BDD）方法，只有不标注的卫星图像对before和after两个图像的比较。
methods: 该方法基于预训练的视觉语言基础模型（Grounding DINO、SAM和CLIP），并提出了一种新的自我超vised框架U-BDD++，以 Addressing domain-specific issues associated with satellite imagery。
results: 实验结果表明，该方法可以有效地进行无监督的建筑损害检测，并且可以减少 Labeling effort。

Abstract
Existing Building Damage Detection (BDD) methods always require labour-intensive pixel-level annotations of buildings and their conditions, hence largely limiting their applications. In this paper, we investigate a challenging yet practical scenario of BDD, Unsupervised Building Damage Detection (U-BDD), where only unlabelled pre- and post-disaster satellite image pairs are provided. As a pilot study, we have first proposed an advanced U-BDD baseline that leverages pre-trained vision-language foundation models (i.e., Grounding DINO, SAM and CLIP) to address the U-BDD task. However, the apparent domain gap between satellite and generic images causes low confidence in the foundation models used to identify buildings and their damages. In response, we further present a novel self-supervised framework, U-BDD++, which improves upon the U-BDD baseline by addressing domain-specific issues associated with satellite imagery. Furthermore, the new Building Proposal Generation (BPG) module and the CLIP-enabled noisy Building Proposal Selection (CLIP-BPS) module in U-BDD++ ensure high-quality self-training. Extensive experiments on the widely used building damage assessment benchmark demonstrate the effectiveness of the proposed method for unsupervised building damage detection. The presented annotation-free and foundation model-based paradigm ensures an efficient learning phase. This study opens a new direction for real-world BDD and sets a strong baseline for future research.

摘要
现有的建筑物损害检测（BDD）方法总是需要劳动密集的像素级别标注建筑物和其状况，因此很大程度上限制了它们的应用。在这篇论文中，我们研究了一个具有挑战性和实用性的BDD场景，即无监督的建筑物损害检测（U-BDD），只提供了无标注的卫星图像对。作为一个先验研究，我们首先提出了一个高级U-BDD基线，利用预训练的视觉语言基础模型（如Grounding DINO、SAM和CLIP）来解决U-BDD任务。然而，卫星图像的域特性和常见图像之间的域差导致基础模型用于识别建筑物和其损害时的信任度较低。为此，我们进一步提出了一个新的自动生成框架，U-BDD++，用于解决卫星图像特有的领域问题。此外，U-BDD++中的新建筑物提案生成（BPG）模块和CLIP启用的杂乱建筑物提案选择（CLIP-BPS）模块，确保高质量的自动训练。广泛的实验表明，提议的方法可以有效地进行无监督的建筑物损害检测。由于不需要标注，学习阶段可以更加快速和高效。这项研究开启了一新的方向，为实际的BDD做出了重要贡献，并设置了未来研究的强大基线。

Survey on deep learning in multimodal medical imaging for cancer detection

paper_url: http://arxiv.org/abs/2312.01573
repo_url: None
paper_authors: Yan Tian, Zhaocheng Xu, Yujun Ma, Weiping Ding, Ruili Wang, Zhihong Gao, Guohua Cheng, Linyang He, Xuran Zhao
for: The paper is written for researchers and practitioners working in the field of cancer detection and diagnosis, specifically those interested in multimodal cancer detection using deep learning.
methods: The paper focuses on the use of deep learning-based object detection for multimodal cancer detection, specifically investigating over 150 papers in recent years. It discusses various challenges such as data annotation, variance between classes, small-scale lesions, and occlusion, and provides an overview of the advantages and drawbacks of each approach.
results: The paper provides an overview of the current state of the art in multimodal cancer detection using deep learning, including the various datasets and solutions that have been proposed to address the challenges in this field. It also discusses the current scope of work and provides directions for future development.Here is the same information in Simplified Chinese text:
for: 该论文是为了投入 cancer 检测和诊断领域的研究人员和实践者而写的，特别是关注多modal cancer 检测使用深度学习。
methods: 论文主要关注过去几年的150篇论文，探讨了多modal cancer 检测中的各种挑战，如数据标注、类别之间的变异、小规模的肿瘤、和遮挡等问题，并提供了每种方法的优缺点。
results: 论文提供了多modal cancer 检测领域的当前状况，包括各种数据集和解决方案，以及这些方法的优缺点。同时，论文还讨论了当前的范围和未来发展的方向。

Abstract
The task of multimodal cancer detection is to determine the locations and categories of lesions by using different imaging techniques, which is one of the key research methods for cancer diagnosis. Recently, deep learning-based object detection has made significant developments due to its strength in semantic feature extraction and nonlinear function fitting. However, multimodal cancer detection remains challenging due to morphological differences in lesions, interpatient variability, difficulty in annotation, and imaging artifacts. In this survey, we mainly investigate over 150 papers in recent years with respect to multimodal cancer detection using deep learning, with a focus on datasets and solutions to various challenges such as data annotation, variance between classes, small-scale lesions, and occlusion. We also provide an overview of the advantages and drawbacks of each approach. Finally, we discuss the current scope of work and provide directions for the future development of multimodal cancer detection.

摘要
多模态肿瘤检测的任务是确定肿瘤的位置和类别，这是癌症诊断的重要研究方法之一。在深度学习的支持下，对象检测在 semantic feature 提取和非线性函数适应方面具有显著的进步。然而，多模态肿瘤检测仍然存在许多挑战，包括肿瘤形态异常、between-class 差异、数据注释困难、图像artefacts 等。在本调查中，我们主要研究过去几年的150多篇论文，探讨了多模态肿瘤检测使用深度学习的方法，特别是数据集和对各种挑战的解决方案，包括数据注释、between-class 差异、小规模肿瘤和遮挡。我们还提供了每种方法的优缺点。最后，我们讨论了目前的工作范围和未来的发展方向。

Multi-View Person Matching and 3D Pose Estimation with Arbitrary Uncalibrated Camera Networks

paper_url: http://arxiv.org/abs/2312.01561
repo_url: None
paper_authors: Yan Xu, Kris Kitani
for: 解决跨视图人脸匹配和3D人体 pose 估计问题，特别是在多摄像头网络中，当摄像头不知情的情况下。
methods: 我们提出了一种方法，即 PME，它不需要任何相关的摄像头位势或3D 数据 annotaion。我们将跨视图人脸匹配视为一个聚类问题，使用每个人作为聚类中心，然后从人脸匹配中获取对应关系，并通过多视图三角形推算和约束优化来估计3D人体 pose。
results: 我们在三个公开数据集和两个室内和室外自然环境中采集的两个数据集上进行了广泛的评估，结果显示我们的方法在跨视图人脸匹配和3D人体 pose 估计方面比其他方法大幅提高，并在不使用摄像头位势或3D 数据 annotaion的情况下达到了最佳性能。

Abstract
Cross-view person matching and 3D human pose estimation in multi-camera networks are particularly difficult when the cameras are extrinsically uncalibrated. Existing efforts generally require large amounts of 3D data for training neural networks or known camera poses for geometric constraints to solve the problem. However, camera poses and 3D data annotation are usually expensive and not always available. We present a method, PME, that solves the two tasks without requiring either information. Our idea is to address cross-view person matching as a clustering problem using each person as a cluster center, then obtain correspondences from person matches, and estimate 3D human poses through multi-view triangulation and bundle adjustment. We solve the clustering problem by introducing a "size constraint" using the number of cameras and a "source constraint" using the fact that two people from the same camera view should not match, to narrow the solution space to a small feasible region. The 2D human poses used in clustering are obtained through a pre-trained 2D pose detector, so our method does not require expensive 3D training data for each new scene. We extensively evaluate our method on three open datasets and two indoor and outdoor datasets collected using arbitrarily set cameras. Our method outperforms other methods by a large margin on cross-view person matching, reaches SOTA performance on 3D human pose estimation without using either camera poses or 3D training data, and shows good generalization ability across five datasets of various environment settings.

摘要
跨视图人脸匹配和3D人体姿态估计在多摄像头网络中特别困难，因为摄像头没有外部准确设定。现有方法通常需要大量的3D数据进行神经网络训练或知道摄像头位姿来解决问题。然而，摄像头位姿和3D数据标注通常是昂贵的并不总是可获得。我们提出了一种方法，PME，解决这两个任务不需要任何信息。我们的想法是将跨视图人脸匹配视为一个聚类问题，使每个人作为聚类中心，然后从人匹配中获取对应关系，并通过多视图三角形算法和缓冲调整来估计3D人体姿态。我们解决聚类问题的方法是引入“大小约束”使用摄像头数量，以及“源约束”使用同一个摄像头视图中的两个人不能匹配，以缩小解决空间到一个小可行区域。我们使用预训练的2D姿态检测器来获取2D人姿，因此我们的方法不需要每个新场景都进行昂贵的3D训练数据。我们对三个开源数据集和两个室外室内数据集进行了广泛的评估，并达到了其他方法的大幅超越，在跨视图人脸匹配方面和3D人体姿态估计方面无需使用摄像头位姿或3D训练数据，并且在五个不同环境设置下显示了良好的总体化能力。

Hyperspectral Image Compression Using Sampling and Implicit Neural Representations

paper_url: http://arxiv.org/abs/2312.01558
repo_url: None
paper_authors: Shima Rezasoltani, Faisal Z. Qureshi
for: 这 paper 的目的是提出一种基于隐藏 нейрон表示的高spectral 图像压缩方法，以提高图像压缩率和速度。
methods: 该方法使用多层感知网络（MLP）来映射每个像素位置到对应的像素强度，从而将高spectral 图像压缩成一个压缩编码。
results: 对四个标准测试集（Indian Pines、Jasper Ridge、Pavia University 和 Cuprite）进行了评估，并显示了该方法可以在低比特率下实现更好的压缩率，并且比JPEG、JPEG2000和PCA-DCT更快。此外，与学习型方法进行比较，也显示了更好的性能和速度。

Abstract
Hyperspectral images, which record the electromagnetic spectrum for a pixel in the image of a scene, often store hundreds of channels per pixel and contain an order of magnitude more information than a similarly-sized RBG color image. Consequently, concomitant with the decreasing cost of capturing these images, there is a need to develop efficient techniques for storing, transmitting, and analyzing hyperspectral images. This paper develops a method for hyperspectral image compression using implicit neural representations where a multilayer perceptron network F with sinusoidal activation functions "learns" to map pixel locations to pixel intensities for a given hyperspectral image I. F thus acts as a compressed encoding of this image, and the original image is reconstructed by evaluating F at each pixel location. We use a sampling method with two factors: window size and sampling rate to reduce the compression time. We have evaluated our method on four benchmarks -- Indian Pines, Jasper Ridge, Pavia University, and Cuprite using PSNR and SSIM -- and we show that the proposed method achieves better compression than JPEG, JPEG2000, and PCA-DCT at low bitrates. Besides, we compare our results with the learning-based methods like PCA+JPEG2000, FPCA+JPEG2000, 3D DCT, 3D DWT+SVR, and WSRC and show the corresponding results in the "Compression Results" section. We also show that our methods with sampling achieve better speed and performance than our method without sampling.

摘要
гиперспектральные изображения，记录 scène 的电磁谱спектrum 每个像素点，通常包含数百个通道每个像素点，与同样大小的 RGB 颜色图像相比，含有一个数量级更多的信息。随着捕获这些图像的成本逐渐下降，需要开发高效的存储、传输和分析 гиперспектраль图像的技术。本文提出了一种基于偏多层感知网络（F）的 гиперспектраль图像压缩方法，其中 F 使用抽象函数来映射图像像素点到图像像素点的Intensity。因此，F acted as a compressed representation of the image, and the original image was reconstructed by evaluating F at each pixel location.我们使用了两个因素：窗口大小和采样率来减少压缩时间。我们对四个标准测试集（印度皮托、杰斯柏林、帕维亚大学和普瓦瑞特）进行了PSNR和SSIM的评估，并显示了我们提出的方法在低比特率下比 JPEG、JPEG2000 和 PCA-DCT 更好的压缩。此外，我们与学习基于方法如 PCA+JPEG2000、FPCA+JPEG2000、3D DCT、3D DWT+SVR 进行了比较，并在 "压缩结果" 部分显示了相应的结果。此外，我们还显示了我们的方法采样 achieve better speed和性能than our method without sampling。

Digital Histopathology with Graph Neural Networks: Concepts and Explanations for Clinicians

paper_url: http://arxiv.org/abs/2312.02225
repo_url: None
paper_authors: Alessandro Farace di Villaforesta, Lucie Charlotte Magister, Pietro Barbiero, Pietro Liò
for: 用于提供可解释的深度学习模型 для医疗设置。
methods: 使用GCExplainer和Logic Explained Networks提供全局解释，用于图解释网络。
results: 使用HoVer-Net和图 convolution 网络在肿瘤预测中显示了可靠的结果，并且在使用H&E镜像进行训练时表现良好。

Abstract
To address the challenge of the ``black-box" nature of deep learning in medical settings, we combine GCExplainer - an automated concept discovery solution - along with Logic Explained Networks to provide global explanations for Graph Neural Networks. We demonstrate this using a generally applicable graph construction and classification pipeline, involving panoptic segmentation with HoVer-Net and cancer prediction with Graph Convolution Networks. By training on H&E slides of breast cancer, we show promising results in offering explainable and trustworthy AI tools for clinicians.

摘要
为了解决深度学习在医疗场景中的“黑盒”问题，我们将GCExplainer - 一种自动概念发现解决方案 - 与逻辑解释网络结合使用，以提供全局解释 для图 neural network。我们通过一种通用的图构建和分类管道，包括扫描绘-Net和图 convolutional network，在使用 H&E 报告扫描的乳腺癌预测中进行训练。我们在医生可信的 AI 工具方面取得了有前途的结果。

2023-12-04

Unsupervised Change Detection for Space Habitats Using 3D Point Clouds

MEDPSeg: End-to-end segmentation of pulmonary structures and lesions in computed tomography

PointNeRF++: A multi-scale, point-based Neural Radiance Field

Calibrated Uncertainties for Neural Radiance Fields

CLIPDrawX: Primitive-based Explanations for Text Guided Sketch Synthesis

STEREOFOG – Computational DeFogging via Image-to-Image Translation on a real-world Dataset

Cable Slack Detection for Arresting Gear Application using Machine Vision

MoE-AMC: Enhancing Automatic Modulation Classification Performance Using Mixture-of-Experts

You Can Run but not Hide: Improving Gait Recognition with Intrinsic Occlusion Type Awareness

PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation

Mesh-Guided Neural Implicit Field Editing

GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis

Aligning and Prompting Everything All at Once for Universal Visual Perception

Steerers: A framework for rotation equivariant keypoint descriptors

Readout Guidance: Learning Control from Diffusion Features

Rejuvenating image-GPT as Strong Visual Representation Learners

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

Optimizing Camera Configurations for Multi-View Pedestrian Detection

Object Recognition as Next Token Prediction

iMatching: Imperative Correspondence Learning

MANUS: Markerless Hand-Object Grasp Capture using Articulated 3D Gaussians

BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation

Re-Nerfing: Enforcing Geometric Constraints on Neural Radiance Fields through Novel Views Synthesis

Fast View Synthesis of Casual Videos

GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians

Style Aligned Image Generation via Shared Attention

Can we truly transfer an actor’s genuine happiness to avatars? An investigation into virtual, real, posed and spontaneous faces

VerA: Versatile Anonymization Fit for Clinical Facial Images

Mathematical Supplement for the $\texttt{gsplat}$ Library

GIVT: Generative Infinite-Vocabulary Transformers

ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation

Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection

Large Language Models as Consistent Story Visualizers

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians

DUCK: Distance-based Unlearning via Centroid Kinematics

Implicit Learning of Scene Geometry from Poses for Global Localization

A multi-channel cycleGAN for CBCT to CT synthesis

ColonNeRF: Neural Radiance Fields for High-Fidelity Long-Sequence Colonoscopy Reconstruction

SRTransGAN: Image Super-Resolution using Transformer based Generative Adversarial Network

Language-only Efficient Training of Zero-shot Composed Image Retrieval

A Generative Self-Supervised Framework using Functional Connectivity in fMRI Data

Bootstrapping SparseFormers from Vision Foundation Models

UniGS: Unified Representation for Image Generation and Segmentation

Semantics-aware Motion Retargeting with Vision-Language Models

Instance-guided Cartoon Editing with a Large-scale Dataset

COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction

A Reliable Representation with Bidirectional Transition Model for Visual Reinforcement Learning Generalization

Unsupervised Anomaly Detection using Aggregated Normative Diffusion

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos

InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models

FeaInfNet: Diagnosis in Medical Image with Feature-Driven Inference and Visual Explanations

Unveiling Objects with SOLA: An Annotation-Free Image Search on the Object Level for Automotive Data Sets

Robot Synesthesia: In-Hand Manipulation with Visuotactile Sensing

Generalization by Adaptation: Diffusion-Based Domain Extension for Domain-Generalized Semantic Segmentation

Geometrically-driven Aggregation for Zero-shot 3D Point Cloud Understanding

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

Few Clicks Suffice: Active Test-Time Adaptation for Semantic Segmentation

Equivariant plug-and-play image reconstruction

Collaborative Neural Painting

Exploring Multi-Modal Fusion for Image Manipulation Detection and Localization

Two-stage optimized unified adversarial patch for attacking visible-infrared cross-modal detectors in the physical world

IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks

Few-Shot Anomaly Detection with Adversarial Loss for Robust Feature Representations

Localizing and Assessing Node Significance in Default Mode Network using Sub-Community Detection in Mild Cognitive Impairment

Dynamic Erasing Network Based on Multi-Scale Temporal Features for Weakly Supervised Video Anomaly Detection

Light Field Imaging in the Restrictive Object Space based on Flexible Angular Plane

Long-Tail Learning with Rebalanced Contrastive Loss

Open-DDVM: A Reproduction and Extension of Diffusion Model for Optical Flow Estimation

Cross-Modal Adaptive Dual Association for Text-to-Image Person Retrieval

Singular Regularization with Information Bottleneck Improves Model’s Adversarial Robustness

Fully Spiking Denoising Diffusion Implicit Models

SRSNetwork: Siamese Reconstruction-Segmentation Networks based on Dynamic-Parameter Convolution

MobileUtr: Revisiting the relationship between light-weight CNN and Transformer for efficient medical image segmentation

Effective Adapter for Face Recognition in the Wild

Likelihood-Aware Semantic Alignment for Full-Spectrum Out-of-Distribution Detection

Simultaneous Alignment and Surface Regression Using Hybrid 2D-3D Networks for 3D Coherent Layer Segmentation of Retinal OCT Images with Full and Sparse Annotations

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

Disentangled Interaction Representation for One-Stage Human-Object Interaction Detection