2023-11-30

cs.CV

cs.CV - 2023-11-30

PyNeRF: Pyramidal Neural Radiance Fields

paper_url: http://arxiv.org/abs/2312.00252
repo_url: https://github.com/hturki/pynerf
paper_authors: Haithem Turki, Michael Zollhöfer, Christian Richardt, Deva Ramanan
for: 提高NeRF的渲染速度和质量
methods: 使用不同的空间网格分辨率进行训练，并在渲染时使用不同的网格分辨率来渲染大量的样本
results: 可以减少投射错误率（比如Mip-NeRF），并且可以在训练时更快（每个模型头可以快速评估）

Abstract
Neural Radiance Fields (NeRFs) can be dramatically accelerated by spatial grid representations. However, they do not explicitly reason about scale and so introduce aliasing artifacts when reconstructing scenes captured at different camera distances. Mip-NeRF and its extensions propose scale-aware renderers that project volumetric frustums rather than point samples but such approaches rely on positional encodings that are not readily compatible with grid methods. We propose a simple modification to grid-based models by training model heads at different spatial grid resolutions. At render time, we simply use coarser grids to render samples that cover larger volumes. Our method can be easily applied to existing accelerated NeRF methods and significantly improves rendering quality (reducing error rates by 20-90% across synthetic and unbounded real-world scenes) while incurring minimal performance overhead (as each model head is quick to evaluate). Compared to Mip-NeRF, we reduce error rates by 20% while training over 60x faster.

摘要

Advancements and Trends in Ultra-High-Resolution Image Processing: An Overview

paper_url: http://arxiv.org/abs/2312.00250
repo_url: None
paper_authors: Zhuoran Zheng, Boxue Xiao
for: 提高 Ultra-High-Definition（UHD）图像的视觉亮度和质量。
methods: 介绍了现有的UHD图像提高技术和应用场景，包括各种环境噪声和设备抖动等因素的影响，以及一些针对这些问题的算法和方法。
results: 对UHD图像的提高和优化，提高视觉亮度和质量。

Abstract
Currently, to further improve visual enjoyment, Ultra-High-Definition (UHD) images are catching wide attention. Here, UHD images are usually referred to as having a resolution greater than or equal to $3840 \times 2160$. However, since the imaging equipment is subject to environmental noise or equipment jitter, UHD images are prone to contrast degradation, blurring, low dynamic range, etc. To address these issues, a large number of algorithms for UHD image enhancement have been proposed. In this paper, we introduce the current state of UHD image enhancement from two perspectives, one is the application field and the other is the technology. In addition, we briefly explore its trends.

摘要
当前，为了进一步提高视觉享受，超高清（UHD）图像正在吸引广泛关注。在这里，UHD图像通常被定义为具有3840×2160分辨率或更高的图像。然而，由于捕捉设备受到环境噪音或设备振荡的影响，UHD图像容易受到对比下降、模糊、动态范围等问题。为解决这些问题，大量的UHD图像提升算法已经被提出。在这篇文章中，我们将介绍UHD图像提升的当前状况，从两个角度出发：一是应用领域，二是技术。此外，我们还将简要探讨其趋势。

AV-RIR: Audio-Visual Room Impulse Response Estimation

paper_url: http://arxiv.org/abs/2312.00834
repo_url: None
paper_authors: Anton Ratnarajah, Sreyan Ghosh, Sonal Kumar, Purva Chiniya, Dinesh Manocha
for: 这个研究的目的是精确地估算室内传播响应（RIR），以便应用于语音处理和虚拟现实应用。
methods: 这个研究使用了一种新的多模式多任务学习方法，称为AV-RIR，它可以从语音信号和环境的视觉征象中精确地估算RIR。AV-RIR建立在一个新的神经码码学习架构上，可以有效地捕捉环境的几何和材料特性，并且通过多任务学习来解决语音排擦作为副任务。
results: 这个研究的结果显示，AV-RIR可以与前一代的音频专注和视觉专注方法相比，提高RIR估算的精度，具体来说是36%-63%的改善。此外，AV-RIR也在人类评价中得到了更高的偏好得分。此外，AV-RIR的副作用是可以在不同的说话语言处理任务中提供竞争力的排擦 speech。

Abstract
Accurate estimation of Room Impulse Response (RIR), which captures an environment's acoustic properties, is important for speech processing and AR/VR applications. We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and the visual cues of its corresponding environment. AV-RIR builds on a novel neural codec-based architecture that effectively captures environment geometry and materials properties and solves speech dereverberation as an auxiliary task by using multi-task learning. We also propose Geo-Mat features that augment material information into visual cues and CRIP that improves late reverberation components in the estimated RIR via image-to-RIR retrieval by 86%. Empirical results show that AV-RIR quantitatively outperforms previous audio-only and visual-only approaches by achieving 36% - 63% improvement across various acoustic metrics in RIR estimation. Additionally, it also achieves higher preference scores in human evaluation. As an auxiliary benefit, dereverbed speech from AV-RIR shows competitive performance with the state-of-the-art in various spoken language processing tasks and outperforms reverberation time error score in the real-world AVSpeech dataset. Qualitative examples of both synthesized reverberant speech and enhanced speech can be found at https://www.youtube.com/watch?v=tTsKhviukAE.

摘要
importante estimation de la Respuesta de Impulso de Habitación (RIR) accurada, que captura las propiedades acústicas del entorno, es crucial para la procesamiento de speech y aplicaciones de realidad aumentada/virtual (AR/VR). Proponemos AV-RIR, un enfoque innovador de aprendizaje multimodal multitarea para estimar precisamente la RIR a partir de una señal de speech reverberante y las visuales correspondientes del entorno. AV-RIR se basa en una arquitectura de códigoc neural que captura eficazmente la geometría y propiedades de los materiales del entorno y resuelve la tarea de dereverberación como una tarea auxiliar utilizando aprendizaje multitarea. Además, proponemos características Geo-Mat que integran información de materiales en las señales visuales y CRIP que mejora componentes de reverberación tardía en la RIR estimada a través de la recuperación de RIR a partir de imágenes. Los resultados empíricos demuestran que AV-RIR mejora substancialmente la estimación de RIR en comparación con enfoques de audio solo y visual solo, con un aumento del 36% al 63% en diferentes métricas acústicas de RIR. Además, también obtiene puntajes de preferencia más altos en la evaluación humana. Como un beneficio adicional, la speech dereverberada de AV-RIR muestra un rendimiento competitivo con el estado del arte en diversas tareas de procesamiento de lenguaje hablado y supera el error de tiempo de reverberación en el dataset real-world AVSpeech. Los ejemplos cualitativos de speech reverberante sintetizado y habla mejorada se pueden encontrar en .

Lasagna: Layered Score Distillation for Disentangled Object Relighting

paper_url: http://arxiv.org/abs/2312.00833
repo_url: https://github.com/dbash/lasagna
paper_authors: Dina Bashkirova, Arijit Ray, Rupayan Mallick, Sarah Adel Bargal, Jianming Zhang, Ranjay Krishna, Kate Saenko
for: 本研究旨在提供一种Intuitive Text-Guided Relighting Control方法，帮助艺术家、摄影师和其他视觉内容创作者更容易地实现对图像的照明控制。
methods: 本方法使用Score Distillation Sampling法学习灯光优先，并通过Diffusion Model进行训练。新创建的Synthetic DatasetReLiT用于训练 Lasagna。
results: Lasagna在实际图像上实现了有效的照明控制，并且比现有的文本指导图像编辑方法更高效。 Lasagna在自然图像和数字艺术作品上获得了人类的91%的喜欢率。此外，我们还扩展了我们的学习目标，以实现颜色化的图像编辑。

Abstract
Professional artists, photographers, and other visual content creators use object relighting to establish their photo's desired effect. Unfortunately, manual tools that allow relighting have a steep learning curve and are difficult to master. Although generative editing methods now enable some forms of image editing, relighting is still beyond today's capabilities; existing methods struggle to keep other aspects of the image -- colors, shapes, and textures -- consistent after the edit. We propose Lasagna, a method that enables intuitive text-guided relighting control. Lasagna learns a lighting prior by using score distillation sampling to distill the prior of a diffusion model, which has been finetuned on synthetic relighting data. To train Lasagna, we curate a new synthetic dataset ReLiT, which contains 3D object assets re-lit from multiple light source locations. Despite training on synthetic images, quantitative results show that Lasagna relights real-world images while preserving other aspects of the input image, outperforming state-of-the-art text-guided image editing methods. Lasagna enables realistic and controlled results on natural images and digital art pieces and is preferred by humans over other methods in over 91% of cases. Finally, we demonstrate the versatility of our learning objective by extending it to allow colorization, another form of image editing.

摘要
To address this challenge, we propose Lasagna, a method that enables intuitive text-guided relighting control. Lasagna learns a lighting prior by using score distillation sampling to distill the prior of a diffusion model, which has been finetuned on synthetic relighting data. To train Lasagna, we curate a new synthetic dataset ReLiT, which contains 3D object assets re-lit from multiple light source locations.Despite training on synthetic images, Lasagna is able to relight real-world images while preserving other aspects of the input image, outperforming state-of-the-art text-guided image editing methods. Lasagna enables realistic and controlled results on natural images and digital art pieces, and is preferred by humans over other methods in over 91% of cases.Furthermore, we demonstrate the versatility of our learning objective by extending it to allow colorization, another form of image editing. Our approach allows for intuitive text-guided colorization of images, while preserving their original structure and content.In summary, Lasagna is a text-guided relighting method that learns a lighting prior from synthetic data and applies it to real-world images, enabling realistic and controlled results. Our approach outperforms state-of-the-art methods and has a wide range of applications in digital art, graphic design, and other fields.

Brainformer: Modeling MRI Brain Functions to Machine Vision

paper_url: http://arxiv.org/abs/2312.00236
repo_url: None
paper_authors: Xuan-Bac Nguyen, Xin Li, Samee U. Khan, Khoa Luu
for: 这个论文的目的是探讨人类视觉系统中的视觉过程，以帮助桥接人类视觉和计算机视觉模型之间的差距。
methods: 这篇论文使用了Transformer框架来分析人类视觉系统中的 Patterns of brain activities，并利用fMRI信息作为人脑活动的表征来监督机器视觉模型。
results: 经过我们的实验，我们发现可以通过利用fMRI信息来提高机器视觉模型的性能，并达到现有State-of-the-art方法的相对性能。

Abstract
"Perception is reality". Human perception plays a vital role in forming beliefs and understanding reality. Exploring how the human brain works in the visual system facilitates bridging the gap between human visual perception and computer vision models. However, neuroscientists study the brain via Neuroimaging, i.e., Functional Magnetic Resonance Imaging (fMRI), to discover the brain's functions. These approaches face interpretation challenges where fMRI data can be complex and require expertise. Therefore, neuroscientists make inferences about cognitive processes based on patterns of brain activities, which can lead to potential misinterpretation or limited functional understanding. In this work, we first present a simple yet effective Brainformer approach, a novel Transformer-based framework, to analyze the patterns of fMRI in the human perception system from the machine learning perspective. Secondly, we introduce a novel mechanism incorporating fMRI, which represents the human brain activities, as the supervision for the machine vision model. This work also introduces a novel perspective on transferring knowledge from human perception to neural networks. Through our experiments, we demonstrated that by leveraging fMRI information, the machine vision model can achieve potential results compared to the current State-of-the-art methods in various image recognition tasks.

摘要

Convolutional Neural Networks for Segmentation of Malignant Pleural Mesothelioma: Analysis of Probability Map Thresholds (CALGB 30901, Alliance)

paper_url: http://arxiv.org/abs/2312.00223
repo_url: None
paper_authors: Mena Shenouda, Eyjólfur Gudmundsson, Feng Li, Christopher M. Straus, Hedy L. Kindler, Arkadiusz Z. Dudek, Thomas Stinchcombe, Xiaofei Wang, Adam Starkey, Samuel G. Armato III
for: 这个研究的目的是评估用深度学习模型自动分割抑制MPM肿瘤的影响。methods: 这个研究使用了VGG16/U-Net convolutional neural network (CNN)进行自动分割，并由 радиialogIST modifications 的contours。results: 研究发现，降低probability threshold从0.5到0.1可以降低平均差异率，但没有单个输出阈值能够优化 both tumor volume和DSC。

Abstract
Malignant pleural mesothelioma (MPM) is the most common form of mesothelioma. To assess response to treatment, tumor measurements are acquired and evaluated based on a patient's longitudinal computed tomography (CT) scans. Tumor volume, however, is the more accurate metric for assessing tumor burden and response. Automated segmentation methods using deep learning can be employed to acquire volume, which otherwise is a tedious task performed manually. The deep learning-based tumor volume and contours can then be compared with a standard reference to assess the robustness of the automated segmentations. The purpose of this study was to evaluate the impact of probability map threshold on MPM tumor delineations generated using a convolutional neural network (CNN). Eighty-eight CT scans from 21 MPM patients were segmented by a VGG16/U-Net CNN. A radiologist modified the contours generated at a 0.5 probability threshold. Percent difference of tumor volume and overlap using the Dice Similarity Coefficient (DSC) were compared between the standard reference provided by the radiologist and CNN outputs for thresholds ranging from 0.001 to 0.9. CNN annotations consistently yielded smaller tumor volumes than radiologist contours. Reducing the probability threshold from 0.5 to 0.1 decreased the absolute percent volume difference, on average, from 43.96% to 24.18%. Median and mean DSC ranged from 0.58 to 0.60, with a peak at a threshold of 0.5; no distinct threshold was found for percent volume difference. No single output threshold in the CNN probability maps was optimal for both tumor volume and DSC. This work underscores the need to assess tumor volume and spatial overlap when evaluating CNN performance. While automated segmentations may yield comparable tumor volumes to that of the reference standard, the spatial region delineated by the CNN at a specific threshold is equally important.

摘要
“腹膜 Pleural Mesothelioma（MPM）是最常见的 Mesothelioma 型式。为了评估治疗效果，医生通过病人的长期 Computed Tomography（CT）扫描获取和评估肿瘤大小。但是，肿瘤体积是评估肿瘤负担和治疗效果的更加精确的指标。使用深度学习的自动分类方法可以从 CT 扫描中获取肿瘤体积，这个过程可以耗费很长时间。这项研究的目的是评估对 MPM 肿瘤定义的概率地图阈值的影响。VGG16/U-Net CNN 分类器segments了88个 CT 扫描，并由医生修改了 CNN 生成的边据。在阈值由0.001至0.9之间进行比较时，与标准参考的百分比差异和 overlap 使用 dice 相似度 coefficient（DSC）进行比较。CNN 标注逐个较小肿瘤体积，而医生标注逐个较大。降低阈值从0.5到0.1可以将统计量差异降至24.18%，而 median 和 mean DSC 在0.58至0.60之间，峰值为0.5。无法找到单一的出力阈值，以ensure tumor volume 和 spatial overlap 都符合参考标准。这些结果显示，当评估 CNN 性能时，需要考虑肿瘤体积和空间 overlap。自动分类may 对肿瘤体积有相似的效果，但是 CNN 在特定阈值下定义的空间区域 equally important。”

SparseGS: Real-Time 360° Sparse View Synthesis using Gaussian Splatting

paper_url: http://arxiv.org/abs/2312.00206
repo_url: None
paper_authors: Haolin Xiong, Sairisheek Muttukuru, Rishi Upadhyay, Pradyumna Chari, Achuta Kadambi
for: 能够Synthesize high-quality novel views from sparse training views, especially for 360-degree scenes.
methods: Integrate depth priors with generative and explicit constraints to reduce background collapse, remove floaters, and enhance consistency from unseen viewpoints.
results: Outperforms base 3DGS by up to 30.5% and NeRF-based methods by up to 15.6% in LPIPS on the MipNeRF-360 dataset with substantially less training and inference cost.

Abstract
The problem of novel view synthesis has grown significantly in popularity recently with the introduction of Neural Radiance Fields (NeRFs) and other implicit scene representation methods. A recent advance, 3D Gaussian Splatting (3DGS), leverages an explicit representation to achieve real-time rendering with high-quality results. However, 3DGS still requires an abundance of training views to generate a coherent scene representation. In few shot settings, similar to NeRF, 3DGS tends to overfit to training views, causing background collapse and excessive floaters, especially as the number of training views are reduced. We propose a method to enable training coherent 3DGS-based radiance fields of 360 scenes from sparse training views. We find that using naive depth priors is not sufficient and integrate depth priors with generative and explicit constraints to reduce background collapse, remove floaters, and enhance consistency from unseen viewpoints. Experiments show that our method outperforms base 3DGS by up to 30.5% and NeRF-based methods by up to 15.6% in LPIPS on the MipNeRF-360 dataset with substantially less training and inference cost.

摘要
Recently, the problem of novel view synthesis has gained significant attention with the emergence of Neural Radiance Fields (NeRFs) and other implicit scene representation methods. A recent advancement, 3D Gaussian Splatting (3DGS), utilizes an explicit representation to achieve real-time rendering with high-quality results. However, 3DGS still requires a large amount of training views to generate a coherent scene representation. In few shot settings, similar to NeRF, 3DGS tends to overfit to training views, resulting in background collapse and excessive floaters, especially when the number of training views is reduced. We propose a method to train coherent 3DGS-based radiance fields of 360 scenes from sparse training views. We find that using naive depth priors is not sufficient and integrate depth priors with generative and explicit constraints to reduce background collapse, remove floaters, and enhance consistency from unseen viewpoints. Experimental results show that our method outperforms base 3DGS by up to 30.5% and NeRF-based methods by up to 15.6% in LPIPS on the MipNeRF-360 dataset with significantly less training and inference cost.

DNS SLAM: Dense Neural Semantic-Informed SLAM

paper_url: http://arxiv.org/abs/2312.00204
repo_url: None
paper_authors: Kunyi Li, Michael Niemeyer, Nassir Navab, Federico Tombari
for: This paper is written for the task of Simultaneous Localization and Mapping (SLAM) in real-world environments, specifically addressing the challenge of oversmoothed reconstructions in neural implicit representations.
methods: The paper proposes a novel neural RGB-D semantic SLAM approach featuring a hybrid representation, which integrates multi-view geometry constraints with image-based feature extraction to improve appearance details and output color, density, and semantic class information. The method also introduces a lightweight coarse scene representation trained in a self-supervised manner in latent space for real-time tracking.
results: The paper achieves state-of-the-art performance on both synthetic data and real-world data tracking while maintaining a commendable operational speed on off-the-shelf hardware. The method outputs class-wise decomposed reconstructions with better texture capturing appearance and geometric details.Here’s the Chinese translation of the three points:
for: 这篇论文是为了实时地地图化和定位（SLAM）在真实世界环境中进行，特别是解决神经隐式表示中的平滑重建问题。
methods: 这篇论文提出了一种新的神经RGB-D语义SLAM方法，它将多视图几何约束与图像特征提取相结合，以提高出色质量和semantic类信息。它还引入了一种轻量级的场景概率表示，通过在隐藏空间自然地进行自我监督训练，以实现实时跟踪。
results: 这篇论文在 sintetic数据和真实世界数据上实现了状态对应的性能，同时保持了可接受的实时速度。它输出了类别化分解的重建结果，具有更好的文本捕捉和几何细节。

Abstract
In recent years, coordinate-based neural implicit representations have shown promising results for the task of Simultaneous Localization and Mapping (SLAM). While achieving impressive performance on small synthetic scenes, these methods often suffer from oversmoothed reconstructions, especially for complex real-world scenes. In this work, we introduce DNS SLAM, a novel neural RGB-D semantic SLAM approach featuring a hybrid representation. Relying only on 2D semantic priors, we propose the first semantic neural SLAM method that trains class-wise scene representations while providing stable camera tracking at the same time. Our method integrates multi-view geometry constraints with image-based feature extraction to improve appearance details and to output color, density, and semantic class information, enabling many downstream applications. To further enable real-time tracking, we introduce a lightweight coarse scene representation which is trained in a self-supervised manner in latent space. Our experimental results achieve state-of-the-art performance on both synthetic data and real-world data tracking while maintaining a commendable operational speed on off-the-shelf hardware. Further, our method outputs class-wise decomposed reconstructions with better texture capturing appearance and geometric details.

摘要
Recently, coordinate-based neural implicit representations have shown promising results for Simultaneous Localization and Mapping (SLAM) tasks. However, these methods often suffer from oversmoothed reconstructions, especially for complex real-world scenes. In this work, we introduce DNS SLAM, a novel neural RGB-D semantic SLAM approach with a hybrid representation. Our method relies only on 2D semantic priors and trains class-wise scene representations while providing stable camera tracking. We integrate multi-view geometry constraints with image-based feature extraction to improve appearance details and output color, density, and semantic class information, enabling many downstream applications. To further enable real-time tracking, we introduce a lightweight coarse scene representation that is trained in a self-supervised manner in latent space. Our experimental results achieve state-of-the-art performance on both synthetic data and real-world data tracking while maintaining a commendable operational speed on off-the-shelf hardware. Additionally, our method outputs class-wise decomposed reconstructions with better texture capturing appearance and geometric details.Here is the word-for-word translation of the text into Simplified Chinese:最近，坐标基于神经隐式表示方法在同时定位和地图建模（SLAM）任务中表现出色。然而，这些方法经常在复杂的实际场景中受到过度平滑的重建问题。在这项工作中，我们介绍了 DNS SLAM，一种基于神经网络的RGB-D语义SLAM方法，其中使用混合表示。我们的方法仅仅依靠2D语义优先，并在提供稳定摄像头跟踪的同时，训练分类Scene表示。我们将多视图几何约束与图像特征提取相结合，以提高外观细节和颜色信息的精度，并输出颜色、密度和语义类信息，以便多种下游应用程序。为了进一步启用实时跟踪，我们引入了一种轻量级的粗糙场景表示，该表示在离散空间中进行自主supervised训练。我们的实验结果在 both synthetic data和实际数据跟踪中达到了状态 искусственный体验表现，同时在off-the-shelf硬件上保持了可靠的运行速度。此外，我们的方法输出分类分解重建， captured texture appearance和几何细节更好。

Raising the Bar of AI-generated Image Detection with CLIP

paper_url: http://arxiv.org/abs/2312.00195
repo_url: None
paper_authors: Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, Luisa Verdoliva
For: The paper aims to explore the potential of pre-trained vision-language models (VLMs) for universal detection of AI-generated images.* Methods: The authors develop a lightweight detection strategy based on CLIP features and study its performance in a wide variety of challenging scenarios.* Results: The CLIP-based detector exhibits a surprising generalization ability and high robustness across several different architectures, including recent commercial tools such as Dalle-3, Midjourney v5, and Firefly. The detector achieves a high performance on in-distribution data and improves significantly on out-of-distribution data and robustness to impaired/laundered data.Here’s the same information in Simplified Chinese text:* For: 这篇论文目的是探索预训练的视觉语言模型（VLM）在人工智能生成图像检测方面的潜力。* Methods: 作者们开发了一种轻量级检测策略，基于CLIP特征，并在各种复杂的场景下进行了研究。* Results: CLIP基于的检测器在不同的架构上展现出了高度一致和强大的泛化能力，包括最新的商业工具Dalle-3、Midjourney v5和Firefly。检测器在内部数据上具有高性能，并在外部数据上提高了6%的AUC和13%的鲁棒性。I hope that helps!

Abstract
Aim of this work is to explore the potential of pre-trained vision-language models (VLMs) for universal detection of AI-generated images. We develop a lightweight detection strategy based on CLIP features and study its performance in a wide variety of challenging scenarios. We find that, unlike previous belief, it is neither necessary nor convenient to use a large domain-specific dataset for training. On the contrary, by using only a handful of example images from a single generative model, a CLIP-based detector exhibits a surprising generalization ability and high robustness across several different architectures, including recent commercial tools such as Dalle-3, Midjourney v5, and Firefly. We match the SoTA on in-distribution data, and improve largely above it in terms of generalization to out-of-distribution data (+6% in terms of AUC) and robustness to impaired/laundered data (+13%). Our project is available at https://grip-unina.github.io/ClipBased-SyntheticImageDetection/

摘要
目的是探索预训练视觉语言模型（VLM）在检测人工生成图像方面的潜力。我们开发了一种轻量级检测策略基于CLIP特征，并研究其在多种复杂的场景中的表现。我们发现，与之前的信念不同，不需要使用大量领域特定的数据集来训练。相反，只使用单个生成器模型中的几个示例图像，CLIP基本的检测器具有各卷惊人的泛化能力和高稳定性，包括最新的商业工具such as Dalle-3、Midjourney v5和Firefly。我们与SoTA匹配在内部数据上，并在外部数据上进一步提高了泛化性 (+6% 在 AUC 上) 和质量强度 (+13%)。我们的项目可以在https://grip-unina.github.io/ClipBased-SyntheticImageDetection/ 中找到。

Benchmarking and Enhancing Disentanglement in Concept-Residual Models

paper_url: http://arxiv.org/abs/2312.00192
repo_url: None
paper_authors: Renos Zabounidis, Ini Oguntola, Konghao Zhao, Joseph Campbell, Simon Stepputtis, Katia Sycara
for: 提高模型的解释性和性能，对 incomplete 的概念集进行修复。
methods: 提出三种新的分离方法，通过分离概念和剩余来减少信息泄露，并调整概念和任务性能之间的关系。
results: 通过对 CUB、OAI 和 CIFAR 100 dataset进行广泛的实验分析，评估每种分离方法的性能，并提供了干预概念和任务性能之间的关系的新的视角。

Abstract
Concept bottleneck models (CBMs) are interpretable models that first predict a set of semantically meaningful features, i.e., concepts, from observations that are subsequently used to condition a downstream task. However, the model's performance strongly depends on the engineered features and can severely suffer from incomplete sets of concepts. Prior works have proposed a side channel -- a residual -- that allows for unconstrained information flow to the downstream task, thus improving model performance but simultaneously introducing information leakage, which is undesirable for interpretability. This work proposes three novel approaches to mitigate information leakage by disentangling concepts and residuals, investigating the critical balance between model performance and interpretability. Through extensive empirical analysis on the CUB, OAI, and CIFAR 100 datasets, we assess the performance of each disentanglement method and provide insights into when they work best. Further, we show how each method impacts the ability to intervene over the concepts and their subsequent impact on task performance.

摘要
<>输入文本翻译为简化中文。<>概念瓶颈模型（CBM）是可解释的模型，首先预测一组semantically meaningful的特征，即概念，从观察数据中提取，然后用这些特征来condition下游任务。然而，模型的性能强度取决于人工设计的特征，并且可能会受到不完整的概念集的影响。先前的工作已经提出了一个side channel，即剩余，以允许不受限制的信息流向下游任务，从而提高模型性能，但是同时也引入了不жела的信息泄露，这会影响模型的解释性。这个工作提出了三种新的方法来缓解信息泄露，包括分离概念和剩余、调查权衡模型性能和解释性的关键点，以及在不同情况下每种方法的表现和影响。通过对CUB、OAI和CIFAR 100 datasets进行广泛的实验分析，我们评估了每种分离方法的性能，并提供了关于它们在哪些情况下工作最好的直观分析。此外，我们还示出了每种方法对概念和其后对任务性能的影响。

Galaxy Classification: A machine learning approach for classifying shapes using numerical data

paper_url: http://arxiv.org/abs/2312.00184
repo_url: None
paper_authors: Anusha Guruprasad
for: 本研究旨在使用Machine Learning模型为星系分类，以提高我们对星系形成和演化的理解。
methods: 我们使用了一种卷积神经网络架构，从星系图像中提取特征并将星系分类为旋涡星系或椭圆星系。
results: 我们的模型在一个子集的Galaxy Zoo数据上达到了高精度，并且比人类分类器更高效。这表明我们的模型有效地提高了我们对星系形成和演化的理解。

Abstract
The classification of galaxies as spirals or ellipticals is a crucial task in understanding their formation and evolution. With the arrival of large-scale astronomical surveys, such as the Sloan Digital Sky Survey (SDSS), astronomers now have access to images of a vast number of galaxies. However, the visual inspection of these images is an impossible task for humans due to the sheer number of galaxies to be analyzed. To solve this problem, the Galaxy Zoo project was created to engage thousands of citizen scientists to classify the galaxies based on their visual features. In this paper, we present a machine learning model for galaxy classification using numerical data from the Galaxy Zoo[5] project. Our model utilizes a convolutional neural network architecture to extract features from galaxy images and classify them into spirals or ellipticals. We demonstrate the effectiveness of our model by comparing its performance with that of human classifiers using a subset of the Galaxy Zoo dataset. Our results show that our model achieves high accuracy in classifying galaxies and has the potential to significantly enhance our understanding of the formation and evolution of galaxies.

摘要
галактики的分类为旋涡或椭圆形是理解其形成和演化的关键任务。随着大规模天文学调查的到来，如 Sloan 数字天空survey (SDSS)，天文学家现在拥有大量的 галактики图像。然而，人工视觉检查这些图像是不可能的任务，因为需要分析的 галактики的数量太多。为解决这个问题，Galaxy Zoo 项目成立，以吸引千名公民科学家来分类 галактики根据其视觉特征。在这篇文章中，我们提出了一种基于数字数据的机器学习模型，用于分类 галактики。我们的模型使用卷积神经网络架构来提取 галактики图像中的特征，并将其分类为旋涡或椭圆形。我们通过对Galaxy Zoo 数据集中的一个子集进行比较，示出了我们的模型在分类 галактики时的高精度和可能性。

Fool the Hydra: Adversarial Attacks against Multi-view Object Detection Systems

paper_url: http://arxiv.org/abs/2312.00173
repo_url: None
paper_authors: Bilel Tarchoun, Quazi Mishkatul Alam, Nael Abu-Ghazaleh, Ihsen Alouani
for: 这篇论文旨在 investigating the robustness of multiview object detection systems against adversarial patches in real-world scenarios.methods: The authors use a multiview object detection framework and conduct experiments on the Wildtrack benchmark with off-the-shelf adversarial patches. They also propose two new attack methods to challenge the robustness of the system.results: The authors find that the multiview system has promising robustness against off-the-shelf adversarial patches, but their proposed attacks can significantly degrade the detection performance of the system. Specifically, the first attack reaches an attack success rate of 73%, while the second attack reduces the performance of the target detector by 62%.

Abstract
Adversarial patches exemplify the tangible manifestation of the threat posed by adversarial attacks on Machine Learning (ML) models in real-world scenarios. Robustness against these attacks is of the utmost importance when designing computer vision applications, especially for safety-critical domains such as CCTV systems. In most practical situations, monitoring open spaces requires multi-view systems to overcome acquisition challenges such as occlusion handling. Multiview object systems are able to combine data from multiple views, and reach reliable detection results even in difficult environments. Despite its importance in real-world vision applications, the vulnerability of multiview systems to adversarial patches is not sufficiently investigated. In this paper, we raise the following question: Does the increased performance and information sharing across views offer as a by-product robustness to adversarial patches? We first conduct a preliminary analysis showing promising robustness against off-the-shelf adversarial patches, even in an extreme setting where we consider patches applied to all views by all persons in Wildtrack benchmark. However, we challenged this observation by proposing two new attacks: (i) In the first attack, targeting a multiview CNN, we maximize the global loss by proposing gradient projection to the different views and aggregating the obtained local gradients. (ii) In the second attack, we focus on a Transformer-based multiview framework. In addition to the focal loss, we also maximize the transformer-specific loss by dissipating its attention blocks. Our results show a large degradation in the detection performance of victim multiview systems with our first patch attack reaching an attack success rate of 73% , while our second proposed attack reduced the performance of its target detector by 62%

摘要
adversarial patches illustrate the tangible threat of adversarial attacks on machine learning (ML) models in real-world scenarios. Robustness against these attacks is crucial when designing computer vision applications, especially for safety-critical domains like CCTV systems. In most practical situations, monitoring open spaces requires multi-view systems to overcome acquisition challenges such as occlusion handling. Multiview object systems can combine data from multiple views and achieve reliable detection results even in difficult environments. Despite its importance in real-world vision applications, the vulnerability of multiview systems to adversarial patches has not been thoroughly investigated. In this paper, we raise the following question: Does the increased performance and information sharing across views offer robustness to adversarial patches as a by-product? We first conducted a preliminary analysis showing promising robustness against off-the-shelf adversarial patches, even in an extreme setting where we considered patches applied to all views by all persons in the Wildtrack benchmark. However, we challenged this observation by proposing two new attacks: (i) In the first attack, we targeted a multiview CNN and maximized the global loss by proposing gradient projection to the different views and aggregating the obtained local gradients. (ii) In the second attack, we focused on a Transformer-based multiview framework. In addition to the focal loss, we also maximized the transformer-specific loss by dissipating its attention blocks. Our results showed a large degradation in the detection performance of victim multiview systems with our first patch attack reaching an attack success rate of 73%, while our second proposed attack reduced the performance of its target detector by 62%.

Universal Backdoor Attacks

paper_url: http://arxiv.org/abs/2312.00157
repo_url: https://github.com/ain-soph/trojanzoo
paper_authors: Benjamin Schneider, Nils Lukas, Florian Kerschbaum
for: 这 paper written for 攻击深度图像分类器的数据毒 poisoning 问题，以及如何通过控制毒品样本来让模型错分到任意目标类。
methods: 这 paper 使用了一种基于 trigger 的数据毒 poisoning 技术，通过生成特征rich的 trigger 来控制模型的输出，并利用 inter-class poison transferability 来实现对任意目标类的控制。
results: 这 paper 的实验结果表明，使用这种技术可以控制模型分类错误率，并且可以在占据小 Fraction of training dataset 的情况下实现对大量类型的控制。

Abstract
Web-scraped datasets are vulnerable to data poisoning, which can be used for backdooring deep image classifiers during training. Since training on large datasets is expensive, a model is trained once and re-used many times. Unlike adversarial examples, backdoor attacks often target specific classes rather than any class learned by the model. One might expect that targeting many classes through a naive composition of attacks vastly increases the number of poison samples. We show this is not necessarily true and more efficient, universal data poisoning attacks exist that allow controlling misclassifications from any source class into any target class with a small increase in poison samples. Our idea is to generate triggers with salient characteristics that the model can learn. The triggers we craft exploit a phenomenon we call inter-class poison transferability, where learning a trigger from one class makes the model more vulnerable to learning triggers for other classes. We demonstrate the effectiveness and robustness of our universal backdoor attacks by controlling models with up to 6,000 classes while poisoning only 0.15% of the training dataset.

摘要
Our idea is to generate triggers with salient characteristics that the model can learn. The triggers we craft exploit a phenomenon we call inter-class poison transferability, where learning a trigger from one class makes the model more vulnerable to learning triggers for other classes. We demonstrate the effectiveness and robustness of our universal backdoor attacks by controlling models with up to 6,000 classes while poisoning only 0.15% of the training dataset.

A Unified Framework for Connecting Noise Modeling to Boost Noise Detection

paper_url: http://arxiv.org/abs/2312.00827
repo_url: https://github.com/sunnysiqi/combo
paper_authors: Siqi Wang, Chau Pham, Bryan A. Plummer
for: 学习受损标签的研究是一个重要的话题，这种受损可能会使模型性能下降。这篇论文旨在探讨两种常见的方法——噪声模型和噪声检测——是否可以合作。
methods: 该论文提出了一种整合这两种方法的结构，包括噪声模型、知识源标识和增强噪声检测等三个关键块。这种结构的协作可以提高分类精度，例如，可以识别干扰性负样本并保持真正的干扰样本。
results: experiments 表明，这种结构的协作方法可以在四个数据集上提高分类精度，包括三种噪声和不同的组合。最终，这些组件在不同的噪声场景下各自提供了不同的贡献，提高了总的分类精度。

Abstract
Noisy labels can impair model performance, making the study of learning with noisy labels an important topic. Two conventional approaches are noise modeling and noise detection. However, these two methods are typically studied independently, and there has been limited work on their collaboration. In this work, we explore the integration of these two approaches, proposing an interconnected structure with three crucial blocks: noise modeling, source knowledge identification, and enhanced noise detection using noise source-knowledge-integration methods. This collaboration structure offers advantages such as discriminating hard negatives and preserving genuinely clean labels that might be suspiciously noisy. Our experiments on four datasets, featuring three types of noise and different combinations of each block, demonstrate the efficacy of these components' collaboration. Our collaborative structure methods achieve up to a 10% increase in top-1 classification accuracy in synthesized noise datasets and 3-5% in real-world noisy datasets. The results also suggest that these components make distinct contributions to overall performance across various noise scenarios. These findings provide valuable insights for designing noisy label learning methods customized for specific noise scenarios in the future. Our code is accessible to the public.

摘要
噪声标签可能会降低模型性能，因此研究学习噪声标签是一个重要的话题。现有两种常见的方法是噪声模型化和噪声检测。然而，这两种方法通常是独立地研究的，有限的研究者尝试将它们集成在一起。在这个工作中，我们探索了这两种方法的集成，提出了一个互连结构，包括三个关键块：噪声模型、源知识标识和增强的噪声检测使用噪声源知识集成方法。这种合作结构具有优势，例如可以区分困难的负例和保留真正的噪声标签。我们在四个 datasets上进行了实验，这些 datasets 包括三种噪声和不同的块组合。结果表明，我们的合作结构方法可以在 synthesized 噪声 datasets 中提高 top-1 类фика度 accuracy 至 10%，而在实际噪声 datasets 中提高 3-5%。这些结果还表明，这些组件在不同的噪声场景中具有不同的贡献。这些发现可以为未来设计适应特定噪声场景的噪声标签学习方法提供价值的信息。我们的代码公开访问。

GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs

paper_url: http://arxiv.org/abs/2312.00093
repo_url: None
paper_authors: Gege Gao, Weiyang Liu, Anpei Chen, Andreas Geiger, Bernhard Schölkopf
for: 用于生成高级组成3D场景，提高文本框架中对象的解耦度
methods: 使用场景图，具有节点和边的信息，以便更好地利用预训练的文本到图像扩散模型，并且使用签名距离场景图来表示 объек 之间的关系
results: 通过对训练数据进行评估， validate 了 GraphDreamer 的高级组成3D场景生成能力，并且能够减少对象之间的干扰Here’s the simplified Chinese text:
for: 用于生成高级组成3D场景，提高文本框架中对象的解耦度
methods: 使用场景图，具有节点和边的信息，以便更好地利用预训练的文本到图像扩散模型，并且使用签名距离场景图来表示 objet 之间的关系
results: 通过对训练数据进行评估， validate 了 GraphDreamer 的高级组成3D场景生成能力，并且能够减少对象之间的干扰

Abstract
As pretrained text-to-image diffusion models become increasingly powerful, recent efforts have been made to distill knowledge from these text-to-image pretrained models for optimizing a text-guided 3D model. Most of the existing methods generate a holistic 3D model from a plain text input. This can be problematic when the text describes a complex scene with multiple objects, because the vectorized text embeddings are inherently unable to capture a complex description with multiple entities and relationships. Holistic 3D modeling of the entire scene further prevents accurate grounding of text entities and concepts. To address this limitation, we propose GraphDreamer, a novel framework to generate compositional 3D scenes from scene graphs, where objects are represented as nodes and their interactions as edges. By exploiting node and edge information in scene graphs, our method makes better use of the pretrained text-to-image diffusion model and is able to fully disentangle different objects without image-level supervision. To facilitate modeling of object-wise relationships, we use signed distance fields as representation and impose a constraint to avoid inter-penetration of objects. To avoid manual scene graph creation, we design a text prompt for ChatGPT to generate scene graphs based on text inputs. We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer in generating high-fidelity compositional 3D scenes with disentangled object entities.

摘要
As pre-trained text-to-image diffusion models become increasingly powerful, recent efforts have been made to distill knowledge from these text-to-image pre-trained models for optimizing a text-guided 3D model. Most of the existing methods generate a holistic 3D model from a plain text input. This can be problematic when the text describes a complex scene with multiple objects, because the vectorized text embeddings are inherently unable to capture a complex description with multiple entities and relationships. Holistic 3D modeling of the entire scene further prevents accurate grounding of text entities and concepts. To address this limitation, we propose GraphDreamer, a novel framework to generate compositional 3D scenes from scene graphs, where objects are represented as nodes and their interactions as edges. By exploiting node and edge information in scene graphs, our method makes better use of the pre-trained text-to-image diffusion model and is able to fully disentangle different objects without image-level supervision. To facilitate modeling of object-wise relationships, we use signed distance fields as representation and impose a constraint to avoid inter-penetration of objects. To avoid manual scene graph creation, we design a text prompt for ChatGPT to generate scene graphs based on text inputs. We conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer in generating high-fidelity compositional 3D scenes with disentangled object entities.

TrafficMOT: A Challenging Dataset for Multi-Object Tracking in Complex Traffic Scenarios

paper_url: http://arxiv.org/abs/2311.18839
repo_url: None
paper_authors: Lihao Liu, Yanqi Cheng, Zhongying Deng, Shujun Wang, Dongdong Chen, Xiaowei Hu, Pietro Liò, Carola-Bibiane Schönlieb, Angelica Aviles-Rivero
for: 提高交通监测精度和安全措施，通过使用先进机器学习算法进行多个目标跟踪在交通视频中。
methods: 使用TrafficMOT数据集，包括多种交通情况和复杂场景，以驱动多目标跟踪领域的进步。
results: 经过三种不同的设定（充分监督、半监督和Tracking Anything Model）的实验结果表明，TrafficMOT数据集具有较高的复杂性和挑战性，但它可以驱动多目标跟踪领域的进步。

Abstract
Multi-object tracking in traffic videos is a crucial research area, offering immense potential for enhancing traffic monitoring accuracy and promoting road safety measures through the utilisation of advanced machine learning algorithms. However, existing datasets for multi-object tracking in traffic videos often feature limited instances or focus on single classes, which cannot well simulate the challenges encountered in complex traffic scenarios. To address this gap, we introduce TrafficMOT, an extensive dataset designed to encompass diverse traffic situations with complex scenarios. To validate the complexity and challenges presented by TrafficMOT, we conducted comprehensive empirical studies using three different settings: fully-supervised, semi-supervised, and a recent powerful zero-shot foundation model Tracking Anything Model (TAM). The experimental results highlight the inherent complexity of this dataset, emphasising its value in driving advancements in the field of traffic monitoring and multi-object tracking.

摘要
多bject tracking in traffic videos 是一个关键研究领域，具有巨大的潜在价值，通过利用高级机器学习算法，提高交通监测精度，促进道路安全措施。然而，现有的多bject tracking in traffic videos 数据集通常具有有限的实例数或单一类型，无法真实模拟交通场景中的复杂性。为了解决这个问题，我们介绍了 TrafficMOT，一个广泛的数据集，包含多种交通情况和复杂的场景。为了证明 TrafficMOT 的复杂性和挑战性，我们进行了广泛的实验研究，使用了三种不同的设置：完全监督、半监督和最新的 Zero-shot 基础模型 Tracking Anything Model (TAM)。实验结果显示了 TrafficMOT 数据集的内在复杂性，证明其在交通监测和多bject tracking 领域的价值。

Just Add $π$! Pose Induced Video Transformers for Understanding Activities of Daily Living

paper_url: http://arxiv.org/abs/2311.18840
repo_url: https://github.com/dominickrei/pi-vit
paper_authors: Dominick Reilly, Srijan Das
for: 用于活动日常生活（ADL）领域，提高视频变换器的适用环境。
methods: 通过将RGB表示与人体姿态信息相结合，提高视频变换器的表示能力。
results: 在三个主要的ADL数据集上达到了最佳性能，不需要实时poses或额外计算负担。

Abstract
Video transformers have become the de facto standard for human action recognition, yet their exclusive reliance on the RGB modality still limits their adoption in certain domains. One such domain is Activities of Daily Living (ADL), where RGB alone is not sufficient to distinguish between visually similar actions, or actions observed from multiple viewpoints. To facilitate the adoption of video transformers for ADL, we hypothesize that the augmentation of RGB with human pose information, known for its sensitivity to fine-grained motion and multiple viewpoints, is essential. Consequently, we introduce the first Pose Induced Video Transformer: PI-ViT (or $\pi$-ViT), a novel approach that augments the RGB representations learned by video transformers with 2D and 3D pose information. The key elements of $\pi$-ViT are two plug-in modules, 2D Skeleton Induction Module and 3D Skeleton Induction Module, that are responsible for inducing 2D and 3D pose information into the RGB representations. These modules operate by performing pose-aware auxiliary tasks, a design choice that allows $\pi$-ViT to discard the modules during inference. Notably, $\pi$-ViT achieves the state-of-the-art performance on three prominent ADL datasets, encompassing both real-world and large-scale RGB-D datasets, without requiring poses or additional computational overhead at inference.

摘要
视频变换器已成为人体动作识别的途径标准，但它们仅依靠RGB模式仍有限制其在某些领域的应用。一个如此领域是日常生活活动（ADL），因为RGBalone无法分辨视觉相似的动作或从多个视点观察的动作。为促进视频变换器在ADL领域的应用，我们认为RGB的扩充是必要的。因此，我们引入首个 pose 引入视频变换器：PI-ViT（或π-ViT），一种新的方法，它将RGB表示学习的RGB表示学习扩充了2D和3Dpose信息。PI-ViT的关键元素包括两个插件模块：2D Skeleton Induction Module和3D Skeleton Induction Module，它们负责在RGB表示中引入2D和3Dpose信息。这些模块通过执行pose-aware auxillary task来完成这个任务，这使得PI-ViT可以在推理中产生无需pose或额外计算过程的性能。值得注意的是，PI-ViT在三个主要ADL dataset上实现了状态的表现，包括真实世界和大规模RGB-D dataset，而无需poses或额外计算过程。

PoseGPT: Chatting about 3D Human Pose

paper_url: http://arxiv.org/abs/2311.18836
repo_url: None
paper_authors: Yao Feng, Jing Lin, Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Michael J. Black
for: 理解和理解3D人姿图像或文本描述
methods: 使用大型自然语言模型(LLMs)来理解和理解3D人姿
results: PoseGPT比现有的多modal LLMs和任务特定方法在新提出的任务上表现出色，并且能够基于复杂的推理来理解和生成3D人姿。

Abstract
We introduce PoseGPT, a framework employing Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions. Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description, a process that intertwines image interpretation, world knowledge, and an understanding of body language. Traditional human pose estimation methods, whether image-based or text-based, often lack holistic scene comprehension and nuanced reasoning, leading to a disconnect between visual data and its real-world implications. PoseGPT addresses these limitations by embedding SMPL poses as a distinct signal token within a multi-modal LLM, enabling direct generation of 3D body poses from both textual and visual inputs. This approach not only simplifies pose prediction but also empowers LLMs to apply their world knowledge in reasoning about human poses, fostering two advanced tasks: speculative pose generation and reasoning about pose estimation. These tasks involve reasoning about humans to generate 3D poses from subtle text queries, possibly accompanied by images. We establish benchmarks for these tasks, moving beyond traditional 3D pose generation and estimation methods. Our results show that PoseGPT outperforms existing multimodal LLMs and task-sepcific methods on these newly proposed tasks. Furthermore, PoseGPT's ability to understand and generate 3D human poses based on complex reasoning opens new directions in human pose analysis.

摘要
我们介绍PoseGPT框架，利用大语言模型（LLM）理解和理解三维人姿图像或文本描述。我们的工作受人类能够通过单个图像或简短描述理解姿势的能力启发，这种过程涉及图像解读、世界知识和人体语言理解。传统的人姿估计方法，无论是图像基的或文本基的，通常缺乏整体场景理解和细化的理解，导致视觉数据和其实际意义之间的分离。PoseGPT解决这些限制，将SMPL姿势作为特定信号符号在多modal LLM中嵌入，允许直接从文本和视觉输入生成三维人姿。这种方法不仅简化了人姿预测，还让LLM可以在理解人姿方面应用其世界知识，激发两个高级任务：推断人姿生成和理解人姿估计。这些任务涉及理解人类，从某些文本提示和图像 accompaniment 生成三维人姿。我们建立了这些任务的标准准确率标准，超越了传统的三维人姿生成和估计方法。我们的结果表明，PoseGPT在这些新提出的任务上比现有的多modal LLM和任务特定方法表现出色。此外，PoseGPT可以基于复杂的推理理解和生成三维人姿，开启了人姿分析领域的新方向。

paper_url: http://arxiv.org/abs/2311.18835
repo_url: https://github.com/rongyaofang/instructseq
paper_authors: Rongyao Fang, Shilin Yan, Zhaoyang Huang, Jingqiu Zhou, Hao Tian, Jifeng Dai, Hongsheng Li
for: 这个研究的目的是提供一种基于自然语言指令的多模态模型框架，以实现更强大和通用的人工智能。
methods: 这个模型使用多模态转换器架构，包括视觉、语言和时序模型。它使用视觉编码器提取图像特征，并使用文本编码器编码指令。一个 autoregressive 转换器将表示 fusion，生成顺序任务输出。通过与 LLM 生成的自然语言指令进行训练，InstructSeq 学习了自由形指令的强大理解，以提供一种直观的界面 для指定视觉任务。
results: InstructSeq 在 semantic segmentation、referring expression segmentation/comprehension 和图像描述等任务上取得了很好的表现，无需任务特定的调整。这种灵活控制和多任务统一使得模型具有更人类化的多样性和普适性。

Abstract
Empowering models to dynamically accomplish tasks specified through natural language instructions represents a promising path toward more capable and general artificial intelligence. In this work, we introduce InstructSeq, an instruction-conditioned multi-modal modeling framework that unifies diverse vision tasks through flexible natural language control and handling of both visual and textual data. InstructSeq employs a multimodal transformer architecture encompassing visual, language, and sequential modeling. We utilize a visual encoder to extract image features and a text encoder to encode instructions. An autoregressive transformer fuses the representations and generates sequential task outputs. By training with LLM-generated natural language instructions, InstructSeq acquires a strong comprehension of free-form instructions for specifying visual tasks. This provides an intuitive interface for directing capabilities using flexible natural instructions. Without any task-specific tuning, InstructSeq achieves compelling performance on semantic segmentation, referring expression segmentation/comprehension, and image captioning. The flexible control and multi-task unification empower the model with more human-like versatility and generalizability for computer vision. The code will be released soon at https://github.com/rongyaofang/InstructSeq.

摘要
强化模型通过自然语言指令实现任务是推动更强大和通用的人工智能的可能性。在这项工作中，我们介绍InstructSeq，一个基于多模态模型的指令控制框架，通过自然语言指令和视觉数据的处理，将多种视觉任务统一起来。InstructSeq使用多模态变换器架构，包括视觉、语言和时序模型。我们使用视觉Encoder提取图像特征，并使用文本Encoder编码指令。一个自然语言模型将表示 fusions 和生成顺序任务输出。通过与大语言模型生成的自然语言指令进行训练，InstructSeq获得了对自由形式指令的强大理解，从而提供了直观的界面 для指定视觉任务。这种直观的控制和多任务统一使得模型具有更人类化的灵活性和通用性。我们将在https://github.com/rongyaofang/InstructSeq上发布代码。

S2ST: Image-to-Image Translation in the Seed Space of Latent Diffusion

paper_url: http://arxiv.org/abs/2312.00116
repo_url: None
paper_authors: Or Greenberg, Eran Kishon, Dani Lischinski
for: 这篇论文旨在实现高精度的图像对图像转换（I2IT），并维持图像内容的基本连接。
methods: 本文提出了一个名为S2ST的新框架，该框架在对复杂、实境图像进行全球I2IT中展示了高效性。S2ST在对应潜在散射模型的种子空间中运作，并利用这些模型学习到的图像观念来进行对图像的转换。
results: 本文的实验结果显示，S2ST比前一代GAN基于I2IT方法和散射基于方法更高效，并在多个领域中实现了高精度的图像转换。另外，S2ST不需要训练特定领域的转换网络，因此可以更加方便地应用于实际应用情况。

Abstract
Image-to-image translation (I2IT) refers to the process of transforming images from a source domain to a target domain while maintaining a fundamental connection in terms of image content. In the past few years, remarkable advancements in I2IT were achieved by Generative Adversarial Networks (GANs), which nevertheless struggle with translations requiring high precision. Recently, Diffusion Models have established themselves as the engine of choice for image generation. In this paper we introduce S2ST, a novel framework designed to accomplish global I2IT in complex photorealistic images, such as day-to-night or clear-to-rain translations of automotive scenes. S2ST operates within the seed space of a Latent Diffusion Model, thereby leveraging the powerful image priors learned by the latter. We show that S2ST surpasses state-of-the-art GAN-based I2IT methods, as well as diffusion-based approaches, for complex automotive scenes, improving fidelity while respecting the target domain's appearance across a variety of domains. Notably, S2ST obviates the necessity for training domain-specific translation networks.

摘要
image-to-image翻译（I2IT）指的是将图像从源领域翻译到目标领域，保持图像内容的基本连接。过去几年，图像翻译领域有了非常的进步，尤其是通过生成对抗网络（GANs），但这些方法在需要高精度翻译时却遇到了困难。最近，扩散模型在图像生成中发挥了重要的作用。在这篇论文中，我们介绍了一种新的框架，称为S2ST，用于实现全球图像翻译。S2ST在潜在空间中的缓和模型内部运行，因此可以利用 latent diffusion模型学习的强大图像假设。我们表明，S2ST在复杂的汽车场景中超过了现有的GAN基于翻译方法和扩散基于方法，提高了准确性，同时尊重目标领域的外观，包括多个领域。值得一提的是，S2ST不需要训练域特定的翻译网络。

ART$\boldsymbol{\cdot}$V: Auto-Regressive Text-to-Video Generation with Diffusion Models

paper_url: http://arxiv.org/abs/2311.18834
repo_url: None
paper_authors: Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jianmin Bao, Yuhui Yuan, Chong Luo, Yueyi Zhang, Zhiwei Xiong
for: 生成自然动作、丰富细节、高度美观的视频
methods: 使用扩散模型进行单帧自动生成，避免复杂长距离运动的学习，并通过遮盖扩散模型来避免模型预测错误引起的不一致现象
results: 可以生成长视频，包括多个文本提示的组合，并且可以保持高级别的生成质量和自然动作，只需训练两周内四个GPU

Abstract
We present ART$\boldsymbol{\cdot}$V, an efficient framework for auto-regressive video generation with diffusion models. Unlike existing methods that generate entire videos in one-shot, ART$\boldsymbol{\cdot}$V generates a single frame at a time, conditioned on the previous ones. The framework offers three distinct advantages. First, it only learns simple continual motions between adjacent frames, therefore avoiding modeling complex long-range motions that require huge training data. Second, it preserves the high-fidelity generation ability of the pre-trained image diffusion models by making only minimal network modifications. Third, it can generate arbitrarily long videos conditioned on a variety of prompts such as text, image or their combinations, making it highly versatile and flexible. To combat the common drifting issue in AR models, we propose masked diffusion model which implicitly learns which information can be drawn from reference images rather than network predictions, in order to reduce the risk of generating inconsistent appearances that cause drifting. Moreover, we further enhance generation coherence by conditioning it on the initial frame, which typically contains minimal noise. This is particularly useful for long video generation. When trained for only two weeks on four GPUs, ART$\boldsymbol{\cdot}$V already can generate videos with natural motions, rich details and a high level of aesthetic quality. Besides, it enables various appealing applications, e.g., composing a long video from multiple text prompts.

摘要
我们介绍ART$\cdot$V，一个高效的自动复合影像生成框架，使用扩散模型。不同于现有的方法，ART$\cdot$V每个框架都会生成单一帧，并且受到前一帧的影像状态所控制。这个框架具有三个优点：首先，它只需学习简单的连续运动，因此可以避免模型学习复杂的长距离运动，需要巨量的训练数据。第二，它保留了预训练的图像扩散模型的高精确生成能力，只需进行最小的网络修改。第三，它可以根据文本、图像或它们的组合来生成无限长的影像，具有非常高的灵活性和可以应用性。为了解决自动复合模型中常见的漂移问题，我们提出了匿名扩散模型，它隐藏了网络预测的信息，从而将网络预测的错误降低到最少。此外，我们进一步增强生成的一致性，通过将生成状态与初始帧相互关联，这是特别有用的 для 长影像生成。当我们将ART$\cdot$V训练了两周，它就可以生成自然的运动、丰富的细节和高水准的艺术品质。此外，它允许了许多吸引人的应用，例如从多个文本提示中制成长影像。

Exploiting Diffusion Prior for Generalizable Pixel-Level Semantic Prediction

paper_url: http://arxiv.org/abs/2311.18832
repo_url: https://github.com/shinying/dmp
paper_authors: Hsin-Ying Lee, Hung-Yu Tseng, Hsin-Ying Lee, Ming-Hsuan Yang
for: 这篇论文是为了解决现有的推送模型在新的文本-到-图像（T2I）扩散模型中的问题。
methods: 这篇论文使用了预训练的T2I模型作为前置知识，并通过一系列的 interpolations 重新定义了扩散过程，以确定输入RGB图像和输出预测分布之间的唯一映射。
results: 经过大量的实验，这篇论文显示了提出的方法的效果。尽管受限于有限的培训数据，但这种方法仍能够对任意图像进行 faithful 的预测，超过了现有的状态艺术Algorithm。

Abstract
Contents generated by recent advanced Text-to-Image (T2I) diffusion models are sometimes too imaginative for existing off-the-shelf property semantic predictors to estimate due to the immitigable domain gap. We introduce DMP, a pipeline utilizing pre-trained T2I models as a prior for pixel-level semantic prediction tasks. To address the misalignment between deterministic prediction tasks and stochastic T2I models, we reformulate the diffusion process through a sequence of interpolations, establishing a deterministic mapping between input RGB images and output prediction distributions. To preserve generalizability, we use low-rank adaptation to fine-tune pre-trained models. Extensive experiments across five tasks, including 3D property estimation, semantic segmentation, and intrinsic image decomposition, showcase the efficacy of the proposed method. Despite limited-domain training data, the approach yields faithful estimations for arbitrary images, surpassing existing state-of-the-art algorithms.

摘要
Text-to-Image（T2I）扩散模型的最新成果可能会超出现有的准备Semantic predictor的预测范围，这是因为域之间的差距是不可归正的。我们介绍了DP，一个使用预训练T2I模型作为前提的管道，用于像素级Semantic prediction任务。为了解决Deterministic prediction任务和随机T2I模型之间的不一致，我们将扩散过程重新表述为一系列的 interpolations，建立了Deterministic mapping между输入RGB图像和输出预测分布。为保持普适性，我们使用低级 adaptation来精细调整预训练模型。经过了五个任务的广泛实验，包括3D属性估计、semantic segmentation和内在图像分解，我们的方法得到了证明。即使域训练数据有限，我们的方法仍能够对任意图像进行 faithful estimation，超越现有的状态 искусственный算法。Note: Simplified Chinese is used in this translation, as it is more widely used in mainland China and is the standard form of Chinese used in many online platforms and documents. If you prefer Traditional Chinese, I can provide that translation as well.

MotionEditor: Editing Video Motion via Content-Aware Diffusion

paper_url: http://arxiv.org/abs/2311.18830
repo_url: https://github.com/Francis-Rings/MotionEditor
paper_authors: Shuyuan Tu, Qi Dai, Zhi-Qi Cheng, Han Hu, Xintong Han, Zuxuan Wu, Yu-Gang Jiang
for: 本研究旨在解决现有的扩散基于视频编辑模型，尝试 manipulate 视频动作信息，保持原始人物的外观和背景不变。
methods: 本研究提出了 MotionEditor，一种扩散模型，它包含了一种内容意识的动作适配器，用于捕捉时间相关的动作匹配。ControlNet 可以直接生成基于skeleton pose的动作，但在倒数据中修改源动作时会遇到对抗信号的问题。我们的适配器可以将源内容传递给 ControlNet，以实现无缝的动作适配。此外，我们建立了一个两极体系（一个重建分支和一个编辑分支），并在这两个分支之间实现了高精度注意力注入机制，以便编辑分支可以保留原始背景和人物外观。
results: 实验表明，MotionEditor 具有出色的动作编辑能力， tanto qualitatively como quantitatively。

Abstract
Existing diffusion-based video editing models have made gorgeous advances for editing attributes of a source video over time but struggle to manipulate the motion information while preserving the original protagonist's appearance and background. To address this, we propose MotionEditor, a diffusion model for video motion editing. MotionEditor incorporates a novel content-aware motion adapter into ControlNet to capture temporal motion correspondence. While ControlNet enables direct generation based on skeleton poses, it encounters challenges when modifying the source motion in the inverted noise due to contradictory signals between the noise (source) and the condition (reference). Our adapter complements ControlNet by involving source content to transfer adapted control signals seamlessly. Further, we build up a two-branch architecture (a reconstruction branch and an editing branch) with a high-fidelity attention injection mechanism facilitating branch interaction. This mechanism enables the editing branch to query the key and value from the reconstruction branch in a decoupled manner, making the editing branch retain the original background and protagonist appearance. We also propose a skeleton alignment algorithm to address the discrepancies in pose size and position. Experiments demonstrate the promising motion editing ability of MotionEditor, both qualitatively and quantitatively.

摘要
原有的扩散基于视频编辑模型已经做出了优美的进步，可以在时间上修改源视频的特性，但是它们很难 manipulate 视频中的运动信息，保持原始主角的外貌和背景不变。为解决这个问题，我们提出了 MotionEditor，一种扩散模型 для视频运动编辑。MotionEditor 包含了一种新的内容意识的运动适应器，用于捕捉时间相关的运动匹配。而 ControlNet 可以直接基于骨架姿态来生成内容，但是在反转噪声中修改源运动时会遇到对抗信号的问题，这是因为噪声（源）和条件（参考）之间的信号矛盾。我们的适应器可以将源内容引入，以便在适应控制信号时传递适应信号平滑。此外，我们建立了一个两棵架构体系（重建架构和编辑架构），并在这两个架构之间设置了高精度的注意力充排机制，使编辑架构可以在半独立的方式下从重建架构中获取关键和值。这使得编辑架构可以保留原始背景和主角的外貌。我们还提出了一种骨架对准算法，以解决骨架大小和位置的差异。实验表明 MotionEditor 具有优秀的运动编辑能力， both qualitatively and quantitatively。

Un-EvMoSeg: Unsupervised Event-based Independent Motion Segmentation

paper_url: http://arxiv.org/abs/2312.00114
repo_url: None
paper_authors: Ziyun Wang, Jinyuan Guo, Kostas Daniilidis
for: 本研究旨在提高事件摄像头在独立移动对象（IMO）分割 task 的性能，并提出了一种基于几何约束的无监督方法。
methods: 本方法使用事件摄像头捕获视频序列，并通过几何约束生成 IMO pseudo-标签。
results: 在 EVIMO dataset 上测试，本方法与监督学习方法相比，表现很接近，并且可以处理无限多个不确定的对象。

Abstract
Event cameras are a novel type of biologically inspired vision sensor known for their high temporal resolution, high dynamic range, and low power consumption. Because of these properties, they are well-suited for processing fast motions that require rapid reactions. Although event cameras have recently shown competitive performance in unsupervised optical flow estimation, performance in detecting independently moving objects (IMOs) is lacking behind, although event-based methods would be suited for this task based on their low latency and HDR properties. Previous approaches to event-based IMO segmentation have been heavily dependent on labeled data. However, biological vision systems have developed the ability to avoid moving objects through daily tasks without being given explicit labels. In this work, we propose the first event framework that generates IMO pseudo-labels using geometric constraints. Due to its unsupervised nature, our method can handle an arbitrary number of not predetermined objects and is easily scalable to datasets where expensive IMO labels are not readily available. We evaluate our approach on the EVIMO dataset and show that it performs competitively with supervised methods, both quantitatively and qualitatively.

摘要
Previous methods for event-based IMO segmentation have relied heavily on labeled data. But biological vision systems have developed the ability to avoid moving objects through daily tasks without being given explicit labels. In this work, we propose the first event framework that generates IMO pseudo-labels using geometric constraints. Because our method is unsupervised, it can handle an arbitrary number of not predetermined objects and is easily scalable to datasets where expensive IMO labels are not readily available.We evaluate our approach on the EVIMO dataset and show that it performs competitively with supervised methods, both quantitatively and qualitatively.

MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation

paper_url: http://arxiv.org/abs/2311.18829
repo_url: None
paper_authors: Yanhui Wang, Jianmin Bao, Wenming Weng, Ruoyu Feng, Dacheng Yin, Tao Yang, Jingxu Zhang, Qi Dai Zhiyuan Zhao, Chunyu Wang, Kai Qiu, Yuhui Yuan, Xiaoyan Sun, Chong Luo, Baining Guo
for: 这个研究旨在提出一种简单又有效的文本到视频生成框架，以生成高质量和一致的视频。methods: 该框架使用分治策略，将文本到视频生成分为两个阶段：文本到图像生成和图像&文本到视频生成。这种策略有两个优势：一是可以全面利用最新的文本到图像模型，如稳定扩散、中途旅行和DALLE等，生成高质量和细节rich的图像；二是通过生成图像，模型可以更好地控制动作的细节，从而更好地学习动作的动态。results: 经验表明，提出的框架可以生成高质量的视频，具有精确的动作和保留给出的图像的出色表现。特别是，在零shot情况下， MicroCinema可以达到SOTA的FVD分数为342.86在UCF-101和377.40在MSR-VTT。

Abstract
We present MicroCinema, a straightforward yet effective framework for high-quality and coherent text-to-video generation. Unlike existing approaches that align text prompts with video directly, MicroCinema introduces a Divide-and-Conquer strategy which divides the text-to-video into a two-stage process: text-to-image generation and image\&text-to-video generation. This strategy offers two significant advantages. a) It allows us to take full advantage of the recent advances in text-to-image models, such as Stable Diffusion, Midjourney, and DALLE, to generate photorealistic and highly detailed images. b) Leveraging the generated image, the model can allocate less focus to fine-grained appearance details, prioritizing the efficient learning of motion dynamics. To implement this strategy effectively, we introduce two core designs. First, we propose the Appearance Injection Network, enhancing the preservation of the appearance of the given image. Second, we introduce the Appearance Noise Prior, a novel mechanism aimed at maintaining the capabilities of pre-trained 2D diffusion models. These design elements empower MicroCinema to generate high-quality videos with precise motion, guided by the provided text prompts. Extensive experiments demonstrate the superiority of the proposed framework. Concretely, MicroCinema achieves SOTA zero-shot FVD of 342.86 on UCF-101 and 377.40 on MSR-VTT. See https://wangyanhui666.github.io/MicroCinema.github.io/ for video samples.

摘要
我们介绍 MicroCinema，一种简单 yet 有效的文本到视频生成框架。与现有方法不同，MicroCinema 使用分治策略，将文本到视频转化为两个阶段过程：文本到图像生成和图像&文本到视频生成。这种策略带来了两个重要优点。a) 它允许我们全面利用最新的文本到图像模型，如稳定扩散、中途旅行和DALL-E等，生成高质量和细节实在的图像。b) 通过使用生成的图像，模型可以更多地关注运动动态，忽略细节的 appearances。为实现这种策略，我们提出了两个核心设计。首先，我们提出了图像插入网络，以保持给定图像的外观。其次，我们引入了图像噪声优先顺序，以维护预训练的2D扩散模型的能力。这些设计元素使得 MicroCinema 能够生成高质量视频，受文本提示指导，并且具有精确的运动和高质量的图像。我们的实验证明了我们的提案的优越性。具体来说，MicroCinema 在 UCF-101 和 MSR-VTT 上 achieved SOTA 零学习 FVD 的值为 342.86 和 377.40。请参考查看视频示例。

Event-based Continuous Color Video Decompression from Single Frames

paper_url: http://arxiv.org/abs/2312.00113
repo_url: None
paper_authors: Ziyun Wang, Friedhelm Hamann, Kenneth Chaney, Wen Jiang, Guillermo Gallego, Kostas Daniilidis
for: generate a continuous video from a single static RGB image
methods: event-based continuous color video decompression, combining long-range motion modeling and feature-plane-based synthesis neural integration
results: significantly outperforms event- and image-based baselines in the proposed task

Abstract
We present ContinuityCam, a novel approach to generate a continuous video from a single static RGB image, using an event camera. Conventional cameras struggle with high-speed motion capture due to bandwidth and dynamic range limitations. Event cameras are ideal sensors to solve this problem because they encode compressed change information at high temporal resolution. In this work, we propose a novel task called event-based continuous color video decompression, pairing single static color frames and events to reconstruct temporally continuous videos. Our approach combines continuous long-range motion modeling with a feature-plane-based synthesis neural integration model, enabling frame prediction at arbitrary times within the events. Our method does not rely on additional frames except for the initial image, increasing, thus, the robustness to sudden light changes, minimizing the prediction latency, and decreasing the bandwidth requirement. We introduce a novel single objective beamsplitter setup that acquires aligned images and events and a novel and challenging Event Extreme Decompression Dataset (E2D2) that tests the method in various lighting and motion profiles. We thoroughly evaluate our method through benchmarking reconstruction as well as various downstream tasks. Our approach significantly outperforms the event- and image- based baselines in the proposed task.

摘要
我们介绍了 ContinuityCam，一种新的方法，可以从单个静止的RGB图像中生成一个连续的视频，使用事件摄像头。传统的摄像头在高速运动捕捉方面遇到了带宽和动态范围的限制。事件摄像头是理想的感知器，可以解决这个问题，因为它们可以高效地编码压缩的变化信息。在这种工作中，我们提出了一个新的任务，即事件基于颜色视频分解，将单个静止颜色帧和事件相对应，以重建时间连续的视频。我们的方法结合了长距离动作模型和特征面基于的神经网络集成模型，可以在事件中预测任意时刻的帧。我们的方法不需要其他帧，只需要初始图像，从而增加了对突然光变化的 Robustness，降低预测延迟，并降低带宽需求。我们 introduce a novel single objective beamsplitter setup that acquires aligned images and events and a novel and challenging Event Extreme Decompression Dataset (E2D2) that tests the method in various lighting and motion profiles. We thoroughly evaluate our method through benchmarking reconstruction as well as various downstream tasks. Our approach significantly outperforms the event- and image-based baselines in the proposed task.

One-step Diffusion with Distribution Matching Distillation

paper_url: http://arxiv.org/abs/2311.18828
repo_url: None
paper_authors: Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, Taesung Park
for: 将 diffusion models 转化为一步图像生成器，以提高图像质量并降低计算量。
methods: 我们提出了 Distribution Matching Distillation (DMD) 方法，即在 diffusion model 的分布水平匹配一步图像生成器，以实现图像质量的保持和计算量的减少。我们使用了两个 diffusion models，一个是目标分布，另一个是生成的分布，通过两个分布模型的梯度来计算 KL 差异，从而实现分布匹配。此外，我们还使用了一个简单的回归损失来匹配多步 diffusion 输出的大规模结构。
results: 我们的方法可以在 ImageNet 64x64 和 zero-shot COCO-30k 上达到 2.62 FID 和 11.49 FID，与 Stable Diffusion 相当，但是计算量减少到了数个量级。我们的模型在现代硬件上可以在 20 FPS 上生成图像。

Abstract
Diffusion models generate high-quality images but require dozens of forward passes. We introduce Distribution Matching Distillation (DMD), a procedure to transform a diffusion model into a one-step image generator with minimal impact on image quality. We enforce the one-step image generator match the diffusion model at distribution level, by minimizing an approximate KL divergence whose gradient can be expressed as the difference between 2 score functions, one of the target distribution and the other of the synthetic distribution being produced by our one-step generator. The score functions are parameterized as two diffusion models trained separately on each distribution. Combined with a simple regression loss matching the large-scale structure of the multi-step diffusion outputs, our method outperforms all published few-step diffusion approaches, reaching 2.62 FID on ImageNet 64x64 and 11.49 FID on zero-shot COCO-30k, comparable to Stable Diffusion but orders of magnitude faster. Utilizing FP16 inference, our model generates images at 20 FPS on modern hardware.

摘要
<>将文本翻译成简化中文。<>Diffusion模型可以生成高质量图像，但需要多达数十次前进通过。我们介绍了分布匹配整合（DMD），它将 diffusion模型转换为一步图像生成器，而无需影响图像质量。我们在分布水平强制对一步图像生成器和扩散模型进行匹配，通过最小化一个 Approximate KL 差的梯度，其中一个是目标分布，另一个是我们一步生成器生成的Synthetic分布。这两个扩散模型都是分别在每个分布上训练的。在其中加入一个简单的 regression 损失，使得我们的方法在已有的几步扩散输出的大规模结构上匹配，我们的方法在 ImageNet 64x64 和 zero-shot COCO-30k 上的 FID 为 2.62 和 11.49，与 Stable Diffusion 相当，但很多次 быстре。使用 FP16 执行，我们的模型在现代硬件上可以在 20 FPS 上生成图像。

DynMF: Neural Motion Factorization for Real-time Dynamic View Synthesis with 3D Gaussian Splatting

paper_url: http://arxiv.org/abs/2312.00112
repo_url: https://github.com/agelosk/dynmf
paper_authors: Agelos Kratimenos, Jiahui Lei, Kostas Daniilidis
for: This paper aims to address the challenges of modeling dynamic scenes and motions, and to provide a compact and efficient representation for dynamic scene rendering.
methods: The paper proposes a neural representation called DynMF, which decomposes a dynamic scene into a few neural trajectories. The representation is based on a carefully designed neural framework that consists of a tiny set of learned basis queries in time, and it adequately constrains the motion field of the scene to enable effective and fast optimization.
results: The paper demonstrates that the proposed method can reach state-of-the-art render quality within just 5 minutes of training, and it can synthesize novel views of dynamic scenes with superior photorealistic quality in less than half an hour. The representation is interpretable, efficient, and expressive enough to offer real-time view synthesis of complex dynamic scene motions in monocular and multi-view scenarios.

Abstract
Accurately and efficiently modeling dynamic scenes and motions is considered so challenging a task due to temporal dynamics and motion complexity. To address these challenges, we propose DynMF, a compact and efficient representation that decomposes a dynamic scene into a few neural trajectories. We argue that the per-point motions of a dynamic scene can be decomposed into a small set of explicit or learned trajectories. Our carefully designed neural framework consisting of a tiny set of learned basis queried only in time allows for rendering speed similar to 3D Gaussian Splatting, surpassing 120 FPS, while at the same time, requiring only double the storage compared to static scenes. Our neural representation adequately constrains the inherently underconstrained motion field of a dynamic scene leading to effective and fast optimization. This is done by biding each point to motion coefficients that enforce the per-point sharing of basis trajectories. By carefully applying a sparsity loss to the motion coefficients, we are able to disentangle the motions that comprise the scene, independently control them, and generate novel motion combinations that have never been seen before. We can reach state-of-the-art render quality within just 5 minutes of training and in less than half an hour, we can synthesize novel views of dynamic scenes with superior photorealistic quality. Our representation is interpretable, efficient, and expressive enough to offer real-time view synthesis of complex dynamic scene motions, in monocular and multi-view scenarios.

摘要
快速和高效地模拟动态场景和运动是一项非常挑战性的任务，主要是因为时间动态和运动复杂性。为了解决这些挑战，我们提议了动态MF（DynMF），它是一种高效和 компакт的表示方式，可以将动态场景 decomposed into a few neural trajectories。我们认为每个点的运动可以 быть decomposed into一小组Explicit或学习得到的轨迹。我们设计了一个 tiny 的学习基础，它只在时间上查询，可以在渲染速度和 static scene 相当，同时只需要 double storage。我们的神经表示可以强制动态场景中的运动场景具有效果地优化。这是通过每个点绑定到运动系数来实现的，这些系数规定了每个点的轨迹。我们通过精心应用一个简单性损失来让运动系数具有简单性，从而可以分解动态场景中的运动，独立控制它们，并生成未曾见过的运动组合。我们可以在只需5分钟的训练时间内达到状态机器人渲染质量的首席，并在30分钟内可以生成未曾见过的动态场景视图，具有超越120帧的渲染速度和高品质。我们的表示是可解释的，高效的和表达力强的，可以在单视图和多视图场景中实现实时视图合成高质量动态场景运动。

CAST: Cross-Attention in Space and Time for Video Action Recognition

paper_url: http://arxiv.org/abs/2311.18825
repo_url: https://github.com/khu-vll/cast
paper_authors: Dongho Lee, Jongseo Lee, Jinwoo Choi
for: 这篇论文的目的是提出一种新的动作识别方法，以解决现有的动作识别模型缺乏视觉空间时间理解的问题。
methods: 该方法使用了一种新的两栅架构，称为 Cross-Attention in Space and Time (CAST)，它使用RGB输入来实现视觉空间时间的搜集和权重学习，并通过瓶颈穿梭机制使得空间和时间专家模型进行信息交换和协同预测，从而提高表现。
results: 作者通过了多个公共的测试 datasets（EPIC-KITCHENS-100、Something-Something-V2、Kinetics-400）的实验，证明了该方法在不同的dataset特点下都能够显示出优异的表现，而现有方法在不同的dataset上表现不稳定。

Abstract
Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics.

摘要
Recognizing human actions in videos requires both spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics.Here's the breakdown of the translation:* recognizing human actions in videos (人类动作识别)* requires both spatial and temporal understanding (需要空间和时间理解)* most existing action recognition models lack a balanced spatio-temporal understanding (大多数现有动作识别模型缺乏平衡的空间-时间理解)* proposed a novel two-stream architecture (提出了一种新的两树体系)* called Cross-Attention in Space and Time (CAST) (名为空间和时间之间的交叉注意力)* using only RGB input (使用RGB输入)* proposed bottleneck cross-attention mechanism (提出的瓶颈交叉注意力机制)* enables the spatial and temporal expert models to exchange information (允许空间和时间专家模型交换信息)* and make synergistic predictions (并且进行协同预测)* leading to improved performance (导致性能提高)* validated with extensive experiments (通过广泛的实验验证)* on public benchmarks with different characteristics (在不同特点的公共标准测试集上)* our method consistently shows favorable performance (我们的方法在不同测试集上一致表现出色)* while the performance of existing methods fluctuates depending on the dataset characteristics (而现有方法在不同数据集特点上的性能呈现波动性)

DEVIAS: Learning Disentangled Video Representations of Action and Scene for Holistic Video Understanding

paper_url: http://arxiv.org/abs/2312.00826
repo_url: None
paper_authors: Kyungho Bae, Geo Ahn, Youngrae Kim, Jinwoo Choi
for: 提高视频理解能力，尤其是在不同场景下进行视频分类和识别任务。
methods: 使用插槽注意力学习分离动作和场景表示，并通过辅助任务进一步引导插槽注意力。
results: 在不同数据集上，与基elines相比，提出的方法实现了更好的视频理解能力，特别是在不同场景下进行视频分类和识别任务时。

Abstract
When watching a video, humans can naturally extract human actions from the surrounding scene context, even when action-scene combinations are unusual. However, unlike humans, video action recognition models often learn scene-biased action representations from the spurious correlation in training data, leading to poor performance in out-of-context scenarios. While scene-debiased models achieve improved performance in out-of-context scenarios, they often overlook valuable scene information in the data. Addressing this challenge, we propose Disentangled VIdeo representations of Action and Scene (DEVIAS), which aims to achieve holistic video understanding. Disentangled action and scene representations with our method could provide flexibility to adjust the emphasis on action or scene information depending on downstream task and dataset characteristics. Disentangled action and scene representations could be beneficial for both in-context and out-of-context video understanding. To this end, we employ slot attention to learn disentangled action and scene representations with a single model, along with auxiliary tasks that further guide slot attention. We validate the proposed method on both in-context datasets: UCF-101 and Kinetics-400, and out-of-context datasets: SCUBA and HAT. Our proposed method shows favorable performance across different datasets compared to the baselines, demonstrating its effectiveness in diverse video understanding scenarios.

摘要
当观看视频时，人类可以自然地从周围场景上提取人体动作，即使动作-场景组合不寻常。然而，与人类不同，视频动作识别模型经常从训练数据中吸收虚假相关性，导致在异 Context 场景下表现不佳。 scene-debiased 模型可以提高在异 Context 场景下的表现，但它们经常忽略数据中的有价值场景信息。为 Addressing 这个挑战，我们提议 Disentangled VIdeo 表示法（DEVIAS），该方法目的是实现全面的视频理解。我们使用槽注意力学习分离动作和场景表示，并通过辅助任务来指导槽注意力。我们在 UCF-101、Kinetics-400、SCUBA 和 HAT 等 dataset 上验证了我们的方法，并证明它在多种视频理解enario 中表现出色。

Initializing Models with Larger Ones

paper_url: http://arxiv.org/abs/2311.18823
repo_url: https://github.com/oscarxzq/weight-selection
paper_authors: Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, Zhuang Liu
for: 这篇论文目的是提出一种Weight Selection方法，将大型模型中的一部分 weights 选择到小型模型中，以将大型模型中的知识转移到小型模型中。
methods: 这篇论文使用了Weight Selection方法，将大型模型中的一部分 weights 选择到小型模型中，并与知识蒸发相结合。
results: 实验结果显示，Weight Selection方法可以对小型模型进行明显改善，并且可以降低训练时间。此外，这篇论文还证明了Weight Selection方法可以与知识蒸发相结合，实现更好的效果。

Abstract
Weight initialization plays an important role in neural network training. Widely used initialization methods are proposed and evaluated for networks that are trained from scratch. However, the growing number of pretrained models now offers new opportunities for tackling this classical problem of weight initialization. In this work, we introduce weight selection, a method for initializing smaller models by selecting a subset of weights from a pretrained larger model. This enables the transfer of knowledge from pretrained weights to smaller models. Our experiments demonstrate that weight selection can significantly enhance the performance of small models and reduce their training time. Notably, it can also be used together with knowledge distillation. Weight selection offers a new approach to leverage the power of pretrained models in resource-constrained settings, and we hope it can be a useful tool for training small models in the large-model era. Code is available at https://github.com/OscarXZQ/weight-selection.

摘要
<>Translate given text into Simplified Chinese.<> neural network 训练中Weight initialization 发挥重要作用。广泛使用的 initialization 方法被提出并评估，以适应从 scratch 训练的网络。然而，随着预训练模型的数量不断增加，现在提供了新的机会来解决这个古老的问题。在这项工作中，我们介绍了 weight selection，一种从预训练大型模型中选择一 subset of weights 的方法，以传递预训练 weights 知识到更小的模型中。我们的实验表明， weight selection 可以大幅提高小模型的性能和训练时间。另外，它还可以与知识储存配合使用。 weight selection 为资源有限的设置提供了一种新的方法，我们希望它可以在大模型时代中用于训练小模型。代码可以在上获取。

ElasticDiffusion: Training-free Arbitrary Size Image Generation

paper_url: http://arxiv.org/abs/2311.18822
repo_url: https://github.com/moayedhajiali/elasticdiffusion-official
paper_authors: Moayed Haji-Ali, Guha Balakrishnan, Vicente Ordonez
for: 这篇论文是为了解决当前的图像生成模型受限于几种尺寸和比例的问题而提出的。
methods: 该论文提出了一种名为ElasticDiffusion的训练free的解码方法，可以让预训练的文本到图像扩散模型生成不同尺寸的图像。ElasticDiffusion尝试将预训练模型的生成轨迹解析成本地和全局信号。本地信号控制低级像素信息，可以在本地小区域中估计，而全局信号用于保持整体结构一致，与参考图像进行比较。
results: 我们在CelebA-HQ（脸）和LAION-COCO（物体/室内/户外场景）上进行了实验和质量测试，结果表明ElasticDiffusion比MultiDiffusion和标准扩散策略（Stable Diffusion的标准解码策略）在不同比例下的图像准确率更高。代码：https://github.com/MoayedHajiAli/ElasticDiffusion-official.git

Abstract
Diffusion models have revolutionized image generation in recent years, yet they are still limited to a few sizes and aspect ratios. We propose ElasticDiffusion, a novel training-free decoding method that enables pretrained text-to-image diffusion models to generate images with various sizes. ElasticDiffusion attempts to decouple the generation trajectory of a pretrained model into local and global signals. The local signal controls low-level pixel information and can be estimated on local patches, while the global signal is used to maintain overall structural consistency and is estimated with a reference image. We test our method on CelebA-HQ (faces) and LAION-COCO (objects/indoor/outdoor scenes). Our experiments and qualitative results show superior image coherence quality across aspect ratios compared to MultiDiffusion and the standard decoding strategy of Stable Diffusion. Code: https://github.com/MoayedHajiAli/ElasticDiffusion-official.git

摘要
“吸引模型在最近几年内革命化了图像生成，但它们仍然受限于几种大小和比例。我们提议了一种新的训练无法解码方法，名为ElasticDiffusion，可以让预训练文本到图像扩散模型生成图像多种大小。ElasticDiffusion尝试将预训练模型的生成轨迹分解成本地和全局信号。本地信号控制低级像素信息，可以在本地补丁中估计，而全局信号用于保持整体结构一致，与参考图像一起估计。我们在CelebA-HQ（脸）和LAION-COCO（物体/室内/室外场景）上进行了实验和质量检测，结果显示在不同比例下图像凝固质量都高于多扩散和标准扩散策略。代码：https://github.com/MoayedHajiAli/ElasticDiffusion-official.git”Note that Simplified Chinese is used here, as it is the most widely used variety of Chinese in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.

LucidDreaming: Controllable Object-Centric 3D Generation

paper_url: http://arxiv.org/abs/2312.00588
repo_url: None
paper_authors: Zhaoning Wang, Ming Li, Chen Chen
for: 本研究旨在提供精细控制3D生成的有效管道，即LucidDreaming。
methods: 该方法仅需 minimal 3D bounding box输入，可以通过大语言模型来从文本提示中推理出。具体来说，我们提出了剪辑射线抽象法，以分离和优化用户指定的物体。此外，我们还引入物体心智泥团偏好，以促进生成的物体分离。
results: 我们的方法在从头开始生成3D内容以及在先修改的NeRF场景中进行编辑时表现出色，超过基eline方法的Alignment度。此外，我们还提供了一个包含提示和3D bounding box的数据集，用于评估3D空间控制性。

Abstract
With the recent development of generative models, Text-to-3D generations have also seen significant growth. Nonetheless, achieving precise control over 3D generation continues to be an arduous task, as using text to control often leads to missing objects and imprecise locations. Contemporary strategies for enhancing controllability in 3D generation often entail the introduction of additional parameters, such as customized diffusion models. This often induces hardness in adapting to different diffusion models or creating distinct objects. In this paper, we present LucidDreaming as an effective pipeline capable of fine-grained control over 3D generation. It requires only minimal input of 3D bounding boxes, which can be deduced from a simple text prompt using a Large Language Model. Specifically, we propose clipped ray sampling to separately render and optimize objects with user specifications. We also introduce object-centric density blob bias, fostering the separation of generated objects. With individual rendering and optimizing of objects, our method excels not only in controlled content generation from scratch but also within the pre-trained NeRF scenes. In such scenarios, existing generative approaches often disrupt the integrity of the original scene, and current editing methods struggle to synthesize new content in empty spaces. We show that our method exhibits remarkable adaptability across a spectrum of mainstream Score Distillation Sampling-based 3D generation frameworks, and achieves superior alignment of 3D content when compared to baseline approaches. We also provide a dataset of prompts with 3D bounding boxes, benchmarking 3D spatial controllability.

摘要
Recent developments in generative models have also led to significant growth in text-to-3D generation. However, achieving precise control over 3D generation remains a challenging task, as using text to control often results in missing objects and imprecise locations. To address this issue, many current strategies involve introducing additional parameters, such as customized diffusion models, which can make it difficult to adapt to different diffusion models or create distinct objects.In this paper, we propose LucidDreaming, an effective pipeline that provides fine-grained control over 3D generation with minimal input. Our method requires only a simple text prompt and a Large Language Model to deduce 3D bounding boxes. We use clipped ray sampling to separately render and optimize objects with user specifications, and introduce object-centric density blob bias to ensure the separation of generated objects.Our method excels not only in generating content from scratch but also in existing pre-trained NeRF scenes, where other generative approaches often disrupt the integrity of the original scene and current editing methods struggle to synthesize new content in empty spaces. We demonstrate the remarkable adaptability of our method across a range of mainstream Score Distillation Sampling-based 3D generation frameworks and show superior alignment of 3D content compared to baseline approaches. We also provide a dataset of prompts with 3D bounding boxes for benchmarking 3D spatial controllability.

IMMA: Immunizing text-to-image Models against Malicious Adaptation

paper_url: http://arxiv.org/abs/2311.18815
repo_url: https://github.com/zhengyjzoe/imma
paper_authors: Yijia Zheng, Raymond A. Yeh
for: 保护文本至图模型免受恶意修改
methods: 提出一种新的保护策略，即使模型参数难以在恶意修改时被篡改
results: 对三种恶意修改方法（LoRA、Textual-Inversion、DreamBooth）进行实验，证明 IMMA 的效果可以减少恶意修改的风险

Abstract
Advancements in text-to-image models and fine-tuning methods have led to the increasing risk of malicious adaptation, i.e., fine-tuning to generate harmful unauthorized content. Recent works, e.g., Glaze or MIST, have developed data-poisoning techniques which protect the data against adaptation methods. In this work, we consider an alternative paradigm for protection. We propose to ``immunize'' the model by learning model parameters that are difficult for the adaptation methods when fine-tuning malicious content; in short IMMA. Empirical results show IMMA's effectiveness against malicious adaptations, including mimicking the artistic style and learning of inappropriate/unauthorized content, over three adaptation methods: LoRA, Textual-Inversion, and DreamBooth.

摘要
Recent advances in text-to-image models and fine-tuning methods have led to an increasing risk of malicious adaptation, i.e., fine-tuning to generate harmful and unauthorized content. To address this issue, some recent works, such as Glaze or MIST, have proposed data-poisoning techniques to protect the data against adaptation methods. However, we propose an alternative paradigm for protection, which is to "immunize" the model by learning model parameters that are difficult for the adaptation methods when fine-tuning malicious content. In short, we call this approach IMMA (Immune Model against Malicious Adaptation). Our empirical results show that IMMA is effective against malicious adaptations, including mimicking artistic styles and learning inappropriate/unauthorized content, over three adaptation methods: LoRA, Textual-Inversion, and DreamBooth.

Is Underwater Image Enhancement All Object Detectors Need?

paper_url: http://arxiv.org/abs/2311.18814
repo_url: https://github.com/bigwangyudong/lqit
paper_authors: Yudong Wang, Jichang Guo, Wanru He, Huan Gao, Huihui Yue, Zenan Zhang, Chongyi Li
for: 本研究的目的是回答“底层水下图像增强对水下对象检测是否有益？”以及“底层水下图像增强如何对水下对象检测？”这两个问题。
methods: 本研究使用18种当前最佳水下图像增强算法，包括传统、CNN基于和GAN基于算法，对水下对象检测数据进行预处理。然后，使用不同算法对水下对象检测数据进行增强，并使用7种深度学习基于对象检测模型进行重新训练。
results: 本研究使用133种模型进行全面分析底层水下图像增强对水下对象检测的影响。结果表明，底层水下图像增强可以提高水下对象检测的准确率和F1值。此外，研究还发现不同的增强算法对水下对象检测的影响不同。

Abstract
Underwater object detection is a crucial and challenging problem in marine engineering and aquatic robot. The difficulty is partly because of the degradation of underwater images caused by light selective absorption and scattering. Intuitively, enhancing underwater images can benefit high-level applications like underwater object detection. However, it is still unclear whether all object detectors need underwater image enhancement as pre-processing. We therefore pose the questions "Does underwater image enhancement really improve underwater object detection?" and "How does underwater image enhancement contribute to underwater object detection?". With these two questions, we conduct extensive studies. Specifically, we use 18 state-of-the-art underwater image enhancement algorithms, covering traditional, CNN-based, and GAN-based algorithms, to pre-process underwater object detection data. Then, we retrain 7 popular deep learning-based object detectors using the corresponding results enhanced by different algorithms, obtaining 126 underwater object detection models. Coupled with 7 object detection models retrained using raw underwater images, we employ these 133 models to comprehensively analyze the effect of underwater image enhancement on underwater object detection. We expect this study can provide sufficient exploration to answer the aforementioned questions and draw more attention of the community to the joint problem of underwater image enhancement and underwater object detection. The pre-trained models and results are publicly available and will be regularly updated. Project page: https://github.com/BIGWangYuDong/lqit/tree/main/configs/detection/uw_enhancement_affect_detection.

摘要
水下物体检测是marine engineering和水下机器人中的一项重要和挑战性问题。这种困难的一部分是由于光选择性吸收和散射所导致的水下图像的劣化。直觉上，提高水下图像可能会有助于高级应用程序 like underwater object detection。然而，是否所有的物体检测器需要水下图像提高为先processing仍然是不清楚的。我们因此提出了以下两个问题：“水下图像提高确实改善水下物体检测吗？”和“水下图像提高如何影响水下物体检测？”。我们采取了广泛的研究。具体来说，我们使用18种当前最佳水下图像提高算法，涵盖传统、CNN基于和GAN基于算法，对水下物体检测数据进行预处理。然后，我们使用不同算法对应的结果进行重新训练7种深度学习基于的物体检测器，共获得126个水下物体检测模型。与7种原始水下图像重新训练的物体检测模型相比，我们使用这133个模型进行全面分析水下图像提高对水下物体检测的影响。我们希望这项研究可以提供充分的探索，回答上述问题，并吸引社区更多关注水下图像提高和水下物体检测的共同问题。我们的预训练模型和结果将在 GitHub 上公开，并定期更新。项目页面：https://github.com/BIGWangYuDong/lqit/tree/main/configs/detection/uw_enhancement_affect_detection。

MD-Splatting: Learning Metric Deformation from 4D Gaussians in Highly Deformable Scenes

paper_url: http://arxiv.org/abs/2312.00583
repo_url: None
paper_authors: Bardienus P. Duisterhof, Zhao Mandi, Yunchao Yao, Jia-Wei Liu, Mike Zheng Shou, Shuran Song, Jeffrey Ichnowski
for: 这个论文旨在提高在高度变形的场景中进行高精度3D跟踪和新视图生成，以便应用于机器人、增强现实和生成AI等领域。
methods: 该论文提出了MD-Splatting方法，它基于 Gaussian splatting 方法，通过视频捕捉的多个相机pose拍摄的场景进行同时3D跟踪和新视图生成。MD-Splatting 学习了一个填充函数，将一组 Gaussian 的非 metric 属性映射到几何空间中。
results: 该论文实现了在高度变形的场景中高精度3D跟踪和新视图生成，并且与现有方法相比，提高了3D跟踪的平均误差率23.9%。在具有足够的 текстура的场景中，如场景6，MD-Splatting 实现了1 x 1米的布料上的 median 跟踪误差为3.39毫米。

Abstract
Accurate 3D tracking in highly deformable scenes with occlusions and shadows can facilitate new applications in robotics, augmented reality, and generative AI. However, tracking under these conditions is extremely challenging due to the ambiguity that arises with large deformations, shadows, and occlusions. We introduce MD-Splatting, an approach for simultaneous 3D tracking and novel view synthesis, using video captures of a dynamic scene from various camera poses. MD-Splatting builds on recent advances in Gaussian splatting, a method that learns the properties of a large number of Gaussians for state-of-the-art and fast novel view synthesis. MD-Splatting learns a deformation function to project a set of Gaussians with non-metric, thus canonical, properties into metric space. The deformation function uses a neural-voxel encoding and a multilayer perceptron (MLP) to infer Gaussian position, rotation, and a shadow scalar. We enforce physics-inspired regularization terms based on local rigidity, conservation of momentum, and isometry, which leads to trajectories with smaller trajectory errors. MD-Splatting achieves high-quality 3D tracking on highly deformable scenes with shadows and occlusions. Compared to state-of-the-art, we improve 3D tracking by an average of 23.9 %, while simultaneously achieving high-quality novel view synthesis. With sufficient texture such as in scene 6, MD-Splatting achieves a median tracking error of 3.39 mm on a cloth of 1 x 1 meters in size. Project website: https://md-splatting.github.io/.

摘要
准确的3D跟踪在具有弹性变形、阴影和 occlusion 的场景中可以推动新的应用程序，如Robotics、增强现实和生成 AI。然而，在这些条件下进行跟踪是非常困难的，因为大弹性、阴影和 occlusion 会导致跟踪结果的模糊性。我们介绍了 MD-Splatting，一种同时进行3D跟踪和新视野合成的方法，使用多个相机pose的视频捕捉。MD-Splatting 基于最近的 Gaussian splatting 技术，这种技术可以实现状态之最佳和快速的新视野合成。MD-Splatting 学习一个变形函数，将一组 Gaussian 的非 metric 属性映射到 métrica 空间中。变形函数使用神经元体系和多层感知器（MLP）来INFER Gaussian 的位置、旋转和阴影Scalar。我们采用物理灵感的正则化项，如本地稳定性、动量保守和射影，以避免跟踪结果的抖动。与状态之最佳相比，我们提高了3D跟踪的准确率，同时实现高质量的新视野合成。在具有足够的文本的场景6中，MD-Splatting 实现了1 x 1米的布料上的 median 跟踪误差为3.39 mm。项目网站：https://md-splatting.github.io/。

Convergence of Nonconvex PnP-ADMM with MMSE Denoisers

paper_url: http://arxiv.org/abs/2311.18810
repo_url: https://github.com/wustl-cig/camsap2023
paper_authors: Chicago Park, Shirin Shoushtari, Weijie Gan, Ulugbek S. Kamilov
for: 这个论文的目的是解释Plug-and-Play Alternating Direction Method of Multipliers（PnP-ADMM）在使用核函数神经网络（CNN）的稳定性。
methods: 该论文使用了PnP-ADMM算法，并将CNN作为一个迁移函数。
results: 该论文提供了一个理论解释，表明PnP-ADMM算法在使用非扩散CNN和扩散DRUNet denoiser时的稳定性。

Abstract
Plug-and-Play Alternating Direction Method of Multipliers (PnP-ADMM) is a widely-used algorithm for solving inverse problems by integrating physical measurement models and convolutional neural network (CNN) priors. PnP-ADMM has been theoretically proven to converge for convex data-fidelity terms and nonexpansive CNNs. It has however been observed that PnP-ADMM often empirically converges even for expansive CNNs. This paper presents a theoretical explanation for the observed stability of PnP-ADMM based on the interpretation of the CNN prior as a minimum mean-squared error (MMSE) denoiser. Our explanation parallels a similar argument recently made for the iterative shrinkage/thresholding algorithm variant of PnP (PnP-ISTA) and relies on the connection between MMSE denoisers and proximal operators. We also numerically evaluate the performance gap between PnP-ADMM using a nonexpansive DnCNN denoiser and expansive DRUNet denoiser, thus motivating the use of expansive CNNs.

摘要
“插件式扩展方向方法（PnP-ADMM）是一种广泛使用的算法，用于解决反常问题，其结合物理测量模型和卷积神经网络（CNN）假设。PnP-ADMM已经理论上证明可以协调 convex 数据准确性项和 nonexpansive CNNs。然而，它经常在 expansive CNNs 上实际协调。这篇论文提供了对 PnP-ADMM 稳定性的理论解释，基于 CNN 假设的 MMSE 滤波器解释。这个解释与 PnP-ISTA 迭代压缩算法的类似解释相似，并且基于 proximal 算符的连接。我们还对 PnP-ADMM 使用 nonexpansive DnCNN 滤波器和 expansive DRUNet 滤波器进行了数值评估，从而激励使用 expansive CNNs。”Note: "DnCNN" and "DRUNet" are abbreviations for "Deep Neural Network" and "Deep Regularization Network", respectively.

FoundPose: Unseen Object Pose Estimation with Foundation Features

paper_url: http://arxiv.org/abs/2311.18809
repo_url: None
paper_authors: Evin Pınar Örnek, Yann Labbé, Bugra Tekin, Lingni Ma, Cem Keskin, Christian Forster, Tomas Hodan
for: 这篇论文旨在提出一种基于单个RGB图像的6D姿态估计方法，用于检测未知的RIGID对象的姿态。
methods: 该方法基于DINOv2视觉基础模型，不需要对特定对象进行训练。它首先从渠道图像中提取DINOv2 patch特征，然后通过比较这些特征和已经rendered对象模板中的特征来确定2D-3D匹配点。最后，通过Featuremetric反射来优化姿态 hypotheses。
results: 该方法可以处理多种对象，包括具有对称和无Texture的对象，并在标准BOP bencmark上显著超过了现有的RGB方法。通过Featuremetric和additional MegaPose refinement，该方法可以更高精度地估计对象的姿态。

Abstract
We propose FoundPose, a method for 6D pose estimation of unseen rigid objects from a single RGB image. The method assumes that 3D models of the objects are available but does not require any object-specific training. This is achieved by building upon DINOv2, a recent vision foundation model with impressive generalization capabilities. An online pose estimation stage is supported by a minimal object representation that is built during a short onboarding stage from DINOv2 patch features extracted from rendered object templates. Given a query image with an object segmentation mask, FoundPose first rapidly retrieves a handful of similarly looking templates by a DINOv2-based bag-of-words approach. Pose hypotheses are then generated from 2D-3D correspondences established by matching DINOv2 patch features between the query image and a retrieved template, and finally optimized by featuremetric refinement. The method can handle diverse objects, including challenging ones with symmetries and without any texture, and noticeably outperforms existing RGB methods for coarse pose estimation in both accuracy and speed on the standard BOP benchmark. With the featuremetric and additional MegaPose refinement, which are demonstrated complementary, the method outperforms all RGB competitors. Source code is at: evinpinar.github.io/foundpose.

摘要
我们提出FoundPose方法，用于从单一RGB图像中进行未见的6D姿态估计。这个方法假设有3D模型的物品是可用的，但不需要任何物品特定的训练。这是通过建立于DINOv2，最近的视觉基础模型，并实现了卓越的通用能力。在线上姿态估计阶段，我们支持一个最小的物品表现，从DINOv2贴图特征EXTRACTED FROM Rendered object templates中建立。对于具有物品分割mask的询问图像，FoundPose首先快速地找到一些相似的模板，这是通过DINOv2-based Bag-of-words方法进行快速搜寻。姿态假设是从2D-3D对应Established by matching DINOv2贴图特征 zwischen内部图像和找到的模板，然后是通过对�Metric refinement进行优化。这个方法可以处理多种物品，包括具有Symmetries和无 texture的物品，并明显超越了RGB方法的粗糙姿态估计。通过Featuremetric和Additional MegaPose refinement，这些被证明是补充的，FoundPose方法终于超越了所有RGB竞争对手。源代码可以在：evinpinar.github.io/foundpose。

CLIP-QDA: An Explainable Concept Bottleneck Model

paper_url: http://arxiv.org/abs/2312.00110
repo_url: None
paper_authors: Rémi Kazmierczak, Eloïse Berthier, Goran Frehse, Gianni Franchi
for: 本研究旨在提出一种可解释的图像分类算法，基于多Modal基础模型，具有快速和可解释的特点。
methods: 本方法基于CLIP-based Concept Bottleneck Models (CBMs)，创造了一个含义层次结构，每个神经元与特定的单词相关联。使用 Mixture of Gaussians (MoG) formalism，提高了这个含义层次结构的解释性。此外，我们还提出了一种基于统计量的分类器，即CLIP-QDA。
results: 我们的实验结果显示，当MoG假设成立时，CLIP-QDA可以与当前最佳方法CBMs准确率相似。我们的解释方法与现有XAI方法竞争，而计算速度更快。

Abstract
In this paper, we introduce an explainable algorithm designed from a multi-modal foundation model, that performs fast and explainable image classification. Drawing inspiration from CLIP-based Concept Bottleneck Models (CBMs), our method creates a latent space where each neuron is linked to a specific word. Observing that this latent space can be modeled with simple distributions, we use a Mixture of Gaussians (MoG) formalism to enhance the interpretability of this latent space. Then, we introduce CLIP-QDA, a classifier that only uses statistical values to infer labels from the concepts. In addition, this formalism allows for both local and global explanations. These explanations come from the inner design of our architecture, our work is part of a new family of greybox models, combining performances of opaque foundation models and the interpretability of transparent models. Our empirical findings show that in instances where the MoG assumption holds, CLIP-QDA achieves similar accuracy with state-of-the-art methods CBMs. Our explanations compete with existing XAI methods while being faster to compute.

摘要
在这篇论文中，我们介绍了一种可解释的图像分类算法，基于多Modal基础模型。我们 Drawing inspiration from CLIP-based Concept Bottleneck Models (CBMs)，我们的方法创建了一个含义空间，每个神经元与特定的单词相关联。我们发现这个含义空间可以用简单的分布来模型，因此我们使用 Mixture of Gaussians (MoG) formalism来增强这个含义空间的解释性。然后，我们引入 CLIP-QDA，一种只使用统计值来推断标签的分类器。此外，这种形式语言还允许我们获得本地和全局的解释。这些解释来自我们的建筑的内部结构，我们的工作是一种新的灰色模型家族，这种模型结合了透明模型的表现和透明模型的解释性。我们的实验发现，在MoG假设成立时，CLIP-QDA可以与当前状态的方法CBMs achieve similar accuracy。我们的解释与现有的XAI方法竞争，而且计算速度更快。

Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

paper_url: http://arxiv.org/abs/2311.18773
repo_url: None
paper_authors: Rohan Myer Krishnan, Zitian Tang, Zhiqiu Yu, Chen Sun
for: 这个论文的目的是提出一个新的视频学习挑战任务，即在视频中学习人类展示的技能，并且能够在不同的频谱和多Modalities中进行扩展。
methods: 这个论文使用了视频语言模型来获取有结构的理解，包括将视频分解成不同的动作和技能，并且能够在新的频谱和多Modalities中进行扩展。
results: 研究发现，现有的方法在这个新的挑战任务中表现不佳，这表明将来需要开发新的方法来解决这些任务。

Abstract
Learning from videos is an emerging research area that enables robots to acquire skills from human demonstrations, such as procedural videos. To do this, video-language models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) intra-video retrieval over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model's ability to make use of: (1) out-of-domain visual information; (2) a high temporal context window; and (3) multimodal (text + video) domains. This departs from existing benchmarks for procedural video understanding, which typically deal with short context lengths and can be solved with a single modality. Spacewalk-18, with its inherent multimodal and long-form complexity, exposes the high difficulty of task recognition and segmentation. We find that state-of-the-art methods perform poorly on our benchmark, demonstrating that the goal of generalizable procedural video understanding models is far out and underscoring the need to develop new approaches to these tasks. Data, model, and code will be publicly released.

摘要
学习从视频是一个emerging研究领域，允许机器人从人类示范中学习技能，如过程视频。为此，视频语言模型必须能够获得结构化理解，如示范视频的时间 segmentation 和技能总结，并将理解扩展到新领域。为实现这个目标，我们介绍Spacewalk-18benchmark，包括两个任务：（1）步态识别和（2）视频内部检索。这两个任务评估模型对：（1）外域视觉信息的使用；（2）高时间上下文窗口；以及（3）多模态（文本+视频）领域的使用。这与现有的过程视频理解benchmark不同，通常只处理短ContextLength和可以通过单一模式解决。Spacewalk-18 benchmark，具有内置的多模态和长形复杂性，暴露出任务认识和分 segmentation的高难度。我们发现现状的方法在我们的benchmark中表现糟糕，表明目标普通的过程视频理解模型远远有待开发，需要开发新的方法来解决这些任务。数据、模型和代码将公开发布。

Semi-supervised Semantic Segmentation via Boosting Uncertainty on Unlabeled Data

paper_url: http://arxiv.org/abs/2311.18758
repo_url: None
paper_authors: Daoan Zhang, Yunhao Luo, Jianguo Zhang
for: The paper is written for improving the performance of semi-supervised semantic segmentation models by addressing the distribution gap between labeled and unlabeled datasets.
methods: The paper proposes two strategies and designs an uncertainty booster algorithm to appropriately boost uncertainty on unlabeled data, which helps minimize the distribution gap and benefits the generalization of the model.
results: The proposed algorithm and strategies are experimentally proven to be effective in promoting performance in semi-supervised semantic segmentation, achieving state-of-the-art results on popular benchmarks such as Cityscapes and PASCAL VOC 2012 with different train settings.

Abstract
We bring a new perspective to semi-supervised semantic segmentation by providing an analysis on the labeled and unlabeled distributions in training datasets. We first figure out that the distribution gap between labeled and unlabeled datasets cannot be ignored, even though the two datasets are sampled from the same distribution. To address this issue, we theoretically analyze and experimentally prove that appropriately boosting uncertainty on unlabeled data can help minimize the distribution gap, which benefits the generalization of the model. We propose two strategies and design an uncertainty booster algorithm, specially for semi-supervised semantic segmentation. Extensive experiments are carried out based on these theories, and the results confirm the efficacy of the algorithm and strategies. Our plug-and-play uncertainty booster is tiny, efficient, and robust to hyperparameters but can significantly promote performance. Our approach achieves state-of-the-art performance in our experiments compared to the current semi-supervised semantic segmentation methods on the popular benchmarks: Cityscapes and PASCAL VOC 2012 with different train settings.

摘要
我们带来了一新的视角来 semi-supervised semantic segmentation 领域中的分布分析。我们首先发现，即使 labelled 和 unlabeled 数据集是从同一个分布中采样的，但 gap между两者的分布仍然不可忽略。为了解决这个问题，我们 theoretically 分析并 experimentally 证明，在执行 boosting uncertainty 操作时，可以帮助 minimize 分布差距，从而提高模型的泛化性。我们提出了两种策略，并设计了一个 uncertainty booster 算法，专门为 semi-supervised semantic segmentation。我们进行了广泛的实验，并证明了我们的算法和策略的有效性。我们的插件和灵活的 hyperparameter 对于模型的性能有着显著的提升作用。在我们的实验中，我们的方法与现有的 semi-supervised semantic segmentation 方法在 popular 的 benchmark 上（Cityscapes 和 PASCAL VOC 2012）以不同的训练设置中达到了 state-of-the-art 性能。

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

paper_url: http://arxiv.org/abs/2312.00109
repo_url: https://github.com/city-super/Scaffold-GS
paper_authors: Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, Bo Dai
for: 提高3D场景真实渲染质量和速度，应用于学术和业务领域。
methods: 使用 anchor points 分布本地3D Gaussian，基于视角方向和距离在视场范围内预测Attributes。
results: 减少重复的 Gaussian，提高场景覆盖率，同时保持高质量渲染和速度。

Abstract
Neural rendering methods have significantly advanced photo-realistic 3D scene rendering in various academic and industrial applications. The recent 3D Gaussian Splatting method has achieved the state-of-the-art rendering quality and speed combining the benefits of both primitive-based representations and volumetric representations. However, it often leads to heavily redundant Gaussians that try to fit every training view, neglecting the underlying scene geometry. Consequently, the resulting model becomes less robust to significant view changes, texture-less area and lighting effects. We introduce Scaffold-GS, which uses anchor points to distribute local 3D Gaussians, and predicts their attributes on-the-fly based on viewing direction and distance within the view frustum. Anchor growing and pruning strategies are developed based on the importance of neural Gaussians to reliably improve the scene coverage. We show that our method effectively reduces redundant Gaussians while delivering high-quality rendering. We also demonstrates an enhanced capability to accommodate scenes with varying levels-of-detail and view-dependent observations, without sacrificing the rendering speed.

摘要
We introduce Scaffold-GS, which uses anchor points to distribute local 3D Gaussians and predicts their attributes on-the-fly based on viewing direction and distance within the view frustum. Anchor growing and pruning strategies are developed based on the importance of neural Gaussians to reliably improve the scene coverage. We show that our method effectively reduces redundant Gaussians while delivering high-quality rendering. We also demonstrate an enhanced capability to accommodate scenes with varying levels-of-detail and view-dependent observations, without sacrificing the rendering speed.

Merlin:Empowering Multimodal LLMs with Foresight Minds

paper_url: http://arxiv.org/abs/2312.00589
repo_url: https://github.com/Ahnsun/merlin
paper_authors: En Yu, Liang Zhao, Yana Wei, Jinrong Yang, Dongming Wu, Lingyu Kong, Haoran Wei, Tiancai Wang, Zheng Ge, Xiangyu Zhang, Wenbing Tao
for: 这篇论文是为了探讨如何使现有的多Modal大型语言模型（MLLMs）具备预测未来的能力，以便更好地理解事物的基本原理和行为意图。
methods: 该论文提出了两种新的方法来帮助 MLLMs 具备预测能力：Foresight Pre-Training（FPT）和 Foresight Instruction-Tuning（FIT）。FPT 是在不同任务中培养 MLLMs 能够注意和预测整个轨迹的方法，而 FIT 则需要 MLLMs 先预测对象的轨迹，然后根据这些轨迹来理解未来事件的可能性。
results: 实验结果表明，通过使用 FPT 和 FIT，建立了一个名为 Merlin 的新的多图像输入 MLLM，可以具备出色的未来预测能力和视觉理解能力。

Abstract
Humans possess the remarkable ability to foresee the future to a certain extent based on present observations, a skill we term as foresight minds. However, this capability remains largely under explored within existing Multimodal Large Language Models (MLLMs), hindering their capacity to learn the fundamental principles of how things operate and the intentions behind the observed subjects. To address this issue, we introduce the integration of future modeling into the existing learning frameworks of MLLMs. By utilizing the subject trajectory, a highly structured representation of a consecutive frame sequence, as a learning objective, we aim to bridge the gap between the past and the future. We propose two innovative methods to empower MLLMs with foresight minds, Foresight Pre-Training (FPT) and Foresight Instruction-Tuning (FIT), which are inspired by the modern learning paradigm of LLMs. Specifically, FPT jointly training various tasks centered on trajectories, enabling MLLMs to learn how to attend and predict entire trajectories from a given initial observation. Then, FIT requires MLLMs to first predict trajectories of related objects and then reason about potential future events based on them. Aided by FPT and FIT, we build a novel and unified MLLM named Merlin that supports multi-images input and analysis about potential actions of multiple objects for the future reasoning. Experimental results show Merlin powerful foresight minds with impressive performance on both future reasoning and visual comprehension tasks.

摘要
人类具有预测未来的能力，即叫做前视能力。然而，这种能力在现有的多模态大型自然语言模型（MLLM）中尚未得到充分发挥，这限制了MLLM的学习基本原理和行为目的。为解决这个问题，我们提出将未来预测纳入现有MLLM的学习框架中。通过使用行为轨迹，一种高度结构化的帧序列表示，作为学习目标，我们希望 bridge the gap between the past and the future。我们提出了两种创新的方法，即前视预训练（FPT）和前视指导调整（FIT），它们是基于现代学习 paradigm of LLMs。specifically, FPT通过合并不同任务中心于轨迹，使MLLMs学习如何从给定的初始观察到整个轨迹的attend和预测。然后，FIT要求MLLMs先预测相关对象的轨迹，然后根据它们来理解可能的未来事件。帮助了FPT和FIT，我们构建了一个名为Merlin的新的和统一的MLLM，可以处理多张图像输入和多个对象的未来预测。实验结果表明Merlin具有强大的前视 minds，在未来预测和视觉理解任务中表现出色。

Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data

paper_url: http://arxiv.org/abs/2311.18729
repo_url: https://github.com/YuDeng/Portrait-4D
paper_authors: Yu Deng, Duomin Wang, Xiaohang Ren, Xingyu Chen, Baoyuan Wang
for: 一shot 4D head synthesis
methods: 通过大规模 sintetic data 学习，首先通过对偶抗学习学习 parts-wise 4D 生成模型，然后使用 transformer-based 动画 triplane reconstructor 学习 4D 头部重建。
results: 实验表明我们的方法比优于过去的艺术。

Abstract
Existing one-shot 4D head synthesis methods usually learn from monocular videos with the aid of 3DMM reconstruction, yet the latter is evenly challenging which restricts them from reasonable 4D head synthesis. We present a method to learn one-shot 4D head synthesis via large-scale synthetic data. The key is to first learn a part-wise 4D generative model from monocular images via adversarial learning, to synthesize multi-view images of diverse identities and full motions as training data; then leverage a transformer-based animatable triplane reconstructor to learn 4D head reconstruction using the synthetic data. A novel learning strategy is enforced to enhance the generalizability to real images by disentangling the learning process of 3D reconstruction and reenactment. Experiments demonstrate our superiority over the prior art.

摘要
现有的一步式4D头synthesis方法通常通过使用3DMM重建来学习从单视镜头中学习，但是后者具有较高的难度，这限制了他们的合理的4D头synthesis。我们提出了一种通过大规模的合成数据来学习一步式4D头synthesis的方法。关键在于首先通过对单视图像进行 adversarial learning来学习分割型4D生成模型，然后使用这些生成的多视图图像和全动作图像作为训练数据，并利用基于transformer的可动三面重建器来学习4D头重建。我们提出了一种新的学习策略，以分离3D重建和reenactment的学习过程，以提高对真实图像的泛化性。实验表明，我们的方法超过了先前艺术。

Improving the Robustness of Quantized Deep Neural Networks to White-Box Attacks using Stochastic Quantization and Information-Theoretic Ensemble Training

paper_url: http://arxiv.org/abs/2312.00105
repo_url: None
paper_authors: Saurabh Farkya, Aswin Raghavan, Avi Ziskind
for: The paper aims to improve the robustness of quantized deep neural networks (DNNs) to white-box adversarial attacks.
methods: The paper introduces a differentiable Stochastic Quantizer (SQ) to tackle the limitation of deterministic quantization to fixed “bins”. The authors also explore the idea of using different quantizations to collectively improve robustness and learn diverse representations of the input image.
results: The paper demonstrates substantial improvement in robustness against $L_\infty$ attacks, with > 50% accuracy to PGD(5/255) on CIFAR10 without adversarial training, compared to vanilla DNNs and existing ensembles of quantized DNNs. The authors also extend the method to detect attacks and generate robustness profiles in the adversarial information plane (AIP).

Abstract
Most real-world applications that employ deep neural networks (DNNs) quantize them to low precision to reduce the compute needs. We present a method to improve the robustness of quantized DNNs to white-box adversarial attacks. We first tackle the limitation of deterministic quantization to fixed ``bins'' by introducing a differentiable Stochastic Quantizer (SQ). We explore the hypothesis that different quantizations may collectively be more robust than each quantized DNN. We formulate a training objective to encourage different quantized DNNs to learn different representations of the input image. The training objective captures diversity and accuracy via mutual information between ensemble members. Through experimentation, we demonstrate substantial improvement in robustness against $L_\infty$ attacks even if the attacker is allowed to backpropagate through SQ (e.g., > 50\% accuracy to PGD(5/255) on CIFAR10 without adversarial training), compared to vanilla DNNs as well as existing ensembles of quantized DNNs. We extend the method to detect attacks and generate robustness profiles in the adversarial information plane (AIP), towards a unified analysis of different threat models by correlating the MI and accuracy.

摘要
大多数实际应用中使用深度神经网络（DNN）时，会归约其为低精度来降低计算需求。我们提出了一种方法，以提高归约DNN对白盒攻击的 Robustness。我们首先解决了归约到固定“缓冲”的限制，通过引入可导式随机归约（SQ）。我们提出了一种培训目标，以便归约DNN ensemble成员之间学习不同的输入图像表示。这个培训目标捕捉了多样性和准确度，通过牵扯到ensemble成员之间的相互信息。通过实验，我们证明了对$L_\infty$攻击的鲁棒性可以大幅提高（比如> 50%的准确率对PGD(5/255) 在 CIFAR10 上 без针对性训练），比 vanilla DNN 和现有的归约DNN ensemble。我们还扩展了方法，以检测攻击和生成鲁棒性profile在攻击信息平面（AIP）中，并寻求在不同的威胁模型之间进行统一分析，通过相关MI和准确率。

Meta-Prior: Meta learning for Adaptive Inverse Problem Solvers

paper_url: http://arxiv.org/abs/2311.18710
repo_url: None
paper_authors: Matthieu Terris, Thomas Moreau
for:The paper is written for addressing imaging inverse problems using deep neural networks, specifically in the absence of ground truth data.methods:The proposed method is based on meta-learning, which trains a meta-model on a diverse set of imaging tasks and allows for efficient fine-tuning for specific tasks with few fine-tuning steps. The method uses a bilevel formulation with an outer supervised loss and an inner loss that can be either supervised or unsupervised, relying only on the measurement operator.results:The proposed method is effective in recovering the Bayes optimal estimator in simple settings and demonstrates improved performance on various imaging tasks, including image processing and magnetic resonance imaging.

Abstract
Deep neural networks have become a foundational tool for addressing imaging inverse problems. They are typically trained for a specific task, with a supervised loss to learn a mapping from the observations to the image to recover. However, real-world imaging challenges often lack ground truth data, rendering traditional supervised approaches ineffective. Moreover, for each new imaging task, a new model needs to be trained from scratch, wasting time and resources. To overcome these limitations, we introduce a novel approach based on meta-learning. Our method trains a meta-model on a diverse set of imaging tasks that allows the model to be efficiently fine-tuned for specific tasks with few fine-tuning steps. We show that the proposed method extends to the unsupervised setting, where no ground truth data is available. In its bilevel formulation, the outer level uses a supervised loss, that evaluates how well the fine-tuned model performs, while the inner loss can be either supervised or unsupervised, relying only on the measurement operator. This allows the meta-model to leverage a few ground truth samples for each task while being able to generalize to new imaging tasks. We show that in simple settings, this approach recovers the Bayes optimal estimator, illustrating the soundness of our approach. We also demonstrate our method's effectiveness on various tasks, including image processing and magnetic resonance imaging.

摘要
深度神经网络已成为解决图像反问题的基本工具。它们通常被训练为特定任务，使用监督损失来学习从观察到图像的映射。然而，现实中的成像挑战通常缺乏真实的ground truth数据，使得传统的监督方法无效。此外，为每个新的成像任务，需要从零开始训练一个新的模型，浪费时间和资源。为了解决这些限制，我们介绍了一种新的方法，基于meta-学习。我们的方法将一个多任务模型训练在多种成像任务上，使得模型可以快速地适应特定任务，只需几步微调。我们表明，我们的方法可以扩展到无监督设定，在这种情况下， outer层使用一个监督损失，而内层损失可以是监督的或无监督的，只需要依靠测量算子。这使得meta-模型可以利用每个任务的几个ground truth样本，同时能够泛化到新的成像任务。我们表明，在简单的设定下，我们的方法可以恢复bayes优化 estimator，证明了我们的方法的正确性。我们还通过多种任务，包括图像处理和核磁共振成像，证明了我们的方法的有效性。

Seg2Reg: Differentiable 2D Segmentation to 1D Regression Rendering for 360 Room Layout Reconstruction

paper_url: http://arxiv.org/abs/2311.18695
repo_url: None
paper_authors: Cheng Sun, Wei-En Tai, Yu-Lin Shih, Kuan-Wei Chen, Yong-Jing Syu, Kent Selwyn The, Yu-Chiang Frank Wang, Hwann-Tzong Chen
For: 该 paper 的目的是 reconstruction of single-view 360-degree room layout.* Methods: 该 paper 使用了一种 differentiable 和 occlusion-aware 的方法，将 2D 的layout segmentation转换为 1D 的 layout depth regression.* Results: 该 paper 的模型在 benchmarking 中表现出色，significantly outperforms previous arts.Here’s the full Chinese text:
for: 该 paper 的目的是 reconstruction of single-view 360-degree room layout.
methods: 该 paper 使用了一种 differentiable 和 occlusion-aware 的方法，将 2D 的layout segmentation转换为 1D 的 layout depth regression.
results: 该 paper 的模型在 benchmarking 中表现出色，significantly outperforms previous arts.

Abstract
State-of-the-art single-view 360-degree room layout reconstruction methods formulate the problem as a high-level 1D (per-column) regression task. On the other hand, traditional low-level 2D layout segmentation is simpler to learn and can represent occluded regions, but it requires complex post-processing for the targeting layout polygon and sacrifices accuracy. We present Seg2Reg to render 1D layout depth regression from the 2D segmentation map in a differentiable and occlusion-aware way, marrying the merits of both sides. Specifically, our model predicts floor-plan density for the input equirectangular 360-degree image. Formulating the 2D layout representation as a density field enables us to employ `flattened' volume rendering to form 1D layout depth regression. In addition, we propose a novel 3D warping augmentation on layout to improve generalization. Finally, we re-implement recent room layout reconstruction methods into our codebase for benchmarking and explore modern backbones and training techniques to serve as the strong baseline. Our model significantly outperforms previous arts. The code will be made available upon publication.

摘要
现代单视图360度房内布局重建方法将问题定义为高级1D（每列）回归任务。然而，传统的低级2D布局分割虽然更容易学习，但它需要复杂的后处理来获得目标布局 polygon，同时牺牲精度。我们提出Seg2Reg，它可以从2D分割图生成1D布局深度回归，同时具有透视和 occlusion 的自适应性。具体来说，我们的模型预测输入的投影平面密度，这使得我们可以使用扁平化的体积渲染来实现1D布局深度回归。此外，我们还提出了一种新的3D折叠增强augmentation，用于提高通用性。最后，我们将现有的房内布局重建方法重新实现到我们的代码库中，以供比较，并探索现代的背景和训练技术，以建立强大的基线。我们的模型在前一辈的艺术上显著超越。代码将在发表时公布。

Cascaded Interaction with Eroded Deep Supervision for Salient Object Detection

paper_url: http://arxiv.org/abs/2311.18675
repo_url: None
paper_authors: Hewen Xiao, Jie Mei, Guangfu Ma, Weiren Wu
for: 提高精度检测的聚焦对象检测方法
methods: 提出了两种方向的方法：一是增强特性网络，二是深度监测策略
results: 在五个流行的数据集上进行了广泛的实验，显示了方法的优越性

Abstract
Deep convolutional neural networks have been widely applied in salient object detection and have achieved remarkable results in this field. However, existing models suffer from information distortion caused by interpolation during up-sampling and down-sampling. In response to this drawback, this article starts from two directions in the network: feature and label. On the one hand, a novel cascaded interaction network with a guidance module named global-local aligned attention (GAA) is designed to reduce the negative impact of interpolation on the feature side. On the other hand, a deep supervision strategy based on edge erosion is proposed to reduce the negative guidance of label interpolation on lateral output. Extensive experiments on five popular datasets demonstrate the superiority of our method.

摘要
深度卷积神经网络在突出对象检测领域广泛应用，并取得了显著的成果。然而，现有模型受到 interpolate 操作导致的信息扭曲问题。为了解决这个缺点，本文从网络两个方向进行改进：特征和标签。一方面，我们提出了一种新的卷积交互网络，名为全球本地对齐注意力（GAA），以减少 interpolate 操作对特征的负面影响。另一方面，我们提出了一种深度监视策略，基于边缘腐蚀，以减少标签 interpolate 操作对横向输出的负面指导。我们在五个流行的数据集上进行了广泛的实验，并证明了我们的方法的优越性。

Action Recognition in Video Recordings from Gynecologic Laparoscopy

paper_url: http://arxiv.org/abs/2311.18666
repo_url: None
paper_authors: Sahar Nasirihaghighi, Negin Ghamsarian, Daniela Stefanics, Klaus Schoeffmann, Heinrich Husslein
for: 这篇论文是为了自动识别 Laparoscopic 手术中的动作而写的。
methods: 这篇论文使用了一种 CNN-RNN 架构和一种适应性训练-推理框架来解决 Laparoscopic 手术中的动作识别挑战。
results: 该方法在对 Laparoscopic 手术中的动作识别 task 进行了广泛的实验，并证明了它的超越性。

Abstract
Action recognition is a prerequisite for many applications in laparoscopic video analysis including but not limited to surgical training, operation room planning, follow-up surgery preparation, post-operative surgical assessment, and surgical outcome estimation. However, automatic action recognition in laparoscopic surgeries involves numerous challenges such as (I) cross-action and intra-action duration variation, (II) relevant content distortion due to smoke, blood accumulation, fast camera motions, organ movements, object occlusion, and (III) surgical scene variations due to different illuminations and viewpoints. Besides, action annotations in laparoscopy surgeries are limited and expensive due to requiring expert knowledge. In this study, we design and evaluate a CNN-RNN architecture as well as a customized training-inference framework to deal with the mentioned challenges in laparoscopic surgery action recognition. Using stacked recurrent layers, our proposed network takes advantage of inter-frame dependencies to negate the negative effect of content distortion and variation in action recognition. Furthermore, our proposed frame sampling strategy effectively manages the duration variations in surgical actions to enable action recognition with high temporal resolution. Our extensive experiments confirm the superiority of our proposed method in action recognition compared to static CNNs.

摘要
<> translate the following text into Simplified Chinese:Action recognition is a prerequisite for many applications in laparoscopic video analysis, including but not limited to surgical training, operation room planning, follow-up surgery preparation, post-operative surgical assessment, and surgical outcome estimation. However, automatic action recognition in laparoscopic surgeries involves numerous challenges such as (I) cross-action and intra-action duration variation, (II) relevant content distortion due to smoke, blood accumulation, fast camera motions, organ movements, object occlusion, and (III) surgical scene variations due to different illuminations and viewpoints. Besides, action annotations in laparoscopy surgeries are limited and expensive due to requiring expert knowledge. In this study, we design and evaluate a CNN-RNN architecture as well as a customized training-inference framework to deal with the mentioned challenges in laparoscopic surgery action recognition. Using stacked recurrent layers, our proposed network takes advantage of inter-frame dependencies to negate the negative effect of content distortion and variation in action recognition. Furthermore, our proposed frame sampling strategy effectively manages the duration variations in surgical actions to enable action recognition with high temporal resolution. Our extensive experiments confirm the superiority of our proposed method in action recognition compared to static CNNs.Translate the text into Simplified Chinese:<< /SYS>>Here's the translation:laparoscopic 视频分析中的动作识别是许多应用的先修，包括但不限于手术培训、操作室规划、后续手术准备、手术后评估和手术结果预测。然而，自动地在 Laparoscopic 手术中识别动作存在许多挑战，包括（I）动作之间和动作内部的时间变化，（II）由烟雾、血液堆积、快速摄像头运动、器官运动和物体遮挡等引起的有关内容扭曲，以及（III）不同的照明和视角引起的手术场景变化。此外， Laparoscopy 手术中的动作标注受限和昂贵，因为需要专家知识。在本研究中，我们设计和评估了一种 CNN-RNN 架构以及一种适应性训练-推理框架，以解决 Laparoscopic 手术中动作识别中所提到的挑战。我们使用堆叠的回归层，以利用帧之间的依赖关系，对内容扭曲和动作变化产生负面影响的动作识别做出补偿。此外，我们提出的帧采样策略有效地管理手术动作的时间变化，以实现高 temporal 分辨率的动作识别。我们的广泛实验表明，我们的提议方法在动作识别中胜过静态 CNN。

Pose Estimation and Tracking for ASIST

paper_url: http://arxiv.org/abs/2311.18665
repo_url: None
paper_authors: Ari Goodman, Gurpreet Singh, Ryan O’Shea, Peter Teague, James Hing
for: 该研究旨在提高ASIST系统操作者的 Situational awareness和减少操作者对飞机位置的不确定性，以提高安全降落区域的可能性。
methods: 该研究使用了现代计算机视觉算法，如Faster R-CNN和HRNet，来估算飞机的姿态，以及传统的编码器-解码器来估算飞机的方向。
results: 研究人员制造出了一个可以跟踪飞机与RSD之间的位置的原型系统，并通过使用现代计算机视觉算法和传统的编码器-解码器来确认飞机的姿态和方向。

Abstract
Aircraft Ship Integrated Secure and Traverse (ASIST) is a system designed to arrest helicopters safely and efficiently on ships. Originally, a precision Helicopter Position Sensing Equipment (HPSE) tracked and monitored the position of the helicopter relative to the Rapid Securing Device (RSD). However, using the HPSE component was determined to be infeasible in the transition of the ASIST system due to the hardware installation requirements. As a result, sailors track the position of the helicopters with their eyes with no sensor or artificially intelligent decision aid. Manually tracking the helicopter takes additional time and makes recoveries more difficult, especially at high sea states. Performing recoveries without the decision aid leads to higher uncertainty and cognitive load. PETA (Pose Estimation and Tracking for ASIST) is a research effort to create a helicopter tracking system prototype without hardware installation requirements for ASIST system operators. Its overall goal is to improve situational awareness and reduce operator uncertainty with respect to the aircrafts position relative to the RSD, and consequently increase the allowable landing area. The authors produced a prototype system capable of tracking helicopters with respect to the RSD. The software included a helicopter pose estimation component, camera pose estimation component, and a user interface component. PETA demonstrated the potential for state-of-the-art computer vision algorithms Faster R-CNN and HRNet (High-Resolution Network) to be used to estimate the pose of helicopters in real-time, returning ASIST to its originally intended capability. PETA also demonstrated that traditional methods of encoder-decoders could be used to estimate the orientation of the helicopter and could be used to confirm the output from HRNet.

摘要
飞机船 интеграted安全 traverse (ASIST) 是一个系统，用于安全和有效地降落直升机 onto ships. 原来，使用高精度直升机位置感知设备 (HPSE) 跟踪和监测直升机与 Rapid Securing Device (RSD) 之间的位置。但是，使用 HPSE 组件不可能在 ASIST 系统的过渡中进行，因为硬件安装需求。因此，水手 manually track the position of the helicopters with their eyes, without any sensor or artificial intelligence decision aid. 手动跟踪直升机需要更多时间，使回归更加困难，特别是在高海状态下。不使用决策助手会导致更高的不确定性和认知负担。PETA (Pose Estimation and Tracking for ASIST) 是一个研究努力，旨在创建一个无需硬件安装的直升机跟踪系统原型，以提高 ASIST 系统操作员的情况意识，并减少操作员对直升机的位置相对于 RSD 的不确定性。作者们制作了一个可以跟踪直升机的系统，包括直升机姿态估计组件、摄像头姿态估计组件和用户界面组件。PETA 表明了使用当今计算机视觉算法 Faster R-CNN 和 HRNet (高分辨率网络) 可以在实时中估计直升机的姿态，从而恢复 ASIST 系统的原始意图。PETA 还证明了传统的编码器-解码器可以用来估计直升机的姿态，并可以用来验证 HRNet 的输出。

Learning Part Segmentation from Synthetic Animals

paper_url: http://arxiv.org/abs/2311.18661
repo_url: None
paper_authors: Jiawei Peng, Ju He, Prakhar Kaushik, Zihao Xiao, Jiteng Mu, Alan Yuille
for: 本文主要旨在学习动物部分 segmentation，利用Skinned Multi-Animal Linear（SMAL）模型练习已有的 sintetic数据，以扩大现有的 Computer-Aided Design（CAD）动物模型生成的数据。
methods: 本文首次提出了Synthetic Animal Parts（SAP）数据集，并对Syn-to-Real动物部分 segmentation进行了 benchmarking，包括使用现有的semantic segmentation域 adaptation方法和提出了一种Class-Balanced Fourier Data Mixing（CB-FDM）方法来解决Syn-to-Real任务之间的本质差异问题。
results: 研究发现，CB-FDM方法可以在SynRealPart任务中提高表现，并且发现学习自动机部分的模型可以在所有四足动物类别中进行转移。

Abstract
Semantic part segmentation provides an intricate and interpretable understanding of an object, thereby benefiting numerous downstream tasks. However, the need for exhaustive annotations impedes its usage across diverse object types. This paper focuses on learning part segmentation from synthetic animals, leveraging the Skinned Multi-Animal Linear (SMAL) models to scale up existing synthetic data generated by computer-aided design (CAD) animal models. Compared to CAD models, SMAL models generate data with a wider range of poses observed in real-world scenarios. As a result, our first contribution is to construct a synthetic animal dataset of tigers and horses with more pose diversity, termed Synthetic Animal Parts (SAP). We then benchmark Syn-to-Real animal part segmentation from SAP to PartImageNet, namely SynRealPart, with existing semantic segmentation domain adaptation methods and further improve them as our second contribution. Concretely, we examine three Syn-to-Real adaptation methods but observe relative performance drop due to the innate difference between the two tasks. To address this, we propose a simple yet effective method called Class-Balanced Fourier Data Mixing (CB-FDM). Fourier Data Mixing aligns the spectral amplitudes of synthetic images with real images, thereby making the mixed images have more similar frequency content to real images. We further use Class-Balanced Pseudo-Label Re-Weighting to alleviate the imbalanced class distribution. We demonstrate the efficacy of CB-FDM on SynRealPart over previous methods with significant performance improvements. Remarkably, our third contribution is to reveal that the learned parts from synthetic tiger and horse are transferable across all quadrupeds in PartImageNet, further underscoring the utility and potential applications of animal part segmentation.

摘要
<>Translate the given text into Simplified Chinese.<>semantic part segmentation可以提供对物体的细腻和可解释的理解，因此可以推动多个下游任务。然而，需要详细的注释限制其应用于不同的物体类型。这篇论文将从 sintetic animals 中学习部分 segmentation，利用 Skinned Multi-Animal Linear (SMAL) 模型来扩展现有的 sintetic data，这些数据由计算机支持设计 (CAD) 动物模型生成。相比CAD模型，SMAL模型生成的数据具有更广泛的 pose 观察到在实际场景中。因此，我们的第一个贡献是建立一个 sintetic animal dataset，包括虎和马的部分 segmentation，称为 Synthetic Animal Parts (SAP)。然后，我们将 SAP 与 PartImageNet 进行 syn-to-real 动物部分 segmentation的比较， specifically SynRealPart，并与现有 semantic segmentation 领域适应方法进行比较。我们发现，在 syn-to-real 适应中，exist 的方法表现不佳，这是因为synthetic 和实际图像之间的本质差异。为此，我们提出了一种简单 yet effective的方法，namely Class-Balanced Fourier Data Mixing (CB-FDM)。Fourier Data Mixing 可以使 synthetic 图像的spectral amplitude与实际图像的spectral amplitude更相似，从而使混合图像具有更相似的频谱内容。此外，我们还使用 Class-Balanced Pseudo-Label Re-Weighting 来缓解类别分布的不均衡问题。我们在 SynRealPart 中证明了 CB-FDM 的效果，与之前的方法相比，具有显著性能提升。最重要的是，我们的第三个贡献是证明 sintetic 虎和马的学习部分可以在 PartImageNet 中转移到所有四足动物上，进一步证明了动物部分 segmentation的实用性和潜在应用。

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

paper_url: http://arxiv.org/abs/2311.18651
repo_url: https://github.com/open3da/ll3da
paper_authors: Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, Tao Chen
for: 这篇论文旨在提出一种能够理解、计划和回答在复杂多个Modal中的人机交互应用，具体来说是使用点云作为直接输入，并通过语言模型和计算机视觉模型结合来解决多模态场景中的歧义和干扰问题。
methods: 该论文提出了一种名为 LL3DA（Large Language 3D Assistant）的方法，该方法可以将点云作为直接输入，并通过语言模型和计算机视觉模型的结合来回答both textual-instructions和visual-prompts。
results: 实验表明，LL3DA可以达到很高的表现，并超过了多种3D视觉语言模型在3D dense captioning和3D问题回答等领域的表现。

Abstract
Recent advances in Large Multimodal Models (LMM) have made it possible for various applications in human-machine interactions. However, developing LMMs that can comprehend, reason, and plan in complex and diverse 3D environments remains a challenging topic, especially considering the demand for understanding permutation-invariant point cloud 3D representations of the 3D scene. Existing works seek help from multi-view images, and project 2D features to 3D space as 3D scene representations. This, however, leads to huge computational overhead and performance degradation. In this paper, we present LL3DA, a Large Language 3D Assistant that takes point cloud as direct input and respond to both textual-instructions and visual-prompts. This help LMMs better comprehend human interactions and further help to remove the ambiguities in cluttered 3D scenes. Experiments show that LL3DA achieves remarkable results, and surpasses various 3D vision-language models on both 3D Dense Captioning and 3D Question Answering.

摘要
现代大型多Modal模型（LMM）的进步使得人机交互应用得到了推动。然而，开发能够理解、计划和分析复杂多种3D环境的LMM仍然是一个挑战，尤其是在考虑 permutation-invariant点云3D场景表示时。现有的工作寻求帮助于多视图图像，将2D特征项项目到3D空间作为3D场景表示。这会导致巨大的计算开销和性能下降。在这篇论文中，我们介绍LL3DA，一个大型语言3D助手，可以直接处理点云并响应文本指令和视觉提示。这有助于LMM更好地理解人类交互，并帮助去除拥塞的3D场景中的歧义。实验结果显示，LL3DA达到了很出色的结果，并在3D密集描述和3D问答上超越了多种3D视力语言模型。

Simple Semantic-Aided Few-Shot Learning

paper_url: http://arxiv.org/abs/2311.18649
repo_url: None
paper_authors: Hai Zhang, Junzhe Xu, Shanlin Jiang, Zhenan He
for: 本研究旨在提高少量数据下的计算机视觉任务，即几 shot learning 的性能。
methods: 本文提出了一种自动生成高质量 semantics 的方法，并使用了一种简单的两层网络（Semantic Alignment Network）将 semantics 和视觉特征转换为robust的类原型。
results: 实验结果表明，我们的框架在五个benchmark上都超过了之前的方法， demonstrating 一种简单的网络可以在几 shot classification 任务中击败复杂的多Modal模块。

Abstract
Learning from a limited amount of data, namely Few-Shot Learning, stands out as a challenging computer vision task. Several works exploit semantics and design complicated semantic fusion mechanisms to compensate for rare representative features within restricted data. However, relying on naive semantics such as class names introduces biases due to their brevity, while acquiring extensive semantics from external knowledge takes a huge time and effort. This limitation severely constrains the potential of semantics in few-shot learning. In this paper, we design an automatic way called Semantic Evolution to generate high-quality semantics. The incorporation of high-quality semantics alleviates the need for complex network structures and learning algorithms used in previous works. Hence, we employ a simple two-layer network termed Semantic Alignment Network to transform semantics and visual features into robust class prototypes with rich discriminative features for few-shot classification. The experimental results show our framework outperforms all previous methods on five benchmarks, demonstrating a simple network with high-quality semantics can beat intricate multi-modal modules on few-shot classification tasks.

摘要
In this paper, we propose an automatic way called Semantic Evolution to generate high-quality semantics. By incorporating these high-quality semantics, we can alleviate the need for complex network structures and learning algorithms used in previous works. We employ a simple two-layer network, called the Semantic Alignment Network, to transform semantics and visual features into robust class prototypes with rich discriminative features for few-shot classification.Experimental results show that our framework outperforms all previous methods on five benchmarks, demonstrating that a simple network with high-quality semantics can beat intricate multi-modal modules on few-shot classification tasks.

DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars

paper_url: http://arxiv.org/abs/2311.18635
repo_url: None
paper_authors: Tobias Kirschstein, Simon Giebenhain, Matthias Nießner
for: 生成高品质3D人物头像，提供直观的姿势和表情控制。
methods: 使用扩散基于神经网络的渲染器，利用通用2D规范生成有趣的人脸图像。
results: 生成新姿势和表情的人物头像，比现有方法更有趣和可信。

Abstract
DiffusionAvatars synthesizes a high-fidelity 3D head avatar of a person, offering intuitive control over both pose and expression. We propose a diffusion-based neural renderer that leverages generic 2D priors to produce compelling images of faces. For coarse guidance of the expression and head pose, we render a neural parametric head model (NPHM) from the target viewpoint, which acts as a proxy geometry of the person. Additionally, to enhance the modeling of intricate facial expressions, we condition DiffusionAvatars directly on the expression codes obtained from NPHM via cross-attention. Finally, to synthesize consistent surface details across different viewpoints and expressions, we rig learnable spatial features to the head's surface via TriPlane lookup in NPHM's canonical space. We train DiffusionAvatars on RGB videos and corresponding tracked NPHM meshes of a person and test the obtained avatars in both self-reenactment and animation scenarios. Our experiments demonstrate that DiffusionAvatars generates temporally consistent and visually appealing videos for novel poses and expressions of a person, outperforming existing approaches.

摘要
DiffusionAvatars 合成了一个高品质的3D头像人，提供了直观的控制方式来调整姿势和表情。我们提议使用扩散基于的神经渲染器，利用通用2D先验来生成吸引人的面孔图像。为了提供粗略的表情和头姿指导，我们从目标视点中渲染了神经 parametric 头部模型（NPHM），作为人体的代理几何体。此外，为了增强表情的细部模拟，我们将DiffusionAvatars 直接通过 NPHM 的表情代码和cross-attention进行条件Rendering。最后，为了在不同视点和表情下保持表面详细的一致性，我们通过 TriPlane lookup 来学习可变的表面特征，并将其绑定到人头的Surface上。我们在RGB视频和对应的跟踪 NPHM 的人体三维模型上训练DiffusionAvatars，并在自reenactment和动画场景中测试其生成的头像人。我们的实验表明，DiffusionAvatars 可以生成新姿势和表情的人头像人，并且在视觉上具有满意的效果，比较出色于现有的方法。

A Lightweight Clustering Framework for Unsupervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2311.18628
repo_url: None
paper_authors: Yau Shing Jonathan Cheung, Xi Chen, Lihe Yang, Hengshuang Zhao
for: Unsupervised semantic segmentation of images, specifically using a lightweight clustering framework that does not require annotated data.
methods: Uses attention features from the self-supervised vision transformer to cluster image patches into distinct groupings, and then extracts patch-level binary pseudo-masks through multilevel clustering consistency.
results: Achieves state-of-the-art results on PASCAL VOC and MS COCO datasets.

Abstract
Unsupervised semantic segmentation aims to label each pixel of an image to a corresponding class without the use of annotated data. It is a widely researched area as obtaining labeled datasets are expensive. While previous works in the field demonstrated a gradual improvement in segmentation performance, most of them required neural network training. This made segmentation equally expensive, especially when dealing with large-scale datasets. We thereby propose a lightweight clustering framework for unsupervised semantic segmentation. Attention features of the self-supervised vision transformer exhibit strong foreground-background differentiability. By clustering these features into a small number of clusters, we could separate foreground and background image patches into distinct groupings. In our clustering framework, we first obtain attention features from the self-supervised vision transformer. Then we extract Dataset-level, Category-level and Image-level masks by clustering features within the same dataset, category and image. We further ensure multilevel clustering consistency across the three levels and this allows us to extract patch-level binary pseudo-masks. Finally, the pseudo-mask is upsampled, refined and class assignment is performed according to the CLS token of object regions. Our framework demonstrates great promise in unsupervised semantic segmentation and achieves state-of-the-art results on PASCAL VOC and MS COCO datasets.

摘要
Unsupervised semantic segmentation aims to assign each pixel of an image to a corresponding class without using annotated data. This area has been widely researched as obtaining labeled datasets is expensive. Previous works in the field have shown a gradual improvement in segmentation performance, but most of them require neural network training, which can be costly, especially when dealing with large-scale datasets. We propose a lightweight clustering framework for unsupervised semantic segmentation. The attention features of the self-supervised vision transformer are strong in foreground-background differentiability, and by clustering these features into a small number of clusters, we can separate foreground and background image patches into distinct groupings.In our clustering framework, we first obtain attention features from the self-supervised vision transformer. Then, we extract dataset-level, category-level, and image-level masks by clustering features within the same dataset, category, and image. We ensure multilevel clustering consistency across the three levels, which allows us to extract patch-level binary pseudo-masks. Finally, the pseudo-mask is upsampled, refined, and class assignment is performed according to the CLS token of object regions. Our framework shows great promise in unsupervised semantic segmentation and achieves state-of-the-art results on PASCAL VOC and MS COCO datasets.

JPPF: Multi-task Fusion for Consistent Panoptic-Part Segmentation

paper_url: http://arxiv.org/abs/2311.18618
repo_url: None
paper_authors: Shishir Muralidhara, Sravan Kumar Jagadeesh, René Schuster, Didier Stricker
for: 提供多级别semantic理解的场景，包括 semanticareas、object instances和semantic parts的同时预测。
methods: 提出了Joint Panoptic Part Fusion（JPPF）方法，通过有效地组合三个个segmentation来获得panoptic-part segmentation。
results: 在Cityscapes Panoptic Parts（CPP）和Pascal Panoptic Parts（PPP）数据集上进行了广泛的实验，并证明了我们的公平 fusion的重要性，特别是在可以进一步 segmentation的区域上。无需 fine-tuning，我们的设计在5个额外数据集上也有良好的普适性。

Abstract
Part-aware panoptic segmentation is a problem of computer vision that aims to provide a semantic understanding of the scene at multiple levels of granularity. More precisely, semantic areas, object instances, and semantic parts are predicted simultaneously. In this paper, we present our Joint Panoptic Part Fusion (JPPF) that combines the three individual segmentations effectively to obtain a panoptic-part segmentation. Two aspects are of utmost importance for this: First, a unified model for the three problems is desired that allows for mutually improved and consistent representation learning. Second, balancing the combination so that it gives equal importance to all individual results during fusion. Our proposed JPPF is parameter-free and dynamically balances its input. The method is evaluated and compared on the Cityscapes Panoptic Parts (CPP) and Pascal Panoptic Parts (PPP) datasets in terms of PartPQ and Part-Whole Quality (PWQ). In extensive experiments, we verify the importance of our fair fusion, highlight its most significant impact for areas that can be further segmented into parts, and demonstrate the generalization capabilities of our design without fine-tuning on 5 additional datasets.

摘要
“弹性积极分类”是计算机视觉的问题，旨在将场景给多级弹性理解。更精确地说，这问题涉及到 semantic areas、物件实例和 semantic parts 的预测。在这篇文章中，我们提出了 Joint Panoptic Part Fusion (JPPF)，它能够有效地融合这三个问题的解决方案，以获得一个积极分类。两个重要的方面包括：一、需要一个统一的模型，允许弹性的表现学习和改进。二、在融合中需要平衡，使得所有个别结果都获得相同的重要性。我们的提案的 JPPF 是免 Parameters 的，并且可以动态地平衡其输入。我们在 Cityscapes Panoptic Parts (CPP) 和 Pascal Panoptic Parts (PPP) 数据集上进行了评估和比较，以 PartPQ 和 Part-Whole Quality (PWQ) 作为评估指标。在广泛的实验中，我们证明了我们的公平融合的重要性，高亮了可以进一步分解的部分的影响，并证明了我们的设计不需要 fine-tuning，在 5 个额外的数据集上具有一致的表现。

Anatomy and Physiology of Artificial Intelligence in PET Imaging

paper_url: http://arxiv.org/abs/2311.18614
repo_url: None
paper_authors: Tyler J. Bradshaw, Alan B. McMillan
for: 本文旨在为核医学领域内的人工智能应用提供一份图文导论，帮助读者了解现代AI的核心原则，特别是在PET成像中可能遇到的部分。
methods: 本文使用的方法包括卷积神经网络、算法训练和U-Net Segmentation和图像生成的组件。
results: 本文通过图文导论的方式，帮助读者了解现代AI的核心原则，并提供了PET成像中可能遇到的AI应用的示例。

Abstract
The influence of artificial intelligence (AI) within the field of nuclear medicine has been rapidly growing. Many researchers and clinicians are seeking to apply AI within PET, and clinicians will soon find themselves engaging with AI-based applications all along the chain of molecular imaging, from image reconstruction to enhanced reporting. This expanding presence of AI in PET imaging will result in greater demand for educational resources for those unfamiliar with AI. The objective of this article to is provide an illustrated guide to the core principles of modern AI, with specific focus on aspects that are most likely to be encountered in PET imaging. We describe convolutional neural networks, algorithm training, and explain the components of the commonly used U-Net for segmentation and image synthesis.

摘要
人工智能（AI）在核医学领域的影响正在快速增长。许多研究人员和临床医生正在尝试将AI应用于PET影像，而且未来，临床医生会在分子成像链中与AI应用程序互动，从图像重建到增强报告。随着AI在PET影像领域的扩大存在，需要更多的教育资源来帮助不熟悉AI的人士了解这些技术。本文的目标是提供一份图文导论，概述现代AI的核心原理，特别是在PET影像领域最可能遇到的方面。我们介绍了卷积神经网络、算法训练和通用的U-Netsegmentation和图像生成组件。

Cancer-Net PCa-Gen: Synthesis of Realistic Prostate Diffusion Weighted Imaging Data via Anatomic-Conditional Controlled Latent Diffusion

paper_url: http://arxiv.org/abs/2311.18612
repo_url: None
paper_authors: Aditya Sridhar, Chi-en Amy Tai, Hayden Gunraj, Yuhao Chen, Alexander Wong
for:The paper aims to generate realistic prostate diffusion-weighted imaging (DWI) data to aid in the diagnosis, prognosis, and treatment planning of prostate cancer.methods:The authors propose an anatomic-conditional controlled latent diffusion strategy, called Cancer-Net PCa-Gen, to generate diverse prostate images with controllable tumor locations and improved anatomical and textural fidelity.results:The proposed method enhances the synthesis of diverse prostate images, which can be used to augment real patient data and train neural networks on a more diverse and comprehensive data distribution. The Cancer-Net PCa-Gen framework and sample images have been made publicly available for further research and development.Here is the information in Simplified Chinese text:for:本研究旨在通过生成真实的肾 diffusion-weighted imaging（DWI）数据，帮助诊断、预测和规划肾癌。methods:作者们提出了一种基于 conditioning 的 latent diffusion 策略，称为 Cancer-Net PCa-Gen，以生成多样化的肾图像，包括可控的肿瘤位置和改善的解剖和XTRL 精度。results:提议的方法可以提高多样化肾图像的生成，可以用来补充实际病人数据，以便通过更多和更全面的数据分布训练神经网络。 Cancer-Net PCa-Gen 框架和样图已经公开发布在 https://www.kaggle.com/datasets/deetsadi/cancer-net-pca-gen-dataset，供更多的研究和开发使用。

Abstract
In Canada, prostate cancer is the most common form of cancer in men and accounted for 20% of new cancer cases for this demographic in 2022. Due to recent successes in leveraging machine learning for clinical decision support, there has been significant interest in the development of deep neural networks for prostate cancer diagnosis, prognosis, and treatment planning using diffusion weighted imaging (DWI) data. A major challenge hindering widespread adoption in clinical use is poor generalization of such networks due to scarcity of large-scale, diverse, balanced prostate imaging datasets for training such networks. In this study, we explore the efficacy of latent diffusion for generating realistic prostate DWI data through the introduction of an anatomic-conditional controlled latent diffusion strategy. To the best of the authors' knowledge, this is the first study to leverage conditioning for synthesis of prostate cancer imaging. Experimental results show that the proposed strategy, which we call Cancer-Net PCa-Gen, enhances synthesis of diverse prostate images through controllable tumour locations and better anatomical and textural fidelity. These crucial features make it well-suited for augmenting real patient data, enabling neural networks to be trained on a more diverse and comprehensive data distribution. The Cancer-Net PCa-Gen framework and sample images have been made publicly available at https://www.kaggle.com/datasets/deetsadi/cancer-net-pca-gen-dataset as a part of a global open-source initiative dedicated to accelerating advancement in machine learning to aid clinicians in the fight against cancer.

摘要
在加拿大，男性悬股癌是男性癌症最常见的形式，占2022年新癌例20%。由于最近对临床决策支持的机器学习取得了成功，因此有大量关注在使用深度神经网络进行悬股癌诊断、预后评估和治疗规划中使用扩散加权成像（DWI）数据。然而，由于癌症图像数据的罕见性和不均匀性，这些网络的普遍化尚未得到广泛的应用。在这项研究中，我们研究了使用 latent diffusion 生成真实的悬股癌 DWI 数据，通过引入 anatomic-conditional 控制 latent diffusion 策略。作者认为，这是首次通过控制 synthesis 来生成悬股癌成像。实验结果表明，我们提出的 Cancer-Net PCa-Gen 框架可以控制肿瘤的位置和更好地保持生物学和文本特征，从而提高了生成多种悬股癌图像的可控性。这些关键特征使得它适合补充实际病人数据，使得神经网络在更多的多样化和完整的数据分布上训练。Cancer-Net PCa-Gen 框架和样图已经在上公开发布，作为一个全球开源的机器学习推动计划，旨在帮助临床医生在抗癌斗争中获得更多的支持。

DiffCAD: Weakly-Supervised Probabilistic CAD Model Retrieval and Alignment from an RGB Image

paper_url: http://arxiv.org/abs/2311.18610
repo_url: None
paper_authors: Daoyi Gao, Dávid Rozenberszki, Stefan Leutenegger, Angela Dai
for: 从RGB图像中探测3D结构，以实现场景中3D物体基于图像的有效、高效表示。
methods: 我们提出了DiffCAD，首个弱监督概率方法，可以从RGB图像中提取CAD模型。我们将这视为一个Conditional生成任务，通过填充来学习潜在的probabilistic模型，捕捉图像中CAD对象的形状、orientation和Scale。这使得我们可以生成多种可能性的CAD重建，只需要几个假设来捕捉深度/比例和形状匹配的不确定性。
results: 我们的方法可以在不同的Target域上进行零基础适应，并且可以超过基于精心标注的supervised状态的前景。我们的多个假设方法可以在Scan2CAD数据集上提高了5.9%。

Abstract
Perceiving 3D structures from RGB images based on CAD model primitives can enable an effective, efficient 3D object-based representation of scenes. However, current approaches rely on supervision from expensive annotations of CAD models associated with real images, and encounter challenges due to the inherent ambiguities in the task -- both in depth-scale ambiguity in monocular perception, as well as inexact matches of CAD database models to real observations. We thus propose DiffCAD, the first weakly-supervised probabilistic approach to CAD retrieval and alignment from an RGB image. We formulate this as a conditional generative task, leveraging diffusion to learn implicit probabilistic models capturing the shape, pose, and scale of CAD objects in an image. This enables multi-hypothesis generation of different plausible CAD reconstructions, requiring only a few hypotheses to characterize ambiguities in depth/scale and inexact shape matches. Our approach is trained only on synthetic data, leveraging monocular depth and mask estimates to enable robust zero-shot adaptation to various real target domains. Despite being trained solely on synthetic data, our multi-hypothesis approach can even surpass the supervised state-of-the-art on the Scan2CAD dataset by 5.9% with 8 hypotheses.

摘要
perceiving 3D structures from RGB images based on CAD model primitives can enable an effective, efficient 3D object-based representation of scenes. However, current approaches rely on supervision from expensive annotations of CAD models associated with real images, and encounter challenges due to the inherent ambiguities in the task -- both in depth-scale ambiguity in monocular perception, as well as inexact matches of CAD database models to real observations. We thus propose DiffCAD, the first weakly-supervised probabilistic approach to CAD retrieval and alignment from an RGB image. We formulate this as a conditional generative task, leveraging diffusion to learn implicit probabilistic models capturing the shape, pose, and scale of CAD objects in an image. This enables multi-hypothesis generation of different plausible CAD reconstructions, requiring only a few hypotheses to characterize ambiguities in depth/scale and inexact shape matches. Our approach is trained only on synthetic data, leveraging monocular depth and mask estimates to enable robust zero-shot adaptation to various real target domains. Despite being trained solely on synthetic data, our multi-hypothesis approach can even surpass the supervised state-of-the-art on the Scan2CAD dataset by 5.9% with 8 hypotheses.Here is the text with some notes on the translation:* "perceiving 3D structures" is translated as "感受到3D结构"* "based on CAD model primitives" is translated as "基于CAD模型基本元素"* "can enable an effective, efficient 3D object-based representation of scenes" is translated as "可以实现有效、高效的3D对象基本表示场景"* "Current approaches rely on supervision from expensive annotations of CAD models associated with real images" is translated as "现有方法需要基于实际图像中的昂贵精度标注的CAD模型进行监督"* "encounter challenges due to the inherent ambiguities in the task" is translated as "面临任务中的自然歧义问题"* "DiffCAD" is translated as "DiffCAD"* "the first weakly-supervised probabilistic approach to CAD retrieval and alignment from an RGB image" is translated as "从RGB图像中的首个弱监督概率方法 дляCAD检索和对齐"* "We formulate this as a conditional generative task, leveraging diffusion to learn implicit probabilistic models capturing the shape, pose, and scale of CAD objects in an image" is translated as "我们将其形式化为一个条件生成任务，利用扩散学习隐式概率模型，捕捉图像中CAD对象的形状、姿势和比例"* "This enables multi-hypothesis generation of different plausible CAD reconstructions" is translated as "这使得可以生成多种可能性的CAD重建"* "requiring only a few hypotheses to characterize ambiguities in depth/scale and inexact shape matches" is translated as "只需要几个假设来描述深度/比例和形状不准确的歧义"* "Our approach is trained only on synthetic data" is translated as "我们的方法只在合成数据上训练"* "leveraging monocular depth and mask estimates to enable robust zero-shot adaptation to various real target domains" is translated as "利用独眼深度和面部估计来实现多个实际目标领域中的强健适应"* "Despite being trained solely on synthetic data, our multi-hypothesis approach can even surpass the supervised state-of-the-art on the Scan2CAD dataset by 5.9% with 8 hypotheses" is translated as "即使我们的方法仅在合成数据上训练，我们的多个假设方法可以在Scan2CAD数据集上比超过已有监督方法的状态艺术，8个假设下提高5.9%"

Learning Triangular Distribution in Visual World

paper_url: http://arxiv.org/abs/2311.18605
repo_url: None
paper_authors: Ping Chen, Xingpeng Zhang, Chengtao Zhou, Dichao Fan, Peng Tu, Le Zhang, Yanlin Qian
for: 本研究旨在解决 Label Distribution Learning 中的问题，包括如何准确地将特征与标签之间的差异映射到标签之间的差异中。
methods: 本研究使用了一种名为 Triangular Distribution Transform (TDT) 的普遍和简单的框架，以建立特征和标签之间的唯一函数关系，使得任何对称的特征差异 linearly 反映标签之间的差异。
results: 在 Facial Age Recognition、Illumination Chromaticity Estimation 和 Aesthetics assessment 等任务上，TDT 可以与先前艺术达到或更好的结果。

Abstract
Convolution neural network is successful in pervasive vision tasks, including label distribution learning, which usually takes the form of learning an injection from the non-linear visual features to the well-defined labels. However, how the discrepancy between features is mapped to the label discrepancy is ambient, and its correctness is not guaranteed. To address these problems, we study the mathematical connection between feature and its label, presenting a general and simple framework for label distribution learning. We propose a so-called Triangular Distribution Transform (TDT) to build an injective function between feature and label, guaranteeing that any symmetric feature discrepancy linearly reflects the difference between labels. The proposed TDT can be used as a plug-in in mainstream backbone networks to address different label distribution learning tasks. Experiments on Facial Age Recognition, Illumination Chromaticity Estimation, and Aesthetics assessment show that TDT achieves on-par or better results than the prior arts.

摘要
卷积神经网络在普遍视觉任务中取得了成功，包括标签分布学习，通常表现为从非线性视觉特征中学习一个具有映射性的映射到已定义的标签。然而，这些问题中的特征差异与标签差异之间的映射关系是抽象的，并无 garantía 的正确性。为了解决这些问题，我们研究了特征和其标签之间的数学连接，并提出了一个通用且简单的标签分布学习框架。我们称之为三角分布变换（TDT），它可以建立特征和标签之间的具有映射性的函数，使得任何对称的特征差异线性反映标签之间的差异。我们的TDT可以作为主流脊梁网络中的插件，用于解决不同的标签分布学习任务。我们在人脸年龄识别、照明色度估计和艺术评价中进行了实验，得到的结果与先前的艺术相当或更好。

Identifying tourist destinations from movie scenes using Deep Learning

paper_url: http://arxiv.org/abs/2312.00098
repo_url: None
paper_authors: Mahendran Narayanan
for: 这个论文的目的是探讨电影对旅游业的影响，并提出一种基于深度学习的方法来识别电影中出现的旅游景点。
methods: 该论文提出了一种基于深度学习的方法，通过训练一个大型旅游景点世界各地的数据集，以识别电影中出现的旅游景点。
results: 该论文的研究目标是帮助观众通过电影场景中的地标建议旅游经验，这种方法可以帮助旅游业增加收益。

Abstract
Movies wield significant influence in our lives, playing a pivotal role in the tourism industry of any country. The inclusion of picturesque landscapes, waterfalls, and mountains as backdrops in films serves to enhance the allure of specific scenarios. Recognizing the impact of movies on tourism, this paper introduces a method for identifying tourist destinations featured in films. We propose the development of a deep learning model capable of recognizing these locations during movie viewing. The model is trained on a dataset comprising major tourism destinations worldwide. Through this research, the goal is to enable viewers to identify the real-world locations depicted in movie scenes, offering a novel way to connect cinema with global travel experiences.

摘要
电影在我们生活中具有很大的影响力，对任何国家的旅游业发挥着重要作用。电影中的美丽景色、瀑布和山峰作为背景，可以增强特定情境的吸引力。我们认为电影对旅游业的影响，因此我们提出了一种可以在电影观看过程中识别旅游景点的深入学习模型的想法。这个模型通过训练包括全球主要旅游景点的数据集来培育。通过这项研究，我们希望能够让观众在电影场景中识别真实的地方，为电影与旅游经验提供一种新的连接点。

Overcoming Label Noise for Source-free Unsupervised Video Domain Adaptation

paper_url: http://arxiv.org/abs/2311.18572
repo_url: None
paper_authors: Avijit Dasgupta, C. V. Jawahar, Karteek Alahari
for: addressing the challenge of source-dependent video domain adaptation by developing a self-training based source-free approach.
methods: using the source pre-trained model to generate pseudo-labels for the target domain samples, treating the problem as learning from noisy labels, and leveraging a teacher-student framework to improve adaptation performance.
results: achieving state-of-the-art results on various open datasets, outperforming existing approaches.Here’s the full summary in Simplified Chinese:
for: Handle video domain adaptation challenge without access to source data.
methods: Use source pre-trained model to generate pseudo-labels, treat as learning from noisy labels, use teacher-student framework to improve adaptation.
results: Achieve state-of-the-art results on various open datasets, outperform existing approaches.

Abstract
Despite the progress seen in classification methods, current approaches for handling videos with distribution shifts in source and target domains remain source-dependent as they require access to the source data during the adaptation stage. In this paper, we present a self-training based source-free video domain adaptation approach to address this challenge by bridging the gap between the source and the target domains. We use the source pre-trained model to generate pseudo-labels for the target domain samples, which are inevitably noisy. Thus, we treat the problem of source-free video domain adaptation as learning from noisy labels and argue that the samples with correct pseudo-labels can help us in adaptation. To this end, we leverage the cross-entropy loss as an indicator of the correctness of the pseudo-labels and use the resulting small-loss samples from the target domain for fine-tuning the model. We further enhance the adaptation performance by implementing a teacher-student framework, in which the teacher, which is updated gradually, produces reliable pseudo-labels. Meanwhile, the student undergoes fine-tuning on the target domain videos using these generated pseudo-labels to improve its performance. Extensive experimental evaluations show that our methods, termed as CleanAdapt, CleanAdapt + TS, achieve state-of-the-art results, outperforming the existing approaches on various open datasets. Our source code is publicly available at https://avijit9.github.io/CleanAdapt.

摘要
尽管现有的分类方法有所进步，但现有的方法仍然受到源和目标领域之间的分布差异的限制，需要在适应过程中访问源数据。在这篇论文中，我们提出了一种基于自我训练的源自由视频领域适应方法，以填补这一挑战。我们使用源已经预训练的模型生成目标领域样本的假标签，这些标签然而具有噪音。因此，我们将 пробле 的解决方法视为从噪音标签学习，并认为可以通过正确的假标签来帮助我们进行适应。为此，我们利用了交叉熵损失作为假标签的指标，并使用目标领域中的小损失样本进行细调。此外，我们还实现了一种教师-学生框架，在其中教师逐渐更新，生成可靠的假标签，而学生通过这些生成的假标签进行微调以提高其性能。我们的方法，称为CleanAdapt，CleanAdapt + TS，在多个开放数据集上实现了状态的最佳结果，超过了现有的方法。我们的源代码可以在中获取。

Seam-guided local alignment and stitching for large parallax images

paper_url: http://arxiv.org/abs/2311.18564
repo_url: https://github.com/tlliao/Seam-guided-local-alignment
paper_authors: Tianli Liao, Chenyang Zhao, Lei Li, Heling Cao
for: 这篇论文是关于图像融合中的地方剖分和精度评估的研究。
methods: 该方法首先使用现有的图像对接和缝合方法计算初始缝并评估缝上像素的质量。然后，对于低质量像素，通过提取修改的稠密对准方法来分割涵盖patches在对接图像中，并将其地方对接。最后，通过缝合 patches 并将其与初始对接结果 merge 而成的最终融合图像。
results: 对比当前状态艺技，该方法的结果更加可信度高， artifacts 更少。

Abstract
Seam-cutting methods have been proven effective in the composition step of image stitching, especially for images with parallax. However, the effectiveness of seam-cutting usually depends on that images can be roughly aligned such that there exists a local region where a plausible seam can be found. For images with large parallax, current alignment methods often fall short of expectations. In this paper, we propose a local alignment and stitching method guided by seam quality evaluation. First, we use existing image alignment and seam-cutting methods to calculate an initial seam and evaluate the quality of pixels along the seam. Then, for pixels with low qualities, we separate their enclosing patches in the aligned images and locally align them by extracting modified dense correspondences via SIFT flow. Finally, we composite the aligned patches via seam-cutting and merge them into the original aligned result to generate the final mosaic. Experiments show that compared with the state-of-the-art seam-cutting methods, our result is more plausible and with fewer artifacts. The code will be available at https://github.com/tlliao/Seam-guided-local-alignment.

摘要
扭曲方法在图像融合的组合步骤中已经被证明有效，特别是对于具有相差的图像。然而，扭曲方法的效iveness通常取决于图像可以roughly align，以便在当地发现一个可信worthy seam。对于具有大相差的图像，当前的对齐方法经常无法满足需求。在这篇论文中，我们提出了基于扭曲质量评估的本地对齐和融合方法。首先，我们使用现有的图像对齐和扭曲方法计算一个初始扭曲，并评估扭曲中像素的质量。然后，对于像素质量低的情况，我们将其包含的区域分割成修改后的稠密对匹配的图像中，并将其本地对齐。最后，我们使用扭曲 cutting 将修改后的图像合并到原始对齐结果中，以生成最终的融合图像。实验表明，与当前状态的扭曲方法相比，我们的结果更加真实，具有更少的artifacts。代码将在上公开。

Periodic Vibration Gaussian: Dynamic Urban Scene Reconstruction and Real-time Rendering

paper_url: http://arxiv.org/abs/2311.18561
repo_url: https://github.com/fudan-zvg/PVG
paper_authors: Yurui Chen, Chun Gu, Junzhe Jiang, Xiatian Zhu, Li Zhang
for: 该论文旨在Addressing the challenges of modeling dynamic, large-scale urban scenes, which are characterized by highly intricate geometric structures and unconstrained dynamics in both space and time.
methods: The proposed method, called Periodic Vibration Gaussian (PVG), builds upon the efficient 3D Gaussian splatting technique and introduces periodic vibration-based temporal dynamics to capture the synergistic interactions between objects and elements in dynamic urban scenes. Additionally, the method includes a novel flow-based temporal smoothing mechanism and a position-aware adaptive control strategy to enhance temporally coherent representation learning with sparse training data.
results: Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate that PVG surpasses state-of-the-art alternatives in both reconstruction and novel view synthesis for both dynamic and static scenes, without relying on manually labeled object bounding boxes or expensive optical flow estimation. Moreover, PVG achieves 50/6000-fold acceleration in training/rendering over the best alternative.

Abstract
Modeling dynamic, large-scale urban scenes is challenging due to their highly intricate geometric structures and unconstrained dynamics in both space and time. Prior methods often employ high-level architectural priors, separating static and dynamic elements, resulting in suboptimal capture of their synergistic interactions. To address this challenge, we present a unified representation model, called Periodic Vibration Gaussian (PVG). PVG builds upon the efficient 3D Gaussian splatting technique, originally designed for static scene representation, by introducing periodic vibration-based temporal dynamics. This innovation enables PVG to elegantly and uniformly represent the characteristics of various objects and elements in dynamic urban scenes. To enhance temporally coherent representation learning with sparse training data, we introduce a novel flow-based temporal smoothing mechanism and a position-aware adaptive control strategy. Extensive experiments on Waymo Open Dataset and KITTI benchmarks demonstrate that PVG surpasses state-of-the-art alternatives in both reconstruction and novel view synthesis for both dynamic and static scenes. Notably, PVG achieves this without relying on manually labeled object bounding boxes or expensive optical flow estimation. Moreover, PVG exhibits 50/6000-fold acceleration in training/rendering over the best alternative.

摘要
模拟大规模城市场景是一项挑战，因为它们的空间和时间尺度都具有高度复杂的几何结构，同时也存在无约束的动态变化。先前的方法通常采用高级建筑假设，将静止和动态元素分离开来，从而导致对它们的相互作用的捕捉不够优化。为了解决这个挑战，我们提出了一种统一表示模型，即周期振荡 Gaussian（PVG）。PVG基于高效的3D Gaussian扩散技术，原本设计用于静止场景表示，通过引入周期振荡基于时间动态来扩展。这种创新使得PVG能够精细地和一致地表示各种对象和元素在动态城市场景中的特点。为了提高在缺乏训练数据的情况下进行有 temps coherent的学习，我们引入了一种流基的时间缓和机制和一种位置感知的自适应控制策略。在 Waymo 开放数据集和 KITTI 标准benchmark 上进行了广泛的实验，显示PVG在重建和新视角合成方面比州前一些方法更为出色，而且不需要手动标注对象边框或高昂的光流估计。此外，PVG在训练和渲染过程中比最佳代码进行50/6000-倍加速。

FediOS: Decoupling Orthogonal Subspaces for Personalization in Feature-skew Federated Learning

paper_url: http://arxiv.org/abs/2311.18559
repo_url: None
paper_authors: Lingzhi Gao, Zexi Li, Yang Lu, Chao Wu
for: 提高个性化本地模型的能力
methods: 提出一种新的 Architecture Decoupling 设计，使用两个特征提取器（一个普适特征提取器和一个个性化特征提取器）和一个共享预测头来实现异构特征分解。
results: 在四个视觉数据集上进行了广泛的实验，并达到了在特征不均衡情况下的状态革命级表现。

Abstract
Personalized federated learning (pFL) enables collaborative training among multiple clients to enhance the capability of customized local models. In pFL, clients may have heterogeneous (also known as non-IID) data, which poses a key challenge in how to decouple the data knowledge into generic knowledge for global sharing and personalized knowledge for preserving local personalization. A typical way of pFL focuses on label distribution skew, and they adopt a decoupling scheme where the model is split into a common feature extractor and two prediction heads (generic and personalized). However, such a decoupling scheme cannot solve the essential problem of feature skew heterogeneity, because a common feature extractor cannot decouple the generic and personalized features. Therefore, in this paper, we rethink the architecture decoupling design for feature-skew pFL and propose an effective pFL method called FediOS. In FediOS, we reformulate the decoupling into two feature extractors (generic and personalized) and one shared prediction head. Orthogonal projections are used for clients to map the generic features into one common subspace and scatter the personalized features into different subspaces to achieve decoupling for them. In addition, a shared prediction head is trained to balance the importance of generic and personalized features during inference. Extensive experiments on four vision datasets demonstrate our method reaches state-of-the-art pFL performances under feature skew heterogeneity.

摘要
To address this issue, we propose FediOS, a pFL method that rethinks the architecture decoupling design for feature-skew pFL. FediOS uses two feature extractors (generic and personalized) and one shared prediction head to decouple the features. Orthogonal projections are used for clients to map generic features into one common subspace and scatter personalized features into different subspaces, achieving decoupling. Additionally, the shared prediction head is trained to balance the importance of generic and personalized features during inference.Our extensive experiments on four vision datasets demonstrate that FediOS reaches state-of-the-art pFL performances under feature skew heterogeneity.

paper_url: http://arxiv.org/abs/2311.18553
repo_url: None
paper_authors: Daniel Grimm, Maximilian Zipfl, Felix Hertlein, Alexander Naumann, Jürgen Lüttin, Steffen Thoma, Stefan Schmid, Lavdim Halilaj, Achim Rettinger, J. Marius Zöllner
for: 预测周围交通参与者的未来轨迹，以便实现自动驾驶。
methods: 使用vector-based方法，模型交通参与者之间的SemanticSceneGraph，EXTRACT agent-centric image-based map features，生成anchor paths来约束策略在多Modal预测中只允许允许的轨迹。
results: 相比基eline模型HoliGraph，该方法显示出较好的性能。

Abstract
Precisely predicting the future trajectories of surrounding traffic participants is a crucial but challenging problem in autonomous driving, due to complex interactions between traffic agents, map context and traffic rules. Vector-based approaches have recently shown to achieve among the best performances on trajectory prediction benchmarks. These methods model simple interactions between traffic agents but don't distinguish between relation-type and attributes like their distance along the road. Furthermore, they represent lanes only by sequences of vectors representing center lines and ignore context information like lane dividers and other road elements. We present a novel approach for vector-based trajectory prediction that addresses these shortcomings by leveraging three crucial sources of information: First, we model interactions between traffic agents by a semantic scene graph, that accounts for the nature and important features of their relation. Second, we extract agent-centric image-based map features to model the local map context. Finally, we generate anchor paths to enforce the policy in multi-modal prediction to permitted trajectories only. Each of these three enhancements shows advantages over the baseline model HoliGraph.

摘要
<>自动驾驶中预测周围交通参与者的未来轨迹是一个重要 yet 挑战性的问题，因为交通代理之间的复杂互动、地图背景和交通规则。vector-based方法在轨迹预测 benchmark 上最近显示出了一个非常好的表现。这些方法可以模型交通代理之间的简单互动，但是不能分辨代理之间的关系类型和特性，例如他们的距离在道路上。此外，它们只表示道路为 Vector sequence representing center lines，并忽略了路况信息如拥有道路元素和其他道路特征。我们提出了一种新的vector-based轨迹预测方法，以下是这种方法的三大优势：1. 我们使用语义场景图来模型交通代理之间的关系，这种图可以捕捉交通代理之间的自然和重要特征。2. 我们EXTRACT agent-centric image-based map features来模型当地地图背景。3. 我们生成 anchor paths来强制策略在多模态预测中仅产生允许的轨迹。每一个改进都显示了比基eline模型 HoliGraph 更好的表现。>>>

SparseDC: Depth Completion from sparse and non-uniform inputs

paper_url: http://arxiv.org/abs/2312.00097
repo_url: https://github.com/whu-usi3dv/sparsedc
paper_authors: Chen Long, Wenxiao Zhang, Zhe Chen, Haiping Wang, Yuan Liu, Zhen Cao, Zhen Dong, Bisheng Yang
For: This paper is written for completing depth maps with poor quality in real-world usage.* Methods: The paper proposes a two-branch feature embedder with an uncertainty-based fusion module to handle sparse and non-uniform depth inputs.* Results: The proposed method, SparseDC, demonstrates robustness in handling sparse and non-uniform depth inputs and outperforms previous methods in real-world scenarios.Here’s the Chinese translation of the three key points:* For: 这篇论文是为了处理实际使用中的低质量深度图像而写的。* Methods: 论文提出了一种两极特征嵌入器，其中包括一个不确定性基于的融合模块，以处理笼统和非统一的深度输入。* Results: 提出的方法，即SparseDC，在实际场景中表现出了对笼统和非统一深度输入的稳定性，并且超过了先前的方法。

Abstract
We propose SparseDC, a model for Depth Completion of Sparse and non-uniform depth inputs. Unlike previous methods focusing on completing fixed distributions on benchmark datasets (e.g., NYU with 500 points, KITTI with 64 lines), SparseDC is specifically designed to handle depth maps with poor quality in real usage. The key contributions of SparseDC are two-fold. First, we design a simple strategy, called SFFM, to improve the robustness under sparse input by explicitly filling the unstable depth features with stable image features. Second, we propose a two-branch feature embedder to predict both the precise local geometry of regions with available depth values and accurate structures in regions with no depth. The key of the embedder is an uncertainty-based fusion module called UFFM to balance the local and long-term information extracted by CNNs and ViTs. Extensive indoor and outdoor experiments demonstrate the robustness of our framework when facing sparse and non-uniform input depths. The pre-trained model and code are available at https://github.com/WHU-USI3DV/SparseDC.

摘要
我们提出SparseDC模型，用于完成稀畴深度输入的深度完成。与前方法一样，SparseDC不同的是，它专门针对实际使用中的低质量深度图进行处理。SparseDC的两大贡献是：一、我们提出了一种简单的策略，称为SFFM，以提高 sparse输入下的稳定性，通过将不稳定的深度特征用稳定的图像特征填充；二、我们提出了一种两棵树特征嵌入器，用于预测有depth值的地方的精确的地方形和无depth值的地方的准确的结构。特征嵌入器的关键是一种不确定性基于的融合模块，称为UFFM，用于衡量 CNN和ViT中提取的本地和长期信息。我们在室内和室外进行了广泛的实验，并证明了我们的框架在稀畴和非均匀的输入深度下 display 的稳定性。我们的预训练模型和代码可以在https://github.com/WHU-USI3DV/SparseDC上获取。

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

paper_url: http://arxiv.org/abs/2312.00096
repo_url: https://github.com/tomchen-ctj/OST
paper_authors: Tongjia Chen, Hongshan Yu, Zhengeng Yang, Zechuan Li, Wei Sun, Chen Chen
for: 本研究旨在提高视频识别的普适性，通过强调文本知识的改进来解决视频数据的扩展性和缺乏文本描述所带来的限制。
methods: 我们提出了一种新的方法，即使用大型自然语言模型（LLM）来增强动作类名称，并将其转换为空间时间描述符（STD），以填充文本缺乏的问题。此外，我们还提出了一种优化描述符解决方案（Optimal Descriptor Solver），以匹配最佳描述符与视频实例。
results: 我们的方法在零shot、几shot和完全监督视频识别中进行了广泛的评估，并取得了出色的效果。我们最佳模型在Kinetics-600上达到了75.1%的零shot准确率，创下了state-of-the-art记录。

Abstract
Due to the resource-intensive nature of training vision-language models on expansive video data, a majority of studies have centered on adapting pre-trained image-language models to the video domain. Dominant pipelines propose to tackle the visual discrepancies with additional temporal learners while overlooking the substantial discrepancy for web-scaled descriptive narratives and concise action category names, leading to less distinct semantic space and potential performance limitations. In this work, we prioritize the refinement of text knowledge to facilitate generalizable video recognition. To address the limitations of the less distinct semantic space of category names, we prompt a large language model (LLM) to augment action class names into Spatio-Temporal Descriptors thus bridging the textual discrepancy and serving as a knowledge base for general recognition. Moreover, to assign the best descriptors with different video instances, we propose Optimal Descriptor Solver, forming the video recognition problem as solving the optimal matching flow across frame-level representations and descriptors. Comprehensive evaluations in zero-shot, few-shot, and fully supervised video recognition highlight the effectiveness of our approach. Our best model achieves a state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600.

摘要
In this work, we prioritize the refinement of text knowledge to facilitate generalizable video recognition. To address the limitations of the less distinct semantic space of category names, we use a large language model (LLM) to augment action class names into Spatio-Temporal Descriptors, thereby bridging the textual discrepancy and serving as a knowledge base for general recognition.Moreover, to assign the best descriptors with different video instances, we propose the Optimal Descriptor Solver, which forms the video recognition problem as solving the optimal matching flow across frame-level representations and descriptors.Comprehensive evaluations in zero-shot, few-shot, and fully supervised video recognition demonstrate the effectiveness of our approach. Our best model achieves a state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600.Translation notes:* "expansive video data" is translated as "广泛的视频数据" (guǎng kòu de ví zhǐ dào)* "web-scaled descriptive narratives" is translated as "网络级别的描述性 narraatives" (wǎng gōng jí bèi de mǐ xiǎng yǔ)* "concise action category names" is translated as "简洁的动作类别名称" (jiǎn jí de dòng xiǎng bèi míng cm)* "Spatio-Temporal Descriptors" is translated as "空间-时间描述符" (kōng jiān - shí jiān mǐ xiǎng)* "Optimal Descriptor Solver" is translated as "优化描述符解决方案" (yòu gòng mǐ xiǎng yì jí fāng án)

Match me if you can: Semantic Correspondence Learning with Unpaired Images

paper_url: http://arxiv.org/abs/2311.18540
repo_url: None
paper_authors: Jiwon Kim, Byeongho Heo, Sangdoo Yun, Seungryong Kim, Dongyoon Han
for: 本文提出了一种简单 yet effective的方法，用于提高 semantic correspondence 的性能。这种方法不需要Extra labeled keypoints 或 trainable modules，可以使用 unlabeled pairs 进行训练。
methods: 本文提出了一种 teacher-student 框架，通过机器监督来提供可靠的 pseudo correspondences 给学生网络。此外，本文还提出了一种迭代训练方法，使用学生网络来生成高精度的标签，并重新训练一个新的学生网络。
results: 根据实验结果，本文的模型可以超越当今领先方法，包括 state-of-the-art 方法在 semantic correspondence benchmarks 上。

Abstract
Recent approaches for semantic correspondence have focused on obtaining high-quality correspondences using a complicated network, refining the ambiguous or noisy matching points. Despite their performance improvements, they remain constrained by the limited training pairs due to costly point-level annotations. This paper proposes a simple yet effective method that performs training with unlabeled pairs to complement both limited image pairs and sparse point pairs, requiring neither extra labeled keypoints nor trainable modules. We fundamentally extend the data quantity and variety by augmenting new unannotated pairs not primitively provided as training pairs in benchmarks. Using a simple teacher-student framework, we offer reliable pseudo correspondences to the student network via machine supervision. Finally, the performance of our network is steadily improved by the proposed iterative training, putting back the student as a teacher to generate refined labels and train a new student repeatedly. Our models outperform the milestone baselines, including state-of-the-art methods on semantic correspondence benchmarks.

摘要
Translated into Simplified Chinese:现有的Semantic Correspondence方法主要关注于通过复杂的网络获得高质量匹配，并且对匹配点进行纠正。尽管它们提高了性能，但它们仍然受到有限的训练对数的限制，因为昂贵的点级注释。这篇论文提议一种简单 yet effective的方法，通过使用无注释对来补充有限的图像对和稀疏的点对，无需Extra的标注键点 nor 可训练的模块。我们基本扩展了数据量和多样性，通过在benchmark中添加新的无注释对。使用简单的教师-学生框架，我们提供了可靠的pseudo匹配到学生网络via机器监督。最后，我们通过提议的迭代训练，将学生作为教师来生成更加精确的标签，并训练新的学生。我们的模型超越了参考基eline，包括state-of-the-art方法在Semantic Correspondence benchmark中。

MaXTron: Mask Transformer with Trajectory Attention for Video Panoptic Segmentation

paper_url: http://arxiv.org/abs/2311.18537
repo_url: https://github.com/tacju/maxtron
paper_authors: Ju He, Qihang Yu, Inkyu Shin, Xueqing Deng, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen
for: 这个论文是为了实现视频精准分割而设计的，即在视频中准确地分割对象和背景，并在时间上跟踪对象的变化。
methods: 该论文提出了MaXTron方法，它利用Mask XFormer和轨迹注意力来解决视频精准分割问题。MaXTron方法利用轨迹注意力来提高时间上的一致性，并在内 clip和交叉 clip 模块中有效地使用轨迹注意力。
results: 该论文的实验结果表明，MaXTron方法可以在视频 segmentation benchmark 上达到 state-of-the-art 性能水平，而无需添加任何辅助功能。

Abstract
Video panoptic segmentation requires consistently segmenting (for both `thing' and `stuff' classes) and tracking objects in a video over time. In this work, we present MaXTron, a general framework that exploits Mask XFormer with Trajectory Attention to tackle the task. MaXTron enriches an off-the-shelf mask transformer by leveraging trajectory attention. The deployed mask transformer takes as input a short clip consisting of only a few frames and predicts the clip-level segmentation. To enhance the temporal consistency, MaXTron employs within-clip and cross-clip tracking modules, efficiently utilizing trajectory attention. Originally designed for video classification, trajectory attention learns to model the temporal correspondences between neighboring frames and aggregates information along the estimated motion paths. However, it is nontrivial to directly extend trajectory attention to the per-pixel dense prediction tasks due to its quadratic dependency on input size. To alleviate the issue, we propose to adapt the trajectory attention for both the dense pixel features and object queries, aiming to improve the short-term and long-term tracking results, respectively. Particularly, in our within-clip tracking module, we propose axial-trajectory attention that effectively computes the trajectory attention for tracking dense pixels sequentially along the height- and width-axes. The axial decomposition significantly reduces the computational complexity for dense pixel features. In our cross-clip tracking module, since the object queries in mask transformer are learned to encode the object information, we are able to capture the long-term temporal connections by applying trajectory attention to object queries, which learns to track each object across different clips. Without bells and whistles, MaXTron demonstrates state-of-the-art performances on video segmentation benchmarks.

摘要
视频牛逼分割需要持续地分割（ для both 'thing' 和 'stuff' 类）并跟踪对象在视频中的时间变化。在这种工作中，我们提出了 MaXTron，这是一个通用的框架，它利用Mask XFormer 和轨迹注意力来解决这个任务。MaXTron 通过使用轨迹注意力来强化Off-the-shelf mask transformer，输入短 clip 并预测clip-level分割。为了提高时间一致性，MaXTron 使用 within-clip 和 cross-clip 跟踪模块，高效地利用轨迹注意力。原来设计用于视频分类的轨迹注意力可以学习模拟邻近帧之间的时间相关性，并将信息集成到估计的运动路径上。但是，将轨迹注意力直接应用于每个像素密集预测任务是非常困难，因为它们的输入大小 quadratic 相关。为了解决这个问题，我们提出了适应轨迹注意力，以提高短期和长期跟踪结果。特别是，在我们的 within-clip 跟踪模块中，我们提出了 axial-trajectory attention，可以有效地计算轨迹注意力以Sequentially 跟踪 dense pixels along height- 和 width-axes。axial 分解显著减少了密集像素特征计算的计算复杂性。在我们的 cross-clip 跟踪模块中，由于mask transformer中的object queries 是学习编码对象信息，我们可以通过应用轨迹注意力到object queries，以捕捉每个对象在不同clip中的长期时间连接。没有招架和钻石，MaXTron 展示了视频分割标准 benchmark 的state-of-the-art 性能。

Revisiting Proposal-based Object Detection

paper_url: http://arxiv.org/abs/2311.18512
repo_url: None
paper_authors: Aritra Bhowmik, Martin R. Oswald, Pascal Mettes, Cees G. M. Snoek
For: This paper revisits the pipeline for detecting objects in images with proposals, with the goal of improving the accuracy and efficiency of object detection methods.* Methods: The paper proposes a simple yet effective alternative to the common approach of directly maximizing the overlap between proposal and ground truth boxes. Instead, the paper regresses to the area of intersection between proposal and ground truth, allowing each proposal to specify which part contains the object without needing to regress beyond its visual scope.* Results: The paper shows that the proposed intersection-based regression and grouping approach directly improves canonical object detection and instance segmentation architectures, highlighting the utility of this alternative approach.

Abstract
This paper revisits the pipeline for detecting objects in images with proposals. For any object detector, the obtained box proposals or queries need to be classified and regressed towards ground truth boxes. The common solution for the final predictions is to directly maximize the overlap between each proposal and the ground truth box, followed by a winner-takes-all ranking or non-maximum suppression. In this work, we propose a simple yet effective alternative. For proposal regression, we solve a simpler problem where we regress to the area of intersection between proposal and ground truth. In this way, each proposal only specifies which part contains the object, avoiding a blind inpainting problem where proposals need to be regressed beyond their visual scope. In turn, we replace the winner-takes-all strategy and obtain the final prediction by taking the union over the regressed intersections of a proposal group surrounding an object. Our revisited approach comes with minimal changes to the detection pipeline and can be plugged into any existing method. We show that our approach directly improves canonical object detection and instance segmentation architectures, highlighting the utility of intersection-based regression and grouping.

摘要
Translated into Simplified Chinese:这篇论文重新审视了图像中对象检测的管道。任何对象检测器都需要对获得的框提案或查询进行分类和 regression 以获得真实的框标准。现有的解决方案是直接将每个提案与真实框的重叠度进行最大化，然后使用赢家当选策略或非最大化抑制。在这种工作中，我们提议一种简单 yet effective 的替代方案。 для proposal regression，我们解决一个更加简单的问题，即将提案与真实框的重叠面积进行 regression。因此，每个提案只需要指定包含对象的部分，而不需要通过视觉范围外的盲目填充问题。然后，我们取代赢家当选策略，并通过获得提案组 surrounding 对象的重叠面积的联合来获得最终预测。我们的修订方案具有少量改变的检测管道，可以与现有方法整合使用。我们显示，我们的方法直接改进了标准对象检测和实例分割架构，强调了重叠基于 regression 和组合的优点。

DifAugGAN: A Practical Diffusion-style Data Augmentation for GAN-based Single Image Super-resolution

paper_url: http://arxiv.org/abs/2311.18508
repo_url: None
paper_authors: Axi Niu, Kang Zhang, Joshua Tian Jin Tee, Trung X. Pham, Jinqiu Sun, Chang D. Yoo, In So Kweon, Yanning Zhang
for: 提高 GAN 基于 SR 方法的图像质量
methods: 使用 diffusion-style 数据增强策略进行泛化，提高批判器的准确性
results: 对比现有方法，得到了更高的 SR 性能，并且可以适用于现有 GAN 基于 SR 方法中Translation:
for: To improve the image quality of GAN-based SR methods
methods: Using a diffusion-style data augmentation scheme to improve the calibration of the discriminator
results: Obtained higher SR performance compared to existing methods, and can be applied to existing GAN-based SR methods

Abstract
It is well known the adversarial optimization of GAN-based image super-resolution (SR) methods makes the preceding SR model generate unpleasant and undesirable artifacts, leading to large distortion. We attribute the cause of such distortions to the poor calibration of the discriminator, which hampers its ability to provide meaningful feedback to the generator for learning high-quality images. To address this problem, we propose a simple but non-travel diffusion-style data augmentation scheme for current GAN-based SR methods, known as DifAugGAN. It involves adapting the diffusion process in generative diffusion models for improving the calibration of the discriminator during training motivated by the successes of data augmentation schemes in the field to achieve good calibration. Our DifAugGAN can be a Plug-and-Play strategy for current GAN-based SISR methods to improve the calibration of the discriminator and thus improve SR performance. Extensive experimental evaluations demonstrate the superiority of DifAugGAN over state-of-the-art GAN-based SISR methods across both synthetic and real-world datasets, showcasing notable advancements in both qualitative and quantitative results.

摘要
通常情况下，基于GAN的图像超分辨（SR）方法的对抗优化会使得先前的SR模型产生不愉快和不适的artefacts，导致大量的扭曲。我们认为这种扭曲的原因是抽象器的质量不足，使其无法提供高质量图像的反馈，从而阻碍生成器学习高质量图像。为解决这个问题，我们提出了一种简单 yet non-travel diffusion-style数据增强方案，称为DifAugGAN。它基于生成扩散模型中的扩散过程进行改进抽象器的准确性，以便在训练中提高抽象器的准确性。我们的DifAugGAN可以作为现有GAN基于SISR方法的Plug-and-Play策略，以提高抽象器的准确性，并因此提高SR性能。我们的实验证明，DifAugGAN在实验和实际数据上均显示出优于当前GAN基于SISR方法的状态的较好的性能，both qualitative和quantitativeResults中都有显著的进步。

Accurate Segmentation of Optic Disc And Cup from Multiple Pseudo-labels by Noise-Aware Learning

paper_url: http://arxiv.org/abs/2311.18496
repo_url: https://github.com/wwwtttjjj/mpnn
paper_authors: Tengjin Weng, Yang Shen, Zhidong Zhao, Zhiming Cheng, Shuai Wang
for: 这个论文主要针对 optic glaucoma 检测和诊断中的 optic disc 和 cup 分类问题提出了一个新的解决方案。
methods: 本文提出了一个创新的 label-denoising 方法，名为 Multiple Pseudo-labels Noise-aware Network (MPNN)，它通过多个不同的初始化网络训练真正的标签，然后使用多个测度标签生成器生成多个 pseudo-labels，并将这些 pseudo-labels 用于对像当中的顶点标签推导。
results: 实验结果显示，MPNN 比其他 label-denoising 方法更好地适应 optic disc 和 cup 分类任务，并且具有优秀的体积测试能力和显著的杂音适应能力。

Abstract
Optic disc and cup segmentation play a crucial role in automating the screening and diagnosis of optic glaucoma. While data-driven convolutional neural networks (CNNs) show promise in this area, the inherent ambiguity of segmenting object and background boundaries in the task of optic disc and cup segmentation leads to noisy annotations that impact model performance. To address this, we propose an innovative label-denoising method of Multiple Pseudo-labels Noise-aware Network (MPNN) for accurate optic disc and cup segmentation. Specifically, the Multiple Pseudo-labels Generation and Guided Denoising (MPGGD) module generates pseudo-labels by multiple different initialization networks trained on true labels, and the pixel-level consensus information extracted from these pseudo-labels guides to differentiate clean pixels from noisy pixels. The training framework of the MPNN is constructed by a teacher-student architecture to learn segmentation from clean pixels and noisy pixels. Particularly, such a framework adeptly leverages (i) reliable and fundamental insights from clean pixels and (ii) the supplementary knowledge within noisy pixels via multiple perturbation-based unsupervised consistency. Compared to other label-denoising methods, comprehensive experimental results on the RIGA dataset demonstrate our method's excellent performance and significant denoising ability.

摘要
优化着色板和杯 Segmentation 在自动化视网膜病诊断中扮演着关键角色。 Although data-driven Convolutional Neural Networks (CNNs) show promise in this area, the inherent ambiguity of segmenting object and background boundaries in the task of optic disc and cup segmentation leads to noisy annotations that impact model performance. To address this, we propose an innovative label-denoising method called Multiple Pseudo-labels Noise-aware Network (MPNN) for accurate optic disc and cup segmentation. Specifically, the Multiple Pseudo-labels Generation and Guided Denoising (MPGGD) module generates pseudo-labels by multiple different initialization networks trained on true labels, and the pixel-level consensus information extracted from these pseudo-labels guides to differentiate clean pixels from noisy pixels. The training framework of the MPNN is constructed by a teacher-student architecture to learn segmentation from clean pixels and noisy pixels. Particularly, such a framework adeptly leverages (i) reliable and fundamental insights from clean pixels and (ii) the supplementary knowledge within noisy pixels via multiple perturbation-based unsupervised consistency. Compared to other label-denoising methods, comprehensive experimental results on the RIGA dataset demonstrate our method's excellent performance and significant denoising ability.

Improving Adversarial Transferability via Model Alignment

paper_url: http://arxiv.org/abs/2311.18495
repo_url: None
paper_authors: Avery Ma, Amir-massoud Farahmand, Yangchen Pan, Philip Torr, Jindong Gu
for: 提高给定源模型的攻击致命性和可迁移性
methods: 使用约束Alignment技术调整源模型的参数，以最小化与另一个独立训练的观察者模型（即观察者模型）之间的差异
results: 对于ImageNet dataset和多种模型架构，通过对源模型进行调整，生成的攻击噪声显示出较高的可迁移性。

Abstract
Neural networks are susceptible to adversarial perturbations that are transferable across different models. In this paper, we introduce a novel model alignment technique aimed at improving a given source model's ability in generating transferable adversarial perturbations. During the alignment process, the parameters of the source model are fine-tuned to minimize an alignment loss. This loss measures the divergence in the predictions between the source model and another, independently trained model, referred to as the witness model. To understand the effect of model alignment, we conduct a geometric anlaysis of the resulting changes in the loss landscape. Extensive experiments on the ImageNet dataset, using a variety of model architectures, demonstrate that perturbations generated from aligned source models exhibit significantly higher transferability than those from the original source model.

摘要

PRS: Sharp Feature Priors for Resolution-Free Surface Remeshing

paper_url: http://arxiv.org/abs/2311.18494
repo_url: https://github.com/artonson/prs
paper_authors: Natalia Soboleva, Olga Gorbunova, Maria Ivanova, Evgeny Burnaev, Matthias Nießner, Denis Zorin, Alexey Artemov
for: 高级计算机视觉任务中的表面重建，特别是保留形态特征的重建。
methods: 使用数据驱动的方法探测特征并进行 mesh 重建，只需输入低级划分的偏 aliased 网格即可。
results: 在 ABC 数据集上进行了高级shape reconstruction，与现状前进行比较，具有26% 的 normals F-score 和 42% 的 perce 预测 $\text{RMSE}_{\text{v}$。

Abstract
Surface reconstruction with preservation of geometric features is a challenging computer vision task. Despite significant progress in implicit shape reconstruction, state-of-the-art mesh extraction methods often produce aliased, perceptually distorted surfaces and lack scalability to high-resolution 3D shapes. We present a data-driven approach for automatic feature detection and remeshing that requires only a coarse, aliased mesh as input and scales to arbitrary resolution reconstructions. We define and learn a collection of surface-based fields to (1) capture sharp geometric features in the shape with an implicit vertexwise model and (2) approximate improvements in normals alignment obtained by applying edge-flips with an edgewise model. To support scaling to arbitrary complexity shapes, we learn our fields using local triangulated patches, fusing estimates on complete surface meshes. Our feature remeshing algorithm integrates the learned fields as sharp feature priors and optimizes vertex placement and mesh connectivity for maximum expected surface improvement. On a challenging collection of high-resolution shape reconstructions in the ABC dataset, our algorithm improves over state-of-the-art by 26% normals F-score and 42% perceptual $\text{RMSE}_{\text{v}$.

摘要
表面重建与保留几何特征是计算机视觉中的挑战。虽然有 significiant progress in implicit shape reconstruction，current state-of-the-art mesh extraction methods often produce aliased, perceptually distorted surfaces and lack scalability to high-resolution 3D shapes. We present a data-driven approach for automatic feature detection and remeshing that requires only a coarse, aliased mesh as input and scales to arbitrary resolution reconstructions. We define and learn a collection of surface-based fields to (1) capture sharp geometric features in the shape with an implicit vertexwise model and (2) approximate improvements in normals alignment obtained by applying edge-flips with an edgewise model. To support scaling to arbitrary complexity shapes, we learn our fields using local triangulated patches, fusing estimates on complete surface meshes. Our feature remeshing algorithm integrates the learned fields as sharp feature priors and optimizes vertex placement and mesh connectivity for maximum expected surface improvement. On a challenging collection of high-resolution shape reconstructions in the ABC dataset, our algorithm improves over state-of-the-art by 26% normals F-score and 42% perceptual $\text{RMSE}_{\text{v}$.

Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

paper_url: http://arxiv.org/abs/2311.18482
repo_url: None
paper_authors: Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, Shao-Hua Guan
for: scene understanding tasks such as object localization and segmentation
methods: novel embedding procedure and dedicated quantization scheme
results: best visual quality and language querying accuracy, while maintaining real-time rendering frame rates

Abstract
Open-vocabulary querying in 3D space is challenging but essential for scene understanding tasks such as object localization and segmentation. Language-embedded scene representations have made progress by incorporating language features into 3D spaces. However, their efficacy heavily depends on neural networks that are resource-intensive in training and rendering. Although recent 3D Gaussians offer efficient and high-quality novel view synthesis, directly embedding language features in them leads to prohibitive memory usage and decreased performance. In this work, we introduce Language Embedded 3D Gaussians, a novel scene representation for open-vocabulary query tasks. Instead of embedding high-dimensional raw semantic features on 3D Gaussians, we propose a dedicated quantization scheme that drastically alleviates the memory requirement, and a novel embedding procedure that achieves smoother yet high accuracy query, countering the multi-view feature inconsistencies and the high-frequency inductive bias in point-based representations. Our comprehensive experiments show that our representation achieves the best visual quality and language querying accuracy across current language-embedded representations, while maintaining real-time rendering frame rates on a single desktop GPU.

摘要

Mixture of Gaussian-distributed Prototypes with Generative Modelling for Interpretable Image Classification

paper_url: http://arxiv.org/abs/2312.00092
repo_url: None
paper_authors: Chong Wang, Yuanhong Chen, Fengbei Liu, Davis James McCarthy, Helen Frazer, Gustavo Carneiro
for: 提高分类预测 interpretability，使用生成学习方法学习类别特征分布，并通过连接分类预测与特征分布来提供直观的决策理解。
methods: 使用生成学习方法Mixture of Gaussian-distributed Prototypes (MGProto)，其中每个类别特征分布都是一个 Gaussian mixture model (GMM)，这使得每个学习的类别特征分布拥有一定的变异度量，从而自然地减少稀疏性和重复性。同时，对GMM进行优化，包括prototype diversity目标函数，以降低红色样本捕捉率。
results: 在CUB-200-2011、Stanford Cars、Stanford Dogs和Oxford-IIIT Pets等数据集上，MGProto实现了状态的分类和异常检测性能，同时具有鼓舞人的解释结果。

Abstract
Prototypical-part interpretable methods, e.g., ProtoPNet, enhance interpretability by connecting classification predictions to class-specific training prototypes, thereby offering an intuitive insight into their decision-making. Current methods rely on a discriminative classifier trained with point-based learning techniques that provide specific values for prototypes. Such prototypes have relatively low representation power due to their sparsity and potential redundancy, with each prototype containing no variability measure. In this paper, we present a new generative learning of prototype distributions, named Mixture of Gaussian-distributed Prototypes (MGProto), which are represented by Gaussian mixture models (GMM). Such an approach enables the learning of more powerful prototype representations since each learned prototype will own a measure of variability, which naturally reduces the sparsity given the spread of the distribution around each prototype, and we also integrate a prototype diversity objective function into the GMM optimisation to reduce redundancy. Incidentally, the generative nature of MGProto offers a new and effective way for detecting out-of-distribution samples. To improve the compactness of MGProto, we further propose to prune Gaussian-distributed prototypes with a low prior. Experiments on CUB-200-2011, Stanford Cars, Stanford Dogs, and Oxford-IIIT Pets datasets show that MGProto achieves state-of-the-art classification and OoD detection performances with encouraging interpretability results.

摘要
模块化可解释方法，如ProtoPNet，可以提高解释性，通过将分类预测与训练集中的类别特征相连接，从而提供直观的决策理解。现有方法通常使用基于点学习技术的权重学习来训练权重，这些权重具有相对较低的表达力，因为它们具有较少的变量度和可能的重复。在本文中，我们提出了一种新的生成学习方法，即混合 Gaussian 分布模型（MGProto），它可以学习更具有表达力的权重表示。每个学习的权重都会拥有一个变量度量，这自然减少了权重的稀疏程度，同时我们还 integrate了一个权重多样性目标函数来降低重复性。另外，生成性的MGProto还提供了一种新的和有效的异常样本检测方法。为了提高MGProto的 компакт性，我们进一步提议使用低优先级 Gaussian 分布权重的剪除。实验结果表明，MGProto在 CUB-200-2011、Stanford Cars、Stanford Dogs 和 Oxford-IIIT Pets 数据集上达到了当前最佳的分类和异常检测性能，同时解释性结果也是满意的。

HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video

paper_url: http://arxiv.org/abs/2311.18448
repo_url: https://github.com/zc-alexfan/hold
paper_authors: Zicong Fan, Maria Parelli, Maria Eleni Kadoglou, Muhammed Kocabas, Xu Chen, Michael J. Black, Otmar Hilliges
for: 用于重建人工智能中的人手交互行为
methods: 使用monocular互动视频来重建拟合手和物体的3D结构，并采用分解式拟合模型来分离手和物体的3D坐标
results: 在实验室和野外 Setting中，该方法可以不使用3D手 object 标注，并且超越完全监督的基eline，实现高品质的重建结果，并且在野外 Setting中具有较好的Robustness。代码：https://github.com/zc-alexfan/hold

Abstract
Since humans interact with diverse objects every day, the holistic 3D capture of these interactions is important to understand and model human behaviour. However, most existing methods for hand-object reconstruction from RGB either assume pre-scanned object templates or heavily rely on limited 3D hand-object data, restricting their ability to scale and generalize to more unconstrained interaction settings. To this end, we introduce HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video. We develop a compositional articulated implicit model that can reconstruct disentangled 3D hand and object from 2D images. We also further incorporate hand-object constraints to improve hand-object poses and consequently the reconstruction quality. Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings. Moreover, we qualitatively show its robustness in reconstructing from in-the-wild videos. Code: https://github.com/zc-alexfan/hold

摘要
因为人们每天都与多种对象进行交互，因此总体3D捕捉这些交互的重要性来理解和模型人类行为。然而，现有的手-物体重建方法从RGB中大多数都假设先采集的对象模板或者互相依赖有限的3D手-物体数据，从而限制其扩展和通用性。为此，我们介绍了HOLD——首个不需要3D手-物体注释的类别agnostic方法，可以从单一仰光交互视频中重建分解手和物体的3D拟合模型。我们还进一步 incorporate手-物体约束来改进手-物体姿态和重建质量。我们的方法不依赖3D手-物体注释，而在实验室和野外设置中超越了完全监督基eline。此外，我们还证明其在野外视频中的Robustness。代码：https://github.com/zc-alexfan/hold

VTimeLLM: Empower LLM to Grasp Video Moments

paper_url: http://arxiv.org/abs/2311.18445
repo_url: https://github.com/huangb23/vtimellm
paper_authors: Bin Huang, Xin Wang, Hong Chen, Zihan Song, Wenwu Zhu
for: 这个论文是为了解决现有视频大语言模型（Video LLM）无法准确捕捉视频中特定事件的时间边界问题。
methods: 该论文提出了一种名为VTimeLLM的新型视频大语言模型，该模型采用了边界意识的三个阶段训练策略，包括图像文本对Alignment、多个事件视频以增强时间边界意识，以及高质量视频指导调整以进一步提高时间理解能力和人类意图的Alignment。
results: 经过广泛的实验表明，在细化时间相关理解任务中（如Temporal Video Grounding和Dense Video Captioning），VTimeLLM在与现有视频LLMs进行比较时显著超越它们。此外，由于VTimeLLM具有细化时间理解能力，它还在视频对话 benchark中超越了现有视频LLMs，表明它在多模态理解和逻辑能力方面具有优异表现。

Abstract
Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details. However, existing Video LLMs can only provide a coarse description of the entire video, failing to capture the precise start and end time boundary of specific events. In this paper, we solve this issue via proposing VTimeLLM, a novel Video LLM designed for fine-grained video moment understanding and reasoning with respect to time boundary. Specifically, our VTimeLLM adopts a boundary-aware three-stage training strategy, which respectively utilizes image-text pairs for feature alignment, multiple-event videos to increase temporal-boundary awareness, and high-quality video-instruction tuning to further improve temporal understanding ability as well as align with human intents. Extensive experiments demonstrate that in fine-grained time-related comprehension tasks for videos such as Temporal Video Grounding and Dense Video Captioning, VTimeLLM significantly outperforms existing Video LLMs. Besides, benefits from the fine-grained temporal understanding of the videos further enable VTimeLLM to beat existing Video LLMs in video dialogue benchmark, showing its superior cross-modal understanding and reasoning abilities.

摘要
Our VTimeLLM adopts a boundary-aware three-stage training strategy that respectively utilizes image-text pairs for feature alignment, multiple-event videos to increase temporal-boundary awareness, and high-quality video-instruction tuning to further improve temporal understanding ability and align with human intents. Extensive experiments show that in fine-grained time-related comprehension tasks for videos such as Temporal Video Grounding and Dense Video Captioning, VTimeLLM significantly outperforms existing Video LLMs. Additionally, the fine-grained temporal understanding of videos provided by VTimeLLM enables it to beat existing Video LLMs in video dialogue benchmarks, demonstrating its superior cross-modal understanding and reasoning abilities.

Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis

paper_url: http://arxiv.org/abs/2311.18435
repo_url: None
paper_authors: Zipeng Qi, Guoxi Huang, Zebin Huang, Qin Guo, Jinwen Chen, Junyu Han, Jian Wang, Gang Zhang, Lufei Liu, Errui Ding, Jingdong Wang
for: 提高 diffusion 模型中的空间控制性，以提高图像生成的准确性和效率。
methods: 提出了两大创新：视觉指南和层次渲染扩展（LRDiff）框架。视觉指南是一种空间布局条件，可以帮助图像生成过程更好地遵循空间布局要求，而LRDiff框架则是一种多层渲染机制，可以更好地避免概念杂化和不一致问题，并提供更加准确和上下文敏感的图像生成。
results: 通过实验表明，提出的方法比现有技术更加高效和准确，并在三个实际应用中（矩形框-到-图像、semantic mask-to-image和图像修改）得到了更好的效果。

Abstract
This paper introduces innovative solutions to enhance spatial controllability in diffusion models reliant on text queries. We present two key innovations: Vision Guidance and the Layered Rendering Diffusion (LRDiff) framework. Vision Guidance, a spatial layout condition, acts as a clue in the perturbed distribution, greatly narrowing down the search space, to focus on the image sampling process adhering to the spatial layout condition. The LRDiff framework constructs an image-rendering process with multiple layers, each of which applies the vision guidance to instructively estimate the denoising direction for a single object. Such a layered rendering strategy effectively prevents issues like unintended conceptual blending or mismatches, while allowing for more coherent and contextually accurate image synthesis. The proposed method provides a more efficient and accurate means of synthesising images that align with specific spatial and contextual requirements. We demonstrate through our experiments that our method provides better results than existing techniques both quantitatively and qualitatively. We apply our method to three practical applications: bounding box-to-image, semantic mask-to-image and image editing.

摘要

E2PNet: Event to Point Cloud Registration with Spatio-Temporal Representation Learning

paper_url: http://arxiv.org/abs/2311.18433
repo_url: https://github.com/xmu-qcj/e2pnet
paper_authors: Xiuhong Lin, Changjie Qiu, Zhipeng Cai, Siqi Shen, Yu Zang, Weiquan Liu, Xuesheng Bian, Matthias Müller, Cheng Wang
for: 本研究旨在提出一种基于学习的event-to-point cloud registration方法，以解决2D图像与3D点云之间的匹配问题。
methods: 该方法使用了一种新的特征表示网络 called Event-Points-to-Tensor (EP2T)，该网络可以将事件数据转换为2D网格形特征图，以便使用现有的RGB基于的框架进行匹配。EP2T使用了新的采样和信息聚合模块，以处理事件输入的不均衡的空间和时间维度。
results: 实验结果表明，E2PNet比手工制定和其他学习基于方法更加稳定和高效，并且在极端照明或快速运动的情况下具有更高的Robustness。此外，EP2T还可以用于其他视觉任务，如流速估计、事件-to-图像重建和物体识别。

Abstract
Event cameras have emerged as a promising vision sensor in recent years due to their unparalleled temporal resolution and dynamic range. While registration of 2D RGB images to 3D point clouds is a long-standing problem in computer vision, no prior work studies 2D-3D registration for event cameras. To this end, we propose E2PNet, the first learning-based method for event-to-point cloud registration. The core of E2PNet is a novel feature representation network called Event-Points-to-Tensor (EP2T), which encodes event data into a 2D grid-shaped feature tensor. This grid-shaped feature enables matured RGB-based frameworks to be easily used for event-to-point cloud registration, without changing hyper-parameters and the training procedure. EP2T treats the event input as spatio-temporal point clouds. Unlike standard 3D learning architectures that treat all dimensions of point clouds equally, the novel sampling and information aggregation modules in EP2T are designed to handle the inhomogeneity of the spatial and temporal dimensions. Experiments on the MVSEC and VECtor datasets demonstrate the superiority of E2PNet over hand-crafted and other learning-based methods. Compared to RGB-based registration, E2PNet is more robust to extreme illumination or fast motion due to the use of event data. Beyond 2D-3D registration, we also show the potential of EP2T for other vision tasks such as flow estimation, event-to-image reconstruction and object recognition. The source code can be found at: https://github.com/Xmu-qcj/E2PNet.

摘要
Event 摄像机在最近几年内得到了广泛应用的潜在可能性，主要是因为它们的时间分辨率和动态范围无与伦比。然而，在计算机视觉中，2D RGB 图像与3D点云的注册问题仍然是一个长期不解的问题。为了解决这个问题，我们提出了 E2PNet，这是首个基于学习的2D-3D注册方法。E2PNet 的核心是一种新的特征表示网络，叫做事件点云向量（EP2T），它将事件数据转换成2D 网格形状的特征张量。这个网格形状的特征张量使得现成的 RGB 基于的框架可以轻松地用于事件-点云注册，无需更改超参数和训练过程。EP2T 对事件输入进行了特殊的采样和信息聚合处理，以处理事件数据的不均衡性。与标准的3D 学习架构不同，EP2T 的采样和信息聚合模块可以有效地处理事件数据中的空间和时间维度的不均衡性。实验结果表明，E2PNet 在 MVSEC 和 VECtor 数据集上超过了手工设计和其他学习基于方法。相比 RGB 注册，E2PNet 更加Robust 对极端的照明或快速运动，这是因为使用事件数据。此外，我们还展示了 EP2T 在其他视觉任务中的潜在应用，如流场估计、事件-图像重建和物体识别。软件代码可以在 GitHub 上找到：https://github.com/Xmu-qcj/E2PNet。

TeG-DG: Textually Guided Domain Generalization for Face Anti-Spoofing

paper_url: http://arxiv.org/abs/2311.18420
repo_url: None
paper_authors: Lianrui Mu, Jianhong Bai, Xiaoxuan He, Jiangnan Ye, Xiaoyu Liang, Yuchen Yang, Jiedong Zhuang, Haoji Hu
for: 提高面部防伪攻击（FAS）技术的领域通用性性能
methods: 利用文本信息进行cross-domain alignment，提出了Textually Guided Domain Generalization（TeG-DG）框架，并设计了层次注意力融合（HAF）模块和文本增强视觉分类器（TEVD）
results: 与先前方法相比，TeG-DG在具有极限来源频道数据的情况下显示出了显著的几个shot性能提升（~~14%和~~12%的HTER和AUC提升）

Abstract
Enhancing the domain generalization performance of Face Anti-Spoofing (FAS) techniques has emerged as a research focus. Existing methods are dedicated to extracting domain-invariant features from various training domains. Despite the promising performance, the extracted features inevitably contain residual style feature bias (e.g., illumination, capture device), resulting in inferior generalization performance. In this paper, we propose an alternative and effective solution, the Textually Guided Domain Generalization (TeG-DG) framework, which can effectively leverage text information for cross-domain alignment. Our core insight is that text, as a more abstract and universal form of expression, can capture the commonalities and essential characteristics across various attacks, bridging the gap between different image domains. Contrary to existing vision-language models, the proposed framework is elaborately designed to enhance the domain generalization ability of the FAS task. Concretely, we first design a Hierarchical Attention Fusion (HAF) module to enable adaptive aggregation of visual features at different levels; Then, a Textual-Enhanced Visual Discriminator (TEVD) is proposed for not only better alignment between the two modalities but also to regularize the classifier with unbiased text features. TeG-DG significantly outperforms previous approaches, especially in situations with extremely limited source domain data (~14% and ~12% improvements on HTER and AUC respectively), showcasing impressive few-shot performance.

摘要
增强面部防诈（FAS）技术的领域泛化性能已成为研究焦点。现有方法主要关注提取域外特征，尽管表现良好，但提取的特征仍然含有剩余风格特征偏好（如照明、捕捉设备），导致泛化性能差。在这篇论文中，我们提出了一种alternative和有效的解决方案——文本指导领域泛化（TeG-DG）框架。我们的核心想法是，文本作为更抽象和通用的表达形式，可以捕捉不同攻击中的共同特征和基本特征，从而跨域对应。与现有视力语言模型不同，我们提出的框架特别设计为提高FAS任务的领域泛化能力。具体来说，我们首先设计了层次注意力融合（HAF）模块，以便适应性地融合不同级别的视觉特征;然后，我们提出了文本增强视觉分类器（TEVD），不仅可以更好地对应两个模式，还可以规范分类器使用不受扭曲的文本特征。TeG-DG显著超越了之前的方法，特别是在具有极限源频训练数据（约14%和12%）的情况下，表现出了很好的几何性能。

CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model

paper_url: http://arxiv.org/abs/2311.18405
repo_url: https://github.com/zengjianhao/cat-dm
paper_authors: Jianhao Zeng, Dan Song, Weizhi Nie, Hongshuo Tian, Tongtong Wang, Anan Liu
for: 提供一种可控的液化图像基于虚拟试穿方法，以提高虚拟试穿效果和速度。
methods: 利用ControlNet引入额外控制条件，并对裤衣图像进行特征提取。在液化过程中，通过隐藏的GAN模型生成的假分布来进行逆噪处理，从而提高液化效果。
results: 比对 previous 基于液化模型的试穿方法，CAT-DM可以保持裤衣图像中的模式和纹理细节，同时减少抽取步骤，不降低生成质量。

Abstract
Image-based virtual try-on enables users to virtually try on different garments by altering original clothes in their photographs. Generative Adversarial Networks (GANs) dominate the research field in image-based virtual try-on, but have not resolved problems such as unnatural deformation of garments and the blurry generation quality. Recently, diffusion models have emerged with surprising performance across various image generation tasks. While the generative quality of diffusion models is impressive, achieving controllability poses a significant challenge when applying it to virtual try-on tasks and multiple denoising iterations limit its potential for real-time applications. In this paper, we propose Controllable Accelerated virtual Try-on with Diffusion Model called CAT-DM. To enhance the controllability, a basic diffusion-based virtual try-on network is designed, which utilizes ControlNet to introduce additional control conditions and improves the feature extraction of garment images. In terms of acceleration, CAT-DM initiates a reverse denoising process with an implicit distribution generated by a pre-trained GAN-based model. Compared with previous try-on methods based on diffusion models, CAT-DM not only retains the pattern and texture details of the in-shop garment but also reduces the sampling steps without compromising generation quality. Extensive experiments demonstrate the superiority of CAT-DM against both GAN-based and diffusion-based methods in producing more realistic images and accurately reproducing garment patterns. Our code and models will be publicly released.

摘要
Image-based virtural try-on 使用户可以通过修改原来的衣服图像来试穿不同的服装。生成对抗网络（GANs）在图像基于的virtural try-on 领域占据主导地位，但它们没有解决衣服形态不自然和生成质量不清晰的问题。最近，扩散模型在不同的图像生成任务上表现出了惊人的表现。然而，在应用到virtural try-on 任务上，实现控制性是一个非常大的挑战，并且多个扩散迭代限制了它的实时应用潜力。在这篇论文中，我们提出了控制可能的扩散模型，称为CAT-DM。为了增强控制性，我们设计了一个基本的扩散基于的virtural try-on 网络，该网络使用ControlNet引入了额外的控制条件，以提高衣服图像的特征提取。在速度方面，CAT-DM实施了一种逆扩散过程，使用由预训练的GAN-based模型生成的隐式分布。相比之下，CAT-DM不仅保留了店内衣服的 patrern和текстура细节，而且减少了采样步骤，无需牺牲生成质量。我们的实验表明，CAT-DM比之前基于扩散模型的方法更有优势，能够生成更真实的图像，并准确地复制衣服的 patrern。我们的代码和模型将公开发布。

MV-CLIP: Multi-View CLIP for Zero-shot 3D Shape Recognition

paper_url: http://arxiv.org/abs/2311.18402
repo_url: None
paper_authors: Dan Song, Xinwei Fu, Weizhi Nie, Wenhui Li, Anan Liu
for: 提高Language-Image Pre-training模型在3D形状识别任务中的信任度
methods: 使用视图选择和层次提示来提高Language-Image Pre-training模型的泛化能力
results: 在没有额外训练的情况下，模型实现了84.44%、91.51%和66.17%的零shot 3D分类精度在ModelNet40、ModelNet10和ShapeNet Core55中。

Abstract
Large-scale pre-trained models have demonstrated impressive performance in vision and language tasks within open-world scenarios. Due to the lack of comparable pre-trained models for 3D shapes, recent methods utilize language-image pre-training to realize zero-shot 3D shape recognition. However, due to the modality gap, pretrained language-image models are not confident enough in the generalization to 3D shape recognition. Consequently, this paper aims to improve the confidence with view selection and hierarchical prompts. Leveraging the CLIP model as an example, we employ view selection on the vision side by identifying views with high prediction confidence from multiple rendered views of a 3D shape. On the textual side, the strategy of hierarchical prompts is proposed for the first time. The first layer prompts several classification candidates with traditional class-level descriptions, while the second layer refines the prediction based on function-level descriptions or further distinctions between the candidates. Remarkably, without the need for additional training, our proposed method achieves impressive zero-shot 3D classification accuracies of 84.44\%, 91.51\%, and 66.17\% on ModelNet40, ModelNet10, and ShapeNet Core55, respectively. Furthermore, we will make the code publicly available to facilitate reproducibility and further research in this area.

摘要
大规模预训练模型在视觉和语言任务中表现出色，但因为3D形的预训练模型缺乏相似模型，现有方法利用语言-图像预训练实现零shot 3D形认识。然而，由于模式差距，预训练语言-图像模型在总体化到3D形认识中不够自信。因此，本文目的是提高自信度，使用视选择和层次提示。基于CLIP模型为例，我们在视觉方面使用多个渲染视图中的高预测信任度视选择。在文本方面，我们提出了层次提示策略，第一层提出多个分类候选者的传统分类描述，第二层根据功能级描述或进一步的分类差异于候选者进行细化预测。很Remarkably，无需额外训练，我们提议的方法可以在零shot情况下实现84.44%、91.51%和66.17%的3D分类精度在ModelNet40、ModelNet10和ShapeNet Core55上。此外，我们将代码公开发布，以便促进复现和此领域进一步研究。

RainAI – Precipitation Nowcasting from Satellite Data

paper_url: http://arxiv.org/abs/2311.18398
repo_url: https://github.com/rafapablos/w4c23-rainai
paper_authors: Rafael Pablos Sarabia, Joachim Nyborg, Morten Birk, Ira Assent
for: 这个论文目标是在使用lower-resolution卫星辐射图像预测高分辨率降水，并且需要8小时预测时间。
methods: 该论文提出了一种简单 yet effective的spatiotemporal特征学习方法，使用2D U-Net模型，并超过官方3D U-Net基线模型的性能和效率。论文还强调数据准备和重要抽样技术的重要性，并证明这些技术对性能有较大的影响。
results: 论文表明，使用 conditional lead time 和 learned upsampling方法可以提高预测性能，并且可以生成高分辨率预测结果。

Abstract
This paper presents a solution to the Weather4Cast 2023 competition, where the goal is to forecast high-resolution precipitation with an 8-hour lead time using lower-resolution satellite radiance images. We propose a simple, yet effective method for spatiotemporal feature learning using a 2D U-Net model, that outperforms the official 3D U-Net baseline in both performance and efficiency. We place emphasis on refining the dataset, through importance sampling and dataset preparation, and show that such techniques have a significant impact on performance. We further study an alternative cross-entropy loss function that improves performance over the standard mean squared error loss, while also enabling models to produce probabilistic outputs. Additional techniques are explored regarding the generation of predictions at different lead times, specifically through Conditioning Lead Time. Lastly, to generate high-resolution forecasts, we evaluate standard and learned upsampling methods. The code and trained parameters are available at https://github.com/rafapablos/w4c23-rainai.

摘要

On Exact Inversion of DPM-Solvers

paper_url: http://arxiv.org/abs/2311.18387
repo_url: https://github.com/smhongok/inv-dpm
paper_authors: Seongmin Hong, Kyeonghyun Lee, Suh Yoon Jeon, Hyewon Bae, Se Young Chun
for: 这个研究旨在探讨DPM-solvers中的精确逆解（exact inversion）方法，以提高图像处理和数位 watermarking 的品质。
methods: 研究人员使用了降噪推敲法（gradient descent）和前进步法（forward step method）等隐式方法来实现精确逆解。
results: 实验结果显示，提案的精确逆解方法可以对DPM-solvers中的每个明确降噪步骤（explicit denoising step）进行精确逆解，并且可以大幅降低图像和噪音重建错误，以及提高隐藏水印的分辨率和不易被变更背景的稳定性。

Abstract
Diffusion probabilistic models (DPMs) are a key component in modern generative models. DPM-solvers have achieved reduced latency and enhanced quality significantly, but have posed challenges to find the exact inverse (i.e., finding the initial noise from the given image). Here we investigate the exact inversions for DPM-solvers and propose algorithms to perform them when samples are generated by the first-order as well as higher-order DPM-solvers. For each explicit denoising step in DPM-solvers, we formulated the inversions using implicit methods such as gradient descent or forward step method to ensure the robustness to large classifier-free guidance unlike the prior approach using fixed-point iteration. Experimental results demonstrated that our proposed exact inversion methods significantly reduced the error of both image and noise reconstructions, greatly enhanced the ability to distinguish invisible watermarks and well prevented unintended background changes consistently during image editing. Project page: \url{https://smhongok.github.io/inv-dpm.html}.

摘要
Diffusion probabilistic models (DPMs) 是现代生成模型的关键组件。 DPM-解决方案可以减少延迟和提高质量，但是找到初始噪声（即从给定图像中找到初始噪声）是一个挑战。我们在这里调查了 DPM-解决方案中的精确倒数（i.e., 从给定图像中找到初始噪声），并提出了使用 implicit 方法如梯度下降或前进步骤法来实现。每个 DPM-解决方案中的明确排除步骤都可以通过 implicit 方法来实现，以确保不同于先前使用固定点迭代的稳定性。实验结果表明，我们的提议的精确倒数方法可以减少图像和噪声重建错误，大幅提高了隐藏水印的分辨率，并在图像编辑中保持了不可预测的背景变化。项目页面：.

A Survey on Deep Learning for Polyp Segmentation: Techniques, Challenges and Future Trends

paper_url: http://arxiv.org/abs/2311.18373
repo_url: None
paper_authors: Jiaxin Mei, Tao Zhou, Kaiwen Huang, Yizhe Zhang, Yi Zhou, Ye Wu, Huazhu Fu
for: 这 paper 主要旨在为检测和评估肠Rectal Cancer (CRC) 提供有效的解决方案，即干预器的精准位置和分割。
methods: 本 paper 详细介绍了多种检测算法，包括传统的基于手动提取特征的方法和深度学习网络的方法。
results: 本 paper 进行了深入的评估和比较，探讨了不同的深度学习模型和结果，并考虑了肠Rectal Cancer 的不同大小。

Abstract
Early detection and assessment of polyps play a crucial role in the prevention and treatment of colorectal cancer (CRC). Polyp segmentation provides an effective solution to assist clinicians in accurately locating and segmenting polyp regions. In the past, people often relied on manually extracted lower-level features such as color, texture, and shape, which often had issues capturing global context and lacked robustness to complex scenarios. With the advent of deep learning, more and more outstanding medical image segmentation algorithms based on deep learning networks have emerged, making significant progress in this field. This paper provides a comprehensive review of polyp segmentation algorithms. We first review some traditional algorithms based on manually extracted features and deep segmentation algorithms, then detail benchmark datasets related to the topic. Specifically, we carry out a comprehensive evaluation of recent deep learning models and results based on polyp sizes, considering the pain points of research topics and differences in network structures. Finally, we discuss the challenges of polyp segmentation and future trends in this field. The models, benchmark datasets, and source code links we collected are all published at https://github.com/taozh2017/Awesome-Polyp-Segmentation.

摘要
早期检测和评估贫囊有助于预防和治疗肠Rectal Cancer (CRC)。贫囊分割提供了一个有效的解决方案，以帮助临床专业人员准确地定位和分割贫囊区域。在过去，人们 oftentimes 依靠手动提取的lower-level特征，如颜色、xture和形状，这些特征通常无法捕捉全局上下文，而且缺乏复杂场景下的稳定性。随着深度学习的出现，越来越多的出色的医疗图像分割算法基于深度学习网络在这一领域取得了 significante进步。本文提供了贫囊分割算法的全面回顾。我们首先详细介绍了基于手动提取特征的传统算法，然后详细介绍了深度分割算法，并详细评估了最新的深度学习模型和其结果，以及对贫囊大小的影响。最后，我们讨论了贫囊分割的挑战和未来发展趋势。我们收集了算法、referencedatasets和源代码链接，并将其发布在https://github.com/taozh2017/Awesome-Polyp-Segmentation上。

Each Test Image Deserves A Specific Prompt: Continual Test-Time Adaptation for 2D Medical Image Segmentation

paper_url: http://arxiv.org/abs/2311.18363
repo_url: None
paper_authors: Ziyang Chen, Yiwen Ye, Mengkang Lu, Yongsheng Pan, Yong Xia
for: 这个论文是为了解决医疗影像中的分布偏移问题，尤其是在不同医疗中心取得的影像之间存在广泛的分布偏移，这会对于在实际应用中部署预训模型带来严重的阻碍。
methods: 这个论文使用的方法是在测试时进行时间适应，并且将预训模型固定不更新，以避免错误累累和忘记性。在测试时，我们提出了一个名为Visual Prompt-based Test-Time Adaptation（VPTTA）的方法，将测试图像训练成特定的启发，以将测试图像的统计与预训模型的统计进行Alignment。
results: 实验结果显示，VPTTA方法比其他已有的方法更superior，在医疗影像分类 задачі中实现了更高的性能。

Abstract
Distribution shift widely exists in medical images acquired from different medical centres and poses a significant obstacle to deploying the pre-trained semantic segmentation model in real-world applications. Test-time adaptation has proven its effectiveness in tackling the cross-domain distribution shift during inference. However, most existing methods achieve adaptation by updating the pre-trained models, rendering them susceptible to error accumulation and catastrophic forgetting when encountering a series of distribution shifts (i.e., under the continual test-time adaptation setup). To overcome these challenges caused by updating the models, in this paper, we freeze the pre-trained model and propose the Visual Prompt-based Test-Time Adaptation (VPTTA) method to train a specific prompt for each test image to align the statistics in the batch normalization layers. Specifically, we present the low-frequency prompt, which is lightweight with only a few parameters and can be effectively trained in a single iteration. To enhance prompt initialization, we equip VPTTA with a memory bank to benefit the current prompt from previous ones. Additionally, we design a warm-up mechanism, which mixes source and target statistics to construct warm-up statistics, thereby facilitating the training process. Extensive experiments demonstrate the superiority of our VPTTA over other state-of-the-art methods on two medical image segmentation benchmark tasks. The code and weights of pre-trained source models are available at https://github.com/Chen-Ziyang/VPTTA.

摘要
医疗图像采集从不同医疗中心而来，存在广泛的分布shift问题，这会对实际应用中的投入模型带来很大的障碍。测试时适应可以在推理过程中有效地解决这种跨频道分布shift。然而，大多数现有方法通过更新预训练模型来实现适应，这会导致模型受到错误积累和恐慌忘记（即在连续的推理时适应设置下）。为了解决由更新模型而导致的问题，在这篇论文中，我们冻结预训练模型并提出了视觉提示基于测试时适应（VPTTA）方法。Specifically, we present the low-frequency prompt, which is lightweight with only a few parameters and can be effectively trained in a single iteration. To enhance prompt initialization, we equip VPTTA with a memory bank to benefit the current prompt from previous ones. Additionally, we design a warm-up mechanism, which mixes source and target statistics to construct warm-up statistics, thereby facilitating the training process. Extensive experiments demonstrate the superiority of our VPTTA over other state-of-the-art methods on two medical image segmentation benchmark tasks. The code and weights of pre-trained source models are available at https://github.com/Chen-Ziyang/VPTTA.

Automating lookahead planning using site appearance and space utilization

paper_url: http://arxiv.org/abs/2311.18361
repo_url: None
paper_authors: Eyob Mengiste, Borja Garcia de Soto, Timo Hartmann
for: 这种研究旨在自动化规划前期的准备工作，以提高建筑工程的效率和质量。
methods: 该方法使用建筑物的外观和建筑地点的空间利用情况来预测任务完成率，并使用GRU基于RNN模型来训练预测任务完成率并提出数据意识的后备计划。
results: 实验结果表明，该方法可以帮助开发自动化后备计划，并将建筑规划与实际建筑地点的活动联系起来，从而扩展传统的计划技术和 интегrirated更广泛的建筑地点约束 into lookahead planning。

Abstract
This study proposes a method to automate the development of lookahead planning. The proposed method uses construction material conditions (i.e., appearances) and site space utilization to predict task completion rates. A Gated Recurrent Unit (GRU) based Recurrent Neural Network (RNN) model was trained using a segment of a construction project timeline to estimate completion rates of tasks and propose data-aware lookahead plans. The proposed method was evaluated in a sample construction project involving finishing works such as plastering, painting, and installing electrical fixtures. The results show that the proposed method can assist with developing automated lookahead plans. In doing so, this study links construction planning with actual events at the construction site. It extends the traditional scheduling techniques and integrates a broader spectrum of site spatial constraints into lookahead planning.

摘要
这项研究提出了一种自动化lookahead规划的方法。该方法使用建筑材料条件（即出现）和建筑地点资源利用率来预测任务完成率。使用启动式Recurrent Neural Network（RNN）模型，该方法在一段建筑项目时间线上训练了一个Gated Recurrent Unit（GRU）基于RNN模型，以估算任务完成率并提出数据意识的lookahead计划。该方法在一个示例建筑项目中进行了评估，包括抹灰、涂漆和安装电气设备等完成工作。结果表明，该方法可以帮助开发自动化lookahead计划，并将建筑规划与实际建筑地点事件相连接。这种方法超越传统的计划技术，并将建筑地点空间约束纳入lookahead规划中。

TIDE: Test Time Few Shot Object Detection

paper_url: http://arxiv.org/abs/2311.18358
repo_url: https://github.com/deku-0621/tide
paper_authors: Weikai Li, Hongfeng Wei, Yanlai Wu, Jie Yang, Yudi Ruan, Yuan Li, Ying Tang
for: 本研究 targets industry 5.0 中的实时配置需求和黑盒环境，提出了一种新的几何干扰检测任务（Test TIme Few Shot DEtection，TIDE），以解决现有方法在这些环境中的应用限制。
methods: 本研究提出了一种非对称架构，通过学习支持实例响应的动态分类器来解决几何干扰检测任务。此外，文章还 introduce 了一个cross-attention模块和一个多scale resizer，以提高模型性能。
results: 实验结果表明，提出的 TIDE 方法在多个几何干扰检测平台上显著超越了现有的同类方法。 codes 可以在 https://github.com/deku-0621/TIDE 上获取。

Abstract
Few-shot object detection (FSOD) aims to extract semantic knowledge from limited object instances of novel categories within a target domain. Recent advances in FSOD focus on fine-tuning the base model based on a few objects via meta-learning or data augmentation. Despite their success, the majority of them are grounded with parametric readjustment to generalize on novel objects, which face considerable challenges in Industry 5.0, such as (i) a certain amount of fine-tuning time is required, and (ii) the parameters of the constructed model being unavailable due to the privilege protection, making the fine-tuning fail. Such constraints naturally limit its application in scenarios with real-time configuration requirements or within black-box settings. To tackle the challenges mentioned above, we formalize a novel FSOD task, referred to as Test TIme Few Shot DEtection (TIDE), where the model is un-tuned in the configuration procedure. To that end, we introduce an asymmetric architecture for learning a support-instance-guided dynamic category classifier. Further, a cross-attention module and a multi-scale resizer are provided to enhance the model performance. Experimental results on multiple few-shot object detection platforms reveal that the proposed TIDE significantly outperforms existing contemporary methods. The implementation codes are available at https://github.com/deku-0621/TIDE

摘要
新型工业5.0中的几个难点，如需要一定的精度调整时间和参数不可用等问题，对现有的几拘object检测技术带来了很大的挑战。为了解决这些问题，我们提出了一种新的几拘object检测任务，即Test Time Few Shot DEtection（TIDE）。在TIDE任务中，模型在配置过程中不需要调整参数。为此，我们提出了一种不对称的架构，用于学习帮助类 instances 的动态类别分类器。此外，我们还提出了一种对比处理模块和多比例压缩器，以提高模型性能。我们在多个几拘object检测平台上进行了实验，结果显示，提出的TIDE方法在与现代方法进行比较时表现出了显著的优异。代码实现可以在https://github.com/deku-0621/TIDE上下载。

DSeg: Direct Line Segments Detection

paper_url: http://arxiv.org/abs/2311.18344
repo_url: None
paper_authors: Berger Cyrille, Lacroix Simon
for: 检测图像线段
methods: 使用线性加尔曼畸正常分布来检测图像线段，逐步检测线段在梯度图像上，并估计支持线 Parameters 和相关的方差。
results: 比较稳定和快速，能够检测更长的线段，不需要耗时参数调整。增加 pyramidal 方法可以提高结果质量。

Abstract
This paper presents a model-driven approach to detect image line segments. The approach incrementally detects segments on the gradient image using a linear Kalman filter that estimates the supporting line parameters and their associated variances. The algorithm is fast and robust with respect to image noise and illumination variations, it allows the detection of longer line segments than data-driven approaches, and does not require any tedious parameters tuning. An extension of the algorithm that exploits a pyramidal approach to enhance the quality of results is proposed. Results with varying scene illumination and comparisons to classic existing approaches are presented.

摘要
Here is the text in Simplified Chinese:这篇论文提出了基于模型的图像直线段检测方法。该方法使用线性卡尔曼约束来估算支持直线参数和其相关的方差，它快速、对图像噪音和照明变化强度有良好的Robustness，不需要 tedious的参数调整。该算法可以检测更长的直线段，并提出了基于pyramidal方法来提高结果质量的扩展。文章还将对不同的场景照明condition下的结果进行展示和经典方法的比较。

Multilevel Saliency-Guided Self-Supervised Learning for Image Anomaly Detection

paper_url: http://arxiv.org/abs/2311.18332
repo_url: None
paper_authors: Jianjian Qin, Chunzhi Gu, Jun Yu, Chao Zhang
for: 这个论文的目的是提出一种基于视觉注意力学习的异常检测方法，以提高计算机视觉中的异常检测性能。
methods: 该方法使用了LAYERCAM来提取多层图像特征，并使用 clustering 算法来获取多个中心点。然后，该方法选择了最高中心点精度的像素对，并将其作为异常样本生成。
results: 对两个主流异常检测 benchmark 数据集进行了广泛的实验和ablative 评估，并显示了该方法可以达到计算机视觉中异常检测性能的国际先进水平。

Abstract
Anomaly detection (AD) is a fundamental task in computer vision. It aims to identify incorrect image data patterns which deviate from the normal ones. Conventional methods generally address AD by preparing augmented negative samples to enforce self-supervised learning. However, these techniques typically do not consider semantics during augmentation, leading to the generation of unrealistic or invalid negative samples. Consequently, the feature extraction network can be hindered from embedding critical features. In this study, inspired by visual attention learning approaches, we propose CutSwap, which leverages saliency guidance to incorporate semantic cues for augmentation. Specifically, we first employ LayerCAM to extract multilevel image features as saliency maps and then perform clustering to obtain multiple centroids. To fully exploit saliency guidance, on each map, we select a pixel pair from the cluster with the highest centroid saliency to form a patch pair. Such a patch pair includes highly similar context information with dense semantic correlations. The resulting negative sample is created by swapping the locations of the patch pair. Compared to prior augmentation methods, CutSwap generates more subtle yet realistic negative samples to facilitate quality feature learning. Extensive experimental and ablative evaluations demonstrate that our method achieves state-of-the-art AD performance on two mainstream AD benchmark datasets.

摘要
“异常检测（AD）是计算机视觉中的基本任务。它目的是找出图像数据中的错误模式，这些模式与正常模式不同。传统方法通常通过强制自我超级学习来实现AD，但这些技术通常不考虑 semantics during augmentation，导致生成的负样本不真实、无效。这会使特征提取网络受到限制，从而降低AD性能。在本研究中，我们提出了CutSwap方法，它利用视觉注意力学习方法来汇入Semantic cues for augmentation。具体来说，我们首先使用LayerCAM来提取图像的多层特征图，然后使用聚类来获得多个中心点。为了充分利用注意力指导，我们在每个图中选择一个最高中心点精度的像素对来组成一个patch pair。这个patch pair包含高度相似的上下文信息和密集的semantic correlations。通过交换这个patch pair的位置，我们可以生成更加细致 yet realistic的负样本，从而促进特征学习。我们的方法在两个主流AD benchmark datasets上实现了状态的AD性能。”

Anisotropic Neural Representation Learning for High-Quality Neural Rendering

paper_url: http://arxiv.org/abs/2311.18311
repo_url: None
paper_authors: Y. Wang, J. Xu, Y. Zeng, Y. Gong
for: 提高NeRF视图合成的质量和精度
methods: 使用可学习视角特征来改善场景表示和重建
results: 在多种NeRF框架中，提高渲染质量和达到了状态之最的渲染性能

Abstract
Neural radiance fields (NeRFs) have achieved impressive view synthesis results by learning an implicit volumetric representation from multi-view images. To project the implicit representation into an image, NeRF employs volume rendering that approximates the continuous integrals of rays as an accumulation of the colors and densities of the sampled points. Although this approximation enables efficient rendering, it ignores the direction information in point intervals, resulting in ambiguous features and limited reconstruction quality. In this paper, we propose an anisotropic neural representation learning method that utilizes learnable view-dependent features to improve scene representation and reconstruction. We model the volumetric function as spherical harmonic (SH)-guided anisotropic features, parameterized by multilayer perceptrons, facilitating ambiguity elimination while preserving the rendering efficiency. To achieve robust scene reconstruction without anisotropy overfitting, we regularize the energy of the anisotropic features during training. Our method is flexiable and can be plugged into NeRF-based frameworks. Extensive experiments show that the proposed representation can boost the rendering quality of various NeRFs and achieve state-of-the-art rendering performance on both synthetic and real-world scenes.

摘要
(Simplified Chinese translation)NeRFs 通过多视图图像学习了一种隐式体积表示，实现了各种视图合成结果。为将隐式表示转换为图像，NeRF 使用了体积渲染，将杆茎观察点的颜色和密度作为积分的估计。然而，这种估计忽略了杆茎点间的方向信息，导致了不确定特征和限制重建质量。在这篇论文中，我们提出了一种基于视图依赖的神经表示学习方法，使用学习的视图依赖特征来提高场景表示和重建质量。我们将体积函数模型为圆柱幂（SH）导向的不同方向特征，通过多层感知器进行参数化，实现了不确定性消除而保持渲染效率。为确保Scene reconstruction不受方向过拟合，我们在训练时对不同方向特征能量进行正则化。我们的方法可以与 NeRF 基础框架整合，并在synthetic和实际场景上实现了 state-of-the-art 的渲染性能。

Categorical Traffic Transformer: Interpretable and Diverse Behavior Prediction with Tokenized Latent

paper_url: http://arxiv.org/abs/2311.18307
repo_url: None
paper_authors: Yuxiao Chen, Sander Tonkens, Marco Pavone
for: 这篇论文旨在提供一种可靠的自动驾驶车辆（AV）交通模型，以满足规划和关闭回 simulation 的需求。
methods: 该论文使用了大型语言模型（LLM），并提出了一种新的交通模型，即 categorical traffic transformer（CTT），它可以输出 kontinuous trajectory predictions 和 tokenized categorical predictions（车道模式、同似性等）。 CTT 的最出色的特点是具有可解释的潜在空间，允许在训练时直接监督潜在变量的ground truth，从而完全避免模式折射。
results: 在预测精度方面，CTT 可以Conditioned on different latent modes with semantic meanings while beating SOTA on prediction accuracy。此外，CTT 的能力可以输入和输出token，使其与 LLM 集成，进行通用理解和零Instance generalization。

Abstract
Adept traffic models are critical to both planning and closed-loop simulation for autonomous vehicles (AV), and key design objectives include accuracy, diverse multimodal behaviors, interpretability, and downstream compatibility. Recently, with the advent of large language models (LLMs), an additional desirable feature for traffic models is LLM compatibility. We present Categorical Traffic Transformer (CTT), a traffic model that outputs both continuous trajectory predictions and tokenized categorical predictions (lane modes, homotopies, etc.). The most outstanding feature of CTT is its fully interpretable latent space, which enables direct supervision of the latent variable from the ground truth during training and avoids mode collapse completely. As a result, CTT can generate diverse behaviors conditioned on different latent modes with semantic meanings while beating SOTA on prediction accuracy. In addition, CTT's ability to input and output tokens enables integration with LLMs for common-sense reasoning and zero-shot generalization.

摘要
优秀的交通模型是自动驾驶车辆（AV）规划和关闭环境 simulate 中的关键设计目标之一，其中包括准确性、多样化的多种行为、可解释性和下游兼容性。在大语言模型（LLM）的出现以后，另一个感兴趣的特征为交通模型是 LLm 兼容性。我们介绍了ategorical Traffic Transformer（CTT），一种输出连续轨迹预测和 Token 化分类预测（车道模式、同征等）的交通模型。 CTt 的最 distinguishing feature 是它的完全可解释的潜在空间，允许在训练时直接监督潜在变量并完全避免模式混合。因此，CTT 可以根据不同的潜在模式生成多样化的行为，同时保持高精度预测。此外， CTt 的 Token 输入和输出功能可以与 LLm 集成，实现常识理解和零容量泛化。

X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation

paper_url: http://arxiv.org/abs/2312.00085
repo_url: https://github.com/xmu-xiaoma666/X-Dreamer
paper_authors: Yiwei Ma, Yijun Fan, Jiayi Ji, Haowei Wang, Xiaoshuai Sun, Guannan Jiang, Annan Shu, Rongrong Ji
for: 高质量文本到3D内容创建
methods: 使用Camera-Guided Low-Rank Adaptation (CG-LoRA)和Attention-Mask Alignment (AMA)损失进行适应，使得模型更好地考虑摄像机信息和对eground对象的注意力
results: 比对 existed 文本到3D方法，X-Dreamer 显示出更高的质量和精度

Abstract
In recent times, automatic text-to-3D content creation has made significant progress, driven by the development of pretrained 2D diffusion models. Existing text-to-3D methods typically optimize the 3D representation to ensure that the rendered image aligns well with the given text, as evaluated by the pretrained 2D diffusion model. Nevertheless, a substantial domain gap exists between 2D images and 3D assets, primarily attributed to variations in camera-related attributes and the exclusive presence of foreground objects. Consequently, employing 2D diffusion models directly for optimizing 3D representations may lead to suboptimal outcomes. To address this issue, we present X-Dreamer, a novel approach for high-quality text-to-3D content creation that effectively bridges the gap between text-to-2D and text-to-3D synthesis. The key components of X-Dreamer are two innovative designs: Camera-Guided Low-Rank Adaptation (CG-LoRA) and Attention-Mask Alignment (AMA) Loss. CG-LoRA dynamically incorporates camera information into the pretrained diffusion models by employing camera-dependent generation for trainable parameters. This integration enhances the alignment between the generated 3D assets and the camera's perspective. AMA loss guides the attention map of the pretrained diffusion model using the binary mask of the 3D object, prioritizing the creation of the foreground object. This module ensures that the model focuses on generating accurate and detailed foreground objects. Extensive evaluations demonstrate the effectiveness of our proposed method compared to existing text-to-3D approaches. Our project webpage: https://xmuxiaoma666.github.io/Projects/X-Dreamer .

摘要
现代时期，自动文本到3D内容创建已经取得了 significativeloop gain，由于2D扩散模型的发展。现有的文本到3D方法通常是优化3D表示以使得rendered图像与给定的文本保持良好的对齐，如由2D扩散模型进行评估。然而，2D图像和3D资产之间存在很大的领域差异，主要归结于摄像头相关的特性和独特的前景对象。因此，直接使用2D扩散模型来优化3D表示可能会导致不佳的结果。为解决这问题，我们提出了X-Dreamer，一种高质量文本到3D内容创建方法，可以有效bridging 2D和3D synthesis之间的领域差异。X-Dreamer的关键组件包括两个创新的设计：Camera-Guided Low-Rank Adaptation（CG-LoRA）和Attention-Mask Alignment（AMA）损失。CG-LoRA在预训练的扩散模型中动态包含摄像头信息，通过使用可训练参数的摄像头依赖生成。这种结合使得生成的3D资产与摄像头的观点保持更好的对齐。AMA损失引导预训练扩散模型的注意地图使用3D对象的二进制掩码，以便强调生成的前景对象的细节和准确性。这个模块确保了模型对生成的前景对象进行精细和准确的生成。我们的项目页面：https://xmuxiaoma666.github.io/Projects/X-Dreamer。

Can Protective Perturbation Safeguard Personal Data from Being Exploited by Stable Diffusion?

paper_url: http://arxiv.org/abs/2312.00084
repo_url: None
paper_authors: Zhengyue Zhao, Jinhao Duan, Kaidi Xu, Chenan Wang, Rui Zhangp Zidong Dup Qi Guo, Xing Hu
For: This paper aims to evaluate the effectiveness of using perturbations to protect images in a practical threat model, and to introduce a purification method to remove protected perturbations while preserving the original image structure.* Methods: The paper uses a Stable Diffusion model fine-tuned with personalized concepts, and adds imperceptible adversarial perturbations to images to prevent unauthorized exploitation and infringement. The authors also introduce a purification method to remove protected perturbations while preserving the original image structure.* Results: The results suggest that the perturbation-based protection methods may not be sufficient to safeguard image privacy and copyright effectively, and that the purification method can effectively remove protected perturbations while preserving the original image structure. The paper also demonstrates that Stable Diffusion can effectively learn from purified images over all protective methods.

Abstract
Stable Diffusion has established itself as a foundation model in generative AI artistic applications, receiving widespread research and application. Some recent fine-tuning methods have made it feasible for individuals to implant personalized concepts onto the basic Stable Diffusion model with minimal computational costs on small datasets. However, these innovations have also given rise to issues like facial privacy forgery and artistic copyright infringement. In recent studies, researchers have explored the addition of imperceptible adversarial perturbations to images to prevent potential unauthorized exploitation and infringements when personal data is used for fine-tuning Stable Diffusion. Although these studies have demonstrated the ability to protect images, it is essential to consider that these methods may not be entirely applicable in real-world scenarios. In this paper, we systematically evaluate the use of perturbations to protect images within a practical threat model. The results suggest that these approaches may not be sufficient to safeguard image privacy and copyright effectively. Furthermore, we introduce a purification method capable of removing protected perturbations while preserving the original image structure to the greatest extent possible. Experiments reveal that Stable Diffusion can effectively learn from purified images over all protective methods.

摘要
stable diffusion 已成为生成 AI 艺术领域的基础模型，广泛应用和研究。一些最近的细化方法使得个人可以将个性化想法嵌入到基础的 stable diffusion 模型中，使得计算成本减少到最小限度。然而，这些创新也产生了面部隐私伪造和艺术版权侵犯的问题。在最近的研究中，研究人员通过添加不可见的对抗扰动到图像来防止可能的未经授权使用和侵犯。虽然这些研究表明了保护图像的能力，但是需要考虑这些方法在实际场景中可能并不适用。在这篇论文中，我们系统地评估了使用扰动来保护图像的使用情况。结果表明，这些方法可能无法有效地保护图像隐私和版权。此外，我们介绍了一种纯化方法，可以从保护后的图像中除去保护扰动，并保持图像的原始结构最大限度。实验表明，Stable Diffusion 可以从纯化后的图像中学习，而且比所有保护方法都要更好。

BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos

paper_url: http://arxiv.org/abs/2312.00083
repo_url: https://github.com/Pilhyeon/BAM-DETR
paper_authors: Pilhyeon Lee, Hyeran Byun
for: 本 paper 的目的是解决 Temporal sentence grounding 中的中心不alignment 问题，提高 moment 的准确性。
methods: 本 paper 使用了一种新的 boundery-oriented moment 表示法，并设计了一种 dual-pathway decoding 过程，通过 global 和 boundery-focused 注意力来重新定义 anchor 和 boundery。
results: EXTENSIVE experiments 表明，本 paper 的方法可以提高 moment 的准确性，并在三个 benchmark 上创造新的 state-of-the-art 结果。

Abstract
Temporal sentence grounding aims to localize moments relevant to a language description. Recently, DETR-like approaches have shown notable progress by decoding the center and length of a target moment from learnable queries. However, they suffer from the issue of center misalignment raised by the inherent ambiguity of moment centers, leading to inaccurate predictions. To remedy this problem, we introduce a novel boundary-oriented moment formulation. In our paradigm, the model no longer needs to find the precise center but instead suffices to predict any anchor point within the interval, from which the onset and offset are directly estimated. Based on this idea, we design a Boundary-Aligned Moment Detection Transformer (BAM-DETR), equipped with a dual-pathway decoding process. Specifically, it refines the anchor and boundaries within parallel pathways using global and boundary-focused attention, respectively. This separate design allows the model to focus on desirable regions, enabling precise refinement of moment predictions. Further, we propose a quality-based ranking method, ensuring that proposals with high localization qualities are prioritized over incomplete ones. Extensive experiments verify the advantages of our methods, where our model records new state-of-the-art results on three benchmarks. Code is at https://github.com/Pilhyeon/BAM-DETR.

摘要
sentence grounding aims to localize moments relevant to a language description. Recently, DETR-like approaches have shown notable progress by decoding the center and length of a target moment from learnable queries. However, they suffer from the issue of center misalignment raised by the inherent ambiguity of moment centers, leading to inaccurate predictions. To remedy this problem, we introduce a novel boundary-oriented moment formulation. In our paradigm, the model no longer needs to find the precise center but instead suffices to predict any anchor point within the interval, from which the onset and offset are directly estimated. Based on this idea, we design a Boundary-Aligned Moment Detection Transformer (BAM-DETR), equipped with a dual-pathway decoding process. Specifically, it refines the anchor and boundaries within parallel pathways using global and boundary-focused attention, respectively. This separate design allows the model to focus on desirable regions, enabling precise refinement of moment predictions. Further, we propose a quality-based ranking method, ensuring that proposals with high localization qualities are prioritized over incomplete ones. Extensive experiments verify the advantages of our methods, where our model records new state-of-the-art results on three benchmarks. Code is at https://github.com/Pilhyeon/BAM-DETR.Here's a word-for-word translation of the text into Simplified Chinese:句子根准目标地方化目标是为了在语言描述中地方化相关的时刻。在最近，DETR-like方法已经显示了不eworthy进步，通过从学习查询中提取目标时刻的中心和长度进行解码。然而，这些方法受到中心不对 alignment 问题的影响，导致不准确的预测。为了解决这个问题，我们介绍了一种新的边缘 oriented 时刻形式。在我们的 paradigm 中，模型不再需要找到准确的中心，而是只需要预测任何时刻内的anchor点，从而直接估算起始和结束时刻。基于这个想法，我们设计了一种Boundary-Aligned Moment Detection Transformer (BAM-DETR)，它具有两个平行的decoding进程。特别是，它在并行的两个路径中使用全局和边界专注的注意力来细化anchor和边界，从而启用高精度的时刻预测。此外，我们提出了一种基于质量的排名方法，以确保高地方化质量的提案被优先选择，而不完整的提案被排除。广泛的实验证明了我们的方法的优势，其中我们的模型在三个 benchmark 上创造了新的 state-of-the-art 记录。代码可以在 https://github.com/Pilhyeon/BAM-DETR 上找到。

OmniMotionGPT: Animal Motion Generation with Limited Data

paper_url: http://arxiv.org/abs/2311.18303
repo_url: None
paper_authors: Zhangsihao Yang, Mingyuan Zhou, Mengyi Shan, Bingbing Wen, Ziwei Xuan, Mitch Hill, Junjie Bai, Guo-Jun Qi, Yalin Wang
for: 本研究旨在将文本描述中的动物动作 sequences 生成出多样化和实际的动物动作。
methods: 我们运用 Generative Pretraining Transformer (GPT) 架构，将人类动作知识转移到动物领域，并同时将人类动作和动物动作的数据进行合理的融合。
results: 我们的方法可以将文本描述中的动物动作生成出高度多样化和实际的动作，与训练人类动作基eline在动物数据上表现出优异的结果。此外，我们还提出了 AnimalML3D，第一个文本-动物动作数据集，包含 1240 个动画序列，涵盖 36 种不同的动物身份。

Abstract
Our paper aims to generate diverse and realistic animal motion sequences from textual descriptions, without a large-scale animal text-motion dataset. While the task of text-driven human motion synthesis is already extensively studied and benchmarked, it remains challenging to transfer this success to other skeleton structures with limited data. In this work, we design a model architecture that imitates Generative Pretraining Transformer (GPT), utilizing prior knowledge learned from human data to the animal domain. We jointly train motion autoencoders for both animal and human motions and at the same time optimize through the similarity scores among human motion encoding, animal motion encoding, and text CLIP embedding. Presenting the first solution to this problem, we are able to generate animal motions with high diversity and fidelity, quantitatively and qualitatively outperforming the results of training human motion generation baselines on animal data. Additionally, we introduce AnimalML3D, the first text-animal motion dataset with 1240 animation sequences spanning 36 different animal identities. We hope this dataset would mediate the data scarcity problem in text-driven animal motion generation, providing a new playground for the research community.

摘要
我们的论文目标是从文本描述生成多样化和现实的动物运动序列，无需大规模的动物文本运动数据集。在人体动物学已经广泛研究和评估的基础上，我们设计了一个模型架构，借鉴生成预训练变换器（GPT）的优势，在动物骨架结构中转移知识。我们同时培养动物和人体动物的运动自动编码器，并通过文本CLIP编码的相似性分数来优化。这是首次解决这个问题，我们能够生成高多样化和准确性的动物运动序列，量化和质量上超过了将人体动物基eline训练数据集应用于动物数据的结果。此外，我们介绍了AnimalsML3D，第一个文本-动物运动数据集，包括1240个动画序列，涵盖36种不同的动物标识。我们希望这个数据集能够解决动物运动生成数据的不足问题，为研究者提供一个新的玩家场。

Reconstructing the normal and shape at specularities in endoscopy

paper_url: http://arxiv.org/abs/2311.18299
repo_url: None
paper_authors: Karim Makki, Adrien Bartoli
for: 这篇论文是为了利用折射所提供的信息来估算器官的三维形态和方向的重建。
methods: 该方法使用单个图像中的折射点来估算器官的正常方向和形态。
results: 试验结果表明，该方法可以准确地估算器官的三维形态和方向，并且可以在实验室和真实操作图像中进行应用。

Abstract
Specularities are numerous in endoscopic images. They occur as many white small elliptic spots, which are generally ruled out as nuisance in image analysis and computer vision methods. Instead, we propose to use specularities as cues for 3D perception. Specifically, we propose a new method to reconstruct, at each specularity, the observed tissue's normal direction (i.e., its orientation) and shape (i.e., its curvature) from a single image. We show results on simulated and real interventional images.

摘要
照片中的specularities很多。它们出现为白色小圆形斑点，通常被视为图像分析和计算机视觉方法中的干扰。然而，我们提议使用specularities作为3D见解的cue。具体来说，我们提出了一种使用单张图像来重建观察到的组织表层方向（即方向）和形状（即弯曲）的新方法。我们在模拟和实际手术图像上显示了结果。

TLDR: Text Based Last-layer Retraining for Debiasing Image Classifiers

paper_url: http://arxiv.org/abs/2311.18291
repo_url: https://github.com/tmlabonte/last-layer-retraining
paper_authors: Juhyeon Park, Seokhyeon Jeong, Taesup Moon
for: This paper aims to mitigate the spurious correlation of classifiers by using text datasets built with large language models for a general image classifier.
methods: The proposed method, called TLDR, uses generated texts to train the final layer in the embedding space of the arbitrary image classifier, and filters out noisy words to reduce the effort of inspecting each word.
results: TLDR achieves performance that is comparable to those of LLR methods that also utilize group-balanced image datasets for retraining, and outperforms other baselines that involve training the last linear layer without a group annotated dataset.Here’s the simplified Chinese text:
for: 这篇论文目标是利用大语言模型生成的文本集来减少分类器的偶极相关性。
methods: 提议的方法是基于文本生成的TLDR，通过在权重空间中使用生成的文本来训练普通图像分类器的最后一层。此外，还提出了一种过滤掉噪音、不精确的单词的方法，以降低每个单词的检查成本。
results: TLDR实现了与使用分组平衡图像集进行重新训练的LLR方法相同的性能，并超越了没有分组标注图像集的训练方法。

Abstract
A classifier may depend on incidental features stemming from a strong correlation between the feature and the classification target in the training dataset. Recently, Last Layer Retraining (LLR) with group-balanced datasets is known to be efficient in mitigating the spurious correlation of classifiers. However, the acquisition of group-balanced datasets is costly, which hinders the applicability of the LLR method. In this work, we propose to perform LLR based on text datasets built with large language models for a general image classifier. We demonstrate that text can be a proxy for its corresponding image beyond the image-text joint embedding space, such as CLIP. Based on this, we use generated texts to train the final layer in the embedding space of the arbitrary image classifier. In addition, we propose a method of filtering the generated words to get rid of noisy, imprecise words, which reduces the effort of inspecting each word. We dub these procedures as TLDR (\textbf{T}ext-based \textbf{L}ast layer retraining for \textbf{D}ebiasing image classifie\textbf{R}s) and show our method achieves the performance that is comparable to those of the LLR methods that also utilize group-balanced image dataset for retraining. Furthermore, TLDR outperforms other baselines that involve training the last linear layer without a group annotated dataset.

摘要
一个分类器可能会受到意外的特征所影响，这些特征来自于训练数据集中的强相关性。近期，最后层重新训练（LLR）与归一化数据集的使用已知能够有效地消除分类器的偶极相关性。然而，获得归一化数据集的成本很高，这限制了LLR方法的应用性。在这种情况下，我们提议使用文本数据集，由大型自然语言模型生成，来对一个通用的图像分类器进行LLR。我们示出了文本可以作为其对应的图像的代表，例如CLIP。基于这一点，我们使用生成的文本来训练图像分类器的最后一层。此外，我们提出了一种过滤生成的词语的方法，以避免干扰和不精确的词语，从而降低了每个词语的检查成本。我们称这些过程为TLDR（文本基于的最后层重新训练 для减降图像分类器的偏见），并证明我们的方法与使用归一化图像数据集进行LLR的方法相当。此外，TLDR也超越了没有使用归一化数据集进行训练的最后 Linear 层的基eline。

CosAvatar: Consistent and Animatable Portrait Video Tuning with Text Prompt

paper_url: http://arxiv.org/abs/2311.18288
repo_url: None
paper_authors: Haiyao Xiao, Chenglai Zhong, Xuan Gao, Yudong Guo, Juyong Zhang
for: 提高数字肖像编辑质量和用户体验，以便在文本指南下进行高质量的肖像修饰。
methods: 提议了一种基于NeRF的动态3D肖像表示方法，通过在视频帧数据集和下一代3D肖像之间交互进行编辑，以实现时间和3D一致性。此外，还 integrate了semantic肖像规范来增强编辑结果的精度。
results: 经过广泛测试，提议的方法可以不仅准确地根据文本指南进行肖像修饰，还能够支持基于源视频的表达性动画。

Abstract
Recently, text-guided digital portrait editing has attracted more and more attentions. However, existing methods still struggle to maintain consistency across time, expression, and view or require specific data prerequisites. To solve these challenging problems, we propose CosAvatar, a high-quality and user-friendly framework for portrait tuning. With only monocular video and text instructions as input, we can produce animatable portraits with both temporal and 3D consistency. Different from methods that directly edit in the 2D domain, we employ a dynamic NeRF-based 3D portrait representation to model both the head and torso. We alternate between editing the video frames' dataset and updating the underlying 3D portrait until the edited frames reach 3D consistency. Additionally, we integrate the semantic portrait priors to enhance the edited results, allowing precise modifications in specified semantic areas. Extensive results demonstrate that our proposed method can not only accurately edit portrait styles or local attributes based on text instructions but also support expressive animation driven by a source video.

摘要
Unlike methods that directly edit in the 2D domain, we employ a dynamic NeRF-based 3D portrait representation to model both the head and torso. We alternate between editing the video frames' dataset and updating the underlying 3D portrait until the edited frames reach 3D consistency. Additionally, we integrate the semantic portrait priors to enhance the edited results, allowing precise modifications in specified semantic areas.Our proposed method can not only accurately edit portrait styles or local attributes based on text instructions but also support expressive animation driven by a source video. Extensive results demonstrate the effectiveness of our method in producing high-quality and consistent animatable portraits.

Dispersed Structured Light for Hyperspectral 3D Imaging

paper_url: http://arxiv.org/abs/2311.18287
repo_url: None
paper_authors: Suhyun Shin, Seokjun Choi, Felix Heide, Seung-Hwan Baek
for: 这篇论文旨在实现高精度三维影像摄取，包括深度和光谱信息。
methods: 这篇论文提出了一种新的投射-相机系统，具有微米级几何透镜膜，以把结构光散射成不同波长。这篇论文还提出了一个投射图像形成模型和每个像素的实时三维重建方法。
results: 这篇论文验证了其方法，并获得了18.8nm的光谱宽度和1mm的深度误差。该方法超越了现有的实用高精度三维影像摄取方法。

Abstract
Hyperspectral 3D imaging aims to acquire both depth and spectral information of a scene. However, existing methods are either prohibitively expensive and bulky or compromise on spectral and depth accuracy. In this work, we present Dispersed Structured Light (DSL), a cost-effective and compact method for accurate hyperspectral 3D imaging. DSL modifies a traditional projector-camera system by placing a sub-millimeter thick diffraction grating film front of the projector. The grating disperses structured light based on light wavelength. To utilize the dispersed structured light, we devise a model for dispersive projection image formation and a per-pixel hyperspectral 3D reconstruction method. We validate DSL by instantiating a compact experimental prototype. DSL achieves spectral accuracy of 18.8nm full-width half-maximum (FWHM) and depth error of 1mm. We demonstrate that DSL outperforms prior work on practical hyperspectral 3D imaging. DSL promises accurate and practical hyperspectral 3D imaging for diverse application domains, including computer vision and graphics, cultural heritage, geology, and biology.

摘要

SimulFlow: Simultaneously Extracting Feature and Identifying Target for Unsupervised Video Object Segmentation

paper_url: http://arxiv.org/abs/2311.18286
repo_url: None
paper_authors: Lingyi Hong, Wei Zhang, Shuyong Gao, Hong Lu, WenQiang Zhang
for: 这 paper 目的是提出一种高效的无监督视频物体分割方法，它可以快速地检测视频序列中的主要对象，无需人工干预。
methods: 这 paper 使用了一种新的 SimulFlow 模型，它同时进行特征提取和目标识别，从而减少计算复杂度和提高效果。 SimulFlow 模型使用了一种新的注意力机制，可以将图像和运动信息相互 bridged，从而不需要额外的手动设计 fusione 模块。
results: 这 paper 的实验结果表明，SimulFlow 模型可以在多个标准测试集上达到领先的Result，并且在计算复杂度和参数量两个方面具有优势。Specifically, SimulFlow 在 DAVIS-16 测试集上 achieve 87.4% J & F，并且在 3090 上达到 63.7 FPS 的最高速度和 13.7 M 的最低参数量。

Abstract
Unsupervised video object segmentation (UVOS) aims at detecting the primary objects in a given video sequence without any human interposing. Most existing methods rely on two-stream architectures that separately encode the appearance and motion information before fusing them to identify the target and generate object masks. However, this pipeline is computationally expensive and can lead to suboptimal performance due to the difficulty of fusing the two modalities properly. In this paper, we propose a novel UVOS model called SimulFlow that simultaneously performs feature extraction and target identification, enabling efficient and effective unsupervised video object segmentation. Concretely, we design a novel SimulFlow Attention mechanism to bridege the image and motion by utilizing the flexibility of attention operation, where coarse masks predicted from fused feature at each stage are used to constrain the attention operation within the mask area and exclude the impact of noise. Because of the bidirectional information flow between visual and optical flow features in SimulFlow Attention, no extra hand-designed fusing module is required and we only adopt a light decoder to obtain the final prediction. We evaluate our method on several benchmark datasets and achieve state-of-the-art results. Our proposed approach not only outperforms existing methods but also addresses the computational complexity and fusion difficulties caused by two-stream architectures. Our models achieve 87.4% J & F on DAVIS-16 with the highest speed (63.7 FPS on a 3090) and the lowest parameters (13.7 M). Our SimulFlow also obtains competitive results on video salient object detection datasets.

摘要
<>将文本翻译成简化中文。<>Unsupervised video object segmentation（UVOS）目标是在给定视频序列中检测主要 объек，无需人工干预。现有方法多数采用两派架构，分别对视频信息进行出现和运动编码，然后将两种模式 fusion 生成目标和生成对象面积。然而，这种管道可能会带来计算成本高和融合不当，导致性能下降。在这篇论文中，我们提出一种新的 UVOS 模型，即 SimulFlow，可以快速和有效地进行无监督视频对象分割。具体来说，我们设计了一种 SimulFlow 注意力机制，通过利用注意力操作的灵活性，将图像和运动相互桥接，并通过在掩码预测中使用各个阶段的粗略掩码来约束注意力操作在掩码区域内，以排除噪声的影响。由于 SimulFlow 注意力机制中的视觉和运动特征之间的双向信息流，因此不需要额外的手动设计融合模块，我们只采用了轻量级解码器来获取最终预测。我们对多个标准测试集进行评估，并实现了状态的最佳结果。我们的提出方法不仅超越了现有方法，还解决了两派架构导致的计算复杂性和融合困难。我们的模型在 DAVIS-16 上获得了 87.4% J & F 和最高速度 (63.7 FPS 在 3090)，并且最低的参数 (13.7 M)。我们的 SimulFlow 还在视频焦点对象检测数据集上获得了竞争性的结果。

Utilizing Radiomic Feature Analysis For Automated MRI Keypoint Detection: Enhancing Graph Applications

paper_url: http://arxiv.org/abs/2311.18281
repo_url: None
paper_authors: Sahar Almahfouz Nasser, Shashwat Pathak, Keshav Singhal, Mohit Meena, Nihar Gupte, Ananya Chinmaya, Prateek Garg, Amit Sethi
for: 本研究旨在探讨图像处理领域中图像感知网络（GNNs）的应用，尤其是在图像数据转化为GNN模型的情况下。
methods: 本研究使用了一种新的键点检测方法，基于医疗图像中的 радиометрические特征进行检测。此外，还使用了Super-Retina方法进行针对医疗图像中的键点检测。
results: 研究发现，使用GNNs可以在图像匹配中提高匹配数量和信任分数，并且可以在医疗图像中提高注射注入的精度。此外，还发现了一些新的键点，并且通过使用这些键点进行注射注入，提高了注射注入的精度。

Abstract
Graph neural networks (GNNs) present a promising alternative to CNNs and transformers in certain image processing applications due to their parameter-efficiency in modeling spatial relationships. Currently, a major area of research involves the converting non-graph input data for GNN-based models, notably in scenarios where the data originates from images. One approach involves converting images into nodes by identifying significant keypoints within them. Super-Retina, a semi-supervised technique, has been utilized for detecting keypoints in retinal images. However, its limitations lie in the dependency on a small initial set of ground truth keypoints, which is progressively expanded to detect more keypoints. Having encountered difficulties in detecting consistent initial keypoints in brain images using SIFT and LoFTR, we proposed a new approach: radiomic feature-based keypoint detection. Demonstrating the anatomical significance of the detected keypoints was achieved by showcasing their efficacy in improving registration processes guided by these keypoints. Subsequently, these keypoints were employed as the ground truth for the keypoint detection method (LK-SuperRetina). Furthermore, the study showcases the application of GNNs in image matching, highlighting their superior performance in terms of both the number of good matches and confidence scores. This research sets the stage for expanding GNN applications into various other applications, including but not limited to image classification, segmentation, and registration.

摘要
图 neural network (GNN) 提供了一种有前途的替代方案，尤其在图像处理领域中，因为它们可以快速和效率地模型图像之间的关系。目前，研究的焦点在于将非图数据转换为 GNN 模型中使用，尤其在图像来源于图像的场景下。一种方法是将图像转换为节点，通过在图像中标识重要的关键点进行识别。Super-Retina 是一种半监督的技术，可以在眼球图像中检测关键点。然而，它的限制在于依赖于小的初始集成真实标点，这会逐渐扩展到检测更多的关键点。我们在脑图像中使用 radiomic 特征来检测关键点，并证明了这些关键点在注射过程中的生物学意义。然后，这些关键点被用作ground truth，以便使用 LK-SuperRetina 方法进行关键点检测。此外，研究还展示了 GNN 在图像匹配中的突出表现，包括更多的好匹配和更高的信任分数。这项研究为扩展 GNN 应用领域的基础 lay。

HKUST at SemEval-2023 Task 1: Visual Word Sense Disambiguation with Context Augmentation and Visual Assistance

paper_url: http://arxiv.org/abs/2311.18273
repo_url: https://github.com/thomas-yin/semeval-2023-task1
paper_authors: Zhuohao Yin, Xin Huang
for: 本研究的目的是提出一个多modal的推干架构，用于实现Visual Word Sense Disambiguation (VWSD)任务中的Semantic Search。
methods: 本研究使用了预训语言模型和开放知识库，并将其融合到一个多modal的推干架构中。系统包括以下四个主要 ком成分：（1）字汇匹配（Gloss Matching）：使用预训bi-encoder模型对上下文中的目标词 matched with proper senses;（2）提示（Prompting）：将匹配到的字汇和其他文本信息，如同义词，融合到一个提示模板中；（3）图像搜寻（Image Retrieval）：使用提示作为查询，从大量的开放 dataset 中搜寻具有相似含义的图像；（4）modal融合（Modality Fusion）：将不同modalities的信息融合，用于预测。
results: 本研究的结果虽未在SemEval-2023 Task 1中产生最高竞争力，但我们仍能超越大约一半的队伍。更重要的是，我们的实验发现了WSD和多modal学习领域的鲜为人知的恶性现象和挑战。我们的代码可以在 GitHub 上获得。

Abstract
Visual Word Sense Disambiguation (VWSD) is a multi-modal task that aims to select, among a batch of candidate images, the one that best entails the target word's meaning within a limited context. In this paper, we propose a multi-modal retrieval framework that maximally leverages pretrained Vision-Language models, as well as open knowledge bases and datasets. Our system consists of the following key components: (1) Gloss matching: a pretrained bi-encoder model is used to match contexts with proper senses of the target words; (2) Prompting: matched glosses and other textual information, such as synonyms, are incorporated using a prompting template; (3) Image retrieval: semantically matching images are retrieved from large open datasets using prompts as queries; (4) Modality fusion: contextual information from different modalities are fused and used for prediction. Although our system does not produce the most competitive results at SemEval-2023 Task 1, we are still able to beat nearly half of the teams. More importantly, our experiments reveal acute insights for the field of Word Sense Disambiguation (WSD) and multi-modal learning. Our code is available on GitHub.

摘要
<>Visual Word Sense Disambiguation (VWSD) 是一个多模态任务，旨在在限定的上下文中选择最能体现target词的意思的图像。在这篇论文中，我们提出了一个多模态检索框架，最大限度地利用预训练的视力语言模型，以及开放的知识库和数据集。我们的系统包括以下关键组件：1. 字义匹配：使用预训练的二元编码器模型来匹配上下文中的target词之字义。2. 提示：匹配到的字义和其他文本信息，如同义词，通过提示模板进行 incorporation。3. 图像检索：使用提示作为查询来从大量开放数据集中检索semantic上相似的图像。4. 模式融合：从不同模式中的信息进行融合，用于预测。虽然我们的系统在SemEval-2023任务1上并不是最竞争力强，但我们仍能击败大约一半的队伍。更重要的是，我们的实验发现了Word Sense Disambiguation（WSD）和多模态学习的锐意见。我们的代码可以在GitHub上找到。

Beyond Entropy: Style Transfer Guided Single Image Continual Test-Time Adaptation

paper_url: http://arxiv.org/abs/2311.18270
repo_url: None
paper_authors: Younggeol Cho, Youngrae Kim, Dongman Lee
for: 本研究旨在解决存在计算资源限制的实时环境中，使模型在不断变化的真实世界环境中进行适应。
methods: 我们提出了一种基于风格传输的单张图像实时测试时适应方法（BESTTA），使用了简单 yet powerful的归一化方法（BeIN）和风格导向的损失函数。
results: 我们示出了BESTTA在 semantic segmentation 和图像分类任务上能够效果地适应 continually changing 目标环境，仅使用了一张图像和两个参数。此外，BESTTA 还比现有的状态最优方法在性能上表现更好。

Abstract
Continual test-time adaptation (cTTA) methods are designed to facilitate the continual adaptation of models to dynamically changing real-world environments where computational resources are limited. Due to this inherent limitation, existing approaches fail to simultaneously achieve accuracy and efficiency. In detail, when using a single image, the instability caused by batch normalization layers and entropy loss significantly destabilizes many existing methods in real-world cTTA scenarios. To overcome these challenges, we present BESTTA, a novel single image continual test-time adaptation method guided by style transfer, which enables stable and efficient adaptation to the target environment by transferring the style of the input image to the source style. To implement the proposed method, we devise BeIN, a simple yet powerful normalization method, along with the style-guided losses. We demonstrate that BESTTA effectively adapts to the continually changing target environment, leveraging only a single image on both semantic segmentation and image classification tasks. Remarkably, despite training only two parameters in a BeIN layer consuming the least memory, BESTTA outperforms existing state-of-the-art methods in terms of performance.

摘要

Prompt-Based Exemplar Super-Compression and Regeneration for Class-Incremental Learning

paper_url: http://arxiv.org/abs/2311.18266
repo_url: https://github.com/kerrydrx/escort
paper_authors: Ruxiao Duan, Yaoyao Liu, Jieneng Chen, Adam Kortylewski, Alan Yuille
for: 这个论文旨在提出一种新的对照储存和重新生成方法，以解决对照学习（Class-Incremental Learning，CIL）中的档案限制和遗传问题。
methods: 这个方法使用了图像压缩和文本描述来将对照变化为可以储存的图像描述，并使用了一个预训 ControlNet 来从描述中生成高分辨率的对照。
results: 实验结果显示，这个方法可以对多个 CIL 标准 benchmark 进行改进，例如在 10 个阶段 Caltech-256 标准 dataset 上提高了5.0% 的表现，较前一任的州分优先点。

Abstract
Replay-based methods in class-incremental learning (CIL) have attained remarkable success, as replaying the exemplars of old classes can significantly mitigate catastrophic forgetting. Despite their effectiveness, the inherent memory restrictions of CIL result in saving a limited number of exemplars with poor diversity, leading to data imbalance and overfitting issues. In this paper, we introduce a novel exemplar super-compression and regeneration method, ESCORT, which substantially increases the quantity and enhances the diversity of exemplars. Rather than storing past images, we compress images into visual and textual prompts, e.g., edge maps and class tags, and save the prompts instead, reducing the memory usage of each exemplar to 1/24 of the original size. In subsequent learning phases, diverse high-resolution exemplars are generated from the prompts by a pre-trained diffusion model, e.g., ControlNet. To minimize the domain gap between generated exemplars and real images, we propose partial compression and diffusion-based data augmentation, allowing us to utilize an off-the-shelf diffusion model without fine-tuning it on the target dataset. Therefore, the same diffusion model can be downloaded whenever it is needed, incurring no memory consumption. Comprehensive experiments demonstrate that our method significantly improves model performance across multiple CIL benchmarks, e.g., 5.0 percentage points higher than the previous state-of-the-art on 10-phase Caltech-256 dataset.

摘要
循环学习（Class-Incremental Learning，CIL）中的回忆性方法已经取得了很大的成功，因为重新播放过去的类型的示例可以减轻忘记性。尽管它们的效iveness，CIL中的内置的内存限制会导致只保留一小部分的示例，从而导致数据不均衡和预测问题。在这篇论文中，我们介绍了一种新的示例超压缩和重生方法，名为ESCORT。而不是保存过去的图像，我们将图像压缩成视觉和文本的提示，例如边极图和类标签，并将提示存储在内存中。在后续的学习阶段，我们使用预训练的扩散模型，例如ControlNet，生成了多样化的高分辨率示例。为了减少生成示例和真实图像之间的领域差距，我们提出了部分压缩和扩散基于的数据增强技术。因此，我们可以在需要时下载相同的扩散模型，不需要特性化 fine-tuning。全面的实验表明，我们的方法可以在多个 CIL 标准准的测试集上提高模型性能，比如10-phase Caltech-256 数据集上的5.0%。

MCI Detection using fMRI time series embeddings of Recurrence plots

paper_url: http://arxiv.org/abs/2311.18265
repo_url: https://github.com/blackpearl006/ISBI-2024
paper_authors: Ninad Aithal, Chakka Sai Pradeep, Neelam Sinha
for: 本研究使用Resting State fMRI时间序列成像，研究人脑动态系统中的下游ROI特征，以了解结构或缺失。
methods: 研究人员使用Recurrence Plotvisual化时间序列，并使用Autoencoders将时间序列转换成低维特征表示。
results: 研究发现，使用提posed方法可以在100名参与者的fMRI数据上达到93%的峰性分类精度和89.3%的平均精度，illustrating the promise of the proposed approach.

Abstract
The human brain can be conceptualized as a dynamical system. Utilizing resting state fMRI time series imaging, we can study the underlying dynamics at ear-marked Regions of Interest (ROIs) to understand structure or lack thereof. This differential behavior could be key to understanding the neurodegeneration and also to classify between healthy and Mild Cognitive Impairment (MCI) subjects. In this study, we consider 6 brain networks spanning over 160 ROIs derived from Dosenbach template, where each network consists of 25-30 ROIs. Recurrence plot, extensively used to understand evolution of time series, is employed. Representative time series at each ROI is converted to its corresponding recurrence plot visualization, which is subsequently condensed to low-dimensional feature embeddings through Autoencoders. The performance of the proposed method is shown on fMRI volumes of 100 subjects (balanced data), taken from publicly available ADNI dataset. Results obtained show peak classification accuracy of 93% among the 6 brain networks, mean accuracy of 89.3% thereby illustrating promise in the proposed approach.

摘要
人脑可以理解为动态系统。通过使用休息状态fMRI时间序列成像，我们可以研究下标注的Region of Interest（ROIs）下的内在动态，以解释结构或缺失。这种差异性可能是理解脑化退化的关键，以及分类健康和轻度认知障碍（MCI）者的关键。在这个研究中，我们考虑了6个脑网络，涵盖了160个ROIs，每个网络由25-30个ROIs组成。我们使用了时间序列演化的重要工具——回忆图，将每个ROI的代表时间序列转化为其对应的回忆图视化。然后，我们使用自适应Encoder将这些视化转化为低维度特征嵌入。研究结果显示，我们的方法在100名参与者的fMRI数据上达到了93%的峰分类精度，以及89.3%的平均精度，这表明了我们的方法的承诺。

A Compact Implicit Neural Representation for Efficient Storage of Massive 4D Functional Magnetic Resonance Imaging

paper_url: http://arxiv.org/abs/2312.00082
repo_url: None
paper_authors: Ruoran Li, Runzhao Yang, Wenxin Xiang, Yuxiao Cheng, Tingxiong Xiao, Jinli Suo
for: 这篇论文是为了提出一种适用于功能核磁共振成像（fMRI）数据压缩的新方法，以提高数据的压缩率和质量。
methods: 该方法基于含义神经表示（INR），包括在时间序列中进行空间相关性模型化、分解可重用的神经活动模式，以及使用适当的初始化和非线性融合来描述 между区域相似性。
results: 实验结果表明，提出的方法可以有效地压缩fMRI数据，并在常见的图像质量评价指标和fMRI下游任务中表现出色，超过了当前最佳的算法。这项工作将为大规模的fMRI数据分享互联网带来便利和高精度。

Abstract
Functional Magnetic Resonance Imaging (fMRI) data is a kind of widely used four-dimensional biomedical data, demanding effective compression but presenting unique challenges for compression due to its intricate temporal dynamics, low signal-to-noise ratio, and complicated underlying redundancies. This paper reports a novel compression paradigm specifically tailored for fMRI data based on Implicit Neural Representation (INR). The proposed approach focuses on removing the various redundancies among the time series, including (i) conducting spatial correlation modeling for intra-region dynamics, (ii) decomposing reusable neuronal activation patterns, and using proper initialization together with nonlinear fusion to describe the inter-region similarity. The above scheme properly incorporates the unique features of fMRI data, and experimental results on publicly available datasets demonstrate the effectiveness of the proposed method, surpassing state-of-the-art algorithms in both conventional image quality evaluation metrics and fMRI downstream tasks. This work in this paper paves the way for sharing massive fMRI data at low bandwidth and high fidelity.

摘要

时间序列内的各种相互关联性，包括：2. 内部动态的空间相关模型化3. 可重用的 neuronal activation pattern 的分解4. 适当的初始化和非线性融合，以描述间region的相似性。提案的方法可以有效地利用 fMRI 数据的特点，实验结果表明，与现有算法相比，提案的方法在公共可用的数据集上达到了更高的图像质量评价指标和 FMRI 下游任务的表现。这种工作将为 massive fMRI 数据的共享链接低带宽高准确度的链接提供了道路。

Diffusion Models Without Attention

paper_url: http://arxiv.org/abs/2311.18257
repo_url: None
paper_authors: Jing Nathan Yan, Jiatao Gu, Alexander M. Rush
for: 高精度图像生成技术的进一步发展
methods: 取代注意力机制，使用更可扩展的状态空间模型底部
results: 在ImageNet和LSUN datasets上，与注意力模型相比，DiffuSSMs在FID和Inception Score指标上几乎相当或甚至超越了对Diffusion模型的训练，同时减少了总计算量的FLOP用量。

Abstract
In recent advancements in high-fidelity image generation, Denoising Diffusion Probabilistic Models (DDPMs) have emerged as a key player. However, their application at high resolutions presents significant computational challenges. Current methods, such as patchifying, expedite processes in UNet and Transformer architectures but at the expense of representational capacity. Addressing this, we introduce the Diffusion State Space Model (DiffuSSM), an architecture that supplants attention mechanisms with a more scalable state space model backbone. This approach effectively handles higher resolutions without resorting to global compression, thus preserving detailed image representation throughout the diffusion process. Our focus on FLOP-efficient architectures in diffusion training marks a significant step forward. Comprehensive evaluations on both ImageNet and LSUN datasets at two resolutions demonstrate that DiffuSSMs are on par or even outperform existing diffusion models with attention modules in FID and Inception Score metrics while significantly reducing total FLOP usage.

摘要

paper_url: http://arxiv.org/abs/2311.18245
repo_url: None
paper_authors: Long Chen, Liben Chen, Binfeng Xu, Wenxin Zhang, Narges Razavian
for: 预测阿尔ツ海病的stage，包括健康正常（CN）、轻度认知障碍（MCI）和阿尔ツ海病（AD），基于两种不同类型的脑MRI扫描结果。
methods: 使用AlexNet深度学习模型，利用T1和FLAIR两种脑MRI扫描结果的共同信息相互补充，实现自动诊断。
results: 通过分析多Modal MRI扫描结果，提高了自动诊断的准确率。

Abstract
The aging population of the U.S. drives the prevalence of Alzheimer's disease. Brookmeyer et al. forecasts approximately 15 million Americans will have either clinical AD or mild cognitive impairment by 2060. In response to this urgent call, methods for early detection of Alzheimer's disease have been developed for prevention and pre-treatment. Notably, literature on the application of deep learning in the automatic detection of the disease has been proliferating. This study builds upon previous literature and maintains a focus on leveraging multi-modal information to enhance automatic detection. We aim to predict the stage of the disease - Cognitively Normal (CN), Mildly Cognitive Impairment (MCI), and Alzheimer's Disease (AD), based on two different types of brain MRI scans. We design an AlexNet-based deep learning model that learns the synergy of complementary information from both T1 and FLAIR MRI scans.

摘要
美国老龄化人口导致阿尔ц海病的流行性。布鲁克梅耶等人预测到2060年，美国约有1500万人将出现轻度智能障碍或严重智能障碍。为应对这一紧迫的呼吁，旨在早期检测阿尔ц海病的方法已经开发出来。文献显示，深度学习在自动检测阿尔ц海病方面的应用已经激增。本研究基于先前的文献，并继续利用多Modal信息进行增强自动检测。我们目标是根据T1和FLAIR两种脑MRI扫描图片，预测疾病的stage，包括脑功能正常（CN）、轻度智能障碍（MCI）和阿尔ц海病（AD）。我们设计了AlexNet基于深度学习模型，利用两种MRI扫描图片的补做性信息来学习疾病的同时性。

DKiS: Decay weight invertible image steganography with private key

paper_url: http://arxiv.org/abs/2311.18243
repo_url: https://github.com/yanghangai/dkis
paper_authors: Hang Yang, Yitian Xu, Xuhua Liu
for: private key-based image steganography techniquemethods: novel approach using a private key for access experimental evidence demonstrating effectivenessresults: real-world applicability addressing the challenge of non-essential information transfer in invertible image steganography with the decay weight method

Abstract
Image steganography, the practice of concealing information within another image, traditionally faces security challenges when its methods become publicly known. To counteract this, we introduce a novel private key-based image steganography technique. This approach ensures the security of hidden information, requiring a corresponding private key for access, irrespective of the public knowledge of the steganography method. We present experimental evidence demonstrating our method's effectiveness, showcasing its real-world applicability. Additionally, we identified a critical challenge in the invertible image steganography process: the transfer of non-essential, or `garbage', information from the secret to the host pipeline. To address this, we introduced the decay weight to control the information transfer, filtering out irrelevant data and enhancing the performance of image steganography. Our code is publicly accessible at https://github.com/yanghangAI/DKiS, and a practical demonstration is available at http://yanghang.site/hidekey.

摘要
Image 隐藏技术，即将信息隐藏在另一个图像中，传统上面临安全挑战，因为其方法的公开可能会被披露。为了解决这个问题，我们提出了一种新的私钥基于的图像隐藏技术。这种方法确保隐藏的信息的安全，无论公开了隐藏技术的方法，都需要对应的私钥访问。我们提供了实验证明我们的方法的有效性，并在实际应用中展示了其可行性。此外，我们发现了逆向图像隐藏过程中的一个关键挑战：将不必要的、或“垃圾”信息从秘密管道传递到主机管道。为了解决这个问题，我们引入了衰减因子，控制信息传输，过滤掉不必要的数据，并提高图像隐藏的性能。我们的代码可以在https://github.com/yanghangAI/DKiS中获取，并在http://yanghang.site/hidekey中进行实际示例。

Label-efficient Training of Small Task-specific Models by Leveraging Vision Foundation Models

paper_url: http://arxiv.org/abs/2311.18237
repo_url: None
paper_authors: Raviteja Vemulapalli, Hadi Pouransari, Fartash Faghri, Sachin Mehta, Mehrdad Farajtabar, Mohammad Rastegari, Oncel Tuzel
for: 如何使用预训练的大视野基本模型（VFM）来训练一个小型任务特定模型，以便在有限的标注数据下进行下游任务？
methods: 我们提出了一种简单高效的任务导向知识传递方法，使得可以使用预训练的VFM来有效地训练小型任务特定模型。
results: 我们的实验结果表明，我们的任务导向知识传递方法可以比task-agnostic VFM液化、web-scale CLIP预训练和supervised ImageNet预训练提高target任务性能，提高1-10.5%、2-22%和2-14%。同时，我们也发现了不同的数据集对target任务性能的影响，并提出了一种基于图像检索的方法来筛选有效的传输集。

Abstract
Large Vision Foundation Models (VFMs) pretrained on massive datasets exhibit impressive performance on various downstream tasks, especially with limited labeled target data. However, due to their high memory and compute requirements, these models cannot be deployed in resource constrained settings. This raises an important question: How can we utilize the knowledge from a large VFM to train a small task-specific model for a new target task with limited labeled training data? In this work, we answer this question by proposing a simple and highly effective task-oriented knowledge transfer approach to leverage pretrained VFMs for effective training of small task-specific models. Our experimental results on four target tasks under limited labeled data settings show that the proposed knowledge transfer approach outperforms task-agnostic VFM distillation, web-scale CLIP pretraining and supervised ImageNet pretraining by 1-10.5%, 2-22% and 2-14%, respectively. We also show that the dataset used for transferring knowledge has a significant effect on the final target task performance, and propose an image retrieval-based approach for curating effective transfer sets.

摘要
大型视野基本模型（VFM）在庞大数据集上预训练后表现出色，特别是在有限的标注数据上下降流程中。然而，由于它们的高内存和计算需求，这些模型无法在资源受限的环境中部署。这提出了一个重要问题：如何使用大型VFM中的知识来训练一个新的目标任务上的小任务特定模型，即使只有有限的标注数据available?在这项工作中，我们回答这个问题，提出了一种简单高效的任务指向知识传递方法，以利用预训练的VFM来有效地训练小任务特定模型。我们的实验结果表明，我们的知识传递方法在四个目标任务下的有限标注数据设置下表现出优于任务无关VFM涅泊、web级CLIP预训练和supervised ImageNet预训练by 1-10.5%, 2-22%和2-14%, respectively。我们还发现了转移知识的数据集对最终目标任务的表现有显著的影响，并提出了基于图像检索的方法来筛选有效的转移集。

TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model

paper_url: http://arxiv.org/abs/2311.18231
repo_url: https://github.com/htyao89/textual-based_class-aware_prompt_tuning
paper_authors: Hantao Yao, Rui Zhang, Changsheng Xu
for: 这篇研究旨在适应预训化的视觉语言模型（VLM）进行多种下游任务。
methods: 这篇研究使用CoOp基本的方法，即可读的预设域共享或图像条件的文本数据，以帮助生成任务特定的文本分类器。
results: 这篇研究提出了一个新的文本基于的类别敏感调整（TCP）方法，可以增强类别的检测能力。TCP使用文本知识嵌入（TKE）将类别水平的文本知识映射到类别相应的文本数据，从而生成动态类别敏感的文本分类器。

Abstract
Prompt tuning represents a valuable technique for adapting pre-trained visual-language models (VLM) to various downstream tasks. Recent advancements in CoOp-based methods propose a set of learnable domain-shared or image-conditional textual tokens to facilitate the generation of task-specific textual classifiers. However, those textual tokens have a limited generalization ability regarding unseen domains, as they cannot dynamically adjust to the distribution of testing classes. To tackle this issue, we present a novel Textual-based Class-aware Prompt tuning(TCP) that explicitly incorporates prior knowledge about classes to enhance their discriminability. The critical concept of TCP involves leveraging Textual Knowledge Embedding (TKE) to map the high generalizability of class-level textual knowledge into class-aware textual tokens. By seamlessly integrating these class-aware prompts into the Text Encoder, a dynamic class-aware classifier is generated to enhance discriminability for unseen domains. During inference, TKE dynamically generates class-aware prompts related to the unseen classes. Comprehensive evaluations demonstrate that TKE serves as a plug-and-play module effortlessly combinable with existing methods. Furthermore, TCP consistently achieves superior performance while demanding less training time.

摘要
“Prompt tuning是一种有价值的技术，用于适应预训练的视觉语言模型（VLM）到下游任务。 latest advancements in CoOp-based methods propose a set of learnable domain-shared or image-conditional textual tokens to facilitate the generation of task-specific textual classifiers. However, those textual tokens have limited generalization ability regarding unseen domains, as they cannot dynamically adjust to the distribution of testing classes. To address this issue, we present a novel Textual-based Class-aware Prompt tuning(TCP) that explicitly incorporates prior knowledge about classes to enhance their discriminability. The critical concept of TCP involves leveraging Textual Knowledge Embedding (TKE) to map the high generalizability of class-level textual knowledge into class-aware textual tokens. By seamlessly integrating these class-aware prompts into the Text Encoder, a dynamic class-aware classifier is generated to enhance discriminability for unseen domains. During inference, TKE dynamically generates class-aware prompts related to the unseen classes. Comprehensive evaluations demonstrate that TKE serves as a plug-and-play module effortlessly combinable with existing methods. Furthermore, TCP consistently achieves superior performance while demanding less training time.”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format instead.

FS-BAND: A Frequency-Sensitive Banding Detector

paper_url: http://arxiv.org/abs/2311.18216
repo_url: None
paper_authors: Zijian Chen, Wei Sun, Zicheng Zhang, Ru Huang, Fangfang Lu, Xiongkuo Min, Guangtao Zhai, Wenjun Zhang
for: 本文旨在研究压缩等场景中的带宽扰动（banding artifact），以提高用户体验质量（QoE）。
methods: 本文提出了一种基于频率特征的带宽扰动检测模型，名为频率敏感带宽检测器（FS-BAND），可以生成基于视觉质量相关指标的像素级带宽地图。
results: 实验结果表明，FS-BAND方法在带宽扰动分类任务中比现有的图像质量评价方法（IQA）更高精度。

Abstract
Banding artifact, as known as staircase-like contour, is a common quality annoyance that happens in compression, transmission, etc. scenarios, which largely affects the user's quality of experience (QoE). The banding distortion typically appears as relatively small pixel-wise variations in smooth backgrounds, which is difficult to analyze in the spatial domain but easily reflected in the frequency domain. In this paper, we thereby study the banding artifact from the frequency aspect and propose a no-reference banding detection model to capture and evaluate banding artifacts, called the Frequency-Sensitive BANding Detector (FS-BAND). The proposed detector is able to generate a pixel-wise banding map with a perception correlated quality score. Experimental results show that the proposed FS-BAND method outperforms state-of-the-art image quality assessment (IQA) approaches with higher accuracy in banding classification task.

摘要
带宽artefact，也称为台阶样式损害，是压缩、传输等场景中常见的质量折扣，对用户体验质量（QoE）产生重大影响。带宽扭曲通常在平额背景中出现为小规模的像素级别变化，在空间领域难以分析，但在频率领域表现更为明显。在这篇论文中，我们从频率方面研究带宽artefact，并提出了一种无参亮度损害检测模型，called the Frequency-Sensitive BANding Detector (FS-BAND)。该检测器能够生成一个像素级别的带宽地图，并与感知相关的质量分数相关。实验结果显示，提出的FS-BAND方法在带宽分类任务中高度超越了现有的图像质量评估（IQA）方法。

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

paper_url: http://arxiv.org/abs/2312.00081
repo_url: https://github.com/wjpoom/spec
paper_authors: Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, Zuxuan Wu
for: 评估视觉语言模型（VLM）在细化的视觉语言任务上的表现。
methods: 提出了一个进步的数据pipeline来生成具有特定属性的图像，并设计了一个名为SPEC的benchmark来评估VLM的理解能力。
results: 发现现有四个领先的VLM在SPEC上表现很差，而我们提出的简单 yet effective的优化方法可以在SPEC上实现显著提高，并在两个其他细化benchmark上也得到了一致的提高。

Abstract
Vision language models (VLM) have demonstrated remarkable performance across various downstream tasks. However, understanding fine-grained visual-linguistic concepts, such as attributes and inter-object relationships, remains a significant challenge. While several benchmarks aim to evaluate VLMs in finer granularity, their primary focus remains on the linguistic aspect, neglecting the visual dimension. Here, we highlight the importance of evaluating VLMs from both a textual and visual perspective. We introduce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects. Utilizing this data engine, we carefully design a benchmark, SPEC, to diagnose the comprehension of object size, position, existence, and count. Subsequently, we conduct a thorough evaluation of four leading VLMs on SPEC. Surprisingly, their performance is close to random guess, revealing significant limitations. With this in mind, we propose a simply yet effective approach to optimize VLMs in fine-grained understanding, achieving significant improvements on SPEC without compromising the zero-shot performance. Results on two additional fine-grained benchmarks also show consistent improvements, further validating the transferability of our approach.

摘要
视觉语言模型（VLM）在多种下游任务上表现出色，但理解细腻的视觉语言概念，如特征和物体之间关系，仍然是一项重要挑战。虽有多个benchmark旨在评估VLM的细化方面，但它们主要关注语言方面，忽视视觉维度。我们认为评估VLM的文本和视觉方面都是重要的。我们提出了一种逐步管道来生成具有特定特征的图像，保证所有其他方面具有一致性。利用这个数据引擎，我们计划了一个benchmark，SPEC，以诊断VLM的对象大小、位置、存在和数量的理解。然后，我们对四个领先的VLM进行了严格的评估。 result显示，它们在SPEC上的表现几乎与随机猜测一样，揭示了重要的局限性。针对这一点，我们提出了一种简单又有效的方法来优化VLM，实现了在SPEC上显著提高表现，同时无需妥协零基eline性。此外，我们还对两个额外的细化benchmark进行了进一步的评估，并得到了一致的 validate the transferability of our approach。

Perception of Misalignment States for Sky Survey Telescopes with the Digital Twin and the Deep Neural Networks

paper_url: http://arxiv.org/abs/2311.18214
repo_url: None
paper_authors: Miao Zhang, Peng Jia, Zhengyang Li, Wennan Xiang, Jiameng Lv, Rui Sun
for: 该研究旨在提供对活动光学系统和光学系统Alignment的优先信息。
methods: 该研究提出了一种基于深度神经网络的方法，可以从不同的视场中提取不同的推算错误状态，并且可以在实验中通过状态图来探索这些关系。
results: 该研究表明，使用该方法可以准确地估计推算错误状态，并且可以在不同的视场和噪声水平下进行调整。

Abstract
Sky survey telescopes play a critical role in modern astronomy, but misalignment of their optical elements can introduce significant variations in point spread functions, leading to reduced data quality. To address this, we need a method to obtain misalignment states, aiding in the reconstruction of accurate point spread functions for data processing methods or facilitating adjustments of optical components for improved image quality. Since sky survey telescopes consist of many optical elements, they result in a vast array of potential misalignment states, some of which are intricately coupled, posing detection challenges. However, by continuously adjusting the misalignment states of optical elements, we can disentangle coupled states. Based on this principle, we propose a deep neural network to extract misalignment states from continuously varying point spread functions in different field of views. To ensure sufficient and diverse training data, we recommend employing a digital twin to obtain data for neural network training. Additionally, we introduce the state graph to store misalignment data and explore complex relationships between misalignment states and corresponding point spread functions, guiding the generation of training data from experiments. Once trained, the neural network estimates misalignment states from observation data, regardless of the impacts caused by atmospheric turbulence, noise, and limited spatial sampling rates in the detector. The method proposed in this paper could be used to provide prior information for the active optics system and the optical system alignment.

摘要
天文望远镜在现代天文学中扮演着关键性的角色，但是光学元件的偏置可能会导致点扩散函数的变化，从而降低数据质量。为了解决这个问题，我们需要一种方法来获取偏置状态，以便在数据处理方法中重建准确的点扩散函数，或者在光学组件上进行调整以提高图像质量。由于天文望远镜由多个光学元件组成，因此它们可能会产生庞大的偏置状态空间，其中一些状态是互相关联的，这会增加检测挑战。但是，通过不断地调整光学元件的偏置状态，我们可以解耦这些状态。基于这个原理，我们提出了一种深度神经网络，用于从不同的视场中提取偏置状态。为确保充足的和多样的训练数据，我们建议使用数字双护者来获取训练数据。此外，我们引入了状态图来存储偏置数据，并探索了偏置状态和相应的点扩散函数之间的复杂关系，以便从实验中生成训练数据。一旦训练完成，神经网络就可以从观测数据中估计偏置状态，不受大气扰动、噪声和探测器的限制所影响。这种方法可以用于提供活动光学系统和光学系统Alignment的先验信息。

SMaRt: Improving GANs with Score Matching Regularity

paper_url: http://arxiv.org/abs/2311.18208
repo_url: None
paper_authors: Mengfei Xia, Yujun Shen, Ceyuan Yang, Ran Yi, Wenping Wang, Yong-jin Liu
for: 这个论文的目的是提高生成模型（GAN）在处理高度多元数据时的学习能力。
methods: 这篇论文提出了一种基于敌对搅拌的方法，称为Score Matching Regularity（SMaRt），以解决GAN在处理高度多元数据时的问题。
results: 经验表明，通过使用SMaRt方法可以提高GAN在不同 datasets 上的synthesis性能，例如在ImageNet 64x64 dataset上，可以提高FID从8.87到7.11，与一步一致性模型相当。

Abstract
Generative adversarial networks (GANs) usually struggle in learning from highly diverse data, whose underlying manifold is complex. In this work, we revisit the mathematical foundations of GANs, and theoretically reveal that the native adversarial loss for GAN training is insufficient to fix the problem of subsets with positive Lebesgue measure of the generated data manifold lying out of the real data manifold. Instead, we find that score matching serves as a valid solution to this issue thanks to its capability of persistently pushing the generated data points towards the real data manifold. We thereby propose to improve the optimization of GANs with score matching regularity (SMaRt). Regarding the empirical evidences, we first design a toy example to show that training GANs by the aid of a ground-truth score function can help reproduce the real data distribution more accurately, and then confirm that our approach can consistently boost the synthesis performance of various state-of-the-art GANs on real-world datasets with pre-trained diffusion models acting as the approximate score function. For instance, when training Aurora on the ImageNet 64x64 dataset, we manage to improve FID from 8.87 to 7.11, on par with the performance of one-step consistency model. The source code will be made public.

摘要
通常，生成抗对抗网络（GANs）在处理高度多样数据时会遇到问题，因为这些数据的下面拓扑是复杂的。在这个工作中，我们再次检视了GANs的数学基础，并证明了native对抗损失在GAN训练中是不够的，因为一些生成数据集的subsets会在真实数据集外部lie。相反，我们发现score matching是一种有效的解决方案，因为它可以持续地推动生成的数据点向真实数据集方向。我们因此提议使用score matching regularity（SMaRt）来改进GANs的优化。在实际证明中，我们首先设计了一个简单的示例，以示训练GANs使用真实分布的分数函数可以帮助更准确地复制真实数据分布，然后我们证明了我们的方法可以在实际世界数据集上提高多种state-of-the-art GANs的合成性能。例如，当在ImageNet 64x64 dataset上训练Aurora时，我们可以从8.87降低到7.11，与一步一致模型的性能相当。代码源代码将公开。

Hy-Tracker: A Novel Framework for Enhancing Efficiency and Accuracy of Object Tracking in Hyperspectral Videos

paper_url: http://arxiv.org/abs/2311.18199
repo_url: None
paper_authors: Mohammad Aminul Islam, Wangzhi Xing, Jun Zhou, Yongsheng Gao, Kuldip K. Paliwal
for: 本研究使用YOLOv7进行物体跟踪，充分利用干扰图像中的物体信息。
methods: 本研究提出了一种新的框架 Hy-Tracker，通过结合YOLOv7和修改 tracking模块来提高物体跟踪性能。
results: 实验结果表明，Hy-Tracker在干扰图像上具有高精度的物体跟踪能力，能够在不同的缩放和 occlusion 情况下精准地跟踪物体。

Abstract
Hyperspectral object tracking has recently emerged as a topic of great interest in the remote sensing community. The hyperspectral image, with its many bands, provides a rich source of material information of an object that can be effectively used for object tracking. While most hyperspectral trackers are based on detection-based techniques, no one has yet attempted to employ YOLO for detecting and tracking the object. This is due to the presence of multiple spectral bands, the scarcity of annotated hyperspectral videos, and YOLO's performance limitation in managing occlusions, and distinguishing object in cluttered backgrounds. Therefore, in this paper, we propose a novel framework called Hy-Tracker, which aims to bridge the gap between hyperspectral data and state-of-the-art object detection methods to leverage the strengths of YOLOv7 for object tracking in hyperspectral videos. Hy-Tracker not only introduces YOLOv7 but also innovatively incorporates a refined tracking module on top of YOLOv7. The tracker refines the initial detections produced by YOLOv7, leading to improved object-tracking performance. Furthermore, we incorporate Kalman-Filter into the tracker, which addresses the challenges posed by scale variation and occlusion. The experimental results on hyperspectral benchmark datasets demonstrate the effectiveness of Hy-Tracker in accurately tracking objects across frames.

摘要
干扰对象跟踪在远程感知社区中最近受到了广泛关注。干扰图像，拥有多个频谱带，可以提供对物体的丰富物理信息，可以有效地用于对象跟踪。然而，大多数干扰跟踪器都基于探测技术，没有任何人尝试将YOLO应用于干扰图像中的检测和跟踪。这是因为干扰图像中存在多个频谱带，缺乏标注的干扰视频，以及YOLO在受掩蔽和干扰背景中的表现有限。因此，在这篇论文中，我们提出了一种新的框架called Hy-Tracker，以利用YOLOv7的优势来进行干扰图像中的对象跟踪。Hy-Tracker不仅引入了YOLOv7，还创新地在YOLOv7之上添加了一个精细跟踪模块。该模块对YOLOv7生成的初始检测进行了精细修正，从而提高了对象跟踪性能。此外，我们还在跟踪器中添加了卡尔曼滤波，以 Addresses the challenges posed by scale variation and occlusion.实验结果表明，Hy-Tracker在干扰数据集上具有高精度地跟踪对象 across frames。

S-T CRF: Spatial-Temporal Conditional Random Field for Human Trajectory Prediction

paper_url: http://arxiv.org/abs/2311.18198
repo_url: None
paper_authors: Pengqian Han, Jiamou Liu, Jialing He, Zeyu Zhang, Song Yang, Yanni Tang, Partha Roop
for: 本研究的目的是提高人体动向预测精度，以便自动驾驶车辆和机器人在计划运动时能够更加准确地预测人体动向。
methods: 本研究提出了一种新的模型，称为S-T CRF（空间-时间决定性随机场），它将空间-时间信息和人体意图信息一起利用，以提高人体动向预测的准确性。该模型使用一个Conditional Random Field（CRF）来生成未来意图的表示，从而大幅提高了随后动向预测的准确性。此外，研究还创新地提出了空间CRF损失和时间CRF损失，以便优化交互约束和时间动力学。
results: 实验证明，提出的S-T CRF模型在ETH/UCY和SDD数据集上表现出色，胜过了现有的基线方法。

Abstract
Trajectory prediction is of significant importance in computer vision. Accurate pedestrian trajectory prediction benefits autonomous vehicles and robots in planning their motion. Pedestrians' trajectories are greatly influenced by their intentions. Prior studies having introduced various deep learning methods only pay attention to the spatial and temporal information of trajectory, overlooking the explicit intention information. In this study, we introduce a novel model, termed the \textbf{S-T CRF}: \textbf{S}patial-\textbf{T}emporal \textbf{C}onditional \textbf{R}andom \textbf{F}ield, which judiciously incorporates intention information besides spatial and temporal information of trajectory. This model uses a Conditional Random Field (CRF) to generate a representation of future intentions, greatly improving the prediction of subsequent trajectories when combined with spatial-temporal representation. Furthermore, the study innovatively devises a space CRF loss and a time CRF loss, meticulously designed to enhance interaction constraints and temporal dynamics, respectively. Extensive experimental evaluations on dataset ETH/UCY and SDD demonstrate that the proposed method surpasses existing baseline approaches.

摘要
叙述预测在计算机视觉中具有重要意义。准确预测行人轨迹可以帮助自动驾驶车和机器人在计划其动作上更加准确。行人轨迹受其意图的影响。先前的研究都是通过不同的深度学习方法来预测轨迹，但是它们忽略了轨迹中的显式意图信息。本研究提出了一种新的模型，称为S-T CRF：空间-时间条件随机场，这种模型智能地将意图信息纳入轨迹预测中。该模型使用条件随机场（CRF）来生成未来意图的表示，大大提高了轨迹预测的准确性。此外，研究还创新地设计了空间CRF损失和时间CRF损失，仔细设计了空间和时间的交互约束和动态约束，分别提高了轨迹预测的精度和稳定性。经过广泛的实验评估于ETH/UCY和SDD数据集上，提出的方法胜过现有的基eline方法。

Persistent Test-time Adaptation in Episodic Testing Scenarios

paper_url: http://arxiv.org/abs/2311.18193
repo_url: None
paper_authors: Trung-Hieu Hoang, Duc Minh Vo, Minh N. Do
for: 本研究旨在检测测试时间适应(TTA)模型在经历多次测试环境后是否会堆积错误。
methods: 我们使用了一种叫做 episodic TTA 的测试设定，并通过对一个简单 yet representative 的 $\epsilon$-perturbed Gaussian Mixture Model Classifier 进行模拟，从理论上 derivation 了影响 TTA 方法堆积错误的因素。
results: 我们的研究表明，TTA 方法在经历多次测试环境后会逐渐堆积错误。为解决这个问题，我们提出了一种名为 persistent TTA (PeTTA) 的方法，它能够检测模型受到干扰的程度，并调整 TTA 策略，以保持模型的稳定性。我们的实验结果表明，PeTTA 能够稳定地在面对多次测试环境的情况下进行适应。

Abstract
Current test-time adaptation (TTA) approaches aim to adapt to environments that change continuously. Yet, when the environments not only change but also recur in a correlated manner over time, such as in the case of day-night surveillance cameras, it is unclear whether the adaptability of these methods is sustained after a long run. This study aims to examine the error accumulation of TTA models when they are repeatedly exposed to previous testing environments, proposing a novel testing setting called episodic TTA. To study this phenomenon, we design a simulation of TTA process on a simple yet representative $\epsilon$-perturbed Gaussian Mixture Model Classifier and derive the theoretical findings revealing the dataset- and algorithm-dependent factors that contribute to the gradual degeneration of TTA methods through time. Our investigation has led us to propose a method, named persistent TTA (PeTTA). PeTTA senses the model divergence towards a collapsing and adjusts the adaptation strategy of TTA, striking a balance between two primary objectives: adaptation and preventing model collapse. The stability of PeTTA in the face of episodic TTA scenarios has been demonstrated through a set of comprehensive experiments on various benchmarks.

摘要
当前的测试时间适应（TTA）方法目标是适应不断变化的环境。然而，当环境不仅变化，而且在时间上呈相关的循环性变化，如日夜Surveillance camera等，是否可以保持TTA方法的适应性？这项研究旨在检查TTA模型在重复暴露于前一次测试环境时的错误积累情况，并提出了一种名为 episodic TTA的新测试环境。为了研究这种现象，我们设计了一种TTA过程的模拟，并 derive了对 Gaussian Mixture Model Classifier的$\epsilon$-perturbed的简单 yet representative的模型进行了 theoretically findings，揭示了测试环境和算法依赖的因素，这些因素会导致TTA方法逐渐衰竭。我们的调查导致我们提出了一种名为 persistent TTA（PeTTA）的方法，PeTTA可以感知模型受到潜在的崩溃的吸引力，并调整TTA策略，实现适应和避免模型崩溃之间的平衡。PeTTA在 episodic TTA 场景下的稳定性已经通过了多种 benchmark 进行了证明。

Quantification of cardiac capillarization in single-immunostained myocardial slices using weakly supervised instance segmentation

paper_url: http://arxiv.org/abs/2311.18173
repo_url: None
paper_authors: Zhao Zhang, Xiwen Chen, William Richardson, Bruce Z. Gao, Abolfazl Razi, Tong Ye
for: 这种研究旨在提供一种自动化评估心肺血管 densities的工具，以帮助研究者更准确地评估心肺血管的结构和功能。
methods: 这种工具使用了一种弱supervised的实例分割算法，通过利用一个预训练的分割模型的力量，通过提示工程来实现自动化的分割和评估。
results: 对比YOLOv8-Seg等现有的实例分割模型，AutoQC在实例分割和评估方面都有更高的性能，并且只需要一小 dataset with bounding box annotations来训练，因此可以大幅减少训练时间和工作负担。

Abstract
Decreased myocardial capillary density has been reported as an important histopathological feature associated with various heart disorders. Quantitative assessment of cardiac capillarization typically involves double immunostaining of cardiomyocytes (CMs) and capillaries in myocardial slices. In contrast, single immunostaining of basement membrane components is a straightforward approach to simultaneously label CMs and capillaries, presenting fewer challenges in background staining. However, subsequent image analysis always requires manual work in identifying and segmenting CMs and capillaries. Here, we developed an image analysis tool, AutoQC, to automatically identify and segment CMs and capillaries in immunofluorescence images of collagen type IV, a predominant basement membrane protein within the myocardium. In addition, commonly used capillarization-related measurements can be derived from segmentation masks. AutoQC features a weakly supervised instance segmentation algorithm by leveraging the power of a pre-trained segmentation model via prompt engineering. AutoQC outperformed YOLOv8-Seg, a state-of-the-art instance segmentation model, in both instance segmentation and capillarization assessment. Furthermore, the training of AutoQC required only a small dataset with bounding box annotations instead of pixel-wise annotations, leading to a reduced workload during network training. AutoQC provides an automated solution for quantifying cardiac capillarization in basement-membrane-immunostained myocardial slices, eliminating the need for manual image analysis once it is trained.

摘要
原文：减少心肺 капи LL density 已被认为是心脏疾病的重要 histopathological 特征之一。量测心肺 capillarization 通常通过双抗体免疫染色心肺 slice 进行。然而，单抗体免疫染色基底膜成分可以是一种简单的方法来同时标记心肺 cells 和 капи LL，具有较少的背景染色挑战。然而，后续的图像分析总是需要手动工作，包括标识和分割心肺 cells 和 капи LL。在这里，我们开发了一种图像分析工具，AutoQC，可以自动标识和分割心肺 cells 和 капи LL 在免疫染色图像中。此外，通常用于评估 capillarization 的相关指标也可以从分割masks中 derivation。AutoQC 采用了弱类supervised的实例分割算法，通过引入预训练分割模型的推动力来实现。AutoQC 在实例分割和 capillarization 评估中超越 YOLOv8-Seg，一个状态态'-art instance segmentation模型。此外，AutoQC 的训练只需要一小 dataset WITH bounding box 笔识笔注，而不需要像素精度的笔识笔注，因此在网络训练时的工作负担减少了。AutoQC 提供了一种自动化心肺 capillarization 的评估方法，消除了手动图像分析的需求，一旦它被训练。

Few-shot Image Generation via Style Adaptation and Content Preservation

paper_url: http://arxiv.org/abs/2311.18169
repo_url: None
paper_authors: Xiaosheng He, Fan Yang, Fayao Liu, Guosheng Lin
for: 这篇论文主要针对培养生成模型（GAN）的挑战，具体来说是使用有限数据（例如10）进行培养。
methods: 论文提出了一种基于GAN模型的练习方法，通过将GAN模型进行精度调整来保持样式，但这可能导致过拟合。在这种情况下，论文提出了一种对准图像重建方法，以保持内容。
results: 论文的实验结果表明，使用该方法可以在少量数据的情况下，轻松地超越现有的状态艺技术。

Abstract
Training a generative model with limited data (e.g., 10) is a very challenging task. Many works propose to fine-tune a pre-trained GAN model. However, this can easily result in overfitting. In other words, they manage to adapt the style but fail to preserve the content, where \textit{style} denotes the specific properties that defines a domain while \textit{content} denotes the domain-irrelevant information that represents diversity. Recent works try to maintain a pre-defined correspondence to preserve the content, however, the diversity is still not enough and it may affect style adaptation. In this work, we propose a paired image reconstruction approach for content preservation. We propose to introduce an image translation module to GAN transferring, where the module teaches the generator to separate style and content, and the generator provides training data to the translation module in return. Qualitative and quantitative experiments show that our method consistently surpasses the state-of-the-art methods in few shot setting.

摘要
训练一个生成模型（例如，10）是一个非常具有挑战性的任务。许多研究提议使用已经训练过的 GAN 模型进行细化。然而，这可以轻松导致过拟合。即，它们能够适应样式，但是无法保留内容，其中“样式”指定域特定的特征，而“内容”则指定域不相关的多样性。最近的研究尝试维护一个预定的对应关系，以保留内容，但是多样性仍然不够，这可能会影响样式适应。在这个工作中，我们提议一种对应图像重建方法，以保持内容。我们提议在 GAN 传输中引入一个图像翻译模块，该模块教育生成器分离样式和内容，而生成器则为翻译模块提供训练数据回报。实验表明，我们的方法在几个射击设置下一直高于当前的方法。

Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications

paper_url: http://arxiv.org/abs/2311.18168
repo_url: None
paper_authors: Karren D. Yang, Anurag Ranjan, Jen-Hao Rick Chang, Raviteja Vemulapalli, Oncel Tuzel
for: 本研究旨在生成基于语音信号的3D脸部动画。现有的方法主要是权值学习，它们通过学习一个语音信号到3D脸部网格的一对一映射，可以达到高质量的唇形态，但是它们无法捕捉实际世界中的全面和多样化的3D脸部动态。
methods: 我们提出了一种probabilistic模型，该模型可以生成多样化的3D脸部动态，同时保持 faithful to speech 的conditioning信号。我们首先提出了大规模的benchmark数据集和评价metric，然后我们实现了这些metric中的最佳性。
results: 我们的模型可以在我们提出的benchmark数据集上达到最高的性能，并且可以生成与未seen speaker styles匹配的多样化的3D脸部动态。此外，我们还示出了这些probabilistic模型可以提高下游的音频视频模型的性能。

Abstract
We consider the task of animating 3D facial geometry from speech signal. Existing works are primarily deterministic, focusing on learning a one-to-one mapping from speech signal to 3D face meshes on small datasets with limited speakers. While these models can achieve high-quality lip articulation for speakers in the training set, they are unable to capture the full and diverse distribution of 3D facial motions that accompany speech in the real world. Importantly, the relationship between speech and facial motion is one-to-many, containing both inter-speaker and intra-speaker variations and necessitating a probabilistic approach. In this paper, we identify and address key challenges that have so far limited the development of probabilistic models: lack of datasets and metrics that are suitable for training and evaluating them, as well as the difficulty of designing a model that generates diverse results while remaining faithful to a strong conditioning signal as speech. We first propose large-scale benchmark datasets and metrics suitable for probabilistic modeling. Then, we demonstrate a probabilistic model that achieves both diversity and fidelity to speech, outperforming other methods across the proposed benchmarks. Finally, we showcase useful applications of probabilistic models trained on these large-scale datasets: we can generate diverse speech-driven 3D facial motion that matches unseen speaker styles extracted from reference clips; and our synthetic meshes can be used to improve the performance of downstream audio-visual models.

摘要
我们考虑了基于语音信号的3D面部动画任务。现有的方法主要是权宜的，强调学习语音信号到3D面部网格的一一映射，但这些模型只能在小样本中的训练集中实现高质量的唇动作，无法捕捉到实际世界中的全面和多样化的3D面部运动。重要的是，语音和面部运动之间的关系是一对多的，包括 междуspeaker和内部speaker的变化，需要一个 probabilistic 的方法。在这篇论文中，我们认为和解决了关键的挑战：没有适合训练和评估probabilistic模型的数据集和度量，以及设计一个模型可以生成多样化结果，同时保持强的conditioning信号（语音）的约束。我们首先提出了大规模的标准数据集和度量，然后我们展示了一种probabilistic模型，可以在这些标准数据集上实现多样化和强的 faithfulness 于语音。最后，我们展示了使用这些大规模数据集训练的probabilistic模型的有用应用：我们可以生成不同的speaker风格的语音驱动3D面部动画，并且我们的合成网格可以改善下游的音频视频模型的性能。

A-Scan2BIM: Assistive Scan to Building Information Modeling

paper_url: http://arxiv.org/abs/2311.18166
repo_url: https://github.com/weiliansong/A-Scan2BIM
paper_authors: Weilian Song, Jieliang Luo, Dale Zhao, Yan Fu, Chin-Yi Cheng, Yasutaka Furukawa
for: 这个论文是为了帮助建筑师在扫描数据转化为建筑信息模型（BIM）应用中减少大量的手动工作时间，而不是完全取代建筑师。
methods: 该论文提出了一个帮助建筑师在扫描数据转化过程中自动预测模型编辑操作的系统，使用了扫描数据和编辑历史记录（包括当前BIM模型），并使用了质量量化的扫描数据集来训练模型。
results: 论文介绍了一个建筑尺度的扫描数据集，包含了16个场景、35,000平方米的扫描数据，并提供了一种新的顺序度量来评估自动预测的结果。研究发现，通过一种简单的修改，可以提高重建质量，并与两个基线相比，该方法的顺序度量表现较好。

Abstract
This paper proposes an assistive system for architects that converts a large-scale point cloud into a standardized digital representation of a building for Building Information Modeling (BIM) applications. The process is known as Scan-to-BIM, which requires many hours of manual work even for a single building floor by a professional architect. Given its challenging nature, the paper focuses on helping architects on the Scan-to-BIM process, instead of replacing them. Concretely, we propose an assistive Scan-to-BIM system that takes the raw sensor data and edit history (including the current BIM model), then auto-regressively predicts a sequence of model editing operations as APIs of a professional BIM software (i.e., Autodesk Revit). The paper also presents the first building-scale Scan2BIM dataset that contains a sequence of model editing operations as the APIs of Autodesk Revit. The dataset contains 89 hours of Scan2BIM modeling processes by professional architects over 16 scenes, spanning over 35,000 m^2. We report our system's reconstruction quality with standard metrics, and we introduce a novel metric that measures how natural the order of reconstructed operations is. A simple modification to the reconstruction module helps improve performance, and our method is far superior to two other baselines in the order metric. We will release data, code, and models at a-scan2bim.github.io.

摘要
The paper also presents the first building-scale Scan2BIM dataset, containing a sequence of model editing operations as Autodesk Revit APIs. The dataset includes 89 hours of Scan2BIM modeling processes by professional architects over 16 scenes, covering over 35,000 square meters. The system's reconstruction quality is evaluated using standard metrics, and a novel metric is introduced to measure the naturalness of the order of reconstructed operations.A simple modification to the reconstruction module improves performance, and the proposed method is significantly better than two baselines in the order metric. The data, code, and models will be released at a-scan2bim.github.io.

Compact3D: Compressing Gaussian Splat Radiance Field Models with Vector Quantization

paper_url: http://arxiv.org/abs/2311.18159
repo_url: https://github.com/ucdvision/compact3d
paper_authors: KL Navaneet, Kossar Pourahmadi Meibodi, Soroush Abbasi Koohpayegani, Hamed Pirsiavash
for: 3D Gaussian Splatting is a new method for modeling and rendering 3D radiance fields that achieves faster learning and rendering time compared to SOTA NeRF methods, but with a larger storage demand.
methods: The paper introduces a simple vector quantization method based on \kmeans algorithm to quantize the Gaussian parameters, and stores the small codebook along with the index of the code for each Gaussian. The indices are further compressed using a method similar to run-length encoding.
results: The paper shows that the proposed method can reduce the storage cost for the original 3D Gaussian Splatting method by a factor of almost $20\times$ with a very small drop in the quality of rendered images, on standard benchmarks as well as a new benchmark that is an order of magnitude larger than the standard benchmarks.

Abstract
3D Gaussian Splatting is a new method for modeling and rendering 3D radiance fields that achieves much faster learning and rendering time compared to SOTA NeRF methods. However, it comes with a drawback in the much larger storage demand compared to NeRF methods since it needs to store the parameters for several 3D Gaussians. We notice that many Gaussians may share similar parameters, so we introduce a simple vector quantization method based on \kmeans algorithm to quantize the Gaussian parameters. Then, we store the small codebook along with the index of the code for each Gaussian. Moreover, we compress the indices further by sorting them and using a method similar to run-length encoding. We do extensive experiments on standard benchmarks as well as a new benchmark which is an order of magnitude larger than the standard benchmarks. We show that our simple yet effective method can reduce the storage cost for the original 3D Gaussian Splatting method by a factor of almost $20\times$ with a very small drop in the quality of rendered images.

摘要
“3D Gaussian Splatting是一种新的方法，用于模型和渲染3D光场场景，它比SOTA NeRF方法更快速学习和渲染。然而，它需要更多的存储空间，因为它需要存储多个3DGAUSSIAN的参数。我们发现许多GAUSSIAN可能会有相似的参数，因此我们提出了一种简单的 вектор量化方法，基于K-means算法来量化GAUSSIAN参数。然后，我们将小codebook和每个GAUSSIAN的索引存储在一起。此外，我们还将索引进行排序和使用类似于run-length编码，以进一步压缩存储成本。我们在标准 benchmark 上进行了广泛的实验，以及一个新的 benchmark，其大小比标准 benchmark 大得多。我们显示，我们的简单 yet effective 方法可以将原始3D Gaussian Splatting方法的存储成本减少为约20倍，而影响图像质量的下降非常小。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know and I can provide that as well.

HiPA: Enabling One-Step Text-to-Image Diffusion Models via High-Frequency-Promoting Adaptation

paper_url: http://arxiv.org/abs/2311.18158
repo_url: None
paper_authors: Yifan Zhang, Bryan Hooi
for: 文章目的是提出一种高效的单步文本到图像扩散模型，以提高图像生成的速度和质量。
methods: 本文使用了一种基于高频信息的适应器来提高单步扩散模型的性能，并通过训练低级别适应器来增强扩散模型的高频能力。
results: 相比进步填充扩散，HiPA实现了一个更好的一步文本到图像扩散性能（FID-5k在COCO 2017上从37.3降至23.8），并减少了训练时间28.6倍（从108.8天减至3.8天），需要的参数也减少了99.6%（从7740万减至3300万）。 HiPA还在文本导向图像修改、填充和超解等任务中表现出色。

Abstract
Diffusion models have revolutionized text-to-image generation, but their real-world applications are hampered by the extensive time needed for hundreds of diffusion steps. Although progressive distillation has been proposed to speed up diffusion sampling to 2-8 steps, it still falls short in one-step generation, and necessitates training multiple student models, which is highly parameter-extensive and time-consuming. To overcome these limitations, we introduce High-frequency-Promoting Adaptation (HiPA), a parameter-efficient approach to enable one-step text-to-image diffusion. Grounded in the insight that high-frequency information is essential but highly lacking in one-step diffusion, HiPA focuses on training one-step, low-rank adaptors to specifically enhance the under-represented high-frequency abilities of advanced diffusion models. The learned adaptors empower these diffusion models to generate high-quality images in just a single step. Compared with progressive distillation, HiPA achieves much better performance in one-step text-to-image generation (37.3 $\rightarrow$ 23.8 in FID-5k on MS-COCO 2017) and 28.6x training speed-up (108.8 $\rightarrow$ 3.8 A100 GPU days), requiring only 0.04% training parameters (7,740 million $\rightarrow$ 3.3 million). We also demonstrate HiPA's effectiveness in text-guided image editing, inpainting and super-resolution tasks, where our adapted models consistently deliver high-quality outputs in just one diffusion step. The source code will be released.

摘要
Diffusion models 已经革命化了文本到图像生成，但它们在实际应用中受到了数百步扩散步骤的时间限制。虽然进步的采样压缩有助于加速扩散抽象到2-8步，但它仍然无法在一步生成高质量图像，并且需要训练多个学生模型，这是高参数昂贵并时间consuming的。为了超越这些限制，我们介绍了高频促进适应（HiPA），一种效率的方法来启用一步文本到图像扩散。基于高频信息的潜在重要性，HiPA专注于训练一步、低级别适应器，以增强高频扩散模型的不足之处。学习后，这些适应器将扩散模型帮助生成高质量图像，只需一步。与进步压缩相比，HiPA在一步文本到图像生成中表现更好（FID-5k在MS-COCO 2017上从37.3降至23.8），并且具有28.6倍的训练速度增加（108.8降至3.8 A100 GPU天），只需0.04%的训练参数（7740万降至3300万）。我们还证明了HiPA在文本指导图像修改、填充和超解像 зада务中的效果，我们的适应模型 consistently 在一步扩散中输出高质量图像。源代码将发布。

2023-11-30

PyNeRF: Pyramidal Neural Radiance Fields

Advancements and Trends in Ultra-High-Resolution Image Processing: An Overview

AV-RIR: Audio-Visual Room Impulse Response Estimation

Lasagna: Layered Score Distillation for Disentangled Object Relighting

Brainformer: Modeling MRI Brain Functions to Machine Vision

Convolutional Neural Networks for Segmentation of Malignant Pleural Mesothelioma: Analysis of Probability Map Thresholds (CALGB 30901, Alliance)

SparseGS: Real-Time 360° Sparse View Synthesis using Gaussian Splatting

DNS SLAM: Dense Neural Semantic-Informed SLAM

Raising the Bar of AI-generated Image Detection with CLIP

Benchmarking and Enhancing Disentanglement in Concept-Residual Models

Galaxy Classification: A machine learning approach for classifying shapes using numerical data

Fool the Hydra: Adversarial Attacks against Multi-view Object Detection Systems

Universal Backdoor Attacks

A Unified Framework for Connecting Noise Modeling to Boost Noise Detection

GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs

TrafficMOT: A Challenging Dataset for Multi-Object Tracking in Complex Traffic Scenarios

Just Add $π$! Pose Induced Video Transformers for Understanding Activities of Daily Living

PoseGPT: Chatting about 3D Human Pose

InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation

S2ST: Image-to-Image Translation in the Seed Space of Latent Diffusion

ART$\boldsymbol{\cdot}$V: Auto-Regressive Text-to-Video Generation with Diffusion Models

Exploiting Diffusion Prior for Generalizable Pixel-Level Semantic Prediction

MotionEditor: Editing Video Motion via Content-Aware Diffusion

Un-EvMoSeg: Unsupervised Event-based Independent Motion Segmentation

MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation

Event-based Continuous Color Video Decompression from Single Frames

One-step Diffusion with Distribution Matching Distillation

DynMF: Neural Motion Factorization for Real-time Dynamic View Synthesis with 3D Gaussian Splatting

CAST: Cross-Attention in Space and Time for Video Action Recognition

DEVIAS: Learning Disentangled Video Representations of Action and Scene for Holistic Video Understanding

Initializing Models with Larger Ones

ElasticDiffusion: Training-free Arbitrary Size Image Generation

LucidDreaming: Controllable Object-Centric 3D Generation

IMMA: Immunizing text-to-image Models against Malicious Adaptation

Is Underwater Image Enhancement All Object Detectors Need?

MD-Splatting: Learning Metric Deformation from 4D Gaussians in Highly Deformable Scenes

Convergence of Nonconvex PnP-ADMM with MMSE Denoisers

FoundPose: Unseen Object Pose Estimation with Foundation Features

CLIP-QDA: An Explainable Concept Bottleneck Model

Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

Semi-supervised Semantic Segmentation via Boosting Uncertainty on Unlabeled Data

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

Merlin:Empowering Multimodal LLMs with Foresight Minds

Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data

Improving the Robustness of Quantized Deep Neural Networks to White-Box Attacks using Stochastic Quantization and Information-Theoretic Ensemble Training

Meta-Prior: Meta learning for Adaptive Inverse Problem Solvers

Seg2Reg: Differentiable 2D Segmentation to 1D Regression Rendering for 360 Room Layout Reconstruction

Cascaded Interaction with Eroded Deep Supervision for Salient Object Detection

Action Recognition in Video Recordings from Gynecologic Laparoscopy

Pose Estimation and Tracking for ASIST

Learning Part Segmentation from Synthetic Animals

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

Simple Semantic-Aided Few-Shot Learning

DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars

A Lightweight Clustering Framework for Unsupervised Semantic Segmentation

JPPF: Multi-task Fusion for Consistent Panoptic-Part Segmentation

Anatomy and Physiology of Artificial Intelligence in PET Imaging

Cancer-Net PCa-Gen: Synthesis of Realistic Prostate Diffusion Weighted Imaging Data via Anatomic-Conditional Controlled Latent Diffusion

DiffCAD: Weakly-Supervised Probabilistic CAD Model Retrieval and Alignment from an RGB Image

Learning Triangular Distribution in Visual World

Identifying tourist destinations from movie scenes using Deep Learning

Overcoming Label Noise for Source-free Unsupervised Video Domain Adaptation

Seam-guided local alignment and stitching for large parallax images

Periodic Vibration Gaussian: Dynamic Urban Scene Reconstruction and Real-time Rendering

FediOS: Decoupling Orthogonal Subspaces for Personalization in Feature-skew Federated Learning

Heterogeneous Graph-based Trajectory Prediction using Local Map Context and Social Interactions

SparseDC: Depth Completion from sparse and non-uniform inputs

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

Match me if you can: Semantic Correspondence Learning with Unpaired Images

MaXTron: Mask Transformer with Trajectory Attention for Video Panoptic Segmentation

Revisiting Proposal-based Object Detection

DifAugGAN: A Practical Diffusion-style Data Augmentation for GAN-based Single Image Super-resolution

Accurate Segmentation of Optic Disc And Cup from Multiple Pseudo-labels by Noise-Aware Learning

Improving Adversarial Transferability via Model Alignment

PRS: Sharp Feature Priors for Resolution-Free Surface Remeshing

Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

Mixture of Gaussian-distributed Prototypes with Generative Modelling for Interpretable Image Classification

HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video

VTimeLLM: Empower LLM to Grasp Video Moments