2023-12-02

cs.CV

cs.CV - 2023-12-02

Disentangling the Effects of Data Augmentation and Format Transform in Self-Supervised Learning of Image Representations

paper_url: http://arxiv.org/abs/2312.02205
repo_url: None
paper_authors: Neha Kalibhat, Warren Morningstar, Alex Bijamov, Luyang Liu, Karan Singhal, Philip Mansfield
for: 本文研究了自动学习（SSL）在视觉领域中的应用，尤其是对数据增强和格式变换的影响。
methods: 本文使用了数据增强和格式变换两种方法，分别是频域增强（FDA）和图像增强。
results: 研究发现，将频域增强与图像增强结合使用可以提高下游分类精度，最高提升1.3%。此外，研究还发现格式变换可以在没有增强情况下提高学习的表示质量。

Abstract
Self-Supervised Learning (SSL) enables training performant models using limited labeled data. One of the pillars underlying vision SSL is the use of data augmentations/perturbations of the input which do not significantly alter its semantic content. For audio and other temporal signals, augmentations are commonly used alongside format transforms such as Fourier transforms or wavelet transforms. Unlike augmentations, format transforms do not change the information contained in the data; rather, they express the same information in different coordinates. In this paper, we study the effects of format transforms and augmentations both separately and together on vision SSL. We define augmentations in frequency space called Fourier Domain Augmentations (FDA) and show that training SSL models on a combination of these and image augmentations can improve the downstream classification accuracy by up to 1.3% on ImageNet-1K. We also show improvements against SSL baselines in few-shot and transfer learning setups using FDA. Surprisingly, we also observe that format transforms can improve the quality of learned representations even without augmentations; however, the combination of the two techniques yields better quality.

摘要

Motion Informed Needle Segmentation in Ultrasound Images

paper_url: http://arxiv.org/abs/2312.01239
repo_url: None
paper_authors: Raghavv Goel, Cecilia Morales, Manpreet Singh, Artur Dubrawski, John Galeotti, Howie Choset
for: 本研究旨在提高ultrasound影像中针的分割精度，解决针的运动 artifacts、噪声和隐藏现象等问题。
methods: 本研究提出了一种新的方法，结合了传统的Kalman滤波技术和数据驱动学习，同时考虑了针的运动特征和数据有限性。
results: 本研究实现了一种新的Convolutional Neural Networks (CNN) 基于 Kalman Filter (KF) 块，与现有的状态艺术模型相比，提供了15%的像素精度提高和8%的长度错误减少。此外，本研究还是首次在需le segmentation中实现了学习滤波，以提高针的分割精度。

Abstract
Segmenting a moving needle in ultrasound images is challenging due to the presence of artifacts, noise, and needle occlusion. This task becomes even more demanding in scenarios where data availability is limited. Convolutional Neural Networks (CNNs) have been successful in many computer vision applications, but struggle to accurately segment needles without considering their motion. In this paper, we present a novel approach for needle segmentation that combines classical Kalman Filter (KF) techniques with data-driven learning, incorporating both needle features and needle motion. Our method offers two key contributions. First, we propose a compatible framework that seamlessly integrates into commonly used encoder-decoder style architectures. Second, we demonstrate superior performance compared to recent state-of-the-art needle segmentation models using our novel convolutional neural network (CNN) based KF-inspired block, achieving a 15\% reduction in pixel-wise needle tip error and an 8\% reduction in length error. Third, to our knowledge we are the first to implement a learnable filter to incorporate non-linear needle motion for improving needle segmentation.

摘要
segmenting a moving needle in ultrasound images is challenging due to the presence of artifacts, noise, and needle occlusion. this task becomes even more demanding in scenarios where data availability is limited. convolutional neural networks (cnn) have been successful in many computer vision applications, but struggle to accurately segment needles without considering their motion. in this paper, we present a novel approach for needle segmentation that combines classical kalman filter (kf) techniques with data-driven learning, incorporating both needle features and needle motion. our method offers two key contributions. first, we propose a compatible framework that seamlessly integrates into commonly used encoder-decoder style architectures. second, we demonstrate superior performance compared to recent state-of-the-art needle segmentation models using our novel convolutional neural network (cnn) based kf-inspired block, achieving a 15% reduction in pixel-wise needle tip error and an 8% reduction in length error. third, to our knowledge we are the first to implement a learnable filter to incorporate non-linear needle motion for improving needle segmentation.

Volumetric Rendering with Baked Quadrature Fields

paper_url: http://arxiv.org/abs/2312.02202
repo_url: None
paper_authors: Gopal Sharma, Daniel Rebain, Kwang Moo Yi, Andrea Tagliasacchi
for: 这个论文旨在解决非透明场景中Neural Radiance Field（NeRF）的快速推理问题，使得NeRF可以更好地利用现代图形硬件。
methods: 该论文提出了一种新的NeRF表示方法，利用文本化多边形来实现快速推理。该方法包括将场景模型为多边形，然后通过 marching cubes算法获取多边形的零十字点，并使用这些点来模型volume rendering效果。最后，通过纹理fragment shader来获取最终的颜色图像。
results: 该论文的实验结果表明，该方法可以快速地生成高质量的新视图图像，并且可以轻松地与现有的图形框架集成。此外，该方法还可以利用现代图形硬件来提高渲染速度。

Abstract
We propose a novel Neural Radiance Field (NeRF) representation for non-opaque scenes that allows fast inference by utilizing textured polygons. Despite the high-quality novel view rendering that NeRF provides, a critical limitation is that it relies on volume rendering that can be computationally expensive and does not utilize the advancements in modern graphics hardware. Existing methods for this problem fall short when it comes to modelling volumetric effects as they rely purely on surface rendering. We thus propose to model the scene with polygons, which can then be used to obtain the quadrature points required to model volumetric effects, and also their opacity and colour from the texture. To obtain such polygonal mesh, we train a specialized field whose zero-crossings would correspond to the quadrature points when volume rendering, and perform marching cubes on this field. We then rasterize the polygons and utilize the fragment shaders to obtain the final colour image. Our method allows rendering on various devices and easy integration with existing graphics frameworks while keeping the benefits of volume rendering alive.

摘要
我们提出了一种新的神经频谱场（NeRF）表示方法，用于非透明场景，具有快速推理功能，通过使用文本化多边形。虽然NeRF提供了高质量的新视角渲染，但存在一个重要的限制，即它依赖于体积渲染，可能会占用大量计算资源，而不利用现代图形硬件的进步。现有的方法对这个问题不够，因为它们仅仅基于表面渲染，无法模拟体积效果。我们因此提议使用多边形来模拟场景，然后使用这些多边形来获取需要模拟体积效果的 quadrature 点，以及它们的透射度和颜色从文本中获取。我们然后在这个场景中训练特殊的场，其零 crossing 点对应于体积渲染中的 quadrature 点，并使用 Marching Cubes 算法来生成多边形。然后，我们折射多边形，并使用 Fragment Shader 来获取最终的颜色图像。我们的方法允许在不同的设备上进行渲染，并与现有的图形框架集成，保留体积渲染的优点。

ImageDream: Image-Prompt Multi-view Diffusion for 3D Generation

paper_url: http://arxiv.org/abs/2312.02201
repo_url: None
paper_authors: Peng Wang, Yichun Shi
for: 这个论文是为了开发一种基于图像激活的多视角扩散模型，用于三维对象生成。
methods: 该方法使用一种幂等相机坐标系，以提高图像视觉几何精度。它还使用多级控制，以在输入图像基础上Shape the overall object layout和 fine-tune image details。
results: 该方法可以生成比现有状态艺术图像 conditioned 方法更高质量的三维模型。经过广泛的评估，ImageDream 表现出色，并且可以在不同的推荐列表上进行更多的应用。

Abstract
We introduce "ImageDream," an innovative image-prompt, multi-view diffusion model for 3D object generation. ImageDream stands out for its ability to produce 3D models of higher quality compared to existing state-of-the-art, image-conditioned methods. Our approach utilizes a canonical camera coordination for the objects in images, improving visual geometry accuracy. The model is designed with various levels of control at each block inside the diffusion model based on the input image, where global control shapes the overall object layout and local control fine-tunes the image details. The effectiveness of ImageDream is demonstrated through extensive evaluations using a standard prompt list. For more information, visit our project page at https://Image-Dream.github.io.

摘要
我们介绍“图像之 dream”，一种创新的图像提示、多视图扩散模型，用于3D объек的生成。“图像之 dream”在现有状态的图像条件方法中比较出色，能够生成高质量的3D模型。我们的方法利用图像中对象的 canonical 摄像机坐标，改善视觉几何精度。模型内部各块的控制水平根据输入图像进行设定，全局控制定义整体对象布局，本地控制细化图像细节。我们通过了广泛的评估，包括标准提示列表，证明了图像之 dream 的有效性。更多信息，请访问我们项目页面：https://Image-Dream.github.io。

Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation

paper_url: http://arxiv.org/abs/2312.01220
repo_url: None
paper_authors: Zhipeng Du, Miaojing Shi, Jiankang Deng
for: 提高适应低光照场景的物体检测精度
methods: 采用零批量日夜频谱适应，学习Retinex频谱免疫，并提高Retinex图像分解过程
results: 在ExDark、DARK FACE和CODaN等 datasets上实现了强大的低光照场景物体检测精度

Abstract
Detecting objects in low-light scenarios presents a persistent challenge, as detectors trained on well-lit data exhibit significant performance degradation on low-light data due to the low visibility. Previous methods mitigate this issue by investigating image enhancement or object detection techniques using low-light image datasets. However, the progress is impeded by the inherent difficulties associated with collecting and annotating low-light images. To address this challenge, we propose to boost low-light object detection with zero-shot day-night domain adaptation, which aims to generalize a detector from well-lit scenarios to low-light ones without requiring real low-light data. We first design a reflectance representation learning module to learn Retinex-based illumination invariance in images with a carefully designed illumination invariance reinforcement strategy. Next, an interchange-redecomposition-coherence procedure is introduced to improve over the vanilla Retinex image decomposition process by performing two sequential image decompositions and introducing a redecomposition cohering loss. Extensive experiments on ExDark, DARK FACE and CODaN datasets show strong low-light generalizability of our method.

摘要
检测low-light场景中的对象存在一个持续的挑战，因为由于低可见度而训练的检测器在low-light数据上会表现出显著的性能下降。先前的方法通过调查图像增强或对象检测技术使用low-light图像集来 Mitigate this issue.然而，这种进步受到了收集和标注low-light图像的自然Difficulties。为了解决这个挑战，我们提议使用零截日夜频谱适应，将检测器从Well-lit场景中generalize到low-light场景，无需实际low-light数据。我们首先设计了反射表示学习模块，通过一种特殊的照明不变性强制策略来学习Retinex基于图像中的照明不变性。然后，我们引入了一种交换重构相互关联过程，以改进vanilla Retinex图像分解过程，并在图像分解和重建过程中添加一个重建相互关联损失。我们在ExDark、DARK FACE和CODaN数据集上进行了广泛的实验，并显示了我们的方法在low-light场景中具有强大的普适性。

RNb-NeuS: Reflectance and Normal-based Multi-View 3D Reconstruction

paper_url: http://arxiv.org/abs/2312.01215
repo_url: None
paper_authors: Baptiste Brument, Robin Bruneau, Yvain Quéau, Jean Mélou, François Bernard Lauze, Jean-Denis, Jean-Denis Durou, Lilian Calvet
for: 本研究旨在整合多视图反射和法向图，通过光度射频来获取。
methods: 我们的方法是对每个像素进行 JOINT 重parameterization，将反射和法向图视为光度渲染下的向量。
results: 我们的方法在多视图光度射频（MVPS）测试中，与state-of-the-art方法相比，在 F-score、Chamfer 距离和平均角误度指标中表现出色，特别是在高曲率或低可见性的区域中进行详细3D重建。

Abstract
This paper introduces a versatile paradigm for integrating multi-view reflectance and normal maps acquired through photometric stereo. Our approach employs a pixel-wise joint re-parameterization of reflectance and normal, considering them as a vector of radiances rendered under simulated, varying illumination. This re-parameterization enables the seamless integration of reflectance and normal maps as input data in neural volume rendering-based 3D reconstruction while preserving a single optimization objective. In contrast, recent multi-view photometric stereo (MVPS) methods depend on multiple, potentially conflicting objectives. Despite its apparent simplicity, our proposed approach outperforms state-of-the-art approaches in MVPS benchmarks across F-score, Chamfer distance, and mean angular error metrics. Notably, it significantly improves the detailed 3D reconstruction of areas with high curvature or low visibility.

摘要

A Comparative Analysis Towards Melanoma Classification Using Transfer Learning by Analyzing Dermoscopic Images

paper_url: http://arxiv.org/abs/2312.01212
repo_url: None
paper_authors: Md. Fahim Uddin, Nafisa Tafshir, Mohammad Monirujjaman Khan
for: 这个论文的目的是提出一种基于深度学习技术和已有转移学习方法的皮肤癌诊断系统，以便从输入的德朗托图像中准确地预测皮肤癌。
methods: 这个论文使用了 convolutional neural networks (CNNs) 技术来分类皮肤癌图像为良性和肉瘤图像。研究人员使用了 ‘深度学习’ 技术来训练一个庞大的参数量的深度神经网络，以达到最佳的预测结果。
results: 这个论文的结果表明，使用 transfer learning 技术和 CNNs 技术可以建立一个高度准确的皮肤癌诊断系统。在多个预测模型的比较分析中， denseNet 模型表现最佳，其验证精度为 96.64%，验证损失为 9.43%，并且在测试集上达到了 99.63% 的准确率。

Abstract
Melanoma is a sort of skin cancer that starts in the cells known as melanocytes. It is more dangerous than other types of skin cancer because it can spread to other organs. Melanoma can be fatal if it spreads to other parts of the body. Early detection is the key to cure, but it requires the skills of skilled doctors to diagnose it. This paper presents a system that combines deep learning techniques with established transfer learning methods to enable skin lesions classification and diagnosis of melanoma skin lesions. Using Convolutional Neural Networks, it presents a method for categorizing melanoma images into benign and malignant images in this research (CNNs). Researchers used 'Deep Learning' techniques to train an expansive number of photos & essentially to get the expected result deep neural networks to need to be trained with a huge number of parameters as dermoscopic images are sensitive & very hard to classify. This paper, has been emphasized building models with less complexity and comparatively better accuracy with limited datasets & partially fewer deep networks so that the system can predict Melanoma at ease from input dermoscopic images as correctly as possible within devices with less computational power. The dataset has been obtained from ISIC Archive. Multiple pre-trained models ResNet101, DenseNet, EfficientNet, InceptionV3 have been implemented using transfer learning techniques to complete the comparative analysis & every model achieved good accuracy. Before training the models, the data has been augmented by multiple parameters to improve the accuracy. Moreover, the results are better than the previous state-of-the-art approaches & adequate to predict melanoma. Among these architectures, DenseNet performed better than the others which gives a validation accuracy of 96.64%, validation loss of 9.43% & test set accuracy of 99.63%.

摘要
melanoma 是一种皮肤癌，起源于皮肤细胞 called melanocytes。它比其他类型的皮肤癌更加危险，因为它可以扩散到其他器官。如果 melanoma 扩散到其他部分的body，可能会导致死亡。早期发现是治疗的关键，但需要经验丰富的医生来诊断。这篇论文提出了一种结合深度学习技术与成熔传输学习方法的系统，用于皮肤损伤分类和诊断 melanoma 皮肤损伤。使用卷积神经网络（CNNs），这篇论文提出了一种方法，将 melanoma 图像分类为无害和有害图像。研究人员使用深度学习技术来训练大量照片，并实际地达到了预期的结果。因为 dermoscopic 图像敏感度高，需要训练大量深度神经网络，以获得预期的结果。这篇论文强调建立模型，具有较低的复杂性和较高的准确率，使系统可以从输入 dermoscopic 图像中准确预测 melanoma。数据集来自 ISIC Archive。多个预测模型，包括 ResNet101、DenseNet、EfficientNet 和 InceptionV3，通过传输学习技术来实现比较分析。每个模型都达到了良好的准确率。在训练模型之前，数据已经被多个参数进行增强，以提高准确率。此外，结果较前一个状态的方法更好，并且充分以预测 melanoma。其中，DenseNet 表现最好，得到了验证准确率 96.64%、验证损失 9.43% 和测试集准确率 99.63%。

Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction

paper_url: http://arxiv.org/abs/2312.01196
repo_url: None
paper_authors: Devikalyan Das, Christopher Wewer, Raza Yunus, Eddy Ilg, Jan Eric Lenssen
for: 解决非常受约束的动态物体重建问题，提供高质量的新视图。
methods: 使用神经网络准 parametric Gaussian（NPG），采用两个阶段approach：首先，使用低级神经变形模型进行准确的匹配，然后使用非RIGID reconstruction。
results: 实现高质量的重建，在不同视角下维持3D一致性，特别在具有少量多视角cue的场景下表现优于之前的works。

Abstract
Reconstructing dynamic objects from monocular videos is a severely underconstrained and challenging problem, and recent work has approached it in various directions. However, owing to the ill-posed nature of this problem, there has been no solution that can provide consistent, high-quality novel views from camera positions that are significantly different from the training views. In this work, we introduce Neural Parametric Gaussians (NPGs) to take on this challenge by imposing a two-stage approach: first, we fit a low-rank neural deformation model, which then is used as regularization for non-rigid reconstruction in the second stage. The first stage learns the object's deformations such that it preserves consistency in novel views. The second stage obtains high reconstruction quality by optimizing 3D Gaussians that are driven by the coarse model. To this end, we introduce a local 3D Gaussian representation, where temporally shared Gaussians are anchored in and deformed by local oriented volumes. The resulting combined model can be rendered as radiance fields, resulting in high-quality photo-realistic reconstructions of the non-rigidly deforming objects, maintaining 3D consistency across novel views. We demonstrate that NPGs achieve superior results compared to previous works, especially in challenging scenarios with few multi-view cues.

摘要
<>Translate the given text into Simplified Chinese.<>重构动态物体从单视镜像视频是一个严重地下定制和挑战性的问题，现有的工作都是通过不同的方向来解决。然而，由于这个问题的不定性，没有一个解决方案可以提供高质量的新视图。在这个工作中，我们引入神经Parametric Gaussians（NPGs）来解决这个挑战，我们采用了两个阶段的方法：第一阶段是使用神经变换模型，然后用这个模型作为非RIGID reconstruction的 regularization。第一阶段学习物体的变形，使其在新视图中保持一致性。第二阶段通过使用高精度的3D Gaussians来实现高质量的重建，这些Gaussians是由局部方向体积所驱动的。我们引入了本地3D Gaussian表示，其中共享的Gaussians是 anchored in和被局部方向体积所扭曲。结果是一个组合的模型，可以被渲染为光度场，从而实现高品质的非RIGID重建，保持3D一致性在新视图中。我们示示了NPGs在具有少量多视图cue的情况下比前一代方法更高的成果。

Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning

paper_url: http://arxiv.org/abs/2312.01191
repo_url: https://github.com/yangcong356/bita
paper_authors: Cong Yang, Zuchao Li, Lefei Zhang
for: This paper focuses on the task of remote sensing image captioning, specifically addressing the challenge of aligning image and text features in a fine-grained manner.
methods: The proposed approach, called BITA, utilizes a two-stage vision-language pre-training-based framework, which consists of a lightweight interactive Fourier Transformer and a prefix causal language model. The interactive Fourier Transformer extracts multi-scale features of remote sensing images in the frequency domain, while the prefix causal language model guides the text generation process using visual features.
results: The experimental results on the UCM-caption, RSICD, and NWPU-caption datasets demonstrate that BITA outperforms other advanced comparative approaches, indicating its effectiveness in aligning image-text features and generating accurate captions for remote sensing images.

Abstract
Recently, remote sensing image captioning has gained significant attention in the remote sensing community. Due to the significant differences in spatial resolution of remote sensing images, existing methods in this field have predominantly concentrated on the fine-grained extraction of remote sensing image features, but they cannot effectively handle the semantic consistency between visual features and textual features. To efficiently align the image-text, we propose a novel two-stage vision-language pre-training-based approach to bootstrap interactive image-text alignment for remote sensing image captioning, called BITA, which relies on the design of a lightweight interactive Fourier Transformer to better align remote sensing image-text features. The Fourier layer in the interactive Fourier Transformer is capable of extracting multi-scale features of remote sensing images in the frequency domain, thereby reducing the redundancy of remote sensing visual features. Specifically, the first stage involves preliminary alignment through image-text contrastive learning, which aligns the learned multi-scale remote sensing features from the interactive Fourier Transformer with textual features. In the second stage, the interactive Fourier Transformer connects the frozen image encoder with a large language model. Then, prefix causal language modeling is utilized to guide the text generation process using visual features. Ultimately, across the UCM-caption, RSICD, and NWPU-caption datasets, the experimental results clearly demonstrate that BITA outperforms other advanced comparative approaches. The code is available at https://github.com/yangcong356/BITA.

摘要
近期，远程感知图像captioning已经吸引了远程感知社区的广泛关注。由于远程感知图像的空间分辨率存在显著差异，现有的方法主要集中在细致EXTRACT remote sensing image features，但是它们无法有效地处理图像和文本特征之间的semantic consistency。为了高效地对齐图像和文本，我们提出了一种基于视力语言预训练的新方法，称为BITA，它利用了一种轻量级的互动 Fourier Transformer来更好地对齐远程感知图像和文本特征。Fourier层在互动Fourier Transformer中可以提取远程感知图像的多尺度特征，从而降低远程感知视觉特征的重复性。具体来说，首先是通过图像和文本对比学习来进行先期对齐，将从互动Fourier Transformer学习的多尺度远程感知特征与文本特征进行对齐。然后，互动Fourier Transformer将冻结图像编码器连接到大型自然语言模型。接着，使用预后 causal语言模型进行文本生成，使用视觉特征来导航文本生成过程。最终，在UCM-caption、RSICD和NWPU-caption datasets上，实验结果显示BITA的性能明显超过了其他先进的相关方法。代码可以在https://github.com/yangcong356/BITA上下载。

Efficient Expansion and Gradient Based Task Inference for Replay Free Incremental Learning

paper_url: http://arxiv.org/abs/2312.01188
repo_url: None
paper_authors: Soumya Roy, Vinay K Verma, Deepak Gupta
for: 本研究提出了一种简单 yet高效的扩展基于模型，用于不断学习。现有的特征变换、遮盖和分解基于方法可以高效地增长模型，但是它们只会增长全局或共享参数，因此不能充分利用之前学习的信息，导致知识传递能力有限。
methods: 我们提出了一种简单的筛子和通道扩展基于方法，它会增长模型在前一个任务的参数上，而不仅仅是全局参数。因此，它可以完全利用之前学习的信息，而不会忘记。此外，我们还提出了一种可靠的任务预测方法，使用ENTropy weighted数据增强和模型的梯度使用pseudo标签。
results: 我们的模型在不同的dataset和架构上进行了extensive的测试，并在不断学习（TIL）、类增长学习（CIL）和生成 continual learning setting中达到了state-of-the-art的结果。我们的广泛的ablation研究表明了提出的组件的有效性。

Abstract
This paper proposes a simple but highly efficient expansion-based model for continual learning. The recent feature transformation, masking and factorization-based methods are efficient, but they grow the model only over the global or shared parameter. Therefore, these approaches do not fully utilize the previously learned information because the same task-specific parameter forgets the earlier knowledge. Thus, these approaches show limited transfer learning ability. Moreover, most of these models have constant parameter growth for all tasks, irrespective of the task complexity. Our work proposes a simple filter and channel expansion based method that grows the model over the previous task parameters and not just over the global parameter. Therefore, it fully utilizes all the previously learned information without forgetting, which results in better knowledge transfer. The growth rate in our proposed model is a function of task complexity; therefore for a simple task, the model has a smaller parameter growth while for complex tasks, the model requires more parameters to adapt to the current task. Recent expansion based models show promising results for task incremental learning (TIL). However, for class incremental learning (CIL), prediction of task id is a crucial challenge; hence, their results degrade rapidly as the number of tasks increase. In this work, we propose a robust task prediction method that leverages entropy weighted data augmentations and the models gradient using pseudo labels. We evaluate our model on various datasets and architectures in the TIL, CIL and generative continual learning settings. The proposed approach shows state-of-the-art results in all these settings. Our extensive ablation studies show the efficacy of the proposed components.

摘要
In Simplified Chinese:这篇论文提出了一种简单 yet 高效的扩展基于模型 для持续学习。先前的特征变换、masking 和因子化基于方法都是高效的，但它们只会将模型发展到全局或共享参数上，因此它们不能完全利用先前学习的所有信息，而且会忘记之前的知识。因此，这些方法在转移学习能力方面表现有限。此外，大多数这些模型具有一个常数参数增长率，不关系任务复杂度。我们的工作提出了一种简单的筛子和通道扩展基于方法，它会在先前任务参数上增长模型，而不是只是在全局参数上增长。因此，它可以完全利用所有先前学习的信息，而不会忘记。我们的模型在不同的dataset和架构上进行了广泛的实验，并在TIL、CIL和生成持续学习设定中达到了国际先进水平。我们的广泛的减少研究也证明了我们提出的组件的有效性。

SASSL: Enhancing Self-Supervised Learning via Neural Style Transfer

paper_url: http://arxiv.org/abs/2312.01187
repo_url: None
paper_authors: Renan A. Rojas-Gomez, Karan Singhal, Ali Etemad, Alex Bijamov, Warren R. Morningstar, Philip Andrew Mansfield
For: 提高自动学习抽象的表示能力，使用自然图像结构来增强下游性能。* Methods: 基于神经网络风格传输的新型增强技术，分离图像的 semantics 和 style 属性，并且只对样本的 style 进行增强，保持原始内容不变。* Results: 在 ImageNet 上实现了更高的 top-1 分类性能，与 MoCo v2 的表现提高了 más de 2%；在五种多样化的数据集上进行了大幅的 Transfer learning 表现提高，最高提高达 3.75%。

Abstract
Self-supervised learning relies heavily on data augmentation to extract meaningful representations from unlabeled images. While existing state-of-the-art augmentation pipelines incorporate a wide range of primitive transformations, these often disregard natural image structure. Thus, augmented samples can exhibit degraded semantic information and low stylistic diversity, affecting downstream performance of self-supervised representations. To overcome this, we propose SASSL: Style Augmentations for Self Supervised Learning, a novel augmentation technique based on Neural Style Transfer. The method decouples semantic and stylistic attributes in images and applies transformations exclusively to the style while preserving content, generating diverse augmented samples that better retain their semantic properties. Experimental results show our technique achieves a top-1 classification performance improvement of more than 2% on ImageNet compared to the well-established MoCo v2. We also measure transfer learning performance across five diverse datasets, observing significant improvements of up to 3.75%. Our experiments indicate that decoupling style from content information and transferring style across datasets to diversify augmentations can significantly improve downstream performance of self-supervised representations.

摘要
自我指导学习依赖数据增强以EXTRACT meaningful representation from unlabeled images. 现有的状态天算法 pipelines 通常忽略自然图像结构，导致增强样本具有强度降低的semantic information和低的艺术性多样性，这会影响下游自我指导表示的性能。为了解决这个问题，我们提出了 SASSL：Style Augmentations for Self Supervised Learning，这是一种基于神经网络Style Transfer的新的增强技术。该方法将图像中的semantic和艺术属性解 couple，并且 exclusiveapply 样式变换，保持内容不变，从而生成多样化的增强样本，这些样本 better retain their semantic properties。我们的实验结果表明，我们的方法可以在 ImageNet 上 achieved top-1 分类性能提升超过 2% compared to well-established MoCo v2。我们还测试了 across five diverse datasets， Observation significant improvements of up to 3.75%. 我们的实验表明，解 couple style and content information,并将style transfer across datasets to diversify augmentations可以 Significantly improve downstream performance of self-supervised representations.

IDPL-PFOD2: A New Large-Scale Dataset for Printed Farsi Optical Character Recognition

paper_url: http://arxiv.org/abs/2312.01177
repo_url: None
paper_authors: Fatemeh Asadi-zeydabadi, Ali Afkari-Fahandari, Amin Faraji, Elham Shabaninia, Hossein Nezamabadi-pour
for: 本研究旨在提供一个大规模的数据集，以便进行波斯语印刷文本识别。
methods: 本研究使用了CRNN和Vision Transformertwo种架构来评估数据集的有效性。
results: 研究获得了78.49%的基准率和97.72%的normalized edit distance，而Vision Transformer架构则获得了81.32%的基准率和98.74%的normalized edit distance。

Abstract
Optical Character Recognition is a technique that converts document images into searchable and editable text, making it a valuable tool for processing scanned documents. While the Farsi language stands as a prominent and official language in Asia, efforts to develop efficient methods for recognizing Farsi printed text have been relatively limited. This is primarily attributed to the languages distinctive features, such as cursive form, the resemblance between certain alphabet characters, and the presence of numerous diacritics and dot placement. On the other hand, given the substantial training sample requirements of deep-based architectures for effective performance, the development of such datasets holds paramount significance. In light of these concerns, this paper aims to present a novel large-scale dataset, IDPL-PFOD2, tailored for Farsi printed text recognition. The dataset comprises 2003541 images featuring a wide variety of fonts, styles, and sizes. This dataset is an extension of the previously introduced IDPL-PFOD dataset, offering a substantial increase in both volume and diversity. Furthermore, the datasets effectiveness is assessed through the utilization of both CRNN-based and Vision Transformer architectures. The CRNN-based model achieves a baseline accuracy rate of 78.49% and a normalized edit distance of 97.72%, while the Vision Transformer architecture attains an accuracy of 81.32% and a normalized edit distance of 98.74%.

摘要
仪器字符识别是一种技术，将文档图像转换为搜索和编辑文本，从而使其成为读取扫描文档的有价值工具。然而，在亚洲，波斯语作为官方语言，对波斯印刷文本的可读性和编辑性的研究相对落后。这主要归结于波斯语的特殊特征，如 cursive 字形、字母字符之间的相似性、和多个标点和括号的存在。相反，由于深度基础架构需要大量的训练样本来实现有效性， therefore， the development of such datasets is of great significance. 在这些问题的背景下，本文报道了一个新的大规模数据集，IDPL-PFOD2，适用于波斯印刷文本识别。该数据集包含2003541个图像，展示了各种字体、风格和大小。这是 IDPL-PFOD 数据集的扩展，提供了更大的体量和多样性。此外，数据集的效果通过使用 CRNN 和 Vision Transformer 架构进行评估。CRNN 模型实现了基准精度率 78.49% 和normalized edit distance 97.72%，而 Vision Transformer 架构实现了精度 81.32% 和 normalized edit distance 98.74%。

Virtual Category Learning: A Semi-Supervised Learning Method for Dense Prediction with Extremely Limited Labels

paper_url: http://arxiv.org/abs/2312.01169
repo_url: https://github.com/geoffreychen777/vc
paper_authors: Changrui Chen, Jungong Han, Kurt Debattista
for: 提高 dense vision 任务中的模型Optimization，尤其是只有很少标签available。
methods: 使用 pseudo labeling 技术，即虚拟类（VC）分配给困惑样本，以便无需准确标签仍然能够贡献到模型优化。
results: 在 semantic segmentation 和 object detection 两个主流 dense prediction 任务中，提出的 VC learning 方法显示出比州前方法更高的性能，特别是只有很少标签available 时。

Abstract
Due to the costliness of labelled data in real-world applications, semi-supervised learning, underpinned by pseudo labelling, is an appealing solution. However, handling confusing samples is nontrivial: discarding valuable confusing samples would compromise the model generalisation while using them for training would exacerbate the issue of confirmation bias caused by the resulting inevitable mislabelling. To solve this problem, this paper proposes to use confusing samples proactively without label correction. Specifically, a Virtual Category (VC) is assigned to each confusing sample in such a way that it can safely contribute to the model optimisation even without a concrete label. This provides an upper bound for inter-class information sharing capacity, which eventually leads to a better embedding space. Extensive experiments on two mainstream dense prediction tasks -- semantic segmentation and object detection, demonstrate that the proposed VC learning significantly surpasses the state-of-the-art, especially when only very few labels are available. Our intriguing findings highlight the usage of VC learning in dense vision tasks.

摘要
Translation notes:* "costliness" is translated as "高额的" (gāo yǐng de), which means "high cost" or "expensive" in Chinese.* "semi-supervised learning" is translated as "半指导学习" (bàn zhǐ dào xué xí), which means "semi-supervised learning" in Chinese.* "pseudo labeling" is translated as "假标注" ( Jiǎ bāo zhù), which means "pseudo labeling" in Chinese.* "confusing samples" is translated as "困惑样本" (jīn huò yàng bǎn), which means "confusing samples" in Chinese.* "inter-class information sharing capacity" is translated as "类间信息共享能力" (lèi jiān xìn xiāng gòng jì), which means "inter-class information sharing capacity" in Chinese.* "embedding space" is translated as "嵌入空间" (fù rù kōng jì), which means "embedding space" in Chinese.* "dense prediction tasks" is translated as "密集预测任务" (mì jí qián jian yì), which means "dense prediction tasks" in Chinese.* "Virtual Category" is translated as "虚拟类别" (xū yì lèi bèi), which means "Virtual Category" in Chinese.

Meta-Learned Attribute Self-Interaction Network for Continual and Generalized Zero-Shot Learning

paper_url: http://arxiv.org/abs/2312.01167
repo_url: None
paper_authors: Vinay K Verma, Nikhil Mehta, Kevin J Liang, Aakansha Mishra, Lawrence Carin
for: 本研究旨在提出一种基于自交互学习的 continual zero-shot learning（ZSL）方法，以便在不使用未测试类特征的情况下，模型能够通过自身特征来推理未测试类。methods: 本研究使用了 attribute self-interaction 学习和 inverse regularization 来实现 continual ZSL。具体来说，我们使用了 meta-learning 来训练 attribute self-interaction，并对 attribute encoder 进行 inverse regularization。results: 我们在五个标准 ZSL 数据集（CUB、aPY、AWA1、AWA2 和 SUN）上进行了实验，并证明了我们的方法可以在 generalized zero-shot learning 和 continual（固定/动态） zero-shot learning Settings 中占据领先地位，而且可以在 Training 时间上 saves >100x 的时间。

Abstract
Zero-shot learning (ZSL) is a promising approach to generalizing a model to categories unseen during training by leveraging class attributes, but challenges remain. Recently, methods using generative models to combat bias towards classes seen during training have pushed state of the art, but these generative models can be slow or computationally expensive to train. Also, these generative models assume that the attribute vector of each unseen class is available a priori at training, which is not always practical. Additionally, while many previous ZSL methods assume a one-time adaptation to unseen classes, in reality, the world is always changing, necessitating a constant adjustment of deployed models. Models unprepared to handle a sequential stream of data are likely to experience catastrophic forgetting. We propose a Meta-learned Attribute self-Interaction Network (MAIN) for continual ZSL. By pairing attribute self-interaction trained using meta-learning with inverse regularization of the attribute encoder, we are able to outperform state-of-the-art results without leveraging the unseen class attributes while also being able to train our models substantially faster (>100x) than expensive generative-based approaches. We demonstrate this with experiments on five standard ZSL datasets (CUB, aPY, AWA1, AWA2, and SUN) in the generalized zero-shot learning and continual (fixed/dynamic) zero-shot learning settings. Extensive ablations and analyses demonstrate the efficacy of various components proposed.

摘要
zero-shot learning（ZSL）是一种有前途的方法，通过利用类属性来泛化模型，但是还存在一些挑战。最近，使用生成模型来减少训练时间中的偏袋偏好的方法得到了状态之巅，但是这些生成模型可能需要较长的训练时间。此外，这些生成模型假设在训练时都可以得到每个未看到的类的特征向量，这并不是实际情况。此外，许多前一代ZSL方法假设在训练时只需要一次适应未看到的类，但在实际情况下，世界是在不断变化的，因此需要不断地调整已经部署的模型。如果模型无法处理流动数据流，它们可能会经历悬崖效应。我们提出了一种基于元学习的自交互特征网络（MAIN），用于持续ZSL。通过将特征自交互训练使用元学习，并对特征编码器进行逆向正则化，我们能够超越状态之巅的结果，而不需要利用未看到的类的特征。我们通过在五个标准ZSL数据集（CUB、aPY、AWA1、AWA2和SUN）进行实验，证明了我们的方法的可行性。广泛的ablation和分析也证明了我们提出的各个组件的有效性。

A New Learning Paradigm for Foundation Model-based Remote Sensing Change Detection

paper_url: http://arxiv.org/abs/2312.01163
repo_url: https://github.com/likyoo/ban
paper_authors: Kaiyu Li, Xiangyong Cao, Deyu Meng
for:* The paper is written to propose a new framework for change detection (CD) using universal foundation models.methods:* The proposed framework is called Bi-Temporal Adapter Network (BAN), which combines a frozen foundation model (e.g., CLIP) with a bitemporal adapter branch (Bi-TAB) and bridging modules.* The Bi-TAB can be either an existing arbitrary CD model or some hand-crafted stacked blocks.results:* The proposed BAN framework can improve the performance of existing CD methods with only a few additional learnable parameters.* The successful practices show the potential of foundation models for remote sensing CD.

Abstract
Change detection (CD) is a critical task to observe and analyze dynamic processes of land cover. Although numerous deep learning-based CD models have performed excellently, their further performance improvements are constrained by the limited knowledge extracted from the given labelled data. On the other hand, the foundation models that emerged recently contain a huge amount of knowledge by scaling up across data modalities and proxy tasks. In this paper, we propose a Bi-Temporal Adapter Network (BAN), which is a universal foundation model-based CD adaptation framework aiming to extract the knowledge of foundation models for CD. The proposed BAN contains three parts, i.e. frozen foundation model (e.g., CLIP), bitemporal adapter branch (Bi-TAB), and bridging modules between them. Specifically, the Bi-TAB can be either an existing arbitrary CD model or some hand-crafted stacked blocks. The bridging modules are designed to align the general features with the task/domain-specific features and inject the selected general knowledge into the Bi-TAB. To our knowledge, this is the first universal framework to adapt the foundation model to the CD task. Extensive experiments show the effectiveness of our BAN in improving the performance of existing CD methods (e.g., up to 4.08\% IoU improvement) with only a few additional learnable parameters. More importantly, these successful practices show us the potential of foundation models for remote sensing CD. The code is available at \url{https://github.com/likyoo/BAN} and will be supported in our Open-CD \url{https://github.com/likyoo/open-cd}.

摘要
优化变换检测（CD）是观察和分析地表覆盖过程的关键任务。虽然许多深度学习基于的CD模型表现出色，但它们的进一步性能提高受到限制于现有标注数据中的知识有限。然而，最近出现的基础模型具有大量的知识，可以通过数据模式和代理任务扩大。在这篇论文中，我们提出了一个Bi-Temporal Adapter Network（BAN），它是一个基于基础模型的CD适应框架，旨在提取基础模型中的知识。BAN包括三部分：固定的基础模型（例如CLIP）、双时间适应分支（Bi-TAB）和将这两者连接的桥接模块。具体来说，Bi-TAB可以是现有的任意CD模型或一些手工设计的堆叠块。桥接模块是用于将通用特征与任务/领域特定特征进行对应，并将选择的通用知识注入到Bi-TAB中。我们知道，这是首个基于基础模型的CD适应框架。我们进行了广泛的实验，并证明了BAN可以提高现有CD方法的性能（例如准确率提高4.08%），只需要增加一些可学习的参数。更重要的是，这些成功的实践表明了基础模型的潜在力量在CD领域。代码可以在GitHub上获取，地址为\url{https://github.com/likyoo/BAN}，并将在我们的Open-CD项目\url{https://github.com/likyoo/open-cd}中支持。

Ultra-Resolution Cascaded Diffusion Model for Gigapixel Image Synthesis in Histopathology

paper_url: http://arxiv.org/abs/2312.01152
repo_url: https://github.com/lucidrains/imagen-pytorch
paper_authors: Sarah Cechnicka, Hadrien Reynaud, James Ball, Naomi Simmonds, Catherine Horsfield, Andrew Smith, Candice Roufosse, Bernhard Kainz
for: 本研究旨在提高整个报告图像的诊断精度，特别是在低分辨率下。
methods: 本研究使用ultra-resolution cascaded diffusion models (URCDMs) Synthesize high-resolution images that are realistic at all magnification levels, with a focus on long-distance spatial coherence.
results: 对比exist的方法，本研究提高了pFID-50k分数的110.63到39.52。此外，人工专家评估研究显示， Lower Resolution Diffusion Models的Weighted Mean Absolute Error (MAE)为0.11，URCDM的Weighted MAE为0.22。

Abstract
Diagnoses from histopathology images rely on information from both high and low resolutions of Whole Slide Images. Ultra-Resolution Cascaded Diffusion Models (URCDMs) allow for the synthesis of high-resolution images that are realistic at all magnification levels, focusing not only on fidelity but also on long-distance spatial coherency. Our model beats existing methods, improving the pFID-50k [2] score by 110.63 to 39.52 pFID-50k. Additionally, a human expert evaluation study was performed, reaching a weighted Mean Absolute Error (MAE) of 0.11 for the Lower Resolution Diffusion Models and a weighted MAE of 0.22 for the URCDM.

摘要
诊断从历史 PATH 图像取决于高和低分辨率整体扫描图像信息。我们的极高分辨率协同扩散模型（URCDM）可以生成高分辨率图像，具有所有倍率水平的实际感。不仅保持准确性，还保持长距离空间一致性。与现有方法相比，我们的模型提高了 pFID-50k 分数的110.63到39.52 pFID-50k。此外，我们进行了人工专家评估， weighted Mean Absolute Error（MAE）为0.11 для Lower Resolution Diffusion Models，weighted MAE 为0.22 для URCDM。

Has Anything Changed? 3D Change Detection by 2D Segmentation Masks

paper_url: http://arxiv.org/abs/2312.01148
repo_url: None
paper_authors: Aikaterini Adam, Konstantinos Karantzalos, Lazaros Grammatikopoulos, Torsten Sattler
for: 该论文为了掌握indoor场景中 объек的变化和添加情况，提出了一种无监督的物体发现方法。
methods: 该方法结合了3D变化检测和2D分割任务，使用通用的2D分割mask进行3D变化检测的补充。
results: 实验表明，该方法在3Rscan数据集上表现出优于比较基eline的基eline，实现了SoTA的结果。

Abstract
As capturing devices become common, 3D scans of interior spaces are acquired on a daily basis. Through scene comparison over time, information about objects in the scene and their changes is inferred. This information is important for robots and AR and VR devices, in order to operate in an immersive virtual experience. We thus propose an unsupervised object discovery method that identifies added, moved, or removed objects without any prior knowledge of what objects exist in the scene. We model this problem as a combination of a 3D change detection and a 2D segmentation task. Our algorithm leverages generic 2D segmentation masks to refine an initial but incomplete set of 3D change detections. The initial changes, acquired through render-and-compare likely correspond to movable objects. The incomplete detections are refined through graph optimization, distilling the information of the 2D segmentation masks in the 3D space. Experiments on the 3Rscan dataset prove that our method outperforms competitive baselines, with SoTA results.

摘要
“随着捕捉设备的普及，日常获得的3D扫描内部空间的数据越来越多。通过时间比较场景，从数据中推导出内部物品的变化信息。这些信息对于机器人和AR/VR设备非常重要，以建立实际的虚拟经验。我们因此提出一种不需要先知的物品探测方法，可以找到增加、移动或 removing 的物品。我们将这个问题视为3D变化探测和2D分割任务的组合。我们的算法利用通用的2D分割面给3D变化探测进行精确化。初始变化，通过render和比较获得的变化，很可能是可移动的物品。 incomplete 的探测通过 графа优化，将2D分割面的信息在3D空间中传递。实验证明我们的方法在3Rscan数据集上比竞争性基准的表现更好，得到了顶尖的成果。”

Exploiting Diffusion Priors for All-in-One Image Restoration

paper_url: http://arxiv.org/abs/2312.02197
repo_url: None
paper_authors: Yuanbiao Gou, Haiyu Zhao, Boyun Li, Xinyan Xiao, Xi Peng
for: 解决多种图像恢复任务，包括锐化、噪声、抖抖、震荡等，在一个模型中进行全面的图像恢复。
methods: 利用预训练的扩散模型中的图像约束，通过两个挑战：一是模拟干扰过程，二是导引扩散模型生成相应的清洁图像。
results: 提出了一个零shot框架ZeroAIR，通过在反推 sampling中 alternatively进行测试时间降低模型（TDM）和三 stage扩散指导（TDG），实现了一stop图像恢复。经过广泛的实验，我们证明了ZeroAIR可以与任务特定方法相比或者超越它们的性能。代码将在Github上公开。

Abstract
All-in-one aims to solve various tasks of image restoration in a single model. To this end, we present a feasible way of exploiting the image priors captured by the pretrained diffusion model, through addressing the two challenges, i.e., degradation modeling and diffusion guidance. The former aims to simulate the process of the clean image degenerated by certain degradations, and the latter aims at guiding the diffusion model to generate the corresponding clean image. With the motivations, we propose a zero-shot framework for all-in-one image restoration, termed ZeroAIR, which alternatively performs the test-time degradation modeling (TDM) and the three-stage diffusion guidance (TDG) at each timestep of the reverse sampling. To be specific, TDM exploits the diffusion priors to learn a degradation model from a given degraded image, and TDG divides the timesteps into three stages for taking full advantage of the varying diffusion priors. Thanks to their degradation-agnostic property, the all-in-one image restoration could be achieved in a zero-shot way by ZeroAIR. Through extensive experiments, we show that our ZeroAIR achieves comparable even better performance than those task-specific methods. The code will be available on Github.

摘要
全功能尝试解决图像恢复中的多个任务在单一模型中。为此，我们提出了利用预训练的扩散模型所捕捉的图像约束的可行方法，通过解决两个挑战：即退化模型和扩散指导。前者通过模拟受到某种退化后的清晰图像的过程来模拟清晰图像的退化，而后者则是通过指导扩散模型生成相应的清晰图像。基于这些动机，我们提出了一个零shot框架，称为ZeroAIR，它在反排 sampling 中交替执行测试时退化模型化（TDM）和三个阶段扩散指导（TDG）。具体来说，TDM 利用扩散约束来从给定的退化图像中学习退化模型，而 TDG 将时间步分为三个阶段，以便在不同的扩散约束下完全利用扩散约束。由于它们具有退化不可见的性质，ZeroAIR 可以在零shot 的情况下实现全功能的图像恢复。经过广泛的实验，我们表明 ZeroAIR 可以与专门针对任务的方法相比或者更好的性能。代码将在 GitHub 上发布。

Dynamic Inertial Poser (DynaIP): Part-Based Motion Dynamics Learning for Enhanced Human Pose Estimation with Sparse Inertial Sensors

paper_url: http://arxiv.org/abs/2312.02196
repo_url: None
paper_authors: Yu Zhang, Songpengcheng Xia, Lei Chu, Jiarui Yang, Qi Wu, Ling Pei
for: 这篇论文是为了提出一种基于稀缺启动传感器的人体姿势估计方法，以解决前一些方法所采用的人工数据的缺陷。
methods: 这种方法使用了多种真实的启动传感器数据，包括不同的骨架格式，以提高运动多样性和模型泛化。它包括两个创新的组件：一个 Pseudo-velocity regression 模型，用于在启动传感器上进行动态运动捕捉，以及一个分为三个区域的部分模型，每个区域都专注于它们独特的特征。
results: 这种方法在五个公共数据集上比前一些状态静态模型表现出色，尤其是在 DIP-IMU 数据集上，减少了姿势误差 by 19%，这表明了该方法在启动传感器上的人体姿势估计方法具有显著的改进。我们将在未来公开该模型的实现。

Abstract
This paper introduces a novel human pose estimation approach using sparse inertial sensors, addressing the shortcomings of previous methods reliant on synthetic data. It leverages a diverse array of real inertial motion capture data from different skeleton formats to improve motion diversity and model generalization. This method features two innovative components: a pseudo-velocity regression model for dynamic motion capture with inertial sensors, and a part-based model dividing the body and sensor data into three regions, each focusing on their unique characteristics. The approach demonstrates superior performance over state-of-the-art models across five public datasets, notably reducing pose error by 19\% on the DIP-IMU dataset, thus representing a significant improvement in inertial sensor-based human pose estimation. We will make the implementation of our model available for public use.

摘要
这篇论文介绍了一种新的人姿估算方法，使用稀疏的吸引力传感器，解决了过去的方法对于真实数据的依赖。它利用了不同的真实吸引力运动捕捉数据，以提高运动多样性和模型泛化。这种方法包括两个创新组件：一个假速度回归模型，用于在吸引力传感器上进行动态运动捕捉，以及一个分割body和传感器数据的部分模型，每个部分专注于其特有特征。该方法在五个公共数据集上达到了比较好的性能，比特定状态的模型减少人姿错误率19％，表示一种重要的吸引力传感器基于人姿估算的改进。我们将在公共使用的实现方法。

ControlDreamer: Stylized 3D Generation with Multi-View ControlNet

paper_url: http://arxiv.org/abs/2312.01129
repo_url: https://github.com/oyt9306/ControlDreamer
paper_authors: Yeongtak Oh, Jooyoung Choi, Yongsung Kim, Minjun Park, Chaehun Shin, Sungroh Yoon
for: 本研究旨在 Addressing the limitations of current text-to-3D generation methods 提高 3D 模型的创作 geometry 和 style 的自动化和民生化。
methods: 我们引入了一种新的 multi-view ControlNet 模型，该模型是基于生成的 dataset 从一个仔细挑选的 100K 文本 corpus 进行训练的 depth-aware multi-view diffusion model。我们还将这种模型integrated into our two-stage pipeline ControlDreamer，以实现文本引导的 3D 模型生成。
results: 我们的比较分析表明，这种新的 pipeline 在质量上表现出色，与现有的 text-to-3D 方法相比，通过qualitative comparisons 和 CLIP 分数指标进行评估。

Abstract
Recent advancements in text-to-3D generation have significantly contributed to the automation and democratization of 3D content creation. Building upon these developments, we aim to address the limitations of current methods in generating 3D models with creative geometry and styles. We introduce multi-view ControlNet, a novel depth-aware multi-view diffusion model trained on generated datasets from a carefully curated 100K text corpus. Our multi-view ControlNet is then integrated into our two-stage pipeline, ControlDreamer, enabling text-guided generation of stylized 3D models. Additionally, we present a comprehensive benchmark for 3D style editing, encompassing a broad range of subjects, including objects, animals, and characters, to further facilitate diverse 3D generation. Our comparative analysis reveals that this new pipeline outperforms existing text-to-3D methods as evidenced by qualitative comparisons and CLIP score metrics.

摘要

SPEEDNet: Salient Pyramidal Enhancement Encoder-Decoder Network for Colonoscopy Images

paper_url: http://arxiv.org/abs/2312.01128
repo_url: None
paper_authors: Tushir Sahu, Vidhi Bhatt, Sai Chandra Teja R, Sparsh Mittal, Nagesh Kumar S
for: 这篇论文目的是精确地分类医疗影像中的区域，例如肿瘤或损伤。
methods: 这篇论文提出了一个名为SPEEDNet的新架构，用于精确地分类医疗影像中的肿瘤。SPEEDNet使用了一个名为DIPC（扩展凝合层对抗）的新封页，将特征图从多个层次转换为一个紧密的空间，以提高学习表现。
results: 在EBHISeg数据集上，SPEEDNet比三个前一代网络（UNet、FeedNet和AttesResDUNet）表现更好，具体的 dice分数为0.952，回应率为0.971。

Abstract
Accurate identification and precise delineation of regions of significance, such as tumors or lesions, is a pivotal goal in medical imaging analysis. This paper proposes SPEEDNet, a novel architecture for precisely segmenting lesions within colonoscopy images. SPEEDNet uses a novel block named Dilated-Involutional Pyramidal Convolution Fusion (DIPC). A DIPC block combines the dilated involution layers pairwise into a pyramidal structure to convert the feature maps into a compact space. This lowers the total number of parameters while improving the learning of representations across an optimal receptive field, thereby reducing the blurring effect. On the EBHISeg dataset, SPEEDNet outperforms three previous networks: UNet, FeedNet, and AttesResDUNet. Specifically, SPEEDNet attains an average dice score of 0.952 and a recall of 0.971. Qualitative results and ablation studies provide additional insights into the effectiveness of SPEEDNet. The model size of SPEEDNet is 9.81 MB, significantly smaller than that of UNet (22.84 MB), FeedNet(185.58 MB), and AttesResDUNet (140.09 MB).

摘要
医学图像分析中精准识别和定义区域的目标是非常重要的。这篇论文提出了SPEEDNet，一种新的架构，用于精准地分类colonoscopy图像中的肿瘤。SPEEDNet使用一种新的块named Dilated-Involutional Pyramidal Convolution Fusion（DIPC）。一个DIPC块将叠加pyramidal结构中的扩展减少层pairwise，将特征图转换为一个紧凑的空间。这有助于降低总参数数量，同时改善表示学习的范围，从而降低模糊效应。在EBHISeg数据集上，SPEEDNet比三个前一个网络：UNet、FeedNet和AttesResDUNet表现更好。特别是，SPEEDNet实现了平均的 dice 分数为0.952，和回归率为0.971。Qualitative results和ablation studies提供了更多关于SPEEDNet的效果的视角。SPEEDNet的模型大小为9.81 MB，远小于UNet（22.84 MB）、FeedNet（185.58 MB）和AttesResDUNet（140.09 MB）。

Beyond Accuracy: Statistical Measures and Benchmark for Evaluation of Representation from Self-Supervised Learning

paper_url: http://arxiv.org/abs/2312.01118
repo_url: None
paper_authors: Jiantao Wu, Shentong Mo, Sara Atito, Josef Kittler, Zhenhua Feng, Muhammad Awais
for: 本研究旨在提供一个大规模的benchmark，用于评估自动学习的度量学习模型。
methods: 本研究使用了ImageNet-21K和WordNet建立了一个具有多样性和细化的类别分布的benchmark，并提出了一些新的评估指标，包括overlap和aSTD。
results: 研究发现了supervised学习和self-supervised学习模型中存在类别偏见和缺乏泛化能力的问题，并提供了未来模型改进的可能性。

Abstract
Recently, self-supervised metric learning has raised attention for the potential to learn a generic distance function. It overcomes the limitations of conventional supervised one, e.g., scalability and label biases. Despite progress in this domain, current benchmarks, incorporating a narrow scope of classes, stop the nuanced evaluation of semantic representations. To bridge this gap, we introduce a large-scale benchmark with diversity and granularity of classes, Statistical Metric Learning Benchmark (SMLB) built upon ImageNet-21K and WordNet. SMLB is designed to rigorously evaluate the discriminative discernment and generalizability across more than 14M images, 20K classes, and 16K taxonomic nodes. Alongside, we propose novel evaluation metrics -- `overlap' for separability and `aSTD' for consistency -- to measure distance statistical information, which are efficient and robust to the change of class number. Our benchmark offers a novel perspective of evaluating the quality of representations beyond accuracy. Our findings reveal the limitations of supervised learning and the class bias inherent in SSL models, offering insights into potential areas for future model enhancement.

摘要

Paved2Paradise: Cost-Effective and Scalable LiDAR Simulation by Factoring the Real World

paper_url: http://arxiv.org/abs/2312.01117
repo_url: https://github.com/airalcorn2/paved2paradise
paper_authors: Michael A. Alcorn, Noah Schwartz
for: 这个论文的目的是提出一种可靠、成本低的方法来生成激活点云数据集，以便加速点云模型的开发。
methods: 该方法基于分离实际世界的思想，通过收集独立的背景和对象数据集，智能组合它们来生成具有多样性和真实性的训练集。该方法包括四个步骤：收集庞大背景数据、记录感兴趣的对象类型的人员在隔离环境中表现出不同的行为、自动生成对象数据集的标签、将对象放置在背景中的随机位置来生成样本。
results: 该方法可以生成高质量的人体检测数据集，包括在树枝干扰下的人员检测。量化结果表明，使用Paved2Paradise数据集训练的模型在实际数据集上表现相当于使用原始数据集进行训练。这些结果表明，Paved2Paradise可以帮助加速点云模型的开发，特别是在购买点云数据集的成本和时间上存在限制的情况下。

Abstract
To achieve strong real world performance, neural networks must be trained on large, diverse datasets; however, obtaining and annotating such datasets is costly and time-consuming, particularly for 3D point clouds. In this paper, we describe Paved2Paradise, a simple, cost-effective approach for generating fully labeled, diverse, and realistic lidar datasets from scratch, all while requiring minimal human annotation. Our key insight is that, by deliberately collecting separate "background" and "object" datasets (i.e., "factoring the real world"), we can intelligently combine them to produce a combinatorially large and diverse training set. The Paved2Paradise pipeline thus consists of four steps: (1) collecting copious background data, (2) recording individuals from the desired object class(es) performing different behaviors in an isolated environment (like a parking lot), (3) bootstrapping labels for the object dataset, and (4) generating samples by placing objects at arbitrary locations in backgrounds. To demonstrate the utility of Paved2Paradise, we generated synthetic datasets for two tasks: (1) human detection in orchards (a task for which no public data exists) and (2) pedestrian detection in urban environments. Qualitatively, we find that a model trained exclusively on Paved2Paradise synthetic data is highly effective at detecting humans in orchards, including when individuals are heavily occluded by tree branches. Quantitatively, a model trained on Paved2Paradise data that sources backgrounds from KITTI performs comparably to a model trained on the actual dataset. These results suggest the Paved2Paradise synthetic data pipeline can help accelerate point cloud model development in sectors where acquiring lidar datasets has previously been cost-prohibitive.

摘要
要实现强大的实际性能，神经网络需要在大量、多样的数据上训练；然而，获取和标注这些数据可以是成本和时间consuming的，特别是 для3D点云。在这篇论文中，我们描述了Paved2Paradise，一种简单、成本效果的方法，可以从scratch生成完全标注的、多样化和真实的激光数据集，而无需大量的人工标注。我们的关键思想是，通过意图地收集“背景”和“对象”数据集（即“分解实际世界”），我们可以智能地将它们组合起来生成一个 combinatorially 大量和多样化的训练集。Paved2Paradise管道因此包括四个步骤：（1）收集充沛的背景数据，（2）记录目标对象类型的个体在隔离环境（如停车场）中进行不同的行为，（3）为对象数据集提供杂志标签，（4）通过将对象放置在背景中的任意位置来生成样本。为了证明Paved2Paradise的实用性，我们生成了两个任务的 sintetic 数据集：（1）人员检测在园林（无公共数据存在）和（2）人员检测在城市环境中。我们发现，一个专门使用Paved2Paradise sintetic数据集训练的模型可以在园林中快速和高效地检测人员，包括在树枝叶下部分遮挡的情况下。量化来说，使用Paved2Paradise数据集源自KITTI的背景时，模型与实际数据集相比表现相似。这些结果表明Paved2Paradise sintetic数据集管道可以帮助加速点云模型在拥有高成本的激光数据集的领域进行更快的发展。

Local Masking Meets Progressive Freezing: Crafting Efficient Vision Transformers for Self-Supervised Learning

paper_url: http://arxiv.org/abs/2312.02194
repo_url: https://github.com/utkutpcgl/vitfreeze
paper_authors: Utku Mert Topcuoglu, Erdem Akagündüz
for: 这个论文旨在提出一种自然语言处理方法，用于减少 ViT 模型的训练时间，同时保持或提高模型的学习能力。
methods: 该方法基于local masked image modeling和进行逐步层冻结，以提高初始层训练的效率和速度。具体来说，通过系统地冻结特定层在训练过程中的某些点，可以降低计算负担，同时保持或提高模型的学习能力。
results: 该方法可以在训练 ViT 模型时大幅减少训练时间（约12.5%），同时减少模型精度（top-1准确率下降0.6%）。实验结果显示，该方法可以在计算资源和时间限制下实现高效的自然语言处理。

Abstract
In this paper, we present an innovative approach to self-supervised learning for Vision Transformers (ViTs), integrating local masked image modeling with progressive layer freezing. This method focuses on enhancing the efficiency and speed of initial layer training in ViTs. By systematically freezing specific layers at strategic points during training, we reduce computational demands while maintaining or improving learning capabilities. Our approach employs a novel multi-scale reconstruction process that fosters efficient learning in initial layers and enhances semantic comprehension across scales. The results demonstrate a substantial reduction in training time (~12.5\%) with a minimal impact on model accuracy (decrease in top-1 accuracy by 0.6\%). Our method achieves top-1 and top-5 accuracies of 82.6\% and 96.2\%, respectively, underscoring its potential in scenarios where computational resources and time are critical. This work marks an advancement in the field of self-supervised learning for computer vision. The implementation of our approach is available at our project's GitHub repository: github.com/utkutpcgl/ViTFreeze.

摘要
在这篇论文中，我们提出了一种自动学习方法，把地方填充模型与进行逐层冻结结合在一起，以提高初始层次训练的效率和速度。这种方法通过在训练过程中系统地冻结特定层次，从而降低计算成本而不会影响学习能力。我们的方法使用了一种新的多尺度重建过程，以便高效地在初始层次上学习和提高 semantic 理解的能力。结果表明，使用我们的方法可以大幅减少训练时间（约12.5%），同时保持或提高模型准确率（降低顶层准确率0.6%）。我们的方法实现了顶层准确率82.6%和顶层5准确率96.2%，这表明它在计算资源和时间紧张的情况下具有潜在的应用价值。这项研究对计算机视觉领域的自动学习做出了一项进步。我们的实现可以在我们项目的 GitHub 存储库中找到：github.com/utkutpcgl/ViTFreeze。

S2P3: Self-Supervised Polarimetric Pose Prediction

paper_url: http://arxiv.org/abs/2312.01105
repo_url: None
paper_authors: Patrick Ruhkamp, Daoyi Gao, Nassir Navab, Benjamin Busam
for: 这篇论文是关于从多modalRGB+普拉拉度图像中预测6D对象pose的首次自我超vised学习方法。
methods: 该方法包括1)物理模型提取普拉拉度图像的geometry信息，2)教师生物知识传授方案和3)自我超vised损失函数通过可导渲染和反转物理约束。两个网络都利用物理特性来学习坚实的geometry表示，通过编码形状假设和普拉拉度特征来学习物理特性。
results: 这篇论文report了对透明、折射和反射表面的物体进行预测 pose 的性能提升，特别是在照明挑战性较高的情况下。

Abstract
This paper proposes the first self-supervised 6D object pose prediction from multimodal RGB+polarimetric images. The novel training paradigm comprises 1) a physical model to extract geometric information of polarized light, 2) a teacher-student knowledge distillation scheme and 3) a self-supervised loss formulation through differentiable rendering and an invertible physical constraint. Both networks leverage the physical properties of polarized light to learn robust geometric representations by encoding shape priors and polarization characteristics derived from our physical model. Geometric pseudo-labels from the teacher support the student network without the need for annotated real data. Dense appearance and geometric information of objects are obtained through a differentiable renderer with the predicted pose for self-supervised direct coupling. The student network additionally features our proposed invertible formulation of the physical shape priors that enables end-to-end self-supervised training through physical constraints of derived polarization characteristics compared against polarimetric input images. We specifically focus on photometrically challenging objects with texture-less or reflective surfaces and transparent materials for which the most prominent performance gain is reported.

摘要

A physical model to extract geometric information of polarized light.2. A teacher-student knowledge distillation scheme.3. A self-supervised loss formulation through differentiable rendering and an invertible physical constraint.The proposed method leverages the physical properties of polarized light to learn robust geometric representations by encoding shape priors and polarization characteristics derived from the physical model. The teacher network provides geometric pseudo-labels to support the student network without the need for annotated real data. The student network also features an invertible formulation of the physical shape priors, which enables end-to-end self-supervised training through physical constraints of derived polarization characteristics compared against polarimetric input images.The proposed method is specifically designed to handle photometrically challenging objects with texture-less or reflective surfaces and transparent materials, and it reports the most prominent performance gain in these cases.

QPoser: Quantized Explicit Pose Prior Modeling for Controllable Pose Generation

paper_url: http://arxiv.org/abs/2312.01104
repo_url: None
paper_authors: Yumeng Li, Yaoxiang Ding, Zhong Ren, Kun Zhou
for: 本研究旨在提出一种高度可控的显式姿势先骨模型，以确保正确性、表达性和可控性三大特性均能满足。
methods: 该模型使用多头 вектор量编码器（MS-VQVAE）获取表达性和分布式姿势表示，同时使用全局本地特征融合机制（GLIF-AE）将潜在表示融合到本地关节特征中。
results: 实验结果显示，QPoser对于表达性和正确姿势表示有显著优势，同时可以轻松地根据参考姿势和指示信息进行详细的条件生成。

Abstract
Explicit pose prior models compress human poses into latent representations for using in pose-related downstream tasks. A desirable explicit pose prior model should satisfy three desirable abilities: 1) correctness, i.e. ensuring to generate physically possible poses; 2) expressiveness, i.e. ensuring to preserve details in generation; 3) controllability, meaning that generation from reference poses and explicit instructions should be convenient. Existing explicit pose prior models fail to achieve all of three properties, in special controllability. To break this situation, we propose QPoser, a highly controllable explicit pose prior model which guarantees correctness and expressiveness. In QPoser, a multi-head vector quantized autoencoder (MS-VQVAE) is proposed for obtaining expressive and distributed pose representations. Furthermore, a global-local feature integration mechanism (GLIF-AE) is utilized to disentangle the latent representation and integrate full-body information into local-joint features. Experimental results show that QPoser significantly outperforms state-of-the-art approaches in representing expressive and correct poses, meanwhile is easily to be used for detailed conditional generation from reference poses and prompting instructions.

摘要
<>定型人体 pose 模型可以压缩人体 pose 到 latent 表示，以便在人体 pose 相关的下游任务中使用。一个理想的明确 pose 模型应满足以下三种能力：1）正确性，即生成物理可能的 pose；2）表现力，即保留生成中的细节；3）可控性，即从参考 pose 和显式指令中生成 conveniences。现有的明确 pose 模型都不能同时满足这三个属性，尤其是可控性。为了破坏这种情况，我们提出 QPoser，一种高度可控的明确 pose 模型， guarantees correctness 和表现力。在 QPoser 中，一种多头 вектор量编码器（MS-VQVAE）被提议用于获取表达性和分布式 pose 表示。此外，一种全球-本地特征融合机制（GLIF-AE）被利用来分离 latent 表示并将全身信息融合到本地关节特征中。实验结果表明，QPoser 与状态机制比较出色地表达表达性和正确 pose，同时是容易用于详细的 conditional 生成从参考 pose 和激励指令中。

Rethinking Multiple Instance Learning for Whole Slide Image Classification: A Bag-Level Classifier is a Good Instance-Level Teacher

paper_url: http://arxiv.org/abs/2312.01099
repo_url: https://github.com/dootmaan/icmil
paper_authors: Hongyi Wang, Luyang Luo, Fang Wang, Ruofeng Tong, Yen-Wei Chen, Hongjie Hu, Lanfen Lin, Hao Chen
for: 本研究旨在提高多例学习（MIL）在整个扫描图像（WSI）分类中的性能，并解决高计算成本问题。
methods: 我们提出了一种叫做趋同多例学习（ICMIL）的方法，它利用一个质量更高的Bag类别器作为每个幕例的教师，以便在低成本下进行幕例表示的迭代更新。在这个过程中，我们还引入了一种教师-学生框架，以有效地传递分类知识从Bag类别器中。
results: 我们在四个不同的数据集上进行了严格的实验，并证明了ICMIL在现有MIL脊梁上的性能有所提高，达到了现有最佳结果。

Abstract
Multiple Instance Learning (MIL) has demonstrated promise in Whole Slide Image (WSI) classification. However, a major challenge persists due to the high computational cost associated with processing these gigapixel images. Existing methods generally adopt a two-stage approach, comprising a non-learnable feature embedding stage and a classifier training stage. Though it can greatly reduce the memory consumption by using a fixed feature embedder pre-trained on other domains, such scheme also results in a disparity between the two stages, leading to suboptimal classification accuracy. To address this issue, we propose that a bag-level classifier can be a good instance-level teacher. Based on this idea, we design Iteratively Coupled Multiple Instance Learning (ICMIL) to couple the embedder and the bag classifier at a low cost. ICMIL initially fix the patch embedder to train the bag classifier, followed by fixing the bag classifier to fine-tune the patch embedder. The refined embedder can then generate better representations in return, leading to a more accurate classifier for the next iteration. To realize more flexible and more effective embedder fine-tuning, we also introduce a teacher-student framework to efficiently distill the category knowledge in the bag classifier to help the instance-level embedder fine-tuning. Thorough experiments were conducted on four distinct datasets to validate the effectiveness of ICMIL. The experimental results consistently demonstrate that our method significantly improves the performance of existing MIL backbones, achieving state-of-the-art results. The code is available at: https://github.com/Dootmaan/ICMIL/tree/confidence_based

摘要
多个实例学习（MIL）在整幅扫描图像（WSI）分类中表现出了承诺。然而，高计算成本使得处理这些吉帕ixel图像的主要挑战仍然存在。现有方法通常采用两个阶段approach，包括一个非学习的特征嵌入阶段和一个分类器训练阶段。虽然可以通过使用预先训练在其他领域的固定特征嵌入器来减少内存消耗，但这种方法也会导致两个阶段之间的差异，从而导致分类精度下降。为解决这个问题，我们提议在bag级分类器中使用嵌入器作为好的实例级教师。基于这个想法，我们设计了Iteratively Coupled Multiple Instance Learning（ICMIL），将嵌入器和包级分类器在低成本下 coupling。ICMIL首先固定patch嵌入器，然后使用包级分类器来精度调整嵌入器。受托的嵌入器可以生成更好的表示，从而导致更高精度的分类器。为了更好地实现嵌入器的练习和更高效地嵌入器的调整，我们还引入了教师-学生框架，以高效地传递包级分类器中的类别知识，以帮助实例级嵌入器的练习。我们在四个不同的数据集上进行了严格的实验，以验证ICMIL的效果。实验结果表明，我们的方法可以显著提高现有MIL的性能，达到状态前级结果。代码可以在https://github.com/Dootmaan/ICMIL/tree/confidence_based找到。

Planning as In-Painting: A Diffusion-Based Embodied Task Planning Framework for Environments under Uncertainty

paper_url: http://arxiv.org/abs/2312.01097
repo_url: https://github.com/joeyy5588/planning-as-inpainting
paper_authors: Cheng-Fu Yang, Haoyang Xu, Te-Lin Wu, Xiaofeng Gao, Kai-Wei Chang, Feng Gao
for: 这 paper 的目的是解决embodied AI 任务规划问题，因为社区没有达成一致性在问题的形式化方面。
methods: 该 paper 提出了一个统一框架，包括一个端到端训练方法和一个规划算法。特别是，他们提出了一种任务无关的方法，名为“规划 как涂抹”。在这种方法中，他们使用了一个 Denoising Diffusion Model (DDM) для计划生成，基于语言指令和感知输入在部分可见环境下。
results: 该 paper 的实验结果表明，该方法在多种embodied AI任务中表现出色，包括视觉语言导航、物体操作和任务规划在一个 photorealistic 虚拟环境中。

Abstract
Task planning for embodied AI has been one of the most challenging problems where the community does not meet a consensus in terms of formulation. In this paper, we aim to tackle this problem with a unified framework consisting of an end-to-end trainable method and a planning algorithm. Particularly, we propose a task-agnostic method named 'planning as in-painting'. In this method, we use a Denoising Diffusion Model (DDM) for plan generation, conditioned on both language instructions and perceptual inputs under partially observable environments. Partial observation often leads to the model hallucinating the planning. Therefore, our diffusion-based method jointly models both state trajectory and goal estimation to improve the reliability of the generated plan, given the limited available information at each step. To better leverage newly discovered information along the plan execution for a higher success rate, we propose an on-the-fly planning algorithm to collaborate with the diffusion-based planner. The proposed framework achieves promising performances in various embodied AI tasks, including vision-language navigation, object manipulation, and task planning in a photorealistic virtual environment. The code is available at: https://github.com/joeyy5588/planning-as-inpainting.

摘要
任务规划 дляembodied AI是一个最大化问题，community没有共识在问题形ulation中。在这篇论文中，我们想使用一个统一框架，包括一个端到端训练方法和一个规划算法。特别是，我们提出了一种任务无关的方法名为“规划为填充”。在这种方法中，我们使用一个Denosing Diffusion Model（DDM）来生成规划，基于语言指令和感知输入，在部分可见环境下。partial observation常常导致模型假设规划。因此，我们的扩散基于方法同时模型状态轨迹和目标估计，以提高基于有限可用信息的每步生成的计划的可靠性。为了更好地利用在计划执行过程中新发现的信息，我们提议一种在线规划算法与扩散基于规划协作。我们的提案的框架在多种embodied AI任务中表现出色，包括视力语言导航、物体操作和任务规划在一个真实的虚拟环境中。代码可以在：https://github.com/joeyy5588/planning-as-inpainting中找到。

RobustCalib: Robust Lidar-Camera Extrinsic Calibration with Consistency Learning

paper_url: http://arxiv.org/abs/2312.01085
repo_url: None
paper_authors: Shuang Xu, Sifan Zhou, Zhi Tian, Jizhou Ma, Qiong Nie, Xiangxiang Chu
for: addresses the extrinsic calibration problem in a robust, automatic, and single-shot manner, without relying on offline targets or human efforts.
methods: leverages consistency learning between LiDAR and camera to implement implicit re-calibration, using an appearance-consistency loss and a geometric-consistency loss to minimize inconsistencies between projected and predicted attributes.
results: achieves accurate and robust performance in various scenarios, with comprehensive experiments conducted on different datasets. The model and code will be released to promote further research and development.Here’s the full text in Simplified Chinese:
for: 本研究目的是实现LiDAR-camera外部参数估计的稳定、自动化和单次测试方法，不需要人工干预或对应标的支持。
methods: 我们利用LiDAR和camera之间的一致性学习来实现隐式重新校准，透过对对应点 cloud 的属性（例如深度和对应深度）进行整合学习，以减少对应点 cloud 的不一致。
results: 我们在不同的数据集上进行了充分的实验，结果显示我们的方法可以实现精确且可靠的性能，并且可以适应多种情况。我们将会发布我们的模型和代码，以便进一步的研究和发展。

Abstract
Current traditional methods for LiDAR-camera extrinsics estimation depend on offline targets and human efforts, while learning-based approaches resort to iterative refinement for calibration results, posing constraints on their generalization and application in on-board systems. In this paper, we propose a novel approach to address the extrinsic calibration problem in a robust, automatic, and single-shot manner. Instead of directly optimizing extrinsics, we leverage the consistency learning between LiDAR and camera to implement implicit re-calibartion. Specially, we introduce an appearance-consistency loss and a geometric-consistency loss to minimizing the inconsitency between the attrbutes (e.g., intensity and depth) of projected LiDAR points and the predicted ones. This design not only enhances adaptability to various scenarios but also enables a simple and efficient formulation during inference. We conduct comprehensive experiments on different datasets, and the results demonstrate that our method achieves accurate and robust performance. To promote further research and development in this area, we will release our model and code.

摘要
Current traditional methods for LiDAR-camera extrinsics estimation rely on offline targets and human efforts, while learning-based approaches resort to iterative refinement for calibration results, which poses constraints on their generalization and application in on-board systems. In this paper, we propose a novel approach to address the extrinsic calibration problem in a robust, automatic, and single-shot manner. Instead of directly optimizing extrinsics, we leverage the consistency learning between LiDAR and camera to implement implicit re-calibration. Specifically, we introduce an appearance-consistency loss and a geometric-consistency loss to minimize the inconsistency between the attributes (e.g., intensity and depth) of projected LiDAR points and the predicted ones. This design not only enhances adaptability to various scenarios but also enables a simple and efficient formulation during inference. We conduct comprehensive experiments on different datasets, and the results demonstrate that our method achieves accurate and robust performance. To promote further research and development in this area, we will release our model and code.Here is the word-for-word translation of the text into Simplified Chinese:现有的传统方法对探测器和摄像头的外部参数估计依赖于线上目标和人工努力，而学习基于的方法则通过迭代精度调整来实现准确性，这会对其普适性和实现在 bord 系统中带来限制。在这篇文章中，我们提出了一种新的方法，以解决探测器和摄像头的外部参数估计问题。而不是直接优化外部参数，我们利用探测器和摄像头之间的一致性学习来实现隐式重新调整。具体来说，我们引入了一个外观一致损失和一个几何一致损失，以最小化探测器和预测的点 cloud 之间的不一致。这种设计不仅提高了对不同场景的适应性，还可以在推理过程中实现简单和高效的表示。我们在不同的 dataset 上进行了广泛的实验，结果表明，我们的方法可以具有高度的准确性和稳定性。为促进这个领域的进一步研究和开发，我们将发布我们的模型和代码。

Consistency Prototype Module and Motion Compensation for Few-Shot Action Recognition (CLIP-CP$\mathbf{M^2}$C)

paper_url: http://arxiv.org/abs/2312.01083
repo_url: None
paper_authors: Fei Guo, Li Zhu, YiKang Wang, Han Qi
for: 提高 few-shot 动作识别精度， Addressing the limitations of previous works that rely solely on visual modal and ignore motion features.
methods: 提出 Consistency Prototype and Motion Compensation Network(CLIP-CP$M^2$C), 使用 CLIP 进行多Modal 几拟shot 动作识别，并提出一种新的文本补充方法来减少 query 视频中文本信息的影响。
results: 经验表明，提出的方法可以与现有的状态 tirifled 结果竞争，并且在标准的 benchmark 数据集上实现了优秀的性能。

Abstract
Recently, few-shot action recognition has significantly progressed by learning the feature discriminability and designing suitable comparison methods. Still, there are the following restrictions. (a) Previous works are mainly based on visual mono-modal. Although some multi-modal works use labels as supplementary to construct prototypes of support videos, they can not use this information for query videos. The labels are not used efficiently. (b) Most of the works ignore the motion feature of video, although the motion features are essential for distinguishing. We proposed a Consistency Prototype and Motion Compensation Network(CLIP-CP$M^2$C) to address these issues. Firstly, we use the CLIP for multi-modal few-shot action recognition with the text-image comparison for domain adaption. Secondly, in order to make the amount of information between the prototype and the query more similar, we propose a novel method to compensate for the text(prompt) information of query videos when text(prompt) does not exist, which depends on a Consistency Loss. Thirdly, we use the differential features of the adjacent frames in two directions as the motion features, which explicitly embeds the network with motion dynamics. We also apply the Consistency Loss to the motion features. Extensive experiments on standard benchmark datasets demonstrate that the proposed method can compete with state-of-the-art results. Our code is available at the URL: https://github.com/xxx/xxx.git.

摘要
最近几个月，几个shot动作识别技术有了很大的进步，主要是通过学习特征权重和设计适当的比较方法来提高特征分解能力。然而，还有一些限制。（a）以前的工作主要基于视觉单模态，尽管一些多模态工作使用标签作为补充来构建支持视频的原型，但它们不能使用这些信息来处理查询视频。标签没有被有效地利用。（b）大多数工作忽略了视频中的运动特征，尽管运动特征是分辨动作的关键。我们提出了一种协调原型和运动补做网络（CLIP-CP$M^2$C）来解决这些问题。首先，我们使用CLIP进行多模态几个shot动作识别，并使用文本-图像比较来适应频率。其次，我们提出了一种新的方法，用于在查询视频中缺失文本（提示）信息时，补做文本信息，这取决于一种一致损失。最后，我们使用两个相邻帧之间的差异特征作为运动特征，并直接嵌入网络中运动动态。我们还应用一种一致损失来补做运动特征。广泛的实验表明，我们的方法可以与当前的最佳结果竞争。我们的代码可以在以下URL获取：https://github.com/xxx/xxx.git。

DPHMs: Diffusion Parametric Head Models for Depth-based Tracking

paper_url: http://arxiv.org/abs/2312.01068
repo_url: None
paper_authors: Jiapeng Tang, Angela Dai, Yinyu Nie, Lev Markhasin, Justus Thies, Matthias Niessner
for: 本研究旨在提出一种基于扩散参数的头部模型（DPHMs），用于robust化从单视角深度序列中重建和跟踪头部 geometry。
methods: 该方法使用扩散参数来强制实现头部重建和跟踪，并通过对标注数据进行学习来regularize整个过程。
results: 对比于现有的tracking方法，本研究的方法能够更好地重建头部的身份和表情，并在不同的表情和快速转换中保持稳定性。

Abstract
We introduce Diffusion Parametric Head Models (DPHMs), a generative model that enables robust volumetric head reconstruction and tracking from monocular depth sequences. While recent volumetric head models, such as NPHMs, can now excel in representing high-fidelity head geometries, tracking and reconstruction heads from real-world single-view depth sequences remains very challenging, as the fitting to partial and noisy observations is underconstrained. To tackle these challenges, we propose a latent diffusion-based prior to regularize volumetric head reconstruction and tracking. This prior-based regularizer effectively constrains the identity and expression codes to lie on the underlying latent manifold which represents plausible head shapes. To evaluate the effectiveness of the diffusion-based prior, we collect a dataset of monocular Kinect sequences consisting of various complex facial expression motions and rapid transitions. We compare our method to state-of-the-art tracking methods, and demonstrate improved head identity reconstruction as well as robust expression tracking.

摘要
我们介绍Diffusion Parametric Head Models（DPHMs），一种生成模型，可以从单视深度序列中生成和跟踪高精度头部结构。当前的头部模型，如NPHMs，可以在表示高精度头部几何结构方面达到很高水平，但是从实际世界单视深度序列中跟踪和重建头部仍然很具有挑战性，因为适应部分和噪声 Observations 是不够约束的。为了解决这些挑战，我们提议在重建和跟踪头部过程中使用抽象扩散-based prior来规范化 heads。这个假设基于的规范化效果地使得表达和身份代码在下面的几何表示中徘徊。为了评估抽象扩散-based prior的效果，我们收集了一个包含多种复杂表情动作和快速过渡的单视Kinect序列的数据集。我们与状态艺术方法进行比较，并证明我们的方法可以提高头部身份重建以及表达跟踪的稳定性。

DiverseDream: Diverse Text-to-3D Synthesis with Augmented Text Embedding

paper_url: http://arxiv.org/abs/2312.02192
repo_url: None
paper_authors: Uy Dieu Tran, Minh Luu, Phong Nguyen, Janne Heikkila, Khoi Nguyen, Binh-Son Hua
for: 本研究旨在解决现有文本到3D模型Synthesis方法中的模式塌聚问题，即使用预训练的文本到图像模型作为引导视觉先验，但是现有方法中的3D模型往往具有很差的多样性。
methods: 本研究采用分析和实验方法来探讨现有文本到3D Synthesis方法中的限制因素，并提出一种新的方法，即通过文本描述和参照图像的文本反转来增强多样性。
results: 研究表明，通过提出新的方法，文本到3D Synthesis中的多样性得到了改善，both qualitatively and quantitatively。

Abstract
Text-to-3D synthesis has recently emerged as a new approach to sampling 3D models by adopting pretrained text-to-image models as guiding visual priors. An intriguing but underexplored problem with existing text-to-3D methods is that 3D models obtained from the sampling-by-optimization procedure tend to have mode collapses, and hence poor diversity in their results. In this paper, we provide an analysis and identify potential causes of such a limited diversity, and then devise a new method that considers the joint generation of different 3D models from the same text prompt, where we propose to use augmented text prompts via textual inversion of reference images to diversify the joint generation. We show that our method leads to improved diversity in text-to-3D synthesis qualitatively and quantitatively.

摘要
文本到3D合成最近emerged as a new approach to sampling 3D models by adopting pre-trained text-to-image models as guiding visual priors. An intriguing but underexplored problem with existing text-to-3D methods is that 3D models obtained from the sampling-by-optimization procedure tend to have mode collapses, and hence poor diversity in their results. In this paper, we provide an analysis and identify potential causes of such a limited diversity, and then devise a new method that considers the joint generation of different 3D models from the same text prompt, where we propose to use augmented text prompts via textual inversion of reference images to diversify the joint generation. We show that our method leads to improved diversity in text-to-3D synthesis both qualitatively and quantitatively.Here's the breakdown of the translation:* "text-to-3D synthesis" is translated as "文本到3D合成" (wén tiě to 3D hù shēng)* "recently emerged" is translated as "最近emerged" (zuì jìn emerged)* "approach" is translated as "方法" (fāng fǎ)* "sampling 3D models" is translated as "采样3D模型" (cǎi yàng 3D moldèi)* "by adopting pretrained text-to-image models" is translated as "采用预训练文本到图像模型" (cǎi yòng yù xiǎng préng zhì text to image moldèi)* "as guiding visual priors" is translated as "作为视觉引导" (suǒ weì wèi jǐng yì guī)* "an intriguing but underexplored problem" is translated as "一个有趣但未得到过问题" (yī gè yǒu qióng dào wèi dé zhèng bù yào guò)* "with existing text-to-3D methods" is translated as "现有的文本到3D方法" (xiàn yǒu de wén tiě to 3D fāng fǎ)* "obtained from the sampling-by-optimization procedure" is translated as "从采样优化过程中获得" (cóng cǎi yàng yǎo gōng zhèng)* "tend to have mode collapses" is translated as "容易出现模式崩溃" (róng yì chū shì mó dì bāo kòng)* "and hence poor diversity in their results" is translated as "因此结果中的多样性很差" (yīn qī jiè yù zhèng zhèng)* "In this paper, we provide an analysis and identify potential causes" is translated as "在这篇论文中，我们提供分析并识别出可能的原因" (zài zhè bēi tí wén zhèng, wǒmen tīng bìng yì jīng yì)* "of such a limited diversity" is translated as "这种有限的多样性" (zhè zhòng yǒu xiàn de duō yàng xìng)* "and then devise a new method" is translated as "然后提出一种新方法" (rán hái tím chū yī zhòng xīn fāng fǎ)* "that considers the joint generation of different 3D models" is translated as "考虑多种不同的3D模型的共同生成" (kǎo jí duō shè bì dì 3D moldèi de gòng dōng shēng chéng)* "from the same text prompt" is translated as "从同一个文本提示中" (cóng tóng yī gè wén tiě tí shǐ zhèng)* "where we propose to use augmented text prompts" is translated as "我们提议使用增强文本提示" (wǒmen tīng yì shǐ yòng zhòng jì chéng wén tiě tí shǐ)* "via textual inversion of reference images" is translated as "通过文本反转参考图像" (tōng guò wén tiě fǎ zhòng tiě jǐng)* "to diversify the joint generation" is translated as "以增加多样性的共同生成" (yǐ jìn gǎi duō yàng xìng de gòng dōng shēng chéng)* "We show that our method leads to improved diversity" is translated as "我们示示出我们的方法可以提高多样性" (wǒmen shì shì zhèng zhèng zhèng)* "in text-to-3D synthesis both qualitatively and quantitatively" is translated as "在文本到3D合成中， both qualitatively and quantitatively" (zhè bēi zhèng zhèng zhèng)

Spectral-wise Implicit Neural Representation for Hyperspectral Image Reconstruction

paper_url: http://arxiv.org/abs/2312.01061
repo_url: None
paper_authors: Huan Chen, Wangcai Zhao, Tingfa Xu, Shiyun Zhou, Peifu Liu, Jianan Li
for: 提高卷积Snapshot спектраль成像（CASSI）重建的精度和灵活性，使其能够在各种应用场景中提供更高质量的图像重建。
methods: 提出了一种新的 Spectral-wise Implicit Neural Representation（SINR）方法，该方法通过引入连续spectral扩展过程，实现图像重建中的spectral超分辨率，并且可以自定义扩展因子。SINR方法还包括了一个spectral Wisdom attention机制，以及一个Fourier坐标编码器和一个spectral缩放因子模块，以提高SINR的表达能力和灵活性。
results: 对比基eline方法，SINR方法在不同的图像重建任务中表现出了明显的优势，并且可以在CASSI框架中实现连续的图像重建。

Abstract
Coded Aperture Snapshot Spectral Imaging (CASSI) reconstruction aims to recover the 3D spatial-spectral signal from 2D measurement. Existing methods for reconstructing Hyperspectral Image (HSI) typically involve learning mappings from a 2D compressed image to a predetermined set of discrete spectral bands. However, this approach overlooks the inherent continuity of the spectral information. In this study, we propose an innovative method called Spectral-wise Implicit Neural Representation (SINR) as a pioneering step toward addressing this limitation. SINR introduces a continuous spectral amplification process for HSI reconstruction, enabling spectral super-resolution with customizable magnification factors. To achieve this, we leverage the concept of implicit neural representation. Specifically, our approach introduces a spectral-wise attention mechanism that treats individual channels as distinct tokens, thereby capturing global spectral dependencies. Additionally, our approach incorporates two components, namely a Fourier coordinate encoder and a spectral scale factor module. The Fourier coordinate encoder enhances the SINR's ability to emphasize high-frequency components, while the spectral scale factor module guides the SINR to adapt to the variable number of spectral channels. Notably, the SINR framework enhances the flexibility of CASSI reconstruction by accommodating an unlimited number of spectral bands in the desired output. Extensive experiments demonstrate that our SINR outperforms baseline methods. By enabling continuous reconstruction within the CASSI framework, we take the initial stride toward integrating implicit neural representation into the field.

摘要
<>使用简化中文翻译文本。<> CASSI重建目标是从2D测量获取3D空间spectral信号。现有的HSI重建方法通常是从2D压缩图像到 predetermined设置的几个精确的 spectral带来学习映射。然而，这种方法忽略了spectral信息的内在连续性。在这种研究中，我们提出了一种创新的方法 called Spectral-wise Implicit Neural Representation（SINR），作为HSI重建领域的先锋之步。SINR引入了一种spectral超分辨程序，以实现HSI重建中的spectral超分辨。为此，我们利用了implicit neural representation的概念。具体来说，我们的方法引入了一种spectral-wise注意机制，将 individuial channels treated as distinct tokens，以捕捉全 spectraldependencies。此外，我们的方法包括两个组件：Fourier coordinate encoder和spectral scale factor module。Fourier coordinate encoder增强了SINR的能力强调高频分量，而spectral scale factor module帮助SINR适应变量数量的spectral channels。值得注意的是，SINR框架提高了CASSI重建的灵活性，可以在desired输出中包含无限多个spectral带。广泛的实验表明，我们的SINR超过基准方法。通过在CASSI框架中实现连续重建，我们正式开启了implicit neural representation在这一领域的应用。

Spectrum-driven Mixed-frequency Network for Hyperspectral Salient Object Detection

paper_url: http://arxiv.org/abs/2312.01060
repo_url: https://github.com/laprf/smn
paper_authors: Peifu Liu, Tingfa Xu, Huan Chen, Shiyun Zhou, Haolin Qin, Jianan Li
for: 这个论文的目标是提出一种基于spectral characteristic的高spectral分辨率对象检测方法（HSOD），以便更好地检测各种颜色图像中的突出对象。
methods: 该方法利用了两种不同频率组件来捕捉对象的特征信息：low-frequency Spectral Saliency和high-frequency Spectral Edge。Spectral Saliency可以 approximates 突出对象的区域，而 Spectral Edge可以捕捉突出对象的边缘信息。
results: 该方法在HS-SOD benchmark和自定义的HSOD-BIT数据集上进行了广泛的实验，并证明了它在HSOD性能方面超过了现有的方法。

Abstract
Hyperspectral salient object detection (HSOD) aims to detect spectrally salient objects in hyperspectral images (HSIs). However, existing methods inadequately utilize spectral information by either converting HSIs into false-color images or converging neural networks with clustering. We propose a novel approach that fully leverages the spectral characteristics by extracting two distinct frequency components from the spectrum: low-frequency Spectral Saliency and high-frequency Spectral Edge. The Spectral Saliency approximates the region of salient objects, while the Spectral Edge captures edge information of salient objects. These two complementary components, crucial for HSOD, are derived by computing from the inter-layer spectral angular distance of the Gaussian pyramid and the intra-neighborhood spectral angular gradients, respectively. To effectively utilize this dual-frequency information, we introduce a novel lightweight Spectrum-driven Mixed-frequency Network (SMN). SMN incorporates two parameter-free plug-and-play operators, namely Spectral Saliency Generator and Spectral Edge Operator, to extract the Spectral Saliency and Spectral Edge components from the input HSI independently. Subsequently, the Mixed-frequency Attention module, comprised of two frequency-dependent heads, intelligently combines the embedded features of edge and saliency information, resulting in a mixed-frequency feature representation. Furthermore, a saliency-edge-aware decoder progressively scales up the mixed-frequency feature while preserving rich detail and saliency information for accurate salient object prediction. Extensive experiments conducted on the HS-SOD benchmark and our custom dataset HSOD-BIT demonstrate that our SMN outperforms state-of-the-art methods regarding HSOD performance. Code and dataset will be available at https://github.com/laprf/SMN.

摘要
《卷积神经网络对干扰监测（HSOD）的研究》目标：卷积神经网络可以快速地检测干扰监测（HSIs）中的特征对象。但现有方法未能充分利用干扰spectral信息，通常会将HSIs转换为false-color图像或者将神经网络与归一化。我们提出了一种新的方法，即完全利用干扰spectral特征，通过提取low-frequency spectral saliency和high-frequency spectral edge两个不同频率组成部分来检测特征对象。干扰saliency region的估计和特征对象的边缘信息是两个重要的组成部分，它们是通过计算干扰spectral angular distance和干扰邻居spectral angular gradients来获得的。为了有效利用这两个频率信息，我们提出了一种新的轻量级 Spectrum-driven Mixed-frequency Network（SMN）。SMN包括两个无参数的插件操作器，即 Spectral Saliency Generator和Spectral Edge Operator，可以独立地从输入HSIs中提取low-frequency spectral saliency和high-frequency spectral edge两个频率组成部分。然后，混合频率注意模块，由两个频率依赖的头组成，智能地将这两个频率组成部分混合，从而获得混合频率特征表示。最后，一个saliency-edge-aware decoder逐渐缩放混合频率特征，保留细节和干扰信息，以准确预测特征对象。我们在HS-SODbenchmark和我们自定义的HSOD-BIT数据集上进行了广泛的实验，并证明了我们的SMN在HSOD性能方面比现有方法更高。代码和数据集将在https://github.com/laprf/SMN上提供。

Taming Latent Diffusion Models to See in the Dark

paper_url: http://arxiv.org/abs/2312.01027
repo_url: https://github.com/csqiangwen/Taming_Latent_Diffusion_Models_to_See_in_the_Dark
paper_authors: Qiang Wen, Yazhou Xing, Qifeng Chen
for: 提高低光照降采样照片的曝光和清晰度
methods: 利用一个冰结的预训练 diffusion 模型，并在其生成过程中插入一组提议的 taming 模块，以控制生成的结构和细节
results: 对比 экспериментах显示，提出的方法不仅在量化评价中达到了领先的性能，而且在视觉比较中也显示出了明显的优势。这些结果表明，可以通过利用预训练的 diffusion 模型来提高低光照降采样照片的曝光和清晰度。

Abstract
Enhancing a low-light noisy RAW image into a well-exposed and clean sRGB image is a significant challenge in computational photography. Due to the limitation of large-scale paired data, prior approaches have difficulty in recovering fine details and true colors in extremely low-light regions. Meanwhile, recent advancements in generative diffusion models have shown promising generating capabilities, which inspires this work to explore generative priors from a diffusion model trained on a large-scale open-domain dataset to benefit the low-light image enhancement (LLIE) task. Based on this intention, we propose a novel diffusion-model-based LLIE method, dubbed LDM-SID. LDM-SID aims at inserting a set of proposed taming modules into a frozen pre-trained diffusion model to steer its generating process. Specifically, the taming module fed with low-light information serves to output a pair of affine transformation parameters to modulate the intermediate feature in the diffusion model. Additionally, based on the observation of dedicated generative priors across different portions of the diffusion model, we propose to apply 2D discrete wavelet transforms on the input RAW image, resulting in dividing the LLIE task into two essential parts: low-frequency content generation and high-frequency detail maintenance. This enables us to skillfully tame the diffusion model for optimized structural generation and detail enhancement. Extensive experiments demonstrate the proposed method not only achieves state-of-the-art performance in quantitative evaluations but also shows significant superiority in visual comparisons. These findings highlight the effectiveness of leveraging a pre-trained diffusion model as a generative prior to the LLIE task.

摘要
提高低光照噪raw图像为高效照明和干净的sRGB图像是计算 fotograf�摄影中的一大挑战。由于大规模对照数据的限制，先前的方法很难在EXTREMELY低光照区域中恢复细节和真实颜色。同时，最近的生成扩散模型的进步有吸引力，这种吸引力 inspirits 我们的工作，用一个固定的扩散模型来权重生成。基于这个意图，我们提出了一种新的扩散模型基于LDIE方法，称为LDM-SID。LDM-SID的目标是在扩散模型中插入一组提议的模块，以控制生成过程。具体来说，通过输入低光信息，这些模块将输出一对平行变换参数，以修改扩散模型的中间特征。此外，根据扩散模型中不同部分的专门生成规则，我们提出了在输入RAW图像上应用2D离散波形变换，从而将LDIE任务分成两个基本部分：低频内容生成和高频细节维持。这使得我们可以聪明地使用扩散模型来生成结构和细节。广泛的实验表明，提议的方法不仅在量化评价中达到了状态态arp的性能，而且在视觉比较中也表现出了显著的优势。这些发现 highlights 使用预训练的扩散模型作为LDIE任务的生成先验的效iveness。

Token Fusion: Bridging the Gap between Token Pruning and Token Merging

paper_url: http://arxiv.org/abs/2312.01026
repo_url: None
paper_authors: Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, Hongxia Jin
for: 提高资源受限的边缘设备上的计算效率，使用Token Fusion方法，以提高计算效率和模型准确率。
methods: 提出了Token Fusion方法，结合了Token pruning和Token merging两种方法，并使用MLERP merging技术来维持特征分布。
results: 实验表明，Token Fusion方法可以在分类和图像生成任务中提高计算效率和模型准确率，并且可以适用于不同的ViT模型。

Abstract
Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs. However, their computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging. Multiple solutions rely on token pruning or token merging. In this paper, we introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging. Token pruning proves advantageous when the model exhibits sensitivity to input interpolations, while token merging is effective when the model manifests close to linear responses to inputs. We combine this to propose a new scheme called Token Fusion. Moreover, we tackle the limitations of average merging, which doesn't preserve the intrinsic feature norm, resulting in distributional shifts. To mitigate this, we introduce MLERP merging, a variant of the SLERP technique, tailored to merge multiple tokens while maintaining the norm distribution. ToFu is versatile, applicable to ViTs with or without additional training. Our empirical evaluations indicate that ToFu establishes new benchmarks in both classification and image generation tasks concerning computational efficiency and model accuracy.

摘要
Computer Vision Transformers (ViTs) 已经成为电脑视觉领域的强大背部，超越了许多传统的单语网络 (CNN)。然而，它们的计算成本，主要从自我对齐机制而来，使得在有限的边缘设备上部署困难。多个解决方案依靠单语削减或单语合并。在这篇论文中，我们引入“单语融合”（ToFu），一种结合单语削减和单语合并的方法。单语削减在模型对输入 interpolations 时有利，而单语合并则在模型对输入Linear responses 时有利。我们结合这两种方法，提议一个新的方案。此外，我们解决了均值融合的限制，它不会保留单语特征norm，导致分布迁移。我们引入 MLERP 融合，一种适应 ViTs 的 SLERP 技术，可以融合多个单语而保持单语特征norm。ToFu 可以适用于 ViTs WITH 或 Without 额外训练。我们的实验评估表明，ToFu 在分类和图像生成任务中具有较高的计算效率和模型精度。

Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D

paper_url: http://arxiv.org/abs/2312.02190
repo_url: None
paper_authors: Karran Pandey, Paul Guerrero, Matheus Gadelha, Yannick Hold-Geoffroy, Karan Singh, Niloy Mitra
for: 该论文旨在提供一种新的方法，使得在扩散图像上进行3D对象编辑，无需任何精度调整或3D对象检索。
methods: 该论文使用现有的预训练扩散模型和2D图像深度估计，实现3D对象编辑。
results: 该论文的结果显示，Diffusion Handles可以生成高品质、photo-real的编辑图像，保持对象特征。论文的主要发现是通过将扩散活化升起到3D使用代理深度，并将其3D变换和相关的活化升起，然后将其 projet回到图像空间，以生成高品质的编辑图像。

Abstract
Diffusion Handles is a novel approach to enabling 3D object edits on diffusion images. We accomplish these edits using existing pre-trained diffusion models, and 2D image depth estimation, without any fine-tuning or 3D object retrieval. The edited results remain plausible, photo-real, and preserve object identity. Diffusion Handles address a critically missing facet of generative image based creative design, and significantly advance the state-of-the-art in generative image editing. Our key insight is to lift diffusion activations for an object to 3D using a proxy depth, 3D-transform the depth and associated activations, and project them back to image space. The diffusion process applied to the manipulated activations with identity control, produces plausible edited images showing complex 3D occlusion and lighting effects. We evaluate Diffusion Handles: quantitatively, on a large synthetic data benchmark; and qualitatively by a user study, showing our output to be more plausible, and better than prior art at both, 3D editing and identity control. Project Webpage: https://diffusionhandles.github.io/

摘要
Diffusion Handles 是一种新的方法，用于在扩散图像上进行3D对象编辑。我们使用现有的预训练扩散模型，以及2D图像深度估计，而无需任何精度调整或3D对象检索。编辑结果保持了原始对象的可信度、实际性和形象性。Diffusion Handles 填补了生成图像基于的创意设计中缺失的一个重要方面，并在生成图像编辑方面提出了重要的进步。我们的关键发现是将扩散活化用于对象的3D尺寸，通过代理深度和相关的活化进行3D变换，然后将其 proyect 回到图像空间。通过应用 diffusion 过程于修改后的活化，并保持标识控制，我们可以生成可信度高、实际性强、保持对象形象的编辑图像，显示复杂的3D遮盖和照明效果。我们对Diffusion Handles进行了量化评估，并进行了用户研究，显示我们的输出更加可信度高，并在3D编辑和标识控制两个方面都比过去的先进技术更好。项目网站：https://diffusionhandles.github.io/

Self-Evolving Neural Radiance Fields

paper_url: http://arxiv.org/abs/2312.01003
repo_url: None
paper_authors: Jaewoo Jung, Jisang Han, Jiwon Kang, Seongchan Kim, Min-Seop Kwak, Seungryong Kim
for: 提高 NeRF 模型在实际场景中的应用性，增强其对几个视图点的渲染能力。
methods: 我们提出了一种新的框架，即 Self-Evolving Neural Radiance Fields (SE-NeRF)，它将 NeRF 模型与自我教学框架结合，以解决这些问题。我们将几个视图点的教师模型与学生模型结合，使学生模型通过Distillation scheme来学习更加稳定和准确的场景表示。
results: 我们通过对现有模型应用我们的自我教学框架，提高了模型对图像的渲染质量，并在多个设定下实现了状态机器人的表现。

Abstract
Recently, neural radiance field (NeRF) has shown remarkable performance in novel view synthesis and 3D reconstruction. However, it still requires abundant high-quality images, limiting its applicability in real-world scenarios. To overcome this limitation, recent works have focused on training NeRF only with sparse viewpoints by giving additional regularizations, often called few-shot NeRF. We observe that due to the under-constrained nature of the task, solely using additional regularization is not enough to prevent the model from overfitting to sparse viewpoints. In this paper, we propose a novel framework, dubbed Self-Evolving Neural Radiance Fields (SE-NeRF), that applies a self-training framework to NeRF to address these problems. We formulate few-shot NeRF into a teacher-student framework to guide the network to learn a more robust representation of the scene by training the student with additional pseudo labels generated from the teacher. By distilling ray-level pseudo labels using distinct distillation schemes for reliable and unreliable rays obtained with our novel reliability estimation method, we enable NeRF to learn a more accurate and robust geometry of the 3D scene. We show and evaluate that applying our self-training framework to existing models improves the quality of the rendered images and achieves state-of-the-art performance in multiple settings.

摘要

Deep Generative Attacks and Countermeasures for Data-Driven Offline Signature Verification

paper_url: http://arxiv.org/abs/2312.00987
repo_url: None
paper_authors: An Ngo, MinhPhuong Cao, Rajesh Kumar
for: This paper focuses on the impact of generative attacks on data-driven signature verification (DASV) and proposes practical and interpretable countermeasures.methods: The paper uses two prominent deep generative models (DGMs) - Variational Auto-encoders (VAE) and Conditional Generative Adversarial Networks (CGAN) - to generate signatures that can deceive DASV. The quality of generated images is evaluated using the Structural Similarity Index measure (SSIM).results: The paper shows that VAE-generated signatures increase the False Accept Rates (FARs) of DASV baselines by an average of 10.4%, 10.1%, and 7.5%, while CGAN-generated signatures increase FARs by an average of 32.5%, 30%, and 26.1%. The paper also finds a strong negative correlation between FARs and SSIMs, and demonstrates the effectiveness of retraining DASV baselines with synthetic datasets to improve robustness to generative attacks.Here is the same information in Simplified Chinese:for: 这篇论文关注数据驱动签名验证（DASV）中的生成攻击的影响，并提出了实用和可解释的防御策略。methods: 论文使用了两种知名的深度生成模型（DGM）——变量自适应Encoder（VAE）和条件生成对抗网络（CGAN）——来生成可以骗取DASV的签名。生成图像质量使用了结构相似度指数（SSIM）进行评估。results: 论文显示，VAE生成的签名可以使DASV基线上的False Accept Rate（FAR）提高平均10.4%、10.1%和7.5%，而CGAN生成的签名可以提高FAR平均32.5%、30%和26.1%。论文还发现生成攻击和SSIM之间存在强负相关，并证明使用生成数据重新训练DASV基线可以提高生成攻击的抵抗力。

Abstract
While previous studies have explored attacks via random, simple, and skilled forgeries, generative attacks have received limited attention in the data-driven signature verification (DASV) process. Thus, this paper explores the impact of generative attacks on DASV and proposes practical and interpretable countermeasures. We investigate the power of two prominent Deep Generative Models (DGMs), Variational Auto-encoders (VAE) and Conditional Generative Adversarial Networks (CGAN), on their ability to generate signatures that would successfully deceive DASV. Additionally, we evaluate the quality of generated images using the Structural Similarity Index measure (SSIM) and use the same to explain the attack's success. Finally, we propose countermeasures that effectively reduce the impact of deep generative attacks on DASV. We first generated six synthetic datasets from three benchmark offline-signature datasets viz. CEDAR, BHSig260- Bengali, and BHSig260-Hindi using VAE and CGAN. Then, we built baseline DASVs using Xception, ResNet152V2, and DenseNet201. These DASVs achieved average (over the three datasets) False Accept Rates (FARs) of 2.55%, 3.17%, and 1.06%, respectively. Then, we attacked these baselines using the synthetic datasets. The VAE-generated signatures increased average FARs to 10.4%, 10.1%, and 7.5%, while CGAN-generated signatures to 32.5%, 30%, and 26.1%. The variation in the effectiveness of attack for VAE and CGAN was investigated further and explained by a strong (rho = -0.86) negative correlation between FARs and SSIMs. We created another set of synthetic datasets and used the same to retrain the DASVs. The retained baseline showed significant robustness to random, skilled, and generative attacks as the FARs shrank to less than 1% on average. The findings underscore the importance of studying generative attacks and potential countermeasures for DASV.

摘要
previous studies have explored attacks via random, simple, and skilled forgeries, but generative attacks have received limited attention in the data-driven signature verification (DASV) process. Therefore, this paper explores the impact of generative attacks on DASV and proposes practical and interpretable countermeasures. We investigate the power of two prominent Deep Generative Models (DGMs), Variational Auto-encoders (VAE) and Conditional Generative Adversarial Networks (CGAN), on their ability to generate signatures that would successfully deceive DASV. Additionally, we evaluate the quality of generated images using the Structural Similarity Index measure (SSIM) and use the same to explain the attack's success. Finally, we propose countermeasures that effectively reduce the impact of deep generative attacks on DASV.我们在前一些研究中已经探索过随机、简单和技巧性的伪造攻击，但是生成攻击则受到了有限的注意力。因此，本文将实际攻击和解释措施，并调查了两种常见的深度生成模型（DGM）——Variational Auto-encoders（VAE）和Conditional Generative Adversarial Networks（CGAN），以及它们对于DASV的攻击能力。我们还评估了生成的图像质量使用结构相似度指数（SSIM），并使用这个指数来解释攻击的成功。最后，我们提出了有效地减少深度生成攻击的对DASV的影响的几种实际措施。我们首先从三个benchmark offline签名数据集（CEDAR、BHSig260-Bengali和BHSig260-Hindi）中生成了六个合成数据集，使用VAE和CGAN进行生成。然后，我们建立了基eline DASV，使用Xception、ResNet152V2和DenseNet201。这些DASV的 False Accept Rate（FAR）的平均值为2.55%、3.17%和1.06%。然后，我们对这些基eline进行攻击，使用生成的签名。VAE生成的签名将平均FAR提高到10.4%、10.1%和7.5%，而CGAN生成的签名则提高到32.5%、30%和26.1%。我们进一步研究了VAE和CGAN对于攻击的影响，发现了一个强(-0.86)的负 correlatio（FAR和SSIM）。我们另外创建了一些新的合成数据集，并使用这些数据集重训DASV。重训后的基eline表现了对随机、技巧性和生成攻击的强大抗性，FAR的平均值降到了 less than 1%。这些结果突出了研究生成攻击和可行的对策的重要性。