2023-11-28

cs.CV

cs.CV - 2023-11-28

E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer

paper_url: http://arxiv.org/abs/2311.17267
repo_url: None
paper_authors: Jacob Zhiyuan Fang, Skyler Zheng, Vasu Sharma, Robinson Piramuthu
for: 这个论文主要是为了构建一个轻量级的视频语言模型（E-ViLM）和一种masked video modeling（MVM）schema，以便在实际应用中使用。
methods: 这个论文使用了一种叫做vector-quantized tokenizer的semantic vector化工具，并通过使用一种简单的MVM任务和常规的VL预训练模型，使E-ViLM学习出高效表示。
results: 这个论文的实验结果表明，即使E-ViLM的 Parameters和GFLOPs都很少，它仍然可以从视频语言资源中学习出高效表示，并且在多个视频语言任务上达到了竞争性的表现，比如视频问答、文本到视频检索等。

Abstract
To build scalable models for challenging real-world tasks, it is important to learn from diverse, multi-modal data in various forms (e.g., videos, text, and images). Among the existing works, a plethora of them have focused on leveraging large but cumbersome cross-modal architectures. Regardless of their effectiveness, larger architectures unavoidably prevent the models from being extended to real-world applications, so building a lightweight VL architecture and an efficient learning schema is of great practical value. In this paper, we propose an Efficient Video-Language Model (dubbed as E-ViLM) and a masked video modeling (MVM) schema, assisted with a semantic vector-quantized tokenizer. In particular, our E-ViLM learns to reconstruct the semantic labels of masked video regions, produced by the pre-trained vector-quantized tokenizer, which discretizes the continuous visual signals into labels. We show that with our simple MVM task and regular VL pre-training modelings, our E-ViLM, despite its compactness, is able to learn expressive representations from Video-Language corpus and generalize well to extensive Video-Language tasks including video question answering, text-to-video retrieval, etc. In particular, our E-ViLM obtains obvious efficiency improvements by reaching competing performances with faster inference speed, i.e., our model reaches $39.3$% Top-$1$ accuracy on the MSRVTT benchmark, retaining $91.4$% of the accuracy of state-of-the-art larger VL architecture with only $15%$ parameters and $94.8%$ fewer GFLOPs. We also provide extensive ablative studies that validate the effectiveness of our proposed learning schema for E-ViLM.

摘要
要建立可扩展的模型以解决实际问题，学习多Modal数据的多种形式（如视频、文本和图像）是非常重要。许多现有的工作都在利用大量的跨Modal建筑。尽管它们有效，但是这些大型建筑无法应用于实际应用场景，因此建立一个轻量级的语音视频模型和有效的学习方案是实际上很有用。在这篇论文中，我们提出了一个高效的视频语言模型（即E-ViLM）和一种受掩蔽的视频模型（MVM） schema，帮助于一个含义归一化的tokenizer。具体来说，我们的E-ViLM学习将掩蔽的视频区域的semantic标签，由预训练的含义归一化tokenizer生成的 kontinuous visual signal，进行分类。我们发现，通过我们简单的MVM任务和常规VL预训练模型，我们的E-ViLM，尽管它很紧凑，仍能学习表达表达from Video-Language corpus和广泛的 Video-Language任务，包括视频问答、文本到视频检索等。具体来说，我们的E-ViLM在MSRVTTbenchmark上达到$39.3\%$的Top-$1$准确率，保留了$91.4\%$的状态的较好的更大VL建筑的准确率，只用$15\%$的参数和$94.8\%$的GFLOPs。我们还提供了详细的ablative studiesthat验证了我们提出的学习方案的有效性。

SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors

paper_url: http://arxiv.org/abs/2311.17261
repo_url: https://github.com/daveredrum/SceneTex
paper_authors: Dave Zhenyu Chen, Haoxuan Li, Hsin-Ying Lee, Sergey Tulyakov, Matthias Nießner
for: 用于生成高质量、风格一致的室内场景纹理。
methods: 使用深度到图像扩散假设，形式ulated the texture synthesis task as an optimization problem in RGB空间，并通过score-distillation-based objective function来优化目标纹理。
results: 对3D-FRONT场景中的纹理生成，显示出视觉质量和快速准确性的显著提高。

Abstract
We propose SceneTex, a novel method for effectively generating high-quality and style-consistent textures for indoor scenes using depth-to-image diffusion priors. Unlike previous methods that either iteratively warp 2D views onto a mesh surface or distillate diffusion latent features without accurate geometric and style cues, SceneTex formulates the texture synthesis task as an optimization problem in the RGB space where style and geometry consistency are properly reflected. At its core, SceneTex proposes a multiresolution texture field to implicitly encode the mesh appearance. We optimize the target texture via a score-distillation-based objective function in respective RGB renderings. To further secure the style consistency across views, we introduce a cross-attention decoder to predict the RGB values by cross-attending to the pre-sampled reference locations in each instance. SceneTex enables various and accurate texture synthesis for 3D-FRONT scenes, demonstrating significant improvements in visual quality and prompt fidelity over the prior texture generation methods.

摘要
我们提出了SceneTex，一种新的方法，可以高效生成高质量和风格一致的室内场景纹理。与之前的方法不同，SceneTex不会在2D视图的投影上iteratively扭曲mesh表面，也不会不正确地抽取扩散的液态特征。SceneTex将纹理合成任务定义为RGB空间中的优化问题，以确保风格和几何协调正确反映。SceneTex的核心是一个多分辨率的纹理场，用于隐式地编码mesh的外观。我们通过RGB渲染中的分数泵灌对target纹理进行优化。为了保持视图之间的风格一致，我们引入了cross-attention解码器，以在每个实例中预取参考位置的RGB值进行预测。SceneTex可以实现3D-FRONT场景中的各种和准确的纹理合成，并示出了较大的视觉质量和快速精度。

Pattern retrieval of traffic congestion using graph-based associations of traffic domain-specific features

paper_url: http://arxiv.org/abs/2311.17256
repo_url: None
paper_authors: Tin T. Nguyen, Simeon C. Calvert, Guopeng Li, Hans van Lint
for: This paper proposes a content-based retrieval system for spatiotemporal patterns of highway traffic congestion, which can help traffic management by locating similar patterns in big datasets.
methods: The proposed framework consists of two main components: pattern representation and similarity measurement. The paper uses a graph-based approach (relation-graph) for pattern representation, in which fundamental traffic phenomena are encoded as nodes and their spatiotemporal relationships as edges.
results: The proposed method is effective in retrieving similar patterns in a dataset of hundreds of patterns with various complexities, both temporally and spatially. The obtained patterns present similar traffic phenomena as in the given examples, and the success of the proposed approach opens up a new opportunity for semantic retrieval.

Abstract
The fast-growing amount of traffic data brings many opportunities for revealing more insightful information about traffic dynamics. However, it also demands an effective database management system in which information retrieval is arguably an important feature. The ability to locate similar patterns in big datasets potentially paves the way for further valuable analyses in traffic management. This paper proposes a content-based retrieval system for spatiotemporal patterns of highway traffic congestion. There are two main components in our framework, namely pattern representation and similarity measurement. To effectively interpret retrieval outcomes, the paper proposes a graph-based approach (relation-graph) for the former component, in which fundamental traffic phenomena are encoded as nodes and their spatiotemporal relationships as edges. In the latter component, the similarities between congestion patterns are customizable with various aspects according to user expectations. We evaluated the proposed framework by applying it to a dataset of hundreds of patterns with various complexities (temporally and spatially). The example queries indicate the effectiveness of the proposed method, i.e. the obtained patterns present similar traffic phenomena as in the given examples. In addition, the success of the proposed approach directly derives a new opportunity for semantic retrieval, in which expected patterns are described by adopting the relation-graph notion to associate fundamental traffic phenomena.

摘要
随着交通数据的快速增长，它带来了更多的可能性，用于揭示交通动态的更多信息。然而，这也需要一个有效的数据库管理系统，以便更好地检索信息。能够在大数据集中找到类似的模式，可能会开铺更多的有价值的分析，用于交通管理。本文提出了基于内容的检索系统，用于高速公路交通堵塞的空间时间模式。系统的两个主要组成部分是模式表示和相似度测量。为了更好地解释检索结果，文章提出了一种图表基的方法（关系图），在这里，交通现象的基本元素被编码为节点，而它们的空间时间关系被编码为边。在后一个组成部分中，用户可以根据不同的方面自定义相似度测量。我们对一个包含多种复杂性（时间和空间）的数据集进行了测试，并证明了提议的方法的效果，即所获得的模式与给定的示例尝试相似。此外，我们还发现了一个新的机会，即通过采用关系图的想法，将基本交通现象相关联起来，以描述预期的模式。

SubZero: Subspace Zero-Shot MRI Reconstruction

paper_url: http://arxiv.org/abs/2311.17251
repo_url: https://github.com/heng14/subzero
paper_authors: Heng Yu, Yamin Arefeen, Berkin Bilgic
for: 本研究旨在提高MRI扫描速度，使用零shot自监学习和子空间模型。
methods: 本研究使用了一种并行网络框架和注意机制，以提高零shot自监学习的性能，并实现更高的加速因子。
results: 实验结果表明，本方法可以在T1和T2映射获得更高的性能，比现有方法更好。In English, the three key points are:
for: The purpose of this study is to accelerate MRI scans using zero-shot self-supervised learning and subspace models.
methods: The proposed method uses a parallel network framework and an attention mechanism to improve the performance of subspace-based zero-shot self-supervised learning, achieving higher acceleration factors.
results: Experimental results show that the proposed method outperforms existing methods in T1 and T2 mapping acquisitions.

Abstract
Recently introduced zero-shot self-supervised learning (ZS-SSL) has shown potential in accelerated MRI in a scan-specific scenario, which enabled high-quality reconstructions without access to a large training dataset. ZS-SSL has been further combined with the subspace model to accelerate 2D T2-shuffling acquisitions. In this work, we propose a parallel network framework and introduce an attention mechanism to improve subspace-based zero-shot self-supervised learning and enable higher acceleration factors. We name our method SubZero and demonstrate that it can achieve improved performance compared with current methods in T1 and T2 mapping acquisitions.

摘要
Here is the text in Simplified Chinese:最近，零shot自监学习（ZS-SSL）已经被引入，并在扫描特定场景下的MRI扫描中展示了潜力，可以实现高质量重建而无需大量的训练数据。ZS-SSL还与子空间模型结合以加速2D T2-拼接成像。在这项工作中，我们提出了并行网络框架，并引入了注意力机制，以改进基于子空间的零shot自监学习，实现更高的加速因子。我们称之为“SubZero”，并证明其在T1和T2映射成像中可以 achieve improved performance compared with现有方法。

LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS

paper_url: http://arxiv.org/abs/2311.17245
repo_url: https://github.com/VITA-Group/LightGaussian
paper_authors: Zhiwen Fan, Kevin Wang, Kairun Wen, Zehao Zhu, Dejia Xu, Zhangyang Wang
for: 提高3D Gaussian Splatting的可扩展性和可靠性，并提高 scene 的渲染效果。
methods: LightGaussian 方法基于 Network Pruning 的概念，通过减少红外干扰并使用 pseudo-view 增强来提高渲染效果。VecTree Quantization 方法用于压缩所有特征，以获得更加紧凑的表示。
results: LightGaussian 方法可以实现15倍的压缩率，同时提高 FPS FROM 139 TO 215，使得复杂场景的渲染变得更加高效。

Abstract
Recent advancements in real-time neural rendering using point-based techniques have paved the way for the widespread adoption of 3D representations. However, foundational approaches like 3D Gaussian Splatting come with a substantial storage overhead caused by growing the SfM points to millions, often demanding gigabyte-level disk space for a single unbounded scene, posing significant scalability challenges and hindering the splatting efficiency. To address this challenge, we introduce LightGaussian, a novel method designed to transform 3D Gaussians into a more efficient and compact format. Drawing inspiration from the concept of Network Pruning, LightGaussian identifies Gaussians that are insignificant in contributing to the scene reconstruction and adopts a pruning and recovery process, effectively reducing redundancy in Gaussian counts while preserving visual effects. Additionally, LightGaussian employs distillation and pseudo-view augmentation to distill spherical harmonics to a lower degree, allowing knowledge transfer to more compact representations while maintaining reflectance. Furthermore, we propose a hybrid scheme, VecTree Quantization, to quantize all attributes, resulting in lower bitwidth representations with minimal accuracy losses. In summary, LightGaussian achieves an averaged compression rate over 15x while boosting the FPS from 139 to 215, enabling an efficient representation of complex scenes on Mip-NeRF 360, Tank and Temple datasets. Project website: https://lightgaussian.github.io/

摘要
近期实时神经渲染技术的发展使得3D表示更加普遍。然而，基础方法如3D Gaussian Splatting带来大量存储开销，由于增长SFM点数达到百万级，经常需要多达单个无穷景的磁盘空间，这会对抹掉效率构成 significiant scalability 挑战。为解决这个问题，我们介绍了LightGaussian，一种将3D Gaussian转换为更加高效和减少的格式的新方法。LightGaussian通过Network Pruning的概念，从Scene reconstruction中找到不重要的3D Gaussian，并采用剪辑和恢复过程，从而减少Gaussian的数量，同时保持视觉效果。此外，LightGaussian还使用热退换和pseudo-view增强，将球面函数退化到较低的度数，使知识传递到更加紧凑的表示，保持反射。此外，我们提议了VecTree Quantizationhybrid scheme，即所有特征的量化，以达到较低的位宽表示，无需损失精度。简单来说，LightGaussian实现了15倍的压缩率，并提高了FPS从139到215，使得复杂的场景在Mip-NeRF 360、Tank和Temple datasets上能够高效表示。项目网站：https://lightgaussian.github.io/

PHG-Net: Persistent Homology Guided Medical Image Classification

paper_url: http://arxiv.org/abs/2311.17243
repo_url: https://github.com/yaoppeng/topoclassification
paper_authors: Yaopeng Peng, Hongxiao Wang, Milan Sonka, Danny Z. Chen
for: 这个研究旨在提高医疗影像分类的精度，通过探索体系 topological features。
methods: 方法是使用 persistent homology guided approach (PHG-Net)，首先计算输入影像的立方体侧 diagram，然后将 topological features 转换为一个矩阵表示，并与 CNN 或 Transformer 的特征汇合。
results: 实验结果显示，PHG-Net 在三个公开数据集上实现了与现有方法相对的提高，并且可以与任何 CNN 或 Transformer 架构结合使用。

Abstract
Modern deep neural networks have achieved great successes in medical image analysis. However, the features captured by convolutional neural networks (CNNs) or Transformers tend to be optimized for pixel intensities and neglect key anatomical structures such as connected components and loops. In this paper, we propose a persistent homology guided approach (PHG-Net) that explores topological features of objects for medical image classification. For an input image, we first compute its cubical persistence diagram and extract topological features into a vector representation using a small neural network (called the PH module). The extracted topological features are then incorporated into the feature map generated by CNN or Transformer for feature fusion. The PH module is lightweight and capable of integrating topological features into any CNN or Transformer architectures in an end-to-end fashion. We evaluate our PHG-Net on three public datasets and demonstrate its considerable improvements on the target classification tasks over state-of-the-art methods.

摘要
现代深度神经网络在医学图像分析中已经取得了很大的成功。然而， convolutional neural networks (CNNs) 或 Transformers 捕捉的特征通常是针对像素强度优化的，而忽略了重要的 анатомиче结构，如连接组件和循环。在这篇论文中，我们提出了一种基于持续同态的方法（PHG-Net），该方法探索医学图像中对象的 тоポлогиcal特征。对输入图像，我们首先计算其立方体持续 diagram，然后使用一个小型神经网络（称为PH模块）将 topological features提取到一个矢量表示中。提取的 topological features然后与 CNN或 Transformer 生成的特征图进行Feature Fusion。PH模块轻量级，可以在任何 CNN 或 Transformer 架构中End-to-end 方式搅合 topological features。我们在三个公共数据集上评估了我们的 PHG-Net，并证明它在目标分类任务上表现出了显著的提高。

End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames

paper_url: http://arxiv.org/abs/2311.17241
repo_url: None
paper_authors: Shuming Liu, Chen-Lin Zhang, Chen Zhao, Bernard Ghanem
for: 本研究旨在提高 temporal action detection（TAD）性能，但由于存储瓶颈，只有有限的规模和数据量可以进行端到端训练，从而限制 TAD 性能。
methods: 本研究提出了一种减少端到端训练内存消耗的方法，并将 TAD 底层扩展到 100 亿参数和 1536 帧输入视频。关键在于我们提出的 temporal-informative adapter（TIA），是一种轻量级模块，可以减少训练内存。TIA 使得庞大底层解脱 TAD 任务的学习，只需要在 TIA 中更新参数。TIA 还使得 TAD 表示更好，通过在底层中累积邻帧上的时间上下文。
results: 我们在四个代表性的数据集上评估了我们的模型，并取得了比较出色的结果。由于我们的有效设计，我们可以在 VideoMAEv2-giant 上进行端到端训练，并在 THUMOS14 上达到 75.4% mAP，超过了最佳特征基于方法。

Abstract
Recently, temporal action detection (TAD) has seen significant performance improvement with end-to-end training. However, due to the memory bottleneck, only models with limited scales and limited data volumes can afford end-to-end training, which inevitably restricts TAD performance. In this paper, we reduce the memory consumption for end-to-end training, and manage to scale up the TAD backbone to 1 billion parameters and the input video to 1,536 frames, leading to significant detection performance. The key to our approach lies in our proposed temporal-informative adapter (TIA), which is a novel lightweight module that reduces training memory. Using TIA, we free the humongous backbone from learning to adapt to the TAD task by only updating the parameters in TIA. TIA also leads to better TAD representation by temporally aggregating context from adjacent frames throughout the backbone. We evaluate our model across four representative datasets. Owing to our efficient design, we are able to train end-to-end on VideoMAEv2-giant and achieve 75.4% mAP on THUMOS14, being the first end-to-end model to outperform the best feature-based methods.

摘要
最近，时间动作检测（TAD）的性能有了显著的提升，主要归功于端到端训练。然而，由于内存瓶颈的限制，只有有限的缩放和数据量可以进行端到端训练，这会导致TAD性能受限。在这篇论文中，我们降低了端到端训练的内存占用量，并成功地扩大TAD底层到10亿个参数和输入视频到1536帧，从而实现了显著的检测性能。我们的方法的关键在于我们提出的时间相关信息适配器（TIA），这是一种新的轻量级模块，它可以减少端到端训练的内存占用量。使用TIA，我们解除了庞大的底层学习适应TAD任务的责任，只需要在TIA中更新参数。此外，TIA还使得TAD表示更好，通过在底层中累积邻帧的时间上下文。我们在四个代表性的数据集上进行评估，由于我们的有效设计，我们可以在VideoMAEv2-giant上进行端到端训练，并在THUMOS14上达到75.4%的mAP，成为了首个端到端模型超越最佳基于特征方法的记录。

BIM: Block-Wise Self-Supervised Learning with Masked Image Modeling

paper_url: http://arxiv.org/abs/2311.17218
repo_url: None
paper_authors: Yixuan Luo, Mengye Ren, Sai Qian Zhang
for: 提高深度神经网络（DNN）的特征提取能力
methods: 使用分割式隐藏图像模型（BIM），将MIM任务分解为多个独立计算模式
results: 提高MIM性能，降低峰值内存消耗，并允许同时训练多个DNN背bone

Abstract
Like masked language modeling (MLM) in natural language processing, masked image modeling (MIM) aims to extract valuable insights from image patches to enhance the feature extraction capabilities of the underlying deep neural network (DNN). Contrasted with other training paradigms like supervised learning and unsupervised contrastive learning, masked image modeling (MIM) pretraining typically demands significant computational resources in order to manage large training data batches (e.g., 4096). The significant memory and computation requirements pose a considerable challenge to its broad adoption. To mitigate this, we introduce a novel learning framework, termed~\textit{Block-Wise Masked Image Modeling} (BIM). This framework involves decomposing the MIM tasks into several sub-tasks with independent computation patterns, resulting in block-wise back-propagation operations instead of the traditional end-to-end approach. Our proposed BIM maintains superior performance compared to conventional MIM while greatly reducing peak memory consumption. Moreover, BIM naturally enables the concurrent training of numerous DNN backbones of varying depths. This leads to the creation of multiple trained DNN backbones, each tailored to different hardware platforms with distinct computing capabilities. This approach significantly reduces computational costs in comparison with training each DNN backbone individually. Our framework offers a promising solution for resource constrained training of MIM.

摘要
如果你想使用图像patches提高深度神经网络（DNN）的特征提取能力，可以考虑使用面 masked image modeling（MIM）。与其他培训方法（如监督学习和无监督对比学习）相比，MIM培训通常需要大量的计算资源来处理大批量训练数据（例如4096）。这会带来很大的内存和计算资源的需求，对其广泛采用带来很大的挑战。为了解决这个问题，我们提出了一种新的学习框架，称为块级分解MIM（BIM）。BIM框架通过将MIM任务分解成一些独立的计算模式下的子任务，从而实现了块级的反向传播操作。与传统的端到端方法相比，BIM可以维持高性能，同时减少峰值内存占用。此外，BIM自然地支持并发训练多个DNN背部网络，每个背部网络都是不同硬件平台的不同计算能力。这会带来对训练DNN背部网络的计算成本的减少。我们的框架为资源受限的MIM培训提供了一个有前途的解决方案。

Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation

paper_url: http://arxiv.org/abs/2311.17216
repo_url: None
paper_authors: Hang Li, Chengzhi Shen, Philip Torr, Volker Tresp, Jindong Gu
for: 防止 diffusion-based 模型生成不适内容，如偏见或危险图像。
methods: 我们提出了一种新的自然语言驱动的方法，通过发现对应的概念方向来解释 diffusion 模型内部表示的含义。
results: 我们的方法可以帮助避免不适内容的生成，并且可以实现公正、安全和负责任的文本增强生成。

Abstract
Diffusion-based models have gained significant popularity for text-to-image generation due to their exceptional image-generation capabilities. A risk with these models is the potential generation of inappropriate content, such as biased or harmful images. However, the underlying reasons for generating such undesired content from the perspective of the diffusion model's internal representation remain unclear. Previous work interprets vectors in an interpretable latent space of diffusion models as semantic concepts. However, existing approaches cannot discover directions for arbitrary concepts, such as those related to inappropriate concepts. In this work, we propose a novel self-supervised approach to find interpretable latent directions for a given concept. With the discovered vectors, we further propose a simple approach to mitigate inappropriate generation. Extensive experiments have been conducted to verify the effectiveness of our mitigation approach, namely, for fair generation, safe generation, and responsible text-enhancing generation.

摘要
Diffusion-based models 已经受到了广泛关注，因为它们在文本生成图像方面表现出色。然而，这些模型也存在一定的风险，例如可能生成不当或有害的图像。然而，这些模型内部表示的下面原因仍然不清楚。先前的工作解释了 diffusion 模型的可解释幂会空间中的向量为 semantic 概念。然而，现有的方法无法找到对于任意概念，如不当概念的方向。在这种情况下，我们提出了一种新的自动学习方法，可以找到一个给定概念的可解释幂会空间中的方向。通过这些方向，我们进一步提议了一种简单的方法，以避免不当生成。我们对这种避免方法的效果进行了广泛的实验，包括公平生成、安全生成和责任文本增强生成。

paper_url: http://arxiv.org/abs/2311.17177
repo_url: None
paper_authors: Lin Zhao, Hongxuan Li, Xuefei Ning, Xinru Jiang
for: 这个研究旨在实现跨Modal的隐藏通信，将长时间的语音资料隐藏在公开可用的封面信号中，以便不引起注意。
methods: 这个方法使用了人脸的特性，将语音资料隐藏在图像中，并且可以实现多个授权水平。
results: 实验结果显示，这个方法可以将高质量的语音与影像混合在一起，并且可以实现多个授权水平。

Abstract
Cross-modal Steganography is the practice of concealing secret signals in publicly available cover signals (distinct from the modality of the secret signals) unobtrusively. While previous approaches primarily concentrated on concealing a relatively small amount of information, we propose THInImg, which manages to hide lengthy audio data (and subsequently decode talking head video) inside an identity image by leveraging the properties of human face, which can be effectively utilized for covert communication, transmission and copyright protection. THInImg consists of two parts: the encoder and decoder. Inside the encoder-decoder pipeline, we introduce a novel architecture that substantially increase the capacity of hiding audio in images. Moreover, our framework can be extended to iteratively hide multiple audio clips into an identity image, offering multiple levels of control over permissions. We conduct extensive experiments to prove the effectiveness of our method, demonstrating that THInImg can present up to 80 seconds of high quality talking-head video (including audio) in an identity image with 160x160 resolution.

摘要
cross-modal steganography 是隐藏秘密信号在公开可用的覆盖信号（与秘密信号的模式不同）的做法。 previous approaches 主要集中在隐藏一小Amount of information，我们提议 THInImg，可以在人脸图像中隐藏长时间的音频数据（并 subsequentially decode talking head video），利用人脸的特性，可以有效地用于隐蔽通信、传输和版权保护。 THInImg consists of two parts：编码器和解码器。在编码器-解码器管道中，我们引入了一种新的建筑，可以显著提高图像中隐藏音频的能力。此外，我们的框架可以进一步扩展到隐藏多个音频clip into an identity image，提供多个权限控制等级。 we conduct extensive experiments to prove the effectiveness of our method, demonstrating that THInImg can present up to 80 seconds of high quality talking-head video (including audio) in an identity image with 160x160 resolution.

Material Palette: Extraction of Materials from a Single Image

paper_url: http://arxiv.org/abs/2311.17060
repo_url: None
paper_authors: Ivan Lopes, Fabio Pizzati, Raoul de Charette
for: 这个论文的目的是提出一种从真实世界图像中提取物理基础渲染（PBR）材质的方法。
methods: 这个方法包括两个步骤：首先，通过一种扩散模型将图像中的区域映射到物料概念上，以 sampling 各种物料图像。然后，通过另一个网络将生成的文本ures decomposed into Spatially Varying BRDFs（SVBRDFs），以提供可用于渲染应用程序的材质。
results: 这个方法可以基于现有的 sintetic material 图书馆与 SVBRDF 的基准数据进行训练，同时还可以通过不监督领域适应（UDA）来扩展到新的样本。作者们在 sintetic 和真实世界的数据集上进行了系统性的评估，并证明了该方法的可靠性和可推广性。此外，作者们还示例了使用该方法编辑 3D 场景中的材质。代码和模型将被公开。项目页面：https://astra-vision.github.io/MaterialPalette/

Abstract
In this paper, we propose a method to extract physically-based rendering (PBR) materials from a single real-world image. We do so in two steps: first, we map regions of the image to material concepts using a diffusion model, which allows the sampling of texture images resembling each material in the scene. Second, we benefit from a separate network to decompose the generated textures into Spatially Varying BRDFs (SVBRDFs), providing us with materials ready to be used in rendering applications. Our approach builds on existing synthetic material libraries with SVBRDF ground truth, but also exploits a diffusion-generated RGB texture dataset to allow generalization to new samples using unsupervised domain adaptation (UDA). Our contributions are thoroughly evaluated on synthetic and real-world datasets. We further demonstrate the applicability of our method for editing 3D scenes with materials estimated from real photographs. The code and models will be made open-source. Project page: https://astra-vision.github.io/MaterialPalette/

摘要
在这篇论文中，我们提出了一种方法，可以从实际世界图像中提取物理基础渲染（PBR）材料。我们采用了两步进行：首先，我们使用扩散模型将图像中的区域映射到材料概念上，以便从Texture Image中采样与每种材料场景中相似的文件。其次，我们利用分解网络将生成的Texture decomposed into Spatially Varying BRDFs（SVBRDFs），从而为渲染应用提供准备好的材料。我们的方法基于现有的Synthetic材料图书馆中的SVBRDF基准，同时还利用了扩散生成的RGB图像数据集，以便通过无监督领域适应（UDA）来推广到新样本。我们的贡献得到了实际和synthetic数据集的评估。此外，我们还示出了对3D场景的编辑，使用从实际照片中提取的材料。代码和模型将被公开。项目页面：https://astra-vision.github.io/MaterialPalette/

HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting

paper_url: http://arxiv.org/abs/2311.17061
repo_url: https://github.com/alvinliu0/HumanGaussian
paper_authors: Xian Liu, Xiaohang Zhan, Jiaxiang Tang, Ying Shan, Gang Zeng, Dahua Lin, Xihui Liu, Ziwei Liu
for: 高质量3D人体生成从文本提示是一项感兴趣但具有挑战性的任务。现有方法通过分数蒸馏抽象（SDS）优化3D表示，但它们受到缺乏细节或过长训练时间的限制。本文提出了一个高效又有效的框架，即人类泛函（HumanGaussian），可以生成高质量3D人体，具有细化的几何结构和真实的外观。
methods: 我们的关键发现是，3D Gaussian Splatting 是一种高效的渲染器，其通过periodic Gaussian shrinkage或growing来控制adaptive density，这种适应性可以受到人类内在结构的指导。在这个框架中，我们首先提出了一种结构意识 SDS，可以同时优化人体的外观和几何结构。此外，我们还提出了一种冷却负面指导，可以有效地解决过滥问题。
results: 我们的实验表明，人类泛函可以在多种enario下生成真实的3D人体，并且比现有方法更高效。在不同的照明和摄像头距离情况下，我们的方法可以准确地控制人体的外观和几何结构，并且可以生成更加细化和真实的3D人体。项目页面：https://alvinliu0.github.io/projects/HumanGaussian

Abstract
Realistic 3D human generation from text prompts is a desirable yet challenging task. Existing methods optimize 3D representations like mesh or neural fields via score distillation sampling (SDS), which suffers from inadequate fine details or excessive training time. In this paper, we propose an efficient yet effective framework, HumanGaussian, that generates high-quality 3D humans with fine-grained geometry and realistic appearance. Our key insight is that 3D Gaussian Splatting is an efficient renderer with periodic Gaussian shrinkage or growing, where such adaptive density control can be naturally guided by intrinsic human structures. Specifically, 1) we first propose a Structure-Aware SDS that simultaneously optimizes human appearance and geometry. The multi-modal score function from both RGB and depth space is leveraged to distill the Gaussian densification and pruning process. 2) Moreover, we devise an Annealed Negative Prompt Guidance by decomposing SDS into a noisier generative score and a cleaner classifier score, which well addresses the over-saturation issue. The floating artifacts are further eliminated based on Gaussian size in a prune-only phase to enhance generation smoothness. Extensive experiments demonstrate the superior efficiency and competitive quality of our framework, rendering vivid 3D humans under diverse scenarios. Project Page: https://alvinliu0.github.io/projects/HumanGaussian

摘要
现实型3D人体生成从文本提示是一项愿望又挑战的任务。现有方法通过分数顶混合 sampling（SDS）优化3D表示，但这些方法受到不充分细节或过长训练时间的限制。在这篇论文中，我们提出一个高效又有效的框架，即人类Gaussian（HumanGaussian），可以生成高质量的3D人体，具有细腻的几何结构和真实的外观。我们的关键思想是，3D Gaussian splatting 是一种高效的渲染器，可以通过 periodic Gaussian shrinkage 或 growing 来控制 Adaptive density。specifically，我们首先提出了一种基于结构的 SDS，可以同时优化人体的外观和几何结构。我们利用了RGB和深度空间中的多模式分数函数，来把 Gaussian densification 和 pruning 过程引导到人体的内在结构上。此外，我们还提出了一种混合负面指导（Annealed Negative Prompt Guidance），可以很好地解决过饱感的问题。我们将 SDS decomposed 成一个更加随机的生成分数和一个更加纯净的分类分数，并通过混合这两个分数来生成更加真实的3D人体。最后，我们通过 Gaussian 的大小来除掉浮动的痕迹，以提高生成的平滑度。我们的实验表明，我们的框架可以更加高效地和竞争性地生成3D人体，并在多种场景下呈现出绝佳的效果。项目页面：https://alvinliu0.github.io/projects/HumanGaussian

ReMoS: Reactive 3D Motion Synthesis for Two-Person Interactions

paper_url: http://arxiv.org/abs/2311.17057
repo_url: None
paper_authors: Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, Philipp Slusallek
For: This paper focuses on developing a method for synthesizing realistic two-person interactions in 3D human motion, addressing the complex dynamics of multi-human interactions.* Methods: The proposed method, called ReMoS, is a denoising diffusion-based probabilistic model that explores two-person interactions. It synthesizes the reactive motion of the second person given the motion of the first person, including full-body motions and hand interactions.* Results: The paper demonstrates the performance of ReMoS under a variety of challenging two-person scenarios, including pair-dancing, Ninjutsu, kickboxing, and acrobatics. The results show that the approach can generate realistic and diverse motions for both individuals in the interaction, and provides an adequate amount of control for animators. Additionally, the paper introduces the ReMoCap dataset for two-person interactions, which consists of full-body and hand motions.

Abstract
Current approaches for 3D human motion synthesis can generate high-quality 3D animations of digital humans performing a wide variety of actions and gestures. However, there is still a notable technological gap in addressing the complex dynamics of multi-human interactions within this paradigm. In this work, we introduce ReMoS, a denoising diffusion-based probabilistic model for reactive motion synthesis that explores two-person interactions. Given the motion of one person, we synthesize the reactive motion of the second person to complete the interactions between the two. In addition to synthesizing the full-body motions, we also synthesize plausible hand interactions. We show the performance of ReMoS under a wide range of challenging two-person scenarios including pair-dancing, Ninjutsu, kickboxing, and acrobatics, where one person's movements have complex and diverse influences on the motions of the other. We further propose the ReMoCap dataset for two-person interactions consisting of full-body and hand motions. We evaluate our approach through multiple quantitative metrics, qualitative visualizations, and a user study. Our results are usable in interactive applications while also providing an adequate amount of control for animators.

摘要
Translated into Simplified Chinese:当前的3D人体动作合成方法可以生成高质量的3D动画，展示人工人体执行各种动作和姿势。然而，在多人交互方面仍存在技术差距。在这项工作中，我们介绍ReMoS，一种去噪扩散型 probabilistic模型，用于响应动作合成。给定一个人的动作，我们可以合成另一个人的响应动作，以完成两人之间的交互。此外，我们还可以合成合理的手势交互。我们在多种挑战性的两人场景中进行了评估，包括舞蹈、忍术、拳击和 акро巴特，其中一个人的动作具有复杂和多样化的影响。我们还提出了ReMoCap数据集，用于两人交互的全身和手势动作。我们通过多种量化指标、Qualitative visualization和用户测试来评估我们的方法。我们的结果可以在互动应用中使用，同时还提供了合理的控制 для动画师。

Self-Supervised Motion Magnification by Backpropagating Through Optical Flow

paper_url: http://arxiv.org/abs/2311.17056
repo_url: https://github.com/dangeng/flowmag
paper_authors: Zhaoying Pan, Daniel Geng, Andrew Owens
for: 这 paper 的目的是提出一种简单、自主学习的视频增大细动方法，即给输入视频和扩大因子，我们 manipulate 视频以使其新的光流与给定的扩大因子成正比。
methods: 我们提出了一种损失函数来估算生成视频的光流，并对其与给定的扩大因子进行比较。在训练中，我们通过 differentiating through 预训练的光流网络来训练我们的模型。
results: 我们通过视觉质量和量化指标来评估我们的方法，并在各种实际和Synthetic 视频上进行了评估。我们的方法可以在不同的光流算法上进行自适应调整，并且可以根据用户选择的对象进行精细化。

Abstract
This paper presents a simple, self-supervised method for magnifying subtle motions in video: given an input video and a magnification factor, we manipulate the video such that its new optical flow is scaled by the desired amount. To train our model, we propose a loss function that estimates the optical flow of the generated video and penalizes how far if deviates from the given magnification factor. Thus, training involves differentiating through a pretrained optical flow network. Since our model is self-supervised, we can further improve its performance through test-time adaptation, by finetuning it on the input video. It can also be easily extended to magnify the motions of only user-selected objects. Our approach avoids the need for synthetic magnification datasets that have been used to train prior learning-based approaches. Instead, it leverages the existing capabilities of off-the-shelf motion estimators. We demonstrate the effectiveness of our method through evaluations of both visual quality and quantitative metrics on a range of real-world and synthetic videos, and we show our method works for both supervised and unsupervised optical flow methods.

摘要
中文翻译：这篇论文提出了一种简单、自动超vision的方法，用于在视频中增大微小的运动。给定输入视频和放大因子，该方法将视频 manipulate 以将其新的optical flowScale 到所需的程度。为了训练模型，我们提出了一个损失函数，该函数估算生成的视频的optical flow，并对其与给定的放大因子进行 penalty。因此，训练 involve differentiating through a pre-trained optical flow network。我们的方法不需要使用synthetic magnification datasets，而是利用现有的off-the-shelf motion estimators。我们通过评估视质和量化指标来证明我们的方法的有效性，并在实际和 sintetic videos 上进行了评估。我们还表明了我们的方法可以在不同的supervised和unsupervised optical flow方法上进行改进。

Rethinking Directional Integration in Neural Radiance Fields

paper_url: http://arxiv.org/abs/2311.16504
repo_url: None
paper_authors: Congyue Deng, Jiawei Yang, Leonidas Guibas, Yue Wang
for: 提高 NeRF 的视角依赖效果渲染质量
methods: 修改 NeRF 渲染方程，将投影操作和方向解码器网络互换，分离视角依赖和独立组件
results: 实验结果表明，我们的修改方法可以减少网络近似和数值积分所导致的错误，并且可以解释为光场渲染 WITH 学习的射线嵌入Here’s a more detailed explanation of each point:
for: The paper aims to improve the rendering quality of view-dependent effects in NeRF, which is a powerful method for 3D scene reconstruction.
methods: The authors propose a simple modification to the NeRF rendering equation, which involves swapping the integration operator and the direction decoder network. This modification allows for a better separation of view-dependent and independent components, leading to improved rendering quality.
results: The authors demonstrate the effectiveness of their modification through experiments on different NeRF variations. They show that their method can significantly improve the quality of view-dependent effects, such as reflections and shadows, and can be interpreted as light field rendering with learned ray embeddings.

Abstract
Recent works use the Neural radiance field (NeRF) to perform multi-view 3D reconstruction, providing a significant leap in rendering photorealistic scenes. However, despite its efficacy, NeRF exhibits limited capability of learning view-dependent effects compared to light field rendering or image-based view synthesis. To that end, we introduce a modification to the NeRF rendering equation which is as simple as a few lines of code change for any NeRF variations, while greatly improving the rendering quality of view-dependent effects. By swapping the integration operator and the direction decoder network, we only integrate the positional features along the ray and move the directional terms out of the integration, resulting in a disentanglement of the view-dependent and independent components. The modified equation is equivalent to the classical volumetric rendering in ideal cases on object surfaces with Dirac densities. Furthermore, we prove that with the errors caused by network approximation and numerical integration, our rendering equation exhibits better convergence properties with lower error accumulations compared to the classical NeRF. We also show that the modified equation can be interpreted as light field rendering with learned ray embeddings. Experiments on different NeRF variations show consistent improvements in the quality of view-dependent effects with our simple modification.

摘要
近期研究使用神经辐射场（NeRF）进行多视图3D重建，提供了辐射真实场景的显著进步。然而，NeRF表现有限的视角依赖效果学习能力，相比于光场渲染或图像基synthesis。为此，我们介绍一种对NeRF渲染公式进行修改，只需几行代码修改即可，而大幅提高视角依赖效果的渲染质量。我们将整合运算符和方向解码器网络互换，只在光paths上整合位置特征，并将方向特征移出整合，从而分离视角依赖和独立组成。这个修改后的公式与经典的Volume Rendering等价，并且我们证明在物体表面上的Dirac密度下，我们的渲染公式具有更好的收敛性和更低的错误积累。此外，我们还证明这个修改后的公式可以被视为学习的光场渲染。我们在不同的NeRF变种上进行了实验，并经验显示，我们的简单修改可以在视角依赖效果上提供重大的改善。

Surf-D: High-Quality Surface Generation for Arbitrary Topologies using Diffusion Models

paper_url: http://arxiv.org/abs/2311.17050
repo_url: https://github.com/Yzmblog/SurfD
paper_authors: Zhengming Yu, Zhiyang Dou, Xiaoxiao Long, Cheng Lin, Zekun Li, Yuan Liu, Norman Müller, Taku Komura, Marc Habermann, Christian Theobalt, Xin Li, Wenping Wang
for: 本文提出了一种新的方法Surf-D，用于生成高质量的3D形状，使用Diffusion模型。这种方法采用Unsigned Distance Field（UDF）作为表示方式，可以处理任意的topology，并且可以生成复杂的形状。在优先方法中， shapes的生成通常采用不同的表示方式，但是它们受到限制的topology和geometry细节的影响。
methods: 我们首先利用点云自动编码器来学习一个紧凑的秘密空间，以便在任意输入点上进行差分梯度查询，以高精度捕捉复杂的geometry。此外，我们采用了课程学习策略，以便有效地嵌入不同的表面，从而提高整个嵌入过程。在预训练shape秘密空间后，我们采用了射 diffusional模型来获得不同形状的分布。
results: 我们的方法在多种模式下的形状生成 Tasks中表现出色，包括无条件生成、类别条件生成、图像3D重建和文本到形状任务。我们的方法在这些任务中都达到了或超过了状态艺术的性能。

Abstract
In this paper, we present Surf-D, a novel method for generating high-quality 3D shapes as Surfaces with arbitrary topologies using Diffusion models. Specifically, we adopt Unsigned Distance Field (UDF) as the surface representation, as it excels in handling arbitrary topologies, enabling the generation of complex shapes. While the prior methods explored shape generation with different representations, they suffer from limited topologies and geometry details. Moreover, it's non-trivial to directly extend prior diffusion models to UDF because they lack spatial continuity due to the discrete volume structure. However, UDF requires accurate gradients for mesh extraction and learning. To tackle the issues, we first leverage a point-based auto-encoder to learn a compact latent space, which supports gradient querying for any input point through differentiation to effectively capture intricate geometry at a high resolution. Since the learning difficulty for various shapes can differ, a curriculum learning strategy is employed to efficiently embed various surfaces, enhancing the whole embedding process. With pretrained shape latent space, we employ a latent diffusion model to acquire the distribution of various shapes. Our approach demonstrates superior performance in shape generation across multiple modalities and conducts extensive experiments in unconditional generation, category conditional generation, 3D reconstruction from images, and text-to-shape tasks.

摘要
在这篇论文中，我们介绍了Surf-D方法，用于生成高质量的3D形状，使用Diffusion模型。我们采用了无符号距离场（UDF）作为表面表示，因为它能够处理任意topology，并且可以生成复杂的形状。先前的方法探索了不同的表示方法，但它们受到限制的topology和geometry细节的影响。此外，直接将先前的扩散模型扩展到UDF是非常困难，因为它们缺乏空间连续性，而UDF需要准确的梯度来提取网格和学习。为了解决这些问题，我们首先利用点云自动编码器来学习一个紧凑的秘密空间，该空间支持点云的梯度询问，以高分辨率 capturing细腻的geometry。在不同的形状学习难度不同的情况下，我们采用了课程学习策略，以提高整个嵌入过程。与预训练形状秘密空间后，我们使用扩散模型来获得不同形状的分布。我们的方法在多种模式下进行了广泛的实验，包括无条件生成、类别条件生成、图像三维重建和文本到形状任务。我们的方法示出了在多种任务上的优秀表现。

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

paper_url: http://arxiv.org/abs/2311.17048
repo_url: None
paper_authors: Zeyu Han, Fangrui Zhu, Qianru Lao, Huaizu Jiang
for: 本研究旨在提高零 shot 表达理解中的图像识别精度，特别是在文本提示中提供的图像上 Localizing bounding boxes。
methods: 本研究使用了大规模基础模型，将图像和文本转化为 triplets 的形式，然后使用 VLA 模型进行结构相似矩阵计算，以实现图像和文本之间的关系理解。
results: 实验结果表明，我们的方法可以提高零 shot 图像识别精度，在 RefCOCO/+/g 上比 SOTA 零 shot 模型提高了19.5%。在更加具有挑战性的 Who’s Waldo 数据集上，我们的零 shot 方法与 Fully supervised 模型具有相似的准确率。

Abstract
Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to the provided textual prompts, which requires: (i) a fine-grained disentanglement of complex visual scene and textual context, and (ii) a capacity to understand relationships among disentangled entities. Unfortunately, existing large vision-language alignment (VLA) models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task. To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instance-level similarity matrix. Furthermore, to equip VLA models with the ability of relationship understanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships. Experiments demonstrate that our visual grounding performance increase of up to 19.5% over the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who's Waldo dataset, our zero-shot approach achieves comparable accuracy to the fully supervised model.

摘要
<>转换文本到简化中文。<>zero-shot表达式理解 targets locating bounding boxes in an image corresponding to provided textual prompts, which requires: (i) a fine-grained disentanglement of complex visual scene and textual context, and (ii) a capacity to understand relationships among disentangled entities. Unfortunately, existing large vision-language alignment (VLA) models, e.g., CLIP, struggle with both aspects so cannot be directly used for this task. To mitigate this gap, we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject, predicate, object). After that, grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model, and subsequently propagate it to an instance-level similarity matrix. Furthermore, to equip VLA models with the ability of relationship understanding, we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships. Experiments demonstrate that our visual grounding performance increase of up to 19.5% over the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who's Waldo dataset, our zero-shot approach achieves comparable accuracy to the fully supervised model.

TLControl: Trajectory and Language Control for Human Motion Synthesis

paper_url: http://arxiv.org/abs/2311.17135
repo_url: None
paper_authors: Weilin Wan, Zhiyang Dou, Taku Komura, Wenping Wang, Dinesh Jayaraman, Lingjie Liu
for: 实现控制人体运动Synthesize是应用于AR/VR、游戏、电影和身体AI等领域的基本需求。现有方法通常只是强调语言或整个轨迹控制，缺乏精准地同用户指定轨迹控制，特别是多 JOINT控制。
methods: 我们提出了一种新的TLControl方法，具有低级轨迹和高级语言语义控制。我们首先培训了VQ-VAE来学习一个紧凑的Latent动作空间，其中每个部分都有自己的分割。然后，我们提出了一种Masked Trajectories Transformer来生成基于学习的Latent动作空间的全轨迹预测，用户提供的部分轨迹和文本描述作为条件。最后，我们引入了高效的测试时优化，以优化这些初始预测的准确性。
results: 实验表明，TLControl在轨迹准确性和时间效率两个方面都高于当前状态的方法，使其成为实时交互和高质量动画生成的实际应用。

Abstract
Controllable human motion synthesis is essential for applications in AR/VR, gaming, movies, and embodied AI. Existing methods often focus solely on either language or full trajectory control, lacking precision in synthesizing motions aligned with user-specified trajectories, especially for multi-joint control. To address these issues, we present TLControl, a new method for realistic human motion synthesis, incorporating both low-level trajectory and high-level language semantics controls. Specifically, we first train a VQ-VAE to learn a compact latent motion space organized by body parts. We then propose a Masked Trajectories Transformer to make coarse initial predictions of full trajectories of joints based on the learned latent motion space, with user-specified partial trajectories and text descriptions as conditioning. Finally, we introduce an efficient test-time optimization to refine these coarse predictions for accurate trajectory control. Experiments demonstrate that TLControl outperforms the state-of-the-art in trajectory accuracy and time efficiency, making it practical for interactive and high-quality animation generation.

摘要
<>CONTROLLABLE HUMAN MOTION SYNTHESIS 是应用于 AR/VR、游戏、电影和embodied AI 等领域的关键技术。现有方法通常只会强调语言或全征轨迹控制，缺乏精准地控制用户指定的轨迹，特别是多 JOINT 控制。为解决这些问题，我们提出了 TLControl，一种新的人体动作生成方法，具有低级轨迹和高级语言 semantics 控制。我们首先训练了 VQ-VAE 来学习一个紧凑的 latent motion space，该空间由身体部分组织。然后，我们提出了一种假设掩码的 Trajectories Transformer，以使用学习的 latent motion space 来生成全轨迹的 JOINTS 的干预预测，并使用用户指定的部分轨迹和文本描述作为条件。最后，我们引入了高效的测试时优化，以修正这些干预预测，以获得高精度的轨迹控制。实验表明，TLControl 在轨迹精度和时间效率两个方面都超过了当前状况，使其实用于交互式和高质量动画生成。

Adversarial Diffusion Distillation

paper_url: http://arxiv.org/abs/2311.17042
repo_url: https://github.com/stability-ai/generative-models
paper_authors: Axel Sauer, Dominik Lorenz, Andreas Blattmann, Robin Rombach
for: 本研究旨在提出一种新的训练方法，以有效地在大规模基础图像扩散模型上进行几步扩散，以保持高质量图像。
methods: 本研究使用了分数泵抑素的方法，利用大规模的凝固图像扩散模型作为教师信号，并与对抗损失相结合，以确保图像准确性。
results: 我们的分析表明，我们的模型在一步和四步情况下都能够明显超越现有的几步方法（GANs、潜在一致模型），并达到当前扩散模型（SDXL）的性能水平。 ADD 是首个实现单步、实时图像生成的基础模型。

Abstract
We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1-4 steps while maintaining high image quality. We use score distillation to leverage large-scale off-the-shelf image diffusion models as a teacher signal in combination with an adversarial loss to ensure high image fidelity even in the low-step regime of one or two sampling steps. Our analyses show that our model clearly outperforms existing few-step methods (GANs, Latent Consistency Models) in a single step and reaches the performance of state-of-the-art diffusion models (SDXL) in only four steps. ADD is the first method to unlock single-step, real-time image synthesis with foundation models. Code and weights available under https://github.com/Stability-AI/generative-models and https://huggingface.co/stabilityai/ .

摘要
我们介绍了一种新的训练方法，叫做敌对扩散定向（ADD），可以快速获得大规模基础图像扩散模型的高品质样本，只需要1-4步。我们使用分数定律来利用大规模的训练模型作为教师信号，并与反对敌的损失函数共同保证图像的内在一致性，即使在低步骤情况下。我们的分析显示，我们的模型在单一步骤和四步骤情况下都能够明显超越现有的几步方法（GANs、 latent Consistency Models），并在四步骤情况下达到现有的扩散模型（SDXL）的性能。ADD 是首个实现单一步骤、实时图像生成的基础模型。我们的代码和模型预设可以在 GitHub 上找到，另外可以在 Hugging Face 上找到。

Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence

paper_url: http://arxiv.org/abs/2311.17034
repo_url: None
paper_authors: Junyi Zhang, Charles Herrmann, Junhwa Hur, Eric Chen, Varun Jampani, Deqing Sun, Ming-Hsuan Yang
for: 本研究旨在提高semantic correspondence表现，并探讨现有基础模型中的Feature问题。
methods: 本研究使用了简单但有效的解决方案，包括将geometry信息 integrate into semantic correspondence。
results: 我们的方法在难度较高的SPair-71k数据集上实现了64.2（零shot）和85.6（监控）的PCK@0.10分数，对比州先进的模型实现4.3p和11.0p的绝对优化。

Abstract
While pre-trained large-scale vision models have shown significant promise for semantic correspondence, their features often struggle to grasp the geometry and orientation of instances. This paper identifies the importance of being geometry-aware for semantic correspondence and reveals a limitation of the features of current foundation models under simple post-processing. We show that incorporating this information can markedly enhance semantic correspondence performance with simple but effective solutions in both zero-shot and supervised settings. We also construct a new challenging benchmark for semantic correspondence built from an existing animal pose estimation dataset, for both pre-training validating models. Our method achieves a PCK@0.10 score of 64.2 (zero-shot) and 85.6 (supervised) on the challenging SPair-71k dataset, outperforming the state-of-the-art by 4.3p and 11.0p absolute gains, respectively. Our code and datasets will be publicly available.

摘要
大量预训练视觉模型已经显示了 semantic correspondence 的承诺，但它们的特征经常强度不足以捕捉实例的几何和方向。这篇论文认为 geometry-awareness 对 semantic correspondence 的重要性，并发现现有基础模型的特征在简单的后处理下存在限制。我们表明可以通过 incorporating 这些信息来显著提高 semantic correspondence 性能，并提供了简单 yet effective 的解决方案。我们还构建了一个新的 Semantic Correspondence 测试集，来验证和预训练模型。我们的方法在 SPair-71k 测试集上取得了 PCK@0.10 分数为 64.2（零值）和 85.6（指导），与当前最佳模型相比带来了 4.3p 和 11.0p 绝对提升。我们的代码和数据将公开发布。

Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features

paper_url: http://arxiv.org/abs/2311.17024
repo_url: None
paper_authors: Niladri Shekhar Dutt, Sanjeev Muralikrishnan, Niloy J. Mitra
for: 本文为了提供一种简单、强健、无关类型的特征描述器，用于处理无文本输入形状（网格或点云）。
methods: 我们的方法基于输入形状生成深度和法向图，作为条件图生成的导航，并在过程中生成2D中的噪声特征。我们发现，即使多视图渲染输入形状得到的 conditional 图像生成不一致，与其关联的图像特征仍然强大，可以直接在原始表面上积累。
results: 我们在多个benchmark（SHREC’19、SHREC’20和TOSCA）进行了广泛的实验，并证明了我们的特征，不需要额外数据或训练，可以在不同的形状家族之间提供可靠的对应关系。

Abstract
We present Diff3F as a simple, robust, and class-agnostic feature descriptor that can be computed for untextured input shapes (meshes or point clouds). Our method distills diffusion features from image foundational models onto input shapes. Specifically, we use the input shapes to produce depth and normal maps as guidance for conditional image synthesis, and in the process produce (diffusion) features in 2D that we subsequently lift and aggregate on the original surface. Our key observation is that even if the conditional image generations obtained from multi-view rendering of the input shapes are inconsistent, the associated image features are robust and can be directly aggregated across views. This produces semantic features on the input shapes, without requiring additional data or training. We perform extensive experiments on multiple benchmarks (SHREC'19, SHREC'20, and TOSCA) and demonstrate that our features, being semantic instead of geometric, produce reliable correspondence across both isometeric and non-isometrically related shape families.

摘要
我们介绍Diff3F作为一种简单、可靠和无关类型的特征描述器，可以应用于无文本输入形状（三角形或点云）。我们的方法将图像基础模型中的扩散特征投射到输入形状上。具体来说，我们使用输入形状生成深度和法向图作为条件图像synthesis的导向，并在过程中生成2D中的扩散特征。我们的关键观察是，即使多视图渲染输入形状所得到的Conditional图像生成不一致，与其关联的图像特征仍然具有Robust性，可以直接在不同视图之间进行汇聚。这会生成在输入形状上的semantic特征，不需要额外数据或训练。我们在多个benchmark（SHREC'19、SHREC'20和TOSCA）进行了广泛的实验，并证明了我们的特征，作为semantic而不是geometric，在不同形状家族之间可靠地实现对应关系。

Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer

paper_url: http://arxiv.org/abs/2311.17009
repo_url: None
paper_authors: Danah Yatim, Rafail Fridman, Omer Bar Tal, Yoni Kasten, Tali Dekel
for: 本研究旨在实现基于文本提示的动作传输，将输入视频中的动作和场景 Layout 转移到目标对象和场景中，而不改变输入视频的基本动作特征。
methods: 本研究使用了一种新的文本驱动动作传输方法，利用预训练的和固定的文本-视频扩散模型，以获得生成和动作优先。方法的核心是一种新的空间-时间特征损失，该损失引导生成过程，以保持输入视频的总动作特征，同时遵循目标对象的形状和细部动作特征。
results: 研究人员通过实验和比较分析，发现本方法可以在不同形状和动作特征的目标对象和场景中实现高质量的动作传输，而且比传统方法更加灵活和可靠。

Abstract
We present a new method for text-driven motion transfer - synthesizing a video that complies with an input text prompt describing the target objects and scene while maintaining an input video's motion and scene layout. Prior methods are confined to transferring motion across two subjects within the same or closely related object categories and are applicable for limited domains (e.g., humans). In this work, we consider a significantly more challenging setting in which the target and source objects differ drastically in shape and fine-grained motion characteristics (e.g., translating a jumping dog into a dolphin). To this end, we leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors. The pillar of our method is a new space-time feature loss derived directly from the model. This loss guides the generation process to preserve the overall motion of the input video while complying with the target object in terms of shape and fine-grained motion traits.

摘要
我们提出了一种新的文本驱动动作传输方法 - 生成一个遵循输入文本提示的目标对象和场景的视频，同时保持输入视频的动作和场景布局。先前的方法受限于在同一或相似物体类别内传输动作，并且只适用于有限的领域（例如人类）。在这个工作中，我们考虑了一个更加具有挑战性的设定，在其中目标和源对象具有极大的形状和细化动作特征差异（例如将狗 transformed into 鲸鱼）。为此，我们利用了一个已经训练并固定的文本到视频扩散模型，该模型为我们提供了生成和动作偏好。我们的方法的核心是一种新的空间时间特征损失，这种损失引导生成过程保持输入视频的总动作，同时遵循目标对象的形状和细化动作特征。

paper_url: http://arxiv.org/abs/2311.17005
repo_url: https://github.com/opengvlab/ask-anything
paper_authors: Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, Yu Qiao
for:多种多样的视频任务的评估和评价。methods: introduce a novel static-to-dynamic method to define temporal-related tasks, 以及自动将公共视频笔记转换成多选问答来评估每个任务。results: existing MLLMs 在时间理解方面表现不够，而我们的 VideoChat2 在 MVBench 上至少比leading模型高出15%。

Abstract
With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.

摘要
随着多模态大型自然语言模型（MLLMs）的快速发展，一些诊断标准也在emerge来评估这些模型的理解能力。然而，大多数标准都主要评估在静止图像任务上的空间理解，而忽略了动态视频任务中的时间理解。为了解决这问题，我们提出了一个完整的多模态视频理解benchmark，即MVBench，该benchmark包括20个不可 ignored的视频任务，这些任务不可以通过单一帧来解决。我们首先提出了一种新的静止到动态方法，用于定义这些时间相关任务。通过将各种静止任务转化为动态任务，我们允许系统生成具有广泛时间技能范围的视频任务，从感知到认知。然后，根据任务定义，我们自动将公共视频笔记转化为多选问答，以评估每个任务。在一方面，这种独特的方法允许我们建立MVBench的efficient，无需大量的手动干预。在另一方面，它保证了评估公正，避免了LLMs的偏袋评分。此外，我们还进一步开发了一种robust的视频MLLM基eline，即VideoChat2，通过多模式多语言训练和多种指令调整数据进行进化。我们的MVBench测试结果显示，现有的MLLMs在时间理解方面远远不够，而我们的VideoChat2在MVBench上大幅超越了这些领先模型，提高了15%左右。所有模型和数据都可以在https://github.com/OpenGVLab/Ask-Anything上获取。

Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

paper_url: http://arxiv.org/abs/2311.17002
repo_url: https://github.com/Ranni-T2I/Ranni
paper_authors: Yutong Feng, Biao Gong, Di Chen, Yujun Shen, Yu Liu, Jingren Zhou
for: 这个研究的目的是提高文本到图像（T2I）扩散模型的文本控制能力，特别是处理复杂的提示语、对象特性绑定和多主体描述。
methods: 该研究使用了 semantic panel 作为中间件，通过将文本中的视觉概念通过大语言模型进行解析，并将其注入到减噪网络中作为详细控制信号，以改善 T2I 生成器的文本控制能力。
results: 该研究表明，通过使用 semantic panel，T2I 生成器的文本控制能力得到了提高，并且可以通过直接修改面板中的元素或使用语言指令来进行细致的自定义生成。此外，该研究还开发了一个实用的系统，并在连续生成和协作编辑中展示了其潜力。

Abstract
Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts, especially those with quantity, object-attribute binding, and multi-subject descriptions. In this work, we introduce a semantic panel as the middleware in decoding texts to images, supporting the generator to better follow instructions. The panel is obtained through arranging the visual concepts parsed from the input text by the aid of large language models, and then injected into the denoising network as a detailed control signal to complement the text condition. To facilitate text-to-panel learning, we come up with a carefully designed semantic formatting protocol, accompanied by a fully-automatic data preparation pipeline. Thanks to such a design, our approach, which we call Ranni, manages to enhance a pre-trained T2I generator regarding its textual controllability. More importantly, the introduction of the generative middleware brings a more convenient form of interaction (i.e., directly adjusting the elements in the panel or using language instructions) and further allows users to finely customize their generation, based on which we develop a practical system and showcase its potential in continuous generation and chatting-based editing.

摘要

COLE: A Hierarchical Generation Framework for Graphic Design

paper_url: http://arxiv.org/abs/2311.16974
repo_url: https://github.com/JarekPaulDonald/COLE
paper_authors: Peidong Jia, Chenxuan Li, Zeyu Liu, Yichao Shen, Xingru Chen, Yuhui Yuan, Yinglin Zheng, Dong Chen, Ji Li, Xiaodong Xie, Shanghang Zhang, Baining Guo
for: 本研究旨在提出一种可以从用户意图生成高质量图形设计的框架，以解决现代广告设计中的创新和创造力问题。
methods: 本研究使用了一种层次分解的生成框架，将复杂的文本到设计生成任务分解成一系列更简单的子任务，每个子任务由特定的模型进行处理，然后将这些模型的输出结果集成起来生成具有一致性的最终输出。
results: 对比 existed 方法，本研究的 COLE 系统在生成高质量图形设计方面表现出了显著的优势，并且支持用户输入的灵活编辑。

Abstract
Graphic design, which has been evolving since the 15th century, plays a crucial role in advertising. The creation of high-quality designs demands creativity, innovation, and lateral thinking. This intricate task involves understanding the objective, crafting visual elements such as the background, decoration, font, color, and shape, formulating diverse professional layouts, and adhering to fundamental visual design principles. In this paper, we introduce COLE, a hierarchical generation framework designed to comprehensively address these challenges. This COLE system can transform a straightforward intention prompt into a high-quality graphic design, while also supporting flexible editing based on user input. Examples of such input might include directives like ``design a poster for Hisaishi's concert.'' The key insight is to dissect the complex task of text-to-design generation into a hierarchy of simpler sub-tasks, each addressed by specialized models working collaboratively. The results from these models are then consolidated to produce a cohesive final output. Our hierarchical task decomposition can streamline the complex process and significantly enhance generation reliability. Our COLE system consists of multiple fine-tuned Large Language Models (LLMs), Large Multimodal Models (LMMs), and Diffusion Models (DMs), each specifically tailored for a design-aware text or image generation task. Furthermore, we construct the DESIGNERINTENTION benchmark to highlight the superiority of our COLE over existing methods in generating high-quality graphic designs from user intent. We perceive our COLE as an important step towards addressing more complex visual design generation tasks in the future.

摘要
GRAPHIC DESIGN，历史可追溯到15世纪，在广告中扮演着关键的角色。创建高质量的设计需要创投、创新和横向思维。这个复杂任务包括理解目标、制作背景、装饰、字体、颜色和形状等视觉元素，制定多种专业布局，并遵循基本视觉设计原则。在这篇论文中，我们介绍了COLE，一个层次生成框架，用于全面解决这些挑战。COLE系统可以将简单的意图提示转化成高质量的graphic design，同时支持用户输入的灵活修改。例如，用户可能会提供“设计希雅希的演唱会海报”的指令。我们的关键发现是将文本到设计生成的复杂任务分解成一个层次结构的 simpler sub-task，每个任务由特殊化的模型共同工作。这些模型的结果 subsequences 被汇集以生成一个协调完整的输出。我们的层次任务分解可以减少复杂的过程，并显著提高生成可靠性。我们的COLE系统包括多个精心调整的大语言模型（LLM）、大多媒体模型（LMM）和扩散模型（DM），每个模型都特化于设计意识的文本或图像生成任务。此外，我们还构建了DESIGNERINTENTION benchmark，以证明我们的COLE在基于用户意图的高质量图像设计生成方面的超越性。我们认为COLE是未来更复杂的视觉设计生成任务的重要一步。

HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion

paper_url: http://arxiv.org/abs/2311.16961
repo_url: None
paper_authors: Jingbo Zhang, Xiaoyu Li, Qi Zhang, Yanpei Cao, Ying Shan, Jing Liao
for: 生成一个3D人体模型从单个参考图像
methods: 提出了一种基于引用导航的分布式推理模型，以保证生成的3D人体模型具有高级别的细节和一致性
results: 实验结果表明，该方法可以比前方法更好地生成3D人体模型，并且可以保持人体的细节和一致性在不同的视图下

Abstract
Generating a 3D human model from a single reference image is challenging because it requires inferring textures and geometries in invisible views while maintaining consistency with the reference image. Previous methods utilizing 3D generative models are limited by the availability of 3D training data. Optimization-based methods that lift text-to-image diffusion models to 3D generation often fail to preserve the texture details of the reference image, resulting in inconsistent appearances in different views. In this paper, we propose HumanRef, a 3D human generation framework from a single-view input. To ensure the generated 3D model is photorealistic and consistent with the input image, HumanRef introduces a novel method called reference-guided score distillation sampling (Ref-SDS), which effectively incorporates image guidance into the generation process. Furthermore, we introduce region-aware attention to Ref-SDS, ensuring accurate correspondence between different body regions. Experimental results demonstrate that HumanRef outperforms state-of-the-art methods in generating 3D clothed humans with fine geometry, photorealistic textures, and view-consistent appearances.

摘要
生成三维人体模型从单个参考图像是挑战，因为需要在不可见的视角中推断Texture和几何结构，同时保持与参考图像的一致性。先前的方法使用3D生成模型有限，因为3D训练数据的可用性有限。优化基于的方法可以提升文本到图像扩散模型，但是通常无法保持参考图像的细节Texture，导致不同视角中的外观不一致。在这篇论文中，我们提出了人体Ref，一种基于单个视角输入的3D人体生成框架。为确保生成的3D模型具有高品质和与输入图像一致的外观，人体Ref引入了一种新的参考指导分数散 sampling（Ref-SDS）方法，该方法能够有效地在生成过程中包含图像指导。此外，我们还引入了区域注意力，确保不同身体区域之间的匹配精度。实验结果表明，人体Ref可以在生成3D穿着人体时比州先进方法表现出更高品质和视角一致的外观。

UC-NeRF: Neural Radiance Field for Under-Calibrated multi-view cameras in autonomous driving

paper_url: http://arxiv.org/abs/2311.16945
repo_url: None
paper_authors: Kai Cheng, Xiaoxiao Long, Wei Yin, Jin Wang, Zhiqiang Wu, Yuexin Ma, Kaixuan Wang, Xiaozhi Chen, Xuejin Chen
for: 该文章主要用于解决多摄像头系统下的新视图合成问题，即在具有不准确协调的多摄像头系统中实现高质量的新视图合成。
methods: 该文章提出了三个方法来解决上述问题，包括层次色彩调整、虚拟扭曲和空间时间约束pose协调。
results: 该文章的实验结果表明，UC-NeRF方法可以准确地synthesize novel views in under-calibrated multi-view camera systems,并且可以有效地优化depth estimation in large-scale outdoor scenes with the synthesized novel views。

Abstract
Multi-camera setups find widespread use across various applications, such as autonomous driving, as they greatly expand sensing capabilities. Despite the fast development of Neural radiance field (NeRF) techniques and their wide applications in both indoor and outdoor scenes, applying NeRF to multi-camera systems remains very challenging. This is primarily due to the inherent under-calibration issues in multi-camera setup, including inconsistent imaging effects stemming from separately calibrated image signal processing units in diverse cameras, and system errors arising from mechanical vibrations during driving that affect relative camera poses. In this paper, we present UC-NeRF, a novel method tailored for novel view synthesis in under-calibrated multi-view camera systems. Firstly, we propose a layer-based color correction to rectify the color inconsistency in different image regions. Second, we propose virtual warping to generate more viewpoint-diverse but color-consistent virtual views for color correction and 3D recovery. Finally, a spatiotemporally constrained pose refinement is designed for more robust and accurate pose calibration in multi-camera systems. Our method not only achieves state-of-the-art performance of novel view synthesis in multi-camera setups, but also effectively facilitates depth estimation in large-scale outdoor scenes with the synthesized novel views.

摘要
多摄像头设置在各种应用中广泛使用，如自动驾驶，因为它们可以大幅扩展感知能力。尽管神经辐射场（NeRF）技术在室内和室外场景中广泛应用，但将NeRF应用于多摄像头系统仍然非常困难。这主要是因为多摄像头设置中的内置不准确问题，包括不同摄像头中的图像处理单元在不同的尺度和缓冲区域中的颜色不一致问题，以及在驾驶中的机械振荡引起的相对摄像头姿态的误差。在这篇论文中，我们提出了UC-NeRF，一种适应于不准确多视图摄像头系统的新观察角度Synthesis的方法。首先，我们提议了层次颜色修正来纠正不同图像区域中的颜色不一致问题。其次，我们提议虚拟扭曲来生成更多的视角多样性且颜色一致的虚拟视图，以便颜色修正和3D恢复。最后，我们设计了基于空间时间的约束的姿态精度加工来提高多摄像头系统中的姿态精度。我们的方法不仅实现了多摄像头设置中novel view synthesis的状态码性表现，而且也有效地促进了大规模户外场景中的深度估计。

Image segmentation with traveling waves in an exactly solvable recurrent neural network

paper_url: http://arxiv.org/abs/2311.16943
repo_url: None
paper_authors: Luisa H. B. Liboni, Roberto C. Budzinski, Alexandra N. Busch, Sindy Löwe, Thomas A. Keller, Max Welling, Lyle E. Muller
for: 图像分割，使用时空动力学的循环神经网络。
methods: 使用循环神经网络的状态每个单元是复数，生成复杂的时空动力学，可以有效地将图像分成Scene的结构特征。
results: 通过循环网络动力学的精确解，提供了对object segmentation的数学解释，并示出了一种通用的对象分割算法，可以适应各种图像输入，从简单的二维图像到自然图像。

Abstract
We study image segmentation using spatiotemporal dynamics in a recurrent neural network where the state of each unit is given by a complex number. We show that this network generates sophisticated spatiotemporal dynamics that can effectively divide an image into groups according to a scene's structural characteristics. Using an exact solution of the recurrent network's dynamics, we present a precise description of the mechanism underlying object segmentation in this network, providing a clear mathematical interpretation of how the network performs this task. We then demonstrate a simple algorithm for object segmentation that generalizes across inputs ranging from simple geometric objects in grayscale images to natural images. Object segmentation across all images is accomplished with one recurrent neural network that has a single, fixed set of weights. This demonstrates the expressive potential of recurrent neural networks when constructed using a mathematical approach that brings together their structure, dynamics, and computation.

摘要
我们研究图像分割使用空间时间动力学在回归神经网络中，每个单元的状态由复数表示。我们显示这种神经网络生成了复杂的空间时间动力学，可以有效地将图像分成Scene的结构特征相似的组。使用回归网络动力学的精确解，我们提供了图像分割机制的准确数学解释，解释如何在这种神经网络中实现对象分割。然后，我们提出了一种通用的对象分割算法，可以在所有图像上实现对象分割，并且只需一个固定的回归神经网络权重。这种方法 demonstarte了回归神经网络的表达能力，当将其结构、动力学和计算相结合时。

The Sky’s the Limit: Re-lightable Outdoor Scenes via a Sky-pixel Constrained Illumination Prior and Outside-In Visibility

paper_url: http://arxiv.org/abs/2311.16937
repo_url: https://github.com/jadgardner/neusky
paper_authors: James A. D. Gardner, Evgenii Kashin, Bernhard Egger, William A. P. Smith
for: 本研究的目的是解决无束缚的户外场景图像集中的反投影问题，特别是因为光照环境的 occlusion 和光照环境的ambiguity。
methods: 本研究使用了一种基于神经网络的方法，利用天空像提供的直接光照测量，并通过神经网络灵敏照明先验来解译剩余的光照环境。还引入了一种新的“外部向内”方法来计算天空可见度，基于神经directional distance function。
results: 本研究实现了高质量的颜色、geometry、光照和天空可见度的估计，在NeRF-OSR relighting benchmark上达到了现有最佳结果。

Abstract
Inverse rendering of outdoor scenes from unconstrained image collections is a challenging task, particularly illumination/albedo ambiguities and occlusion of the illumination environment (shadowing) caused by geometry. However, there are many cues in an image that can aid in the disentanglement of geometry, albedo and shadows. We exploit the fact that any sky pixel provides a direct measurement of distant lighting in the corresponding direction and, via a neural illumination prior, a statistical cue as to the remaining illumination environment. We also introduce a novel `outside-in' method for computing differentiable sky visibility based on a neural directional distance function. This is efficient and can be trained in parallel with the neural scene representation, allowing gradients from appearance loss to flow from shadows to influence estimation of illumination and geometry. Our method estimates high-quality albedo, geometry, illumination and sky visibility, achieving state-of-the-art results on the NeRF-OSR relighting benchmark. Our code and models can be found https://github.com/JADGardner/neusky

摘要
原文：“室外场景的反向渲染从无结构图像集中是一项具有挑战性的任务，尤其是光照环境的遮挡和反射（阴影）引起的几何学问题。然而，图像中存在许多帮助于分离几何学、反射和阴影的信息。我们利用了任务环境中远程光照的直接测量，并通过神经网络照明先验来获得剩下的照明环境的统计参考。我们还提出了一种新的“外部进入”方法，基于神经网络方向准确距离函数来计算天空视图的可见性。这种方法效率高，可以并行训练神经场景表示和神经照明先验，从颜色损失中获得到的梯度可以流向照明和几何学的估计。我们的方法可以高质量地估计几何学、反射、照明和天空视图，在NeRF-OSR渲染挑战benchmark上实现了状态机器人的结果。我们的代码和模型可以在https://github.com/JADGardner/neusky找到。”Note: The text is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models

paper_url: http://arxiv.org/abs/2311.16933
repo_url: None
paper_authors: Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, Bo Dai
for: 这个论文的目的是提高文本到视频（T2V）生成的灵活性和可控性，使得可以更好地控制视频的结构和内容。
methods: 这个论文使用了一种名为SparseCtrl的方法，它利用稀疏的信号（如一些帧的深度/边缘序列）来提高控制性，而不需要大量的输入。这个方法可以与现有的T2V模型一起使用，并且可以与不同的模式（如涂鸦、深度图和RGB图像）结合使用。
results: experiments表明，SparseCtrl可以广泛应用于T2V生成中，并且可以在不同的情况下进行个性化控制。 codes和模型将在https://guoyww.github.io/projects/SparseCtrl中公开发布。

Abstract
The development of text-to-video (T2V), i.e., generating videos with a given text prompt, has been significantly advanced in recent years. However, relying solely on text prompts often results in ambiguous frame composition due to spatial uncertainty. The research community thus leverages the dense structure signals, e.g., per-frame depth/edge sequences, to enhance controllability, whose collection accordingly increases the burden of inference. In this work, we present SparseCtrl to enable flexible structure control with temporally sparse signals, requiring only one or a few inputs, as shown in Figure 1. It incorporates an additional condition encoder to process these sparse signals while leaving the pre-trained T2V model untouched. The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images, providing more practical control for video generation and promoting applications such as storyboarding, depth rendering, keyframe animation, and interpolation. Extensive experiments demonstrate the generalization of SparseCtrl on both original and personalized T2V generators. Codes and models will be publicly available at https://guoyww.github.io/projects/SparseCtrl .

摘要
在最近几年内，文本到视频（T2V）的开发，即根据文本提示生成视频，已经取得了重要进步。然而，仅仅 rely on 文本提示通常会导致视频帧的不确定性，从而影响视频的质量。因此，研究人员通常会利用 dense structure signals，例如每帧深度/边缘序列，以提高控制性。这种方法的推广会增加推理的负担。在这项工作中，我们提出了 SparseCtrl，它可以在时间上 sparse 的信号（只需一个或几个输入）上实现灵活的结构控制，如图1所示。它包含一个额外的condition encoder来处理这些稀疏信号，而不改变原有的 T2V 模型。我们的方法可以与不同的Modalities，如素描、深度地图和RGB图像相结合，为视频生成带来更多的实用控制，并促进应用如故事板、深度渲染、关键帧动画和 interpolate 等。我们的实验表明 SparseCtrl 可以在原始 T2V 生成器和个性化 T2V 生成器上进行普适化。代码和模型将在上公开。

LLaFS: When Large-Language Models Meet Few-Shot Segmentation

paper_url: http://arxiv.org/abs/2311.16926
repo_url: https://github.com/lanyunzhu99/llafs
paper_authors: Lanyun Zhu, Tianrun Chen, Deyi Ji, Jieping Ye, Jun Liu
for: 这个论文提出了一种基于大语言模型（LLM）的几个shot分割方法，以便利用LLM的庞大前置知识来提高图像分割的效果。
methods: 这种方法使用了一种特制的输入指令，使得LLM可以直接从文本中获取图像分割结果，并使用一个区域特征表来模拟人类视觉机制，以提供多模式干扰。同时，它还使用了假样生成和课程学习来扩大数据和优化。
results: 这个论文在多个数据集上达到了状态的最佳效果，示出了使用LLM进行几个shot计算机视觉任务的潜在优势。

Abstract
This paper proposes LLaFS, the first attempt to leverage large language models (LLMs) in few-shot segmentation. In contrast to the conventional few-shot segmentation methods that only rely on the limited and biased information from the annotated support images, LLaFS leverages the vast prior knowledge gained by LLM as an effective supplement and directly uses the LLM to segment images in a few-shot manner. To enable the text-based LLM to handle image-related tasks, we carefully design an input instruction that allows the LLM to produce segmentation results represented as polygons, and propose a region-attribute table to simulate the human visual mechanism and provide multi-modal guidance. We also synthesize pseudo samples and use curriculum learning for pretraining to augment data and achieve better optimization. LLaFS achieves state-of-the-art results on multiple datasets, showing the potential of using LLMs for few-shot computer vision tasks. Code will be available at https://github.com/lanyunzhu99/LLaFS.

摘要

Super-Resolution through StyleGAN Regularized Latent Search: A Realism-Fidelity Trade-off

paper_url: http://arxiv.org/abs/2311.16923
repo_url: None
paper_authors: Marzieh Gheisari, Auguste Genovesio
for: 提高图像的分辨率，从低分辨率（LR）图像construct高分辨率（HR）图像。
methods: 使用StyleGAN预训练在HR图像的latent space中搜索图像，以便在输入LR图像下降到最佳的HR图像。但是，这些方法通常会生成外域图像并且不准确地重建HR图像。我们的贡献有两点：首先，我们引入了一种新的正则项来约束搜索的空间，以确保恢复的代码在原始图像映射中。其次，我们通过扩展图像优先预测来进一步提高恢复。
results: 我们的方法可以recover高质量的真实图像，尤其是在大放大因子下。此外，在低放大因子下，我们的方法还可以恢复 generator无法生成的细节。总的来说，我们的方法实现了对超分辨率任务的好的平衡 между 准确和现实。

Abstract
This paper addresses the problem of super-resolution: constructing a highly resolved (HR) image from a low resolved (LR) one. Recent unsupervised approaches search the latent space of a StyleGAN pre-trained on HR images, for the image that best downscales to the input LR image. However, they tend to produce out-of-domain images and fail to accurately reconstruct HR images that are far from the original domain. Our contribution is twofold. Firstly, we introduce a new regularizer to constrain the search in the latent space, ensuring that the inverted code lies in the original image manifold. Secondly, we further enhanced the reconstruction through expanding the image prior around the optimal latent code. Our results show that the proposed approach recovers realistic high-quality images for large magnification factors. Furthermore, for low magnification factors, it can still reconstruct details that the generator could not have produced otherwise. Altogether, our approach achieves a good trade-off between fidelity and realism for the super-resolution task.

摘要
Our contribution is twofold. First, we introduce a new regularizer to constrain the search in the latent space, ensuring that the inverted code lies in the original image manifold. Second, we enhance the reconstruction by expanding the image prior around the optimal latent code.Our results show that the proposed approach recovers realistic high-quality images for large magnification factors. Additionally, for low magnification factors, it can still reconstruct details that the generator could not have produced otherwise. Overall, our approach achieves a good trade-off between fidelity and realism for the super-resolution task.Here's the translation in Simplified Chinese:这篇论文关注超分解问题，即从低分辨率（LR）图像构建高分辨率（HR）图像。现有的无监督方法在StyleGAN预训练的 latent space 中搜索最佳的下采样图像，但它们通常会生成非原域图像并且不准确地重建远离原域的HR图像。我们的贡献有两点：首先，我们引入了一种新的 regularizer，以确保搜索的结果在原始图像概率空间中。其次，我们进一步提高了重建的精度，通过在最佳幂码周围扩展图像优先。我们的结果表明，我们的方法可以在大倍率因子下重建高质量的真实图像。此外，在低倍率因子下，它还可以重建 generator 无法生成的详细信息。总的来说，我们的方法在超分解任务中实现了良好的妥协 между 准确和现实。

UGG: Unified Generative Grasping

paper_url: http://arxiv.org/abs/2311.16917
repo_url: https://github.com/autonomousvision/shape_as_points
paper_authors: Jiaxin Lu, Hao Kang, Haoxiang Li, Bo Liu, Yiding Yang, Qixing Huang, Gang Hua
for: 这个论文的目的是提高dexterous grasping的成功率和多样性。
methods: 该论文使用了一种名为UGG的混合扩散模型，该模型在物体点云和手坐标空间内操作，并通过一种全Transformer架构将物体、手和接触点的信息融合在一起。
results: 该模型在大规模的DexGraspNet数据集上实现了状态的dexterous grasping成功率，同时还可以生成基于手坐标的物体，这提供了有价值的对象设计和生成模型的研究。

Abstract
Dexterous grasping aims to produce diverse grasping postures with a high grasping success rate. Regression-based methods that directly predict grasping parameters given the object may achieve a high success rate but often lack diversity. Generation-based methods that generate grasping postures conditioned on the object can often produce diverse grasping, but they are insufficient for high grasping success due to lack of discriminative information. To mitigate, we introduce a unified diffusion-based dexterous grasp generation model, dubbed the name UGG, which operates within the object point cloud and hand parameter spaces. Our all-transformer architecture unifies the information from the object, the hand, and the contacts, introducing a novel representation of contact points for improved contact modeling. The flexibility and quality of our model enable the integration of a lightweight discriminator, benefiting from simulated discriminative data, which pushes for a high success rate while preserving high diversity. Beyond grasp generation, our model can also generate objects based on hand information, offering valuable insights into object design and studying how the generative model perceives objects. Our model achieves state-of-the-art dexterous grasping on the large-scale DexGraspNet dataset while facilitating human-centric object design, marking a significant advancement in dexterous grasping research. Our project page is https://jiaxin-lu.github.io/ugg/ .

摘要
dexterous grasping aims to produce diverse grasping postures with high success rate. Regression-based methods may achieve high success rate but lack diversity. Generation-based methods can produce diverse grasping but lack discriminative information. To mitigate, we introduce a unified diffusion-based dexterous grasp generation model (UGG) that operates within object point cloud and hand parameter spaces. Our all-transformer architecture unifies object, hand, and contact information, introducing a novel representation of contact points for improved contact modeling. The flexibility and quality of our model enable the integration of a lightweight discriminator, benefiting from simulated discriminative data, which pushes for high success rate while preserving high diversity. Beyond grasp generation, our model can also generate objects based on hand information, offering valuable insights into object design and studying how the generative model perceives objects. Our model achieves state-of-the-art dexterous grasping on large-scale DexGraspNet dataset while facilitating human-centric object design, marking significant advancement in dexterous grasping research. Our project page is .

Brain-ID: Learning Robust Feature Representations for Brain Imaging

paper_url: http://arxiv.org/abs/2311.16914
repo_url: https://github.com/peirong26/Brain-ID
paper_authors: Peirong Liu, Oula Puonti, Xiaoling Hu, Daniel C. Alexander, Juan Eugenio Iglesias
for: 这篇论文旨在提供一个可靠的特征表示学习策略，以便应用于不同的脑成像调查方法中。
methods: 这篇论文使用了一种名为Brain-ID的特征表示学习策略，这策略可以快速地适应不同的脑成像调查方法，并且能够快速地适应新的数据。
results: 根据实验结果显示，Brain-ID 能够在不同的脑成像调查方法中表现出色，并且在有限的训练数据下可以保持其性能。

Abstract
Recent learning-based approaches have made astonishing advances in calibrated medical imaging like computerized tomography, yet they struggle to generalize in uncalibrated modalities -- notoriously magnetic resonance imaging (MRI), where performance is highly sensitive to the differences in MR contrast, resolution, and orientation between the training and testing data. This prevents broad applicability to the diverse clinical acquisition protocols in the real world. We introduce Brain-ID, a robust feature representation learning strategy for brain imaging, which is contrast-agnostic, and robust to the brain anatomy of each subject regardless of the appearance of acquired images (i.e., deformation, contrast, resolution, orientation, artifacts, etc). Brain-ID is trained entirely on synthetic data, and easily adapts to downstream tasks with our proposed simple one-layer solution. We validate the robustness of Brain-ID features, and evaluate their performance in a variety of downstream applications, including both contrast-independent (anatomy reconstruction/contrast synthesis, brain segmentation), and contrast-dependent (super-resolution, bias field estimation) tasks. Extensive experiments on 6 public datasets demonstrate that Brain-ID achieves state-of-the-art performance in all tasks, and more importantly, preserves its performance when only limited training data is available.

摘要
现代学习基于方法在计算机tomography等受核酸成像领域取得了惊人的进步，然而它们在不准确的模式下（如磁共振成像）表现不佳，这限制了它们在实际世界中的广泛应用。我们介绍了Brain-ID，一种robust特征表示学习策略，可以忽略不同主体的脑 анатоMY中的差异（例如变形、对比、分辨率、方向等），并且可以在Synthetic数据上进行完全训练。Brain-ID可以轻松适应下游任务，并且我们提出了一种简单的一层解决方案。我们 validate了Brain-ID特征的稳定性，并评估了它在多种下游应用中的性能，包括不具备对比（脑分割/对比合成）和具备对比（超分辨/偏置场 estimation）任务。我们在6个公共数据集上进行了广泛的实验，并证明了Brain-ID在所有任务中取得了状态的最佳性能，并且在具有有限的训练数据时仍然保持了其性能。

Feedback RoI Features Improve Aerial Object Detection

paper_url: http://arxiv.org/abs/2311.17129
repo_url: None
paper_authors: Botao Ren, Botian Xu, Tengyu Liu, Jingyi Wang, Zhidong Deng
for: 这个论文的目的是提出一种基于高级反馈信息的对象检测方法，以适应不同特征信号的变化。
methods: 该方法使用了高级反馈信息来改进对象检测中的特征选择，以适应图像质量变化和分类不确定性。
results: 实验结果表明，该方法可以在挑战性强的航空图像检测数据集上提供可靠的提升，包括DOTA-v1.0、DOTA-v1.5和HRSC2016。此外，对于普通的检测模型，我们的模块也能够提供有效的改进。

Abstract
Neuroscience studies have shown that the human visual system utilizes high-level feedback information to guide lower-level perception, enabling adaptation to signals of different characteristics. In light of this, we propose Feedback multi-Level feature Extractor (Flex) to incorporate a similar mechanism for object detection. Flex refines feature selection based on image-wise and instance-level feedback information in response to image quality variation and classification uncertainty. Experimental results show that Flex offers consistent improvement to a range of existing SOTA methods on the challenging aerial object detection datasets including DOTA-v1.0, DOTA-v1.5, and HRSC2016. Although the design originates in aerial image detection, further experiments on MS COCO also reveal our module's efficacy in general detection models. Quantitative and qualitative analyses indicate that the improvements are closely related to image qualities, which match our motivation.

摘要
neuroscience 研究表明，人类视系统会使用高级反馈信息来导引低级识别，以适应不同特征的信号。为了实现类似的机制，我们提出了反馈多级特征提取器（Flex）。Flex 根据图像和实例级反馈信息进行特征选择级别，以适应图像质量变化和分类不确定性。实验结果表明，Flex 可以在多个现状顶峰 Object Detection 数据集上提供顺序的改进，包括 DOTA-v1.0、DOTA-v1.5 和 HRSC2016。尽管设计起源于飞行图像检测，但进一步在 MS COCO 上的实验也表明了我们模块的通用性。量化和质量分析表明，改进与图像质量有着密切的关系，与我们的动机一致。

Lane-Keeping Control of Autonomous Vehicles Through a Soft-Constrained Iterative LQR

paper_url: http://arxiv.org/abs/2311.16900
repo_url: None
paper_authors: Der-Hau Lee
for: 自动驾驶车辆应用中，精准预测平滑操纵输入是关键，因为控制动作干扰可能会导致车系失控。
methods: 我们将CILQR算法与模型预测控制（MPC）约束放宽技术结合，开发了一种soft-CILQR算法。在优化过程中，我们将状态和控制边界函数中的缺失变量integrated into the state and control barrier functions of the soft-CILQR solver to soften the constraints in the optimization process so that stabilizing control inputs can be calculated in a relatively simple manner。
results: 我们通过对 linear system dynamics model进行数字 simulations和基于视觉的困难推进的实验来测试提议的soft-CILQR算法的性能，并与CILQR算法进行比较。数字 simulations中，soft-CILQR和CILQR解除都能使系统向参照状态势 asymptotically 驱动；然而，soft-CILQR解除比CILQR解除更容易在受加成干扰情况下获得平滑操纵输入轨迹。实验中，soft-CILQR控制器在跟踪精度和操纵平滑性方面比CILQR控制器更好。

Abstract
The accurate prediction of smooth steering inputs is crucial for autonomous vehicle applications because control actions with jitter might cause the vehicle system to become unstable. To address this problem in automobile lane-keeping control without the use of additional smoothing algorithms, we developed a soft-constrained iterative linear-quadratic regulator (soft-CILQR) algorithm by integrating CILQR algorithm and a model predictive control (MPC) constraint relaxation method. We incorporated slack variables into the state and control barrier functions of the soft-CILQR solver to soften the constraints in the optimization process so that stabilizing control inputs can be calculated in a relatively simple manner. Two types of automotive lane-keeping experiments were conducted with a linear system dynamics model to test the performance of the proposed soft-CILQR algorithm and to compare its performance with that of the CILQR algorithm: numerical simulations and experiments involving challenging vision-based maneuvers. In the numerical simulations, the soft-CILQR and CILQR solvers managed to drive the system toward the reference state asymptotically; however, the soft-CILQR solver obtained smooth steering input trajectories more easily than did the CILQR solver under conditions involving additive disturbances. In the experiments with visual inputs, the soft-CILQR controller outperformed the CILQR controller in terms of tracking accuracy and steering smoothness during the driving of an ego vehicle on TORCS.

摘要
准确预测平滑的控制输入是自动驾驶车辆应用中的关键因素，因为控制动作具有晃动可能会导致车辆系统失控。为了解决这个问题而不使用附加的缓和算法，我们开发了一种具有软约束的迭代线性quadratic regulator（soft-CILQR）算法，该算法将CILQR算法和模型预测控制（MPC）约束松relaxation方法相结合。我们在soft-CILQR求解器中添加了承载变量到状态和控制边界函数中，以软化约束在优化过程中，以便在相对简单的方式中计算稳定的控制输入。我们对使用linear system dynamics模型进行了两种自动驾驶车辆的车道保持实验，分别是数字实验和视觉实验。在数字实验中，soft-CILQR和CILQR求解器都能将系统逼近参照状态，但soft-CILQR求解器在受到加速器的情况下更容易获得平滑的控制输入 trajectory。在视觉实验中，soft-CILQR控制器比CILQR控制器更好地实现了跟踪精度和平滑的总动员操作，在TORCS上驾驶ego车辆时。

Dendrogram distance: an evaluation metric for generative networks using hierarchical clustering

paper_url: http://arxiv.org/abs/2311.16894
repo_url: None
paper_authors: Gustavo Sutter Carvalho, Moacir Antonelli Ponti
for: 本研究提出了一种新的生成模型评估指标，主要针对生成网络。
methods: 该方法使用树形图表示实际和假数据，以计算训练和生成样本之间的差异。它强调模式泄漏问题，targeting生成器无法捕捉全部训练集中的所有模式。
results: 对于提出的方法，进行了一种基于实际数据抽样的验证方案，并在控制环境下证明了与其他状态艺术方法相当竞争。

Abstract
We present a novel metric for generative modeling evaluation, focusing primarily on generative networks. The method uses dendrograms to represent real and fake data, allowing for the divergence between training and generated samples to be computed. This metric focus on mode collapse, targeting generators that are not able to capture all modes in the training set. To evaluate the proposed method it is introduced a validation scheme based on sampling from real datasets, therefore the metric is evaluated in a controlled environment and proves to be competitive with other state-of-the-art approaches.

摘要
我们提出了一种新的评估生成模型 metric，主要针对生成网络。该方法使用树状图表示真实和假数据，从而计算生成样本与训练样本之间的差异。该metric关注模式塌陷，targeting生成器不能捕捉全部训练集中的所有模式。为评估我们的方法，我们提出了基于真实数据采样的验证方案，因此该metric在控制环境中评估并与其他当前最佳方法竞争。

A Unified Approach for Text- and Image-guided 4D Scene Generation

paper_url: http://arxiv.org/abs/2311.16854
repo_url: None
paper_authors: Yufeng Zheng, Xueting Li, Koki Nagano, Sifei Liu, Karsten Kreis, Otmar Hilliges, Shalini De Mello
for: 文章旨在探讨大规模扩散生成模型如何用于从用户提供的文本提示和图像中生成图像、视频和3D资产。
methods: 文章提出了一种新的两阶段方法，使用3D和2D扩散指导来有效地学习文本到4D动画场景的高质量静态3D资产在第一阶段，并在第二阶段使用可变尺度特征网格和偏移总变量损失来有效地学习动作。
results: 通过用户偏好研究，文章表明，其方法可以明显提高图像和动作质量、3D一致性和文本准确性，并且可以轻松地适应可控生成任务，无需修改动作学习阶段。因此，该方法提供了一个统一的approach для文本到4D、图像到4D和个性化4D生成任务。

Abstract
Large-scale diffusion generative models are greatly simplifying image, video and 3D asset creation from user-provided text prompts and images. However, the challenging problem of text-to-4D dynamic 3D scene generation with diffusion guidance remains largely unexplored. We propose Dream-in-4D, which features a novel two-stage approach for text-to-4D synthesis, leveraging (1) 3D and 2D diffusion guidance to effectively learn a high-quality static 3D asset in the first stage; (2) a deformable neural radiance field that explicitly disentangles the learned static asset from its deformation, preserving quality during motion learning; and (3) a multi-resolution feature grid for the deformation field with a displacement total variation loss to effectively learn motion with video diffusion guidance in the second stage. Through a user preference study, we demonstrate that our approach significantly advances image and motion quality, 3D consistency and text fidelity for text-to-4D generation compared to baseline approaches. Thanks to its motion-disentangled representation, Dream-in-4D can also be easily adapted for controllable generation where appearance is defined by one or multiple images, without the need to modify the motion learning stage. Thus, our method offers, for the first time, a unified approach for text-to-4D, image-to-4D and personalized 4D generation tasks.

摘要
大规模扩散生成模型已经大幅简化了图像、视频和3D资产的创建，从用户提供的文本提示和图像开始。然而，文本到4D动态3D场景生成仍然是一个未经探索的挑战。我们提出了“梦在4D”方法，它采用了以下三个阶段来实现文本到4D合成：1. 利用3D和2D扩散指导，高质量地学习静态3D资产。2. 使用可变神经频谱场，分离学习的静态资产和其变形，以保持质量在运动学习过程中。3. 使用多resolution特征网格和滤波总变量损失来有效地学习运动。通过用户偏好调查，我们证明了我们的方法在图像和运动质量、3D一致性和文本 faithfulness等方面对比基eline方法有 significan advances。由于其运动拟合的表示方式，梦在4D还可以方便地适应控制生成任务，只需要修改运动学习阶段。因此，我们的方法提供了一个统一的approach，可以同时解决文本到4D、图像到4D和个性化4D生成任务。

Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for Underwater Image Restoration

paper_url: http://arxiv.org/abs/2311.16845
repo_url: None
paper_authors: Chen Zhao, Weiling Cai, Chenyu Dong, Chengwei Hu
for: 提高水下图像质量
methods: 利用频域信息和扩散模型
results: 达到了水下图像 dataset 上 SOTA 性能，并与其他方法竞争视觉质量Here’s the full translation in Simplified Chinese:
for: 提高水下图像质量
methods: 利用频域信息和扩散模型
results: 达到了水下图像 dataset 上 SOTA 性能，并与其他方法竞争视觉质量I hope this helps! Let me know if you have any other questions.

Abstract
Underwater images are subject to intricate and diverse degradation, inevitably affecting the effectiveness of underwater visual tasks. However, most approaches primarily operate in the raw pixel space of images, which limits the exploration of the frequency characteristics of underwater images, leading to an inadequate utilization of deep models' representational capabilities in producing high-quality images. In this paper, we introduce a novel Underwater Image Enhancement (UIE) framework, named WF-Diff, designed to fully leverage the characteristics of frequency domain information and diffusion models. WF-Diff consists of two detachable networks: Wavelet-based Fourier information interaction network (WFI2-net) and Frequency Residual Diffusion Adjustment Module (FRDAM). With our full exploration of the frequency domain information, WFI2-net aims to achieve preliminary enhancement of frequency information in the wavelet space. Our proposed FRDAM can further refine the high- and low-frequency information of the initial enhanced images, which can be viewed as a plug-and-play universal module to adjust the detail of the underwater images. With the above techniques, our algorithm can show SOTA performance on real-world underwater image datasets, and achieves competitive performance in visual quality.

摘要
水下图像受到细致和多样化的干扰，不可避免地影响水下视觉任务的效果。然而，大多数方法主要在图像原始像素空间中运行，限制了水下图像频谱特征的探索，导致深度模型的表示能力下释放不够。在这篇论文中，我们提出了一种新的水下图像增强（UIE）框架，名为WF-Diff，用于全面利用频谱信息和扩散模型的特点。WF-Diff包括两个分离的网络：浪涌信息互动网络（WFI2-net）和频谱差异调整模块（FRDAM）。我们在浪涌信息空间进行了全面的频谱信息探索，以实现频谱信息的初步增强。我们的提议的FRDAM可以进一步细化高频和低频信息，这可以视为一个可插入式的通用模块，用于调整水下图像的细节。与以上技术相结合，我们的算法可以在实际水下图像数据上达到顶峰性性能，并在视觉质量方面与前者竞争。

Self-training solutions for the ICCV 2023 GeoNet Challenge

paper_url: http://arxiv.org/abs/2311.16843
repo_url: https://github.com/tim-learn/geonet23_casia_tim
paper_authors: Lijun Sheng, Zhengbo Wang, Jian Liang
for: 本研究目的是在GeoNet数据集上进行领域适应，以优化模型在不同地理环境中的性能。
methods: 本方法采用了两stage的源自由领域适应框架，使用SwinTransformer底层，以实现知识传递从美国（源）领域到亚洲（目标）领域。在第一stage中，我们使用了源数据的标注进行训练，并采用了重采样策略和两种cross-entropy损失函数。在第二stage中，我们生成了target数据的 pseudo标签，以优化模型。
results: 本方法在GeoUniDA挑战中实现了74.56%的H-score，并在GeoImNet和GeoPlaces挑战中达到了64.46%和51.23%的准精度。

Abstract
GeoNet is a recently proposed domain adaptation benchmark consisting of three challenges (i.e., GeoUniDA, GeoImNet, and GeoPlaces). Each challenge contains images collected from the USA and Asia where there are huge geographical gaps. Our solution adopts a two-stage source-free domain adaptation framework with a Swin Transformer backbone to achieve knowledge transfer from the USA (source) domain to Asia (target) domain. In the first stage, we train a source model using labeled source data with a re-sampling strategy and two types of cross-entropy loss. In the second stage, we generate pseudo labels for unlabeled target data to fine-tune the model. Our method achieves an H-score of 74.56% and ultimately ranks 1st in the GeoUniDA challenge. In GeoImNet and GeoPlaces challenges, our solution also reaches a top-3 accuracy of 64.46% and 51.23%, respectively.

摘要

paper_url: http://arxiv.org/abs/2311.16835
repo_url: None
paper_authors: Kunpeng Wang, Chenglong Li, Zhengzheng Tu, Bin Luo
for: 这篇论文 targets 单modal和多modal Sobjective Detection（SOD）任务，旨在提出一个简单的框架来解决这些任务。
methods: 该论文提出了一种名为UniSOD的灵活框架，该框架通过适应提问学习来学习模态相关的提示，并将其插入到基于预训练的SOD模型中，以处理对应的任务，只需几个可学习的参数。
results: 该论文在14个benchmark dataset上实现了革命性的性能提升，证明UniSOD有效地和高效地将单modal和多modal SOD任务集成在一起。

Abstract
Existing single-modal and multi-modal salient object detection (SOD) methods focus on designing specific architectures tailored for their respective tasks. However, developing completely different models for different tasks leads to labor and time consumption, as well as high computational and practical deployment costs. In this paper, we make the first attempt to address both single-modal and multi-modal SOD in a unified framework called UniSOD. Nevertheless, assigning appropriate strategies to modality variable inputs is challenging. To this end, UniSOD learns modality-aware prompts with task-specific hints through adaptive prompt learning, which are plugged into the proposed pre-trained baseline SOD model to handle corresponding tasks, while only requiring few learnable parameters compared to training the entire model. Each modality-aware prompt is generated from a switchable prompt generation block, which performs structural switching solely relied on single-modal and multi-modal inputs. UniSOD achieves consistent performance improvement on 14 benchmark datasets for RGB, RGB-D, and RGB-T SOD, which demonstrates that our method effectively and efficiently unifies single-modal and multi-modal SOD tasks.

摘要
existing 单modal和多modal鲜 destacado detection（SOD）方法强调设计特定的建筑物tailored for their respective tasks。 However, developing completely different models for different tasks leads to labor and time consumption, as well as high computational and practical deployment costs。 In this paper, we make the first attempt to address both single-modal and multi-modal SOD in a unified framework called UniSOD。 Nevertheless, assigning appropriate strategies to modality variable inputs is challenging。 To this end, UniSOD learns modality-aware prompts with task-specific hints through adaptive prompt learning，which are plugged into the proposed pre-trained baseline SOD model to handle corresponding tasks，while only requiring few learnable parameters compared to training the entire model。 Each modality-aware prompt is generated from a switchable prompt generation block，which performs structural switching solely relied on single-modal and multi-modal inputs。 UniSOD achieves consistent performance improvement on 14 benchmark datasets for RGB, RGB-D, and RGB-T SOD，which demonstrates that our method effectively and efficiently unifies single-modal and multi-modal SOD tasks。

1-Lipschitz Layers Compared: Memory, Speed, and Certifiable Robustness

paper_url: http://arxiv.org/abs/2311.16833
repo_url: https://github.com/berndprach/1lipschitzlayerscompared
paper_authors: Bernd Prach, Fabio Brau, Giorgio Buttazzo, Christoph H. Lampert
for: 本研究旨在比较用于实现1-Lipschitz神经网络的不同方法，包括卷积神经网络和密集神经网络，以及这些方法在不同资源条件下的性能。
methods: 本研究使用了多种方法来实现1-Lipschitz神经网络，包括使用密集神经网络和卷积神经网络，以及不同的训练策略和优化技术。
results: 研究发现，使用密集神经网络可以减少内存占用，但是需要更长的训练时间。卷积神经网络具有更好的速度性能，但是需要更多的计算资源。此外，研究还提供了一些指导和建议，以帮助用户根据可用资源选择最佳方法。

Abstract
The robustness of neural networks against input perturbations with bounded magnitude represents a serious concern in the deployment of deep learning models in safety-critical systems. Recently, the scientific community has focused on enhancing certifiable robustness guarantees by crafting 1-Lipschitz neural networks that leverage Lipschitz bounded dense and convolutional layers. Although different methods have been proposed in the literature to achieve this goal, understanding the performance of such methods is not straightforward, since different metrics can be relevant (e.g., training time, memory usage, accuracy, certifiable robustness) for different applications. For this reason, this work provides a thorough theoretical and empirical comparison between methods by evaluating them in terms of memory usage, speed, and certifiable robust accuracy. The paper also provides some guidelines and recommendations to support the user in selecting the methods that work best depending on the available resources. We provide code at https://github.com/berndprach/1LipschitzLayersCompared.

摘要
神经网络对输入抖动的 Robustness 是在安全关键系统中深度学习模型的部署中的一个严重问题。最近，科学社区对提高可证Robustness 保证的方法进行了大量研究，包括设计1-Lipschitz神经网络，这些神经网络使用Lipschitz紧密的权重和卷积层。虽然文献中有多种方法来实现这个目标，但理解这些方法的性能不是 straightforward，因为不同的应用可能需要不同的纪录（例如训练时间、内存使用量、准确率、可证Robustness）。因此，这个工作提供了一个全面的理论和实验性比较，通过评估方法的内存使用量、速度和可证Robustness 精度来评估这些方法。此外，这个工作还提供了一些指南和建议，以帮助用户选择适合其 disponible 资源的方法。我们在 GitHub 上提供了代码：https://github.com/berndprach/1LipschitzLayersCompared。

Decomposer: Semi-supervised Learning of Image Restoration and Image Decomposition

paper_url: http://arxiv.org/abs/2311.16829
repo_url: None
paper_authors: Boris Meinardus, Mariusz Trzeciakiewicz, Tim Herzig, Monika Kwiatkowski, Simon Matern, Olaf Hellwich
for: 本文提出了一种 semi-supervised 重建模型，用于分解扭曲图像序列中的基本建构部分，即原始图像和应用的扭曲效果（阴影、灯光和遮挡）。
methods: 本文使用 SIDAR 数据集，该数据集包含许多扭曲图像序列，每个序列包含扭曲后的图像，以及对原始图像应用不同类型的扭曲效果（加法或乘法噪声）。作者提出了一种基于 transformer 的模型，用于显式学习这种分解。模型包括 3D Swin-Transformers 用于空间时间编码，以及 3D U-Nets 作为预测头，用于个别部分的预测。
results: 作者通过 separately 预训导出自己的模型，使其优化为解决 ambiguous 问题定义，并学习分别扭曲效果。模型在不同扭曲类型下的预测中表现出色，可以准确地分解扭曲图像序列中的基本建构部分。

Abstract
We present Decomposer, a semi-supervised reconstruction model that decomposes distorted image sequences into their fundamental building blocks - the original image and the applied augmentations, i.e., shadow, light, and occlusions. To solve this problem, we use the SIDAR dataset that provides a large number of distorted image sequences: each sequence contains images with shadows, lighting, and occlusions applied to an undistorted version. Each distortion changes the original signal in different ways, e.g., additive or multiplicative noise. We propose a transformer-based model to explicitly learn this decomposition. The sequential model uses 3D Swin-Transformers for spatio-temporal encoding and 3D U-Nets as prediction heads for individual parts of the decomposition. We demonstrate that by separately pre-training our model on weakly supervised pseudo labels, we can steer our model to optimize for our ambiguous problem definition and learn to differentiate between the different image distortions.

摘要
我们介绍Decomposer，一种半监督的重建模型，可以将扭曲图像序列分解成其基本组成部分 - 原图像和应用的扭曲（阴影、灯光和遮挡）。为解决这个问题，我们使用SIDAR数据集，该数据集提供了大量扭曲图像序列：每个序列包含有阴影、灯光和遮挡应用于原始图像的版本。每种扭曲都会改变原始信号不同的方式，例如添加或 multiplicative 噪声。我们提议一种基于 transformer 的模型，以显式地学习这种分解。我们的模型使用3D Swin-Transformers 进行空间时间编码，并使用3D U-Nets 作为预测头来预测各个分解部分。我们示示了，通过独立进行弱监督预训练，我们可以让我们的模型优化为我们模糊的问题定义，并学习 differentiate 不同的图像扭曲。

SARA: Controllable Makeup Transfer with Spatial Alignment and Region-Adaptive Normalization

paper_url: http://arxiv.org/abs/2311.16828
repo_url: None
paper_authors: Xiaojing Zhong, Xinyi Huang, Zhonghua Wu, Guosheng Lin, Qingyao Wu
for: 这个研究目的是实现高级别的化妆风格转移，并且可以处理大量的空间调整。
methods: 我们提出了一个名为SARA的新方法，它包括三个模组：首先，一个空间调整模组，可以保留化妆的空间上下文和提供目标semantic map，并且可以帮助推导不同形状的style codes。其次，一个区域适应正规化模组，可以将形状和化妆风格解联，并且可以消除空间调整。最后，一个化妆融合模组，可以融合identité特征和化妆风格，并且可以通过学习scale和偏移参数来实现。
results: 实验结果显示，我们的SARA方法可以比前一代方法高效地实现高级别的化妆风格转移，并且在两个公开的数据集上实现了州级的表现。

Abstract
Makeup transfer is a process of transferring the makeup style from a reference image to the source images, while preserving the source images' identities. This technique is highly desirable and finds many applications. However, existing methods lack fine-level control of the makeup style, making it challenging to achieve high-quality results when dealing with large spatial misalignments. To address this problem, we propose a novel Spatial Alignment and Region-Adaptive normalization method (SARA) in this paper. Our method generates detailed makeup transfer results that can handle large spatial misalignments and achieve part-specific and shade-controllable makeup transfer. Specifically, SARA comprises three modules: Firstly, a spatial alignment module that preserves the spatial context of makeup and provides a target semantic map for guiding the shape-independent style codes. Secondly, a region-adaptive normalization module that decouples shape and makeup style using per-region encoding and normalization, which facilitates the elimination of spatial misalignments. Lastly, a makeup fusion module blends identity features and makeup style by injecting learned scale and bias parameters. Experimental results show that our SARA method outperforms existing methods and achieves state-of-the-art performance on two public datasets.

摘要
美化传输是将参考图像中的美化样式传输到源图像上，保持源图像的身份。这种技术非常感兴趣，有很多应用。然而，现有的方法缺乏细粒度控制美化样式，导致在大 spatial misalignment 下获得高质量结果具有挑战性。为解决这个问题，我们在本文提出了一种新的 Spatial Alignment and Region-Adaptive normalization 方法（SARA）。我们的方法可以处理大 spatial misalignment 并实现部分和颜色控制的美化传输。具体来说，SARA 包括三个模块：首先，一个空间对应模块，保留美化的空间上下文并提供一个目标semantic map，用于引导形状独立的样式编码。其次，一个区域适应normalization模块，使用每个区域的编码和Normalization，以解除空间misalignment。最后，一个美化融合模块，将身份特征和美化样式融合，通过学习缩放和偏置参数进行融合。我们的实验结果表明，SARA 方法在两个公共数据集上表现出state-of-the-art，超过了现有方法的性能。

Denoising Diffusion Probabilistic Models for Image Inpainting of Cell Distributions in the Human Brain

paper_url: http://arxiv.org/abs/2311.16821
repo_url: None
paper_authors: Jan-Oliver Kropp, Christian Schiffer, Katrin Amunts, Timo Dickscheid
for: 这个论文的目的是研究人脑的多级建筑，包括脑区和核lei、层次、柱和单元组分。
methods: 这个论文使用了高性能计算和成像技术来描绘整个人脑的细胞水平，并使用了映射和单元分割方法来实现快速和自动的图像分析。
results: 这个论文提出了一种基于扩散概率模型的填充模型，可以在可靠的方式下填充图像数据中的缺失信息，并生成高度真实的图像信息，包括细胞统计和цитоархитекtonic征特。

Abstract
Recent advances in imaging and high-performance computing have made it possible to image the entire human brain at the cellular level. This is the basis to study the multi-scale architecture of the brain regarding its subdivision into brain areas and nuclei, cortical layers, columns, and cell clusters down to single cell morphology Methods for brain mapping and cell segmentation exploit such images to enable rapid and automated analysis of cytoarchitecture and cell distribution in complete series of histological sections. However, the presence of inevitable processing artifacts in the image data caused by missing sections, tears in the tissue, or staining variations remains the primary reason for gaps in the resulting image data. To this end we aim to provide a model that can fill in missing information in a reliable way, following the true cell distribution at different scales. Inspired by the recent success in image generation, we propose a denoising diffusion probabilistic model (DDPM), trained on light-microscopic scans of cell-body stained sections. We extend this model with the RePaint method to impute missing or replace corrupted image data. We show that our trained DDPM is able to generate highly realistic image information for this purpose, generating plausible cell statistics and cytoarchitectonic patterns. We validate its outputs using two established downstream task models trained on the same data.

摘要
最近的进步在成像和高性能计算方面，使得整个人脑的细胞水平成像成为可能。这是研究脑部的多层架构，包括脑区和核lei、 cortical层、Column和细胞集群的基础。Methods for 脑地图和细胞分类利用这些图像来实现快速和自动的分析脑部的体积和细胞分布。然而，图像处理 artefacts 的存在是脑部图像数据中的主要原因，导致图像数据中的缺失。为此，我们目标是提供一个可靠地填充缺失信息的模型，跟脑部的多个尺度下的细胞分布跟随。受到最近的成像成功的灵感，我们提议一个散射概率模型（DDPM），在细胞体染色 slice 上进行训练。我们将这个模型扩展为RePaint方法，以填充或更正图像数据中的缺失或损坏的图像。我们显示了我们训练的 DDPM 能够生成高度真实的图像信息，实现可靠地填充缺失信息，并生成可靠的细胞统计和Cytoarchitectonic 模式。我们验证其输出使用两个已知的下游任务模型，这些模型都是在同样的数据上训练。

DI-Net : Decomposed Implicit Garment Transfer Network for Digital Clothed 3D Human

paper_url: http://arxiv.org/abs/2311.16818
repo_url: None
paper_authors: Xiaojing Zhong, Yukun Su, Zhonghua Wu, Guosheng Lin, Qingyao Wu
for: 3D虚拟试穿的应用广泛，但是该任务仍然是一个困难的问题。现有的2D虚拟试穿方法无法直接扩展到3D，因为它们缺乏每个像素的深度感知能力。此外，3D虚拟试穿的方法多数基于固定的 topological structure，并且需要大量计算。为解决这些问题，我们提出了Decomposed Implicit garment transfer network（DI-Net），它可以轻松地重建一个3D人体 mesh，并保留来自任意视角的纹理。
methods: DI-Net包括两个模块：1）一个兼容扩展模块，通过 dense correspondence learning和 sparse flow learning将参照图与源图匹配到同 pose；2）一个 geometry-aware decomposed transfer模块，将衣服传输分解为图像布局基于的传输和纹理基于的传输，通过建立像素对应的隐函数来实现表面和纹理重建。
results: 实验结果表明，我们的方法在3D虚拟试穿任务中表现出了效果和优势，可以提供更高质量的结果，超过其他现有方法。

Abstract
3D virtual try-on enjoys many potential applications and hence has attracted wide attention. However, it remains a challenging task that has not been adequately solved. Existing 2D virtual try-on methods cannot be directly extended to 3D since they lack the ability to perceive the depth of each pixel. Besides, 3D virtual try-on approaches are mostly built on the fixed topological structure and with heavy computation. To deal with these problems, we propose a Decomposed Implicit garment transfer network (DI-Net), which can effortlessly reconstruct a 3D human mesh with the newly try-on result and preserve the texture from an arbitrary perspective. Specifically, DI-Net consists of two modules: 1) A complementary warping module that warps the reference image to have the same pose as the source image through dense correspondence learning and sparse flow learning; 2) A geometry-aware decomposed transfer module that decomposes the garment transfer into image layout based transfer and texture based transfer, achieving surface and texture reconstruction by constructing pixel-aligned implicit functions. Experimental results show the effectiveness and superiority of our method in the 3D virtual try-on task, which can yield more high-quality results over other existing methods.

摘要
三维虚拟试穿得到了广泛关注，但是它仍然是一项具有挑战性的任务，尚未得到了充分解决。现有的二维虚拟试穿方法无法直接扩展到三维，因为它们缺乏每个像素的深度感知。此外，三维虚拟试穿方法大多基于固定的 topological结构，并且需要大量计算。为了解决这些问题，我们提出了归一化 garment transfer 网络（DI-Net），它可以轻松地重建一个三维人体模型，并保留来自任意视角的纹理。具体来说，DI-Net 包括两个模块：1）一个 complementary warping 模块，通过 dense correspondence learning 和 sparse flow learning 将参照图像与源图像匹配到同 pose; 2）一个 geometry-aware decomposed transfer 模块，将衣物传输分解为图像布局基于传输和纹理基于传输，通过构建像素对应的偏函数来实现表面和纹理重建。我们的实验结果表明，我们的方法在三维虚拟试穿任务中效果非常好，可以提供更高质量的结果，胜过其他现有方法。

Panacea: Panoramic and Controllable Video Generation for Autonomous Driving

paper_url: http://arxiv.org/abs/2311.16813
repo_url: None
paper_authors: Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, Xiangyu Zhang
for: 提高自动驾驶技术的训练数据质量
methods: 提出了一种新的方法Panacea，可以生成宽 angle和可控的驾驶场景视频，并且可以生成无数量的多样化标注样本
results: 对于nuScenes数据集进行了广泛的量化和质量评估，证明Panacea可以生成高质量多视图驾驶场景视频

Abstract
The field of autonomous driving increasingly demands high-quality annotated training data. In this paper, we propose Panacea, an innovative approach to generate panoramic and controllable videos in driving scenarios, capable of yielding an unlimited numbers of diverse, annotated samples pivotal for autonomous driving advancements. Panacea addresses two critical challenges: 'Consistency' and 'Controllability.' Consistency ensures temporal and cross-view coherence, while Controllability ensures the alignment of generated content with corresponding annotations. Our approach integrates a novel 4D attention and a two-stage generation pipeline to maintain coherence, supplemented by the ControlNet framework for meticulous control by the Bird's-Eye-View (BEV) layouts. Extensive qualitative and quantitative evaluations of Panacea on the nuScenes dataset prove its effectiveness in generating high-quality multi-view driving-scene videos. This work notably propels the field of autonomous driving by effectively augmenting the training dataset used for advanced BEV perception techniques.

摘要
随着自动驾驶技术的发展，需要高质量的注释训练数据 Field 的需求不断增长。在这篇论文中，我们提出Panacea，一种创新的方法，可以生成panoramic和可控的驾驶场景视频，可以生成无数量的多样化、注释的样本，对自动驾驶技术的发展起到重要作用。Panacea解决了两个关键挑战：“一致性”和“可控性”。一致性确保时间和视角之间的一致性，而可控性确保生成的内容与相应的注释相匹配。我们的方法 integrate了一种新的4D注意力和两个阶段生成管道，以保持一致性，并且由ControlNet框架进行精细控制，以便在Bird's-Eye-View（BEV）布局上进行精细控制。我们在nuScenes数据集进行了广泛的质量和量度评估，证明Panacea在多视图驾驶场景视频生成方面具有高质量。这项工作对自动驾驶技术的发展起到了有效的补充作用，可以有效地增加用于高级BEV感知技术的训练数据。

Large Model Based Referring Camouflaged Object Detection

paper_url: http://arxiv.org/abs/2311.17122
repo_url: None
paper_authors: Shupeng Cheng, Ge-Peng Ji, Pengda Qin, Deng-Ping Fan, Bowen Zhou, Peng Xu
for: 这个论文是为了解决camouflaged object detection（COD）问题，即根据文本或视觉参考，检测和分割掩盖物体。
methods: 该论文使用了大量的多modal Large Language Models（MLLMs）知识，将COD问题分解成两个主要视角：target和scene，并将这些视角分别guide通过多级知识描述来导引大视觉模型进行分割。
results: 该论文在Ref-COD benchmark上达到了 estado-of-the-art 水平，并且在uni-modal COD datasets上显示了零shot泛化能力。

Abstract
Referring camouflaged object detection (Ref-COD) is a recently-proposed problem aiming to segment out specified camouflaged objects matched with a textual or visual reference. This task involves two major challenges: the COD domain-specific perception and multimodal reference-image alignment. Our motivation is to make full use of the semantic intelligence and intrinsic knowledge of recent Multimodal Large Language Models (MLLMs) to decompose this complex task in a human-like way. As language is highly condensed and inductive, linguistic expression is the main media of human knowledge learning, and the transmission of knowledge information follows a multi-level progression from simplicity to complexity. In this paper, we propose a large-model-based Multi-Level Knowledge-Guided multimodal method for Ref-COD termed MLKG, where multi-level knowledge descriptions from MLLM are organized to guide the large vision model of segmentation to perceive the camouflage-targets and camouflage-scene progressively and meanwhile deeply align the textual references with camouflaged photos. To our knowledge, our contributions mainly include: (1) This is the first time that the MLLM knowledge is studied for Ref-COD and COD. (2) We, for the first time, propose decomposing Ref-COD into two main perspectives of perceiving the target and scene by integrating MLLM knowledge, and contribute a multi-level knowledge-guided method. (3) Our method achieves the state-of-the-art on the Ref-COD benchmark outperforming numerous strong competitors. Moreover, thanks to the injected rich knowledge, it demonstrates zero-shot generalization ability on uni-modal COD datasets. We will release our code soon.

摘要
提交的文本为：参考隐形物检测（Ref-COD）是一个近期提出的问题，旨在将特定的隐形物与文本或视觉参考相匹配。这个任务包括两个主要挑战：COD领域专门的感知和多模态参考图像对alignment。我们的动机是使用最新的多modal大语言模型（MLLM）的semantic intelligence和内在知识，以人类的方式来解compose这个复杂任务。由于语言具有高度的压缩和抽象性，语言表达是人类知识学习的主要媒体，知识信息的传输采用多层次的进步从简单到复杂。在这篇论文中，我们提出了一种基于大语言模型的多级知识引导的多模态方法，称为MLKG，其中多级知识描述从 MLLM 中组织以导引大视觉模型进行隐形目标和隐形场景的感知，同时进行文本参考的深度对齐。我们的贡献主要包括：1. 这是首次将 MLLM 知识应用于 Ref-COD 和 COD。2. 我们首次提出了将 Ref-COD decomposed into two main perspectives of perceiving the target and scene by integrating MLLM knowledge，并提交了一种多级知识引导方法。3. 我们的方法在 Ref-COD benchmark 上达到了状态之前的最高水平，超越了许多强大的竞争对手。此外，由于注入了丰富的知识，它也表现出了零 shot 泛化能力在单模 COD 数据集上。我们即将发布代码。

Generative Data Augmentation Improves Scribble-supervised Semantic Segmentation

paper_url: http://arxiv.org/abs/2311.17121
repo_url: None
paper_authors: Jacob Schnell, Jieke Wang, Lu Qi, Vincent Tao Hu, Meng Tang
For: 本研究探讨了基于scribble监督的semantic segmentation中的生成数据增强技术，以提高scribble监督下的semantic segmentation性能。* Methods: 我们提出了一种基于ControlNet扩散模型的生成数据增强方法，使用semantic scribble来控制生成的图像。我们还引入了分类器-free扩散指导和编码比例来保证数据的类别一致性和数据的现实性。* Results: 我们发现了多种增强方案，其中一些可以在低数据量情况下提高模型性能。我们的框架可以减少scribble监督下的semantic segmentation和完全监督下的segmentation之间的差距。此外，我们还发现了在小数据集上，我们的框架可以超越完全监督下的segmentation性能。

Abstract
Recent advances in generative models, such as diffusion models, have made generating high-quality synthetic images widely accessible. Prior works have shown that training on synthetic images improves many perception tasks, such as image classification, object detection, and semantic segmentation. We are the first to explore generative data augmentations for scribble-supervised semantic segmentation. We propose a generative data augmentation method that leverages a ControlNet diffusion model conditioned on semantic scribbles to produce high-quality training data. However, naive implementations of generative data augmentations may inadvertently harm the performance of the downstream segmentor rather than improve it. We leverage classifier-free diffusion guidance to enforce class consistency and introduce encode ratios to trade off data diversity for data realism. Using the guidance scale and encode ratio, we are able to generate a spectrum of high-quality training images. We propose multiple augmentation schemes and find that these schemes significantly impact model performance, especially in the low-data regime. Our framework further reduces the gap between the performance of scribble-supervised segmentation and that of fully-supervised segmentation. We also show that our framework significantly improves segmentation performance on small datasets, even surpassing fully-supervised segmentation.

摘要
最近的生成模型技术，如扩散模型，使得生成高质量的 sintetic 图像变得广泛可用。先前的研究表明，训练在 sintetic 图像上能提高许多感知任务，如图像分类、物体检测和semantic segmentation。我们是第一个探讨使用生成数据增强的scribble-supervised semantic segmentation。我们提出了一种基于ControlNet扩散模型并且 Conditioned on semantic scribbles的生成数据增强方法。然而，直接使用生成数据增强可能会不必要地下降下渠道 segmentor 的性能而不是提高它。我们利用类别感知导向来保证类别一致性，并引入编码比率来让拟合数据的多样性和真实性进行平衡。使用导航缩放和编码比率，我们能够生成高质量的training图像spectrum。我们提出了多种增强方案，并发现这些方案在低数据情况下具有显著的影响，特别是在低数据情况下。我们的框架可以进一步减少scribble-supervised segmentation和完全supervised segmentation之间的差距。此外，我们还证明了我们的框架可以在小数据情况下显著提高segmentation性能，甚至超过完全supervised segmentation。

paper_url: http://arxiv.org/abs/2311.16773
repo_url: https://github.com/dasec/multi-channel-cross-modal-detection-of-synthetic-face-images
paper_authors: M. Ibsen, C. Rathgeb, S. Marcel, C. Busch
for: 检测完全 synthetic 的 face image，提高数字内容的可信度。
methods: 提出了一种多通道架构，通过在频谱和可见spectra中分析信息，使用 Cross Modal Focal Loss 进行训练。
results: 与其他相关架构相比，提出的架构在 cross-model 实验中，通常 achieve 最佳性能。

Abstract
Synthetically generated face images have shown to be indistinguishable from real images by humans and as such can lead to a lack of trust in digital content as they can, for instance, be used to spread misinformation. Therefore, the need to develop algorithms for detecting entirely synthetic face images is apparent. Of interest are images generated by state-of-the-art deep learning-based models, as these exhibit a high level of visual realism. Recent works have demonstrated that detecting such synthetic face images under realistic circumstances remains difficult as new and improved generative models are proposed with rapid speed and arbitrary image post-processing can be applied. In this work, we propose a multi-channel architecture for detecting entirely synthetic face images which analyses information both in the frequency and visible spectra using Cross Modal Focal Loss. We compare the proposed architecture with several related architectures trained using Binary Cross Entropy and show in cross-model experiments that the proposed architecture supervised using Cross Modal Focal Loss, in general, achieves most competitive performance.

摘要
人工生成的面像已经被证明可以与真实图像无法分辨，这可能会导致对数字内容的不信任，因为它们可以用来散布谣言。因此，开发检测完全 sintetic 面像的算法是非常重要的。我们关注使用现代深度学习模型生成的图像，这些图像具有高度的视觉实际性。 latest works 表明，在真实的场景下检测这些 sintetic 面像是具有挑战性的，因为新的生成模型和优化的图像处理技术在抢夺速度上不断地更新。在这种情况下，我们提出了一种多通道架构，该架构在频率和可见光谱中分析信息，使用 Cross Modal Focal Loss 来检测 entirely sintetic 面像。我们与其他相关的架构进行比较，并在跨模型实验中显示，我们的架构在使用 Cross Modal Focal Loss 的情况下，在总体来说达到了最竞争性的性能。

Continuous Pose for Monocular Cameras in Neural Implicit Representation

paper_url: http://arxiv.org/abs/2311.17119
repo_url: https://github.com/qimaqi/continuous-pose-in-nerf
paper_authors: Qi Ma, Danda Pani Paudel, Ajad Chhatkuli, Luc Van Gool
for: 这个论文目的是优化单目镜头pose为时间函数。
methods: 该方法使用含隐藏层 neural network 表示镜头pose，并将其用于下游任务中的 JOINT 镜头pose优化。在这个过程中，网络参数（即镜头pose）也是被优化的。
results: 在四个多样化的实验设置中（1）NeRF from noisy poses; (2) NeRF from asynchronous Events; (3) Visual Simultaneous Localization and Mapping (vSLAM);和 (4) vSLAM with IMUs），提出的方法与比较基线和状态艺术方法相比表现出色，并且在 vSLAM 设置中实现了惊人的镜头跟踪性能。此外，通过假设连续运动，镜头pose 的变化可能实际上生活在一个低于 6 度自由度（DOF）的拟合 manifold 中，我们称之为“内在运动”，并在 vSLAM 设置中使用这种方法，获得了惊人的Camera tracking性能。

Abstract
In this paper, we showcase the effectiveness of optimizing monocular camera poses as a continuous function of time. The camera poses are represented using an implicit neural function which maps the given time to the corresponding camera pose. The mapped camera poses are then used for the downstream tasks where joint camera pose optimization is also required. While doing so, the network parameters -- that implicitly represent camera poses -- are optimized. We exploit the proposed method in four diverse experimental settings, namely, (1) NeRF from noisy poses; (2) NeRF from asynchronous Events; (3) Visual Simultaneous Localization and Mapping (vSLAM); and (4) vSLAM with IMUs. In all four settings, the proposed method performs significantly better than the compared baselines and the state-of-the-art methods. Additionally, using the assumption of continuous motion, changes in pose may actually live in a manifold that has lower than 6 degrees of freedom (DOF) is also realized. We call this low DOF motion representation as the \emph{intrinsic motion} and use the approach in vSLAM settings, showing impressive camera tracking performance.

摘要
在这篇论文中，我们展示了时间作为连续函数优化单目摄像头姿态的效iveness。摄像头姿态通过一个含义 neural 函数来映射给定时间到对应的摄像头姿态。映射后的摄像头姿态然后用于下游任务中的 JOINT 摄像头姿态优化。在进行这些任务时，网络参数（表示摄像头姿态）被优化。我们在四个多样化的实验设置中利用该方法，即（1）从噪声姿态中生成 NeRF;（2）从异步事件中生成 NeRF;（3）视觉同时地地图和定位（vSLAM）;以及（4） vSLAM 与 IMU 的组合。在所有四个设置中，我们的方法与比较基eline和状态 arts 方法相比，表现出显著的改善。此外，通过假设连续运动，姿态变化可能实际生活在低于 6 度自由度（DOF）的拟合 manifold 中，我们称之为“内在运动”，并在 vSLAM 设置中使用这种方法，展现出了出色的摄像头跟踪性能。

Rescuing referral failures during automated diagnosis of domain-shifted medical images

paper_url: http://arxiv.org/abs/2311.16766
repo_url: None
paper_authors: Anuj Srivastava, Karm Patel, Pradeep Shenoy, Devarajan Sridharan
for: Addressing a fundamental challenge with selective classification during automated diagnosis with domain-shifted medical images.
methods: Examining two benchmark diagnostic medical imaging datasets exhibiting strong covariate shifts, and evaluating novel combinations of robust generalization and post hoc referral approaches.
results: Significant performance improvements, typically >10%, over baseline methods, and rescue of failures under covariate shifts leading to non-monotonic referral curves and severe drops in performance (up to 50%) at high referral rates (>70%).

Abstract
The success of deep learning models deployed in the real world depends critically on their ability to generalize well across diverse data domains. Here, we address a fundamental challenge with selective classification during automated diagnosis with domain-shifted medical images. In this scenario, models must learn to avoid making predictions when label confidence is low, especially when tested with samples far removed from the training set (covariate shift). Such uncertain cases are typically referred to the clinician for further analysis and evaluation. Yet, we show that even state-of-the-art domain generalization approaches fail severely during referral when tested on medical images acquired from a different demographic or using a different technology. We examine two benchmark diagnostic medical imaging datasets exhibiting strong covariate shifts: i) diabetic retinopathy prediction with retinal fundus images and ii) multilabel disease prediction with chest X-ray images. We show that predictive uncertainty estimates do not generalize well under covariate shifts leading to non-monotonic referral curves, and severe drops in performance (up to 50%) at high referral rates (>70%). We evaluate novel combinations of robust generalization and post hoc referral approaches, that rescue these failures and achieve significant performance improvements, typically >10%, over baseline methods. Our study identifies a critical challenge with referral in domain-shifted medical images and finds key applications in reliable, automated disease diagnosis.

摘要
成功的深度学习模型在实际应用中取决于它们在多样数据领域上的泛化能力。我们解决了域转换中自动诊断中的一个基本挑战：在选择性分类时，模型需要学习避免在标签信息不确定时进行预测，特别是在训练集与测试集之间存在差异（变量转换）。这些不确定的案例通常会被送往医生进行进一步分析和评估。然而，我们发现了even state-of-the-art域泛化方法在referral时会严重失败，特别是在医疗图像的不同人群或技术下进行训练。我们分析了两个常用的诊断医学成像数据集，它们都具有强烈的差异变量：i）肥胖糖尿病预测using retinal fundus图像和ii）多标签疾病预测using chest X-ray图像。我们发现了predictive uncertainty估计不会在差异变量下进行泛化，导致非MONOTONIC referral曲线和高referral率（>70%）下的性能下降（up to 50%）。我们评估了一些新的Robust generalization和post hoc referral方法，它们能够恢复这些失败和实现significant性能提升（typically >10%），比基eline方法更好。我们的研究发现了域转换中的自动诊断中的一个关键挑战，并发现了重要的应用在可靠、自动的疾病诊断中。

Gradient-based Local Next-best-view Planning for Improved Perception of Targeted Plant Nodes

paper_url: http://arxiv.org/abs/2311.16759
repo_url: None
paper_authors: Akshay K. Burusa, Eldert J. van Henten, Gert Kootstra
for: Tomatoes greenhouses automation, specifically selective harvesting and de-leafing tasks.
methods: Local next-best-view (NBV) planning using differential ray sampling to overcome occlusion and improve perception.
results: The proposed planner can handle occlusions and improve 3D reconstruction and position estimation of nodes, while taking less computation and generating more efficient trajectories compared to previous methods.

Abstract
Robots are increasingly used in tomato greenhouses to automate labour-intensive tasks such as selective harvesting and de-leafing. To perform these tasks, robots must be able to accurately and efficiently perceive the plant nodes that need to be cut, despite the high levels of occlusion from other plant parts. We formulate this problem as a local next-best-view (NBV) planning task where the robot has to plan an efficient set of camera viewpoints to overcome occlusion and improve the quality of perception. Our formulation focuses on quickly improving the perception accuracy of a single target node to maximise its chances of being cut. Previous methods of NBV planning mostly focused on global view planning and used random sampling of candidate viewpoints for exploration, which could suffer from high computational costs, ineffective view selection due to poor candidates, or non-smooth trajectories due to inefficient sampling. We propose a gradient-based NBV planner using differential ray sampling, which directly estimates the local gradient direction for viewpoint planning to overcome occlusion and improve perception. Through simulation experiments, we showed that our planner can handle occlusions and improve the 3D reconstruction and position estimation of nodes equally well as a sampling-based NBV planner, while taking ten times less computation and generating 28% more efficient trajectories.

摘要
роботы все более часто используются в теплицах по производству помидоров для автоматизации трудоемких задач, таких как выборка и удаление листьев. Чтобы выполнить эти задачи, роботы должны быть в состоянии точно и эффективно определять узлы растения, которые нужно вырезать, несмотря на высокий уровень затенения со стороны других частей растения. Мы формулируем эту проблему как задачу планирования следующего лучшего вида (NBV) на местном уровне, где робот должен планировать набор эффективных точек обзора, чтобы преодолеть затенение и улучшить качество восприятия. Наша формулировка сосредоточена на быстром улучшении точности восприятия单ющей цели узла для максимизации его шансов на вырезание. Предыдущие методы NBV-планирования в основном сосредоточены на глобальном планировании вида и использовали случайный выбор кандидатов для исследования, что могло привести к высоким расходам на вычисления, неэффективному выбору точек обзора из-за плохих кандидатов или несглаженым траекториям из-за неэффективного сэмплирования. Мы предлагаем gradient-based NBV-планировщик с помощью дифференциального сэмплирования лучей, который непосредственно оценивает местный градиент направления для планирования точек обзора, чтобы преодолеть затенение и улучшить восприятие. В экспериментах на симуляторе мы показали, что наш планировщик может справиться с затенением и улучшить трёхмерное восстановление и определение позиции узлов настолько же хорошо, как и sampling-based NBV-планировщик, примерно в десять раз меньше вычислительных затрат и с 28% более эффективными траекториями.

As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation Using 2D Diffusion Priors

paper_url: http://arxiv.org/abs/2311.16739
repo_url: None
paper_authors: Seungwoo Yoo, Kunho Kim, Vladimir G. Kim, Minhyuk Sung
for: 这个论文是为了提出一种基于2D扩散先验的可能性最大化（APAP）网格弯曲技术，以保持用户控制的弯曲过程中网格的可能性。
methods: 该技术使用每个面Jacobian来表示网格弯曲，其中网格顶点坐标通过可导的波峰解决方法计算得到。弯曲后的网格将被渲染，并将生成的2D图像用于Score Distillation Sampling（SDS）过程，从而提取有用的可能性先验。
results: 该方法可以在2D和3D网格上提供质量提升，并且可以更好地保持编辑后网格的身份。

Abstract
We present As-Plausible-as-Possible (APAP) mesh deformation technique that leverages 2D diffusion priors to preserve the plausibility of a mesh under user-controlled deformation. Our framework uses per-face Jacobians to represent mesh deformations, where mesh vertex coordinates are computed via a differentiable Poisson Solve. The deformed mesh is rendered, and the resulting 2D image is used in the Score Distillation Sampling (SDS) process, which enables extracting meaningful plausibility priors from a pretrained 2D diffusion model. To better preserve the identity of the edited mesh, we fine-tune our 2D diffusion model with LoRA. Gradients extracted by SDS and a user-prescribed handle displacement are then backpropagated to the per-face Jacobians, and we use iterative gradient descent to compute the final deformation that balances between the user edit and the output plausibility. We evaluate our method with 2D and 3D meshes and demonstrate qualitative and quantitative improvements when using plausibility priors over geometry-preservation or distortion-minimization priors used by previous techniques.

摘要
我们提出了可能性最大化（APAP）网格扭曲技术，利用2D传播先验知识来维护网格在用户控制下的扭曲。我们的框架使用每个面Jacobian来表示网格扭曲，其中网格点坐标由可微分波尔兹方程解析得到。扭曲后的网格被显示出来，并使用Score Distillation Sampling（SDS）过程来提取有用的可能性先验知识。为了更好地保留编辑过的网格的身份，我们精炼了2D传播模型，并使用用户指定的抓取调整和SDS提取的梯度，透过迭代几何推导来计算最终扭曲，以寻求平衡用户编辑和输出可能性。我们将这些方法应用于2D和3D网格，并证明了与先前技术相比，在使用可能性先验知识时的Qualitative和量值改进。

Riemannian Self-Attention Mechanism for SPD Networks

paper_url: http://arxiv.org/abs/2311.16738
repo_url: None
paper_authors: Rui Wang, Xiao-Jun Wu, Hui Li, Josef Kittler
for:* 这篇论文旨在提出一种基于SPD矩阵自注意机制的地形学学习模块，以提高深度结构表示的准确率。methods:* 本文提出了一种基于Riemannian metric、Riemannianmean和Riemannian优化的SPD矩阵自注意机制（SMSA），并将其应用于一种基于SMSA的地形学学习模块（SMSA-GLM）中。results:* 对三个 benchmark 数据集进行了广泛的实验研究，并证明了我们的修改可以进一步减少信息损失问题，提高准确率。

Abstract
Symmetric positive definite (SPD) matrix has been demonstrated to be an effective feature descriptor in many scientific areas, as it can encode spatiotemporal statistics of the data adequately on a curved Riemannian manifold, i.e., SPD manifold. Although there are many different ways to design network architectures for SPD matrix nonlinear learning, very few solutions explicitly mine the geometrical dependencies of features at different layers. Motivated by the great success of self-attention mechanism in capturing long-range relationships, an SPD manifold self-attention mechanism (SMSA) is proposed in this paper using some manifold-valued geometric operations, mainly the Riemannian metric, Riemannian mean, and Riemannian optimization. Then, an SMSA-based geometric learning module (SMSA-GLM) is designed for the sake of improving the discrimination of the generated deep structured representations. Extensive experimental results achieved on three benchmarking datasets show that our modification against the baseline network further alleviates the information degradation problem and leads to improved accuracy.

摘要
“对称正定 positively definite（SPD）矩阵已经在许多科学领域中被证明为有效的特征描述器，因为它可以在弹性的欧几何 Riemannian 构造上充分传递体积时间的数据统计信息，即 SPD 构造。although there are many different ways to design network architectures for SPD matrix nonlinear learning, very few solutions explicitly mine the geometrical dependencies of features at different layers. Motivated by the great success of self-attention mechanism in capturing long-range relationships, an SPD manifold self-attention mechanism (SMSA) is proposed in this paper using some manifold-valued geometric operations, mainly the Riemannian metric, Riemannian mean, and Riemannian optimization. Then, an SMSA-based geometric learning module (SMSA-GLM) is designed for the sake of improving the discrimination of the generated deep structured representations. Extensive experimental results achieved on three benchmarking datasets show that our modification against the baseline network further alleviates the information degradation problem and leads to improved accuracy.”Note: The translation is in Simplified Chinese, which is one of the two standardized Chinese writing systems. The other one is Traditional Chinese.

Point’n Move: Interactive Scene Object Manipulation on Gaussian Splatting Radiance Fields

paper_url: http://arxiv.org/abs/2311.16737
repo_url: None
paper_authors: Jiajun Huang, Hongchuan Yu
for: 本文旨在实现交互式场景对象修改，包括露出区域填充。
methods: 本文采用 Gaussian Splatting Radiance Field 场景表示，并完全利用其Explicit Nature和速度优势。
results: 本文提出了一种基于2D prompt points的3D mask双Stage自然提示分割算法，可以实现高质量和实时编辑。对于前向和360度场景进行编辑测试，并与现有场景对象移除方法进行比较，表现出优于现有方法。

Abstract
We propose Point'n Move, a method that achieves interactive scene object manipulation with exposed region inpainting. Interactivity here further comes from intuitive object selection and real-time editing. To achieve this, we adopt Gaussian Splatting Radiance Field as the scene representation and fully leverage its explicit nature and speed advantage. Its explicit representation formulation allows us to devise a 2D prompt points to 3D mask dual-stage self-prompting segmentation algorithm, perform mask refinement and merging, minimize change as well as provide good initialization for scene inpainting and perform editing in real-time without per-editing training, all leads to superior quality and performance. We test our method by performing editing on both forward-facing and 360 scenes. We also compare our method against existing scene object removal methods, showing superior quality despite being more capable and having a speed advantage.

摘要
我们提出了Point'n Move方法，它实现了互动场景对象修改，并且可以在显示区域填充中进行实时编辑。在这个方法中，我们采用了 Gaussian Splatting Radiance Field来表示场景，并充分利用其明确的形式和速度优势。这种明确的表示形式允许我们设计一种基于2D提示点的3D面积 dual-stage自我提示分割算法，进行面积修正和合并，最小化变化并提供好的初始化，进行场景填充和实时编辑，而不需要每次编辑培训。我们对我们的方法进行了测试，并在前向和360度场景中进行了编辑。我们还对现有场景对象除除方法进行了比较，并证明了我们的方法具有较高的质量和速度优势。

AdaFocus: Towards End-to-end Weakly Supervised Learning for Long-Video Action Understanding

paper_url: http://arxiv.org/abs/2311.17118
repo_url: None
paper_authors: Jiaming Zhou, Hanjun Li, Kun-Yu Lin, Junwei Liang
for: 本文提出了一种用于长视频动作理解任务的终端模型开发方法，以解决长视频特点的计算和内存挑战。
methods: 本文使用了一种叫做AdaFocus的框架，可以适应ively focus on动作clip，使得无需精确的动作开始和结束时间标注可以进行更好的训练。
results: experiments表明，使用AdaFocus框架可以在三个长视频 dataset上实现更好的性能，甚至在两个dataset上，模型经过AdaFocus训练后在无精确标注情况下表现比以前的全监督训练更好。此外，本文还构建了一个弱监督特征提取管道，可以在长视频动作理解任务上实现显著改进。

Abstract
Developing end-to-end models for long-video action understanding tasks presents significant computational and memory challenges. Existing works generally build models on long-video features extracted by off-the-shelf action recognition models, which are trained on short-video datasets in different domains, making the extracted features suffer domain discrepancy. To avoid this, action recognition models can be end-to-end trained on clips, which are trimmed from long videos and labeled using action interval annotations. Such fully supervised annotations are expensive to collect. Thus, a weakly supervised method is needed for long-video action understanding at scale. Under the weak supervision setting, action labels are provided for the whole video without precise start and end times of the action clip. To this end, we propose an AdaFocus framework. AdaFocus estimates the spike-actionness and temporal positions of actions, enabling it to adaptively focus on action clips that facilitate better training without the need for precise annotations. Experiments on three long-video datasets show its effectiveness. Remarkably, on two of datasets, models trained with AdaFocus under weak supervision outperform those trained under full supervision. Furthermore, we form a weakly supervised feature extraction pipeline with our AdaFocus, which enables significant improvements on three long-video action understanding tasks.

摘要
开发长视频动作理解任务的终端模型存在计算和内存挑战。现有的工作通常是基于商业化动作识别模型提取的长视频特征，这些模型在不同领域的短视频 dataset 上进行了训练，导致提取的特征受到领域差异。为了避免这种情况，动作识别模型可以通过clips进行终端训练，clips是从长视频中截取的，并使用动作间隔标注。然而，这种完全监督的标注是costly的。因此，我们需要一种弱监督的方法来实现长视频动作理解的批量训练。在弱监督设定下，我们提出了AdaFocus框架。AdaFocus可以估算动作clip中的峰值动作性和时间位置，从而使其能够自适应地关注动作clip，以便更好地训练无需精确标注。我们在三个长视频dataset上进行了实验，结果表明AdaFocus在两个dataset上比以前的模型更高效，并且在三个任务上建立了弱监督特征提取管道，从而实现了显著的提高。

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

paper_url: http://arxiv.org/abs/2311.17117
repo_url: https://github.com/HumanAIGC/AnimateAnyone
paper_authors: Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, Liefeng Bo
for: 这个论文是为了生成从静止图像中的人物视频的目的。
methods: 这个论文使用了扩散模型，并提出了一种特有的框架，以保持人物的细节特征的一致性。它还引入了一种有效的姿势引导器和一种有效的时间模型来确保人物的动作是有控制的和平滑的。
results: 这个论文的方法可以在 benchmark 上对人物动作进行Synthesis，并实现了与其他图像到视频方法相比的更高的效果。

Abstract
Character Animation aims to generating character videos from still images through driving signals. Currently, diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities. However, challenges persist in the realm of image-to-video, especially in character animation, where temporally maintaining consistency with detailed information from character remains a formidable problem. In this paper, we leverage the power of diffusion models and propose a novel framework tailored for character animation. To preserve consistency of intricate appearance features from reference image, we design ReferenceNet to merge detail features via spatial attention. To ensure controllability and continuity, we introduce an efficient pose guider to direct character's movements and employ an effective temporal modeling approach to ensure smooth inter-frame transitions between video frames. By expanding the training data, our approach can animate arbitrary characters, yielding superior results in character animation compared to other image-to-video methods. Furthermore, we evaluate our method on benchmarks for fashion video and human dance synthesis, achieving state-of-the-art results.

摘要
Character Animation 目标是从静止图像中生成动画视频，通过驱动信号。目前，diffusion模型在视觉生成研究中成为主流，因为它们具有强大的生成能力。然而，图像到视频的转化仍然存在困难，特别是人物动画，因为需要在 reference 图像中保持细节信息的一致性。在这篇论文中，我们利用 diffusion 模型的能力，并提出了一个专门 для人物动画的框架。为了保持 reference 图像中的细节特征的一致性，我们设计了 ReferenceNet，并通过空间注意力来融合细节特征。为确保控制性和连续性，我们引入了高效的 pose 引导器，并使用有效的时间模型来保证视频帧之间的平滑过渡。通过扩展训练数据，我们的方法可以动化任何人物，并在其他图像到视频方法中达到更高的结果。此外，我们对标准测试集进行评估，在时尚视频和人类舞蹈生成方面达到了状态的最佳结果。

Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular, Stereo, and RGB-D Cameras

paper_url: http://arxiv.org/abs/2311.16728
repo_url: None
paper_authors: Huajian Huang, Longwei Li, Hui Cheng, Sai-Kit Yeung
for: 本研究旨在提出一种基于干扰 primitives 地图的 SLAM 框架，以实现 JOINT 地理定位和高品质视觉重建。
methods: 我们同时利用显式 геометрических特征进行地理定位，并通过学习隐式光学特征来表示观察环境的Texture信息。我们还提出了基于 Gaussian Pyramid 的训练方法，以逐级学习多层特征，提高高品质地图表现。
results: 我们在 monocular、stereo 和 RGB-D 数据集上进行了广泛的实验，证明我们提出的 Photo-SLAM 系统在 ONLINE 高品质地图建模方面，比如 PSNR 高度提高30%，并且在 Jetson AGX Orin 嵌入式平台上实现实时速度，表明其可以应用于机器人应用。

Abstract
The integration of neural rendering and the SLAM system recently showed promising results in joint localization and photorealistic view reconstruction. However, existing methods, fully relying on implicit representations, are so resource-hungry that they cannot run on portable devices, which deviates from the original intention of SLAM. In this paper, we present Photo-SLAM, a novel SLAM framework with a hyper primitives map. Specifically, we simultaneously exploit explicit geometric features for localization and learn implicit photometric features to represent the texture information of the observed environment. In addition to actively densifying hyper primitives based on geometric features, we further introduce a Gaussian-Pyramid-based training method to progressively learn multi-level features, enhancing photorealistic mapping performance. The extensive experiments with monocular, stereo, and RGB-D datasets prove that our proposed system Photo-SLAM significantly outperforms current state-of-the-art SLAM systems for online photorealistic mapping, e.g., PSNR is 30% higher and rendering speed is hundreds of times faster in the Replica dataset. Moreover, the Photo-SLAM can run at real-time speed using an embedded platform such as Jetson AGX Orin, showing the potential of robotics applications.

摘要
“对于对照照相机和SLAM系统的统合，最近的结果表明了联合本地化和实时测量的可能性。然而，现有的方法，完全依赖于隐式表示，使得它们无法在可携设备上运行，这与SLAM的原始目的不符。在这篇文章中，我们提出了Photo-SLAM，一个新的SLAM框架，具有几何基本图的对应。具体来说，我们同时利用明确的几何特征来进行定位，并将隐式的光度特征用于表征观察环境的纹理信息。此外，我们还引入了基于高斯散点的训练方法，逐级学习多个特征，进一步提高实时测量的表现。实验结果显示，与对照照相机、 стерео和RGB-D数据集进行比较，我们的提案的Photo-SLAM系统对于在线实时测量的实现，例如PSNR的提高为30%，并且渲染速度比现有的SLAM系统快上百倍。此外，Photo-SLAM可以在真实时运行，使用嵌入式平台如Jetson AGX Orin，显示出机器人应用的潜力。”

REF$^2$-NeRF: Reflection and Refraction aware Neural Radiance Field

paper_url: http://arxiv.org/abs/2311.17116
repo_url: None
paper_authors: Wooseok Kim, Taiki Fukiage, Takeshi Oishi
for: 该文章旨在提出一种基于NeRF的多视图3D重建方法，用于处理包含玻璃显示橱柜的场景。
methods: 该方法基于Volume Rendering，并使用依赖和无关视点的元素来模拟偏振和反射效果。
results: 比对于现有方法，该方法能够更准确地模拟玻璃偏振和整个场景。

Abstract
Recently, significant progress has been made in the study of methods for 3D reconstruction from multiple images using implicit neural representations, exemplified by the neural radiance field (NeRF) method. Such methods, which are based on volume rendering, can model various light phenomena, and various extended methods have been proposed to accommodate different scenes and situations. However, when handling scenes with multiple glass objects, e.g., objects in a glass showcase, modeling the target scene accurately has been challenging due to the presence of multiple reflection and refraction effects. Thus, this paper proposes a NeRF-based modeling method for scenes containing a glass case. In the proposed method, refraction and reflection are modeled using elements that are dependent and independent of the viewer's perspective. This approach allows us to estimate the surfaces where refraction occurs, i.e., glass surfaces, and enables the separation and modeling of both direct and reflected light components. Compared to existing methods, the proposed method enables more accurate modeling of both glass refraction and the overall scene.

摘要

Human Gaussian Splatting: Real-time Rendering of Animatable Avatars

paper_url: http://arxiv.org/abs/2311.17113
repo_url: None
paper_authors: Arthur Moreau, Jifei Song, Helisa Dhamo, Richard Shaw, Yiren Zhou, Eduardo Pérez-Pellitero
for: 本研究实现了实时渲染真实人体模拟，从多视角影片中学习的人体模型。
methods: 我们使用3D Gaussian Splatting来表示人体，这是一种非常高效的替代方法。我们使用一个组合了前向皮肤和本地非静态修正的方法，将人体表示为一组几何形状。
results: 我们的方法可以实现PSNR 1.5dbB更高的输出结果，并且可以在20fps或更高的帧率下显示。

Abstract
This work addresses the problem of real-time rendering of photorealistic human body avatars learned from multi-view videos. While the classical approaches to model and render virtual humans generally use a textured mesh, recent research has developed neural body representations that achieve impressive visual quality. However, these models are difficult to render in real-time and their quality degrades when the character is animated with body poses different than the training observations. We propose the first animatable human model based on 3D Gaussian Splatting, that has recently emerged as a very efficient alternative to neural radiance fields. Our body is represented by a set of gaussian primitives in a canonical space which are deformed in a coarse to fine approach that combines forward skinning and local non-rigid refinement. We describe how to learn our Human Gaussian Splatting (\OURS) model in an end-to-end fashion from multi-view observations, and evaluate it against the state-of-the-art approaches for novel pose synthesis of clothed body. Our method presents a PSNR 1.5dbB better than the state-of-the-art on THuman4 dataset while being able to render at 20fps or more.

摘要

paper_url: http://arxiv.org/abs/2311.16714
repo_url: https://github.com/stevenyangyj/emma-alfworld
paper_authors: Yijun Yang, Tianyi Zhou, Kanxue Li, Dapeng Tao, Lusong Li, Li Shen, Xiaodong He, Jing Jiang, Yuhui Shi
for: This paper aims to train a vision-language model (VLM) agent to adapt to a visual world by leveraging a large language model (LLM) agent’s reflection outcomes in a text world.
methods: The proposed method, called Embodied Multi-Modal Agent (EMMA), finetunes the VLM on the same tasks of the visual world using the LLM’s reflection outcomes in a text world.
results: EMMA achieves superior performance compared to state-of-the-art VLM-based agents on diverse tasks, with an improvement rate of 20%-70% in the success rate.

Abstract
While large language models (LLMs) excel in a simulated world of texts, they struggle to interact with the more realistic world without perceptions of other modalities such as visual or audio signals. Although vision-language models (VLMs) integrate LLM modules (1) aligned with static image features, and (2) may possess prior knowledge of world dynamics (as demonstrated in the text world), they have not been trained in an embodied visual world and thus cannot align with its dynamics. On the other hand, training an embodied agent in a noisy visual world without expert guidance is often challenging and inefficient. In this paper, we train a VLM agent living in a visual world using an LLM agent excelling in a parallel text world (but inapplicable to the visual world). Specifically, we distill LLM's reflection outcomes (improved actions by analyzing mistakes) in a text world's tasks to finetune the VLM on the same tasks of the visual world, resulting in an Embodied Multi-Modal Agent (EMMA) quickly adapting to the visual world dynamics. Such cross-modality imitation learning between the two parallel worlds enables EMMA to generalize to a broad scope of new tasks without any further guidance from the LLM expert. Extensive evaluations on the ALFWorld benchmark highlight EMMA's superior performance to SOTA VLM-based agents across diverse tasks, e.g., 20%-70% improvement in the success rate.

摘要
大型语言模型（LLM）在文本世界中具有优异表现，但在具有视觉或音频信号的更真实世界中表现不佳。视觉语言模型（VLM）将LLM模块与静止图像特征进行对应，并具有世界动力学的先前知识，但它们没有在视觉世界中接受训练，因此无法与其动力学相协调。相反，在噪音视觉世界中训练一个无导航的VLM Agent是复杂和不有效的。在这篇论文中，我们训练一个在视觉世界中生存的VLM Agent，使用在文本世界中出色的LLM Agent（尚未在视觉世界中适用）。具体来说，我们将LLM的反射结果（改进的动作）在文本世界的任务中练练VLM Agent，从而生成一个具有多模态功能的Embodied Multi-Modal Agent（EMMA），快速适应视觉世界的动力学。这种跨模态学习between两个平行世界使得EMMA可以通过简单的自适应来适应新任务，而无需LLM专家的进一步指导。我们在ALFWorld benchmark上进行了广泛的评估，并发现EMMA的性能在多种任务上比SOTA VLM-based Agent高出20%-70%。

Full-resolution MLPs Empower Medical Dense Prediction

paper_url: http://arxiv.org/abs/2311.16707
repo_url: https://github.com/mungomeng/densepred-fullmlp
paper_authors: Mingyuan Meng, Yuxin Xue, Dagan Feng, Lei Bi, Jinman Kim
for: 这篇论文主要针对医疗影像处理中的细密预测任务，例如医疗影像修复、注册和分类。
methods: 这篇论文使用的方法是多层感知机制（MLP），并在全像分辨率开始使用MLP。
results: 实验结果显示，使用MLP在全像分辨率可以超过CNN和对应器的性能，并在多种医疗细密预测任务上 achievement state-of-the-art的表现。

Abstract
Dense prediction is a fundamental requirement for many medical vision tasks such as medical image restoration, registration, and segmentation. The most popular vision model, Convolutional Neural Networks (CNNs), has reached bottlenecks due to the intrinsic locality of convolution operations. Recently, transformers have been widely adopted for dense prediction for their capability to capture long-range visual dependence. However, due to the high computational complexity and large memory consumption of self-attention operations, transformers are usually used at downsampled feature resolutions. Such usage cannot effectively leverage the tissue-level textural information available only at the full image resolution. This textural information is crucial for medical dense prediction as it can differentiate the subtle human anatomy in medical images. In this study, we hypothesize that Multi-layer Perceptrons (MLPs) are superior alternatives to transformers in medical dense prediction where tissue-level details dominate the performance, as MLPs enable long-range dependence at the full image resolution. To validate our hypothesis, we develop a full-resolution hierarchical MLP framework that uses MLPs beginning from the full image resolution. We evaluate this framework with various MLP blocks on a wide range of medical dense prediction tasks including restoration, registration, and segmentation. Extensive experiments on six public well-benchmarked datasets show that, by simply using MLPs at full resolution, our framework outperforms its CNN and transformer counterparts and achieves state-of-the-art performance on various medical dense prediction tasks.

摘要
厚度预测是医疗图像任务中的基本需求，如医疗图像修复、注册和分割等。现今最受欢迎的视觉模型是卷积神经网络（CNN），但是由于卷积操作的本质性本地性，CNN在厚度预测方面已经遇到了瓶颈。而在过去几年，人们已经广泛采用了转换器来进行厚度预测，因为它们可以 Capture长距离视觉依赖关系。但是由于自我注意操作的计算复杂度和内存占用，通常在下采样后使用转换器。这种使用方式无法有效利用医疗图像的全像分辨率中的组织层级文本信息，这种信息对医疗厚度预测至关重要，因为它可以在医疗图像中分辨出细腻的人体解剖特征。在本研究中，我们提出了一种多层感知神经网络（MLP）在医疗厚度预测中的超越，MLP可以在全像分辨率上进行长距离依赖。为了验证我们的假设，我们开发了一种全像层次结构的MLP框架，该框架使用MLP从全像分辨率开始。我们在多种医疗厚度预测任务上进行了广泛的实验，包括修复、注册和分割等，并证明了我们的框架在不同的任务上都可以超越CNN和转换器，并达到医疗厚度预测中的状态艺术性表现。

CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs

paper_url: http://arxiv.org/abs/2311.16703
repo_url: None
paper_authors: Haocheng Yuan, Jing Xu, Hao Pan, Adrien Bousseau, Niloy Mitra, Changjian Li
for: 这个论文的目标是如何对CAD程序进行Semantic Commenting，即将CAD程序分解成 semantically meaningful shape parts，并为每个部分分配Semantic Label。
methods: 该论文使用了程序分析和视觉语义分析，利用最新的基础语言和视觉模型，将输入程序执行生成的形状用于生成可预测的图像，并使用这些图像进行semantic annotator。然后，将信息从图像中提取并返回到原始程序中进行Semantic Commenting。
results: 该论文在新的CADTalk数据集上进行了广泛的评估，与GPT基eline和开放集shape segmentation baseline进行比较，并Reported an 83.24% accuracy。

Abstract
CAD programs are a popular way to compactly encode shapes as a sequence of operations that are easy to parametrically modify. However, without sufficient semantic comments and structure, such programs can be challenging to understand, let alone modify. We introduce the problem of semantic commenting CAD programs, wherein the goal is to segment the input program into code blocks corresponding to semantically meaningful shape parts and assign a semantic label to each block. We solve the problem by combining program parsing with visual-semantic analysis afforded by recent advances in foundational language and vision models. Specifically, by executing the input programs, we create shapes, which we use to generate conditional photorealistic images to make use of semantic annotators for such images. We then distill the information across the images and link back to the original programs to semantically comment on them. Additionally, we collected and annotated a benchmark dataset, CADTalk, consisting of 5,280 machine-made programs and 45 human-made programs with ground truth semantic comments to foster future research. We extensively evaluated our approach, compared to a GPT-based baseline approach, and an open-set shape segmentation baseline, i.e., PartSLIP, and reported an 83.24% accuracy on the new CADTalk dataset. Project page: https://enigma-li.github.io/CADTalk/.

摘要

Parameter Efficient Fine-tuning via Cross Block Orchestration for Segment Anything Model

paper_url: http://arxiv.org/abs/2311.17112
repo_url: None
paper_authors: Zelin Peng, Zhengqin Xu, Zhilin Zeng, Lingxi Xie, Qi Tian, Wei Shen
for: 这个研究旨在提高 Parameter-efficient fine-tuning (PEFT) 方法在新的enario中的表现，并且实现对 Segment Anything Model (SAM) 的适应。
methods: 本研究使用 PEFT 方法，并将其扩展为具有交叉对应关系矩阵的整体适应机制，以便在不同的下游scenario中进行适应。此外，本研究还引入了一个 intra-block 增强模组，以提高对整个参数空间的适应。
results: 实验结果显示，我们的提案方法可以在仅有约1K额外参数的情况下，实现显著的类别分类表现提升，并且在多种 benchmark 上实现了类比性的表现。

Abstract
Parameter-efficient fine-tuning (PEFT) is an effective methodology to unleash the potential of large foundation models in novel scenarios with limited training data. In the computer vision community, PEFT has shown effectiveness in image classification, but little research has studied its ability for image segmentation. Fine-tuning segmentation models usually require a heavier adjustment of parameters to align the proper projection directions in the parameter space for new scenarios. This raises a challenge to existing PEFT algorithms, as they often inject a limited number of individual parameters into each block, which prevents substantial adjustment of the projection direction of the parameter space due to the limitation of Hidden Markov Chain along blocks. In this paper, we equip PEFT with a cross-block orchestration mechanism to enable the adaptation of the Segment Anything Model (SAM) to various downstream scenarios. We introduce a novel inter-block communication module, which integrates a learnable relation matrix to facilitate communication among different coefficient sets of each PEFT block's parameter space. Moreover, we propose an intra-block enhancement module, which introduces a linear projection head whose weights are generated from a hyper-complex layer, further enhancing the impact of the adjustment of projection directions on the entire parameter space. Extensive experiments on diverse benchmarks demonstrate that our proposed approach consistently improves the segmentation performance significantly on novel scenarios with only around 1K additional parameters.

摘要
在这篇论文中，我们将 PEFT equip WITH 一种 cross-block orchestration 机制，以便在不同的下游 scenery 中适应 Segment Anything Model (SAM)。我们引入了一种新的 inter-block communication module，该模块通过学习关系矩阵来促进不同 coefficient set 的参数空间之间的交流。此外，我们还提出了一种 intra-block enhancement module，该模块引入了一个线性投影头，其权重由一个 hyper-complex layer 生成，从而进一步提高了参数空间中投影方向的影响。我们在多个 benchmark 上进行了广泛的实验，结果表明，我们的提议方法可以在 novel scenery 中提高 segmentation 性能，只需要约 1K 的额外参数。

ContextSeg: Sketch Semantic Segmentation by Querying the Context with Attention

paper_url: http://arxiv.org/abs/2311.16682
repo_url: None
paper_authors: Jiawei Wang, Changjian Li
for: 本研究旨在提出一种简单 yet highly effective的方法，用于快速和准确地进行手写文本 segmentation。
methods: 该方法包括两个阶段：第一阶段，使用 autoencoder 网络预测Extra dense distance field，以增强结构信息学习; 第二阶段，使用自动回归Transformer来标注整个roke为同一个semantic part，并在同一个part中标注剩下的stroke。
results: 该方法在两个表示性数据集上达到了最佳 segmentation 精度，并在许多实验中证明了其超越性能。此外，本研究还提供了解决part imbalance在训练数据中的方法和初步交叉类训练实验，这些研究可能会推动未来的研究在这个领域。

Abstract
Sketch semantic segmentation is a well-explored and pivotal problem in computer vision involving the assignment of pre-defined part labels to individual strokes. This paper presents ContextSeg - a simple yet highly effective approach to tackling this problem with two stages. In the first stage, to better encode the shape and positional information of strokes, we propose to predict an extra dense distance field in an autoencoder network to reinforce structural information learning. In the second stage, we treat an entire stroke as a single entity and label a group of strokes within the same semantic part using an auto-regressive Transformer with the default attention mechanism. By group-based labeling, our method can fully leverage the context information when making decisions for the remaining groups of strokes. Our method achieves the best segmentation accuracy compared with state-of-the-art approaches on two representative datasets and has been extensively evaluated demonstrating its superior performance. Additionally, we offer insights into solving part imbalance in training data and the preliminary experiment on cross-category training, which can inspire future research in this field.

摘要
笔划 semantic segmentation 是计算机视觉中已经广泛研究的一个重要问题，即将预定义的部分标签分配给个体笔划。这篇论文介绍了 ContextSeg，一种简单 yet 高效的方法，用于解决这个问题。我们在首先stage中预测了一个额外的稠密距离场，以增强笔划的结构信息学习。在第二个stage中，我们对整个笔划视为一个单一的实体，并使用自动回归的Transformer网络（default attention mechanism）将一组笔划分配到同一个semantic part中。通过群集标注，我们的方法可以完全利用Context信息进行决策。我们的方法在两个代表性数据集上实现了最佳的 segmentation 精度，并且在广泛的评估中表现出色。此外，我们还提供了解决部分不平衡问题的论述和初步的交叉类训练实验，这可能会激发未来的研究。

Neural Texture Puppeteer: A Framework for Neural Geometry and Texture Rendering of Articulated Shapes, Enabling Re-Identification at Interactive Speed

paper_url: http://arxiv.org/abs/2311.17109
repo_url: None
paper_authors: Urs Waldmann, Ole Johannsen, Bastian Goldluecke
for: 这 paper 的目的是提出一种基于神经网络的纹理渲染管道，用于识别材质化的人体或物体。
methods: 该方法分离了地形和纹理编码，使用地形数据来学习表面上的空间关系，并使用纹理自动编码器将纹理图像编码为全局纹理代码。
results: 该方法可以实现实时的纹理渲染和人体识别，并且可以应用于限制数据的实际场景。furthermore, the method can be applied to real-world data with a synthetic-to-real texture domain shift. The novel synthetic texture dataset NePuMoo is publicly available for further development.

Abstract
In this paper, we present a neural rendering pipeline for textured articulated shapes that we call Neural Texture Puppeteer. Our method separates geometry and texture encoding. The geometry pipeline learns to capture spatial relationships on the surface of the articulated shape from ground truth data that provides this geometric information. A texture auto-encoder makes use of this information to encode textured images into a global latent code. This global texture embedding can be efficiently trained separately from the geometry, and used in a downstream task to identify individuals. The neural texture rendering and the identification of individuals run at interactive speeds. To the best of our knowledge, we are the first to offer a promising alternative to CNN- or transformer-based approaches for re-identification of articulated individuals based on neural rendering. Realistic looking novel view and pose synthesis for different synthetic cow textures further demonstrate the quality of our method. Restricted by the availability of ground truth data for the articulated shape's geometry, the quality for real-world data synthesis is reduced. We further demonstrate the flexibility of our model for real-world data by applying a synthetic to real-world texture domain shift where we reconstruct the texture from a real-world 2D RGB image. Thus, our method can be applied to endangered species where data is limited. Our novel synthetic texture dataset NePuMoo is publicly available to inspire further development in the field of neural rendering-based re-identification.

摘要
在这篇论文中，我们介绍了一个基于神经网络的纹理渲染管线，称之为神经纹理控制器。我们的方法将几何和纹理编码分离开来。几何管线从真实数据中学习到表面附件形状的空间关系信息。纹理自动编码器利用这些信息对纹理图像编码成全局各向征码。这个全局纹理嵌入可以独立地受到训练，并在下游任务中用于标识个体。神经纹理渲染和标识个体都可以在交互速度下进行。根据我们所知，我们的方法是首先提供基于神经渲染的人体重要特征标识的可能的代替方案。我们的方法可以实现真实的新视角和姿态synthesis，并且可以应用于实际世界数据。由于数据的有限性，我们的方法在实际世界数据中的质量有所降低。我们还展示了我们的模型在实际世界数据上的灵活性，通过对真实2D RGB图像中的纹理进行synthetic to real-world texture domain shift来重construct纹理。因此，我们的方法可以应用于紧急状况下的物种。我们的新的 sintetic texture dataset NePuMoo 公开可用，以便激励更多关于神经渲染基于标识的研究。

LiveNVS: Neural View Synthesis on Live RGB-D Streams

paper_url: http://arxiv.org/abs/2311.16668
repo_url: None
paper_authors: Laura Fink, Darius Rückert, Linus Franke, Joachim Keinert, Marc Stamminger
for: 实时RGB-D重建方法，如Kinect Fusion，缺乏实时高品质图像化。这是因为depth map和摄像头位置的不准确会导致geometry和texture受到干扰， resulting in noisy, oversmoothed or incomplete geometry and blurry textures.
methods: LiveNVS使用了 neural novel view synthesis 技术，使得在实时RGB-D输入流中，可以实现very low latency和实时渲染。基于RGB-D输入流，LiveNVS会使用 densely fused depth map project neural features into target view，并将features在图像空间相加到target feature map中。然后，一个通用的神经网络将target feature map转换成高质量RGB图像。
results: LiveNVS可以在捕捉过程中实现状态机器人 Rendering 质量，让用户在实时中查看场景，并在实时中评估重建质量。

Abstract
Existing real-time RGB-D reconstruction approaches, like Kinect Fusion, lack real-time photo-realistic visualization. This is due to noisy, oversmoothed or incomplete geometry and blurry textures which are fused from imperfect depth maps and camera poses. Recent neural rendering methods can overcome many of such artifacts but are mostly optimized for offline usage, hindering the integration into a live reconstruction pipeline. In this paper, we present LiveNVS, a system that allows for neural novel view synthesis on a live RGB-D input stream with very low latency and real-time rendering. Based on the RGB-D input stream, novel views are rendered by projecting neural features into the target view via a densely fused depth map and aggregating the features in image-space to a target feature map. A generalizable neural network then translates the target feature map into a high-quality RGB image. LiveNVS achieves state-of-the-art neural rendering quality of unknown scenes during capturing, allowing users to virtually explore the scene and assess reconstruction quality in real-time.

摘要
现有的实时RGB-D重建方法，如Kinect Fusion，缺乏实时Photo-realistic视觉化。这是因为深度地图和摄像头位置的不准确性导致的噪声、过度平滑或部分地图的缺失，以及摄像头的模糊性。 latest neural rendering methods can overcome many of these artifacts, but are mostly optimized for offline usage, hindering their integration into a live reconstruction pipeline.在这篇论文中，我们介绍了LiveNVS系统，它可以在实时RGB-D输入流上进行神经新视 synthesis，并且具有非常低的延迟和实时渲染。通过将神经特征项投影到目标视图中via dense fusion depth map，并将特征项在图像空间聚合到目标特征图中，LiveNVS可以在捕捉过程中实时生成高质量RGB图像。LiveNVS实现了在捕捉过程中未知场景的神经渲染质量， allowing users to virtually explore the scene and assess reconstruction quality in real-time.

DGNR: Density-Guided Neural Point Rendering of Large Driving Scenes

paper_url: http://arxiv.org/abs/2311.16664
repo_url: None
paper_authors: Zhuopeng Li, Chenming Wu, Liangjun Zhang, Jianke Zhu
for: 这篇论文主要针对大规模驾驶场景的渲染问题，尤其是长轨迹场景，并提出了一种基于密度空间的渲染框架（DGNR）来解决这些问题。
methods: 这篇论文使用了一种基于神经网络的渲染框架，通过学习场景的密度空间来指导渲染。具体来说，这种框架使用了一种可导的渲染器来从神经密度特征中synthesize图像。此外，文章还提出了一种密度基于的融合模块和几何正则化来优化密度空间。
results: 根据在一个广泛使用的自动驾驶数据集上进行的实验，这种框架可以Synthesize出高品质的驾驶场景图像，并且可以实现实时可能的渲染。

Abstract
Despite the recent success of Neural Radiance Field (NeRF), it is still challenging to render large-scale driving scenes with long trajectories, particularly when the rendering quality and efficiency are in high demand. Existing methods for such scenes usually involve with spatial warping, geometric supervision from zero-shot normal or depth estimation, or scene division strategies, where the synthesized views are often blurry or fail to meet the requirement of efficient rendering. To address the above challenges, this paper presents a novel framework that learns a density space from the scenes to guide the construction of a point-based renderer, dubbed as DGNR (Density-Guided Neural Rendering). In DGNR, geometric priors are no longer needed, which can be intrinsically learned from the density space through volumetric rendering. Specifically, we make use of a differentiable renderer to synthesize images from the neural density features obtained from the learned density space. A density-based fusion module and geometric regularization are proposed to optimize the density space. By conducting experiments on a widely used autonomous driving dataset, we have validated the effectiveness of DGNR in synthesizing photorealistic driving scenes and achieving real-time capable rendering.

摘要
In DGNR, geometric priors are no longer needed, as the density space can be learned intrinsically through volumetric rendering. Specifically, we use a differentiable renderer to synthesize images from the neural density features obtained from the learned density space. A density-based fusion module and geometric regularization are proposed to optimize the density space.We validated the effectiveness of DGNR by conducting experiments on a widely used autonomous driving dataset. The results show that DGNR can synthesize photorealistic driving scenes and achieve real-time capable rendering.

SCALAR-NeRF: SCAlable LARge-scale Neural Radiance Fields for Scene Reconstruction

paper_url: http://arxiv.org/abs/2311.16657
repo_url: None
paper_authors: Yu Chen, Gim Hee Lee
for: 这个研究旨在提出一种可扩展的大规模神经场景重建方法。
methods: 该方法采用了编码器-解码器架构，其中编码器处理3D点坐标，生成编码特征，而解码器生成含有积度距离和颜色的几何值。这个方法首先训练一个粗略全局模型，然后将图像分割成小块，并使用KMeans将每个块分配给专门的本地模型。通过扩大每个本地模型的盒子大小，提高不同块之间的重叠区域。全局解码器被共享到不同块中，从而促进了特征空间的对齐。这种粗略到细化策略使得我们的方法超越了现状最佳的NeRF方法，并且可扩展到大规模场景重建。
results: 该方法在大规模场景重建方面实现了优秀的性能，超越了现状最佳的NeRF方法。

Abstract
In this work, we introduce SCALAR-NeRF, a novel framework tailored for scalable large-scale neural scene reconstruction. We structure the neural representation as an encoder-decoder architecture, where the encoder processes 3D point coordinates to produce encoded features, and the decoder generates geometric values that include volume densities of signed distances and colors. Our approach first trains a coarse global model on the entire image dataset. Subsequently, we partition the images into smaller blocks using KMeans with each block being modeled by a dedicated local model. We enhance the overlapping regions across different blocks by scaling up the bounding boxes of each local block. Notably, the decoder from the global model is shared across distinct blocks and therefore promoting alignment in the feature space of local encoders. We propose an effective and efficient methodology to fuse the outputs from these local models to attain the final reconstruction. Employing this refined coarse-to-fine strategy, our method outperforms state-of-the-art NeRF methods and demonstrates scalability for large-scale scene reconstruction. The code will be available on our project page at https://aibluefisher.github.io/SCALAR-NeRF/

摘要
在这个工作中，我们介绍了一种新的框架，即SCALAR-NeRF，这是一种适用于大规模神经场景重建的新方法。我们将神经表示结构设计为编码器-解码器架构，其中编码器处理3D点坐标，生成编码特征，而解码器生成包括volume密度和颜色的 геометрические值。我们的方法首先在整个图像集合上训练一个粗略的全球模型。然后，我们使用KMeans将图像分割成较小的块，每个块使用专门的本地模型进行模型化。我们在不同块之间的重叠区域进行增强，通过缩大每个本地块的 bounding box。值得注意的是，解码器从全球模型中被共享，因此在本地编码器空间中促进了对齐。我们提出了一种有效和高效的方法来融合这些本地模型的输出，以获得最终重建。采用这种粗略到细节的策略，我们的方法在NeRF方法中超越了状态的杰出表现，并证明了对大规模场景重建的可扩展性。代码将在我们项目页面上提供，请参考。

Augmenting x-ray single particle imaging reconstruction with self-supervised machine learning

paper_url: http://arxiv.org/abs/2311.16652
repo_url: None
paper_authors: Zhantao Chen, Cong Wang, Mingye Gao, Chun Hong Yoon, Jana B. Thayer, Joshua J. Turner
for: 这种研究旨在开拓XFELs的应用场景，具体来说是通过SPI技术 investigate 生物体物理状态下的分子结构和动力学性质，无需涉及普通晶体或低温Conditions。
methods: 该研究使用了自动化学习方法，具体来说是一种端到端、自我超visisted的机器学习方法，用于从 diffraction 图像中恢复分子orientation和reciprocal space 强度。
results: 该研究显示了该方法在实验条件下具有很强的Robustness和可靠性，并且在重构分子结构和动力学性质方面具有显著的提高。这种方法可能会在当前的XFELs中实施SPI中引入一种新的 Paradigma shift。

Abstract
The development of X-ray Free Electron Lasers (XFELs) has opened numerous opportunities to probe atomic structure and ultrafast dynamics of various materials. Single Particle Imaging (SPI) with XFELs enables the investigation of biological particles in their natural physiological states with unparalleled temporal resolution, while circumventing the need for cryogenic conditions or crystallization. However, reconstructing real-space structures from reciprocal-space x-ray diffraction data is highly challenging due to the absence of phase and orientation information, which is further complicated by weak scattering signals and considerable fluctuations in the number of photons per pulse. In this work, we present an end-to-end, self-supervised machine learning approach to recover particle orientations and estimate reciprocal space intensities from diffraction images only. Our method demonstrates great robustness under demanding experimental conditions with significantly enhanced reconstruction capabilities compared with conventional algorithms, and signifies a paradigm shift in SPI as currently practiced at XFELs.

摘要
开发X射光自由电子激光（XFEL）已经提供了许多探索原子结构和超速动态的机会，single particle imaging（SPI）技术可以在XFEL下对生物体内的粒子进行探索，并且可以避免使用液氮或晶体化Conditions，但是从反射空间X射谱数据中获取真空空间结构的重建是非常困难，因为缺少相位信息和方向信息，这被复杂的辐射信号和每个激光脉冲中很弱的辐射信号所困难。在这项工作中，我们提出了一种综合、自主学习的机器学习方法，可以从diffraction图像中直接获取粒子方向和反射空间强度。我们的方法在具有严格的实验条件下表现出了很好的稳定性和明显提高的重建能力，这标志着SPI领域中的一个 парадигShift。

Parallax-Tolerant Image Stitching with Epipolar Displacement Field

paper_url: http://arxiv.org/abs/2311.16637
repo_url: None
paper_authors: Jian Yu, Yi Yu, Feipeng Da
for: 大幅投影图像合成是一项具有挑战性的任务，现有方法通常难以同时维护图像的地方和全局结构，以及减少对适应错误和扭曲的影响。
methods: 该文提出了一种新的方法，利用视观几何来设定一种基于视观异位投影的折叠技术，首先通过无穷同步投影来确定折叠规则的每个像素点，然后利用厚板拟合来表示每个像素点在折叠线上的滑动距离。
results: 该方法能够高质量地减少对适应错误和扭曲的影响，同时维护投影的项目性。经过比较性和量化性的实验，该方法在大幅投影图像合成中表现竞争力强。

Abstract
Large parallax image stitching is a challenging task. Existing methods often struggle to maintain both the local and global structures of the image while reducing alignment artifacts and warping distortions. In this paper, we propose a novel approach that utilizes epipolar geometry to establish a warping technique based on the epipolar displacement field. Initially, the warping rule for pixels in the epipolar geometry is established through the infinite homography. Subsequently, Subsequently, the epipolar displacement field, which represents the sliding distance of the warped pixel along the epipolar line, is formulated by thin plate splines based on the principle of local elastic deformation. The stitching result can be generated by inversely warping the pixels according to the epipolar displacement field. This method incorporates the epipolar constraints in the warping rule, which ensures high-quality alignment and maintains the projectivity of the panorama. Qualitative and quantitative comparative experiments demonstrate the competitiveness of the proposed method in stitching images large parallax.

摘要
大型投影图合成是一项复杂的任务。现有方法经常难以保持图像的地方和全局结构，同时减少对焊接残留和扭曲的影响。在这篇论文中，我们提出了一种新的方法，利用视角几何来建立基于视角偏移场的折叠技术。首先，通过无穷多重投影，在视角几何中定义了折叠规则的每个像素。然后，基于本地弹性扭曲的原理，使用薄板拟合来表示每个像素在视角线上的滑动距离。最后，通过倒折叠像素来生成拼接结果。这种方法具有视角约束，保证了高质量的对齐和维护了投影图的 projektivity。经过对比性和量化比较，我们的方法在大型投影图拼接中表现竞争力强。

MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video Generation

paper_url: http://arxiv.org/abs/2311.16635
repo_url: None
paper_authors: Sitong Su, Litao Guo, Lianli Gao, Hengtao Shen, Jingkuan Song
for: 文章旨在解决无法使用视频示例的文本到视频合成问题，通过提取提示中的动作偏好来控制不同 объек 的动作。
methods: 文章提出了一种基于大语言模型的动作控制策略，称为 MotionZero，它从提示中提取不同对象的动作偏好，并对不同对象的动作进行独立控制。此外，文章还提出了一种基于动作强度的注意机制，以适应视频中的动作强度变化。
results: 实验表明，MotionZero 可以正确地控制不同对象的动作，并且支持多种应用，如零例视频编辑。

Abstract
Zero-shot Text-to-Video synthesis generates videos based on prompts without any videos. Without motion information from videos, motion priors implied in prompts are vital guidance. For example, the prompt "airplane landing on the runway" indicates motion priors that the "airplane" moves downwards while the "runway" stays static. Whereas the motion priors are not fully exploited in previous approaches, thus leading to two nontrivial issues: 1) the motion variation pattern remains unaltered and prompt-agnostic for disregarding motion priors; 2) the motion control of different objects is inaccurate and entangled without considering the independent motion priors of different objects. To tackle the two issues, we propose a prompt-adaptive and disentangled motion control strategy coined as MotionZero, which derives motion priors from prompts of different objects by Large-Language-Models and accordingly applies motion control of different objects to corresponding regions in disentanglement. Furthermore, to facilitate videos with varying degrees of motion amplitude, we propose a Motion-Aware Attention scheme which adjusts attention among frames by motion amplitude. Extensive experiments demonstrate that our strategy could correctly control motion of different objects and support versatile applications including zero-shot video edit.

摘要
<>文本到视频合成中的零模型 Synthesis generates videos based on prompts without any videos. Without motion information from videos, motion priors implied in prompts are vital guidance. For example, the prompt "airplane landing on the runway" indicates motion priors that the "airplane" moves downwards while the "runway" stays static. However, previous approaches have not fully exploited these motion priors, leading to two nontrivial issues: 1) the motion variation pattern remains unaltered and prompt-agnostic; 2) the motion control of different objects is inaccurate and entangled without considering the independent motion priors of different objects. To address these two issues, we propose a prompt-adaptive and disentangled motion control strategy called MotionZero, which derives motion priors from prompts of different objects using Large-Language-Models and accordingly applies motion control of different objects to corresponding regions in disentanglement. Moreover, to accommodate videos with varying degrees of motion amplitude, we propose a Motion-Aware Attention scheme that adjusts attention among frames based on motion amplitude. Experimental results show that our strategy can accurately control the motion of different objects and support versatile applications, including zero-shot video editing.<>

On the Calibration of Human Pose Estimation

paper_url: http://arxiv.org/abs/2311.17105
repo_url: https://github.com/leob03/HRC_extrinsic_calib
paper_authors: Kerui Gu, Rongyu Chen, Angela Yao
for: 这篇论文主要针对2D人姿估计中的误差问题，具体来说是对 pose estimation 中的 keypoint confidence 进行调整和改进。
methods: 该论文使用了 teoretic analysis 和 practical experiments 来探讨 current pose estimation 方法中的误差问题，并提出了一种基于 confidence 和 pose accuracy 的 Calibrated ConfidenceNet (CCNet) 来解决这个问题。
results: 该论文的实验结果表明，CCNet 可以提高 pose estimation 的 accuracy，具体来说是可以提高 AP 的值by up to 1.4%，并且在 downstream task 中也可以降低 3D keypoint error by 1.0mm。

Abstract
Most 2D human pose estimation frameworks estimate keypoint confidence in an ad-hoc manner, using heuristics such as the maximum value of heatmaps. The confidence is part of the evaluation scheme, e.g., AP for the MSCOCO dataset, yet has been largely overlooked in the development of state-of-the-art methods. This paper takes the first steps in addressing miscalibration in pose estimation. From a calibration point of view, the confidence should be aligned with the pose accuracy. In practice, existing methods are poorly calibrated. We show, through theoretical analysis, why a miscalibration gap exists and how to narrow the gap. Simply predicting the instance size and adjusting the confidence function gives considerable AP improvements. Given the black-box nature of deep neural networks, however, it is not possible to fully close this gap with only closed-form adjustments. As such, we go one step further and learn network-specific adjustments by enforcing consistency between confidence and pose accuracy. Our proposed Calibrated ConfidenceNet (CCNet) is a light-weight post-hoc addition that improves AP by up to 1.4% on off-the-shelf pose estimation frameworks. Applied to the downstream task of mesh recovery, CCNet facilitates an additional 1.0mm decrease in 3D keypoint error.

摘要
Current 2D人体pose estimation frameworks usually estimate keypoint confidence in an ad-hoc way, using heuristics like the maximum value of heatmaps. Confidence is part of the evaluation scheme, such as AP for the MSCOCO dataset, but has been largely overlooked in the development of state-of-the-art methods. This paper takes the first steps in addressing miscalibration in pose estimation. From a calibration perspective, the confidence should be aligned with the pose accuracy. In practice, existing methods are poorly calibrated. We show, through theoretical analysis, why a miscalibration gap exists and how to narrow the gap. Simply predicting the instance size and adjusting the confidence function gives considerable AP improvements. However, due to the black-box nature of deep neural networks, it is not possible to fully close this gap with only closed-form adjustments. Therefore, we go one step further and learn network-specific adjustments by enforcing consistency between confidence and pose accuracy. Our proposed Calibrated ConfidenceNet (CCNet) is a light-weight post-hoc addition that improves AP by up to 1.4% on off-the-shelf pose estimation frameworks. Applied to the downstream task of mesh recovery, CCNet facilitates an additional 1.0mm decrease in 3D keypoint error.

paper_url: http://arxiv.org/abs/2311.16623
repo_url: https://github.com/gramuah/ros4vsn
paper_authors: Carlos Gutiérrez-Álvarez, Pablo Ríos-Navarro, Rafael Flor-Rodríguez, Francisco Javier Acevedo-Rodríguez, Roberto J. López-Sastre
for: 这个研究旨在将视觉Semantic Navigation（VSN）模型融入真实世界中的机器人中，以建立真实身体代理人。
methods: 我们提出了一个新的解决方案，通过将VSNAgent integrate到ROS-相容的机器人中，并发布了一个基于ROS的新框架——ROS4VSN，以便任何VSNAgent可以轻松地在任何ROS-相容的机器人上运行和测试。
results: 我们的实验显示，在真实世界和模拟环境中，两个不同的机器人上 embed two state-of-the-art VSNAgent，它们在真实世界中表现出明显的性能差异。

Abstract
Visual Semantic Navigation (VSN) is the ability of a robot to learn visual semantic information for navigating in unseen environments. These VSN models are typically tested in those virtual environments where they are trained, mainly using reinforcement learning based approaches. Therefore, we do not yet have an in-depth analysis of how these models would behave in the real world. In this work, we propose a new solution to integrate VSN models into real robots, so that we have true embodied agents. We also release a novel ROS-based framework for VSN, ROS4VSN, so that any VSN-model can be easily deployed in any ROS-compatible robot and tested in a real setting. Our experiments with two different robots, where we have embedded two state-of-the-art VSN agents, confirm that there is a noticeable performance difference of these VSN solutions when tested in real-world and simulation environments. We hope that this research will endeavor to provide a foundation for addressing this consequential issue, with the ultimate aim of advancing the performance and efficiency of embodied agents within authentic real-world scenarios. Code to reproduce all our experiments can be found at https://github.com/gramuah/ros4vsn.

摘要
Visual Semantic Navigation (VSN) 是一种机器人学习视觉Semantic信息以在未经看过的环境中导航。这些 VSN 模型通常在训练环境中进行测试，主要采用强化学习基本方法。因此，我们还没有深入分析这些模型在实际世界中的行为。在这项工作中，我们提出了一种新的解决方案，将 VSN 模型集成到真实的机器人中，以创建真正的肉体代理人。我们还开发了一个基于 ROS 的 VSN 框架，即 ROS4VSN，使得任何 VSN 模型都可以轻松地在任何 ROS 兼容的机器人上部署和测试。我们对两种不同的机器人进行了实验，并将两个当前顶尖 VSN 解决方案集成到了这两种机器人中。我们发现，在真实世界和模拟环境中测试 VSN 解决方案时，有显著的性能差异。我们希望通过这项研究，为实际世界中肉体代理人的性能和效率提供基础，以便在真实世界中进一步提高肉体代理人的表现。所有我们实验的代码可以在 GitHub 上找到，请参考 https://github.com/gramuah/ros4vsn。

Cross-level Attention with Overlapped Windows for Camouflaged Object Detection

paper_url: http://arxiv.org/abs/2311.16618
repo_url: None
paper_authors: Jiepan Li, Fangxiao Lu, Nan Xue, Zhuohong Li, Hongyan Zhang, Wei He
for: 本研究旨在提高掩饰物体检测精度。
methods: 该方法使用高级语义特征和细节特征的融合，并提出了一种叫做覆盖窗口十字关注（OWinCA）来强化低级特征。
results: 实验结果表明，提出的OWinCANet方法在三个大规模掩饰物体检测数据集上显著超越了当前状态的掩饰物体检测方法。

Abstract
Camouflaged objects adaptively fit their color and texture with the environment, which makes them indistinguishable from the surroundings. Current methods revealed that high-level semantic features can highlight the differences between camouflaged objects and the backgrounds. Consequently, they integrate high-level semantic features with low-level detailed features for accurate camouflaged object detection (COD). Unlike previous designs for multi-level feature fusion, we state that enhancing low-level features is more impending for COD. In this paper, we propose an overlapped window cross-level attention (OWinCA) to achieve the low-level feature enhancement guided by the highest-level features. By sliding an aligned window pair on both the highest- and low-level feature maps, the high-level semantics are explicitly integrated into the low-level details via cross-level attention. Additionally, it employs an overlapped window partition strategy to alleviate the incoherence among windows, which prevents the loss of global information. These adoptions enable the proposed OWinCA to enhance low-level features by promoting the separability of camouflaged objects. The associated proposed OWinCANet fuses these enhanced multi-level features by simple convolution operation to achieve the final COD. Experiments conducted on three large-scale COD datasets demonstrate that our OWinCANet significantly surpasses the current state-of-the-art COD methods.

摘要
伪装物体可以适应环境的颜色和文化，使其与背景完全一致。现有方法表明，高水平semantic特征可以强调掩饰物体和背景之间的差异。因此，它们将高水平semantic特征与低水平细节特征相结合以实现准确的掩饰物体检测（COD）。不同于之前的多级特征融合设计，我们认为提高低水平特征更加重要 для COD。在这篇论文中，我们提出了覆盖窗口交叉水平注意力（OWinCA）来实现低水平特征的提升，这些特征被最高水平semantic特征引导。通过将均匀窗口对在最高水平和低水平特征图中进行对齐，高水平semantic特征与低水平细节特征进行交叉注意力的同时，进一步提高了低水平特征的分割性。此外，我们采用了覆盖窗口分区策略，以避免窗口之间的不一致，从而保持全局信息的完整性。这些采用使得我们提出的 OWinCA 能够提高低水平特征，从而提高掩饰物体的分割性。与此同时，我们还提出了 OWinCANet，它将这些提高后的多级特征进行简单的卷积操作，以实现最终的 COD。实验表明，我们的 OWinCANet 在三个大规模 COD 数据集上显著超越当前状态的最佳 COD 方法。

Filter-Pruning of Lightweight Face Detectors Using a Geometric Median Criterion

paper_url: http://arxiv.org/abs/2311.16613
repo_url: https://github.com/idt-iti/lightweight-face-detector-pruning
paper_authors: Konstantinos Gkrispanis, Nikolaos Gkalelis, Vasileios Mezaris
for: 这个论文旨在提出一种基于范点损失推敲的简洁脸部检测模型，以适应具有限制的处理能力和内存的边缘设备。
methods: 本论文使用了范点损失推敲法（Filter Pruning via Geometric Median，FPGM）和软范点推敲法（Soft Filter Pruning，SFP）来实现范点损失推敲，并与L1 Norm排序比较。
results: 实验结果显示，提案的方法可以降低已经小巧的脸部检测模型的模型大小，仅带来轻微的精度损失或者甚至带来小幅的精度增加，尤其是在低排除率下。

Abstract
Face detectors are becoming a crucial component of many applications, including surveillance, that often have to run on edge devices with limited processing power and memory. Therefore, there's a pressing demand for compact face detection models that can function efficiently across resource-constrained devices. Over recent years, network pruning techniques have attracted a lot of attention from researchers. These methods haven't been well examined in the context of face detectors, despite their expanding popularity. In this paper, we implement filter pruning on two already small and compact face detectors, named EXTD (Extremely Tiny Face Detector) and EResFD (Efficient ResNet Face Detector). The main pruning algorithm that we utilize is Filter Pruning via Geometric Median (FPGM), combined with the Soft Filter Pruning (SFP) iterative procedure. We also apply L1 Norm pruning, as a baseline to compare with the proposed approach. The experimental evaluation on the WIDER FACE dataset indicates that the proposed approach has the potential to further reduce the model size of already lightweight face detectors, with limited accuracy loss, or even with small accuracy gain for low pruning rates.

摘要
face 检测器在许多应用程序中变得越来越重要，包括监控，这些应用程序经常需要在边缘设备上运行，这些设备通常具有有限的处理能力和内存。因此，有一个急需更加压缩的面部检测模型，以便在有限的资源下运行。在过去几年中，网络剪辑技术引起了研究人员的广泛关注。尽管这些技术在面部检测器中没有得到广泛的检查，但它们在扩展的应用场景中表现出了潜在的优势。在这篇论文中，我们对两个已经非常小型和紧凑的面部检测器，即EXTD（极其简单的面部检测器）和EResFD（高效的ResNet面部检测器）进行了筛选器剪辑。我们使用的主要筛选器剪辑算法是 Filter Pruning via Geometric Median（FPGM），并与Soft Filter Pruning（SFP）迭代过程相结合。此外，我们还应用了L1 Norm剪辑，以作为基准对比的方法。实验评估于WIDER FACE数据集表明，我们的方法有可能进一步减少已经轻量级的面部检测器的模型大小，即使剪辑率较高，也只带来有限的减少精度损失，或者甚至带来小范围内的精度提升。

Empowering COVID-19 Detection: Optimizing Performance Through Fine-Tuned EfficientNet Deep Learning Architecture

paper_url: http://arxiv.org/abs/2311.16593
repo_url: None
paper_authors: Md. Alamin Talukder, Md. Abu Layek, Mohsin Kazi, Md Ashraf Uddin, Sunil Aryal
for: 这个研究的目的是发展一个基于骨质图像处理的COVID-19患者检测方法，以帮助医生快速和精度地诊断COVID-19。
methods: 这个研究使用了深度学习算法来处理骨质图像，并将这些算法与对应的转移学习模型进行了精确地 fine-tuning。
results: 实验结果显示，使用这种方法可以实现100%的准确率，并且在肺病图像集合中取得了99.17%的准确率、99.13%的精度和99.16%的回传率，这表明这种方法具有较高的准确性和效率。

Abstract
The worldwide COVID-19 pandemic has profoundly influenced the health and everyday experiences of individuals across the planet. It is a highly contagious respiratory disease requiring early and accurate detection to curb its rapid transmission. Initial testing methods primarily revolved around identifying the genetic composition of the coronavirus, exhibiting a relatively low detection rate and requiring a time-intensive procedure. To address this challenge, experts have suggested using radiological imagery, particularly chest X-rays, as a valuable approach within the diagnostic protocol. This study investigates the potential of leveraging radiographic imaging (X-rays) with deep learning algorithms to swiftly and precisely identify COVID-19 patients. The proposed approach elevates the detection accuracy by fine-tuning with appropriate layers on various established transfer learning models. The experimentation was conducted on a COVID-19 X-ray dataset containing 2000 images. The accuracy rates achieved were impressive of 100% for EfficientNetB4 model. The fine-tuned EfficientNetB4 achieved an excellent accuracy score, showcasing its potential as a robust COVID-19 detection model. Furthermore, EfficientNetB4 excelled in identifying Lung disease using Chest X-ray dataset containing 4,350 Images, achieving remarkable performance with an accuracy of 99.17%, precision of 99.13%, recall of 99.16%, and f1-score of 99.14%. These results highlight the promise of fine-tuned transfer learning for efficient lung detection through medical imaging, especially with X-ray images. This research offers radiologists an effective means of aiding rapid and precise COVID-19 diagnosis and contributes valuable assistance for healthcare professionals in accurately identifying affected patients.

摘要
全球COVID-19大流行对人类健康和日常生活产生了深远的影响。这是一种高度传染性的呼吸道疾病，早期检测是阻断其迅速传播的关键。初期检测方法主要是通过识别 koronavirus 的遗传组成来进行，但这种方法的检测率较低，需要时间consuming 的过程。为了解决这个挑战，专家建议使用 radiological imaging（X射线图像）作为诊断协议的一部分。本研究探讨了利用 radiographic imaging（X射线图像）和深度学习算法来快速和准确地诊断COVID-19患者。我们在 COVID-19 X射线图像集中进行了2000张图像的实验，实现了100%的准确率。我们使用 EfficientNetB4 模型进行了精细调整，并 achieved 惊人的准确率（100%）。此外，我们发现 EfficientNetB4 模型在4350张 X射线图像中的肺病检测中表现出色，具有99.17%的准确率、99.13%的精度、99.16%的回归率和99.14%的 F1 分数。这些结果表明 fine-tuned transfer learning 在医疗影像检测中具有优势，特别是在 X射线图像上。这项研究为医生提供了一种有效的帮助方式，以便快速和准确地诊断 COVID-19 病例，并为医疗专业人员提供了准确地诊断患者的有用工具。

Improving Lane Detection Generalization: A Novel Framework using HD Maps for Boosting Diversity

paper_url: http://arxiv.org/abs/2311.16589
repo_url: None
paper_authors: Daeun Lee, Minhyeok Heo, Jiwon Kim
for: 提高自适应车道检测算法的可靠性和灵活性。
methods: 使用高清地图和生成模型增强数据多样性，从核心数据选择最佳多样性和性能优化。
results: 实验表明，我们的框架可以提高车道检测算法的泛化性能，与领域适应化方法相当。

Abstract
Lane detection is a vital task for vehicles to navigate and localize their position on the road. To ensure reliable results, lane detection algorithms must have robust generalization performance in various road environments. However, despite the significant performance improvement of deep learning-based lane detection algorithms, their generalization performance in response to changes in road environments still falls short of expectations. In this paper, we present a novel framework for single-source domain generalization (SSDG) in lane detection. By decomposing data into lane structures and surroundings, we enhance diversity using High-Definition (HD) maps and generative models. Rather than expanding data volume, we strategically select a core subset of data, maximizing diversity and optimizing performance. Our extensive experiments demonstrate that our framework enhances the generalization performance of lane detection, comparable to the domain adaptation-based method.

摘要
Lane detection 是车辆导航和确定位置的关键任务。为确保可靠性，车道检测算法必须具有多样化环境的可靠性。虽然深度学习基于的车道检测算法表现出色，但它们在环境变化后的泛化性仍然不够。在这篇论文中，我们提出了一种单源领域泛化（SSDG）框架。我们将数据分解为车道结构和周围环境，并使用高清地图和生成模型增强多样性。而不是扩大数据量，我们策略性选择核心数据集，最大化多样性并优化性能。我们的广泛实验表明，我们的框架可以提高车道检测的泛化性，与领域适应基于方法相当。

Robust Diffusion GAN using Semi-Unbalanced Optimal Transport

paper_url: http://arxiv.org/abs/2311.17101
repo_url: None
paper_authors: Quan Dao, Binh Ta, Tung Pham, Anh Tran
for: 提高Diffusion模型的可靠性和性能，使其能够更好地应对各种实际应用场景。
methods: 基于半不均衡优先transport的Robust Training技术，以mitigate outlier samples的影响。
results: RDGAN比vanilla DDGAN在图像质量、分布覆盖率和推理速度等方面表现更好，并且在含有异常样本的数据集上展现出更高的可靠性。

Abstract
Diffusion models, a type of generative model, have demonstrated great potential for synthesizing highly detailed images. By integrating with GAN, advanced diffusion models like DDGAN \citep{xiao2022DDGAN} could approach real-time performance for expansive practical applications. While DDGAN has effectively addressed the challenges of generative modeling, namely producing high-quality samples, covering different data modes, and achieving faster sampling, it remains susceptible to performance drops caused by datasets that are corrupted with outlier samples. This work introduces a robust training technique based on semi-unbalanced optimal transport to mitigate the impact of outliers effectively. Through comprehensive evaluations, we demonstrate that our robust diffusion GAN (RDGAN) outperforms vanilla DDGAN in terms of the aforementioned generative modeling criteria, i.e., image quality, mode coverage of distribution, and inference speed, and exhibits improved robustness when dealing with both clean and corrupted datasets.

摘要
传播模型，一种生成模型，在生成高级精照图像方面表现出色。通过与GAN结合，进阶传播模型如DDGAN（《Xiao et al。(2022)》）可以实现实时性，实现广泛的实用应用。 although DDGAN已经很好地解决生成模型的挑战，包括生成高质量样本、覆盖不同数据模式以及更快的样本生成，但是它仍然受到受扰应用数据中的噪音样本所影响。这个研究提出了一种基于半不对称优先运输的强健训练技术，以对噪音样本进行有效防护。通过全面评估，我们展示了我们的强健扩散GAN（RDGAN）在生成模型的评估标准，例如图像质量、分布覆盖率和推断速度等方面，与普通的DDGAN有所不同，并且在清洁和受扰数据集中具有更好的韧性。

GeoScaler: Geometry and Rendering-Aware Downsampling of 3D Mesh Textures

paper_url: http://arxiv.org/abs/2311.16581
repo_url: None
paper_authors: Sai Karthikey Pentapati, Anshul Rai, Arkady Ten, Chaitanya Atluru, Alan Bovik
for: 提高3D场景的图像质量和细节级别
methods: 利用GeoScaler方法，基于3D模型的几何特征和投影 Parametrization，进行纹理截断，以提高图像质量和细节级别
results: 对比传统下采样方法，GeoScaler方法可以生成高质量的渲染图像，并且能够保持3D模型的细节级别和图像质量

Abstract
High-resolution texture maps are necessary for representing real-world objects accurately with 3D meshes. The large sizes of textures can bottleneck the real-time rendering of high-quality virtual 3D scenes on devices having low computational budgets and limited memory. Downsampling the texture maps directly addresses the issue, albeit at the cost of visual fidelity. Traditionally, downsampling of texture maps is performed using methods like bicubic interpolation and the Lanczos algorithm. These methods ignore the geometric layout of the mesh and its UV parametrization and also do not account for the rendering process used to obtain the final visualization that the users will experience. Towards filling these gaps, we introduce GeoScaler, which is a method of downsampling texture maps of 3D meshes while incorporating geometric cues, and by maximizing the visual fidelity of the rendered views of the textured meshes. We show that the textures generated by GeoScaler deliver significantly better quality rendered images compared to those generated by traditional downsampling methods

摘要
高分辨率文字地图是必需的 для准确地表示真实世界对象使用3D网格。大文件大小的文字地图可能会卡顿实时渲染高质量虚拟3D场景，特别是设备有限的计算预算和内存。直接下采样文字地图可以解决这个问题，但是会导致视觉精度下降。传统下采样方法包括二次插值和兰佐斯算法，这些方法忽略网格的几何布局和UV参数化，也不考虑渲染过程来获得最终用户可见的视觉效果。为了填补这些空白，我们介绍了GeoScaler，它是一种基于网格几何布局和渲染过程的文字地图下采样方法，并最大化渲染视图中文字地图的视觉精度。我们表明，由GeoScaler生成的文字地图可以提供较高的视觉质量渲染图像，比传统下采样方法更好。

Clean Label Disentangling for Medical Image Segmentation with Noisy Labels

paper_url: http://arxiv.org/abs/2311.16580
repo_url: https://github.com/xiaoyao3302/2bdenoise
paper_authors: Zicheng Wang, Zhen Zhao, Erjian Guo, Luping Zhou
for: 解决医学图像分割中的噪音标注问题，提高医学图像分割的精度和可靠性。
methods: 提出了一种简单 yet efficient的分类偏好采样策略，并将其扩展到一种新的噪音特征帮助分离清洁标签框架。
results: 经验验证了方法的效iveness，其方法在医学图像分割中实现了新的州对比性表现。代码可以在https://github.com/xiaoyao3302/2BDenoise上获取。

Abstract
Current methods focusing on medical image segmentation suffer from incorrect annotations, which is known as the noisy label issue. Most medical image segmentation with noisy labels methods utilize either noise transition matrix, noise-robust loss functions or pseudo-labeling methods, while none of the current research focuses on clean label disentanglement. We argue that the main reason is that the severe class-imbalanced issue will lead to the inaccuracy of the selected ``clean'' labels, thus influencing the robustness of the model against the noises. In this work, we come up with a simple but efficient class-balanced sampling strategy to tackle the class-imbalanced problem, which enables our newly proposed clean label disentangling framework to successfully select clean labels from the given label sets and encourages the model to learn from the correct annotations. However, such a method will filter out too many annotations which may also contain useful information. Therefore, we further extend our clean label disentangling framework to a new noisy feature-aided clean label disentangling framework, which takes the full annotations into utilization to learn more semantics. Extensive experiments have validated the effectiveness of our methods, where our methods achieve new state-of-the-art performance. Our code is available at https://github.com/xiaoyao3302/2BDenoise.

摘要
Current methods for medical image segmentation are plagued by noisy labels, a problem known as the noisy label issue. Most existing methods use either noise transition matrices, noise-robust loss functions, or pseudo-labeling techniques, but none of them focus on clean label disentanglement. We believe the main reason is that the severe class-imbalance problem leads to inaccurate selection of "clean" labels, which affects the robustness of the model to noise.In this work, we propose a simple but effective class-balanced sampling strategy to address the class-imbalance problem. This enables our newly proposed clean label disentangling framework to select clean labels from the given label sets and encourages the model to learn from correct annotations. However, such a method may filter out too many annotations that contain useful information. Therefore, we further extend our clean label disentangling framework to a new noisy feature-aided clean label disentangling framework, which utilizes the full annotations to learn more semantics.Extensive experiments have validated the effectiveness of our methods, and our methods achieve new state-of-the-art performance. Our code is available at https://github.com/xiaoyao3302/2BDenoise.Here's the Simplified Chinese translation of the text:当前的医学影像分割方法受到噪声标注问题的影响，大多数方法使用噪声过渡矩阵、噪声抗损失函数或 Pseudo-标签技术，但没有任何研究专注于干净标签分离。我们认为主要的原因是类别不均衡问题，会导致选择“干净”标签的不准确性，从而影响模型对噪声的Robustness。在这种情况下，我们提出了一种简单 yet efficient的类别均衡采样策略，以解决类别不均衡问题。这使得我们的新提出的干净标签分离框架能够从给定的标签集中选择干净标签，并促使模型学习正确的标签。然而，这种方法可能会过滤出太多的标签，其中可能包含有用信息。因此，我们进一步扩展了我们的干净标签分离框架为一个新的噪声特征帮助的干净标签分离框架，该框架利用全部标签来学习更多的 semantics。我们的实验证明了我们的方法的效果，我们的方法实现了新的州OF-the-art性能。我们的代码可以在https://github.com/xiaoyao3302/2BDenoise中获取。

Efficient Key-Based Adversarial Defense for ImageNet by Using Pre-trained Model

paper_url: http://arxiv.org/abs/2311.16577
repo_url: None
paper_authors: AprilPyone MaungMaung, Isao Echizen, Hitoshi Kiya
for: 这篇论文的目的是提出一个基于键的防御模型增殖模型，利用预训练的模型和最新的高效练习技术在 ImageNet-1k 分类中。
methods: 这篇论文使用了 Apple CoreML 等最新的模型部署技术，并利用了高效的练习技术来增殖基于键的模型，以提高防御性能。
results: 实验结果显示，提出的精练基于键的模型在 ImageNet-1k dataset 上显示出较高的分类精度（超过 10% 增加），比前一代基于键的模型更好。

Abstract
In this paper, we propose key-based defense model proliferation by leveraging pre-trained models and utilizing recent efficient fine-tuning techniques on ImageNet-1k classification. First, we stress that deploying key-based models on edge devices is feasible with the latest model deployment advancements, such as Apple CoreML, although the mainstream enterprise edge artificial intelligence (Edge AI) has been focused on the Cloud. Then, we point out that the previous key-based defense on on-device image classification is impractical for two reasons: (1) training many classifiers from scratch is not feasible, and (2) key-based defenses still need to be thoroughly tested on large datasets like ImageNet. To this end, we propose to leverage pre-trained models and utilize efficient fine-tuning techniques to proliferate key-based models even on limited computing resources. Experiments were carried out on the ImageNet-1k dataset using adaptive and non-adaptive attacks. The results show that our proposed fine-tuned key-based models achieve a superior classification accuracy (more than 10% increase) compared to the previous key-based models on classifying clean and adversarial examples.

摘要
在这篇论文中，我们提出了基于键的防御模型扩散方法，利用预训练模型和最新的有效精细调整技术在ImageNet-1k分类 задании。首先，我们强调在Edge设备上部署基于键的模型是可能的，即使主流企业端人工智能（Edge AI）在云端集中化。然后，我们指出了以前的键基于防御在设备上的图像分类是不现实的，因为训练多个分类器从零是不可能，而且键基于防御仍需要在大量数据集如ImageNet进行详细测试。为此，我们提议利用预训练模型和有效精细调整技术来扩散基于键的模型，即使在有限的计算资源下。实验在ImageNet-1k数据集上使用适应和非适应攻击。结果显示，我们提出的精细调整后的基于键模型在分类清洁和攻击样本上达到了超过10%的提升。

MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices

paper_url: http://arxiv.org/abs/2311.16567
repo_url: None
paper_authors: Yang Zhao, Yanwu Xu, Zhisheng Xiao, Tingbo Hou
for: 提高大规模文本到图像扩散模型在移动设备上的部署效率，减少模型大小和计算时间。
methods: 通过 architecture 优化和 sampling 技术来提高计算效率，并采用精神投射和扩散-GAN 微调技术来提高图像生成质量。
results: 实验结果表明，MobileDiffusion 可以在移动设备上生成 $512\times512$ 像素图像，并且只需要 \textbf{sub-second} 的计算时间，创造了新的最佳纪录。

Abstract
The deployment of large-scale text-to-image diffusion models on mobile devices is impeded by their substantial model size and slow inference speed. In this paper, we propose \textbf{MobileDiffusion}, a highly efficient text-to-image diffusion model obtained through extensive optimizations in both architecture and sampling techniques. We conduct a comprehensive examination of model architecture design to reduce redundancy, enhance computational efficiency, and minimize model's parameter count, while preserving image generation quality. Additionally, we employ distillation and diffusion-GAN finetuning techniques on MobileDiffusion to achieve 8-step and 1-step inference respectively. Empirical studies, conducted both quantitatively and qualitatively, demonstrate the effectiveness of our proposed techniques. MobileDiffusion achieves a remarkable \textbf{sub-second} inference speed for generating a $512\times512$ image on mobile devices, establishing a new state of the art.

摘要
deployment of large-scale text-to-image diffusion models on mobile devices is impeded by their substantial model size and slow inference speed. In this paper, we propose \textbf{MobileDiffusion}, a highly efficient text-to-image diffusion model obtained through extensive optimizations in both architecture and sampling techniques. We conduct a comprehensive examination of model architecture design to reduce redundancy, enhance computational efficiency, and minimize model's parameter count, while preserving image generation quality. Additionally, we employ distillation and diffusion-GAN finetuning techniques on MobileDiffusion to achieve 8-step and 1-step inference respectively. Empirical studies, conducted both quantitatively and qualitatively, demonstrate the effectiveness of our proposed techniques. MobileDiffusion achieves a remarkable \textbf{sub-second} inference speed for generating a $512\times512$ image on mobile devices, establishing a new state of the art.Here's the translation in Traditional Chinese:deployment of large-scale text-to-image diffusion models on mobile devices is impeded by their substantial model size and slow inference speed. In this paper, we propose \textbf{MobileDiffusion}, a highly efficient text-to-image diffusion model obtained through extensive optimizations in both architecture and sampling techniques. We conduct a comprehensive examination of model architecture design to reduce redundancy, enhance computational efficiency, and minimize model's parameter count, while preserving image generation quality. Additionally, we employ distillation and diffusion-GAN finetuning techniques on MobileDiffusion to achieve 8-step and 1-step inference respectively. Empirical studies, conducted both quantitatively and qualitatively, demonstrate the effectiveness of our proposed techniques. MobileDiffusion achieves a remarkable \textbf{sub-second} inference speed for generating a $512\times512$ image on mobile devices, establishing a new state of the art.

Egocentric Whole-Body Motion Capture with FisheyeViT and Diffusion-Based Motion Refinement

paper_url: http://arxiv.org/abs/2311.16495
repo_url: https://github.com/jianwang-mpi/egowholemocap
paper_authors: Jian Wang, Zhe Cao, Diogo Luvizon, Lingjie Liu, Kripasindhu Sarkar, Danhang Tang, Thabo Beeler, Christian Theobalt
for: 本研究探讨了基于单一 fisheye 摄像头的 egocentric 全身动作捕捉，同时估计人体体部和手动作。
methods: 我们提出了一种新的方法，利用 FisheyeViT 提取 fisheye 图像特征，并将其转换为像素对应的 3D 热图表示，以便预测人体体部姿势。对手追踪，我们添加了专门的手检测和手姿势预测网络，以回归 3D 手姿势。最后，我们开发了一种基于分布的整体动作先验模型，以修正估计的整体动作，同时考虑关节不确定性。
results: 我们通过收集大量的人工生成数据集 EgoWholeBody，包括 840,000 高质量 egocentric 图像，来训练这些网络。量化和质量评估表明，我们的方法可以很好地从单一 egocentric 摄像头中生成高质量的整体动作估计。

Abstract
In this work, we explore egocentric whole-body motion capture using a single fisheye camera, which simultaneously estimates human body and hand motion. This task presents significant challenges due to three factors: the lack of high-quality datasets, fisheye camera distortion, and human body self-occlusion. To address these challenges, we propose a novel approach that leverages FisheyeViT to extract fisheye image features, which are subsequently converted into pixel-aligned 3D heatmap representations for 3D human body pose prediction. For hand tracking, we incorporate dedicated hand detection and hand pose estimation networks for regressing 3D hand poses. Finally, we develop a diffusion-based whole-body motion prior model to refine the estimated whole-body motion while accounting for joint uncertainties. To train these networks, we collect a large synthetic dataset, EgoWholeBody, comprising 840,000 high-quality egocentric images captured across a diverse range of whole-body motion sequences. Quantitative and qualitative evaluations demonstrate the effectiveness of our method in producing high-quality whole-body motion estimates from a single egocentric camera.

摘要
在这项工作中，我们探索了一种使用单个鱼眼镜头进行自我中心全身运动捕捉，这种方法同时估算人体Body和手部运动。这个任务存在三个因素的挑战：lack of high-quality datasets，鱼眼镜头扭曲和人体自 occlusion。为了解决这些挑战，我们提出了一种新的方法，利用FisheyeViT提取鱼眼图像特征，然后将其转换为像素对齐的3D热图表示，用于3D人体姿势预测。为了跟踪手部，我们添加了专门的手部检测和手部姿势预测网络，以回归3D手部姿势。最后，我们开发了一种基于扩散的整体运动先验模型，以修正估算的整体运动，同时考虑关节不确定性。为了训练这些网络，我们收集了一个大量的synthetic dataset，EgoWholeBody，包含840,000高质量的自我中心 egocentric 图像， captured across a diverse range of whole-body motion sequences。量化和质量评估表明，我们的方法可以高效地从单个 egocentric camera 中生成高质量的整体运动估计。

DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser

paper_url: http://arxiv.org/abs/2311.16565
repo_url: None
paper_authors: Peng Chen, Xiaobao Wei, Ming Lu, Yitong Zhu, Naiming Yao, Xingyu Xiao, Hui Chen
for: 这个研究旨在提出一个基于扩散模型的 speech-driven 3D 面部动画系统，并通过对比学习和知识传递来个化面部动画和加速化3D 动画生成。
methods: 这个方法使用了扩散模型，并通过对比学习增强了3D 面部动画的个化特征。在推断过程中，我们引入了一个可学习的对话Identify，以将音频序列中的知识聚合到3D 面部动画中。
results: 实验结果显示，我们的方法可以较前方法提高个化面部动画和加速化3D 动画生成的效果。code会被发布。

Abstract
Speech-driven 3D facial animation has been an attractive task in both academia and industry. Traditional methods mostly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the non-deterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. However, personalizing facial animation and accelerating animation generation are still two major limitations of existing diffusion-based methods. To address the above limitations, we propose DiffusionTalker, a diffusion-based method that utilizes contrastive learning to personalize 3D facial animation and knowledge distillation to accelerate 3D animation generation. Specifically, to enable personalization, we introduce a learnable talking identity to aggregate knowledge in audio sequences. The proposed identity embeddings extract customized facial cues across different people in a contrastive learning manner. During inference, users can obtain personalized facial animation based on input audio, reflecting a specific talking style. With a trained diffusion model with hundreds of steps, we distill it into a lightweight model with 8 steps for acceleration. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released.

摘要
<> traduction de texte en chinois simplifié<>三维人脸动画控制方法是学术和业界颇具吸引力的任务。传统方法主要集中在学习speech到动画的准确映射。现有approaches开始考虑speech驱动的非决定性特点，并使用扩散模型进行任务。然而，现有的扩散基本方法仍有两大限制：一是个性化动画生成，二是加速动画生成。为解决以上限制，我们提出了DiffusionTalker方法，它利用对比学习来个性化3D人脸动画，并通过知识储存来加速3D动画生成。具体来说，我们引入了一个可学习的说话identidad来聚合音频序列中的知识。我们的人脸嵌入EXTRACT Customized facial cues across different people in a contrastive learning manner。在推理过程中，用户可以根据输入音频获得个性化的 facial animation，具体的说话样式。我们使用了数百步训练的扩散模型，并将其压缩到8步的轻量级模型。我们的方法在广泛的实验中表现出色，超过了状态艺术方法。代码将被发布。

Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models

paper_url: http://arxiv.org/abs/2311.16555
repo_url: None
paper_authors: Ling Fu, Zijie Wu, Yingying Zhu, Yuliang Liu, Xiang Bai
for: 提高场景文本检测器的性能
methods: 使用扩散模型将前景文本与背景的自然特征融合，以生成更加现实的文本图像
results: 对于横、旋、拐、线 уров垂直文本检测，DiffText生成的文本图像表现出较高的效果

Abstract
Scene text detection techniques have garnered significant attention due to their wide-ranging applications. However, existing methods have a high demand for training data, and obtaining accurate human annotations is labor-intensive and time-consuming. As a solution, researchers have widely adopted synthetic text images as a complementary resource to real text images during pre-training. Yet there is still room for synthetic datasets to enhance the performance of scene text detectors. We contend that one main limitation of existing generation methods is the insufficient integration of foreground text with the background. To alleviate this problem, we present the Diffusion Model based Text Generator (DiffText), a pipeline that utilizes the diffusion model to seamlessly blend foreground text regions with the background's intrinsic features. Additionally, we propose two strategies to generate visually coherent text with fewer spelling errors. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors. Extensive experiments on detecting horizontal, rotated, curved, and line-level texts demonstrate the effectiveness of DiffText in producing realistic text images.

摘要

Robust Transductive Few-shot Learning via Joint Message Passing and Prototype-based Soft-label Propagation

paper_url: http://arxiv.org/abs/2311.17096
repo_url: None
paper_authors: Jiahui Wang, Qin Xu, Bo Jiang, Bin Luo
for: 这个研究目的是发展一个具有数少支持样本的学习模型，能够对新的类别进行扩展。
methods: 这个方法结合了几种常见的方法，包括prototype learning和label propagation。具体来说，我们首先从支持集中学习出各个标本的代表，然后根据问题样本与标本之间的距离来决定问题的标签。
results: 我们的方法在四个受测数据集上取得了与现有方法相对的竞争性结果，包括平衡和不平衡的设定下。此外，我们还设计了一个新的联合讯息传递方案，可以同时学习标本的硬件表现和问题-支持关系图的构成。

Abstract
Few-shot learning (FSL) aims to develop a learning model with the ability to generalize to new classes using a few support samples. For transductive FSL tasks, prototype learning and label propagation methods are commonly employed. Prototype methods generally first learn the representative prototypes from the support set and then determine the labels of queries based on the metric between query samples and prototypes. Label propagation methods try to propagate the labels of support samples on the constructed graph encoding the relationships between both support and query samples. This paper aims to integrate these two principles together and develop an efficient and robust transductive FSL approach, termed Prototype-based Soft-label Propagation (PSLP). Specifically, we first estimate the soft-label presentation for each query sample by leveraging prototypes. Then, we conduct soft-label propagation on our learned query-support graph. Both steps are conducted progressively to boost their respective performance. Moreover, to learn effective prototypes for soft-label estimation as well as the desirable query-support graph for soft-label propagation, we design a new joint message passing scheme to learn sample presentation and relational graph jointly. Our PSLP method is parameter-free and can be implemented very efficiently. On four popular datasets, our method achieves competitive results on both balanced and imbalanced settings compared to the state-of-the-art methods. The code will be released upon acceptance.

摘要
预处理学习（FSL）的目标是开发一种能够通过几个示例学习新类的学习模型。在推uctive FSL任务中， prototype 学习和标签传播方法通常被使用。 prototype 方法通常先从支持集中学习表示性的原型，然后根据查询样本和原型之间的距离来确定查询样本的标签。标签传播方法尝试将支持样本上的标签通过建立查询样本和支持样本之间的图表示的关系进行传播。本文旨在将这两种原理结合在一起，并开发一种高效和可靠的推uctive FSL方法，即示例基于软标签传播（PSLP）。 Specifically，我们首先估算每个查询样本的软标签表示，通过使用原型。然后，我们在我们学习的查询样本和支持样本之间的图上进行软标签传播。这两个步骤都是在进行进程中进行，以提高它们的相应性能。此外，为了学习有效的原型以及欲望的查询样本和支持样本之间的关系，我们设计了一种新的共同消息传递方案，用于同时学习样本表示和关系图。我们的 PSLP 方法无需参数，可以非常高效地实现。在四个流行的数据集上，我们的方法在平衡和不平衡的设置下与当前的状态艺技相当。代码将在接受后释出。

HandyPriors: Physically Consistent Perception of Hand-Object Interactions with Differentiable Priors

paper_url: http://arxiv.org/abs/2311.16552
repo_url: None
paper_authors: Shutong Zhang, Yi-Ling Qiao, Guanglei Zhu, Eric Heiden, Dylan Turpin, Jingzhou Liu, Ming Lin, Miles Macklin, Animesh Garg
for: 这个论文的目的是提出一种综合和通用的推理管道，用于人机物交互场景中的姿态估算。
methods: 这个论文使用了最近的可导物理和渲染技术，并使用渲染优化器来对输入图像和分割mask进行Alignment，同时使用物理优化器来减少刺激和相对滑动。此外，论文还提出了两种手和物 pose估算方法，一种是优化基本的pose估算，另一种是使用可导物理模块作为动力和观测模型，以实现更快的跟踪。
results: 论文表明，HandyPriors可以在人机物交互场景中实现相对或更高的姿态估算精度，并且可以预测物体与人体之间的接触信息，以进一步改进姿态估算。此外，论文还证明了其方法的通用性，可以应用于机器人手部 manipulate和人类物体姿态估算等任务。

Abstract
Various heuristic objectives for modeling hand-object interaction have been proposed in past work. However, due to the lack of a cohesive framework, these objectives often possess a narrow scope of applicability and are limited by their efficiency or accuracy. In this paper, we propose HandyPriors, a unified and general pipeline for pose estimation in human-object interaction scenes by leveraging recent advances in differentiable physics and rendering. Our approach employs rendering priors to align with input images and segmentation masks along with physics priors to mitigate penetration and relative-sliding across frames. Furthermore, we present two alternatives for hand and object pose estimation. The optimization-based pose estimation achieves higher accuracy, while the filtering-based tracking, which utilizes the differentiable priors as dynamics and observation models, executes faster. We demonstrate that HandyPriors attains comparable or superior results in the pose estimation task, and that the differentiable physics module can predict contact information for pose refinement. We also show that our approach generalizes to perception tasks, including robotic hand manipulation and human-object pose estimation in the wild.

摘要
历史研究中提出了多种各种目标函数用于模型人手交互。然而，由于缺乏一个紧密的框架，这些目标函数经常具有局部应用范围和精度和效率的限制。在这篇论文中，我们提出了HandyPriors，一个通用和总体的排序管道，通过使用最近的可微 физи学和渲染来进行pose estimation在人手交互场景中。我们的方法使用渲染假设来与输入图像和分割掩码进行对齐，同时使用物理假设来缓解射入和相对滑动问题。此外，我们提出了两种手和物体 pose estimation 的方法。一种是优化基于的pose estimation，它可以达到更高的准确率；另一种是使用可微假设作为动力学和观测模型的筛选基于的跟踪，它可以更快地执行。我们示出了HandyPriors在pose estimation任务中可以达到相同或更高的结果，并且可以预测物体的接触信息用于pose refinement。此外，我们还展示了我们的方法在机器人手 manipulate和人手pose estimation中的普适性。

Multi-Irreducible Spectral Synchronization for Robust Rotation Averaging

paper_url: http://arxiv.org/abs/2311.16544
repo_url: None
paper_authors: Owen Howell, Haoen Huang, David Rosen
For: 这个论文的目标是解决机器人和计算机视觉中的旋转平均问题（Rotation Averaging，RA），其中需要估算一组未知旋转矩阵 $R_{1}, …, R_{N} \in SO(3)$，基于各个对称对的噪声测量 $R_{ij} \sim R^{-1}{i} R{j}$。* Methods: 作者使用了幂分析在团体上的技术，以构建一个（凸）спектраль逼近问题，然后使用一些极值值来回归RA解决方案。* Results: 作者的方法具有许多优点，例如可以与任何平滑损失函数一起使用（包括但不限于Robust M-estimators），不需要任何初始化，并且使用简单（高度可扩展）的线性代数计算和并行优化来实现。此外，作者还提供了一些性能保证，其中包括在测量网络中添加某些满足特定条件的探针，以确保准确的估算。

Abstract
Rotation averaging (RA) is a fundamental problem in robotics and computer vision. In RA, the goal is to estimate a set of $N$ unknown orientations $R_{1}, ..., R_{N} \in SO(3)$, given noisy measurements $R_{ij} \sim R^{-1}_{i} R_{j}$ of a subset of their pairwise relative rotations. This problem is both nonconvex and NP-hard, and thus difficult to solve in the general case. We apply harmonic analysis on compact groups to derive a (convex) spectral relaxation constructed from truncated Fourier decompositions of the individual summands appearing in the RA objective; we then recover an estimate of the RA solution by computing a few extremal eigenpairs of this relaxation, and (approximately) solving a consensus problem. Our approach affords several notable advantages versus prior RA methods: it can be used in conjunction with \emph{any} smooth loss function (including, but not limited to, robust M-estimators), does not require any initialization, and is implemented using only simple (and highly scalable) linear-algebraic computations and parallelizable optimizations over band-limited functions of individual rotational states. Moreover, under the (physically well-motivated) assumption of multiplicative Langevin measurement noise, we derive explicit performance guarantees for our spectral estimator (in the form of probabilistic tail bounds on the estimation error) that are parameterized in terms of graph-theoretic quantities of the underlying measurement network. By concretely linking estimator performance with properties of the underlying measurement graph, our results also indicate how to devise measurement networks that are \emph{guaranteed} to achieve accurate estimation, enabling such downstream tasks as sensor placement, network compression, and active sensing.

摘要
rotate averaging (RA) 是 robotics 和 computer vision 中的基本问题。在 RA 中，目标是估计一组 $N$ 个未知旋转 $R_{1}, ..., R_{N} \in SO(3)$，给出一些噪声损失 $R_{ij} \sim R^{-1}_{i} R_{j}$ 的subset 的对称对旋转。这个问题是非凸和NP难，因此在一般情况下很难解决。我们通过幂分析在固定群上来 derive 一个（凸） spectral relaxation， constructed from truncated Fourier decompositions of the individual summands appearing in the RA objective; 然后我们可以通过计算一些极值 eigenpairs 来回收一个 RA 的解决方案，并（约）解决一个 consensus 问题。我们的方法有以下优点：可以与任何平滑损失函数（包括但不限于 robust M-estimators）结合使用，不需要任何初始化，并且通过简单（高度可扩展）的线性代数计算和并行优化来实现。此外，对于 multiplicative Langevin 测量噪声的（物理上有良好的）假设，我们得到了明确的性能保证（在形式上为 probabilistic tail bounds on the estimation error），这些保证与测量网络的特性相关。我们的结果还表明，可以通过设计测量网络来确保高精度估计，以便实现下游任务，如探测器布局、网络压缩和活动探测。

Exploring Straighter Trajectories of Flow Matching with Diffusion Guidance

paper_url: http://arxiv.org/abs/2311.16507
repo_url: None
paper_authors: Siyu Xing, Jie Cao, Huaibo Huang, Xiao-Yu Zhang, Ran He
for: 提高流行匹配模型的生成质量和效率
methods: 提出一种新的拓展方法“直轨流行匹配”(StraightFM)，通过减少步骤数来提高生成质量和效率
results: StraightFM在几个方面达到了更高的质量和效率，包括在 pixel 空间中的 FID 和 latent 空间中的 KID 值在内，并且在一些样本中可以在 fewer than 10 步骤内达到更高的质量和效率。

Abstract
Flow matching as a paradigm of generative model achieves notable success across various domains. However, existing methods use either multi-round training or knowledge within minibatches, posing challenges in finding a favorable coupling strategy for straight trajectories. To address this issue, we propose a novel approach, Straighter trajectories of Flow Matching (StraightFM). It straightens trajectories with the coupling strategy guided by diffusion model from entire distribution level. First, we propose a coupling strategy to straighten trajectories, creating couplings between image and noise samples under diffusion model guidance. Second, StraightFM also integrates real data to enhance training, employing a neural network to parameterize another coupling process from images to noise samples. StraightFM is jointly optimized with couplings from above two mutually complementary directions, resulting in straighter trajectories and enabling both one-step and few-step generation. Extensive experiments demonstrate that StraightFM yields high quality samples with fewer step. StraightFM generates visually appealing images with a lower FID among diffusion and traditional flow matching methods within 5 sampling steps when trained on pixel space. In the latent space (i.e., Latent Diffusion), StraightFM achieves a lower KID value compared to existing methods on the CelebA-HQ 256 dataset in fewer than 10 sampling steps.

摘要
流行匹配作为生成模型的 paradigm 在多个领域取得了显著的成功。然而，现有的方法使用 either 多轮训练或者在 mini-batch 中的知识，从而增加了找到适合的 Coupling 策略的问题。为解决这个问题，我们提出了一种新的方法：流程匹配 straight trajectories（StraightFM）。它使用整个分布水平的扩散模型来引导 Coupling 策略，从而 straighten trajectories。首先，我们提出了一种 Coupling 策略，将图像和随机噪音之间创建 Couplings。其次，StraightFM 还 integrate 了实际数据来提高训练，通过一个神经网络来另外 parameterize 一种从图像到随机噪音的 Coupling 过程。StraightFM 被同时优化了上述两个相互补偿的方向，从而实现更直的 trajectories 和一步或几步生成。广泛的实验表明，StraightFM 可以生成高质量的样本，需要 fewer step。在 pixel space 中，StraightFM 在5个步骤内可以生成可见的图像，并且与传统的流行匹配方法相比，在 CelebA-HQ 256 数据集上的 KID 值较低。在 latent space 中，StraightFM 在 fewer than 10 步骤内可以 achiev 较低的 KID 值。

In Search of a Data Transformation That Accelerates Neural Field Training

paper_url: http://arxiv.org/abs/2311.17094
repo_url: None
paper_authors: Junwon Seo, Sangyoon Lee, Kwang In Kim, Jaeho Lee
for: 本文研究了带有数据变换的神经场训练速度的影响，特别是排序像素位置对SGD训练的影响。
methods: 本文使用了神经网络来近似给定的信号，并通过PSNR曲线、损失地形和错误模式来解释Random Permutation的影响。
results: 研究发现，随机排序像素位置可以大幅加速神经场训练，但是这种减少容易适应的模式会妨碍捕捉信号的细节。

Abstract
Neural field is an emerging paradigm in data representation that trains a neural network to approximate the given signal. A key obstacle that prevents its widespread adoption is the encoding speed-generating neural fields requires an overfitting of a neural network, which can take a significant number of SGD steps to reach the desired fidelity level. In this paper, we delve into the impacts of data transformations on the speed of neural field training, specifically focusing on how permuting pixel locations affect the convergence speed of SGD. Counterintuitively, we find that randomly permuting the pixel locations can considerably accelerate the training. To explain this phenomenon, we examine the neural field training through the lens of PSNR curves, loss landscapes, and error patterns. Our analyses suggest that the random pixel permutations remove the easy-to-fit patterns, which facilitate easy optimization in the early stage but hinder capturing fine details of the signal.

摘要
neural field 是一种emerging paradigm在数据表示方面，用神经网络来近似给定的信号。然而，一个关键的障碍是将神经网络训练到 Desired fidelity level 需要很多SGD步骤，这会增加训练时间。在这篇论文中，我们研究了数据变换对神经场训练速度的影响，具体来说是如何 randomly permute pixel locations 会加速训练。我们发现，Randomly permuting pixel locations 可以大幅提高训练速度。为了解释这种现象，我们通过PSNR曲线、损失 landscape 和错误模式来分析神经场训练。我们发现，随机排序像素位置可以消除容易适应的模式，这些模式可以在早期阶段帮助优化，但是它们会阻碍捕捉信号的细节。

Agents meet OKR: An Object and Key Results Driven Agent System with Hierarchical Self-Collaboration and Self-Evaluation

paper_url: http://arxiv.org/abs/2311.16542
repo_url: None
paper_authors: Yi Zheng, Chongyang Ma, Kanle Shi, Haibin Huang
for: 提高大语言模型（LLM）在任务解决方面的能力
methods: 使用自我合作和自我修正机制，通过层次代理人来解决任务的复杂性
results: 实验结果表明，OKR-Agent方法比前一些方法在一些任务上表现更高效

Abstract
In this study, we introduce the concept of OKR-Agent designed to enhance the capabilities of Large Language Models (LLMs) in task-solving. Our approach utilizes both self-collaboration and self-correction mechanism, facilitated by hierarchical agents, to address the inherent complexities in task-solving. Our key observations are two-fold: first, effective task-solving demands in-depth domain knowledge and intricate reasoning, for which deploying specialized agents for individual sub-tasks can markedly enhance LLM performance. Second, task-solving intrinsically adheres to a hierarchical execution structure, comprising both high-level strategic planning and detailed task execution. Towards this end, our OKR-Agent paradigm aligns closely with this hierarchical structure, promising enhanced efficacy and adaptability across a range of scenarios. Specifically, our framework includes two novel modules: hierarchical Objects and Key Results generation and multi-level evaluation, each contributing to more efficient and robust task-solving. In practical, hierarchical OKR generation decomposes Objects into multiple sub-Objects and assigns new agents based on key results and agent responsibilities. These agents subsequently elaborate on their designated tasks and may further decompose them as necessary. Such generation operates recursively and hierarchically, culminating in a comprehensive set of detailed solutions. The multi-level evaluation module of OKR-Agent refines solution by leveraging feedback from all associated agents, optimizing each step of the process. This ensures solution is accurate, practical, and effectively address intricate task requirements, enhancing the overall reliability and quality of the outcome. Experimental results also show our method outperforms the previous methods on several tasks. Code and demo are available at https://okr-agent.github.io/

摘要
在这项研究中，我们介绍了一种名为OKR-Agent的概念，用于提高大型自然语言模型（LLM）在任务解决方面的能力。我们的方法利用了自适应和自我修正机制，通过层次代理人来解决任务中的内在复杂性。我们的关键观察结果有两个方面：首先，有效地解决任务需要深入的领域知识和细腻的思维，在部署专门的子任务代理人后，LLM表现得更出色。第二，任务解决本身具有层次执行结构，包括高级策略规划和细节任务执行。为此，OKR-Agent模型与这种层次结构高度吻合，提供了更高效和可靠的多种场景下的解决方案。具体来说，OKR-Agent模型包括两个新模块：层次对象和关键结果生成，以及多级评估。层次对象生成模块将对象 decomposes into multiple sub-objects，并将新代理人分配给每个子对象基于关键结果和代理人责任。这些代理人随后在它们的指定任务上进行详细的描述和可能的 decomposing，这种生成操作采用了 recursive和层次的方式进行，从而生成了一个完整的解决方案。多级评估模块使用所有相关代理人的反馈来优化解决方案，确保解决方案是准确、实用和有效地解决任务要求，从而提高整体的可靠性和质量。我们的实验结果也表明，我们的方法在多个任务上比前一种方法表现更出色。OKR-Agent代码和demo可以在中找到。

Improved Prototypical Semi-Supervised Learning with Foundation Models: Prototype Selection, Parametric vMF-SNE Pretraining and Multi-view Pseudolabelling

paper_url: http://arxiv.org/abs/2311.17093
repo_url: None
paper_authors: Evelyn Mannix, Howard Bondell
for: 提高计算机视觉中的半监督学习性能，特别是使用冻结基础模型为网络后ION的准确性。
methods: 提出了一种基于 Parametric von-Mises Fisher Stochastic Neighbour Embedding (vMF-SNE) 的映射方法，以及一种多视图假标签的提出方法，以及一种简单的 $k$-means prototype selection 技术。
results: 在多个 benchmark 数据集上，与现有的 PAWS 和 RoPAWS 方法相比，提高了 +2.9% 和 +5.7% 的性能，并在 DeepWeeds 数据集上达到了新的领先性能记录。

Abstract
In this paper we present an improved approach to prototypical semi-supervised learning for computer vision, in the context of leveraging a frozen foundation model as the backbone of our neural network. As a general tool, we propose parametric von-Mises Fisher Stochastic Neighbour Embedding (vMF-SNE) to create mappings with neural networks between high-dimensional latent spaces that preserve local structure. This enables us to pretrain the projection head of our network using the high-quality embeddings of the foundation model with vMF-SNE. We also propose soft multi-view pseudolabels, where predictions across multiple views are combined to provide a more reliable supervision signal compared to a consistency or swapped assignment approach. We demonstrate that these ideas improve upon P}redicting View-Assignments with Support Samples (PAWS), a current state-of-the-art semi-supervised learning method, as well as Robust PAWS (RoPAWS), over a range of benchmarking datasets. We also introduce simple $k$-means prototype selection, a technique that provides superior performance to other unsupervised label selection approaches in this context. These changes improve upon PAWS by an average of +2.9% for CIFAR-10 and +5.7% for CIFAR-100 with four labels per class, and by +15.2% for DeepWeeds, a particularly challenging dataset for semi-supervised learning. We also achieve new state-of-the-art results in semi-supervised learning in this small label regime for CIFAR-10 - 95.8% (+0.7%) and CIFAR-100 - 76.6% (+12.0%).

摘要
在这篇论文中，我们提出了一种改进的半supervised learning方法，用于计算机视觉领域，基于冻结的基础模型。我们提议使用parametric von-Mises Fisher Stochastic Neighbor Embedding（vMF-SNE）来创建高维特征空间中的映射，这些映射保留了本地结构。我们可以使用基础模型的高质量嵌入，通过vMF-SNE来预训练投影头。我们还提议使用软分布式视图预测，将多个视图的预测结果组合起来，以提供更可靠的超级视图指标。我们示出，这些想法可以超过PAWS方法（Predicting View-Assignments with Support Samples）和RoPAWS（Robust PAWS），在多个 benchmark 数据集上进行改进。我们还介绍了简单的 $k$-means prototype选择技术，这种技术可以在这种情况下提供superior的性能。这些变化可以在 CIFAR-10 和 CIFAR-100 上提高 +2.9% 和 +5.7% 的四个标签，以及在 DeepWeeds 数据集上提高 +15.2%。我们还实现了小标签 regime 中的新state-of-the-art 结果，CIFAR-10 的结果为 95.8% (+0.7%)，CIFAR-100 的结果为 76.6% (+12.0%)。

SEED-Bench-2: Benchmarking Multimodal Large Language Models

paper_url: http://arxiv.org/abs/2311.17092
repo_url: https://github.com/ailab-cvc/seed-bench
paper_authors: Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, Ying Shan
for: 这个论文旨在提供一个全面的多模态大语言模型（MLLM）测试 benchmark，以评估当前 MLLM 的进步和局限性。
methods: 该论文提出了一个基于 LLM 的多模态大语言模型（MLLM）分类法，并提出了 SEED-Bench-2 测试 benchmark，该 benchmark 包括 24K 多选题目，覆盖 27 个维度，包括文本和图像生成能力的评估。
results: 论文通过测试 23 种公开源 MLLM 模型，发现了这些模型的局限性，并提供了价值的观察。该论文的目标是通过 SEED-Bench-2 测试 benchmark，驱动未来的研究，推动到通用人工智能的目标。

Abstract
Multimodal large language models (MLLMs), building upon the foundation of powerful large language models (LLMs), have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs (acting like a combination of GPT-4V and DALL-E 3). However, existing MLLM benchmarks remain limited to assessing only models' comprehension ability of single image-text inputs, failing to keep up with the strides made in MLLMs. A comprehensive benchmark is imperative for investigating the progress and uncovering the limitations of current MLLMs. In this work, we categorize the capabilities of MLLMs into hierarchical levels from $L_0$ to $L_4$ based on the modalities they can accept and generate, and propose SEED-Bench-2, a comprehensive benchmark that evaluates the \textbf{hierarchical} capabilities of MLLMs. Specifically, SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions, including the evaluation of both text and image generation. Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance, eliminating the need for human or GPT intervention during evaluation. We further evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations. By revealing the limitations of existing MLLMs through extensive evaluations, we aim for SEED-Bench-2 to provide insights that will motivate future research towards the goal of General Artificial Intelligence. Dataset and evaluation code are available at \href{https://github.com/AILab-CVC/SEED-Bench}

摘要
多Modal大语言模型（MLLM），基于强大的大语言模型（LLM），最近已经展示了对于生成文本和图像的exceptional能力，只要提供混合多Modal输入（类似于GPT-4V和DALL-E 3）。然而，现有的 MLLM 评价标准仍然只能评估模型对单个图像文本输入的理解能力，而不能随着 MLLM 的进步而更新。一个完整的评价标准是必要的，以调查当前 MLLM 的进步和探索其局限性。在这项工作中，我们将 MLLM 的能力分为 $L_0$ 到 $L_4$ 等级，根据它们可以接受和生成的modalities，并提出了 SEED-Bench-2，一个完整的评价标准。SEED-Bench-2 包含 24K 多选题，具有人工注释的准确性，涵盖 27 个维度，包括文本和图像生成的评估。多选题的真实选项来自人工注释，可以快速和有效地评估模型性能，不需要人类或 GPT 的干预。我们还对 23 个开源 MLLM 进行了评估，并提出了有价值的观察。通过对现有 MLLM 的广泛评估，我们希望 SEED-Bench-2 能够提供有价值的信息，以激励未来的人工智能研究。数据集和评估代码可以在 GitHub 上获取：

Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models

paper_url: http://arxiv.org/abs/2311.17091
repo_url: https://github.com/zhihelu/ensemble_vlm
paper_authors: Zhihe Lu, Jiawang Bai, Xin Li, Zeyu Xiao, Xinchao Wang
for: 这个论文旨在探讨如何通过将弱化的视觉语言模型（VLM） ensemble 提高open-world泛化性能。
methods: 本论文提出了三种自定义的ensemble策略，每种适用于不同的场景：零shot ensemble、培训和调整 ensemble、和跨数据集ensemble。
results: 研究人员通过对不同模型的logits进行自动调整，实现了在零shot、基础到新和跨数据集泛化性能上的新州Of-the-art表现。

Abstract
Fine-tuning pre-trained vision-language models (VLMs), e.g., CLIP, for the open-world generalization has gained increasing popularity due to its practical value. However, performance advancements are limited when relying solely on intricate algorithmic designs for a single model, even one exhibiting strong performance, e.g., CLIP-ViT-B/16. This paper, for the first time, explores the collaborative potential of leveraging much weaker VLMs to enhance the generalization of a robust single model. The affirmative findings motivate us to address the generalization problem from a novel perspective, i.e., ensemble of pre-trained VLMs. We introduce three customized ensemble strategies, each tailored to one specific scenario. Firstly, we introduce the zero-shot ensemble, automatically adjusting the logits of different models based on their confidence when only pre-trained VLMs are available. Furthermore, for scenarios with extra few-shot samples, we propose the training-free and tuning ensemble, offering flexibility based on the availability of computing resources. The proposed ensemble strategies are evaluated on zero-shot, base-to-new, and cross-dataset generalization, achieving new state-of-the-art performance. Notably, this work represents an initial stride toward enhancing the generalization performance of VLMs via ensemble. The code is available at https://github.com/zhiheLu/Ensemble_VLM.git.

摘要
优化预训练视语模型（VLM），例如CLIP，以实现开放世界泛化的实用价值已经在增加。然而，凭借精妙的算法设计 alone 的单个模型性能有限。这篇论文是第一次探索强度较弱的 VLM ensemble 的可能性，以提高单个模型的泛化性能。我们引入三种自定义ensemble策略，每种适用于不同的场景。首先，我们引入零shot ensemble，自动调整不同模型的 logits 根据其自信度，只有预训练 VLM available。其次，在具有些个shot样本的情况下，我们提出了免训练和调整 ensemble，可以根据计算资源的可用性进行灵活调整。我们对这些ensemble策略进行了评估，在零shot、基础到新和跨数据集泛化中实现了新的州OF-THE-ART性能。值得注意的是，这项工作表示了对 VLM 泛化性能的提升的初步尝试，未来可能会有更多的进展。代码可以在https://github.com/zhiheLu/Ensemble_VLM.git中找到。

3D Teeth Reconstruction from Panoramic Radiographs using Neural Implicit Functions

paper_url: http://arxiv.org/abs/2311.16524
repo_url: None
paper_authors: Sihwa Park, Seongjun Kim, In-Seok Song, Seung Jun Baek
for: 本研究提出了一种基于神经函数的3D牙齿重建方法，以解决普遍存在的2D牙齿影像 limitation。
methods: 方法包括多标签分割、牙齿形状嵌入和牙齿类别嵌入，然后将这些输出作为神经函数的输入。一个新的模块叫做 Conditional eXcitation (CX) 用于有效地将共同形状和类别嵌入包含在神经函数中。
results: 对比其他方法，Occudent 显示出了更高的精度和可靠性。

Abstract
Panoramic radiography is a widely used imaging modality in dental practice and research. However, it only provides flattened 2D images, which limits the detailed assessment of dental structures. In this paper, we propose Occudent, a framework for 3D teeth reconstruction from panoramic radiographs using neural implicit functions, which, to the best of our knowledge, is the first work to do so. For a given point in 3D space, the implicit function estimates whether the point is occupied by a tooth, and thus implicitly determines the boundaries of 3D tooth shapes. Firstly, Occudent applies multi-label segmentation to the input panoramic radiograph. Next, tooth shape embeddings as well as tooth class embeddings are generated from the segmentation outputs, which are fed to the reconstruction network. A novel module called Conditional eXcitation (CX) is proposed in order to effectively incorporate the combined shape and class embeddings into the implicit function. The performance of Occudent is evaluated using both quantitative and qualitative measures. Importantly, Occudent is trained and validated with actual panoramic radiographs as input, distinct from recent works which used synthesized images. Experiments demonstrate the superiority of Occudent over state-of-the-art methods.

摘要
扫描方式是 dental 实践和研究中广泛使用的影像模式，但它只提供了平铺的2D图像，这限制了牙科结构的详细评估。在这篇论文中，我们提出了 Occudent，一个基于神经隐函数的牙齿三维重建框架。根据我们所知，这是首次在牙齿重建领域使用神经隐函数进行3D牙齿重建。给定3D空间中的点，神经隐函数会判断该点是否被牙齿包含，从而隐式地确定牙齿的3D形态boundaries。首先，Occudent使用多标签分割来处理输入的扫描图像。接着，由分割输出生成的牙齿形态嵌入和牙齿类嵌入被传递到重建网络。为了有效地将合并的形态和类嵌入integrated into the implicit function，我们提出了一个新的模块 called Conditional eXcitation (CX)。Occudent的性能被评估了使用量化和质量度量标准。重要的是，Occudent被训练和验证使用实际的扫描图像作为输入，与之前的works不同，使用合成图像。实验结果表明Occudent在比较方法上显著超越了state-of-the-art方法。

A Unified Framework for Multimodal, Multi-Part Human Motion Synthesis

paper_url: http://arxiv.org/abs/2311.16471
repo_url: None
paper_authors: Zixiang Zhou, Yu Wan, Baoyuan Wang
for: 本研究旨在实现多modal和多部分人体动作生成，以满足实际场景中的各种需求。
methods: 我们提出了一种可扩展的方法，该方法包括以下几个步骤：首先，对不同的身体部位动作进行编码，并将其分配到专门为每个部位设计的codebook中。然后，通过使用预训练模型，将多modal信号转化为共享的几何空间中的字符串。最后，通过预测继 succeeding tokens来构建完整的动作序列。
results: 我们的方法在实验中得到了广泛的应用前景，并且可以轻松地扩展到新的模式。

Abstract
The field has made significant progress in synthesizing realistic human motion driven by various modalities. Yet, the need for different methods to animate various body parts according to different control signals limits the scalability of these techniques in practical scenarios. In this paper, we introduce a cohesive and scalable approach that consolidates multimodal (text, music, speech) and multi-part (hand, torso) human motion generation. Our methodology unfolds in several steps: We begin by quantizing the motions of diverse body parts into separate codebooks tailored to their respective domains. Next, we harness the robust capabilities of pre-trained models to transcode multimodal signals into a shared latent space. We then translate these signals into discrete motion tokens by iteratively predicting subsequent tokens to form a complete sequence. Finally, we reconstruct the continuous actual motion from this tokenized sequence. Our method frames the multimodal motion generation challenge as a token prediction task, drawing from specialized codebooks based on the modality of the control signal. This approach is inherently scalable, allowing for the easy integration of new modalities. Extensive experiments demonstrated the effectiveness of our design, emphasizing its potential for broad application.

摘要
“这个领域已经做出了实际人类动作的实际进步，这些进步是由不同的数据类型所驱动的。然而，在实际应用中，不同的控制信号需要不同的方法来驱动不同的身体部位，这限制了这些技术的扩展性。在这篇论文中，我们将引入一个具有一体化和扩展性的方法，该方法可以同时驱动多种数据类型和多个身体部位的人类动作生成。我们的方法流程如下：首先，我们将多种身体部位的动作转换为特定领域的量化表示，然后将这些表示转换为共享的潜在空间中的特征表示。接着，我们将这些特征表示转换为精确的动作字串，并透过迭代预测的方式，将这些字串转换为完整的动作序列。最后，我们将这个字串序列转换为连续的实际动作。我们的方法把多modal动作生成挑战视为一个预测字串的任务，并从不同的数据类型基础上提取特定的代码库。这种方法具有内置的扩展性，可以轻松地添加新的数据类型。实际实验显示了我们的设计的有效性，强调其广泛应用的潜力。”

AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and Beyond

paper_url: http://arxiv.org/abs/2311.16468
repo_url: None
paper_authors: Zixiang Zhou, Yu Wan, Baoyuan Wang
for: 这paper的目的是提出一个All-in-One框架，用于融合多种人体动作相关任务，包括理解、规划、生成以及其他任务。
methods: 该框架使用大语言模型（LLMs），并采用了InstructGPT和Gato的概念，将各种任务都作为一种特定的指令进行精度调整。所有任务都通过语言作为共同界面进行连接，形成一个封闭的循环。
results: 经过广泛的实验，AvargarGPT实现了低级任务的最佳性能，以及高级任务的可观望成果。此外，它还可以通过Iterative Traversal的方式实现无限长的动作合成。

Abstract
Large Language Models(LLMs) have shown remarkable emergent abilities in unifying almost all (if not every) NLP tasks. In the human motion-related realm, however, researchers still develop siloed models for each task. Inspired by InstuctGPT, and the generalist concept behind Gato, we introduce AvatarGPT, an All-in-One framework for motion understanding, planning, generations as well as other tasks such as motion in-between synthesis. AvatarGPT treats each task as one type of instruction fine-tuned on the shared LLM. All the tasks are seamlessly interconnected with language as the universal interface, constituting a closed-loop within the framework. To achieve this, human motion sequences are first encoded as discrete tokens, which serve as the extended vocabulary of LLM. Then, an unsupervised pipeline to generate natural language descriptions of human action sequences from in-the-wild videos is developed. Finally, all tasks are jointly trained. Extensive experiments show that AvatarGPT achieves SOTA on low-level tasks, and promising results on high-level tasks, demonstrating the effectiveness of our proposed All-in-One framework. Moreover, for the first time, AvatarGPT enables a principled approach by iterative traversal of the tasks within the closed-loop for unlimited long-motion synthesis.

摘要
大语言模型（LLM）已经显示出了非常出众的融合能力，可以涵盖大多数（如果不是所有）的自然语言处理任务。然而，在人体运动相关领域，研究人员仍然开发着封闭的模型，用于每个任务。受到InstructGPT和Gato的启发，我们介绍了 AvatarGPT，一个所有任务的框架，用于运动理解、规划、生成以及其他任务，如运动间 synthesis。AvatarGPT对每个任务使用共享的 LLM 进行微调，并将所有任务串联在一起，通过语言作为通用接口，形成一个关闭的循环。为实现这一目标，我们首先将人体运动序列编码为离散的token，这些tokenserve为LLM的扩展词汇。然后，我们开发了一个无监督的管道，用于自然语言描述从野外视频中提取的人体动作序列。最后，我们共同训练所有任务。广泛的实验表明，AvatarGPT在低级任务上达到了最佳性能，并在高级任务上获得了扎实的结果，证明了我们提出的所有任务框架的有效性。此外，AvatarGPT还实现了一种原则性的方法，通过循环 traverse Tasks 内的循环来实现无限长的运动合成。

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

paper_url: http://arxiv.org/abs/2311.16465
repo_url: None
paper_authors: Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei
for: 提高文本渲染的自动化和多样性
methods: 使用大语言模型进行布局规划，并在扩散模型中使用语言模型编码文本位置和内容
results: 实现更合理的文本布局和生成，提高多样性和自动化水平

Abstract
The diffusion model has been proven a powerful generative model in recent years, yet remains a challenge in generating visual text. Several methods alleviated this issue by incorporating explicit text position and content as guidance on where and what text to render. However, these methods still suffer from several drawbacks, such as limited flexibility and automation, constrained capability of layout prediction, and restricted style diversity. In this paper, we present TextDiffuser-2, aiming to unleash the power of language models for text rendering. Firstly, we fine-tune a large language model for layout planning. The large language model is capable of automatically generating keywords for text rendering and also supports layout modification through chatting. Secondly, we utilize the language model within the diffusion model to encode the position and texts at the line level. Unlike previous methods that employed tight character-level guidance, this approach generates more diverse text images. We conduct extensive experiments and incorporate user studies involving human participants as well as GPT-4V, validating TextDiffuser-2's capacity to achieve a more rational text layout and generation with enhanced diversity. The code and model will be available at \url{https://aka.ms/textdiffuser-2}.

摘要
很多年来，扩散模型已经被证明是一种强大的生成模型，但是在生成视觉文本方面仍然是一个挑战。许多方法已经尝试解决这个问题，例如通过 incorporating 显式文本位置和内容作为指导来帮助生成文本。然而，这些方法仍然受到一些缺点的限制，例如限制了自动化和自适应性，预测布局的能力受到限制，并且样式多样性受到限制。在这篇论文中，我们提出了 TextDiffuser-2，旨在解 liberate 语言模型在文本渲染方面的力量。我们首先细化了一个大型语言模型，以便自动生成文本渲染关键词和修改布局。其次，我们利用语言模型在扩散模型中来编码文本的位置和内容。与前一些方法不同的是，我们的方法不使用紧张的字符级别指导，而是通过语言模型来生成更多元的文本图像。我们进行了广泛的实验和人类参与者用户研究，并与 GPT-4V 集成，以验证 TextDiffuser-2 的能力在实现更合理的文本布局和生成方面具有更高的多样性。代码和模型将在 \url{https://aka.ms/textdiffuser-2} 上提供。

Viewport Prediction for Volumetric Video Streaming by Exploring Video Saliency and Trajectory Information

paper_url: http://arxiv.org/abs/2311.16462
repo_url: None
paper_authors: Jie Li, Zhixin Li, Zhi Liu, Pengyuan Zhou, Richang Hong, Qiyue Li, Han Hu
for:The paper is written to improve the precision of viewport prediction in volumetric video streaming.methods:The proposed approach, named Saliency and Trajectory Viewport Prediction (STVP), utilizes video saliency information and viewport trajectory to improve viewport prediction. The method includes a novel sampling method, Uniform Random Sampling (URS), and a saliency detection technique that incorporates both spatial and temporal information.results:The proposed method is superior to existing schemes, as shown by extensive simulations using state-of-the-art volumetric video sequences. The dataset and source code will be publicly accessible after acceptance.

Abstract
Volumetric video, also known as hologram video, is a novel medium that portrays natural content in Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR). It is expected to be the next-gen video technology and a prevalent use case for 5G and beyond wireless communication. Considering that each user typically only watches a section of the volumetric video, known as the viewport, it is essential to have precise viewport prediction for optimal performance. However, research on this topic is still in its infancy. In the end, this paper presents and proposes a novel approach, named Saliency and Trajectory Viewport Prediction (STVP), which aims to improve the precision of viewport prediction in volumetric video streaming. The STVP extensively utilizes video saliency information and viewport trajectory. To our knowledge, this is the first comprehensive study of viewport prediction in volumetric video streaming. In particular, we introduce a novel sampling method, Uniform Random Sampling (URS), to reduce computational complexity while still preserving video features in an efficient manner. Then we present a saliency detection technique that incorporates both spatial and temporal information for detecting static, dynamic geometric, and color salient regions. Finally, we intelligently fuse saliency and trajectory information to achieve more accurate viewport prediction. We conduct extensive simulations to evaluate the effectiveness of our proposed viewport prediction methods using state-of-the-art volumetric video sequences. The experimental results show the superiority of the proposed method over existing schemes. The dataset and source code will be publicly accessible after acceptance.

摘要
三维视频（Volumetric video）是一种新的媒体形式，可以在虚拟现实（VR）、增强现实（AR）和混合现实（MR）中显示自然内容。预计它将成为下一代视频技术，并在无线通信5G和更多的应用中具有广泛的使用。由于每个用户通常只查看视频中的一个部分，称为“视窗”，因此精准的视窗预测是非常重要的。然而，这个领域的研究仍然处于初级阶段。本文提出了一种新的方法，即Saliency and Trajectory Viewport Prediction（STVP），以提高三维视频流式中的视窗预测精度。STVP广泛利用视频吸引力信息和视窗轨迹。我们认为这是三维视频流式中的第一次全面研究。本文还提出了一种新的采样方法，即均匀随机采样（URS），以降低计算复杂性而仍保持视频特征的有效方式。然后，我们介绍了一种包含空间和时间信息的吸引力检测技术，用于检测静止、动态、几何和颜色吸引力区域。最后，我们智能融合吸引力和轨迹信息，以实现更加准确的视窗预测。我们对使用state-of-the-art三维视频序列进行了广泛的 simulations，以评估我们提出的视窗预测方法的效果。实验结果显示，我们的方法在现有方法中显著超越。数据集和源代码将在接受后公开。

Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering

paper_url: http://arxiv.org/abs/2311.17089
repo_url: None
paper_authors: Zhiwen Yan, Weng Fei Low, Yu Chen, Gim Hee Lee
for: 提高3D重建和渲染的效率和质量，特别是在低分辨率或远程摄像头位置下。
methods: 提议一种多尺度3D加摩斯投影算法，保持不同尺度的加摩斯来表示同一场景。
results: 与单个尺度3D加摩斯投影相比，该算法可以 Achieve 13%-66% PSNR和160%-2400%的渲染速度提高，在Mip-NeRF360 dataset上实现4倍-128倍缩放的渲染。

Abstract
3D Gaussians have recently emerged as a highly efficient representation for 3D reconstruction and rendering. Despite its high rendering quality and speed at high resolutions, they both deteriorate drastically when rendered at lower resolutions or from far away camera position. During low resolution or far away rendering, the pixel size of the image can fall below the Nyquist frequency compared to the screen size of each splatted 3D Gaussian and leads to aliasing effect. The rendering is also drastically slowed down by the sequential alpha blending of more splatted Gaussians per pixel. To address these issues, we propose a multi-scale 3D Gaussian splatting algorithm, which maintains Gaussians at different scales to represent the same scene. Higher-resolution images are rendered with more small Gaussians, and lower-resolution images are rendered with fewer larger Gaussians. With similar training time, our algorithm can achieve 13\%-66\% PSNR and 160\%-2400\% rendering speed improvement at 4$\times$-128$\times$ scale rendering on Mip-NeRF360 dataset compared to the single scale 3D Gaussian splatting.

摘要
3D Gaussian 已经出现为3D重建和渲染中的高效表示方法。尽管它在高分辨率下的渲染质量和速度都很高，但当分辨率低或相机位置远时，它们都会受到严重的损害。在低分辨率或远程渲染时，图像像素大小可能小于屏幕大小中每个扩散3D Gaussian的 Nyquist频率，导致抖抖现象。此外，更多的扩散3D Gaussian per pixel的顺序 alpha 杂合也会导致渲染速度减慢。为解决这些问题，我们提议一种多尺度3D Gaussian拼接算法，该算法在不同尺度上维持不同的 Gaussian 来表示同一场景。高分辨率图像使用更多的小 Gaussian，而低分辨率图像使用 fewer 大 Gaussian。与单个尺度3D Gaussian拼接相同的训练时间，我们的算法可以在4倍-128倍的渲染比例上实现13%-66% PSNR和160%-2400%的渲染速度提升。

Spiking Neural Networks with Dynamic Time Steps for Vision Transformers

paper_url: http://arxiv.org/abs/2311.16456
repo_url: None
paper_authors: Gourav Datta, Zeyu Liu, Anni Li, Peter A. Beerel
for: 这篇论文是关于使用刺激神经网络（SNN）进行复杂视觉任务的研究。
methods: 该论文提出了一种新的训练框架，该框架可以动态分配每个ViT模块中的时间步数，以提高能效性。
results: 该论文在图像识别任务中获得了高测试精度（95.97%），并且需要的时间步数为4.97步。

Abstract
Spiking Neural Networks (SNNs) have emerged as a popular spatio-temporal computing paradigm for complex vision tasks. Recently proposed SNN training algorithms have significantly reduced the number of time steps (down to 1) for improved latency and energy efficiency, however, they target only convolutional neural networks (CNN). These algorithms, when applied on the recently spotlighted vision transformers (ViT), either require a large number of time steps or fail to converge. Based on analysis of the histograms of the ANN and SNN activation maps, we hypothesize that each ViT block has a different sensitivity to the number of time steps. We propose a novel training framework that dynamically allocates the number of time steps to each ViT module depending on a trainable score assigned to each timestep. In particular, we generate a scalar binary time step mask that filters spikes emitted by each neuron in a leaky-integrate-and-fire (LIF) layer. The resulting SNNs have high activation sparsity and require only accumulate operations (AC), except for the input embedding layer, in contrast to expensive multiply-and-accumulates (MAC) needed in traditional ViTs. This yields significant improvements in energy efficiency. We evaluate our training framework and resulting SNNs on image recognition tasks including CIFAR10, CIFAR100, and ImageNet with different ViT architectures. We obtain a test accuracy of 95.97% with 4.97 time steps with direct encoding on CIFAR10.

摘要
快速神经网络（SNN）在复杂视觉任务中得到了广泛应用。最近提出的SNN训练算法可以大幅减少时间步数（下降至1步），以提高响应时间和能效性，但是它们只适用于卷积神经网络（CNN）。这些算法在应用于最近引起关注的视觉转换器（ViT）时，ether需要大量的时间步数或者无法 converge。 Based on the histograms of the ANN and SNN activation maps, we hypothesize that each ViT block has a different sensitivity to the number of time steps. We propose a novel training framework that dynamically allocates the number of time steps to each ViT module depending on a trainable score assigned to each timestep. In particular, we generate a scalar binary time step mask that filters spikes emitted by each neuron in a leaky-integrate-and-fire（LIF）layer. The resulting SNNs have high activation sparsity and require only accumulate operations（AC）, except for the input embedding layer, in contrast to expensive multiply-and-accumulates（MAC） needed in traditional ViTs. This yields significant improvements in energy efficiency. We evaluate our training framework and resulting SNNs on image recognition tasks including CIFAR10, CIFAR100, and ImageNet with different ViT architectures. We obtain a test accuracy of 95.97% with 4.97 time steps with direct encoding on CIFAR10.

paper_url: http://arxiv.org/abs/2311.17088
repo_url: None
paper_authors: Mulin Tian, Mahyar Khayatkhoei, Joe Mathai, Wael AbdAlmageed
for: 本研究旨在探讨深伪视频检测方法，以满足现有深伪生成方法的训练数据短缺问题。
methods: 本文提出了一种新的无监督方法，通过测量多Modal特征之间的同模和交模一致性来检测深伪视频。
results: 通过广泛的实验，authors证明了深伪视频中的内模和交模不一致性存在明显的特征，可以通过高精度检测这些不一致性来检测深伪视频。这种方法可扩展、普适和可解释，并且可以在实时检测中提供高精度的检测结果。

Abstract
Deepfake videos present an increasing threat to society with potentially negative impact on criminal justice, democracy, and personal safety and privacy. Meanwhile, detecting deepfakes, at scale, remains a very challenging tasks that often requires labeled training data from existing deepfake generation methods. Further, even the most accurate supervised learning, deepfake detection methods do not generalize to deepfakes generated using new generation methods. In this paper, we introduce a novel unsupervised approach for detecting deepfake videos by measuring of intra- and cross-modal consistency among multimodal features; specifically visual, audio, and identity features. The fundamental hypothesis behind the proposed detection method is that since deepfake generation attempts to transfer the facial motion of one identity to another, these methods will eventually encounter a trade-off between motion and identity that enviably leads to detectable inconsistencies. We validate our method through extensive experimentation, demonstrating the existence of significant intra- and cross- modal inconsistencies in deepfake videos, which can be effectively utilized to detect them with high accuracy. Our proposed method is scalable because it does not require pristine samples at inference, generalizable because it is trained only on real data, and is explainable since it can pinpoint the exact location of modality inconsistencies which are then verifiable by a human expert.

摘要
深圳视频潜在的威胁正在不断增长，可能对刑事司法、民主和个人安全隐私产生负面影响。而检测深圳视频的任务，尤其是在大规模的场景下，仍然是一项非常困难的任务，通常需要现有的深圳生成方法提供标注训练数据。此外，即使使用最准确的超vised学习方法，也不能泛化到新生成方法生成的深圳视频。在这篇论文中，我们提出了一种新的无监督方法，通过评估多Modal特征之间的内部和交叉Modal一致性来检测深圳视频。我们的基本假设是，深圳生成方法会在传递一个人的面部动作到另一个人身上时，遇到模拟和身份之间的负面交易，这将导致检测到的不一致性。我们通过广泛的实验证明了我们的方法的可靠性，并证明了深圳视频中存在显著的内部和交叉Modal不一致性，可以高度精准地检测深圳视频。我们的提出的方法可扩展，因为它不需要检测过程中的优质样本，泛化，因为它只在真实数据上训练，并且可以解释，因为它可以指出检测到的不一致性的具体位置，这些位置可以由人类专家验证。

Rethinking Mixup for Improving the Adversarial Transferability

paper_url: http://arxiv.org/abs/2311.17087
repo_url: None
paper_authors: Xiaosen Wang, Zeyuan Yin
For: The paper aims to explore the underlying mechanism of mixup augmentation in generating adversarial examples with superior adversarial transferability, and to propose a new input transformation-based attack called Mixing the Image but Separating the gradienT (MIST) to improve the transferability of adversarial examples.* Methods: The paper uses a combination of theoretical analysis and experimental evaluation to investigate the effect of mixup augmentation on adversarial transferability, and to compare the performance of MIST with existing state-of-the-art input transformation-based attacks on both CNNs and ViTs.* Results: The paper shows that MIST outperforms existing attacks with a clear margin on both CNNs and ViTs, and demonstrates its high effectiveness and generality on the ImageNet dataset.

Abstract
Mixup augmentation has been widely integrated to generate adversarial examples with superior adversarial transferability when immigrating from a surrogate model to other models. However, the underlying mechanism influencing the mixup's effect on transferability remains unexplored. In this work, we posit that the adversarial examples located at the convergence of decision boundaries across various categories exhibit better transferability and identify that Admix tends to steer the adversarial examples towards such regions. However, we find the constraint on the added image in Admix decays its capability, resulting in limited transferability. To address such an issue, we propose a new input transformation-based attack called Mixing the Image but Separating the gradienT (MIST). Specifically, MIST randomly mixes the input image with a randomly shifted image and separates the gradient of each loss item for each mixed image. To counteract the imprecise gradient, MIST calculates the gradient on several mixed images for each input sample. Extensive experimental results on the ImageNet dataset demonstrate that MIST outperforms existing SOTA input transformation-based attacks with a clear margin on both Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) w/wo defense mechanisms, supporting MIST's high effectiveness and generality.

摘要
mixup 扩展已经广泛应用于生成高效的对抗示例，但是其下面机制 influencing mixup 对 transferred 的影响还未经过 исследова. 在这项工作中，我们认为在不同类别的决策边界交叉点上的对抗示例具有更好的转移性，并发现 Admix 倾向于导向这些地方的对抗示例。然而，我们发现 Admix 中添加的图像约束在 decay 了其能力，导致有限的转移性。为解决这个问题，我们提出了一种新的输入变换基于的攻击方法，即 Mixing the Image but Separating the gradienT (MIST)。具体来说，MIST 随机混合输入图像和一个随机偏移的图像，并将每个混合图像的梯度分离成多个混合图像中的每个输入样本。为了对准不精准的梯度，MIST 对每个输入样本进行多个混合图像中的梯度计算。我们的实验结果表明，MIST 在 ImageNet 数据集上对于 CNNs 和 ViTs WITH/WITHOUT 防御机制都具有明显的优势，支持 MIST 的高效性和通用性。

TopoSemiSeg: Enforcing Topological Consistency for Semi-Supervised Segmentation of Histopathology Images

paper_url: http://arxiv.org/abs/2311.16447
repo_url: https://github.com/Melon-Xu/TopoSemiSeg
paper_authors: Meilong Xu, Xiaoling Hu, Saumya Gupta, Shahira Abousamra, Chao Chen
for: 本研究的目的是提高计算生物学中 segmenation 过程中 dense 分布的 object 的准确率，特别是 gland 和核� wsp;
methods: 我们提出了一种 semi-supervised learning 方法，即 TopoSemiSeg，它可以从无标注数据中学习 topological 表示，从而提高 segmenation 的准确率;
results: 我们在公共的 pathology 图像数据集上进行了广泛的实验，结果表明，TopoSemiSeg 方法在 topology-wise 评价指标上具有显著的优势，特别是在 dense 分布的 object 上。

Abstract
In computational pathology, segmenting densely distributed objects like glands and nuclei is crucial for downstream analysis. To alleviate the burden of obtaining pixel-wise annotations, semi-supervised learning methods learn from large amounts of unlabeled data. Nevertheless, existing semi-supervised methods overlook the topological information hidden in the unlabeled images and are thus prone to topological errors, e.g., missing or incorrectly merged/separated glands or nuclei. To address this issue, we propose TopoSemiSeg, the first semi-supervised method that learns the topological representation from unlabeled data. In particular, we propose a topology-aware teacher-student approach in which the teacher and student networks learn shared topological representations. To achieve this, we introduce topological consistency loss, which contains signal consistency and noise removal losses to ensure the learned representation is robust and focuses on true topological signals. Extensive experiments on public pathology image datasets show the superiority of our method, especially on topology-wise evaluation metrics. Code is available at https://github.com/Melon-Xu/TopoSemiSeg.

摘要
在计算 PATHOLOGY 中，分割密集分布的对象如腺体和核体是至关重要的，以便下游分析。然而，现有的半超vised 方法忽略了无标注图像中的 topological 信息，因此容易出现 topological 错误，例如遗弃或 incorrectly 合并/分割腺体或核体。为解决这个问题，我们提出 TopoSemiSeg，首个半超vised 方法，从无标注图像中学习 topological 表示。具体来说，我们提出了一种 topology-aware teacher-student 方法，在这种方法中，教师和学生网络共同学习 shared topological 表示。为实现这一点，我们引入了 topological 一致性损失，该损失包括信号一致性损失和噪声除掉损失，以确保学习的表示是Robust和专注于真实的 topological 信号。我们在公共的 PATHOLOGY 图像数据集上进行了广泛的实验，结果表明我们的方法在特别是在 topological 评价纪录中表现出优异。代码可以在 https://github.com/Melon-Xu/TopoSemiSeg 上获取。

Centre Stage: Centricity-based Audio-Visual Temporal Action Detection

paper_url: http://arxiv.org/abs/2311.16446
repo_url: https://github.com/hanielwang/audio-visual-tad
paper_authors: Hanyuan Wang, Majid Mirmehdi, Dima Damen, Toby Perrett
for: 这篇论文旨在提出一种基于多尺度跨模态混合的一阶时间点检测方法，以提高时间点检测的准确性。
methods: 该方法使用多尺度跨模态混合来模型时间相关性，并提出了一种新的中心分数来衡量时间点的准确性。
results: 该方法在EPIC-Kitchens-100动作检测标准 benchmark上实现了state-of-the-art表现，并通过细化研究显示了音频混合和中心分数的优势。

Abstract
Previous one-stage action detection approaches have modelled temporal dependencies using only the visual modality. In this paper, we explore different strategies to incorporate the audio modality, using multi-scale cross-attention to fuse the two modalities. We also demonstrate the correlation between the distance from the timestep to the action centre and the accuracy of the predicted boundaries. Thus, we propose a novel network head to estimate the closeness of timesteps to the action centre, which we call the centricity score. This leads to increased confidence for proposals that exhibit more precise boundaries. Our method can be integrated with other one-stage anchor-free architectures and we demonstrate this on three recent baselines on the EPIC-Kitchens-100 action detection benchmark where we achieve state-of-the-art performance. Detailed ablation studies showcase the benefits of fusing audio and our proposed centricity scores. Code and models for our proposed method are publicly available at https://github.com/hanielwang/Audio-Visual-TAD.git

摘要
先前的一阶段动作检测方法都是基于视觉特征来模型时间关系。在这篇论文中，我们研究了不同的策略来结合听音特征，使用多级横规维度的cross-Attention来融合两种感知。我们还示出了动作中心距离时间步的相关性和预测边界准确度之间的相关性。因此，我们提出了一种新的网络头来估计时间步的中心性，我们称之为中心分数。这导致了更高的自信心度 для提案，这些提案具有更精确的边界。我们的方法可以与其他一阶段无锚的架构结合使用，我们在EPIC-Kitchens-100动作检测 benchmark上使用三个最新的基eline来证明我们的方法的状态之最性。详细的ablation研究表明了将听音特征与我们所提出的中心分数融合的好处。我们在 GitHub 上公开了我们的方法和模型代码，请参考https://github.com/hanielwang/Audio-Visual-TAD.git。

CLAP: Contrastive Learning with Augmented Prompts for Robustness on Pretrained Vision-Language Models

paper_url: http://arxiv.org/abs/2311.16445
repo_url: None
paper_authors: Yichao Cai, Yuhang Liu, Zhen Zhang, Javen Qinfeng Shi
for: 提高视觉语言模型的Robustness，不需要重新训练图像Encoder。
methods: 通过solely text augmentation来分离 latent content variables和style variables，使图像Encoder更加敏感。
results: 在多个数据集上，通过 modifying the style part of the text data，实现了 substatial improvements in the robustness of the pre-trained CLIP model。

Abstract
Contrastive vision-language models, e.g., CLIP, have garnered substantial attention for their exceptional generalization capabilities. However, their robustness to perturbations has ignited concerns. Existing strategies typically reinforce their resilience against adversarial examples by enabling the image encoder to "see" these perturbed examples, often necessitating a complete retraining of the image encoder on both natural and adversarial samples. In this study, we propose a new method to enhance robustness solely through text augmentation, eliminating the need for retraining the image encoder on adversarial examples. Our motivation arises from the realization that text and image data inherently occupy a shared latent space, comprising latent content variables and style variables. This insight suggests the feasibility of learning to disentangle these latent content variables using text data exclusively. To accomplish this, we introduce an effective text augmentation method that focuses on modifying the style while preserving the content in the text data. By changing the style part of the text data, we empower the text encoder to emphasize latent content variables, ultimately enhancing the robustness of vision-language models. Our experiments across various datasets demonstrate substantial improvements in the robustness of the pre-trained CLIP model.

摘要
受到关注的对比式视觉语言模型，如CLIP，具有极其普遍的应用能力。然而，其对干扰的抵抗力受到了关注。现有的策略通常是通过启用图像编码器"看到"这些干扰示例，通常需要对图像编码器进行完整的重新训练，包括自然示例和干扰示例。在这项研究中，我们提出了一种新的方法来提高对干扰的鲁棒性，不需要重新训练图像编码器。我们的动机来自于认识到文本和图像数据本身就处于共同的隐藏空间，包括内容变量和风格变量。这一点意味着可以通过文本数据专门学习抽离内容变量。为此，我们提出了一种有效的文本增强方法，将文本数据中的风格部分修改，以便让文本编码器强调内容变量，最终提高视觉语言模型的鲁棒性。我们在不同的数据集上进行了多种实验，并证明了在CLIP模型的预训练模型中显著提高了对干扰的鲁棒性。

Beyond Visual Cues: Synchronously Exploring Target-Centric Semantics for Vision-Language Tracking

paper_url: http://arxiv.org/abs/2311.17085
repo_url: None
paper_authors: Jiawei Ge, Xiangmei Chen, Jiuxin Cao, Xuelin Zhu, Weijia Liu, Bo Liu
for: 本研究旨在提高单目标跟踪的性能，使用语言描述提供高级 semantics。
methods: 本研究提出了一种新的跟踪器，包括两个新模块：目标增强模块（TEM）和semantic aware模块（SAM），以提高 VL 特征提取和融合。
results: 对 VL 跟踪数据集进行了广泛的实验， demonstarted 新方法的超越性和效果。

Abstract
Single object tracking aims to locate one specific target in video sequences, given its initial state. Classical trackers rely solely on visual cues, restricting their ability to handle challenges such as appearance variations, ambiguity, and distractions. Hence, Vision-Language (VL) tracking has emerged as a promising approach, incorporating language descriptions to directly provide high-level semantics and enhance tracking performance. However, current VL trackers have not fully exploited the power of VL learning, as they suffer from limitations such as heavily relying on off-the-shelf backbones for feature extraction, ineffective VL fusion designs, and the absence of VL-related loss functions. Consequently, we present a novel tracker that progressively explores target-centric semantics for VL tracking. Specifically, we propose the first Synchronous Learning Backbone (SLB) for VL tracking, which consists of two novel modules: the Target Enhance Module (TEM) and the Semantic Aware Module (SAM). These modules enable the tracker to perceive target-related semantics and comprehend the context of both visual and textual modalities at the same pace, facilitating VL feature extraction and fusion at different semantic levels. Moreover, we devise the dense matching loss to further strengthen multi-modal representation learning. Extensive experiments on VL tracking datasets demonstrate the superiority and effectiveness of our methods.

摘要
<>单个目标跟踪目标是在视频序列中找到一个具体的目标，givent its initial state。经典的跟踪器仅仅依靠视觉特征来实现跟踪，因此它们无法满足挑战，如外观变化、混乱和干扰。因此，视觉语言（VL）跟踪技术在跟踪中表现出了推荐的特点，它将语言描述直接提供高级别 semantics，以提高跟踪性能。然而，目前的VL跟踪器尚未充分利用VL学习的能力，主要表现在以下几个方面：1. 仅仅使用准备好的背部树进行特征提取，而不是针对特定的目标进行适应性的特征提取。2. 不合理的VL融合设计，导致视觉和语言模式之间的信息不够协调。3. 缺乏VL相关的损失函数，使得学习过程中没有足够的指导。为了解决这些问题，我们提出了一种新的跟踪器，即同步学习底层（SLB）。SLB包括两个新模块：目标增强模块（TEM）和semantic意识模块（SAM）。这两个模块使得跟踪器能够同步地感知目标相关的 semantics，并且在视觉和语言模式之间进行协调，从而促进VL特征提取和融合。此外，我们还提出了密集匹配损失函数，以进一步强化多模式表示学习。我们在VL跟踪数据集上进行了广泛的实验，结果显示了我们的方法的优越性和有效性。>>

Model-free Test Time Adaptation for Out-Of-Distribution Detection

paper_url: http://arxiv.org/abs/2311.16420
repo_url: None
paper_authors: YiFan Zhang, Xue Wang, Tian Zhou, Kun Yuan, Zhang Zhang, Liang Wang, Rong Jin, Tieniu Tan
for: 提高 ML 模型的可靠性，避免因为对于的数据分布而导致的错误推断。
methods: 基于在线测试样本的模型适应，以便在测试时对数据分布的变化进行适应。
results: 比对 conventional 方法，可以更好地避免 false positive 错误，特别是当 ID 和 OOD 数据分布相互重叠时。例如，在 CIFAR-10 和 ImageNet-1k benchmark 上，\abbr 可以 reducuce FPR95 的错误率 $23.23%$ 和 $38%$ 分别。

Abstract
Out-of-distribution (OOD) detection is essential for the reliability of ML models. Most existing methods for OOD detection learn a fixed decision criterion from a given in-distribution dataset and apply it universally to decide if a data point is OOD. Recent work~\cite{fang2022is} shows that given only in-distribution data, it is impossible to reliably detect OOD data without extra assumptions. Motivated by the theoretical result and recent exploration of test-time adaptation methods, we propose a Non-Parametric Test Time \textbf{Ada}ptation framework for \textbf{O}ut-Of-\textbf{D}istribution \textbf{D}etection (\abbr). Unlike conventional methods, \abbr utilizes online test samples for model adaptation during testing, enhancing adaptability to changing data distributions. The framework incorporates detected OOD instances into decision-making, reducing false positive rates, particularly when ID and OOD distributions overlap significantly. We demonstrate the effectiveness of \abbr through comprehensive experiments on multiple OOD detection benchmarks, extensive empirical studies show that \abbr significantly improves the performance of OOD detection over state-of-the-art methods. Specifically, \abbr reduces the false positive rate (FPR95) by $23.23\%$ on the CIFAR-10 benchmarks and $38\%$ on the ImageNet-1k benchmarks compared to the advanced methods. Lastly, we theoretically verify the effectiveness of \abbr.

摘要
外部数据（OOD）检测是机器学习模型的可靠性的关键。现有大多数OOD检测方法都是通过学习固定的决策标准来判断数据点是否为OOD。然而，据最近的研究~\cite{fang2022is}显示，只有在给定的内部数据集上学习的情况下，无法可靠地检测OOD数据。这引发了我们对测试时适应方法的探索，并提出了一种非参数的测试时适应框架 дляOOD检测(\abbr)。与传统方法不同，\abbr在测试时使用在线测试样本进行模型适应，从而提高模型对数据分布变化的适应能力。此外，\abbr还利用检测到的OOD实例来增强决策的可靠性，尤其是在ID和OOD分布 overlap得非常大时。我们通过多个OOD检测benchmark进行了广泛的实验，并证明了\abbr在OOD检测性能方面的有效性。具体来说，\abbr相比于当前最先进的方法，可以降低FPR95的false positive rate（FPR95）by $23.23\%$在CIFAR-10benchmark上和$38\%$在ImageNet-1k benchmark上。最后，我们还 theoretically verify了\abbr的有效性。

DepthSSC: Depth-Spatial Alignment and Dynamic Voxel Resolution for Monocular 3D Semantic Scene Completion

paper_url: http://arxiv.org/abs/2311.17084
repo_url: None
paper_authors: Jiawei Yao, Jusheng Zhang
for: 3D semantic scene completion with monocular cameras
methods: ST-GF (Spatial Transformation Graph Fusion) module with geometric-aware voxelization
results: achieves state-of-the-art performance in capturing intricate 3D structural details and mitigates spatial misalignment and distortion issuesHere is the result in Simplified Chinese text:
for: 三元素场景完成使用单目镜头
methods: ST-GF（空间变换图像融合）模块与几何意识 voxelization
results: 实现了三元素场景的细节 detail 的捕捉以及消除空间偏移和扭曲问题

Abstract
The task of 3D semantic scene completion with monocular cameras is gaining increasing attention in the field of autonomous driving. Its objective is to predict the occupancy status of each voxel in the 3D scene from partial image inputs. Despite the existence of numerous methods, many of them overlook the issue of accurate alignment between spatial and depth information. To address this, we propose DepthSSC, an advanced method for semantic scene completion solely based on monocular cameras. DepthSSC combines the ST-GF (Spatial Transformation Graph Fusion) module with geometric-aware voxelization, enabling dynamic adjustment of voxel resolution and considering the geometric complexity of 3D space to ensure precise alignment between spatial and depth information. This approach successfully mitigates spatial misalignment and distortion issues observed in prior methods. Through evaluation on the SemanticKITTI dataset, DepthSSC not only demonstrates its effectiveness in capturing intricate 3D structural details but also achieves state-of-the-art performance. We believe DepthSSC provides a fresh perspective on monocular camera-based 3D semantic scene completion research and anticipate it will inspire further related studies.

摘要
三维 semantic scene completion with monocular camera 在自动驾驶领域正在获得越来越多的关注。其目标是从部分图像输入中预测每个块体的占据状态。虽然现有许多方法，但很多其中忽视了精确的空间信息和深度信息对齐问题。为解决这个问题，我们提出了 DepthSSC，一种基于单目镜像的高级方法 для三维 semantic scene completion。DepthSSC结合了 ST-GF（空间变换图像融合）模块和地理感知 voxelization，使得 voxel 分辨率可动调整，考虑三维空间的 геометрической复杂性，以确保精确的空间信息和深度信息对齐。这种方法成功解决了先前方法中的空间偏移和扭曲问题。经SemanticKITTI数据集评估，DepthSSC不仅能够捕捉到细腻的三维结构细节，还实现了状态之 arts 的性能。我们认为 DepthSSC 提供了对单目镜像基于三维 semantic scene completion 研究的新的视角，并期望它会激发更多相关的研究。

CLiC: Concept Learning in Context

paper_url: http://arxiv.org/abs/2311.17083
repo_url: https://github.com/Mehdi0xC/clic
paper_authors: Mehdi Safaee, Aryan Mikaeili, Or Patashnik, Daniel Cohen-Or, Ali Mahdavi-Amiri
For: 学习一个对象的本地视觉模式从一个图像中，并将该模式应用到目标图像中的对象上。* Methods: 我们的方法基于现代视觉概念学习的进步，包括从源图像中获取视觉概念（如饰物），并将其应用到目标图像中的对象上。我们的关键想法是在对象上进行Contextual Concept Learning，使得概念学习更加地Local化。我们使用软遮盖来实现这一点，软遮盖包含概念以及周围的图像区域。* Results: 我们通过对象在图像中的生成，展示了可能的在图像中嵌入Contextual Concept Learning得到的结果。我们还引入了将获取的概念引导到目标图像中的特定位置，使用交叉注意机制，并在源和目标对象之间建立对应关系。我们的方法的效果通过量化和质量测试得到证明。

Abstract
This paper addresses the challenge of learning a local visual pattern of an object from one image, and generating images depicting objects with that pattern. Learning a localized concept and placing it on an object in a target image is a nontrivial task, as the objects may have different orientations and shapes. Our approach builds upon recent advancements in visual concept learning. It involves acquiring a visual concept (e.g., an ornament) from a source image and subsequently applying it to an object (e.g., a chair) in a target image. Our key idea is to perform in-context concept learning, acquiring the local visual concept within the broader context of the objects they belong to. To localize the concept learning, we employ soft masks that contain both the concept within the mask and the surrounding image area. We demonstrate our approach through object generation within an image, showcasing plausible embedding of in-context learned concepts. We also introduce methods for directing acquired concepts to specific locations within target images, employing cross-attention mechanisms, and establishing correspondences between source and target objects. The effectiveness of our method is demonstrated through quantitative and qualitative experiments, along with comparisons against baseline techniques.

摘要

DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling

paper_url: http://arxiv.org/abs/2311.17082
repo_url: https://github.com/alexzhou907/dreampropeller
paper_authors: Linqi Zhou, Andy Shih, Chenlin Meng, Stefano Ermon
for: 提高文本到3D生成的速度，使用梦启动器算法来加速现有的文本到3D生成管道。
methods: 基于得分精炼的梦启动器算法，可以普适应用于不同的文本到3D生成框架。
results: 对多种文本到3D生成框架进行测试，实验表明梦启动器算法可以达到4.7倍的速度提升，而且影响生成质量几乎可以忽略。

Abstract
Recent methods such as Score Distillation Sampling (SDS) and Variational Score Distillation (VSD) using 2D diffusion models for text-to-3D generation have demonstrated impressive generation quality. However, the long generation time of such algorithms significantly degrades the user experience. To tackle this problem, we propose DreamPropeller, a drop-in acceleration algorithm that can be wrapped around any existing text-to-3D generation pipeline based on score distillation. Our framework generalizes Picard iterations, a classical algorithm for parallel sampling an ODE path, and can account for non-ODE paths such as momentum-based gradient updates and changes in dimensions during the optimization process as in many cases of 3D generation. We show that our algorithm trades parallel compute for wallclock time and empirically achieves up to 4.7x speedup with a negligible drop in generation quality for all tested frameworks.

摘要
Recent methods such as Score Distillation Sampling (SDS) and Variational Score Distillation (VSD) using 2D diffusion models for text-to-3D generation have demonstrated impressive generation quality. However, the long generation time of such algorithms significantly degrades the user experience. To tackle this problem, we propose DreamPropeller, a drop-in acceleration algorithm that can be wrapped around any existing text-to-3D generation pipeline based on score distillation. Our framework generalizes Picard iterations, a classical algorithm for parallel sampling an ODE path, and can account for non-ODE paths such as momentum-based gradient updates and changes in dimensions during the optimization process as in many cases of 3D generation. We show that our algorithm trades parallel compute for wallclock time and empirically achieves up to 4.7x speedup with a negligible drop in generation quality for all tested frameworks.Here's the translation in Traditional Chinese:近期的方法，如Score Distillation Sampling (SDS) 和Variational Score Distillation (VSD)，使用2D扩散模型进行文本至3D生成，呈现了很好的生成质量。然而，这些算法的生成时间很长，对用户体验会很差。为了解决这个问题，我们提出了DreamPropeller，一个可以给 EXISTS的文本至3D生成管线基于Score Distillation的加速器。我们的框架将Picard迭代，一个经典的平行抽样ODE路径的算法，扩展到非ODE路径，例如增强的梯度更新和维度变化，这些路径在3D生成中经常出现。我们证明了我们的算法可以将平行计算与壁垒时间交换，并在所有测试框架上实现了最多4.7倍的速度增强，而且生成质量的损失几乎没有。

I-MedSAM: Implicit Medical Image Segmentation with Segment Anything

paper_url: http://arxiv.org/abs/2311.17081
repo_url: None
paper_authors: Xiaobao Wei, Jiajun Cao, Yizhu Jin, Ming Lu, Guangyu Wang, Shanghang Zhang
for:这篇论文的目的是提出一种新的医学影像分割方法，以提高医学影像分割的鲁棒性和精度。methods:该方法基于Continuous Representation（连续表示）和Segment Anything Model（任何分割模型），并通过Parameter Efficient Fine Tuning（ Parametric Efficient Fine Tuning）和Implicit Neural Representation（隐藏神经表示）来提高分割精度。results: compared with现有的分割方法，该方法在2D医学影像分割任务上表现出了更高的精度和鲁棒性，且具有较少的训练参数（1.6M）。

Abstract
With the development of Deep Neural Networks (DNNs), many efforts have been made to handle medical image segmentation. Traditional methods such as nnUNet train specific segmentation models on the individual datasets. Plenty of recent methods have been proposed to adapt the foundational Segment Anything Model (SAM) to medical image segmentation. However, they still focus on discrete representations to generate pixel-wise predictions, which are spatially inflexible and scale poorly to higher resolution. In contrast, implicit methods learn continuous representations for segmentation, which is crucial for medical image segmentation. In this paper, we propose I-MedSAM, which leverages the benefits of both continuous representations and SAM, to obtain better cross-domain ability and accurate boundary delineation. Since medical image segmentation needs to predict detailed segmentation boundaries, we designed a novel adapter to enhance the SAM features with high-frequency information during Parameter Efficient Fine Tuning (PEFT). To convert the SAM features and coordinates into continuous segmentation output, we utilize Implicit Neural Representation (INR) to learn an implicit segmentation decoder. We also propose an uncertainty-guided sampling strategy for efficient learning of INR. Extensive evaluations on 2D medical image segmentation tasks have shown that our proposed method with only 1.6M trainable parameters outperforms existing methods including discrete and continuous methods. The code will be released.

摘要
随着深度神经网络（DNNs）的发展，许多努力已经被投入到医学影像分割中。传统方法如nnUNet将专门设计的分割模型训练在个别数据集上。过去几年，许多最新的方法已经被提出来适应基础Segment Anything Model（SAM）的医学影像分割。然而，这些方法仍然将精度分割预测转换为精度分割结果，这会导致空间不灵活和分辨率不高。相比之下，隐藏方法学习连续表示，对医学影像分割是非常重要的。在这篇文章中，我们提出了I-MedSAM，它利用隐藏方法学习的优点和SAM的基础特点，以获得更好的跨频域能力和精度分割界限。由于医学影像分割需要预测精度分割界限，我们设计了一种新的适配器以增强SAM特征的高频信息通过Parameter Efficient Fine Tuning（PEFT）。为将SAM特征和坐标转换为连续分割输出，我们利用Implicit Neural Representation（INR）学习一个隐藏分割解码器。我们还提出了一种不确定性导向的采样策略，以便高效地学习INR。我们对2D医学影像分割任务进行了广泛的评估，结果表明，我们的提议方法，只有1.6M个可训练参数，已经超过了现有的分割方法，包括分割和连续方法。代码将会发布。

2023-11-28

E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer

SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors

Pattern retrieval of traffic congestion using graph-based associations of traffic domain-specific features

SubZero: Subspace Zero-Shot MRI Reconstruction

LightGaussian: Unbounded 3D Gaussian Compression with 15x Reduction and 200+ FPS

PHG-Net: Persistent Homology Guided Medical Image Classification

End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames

BIM: Block-Wise Self-Supervised Learning with Masked Image Modeling

Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation

THInImg: Cross-modal Steganography for Presenting Talking Heads in Images

Material Palette: Extraction of Materials from a Single Image

HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting

ReMoS: Reactive 3D Motion Synthesis for Two-Person Interactions

Self-Supervised Motion Magnification by Backpropagating Through Optical Flow

Rethinking Directional Integration in Neural Radiance Fields

Surf-D: High-Quality Surface Generation for Arbitrary Topologies using Diffusion Models

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

TLControl: Trajectory and Language Control for Human Motion Synthesis

Adversarial Diffusion Distillation

Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence

Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features

Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

COLE: A Hierarchical Generation Framework for Graphic Design

HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion

UC-NeRF: Neural Radiance Field for Under-Calibrated multi-view cameras in autonomous driving

Image segmentation with traveling waves in an exactly solvable recurrent neural network

The Sky’s the Limit: Re-lightable Outdoor Scenes via a Sky-pixel Constrained Illumination Prior and Outside-In Visibility

SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models

LLaFS: When Large-Language Models Meet Few-Shot Segmentation

Super-Resolution through StyleGAN Regularized Latent Search: A Realism-Fidelity Trade-off

UGG: Unified Generative Grasping

Brain-ID: Learning Robust Feature Representations for Brain Imaging

Feedback RoI Features Improve Aerial Object Detection

Lane-Keeping Control of Autonomous Vehicles Through a Soft-Constrained Iterative LQR

Dendrogram distance: an evaluation metric for generative networks using hierarchical clustering

A Unified Approach for Text- and Image-guided 4D Scene Generation

Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for Underwater Image Restoration

Self-training solutions for the ICCV 2023 GeoNet Challenge

Unified-modal Salient Object Detection via Adaptive Prompt Learning

1-Lipschitz Layers Compared: Memory, Speed, and Certifiable Robustness

Decomposer: Semi-supervised Learning of Image Restoration and Image Decomposition

SARA: Controllable Makeup Transfer with Spatial Alignment and Region-Adaptive Normalization

Denoising Diffusion Probabilistic Models for Image Inpainting of Cell Distributions in the Human Brain

DI-Net : Decomposed Implicit Garment Transfer Network for Digital Clothed 3D Human

Panacea: Panoramic and Controllable Video Generation for Autonomous Driving

Large Model Based Referring Camouflaged Object Detection

Generative Data Augmentation Improves Scribble-supervised Semantic Segmentation

Multi-Channel Cross Modal Detection of Synthetic Face Images

Continuous Pose for Monocular Cameras in Neural Implicit Representation

Rescuing referral failures during automated diagnosis of domain-shifted medical images

Gradient-based Local Next-best-view Planning for Improved Perception of Targeted Plant Nodes

As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation Using 2D Diffusion Priors

Riemannian Self-Attention Mechanism for SPD Networks

Point’n Move: Interactive Scene Object Manipulation on Gaussian Splatting Radiance Fields

AdaFocus: Towards End-to-end Weakly Supervised Learning for Long-Video Action Understanding

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular, Stereo, and RGB-D Cameras

REF$^2$-NeRF: Reflection and Refraction aware Neural Radiance Field

Human Gaussian Splatting: Real-time Rendering of Animatable Avatars

Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld

Full-resolution MLPs Empower Medical Dense Prediction

CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs

Parameter Efficient Fine-tuning via Cross Block Orchestration for Segment Anything Model

ContextSeg: Sketch Semantic Segmentation by Querying the Context with Attention

Neural Texture Puppeteer: A Framework for Neural Geometry and Texture Rendering of Articulated Shapes, Enabling Re-Identification at Interactive Speed

LiveNVS: Neural View Synthesis on Live RGB-D Streams

DGNR: Density-Guided Neural Point Rendering of Large Driving Scenes

SCALAR-NeRF: SCAlable LARge-scale Neural Radiance Fields for Scene Reconstruction

Augmenting x-ray single particle imaging reconstruction with self-supervised machine learning

Parallax-Tolerant Image Stitching with Epipolar Displacement Field

MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video Generation

On the Calibration of Human Pose Estimation

Visual Semantic Navigation with Real Robots

Cross-level Attention with Overlapped Windows for Camouflaged Object Detection

Filter-Pruning of Lightweight Face Detectors Using a Geometric Median Criterion

Empowering COVID-19 Detection: Optimizing Performance Through Fine-Tuned EfficientNet Deep Learning Architecture

Improving Lane Detection Generalization: A Novel Framework using HD Maps for Boosting Diversity