2023-12-06

cs.CV

cs.CV - 2023-12-06

A Layer-Wise Tokens-to-Token Transformer Network for Improved Historical Document Image Enhancement

paper_url: http://arxiv.org/abs/2312.03946
repo_url: https://github.com/RisabBiswas/T2T-BinFormer
paper_authors: Risab Biswas, Swalpa Kumar Roy, Umapada Pal
for: 本研究旨在提出一种基于Token-to-Token视Transformer（T2T）的文档二进制编解码器模型，以提高文档图像强化效果。
methods: 该模型使用进步的分割技术来捕捉输入图像的本地结构信息，并通过多次应用T2T模型来模型全局图像关系。
results: 实验结果表明，提出的模型在DIBCO和H-DIBCO测试集上具有较高的效果，并超越了现有的CNN和ViT基于的状态对比方法。

Abstract
Document image enhancement is a fundamental and important stage for attaining the best performance in any document analysis assignment because there are many degradation situations that could harm document images, making it more difficult to recognize and analyze them. In this paper, we propose \textbf{T2T-BinFormer} which is a novel document binarization encoder-decoder architecture based on a Tokens-to-token vision transformer. Each image is divided into a set of tokens with a defined length using the ViT model, which is then applied several times to model the global relationship between the tokens. However, the conventional tokenization of input data does not adequately reflect the crucial local structure between adjacent pixels of the input image, which results in low efficiency. Instead of using a simple ViT and hard splitting of images for the document image enhancement task, we employed a progressive tokenization technique to capture this local information from an image to achieve more effective results. Experiments on various DIBCO and H-DIBCO benchmarks demonstrate that the proposed model outperforms the existing CNN and ViT-based state-of-the-art methods. In this research, the primary area of examination is the application of the proposed architecture to the task of document binarization. The source code will be made available at https://github.com/RisabBiswas/T2T-BinFormer.

摘要
文档图像提升是文档分析任务的基础和重要阶段，因为文档图像可能会受到多种干扰，使其更难识别和分析。在这篇论文中，我们提出了一种新的文档二进制编解码架构，名为T2T-BinFormer，它基于Token-to-token视transformer。每个图像被分成一系列有定长度的标签，然后应用多次模型全局关系 между标签。然而，传统的标注输入数据不能充分反映输入图像的重要本地结构，这会导致低效率。相比使用简单的ViT和硬分割图像，我们采用了进行грессив的标注技术，以 capture本地信息从图像中，以实现更有效的结果。实验表明，我们提出的模型在DIBCO和H-DIBCO测试 benchmark上超过了现有的CNN和ViT基于的状态态别方法。在这篇研究中，我们的研究领域仅是应用提案的架构到文档二进制化任务。源代码将在https://github.com/RisabBiswas/T2T-BinFormer 上提供。

Adapting HouseDiffusion for conditional Floor Plan generation on Modified Swiss Dwellings dataset

paper_url: http://arxiv.org/abs/2312.03938
repo_url: None
paper_authors: Emanuel Kuhn
for: 这个论文是为了扩展一个已有的Diffusion模型，以适应CVAAD Floor Plan Auto-Completion工作坊挑战的新数据集MSD。
methods: 该方法包括对Diffusion模型的变换层进行修改，以使其condition on一组墙线。此外，还提出了一种预处理管道，用于从建筑物结构的binary掩模中提取墙线。
results: 该方法在使用简化所有房间多边形为矩形后，对Diffusion模型表现更好。这表明未来的工作应该探索更好的多边形表示方式在扩散模型中。

Abstract
Automated floor plan generation has recently gained momentum with several methods that have been proposed. The CVAAD Floor Plan Auto-Completion workshop challenge introduced MSD, a new dataset that includes existing structural walls of the building as an additional input constraint. This technical report presents an approach for extending a recent work, HouseDiffusion (arXiv:2211.13287 [cs.CV]), to the MSD dataset. The adaption involves modifying the model's transformer layers to condition on a set of wall lines. The report introduces a pre-processing pipeline to extract wall lines from the binary mask of the building structure provided as input. Additionally, it was found that a data processing procedure that simplifies all room polygons to rectangles leads to better performance. This indicates that future work should explore better representations of variable-length polygons in diffusion models. The code will be made available at a later date.

摘要
自动化楼层设计已经在最近得到了一些方法的提出。CVAAD楼层自动完成工作坊挑战引入了MSD dataset，该数据集包括建筑物的结构墙为额外输入约束。本技术报告介绍了一种将最近的工作 HouseDiffusion（arXiv:2211.13287 [cs.CV]）加以适应MSD dataset的方法。改进包括修改模型的变换层以条件在一组墙线上进行。报告介绍了一个预处理管道，用于从建筑物的二进制掩码中提取墙线。此外，发现将所有房间 polygon 简化为矩形 leads to better performance。这表明未来的工作应该探索更好的变量长度多边形表示方法。代码将在后续日期上传。

The Potential of Vision-Language Models for Content Moderation of Children’s Videos

paper_url: http://arxiv.org/abs/2312.03936
repo_url: None
paper_authors: Syed Hammad Ahmed, Shengnan Hu, Gita Sukthankar
for: 这个论文主要是为了研究自然语言监督下的零shot学习，特别是在计算机视觉任务中，如物体检测和活动识别。
methods: 这篇论文使用了多种CLIP变种，包括我们提议的Projection层CLIP模型，对儿童动画内容进行了超级vised和零shot学习。
results: 我们的实验结果表明，我们的Projection层CLIP模型在MOB数据集上的内容审核性能高于之前的工作。我们还发现，在儿童动画视频中，需要包含更多的上下文信息，以提高内容审核性能。

Abstract
Natural language supervision has been shown to be effective for zero-shot learning in many computer vision tasks, such as object detection and activity recognition. However, generating informative prompts can be challenging for more subtle tasks, such as video content moderation. This can be difficult, as there are many reasons why a video might be inappropriate, beyond violence and obscenity. For example, scammers may attempt to create junk content that is similar to popular educational videos but with no meaningful information. This paper evaluates the performance of several CLIP variations for content moderation of children's cartoons in both the supervised and zero-shot setting. We show that our proposed model (Vanilla CLIP with Projection Layer) outperforms previous work conducted on the Malicious or Benign (MOB) benchmark for video content moderation. This paper presents an in depth analysis of how context-specific language prompts affect content moderation performance. Our results indicate that it is important to include more context in content moderation prompts, particularly for cartoon videos as they are not well represented in the CLIP training data.

摘要
自然语言监督已经证明可以有效地应用于无需示例学习的许多计算机视觉任务中，如物体检测和活动识别。然而，生成有用提示可以是更加困难的任务，尤其是在更加微妙的任务中，如视频内容筛选。这可能是因为有很多原因可能会使视频不适宜，而不仅仅是暴力和不当。例如，骗子可能会尝试创建垃圾内容，这些内容与有用的教育视频类似，但是没有实际信息。本文评估了几种 CLIP 变种的内容筛选性能，包括我们提议的 Vanilla CLIP 加投影层模型。我们的结果表明，我们的模型在 Malicious or Benign（MOB）benchmark上的内容筛选性能高于之前的工作。本文还提供了内容筛选提示语言特定的深入分析，我们的结果表明，在cartoon视频中包含更多上下文信息是重要的。

Controllable Human-Object Interaction Synthesis

paper_url: http://arxiv.org/abs/2312.03913
repo_url: None
paper_authors: Jiaman Li, Alexander Clegg, Roozbeh Mottaghi, Jiajun Wu, Xavier Puig, C. Karen Liu
for: 这 paper 的目的是 simulate 人类行为，具体来说是在 3D 场景中生成同步的人类动作和物体动作，以满足语言描述的 Style 和意图。
methods: 这 paper 使用了 conditional diffusion model，该模型可以根据语言描述、初始人类和物体状态，以及稀肥的物体方向点来生成人类动作和物体动作。在高级规划方法的支持下，可以有效提取出场景中的高级方向点。
results: 这 paper 的实验结果表明，通过引入物体几何损失以提高生成的物体动作与输入物体方向点的匹配，可以提高生成的人类动作和物体动作的准确性和真实性。此外，通过设计导航 тер미来保证在生成过程中的接触约束，可以更好地控制人类动作和物体动作的相互作用。

Abstract
Synthesizing semantic-aware, long-horizon, human-object interaction is critical to simulate realistic human behaviors. In this work, we address the challenging problem of generating synchronized object motion and human motion guided by language descriptions in 3D scenes. We propose Controllable Human-Object Interaction Synthesis (CHOIS), an approach that generates object motion and human motion simultaneously using a conditional diffusion model given a language description, initial object and human states, and sparse object waypoints. While language descriptions inform style and intent, waypoints ground the motion in the scene and can be effectively extracted using high-level planning methods. Naively applying a diffusion model fails to predict object motion aligned with the input waypoints and cannot ensure the realism of interactions that require precise hand-object contact and appropriate contact grounded by the floor. To overcome these problems, we introduce an object geometry loss as additional supervision to improve the matching between generated object motion and input object waypoints. In addition, we design guidance terms to enforce contact constraints during the sampling process of the trained diffusion model.

摘要
<>使用语义感知、长期观察的人机交互Synthesizing semantic-aware, long-horizon, human-object interaction is critical to simulate realistic human behaviors. In this work, we address the challenging problem of generating synchronized object motion and human motion guided by language descriptions in 3D scenes. We propose Controllable Human-Object Interaction Synthesis (CHOIS), an approach that generates object motion and human motion simultaneously using a conditional diffusion model given a language description, initial object and human states, and sparse object waypoints. While language descriptions inform style and intent, waypoints ground the motion in the scene and can be effectively extracted using high-level planning methods. Naively applying a diffusion model fails to predict object motion aligned with the input waypoints and cannot ensure the realism of interactions that require precise hand-object contact and appropriate contact grounded by the floor. To overcome these problems, we introduce an object geometry loss as additional supervision to improve the matching between generated object motion and input object waypoints. In addition, we design guidance terms to enforce contact constraints during the sampling process of the trained diffusion model.

WonderJourney: Going from Anywhere to Everywhere

paper_url: http://arxiv.org/abs/2312.03884
repo_url: None
paper_authors: Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T. Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, Charles Herrmann
for: 这个论文是为了描述一种基于文本描述或图像的自然语言生成的3D场景生成框架，它可以从任意用户提供的位置开始，并生成一个长序列的多种多样的3D场景。
methods: 这个框架使用了一个语言模型（LLM）来生成场景的文本描述，一个文本驱动的点云生成管道来生成吸引人的和一致的3D场景序列，以及一个大型语言模型（VLM）来验证生成的场景。
results: 论文所示的视觉结果采用多种多样的场景类型和风格，包括幻想的“妖怪旅行”，并通过多种评价指标证明了生成的场景的可读性和一致性。

Abstract
We introduce WonderJourney, a modularized framework for perpetual 3D scene generation. Unlike prior work on view generation that focuses on a single type of scenes, we start at any user-provided location (by a text description or an image) and generate a journey through a long sequence of diverse yet coherently connected 3D scenes. We leverage an LLM to generate textual descriptions of the scenes in this journey, a text-driven point cloud generation pipeline to make a compelling and coherent sequence of 3D scenes, and a large VLM to verify the generated scenes. We show compelling, diverse visual results across various scene types and styles, forming imaginary "wonderjourneys". Project website: https://kovenyu.com/WonderJourney/

摘要
我们介绍 WonderJourney，一个模块化框架 для不断的3D场景生成。与以往的视图生成方法不同，我们从用户提供的文本描述或图像开始，并生成一个长 sequences of多样化但协调连接的3D场景。我们利用一个LLM来生成场景的文本描述，一个文本驱动的点云生成管道来制造一串有趣且吸引人的3D场景，以及一个大型VLM来验证生成的场景。我们展示了多样化的视觉结果，包括不同的场景类型和风格，形成一系列的“奇幻旅行”。项目网站：https://kovenyu.com/WonderJourney/

Inpaint3D: 3D Scene Content Generation using 2D Inpainting Diffusion

paper_url: http://arxiv.org/abs/2312.03869
repo_url: None
paper_authors: Kira Prabhu, Jane Wu, Lynn Tsai, Peter Hedman, Dan B Goldman, Ben Poole, Michael Broxton
for: 填充3D场景中的区域，给定多视图图像的mask。
methods: 基于单个masked 2D图像的扩散模型，通过Score抽象抽取和NERF重建loss来学习3D场景表示。
results: 可以生成高质量的3D场景图像，并且可以进行3D物体消除、3D物体替换和3D场景完成等操作。

Abstract
This paper presents a novel approach to inpainting 3D regions of a scene, given masked multi-view images, by distilling a 2D diffusion model into a learned 3D scene representation (e.g. a NeRF). Unlike 3D generative methods that explicitly condition the diffusion model on camera pose or multi-view information, our diffusion model is conditioned only on a single masked 2D image. Nevertheless, we show that this 2D diffusion model can still serve as a generative prior in a 3D multi-view reconstruction problem where we optimize a NeRF using a combination of score distillation sampling and NeRF reconstruction losses. Predicted depth is used as additional supervision to encourage accurate geometry. We compare our approach to 3D inpainting methods that focus on object removal. Because our method can generate content to fill any 3D masked region, we additionally demonstrate 3D object completion, 3D object replacement, and 3D scene completion.

摘要

LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

paper_url: http://arxiv.org/abs/2312.03849
repo_url: None
paper_authors: Bolin Lai, Xiaoliang Dai, Lawrence Chen, Guan Pang, James M. Rehg, Miao Liu
for: 本研究旨在生成人日常动作图像 instructional 图像，以便效率地传递技能。
methods: 我们提出了一个新的问题—— egocentric action frame generation，即根据用户提问和输入 egocentric 图像来生成动作框。然而，现有的 egocentric 数据集缺乏详细的动作执行注释。此外，diffusion-based 图像修改模型无法在对应的 egocentric 图像像素空间控制动作状态变化。因此，我们使用视觉大语言模型（VLLM）进行视觉指令调整，以便生成更加详细的动作描述。此外，我们还提出了使用 VLLM 的图像和文本嵌入来进行 LEGO 动作框生成。
results: 我们在两个 egocentric 数据集—— Ego4D 和 Epic-Kitchens 上验证了我们的提posed模型，并得到了明显的改善。我们的实验结果表明，我们的模型在量化和质量上的评价都显著超越了先前的图像修改模型。此外，我们还进行了详细的折衔分析和细节分析，以提供我们的方法的深入理解。

Abstract
Generating instructional images of human daily actions from an egocentric viewpoint serves a key step towards efficient skill transfer. In this paper, we introduce a novel problem -- egocentric action frame generation. The goal is to synthesize the action frame conditioning on the user prompt question and an input egocentric image that captures user's environment. Notably, existing egocentric datasets lack the detailed annotations that describe the execution of actions. Additionally, the diffusion-based image manipulation models fail to control the state change of an action within the corresponding egocentric image pixel space. To this end, we finetune a visual large language model (VLLM) via visual instruction tuning for curating the enriched action descriptions to address our proposed problem. Moreover, we propose to Learn EGOcentric (LEGO) action frame generation using image and text embeddings from VLLM as additional conditioning. We validate our proposed model on two egocentric datasets -- Ego4D and Epic-Kitchens. Our experiments show prominent improvement over prior image manipulation models in both quantitative and qualitative evaluation. We also conduct detailed ablation studies and analysis to provide insights on our method.

摘要
<>传送给定文本到简化中文。>生成用户日常动作的教学图像从 egocentric 视角 serves 一个关键的步骤 towards 高效技能传递。在这篇论文中，我们介绍了一个新的问题： egocentric action frame generation。目标是通过用户提问问题和输入 egocentric 图像来Conditional synthesize action frame。需要注意的是，现有的 egocentric 数据集缺乏详细的动作执行注释。此外，diffusion-based 图像修饰模型无法在对应的 egocentric 图像像素空间控制动作状态变化。为此，我们通过对 visual large language model (VLLM) 进行训练来增强 action 描述，以解决我们的提posed 问题。此外，我们还提出了 Learn EGOcentric (LEGO) 动作帧生成模型，使用 VLLM 的图像和文本嵌入作为附加的conditioning。我们在 Ego4D 和 Epic-Kitchens 两个 egocentric 数据集上验证了我们的提posed 模型，并得到了前期图像修饰模型的明显改进。我们还进行了详细的拟合分析以提供我们的方法的内在性分析。

Relightable Gaussian Codec Avatars

paper_url: http://arxiv.org/abs/2312.03704
repo_url: None
paper_authors: Shunsuke Saito, Gabriel Schwartz, Tomas Simon, Junxuan Li, Giljoo Nam
for: 高精度渲染头像，包括眉毛和皮肤细腻的详细表现
methods: 使用3D Gaussians建立高精度geometry模型，并使用学习投光模型来支持人头的多种材质
results: 实现了高精度、实时渲染头像，包括眉毛和皮肤的详细表现，并且支持多种材质和灯光环境

Abstract
The fidelity of relighting is bounded by both geometry and appearance representations. For geometry, both mesh and volumetric approaches have difficulty modeling intricate structures like 3D hair geometry. For appearance, existing relighting models are limited in fidelity and often too slow to render in real-time with high-resolution continuous environments. In this work, we present Relightable Gaussian Codec Avatars, a method to build high-fidelity relightable head avatars that can be animated to generate novel expressions. Our geometry model based on 3D Gaussians can capture 3D-consistent sub-millimeter details such as hair strands and pores on dynamic face sequences. To support diverse materials of human heads such as the eyes, skin, and hair in a unified manner, we present a novel relightable appearance model based on learnable radiance transfer. Together with global illumination-aware spherical harmonics for the diffuse components, we achieve real-time relighting with spatially all-frequency reflections using spherical Gaussians. This appearance model can be efficiently relit under both point light and continuous illumination. We further improve the fidelity of eye reflections and enable explicit gaze control by introducing relightable explicit eye models. Our method outperforms existing approaches without compromising real-time performance. We also demonstrate real-time relighting of avatars on a tethered consumer VR headset, showcasing the efficiency and fidelity of our avatars.

摘要
“预 Rendering 的实现是受到几何和外观表示的限制。对于几何， Both mesh 和 volume 方法均难以模拟细节丰富的3D毛发几何。对于外观，现有的重新照明模型仅具有限定的精确度和通常太慢呈现高分辨率连续环境。在这个工作中，我们提出了可重新照明 Gaussian 几何人像，一种方法来建立高精确度可重新照明的头部人像，并可以通过生成新的表情来动画。我们的几何模型基于3D Gaussian 可以捕捉3D 一致的毫米级细节，如头发丝和肌肤孔。为了统一人类头部的不同材质，我们提出了一个新的可重新照明外观模型，基于学习透射转移。该模型可以与球面几何相互适应，并且可以实现实时重新照明。我们还引入可重新照明的视网膜模型，以提高眼睛镜射的精确度和可控。我们的方法在不妥协实时性的前提下，超越了现有的方法。我们还展示了在绑定顾客VR头盔上实现了实时重新照明的人像，实现了我们的方法的效率和精确度。”

Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning

paper_url: http://arxiv.org/abs/2312.03703
repo_url: https://github.com/fanglaosi/skeleton-in-context
paper_authors: Xinshun Wang, Zhongbin Fang, Xia Li, Xiangtai Li, Chen Chen, Mengyuan Liu
for: 本研究旨在提出一种能够同时处理多个骨架序列任务的内容学习模型，以提高骨架序列任务的执行效率和多任务合作性。
methods: 本研究提出了一种名为skeleton-in-context（SiC）的框架，该框架可以在单一训练过程中同时处理多个骨架序列任务，并通过自适应的任务描述来从context中捕捉任务。
results: 经过广泛的实验评估，本研究的SiC模型在多个任务上达到了状态的末级多任务性和甚至超越单任务方法的certain tasks。

Abstract
In-context learning provides a new perspective for multi-task modeling for vision and NLP. Under this setting, the model can perceive tasks from prompts and accomplish them without any extra task-specific head predictions or model fine-tuning. However, Skeleton sequence modeling via in-context learning remains unexplored. Directly applying existing in-context models from other areas onto skeleton sequences fails due to the inter-frame and cross-task pose similarity that makes it outstandingly hard to perceive the task correctly from a subtle context. To address this challenge, we propose Skeleton-in-Context (SiC), an effective framework for in-context skeleton sequence modeling. Our SiC is able to handle multiple skeleton-based tasks simultaneously after a single training process and accomplish each task from context according to the given prompt. It can further generalize to new, unseen tasks according to customized prompts. To facilitate context perception, we additionally propose a task-unified prompt, which adaptively learns tasks of different natures, such as partial joint-level generation, sequence-level prediction, or 2D-to-3D motion prediction. We conduct extensive experiments to evaluate the effectiveness of our SiC on multiple tasks, including motion prediction, pose estimation, joint completion, and future pose estimation. We also evaluate its generalization capability on unseen tasks such as motion-in-between. These experiments show that our model achieves state-of-the-art multi-task performance and even outperforms single-task methods on certain tasks.

摘要
具有新的视角的内容学习提供了多任务模型化的新途径，在这种设置下，模型可以通过提示来完成任务，无需任务特定的头预测或模型精度调整。然而，骨架序列模型化via内容学习仍未被探讨。直接将现有的内容模型应用到骨架序列上失败，因为交叉任务和时间帧之间的 pose 相似性使得任务很难从柔和的上下文中正确感知。为解决这个挑战，我们提出了 Skeleton-in-Context（SiC），一种有效的内容学习框架 для骨架序列模型化。我们的 SiC 可以同时处理多个骨架序列任务，并在单一训练过程中完成每个任务。它还可以根据自定义提示来适应新、未经见过的任务。为了促进上下文感知，我们还提出了一种任务统一提示，可以适应不同的任务性质，如部分关节级生成、序列预测、2D-to-3D 动作预测。我们进行了广泛的实验评估 SiC 的多任务性能，包括动作预测、pose 估计、关节完成和未来 pose 估计。我们还评估了它的扩展性在未经见过的任务上，如动作-in-between。这些实验表明，我们的模型可以达到当前最佳的多任务性能，甚至在某些任务上超过单任务方法。

Self-conditioned Image Generation via Generating Representations

paper_url: http://arxiv.org/abs/2312.03701
repo_url: https://github.com/LTH14/rcg
paper_authors: Tianhong Li, Dina Katabi, Kaiming He
for: 该 paper 是为了提出一种新的图像生成框架，以提高类无条件图像生成的性能。
methods: 该 paper 使用了一种名为 Representation-Conditioned image Generation (RCG) 的图像生成框架，它不是基于人工标注，而是基于一个自我指导的表示分布，这个表示分布是通过一个预训练的编码器对图像分布进行映射。在生成过程中，RCG 使用了一种表示扩散模型 (RDM) 来采样这个表示分布，并使用一个像素生成器来conditioned 图像像素。
results: 该 paper 在 ImageNet 256$\times$256 上测试了 RC G，并取得了 Frechet Inception Distance (FID) 的 3.31 和 Inception Score (IS) 的 253.4。这些结果不仅在类无条件图像生成中提高了状态的艺术，而且在类征随机图像生成中也有出色的表现， thereby bridging the long-standing performance gap between these two tasks。

Abstract
This paper presents $\textbf{R}$epresentation-$\textbf{C}$onditioned image $\textbf{G}$eneration (RCG), a simple yet effective image generation framework which sets a new benchmark in class-unconditional image generation. RCG does not condition on any human annotations. Instead, it conditions on a self-supervised representation distribution which is mapped from the image distribution using a pre-trained encoder. During generation, RCG samples from such representation distribution using a representation diffusion model (RDM), and employs a pixel generator to craft image pixels conditioned on the sampled representation. Such a design provides substantial guidance during the generative process, resulting in high-quality image generation. Tested on ImageNet 256$\times$256, RCG achieves a Frechet Inception Distance (FID) of 3.31 and an Inception Score (IS) of 253.4. These results not only significantly improve the state-of-the-art of class-unconditional image generation but also rival the current leading methods in class-conditional image generation, bridging the long-standing performance gap between these two tasks. Code is available at https://github.com/LTH14/rcg.

摘要
这篇论文提出了$\textbf{R}$epresentation-$\textbf{C}$onditioned image $\textbf{G}$eneration（RCG）框架，这是一种简单却有效的图像生成框架，并将类无条件图像生成的新标准设置。RCG不需要任何人工标注。相反，它使用一个预训练的编码器将图像分布映射到自我监督的表示分布中，然后使用一个表示扩散模型（RDM）从该表示分布中随机抽取样本，并使用一个像素生成器来根据抽取的表示来生成图像像素。这种设计提供了巨大的生成过程中的指导，导致高质量的图像生成。在ImageNet 256$\times$256上测试，RCG的Frechet Inception Distance（FID）为3.31，Inception Score（IS）为253.4。这些结果不仅在类无条件图像生成中提高了状态的末端性，还与当前领先的方法在类条件图像生成中匹配，bridging长期存在的性能差距。代码可以在https://github.com/LTH14/rcg上获取。

Diffusion Illusions: Hiding Images in Plain Sight

paper_url: http://arxiv.org/abs/2312.03817
repo_url: https://github.com/RyannDaGreat/Diffusion-Illusions
paper_authors: Ryan Burgert, Xiang Li, Abe Leite, Kanchana Ranasinghe, Michael S. Ryoo
for: 这篇论文目标是 Computationally Generating Special ‘Prime’ Images for Optical Illusions
methods: 论文使用了一个泛化的pipeline，包括自适应的扩散模型和一些优化loss函数，以生成各种光学错觉图像
results: 论文通过实验和 физиical fabrication verification，证明了该方法的有效性和可行性，并且展示了它们可以在实际场景中应用

Abstract
We explore the problem of computationally generating special `prime' images that produce optical illusions when physically arranged and viewed in a certain way. First, we propose a formal definition for this problem. Next, we introduce Diffusion Illusions, the first comprehensive pipeline designed to automatically generate a wide range of these illusions. Specifically, we both adapt the existing `score distillation loss' and propose a new `dream target loss' to optimize a group of differentially parametrized prime images, using a frozen text-to-image diffusion model. We study three types of illusions, each where the prime images are arranged in different ways and optimized using the aforementioned losses such that images derived from them align with user-chosen text prompts or images. We conduct comprehensive experiments on these illusions and verify the effectiveness of our proposed method qualitatively and quantitatively. Additionally, we showcase the successful physical fabrication of our illusions -- as they are all designed to work in the real world. Our code and examples are publicly available at our interactive project website: https://diffusionillusions.com

摘要
我们探索 computationally 生成特殊的“prime”图像，这些图像当 Physically 排列和观看时会创造错觉。我们提出了一个形式定义这个问题。接着，我们引入了Diffusion Illusions，第一个涵盖广泛这些错觉的整体管道。具体而言，我们将存在固定的 text-to-image 传播模型中的数据泵对称化来优化一组不同参数的 prime 图像。我们研究了三种错觉，每种错觉的 prime 图像在不同的排列和优化方式下，使用我们提出的“score distillation loss”和“dream target loss”来优化。我们进行了详细的实验，证明了我们的提案的效果是质量和量上的。此外，我们还展示了我们的错觉可以在实际世界中成功 физи学上制做。我们的代码和例子可以在我们的互动项目网站上公开：https://diffusionillusions.comNote: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

AVID: Any-Length Video Inpainting with Diffusion Model

paper_url: http://arxiv.org/abs/2312.03816
repo_url: https://github.com/zhang-zx/AVID
paper_authors: Zhixing Zhang, Bichen Wu, Xiaoyan Wang, Yaqiao Luo, Luxin Zhang, Yinan Zhao, Peter Vajda, Dimitris Metaxas, Licheng Yu
for: 本研究的目的是提出一种可以在视频中进行文本引导的图像填充方法，以满足不同类型的图像填充需求。
methods: 本研究使用了 diffusion model 来实现文本引导的图像填充，并提出了一种 novel Temporal MultiDiffusion sampling pipeline 和一种中间帧注意力导航机制来支持不同的填充类型和视频 duration。
results: 经过了 comprehensive 的实验，研究发现该方法可以在不同的视频 duration 范围内，Robustly 处理不同类型的填充需求，并且图像质量很高。更多的视频化结果可以在 https://zhang-zx.github.io/AVID/ 上公开查看。

Abstract
Recent advances in diffusion models have successfully enabled text-guided image inpainting. While it seems straightforward to extend such editing capability into video domain, there has been fewer works regarding text-guided video inpainting. Given a video, a masked region at its initial frame, and an editing prompt, it requires a model to do infilling at each frame following the editing guidance while keeping the out-of-mask region intact. There are three main challenges in text-guided video inpainting: ($i$) temporal consistency of the edited video, ($ii$) supporting different inpainting types at different structural fidelity level, and ($iii$) dealing with variable video length. To address these challenges, we introduce Any-Length Video Inpainting with Diffusion Model, dubbed as AVID. At its core, our model is equipped with effective motion modules and adjustable structure guidance, for fixed-length video inpainting. Building on top of that, we propose a novel Temporal MultiDiffusion sampling pipeline with an middle-frame attention guidance mechanism, facilitating the generation of videos with any desired duration. Our comprehensive experiments show our model can robustly deal with various inpainting types at different video duration range, with high quality. More visualization results is made publicly available at https://zhang-zx.github.io/AVID/ .

摘要

Memory Triggers: Unveiling Memorization in Text-To-Image Generative Models through Word-Level Duplication

paper_url: http://arxiv.org/abs/2312.03692
repo_url: None
paper_authors: Ali Naseh, Jaechul Roh, Amir Houmansadr
for: 这篇论文探讨了 diffusion-based 模型在文字到图像合成中的应用，并探讨了这些模型在实际应用中的安全性和责任性问题。
methods: 本文使用了两个案例研究，探讨了 diffusion-based 模型中两种未经过评估的重复现象，以探索这些现象对应用中的安全性和责任性带来的影响。
results: 本文发现了两种未经过评估的重复现象在 diffusion-based 模型中的存在，这些现象可能导致模型内存储和攻击。本文的研究可以帮助改善 diffusion-based 模型的安全性和责任性，并推广更责任的应用。

Abstract
Diffusion-based models, such as the Stable Diffusion model, have revolutionized text-to-image synthesis with their ability to produce high-quality, high-resolution images. These advancements have prompted significant progress in image generation and editing tasks. However, these models also raise concerns due to their tendency to memorize and potentially replicate exact training samples, posing privacy risks and enabling adversarial attacks. Duplication in training datasets is recognized as a major factor contributing to memorization, and various forms of memorization have been studied so far. This paper focuses on two distinct and underexplored types of duplication that lead to replication during inference in diffusion-based models, particularly in the Stable Diffusion model. We delve into these lesser-studied duplication phenomena and their implications through two case studies, aiming to contribute to the safer and more responsible use of generative models in various applications.

摘要
Diffusion-based models, such as the Stable Diffusion model, have revolutionized text-to-image synthesis with their ability to produce high-quality, high-resolution images. These advancements have prompted significant progress in image generation and editing tasks. However, these models also raise concerns due to their tendency to memorize and potentially replicate exact training samples, posing privacy risks and enabling adversarial attacks. Duplication in training datasets is recognized as a major factor contributing to memorization, and various forms of memorization have been studied so far. This paper focuses on two distinct and underexplored types of duplication that lead to replication during inference in diffusion-based models, particularly in the Stable Diffusion model. We delve into these lesser-studied duplication phenomena and their implications through two case studies, aiming to contribute to the safer and more responsible use of generative models in various applications.Here's the translation in Traditional Chinese:Diffusion-based models, such as the Stable Diffusion model, have revolutionized text-to-image synthesis with their ability to produce high-quality, high-resolution images. These advancements have prompted significant progress in image generation and editing tasks. However, these models also raise concerns due to their tendency to memorize and potentially replicate exact training samples, posing privacy risks and enabling adversarial attacks. Duplication in training datasets is recognized as a major factor contributing to memorization, and various forms of memorization have been studied so far. This paper focuses on two distinct and underexplored types of duplication that lead to replication during inference in diffusion-based models, particularly in the Stable Diffusion model. We delve into these lesser-studied duplication phenomena and their implications through two case studies, aiming to contribute to the safer and more responsible use of generative models in various applications.

Hybrid Functional Maps for Crease-Aware Non-Isometric Shape Matching

paper_url: http://arxiv.org/abs/2312.03678
repo_url: None
paper_authors: Lennart Bastian, Yizheng Xie, Nassir Navab, Zorah Lähner
for: solves the problem of non-isometric shape correspondence in computer vision, which is a fundamental challenge.
methods: combines the non-orthogonal extrinsic basis of eigenfunctions of the elastic thin-shell hessian with the intrinsic ones of the LBO, creating a hybrid spectral space.
results: achieves significant improvements in non-isometric correspondence settings, with up to 15% better mean geodesic error, and up to 45% improvement in scenarios with topological noise.

Abstract
Non-isometric shape correspondence remains a fundamental challenge in computer vision. Traditional methods using Laplace-Beltrami operator (LBO) eigenmodes face limitations in characterizing high-frequency extrinsic shape changes like bending and creases. We propose a novel approach of combining the non-orthogonal extrinsic basis of eigenfunctions of the elastic thin-shell hessian with the intrinsic ones of the LBO, creating a hybrid spectral space in which we construct functional maps. To this end, we present a theoretical framework to effectively integrate non-orthogonal basis functions into descriptor- and learning-based functional map methods. Our approach can be incorporated easily into existing functional map pipelines across varying applications and is able to handle complex deformations beyond isometries. We show extensive evaluations across various supervised and unsupervised settings and demonstrate significant improvements. Notably, our approach achieves up to 15% better mean geodesic error for non-isometric correspondence settings and up to 45% improvement in scenarios with topological noise.

摘要

WarpDiffusion: Efficient Diffusion Model for High-Fidelity Virtual Try-on

paper_url: http://arxiv.org/abs/2312.03667
repo_url: None
paper_authors: xujie zhang, Xiu Li, Michael Kampffmeyer, Xin Dong, Zhenyu Xie, Feida Zhu, Haoye Dong, Xiaodan Liang
for: 提高虚拟试穿的真实感和细节保留
methods: 提出了一种基于扩散模型的卷积扩散法，通过地图特征注意力机制来提高synthesis质量和精度
results: 对高分辨率虚拟试穿标准benchmark和野化测试集进行了广泛的实验，证明了WarpDiffusion方法的超越性，both qualitatively and quantitatively

Abstract
Image-based Virtual Try-On (VITON) aims to transfer an in-shop garment image onto a target person. While existing methods focus on warping the garment to fit the body pose, they often overlook the synthesis quality around the garment-skin boundary and realistic effects like wrinkles and shadows on the warped garments. These limitations greatly reduce the realism of the generated results and hinder the practical application of VITON techniques. Leveraging the notable success of diffusion-based models in cross-modal image synthesis, some recent diffusion-based methods have ventured to tackle this issue. However, they tend to either consume a significant amount of training resources or struggle to achieve realistic try-on effects and retain garment details. For efficient and high-fidelity VITON, we propose WarpDiffusion, which bridges the warping-based and diffusion-based paradigms via a novel informative and local garment feature attention mechanism. Specifically, WarpDiffusion incorporates local texture attention to reduce resource consumption and uses a novel auto-mask module that effectively retains only the critical areas of the warped garment while disregarding unrealistic or erroneous portions. Notably, WarpDiffusion can be integrated as a plug-and-play component into existing VITON methodologies, elevating their synthesis quality. Extensive experiments on high-resolution VITON benchmarks and an in-the-wild test set demonstrate the superiority of WarpDiffusion, surpassing state-of-the-art methods both qualitatively and quantitatively.

摘要
图像基于虚拟试穿（VITON）目标是将店内裤装图像传递到目标人体上。现有方法通常是将裤装扭曲以适应人体姿势，但是这些方法通常忽略裤装和皮肤边界处的合成质量以及真实的效果，如皮肤弯曲和阴影。这些限制使得生成的结果具有较低的真实感和应用实用性。基于Diffusion模型在cross-modal图像合成中的成功，一些最近的Diffusion基本方法尝试解决这个问题。然而，它们通常需要大量的训练资源或者难以实现真实的试穿效果和保留裤装细节。为了高效和高精度的VITON，我们提议WarpDiffusion，它通过一种新的信息rich和本地裤装特征注意机制来结合扭曲和Diffusion两个 paradigma。具体来说，WarpDiffusion使用本地文本注意力来减少资源消耗，并使用一种新的自动层Mask模块，以便只保留扭曲裤装中的关键部分，而不是不实际或错误的部分。值得一提的是，WarpDiffusion可以与现有VITON方法集成，提高其合成质量。广泛的实验表明，WarpDiffusion在高分辨率VITON标准准样和一个实际测试集上具有较高的超越性，胜过当前状态的方法 both qualitatively and quantitatively。

Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving

paper_url: http://arxiv.org/abs/2312.03661
repo_url: https://github.com/fudan-zvg/reason2drive
paper_authors: Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, Li Zhang
for:The paper aims to provide a benchmark dataset (Reason2Drive) for studying interpretable reasoning in complex driving environments, and to evaluate the reasoning capabilities of large vision-language models (VLMs) in autonomous driving.methods:The proposed benchmark dataset consists of over 600K video-text pairs, collected from a diverse range of open-source outdoor driving datasets, including nuScenes, Waymo, and ONCE. A novel aggregated evaluation metric is introduced to assess chain-based reasoning performance in autonomous systems, addressing the semantic ambiguities of existing metrics such as BLEU and CIDEr.results:The authors conduct experiments to assess various existing VLMs on the proposed benchmark, revealing insights into their reasoning capabilities. Additionally, they develop an efficient approach to empower VLMs to leverage object-level perceptual elements in both feature extraction and prediction, further enhancing their reasoning accuracy.

Abstract
Large vision-language models (VLMs) have garnered increasing interest in autonomous driving areas, due to their advanced capabilities in complex reasoning tasks essential for highly autonomous vehicle behavior. Despite their potential, research in autonomous systems is hindered by the lack of datasets with annotated reasoning chains that explain the decision-making processes in driving. To bridge this gap, we present Reason2Drive, a benchmark dataset with over 600K video-text pairs, aimed at facilitating the study of interpretable reasoning in complex driving environments. We distinctly characterize the autonomous driving process as a sequential combination of perception, prediction, and reasoning steps, and the question-answer pairs are automatically collected from a diverse range of open-source outdoor driving datasets, including nuScenes, Waymo and ONCE. Moreover, we introduce a novel aggregated evaluation metric to assess chain-based reasoning performance in autonomous systems, addressing the semantic ambiguities of existing metrics such as BLEU and CIDEr. Based on the proposed benchmark, we conduct experiments to assess various existing VLMs, revealing insights into their reasoning capabilities. Additionally, we develop an efficient approach to empower VLMs to leverage object-level perceptual elements in both feature extraction and prediction, further enhancing their reasoning accuracy. The code and dataset will be released.

摘要
We categorize the autonomous driving process into three sequential steps: perception, prediction, and reasoning. The question-answer pairs in the dataset are automatically collected from a variety of open-source outdoor driving datasets, including nuScenes, Waymo, and ONCE. To evaluate the performance of VLMs in chain-based reasoning, we introduce a novel aggregated evaluation metric that addresses the semantic ambiguities of existing metrics such as BLEU and CIDEr.In our experiments, we assess the reasoning capabilities of various existing VLMs using the proposed benchmark. Additionally, we develop an efficient approach that enables VLMs to leverage object-level perceptual elements in both feature extraction and prediction, further enhancing their reasoning accuracy. The code and dataset will be publicly released.

Seeing the random forest through the decision trees. Supporting learning health systems from histopathology with machine learning models: Challenges and opportunities

paper_url: http://arxiv.org/abs/2312.03812
repo_url: None
paper_authors: Ricardo Gonzalez, Ashirbani Saha, Clinton J. V. Campbell, Peyman Nejat, Cynthia Lokker, Andrew P. Norgan
for: 本研究探讨了机器学习模型在病理学领域面临的一些忽略的挑战，并提出了一种新的机会来支持“学习医疗系统”。
methods: 作者首先详细介绍了这些挑战，并将它们分为需要创新的方法、需要时间或未来技术能力的挑战以及需要重新思考的概念方面的挑战。
results: 然后，作者提出了一种将隐藏在数字化病理学板幅中的信息，由机器学习模型提取出来，与其他医疗大数据集成起来以支持“学习医疗系统”的新机会。

Abstract
This paper discusses some overlooked challenges faced when working with machine learning models for histopathology and presents a novel opportunity to support "Learning Health Systems" with them. Initially, the authors elaborate on these challenges after separating them according to their mitigation strategies: those that need innovative approaches, time, or future technological capabilities and those that require a conceptual reappraisal from a critical perspective. Then, a novel opportunity to support "Learning Health Systems" by integrating hidden information extracted by ML models from digitalized histopathology slides with other healthcare big data is presented.

摘要
这篇论文探讨了机器学习模型在病理学领域所面临的一些常被忽略的挑战，并提出了一个新的机会，即通过将数字化病理学报告中隐藏的信息与医疗大数据集成以支持“学习医疗系统”。首先，作者们从分类措施的角度分析了这些挑战，并将它们分为需要创新、时间或未来技术能力的解决方案，以及需要重新思考的概念方式。然后，作者们提出了一种将隐藏信息由机器学习模型从数字化病理学报告中提取出来，并与其他医疗大数据集成以支持“学习医疗系统”的新机会。

Editable Stain Transformation Of Histological Images Using Unpaired GANs

paper_url: http://arxiv.org/abs/2312.03647
repo_url: https://github.com/slobodaapl/xai-cyclegan-2
paper_authors: Tibor Sloboda, Lukáš Hudec, Wanda Benešová
for: This paper is written for researchers and clinicians working in the field of histopathology, particularly those interested in metaplastic breast cancer.
methods: The paper introduces a new method called xAI-CycleGAN, which combines Mask CycleGAN with explainability features and structure-preserving capabilities to transform H&E stained breast tissue images into P63-like images.
results: The paper shows that xAI-CycleGAN is effective in maintaining structural integrity and generating high-quality images, and a survey of histopathologists indicates that the generated images are often comparable in realism to actual images.Here’s the information in Simplified Chinese text:
for: 这篇论文是为了 histopathology 领域的研究人员和临床医生写的，特别是关注乳腺癌的 метапластиic 变型。
methods: 论文提出了一种新方法，即 xAI-CycleGAN，它将 Mask CycleGAN 与可解释特征和结构保持功能相结合，将 H&E 染色的乳腺组织图像转换成 P63 类似图像。
results: 论文显示，xAI-CycleGAN 能够保持结构完整性和生成高质量图像，并且 histopathologist 调查表明，生成的图像与实际图像的真实性相当。

Abstract
Double staining in histopathology, particularly for metaplastic breast cancer, typically employs H&E and P63 dyes. However, P63's tissue damage and high cost necessitate alternative methods. This study introduces xAI-CycleGAN, an advanced architecture combining Mask CycleGAN with explainability features and structure-preserving capabilities for transforming H&E stained breast tissue images into P63-like images. The architecture allows for output editing, enhancing resemblance to actual images and enabling further model refinement. We showcase xAI-CycleGAN's efficacy in maintaining structural integrity and generating high-quality images. Additionally, a histopathologist survey indicates the generated images' realism is often comparable to actual images, validating our model's high-quality output.

摘要

Training Neural Networks on RAW and HDR Images for Restoration Tasks

paper_url: http://arxiv.org/abs/2312.03640
repo_url: https://github.com/gfxdisp/colorvideovdp
paper_authors: Lei Luo, Alexandre Chapiro, Xiaoyu Xiang, Yuchen Fan, Rakesh Ranjan, Rafal Mantiuk
for: 这篇论文主要针对的是如何训练神经网络进行图像修复任务，特别是对于RAW和HDR图像。
methods: 作者测试了多种方法，包括使用常见传输函数（PQ、PU21、mu-law）将HDR/RAW图像转换为显示编码图像，以及使用损失函数来 corrected for perceptual non-uniformity。
results: 结果显示，使用显示编码图像来训练神经网络可以提高图像修复性能，最多提高10-15 dB。

Abstract
The vast majority of standard image and video content available online is represented in display-encoded color spaces, in which pixel values are conveniently scaled to a limited range (0-1) and the color distribution is approximately perceptually uniform. In contrast, both camera RAW and high dynamic range (HDR) images are often represented in linear color spaces, in which color values are linearly related to colorimetric quantities of light. While training on commonly available display-encoded images is a well-established practice, there is no consensus on how neural networks should be trained for tasks on RAW and HDR images in linear color spaces. In this work, we test several approaches on three popular image restoration applications: denoising, deblurring, and single-image super-resolution. We examine whether HDR/RAW images need to be display-encoded using popular transfer functions (PQ, PU21, mu-law), or whether it is better to train in linear color spaces, but use loss functions that correct for perceptual non-uniformity. Our results indicate that neural networks train significantly better on HDR and RAW images represented in display-encoded color spaces, which offer better perceptual uniformity than linear spaces. This small change to the training strategy can bring a very substantial gain in performance, up to 10-15 dB.

摘要
大多数标准图像和视频内容在线上都是使用显示编码的颜色空间表示，其中像素值被简单地缩放到有限范围（0-1），颜色分布约束是人类视觉的。然而，相机RAW和高动态范围（HDR）图像通常是使用线性颜色空间表示，其中颜色值与光谱量的线性关系。虽然使用公共显示编码图像进行训练已经是一种广泛的做法，但是有关如何训练神经网络进行RAW和HDR图像的任务还没有一致的共识。在这种情况下，我们测试了几种方法，包括三个流行的图像修复应用程序：噪声除除、锐化除和单像超解像。我们发现，使用流行的传输函数（PQ、PU21、mu-law）将HDR/RAW图像转换为显示编码颜色空间可以提高神经网络的训练效果，并且可以提高性能，最高提高10-15 dB。

Boosting Segment Anything Model Towards Open-Vocabulary Learning

paper_url: http://arxiv.org/abs/2312.03628
repo_url: https://github.com/ucas-vg/sambor
paper_authors: Xumeng Han, Longhui Wei, Xuehui Yu, Zhiyang Dou, Xin He, Kuiran Wang, Zhenjun Han, Qi Tian
for: 本文旨在推广Segment Anything Model（SAM）的应用范围，并在不同领域中实现适应性。
methods: 本文提出了一种将SAM与开 vocabulary对象检测器结合的综合框架，以便根据人工输入的类别名称或引用表达来检测任意对象。
results: 对比前SoTA方法，Sambor在COCO和LVIS等几个标准测试 benchmark上表现出优于前SoTA的零shot性能。

Abstract
The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics. In this paper, we present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework. While retaining all the remarkable capabilities inherent to SAM, we enhance it with the capacity to detect arbitrary objects based on human inputs like category names or reference expressions. To accomplish this, we introduce a novel SideFormer module that extracts SAM features to facilitate zero-shot object localization and inject comprehensive semantic information for open-vocabulary recognition. In addition, we devise an open-set region proposal network (Open-set RPN), enabling the detector to acquire the open-set proposals generated by SAM. Sambor demonstrates superior zero-shot performance across benchmarks, including COCO and LVIS, proving highly competitive against previous SoTA methods. We aspire for this work to serve as a meaningful endeavor in endowing SAM to recognize diverse object categories and advancing open-vocabulary learning with the support of vision foundation models.

摘要
最近的Segment Anything Model（SAM）已经出现为一种新的视觉基础模型，展示了强大的零例掌握和灵活的提示。尽管SAM在多个领域发现了应用和改进，但它的主要限制在于无法捕捉物体 semantics。在这篇论文中，我们提出了Sambor，一种将SAM与开 vocabulary对象检测器集成的端到端框架。保留SAM所具有的所有杰出特性，我们增强它以可以根据人类输入的类别名或引用表达Inject comprehensive semantic information for open-vocabulary recognition。为实现这一点，我们提出了一种SideFormer模块，该模块EXTRACTS SAM特征以便零例物体定位和注入开放 vocabulary检测器。此外，我们开发了一种开放集区提档网络（Open-set RPN），使得检测器可以从SAM获得开放集提档。Sambor在COCO和LVIS等标准套件上表现出色，与之前的SoTA方法竞争高度。我们希望这项工作能够为SAM扩展到多种物体类别的认知，并推动开放 vocabulary学习的发展，借助视觉基础模型。

TokenCompose: Grounding Diffusion with Token-level Supervision

paper_url: http://arxiv.org/abs/2312.03626
repo_url: https://github.com/mlpc-ucsd/TokenCompose
paper_authors: Zirui Wang, Zhizhou Sha, Zheng Ding, Yilin Wang, Zhuowen Tu
for: 提高文本提示和模型生成图像的一致性
methods: 引入токен级别的一致性条件，对现有的文本条件 diffusion 模型进行微调
results: 改进多种类实例组合，提高生成图像的真实性

Abstract
We present TokenCompose, a Latent Diffusion Model for text-to-image generation that achieves enhanced consistency between user-specified text prompts and model-generated images. Despite its tremendous success, the standard denoising process in the Latent Diffusion Model takes text prompts as conditions only, absent explicit constraint for the consistency between the text prompts and the image contents, leading to unsatisfactory results for composing multiple object categories. TokenCompose aims to improve multi-category instance composition by introducing the token-wise consistency terms between the image content and object segmentation maps in the finetuning stage. TokenCompose can be applied directly to the existing training pipeline of text-conditioned diffusion models without extra human labeling information. By finetuning Stable Diffusion, the model exhibits significant improvements in multi-category instance composition and enhanced photorealism for its generated images.

摘要
我们介绍TokenCompose，一个基于潜在扩散模型的文本到图像生成模型，它可以提高文本提示中的物品Category的一致性。尽管Latent Diffusion Model在标准去噪过程中已经取得了很大的成功，但是没有明确的约束来保证文本提示和图像内容之间的一致性，从而导致多物品Category的组合结果不满意。TokenCompose通过在调整阶段引入单位层次的一致性项目来改善多类别实例组合。TokenCompose可以与现有的文本条件扩散模型训练管道直接应用，无需额外的人工标注信息。经过调整Stable Diffusion模型，模型可以在多类别实例组合中展示明显的改善和提高图像生成的真实感。

Automated Multimodal Data Annotation via Calibration With Indoor Positioning System

paper_url: http://arxiv.org/abs/2312.03608
repo_url: None
paper_authors: Ryan Rubel, Andrew Dudash, Mohammad Goli, James O’Hara, Karl Wunderlich
for: 提高物流批处和自动化基础设施中的对象检测精度
methods: 利用探测器和摄像头数据的融合，自动生成多模态对象检测数据集，不需要人工标注
results: 比人类基线快261.8倍，提高整个数据集创建的速度61.5%

Abstract
Learned object detection methods based on fusion of LiDAR and camera data require labeled training samples, but niche applications, such as warehouse robotics or automated infrastructure, require semantic classes not available in large existing datasets. Therefore, to facilitate the rapid creation of multimodal object detection datasets and alleviate the burden of human labeling, we propose a novel automated annotation pipeline. Our method uses an indoor positioning system (IPS) to produce accurate detection labels for both point clouds and images and eliminates manual annotation entirely. In an experiment, the system annotates objects of interest 261.8 times faster than a human baseline and speeds up end-to-end dataset creation by 61.5%.

摘要
< translations > 输入文本：Learned object detection methods based on fusion of LiDAR and camera data require labeled training samples, but niche applications, such as warehouse robotics or automated infrastructure, require semantic classes not available in large existing datasets. Therefore, to facilitate the rapid creation of multimodal object detection datasets and alleviate the burden of human labeling, we propose a novel automated annotation pipeline. Our method uses an indoor positioning system (IPS) to produce accurate detection labels for both point clouds and images and eliminates manual annotation entirely. In an experiment, the system annotates objects of interest 261.8 times faster than a human baseline and speeds up end-to-end dataset creation by 61.5%. 学习基于LiDAR和摄像头数据的对象检测方法需要标注训练样本，但是特殊应用，如仓库 роботех术或自动化基础设施，需要不在大量现有数据集中存在的语义类。因此，我们提议一种新的自动注释管道，以快速创建多modal对象检测数据集，消除人工标注的压力。我们的方法使用indoor定位系统（IPS）生成高精度的检测标签，包括点云和图像，并完全消除人工标注。在一个实验中，系统可以对需要注释的对象进行261.8倍的速度，比人类基准更快，并提高整个数据集创建的速度 by 61.5%。 Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China.

SurfaceAug: Closing the Gap in Multimodal Ground Truth Sampling

paper_url: http://arxiv.org/abs/2312.03808
repo_url: None
paper_authors: Ryan Rubel, Nathan Clark, Andrew Dudash
for: 这篇论文是为了提高多模态检测器的性能而写的。
methods: 这篇论文使用了一种新的地面排列算法，称为SurfaceAug，用于对图像和点云进行对象粘贴。
results: 实验表明，使用SurfaceAug算法可以提高多模态检测器的检测性能，并在汽车检测任务中创造出新的状态OF THE ART。

Abstract
Despite recent advances in both model architectures and data augmentation, multimodal object detectors still barely outperform their LiDAR-only counterparts. This shortcoming has been attributed to a lack of sufficiently powerful multimodal data augmentation. To address this, we present SurfaceAug, a novel ground truth sampling algorithm. SurfaceAug pastes objects by resampling both images and point clouds, enabling object-level transformations in both modalities. We evaluate our algorithm by training a multimodal detector on KITTI and compare its performance to previous works. We show experimentally that SurfaceAug outperforms existing methods on car detection tasks and establishes a new state of the art for multimodal ground truth sampling.

摘要
尽管最新的模型架构和数据增强技术有所进步，多模态对象探测器仍然很难超越它们的 LiDAR 只 counterparts。这种缺陷被归结为缺乏具有足够力的多模态数据增强。为解决这个问题，我们提出了 SurfaceAug，一种新的地面 truth 采样算法。SurfaceAug 将对象采样并将图像和点云重新采样，使得对象在多modal 中进行对象级别的变换。我们通过在 KITTI 上训练多模态探测器，并与前期作品进行比较，并证明 SurfaceAug 在汽车检测任务中表现出色，并在多模态地面 truth 采样中创造了新的状态。

A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

paper_url: http://arxiv.org/abs/2312.03594
repo_url: https://github.com/open-mmlab/mmagic/tree/main/projects/powerpaint
paper_authors: Junhao Zhuang, Yanhong Zeng, Wenran Liu, Chun Yuan, Kai Chen
for: 高质量多元图像填充，即根据用户意图填充用户指定区域的图像，是一项重要挑战。现有方法在同时解决上下文感知图像填充和文本引导物体填充方面存在困难，因为它们需要不同的优化训练策略。
methods: 我们介绍了PowerPaint模型，这是首个同时实现高质量多元图像填充和文本引导物体填充的模型。我们引入了可学习的任务提示，并采用了适应性的精度调整策略，以便在不同的填充目标上帮助模型集中注意力。
results: 我们通过多种填充任务评估PowerPaint模型，并证明它在多元图像填充方面具有州前性。此外，我们还展示了PowerPaint模型可以作为负例示例进行物体 removing，并使用提取技术实现可控的形态指导物体填充。

Abstract
Achieving high-quality versatile image inpainting, where user-specified regions are filled with plausible content according to user intent, presents a significant challenge. Existing methods face difficulties in simultaneously addressing context-aware image inpainting and text-guided object inpainting due to the distinct optimal training strategies required. To overcome this challenge, we introduce PowerPaint, the first high-quality and versatile inpainting model that excels in both tasks. First, we introduce learnable task prompts along with tailored fine-tuning strategies to guide the model's focus on different inpainting targets explicitly. This enables PowerPaint to accomplish various inpainting tasks by utilizing different task prompts, resulting in state-of-the-art performance. Second, we demonstrate the versatility of the task prompt in PowerPaint by showcasing its effectiveness as a negative prompt for object removal. Additionally, we leverage prompt interpolation techniques to enable controllable shape-guided object inpainting. Finally, we extensively evaluate PowerPaint on various inpainting benchmarks to demonstrate its superior performance for versatile image inpainting. We release our codes and models on our project page: https://powerpaint.github.io/.

摘要
实现高质量多元图像填充是一项具有挑战性的任务，因为用户需要指定区域被填充 plausible 内容，根据用户的意图。现有方法在同时解决 context-aware 图像填充和 text-guided 对象填充时存在困难，因为它们需要不同的优化策略。为了解决这个挑战，我们引入 PowerPaint，第一个高质量和多元的填充模型，可以同时解决多种填充任务。首先，我们引入可学习的任务提示 along with 特化的 fine-tuning 策略，以指导模型在不同的填充目标上进行明确的注意力集中。这使得 PowerPaint 可以通过不同的任务提示来完成多种填充任务，并达到状态 искусственный表现。其次，我们示出 PowerPaint 中任务提示的可控性，通过在填充任务中使用 negative 提示来进行对象移除。此外，我们利用 prompt interpolation 技术来实现可控的形态指导对象填充。最后，我们对 PowerPaint 进行了广泛的评估，以展示其在多种填充任务中的优越性。我们将代码和模型发布在我们的项目页面：https://powerpaint.github.io/.

Language-Informed Visual Concept Learning

paper_url: http://arxiv.org/abs/2312.03587
repo_url: None
paper_authors: Sharon Lee, Yunzhi Zhang, Shangzhe Wu, Jiajun Wu
for: 学习一种语言信息驱动的视觉概念表示方法，以便通过改变不同的语言信息来控制图像的生成。
methods: 使用大规模预训练的视觉语言模型，并在批处理阶段使用一些特定的概念轴来编码图像中的信息。
results: 可以通过将不同的概念轴拼接起来，生成具有新的组合的图像。此外，在批处理阶段进行轻量级的finetuning可以使模型对新的概念进行扩展。

Abstract
Our understanding of the visual world is centered around various concept axes, characterizing different aspects of visual entities. While different concept axes can be easily specified by language, e.g. color, the exact visual nuances along each axis often exceed the limitations of linguistic articulations, e.g. a particular style of painting. In this work, our goal is to learn a language-informed visual concept representation, by simply distilling large pre-trained vision-language models. Specifically, we train a set of concept encoders to encode the information pertinent to a set of language-informed concept axes, with an objective of reproducing the input image through a pre-trained Text-to-Image (T2I) model. To encourage better disentanglement of different concept encoders, we anchor the concept embeddings to a set of text embeddings obtained from a pre-trained Visual Question Answering (VQA) model. At inference time, the model extracts concept embeddings along various axes from new test images, which can be remixed to generate images with novel compositions of visual concepts. With a lightweight test-time finetuning procedure, it can also generalize to novel concepts unseen at training.

摘要
Translated into Simplified Chinese:我们对视觉世界的理解是以各种概念轴为中心的，这些概念轴描述了不同的视觉特征。虽然语言可以轻松地指定不同的概念轴，但是具体的视觉细节通常超出语言表达的限制，例如一种特定的画法。在这项工作中，我们的目标是学习语言导向的视觉概念表示，通过简单地蒸馏大量预训练的视觉语言模型。我们专门 trains a set of concept encoders来编码与各种语言导向的概念轴相关的信息，以达到重建输入图像通过预训练的 Text-to-Image（T2I）模型。为了促进不同概念编码器之间的更好分离，我们将概念嵌入 anchor 到预训练的视觉问答（VQA）模型中的文本嵌入。在推理时，模型可以从新的测试图像中提取各种概念嵌入，并将其拼接起来生成新的视觉组合。通过轻量级的推理时 fine-tuning 过程，它也可以通用于未seen 的概念。

XCube ($\mathcal{X}^3$): Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies

paper_url: http://arxiv.org/abs/2312.03806
repo_url: None
paper_authors: Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, Francis Williams
for: 高级3D矢量网格生成模型
methods: 基于VDB数据结构的层次粒子潜在扩散模型
results: 可以生成高分辨率对象，并在大型户外场景中显示出清晰的质量和量度提升。Here’s the simplified Chinese text:
for: 高级3D矢量网格生成模型
methods: 基于VDB数据结构的层次粒子潜在扩散模型
results: 可以生成高分辨率对象，并在大型户外场景中显示出清晰的质量和量度提升。

Abstract
We present $\mathcal{X}^3$ (pronounced XCube), a novel generative model for high-resolution sparse 3D voxel grids with arbitrary attributes. Our model can generate millions of voxels with a finest effective resolution of up to $1024^3$ in a feed-forward fashion without time-consuming test-time optimization. To achieve this, we employ a hierarchical voxel latent diffusion model which generates progressively higher resolution grids in a coarse-to-fine manner using a custom framework built on the highly efficient VDB data structure. Apart from generating high-resolution objects, we demonstrate the effectiveness of XCube on large outdoor scenes at scales of 100m$\times$100m with a voxel size as small as 10cm. We observe clear qualitative and quantitative improvements over past approaches. In addition to unconditional generation, we show that our model can be used to solve a variety of tasks such as user-guided editing, scene completion from a single scan, and text-to-3D. More results and details can be found at https://research.nvidia.com/labs/toronto-ai/xcube/.

摘要
我们提出了$\mathcal{X}^3$（简称XCube），一种新型的生成模型，用于高分辨率稀疏三维体格中的任意属性。我们的模型可以在Feed-forward方式下生成数百万个小体格，最高效的分辨率达到1024^3，而不需要时间consuming的测试时间优化。为了实现这一点，我们采用了层次的体格潜在扩散模型，该模型在粗化到细化的方式下生成高分辨率的体格，使用自定义基于高效的VDB数据结构的框架。除了生成高分辨率的物体，我们证明XCube在100米×100米的大 OUTDOOR场景中， voxel size为10cm，可以达到显著的Qualitative和Quantitative改进。此外，我们还证明了XCube可以用于多种任务，如用户指导编辑、从单个扫描完成场景、和文本到3D。更多结果和细节可以在https://research.nvidia.com/labs/toronto-ai/xcube/find。

Context Diffusion: In-Context Aware Image Generation

paper_url: http://arxiv.org/abs/2312.03584
repo_url: None
paper_authors: Ivona Najdenkoska, Animesh Sinha, Abhimanyu Dubey, Dhruv Mahajan, Vignesh Ramanathan, Filip Radenovic
for: 本文提出了Context Diffusion，一种基于扩散的框架，帮助图像生成模型从视觉示例中学习。
methods: 本文使用了 diffusion-based 框架，将图像生成模型与文本提示和视觉示例结合在一起，以便学习图像生成。
results: 实验和用户研究表明，Context Diffusion 在域内和域外任务中都表现出色，比对应模型更高质量和准确性。

Abstract
We propose Context Diffusion, a diffusion-based framework that enables image generation models to learn from visual examples presented in context. Recent work tackles such in-context learning for image generation, where a query image is provided alongside context examples and text prompts. However, the quality and fidelity of the generated images deteriorate when the prompt is not present, demonstrating that these models are unable to truly learn from the visual context. To address this, we propose a novel framework that separates the encoding of the visual context and preserving the structure of the query images. This results in the ability to learn from the visual context and text prompts, but also from either one of them. Furthermore, we enable our model to handle few-shot settings, to effectively address diverse in-context learning scenarios. Our experiments and user study demonstrate that Context Diffusion excels in both in-domain and out-of-domain tasks, resulting in an overall enhancement in image quality and fidelity compared to counterpart models.

摘要
我们提出Context Diffusion框架，一种基于扩散的框架，让图像生成模型从视觉示例中学习。 latest work tackles such in-context learning for image generation, where a query image is provided alongside context examples and text prompts. However, the quality and fidelity of the generated images deteriorate when the prompt is not present, indicating that these models are unable to truly learn from the visual context. To address this, we propose a novel framework that separates the encoding of the visual context and preserves the structure of the query images. This allows the model to learn from both the visual context and text prompts, as well as from either one of them. Furthermore, we enable our model to handle few-shot settings, effectively addressing diverse in-context learning scenarios. Our experiments and user study show that Context Diffusion excels in both in-domain and out-of-domain tasks, resulting in an overall enhancement in image quality and fidelity compared to counterpart models.Here's the translation in Traditional Chinese:我们提出Context Diffusion框架，一种基于扩散的框架，让图像生成模型从视觉示例中学习。 latest work tackles such in-context learning for image generation, where a query image is provided alongside context examples and text prompts. However, the quality and fidelity of the generated images deteriorate when the prompt is not present, indicating that these models are unable to truly learn from the visual context. To address this, we propose a novel framework that separates the encoding of the visual context and preserves the structure of the query images. This allows the model to learn from both the visual context and text prompts, as well as from either one of them. Furthermore, we enable our model to handle few-shot settings, effectively addressing diverse in-context learning scenarios. Our experiments and user study show that Context Diffusion excels in both in-domain and out-of-domain tasks, resulting in an overall enhancement in image quality and fidelity compared to counterpart models.

DocBinFormer: A Two-Level Transformer Network for Effective Document Image Binarization

paper_url: http://arxiv.org/abs/2312.03568
repo_url: None
paper_authors: Risab Biswas, Swalpa Kumar Roy, Ning Wang, Umapada Pal, Guang-Bin Huang
for: 提高文档图像二进制化精度，提高文档图像分类和 recognize 性能。
methods: 提出了一种基于视Transformer的新型二进制化模型，包括一个两级视TransformerEncoder和一个Decoder，通过global和local特征表示来提高二进制化精度。
results: 对多种DIBCO和H-DIBCO标准底库进行了广泛的实验，比较了与现有技术的比较，得到了四个维度的提高。

Abstract
In real life, various degradation scenarios exist that might damage document images, making it harder to recognize and analyze them, thus binarization is a fundamental and crucial step for achieving the most optimal performance in any document analysis task. We propose DocBinFormer (Document Binarization Transformer), a novel two-level vision transformer (TL-ViT) architecture based on vision transformers for effective document image binarization. The presented architecture employs a two-level transformer encoder to effectively capture both global and local feature representation from the input images. These complimentary bi-level features are exploited for efficient document image binarization, resulting in improved results for system-generated as well as handwritten document images in a comprehensive approach. With the absence of convolutional layers, the transformer encoder uses the pixel patches and sub-patches along with their positional information to operate directly on them, while the decoder generates a clean (binarized) output image from the latent representation of the patches. Instead of using a simple vision transformer block to extract information from the image patches, the proposed architecture uses two transformer blocks for greater coverage of the extracted feature space on a global and local scale. The encoded feature representation is used by the decoder block to generate the corresponding binarized output. Extensive experiments on a variety of DIBCO and H-DIBCO benchmarks show that the proposed model outperforms state-of-the-art techniques on four metrics. The source code will be made available at https://github.com/RisabBiswas/DocBinFormer.

摘要
实际生活中，文档图像可能会受到各种损害情况的影响，如磨砺、折叠、损害等，这会使文档图像更难以识别和分析，因此文档图像 binarization 成为了文档分析任务中的基本和重要步骤。我们提出了 DocBinFormer（文档 binarization transformer）模型，这是基于视transformer的一种新型两级视transformer（TL-ViT）架构，用于有效地进行文档图像 binarization。提出的架构使用两级视transformerEncoder，以 capture文档图像的全局和地方特征表示。这些补充性的 би层特征被利用于有效地进行文档图像 binarization，从而实现了对系统生成的文档图像和手写文档图像的全面性。不同于传统的 convolutional layers，transformerEncoder 使用像素块和子块以及它们的位势信息直接操作它们，而decoder 使用 latent representation 生成干净（binarized）输出图像。相比使用单个视transformer块来EXTRACT信息从图像块，提出的架构使用两个视transformer块，以更好地覆盖EXTRACT的特征空间。对于 DIBCo 和 H-DIBCO 测试集，我们进行了广泛的实验，结果显示，提出的模型在四个纪录中超越了当前状态的技术。源代码将于 GitHub 上公开，可以通过 https://github.com/RisabBiswas/DocBinFormer 访问。

SYNC-CLIP: Synthetic Data Make CLIP Generalize Better in Data-Limited Scenarios

paper_url: http://arxiv.org/abs/2312.03805
repo_url: None
paper_authors: Mushui Liu, Weijie He, Ziqian Lu, Yunlong Yu
for: 提高CLIP模型在开放词汇场景下的泛化能力
methods: 使用SYNthetiC数据强化CLIP模型，并将真实和 sintetic 样本视为两个不同的领域，并且优化每个领域的提问来捕捉各自的领域特性，同时保持两个领域的semantic consistency
results: 在三个模型总结任务上表现很竞争力，特别是在11个开放词汇场景中 novel 类上的平均提升3.0%。

Abstract
Prompt learning is a powerful technique for transferring Vision-Language Models (VLMs) such as CLIP to downstream tasks. However, the prompt-based methods that are fine-tuned solely with base classes may struggle to generalize to novel classes in open-vocabulary scenarios, especially when data are limited. To address this issue, we propose an innovative approach called SYNC-CLIP that leverages SYNthetiC data for enhancing the generalization capability of CLIP. Based on the observation of the distribution shift between the real and synthetic samples, we treat real and synthetic samples as distinct domains and propose to optimize separate domain prompts to capture domain-specific information, along with the shared visual prompts to preserve the semantic consistency between two domains. By aligning the cross-domain features, the synthetic data from novel classes can provide implicit guidance to rebalance the decision boundaries. Experimental results on three model generalization tasks demonstrate that our method performs very competitively across various benchmarks. Notably, SYNC-CLIP outperforms the state-of-the-art competitor PromptSRC by an average improvement of 3.0% on novel classes across 11 datasets in open-vocabulary scenarios.

摘要
Prompt learning 是一种强大的技术，用于将视觉语言模型（VLM）如 CLIP 传递到下游任务上。然而，通过基础类 alone 精度的 fine-tuning 方法可能会在开放词汇场景中对新类做出泛化困难，特别是当数据有限。为解决这个问题，我们提出了一种创新的方法called SYNC-CLIP，该方法利用 SYNthetiC 数据来增强 CLIP 的泛化能力。基于实际和 sintetic 样本之间的分布偏移的观察，我们将实际和 sintetic 样本视为两个不同的领域，并提议使用两个独立的领域 prompt 来捕捉两个领域中的信息，同时保持两个领域之间的semantic 一致性。通过对两个领域的特征进行对应，可以使 synthetic 数据中的新类提供隐藏的导航，以重新平衡决策边界。我们在三种模型泛化任务上进行了实验，结果显示，我们的方法在多种benchmark上表现很竞争力，并且在开放词汇场景下， SYNC-CLIP 平均超过了state-of-the-art 竞争对手 PromptSRC 的3.0% 。

Enhancing Kinship Verification through Multiscale Retinex and Combined Deep-Shallow features

paper_url: http://arxiv.org/abs/2312.03562
repo_url: None
paper_authors: El Ouanas Belabbaci, Mohammed Khammari, Ammar Chouchane, Mohcene Bessaoudi, Abdelmalik Ouamane, Yassine Himeur, Shadi Atalla, Wathiq Mansoor
for: Kinship verification from facial images, with applications in image annotation, forensic analysis, and social media research.
methods: Multiscale Retinex (MSR) preprocessing, deep and shallow texture descriptors (VGG16 and Local Phase Quantization (LPQ)), and Logistic Regression (LR) method.
results: Robust and effective method tested on three kinship datasets (Cornell Kin Face, UB Kin Face, and TS Kin Face) with improved image quality and accuracy.

Abstract
The challenge of kinship verification from facial images represents a cutting-edge and formidable frontier in the realms of pattern recognition and computer vision. This area of study holds a myriad of potential applications, spanning from image annotation and forensic analysis to social media research. Our research stands out by integrating a preprocessing method named Multiscale Retinex (MSR), which elevates image quality and amplifies contrast, ultimately bolstering the end results. Strategically, our methodology capitalizes on the harmonious blend of deep and shallow texture descriptors, merging them proficiently at the score level through the Logistic Regression (LR) method. To elucidate, we employ the Local Phase Quantization (LPQ) descriptor to extract shallow texture characteristics. For deep feature extraction, we turn to the prowess of the VGG16 model, which is pre-trained on a convolutional neural network (CNN). The robustness and efficacy of our method have been put to the test through meticulous experiments on three rigorous kinship datasets, namely: Cornell Kin Face, UB Kin Face, and TS Kin Face.

摘要
“人脸图像关系识别是一个 cutting-edge 和 formidable 的前沿领域，涉及到图像识别和计算机视觉等领域。这个领域拥有很多应用场景，从图像注释和审查分析到社交媒体研究。我们的研究具有独特的优势，通过 integrate 一种名为 Multiscale Retinex (MSR) 的预处理方法，提高图像质量并增强对比度，从而提高结果的可靠性。战略上，我们充分利用了深度和浅度的文本描述器的协同作用，通过 Logistic Regression (LR) 方法进行混合。特别是，我们使用 Local Phase Quantization (LPQ) 描述器来EXTRACT 浅度的文本特征，而深度特征EXTRACTION 则通过 VGG16 模型进行预训练的 convolutional neural network (CNN)。我们的方法在三个严格的人脸数据集（Cornell Kin Face、UB Kin Face 和 TS Kin Face）上进行了严格的实验，以证明其可靠性和效果。”

When an Image is Worth 1,024 x 1,024 Words: A Case Study in Computational Pathology

paper_url: http://arxiv.org/abs/2312.03558
repo_url: None
paper_authors: Wenhui Wang, Shuming Ma, Hanwen Xu, Naoto Usuyama, Jiayu Ding, Hoifung Poon, Furu Wei
for: 用于计算生物学pathology中的癌病诊断和预测，特别是使用全图像尺度的whole-slide images。
methods: 使用长对称网络（LongNet）模型来处理极长序列，以 capture短距离和长距离依赖关系。
results: 实验结果表明，LongViT有效地处理大пикsel图像，并在癌病类型和生存预测方面超过了之前的状态对策法。

Abstract
This technical report presents LongViT, a vision Transformer that can process gigapixel images in an end-to-end manner. Specifically, we split the gigapixel image into a sequence of millions of patches and project them linearly into embeddings. LongNet is then employed to model the extremely long sequence, generating representations that capture both short-range and long-range dependencies. The linear computation complexity of LongNet, along with its distributed algorithm, enables us to overcome the constraints of both computation and memory. We apply LongViT in the field of computational pathology, aiming for cancer diagnosis and prognosis within gigapixel whole-slide images. Experimental results demonstrate that LongViT effectively encodes gigapixel images and outperforms previous state-of-the-art methods on cancer subtyping and survival prediction. Code and models will be available at https://aka.ms/LongViT.

摘要
这份技术报告介绍了LongViT，一种可以一直处理 гигаPixel图像的整体端到端方法。具体来说，我们将гигаPixel图像分成了数百万个小块，并将它们线性 проек到嵌入中。然后，我们使用LongNet模型来模型EXTREMELY LONG SEQUENCE，生成表示短距离和长距离依赖关系。由于LongNet的直线计算复杂度和分布式算法，我们可以超越计算和内存的限制。我们在计算生物学中应用LongViT，targeting cancer diagnosis and prognosis within gigapixel whole-slide images。实验结果表明，LongViT有效地编码гигаPixel图像，并比前一个状态的方法在 cancer 分型和生存预测方面表现出色。代码和模型将可以在 https://aka.ms/LongViT 上获得。

Personalized Face Inpainting with Diffusion Models by Parallel Visual Attention

paper_url: http://arxiv.org/abs/2312.03556
repo_url: None
paper_authors: Jianjin Xu, Saman Motamed, Praneetha Vaddamanu, Chen Henry Wu, Christian Haene, Jean-Charles Bazin, Fernando de la Torre
for: 提高面填充结果，降低推理过程中的计算复杂度
methods: 使用并行视觉注意力（PVA）和扩散模型
results: 实现了无与伦比的人脸特征保持，并提供了有效的语言控制性Here’s a more detailed explanation of each point:
for: The paper aims to improve the results of face inpainting and reduce the computational complexity during inference.
methods: The proposed method uses Parallel Visual Attention (PVA) in conjunction with diffusion models. PVA is inserted into each cross-attention module in the denoising network, which attends to features extracted from reference images by an identity encoder.
results: The proposed method achieves unparalleled identity resemblance in both face inpainting and face inpainting with language guidance tasks, outperforming various benchmarks, including MyStyle, Paint by Example, and Custom Diffusion. The method also provides effective language-controllability and requires only 40 fine-tuning steps for each new identity, which is a significant speed increase of over 20 times compared to Custom Diffusion.

Abstract
Face inpainting is important in various applications, such as photo restoration, image editing, and virtual reality. Despite the significant advances in face generative models, ensuring that a person's unique facial identity is maintained during the inpainting process is still an elusive goal. Current state-of-the-art techniques, exemplified by MyStyle, necessitate resource-intensive fine-tuning and a substantial number of images for each new identity. Furthermore, existing methods often fall short in accommodating user-specified semantic attributes, such as beard or expression. To improve inpainting results, and reduce the computational complexity during inference, this paper proposes the use of Parallel Visual Attention (PVA) in conjunction with diffusion models. Specifically, we insert parallel attention matrices to each cross-attention module in the denoising network, which attends to features extracted from reference images by an identity encoder. We train the added attention modules and identity encoder on CelebAHQ-IDI, a dataset proposed for identity-preserving face inpainting. Experiments demonstrate that PVA attains unparalleled identity resemblance in both face inpainting and face inpainting with language guidance tasks, in comparison to various benchmarks, including MyStyle, Paint by Example, and Custom Diffusion. Our findings reveal that PVA ensures good identity preservation while offering effective language-controllability. Additionally, in contrast to Custom Diffusion, PVA requires just 40 fine-tuning steps for each new identity, which translates to a significant speed increase of over 20 times.

摘要
面部填充在各种应用中具有重要意义，如图像修复、图像编辑和虚拟现实。尽管面部生成模型已经取得了显著的进步，但保持人脸唯一特征的维护仍然是一个困难的目标。现有的技术，如MyStyle，需要资源占用很大的精度调整和大量的图像，并且常常无法满足用户指定的语义特征，如胡须或表情。为了改进填充结果并降低推理过程中的计算复杂性，本文提议使用并行视觉注意力（PVA）和扩散模型。具体来说，我们在推理网络中插入并行注意力矩阵，以便由标识编码器提取的特征进行跨层协同注意。我们在CelebAHQ-IDI dataset上训练了添加了注意力矩阵和标识编码器。实验表明，PVA在面部填充和面部填充With Language Guidance任务中保持了无与伦比的人脸相似度，与多种标准准则，如MyStyle、Paint by Example和Custom Diffusion相比，我们的发现表明PVA可以保持良好的人脸特征，同时具有有效的语言控制能力。此外，与Custom Diffusion相比，PVA只需要40个微调步骤，可以在每个新的人脸上微调，这意味着大幅提高了推理速度，大约20倍。

How Low Can You Go? Surfacing Prototypical In-Distribution Samples for Unsupervised Anomaly Detection

paper_url: http://arxiv.org/abs/2312.03804
repo_url: None
paper_authors: Felix Meissen, Johannes Getzner, Alexander Ziller, Georgios Kaissis, Daniel Rueckert
for: 这 paper 的目的是提出一种基于无监督学习的异常检测方法，以避免大量的标注努力。
methods: 这 paper 使用了三种方法来标定彩色分布中的原型样本，并证明了这些样本可以用于异常检测。
results: 这 paper 的实验结果表明，只使用了彩色分布中的一小部分样本（typically 10）可以达到高效的异常检测性能，并且在一些情况下可以超越全量训练的性能。

Abstract
Unsupervised anomaly detection (UAD) alleviates large labeling efforts by training exclusively on unlabeled in-distribution data and detecting outliers as anomalies. Generally, the assumption prevails that large training datasets allow the training of higher-performing UAD models. However, in this work, we show that using only very few training samples can already match - and in some cases even improve - anomaly detection compared to training with the whole training dataset. We propose three methods to identify prototypical samples from a large dataset of in-distribution samples. We demonstrate that by training with a subset of just ten such samples, we achieve an area under the receiver operating characteristics curve (AUROC) of $96.37 \%$ on CIFAR10, $92.59 \%$ on CIFAR100, $95.37 \%$ on MNIST, $95.38 \%$ on Fashion-MNIST, $96.37 \%$ on MVTec-AD, $98.81 \%$ on BraTS, and $81.95 \%$ on RSNA pneumonia detection, even exceeding the performance of full training in $25/67$ classes we tested. Additionally, we show that the prototypical in-distribution samples identified by our proposed methods translate well to different models and other datasets and that using their characteristics as guidance allows for successful manual selection of small subsets of high-performing samples. Our code is available at https://anonymous.4open.science/r/uad_prototypical_samples/

摘要
“无监督异常检测（USD）可以减少大量标注努力，通过只使用无标注数据集训练并检测异常点。通常认为，大规模的训练数据集可以训练更高性能的USD模型。然而，在这项工作中，我们表明使用只有很少的训练样本（只有十个）仍然可以与全部训练数据集训练的模型匹配或者超越异常检测。我们提出三种方法来标识数据集中的代表性样本。我们示示了使用这些方法训练的模型可以在CIFAR10、CIFAR100、MNIST、Fashion-MNIST、MVTec-AD和BraTS等七个数据集上达到AUROC值为96.37%、92.59%、95.37%、95.38%、96.37%和98.81%，而且在25/67个类中超过了全部训练的性能。此外，我们还证明了这些代表性样本的特征可以在不同的模型和数据集上翻译良好，并且使用这些特征作为指导可以成功地手动选择小样本集。我们的代码可以在https://anonymous.4open.science/r/uad_prototypical_samples/”

Texture-Semantic Collaboration Network for ORSI Salient Object Detection

paper_url: http://arxiv.org/abs/2312.03548
repo_url: https://github.com/mathlee/tscnet
paper_authors: Gongyang Li, Zhen Bai, Zhi Liu
for: 这种研究旨在提高OPTICAL REMOTE SENSING IMAGES中的精锐对象检测精度。methods: 该方法基于一个普通的encoder-decoder结构，并包括一个重要的Texture-Semantic Collaboration Module（TSCM），用于在基本特征提取过程中进行精度提升和交互。results: 对于三个数据集的广泛实验表明，该方法可与14种现有方法进行竞争，并且能够处理多种场景。

Abstract
Salient object detection (SOD) in optical remote sensing images (ORSIs) has become increasingly popular recently. Due to the characteristics of ORSIs, ORSI-SOD is full of challenges, such as multiple objects, small objects, low illuminations, and irregular shapes. To address these challenges, we propose a concise yet effective Texture-Semantic Collaboration Network (TSCNet) to explore the collaboration of texture cues and semantic cues for ORSI-SOD. Specifically, TSCNet is based on the generic encoder-decoder structure. In addition to the encoder and decoder, TSCNet includes a vital Texture-Semantic Collaboration Module (TSCM), which performs valuable feature modulation and interaction on basic features extracted from the encoder. The main idea of our TSCM is to make full use of the texture features at the lowest level and the semantic features at the highest level to achieve the expression enhancement of salient regions on features. In the TSCM, we first enhance the position of potential salient regions using semantic features. Then, we render and restore the object details using the texture features. Meanwhile, we also perceive regions of various scales, and construct interactions between different regions. Thanks to the perfect combination of TSCM and generic structure, our TSCNet can take care of both the position and details of salient objects, effectively handling various scenes. Extensive experiments on three datasets demonstrate that our TSCNet achieves competitive performance compared to 14 state-of-the-art methods. The code and results of our method are available at https://github.com/MathLee/TSCNet.

摘要
优点对象检测（SOD）在光学远程感知图像（ORSIs）已经成为最近受欢迎的研究领域。由于ORSIs的特点，ORSIs-SOD充满挑战，如多对象、小对象、低照明和不规则形状。为解决这些挑战，我们提出了一种简洁 yet 有效的 Texture-Semantic Collaboration Network（TSCNet），以探索Texture和Semantic的协作。具体来说，TSCNet基于 generic encoder-decoder 结构。除了编码器和解码器之外，TSCNet还包括一个重要的 Texture-Semantic Collaboration Module（TSCM），该模块在基础特征提取后进行有价值的特征修饰和交互。我们的TSCM的主要想法是利用Texture特征的最低层和Semantic特征的最高层来实现特征强化突出焦点区域。在TSCM中，我们首先使用Semantic特征提高突出焦点区域的位置。然后，我们使用Texture特征来还原和修复对象细节。同时，我们还捕捉不同比例的区域，并在不同区域之间建立交互。感谢TSCM和generic结构的完美结合，我们的TSCNet可以同时处理不同的场景，并且能够有效地处理多对象、小对象、低照明和不规则形状。我们的实验表明，与14种现有方法相比，TSCNet在三个数据集上达到了竞争性的性能。我们的代码和实验结果可以在中找到。

FoodFusion: A Latent Diffusion Model for Realistic Food Image Generation

paper_url: http://arxiv.org/abs/2312.03540
repo_url: None
paper_authors: Olivia Markham, Yuhao Chen, Chi-en Amy Tai, Alexander Wong
for: 用于生成真实的食物图像，以便在图像识别中进行训练。
methods: 使用Latent Diffusion Models（LDMs），并利用大量的开源食物数据集，以生成基于文本描述的真实食物图像。
results: 对比公共可用的图像生成模型， FoodFusion模型能够生成出更真实和多样的食物图像。

Abstract
Current state-of-the-art image generation models such as Latent Diffusion Models (LDMs) have demonstrated the capacity to produce visually striking food-related images. However, these generated images often exhibit an artistic or surreal quality that diverges from the authenticity of real-world food representations. This inadequacy renders them impractical for applications requiring realistic food imagery, such as training models for image-based dietary assessment. To address these limitations, we introduce FoodFusion, a Latent Diffusion model engineered specifically for the faithful synthesis of realistic food images from textual descriptions. The development of the FoodFusion model involves harnessing an extensive array of open-source food datasets, resulting in over 300,000 curated image-caption pairs. Additionally, we propose and employ two distinct data cleaning methodologies to ensure that the resulting image-text pairs maintain both realism and accuracy. The FoodFusion model, thus trained, demonstrates a remarkable ability to generate food images that exhibit a significant improvement in terms of both realism and diversity over the publicly available image generation models. We openly share the dataset and fine-tuned models to support advancements in this critical field of food image synthesis at https://bit.ly/genai4good.

摘要
当前最先进的图像生成模型，如潜在扩散模型（LDM），已经表现出可以生成高度吸引人的食物相关图像。然而，这些生成的图像 часто具有艺术或幻想的特点，与现实食物图像的准确性相差很大。这种不足使得它们在需要实际食物图像的应用中不够实用，如图像基于饮食评估的训练模型。为解决这些限制，我们介绍了 FoodFusion，一种特意为实际食物图像的准确合成而设计的潜在扩散模型。 FoodFusion 模型的开发包括利用大量的开源食物数据集，共计超过 300,000 个图像描述对。此外，我们还提出了两种不同的数据清洁方法，以确保生成的图像描述对保持实际性和准确性。 FoodFusion 模型经过训练后表现出了对实际食物图像的很好的合成，与公共可用的图像生成模型相比，具有显著的改善。我们将数据集和细化模型公开分享，以支持这一关键领域的进步，请参考。

Low-shot Object Learning with Mutual Exclusivity Bias

paper_url: http://arxiv.org/abs/2312.03533
repo_url: https://github.com/rehg-lab/lsme
paper_authors: Anh Thai, Ahmad Humayun, Stefan Stojanov, Zixuan Huang, Bikram Boote, James M. Rehg
for: 本研究旨在解决低数据量对象学习问题，提出了低精度对象学习 WITH Mutual Exclusivity Bias（LSME）计算模型。
methods: 本研究使用了一种新的数据生成管道，并提供了完整的基elines和现有的方法，以便ML社区对这个复杂的学习任务进行研究。
results: 本研究发现，通过使用LSME模型，可以在低数据量情况下实现高精度的对象分类。此外，还提供了一种基eline方法，其性能高于当前state-of-the-art模型。

Abstract
This paper introduces Low-shot Object Learning with Mutual Exclusivity Bias (LSME), the first computational framing of mutual exclusivity bias, a phenomenon commonly observed in infants during word learning. We provide a novel dataset, comprehensive baselines, and a state-of-the-art method to enable the ML community to tackle this challenging learning task. The goal of LSME is to analyze an RGB image of a scene containing multiple objects and correctly associate a previously-unknown object instance with a provided category label. This association is then used to perform low-shot learning to test category generalization. We provide a data generation pipeline for the LSME problem and conduct a thorough analysis of the factors that contribute to its difficulty. Additionally, we evaluate the performance of multiple baselines, including state-of-the-art foundation models. Finally, we present a baseline approach that outperforms state-of-the-art models in terms of low-shot accuracy.

摘要
The goal of LSME is to analyze an RGB image of a scene containing multiple objects and correctly associate a previously unknown object instance with a provided category label. This association is then used to perform low-shot learning to test category generalization.We provide a data generation pipeline for the LSME problem and conduct a thorough analysis of the factors that contribute to its difficulty. Additionally, we evaluate the performance of multiple baselines, including state-of-the-art foundation models. Finally, we present a baseline approach that outperforms state-of-the-art models in terms of low-shot accuracy.Translation notes:* "Low-shot Object Learning with Mutual Exclusivity Bias" 是一种新的计算框架，用于解决幼儿语言学习中的互相排斥现象。* "mutual exclusivity bias" 被称为 "互相排斥现象"，是指在语言学习过程中，幼儿会忽略或快速忘记已经了解的词语。* "low-shot learning" 是一种基于少量数据的学习方法，旨在测试分类模型的一般化能力。* "RGB image" 是一种图像格式，用于描述图像的颜色和亮度信息。* "scene" 是一个场景或环境，通常包括多个物体或对象。

Single Image Reflection Removal with Reflection Intensity Prior Knowledge

paper_url: http://arxiv.org/abs/2312.03798
repo_url: None
paper_authors: Dongshen Han, Seungkyu Lee, Chaoning Zhang, Heechan Yoon, Hyukmin Kwon, HyunCheol Kim, HyonGon Choo
for: 本研究旨在解决实际世界图像中单个图像反射 removing 问题，这是由于图像传输和反射时的多种图像损害所致。许多现有方法都是基于特定假设来解决这个问题。
methods: 我们提出了一种通用反射Intensity Prior（SIRR），该 prior捕捉了反射现象的强度，并通过引入Reflection Prior Extraction Network（RPEN）来学习非uniform的反射优先。我们还提出了基于优先的反射去除网络（PRRN），使用简单的 transformer U-Net 架构，可以适应反射优先。
results: 我们在实际世界 benchmark 上进行了实验，结果表明我们的方法可以达到state-of-the-art的准确率在 SIRR 中。

Abstract
Single Image Reflection Removal (SIRR) in real-world images is a challenging task due to diverse image degradations occurring on the glass surface during light transmission and reflection. Many existing methods rely on specific prior assumptions to resolve the problem. In this paper, we propose a general reflection intensity prior that captures the intensity of the reflection phenomenon and demonstrate its effectiveness. To learn the reflection intensity prior, we introduce the Reflection Prior Extraction Network (RPEN). By segmenting images into regional patches, RPEN learns non-uniform reflection prior in an image. We propose Prior-based Reflection Removal Network (PRRN) using a simple transformer U-Net architecture that adapts reflection prior fed from RPEN. Experimental results on real-world benchmarks demonstrate the effectiveness of our approach achieving state-of-the-art accuracy in SIRR.

摘要
Single Image Reflection Removal (SIRR) in real-world images is a challenging task due to diverse image degradations occurring on the glass surface during light transmission and reflection. Many existing methods rely on specific prior assumptions to resolve the problem. In this paper, we propose a general reflection intensity prior that captures the intensity of the reflection phenomenon and demonstrate its effectiveness. To learn the reflection intensity prior, we introduce the Reflection Prior Extraction Network (RPEN). By segmenting images into regional patches, RPEN learns non-uniform reflection prior in an image. We propose Prior-based Reflection Removal Network (PRRN) using a simple transformer U-Net architecture that adapts reflection prior fed from RPEN. Experimental results on real-world benchmarks demonstrate the effectiveness of our approach achieving state-of-the-art accuracy in SIRR.Here's the text in Traditional Chinese:Single Image Reflection Removal (SIRR) in real-world images is a challenging task due to diverse image degradations occurring on the glass surface during light transmission and reflection. Many existing methods rely on specific prior assumptions to resolve the problem. In this paper, we propose a general reflection intensity prior that captures the intensity of the reflection phenomenon and demonstrate its effectiveness. To learn the reflection intensity prior, we introduce the Reflection Prior Extraction Network (RPEN). By segmenting images into regional patches, RPEN learns non-uniform reflection prior in an image. We propose Prior-based Reflection Removal Network (PRRN) using a simple transformer U-Net architecture that adapts reflection prior fed from RPEN. Experimental results on real-world benchmarks demonstrate the effectiveness of our approach achieving state-of-the-art accuracy in SIRR.

Personalized Pose Forecasting

paper_url: http://arxiv.org/abs/2312.03528
repo_url: https://github.com/chahuja/trontr
paper_authors: Maria Priisalu, Ted Kronvall, Cristian Sminchisescu
for: 预测人体动作，即基于过去人体动作的未来动作预测。
methods: 使用个性化时间序列分析模型，将神经网络pose预测个性化。
results: 可以高效地在线进行个性化动作预测，使用低参数时间序列分析模型进行个性化神经网络pose预测。

Abstract
Human pose forecasting is the task of predicting articulated human motion given past human motion. There exists a number of popular benchmarks that evaluate an array of different models performing human pose forecasting. These benchmarks do not reflect that a human interacting system, such as a delivery robot, observes and plans for the motion of the same individual over an extended period of time. Every individual has unique and distinct movement patterns. This is however not reflected in existing benchmarks that evaluate a model's ability to predict an average human's motion rather than a particular individual's. We reformulate the human motion forecasting problem and present a model-agnostic personalization method. Motion forecasting personalization can be performed efficiently online by utilizing a low-parametric time-series analysis model that personalizes neural network pose predictions.

摘要
人体动作预测是指根据过去人体动作预测未来人体动作的任务。现有许多popular benchmarks evaluate多种不同的模型在人体动作预测方面。这些benchmarks不反映一个人工智能系统，如elivery robot，观察和规划同一个人的动作在长期内。每个个体都有独特和特殊的运动模式。这并不反映现有的benchmarks，它们评估一个平均的人体动作预测而不是特定个体的动作。我们重新定义人体动作预测问题并提出了模型无关的个性化方法。人体动作预测个性化可以在线进行efficiently，使用低参数时间序列分析模型来个性化神经网络的pose预测。

AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation

paper_url: http://arxiv.org/abs/2312.03795
repo_url: None
paper_authors: Xinzhou Wang, Yikai Wang, Junliang Ye, Zhengyi Wang, Fuchun Sun, Pengkun Liu, Ling Wang, Kai Sun, Xintong Wang, Bin He
for: 本研究旨在提高文本导向3D模型生成的灵活性和可重构性，特别是对于动态物体的3D模型生成和重建。
methods: 该研究提出了一种基于文本的4D生成框架，称为AnimatableDreamer，可以生成具有不同类别的非固定物体，并且遵循视频中的物体运动。AnimatableDreamer的核心技术是Canonical Score Distillation（CSD），它通过在不同时间扩展中进行减噪和缩放，使得生成维度从4D减少到3D，并且保持时间一致性和物理可能性。
results: 实验表明，AnimatableDreamer可以生成高灵活性的文本导向3D模型，并且在非固定重建任务上表现出优于传统非固定重建方法。同时，通过从多视图一致扩散模型中获得了示例知识，CSD可以循环提高生成过程。

Abstract
Text-to-3D model adaptations have advanced static 3D model quality, but sequential 3D model generation, particularly for animatable objects with large motions, is still scarce. Our work proposes AnimatableDreamer, a text-to-4D generation framework capable of generating diverse categories of non-rigid objects while adhering to the object motions extracted from a monocular video. At its core, AnimatableDreamer is equipped with our novel optimization design dubbed Canonical Score Distillation (CSD), which simplifies the generation dimension from 4D to 3D by denoising over different frames in the time-varying camera spaces while conducting the distillation process in a unique canonical space shared per video. Concretely, CSD ensures that score gradients back-propagate to the canonical space through differentiable warping, hence guaranteeing the time-consistent generation and maintaining morphological plausibility across different poses. By lifting the 3D generator to 4D with warping functions, AnimatableDreamer offers a novel perspective on non-rigid 3D model generation and reconstruction. Besides, with inductive knowledge from a multi-view consistent diffusion model, CSD regularizes reconstruction from novel views, thus cyclically enhancing the generation process. Extensive experiments demonstrate the capability of our method in generating high-flexibility text-guided 3D models from the monocular video, while also showing improved reconstruction performance over typical non-rigid reconstruction methods. Project page https://AnimatableDreamer.github.io.

摘要
文本到3D模型的转化技术已经提高了静止3D模型的质量，但是顺序的3D模型生成，特别是对于可动的物体来说，仍然很罕见。我们的工作提出了AnimatableDreamer，一个文本到4D生成框架，可以生成多种类型的非固定物体，同时遵循视频中的物体运动。AnimatableDreamer的核心是我们的新的优化设计，即幻数分配（CSD），它将生成维度从4D降低到3D，通过在不同帧中的时间变化空间中的杂谱抑制来简化生成过程，同时通过可导的折叠来保证时间一致的生成和在不同姿势下的物理可能性。通过将生成器提升到4D，AnimatableDreamer提供了一个新的非固定3D模型生成和重建的视角。此外，通过从多视图一致的扩散模型中获得了抽象知识，CSD对新视图的重建进行了规范化，因此在生成过程中进行了循环增强。广泛的实验表明我们的方法可以从单视图的视频中生成高灵活度的文本引导的3D模型，同时也比典型的非固定重建方法提高了重建性能。详细信息请参考https://AnimatableDreamer.github.io。

Kandinsky 3.0 Technical Report

paper_url: http://arxiv.org/abs/2312.03511
repo_url: https://github.com/ai-forever/movqgan
paper_authors: Vladimir Arkhipkin, Andrei Filatov, Viacheslav Vasilev, Anastasia Maltseva, Said Azizov, Igor Pavlov, Julia Agafonova, Andrey Kuznetsov, Denis Dimitrov
for: 这篇论文是为了推动大规模文本到图像生成技术的进步，继承了先前的Kandinsky模型系列，并在实现更高质量和现实主义的图像生成方面做出了进展。
methods: 该模型使用了潜在扩散的Latent Diffusion模型，并在U-Net卷积核心部分使用了两倍大小的模型，以及更大的文本编码器。此外， diffusion mapping也被移除。
results: 作者通过对大量实验和训练技巧的调整，实现了提高模型质量的突破。特别是在文本理解和特定领域方面，Kandinsky 3.0 表现出了明显的提高。production system of user interaction。

Abstract
We present Kandinsky 3.0, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation. Compared to previous versions of Kandinsky 2.x, Kandinsky 3.0 leverages a two times larger U-Net backbone, a ten times larger text encoder and removes diffusion mapping. We describe the architecture of the model, the data collection procedure, the training technique, and the production system of user interaction. We focus on the key components that, as we have identified as a result of a large number of experiments, had the most significant impact on improving the quality of our model compared to the others. By our side-by-side comparisons, Kandinsky becomes better in text understanding and works better on specific domains. Project page: https://ai-forever.github.io/Kandinsky-3

摘要
我们现在提供Kandinsky 3.0，一种基于潜在扩散的大规模文本到图像生成模型，继承了文本到图像Kandinsky模型的系列和我们的进步，以实现更高的图像生成质量和真实性。与前一代Kandinsky 2.x相比，Kandinsky 3.0使用了两倍大的 U-Net 背景、十倍大的文本编码器，并移除了扩散映射。我们描述了模型的体系结构、数据收集过程、训练技巧和用户交互生产系统。我们关注了影响模型质量改进的关键组件，并通过大量实验确定了这些组件的影响。在我们的左右比较中，Kandinsky 3.0在文本理解方面变得更好，在特定领域上表现更好。项目页面：https://ai-forever.github.io/Kandinsky-3

Gravitational cell detection and tracking in fluorescence microscopy data

paper_url: http://arxiv.org/abs/2312.03509
repo_url: None
paper_authors: Nikomidisz Eftimiu, Michal Kozubek
for: 这篇论文是为了探讨计算机视觉技术在生物医学研究和临床实践中的应用，特别是在微scopic 图像中自动检测和跟踪细胞的问题。
methods: 该方法基于重力场的力场方法，可以与现代机器学习模型竞争，并可能在fluorescence microscopy图像中表现出比较好的性能。该方法包括检测、分 segmentation 和跟踪元素。
results: 该方法在Cell Tracking Challenge dataset上得到了良好的结果，可能比现代机器学习模型更好。

Abstract
Automatic detection and tracking of cells in microscopy images are major applications of computer vision technologies in both biomedical research and clinical practice. Though machine learning methods are increasingly common in these fields, classical algorithms still offer significant advantages for both tasks, including better explainability, faster computation, lower hardware requirements and more consistent performance. In this paper, we present a novel approach based on gravitational force fields that can compete with, and potentially outperform modern machine learning models when applied to fluorescence microscopy images. This method includes detection, segmentation, and tracking elements, with the results demonstrated on a Cell Tracking Challenge dataset.

摘要
自动探测和跟踪细胞在微scopic影像中是计算机视觉技术的主要应用领域，在生物医学研究和临床实践中都具有重要的应用价值。虽然机器学习方法在这些领域日益普及，但古典算法仍然在这两个任务中具有显著优势，包括更好的解释性、更快的计算速度、更低的硬件需求和更一致的性能。在这篇论文中，我们提出了一种基于重力力场的新方法，可以与现代机器学习模型相比，并可能在fluorescence microscopy影像中表现出 competed performance。这种方法包括探测、分割和跟踪的元素，其结果在Cell Tracking Challenge数据集上进行了示例。

Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation

paper_url: http://arxiv.org/abs/2312.03502
repo_url: https://github.com/zhang-haojie/wesam
paper_authors: Haojie Zhang, Yongyi Su, Xun Xu, Kui Jia
for: 本研究旨在提高Segment-Anything（SAM）模型对目标分布的适应性，使其能够在强制分布下表现出色。
methods: 我们提出了一种自动学习基于锚点规范和低级别调整的方法，以提高适应性和计算效率。
results: 我们在5种下游分割任务上 validate了我们的方法的有效性，并且在大多数任务上超越了预训练SAM和领域适应方法。

Abstract
The success of large language models has inspired the computer vision community to explore image segmentation foundation model that is able to zero/few-shot generalize through prompt engineering. Segment-Anything(SAM), among others, is the state-of-the-art image segmentation foundation model demonstrating strong zero/few-shot generalization. Despite the success, recent studies reveal the weakness of SAM under strong distribution shift. In particular, SAM performs awkwardly on corrupted natural images, camouflaged images, medical images, etc. Motivated by the observations, we aim to develop a self-training based strategy to adapt SAM to target distribution. Given the unique challenges of large source dataset, high computation cost and incorrect pseudo label, we propose a weakly supervised self-training architecture with anchor regularization and low-rank finetuning to improve the robustness and computation efficiency of adaptation. We validate the effectiveness on 5 types of downstream segmentation tasks including natural clean/corrupted images, medical images, camouflaged images and robotic images. Our proposed method is task-agnostic in nature and outperforms pre-trained SAM and state-of-the-art domain adaptation methods on almost all downstream tasks with the same testing prompt inputs.

摘要
大型语言模型的成功激发了计算机视觉社区开展图像分割基础模型，能够零/几shot通用化through prompt工程。Segment-Anything(SAM)等是目前最佳图像分割基础模型，表现出强大的零/几shot通用化能力。然而，最新的研究表明SAM在强大的分布转移下表现不佳。特别是SAM在天然雏形图像、医疗图像、潜补图像等领域表现awkward。驱动于这些观察，我们目标是开发一种基于自我训练的策略，以适应目标分布。由于大源数据集的特殊挑战（高计算成本和错误 pseudo label），我们提议一种弱级指导自学习架构，并在 anchor regularization 和低级训练下进行改进。我们验证了方法的有效性，并在5种下游分割任务中达到了最佳效果，包括天然清洁/损坏图像、医疗图像、潜补图像和机器人图像。我们的提议方法是任务无关的性质，并在大多数下游任务上超越了预训练SAM和当前领域适应方法。

AnimateZero: Video Diffusion Models are Zero-Shot Image Animators

paper_url: http://arxiv.org/abs/2312.03793
repo_url: https://github.com/vvictoryuki/animatezero
paper_authors: Jiwen Yu, Xiaodong Cun, Chenyang Qi, Yong Zhang, Xintao Wang, Ying Shan, Jian Zhang
for: The paper is written for generating high-quality video from text descriptions, and providing precise control over the appearance and motion of the generated video.
methods: The paper proposes two methods to improve the text-to-video diffusion model, AnimateDiff, by decoupling the video into appearance and motion and using positional-corrected window attention for temporal control.
results: The proposed method, AnimateZero, can successfully control the generating progress without further training, and enables multiple new applications such as interactive video generation and real image animation, as demonstrated by the detailed experiments.Here’s the same information in Simplified Chinese:
for: 文章是为了从文本描述生成高质量的视频，并提供精确的控制视频的外观和动作。
methods: 文章提出了两种方法来改进文本到视频扩散模型，animateDiff，包括分离视频为外观和动作，并使用位置 corrected window attention来确保其他帧与首帧相对适应。
results: 提posed方法animateZero可以成功控制生成过程，并启用多种新应用，如交互式视频生成和真实图像动画，如实验所示。

Abstract
Large-scale text-to-video (T2V) diffusion models have great progress in recent years in terms of visual quality, motion and temporal consistency. However, the generation process is still a black box, where all attributes (e.g., appearance, motion) are learned and generated jointly without precise control ability other than rough text descriptions. Inspired by image animation which decouples the video as one specific appearance with the corresponding motion, we propose AnimateZero to unveil the pre-trained text-to-video diffusion model, i.e., AnimateDiff, and provide more precise appearance and motion control abilities for it. For appearance control, we borrow intermediate latents and their features from the text-to-image (T2I) generation for ensuring the generated first frame is equal to the given generated image. For temporal control, we replace the global temporal attention of the original T2V model with our proposed positional-corrected window attention to ensure other frames align with the first frame well. Empowered by the proposed methods, AnimateZero can successfully control the generating progress without further training. As a zero-shot image animator for given images, AnimateZero also enables multiple new applications, including interactive video generation and real image animation. The detailed experiments demonstrate the effectiveness of the proposed method in both T2V and related applications.

摘要
大规模文本到视频（T2V）扩散模型在最近几年内取得了很大进步，具体来说是视觉质量、运动和时间一致性。然而，生成过程仍然是一个黑盒子，其中所有属性（如外观、运动）都是同时学习和生成的，只能通过粗略的文本描述进行控制。 inspirited by图像动画，我们提出了AnimateZero，它可以解除预训练的文本到视频扩散模型（AnimateDiff）的封闭性，并提供更精确的外观和运动控制能力。For appearance control, we borrow intermediate latents and their features from the text-to-image（T2I）生成， ensuring the generated first frame is equal to the given generated image. For temporal control, we replace the global temporal attention of the original T2V model with our proposed positional-corrected window attention, ensuring other frames align with the first frame well. Empowered by the proposed methods, AnimateZero can successfully control the generating progress without further training. As a zero-shot image animator for given images, AnimateZero also enables multiple new applications, including interactive video generation and real image animation. The detailed experiments demonstrate the effectiveness of the proposed method in both T2V and related applications.

PneumoLLM: Harnessing the Power of Large Language Model for Pneumoconiosis Diagnosis

paper_url: http://arxiv.org/abs/2312.03490
repo_url: https://github.com/codemonsterphd/pneumollm
paper_authors: Meiyue Song, Zhihua Yu, Jiaxin Wang, Jiarui Wang, Yuting Lu, Baicun Li, Xiaoxu Wang, Qinghua Huang, Zhijun Li, Nikolaos I. Kanellakis, Jiangfeng Liu, Jing Wang, Binglu Wang, Juntao Yang
for: 这篇论文旨在应用大语言模型（LLMs）来诊断职业疾病（如肺症），并提出了一种新的方法来实现这一目标。
methods: 本研究使用了一种新的方法，即将对话头部替换为分类头部，以便更好地利用LLMs。此外，本研究还引入了一种适应多个 токен的环境（contextual multi-token engine）以生成诊断 токен，以及一种单向讯息发送模组（information emitter module）来将图像 токен 转换为诊断 токен。
results: 实验结果显示，本研究的方法比传统的方法更有效，并且可以更好地利用LLMs。 codes可以在https://github.com/CodeMonsterPHD/PneumoLLM/tree/main 找到。

Abstract
The conventional pretraining-and-finetuning paradigm, while effective for common diseases with ample data, faces challenges in diagnosing data-scarce occupational diseases like pneumoconiosis. Recently, large language models (LLMs) have exhibits unprecedented ability when conducting multiple tasks in dialogue, bringing opportunities to diagnosis. A common strategy might involve using adapter layers for vision-language alignment and diagnosis in a dialogic manner. Yet, this approach often requires optimization of extensive learnable parameters in the text branch and the dialogue head, potentially diminishing the LLMs' efficacy, especially with limited training data. In our work, we innovate by eliminating the text branch and substituting the dialogue head with a classification head. This approach presents a more effective method for harnessing LLMs in diagnosis with fewer learnable parameters. Furthermore, to balance the retention of detailed image information with progression towards accurate diagnosis, we introduce the contextual multi-token engine. This engine is specialized in adaptively generating diagnostic tokens. Additionally, we propose the information emitter module, which unidirectionally emits information from image tokens to diagnosis tokens. Comprehensive experiments validate the superiority of our methods and the effectiveness of proposed modules. Our codes can be found at https://github.com/CodeMonsterPHD/PneumoLLM/tree/main.

摘要
传统的预训练和finetuning方法，虽然效果良好于常见的疾病，但在诊断数据稀缺的职业疾病如粉渣病中遇到了挑战。最近，大型自然语言模型（LLMs）在对多个任务进行对话中表现出了前所未有的能力，这带来了诊断的机会。一种常见的策略可能是使用适配层进行视觉语言对接和诊断，但这种方法经常需要在文本分支和对话头中优化大量可学习参数，可能导致LLMs的效果受到限制，尤其是在受限的训练数据量。在我们的工作中，我们创新性地将文本分支 eliminate 并将对话头替换为分类头。这种方法提供了更有效的方法来利用LLMs进行诊断，并且具有更少的可学习参数。此外，为了保持图像信息的细节和准确诊断的进程，我们引入了Contextual Multi-token Engine。这个引擎专门用于适应生成诊断的特征字符。此外，我们还提出了信息发送模块，它将图像字符中的信息单向发送到诊断字符中。我们的实验结果证明了我们的方法的优越性和提出的模块的效果。我们的代码可以在https://github.com/CodeMonsterPHD/PneumoLLM/tree/main中找到。

From Detection to Action Recognition: An Edge-Based Pipeline for Robot Human Perception

paper_url: http://arxiv.org/abs/2312.03477
repo_url: None
paper_authors: Petros Toupas, Georgios Tsamis, Dimitrios Giakoumis, Konstantinos Votis, Dimitrios Tzovaras
for: 这种研究旨在帮助移动服务机器人更好地理解和应对人类行为，以便在日常生活中提供更好的协助和支持。
methods: 该研究提出了一个涵盖整个过程的端到端管道，从人员检测和跟踪开始，然后进行动作识别。该管道采用边缘计算方式，以实现实时处理，并且选择了最适合移动机器人的模型。
results: 通过对state-of-the-art解决方案进行比较，以及使用自己的数据集进行测试，研究人员发现了他们的方法在实际应用中表现出色，能够准确地识别人类动作并响应相应的行为。

Abstract
Mobile service robots are proving to be increasingly effective in a range of applications, such as healthcare, monitoring Activities of Daily Living (ADL), and facilitating Ambient Assisted Living (AAL). These robots heavily rely on Human Action Recognition (HAR) to interpret human actions and intentions. However, for HAR to function effectively on service robots, it requires prior knowledge of human presence (human detection) and identification of individuals to monitor (human tracking). In this work, we propose an end-to-end pipeline that encompasses the entire process, starting from human detection and tracking, leading to action recognition. The pipeline is designed to operate in near real-time while ensuring all stages of processing are performed on the edge, reducing the need for centralised computation. To identify the most suitable models for our mobile robot, we conducted a series of experiments comparing state-of-the-art solutions based on both their detection performance and efficiency. To evaluate the effectiveness of our proposed pipeline, we proposed a dataset comprising daily household activities. By presenting our findings and analysing the results, we demonstrate the efficacy of our approach in enabling mobile robots to understand and respond to human behaviour in real-world scenarios relying mainly on the data from their RGB cameras.

摘要
<>mobile服务机器人在各种应用中表现越来越有效，如医疗、日常生活活动识别（ADL）和 ambient assisted living（AAL）。这些机器人依赖人体动作识别（HAR）来理解人类行为和意图。然而，为了使 HAR 在服务机器人上工作有效，需要先知道人类存在（人体探测），并识别要监测的人员（人体跟踪）。在这种情况下，我们提出了一个整体管道，从人体探测和跟踪开始，然后进行动作识别。管道采用边缘计算，以快速响应的方式进行处理，以避免中央计算。为了选择最适合我们的移动机器人的模型，我们进行了一系列实验，比较了当前状态的解决方案，以及它们的检测性和效率。为了评估我们的提议的有效性，我们提出了一个日常家庭活动数据集。通过对我们的发现和分析结果，我们证明了我们的方法在实际 scenarios 中使用主要基于 RGB 摄像头的数据来理解和回应人类行为的有效性。

Memory-Efficient Optical Flow via Radius-Distribution Orthogonal Cost Volume

paper_url: http://arxiv.org/abs/2312.03790
repo_url: None
paper_authors: Gangwei Xu, Shujun Chen, Hao Jia, Miaojie Feng, Xin Yang
for: 高分辨率影像运算flow估计
methods: 缓存有效的对称均值场构造（MeFlow）
results: 在 Sintel 和 KITTI 测试 benchmark 上具有竞争性的表现，并在高分辨率输入上保持最高的缓存有效性。

Abstract
The full 4D cost volume in Recurrent All-Pairs Field Transforms (RAFT) or global matching by Transformer achieves impressive performance for optical flow estimation. However, their memory consumption increases quadratically with input resolution, rendering them impractical for high-resolution images. In this paper, we present MeFlow, a novel memory-efficient method for high-resolution optical flow estimation. The key of MeFlow is a recurrent local orthogonal cost volume representation, which decomposes the 2D search space dynamically into two 1D orthogonal spaces, enabling our method to scale effectively to very high-resolution inputs. To preserve essential information in the orthogonal space, we utilize self attention to propagate feature information from the 2D space to the orthogonal space. We further propose a radius-distribution multi-scale lookup strategy to model the correspondences of large displacements at a negligible cost. We verify the efficiency and effectiveness of our method on the challenging Sintel and KITTI benchmarks, and real-world 4K ($2160\!\times\!3840$) images. Our method achieves competitive performance on both Sintel and KITTI benchmarks, while maintaining the highest memory efficiency on high-resolution inputs.

摘要
全4D成本体积在回忆所有场合变换（RAFT）或全球匹配使用 transformer 实现了吸引人的表现 для оптиче流估计。然而，它们的内存消耗随输入分辨率平方增长，使其对高分辨率图像不实用。在这篇论文中，我们介绍了 MeFlow，一种新的内存高效的方法 для高分辨率 оптиче流估计。MeFlow 的关键是一种 recursively 的本地正交成本体积表示，它在动态地将2D搜索空间分解成两个1D正交空间，使我们的方法可以有效地扩展到非常高分辨率输入。为了保留2D空间中的重要信息，我们使用自注意力来传递特征信息到正交空间。此外，我们还提出了一种半径分布多Scale Lookup策略，以模拟大差距的匹配。我们在 Sintel 和 KITTI 标准benchmark 上验证了我们的方法的效率和效果，并在4K（2160×3840）分辨率的真实世界图像上实现了高效的内存使用。我们的方法在高分辨率输入上实现了与 Sintel 和 KITTI 标准benchmark 的竞争性表现，同时保持最高的内存高效性。

HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting

paper_url: http://arxiv.org/abs/2312.03461
repo_url: None
paper_authors: Yuheng Jiang, Zhehao Shen, Penghao Wang, Zhuo Su, Yu Hong, Yingliang Zhang, Jingyi Yu, Lan Xu
for: 高级人体模拟和渲染技术的进步，但是将真实人体表现高质量渲染到纹理渲染管道中仍然具有挑战性。本文提出了HiFi4G方法，这是一种基于三角函数的高级人体表现渲染方法。
methods: 我们提出了一种双图机制来获取动作假设，其中一个是粗略变形图 для初始化，另一个是细化的四元函数图来束制约。然后，我们使用了一种4D Gaussian优化方案，并使用了适应空间temporal regularizers来均衡非RIGID先天和Gaussian更新。
results: 我们的方法可以具有较高的优化速度、渲染质量和存储开销。对比 existed方法，我们的方法在优化速度、渲染质量和存储开销上具有显著的优势。

Abstract
We have recently seen tremendous progress in photo-real human modeling and rendering. Yet, efficiently rendering realistic human performance and integrating it into the rasterization pipeline remains challenging. In this paper, we present HiFi4G, an explicit and compact Gaussian-based approach for high-fidelity human performance rendering from dense footage. Our core intuition is to marry the 3D Gaussian representation with non-rigid tracking, achieving a compact and compression-friendly representation. We first propose a dual-graph mechanism to obtain motion priors, with a coarse deformation graph for effective initialization and a fine-grained Gaussian graph to enforce subsequent constraints. Then, we utilize a 4D Gaussian optimization scheme with adaptive spatial-temporal regularizers to effectively balance the non-rigid prior and Gaussian updating. We also present a companion compression scheme with residual compensation for immersive experiences on various platforms. It achieves a substantial compression rate of approximately 25 times, with less than 2MB of storage per frame. Extensive experiments demonstrate the effectiveness of our approach, which significantly outperforms existing approaches in terms of optimization speed, rendering quality, and storage overhead.

摘要
很近期，我们在人模型和渲染方面就见到了很大的进步。然而，将真实的人体表现 efficiently 渲染到精炼图像中仍然是一个挑战。在这篇论文中，我们提出了 HiFi4G，一种明确和紧凑的 Gaussian 基于方法，用于高效的人体表现渲染。我们的核心想法是将 3D Gaussian 表示与非RIGID 跟踪结合起来，实现一种紧凑和压缩友好的表示。我们首先提出了 dual-graph 机制，用于获取动作先验，其中一个粗略的变形图 для initialize 和一个细化的 Gaussian 图来施加后续约束。然后，我们采用了一种4D Gaussian 优化方案，通过适应的空间-时间 regularizers 来均衡非RIGID 先验和 Gaussian 更新。我们还提出了一种 companion 压缩方案，使用差异做为补偿，以实现在不同平台上的 immerse 体验。它可以达到约 25 倍的压缩率，仅占每帧 storage 约 2MB。我们的方法在优化速度、渲染质量和存储开销等方面都表现出了明显的优势。

F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis

paper_url: http://arxiv.org/abs/2312.03459
repo_url: None
paper_authors: Sitong Su, Jianzhi Liu, Lianli Gao, Jingkuan Song
for: 提高 Text-to-Video Synthesis 的推理速度，不需要重新训练模型。
methods: 通过对两种主流 Text-to-Video 模型（基于 transformer 和 diffusion 模型）的推理过程进行探索，发现两者具有共同的循环关系建立模块的重复性。根据这个发现，我们提出了一种无需重新训练的、通用的剪辑策略 called F3-Pruning。
results: 在三个 dataset 上使用 классические transformer-based 模型 CogVideo 和典型 diffusion-based 模型 Tune-A-Video，进行了广泛的实验，证明 F3-Pruning 可以快速加速推理，保持质量，并具有广泛的应用性。

Abstract
Recently Text-to-Video (T2V) synthesis has undergone a breakthrough by training transformers or diffusion models on large-scale datasets. Nevertheless, inferring such large models incurs huge costs.Previous inference acceleration works either require costly retraining or are model-specific.To address this issue, instead of retraining we explore the inference process of two mainstream T2V models using transformers and diffusion models.The exploration reveals the redundancy in temporal attention modules of both models, which are commonly utilized to establish temporal relations among frames.Consequently, we propose a training-free and generalized pruning strategy called F3-Pruning to prune redundant temporal attention weights.Specifically, when aggregate temporal attention values are ranked below a certain ratio, corresponding weights will be pruned.Extensive experiments on three datasets using a classic transformer-based model CogVideo and a typical diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning in inference acceleration, quality assurance and broad applicability.

摘要
最近，文本到视频（T2V）合成技术已经经历了重大突破，通过训练变换器或扩散模型在大规模数据集上进行训练。然而，在这些大型模型的推理过程中仍存在巨大的成本问题。过去的推理加速工作都需要重新训练模型，或者是模型特定的。为了解决这个问题，我们不再重新训练，而是 explore T2V模型中的推理过程。我们发现了这两种主流T2V模型中的时间关注模块具有重复性，这些模块通常用于建立帧之间的时间关系。因此，我们提出了一种无需训练的、通用剪裁策略called F3-Pruning。具体来说，当汇合时间关注值 rank下一定比率时，相应的权重将被剪裁。我们在三个数据集上使用了一种经典的变换器基于模型CogVideo和一种典型的扩散基于模型Tune-A-Video进行了广泛的实验，并证明了F3-Pruning在推理加速、质量保证和广泛应用上的效果。

Data-driven Crop Growth Simulation on Time-varying Generated Images using Multi-conditional Generative Adversarial Networks

paper_url: http://arxiv.org/abs/2312.03443
repo_url: https://github.com/luked12/crop-growth-cgan
paper_authors: Lukas Drees, Dereje T. Demie, Madhuri R. Paul, Johannes Leonhardt, Sabine J. Seidel, Thomas F. Döring, Ribana Roscher
for: 这个论文主要是为了提供一种基于图像的蔬菜生长模型，以便在精准农业中实现蔬菜生长的空间和时间分布，从而提供早期和位置特定的蔬菜 trait 预测。
methods: 这个论文使用了一种两个阶段的框架，包括一个图像预测模型和一个生长估计模型。图像预测模型是一个conditional Wasserstein生成 adversarial network (CWGAN)，其中使用了 conditional batch normalization (CBN)来 integrate multiple influencing factors。而生长估计模型则是一个独立训练的模型，用于 derivation plant-specific traits from the predicted images。
results: 论文的结果表明，使用这种两个阶段框架可以提供高品质的、适度损失的时间变化图像预测，并且可以提供有用的情况下的蔬菜 trait 预测。此外，论文还发现，通过添加过程基于的生物物理模拟biomass作为条件，可以提高预测图像中的蔬菜 trait 的准确性。这表明，这种框架有 potential to serve as an interface between image-based and process-based crop growth models。

Abstract
Image-based crop growth modeling can substantially contribute to precision agriculture by revealing spatial crop development over time, which allows an early and location-specific estimation of relevant future plant traits, such as leaf area or biomass. A prerequisite for realistic and sharp crop image generation is the integration of multiple growth-influencing conditions in a model, such as an image of an initial growth stage, the associated growth time, and further information about the field treatment. We present a two-stage framework consisting first of an image prediction model and second of a growth estimation model, which both are independently trained. The image prediction model is a conditional Wasserstein generative adversarial network (CWGAN). In the generator of this model, conditional batch normalization (CBN) is used to integrate different conditions along with the input image. This allows the model to generate time-varying artificial images dependent on multiple influencing factors of different kinds. These images are used by the second part of the framework for plant phenotyping by deriving plant-specific traits and comparing them with those of non-artificial (real) reference images. For various crop datasets, the framework allows realistic, sharp image predictions with a slight loss of quality from short-term to long-term predictions. Simulations of varying growth-influencing conditions performed with the trained framework provide valuable insights into how such factors relate to crop appearances, which is particularly useful in complex, less explored crop mixture systems. Further results show that adding process-based simulated biomass as a condition increases the accuracy of the derived phenotypic traits from the predicted images. This demonstrates the potential of our framework to serve as an interface between an image- and process-based crop growth model.

摘要
Image-based 农业精细化模型可以很大程度地提高精细农业的效率，因为它可以在时间和空间上显示植物的发展，从而提供早期和地点特定的植物特征预测，如叶面积或生物质。为了实现真实和明确的植物图像生成，我们提出了一个两阶段框架，包括图像预测模型和生长估计模型。这两个模型都是独立地训练的。图像预测模型使用conditioned Wasserstein generative adversarial network (CWGAN)，在生成器中使用conditional batch normalization (CBN)来integrate不同的条件和输入图像。这allowsthe模型可以生成时间变化的人工图像，具体取决于不同类型的多种影响因素。这些图像被用于实际图像估计中，通过 derive植物特征并与实际参照图像进行比较。对于多种作物数据集，我们的框架可以生成真实、锐化的图像预测，尽管短期预测的质量略有下降。我们的模型可以在不同生长影响因素的 simulations中提供有价值的植物表型信息，特别是在复杂的作物混合系统中。此外，我们发现在添加过程基于的生物质预测中，可以提高 derivated的植物特征 trait的准确性。这表明我们的框架可以作为图像-基于的农业生长模型和过程-基于的农业生长模型之间的桥接。

High-Quality Facial Geometry and Appearance Capture at Home

paper_url: http://arxiv.org/abs/2312.03442
repo_url: https://github.com/yxuhan/cora
paper_authors: Yuxuan Han, Junfeng Lyu, Feng Xu
for: 高品质人脸捕捉，使得普通用户可以轻松使用。
methods: 提出了一种新的方法，可以很好地重建人脸几何和外观，包括皮肤、嘴巴内部、头发和眼睛。
results: 实验结果表明，该方法可以获得高品质的3D捕捉结果。

Abstract
Facial geometry and appearance capture have demonstrated tremendous success in 3D scanning real humans in studios. Recent works propose to democratize this technique while keeping the results high quality. However, they are still inconvenient for daily usage. In addition, they focus on an easier problem of only capturing facial skin. This paper proposes a novel method for high-quality face capture, featuring an easy-to-use system and the capability to model the complete face with skin, mouth interior, hair, and eyes. We reconstruct facial geometry and appearance from a single co-located smartphone flashlight sequence captured in a dim room where the flashlight is the dominant light source (e.g. rooms with curtains or at night). To model the complete face, we propose a novel hybrid representation to effectively model both eyes and other facial regions, along with novel techniques to learn it from images. We apply a combined lighting model to compactly represent real illuminations and exploit a morphable face albedo model as a reflectance prior to disentangle diffuse and specular. Experiments show that our method can capture high-quality 3D relightable scans.

摘要
面部几何和外观捕捉在工作室中已经取得了很大的成功，最近的工作尝试将这种技术民主化，保持结果高质量。然而，它们仍然不方便每天使用。此外，它们只关注了面部皮肤的捕捉，忽略了其他面部部分。这篇论文提议一种新的高质量面部捕捉方法，其特点是易于使用的系统和能够模型完整的面部，包括皮肤、嘴部内部、头发和眼睛。我们从具有强烈照明的单个手机闪光灯序列中重建面部几何和外观，并提出了一种新的混合表示方法，能够有效地模型面部的其他部分，以及一些新的技术来学习它们。我们采用了一种组合照明模型，以Compactly represent实际照明，并利用一种可变面睛模型作为反射特征来分离投射和镜面。实验显示，我们的方法可以捕捉高质量的3D渐光扫描。

UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity

paper_url: http://arxiv.org/abs/2312.03441
repo_url: https://github.com/zplusdragon/ufinebench
paper_authors: Jialong Zuo, Hanyu Zhou, Ying Nie, Feng Zhang, Tianyu Guo, Nong Sang, Yunhe Wang, Changxin Gao
for: 提高文本基于人脸 Retrieval 的精细Semantics表示能力
methods: 构建了一个新的 dataset 和一种更有效的算法，即 CFAM，以提高文本基于人脸 Retrieval 的精细Semantics表示能力
results: 在标准域内评估中，CFAM 在不同的 dataset 上达到了竞争性的性能，特别是在 ultra fine-grained 的 UFine6926 上。此外，通过评估在 UFine3C 上，我们示出了训练在 UFine6926 上的模型在实际场景中的普遍性比较高。

Abstract
Existing text-based person retrieval datasets often have relatively coarse-grained text annotations. This hinders the model to comprehend the fine-grained semantics of query texts in real scenarios. To address this problem, we contribute a new benchmark named \textbf{UFineBench} for text-based person retrieval with ultra-fine granularity. Firstly, we construct a new \textbf{dataset} named UFine6926. We collect a large number of person images and manually annotate each image with two detailed textual descriptions, averaging 80.8 words each. The average word count is three to four times that of the previous datasets. In addition of standard in-domain evaluation, we also propose a special \textbf{evaluation paradigm} more representative of real scenarios. It contains a new evaluation set with cross domains, cross textual granularity and cross textual styles, named UFine3C, and a new evaluation metric for accurately measuring retrieval ability, named mean Similarity Distribution (mSD). Moreover, we propose CFAM, a more efficient \textbf{algorithm} especially designed for text-based person retrieval with ultra fine-grained texts. It achieves fine granularity mining by adopting a shared cross-modal granularity decoder and hard negative match mechanism. With standard in-domain evaluation, CFAM establishes competitive performance across various datasets, especially on our ultra fine-grained UFine6926. Furthermore, by evaluating on UFine3C, we demonstrate that training on our UFine6926 significantly improves generalization to real scenarios compared with other coarse-grained datasets. The dataset and code will be made publicly available at \url{https://github.com/Zplusdragon/UFineBench}.

摘要
Traditional text-based person retrieval datasets often have relatively coarse-grained text annotations, which hinders the model's ability to comprehend the fine-grained semantics of query texts in real-world scenarios. To address this problem, we contribute a new benchmark named UFineBench for text-based person retrieval with ultra-fine granularity.First, we construct a new dataset named UFine6926, which contains a large number of person images and manual textual descriptions with an average of 80.8 words each. The average word count is three to four times that of previous datasets. In addition to standard in-domain evaluation, we propose a special evaluation paradigm that is more representative of real-world scenarios, including a new evaluation set with cross-domain, cross-textual granularity, and cross-textual styles, named UFine3C, and a new evaluation metric named mean Similarity Distribution (mSD) to accurately measure retrieval ability.Moreover, we propose CFAM, a more efficient algorithm designed for text-based person retrieval with ultra-fine-grained texts. It achieves fine granularity mining by adopting a shared cross-modal granularity decoder and hard negative match mechanism. With standard in-domain evaluation, CFAM achieves competitive performance across various datasets, especially on our ultra-fine-grained UFine6926. Furthermore, by evaluating on UFine3C, we demonstrate that training on our UFine6926 significantly improves generalization to real-world scenarios compared with other coarse-grained datasets. The dataset and code will be publicly available at \url{https://github.com/Zplusdragon/UFineBench}.

Data-Centric Digital Agriculture: A Perspective

paper_url: http://arxiv.org/abs/2312.03437
repo_url: None
paper_authors: Ribana Roscher, Lukas Roth, Cyrill Stachniss, Achim Walter
for: This paper aims to address the challenges of digital agriculture by adopting a data-centric approach to machine learning, with the goal of improving the accuracy and sustainability of agricultural tasks such as yield prediction, weed detection, and early disease identification.
methods: The paper proposes the use of data-centric machine learning strategies that utilize the intrinsic value of data to develop accurate, generalizable, and adaptable methods for digital agriculture. These strategies include acquiring and curating valuable data, as well as implementing effective learning and evaluation methods.
results: The paper has the potential to create accurate, generalizable, and adaptable machine learning methods that effectively and sustainably address agricultural tasks, leading to improved yields, reduced waste, and more efficient use of resources. By adopting a data-centric approach, the paper aims to overcome the limitations of traditional model-centric methods and fully realize the potential of digital agriculture.

Abstract
In response to the increasing global demand for food, feed, fiber, and fuel, digital agriculture is rapidly evolving to meet these demands while reducing environmental impact. This evolution involves incorporating data science, machine learning, sensor technologies, robotics, and new management strategies to establish a more sustainable agricultural framework. So far, machine learning research in digital agriculture has predominantly focused on model-centric approaches, focusing on model design and evaluation. These efforts aim to optimize model accuracy and efficiency, often treating data as a static benchmark. Despite the availability of agricultural data and methodological advancements, a saturation point has been reached, with many established machine learning methods achieving comparable levels of accuracy and facing similar limitations. To fully realize the potential of digital agriculture, it is crucial to have a comprehensive understanding of the role of data in the field and to adopt data-centric machine learning. This involves developing strategies to acquire and curate valuable data and implementing effective learning and evaluation strategies that utilize the intrinsic value of data. This approach has the potential to create accurate, generalizable, and adaptable machine learning methods that effectively and sustainably address agricultural tasks such as yield prediction, weed detection, and early disease identification

摘要
随着全球食品、饲料、纤维和能源的需求增长，数字农业在实现更加可持续的农业框架方面逐渐演化。这种演化包括在数据科学、机器学习、传感器技术、机器人和新管理策略等方面进行整合，以提高农业的可持续性。迄今为止，数字农业中的机器学习研究主要集中在模型中心的方法上，强调模型设计和评估。这些努力的目标是优化模型的准确率和效率，通常将数据视为静态标准。然而，农业数据的可用性和方法的进步，已经达到了满载点，许多已经确立的机器学习方法都达到了相似的准确率和面临相似的限制。为了实现数字农业的潜在价值，必须有一个全面的理解数据在农业领域的角色，并采用数据驱动的机器学习策略。这种方法可以创造高准确率、普适和适应的机器学习方法，以有效和可持续地解决农业任务，如收益预测、苔萝检测和早期病诊断。

Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle

paper_url: http://arxiv.org/abs/2312.03431
repo_url: None
paper_authors: Youtian Lin, Zuozhuo Dai, Siyu Zhu, Yao Yao
for: fast dynamic scene reconstruction and real-time rendering from multi-view and monocular videos
methods: 使用Point-based 3D Gaussian Splatting (3DGS) 和 Dual-Domain Deformation Model (DDDM) 模型 attribute 变形
results: 比前者更高的新视图渲染质量和5倍快的训练速度

Abstract
We introduce Gaussian-Flow, a novel point-based approach for fast dynamic scene reconstruction and real-time rendering from both multi-view and monocular videos. In contrast to the prevalent NeRF-based approaches hampered by slow training and rendering speeds, our approach harnesses recent advancements in point-based 3D Gaussian Splatting (3DGS). Specifically, a novel Dual-Domain Deformation Model (DDDM) is proposed to explicitly model attribute deformations of each Gaussian point, where the time-dependent residual of each attribute is captured by a polynomial fitting in the time domain, and a Fourier series fitting in the frequency domain. The proposed DDDM is capable of modeling complex scene deformations across long video footage, eliminating the need for training separate 3DGS for each frame or introducing an additional implicit neural field to model 3D dynamics. Moreover, the explicit deformation modeling for discretized Gaussian points ensures ultra-fast training and rendering of a 4D scene, which is comparable to the original 3DGS designed for static 3D reconstruction. Our proposed approach showcases a substantial efficiency improvement, achieving a $5\times$ faster training speed compared to the per-frame 3DGS modeling. In addition, quantitative results demonstrate that the proposed Gaussian-Flow significantly outperforms previous leading methods in novel view rendering quality. Project page: https://nju-3dv.github.io/projects/Gaussian-Flow

摘要
我们介绍 Gaussian-Flow，一种新的点 clouds 方法，用于快速动态场景重建和实时渲染，从多条看法和单条看法影片中。相比于通行的 NeRF-based 方法，我们的方法具有更快的训练和渲染速度。我们提出了一个名为 Dual-Domain Deformation Model (DDDM) 的新模型，用于处理每个点 clouds 的特征变形。这个模型可以实时追踪场景中的变形，不需要针对每帧训练分别的 3DGS，也不需要添加额外的隐藏神经场来模型3D动力学。此外，我们的方法可以实现瞬间训练和渲染4D场景，与原始的3DGS设计 для静止3D重建相比，具有更高的效率。我们的方法在训练速度方面比较 NeRF-based 方法快 $5\times$，并且在新观点显示质量方面对比前进方法有所改善。更多详细信息可以参考我们的项目页面：https://nju-3dv.github.io/projects/Gaussian-Flow。

ShareCMP: Polarization-Aware RGB-P Semantic Segmentation

paper_url: http://arxiv.org/abs/2312.03430
repo_url: https://github.com/lefteyex/sharecmp
paper_authors: Zhuoyan Liu, Bo Wang, Lizhi Wang, Chenyu Mao, Ye Li
for: 这个论文旨在提高RGB-斜波 semantic segmentation的性能，并为自主水下探测器（AUVs）提供特殊观察能力。
methods: 作者设计了一个共享双分支架构的RGB-斜波Semantic segmentation框架，即ShareCMP，其中包括一个抽象斜波模式图像生成器（PGA）模块，用于为encoder提供更加丰富的斜波特征。此外，作者还提出了一种类别斜波意识损失（CPALoss），以提高encoder对斜波模式信息的学习和理解，并优化PGA模块。
results: 作者在三个RGB-斜波benchmark上进行了广泛的实验，其中包括UPLight、ZJU和MCubeS datasets。results表明，ShareCMP在mIoU性能上达到了当前最佳性能，并且在参数量方面减少了约26-33% compared to previous dual-branch models。

Abstract
Multimodal semantic segmentation is developing rapidly, but the modality of RGB-Polarization remains underexplored. To delve into this problem, we construct a UPLight RGB-P segmentation benchmark with 12 typical underwater semantic classes which provides data support for Autonomous Underwater Vehicles (AUVs) to perform special perception tasks. In this work, we design the ShareCMP, an RGB-P semantic segmentation framework with a shared dual-branch architecture, which reduces the number of parameters by about 26-33% compared to previous dual-branch models. It encompasses a Polarization Generate Attention (PGA) module designed to generate polarization modal images with richer polarization properties for the encoder. In addition, we introduce the Class Polarization-Aware Loss (CPALoss) to improve the learning and understanding of the encoder for polarization modal information and to optimize the PGA module. With extensive experiments on a total of three RGB-P benchmarks, our ShareCMP achieves state-of-the-art performance in mIoU with fewer parameters on the UPLight (92.45%), ZJU (92.7%), and MCubeS (50.99%) datasets. The code is available at https://github.com/LEFTeyex/ShareCMP.

摘要
多Modal semantic segmentation 正在快速发展，但RGB-Polarization Modality仍未得到充分研究。为了探讨这个问题，我们构建了一个UPLight RGB-P semantic segmentation benchmark，提供了12种常见的海洋 semantic classes数据支持Autonomous Underwater Vehicles（AUVs）执行特殊见解任务。在这项工作中，我们设计了 ShareCMP，一个RGB-P semantic segmentation框架，它减少了前 dual-branch 模型参数数量约26-33%。它包括一个Polarization Generate Attention（PGA）模块，用于生成具有更丰富Polarization Properties的折射模式图像，以便编码器更好地理解。此外，我们引入了Class Polarization-Aware Loss（CPALoss），以提高编码器对折射模式信息的学习和理解，并且优化PGA模块。经过对三个RGB-P benchmark进行了广泛的实验，我们的 ShareCMP在mIoU总平均值上达到了当前最佳性能（92.45%），在UPLight（92.7%）、ZJU（92.7%）和MCubeS（50.99%）数据集上。代码可以在https://github.com/LEFTeyex/ShareCMP中获取。

Artist-Friendly Relightable and Animatable Neural Heads

paper_url: http://arxiv.org/abs/2312.03420
repo_url: None
paper_authors: Yingyan Xu, Prashanth Chandran, Sebastian Weiss, Markus Gross, Gaspard Zoss, Derek Bradley
for: 这个论文旨在创建一种可以在任何环境下表达不同表情的 photo-realistic 数字人物模型，使用 volumetric neural fields 技术。
methods: 这个方法基于一种组合了混合体 primitives 的动态人物方法，并使用了一种新的轻量级硬件设置来实现可关注灯光的 neural fields。
results: 该方法可以在任何环境下，包括近场灯光和视点，实现动态 neural avatars 的适应和表情变化。

Abstract
An increasingly common approach for creating photo-realistic digital avatars is through the use of volumetric neural fields. The original neural radiance field (NeRF) allowed for impressive novel view synthesis of static heads when trained on a set of multi-view images, and follow up methods showed that these neural representations can be extended to dynamic avatars. Recently, new variants also surpassed the usual drawback of baked-in illumination in neural representations, showing that static neural avatars can be relit in any environment. In this work we simultaneously tackle both the motion and illumination problem, proposing a new method for relightable and animatable neural heads. Our method builds on a proven dynamic avatar approach based on a mixture of volumetric primitives, combined with a recently-proposed lightweight hardware setup for relightable neural fields, and includes a novel architecture that allows relighting dynamic neural avatars performing unseen expressions in any environment, even with nearfield illumination and viewpoints.

摘要
“一种日益普及的方法是通过使用Volume Neural Fields（VNF）来创建真实的数字人物。原始神经辐射场（NeRF）可以实现了基于多视图图像的新观察 Synthesis，而后续方法表明这些神经表示可以扩展到动态人物。最近，新变体也超越了传统的热成因在神经表示中的缺陷，示示静态神经人物可以在任何环境中重新照明。在这个工作中，我们同时解决了运动和照明问题，提出一种新的方法，可以在任何环境中重新照明和动画静态神经人物，包括静态神经人物在靠近照明和视角下表现出来的新表情。”Here's a breakdown of the translation:* "an increasingly common approach" ⇒ "日益普及的方法"* "for creating photo-realistic digital avatars" ⇒ "创建真实的数字人物"* "through the use of volumetric neural fields" ⇒ "通过使用Volume Neural Fields（VNF）"* "The original neural radiance field (NeRF)" ⇒ "原始神经辐射场（NeRF）"* "allowed for impressive novel view synthesis of static heads" ⇒ "可以实现了基于多视图图像的新观察 Synthesis"* "and follow up methods showed that these neural representations can be extended to dynamic avatars" ⇒ "而后续方法表明这些神经表示可以扩展到动态人物"* "Recently, new variants also surpassed the usual drawback of baked-in illumination in neural representations" ⇒ "最近，新变体也超越了传统的热成因在神经表示中的缺陷"* "showing that static neural avatars can be relit in any environment" ⇒ "示示静态神经人物可以在任何环境中重新照明"* "In this work we simultaneously tackle both the motion and illumination problem" ⇒ "在这个工作中，我们同时解决了运动和照明问题"* "proposing a new method for relightable and animatable neural heads" ⇒ "提出一种新的方法，可以在任何环境中重新照明和动画静态神经人物"* "Our method builds on a proven dynamic avatar approach based on a mixture of volumetric primitives" ⇒ "我们的方法基于一种已经证明有效的动态人物方法，基于一种混合的Volume primitives"* "combined with a recently-proposed lightweight hardware setup for relightable neural fields" ⇒ "并与最近提出的轻量级硬件设置 для重新照明神经场合并使用"* "and includes a novel architecture that allows relighting dynamic neural avatars performing unseen expressions in any environment" ⇒ "并包括一种新的架构，允许在任何环境中表现出来的新表情"I hope this helps! Let me know if you have any further questions.

DeepPyramid+: Medical Image Segmentation using Pyramid View Fusion and Deformable Pyramid Reception

paper_url: http://arxiv.org/abs/2312.03409
repo_url: None
paper_authors: Negin Ghamsarian, Sebastian Wolf, Martin Zinkernagel, Klaus Schoeffmann, Raphael Sznitman
for: 这 paper 是为了解决医学图像和手术视频分割中的多种挑战，如多样性、变形、透明度、柔软边缘和多种扭曲。
methods: 该 paper 提出了一种网络架构，即 DeepPyramid+，用于解决医学图像和手术视频分割中的多种挑战。 DeepPyramid+ 包括两个主要模块：Pyramid View Fusion（PVF）和 Deformable Pyramid Reception（DPR），用于解决多种挑战。 PVF 复制了人类视觉系统中的减少过程，以增强每个像素位置的信息表示。 DPR 引入了形态和比例兼容的特征提取技术，以提高精度和可靠性。
results: 对于多种数据集，包括 ENDOMETRIOSIS 视频、MRI 图像、OCT 扫描和 Laparoscopy 视频，DeepPyramid+ 能够处理多种挑战，如形态和比例变化、反射和模糊干扰。 DeepPyramid+ 在分割性能方面显示出了显著提高，对于内域分割而言，可以提高 Dice 乘数达到 3.65%，对于跨域分割而言，可以提高 Dice 乘数达到 17%。 DeepPyramid+ 在不同的模式下，包括不同的后处网络，都能够表现出优于状态艺网络，示出其 universality。

Abstract
Semantic Segmentation plays a pivotal role in many applications related to medical image and video analysis. However, designing a neural network architecture for medical image and surgical video segmentation is challenging due to the diverse features of relevant classes, including heterogeneity, deformability, transparency, blunt boundaries, and various distortions. We propose a network architecture, DeepPyramid+, which addresses diverse challenges encountered in medical image and surgical video segmentation. The proposed DeepPyramid+ incorporates two major modules, namely "Pyramid View Fusion" (PVF) and "Deformable Pyramid Reception," (DPR), to address the outlined challenges. PVF replicates a deduction process within the neural network, aligning with the human visual system, thereby enhancing the representation of relative information at each pixel position. Complementarily, DPR introduces shape- and scale-adaptive feature extraction techniques using dilated deformable convolutions, enhancing accuracy and robustness in handling heterogeneous classes and deformable shapes. Extensive experiments conducted on diverse datasets, including endometriosis videos, MRI images, OCT scans, and cataract and laparoscopy videos, demonstrate the effectiveness of DeepPyramid+ in handling various challenges such as shape and scale variation, reflection, and blur degradation. DeepPyramid+ demonstrates significant improvements in segmentation performance, achieving up to a 3.65% increase in Dice coefficient for intra-domain segmentation and up to a 17% increase in Dice coefficient for cross-domain segmentation. DeepPyramid+ consistently outperforms state-of-the-art networks across diverse modalities considering different backbone networks, showcasing its versatility.

摘要
医疗图像和视频分割扮演着重要的角色在医疗领域中，但设计用于医疗图像和手术视频分割的神经网络架构却是一项挑战。这是因为医疗图像和手术视频中的相关类型具有多样化特征，包括不均匀、可变形、透明度和多种扭曲。我们提出了一种网络架构，即深度PYRAMID+，以解决这些挑战。深度PYRAMID+包括两个主要模块：“Pyramid View Fusion”（PVF）和“Deformable Pyramid Reception”（DPR）。PVF通过在神经网络中复制推理过程，与人视觉系统相似，以提高每个像素位置的相关信息表示。此外，DPR引入了灵活扩展的特征提取技术，通过扩展扩展的填充式卷积，提高精度和Robustness，以处理多样化的类型和可变形。我们对多种数据集进行了广泛的实验，包括结直肠炎视频、MRI图像、OCT扫描和胆囊和 Laparoscopy 视频。结果表明，深度PYRAMID+在处理多种挑战时表现出色，包括形态和比例变化、镜像和模糊干扰。深度PYRAMID+在分割性能方面达到了3.65%的提升，在同频段分割中达到了17%的提升。此外，深度PYRAMID+在不同的模式下，包括不同的后处网络，均表现出了其多样性。

Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future

paper_url: http://arxiv.org/abs/2312.03408
repo_url: https://github.com/opendrivelab/driveagi
paper_authors: Hongyang Li, Yang Li, Huijie Wang, Jia Zeng, Pinlong Cai, Huilin Xu, Dahua Lin, Junchi Yan, Feng Xu, Lu Xiong, Jingdong Wang, Futang Zhu, Kai Yan, Chunjing Xu, Tiancai Wang, Beipeng Mu, Shaoqing Ren, Zhihui Peng, Yu Qiao
For: This paper aims to provide a comprehensive review of open-source autonomous driving datasets, including their principles, data scales, and future challenges.* Methods: The paper uses a systematic approach to assess over 70 open-source autonomous driving datasets from domestic and international sources, and offers insights into the creation of high-quality datasets, data engine systems, and generative foundation models.* Results: The paper provides an exhaustive analysis and discourse of the characteristics and data scales of future third-generation autonomous driving datasets, and highlights the scientific and technical challenges that need to be addressed to advance autonomous innovation and foster technological enhancement in critical domains.

Abstract
With the continuous maturation and application of autonomous driving technology, a systematic examination of open-source autonomous driving datasets becomes instrumental in fostering the robust evolution of the industry ecosystem. Current autonomous driving datasets can broadly be categorized into two generations. The first-generation autonomous driving datasets are characterized by relatively simpler sensor modalities, smaller data scale, and is limited to perception-level tasks. KITTI, introduced in 2012, serves as a prominent representative of this initial wave. In contrast, the second-generation datasets exhibit heightened complexity in sensor modalities, greater data scale and diversity, and an expansion of tasks from perception to encompass prediction and control. Leading examples of the second generation include nuScenes and Waymo, introduced around 2019. This comprehensive review, conducted in collaboration with esteemed colleagues from both academia and industry, systematically assesses over seventy open-source autonomous driving datasets from domestic and international sources. It offers insights into various aspects, such as the principles underlying the creation of high-quality datasets, the pivotal role of data engine systems, and the utilization of generative foundation models to facilitate scalable data generation. Furthermore, this review undertakes an exhaustive analysis and discourse regarding the characteristics and data scales that future third-generation autonomous driving datasets should possess. It also delves into the scientific and technical challenges that warrant resolution. These endeavors are pivotal in advancing autonomous innovation and fostering technological enhancement in critical domains. For further details, please refer to https://github.com/OpenDriveLab/DriveAGI.

摘要
随着自动驾驶技术的不断成熔和应用，对开源自动驾驶数据集进行系统性的评估变得非常重要，以推动行业生态系统的健康发展。目前的自动驾驶数据集可以大致分为两代。第一代自动驾驶数据集具有较简单的感知模式、较小的数据规模和仅仅是感知级别的任务。2012年出现的KITTI是这一代的代表之一。相比之下，第二代数据集具有更高的感知模式复杂度、更大的数据规模和多样性，以及从感知扩展到预测和控制等任务。 nuScenes和Waymo等在2019年出现，是这一代的代表之一。本文是和学术界和业界知名专家合作完成的，对国内外七十余个开源自动驾驶数据集进行了系统性的评估。它提供了高质量数据集的创建原则、数据引擎系统的重要性以及基于生成基础模型的可扩展数据生成等方面的启示。此外，本文还进行了对未来第三代自动驾驶数据集的特点和数据规模的仔细分析和讨论，以及解决科技和科学挑战的需要。这些努力将有助于推动自动驾驶创新和技术进步，激发行业的健康发展。详细信息请参考。

SVQ: Sparse Vector Quantization for Spatiotemporal Forecasting

paper_url: http://arxiv.org/abs/2312.03406
repo_url: None
paper_authors: Chao Chen, Tian Zhou, Yanjun Zhao, Hui Liu, Liang Sun, Rong Jin
for: 该论文旨在提高计算机视觉模型的推理能力，以便更好地进行空间时间预测任务，如天气预测和交通流量预测。
methods: 该论文提出了一种新的减少量化方法，称为稀疏 вектор量化（SVQ），它通过使用稀疏回归来更好地考虑空间时间数据的细节和噪声，从而提高模型的泛化能力和传输学习能力。
results: 经过对多个领域的多种数据进行实验， authors 证明了 SVQ 方法可以减少计算量而不失去细节，并且在所有标准准则中表现出色，达到了状态的推理效果。

Abstract
Spatiotemporal forecasting tasks, such as weather forecasting and traffic prediction, offer significant societal benefits. These tasks can be effectively approached as image forecasting problems using computer vision models. Vector quantization (VQ) is a well-known method for discrete representation that improves the latent space, leading to enhanced generalization and transfer learning capabilities. One of the main challenges in using VQ for spatiotemporal forecasting is how to balance between keeping enough details and removing noises from the original patterns for better generalization. We address this challenge by developing sparse vector quantization, or {\bf SVQ} for short, that leverages sparse regression to make better trade-off between the two objectives. The main innovation of this work is to approximate sparse regression by a two-layer MLP and a randomly fixed or learnable matrix, dramatically improving its computational efficiency. Through experiments conducted on diverse datasets in multiple fields including weather forecasting, traffic flow prediction, and video forecasting, we unequivocally demonstrate that our proposed method consistently enhances the performance of base models and achieves state-of-the-art results across all benchmarks.

摘要
各种空间时间预测任务，如天气预测和交通预测，具有重要的社会效益。这些任务可以被视为计算机视觉模型的图像预测问题。量化 вектор（VQ）是一种广泛使用的离散表示方法，可以提高秘密空间，从而提高泛化和转移学习能力。然而，使用VQ进行空间时间预测存在一个主要挑战，即如何平衡保留足够的细节和从原始模式中除去噪声，以便更好地泛化。我们解决这个挑战，通过开发一种叫做{\bf SVQ}的稀疏 вектор量化方法，该方法利用稀疏回归来更好地让拟合两个目标。我们的主要创新是，我们使用两层多层感知神经网络（MLP）和一个随机或学习的矩阵来近似稀疏回归，从而很大地提高计算效率。我们在多个领域的多个数据集上进行了实验，包括天气预测、交通流预测和视频预测，并证明了我们的提议可以坚持性地提高基本模型的性能，并在所有标准准则上达到领先的结果。

Predicting Postoperative Intraocular Lens Dislocation in Cataract Surgery via Deep Learning

paper_url: http://arxiv.org/abs/2312.03401
repo_url: None
paper_authors: Negin Ghamsarian, Doris Putzgruber-Adamitsch, Stephanie Sarny, Raphael Sznitman, Klaus Schoeffmann, Yosuf El-Shabrawi
for: 这个研究是为了研究在Cataract surgery中可能出现的镜片异常现象，并开发一个可靠的预测方法来预测这些异常。
methods: 该研究使用了三种类型的 convolutional neural networks (CNNs)，包括 recurrent CNNs, region-based CNNs, 和 pixel-based CNNs，以计算镜片 unfolding delay, rotation, 和不稳定性 durante la cirugía.
results: 研究结果表明，提posed framework可以准确地评估镜片的统计特征 durante la cirugía，并与专家外科医的假设和观察相符。 results also show significant correlations between lens unfolding delay and lens rotation, and significant differences between the intra-operative rotations stability of four groups of lenses.

Abstract
A critical yet unpredictable complication following cataract surgery is intraocular lens dislocation. Postoperative stability is imperative, as even a tiny decentration of multifocal lenses or inadequate alignment of the torus in toric lenses due to postoperative rotation can lead to a significant drop in visual acuity. Investigating possible intraoperative indicators that can predict post-surgical instabilities of intraocular lenses can help prevent this complication. In this paper, we develop and evaluate the first fully-automatic framework for the computation of lens unfolding delay, rotation, and instability during surgery. Adopting a combination of three types of CNNs, namely recurrent, region-based, and pixel-based, the proposed framework is employed to assess the possibility of predicting post-operative lens dislocation during cataract surgery. This is achieved via performing a large-scale study on the statistical differences between the behavior of different brands of intraocular lenses and aligning the results with expert surgeons' hypotheses and observations about the lenses. We exploit a large-scale dataset of cataract surgery videos featuring four intraocular lens brands. Experimental results confirm the reliability of the proposed framework in evaluating the lens' statistics during the surgery. The Pearson correlation and t-test results reveal significant correlations between lens unfolding delay and lens rotation and significant differences between the intra-operative rotations stability of four groups of lenses. These results suggest that the proposed framework can help surgeons select the lenses based on the patient's eye conditions and predict post-surgical lens dislocation.

摘要
医学研究中的一个扰乱 yet 预测不可预知的问题是 Cataract 手术后的内 ocular lens 异位。手术后稳定是非常重要的，因为even a tiny decentration of multifocal lenses or inadequate alignment of the torus in toric lenses due to postoperative rotation can lead to a significant drop in visual acuity。investigating possible intraoperative indicators that can predict post-surgical instabilities of intraocular lenses can help prevent this complication。在这篇论文中，我们开发并评估了第一个完全自动化的框架，用于计算内 ocular lens unfolding delay，旋转和不稳定性 during surgery。采用三种类型的 CNNs， namely recurrent, region-based, and pixel-based，我们的提议的框架可以评估可能预测 Cataract 手术后的内 ocular lens 异位。我们通过对不同品牌的内 ocular lens 的行为进行大规模研究，并与专家预测和观察结果进行对比，以确定可能预测内 ocular lens 异位的指标。我们利用了 Cataract 手术视频数据集，该数据集包含四种内 ocular lens 品牌。实验结果证明了我们的提议的框架可以正确地评估内 ocular lens 的统计特征 during surgery。Pearson correlation 和 t-test 结果表明了内 ocular lens unfolding delay 和旋转的相关性，以及不同品牌的内 ocular lens 在手术过程中的旋转稳定性之间存在显著差异。这些结果表明，我们的提议的框架可以帮助Surgeons 根据病人的眼睛状况选择适合的内 ocular lens，并预测 Cataract 手术后的内 ocular lens 异位。

Action Scene Graphs for Long-Form Understanding of Egocentric Videos

paper_url: http://arxiv.org/abs/2312.03391
repo_url: https://github.com/fpv-iplab/easg
paper_authors: Ivan Rodin, Antonino Furnari, Kyle Min, Subarna Tripathi, Giovanni Maria Farinella
for: Egocentric Action Scene Graphs (EASGs) 是一种新的表示方式，用于长形 Egocentric 视频理解。methods: EASGs 使用一种新的注解过程，将 Egocentric 视频中摄像头携员的动作、相互作用对象、时间进行描述，从而提供一个时间演进的图形基本描述。results: 通过两个下游任务（ Egocentric 动作预测和 Egocentric 活动概要）的实验，我们证明 EASGs 有效地支持长形 Egocentric 视频理解。

Abstract
We present Egocentric Action Scene Graphs (EASGs), a new representation for long-form understanding of egocentric videos. EASGs extend standard manually-annotated representations of egocentric videos, such as verb-noun action labels, by providing a temporally evolving graph-based description of the actions performed by the camera wearer, including interacted objects, their relationships, and how actions unfold in time. Through a novel annotation procedure, we extend the Ego4D dataset by adding manually labeled Egocentric Action Scene Graphs offering a rich set of annotations designed for long-from egocentric video understanding. We hence define the EASG generation task and provide a baseline approach, establishing preliminary benchmarks. Experiments on two downstream tasks, egocentric action anticipation and egocentric activity summarization, highlight the effectiveness of EASGs for long-form egocentric video understanding. We will release the dataset and the code to replicate experiments and annotations.

摘要
我们介绍 Egocentric Action Scene Graphs (EASGs)，一种新的长形 Egocentric 视频理解表示方法。EASGs 将标准手动标注的 Egocentric 视频表示方法（如动词-名词动作标签）扩展到提供时间演化的图形基本描述，包括相互作用的对象、关系和时间演化的行为。通过一种新的注解过程，我们延展了 Ego4D 数据集，添加了手动标注的 Egocentric Action Scene Graphs，提供了丰富的注解，旨在支持长形 Egocentric 视频理解。我们定义 EASG 生成任务，并提供了基线方法，设置了初步的参考点。在两个下游任务中， Egocentric 动作预测和 Egocentric 活动概要 SUMMARIZATION 中，实验表明 EASGs 对长形 Egocentric 视频理解具有有效性。我们将发布数据集和实验代码，以便重现实验和注解。

Novel class discovery meets foundation models for 3D semantic segmentation

paper_url: http://arxiv.org/abs/2312.03782
repo_url: None
paper_authors: Luigi Riz, Cristiano Saltori, Yiming Wang, Elisa Ricci, Fabio Poiesi
for: 本研究探讨了 semantic segmentation 中的 Novel Class Discovery (NCD) 任务，即通过使用可available的supervision来准确地分类novel类。
methods: 本paper提出了一种基于 online clustering、uncertainty estimation 和 semantic distillation的新的NCD方法，以提高point cloud semantic segmentation中的novel类分类精度。
results: 通过对SemanticKITTI、SemanticPOSS和S3DIS dataset的广泛评估，本paper demonstarted了提档的NCD方法在point cloud semantic segmentation中的显著superiority。

Abstract
The task of Novel Class Discovery (NCD) in semantic segmentation entails training a model able to accurately segment unlabelled (novel) classes, relying on the available supervision from annotated (base) classes. Although extensively investigated in 2D image data, the extension of the NCD task to the domain of 3D point clouds represents a pioneering effort, characterized by assumptions and challenges that are not present in the 2D case. This paper represents an advancement in the analysis of point cloud data in four directions. Firstly, it introduces the novel task of NCD for point cloud semantic segmentation. Secondly, it demonstrates that directly transposing the only existing NCD method for 2D image semantic segmentation to 3D data yields suboptimal results. Thirdly, a new NCD approach based on online clustering, uncertainty estimation, and semantic distillation is presented. Lastly, a novel evaluation protocol is proposed to rigorously assess the performance of NCD in point cloud semantic segmentation. Through comprehensive evaluations on the SemanticKITTI, SemanticPOSS, and S3DIS datasets, the paper demonstrates substantial superiority of the proposed method over the considered baselines.

摘要
《点云 semantic segmentation中的新类发现任务（NCD）》的挑战在于训练一个能够准确地分类未标注（新类）的模型，基于可用的基类（标注类）的监督。虽然在2D图像数据中广泛研究了NCD任务，但将NCD任务推广到3D点云领域是一项创新的尝试，具有2D情况中缺失的假设和挑战。本文在以下四个方向下进行了进一步的分析：1. 引入了点云 semantic segmentation中的新类发现任务。2. 示示了将2D图像 semantic segmentation中的唯一一个NCD方法直接应用于3D数据会导致不佳的结果。3. 提出了基于在线聚类、不确定性估计和semantic distillation的新的NCD方法。4. 提出了用于严格评估点云 semantic segmentation中NCD性能的新评价协议。通过对SemanticKITTI、SemanticPOSS和S3DIS datasets进行了广泛的评估，本文示示了提案的方法在考虑的基线上比较出色的性能。

Riemannian Complex Matrix Convolution Network for PolSAR Image Classification

paper_url: http://arxiv.org/abs/2312.03378
repo_url: None
paper_authors: Junfei Shi, Wei Wang, Haiyan Jin, Mengmeng Nie, Shanshan Ji
for: 本研究旨在提高PolSAR图像分类的深度学习方法，以便更好地利用PolSAR数据的特点。methods: 该方法使用Riemannian复数矩阵卷积网络，直接使用复数矩阵作为网络输入，并定义Riemannian操作来学习复数矩阵的特征。此外，还开发了一种快速kernel学习方法来学习类pecific特征并降低计算时间。results: 实验结果显示，提出的方法可以在三个不同的PolSAR数据集上获得超过当前最佳方法的性能。

Abstract
Recently, deep learning methods have achieved superior performance for Polarimetric Synthetic Aperture Radar(PolSAR) image classification. Existing deep learning methods learn PolSAR data by converting the covariance matrix into a feature vector or complex-valued vector as the input. However, all these methods cannot learn the structure of complex matrix directly and destroy the channel correlation. To learn geometric structure of complex matrix, we propose a Riemannian complex matrix convolution network for PolSAR image classification in Riemannian space for the first time, which directly utilizes the complex matrix as the network input and defines the Riemannian operations to learn complex matrix's features. The proposed Riemannian complex matrix convolution network considers PolSAR complex matrix endowed in Riemannian manifold, and defines a series of new Riemannian convolution, ReLu and LogEig operations in Riemannian space, which breaks through the Euclidean constraint of conventional networks. Then, a CNN module is appended to enhance contextual Riemannian features. Besides, a fast kernel learning method is developed for the proposed method to learn class-specific features and reduce the computation time effectively. Experiments are conducted on three sets of real PolSAR data with different bands and sensors. Experiments results demonstrates the proposed method can obtain superior performance than the state-of-the-art methods.

摘要
最近，深度学习方法在极化 син统�apor（PolSAR）图像分类中达到了突出表现。现有的深度学习方法将PolSAR数据转换为covariance矩阵中的特征向量或复数向量作为网络输入，但这些方法无法直接学习复杂矩阵的结构，同时也会隐藏通道相关性。为了学习复杂矩阵的结构，我们提出了一种在Riemannian空间中使用复数矩阵 convolution网络，这是第一次将PolSAR复数矩阵作为网络输入，并定义了Riemannian操作来学习复数矩阵的特征。我们的Riemannian复数矩阵 convolution网络将PolSAR复数矩阵视为Riemannian拓扑上的一个维度，并定义了一系列新的Riemannian convolution、ReLu和LogEig操作，这些操作绕过了传统网络中的欧几丁度约束。然后，我们附加了一个CNN模块以增强Riemannian特征的上下文特征。此外，我们还开发了一种高效的kernel学习方法，可以快速学习类别特征，并降低计算时间。我们在三个不同激光和探测器的实验数据上进行了实验，实验结果表明，我们的方法可以在PolSAR图像分类中达到更高的表现。

Evaluating the point cloud of individual trees generated from images based on Neural Radiance fields (NeRF) method

paper_url: http://arxiv.org/abs/2312.03372
repo_url: None
paper_authors: Hongyu Huang, Guoji Tian, Chongcheng Chen
for: 三维（3D）树重建是精度林业管理和研究中的关键任务。
methods: 基于不同摄像机拍摄的树图像，使用神经辐射场（NeRF）方法进行个体树三维重建，并与摄grammetric重建和激光扫描方法获取的点云模型进行比较。
results: NeRF方法在个体树三维重建中表现良好，成功重建率高，在树叶区域重建更好，但生成的点云模型具有噪音和低分辨率问题。相比摄grammetric重建方法，NeRF方法在重建效率方面具有显著优势，适用于复杂场景，但生成的点云模型具有较差的精度。

Abstract
Three-dimensional (3D) reconstruction of trees has always been a key task in precision forestry management and research. Due to the complex branch morphological structure of trees themselves and the occlusions from tree stems, branches and foliage, it is difficult to recreate a complete three-dimensional tree model from a two-dimensional image by conventional photogrammetric methods. In this study, based on tree images collected by various cameras in different ways, the Neural Radiance Fields (NeRF) method was used for individual tree reconstruction and the exported point cloud models are compared with point cloud derived from photogrammetric reconstruction and laser scanning methods. The results show that the NeRF method performs well in individual tree 3D reconstruction, as it has higher successful reconstruction rate, better reconstruction in the canopy area, it requires less amount of images as input. Compared with photogrammetric reconstruction method, NeRF has significant advantages in reconstruction efficiency and is adaptable to complex scenes, but the generated point cloud tends to be noisy and low resolution. The accuracy of tree structural parameters (tree height and diameter at breast height) extracted from the photogrammetric point cloud is still higher than those of derived from the NeRF point cloud. The results of this study illustrate the great potential of NeRF method for individual tree reconstruction, and it provides new ideas and research directions for 3D reconstruction and visualization of complex forest scenes.

摘要
三维重建（3D）森林植物是精度森林管理和研究中的关键任务。由于树木自身的分支结构复杂和树干、枝头和叶子的遮挡，使得从二维图像中创建完整的三维树模型很难。在这项研究中，通过不同摄像头和方法收集的树木图像，使用Neural Radiance Fields（NeRF）方法进行个体树重建，并与光ogrammetric重建和激光扫描方法 derivated的点云模型进行比较。结果表明，NeRF方法在个体树三维重建方面表现良好，其成功重建率高、在树叶区域重建更好，需要 fewer amount of images as input。相比光ogrammetric重建方法，NeRF方法在重建效率方面具有显著优势，可以适应复杂的场景，但生成的点云具有噪声和低分辨率。抽取从光ogrammetric点云中的树Structural parameter（树高和Breast height diameter）的准确性仍然高于NeRF点云中的。这项研究的结果表明NeRF方法在个体树重建方面具有潜在的优势，并提供了新的想法和研究方向 для三维重建和可见化复杂的森林场景。

Bottom-Up Instance Segmentation of Catheters for Chest X-Rays

paper_url: http://arxiv.org/abs/2312.03368
repo_url: None
paper_authors: Francesca Boccardi, Axel Saalbach, Heinrich Schulz, Samuele Salti, Ilyas Sirazitdinov
for: 鉴定中线和管线 Correct placement and detection of complications in central lines and tubes in emergency departments and intensive care units.
methods: 使用深度学习的相关嵌入来实现中线实例分割，能够有效地处理Device Intersections和复杂的情况。
results: 提出了一种能够有效地处理中线和管线的深度学习方法，可以减少报告延迟和提高鉴定效率。

Abstract
Chest X-ray (CXR) is frequently employed in emergency departments and intensive care units to verify the proper placement of central lines and tubes and to rule out related complications. The automation of the X-ray reading process can be a valuable support tool for non-specialist technicians and minimize reporting delays due to non-availability of experts. While existing solutions for automated catheter segmentation and malposition detection show promising results, the disentanglement of individual catheters remains an open challenge, especially in complex cases where multiple devices appear superimposed in the X-ray projection. Moreover, conventional top-down instance segmentation methods are ineffective on such thin and long devices, that often extend through the entire image. In this paper, we propose a deep learning approach based on associative embeddings for catheter instance segmentation, able to overcome those limitations and effectively handle device intersections.

摘要
��X��镜（CXR）�� frequently employed in emergency departments and intensive care units to verify the proper placement of central lines and tubes and to rule out related complications. The automation of the X-ray reading process can be a valuable support tool for non-specialist technicians and minimize reporting delays due to non-availability of experts. While existing solutions for automated catheter segmentation and malposition detection show promising results, the disentanglement of individual catheters remains an open challenge, especially in complex cases where multiple devices appear superimposed in the X-ray projection. Moreover, conventional top-down instance segmentation methods are ineffective on such thin and long devices, that often extend through the entire image. In this paper, we propose a deep learning approach based on associative embeddings for catheter instance segmentation, able to overcome those limitations and effectively handle device intersections.Here's the breakdown of the translation:* ��X��镜 (CXR): Chest X-ray* �� frequently employed: frequently used* �� to verify the proper placement: to confirm the correct positioning* �� of central lines and tubes: of central lines and tubes* �� and to rule out related complications: and to rule out related complications* The automation of the X-ray reading process: The automation of the X-ray reading process* can be a valuable support tool: can be a valuable support tool* for non-specialist technicians: for non-specialist technicians* and minimize reporting delays: and minimize reporting delays* due to non-availability of experts: due to the lack of experts* While existing solutions: While existing solutions* for automated catheter segmentation: for automated catheter segmentation* and malposition detection: and malposition detection* show promising results: show promising results* the disentanglement of individual catheters: the disentanglement of individual catheters* remains an open challenge: remains an open challenge* especially in complex cases: especially in complex cases* where multiple devices appear superimposed: where multiple devices appear superimposed* in the X-ray projection: in the X-ray projection* Moreover, conventional top-down instance segmentation methods: Moreover, conventional top-down instance segmentation methods* are ineffective on such thin and long devices: are ineffective on such thin and long devices* that often extend through the entire image: that often extend through the entire image* In this paper, we propose: In this paper, we propose* a deep learning approach: a deep learning approach* based on associative embeddings: based on associative embeddings* for catheter instance segmentation: for catheter instance segmentation* able to overcome those limitations: able to overcome those limitations* and effectively handle device intersections: and effectively handle device intersections

RING-NeRF: A Versatile Architecture based on Residual Implicit Neural Grids

paper_url: http://arxiv.org/abs/2312.03357
repo_url: None
paper_authors: Doriand Petit, Steve Bourgeois, Dumitru Pavel, Vincent Gay-Bellile, Florian Chabot, Loic Barthe
for: 该论文旨在提高NeRF的快速和稳定性，以便在3D重建和新视点生成中使用。
methods: 该论文基于Residual Implicit Neural Grids（RING）技术，具有控制场景和 latent space 之间 mapping 函数级别的控制，以及距离意识forward mapping机制和连续粗细化重建过程。
results: 该论文实现了高质量的反射渲染、从少量指导视点重建高质量图像，以及在Scene-specific initialization 缺失情况下的Robustness。此外，该 Architecture 还可以动态添加格子以提高重建精度。

Abstract
Since their introduction, Neural Fields have become very popular for 3D reconstruction and new view synthesis. Recent researches focused on accelerating the process, as well as improving the robustness to variation of the observation distance and limited number of supervised viewpoints. However, those approaches often led to dedicated solutions that cannot be easily combined. To tackle this issue, we introduce a new simple but efficient architecture named RING-NeRF, based on Residual Implicit Neural Grids, that provides a control on the level of detail of the mapping function between the scene and the latent spaces. Associated with a distance-aware forward mapping mechanism and a continuous coarse-to-fine reconstruction process, our versatile architecture demonstrates both fast training and state-of-the-art performances in terms of: (1) anti-aliased rendering, (2) reconstruction quality from few supervised viewpoints, and (3) robustness in the absence of appropriate scene-specific initialization for SDF-based NeRFs. We also demonstrate that our architecture can dynamically add grids to increase the details of the reconstruction, opening the way to adaptive reconstruction.

摘要
Since their introduction, Neural Fields have become very popular for 3D reconstruction and new view synthesis. Recent researches focused on accelerating the process, as well as improving the robustness to variation of the observation distance and limited number of supervised viewpoints. However, those approaches often led to dedicated solutions that cannot be easily combined. To tackle this issue, we introduce a new simple but efficient architecture named RING-NeRF, based on Residual Implicit Neural Grids, that provides a control on the level of detail of the mapping function between the scene and the latent spaces. Associated with a distance-aware forward mapping mechanism and a continuous coarse-to-fine reconstruction process, our versatile architecture demonstrates both fast training and state-of-the-art performances in terms of: (1) anti-aliased rendering, (2) reconstruction quality from few supervised viewpoints, and (3) robustness in the absence of appropriate scene-specific initialization for SDF-based NeRFs. We also demonstrate that our architecture can dynamically add grids to increase the details of the reconstruction, opening the way to adaptive reconstruction.Here is the translation in Traditional Chinese:自从其出现以来，神经场的应用已经非常受欢迎，包括3D重建和新视角Synthesis。最近的研究专注于加速过程，以及增强观察距离的变化和有限数量的指导视点。然而，这些方法经常导致专门的解决方案，不能轻松结合。为了解决这个问题，我们介绍了一个新的简单 yet efficient的架构，名为RING-NeRF，基于差分隐藏神经格。这个架构具有调控场景和对应空间之间的对映函数级别的控制。在距离意识的前进映射机制和连续从粗到细重建过程中，我们的多元架构展示了快速训练和现有的表现，包括：（1）抑制噪声渲染、（2）从几个指导视点重建质量、和（3）在没有适当Scene-specific Initialization的情况下的Robustness。我们还证明了我们的架构可以灵活地增加Grids，开启了可靠的重建。

PointMoment:Mixed-Moment-based Self-Supervised Representation Learning for 3D Point Clouds

paper_url: http://arxiv.org/abs/2312.03350
repo_url: None
paper_authors: Xin Cao, Xinxin Han, Yifan Wang, Mengna Yang, Kang Li
for: 本研究旨在提高点云自我超参量化学习的效果，以便提高点云数据下的下游任务性能。
methods: 本文提出了一种基于高阶混合积分损失函数的点云自我超参量化学习方法，不需要特殊的技术如偏极网络架构、滚动损失等。
results: 实验结果表明， compared with前期无监督学习方法，本方法在点云分类和 segmentation 下游任务中获得了更好的性能。

Abstract
Large and rich data is a prerequisite for effective training of deep neural networks. However, the irregularity of point cloud data makes manual annotation time-consuming and laborious. Self-supervised representation learning, which leverages the intrinsic structure of large-scale unlabelled data to learn meaningful feature representations, has attracted increasing attention in the field of point cloud research. However, self-supervised representation learning often suffers from model collapse, resulting in reduced information and diversity of the learned representation, and consequently degrading the performance of downstream tasks. To address this problem, we propose PointMoment, a novel framework for point cloud self-supervised representation learning that utilizes a high-order mixed moment loss function rather than the conventional contrastive loss function. Moreover, our framework does not require any special techniques such as asymmetric network architectures, gradient stopping, etc. Specifically, we calculate the high-order mixed moment of the feature variables and force them to decompose into products of their individual moment, thereby making multiple variables more independent and minimizing the feature redundancy. We also incorporate a contrastive learning approach to maximize the feature invariance under different data augmentations of the same point cloud. Experimental results show that our approach outperforms previous unsupervised learning methods on the downstream task of 3D point cloud classification and segmentation.

摘要
Translated into Simplified Chinese:大量和丰富的数据是深度神经网络训练的必要条件，但点云数据的不规则性使得手动标注成本昂贵和劳动密集。基于自主学习的点云研究已经吸引了领域的越来越多的关注，但自主学习常常导致模型塌陷，从而减少了学习后的信息和多样性，影响下游任务的性能。为解决这个问题，我们提出了PointMoment框架，它使用高阶混合瞬时loss函数而不是传统的对比损失函数进行自主学习。此外，我们的框架不需要特殊的技术，如偏振网络架构、梯度停止等。我们计算了特征变量的高阶混合瞬时，并让它们归一化为各个瞬时的产品，从而使多个变量更加独立，减少特征冗余。我们还包括对不同数据扩展的对比学习方法，以最大化特征不变性。实验结果表明，我们的方法在3D点云分类和 segmentation 下游任务上表现出色，超过了之前的无监督学习方法。

GraNet: A Multi-Level Graph Network for 6-DoF Grasp Pose Generation in Cluttered Scenes

paper_url: http://arxiv.org/abs/2312.03345
repo_url: https://github.com/wang-h-w/GraNet
paper_authors: Haowen Wang, Wanhao Niu, Chungang Zhuang
for: 提高6自由度物体 grasping 的精度和效率在无架构环境中
methods: 使用图像推广网络，建立多级图形，从多级图形中传输特征，逐步减小至理想的抓取位置
results: 在GraspNet-1Billion大量测试集上达到了最佳性能，特别是在抓取未看过的物体 (+11.62 AP) 中；实际机器人实验中得到了高成功率的抓取物体result

Abstract
6-DoF object-agnostic grasping in unstructured environments is a critical yet challenging task in robotics. Most current works use non-optimized approaches to sample grasp locations and learn spatial features without concerning the grasping task. This paper proposes GraNet, a graph-based grasp pose generation framework that translates a point cloud scene into multi-level graphs and propagates features through graph neural networks. By building graphs at the scene level, object level, and grasp point level, GraNet enhances feature embedding at multiple scales while progressively converging to the ideal grasping locations by learning. Our pipeline can thus characterize the spatial distribution of grasps in cluttered scenes, leading to a higher rate of effective grasping. Furthermore, we enhance the representation ability of scalable graph networks by a structure-aware attention mechanism to exploit local relations in graphs. Our method achieves state-of-the-art performance on the large-scale GraspNet-1Billion benchmark, especially in grasping unseen objects (+11.62 AP). The real robot experiment shows a high success rate in grasping scattered objects, verifying the effectiveness of the proposed approach in unstructured environments.

摘要
“六度自由 grasping 在无结构环境中是机器人学中的一个重要但困难任务。现有的大多数方法使用非优化的方法来抽取抓取位置和学习空间特征，而这篇论文提出了 GraNet，一个基于图形的抓取位置生成框架。这个框架将Scene、物体和抓取点纳入图形，并通过图形神经网络传递特征。这样可以增强特征嵌入多个尺度，同时逐渐趋向理想的抓取位置，进而学习。我们的管道可以对丛集环境中的抓取位置进行描述，从而提高抓取率。此外，我们还将构成层、物体层和抓取点层的图形网络结构获得更好的表现能力，通过本地关系的注意力机制。我们的方法在GraspNet-1Billion大规模测试中实现了州前的性能，特别是对未见过的物品 (+11.62 AP)。实验显示我们的方法在无结构环境中成功地抓取散乱的物品，证明了我们的方法在实际应用中的有效性。”

PointJEM: Self-supervised Point Cloud Understanding for Reducing Feature Redundancy via Joint Entropy Maximization

paper_url: http://arxiv.org/abs/2312.03339
repo_url: None
paper_authors: Xin Cao, Huan Xia, Xinxin Han, Yifan Wang, Kang Li, Linzhi Su
for: 这篇论文主要是为了解决深度学习基于点云数据的处理方法中的监督学习问题，因为手动标注点云数据是时间consuming和劳动密集的。
methods: 这篇论文提出了一种自监学习表示学习方法，即PointJEM，它包括一种嵌入方案和基于共同熵的损失函数。嵌入方案将嵌入向量分成不同的部分，每个部分可以学习不同的特征。
results: 经过实验 validate，PointJEM可以减少特征之间的相关性，并且在下游任务中比如分类和分割任务中表现竞争性。

Abstract
Most deep learning-based point cloud processing methods are supervised and require large scale of labeled data. However, manual labeling of point cloud data is laborious and time-consuming. Self-supervised representation learning can address the aforementioned issue by learning robust and generalized representations from unlabeled datasets. Nevertheless, the embedded features obtained by representation learning usually contain redundant information, and most current methods reduce feature redundancy by linear correlation constraints. In this paper, we propose PointJEM, a self-supervised representation learning method applied to the point cloud field. PointJEM comprises an embedding scheme and a loss function based on joint entropy. The embedding scheme divides the embedding vector into different parts, each part can learn a distinctive feature. To reduce redundant information in the features, PointJEM maximizes the joint entropy between the different parts, thereby rendering the learned feature variables pairwise independent. To validate the effectiveness of our method, we conducted experiments on multiple datasets. The results demonstrate that our method can significantly reduce feature redundancy beyond linear correlation. Furthermore, PointJEM achieves competitive performance in downstream tasks such as classification and segmentation.

摘要
大多数深度学习基于点云处理方法都是指导的，需要大规模的标注数据。然而，手动标注点云数据是时间消耗和劳动密集的。无监督学习表征学可以解决这个问题，通过不监督学习得到稳定和通用的表征。然而，嵌入的特征通常包含重复信息，现有的方法通常通过线性相关约束来减少特征重复。在本文中，我们提出了PointJEM，一种应用于点云领域的自监督表征学习方法。PointJEM包括嵌入方案和基于共同 entropy的损失函数。嵌入方案将嵌入向量分成不同部分，每个部分可以学习特征。为了减少特征重复信息，PointJEM通过最大化共同 entropy来将学习的特征变量pairwise独立。为了证明我们的方法的有效性，我们在多个数据集上进行了实验。结果表明，我们的方法可以在特征重复信息上减少特征重复，并且在下游任务 such as 分类和 segmentation 中达到竞争性的性能。

paper_url: http://arxiv.org/abs/2312.03327
repo_url: None
paper_authors: Xiaobo Hu, Youfang Lin, HeHe Fan, Shuo Wang, Zhihao Wu, Kai Lv
for: This paper is written for visual navigation in partially observable environments, with the goal of reaching a target object based on a sequence of partial observations.
methods: The proposed method uses a Category Relation Graph (CRG) to learn the knowledge of object category layout relations, and a Temporal-Spatial-Region (TSR) attention architecture to perceive the long-term spatial-temporal dependencies of objects.
results: The proposed CRG-TSR method significantly outperforms existing methods regarding both effectiveness and efficiency, as demonstrated by experiments on AI2-THOR.Here’s the Simplified Chinese translation of the three points:
for: 这篇论文是为视觉导航在部分可见环境中进行，目标是根据一个序列的部分观察结果达到目标对象的位置。
methods: 该提议的方法使用Category Relation Graph (CRG)来学习对象类划分关系的知识，并使用Temporal-Spatial-Region (TSR)注意架构来感知当前未知环境中对象之间的长期空间时间相互关系。
results: 该提议的CRG-TSR方法在AI2-THOR上比现有方法更高效和更有效果，经过实验证明。

Abstract
Given an object of interest, visual navigation aims to reach the object's location based on a sequence of partial observations. To this end, an agent needs to 1) learn a piece of certain knowledge about the relations of object categories in the world during training and 2) look for the target object based on the pre-learned object category relations and its moving trajectory in the current unseen environment. In this paper, we propose a Category Relation Graph (CRG) to learn the knowledge of object category layout relations and a Temporal-Spatial-Region (TSR) attention architecture to perceive the long-term spatial-temporal dependencies of objects helping the navigation. We learn prior knowledge of object layout, establishing a category relationship graph to deduce the positions of specific objects. Subsequently, we introduced TSR to capture the relationships of objects in temporal, spatial, and regions within the observation trajectories. Specifically, we propose a Temporal attention module (T) to model the temporal structure of the observation sequence, which implicitly encodes the historical moving or trajectory information. Then, a Spatial attention module (S) is used to uncover the spatial context of the current observation objects based on the category relation graph and past observations. Last, a Region attention module (R) shifts the attention to the target-relevant region. Based on the visual representation extracted by our method, the agent can better perceive the environment and easily learn superior navigation policy. Experiments on AI2-THOR demonstrate our CRG-TSR method significantly outperforms existing methods regarding both effectiveness and efficiency. The code has been included in the supplementary material and will be publicly available.

摘要
Given an object of interest, visual navigation aims to reach the object's location based on a sequence of partial observations. To this end, an agent needs to 1) learn a piece of certain knowledge about the relations of object categories in the world during training and 2) look for the target object based on the pre-learned object category relations and its moving trajectory in the current unseen environment. In this paper, we propose a Category Relation Graph (CRG) to learn the knowledge of object category layout relations and a Temporal-Spatial-Region (TSR) attention architecture to perceive the long-term spatial-temporal dependencies of objects helping the navigation. We learn prior knowledge of object layout, establishing a category relationship graph to deduce the positions of specific objects. Subsequently, we introduced TSR to capture the relationships of objects in temporal, spatial, and regions within the observation trajectories. Specifically, we propose a Temporal attention module (T) to model the temporal structure of the observation sequence, which implicitly encodes the historical moving or trajectory information. Then, a Spatial attention module (S) is used to uncover the spatial context of the current observation objects based on the category relation graph and past observations. Last, a Region attention module (R) shifts the attention to the target-relevant region. Based on the visual representation extracted by our method, the agent can better perceive the environment and easily learn superior navigation policy. Experiments on AI2-THOR demonstrate our CRG-TSR method significantly outperforms existing methods regarding both effectiveness and efficiency. The code has been included in the supplementary material and will be publicly available.Here's the translation in Traditional Chinese: Given an object of interest, visual navigation aims to reach the object's location based on a sequence of partial observations. To this end, an agent needs to 1) learn a piece of certain knowledge about the relations of object categories in the world during training and 2) look for the target object based on the pre-learned object category relations and its moving trajectory in the current unseen environment. In this paper, we propose a Category Relation Graph (CRG) to learn the knowledge of object category layout relations and a Temporal-Spatial-Region (TSR) attention architecture to perceive the long-term spatial-temporal dependencies of objects helping the navigation. We learn prior knowledge of object layout, establishing a category relationship graph to deduce the positions of specific objects. Subsequently, we introduced TSR to capture the relationships of objects in temporal, spatial, and regions within the observation trajectories. Specifically, we propose a Temporal attention module (T) to model the temporal structure of the observation sequence, which implicitly encodes the historical moving or trajectory information. Then, a Spatial attention module (S) is used to uncover the spatial context of the current observation objects based on the category relation graph and past observations. Last, a Region attention module (R) shifts the attention to the target-relevant region. Based on the visual representation extracted by our method, the agent can better perceive the environment and easily learn superior navigation policy. Experiments on AI2-THOR demonstrate our CRG-TSR method significantly outperforms existing methods regarding both effectiveness and efficiency. The code has been included in the supplementary material and will be publicly available.

GCFA:Geodesic Curve Feature Augmentation via Shape Space Theory

paper_url: http://arxiv.org/abs/2312.03325
repo_url: None
paper_authors: Yuexing Han, Guanxin Wan, Bing Wang
for: 提高小样本环境下的数据预处理模型性能
methods: 基于形态空间理论的曲线特征增强方法（GCFA）
results: 可以减少信息损失和提高小样本环境下的数据预处理模型性能

Abstract
Deep learning has yielded remarkable outcomes in various domains. However, the challenge of requiring large-scale labeled samples still persists in deep learning. Thus, data augmentation has been introduced as a critical strategy to train deep learning models. However, data augmentation suffers from information loss and poor performance in small sample environments. To overcome these drawbacks, we propose a feature augmentation method based on shape space theory, i.e., Geodesic curve feature augmentation, called GCFA in brevity. First, we extract features from the image with the neural network model. Then, the multiple image features are projected into a pre-shape space as features. In the pre-shape space, a Geodesic curve is built to fit the features. Finally, the many generated features on the Geodesic curve are used to train the various machine learning models. The GCFA module can be seamlessly integrated with most machine learning methods. And the proposed method is simple, effective and insensitive for the small sample datasets. Several examples demonstrate that the GCFA method can greatly improve the performance of the data preprocessing model in a small sample environment.

摘要
深度学习在不同领域带来了惊人的成果，但是需要大规模标注样本的挑战仍然存在。为了解决这个挑战，我们提出了基于形状空间理论的特征增强方法，即曲线特征增强方法（GCFA）。首先，我们从图像中提取特征使用神经网络模型。然后，多个图像特征被投影到预先定义的形状空间中作为特征。在形状空间中，建立一条适合特征的地odesic曲线。最后，在Geodesic曲线上生成的多个特征被用来训练不同的机器学习模型。GCFA模块可以轻松地与大多数机器学习方法结合使用。此外，我们的方法简单、有效，对小样本 dataset 表现不敏感。一些例子表明，GCFA 方法可以在小样本环境中大幅提高数据预处理模型的性能。

Background Clustering Pre-training for Few-shot Segmentation

paper_url: http://arxiv.org/abs/2312.03322
repo_url: https://github.com/Carboxy/BCPT
paper_authors: Zhimiao Yu, Tiancheng Lin, Yi Xu
for: 提高ew-shot segmentation（FSS）方法的性能，解决现有的 merged background 问题。
methods: 提出了一种新的预训练方案，通过在预训练阶段使用在线 clustering 来解耦新类和背景，并通过基础类来引导 clustering 过程，提高 clustering 结果的质量和稳定性。
results: 在 PASCAL-5i 和 COCO-20i 上进行了实验，并确认了 BCSP 的高效性。

Abstract
Recent few-shot segmentation (FSS) methods introduce an extra pre-training stage before meta-training to obtain a stronger backbone, which has become a standard step in few-shot learning. Despite the effectiveness, current pre-training scheme suffers from the merged background problem: only base classes are labelled as foregrounds, making it hard to distinguish between novel classes and actual background. In this paper, we propose a new pre-training scheme for FSS via decoupling the novel classes from background, called Background Clustering Pre-Training (BCPT). Specifically, we adopt online clustering to the pixel embeddings of merged background to explore the underlying semantic structures, bridging the gap between pre-training and adaptation to novel classes. Given the clustering results, we further propose the background mining loss and leverage base classes to guide the clustering process, improving the quality and stability of clustering results. Experiments on PASCAL-5i and COCO-20i show that BCPT yields advanced performance. Code will be available.

摘要
近期几招分割（FSS）方法增加了预训练阶段，以获得更强的基础模型，这成为了几招学习的标准步骤。然而，当前的预训练方案受到合并背景问题的困扰：仅仅基础类被标注为前景，使得 отличать新类和实际背景变得困难。在本文中，我们提出了一种新的预训练方案 для FSS，通过分离新类和背景来解决这个问题。具体来说，我们采用在像素嵌入空间中进行在线 clustering，以探索基于 semantic structure的下面结构， bridge 预训练和适应新类的 gap。给出 clustering 结果，我们进一步提议 background mining loss，并利用基础类来引导 clustering 过程，提高 clustering 结果的质量和稳定性。实验表明，BCPT 可以在 PASCAL-5i 和 COCO-20i 上提供先进的性能。代码将可用。

Complementary Benefits of Contrastive Learning and Self-Training Under Distribution Shift

paper_url: http://arxiv.org/abs/2312.03318
repo_url: None
paper_authors: Saurabh Garg, Amrith Setlur, Zachary Chase Lipton, Sivaraman Balakrishnan, Virginia Smith, Aditi Raghunathan
for: 这篇论文主要是研究自适应学习和对比学习的组合效果，以及它们在不同场景下的合作性。
methods: 这篇论文使用了自适应学习和对比学习两种方法，并对它们的组合效果进行了系统性的实验研究。
results: 研究结果表明，在频率Shift设定下，自适应学习和对比学习的组合可以获得3-8%的高精度，而在 semi-supervised learning 设定下，两种方法的组合效果不如两种方法独立使用时。

Abstract
Self-training and contrastive learning have emerged as leading techniques for incorporating unlabeled data, both under distribution shift (unsupervised domain adaptation) and when it is absent (semi-supervised learning). However, despite the popularity and compatibility of these techniques, their efficacy in combination remains unexplored. In this paper, we undertake a systematic empirical investigation of this combination, finding that (i) in domain adaptation settings, self-training and contrastive learning offer significant complementary gains; and (ii) in semi-supervised learning settings, surprisingly, the benefits are not synergistic. Across eight distribution shift datasets (e.g., BREEDs, WILDS), we demonstrate that the combined method obtains 3--8% higher accuracy than either approach independently. We then theoretically analyze these techniques in a simplified model of distribution shift, demonstrating scenarios under which the features produced by contrastive learning can yield a good initialization for self-training to further amplify gains and achieve optimal performance, even when either method alone would fail.

摘要
<>自适应和对比学习已成为 incorporating 无标注数据的主流技术，包括分布转换（频率域适应）和缺乏标注数据（半supervised learning）。然而，这两种技术的组合效果尚未得到探索。在这篇文章中，我们进行了系统性的实验研究，发现：（i）在频率域适应设置下，自适应和对比学习具有显著的补偿效果；（ii）在半supervised learning设置下，奇怪地，这两种技术的共同效果不是синергетиче。通过八个分布转换数据集（e.g., BREEDs, WILDS）的实验表明，这种组合方法在频率域适应设置下可以获得3-8%的更高精度，比单独使用任一方法高。然后，我们在一个简化的分布转换模型中 theoretically 分析了这些技术，示出在某些场景下，对比学习生成的特征可以为自适应提供一个良好的初始化，使得自适应可以进一步增强得分，达到最佳性能，即使单独使用任一方法时未能达到。Note: I have translated the text into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can also provide that translation.

DiffPMAE: Diffusion Masked Autoencoders for Point Cloud Reconstruction

paper_url: http://arxiv.org/abs/2312.03298
repo_url: None
paper_authors: Yanlong Li, Chamara Madarasingha, Kanchana Thilakarathna
For: The paper is written for point cloud reconstruction and its applications in interactive service delivery and the future Metaverse.* Methods: The paper proposes an effective point cloud reconstruction architecture that combines Masked Auto-Encoding and Diffusion Model mechanism to remotely reconstruct point cloud data.* Results: The paper validates the performance of the proposed method, DiffPMAE, exceeding many state-of-the-art methods in terms of auto-encoding and downstream tasks considered, with over 60,000 objects in the ShapeNet-55 and ModelNet datasets.Here’s the simplified Chinese text for the three key information points:* For: 这篇论文是为点云重建和其应用于互动服务传递和未来Metaverse而写的。* Methods: 这篇论文提出了一种有效的点云重建架构，它结合Masked Auto-Encoding和Diffusion Model机制来远程重建点云数据。* Results: 论文 validate了提出的方法DiffPMAE的性能，其在Auto-Encoding和下游任务中超过了许多状态艺术法，使用ShapeNet-55和ModelNet datasets中的超过60000个对象。

Abstract
Point cloud streaming is increasingly getting popular, evolving into the norm for interactive service delivery and the future Metaverse. However, the substantial volume of data associated with point clouds presents numerous challenges, particularly in terms of high bandwidth consumption and large storage capacity. Despite various solutions proposed thus far, with a focus on point cloud compression, upsampling, and completion, these reconstruction-related methods continue to fall short in delivering high fidelity point cloud output. As a solution, in DiffPMAE, we propose an effective point cloud reconstruction architecture. Inspired by self-supervised learning concepts, we combine Masked Auto-Encoding and Diffusion Model mechanism to remotely reconstruct point cloud data. By the nature of this reconstruction process, DiffPMAE can be extended to many related downstream tasks including point cloud compression, upsampling and completion. Leveraging ShapeNet-55 and ModelNet datasets with over 60000 objects, we validate the performance of DiffPMAE exceeding many state-of-the-art methods in-terms of auto-encoding and downstream tasks considered.

摘要
<> translate_language: zh-CNPoint cloud streaming is becoming increasingly popular, evolving into the norm for interactive service delivery and the future Metaverse. However, the large amount of data associated with point clouds presents numerous challenges, particularly in terms of high bandwidth consumption and large storage capacity. Despite various solutions proposed thus far, with a focus on point cloud compression, upsampling, and completion, these reconstruction-related methods continue to fall short in delivering high fidelity point cloud output. As a solution, in DiffPMAE, we propose an effective point cloud reconstruction architecture. Inspired by self-supervised learning concepts, we combine Masked Auto-Encoding and Diffusion Model mechanism to remotely reconstruct point cloud data. By the nature of this reconstruction process, DiffPMAE can be extended to many related downstream tasks including point cloud compression, upsampling, and completion. Leveraging ShapeNet-55 and ModelNet datasets with over 60000 objects, we validate the performance of DiffPMAE exceeding many state-of-the-art methods in terms of auto-encoding and downstream tasks considered.Translation:点云流处理正在越来越受欢迎，成为未来元宇宙的标准。然而，点云数据的大量存储和带宽占用却提出了许多挑战。尽管过去有各种解决方案，主要集中在点云压缩、扩大和完成等相关任务上，但这些重建相关方法仍然无法提供高精度点云输出。作为解决方案，在DiffPMAE中，我们提出了一种高效的点云重建架构。受到无监督学习概念的激发，我们将Masked Auto-Encoding和扩散模型机制结合起来远程重建点云数据。由于这种重建过程的自然特性，DiffPMAE可以扩展到许多相关的下游任务，包括点云压缩、扩大和完成。基于ShapeNet-55和ModelNet datasets上的超过60000个对象，我们验证了DiffPMAE的性能，胜过许多现状的方法。

Cooperative Probabilistic Trajectory Forecasting under Occlusion

paper_url: http://arxiv.org/abs/2312.03296
repo_url: None
paper_authors: Anshul Nayak, Azim Eskandarian
for: This paper aims to improve the navigation of safety-critical tasks in situations with occlusion.
methods: The paper proposes an end-to-end network that estimates the current state of occluded pedestrians in the reference frame of the ego agent and predicts the trajectory with safety guarantees.
results: The proposed method shows experimental results that are almost similar to the ground truth trajectory assuming no occlusion, demonstrating the effectiveness of the method for uncertainty-aware navigation among multiple connected agents under occlusion.Here is the text in Simplified Chinese:
for: 本研究目的是提高安全关键任务下 occlusion 的导航。
methods: 本 paper 提议一种结构网络，用于在 ego 机器人参照Frame 中估算 occluded 人员当前状态，并预测安全保证的 trajectory。
results: 实验结果显示，提议的方法可以准确地预测 occluded 人员的轨迹，与不受 occlusion 影响的ground truth trajectory 相似，表明方法对多个连接 Agent 下 occlusion 的导航具有可靠性。

Abstract
Perception and planning under occlusion is essential for safety-critical tasks. Occlusion-aware planning often requires communicating the information of the occluded object to the ego agent for safe navigation. However, communicating rich sensor information under adverse conditions during communication loss and limited bandwidth may not be always feasible. Further, in GPS denied environments and indoor navigation, localizing and sharing of occluded objects can be challenging. To overcome this, relative pose estimation between connected agents sharing a common field of view can be a computationally effective way of communicating information about surrounding objects. In this paper, we design an end-to-end network that cooperatively estimates the current states of occluded pedestrian in the reference frame of ego agent and then predicts the trajectory with safety guarantees. Experimentally, we show that the uncertainty-aware trajectory prediction of occluded pedestrian by the ego agent is almost similar to the ground truth trajectory assuming no occlusion. The current research holds promise for uncertainty-aware navigation among multiple connected agents under occlusion.

摘要
感知和规划在干扰 Situation 中是安全关键任务。 occlusion-aware 规划通常需要 egocentric Agent 通过信息的传输来保证安全导航。然而，在通信损失和带宽限制的情况下，传输丰富的感知信息可能并不是可行的。此外，在 GPS 被拒绝环境和室内导航中，本地化和共享遮盲对象可能是挑战。为了解决这个问题， egocentric Agent 之间共享视场中的相对pose 估计可以是一种计算效率的信息传输方式。在这篇论文中，我们设计了一个端到端网络，可以协同估计 occluded pedestrian 的当前状态，然后预测安全保证的 trajectory。实验表明， egocentric Agent 对 occluded pedestrian 的不确定性感知轨迹预测与无遮盲情况下的真实轨迹非常相似。本研究保持了多个连接 Agent 下uncertainty-aware 导航的前景。

On the Robustness of Large Multimodal Models Against Image Adversarial Attacks

paper_url: http://arxiv.org/abs/2312.03777
repo_url: None
paper_authors: Xuanimng Cui, Alejandro Aparcedo, Young Kyun Jang, Ser-Nam Lim
for: 本研究旨在探讨大型多Modal模型（LMMs）对不同类型的视觉攻击的Robustness。
methods: 我们采用了多种不同类型的攻击方法，包括插入攻击、替换攻击和缺失攻击，以评估LMMs在图像分类、图像描述和视觉问答（VQA）任务上的Robustness。
results: 我们发现LMMs在某些任务上不具备Robustness，但是在科学问答任务上，LMMs却能够保持8.10%的性能下降，而其视觉对手下降了99.73%。此外，我们也提出了一种新的实际世界图像分类方法，即查询分解法，通过添加exit queries到输入提示中，可以减少攻击效果并提高图像分类精度。

Abstract
Recent advances in instruction tuning have led to the development of State-of-the-Art Large Multimodal Models (LMMs). Given the novelty of these models, the impact of visual adversarial attacks on LMMs has not been thoroughly examined. We conduct a comprehensive study of the robustness of various LMMs against different adversarial attacks, evaluated across tasks including image classification, image captioning, and Visual Question Answer (VQA). We find that in general LMMs are not robust to visual adversarial inputs. However, our findings suggest that context provided to the model via prompts, such as questions in a QA pair helps to mitigate the effects of visual adversarial inputs. Notably, the LMMs evaluated demonstrated remarkable resilience to such attacks on the ScienceQA task with only an 8.10% drop in performance compared to their visual counterparts which dropped 99.73%. We also propose a new approach to real-world image classification which we term query decomposition. By incorporating existence queries into our input prompt we observe diminished attack effectiveness and improvements in image classification accuracy. This research highlights a previously under-explored facet of LMM robustness and sets the stage for future work aimed at strengthening the resilience of multimodal systems in adversarial environments.

摘要
最近的教程调整技术发展，带来了当今最佳大型多modal模型（LMM）的出现。由于这些模型的新颖性，它们在面对视觉攻击的情况下的Robustness未经过完善的研究。我们进行了多种LMM对不同攻击的全面研究，评估 tasks包括图像分类、图像描述和Visual Question Answer（VQA）。我们发现，在总体来说，LMMs不 robust于视觉攻击。然而，我们的发现表明，通过提供给模型的上下文，如问题在QA对中的提示，可以减轻视觉攻击的影响。特别是，我们评估的LMMs在科学QA任务上表现出了remarkable的抵抗力，只有8.10%的性能下降，与其视觉对应的下降99.73%。我们还提出了一种新的实际图像分类方法，我们称之为查询分解。通过在输入提示中添加存在查询，我们发现了攻击效果的减少和图像分类精度的提高。这些研究探讨了LMM的一个未曾被探讨的方面，并为未来强化多modal系统在攻击环境中的Robustness做出了重要贡献。

Class Incremental Learning for Adversarial Robustness

paper_url: http://arxiv.org/abs/2312.03289
repo_url: None
paper_authors: Seungju Cho, Hongsin Lee, Changick Kim
for:* 这个研究旨在探索对于固定数据集的数据增量学习方法，即Adversarially Robust Class Incremental Learning (ARCIL)，以强化模型的Robustness。methods:* 研究使用了naive adversarial training和增量学习的组合，并观察到这种组合会导致模型的Robustness下降。* 为了解决这个问题，研究人员提出了Flatness Preserving Distillation (FPD)损失函数和Logit Adjustment Distillation (LAD)损失函数，以维持模型的flatness特征。results:* 实验结果显示，这个方法比应用于现有的增量学习方法上的对抗训练方法来得到更高的AutoAttack准确率，具体而言，相比基eline，这个方法在split CIFAR-10、CIFAR-100和Tiny ImageNet上的AutoAttack准确率分别提高了5.99%p、5.27%p和3.90%p。

Abstract
Adversarial training integrates adversarial examples during model training to enhance robustness. However, its application in fixed dataset settings differs from real-world dynamics, where data accumulates incrementally. In this study, we investigate Adversarially Robust Class Incremental Learning (ARCIL), a method that combines adversarial robustness with incremental learning. We observe that combining incremental learning with naive adversarial training easily leads to a loss of robustness. We discover that this is attributed to the disappearance of the flatness of the loss function, a characteristic of adversarial training. To address this issue, we propose the Flatness Preserving Distillation (FPD) loss that leverages the output difference between adversarial and clean examples. Additionally, we introduce the Logit Adjustment Distillation (LAD) loss, which adapts the model's knowledge to perform well on new tasks. Experimental results demonstrate the superiority of our method over approaches that apply adversarial training to existing incremental learning methods, which provides a strong baseline for incremental learning on adversarial robustness in the future. Our method achieves AutoAttack accuracy that is 5.99\%p, 5.27\%p, and 3.90\%p higher on average than the baseline on split CIFAR-10, CIFAR-100, and Tiny ImageNet, respectively. The code will be made available.

摘要
“敌对训练”是一种将敌对示例 integrate到模型训练中以增强模型的Robustness。但是，它在固定 dataset 设定下不同于实际世界的动态资料，在这里我们调查了“敌对Robust Class Incremental Learning”（ARCIL）方法，它结合了敌对Robustness 和增量学习。我们发现，将增量学习与普通的敌对训练结合起来很容易导致模型的Robustness 下降。我们发现这是因为敌对训练的损失函数不再平坦的问题。为解决这个问题，我们提出了“平坦保持”（FPD）损失函数，它利用了敌对和清洁示例之间的出力差异。此外，我们引入了“alogit Adjustment Distillation”（LAD）损失函数，它让模型对新任务进行好的适应。实验结果显示我们的方法比于将敌对训练应用到现有的增量学习方法，提供了一个强的基eline，从而实现了增量学习中的敌对Robustness。我们的方法在split CIFAR-10、CIFAR-100 和 Tiny ImageNet 上的 AutoAttack 精度高于基eline的平均5.99\%p、5.27\%p 和 3.90\%p。代码将会公开。”

Indirect Gradient Matching for Adversarial Robust Distillation

paper_url: http://arxiv.org/abs/2312.03286
repo_url: None
paper_authors: Hongsin Lee, Seungju Cho, Changick Kim
For: 提高模型的 adversarial Robustness* Methods: 使用 Indirect Gradient Distillation Module (IGDM)，具体来说是通过 Taylor 约化来匹配老师的输入梯度和学生模型的输入梯度。* Results: experimental results show that IGDM can effectively enhance the performance of all AD methods, especially when integrated into the SOTA method without additional data augmentation. Specifically, the AutoAttack accuracy of the ResNet-18 model and MobileNetV2 model were improved from 28.06% to 30.32% and from 26.18% to 29.52%, respectively.

Abstract
Adversarial training significantly improves adversarial robustness, but superior performance is primarily attained with large models. This substantial performance gap for smaller models has spurred active research into adversarial distillation (AD) to mitigate the difference. Existing AD methods leverage the teacher's logits as a guide. In contrast to these approaches, we aim to transfer another piece of knowledge from the teacher, the input gradient. In this paper, we propose a distillation module termed Indirect Gradient Distillation Module (IGDM) that indirectly matches the student's input gradient with that of the teacher. We hypothesize that students can better acquire the teacher's knowledge by matching the input gradient. Leveraging the observation that adversarial training renders the model locally linear on the input space, we employ Taylor approximation to effectively align gradients without directly calculating them. Experimental results show that IGDM seamlessly integrates with existing AD methods, significantly enhancing the performance of all AD methods. Particularly, utilizing IGDM on the CIFAR-100 dataset improves the AutoAttack accuracy from 28.06% to 30.32% with the ResNet-18 model and from 26.18% to 29.52% with the MobileNetV2 model when integrated into the SOTA method without additional data augmentation. The code will be made available.

摘要
学习对抗训练显著提高了对抗 robustness，但是大型模型才能获得出色的性能。这个性能差距对小型模型的研究启动了活跃的激发。现有的AD方法利用教师的логи特点作为导向。与这些方法不同，我们尝试将另一种教师知识传递给学生，即输入梯度。在这篇论文中，我们提出了一种名为间接梯度填充模块（IGDM），它间接匹配学生的输入梯度与教师的梯度。我们认为，通过匹配输入梯度，学生可以更好地继承教师的知识。利用对抗训练使模型在输入空间变为本地线性，我们采用泰勒近似来有效地对梯度进行匹配，而不需要直接计算。实验结果表明，IGDM可以与现有AD方法集成，显著提高所有AD方法的性能。特别是在CIFAR-100 dataset上，对于使用ResNet-18和MobileNetV2模型，通过IGDM进行加工后，AutoAttack准确率从28.06%提高到30.32%和26.18%提高到29.52%，分别与当前最佳方法集成后的准确率相比提高了2.26%和3.34%。代码将会公开。

SO-NeRF: Active View Planning for NeRF using Surrogate Objectives

paper_url: http://arxiv.org/abs/2312.03266
repo_url: None
paper_authors: Keifer Lee, Shubham Gupta, Sunglyoung Kim, Bhargav Makwana, Chao Chen, Chen Feng
for: 提高NeRF数据采集过程的明确性和效率，并且能够活动地规划最佳视图序列以实现最高的重建质量。
methods: 提出了一个名为Surrogate Objectives for Active Radiance Fields（SOAR）的可读性函数集，该集使用几何和光学视觉特征来评估视图的好坏，包括表面覆盖率、几何复杂度、文本复杂度和光线多样性。通过深度网络学习来推断SOAR分数，可以快速选择优秀视图，不需要预先访问所有候选视图或在重建过程中训练任何频谱场。
results: SOARNet在比较baseline的速度下实现了$\sim$80倍的加速，同时保持了重建质量的比较或相同水平。此外，SOAR是模型兼容的，因此可以应用于完全含有神经透彻的方法以及完全透彻的方法。

Abstract
Despite the great success of Neural Radiance Fields (NeRF), its data-gathering process remains vague with only a general rule of thumb of sampling as densely as possible. The lack of understanding of what actually constitutes good views for NeRF makes it difficult to actively plan a sequence of views that yield the maximal reconstruction quality. We propose Surrogate Objectives for Active Radiance Fields (SOAR), which is a set of interpretable functions that evaluates the goodness of views using geometric and photometric visual cues - surface coverage, geometric complexity, textural complexity, and ray diversity. Moreover, by learning to infer the SOAR scores from a deep network, SOARNet, we are able to effectively select views in mere seconds instead of hours, without the need for prior visits to all the candidate views or training any radiance field during such planning. Our experiments show SOARNet outperforms the baselines with $\sim$80x speed-up while achieving better or comparable reconstruction qualities. We finally show that SOAR is model-agnostic, thus it generalizes across fully neural-implicit to fully explicit approaches.

摘要
尽管神经辉度场（NeRF）取得了很大的成功，但是它的数据收集过程仍然存在一定的欠准，只有一个通用的规则 thumb 来 sampler dense 地。由于没有对 NeRF 的好视图理解，因此很难活动地规划一个可以提供最高重建质量的序列视图。我们提出了代理目标函数活动辉度场（SOAR），它是一组可解释的函数，用于评估视图的好坏，包括表面覆盖率、几何复杂度、文本复杂度和射线多样性。此外，我们通过学习一个深度网络来推算 SOAR 分数，我们可以在几秒钟内选择视图，而不需要先访问所有候选视图或在这个阶段训练任何辉度场。我们的实验表明，SOARNet 比基eline 快得多，而且重建质量也是比较或相似的。最后，我们证明 SOAR 是模型无关的，因此它可以普适化到完全神经透凝到完全Explicit Approaches 之间。

FAAC: Facial Animation Generation with Anchor Frame and Conditional Control for Superior Fidelity and Editability

paper_url: http://arxiv.org/abs/2312.03775
repo_url: None
paper_authors: Linze Li, Sunqi Fan, Hengjun Pu, Zhaodong Bing, Yao Tang, Tianzhu Ye, Tong Yang, Liangyu Chen, Jiajun Liang
for: 提高视频生成中的人脸动画质量和可编辑性，并增加人脸动画的创作可能性。
methods: 利用 anchor frame 概念来提高原始文本到图像模型的生成能力，并提出了两种不同的 anchor frame 方法：训练自由和训练基于的方法。
results: 在多个 DreamBooth 和 LoRA 模型上验证了方法的效果，实现了人脸动画的高质量、高可编辑性和视频动作的提升。同时，通过使用 3D 参数化人脸模型，实现了高精度的人脸表达和动作捕捉。

Abstract
Over recent years, diffusion models have facilitated significant advancements in video generation. Yet, the creation of face-related videos still confronts issues such as low facial fidelity, lack of frame consistency, limited editability and uncontrollable human poses. To address these challenges, we introduce a facial animation generation method that enhances both face identity fidelity and editing capabilities while ensuring frame consistency. This approach incorporates the concept of an anchor frame to counteract the degradation of generative ability in original text-to-image models when incorporating a motion module. We propose two strategies towards this objective: training-free and training-based anchor frame methods. Our method's efficacy has been validated on multiple representative DreamBooth and LoRA models, delivering substantial improvements over the original outcomes in terms of facial fidelity, text-to-image editability, and video motion. Moreover, we introduce conditional control using a 3D parametric face model to capture accurate facial movements and expressions. This solution augments the creative possibilities for facial animation generation through the integration of multiple control signals. For additional samples, please visit https://anonymous.4open.science/r/FAAC.

摘要
Recent years have seen significant advancements in video generation thanks to diffusion models. However, creating face-related videos still poses challenges such as low facial fidelity, inconsistent frames, limited editability, and uncontrollable human poses. To address these issues, we propose a facial animation generation method that enhances both face identity fidelity and editing capabilities while ensuring frame consistency. This approach utilizes the concept of an anchor frame to counteract the degradation of generative ability in original text-to-image models when incorporating a motion module. We propose two strategies towards this objective: training-free and training-based anchor frame methods. Our method has been validated on multiple representative DreamBooth and LoRA models, delivering substantial improvements over the original outcomes in terms of facial fidelity, text-to-image editability, and video motion. Additionally, we introduce conditional control using a 3D parametric face model to capture accurate facial movements and expressions, expanding the creative possibilities for facial animation generation through the integration of multiple control signals. For more examples, please visit https://anonymous.4open.science/r/FAAC.

OctreeOcc: Efficient and Multi-Granularity Occupancy Prediction Using Octree Queries

paper_url: http://arxiv.org/abs/2312.03774
repo_url: None
paper_authors: Yuhang Lu, Xinge Zhu, Tai Wang, Yuexin Ma
for: 预测3D场景中的占用率，以提高Scene理解的精度。
methods: 使用Octree representation来适应ively capture3D场景中的信息，并通过图像semantic信息和修正机制来提高初始Octree结构的准确性。
results: 在对多种场景和对象进行了广泛的评估中，OctreeOcc不仅超过了当前State-of-the-art方法，还实现了15%-24%的计算开销减少相对于dense-grid-based方法。

Abstract
Occupancy prediction has increasingly garnered attention in recent years for its fine-grained understanding of 3D scenes. Traditional approaches typically rely on dense, regular grid representations, which often leads to excessive computational demands and a loss of spatial details for small objects. This paper introduces OctreeOcc, an innovative 3D occupancy prediction framework that leverages the octree representation to adaptively capture valuable information in 3D, offering variable granularity to accommodate object shapes and semantic regions of varying sizes and complexities. In particular, we incorporate image semantic information to improve the accuracy of initial octree structures and design an effective rectification mechanism to refine the octree structure iteratively. Our extensive evaluations show that OctreeOcc not only surpasses state-of-the-art methods in occupancy prediction, but also achieves a 15%-24% reduction in computational overhead compared to dense-grid-based methods.

摘要
Recently, occupancy prediction has received increasing attention due to its ability to understand 3D scenes in fine detail. Traditional methods often rely on dense, regular grid representations, which can lead to excessive computational demands and a loss of spatial details for small objects. This paper introduces OctreeOcc, a novel 3D occupancy prediction framework that leverages the octree representation to adaptively capture valuable information in 3D, offering variable granularity to accommodate object shapes and semantic regions of varying sizes and complexities. Specifically, we incorporate image semantic information to improve the accuracy of initial octree structures and design an effective rectification mechanism to refine the octree structure iteratively. Our extensive evaluations show that OctreeOcc not only outperforms state-of-the-art methods in occupancy prediction, but also achieves a 15%-24% reduction in computational overhead compared to dense-grid-based methods.

Human Body Model based ID using Shape and Pose Parameters

paper_url: http://arxiv.org/abs/2312.03227
repo_url: None
paper_authors: Aravind Sundaresan, Brian Burns, Indranil Sur, Yi Yao, Xiao Lin, Sujeong Kim
for: 这个论文是用于人体模型基本实体认证系统（HMID）的研究。
methods: 这个系统基于人体网络（HMR），并提出了附加的损失函数来改进和稳定形态估计和生物特征识别。
results: 当将HMID网络训练使用附加的形态和 pose损失函数时，它在生物特征识别性能方面表现出了显著改进，比不使用这些损失函数的同类模型更好。

Abstract
We present a Human Body model based IDentification system (HMID) system that is jointly trained for shape, pose and biometric identification. HMID is based on the Human Mesh Recovery (HMR) network and we propose additional losses to improve and stabilize shape estimation and biometric identification while maintaining the pose and shape output. We show that when our HMID network is trained using additional shape and pose losses, it shows a significant improvement in biometric identification performance when compared to an identical model that does not use such losses. The HMID model uses raw images instead of silhouettes and is able to perform robust recognition on images collected at range and altitude as many anthropometric properties are reasonably invariant to clothing, view and range. We show results on the USF dataset as well as the BRIAR dataset which includes probes with both clothing and view changes. Our approach (using body model losses) shows a significant improvement in Rank20 accuracy and True Accuracy Rate on the BRIAR evaluation dataset.

摘要
我们提出了一种基于人体模型的人体标识系统（HMID），该系统同时受过训练以确定形状、姿势和生物特征。HMID基于人体网络（HMR），我们提出了额外的损失函数，以提高和稳定形状估计和生物标识的性能，而不会影响形状和姿势输出。我们证明，当我们的HMID网络通过额外的形状和姿势损失函数进行训练时，其生物标识性能会显著提高，比与没有使用这些损失函数的同类模型。HMID模型使用原始图像而不是检测图像，可以实现robust的人体识别，无论图像是在多种angles和距离上拍摄。我们在USF数据集以及包含衣服和视角变化的BRIAR数据集上进行了实验，并证明了我们的方法（使用人体模型损失）在BRIAR评估数据集上显著提高了rank20准确率和真实准确率。

Rethinking Object Saliency Ranking: A Novel Whole-flow Processing Paradigm

paper_url: http://arxiv.org/abs/2312.03226
repo_url: https://github.com/mengkesong/saliency-ranking-paradigm
paper_authors: Mengke Song, Linfeng Li, Dunquan Wu, Wenfeng Song, Chenglizhao Chen
for: 提高多个对象之间的关系和多个对象的相对重要性的识别率。
methods: 基于全流处理的整体框架，包括GT数据生成、网络结构设计和训练协议。
results: 在广泛的SALICON集上表现出色，超过现有状态码方法。

Abstract
Existing salient object detection methods are capable of predicting binary maps that highlight visually salient regions. However, these methods are limited in their ability to differentiate the relative importance of multiple objects and the relationships among them, which can lead to errors and reduced accuracy in downstream tasks that depend on the relative importance of multiple objects. To conquer, this paper proposes a new paradigm for saliency ranking, which aims to completely focus on ranking salient objects by their "importance order". While previous works have shown promising performance, they still face ill-posed problems. First, the saliency ranking ground truth (GT) orders generation methods are unreasonable since determining the correct ranking order is not well-defined, resulting in false alarms. Second, training a ranking model remains challenging because most saliency ranking methods follow the multi-task paradigm, leading to conflicts and trade-offs among different tasks. Third, existing regression-based saliency ranking methods are complex for saliency ranking models due to their reliance on instance mask-based saliency ranking orders. These methods require a significant amount of data to perform accurately and can be challenging to implement effectively. To solve these problems, this paper conducts an in-depth analysis of the causes and proposes a whole-flow processing paradigm of saliency ranking task from the perspective of "GT data generation", "network structure design" and "training protocol". The proposed approach outperforms existing state-of-the-art methods on the widely-used SALICON set, as demonstrated by extensive experiments with fair and reasonable comparisons. The saliency ranking task is still in its infancy, and our proposed unified framework can serve as a fundamental strategy to guide future work.

摘要
现有的焦点 объекt detection方法可以预测视觉焦点区域的二进制地图，但这些方法有限制，无法 diferenciate 多个对象之间的关系和重要性，这可能导致错误和下游任务的减少精度。为了解决这个问题，这篇论文提出了一新的焦点排名 paradigm，旨在完全专注于对焦点对象进行“重要性顺序”排名。 although previous works have shown promising performance, they still face ill-posed problems. First, the saliency ranking ground truth (GT) orders generation methods are unreasonable, since determining the correct ranking order is not well-defined, resulting in false alarms. Second, training a ranking model remains challenging because most saliency ranking methods follow the multi-task paradigm, leading to conflicts and trade-offs among different tasks. Third, existing regression-based saliency ranking methods are complex for saliency ranking models due to their reliance on instance mask-based saliency ranking orders. These methods require a significant amount of data to perform accurately and can be challenging to implement effectively.为了解决这些问题，这篇论文进行了深入的分析并提出了一个整体流程处理 paradigm，从GT数据生成、网络结构设计以及训练协议的角度进行了解决。 proposed approach outperforms existing state-of-the-art methods on the widely-used SALICON set, as demonstrated by extensive experiments with fair and reasonable comparisons. The saliency ranking task is still in its infancy, and our proposed unified framework can serve as a fundamental strategy to guide future work.

Predicting Scores of Various Aesthetic Attribute Sets by Learning from Overall Score Labels

paper_url: http://arxiv.org/abs/2312.03222
repo_url: None
paper_authors: Heng Huang, Xin Jin, Yaqi Liu, Hao Lou, Chaoen Xiao, Shuai Cui, Xinning Li, Dongqing Zou
for:This paper aims to develop a novel aesthetic attribute evaluation framework to predict attribute scores and overall scores for images.methods:The proposed framework, called F2S (attribute features to attribute scores), uses networks from different tasks to provide attribute features and leverages an aesthetic attribute contribution to describe the role of aesthetic attributes in an image.results:The proposed F2S model achieves comparable performance with those trained on fully-annotated aesthetic attribute score labels, making it feasible to learn meaningful attribute scores for various aesthetic attribute sets in different types of images with only overall aesthetic scores.

Abstract
Now many mobile phones embed deep-learning models for evaluation or guidance on photography. These models cannot provide detailed results like human pose scores or scene color scores because of the rare of corresponding aesthetic attribute data. However, the annotation of image aesthetic attribute scores requires experienced artists and professional photographers, which hinders the collection of large-scale fully-annotated datasets. In this paper, we propose to replace image attribute labels with feature extractors. First, a novel aesthetic attribute evaluation framework based on attribute features is proposed to predict attribute scores and overall scores. We call it the F2S (attribute features to attribute scores) model. We use networks from different tasks to provide attribute features to our F2S models. Then, we define an aesthetic attribute contribution to describe the role of aesthetic attributes throughout an image and use it with the attribute scores and the overall scores to train our F2S model. Sufficient experiments on publicly available datasets demonstrate that our F2S model achieves comparable performance with those trained on the datasets with fully-annotated aesthetic attribute score labels. Our method makes it feasible to learn meaningful attribute scores for various aesthetic attribute sets in different types of images with only overall aesthetic scores.

摘要
现在许多移动电话内置深度学习模型来评估或指导摄影。这些模型无法提供如人体姿势分数或场景颜色分数的详细结果，因为缺乏对应的艺术特性数据。然而，对图像艺术特性分数的标注需要经验丰富的艺术家和专业摄影师，这阻碍了大规模完全标注数据集的收集。在这篇论文中，我们提议将图像特性标签 replacced 为特征抽取器。我们提出了一种基于特征 attribute 评估框架的新艺术特性评估方法，我们称之为 F2S（特征 attribute 到 attribute 分数）模型。我们使用不同任务的网络提供特征 attribute 到我们的 F2S 模型。然后，我们定义了一种艺术特性贡献来描述图像中艺术特性的作用，并使用它与特征分数和总分数一起训练我们的 F2S 模型。我们的实验表明，我们的 F2S 模型在公开可用的数据集上具有相当的性能，与完全标注艺术特性分数标签的模型相比。我们的方法使得可以在不同类型的图像中学习有意义的特征分数，只需要全局艺术分数。

Cache Me if You Can: Accelerating Diffusion Models through Block Caching

paper_url: http://arxiv.org/abs/2312.03209
repo_url: None
paper_authors: Felix Wimbauer, Bichen Wu, Edgar Schoenfeld, Xiaoliang Dai, Ji Hou, Zijian He, Artsiom Sanakoyeu, Peizhao Zhang, Sam Tsai, Jonas Kohler, Christian Rupprecht, Daniel Cremers, Peter Vajda, Jialiang Wang
for: 这个研究旨在提高散射模型的图像生成效率，以便在同等的计算成本下生成更高品质的图像。
methods: 本研究使用对散射模型的内部层次分析，发现层次的输出随时间变化平滑，每个层次的变化呈现特定的模式，并且每步变化很小。基于这些发现，我们提出了封页储存（Block Caching）技术，通过重复使用上一步层次的输出来加速推导。此外，我们还提出了自动决定储存计划的技术，根据每个层次的时间变化来决定储存项目。
results: 我们通过FID、人类评估和Qualitative分析显示，封页储存技术可以在同等的计算成本下生成更高品质的图像。我们在不同的顶尖模型（LDM和EMU）和推导器（DDIM和DPM）上进行了实验。

Abstract
Diffusion models have recently revolutionized the field of image synthesis due to their ability to generate photorealistic images. However, one of the major drawbacks of diffusion models is that the image generation process is costly. A large image-to-image network has to be applied many times to iteratively refine an image from random noise. While many recent works propose techniques to reduce the number of required steps, they generally treat the underlying denoising network as a black box. In this work, we investigate the behavior of the layers within the network and find that 1) the layers' output changes smoothly over time, 2) the layers show distinct patterns of change, and 3) the change from step to step is often very small. We hypothesize that many layer computations in the denoising network are redundant. Leveraging this, we introduce block caching, in which we reuse outputs from layer blocks of previous steps to speed up inference. Furthermore, we propose a technique to automatically determine caching schedules based on each block's changes over timesteps. In our experiments, we show through FID, human evaluation and qualitative analysis that Block Caching allows to generate images with higher visual quality at the same computational cost. We demonstrate this for different state-of-the-art models (LDM and EMU) and solvers (DDIM and DPM).

摘要
对于图像生成领域，传播模型最近已经引发了革命，因为它们可以生成高品质图像。然而，传播模型的一个主要缺点是生成图像的过程占用资源。一个大型图像到图像网络需要多次应用，以进行反复地更正图像从随机噪音中。虽然许多最近的工作提出了减少需要的步骤的技术，但是它们通常对底层减噪网络视为黑盒子。在这个工作中，我们研究层的输出是如何变化，并发现了以下三点：1）层的输出随时间的变化非常平滑，2）层的变化具有明确的模式，3）步骤之间的变化通常非常小。我们推测了很多层计算在减噪网络中是重复的。基于这一点，我们引入了块缓存，其中我们可以在不同的步骤中重用前一步的层块输出，以加速推断。此外，我们也提出了根据每个块的时间步骤变化自动确定缓存计划的技术。在我们的实验中，通过FID、人类评估和质量分析，我们显示了使用块缓存可以在同样的计算成本下生成图像的视觉质量更高。我们在不同的状态艺模型（LDM和EMU）和解决方案（DDIM和DPM）上进行了实验，并得到了类似的结果。

Satellite Imagery and AI: A New Era in Ocean Conservation, from Research to Deployment and Impact

paper_url: http://arxiv.org/abs/2312.03207
repo_url: None
paper_authors: Patrick Beukema, Favyen Bastani, Piper Wolters, Henry Herzog, Joe Ferdinando
For: The paper is written for those interested in using satellite data to monitor and prevent illegal, unreported, and unregulated (IUU) fishing.* Methods: The paper introduces three specialized computer vision models for synthetic aperture radar (Sentinel-1), optical imagery (Sentinel-2), and nighttime lights (Suomi-NPP/NOAA-20) to monitor maritime activities.* Results: The paper presents real-time computer vision services for conservation, which have been deployed in Skylight, a free online platform for maritime monitoring.Here’s the information in Simplified Chinese text:* For: 这篇论文是为了使用卫星数据监测和防止非法、未报告和不监管的渔业活动而写的。* Methods: 论文介绍了三种特殊的计算机视觉模型，用于synthetic aperture radar（Sentinel-1）、光学成像（Sentinel-2）和夜色灯（Suomi-NPP/NOAA-20）来监测海洋活动。* Results: 论文介绍了实时计算机视觉服务，已经部署在免费的Skylight平台上，用于海洋监测。

Abstract
Illegal, unreported, and unregulated (IUU) fishing poses a global threat to ocean habitats. Publicly available satellite data offered by NASA and the European Space Agency (ESA) provide an opportunity to actively monitor this activity. Effectively leveraging satellite data for maritime conservation requires highly reliable machine learning models operating globally with minimal latency. This paper introduces three specialized computer vision models designed for synthetic aperture radar (Sentinel-1), optical imagery (Sentinel-2), and nighttime lights (Suomi-NPP/NOAA-20). It also presents best practices for developing and delivering real-time computer vision services for conservation. These models have been deployed in Skylight, a real time maritime monitoring platform, which is provided at no cost to users worldwide.

摘要
非法、未报告、未规制（IUU）渔业活动对海洋生态系统构成全球性威胁。 NASA和欧洲空间局（ESA）公开提供的卫星数据可以为保护海洋生态系统提供实时监测。为了有效地利用卫星数据，我们需要开发高可靠性机器学习模型，能够全球运行，延迟最小。这篇论文介绍了三种特殊的计算机视觉模型，适用于Synthetic Aperture Radar（Sentinel-1）、光学图像（Sentinel-2）和夜色照明（Suomi-NPP/NOAA-20）。它还提供了保护实时计算机视觉服务的最佳实践。这些模型已经在Skylight实时海洋监测平台上部署，该平台免费提供给全球用户。

Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields

paper_url: http://arxiv.org/abs/2312.03203
repo_url: None
paper_authors: Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Zehao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, Achuta Kadambi
for: 本研究旨在扩展NeRF beyond view synthesis，用于semantically aware任务 such as editing和 segmentation。
methods: 本方法使用3D Gaussian Splatting进行 радиаль场Rendering，并通过2D基础模型distillation来实现3D feature field的分布。
results: 我们的方法可以提供比较好的结果，同时更快于训练和渲染。此外，我们还是首次使用点和 bounding-box prompting来 manipulate radiance field，通过利用SAM模型。

Abstract
3D scene representations have gained immense popularity in recent years. Methods that use Neural Radiance fields are versatile for traditional tasks such as novel view synthesis. In recent times, some work has emerged that aims to extend the functionality of NeRF beyond view synthesis, for semantically aware tasks such as editing and segmentation using 3D feature field distillation from 2D foundation models. However, these methods have two major limitations: (a) they are limited by the rendering speed of NeRF pipelines, and (b) implicitly represented feature fields suffer from continuity artifacts reducing feature quality. Recently, 3D Gaussian Splatting has shown state-of-the-art performance on real-time radiance field rendering. In this work, we go one step further: in addition to radiance field rendering, we enable 3D Gaussian splatting on arbitrary-dimension semantic features via 2D foundation model distillation. This translation is not straightforward: naively incorporating feature fields in the 3DGS framework leads to warp-level divergence. We propose architectural and training changes to efficiently avert this problem. Our proposed method is general, and our experiments showcase novel view semantic segmentation, language-guided editing and segment anything through learning feature fields from state-of-the-art 2D foundation models such as SAM and CLIP-LSeg. Across experiments, our distillation method is able to provide comparable or better results, while being significantly faster to both train and render. Additionally, to the best of our knowledge, we are the first method to enable point and bounding-box prompting for radiance field manipulation, by leveraging the SAM model. Project website at: https://feature-3dgs.github.io/

摘要
三元场景表示在最近几年内得到了广泛的推广。使用神经辐射场的方法非常灵活，能够完成传统任务，如新视图合成。在最近的一些工作中，人们尝试了将NeRF的功能扩展到更加semantically aware的任务，如编辑和分割，通过从2D基础模型中提取3D特征场的分配。然而，这些方法有两个主要的限制：（1）它们受NeRF管道的渲染速度的限制，（2）卷积表示的特征场会受到继承artefact的影响，从而降低特征质量。Recently, 3D Gaussian Splatting has shown state-of-the-art performance on real-time radiance field rendering. In this work, we go one step further: in addition to radiance field rendering, we enable 3D Gaussian splatting on arbitrary-dimension semantic features via 2D foundation model distillation. This translation is not straightforward: naively incorporating feature fields in the 3DGS framework leads to warp-level divergence. We propose architectural and training changes to efficiently avert this problem. Our proposed method is general, and our experiments showcase novel view semantic segmentation, language-guided editing and segment anything through learning feature fields from state-of-the-art 2D foundation models such as SAM and CLIP-LSeg. Across experiments, our distillation method is able to provide comparable or better results, while being significantly faster to both train and render. Additionally, to the best of our knowledge, we are the first method to enable point and bounding-box prompting for radiance field manipulation, by leveraging the SAM model. Project website at:

2023-12-06

A Layer-Wise Tokens-to-Token Transformer Network for Improved Historical Document Image Enhancement

Adapting HouseDiffusion for conditional Floor Plan generation on Modified Swiss Dwellings dataset

The Potential of Vision-Language Models for Content Moderation of Children’s Videos

Controllable Human-Object Interaction Synthesis

WonderJourney: Going from Anywhere to Everywhere

Inpaint3D: 3D Scene Content Generation using 2D Inpainting Diffusion

LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

Relightable Gaussian Codec Avatars

Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning

Self-conditioned Image Generation via Generating Representations

Diffusion Illusions: Hiding Images in Plain Sight

AVID: Any-Length Video Inpainting with Diffusion Model

Memory Triggers: Unveiling Memorization in Text-To-Image Generative Models through Word-Level Duplication

Hybrid Functional Maps for Crease-Aware Non-Isometric Shape Matching

WarpDiffusion: Efficient Diffusion Model for High-Fidelity Virtual Try-on

Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving

Seeing the random forest through the decision trees. Supporting learning health systems from histopathology with machine learning models: Challenges and opportunities

Editable Stain Transformation Of Histological Images Using Unpaired GANs

Training Neural Networks on RAW and HDR Images for Restoration Tasks

Boosting Segment Anything Model Towards Open-Vocabulary Learning

TokenCompose: Grounding Diffusion with Token-level Supervision

Automated Multimodal Data Annotation via Calibration With Indoor Positioning System

SurfaceAug: Closing the Gap in Multimodal Ground Truth Sampling

A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

Language-Informed Visual Concept Learning

XCube ($\mathcal{X}^3$): Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies

Context Diffusion: In-Context Aware Image Generation

DocBinFormer: A Two-Level Transformer Network for Effective Document Image Binarization

SYNC-CLIP: Synthetic Data Make CLIP Generalize Better in Data-Limited Scenarios

Enhancing Kinship Verification through Multiscale Retinex and Combined Deep-Shallow features

When an Image is Worth 1,024 x 1,024 Words: A Case Study in Computational Pathology

Personalized Face Inpainting with Diffusion Models by Parallel Visual Attention

How Low Can You Go? Surfacing Prototypical In-Distribution Samples for Unsupervised Anomaly Detection

Texture-Semantic Collaboration Network for ORSI Salient Object Detection

FoodFusion: A Latent Diffusion Model for Realistic Food Image Generation

Low-shot Object Learning with Mutual Exclusivity Bias

Single Image Reflection Removal with Reflection Intensity Prior Knowledge

Personalized Pose Forecasting

AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation

Kandinsky 3.0 Technical Report

Gravitational cell detection and tracking in fluorescence microscopy data

Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation

AnimateZero: Video Diffusion Models are Zero-Shot Image Animators

PneumoLLM: Harnessing the Power of Large Language Model for Pneumoconiosis Diagnosis

From Detection to Action Recognition: An Edge-Based Pipeline for Robot Human Perception

Memory-Efficient Optical Flow via Radius-Distribution Orthogonal Cost Volume

HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting

F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis

Data-driven Crop Growth Simulation on Time-varying Generated Images using Multi-conditional Generative Adversarial Networks

High-Quality Facial Geometry and Appearance Capture at Home

UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity

Data-Centric Digital Agriculture: A Perspective

Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle

ShareCMP: Polarization-Aware RGB-P Semantic Segmentation

Artist-Friendly Relightable and Animatable Neural Heads

DeepPyramid+: Medical Image Segmentation using Pyramid View Fusion and Deformable Pyramid Reception

Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future

SVQ: Sparse Vector Quantization for Spatiotemporal Forecasting

Predicting Postoperative Intraocular Lens Dislocation in Cataract Surgery via Deep Learning

Action Scene Graphs for Long-Form Understanding of Egocentric Videos

Novel class discovery meets foundation models for 3D semantic segmentation

Riemannian Complex Matrix Convolution Network for PolSAR Image Classification

Evaluating the point cloud of individual trees generated from images based on Neural Radiance fields (NeRF) method

Bottom-Up Instance Segmentation of Catheters for Chest X-Rays

RING-NeRF: A Versatile Architecture based on Residual Implicit Neural Grids

PointMoment:Mixed-Moment-based Self-Supervised Representation Learning for 3D Point Clouds

GraNet: A Multi-Level Graph Network for 6-DoF Grasp Pose Generation in Cluttered Scenes

PointJEM: Self-supervised Point Cloud Understanding for Reducing Feature Redundancy via Joint Entropy Maximization

Building Category Graphs Representation with Spatial and Temporal Attention for Visual Navigation

GCFA:Geodesic Curve Feature Augmentation via Shape Space Theory

Background Clustering Pre-training for Few-shot Segmentation

Complementary Benefits of Contrastive Learning and Self-Training Under Distribution Shift

DiffPMAE: Diffusion Masked Autoencoders for Point Cloud Reconstruction

Cooperative Probabilistic Trajectory Forecasting under Occlusion

On the Robustness of Large Multimodal Models Against Image Adversarial Attacks

Class Incremental Learning for Adversarial Robustness

Indirect Gradient Matching for Adversarial Robust Distillation

SO-NeRF: Active View Planning for NeRF using Surrogate Objectives

FAAC: Facial Animation Generation with Anchor Frame and Conditional Control for Superior Fidelity and Editability