2023-11-29

cs.CV

cs.CV - 2023-11-29

Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models

paper_url: http://arxiv.org/abs/2311.17919
repo_url: https://github.com/dangeng/visual_anagrams
paper_authors: Daniel Geng, Inbum Park, Andrew Owens
for: 这个论文 Addresses the problem of synthesizing multi-view optical illusions, such as images that change appearance upon flipping or rotating.
methods: The proposed method uses off-the-shelf text-to-image diffusion models to obtain these illusions with zero-shot learning. During the reverse diffusion process, the method estimates noise from different views of a noisy image and combines them to denoise the image.
results: The method is effective and flexible, as demonstrated by both qualitative and quantitative results. The approach can also be extended to illusions with more than two views, and more results can be found on the project webpage provided.Here’s the full text in Simplified Chinese:
for: 这个论文 Addresses the problem of synthesizing multi-view optical illusions：images that change appearance upon flipping or rotating.
methods: 该方法使用的是off-the-shelf text-to-image diffusion models，通过逆 diffusion process，Estimate the noise from different views of a noisy image and combine them to denoise the image.
results: The method is effective and flexible, as demonstrated by both qualitative and quantitative results. The approach can also be extended to illusions with more than two views, and more results can be found on the project webpage provided.

Abstract
We address the problem of synthesizing multi-view optical illusions: images that change appearance upon a transformation, such as a flip or rotation. We propose a simple, zero-shot method for obtaining these illusions from off-the-shelf text-to-image diffusion models. During the reverse diffusion process, we estimate the noise from different views of a noisy image. We then combine these noise estimates together and denoise the image. A theoretical analysis suggests that this method works precisely for views that can be written as orthogonal transformations, of which permutations are a subset. This leads to the idea of a visual anagram--an image that changes appearance under some rearrangement of pixels. This includes rotations and flips, but also more exotic pixel permutations such as a jigsaw rearrangement. Our approach also naturally extends to illusions with more than two views. We provide both qualitative and quantitative results demonstrating the effectiveness and flexibility of our method. Please see our project webpage for additional visualizations and results: https://dangeng.github.io/visual_anagrams/

摘要
我团队正在研究多视图光学错觉的合成：图像会在变换后改变外观，比如旋转或翻转。我们提出了一种简单、 zero-shot 方法，使用 commercially 可用的文本到图像扩散模型来获得这些错觉。在反扩散过程中，我们估算不同视图中的噪声，然后将这些噪声估算结果相加，用于减噪图像。理论分析表明，这种方法在可以写为orthogonal transformation的视图中具有高精度。这包括旋转和翻转，还有更加奇异的像素重新排序，如谜题拼图。我们的方法也自然地扩展到多视图错觉。我们提供了许多资讯和结果，以证明我们的方法的效果和灵活性。请参考我们项目网页获取更多的视觉化和结果：

Do text-free diffusion models learn discriminative visual representations?

paper_url: http://arxiv.org/abs/2311.17921
repo_url: None
paper_authors: Soumik Mukhopadhyay, Matthew Gwilliam, Yosuke Yamaguchi, Vatsal Agarwal, Namitha Padmanabhan, Archana Swaminathan, Tianyi Zhou, Abhinav Shrivastava
for: 这个论文的目的是探讨一种能同时解决生成和分类两家任务的无监督学习模型，即扩展 diffusion models 的应用。
methods: 这个论文使用了 diffusion models，一种现在的生成任务领域的状态流行方法，并在其基础上提出了一种新的注意机制和变换器结合使用方法，以提高模型的表达能力。
results: 这个论文的实验结果表明，使用这种新的注意机制和变换器结合使用方法可以在不同的任务上达到比较高的性能，比如图像分类、细化分类、物体检测和分割等任务。

Abstract
While many unsupervised learning models focus on one family of tasks, either generative or discriminative, we explore the possibility of a unified representation learner: a model which addresses both families of tasks simultaneously. We identify diffusion models, a state-of-the-art method for generative tasks, as a prime candidate. Such models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high-fidelity, diverse, novel images. We find that the intermediate feature maps of the U-Net are diverse, discriminative feature representations. We propose a novel attention mechanism for pooling feature maps and further leverage this mechanism as DifFormer, a transformer feature fusion of features from different diffusion U-Net blocks and noise steps. We also develop DifFeed, a novel feedback mechanism tailored to diffusion. We find that diffusion models are better than GANs, and, with our fusion and feedback mechanisms, can compete with state-of-the-art unsupervised image representation learning methods for discriminative tasks - image classification with full and semi-supervision, transfer for fine-grained classification, object detection and segmentation, and semantic segmentation. Our project website (https://mgwillia.github.io/diffssl/) and code (https://github.com/soumik-kanad/diffssl) are available publicly.

摘要
而 viele 无监督学习模型集中在一个家族任务上，我们探索一个简化的表示学习器：一个模型可以同时解决两个家族任务。我们认为噪声模型是一种适合的候选者，这种模型通过训练 U-Net 来逐步预测和除噪，并且得到的模型可以生成高效、多样、新的图像。我们发现 U-Net 的中间特征图是多样的、掌握性的特征表示。我们提出了一种新的注意力机制来聚合特征图，并将其作为 DifFormer，一种基于 transformer 的特征融合，将不同的噪声 U-Net 块和噪声步的特征 fusion。我们还开发了 DifFeed，一种特定于噪声的反馈机制。我们发现噪声模型比 GANs 更好，并且通过我们的融合和反馈机制，可以与现有的无监督图像表示学习方法竞争，包括图像分类（完全和半监督）、细化分类转移、物体检测和分割、和 semantics 分割。我们的项目网站（https://mgwillia.github.io/diffssl/）和代码（https://github.com/soumik-kanad/diffssl）都是公开的。

A Simple Recipe for Language-guided Domain Generalized Segmentation

paper_url: http://arxiv.org/abs/2311.17922
repo_url: None
paper_authors: Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick Pérez, Raoul de Charette
for: 本研究旨在扩展神经网络在实际应用中的应用，提高神经网络对新领域数据的泛化能力。
methods: 本研究使用语言作为杂化源，提出了三个关键组成部分：保持CLIP内置的坚定性，通过语言驱动的本地样式杂化，以及在训练过程中 randomly混合源和杂化样式。
results: 广泛的实验结果显示，该方法可以在多个泛化 bencmarks 上达到状态 искусственный智能水平。代码将会公开。

Abstract
Generalization to new domains not seen during training is one of the long-standing goals and challenges in deploying neural networks in real-world applications. Existing generalization techniques necessitate substantial data augmentation, potentially sourced from external datasets, and aim at learning invariant representations by imposing various alignment constraints. Large-scale pretraining has recently shown promising generalization capabilities, along with the potential of bridging different modalities. For instance, the recent advent of vision-language models like CLIP has opened the doorway for vision models to exploit the textual modality. In this paper, we introduce a simple framework for generalizing semantic segmentation networks by employing language as the source of randomization. Our recipe comprises three key ingredients: i) the preservation of the intrinsic CLIP robustness through minimal fine-tuning, ii) language-driven local style augmentation, and iii) randomization by locally mixing the source and augmented styles during training. Extensive experiments report state-of-the-art results on various generalization benchmarks. The code will be made available.

摘要
通用化到新领域不seen during training是神经网络实现实际应用中的长期目标和挑战。现有的总结技术需要大量的数据扩展，可能来自外部数据集，并尝试通过各种对齐约束学习无关的表示。大规模预训练在最近几年内显示了可能的总结能力，同时还可以把不同的模式相互连接。例如，最近出现的视觉语言模型CLIP打开了视觉模型可以利用文本模式的大门。在这篇论文中，我们介绍了一个简单的框架，通过使用语言来随机化 semantic segmentation 网络。我们的配方包括以下三个关键元素：1. 保持CLIP的内在稳健性通过最小化微调，2. 基于语言驱动的本地风格增强，3. 在训练中随机地混合源和扩展风格。我们的实验结果显示在各种总结测试上达到了州前的结果。代码将被公开。

Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving

paper_url: http://arxiv.org/abs/2311.17918
repo_url: https://github.com/bravegroup/drive-wm
paper_authors: Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, Zhaoxiang Zhang
for: 这篇论文的目的是为了提高自动驾驶车辆的安全和效率，通过预测未来事件和评估可能的风险。
methods: 这篇论文提出了 Drive-WM，第一个与现有终端计划模型相容的驾驶世界模型。这个模型通过共同的空间-时间模型化，使用视角分解来生成高品质多视角驾驶Scene。
results: 我们的 Drive-WM 可以生成高质量、一致、可控的多视角影片，并且可以根据图像基于的赏识来决定最佳路径。评估真实世界驾驶数据显示，我们的方法可以实现高品质、可靠的驾驶 simulations。

Abstract
In autonomous driving, predicting future events in advance and evaluating the foreseeable risks empowers autonomous vehicles to better plan their actions, enhancing safety and efficiency on the road. To this end, we propose Drive-WM, the first driving world model compatible with existing end-to-end planning models. Through a joint spatial-temporal modeling facilitated by view factorization, our model generates high-fidelity multiview videos in driving scenes. Building on its powerful generation ability, we showcase the potential of applying the world model for safe driving planning for the first time. Particularly, our Drive-WM enables driving into multiple futures based on distinct driving maneuvers, and determines the optimal trajectory according to the image-based rewards. Evaluation on real-world driving datasets verifies that our method could generate high-quality, consistent, and controllable multiview videos, opening up possibilities for real-world simulations and safe planning.

摘要
autonomous driving 预测未来事件的能力和评估可能的风险，使自动驾驶车辆更好地规划行为，提高安全性和效率。为此，我们提出了 Drive-WM，第一个与现有终端规划模型兼容的驾驶世界模型。通过视图分解，我们的模型生成了高效率的多视图驾驶场景视频。基于其强大生成能力，我们展示了在安全驾驶规划方面应用世界模型的潜在优势。具体来说，我们的 Drive-WM 可以根据不同的驾驶动作生成多个未来的驾驶场景，并根据图像基于奖励来确定最佳轨迹。实验结果表明，我们的方法可以生成高质量、一致、可控的多视图驾驶视频，开启了实际 simulations 和安全规划的可能性。

AvatarStudio: High-fidelity and Animatable 3D Avatar Creation from Text

paper_url: http://arxiv.org/abs/2311.17917
repo_url: https://github.com/magic-research/avatarstudio
paper_authors: Jianfeng Zhang, Xuanmeng Zhang, Huichao Zhang, Jun Hao Liew, Chenxu Zhang, Yi Yang, Jiashi Feng
for: 创建基于文本描述的高品质和可动画的3D人物模型。
methods: 我们提出了AvartarStudio，一种从粗细到细节的生成模型，使用NeRF-based representation开始，然后通过包含SMPL-guided articulation在Explicit Mesh Representation中来支持人物动画和高分辨率渲染。
results: AvatarStudio可以创建基于文本描述的高品质的可动画人物模型，在多个应用中表现出优秀的效果，如多Modal Avatar Animation和 Style-Guided Avatar Creation。更多结果请参考我们项目页面：http://jeff95.me/projects/avatarstudio.html。

Abstract
We study the problem of creating high-fidelity and animatable 3D avatars from only textual descriptions. Existing text-to-avatar methods are either limited to static avatars which cannot be animated or struggle to generate animatable avatars with promising quality and precise pose control. To address these limitations, we propose AvatarStudio, a coarse-to-fine generative model that generates explicit textured 3D meshes for animatable human avatars. Specifically, AvatarStudio begins with a low-resolution NeRF-based representation for coarse generation, followed by incorporating SMPL-guided articulation into the explicit mesh representation to support avatar animation and high resolution rendering. To ensure view consistency and pose controllability of the resulting avatars, we introduce a 2D diffusion model conditioned on DensePose for Score Distillation Sampling supervision. By effectively leveraging the synergy between the articulated mesh representation and the DensePose-conditional diffusion model, AvatarStudio can create high-quality avatars from text that are ready for animation, significantly outperforming previous methods. Moreover, it is competent for many applications, e.g., multimodal avatar animations and style-guided avatar creation. For more results, please refer to our project page: http://jeff95.me/projects/avatarstudio.html

摘要
我们研究创建基于文本描述的高准确性和可动画的3D人物模型的问题。现有的文本到人物方法有限，不能生成可动画的人物模型，或者生成的人物模型质量不高、精度不足。为解决这些限制，我们提议了人物工作室（AvatarStudio），它是一种从粗略到细节的生成模型，可以生成文本描述的人物模型。具体来说，人物工作室首先使用低分辨率NeRF的表示方法进行粗略生成，然后通过 incorporating SMPL-guided articulation into the explicit mesh representation来支持人物动画和高分辨率渲染。为保证视角一致和姿势控制的人物模型，我们引入了基于DensePose的2D扩散模型进行Score Distillation Sampling的监督。通过有效地利用人物模型和DensePose-conditional扩散模型之间的同步效应，人物工作室可以从文本创建高质量的人物模型，在多个应用场景中表现出色，比如多模态人物动画和风格导向的人物创建。更多结果请参考我们项目页面：http://jeff95.me/projects/avatarstudio.html。

paper_url: http://arxiv.org/abs/2311.17911
repo_url: https://github.com/shikiw/opera
paper_authors: Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, Nenghai Yu
for: 这篇论文旨在解决多modal大语言模型（MLLM）中的幻觉问题，以提高其在实际应用中的准确性。
methods: 该论文提出了一种基于过度信任罚和回退分配策略的新的MLLM解码方法，无需额外数据、知识或训练。
results: 经过广泛的实验，该方法在不同的MLLM和指标上表现出了显著的幻觉降低效果，证明其有效性和通用性。

Abstract
Hallucination, posed as a pervasive challenge of multi-modal large language models (MLLMs), has significantly impeded their real-world usage that demands precise judgment. Existing methods mitigate this issue with either training with specific designed data or inferencing with external knowledge from other sources, incurring inevitable additional costs. In this paper, we present OPERA, a novel MLLM decoding method grounded in an Over-trust Penalty and a Retrospection-Allocation strategy, serving as a nearly free lunch to alleviate the hallucination issue without additional data, knowledge, or training. Our approach begins with an interesting observation that, most hallucinations are closely tied to the knowledge aggregation patterns manifested in the self-attention matrix, i.e., MLLMs tend to generate new tokens by focusing on a few summary tokens, but not all the previous tokens. Such partial over-trust inclination results in the neglecting of image tokens and describes the image content with hallucination. Statistically, we observe an 80%$\sim$95% co-currency rate between hallucination contents and such knowledge aggregation patterns. Based on the observation, OPERA introduces a penalty term on the model logits during the beam-search decoding to mitigate the over-trust issue, along with a rollback strategy that retrospects the presence of summary tokens in the previously generated tokens, and re-allocate the token selection if necessary. With extensive experiments, OPERA shows significant hallucination-mitigating performance on different MLLMs and metrics, proving its effectiveness and generality. Our code is available at: https://github.com/shikiw/OPERA.

摘要
《大语言模型中的幻视》作为多modal大语言模型（MLLM）的挑战，对于实际应用而言是一大障碍。现有的方法通过专门设计的数据或从其他来源获取知识来缓解此问题，但这会带来不可避免的额外成本。本文发表了一新的MLLM解码方法，称为OPERA，它基于一个过信问题和回溯分配策略，能够帮助解决幻视问题而不需要额外的数据、知识或训练。我们的观察是，大多数幻视都与语言模型对知识聚合的习惯相关，即MLLMs将新增token通过专注在一些摘要token上，而不是所有之前的token。这种偏好导致忽略图像token，并将图像内容描述为幻视。我们发现，在不同的MLLM和度量上，幻视内容与这种知识聚合模式之间的相似度为80%~$\sim$95%。根据这个观察，OPERA引入了一个过信问题Term durante the beam-search解码过程，同时引入了一个回溯分配策略，检查之前产生的token是否存在摘要token，如果是的话，则重新分配token。实际实验表明，OPERA在不同的MLLM和度量上都显示了明显的幻视缓解性。我们的代码可以在https://github.com/shikiw/OPERA上取得。

HUGS: Human Gaussian Splats

paper_url: http://arxiv.org/abs/2311.17910
repo_url: None
paper_authors: Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel, Oncel Tuzel, Anurag Ranjan
For: 本研究旨在实现 photogrammetry 中的人体动画synthesis，即从单一摄影机影像中提取人体和场景的信息，并将其融合为一个完整的动画人物。* Methods: 本研究使用 3D Gaussian Splatting (3DGS) 技术来表示人体，并将其与场景融合为一个完整的动画人物。具体来说，我们使用 SMPL 人体模型来初始化人体 Gaussian，并允许 Gaussian 在不同的标高上偏离人体模型，以捕捉更多的详细信息。在动画过程中，我们将Linear Blend Skinning（LBS）权重优化为协调个别 Gaussian 的运动。* Results: 本研究可以在 60 FPS 的渲染速度下实现 state-of-the-art 的渲染质量，并且比前一代的training时间快了 ~100 倍。我们的代码将会在 GitHub 上公布：https://github.com/apple/ml-hugs。

Abstract
Recent advances in neural rendering have improved both training and rendering times by orders of magnitude. While these methods demonstrate state-of-the-art quality and speed, they are designed for photogrammetry of static scenes and do not generalize well to freely moving humans in the environment. In this work, we introduce Human Gaussian Splats (HUGS) that represents an animatable human together with the scene using 3D Gaussian Splatting (3DGS). Our method takes only a monocular video with a small number of (50-100) frames, and it automatically learns to disentangle the static scene and a fully animatable human avatar within 30 minutes. We utilize the SMPL body model to initialize the human Gaussians. To capture details that are not modeled by SMPL (e.g. cloth, hairs), we allow the 3D Gaussians to deviate from the human body model. Utilizing 3D Gaussians for animated humans brings new challenges, including the artifacts created when articulating the Gaussians. We propose to jointly optimize the linear blend skinning weights to coordinate the movements of individual Gaussians during animation. Our approach enables novel-pose synthesis of human and novel view synthesis of both the human and the scene. We achieve state-of-the-art rendering quality with a rendering speed of 60 FPS while being ~100x faster to train over previous work. Our code will be announced here: https://github.com/apple/ml-hugs

摘要

Language-conditioned Detection Transformer

paper_url: http://arxiv.org/abs/2311.17902
repo_url: https://github.com/janghyuncho/decola
paper_authors: Jang Hyun Cho, Philipp Krähenbühl
for: 这个研究旨在开发一个新的开放词汇检测框架，以便在无监督的情况下检测图像中的对象。
methods: 该研究使用了图像水平标签和详细检测注释，并分为三步进行。首先，研究人员将语言条件的对象检测器在完全监督的检测数据上训练。然后，使用这个检测器进行 pseudo-标注图像，并使用这些 pseudo-标签来训练一个无条件的开放词汇检测器。
results: 研究人员通过在 LVIS benchmark 上进行零学习测试，以及直接零学习传输测试在 LVIS、COCO、Object365 和 OpenImages 上，并得到了优秀的成绩。相比之前的方法，DECOLA 在零学习 LVIS benchmark 上提高了17.1 AP-rare 和 9.4 mAP。DECOLA 在不同的模型大小、结构和数据集上实现了state-of-the-art 的结果，并且只需要使用开源数据和学术级计算。

Abstract
We present a new open-vocabulary detection framework. Our framework uses both image-level labels and detailed detection annotations when available. Our framework proceeds in three steps. We first train a language-conditioned object detector on fully-supervised detection data. This detector gets to see the presence or absence of ground truth classes during training, and conditions prediction on the set of present classes. We use this detector to pseudo-label images with image-level labels. Our detector provides much more accurate pseudo-labels than prior approaches with its conditioning mechanism. Finally, we train an unconditioned open-vocabulary detector on the pseudo-annotated images. The resulting detector, named DECOLA, shows strong zero-shot performance in open-vocabulary LVIS benchmark as well as direct zero-shot transfer benchmarks on LVIS, COCO, Object365, and OpenImages. DECOLA outperforms the prior arts by 17.1 AP-rare and 9.4 mAP on zero-shot LVIS benchmark. DECOLA achieves state-of-the-art results in various model sizes, architectures, and datasets by only training on open-sourced data and academic-scale computing. Code is available at https://github.com/janghyuncho/DECOLA.

摘要
我们提出了一个新的开 vocabulary检测框架。我们的框架使用图像级别标签和详细检测注释当 disponible。我们的框架进行以下三个步骤。首先，我们在受控的检测数据上训练一个语言条件对象检测器。这个检测器在训练时看到真实的类别，并基于这些存在的类别进行预测。我们使用这个检测器将图像 Pseudo-标签。我们的检测器比之前的方法更准确地生成 Pseudo-标签。然后，我们在 Pseudo-标签的图像上训练一个开 vocabulary检测器。我们名为 DECOLA 的检测器在开 vocabulary LVIS 测试 benchmark 上表现出色，以及直接零shot 传播 benchmark 上的 LVIS、COCO、Object365 和 OpenImages。DECOLA 比前一代的方法提高了17.1 AP-rare和9.4 mAP在零shot LVIS benchmark。DECOLA 在不同的模型大小、体系和数据集上实现了状态的最优结果，只需要训练在开源数据和学术级计算机上。代码可以在上获取。

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

paper_url: http://arxiv.org/abs/2311.17893
repo_url: None
paper_authors: Shuangrui Ding, Rui Qian, Haohang Xu, Dahua Lin, Hongkai Xiong
for: 本研究提出了一种简单 yet effective的方法，用于无监督视频对象分割（VOS）。
methods: 我们的关键发现是可以利用 DINO-预训练的 transformer 中的自然结构依赖关系，以建立视频中的稳定 spatial-temporal 对应关系。我们还使用 hierarchical clustering 来生成对象分割mask。
results: 我们的方法在多个无监督 VOS 测试 benchmark 上达到了state-of-the-art 性能，特别是在复杂的实际多对象视频分割任务上，如 DAVIS-17-Unsupervised 和 YouTube-VIS-19。

Abstract
In this paper, we propose a simple yet effective approach for self-supervised video object segmentation (VOS). Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust spatio-temporal correspondences in videos. Furthermore, simple clustering on this correspondence cue is sufficient to yield competitive segmentation results. Previous self-supervised VOS techniques majorly resort to auxiliary modalities or utilize iterative slot attention to assist in object discovery, which restricts their general applicability and imposes higher computational requirements. To deal with these challenges, we develop a simplified architecture that capitalizes on the emerging objectness from DINO-pretrained Transformers, bypassing the need for additional modalities or slot attention. Specifically, we first introduce a single spatio-temporal Transformer block to process the frame-wise DINO features and establish spatio-temporal dependencies in the form of self-attention. Subsequently, utilizing these attention maps, we implement hierarchical clustering to generate object segmentation masks. To train the spatio-temporal block in a fully self-supervised manner, we employ semantic and dynamic motion consistency coupled with entropy normalization. Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and particularly excels in complex real-world multi-object video segmentation tasks such as DAVIS-17-Unsupervised and YouTube-VIS-19. The code and model checkpoints will be released at https://github.com/shvdiwnkozbw/SSL-UVOS.

摘要
在这篇论文中，我们提出了一种简单 yet effective的方法 для无监督视频物体分割（VOS）。我们的关键发现是利用独立预训练的 DINO 变换器中的自然的结构依赖关系，以建立视频中的稳定空间时间匹配。此外，对这种匹配关系进行简单的聚类，可以获得竞争力强的分割结果。先前的无监督 VOS 技术主要靠 auxiliary 模式或使用迭代槽注意力来帮助对象发现，这限制了它们的通用性和计算成本。为解决这些挑战，我们开发了简化的架构，利用 DINO 变换器中的突出对象性，不需要额外的模式或槽注意力。具体来说，我们首先引入一个框架级别的空间时间 transformer 块，处理框架级别的 DINO 特征，并建立空间时间相关性。然后，通过这些注意力地图，实现层次聚类，生成对象分割mask。为了在完全无监督下训练空间时间块，我们采用 semantic 和动态运动一致、Entropy 归一化等级别的自我监督。我们的方法在多个无监督 VOS 标准准则上达到了状态 искусternal 性，特别是在复杂的实际多对象视频分割任务中，如 DAVIS-17-Unsupervised 和 YouTube-VIS-19 之类。代码和模型检查点将在 GitHub 上发布，请参考 https://github.com/shvdiwnkozbw/SSL-UVOS。

Pose Anything: A Graph-Based Approach for Category-Agnostic Pose Estimation

paper_url: http://arxiv.org/abs/2311.17891
repo_url: https://github.com/orhir/PoseAnything
paper_authors: Or Hirschorn, Shai Avidan
for: 能够处理novel object category的2D pose estimation模型
methods: 使用Graph Transformer Decodercapture和利用键点之间的几何关系
results: 在MP-100 benchmark上获得了substantial margins的改进， achievieving remarkable improvements of 2.16% and 1.82% under 1-shot and 5-shot settings, respectively.

Abstract
Traditional 2D pose estimation models are limited by their category-specific design, making them suitable only for predefined object categories. This restriction becomes particularly challenging when dealing with novel objects due to the lack of relevant training data. To address this limitation, category-agnostic pose estimation (CAPE) was introduced. CAPE aims to enable keypoint localization for arbitrary object categories using a single model, requiring minimal support images with annotated keypoints. This approach not only enables object pose generation based on arbitrary keypoint definitions but also significantly reduces the associated costs, paving the way for versatile and adaptable pose estimation applications. We present a novel approach to CAPE that leverages the inherent geometrical relations between keypoints through a newly designed Graph Transformer Decoder. By capturing and incorporating this crucial structural information, our method enhances the accuracy of keypoint localization, marking a significant departure from conventional CAPE techniques that treat keypoints as isolated entities. We validate our approach on the MP-100 benchmark, a comprehensive dataset comprising over 20,000 images spanning more than 100 categories. Our method outperforms the prior state-of-the-art by substantial margins, achieving remarkable improvements of 2.16% and 1.82% under 1-shot and 5-shot settings, respectively. Furthermore, our method's end-to-end training demonstrates both scalability and efficiency compared to previous CAPE approaches.

摘要
传统的2D姿态估计模型受限于其类别特定的设计，使其只适用于已知的对象类别。这种限制在处理新型对象时变得非常困难，尤其是因为缺乏相关的训练数据。为解决这一限制，category-agnostic pose estimation（CAPE）被引入。CAPE的目标是通过一个单一的模型，实现任意对象类别的键点定位，只需要 minimal 支持图像与注解键点。这种方法不仅允许基于自定义键点定义对象姿态生成，还可以减少相关成本，为 flexible 和 adaptable 姿态估计应用提供方便。我们提出了一种新的CAPE方法，利用键点之间的自然几何关系，通过一种新设计的图像trasformer Decoder来捕捉和汇集这些关键信息。通过捕捉和利用这些关键信息，我们的方法可以提高键点定位的准确性，这标志着与传统CAPE技术不同，对键点进行分割和处理。我们在 MP-100 测试benchmark上验证了我们的方法，该benchmark包括超过20,000张图像，涵盖超过100个类别。我们的方法与先前的状态泰斗相比，取得了显著的提高，在1-shot 和 5-shot 设定下提高了2.16%和1.82%的性能。此外，我们的方法在练习中的灵活性和效率也被证明。

TSDF-Sampling: Efficient Sampling for Neural Surface Field using Truncated Signed Distance Field

paper_url: http://arxiv.org/abs/2311.17878
repo_url: None
paper_authors: Chaerin Min, Sehyun Cha, Changhee Won, Jongwoo Lim
for: 这篇论文主要目标是提高多视图神经表面重建的运算速度，以便在实时应用中使用。
methods: 该论文提出了一种新的方法，通过利用场景的TSDF量化来减少样本数量，从而提高渲染质量。这种方法不同于先前的重要样本方法，因为它们依赖于初始均匀样本，从而导致性能下降。相比之下，我们的方法利用训练视图中生成的TSDF体积，并证明了它可以提供合理的 bound для从未知视图中采样。
results: 我们的方法可以快速地实现高质量渲染，不需要大量的样本。在实验中，我们发现了11倍的运算速度提高，而无需妥协性能。视频资料可以在我们项目页面中找到：https://tsdf-sampling.github.io/

Abstract
Multi-view neural surface reconstruction has exhibited impressive results. However, a notable limitation is the prohibitively slow inference time when compared to traditional techniques, primarily attributed to the dense sampling, required to maintain the rendering quality. This paper introduces a novel approach that substantially reduces the number of samplings by incorporating the Truncated Signed Distance Field (TSDF) of the scene. While prior works have proposed importance sampling, their dependence on initial uniform samples over the entire space makes them unable to avoid performance degradation when trying to use less number of samples. In contrast, our method leverages the TSDF volume generated only by the trained views, and it proves to provide a reasonable bound on the sampling from upcoming novel views. As a result, we achieve high rendering quality by fully exploiting the continuous neural SDF estimation within the bounds given by the TSDF volume. Notably, our method is the first approach that can be robustly plug-and-play into a diverse array of neural surface field models, as long as they use the volume rendering technique. Our empirical results show an 11-fold increase in inference speed without compromising performance. The result videos are available at our project page: https://tsdf-sampling.github.io/

摘要
多视图神经表面重建已经表现出色，但有一个明显的限制是与传统技术相比，推理速度过于慢，主要归结于粗粒度的抽取。这篇论文提出了一种新的方法，可以减少抽取数量，通过包含场景的截断签Distance Field（TSDF）。先前的工作已经提出了重要抽取，但它们依赖于初始化的均匀采样，从全空间中抽取样本，这使得它们无法避免性能下降，当尝试使用 fewer samples 时。与之相反，我们的方法利用已经训练的视图中生成的 TSDF Volume，并证明它可以提供合理的采样 bound для forthcoming novel views。因此，我们可以高效地利用神经 SDF 估计的连续性，在 TSDF Volume 中给出的约束内部进行渲染。各种神经表面场景模型都可以坚持我们的方法，只要使用 volume rendering 技术。我们的实验结果显示，可以提高推理速度 11 倍，不会影响性能。试试视频可以在我们项目页面上找到：https://tsdf-sampling.github.io/

Enhancing Post-Hoc Explanation Benchmark Reliability for Image Classification

paper_url: http://arxiv.org/abs/2311.17876
repo_url: None
paper_authors: Tristan Gomez, Harold Mouchère
for: 本研究旨在提高透彻神经网络的决策过程理解，并且提出了一种基于心理测量的方法来评估后处方法的可靠性。
methods: 本研究使用了一种受损样本训练和焦点损失的方法来提高模型的Robustness和准确性。
results: 研究发现，通过这些修改，可以提高后处方法的可靠性评估，并且在不同的度量、数据集和后处方法上都有显著改进。这项研究为评估后处方法的可靠性提供了一个基础。

Abstract
Deep neural networks, while powerful for image classification, often operate as "black boxes," complicating the understanding of their decision-making processes. Various explanation methods, particularly those generating saliency maps, aim to address this challenge. However, the inconsistency issues of faithfulness metrics hinder reliable benchmarking of explanation methods. This paper employs an approach inspired by psychometrics, utilizing Krippendorf's alpha to quantify the benchmark reliability of post-hoc methods in image classification. The study proposes model training modifications, including feeding perturbed samples and employing focal loss, to enhance robustness and calibration. Empirical evaluations demonstrate significant improvements in benchmark reliability across metrics, datasets, and post-hoc methods. This pioneering work establishes a foundation for more reliable evaluation practices in the realm of post-hoc explanation methods, emphasizing the importance of model robustness in the assessment process.

摘要
This paper proposes a new approach inspired by psychometrics, which uses Krippendorf's alpha to quantify the reliability of post-hoc explanation methods in image classification. The study also introduces training modifications, such as feeding perturbed samples and employing focal loss, to enhance the robustness and calibration of the models.Empirical evaluations show significant improvements in benchmark reliability across different metrics, datasets, and post-hoc methods. This groundbreaking work establishes a solid foundation for more reliable evaluation practices in the field of post-hoc explanation methods, highlighting the importance of model robustness in the assessment process.

FisherRF: Active View Selection and Uncertainty Quantification for Radiance Fields using Fisher Information

paper_url: http://arxiv.org/abs/2311.17874
repo_url: https://github.com/JiangWenPL/FisherRF
paper_authors: Wen Jiang, Boshu Lei, Kostas Daniilidis
For: This paper addresses the challenging problem of active view selection and uncertainty quantification within the domain of Radiance Fields.* Methods: The paper leverages Fisher Information to efficiently quantify observed information within Radiance Fields without ground truth data, overcoming existing limitations on model architecture and effectiveness.* Results: The paper achieves state-of-the-art results in both view selection and uncertainty quantification, demonstrating its potential to advance the field of Radiance Fields. The method with the 3D Gaussian Splatting backend could perform view selections at 70 fps.Here is the full summary in Simplified Chinese:* 为：这篇论文targets Radiance Fields中的活动视角选择和 uncertainty quantification问题，提出了一种基于Fisher Information的方法。* 方法：使用Fisher Information来有效地量化Radiance Fields中的观测信息，不需要ground truth数据，从而超越现有的模型结构和效果限制。* 结果：该方法在view selection和uncertainty quantification中达到了领先的状态，示出了它在Radiance Fields领域的潜力提高。同时，使用3D Gaussian Splatting backend可以在70 fps中进行视图选择。

Abstract
This study addresses the challenging problem of active view selection and uncertainty quantification within the domain of Radiance Fields. Neural Radiance Fields (NeRF) have greatly advanced image rendering and reconstruction, but the limited availability of 2D images poses uncertainties stemming from occlusions, depth ambiguities, and imaging errors. Efficiently selecting informative views becomes crucial, and quantifying NeRF model uncertainty presents intricate challenges. Existing approaches either depend on model architecture or are based on assumptions regarding density distributions that are not generally applicable. By leveraging Fisher Information, we efficiently quantify observed information within Radiance Fields without ground truth data. This can be used for the next best view selection and pixel-wise uncertainty quantification. Our method overcomes existing limitations on model architecture and effectiveness, achieving state-of-the-art results in both view selection and uncertainty quantification, demonstrating its potential to advance the field of Radiance Fields. Our method with the 3D Gaussian Splatting backend could perform view selections at 70 fps.

摘要
（本研究强调解决Neural Radiance Fields（NeRF）中活动视图选择和不确定性评估问题。由于NeRF模型的有限的2D图像可用性，导致 occlusions、depth ambiguities和捕捉错误等不确定性，因此选择有用的视图变得非常重要。而现有的方法 either depend on model architecture or based on density distributions的假设，这些假设通常不适用。我们利用Fisher Information来效率地量化Radiance Fields中观测到的信息，无需ground truth数据。这可以用于下一个最佳视图选择和像素精度评估。我们的方法超越了现有的模型结构和效果限制，实现了Radiance Fields领域的状态级 результатов，demonstrating its potential to advance the field。我们的方法与3D Gaussian Splatting backend可以在70 fps下完成视图选择。）

Gaussian Shell Maps for Efficient 3D Human Generation

paper_url: http://arxiv.org/abs/2311.17857
repo_url: https://github.com/computational-imaging/GSM
paper_authors: Rameen Abdal, Wang Yifan, Zifan Shi, Yinghao Xu, Ryan Po, Zhengfei Kuang, Qifeng Chen, Dit-Yan Yeung, Gordon Wetzstein
for: 这篇论文目标是提高3D数字人类的生成效率，以满足虚拟现实、社交媒体和电影制作等领域的需求。
methods: 该论文提出了一种基于 Gaussian Shell Maps（GSMs）的框架，将当前SOTA的3D生成随机网络与新的3D Gaussian rendering primitives相连接，通过一个可调整的多层壳骨架。在这个设置下，一个CNN生成了一个3D texture堆叠，其中的特征被映射到壳骨上。而不是直接镶嵌壳骨，我们在壳骨上采样了3D Gaussian，其属性被编码在Texture特征中。这些Gaussian可以高效地和可 diferenciadamente进行渲染。
results: 该论文的实验结果表明，GSMs可以高质量地生成3D数字人类，并且可以在单视图数据集上进行训练，如SHHQ和DeepFashion。此外，GSMs还可以在不需要多视图不一致的情况下实现高质量的多视图一致渲染，并且可以在原始分辨率512×512像素的情况下完成渲染。

Abstract
Efficient generation of 3D digital humans is important in several industries, including virtual reality, social media, and cinematic production. 3D generative adversarial networks (GANs) have demonstrated state-of-the-art (SOTA) quality and diversity for generated assets. Current 3D GAN architectures, however, typically rely on volume representations, which are slow to render, thereby hampering the GAN training and requiring multi-view-inconsistent 2D upsamplers. Here, we introduce Gaussian Shell Maps (GSMs) as a framework that connects SOTA generator network architectures with emerging 3D Gaussian rendering primitives using an articulable multi shell--based scaffold. In this setting, a CNN generates a 3D texture stack with features that are mapped to the shells. The latter represent inflated and deflated versions of a template surface of a digital human in a canonical body pose. Instead of rasterizing the shells directly, we sample 3D Gaussians on the shells whose attributes are encoded in the texture features. These Gaussians are efficiently and differentiably rendered. The ability to articulate the shells is important during GAN training and, at inference time, to deform a body into arbitrary user-defined poses. Our efficient rendering scheme bypasses the need for view-inconsistent upsamplers and achieves high-quality multi-view consistent renderings at a native resolution of $512 \times 512$ pixels. We demonstrate that GSMs successfully generate 3D humans when trained on single-view datasets, including SHHQ and DeepFashion.

摘要
importantly, the efficient generation of 3D digital humans is crucial in various industries, such as virtual reality, social media, and cinematic production. 3D generative adversarial networks (GANs) have achieved state-of-the-art (SOTA) quality and diversity for generated assets. However, current 3D GAN architectures typically rely on volume representations, which are slow to render, hindering GAN training and requiring multi-view-inconsistent 2D upsamplers.To address this challenge, we propose Gaussian Shell Maps (GSMs) as a framework that connects SOTA generator network architectures with emerging 3D Gaussian rendering primitives using an articulable multi-shell based scaffold. In this setting, a convolutional neural network (CNN) generates a 3D texture stack with features that are mapped to the shells. The shells represent inflated and deflated versions of a template surface of a digital human in a canonical body pose. Instead of rasterizing the shells directly, we sample 3D Gaussians on the shells, whose attributes are encoded in the texture features. These Gaussians are efficiently and differentiably rendered. The ability to articulate the shells is crucial during GAN training and, at inference time, to deform a body into arbitrary user-defined poses.Our efficient rendering scheme bypasses the need for view-inconsistent upsamplers and achieves high-quality multi-view consistent renderings at a native resolution of $512 \times 512$ pixels. We demonstrate that GSMs successfully generate 3D humans when trained on single-view datasets, including SHHQ and DeepFashion.

Evaluating VLMs for Score-Based, Multi-Probe Annotation of 3D Objects

paper_url: http://arxiv.org/abs/2311.17851
repo_url: None
paper_authors: Rishabh Kabra, Loic Matthey, Alexander Lerchner, Niloy J. Mitra
for: 本研究想要利用预训练的视觉语言模型（VLM）来解决一系列的注释任务，从描述物体 semantics 到物理属性。
methods: 本研究使用了 probabilistic aggregation 方法，将 VLM 的 scores 作为样本响应来积分。这种方法可以超越语言模型（如 GPT4）的概括能力，避免因响应中的细节而导致的幻觉。
results: 该方法可以提高 VLM 的预测性能，例如物体的类型和材质预测。此外，该方法还可以用于评估视觉逻辑的贡献，并且可以在不进行额外训练或在线学习的情况下达到人工验证的标准。

Abstract
Unlabeled 3D objects present an opportunity to leverage pretrained vision language models (VLMs) on a range of annotation tasks -- from describing object semantics to physical properties. An accurate response must take into account the full appearance of the object in 3D, various ways of phrasing the question/prompt, and changes in other factors that affect the response. We present a method to marginalize over any factors varied across VLM queries, utilizing the VLM's scores for sampled responses. We first show that this probabilistic aggregation can outperform a language model (e.g., GPT4) for summarization, for instance avoiding hallucinations when there are contrasting details between responses. Secondly, we show that aggregated annotations are useful for prompt-chaining; they help improve downstream VLM predictions (e.g., of object material when the object's type is specified as an auxiliary input in the prompt). Such auxiliary inputs allow ablating and measuring the contribution of visual reasoning over language-only reasoning. Using these evaluations, we show how VLMs can approach, without additional training or in-context learning, the quality of human-verified type and material annotations on the large-scale Objaverse dataset.

摘要
<>将文本翻译为简化中文。<>未标注的3D对象提供了利用预训练视觉语言模型（VLM）进行多种注释任务的机会——从描述对象 semantics 到物理属性。一个准确的响应必须考虑对象在3D空间的全部外观，不同的问题/提示表达方式，以及其他affecting factors。我们提出了一种方法，通过VARIABLYING FACTORS ACROSS VLM QUERIES marginalize over any factors varied across VLM queries，使用VLM的分布式分数来采样响应。我们首先示出，这种概率汇集可以超过语言模型（例如GPT4）的概率汇集，以避免对比较细节的细节。其次，我们示出了汇集的笔记可以用于提高下游VLM预测（例如物品材质，当对象的类型作为 auxiliary input 在提示中指定）。这些auxiliary inputs允许离开和测量语言逻辑和视觉逻辑之间的贡献。使用这些评估，我们示出了VLM可以，无需额外训练或在Context learning，达到人工验证的类型和材质笔记质量水平在大规模的Objaverse数据集上。

Towards Real-World Focus Stacking with Deep Learning

paper_url: http://arxiv.org/abs/2311.17846
repo_url: https://github.com/araujoalexandre/focusstackingdataset
paper_authors: Alexandre Araujo, Jean Ponce, Julien Mairal
for: 这篇论文是为了提出一种新的深度学习方法来处理长时间序列图像的多重фокус问题。
methods: 该方法使用了深度学习算法，并使用了一个新的数据集来训练。
results: 该方法可以处理长时间序列图像，并且能够tolerate noise。

Abstract
Focus stacking is widely used in micro, macro, and landscape photography to reconstruct all-in-focus images from multiple frames obtained with focus bracketing, that is, with shallow depth of field and different focus planes. Existing deep learning approaches to the underlying multi-focus image fusion problem have limited applicability to real-world imagery since they are designed for very short image sequences (two to four images), and are typically trained on small, low-resolution datasets either acquired by light-field cameras or generated synthetically. We introduce a new dataset consisting of 94 high-resolution bursts of raw images with focus bracketing, with pseudo ground truth computed from the data using state-of-the-art commercial software. This dataset is used to train the first deep learning algorithm for focus stacking capable of handling bursts of sufficient length for real-world applications. Qualitative experiments demonstrate that it is on par with existing commercial solutions in the long-burst, realistic regime while being significantly more tolerant to noise. The code and dataset are available at https://github.com/araujoalexandre/FocusStackingDataset.

摘要
focus推差摄影广泛应用于微、马кро、景观摄影，将多帧图像通过对准点推差来重构所有在Focus推差图像中的全景图像。现有的深度学习方法对于这个多重 фокус图像融合问题有限的应用性，因为它们通常只是为二到四帧图像而设计，并且通常在小、低分辨率的数据集上进行训练，或者通过光场相机获得的数据或者生成synthetically。我们介绍了一个新的数据集，包括94个高分辨率弹性拍摄的Raw图像，并使用商业软件计算pseudo真实数据。这个数据集用于训练第一个可以处理长弹性的深度学习算法，与现有的商业解决方案在长弹性、实际 regime中具有相同水平的性能，而且更具耐性于噪声。代码和数据集可以在https://github.com/araujoalexandre/FocusStackingDataset上获取。

SPiC-E : Structural Priors in 3D Diffusion Models using Cross Entity Attention

paper_url: http://arxiv.org/abs/2311.17834
repo_url: None
paper_authors: Etai Sella, Gal Fiebelman, Noam Atia, Hadar Averbuch-Elor
for: This paper is written for democratizing 3D content creation by improving the efficiency and versatility of 3D diffusion models.
methods: The paper introduces a neural network called SPiC-E that adds structural guidance to 3D diffusion models, allowing for task-specific structural priors to be learned from auxiliary guidance shapes.
results: The paper shows that SPiC-E achieves state-of-the-art (SOTA) performance on various applications such as 3D stylization, semantic shape editing, and text-conditional abstraction-to-3D, while being faster than alternative methods.

Abstract
We are witnessing rapid progress in automatically generating and manipulating 3D assets due to the availability of pretrained text-image diffusion models. However, time-consuming optimization procedures are required for synthesizing each sample, hindering their potential for democratizing 3D content creation. Conversely, 3D diffusion models now train on million-scale 3D datasets, yielding high-quality text-conditional 3D samples within seconds. In this work, we present SPiC-E - a neural network that adds structural guidance to 3D diffusion models, extending their usage beyond text-conditional generation. At its core, our framework introduces a cross-entity attention mechanism that allows for multiple entities (in particular, paired input and guidance 3D shapes) to interact via their internal representations within the denoising network. We utilize this mechanism for learning task-specific structural priors in 3D diffusion models from auxiliary guidance shapes. We show that our approach supports a variety of applications, including 3D stylization, semantic shape editing and text-conditional abstraction-to-3D, which transforms primitive-based abstractions into highly-expressive shapes. Extensive experiments demonstrate that SPiC-E achieves SOTA performance over these tasks while often being considerably faster than alternative methods. Importantly, this is accomplished without tailoring our approach for any specific task.

摘要
“我们目睹3D资产自动生成和处理得更加快速的进步，具体原因在于预训文本-图像扩散模型的可用性。然而，每个样本都需要耗时的优化程序，阻碍它们在3D内容创建方面发挥潜力。相反，3D扩散模型现在训练在百万组3D数据上，从而产生高品质的文本-相关3D样本，仅仅几秒内。在这个工作中，我们发表了SPiC-E框架，它将额外给3D扩散模型结构指导，从而扩展其使用范围。SPiC-E框架的核心是跨物体注意力机制，让不同物体（特别是对照输入和指导3D形的组合）在混合网络中互动，从而学习任务特定的结构假设。我们运用这种机制来学习3D扩散模型从副任务指导形组件中学习任务特定的结构假设。我们的方法可以支持多种应用，包括3D数据创建、semantic shape editing和文本相关抽象到3D。我们的实验结果显示，SPiC-E可以在这些任务上取得SOTA表现，并且通常比alternative方法更快。最重要的是，我们的方法不需要针对特定任务进行调整。”

paper_url: http://arxiv.org/abs/2311.17812
repo_url: None
paper_authors: Ting Liu, Yue Hu, Wansen Wu, Youkai Wang, Kai Xu, Quanjun Yin
for: 提高自适应机器人在未知环境中导航的能力，尤其是使用预训练的视觉语言模型。
methods: 提出了一种新的、模型无关的领域意识驱动学习（DAP）框架，通过低成本的提问调整策略，使预训练模型中的视觉编码器学习具有特定对象水平和场景水平的跨模态相关性。
results: 在R2R和REVERIE等任务上，DAP方法与现有状态的方法进行比较，实验结果表明DAP方法在提高自适应机器人在未知环境中导航的能力方面表现出色。

Abstract
Following language instructions to navigate in unseen environments is a challenging task for autonomous embodied agents. With strong representation capabilities, pretrained vision-and-language models are widely used in VLN. However, most of them are trained on web-crawled general-purpose datasets, which incurs a considerable domain gap when used for VLN tasks. To address the problem, we propose a novel and model-agnostic domain-aware prompt learning (DAP) framework. For equipping the pretrained models with specific object-level and scene-level cross-modal alignment in VLN tasks, DAP applies a low-cost prompt tuning paradigm to learn soft visual prompts for extracting in-domain image semantics. Specifically, we first generate a set of in-domain image-text pairs with the help of the CLIP model. Then we introduce soft visual prompts in the input space of the visual encoder in a pretrained model. DAP injects in-domain visual knowledge into the visual encoder of the pretrained model in an efficient way. Experimental results on both R2R and REVERIE show the superiority of DAP compared to existing state-of-the-art methods.

摘要
自适应体系中的 navigation 任务是一个具有挑战性的任务，尤其是在不可见的环境中。通过强大的表示能力，预训练的视觉语言模型在 VLN 中广泛使用。然而，大多数这些模型是通过网络抓取的通用数据集进行训练，这会导致很大的领域差异问题。为解决这个问题，我们提出了一种新的和模型无关的领域意识驱动的提问学习（DAP）框架。为在 VLN 任务中将预训练模型具备特定的物体层次和场景层次的跨模态对齐，DAP 使用一种低成本的提问调整 paradigma 来学习软体Visual提问。具体来说，我们首先通过 CLIP 模型生成一组在领域中的图像文本对。然后，我们在预训练模型的视觉编码器中引入软体Visual提问。DAP 通过在预训练模型的视觉编码器中注入领域特定的视觉知识来具备模型适应性。我们通过对 R2R 和 REVERIE 进行实验，并证明 DAP 与现有状态的方法相比具有着superiority。

Coloring the Past: Neural Historical Buildings Reconstruction from Archival Photography

paper_url: http://arxiv.org/abs/2311.17810
repo_url: None
paper_authors: David Komorowicz, Lu Sang, Ferdinand Maiwald, Daniel Cremers
for: 历史建筑是人类文化遗产的宝藏和里程碑，复原历史建筑的3D模型具有重要价值。
methods: 我们提出了一种基于神经渲染技术的历史建筑3D模型复原方法，利用批量点云作为几何假设，并引入颜色出现嵌入损失来回填颜色信息。
results: 我们的方法可以有效地复原历史建筑的3D形状，并且可以在有限的颜色图像基础上进行颜色回填。我们的研究旨在促进历史建筑保护的兴趣和着力。为此，我们还提供了一个新的匈牙利国家剧场历史数据集，作为复原方法的新标准。

Abstract
Historical buildings are a treasure and milestone of human cultural heritage. Reconstructing the 3D models of these building hold significant value. The rapid development of neural rendering methods makes it possible to recover the 3D shape only based on archival photographs. However, this task presents considerable challenges due to the limitations of such datasets. Historical photographs are often limited in number and the scenes in these photos might have altered over time. The radiometric quality of these images is also often sub-optimal. To address these challenges, we introduce an approach to reconstruct the geometry of historical buildings, employing volumetric rendering techniques. We leverage dense point clouds as a geometric prior and introduce a color appearance embedding loss to recover the color of the building given limited available color images. We aim for our work to spark increased interest and focus on preserving historical buildings. Thus, we also introduce a new historical dataset of the Hungarian National Theater, providing a new benchmark for the reconstruction method.

摘要
历史建筑是人类文化遗产的宝藏和里程碑。重建历史建筑的3D模型具有重要的价值。随着神经渲染技术的快速发展，现在可以通过档案照片来恢复建筑物的3D形状。然而，这个任务具有许多挑战，主要是因为档案照片的数量有限，而且场景中的元素可能会随着时间的推移而改变。此外，历史照片的 радиometric质量也经常不佳。为了解决这些挑战，我们提出了一种使用液体点云作为几何学先验的方法，并引入了颜色出现嵌入损失来恢复建筑物的颜色，即使有限的颜色图像available。我们希望通过我们的工作，激发更多关注和保护历史建筑的精神。因此，我们还发布了一个新的历史数据集， Hungarian National Theater，作为重建方法的新标准。

Aggregation Model Hyperparameters Matter in Digital Pathology

paper_url: http://arxiv.org/abs/2311.17804
repo_url: None
paper_authors: Gustav Bredell, Marcel Fischer, Przemyslaw Szostak, Samaneh Abbasi-Sureshjani, Alvaro Gomariz
for: 这个研究旨在探讨数字 PATHOLOGY 中 gigapixel 整幅图像 (WSI) 的分析对疾病检测和病理医生效率的影响。
methods: 这些研究使用了将 WSI 分割成小块，然后应用特征提取器模型来获取特征向量，并使用聚合模型来预测 WSI 的标签。
results: 研究发现，传统的评估方法可能受到fixed aggregation model hyperparameters的偏见，这会导致特征提取器模型的性能比较不同。通过考虑这种相互关系，研究发现了许多当前的特征提取器模型之间的性能相似性。这种全面的方法提供了更加细化的理解特征提取器和聚合模型之间的关系，从而为数字 PATHOLOGY 中的特征提取器模型进行更加公正和准确的评估。

Abstract
Digital pathology has significantly advanced disease detection and pathologist efficiency through the analysis of gigapixel whole-slide images (WSI). In this process, WSIs are first divided into patches, for which a feature extractor model is applied to obtain feature vectors, which are subsequently processed by an aggregation model to predict the respective WSI label. With the rapid evolution of representation learning, numerous new feature extractor models, often termed foundational models, have emerged. Traditional evaluation methods, however, rely on fixed aggregation model hyperparameters, a framework we identify as potentially biasing the results. Our study uncovers a co-dependence between feature extractor models and aggregation model hyperparameters, indicating that performance comparability can be skewed based on the chosen hyperparameters. By accounting for this co-dependency, we find that the performance of many current feature extractor models is notably similar. We support this insight by evaluating seven feature extractor models across three different datasets with 162 different aggregation model configurations. This comprehensive approach provides a more nuanced understanding of the relationship between feature extractors and aggregation models, leading to a fairer and more accurate assessment of feature extractor models in digital pathology.

摘要
digitization Pathology 已经取得了疾病检测和病理学家效率的显著进步，通过分析 gigapixel 整幅图像 (WSI) 的分割和特征提取模型的应用。在这个过程中，WSI 首先被分割成块，然后应用特征提取模型来获取特征向量，最后由聚合模型进行预测。随着表征学学习的快速发展，许多新的特征提取模型，通常被称为基础模型，已经出现。传统的评估方法通常采用固定聚合模型超参数，我们称之为可能偏导致结果的潜在偏误。我们的研究发现，特征提取模型和聚合模型超参数之间存在相互关系，表明基于不同超参数的选择可能导致性能相似性的偏移。通过考虑这种相互关系，我们发现了许多当今的特征提取模型的性能实际上很相似。我们支持这一发现，通过对七种特征提取模型在三个不同的数据集上进行162种聚合模型配置的评估。这种全面的方法为评估特征提取模型和聚合模型之间的关系提供了更加细化的理解，从而为数字病理学中的评估带来更公正、更准确的评估。

U-Net v2: Rethinking the Skip Connections of U-Net for Medical Image Segmentation

paper_url: http://arxiv.org/abs/2311.17791
repo_url: https://github.com/yaoppeng/u-net_v2
paper_authors: Yaopeng Peng, Milan Sonka, Danny Z. Chen
for: 这paper aimsto improve medical image segmentation by introducing a new U-Net variant that incorporates semantic information and finer details.
methods: 该方法使用深度神经网络编码器提取多级特征，并通过哈达德乘法将高级特征信息感知到低级特征，以增强特征的Semanticcharacteristics和细节。 novel skip connections are used to empower features of all levels with enriched semantic information and intricate details.
results: 对多个公共医学图像分割数据集进行测试，该方法的 segmentation accuracy 超过了当前方法，同时保持了内存和计算效率的优化。

Abstract
In this paper, we introduce U-Net v2, a new robust and efficient U-Net variant for medical image segmentation. It aims to augment the infusion of semantic information into low-level features while simultaneously refining high-level features with finer details. For an input image, we begin by extracting multi-level features with a deep neural network encoder. Next, we enhance the feature map of each level by infusing semantic information from higher-level features and integrating finer details from lower-level features through Hadamard product. Our novel skip connections empower features of all the levels with enriched semantic characteristics and intricate details. The improved features are subsequently transmitted to the decoder for further processing and segmentation. Our method can be seamlessly integrated into any Encoder-Decoder network. We evaluate our method on several public medical image segmentation datasets for skin lesion segmentation and polyp segmentation, and the experimental results demonstrate the segmentation accuracy of our new method over state-of-the-art methods, while preserving memory and computational efficiency. Code is available at: https://github.com/yaoppeng/U-Net\_v2

摘要
在这篇论文中，我们介绍了U-Net v2，一种新的robust和高效的U-Net变体，用于医疗图像分割。它目的是在低级别特征中增强semantic信息的混入，同时在高级别特征中细化细节。对于输入图像，我们首先通过深度神经网络编码器提取多个级别特征。然后，我们通过哈达德乘制法增强每级特征图的semantic信息和细节。我们的新的跳跃连接使得所有级别的特征具备了充分的semantic特征和细节。改进后的特征被传递到解码器进行进一步处理和分割。我们的方法可以轻松地整合到任何Encoder-Decoder网络中。我们在多个公共医疗图像分割数据集上进行了实验，并证明了我们新方法的分割精度高于当前的方法，而且保持了内存和计算效率。代码可以在https://github.com/yaoppeng/U-Net\_v2上获取。

One-Shot Open Affordance Learning with Foundation Models

paper_url: http://arxiv.org/abs/2311.17776
repo_url: None
paper_authors: Gen Li, Deqing Sun, Laura Sevilla-Lara, Varun Jampani
for: 这篇论文旨在提出一种基于一个示例的开放可能性学习（OOAL）方法，能够让模型从少量数据中学习 novel object 和 affordance。
methods: 作者通过对现有基础模型进行全面分析，以评估这些模型对可行性的内在理解程度，并提出了一种简单有效的视觉语言框架，以提高视觉特征和可行文本嵌入的含义的对应。
results: 实验表明，提出的方法在两个可行分割标准测试集上的表现都高于当前最佳模型，并且在未看到的对象和可行上具有合理的泛化能力。

Abstract
We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category, but is expected to identify novel objects and affordances. While vision-language models excel at recognizing novel objects and scenes, they often struggle to understand finer levels of granularity such as affordances. To handle this issue, we conduct a comprehensive analysis of existing foundation models, to explore their inherent understanding of affordances and assess the potential for data-limited affordance learning. We then propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings. Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data, and exhibits reasonable generalization capability on unseen objects and affordances.

摘要
我们介绍一个单一开放功能学习（OOAL），其中一个模型仅需要一个基本物类的示例，但预期能够识别新的物类和功能。视觉语言模型在识别新的物类和场景方面表现出色，但它们往往对更细节的层次，如功能，表现不佳。为解决这个问题，我们进行了现有基础模型的全面分析，以探索它们的内在理解功能的潜力，并评估可以通过限制数据来学习功能。我们then propose a vision-language框架，其中包含简单和有效的设计，将视觉特征与文本嵌入的Alignment提高。实验结果显示，我们提议的方法在两个功能分割benchmark上比前置推广模型表现出色，并且在未看到的物类和功能上具有合理的泛化能力。

PillarNeSt: Embracing Backbone Scaling and Pretraining for Pillar-based 3D Object Detection

paper_url: http://arxiv.org/abs/2311.17770
repo_url: None
paper_authors: Weixin Mao, Tiancai Wang, Diankun Zhang, Junjie Yan, Osamu Yoshie
for: 提高 pillar-based 3D 物体检测器的性能
methods: 使用适应性的 dense ConvNet 预训练和缩放，并将其应用于 pillar-based 检测器中
results: 比现有的 3D 物体检测器减小误差率，在 nuScenes 和 Argoversev2 数据集上达到了新的高水平

Abstract
This paper shows the effectiveness of 2D backbone scaling and pretraining for pillar-based 3D object detectors. Pillar-based methods mainly employ randomly initialized 2D convolution neural network (ConvNet) for feature extraction and fail to enjoy the benefits from the backbone scaling and pretraining in the image domain. To show the scaling-up capacity in point clouds, we introduce the dense ConvNet pretrained on large-scale image datasets (e.g., ImageNet) as the 2D backbone of pillar-based detectors. The ConvNets are adaptively designed based on the model size according to the specific features of point clouds, such as sparsity and irregularity. Equipped with the pretrained ConvNets, our proposed pillar-based detector, termed PillarNeSt, outperforms the existing 3D object detectors by a large margin on the nuScenes and Argoversev2 datasets. Our code shall be released upon acceptance.

摘要
Note:* "pillar-based" refers to methods that use pillar-shaped feature extractors to extract features from point clouds.* "2D backbone scaling" refers to the practice of scaling up the size of the 2D backbone (i.e., the convolutional neural network) to improve performance.* "pretraining" refers to the practice of pretraining the 2D backbone on a large dataset (such as ImageNet) before fine-tuning it on a smaller dataset (such as nuScenes or Argoversev2).

Cinematic Behavior Transfer via NeRF-based Differentiable Filming

paper_url: http://arxiv.org/abs/2311.17754
repo_url: None
paper_authors: Xuekun Jiang, Anyi Rao, Jingbo Wang, Dahua Lin, Bo Dai
for: 这 paper 是为了提高 digital media 和视频生产中的精准 manipulate 和重现视觉元素，如摄像机运动和人物动作。
methods: 这 paper 使用了 reverse filming behavior estimation 技术，使用 NeRF 作为可微分渲染器，并且使用 SMPL 跟踪来优化摄像机轨迹。它还使用了 cinematic transfer pipeline 将各种拍摄类型转移到新的 2D 视频或 3D 虚拟环境中。
results: 这 paper 的实验结果表明，cinematic transfer pipeline 能够提供精准的摄像机轨迹和人物动作重现，并且在用户研究中得到了更高的评分。

Abstract
In the evolving landscape of digital media and video production, the precise manipulation and reproduction of visual elements like camera movements and character actions are highly desired. Existing SLAM methods face limitations in dynamic scenes and human pose estimation often focuses on 2D projections, neglecting 3D statuses. To address these issues, we first introduce a reverse filming behavior estimation technique. It optimizes camera trajectories by leveraging NeRF as a differentiable renderer and refining SMPL tracks. We then introduce a cinematic transfer pipeline that is able to transfer various shot types to a new 2D video or a 3D virtual environment. The incorporation of 3D engine workflow enables superior rendering and control abilities, which also achieves a higher rating in the user study.

摘要
在数字媒体和视频制作中演化的景观中，精准控制和重制视觉元素如摄像机运动和人物动作的需求越来越高。现有的SLAM方法在动态场景中存在限制，人 pose estimation通常将2D投影neglecting 3D状态。为解决这些问题，我们首先介绍一种逆摄影行为估计技术。它利用NeRF作为可微分渲染器，并使用SMPL跟踪进行优化。然后，我们介绍一个可转换到新的2D视频或3D虚拟环境的电影传输管道。该管道包含3D引擎工作流程，具有更高的渲染和控制能力，并在用户研究中获得更高的评分。

BAND-2k: Banding Artifact Noticeable Database for Banding Detection and Quality Assessment

paper_url: http://arxiv.org/abs/2311.17752
repo_url: None
paper_authors: Zijian Chen, Wei Sun, Jun Jia, Fangfang Lu, Zicheng Zhang, Jing Liu, Ru Huang, Xiongkuo Min, Guangtao Zhai
for: 这篇论文的目的是为了检测和评估图像带状抑制（banding）的质量问题。
methods: 该论文使用了15种压缩和量化算法生成2000个带状抑制图像，并通过23名参与者参与主观测试，获得了214000个小块级带状抑制分类标签和44371个可靠的图像质量评分。
results: 该论文提出了一种有效的无参考（NR）带状抑制评估器，通过利用带状抑制特征的频率特征来提高检测和评估带状抑制的能力。实验表明，该评估器在检测带状抑制和评估图像质量方面具有高度的准确率和SRCC/PLCC指标。这些结果证明了带状抑制质量的强相关性和人类视觉评估的可靠性。

Abstract
Banding, also known as staircase-like contours, frequently occurs in flat areas of images/videos processed by the compression or quantization algorithms. As undesirable artifacts, banding destroys the original image structure, thus degrading users' quality of experience (QoE). In this paper, we systematically investigate the banding image quality assessment (IQA) problem, aiming to detect the image banding artifacts and evaluate their perceptual visual quality. Considering that the existing image banding databases only contain limited content sources and banding generation methods, and lack perceptual quality labels (i.e. mean opinion scores), we first build the largest banding IQA database so far, named Banding Artifact Noticeable Database (BAND-2k), which consists of 2,000 banding images generated by 15 compression and quantization schemes. A total of 23 workers participated in the subjective IQA experiment, yielding over 214,000 patch-level banding class labels and 44,371 reliable image-level quality ratings. Subsequently, we develop an effective no-reference (NR) banding evaluator for banding detection and quality assessment by leveraging frequency characteristics of banding artifacts. A dual convolutional neural network is employed to concurrently learn the feature representation from the high-frequency and low-frequency maps, thereby enhancing the ability to discern banding artifacts. The quality score of a banding image is generated by pooling the banding detection maps masked by the spatial frequency filters. Experiments demonstrate that our banding evaluator achieves a remarkably high accuracy in banding detection and also exhibits high SRCC and PLCC results with the perceptual quality labels. These findings unveil the strong correlations between the intensity of banding artifacts and the perceptual visual quality, thus validating the necessity of banding quality assessment.

摘要
bandeting，也称为台阶状图像，常occurs在压缩或量化算法处理的图像/视频中的平铺区域。作为不желатель的噪声， bandeting会破坏原始图像结构，从而降低用户体验质量（QoE）。在这篇论文中，我们系统地调查了图像质量评估（IQA）问题，旨在检测图像 bandeting artifacts 并评估其视觉质量。由于现有的图像 bandeting 数据库只包含有限的内容源和生成方法，而且缺乏视觉质量标签（例如，意见评分），我们首先建立了最大的 bandeting IQA 数据库，名为 Banding Artifact Noticeable Database (BAND-2k)，其包含2,000个由 15 种压缩和量化算法生成的 bandeting 图像。总共有 23 名工作人员参与了主观 IQA 实验，生成了超过 214,000 个小区域缺陷标签和 44,371 个可靠的图像质量评估。然后，我们开发了一种高效的无参考（NR） bandeting 评估器，通过利用带通的特征来检测和评估 bandeting artifacts。我们使用了一个双层卷积神经网络，同时从高频和低频地图中学习特征表示，以提高检测 bandeting 噪声的能力。图像质量分数是通过在带通滤波器上填充的 bandeting 检测地图来生成的。实验表明，我们的 bandeting 评估器可以高度准确地检测 bandeting 噪声，并且与视觉质量标签 exhibit 高SRCC 和 PLCC 结果。这些发现证明了噪声的强相关性和视觉质量之间的强关系，因此证实了 bandeting 质量评估的必要性。

Variational Bayes image restoration with compressive autoencoders

paper_url: http://arxiv.org/abs/2311.17744
repo_url: None
paper_authors: Maud Biquard, Marie Chabert, Thomas Oberlin
for: 这篇论文主要针对 Computational Imaging 中的 inverse problem 进行了 regularization 的研究。
methods: 该论文使用了压缩 autoencoder 来实现 latent espae 的估计，并提出了 Variational Bayes Latent Estimation（VBLE）算法来实现在 variational inference 框架下的 posterior sampling。
results: 实验结果表明，VBLE 可以达到与现有插件和执行方法相同的性能，而且可以更快地Quantify uncertainties than other existing posterior sampling techniques。

Abstract
Regularization of inverse problems is of paramount importance in computational imaging. The ability of neural networks to learn efficient image representations has been recently exploited to design powerful data-driven regularizers. While state-of-the-art plug-and-play methods rely on an implicit regularization provided by neural denoisers, alternative Bayesian approaches consider Maximum A Posteriori (MAP) estimation in the latent space of a generative model, thus with an explicit regularization. However, state-of-the-art deep generative models require a huge amount of training data compared to denoisers. Besides, their complexity hampers the optimization of the latent MAP. In this work, we propose to use compressive autoencoders for latent estimation. These networks, which can be seen as variational autoencoders with a flexible latent prior, are smaller and easier to train than state-of-the-art generative models. We then introduce the Variational Bayes Latent Estimation (VBLE) algorithm, which performs this estimation within the framework of variational inference. This allows for fast and easy (approximate) posterior sampling. Experimental results on image datasets BSD and FFHQ demonstrate that VBLE reaches similar performance than state-of-the-art plug-and-play methods, while being able to quantify uncertainties faster than other existing posterior sampling techniques.

摘要
很多计算成像问题中存在倒数问题，因此对于计算成像来说，正则化的重要性非常高。随着神经网络的发展，人们开始利用神经网络学习有效的图像表示，设计数据驱动的正则化器。现有的标准插件和执行方法都是通过神经网络提供的隐式正则化来实现的，而bayesian方法则是在生成模型的幂空间中进行最大 posterior estimation，这是一种显式的正则化。然而，现有的深度生成模型需要训练数据量非常大，同时它们的复杂度也使得幂空间的MAP优化变得困难。在这种情况下，我们提议使用压缩自编码器进行幂空间 estimation。这些网络可以看作是变量自编码器的一种弹性版本，它们比现有的生成模型更小更容易训练。我们then introduce the Variational Bayes Latent Estimation（VBLE）算法，它在变量推断框架中进行幂空间 estimation。这使得可以快速地进行（approximate）后验抽象。实验结果表明，VBLE可以与现有的插件和执行方法相比，在图像 dataset BSD 和 FFHQ 上达到类似的性能，而且可以比其他现有的后验抽象技术更快地计算uncertainty。

GenZI: Zero-Shot 3D Human-Scene Interaction Generation

paper_url: http://arxiv.org/abs/2311.17737
repo_url: None
paper_authors: Lei Li, Angela Dai
for: 这篇论文旨在无需学习任何3D人-场景互动数据，生成3D人与场景互动。
methods: 该方法基于大量视力语言模型（VLM）的吸取，将人类互动约束转化为3D人模型的姿态和形状。
results: 对比于现有的学习型方法，该方法具有高度的灵活性和通用性，可以应用于多种场景，包括室内和室外环境。

Abstract
Can we synthesize 3D humans interacting with scenes without learning from any 3D human-scene interaction data? We propose GenZI, the first zero-shot approach to generating 3D human-scene interactions. Key to GenZI is our distillation of interaction priors from large vision-language models (VLMs), which have learned a rich semantic space of 2D human-scene compositions. Given a natural language description and a coarse point location of the desired interaction in a 3D scene, we first leverage VLMs to imagine plausible 2D human interactions inpainted into multiple rendered views of the scene. We then formulate a robust iterative optimization to synthesize the pose and shape of a 3D human model in the scene, guided by consistency with the 2D interaction hypotheses. In contrast to existing learning-based approaches, GenZI circumvents the conventional need for captured 3D interaction data, and allows for flexible control of the 3D interaction synthesis with easy-to-use text prompts. Extensive experiments show that our zero-shot approach has high flexibility and generality, making it applicable to diverse scene types, including both indoor and outdoor environments.

摘要
可以做3D人物与场景的合成无需学习任何3D人物-场景互动数据吗？我们提出GenZI，这是首个零shot方法来生成3D人物-场景互动。GenZI的关键是我们对大量视觉语言模型（VLMs）中学习的互动假设。给定一个自然语言描述和场景中需要的交互的粗略位置，我们首先利用VLMs来想象可能的2D人物互动，并在多个渲染视图中填充这些互动。然后，我们形ulated一种稳定的迭代优化方法，以确定人物模型在场景中的 pose和形状，并且被2D互动假设所导引。与现有的学习型方法不同，GenZI circumvents the conventional need for captured 3D interaction data，并允许通过易于使用的文本提示来控制3D交互生成的灵活性。我们的零shot方法在多种场景中表现出高的灵活性和通用性，包括室内和室外环境。

Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers

paper_url: http://arxiv.org/abs/2311.17717
repo_url: None
paper_authors: Chi-Pin Huang, Kai-Po Chang, Chung-Ting Tsai, Yung-Hsuan Lai, Yu-Chiang Frank Wang
for: 防止预训练的文本到图像扩散模型生成与目标概念相关的图像
methods: 提出了一种名为可靠概念消除（Receler）的轻量级消除器，并通过概念本地化正则化和对抗提示学习等方法增强了本地性和可靠性
results: 对于多种概念提示，Receler 比前一代消除方法具有更高的本地性和可靠性

Abstract
Concept erasure in text-to-image diffusion models aims to disable pre-trained diffusion models from generating images related to a target concept. To perform reliable concept erasure, the properties of robustness and locality are desirable. The former refrains the model from producing images associated with the target concept for any paraphrased or learned prompts, while the latter preserves the model ability in generating images for non-target concepts. In this paper, we propose Reliable Concept Erasing via Lightweight Erasers (Receler), which learns a lightweight Eraser to perform concept erasing and enhances locality and robustness with the proposed concept-localized regularization and adversarial prompt learning, respectively. Comprehensive quantitative and qualitative experiments with various concept prompts verify the superiority of Receler over the previous erasing methods on the above two desirable properties.

摘要
“概念除法”在文本到图像协同模型中的应用目标是禁止已经训练过的协同模型生成与目标概念相关的图像。为实现可靠的概念除法，robustness和locality这两个性质是非常有用。前者使模型无法生成基于重塑或学习提示的任何图像与目标概念相关，而后者保持模型对非目标概念的图像生成能力。本文提出了“可靠概念除法via轻量级抹除器”（Receler），该方法学习了轻量级抹除器来实现概念除法，并通过提出的概念本地化规则和对抗提示学习等方法增强了本地性和Robustness。经过对多个概念提示的全面量化和质量化实验，Receler的superiority被证明在上述两个愿望性能上。

SAMPro3D: Locating SAM Prompts in 3D for Zero-Shot Scene Segmentation

paper_url: http://arxiv.org/abs/2311.17707
repo_url: https://github.com/GAP-LAB-CUHK-SZ/SAMPro3D
paper_authors: Mutian Xu, Xingyilang Yin, Lingteng Qiu, Yang Liu, Xin Tong, Xiaoguang Han
for: 该论文旨在提出一种零shot 3D indoor scene segmentation方法，使得无需任何培训数据，可以在不同的3D场景中提取高质量的分割结果。
methods: 该方法基于预训练的Segment Anything Model（SAM），将2D帧中的3D点云projected到屏幕上，并使用SAM进行预测，以获得高质量的分割结果。该方法还提出了一种基于多个2D帧的feedback机制，以提高分割质量。
results: 对比前一些零shot或充足监督方法，该方法可以在多个3D场景中提取更高质量和更多样的分割结果，甚至超过人工标注。具体的结果可以通过https://mutianxu.github.io/sampro3d/查看。

Abstract
We introduce SAMPro3D for zero-shot 3D indoor scene segmentation. Given the 3D point cloud and multiple posed 2D frames of 3D scenes, our approach segments 3D scenes by applying the pretrained Segment Anything Model (SAM) to 2D frames. Our key idea involves locating 3D points in scenes as natural 3D prompts to align their projected pixel prompts across frames, ensuring frame-consistency in both pixel prompts and their SAM-predicted masks. Moreover, we suggest filtering out low-quality 3D prompts based on feedback from all 2D frames, for enhancing segmentation quality. We also propose to consolidate different 3D prompts if they are segmenting the same object, bringing a more comprehensive segmentation. Notably, our method does not require any additional training on domain-specific data, enabling us to preserve the zero-shot power of SAM. Extensive qualitative and quantitative results show that our method consistently achieves higher quality and more diverse segmentation than previous zero-shot or fully supervised approaches, and in many cases even surpasses human-level annotations. The project page can be accessed at https://mutianxu.github.io/sampro3d/.

摘要
我们介绍SAMPro3D，一个零shot的3D室内景分类方法。我们的方法使用预训Segment Anything Model（SAM）来运行2D框架中的3D场景分类。我们的关键想法是将3D点云中的点作为自然的3D提示，将其投射到不同框架中，以确保框架中的像素提示和SAM预测的mask是一致的。此外，我们建议使用所有2D框架的反馈来筛选低质量的3D提示，以提高分类质量。我们还提出了将不同的3D提示合并为一个更全面的分类。值得注意的是，我们的方法不需要任何预训 dataset，可以保留SAM的零shot力量。我们的方法在实际应用中具有高品质和多样化的分类结果，在许多情况下，甚至超过人工标注。更多详细信息可以通过下面的项目页面https://mutianxu.github.io/sampro3d/。

Toward a Surgeon-in-the-Loop Ophthalmic Robotic Apprentice using Reinforcement and Imitation Learning

paper_url: http://arxiv.org/abs/2311.17693
repo_url: None
paper_authors: Amr Gomaa, Bilal Mahdy, Niko Kleer, Antonio Krüger
for:这篇论文旨在提出一种基于 simulate 的对象视频指导方法，以便让自动化手术系统适应个别医生的特殊偏好和需求。methods:这篇论文使用了实验环境来训练循环学和模仿学掌握者，并通过与医生在训练过程中的互动，让机器人隐式地学习和适应个别医生的具体技巧。results:这篇论文的结果显示，这种基于 simulate 的对象视频指导方法可以实现更直观和个性化的手术体验，同时确保机器人的一致性性。此外，这种方法还有扩展到其他眼科手术程序的潜力。

Abstract
Robotic-assisted surgical systems have demonstrated significant potential in enhancing surgical precision and minimizing human errors. However, existing systems lack the ability to accommodate the unique preferences and requirements of individual surgeons. Additionally, they primarily focus on general surgeries (e.g., laparoscopy) and are not suitable for highly precise microsurgeries, such as ophthalmic procedures. Thus, we propose a simulation-based image-guided approach for surgeon-centered autonomous agents that can adapt to the individual surgeon's skill level and preferred surgical techniques during ophthalmic cataract surgery. Our approach utilizes a simulated environment to train reinforcement and imitation learning agents guided by image data to perform all tasks of the incision phase of cataract surgery. By integrating the surgeon's actions and preferences into the training process with the surgeon-in-the-loop, our approach enables the robot to implicitly learn and adapt to the individual surgeon's unique approach through demonstrations. This results in a more intuitive and personalized surgical experience for the surgeon. Simultaneously, it ensures consistent performance for the autonomous robotic apprentice. We define and evaluate the effectiveness of our approach using our proposed metrics; and highlight the trade-off between a generic agent and a surgeon-centered adapted agent. Moreover, our approach has the potential to extend to other ophthalmic surgical procedures, opening the door to a new generation of surgeon-in-the-loop autonomous surgical robots. We provide an open-source simulation framework for future development and reproducibility.

摘要
Robotic-assisted surgical systems 有 demonstrated significan potential 在 enhance surgical precision 和 minimize human errors. However, existing systems lack the ability to accommodate the unique preferences 和 requirements of individual surgeons. Additionally, they primarily focus on general surgeries (e.g., laparoscopy) 和 are not suitable for highly precise microsurgeries, such as ophthalmic procedures. Therefore, we propose a simulation-based image-guided approach for surgeon-centered autonomous agents that can adapt to the individual surgeon's skill level 和 preferred surgical techniques during ophthalmic cataract surgery.Our approach utilizes a simulated environment 训练 reinforcement 和 imitation learning agents guided by image data to perform all tasks of the incision phase of cataract surgery. By integrating the surgeon's actions 和 preferences into the training process with the surgeon-in-the-loop, our approach enables the robot to implicitly learn 和 adapt to the individual surgeon's unique approach through demonstrations. This results in a more intuitive 和 personalized surgical experience for the surgeon. Simultaneously, it ensures consistent performance for the autonomous robotic apprentice. We define 和 evaluate the effectiveness of our approach using our proposed metrics; and highlight the trade-off between a generic agent 和 a surgeon-centered adapted agent. Moreover, our approach has the potential to extend to other ophthalmic surgical procedures, opening the door to a new generation of surgeon-in-the-loop autonomous surgical robots. We provide an open-source simulation framework for future development 和 reproducibility.

COVIDx CXR-4: An Expanded Multi-Institutional Open-Source Benchmark Dataset for Chest X-ray Image-Based Computer-Aided COVID-19 Diagnostics

paper_url: http://arxiv.org/abs/2311.17677
repo_url: None
paper_authors: Yifan Wu, Hayden Gunraj, Chi-en Amy Tai, Alexander Wong
for: 本研究旨在提高计算机助手 COVID-19 诊断技术的性能，通过扩大和多样化大量数据集。
methods: 本研究使用了多种深度学习模型，并对报告了大量的肺 X 光像像数据进行分析和评估。
results: 研究人员在 COVIDx CXR-4 数据集上进行了广泛的分析和评估，发现该数据集具有多样化的患者人群、成像metadata和疾病分布，并提供了可能的数据集偏见的检测和评估方法。

Abstract
The global ramifications of the COVID-19 pandemic remain significant, exerting persistent pressure on nations even three years after its initial outbreak. Deep learning models have shown promise in improving COVID-19 diagnostics but require diverse and larger-scale datasets to improve performance. In this paper, we introduce COVIDx CXR-4, an expanded multi-institutional open-source benchmark dataset for chest X-ray image-based computer-aided COVID-19 diagnostics. COVIDx CXR-4 expands significantly on the previous COVIDx CXR-3 dataset by increasing the total patient cohort size by greater than 2.66 times, resulting in 84,818 images from 45,342 patients across multiple institutions. We provide extensive analysis on the diversity of the patient demographic, imaging metadata, and disease distributions to highlight potential dataset biases. To the best of the authors' knowledge, COVIDx CXR-4 is the largest and most diverse open-source COVID-19 CXR dataset and is made publicly available as part of an open initiative to advance research to aid clinicians against the COVID-19 disease.

摘要
全球疫情的影响仍然很大，三年后仍然对国家产生持续的压力。深度学习模型在COVID-19诊断方面表现出了搭配作用，但需要更大和更多样本来提高性能。本文介绍COVIDx CXR-4，一个扩展多机构开源的COVID-19Computer-aided肺X射线诊断数据集。COVIDx CXR-4相比COVIDx CXR-3数据集，增加了总病人群体数量的2.66倍，共84,818张X射线图像，来自45,342名患者，来自多个机构。我们提供了丰富的分析，描述了患者人口结构、成像metadata和疾病分布，以强调数据集可能的偏见。作者知道COVIDx CXR-4是最大和最多样的开源COVID-19 CXR数据集，并且公开提供，以便推动研究，以帮助临床医生对COVID-19疾病作战。

Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications

paper_url: http://arxiv.org/abs/2311.17663
repo_url: https://github.com/haomo-ai/cam4docc
paper_authors: Junyi Ma, Xieyuanli Chen, Jiawei Huang, Jingyi Xu, Zhen Luo, Jintao Xu, Weihao Gu, Rui Ai, Hesheng Wang
for: 本研究旨在提供一个基于摄像头图像的4D占用预测 benchmark，用于评估自动驾驶场景中环境变化的变化。
methods: 本研究使用了多个公开available的数据集，包括 nuScenes、nuScenes-Occupancy 和 Lyft-Level5，以提供Sequential occupancy states of general movable and static objects，以及 их 3D backward centripetal flow。
results: 研究提出了一种基于摄像头图像的4D占用预测网络，并在多个基线之间进行了比较，以评估不同方法的性能。

Abstract
Understanding how the surrounding environment changes is crucial for performing downstream tasks safely and reliably in autonomous driving applications. Recent occupancy estimation techniques using only camera images as input can provide dense occupancy representations of large-scale scenes based on the current observation. However, they are mostly limited to representing the current 3D space and do not consider the future state of surrounding objects along the time axis. To extend camera-only occupancy estimation into spatiotemporal prediction, we propose Cam4DOcc, a new benchmark for camera-only 4D occupancy forecasting, evaluating the surrounding scene changes in a near future. We build our benchmark based on multiple publicly available datasets, including nuScenes, nuScenes-Occupancy, and Lyft-Level5, which provides sequential occupancy states of general movable and static objects, as well as their 3D backward centripetal flow. To establish this benchmark for future research with comprehensive comparisons, we introduce four baseline types from diverse camera-based perception and prediction implementations, including a static-world occupancy model, voxelization of point cloud prediction, 2D-3D instance-based prediction, and our proposed novel end-to-end 4D occupancy forecasting network. Furthermore, the standardized evaluation protocol for preset multiple tasks is also provided to compare the performance of all the proposed baselines on present and future occupancy estimation with respect to objects of interest in autonomous driving scenarios. The dataset and our implementation of all four baselines in the proposed Cam4DOcc benchmark will be released here: https://github.com/haomo-ai/Cam4DOcc.

摘要
<>转换给定文本到简化中文。<>了解环境变化对自动驾驶应用程序中进行安全可靠的下游任务是非常重要。现有的占用估计技术使用只有相机图像作为输入可以提供大规模场景中的 dense 占用表示，但它们主要是只考虑当前观察的三维空间，并不考虑未来时间轴上的对象变化。为扩展相机只的占用估计到时间方向，我们提出了 Cam4DOcc，一个新的 Camera-only 4D 占用预测权威，评估周围场景在较短时间内的变化。我们基于多个公共可用的数据集建立了这个权威，包括 nuScenes、nuScenes-Occupancy 和 Lyft-Level5 等，这些数据集提供了一系列可移动和静止 объек的 sequential 占用状态，以及它们的三维反时流。为建立这个权威，我们引入了四种基线类型，包括静止世界占用模型、 voxelization 的点云预测、 2D-3D 实例基于预测和我们的提出的 novel 综合四个基准网络。此外，我们还提供了一种标准化评估协议，用于比较所有提出的基准在不同任务上的表现，包括未来占用估计和对象 интересы在自动驾驶场景中的表现。 Cam4DOcc 数据集和我们实现的所有四种基准将在以下地址上发布：https://github.com/haomo-ai/Cam4DOcc。

Volumetric Cloud Field Reconstruction

paper_url: http://arxiv.org/abs/2311.17657
repo_url: None
paper_authors: Jacob Lin, Miguel Farinha, Edward Gryspeerdt, Ronald Clark
for: 本研究旨在探讨用少量ステレオ画像对大规模积体的重建问题，以提高3D重建系统的实用性。
methods: 该研究提出了一种 integrate deep learning 框架，包括深度ステレオ模型、3D卷积神经网络（3D CNN）和扩散模块，用于捕捉积体的形态和动态。ステレオ深度被用来预先知道积体的空间位置，并且通过时间演化来提高动态一致性。
results: 研究表明，该系统可以从一些笔记副本中估算云体的密度和运动场景， demonstrating its ability to handle large-scale volumetric phenomena.

Abstract
Volumetric phenomena, such as clouds and fog, present a significant challenge for 3D reconstruction systems due to their translucent nature and their complex interactions with light. Conventional techniques for reconstructing scattering volumes rely on controlled setups, limiting practical applications. This paper introduces an approach to reconstructing volumes from a few input stereo pairs. We propose a novel deep learning framework that integrates a deep stereo model with a 3D Convolutional Neural Network (3D CNN) and an advection module, capable of capturing the shape and dynamics of volumes. The stereo depths are used to carve empty space around volumes, providing the 3D CNN with a prior for coping with the lack of input views. Refining our output, the advection module leverages the temporal evolution of the medium, providing a mechanism to infer motion and improve temporal consistency. The efficacy of our system is demonstrated through its ability to estimate density and velocity fields of large-scale volumes, in this case, clouds, from a sparse set of stereo image pairs.

摘要
“三维现象，如云和雾，对三维重建系统提出了严重的挑战，因为它们的透明性和复杂的光线交互。传统的重建对应方法需要控制的设置，限制了实际应用。本文介绍一种从少量入对照片对三维重建的方法。我们提出了一个新的深度学习框架，融合了深度探测模型和3D卷积神经网络（3D CNN），能够捕捉volume的形状和动力。对照片的深度值用来剔除volume周围的空间，为3D CNN提供了一个先验条件，以对缺乏入对照片的缺陷进行处理。对output进行修正，抽运模块利用时间演化的媒体特性，提供了一种对动作进行推导和改善时间一致性的机制。我们的系统在大规模volume的密度和运动场的估计中表现出色，具体来说是云的density和运动场。”

Multiple Toddler Tracking in Indoor Videos

paper_url: http://arxiv.org/abs/2311.17656
repo_url: https://github.com/ostadabbas/multiple-toddler-tracking
paper_authors: Somaieh Amraee, Bishoy Galoaa, Matthew Goodwin, Elaheh Hatamimajoumerd, Sarah Ostadabbas
for: 这个论文的目的是解决多个婴儿在视频中的跟踪问题，因为婴儿的不可预测的运动、多种姿势和类似的外观使得传统的多对象跟踪（MOT）算法困难应用于婴儿跟踪。
methods: 这篇论文提出了一种名为MTTSort的自适应方法，基于DeepSort算法，用于精准地跟踪多个婴儿在室内环境中的视频。该方法使用了遗传算法优化参数，并提出了一种准确的跟踪算法和淘汰排序技术。
results: 在论文中，MTTSort方法与当前州OF艺术的MOT方法进行了比较，在MTTrack、DanceTrack和MOT15数据集上取得了0.98、0.68和0.98的多对象跟踪准确率（MOTA）、高阶跟踪准确率（HOTA）和迭代和排序框架1（IDF1）指标中的优秀表现。

Abstract
Multiple toddler tracking (MTT) involves identifying and differentiating toddlers in video footage. While conventional multi-object tracking (MOT) algorithms are adept at tracking diverse objects, toddlers pose unique challenges due to their unpredictable movements, various poses, and similar appearance. Tracking toddlers in indoor environments introduces additional complexities such as occlusions and limited fields of view. In this paper, we address the challenges of MTT and propose MTTSort, a customized method built upon the DeepSort algorithm. MTTSort is designed to track multiple toddlers in indoor videos accurately. Our contributions include discussing the primary challenges in MTT, introducing a genetic algorithm to optimize hyperparameters, proposing an accurate tracking algorithm, and curating the MTTrack dataset using unbiased AI co-labeling techniques. We quantitatively compare MTTSort to state-of-the-art MOT methods on MTTrack, DanceTrack, and MOT15 datasets. In our evaluation, the proposed method outperformed other MOT methods, achieving 0.98, 0.68, and 0.98 in multiple object tracking accuracy (MOTA), higher order tracking accuracy (HOTA), and iterative and discriminative framework 1 (IDF1) metrics, respectively.

摘要
多个婴儿跟踪（MTT）涉及到在视频中识别和区分婴儿。与传统的多对象跟踪（MOT）算法不同，婴儿具有不可预测的运动、多种姿势和相似的外观，使得MTT具有极高的难度。在室内环境中跟踪婴儿更加复杂，因为存在遮挡和有限的视场。在这篇论文中，我们解决了MTT中的主要挑战，并提出了MTTSort算法，这是基于DeepSort算法的自定义方法。MTTSort是用于准确地跟踪多个婴儿在室内视频中的。我们的贡献包括对MTT中的主要挑战进行讨论、使用遗传算法优化超参数、提出高度准确的跟踪算法，以及使用无偏论AI共标记技术制定MTTrack数据集。我们对MTTSort与状态的方法进行了量化比较，并证明了MTTSort在MTTrack、DanceTrack和MOT15数据集上的表现优于其他MOT方法，其中MOTA、HOTA和IDF1 metric的评价值分别为0.98、0.68和0.98。

Neural Fields with Thermal Activations for Arbitrary-Scale Super-Resolution

paper_url: http://arxiv.org/abs/2311.17643
repo_url: None
paper_authors: Alexander Becker, Rodrigo Caye Daudt, Nando Metzger, Jan Dirk Wegner, Konrad Schindler
for: 这 paper 是为了解决单个图像超分辨（ASSR）问题，特别是在不同分辨率下保持图像的细节和清晰度。
methods: 该 paper 使用了 мест化神经场来表示连续信号，并使用了一种新的活动函数来模型 Gaussian PSF。这种活动函数来自于傅里叶理论和热方程，可以免除对图像域的滤波，从而保持图像的细节和清晰度。
results: 该 paper 提出了一种新的ASSR方法，并在实验中达到了新的性能水平，同时也比前一代方法更Parameter-efficient。这种方法可以在不同分辨率下保持图像的细节和清晰度，并且可以免除对图像域的滤波。

Abstract
Recent approaches for arbitrary-scale single image super-resolution (ASSR) have used local neural fields to represent continuous signals that can be sampled at different rates. However, in such formulation, the point-wise query of field values does not naturally match the point spread function (PSF) of a given pixel. In this work we present a novel way to design neural fields such that points can be queried with a Gaussian PSF, which serves as anti-aliasing when moving across resolutions for ASSR. We achieve this using a novel activation function derived from Fourier theory and the heat equation. This comes at no additional cost: querying a point with a Gaussian PSF in our framework does not affect computational cost, unlike filtering in the image domain. Coupled with a hypernetwork, our method not only provides theoretically guaranteed anti-aliasing, but also sets a new bar for ASSR while also being more parameter-efficient than previous methods.

摘要
中文翻译：现代ASSR方法中，使用本地神经场来表示不同采样率的连续信号。然而，在这种表述中，对场值的点 wise查询并不自然地匹配给定像素的点扩散函数（PSF）。在这个工作中，我们提出了一种新的方法，使得在神经场中查询点可以使用 Gaussian PSF，这就是防护抗锯齿的。我们使用 fourier理论和热方程来 derivation novel activation function。这不会增加计算成本，与图像领域中的滤波不同，在我们的框架中查询点的 Gaussian PSF 不会增加计算成本。同时，我们还使用 hypernetwork，我们的方法不仅提供了理论保证的防护抗锯齿，还超越了过去的方法，并且在参数效率方面比前方法更高。

paper_url: http://arxiv.org/abs/2311.17634
repo_url: None
paper_authors: Mreenav Shyam Deka, Lu Sang, Daniel Cremers
for: 用于 sintesizing 城市环境中的新视图，如自动驾驶和虚拟游览。
methods: 使用神经点云场景表示法，战略地检测并屏蔽动态对象，以生成无瑕疵的新场景。同时，同步Camera pose优化和视图 sintesizing 过程，以协同优化两个元素。
results: 通过实验 validate 在真实城市数据集上，实现了对城市景象的新视图 sintesizing 领先result。

Abstract
Synthesizing novel views for urban environments is crucial for tasks like autonomous driving and virtual tours. Compared to object-level or indoor situations, outdoor settings present unique challenges, such as inconsistency across frames due to moving vehicles and camera pose drift over lengthy sequences. In this paper, we introduce a method that tackles these challenges on view synthesis for outdoor scenarios. We employ a neural point light field scene representation and strategically detect and mask out dynamic objects to reconstruct novel scenes without artifacts. Moreover, we simultaneously optimize camera pose along with the view synthesis process, and thus, we simultaneously refine both elements. Through validation on real-world urban datasets, we demonstrate state-of-the-art results in synthesizing novel views of urban scenes.

摘要
<>输入文本为 Traditional Chinese.<>创造城市环境中新的视图是自动驾驶和虚拟旅行等任务中的关键。外部场景比对象水平或室内情况更加具有挑战，因为移动的车辆和摄像头pose的变化会导致帧内容不一致。在这篇论文中，我们介绍了一种解决这些挑战的视图合成方法。我们使用神经点云场景表示，并策略性地检测和屏蔽动态对象，以无缝重建新的场景。此外，我们同时优化摄像头pose和视图合成过程，因此同时完善两个元素。通过对实际城市数据进行验证，我们展示了城市景象视图合成领域的状态级Result。

Efficient Decoder for End-to-End Oriented Object Detection in Remote Sensing Images

paper_url: http://arxiv.org/abs/2311.17629
repo_url: None
paper_authors: Jiaqi Zhao, Zeyu Ding, Yong Zhou, Hancheng Zhu, Wenliang Du, Rui Yao, Abdulmotaleb El Saddik
for: 本研究旨在提出一种能够解决远程感知图像中对象实例的分布具有多 orientation、不同 scales 和紧密分布的问题的终端对象检测器。
methods: 该方法使用了两种技术：Rotated RoI attention (RRoI attention)和Selective Distinct Queries (SDQ)。RRoI attention通过交叉注意机制对 oriented 区域 интерест进行有效地注意，并将多尺度特征进行对齐。SDQ从中间解码层中收集查询，并将类似查询过滤得到独特的查询。
results: 对五个数据集进行了广泛的实验，并证明了我们的方法的有效性。尤其是在 DIOR-R (67.31% mAP)、DOTA-v1.5 (67.43% mAP) 和 DOTA-v2.0 (53.28% mAP) 上，我们的方法与 ResNet50 背景下达到了州际级的性能。

Abstract
Object instances in remote sensing images often distribute with multi-orientations, varying scales, and dense distribution. These issues bring challenges to end-to-end oriented object detectors including multi-scale features alignment and a large number of queries. To address these limitations, we propose an end-to-end oriented detector equipped with an efficient decoder, which incorporates two technologies, Rotated RoI attention (RRoI attention) and Selective Distinct Queries (SDQ). Specifically, RRoI attention effectively focuses on oriented regions of interest through a cross-attention mechanism and aligns multi-scale features. SDQ collects queries from intermediate decoder layers and then filters similar queries to obtain distinct queries. The proposed SDQ can facilitate the optimization of one-to-one label assignment, without introducing redundant initial queries or extra auxiliary branches. Extensive experiments on five datasets demonstrate the effectiveness of our method. Notably, our method achieves state-of-the-art performance on DIOR-R (67.31% mAP), DOTA-v1.5 (67.43% mAP), and DOTA-v2.0 (53.28% mAP) with the ResNet50 backbone.

摘要
Remote sensing 图像中的对象实例经常具有多个方向、不同的缩放和密集分布。这些问题会对末端向的对象检测器带来挑战，包括多尺度特征对齐和大量查询。为解决这些限制，我们提出了一种末端向的对象检测器，具有高效的解码器，并结合了两种技术：旋转 RoI 注意力（RRoI 注意力）和选择性独特查询（SDQ）。具体来说，RRoI 注意力通过交叉注意力机制，有效地关注方向的区域关注点，并对多尺度特征进行对齐。而 SDQ 从中间解码层中收集查询，然后过滤相似的查询，以获取独特的查询。我们的 SDQ 方法可以帮助优化一对一的标签分配，不需要额外增加初始查询或额外的帮助分支。我们在五个数据集上进行了广泛的实验，结果显示，我们的方法可以 дости得对 DIOR-R（67.31% mAP）、DOTA-v1.5（67.43% mAP）和 DOTA-v2.0（53.28% mAP）的最佳性能，使用 ResNet50 基础。

Focus on Query: Adversarial Mining Transformer for Few-Shot Segmentation

paper_url: http://arxiv.org/abs/2311.17626
repo_url: https://github.com/wyxdm/amnet
paper_authors: Yuan Wang, Naisong Luo, Tianzhu Zhang
for: 这 paper 的目的是提出一种新的 few-shot segmentation (FSS) 模型，以便在只有几个标注样本的情况下 segment 新的类别对象。
methods: 该 paper 使用了一种新的 query-centric FSS 模型，名为 Adversarial Mining Transformer (AMFormer)，该模型可以在 rough support guidance 或者 weak support labels 的情况下实现高度准确的查询图像分割。
results: 该 paper 在 Pascal-5i 和 COCO-20i 两个常用的标准评估 benchmark 上进行了广泛的实验，并在所有设置下达到了状态的 искусственный智能表现。此外，该 paper 还发现在 weak support labels 的情况下，query-centric paradigm 可以实现出色的表现，这可能会激励更多的 FSS 模型的发展。

Abstract
Few-shot segmentation (FSS) aims to segment objects of new categories given only a handful of annotated samples. Previous works focus their efforts on exploring the support information while paying less attention to the mining of the critical query branch. In this paper, we rethink the importance of support information and propose a new query-centric FSS model Adversarial Mining Transformer (AMFormer), which achieves accurate query image segmentation with only rough support guidance or even weak support labels. The proposed AMFormer enjoys several merits. First, we design an object mining transformer (G) that can achieve the expansion of incomplete region activated by support clue, and a detail mining transformer (D) to discriminate the detailed local difference between the expanded mask and the ground truth. Second, we propose to train G and D via an adversarial process, where G is optimized to generate more accurate masks approaching ground truth to fool D. We conduct extensive experiments on commonly used Pascal-5i and COCO-20i benchmarks and achieve state-of-the-art results across all settings. In addition, the decent performance with weak support labels in our query-centric paradigm may inspire the development of more general FSS models. Code will be available at https://github.com/Wyxdm/AMNet.

摘要
新型几个示例分割（FSS）目标是根据只有几个标注样本来分割新类型的物体。先前的工作主要集中在explore支持信息的方面，而忽略了关键 вопро题分支的挖掘。在这篇论文中，我们重新评估支持信息的重要性，并提出了一种新的问题中心的FSS模型，即反对挖掘变换（AMFormer）。该模型可以通过 rough support guidance 或弱支持标签来实现精准的问题图像分割。我们的提案具有以下优点：首先，我们设计了一个对不完整地活化的支持线索进行扩展的对象挖掘变换（G），以及一个用于异化详细地方差的细节挖掘变换（D）。其次，我们提出了在对G和D进行对抗训练的方法，其中G是用于生成更加准确的面板，以诱导D进行更加准确的识别。我们在 Pascal-5i 和 COCO-20i 两个常用的标准测试集上进行了广泛的实验，并在所有设置下 achieve state-of-the-art 结果。此外，我们的查询中心的 paradigm 在弱支持标签下表现出色，可能会激发更多的FSS模型的发展。代码将在 GitHub 上提供，链接为 https://github.com/Wyxdm/AMNet。

paper_url: http://arxiv.org/abs/2311.17618
repo_url: None
paper_authors: Fukun Yin, Xin Chen, Chi Zhang, Biao Jiang, Zibo Zhao, Jiayuan Fan, Gang Yu, Taihao Li, Tao Chen
for: 这个论文的目的是开发一种能够处理3D形状的多Modal生成模型，以便在3D虚拟建筑和网络帮助设计等领域中进行多种形状生成任务。
methods: 该论文使用了一种word-sentence-paragraph框架，将连续的3D形状分解成形状单词，然后将这些单词组合成形状句子，同时将形状和指令文本结合起来生成多Modal段落。
results: 经过三个阶段的训练，包括形状表示、多Modal协调和指令基本生成，ShapeGPT模型在多种形状相关任务中达到了相当的性能，包括文本到形状、形状到文本、形状完成和形状编辑等任务。

Abstract
The advent of large language models, enabling flexibility through instruction-driven approaches, has revolutionized many traditional generative tasks, but large models for 3D data, particularly in comprehensively handling 3D shapes with other modalities, are still under-explored. By achieving instruction-based shape generations, versatile multimodal generative shape models can significantly benefit various fields like 3D virtual construction and network-aided design. In this work, we present ShapeGPT, a shape-included multi-modal framework to leverage strong pre-trained language models to address multiple shape-relevant tasks. Specifically, ShapeGPT employs a word-sentence-paragraph framework to discretize continuous shapes into shape words, further assembles these words for shape sentences, as well as integrates shape with instructional text for multi-modal paragraphs. To learn this shape-language model, we use a three-stage training scheme, including shape representation, multimodal alignment, and instruction-based generation, to align shape-language codebooks and learn the intricate correlations among these modalities. Extensive experiments demonstrate that ShapeGPT achieves comparable performance across shape-relevant tasks, including text-to-shape, shape-to-text, shape completion, and shape editing.

摘要
大量语言模型的出现，使得执行指令驱动方法的灵活性得到了很大的改善，但是3D数据的大型模型，特别是处理3D形状与其他多 modalities的数据，仍然受到了不足的探索。通过实现指令驱动的形成，多Modal生成模型可以很大幅提高3D虚拟建筑和网络帮助设计等领域。在这个工作中，我们提出了ShapeGPT，一个包含形状的多Modal框架，利用强大预训语言模型来解决多种形状相关的任务。具体来说，ShapeGPT使用了字词 sentences paragraphs框架将连续形状变数化为形状词，然后将这些词 assembles为形状句子，并与指令文本集成以生成多Modal paragraphs。为了训练这个形状语言模型，我们使用了三阶段训练方案，包括形状表示、多Modal对接和指令基本生成，以将形状语言codebook和多Modal modalities之间的细微相关性学习。实验结果显示，ShapeGPT在多种形状相关任务上均能实现相似的表现，包括文本到形状、形状到文本、形状完成和形状修改。

AnyLens: A Generative Diffusion Model with Any Rendering Lens

paper_url: http://arxiv.org/abs/2311.17609
repo_url: None
paper_authors: Andrey Voynov, Amir Hertz, Moab Arar, Shlomi Fruchter, Daniel Cohen-Or
for: 这个研究旨在提出一种基于文本到图像扩散模型的图像渲染方法，具有控制摄像头几何参数的能力。
methods: 该方法基于每个像素坐标的Conditioning方法，可以控制渲染几何参数，从而实现不同的视觉效果。
results: 研究人员通过示例图像表明，该方法可以实现多种视觉效果，如鱼眼、全景视图和球面 текстури。

Abstract
State-of-the-art diffusion models can generate highly realistic images based on various conditioning like text, segmentation, and depth. However, an essential aspect often overlooked is the specific camera geometry used during image capture. The influence of different optical systems on the final scene appearance is frequently overlooked. This study introduces a framework that intimately integrates a text-to-image diffusion model with the particular lens geometry used in image rendering. Our method is based on a per-pixel coordinate conditioning method, enabling the control over the rendering geometry. Notably, we demonstrate the manipulation of curvature properties, achieving diverse visual effects, such as fish-eye, panoramic views, and spherical texturing using a single diffusion model.

摘要
现代扩散模型可以生成高度真实的图像，基于文本、分割和深度等条件。然而，通常被忽视的一个重要方面是摄像机 geometry 在图像捕捉中所使用的具体特点。这种研究提出了一种将文本到图像扩散模型与特定镜头geometry 集成的框架。我们的方法基于每个像素坐标条件方法，允许控制渲染geometry。另外，我们示示了拥有曲率性质的控制，可以实现多种视觉效果，如鱼眼、投影视图和球面文本使用单一扩散模型。

Adversarial Robust Memory-Based Continual Learner

paper_url: http://arxiv.org/abs/2311.17608
repo_url: None
paper_authors: Xiaoyue Mi, Fan Tang, Zonghan Yang, Danding Wang, Juan Cao, Peng Li, Yang Liu
for: 这个论文旨在解决 continual learning 中的 adversarial 抗性问题，即随着学习继续进行，模型对 adversarial 样本的抗性减退。methods: 该论文使用了 memory-based continual learning 算法，并直接应用了 adversarial training 技术来提高 adversarial 抗性。但是，针对 continual learning 中的 accelerated forgetting 和 gradient obfuscation 问题，该论文提出了一种新的 adversarial robust memory-based continual learner，并设计了一种基于 gradient 的数据选择机制来解决 gradient obfuscation 问题。results: 实验结果表明，该论文的方法可以广泛地应用于 existing memory-based continual learning 和 adversarial training 算法中，并且可以达到 Up to 8.13% 高于 adversarial 数据的准确率。

Abstract
Despite the remarkable advances that have been made in continual learning, the adversarial vulnerability of such methods has not been fully discussed. We delve into the adversarial robustness of memory-based continual learning algorithms and observe limited robustness improvement by directly applying adversarial training techniques. Preliminary studies reveal the twin challenges for building adversarial robust continual learners: accelerated forgetting in continual learning and gradient obfuscation in adversarial robustness. In this study, we put forward a novel adversarial robust memory-based continual learner that adjusts data logits to mitigate the forgetting of pasts caused by adversarial samples. Furthermore, we devise a gradient-based data selection mechanism to overcome the gradient obfuscation caused by limited stored data. The proposed approach can widely integrate with existing memory-based continual learning as well as adversarial training algorithms in a plug-and-play way. Extensive experiments on Split-CIFAR10/100 and Split-Tiny-ImageNet demonstrate the effectiveness of our approach, achieving up to 8.13% higher accuracy for adversarial data.

摘要
尽管CONTINUAL LEARNING技术得到了惊人的进步，但是它们的抗对抗性尚未得到完全评估。我们研究了基于记忆的CONTINUAL LEARNING算法的对抗性，发现其对抗性改善有限。先前的研究表明，在CONTINUAL LEARNING中存在加速忘记的两大挑战，即对抗样本引起的忘记和对抗样本的掩蔽。在这种研究中，我们提出了一种新的对抗Robust memory-basedCONTINUAL LEARNING算法，通过调整数据 logits来 Mitigate the forgetting of past caused by adversarial samples。此外，我们还设计了一种基于Gradient的数据选择机制，以解决由限制存储的数据所引起的掩蔽。该方法可以与现有的记忆基于CONTINUAL LEARNING以及对抗训练算法进行插件式 интеграción。广泛的实验表明，我们的方法可以在Split-CIFAR10/100和Split-Tiny-ImageNet上 achieve up to 8.13% higher accuracy for adversarial data。

Topology-Preserving Adversarial Training

paper_url: http://arxiv.org/abs/2311.17607
repo_url: None
paper_authors: Xiaoyue Mi, Fan Tang, Yepeng Weng, Danding Wang, Juan Cao, Sheng Tang, Peng Li, Yang Liu
for: 提高神经网络的Robustness，解决自然环境下的准确率下降问题。
methods: 基于样本空间结构的拟合，采用Topology-pReserving Adversarial traINing（TRAIN）方法，保持标准模型在自然样本上的学习结构，以增强神经网络的Robustness。
results: 对CIFAR-10、CIFAR-100和Tiny ImageNet进行了广泛的实验，并 obtainted consistent and significant improvements over various strong baselines in most cases。Specifically, without additional data, our proposed method achieves up to 8.78% improvement in natural accuracy and 4.50% improvement in robust accuracy。

Abstract
Despite the effectiveness in improving the robustness of neural networks, adversarial training has suffered from the natural accuracy degradation problem, i.e., accuracy on natural samples has reduced significantly. In this study, we reveal that natural accuracy degradation is highly related to the disruption of the natural sample topology in the representation space by quantitative and qualitative experiments. Based on this observation, we propose Topology-pReserving Adversarial traINing (TRAIN) to alleviate the problem by preserving the topology structure of natural samples from a standard model trained only on natural samples during adversarial training. As an additional regularization, our method can easily be combined with various popular adversarial training algorithms in a plug-and-play manner, taking advantage of both sides. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet show that our proposed method achieves consistent and significant improvements over various strong baselines in most cases. Specifically, without additional data, our proposed method achieves up to 8.78% improvement in natural accuracy and 4.50% improvement in robust accuracy.

摘要
尽管对神经网络的Robustness进行了有效的改进，但反对攻击训练受到自然精度下降的问题的束缚，即在表示空间中自然样本的结构干扰的问题。在这个研究中，我们发现自然精度下降与表示空间中自然样本结构的破坏高度相关。基于这一观察，我们提出了保持自然样本表示空间结构的Topology-pReserving Adversarial traINing（TRAIN）方法，以解决这个问题。这种方法可以在反对攻击训练中保持自然样本表示空间结构，从而避免精度下降。此外，我们的方法可以轻松地与多种流行的反对攻击训练算法结合使用，从而得到更好的性能。我们在CIFAR-10、CIFAR-100和Tiny ImageNet上进行了广泛的实验，结果表明，我们提出的方法在大多数情况下可以获得显著性能改进，具体来说，在不使用额外数据的情况下，我们的方法可以提高自然精度8.78%和Robust精度4.50%。

paper_url: http://arxiv.org/abs/2311.17600
repo_url: None
paper_authors: Xin Liu, Yichen Zhu, Yunshi Lan, Chao Yang, Yu Qiao
for: 该研究探讨了大型多模态模型（LMMs）的安全问题，尤其是对于Malicious Query的攻击。
methods: 该研究提出了一种新的视觉提示攻击方法，利用Diffusion Models生成的一个图像和另一个显示文本的图像，基于恶意查询中提取的关键词。
results: 研究表明，现有的开源LMMs可以通过该攻击方法被轻松攻击，即使使用了安全地aligned的Large Language Models。研究还编译了一个大量的数据集，包含13个场景、总共5040个文本-图像对，用于评估各种LMMs的安全性。

Abstract
Warning: This paper contains examples of harmful language and images, and reader discretion is recommended. The security concerns surrounding Large Language Models (LLMs) have been extensively explored, yet the safety of Large Multi-Modal Models (LMMs) remains understudied. In our study, we present a novel visual prompt attack that exploits query-relevant images to jailbreak the open-source LMMs. Our method creates a composite image from one image generated by diffusion models and another that displays the text as typography, based on keywords extracted from a malicious query. We show LLMs can be easily attacked by our approach, even if the employed Large Language Models are safely aligned. To evaluate the extent of this vulnerability in open-source LMMs, we have compiled a substantial dataset encompassing 13 scenarios with a total of 5,040 text-image pairs, using our presented attack technique. Our evaluation of 12 cutting-edge LMMs using this dataset shows the vulnerability of existing multi-modal models on adversarial attacks. This finding underscores the need for a concerted effort to strengthen and enhance the safety measures of open-source LMMs against potential malicious exploits. The resource is available at \href{this https URL}{https://github.com/isXinLiu/MM-SafetyBench}.

摘要
警告：这篇论文包含有害语言和图像示例，请读者慎重阅读。大型语言模型（LLM）的安全问题已得到了广泛研究，然而大型多Modal模型（LMM）的安全问题尚未得到充分研究。在我们的研究中，我们提出了一种新的视觉提示攻击，利用查询相关的图像来破坏开源LMM。我们的方法创建了一个混合图像，其中一个图像由扩散模型生成，另一个图像显示文本为 typography，基于恶意查询中提取的关键词。我们表明，even if the employed Large Language Models are safely aligned, LLMs can still be easily attacked by our approach。为了评估开源LMMs中存在的攻击性 vulnerability，我们编译了一个庞大的数据集，包括13个情景，共5,040个文本-图像对，使用我们提出的攻击技术。我们对12个 cutting-edge LMMs 进行了测试，并证明了现有多Modal模型在反对攻击方面的抵触性。这一发现强调了需要加强和提高开源LMMs 的安全措施，以防止潜在的恶意利用。资源可以在 \href{这个https URL}{https://github.com/isXinLiu/MM-SafetyBench} 上获得。

paper_url: http://arxiv.org/abs/2311.17597
repo_url: None
paper_authors: Yiwen Ye, Yutong Xie, Jianpeng Zhang, Ziyang Chen, Qi Wu, Yong Xia
for:* 这篇论文的目的是提出一个持续性学习架构，以实现多modal的医疗影像分析中的高效自我预训。methods:* 这篇论文使用了一种多Modal的医疗影像数据进行连续自我预训，并提出了一个具有弹性的架构，可以适应不同的医疗影像数据。results:* 实验结果显示，这个方法可以在大规模的多Modal医疗影像数据上进行持续性学习，并且可以实现高效的自我预训。code和预训模型也可以在https://github.com/yeerwen/MedCoSS上下载。

Abstract
Self-supervised learning is an efficient pre-training method for medical image analysis. However, current research is mostly confined to specific-modality data pre-training, consuming considerable time and resources without achieving universality across different modalities. A straightforward solution is combining all modality data for joint self-supervised pre-training, which poses practical challenges. Firstly, our experiments reveal conflicts in representation learning as the number of modalities increases. Secondly, multi-modal data collected in advance cannot cover all real-world scenarios. In this paper, we reconsider versatile self-supervised learning from the perspective of continual learning and propose MedCoSS, a continuous self-supervised learning approach for multi-modal medical data. Unlike joint self-supervised learning, MedCoSS assigns different modality data to different training stages, forming a multi-stage pre-training process. To balance modal conflicts and prevent catastrophic forgetting, we propose a rehearsal-based continual learning method. We introduce the k-means sampling strategy to retain data from previous modalities and rehearse it when learning new modalities. Instead of executing the pretext task on buffer data, a feature distillation strategy and an intra-modal mixup strategy are applied to these data for knowledge retention. We conduct continuous self-supervised pre-training on a large-scale multi-modal unlabeled dataset, including clinical reports, X-rays, CT scans, MRI scans, and pathological images. Experimental results demonstrate MedCoSS's exceptional generalization ability across nine downstream datasets and its significant scalability in integrating new modality data. Code and pre-trained weight are available at https://github.com/yeerwen/MedCoSS.

摘要
自领导学习是医学图像分析领域的高效预训练方法之一。然而，当前的研究主要集中在特定Modalities的数据预训练上，需要大量时间和资源，而且无法实现不同Modalities之间的通用性。为解决这问题，我们提出了将所有Modalities的数据进行共同预训练的简单解决方案。然而，我们的实验表明，随着Modalities的数量增加，会出现表征学习的冲突。此外，预先收集的多modal数据不能涵盖所有的实际场景。在这篇论文中，我们从 kontinual learning 的角度重新思考了自领导学习，并提出了 MedCoSS，一种基于连续学习的多Modalities医学图像预训练方法。与 JOINT self-supervised learning 不同，MedCoSS在不同的training阶段分配不同的Modalities数据，形成了多阶段预训练过程。为保持Modalities的冲突和避免忘记性，我们提出了一种循环学习方法。我们采用了 k-means 采样策略，将之前的Modalities数据保留下来，并在学习新Modalities时重新训练。而不是在缓存中执行预测任务，我们采用了一种特征萃取策略和一种内模 Mixup 策略来保持知识。我们在一个大规模的多Modalities无标注数据集上进行了连续自领导学习，包括临床报告、X射线、CT扫描、MRI扫描和 PATHOLOGICAL 图像。实验结果表明，MedCoSS在九个下游数据集上显示了异常普适性和重要的扩展性，并且可以有效地集成新的Modalities数据。代码和预训练 веса可以在上获取。

SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis

paper_url: http://arxiv.org/abs/2311.17590
repo_url: https://github.com/ZiqiaoPeng/SyncTalk
paper_authors: Ziqiao Peng, Wentao Hu, Yue Shi, Xiangyu Zhu, Xiaomei Zhang, Hao Zhao, Jun He, Hongyan Liu, Zhaoxin Fan
for: 实现真实的、语音驱动的 talking head 视频 synthesis 需要解决一个主要的挑战，传统的 Generative Adversarial Networks (GAN) 很难保持面部特征的一致性，而 Neural Radiance Fields (NeRF) 方法可以解决这个问题，但经常产生不一致的嘴部动作、不充分的面部表情和不稳定的头部姿势。
methods: 我们引入了 SyncTalk，一种基于 NeRF 的方法，可以有效地保持主体特征、增强同步和真实感在 talking head synthesis 中。SyncTalk 使用了一个 Face-Sync Controller 来对 lip movements 与语音进行对齐，并使用了一个3D facial blendshape model 来捕捉面部表情的精准信息。我们还使用了一个 Head-Sync Stabilizer 来优化头部姿势，以获得更自然的头部运动。
results: 我们的实验和用户研究表明，SyncTalk 在同步和真实感方面胜过了当前的状态计算机技术。我们建议您查看我们的补充视频：https://ziqiaopeng.github.io/synctalk

Abstract
Achieving high synchronization in the synthesis of realistic, speech-driven talking head videos presents a significant challenge. Traditional Generative Adversarial Networks (GAN) struggle to maintain consistent facial identity, while Neural Radiance Fields (NeRF) methods, although they can address this issue, often produce mismatched lip movements, inadequate facial expressions, and unstable head poses. A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses. The absence of these synchronizations is a fundamental flaw, leading to unrealistic and artificial outcomes. To address the critical issue of synchronization, identified as the "devil" in creating realistic talking heads, we introduce SyncTalk. This NeRF-based method effectively maintains subject identity, enhancing synchronization and realism in talking head synthesis. SyncTalk employs a Face-Sync Controller to align lip movements with speech and innovatively uses a 3D facial blendshape model to capture accurate facial expressions. Our Head-Sync Stabilizer optimizes head poses, achieving more natural head movements. The Portrait-Sync Generator restores hair details and blends the generated head with the torso for a seamless visual experience. Extensive experiments and user studies demonstrate that SyncTalk outperforms state-of-the-art methods in synchronization and realism. We recommend watching the supplementary video: https://ziqiaopeng.github.io/synctalk

摘要
достижение高度的同步在生成真实的、语音驱动的 talking head 视频中存在重要挑战。传统的生成对抗网络（GAN）很难保持面部标识，而使用Neural Radiance Fields（NeRF）方法可以解决这个问题，但它们经常产生不匹配的嘴部运动、不充分的面部表达和不稳定的头部姿势。一个生命的 talking head 需要同步的协调Subject identity、嘴部运动、面部表达和头部姿势。缺乏这些同步是一个基本的缺陷，导致不真实、不自然的结果。为了解决这个核心问题，我们引入SyncTalk。这是一种基于NeRF的方法，可以有效地保持Subject identity，提高同步和真实性在 talking head 生成中。SyncTalk使用一个Face-Sync Controller来将嘴部运动与语音相匹配，并使用3D facial blendshape模型来捕捉精准的面部表达。我们的Head-Sync Stabilizer可以优化头部姿势，使得头部运动更加自然。Portrait-Sync Generator可以恢复头发细节和融合生成的头部和躯体，提供一个无缝的视觉体验。我们的实验和用户研究表明，SyncTalk在同步和真实性方面超越了现有的方法。我们建议您查看补充视频：https://ziqiaopeng.github.io/synctalk。

CLIPC8: Face liveness detection algorithm based on image-text pairs and contrastive learning

paper_url: http://arxiv.org/abs/2311.17583
repo_url: None
paper_authors: Xu Liu, Shu Zhou, Yurong Song, Wenzhe Luo, Xin Zhang
for: 这个研究旨在解决现有的生命力检测算法在不同测试集上表现不佳的问题，通过使用图像和文本对组合来检测货币领域内的生命力攻击行为。methods: 本研究提出了一种基于图像和文本对组合的生命力检测方法，包括分别将图像和文本转换为特征向量表示，然后使用对组合来检测图像和文本之间的相似性。results: 本研究的方法可以实现Zero-shot的生命力检测功能，并且在五个公共测试集上达到了商业算法的水平。对于5种测试数据集，方法的检测率皆达到100%。这显示了引入图像和文本对组合以及对组合学习的方法可以对生命力检测任务进行有效和Robust的改进。

Abstract
Face recognition technology is widely used in the financial field, and various types of liveness attack behaviors need to be addressed. Existing liveness detection algorithms are trained on specific training datasets and tested on testing datasets, but their performance and robustness in transferring to unseen datasets are relatively poor. To tackle this issue, we propose a face liveness detection method based on image-text pairs and contrastive learning, dividing liveness attack problems in the financial field into eight categories and using text information to describe the images of these eight types of attacks. The text encoder and image encoder are used to extract feature vector representations for the classification description text and face images, respectively. By maximizing the similarity of positive samples and minimizing the similarity of negative samples, the model learns shared representations between images and texts. The proposed method is capable of effectively detecting specific liveness attack behaviors in certain scenarios, such as those occurring in dark environments or involving the tampering of ID card photos. Additionally, it is also effective in detecting traditional liveness attack methods, such as printing photo attacks and screen remake attacks. The zero-shot capabilities of face liveness detection on five public datasets, including NUAA, CASIA-FASD, Replay-Attack, OULU-NPU and MSU-MFSD also reaches the level of commercial algorithms. The detection capability of proposed algorithm was verified on 5 types of testing datasets, and the results show that the method outperformed commercial algorithms, and the detection rates reached 100% on multiple datasets. Demonstrating the effectiveness and robustness of introducing image-text pairs and contrastive learning into liveness detection tasks as proposed in this paper.

摘要
“人脸识别技术在金融领域广泛应用，但漏斗攻击问题仍然需要解决。现有的生命体检测算法基于特定的训练集和测试集，但其在转移到未seen的集合上表现不佳。为解决这问题，我们提议基于图片文本对的面部生命体检测方法，将金融领域内的漏斗攻击问题分为八种类型，并使用文本描述图片的方式来描述这八种类型的攻击行为。图文编码器和图像编码器将被用来提取图片和文本描述的特征向量表示。通过最大化正样本之间的相似性，并最小化负样本之间的相似性，模型将学习图片和文本之间的共同表示。提议的方法能够效果地检测特定的漏斗攻击行为，如在黑暗环境下或 ID 卡照片进行 tampering 等。此外，它还能够效果地检测传统的漏斗攻击方法，如印刷照片攻击和屏幕重建攻击。提议的方法在五个公共数据集上的零实际能力也达到了商业算法的水平。在五种测试集上验证了方法的检测能力，结果显示，提议的方法在多个数据集上达到了100%的检测率，证明了在图片文本对和对比学习的基础上进行面部生命体检测任务的有效性和可靠性。”

LGFCTR: Local and Global Feature Convolutional Transformer for Image Matching

paper_url: http://arxiv.org/abs/2311.17571
repo_url: None
paper_authors: Wenhao Zhong, Jie Jiang
for: 本文主要针对帧内对应找到精准和稳定的方法，尤其是在极端条件下。
methods: 本文提出了一种新的卷积变换器，通过捕捉局部和全局特征同时来减少这些问题。具体来说，首先是一种通用的FPN-like框架，通过转换器和卷积来捕捉全球结构，并补做局部上下文和隐藏位置编码。其次，一种新的卷积变换器模块，通过多Scale的注意力和局部信息聚合来捕捉多Scale长距离依赖关系。最后，一种新的准确性基准下的精度调整模块，利用整个细致窗口特征进行细致位差准确调整。
results: 在各种Benchmark上，提出的方法实现了精准和稳定的对应找到，并且超过了现有的方法。代码将于https://github.com/zwh0527/LGFCTR上公开。

Abstract
Image matching that finding robust and accurate correspondences across images is a challenging task under extreme conditions. Capturing local and global features simultaneously is an important way to mitigate such an issue but recent transformer-based decoders were still stuck in the issues that CNN-based encoders only extract local features and the transformers lack locality. Inspired by the locality and implicit positional encoding of convolutions, a novel convolutional transformer is proposed to capture both local contexts and global structures more sufficiently for detector-free matching. Firstly, a universal FPN-like framework captures global structures in self-encoder as well as cross-decoder by transformers and compensates local contexts as well as implicit positional encoding by convolutions. Secondly, a novel convolutional transformer module explores multi-scale long range dependencies by a novel multi-scale attention and further aggregates local information inside dependencies for enhancing locality. Finally, a novel regression-based sub-pixel refinement module exploits the whole fine-grained window features for fine-level positional deviation regression. The proposed method achieves superior performances on a wide range of benchmarks. The code will be available on https://github.com/zwh0527/LGFCTR.

摘要
Image matching under extreme conditions is a difficult task, and capturing both local and global features is an important way to address this issue. However, recent transformer-based decoders have still struggled with this problem because they lack locality and only extract local features. To solve this issue, a novel convolutional transformer is proposed that captures both local contexts and global structures more effectively for detector-free matching.Firstly, the proposed method uses a universal FPN-like framework that captures global structures in both the self-encoder and cross-decoder using transformers, while also compensating for local contexts and implicit positional encoding using convolutions. Secondly, a novel convolutional transformer module is used to explore multi-scale long-range dependencies using a novel multi-scale attention mechanism, and then aggregate local information within these dependencies to enhance locality. Finally, a novel regression-based sub-pixel refinement module is used to fine-tune the positional deviations of the fine-grained window features.The proposed method achieves superior performance on a wide range of benchmarks. The code will be available on GitHub at .

An Efficient Illumination Invariant Tiger Detection Framework for Wildlife Surveillance

paper_url: http://arxiv.org/abs/2311.17552
repo_url: None
paper_authors: Gaurav Pendharkar, A. Ancy Micheal, Jason Misquitta, Ranjeesh Kaippada
for: 保护虎耐用多元策略，包括生态环境保护、反偷猎措施和社区参与，以促进虎 populations 的可持续发展。
methods: 使用人工智能自动化虎检测，提出了一个适应性强的虎检测框架基于EnlightenGAN和YOLOv8，实现了不受照明条件的虎检测。
results: 在ATRW 数据集上，通过精度调整 YOLOv8 模型，实现了无照明改进的 mAP 分数为 61%，加入照明改进后，mAP 分数提高了 0.7%。这些方法提高了 ATRW 数据集的状态态化性能，比前一代约提高了 6%-7%。

Abstract
Tiger conservation necessitates the strategic deployment of multifaceted initiatives encompassing the preservation of ecological habitats, anti-poaching measures, and community involvement for sustainable growth in the tiger population. With the advent of artificial intelligence, tiger surveillance can be automated using object detection. In this paper, an accurate illumination invariant framework is proposed based on EnlightenGAN and YOLOv8 for tiger detection. The fine-tuned YOLOv8 model achieves a mAP score of 61% without illumination enhancement. The illumination enhancement improves the mAP by 0.7%. The approaches elevate the state-of-the-art performance on the ATRW dataset by approximately 6% to 7%.

摘要
虎 conservation 需要推行多元化的 initiaves，包括生态环境保护、反贩卖措施和社区参与，以实现可持续增长的虎 популяции。 With the advent of artificial intelligence, tiger surveillance can be automated using object detection. In this paper, an accurate illumination invariant framework is proposed based on EnlightenGAN and YOLOv8 for tiger detection. The fine-tuned YOLOv8 model achieves a mAP score of 61% without illumination enhancement. The illumination enhancement improves the mAP by 0.7%. The approaches elevate the state-of-the-art performance on the ATRW dataset by approximately 6% to 7%.Here's the word-for-word translation:虎 conservation 需要推行多元化的 initiaves，包括生态环境保护、反贩卖措施和社区参与，以实现可持续增长的虎 популяции。 With the advent of artificial intelligence, tiger surveillance can be automated using object detection. In this paper, an accurate illumination invariant framework is proposed based on EnlightenGAN and YOLOv8 for tiger detection. The fine-tuned YOLOv8 model achieves a mAP score of 61% without illumination enhancement. The illumination enhancement improves the mAP by 0.7%. The approaches elevate the state-of-the-art performance on the ATRW dataset by approximately 6% to 7%.

VINNA for Neonates – Orientation Independence through Latent Augmentations

paper_url: http://arxiv.org/abs/2311.17546
repo_url: None
paper_authors: Leonie Henschel, David Kügler, Lilla Zöllei, Martin Reuter
for: 这个论文的目的是为了提高新生儿大脑成像图像的快速和准确分割，以更好地理解和检测发育和疾病的变化。
methods: 这个论文使用了一种新的Resolution-Aware Internal Augmentations（VINNA）方法，它可以在不需要重新采样的情况下，在不同的分辨率下进行快速和准确的分割。
results: 研究发现，VINNA方法可以比州际扩展方法更好地处理新生儿大脑成像图像中的头部变化，并且可以保持高度的分割精度在0.5-1.0毫米的分辨率范围内。

Abstract
Fast and accurate segmentation of neonatal brain images is highly desired to better understand and detect changes during development and disease. Yet, the limited availability of ground truth datasets, lack of standardized acquisition protocols, and wide variations of head positioning pose challenges for method development. A few automated image analysis pipelines exist for newborn brain MRI segmentation, but they often rely on time-consuming procedures and require resampling to a common resolution, subject to loss of information due to interpolation and down-sampling. Without registration and image resampling, variations with respect to head positions and voxel resolutions have to be addressed differently. In deep-learning, external augmentations are traditionally used to artificially expand the representation of spatial variability, increasing the training dataset size and robustness. However, these transformations in the image space still require resampling, reducing accuracy specifically in the context of label interpolation. We recently introduced the concept of resolution-independence with the Voxel-size Independent Neural Network framework, VINN. Here, we extend this concept by additionally shifting all rigid-transforms into the network architecture with a four degree of freedom (4-DOF) transform module, enabling resolution-aware internal augmentations (VINNA). In this work we show that VINNA (i) significantly outperforms state-of-the-art external augmentation approaches, (ii) effectively addresses the head variations present specifically in newborn datasets, and (iii) retains high segmentation accuracy across a range of resolutions (0.5-1.0 mm). The 4-DOF transform module is a powerful, general approach to implement spatial augmentation without requiring image or label interpolation. The specific network application to newborns will be made publicly available as VINNA4neonates.

摘要
快速和准确地 segmentation 新生儿脑图像是非常有优先级的，以便更好地理解和检测发展和疾病中的变化。然而，有限的ground truth数据集、缺乏标准化采集协议以及头部位置的差异带来了方法开发的挑战。现有一些自动化图像分析管道用于新生儿脑MRI segmentation，但它们通常需要时间consuming的过程和需要重新采样到共同的分辨率，这会导致信息损失由于 interpolate 和下采样。没有注册和采样，头部位置和 voxel 分辨率的差异需要在不同的方式下进行处理。在深度学习中，外部增强通常用于人工地扩展表示空间的变化，从而增加训练数据集大小和模型的稳定性。然而，这些图像空间中的变换仍然需要重新采样，从而降低了精度，特别是在标签 interpolate 中。我们最近提出了分辨率独立的概念，即 voxel-size Independent Neural Network 框架（VINN）。在这项工作中，我们进一步扩展了这一概念，通过添加一个四度自由变换（4-DOF）模块，使得内部增强（VINNA）能够具备分辨率意识。我们在这篇文章中展示了 VINNA 可以：（1）对比 externally augmented 方法，显著提高性能；（2）有效地处理新生儿数据集中的头部变化；（3）保持高精度 segmentation 在多个分辨率（0.5-1.0 mm）下。4-DOF transform module 是一种强大、通用的方法，可以在不需要图像或标签 interpolate 的情况下进行空间增强。特定的网络应用于新生儿将在 VINNA4neonates 中公开。

Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video Tuning

paper_url: http://arxiv.org/abs/2311.17536
repo_url: https://github.com/spengliang/smoothvideo
paper_authors: Liang Peng, Haoran Cheng, Zheng Yang, Ruisi Zhao, Linxuan Xia, Chaotian Song, Qinglin Lu, Wei Liu, Boxi Wu
for: 提高一键视频调整方法的一致性和平滑性。
methods: 添加噪声约束来规则视频帧之间的噪声预测，从而使latent smooth。
results: 通过应用损失函数到现有一键视频调整方法上，提高了视频的一致性和平滑性。此外，提出了一种新的评价指标，能够更好地捕捉视频的细节特征和时间动态。实验结果证明了我们的方法的有效性。

Abstract
Recent one-shot video tuning methods, which fine-tune the network on a specific video based on pre-trained text-to-image models (e.g., Stable Diffusion), are popular in the community because of the flexibility. However, these methods often produce videos marred by incoherence and inconsistency. To address these limitations, this paper introduces a simple yet effective noise constraint across video frames. This constraint aims to regulate noise predictions across their temporal neighbors, resulting in smooth latents. It can be simply included as a loss term during the training phase. By applying the loss to existing one-shot video tuning methods, we significantly improve the overall consistency and smoothness of the generated videos. Furthermore, we argue that current video evaluation metrics inadequately capture smoothness. To address this, we introduce a novel metric that considers detailed features and their temporal dynamics. Experimental results validate the effectiveness of our approach in producing smoother videos on various one-shot video tuning baselines. The source codes and video demos are available at \href{https://github.com/SPengLiang/SmoothVideo}{https://github.com/SPengLiang/SmoothVideo}.

摘要
现在的一键视频调整方法（例如稳定扩散）在社区中非常受欢迎，因为它们具有灵活性。然而，这些方法经常生成受杂乱和不一致的视频。为解决这些限制，本文引入了一种简单 yet 有效的噪声约束，该约束在视频帧中的噪声预测进行规范，以生成平滑的凝结。这可以在训练阶段直接添加到损失中。通过应用损失到现有的一键视频调整方法上，我们可以显著改进生成视频的一致性和平滑性。此外，我们认为当前的视频评价指标不具有充分考虑平滑性的特点。为解决这个问题，我们引入了一种新的评价指标，该指标考虑了视频中详细的特征和其时间动态。实验结果证明了我们的方法的效果，可以生成在多种一键视频调整基础上更平滑的视频。代码和视频示例可以在 \href{https://github.com/SPengLiang/SmoothVideo}{https://github.com/SPengLiang/SmoothVideo} 上获取。

Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

paper_url: http://arxiv.org/abs/2311.17532
repo_url: None
paper_authors: Xingqun Qi, Jiahao Pan, Peng Li, Ruibin Yuan, Xiaowei Chi, Mengfei Li, Wenhan Luo, Wei Xue, Shanghang Zhang, Qifeng Liu, Yike Guo
for: 生成真实的3D同声动作是在人机交互应用中至关重要，而现有方法只能生成单个情感标签的动作。
methods: 我们首先利用ChatGPT-4和一种音频填充方法构建高质量情感转移人声，然后提出一种弱级指导的训练策略，以促进权威的转换动作。
results: 我们的方法在我们新定义的情感转移任务和数据集上比现有模型表现出色，并且能够生成多种多样的动作。

Abstract
Generating vivid and emotional 3D co-speech gestures is crucial for virtual avatar animation in human-machine interaction applications. While the existing methods enable generating the gestures to follow a single emotion label, they overlook that long gesture sequence modeling with emotion transition is more practical in real scenes. In addition, the lack of large-scale available datasets with emotional transition speech and corresponding 3D human gestures also limits the addressing of this task. To fulfill this goal, we first incorporate the ChatGPT-4 and an audio inpainting approach to construct the high-fidelity emotion transition human speeches. Considering obtaining the realistic 3D pose annotations corresponding to the dynamically inpainted emotion transition audio is extremely difficult, we propose a novel weakly supervised training strategy to encourage authority gesture transitions. Specifically, to enhance the coordination of transition gestures w.r.t different emotional ones, we model the temporal association representation between two different emotional gesture sequences as style guidance and infuse it into the transition generation. We further devise an emotion mixture mechanism that provides weak supervision based on a learnable mixed emotion label for transition gestures. Last, we present a keyframe sampler to supply effective initial posture cues in long sequences, enabling us to generate diverse gestures. Extensive experiments demonstrate that our method outperforms the state-of-the-art models constructed by adapting single emotion-conditioned counterparts on our newly defined emotion transition task and datasets.

摘要
<>TRANSLATE_TEXT生成具有生动感情的3D合成人偶动画在人机交互应用中是非常重要的。而现有的方法只能生成基于单一情感标签的手势，它们忽略了在实际场景中更加重要的长手势序列模型和情感转移。此外，没有大规模可用的情感转移 speech和相应的3D人偶手势数据也限制了解决这个问题。为了实现这个目标，我们首先将ChatGPT-4和一种音频填充方法结合使用，以构建高效的情感转移人声。由于获得真实的3D姿态注释相对困难，我们提出了一种新的弱监督训练策略。Specifically,我们模型了两个不同情感手势序列之间的时间相关性表示，并将其作为风格指导注入到转换生成中。此外，我们还提出了一种情感混合机制，用于提供弱监督基于学习的混合情感标签。最后，我们提出了一种键帧抽取器，以便在长序列中提供有效的初始姿态引导。广泛的实验表明，我们的方法在我们 newly defined emotion transition task和数据集上比 estado-of-the-art 模型 constructed by adapting single emotion-conditioned counterparts 表现出色。Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

HiDiffusion: Unlocking High-Resolution Creativity and Efficiency in Low-Resolution Trained Diffusion Models

paper_url: http://arxiv.org/abs/2311.17528
repo_url: None
paper_authors: Shen Zhang, Zhaowei Chen, Zhenyu Zhao, Zhenyuan Chen, Yao Tang, Yuhao Chen, Wengang Cao, Jiajun Liang
for: 实现预训 diffusion model 可以高效地生成高分辨率图像（例如 1024×1024），超过训练图像的分辨率。
methods: 我们提出了一个简单 yet scalable 的方法组合，包括 Resolution-Aware U-Net (RAU-Net) 和 Modified Shifted Window Multi-head Self-Attention (MSW-MSA)，实现预训 diffusion model 可以高效地生成高分辨率图像。
results: 我们的 HiDiffusion 可以规避预训 diffusion model 在高分辨率图像生成中发生的无理性问题，并同时降低了测试时间，实现了高分辨率图像生成的州OF-the-art表现。

Abstract
We introduce HiDiffusion, a tuning-free framework comprised of Resolution-Aware U-Net (RAU-Net) and Modified Shifted Window Multi-head Self-Attention (MSW-MSA) to enable pretrained large text-to-image diffusion models to efficiently generate high-resolution images (e.g. 1024$\times$1024) that surpass the training image resolution. Pretrained diffusion models encounter unreasonable object duplication in generating images beyond the training image resolution. We attribute it to the mismatch between the feature map size of high-resolution images and the receptive field of U-Net's convolution. To address this issue, we propose a simple yet scalable method named RAU-Net. RAU-Net dynamically adjusts the feature map size to match the convolution's receptive field in the deep block of U-Net. Another obstacle in high-resolution synthesis is the slow inference speed of U-Net. Our observations reveal that the global self-attention in the top block, which exhibits locality, however, consumes the majority of computational resources. To tackle this issue, we propose MSW-MSA. Unlike previous window attention mechanisms, our method uses a much larger window size and dynamically shifts windows to better accommodate diffusion models. Extensive experiments demonstrate that our HiDiffusion can scale diffusion models to generate 1024$\times$1024, 2048$\times$2048, or even 4096$\times$4096 resolution images, while simultaneously reducing inference time by 40\%-60\%, achieving state-of-the-art performance on high-resolution image synthesis. The most significant revelation of our work is that a pretrained diffusion model on low-resolution images is scalable for high-resolution generation without further tuning. We hope this revelation can provide insights for future research on the scalability of diffusion models.

摘要
我们介绍HiDiffusion，一个无需调整的框架，包括分辨率感知U-Net（RAU-Net）和修改的对应窗口多头自我对比（MSW-MSA），以允许预训大文本到图像扩散模型高效地生成高分辨率图像（例如1024×1024），超过训练图像分辨率。预训扩散模型在生成图像 beyond 训练图像分辨率时会 encounter unreasonable object duplication。我们将这问题归因于U-Net的对应窗口和扩散模型之间的对应性不匹配。为解决这问题，我们提出了一个简单 yet scalable 的方法，即 RAU-Net。 RAU-Net 可以在 U-Net 的深层中灵活地调整对应窗口的大小，以配合对应窗口的扩散模型。另一个高分辨率生成的障碍是 U-Net 的对应窗口过慢的实现速度。我们的观察表明，U-Net 的顶层自我对比，即 exhibits locality，但是 consume the majority of computational resources。为了解决这问题，我们提出了 MSW-MSA。与以往的窗口注意力机制不同，我们的方法使用了 much larger window size 和 dynamically shifts windows，以更好地适应扩散模型。实验结果显示，我们的 HiDiffusion 可以将预训扩散模型扩展到生成 1024×1024、2048×2048 或же 4096×4096 分辨率图像，同时降低了推断时间 by 40%-60%， achieved state-of-the-art performance on high-resolution image synthesis。我们的研究最重要的发现是：一个预训的扩散模型在低分辨率图像上是可扩展的，而不需要进一步的调整。我们希望这个发现可以为未来关于扩散模型的可扩展性提供思路。

A publicly available vessel segmentation algorithm for SLO images

paper_url: http://arxiv.org/abs/2311.17525
repo_url: None
paper_authors: Adam Threlfall, Samuel Gibbon, James Cameron, Tom MacGillivray
for: 这个研究的目标是开发一种专门针对探测血管图像（IRSLO）的血管分割算法。
methods: 这个研究使用了23个专家标注的IRSLO图像，以及7个内部标注的图像进行训练。使用的是一种差分减少网络（U-Net）来标注像素为“血管”或“背景”。
results: 在一个未经见过的测试集（4张图像）上，这个模型达到了0.981的AUC和0.815的AUPRC。经过阈值处理，它达到了0.844的敏感性、0.983的特异性和0.857的F1分数。

Abstract
Background and Objective: Infra-red scanning laser ophthalmoscope (IRSLO) images are akin to colour fundus photographs in displaying the posterior pole and retinal vasculature fine detail. While there are many trained networks readily available for retinal vessel segmentation in colour fundus photographs, none cater to IRSLO images. Accordingly, we aimed to develop (and release as open source) a vessel segmentation algorithm tailored specifically to IRSLO images. Materials and Methods: We used 23 expertly annotated IRSLO images from the RAVIR dataset, combined with 7 additional images annotated in-house. We trained a U-Net (convolutional neural network) to label pixels as 'vessel' or 'background'. Results: On an unseen test set (4 images), our model achieved an AUC of 0.981, and an AUPRC of 0.815. Upon thresholding, it achieved a sensitivity of 0.844, a specificity of 0.983, and an F1 score of 0.857. Conclusion: We have made our automatic segmentation algorithm publicly available and easy to use. Researchers can use the generated vessel maps to compute metrics such as fractal dimension and vessel density.

摘要
背景和目标： инфракрас扫描拉зер眼镜（IRSLO）图像与彩色血液照片类似，可以显示 posterior pole 和 RETINAL vasculature 的细节。 although there are many trained networks available for retinal vessel segmentation in color fundus photographs, none are tailored to IRSLO images. Therefore, we aimed to develop (and release as open source) a vessel segmentation algorithm specifically for IRSLO images.材料和方法： we used 23 expertly annotated IRSLO images from the RAVIR dataset, along with 7 additional images annotated in-house. we trained a U-Net (convolutional neural network) to label pixels as 'vessel' or 'background'.结果： on an unseen test set (4 images), our model achieved an AUC of 0.981, and an AUPRC of 0.815. upon thresholding, it achieved a sensitivity of 0.844, a specificity of 0.983, and an F1 score of 0.857.结论： we have made our automatic segmentation algorithm publicly available and easy to use. researchers can use the generated vessel maps to compute metrics such as fractal dimension and vessel density.

Improving Stability during Upsampling – on the Importance of Spatial Context

paper_url: http://arxiv.org/abs/2311.17524
repo_url: None
paper_authors: Shashank Agnihotri, Julia Grabinski, Margret Keuper
for: 这个论文主要针对像 pixel-wise 预测任务，如图像恢复、图像分割或分辨率估计，涉及多个阶段的数据抽样。
methods: 该论文首次探讨了在 upsampling 过程中的缺陷，并提出了一种基于 convolutional upsampling 的方法来改进预测稳定性。
results: 研究发现，通过使用增大核心大小的 convolutional upsampling 操作，可以在图像恢复和图像分割等任务中提高预测稳定性，而一种组合小型核心和大型核心的块可以最佳地结合细节和缺陷 removal。

Abstract
State-of-the-art models for pixel-wise prediction tasks such as image restoration, image segmentation, or disparity estimation, involve several stages of data resampling, in which the resolution of feature maps is first reduced to aggregate information and then sequentially increased to generate a high-resolution output. Several previous works have investigated the effect of artifacts that are invoked during downsampling and diverse cures have been proposed that facilitate to improve prediction stability and even robustness for image classification. However, equally relevant, artifacts that arise during upsampling have been less discussed. This is significantly relevant as upsampling and downsampling approaches face fundamentally different challenges. While during downsampling, aliases and artifacts can be reduced by blurring feature maps, the emergence of fine details is crucial during upsampling. Blurring is therefore not an option and dedicated operations need to be considered. In this work, we are the first to explore the relevance of context during upsampling by employing convolutional upsampling operations with increasing kernel size while keeping the encoder unchanged. We find that increased kernel sizes can in general improve the prediction stability in tasks such as image restoration or image segmentation, while a block that allows for a combination of small-size kernels for fine details and large-size kernels for artifact removal and increased context yields the best results.

摘要
现代模型 для像素级预测任务，如图像恢复、图像分割或相差估计，通常包括多个阶段的数据重采样，在这些阶段中，特征地图的分辨率首先被减小以汇总信息，然后逐渐增加以生成高分辨率输出。过去的一些研究已经研究了下采样引入的瑕疵的影响，并提出了改进预测稳定性和Robustness的方法。然而，下采样和上采样的挑战相对较为不同，而下采样可以通过抑制特征地图的瑕疵来减少瑕疵，而上采样则需要特殊的操作。在这种情况下，我们是第一个探讨了上采样的上下文 relevance，我们采用了卷积 upsampling 操作，保持Encoder不变，并发现，逐渐增大核心大小可以在图像恢复和图像分割任务中提高预测稳定性，而一个块，允许将小Size kernel 用于细节和大Size kernel 用于瑕疵除去和Context增加，可以获得最佳结果。

MMA-Diffusion: MultiModal Attack on Diffusion Models

paper_url: http://arxiv.org/abs/2311.17516
repo_url: None
paper_authors: Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Nan Xu, Qiang Xu
for: 提高 Text-to-Image（T2I）模型的安全性，探讨现有防御机制的缺陷。
methods: 提出了一种基于文本和视觉模式的威胁模型MMA-Diffusion，可以绕过当前的开源模型和商业在线服务的安全措施。
results: MMA-Diffusion可以成功地绕过现有的安全检查机制，暴露了当前防御机制的缺陷。

Abstract
In recent years, Text-to-Image (T2I) models have seen remarkable advancements, gaining widespread adoption. However, this progress has inadvertently opened avenues for potential misuse, particularly in generating inappropriate or Not-Safe-For-Work (NSFW) content. Our work introduces MMA-Diffusion, a framework that presents a significant and realistic threat to the security of T2I models by effectively circumventing current defensive measures in both open-source models and commercial online services. Unlike previous approaches, MMA-Diffusion leverages both textual and visual modalities to bypass safeguards like prompt filters and post-hoc safety checkers, thus exposing and highlighting the vulnerabilities in existing defense mechanisms.

摘要

Fusion of Single and Integral Multispectral Aerial Images

paper_url: http://arxiv.org/abs/2311.17515
repo_url: None
paper_authors: Mohamed Youssef, Oliver Bimber
for: 提高遮挡物 vegetaion 的遥感图像融合，以提高目标的可见性。
methods: combining 传感器模型和学习模型，使用综合频道的灵感参考和遮挡物 vegetaion 的特征，以提高目标的可见性。
results: 比前方法高效，不需手动调整参数，可扩展到多个spectral channel，可重配置 для不同应用场景。

Abstract
We present a novel hybrid (model- and learning-based) architecture for fusing the most significant features from conventional aerial images and integral aerial images that result from synthetic aperture sensing for removing occlusion caused by dense vegetation. It combines the environment's spatial references with features of unoccluded targets. Our method out-beats the state-of-the-art, does not require manually tuned parameters, can be extended to an arbitrary number and combinations of spectral channels, and is reconfigurable to address different use-cases.

摘要
我们提出了一种新的hybrid（模型和学习基于）架构，用于融合传统飞行图像和 integral飞行图像，以消除由密集植被引起的遮挡。它结合环境空间参考和透明目标特征。我们的方法超过了当前状态，不需要手动调整参数，可以扩展到任意数量和组合spectral通道，并可以重配置以应对不同的应用场景。Here's a breakdown of the translation:* "hybrid" is translated as "hybrid" (合成)* "model-based" is translated as "模型基于" (based on a model)* "learning-based" is translated as "学习基于" (based on learning)* "aerial images" is translated as "飞行图像" (aerial images)* "integral aerial images" is translated as " integral飞行图像" (integral aerial images)* "occlusion" is translated as "遮挡" (occlusion)* "dense vegetation" is translated as "密集植被" (dense vegetation)* "spatial references" is translated as "空间参考" (spatial references)* "unoccluded targets" is translated as "透明目标" (unoccluded targets)* "state-of-the-art" is translated as "当前状态" (state-of-the-art)* "manually tuned parameters" is translated as "手动调整参数" (manually tuned parameters)* "arbitrary number" is translated as "任意数量" (arbitrary number)* "combinations of spectral channels" is translated as "任意数量和组合spectral通道" (combinations of spectral channels)* "reconfigurable" is translated as "可重配置" (reconfigurable)* "different use-cases" is translated as "不同的应用场景" (different use-cases)

StructRe: Rewriting for Structured Shape Modeling

paper_url: http://arxiv.org/abs/2311.17510
repo_url: None
paper_authors: Wang, Jiepeng, Pan, Hao, Liu, Yang, Tong, Xin, Komura, Taku, Wang, Wenping
for: 这篇论文目的是为了提出一种新的结构模型化方法，帮助解决人工3D形状的自然组织和结构嵌入问题。
methods: 该论文使用的方法是一种叫做StructRe的结构重写系统，可以将3D对象表示为点和组件的形式 rewrite 到更精细的结构中，或者 rewrite 到更简洁的结构中。通过迭代 rewrite 过程，可以获得层次结构，并且可以采用概率模型来解决多重层次结构的冲突问题。
results: 论文通过使用StructRe模型，可以在不同类别的形状中实现robust泛化和多对一的结构模型化。通过对PartNet数据进行训练， StructRe模型可以在不同类别的形状中进行扩展和应用，并且可以用于形状重构、生成和编辑等任务。

Abstract
Man-made 3D shapes are naturally organized in parts and hierarchies; such structures provide important constraints for shape reconstruction and generation. Modeling shape structures is difficult, because there can be multiple hierarchies for a given shape, causing ambiguity, and across different categories the shape structures are correlated with semantics, limiting generalization. We present StructRe, a structure rewriting system, as a novel approach to structured shape modeling. Given a 3D object represented by points and components, StructRe can rewrite it upward into more concise structures, or downward into more detailed structures; by iterating the rewriting process, hierarchies are obtained. Such a localized rewriting process enables probabilistic modeling of ambiguous structures and robust generalization across object categories. We train StructRe on PartNet data and show its generalization to cross-category and multiple object hierarchies, and test its extension to ShapeNet. We also demonstrate the benefits of probabilistic and generalizable structure modeling for shape reconstruction, generation and editing tasks.

摘要
人工3D形状自然地组织成部件和层次结构，这些结构提供重要的约束 для形状重建和生成。模型形状结构困难，因为给定形状可能有多个层次结构，导致歧义，而不同类别的形状结构与 semantics 相关，限制泛化。我们提出了 StructRe，一种结构重写系统，作为新的结构化形状模型化方法。给定3D对象表示为点和组件时，StructRe可以将其 rewrite 到更简洁的结构中，或者 rewrite 到更详细的结构中；通过重复 rewrite 过程，可以获得层次结构。这种局部 rewrite 过程允许随机模型不确定结构和跨类别结构的泛化。我们在 PartNet 数据上训练 StructRe，并证明其在跨类别和多个对象层次结构中的泛化能力，以及对 ShapeNet 的扩展。我们还示出了结构模型的概率和泛化性对形状重建、生成和编辑任务的好处。

PViT-6D: Overclocking Vision Transformers for 6D Pose Estimation with Confidence-Level Prediction and Pose Tokens

paper_url: http://arxiv.org/abs/2311.17504
repo_url: None
paper_authors: Sebastian Stapf, Tobias Bauernfeind, Marco Riboldi
for: 这个研究的目的是提高6D姿掌 estimation的精度和可靠性，并且实现简单的实现和端到端学习。
methods: 我们使用了Vision Transformer来进行直接的6D姿掌 estimation，并且引入了一个简单的姿掌信任度决定方法，可以与大多数6D姿掌 estimation框架集成。
results: 我们的方法Pose Vision Transformer或PViT-6D在Linemod-Occlusion和YCB-V数据集上比前一代方法高 +0.3% ADD(-S) 和 +2.7% ADD(-S)。此外，我们的方法也提高了模型的解释力和测试过程中的表现可靠性。

Abstract
In the current state of 6D pose estimation, top-performing techniques depend on complex intermediate correspondences, specialized architectures, and non-end-to-end algorithms. In contrast, our research reframes the problem as a straightforward regression task by exploring the capabilities of Vision Transformers for direct 6D pose estimation through a tailored use of classification tokens. We also introduce a simple method for determining pose confidence, which can be readily integrated into most 6D pose estimation frameworks. This involves modifying the transformer architecture by decreasing the number of query elements based on the network's assessment of the scene complexity. Our method that we call Pose Vision Transformer or PViT-6D provides the benefits of simple implementation and being end-to-end learnable while outperforming current state-of-the-art methods by +0.3% ADD(-S) on Linemod-Occlusion and +2.7% ADD(-S) on the YCB-V dataset. Moreover, our method enhances both the model's interpretability and the reliability of its performance during inference.

摘要
现有的6D姿态估计技术都是基于复杂的中间对准和非终端数据推断，而我们的研究则将这个问题转化为一个简单的回推 зада项目，通过特殊的探索类别token，以便直接进行6D姿态估计。我们还提出了一个简单的姿态信任度决定方法，可以与大多数6D姿态估计框架集成。这个方法是基于网络评估场景复杂程度，将查询元素数量降低以提高模型的解释性和测试过程中的可靠性。我们称之为姿态视transformer或PViT-6D，它具有简单的实现和终端学习的优点，并在Linemod-Occlusion和YCB-V数据集上进行了+0.3% ADD(-S)和+2.7% ADD(-S)的提升。此外，我们的方法可以提高模型的解释性和测试过程中的可靠性。

Towards Higher Ranks via Adversarial Weight Pruning

paper_url: http://arxiv.org/abs/2311.17493
repo_url: https://github.com/huawei-noah/Efficient-Computing
paper_authors: Yuchuan Tian, Hanting Chen, Tianyu Guo, Chao Xu, Yunhe Wang
for: 提高Convolutional Neural Networks（CNNs）在边缘设备上部署的效率，通过网络裁剪来减少模型的计算量和存储量。
methods: 提出了一种基于排名的裁剪方法（Rank-based PruninG，RPG），通过在每次迭代中对稀疏权重进行低矩阵 decomposition，并通过提高权重矩阵与低矩阵的距离来保持稀疏权重的高级别结构。
results: 实验结果表明，RPG方法可以在不同的 dataset 和任务上达到高度的稀疏率，并且在 ImageNet 上达到了98% 的稀疏率，相比之前的状态 искусственный智能方法提高了1.13% 的 top-1 准确率。

Abstract
Convolutional Neural Networks (CNNs) are hard to deploy on edge devices due to its high computation and storage complexities. As a common practice for model compression, network pruning consists of two major categories: unstructured and structured pruning, where unstructured pruning constantly performs better. However, unstructured pruning presents a structured pattern at high pruning rates, which limits its performance. To this end, we propose a Rank-based PruninG (RPG) method to maintain the ranks of sparse weights in an adversarial manner. In each step, we minimize the low-rank approximation error for the weight matrices using singular value decomposition, and maximize their distance by pushing the weight matrices away from its low rank approximation. This rank-based optimization objective guides sparse weights towards a high-rank topology. The proposed method is conducted in a gradual pruning fashion to stabilize the change of rank during training. Experimental results on various datasets and different tasks demonstrate the effectiveness of our algorithm in high sparsity. The proposed RPG outperforms the state-of-the-art performance by 1.13% top-1 accuracy on ImageNet in ResNet-50 with 98% sparsity. The codes are available at https://github.com/huawei-noah/Efficient-Computing/tree/master/Pruning/RPG and https://gitee.com/mindspore/models/tree/master/research/cv/RPG.

摘要
convolutional neural networks (CNNs) 因其高计算和存储复杂性而难以在边缘设备上部署。为了压缩模型，通常有两种主要类型的网络剪辑：无结构剪辑和结构化剪辑，其中无结构剪辑在高剪辑率下表现更好。然而，无结构剪辑会出现结构化的征特，限制其性能。为此，我们提出了一种基于排名的剪辑方法（Rank-based PruninG，RPG），以保持稀疏 веса的排名。在每个步骤中，我们使用单值分解来最小化稀疏 веса的低级分解误差，并通过推动稀疏 веса离开其低级分解来最大化其距离。这种排名基于的优化目标导引稀疏 веса向高级 topology。我们采用渐进剪辑的方式来稳定剪辑过程中的排名变化。实验结果表明，我们的算法在不同的 dataset 和任务上具有优秀的性能，并且在98%的稀疏率下，与现状卷积神经网络（ResNet-50）在 ImageNet 上的 top-1 准确率相比，提高了1.13%。代码可以在 GitHub 上找到：https://github.com/huawei-noah/Efficient-Computing/tree/master/Pruning/RPG 和 https://gitee.com/mindspore/models/tree/master/research/cv/RPG。

Spherical Frustum Sparse Convolution Network for LiDAR Point Cloud Semantic Segmentation

paper_url: http://arxiv.org/abs/2311.17491
repo_url: None
paper_authors: Yu Zheng, Guangming Wang, Jiuming Liu, Marc Pollefeys, Hesheng Wang
for: 本文主要用于提出一种新的圆锥结构，以避免在2D图像基于的点云Semantic segmentation中的信息损失。
methods: 本文提出了一种快速Hash-based表示法，以及基于圆锥的稀疏 convolution和快速点 sampling方法。
results: 对于SemanticKITTI和nuScenes datasets的实验结果表明，我们的SFCNet方法在点云Semantic segmentation中具有更高的性能，并且超过了基于普通球面投影的2D图像基于的方法。

Abstract
LiDAR point cloud semantic segmentation enables the robots to obtain fine-grained semantic information of the surrounding environment. Recently, many works project the point cloud onto the 2D image and adopt the 2D Convolutional Neural Networks (CNNs) or vision transformer for LiDAR point cloud semantic segmentation. However, since more than one point can be projected onto the same 2D position but only one point can be preserved, the previous 2D image-based segmentation methods suffer from inevitable quantized information loss. To avoid quantized information loss, in this paper, we propose a novel spherical frustum structure. The points projected onto the same 2D position are preserved in the spherical frustums. Moreover, we propose a memory-efficient hash-based representation of spherical frustums. Through the hash-based representation, we propose the Spherical Frustum sparse Convolution (SFC) and Frustum Fast Point Sampling (F2PS) to convolve and sample the points stored in spherical frustums respectively. Finally, we present the Spherical Frustum sparse Convolution Network (SFCNet) to adopt 2D CNNs for LiDAR point cloud semantic segmentation without quantized information loss. Extensive experiments on the SemanticKITTI and nuScenes datasets demonstrate that our SFCNet outperforms the 2D image-based semantic segmentation methods based on conventional spherical projection. The source code will be released later.

摘要
利用LiDAR点云semantic segmentation可以让机器人获得细腻环境信息。近期，许多研究将点云映射到2D图像上，采用2D卷积神经网络（CNN）或视Transformer进行LiDAR点云semantic segmentation。然而，由于多个点可以映射到同一个2D位置，但只能保留一个点，所以以前的2D图像基本的分 segmentation方法会导致不可避免的量化信息损失。为了避免量化信息损失，在这篇论文中，我们提出了一种新的球形封闭结构。投影到同一个2D位置的点会被保留在球形封闭中。此外，我们提出了一种快速Hash基于的球形封闭表示方法。通过Hash基于的表示方法，我们提出了球形封闭稀疏卷积（SFC）和封闭快速点 sampling（F2PS）来 convolution和点云中的点分别。最后，我们提出了球形封闭稀疏卷积网络（SFCNet），用于采用2D CNNs进行LiDAR点云semantic segmentation，无需量化信息损失。我们在SemanticKITTI和nuScenes dataset上进行了广泛的实验，结果表明，我们的SFCNet比基于传统球面投影的2D图像基本的分 segmentation方法更高效。我们将源代码发布 later。

Non-Visible Light Data Synthesis and Application: A Case Study for Synthetic Aperture Radar Imagery

paper_url: http://arxiv.org/abs/2311.17486
repo_url: None
paper_authors: Zichen Tian, Zhaozheng Chen, Qianru Sun
for: 解决卫星数据捕集困难导致的SAR数据简单预测问题，通过采用大规模预训练图像生成模型（如Stable Diffusion和Imagen）进行非可见光域图像生成。
methods: 提出了一种2-stage低级别适应方法（2LoRA），其首先在第一阶段使用了飞行视图正常图像数据进行适应，然后在第二阶段使用SAR模态数据进行进一步适应。在第二阶段，我们引入了一种新的原型LoRA（pLoRA），以解决SAR数据集中的类偏度问题。
results: 通过使用生成的SAR数据进行训练和分类、 segmentation模型，得到了明显提高的性能，特别是对于小类。

Abstract
We explore the "hidden" ability of large-scale pre-trained image generation models, such as Stable Diffusion and Imagen, in non-visible light domains, taking Synthetic Aperture Radar (SAR) data for a case study. Due to the inherent challenges in capturing satellite data, acquiring ample SAR training samples is infeasible. For instance, for a particular category of ship in the open sea, we can collect only few-shot SAR images which are too limited to derive effective ship recognition models. If large-scale models pre-trained with regular images can be adapted to generating novel SAR images, the problem is solved. In preliminary study, we found that fine-tuning these models with few-shot SAR images is not working, as the models can not capture the two primary differences between SAR and regular images: structure and modality. To address this, we propose a 2-stage low-rank adaptation method, and we call it 2LoRA. In the first stage, the model is adapted using aerial-view regular image data (whose structure matches SAR), followed by the second stage where the base model from the first stage is further adapted using SAR modality data. Particularly in the second stage, we introduce a novel prototype LoRA (pLoRA), as an improved version of 2LoRA, to resolve the class imbalance problem in SAR datasets. For evaluation, we employ the resulting generation model to synthesize additional SAR data. This augmentation, when integrated into the training process of SAR classification as well as segmentation models, yields notably improved performance for minor classes

摘要
我们探索大规模预训练的图像生成模型，如稳定扩散和图像，在非可见光谱频谱中的"隐藏"能力。我们通过使用Synthetic Aperture Radar（SAR）数据作为例子进行研究。由于获取卫星数据具有内在的挑战，因此获取足够的SAR训练样本是不可能的。例如，在开海中某种特定的船只，我们只能收集几张SAR图像，这些图像太有限，无法 derivation 有效的船只认识模型。如果可以将大规模模型预训练的图像转换为生成新的SAR图像，问题就解决了。在预研究中，我们发现了预训练这些模型的几枚SAR图像不会工作，因为模型无法捕捉SAR和普通图像之间的两个主要差异：结构和模式。为解决这个问题，我们提出了一种2stage Low-Rank Adaptation（2LoRA）方法。在第一阶段，模型被适应了平面图像数据（其结构与SAR匹配），然后在第二阶段，基本模型从第一阶段被进一步适应了SAR模式数据。特别在第二阶段，我们引入了一种新的原型LoRA（pLoRA），用于解决SAR数据集中的类别不均匀问题。为评价，我们使用生成的模型来生成更多的SAR数据。这种扩展，当与SAR分类和 segmentation 模型的训练过程结合使用时，会得到明显提高的性能。

CLiSA: A Hierarchical Hybrid Transformer Model using Orthogonal Cross Attention for Satellite Image Cloud Segmentation

paper_url: http://arxiv.org/abs/2311.17475
repo_url: None
paper_authors: Subhajit Paul, Ashutosh Gupta
for: 这个论文的目的是提出一种基于深度学习的云mask生成方法，以提高optical remote-sensing图像的云EXTRACTION精度。
methods: 这个方法基于hybrid transformer架构，并使用自适应的orthogonal自注意力和层次跨注意力模型，并通过 Lovász-Softmax损失函数进行验证。
results: 对多个卫星图像集（Landsat-8、Sentinel-2和Cartosat-2s）进行了质量和kvantitativ的评估，并与其他状态时的方法进行了比较，显示了我们的方法在精确云EXTRACTION方面表现更佳。

Abstract
Clouds in optical satellite images are a major concern since their presence hinders the ability to carry accurate analysis as well as processing. Presence of clouds also affects the image tasking schedule and results in wastage of valuable storage space on ground as well as space-based systems. Due to these reasons, deriving accurate cloud masks from optical remote-sensing images is an important task. Traditional methods such as threshold-based, spatial filtering for cloud detection in satellite images suffer from lack of accuracy. In recent years, deep learning algorithms have emerged as a promising approach to solve image segmentation problems as it allows pixel-level classification and semantic-level segmentation. In this paper, we introduce a deep-learning model based on hybrid transformer architecture for effective cloud mask generation named CLiSA - Cloud segmentation via Lipschitz Stable Attention network. In this context, we propose an concept of orthogonal self-attention combined with hierarchical cross attention model, and we validate its Lipschitz stability theoretically and empirically. We design the whole setup under adversarial setting in presence of Lov\'asz-Softmax loss. We demonstrate both qualitative and quantitative outcomes for multiple satellite image datasets including Landsat-8, Sentinel-2, and Cartosat-2s. Performing comparative study we show that our model performs preferably against other state-of-the-art methods and also provides better generalization in precise cloud extraction from satellite multi-spectral (MX) images. We also showcase different ablation studies to endorse our choices corresponding to different architectural elements and objective functions.

摘要
云在光学卫星图像中是一个重要问题，因为它们障碍了精确的分析和处理。云存在也影响了图像任务安排和导致了地面和空间系统的有价值存储空间浪费。由于这些原因，从光学Remote sensing图像中 derivation of accurate cloud masks是一项重要任务。传统方法，如阈值基于的云检测、空间滤波，在卫星图像中缺乏准确性。在过去几年，深度学习算法在解决图像分割问题方面发展出了一些有前途的方法，因为它允许像素级别的分类和semantic级别的分割。在本文中，我们介绍了一种基于混合变换 architecture的深度学习模型，名为CLiSA（Cloud segmentation via Lipschitz Stable Attention network），用于生成高精度云面Mask。在这个上下文中，我们提出了一种抽象自注意力的概念，并与层次跨注意力模型结合使用。我们在抽象上进行了理论和实验 validate Lipschitz稳定性。我们设计了整个设置在对抗 Setting中，使用Lovász-Softmax损失函数。我们在多个卫星图像数据集上展示了qualitative和quantitative的结果，包括Landsat-8、Sentinel-2和Cartosat-2s。我们进行了对比研究，并证明我们的模型在精确云提取方面表现更好，并且具有更好的泛化能力。我们还进行了不同的ablation Study来证明我们的选择对不同的Architecture和目标函数有何影响。

AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents

paper_url: http://arxiv.org/abs/2311.17465
repo_url: None
paper_authors: Duomin Wang, Bin Dai, Yu Deng, Baoyuan Wang
For: 这个研究的目的是创建可自主计划和渲染复杂表情的人工智能代理人，从视觉和行为角度进行权重考虑。* Methods: 该 Framework 使用语言模型（LLMs）生成高级环境和代理人资料后，生成详细的文本描述了代理人的 facial motion。这些描述被转化为任务不关健的驱动引擎，然后被转化为连续动作嵌入，最后被渲染器使用神经网络渲染出最终的真实渲染图像。* Results: 研究包括对新编译的数据集和现有数据集进行了实验，以验证我们的方法的效果和多样性。我们发现，我们的方法可以自动生成高质量的非语言交流的人工智能代理人动作，并且可以适应不同的环境和代理人类型。

Abstract
In this study, our goal is to create interactive avatar agents that can autonomously plan and animate nuanced facial movements realistically, from both visual and behavioral perspectives. Given high-level inputs about the environment and agent profile, our framework harnesses LLMs to produce a series of detailed text descriptions of the avatar agents' facial motions. These descriptions are then processed by our task-agnostic driving engine into motion token sequences, which are subsequently converted into continuous motion embeddings that are further consumed by our standalone neural-based renderer to generate the final photorealistic avatar animations. These streamlined processes allow our framework to adapt to a variety of non-verbal avatar interactions, both monadic and dyadic. Our extensive study, which includes experiments on both newly compiled and existing datasets featuring two types of agents -- one capable of monadic interaction with the environment, and the other designed for dyadic conversation -- validates the effectiveness and versatility of our approach. To our knowledge, we advanced a leap step by combining LLMs and neural rendering for generalized non-verbal prediction and photo-realistic rendering of avatar agents.

摘要
在这项研究中，我们的目标是创建互动式人物代理人，能够自主规划和生动地表现非语面部动作，从视觉和行为两个角度出发。给出高级输入环境和代理人资料，我们的框架利用自然语言处理技术生成一系列细腻的文本描述人物代理人的面部动作。这些描述后经过任务无关的驱动引擎处理，转换为动作token序列，最终通过我们的独立的神经网络渲染器生成最终的真实摄影人物动画。这种流lined的过程使我们的框架能够适应多种非语言互动，包括单一和双向互动。我们的广泛的研究，包括对新编译和现有数据集上的两种代理人进行实验，证明了我们的方法的有效性和多样性。我们认为，我们在组合LLMs和神经渲染技术方面作出了一个大跃进，用于普适的非语言预测和真实摄影渲染人物代理人。

When StyleGAN Meets Stable Diffusion: a $\mathscr{W}_+$ Adapter for Personalized Image Generation

paper_url: http://arxiv.org/abs/2311.17461
repo_url: https://github.com/csxmli2016/w-plus-adapter
paper_authors: Xiaoming Li, Xinyu Hou, Chen Change Loy
for: 本研究旨在提高文本到图像扩散模型中的个体表现和分离性。
methods: 我们提出使用扩展StyleGAN嵌入空间 $\mathcal{W}_+ $ 来实现更好的个体保持和分离性。我们还提出了新的训练目标，以平衡提示和个体条件的影响，以确保背景保持不变 during facial attribute 修改。
results: 我们的方法可以生成符合提示描述的个性化文本到图像输出，并且可以适应多种 StyleGAN 编辑方向。我们的实验结果表明，我们的方法可以增强文本到图像扩散模型中的个体表现和分离性。

Abstract
Text-to-image diffusion models have remarkably excelled in producing diverse, high-quality, and photo-realistic images. This advancement has spurred a growing interest in incorporating specific identities into generated content. Most current methods employ an inversion approach to embed a target visual concept into the text embedding space using a single reference image. However, the newly synthesized faces either closely resemble the reference image in terms of facial attributes, such as expression, or exhibit a reduced capacity for identity preservation. Text descriptions intended to guide the facial attributes of the synthesized face may fall short, owing to the intricate entanglement of identity information with identity-irrelevant facial attributes derived from the reference image. To address these issues, we present the novel use of the extended StyleGAN embedding space $\mathcal{W}_+$, to achieve enhanced identity preservation and disentanglement for diffusion models. By aligning this semantically meaningful human face latent space with text-to-image diffusion models, we succeed in maintaining high fidelity in identity preservation, coupled with the capacity for semantic editing. Additionally, we propose new training objectives to balance the influences of both prompt and identity conditions, ensuring that the identity-irrelevant background remains unaffected during facial attribute modifications. Extensive experiments reveal that our method adeptly generates personalized text-to-image outputs that are not only compatible with prompt descriptions but also amenable to common StyleGAN editing directions in diverse settings. Our source code will be available at \url{https://github.com/csxmli2016/w-plus-adapter}.

摘要
To address these issues, we propose the novel use of the extended StyleGAN embedding space $\mathcal{W}_+$. By aligning this semantically meaningful human face latent space with text-to-image diffusion models, we achieve enhanced identity preservation and disentanglement. Our method maintains high fidelity in identity preservation while also allowing for semantic editing. To balance the influences of both prompt and identity conditions, we propose new training objectives that ensure the identity-irrelevant background remains unaffected during facial attribute modifications.Our extensive experiments demonstrate that our method effectively generates personalized text-to-image outputs that are not only compatible with prompt descriptions but also amenable to common StyleGAN editing directions in diverse settings. Our source code will be available at \url{https://github.com/csxmli2016/w-plus-adapter}.

W-HMR: Human Mesh Recovery in World Space with Weak-supervised Camera Calibration and Orientation Correction

paper_url: http://arxiv.org/abs/2311.17460
repo_url: https://github.com/yw0208/W-HMR
paper_authors: Wei Yao, Hongwen Zhang, Yunlian Sun, Jinhui Tang
for: 本研究旨在解决现有3D人体重建从单目图像中的问题，即现有方法偏 toward simplifying the task by minimizing the influence of the camera, leading to inaccurate reconstruction in world space.
methods: 本研究提出了一种新的方法 called W-HMR，它将全身重建分解为Camera calibration, local body recovery和全身orientation correction。提出了首个弱监督的相机calibration方法，消除了对焦距标签的依赖，实现了更细的网格图像对齐。提出了一种新的orientación correction模块，使得重建的人体姿态在世界坐标系中保持正常。
results: results show that W-HMR achieves high-quality reconstruction in dual coordinate systems, particularly in challenging scenes. 在难处场景中，W-HMR可以实现高质量的重建。

Abstract
For a long time, in the field of reconstructing 3D human bodies from monocular images, most methods opted to simplify the task by minimizing the influence of the camera. Using a coarse focal length setting results in the reconstructed bodies not aligning well with distorted images. Ignoring camera rotation leads to an unrealistic reconstructed body pose in world space. Consequently, existing methods' application scenarios are confined to controlled environments. And they struggle to achieve accurate and reasonable reconstruction in world space when confronted with complex and diverse in-the-wild images. To address the above issues, we propose W-HMR, which decouples global body recovery into camera calibration, local body recovery and global body orientation correction. We design the first weak-supervised camera calibration method for body distortion, eliminating dependence on focal length labels and achieving finer mesh-image alignment. We propose a novel orientation correction module to allow the reconstructed human body to remain normal in world space. Decoupling body orientation and body pose enables our model to consider the accuracy in camera coordinate and the reasonableness in world coordinate simultaneously, expanding the range of applications. As a result, W-HMR achieves high-quality reconstruction in dual coordinate systems, particularly in challenging scenes. Codes will be released on https://yw0208.github.io/ after publication.

摘要
Simplified Chinese translation:在三角形 reconstruction 领域中，从单目图像中恢复人体的三维体型问题已经存在很长时间。大多数方法选择简化任务，减少Camera的影响。使用宽角距离设置会导致重建的人体与扭曲图像不匹配。忽略摄像机旋转会导致重建的人体pose在全球坐标系中不真实。因此，现有方法的应用场景受到控制环境的限制，并在面临复杂和多样的野外图像时困难获得高质量重建。为解决以上问题，我们提出了W-HMR，它将全身重建分解为相机准备、地方身体重建和全身 Orientation 修正。我们提出了首个弱监督相机准备方法，消除了相机 Label 的依赖，实现了更细的网格图像Alignment。我们还提出了一种新的 Orientation 修正模块，使得重建的人体在全球坐标系中保持正常。解耦身体 Orientación 和 pose 使得我们的模型可以同时考虑相机坐标系中的准确性和世界坐标系中的合理性，扩大应用范围。因此，W-HMR在双坐标系中实现了高质量重建，特别是在复杂场景中。代码将在上发布。

DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Diffusion Model

paper_url: http://arxiv.org/abs/2311.17456
repo_url: None
paper_authors: Jiuming Liu, Guangming Wang, Weicai Ye, Chaokang Jiang, Jinru Han, Zhe Liu, Guofeng Zhang, Dalong Du, Hesheng Wang
for: 提高Scene Flow Estimation的准确性和稳定性
methods: 提出了一种基于扩散概率模型的Scene Flow Estimation网络（DifFlow3D），通过迭代扩散基于修正来提高相关性和抗耗性，并且通过流 relate 特征来限制生成多样性。
results: 实现了State-of-the-art表现，在FlyingThings3D和KITTI 2015数据集上分别降低了6.7%和19.1%的EPE3D值，并在KITTI数据集上达到了前所未有的几毫米级准确性（0.0089m）。

Abstract
Scene flow estimation, which aims to predict per-point 3D displacements of dynamic scenes, is a fundamental task in the computer vision field. However, previous works commonly suffer from unreliable correlation caused by locally constrained searching ranges, and struggle with accumulated inaccuracy arising from the coarse-to-fine structure. To alleviate these problems, we propose a novel uncertainty-aware scene flow estimation network (DifFlow3D) with the diffusion probabilistic model. Iterative diffusion-based refinement is designed to enhance the correlation robustness and resilience to challenging cases, e.g., dynamics, noisy inputs, repetitive patterns, etc. To restrain the generation diversity, three key flow-related features are leveraged as conditions in our diffusion model. Furthermore, we also develop an uncertainty estimation module within diffusion to evaluate the reliability of estimated scene flow. Our DifFlow3D achieves state-of-the-art performance, with 6.7\% and 19.1\% EPE3D reduction respectively on FlyingThings3D and KITTI 2015 datasets. Notably, our method achieves an unprecedented millimeter-level accuracy (0.0089m in EPE3D) on the KITTI dataset. Additionally, our diffusion-based refinement paradigm can be readily integrated as a plug-and-play module into existing scene flow networks, significantly increasing their estimation accuracy. Codes will be released later.

摘要
场景流计算，即预测动态场景中每个点的3D运动，是计算机视觉领域的基本任务。然而，前一些工作受到本地约束搜索范围的不可靠相关性和粗糙结构带来的积累误差的影响。为解决这些问题，我们提出了一种基于扩散概率模型的新型不确定性意识场景流计算网络（DifFlow3D）。我们Iterative扩散基于精度修正来提高相关性稳定性，并能够抗抗困难情况，如动态、噪声输入、复杂模式等。为保持生成多样性，我们在扩散模型中采用了三个关键的流相关特征作为条件。此外，我们还开发了一个内部不确定度评估模块，以评估计算场景流中的可靠性。我们的DifFlow3D实现了当前最佳性能，在FlyingThings3D和KITTI 2015数据集上分别减少了6.7%和19.1%的EPE3D。特别是，我们的方法在KITTI数据集上实现了前无之前的米级精度（0.0089m）。此外，我们的扩散基本修正模块可以轻松地与现有的场景流网络集成，提高其估计精度。代码将在未来发布。

Continual Learning for Image Segmentation with Dynamic Query

paper_url: http://arxiv.org/abs/2311.17450
repo_url: https://github.com/weijiawu/cisdq
paper_authors: Weijia Wu, Yuzhong Zhao, Zhuang Li, Lianlei Shan, Hong Zhou, Mike Zheng Shou
for: 这篇论文旨在解决 continual learning 中的问题，即当需要不断添加新的类别时，还有如何避免 catastrophic forgetting 和背景迁移的问题。
methods: 本篇论文提出了一个简单 yet effective 的 Continual Image Segmentation方法（CISDQ），它将旧知和新知的表现学习分离开来，使用轻量级的查询嵌入。CISDQ 的主要贡献包括：1) 定义动态查询，适应过去知识和学习未来类别的自然方式。2) CISDQ 提出了一个类别/实例数据驱动的 Query Guided Knowledge Distillation策略，以抵消 catastrophic forgetting 的问题。3) CISDQ 进一步还包括了持续学习实例分类，并考虑了实例训练和监督。
results: 实验结果显示，CISDQ 在三个数据集上（i.e., Cityscapes、PASCAL VOC、ADE）的两个任务（i.e., continual semantic segmentation、instance segmentation）上实现了顶尖性能，具体而言，在 ADE 100-10 (6 steps) 设定和 ADE 100-5 (11 steps) 设定中获得了 4.4% 和 2.9% mIoU 改进。

Abstract
Image segmentation based on continual learning exhibits a critical drop of performance, mainly due to catastrophic forgetting and background shift, as they are required to incorporate new classes continually. In this paper, we propose a simple, yet effective Continual Image Segmentation method with incremental Dynamic Query (CISDQ), which decouples the representation learning of both old and new knowledge with lightweight query embedding. CISDQ mainly includes three contributions: 1) We define dynamic queries with adaptive background class to exploit past knowledge and learn future classes naturally. 2) CISDQ proposes a class/instance-aware Query Guided Knowledge Distillation strategy to overcome catastrophic forgetting by capturing the inter-class diversity and intra-class identity. 3) Apart from semantic segmentation, CISDQ introduce the continual learning for instance segmentation in which instance-wise labeling and supervision are considered. Extensive experiments on three datasets for two tasks (i.e., continual semantic and instance segmentation are conducted to demonstrate that CISDQ achieves the state-of-the-art performance, specifically, obtaining 4.4% and 2.9% mIoU improvements for the ADE 100-10 (6 steps) setting and ADE 100-5 (11 steps) setting.

摘要
Image segmentation based on continual learning exhibits a critical drop of performance, mainly due to catastrophic forgetting and background shift, as they are required to incorporate new classes continually. In this paper, we propose a simple, yet effective Continual Image Segmentation method with incremental Dynamic Query (CISDQ), which decouples the representation learning of both old and new knowledge with lightweight query embedding. CISDQ mainly includes three contributions:1. We define dynamic queries with adaptive background class to exploit past knowledge and learn future classes naturally.2. CISDQ proposes a class/instance-aware Query Guided Knowledge Distillation strategy to overcome catastrophic forgetting by capturing the inter-class diversity and intra-class identity.3. Apart from semantic segmentation, CISDQ introduce the continual learning for instance segmentation in which instance-wise labeling and supervision are considered.Extensive experiments on three datasets for two tasks (i.e., continual semantic and instance segmentation) are conducted to demonstrate that CISDQ achieves the state-of-the-art performance, specifically, obtaining 4.4% and 2.9% mIoU improvements for the ADE 100-10 (6 steps) setting and ADE 100-5 (11 steps) setting.

Weakly-semi-supervised object detection in remotely sensed imagery

paper_url: http://arxiv.org/abs/2311.17449
repo_url: None
paper_authors: Ji Hun Wang, Jeremy Irvin, Beri Kohen Behar, Ha Tran, Raghav Samavedam, Quentin Hsu, Andrew Y. Ng
for: 这个研究旨在开发一种弱监督物件探测（WSSOD）模型，以便在遥感图像中探测物件，并且可以对新的任务和地理位置进行开发。
methods: 这个研究使用了大量的点标签图像和一小部分的 bounding box 标签图像进行训练，以便将模型应用到遥感图像中的物件探测。
results: 研究发现，使用 WSSOD 模型可以将物件探测精度提高，并且可以在不需要大量 bounding box 标签图像的情况下进行训练。此外，研究发现 WSSOD 模型可以与完全监督模型进行比较，并且在某些情况下可以将其超过。

Abstract
Deep learning for detecting objects in remotely sensed imagery can enable new technologies for important applications including mitigating climate change. However, these models often require large datasets labeled with bounding box annotations which are expensive to curate, prohibiting the development of models for new tasks and geographies. To address this challenge, we develop weakly-semi-supervised object detection (WSSOD) models on remotely sensed imagery which can leverage a small amount of bounding boxes together with a large amount of point labels that are easy to acquire at scale in geospatial data. We train WSSOD models which use large amounts of point-labeled images with varying fractions of bounding box labeled images in FAIR1M and a wind turbine detection dataset, and demonstrate that they substantially outperform fully supervised models trained with the same amount of bounding box labeled images on both datasets. Furthermore, we find that the WSSOD models trained with 2-10x fewer bounding box labeled images can perform similarly to or outperform fully supervised models trained on the full set of bounding-box labeled images. We believe that the approach can be extended to other remote sensing tasks to reduce reliance on bounding box labels and increase development of models for impactful applications.

摘要
深度学习用于探测远程感知图像中的对象可以开启新的技术，包括 Mitigating 气候变化。然而，这些模型经常需要大量的标注矩形框数据，这些数据可能是高昂的成本。为解决这个挑战，我们开发了弱型半supervised object detection（WSSOD）模型，可以在远程感知图像上使用少量的矩形框标注和大量的点标签来进行训练。我们在FAIR1M和风力发电机检测数据集上训练了WSSOD模型，并证明它们在这两个数据集上有substantially 的提高。此外，我们发现WSSOD模型使用2-10x少的矩形框标注图像训练时可以与完全supervised模型具有相同或更好的性能。我们认为这种方法可以扩展到其他远程感知任务，以减少矩形框标注的依赖性和增加对影响性应用的模型开发。

Group-wise Sparse and Explainable Adversarial Attacks

paper_url: http://arxiv.org/abs/2311.17434
repo_url: https://github.com/wagnermoritz/gse
paper_authors: Shpresim Sadiku, Moritz Wagner, Sebastian Pokutta
for: 这 paper 的目的是为了开发一种可靠的、有效的 sparse adversarial attack，以攻击深度神经网络 (DNNs)。
methods: 这 paper 使用了一种新的权重规范化法，即使用 nuclear group norm 来regularize the adversarial loss，从而实现了更加可靠和有效的攻击。
results: 这 paper 的实验结果表明，Compared to state-of-the-art methods, 这种新的攻击方法可以具有更高的 group-wise sparsity（例如，CIFAR-10 上的平均情况下，增加了48.12%，ImageNet 上的平均情况下，增加了40.78%），同时具有较低的欧几里得距离值和较快的计算时间。

Abstract
Sparse adversarial attacks fool deep neural networks (DNNs) through minimal pixel perturbations, typically regularized by the $\ell_0$ norm. Recent efforts have replaced this norm with a structural sparsity regularizer, such as the nuclear group norm, to craft group-wise sparse adversarial attacks. The resulting perturbations are thus explainable and hold significant practical relevance, shedding light on an even greater vulnerability of DNNs than previously anticipated. However, crafting such attacks poses an optimization challenge, as it involves computing norms for groups of pixels within a non-convex objective. In this paper, we tackle this challenge by presenting an algorithm that simultaneously generates group-wise sparse attacks within semantically meaningful areas of an image. In each iteration, the core operation of our algorithm involves the optimization of a quasinorm adversarial loss. This optimization is achieved by employing the $1/2$-quasinorm proximal operator for some iterations, a method tailored for nonconvex programming. Subsequently, the algorithm transitions to a projected Nesterov's accelerated gradient descent with $2$-norm regularization applied to perturbation magnitudes. We rigorously evaluate the efficacy of our novel attack in both targeted and non-targeted attack scenarios, on CIFAR-10 and ImageNet datasets. When compared to state-of-the-art methods, our attack consistently results in a remarkable increase in group-wise sparsity, e.g., an increase of $48.12\%$ on CIFAR-10 and $40.78\%$ on ImageNet (average case, targeted attack), all while maintaining lower perturbation magnitudes. Notably, this performance is complemented by a significantly faster computation time and a $100\%$ attack success rate.

摘要
深度神经网络（DNN）可以通过非常小的像素变化惑伪（adversarial attack），通常使用 $\ell_0$ 范数进行规范。然而，latest efforts 已经将这种范数替换为结构减少范数（structural sparsity regularizer），以创造组内 sparse adversarial attack。这些攻击通过图像中的组合部分进行解释，并且在实际应用中具有更大的攻击力，从而暴露了 DNN 的更大漏洞。然而，制作这些攻击是一个优化挑战，因为它们需要计算图像中的组内范数在非拟合的目标函数中。在这篇论文中，我们解决了这个挑战，我们提出了一种同时生成组内 sparse adversarial attack 的算法。在每次迭代中，我们的算法会优化一个 quasi-norm 攻击损失函数。我们使用 $1/2$-quasi-norm proximal operator 进行一些迭代，这是一种适用于非拟合程序的优化方法。然后，我们会将攻击执行距离的梯度下降优化，并将其限制为二 norm 范数。我们对 CIFAR-10 和 ImageNet 数据集进行了严格的评估，并证明了我们的攻击方法在目标攻击和非目标攻击情况下具有惊人的提高，例如 CIFAR-10 上的提高为 $48.12\%$，ImageNet 上的提高为 $40.78\%$（平均情况下，targeted attack）。此外，我们的攻击方法还具有更快的计算时间和 $100\%$ 的攻击成功率。

SpeechAct: Towards Generating Whole-body Motion from Speech

paper_url: http://arxiv.org/abs/2311.17425
repo_url: None
paper_authors: Jinsong Zhang, Minjie Zhu, Yuxiang Zhang, Yebin Liu, Kun Li
for: 本研究旨在从语音中生成全身动作。尽管先前的方法已经取得了很大成功，但它们仍然很难生成合理且多样的全身动作。这是因为它们使用不佳的表示方式和缺乏生成多样结果的策略。
methods: 我们提出了一种新的混合点表示方法，以实现准确和连续的动作生成，例如避免脚滑行动。此外，为了生成语音与全身动作之间的紧密关系，我们引入了一个encoder-decoder架构。而为了生成全身动作和手部动作，我们寻求生成多样但合理的动作。
results: 我们的实验结果证明了我们的模型的优秀性和正确性。我们的模型可以生成高质量的全身动作，并且可以快速地生成多样的动作。此外，我们的对比研究表明，我们的模型可以准确地捕捉语音和全身动作之间的关系。

Abstract
This paper addresses the problem of generating whole-body motion from speech. Despite great successes, prior methods still struggle to produce reasonable and diverse whole-body motions from speech. This is due to their reliance on suboptimal representations and a lack of strategies for generating diverse results. To address these challenges, we present a novel hybrid point representation to achieve accurate and continuous motion generation, e.g., avoiding foot skating, and this representation can be transformed into an easy-to-use representation, i.e., SMPL-X body mesh, for many applications. To generate whole-body motion from speech, for facial motion, closely tied to the audio signal, we introduce an encoder-decoder architecture to achieve deterministic outcomes. However, for the body and hands, which have weaker connections to the audio signal, we aim to generate diverse yet reasonable motions. To boost diversity in motion generation, we propose a contrastive motion learning method to encourage the model to produce more distinctive representations. Specifically, we design a robust VQ-VAE to learn a quantized motion codebook using our hybrid representation. Then, we regress the motion representation from the audio signal by a translation model employing our contrastive motion learning method. Experimental results validate the superior performance and the correctness of our model. The project page is available for research purposes at http://cic.tju.edu.cn/faculty/likun/projects/SpeechAct.

摘要
这篇论文解决了基于语音生成整个身体运动的问题。尽管之前的方法已经取得了很大的成功，但它们仍然无法生成合理且多样的整个身体运动。这是因为它们依赖于不优化的表示方式以及缺乏生成多样结果的策略。为解决这些挑战，我们提出了一种新的混合点表示方法，可以实现准确和连续的运动生成，例如避免脚滑行动，并且这种表示可以转换成一个易用的SMPL-X身体网格，对于许多应用有很好的使用。为了将语音转化为整个身体运动，我们引入了编码器-解码器架构，以实现决定性的结果。但是，为了身体和手部，它们与语音信号之间的连接较弱，我们希望可以生成多样又合理的运动。为了提高运动生成的多样性，我们提议一种对比动学学习方法，以便模型可以生成更加特别的表示。具体来说，我们设计了一种坚定的VQ-VAE来学习一个量化动作代码库使用我们的混合表示。然后，我们使用我们的对比动学学习方法将运动表示从语音信号中 regression。实验结果证明了我们的模型的超越性和正确性。研究页面可以在http://cic.tju.edu.cn/faculty/likun/projects/SpeechAct上查看。

Talking Head(?) Anime from a Single Image 4: Improved Model and Its Distillation

paper_url: http://arxiv.org/abs/2311.17409
repo_url: None
paper_authors: Pramook Khungurn
for: This paper aims to create a real-time controllable character model from a single anime character image.methods: The paper uses U-Nets with attention to improve the image quality of the character model, and distills the system into a small network for real-time applications.results: The proposed method achieves better image quality than the baseline, but with a slower generation time. The distilled network can generate 512x512 animation frames in real time while maintaining image quality.Here’s the simplified Chinese text:for: 本研究旨在对单一的日本动画角色图像进行实时控制。methods: 本文使用U-Nets with attention来提高角色模型的图像质量，并将系统概要化为小型网络以便实时应用。results: 提案的方法可以实现更好的图像质量，但是生成时间较慢。将系统概要化后，可以实现512x512动画帧的实时生成，并保持图像质量接近原始系统。

Abstract
We study the problem of creating a character model that can be controlled in real time from a single image of an anime character. A solution to this problem would greatly reduce the cost of creating avatars, computer games, and other interactive applications. Talking Head Anime 3 (THA3) is an open source project that attempts to directly addresses the problem. It takes as input (1) an image of an anime character's upper body and (2) a 45-dimensional pose vector and outputs a new image of the same character taking the specified pose. The range of possible movements is expressive enough for personal avatars and certain types of game characters. However, the system is too slow to generate animations in real time on common PCs, and its image quality can be improved. In this paper, we improve THA3 in two ways. First, we propose new architectures for constituent networks that rotate the character's head and body based on U-Nets with attention that are widely used in modern generative models. The new architectures consistently yield better image quality than the THA3 baseline. Nevertheless, they also make the whole system much slower: it takes up to 150 milliseconds to generate a frame. Second, we propose a technique to distill the system into a small network (less than 2 MB) that can generate 512x512 animation frames in real time (under 30 FPS) using consumer gaming GPUs while keeping the image quality close to that of the full system. This improvement makes the whole system practical for real-time applications.

摘要
我们研究一个实时控制аніме角色模型的问题，使用单一的аніме角色图像。一个解决方案会大幅降低创建角色、电玩游戏等互动应用的成本。《Talking Head Anime 3》（THA3）是一个开源项目，它将从аніме角色的上半身图像和45维度动作向量为输入，生成一个新的аніме角色图像，并且可以在实时环境中生成运算。然而，系统的图像质量可以进一步改善，并且系统的执行速度太慢。在这篇论文中，我们将THA3进行了两种改进。第一，我们提出了一些新的组件网络架构，这些架构基于现代生成模型中广泛使用的U-Net架构，并且将注意力引入到组件网络中。这些新架构一致地提高了图像质量，但是它们也使得整个系统变得更慢，它需要大约150毫秒来生成一帧图像。第二，我们提出了一种技术，可以将系统转换为一个小型网络（小于2 MB），这个网络可以在consumer级游戏GPU上实时生成512x512的动画帧，并且保持图像质量与全系统之间的差不多。这个改进使得整个系统在实时应用中可行。

Dynamic Dense Graph Convolutional Network for Skeleton-based Human Motion Prediction

paper_url: http://arxiv.org/abs/2311.17408
repo_url: None
paper_authors: Xinshun Wang, Wanying Zhang, Can Wang, Yuan Gao, Mengyuan Liu
for: 本文为了解决GCN在人体动作预测任务中的瓶颈问题，提出了动态稠密图卷积网络(DD-GCN)。
methods: 本文使用了4D相互关系模型构建了一个稠密图，并提出了一种动态消息传递机制，通过学习从数据中获得样本特有的启发消息来提高模型性能。
results: 对于人体动作预测任务，DD-GCN显著超过了state-of-the-art GCN-based方法，特别是在使用长期和我们提议的极长期协议时。

Abstract
Graph Convolutional Networks (GCN) which typically follows a neural message passing framework to model dependencies among skeletal joints has achieved high success in skeleton-based human motion prediction task. Nevertheless, how to construct a graph from a skeleton sequence and how to perform message passing on the graph are still open problems, which severely affect the performance of GCN. To solve both problems, this paper presents a Dynamic Dense Graph Convolutional Network (DD-GCN), which constructs a dense graph and implements an integrated dynamic message passing. More specifically, we construct a dense graph with 4D adjacency modeling as a comprehensive representation of motion sequence at different levels of abstraction. Based on the dense graph, we propose a dynamic message passing framework that learns dynamically from data to generate distinctive messages reflecting sample-specific relevance among nodes in the graph. Extensive experiments on benchmark Human 3.6M and CMU Mocap datasets verify the effectiveness of our DD-GCN which obviously outperforms state-of-the-art GCN-based methods, especially when using long-term and our proposed extremely long-term protocol.

摘要
traditional Graph Convolutional Networks (GCN) typically follow a neural message passing framework to model dependencies among skeletal joints, has achieved high success in skeleton-based human motion prediction task. However, how to construct a graph from a skeleton sequence and how to perform message passing on the graph are still open problems, which severely affect the performance of GCN. To solve both problems, this paper presents a Dynamic Dense Graph Convolutional Network (DD-GCN), which constructs a dense graph and implements an integrated dynamic message passing. Specifically, we construct a dense graph with 4D adjacency modeling as a comprehensive representation of motion sequence at different levels of abstraction. Based on the dense graph, we propose a dynamic message passing framework that learns dynamically from data to generate distinctive messages reflecting sample-specific relevance among nodes in the graph. Extensive experiments on benchmark Human 3.6M and CMU Mocap datasets verify the effectiveness of our DD-GCN, which significantly outperforms state-of-the-art GCN-based methods, especially when using long-term and our proposed extremely long-term protocol.Here's the word-for-word translation of the text into Simplified Chinese:传统的Graph Convolutional Networks (GCN)通常采用神经网络消息传递框架来模型关节的依赖关系，在人体动作预测任务中取得了高度的成功。然而，如何从关节序列中构建图和如何在图上进行消息传递仍是一个 откры问题，这些问题对GCN的性能产生严重的影响。为解决这两个问题，本文提出了动态稠密图卷积网络（DD-GCN），该网络构建了稠密图并实现了一种集成的动态消息传递。具体来说，我们使用4D相对性模型来构建稠密图，以表示不同层次的动作序列。基于稠密图，我们提议一种动态消息传递框架，该框架通过数据学习来生成特点rich的消息，以反映样本特有的相关性。经验表明，我们的DD-GCN在人体动作预测任务中表现出色，特别是在长期和我们提出的极长期协议下。

Spectral and Polarization Vision: Spectro-polarimetric Real-world Dataset

paper_url: http://arxiv.org/abs/2311.17396
repo_url: None
paper_authors: Yujin Jeon, Eunsue Choi, Youngchan Kim, Yunseong Moon, Khalid Omer, Felix Heide, Seung-Hwan Baek
for: 本研究开发了两个新的对偶色光图像集（trichromatic Stokes images和hyperspectral Stokes images），以满足现有的对偶色光图像集缺乏的问题。
methods: 本研究使用了对偶色光图像集，并分析了对偶色光图像集的图像统计。
results: 本研究获得了两个新的对偶色光图像集，并进行了对偶色光图像集的分析和实现高维度数据的有效表示。

Abstract
Image datasets are essential not only in validating existing methods in computer vision but also in developing new methods. Most existing image datasets focus on trichromatic intensity images to mimic human vision. However, polarization and spectrum, the wave properties of light that animals in harsh environments and with limited brain capacity often rely on, remain underrepresented in existing datasets. Although spectro-polarimetric datasets exist, these datasets have insufficient object diversity, limited illumination conditions, linear-only polarization data, and inadequate image count. Here, we introduce two spectro-polarimetric datasets: trichromatic Stokes images and hyperspectral Stokes images. These novel datasets encompass both linear and circular polarization; they introduce multiple spectral channels; and they feature a broad selection of real-world scenes. With our dataset in hand, we analyze the spectro-polarimetric image statistics, develop efficient representations of such high-dimensional data, and evaluate spectral dependency of shape-from-polarization methods. As such, the proposed dataset promises a foundation for data-driven spectro-polarimetric imaging and vision research. Dataset and code will be publicly available.

摘要
To address this gap, we introduce two novel spectro-polarimetric datasets: trichromatic Stokes images and hyperspectral Stokes images. These datasets encompass both linear and circular polarization, introduce multiple spectral channels, and feature a broad selection of real-world scenes. With our dataset in hand, we analyze the spectro-polarimetric image statistics, develop efficient representations of high-dimensional data, and evaluate the spectral dependency of shape-from-polarization methods.Our dataset provides a foundation for data-driven spectro-polarimetric imaging and vision research. The dataset and code will be publicly available, offering a valuable resource for researchers and developers in the field.

360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries

paper_url: http://arxiv.org/abs/2311.17389
repo_url: None
paper_authors: Huajian Huang, Changkun Liu, Yipeng Zhu, Hui Cheng, Tristan Braud, Sai-Kit Yeung
for:This paper introduces a new benchmark dataset, 360Loc, which is composed of 360$^\circ$ images with ground truth poses for visual localization.methods:The paper presents a practical implementation of 360$^\circ$ mapping that combines 360$^\circ$ images with lidar data to generate ground truth 6DoF poses.results:The paper demonstrates that omnidirectional visual localization is more robust in challenging large-scale scenes with symmetries and repetitive structures.Here is the Chinese translation of the three points:for:这篇论文提出了一个新的标准 benchmark dataset，名为360Loc，它包含360$^\circ$ 图像和视觉定位的真实pose。methods:论文提出了一种实用的360$^\circ$ 映射实现方法，它将360$^\circ$ 图像与雷达数据结合起来生成真实的6DoF pose。results:论文表明在具有相同性和复制结构的大规模场景中，普通视觉定位更加稳定。

Abstract
Portable 360$^\circ$ cameras are becoming a cheap and efficient tool to establish large visual databases. By capturing omnidirectional views of a scene, these cameras could expedite building environment models that are essential for visual localization. However, such an advantage is often overlooked due to the lack of valuable datasets. This paper introduces a new benchmark dataset, 360Loc, composed of 360$^\circ$ images with ground truth poses for visual localization. We present a practical implementation of 360$^\circ$ mapping combining 360$^\circ$ images with lidar data to generate the ground truth 6DoF poses. 360Loc is the first dataset and benchmark that explores the challenge of cross-device visual positioning, involving 360$^\circ$ reference frames, and query frames from pinhole, ultra-wide FoV fisheye, and 360$^\circ$ cameras. We propose a virtual camera approach to generate lower-FoV query frames from 360$^\circ$ images, which ensures a fair comparison of performance among different query types in visual localization tasks. We also extend this virtual camera approach to feature matching-based and pose regression-based methods to alleviate the performance loss caused by the cross-device domain gap, and evaluate its effectiveness against state-of-the-art baselines. We demonstrate that omnidirectional visual localization is more robust in challenging large-scale scenes with symmetries and repetitive structures. These results provide new insights into 360-camera mapping and omnidirectional visual localization with cross-device queries.

摘要
便携360度相机在成本低廉的情况下成为了建立大规模视图数据的有效工具。这些相机可以快速建立环境模型，这些模型是视地理ocalization的重要组成部分。然而，这些优势通常被忽略因为缺乏有价值的数据。这篇论文介绍了一个新的比较 datasets，名为360Loc，其包含360度图像和准确的pose pose。我们提出了一种实用的360度映射方法，该方法结合360度图像和雷达数据生成准确的6DoF pose。360Loc是首个探讨跨设备视 Positioning的挑战，并包括360度参考幅、查询幅从缩小镜头、超广 FoV鱼眼镜头和360度相机。我们提出了一种虚拟相机方法，将360度图像转换成低FOV查询幅，以确保在视 Localization任务中不同查询类型的比较公平。我们还扩展了这种虚拟相机方法到匹配特征和pose regression-based方法，以解决跨设备频谱差的影响，并评估其效果。我们展示了在大规模场景中，360度视 Localization更加稳定和Robust，特别是在具有对称和重复结构的场景中。这些结果为360相机映射和360度视 Localization提供了新的认知和探索。

Generative Hierarchical Temporal Transformer for Hand Action Recognition and Motion Prediction

paper_url: http://arxiv.org/abs/2311.17366
repo_url: None
paper_authors: Yilin Wen, Hao Pan, Takehiko Ohkawa, Lei Yang, Jia Pan, Yoichi Sato, Taku Komura, Wenping Wang
for: 这个论文旨在同时进行手势识别和未来手势预测。在过去的工作中，人们通常只关注一个或两个方面，但我们的框架可以同时捕捉这两个方面，从而实现更加真实的动作预测。
methods: 我们提出了一种基于Transformer VAE架构的新框架，该框架包括两个缓冲区，一个用于短时间内的手势pose，另一个用于长时间内的动作。这两个缓冲区之间有一个中间特征，用于表示手势pose序列的子时间级别。
results: 我们在多个数据集上训练了这个框架，并证明了在多个数据集上，同时进行手势识别和未来手势预测可以提高过单独解决这两个问题。此外，我们的框架还可以 faithful地表示手势pose和动作之间的 semantic dependency和不同的时间级别。

Abstract
We present a novel framework that concurrently tackles hand action recognition and 3D future hand motion prediction. While previous works focus on either recognition or prediction, we propose a generative Transformer VAE architecture to jointly capture both aspects, facilitating realistic motion prediction by leveraging the short-term hand motion and long-term action consistency observed across timestamps.To ensure faithful representation of the semantic dependency and different temporal granularity of hand pose and action, our framework is decomposed into two cascaded VAE blocks. The lower pose block models short-span poses, while the upper action block models long-span action. These are connected by a mid-level feature that represents sub-second series of hand poses.Our framework is trained across multiple datasets, where pose and action blocks are trained separately to fully utilize pose-action annotations of different qualities. Evaluations show that on multiple datasets, the joint modeling of recognition and prediction improves over separate solutions, and the semantic and temporal hierarchy enables long-term pose and action modeling.

摘要
我们提出了一种新的框架，该框架同时解决了手动作识别和3D未来手动 Motion 预测问题。前一些工作都是专注于一个或两个方面，我们提议使用生成器Transformer VAE架构来同时捕捉这两个方面，从而实现更加现实的动作预测，通过利用短时间内手动 Motion 和长时间内动作的一致性。为了忠实表示手势和动作之间的Semantic 依赖关系和不同的时间粒度，我们的框架被分解成两个堆叠的 VAE 块。下面的姿势块模型短时间内的姿势，而上面的动作块模型长时间内的动作。这两个块被一个中间特征连接，该特征表示每秒Series of 手势。我们的框架在多个 dataset 上训练，其中姿势和动作块分别在不同质量的 pose-action 注释上训练，以便完全利用不同的 pose-action 注释。评估表明，在多个 dataset 上，同时解决识别和预测问题的 joint 模型超过了分别的解决方案，而 semantic 和时间层次启用了长期姿势和动作模型。

Symbol-LLM: Leverage Language Models for Symbolic System in Visual Human Activity Reasoning

paper_url: http://arxiv.org/abs/2311.17365
repo_url: https://github.com/enlighten0707/Symbol-LLM
paper_authors: Xiaoqian Wu, Yong-Lu Li, Jianhua Sun, Cewu Lu
for: 提高人工智能理解活动的能力，增强Activity Understanding的可解释性、泛化性和数据效率。
methods: 基于Symbolic System的Activity Understanding方法，利用大语言模型（Symbol-LLM）来 aproximate ideal properties，并通过推理逻辑计算来理解图像中的活动 semantics。
results: 在多种Activity Understanding任务中表现出色，超过传统方法的性能。

Abstract
Human reasoning can be understood as a cooperation between the intuitive, associative "System-1" and the deliberative, logical "System-2". For existing System-1-like methods in visual activity understanding, it is crucial to integrate System-2 processing to improve explainability, generalization, and data efficiency. One possible path of activity reasoning is building a symbolic system composed of symbols and rules, where one rule connects multiple symbols, implying human knowledge and reasoning abilities. Previous methods have made progress, but are defective with limited symbols from handcraft and limited rules from visual-based annotations, failing to cover the complex patterns of activities and lacking compositional generalization. To overcome the defects, we propose a new symbolic system with two ideal important properties: broad-coverage symbols and rational rules. Collecting massive human knowledge via manual annotations is expensive to instantiate this symbolic system. Instead, we leverage the recent advancement of LLMs (Large Language Models) as an approximation of the two ideal properties, i.e., Symbols from Large Language Models (Symbol-LLM). Then, given an image, visual contents from the images are extracted and checked as symbols and activity semantics are reasoned out based on rules via fuzzy logic calculation. Our method shows superiority in extensive activity understanding tasks. Code and data are available at https://mvig-rhos.com/symbol_llm.

摘要
人类理解可以理解为系统1（INTUITIVE，协同）和系统2（分析，逻辑）之间的合作。现有的系统1类方法在视觉活动理解方面存在一些缺陷，需要与系统2处理结合以提高解释性、泛化能力和数据效率。一种可能的活动理解方法是建立一个符号系统，其中一个符号与多个符号之间存在规则，表示人类知识和思维能力。前一些方法已经做出了进步，但它们具有有限的符号和规则，无法涵盖复杂的活动模式，同时也缺乏复合泛化能力。为了缓解这些缺陷，我们提出了一个新的符号系统，具有两个重要的理想特性：广泛的符号和合理的规则。收集大量人类知识的手动注释是实现这个符号系统的开销高昂。相反，我们利用最新的大语言模型（LLM）的进步，即Symbols from Large Language Models（Symbol-LLM），作为符号系统的近似方法。然后，给定一个图像，图像的视觉内容被提取并作为符号，并根据规则进行混合逻辑计算，以得出活动 semantics。我们的方法在广泛的活动理解任务中表现出优异性。代码和数据可以在获取。

How does spatial structure affect psychological restoration? A method based on Graph Neural Networks and Street View Imagery

paper_url: http://arxiv.org/abs/2311.17361
repo_url: None
paper_authors: Haoran Ma, Yan Zhang, Pengyuan Liu, Fan Zhang, Pengyu Zhua
for: 这项研究旨在理解城市和自然维生化质量的关键因素，以及这些因素如何影响城市维生化质量。
methods: 这项研究使用了一种基于图 neural networks 的方法，该方法可以捕捉城市级别的空间结构，并对城市维生化质量进行评估。
results: 研究结果显示，使用图 neural networks 模型可以更好地评估城市维生化质量，并且发现了城市级别的空间结构对维生化质量的影响。

Abstract
The Attention Restoration Theory (ART) presents a theoretical framework with four essential indicators (being away, extent, fascinating, and compatibility) for comprehending urban and natural restoration quality. However, previous studies relied on non-sequential data and non-spatial dependent methods, which overlooks the impact of spatial structure defined here as the positional relationships between scene entities on restoration quality. The past methods also make it challenging to measure restoration quality on an urban scale. In this work, a spatial-dependent graph neural networks (GNNs) approach is proposed to reveal the relation between spatial structure and restoration quality on an urban scale. Specifically, we constructed two different types of graphs at the street and city levels. The street-level graphs, using sequential street view images (SVIs) of road segments to capture position relationships between entities, were used to represent spatial structure. The city-level graph, modeling the topological relationships of roads as non-Euclidean data structures and embedding urban features (including Perception-features, Spatial-features, and Socioeconomic-features), was used to measure restoration quality. The results demonstrate that: 1) spatial-dependent GNNs model outperforms traditional methods (Acc = 0.735, F1 = 0.732); 2) spatial structure portrayed through sequential SVIs data significantly influences restoration quality; 3) spaces with the same restoration quality exhibited distinct spatial structures patterns. This study clarifies the association between spatial structure and restoration quality, providing a new perspective to improve urban well-being in the future.

摘要
ART理论提出了四个重要指标（离别、范围、吸引力和兼容）来理解城市和自然维护质量。然而，先前的研究使用非序数据和非空间相关方法，忽略了场景元素之间的空间结构对维护质量的影响。这些方法还使得评估城市级别的维护质量变得困难。在这种情况下，我们提议使用空间相关的图神经网络（GNNs）方法，以揭示空间结构对维护质量的关系。具体来说，我们构建了两种不同类型的图，一是街道级别的图，使用Sequential Street View Images（SVIs）记录路段之间的位置关系，另一是城市级别的图，通过模型城市道路的非几何数据结构，并嵌入城市特征（包括感知特征、空间特征和社会经济特征），来评估维护质量。结果显示：1）空间相关GNNs模型在传统方法（Acc = 0.735，F1 = 0.732）之上表现出色; 2）通过Sequential SVIs数据记录的空间结构对维护质量产生了显著影响; 3）具有同等维护质量的空间具有不同的空间结构征特。这一研究解释了空间结构和维护质量之间的关系，为未来城市发展提供了新的视角。

A natural language processing-based approach: mapping human perception by understanding deep semantic features in street view images

paper_url: http://arxiv.org/abs/2311.17354
repo_url: None
paper_authors: Haoran Ma, Dongdong Wu
For: This paper aims to comprehensively understand the deep semantic features of human perception of a scene using a pre-trained natural language model and image captioning network.* Methods: The authors use Place Pulse 2.0 as their base dataset, which contains human-perceived labels for various scenes. They use an image captioning network to extract description information and finetune a pre-trained BERT model with a regression function for six human perceptual dimensions.* Results: The authors find that their approach, which uses deep semantic features, performs better than previous studies that use machine learning methods with shallow features. They also conduct a migration experiment in Hong Kong and show that their approach provides new ideas for subsequent human perception research and better explanatory power in the face of spatial heterogeneity.

Abstract
In the past decade, using Street View images and machine learning to measure human perception has become a mainstream research approach in urban science. However, this approach using only image-shallow information makes it difficult to comprehensively understand the deep semantic features of human perception of a scene. In this study, we proposed a new framework based on a pre-train natural language model to understand the relationship between human perception and the sense of a scene. Firstly, Place Pulse 2.0 was used as our base dataset, which contains a variety of human-perceived labels, namely, beautiful, safe, wealthy, depressing, boring, and lively. An image captioning network was used to extract the description information of each street view image. Secondly, a pre-trained BERT model was finetuning and added a regression function for six human perceptual dimensions. Furthermore, we compared the performance of five traditional regression methods with our approach and conducted a migration experiment in Hong Kong. Our results show that human perception scoring by deep semantic features performed better than previous studies by machine learning methods with shallow features. The use of deep scene semantic features provides new ideas for subsequent human perception research, as well as better explanatory power in the face of spatial heterogeneity.

摘要
过去一个 décennial，使用街景视图图像和机器学习来测量人类感受的研究方法在城市科学领域变得普遍。然而，这种方法只使用图像 superficier 信息，导致不能全面理解人类对场景的深度Semantic 特征。在这项研究中，我们提出了一个新的框架，基于预训练的自然语言模型来理解人类感受和场景感受之间的关系。首先，我们使用Place Pulse 2.0作为基础数据集，该数据集包含了多种人类感受标签，包括美丽、安全、富裕、沮丧、无聊和活泼。然后，我们使用图像描述网络提取每个街景视图图像的描述信息。其次，我们使用预训练的BERT模型进行finetuning，并将其添加为六个人类感受维度的回归函数。此外，我们比较了五种传统回归方法与我们的方法的性能，并进行了在香港进行迁移实验。我们的结果表明，通过深度Scene semantic 特征来评分人类感受的性能比前一代机器学习方法使用 superficier 特征来评分的性能更高。使用深度Scene semantic 特征提供了新的想法，以及更好的解释力在空间不同性下。

Implicit-explicit Integrated Representations for Multi-view Video Compression

paper_url: http://arxiv.org/abs/2311.17350
repo_url: https://github.com/zc-lynen/MV-IERV
paper_authors: Chen Zhu, Guo Lu, Bing He, Rong Xie, Li Song
for: 这个论文是为了提高多视野影像压缩和传输效率，并且能够维持高品质的重建结果。
methods: 本论文使用了一种混合了明示和隐藏的表示方法，其中使用明示表示法进行一个源档的编码，然后使用隐藏表示法进行另外的多个源档的编码。具体来说，使用了一个具有多个水平的特征格子嵌入和一个完全卷积架构的隐藏代码。
results: 实验结果显示，提案的架构可以与最新的多视野影像压缩标准MIV和其他隐藏代码基本相同或甚至更高的表现，包括视野压缩和景象建模。

Abstract
With the increasing consumption of 3D displays and virtual reality, multi-view video has become a promising format. However, its high resolution and multi-camera shooting result in a substantial increase in data volume, making storage and transmission a challenging task. To tackle these difficulties, we propose an implicit-explicit integrated representation for multi-view video compression. Specifically, we first use the explicit representation-based 2D video codec to encode one of the source views. Subsequently, we propose employing the implicit neural representation (INR)-based codec to encode the remaining views. The implicit codec takes the time and view index of multi-view video as coordinate inputs and generates the corresponding implicit reconstruction frames.To enhance the compressibility, we introduce a multi-level feature grid embedding and a fully convolutional architecture into the implicit codec. These components facilitate coordinate-feature and feature-RGB mapping, respectively. To further enhance the reconstruction quality from the INR codec, we leverage the high-quality reconstructed frames from the explicit codec to achieve inter-view compensation. Finally, the compensated results are fused with the implicit reconstructions from the INR to obtain the final reconstructed frames. Our proposed framework combines the strengths of both implicit neural representation and explicit 2D codec. Extensive experiments conducted on public datasets demonstrate that the proposed framework can achieve comparable or even superior performance to the latest multi-view video compression standard MIV and other INR-based schemes in terms of view compression and scene modeling.

摘要
随着3D显示器和虚拟现实技术的普及，多视点视频已成为一个有前途的格式。然而，其高分辨率和多摄像头拍摄导致数据量增加巨大，存储和传输变得困难。为解决这些问题，我们提出一种混合表示法，即使用明确表示法基于2D视频编码器编码一个源视图，然后使用基于神经表示法（INR）编码剩下的视图。INR编码器通过时间和视图索引作为坐标输入，生成相应的隐式重建帧。为了提高压缩性，我们引入多级特征网格嵌入和全连接网络到INR编码器中。这些组件实现坐标特征和特征RGB映射。此外，我们利用高质量重建帧从明确编码器来实现视图补做，并将补做结果与INR编码器生成的隐式重建帧进行拼接。最终，我们提出的框架结合了明确表示法和INR的优势，并在多视点视频压缩和场景建模方面实现了相当或者甚至超过最新的多视点视频压缩标准MIV和其他基于INR的方案的性能。

Cross-Scope Spatial-Spectral Information Aggregation for Hyperspectral Image Super-Resolution

paper_url: http://arxiv.org/abs/2311.17340
repo_url: https://github.com/tomchenshi/cst
paper_authors: Shi Chen, Lefei Zhang, Liangpei Zhang
for: 提高单色干涉图像的空间分辨率
methods: 使用交叉范围空间-spectral transformer（CST）模型，通过跨范围的空间自注意力和特征自注意力来捕捉长距离空间-spectral特征相似性
results: 在三个干涉图像数据集上进行了广泛的实验，并证明了CST比其他状态对方法更好 both quantitatively and visually

Abstract
Hyperspectral image super-resolution has attained widespread prominence to enhance the spatial resolution of hyperspectral images. However, convolution-based methods have encountered challenges in harnessing the global spatial-spectral information. The prevailing transformer-based methods have not adequately captured the long-range dependencies in both spectral and spatial dimensions. To alleviate this issue, we propose a novel cross-scope spatial-spectral Transformer (CST) to efficiently investigate long-range spatial and spectral similarities for single hyperspectral image super-resolution. Specifically, we devise cross-attention mechanisms in spatial and spectral dimensions to comprehensively model the long-range spatial-spectral characteristics. By integrating global information into the rectangle-window self-attention, we first design a cross-scope spatial self-attention to facilitate long-range spatial interactions. Then, by leveraging appropriately characteristic spatial-spectral features, we construct a cross-scope spectral self-attention to effectively capture the intrinsic correlations among global spectral bands. Finally, we elaborate a concise feed-forward neural network to enhance the feature representation capacity in the Transformer structure. Extensive experiments over three hyperspectral datasets demonstrate that the proposed CST is superior to other state-of-the-art methods both quantitatively and visually. The code is available at \url{https://github.com/Tomchenshi/CST.git}.

摘要
“几何特征图像超解析技术在增强几何特征图像的空间解析方面已经得到了广泛的应用。然而，基于条件扩展方法的方法受到了全球空间特征信息的获取所以的限制。尚未充分利用了两个维度中的长距离相依性。为了解决这个问题，我们提出了一个新的跨视野空间特征图像运算（CST），以有效地探索两个维度中的长距离空间特征和 спектраль特征之间的相依性。具体来说，我们创建了一个跨视野空间自注意力机制，以便实现长距离空间之间的互动。然后，我们利用适当的特征特征来构建一个跨视野 спектраль自注意力机制，以实现全球 спектраль带之间的内在相依性。最后，我们创建了一个简洁的复回神经网络，以提高传播运算结构中的特征表现能力。实验结果显示，提案的CST在三个几何特征dataset上具有较高的表现效能，比起其他现有的方法。代码可以在\url{https://github.com/Tomchenshi/CST.git}上找到。”

RADAP: A Robust and Adaptive Defense Against Diverse Adversarial Patches on Face Recognition

paper_url: http://arxiv.org/abs/2311.17339
repo_url: None
paper_authors: Xiaoliang Liu, Furao Shen, Jian Zhao, Changhai Nie
for: 防止深度学习Face recognition系统受到本地攻击
methods: 使用FCutout和F-patch技术，以及改进的边缘抽象损失函数和SAF策略
results: 在各种攻击情况下显著提高了防御性能，而且保持了清洁精度高于未防御的Vanilla模型

Abstract
Face recognition (FR) systems powered by deep learning have become widely used in various applications. However, they are vulnerable to adversarial attacks, especially those based on local adversarial patches that can be physically applied to real-world objects. In this paper, we propose RADAP, a robust and adaptive defense mechanism against diverse adversarial patches in both closed-set and open-set FR systems. RADAP employs innovative techniques, such as FCutout and F-patch, which use Fourier space sampling masks to improve the occlusion robustness of the FR model and the performance of the patch segmenter. Moreover, we introduce an edge-aware binary cross-entropy (EBCE) loss function to enhance the accuracy of patch detection. We also present the split and fill (SAF) strategy, which is designed to counter the vulnerability of the patch segmenter to complete white-box adaptive attacks. We conduct comprehensive experiments to validate the effectiveness of RADAP, which shows significant improvements in defense performance against various adversarial patches, while maintaining clean accuracy higher than that of the undefended Vanilla model.

摘要
面Recognition（FR）系统驱动深度学习技术已广泛应用于多种应用场景。然而，它们受到对抗攻击的威胁，特别是基于本地对抗贴图的攻击。在这篇论文中，我们提出了RADAP，一种可靠和适应的对抗多种对抗贴图的防御机制。RADAP使用创新的技术，如FCutout和F-patch，使用卷积空间抽取掩码来提高对抗贴图模型的遮盲鲁棒性和贴图分割器的性能。此外，我们引入了边缘感知二分类交叉熵（EBCE）损失函数来提高贴图检测精度。我们还提出了分割和填充（SAF）策略，用于对贴图分割器免疫完整白盒适应攻击。我们进行了全面的实验， validate the effectiveness of RADAP，结果显示RADAP在面Recognition系统中具有显著的防御性能，同时保持纯净精度高于无防御的Vanilla模型。

eMotions: A Large-Scale Dataset for Emotion Recognition in Short Videos

paper_url: http://arxiv.org/abs/2311.17335
repo_url: https://github.com/xuecwu/emotions
paper_authors: Xuecheng Wu, Heli Sun, Junxiao Xue, Ruofan Zhai, Xiangyan Kong, Jiayu Nie, Liang He
for: 本研究旨在提高短视频（SV）中情感识别的精度，以便更好地理解SV中的情感表达。
methods: 本研究使用了一种基于视频变换器的端到端方法，以更好地学习semantic relevance的表示。此外，还使用了两个Stage的交叉模式融合模块，以更好地捕捉音视频特征之间的相互关系。
results: 实验结果表明，基于视频变换器的AV-CPNet方法可以更好地提高SV中情感识别的精度。这种方法在九个 dataset上获得了 extensiveresults。

Abstract
Nowadays, short videos (SVs) are essential to information acquisition and sharing in our life. The prevailing use of SVs to spread emotions leads to the necessity of emotion recognition in SVs. Considering the lack of SVs emotion data, we introduce a large-scale dataset named eMotions, comprising 27,996 videos. Meanwhile, we alleviate the impact of subjectivities on labeling quality by emphasizing better personnel allocations and multi-stage annotations. In addition, we provide the category-balanced and test-oriented variants through targeted data sampling. Some commonly used videos (e.g., facial expressions and postures) have been well studied. However, it is still challenging to understand the emotions in SVs. Since the enhanced content diversity brings more distinct semantic gaps and difficulties in learning emotion-related features, and there exists information gaps caused by the emotion incompleteness under the prevalently audio-visual co-expressions. To tackle these problems, we present an end-to-end baseline method AV-CPNet that employs the video transformer to better learn semantically relevant representations. We further design the two-stage cross-modal fusion module to complementarily model the correlations of audio-visual features. The EP-CE Loss, incorporating three emotion polarities, is then applied to guide model optimization. Extensive experimental results on nine datasets verify the effectiveness of AV-CPNet. Datasets and code will be open on https://github.com/XuecWu/eMotions.

摘要
现在，短视频（SV）已成为我们生活中信息获取和分享的重要工具。由于 SV 的普遍使用来传递情感，因此情感认知在 SV 中变得非常重要。由于 SV 的情感数据缺乏，我们提出了一个大规模数据集 named eMotions，包含 27,996 个视频。此外，我们通过更好的人员分配和多 stage 标注来减少标注主观性的影响。此外，我们还提供了类别均衡和测试 ориентирован的变体，通过针对的数据采样。一些常见的视频（例如表情和姿势）已经得到了广泛的研究，但是在 SV 中理解情感仍然是一个挑战。由于增强的内容多样性导致更明显的 semantic gaps 和学习情感相关特征的困难，以及由于听视频表达中的情感不足，导致信息损失。为解决这些问题，我们提出了一种端到端基线方法 AV-CPNet，该方法使用视频变换器更好地学习 semantically 相关的表示。我们还设计了两个跨模态融合模块，以便同时模型听视频特征之间的相关性。使用 EP-CE 损失函数，我们则进一步优化模型。我们的实验结果表明，AV-CPNet 在九个数据集上具有显著效果。数据集和代码将在 GitHub 上公开。

Long-tailed multi-label classification with noisy label of thoracic diseases from chest X-ray

paper_url: http://arxiv.org/abs/2311.17334
repo_url: None
paper_authors: Haoran Lai, Qingsong Yao, Zhiyang He, Xiaodong Tao, S Kevin Zhou
For: The paper aims to improve the detection of rare thoracic diseases in chest X-rays (CXRs) using a novel benchmark for long-tailed multi-label classification.* Methods: The paper proposes a baseline method for this classification challenge, which includes adaptive negative regularization to address the over-suppression of negative logits in tail classes, and a large loss reconsideration strategy for correcting noisy labels from automated annotations.* Results: The paper demonstrates significant advancements in rare disease detection using the proposed method on the “LTML-MIMIC-CXR” dataset, which is an augmentation of the MIMIC-CXR dataset with 26 additional rare diseases.Here are the three points in Simplified Chinese text:* For: 本研究目的是提高胸部X射线图像中罕见肺病的检测，使用长尾多类别分类的新基准。* Methods: 本研究提议一种基准方法，包括适应性负正则化来处理尾类中的负логи的过度压制，以及大型损失重新评估策略来修正自动注释中的噪声标签。* Results: 研究表明，使用提议方法对”LTML-MIMIC-CXR” dataset进行分类，可以显著提高罕见肺病的检测。

Abstract
Chest X-rays (CXR) often reveal rare diseases, demanding precise diagnosis. However, current computer-aided diagnosis (CAD) methods focus on common diseases, leading to inadequate detection of rare conditions due to the absence of comprehensive datasets. To overcome this, we present a novel benchmark for long-tailed multi-label classification in CXRs, encapsulating both common and rare thoracic diseases. Our approach includes developing the "LTML-MIMIC-CXR" dataset, an augmentation of MIMIC-CXR with 26 additional rare diseases. We propose a baseline method for this classification challenge, integrating adaptive negative regularization to address negative logits' over-suppression in tail classes, and a large loss reconsideration strategy for correcting noisy labels from automated annotations. Our evaluation on LTML-MIMIC-CXR demonstrates significant advancements in rare disease detection. This work establishes a foundation for robust CAD methods, achieving a balance in identifying a spectrum of thoracic diseases in CXRs. Access to our code and dataset is provided at:https://github.com/laihaoran/LTML-MIMIC-CXR.

摘要
胸部X射影片（CXR）常常揭示罕见疾病，需要精准诊断。然而，当前的计算机支持诊断（CAD）方法偏向于常见疾病，导致罕见疾病的检测不充分，主要是因为缺乏全面的数据集。为了解决这问题，我们提出了一个新的比赛标准 для多标签分类在CXRs中，涵盖了常见和罕见胸部疾病。我们的方法包括开发了"LTML-MIMIC-CXR"数据集，这是MIMIC-CXR数据集的扩展，添加了26种罕见胸部疾病。我们提出了一种基线方法，该方法包括适应性负正则化，以解决尾类的负极值抑制问题，以及一种大容量损失重新评估策略，以正确化自动生成的标签。我们对LTML-MIMIC-CXR进行了评估，并显示了明显的罕见疾病检测进步。这项工作建立了一个robust CAD方法的基础，实现了在CXRs中识别胸部疾病的spectrum。可以通过https://github.com/laihaoran/LTML-MIMIC-CXR访问我们的代码和数据集。

NeRFTAP: Enhancing Transferability of Adversarial Patches on Face Recognition using Neural Radiance Fields

paper_url: http://arxiv.org/abs/2311.17332
repo_url: None
paper_authors: Xiaoliang Liu, Furao Shen, Feng Han, Jian Zhao, Changhai Nie
for: 防御面 recognition (FR) 技术受到攻击者的攻击，这种攻击可能会影响 FR 系统的安全性。
methods: 我们提出了一种新的敌意攻击方法，它考虑了攻击者直接将攻击者的面像传输到受害人的面像中，这种攻击方法被称为 NeRFTAP。
results: 我们的实验和评估结果表明，NeRFTAP 可以在不同的 FR 模型上实现更高的攻击效果，并且可以在实际攻击场景中提供更好的防御性。

Abstract
Face recognition (FR) technology plays a crucial role in various applications, but its vulnerability to adversarial attacks poses significant security concerns. Existing research primarily focuses on transferability to different FR models, overlooking the direct transferability to victim's face images, which is a practical threat in real-world scenarios. In this study, we propose a novel adversarial attack method that considers both the transferability to the FR model and the victim's face image, called NeRFTAP. Leveraging NeRF-based 3D-GAN, we generate new view face images for the source and target subjects to enhance transferability of adversarial patches. We introduce a style consistency loss to ensure the visual similarity between the adversarial UV map and the target UV map under a 0-1 mask, enhancing the effectiveness and naturalness of the generated adversarial face images. Extensive experiments and evaluations on various FR models demonstrate the superiority of our approach over existing attack techniques. Our work provides valuable insights for enhancing the robustness of FR systems in practical adversarial settings.

摘要
人脸识别（FR）技术在各种应用中扮演着关键角色，但其受到敌意攻击的抵触性问题带来了重要的安全问题。现有的研究主要集中在不同FR模型之间的传输性能力，忽略了直接将攻击者的面像传输到受害者的面像，这是实际应用场景中的实际威胁。在这种研究中，我们提出了一种新的敌意攻击方法，即NeRFTAP。通过使用NeRF-based 3D-GAN，我们生成了新的视角面像图像，以增强攻击者面像图像的传输性能力。我们引入了一种风格一致损失，以确保攻击者UV图像和目标UV图像在0-1掩码下的视觉相似性，从而提高了攻击者面像图像的效果和自然性。我们对各种FR模型进行了广泛的实验和评估，并证明了我们的方法在现有攻击技术之上具有超越性。我们的工作为FR系统在实际敌意情况下增强安全性提供了有价值的思路。

Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering

paper_url: http://arxiv.org/abs/2311.17331
repo_url: None
paper_authors: Zeqing Wang, Wentao Wan, Runmeng Chen, Qiqing Lao, Minjie Lang, Keze Wang
for: 本研究旨在提高Visual Question Answering（VQA） Task的表现，并且解决现有的问题，如知识库（KB）的偏见和有限数据的问题。
methods: 该研究提出了一种可解释的多代理协作框架，通过启用Large Language Models（LLMs）中嵌入的知识，实现Top-down的推理过程。该框架包括三个代理：搜寻者（Seeker）、回答者（Responder）和整合器（Integrator），共同解决VQA问题。
results: 研究人员对多种VQA数据集和VLM进行了广泛的测试和评估，结果显示，该方法可以提高VQA表现，同时具有广泛的应用和可解释性。

Abstract
Recently, Vision Language Models (VLMs) have gained significant attention, exhibiting notable advancements across various tasks by leveraging extensive image-text paired data. However, prevailing VLMs often treat Visual Question Answering (VQA) as perception tasks, employing black-box models that overlook explicit modeling of relationships between different questions within the same visual scene. Moreover, the existing VQA methods that rely on Knowledge Bases (KBs) might frequently encounter biases from limited data and face challenges in relevant information indexing. Attempt to overcome these limitations, this paper introduces an explainable multi-agent collaboration framework by tapping into knowledge embedded in Large Language Models (LLMs) trained on extensive corpora. Inspired by human cognition, our framework uncovers latent information within the given question by employing three agents, i.e., Seeker, Responder, and Integrator, to perform a top-down reasoning process. The Seeker agent generates relevant issues related to the original question. The Responder agent, based on VLM, handles simple VQA tasks and provides candidate answers. The Integrator agent combines information from the Seeker agent and the Responder agent to produce the final VQA answer. Through the above collaboration mechanism, our framework explicitly constructs a multi-view knowledge base for a specific image scene, reasoning answers in a top-down processing manner. We extensively evaluate our method on diverse VQA datasets and VLMs, demonstrating its broad applicability and interpretability with comprehensive experimental results.

摘要
近些时间，视觉语言模型（VLM）在不同任务上表现出了显著的进步，利用了广泛的图像文本对应数据。然而，现有的VLM通常将视觉问答（VQA）视为感知任务，使用黑盒模型，忽视了不同问题在同一个视觉场景中的明确关系模型。此外，现有的VQA方法可能会遇到限制性的数据和知识库（KB）的偏见，并且面临着相关信息索引的挑战。为了超越这些限制，本文提出了一种可解释的多代合作框架，通过启用大语言模型（LLM）训练后的知识来挖掘问题中隐藏的信息。我们的框架采用三个代理：寻找者、回答者和集成者。寻找者代理生成与原问题相关的问题。回答者代理基于VLM，处理简单的VQA任务，提供候选答案。集成者代理将寻找者代理和回答者代理提供的信息进行集成，生成最终的VQA答案。通过上述合作机制，我们的框架可以生成特定图像场景的多视图知识库，进行顶部向下的理性答案处理。我们对多个VQA dataset和VLM进行了广泛的evaluate，并示出了我们的方法的普适性和可解释性。

Alternate Diverse Teaching for Semi-supervised Medical Image Segmentation

paper_url: http://arxiv.org/abs/2311.17325
repo_url: https://github.com/ZhenZHAO/AD-MT
paper_authors: Zhen Zhao, Zicheng Wang, Longyue Wang, Yixuan Yuan, Luping Zhou
for: 这个研究的目的是提高semi-supervised medical image segmentation的精度和稳定性，并且解决现有的教师-学生模型受到确认偏见的问题。
methods: 这个方法使用了一个学生模型和两个不可变的教师模型，并通过Random Periodic Alternate（RPA）更新模块和Conflict-Combating Module（CCM）来解决确认偏见问题。
results: 实验结果显示，这个AD-MT方法在2D和3D医学像分类中具有较高的精度和稳定性，并在不同的semi-supervised设定下表现出超越现有方法的效果。

Abstract
Semi-supervised medical image segmentation studies have shown promise in training models with limited labeled data. However, current dominant teacher-student based approaches can suffer from the confirmation bias. To address this challenge, we propose AD-MT, an alternate diverse teaching approach in a teacher-student framework. It involves a single student model and two non-trainable teacher models that are momentum-updated periodically and randomly in an alternate fashion. To mitigate the confirmation bias from the diverse supervision, the core of AD-MT lies in two proposed modules: the Random Periodic Alternate (RPA) Updating Module and the Conflict-Combating Module (CCM). The RPA schedules the alternating diverse updating process with complementary data batches, distinct data augmentation, and random switching periods to encourage diverse reasoning from different teaching perspectives. The CCM employs an entropy-based ensembling strategy to encourage the model to learn from both the consistent and conflicting predictions between the teachers. Experimental results demonstrate the effectiveness and superiority of our AD-MT on the 2D and 3D medical segmentation benchmarks across various semi-supervised settings.

摘要
semi-supervised医疗影像分类研究表明，具有有限的标注数据可以训练模型。然而，当前主流的教师-学生基础结构方法可能会受到偏见困惑。为解决这个挑战，我们提议了AD-MT，一种 alternate 多元教学方法。它包括一个单个学生模型和两个不可训练的教师模型， periodically和随机地在 alternate 的方式进行启用。为了减轻偏见困惑，AD-MT 的核心在两个提案的模块中：Random Periodic Alternate（RPA）更新模块和 Conflict-Combating Module（CCM）。RPA 将不同的教学视角进行 alternate 更新，使用不同的数据批处理、数据扩展和随机交换时间段来鼓励不同的思维方式。CCM 使用了一种基于 entropy 的拢合策略，以便模型从多个教师的一致和冲突预测中学习。实验结果表明，我们的 AD-MT 在不同的 semi-supervised 设定下的医疗影像分类benchmark上表现出色，优于传统的教师-学生基础结构方法。

Revisiting Single Image Reflection Removal In the Wild

paper_url: http://arxiv.org/abs/2311.17320
repo_url: None
paper_authors: Yurui Zhu, Xueyang Fu, Peng-Tao Jiang, Hao Zhang, Qibin Sun, Jinwei Chen, Zheng-Jun Zha, Bo Li
for: 本研究探讨了现实世界中单张图像反射除去的问题，从两个角度研究：反射收集管道的设计和反射位置的识别。
methods: 我们提出了一种高度适应现实世界反射场景的反射收集管道，并开发了大规模、高质量的反射图像集名为 Reflection Removal in the Wild (RRW)。RRW包含了14,950个高分辨率的现实世界反射对，比前一代 dataset 大得多。
results: 我们发现了许多虚拟反射对在反射图像中可见，但在相应的真实图像中不存在。基于这一观察，我们提出了最大反射筛选器（MaxRF），可以准确地描述反射位置。我们还设计了反射位置意识的堆式框架，专门解决单张图像反射除去问题。通过这些创新技术，我们的解决方案在多个现实世界标准准测试 bench 上表现出优于当前领先方法。

Abstract
This research focuses on the issue of single-image reflection removal (SIRR) in real-world conditions, examining it from two angles: the collection pipeline of real reflection pairs and the perception of real reflection locations. We devise an advanced reflection collection pipeline that is highly adaptable to a wide range of real-world reflection scenarios and incurs reduced costs in collecting large-scale aligned reflection pairs. In the process, we develop a large-scale, high-quality reflection dataset named Reflection Removal in the Wild (RRW). RRW contains over 14,950 high-resolution real-world reflection pairs, a dataset forty-five times larger than its predecessors. Regarding perception of reflection locations, we identify that numerous virtual reflection objects visible in reflection images are not present in the corresponding ground-truth images. This observation, drawn from the aligned pairs, leads us to conceive the Maximum Reflection Filter (MaxRF). The MaxRF could accurately and explicitly characterize reflection locations from pairs of images. Building upon this, we design a reflection location-aware cascaded framework, specifically tailored for SIRR. Powered by these innovative techniques, our solution achieves superior performance than current leading methods across multiple real-world benchmarks. Codes and datasets will be publicly available.

摘要
Simplified Chinese:这项研究关注实际情况下的单张图像反射除法 (SIRR) 问题，从两个角度研究：反射采集管道和反射位置识别。我们提出了高度适应实际反射场景的反射采集管道，并且减少了收集大规模对齐反射对的成本。在这个过程中，我们开发了 named Reflection Removal in the Wild (RRW) 的大规模、高质量反射数据集，该数据集包含了14,950个高分辨率实际反射对，比前一代数据集大四十五倍。关于反射位置识别，我们发现了许多在反射图像中可见的虚拟反射对不存在在相应的真实图像中。这一观察，从对齐对中提取出来，导我们提出了最大反射筛选器 (MaxRF)。MaxRF可以准确地从对齐图像对中识别反射位置。基于这些创新技术，我们设计了反射位置意识的协同框架，特地针对 SIRR。这些技术的应用使我们的解决方案在多个实际 benchmark 上达到了当前领先方法的超越性表现。代码和数据集将公开 disponibles。

paper_url: http://arxiv.org/abs/2311.17315
repo_url: None
paper_authors: Daniela Massiceti, Camilla Longden, Agnieszka Slowik, Samuel Wills, Martin Grayson, Cecily Morrison
for:* The paper evaluates the performance of a widely-used large multi-modal model (CLIP) on data captured by blind or low vision (BLV) users.methods:* The paper tests 25 CLIP variants in a zero-shot classification task and analyzes their accuracy on images captured by BLV users and web-crawled images.* The paper conducts a textual analysis of three common pre-training datasets to investigate the inclusion of disability content.results:* The paper finds that CLIP’s accuracy is 15 percentage points lower on average for images captured by BLV users than web-crawled images, due to sensitivities to image content, image quality, and text content.* The paper shows that few-shot learning with as few as 5 images can mitigate CLIP’s quality-of-service disparities for BLV users in some scenarios.Here is the simplified Chinese translation of the three key information points:for:* 这个论文评估了一种广泛使用的大型多modal模型（CLIP）在视障或低视力（BLV）用户 captured 的数据上的性能。methods:* 论文测试了 25 个 CLIP 变体在零批学习任务中，并分析它们在 BLV 用户 captured 的图像上的准确率。* 论文对三个常用的预训练集进行文本分析，以 investigate 是否包含残疾内容。results:* 论文发现，CLIP 在 BLV 用户 captured 的图像上的准确率比 web-crawled 图像下降了 15% 的点，这是因为 CLIP 对图像内容、图像质量和文本内容产生了敏感性。* 论文显示，使用少量的 few-shot learning（只需 5 张图像）可以在一些场景下 mitigate CLIP 的质量服务差异 для BLV 用户。

Abstract
Large multi-modal models (LMMs) hold the potential to usher in a new era of automated visual assistance for people who are blind or low vision (BLV). Yet, these models have not been systematically evaluated on data captured by BLV users. We address this by empirically assessing CLIP, a widely-used LMM likely to underpin many assistive technologies. Testing 25 CLIP variants in a zero-shot classification task, we find that their accuracy is 15 percentage points lower on average for images captured by BLV users than web-crawled images. This disparity stems from CLIP's sensitivities to 1) image content (e.g. not recognizing disability objects as well as other objects); 2) image quality (e.g. not being robust to lighting variation); and 3) text content (e.g. not recognizing objects described by tactile adjectives as well as visual ones). We delve deeper with a textual analysis of three common pre-training datasets: LAION-400M, LAION-2B and DataComp-1B, showing that disability content is rarely mentioned. We then provide three examples that illustrate how the performance disparities extend to three downstream models underpinned by CLIP: OWL-ViT, CLIPSeg and DALL-E2. We find that few-shot learning with as few as 5 images can mitigate CLIP's quality-of-service disparities for BLV users in some scenarios, which we discuss alongside a set of other possible mitigations.

摘要
大型多Modal模型（LMM）有可能带来一个新的自动化视觉助手时代，以帮助盲人或低视力（BLV）人群。然而，这些模型尚未被系统地评估在由BLV用户收集的数据上。我们解决这个问题，通过对CLIP进行实证评估。我们测试了25个CLIP变体，在零批学习任务中，其准确率与BLV用户收集的图像相比，下降了15%的平均值。这种差异来自于CLIP的敏感性，包括：1）图像内容（例如，不能识别残疾人用品），2）图像质量（例如，不能抗衰减），3）文本内容（例如，不能识别通过感觉描述的物品）。我们进一步进行文本分析，发现三种常见的预训练数据集：LAION-400M、LAION-2B和DataComp-1B中，残疾人内容很少被提及。然后，我们提供了三个例子，用于说明CLIP的性能差异在下游模型中的扩展：OWL-ViT、CLIPSeg和DALL-E2。我们发现，使用5张图像进行几步学习可以在某些情况下减轻CLIP的质量服务差异，我们讨论了其他可能的缓解方法。

Federated Fine-Tuning of Foundation Models via Probabilistic Masking

paper_url: http://arxiv.org/abs/2311.17299
repo_url: None
paper_authors: Vasileios Tsouvalas, Yuki Asano, Aaqib Saeed
for: 这个研究目的是实现 Federaed Learning（FL）中Foundation Models（FMs）的有效整合，并提高通信效率。
methods: 这个研究使用了DeltaMask方法，它利用数据随机掩蔽来检测FMs中高效的子网络，并使用几率 filters将更新转换为一个简单的灰度图像。
results: 这个研究表明，DeltaMask方法可以在8个数据集和5个预训模型中实现bitrate在0.09bpp以下，优化通信效率，并维持FMs的性能。

Abstract
Foundation Models (FMs) have revolutionized machine learning with their adaptability and high performance across tasks; yet, their integration into Federated Learning (FL) is challenging due to substantial communication overhead from their extensive parameterization. Current communication-efficient FL strategies, such as gradient compression, reduce bitrates to around $1$ bit-per-parameter (bpp). However, these approaches fail to harness the characteristics of FMs, with their large number of parameters still posing a challenge to communication efficiency, even at these bitrate regimes. In this work, we present DeltaMask, a novel method that efficiently fine-tunes FMs in FL at an ultra-low bitrate, well below 1 bpp. DeltaMask employs stochastic masking to detect highly effective subnetworks within FMs and leverage stochasticity and sparsity in client masks to compress updates into a compact grayscale image using probabilistic filters, deviating from traditional weight training approaches. Our comprehensive evaluations across various datasets and architectures demonstrate DeltaMask efficiently achieves bitrates as low as 0.09 bpp, enhancing communication efficiency while maintaining FMs performance, as measured on 8 datasets and 5 pre-trained models of various network architectures.

摘要
Translated into Simplified Chinese:基础模型（FM）已经革命化机器学习，具有适应性和高性能 across 任务; 然而，它们在联合学习（FL）中存在挑战性的通信开销，由于 FM 的广泛参数化。当前的通信有效率 FL 策略，如归一化压缩，可以降低比特率至约 1 比特/参数（bpp）。然而，这些方法无法利用 FM 的特点，即它们的大量参数仍然对通信效率构成挑战，即使在这些比特率层次。在这种情况下，我们提出了 DeltaMask，一种新的方法，可以高效地在 FL 中精度调整 FM，并在 ultra-low 比特率下进行压缩，远低于 1 bpp。DeltaMask 利用随机掩码检测 FM 中高效的子网络，并利用客户端掩码中的随机性和稀疏性来压缩更新，使用 probabilistic 筛选器，与传统的训练方法不同。我们对多个数据集和不同的网络架构进行了全面的评估，结果表明，DeltaMask 可以高效地实现比特率为 0.09 bpp，提高通信效率，同时保持 FM 的性能，测试在 8 个数据集和 5 个预训练模型中。

LEOD: Label-Efficient Object Detection for Event Cameras

paper_url: http://arxiv.org/abs/2311.17286
repo_url: None
paper_authors: Ziyi Wu, Mathias Gehrig, Qing Lyu, Xudong Liu, Igor Gilitschenski
for:This paper aims to address the issue of labeling event streams with high temporal resolutions for object detection with event cameras, which is costly and time-consuming.methods:The proposed method, called LEOD, unifies weakly- and semi-supervised object detection with a self-training mechanism. It utilizes a detector pre-trained on limited labels to produce pseudo ground truth on unlabeled events, and then re-trains the detector with both real and generated labels.results:LEOD consistently outperforms supervised baselines across various labeling ratios, improving mAP by 8.6% and 7.8% on Gen1 and 1Mpx datasets, respectively. Even when all labeled data are available, LEOD reaches new state-of-the-art results and is effective in improving larger detectors as well.

Abstract
Object detection with event cameras enjoys the property of low latency and high dynamic range, making it suitable for safety-critical scenarios such as self-driving. However, labeling event streams with high temporal resolutions for supervised training is costly. We address this issue with LEOD, the first framework for label-efficient event-based detection. Our method unifies weakly- and semi-supervised object detection with a self-training mechanism. We first utilize a detector pre-trained on limited labels to produce pseudo ground truth on unlabeled events, and then re-train the detector with both real and generated labels. Leveraging the temporal consistency of events, we run bi-directional inference and apply tracking-based post-processing to enhance the quality of pseudo labels. To stabilize training, we further design a soft anchor assignment strategy to mitigate the noise in labels. We introduce new experimental protocols to evaluate the task of label-efficient event-based detection on Gen1 and 1Mpx datasets. LEOD consistently outperforms supervised baselines across various labeling ratios. For example, on Gen1, it improves mAP by 8.6% and 7.8% for RVT-S trained with 1% and 2% labels. On 1Mpx, RVT-S with 10% labels even surpasses its fully-supervised counterpart using 100% labels. LEOD maintains its effectiveness even when all labeled data are available, reaching new state-of-the-art results. Finally, we show that our method readily scales to improve larger detectors as well.

摘要
First, we use a pre-trained detector on limited labels to generate pseudo ground truth on unlabeled events. We then re-train the detector with both real and generated labels, leveraging the temporal consistency of events to enhance the quality of pseudo labels. To stabilize training, we design a soft anchor assignment strategy to mitigate label noise.We introduce new experimental protocols to evaluate the task of label-efficient event-based detection on the Gen1 and 1Mpx datasets. LEOD consistently outperforms supervised baselines across various labeling ratios. For example, on Gen1, it improves mAP by 8.6% and 7.8% for RVT-S trained with 1% and 2% labels, respectively. On 1Mpx, RVT-S with 10% labels even surpasses its fully-supervised counterpart using 100% labels. LEOD maintains its effectiveness even when all labeled data are available, reaching new state-of-the-art results.Moreover, our method readily scales to improve larger detectors as well.

2023-11-29

Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models

Do text-free diffusion models learn discriminative visual representations?

A Simple Recipe for Language-guided Domain Generalized Segmentation

Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving

AvatarStudio: High-fidelity and Animatable 3D Avatar Creation from Text

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

HUGS: Human Gaussian Splats

Language-conditioned Detection Transformer

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

Pose Anything: A Graph-Based Approach for Category-Agnostic Pose Estimation

TSDF-Sampling: Efficient Sampling for Neural Surface Field using Truncated Signed Distance Field

Enhancing Post-Hoc Explanation Benchmark Reliability for Image Classification

FisherRF: Active View Selection and Uncertainty Quantification for Radiance Fields using Fisher Information

Gaussian Shell Maps for Efficient 3D Human Generation

Evaluating VLMs for Score-Based, Multi-Probe Annotation of 3D Objects

Towards Real-World Focus Stacking with Deep Learning

SPiC-E : Structural Priors in 3D Diffusion Models using Cross Entity Attention

DAP: Domain-aware Prompt Learning for Vision-and-Language Navigation

Coloring the Past: Neural Historical Buildings Reconstruction from Archival Photography

Aggregation Model Hyperparameters Matter in Digital Pathology

U-Net v2: Rethinking the Skip Connections of U-Net for Medical Image Segmentation

One-Shot Open Affordance Learning with Foundation Models

PillarNeSt: Embracing Backbone Scaling and Pretraining for Pillar-based 3D Object Detection

Cinematic Behavior Transfer via NeRF-based Differentiable Filming

BAND-2k: Banding Artifact Noticeable Database for Banding Detection and Quality Assessment

Variational Bayes image restoration with compressive autoencoders

GenZI: Zero-Shot 3D Human-Scene Interaction Generation

Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers

SAMPro3D: Locating SAM Prompts in 3D for Zero-Shot Scene Segmentation

Toward a Surgeon-in-the-Loop Ophthalmic Robotic Apprentice using Reinforcement and Imitation Learning

COVIDx CXR-4: An Expanded Multi-Institutional Open-Source Benchmark Dataset for Chest X-ray Image-Based Computer-Aided COVID-19 Diagnostics

Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications

Volumetric Cloud Field Reconstruction

Multiple Toddler Tracking in Indoor Videos

Neural Fields with Thermal Activations for Arbitrary-Scale Super-Resolution

Erasing the Ephemeral: Joint Camera Refinement and Transient Object Removal for Street View Synthesis

Efficient Decoder for End-to-End Oriented Object Detection in Remote Sensing Images

Focus on Query: Adversarial Mining Transformer for Few-Shot Segmentation

ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model

AnyLens: A Generative Diffusion Model with Any Rendering Lens

Adversarial Robust Memory-Based Continual Learner

Topology-Preserving Adversarial Training

Query-Relevant Images Jailbreak Large Multi-Modal Models

Continual Self-supervised Learning: Towards Universal Multi-modal Medical Data Representation Learning

SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis

CLIPC8: Face liveness detection algorithm based on image-text pairs and contrastive learning

LGFCTR: Local and Global Feature Convolutional Transformer for Image Matching

An Efficient Illumination Invariant Tiger Detection Framework for Wildlife Surveillance

VINNA for Neonates – Orientation Independence through Latent Augmentations

Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video Tuning

Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

HiDiffusion: Unlocking High-Resolution Creativity and Efficiency in Low-Resolution Trained Diffusion Models

A publicly available vessel segmentation algorithm for SLO images

Improving Stability during Upsampling – on the Importance of Spatial Context

MMA-Diffusion: MultiModal Attack on Diffusion Models

Fusion of Single and Integral Multispectral Aerial Images

StructRe: Rewriting for Structured Shape Modeling

PViT-6D: Overclocking Vision Transformers for 6D Pose Estimation with Confidence-Level Prediction and Pose Tokens

Towards Higher Ranks via Adversarial Weight Pruning

Spherical Frustum Sparse Convolution Network for LiDAR Point Cloud Semantic Segmentation

Non-Visible Light Data Synthesis and Application: A Case Study for Synthetic Aperture Radar Imagery

CLiSA: A Hierarchical Hybrid Transformer Model using Orthogonal Cross Attention for Satellite Image Cloud Segmentation

AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents

When StyleGAN Meets Stable Diffusion: a $\mathscr{W}_+$ Adapter for Personalized Image Generation

W-HMR: Human Mesh Recovery in World Space with Weak-supervised Camera Calibration and Orientation Correction

DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Diffusion Model

Continual Learning for Image Segmentation with Dynamic Query

Weakly-semi-supervised object detection in remotely sensed imagery

Group-wise Sparse and Explainable Adversarial Attacks

SpeechAct: Towards Generating Whole-body Motion from Speech

Talking Head(?) Anime from a Single Image 4: Improved Model and Its Distillation

Dynamic Dense Graph Convolutional Network for Skeleton-based Human Motion Prediction

Spectral and Polarization Vision: Spectro-polarimetric Real-world Dataset

360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries

Generative Hierarchical Temporal Transformer for Hand Action Recognition and Motion Prediction

Symbol-LLM: Leverage Language Models for Symbolic System in Visual Human Activity Reasoning

How does spatial structure affect psychological restoration? A method based on Graph Neural Networks and Street View Imagery

A natural language processing-based approach: mapping human perception by understanding deep semantic features in street view images

Implicit-explicit Integrated Representations for Multi-view Video Compression