2023-09-20

cs.CV

cs.CV - 2023-09-20

Understanding Pose and Appearance Disentanglement in 3D Human Pose Estimation

paper_url: http://arxiv.org/abs/2309.11667
repo_url: None
paper_authors: Krishna Kanth Nakka, Mathieu Salzmann
for: 本研究目的是分析当前领域内最新的自然语言描述学习方法是否能够真正分离 pose 信息和 appearance 信息。
methods: 本研究使用了三种当前领域内最新的自然语言描述学习方法进行分析，即 DenseCap, DensePose, 和 H3DNet。
results: 研究发现，这三种方法中的 pose 代码含有显著的 appearance 信息，而且这些方法的分离效果并不够完善。

Abstract
As 3D human pose estimation can now be achieved with very high accuracy in the supervised learning scenario, tackling the case where 3D pose annotations are not available has received increasing attention. In particular, several methods have proposed to learn image representations in a self-supervised fashion so as to disentangle the appearance information from the pose one. The methods then only need a small amount of supervised data to train a pose regressor using the pose-related latent vector as input, as it should be free of appearance information. In this paper, we carry out in-depth analysis to understand to what degree the state-of-the-art disentangled representation learning methods truly separate the appearance information from the pose one. First, we study disentanglement from the perspective of the self-supervised network, via diverse image synthesis experiments. Second, we investigate disentanglement with respect to the 3D pose regressor following an adversarial attack perspective. Specifically, we design an adversarial strategy focusing on generating natural appearance changes of the subject, and against which we could expect a disentangled network to be robust. Altogether, our analyses show that disentanglement in the three state-of-the-art disentangled representation learning frameworks if far from complete, and that their pose codes contain significant appearance information. We believe that our approach provides a valuable testbed to evaluate the degree of disentanglement of pose from appearance in self-supervised 3D human pose estimation.

摘要
As 3D human pose estimation 可以在超级vised learning scenario 中实现非常高的准确率，因此处理没有3D pose annotations的情况 receiving increasing attention. 特别是，several methods have proposed to learn image representations in a self-supervised fashion so as to disentangle the appearance information from the pose one. The methods then only need a small amount of supervised data to train a pose regressor using the pose-related latent vector as input, as it should be free of appearance information.In this paper, we carry out in-depth analysis to understand to what degree the state-of-the-art disentangled representation learning methods truly separate the appearance information from the pose one. First, we study disentanglement from the perspective of the self-supervised network, via diverse image synthesis experiments. Second, we investigate disentanglement with respect to the 3D pose regressor following an adversarial attack perspective. Specifically, we design an adversarial strategy focusing on generating natural appearance changes of the subject, and against which we could expect a disentangled network to be robust.Altogether, our analyses show that disentanglement in the three state-of-the-art disentangled representation learning frameworks is far from complete, and that their pose codes contain significant appearance information. We believe that our approach provides a valuable testbed to evaluate the degree of disentanglement of pose from appearance in self-supervised 3D human pose estimation.

Neural Image Compression Using Masked Sparse Visual Representation

paper_url: http://arxiv.org/abs/2309.11661
repo_url: None
paper_authors: Wei Jiang, Wei Wang, Yue Chen
for: 这篇论文主要研究了基于稀疏视觉表示（SVR）的神经图像压缩，目的是提高压缩率和压缩后图像质量。
methods: 这篇论文提出了一种基于SVR的压缩方法，其中图像被嵌入到一个离散的 latent space 中，并使用了学习的视觉codebook来表示图像。在编码器和解码器之间共享 codebook，编码器将图像转换为 integer 代码word indices，并将这些指标传输给解码器进行重建。这种方法提出了一种名为 Masked Adaptive Codebook 的学习方法，可以在bitrate和重建质量之间进行负权补偿。
results: 实验结果表明，M-AdaCode 方法可以在 JPEG-AI 标准数据集上实现更高的压缩率和更高的重建质量，并且可以在不同的传输比特率下进行负权补偿。

Abstract
We study neural image compression based on the Sparse Visual Representation (SVR), where images are embedded into a discrete latent space spanned by learned visual codebooks. By sharing codebooks with the decoder, the encoder transfers integer codeword indices that are efficient and cross-platform robust, and the decoder retrieves the embedded latent feature using the indices for reconstruction. Previous SVR-based compression lacks effective mechanism for rate-distortion tradeoffs, where one can only pursue either high reconstruction quality or low transmission bitrate. We propose a Masked Adaptive Codebook learning (M-AdaCode) method that applies masks to the latent feature subspace to balance bitrate and reconstruction quality. A set of semantic-class-dependent basis codebooks are learned, which are weighted combined to generate a rich latent feature for high-quality reconstruction. The combining weights are adaptively derived from each input image, providing fidelity information with additional transmission costs. By masking out unimportant weights in the encoder and recovering them in the decoder, we can trade off reconstruction quality for transmission bits, and the masking rate controls the balance between bitrate and distortion. Experiments over the standard JPEG-AI dataset demonstrate the effectiveness of our M-AdaCode approach.

摘要
我们研究基于稀疏视觉表示（SVR）的神经网络图像压缩，图像被嵌入到学习的视觉码库中的离散特征空间中。通过在编码器和解码器之间共享码库，编码器将转化为整数编码字符串，这些编码字符串是高效穿梭平台强的和可靠的，而解码器通过这些编码字符串来重建图像。 précédente 的 SVR 基于压缩缺乏有效的Rate-Distortion 质量衡量机制，只能追求高重建质量或低传输比特率。我们提出了一种带有掩码（Mask）的自适应码库学习（M-AdaCode）方法，通过掩码在干扰特征空间中进行权重调整，以实现Rate-Distortion 质量衡量机制。我们学习了基于输入图像的semantic类别的基础码库，并将这些基础码库Weightedly 组合，以生成高质量重建的综合特征。编码器中的掩码将掩蔽不重要的权重，而解码器中的掩码将重新还原这些掩码，以实现Rate-Distortion 质量衡量机制。实验结果表明，我们的 M-AdaCode 方法在标准 JPEG-AI 数据集上表现出色。

GenLayNeRF: Generalizable Layered Representations with 3D Model Alignment for Multi-Human View Synthesis

paper_url: http://arxiv.org/abs/2309.11627
repo_url: None
paper_authors: Youssef Abdelkareem, Shady Shehata, Fakhri Karray
for: 这个研究是为了解决多人Scene中的复杂人物 occlusion 问题，提供一个通用的 layered representation 来捕捉多人Scene 的内容。
methods: 我们提出了一个名为 GenLayNeRF 的方法，它使用一个多层架构来分解Scene，并使用一个新的对策机制来进行适应器调整和多观察特征融合，以确保 pixel-level 的体模型与输入视野的同步。
results: 我们的方法在 NVS 中表现出色，与通用 NeRF 方法相比，它能够在几乎没有预期优化的情况下提供高品质的内容生成。而与层化 per-scene NeRF 方法相比，它能够在几乎没有测试时间优化的情况下提供相似或更好的表现。

Abstract
Novel view synthesis (NVS) of multi-human scenes imposes challenges due to the complex inter-human occlusions. Layered representations handle the complexities by dividing the scene into multi-layered radiance fields, however, they are mainly constrained to per-scene optimization making them inefficient. Generalizable human view synthesis methods combine the pre-fitted 3D human meshes with image features to reach generalization, yet they are mainly designed to operate on single-human scenes. Another drawback is the reliance on multi-step optimization techniques for parametric pre-fitting of the 3D body models that suffer from misalignment with the images in sparse view settings causing hallucinations in synthesized views. In this work, we propose, GenLayNeRF, a generalizable layered scene representation for free-viewpoint rendering of multiple human subjects which requires no per-scene optimization and very sparse views as input. We divide the scene into multi-human layers anchored by the 3D body meshes. We then ensure pixel-level alignment of the body models with the input views through a novel end-to-end trainable module that carries out iterative parametric correction coupled with multi-view feature fusion to produce aligned 3D models. For NVS, we extract point-wise image-aligned and human-anchored features which are correlated and fused using self-attention and cross-attention modules. We augment low-level RGB values into the features with an attention-based RGB fusion module. To evaluate our approach, we construct two multi-human view synthesis datasets; DeepMultiSyn and ZJU-MultiHuman. The results indicate that our proposed approach outperforms generalizable and non-human per-scene NeRF methods while performing at par with layered per-scene methods without test time optimization.

摘要
《 Novel View Synthesis of Multi-Human Scenes with Generalizable Layered Scene Representation》 Multi-human scene novel view synthesis （NVS）面临许多挑战，主要是因为人体 occlusion 复杂。层次表示处理这些复杂性，通过将场景分解为多层Radiance Fields，但是它们主要是基于场景优化，因此效率低。通用人体视图合成方法将预先适应的3D人体模型与图像特征结合起来，但是它们主要是针对单个人体场景设计。另一个缺点是在缺视设定下，使用多步优化技术进行参数预定的3D人体模型会导致投影幻觉。在这种情况下，我们提出了GenLayNeRF，一种通用层次场景表示，用于无需场景优化和非常罕见的视图输入进行自由视角渲染多个人体主题。我们将场景分解成多个人体层，由3D人体模型anchor。然后，我们通过一种新的终端可调模块，通过iterative parametric correction和多视图特征融合来确保像素级匹配3D模型与输入视图。 для NVS，我们提取人体嵌入和图像对齐的点级特征，并使用自注意力和交叉注意力模块进行相关和融合。此外，我们还将低级RGB值加入特征中，使用注意力基于RGB融合模块。为了评估我们的方法，我们建立了两个多个人体视图合成数据集：DeepMultiSyn和ZJU-MultiHuman。结果表明，我们的提出方法在比较通用和非人体场景NeRF方法的同时，能够达到相同的性能水平，而不需要测试时优化。

Sentence Attention Blocks for Answer Grounding

paper_url: http://arxiv.org/abs/2309.11593
repo_url: None
paper_authors: Seyedalireza Khoshsirat, Chandra Kambhamettu
for: 本文提出了一种新的建筑块，即 Sentence Attention Block，以解决文本描述答案问题中的问题。
methods: 本文使用了一种已知的注意力方法，并通过小改进，提高了结果。
results: 本文在 TextVQA-X、VQS、VQA-X 和 VizWiz-VQA-Grounding 数据集上达到了状态的最佳准确率。

Abstract
Answer grounding is the task of locating relevant visual evidence for the Visual Question Answering task. While a wide variety of attention methods have been introduced for this task, they suffer from the following three problems: designs that do not allow the usage of pre-trained networks and do not benefit from large data pre-training, custom designs that are not based on well-grounded previous designs, therefore limiting the learning power of the network, or complicated designs that make it challenging to re-implement or improve them. In this paper, we propose a novel architectural block, which we term Sentence Attention Block, to solve these problems. The proposed block re-calibrates channel-wise image feature-maps by explicitly modeling inter-dependencies between the image feature-maps and sentence embedding. We visually demonstrate how this block filters out irrelevant feature-maps channels based on sentence embedding. We start our design with a well-known attention method, and by making minor modifications, we improve the results to achieve state-of-the-art accuracy. The flexibility of our method makes it easy to use different pre-trained backbone networks, and its simplicity makes it easy to understand and be re-implemented. We demonstrate the effectiveness of our method on the TextVQA-X, VQS, VQA-X, and VizWiz-VQA-Grounding datasets. We perform multiple ablation studies to show the effectiveness of our design choices.

摘要
Answer grounding 任务是为Visual Question Answering 任务中找到相关的视觉证据。虽然过去的很多注意力方法被提出，但它们受到以下三个问题的限制：不允许使用预训练网络，不能充分利用大规模预训练数据，或者自定义的设计不基于已有的固定设计，因此限制了网络的学习能力。在这篇论文中，我们提出了一种新的建筑块，我们称之为句子注意力块（Sentence Attention Block），以解决这些问题。我们的块通过显式地模型图像特征地图和句子嵌入的间接关系来重新准确化通道 wise 图像特征地图。我们可视示了该块如何基于句子嵌入来过滤不相关的通道 wise 图像特征地图。我们从一个已知的注意力方法开始，通过小量修改，我们提高了结果，达到了状态之Art accuracy。我们的方法的灵活性使得可以使用不同的预训练后台网络，其简洁性使得容易理解和重新实现。我们在TextVQA-X、VQS、VQA-X 和 VizWiz-VQA-Grounding 数据集上进行了多个缺省研究，以证明我们的设计选择的有效性。

Continuous Levels of Detail for Light Field Networks

paper_url: http://arxiv.org/abs/2309.11591
repo_url: https://github.com/AugmentariumLab/continuous-lfn
paper_authors: David Li, Brandon Y. Feng, Amitabh Varshney
for: 提高rendering效果和资源利用率，通过使用连续多级详细度（LODs）生成神经表示。
methods: 使用权重梯度滤波和重要性 sampling 技术，实现精细控制详细度的调整，以适应不同的 rendering 条件。
results: 提出一种基于连续 LODs 的神经网络表示方法，可以实现进度式流式神经网络表示，降低渲染延迟和资源使用率。

Abstract
Recently, several approaches have emerged for generating neural representations with multiple levels of detail (LODs). LODs can improve the rendering by using lower resolutions and smaller model sizes when appropriate. However, existing methods generally focus on a few discrete LODs which suffer from aliasing and flicker artifacts as details are changed and limit their granularity for adapting to resource limitations. In this paper, we propose a method to encode light field networks with continuous LODs, allowing for finely tuned adaptations to rendering conditions. Our training procedure uses summed-area table filtering allowing efficient and continuous filtering at various LODs. Furthermore, we use saliency-based importance sampling which enables our light field networks to distribute their capacity, particularly limited at lower LODs, towards representing the details viewers are most likely to focus on. Incorporating continuous LODs into neural representations enables progressive streaming of neural representations, decreasing the latency and resource utilization for rendering.

摘要
近些年，多级细节（LOD）生成神经表示方法得到了一些突破。LOD可以通过使用较低的分辨率和小型模型来提高渲染。然而，现有方法通常只关注一些精确的LOD，这会导致抖抖和闪烁artifacts，限制其细节适应资源的变化。在这篇论文中，我们提出了一种使用连续LOD编码光场网络方法，允许为渲染条件进行细化适应。我们的训练过程使用总面积表 filtering，以实现高效的连续filtering在不同LODs。此外，我们使用关注度基于的重要性采样，使我们的光场网络能够更好地分配其容量，特别是在较低的LODs。将连续LODintegrated into神经表示允许进行进程式流动神经表示，降低渲染的延迟和资源利用率。

Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding

paper_url: http://arxiv.org/abs/2309.11569
repo_url: None
paper_authors: Mohamed Afham, Satya Narayan Shukla, Omid Poursaeed, Pengchuan Zhang, Ashish Shah, Sernam Lim
for: 提高长视频理解的效果，适应实际视频中的 semantic consistency。
methods: 基于 Kernel Temporal Segmentation (KTS) 的适应 sampling 和 tokenization 方法，不需要任务特定的supervision或固定长度的clip。
results: 在视频分类和 temporal action localization 任务上实现了consistent improvement，并达到了长视频模型的state-of-the-art表现。

Abstract
While most modern video understanding models operate on short-range clips, real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length and aggregating the outputs. This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative. In this paper, we aim to provide a generic and adaptive sampling approach for long-form videos in lieu of the de facto uniform sampling. Viewing videos as semantically consistent segments, we formulate a task-agnostic, unsupervised, and scalable approach based on Kernel Temporal Segmentation (KTS) for sampling and tokenizing long videos. We evaluate our method on long-form video understanding tasks such as video classification and temporal action localization, showing consistent gains over existing approaches and achieving state-of-the-art performance on long-form video modeling.

摘要
当今大多数视频理解模型都是在短范围clip上运行，但实际世界中的视频往往是数分钟长，并且有semantically consistent的分割段。一种常见的方法处理长视频是，将短视频模型应用于固定 temporal length的clip上，并将输出集成。这种方法忽略了长视频的本质，因为固定长clip经常是 redundancy or uninformative。在这篇论文中，我们目的是提供一种通用和适应性的抽样方法，以代替现有的固定抽样。视频被视为semantically consistent的分割段，我们基于Kernel Temporal Segmentation（KTS）提出了一种任务无关、无监督和可扩展的方法，用于抽取和 tokenize 长视频。我们对长视频理解任务，如视频分类和 temporal action localization，进行了评估，并显示了与现有方法相比的consistent提升，并实现了长视频模型的州际性表现。

A Large-scale Dataset for Audio-Language Representation Learning

paper_url: http://arxiv.org/abs/2309.11500
repo_url: https://github.com/jettbrains/-L-
paper_authors: Luoyi Sun, Xuenan Xu, Mengyue Wu, Weidi Xie
for: 这篇论文是为了提出一个新的自动音频描述生成管道，以及构建一个大规模、高质量的音频语言数据集（Auto-ACD）。
methods: 该论文使用了一系列公共工具或 API，自动生成了大量的音频描述文本。
results: 论文通过在不同下游任务上训练 популяр的模型，展示了对 Audio-Language Retrieval、Audio Captioning 和环境分类等任务的性能改进。此外，论文还提出了一个新的测试集，并为音频语言任务提供了一个 referential 平台。

Abstract
The AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, in the audio representation learning community, the present audio-language datasets suffer from limitations such as insufficient volume, simplistic content, and arduous collection procedures. To tackle these challenges, we present an innovative and automatic audio caption generation pipeline based on a series of public tools or APIs, and construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.9M audio-text pairs. To demonstrate the effectiveness of the proposed dataset, we train popular models on our dataset and show performance improvement on various downstream tasks, namely, audio-language retrieval, audio captioning, environment classification. In addition, we establish a novel test set and provide a benchmark for audio-text tasks. The proposed dataset will be released at https://auto-acd.github.io/.

摘要
《人工智能社区在开发强大基础模型方面已经做出了 significiant 进步，这些基础模型得益于大规模多modal数据驱动。然而，在音频表示学术社区中，现有的音频语言数据集受到一些限制，如数据量不足、内容过于简单、收集过程较为繁琐。为了解决这些挑战，我们提出了一种创新的自动音频caption生成管道，基于一系列公共工具或API，并构建了大规模、高质量的音频语言数据集，名为Auto-ACD，包含超过190万个音频文本对。为了证明我们的数据集的效iveness，我们在我们的数据集上训练了popular模型，并在多个下游任务上显示了性能改进，包括音频语言检索、音频captioning、环境分类。此外，我们设立了一个新的测试集，并提供了音频文本任务的benchmark。我们计划在https://auto-acd.github.io/上发布我们的数据集。》Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

FreeU: Free Lunch in Diffusion U-Net

paper_url: http://arxiv.org/abs/2309.11497
repo_url: https://github.com/ChenyangSi/FreeU
paper_authors: Chenyang Si, Ziqi Huang, Yuming Jiang, Ziwei Liu
For: 提高 diffusion U-Net 生成质量，无需额外训练或调整。* Methods: 利用 U-Net 架构的 skip connections 和 backbone feature maps，通过重新权重分配来提高生成质量。* Results: 在图像和视频生成任务中，提出了一种简单 yet effective 的方法 FreeU，可以轻松地与现有的 diffusion 模型结合使用，提高生成质量。

Abstract
In this paper, we uncover the untapped potential of diffusion U-Net, which serves as a "free lunch" that substantially improves the generation quality on the fly. We initially investigate the key contributions of the U-Net architecture to the denoising process and identify that its main backbone primarily contributes to denoising, whereas its skip connections mainly introduce high-frequency features into the decoder module, causing the network to overlook the backbone semantics. Capitalizing on this discovery, we propose a simple yet effective method-termed "FreeU" - that enhances generation quality without additional training or finetuning. Our key insight is to strategically re-weight the contributions sourced from the U-Net's skip connections and backbone feature maps, to leverage the strengths of both components of the U-Net architecture. Promising results on image and video generation tasks demonstrate that our FreeU can be readily integrated to existing diffusion models, e.g., Stable Diffusion, DreamBooth, ModelScope, Rerender and ReVersion, to improve the generation quality with only a few lines of code. All you need is to adjust two scaling factors during inference. Project page: https://chenyangsi.top/FreeU/.

摘要
在这篇论文中，我们揭示了扩散U-Net的未利用潜力，它作为一种"免费的午餐"，可以在飞行中显著提高生成质量。我们首先调查扩散U-Net的建筑均衡对减噪过程的关键贡献，并发现其主要脊梁主要做减噪，而跳转连接主要将高频特征引入到解码模块，使网络忽略脊梁 semantics。基于这一发现，我们提出了一种简单 yet effective的方法——FreeU，可以无需额外训练或微调，提高生成质量。我们关键的思路是在扩散U-Net的跳转连接和脊梁特征图之间进行权重调整，以利用扩散U-Net的两个组件之间的优势。 promising results on image and video generation tasks show that our FreeU can be easily integrated into existing diffusion models, such as Stable Diffusion, DreamBooth, ModelScope, Rerender and ReVersion, to improve the generation quality with only a few lines of code. All you need is to adjust two scaling factors during inference. Project page: .

Budget-Aware Pruning: Handling Multiple Domains with Less Parameters

paper_url: http://arxiv.org/abs/2309.11464
repo_url: None
paper_authors: Samuel Felipe dos Santos, Rodrigo Berriel, Thiago Oliveira-Santos, Nicu Sebe, Jurandy Almeida
for: 这个研究的目的是实现多元领域学习（Multi-Domain Learning），即让模型在多个领域中表现良好，并且降低计算成本和模型大小。
methods: 这个研究使用了削除策略来实现模型缩减，即鼓励所有领域使用相同的子集 filters 来构成模型，并将不使用的 filters 削除。
results: 研究获得了与基准模型相似的分类性能，并且降低了计算成本和模型大小。另外，这个方法在资源有限的设备上也能够更好地运行。

Abstract
Deep learning has achieved state-of-the-art performance on several computer vision tasks and domains. Nevertheless, it still has a high computational cost and demands a significant amount of parameters. Such requirements hinder the use in resource-limited environments and demand both software and hardware optimization. Another limitation is that deep models are usually specialized into a single domain or task, requiring them to learn and store new parameters for each new one. Multi-Domain Learning (MDL) attempts to solve this problem by learning a single model that is capable of performing well in multiple domains. Nevertheless, the models are usually larger than the baseline for a single domain. This work tackles both of these problems: our objective is to prune models capable of handling multiple domains according to a user-defined budget, making them more computationally affordable while keeping a similar classification performance. We achieve this by encouraging all domains to use a similar subset of filters from the baseline model, up to the amount defined by the user's budget. Then, filters that are not used by any domain are pruned from the network. The proposed approach innovates by better adapting to resource-limited devices while, to our knowledge, being the only work that handles multiple domains at test time with fewer parameters and lower computational complexity than the baseline model for a single domain.

摘要
深度学习已经在计算机视觉任务和领域上达到了状态对抗性。然而，它仍然具有高的计算成本和需要较多的参数。这些限制使得在资源有限的环境中使用它们变得困难，需要软件和硬件优化。另外，深度模型通常是专门为单个领域或任务设计的，因此它们需要学习和存储每个新领域或任务的新参数。多个领域学习（MDL）尝试解决这个问题，通过学习一个能够在多个领域中表现好的单一模型。然而，这些模型通常比基eline模型更大。本工作解决了这两个问题：我们的目标是使用用户定义的预算来采样和裁剪模型，使其在计算上更加可持预算而仍保持相似的分类性能。我们实现了这一点通过优化所有领域使用基eline模型的相似subset of filters，并且不用于任何领域的筛子被裁剪出去。我们的方法创新在资源有限的设备上更好地适应，并且，至于我们所知道的，是唯一一个在测试时处理多个领域的方法，使用 fewer parameters 和更低的计算复杂度来比基eline模型在单个领域中表现。

Weight Averaging Improves Knowledge Distillation under Domain Shift

paper_url: http://arxiv.org/abs/2309.11446
repo_url: https://github.com/vorobeevich/distillation-in-dg
paper_authors: Valeriy Berezovskiy, Nikita Morozov
for: 本研究探讨了知识塑化（KD）技术在不同领域数据上的性能。
methods: 本研究使用了学习 teacher 网络和学生网络，并对学生网络进行了权重平均技术。
results: 研究发现，权重平均技术可以提高知识塑化在不同领域数据上的性能。此外，提出了一种简单的权重平均策略，不需要在训练过程中评估验证数据，并证明其与SWAD和SMA相当。

Abstract
Knowledge distillation (KD) is a powerful model compression technique broadly used in practical deep learning applications. It is focused on training a small student network to mimic a larger teacher network. While it is widely known that KD can offer an improvement to student generalization in i.i.d setting, its performance under domain shift, i.e. the performance of student networks on data from domains unseen during training, has received little attention in the literature. In this paper we make a step towards bridging the research fields of knowledge distillation and domain generalization. We show that weight averaging techniques proposed in domain generalization literature, such as SWAD and SMA, also improve the performance of knowledge distillation under domain shift. In addition, we propose a simplistic weight averaging strategy that does not require evaluation on validation data during training and show that it performs on par with SWAD and SMA when applied to KD. We name our final distillation approach Weight-Averaged Knowledge Distillation (WAKD).

摘要
知识塑化（KD）是一种广泛应用在深度学习实践中的模型压缩技术。它关注训练一个小学生网络，以模仿一个更大的教师网络。虽然广泛认知KD可以提高学生网络在同一个分布下的泛化性能，但它在领域转移情况下的性能尚未得到了文献的充分关注。在这篇论文中，我们尝试将知识塑化和领域总结两个领域联系起来。我们表明了在领域转移情况下使用Weight averaging技术，如SWAD和SMA，可以提高知识塑化的性能。此外，我们还提出了一种简单的Weight averaging策略，不需要在训练过程中评估验证数据，并证明它与SWAD和SMA在KD中具有相同的性能。我们将这种最终塑化方法称为Weight-Averaged Knowledge Distillation（WAKD）。

SkeleTR: Towrads Skeleton-based Action Recognition in the Wild

paper_url: http://arxiv.org/abs/2309.11445
repo_url: None
paper_authors: Haodong Duan, Mingze Xu, Bing Shuai, Davide Modolo, Zhuowen Tu, Joseph Tighe, Alessandro Bergamo
for: 本文 targets more general scenarios of action recognition, such as variable number of people and various forms of interaction.
methods: 方法使用 two-stage paradigm，首先使用图 convolutions 模型每个人的内部动作动态，然后使用堆式 transformer encoder 捕捉人之间的交互。
results: 对多种 skeleton-based action recognition 任务进行了全面的解决，包括视频级动作分类、实例级动作检测和群体活动识别。实现了 transfer learning 和共同训练 across different action tasks and datasets，并且在多个 benchmark 上达到了 state-of-the-art 性能。

Abstract
We present SkeleTR, a new framework for skeleton-based action recognition. In contrast to prior work, which focuses mainly on controlled environments, we target more general scenarios that typically involve a variable number of people and various forms of interaction between people. SkeleTR works with a two-stage paradigm. It first models the intra-person skeleton dynamics for each skeleton sequence with graph convolutions, and then uses stacked Transformer encoders to capture person interactions that are important for action recognition in general scenarios. To mitigate the negative impact of inaccurate skeleton associations, SkeleTR takes relative short skeleton sequences as input and increases the number of sequences. As a unified solution, SkeleTR can be directly applied to multiple skeleton-based action tasks, including video-level action classification, instance-level action detection, and group-level activity recognition. It also enables transfer learning and joint training across different action tasks and datasets, which result in performance improvement. When evaluated on various skeleton-based action recognition benchmarks, SkeleTR achieves the state-of-the-art performance.

摘要
我们提出了SkeleTR，一个新的骨架基于动作识别框架。与先前的工作不同，SkeleTR针对更加一般的场景，通常包括变量数量的人员和人员之间多种互动。SkeleTR采用两stage架构，首先使用图 convolution 模型每个骨sequences的内部动作动态，然后使用堆式 transformer 编码器捕捉人员之间重要的动作识别。为了减轻不准确的骨 association 的影响，SkeleTR 使用短skeleton sequence 作为输入，并增加输入序列的数量。作为一个通用解决方案，SkeleTR 可以直接应用于多种骨基于动作任务，包括视频级动作分类、实例级动作检测和群体活动识别。它还允许转移学习和共同训练不同的动作任务和数据集，从而提高性能。在多种骨基于动作识别 benchmark 上评估，SkeleTR 实现了状态的极佳表现。

Signature Activation: A Sparse Signal View for Holistic Saliency

paper_url: http://arxiv.org/abs/2309.11443
repo_url: https://github.com/dtak/signature-activation
paper_authors: Jose Roberto Tello Ayala, Akl C. Fahed, Weiwei Pan, Eugene V. Pomerantsev, Patrick T. Ellinor, Anthony Philippakis, Finale Doshi-Velez
for: 这篇论文的目的是提出一种基于Machine Learning的医疗图像处理方法，以提高医疗图像识别的透明度和解释性。
methods: 本文引入了Signature Activation，一种可靠性方法，它可以对于卷积神经网络（CNN）的输出生成整体和无关对象的解释。本方法基于医疗图像中的前景和背景物件之间的显著差异。
results: 本文透过评估 coronary angiogram 中的病变检测，证明了 Signature Activation 的可靠性和有用性。

Abstract
The adoption of machine learning in healthcare calls for model transparency and explainability. In this work, we introduce Signature Activation, a saliency method that generates holistic and class-agnostic explanations for Convolutional Neural Network (CNN) outputs. Our method exploits the fact that certain kinds of medical images, such as angiograms, have clear foreground and background objects. We give theoretical explanation to justify our methods. We show the potential use of our method in clinical settings through evaluating its efficacy for aiding the detection of lesions in coronary angiograms.

摘要
《机器学习在医疗领域的应用需要模型的透明度和解释性》。在这项工作中，我们介绍了《签名活化》，一种可以生成整体和无类别的解释方法，用于 convolutional neural network（CNN）输出。我们的方法利用了某些医疗图像，如血管agram，具有明确的前景和背景对象。我们给出了理论解释，以便证明我们的方法。我们通过评估其在 coronary angiogram 中的可用性，显示了我们的方法在临床应用中的潜在价值。

CalibFPA: A Focal Plane Array Imaging System based on Online Deep-Learning Calibration

paper_url: http://arxiv.org/abs/2309.11421
repo_url: None
paper_authors: Alper Güngör, M. Umut Bahceci, Yasin Ergen, Ahmet Sözak, O. Oner Ekiz, Tolga Yelboga, Tolga Çukur
for: 这个论文的目的是提出一种基于深度学习的压缩镜头数组系统（CalibFPA），以实现高分辨率（HR）成像，并且不需要线上准备。
methods: 该系统使用了电子控制的空间光模ulators（SLM）进行多重编码，并使用了一个深度学习网络来在多个LR测量中 correect 系统不良的影响。
results: 在模拟和实验数据上，CalibFPA的性能超过了现有的压缩镜头数组方法，并且进行了系统元素的分析和计算复杂度的评估。

Abstract
Compressive focal plane arrays (FPA) enable cost-effective high-resolution (HR) imaging by acquisition of several multiplexed measurements on a low-resolution (LR) sensor. Multiplexed encoding of the visual scene is typically performed via electronically controllable spatial light modulators (SLM). An HR image is then reconstructed from the encoded measurements by solving an inverse problem that involves the forward model of the imaging system. To capture system non-idealities such as optical aberrations, a mainstream approach is to conduct an offline calibration scan to measure the system response for a point source at each spatial location on the imaging grid. However, it is challenging to run calibration scans when using structured SLMs as they cannot encode individual grid locations. In this study, we propose a novel compressive FPA system based on online deep-learning calibration of multiplexed LR measurements (CalibFPA). We introduce a piezo-stage that locomotes a pre-printed fixed coded aperture. A deep neural network is then leveraged to correct for the influences of system non-idealities in multiplexed measurements without the need for offline calibration scans. Finally, a deep plug-and-play algorithm is used to reconstruct images from corrected measurements. On simulated and experimental datasets, we demonstrate that CalibFPA outperforms state-of-the-art compressive FPA methods. We also report analyses to validate the design elements in CalibFPA and assess computational complexity.

摘要
高度压缩的投影平面阵列（FPA）可以实现低成本高分辨率（HR）成像，通过多个多样化测量在低分辨率（LR）感知器上。多样化编码视场通常通过电子控制可变光学模拟器（SLM）进行。然后，从编码测量中重建HR图像，通过解决一个反射问题，该问题涉及到成像系统的前向模型。但是，使用结构化SLM时难以进行线上准备扫描，以便测量系统响应点源在每个空间位置上。在本研究中，我们提出了一种新的压缩FPA系统，基于在线深度学习准备多样化LR测量（CalibFPA）。我们引入了一个 piezo 阶段，使得预制印刷的固定编码窗口在不同的空间位置上移动。然后，我们利用了深度神经网络来纠正多样化测量中系统非理想的影响，无需进行线上准备扫描。最后，我们使用了深度插件播客算法来重建图像。在模拟和实验数据集上，我们证明了CalibFPA的性能比现有压缩FPA方法更高。我们还进行了分析，以验证设计元素的合理性和计算复杂性。

CNNs for JPEGs: A Study in Computational Cost

paper_url: http://arxiv.org/abs/2309.11417
repo_url: None
paper_authors: Samuel Felipe dos Santos, Nicu Sebe, Jurandy Almeida
for: 本文旨在研究频域预处理后的深度学习模型，以优化计算成本和参数数量。
methods: 本文使用了DCT频域表示法，并对传统 CNN 架构进行了修改，以适应频域数据。
results: 本文提出了一些手动和数据驱动的技术来降低计算成本和参数数量，以实现高效且精准的频域深度学习模型。

Abstract
Convolutional neural networks (CNNs) have achieved astonishing advances over the past decade, defining state-of-the-art in several computer vision tasks. CNNs are capable of learning robust representations of the data directly from the RGB pixels. However, most image data are usually available in compressed format, from which the JPEG is the most widely used due to transmission and storage purposes demanding a preliminary decoding process that have a high computational load and memory usage. For this reason, deep learning methods capable of learning directly from the compressed domain have been gaining attention in recent years. Those methods usually extract a frequency domain representation of the image, like DCT, by a partial decoding, and then make adaptation to typical CNNs architectures to work with them. One limitation of these current works is that, in order to accommodate the frequency domain data, the modifications made to the original model increase significantly their amount of parameters and computational complexity. On one hand, the methods have faster preprocessing, since the cost of fully decoding the images is avoided, but on the other hand, the cost of passing the images though the model is increased, mitigating the possible upside of accelerating the method. In this paper, we propose a further study of the computational cost of deep models designed for the frequency domain, evaluating the cost of decoding and passing the images through the network. We also propose handcrafted and data-driven techniques for reducing the computational complexity and the number of parameters for these models in order to keep them similar to their RGB baselines, leading to efficient models with a better trade off between computational cost and accuracy.

摘要
卷积神经网络（CNN）在过去的一代时间内取得了非常的进步，在计算机视觉任务中定义了状态的艺术。CNN可以直接从RGB像素上学习坚实的数据表示。然而，大多数图像数据通常是压缩形式，JPEG是最常用的，因为传输和存储目的需要高计算负担和内存使用。为此，可以直接从压缩领域学习深度学习方法在过去几年内得到了关注。这些方法通常提取图像的频率频谱表示，例如DCT，通过部分解码，然后将其与传统的CNN架构进行适应。现有的方法的一个限制是，为了适应频率频谱数据，模型的修改会增加显著。一方面，预处理更快，因为完全解码图像的成本被避免了，但另一方面，通过网络传输图像的成本增加，这可能导致加速方法的可能性减退。在这篇论文中，我们将进一步研究深度模型在频率频谱频谱上的计算成本，以及图像传输和网络传输的成本。我们还将提出手工和数据驱动的技术，以减少模型的计算复杂性和参数数量，以保持与RGB基eline相似的效率，从而实现更好的计算成本和准确性的负担平衡。

Enhancing motion trajectory segmentation of rigid bodies using a novel screw-based trajectory-shape representation

paper_url: http://arxiv.org/abs/2309.11413
repo_url: None
paper_authors: Arno Verduyn, Maxim Vochten, Joris De Schutter
for: 这篇论文主要针对3D固体运动的轨迹分割。
methods: 该论文提出了一种新的轨迹表示方法，它包括一个 геометрический进度率和一个第三阶轨迹形态描述器，并且具有一些惯性性和参考点无关性的特点。
results: 该论文使用自我监督分割方法进行验证，在实验和真实的人类斟 Pouring 动作记录中表现出更加稳定和一致的分割结果，与传统表示方法相比。

Abstract
Trajectory segmentation refers to dividing a trajectory into meaningful consecutive sub-trajectories. This paper focuses on trajectory segmentation for 3D rigid-body motions. Most segmentation approaches in the literature represent the body's trajectory as a point trajectory, considering only its translation and neglecting its rotation. We propose a novel trajectory representation for rigid-body motions that incorporates both translation and rotation, and additionally exhibits several invariant properties. This representation consists of a geometric progress rate and a third-order trajectory-shape descriptor. Concepts from screw theory were used to make this representation time-invariant and also invariant to the choice of body reference point. This new representation is validated for a self-supervised segmentation approach, both in simulation and using real recordings of human-demonstrated pouring motions. The results show a more robust detection of consecutive submotions with distinct features and a more consistent segmentation compared to conventional representations. We believe that other existing segmentation methods may benefit from using this trajectory representation to improve their invariance.

摘要
准确地描述行走过程的分段是指将行走过程分解成有意义的连续子过程。这篇论文关注于三维固定体运动的轨迹分段。大多数文献中的分段方法只考虑体的翻译和忽略其旋转。我们提出了一种新的轨迹表示方法，该方法包括一个 геометрический进度率和一个第三阶轨迹形态描述器。我们使用了滚筒理论来使这种表示方法时间不变和参照点无关。这种新的表示方法在自主监督分段方法中得到验证，包括在模拟和真实的人类倒 Pouring 动作记录中。结果显示，使用这种轨迹表示方法可以更好地检测出不同特征的连续子过程，并且比传统表示方法更加一致。我们认为其他现有的分段方法可能会从这种轨迹表示方法中受益，以提高其对称性。

Self-supervised learning unveils change in urban housing from street-level images

paper_url: http://arxiv.org/abs/2309.11354
repo_url: None
paper_authors: Steven Stalder, Michele Volpi, Nicolas Büttner, Stephen Law, Kenneth Harttgen, Esra Suel
for: tracks progress in urban housing, specifically in London’s housing supply
methods: uses deep learning-based computer vision methods and self-supervised techniques to measure change in street-level images
results: successfully identified point-level change in London’s housing supply and distinguished between major and minor change, providing timely information for urban planning and policy decisions.

Abstract
Cities around the world face a critical shortage of affordable and decent housing. Despite its critical importance for policy, our ability to effectively monitor and track progress in urban housing is limited. Deep learning-based computer vision methods applied to street-level images have been successful in the measurement of socioeconomic and environmental inequalities but did not fully utilize temporal images to track urban change as time-varying labels are often unavailable. We used self-supervised methods to measure change in London using 15 million street images taken between 2008 and 2021. Our novel adaptation of Barlow Twins, Street2Vec, embeds urban structure while being invariant to seasonal and daily changes without manual annotations. It outperformed generic embeddings, successfully identified point-level change in London's housing supply from street-level images, and distinguished between major and minor change. This capability can provide timely information for urban planning and policy decisions toward more liveable, equitable, and sustainable cities.

摘要
全球各地城市面临着供应充足、安全、健康的住房的紧迫需求。尽管城市住房问题的政策重要性不言而喻，但我们对城市变化的追踪和监测能力却受到限制。使用深度学习计算机视觉方法对街道级图像进行分析，可以成功地衡量社会经济和环境不平等，但这些方法通常无法利用时间变化来追踪城市变化。我们使用自动学习方法，使用2008年至2021年之间的1500万个街道级图像，在伦敦市进行了时间变化的追踪。我们对Barlow Twins进行了改进，称之为Street2Vec，它可以嵌入城市结构，同时具有季节和日期变化的抗辐射性，无需手动标注。Street2Vec在伦敦市的住房供应变化追踪中表现出色，可以提供实时的城市规划和政策决策信息，以建立更加人居住、公平、可持续的城市。

You can have your ensemble and run it too – Deep Ensembles Spread Over Time

paper_url: http://arxiv.org/abs/2309.11333
repo_url: None
paper_authors: Isak Meding, Alexander Bodin, Adam Tonderski, Joakim Johnander, Christoffer Petersson, Lennart Svensson
for: 这个研究旨在探讨深度 Ensemble 可以在时间序列上扩展以提高预测性和不确定性估计的可能性。
methods: 我们提出了 Deep Ensembles Spread Over Time (DESOT) 方法，将单一的 Ensemble member 应用到每个数据点上，并融合多个数据点的预测。
results: DESOT 可以获得深度 Ensemble 的优化性和不确定性估计性，而不需要额外的计算成本增加。 DESOT 也简单实现，不需要在训练过程中使用时间序列。最后，我们发现 DESOT 和深度 Ensemble 都能在非标准数据上进行预测和不确定性估计。

Abstract
Ensembles of independently trained deep neural networks yield uncertainty estimates that rival Bayesian networks in performance. They also offer sizable improvements in terms of predictive performance over single models. However, deep ensembles are not commonly used in environments with limited computational budget -- such as autonomous driving -- since the complexity grows linearly with the number of ensemble members. An important observation that can be made for robotics applications, such as autonomous driving, is that data is typically sequential. For instance, when an object is to be recognized, an autonomous vehicle typically observes a sequence of images, rather than a single image. This raises the question, could the deep ensemble be spread over time? In this work, we propose and analyze Deep Ensembles Spread Over Time (DESOT). The idea is to apply only a single ensemble member to each data point in the sequence, and fuse the predictions over a sequence of data points. We implement and experiment with DESOT for traffic sign classification, where sequences of tracked image patches are to be classified. We find that DESOT obtains the benefits of deep ensembles, in terms of predictive and uncertainty estimation performance, while avoiding the added computational cost. Moreover, DESOT is simple to implement and does not require sequences during training. Finally, we find that DESOT, like deep ensembles, outperform single models for out-of-distribution detection.

摘要
ensemble of independently trained deep neural networks可以提供与 bayesian networks相当的不确定性估计，同时也可以提高预测性能。但是，深度 ensemble在计算budget有限的环境中并不很常见，因为ensemble的复杂度随着成员增加而增加。在робо特应用，如自动驾驶，发现数据通常是顺序的。例如，当需要识别一个物体时，一辆自动驾驶车通常会观察一串图像，而不是单个图像。这引出了一个问题：可以将深度 ensemble推广到时间吗？在这种情况下，我们提出了深度 ensemble推广到时间（DESOT）的想法。我们只应用一个 ensemble member 到每个数据点的序列中，并将预测结果进行融合。我们实现并对 traffic sign classification 进行实验，Sequence of tracked image patches 需要进行分类。我们发现 DESOT 可以获得深度 ensemble 的优点，即预测性能和不确定性估计的好处，而不需要添加计算成本。此外，DESOT 简单易实现，不需要在训练时序列。最后，我们发现 DESOT 也可以超过单个模型的表现，对于非标准范围检测。

How to turn your camera into a perfect pinhole model

paper_url: http://arxiv.org/abs/2309.11326
repo_url: None
paper_authors: Ivan De Boi, Stuti Pathak, Marina Oliveira, Rudi Penne
for: 提高计算机视觉应用中的相机准备环境，提供一种可以处理多种扭曲源的新方法。
methods: 使用 Gaussian processes 来去除图像中的扭曲和相机缺陷，并创建一个虚拟的理想投射相机，只需一张正方形网格检查模式图像。
results: 提高了许多计算机视觉算法和应用的性能，消除了扭曲参数和迭代优化。 Validated by synthetic data and real-world images.

Abstract
Camera calibration is a first and fundamental step in various computer vision applications. Despite being an active field of research, Zhang's method remains widely used for camera calibration due to its implementation in popular toolboxes. However, this method initially assumes a pinhole model with oversimplified distortion models. In this work, we propose a novel approach that involves a pre-processing step to remove distortions from images by means of Gaussian processes. Our method does not need to assume any distortion model and can be applied to severely warped images, even in the case of multiple distortion sources, e.g., a fisheye image of a curved mirror reflection. The Gaussian processes capture all distortions and camera imperfections, resulting in virtual images as though taken by an ideal pinhole camera with square pixels. Furthermore, this ideal GP-camera only needs one image of a square grid calibration pattern. This model allows for a serious upgrade of many algorithms and applications that are designed in a pure projective geometry setting but with a performance that is very sensitive to nonlinear lens distortions. We demonstrate the effectiveness of our method by simplifying Zhang's calibration method, reducing the number of parameters and getting rid of the distortion parameters and iterative optimization. We validate by means of synthetic data and real world images. The contributions of this work include the construction of a virtual ideal pinhole camera using Gaussian processes, a simplified calibration method and lens distortion removal.

摘要
Camera 卡利ibration 是 computer vision 应用中的第一步和基础步骤。尽管是一个活跃的研究领域，张的方法仍然广泛使用于 camera 卡利ibration due to its implementation in popular toolboxes。然而，这种方法初始化假设了缩影模型，忽略了真实的扭曲模型。在这种工作中，我们提出了一种新的方法，该方法通过 Gaussian processes 来从图像中除扭曲。我们的方法不需要任何扭曲模型，可以应用于严重扭曲的图像，甚至在多个扭曲源的情况下，如 fisheye 图像 reflection 的弯曲镜。 Gaussian processes 捕捉了所有的扭曲和相机缺陷，从而生成虚拟的 ideal pinhole camera 图像，如quare pixels。此外，这个 ideal GP-camera 只需一个平方格 calibration pattern 图像。这种模型允许许多算法和应用程序，其中一些是在纯 proyective geometry 设定下设计，但是性能受到非线性镜头扭曲的影响。我们通过简化张的卡利ibration 方法，减少参数的数量，消除扭曲参数和迭代优化来证明方法的有效性。我们验证了这种方法的有效性通过 synthetic 数据和实际图像。本研究的贡献包括：在 Gaussian processes 中构建虚拟的 ideal pinhole camera，简化卡利ibration 方法和镜头扭曲除除。

Face Aging via Diffusion-based Editing

paper_url: http://arxiv.org/abs/2309.11321
repo_url: https://github.com/MunchkinChen/FADING
paper_authors: Xiangyi Chen, Stéphane Lathuilière
for: 本研究旨在解决面部年轻化问题，生成面部图像的过去或未来图像，通过增加年龄相关的变化。
methods: 我们提出了一种新的方法，即FADING，利用语言-图像扩散模型的丰富前提，来解决面部年轻化问题。我们首先特化一个预训练的扩散模型，使其更适应面部年轻化任务，然后对输入图像进行倒散、获取优化的Null噪音嵌入，最后通过文本引导的地方年轻编辑。
results: 我们的方法与现有方法相比，在年轻精度、特征保留和年轻质量等方面具有明显的优势。

Abstract
In this paper, we address the problem of face aging: generating past or future facial images by incorporating age-related changes to the given face. Previous aging methods rely solely on human facial image datasets and are thus constrained by their inherent scale and bias. This restricts their application to a limited generatable age range and the inability to handle large age gaps. We propose FADING, a novel approach to address Face Aging via DIffusion-based editiNG. We go beyond existing methods by leveraging the rich prior of large-scale language-image diffusion models. First, we specialize a pre-trained diffusion model for the task of face age editing by using an age-aware fine-tuning scheme. Next, we invert the input image to latent noise and obtain optimized null text embeddings. Finally, we perform text-guided local age editing via attention control. The quantitative and qualitative analyses demonstrate that our method outperforms existing approaches with respect to aging accuracy, attribute preservation, and aging quality.

摘要
在这篇论文中，我们解决了人脸年龄化问题：通过 incorporating 年龄相关变化来生成过去或未来的脸部图像。先前的年龄方法仅仅基于人类脸部图像集合，因此受到其内置的尺度和偏见的限制，只能生成有限的年龄范围内的图像，并且无法处理大的年龄差。我们提出了 FADING，一种新的方法来解决人脸年龄化问题，通过语言-图像扩散模型的质量丰富的先天知识来超越现有方法。首先，我们特化了预训练的扩散模型，使其更适应人脸年龄编辑任务，并使用年龄意识 fine-tuning 方案进行特化。接着，我们将输入图像反转为干扰噪 embedding，并获得优化的 null text embedding。最后，我们通过文本引导的本地年龄编辑来进行控制。量化和质量分析表明，我们的方法在年龄准确性、特征保持和年龄质量等方面都超越了现有方法。

Uncovering the effects of model initialization on deep model generalization: A study with adult and pediatric Chest X-ray images

paper_url: http://arxiv.org/abs/2309.11318
repo_url: None
paper_authors: Sivaramakrishnan Rajaraman, Ghada Zamzmi, Feng Yang, Zhaohui Liang, Zhiyun Xue, Sameer Antani
for: 这个研究旨在提高深度学习模型在医疗计算机视觉应用中的性能和可靠性。而关于医疗图像（特别是胸部X射线图像）的影响则更少了解。本研究探讨了三种深度模型初始化技术：冷启动、暖启动和缩小和扰动start，对成人和儿童两个人口进行了评估。methods: 本研究使用了三种深度模型初始化技术：冷启动、暖启动和缩小和扰动start。这些技术在医疗图像的批处理训练场景下进行了评估，以适应实际世界中数据不断来临和模型更新的需求。results: 研究结果表明，使用ImageNet预训练权重初始化的模型在成人和儿童两个人口中的总体化能力较高，超过随机初始化的模型。此外，ImageNet预训练模型在不同训练场景下的内部和外部测试中都表现了稳定的性能。weight级 ensemble方法也显示了明显的提高（p<0.05），特别是在测试阶段。因此，本研究强调了使用ImageNet预训练权重初始化的好处，尤其是在weight级 ensemble方法下，为创建可靠和总体化的深度学习解决方案。

Abstract
Model initialization techniques are vital for improving the performance and reliability of deep learning models in medical computer vision applications. While much literature exists on non-medical images, the impacts on medical images, particularly chest X-rays (CXRs) are less understood. Addressing this gap, our study explores three deep model initialization techniques: Cold-start, Warm-start, and Shrink and Perturb start, focusing on adult and pediatric populations. We specifically focus on scenarios with periodically arriving data for training, thereby embracing the real-world scenarios of ongoing data influx and the need for model updates. We evaluate these models for generalizability against external adult and pediatric CXR datasets. We also propose novel ensemble methods: F-score-weighted Sequential Least-Squares Quadratic Programming (F-SLSQP) and Attention-Guided Ensembles with Learnable Fuzzy Softmax to aggregate weight parameters from multiple models to capitalize on their collective knowledge and complementary representations. We perform statistical significance tests with 95% confidence intervals and p-values to analyze model performance. Our evaluations indicate models initialized with ImageNet-pre-trained weights demonstrate superior generalizability over randomly initialized counterparts, contradicting some findings for non-medical images. Notably, ImageNet-pretrained models exhibit consistent performance during internal and external testing across different training scenarios. Weight-level ensembles of these models show significantly higher recall (p<0.05) during testing compared to individual models. Thus, our study accentuates the benefits of ImageNet-pretrained weight initialization, especially when used with weight-level ensembles, for creating robust and generalizable deep learning solutions.

摘要
“模型初始化技术对深度学习模型在医疗计算机视觉应用中的性能和可靠性有着重要的影响。虽然关于非医学图像的研究已经充分，但对医学图像，特别是胸部X射影（CXR）的影响还未得到充分了解。为了解决这个差距，我们的研究探讨了三种深度模型初始化技术：冷启动、温启动和缩放和扰动启动，对于成人和儿童两个人口进行了研究。我们强调在进行训练时periodically arriving data的情况下，以满足实际世界中数据不断来临和模型更新的需求。我们使用F-score-weighted Sequential Least-Squares Quadratic Programming（F-SLSQP）和Attention-Guided Ensembles with Learnable Fuzzy Softmax来权衡多个模型的参数，以便充分利用它们的共同知识和补充表示。我们对模型性能进行了统计学 significativity 测试，结果表明，使用ImageNet预训练权重初始化的模型在总体性能方面表现出色，并且在不同的训练场景下保持了一致的表现。此外，对这些模型进行权重级别的合并也表现出了明显的提升（p<0.05）。因此，我们的研究证明了使用ImageNet预训练权重初始化的模型，特别是在权重级别的合并下，可以创建可靠和总体性能优秀的深度学习解决方案。”

Generalizing Across Domains in Diabetic Retinopathy via Variational Autoencoders

paper_url: http://arxiv.org/abs/2309.11301
repo_url: https://github.com/sharonchokuwa/VAE-DG
paper_authors: Sharon Chokuwa, Muhammad H. Khan
for: 这篇论文旨在探讨Variational Autoencoder（VA）是否能够实现类型普遍化，以对抗DR预测 задачі中的领域转移。
methods: 这篇论文使用Variational Autoencoder（VA）来分析眼底照片的latent space，以获得一个更加灵活和适应的领域不对称表示，以应对DR数据集中的领域转移。
results: 这篇论文显示，使用VA的简单方法可以超越现有的州际顶对应方法，并在公开可用的数据集上达到更高的准确率。这些结果显示，简单的方法可以在医疗图像领域中实现更好的领域普遍化，而不是仅仅靠赖高度复杂的技术。

Abstract
Domain generalization for Diabetic Retinopathy (DR) classification allows a model to adeptly classify retinal images from previously unseen domains with various imaging conditions and patient demographics, thereby enhancing its applicability in a wide range of clinical environments. In this study, we explore the inherent capacity of variational autoencoders to disentangle the latent space of fundus images, with an aim to obtain a more robust and adaptable domain-invariant representation that effectively tackles the domain shift encountered in DR datasets. Despite the simplicity of our approach, we explore the efficacy of this classical method and demonstrate its ability to outperform contemporary state-of-the-art approaches for this task using publicly available datasets. Our findings challenge the prevailing assumption that highly sophisticated methods for DR classification are inherently superior for domain generalization. This highlights the importance of considering simple methods and adapting them to the challenging task of generalizing medical images, rather than solely relying on advanced techniques.

摘要
域 generale 化 для 诊断糖尿病 Retinopathy (DR) 让模型能够efficacious 分类 retinal 图像从以前未经见到的域与不同的拍摄条件和患者特征下，从而提高其在各种临床环境中的应用性。在这项研究中，我们探讨了变量自动编码器内置的latent space的分解能力，以获得更加稳定和适应的域不对称表示，以更好地解决DR数据集中的域转移问题。虽然我们的方法简单，但我们发现这种经典方法的效果可以超过当今的状态对DR分类任务的方法。我们的发现证明了不要仅仅依赖于高度复杂的方法，而是应该考虑简单的方法并适应它们来普遍化医疗图像。

Language-driven Object Fusion into Neural Radiance Fields with Pose-Conditioned Dataset Updates

paper_url: http://arxiv.org/abs/2309.11281
repo_url: https://github.com/kcshum/pose-conditioned-NeRF-object-fusion
paper_authors: Ka Chun Shum, Jaeyeon Kim, Binh-Son Hua, Duc Thanh Nguyen, Sai-Kit Yeung
for: 这 paper 是用于描述一种基于神经辐射场的图像渲染方法，可以生成高质量的多视图一致的图像。
methods: 这 paper 使用了一种基于文本扩展的方法来实现对 neural radiance field 中的对象的操作，包括插入新背景和 removing 已有对象。
results: 实验结果表明，这 paper 的方法可以生成高质量的渲染图像，并且在 3D 重建和神经辐射场融合方面超过了现有的方法。

Abstract
Neural radiance field is an emerging rendering method that generates high-quality multi-view consistent images from a neural scene representation and volume rendering. Although neural radiance field-based techniques are robust for scene reconstruction, their ability to add or remove objects remains limited. This paper proposes a new language-driven approach for object manipulation with neural radiance fields through dataset updates. Specifically, to insert a new foreground object represented by a set of multi-view images into a background radiance field, we use a text-to-image diffusion model to learn and generate combined images that fuse the object of interest into the given background across views. These combined images are then used for refining the background radiance field so that we can render view-consistent images containing both the object and the background. To ensure view consistency, we propose a dataset updates strategy that prioritizes radiance field training with camera views close to the already-trained views prior to propagating the training to remaining views. We show that under the same dataset updates strategy, we can easily adapt our method for object insertion using data from text-to-3D models as well as object removal. Experimental results show that our method generates photorealistic images of the edited scenes, and outperforms state-of-the-art methods in 3D reconstruction and neural radiance field blending.

摘要
神经辐射场是一种出现在渲染方法中的新技术，它可以生成高质量、多视图一致的图像从神经场景表示和体积渲染。 although neural radiance field-based techniques are robust for scene reconstruction, their ability to add or remove objects remains limited. This paper proposes a new language-driven approach for object manipulation with neural radiance fields through dataset updates. Specifically, to insert a new foreground object represented by a set of multi-view images into a background radiance field, we use a text-to-image diffusion model to learn and generate combined images that fuse the object of interest into the given background across views. These combined images are then used for refining the background radiance field so that we can render view-consistent images containing both the object and the background. To ensure view consistency, we propose a dataset updates strategy that prioritizes radiance field training with camera views close to the already-trained views prior to propagating the training to remaining views. We show that under the same dataset updates strategy, we can easily adapt our method for object insertion using data from text-to-3D models as well as object removal. Experimental results show that our method generates photorealistic images of the edited scenes, and outperforms state-of-the-art methods in 3D reconstruction and neural radiance field blending.

Towards Real-Time Neural Video Codec for Cross-Platform Application Using Calibration Information

paper_url: http://arxiv.org/abs/2309.11276
repo_url: None
paper_authors: Kuan Tian, Yonghang Guan, Jinxi Xiang, Jun Zhang, Xiao Han, Wei Yang
for: 这个论文是为了提出一种实时跨平台神经视频编码器，以解决现有神经网络编码器在实际应用中的两大挑战。
methods: 作者使用了一种协调传输系统来保证编码和解码过程中的统一化量化参数，并使用了一种尺度约束来修复分布 Entropy 参数的不均匀性。
results: 实验结果显示，作者的模型可以在 NVIDIA RTX 2080 GPU 上实现 25 FPS 的解码速度，并且可以在另一个平台上编码的 720P 视频进行实时解码。此外，实时模型可以提供最高 24.2% BD-rate 改善，相比 anchor H.265。

Abstract
The state-of-the-art neural video codecs have outperformed the most sophisticated traditional codecs in terms of RD performance in certain cases. However, utilizing them for practical applications is still challenging for two major reasons. 1) Cross-platform computational errors resulting from floating point operations can lead to inaccurate decoding of the bitstream. 2) The high computational complexity of the encoding and decoding process poses a challenge in achieving real-time performance. In this paper, we propose a real-time cross-platform neural video codec, which is capable of efficiently decoding of 720P video bitstream from other encoding platforms on a consumer-grade GPU. First, to solve the problem of inconsistency of codec caused by the uncertainty of floating point calculations across platforms, we design a calibration transmitting system to guarantee the consistent quantization of entropy parameters between the encoding and decoding stages. The parameters that may have transboundary quantization between encoding and decoding are identified in the encoding stage, and their coordinates will be delivered by auxiliary transmitted bitstream. By doing so, these inconsistent parameters can be processed properly in the decoding stage. Furthermore, to reduce the bitrate of the auxiliary bitstream, we rectify the distribution of entropy parameters using a piecewise Gaussian constraint. Second, to match the computational limitations on the decoding side for real-time video codec, we design a lightweight model. A series of efficiency techniques enable our model to achieve 25 FPS decoding speed on NVIDIA RTX 2080 GPU. Experimental results demonstrate that our model can achieve real-time decoding of 720P videos while encoding on another platform. Furthermore, the real-time model brings up to a maximum of 24.2\% BD-rate improvement from the perspective of PSNR with the anchor H.265.

摘要
现代神经视频编码器在某些情况下已经超越了最复杂的传统编码器，但在实际应用中仍然存在两大挑战。首先，由浮点运算引起的平台间计算错误可能导致错误解码bitstream。其次，编码和解码过程的计算复杂性使得实时性很难实现。在这篇论文中，我们提出了一种实时可靠的cross-platform神经视频编码器，可以在consumer-grade GPU上高速解码720P视频bitstream。首先，为了解决由不确定的浮点计算所引起的编码器不一致性问题，我们设计了卡利ibration transmitting系统，以 garantuee the consistent quantization of entropy parameters between the encoding and decoding stages。在编码阶段，我们标识出可能存在跨界量译参数的问题，并将其坐标传输给下游编码器。这样，在解码阶段可以正确处理这些不一致的参数。其次，为了降低auxiliary bitstream的比特率，我们使用piecewise Gaussian constraint来修正参数的分布。其次，为了在解码器端实现实时性，我们设计了一种轻量级模型。我们采用了一系列的效率技巧，使得我们的模型在NVIDIA RTX 2080 GPU上可以达到25帧/秒的解码速度。实验结果表明，我们的模型可以实时解码720P视频，而encoded on another platform。此外，实时模型可以提高最多24.2%的BD-rate，相比 anchor H.265。

StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding

paper_url: http://arxiv.org/abs/2309.11268
repo_url: None
paper_authors: Renqiu Xia, Bo Zhang, Haoyang Peng, Ning Liao, Peng Ye, Botian Shi, Junchi Yan, Yu Qiao
for: 这篇论文目标是建立一种统一的学习模式，能够同时完成图表感知和理解任务。
methods: 论文使用了一种名为Structured Triplet Representations（STR）的新的表示方式，以及一种名为Structuring Chart-oriented Representation Metric（SCRM）的表现评价方法，来提高图表理解能力。
results: 经过广泛的实验，论文发现这种统一的学习模式能够在不同的图表任务上达到极高的表现，并且能够扩大图表数据集，以提高图表理解能力。

Abstract
Charts are common in literature across different scientific fields, conveying rich information easily accessible to readers. Current chart-related tasks focus on either chart perception which refers to extracting information from the visual charts, or performing reasoning given the extracted data, e.g. in a tabular form. In this paper, we aim to establish a unified and label-efficient learning paradigm for joint perception and reasoning tasks, which can be generally applicable to different downstream tasks, beyond the question-answering task as specifically studied in peer works. Specifically, StructChart first reformulates the chart information from the popular tubular form (specifically linearized CSV) to the proposed Structured Triplet Representations (STR), which is more friendly for reducing the task gap between chart perception and reasoning due to the employed structured information extraction for charts. We then propose a Structuring Chart-oriented Representation Metric (SCRM) to quantitatively evaluate the performance for the chart perception task. To enrich the dataset for training, we further explore the possibility of leveraging the Large Language Model (LLM), enhancing the chart diversity in terms of both chart visual style and its statistical information. Extensive experiments are conducted on various chart-related tasks, demonstrating the effectiveness and promising potential for a unified chart perception-reasoning paradigm to push the frontier of chart understanding.

摘要
图表是科学文献中常见的数据可视化方式，能够快速传递丰富的信息给读者。目前的图表相关任务主要集中在图表识别和基于EXTRACTED数据的逻辑思维两个方面。在这篇论文中，我们希望建立一种统一的和标签有效的学习 парадигм，能够普适应用于不同的下游任务，而不仅仅是特定的问答任务，如在同等作者的论文中所研究。 Specifically, StructChart首先将流行的 tubular 形式（具体是线性化 CSV）中的图表信息重新表述为我们提出的结构化 triplet 表示（STR），这种结构化信息提取技术使得图表识别和逻辑思维之间的任务差距更小。然后，我们提出了一种 Chart-oriented Representation Metric（SCRM）来衡量图表识别任务的表现。为了让训练集更加丰富，我们还探索了使用 Large Language Model（LLM），通过扩展图表的视觉风格和统计信息，提高图表的多样性。我们在不同的图表相关任务上进行了广泛的实验，并证明了这种统一的图表识别和逻辑思维方法的有效性和潜在的前iers。

From Classification to Segmentation with Explainable AI: A Study on Crack Detection and Growth Monitoring

paper_url: http://arxiv.org/abs/2309.11267
repo_url: None
paper_authors: Florent Forest, Hugo Porta, Devis Tuia, Olga Fink
for: 本研究旨在 automatization 基础设施中的表面裂隙监测，以实现结构健康监测。
methods: 本研究使用机器学习方法，但需要大量标注数据进行超vised 训练。而once a crack is detected， monitoring its severity 通常需要精准的像素级别分割。然而，对于每个图像进行像素级别分割的标注是劳动密集的。为了解决这个问题，本研究提议使用可解释的人工智能（XAI）方法，从类ifier的解释中 derivate 分割，只需要弱型图像级别的监督。
results: 本研究发现，使用XAI方法可以生成有意义的分割面掩模，即使无需大量的标注数据。Results reveal that while the resulting segmentation masks may exhibit lower quality than those produced by supervised methods, they remain meaningful and enable severity monitoring, thus reducing substantial labeling costs.

Abstract
Monitoring surface cracks in infrastructure is crucial for structural health monitoring. Automatic visual inspection offers an effective solution, especially in hard-to-reach areas. Machine learning approaches have proven their effectiveness but typically require large annotated datasets for supervised training. Once a crack is detected, monitoring its severity often demands precise segmentation of the damage. However, pixel-level annotation of images for segmentation is labor-intensive. To mitigate this cost, one can leverage explainable artificial intelligence (XAI) to derive segmentations from the explanations of a classifier, requiring only weak image-level supervision. This paper proposes applying this methodology to segment and monitor surface cracks. We evaluate the performance of various XAI methods and examine how this approach facilitates severity quantification and growth monitoring. Results reveal that while the resulting segmentation masks may exhibit lower quality than those produced by supervised methods, they remain meaningful and enable severity monitoring, thus reducing substantial labeling costs.

摘要
监测基础设施表面裂隙是结构健康监测的关键。自动视见检测提供了一个有效的解决方案，特别是在困难 accessed 的地方。机器学习方法已经证明其效果，但通常需要大量的注释化数据集 дляsupervised 训练。一旦裂隙被检测出来，则需要精确地分类损害。然而，像素级注释图像 для分类是时间consuming。为了解决这个问题，这篇论文提议使用可解释人工智能（XAI） derive 分类器的解释，只需弱型图像级指导。这种方法可以帮助实现裂隙分类和严重性评估，并且可以降低大量的标注成本。我们评估了不同的XAI方法的性能，并研究了这种方法是否可以实现严重性评估和生长监测。结果表明，尽管生成的分类器分割面可能不如supervised 方法生成的分割面质量高，但它们仍然有意义，并且可以实现严重性评估和生长监测，从而减少标注成本。

TwinTex: Geometry-aware Texture Generation for Abstracted 3D Architectural Models

paper_url: http://arxiv.org/abs/2309.11258
repo_url: https://github.com/Ligo04/TwinTex
paper_authors: Weidan Xiong, Hongqian Zhang, Botao Peng, Ziyu Hu, Yongli Wu, Jianwei Guo, Hui Huang
for: 这个论文是为了生成一个精细的城市 Digital Twin 中的建筑物和景观的图像Texture mapping。
methods: 这个方法使用了一种新的自动化文本映射方法，包括选择高质量照片，提取LoL特征，对照片和geometry进行对齐，并使用一个新的扩展数据集和滤波模型来完善缺失区域。
results: 实验结果表明，这种方法可以高效地生成高质量的文本映射，并且可以在不同的建筑物和景观中实现人工专家水平的效果，而不需要太多的工作。

Abstract
Coarse architectural models are often generated at scales ranging from individual buildings to scenes for downstream applications such as Digital Twin City, Metaverse, LODs, etc. Such piece-wise planar models can be abstracted as twins from 3D dense reconstructions. However, these models typically lack realistic texture relative to the real building or scene, making them unsuitable for vivid display or direct reference. In this paper, we present TwinTex, the first automatic texture mapping framework to generate a photo-realistic texture for a piece-wise planar proxy. Our method addresses most challenges occurring in such twin texture generation. Specifically, for each primitive plane, we first select a small set of photos with greedy heuristics considering photometric quality, perspective quality and facade texture completeness. Then, different levels of line features (LoLs) are extracted from the set of selected photos to generate guidance for later steps. With LoLs, we employ optimization algorithms to align texture with geometry from local to global. Finally, we fine-tune a diffusion model with a multi-mask initialization component and a new dataset to inpaint the missing region. Experimental results on many buildings, indoor scenes and man-made objects of varying complexity demonstrate the generalization ability of our algorithm. Our approach surpasses state-of-the-art texture mapping methods in terms of high-fidelity quality and reaches a human-expert production level with much less effort. Project page: https://vcc.tech/research/2023/TwinTex.

摘要
<>文本翻译成简化中文。>建筑模型经常在大规模生成，从个别建筑到场景，用于下游应用程序，如数字城市、Metaverse、LODs等。这些块状平面模型可以被抽象为真实建筑或场景的孪生。然而，这些模型通常缺乏真实建筑或场景的精炼文化，使其不适合精彩显示或直接参考。在这篇论文中，我们介绍了 TwinTex，首个自动Texture mapping框架，用于生成具有高精炼度的Texture для块状平面代理。我们的方法解决了这类孪生Texture生成中的主要挑战。具体来说，对于每个基本平面，我们首先选择一小集数据，使用善意的规则来考虑光学质量、视角质量和建筑面料完整性。然后，我们从这些选择的数据中提取不同级别的线条特征（LoLs），以供后续步骤的引导。使用LoLs，我们运用优化算法将Texture与Geometry进行对齐。最后，我们使用扩展模型，并在新的数据集上进行填充缺失区域。实验结果表明，我们的算法可以在许多不同复杂度的建筑、室内场景和人工制品上实现高精炼度的Texture mapping，并且超过了当前状态艺的Texture mapping方法。我们的方法可以减少很多劳动力，达到人工专家水平。项目页面：https://vcc.tech/research/2023/TwinTex。

Box2Poly: Memory-Efficient Polygon Prediction of Arbitrarily Shaped and Rotated Text

paper_url: http://arxiv.org/abs/2309.11248
repo_url: None
paper_authors: Xuyang Chen, Dong Wang, Konrad Schindler, Mingwei Sun, Yongliang Wang, Nicolo Savioli, Liqiu Meng
For: 提高文本检测的精度和效率，尤其是对于不规则的文本布局。* Methods: 基于Sparse R-CNN的协调解码管道，通过逐次精度调整多边形预测，使用单个特征向量导引多边形实例准备。* Results: 比较DPText-DETR方法，具有更高的内存效率（>50%）和推理速度（>40%），同时保持了基准测试集上的性能水平。

Abstract
Recently, Transformer-based text detection techniques have sought to predict polygons by encoding the coordinates of individual boundary vertices using distinct query features. However, this approach incurs a significant memory overhead and struggles to effectively capture the intricate relationships between vertices belonging to the same instance. Consequently, irregular text layouts often lead to the prediction of outlined vertices, diminishing the quality of results. To address these challenges, we present an innovative approach rooted in Sparse R-CNN: a cascade decoding pipeline for polygon prediction. Our method ensures precision by iteratively refining polygon predictions, considering both the scale and location of preceding results. Leveraging this stabilized regression pipeline, even employing just a single feature vector to guide polygon instance regression yields promising detection results. Simultaneously, the leverage of instance-level feature proposal substantially enhances memory efficiency (>50% less vs. the state-of-the-art method DPText-DETR) and reduces inference speed (>40% less vs. DPText-DETR) with minor performance drop on benchmarks.

摘要
traducción al chino simplificado:现在，基于Transformer的文本检测技术尝试预测多边形，通过对各个边界顶点的坐标使用特定的查询特征进行编码。然而，这种方法带来了显著的内存开销，并且很难准确地捕捉同一个实例中的逻辑关系。因此，不规则的文本布局经常导致预测的边界顶点变为围栏顶点，这会导致结果的质量下降。为了解决这些挑战，我们提出了一种创新的方法，基于Sparse R-CNN：一个逻辑拓展管道 для多边形预测。我们的方法保证准确性，通过迭代地纠正多边形预测结果，考虑多边形的缩放和位置。通过这个稳定的回归管道，甚至只使用一个特征向量来引导多边形实例回归，也可以获得了有前途的检测结果。同时，通过实例级别的特征提档，可以大幅提高内存效率（>50%比DPText-DETR更高），并且降低推理速度（>40%比DPText-DETR更低），而无需做出重要的性能下降。

Towards Robust Few-shot Point Cloud Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.11228
repo_url: https://github.com/Pixie8888/R3DFSSeg
paper_authors: Yating Xu, Na Zhao, Gim Hee Lee
for: 提高几何点云Semantic segmentation的鲁棒性，使其在实际世界中快速适应新的未知类型，只需几个支持集样本。
methods: 我们提出了一种Component-level Clean Noise Separation（CCNS）表示学习，以学习细分target类的净样本与噪声样本之间的分化特征表示。然后，我们提出了一种Multi-scale Degree-based Noise Suppression（MDNS）方案，以消除支持集中的噪声样本。
results: 我们在不同噪声设定下进行了广泛的实验，结果显示CCNS和MDNS的组合显著提高了性能。

Abstract
Few-shot point cloud semantic segmentation aims to train a model to quickly adapt to new unseen classes with only a handful of support set samples. However, the noise-free assumption in the support set can be easily violated in many practical real-world settings. In this paper, we focus on improving the robustness of few-shot point cloud segmentation under the detrimental influence of noisy support sets during testing time. To this end, we first propose a Component-level Clean Noise Separation (CCNS) representation learning to learn discriminative feature representations that separates the clean samples of the target classes from the noisy samples. Leveraging the well separated clean and noisy support samples from our CCNS, we further propose a Multi-scale Degree-based Noise Suppression (MDNS) scheme to remove the noisy shots from the support set. We conduct extensive experiments on various noise settings on two benchmark datasets. Our results show that the combination of CCNS and MDNS significantly improves the performance. Our code is available at https://github.com/Pixie8888/R3DFSSeg.

摘要
文本：几个类别点云 semantic segmentation 目标是训练一个模型快速适应新未见类别，仅仅需要一些支持集样本。然而，实际世界中的实际设定中可能会轻松违反无噪设定。在这篇论文中，我们专注于增强几个类别点云 semantic segmentation 的Robustness，在测试时testing时的恶劣影响下。为此，我们首先提出了Component-level Clean Noise Separation (CCNS) 表示学习，以学习分类特征表现，将目标类别的清洁样本与噪音样本分离。然后，我们更进一步提出了Multi-scale Degree-based Noise Suppression (MDNS) 方案，以移除测试时的噪音样本。我们对不同噪音设定进行了广泛的实验，结果显示，CCNS 和 MDNS 的组合可以明显提高性能。我们的代码可以在中找到。翻译结果：文本：几个类别点云 semantic segmentation 目标是训练一个模型快速适应新未见类别，仅仅需要一些支持集样本。然而，实际世界中的实际设定中可能会轻松违反无噪设定。在这篇论文中，我们专注于增强几个类别点云 semantic segmentation 的Robustness，在测试时testing时的恶劣影响下。为此，我们首先提出了Component-level Clean Noise Separation (CCNS) 表示学习，以学习分类特征表现，将目标类别的清洁样本与噪音样本分离。然后，我们更进一步提出了Multi-scale Degree-based Noise Suppression (MDNS) 方案，以移除测试时的噪音样本。我们对不同噪音设定进行了广泛的实验，结果显示，CCNS 和 MDNS 的组合可以明显提高性能。我们的代码可以在中找到。

Generalized Few-Shot Point Cloud Segmentation Via Geometric Words

paper_url: http://arxiv.org/abs/2309.11222
repo_url: https://github.com/Pixie8888/GFS-3DSeg_GWs
paper_authors: Yating Xu, Conghui Hu, Na Zhao, Gim Hee Lee
For: 这篇论文的目的是提出一种更实用的普通多少shot点云分割方法，可以在新类出现时通过几个支持点云来泛化到新类，同时保持基础类的分割精度。* Methods: 该方法使用的是 geometric words 来表示基础和新类之间的 geometric 共同部分，并将其 incorporated 到一种新的 geometric-aware semantic representation 中，以便更好地泛化到新类而不忘记基础类。此外，该方法还引入 geometric prototypes 来导引分割，使用 geometric prior knowledge。* Results: compared with基eline方法，该方法在 S3DIS 和 ScanNet 上的实验表现出色，显示了更高的性能。I hope that helps! Let me know if you have any other questions.

Abstract
Existing fully-supervised point cloud segmentation methods suffer in the dynamic testing environment with emerging new classes. Few-shot point cloud segmentation algorithms address this problem by learning to adapt to new classes at the sacrifice of segmentation accuracy for the base classes, which severely impedes its practicality. This largely motivates us to present the first attempt at a more practical paradigm of generalized few-shot point cloud segmentation, which requires the model to generalize to new categories with only a few support point clouds and simultaneously retain the capability to segment base classes. We propose the geometric words to represent geometric components shared between the base and novel classes, and incorporate them into a novel geometric-aware semantic representation to facilitate better generalization to the new classes without forgetting the old ones. Moreover, we introduce geometric prototypes to guide the segmentation with geometric prior knowledge. Extensive experiments on S3DIS and ScanNet consistently illustrate the superior performance of our method over baseline methods. Our code is available at: https://github.com/Pixie8888/GFS-3DSeg_GWs.

摘要
现有的完全监督的点云分割方法在新类出现的动态测试环境中表现不佳，这是因为这些方法在学习新类时会卷积到基础类的精度，这大大限制了其实用性。这种情况激励我们提出一种更实用的通用几shot点云分割方法，要求模型能够通过几个支持点云来扩展到新类，同时保持基础类的分割精度。我们使用“geometry words”来表示基础和新类之间的几何共同部分，并将其 integrate into a novel geometric-aware semantic representation，以便更好地适应新类而无需忘记旧类。此外，我们还引入几何规范来导航分割，以利用几何知识来提高分割精度。我们的实验表明，我们的方法在S3DIS和ScanNet上的扩展性和稳定性都显著提高。代码可以在：https://github.com/Pixie8888/GFS-3DSeg_GWs 中找到。

Automatic Bat Call Classification using Transformer Networks

paper_url: http://arxiv.org/abs/2309.11218
repo_url: None
paper_authors: Frank Fundel, Daniel A. Braun, Sebastian Gottwald
for: automatic bat call identification
methods: Transformer architecture for multi-label classification
results: single species accuracy of 88.92% (F1-score of 84.23%), multi species macro F1-score of 74.40%

Abstract
Automatically identifying bat species from their echolocation calls is a difficult but important task for monitoring bats and the ecosystem they live in. Major challenges in automatic bat call identification are high call variability, similarities between species, interfering calls and lack of annotated data. Many currently available models suffer from relatively poor performance on real-life data due to being trained on single call datasets and, moreover, are often too slow for real-time classification. Here, we propose a Transformer architecture for multi-label classification with potential applications in real-time classification scenarios. We train our model on synthetically generated multi-species recordings by merging multiple bats calls into a single recording with multiple simultaneous calls. Our approach achieves a single species accuracy of 88.92% (F1-score of 84.23%) and a multi species macro F1-score of 74.40% on our test set. In comparison to three other tools on the independent and publicly available dataset ChiroVox, our model achieves at least 25.82% better accuracy for single species classification and at least 6.9% better macro F1-score for multi species classification.

摘要
自动识别蝙蝠种类从呼叫声中是一项具有挑战性和重要性的任务，用于监测蝙蝠和它们所处生态系统。主要挑战在自动蝙蝠呼叫识别中是呼叫声的高度变化、种类之间的相似性、干扰声和缺乏标注数据。现有的许多模型在实际数据上表现较差，主要是因为它们在单个呼叫数据集上训练。我们提出一种Transformer架构，用于多类别分类，具有实时分类场景的应用 potential。我们在合成生成的多种 recording中训练我们的模型，其中每个记录包含多个同时发生的呼叫。我们的方法实现了单种呼叫精度88.92%（F1-score为84.23%）和多种macro F1-score74.40%。与三个其他工具在独立公共的数据集ChiroVox上进行比较，我们的模型至少25.82%更高的单种呼叫精度和6.9%更高的多种 macro F1-score。

EPTQ: Enhanced Post-Training Quantization via Label-Free Hessian

paper_url: http://arxiv.org/abs/2309.11531
repo_url: https://github.com/ssi-research/eptq
paper_authors: Ofir Gordon, Hai Victor Habi, Arnon Netzer
for: 这篇论文旨在提出一种新的增强后期量化方法（EPTQ），以提高深度神经网络（DNN）的嵌入。
methods: 这篇论文使用了知识传播（knowledge distillation）和自适应层重复（adaptive weighting of layers）来实现增强后期量化。另外，论文还引入了一种无标签技术来近似任务损失的希耶数（Label-Free Hessian），以除去需要标签数据集的需求。
results: 这篇论文的实验结果显示，通过使用EPTQ，可以在各种模型、任务和数据集上取得最佳的结果，包括ImageNet分类、COCO物件检测和Pascal-VOC semantic segmentation。此外，论文还证明了EPTQ的可行性和可替代性，可以在不同的架构上进行实现，包括CNNs、Transformers、混合和MLP-only模型。

Abstract
Quantization of deep neural networks (DNN) has become a key element in the efforts of embedding such networks on end-user devices. However, current quantization methods usually suffer from costly accuracy degradation. In this paper, we propose a new method for Enhanced Post Training Quantization named EPTQ. The method is based on knowledge distillation with an adaptive weighting of layers. In addition, we introduce a new label-free technique for approximating the Hessian trace of the task loss, named Label-Free Hessian. This technique removes the requirement of a labeled dataset for computing the Hessian. The adaptive knowledge distillation uses the Label-Free Hessian technique to give greater attention to the sensitive parts of the model while performing the optimization. Empirically, by employing EPTQ we achieve state-of-the-art results on a wide variety of models, tasks, and datasets, including ImageNet classification, COCO object detection, and Pascal-VOC for semantic segmentation. We demonstrate the performance and compatibility of EPTQ on an extended set of architectures, including CNNs, Transformers, hybrid, and MLP-only models.

摘要
深度神经网络（DNN）的量化已成为嵌入这些网络在用户端设备的关键元素。然而，当前的量化方法通常会导致精度下降。在这篇论文中，我们提出了一种新的增强后期量化方法，称为增强后期量化（EPTQ）。该方法基于知识传承，并使用自适应层权重。此外，我们还介绍了一种新的无标签技术，用于估计任务损失的希尔伯特特征，称为无标签希尔伯特特征（Label-Free Hessian）。这种技术消除了需要标注数据集来计算希尔伯特特征的需求。适应知识传承使用无标签希尔伯特特征来增加对模型敏感部分的注意力，进行优化。我们的实验结果表明，通过使用EPTQ，我们在各种模型、任务和数据集上达到了状态对的结果，包括ImageNet分类、COCO物体检测和Pascal-VOC semantics segmentation。我们也证明了EPTQ在扩展的集成体系中的性能和兼容性，包括CNNs、Transformers、混合和MLP-only模型。

Partition-A-Medical-Image: Extracting Multiple Representative Sub-regions for Few-shot Medical Image Segmentation

paper_url: http://arxiv.org/abs/2309.11172
repo_url: https://github.com/YazhouZhu19/PAMI
paper_authors: Yazhou Zhu, Shidong Wang, Tong Xin, Zheng Zhang, Haofeng Zhang
for: 这则研究targets医疗影像分类任务，旨在提供更有前途的解决方案，因为医疗影像分类任务中高质量的标签是自然罕见。
methods: 本研究使用Regional Prototypical Learning (RPL)模块将支持影像的前景 decomposed into distinct regions，然后使用这些区域来 derivation region-level representations。此外，我们还引入了一个新的Prototypical Representation Debiasing (PRD)模块，用于抑制区域表示的干扰。
results: 经过广泛的实验证明，本研究在三个公开 accessible medical imaging datasets上实现了与主流 FSMIS 方法相比的稳定改进。并且提供了一个可用的源代码（https://github.com/YazhouZhu19/PAMI）。

Abstract
Few-shot Medical Image Segmentation (FSMIS) is a more promising solution for medical image segmentation tasks where high-quality annotations are naturally scarce. However, current mainstream methods primarily focus on extracting holistic representations from support images with large intra-class variations in appearance and background, and encounter difficulties in adapting to query images. In this work, we present an approach to extract multiple representative sub-regions from a given support medical image, enabling fine-grained selection over the generated image regions. Specifically, the foreground of the support image is decomposed into distinct regions, which are subsequently used to derive region-level representations via a designed Regional Prototypical Learning (RPL) module. We then introduce a novel Prototypical Representation Debiasing (PRD) module based on a two-way elimination mechanism which suppresses the disturbance of regional representations by a self-support, Multi-direction Self-debiasing (MS) block, and a support-query, Interactive Debiasing (ID) block. Finally, an Assembled Prediction (AP) module is devised to balance and integrate predictions of multiple prototypical representations learned using stacked PRD modules. Results obtained through extensive experiments on three publicly accessible medical imaging datasets demonstrate consistent improvements over the leading FSMIS methods. The source code is available at https://github.com/YazhouZhu19/PAMI.

摘要
供少医学图像分割（FSMIS）是一种更有前途的解决方案，用于医学图像分割任务中，高质量标注很难获得。然而，当前主流方法主要是提取支持图像中巨量的内部变化的整体表示，并遇到在查询图像上适应的困难。在这种工作中，我们提出了一种方法，可以从支持医学图像中提取多个代表性子区域，以便精细地选择生成的图像区域。具体来说，支持图像的前景被分解成不同的区域，然后通过我们设计的区域层学习（RPL）模块来 derivation region-level表示。我们然后引入了一种新的表示偏导（PRD）模块，基于两种排除机制，即自我支持的多向排除（MS）块和支持-查询的互动排除（ID）块。最后，我们设计了一个集成预测（AP）模块，可以平衡和集成多个表示学习的PRD模块中的预测。经过了广泛的实验，我们在三个公共 accessible的医学图像数据集上获得了一致的改进。源代码可以在https://github.com/YazhouZhu19/PAMI上获取。

AutoSynth: Learning to Generate 3D Training Data for Object Point Cloud Registration

paper_url: http://arxiv.org/abs/2309.11170
repo_url: None
paper_authors: Zheng Dang, Mathieu Salzmann
for: 本研究旨在提供一种自动生成3D训练数据的方法，以提高3D对象注册任务的训练数据质量和数量。
methods: 本研究使用自动生成的3D数据集，通过筛选搜索空间中的优秀数据集，以便在低成本下获得优质的3D训练数据。
results: 研究表明，使用我们的方法可以在TUD-L、LINEMOD和Occluded-LINEMOD等任务上实现更好的性能，比如ModelNet40数据集。此外，我们还证明了我们的方法可以在不同的点云注册网络上实现更好的性能。

Abstract
In the current deep learning paradigm, the amount and quality of training data are as critical as the network architecture and its training details. However, collecting, processing, and annotating real data at scale is difficult, expensive, and time-consuming, particularly for tasks such as 3D object registration. While synthetic datasets can be created, they require expertise to design and include a limited number of categories. In this paper, we introduce a new approach called AutoSynth, which automatically generates 3D training data for point cloud registration. Specifically, AutoSynth automatically curates an optimal dataset by exploring a search space encompassing millions of potential datasets with diverse 3D shapes at a low cost.To achieve this, we generate synthetic 3D datasets by assembling shape primitives, and develop a meta-learning strategy to search for the best training data for 3D registration on real point clouds. For this search to remain tractable, we replace the point cloud registration network with a much smaller surrogate network, leading to a $4056.43$ times speedup. We demonstrate the generality of our approach by implementing it with two different point cloud registration networks, BPNet and IDAM. Our results on TUD-L, LINEMOD and Occluded-LINEMOD evidence that a neural network trained on our searched dataset yields consistently better performance than the same one trained on the widely used ModelNet40 dataset.

摘要
现在的深度学习 paradigma中，训练数据的量和质量是网络架构和训练细节的 equally important factors。然而，收集、处理和标注实际数据在大规模上是困难、昂贵和时间consuming的，特别是 для tasks such as 3D object registration。 although synthetic datasets can be created, they require expertise to design and have a limited number of categories. In this paper, we introduce a new approach called AutoSynth, which automatically generates 3D training data for point cloud registration. Specifically, AutoSynth automatically curates an optimal dataset by exploring a search space encompassing millions of potential datasets with diverse 3D shapes at a low cost.To achieve this, we generate synthetic 3D datasets by assembling shape primitives, and develop a meta-learning strategy to search for the best training data for 3D registration on real point clouds. For this search to remain tractable, we replace the point cloud registration network with a much smaller surrogate network, leading to a $4056.43$ times speedup. We demonstrate the generality of our approach by implementing it with two different point cloud registration networks, BPNet and IDAM. Our results on TUD-L, LINEMOD and Occluded-LINEMOD evidence that a neural network trained on our searched dataset yields consistently better performance than the same one trained on the widely used ModelNet40 dataset.

Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation

paper_url: http://arxiv.org/abs/2309.11160
repo_url: https://github.com/nankepan/VIPMT
paper_authors: Nian Liu, Kepan Nan, Wangbo Zhao, Yuanwei Liu, Xiwen Yao, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer, Junwei Han, Fahad Shahbaz Khan
for: 这个论文旨在用少量标注图像支持进行视频对象分割，以便在视频数据中分割同一类目标对象。
methods: 该方法基于IPMT，一种现有的少量图像分割方法，并将多重层次时间引导信息引入视频数据处理中。具体来说，查询视频信息被分解成clip型prototype和记忆型prototype，以捕捉当地和长期内部时间引导信息。每帧独立使用框型 prototype 处理细致化适应引导，并实现了双向clip-frame prototype 交流。此外，为减少噪音记忆的影响，提出了基于结构相似关系的支持选择可靠记忆帧。此外，还提出了一种新的分割损失，以提高学习 prototype 的类别可识别度。
results: 实验结果表明，我们提出的视频 IPMT 模型在两个标准测试集上显著超过了之前的模型。

Abstract
Few-Shot Video Object Segmentation (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images. However, this task was seldom explored. In this work, based on IPMT, a state-of-the-art few-shot image segmentation method that combines external support guidance information with adaptive query guidance cues, we propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data. We decompose the query video information into a clip prototype and a memory prototype for capturing local and long-term internal temporal guidance, respectively. Frame prototypes are further used for each frame independently to handle fine-grained adaptive guidance and enable bidirectional clip-frame prototype communication. To reduce the influence of noisy memory, we propose to leverage the structural similarity relation among different predicted regions and the support for selecting reliable memory frames. Furthermore, a new segmentation loss is also proposed to enhance the category discriminability of the learned prototypes. Experimental results demonstrate that our proposed video IPMT model significantly outperforms previous models on two benchmark datasets. Code is available at https://github.com/nankepan/VIPMT.

摘要
几个视频对象分割（FSVOS）目标是使用一些定义同一类目的支持图像来分割查询视频中的对象。然而，这个任务几乎没有被研究。在这个工作中，我们基于IPMT，一种现有的少量图像分割方法，通过 вне部支持导航信息和适应查询导航征料来拓展我们的方法。我们将查询视频信息分解成一个clip原型和一个记忆原型，以捕捉本地和长期内部 temporal导航信息。每帧prototype被使用，以独立处理细腻的适应导航和两个方向clip-frame prototype通信。为了减少干扰的内存，我们提议使用不同预测区域之间的结构相似关系和支持选择可靠的记忆帧。此外，我们还提出了一种新的分割损失，以提高学习的类别可识别度。实验结果表明，我们的提出的视频IPMT模型在两个标准数据集上显著超越了之前的模型。代码可以在https://github.com/nankepan/VIPMT上获取。

Learning Deformable 3D Graph Similarity to Track Plant Cells in Unregistered Time Lapse Images

paper_url: http://arxiv.org/abs/2309.11157
repo_url: None
paper_authors: Md Shazid Islam, Arindam Dutta, Calvin-Khang Ta, Kevin Rodriguez, Christian Michael, Mark Alber, G. Venugopala Reddy, Amit K. Roy-Chowdhury
for: 该论文旨在提出一种基于学习的方法，用于准确地跟踪植物细胞图像中的细胞。
methods: 该方法利用植物细胞的紧密排列三维结构，创建三维图库，以实现准确的细胞跟踪。另外，该方法还提出了新的细胞分裂检测算法和高效三维对 align 算法。
results: 该论文在一个标准数据集上进行了实验，并证明了该方法的跟踪精度和搜索时间的优势。

Abstract
Tracking of plant cells in images obtained by microscope is a challenging problem due to biological phenomena such as large number of cells, non-uniform growth of different layers of the tightly packed plant cells and cell division. Moreover, images in deeper layers of the tissue being noisy and unavoidable systemic errors inherent in the imaging process further complicates the problem. In this paper, we propose a novel learning-based method that exploits the tightly packed three-dimensional cell structure of plant cells to create a three-dimensional graph in order to perform accurate cell tracking. We further propose novel algorithms for cell division detection and effective three-dimensional registration, which improve upon the state-of-the-art algorithms. We demonstrate the efficacy of our algorithm in terms of tracking accuracy and inference-time on a benchmark dataset.

摘要
track plant cells in microscope images 是一个复杂的问题，因为生物现象如大量细胞、不均生长的不同层次紧密排列的植物细胞，以及细胞分裂。此外，深层组织图像中的噪声和不可避免的图像捕捉过程中的系统性错误更加复杂了问题。在本文中，我们提出了一种基于学习的方法，利用植物细胞紧密三维结构来创建三维图表，以进行准确的细胞跟踪。我们还提出了新的细胞分裂检测算法和有效的三维对接算法，这些算法都超过了当前状态的算法。我们通过对一个标准数据集进行评估，证明了我们的算法的准确性和推理时间。

paper_url: http://arxiv.org/abs/2309.11156
repo_url: None
paper_authors: Olli Knuuttila, Antti Kestilä, Esa Kallio
for: asteroid exploration missions and on-orbit servicing
methods: lightweight feature extractor specifically tailored for asteroid proximity navigation, designed to be robust to illumination changes and affine transformations
results: effective navigation and localization, with incremental improvements over existing methods and a trained feature extractor

Abstract
This article addresses the challenge of vision-based proximity navigation in asteroid exploration missions and on-orbit servicing. Traditional feature extraction methods struggle with the significant appearance variations of asteroids due to limited scattered light. To overcome this, we propose a lightweight feature extractor specifically tailored for asteroid proximity navigation, designed to be robust to illumination changes and affine transformations. We compare and evaluate state-of-the-art feature extraction networks and three lightweight network architectures in the asteroid context. Our proposed feature extractors and their evaluation leverages both synthetic images and real-world data from missions such as NEAR Shoemaker, Hayabusa, Rosetta, and OSIRIS-REx. Our contributions include a trained feature extractor, incremental improvements over existing methods, and a pipeline for training domain-specific feature extractors. Experimental results demonstrate the effectiveness of our approach in achieving accurate navigation and localization. This work aims to advance the field of asteroid navigation and provides insights for future research in this domain.

摘要
(Simplified Chinese translation)这篇文章关注 asteroid 探测和处理任务中的视觉靠近导航挑战，传统的特征提取方法由于 asteroid 的限制散射光导致表现变化强大。为了解决这个问题，我们提议一种适应 asteroid 靠近导航的轻量级特征提取器，可以抗抗照明变化和抽象变换。我们比较和评估了现有的特征提取网络和三种轻量级网络体系，并在 asteroid 上进行了评估。我们的提案包括一个已经训练好的特征提取器，以及对现有方法进行了改进。我们的实验结果表明，我们的方法可以实现高精度的导航和地址确定。这项工作希望可以推动 asteroid 导航领域的进步，并为未来的研究提供了新的思路和灵感。

Online Calibration of a Single-Track Ground Vehicle Dynamics Model by Tight Fusion with Visual-Inertial Odometry

paper_url: http://arxiv.org/abs/2309.11148
repo_url: None
paper_authors: Haolong Li, Joerg Stueckler
for: 这篇论文是为了提供一种基于视觉遥感和动力学模型的单车跑动估计方法，用于 Navigation Planning。
methods: 该方法使用了单车动力学模型，与视觉遥感（VIO）相结合，在线进行模型Parameters calibration和适应。
results: 实验表明，该方法可以在不同的环境下（室内和外），适应环境变化，并且可以准确地预测未来控制输入的效果。同时，该方法还可以提高跟踪精度。

Abstract
Wheeled mobile robots need the ability to estimate their motion and the effect of their control actions for navigation planning. In this paper, we present ST-VIO, a novel approach which tightly fuses a single-track dynamics model for wheeled ground vehicles with visual inertial odometry. Our method calibrates and adapts the dynamics model online and facilitates accurate forward prediction conditioned on future control inputs. The single-track dynamics model approximates wheeled vehicle motion under specific control inputs on flat ground using ordinary differential equations. We use a singularity-free and differentiable variant of the single-track model to enable seamless integration as dynamics factor into VIO and to optimize the model parameters online together with the VIO state variables. We validate our method with real-world data in both indoor and outdoor environments with different terrain types and wheels. In our experiments, we demonstrate that our ST-VIO can not only adapt to the change of the environments and achieve accurate prediction under new control inputs, but even improves the tracking accuracy. Supplementary video: https://youtu.be/BuGY1L1FRa4.

摘要
自动移动机器人需要估算其运动和控制动作的影响以实现导航规划。本文提出了ST-VIO，一种新的方法，它将单车辆动力学模型紧密融合视觉陀螺仪定位。我们的方法在线投入和调整动力学模型，并使用未来控制输入的前提下进行高精度预测。单车辆动力学模型是在特定的控制输入下，在平地上使用普通微分方程描述车辆的运动。我们使用不含特征点和可微分的单车辆模型，以便轻松地将动力学模型纳入VIО中，并在VIО状态变量上线上调整模型参数。我们通过实验证明，我们的ST-VIO可以不仅适应环境变化，并在新的控制输入下实现高精度跟踪。补充视频：https://youtu.be/BuGY1L1FRa4。

GraphEcho: Graph-Driven Unsupervised Domain Adaptation for Echocardiogram Video Segmentation

paper_url: http://arxiv.org/abs/2309.11145
repo_url: https://github.com/xmed-lab/GraphEcho
paper_authors: Jiewen Yang, Xinpeng Ding, Ziyang Zheng, Xiaowei Xu, Xiaomeng Li
For: 这个论文研究了非监督领域适应（Unsupervised Domain Adaptation，UDA）在echocardiogram视频分割方面，目的是将来自源频谱域的模型泛化到其他未标注目标频谱域。* Methods: 我们引入了一个新的CardiacUDA数据集和一种名为GraphEcho的新方法，该方法包括两个创新模块：空间域频谱匹配（SCGM）和心跳周期一致性（TCC）模块。这两个模块可以更好地对global和local特征从源和目标频谱域进行对齐，从而提高UDA分割结果。* Results: 我们的GraphEcho方法在对比 existed状态的推荐UDA分割方法时表现出色，实验结果表明。我们的CardiacUDA数据集和代码将在接受后公开发布，这项工作将为心脏结构分割从echocardiogram视频中奠定新的、坚实的基础。代码和数据集可以通过https://github.com/xmed-lab/GraphEcho访问。

Abstract
Echocardiogram video segmentation plays an important role in cardiac disease diagnosis. This paper studies the unsupervised domain adaption (UDA) for echocardiogram video segmentation, where the goal is to generalize the model trained on the source domain to other unlabelled target domains. Existing UDA segmentation methods are not suitable for this task because they do not model local information and the cyclical consistency of heartbeat. In this paper, we introduce a newly collected CardiacUDA dataset and a novel GraphEcho method for cardiac structure segmentation. Our GraphEcho comprises two innovative modules, the Spatial-wise Cross-domain Graph Matching (SCGM) and the Temporal Cycle Consistency (TCC) module, which utilize prior knowledge of echocardiogram videos, i.e., consistent cardiac structure across patients and centers and the heartbeat cyclical consistency, respectively. These two modules can better align global and local features from source and target domains, improving UDA segmentation results. Experimental results showed that our GraphEcho outperforms existing state-of-the-art UDA segmentation methods. Our collected dataset and code will be publicly released upon acceptance. This work will lay a new and solid cornerstone for cardiac structure segmentation from echocardiogram videos. Code and dataset are available at: https://github.com/xmed-lab/GraphEcho

摘要
《echocardiogram视频分割 plays an important role in cardiac disease diagnosis。This paper studies the unsupervised domain adaption（UDA）for echocardiogram视频分割，where the goal is to generalize the model trained on the source domain to other unlabelled target domains。Existing UDA segmentation methods are not suitable for this task because they do not model local information and the cyclical consistency of heartbeat。In this paper, we introduce a newly collected CardiacUDA dataset and a novel GraphEcho method for cardiac structure segmentation。Our GraphEcho comprises two innovative modules，the Spatial-wise Cross-domain Graph Matching（SCGM）and the Temporal Cycle Consistency（TCC）module，which utilize prior knowledge of echocardiogram videos，i.e., consistent cardiac structure across patients and centers and the heartbeat cyclical consistency，respectively。These two modules can better align global and local features from source and target domains，improving UDA segmentation results。Experimental results showed that our GraphEcho outperforms existing state-of-the-art UDA segmentation methods。Our collected dataset and code will be publicly released upon acceptance。This work will lay a new and solid cornerstone for cardiac structure segmentation from echocardiogram videos。Code and dataset are available at：https://github.com/xmed-lab/GraphEcho。》Note that Simplified Chinese is the official writing system used in mainland China, and it may be different from Traditional Chinese, which is used in Taiwan and other regions.

GL-Fusion: Global-Local Fusion Network for Multi-view Echocardiogram Video Segmentation

paper_url: http://arxiv.org/abs/2309.11144
repo_url: https://github.com/xmed-lab/GL-Fusion
paper_authors: Ziyang Zheng, Jiewen Yang, Xinpeng Ding, Xiaowei Xu, Xiaomeng Li
for: 这种研究旨在提高自动分类echocardiogram视频中的心脏结构分割精度和可靠性。
methods: 该研究提出了一种全新的全球-本地融合网络（GL-Fusion），用于同时利用多视图信息的全球和本地特征，以提高echocardiogram分析的准确性。
results: 该研究通过使用MvEVD数据集进行测试，发现GL-Fusion方法可以提高echocardiogram分析的准确性，与基eline方法相比提高了7.83%。此外，GL-Fusion方法还超过了现有的状态 искусственный智能方法。

Abstract
Cardiac structure segmentation from echocardiogram videos plays a crucial role in diagnosing heart disease. The combination of multi-view echocardiogram data is essential to enhance the accuracy and robustness of automated methods. However, due to the visual disparity of the data, deriving cross-view context information remains a challenging task, and unsophisticated fusion strategies can even lower performance. In this study, we propose a novel Gobal-Local fusion (GL-Fusion) network to jointly utilize multi-view information globally and locally that improve the accuracy of echocardiogram analysis. Specifically, a Multi-view Global-based Fusion Module (MGFM) is proposed to extract global context information and to explore the cyclic relationship of different heartbeat cycles in an echocardiogram video. Additionally, a Multi-view Local-based Fusion Module (MLFM) is designed to extract correlations of cardiac structures from different views. Furthermore, we collect a multi-view echocardiogram video dataset (MvEVD) to evaluate our method. Our method achieves an 82.29% average dice score, which demonstrates a 7.83% improvement over the baseline method, and outperforms other existing state-of-the-art methods. To our knowledge, this is the first exploration of a multi-view method for echocardiogram video segmentation. Code available at: https://github.com/xmed-lab/GL-Fusion

摘要
卡第亚结构分割自echocardiogram视频中扮演重要的角色，用于诊断心血管疾病。多视图echocardiogram数据的组合是提高自动方法的准确性和可靠性的关键。然而，由于视觉差异， derivation of cross-view context information remains a challenging task, and unsophisticated fusion strategies can even lower performance. 在这种研究中，我们提出了一种全新的全球-本地混合（GL-Fusion）网络，用于同时利用多视图信息的全球和本地信息，以提高echocardiogram分析的准确性。 Specifically, a Multi-view Global-based Fusion Module (MGFM) is proposed to extract global context information and to explore the cyclic relationship of different heartbeat cycles in an echocardiogram video. Additionally, a Multi-view Local-based Fusion Module (MLFM) is designed to extract correlations of cardiac structures from different views. Furthermore, we collect a multi-view echocardiogram video dataset (MvEVD) to evaluate our method. Our method achieves an 82.29% average dice score, which demonstrates a 7.83% improvement over the baseline method, and outperforms other existing state-of-the-art methods. To our knowledge, this is the first exploration of a multi-view method for echocardiogram video segmentation. 可以在https://github.com/xmed-lab/GL-Fusion找到我们的代码。

More complex encoder is not all you need

paper_url: http://arxiv.org/abs/2309.11139
repo_url: https://github.com/aitechlabcn/neUNet
paper_authors: Weibin Yang, Longwei Xu, Pengwei Wang, Dehua Geng, Yusong Li, Mingyuan Xu, Zhiqi Dong
for: 这个论文主要用于提高医疗影像分类的精度和效率。
methods: 本文使用的方法包括：U-Net和其变体，并将注意力集中在增强解oder的部分，特别是增强upsampling部分以提高分类结果。
results: 本文的结果显示，使用了新的Sub-pixel Convolution和多条气平面输入模块，可以提高分类结果的精度和效率，并且在Synapse和ACDC datasets上表现出色，超越了其他现有的方法。

Abstract
U-Net and its variants have been widely used in medical image segmentation. However, most current U-Net variants confine their improvement strategies to building more complex encoder, while leaving the decoder unchanged or adopting a simple symmetric structure. These approaches overlook the true functionality of the decoder: receiving low-resolution feature maps from the encoder and restoring feature map resolution and lost information through upsampling. As a result, the decoder, especially its upsampling component, plays a crucial role in enhancing segmentation outcomes. However, in 3D medical image segmentation, the commonly used transposed convolution can result in visual artifacts. This issue stems from the absence of direct relationship between adjacent pixels in the output feature map. Furthermore, plain encoder has already possessed sufficient feature extraction capability because downsampling operation leads to the gradual expansion of the receptive field, but the loss of information during downsampling process is unignorable. To address the gap in relevant research, we extend our focus beyond the encoder and introduce neU-Net (i.e., not complex encoder U-Net), which incorporates a novel Sub-pixel Convolution for upsampling to construct a powerful decoder. Additionally, we introduce multi-scale wavelet inputs module on the encoder side to provide additional information. Our model design achieves excellent results, surpassing other state-of-the-art methods on both the Synapse and ACDC datasets.

摘要
U-Net和其变种在医学影像分割中广泛应用。然而，现有的U-Net变种通常是通过建立更复杂的编码器来提高性能，而忽略了解码器的真正功能：接收低分辨率特征图并将其修复到原始分辨率和丢失信息。这些方法忽略了解码器中的upsampling组件的重要作用，这使得分割结果受到限制。尤其在3D医学影像分割中，通常使用的拼接 convolution 可能会导致视觉artefacts。这种问题的原因在于输出特征图中不存在直接相邻像素的直接关系。此外，简单的编码器已经拥有了充足的特征提取能力，因为下降操作导致捕捉区域的扩展，但是下降操作中丢失的信息是不可忽略的。为了解决这个研究漏洞，我们扩展了我们的关注范围，并引入了一种新的Sub-pixel Convolution для upsampling，以建立一个强大的解码器。此外，我们还引入了多尺度wavelet输入模块在编码器Side来提供额外信息。我们的模型设计实现了出色的结果，超过了其他状态对的方法在Synapse和ACDC数据集上。

Shape Anchor Guided Holistic Indoor Scene Understanding

paper_url: http://arxiv.org/abs/2309.11133
repo_url: https://github.com/Geo-Tell/AncRec
paper_authors: Mingyue Dong, Linxi Huan, Hanjiang Xiong, Shuhan Shen, Xianwei Zheng
for: 提出了一种基于形态锚点的学习策略（AncLearn），用于实现室内Scene理解的稳定和准确性。
methods: 利用形态锚点生成anchors，以便在搜索空间中提取实际存在的对象表示，并且通过对噪声和目标相关特征进行分离，提供可靠的提议。在重建阶段，通过减少异常值，提供高质量的对象点抽象。
results: 在ScanNetv2 dataset上进行了实验，并取得了在3D对象检测、布局估计和形态重建方面的状态 искусственный智能性能。

Abstract
This paper proposes a shape anchor guided learning strategy (AncLearn) for robust holistic indoor scene understanding. We observe that the search space constructed by current methods for proposal feature grouping and instance point sampling often introduces massive noise to instance detection and mesh reconstruction. Accordingly, we develop AncLearn to generate anchors that dynamically fit instance surfaces to (i) unmix noise and target-related features for offering reliable proposals at the detection stage, and (ii) reduce outliers in object point sampling for directly providing well-structured geometry priors without segmentation during reconstruction. We embed AncLearn into a reconstruction-from-detection learning system (AncRec) to generate high-quality semantic scene models in a purely instance-oriented manner. Experiments conducted on the challenging ScanNetv2 dataset demonstrate that our shape anchor-based method consistently achieves state-of-the-art performance in terms of 3D object detection, layout estimation, and shape reconstruction. The code will be available at https://github.com/Geo-Tell/AncRec.

摘要

Locate and Verify: A Two-Stream Network for Improved Deepfake Detection

paper_url: http://arxiv.org/abs/2309.11131
repo_url: https://github.com/sccsok/Locate-and-Verify
paper_authors: Chao Shuai, Jieming Zhong, Shuang Wu, Feng Lin, Zhibo Wang, Zhongjie Ba, Zhenguang Liu, Lorenzo Cavallaro, Kui Ren
for: 本研究旨在提高深伪检测方法的一般化能力和特定 forgery 区域探测能力。
methods: 本文提出了三个方法来解决现有方法的缺陷：一个创新的两条流网络，三个功能模组，以及一个半supervised Patch Similarity Learning策略。
results: 本文的方法在六个benchmark上与现有方法比较，表现出了significantly improved的一般化和特定 forgery 区域探测能力，包括Frame-level AUC在Deepfake Detection Challenge preview dataset上从0.797提高到0.835，以及Video-level AUC在CelebDF$_$v1 dataset上从0.811提高到0.847。

Abstract
Deepfake has taken the world by storm, triggering a trust crisis. Current deepfake detection methods are typically inadequate in generalizability, with a tendency to overfit to image contents such as the background, which are frequently occurring but relatively unimportant in the training dataset. Furthermore, current methods heavily rely on a few dominant forgery regions and may ignore other equally important regions, leading to inadequate uncovering of forgery cues. In this paper, we strive to address these shortcomings from three aspects: (1) We propose an innovative two-stream network that effectively enlarges the potential regions from which the model extracts forgery evidence. (2) We devise three functional modules to handle the multi-stream and multi-scale features in a collaborative learning scheme. (3) Confronted with the challenge of obtaining forgery annotations, we propose a Semi-supervised Patch Similarity Learning strategy to estimate patch-level forged location annotations. Empirically, our method demonstrates significantly improved robustness and generalizability, outperforming previous methods on six benchmarks, and improving the frame-level AUC on Deepfake Detection Challenge preview dataset from 0.797 to 0.835 and video-level AUC on CelebDF$\_$v1 dataset from 0.811 to 0.847. Our implementation is available at https://github.com/sccsok/Locate-and-Verify.

摘要
深刻的假动作（Deepfake）已经在世界上引发了一场信任危机。目前的假动作检测方法通常无法普遍化，往往对背景进行过滤，这些背景虽然常见但相对 speaking 不重要。此外，现有的方法倾向于仅对一些主导的伪造区域进行过滤，可能会忽略其他Equally important regions，从而导致伪造讯号的不充分探测。在这篇文章中，我们尝试解决这些缺陷自三个方面：1. 我们提出了一个创新的两条流网络，实际地扩大了模型从中提取伪造证据的可能区域。2. 我们设计了三个功能模组，以实现多条流和多个标准之间的合作学习。3. 面对伪造标注的挑战，我们提出了一个半supervised Patch Similarity Learning策略，以估计伪造区域标注。实际上，我们的方法在六个benchmark上表现出色，与前一代方法相比，具有更好的 Robustness 和普遍化能力。我们的实现可以在https://github.com/sccsok/Locate-and-Verify上找到。

paper_url: http://arxiv.org/abs/2309.11125
repo_url: None
paper_authors: Chengyou Jia, Minnan Luo, Zhuohang Dang, Guang Dai, Xiaojun Chang, Jingdong Wang, Qinghua Zheng
for: 本文旨在提出一种新的人员搜索框架，以解决现有方法中的两个主要挑战：1）探测阶段模块不适合人脸识别任务；2）两个子任务之间的协作被忽略。
methods: 本文使用了Diffusion模型，将人员搜索转化为两个阶段的双杂化过程，从噪声框和人脸嵌入转化为实际情况。与传统的探测到人脸识别的方法不同，我们的杂化方法可以消除探测阶段模块，从而避免人脸识别任务的地方最优点。此外，我们还设计了一种新的协同杂化层，以便在探测和人脸识别两个子任务之间进行融合协同，使两个子任务互相帮助。
results: 实验结果表明，PSDiff在标准测试集上达到了当前最佳性能，具有较少的参数和灵活计算负担。

Abstract
Dominant Person Search methods aim to localize and recognize query persons in a unified network, which jointly optimizes two sub-tasks, \ie, detection and Re-IDentification (ReID). Despite significant progress, two major challenges remain: 1) Detection-prior modules in previous methods are suboptimal for the ReID task. 2) The collaboration between two sub-tasks is ignored. To alleviate these issues, we present a novel Person Search framework based on the Diffusion model, PSDiff. PSDiff formulates the person search as a dual denoising process from noisy boxes and ReID embeddings to ground truths. Unlike existing methods that follow the Detection-to-ReID paradigm, our denoising paradigm eliminates detection-prior modules to avoid the local-optimum of the ReID task. Following the new paradigm, we further design a new Collaborative Denoising Layer (CDL) to optimize detection and ReID sub-tasks in an iterative and collaborative way, which makes two sub-tasks mutually beneficial. Extensive experiments on the standard benchmarks show that PSDiff achieves state-of-the-art performance with fewer parameters and elastic computing overhead.

摘要
主流人体搜索方法目标是在一个统一网络中本地化和识别查询人体，同时优化两个子任务，即探测和ReID（人体识别）。尽管有了很大的进步，但两个主要挑战仍然存在：1）探测优先模块在先前的方法中是不佳的ReID任务。2）两个子任务之间的合作被忽视。为了解决这些问题，我们提出了一种基于Diffusion模型的人体搜索框架，称为PSDiff。PSDiff将人体搜索转化为一个双方减噪过程，从噪声框和ReID嵌入转化到实际值。与先前的方法不同，我们的减噪方法不需要探测优先模块，以避免探测任务的本地最佳点。在新的 paradigma下，我们进一步设计了一个新的合作减噪层（CDL），以便在迭代和协同的方式优化探测和ReID子任务，使两个子任务互相有利。经验表明，PSDiff在标准测试准则上达到了状态的精度性表现，并且具有 fewer 参数和灵活计算负担。

Hyperspectral Benchmark: Bridging the Gap between HSI Applications through Comprehensive Dataset and Pretraining

paper_url: http://arxiv.org/abs/2309.11122
repo_url: https://github.com/cogsys-tuebingen/hsi_benchmark
paper_authors: Hannah Frank, Leon Amadeus Varga, Andreas Zell
for: 这个研究旨在提供一个全面的专门应用于几何pectral实验（HSI）的benchmark dataset，以便更好地评估几何spectral模型的能力。
methods: 本研究使用了一个新的benchmark dataset，包括三个不同的HSI应用：食品检查、远程感知和回收。此外，研究还提出了一个预训管道，以提高专门的训练过程稳定性。
results: 本研究的结果显示，这个benchmark dataset可以更好地评估专门的HSI模型，并且可以推广现有的方法。此外，预训管道可以提高专门的训练过程稳定性。

Abstract
Hyperspectral Imaging (HSI) serves as a non-destructive spatial spectroscopy technique with a multitude of potential applications. However, a recurring challenge lies in the limited size of the target datasets, impeding exhaustive architecture search. Consequently, when venturing into novel applications, reliance on established methodologies becomes commonplace, in the hope that they exhibit favorable generalization characteristics. Regrettably, this optimism is often unfounded due to the fine-tuned nature of models tailored to specific HSI contexts. To address this predicament, this study introduces an innovative benchmark dataset encompassing three markedly distinct HSI applications: food inspection, remote sensing, and recycling. This comprehensive dataset affords a finer assessment of hyperspectral model capabilities. Moreover, this benchmark facilitates an incisive examination of prevailing state-of-the-art techniques, consequently fostering the evolution of superior methodologies. Furthermore, the enhanced diversity inherent in the benchmark dataset underpins the establishment of a pretraining pipeline for HSI. This pretraining regimen serves to enhance the stability of training processes for larger models. Additionally, a procedural framework is delineated, offering insights into the handling of applications afflicted by limited target dataset sizes.

摘要
干elespectral Imaging（HSI）是一种不 destrucción的空间спектроскопи技术，具有各种应用前景。然而，一个常 recurs的挑战是目标数据集的有限大小，导致了较少的模型搜索空间。因此，在探索新应用场景时，通常会依靠已有的方法，希望它们在不同的HSI上能够展现良好的泛化特性。然而，这种optimism通常是不符的，因为这些模型是为特定HSI上精心定制的。为解决这个困境，本研究提出了一个创新的标准数据集，包括三个明确不同的HSI应用：食品检查、远程感知和回收。这个全面的数据集为干elespectral模型的能力进行更加细致的评估。此外，这个标准数据集还支持现有的state-of-the-art技术的准确性的减弱，从而促进了更高水平的方法的进化。此外，增强的数据集多样性为HSI预训练管道提供了基础。这个预训练管道可以增强大型模型的训练过程的稳定性。此外，本研究还提出了一种手动框架，用于处理受有限target数据集大小的应用。

BroadBEV: Collaborative LiDAR-camera Fusion for Broad-sighted Bird’s Eye View Map Construction

paper_url: http://arxiv.org/abs/2309.11119
repo_url: None
paper_authors: Minsu Kim, Giseop Kim, Kyong Hwan Jin, Sunwook Choi
for: 本研究旨在提高激光摄像机（LiDAR）和摄像机（camera）的 Bird’s Eye View（BEV）空间融合，以实现更广泛的视场和高精度的地面检测。
methods: 我们提出了一种广泛的 BEV融合策略（BroadBEV），包括点散发（Point-scattering）和自注重权重（ColFusion）两个部分。点散发方法使得LiDAR BEV分布散射到摄像机深度分布中，以提高摄像机分支的深度估计和精度。自注重权重方法在LiDAR和摄像机 BEV特征之间应用自注重权重，以实现有效的 BEV融合。
results: 我们的实验表明，BroadBEV可以提供广泛的 BEV视场，并且有较高的性能提升。

Abstract
A recent sensor fusion in a Bird's Eye View (BEV) space has shown its utility in various tasks such as 3D detection, map segmentation, etc. However, the approach struggles with inaccurate camera BEV estimation, and a perception of distant areas due to the sparsity of LiDAR points. In this paper, we propose a broad BEV fusion (BroadBEV) that addresses the problems with a spatial synchronization approach of cross-modality. Our strategy aims to enhance camera BEV estimation for a broad-sighted perception while simultaneously improving the completion of LiDAR's sparsity in the entire BEV space. Toward that end, we devise Point-scattering that scatters LiDAR BEV distribution to camera depth distribution. The method boosts the learning of depth estimation of the camera branch and induces accurate location of dense camera features in BEV space. For an effective BEV fusion between the spatially synchronized features, we suggest ColFusion that applies self-attention weights of LiDAR and camera BEV features to each other. Our extensive experiments demonstrate that BroadBEV provides a broad-sighted BEV perception with remarkable performance gains.

摘要
Recently, a sensor fusion in a bird's eye view (BEV) space has shown its potential in various tasks such as 3D detection and map segmentation. However, the approach is limited by inaccurate camera BEV estimation and a lack of information on distant areas due to the sparsity of LiDAR points. In this paper, we propose a broad BEV fusion (BroadBEV) that addresses these problems using a cross-modality spatial synchronization approach. Our method aims to improve camera BEV estimation for a broad-sighted perception while simultaneously enhancing the completion of LiDAR's sparsity in the entire BEV space. To achieve this, we use Point-scattering to scatter LiDAR BEV distribution to camera depth distribution, which boosts the learning of depth estimation of the camera branch and accurately locates dense camera features in BEV space. Additionally, we propose ColFusion, which applies self-attention weights of LiDAR and camera BEV features to each other for effective BEV fusion. Our extensive experiments show that BroadBEV provides a broad-sighted BEV perception with significant performance gains.

PRAT: PRofiling Adversarial aTtacks

paper_url: http://arxiv.org/abs/2309.11111
repo_url: https://github.com/rahulambati/PRAT
paper_authors: Rahul Ambati, Naveed Akhtar, Ajmal Mian, Yogesh Singh Rawat
for: 这个研究的目的是为了检测和识别深度学习模型面对攻击时所产生的攻击方法。
methods: 这个研究使用了一个新的架构，叫做GLOF（Global-LOcal Feature）模组，它可以将攻击示例中的特征提取出来，并且用于识别攻击的方法。
results: 这个研究使用了一个大量的攻击识别数据集（AID），包含了180,000个攻击示例，并通过使用GLOF模组进行攻击识别，获得了多个有趣的比较结果。

Abstract
Intrinsic susceptibility of deep learning to adversarial examples has led to a plethora of attack techniques with a broad common objective of fooling deep models. However, we find slight compositional differences between the algorithms achieving this objective. These differences leave traces that provide important clues for attacker profiling in real-life scenarios. Inspired by this, we introduce a novel problem of PRofiling Adversarial aTtacks (PRAT). Given an adversarial example, the objective of PRAT is to identify the attack used to generate it. Under this perspective, we can systematically group existing attacks into different families, leading to the sub-problem of attack family identification, which we also study. To enable PRAT analysis, we introduce a large Adversarial Identification Dataset (AID), comprising over 180k adversarial samples generated with 13 popular attacks for image specific/agnostic white/black box setups. We use AID to devise a novel framework for the PRAT objective. Our framework utilizes a Transformer based Global-LOcal Feature (GLOF) module to extract an approximate signature of the adversarial attack, which in turn is used for the identification of the attack. Using AID and our framework, we provide multiple interesting benchmark results for the PRAT problem.

摘要
深度学习内置的攻击例子感受性问题，导致了许多攻击技术的出现，它们的共同目标都是欺骗深度模型。然而，我们发现这些攻击技术之间存在轻微的组合差异，这些差异留下了重要的攻击者追踪 traces。 inspirited by this，我们提出了一个新的问题：PRofiling Adversarial aTtacks（PRAT）。给定一个攻击例子，PRAT 的目标是确定攻击该例子的攻击方法。基于这种视角，我们可以系统地将现有的攻击分为不同的家族，导致了攻击家族识别问题的研究，我们也进行了这种研究。为了启用 PRAT 分析，我们提出了一个大型的攻击标识数据集（AID），包含了180k多个生成了13种流行的攻击的攻击示例，用于黑色/白色盒子设置。我们使用 AID 和我们的框架，提出了一种新的框架来实现 PRAT 目标。我们的框架使用 Transformer 基于的全局-本地特征（GLOF）模块，将攻击例子中的攻击特征提取出来，并用于攻击的识别。使用 AID 和我们的框架，我们提供了多个有趣的 PRAT 问题的 benchmark 结果。

Self-supervised Domain-agnostic Domain Adaptation for Satellite Images

paper_url: http://arxiv.org/abs/2309.11109
repo_url: None
paper_authors: Fahong Zhang, Yilei Shi, Xiao Xiang Zhu
for: Addressing the domain shift issue in machine learning for global scale satellite image processing.
methods: Proposed an self-supervised domain-agnostic domain adaptation (SS(DA)2) method, which uses a contrastive generative adversarial loss to train a generative network for image-to-image translation, and improves the generalizability of downstream models by augmenting the training data with different testing spectral characteristics.
results: Experimental results on public benchmarks verified the effectiveness of SS(DA)2.

Abstract
Domain shift caused by, e.g., different geographical regions or acquisition conditions is a common issue in machine learning for global scale satellite image processing. A promising method to address this problem is domain adaptation, where the training and the testing datasets are split into two or multiple domains according to their distributions, and an adaptation method is applied to improve the generalizability of the model on the testing dataset. However, defining the domain to which each satellite image belongs is not trivial, especially under large-scale multi-temporal and multi-sensory scenarios, where a single image mosaic could be generated from multiple data sources. In this paper, we propose an self-supervised domain-agnostic domain adaptation (SS(DA)2) method to perform domain adaptation without such a domain definition. To achieve this, we first design a contrastive generative adversarial loss to train a generative network to perform image-to-image translation between any two satellite image patches. Then, we improve the generalizability of the downstream models by augmenting the training data with different testing spectral characteristics. The experimental results on public benchmarks verify the effectiveness of SS(DA)2.

摘要
域外转移问题，如不同地理区域或获取条件，是机器学习在全球范围卫星图像处理中的常见问题。一种有前途的方法是领域适应，其中训练集和测试集被分成两个或多个领域，并应用适应方法以提高测试集模型的泛化性。然而，定义各卫星图像归属的领域并不是易事，尤其在大规模多时间和多感器场景下，一个卫星图像融合可能来自多个数据源。在这篇论文中，我们提出了一种自主适应领域无关的自动适应（SS(DA)2）方法，无需定义各卫星图像的领域。为此，我们首先设计了一种对比生成隐藏层的挑战推荐损失，以训练生成网络进行卫星图像块之间的自动翻译。然后，我们通过增加不同测试spectral特征来提高下游模型的泛化性。实验结果表明，SS(DA)2有效地解决了域外转移问题。

Forgery-aware Adaptive Vision Transformer for Face Forgery Detection

paper_url: http://arxiv.org/abs/2309.11092
repo_url: None
paper_authors: Anwei Luo, Rizhao Cai, Chenqi Kong, Xiangui Kang, Jiwu Huang, Alex C. Kot
for: 保护 authentication 完整性，防止 face 伪造攻击。
methods: 提出了一种 Novel Forgery-aware Adaptive Vision Transformer (FA-ViT)，具有冻结 vanilla ViT 的参数，并采用 Local-aware Forgery Injector (LFI) 和 Global-aware Forgery Adaptor (GFA) 两种特殊组件，以适应伪造相关的知识。
results: 实验表明，我们的 FA-ViT 在 cross-dataset 评估和 cross- manipulate 场景中达到了状态机器人的性能，并提高了对未经看到的干扰的Robustness。

Abstract
With the advancement in face manipulation technologies, the importance of face forgery detection in protecting authentication integrity becomes increasingly evident. Previous Vision Transformer (ViT)-based detectors have demonstrated subpar performance in cross-database evaluations, primarily because fully fine-tuning with limited Deepfake data often leads to forgetting pre-trained knowledge and over-fitting to data-specific ones. To circumvent these issues, we propose a novel Forgery-aware Adaptive Vision Transformer (FA-ViT). In FA-ViT, the vanilla ViT's parameters are frozen to preserve its pre-trained knowledge, while two specially designed components, the Local-aware Forgery Injector (LFI) and the Global-aware Forgery Adaptor (GFA), are employed to adapt forgery-related knowledge. our proposed FA-ViT effectively combines these two different types of knowledge to form the general forgery features for detecting Deepfakes. Specifically, LFI captures local discriminative information and incorporates these information into ViT via Neighborhood-Preserving Cross Attention (NPCA). Simultaneously, GFA learns adaptive knowledge in the self-attention layer, bridging the gap between the two different domain. Furthermore, we design a novel Single Domain Pairwise Learning (SDPL) to facilitate fine-grained information learning in FA-ViT. The extensive experiments demonstrate that our FA-ViT achieves state-of-the-art performance in cross-dataset evaluation and cross-manipulation scenarios, and improves the robustness against unseen perturbations.

摘要
随着人脸杜撰技术的发展，保护身份验证的 authenticty integrity 成为越来越重要的。先前的 Vision Transformer (ViT) 基于的检测器在跨数据库评估中表现不佳，主要因为完全精度调整 WITH 有限的 Deepfake 数据通常会导致忘记预训练知识并过拟合数据库specific 的知识。为了解决这些问题，我们提出了一种 novel Forgery-aware Adaptive Vision Transformer (FA-ViT)。在 FA-ViT 中，vanilla ViT 的参数被冻结，以保持其预训练的知识。同时，我们采用了两个特制的组件：Local-aware Forgery Injector (LFI) 和 Global-aware Forgery Adaptor (GFA)。LFI 捕捉了地方特征信息，并将这些信息与 Neighborhood-Preserving Cross Attention (NPCA) 结合，以便在 ViT 中捕捉到地方特征。而 GFA 在自注意层中学习了适应性知识， bridging the gap между两种不同的领域。此外，我们还设计了一种 novel Single Domain Pairwise Learning (SDPL)，以便在 FA-ViT 中进行细化信息学习。广泛的实验表明，我们的 FA-ViT 在跨数据库评估和跨杜撰场景中表现出了 state-of-the-art 的性能，并且能够对未经见杜撰的攻击进行鲁棒化。

Learning Segment Similarity and Alignment in Large-Scale Content Based Video Retrieval

paper_url: http://arxiv.org/abs/2309.11091
repo_url: None
paper_authors: Chen Jiang, Kaiming Huang, Sifeng He, Xudong Yang, Wei Zhang, Xiaobo Zhang, Yuan Cheng, Lei Yang, Qing Wang, Furong Xu, Tan Pan, Wei Chu
for: 这篇论文主要旨在提高内容基于视频检索（CBVR）的精度和效率，尤其是在长视频场景下。
methods: 该论文提出了一种基于自助学习的 Segment Similarity and Alignment Network (SSAN)，包括两个新提出的模块：(1) 高效的自动生成关键帧EXTraction（SKE）模块，(2) 稳定的 Similarity Pattern Detection（SPD）模块。
results: 对于公共数据集的实验结果表明，SSAN可以获得更高的Alignment精度，同时减少存储和在线查询计算成本，比既有方法更高。

Abstract
With the explosive growth of web videos in recent years, large-scale Content-Based Video Retrieval (CBVR) becomes increasingly essential in video filtering, recommendation, and copyright protection. Segment-level CBVR (S-CBVR) locates the start and end time of similar segments in finer granularity, which is beneficial for user browsing efficiency and infringement detection especially in long video scenarios. The challenge of S-CBVR task is how to achieve high temporal alignment accuracy with efficient computation and low storage consumption. In this paper, we propose a Segment Similarity and Alignment Network (SSAN) in dealing with the challenge which is firstly trained end-to-end in S-CBVR. SSAN is based on two newly proposed modules in video retrieval: (1) An efficient Self-supervised Keyframe Extraction (SKE) module to reduce redundant frame features, (2) A robust Similarity Pattern Detection (SPD) module for temporal alignment. In comparison with uniform frame extraction, SKE not only saves feature storage and search time, but also introduces comparable accuracy and limited extra computation time. In terms of temporal alignment, SPD localizes similar segments with higher accuracy and efficiency than existing deep learning methods. Furthermore, we jointly train SSAN with SKE and SPD and achieve an end-to-end improvement. Meanwhile, the two key modules SKE and SPD can also be effectively inserted into other video retrieval pipelines and gain considerable performance improvements. Experimental results on public datasets show that SSAN can obtain higher alignment accuracy while saving storage and online query computational cost compared to existing methods.

摘要
随着网络视频的快速增长，大规模的内容基于视频检索（CBVR）在视频筛选、推荐和版权保护中变得越来越重要。segment级CBVR（S-CBVR）可以在更细粒度上定位相似的分割时间，这对用户浏览效率和侵权检测尤为重要，特别是在长视频场景下。S-CBVR任务的挑战是如何实现高精度时间对对应和高效计算且快速存储消耗。在这篇论文中，我们提出了一种Segment Similarity and Alignment Network（SSAN）来解决这个挑战。SSAN基于两个新提出的模块：（1）高效的自动学习键帧EXTRACTION（SKE）模块，以减少缓存和搜索时间，同时保持相似性和精度；（2）Robust的同时间模式检测（SPD）模块，用于时间对对应。相比于固定帧EXTRACTION，SKE不仅减少了特征存储和搜索时间，还引入了相似的准确性和有限的额外计算时间。在时间对对应方面，SPD可以更高精度地local化相似分割，而且更高效 than现有的深度学习方法。此外，我们将SSAN、SKE和SPD联合训练，实现了端到端提升。此外，这两个关键模块也可以在其他视频检索管道中插入，并获得显著性能提升。实验结果表明，SSAN可以在公共数据集上获得更高的对应精度，同时减少存储和在线查询计算成本。

paper_url: http://arxiv.org/abs/2309.11081
repo_url: None
paper_authors: Heeseung Yun, Joonil Na, Gunhee Kim
for: 这篇论文旨在探讨如何使深度网络拥有空间逻辑能力，以便在我们日常生活中更好地利用声音信息。
methods: 该论文提出了一种名为“匹配引导”（SAM）的知识填充框架，用于在视觉知识传输中匹配地址问题。该框架将听音特征与视觉准确的学习空间嵌入结合起来，以解决多层学生模型中的不一致性问题。
results: 该论文通过一个新创建的封闭预测数据集（DAPS），成功地解决了indoor dense prediction问题，包括声音基础 depth estimation、semantic segmentation和3D场景重建等问题。在不同的metric和后处理架构下，该distillation框架一致地实现了状态的最佳性能。

Abstract
Sound can convey significant information for spatial reasoning in our daily lives. To endow deep networks with such ability, we address the challenge of dense indoor prediction with sound in both 2D and 3D via cross-modal knowledge distillation. In this work, we propose a Spatial Alignment via Matching (SAM) distillation framework that elicits local correspondence between the two modalities in vision-to-audio knowledge transfer. SAM integrates audio features with visually coherent learnable spatial embeddings to resolve inconsistencies in multiple layers of a student model. Our approach does not rely on a specific input representation, allowing for flexibility in the input shapes or dimensions without performance degradation. With a newly curated benchmark named Dense Auditory Prediction of Surroundings (DAPS), we are the first to tackle dense indoor prediction of omnidirectional surroundings in both 2D and 3D with audio observations. Specifically, for audio-based depth estimation, semantic segmentation, and challenging 3D scene reconstruction, the proposed distillation framework consistently achieves state-of-the-art performance across various metrics and backbone architectures.

摘要
声音可以传递重要的信息来帮助我们日常准备空间理解。为了让深度网络具备这种能力，我们在视Audio知识传递中处理紧凑的室内预测问题。在这项工作中，我们提出了一种名为匹配（SAM）知识传递框架，该框架在视Audio知识传递中找到本地匹配点，以解决多层学习模型中的不一致。SAM将音频特征与视觉一致的学习可变的空间嵌入结合起来，以解决多层学习模型中的不一致。我们的方法不依赖特定的输入表示，因此可以在输入形状或维度上进行灵活的调整无论影响性。我们新编制了一个名为环境预测（DAPS）的权威数据集，我们是第一个在2D和3D室内环境预测中使用音频观察结果进行密集预测。特别是，我们的框架在音频基于深度估计、语义分割和复杂3D场景重建等方面均实现了状态的最佳性。

Visual Question Answering in the Medical Domain

paper_url: http://arxiv.org/abs/2309.11080
repo_url: https://github.com/abachaa/VQA-Med-2019
paper_authors: Louisa Canepa, Sonit Singh, Arcot Sowmya
for: 这篇研究旨在提出一个适应医疗图像问题的机器学习模型，以解答基于 givent medical images 的自然语言问题。
methods: 本研究使用了专门的领域预训练策略，包括一种新的对称学习预训方法，以减少小规模 dataset 的问题。
results: 我们的提案模型在 VQA-Med 2019 测试集上获得了60%的准确率，与其他州OF-the-art Med-VQA 模型相当。

Abstract
Medical visual question answering (Med-VQA) is a machine learning task that aims to create a system that can answer natural language questions based on given medical images. Although there has been rapid progress on the general VQA task, less progress has been made on Med-VQA due to the lack of large-scale annotated datasets. In this paper, we present domain-specific pre-training strategies, including a novel contrastive learning pretraining method, to mitigate the problem of small datasets for the Med-VQA task. We find that the model benefits from components that use fewer parameters. We also evaluate and discuss the model's visual reasoning using evidence verification techniques. Our proposed model obtained an accuracy of 60% on the VQA-Med 2019 test set, giving comparable results to other state-of-the-art Med-VQA models.

摘要
医学视觉问答（Med-VQA）是一种机器学习任务，旨在创建一个能够根据给定医学图像回答自然语言问题的系统。虽然总体VQA任务上有了快速的进步，但Med-VQA任务上的进步较少，这主要归结于医学图像数据的小规模。在这篇论文中，我们提出了域特定预训练策略，包括一种新的对比学习预训练方法，以解决Med-VQA任务的数据小规模问题。我们发现模型受到参数数量的限制具有好处。我们还评估和讨论模型的视觉逻辑使用证明技术。我们的提议的模型在VQA-Med 2019测试集上取得了60%的准确率，与其他状态之前的Med-VQA模型相当。

Score Mismatching for Generative Modeling

paper_url: http://arxiv.org/abs/2309.11043
repo_url: https://github.com/senmaoy/Score-Mismatching
paper_authors: Senmao Ye, Fei Liu
for: 这篇论文的目的是提出一种新的分数基本模型，用于生成图像。
methods: 这篇论文使用了一步采样方法，取代了之前的迭代采样方法。在这个模型中，一个独立的生成器将所有的时间步采样压缩到了梯度反propagation来自分数网络。
results: 这篇论文的模型在CIFAR-10数据集上比Consistency Model和Denoising Score Matching更高效，这表明了这种框架的潜在力量。此外，模型还在MINIST和LSUN数据集上进行了更多的示例。代码可以在GitHub上下载。

Abstract
We propose a new score-based model with one-step sampling. Previously, score-based models were burdened with heavy computations due to iterative sampling. For substituting the iterative process, we train a standalone generator to compress all the time steps with the gradient backpropagated from the score network. In order to produce meaningful gradients for the generator, the score network is trained to simultaneously match the real data distribution and mismatch the fake data distribution. This model has the following advantages: 1) For sampling, it generates a fake image with only one step forward. 2) For training, it only needs 10 diffusion steps.3) Compared with consistency model, it is free of the ill-posed problem caused by consistency loss. On the popular CIFAR-10 dataset, our model outperforms Consistency Model and Denoising Score Matching, which demonstrates the potential of the framework. We further provide more examples on the MINIST and LSUN datasets. The code is available on GitHub.

摘要
我们提出了一个新的分数基于模型，使用单步采样。在过去，分数基于模型受到迭代采样的计算压力。为了替代迭代过程，我们训练了一个独立的生成器，使其在分数网络的梯度归整下压缩所有时间步。为了生成有意义的梯度，分数网络需要同时匹配真实数据分布和假数据分布。这个模型具有以下优点：1）采样时只需一步前进。2）训练时只需10步扩散。3）与一致性模型相比，它免受一致性损失导致的糟糕问题。在流行的 CIFAR-10 数据集上，我们的模型超越了一致性模型和杂噪分匹配模型，这表明了该框架的潜力。我们还提供了更多的例子在 MINIST 和 LSUN 数据集上。代码可以在 GitHub 上找到。

CaveSeg: Deep Semantic Segmentation and Scene Parsing for Autonomous Underwater Cave Exploration

paper_url: http://arxiv.org/abs/2309.11038
repo_url: None
paper_authors: A. Abdullah, T. Barua, R. Tibbetts, Z. Chen, M. J. Islam, I. Rekleitis
for: 本研究开发了一个用于潜水洞穴探索和地图创建的自主潜水器视觉学习管线，协助AUV在潜水洞穴环境中快速完成 semantic segmentation 和 scene parsing。
methods: 本研究使用了一个具有全面性的数据集，以便对潜水洞穴场景进行semantic segmentation，并开发了一个基于 transformer 的视觉模型，具有快速执行和低 computational complexity。
results: 本研究通过在美国、墨西哥和西班牙的洞穴系统进行了 comprehensive benchmark 分析，证明了可以透过 CaveSeg 发展出高性能的深度视觉模型，并且在实际应用中实现了快速的 semantic scene parsing。

Abstract
In this paper, we present CaveSeg - the first visual learning pipeline for semantic segmentation and scene parsing for AUV navigation inside underwater caves. We address the problem of scarce annotated training data by preparing a comprehensive dataset for semantic segmentation of underwater cave scenes. It contains pixel annotations for important navigation markers (e.g. caveline, arrows), obstacles (e.g. ground plain and overhead layers), scuba divers, and open areas for servoing. Through comprehensive benchmark analyses on cave systems in USA, Mexico, and Spain locations, we demonstrate that robust deep visual models can be developed based on CaveSeg for fast semantic scene parsing of underwater cave environments. In particular, we formulate a novel transformer-based model that is computationally light and offers near real-time execution in addition to achieving state-of-the-art performance. Finally, we explore the design choices and implications of semantic segmentation for visual servoing by AUVs inside underwater caves. The proposed model and benchmark dataset open up promising opportunities for future research in autonomous underwater cave exploration and mapping.

摘要
在这篇论文中，我们介绍了CaveSeg，首个用于semantic segmentation和场景分解的AUV内水洞环境视觉学习管道。我们解决了罕见的注释培训数据的问题，prepare了包含重要导航标记（例如， cave line、箭头）、障碍物（例如，地面层和天花板层）、潜水员和开放区域的像素注释。通过对美国、墨西哥和西班牙等地水洞系统进行了全面的比较分析，我们证明了可以基于CaveSeg构建Robust的深度视觉模型，用于快速semantic scene parsing水洞环境。尤其是，我们提出了一种新的 transformer-based 模型，具有较少计算量和实时执行能力，同时也达到了状态实验室的性能。最后，我们探讨了semantic segmentation对AUV内水洞环境的视ervoking的设计选择和意义。提出的模型和数据集开 up了未来水洞exploration和 mapping 的可能性。

Light Field Diffusion for Single-View Novel View Synthesis

paper_url: http://arxiv.org/abs/2309.11525
repo_url: None
paper_authors: Yifeng Xiong, Haoyu Ma, Shanlin Sun, Kun Han, Xiaohui Xie
for: 单视图新视角合成，生成基于单个参考图像的图像，是计算机视觉领域中一项重要但具有挑战性的任务。
methods: 我们使用Light Field Diffusion（LFD）模型，这是一种基于扩散的增强模型，在扩散过程中将摄像头视角信息转换为光场编码，并与参考图像相结合。这种设计引入了本地像素级别的约束，从而促进了多视图一致性。
results: 我们的LFD可以高效地生成高质量图像，并在复杂的区域中保持更好的3D一致性。我们的方法可以与NeRF-based模型相比，并且我们的模型规模只是NeRF-based模型的一半。

Abstract
Single-view novel view synthesis, the task of generating images from new viewpoints based on a single reference image, is an important but challenging task in computer vision. Recently, Denoising Diffusion Probabilistic Model (DDPM) has become popular in this area due to its strong ability to generate high-fidelity images. However, current diffusion-based methods directly rely on camera pose matrices as viewing conditions, globally and implicitly introducing 3D constraints. These methods may suffer from inconsistency among generated images from different perspectives, especially in regions with intricate textures and structures. In this work, we present Light Field Diffusion (LFD), a conditional diffusion-based model for single-view novel view synthesis. Unlike previous methods that employ camera pose matrices, LFD transforms the camera view information into light field encoding and combines it with the reference image. This design introduces local pixel-wise constraints within the diffusion models, thereby encouraging better multi-view consistency. Experiments on several datasets show that our LFD can efficiently generate high-fidelity images and maintain better 3D consistency even in intricate regions. Our method can generate images with higher quality than NeRF-based models, and we obtain sample quality similar to other diffusion-based models but with only one-third of the model size.

摘要
单视图novel视觉合成问题，即基于单个参考图像生成新视点图像，是计算机视觉中重要但困难的任务。最近，Denosing Diffusion Probabilistic Model (DDPM) 在这个领域中得到了广泛应用，因为它可以生成高品质图像。然而，当前的扩散基本方法直接使用摄像机pose矩阵作为视图条件，全局和强制性地引入3D约束。这些方法可能在不同视点图像中生成的图像之间存在不一致，特别是在具有复杂 текстура和结构的区域中。在这种情况下，我们提出了Light Field Diffusion (LFD)，一种基于条件扩散的单视图novel视觉合成模型。与之前的方法不同，LFD将摄像机视角信息转换为光场编码，并将其与参考图像相结合。这种设计引入了本地像素级别的扩散模型中的约束，从而鼓励更好的多视图一致性。我们的LFD可以高效地生成高品质图像，并在复杂区域中保持更好的3D一致性。我们的方法可以生成图像质量高于NeRF-based模型，并且在模型大小方面与其他扩散基本方法相当，但只需一半的模型大小。

Conformalized Multimodal Uncertainty Regression and Reasoning

paper_url: http://arxiv.org/abs/2309.11018
repo_url: None
paper_authors: Domenico Parente, Nastaran Darabi, Alex C. Stutts, Theja Tulabandhula, Amit Ranjan Trivedi
for: 这篇论文旨在探讨一种轻量级的不确定度估计器，可以预测多modal（分离）的不确定度 bound，通过将конформаル预测与深度学习回推器结合起来。
methods: 这篇论文使用了将конформаル预测与深度学习回推器结合起来，以预测多modal（分离）的不确定度 bound。
results: simulations 结果显示，在我们的框架中，不确定度估计器适应了具有严重噪音、有限训练数据和有限预测模型大小的问题。此外，我们开发了一个理解框架，利用这些可靠的不确定度估计器，并与光流基于的理解来提高预测精度。因此，通过适当地考虑数据驱动学习中的预测不确定性，并透过规律基于的理解来关闭预测模型的估计loop，我们的方法在所有这些问题上显著超越了传统的深度学习方法，实际上降低预测错误的比例为2-3倍。

Abstract
This paper introduces a lightweight uncertainty estimator capable of predicting multimodal (disjoint) uncertainty bounds by integrating conformal prediction with a deep-learning regressor. We specifically discuss its application for visual odometry (VO), where environmental features such as flying domain symmetries and sensor measurements under ambiguities and occlusion can result in multimodal uncertainties. Our simulation results show that uncertainty estimates in our framework adapt sample-wise against challenging operating conditions such as pronounced noise, limited training data, and limited parametric size of the prediction model. We also develop a reasoning framework that leverages these robust uncertainty estimates and incorporates optical flow-based reasoning to improve prediction prediction accuracy. Thus, by appropriately accounting for predictive uncertainties of data-driven learning and closing their estimation loop via rule-based reasoning, our methodology consistently surpasses conventional deep learning approaches on all these challenging scenarios--pronounced noise, limited training data, and limited model size-reducing the prediction error by 2-3x.

摘要
Here is the text in Simplified Chinese:这篇论文介绍了一种轻量级的不确定性估计器，可以通过将 конформальный预测与深度学习回归器结合来预测多Modal不确定性 bound。我们特别探讨了它在视觉运动（VO）中的应用， где environmental features和感知测量在异常和遮挡下可能导致多Modal不确定性。我们的 simulations 表明，在我们的框架中的不确定性估计适应样本所对抗复杂的运行条件，如强度的噪音、有限的训练数据和有限的预测模型大小。我们还开发了一种使用这些稳健的不确定性估计和基于推Flow的reasoning Framework来提高预测准确性。因此，通过合理地考虑数据驱动学习的预测不确定性和关闭其估计循环 via 规则基于的reasoning，我们的方法在所有这些复杂的 scenarios中一直赶在深度学习方法之前，减少预测错误 by 2-3倍。

Controllable Dynamic Appearance for Neural 3D Portraits

paper_url: http://arxiv.org/abs/2309.11009
repo_url: None
paper_authors: ShahRukh Athar, Zhixin Shu, Zexiang Xu, Fujun Luan, Sai Bi, Kalyan Sunkavalli, Dimitris Samaras
for: 创建完全可控的3D人物头像，在真实捕捉环境中。
methods: 使用NeRF技术，通过动态出现模型来 aproximate照明依赖的效果，并通过面法导向来准确预测表면法向量。
results: 使用短视频 captured with smartphone，在不同的头部姿势和表情控制下实现了高质量的自由视 sintesis效果，并且能够模拟真实的照明效果。

Abstract
Recent advances in Neural Radiance Fields (NeRFs) have made it possible to reconstruct and reanimate dynamic portrait scenes with control over head-pose, facial expressions and viewing direction. However, training such models assumes photometric consistency over the deformed region e.g. the face must be evenly lit as it deforms with changing head-pose and facial expression. Such photometric consistency across frames of a video is hard to maintain, even in studio environments, thus making the created reanimatable neural portraits prone to artifacts during reanimation. In this work, we propose CoDyNeRF, a system that enables the creation of fully controllable 3D portraits in real-world capture conditions. CoDyNeRF learns to approximate illumination dependent effects via a dynamic appearance model in the canonical space that is conditioned on predicted surface normals and the facial expressions and head-pose deformations. The surface normals prediction is guided using 3DMM normals that act as a coarse prior for the normals of the human head, where direct prediction of normals is hard due to rigid and non-rigid deformations induced by head-pose and facial expression changes. Using only a smartphone-captured short video of a subject for training, we demonstrate the effectiveness of our method on free view synthesis of a portrait scene with explicit head pose and expression controls, and realistic lighting effects. The project page can be found here: http://shahrukhathar.github.io/2023/08/22/CoDyNeRF.html

摘要
最近的神经辐射场（NeRF）技术突破，使得可以重建和复活动态肖像场景，包括头部姿态和表情的控制。然而，训练这些模型时需要光ometric consistency over the deformed region，例如脸部必须在不同的头部姿态和表情变化中保持光度的均匀性。这种光度一致性在视频帧中很难保持，即使在studio environment中，因此创建的可控3D肖像容易出现artifacts during reanimation。在这项工作中，我们提出了CoDyNeRF系统，可以在真实的捕捉条件下创建完全可控的3D肖像。CoDyNeRF通过learns to approximate illumination dependent effects via a dynamic appearance model in the canonical space that is conditioned on predicted surface normals and the facial expressions and head-pose deformations来解决这个问题。 surface normals prediction是通过3DMM normals作为一个粗略的估计器来引导的，因为direct prediction of normals是由于头部姿态和表情变化induced的固定和非固定扭曲而困难。通过只使用短视频 capture的智能手机训练，我们示示了我们的方法在free view synthesis of a portrait scene with explicit head pose and expression controls, and realistic lighting effects。相关项目页面可以在以下链接中找到：http://shahrukhathar.github.io/2023/08/22/CoDyNeRF.html

STARNet: Sensor Trustworthiness and Anomaly Recognition via Approximated Likelihood Regret for Robust Edge Autonomy

paper_url: http://arxiv.org/abs/2309.11006
repo_url: https://github.com/sinatayebati/STARNet
paper_authors: Nastaran Darabi, Sina Tayebati, Sureshkumar S., Sathya Ravi, Theja Tulabandhula, Amit R. Trivedi
for: This paper is written to address the reliability concerns of complex sensors such as LiDAR and camera sensors in autonomous robotics, and to improve the prediction accuracy of deep learning models by detecting untrustworthy sensor streams.
methods: STARNet, a Sensor Trustworthiness and Anomaly Recognition Network, is used to detect untrustworthy sensor streams. STARNet employs the concept of approximated likelihood regret, a gradient-free framework tailored for low-complexity hardware.
results: STARNet enhances prediction accuracy by approximately 10% by filtering out untrustworthy sensor streams in unimodal and multimodal settings, especially in addressing internal sensor failures such as cross-sensor interference and crosstalk.

Abstract
Complex sensors such as LiDAR, RADAR, and event cameras have proliferated in autonomous robotics to enhance perception and understanding of the environment. Meanwhile, these sensors are also vulnerable to diverse failure mechanisms that can intricately interact with their operation environment. In parallel, the limited availability of training data on complex sensors also affects the reliability of their deep learning-based prediction flow, where their prediction models can fail to generalize to environments not adequately captured in the training set. To address these reliability concerns, this paper introduces STARNet, a Sensor Trustworthiness and Anomaly Recognition Network designed to detect untrustworthy sensor streams that may arise from sensor malfunctions and/or challenging environments. We specifically benchmark STARNet on LiDAR and camera data. STARNet employs the concept of approximated likelihood regret, a gradient-free framework tailored for low-complexity hardware, especially those with only fixed-point precision capabilities. Through extensive simulations, we demonstrate the efficacy of STARNet in detecting untrustworthy sensor streams in unimodal and multimodal settings. In particular, the network shows superior performance in addressing internal sensor failures, such as cross-sensor interference and crosstalk. In diverse test scenarios involving adverse weather and sensor malfunctions, we show that STARNet enhances prediction accuracy by approximately 10% by filtering out untrustworthy sensor streams. STARNet is publicly available at \url{https://github.com/sinatayebati/STARNet}.

摘要
复杂的感知器如LiDAR、RADAR和事件摄像头在自主 робо扮中广泛应用，以提高环境的感知和理解。然而，这些感知器也面临着多种失效机制，这些失效机制可能与其运行环境互相复杂交互。同时，对于复杂的感知器，有限的训练数据也会影响其深度学习基于预测流的可靠性，其预测模型可能无法泛化到不充分 captured 的环境中。为解决这些可靠性问题，本文介绍了 STARNet，一种感知器可靠性和异常检测网络，可以检测不可靠的感知流，这些感知流可能由感知器故障和/或挑战环境引起。我们 especifically 对 LiDAR 和摄像头数据进行了 benchmark。STARNet 采用了approximated likelihood regret，一种适用于低复杂度硬件的梯度自由框架。通过广泛的 simulations，我们展示了 STARNet 在单模态和多模态设置下的效果。尤其是，网络在内部感知器故障方面表现出色，如交叉感知和电磁干扰。在多种测试enario中，包括不良天气和感知器故障，我们表明 STARNet 可以提高预测精度约 10%，通过筛选不可靠的感知流。STARNet 公共可用于 \url{https://github.com/sinatayebati/STARNet}.

PPD: A New Valet Parking Pedestrian Fisheye Dataset for Autonomous Driving

paper_url: http://arxiv.org/abs/2309.11002
repo_url: None
paper_authors: Zizhang Wu, Xinyuan Chen, Fan Song, Yuanzhu Gan, Tianhao Xu, Jian Pu, Rui Tang
for: 本研究旨在提供一个大规模的 fisheye 数据集，以支持对实际世界中的步行人进行研究，特别是在干扰和多种姿势下。
methods: 本研究使用 fisheye 摄像头捕捉了多种类型的步行人，并提出了两种数据增强技术来提高基eline。
results: 实验证明了我们的新的数据增强方法的效果，并证明了数据集的非常普遍化。

Abstract
Pedestrian detection under valet parking scenarios is fundamental for autonomous driving. However, the presence of pedestrians can be manifested in a variety of ways and postures under imperfect ambient conditions, which can adversely affect detection performance. Furthermore, models trained on publicdatasets that include pedestrians generally provide suboptimal outcomes for these valet parking scenarios. In this paper, wepresent the Parking Pedestrian Dataset (PPD), a large-scale fisheye dataset to support research dealing with real-world pedestrians, especially with occlusions and diverse postures. PPD consists of several distinctive types of pedestrians captured with fisheye cameras. Additionally, we present a pedestrian detection baseline on PPD dataset, and introduce two data augmentation techniques to improve the baseline by enhancing the diversity ofthe original dataset. Extensive experiments validate the effectiveness of our novel data augmentation approaches over baselinesand the dataset's exceptional generalizability.

摘要
自动驾驶中的人行检测在停车场景下是基本的。然而，人行可以在不同的环境条件下表现出多种形式和姿势，这会 adversely affect 检测性能。尤其是模型通常在公共数据集上训练，这些数据集中的人行通常不适合停车场景。在这篇论文中，我们提出了停车场景人行数据集（PPD），一个大规模的鱼眼数据集，以支持实际世界中的人行检测，特别是干扰和多种姿势。PPD 包括多种特征的人行，通过鱼眼摄像头捕捉。此外，我们还提出了人行检测基线在 PPD 数据集上，并介绍了两种数据增强技术来提高基线，以提高数据集的多样性。广泛的实验证明了我们的新的数据增强方法的有效性，以及数据集的出色的普适性。

COSE: A Consistency-Sensitivity Metric for Saliency on Image Classification

paper_url: http://arxiv.org/abs/2309.10989
repo_url: https://github.com/cvl-umass/COSE
paper_authors: Rangel Daroya, Aaron Sun, Subhransu Maji
for: 本研究旨在提供一套基于视觉优先的表现评估方法，用于评估图像分类任务中模型的表现。
methods: 本研究使用了多种视觉焦点映射方法，包括GradCAM、Guided Backpropagation（GBP）和DeepLIFT（DLIFT）等。
results: 研究发现，虽然多种焦点映射方法都能够解释模型决策，但是transformer模型比 convolutional模型更难被这些方法解释。此外，GradCAM表现最佳，但是它在细节化数据集上缺乏多样性。通过对准则和敏感度进行平衡，可以获得一个准确地表示模型行为的焦点映射。

Abstract
We present a set of metrics that utilize vision priors to effectively assess the performance of saliency methods on image classification tasks. To understand behavior in deep learning models, many methods provide visual saliency maps emphasizing image regions that most contribute to a model prediction. However, there is limited work on analyzing the reliability of saliency methods in explaining model decisions. We propose the metric COnsistency-SEnsitivity (COSE) that quantifies the equivariant and invariant properties of visual model explanations using simple data augmentations. Through our metrics, we show that although saliency methods are thought to be architecture-independent, most methods could better explain transformer-based models over convolutional-based models. In addition, GradCAM was found to outperform other methods in terms of COSE but was shown to have limitations such as lack of variability for fine-grained datasets. The duality between consistency and sensitivity allow the analysis of saliency methods from different angles. Ultimately, we find that it is important to balance these two metrics for a saliency map to faithfully show model behavior.

摘要
我们提出了一组维度度量，使用视觉优先来评估针对图像分类任务的精度方法的表现。在深度学习模型中，许多方法提供视觉精度地图，强调图像区域对模型预测的贡献。然而，对于分析深度学习模型决策的可靠性的工作几乎缺乏。我们提出了COnsistency-SEnsitivity（COSE）度量，用于衡量视觉模型解释的等变和不变性。通过我们的度量，我们发现，虽然许多方法被认为是无关于模型结构的，但大多数方法在基于转换器模型时表现较好。此外，GradCAM在COSE方面表现出色，但它在细腻数据上缺乏变化。这种对照性Allow我们从不同角度分析精度方法。最终，我们发现，为了让精度地图准确反映模型行为，需要平衡这两个度量。

RMT: Retentive Networks Meet Vision Transformers

paper_url: http://arxiv.org/abs/2309.11523
repo_url: None
paper_authors: Qihang Fan, Huaibo Huang, Mingrui Chen, Hongmin Liu, Ran He
for: 本研究的目的是探讨将Retrieval Network（RetNet）的思想应用于计算机视觉领域，以提高计算机视觉任务的性能。
methods: 本研究提出了一种组合RetNet和Transformer的模型，称为RMT。该模型引入了明确的衰减元素，以帮助计算机视觉模型更好地控制各个Token的范围。此外，为了降低全模型的计算成本，我们将模型分解成两个坐标轴上的分割。
results: 我们的RMT在多种计算机视觉任务中表现出色，例如在ImageNet-1k上达到84.1%的Top1-acc，使用了仅4.5G FLOPs。此外，RMT在下游任务中，如物体检测、实例分割和semantic segmentation中也表现出优异。

Abstract
Transformer first appears in the field of natural language processing and is later migrated to the computer vision domain, where it demonstrates excellent performance in vision tasks. However, recently, Retentive Network (RetNet) has emerged as an architecture with the potential to replace Transformer, attracting widespread attention in the NLP community. Therefore, we raise the question of whether transferring RetNet's idea to vision can also bring outstanding performance to vision tasks. To address this, we combine RetNet and Transformer to propose RMT. Inspired by RetNet, RMT introduces explicit decay into the vision backbone, bringing prior knowledge related to spatial distances to the vision model. This distance-related spatial prior allows for explicit control of the range of tokens that each token can attend to. Additionally, to reduce the computational cost of global modeling, we decompose this modeling process along the two coordinate axes of the image. Abundant experiments have demonstrated that our RMT exhibits exceptional performance across various computer vision tasks. For example, RMT achieves 84.1% Top1-acc on ImageNet-1k using merely 4.5G FLOPs. To the best of our knowledge, among all models, RMT achieves the highest Top1-acc when models are of similar size and trained with the same strategy. Moreover, RMT significantly outperforms existing vision backbones in downstream tasks such as object detection, instance segmentation, and semantic segmentation. Our work is still in progress.

摘要
transformer 最初出现在自然语言处理领域，后来迁移到计算机视觉领域，在视觉任务中表现出色。然而，最近，Retentive Network（RetNet） Architecture 出现，吸引了自然语言社区的广泛关注。因此，我们提出了将 RetNet 的想法应用于视觉领域，以提高视觉任务的表现。为此，我们将 RetNet 和 transformer 结合，提出了 RMT。 RetNet 中引入了显式衰减，使视觉模型受到相对距离的知识。这种距离相关的空间先验使每个token可以显式控制所能attend的token范围。此外，为降低全局模型的计算成本，我们将模型化过程分解成两个坐标轴的图像。我们的 RMT 在多个计算机视觉任务中表现出色，例如在 ImageNet-1k 中 achiev 84.1% Top1-acc 使用仅 4.5G FLOPs。我们知道，在同样大小的模型和同样策略下，RMT 在所有模型中具有最高的 Top1-acc。此外，RMT 在下游任务中，如物体检测、实例分割和 semantics 分割，也表现出了显著的优异。我们的工作仍在进行中。

SEMPART: Self-supervised Multi-resolution Partitioning of Image Semantics

paper_url: http://arxiv.org/abs/2309.10972
repo_url: None
paper_authors: Sriram Ravindran, Debraj Basu
for: 本文为了解决基于图像数据的稀缺时，准确地定义图像中重要的区域而作出了贡献。
methods: 本文使用了基于DINO的自动编写方法，并利用图像 semantic graph中的含义来找到前景物体。
results: 本文提出了一种名为SEMPART的方法，可以同时确定图像的粗细分割和细分割，并且可以快速生成高质量的mask。

Abstract
Accurately determining salient regions of an image is challenging when labeled data is scarce. DINO-based self-supervised approaches have recently leveraged meaningful image semantics captured by patch-wise features for locating foreground objects. Recent methods have also incorporated intuitive priors and demonstrated value in unsupervised methods for object partitioning. In this paper, we propose SEMPART, which jointly infers coarse and fine bi-partitions over an image's DINO-based semantic graph. Furthermore, SEMPART preserves fine boundary details using graph-driven regularization and successfully distills the coarse mask semantics into the fine mask. Our salient object detection and single object localization findings suggest that SEMPART produces high-quality masks rapidly without additional post-processing and benefits from co-optimizing the coarse and fine branches.

摘要
精确地定义图像中重要区域是一项具有挑战性的任务，尤其当标注数据稀缺时。基于DINO的自动学习方法最近在捕捉图像中具有意义的Semantic Feature中找到了前景对象。现有方法还将直觉约束 incorporated 到了无监督方法中，并在对象分割方面表现出了价值。在这篇论文中，我们提议了 SEMPART，它同时分解图像的DINO基于semantic graph的粗细分割结果。此外，SEMPART还使用图像驱动的正则化来保持细节，并成功地储存粗细分割结果。我们的精确对象检测和单个对象Localization结果表明，SEMPART可以快速生成高质量的Mask，无需额外处理，并且受益于粗细分支的共同优化。

2023-09-20

Understanding Pose and Appearance Disentanglement in 3D Human Pose Estimation

Neural Image Compression Using Masked Sparse Visual Representation

GenLayNeRF: Generalizable Layered Representations with 3D Model Alignment for Multi-Human View Synthesis

Sentence Attention Blocks for Answer Grounding

Continuous Levels of Detail for Light Field Networks

Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding

A Large-scale Dataset for Audio-Language Representation Learning

FreeU: Free Lunch in Diffusion U-Net

Budget-Aware Pruning: Handling Multiple Domains with Less Parameters

Weight Averaging Improves Knowledge Distillation under Domain Shift

SkeleTR: Towrads Skeleton-based Action Recognition in the Wild

Signature Activation: A Sparse Signal View for Holistic Saliency

CalibFPA: A Focal Plane Array Imaging System based on Online Deep-Learning Calibration

CNNs for JPEGs: A Study in Computational Cost

Enhancing motion trajectory segmentation of rigid bodies using a novel screw-based trajectory-shape representation

Self-supervised learning unveils change in urban housing from street-level images

You can have your ensemble and run it too – Deep Ensembles Spread Over Time

How to turn your camera into a perfect pinhole model

Face Aging via Diffusion-based Editing

Uncovering the effects of model initialization on deep model generalization: A study with adult and pediatric Chest X-ray images

Generalizing Across Domains in Diabetic Retinopathy via Variational Autoencoders

Language-driven Object Fusion into Neural Radiance Fields with Pose-Conditioned Dataset Updates

Towards Real-Time Neural Video Codec for Cross-Platform Application Using Calibration Information

StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding

From Classification to Segmentation with Explainable AI: A Study on Crack Detection and Growth Monitoring

TwinTex: Geometry-aware Texture Generation for Abstracted 3D Architectural Models

Box2Poly: Memory-Efficient Polygon Prediction of Arbitrarily Shaped and Rotated Text

Towards Robust Few-shot Point Cloud Semantic Segmentation

Generalized Few-Shot Point Cloud Segmentation Via Geometric Words

Automatic Bat Call Classification using Transformer Networks

EPTQ: Enhanced Post-Training Quantization via Label-Free Hessian

Partition-A-Medical-Image: Extracting Multiple Representative Sub-regions for Few-shot Medical Image Segmentation

AutoSynth: Learning to Generate 3D Training Data for Object Point Cloud Registration

Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation

Learning Deformable 3D Graph Similarity to Track Plant Cells in Unregistered Time Lapse Images

CNN-based local features for navigation near an asteroid

Online Calibration of a Single-Track Ground Vehicle Dynamics Model by Tight Fusion with Visual-Inertial Odometry

GraphEcho: Graph-Driven Unsupervised Domain Adaptation for Echocardiogram Video Segmentation

GL-Fusion: Global-Local Fusion Network for Multi-view Echocardiogram Video Segmentation

More complex encoder is not all you need

Shape Anchor Guided Holistic Indoor Scene Understanding

Locate and Verify: A Two-Stream Network for Improved Deepfake Detection

PSDiff: Diffusion Model for Person Search with Iterative and Collaborative Refinement

Hyperspectral Benchmark: Bridging the Gap between HSI Applications through Comprehensive Dataset and Pretraining

BroadBEV: Collaborative LiDAR-camera Fusion for Broad-sighted Bird’s Eye View Map Construction

PRAT: PRofiling Adversarial aTtacks

Self-supervised Domain-agnostic Domain Adaptation for Satellite Images

Forgery-aware Adaptive Vision Transformer for Face Forgery Detection

Learning Segment Similarity and Alignment in Large-Scale Content Based Video Retrieval

Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation

Visual Question Answering in the Medical Domain

Score Mismatching for Generative Modeling

CaveSeg: Deep Semantic Segmentation and Scene Parsing for Autonomous Underwater Cave Exploration

Light Field Diffusion for Single-View Novel View Synthesis

Conformalized Multimodal Uncertainty Regression and Reasoning

Controllable Dynamic Appearance for Neural 3D Portraits

STARNet: Sensor Trustworthiness and Anomaly Recognition via Approximated Likelihood Regret for Robust Edge Autonomy

PPD: A New Valet Parking Pedestrian Fisheye Dataset for Autonomous Driving

COSE: A Consistency-Sensitivity Metric for Saliency on Image Classification

RMT: Retentive Networks Meet Vision Transformers

SEMPART: Self-supervised Multi-resolution Partitioning of Image Semantics