2023-08-17

cs.CV

cs.CV - 2023-08-17

Discretization-Induced Dirichlet Posterior for Robust Uncertainty Quantification on Regression

paper_url: http://arxiv.org/abs/2308.09065
repo_url: None
paper_authors: Xuanlong Yu, Gianni Franchi, Jindong Gu, Emanuel Aldea
for: 这个研究旨在提高深度神经网络（DNNs）在实际应用中的稳定性，通过提出一个辅助不确定估测器（AuxUE），以估计主任务预测结果的不确定性。
methods: 本研究提出了一个通用的AuxUE方案，以提供更加稳定的不确定性估测。 Specifically, 我们考虑了不同的分布假设，以估计各种不同类型的噪声，最终选择了拉Place分布估计预测误差。我们还提出了一个新的解方案，即精确化后的Dirichlet posterior（DIDO），用于模型预测误差的精确化。
results: 我们在年龄估测、单目深度估测和超解像任务上进行了广泛的实验，结果显示了我们的提案可以在噪音输入下提供稳定的不确定性估测，并且可以扩展到像素级和图像级任务。

Abstract
Uncertainty quantification is critical for deploying deep neural networks (DNNs) in real-world applications. An Auxiliary Uncertainty Estimator (AuxUE) is one of the most effective means to estimate the uncertainty of the main task prediction without modifying the main task model. To be considered robust, an AuxUE must be capable of maintaining its performance and triggering higher uncertainties while encountering Out-of-Distribution (OOD) inputs, i.e., to provide robust aleatoric and epistemic uncertainty. However, for vision regression tasks, current AuxUE designs are mainly adopted for aleatoric uncertainty estimates, and AuxUE robustness has not been explored. In this work, we propose a generalized AuxUE scheme for more robust uncertainty quantification on regression tasks. Concretely, to achieve a more robust aleatoric uncertainty estimation, different distribution assumptions are considered for heteroscedastic noise, and Laplace distribution is finally chosen to approximate the prediction error. For epistemic uncertainty, we propose a novel solution named Discretization-Induced Dirichlet pOsterior (DIDO), which models the Dirichlet posterior on the discretized prediction error. Extensive experiments on age estimation, monocular depth estimation, and super-resolution tasks show that our proposed method can provide robust uncertainty estimates in the face of noisy inputs and that it can be scalable to both image-level and pixel-wise tasks.

摘要
深度神经网络（DNN）在实际应用中部署需要量化不确定性。副作用不确定性估计器（AuxUE）是 modifying the main task model 的一种非常有效的方法来估计主任务预测结果的不确定性。为了被视为可靠，一个 AuxUE 必须能够在遇到不同输入时保持其性能并触发更高的不确定性，即提供可靠的 aleatoric 和 epistemic 不确定性。然而，现有的 AuxUE 设计主要用于 aleatoric 不确定性估计，而 regression 任务上的 AuxUE 强度还未被探索。在这项工作中，我们提出了一种通用的 AuxUE 方案，以提高 regression 任务上的不确定性量化的可靠性。具体来说，为了更好地估计 aleatoric 不确定性，我们考虑了不同的分布假设 для hetroscedastic 噪声，并最终选择了 Laplace 分布来近似预测错误。为 epistemic 不确定性，我们提出了一种新的解决方案，即 Discretization-Induced Dirichlet pOsterior（DIDO），它模型了精度 posterior 在精度化预测错误上。我们在年龄估计、单目深度估计和超分辨率任务上进行了广泛的实验，结果显示，我们的提议方法可以在噪声输入下提供可靠的不确定性估计，并且可以扩展到图像级和像素级任务。

SimFIR: A Simple Framework for Fisheye Image Rectification with Self-supervised Representation Learning

paper_url: http://arxiv.org/abs/2308.09040
repo_url: https://github.com/fh2019ustc/SimFIR
paper_authors: Hao Feng, Wendi Wang, Jiajun Deng, Wengang Zhou, Li Li, Houqiang Li
for: fisheye image rectification
methods: self-supervised representation learning with a Vision Transformer (ViT) and an innovative unified distortion-aware pretext task
results: remarkable boost in transfer performance on downstream rectification task, with superiority over state-of-the-art algorithms and strong generalization ability on real-world fisheye images.Here’s the full summary in Simplified Chinese:
for: fisheye 图像整正
methods: 自主学习表征学习（ViT）和创新的扭曲意识预测任务
results: downstream 整正任务的转移性能强劲提高，与状态艺术算法相比，在真实世界 fisheye 图像上具有强大的总体化能力。

Abstract
In fisheye images, rich distinct distortion patterns are regularly distributed in the image plane. These distortion patterns are independent of the visual content and provide informative cues for rectification. To make the best of such rectification cues, we introduce SimFIR, a simple framework for fisheye image rectification based on self-supervised representation learning. Technically, we first split a fisheye image into multiple patches and extract their representations with a Vision Transformer (ViT). To learn fine-grained distortion representations, we then associate different image patches with their specific distortion patterns based on the fisheye model, and further subtly design an innovative unified distortion-aware pretext task for their learning. The transfer performance on the downstream rectification task is remarkably boosted, which verifies the effectiveness of the learned representations. Extensive experiments are conducted, and the quantitative and qualitative results demonstrate the superiority of our method over the state-of-the-art algorithms as well as its strong generalization ability on real-world fisheye images.

摘要
在鱼眼图像中，丰富的不同扭曲模式 Regularly 分布在图像平面上。这些扭曲模式与视觉内容无关，并提供有用的纠正指引。为了充分利用这些纠正指引，我们提出了基于自我超vised representation learning的SimFIR框架。技术上，我们首先将鱼眼图像分割成多个 patches，然后使用 Vision Transformer (ViT) 提取这些 patches 的表示。为了学习细腻的扭曲表示，我们然后将不同的图像 patches 与其 especific 的扭曲模式相关联，并进一步设计了一种创新的统一扭曲意识pretext任务 для其学习。这种转移性能在下游纠正任务上明显提高，这证明了我们学习的表示的效果。我们进行了广泛的实验，并示出了对现实世界鱼眼图像的superiority和其强大的泛化能力。

MarginMatch: Improving Semi-Supervised Learning with Pseudo-Margins

paper_url: http://arxiv.org/abs/2308.09037
repo_url: https://github.com/tsosea2/marginmatch
paper_authors: Tiberiu Sosea, Cornelia Caragea
for: 提高 semi-supervised learning 的性能，特别是在数据量少的情况下。
methods: combine consistency regularization 和 pseudo-labeling，使用无标签数据训练动态来衡量pseudo-label的质量。
results: 在四个视觉benchmark和两个大规模数据集上提供了显著改进，强调高质量 pseudo-label 的重要性。特别是在 CIFAR-100 上提高了error rate的优化3.25%，并在 STL-10 上使用了只有4个标签每类的情况下提高了error rate的优化3.78%。

Abstract
We introduce MarginMatch, a new SSL approach combining consistency regularization and pseudo-labeling, with its main novelty arising from the use of unlabeled data training dynamics to measure pseudo-label quality. Instead of using only the model's confidence on an unlabeled example at an arbitrary iteration to decide if the example should be masked or not, MarginMatch also analyzes the behavior of the model on the pseudo-labeled examples as the training progresses, to ensure low quality predictions are masked out. MarginMatch brings substantial improvements on four vision benchmarks in low data regimes and on two large-scale datasets, emphasizing the importance of enforcing high-quality pseudo-labels. Notably, we obtain an improvement in error rate over the state-of-the-art of 3.25% on CIFAR-100 with only 25 labels per class and of 3.78% on STL-10 using as few as 4 labels per class. We make our code available at https://github.com/tsosea2/MarginMatch.

摘要
我团队今天宣布了一种新的SSL方法，即MarginMatch，它结合了一致性规则和假标注，其主要创新在于使用无标注数据训练动态来衡量假标注质量。不同于以往只使用模型对无标注示例的任意轮次的信任度来决定是否遮盖示例，MarginMatch还分析了模型在假标注示例上的行为，以确保低质量预测被排除。MarginMatch在四个视觉标准benchmark上表现出了显著改善，特别是在低数据 régime下，以及在两个大规模数据集上，这些成果强调了高质量假标注的重要性。我们在CIFAR-100上实现了error rate的提升为3.25%，只使用每类25个标签，而在STL-10上实现了error rate的提升为3.78%，只使用每类4个标签。我们将代码发布在https://github.com/tsosea2/MarginMatch上。

Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks

paper_url: http://arxiv.org/abs/2308.09033
repo_url: https://github.com/fawazsammani/uni-nlx
paper_authors: Fawaz Sammani, Nikos Deligiannis
For: The paper aims to propose a unified framework for Natural Language Explanations (NLE) that can consolidate all NLE tasks into a single and compact multi-task model using a unified training objective of text generation.* Methods: The proposed Uni-NLX framework uses a unified training objective of text generation to train a single model that can perform seven NLE tasks, including VQA, visual recognition, and visual reasoning tasks, with 7X fewer parameters compared to previous approaches.* Results: The proposed Uni-NLX framework demonstrates comparable performance to independent task-specific models in previous approaches, and in certain tasks even outperforms them, with a single model that can perform seven NLE tasks using 7X fewer parameters.Here are the three key points in Simplified Chinese text:* For: 这paper的目标是提出一个综合的自然语言解释（NLE）框架，可以将所有NLE任务集成到一个紧凑的多任务模型中，使用一个统一的文本生成训练目标。* Methods: 提议的Uni-NLX框架使用一个统一的文本生成训练目标来训练一个单一模型，可以同时完成七个NLE任务，包括VQA、视觉识别和视觉理解任务，与之前的方法相比，具有7倍 fewer参数。* Results: Uni-NLX框架可以与之前的独立任务特定模型相比，在七个NLE任务中显示相对性能，甚至在某些任务中超越它们，具有一个单一模型可以同时完成七个NLE任务，使用7倍 fewer参数。

Abstract
Natural Language Explanations (NLE) aim at supplementing the prediction of a model with human-friendly natural text. Existing NLE approaches involve training separate models for each downstream task. In this work, we propose Uni-NLX, a unified framework that consolidates all NLE tasks into a single and compact multi-task model using a unified training objective of text generation. Additionally, we introduce two new NLE datasets: 1) ImageNetX, a dataset of 144K samples for explaining ImageNet categories, and 2) VQA-ParaX, a dataset of 123K samples for explaining the task of Visual Question Answering (VQA). Both datasets are derived leveraging large language models (LLMs). By training on the 1M combined NLE samples, our single unified framework is capable of simultaneously performing seven NLE tasks including VQA, visual recognition and visual reasoning tasks with 7X fewer parameters, demonstrating comparable performance to the independent task-specific models in previous approaches, and in certain tasks even outperforming them. Code is at https://github.com/fawazsammani/uni-nlx

摘要
自然语言解释（NLE）目标是补充模型预测结果中的人类友好的自然文本。现有的NLE方法通常包括为每个下游任务培训单独的模型。在这个工作中，我们提出了Uni-NLX框架，这是一个单一的框架，将所有NLE任务合并到一个紧凑的多任务模型中，使用一个统一的文本生成训练目标。此外，我们还提出了两个新的NLE数据集：1）ImageNetX，包含144K个样本，用于解释ImageNet类别，和2）VQA-ParaX，包含123K个样本，用于解释视觉问答任务（VQA）。两个数据集都是基于大语言模型（LLM）的。通过训练1M个总NLE样本，我们的单一框架可以同时完成七个NLE任务，包括VQA、视觉识别和视觉理解任务，使用7X fewer parameters，与之前独立的任务特定模型相比，表现相似，甚至在某些任务上超越它们。代码位于https://github.com/fawazsammani/uni-nlx。

LesionMix: A Lesion-Level Data Augmentation Method for Medical Image Segmentation

paper_url: http://arxiv.org/abs/2308.09026
repo_url: https://github.com/dogabasaran/lesionmix
paper_authors: Berke Doga Basaran, Weitong Zhang, Mengyun Qiao, Bernhard Kainz, Paul M. Matthews, Wenjia Bai
for: 这个研究是为了提高深度学习运算库中的医疗影像分类方法。
methods: 这篇论文提出了一种新的数据增强方法，即LesionMix，它将数据增强进行了特定疾病范围内的调整，以提高分类的多样性和精度。
results: 实验结果显示，LesionMix在不同的MODALITIES和不同的疾病数据集上表现出色，能够优于一些最近的混合数据增强方法，提高医疗影像分类的精度和多样性。

Abstract
Data augmentation has become a de facto component of deep learning-based medical image segmentation methods. Most data augmentation techniques used in medical imaging focus on spatial and intensity transformations to improve the diversity of training images. They are often designed at the image level, augmenting the full image, and do not pay attention to specific abnormalities within the image. Here, we present LesionMix, a novel and simple lesion-aware data augmentation method. It performs augmentation at the lesion level, increasing the diversity of lesion shape, location, intensity and load distribution, and allowing both lesion populating and inpainting. Experiments on different modalities and different lesion datasets, including four brain MR lesion datasets and one liver CT lesion dataset, demonstrate that LesionMix achieves promising performance in lesion image segmentation, outperforming several recent Mix-based data augmentation methods. The code will be released at https://github.com/dogabasaran/lesionmix.

摘要
<>translation into Simplified Chinese深度学习基于医学影像 segmentation 方法中的数据增强已成为一种实际中的组件。大多数数据增强技术在医疗影像中都是通过空间和温度变换来提高训练图像的多样性。这些技术通常是在图像层次上进行设计，对全图像进行增强，并不关注特定的病变内部特征。在这里，我们介绍了 LesionMix，一种新的和简单的病变意识的数据增强方法。它在病变层次上进行增强，提高病变形状、位置、温度和负荷分布，并允许病变填充和掩蔽。经过不同的Modalities和不同的病变数据集的测试，包括四个脑MR病变数据集和一个肝CT病变数据集，LesionMix在病变图像分割方面实现了出色的表现，比较多种最近的 Mix-based 数据增强方法。代码将在 GitHub 上发布，地址为。

paper_url: http://arxiv.org/abs/2308.09025
repo_url: None
paper_authors: Johannes Erdmann, Aaron van der Graaf, Florian Mausolf, Olaf Nackenhorst
for: 这个论文是用来研究单个图像超分辨算法的，它们是基于生成对抗网络。
methods: 这个论文使用了生成超分辨网络来提高图像的分辨率，并且使用了电磁喷涂仪表示图像的电磁辐射和中性π子衰变。
results: 这个论文得到了图像的分辨率提高，并且可以重现电磁辐射中的特征。此外，使用生成图像作为深度学习光子识别算法的预处理步骤也得到了改善。

Abstract
We study single-image super-resolution algorithms for photons at collider experiments based on generative adversarial networks. We treat the energy depositions of simulated electromagnetic showers of photons and neutral-pion decays in a toy electromagnetic calorimeter as 2D images and we train super-resolution networks to generate images with an artificially increased resolution by a factor of four in each dimension. The generated images are able to reproduce features of the electromagnetic showers that are not obvious from the images at nominal resolution. Using the artificially-enhanced images for the reconstruction of shower-shape variables and of the position of the shower center results in significant improvements. We additionally investigate the utilization of the generated images as a pre-processing step for deep-learning photon-identification algorithms and observe improvements in the case of low training statistics.

摘要
我们研究单个图像超分辨算法，用于粒子加速器实验室中的光子。我们将 simulate电磁散射的能量储存为图像，并将中子衰变探测器中的电磁calorimeter视为2D图像，然后我们使用生成式对抗网络来生成具有人工提高的分辨率（每个维度提高4倍）的图像。生成的图像能够重现辐射的特征，而不可见于原始分辨率的图像中。我们还发现，通过使用生成的图像作为深度学习光子识别算法的预处理步骤，可以在训练数据量较少的情况下获得改善。

ARAI-MVSNet: A multi-view stereo depth estimation network with adaptive depth range and depth interval

paper_url: http://arxiv.org/abs/2308.09022
repo_url: None
paper_authors: Song Zhang, Wenjia Xu, Zhiwei Wei, Lili Zhang, Yang Wang, Junyi Liu
for: 这 paper 目的是解决基本的Computer Vision问题，即 Multi-View Stereo（MVS），即使用多视图图像和已知摄像机参数，重建场景。
methods: 这 paper 提出了一种新的多阶段粗细化框架，包括在第一阶段预测粗细度图，然后在第二阶段使用参考图像和已有的深度图预测更加精准的所有像素深度范围，以及在第三和第四阶段使用 Adaptive Depth Interval Adjustment 模块来实现变量间隔分割，以提高更加准确的深度估计。
results: 广泛的实验表明，这 paper 的方法可以达到现状最佳性和优秀的泛化能力，特别是在 DTU 数据集上达到最高的 Acc 和 Overall，在 Tanks and Temples 中间和高级数据集上达到最高的 Recall 和 $F_{1}$-score，并在 BlendedMVS 数据集上达到最低的 $e_{1}$ 和 $e_{3} $，以及在 ETH 3D 数据集上达到最高的 Acc 和 $F_{1}$-score，超过所有列出的方法。

Abstract
Multi-View Stereo~(MVS) is a fundamental problem in geometric computer vision which aims to reconstruct a scene using multi-view images with known camera parameters. However, the mainstream approaches represent the scene with a fixed all-pixel depth range and equal depth interval partition, which will result in inadequate utilization of depth planes and imprecise depth estimation. In this paper, we present a novel multi-stage coarse-to-fine framework to achieve adaptive all-pixel depth range and depth interval. We predict a coarse depth map in the first stage, then an Adaptive Depth Range Prediction module is proposed in the second stage to zoom in the scene by leveraging the reference image and the obtained depth map in the first stage and predict a more accurate all-pixel depth range for the following stages. In the third and fourth stages, we propose an Adaptive Depth Interval Adjustment module to achieve adaptive variable interval partition for pixel-wise depth range. The depth interval distribution in this module is normalized by Z-score, which can allocate dense depth hypothesis planes around the potential ground truth depth value and vice versa to achieve more accurate depth estimation. Extensive experiments on four widely used benchmark datasets~(DTU, TnT, BlendedMVS, ETH 3D) demonstrate that our model achieves state-of-the-art performance and yields competitive generalization ability. Particularly, our method achieves the highest Acc and Overall on the DTU dataset, while attaining the highest Recall and $F_{1}$-score on the Tanks and Temples intermediate and advanced dataset. Moreover, our method also achieves the lowest $e_{1}$ and $e_{3}$ on the BlendedMVS dataset and the highest Acc and $F_{1}$-score on the ETH 3D dataset, surpassing all listed methods.Project website: https://github.com/zs670980918/ARAI-MVSNet

摘要
多视图斯tereo（MVS）是计算机视觉中的基本问题，目标是使用多视图图像和known camera参数来重建场景。然而，主流方法都是使用固定的所有像素深度范围和平等深度间隔来表示场景，这会导致深度估计不准确。在这篇论文中，我们提出了一种新的多阶段粗细化框架，以实现适应性的所有像素深度范围和深度间隔。在第一阶段，我们预测了一个粗细的深度图；然后在第二阶段，我们提出了一种适应深度范围预测模块，通过利用参考图像和在第一阶段获得的深度图来逐渐缩进场景，并预测更加准确的所有像素深度范围。在第三和第四阶段，我们提出了一种适应变量间隔调整模块，以实现适应变量间隔的像素深度范围分布。在这个模块中，深度间隔分布被normalized by Z-score，可以为每个深度值分配密集的深度假设平面，从而实现更加准确的深度估计。我们在四个常用的标准测试集（DTU、TnT、BlendedMVS、ETH 3D）进行了广泛的实验，结果表明，我们的模型在状态艺术性能和泛化能力方面均达到了顶峰。尤其是，我们的方法在DTU测试集上 достиieves最高的Acc和Overall，在Tanks and Temples中间和高级测试集上达到最高的Recall和$F_{1}$score，而在BlendedMVS测试集上，我们的方法也实现了最低的$e_{1}$和$e_{3}，并在ETH 3D测试集上实现了最高的Acc和$F_{1}$score，超越了所有列出的方法。Project website:

FashionLOGO: Prompting Multimodal Large Language Models for Fashion Logo Embeddings

paper_url: http://arxiv.org/abs/2308.09012
repo_url: https://github.com/valley-vl/fashionlogo
paper_authors: Yulin Su, Min Yang, Minghui Qiu, Jing Wang, Tao Wang
For: The paper aims to improve the robustness of logo embedding by leveraging textual knowledge as an auxiliary, which can enhance the performance of logo recognition in real-world scenarios.* Methods: The proposed method, FashionLOGO, utilizes Multimodal Large Language Models (MLLMs) to generate explicit textual knowledge through three types of prompts, including image OCR, brief captions, and detailed descriptions prompts, in a zero-shot setting. The cross-attention transformer is used to enable image embedding queries to learn supplementary knowledge from textual embeddings automatically.* Results: The extensive experiments on three real-world datasets demonstrate that FashionLOGO learns generalized and robust logo embeddings, achieving state-of-the-art performance in all benchmark datasets. The introduction of MLLMs improves the performance of logo recognition, and comprehensive ablation studies are conducted to demonstrate the performance improvements.Here’s the simplified Chinese text for the three key points:* 为：本文目的是通过使用文本知识作为辅助，提高图标识别的robustness，以便在实际应用中提高图标识别的性能。* 方法：提议的方法是使用多modal大语言模型（MLLMs）生成文本知识，包括图像OCR、简短标题和详细描述等三种提示，在零shot设定下进行。使用交叉注意力变换器，以便图像嵌入查询自动地学习文本嵌入的补充知识。* 结果：对三个实际 dataset进行了广泛的实验，结果显示，FashionLOGO可以学习 generalized和Robust的图标嵌入，在所有benchmark dataset中达到了状态之art的性能。引入 MLLMs 提高了图标识别的性能，并进行了广泛的减少ablation study来证明性能提高的原因。

Abstract
Logo embedding plays a crucial role in various e-commerce applications by facilitating image retrieval or recognition, such as intellectual property protection and product search. However, current methods treat logo embedding as a purely visual problem, which may limit their performance in real-world scenarios. A notable issue is that the textual knowledge embedded in logo images has not been adequately explored. Therefore, we propose a novel approach that leverages textual knowledge as an auxiliary to improve the robustness of logo embedding. The emerging Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in both visual and textual understanding and could become valuable visual assistants in understanding logo images. Inspired by this observation, our proposed method, FashionLOGO, aims to utilize MLLMs to enhance fashion logo embedding. We explore how MLLMs can improve logo embedding by prompting them to generate explicit textual knowledge through three types of prompts, including image OCR, brief captions, and detailed descriptions prompts, in a zero-shot setting. We adopt a cross-attention transformer to enable image embedding queries to learn supplementary knowledge from textual embeddings automatically. To reduce computational costs, we only use the image embedding model in the inference stage, similar to traditional inference pipelines. Our extensive experiments on three real-world datasets demonstrate that FashionLOGO learns generalized and robust logo embeddings, achieving state-of-the-art performance in all benchmark datasets. Furthermore, we conduct comprehensive ablation studies to demonstrate the performance improvements resulting from the introduction of MLLMs.

摘要
<> translate_language: zh-CN<>logo嵌入在多种电商应用中扮演着关键的角色，如知识产权保护和产品搜索。然而，当前方法对logo嵌入视为纯粹的视觉问题，可能会限制其在实际场景中的表现。一个显著的问题是，logo图像中嵌入的文本知识未得到了充分的利用。因此，我们提出了一种新的方法，即使logo嵌入的文本知识作为辅助提高logo嵌入的稳定性。现在的多Modal大语言模型（MLLM）在视觉和文本理解方面具有卓越的能力，因此可以成为logo图像理解的优秀辅助。以这一观察为出发点，我们提出了一种名为FashionLOGO的方法，旨在利用MLLM来强化时尚logo嵌入。我们研究了MLLM如何通过三种类型的提示（图像OCR、简短标题和详细描述提示）在零扩展设定下提高logo嵌入。我们采用了跨层transformer来允许图像嵌入查询学习自动获取文本嵌入的补充知识。为了降低计算成本，我们只在推理阶段使用图像嵌入模型，类似于传统的推理管道。我们对三个实际数据集进行了广泛的实验，结果显示FashionLOGO可以学习 generalized和稳定的logo嵌入，在所有benchmark数据集中达到了状态机器。此外，我们进行了广泛的减少学习研究，以示MLLM引入后的性能提升。

DealMVC: Dual Contrastive Calibration for Multi-view Clustering

paper_url: http://arxiv.org/abs/2308.09000
repo_url: https://github.com/xihongyang1999/dealmvc
paper_authors: Xihong Yang, Jiaqi Jin, Siwei Wang, Ke Liang, Yue Liu, Yi Wen, Suyuan Liu, Sihang Zhou, Xinwang Liu, En Zhu
For: 提高多视图集群性能，解决跨视图场景下相似 pero 不同样本的问题。* Methods: 提出了一种基于对比塑性网络的双重对比塑性约束网络（DealMVC），包括全视观察器和本地对比塑性约束两部分。* Results: 与其他状态时approaches进行比较，通过八个基准数据集的全面实验结果表明DealMVC的效果和优越性。

Abstract
Benefiting from the strong view-consistent information mining capacity, multi-view contrastive clustering has attracted plenty of attention in recent years. However, we observe the following drawback, which limits the clustering performance from further improvement. The existing multi-view models mainly focus on the consistency of the same samples in different views while ignoring the circumstance of similar but different samples in cross-view scenarios. To solve this problem, we propose a novel Dual contrastive calibration network for Multi-View Clustering (DealMVC). Specifically, we first design a fusion mechanism to obtain a global cross-view feature. Then, a global contrastive calibration loss is proposed by aligning the view feature similarity graph and the high-confidence pseudo-label graph. Moreover, to utilize the diversity of multi-view information, we propose a local contrastive calibration loss to constrain the consistency of pair-wise view features. The feature structure is regularized by reliable class information, thus guaranteeing similar samples have similar features in different views. During the training procedure, the interacted cross-view feature is jointly optimized at both local and global levels. In comparison with other state-of-the-art approaches, the comprehensive experimental results obtained from eight benchmark datasets provide substantial validation of the effectiveness and superiority of our algorithm. We release the code of DealMVC at https://github.com/xihongyang1999/DealMVC on GitHub.

摘要
利用强大的视图一致信息挖掘能力，多视图对比 clustering 在最近几年内吸引了很多关注。然而，我们发现以下缺陷，限制了归类性能的进一步改进：现有的多视图模型主要关注同一个样本在不同视图中的一致性，而忽略了不同视图中的相似 yet 不同的样本之间的关系。为解决这问题，我们提议一种名为 DealMVC 的新型 dual contrastive calibration network for Multi-View Clustering。具体来说，我们首先设计了一种 fusions 机制，以获取全局跨视图特征。然后，我们提出了一种全局对比满意抽象loss，通过对视图特征相似图和高确度假标签图进行对应。此外，为了利用多视图信息的多样性，我们提出了一种本地对比满意抽象loss，以约束不同视图中的对应样本之间的一致性。通过对视图特征进行可靠的分类信息 regularization，我们保证了不同视图中的相似样本具有相似的特征。在训练过程中，我们对跨视图特征进行了交互性的joint 优化。与其他状态的方法相比，我们通过八个 benchmark 数据集的全面实验结果，证明了 DealMVC 的有效性和优越性。我们在 GitHub 上发布了 DealMVC 的代码，请参考。

Semantic Information for Object Detection

paper_url: http://arxiv.org/abs/2308.08990
repo_url: https://github.com/BMW-InnovationLab/BMW-Anonymization-API
paper_authors: Jean-Francois Nies
for: 这 paper 的目的是探讨 Semantic Consistency 概念和 Knowledge-Aware Re-Optimization 方法在复杂交通场景中的应用性。
methods: 该 paper 引入了一种新的方法来从图像集中提取知识图，并将这个新的知识图与现有的语义一致性模型集成。
results: 研究发现，通过将这些新的知识图和现有的语义一致性模型相结合，可以在 Faster-RCNN 和 DETR 对象检测模型中获得有限 yet consistent 的改进。

Abstract
In this paper, we demonstrate that the concept of Semantic Consistency and the ensuing method of Knowledge-Aware Re-Optimization can be adapted for the problem of object detection in intricate traffic scenes. Furthermore, we introduce a novel method for extracting a knowledge graph from a dataset of images provided with instance-level annotations, and integrate this new knowledge graph with the existing semantic consistency model. Combining both this novel hybrid knowledge graph and the preexisting methods of frequency analysis and external knowledge graph as sources for semantic information, we investigate the effectiveness of knowledge-aware re-optimization on the Faster-RCNN and DETR object detection models. We find that limited but consistent improvements in precision and or recall can be achieved using this method for all combinations of model and method studied.

摘要
在本文中，我们证明了 semantic consistency 概念和基于知识的重新优化方法可以应用于复杂交通场景中的对象检测问题。此外，我们介绍了一种 novel 的方法，用于从具有实例级注解的图像集中提取知识图谱，并将该新的知识图谱与现有的 semantic consistency 模型结合使用。通过将这些新的混合知识图谱和频率分析以及外部知识图谱作为Semantic information的来源，我们 investigate 了基于知识重新优化的 Faster-RCNN 和 DETR 对象检测模型的效果。我们发现，对于所有模型和方法的组合，可以达到有限 yet consistent 的改进。

Eosinophils Instance Object Segmentation on Whole Slide Imaging Using Multi-label Circle Representation

paper_url: http://arxiv.org/abs/2308.08974
repo_url: https://github.com/yilinliu610730/eoe
paper_authors: Yilin Liu, Ruining Deng, Juming Xiong, Regina N Tyree, Hernan Correa, Girish Hiremath, Yaohong Wang, Yuankai Huo
For: 这个论文旨在提出一种自动化的肝炎细胞分 segmentation方法，以便更好地诊断和评估肝炎细胞病变。* Methods: 该方法基于圆形表示，并将圆形分类模型扩展到多标签模型，以便同时分类多种细胞类型。* Results: 实验结果表明，圆形分类模型在identifying和 segmenting嗜精细胞方面的准确率比传统的Mask R-CNN模型和DeepSnake模型更高，这些结果表明这种自动化方法在诊断肝炎细胞病变中具有优势。

Abstract
Eosinophilic esophagitis (EoE) is a chronic and relapsing disease characterized by esophageal inflammation. Symptoms of EoE include difficulty swallowing, food impaction, and chest pain which significantly impact the quality of life, resulting in nutritional impairments, social limitations, and psychological distress. The diagnosis of EoE is typically performed with a threshold (15 to 20) of eosinophils (Eos) per high-power field (HPF). Since the current counting process of Eos is a resource-intensive process for human pathologists, automatic methods are desired. Circle representation has been shown as a more precise, yet less complicated, representation for automatic instance cell segmentation such as CircleSnake approach. However, the CircleSnake was designed as a single-label model, which is not able to deal with multi-label scenarios. In this paper, we propose the multi-label CircleSnake model for instance segmentation on Eos. It extends the original CircleSnake model from a single-label design to a multi-label model, allowing segmentation of multiple object types. Experimental results illustrate the CircleSnake model's superiority over the traditional Mask R-CNN model and DeepSnake model in terms of average precision (AP) in identifying and segmenting eosinophils, thereby enabling enhanced characterization of EoE. This automated approach holds promise for streamlining the assessment process and improving diagnostic accuracy in EoE analysis. The source code has been made publicly available at https://github.com/yilinliu610730/EoE.

摘要
《营养细胞肝炎（EoE）是一种慢性和复发性的疾病，特征是食管Inflammation。症状包括困难吞食、食物堵塞和胸痛，对生活质量产生重要影响，导致营养不良、社会限制和心理压力。EoE诊断通常采用15-20个Eosinophils（Eos）per high-power field（HPF）的阈值。由于现有的Eos计数过程需要人工pathologist的劳动力，因此自动方法被欢迎。圆形表示被证明为更精确、 yet less complicated的表示方法 для自动实例细胞分 segmentation，如CircleSnake方法。然而，CircleSnake方法是单标签设计，无法处理多标签场景。在本文中，我们提出了多标签CircleSnake模型 для实例分 segmentation。它将原始的CircleSnake模型从单标签设计扩展到多标签模型，以便分类多种对象类型。实验结果表明CircleSnake模型在标准Mask R-CNN模型和DeepSnake模型的AP平均精度方面比其他两者更高，以便更好地识别和分类Eosinophils，从而提高EoE分析的精度。这种自动化方法可能会改善EoE诊断过程的效率和准确性。代码已经在https://github.com/yilinliu610730/EoE上公开。

Watch Your Steps: Local Image and Scene Editing by Text Instructions

paper_url: http://arxiv.org/abs/2308.08947
repo_url: None
paper_authors: Ashkan Mirzaei, Tristan Aumentado-Armstrong, Marcus A. Brubaker, Jonathan Kelly, Alex Levinshtein, Konstantinos G. Derpanis, Igor Gilitschenski
For: The paper is written for the task of text-guided image and NeRF editing, with the goal of localizing the desired edit region implicit in a text instruction.* Methods: The paper uses InstructPix2Pix (IP2P) to predict the desired edit region, and then uses the discrepancy between IP2P predictions with and without the instruction to create a relevance map that guides the modifications. The paper also uses neural radiance fields (NRF) to enhance the quality of text-guided editing of 3D scenes.* Results: The paper achieves state-of-the-art performance on both image and NeRF editing tasks using the proposed method.Here’s the same information in Simplified Chinese:* For: 这篇论文是为了实现文本指导的图像和NeRF编辑任务，目标是将文本中隐藏的编辑区域归类到InstructPix2Pix（IP2P）中。* Methods: 论文使用InstructPix2Pix（IP2P）预测编辑区域，然后使用IP2P预测结果中的差异来生成一个权重图（relevance map），以便引导修改。此外，论文还使用神经辐射场（NRF）来提高文本指导的3D场景编辑质量。* Results: 论文使用提posed方法实现文本指导的图像和NeRF编辑任务，并达到了现有最佳性能。

Abstract
Denoising diffusion models have enabled high-quality image generation and editing. We present a method to localize the desired edit region implicit in a text instruction. We leverage InstructPix2Pix (IP2P) and identify the discrepancy between IP2P predictions with and without the instruction. This discrepancy is referred to as the relevance map. The relevance map conveys the importance of changing each pixel to achieve the edits, and is used to to guide the modifications. This guidance ensures that the irrelevant pixels remain unchanged. Relevance maps are further used to enhance the quality of text-guided editing of 3D scenes in the form of neural radiance fields. A field is trained on relevance maps of training views, denoted as the relevance field, defining the 3D region within which modifications should be made. We perform iterative updates on the training views guided by rendered relevance maps from the relevance field. Our method achieves state-of-the-art performance on both image and NeRF editing tasks. Project page: https://ashmrz.github.io/WatchYourSteps/

摘要
文本指导的图像生成和修改技术已经实现了高质量的图像生成和修改。我们提出了一种方法，可以在文本指导下将需要修改的区域解决为implicit的方式。我们利用InstructPix2Pix（IP2P），并识别IP2P预测中包含和不包含文本指导的差异。这个差异被称为相关性地图。相关性地图表示需要修改每个像素以实现编辑，并用于导航修改。这种导航确保了不重要的像素保持不变。相关性地图还用于提高文本指导下3D场景的颜色场景的质量。我们在 relevance field 中训练了 relevance map ，定义了需要修改的3D区域。我们在 relevance map 的指导下进行了 Iterative 更新，并实现了文本指导下图像和 NeRF 编辑任务的 state-of-the-art 性能。项目页面：https://ashmrz.github.io/WatchYourSteps/

Auxiliary Tasks Benefit 3D Skeleton-based Human Motion Prediction

paper_url: http://arxiv.org/abs/2308.08942
repo_url: https://github.com/mediabrain-sjtu/auxformer
paper_authors: Chenxin Xu, Robby T. Tan, Yuhong Tan, Siheng Chen, Xinchao Wang, Yanfeng Wang
for: 本研究的目的是提高人体运动预测的精度，探索人体运动中的空间-时间相关性。
methods: 本文提出了一种新的方法，通过引入辅助任务来提高模型的学习效果。在辅助任务中，部分身体 JOINTS 的坐标被masking或添加噪声，目的是在其他坐标的基础上恢复受损坐标。为了处理辅助任务，我们提出了一种新的辅助适应 transformer，可以处理不完整、受损的运动数据，并通过捕捉空间-时间相关性来实现坐标恢复。
results: 实验结果显示，我们的方法在 Human3.6M、CMU Mocap 和 3DPW 数据集上的3D MPJPE 方面比州前方法优于remarkable margins of 7.2%, 3.7%,和9.4%。此外，我们的方法在数据缺失和噪声数据情况下也更加稳定和可靠。代码可以在https://github.com/MediaBrain-SJTU/AuxFormer 上下载。

Abstract
Exploring spatial-temporal dependencies from observed motions is one of the core challenges of human motion prediction. Previous methods mainly focus on dedicated network structures to model the spatial and temporal dependencies. This paper considers a new direction by introducing a model learning framework with auxiliary tasks. In our auxiliary tasks, partial body joints' coordinates are corrupted by either masking or adding noise and the goal is to recover corrupted coordinates depending on the rest coordinates. To work with auxiliary tasks, we propose a novel auxiliary-adapted transformer, which can handle incomplete, corrupted motion data and achieve coordinate recovery via capturing spatial-temporal dependencies. Through auxiliary tasks, the auxiliary-adapted transformer is promoted to capture more comprehensive spatial-temporal dependencies among body joints' coordinates, leading to better feature learning. Extensive experimental results have shown that our method outperforms state-of-the-art methods by remarkable margins of 7.2%, 3.7%, and 9.4% in terms of 3D mean per joint position error (MPJPE) on the Human3.6M, CMU Mocap, and 3DPW datasets, respectively. We also demonstrate that our method is more robust under data missing cases and noisy data cases. Code is available at https://github.com/MediaBrain-SJTU/AuxFormer.

摘要
investigate 空间-时间关系从观察动作是人动作预测的核心挑战。 previous methods mainly focus on 专门的网络结构来模型空间和时间关系。 this paper considers a new direction by introducing a model learning framework with auxiliary tasks。 In our auxiliary tasks, partial body joints' coordinates are corrupted by either masking or adding noise, and the goal is to recover corrupted coordinates depending on the rest coordinates。 To work with auxiliary tasks, we propose a novel auxiliary-adapted transformer, which can handle incomplete, corrupted motion data and achieve coordinate recovery by capturing spatial-temporal dependencies。 Through auxiliary tasks, the auxiliary-adapted transformer is promoted to capture more comprehensive spatial-temporal dependencies among body joints' coordinates, leading to better feature learning。 extensive experimental results have shown that our method outperforms state-of-the-art methods by remarkable margins of 7.2%, 3.7%, and 9.4% in terms of 3D mean per joint position error (MPJPE) on the Human3.6M, CMU Mocap, and 3DPW datasets, respectively。 we also demonstrate that our method is more robust under data missing cases and noisy data cases。 Code is available at https://github.com/MediaBrain-SJTU/AuxFormer。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The translation is based on the standard Mandarin pronunciation and may not be exactly the same as the original text in other dialects or pronunciations.

Automatic Signboard Recognition in Low Quality Night Images

paper_url: http://arxiv.org/abs/2308.08941
repo_url: None
paper_authors: Manas Kagde, Priyanka Choudhary, Rishi Joshi, Somnath Dey
for: 解决车辆助手系统和自动驾驶技术中的 traffic sign 识别问题，提高车辆在不良环境下自动分析环境并做出应有的决策。
methods: 使用 Modified MIRNet 模型进行图像增强，然后使用 Yolov4 模型在不制限环境中识别 traffic sign。
results: 提高了 Yolov4 模型在低质量图像上的 mAP@0.5 值5.40%，并在 GTSRB 数据集上实现了96.75%的总 mAP@0.5，与当前最佳工作相当。

Abstract
An essential requirement for driver assistance systems and autonomous driving technology is implementing a robust system for detecting and recognizing traffic signs. This system enables the vehicle to autonomously analyze the environment and make appropriate decisions regarding its movement, even when operating at higher frame rates. However, traffic sign images captured in inadequate lighting and adverse weather conditions are poorly visible, blurred, faded, and damaged. Consequently, the recognition of traffic signs in such circumstances becomes inherently difficult. This paper addressed the challenges of recognizing traffic signs from images captured in low light, noise, and blurriness. To achieve this goal, a two-step methodology has been employed. The first step involves enhancing traffic sign images by applying a modified MIRNet model and producing enhanced images. In the second step, the Yolov4 model recognizes the traffic signs in an unconstrained environment. The proposed method has achieved 5.40% increment in mAP@0.5 for low quality images on Yolov4. The overall mAP@0.5 of 96.75% has been achieved on the GTSRB dataset. It has also attained mAP@0.5 of 100% on the GTSDB dataset for the broad categories, comparable with the state-of-the-art work.

摘要
driver assistance systems 和自动驾驶技术的重要需求之一是实现一个可靠的交通标志检测和识别系统。这个系统使得车辆可以自动分析环境，并根据相应的 дви作行为。然而，交通标志图像在不良照明和不利天气条件下 capture 得到的图像会变得混乱、模糊、暗淡和受损。因此，在这些情况下，交通标志的识别变得自然地困难。这篇论文通过解决低照明、噪音和模糊等问题，提出了一种两步方法。在第一步中，我们使用修改后的 MIRNet 模型来增强交通标志图像，生成了更加明确的图像。在第二步中，我们使用 Yolov4 模型在无约束环境中识别交通标志。我们的方法在低质量图像上实现了 5.40% 的增量，并在 GTSRB 数据集上实现了 96.75% 的总 mAP@0.5。此外，我们在 GTSDB 数据集上实现了 Broad 类别的 100% mAP@0.5，与现有的工作相当。

SDDNet: Style-guided Dual-layer Disentanglement Network for Shadow Detection

paper_url: http://arxiv.org/abs/2308.08935
repo_url: None
paper_authors: Runmin Cong, Yuchen Guan, Jinpeng Chen, Wei Zhang, Yao Zhao, Sam Kwong
for: 本研究的目的是提高阴影检测的精度，尤其是在复杂背景下，现有方法受到背景颜色的影响，导致阴影检测错误。
methods: 我们使用Style-guided Dual-layer Disentanglement Network (SDDNet)，包括Feature Separation and Recombination (FSR)模块和Shadow Style Filter (SSF)模块，独立模型背景层和阴影层，并通过特殊监督和重建约束保持信息完整性和无重复。
results: 我们的模型在三个公共数据集上实现了优秀的性能，并且在实时推理速度上达到32帧/秒。

Abstract
Despite significant progress in shadow detection, current methods still struggle with the adverse impact of background color, which may lead to errors when shadows are present on complex backgrounds. Drawing inspiration from the human visual system, we treat the input shadow image as a composition of a background layer and a shadow layer, and design a Style-guided Dual-layer Disentanglement Network (SDDNet) to model these layers independently. To achieve this, we devise a Feature Separation and Recombination (FSR) module that decomposes multi-level features into shadow-related and background-related components by offering specialized supervision for each component, while preserving information integrity and avoiding redundancy through the reconstruction constraint. Moreover, we propose a Shadow Style Filter (SSF) module to guide the feature disentanglement by focusing on style differentiation and uniformization. With these two modules and our overall pipeline, our model effectively minimizes the detrimental effects of background color, yielding superior performance on three public datasets with a real-time inference speed of 32 FPS.

摘要
尽管目前的阴影检测技术已经取得了 significative 进步，但是它们仍然在复杂背景下陷入阴影检测错误的问题。我们 Drawing inspiration from the human visual system，我们将输入阴影图像看作是一个背景层和一个阴影层的组合，并设计了一个 Style-guided Dual-layer Disentanglement Network (SDDNet) 来模型这两个层。为了实现这一目标，我们提出了一个 Feature Separation and Recombination (FSR) 模块，该模块将多层特征分解成阴影相关和背景相关的组成部分，通过提供特殊的监督来保持这两个部分之间的信息完整性和不重复，同时通过重建约束来避免信息损失。此外，我们还提出了一个 Shadow Style Filter (SSF) 模块，该模块通过注重阴影风格差异和均衡来引导特征分解。与这两个模块和我们的整体管道相结合，我们的模型能够有效地减少背景颜色的负面影响，在三个公共数据集上实现了超过32帧每秒的实时推理速度。

paper_url: http://arxiv.org/abs/2308.08930
repo_url: https://github.com/rmcong/picr-net_acmmm23
paper_authors: Runmin Cong, Hongyu Liu, Chen Zhang, Wei Zhang, Feng Zheng, Ran Song, Sam Kwong
for: 提高复杂和挑战性场景中的焦点对象检测（SOD）能力
methods: 通过RGB图像和深度地图的共同信息 интеграción，利用Convolutional Neural Networks（CNNs）激活和转换器架构，实现自模态和跨模态的全球长距离相关性模型化
results: 在五个RGB-D SOD数据集上进行了广泛的实验，并示出了与参考模型相比的竞争性Results

Abstract
By integrating complementary information from RGB image and depth map, the ability of salient object detection (SOD) for complex and challenging scenes can be improved. In recent years, the important role of Convolutional Neural Networks (CNNs) in feature extraction and cross-modality interaction has been fully explored, but it is still insufficient in modeling global long-range dependencies of self-modality and cross-modality. To this end, we introduce CNNs-assisted Transformer architecture and propose a novel RGB-D SOD network with Point-aware Interaction and CNN-induced Refinement (PICR-Net). On the one hand, considering the prior correlation between RGB modality and depth modality, an attention-triggered cross-modality point-aware interaction (CmPI) module is designed to explore the feature interaction of different modalities with positional constraints. On the other hand, in order to alleviate the block effect and detail destruction problems brought by the Transformer naturally, we design a CNN-induced refinement (CNNR) unit for content refinement and supplementation. Extensive experiments on five RGB-D SOD datasets show that the proposed network achieves competitive results in both quantitative and qualitative comparisons.

摘要
文本翻译为简化中文：通过融合RGB图像和深度图的补充信息，可以提高复杂和挑战的场景下的鲜色对象检测（SOD）能力。近年来，人工神经网络（CNN）在特征提取和跨模态交互方面得到了广泛的探索，但是在自模态和跨模态的全局长距离相互作用方面仍然不够。为此，我们提出了CNNs-assisted Transformer架构，并提出了一种新的RGB-D SOD网络，即点位相互作用和CNN引导的PICR-Net。一方面， compte tenpoint-aware cross-modality interaction（CmPI）模块，通过关注RGB模式和深度模式之间的先前相关性，探索不同模式之间的特征交互。另一方面，为了解决由Transformer自然引入的块效应和细节毁灭问题，我们设计了CNN引导的修充（CNNR）单元，用于内容修充和补充。在五个RGB-D SOD数据集上进行了广泛的实验，结果表明，我们提出的网络在量化和质量比较中均达到了竞争水平。

Frequency Perception Network for Camouflaged Object Detection

paper_url: http://arxiv.org/abs/2308.08924
repo_url: https://github.com/rmcong/fpnet_acmmm23
paper_authors: Runmin Cong, Mengyao Sun, Sanyi Zhang, Xiaofei Zhou, Wei Zhang, Yao Zhao
for: 抵抗隐藏的物体检测 (COD) 目标是准确检测隐藏在环境中的物体。但是现有的 COD 方法主要在 RGB 频谱中检测隐藏的物体，其性能尚未在许多挑战性enario中得到了充分利用。
methods: 我们提出了一种新的学习可能的和分离的频谱感知机制，驱动于 semantic 层次结构。我们的整个网络采用了两个阶段模型，包括频谱导航的粗略本地化阶段和详细保持的细致本地化阶段。通过多级特征提取的后凹网络，我们设计了一种灵活的频谱感知模块基于 octave convolution 进行粗略定位。然后，我们设计了修正融合模块，通过先导引 correction 和跨层特征通道关联，逐步融合高级特征，并最终与浅层特征结合以实现隐藏物体的细致修正。
results: 与现有的模型相比，我们的提出的方法在三个流行的benchmark数据集中具有竞争性的性能， both qualitatively and quantitatively。

Abstract
Camouflaged object detection (COD) aims to accurately detect objects hidden in the surrounding environment. However, the existing COD methods mainly locate camouflaged objects in the RGB domain, their performance has not been fully exploited in many challenging scenarios. Considering that the features of the camouflaged object and the background are more discriminative in the frequency domain, we propose a novel learnable and separable frequency perception mechanism driven by the semantic hierarchy in the frequency domain. Our entire network adopts a two-stage model, including a frequency-guided coarse localization stage and a detail-preserving fine localization stage. With the multi-level features extracted by the backbone, we design a flexible frequency perception module based on octave convolution for coarse positioning. Then, we design the correction fusion module to step-by-step integrate the high-level features through the prior-guided correction and cross-layer feature channel association, and finally combine them with the shallow features to achieve the detailed correction of the camouflaged objects. Compared with the currently existing models, our proposed method achieves competitive performance in three popular benchmark datasets both qualitatively and quantitatively.

摘要
<>TRANSLATE_TEXT隐形物体检测（COD）的目标是准确检测周围环境中隐藏的物体。然而，现有的COD方法主要在RGB频谱上定位隐形物体，其性能在许多具有挑战性的场景下尚未得到完全利用。尝试从频谱域的特征角度来解决这个问题，我们提出了一种新的学习型和可分离的频谱感知机制，它是基于semantic hierarchy在频谱域的频谱感知机制。我们的整个网络采用了两个阶段模型，包括频谱导向粗略定位阶段和细节保持细致定位阶段。通过多级特征提取的背景，我们设计了一种可变频谱感知模块，基于八分割卷积来实现粗略定位。然后，我们设计了修正融合模块，通过优先指导修正和跨层特征通道关联，逐步融合高级特征，并最终与浅层特征相结合以实现细节修正。相比现有的模型，我们的提出的方法在三个流行的标准 benchmark dataset上具有竞争性的表现， both qualitatively and quantitatively。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

Identity-Seeking Self-Supervised Representation Learning for Generalizable Person Re-identification

paper_url: http://arxiv.org/abs/2308.08887
repo_url: https://github.com/dcp15/isr_iccv2023_oral
paper_authors: Zhaopeng Dou, Zhongdao Wang, Yali Li, Shengjin Wang
for: 本研究旨在从大规模视频中学习域外常见人识别表示，无需任何注释。
methods: 我们提出了一种寻求自我超vised表示学习（ISR）方法，通过模型实体关系为最大积分二分图问题来挖掘人识别信息。
results: 无需人类注释和细化，ISR方法可以在Market-1501和MSMT17上达到87.0% rank-1和56.4% rank-1的表现，分别超过最佳注释域通用方法的5.0%和19.5%。在预训练→细化方案下，ISR方法达到了MSMT17上的最佳表现，即88.4% rank-1。代码可以在GitHub上找到：https://github.com/dcp15/ISR_ICCV2023_Oral。

Abstract
This paper aims to learn a domain-generalizable (DG) person re-identification (ReID) representation from large-scale videos \textbf{without any annotation}. Prior DG ReID methods employ limited labeled data for training due to the high cost of annotation, which restricts further advances. To overcome the barriers of data and annotation, we propose to utilize large-scale unsupervised data for training. The key issue lies in how to mine identity information. To this end, we propose an Identity-seeking Self-supervised Representation learning (ISR) method. ISR constructs positive pairs from inter-frame images by modeling the instance association as a maximum-weight bipartite matching problem. A reliability-guided contrastive loss is further presented to suppress the adverse impact of noisy positive pairs, ensuring that reliable positive pairs dominate the learning process. The training cost of ISR scales approximately linearly with the data size, making it feasible to utilize large-scale data for training. The learned representation exhibits superior generalization ability. \textbf{Without human annotation and fine-tuning, ISR achieves 87.0\% Rank-1 on Market-1501 and 56.4\% Rank-1 on MSMT17}, outperforming the best supervised domain-generalizable method by 5.0\% and 19.5\%, respectively. In the pre-training$\rightarrow$fine-tuning scenario, ISR achieves state-of-the-art performance, with 88.4\% Rank-1 on MSMT17. The code is at \url{https://github.com/dcp15/ISR_ICCV2023_Oral}.

摘要
ISR constructs positive pairs from inter-frame images by modeling the instance association as a maximum-weight bipartite matching problem. A reliability-guided contrastive loss is further introduced to suppress the adverse impact of noisy positive pairs, ensuring that reliable positive pairs dominate the learning process. The training cost of ISR scales approximately linearly with the data size, making it feasible to utilize large-scale data for training.The learned representation exhibits superior generalization ability. Without human annotation and fine-tuning, ISR achieves 87.0% Rank-1 on Market-1501 and 56.4% Rank-1 on MSMT17, outperforming the best supervised domain-generalizable method by 5.0% and 19.5%, respectively. In the pre-training → fine-tuning scenario, ISR achieves state-of-the-art performance, with 88.4% Rank-1 on MSMT17. The code is available at \url{https://github.com/dcp15/ISR_ICCV2023_Oral}.

Event-Guided Procedure Planning from Instructional Videos with Text Supervision

paper_url: http://arxiv.org/abs/2308.08885
repo_url: https://github.com/AlanWang0o0/ISEE-E3P
paper_authors: An-Lan Wang, Kun-Yu Lin, Jia-Run Du, Jingke Meng, Wei-Shi Zheng
for: 这 paper 的目的是提出一种基于文本监督的视频过程规划方法，以适应视频中的指令和过程。
methods: 该方法首先推理出事件，然后根据事件和视频状态进行动作规划。它采用一种事件导向的规范建模方法，将事件信息 incorporate 到过程规划中。
results: 该方法在三个 dataset 上进行了广泛的实验，并得到了较高的效果。

Abstract
In this work, we focus on the task of procedure planning from instructional videos with text supervision, where a model aims to predict an action sequence to transform the initial visual state into the goal visual state. A critical challenge of this task is the large semantic gap between observed visual states and unobserved intermediate actions, which is ignored by previous works. Specifically, this semantic gap refers to that the contents in the observed visual states are semantically different from the elements of some action text labels in a procedure. To bridge this semantic gap, we propose a novel event-guided paradigm, which first infers events from the observed states and then plans out actions based on both the states and predicted events. Our inspiration comes from that planning a procedure from an instructional video is to complete a specific event and a specific event usually involves specific actions. Based on the proposed paradigm, we contribute an Event-guided Prompting-based Procedure Planning (E3P) model, which encodes event information into the sequential modeling process to support procedure planning. To further consider the strong action associations within each event, our E3P adopts a mask-and-predict approach for relation mining, incorporating a probabilistic masking scheme for regularization. Extensive experiments on three datasets demonstrate the effectiveness of our proposed model.

摘要
在这项工作中，我们关注 instruccional 视频中的过程规划任务，其中模型需要预测从初始视觉状态到目标视觉状态的动作序列。这个任务的挑战之一是semantic gap between observed visual states and unobserved intermediate actions，即在视觉状态中存在的Semantic gap between observed contents and some action text labels in a procedure. To address this challenge, we propose a novel event-guided paradigm, which first infers events from the observed states and then plans out actions based on both the states and predicted events. Our inspiration comes from the fact that planning a procedure from an instructional video is to complete a specific event and a specific event usually involves specific actions. Based on the proposed paradigm, we contribute an Event-guided Prompting-based Procedure Planning (E3P) model, which encodes event information into the sequential modeling process to support procedure planning. To further consider the strong action associations within each event, our E3P adopts a mask-and-predict approach for relation mining, incorporating a probabilistic masking scheme for regularization. Our extensive experiments on three datasets demonstrate the effectiveness of our proposed model.

SRMAE: Masked Image Modeling for Scale-Invariant Deep Representations

paper_url: http://arxiv.org/abs/2308.08884
repo_url: None
paper_authors: Zhiming Wang, Lin Gu, Feng Lu
for: 提高Masked Image Modeling（MIM）模型的自动标注性能
methods: 使用图像缩放作为自助指示，采用SR技术设计预测头，从低分辨率启发的图像进行重建
results: 在ImageNet-1K任务上，SRMAE模型在400个epoch后达到82.1%的准确率，在VLR认识任务上超过DeriveNet by 1.3%，在低分辨率表情识别任务上达到74.84%的准确率，超过当前状态的FMD by 9.48%。

Abstract
Due to the prevalence of scale variance in nature images, we propose to use image scale as a self-supervised signal for Masked Image Modeling (MIM). Our method involves selecting random patches from the input image and downsampling them to a low-resolution format. Our framework utilizes the latest advances in super-resolution (SR) to design the prediction head, which reconstructs the input from low-resolution clues and other patches. After 400 epochs of pre-training, our Super Resolution Masked Autoencoders (SRMAE) get an accuracy of 82.1% on the ImageNet-1K task. Image scale signal also allows our SRMAE to capture scale invariance representation. For the very low resolution (VLR) recognition task, our model achieves the best performance, surpassing DeriveNet by 1.3%. Our method also achieves an accuracy of 74.84% on the task of recognizing low-resolution facial expressions, surpassing the current state-of-the-art FMD by 9.48%.

摘要

Text-Only Training for Visual Storytelling

paper_url: http://arxiv.org/abs/2308.08881
repo_url: None
paper_authors: Yuechen Wang, Wengang Zhou, Zhenbo Lu, Houqiang Li
for: 本文旨在提出一种基于文本数据的视觉叙述生成方法，以提高视觉叙述的生成能力和泛化能力。
methods: 本文提出了一种文本唯一训练方法，即将视觉控制集成到一个基于文本的叙述生成器中，并使用CLIP模型来实现跨模态对Alignment。此外，文本中的时间结构和全局视觉内容均被考虑在内。
results: 经过广泛的实验表明，本方法在VIST benchmark上表现出色，在域内和跨域设置中均显示出优于其他方法的效果。此外，人工评估和表达多样性评估也表明本方法的优势。

Abstract
Visual storytelling aims to generate a narrative based on a sequence of images, necessitating both vision-language alignment and coherent story generation. Most existing solutions predominantly depend on paired image-text training data, which can be costly to collect and challenging to scale. To address this, we formulate visual storytelling as a visual-conditioned story generation problem and propose a text-only training method that separates the learning of cross-modality alignment and story generation. Our approach specifically leverages the cross-modality pre-trained CLIP model to integrate visual control into a story generator, trained exclusively on text data. Moreover, we devise a training-free visual condition planner that accounts for the temporal structure of the input image sequence while balancing global and local visual content. The distinctive advantage of requiring only text data for training enables our method to learn from external text story data, enhancing the generalization capability of visual storytelling. We conduct extensive experiments on the VIST benchmark, showcasing the effectiveness of our approach in both in-domain and cross-domain settings. Further evaluations on expression diversity and human assessment underscore the superiority of our method in terms of informativeness and robustness.

摘要
<> translate into Simplified Chinese视觉故事创作目标是通过序列图像生成叙事，需要同视语对齐和有效的故事生成。现有的大多数解决方案都倚靠了对图像文本训练数据的依赖，这可能是收集和扩展困难。为此，我们将视觉故事创作转化为视觉控制的故事生成问题，并提出了基于文本训练的文本专门学习方法。我们的方法特别利用了跨modal CLIP模型来整合视觉控制到故事生成器中，并且具有训练 libre 的视觉条件规划器，能够考虑输入图像序列的时间结构，同时均衡全局和局部视觉内容。由于我们的方法只需要文本数据进行训练，因此可以从外部文本故事数据中学习，提高了视觉故事创作的通用性。我们在 VIST benchmark 上进行了广泛的实验，展示了我们的方法在域内和跨域设置中的效果。进一步的评估表明我们的方法在表达多样性和人工评估中具有更高的有用性和稳定性。

Towards Semi-supervised Learning with Non-random Missing Labels

paper_url: http://arxiv.org/abs/2308.08872
repo_url: https://github.com/njuyued/prg4ssl-mnar
paper_authors: Yue Duan, Zhen Zhao, Lei Qi, Luping Zhou, Lei Wang, Yinghuan Shi
for: Addresses the challenging scenario of label Missing Not At Random (MNAR) in semi-supervised learning (SSL), which is ignored by existing SSL methods.
methods: Proposes a class transition tracking based Pseudo-Rectifying Guidance (PRG) to maintain the model’s unbiased enthusiasm towards assigning pseudo-labels to all classes, improving the quality of pseudo-labels on both popular classes and rare classes in MNAR.
results: Shows superior performance of PRG across a variety of MNAR scenarios, outperforming the latest SSL approaches combining bias removal solutions by a large margin.

Abstract
Semi-supervised learning (SSL) tackles the label missing problem by enabling the effective usage of unlabeled data. While existing SSL methods focus on the traditional setting, a practical and challenging scenario called label Missing Not At Random (MNAR) is usually ignored. In MNAR, the labeled and unlabeled data fall into different class distributions resulting in biased label imputation, which deteriorates the performance of SSL models. In this work, class transition tracking based Pseudo-Rectifying Guidance (PRG) is devised for MNAR. We explore the class-level guidance information obtained by the Markov random walk, which is modeled on a dynamically created graph built over the class tracking matrix. PRG unifies the historical information of class distribution and class transitions caused by the pseudo-rectifying procedure to maintain the model's unbiased enthusiasm towards assigning pseudo-labels to all classes, so as the quality of pseudo-labels on both popular classes and rare classes in MNAR could be improved. Finally, we show the superior performance of PRG across a variety of MNAR scenarios, outperforming the latest SSL approaches combining bias removal solutions by a large margin. Code and model weights are available at https://github.com/NJUyued/PRG4SSL-MNAR.

摘要
semi-supervised learning (SSL) addresses the problem of missing labels by leveraging unlabeled data. However, existing SSL methods primarily focus on the traditional setting and often ignore the practical and challenging scenario of label Missing Not At Random (MNAR). In MNAR, the labeled and unlabeled data have different class distributions, leading to biased label imputation and degraded performance of SSL models. To address this, we propose class transition tracking based Pseudo-Rectifying Guidance (PRG) for MNAR. We leverage the class-level guidance information obtained from the Markov random walk, which is modeled on a dynamically created graph built over the class tracking matrix. PRG integrates historical information on class distribution and class transitions caused by the pseudo-rectifying procedure to ensure the model remains unbiased in assigning pseudo-labels to all classes, thereby improving the quality of pseudo-labels for both popular classes and rare classes in MNAR. Our experiments demonstrate the superior performance of PRG across various MNAR scenarios, outperforming the latest SSL approaches combining bias removal solutions by a large margin. Code and model weights are available at https://github.com/NJUyued/PRG4SSL-MNAR.

Spatially and Spectrally Consistent Deep Functional Maps

paper_url: http://arxiv.org/abs/2308.08871
repo_url: https://github.com/rqhuang88/Spatially-and-Spectrally-Consistent-Deep-Functional-Maps
paper_authors: Mingze Sun, Shiwei Mao, Puhua Jiang, Maks Ovsjanikov, Ruqi Huang
for: investigate the utility of cycle consistency in Deep Functional Maps for non-rigid shape matching
methods: use spectral and point-wise representation to enforce harmony of learned maps, and independently estimate maps in both domains to alleviate over-fitting
results: produce state-of-the-art results in mapping shapes under significant distortions, with superior generalization performance and accuracy in challenging tests for both near-isometric and non-isometric datasets

Abstract
Cycle consistency has long been exploited as a powerful prior for jointly optimizing maps within a collection of shapes. In this paper, we investigate its utility in the approaches of Deep Functional Maps, which are considered state-of-the-art in non-rigid shape matching. We first justify that under certain conditions, the learned maps, when represented in the spectral domain, are already cycle consistent. Furthermore, we identify the discrepancy that spectrally consistent maps are not necessarily spatially, or point-wise, consistent. In light of this, we present a novel design of unsupervised Deep Functional Maps, which effectively enforces the harmony of learned maps under the spectral and the point-wise representation. By taking advantage of cycle consistency, our framework produces state-of-the-art results in mapping shapes even under significant distortions. Beyond that, by independently estimating maps in both spectral and spatial domains, our method naturally alleviates over-fitting in network training, yielding superior generalization performance and accuracy within an array of challenging tests for both near-isometric and non-isometric datasets. Codes are available at https://github.com/rqhuang88/Spatiallyand-Spectrally-Consistent-Deep-Functional-Maps.

摘要
Cycles consistency 已经被利用为一种强大的先验，用于同时优化在一个集合中的形状中的地图。在这篇论文中，我们调查了它在深度功能地图中的用途，这种技术被认为是非RIGID形状匹配领域的国际标准。我们首先证明，在某些条件下，学习的地图，当表示在спектルDomain中时，已经是循环一致的。此外，我们发现了spectrally consistent maps不一定是spatially consistent，也就是说，学习的地图不一定是点 wise一致的。在这种情况下，我们提出了一种新的无监督的深度功能地图设计，可以有效地在循环和点 wise两个 Representation下对学习的地图进行融合。通过利用循环一致性，我们的框架在面对重大扭曲时产生了国际标准的结果。此外，我们的方法独立地估算了spectral和spatial Domain下的地图，从而自然地减轻了网络训练中的过拟合问题，从而获得了在多种难度测试中的优秀性和准确性。代码可以在https://github.com/rqhuang88/Spatiallyand-Spectrally-Consistent-Deep-Functional-Maps中下载。

MV-ROPE: Multi-view Constraints for Robust Category-level Object Pose and Size Estimation

paper_url: http://arxiv.org/abs/2308.08856
repo_url: https://github.com/greatoyster/mv-rope
paper_authors: Jiaqi Yang, Yucong Chen, Xiangting Meng, Chenxin Yan, Min Li, Ran Chen, Lige Liu, Tao Sun, Laurent Kneip
for: 该文章提出了一种基于RGB的分类水平6D对象pose和大小估计的新框架。
methods: 该方法利用预测正规对象坐标空间（NOCS），从RGB图像中提取了一种高效和有效的对象标准表示，而不需要额外的深度读取。
results: 该文章的实验结果表明，该方法在公共数据集序列上具有强大的表现，甚至与RGB-D方法相当。此外，文章还证明了该方法的通用性，在自己收集的数据集上进行了评估。Here is the translation in English:
for: The paper proposes a novel framework for RGB-based category-level 6D object pose and size estimation.
methods: The method utilizes the prediction of normalized object coordinate space (NOCS), which is an efficient and effective object canonical representation that can be extracted from RGB images, without relying on additional depth readings.
results: The experimental results demonstrate the strong performance of the proposed method, comparable to state-of-the-art RGB-D methods across public dataset sequences, and the method also shows good generalization ability on self-collected datasets.

Abstract
We propose a novel framework for RGB-based category-level 6D object pose and size estimation. Our approach relies on the prediction of normalized object coordinate space (NOCS), which serves as an efficient and effective object canonical representation that can be extracted from RGB images. Unlike previous approaches that heavily relied on additional depth readings as input, our novelty lies in leveraging multi-view information, which is commonly available in practical scenarios where a moving camera continuously observes the environment. By introducing multi-view constraints, we can obtain accurate camera pose and depth estimation from a monocular dense SLAM framework. Additionally, by incorporating constraints on the camera relative pose, we can apply trimming strategies and robust pose averaging on the multi-view object poses, resulting in more accurate and robust estimations of category-level object poses even in the absence of direct depth readings. Furthermore, we introduce a novel NOCS prediction network that significantly improves performance. Our experimental results demonstrate the strong performance of our proposed method, even comparable to state-of-the-art RGB-D methods across public dataset sequences. Additionally, we showcase the generalization ability of our method by evaluating it on self-collected datasets.

摘要
我们提出了一种新的RGB基于分类水平6D物体姿态和大小估计框架。我们的方法基于预测正规化物体坐标空间（NOCS），这是一种高效和有效的物体标准表示，可以从RGB图像中提取出来。与之前的方法不同，我们不需要添加深度读取作为输入，而是利用多视图信息，这是在实际场景中常见的移动摄像头不断观察环境中的情况。通过引入多视图约束，我们可以从单摄 dense SLAM框架中获得高精度的相机pose和深度估计。此外，通过应用相机相对pose约束，我们可以在多视图物体姿态中进行trimming策略和稳定 pose 平均，从而在不具备直接深度读取的情况下获得更高精度和更加稳定的分类水平物体姿态估计。此外，我们还提出了一种新的NOCS预测网络，该网络有效提高了性能。我们的实验结果表明，我们提出的方法在公共数据集序列上具有强大的表现，甚至与RGB-D方法相当。此外，我们还证明了我们的方法在自主收集的数据集上进行了普适化。

Realistic Full-Body Tracking from Sparse Observations via Joint-Level Modeling

paper_url: http://arxiv.org/abs/2308.08855
repo_url: https://github.com/zxz267/AvatarJLM
paper_authors: Xiaozheng Zheng, Zhuo Su, Chao Wen, Zhou Xue, Xiaojie Jin
for: 快速发展的VR/AR应用程序中实现真实的全身动作控制。
methods: 提出了一种两stage框架，通过使用头盔显示器和手控制器的三种跟踪信号来获取高精度和平滑的全身动作。该框架在第一个阶段显式地模型了关节级特征，并在第二个阶段通过交替的空间和时间变换块来捕捉关节级相关性。
results: 通过对AMASS运动数据集和真实捕捉数据进行广泛的实验，证明了我们的设计的效果，并显示了我们的提议方法可以实现比现有方法更高精度和平滑的动作。

Abstract
To bridge the physical and virtual worlds for rapidly developed VR/AR applications, the ability to realistically drive 3D full-body avatars is of great significance. Although real-time body tracking with only the head-mounted displays (HMDs) and hand controllers is heavily under-constrained, a carefully designed end-to-end neural network is of great potential to solve the problem by learning from large-scale motion data. To this end, we propose a two-stage framework that can obtain accurate and smooth full-body motions with the three tracking signals of head and hands only. Our framework explicitly models the joint-level features in the first stage and utilizes them as spatiotemporal tokens for alternating spatial and temporal transformer blocks to capture joint-level correlations in the second stage. Furthermore, we design a set of loss terms to constrain the task of a high degree of freedom, such that we can exploit the potential of our joint-level modeling. With extensive experiments on the AMASS motion dataset and real-captured data, we validate the effectiveness of our designs and show our proposed method can achieve more accurate and smooth motion compared to existing approaches.

摘要
Here's the translation in Simplified Chinese:要连接物理世界和虚拟世界，快速开发VR/AR应用中能够真实驱动3D全身人物的能力是非常重要。尽管只使用头盔显示器（HMD）和手控制器来实时跟踪身体活动是非常受限的，但是一个特殊设计的端到端神经网络可以通过学习大规模运动数据来解决这个问题。为此，我们提出了一个两stage框架，可以通过头和手三个跟踪信号来获取准确和平滑的全身运动。我们的框架明确表示 JOINT 级特征在第一个阶段，然后利用它们作为空间和时间变换块来捕捉 JOINT 级相关性。此外，我们设计了一些损失项来限制任务的自由度，以便我们可以利用我们的 JOINT 级模型的潜力。通过对 AMASS 运动数据集和实际捕捉数据进行广泛的实验，我们验证了我们的设计的有效性，并证明了我们的提议方法可以在现有方法中获得更高精度和平滑的运动。

Language-enhanced RNR-Map: Querying Renderable Neural Radiance Field maps with natural language

paper_url: http://arxiv.org/abs/2308.08854
repo_url: https://github.com/intelligolabs/Le-RNR-Map
paper_authors: Francesco Taioli, Federico Cunico, Federico Girella, Riccardo Bologna, Alessandro Farinelli, Marco Cristani
for: 这个论文是为了提供一种语言增强的可 Renderable Neural Radiance Map（Le-RNR-Map），用于视觉导航，并且可以通过自然语言查询提示来搜索。
methods: 该论文使用了RNR-Map，一种基于格子结构的秘密码，每个像素都有一个来自图像观察的 latent codes，可以将图像渲染到相应的Camera pose。此外，该论文还使用了CLIP-based embedding latent codes，允许通过自然语言查询来搜索。
results: 该论文通过单个和多个对象搜索的实验，证明了Le-RNR-Map的效果。此外，该论文还 investigate了这个地图与大型自然语言模型的相容性，并且发现了这个地图可以帮助解决”可用性查询”问题。

Abstract
We present Le-RNR-Map, a Language-enhanced Renderable Neural Radiance map for Visual Navigation with natural language query prompts. The recently proposed RNR-Map employs a grid structure comprising latent codes positioned at each pixel. These latent codes, which are derived from image observation, enable: i) image rendering given a camera pose, since they are converted to Neural Radiance Field; ii) image navigation and localization with astonishing accuracy. On top of this, we enhance RNR-Map with CLIP-based embedding latent codes, allowing natural language search without additional label data. We evaluate the effectiveness of this map in single and multi-object searches. We also investigate its compatibility with a Large Language Model as an "affordance query resolver". Code and videos are available at https://intelligolabs.github.io/Le-RNR-Map/

摘要
我们介绍Le-RNR-Map，一种语言增强的可 Renderable Neural Radiance Map 用于视觉导航，通过自然语言查询提示。最近提出的RNR-Map使用网格结构，每个像素位置具有秘密码。这些秘密码，从图像观察中 derivation，允许：i) 根据摄像头pose进行图像渲染; ii) 图像导航和定位准确。此外，我们在RNR-Map中添加了CLIP基于的嵌入代码，使得不需要额外标注数据进行自然语言搜索。我们评估了这个地图在单个和多个对象搜索中的效果，以及它与大型语言模型作为"能力查询解决方案"的 compatibilty。代码和视频可以在https://intelligolabs.github.io/Le-RNR-Map/ 中下载。

Bag of Tricks for Long-Tailed Multi-Label Classification on Chest X-Rays

paper_url: http://arxiv.org/abs/2308.08853
repo_url: None
paper_authors: Feng Hong, Tianjie Dai, Jiangchao Yao, Ya Zhang, Yanfeng Wang
for: 本文主要针对的是用机器学习算法进行胸部X射线图像的临床分类，特别是面临长尾和多标签等挑战。
methods: 本文提出了一些新的设计方案，包括数据扩充、特征提取器、分类器设计、损失函数重新权衡、外部数据补充等，以提高CXR诊断的性能。
results: 本文通过对多种设计方案的实践和简单的测试数据扩充以及 ensemble 技术，最终达到了ICCV CVAMD 2023 CXR-LT Competition 的测试集上的0.349 mAP值，排名前五。

Abstract
Clinical classification of chest radiography is particularly challenging for standard machine learning algorithms due to its inherent long-tailed and multi-label nature. However, few attempts take into account the coupled challenges posed by both the class imbalance and label co-occurrence, which hinders their value to boost the diagnosis on chest X-rays (CXRs) in the real-world scenarios. Besides, with the prevalence of pretraining techniques, how to incorporate these new paradigms into the current framework lacks of the systematical study. This technical report presents a brief description of our solution in the ICCV CVAMD 2023 CXR-LT Competition. We empirically explored the effectiveness for CXR diagnosis with the integration of several advanced designs about data augmentation, feature extractor, classifier design, loss function reweighting, exogenous data replenishment, etc. In addition, we improve the performance through simple test-time data augmentation and ensemble. Our framework finally achieves 0.349 mAP on the competition test set, ranking in the top five.

摘要
严重疾病分类从骨盆 X 光图像中 particullay challenging，特别是由于其内在的长尾和多标签性。然而，有很少的尝试把承载这两个挑战：分类倾度不均和标签共occurrence。这些挑战限制了机器学习算法在实际应用中的价值。此外，随着预训练技术的普及，如何将这些新的思维方式集成到当前框架中，尚未得到系统性的研究。本技报报告 briefly describes our solution in the ICCV CVAMD 2023 CXR-LT Competition. we empirically explored the effectiveness of CXR diagnosis with the integration of several advanced designs, including data augmentation, feature extractor, classifier design, loss function reweighting, exogenous data replenishment, etc. In addition, we improve the performance through simple test-time data augmentation and ensemble. our framework finally achieves 0.349 mAP on the competition test set, ranking in the top five.

paper_url: http://arxiv.org/abs/2308.08849
repo_url: https://github.com/wentaol86/awesome-body-language
paper_authors: Li Liu, Lufei Gao, Wentao Lei, Fengji Ma, Xiaotian Lin, Jinting Wang
for: 这 paper 主要是为了探讨深度多Modal学习在不同的身体语言（BL）生成和识别方面的应用。
methods: 这 paper 使用了深度多Modal学习技术来分析和理解不同的BL，包括手语（SL）、句子语音（CS）、同时说话（CoS）和头部语音（TH）等。
results: 这 paper 对这些多Modal approaches的评估和比较，并提出了未来研究的方向，如自然语言处理、多Modal学习和大规模预训练模型的应用。

Abstract
Body language (BL) refers to the non-verbal communication expressed through physical movements, gestures, facial expressions, and postures. It is a form of communication that conveys information, emotions, attitudes, and intentions without the use of spoken or written words. It plays a crucial role in interpersonal interactions and can complement or even override verbal communication. Deep multi-modal learning techniques have shown promise in understanding and analyzing these diverse aspects of BL. The survey emphasizes their applications to BL generation and recognition. Several common BLs are considered i.e., Sign Language (SL), Cued Speech (CS), Co-speech (CoS), and Talking Head (TH), and we have conducted an analysis and established the connections among these four BL for the first time. Their generation and recognition often involve multi-modal approaches. Benchmark datasets for BL research are well collected and organized, along with the evaluation of SOTA methods on these datasets. The survey highlights challenges such as limited labeled data, multi-modal learning, and the need for domain adaptation to generalize models to unseen speakers or languages. Future research directions are presented, including exploring self-supervised learning techniques, integrating contextual information from other modalities, and exploiting large-scale pre-trained multi-modal models. In summary, this survey paper provides a comprehensive understanding of deep multi-modal learning for various BL generations and recognitions for the first time. By analyzing advancements, challenges, and future directions, it serves as a valuable resource for researchers and practitioners in advancing this field. n addition, we maintain a continuously updated paper list for deep multi-modal learning for BL recognition and generation: https://github.com/wentaoL86/awesome-body-language.

摘要
Body language (BL) 指的是通过物理运动、姿势、表情和姿态来表达的非语言通信。它是人际交流中的一种重要的沟通方式，可以补充或甚至覆盖语言交流。深入的多Modal学习技术已经在理解和分析这些多种非语言通信方面表现出了承诺。本文件尽可能地概括了这些多Modal学习技术的应用和挑战，并提出了未来研究的方向。在本文中，我们分析了四种常见的BL：手语（SL）、笔记法（CS）、协调说话（CoS）和对话头（TH），并对这些四种BL之间的连接进行了分析。其生成和识别通常需要多Modal的方法。我们也收集了一些标准的BL数据集，并对这些数据集进行了评估。Despite the progress made, there are still several challenges that need to be addressed, such as limited labeled data, multi-modal learning, and the need for domain adaptation to generalize models to unseen speakers or languages. In the future, we can explore self-supervised learning techniques, integrate contextual information from other modalities, and exploit large-scale pre-trained multi-modal models.总之，本文提供了深入的多Modal学习技术在不同的BL生成和识别方面的首次概括。通过分析进步、挑战和未来方向，它将成为研究和实践者在这个领域的有价值资源。此外，我们还维护一份continuously更新的BL生成和识别相关文献列表，可以在 GitHub上找到：https://github.com/wentaoL86/awesome-body-language。

ICoNIK: Generating Respiratory-Resolved Abdominal MR Reconstructions Using Neural Implicit Representations in k-Space

paper_url: http://arxiv.org/abs/2308.08830
repo_url: None
paper_authors: Veronika Spieker, Wenqi Huang, Hannah Eichhorn, Jonathan Stelter, Kilian Weiss, Veronika A. Zimmer, Rickmer F. Braren, Dimitrios C. Karampinos, Kerstin Hammernik, Julia A. Schnabel
for: 实现静止照片与运动照片的混合，解决运动引起的影像污染问题。
methods: 使用神经网络学习几何空间中的几何函数，并将这个函数与测量点和呼吸访问信号组合，实现无推印影像重建。
results: 比标准运动解析技术高效，并提供了一个可能性解决运动引起的影像污染问题的解析方法。

Abstract
Motion-resolved reconstruction for abdominal magnetic resonance imaging (MRI) remains a challenge due to the trade-off between residual motion blurring caused by discretized motion states and undersampling artefacts. In this work, we propose to generate blurring-free motion-resolved abdominal reconstructions by learning a neural implicit representation directly in k-space (NIK). Using measured sampling points and a data-derived respiratory navigator signal, we train a network to generate continuous signal values. To aid the regularization of sparsely sampled regions, we introduce an additional informed correction layer (ICo), which leverages information from neighboring regions to correct NIK's prediction. Our proposed generative reconstruction methods, NIK and ICoNIK, outperform standard motion-resolved reconstruction techniques and provide a promising solution to address motion artefacts in abdominal MRI.

摘要
对于腹部磁共振成像（MRI）中的运动解像仍然是一个挑战，因为存在归一化运动态论和抽样缺陷之间的质量冲突。在这种工作中，我们提议通过直接在k空间学习神经网络（NIK）来生成无抖的运动解像腹部重建。使用测量的抽样点和数据驱动的呼吸导航信号，我们训练了一个网络来生成连续的信号值。为了帮助稀疏抽样区域的正则化，我们引入了一个加 informations层（ICo），该层利用邻近区域的信息来修正 NIK 的预测。我们的提出的生成重建方法，NIK 和 ICoNIK，超过标准的运动解像重建技术，并提供了解决腹部 MRI 中运动artefacts的可能性。

Fast Inference and Update of Probabilistic Density Estimation on Trajectory Prediction

paper_url: http://arxiv.org/abs/2308.08824
repo_url: https://github.com/meaten/flowchain-iccv2023
paper_authors: Takahiro Maeda, Norimichi Ukita
For: 本文提出了一种新的正规流基本拟合方法（FlowChain），用于预测物体的运动轨迹。这种方法能够快速计算并准确地估计概率密度，这是安全关键应用如自动驾驶和社交机器人所需的。* Methods: FlowChain 是一个栈式的条件连续分布（CIF），可以表示概率密度的表达。这种表达可以进行分析计算，比较快速，而且更准确于 Gaussian 混合模型。此外，FlowChain 还允许快速更新估计概率密度，只需要在新的观测位置基础上， reuse 流变换和其对数Jacobian，可以在一毫秒内完成。* Results: 实验结果显示，我们的 FlowChain 在过去方法中实现了最佳的轨迹预测精度。此外，我们的 FlowChain 还在概率密度估计方面表现出了优势，具有更高的准确性和更快的计算速度。我们的代码可以在 https://github.com/meaten/FlowChain-ICCV2023 上下载。

Abstract
Safety-critical applications such as autonomous vehicles and social robots require fast computation and accurate probability density estimation on trajectory prediction. To address both requirements, this paper presents a new normalizing flow-based trajectory prediction model named FlowChain. FlowChain is a stack of conditional continuously-indexed flows (CIFs) that are expressive and allow analytical probability density computation. This analytical computation is faster than the generative models that need additional approximations such as kernel density estimation. Moreover, FlowChain is more accurate than the Gaussian mixture-based models due to fewer assumptions on the estimated density. FlowChain also allows a rapid update of estimated probability densities. This update is achieved by adopting the \textit{newest observed position} and reusing the flow transformations and its log-det-jacobians that represent the \textit{motion trend}. This update is completed in less than one millisecond because this reuse greatly omits the computational cost. Experimental results showed our FlowChain achieved state-of-the-art trajectory prediction accuracy compared to previous methods. Furthermore, our FlowChain demonstrated superiority in the accuracy and speed of density estimation. Our code is available at \url{https://github.com/meaten/FlowChain-ICCV2023}

摘要
安全关键应用，如自动驾驶车和社交机器人，需要快速计算和准确的概率密度估计。为解决这两个需求，这篇论文提出了一种新的正规流基本型 trajectory prediction 模型，名为 FlowChain。FlowChain 是一个堆叠的 conditional continuously-indexed flows (CIFs)，它们是表达力强的，并允许analytical probability density computation。这种analytical computation比 generative models 需要更多的 Approximation，如 kernel density estimation 更快。此外，FlowChain 比 Gaussian mixture-based models 更准确，因为它们对 estimated density 假设的更少。FlowChain 还允许快速更新 estimated probability densities。这个更新通过 adopting the 最新观测位置和 reuse flow transformations 和其 log-det-jacobians 来完成，这个计算成本很低。实验结果表明我们的 FlowChain 在前一代方法的基础上实现了状态机器人 trajectory prediction 的最佳准确性。此外，我们的 FlowChain 还在概率密度估计的准确性和速度方面表现出了优势。我们的代码可以在上找到。

MixBag: Bag-Level Data Augmentation for Learning from Label Proportions

paper_url: http://arxiv.org/abs/2308.08822
repo_url: None
paper_authors: Takanori Asanomi, Shinnosuke Matsuo, Daiki Suehiro, Ryoma Bise
for: 本研究旨在提出一种基于批处理的数据增强方法，以提高无监督学习中的实例级别分类器。
methods: 我们提出了一种基于实验观察的关键观察，即在固定总数据量下，增加标注批处理可以提高实例级别分类精度。此外，我们还提出了基于统计理论的信息量损失函数，以便有效地利用扩充后的批处理。
results: 实验结果表明，我们的方法可以与现有的实例级别数据增强方法相比，在减小损失函数下达到更高的精度。此外，我们的方法还可以与其他无监督学习方法结合使用，以提高分类器的泛化能力。

Abstract
Learning from label proportions (LLP) is a promising weakly supervised learning problem. In LLP, a set of instances (bag) has label proportions, but no instance-level labels are given. LLP aims to train an instance-level classifier by using the label proportions of the bag. In this paper, we propose a bag-level data augmentation method for LLP called MixBag, based on the key observation from our preliminary experiments; that the instance-level classification accuracy improves as the number of labeled bags increases even though the total number of instances is fixed. We also propose a confidence interval loss designed based on statistical theory to use the augmented bags effectively. To the best of our knowledge, this is the first attempt to propose bag-level data augmentation for LLP. The advantage of MixBag is that it can be applied to instance-level data augmentation techniques and any LLP method that uses the proportion loss. Experimental results demonstrate this advantage and the effectiveness of our method.

摘要
学习标签比例（LLP）是一个有前途的弱监督学习问题。在LLP中，一个集合（袋）有标签比例，但没有每个实例的标签。LLP的目标是使用袋的标签比例来训练每个实例的分类器。在这篇论文中，我们提出了一种基于先前实验的观察的袋级数据增强方法called MixBag，以及基于统计理论的自信度范围损失。这是我们知道的第一个提出袋级数据增强的尝试。MixBag的优点是可以与实例级数据增强技术结合使用，并且可以与任何使用比例损失的LLP方法结合使用。实验结果表明了MixBag的优点和效果。

paper_url: http://arxiv.org/abs/2308.08816
repo_url: https://github.com/greatlog/realdan
paper_authors: Zhengxiong Luo, Yan Huang, Shang Li, Liang Wang, Tieniu Tan
for: 这篇论文主要针对的是做出高清度图像的简化超解像（SR），即从低解度图像（LR）中恢复高解度图像（HR）。
methods: 这篇论文提出了一种新的SR方法，即通过alternating optimization算法，将LR图像的简化和SR问题协同解决。具体来说，这种方法包括两个卷积神经网络：一个用于restore SR图像（Restorer），另一个用于估计LR图像的简化（Estimator）。这两个模块之间进行了循环的优化，以实现一个端到端可训练的网络。
results: 根据实验结果，这种方法可以与当前最佳方法相比，在SR问题上具有更高的精度和更好的视觉效果。

Abstract
Blind Super-Resolution (SR) usually involves two sub-problems: 1) estimating the degradation of the given low-resolution (LR) image; 2) super-resolving the LR image to its high-resolution (HR) counterpart. Both problems are ill-posed due to the information loss in the degrading process. Most previous methods try to solve the two problems independently, but often fall into a dilemma: a good super-resolved HR result requires an accurate degradation estimation, which however, is difficult to be obtained without the help of original HR information. To address this issue, instead of considering these two problems independently, we adopt an alternating optimization algorithm, which can estimate the degradation and restore the SR image in a single model. Specifically, we design two convolutional neural modules, namely \textit{Restorer} and \textit{Estimator}. \textit{Restorer} restores the SR image based on the estimated degradation, and \textit{Estimator} estimates the degradation with the help of the restored SR image. We alternate these two modules repeatedly and unfold this process to form an end-to-end trainable network. In this way, both \textit{Restorer} and \textit{Estimator} could get benefited from the intermediate results of each other, and make each sub-problem easier. Moreover, \textit{Restorer} and \textit{Estimator} are optimized in an end-to-end manner, thus they could get more tolerant of the estimation deviations of each other and cooperate better to achieve more robust and accurate final results. Extensive experiments on both synthetic datasets and real-world images show that the proposed method can largely outperform state-of-the-art methods and produce more visually favorable results. The codes are rleased at \url{https://github.com/greatlog/RealDAN.git}.

摘要
通常，盲目超解像（SR）问题包括两个互相关联的优化问题：1）估计给出的低分辨率（LR）图像的劣化程度; 2）将LR图像提升到其高分辨率（HR）对应的图像。两个问题都是不定的，因为升级过程中的信息损失。大多数前一代方法通常会解决这两个问题独立，但经常陷入一个困境：一个好的HR图像需要一个准确的劣化估计，但是不可以不带原始HR信息来获得这个估计。为解决这个问题，我们采用了一种alternating optimization算法，可以同时估计劣化和 restaure SR图像。我们设计了两个卷积神经网络模块：namely \textit{Restorer}和\textit{Estimator}。\textit{Restorer}使用估计的劣化来还原SR图像，而\textit{Estimator}使用还原后的SR图像来估计劣化。我们在这两个模块之间进行了循环的交互，并将这个过程拓展成一个可训练的结束到终结点的网络。这样，\textit{Restorer}和\textit{Estimator}可以互相帮助，使得每个优化问题变得更加容易。此外，\textit{Restorer}和\textit{Estimator}在结束到终结点的training中被优化，因此它们可以更快地适应彼此的估计偏差，并更好地合作以实现更加稳定和准确的最终结果。广泛的实验表明，我们的方法可以大幅超越当前的状态控制方法，并生成更加视觉愉悦的结果。代码可以在\url{https://github.com/greatlog/RealDAN.git}中找到。

A Fusion of Variational Distribution Priors and Saliency Map Replay for Continual 3D Reconstruction

paper_url: http://arxiv.org/abs/2308.08812
repo_url: None
paper_authors: Sanchar Palit, Sandika Biswas
for: 单张图像三维重建任务是一项研究挑战，旨在从单视图图像中预测物体的三维形状。这项任务需要大量数据采集，以预测可见和遮盖的部分。
methods: 我们提议使用 continual learning 的方法，并使用 Variational Priors 来设计模型，以便在新类之后仍然可以reasonably重建先前所见的类。Variational Priors 表示抽象形状，并避免忘记，而 saliency maps 保留物体特征，占用较少的内存。
results: 经过仔细的实验表明，我们的方法可以与已知方法相比， both quantitatively and qualitatively 显示出竞争力。

Abstract
Single-image 3D reconstruction is a research challenge focused on predicting 3D object shapes from single-view images. This task requires significant data acquisition to predict both visible and occluded portions of the shape. Furthermore, learning-based methods face the difficulty of creating a comprehensive training dataset for all possible classes. To this end, we propose a continual learning-based 3D reconstruction method where our goal is to design a model using Variational Priors that can still reconstruct the previously seen classes reasonably even after training on new classes. Variational Priors represent abstract shapes and combat forgetting, whereas saliency maps preserve object attributes with less memory usage. This is vital due to resource constraints in storing extensive training data. Additionally, we introduce saliency map-based experience replay to capture global and distinct object features. Thorough experiments show competitive results compared to established methods, both quantitatively and qualitatively.

摘要
单图三维重建是一项研究挑战，旨在根据单个图像预测三维物体形状。这项任务需要大量数据收集，以预测可见和遮挡部分的形状。学习基于方法则面临创建全面训练数据集的挑战，以涵盖所有可能的类型。为此，我们提议一种逐步学习基于Variational Priors的三维重建方法，其目标是在训练新类后，仍能reasonably重建先前所见的类。Variational Priors表示抽象形态，防止忘记，而saliency maps保留物体特征，占用内存更少。这对于资源受限的存储大量训练数据非常重要。此外，我们引入了saliency map基于经验回放，以捕捉全球和特定物体特征。经过广泛的实验，我们的方法与已知方法相比，具有竞争性的Result。

Label Shift Adapter for Test-Time Adaptation under Covariate and Label Shifts

paper_url: http://arxiv.org/abs/2308.08810
repo_url: None
paper_authors: Sunghyun Park, Seunghan Yang, Jaegul Choo, Sungrack Yun
for: 这个研究旨在实现批量执行时间适应（Test-time adaptation），将预训模型适应目标领域中的数据分布。
methods: 我们提出了一个新的标签迁移适应器，可以与现有的TTA方法结合使用，实现标签迁移的处理。我们估计目标领域的标签分布，然后将其 feed 到标签迁移适应器中，以生成适合目标领域的标签参数。
results: 我们通过广泛的实验表明，将我们的策略与TTA方法结合，可以在标签和 covariate 迁移时获得显著的性能提升。

Abstract
Test-time adaptation (TTA) aims to adapt a pre-trained model to the target domain in a batch-by-batch manner during inference. While label distributions often exhibit imbalances in real-world scenarios, most previous TTA approaches typically assume that both source and target domain datasets have balanced label distribution. Due to the fact that certain classes appear more frequently in certain domains (e.g., buildings in cities, trees in forests), it is natural that the label distribution shifts as the domain changes. However, we discover that the majority of existing TTA methods fail to address the coexistence of covariate and label shifts. To tackle this challenge, we propose a novel label shift adapter that can be incorporated into existing TTA approaches to deal with label shifts during the TTA process effectively. Specifically, we estimate the label distribution of the target domain to feed it into the label shift adapter. Subsequently, the label shift adapter produces optimal parameters for the target label distribution. By predicting only the parameters for a part of the pre-trained source model, our approach is computationally efficient and can be easily applied, regardless of the model architectures. Through extensive experiments, we demonstrate that integrating our strategy with TTA approaches leads to substantial performance improvements under the joint presence of label and covariate shifts.

摘要

Self-distillation Regularized Connectionist Temporal Classification Loss for Text Recognition: A Simple Yet Effective Approach

paper_url: http://arxiv.org/abs/2308.08806
repo_url: None
paper_authors: Ziyin Zhang, Ning Lu, Minghui Liao, Yongshuai Huang, Cheng Li, Min Wang, Wei Peng
for: 提高文本识别模型的准确率，不增加Extra参数或训练阶段。
methods: 提议使用自适应定制的CTC损失函数（DCTC损失），通过带有帧级别的正则化项来强调个体监督，并通过最大化 posteriori 的潜在对齐问题来解决在涨化中的矛盾问题。
results: 对于公共 benchmark 进行了广泛的实验，结果显示，使用 DCTC 损失可以提高文本识别模型的准确率，最高提升达 2.6%，而不增加任何不良影响。

Abstract
Text recognition methods are gaining rapid development. Some advanced techniques, e.g., powerful modules, language models, and un- and semi-supervised learning schemes, consecutively push the performance on public benchmarks forward. However, the problem of how to better optimize a text recognition model from the perspective of loss functions is largely overlooked. CTC-based methods, widely used in practice due to their good balance between performance and inference speed, still grapple with accuracy degradation. This is because CTC loss emphasizes the optimization of the entire sequence target while neglecting to learn individual characters. We propose a self-distillation scheme for CTC-based model to address this issue. It incorporates a framewise regularization term in CTC loss to emphasize individual supervision, and leverages the maximizing-a-posteriori of latent alignment to solve the inconsistency problem that arises in distillation between CTC-based models. We refer to the regularized CTC loss as Distillation Connectionist Temporal Classification (DCTC) loss. DCTC loss is module-free, requiring no extra parameters, longer inference lag, or additional training data or phases. Extensive experiments on public benchmarks demonstrate that DCTC can boost text recognition model accuracy by up to 2.6%, without any of these drawbacks.

摘要
文本识别方法在快速发展中。一些先进技术，如强大的模块、语言模型和不监督学习方法， consecutively 提高了公共测试底板上的性能。然而，如何更好地优化一个文本识别模型，从损失函数的角度来看，几乎被忽略。基于CTC的方法，由于其在实践中的良好平衡性和推理速度，仍然受到精度下降的困扰。这是因为CTC损失函数强调整合整个序列目标，而忽略学习个体字符。我们提出了一种自适应方案，即Distillation Connectionist Temporal Classification（DCTC）损失函数。DCTC损失函数包含了帧级别正则化项，以强调个体监督，并利用最大 posteriori的潜在对齐问题来解决在液化中的不一致问题。DCTC损失函数是模块化的，无需额外参数、更长的推理时间或额外的训练数据或阶段。广泛的实验表明，DCTC可以提高文本识别模型的精度，最高提高2.6%，而无需这些缺点。

Deep Ear Biometrics for Gender Classification

paper_url: http://arxiv.org/abs/2308.08797
repo_url: None
paper_authors: Ritwiz Singh, Keshav Kashyap, Rajesh Mukherjee, Asish Bera, Mamata Dalui Chakraborty
for: 人类性别分类 based on 生物特征, 特别是 Computer Vision 领域的一个重要问题, 因为它有很多应用场景。
methods: 我们使用了深度卷积神经网络 (CNN) 模型来自动地分类人类性别, 使用了 EarVN1.0 耳朵数据集进行评估。
results: 我们的模型达到了 93% 的准确率。I hope this helps! Let me know if you have any other questions.

Abstract
Human gender classification based on biometric features is a major concern for computer vision due to its vast variety of applications. The human ear is popular among researchers as a soft biometric trait, because it is less affected by age or changing circumstances, and is non-intrusive. In this study, we have developed a deep convolutional neural network (CNN) model for automatic gender classification using the samples of ear images. The performance is evaluated using four cutting-edge pre-trained CNN models. In terms of trainable parameters, the proposed technique requires significantly less computational complexity. The proposed model has achieved 93% accuracy on the EarVN1.0 ear dataset.

摘要
人类性别分类基于生物特征是计算机视觉领域的主要问题，由于它的广泛应用领域。人耳是研究人员的首选软生物特征之一，因为它对年龄或变化情况的影响相对较少，非侵入式。在本研究中，我们开发了一种深度卷积神经网络（CNN）模型，用于自动性别分类，使用耳架图像样本。我们使用四种最新的预训练CNN模型进行评估性能。与传统模型相比，我们的方法具有更少的计算复杂性。在 EarVN1.0 耳架数据集上，我们的模型达到了 93% 的准确率。

Environment Diversification with Multi-head Neural Network for Invariant Learning

paper_url: http://arxiv.org/abs/2308.08778
repo_url: https://github.com/joe0123/EDNIL
paper_authors: Bo-Wei Huang, Keng-Te Liao, Chang-Sheng Kao, Shou-De Lin
for: 这篇论文旨在提出一个不需要先知道环境或强制假设的普遍学习框架，以提高神经网络模型对于分布类型的适应能力。
methods: 这个框架包含了一个多头神经网络，用于吸收数据偏见。
results: 该框架不需要先知道环境或强制假设，并且可以实现模型对于分布类型的适应。

Abstract
Neural networks are often trained with empirical risk minimization; however, it has been shown that a shift between training and testing distributions can cause unpredictable performance degradation. On this issue, a research direction, invariant learning, has been proposed to extract invariant features insensitive to the distributional changes. This work proposes EDNIL, an invariant learning framework containing a multi-head neural network to absorb data biases. We show that this framework does not require prior knowledge about environments or strong assumptions about the pre-trained model. We also reveal that the proposed algorithm has theoretical connections to recent studies discussing properties of variant and invariant features. Finally, we demonstrate that models trained with EDNIL are empirically more robust against distributional shifts.

摘要
神经网络经常通过Empirical Risk Minimization（ERM）进行训练；然而，存在训练和测试分布之间的偏移会导致性能下降。为解决这个问题，一种研究方向——不变学习（Invariant Learning）——已经被提出，以抽取不受分布变化影响的特征。本研究提出了EDNIL框架，包括多头神经网络来吸收数据偏好。我们证明了这种框架不需要先知环境或强ASSUME预训练模型。此外，我们还发现了该算法与最近的变异和不变特征研究有理论上的连接。最后，我们通过实验表明，使用EDNIL进行训练的模型在分布变化时的性能更加稳定。

Learning to In-paint: Domain Adaptive Shape Completion for 3D Organ Segmentation

paper_url: http://arxiv.org/abs/2308.08775
repo_url: None
paper_authors: Mingjin Chen, Yongkang He, Yongyi Lu, Zhijing Yang
for: 本研究旨在把Shape信息Explicitly incorporated into current 3D organ segmentation models.
methods: 我们采用Masked Label Mask Modeling (MLM)方法，通过学习mask token来完成Label mask的组织器。此外，我们还提出了一种新的Shape-aware self-distillation方法，用于在Target上传递MLM shape知识。
results: 我们在五个公共organ segmentation dataset上进行了广泛的实验，并得到了至少1.2点的Dice分数提升，证明了我们的方法在难以控制的预测领域中的效果。

Abstract
We aim at incorporating explicit shape information into current 3D organ segmentation models. Different from previous works, we formulate shape learning as an in-painting task, which is named Masked Label Mask Modeling (MLM). Through MLM, learnable mask tokens are fed into transformer blocks to complete the label mask of organ. To transfer MLM shape knowledge to target, we further propose a novel shape-aware self-distillation with both in-painting reconstruction loss and pseudo loss. Extensive experiments on five public organ segmentation datasets show consistent improvements over prior arts with at least 1.2 points gain in the Dice score, demonstrating the effectiveness of our method in challenging unsupervised domain adaptation scenarios including: (1) In-domain organ segmentation; (2) Unseen domain segmentation and (3) Unseen organ segmentation. We hope this work will advance shape analysis and geometric learning in medical imaging.

摘要
我们目标是将显式形态信息integrated到当前3D器官分割模型中。与之前的工作不同，我们将形态学习转换为一个填充任务，称为掩码标签掩码模型（MLM）。通过MLM，学习的掩码标签被传递到转换块中，以完成器官的标签掩码。为将MLM形态知识传递到目标上，我们进一步提议一种新的形态自适应自我热化，包括填充重建损失和假损失。我们在五个公共器官分割数据集上进行了广泛的实验，并示出了与之前的艺术品相比至少1.2点的Dice分数提升，这说明了我们的方法在不可预测的领域适应场景中的效果，包括：（1）本地器官分割；（2）未看到的频谱分割和（3）未看到的器官分割。我们希望这项工作能够推动医学影像中的形态分析和几何学学习。

URL: Combating Label Noise for Lung Nodule Malignancy Grading

paper_url: http://arxiv.org/abs/2308.08772
repo_url: https://github.com/axz520/URL
paper_authors: Xianze Ai, Zehui Liao, Yong Xia
for:This paper focuses on the problem of label noise in lung nodule malignancy grading datasets and proposes a new framework called URL to tackle this issue.methods:The proposed URL framework consists of two stages: SCL and MU. SCL uses supervised contrastive learning to learn better representations, while MU generates pseudo-labels and uses temporal ensembling to obtain memory pseudo-labels that supervise the model training.results:Experiments on the LIDC-IDRI dataset show that the proposed URL framework outperforms other competing methods, demonstrating its effectiveness in handling label noise and modeling the ordinal relation among classes.

Abstract
Due to the complexity of annotation and inter-annotator variability, most lung nodule malignancy grading datasets contain label noise, which inevitably degrades the performance and generalizability of models. Although researchers adopt the label-noise-robust methods to handle label noise for lung nodule malignancy grading, they do not consider the inherent ordinal relation among classes of this task. To model the ordinal relation among classes to facilitate tackling label noise in this task, we propose a Unimodal-Regularized Label-noise-tolerant (URL) framework. Our URL contains two stages, the Supervised Contrastive Learning (SCL) stage and the Memory pseudo-labels generation and Unimodal regularization (MU) stage. In the SCL stage, we select reliable samples and adopt supervised contrastive learning to learn better representations. In the MU stage, we split samples with multiple annotations into multiple samples with a single annotation and shuffle them into different batches. To handle label noise, pseudo-labels are generated using the similarity between each sample and the central feature of each class, and temporal ensembling is used to obtain memory pseudo-labels that supervise the model training. To model the ordinal relation, we introduce unimodal regularization to keep the ordinal relation among classes in the predictions. Moreover, each lung nodule is characterized by three orthographic views. Experiments conducted on the LIDC-IDRI dataset indicate the superiority of our URL over other competing methods. Code is available at https://github.com/axz520/UR.

摘要
Due to the complexity of annotation and inter-annotator variability, most lung nodule malignancy grading datasets contain label noise, which inevitably degrades the performance and generalizability of models. Although researchers adopt label-noise-robust methods to handle label noise for lung nodule malignancy grading, they do not consider the inherent ordinal relation among classes of this task. To model the ordinal relation among classes to facilitate tackling label noise in this task, we propose a Unimodal-Regularized Label-noise-tolerant (URL) framework. Our URL contains two stages, the Supervised Contrastive Learning (SCL) stage and the Memory pseudo-labels generation and Unimodal regularization (MU) stage. In the SCL stage, we select reliable samples and adopt supervised contrastive learning to learn better representations. In the MU stage, we split samples with multiple annotations into multiple samples with a single annotation and shuffle them into different batches. To handle label noise, pseudo-labels are generated using the similarity between each sample and the central feature of each class, and temporal ensembling is used to obtain memory pseudo-labels that supervise the model training. To model the ordinal relation, we introduce unimodal regularization to keep the ordinal relation among classes in the predictions. Moreover, each lung nodule is characterized by three orthographic views. Experiments conducted on the LIDC-IDRI dataset indicate the superiority of our URL over other competing methods. Code is available at https://github.com/axz520/UR.Note: The translation is in Simplified Chinese, which is one of the two standard Chinese languages used in mainland China.

Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes

paper_url: http://arxiv.org/abs/2308.08769
repo_url: https://github.com/Chat-3D/Chat-3D
paper_authors: Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, Zhou Zhao
for: 提高3D场景理解的实用性，建立可对多种下游任务进行对话的全局对话系统。
methods: 利用预训练的3D表示和高级LLM的推理和对话能力，将3D表示映射到LLM的特征空间中，使LLM能够理解3D世界。
results: 实验显示，Chat-3D可以快速理解多种3D场景 instrucions，进行复杂的空间推理，并将外部知识 integrate into its responses。与GPT-4相比，Chat-3D在构建的指令集合上得分75.6%。

Abstract
3D scene understanding has gained significant attention due to its wide range of applications. However, existing methods for 3D scene understanding are limited to specific downstream tasks, which hinders their practicality in real-world applications. This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs to achieve the first universal dialogue systems for 3D scenes. Specifically, we align 3D representations into the feature space of LLMs, thus enabling LLMs to perceive the 3D world. Given the scarcity of 3D scene-text data, we propose a three-stage training strategy to efficiently utilize the available data for better alignment. To enhance the reasoning ability and develop a user-friendly interaction scheme, we further construct a high-quality object-centric 3D instruction dataset and design an associated object-centric prompt. Our experiments show that Chat-3D achieves an impressive ability to comprehend diverse instructions for 3D scenes, engage in intricate spatial reasoning, and incorporate external knowledge into its responses. Chat-3D achieves a 75.6% relative score compared with GPT-4 on the constructed instruction dataset.

摘要
三维场景理解已经吸引了广泛的关注，因为它们在各种应用领域中具有广泛的应用前景。然而，现有的三维场景理解方法受到特定下游任务的限制，这限制了它们在实际应用中的实用性。本文介绍了Chat-3D，它通过将预训练的三维表示与高级LLM的强大理解和对话能力相结合，实现了第一个universal对话系统 для三维场景。具体来说，我们将三维表示空间对齐到LLM的特征空间中，因此让LLM能够感受到三维世界。由于三维场景文本数据的罕见性，我们提出了三个阶段的训练策略，以更有效地利用可用的数据进行更好的对齐。为了提高理解能力和设计用户友好的交互方案，我们还制作了高质量的三维对象中心指令集和相关的对象中心提示。我们的实验表明，Chat-3D可以具有卓越的理解多种三维场景指令、进行复杂的空间逻辑和 incorporate external knowledge into its responses。在我们制作的指令集上，Chat-3D achieved a 75.6% relative score compared with GPT-4。

XVTP3D: Cross-view Trajectory Prediction Using Shared 3D Queries for Autonomous Driving

paper_url: http://arxiv.org/abs/2308.08764
repo_url: None
paper_authors: Zijian Song, Huikun Bi, Ruisi Zhang, Tianlu Mao, Zhaoqi Wang
for: 这篇论文的目的是提出一种能够预测自动驾驶车辆的路径，并且确保多 vista 的预测结果保持一致性。
methods: 本文使用的方法包括使用共享的3D查询（XVTP3D）来生成多个目标，并使用随机遮盾法和粗糙至细的跨视观温探查来捕捉稳定的跨视特征。
results: 实验结果显示，XVTP3D 在两个公开available的数据集上 achieved state-of-the-art 性能，并且保持了多 vista 的预测结果一致性。

Abstract
Trajectory prediction with uncertainty is a critical and challenging task for autonomous driving. Nowadays, we can easily access sensor data represented in multiple views. However, cross-view consistency has not been evaluated by the existing models, which might lead to divergences between the multimodal predictions from different views. It is not practical and effective when the network does not comprehend the 3D scene, which could cause the downstream module in a dilemma. Instead, we predicts multimodal trajectories while maintaining cross-view consistency. We presented a cross-view trajectory prediction method using shared 3D Queries (XVTP3D). We employ a set of 3D queries shared across views to generate multi-goals that are cross-view consistent. We also proposed a random mask method and coarse-to-fine cross-attention to capture robust cross-view features. As far as we know, this is the first work that introduces the outstanding top-down paradigm in BEV detection field to a trajectory prediction problem. The results of experiments on two publicly available datasets show that XVTP3D achieved state-of-the-art performance with consistent cross-view predictions.

摘要
几种感知资料的融合是自动驾驶中的决定性和挑战性任务。现在，我们可以轻松地存取多种检测数据。然而，不同检测视图之间的一致性尚未被现有的模型评估，这可能导致不同检测视图的多模式预测偏离。这不实际又无效当网络不理解3D场景，这可能导致下游模组受到困难。因此，我们预测多种检测路径，并维护不同检测视图之间的一致性。我们使用共享3D查询（XVTP3D）来生成跨观察方向的多个目标，并提出了随机填充方法和粗糙至细的标注捕捉强健的跨观察特征。我们相信这是首次在BEV检测领域中将顶部下降方式引入到路径预测问题中。实验结果显示，XVTP3D在两个公开可用的数据集上实现了状态顶对状态的表现。

Fine-grained Text and Image Guided Point Cloud Completion with CLIP Model

paper_url: http://arxiv.org/abs/2308.08754
repo_url: None
paper_authors: Wei Song, Jun Zhou, Mingjie Wang, Hongchen Tan, Nannan Li, Xiuping Liu
for: This paper focuses on the task of point cloud completion guided by multimodal information, with the goal of improving the generalization ability and fine-grained semantic information of the model.
methods: The proposed method uses a multimodal fusion network that fuses visual and textual information to predict the semantic and geometric characteristics of incomplete shapes. The network employs a pre-trained vision-language model and a multi-stage feature fusion strategy to fuse the textual and visual features.
results: The proposed method achieves superior performance compared to state-of-the-art point cloud completion networks, as demonstrated through extensive quantitative and qualitative experiments. The use of fine-grained text descriptions provides richer geometric details for 3D shapes, further improving the accuracy of the completion.

Abstract
This paper focuses on the recently popular task of point cloud completion guided by multimodal information. Although existing methods have achieved excellent performance by fusing auxiliary images, there are still some deficiencies, including the poor generalization ability of the model and insufficient fine-grained semantic information for extracted features. In this work, we propose a novel multimodal fusion network for point cloud completion, which can simultaneously fuse visual and textual information to predict the semantic and geometric characteristics of incomplete shapes effectively. Specifically, to overcome the lack of prior information caused by the small-scale dataset, we employ a pre-trained vision-language model that is trained with a large amount of image-text pairs. Therefore, the textual and visual encoders of this large-scale model have stronger generalization ability. Then, we propose a multi-stage feature fusion strategy to fuse the textual and visual features into the backbone network progressively. Meanwhile, to further explore the effectiveness of fine-grained text descriptions for point cloud completion, we also build a text corpus with fine-grained descriptions, which can provide richer geometric details for 3D shapes. The rich text descriptions can be used for training and evaluating our network. Extensive quantitative and qualitative experiments demonstrate the superior performance of our method compared to state-of-the-art point cloud completion networks.

摘要
Translated into Simplified Chinese:这篇论文关注最近受欢迎的点云补充任务，带有多 modal 信息导航。虽然现有方法已经通过融合 auxillary 图像实现出色的性能，但还有一些缺陷，包括模型的泛化能力不够和缺乏细化Semantic 信息。在这个工作中，我们提出了一种新的多 modal 融合网络，可以同时融合视觉和文本信息，以预测受限shape的Semantic 和 геометрические特征。Specifically，为了缓解由小规模数据集所带来的缺乏先验信息，我们采用了一个预训练的视觉语言模型，该模型在大量的图像文本对中进行了训练。因此，视觉和文本Encoder 这两个模块具有更强的泛化能力。然后，我们提出了一种多stage 特征融合策略，以逐步融合视觉和文本特征到网络的后部。同时，为了更好地探索文本描述的细化效果，我们还建立了一个细化文本库，该库包含了更细化的描述，可以为3D 形状提供更 ric 的几何细节。这些细化的文本描述可以用于训练和评估我们的网络。广泛的量化和质量测试表明，我们的方法在现有点云补充网络中具有更高的性能。

BOTT: Box Only Transformer Tracker for 3D Object Tracking

paper_url: http://arxiv.org/abs/2308.08753
repo_url: None
paper_authors: Lubing Zhou, Xiaoli Meng, Yiluan Guo, Jiong Yang
for: 三元素 объек Tracking是自主驾驶中的重要任务，现有的 kalman滤波器基于方法仍然是最受欢迎的解决方案，但这些方法需要手工设计的运动模型，无法利用增长的数据量。
methods: 本文提出了盒子只 transformer跟踪器（BOTT），该方法通过将所有的3D盒在一个时间窗口中作为输入，使用 transformer自我注意力来交换所有盒子之间的信息，从而学习全局有用的盒子嵌入。
results: 实验显示，BOTT在 nuScenes 验证和测试分区上得到了69.9和66.7 AMOTA的竞争性性能，在 Waymo Open Dataset 验证和测试分区上得到了56.45和59.57 MOTA L2 的竞争性性能。这些结果表明，通过直接从3D盒子中学习特征使用 transformers 是一种简单 yet 有效的方法。

Abstract
Tracking 3D objects is an important task in autonomous driving. Classical Kalman Filtering based methods are still the most popular solutions. However, these methods require handcrafted designs in motion modeling and can not benefit from the growing data amounts. In this paper, Box Only Transformer Tracker (BOTT) is proposed to learn to link 3D boxes of the same object from the different frames, by taking all the 3D boxes in a time window as input. Specifically, transformer self-attention is applied to exchange information between all the boxes to learn global-informative box embeddings. The similarity between these learned embeddings can be used to link the boxes of the same object. BOTT can be used for both online and offline tracking modes seamlessly. Its simplicity enables us to significantly reduce engineering efforts required by traditional Kalman Filtering based methods. Experiments show BOTT achieves competitive performance on two largest 3D MOT benchmarks: 69.9 and 66.7 AMOTA on nuScenes validation and test splits, respectively, 56.45 and 59.57 MOTA L2 on Waymo Open Dataset validation and test splits, respectively. This work suggests that tracking 3D objects by learning features directly from 3D boxes using transformers is a simple yet effective way.

摘要
<>请将以下文本翻译成简化中文：Tracking 3D对象是自主驾驶中非常重要的任务。经典的Kalman滤波法仍然是最受欢迎的解决方案。然而，这些方法需要手工设计的运动模型，并且无法利用增长的数据量。在这篇论文中，我们提出了一种名为Box Only Transformer Tracker（BOTT）的方法，可以学习将不同帧中的3D盒子链接起来，并且可以在线和离线跟踪模式之间切换。具体来说，我们使用transformer自注意力来交换所有帧中的3D盒子信息，以学习全局有用的盒子嵌入。这些学习的嵌入之间的相似性可以用来链接同一个对象的盒子。BOTT可以在线和离线跟踪模式之间切换，并且其简单性使得可以减少传统Kalman滤波法基于的工程劳动量。实验显示BOTT在nuScenes验证和测试分区上得到了69.9和66.7 AMOTA的竞争性性能，以及Waymo开放数据集验证和测试分区上得到了56.45和59.57 MOTA L2的竞争性性能。这种工作表明了通过直接从3D盒子中学习特征来跟踪3D对象是一种简单 yet有效的方法。<>Here's the translation: Tracking 3D objects is an important task in autonomous driving. Classical Kalman Filtering based methods are still the most popular solutions, but they require handcrafted designs in motion modeling and cannot benefit from the growing data amounts. In this paper, we propose a method called Box Only Transformer Tracker (BOTT) that learns to link 3D boxes of the same object from different frames by taking all the 3D boxes in a time window as input. Specifically, we use transformer self-attention to exchange information between all the boxes to learn global-informative box embeddings. The similarity between these learned embeddings can be used to link the boxes of the same object. BOTT can seamlessly switch between online and offline tracking modes, and its simplicity reduces the engineering efforts required by traditional Kalman Filtering based methods. Experimental results show that BOTT achieves competitive performance on the two largest 3D MOT benchmarks: 69.9 and 66.7 AMOTA on nuScenes validation and test splits, respectively, and 56.45 and 59.57 MOTA L2 on Waymo Open Dataset validation and test splits, respectively. This work suggests that tracking 3D objects by learning features directly from 3D boxes using transformers is a simple yet effective way.

MIPS-Fusion: Multi-Implicit-Submaps for Scalable and Robust Online Neural RGB-D Reconstruction

paper_url: http://arxiv.org/abs/2308.08741
repo_url: None
paper_authors: Yijie Tang, Jiazhao Zhang, Zhinan Yu, He Wang, Kai Xu
for: 这个论文主要目的是提出一种基于神经网络的在线RGB-D重建方法，以实现大规模Scene的高质量重建。
methods: 该方法使用了一种新的神经网络表示方法——多重神经映射（Multi-Implicit-Submap，MIS），并采用分治设计来解决存储特征网格的问题。在该方法中，神经子地图在扫描轨迹中逐渐分配并高效地学习本地神经簇调整。
results: 对比现有的神经RGB-D重建方法，该方法可以实现更高的重建质量，特别是在大规模Scene和快速摄像机运动情况下。

Abstract
We introduce MIPS-Fusion, a robust and scalable online RGB-D reconstruction method based on a novel neural implicit representation -- multi-implicit-submap. Different from existing neural RGB-D reconstruction methods lacking either flexibility with a single neural map or scalability due to extra storage of feature grids, we propose a pure neural representation tackling both difficulties with a divide-and-conquer design. In our method, neural submaps are incrementally allocated alongside the scanning trajectory and efficiently learned with local neural bundle adjustments. The submaps can be refined individually in a back-end optimization and optimized jointly to realize submap-level loop closure. Meanwhile, we propose a hybrid tracking approach combining randomized and gradient-based pose optimizations. For the first time, randomized optimization is made possible in neural tracking with several key designs to the learning process, enabling efficient and robust tracking even under fast camera motions. The extensive evaluation demonstrates that our method attains higher reconstruction quality than the state of the arts for large-scale scenes and under fast camera motions.

摘要
我团队介绍MIPS-Fusion，一种可靠和扩展的在线RGB-D重建方法，基于一种新的神经凝聚表示—多神经凝聚映射（Multi-Implicit-Submap，MIS）。与现有的神经RGB-D重建方法不同，我们的方法缺乏一个灵活的单个神经地图或可扩展性，我们提议一种纯神经表示，同时解决了这两个问题。在我们的方法中，神经子地图在扫描轨迹中逐渐分配并高效地学习地ocal神经簇更正。子地图可以在后续优化中被精细地修正，并在各个子地图水平实现子地图级循环关闭。此外，我们提出了一种混合Tracking方法，结合随机和梯度基于的pose优化。这是神经跟踪中首次实现随机优化的，通过一些关键的设计，使得神经跟踪可以快速和稳定地跟踪，即使相机速度快。我们的评估结果表明，我们的方法在大规模场景下和快相机速度下都可以 дости到更高的重建质量。

Recursive Detection and Analysis of Nanoparticles in Scanning Electron Microscopy Images

paper_url: http://arxiv.org/abs/2308.08732
repo_url: None
paper_authors: Aidan S. Wright, Nathaniel P. Youmans, Enrique F. Valderrama Araya
for: 这个研究旨在开发一个基于Python的计算框架，用于精确地检测和全面分析SEM图像中的粒子。
methods: 这个框架使用了多种技术，包括阈值设定、扩展和膨润，以提高图像处理结果的准确性。
results: 研究人员通过使用这个框架，在五个不同的测试图像中达到97%的粒子检测精度，并能够识别出强度较弱的粒子。

Abstract
In this study, we present a computational framework tailored for the precise detection and comprehensive analysis of nanoparticles within scanning electron microscopy (SEM) images. The primary objective of this framework revolves around the accurate localization of nanoparticle coordinates, accompanied by secondary objectives encompassing the extraction of pertinent morphological attributes including area, orientation, brightness, and length. Constructed leveraging the robust image processing capabilities of Python, particularly harnessing libraries such as OpenCV, SciPy, and Scikit-Image, the framework employs an amalgamation of techniques, including thresholding, dilating, and eroding, to enhance the fidelity of image processing outcomes. The ensuing nanoparticle data is seamlessly integrated into the RStudio environment to facilitate meticulous post-processing analysis. This encompasses a comprehensive evaluation of model accuracy, discernment of feature distribution patterns, and the identification of intricate particle arrangements. The finalized framework exhibits high nanoparticle identification within the primary sample image and boasts 97\% accuracy in detecting particles across five distinct test images drawn from a SEM nanoparticle dataset. Furthermore, the framework demonstrates the capability to discern nanoparticles of faint intensity, eluding manual labeling within the control group.

摘要
在本研究中，我们提出了一种基于计算机的方法，用于准确检测和全面分析顺序电镜图像中的粒子。主要目标是准确地确定粒子坐标，并且包括次要目标，如粒子形态属性的抽取，包括面积、方向、亮度和长度。这个框架利用Python语言的强大图像处理能力，特别是使用OpenCV、SciPy和Scikit-Image库，并采用了多种技术，如阈值处理、膨润和磨灭，以提高图像处理结果的准确性。得到的粒子数据可以轻松地 интеグрироваться到RStudio环境中，以便仔细进行后处理分析。这包括完整评估模型准确度，分析特征分布模式，以及描述复杂的粒子排列。最终的框架在主要样本图像中具有高精度的粒子识别能力，并在五个不同的测试图像中达到97%的检测粒子精度。此外，该框架还能够识别强度较弱的粒子，这些粒子在控制组中逃避人工标注。

Learning Through Guidance: Knowledge Distillation for Endoscopic Image Classification

paper_url: http://arxiv.org/abs/2308.08731
repo_url: None
paper_authors: Harshala Gammulle, Yubo Chen, Sridha Sridharan, Travis Klein, Clinton Fookes
for:The paper is written to improve the accuracy and efficiency of GI tract disease diagnosis using deep learning methods, specifically Convolutional Neural Networks (CNNs).methods:The paper proposes a novel multi-head attention-based feature fusion mechanism to support relation-based learning, and investigates three KD-based learning frameworks: response-based, feature-based, and relation-based.results:The proposed relation-based framework achieves improved lightweight model performance (only 51.8k trainable parameters) on two widely used public datasets, KVASIR-V2 and Hyper-KVASIR, signifying the merits of the proposed method in achieving accurate and efficient disease diagnosis in resource-limited medical clinics.

Abstract
Endoscopy plays a major role in identifying any underlying abnormalities within the gastrointestinal (GI) tract. There are multiple GI tract diseases that are life-threatening, such as precancerous lesions and other intestinal cancers. In the usual process, a diagnosis is made by a medical expert which can be prone to human errors and the accuracy of the test is also entirely dependent on the expert's level of experience. Deep learning, specifically Convolution Neural Networks (CNNs) which are designed to perform automatic feature learning without any prior feature engineering, has recently reported great benefits for GI endoscopy image analysis. Previous research has developed models that focus only on improving performance, as such, the majority of introduced models contain complex deep network architectures with a large number of parameters that require longer training times. However, there is a lack of focus on developing lightweight models which can run in low-resource environments, which are typically encountered in medical clinics. We investigate three KD-based learning frameworks, response-based, feature-based, and relation-based mechanisms, and introduce a novel multi-head attention-based feature fusion mechanism to support relation-based learning. Compared to the existing relation-based methods that follow simplistic aggregation techniques of multi-teacher response/feature-based knowledge, we adopt the multi-head attention technique to provide flexibility towards localising and transferring important details from each teacher to better guide the student. We perform extensive evaluations on two widely used public datasets, KVASIR-V2 and Hyper-KVASIR, and our experimental results signify the merits of our proposed relation-based framework in achieving an improved lightweight model (only 51.8k trainable parameters) that can run in a resource-limited environment.

摘要
endoscopic examination plays a crucial role in identifying potential abnormalities within the gastrointestinal (GI) tract. there are numerous GI tract diseases that are life-threatening, such as precancerous lesions and other intestinal cancers. in the conventional process, a diagnosis is made by a medical expert, which can be prone to human errors and the accuracy of the test is entirely dependent on the expert's level of experience. deep learning, specifically convolutional neural networks (CNNs), has recently shown great benefits for GI endoscopy image analysis. previous research has developed models that focus solely on improving performance, resulting in complex deep network architectures with a large number of parameters that require longer training times. however, there is a lack of focus on developing lightweight models that can run in low-resource environments, typically encountered in medical clinics. we investigate three KD-based learning frameworks, response-based, feature-based, and relation-based mechanisms, and introduce a novel multi-head attention-based feature fusion mechanism to support relation-based learning. compared to existing relation-based methods that use simplistic aggregation techniques of multi-teacher response/feature-based knowledge, we adopt the multi-head attention technique to provide flexibility towards localizing and transferring important details from each teacher to better guide the student. we perform extensive evaluations on two widely used public datasets, KVASIR-V2 and Hyper-KVASIR, and our experimental results demonstrate the advantages of our proposed relation-based framework in achieving an improved lightweight model (only 51.8k trainable parameters) that can run in a resource-limited environment.

Learning A Coarse-to-Fine Diffusion Transformer for Image Restoration

paper_url: http://arxiv.org/abs/2308.08730
repo_url: https://github.com/wlydlut/c2f-dft
paper_authors: Liyan Wang, Qinyu Yang, Cong Wang, Wei Wang, Jinshan Pan, Zhixun Su
for: 这个论文是为了提出一种基于填充 transformer 的图像修复方法，以解决 diffusion-based 方法在图像修复任务中可能因为不准确的噪声估计而未能获得出色的结果。
methods: 该方法使用了填充 transformer，包括填充自注意力（DFSA）和填充Feedforward网络（DFN），并在一种新的粗细顺序训练方案中使用。
results: 对于3个任务（抽掉雨、消除震荡和实际噪声），该方法在与 IR-SDE 比较的情况下显著地超越了 diffusion-based 修复方法，并与Transformer-based状态流方法在性能上具有竞争力。

Abstract
Recent years have witnessed the remarkable performance of diffusion models in various vision tasks. However, for image restoration that aims to recover clear images with sharper details from given degraded observations, diffusion-based methods may fail to recover promising results due to inaccurate noise estimation. Moreover, simple constraining noises cannot effectively learn complex degradation information, which subsequently hinders the model capacity. To solve the above problems, we propose a coarse-to-fine diffusion Transformer (C2F-DFT) for image restoration. Specifically, our C2F-DFT contains diffusion self-attention (DFSA) and diffusion feed-forward network (DFN) within a new coarse-to-fine training scheme. The DFSA and DFN respectively capture the long-range diffusion dependencies and learn hierarchy diffusion representation to facilitate better restoration. In the coarse training stage, our C2F-DFT estimates noises and then generates the final clean image by a sampling algorithm. To further improve the restoration quality, we propose a simple yet effective fine training scheme. It first exploits the coarse-trained diffusion model with fixed steps to generate restoration results, which then would be constrained with corresponding ground-truth ones to optimize the models to remedy the unsatisfactory results affected by inaccurate noise estimation. Extensive experiments show that C2F-DFT significantly outperforms diffusion-based restoration method IR-SDE and achieves competitive performance compared with Transformer-based state-of-the-art methods on $3$ tasks, including deraining, deblurring, and real denoising. The code is available at https://github.com/wlydlut/C2F-DFT.

摘要
近年来，扩散模型在视觉任务中表现出了remarkable的表现。然而，为了recover清晰图像，扩散模型可能因为不准确的噪声估计而无法获得出色的结果。此外，简单的约束噪声无法有效地学习复杂的噪声信息，这会限制模型的容量。为解决这些问题，我们提出了一种归一化扩散变换器（C2F-DFT） для图像修复。具体来说，我们的C2F-DFT包括扩散自注意（DFSA）和扩散径向网络（DFN），并在一种新的归一化训练机制中进行了整合。DFSA和DFN分别捕捉了扩散的长距离相关性和层次扩散表示，以便更好地修复图像。在粗糙训练阶段，我们的C2F-DFT估计噪声，并通过抽样算法生成最终的清晰图像。为了进一步提高修复质量，我们提出了一种简单 yet有效的细化训练机制。它首先利用粗糙训练过的扩散模型，并将其与固定步长进行多次扩散，然后将其与对应的真实图像进行约束，以便使模型更好地修复受到噪声估计的影响的不满result。广泛的实验表明，C2F-DFT在3个任务上（包括雨几摄、锐化和真实噪声）significantly exceeds扩散基于SDE的修复方法，并与基于Transformer的state-of-the-art方法相当。代码可以在https://github.com/wlydlut/C2F-DFT中找到。

Long-Range Grouping Transformer for Multi-View 3D Reconstruction

paper_url: http://arxiv.org/abs/2308.08724
repo_url: https://github.com/liyingcv/long-range-grouping-transformer
paper_authors: Liying Yang, Zhenwei Zhu, Xuxin Lin, Jian Nong, Yanyan Liang
for: 本研究旨在提高多视图3D重建 task 中 transformer 网络的性能，特别是对于自注意处理大量视图输入的困难性。
methods: 我们提出了一种基于 divide-and-conquer 原理的长范围集群注意力（LGA）机制，以便在不同视图之间进行注意力操作。此外，我们还设计了一种高效的编码器，可以连接不同视图之间的间距特征，以及一种进步的高分辨率增幅嵌入器 для voxel 生成。
results: 我们的方法在ShapeNet 数据集上实现了state-of-the-art 精度水平，证明了我们的方法在多视图3D重建 task 中的效果。

Abstract
Nowadays, transformer networks have demonstrated superior performance in many computer vision tasks. In a multi-view 3D reconstruction algorithm following this paradigm, self-attention processing has to deal with intricate image tokens including massive information when facing heavy amounts of view input. The curse of information content leads to the extreme difficulty of model learning. To alleviate this problem, recent methods compress the token number representing each view or discard the attention operations between the tokens from different views. Obviously, they give a negative impact on performance. Therefore, we propose long-range grouping attention (LGA) based on the divide-and-conquer principle. Tokens from all views are grouped for separate attention operations. The tokens in each group are sampled from all views and can provide macro representation for the resided view. The richness of feature learning is guaranteed by the diversity among different groups. An effective and efficient encoder can be established which connects inter-view features using LGA and extract intra-view features using the standard self-attention layer. Moreover, a novel progressive upsampling decoder is also designed for voxel generation with relatively high resolution. Hinging on the above, we construct a powerful transformer-based network, called LRGT. Experimental results on ShapeNet verify our method achieves SOTA accuracy in multi-view reconstruction. Code will be available at https://github.com/LiyingCV/Long-Range-Grouping-Transformer.

摘要
现在，变换器网络在计算机视觉任务中表现出了非常出色的表现。在这种多视图3D重建算法中，变换器处理器需要处理具有庞大信息量的复杂图像token。词汇内容咒语导致模型学习非常困难。为解决这问题，现有的方法通过压缩每个视图的token数量或者抛弃不同视图之间的注意操作来缓解问题。然而，这些方法会对性能产生负面影响。因此，我们提出了长范围群组注意（LGA），基于分治原则。所有视图的token都被分组进行独立的注意操作。每个组中的token来自所有视图，可以为残存视图提供macro表示。各个组之间的多样性保证了特征学习的 ricness。通过LGA和标准自注意层连接，我们可以建立高效的encoder，并且可以提取高精度的voxel生成。此外，我们还设计了一种进步式upsampling解码器，用于生成相对高分辨率的voxel。基于以上，我们构建了一个强大的变换器基于网络，称为LRGT。实验结果表明，我们的方法在ShapeNet上达到了最佳精度的多视图重建。代码将于https://github.com/LiyingCV/Long-Range-Grouping-Transformer上提供。

Dynamic Kernel-Based Adaptive Spatial Aggregation for Learned Image Compression

paper_url: http://arxiv.org/abs/2308.08723
repo_url: https://github.com/Huairui/DKIC
paper_authors: Huairui Wang, Nianxiang Fu, Zhenzhong Chen, Shan Liu
for: 提高图像压缩率和精度性能
methods: 使用动态kernel基于变换编码、共享权重机制和自适应累积、改进 entropymodel
results: 在三个标准测试集上比前state-of-the-art学习基于方法获得更高的率压缩率和精度性能

Abstract
Learned image compression methods have shown superior rate-distortion performance and remarkable potential compared to traditional compression methods. Most existing learned approaches use stacked convolution or window-based self-attention for transform coding, which aggregate spatial information in a fixed range. In this paper, we focus on extending spatial aggregation capability and propose a dynamic kernel-based transform coding. The proposed adaptive aggregation generates kernel offsets to capture valid information in the content-conditioned range to help transform. With the adaptive aggregation strategy and the sharing weights mechanism, our method can achieve promising transform capability with acceptable model complexity. Besides, according to the recent progress of entropy model, we define a generalized coarse-to-fine entropy model, considering the coarse global context, the channel-wise, and the spatial context. Based on it, we introduce dynamic kernel in hyper-prior to generate more expressive global context. Furthermore, we propose an asymmetric spatial-channel entropy model according to the investigation of the spatial characteristics of the grouped latents. The asymmetric entropy model aims to reduce statistical redundancy while maintaining coding efficiency. Experimental results demonstrate that our method achieves superior rate-distortion performance on three benchmarks compared to the state-of-the-art learning-based methods.

摘要
现有的学习图像压缩方法已经显示出了较好的比特率-损失性能和很好的潜在性，相比传统压缩方法。大多数现有的学习方法使用堆叠 convolution或窗口基于自注意力来实现变换编码，这些方法将空间信息归约到固定范围内。在这篇论文中，我们将注意力集中在扩展空间归约能力上，并提出一种动态kernel基于变换编码。我们的提案的自适应归约生成器将kernel偏移来捕捉有效信息在内容受限范围内，以帮助变换。通过自适应归约策略和共享权重机制，我们的方法可以实现可接受的变换能力和模型复杂度。此外，根据最近的比特模型进展，我们定义一种总体粗细-细致比特模型，考虑全局粗细上下文、通道粗细和空间上下文。基于它，我们引入动态kernel在超PRIOR中生成更表达力的全局上下文。此外，我们还提出一种偏极空频频道比特模型，根据图像特征的分组分析。这种偏极 entropy模型的目的是降低统计冗余，保持编码效率。实验结果表明，我们的方法在三个标准底层上比对 estado-of-the-art 学习基于方法得到了较好的比特率-损失性能。

RFD-ECNet: Extreme Underwater Image Compression with Reference to Feature Dictionar

paper_url: http://arxiv.org/abs/2308.08721
repo_url: https://github.com/lilala0/rfd-ecnet
paper_authors: Mengyao Li, Liquan Shen, Peng Ye, Guorui Feng, Zheyin Wang
for: 提高水下应用的高效性，实现水下图像（UWI）在非常窄的水下频谱中的传输。
methods: 首先构建了水下多级特征字典，以提供水下图像压缩的粗略参考特征。然后，提出了一种Extreme UWI压缩网络（RFD-ECNet），利用特征匹配和参考特征变化来减少UWI之间的重复性。为了正确地匹配水下图像的多样性，提出了一种水下风格标准块（USNB），利用水下物理图像模型中提取的水下物理约束来 норmalize水下字典特征向输入。此外，还提出了一种参考特征变化模块（RFVM），用于适应性地改变参考特征，提高参考特征和输入特征之间的相似性。
results: 实验结果表明，我们的RFD-ECNet在四个UWI数据集上实现了31%的BD率减少，超过了最先进的VVC。

Abstract
Thriving underwater applications demand efficient extreme compression technology to realize the transmission of underwater images (UWIs) in very narrow underwater bandwidth. However, existing image compression methods achieve inferior performance on UWIs because they do not consider the characteristics of UWIs: (1) Multifarious underwater styles of color shift and distance-dependent clarity, caused by the unique underwater physical imaging; (2) Massive redundancy between different UWIs, caused by the fact that different UWIs contain several common ocean objects, which have plenty of similarities in structures and semantics. To remove redundancy among UWIs, we first construct an exhaustive underwater multi-scale feature dictionary to provide coarse-to-fine reference features for UWI compression. Subsequently, an extreme UWI compression network with reference to the feature dictionary (RFD-ECNet) is creatively proposed, which utilizes feature match and reference feature variant to significantly remove redundancy among UWIs. To align the multifarious underwater styles and improve the accuracy of feature match, an underwater style normalized block (USNB) is proposed, which utilizes underwater physical priors extracted from the underwater physical imaging model to normalize the underwater styles of dictionary features toward the input. Moreover, a reference feature variant module (RFVM) is designed to adaptively morph the reference features, improving the similarity between the reference and input features. Experimental results on four UWI datasets show that our RFD-ECNet is the first work that achieves a significant BD-rate saving of 31% over the most advanced VVC.

摘要
在水下应用需要高效的极端压缩技术来实现水下图像（UWI）的传输在非常窄的水下频谱带中。然而，现有的图像压缩方法在UWI上的性能不佳，因为它们不考虑水下图像的特点：（1）多样的水下颜色变换和距离相关的清晰度变化，由水下物理捕捉器带来的独特水下物理特性；（2）水下图像之间的巨大重复性，由于不同的UWI都包含许多相似的海洋对象，这些对象在结构和 semantics 方面具有很多相似性。为了消除UWI之间的重复性，我们首先构建了水下多尺度特征字典（USDL），提供水下图像压缩的粗略到细则参考特征。然后，我们创新提出了基于特征字典的极端UWI压缩网络（RFD-ECNet），该网络利用特征匹配和参考特征变体来减少UWI之间的重复性。为了调整水下颜色的多样性并提高特征匹配的准确性，我们还提出了水下风格normal化块（USNB），该块利用从水下物理捕捉器提取的水下物理理论来正常化特征字典中的水下风格。此外，我们还设计了参考特征变体模块（RFVM），以适应不同的参考特征，使得参考特征与输入特征之间的相似性更高。实验结果表明，我们的RFD-ECNet在四个UWI数据集上达到了最高的BD-率减少31%，比最先进的VVC更高效。

V-FUSE: Volumetric Depth Map Fusion with Long-Range Constraints

paper_url: http://arxiv.org/abs/2308.08715
repo_url: https://github.com/nburgdorfer/vfuse
paper_authors: Nathaniel Burgdorfer, Philippos Mordohai
for: 提高Multi-View Stereo（MVS）算法生成的深度和信任图的精度
methods: integrate volumetric visibility constraints into an end-to-end trainable architecture, 以及一个jointly trained depth search window estimation sub-network
results: 对MVS数据进行了大量实验，并显示了出入 fusion depth和信任图的精度有substantial improvements。

Abstract
We introduce a learning-based depth map fusion framework that accepts a set of depth and confidence maps generated by a Multi-View Stereo (MVS) algorithm as input and improves them. This is accomplished by integrating volumetric visibility constraints that encode long-range surface relationships across different views into an end-to-end trainable architecture. We also introduce a depth search window estimation sub-network trained jointly with the larger fusion sub-network to reduce the depth hypothesis search space along each ray. Our method learns to model depth consensus and violations of visibility constraints directly from the data; effectively removing the necessity of fine-tuning fusion parameters. Extensive experiments on MVS datasets show substantial improvements in the accuracy of the output fused depth and confidence maps.

摘要
我们提出了一种基于学习的深度地图融合框架，该框架接受多视图深度和信任图生成器的输出，并改进它们。我们在整个端到端可学习架构中 integrate了Volumetric可见性约束，这些约束编码了不同视图之间的长距离表面关系。我们还提出了一个专门用于每个光栅的深度搜索窗口估计子网络，与更大的融合子网络一起培训，以降低每个光栅的深度假设搜索空间。我们的方法可以直接从数据中学习深度一致性和视图约束的违反;无需精细调整融合参数。我们的实验表明，对MVS数据集进行了广泛的测试，并显著提高了输出融合的深度和信任图的准确性。

SkinDistilViT: Lightweight Vision Transformer for Skin Lesion Classification

paper_url: http://arxiv.org/abs/2308.08669
repo_url: https://github.com/Longman-Stan/SkinDistilVit
paper_authors: Vlad-Constantin Lungu-Stan, Dumitru-Clementin Cercel, Florin Pop
for: 这个论文的目的是提供一种特定生产环境中的皮肤癌分类问题的解决方案，以匹配人类的检测精度。
methods: 这个论文使用了知识储存法来训练一个基于视Transformer的模型，并在专家级别标注的皮肤医学图像上进行了训练。
results: 该模型可以保持98.33%的平衡多类准确率，而且在推理成本方面具有显著的提高，即时间内存具有69.25%和97.96%的提高。

Abstract
Skin cancer is a treatable disease if discovered early. We provide a production-specific solution to the skin cancer classification problem that matches human performance in melanoma identification by training a vision transformer on melanoma medical images annotated by experts. Since inference cost, both time and memory wise is important in practice, we employ knowledge distillation to obtain a model that retains 98.33% of the teacher's balanced multi-class accuracy, at a fraction of the cost. Memory-wise, our model is 49.60% smaller than the teacher. Time-wise, our solution is 69.25% faster on GPU and 97.96% faster on CPU. By adding classification heads at each level of the transformer and employing a cascading distillation process, we improve the balanced multi-class accuracy of the base model by 2.1%, while creating a range of models of various sizes but comparable performance. We provide the code at https://github.com/Longman-Stan/SkinDistilVit.

摘要
皮肤癌是一种可治疗的疾病，如果早发现。我们提供一个特定生产环境下的解决方案，用于皮肤癌分类问题，它与专业人员注意标注的melanoma医疗图像进行训练，以实现人工智能水平的抑肿癌识别。在实践中，推理成本（时间和内存）是重要的，我们使用知识储存法来获得一个保持98.33%的教师平衡多类准确率的模型，而且模型的内存占用量比教师模型少49.60%，时间占用量比教师模型快69.25%（GPU）和97.96%（CPU）。通过在转换器中添加多个分类头和使用层次分解法，我们提高基本模型的平衡多类准确率2.1%，并创造了不同大小的模型，但它们具有相似的性能。我们提供了代码，可以在 GitHub上找到：https://github.com/Longman-Stan/SkinDistilVit。

A New Data-Driven Method to Identify Violent Facial Expression

paper_url: http://arxiv.org/abs/2308.08658
repo_url: https://github.com/arindampaulripon/A-Novel-Method-for-Machine-Learning-Based-Automatic-Crime-Activity-Identification-System-by-Analyzin
paper_authors: Arindam Kumar Paul, Md Maruf Hasan, Md. Delwar Hosen
for: 本研究旨在开发一个自动识别凶暴行为的系统，以帮助预防犯罪和保护社会。
methods: 本研究使用了卷积神经网络模型，并使用自动特征选择器来捕捉特定的面孔表情特征。
results: 研究发现，这个系统可以更加精确地识别凶暴行为的面孔表情特征，并且只需使用少量的面孔数据来训练。

Abstract
Human Facial Expressions plays an important role in identifying human actions or intention. Facial expressions can represent any specific action of any person and the pattern of violent behavior of any person strongly depends on the geographic region. Here we have designed an automated system by using a Convolutional Neural Network which can detect whether a person has any intention to commit any crime or not. Here we proposed a new method that can identify criminal intentions or violent behavior of any person before executing crimes more efficiently by using very little data on facial expressions before executing a crime or any violent tasks. Instead of using image features which is a time-consuming and faulty method we used an automated feature selector Convolutional Neural Network model which can capture exact facial expressions for training and then can predict that target facial expressions more accurately. Here we used only the facial data of a specific geographic region which can represent the violent and before-crime before-crime facial patterns of the people of the whole region.

摘要
人类表情表达在识别人类行为或意图方面发挥重要作用。人类表情可以代表任何人的特定行为，并且任何人的暴力行为强度受地理区域的影响。我们设计了一个自动化系统，使用卷积神经网络来检测人类是否有任何犯罪意图。我们提出了一种新的方法，可以更高效地识别人类犯罪意图或暴力行为，只需使用小量的面部表达数据。而不是使用图像特征，我们使用自动化特征选择器卷积神经网络模型，可以更准确地捕捉面部表达特征，并且预测目标面部表达更加准确。我们使用了特定地理区域的面部数据，可以更好地表示该地区的暴力和前犯罪面部模式。

Flickr Africa: Examining Geo-Diversity in Large-Scale, Human-Centric Visual Data

paper_url: http://arxiv.org/abs/2308.08656
repo_url: None
paper_authors: Keziah Naggita, Julienne LaChance, Alice Xiang
for: 研究大规模图像数据集中的偏见对计算机视觉模型的性能的影响
methods: 使用 geotagged Flickr 图像与每个非洲国家相对比较的人类中心图像异质量进行分析，并进行两年频次的时间分析以暴露出来的数据趋势
results: 发现非洲的图像数据匮乏，主要由非洲以外的摄影师拍摄，需要进一步的工作以获得更加代表性的图像数据，以提高计算机视觉模型在全球范围内的应用性

Abstract
Biases in large-scale image datasets are known to influence the performance of computer vision models as a function of geographic context. To investigate the limitations of standard Internet data collection methods in low- and middle-income countries, we analyze human-centric image geo-diversity on a massive scale using geotagged Flickr images associated with each nation in Africa. We report the quantity and content of available data with comparisons to population-matched nations in Europe as well as the distribution of data according to fine-grained intra-national wealth estimates. Temporal analyses are performed at two-year intervals to expose emerging data trends. Furthermore, we present findings for an ``othering'' phenomenon as evidenced by a substantial number of images from Africa being taken by non-local photographers. The results of our study suggest that further work is required to capture image data representative of African people and their environments and, ultimately, to improve the applicability of computer vision models in a global context.

摘要
大规模图像数据集中的偏见会影响计算机视觉模型的表现，具体来说是根据地理上下文。为了探讨标准互联网数据收集方法在LOW-和中等收入国家的局限性，我们使用Geotagged Flickr图像与每个非洲国家进行人类中心的图像地域多样性分析，并对相应的人口比较欧洲国家进行比较。我们还分析了图像数据的分布 according to fine-grained intra-national wealth estimates。通过两年 interval的时间分析，暴露出emerging data trends。此外，我们还发现了一种“其他化”现象，即非洲的图像大多被非本地摄影师拍摄。我们的研究结果表明，需要进一步的工作，以捕捉符合非洲人和其环境的图像数据，并 ultimately improve计算机视觉模型在全球上的可用性。

Fair GANs through model rebalancing with synthetic data

paper_url: http://arxiv.org/abs/2308.08638
repo_url: None
paper_authors: Anubhav Jain, Nasir Memon, Julian Togelius
for: 本文旨在提高生成模型的偏差减少和公平性提高。
methods: 本文提出一种使用潜在空间探索生成平衡数据，并使用这些数据来训练平衡的生成模型来减少生成模型中的偏差。此外，本文还提出了一种偏差纠正损失函数，可以在不平衡的数据集上提高公平性指标。
results: 在使用Stylegan2模型和FFHQ数据集进行遥感肖像偏差和公平性问题时，本文得到了 almost 5 倍的提高，同时保持图像质量。此外，本文还验证了这种方法在不平衡的 Cifar-10 数据集上的有效性。最后，本文指出了传统的图像质量指标，如Frechet inception distance (FID)，在偏差减少问题上不适用。

Abstract
Deep generative models require large amounts of training data. This often poses a problem as the collection of datasets can be expensive and difficult, in particular datasets that are representative of the appropriate underlying distribution (e.g. demographic). This introduces biases in datasets which are further propagated in the models. We present an approach to mitigate biases in an existing generative adversarial network by rebalancing the model distribution. We do so by generating balanced data from an existing unbalanced deep generative model using latent space exploration and using this data to train a balanced generative model. Further, we propose a bias mitigation loss function that shows improvements in the fairness metric even when trained with unbalanced datasets. We show results for the Stylegan2 models while training on the FFHQ dataset for racial fairness and see that the proposed approach improves on the fairness metric by almost 5 times, whilst maintaining image quality. We further validate our approach by applying it to an imbalanced Cifar-10 dataset. Lastly, we argue that the traditionally used image quality metrics such as Frechet inception distance (FID) are unsuitable for bias mitigation problems.

摘要
深度生成模型需要大量的训练数据，这经常导致收集数据集的成本和困难，特别是收集表示适当下游分布（例如人口学）的数据集。这会导致模型中的偏见，并在模型中传递这些偏见。我们提出了一种方法来减轻模型中的偏见，通过在现有的深度生成模型中进行缺失空间探索，并使用这些缺失数据来训练一个平衡的生成模型。此外，我们提出了一种偏见纠正loss函数，这种loss函数可以在偏见数据集上训练模型，并且可以提高公平度量表示的改进。我们在使用Stylegan2模型进行FFHQ数据集上的遥感风格改进和Cifar-10数据集上的偏见纠正 validate our approach。最后，我们 argue that traditionally used image quality metrics such as Frechet inception distance (FID) are unsuitable for bias mitigation problems。Note: Please note that the translation is in Simplified Chinese, and the word order and sentence structure may be different from the original text.

MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions

paper_url: http://arxiv.org/abs/2308.08544
repo_url: https://github.com/henghuiding/MeViS
paper_authors: Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Chen Change Loy
for: 这 paper 的目的是为了研究基于动作表达的视频分割，即使用语句来描述目标对象的动作，从而在视频内部分割目标对象。
methods: 本 paper 使用了现有的referring video object segmentation (RVOS) 方法进行比较，并提出了一个基线方法来解决问题。
results: results show that current RVOS methods cannot effectively address motion expression-guided video segmentation, and the proposed baseline approach can provide a good starting point for future research.

Abstract
This paper strives for motion expressions guided video segmentation, which focuses on segmenting objects in video content based on a sentence describing the motion of the objects. Existing referring video object datasets typically focus on salient objects and use language expressions that contain excessive static attributes that could potentially enable the target object to be identified in a single frame. These datasets downplay the importance of motion in video content for language-guided video object segmentation. To investigate the feasibility of using motion expressions to ground and segment objects in videos, we propose a large-scale dataset called MeViS, which contains numerous motion expressions to indicate target objects in complex environments. We benchmarked 5 existing referring video object segmentation (RVOS) methods and conducted a comprehensive comparison on the MeViS dataset. The results show that current RVOS methods cannot effectively address motion expression-guided video segmentation. We further analyze the challenges and propose a baseline approach for the proposed MeViS dataset. The goal of our benchmark is to provide a platform that enables the development of effective language-guided video segmentation algorithms that leverage motion expressions as a primary cue for object segmentation in complex video scenes. The proposed MeViS dataset has been released at https://henghuiding.github.io/MeViS.

摘要
Translated into Simplified Chinese:这篇论文目标是开发基于运动表达的视频分割，强调在视频内容中基于运动描述对象进行分割。现有的视频对象引用Dataset通常强调出现在视频中的突出对象，使用语言表达包含过多的静态特征，这可能使得目标对象在单帧中被识别。这些Dataset下Play优化视频内容中的运动的重要性，对语言导向视频对象分割进行了下Play。为了探索使用运动表达来固定和分割视频中的对象，我们提出了大规模的MeViS dataset，它包含了许多运动表达来指定目标对象在复杂环境中。我们对5种现有的视频对象分割方法进行了比较性分析，并在MeViS dataset上进行了完整的比较。结果表明，现有的视频对象分割方法无法有效地解决运动表达导向的视频分割。我们进一步分析了挑战和提出了基线方法，以便在我们的benchmark中发展有效的语言导向视频分割算法，利用运动表达作为主要准确对象分割在复杂视频场景中的导向。MeViS dataset已经在https://henghuiding.github.io/MeViS上发布。

InsightMapper: A Closer Look at Inner-instance Information for Vectorized High-Definition Mapping

paper_url: http://arxiv.org/abs/2308.08543
repo_url: https://github.com/TonyXuQAQ/InsightMapper
paper_authors: Zhenhua Xu, Kenneth K. Y. Wong, Hengshuang Zhao
for: 本研究旨在提高自动驾驶车辆中 vectorized 高清晰地图的检测性能，通过利用内部实例信息进行增强。
methods: 本研究使用了 transformers 来利用内部实例信息，并 introduce 了三种新的设计方案，包括 Hybrid 查询生成、内部实例查询融合和内部实例特征汇集。
results: 对 NuScenes 数据集进行比较 экспериментирова，显示了我们的提案方法在检测性能和效率两个方面具有明显的优势，比如5.78 mAP 和 5.12 TOPO，这两个指标都是评估拓扑正确性的。

Abstract
Vectorized high-definition (HD) maps contain detailed information about surrounding road elements, which are crucial for various downstream tasks in modern autonomous driving vehicles, such as vehicle planning and control. Recent works have attempted to directly detect the vectorized HD map as a point set prediction task, resulting in significant improvements in detection performance. However, these approaches fail to analyze and exploit the inner-instance correlations between predicted points, impeding further advancements. To address these challenges, we investigate the utilization of inner-$\textbf{INS}$tance information for vectorized h$\textbf{IGH}$-definition mapping through $\textbf{T}$ransformers and introduce InsightMapper. This paper presents three novel designs within InsightMapper that leverage inner-instance information in distinct ways, including hybrid query generation, inner-instance query fusion, and inner-instance feature aggregation. Comparative experiments are conducted on the NuScenes dataset, showcasing the superiority of our proposed method. InsightMapper surpasses previous state-of-the-art (SOTA) methods by 5.78 mAP and 5.12 TOPO, which assess topology correctness. Simultaneously, InsightMapper maintains high efficiency during both training and inference phases, resulting in remarkable comprehensive performance. The project page for this work is available at https://tonyxuqaq.github.io/projects/InsightMapper .

摘要
高清晰度地图（HD map）被vector化后，含有周围道路元素的详细信息，对现代自动驾驶车辆的各种下渠任务非常重要，如车辆规划和控制。现有研究直接将vectorized HD map当作点集预测任务进行处理，这些方法可以大幅提高检测性能。然而，这些方法忽略了预测点之间的内部相关关系，从而限制了进一步的进步。为了解决这些挑战，我们研究了使用内部实例信息对vectorized高清晰度地图进行Transformers的利用，并提出了InsightMapper。InsightMapper包含三种新的设计，它们在不同的方面利用内部实例信息，包括半结合查询生成、内部实例查询融合和内部实例特征聚合。我们在NuScenes dataset上进行了比较性实验，显示我们提出的方法在检测性能和效率方面具有卓越的表现。InsightMapper比前一代最佳方法（SOTA）提高5.78 mAP和5.12 TOPO，这些指标评估了排名准确性。同时，InsightMapper在训练和推理阶段保持高效，从而实现了remarkable的总性性能。有关这项工作的详细信息，请参考https://tonyxuqaq.github.io/projects/InsightMapper。

Ref-DVGO: Reflection-Aware Direct Voxel Grid Optimization for an Improved Quality-Efficiency Trade-Off in Reflective Scene Reconstruction

paper_url: http://arxiv.org/abs/2308.08530
repo_url: https://github.com/gkouros/ref-dvgo
paper_authors: Georgios Kouros, Minye Wu, Shubham Shrivastava, Sushruth Nagesh, Punarjay Chakravarty, Tinne Tuytelaars
for: 该研究旨在寻找一种能够平衡效率和质量的方法来处理反射物体。
methods: 该研究采用了一种基于储存渠道的隐式-显式方法，并采用了高效的密度基本的格子表示。
results: 该研究在重建质量和训练和渲染过程中提高了效率，并实现了与其他方法相当的质量效率负荷平衡。

Abstract
Neural Radiance Fields (NeRFs) have revolutionized the field of novel view synthesis, demonstrating remarkable performance. However, the modeling and rendering of reflective objects remain challenging problems. Recent methods have shown significant improvements over the baselines in handling reflective scenes, albeit at the expense of efficiency. In this work, we aim to strike a balance between efficiency and quality. To this end, we investigate an implicit-explicit approach based on conventional volume rendering to enhance the reconstruction quality and accelerate the training and rendering processes. We adopt an efficient density-based grid representation and reparameterize the reflected radiance in our pipeline. Our proposed reflection-aware approach achieves a competitive quality efficiency trade-off compared to competing methods. Based on our experimental results, we propose and discuss hypotheses regarding the factors influencing the results of density-based methods for reconstructing reflective objects. The source code is available at https://github.com/gkouros/ref-dvgo.

摘要
neural radiance fields (nerfs) 已经革命化了新视角合成领域，表现出色。然而，模型和渲染镜面物体仍然是挑战。现有方法已经在镜面场景中显示了重要的改进，但是这些改进通常是在效率上的代价。在这种工作中，我们想要寻找效率和质量之间的平衡。为此，我们 investigate了一种采用储存-渲染的方法，使用传统的体积渲染来提高重建质量并加速训练和渲染过程。我们采用了高效的体积-基的格子表示和重新 parameterize 反射辐射的管道。我们的提案的反射意识方法可以与竞争方法相比，在质量和效率之间寻找一个稍微的平衡。根据我们的实验结果，我们提出和讨论了镜面物体重建中各种体积方法的因素的影响因素。源代码可以在 https://github.com/gkouros/ref-dvgo 上找到。

Diagnosing Human-object Interaction Detectors

paper_url: http://arxiv.org/abs/2308.08529
repo_url: https://github.com/neu-vi/diag-hoi
paper_authors: Fangrui Zhu, Yiming Xie, Weidi Xie, Huaizu Jiang
for: 本文旨在提供一个用于分析现有人对物检测模型错误来源的诊断工具箱。
methods: 本文首先对人对物检测管道进行总体调查，然后定义了不同类型的错误和其修正方法。通过测量修正错误时的mAP提升，可以进行详细的错误分析。
results: 本文对人对物检测和互动类别分类任务进行了深入的分析，包括人对物检测的准确率和干扰率的计算，以及互动类别分类的mAP分布。

Abstract
Although we have witnessed significant progress in human-object interaction (HOI) detection with increasingly high mAP (mean Average Precision), a single mAP score is too concise to obtain an informative summary of a model's performance and to understand why one approach is better than another. In this paper, we introduce a diagnosis toolbox for analyzing the error sources of the existing HOI detection models. We first conduct holistic investigations in the pipeline of HOI detection, consisting of human-object pair detection and then interaction classification. We define a set of errors and the oracles to fix each of them. By measuring the mAP improvement obtained from fixing an error using its oracle, we can have a detailed analysis of the significance of different errors. We then delve into the human-object detection and interaction classification, respectively, and check the model's behavior. For the first detection task, we investigate both recall and precision, measuring the coverage of ground-truth human-object pairs as well as the noisiness level in the detections. For the second classification task, we compute mAP for interaction classification only, without considering the detection scores. We also measure the performance of the models in differentiating human-object pairs with and without actual interactions using the AP (Average Precision) score. Our toolbox is applicable for different methods across different datasets and available at https://github.com/neu-vi/Diag-HOI.

摘要
尽管我们已经观察到人物对象交互检测中的显著进步，但单一的mAP得分并不能提供完整的模型性能概括，也无法理解哪些方法更好。在这篇论文中，我们提出了一个分析错误源的工具箱 для现有的人物对象交互检测模型。我们首先在人物对象检测的管道中进行了全面的调查，包括人物对象对的检测和 then 交互分类。我们定义了一组错误和其修复方法。通过使用每个错误的 oracle 来修复错误，我们可以进行详细的错误分析，并计算修复错误后的mAP提升。然后，我们逐个探究人物检测和交互分类任务中的模型行为。对于第一个检测任务，我们评估了人物对象对的覆盖率和检测结果的噪音水平。对于第二个分类任务，我们计算了交互分类的mAP分数，不考虑检测得分。我们还计算了模型在分辨人物对象对有实际交互和无实际交互的情况下的AP分数。我们的工具箱可以应用于不同的方法和数据集，可以在 GitHub 上获取：https://github.com/neu-vi/Diag-HOI。

Likelihood-Based Text-to-Image Evaluation with Patch-Level Perceptual and Semantic Credit Assignment

paper_url: http://arxiv.org/abs/2308.08525
repo_url: https://github.com/chenqi008/leica
paper_authors: Qi Chen, Chaorui Deng, Zixiong Huang, Bowen Zhang, Mingkui Tan, Qi Wu
for: 本研究旨在提出一种新的文本到图像生成评价指标，以更好地评估图像生成模型的表现。
methods: 本研究使用了一种基于概率的文本到图像生成模型，以直接估计生成图像的可能性，并提出了一些新的设计来实现准确的归因分配策略。
results: 在实验中，提出的指标能够 успеш地评估多种流行的文本到图像生成模型和数据集，并且可以使用只需要几十个样本来稳定评估结果，这使得它在实践中非常有效率。

Abstract
Text-to-image synthesis has made encouraging progress and attracted lots of public attention recently. However, popular evaluation metrics in this area, like the Inception Score and Fr'echet Inception Distance, incur several issues. First of all, they cannot explicitly assess the perceptual quality of generated images and poorly reflect the semantic alignment of each text-image pair. Also, they are inefficient and need to sample thousands of images to stabilise their evaluation results. In this paper, we propose to evaluate text-to-image generation performance by directly estimating the likelihood of the generated images using a pre-trained likelihood-based text-to-image generative model, i.e., a higher likelihood indicates better perceptual quality and better text-image alignment. To prevent the likelihood of being dominated by the non-crucial part of the generated image, we propose several new designs to develop a credit assignment strategy based on the semantic and perceptual significance of the image patches. In the experiments, we evaluate the proposed metric on multiple popular text-to-image generation models and datasets in accessing both the perceptual quality and the text-image alignment. Moreover, it can successfully assess the generation ability of these models with as few as a hundred samples, making it very efficient in practice.

摘要

Painter: Teaching Auto-regressive Language Models to Draw Sketches

paper_url: http://arxiv.org/abs/2308.08520
repo_url: None
paper_authors: Reza Pourreza, Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Pulkit Madan, Roland Memisevic
for: 这篇论文旨在应用大型自然语言理解模型（LLM）来进行图像生成任务，直接从文本描述中生成虚拟的毫线绘制图像。
methods: 作者使用了一个基于废弃的LLM，通过精心调整和保留语言理解能力来构建了Painter模型，可以将文本描述转换为绘制图像的毫线绘制。
results: 作者创建了一个多bject绘制集，并使用Painter模型将文本描述转换为绘制图像，还可以从画布上除掉对象、检测和分类对象等功能。结果很有推动力。

Abstract
Large language models (LLMs) have made tremendous progress in natural language understanding and they have also been successfully adopted in other domains such as computer vision, robotics, reinforcement learning, etc. In this work, we apply LLMs to image generation tasks by directly generating the virtual brush strokes to paint an image. We present Painter, an LLM that can convert user prompts in text description format to sketches by generating the corresponding brush strokes in an auto-regressive way. We construct Painter based on off-the-shelf LLM that is pre-trained on a large text corpus, by fine-tuning it on the new task while preserving language understanding capabilities. We create a dataset of diverse multi-object sketches paired with textual prompts that covers several object types and tasks. Painter can generate sketches from text descriptions, remove objects from canvas, and detect and classify objects in sketches. Although this is an unprecedented pioneering work in using LLMs for auto-regressive image generation, the results are very encouraging.

摘要
大型自然语言模型（LLM）已经取得了很大的进步，并在其他领域如计算机视觉、机器人学习、奖励学习等领域中得到成功应用。在这项工作中，我们将LLM应用于图像生成任务中，通过直接生成虚拟的毫梭来绘制图像。我们提出了一种名为“艺术家”的LLM，可以将用户提示转化为文本描述格式的绘制。我们基于市场上可获得的LLM预训练数据库，通过细化其新任务而不损失语言理解能力来构建艺术家。我们创建了一个包含多种物体和任务的多物体绘制数据集，以及与文本提示相对应的绘制。艺术家可以从文本描述中生成绘制，从画布上除除物体，并在绘制中检测和分类物体。虽然这是一项前所未有的使用LLM进行自然逻辑推导的图像生成的研究，但结果很有激励力。

Two-and-a-half Order Score-based Model for Solving 3D Ill-posed Inverse Problems

paper_url: http://arxiv.org/abs/2308.08511
repo_url: None
paper_authors: Zirong Li, Yanyang Wang, Jianjia Zhang, Weiwen Wu, Hengyong Yu
for:* 这篇论文旨在解决 CT 和 MRI 领域中的不同数据类型的 inverse problem，包括 sparse-view CT 和 fast MRI reconstruction。methods:* 提出了一个新的 two-and-a-half order score-based model (TOSM)，它在训练阶段学习资料分布在 2D 空间，降低了训练的复杂性。* 在重建阶段，TOSM 使用了三个方向的补偿分数（sagittal、coronal 和 transaxial），实现更精确的重建。results:* 透过广泛的实验，发现 TOSM 可以实现高质量的 3D 积分重建，并且有效地解决了不同数据类型中的 inter-slice 不一致问题。

Abstract
Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) are crucial technologies in the field of medical imaging. Score-based models have proven to be effective in addressing different inverse problems encountered in CT and MRI, such as sparse-view CT and fast MRI reconstruction. However, these models face challenges in achieving accurate three dimensional (3D) volumetric reconstruction. The existing score-based models primarily focus on reconstructing two dimensional (2D) data distribution, leading to inconsistencies between adjacent slices in the reconstructed 3D volumetric images. To overcome this limitation, we propose a novel two-and-a-half order score-based model (TOSM). During the training phase, our TOSM learns data distributions in 2D space, which reduces the complexity of training compared to directly working on 3D volumes. However, in the reconstruction phase, the TOSM updates the data distribution in 3D space, utilizing complementary scores along three directions (sagittal, coronal, and transaxial) to achieve a more precise reconstruction. The development of TOSM is built on robust theoretical principles, ensuring its reliability and efficacy. Through extensive experimentation on large-scale sparse-view CT and fast MRI datasets, our method demonstrates remarkable advancements and attains state-of-the-art results in solving 3D ill-posed inverse problems. Notably, the proposed TOSM effectively addresses the inter-slice inconsistency issue, resulting in high-quality 3D volumetric reconstruction.

摘要
computed tomography (CT) 和磁共振成像 (MRI) 是医学成像领域的重要技术。分数模型已经证明可以有效地解决 CT 和 MRI 中的不同的反问题，如稀疏视图 CT 和快速 MRI 重建。然而，这些模型在实现准确的三维 (3D) 体积重建方面遇到了挑战。现有的分数模型主要关注在重建二维 (2D) 数据分布上，导致重建的3D体积图像中的邻近slice之间存在不一致。为了解决这个限制，我们提出了一种新的二阶三角形分数模型（TOSM）。在训练阶段，我们的 TOSM 学习了数据分布在2D空间上，这有助于减少训练的复杂性，相比直接在3D体积上进行训练。然而，在重建阶段，TOSM 使用了三个方向（极轴、极轴和扁平）的补偿分数，以实现更加精准的重建。TOSM 的开发基于坚实的理论原则，以确保其可靠性和效果。通过对大规模稀疏视图 CT 和快速 MRI 数据进行广泛的实验，我们的方法在解决 3D 不定 inverse problem 方面具有显著的进步和达到了当前领域的状态艺术。尤其是，我们的 TOSM 有效地解决了邻近slice之间的不一致问题，从而实现高质量的3D体积重建。

ResBuilder: Automated Learning of Depth with Residual Structures

paper_url: http://arxiv.org/abs/2308.08504
repo_url: None
paper_authors: Julian Burghoff, Matthias Rottmann, Jill von Conta, Sebastian Schoenen, Andreas Witte, Hanno Gottschalk
for: 这篇论文是为了开发一个基于 neural architecture search 的 ResNet 架构，以达到高精度低计算成本的目标。
methods: 该算法使用了一种名为 Resbuilder 的神经网络搜索算法，可以从 scratch 开发 ResNet 架构，并且可以修改现有架构，还可以从 ResNet 架构中移除和插入块。
results: 在不同的图像分类任务上进行实验，Resbuilder 可以达到与状态艺术级的性能，同时比 off-the-shelf ResNets 减少计算成本。此外，通过对 CIFAR10 进行参数调整，我们获得了一个适合所有其他任务的默认参数集，并且这种特性可以普遍应用于实际应用场景。

Abstract
In this work, we develop a neural architecture search algorithm, termed Resbuilder, that develops ResNet architectures from scratch that achieve high accuracy at moderate computational cost. It can also be used to modify existing architectures and has the capability to remove and insert ResNet blocks, in this way searching for suitable architectures in the space of ResNet architectures. In our experiments on different image classification datasets, Resbuilder achieves close to state-of-the-art performance while saving computational cost compared to off-the-shelf ResNets. Noteworthy, we once tune the parameters on CIFAR10 which yields a suitable default choice for all other datasets. We demonstrate that this property generalizes even to industrial applications by applying our method with default parameters on a proprietary fraud detection dataset.

摘要
在这个工作中，我们开发了一种神经网络搜索算法，即Resbuilder，可以从头来开发高精度低计算成本的ResNet架构。它可以修改现有架构，并可以移除和插入ResNet块，因此可以在ResNet架构空间中进行搜索。在我们对不同的图像分类 dataset 进行实验时，Resbuilder 能够达到 state-of-the-art 性能，同时保持下来计算成本相对较低。值得注意的是，我们在 CIFAR10 上调参得到了一个适用于所有其他 dataset 的适当默认选择。我们示示了这种性能普适性，通过在一个商业应用中使用我们的方法并在默认参数下对一个专有诈骗检测 dataset 进行应用。

Self-Supervised Online Camera Calibration for Automated Driving and Parking Applications

paper_url: http://arxiv.org/abs/2308.08495
repo_url: None
paper_authors: Ciarán Hogan, Ganesh Sistu, Ciarán Eising
for: 这个论文是为了提出一种基于深度学习的摄像头准确投影系统，用于现代自动驾驶车辆中。
methods: 该论文使用深度学习框架来学习摄像头的内在和外在准确参数，不需要任何标注或监督。
results: 该论文的实验结果表明，该深度学习框架可以在实时中学习摄像头的准确投影参数，不需要特殊的数据收集或精心调整。

Abstract
Camera-based perception systems play a central role in modern autonomous vehicles. These camera based perception algorithms require an accurate calibration to map the real world distances to image pixels. In practice, calibration is a laborious procedure requiring specialised data collection and careful tuning. This process must be repeated whenever the parameters of the camera change, which can be a frequent occurrence in autonomous vehicles. Hence there is a need to calibrate at regular intervals to ensure the camera is accurate. Proposed is a deep learning framework to learn intrinsic and extrinsic calibration of the camera in real time. The framework is self-supervised and doesn't require any labelling or supervision to learn the calibration parameters. The framework learns calibration without the need for any physical targets or to drive the car on special planar surfaces.

摘要
Camera-based 感知系统在现代自动驾驶汽车中发挥中心作用。这些摄像头基于的感知算法需要精准的准确映射到实际世界中的距离。在实践中，准确是一个复杂的过程，需要专门的数据采集和精心调整。这个过程需要在相机参数发生变化时重复，这可能是自动驾驶汽车中的一个频繁发生的情况。因此，需要定期进行准确性检查。提议一种基于深度学习的框架，以自动学习摄像头的内在和外在准确参数。该框架不需要任何 labels 或监督来学习准确参数。该框架可以在不需要任何物理目标或特殊平面表面上学习准确参数。

DeDoDe: Detect, Don’t Describe – Describe, Don’t Detect for Local Feature Matching

paper_url: http://arxiv.org/abs/2308.08479
repo_url: https://github.com/parskatt/dedode
paper_authors: Johan Edstedt, Georg Bökman, Mårten Wadenbäck, Michael Felsberg
for: 本文主要针对的是3D重建中的关键点检测问题，即在不同视图中检测出相同的3D点集。
methods: 本文使用了一种直接从3D一致性中学习关键点的方法，即通过训练从大规模SfM数据中检测出轨迹来学习关键点。为了解决缺少数据的问题，我们提出了一种半监督的两视检测目标函数，以扩展检测结果的数量。
results: 本文的方法DeDoDe在多个几何benchmark上实现了显著的提升，代码可以在https://github.com/Parskatt/DeDoDe上下载。

Abstract
Keypoint detection is a pivotal step in 3D reconstruction, whereby sets of (up to) K points are detected in each view of a scene. Crucially, the detected points need to be consistent between views, i.e., correspond to the same 3D point in the scene. One of the main challenges with keypoint detection is the formulation of the learning objective. Previous learning-based methods typically jointly learn descriptors with keypoints, and treat the keypoint detection as a binary classification task on mutual nearest neighbours. However, basing keypoint detection on descriptor nearest neighbours is a proxy task, which is not guaranteed to produce 3D-consistent keypoints. Furthermore, this ties the keypoints to a specific descriptor, complicating downstream usage. In this work, we instead learn keypoints directly from 3D consistency. To this end, we train the detector to detect tracks from large-scale SfM. As these points are often overly sparse, we derive a semi-supervised two-view detection objective to expand this set to a desired number of detections. To train a descriptor, we maximize the mutual nearest neighbour objective over the keypoints with a separate network. Results show that our approach, DeDoDe, achieves significant gains on multiple geometry benchmarks. Code is provided at https://github.com/Parskatt/DeDoDe .

摘要
<>将文本翻译成简化中文。<>Scene 3D 重建中，关键点检测是一个关键步骤，其中每个视图中检测到多达K个点。这些点需要在不同视图中保持一致，即在场景中相同的3D点。关键点检测的主要挑战之一是学习目标函数的形式化。先前的学习基于方法通常是同时学习描述符和关键点，并将关键点检测视为视图中 nearest neighbors 的二分类任务。然而，基于描述符 nearest neighbors 的方法并不能保证生成3D 一致的关键点。此外，这会将关键点绑定到特定的描述符，使其下游使用变得复杂。在这种情况下，我们 direktly 从3D 一致中学习关键点。为此，我们在大规模 SfM 中训练检测器，以检测覆盖大规模 SfM 的点。由于这些点通常是过度稀疏，我们 derivate 一种 semi-supervised 的两视检测目标，以扩展这个集到所需的数量。为了训练描述符，我们在关键点上maximize 约束 nearest neighbors 的目标。结果表明我们的方法，DeDoDe，在多个几何度量上具有显著的提升。代码可以在 https://github.com/Parskatt/DeDoDe 中找到。

Classification Committee for Active Deep Object Detection

paper_url: http://arxiv.org/abs/2308.08476
repo_url: https://github.com/ylzy123/CCADOD
paper_authors: Lei Zhao, Bo Li, Xingxing Wei
for: This paper proposes an active deep object detection method that uses a classification committee to select the most informative images for training the object detector.methods: The proposed method uses a main detector and a classification committee to select the most informative images based on their uncertainty values. The committee is pre-trained via the Maximum Classifiers Discrepancy Group Loss (MCDGL) and the Focus on Positive Instances Loss (FPIL) to mitigate the impact of interference instances.results: The proposed method outperforms state-of-the-art active learning methods in object detection tasks on Pascal VOC and COCO datasets.

Abstract
In object detection, the cost of labeling is much high because it needs not only to confirm the categories of multiple objects in an image but also to accurately determine the bounding boxes of each object. Thus, integrating active learning into object detection will raise pretty positive significance. In this paper, we propose a classification committee for active deep object detection method by introducing a discrepancy mechanism of multiple classifiers for samples' selection when training object detectors. The model contains a main detector and a classification committee. The main detector denotes the target object detector trained from a labeled pool composed of the selected informative images. The role of the classification committee is to select the most informative images according to their uncertainty values from the view of classification, which is expected to focus more on the discrepancy and representative of instances. Specifically, they compute the uncertainty for a specified instance within the image by measuring its discrepancy output by the committee pre-trained via the proposed Maximum Classifiers Discrepancy Group Loss (MCDGL). The most informative images are finally determined by selecting the ones with many high-uncertainty instances. Besides, to mitigate the impact of interference instances, we design a Focus on Positive Instances Loss (FPIL) to make the committee the ability to automatically focus on the representative instances as well as precisely encode their discrepancies for the same instance. Experiments are conducted on Pascal VOC and COCO datasets versus some popular object detectors. And results show that our method outperforms the state-of-the-art active learning methods, which verifies the effectiveness of the proposed method.

摘要
在物体检测中，标注成本很高，因为需要不仅确定图像中多个物体的类别，还需要准确确定每个物体的 bounding box。因此，将活动学习 интегрирован到物体检测中将具有非常正面的意义。在这篇论文中，我们提出了一种 classification committee для活动深度物体检测方法，通过引入多个分类器的偏差机制来选择训练物体检测器时的样本。模型包括主检测器和分类委员会。主检测器表示从标注池中训练的目标物体检测器，而分类委员会的角色是选择图像中最有用的信息的样本，这些样本的uncertainty值由分类委员会预训练得到的MCDGL（最大分类器偏差组loss）进行计算。最终选择最有用的图像是根据它们中高uncertainty实例的多少进行选择。此外，为了减少干扰实例的影响，我们设计了Focus on Positive Instances Loss（FPIL），使得委员会能够自动关注代表实例，同时精确编码它们的偏差。在 Pascal VOC 和 COCO datasets 上进行了实验，与一些流行的物体检测器进行比较，结果表明，我们的方法在活动学习方法中表现出色，这证明了我们的方法的有效性。

Hierarchical Uncertainty Estimation for Medical Image Segmentation Networks

paper_url: http://arxiv.org/abs/2308.08465
repo_url: None
paper_authors: Xinyu Bai, Wenjia Bai
for: 建立一个可信任的医疗影像分类模型，需要不仅评估模型的性能，而且也需要估计模型预测结果中的不确定性。
methods: 我们利用了这个嵌入式Encoder架构，将影像特征从细节到概要层次提取出来，然后使用 skip-connection 模组估计多个层次的不确定性。
results: 我们显示了，将这种多层次不确定性估计模组添加到深度学习分类网络中，可以实现高度的分类性能，同时提供了有意义的不确定性地图，可以用于过度分布检测。

Abstract
Learning a medical image segmentation model is an inherently ambiguous task, as uncertainties exist in both images (noise) and manual annotations (human errors and bias) used for model training. To build a trustworthy image segmentation model, it is important to not just evaluate its performance but also estimate the uncertainty of the model prediction. Most state-of-the-art image segmentation networks adopt a hierarchical encoder architecture, extracting image features at multiple resolution levels from fine to coarse. In this work, we leverage this hierarchical image representation and propose a simple yet effective method for estimating uncertainties at multiple levels. The multi-level uncertainties are modelled via the skip-connection module and then sampled to generate an uncertainty map for the predicted image segmentation. We demonstrate that a deep learning segmentation network such as U-net, when implemented with such hierarchical uncertainty estimation module, can achieve a high segmentation performance, while at the same time provide meaningful uncertainty maps that can be used for out-of-distribution detection.

摘要
学习医学图像分割模型是一个自然而又不确定的任务，因为图像中的噪声和人工标注（人类错误和偏见）在模型训练中存在不确定性。为建立可靠的图像分割模型，不仅要评估其性能，还需要估计模型预测结果中的不确定性。大多数当前的图像分割网络采用层次编码结构，从细到粗渐地提取图像特征。在这种工作中，我们利用这种层次图像表示，并提议一种简单 yet 有效的方法来估计多个水平的不确定性。这些多个不确定性被模型的跳过连接模块模型，然后采样以生成预测图像分割结果中的不确定性地图。我们示示了一个深度学习分割网络，如U-Net，当它被实现了这种层次不确定性估计模块时，可以 дости到高的分割性能，同时提供可靠的不确定性地图，可以用于非典型检测。

Learning to Distill Global Representation for Sparse-View CT

paper_url: http://arxiv.org/abs/2308.08463
repo_url: None
paper_authors: Zilong Li, Chenglong Ma, Jie Chen, Junping Zhang, Hongming Shan
for: 这个论文主要针对的是使用少量投影计算tomography图像的问题，以及通过图像后处理技术提高图像质量的问题。
methods: 这个论文使用了图像后处理技术，包括global representation（GloRe）混合约束和批处理学习等方法。
results: 实验结果表明，提议的GloReDi方法在ultra-sparse-view计算tomography图像中的重建效果明显更高，并且与现有的双频域方法相比，具有更好的扩展性和可靠性。

Abstract
Sparse-view computed tomography (CT) -- using a small number of projections for tomographic reconstruction -- enables much lower radiation dose to patients and accelerated data acquisition. The reconstructed images, however, suffer from strong artifacts, greatly limiting their diagnostic value. Current trends for sparse-view CT turn to the raw data for better information recovery. The resultant dual-domain methods, nonetheless, suffer from secondary artifacts, especially in ultra-sparse view scenarios, and their generalization to other scanners/protocols is greatly limited. A crucial question arises: have the image post-processing methods reached the limit? Our answer is not yet. In this paper, we stick to image post-processing methods due to great flexibility and propose global representation (GloRe) distillation framework for sparse-view CT, termed GloReDi. First, we propose to learn GloRe with Fourier convolution, so each element in GloRe has an image-wide receptive field. Second, unlike methods that only use the full-view images for supervision, we propose to distill GloRe from intermediate-view reconstructed images that are readily available but not explored in previous literature. The success of GloRe distillation is attributed to two key components: representation directional distillation to align the GloRe directions, and band-pass-specific contrastive distillation to gain clinically important details. Extensive experiments demonstrate the superiority of the proposed GloReDi over the state-of-the-art methods, including dual-domain ones. The source code is available at https://github.com/longzilicart/GloReDi.

摘要
sparse-view computed tomography (CT) 使用少量投射进行tomographic reconstruction，可以大大降低病人接受的辐射剂量并加速数据获取。然而，重建图像却受到强烈的artifacts的限制，因此其诊断价值受到了很大的限制。当前的趋势是使用原始数据来提取更多的信息。结果的双Domain方法却受到了次要的artifacts，特别是在ultra-sparse view scenario下，其普遍性受到了限制。问题是：图像后处理方法是否已经达到了其限制？我们的答案是不是。在这篇文章中，我们选择使用图像后处理方法，因为它具有很大的灵活性。我们提出了一种全球表示（GloRe）液化框架，称之为GloReDi。首先，我们提出了使用傅里叶变换来学习GloRe，以确保每个GloRe元素都具有整个图像的报知频谱。其次，不同于以前的方法，我们提出了使用中间视图重建图像进行监督，这些图像ready available但没有在前期文献中被探讨。GloRe的Success被归因于两个关键组成部分：方向性液化照明和带宽特征相关的液化照明。我们进行了广泛的实验，并证明了我们的GloReDi在现有的方法之上具有优越性。代码可以在https://github.com/longzilicart/GloReDi中获取。

2023-08-17

Discretization-Induced Dirichlet Posterior for Robust Uncertainty Quantification on Regression

SimFIR: A Simple Framework for Fisheye Image Rectification with Self-supervised Representation Learning

MarginMatch: Improving Semi-Supervised Learning with Pseudo-Margins

Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks

LesionMix: A Lesion-Level Data Augmentation Method for Medical Image Segmentation

SR-GAN for SR-gamma: photon super resolution at collider experiments

ARAI-MVSNet: A multi-view stereo depth estimation network with adaptive depth range and depth interval

FashionLOGO: Prompting Multimodal Large Language Models for Fashion Logo Embeddings

DealMVC: Dual Contrastive Calibration for Multi-view Clustering

Semantic Information for Object Detection

Eosinophils Instance Object Segmentation on Whole Slide Imaging Using Multi-label Circle Representation

Watch Your Steps: Local Image and Scene Editing by Text Instructions

Auxiliary Tasks Benefit 3D Skeleton-based Human Motion Prediction

Automatic Signboard Recognition in Low Quality Night Images

SDDNet: Style-guided Dual-layer Disentanglement Network for Shadow Detection

Point-aware Interaction and CNN-induced Refinement Network for RGB-D Salient Object Detection

Frequency Perception Network for Camouflaged Object Detection

Identity-Seeking Self-Supervised Representation Learning for Generalizable Person Re-identification

Event-Guided Procedure Planning from Instructional Videos with Text Supervision

SRMAE: Masked Image Modeling for Scale-Invariant Deep Representations

Text-Only Training for Visual Storytelling

Towards Semi-supervised Learning with Non-random Missing Labels

Spatially and Spectrally Consistent Deep Functional Maps

MV-ROPE: Multi-view Constraints for Robust Category-level Object Pose and Size Estimation

Realistic Full-Body Tracking from Sparse Observations via Joint-Level Modeling

Language-enhanced RNR-Map: Querying Renderable Neural Radiance Field maps with natural language

Bag of Tricks for Long-Tailed Multi-Label Classification on Chest X-Rays

A Survey on Deep Multi-modal Learning for Body Language Recognition and Generation

ICoNIK: Generating Respiratory-Resolved Abdominal MR Reconstructions Using Neural Implicit Representations in k-Space

Fast Inference and Update of Probabilistic Density Estimation on Trajectory Prediction

MixBag: Bag-Level Data Augmentation for Learning from Label Proportions

End-to-end Alternating Optimization for Real-World Blind Super Resolution

A Fusion of Variational Distribution Priors and Saliency Map Replay for Continual 3D Reconstruction

Label Shift Adapter for Test-Time Adaptation under Covariate and Label Shifts

Self-distillation Regularized Connectionist Temporal Classification Loss for Text Recognition: A Simple Yet Effective Approach

Deep Ear Biometrics for Gender Classification

Environment Diversification with Multi-head Neural Network for Invariant Learning

Learning to In-paint: Domain Adaptive Shape Completion for 3D Organ Segmentation

URL: Combating Label Noise for Lung Nodule Malignancy Grading

Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes

XVTP3D: Cross-view Trajectory Prediction Using Shared 3D Queries for Autonomous Driving

Fine-grained Text and Image Guided Point Cloud Completion with CLIP Model

BOTT: Box Only Transformer Tracker for 3D Object Tracking

MIPS-Fusion: Multi-Implicit-Submaps for Scalable and Robust Online Neural RGB-D Reconstruction

Recursive Detection and Analysis of Nanoparticles in Scanning Electron Microscopy Images

Learning Through Guidance: Knowledge Distillation for Endoscopic Image Classification

Learning A Coarse-to-Fine Diffusion Transformer for Image Restoration

Long-Range Grouping Transformer for Multi-View 3D Reconstruction

Dynamic Kernel-Based Adaptive Spatial Aggregation for Learned Image Compression

RFD-ECNet: Extreme Underwater Image Compression with Reference to Feature Dictionar

V-FUSE: Volumetric Depth Map Fusion with Long-Range Constraints

SkinDistilViT: Lightweight Vision Transformer for Skin Lesion Classification

A New Data-Driven Method to Identify Violent Facial Expression

Flickr Africa: Examining Geo-Diversity in Large-Scale, Human-Centric Visual Data

Fair GANs through model rebalancing with synthetic data

MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions

InsightMapper: A Closer Look at Inner-instance Information for Vectorized High-Definition Mapping

Ref-DVGO: Reflection-Aware Direct Voxel Grid Optimization for an Improved Quality-Efficiency Trade-Off in Reflective Scene Reconstruction

Diagnosing Human-object Interaction Detectors

Likelihood-Based Text-to-Image Evaluation with Patch-Level Perceptual and Semantic Credit Assignment

Painter: Teaching Auto-regressive Language Models to Draw Sketches

Two-and-a-half Order Score-based Model for Solving 3D Ill-posed Inverse Problems

ResBuilder: Automated Learning of Depth with Residual Structures

Self-Supervised Online Camera Calibration for Automated Driving and Parking Applications

DeDoDe: Detect, Don’t Describe – Describe, Don’t Detect for Local Feature Matching

Classification Committee for Active Deep Object Detection

Hierarchical Uncertainty Estimation for Medical Image Segmentation Networks

Learning to Distill Global Representation for Sparse-View CT