2023-10-10

cs.CV

cs.CV - 2023-10-10

BeSt-LeS: Benchmarking Stroke Lesion Segmentation using Deep Supervision

paper_url: http://arxiv.org/abs/2310.07060
repo_url: https://github.com/prantik-pdeb/best-les
paper_authors: Prantik Deb, Lalith Bharadwaj Baru, Kamalaker Dadi, Bapi Raju S
for: 这项研究的目的是为了提供一种基于自动化分割的 stroke 识别和风险评估方法，以帮助临床专业人员更准确地诊断 stroke。
methods: 这项研究使用了公共可用的 ATLAS $v2.0$ 数据集，并对不同的终端指导式 U-Net 模型进行了比较。研究使用了标准的评估指标来评估模型的性能。
results: 研究所获得的结果显示，使用 transformer-based 2D 模型可以达到最高的 Dice 分数为 0.583，而使用 residual U-Net 3D 模型可以达到最高的 Dice 分数为 0.504。此外，研究还通过 Wilcoxon 测试发现了 stroke 体积预测和实际值之间的相关关系。

Abstract
Brain stroke has become a significant burden on global health and thus we need remedies and prevention strategies to overcome this challenge. For this, the immediate identification of stroke and risk stratification is the primary task for clinicians. To aid expert clinicians, automated segmentation models are crucial. In this work, we consider the publicly available dataset ATLAS $v2.0$ to benchmark various end-to-end supervised U-Net style models. Specifically, we have benchmarked models on both 2D and 3D brain images and evaluated them using standard metrics. We have achieved the highest Dice score of 0.583 on the 2D transformer-based model and 0.504 on the 3D residual U-Net respectively. We have conducted the Wilcoxon test for 3D models to correlate the relationship between predicted and actual stroke volume. For reproducibility, the code and model weights are made publicly available: https://github.com/prantik-pdeb/BeSt-LeS.

摘要
Brain stroke 已成为全球医疗的重要挑战，因此我们需要有效的治疗和预防策略。为此，诊断 stroke 的速度和风险评估是临床医生的首要任务。为了帮助专业医生，自动分割模型是非常重要。在这项工作中，我们使用公共可用的 ATLAS $v2.0$ 数据集来对不同的端到端授 taught U-Net 模型进行比较。我们在2D和3D脑图像上对模型进行了测试，并使用标准指标进行评估。我们在2D transformer-based 模型上达到了最高的 Dice 分数为 0.583，并在3D residual U-Net 模型上达到了 0.504 的最高分数。我们通过沃克逊测试来检验3D模型中预测和实际 stroke 体积之间的相关性。为了保持可重复性，我们在 GitHub 上公开了代码和模型参数：https://github.com/prantik-pdeb/BeSt-LeS。

TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

paper_url: http://arxiv.org/abs/2310.07056
repo_url: None
paper_authors: Chengyang Zhao, Yikang Shen, Zhenfang Chen, Mingyu Ding, Chuang Gan
for: 本研究旨在生成基于文本描述的涵义图（Caption-to-PSG），以解决现有方法受限于需要大量精心标注的问题。
methods: 我们提出了一个新的框架TextPSG，包括四个模块：区域分组、实体降解、段合并和标签生成，以及一些新的技术。
results: 我们的方法显著超越了基准值，并 achieve strong out-of-distribution robustness。我们进行了广泛的减少研究来证明我们的设计选择的有效性，并提供了深入的分析，以便未来的研究。

Abstract
Panoptic Scene Graph has recently been proposed for comprehensive scene understanding. However, previous works adopt a fully-supervised learning manner, requiring large amounts of pixel-wise densely-annotated data, which is always tedious and expensive to obtain. To address this limitation, we study a new problem of Panoptic Scene Graph Generation from Purely Textual Descriptions (Caption-to-PSG). The key idea is to leverage the large collection of free image-caption data on the Web alone to generate panoptic scene graphs. The problem is very challenging for three constraints: 1) no location priors; 2) no explicit links between visual regions and textual entities; and 3) no pre-defined concept sets. To tackle this problem, we propose a new framework TextPSG consisting of four modules, i.e., a region grouper, an entity grounder, a segment merger, and a label generator, with several novel techniques. The region grouper first groups image pixels into different segments and the entity grounder then aligns visual segments with language entities based on the textual description of the segment being referred to. The grounding results can thus serve as pseudo labels enabling the segment merger to learn the segment similarity as well as guiding the label generator to learn object semantics and relation predicates, resulting in a fine-grained structured scene understanding. Our framework is effective, significantly outperforming the baselines and achieving strong out-of-distribution robustness. We perform comprehensive ablation studies to corroborate the effectiveness of our design choices and provide an in-depth analysis to highlight future directions. Our code, data, and results are available on our project page: https://vis-www.cs.umass.edu/TextPSG.

摘要
它最近提出了泛睿场景图（Panoptic Scene Graph，PSG），但以前的工作采用了完全监督学习方式，需要大量的像素密集标注数据，这总是费时和贵金属。为了解决这个限制，我们研究了一个新的问题：即从文本描述中生成PSG（Caption-to-PSG）。我们利用了互联网上免费的图像描述数据集，以生成泛睿场景图。这个问题具有三个约束：1）没有位置假设；2）没有显式地将视觉区域与文本实体连接起来；3）没有预定的概念集。为了解决这个问题，我们提出了一个新的框架：TextPSG，包括四个模块：区域分组器、实体降解器、段合并器和标签生成器。我们还提出了一些新的技术。区域分组器首先将图像像素分成不同的段，然后实体降解器将视觉段与文本描述相对应。这些降解结果可以作为pseudo标签，使段合并器学习段的相似性以及导引标签生成器学习对象 semantics和关系预测，从而实现细化的结构化场景理解。我们的框架高效，Significantly Outperforming基elines，并且具有强大的Out-of-distribution Robustness。我们进行了广泛的拆分分析，以证明我们的设计选择的有效性，并提供了深入的分析，以透视未来的方向。我们的代码、数据和结果都可以在我们项目页面上找到：。

Utilizing Synthetic Data for Medical Vision-Language Pre-training: Bypassing the Need for Real Images

paper_url: http://arxiv.org/abs/2310.07027
repo_url: None
paper_authors: Che Liu, Anand Shah, Wenjia Bai, Rossella Arcucci
for: 本研究旨在探讨 whether 医学视觉语言预训(VLP) 可以通过使用生成的医学图像来实现有效的预训，而不需要大量的对应的图像文本数据集。
methods: 本研究使用了三种state-of-the-art VLP算法，即使用生成的医学图像进行solely training。
results: 我们的实验结果表明，使用生成的医学图像可以达到与真实图像相当的性能，或者甚至超过它们。我们还引入了一个大规模的生成医学图像数据集，并将其与匿名化的真实医学报告集成对应。

Abstract
Medical Vision-Language Pre-training (VLP) learns representations jointly from medical images and paired radiology reports. It typically requires large-scale paired image-text datasets to achieve effective pre-training for both the image encoder and text encoder. The advent of text-guided generative models raises a compelling question: Can VLP be implemented solely with synthetic images generated from genuine radiology reports, thereby mitigating the need for extensively pairing and curating image-text datasets? In this work, we scrutinize this very question by examining the feasibility and effectiveness of employing synthetic images for medical VLP. We replace real medical images with their synthetic equivalents, generated from authentic medical reports. Utilizing three state-of-the-art VLP algorithms, we exclusively train on these synthetic samples. Our empirical evaluation across three subsequent tasks, namely image classification, semantic segmentation and object detection, reveals that the performance achieved through synthetic data is on par with or even exceeds that obtained with real images. As a pioneering contribution to this domain, we introduce a large-scale synthetic medical image dataset, paired with anonymized real radiology reports. This alleviates the need of sharing medical images, which are not easy to curate and share in practice. The code and the dataset will be made publicly available upon paper acceptance.

摘要
医疗视力语言预训练（VLP）学习表示结合医疗图像和相关的医疗报告。通常需要大规模的对应图像文本数据来实现有效的预训练，以便图像编码器和文本编码器都能够得到好的表示。然而，文本导向生成模型的出现提出了一个吸引人的问题：可以通过使用生成自真实医疗报告的 sintetic 图像来实现 VLP，从而减少对对应图像文本数据的需求。在这种情况下，我们详细研究了使用 sintetic 图像来进行医疗 VLP 的可行性和效果。我们将真实的医疗图像替换为生成自它们的 sintetic 图像，然后使用三种现状顶尖 VLP 算法进行专注式训练。我们的实验结果表明，通过 sintetic 图像来进行医疗 VLP 的性能与使用真实图像相当或甚至高于。在这种领域中，我们首次提供了一个大规模的 sintetic 医疗图像集，与医疗报告相对应，以避免分享医疗图像的困难。我们将代码和数据集公开发布，以便其他研究人员可以进行更多的研究和应用。

paper_url: http://arxiv.org/abs/2310.07021
repo_url: None
paper_authors: Vishnu Dutt Sharma, Anukriti Singh, Pratap Tokekar
for: 这个论文旨在探讨如何使用基础视觉模型进行环境结构预测，以提高移动机器人的导航和探索能力。
methods: 该论文使用Masked Autoencoders，预先在街景图像上训练，以扩展视野，实现单机器人地图探索和多机器人探索，以及indoor mapping。
results: 该论文显示了基础视觉模型可以无需微调，对各种输入模式进行通用应用，并且在缺乏训练数据的情况下，可以减少任务时间。更多资讯请参考https://raaslab.org/projects/MIM4Robots。

Abstract
2D top-down maps are commonly used for the navigation and exploration of mobile robots through unknown areas. Typically, the robot builds the navigation maps incrementally from local observations using onboard sensors. Recent works have shown that predicting the structural patterns in the environment through learning-based approaches can greatly enhance task efficiency. While many such works build task-specific networks using limited datasets, we show that the existing foundational vision networks can accomplish the same without any fine-tuning. Specifically, we use Masked Autoencoders, pre-trained on street images, to present novel applications for field-of-view expansion, single-agent topological exploration, and multi-agent exploration for indoor mapping, across different input modalities. Our work motivates the use of foundational vision models for generalized structure prediction-driven applications, especially in the dearth of training data. For more qualitative results see https://raaslab.org/projects/MIM4Robots.

摘要
2D 顶点下方地图通常用于移动机器人的导航和探索未知区域。通常，机器人在当地感知器上建立导航地图，逐步增量更新。现在的研究表明，通过学习基本视觉网络可以大幅提高任务效率。虽然许多这些工作建立了特定任务的网络，但我们表明可以使用预先训练的街景图像Masked Autoencoders来实现相同的目标，无需任何微调。我们使用这些网络进行预览展示，包括预览展示、单机器人探索和多机器人探索，并且可以处理不同的输入模式。我们的工作激励使用基础视觉模型来预测环境结构，特别是在训练数据稀缺的情况下。更多资讯请参考https://raaslab.org/projects/MIM4Robots。

Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models

paper_url: http://arxiv.org/abs/2310.06992
repo_url: None
paper_authors: Wen-Hsuan Chu, Adam W. Harley, Pavel Tokmakov, Achal Dave, Leonidas Guibas, Katerina Fragkiadaki
for: 这篇论文是为了提出一种基于大规模预训练模型的开放 vocabulary 视频跟踪方法，用于跟踪任意类别的对象在2D视频中。
methods: 该方法使用了一种开放词汇探测器、分割器和紧密的滤波估计器，将这些模型重新用于跟踪和分割任意类别的对象在2D视频中。
results: 该方法在多个已知的视频对象分割和跟踪 benchmark 上达到了强大的表现，并且能够在掩蔽中重新识别对象。

Abstract
Object tracking is central to robot perception and scene understanding. Tracking-by-detection has long been a dominant paradigm for object tracking of specific object categories. Recently, large-scale pre-trained models have shown promising advances in detecting and segmenting objects and parts in 2D static images in the wild. This begs the question: can we re-purpose these large-scale pre-trained static image models for open-vocabulary video tracking? In this paper, we re-purpose an open-vocabulary detector, segmenter, and dense optical flow estimator, into a model that tracks and segments objects of any category in 2D videos. Our method predicts object and part tracks with associated language descriptions in monocular videos, rebuilding the pipeline of Tractor with modern large pre-trained models for static image detection and segmentation: we detect open-vocabulary object instances and propagate their boxes from frame to frame using a flow-based motion model, refine the propagated boxes with the box regression module of the visual detector, and prompt an open-world segmenter with the refined box to segment the objects. We decide the termination of an object track based on the objectness score of the propagated boxes, as well as forward-backward optical flow consistency. We re-identify objects across occlusions using deep feature matching. We show that our model achieves strong performance on multiple established video object segmentation and tracking benchmarks, and can produce reasonable tracks in manipulation data. In particular, our model outperforms previous state-of-the-art in UVO and BURST, benchmarks for open-world object tracking and segmentation, despite never being explicitly trained for tracking. We hope that our approach can serve as a simple and extensible framework for future research.

摘要
Object tracking是Robot感知和场景理解中的核心。Tracking-by-detection是特定对象类型的tracking的传统方法。最近，大规模预训练模型在2D静止图像中探测和分割对象和部分表现出了可观的进步。这引发了问题：我们可以将这些大规模预训练静止图像模型重新用于开放词汇视频跟踪吗？在这篇文章中，我们将开放词汇探测、分割和稠密运动场景估计器重新用于跟踪和分割任意类型对象在2D视频中。我们的方法预测对象和部分跟踪，并将其关联到语言描述。我们在无监测下探测开放词汇对象实例，从帧到帧传播这些实例的框架，并使用视觉探测器的框架进行框架进行简单的级联探测。我们根据框架的框架进行简单的级联探测。我们在视频中跟踪对象的终止是根据对象存在程度和前后方向运动的一致性决定。我们使用深度匹配来重新识别对象。我们的方法在多个已知的视频对象分割和跟踪标准准则上达到了强性表现，并且在执行操作数据时可以生成合理的跟踪。特别是，我们的方法在UVO和BURST两个开放世界对象跟踪和分割标准上超越了之前的状态。我们希望我们的方法可以作为一个简单和可扩展的框架，为未来的研究提供服务。

Leveraging Neural Radiance Fields for Uncertainty-Aware Visual Localization

paper_url: http://arxiv.org/abs/2310.06984
repo_url: None
paper_authors: Le Chen, Weirong Chen, Rui Wang, Marc Pollefeys
for: 提高Scene Coordinate Regression（SCR）的数据效率和准确率，使其成为一种可靠的视觉地标化技术。
methods: 利用Neural Radiance Fields（NeRF）生成SCR训练样本，并在样本中预测颜色和深度图像的不确定性。采用深度学习和证据不确定性来执行SCR，并根据不确定性三个艺术来选择视角。
results: 在公共数据集上进行实验，发现我们的方法可以选择带有最高信息增加的样本，并提高SCR的数据效率和准确率。

Abstract
As a promising fashion for visual localization, scene coordinate regression (SCR) has seen tremendous progress in the past decade. Most recent methods usually adopt neural networks to learn the mapping from image pixels to 3D scene coordinates, which requires a vast amount of annotated training data. We propose to leverage Neural Radiance Fields (NeRF) to generate training samples for SCR. Despite NeRF's efficiency in rendering, many of the rendered data are polluted by artifacts or only contain minimal information gain, which can hinder the regression accuracy or bring unnecessary computational costs with redundant data. These challenges are addressed in three folds in this paper: (1) A NeRF is designed to separately predict uncertainties for the rendered color and depth images, which reveal data reliability at the pixel level. (2) SCR is formulated as deep evidential learning with epistemic uncertainty, which is used to evaluate information gain and scene coordinate quality. (3) Based on the three arts of uncertainties, a novel view selection policy is formed that significantly improves data efficiency. Experiments on public datasets demonstrate that our method could select the samples that bring the most information gain and promote the performance with the highest efficiency.

摘要
随着视觉本地化的潮流，场景坐标回归（SCR）在过去的十年中取得了很大的进步。大多数最新的方法通常采用神经网络来学习图像像素到3D场景坐标的映射，需要大量的注解训练数据。我们提议利用神经辐射场（NeRF）生成训练样本。尽管NeRF有效地渲染图像，但是许多渲染数据受到艺术ifacts或只含有有限的信息增长，这可能会降低回归精度或带来无需要的计算成本。这些挑战在本文中被解决：1. NeRF用于分别预测渲染色彩和深度图像的不确定性，以显示像素级别的数据可靠性。2. SCR被表述为深度证明学中的证明不确定性，用于评估信息增长和场景坐标质量。3. 基于三种不确定性，我们提出了一种新的视图选择策略，可以大幅提高数据效率。在公共数据集上进行的实验表明，我们的方法可以选择带有最高信息增长的样本，并提高性能的效率。

Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality

paper_url: http://arxiv.org/abs/2310.06982
repo_url: None
paper_authors: Xuxi Chen, Yu Yang, Zhangyang Wang, Baharan Mirzasoleiman
for: 降低深度网络训练大数据集时间和内存占用，通过创建一小集合的 sintetic 图像，使得模型在这些 sintetic 图像上的泛化性能与原始数据集相似。
methods: 我们提出了多个 sintetic subset 的 Progressive Dataset Distillation (PDD) 方法，每个subset 都是基于之前的subset 来 conditioning，并在训练过程中逐步添加这些subset 来 capture 训练 dynamics。
results: 我们的实验表明，PDD 可以效果地提高现有 dataset distillation 方法的性能，最高提高4.3%；此外，我们的方法可以生成许多更大的 sintetic 数据集。

Abstract
Dataset distillation aims to minimize the time and memory needed for training deep networks on large datasets, by creating a small set of synthetic images that has a similar generalization performance to that of the full dataset. However, current dataset distillation techniques fall short, showing a notable performance gap when compared to training on the original data. In this work, we are the first to argue that using just one synthetic subset for distillation will not yield optimal generalization performance. This is because the training dynamics of deep networks drastically change during the training. Hence, multiple synthetic subsets are required to capture the training dynamics at different phases of training. To address this issue, we propose Progressive Dataset Distillation (PDD). PDD synthesizes multiple small sets of synthetic images, each conditioned on the previous sets, and trains the model on the cumulative union of these subsets without requiring additional training time. Our extensive experiments show that PDD can effectively improve the performance of existing dataset distillation methods by up to 4.3%. In addition, our method for the first time enable generating considerably larger synthetic datasets.

摘要
Translated into Simplified Chinese: dataset distillation 目的是减少训练深度网络的时间和内存需求，通过创建一个小型的合成图像集，以提高模型的泛化性能。然而，当前的 dataset distillation 技术尚未能充分利用合成图像集的优势，显示了训练原始数据时的性能差距。在这项工作中，我们是首次 argue 使用单一的合成subset 不能提供最佳的泛化性能。这是因为深度网络在训练过程中的训练动态会发生剧烈变化。因此，我们需要使用多个合成subset，以捕捉不同阶段的训练动态。为此，我们提出了 Progressive Dataset Distillation (PDD)。PDD 使用多个小型的合成图像集，每个集合基于前一个集合，并在这些集合的积合union 上训练模型，无需额外的训练时间。我们的广泛实验表明，PDD 可以提高现有的 dataset distillation 方法的性能，最高提高4.3%。此外，我们的方法首次允许生成较大的合成图像集。

ObjectComposer: Consistent Generation of Multiple Objects Without Fine-tuning

paper_url: http://arxiv.org/abs/2310.06968
repo_url: None
paper_authors: Alec Helbling, Evan Montoya, Duen Horng Chau
for: 生成高品质图像 FROM 文本提示，但这些模型困难在不同上下文中同样的物体外观生成。
methods: 我们提出了一种名为 ObjectComposer 的方法，可以生成基于用户指定的图像的对象compositions。我们的方法不需要训练，基于现有的模型。
results: ObjectComposer 可以生成高品质的对象compositions，同时保持用户指定的物体外观和配置。这种方法不需要修改基础模型的参数，可以在实时和大规模应用中使用。

Abstract
Recent text-to-image generative models can generate high-fidelity images from text prompts. However, these models struggle to consistently generate the same objects in different contexts with the same appearance. Consistent object generation is important to many downstream tasks like generating comic book illustrations with consistent characters and setting. Numerous approaches attempt to solve this problem by extending the vocabulary of diffusion models through fine-tuning. However, even lightweight fine-tuning approaches can be prohibitively expensive to run at scale and in real-time. We introduce a method called ObjectComposer for generating compositions of multiple objects that resemble user-specified images. Our approach is training-free, leveraging the abilities of preexisting models. We build upon the recent BLIP-Diffusion model, which can generate images of single objects specified by reference images. ObjectComposer enables the consistent generation of compositions containing multiple specific objects simultaneously, all without modifying the weights of the underlying models.

摘要

Comparing the robustness of modern no-reference image- and video-quality metrics to adversarial attacks

paper_url: http://arxiv.org/abs/2310.06958
repo_url: https://github.com/msu-video-group/msu_metrics_robustness_benchmark
paper_authors: Anastasia Antsiferova, Khaled Abud, Aleksandr Gushchin, Sergey Lavrushkin, Ekaterina Shumitskaya, Maksim Velikanov, Dmitriy Vatolin
for: 本研究旨在分析现代图像和视频质量度量的 adversarial robustness，以确定哪些度量表示更高的安全性。
methods: 本研究采用了 computer vision 领域中的 adversarial attacks，对 15 种无参考图像/视频质量度量进行比较，检测它们对 adversarial attacks 的抵抗力。
results: 研究发现一些度量表示高度的抵抗力，使其在 benchmark 中更安全。 benchmark 接受新的度量提交，邀请研究人员通过提高度量的 adversarial robustness 或找到更安全的度量来参与研究。用户可以使用 pip install robustness-benchmark 进行测试。

Abstract
Nowadays neural-network-based image- and video-quality metrics show better performance compared to traditional methods. However, they also became more vulnerable to adversarial attacks that increase metrics' scores without improving visual quality. The existing benchmarks of quality metrics compare their performance in terms of correlation with subjective quality and calculation time. However, the adversarial robustness of image-quality metrics is also an area worth researching. In this paper, we analyse modern metrics' robustness to different adversarial attacks. We adopted adversarial attacks from computer vision tasks and compared attacks' efficiency against 15 no-reference image/video-quality metrics. Some metrics showed high resistance to adversarial attacks which makes their usage in benchmarks safer than vulnerable metrics. The benchmark accepts new metrics submissions for researchers who want to make their metrics more robust to attacks or to find such metrics for their needs. Try our benchmark using pip install robustness-benchmark.

摘要
现在的神经网络基于的图像和视频质量指标表现更好于传统方法，但它们也变得更易受到攻击，使其分数提高而不改善视觉质量。现有的质量指标比较标准是根据主观质量相关性和计算时间。然而，对图像质量指标的攻击Robustness也是一个值得研究的领域。在这篇论文中，我们分析了现代指标对不同的攻击方法的Robustness。我们从计算机视觉任务中抽取了攻击，并将其与15种无参考图像/视频质量指标进行比较。一些指标具有高度抵抗攻击的能力，使其在标准中使用更加安全。我们的标准接受研究人员提交新的指标，以提高指标的Robustness或找到适合他们需求的指标。您可以使用pip安装robustness-benchmark来尝试我们的标准。

End-to-end Evaluation of Practical Video Analytics Systems for Face Detection and Recognition

paper_url: http://arxiv.org/abs/2310.06945
repo_url: None
paper_authors: Praneet Singh, Edward J. Delp, Amy R. Reibman
for: 这篇论文主要是为了评估一个执行计算机视觉任务的自动驾驶车辆上的实时视频分析系统。
methods: 这篇论文使用了流行的视频编码器HEVC进行视频压缩，然后将输入传递给执行面检测、对齐和识别等计算机视觉任务的模块。
results: 该论文通过使用适用于驾驶specific的数据集进行了全面的端到端评估，并发现了独立评估模块、数据集不均衡和笔迹不一致等因素可能导致系统性能估计错误。提出了创建平衡评估子集和确保数据集和分析任务之间的协调的策略。然后对端到端系统性能进行了顺序评估，以考虑任务之间的依赖关系。实验结果表明，我们的方法可以提供正确、可靠、可解释的系统性能估计，这对实际应用非常重要。

Abstract
Practical video analytics systems that are deployed in bandwidth constrained environments like autonomous vehicles perform computer vision tasks such as face detection and recognition. In an end-to-end face analytics system, inputs are first compressed using popular video codecs like HEVC and then passed onto modules that perform face detection, alignment, and recognition sequentially. Typically, the modules of these systems are evaluated independently using task-specific imbalanced datasets that can misconstrue performance estimates. In this paper, we perform a thorough end-to-end evaluation of a face analytics system using a driving-specific dataset, which enables meaningful interpretations. We demonstrate how independent task evaluations, dataset imbalances, and inconsistent annotations can lead to incorrect system performance estimates. We propose strategies to create balanced evaluation subsets of our dataset and to make its annotations consistent across multiple analytics tasks and scenarios. We then evaluate the end-to-end system performance sequentially to account for task interdependencies. Our experiments show that our approach provides consistent, accurate, and interpretable estimates of the system's performance which is critical for real-world applications.

摘要
实际的视频分析系统在带宽缩限环境中，如自动驾驶车辆，执行计算机视觉任务，如人脸检测和识别。在末端面 analytics 系统中，输入首先使用流行的视频编码器如 HEVC 压缩，然后传递到模块进行人脸检测、对应和识别的sequential进行处理。通常，这些系统的模块会被独立进行评估，使用任务特定的不均衡数据集来评估性能。在这篇论文中，我们进行了综合的末端评估，使用驾驶相关的数据集，以获得可靠的性能估计。我们示出了独立任务评估、数据集不均衡和多个分析任务和场景中的标注不一致可能导致系统性能估计错误的情况。我们提议创建均衡评估 subsets 并使其标注一致于多个分析任务和场景。然后，我们顺序评估整个系统性能，以兑合任务之间的依赖关系。我们的实验结果表明，我们的方法可以提供可靠、准确和可解释的系统性能估计，这是实际应用中的关键。

Self-supervised Object-Centric Learning for Videos

paper_url: http://arxiv.org/abs/2310.06907
repo_url: https://github.com/shvdiwnkozbw/SMTC
paper_authors: Görkay Aydemir, Weidi Xie, Fatma Güney
for: 这个论文旨在提出一种完全无监督的多物体分割方法，用于真实世界预算中的影像序列。
methods: 我们的方法基于对每帧帧排序的物体构造，并在每帧帧之间的时间相互关联。我们提出了一种遮盾策略，将大量的标签从特征空间中消除，以提高效率和规化。
results: 我们的方法可以成功地分割 YouTube 影像序列中的多个复杂和多种类型的物体，并且可以提供高质量的分割结果。

Abstract
Unsupervised multi-object segmentation has shown impressive results on images by utilizing powerful semantics learned from self-supervised pretraining. An additional modality such as depth or motion is often used to facilitate the segmentation in video sequences. However, the performance improvements observed in synthetic sequences, which rely on the robustness of an additional cue, do not translate to more challenging real-world scenarios. In this paper, we propose the first fully unsupervised method for segmenting multiple objects in real-world sequences. Our object-centric learning framework spatially binds objects to slots on each frame and then relates these slots across frames. From these temporally-aware slots, the training objective is to reconstruct the middle frame in a high-level semantic feature space. We propose a masking strategy by dropping a significant portion of tokens in the feature space for efficiency and regularization. Additionally, we address over-clustering by merging slots based on similarity. Our method can successfully segment multiple instances of complex and high-variety classes in YouTube videos.

摘要
自动多对象分割已经在图像上显示出了很好的结果，通过利用自我监督预训练中强大的 semantics 学习。在视频序列中，通常会使用一个额外modal，如深度或运动，来促进分割。然而，在Synthetic序列中观察到的性能提升不会在更复杂的实际场景中转移。在这篇论文中，我们提出了第一个完全无监督的方法，用于在实际场景中分割多个 объек。我们的 object-centric 学习框架将 объек 分配到每帧的幂 Space 中，然后在帧中关联这些幂 Space。我们的训练目标是在高级 semantics 特征空间中重建中间帧。我们提出了一种masking策略，通过删除大量的 токен在特征空间中来提高效率和规范。此外，我们解决过度归一化问题，通过基于相似性的槽合并。我们的方法可以成功地分割 YouTube 视频中的多个复杂和多样化的实例。

Distillation Improves Visual Place Recognition for Low-Quality Queries

paper_url: http://arxiv.org/abs/2310.06906
repo_url: None
paper_authors: Anbang Yang, Yao Wang, John-Ross Rizzo, Chen Feng
for: 提高低质量查询图像的视觉地理位置认识（VPR）性能。
methods: 使用高质量查询图像进行训练，以提取更好的特征表示，并使用MSE损失和ICKD损失来帮助学习模型。
results: 在 Pittsburgh 250k 数据集和自己的室内数据集中，通过 Fine-tune NetVLAD 参数并使用我们的扩展损失函数，对低质量查询图像进行视觉地理位置认识，实现了明显的改善。

Abstract
The shift to online computing for real-time visual localization often requires streaming query images/videos to a server for visual place recognition (VPR), where fast video transmission may result in reduced resolution or increased quantization. This compromises the quality of global image descriptors, leading to decreased VPR performance. To improve the low recall rate for low-quality query images, we present a simple yet effective method that uses high-quality queries only during training to distill better feature representations for deep-learning-based VPR, such as NetVLAD. Specifically, we use mean squared error (MSE) loss between the global descriptors of queries with different qualities, and inter-channel correlation knowledge distillation (ICKD) loss over their corresponding intermediate features. We validate our approach using the both Pittsburgh 250k dataset and our own indoor dataset with varying quantization levels. By fine-tuning NetVLAD parameters with our distillation-augmented losses, we achieve notable VPR recall-rate improvements over low-quality queries, as demonstrated in our extensive experimental results. We believe this work not only pushes forward the VPR research but also provides valuable insights for applications needing dependable place recognition under resource-limited conditions.

摘要
往往在在线计算中进行实时视觉地标注需要将查询图像/视频流式传输到服务器进行视觉地标注（VPR），这可能会导致快速视频传输，从而导致图像的分辨率下降或者Quantization增加。这会下降全图像描述器的质量，从而降低VPR性能。为了提高低质量查询图像的回归率，我们提出了一种简单 yet有效的方法。在训练时使用高质量查询图像来浸泡出更好的特征表示，例如NetVLAD。我们使用查询图像的全球描述器之间的平均方差（MSE）损失，以及其相应的中间特征之间的相关知识储存（ICKD）损失。我们验证了我们的方法使用 Pittsburgh 250k 数据集和我们自己的室内数据集，并对查询图像的量化水平进行了变化。通过调整 NetVLAD 参数使用我们的浸泡损失和储存损失，我们实现了对低质量查询图像的 VPR 回归率的显著改进。我们认为这种工作不仅推进了 VPR 研究，也提供了具有可靠性的地标注应用场景中的有价值的意见。

Mitigating stereotypical biases in text to image generative systems

paper_url: http://arxiv.org/abs/2310.06904
repo_url: None
paper_authors: Piero Esposito, Parmida Atighehchian, Anastasis Germanidis, Deepti Ghadiyaram
for: 这篇论文目的是如何使用细化的人工数据来修正文本描述生成图像模型中的社会偏见，以确保模型的输出是公平的。
methods: 作者使用了多元组合的文本提示来生成多元的人工数据，然后使用这些数据来练化文本描述生成图像模型。这种方法被称为多元练化（DFT）模型。
results: 对比基eline，DFT模型在 perceived skin tone 和 perceived gender 方面的公平度指标提高了150%和97.7%。此外，DFT模型也生成了更多的人类 WITH perceived darker skin tone AND more women。作者将会公开所有文本提示和代码，以便其他研究人员可以进行开放研究。

Abstract
State-of-the-art generative text-to-image models are known to exhibit social biases and over-represent certain groups like people of perceived lighter skin tones and men in their outcomes. In this work, we propose a method to mitigate such biases and ensure that the outcomes are fair across different groups of people. We do this by finetuning text-to-image models on synthetic data that varies in perceived skin tones and genders constructed from diverse text prompts. These text prompts are constructed from multiplicative combinations of ethnicities, genders, professions, age groups, and so on, resulting in diverse synthetic data. Our diversity finetuned (DFT) model improves the group fairness metric by 150% for perceived skin tone and 97.7% for perceived gender. Compared to baselines, DFT models generate more people with perceived darker skin tone and more women. To foster open research, we will release all text prompts and code to generate training images.

摘要
现代生成文本图像模型已知存在社会偏见，常常过度表现人员 perceived 皮肤颜色和男性。在这项工作中，我们提议一种方法来纠正这些偏见，使结果具有不同群体人员的公平。我们通过精度调整文本图像模型在人工生成的数据上，使用包括多种民族、性别、职业、年龄等多种特征的多元组合生成文本提示。我们称之为多样性精度调整（DFT）模型。对比基eline，DFT 模型在 perceived 皮肤颜色和性别方面的公平指标提高了150%和97.7%。相比基eline，DFT 模型生成了更多的 perceived 皮肤颜色更浅的人员和更多的女性。为促进开放研究，我们将发布所有文本提示和生成训练图像代码。

AutoAD II: The Sequel – Who, When, and What in Movie Audio Description

paper_url: http://arxiv.org/abs/2310.06838
repo_url: None
paper_authors: Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman
for: 本研究旨在提供一种自动生成电影 Audio Description（AD）的模型，以帮助视障观众更好地理解电影的情节。methods: 本研究使用 CLIP 视觉特征、cast 列表和时间语音位置来生成 AD，并解决 ‘who’, ‘when’ 和 ‘what’ 三个问题：(i) 谁：引入主要演员的名称、演员和 CLIP 面部特征库，以提高生成 AD 中的人名称；(ii) WHEN：基于视觉内容和其邻近时间段的分析，确定是否需要生成 AD，以及生成 AD 的时间点；(iii) WHAT：使用视语模型，将提案集与 CLIP 视觉特征进行交叉注意力，以提高 AD 文本生成的质量。results: 研究表明，使用 proposed 的模型可以提高 AD 文本生成的质量，并在 apples-to-apples 比较中胜过先前的建模 Architecture。

Abstract
Audio Description (AD) is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences. For movies, this presents notable challenges -- AD must occur only during existing pauses in dialogue, should refer to characters by name, and ought to aid understanding of the storyline as a whole. To this end, we develop a new model for automatically generating movie AD, given CLIP visual features of the frames, the cast list, and the temporal locations of the speech; addressing all three of the 'who', 'when', and 'what' questions: (i) who -- we introduce a character bank consisting of the character's name, the actor that played the part, and a CLIP feature of their face, for the principal cast of each movie, and demonstrate how this can be used to improve naming in the generated AD; (ii) when -- we investigate several models for determining whether an AD should be generated for a time interval or not, based on the visual content of the interval and its neighbours; and (iii) what -- we implement a new vision-language model for this task, that can ingest the proposals from the character bank, whilst conditioning on the visual features using cross-attention, and demonstrate how this improves over previous architectures for AD text generation in an apples-to-apples comparison.

摘要
听说描述（AD）是将视觉内容描述出来，以便为视障听众提供帮助。电影中，AD 有很多挑战 -- AD 只能发生在对话中的存在插段，应该使用人物名称，并且需要帮助理解电影的故事情节。为解决这些问题，我们开发了一种新的自动生成电影 AD 模型，使用 CLIP 视觉特征、主演人员名单和时间幕上的对话时间进行生成 ; 解决 'who'、'when' 和 'what' 三个问题：(i) 谁 -- 我们提出了一个人物银行，包括每部电影的主要演员名单、他们的面部 CLIP 特征和actor的名称，并证明了如何使用这些特征来提高生成的 AD 中的命名。(ii) WHEN -- 我们研究了多种方法来确定是否应该在某个时间间隔生成 AD，基于该时间间隔的视觉内容和其邻近时间间隔的特征。(iii) WHAT -- 我们实现了一种新的视觉语言模型，可以将人物银行的提议与视觉特征进行跨对话注意力注入，并证明了如何使用这种模型来改进之前的 AD 文本生成预算。

What Does Stable Diffusion Know about the 3D Scene?

paper_url: http://arxiv.org/abs/2310.06836
repo_url: None
paper_authors: Guanqi Zhan, Chuanxia Zheng, Weidi Xie, Andrew Zisserman
for: 本研究的目的是探索Stable Diffusion网络是否对3D场景中的不同属性有深刻的理解，以便更好地理解网络如何生成高品质图像。
methods: 作者们提出了一种评估网络是否模型3D场景中的不同属性的协议，并在实际图像 dataset 上进行了应用。该协议包括针对不同属性的显式特征进行探索。
results: 作者们发现，Stable Diffusion 网络在场景几何学、支持关系、阴影和深度方面表现出色，但对 occlusion 的表现较差。此外，作者们还对 DINO 和 CLIP 等其他大规模训练的模型进行了测试，发现其表现较差于Stable Diffusion。

Abstract
Recent advances in generative models like Stable Diffusion enable the generation of highly photo-realistic images. Our objective in this paper is to probe the diffusion network to determine to what extent it 'understands' different properties of the 3D scene depicted in an image. To this end, we make the following contributions: (i) We introduce a protocol to evaluate whether a network models a number of physical 'properties' of the 3D scene by probing for explicit features that represent these properties. The probes are applied on datasets of real images with annotations for the property. (ii) We apply this protocol to properties covering scene geometry, scene material, support relations, lighting, and view dependent measures. (iii) We find that Stable Diffusion is good at a number of properties including scene geometry, support relations, shadows and depth, but less performant for occlusion. (iv) We also apply the probes to other models trained at large-scale, including DINO and CLIP, and find their performance inferior to that of Stable Diffusion.

摘要
Recent advances in generative models like Stable Diffusion have enabled the generation of highly photo-realistic images. Our objective in this paper is to probe the diffusion network to determine to what extent it 'understands' different properties of the 3D scene depicted in an image. To this end, we make the following contributions:(i) We introduce a protocol to evaluate whether a network models a number of physical 'properties' of the 3D scene by probing for explicit features that represent these properties. The probes are applied on datasets of real images with annotations for the property.(ii) We apply this protocol to properties covering scene geometry, scene material, support relations, lighting, and view dependent measures.(iii) We find that Stable Diffusion is good at a number of properties including scene geometry, support relations, shadows, and depth, but less performant for occlusion.(iv) We also apply the probes to other models trained at large-scale, including DINO and CLIP, and find their performance inferior to that of Stable Diffusion.

Neural Bounding

paper_url: http://arxiv.org/abs/2310.06822
repo_url: https://github.com/AlexeyAB/Yolo_mark
paper_authors: Wenxin Liu, Michael Fischer, Paul D. Yoo, Tobias Ritschel
for: 本文研究使用神经网络作为缓减体（bounding volume）的应用。
methods: 本文使用神经网络来定义空间为可occupied或可вобоidden的分类问题，这种学习基于的方法在高维空间中（如动漫场景）具有优势。
results: 本文的神经缓减可以减少至少一个数量级的假阳性结果，相比传统方法。

Abstract
Bounding volumes are an established concept in computer graphics and vision tasks but have seen little change since their early inception. In this work, we study the use of neural networks as bounding volumes. Our key observation is that bounding, which so far has primarily been considered a problem of computational geometry, can be redefined as a problem of learning to classify space into free or occupied. This learning-based approach is particularly advantageous in high-dimensional spaces, such as animated scenes with complex queries, where neural networks are known to excel. However, unlocking neural bounding requires a twist: allowing -- but also limiting -- false positives, while ensuring that the number of false negatives is strictly zero. We enable such tight and conservative results using a dynamically-weighted asymmetric loss function. Our results show that our neural bounding produces up to an order of magnitude fewer false positives than traditional methods.

摘要
bounding 是计算机图形和视觉任务中已经成熟的概念，但是自它的早期发明以来它几乎没有发生变化。在这项工作中，我们研究使用神经网络作为 bounding。我们的关键观察是，bounding，曾经主要被视为计算几何问题，可以被重新定义为学习将空间分类为可用或占用的问题。这种学习基于的方法在高维空间，如动漫场景中的复杂查询，where neural networks are known to excel。然而，解锁神经 bounding 需要一个折衔：允许，但也限制假阳性，而确保数量假阴性是纯粹的零。我们实现这种紧张和保守的结果使用动态权重不Symmetric 损失函数。我们的结果表明，我们的神经 bounding 可以减少至少一个数量级别的假阳性，相比传统方法。

TopoMLP: An Simple yet Strong Pipeline for Driving Topology Reasoning

paper_url: http://arxiv.org/abs/2310.06753
repo_url: https://github.com/wudongming97/topomlp
paper_authors: Dongming Wu, Jiahao Chang, Fan Jia, Yingfei Liu, Tiancai Wang, Jianbing Shen
for: 本研究旨在提高自动驾驶中的道路场景理解和驾驶路径提供，包括探测道路中心线（车道）和交通元素，然后进行这些元素之间的 topology 关系逻辑分析。
methods: 我们首先表明了测试性能对 topology 分析的重要性，因此我们提出了一种强大的3D车道探测器和改进的2D交通元素探测器，以扩展 topology 性能的Upper bound。然后，我们提出了一个简单又高性能的 TopoMLP 管道，用于驾驶 topology 逻辑分析。
results: 基于卓越的探测性能，我们开发了两个简单的 MLP 基本头，用于生成 topology。 TopoMLP 实现了 OpenLane-V2 benchmark 上的州际最佳性能，即 41.2% OLS WITH ResNet-50 背景神经网络。此外，它还是自动驾驶挑战赛中第一个 OpenLane topology 解决方案。我们希望这种简单又强大的管道可以为社区提供一些新的想法。代码可以在 https://github.com/wudongming97/TopoMLP 找到。

Abstract
Topology reasoning aims to comprehensively understand road scenes and present drivable routes in autonomous driving. It requires detecting road centerlines (lane) and traffic elements, further reasoning their topology relationship, i.e., lane-lane topology, and lane-traffic topology. In this work, we first present that the topology score relies heavily on detection performance on lane and traffic elements. Therefore, we introduce a powerful 3D lane detector and an improved 2D traffic element detector to extend the upper limit of topology performance. Further, we propose TopoMLP, a simple yet high-performance pipeline for driving topology reasoning. Based on the impressive detection performance, we develop two simple MLP-based heads for topology generation. TopoMLP achieves state-of-the-art performance on OpenLane-V2 benchmark, i.e., 41.2% OLS with ResNet-50 backbone. It is also the 1st solution for 1st OpenLane Topology in Autonomous Driving Challenge. We hope such simple and strong pipeline can provide some new insights to the community. Code is at https://github.com/wudongming97/TopoMLP.

摘要
topological 理解旨在全面理解路景和提供自动驾驶路线。它需要检测路中心线（车道）和交通元素，然后进一步分析这些元素之间的topology关系，即车道-车道topology和车道-交通topology。在这种工作中，我们首先表明了topology分数强度取决于检测车道和交通元素的性能。因此，我们提出了一种高效的3D车道检测器和改进的2D交通元素检测器，以推进topology性能的Upper Limit。此外，我们提议了TopoMLP，一个简单却高性能的驱动topology分析管道。基于出色的检测性能，我们开发了两个简单的MLP-based头，用于生成topology。TopoMLP实现了OpenLane-V2数据集上的状态对照性表现，即41.2% OLS，使用ResNet-50 背景模型。它也是自动驾驶挑战赛中的首个OpenLane topology解决方案。我们希望这种简单却强大的管道可以为社区提供新的想法。代码可以在https://github.com/wudongming97/TopoMLP中找到。

HiFi-123: Towards High-fidelity One Image to 3D Content Generation

paper_url: http://arxiv.org/abs/2310.06744
repo_url: https://github.com/HiFi-123/HiFi-123.github.io
paper_authors: Wangbo Yu, Li Yuan, Yan-Pei Cao, Xiangjun Gao, Xiaoyu Li, Long Quan, Ying Shan, Yonghong Tian
for: 高精度多视图适用的图像到3D模型生成
methods: 引入参考图像导向新视图增强技术和参考图像导向状态储存损失
results: 对比现有方法，实现了高精度多视图适用的图像到3D模型生成，达到了现状态 искусственный智能水平

Abstract
Recent advances in text-to-image diffusion models have enabled 3D generation from a single image. However, current image-to-3D methods often produce suboptimal results for novel views, with blurred textures and deviations from the reference image, limiting their practical applications. In this paper, we introduce HiFi-123, a method designed for high-fidelity and multi-view consistent 3D generation. Our contributions are twofold: First, we propose a reference-guided novel view enhancement technique that substantially reduces the quality gap between synthesized and reference views. Second, capitalizing on the novel view enhancement, we present a novel reference-guided state distillation loss. When incorporated into the optimization-based image-to-3D pipeline, our method significantly improves 3D generation quality, achieving state-of-the-art performance. Comprehensive evaluations demonstrate the effectiveness of our approach over existing methods, both qualitatively and quantitatively.

摘要
现在的文本到图像扩散模型已经允许从单个图像中生成3D模型。然而，当前的图像到3D方法经常生成新视图时会导致图像质量下降，缺乏参考图像的精度，限制了其实际应用。在这篇论文中，我们介绍了HiFi-123方法，用于实现高精度和多视图一致的3D生成。我们的贡献有两个方面：首先，我们提出了一种参考图像指导的新视图增强技术，可以减少生成的视图与参考视图之间的质量差距。其次，我们基于新视图增强技术，提出了一种参考图像指导的状态涂抹损失。当 incorporated into the optimization-based image-to-3D管道中，我们的方法可以明显提高3D生成质量，达到当前最佳性能。我们的评估表明，我们的方法在现有方法的基础上具有较高的效果和精度。

Multi-domain improves out-of-distribution and data-limited scenarios for medical image analysis

paper_url: http://arxiv.org/abs/2310.06737
repo_url: None
paper_authors: Ece Ozkan, Xavier Boix
for: 这篇论文旨在提出多元领域模型，以扩大医疗影像分析的应用范围。
methods: 这篇论文使用多元领域模型，结合不同领域的医疗影像资料，包括X射线、MRI、CT和ultrasound等不同的剖析方式和视角。
results: 相比特定领域模型，多元领域模型在有限数据和离散数据的情况下表现更好，尤其是在医疗应用中频繁遇到的外部数据情况下。多元领域模型可以更好地利用不同领域之间的共同信息，提高整体结果，例如组织识别率可以提高10%。

Abstract
Current machine learning methods for medical image analysis primarily focus on developing models tailored for their specific tasks, utilizing data within their target domain. These specialized models tend to be data-hungry and often exhibit limitations in generalizing to out-of-distribution samples. Recently, foundation models have been proposed, which combine data from various domains and demonstrate excellent generalization capabilities. Building upon this, this work introduces the incorporation of diverse medical image domains, including different imaging modalities like X-ray, MRI, CT, and ultrasound images, as well as various viewpoints such as axial, coronal, and sagittal views. We refer to this approach as multi-domain model and compare its performance to that of specialized models. Our findings underscore the superior generalization capabilities of multi-domain models, particularly in scenarios characterized by limited data availability and out-of-distribution, frequently encountered in healthcare applications. The integration of diverse data allows multi-domain models to utilize shared information across domains, enhancing the overall outcomes significantly. To illustrate, for organ recognition, multi-domain model can enhance accuracy by up to 10% compared to conventional specialized models.

摘要

Domain Generalization by Rejecting Extreme Augmentations

paper_url: http://arxiv.org/abs/2310.06670
repo_url: None
paper_authors: Masih Aminbeidokhti, Fidel A. Guerrero Peña, Heitor Rapela Medeiros, Thomas Dubail, Eric Granger, Marco Pedersoli
for: 该 paper 是为了研究数据扩充在深度学习模型中的效果，以及如何在不同领域和环境下实现模型的Recognition性能提升。
methods: 该 paper 使用了一种简单的训练方法，包括：(i) 使用标准数据扩充变换进行均匀采样; (ii) 根据测试数据的变化程度进行变换强化; (iii) 设计一个新的奖励函数，以拒绝不良变换。
results: 该 paper 的数据扩充方案在多个领域和环境下实现了比或更高的Accuracy水平，与现有方法相当或更高。代码可以在 \url{https://github.com/Masseeh/DCAug} 上找到。

Abstract
Data augmentation is one of the most effective techniques for regularizing deep learning models and improving their recognition performance in a variety of tasks and domains. However, this holds for standard in-domain settings, in which the training and test data follow the same distribution. For the out-of-domain case, where the test data follow a different and unknown distribution, the best recipe for data augmentation is unclear. In this paper, we show that for out-of-domain and domain generalization settings, data augmentation can provide a conspicuous and robust improvement in performance. To do that, we propose a simple training procedure: (i) use uniform sampling on standard data augmentation transformations; (ii) increase the strength transformations to account for the higher data variance expected when working out-of-domain, and (iii) devise a new reward function to reject extreme transformations that can harm the training. With this procedure, our data augmentation scheme achieves a level of accuracy that is comparable to or better than state-of-the-art methods on benchmark domain generalization datasets. Code: \url{https://github.com/Masseeh/DCAug}

摘要
“数据扩充是深度学习模型训练中最有效的技术之一，可以提高模型在不同任务和领域的识别性能。但是，这只适用于标准内领域的情况，在异领域情况下，最佳的数据扩充策略未知。在这篇论文中，我们表明，在异领域和领域总结合 Settings中，数据扩充可以提供明显和稳定的性能改进。我们提议一种简单的训练方法：（i）使用标准数据扩充变换的均匀采样；（ii）根据异领域数据的高度变化预计，增加变换的强度；（iii）设计一个新的奖励函数，以拒绝对训练造成伤害的极端变换。通过这种方法，我们的数据扩充方案在 benchmark 领域总结合数据集上实现了与现状方法相当或更好的精度。代码：\url{https://github.com/Masseeh/DCAug}Note that the translation is done using Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

Latent Diffusion Counterfactual Explanations

paper_url: http://arxiv.org/abs/2310.06668
repo_url: https://github.com/lmb-freiburg/ldce
paper_authors: Karim Farid, Simon Schrodi, Max Argus, Thomas Brox
for: 这个论文旨在描述一种新的方法，用于生成对抗防御模型的推论，以便更好地理解黑盒模型的行为。
methods: 该方法基于最近的像素空间扩散模型，并使用新的协调共识导航机制来筛选潜在的对抗性诱导。
results: 该方法可以快速生成对抗防御模型的推论，并且能够准确地捕捉到模型的Semantic部分。此外，该方法可以在不同的学习 парадиг和数据集上进行应用，并且可以提供模型错误的理解。

Abstract
Counterfactual explanations have emerged as a promising method for elucidating the behavior of opaque black-box models. Recently, several works leveraged pixel-space diffusion models for counterfactual generation. To handle noisy, adversarial gradients during counterfactual generation -- causing unrealistic artifacts or mere adversarial perturbations -- they required either auxiliary adversarially robust models or computationally intensive guidance schemes. However, such requirements limit their applicability, e.g., in scenarios with restricted access to the model's training data. To address these limitations, we introduce Latent Diffusion Counterfactual Explanations (LDCE). LDCE harnesses the capabilities of recent class- or text-conditional foundation latent diffusion models to expedite counterfactual generation and focus on the important, semantic parts of the data. Furthermore, we propose a novel consensus guidance mechanism to filter out noisy, adversarial gradients that are misaligned with the diffusion model's implicit classifier. We demonstrate the versatility of LDCE across a wide spectrum of models trained on diverse datasets with different learning paradigms. Finally, we showcase how LDCE can provide insights into model errors, enhancing our understanding of black-box model behavior.

摘要
<>translate into Simplified ChineseCounterfactual explanations have emerged as a promising method for elucidating the behavior of opaque black-box models. Recently, several works leveraged pixel-space diffusion models for counterfactual generation. To handle noisy, adversarial gradients during counterfactual generation -- causing unrealistic artifacts or mere adversarial perturbations -- they required either auxiliary adversarially robust models or computationally intensive guidance schemes. However, such requirements limit their applicability, e.g., in scenarios with restricted access to the model's training data. To address these limitations, we introduce Latent Diffusion Counterfactual Explanations (LDCE). LDCE harnesses the capabilities of recent class- or text-conditional foundation latent diffusion models to expedite counterfactual generation and focus on the important, semantic parts of the data. Furthermore, we propose a novel consensus guidance mechanism to filter out noisy, adversarial gradients that are misaligned with the diffusion model's implicit classifier. We demonstrate the versatility of LDCE across a wide spectrum of models trained on diverse datasets with different learning paradigms. Finally, we showcase how LDCE can provide insights into model errors, enhancing our understanding of black-box model behavior. traslate into Simplified ChineseCounterfactual explanations have emerged as a promising method for elucidating the behavior of opaque black-box models. Recently, several works leveraged pixel-space diffusion models for counterfactual generation. To handle noisy, adversarial gradients during counterfactual generation -- causing unrealistic artifacts or mere adversarial perturbations -- they required either auxiliary adversarially robust models or computationally intensive guidance schemes. However, such requirements limit their applicability, e.g., in scenarios with restricted access to the model's training data. To address these limitations, we introduce Latent Diffusion Counterfactual Explanations (LDCE). LDCE harnesses the capabilities of recent class- or text-conditional foundation latent diffusion models to expedite counterfactual generation and focus on the important, semantic parts of the data. Furthermore, we propose a novel consensus guidance mechanism to filter out noisy, adversarial gradients that are misaligned with the diffusion model's implicit classifier. We demonstrate the versatility of LDCE across a wide spectrum of models trained on diverse datasets with different learning paradigms. Finally, we showcase how LDCE can provide insights into model errors, enhancing our understanding of black-box model behavior.Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well. Let me know if you have any other questions or requests!

SC2GAN: Rethinking Entanglement by Self-correcting Correlated GAN Space

paper_url: http://arxiv.org/abs/2310.06667
repo_url: None
paper_authors: Zikun Chen, Han Zhao, Parham Aarabi, Ruowei Jiang
for: 该研究旨在解决生成器 adversarial网络 (GANs) 中存在的杂相关性问题，即在学习的秘密空间中存在不相关的特征集成的问题。
methods: 该研究使用了 StyleGAN2-FFHQ 模型，并提出了一种新的框架 SC$^2$GAN，通过重 проектирование低密度秘密编码样本并根据高密度和低密度区域修正编辑方向来解决杂相关性问题。
results: 该研究表明，使用 SC$^2$GAN 可以减少 GANs 中的杂相关性问题，并且可以通过小量的低密度区域样本来生成具有罕见特征组合的图像。

Abstract
Generative Adversarial Networks (GANs) can synthesize realistic images, with the learned latent space shown to encode rich semantic information with various interpretable directions. However, due to the unstructured nature of the learned latent space, it inherits the bias from the training data where specific groups of visual attributes that are not causally related tend to appear together, a phenomenon also known as spurious correlations, e.g., age and eyeglasses or women and lipsticks. Consequently, the learned distribution often lacks the proper modelling of the missing examples. The interpolation following editing directions for one attribute could result in entangled changes with other attributes. To address this problem, previous works typically adjust the learned directions to minimize the changes in other attributes, yet they still fail on strongly correlated features. In this work, we study the entanglement issue in both the training data and the learned latent space for the StyleGAN2-FFHQ model. We propose a novel framework SC$^2$GAN that achieves disentanglement by re-projecting low-density latent code samples in the original latent space and correcting the editing directions based on both the high-density and low-density regions. By leveraging the original meaningful directions and semantic region-specific layers, our framework interpolates the original latent codes to generate images with attribute combination that appears infrequently, then inverts these samples back to the original latent space. We apply our framework to pre-existing methods that learn meaningful latent directions and showcase its strong capability to disentangle the attributes with small amounts of low-density region samples added.

摘要
生成敌对网络（GANs）可以生成真实的图像，学习的幂值空间中存储了丰富的 semantics 信息，其中各个方向都具有可解释的含义。然而，由于学习的幂值空间是无结构的，因此它继承了训练数据中的偏见，specific groups of visual attributes that are not causally related tend to appear together, a phenomenon also known as spurious correlations, e.g., age and eyeglasses or women and lipsticks。这使得学习的分布 часто缺少适当的模型，并且 interpolating following editing directions for one attribute could result in entangled changes with other attributes。为了解决这问题，先前的工作通常是通过最小化其他属性的变化来调整学习的方向，但它们仍然无法处理强相关的特征。在这种情况下，我们在StyleGAN2-FFHQ模型中研究了杂化问题，并提出了一种新的框架SC$^2$GAN。我们的框架可以通过重新 проектирова低密度幂值代码样本在原始幂值空间中，并根据高密度和低密度区域来修正编辑方向，以实现解耦。通过利用原始有意义的方向和 semantic region-specific层，我们的框架可以将原始幂值代码 interpolated 到生成图像，并将其推回原始幂值空间。我们对先前学习有意义的幂值方向进行了改进，并显示了我们的框架在小量低密度区域样本的情况下具有强大的解耦能力。

paper_url: http://arxiv.org/abs/2310.06654
repo_url: None
paper_authors: Guanqi Chen, Lei Yang, Guanhua Chen, Jia Pan
for: 本研究旨在解释深度神经网络模型在视觉语言导航任务中的决策过程，以便更好地理解这些模型在完成任务时所使用的信息。
methods: 本研究使用了多种解释方法来评估深度神经网络模型在视觉语言导航任务中的决策过程。
results: 经过实验，我们发现了一些有价值的发现，包括：1）不同的解释方法在不同的模型和数据集上的表现有所不同；2）某些解释方法可以更好地捕捉模型在具体的决策过程中使用的信息。

Abstract
The ability to navigate robots with natural language instructions in an unknown environment is a crucial step for achieving embodied artificial intelligence (AI). With the improving performance of deep neural models proposed in the field of vision-and-language navigation (VLN), it is equally interesting to know what information the models utilize for their decision-making in the navigation tasks. To understand the inner workings of deep neural models, various explanation methods have been developed for promoting explainable AI (XAI). But they are mostly applied to deep neural models for image or text classification tasks and little work has been done in explaining deep neural models for VLN tasks. In this paper, we address these problems by building quantitative benchmarks to evaluate explanation methods for VLN models in terms of faithfulness. We propose a new erasure-based evaluation pipeline to measure the step-wise textual explanation in the sequential decision-making setting. We evaluate several explanation methods for two representative VLN models on two popular VLN datasets and reveal valuable findings through our experiments.

摘要
“ nave robot 使用自然语言指令在未知环境 Navigation 是人工智能embodied的关键步骤。随着视力语言导航（VLN）领域的深度神经网络模型表现的提高，我们也更关心这些模型做出决策时所使用的信息。为了了解深度神经网络模型的内部工作，各种解释方法已经被开发出来促进可解释人工智能（XAI）。但这些解释方法主要被应用于图像或文本分类任务上，对于VLN任务的解释还做得不够。在这篇论文中，我们解决这些问题，建立了量化的benchmark来评估VLN模型的解释方法。我们提出了一种基于擦除的评估管线，用于测试步骤性文本解释在序列决策设定下。我们对两种代表性VLN模型在两个流行的VLN数据集上进行了多个实验，并通过实验获得了有价值的发现。”

How (not) to ensemble LVLMs for VQA

paper_url: http://arxiv.org/abs/2310.06641
repo_url: None
paper_authors: Lisa Alazraki, Lluis Castrejon, Mostafa Dehghani, Fantine Huot, Jasper Uijlings, Thomas Mensink
for: 这 paper 研究了在大型视力语言模型（LVLM）时代的集成方法。集成是一种经典的方法，用于将不同的模型组合起来提高性能。
methods: 作者在这 paper 考虑了各种模型来解决他们的任务，从普通的 LVLM 到包含描述文本作为额外 контекст的模型，以及通过探索 Wikipedia 页面来增强模型的 Lens-based Retrieval。这些模型都是高度 complementary，因此可以合并以获得更高的性能。
results: 作者通过 oracle 实验发现，将这些模型集成起来可以获得大幅提高的性能，从最佳单个模型的 48.8% 准确率（最佳单个模型）提高到最高可能的 ensemble 的 67% 准确率。因此，创建一个 ensemble 并不是一个极Difficult task。

Abstract
This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to models including the caption as extra context, to models augmented with Lens-based retrieval of Wikipedia pages. Intuitively these models are highly complementary, which should make them ideal for ensembling. Indeed, an oracle experiment shows potential gains from 48.8% accuracy (the best single model) all the way up to 67% (best possible ensemble). So it is a trivial exercise to create an ensemble with substantial real gains. Or is it?

摘要

paper_url: http://arxiv.org/abs/2310.06633
repo_url: None
paper_authors: Alexandra Barancová, Melvin Wevers, Nanne van Noord
for: 本研究探讨计算机视觉模型是否能够从视觉内容中提取时间信息，特意关注历史照片。
methods: 我们使用OpenCLIP，一个开源的CLIP实现，一种多Modal语言和视觉模型，进行研究。我们的实验包括三个步骤：零 shot分类、精度调整和视觉内容分析。
results: 研究结果表明，零 shot分类对于图像日期准确性不太好，倾向于预测过去的日期。经过精度调整后，OpenCLIP的表现得到了改善，并消除了偏好。分析表明，包括汽车、狗、猫、人等元素的图像更加准确地被日期化，这表明了时间标记的存在。这种研究示出计算机视觉模型可以准确地日期图像，并且需要精度调整以获得最佳效果。未来研究应该探讨这些发现的应用于颜色照片和多样化数据集。

Abstract
This paper explores the capacity of computer vision models to discern temporal information in visual content, focusing specifically on historical photographs. We investigate the dating of images using OpenCLIP, an open-source implementation of CLIP, a multi-modal language and vision model. Our experiment consists of three steps: zero-shot classification, fine-tuning, and analysis of visual content. We use the \textit{De Boer Scene Detection} dataset, containing 39,866 gray-scale historical press photographs from 1950 to 1999. The results show that zero-shot classification is relatively ineffective for image dating, with a bias towards predicting dates in the past. Fine-tuning OpenCLIP with a logistic classifier improves performance and eliminates the bias. Additionally, our analysis reveals that images featuring buses, cars, cats, dogs, and people are more accurately dated, suggesting the presence of temporal markers. The study highlights the potential of machine learning models like OpenCLIP in dating images and emphasizes the importance of fine-tuning for accurate temporal analysis. Future research should explore the application of these findings to color photographs and diverse datasets.

摘要
The results show that zero-shot classification is not very effective for image dating, with a bias towards predicting dates in the past. However, fine-tuning OpenCLIP with a logistic classifier improves performance and eliminates the bias. Our analysis also reveals that images featuring buses, cars, cats, dogs, and people are more accurately dated, suggesting the presence of temporal markers.This study highlights the potential of machine learning models like OpenCLIP in dating images and emphasizes the importance of fine-tuning for accurate temporal analysis. Future research should explore the application of these findings to color photographs and diverse datasets.

EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention

paper_url: http://arxiv.org/abs/2310.06629
repo_url: https://github.com/nkusyl/evit
paper_authors: Yulong Shi, Mingwei Sun, Yongshuai Wang, Rui Wang, Hui Sun, Zengqiang Chen
for: 提高计算复杂性和吸引偏好的问题
methods: 提出了一种新的双眼窥见自注意力（BFSA），通过模仿鸟视的生物学结构和特点，实现多尺度特征表示之间的交互
results: 实验结果表明，在不同计算机视觉任务中，包括图像分类、物体检测、实例分割等，提出的鸟视变换器（EViTs）可以有效地与基elines相比，并且在同一个模型大小下显示出更高的速度在图像处理器上。

Abstract
Thanks to the advancement of deep learning technology, vision transformer has demonstrated competitive performance in various computer vision tasks. Unfortunately, vision transformer still faces some challenges such as high computational complexity and absence of desirable inductive bias. To alleviate these problems, a novel Bi-Fovea Self-Attention (BFSA) is proposed, inspired by the physiological structure and characteristics of bi-fovea vision in eagle eyes. This BFSA can simulate the shallow fovea and deep fovea functions of eagle vision, enable the network to extract feature representations of targets from coarse to fine, facilitate the interaction of multi-scale feature representations. Additionally, a Bionic Eagle Vision (BEV) block based on BFSA is designed in this study. It combines the advantages of CNNs and Vision Transformers to enhance the ability of global and local feature representations of networks. Furthermore, a unified and efficient general pyramid backbone network family is developed by stacking the BEV blocks in this study, called Eagle Vision Transformers (EViTs). Experimental results on various computer vision tasks including image classification, object detection, instance segmentation and other transfer learning tasks show that the proposed EViTs perform effectively by comparing with the baselines under same model size and exhibit higher speed on graphics processing unit than other models. Code is available at https://github.com/nkusyl/EViT.

摘要
因深度学习技术的进步，视transformer已经在多种计算机视觉任务中展现了竞争力。然而，视transformer仍面临一些挑战，如计算复杂性高和缺乏愿望导向。为解决这些问题，本研究提出了一种新的双眼凝聚自注意力（BFSA）， inspirited by the physiological structure and characteristics of bi-fovea vision in eagle eyes。这个BFSA可以模拟鼠标眼中的浅眼和深眼功能，使网络EXTRACTtargets的特征表示从粗到细，促进多尺度特征表示之间的互动。此外，本研究还提出了基于BFSA的鹰眼视觉（BEV）块，可以结合CNNs和视transformer的优点，提高网络的全球和本地特征表示能力。此外，本研究还开发了一个统一和高效的通用尖顶框架网络家族，通过堆叠BEV块来实现，称为鹰视transformer（EViTs）。实验结果表明，提案的EViTs在多种计算机视觉任务中表现强劲，与基准模型相比，在同样的模型大小下，并且在图像处理器上运行 faster。代码可以在https://github.com/nkusyl/EViT中找到。

Deep Cardiac MRI Reconstruction with ADMM

paper_url: http://arxiv.org/abs/2310.06628
repo_url: None
paper_authors: George Yiasemis, Nikita Moriakov, Jan-Jakob Sonke, Jonas Teuwen
for: 这个论文目的是提出一种基于深度学习的加速照片和多种 контраст重建方法，用于动态心脏成像。
methods: 该方法使用了深度学习 inverse problem solver vSHARP，并在图像和傅里叶空间中进行优化，以实现高精度重建。
results: 该方法可以在动态心脏成像中提高重建图像质量和多种contraast mapping的精度，并且可以在不同的探测方案下进行普适化。

Abstract
Cardiac magnetic resonance imaging is a valuable non-invasive tool for identifying cardiovascular diseases. For instance, Cine MRI is the benchmark modality for assessing the cardiac function and anatomy. On the other hand, multi-contrast (T1 and T2) mapping has the potential to assess pathologies and abnormalities in the myocardium and interstitium. However, voluntary breath-holding and often arrhythmia, in combination with MRI's slow imaging speed, can lead to motion artifacts, hindering real-time acquisition image quality. Although performing accelerated acquisitions can facilitate dynamic imaging, it induces aliasing, causing low reconstructed image quality in Cine MRI and inaccurate T1 and T2 mapping estimation. In this work, inspired by related work in accelerated MRI reconstruction, we present a deep learning (DL)-based method for accelerated cine and multi-contrast reconstruction in the context of dynamic cardiac imaging. We formulate the reconstruction problem as a least squares regularized optimization task, and employ vSHARP, a state-of-the-art DL-based inverse problem solver, which incorporates half-quadratic variable splitting and the alternating direction method of multipliers with neural networks. We treat the problem in two setups; a 2D reconstruction and a 2D dynamic reconstruction task, and employ 2D and 3D deep learning networks, respectively. Our method optimizes in both the image and k-space domains, allowing for high reconstruction fidelity. Although the target data is undersampled with a Cartesian equispaced scheme, we train our model using both Cartesian and simulated non-Cartesian undersampling schemes to enhance generalization of the model to unseen data. Furthermore, our model adopts a deep neural network to learn and refine the sensitivity maps of multi-coil k-space data. Lastly, our method is jointly trained on both, undersampled cine and multi-contrast data.

摘要
心脏磁共振成像是一种有价值的非侵入式工具，用于诊断心血管疾病。例如，Cine MRI 是评估心脏功能和解剖结构的标准模式。然而，自降呼吸和常见的Cardiac arrhythmia，在 MRI 的慢速成像速度下，可能导致运动artefacts，妨碍实时成像质量。尽管执行加速成像可以实现动态成像，但是它会导致扩散，从而导致 Cine MRI 和多态（T1和T2）映射的重建质量低下。在这种情况下，我们基于相关的加速 MRI 重建技术，提出了一种深度学习（DL）基于的方法，用于加速 Cine 和多态重建。我们将重建问题表示为一个带有正则化的最小二乘问题，并使用 vSHARP，一种state-of-the-art DL 基于的反射问题解决方案，其包括半quadratic variable splitting和多项式方法。我们在两种设置下处理问题，一种是2D重建任务，另一种是2D动态重建任务，并使用 2D 和 3D 深度学习网络。我们的方法在图像和频率域进行优化，以确保高重建准确性。虽然目标数据使用 equispaced Cartesian 样式受抽象，但我们使用 Cartesian 和 simulated non-Cartesian 抽象样式进行训练，以便模型在未见数据上的泛化。此外，我们的模型采用深度神经网络来学习和改进多极空间数据的敏感图。最后，我们的方法在受抽象 Cine 和多态数据上进行同时训练。

Pi-DUAL: Using Privileged Information to Distinguish Clean from Noisy Labels

paper_url: http://arxiv.org/abs/2310.06600
repo_url: None
paper_authors: Ke Wang, Guillermo Ortiz-Jimenez, Rodolphe Jenatton, Mark Collier, Efi Kokiopoulou, Pascal Frossard
for: mitigate the effects of label noise in deep learning models
methods: leveraging privileged information (PI) to distinguish clean from wrong labels
results: significant performance improvements on key PI benchmarks, establishing a new state-of-the-art test set accuracy, and effective at identifying noisy samples post-training

Abstract
Label noise is a pervasive problem in deep learning that often compromises the generalization performance of trained models. Recently, leveraging privileged information (PI) -- information available only during training but not at test time -- has emerged as an effective approach to mitigate this issue. Yet, existing PI-based methods have failed to consistently outperform their no-PI counterparts in terms of preventing overfitting to label noise. To address this deficiency, we introduce Pi-DUAL, an architecture designed to harness PI to distinguish clean from wrong labels. Pi-DUAL decomposes the output logits into a prediction term, based on conventional input features, and a noise-fitting term influenced solely by PI. A gating mechanism steered by PI adaptively shifts focus between these terms, allowing the model to implicitly separate the learning paths of clean and wrong labels. Empirically, Pi-DUAL achieves significant performance improvements on key PI benchmarks (e.g., +6.8% on ImageNet-PI), establishing a new state-of-the-art test set accuracy. Additionally, Pi-DUAL is a potent method for identifying noisy samples post-training, outperforming other strong methods at this task. Overall, Pi-DUAL is a simple, scalable and practical approach for mitigating the effects of label noise in a variety of real-world scenarios with PI.

摘要
描述预测问题（Label Noise）是深度学习中的一个广泛存在的问题，它可能会影响训练模型的泛化性能。在训练过程中，利用特权信息（PI）——只有在训练过程中可用，不可用于测试时间——已经被认为是一个有效的方法来缓解这个问题。然而，现有的PI-based方法尚未能 consistently 超越无PI counterpart在防止抖振振的方面。为了解决这种不足，我们介绍了Pi-DUAL，一种用于利用PI来分辨干净和错误标签的架构。Pi-DUAL将输出logits decomposed into一个预测项，基于传统的输入特征，以及一个随PI而变的噪音适应项。一个基于PI的闭合机制可以让模型在clean和错误标签之间分离学习路径。实验结果表明，Pi-DUAL在关键PI benchmark上（如ImageNet-PI）实现了显著的性能提升（+6.8%），成为新的州OF-the-art测试集精度。此外，Pi-DUAL还是一种高效的预测错误样本的方法，在这个任务中超越了其他强大的方法。总之，Pi-DUAL是一种简单、可扩展、实用的方法，可以在具有PI的实际应用场景中有效地缓解标签噪音的问题。

REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets

paper_url: http://arxiv.org/abs/2310.06594
repo_url: https://github.com/liaoning97/revo-lion
paper_authors: Ning Liao, Shaofeng Zhang, Renqiu Xia, Bo Zhang, Min Cao, Yu Qiao, Junchi Yan
for: 本研究旨在评估视语指令调整（VLIT）数据集的质量，并探索如何建立一个包含全面指令调整模型的数据集。
methods: 作者提出了一种调度跨评估方法，通过在不同数据集上进行调度并评估，来评估VLIT数据集的全面质量。并定义了Meta Quality（MQ）和Dataset Quality（DQ）来衡量数据集的质量。
results: 实验结果表明，作者的评估方法是有效的，并且可以建立一个包含全面指令调整模型的数据集。具体来说，使用REVO-LION数据集（包含高品质样本从各个数据集）和一个简单的加权方法，可以在只有半数据量的情况下达到与所有VLIT数据集之和的性能。

Abstract
There is an emerging line of research on multimodal instruction tuning, and a line of benchmarks have been proposed for evaluating these models recently. Instead of evaluating the models directly, in this paper we try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets themselves and further seek the way of building a dataset for developing an all-powerful VLIT model, which we believe could also be of utility for establishing a grounded protocol for benchmarking VLIT models. For effective analysis of VLIT datasets that remains an open question, we propose a tune-cross-evaluation paradigm: tuning on one dataset and evaluating on the others in turn. For each single tune-evaluation experiment set, we define the Meta Quality (MQ) as the mean score measured by a series of caption metrics including BLEU, METEOR, and ROUGE-L to quantify the quality of a certain dataset or a sample. On this basis, to evaluate the comprehensiveness of a dataset, we develop the Dataset Quality (DQ) covering all tune-evaluation sets. To lay the foundation for building a comprehensive dataset and developing an all-powerful model for practical applications, we further define the Sample Quality (SQ) to quantify the all-sided quality of each sample. Extensive experiments validate the rationality of the proposed evaluation paradigm. Based on the holistic evaluation, we build a new dataset, REVO-LION (REfining VisiOn-Language InstructiOn tuNing), by collecting samples with higher SQ from each dataset. With only half of the full data, the model trained on REVO-LION can achieve performance comparable to simply adding all VLIT datasets up. In addition to developing an all-powerful model, REVO-LION also includes an evaluation set, which is expected to serve as a convenient evaluation benchmark for future research.

摘要
有一条emerging的研究方向是多模式指令调整（Multimodal Instruction Tuning，简称VLIT），而最近有一些benchmark被提出来评估这些模型。而不是直接评估这些模型，在这篇论文中我们尝试了评估VLIT数据集本身，并寻求如何建立一个可以开发出全面强大VLIT模型的数据集。为了有效地分析VLIT数据集，我们提出了一种跨评分法：在不同的数据集上进行调整，然后在另一个数据集上进行评估。对于每个单独的调评试验集，我们定义了Meta Quality（MQ）为在不同的caption metric（包括BLEU、METEOR和ROUGE-L）中的平均分数，以衡量一个特定数据集或样本的质量。基于这个基础，我们开发了Dataset Quality（DQ），用于评估数据集的完整性。为了建立一个全面的数据集和开发实际应用中的强大模型，我们进一步定义了Sample Quality（SQ），用于衡量每个样本的多方面质量。我们的实验证明了我们提出的评估方法的合理性。基于这种整体评估，我们建立了一个新的数据集，REVO-LION（REfining VisiOn-Language InstructiOn tuNing），通过收集具有更高SQ的样本来建立。与整个数据集的一半数据量相比，模型在REVO-LION上训练可以 дости到与所有VLIT数据集之和相当的性能。除了开发全面强大的模型外，REVO-LION还包括评估集，预计将为未来研究提供便捷的评估标准。

Hierarchical Mask2Former: Panoptic Segmentation of Crops, Weeds and Leaves

paper_url: http://arxiv.org/abs/2310.06582
repo_url: https://github.com/madeleinedarbyshire/hierarchicalmask2former
paper_authors: Madeleine Darbyshire, Elizabeth Sklar, Simon Parsons
for: 这篇论文是为了提出一种基于机器视觉的精准农业技术，以提高农业生产效率和降低资源消耗。
methods: 这篇论文使用了一种层次结构的�annoptic分割方法，同时可以识别植物的生长指标和找到苔藿。这种方法基于Mask2Former框架，可以预测植物、苔藿和叶子的面积。
results: 该论文实现了PQ{\dag}的75.99，并提出了一些减少计算和时间成本的方法，使得核心算法的执行速度可以减少60%，而PQ{\dag}的下降幅度仅占1%。

Abstract
Advancements in machine vision that enable detailed inferences to be made from images have the potential to transform many sectors including agriculture. Precision agriculture, where data analysis enables interventions to be precisely targeted, has many possible applications. Precision spraying, for example, can limit the application of herbicide only to weeds, or limit the application of fertiliser only to undernourished crops, instead of spraying the entire field. The approach promises to maximise yields, whilst minimising resource use and harms to the surrounding environment. To this end, we propose a hierarchical panoptic segmentation method to simultaneously identify indicators of plant growth and locate weeds within an image. We adapt Mask2Former, a state-of-the-art architecture for panoptic segmentation, to predict crop, weed and leaf masks. We achieve a PQ{\dag} of 75.99. Additionally, we explore approaches to make the architecture more compact and therefore more suitable for time and compute constrained applications. With our more compact architecture, inference is up to 60% faster and the reduction in PQ{\dag} is less than 1%.

摘要
“具有高度内涵的机器视觉技术，可以对图像进行详细的推论，具有广泛的应用前景。精确农业是其中一个，这种技术可以通过分析数据来实现精确的干预。例如，精确喷撒可以仅对于落叶的部分喷撒药物，或仅对于缺乏营养的作物喷撒肥料，而不是对整个田园进行喷撒。这种方法可以最大化生产量，同时最小化资源的使用和环境中的伤害。为此，我们提出了一个层次组织的当知分割方法，可以同时识别作物生长指标和发现杂草。我们将Mask2Former架构，一个现代当知分割架构，用于预测作物、杂草和叶子的面罩。我们取得了PQ{\dag}的75.99。此外，我们也考虑了将架构更加紧凑，以便在时间和计算限制下进行应用。我们的更紧凑的架构，可以在测试中提高推断速度，并且对PQ{\dag}的下降产生较小的影响。”Note: PQ{\dag} is a metric used to evaluate the performance of panoptic segmentation models, it stands for "panoptic quality" with asterisk.

Energy-Efficient Visual Search by Eye Movement and Low-Latency Spiking Neural Network

paper_url: http://arxiv.org/abs/2310.06578
repo_url: None
paper_authors: Yunhui Zhou, Dongqi Han, Yuguo Yu
for: 模型开发人类视觉系统，包括非均匀分辨率 retina、高效眼动策略和脉冲神经网络（SNN），以满足视场大小、分辨率、能耗成本和推理延迟等要求。
methods: 使用人类视觉中的三种特性——非均匀分辨率 retina、高效眼动策略和SNN——开发人类类似的计算机视觉系统。实验研究人类视察行为，并建立了首个基于 SNN 的视察搜索模型。该模型结合人工 retina 和脉冲特征提取、记忆和眼动决策模块，并使用人类化的 fixation 策略和人类化的追踪策略。
results: 模型可以学习人类类似的 fixation 策略和近似于最优的搜索策略，在搜索速度和准确率方面超越人类，同时具有高能效性和短抑制决策延迟。这些结果表明人类视觉系统的模型化在神经科学和机器学习中具有重要意义，并可以帮助开发更能效的计算机视觉算法。

Abstract
Human vision incorporates non-uniform resolution retina, efficient eye movement strategy, and spiking neural network (SNN) to balance the requirements in visual field size, visual resolution, energy cost, and inference latency. These properties have inspired interest in developing human-like computer vision. However, existing models haven't fully incorporated the three features of human vision, and their learned eye movement strategies haven't been compared with human's strategy, making the models' behavior difficult to interpret. Here, we carry out experiments to examine human visual search behaviors and establish the first SNN-based visual search model. The model combines an artificial retina with spiking feature extraction, memory, and saccade decision modules, and it employs population coding for fast and efficient saccade decisions. The model can learn either a human-like or a near-optimal fixation strategy, outperform humans in search speed and accuracy, and achieve high energy efficiency through short saccade decision latency and sparse activation. It also suggests that the human search strategy is suboptimal in terms of search speed. Our work connects modeling of vision in neuroscience and machine learning and sheds light on developing more energy-efficient computer vision algorithms.

摘要
人类视觉具有不均匀分辨率 RETINA、高效眼动策略和脉冲神经网络（SNN），以满足视场大小、分辨率、能耗成本和推理延迟的要求。这些特性引发了人类视觉模型的开发兴趣。然而，现有模型没有完全包含人类视觉的三个特性，并且学习的眼动策略与人类策略不符，使模型的行为困难于解释。我们在这里进行实验研究人类视察行为，建立了首个SNN基于的视察模型。该模型组合人工 RETINA 与脉冲特征提取、记忆和眼动决策模块，并使用人口编码进行快速和高效的眼动决策。模型可以学习人类类似或近似优化的FIXATION策略，在搜寻速度和准确率方面超过人类，并且通过短时间内的眼动决策延迟和稀有的活动实现高能效性。此外，我们的工作将神经科学和机器学习的视觉模型相连，并照明开发更加能效的计算机视觉算法的可能性。

SketchBodyNet: A Sketch-Driven Multi-faceted Decoder Network for 3D Human Reconstruction

paper_url: http://arxiv.org/abs/2310.06577
repo_url: None
paper_authors: Fei Wang, Kongzhang Tang, Hefeng Wu, Baoquan Zhao, Hao Cai, Teng Zhou
for: 这个论文目的是 reconstruction 3D human shape from 2D sketches.
methods: 该方法使用了一种名为 SketchBodyNet 的网络，具体来说是一个 backbone 和三个独立的注意力解码分支。每个解码分支都包含一个多头自注意力模块，以获取增强的特征，然后是一个多层感知器。
results: 该方法可以 superior 的表现 reconstruction 3D human mesh from freehand sketches.

Abstract
Reconstructing 3D human shapes from 2D images has received increasing attention recently due to its fundamental support for many high-level 3D applications. Compared with natural images, freehand sketches are much more flexible to depict various shapes, providing a high potential and valuable way for 3D human reconstruction. However, such a task is highly challenging. The sparse abstract characteristics of sketches add severe difficulties, such as arbitrariness, inaccuracy, and lacking image details, to the already badly ill-posed problem of 2D-to-3D reconstruction. Although current methods have achieved great success in reconstructing 3D human bodies from a single-view image, they do not work well on freehand sketches. In this paper, we propose a novel sketch-driven multi-faceted decoder network termed SketchBodyNet to address this task. Specifically, the network consists of a backbone and three separate attention decoder branches, where a multi-head self-attention module is exploited in each decoder to obtain enhanced features, followed by a multi-layer perceptron. The multi-faceted decoders aim to predict the camera, shape, and pose parameters, respectively, which are then associated with the SMPL model to reconstruct the corresponding 3D human mesh. In learning, existing 3D meshes are projected via the camera parameters into 2D synthetic sketches with joints, which are combined with the freehand sketches to optimize the model. To verify our method, we collect a large-scale dataset of about 26k freehand sketches and their corresponding 3D meshes containing various poses of human bodies from 14 different angles. Extensive experimental results demonstrate our SketchBodyNet achieves superior performance in reconstructing 3D human meshes from freehand sketches.

摘要
<>这里的文本将被翻译为简化字的中文。<>近期，从二dimensional图像中重建三dimensional人体shape received increasing attention，因为它具有许多高级应用的基本支持。相比于自然图像，自由画图更加灵活地描绘了不同的形状，提供了高度潜在和有价的方法。然而，这个任务非常具有挑战性。图像的叠加特征使得问题变得更加困难，导致这已经是 badly ill-posed 的问题。 although current methods have achieved great success in reconstructing three-dimensional human bodies from a single-view image, they do not work well on freehand sketches.在这篇文章中，我们提出了一个新的图画驱动多面顶点网络，称为 SketchBodyNet，以解决这个任务。具体而言，该网络包括一个背bone和三个分开的注意力顶点分支，每个分支都包括一个多头自我注意模组，然后是一个多层感知器。这些多面顶点尝试预测摄像头、形状和姿势参数，分别与SMPL模型组合以重建相应的三dimensional human mesh。在学习过程中，我们将现有的三dimensional mesh projected via camera parameters into 2D synthetic sketches with joints，与自由画图相结合以便优化模型。为了证明我们的方法，我们收集了大约26,000幅自由画图和其对应的三dimensional mesh，这些图像包括了人体的多个姿势和角度。实验结果显示，我们的 SketchBodyNet 可以从自由画图中重建三dimensional human mesh with superior performance。

Efficient Retrieval of Images with Irregular Patterns using Morphological Image Analysis: Applications to Industrial and Healthcare datasets

paper_url: http://arxiv.org/abs/2310.06566
repo_url: None
paper_authors: Jiajun Zhang, Georgina Cosma, Sarah Bugby, Jason Watkins
for: 本文提出了一个图像检索框架，用于在图像数据库中查找具有类似不规则模式的图像。
methods: 本文使用了一些特征提取方法，包括深度特征、颜色基于特征、形状基于特征和本地特征，以提取图像中的特征。
results: 本文使用了不同的特征提取方法和距离度量来评估图像检索性能，结果显示，使用DefChars和曼哈顿距离度量可以达到80%的含义平均精度和0.09的标准差，在不同的 dataset 上表现出色。

Abstract
Image retrieval is the process of searching and retrieving images from a database based on their visual content and features. Recently, much attention has been directed towards the retrieval of irregular patterns within industrial or medical images by extracting features from the images, such as deep features, colour-based features, shape-based features and local features. This has applications across a spectrum of industries, including fault inspection, disease diagnosis, and maintenance prediction. This paper proposes an image retrieval framework to search for images containing similar irregular patterns by extracting a set of morphological features (DefChars) from images; the datasets employed in this paper contain wind turbine blade images with defects, chest computerised tomography scans with COVID-19 infection, heatsink images with defects, and lake ice images. The proposed framework was evaluated with different feature extraction methods (DefChars, resized raw image, local binary pattern, and scale-invariant feature transforms) and distance metrics to determine the most efficient parameters in terms of retrieval performance across datasets. The retrieval results show that the proposed framework using the DefChars and the Manhattan distance metric achieves a mean average precision of 80% and a low standard deviation of 0.09 across classes of irregular patterns, outperforming alternative feature-metric combinations across all datasets. Furthermore, the low standard deviation between each class highlights DefChars' capability for a reliable image retrieval task, even in the presence of class imbalances or small-sized datasets.

摘要
Image Retrieval是寻找和搜寻库中的图像，根据它们的视觉内容和特征进行搜寻。现在，对于工业或医疗图像中的不规律模式进行搜寻已经引起了很多注意。这可以应用于各种领域，包括缺陷检测、疾病诊断和维护预测。本文提出了一个图像搜寻框架，用于寻找具有相似不规律模式的图像。这个框架使用了一些形式特征（DefChars）从图像中提取特征，并使用了不同的特征提取方法和距离度量进行评估。结果显示，使用DefChars和曼哈顿距离度量的提案框架可以实现80%的平均精度和0.09的标准差，在不同的数据集上均以最高效的方式进行搜寻。此外，每个类别之间的标准差较低，证明了DefChars在适当的图像搜寻任务中的可靠性，即使 faced with class imbalances or small-sized datasets。

Compositional Representation Learning for Brain Tumour Segmentation

paper_url: http://arxiv.org/abs/2310.06562
repo_url: None
paper_authors: Xiao Liu, Antanas Kascenas, Hannah Watson, Sotirios A. Tsaftaris, Alison Q. O’Neil
for: 这 paper 的目的是解决折衣瘤分 segmentation 领域中的数据短缺问题，通过混合监督框架和无监督学习来学习强健的组成表示。
methods: 该 paper 使用了混合监督框架 vMFNet，通过无监督学习和弱监督来学习强健的组成表示，并使用了 BraTS 数据集来生成各个 MRI 图像中的两点专家病理标注，从而构建了弱图像级别标注。
results: 该 paper 表明，可以通过大量的弱监督数据和只有一小部分完全标注数据来实现佳的折衣瘤分 segmentation 性能，并且发现在只受病理学超级vision（折衣瘤）的情况下，emergent learning 可以在组成表示中 Learning 出各种解剖结构。

Abstract
For brain tumour segmentation, deep learning models can achieve human expert-level performance given a large amount of data and pixel-level annotations. However, the expensive exercise of obtaining pixel-level annotations for large amounts of data is not always feasible, and performance is often heavily reduced in a low-annotated data regime. To tackle this challenge, we adapt a mixed supervision framework, vMFNet, to learn robust compositional representations using unsupervised learning and weak supervision alongside non-exhaustive pixel-level pathology labels. In particular, we use the BraTS dataset to simulate a collection of 2-point expert pathology annotations indicating the top and bottom slice of the tumour (or tumour sub-regions: peritumoural edema, GD-enhancing tumour, and the necrotic / non-enhancing tumour) in each MRI volume, from which weak image-level labels that indicate the presence or absence of the tumour (or the tumour sub-regions) in the image are constructed. Then, vMFNet models the encoded image features with von-Mises-Fisher (vMF) distributions, via learnable and compositional vMF kernels which capture information about structures in the images. We show that good tumour segmentation performance can be achieved with a large amount of weakly labelled data but only a small amount of fully-annotated data. Interestingly, emergent learning of anatomical structures occurs in the compositional representation even given only supervision relating to pathology (tumour).

摘要

Data efficient deep learning for medical image analysis: A survey

paper_url: http://arxiv.org/abs/2310.06557
repo_url: None
paper_authors: Suruchi Kumari, Pravendra Singh
for: 本研究主要用于探讨数据效率深度学习方法在医学图像分析领域的应用。
methods: 本文对数据效率深度学习方法进行了全面的审视，并根据监督水平进行了分类，包括无监督、不准确监督、部分监督、不准确监督和有限监督等类别。
results: 本文系统地总结了医学图像分析领域中通用的数据效率深度学习方法，并探讨未来的研究方向。

Abstract
The rapid evolution of deep learning has significantly advanced the field of medical image analysis. However, despite these achievements, the further enhancement of deep learning models for medical image analysis faces a significant challenge due to the scarcity of large, well-annotated datasets. To address this issue, recent years have witnessed a growing emphasis on the development of data-efficient deep learning methods. This paper conducts a thorough review of data-efficient deep learning methods for medical image analysis. To this end, we categorize these methods based on the level of supervision they rely on, encompassing categories such as no supervision, inexact supervision, incomplete supervision, inaccurate supervision, and only limited supervision. We further divide these categories into finer subcategories. For example, we categorize inexact supervision into multiple instance learning and learning with weak annotations. Similarly, we categorize incomplete supervision into semi-supervised learning, active learning, and domain-adaptive learning and so on. Furthermore, we systematically summarize commonly used datasets for data efficient deep learning in medical image analysis and investigate future research directions to conclude this survey.

摘要
“深度学习的快速进化已经大幅提高医学图像分析领域的水平。然而，虽然有这些成就，但是进一步提高深度学习模型的医学图像分析仍面临着数据稀缺的大问题。为解决这个问题，最近几年内见许多关注数据有效深度学习方法的开发。这篇文章进行了深入的对数据有效深度学习方法的审视。为此，我们将这些方法按照它们依赖的级别进行分类，包括无监督、不准确监督、不完全监督、不正确监督以及只有有限监督。我们将这些类别进一步细分，例如将不准确监督分为多个实例学习和弱标注学习。类似地，我们将不完全监督分为半监督学习、活动学习和领域适应学习等。此外，我们系统地总结了医学图像分析中常用的数据有效深度学习数据集，并 investigate未来研究方向以结束本调查。”

Be Careful What You Smooth For: Label Smoothing Can Be a Privacy Shield but Also a Catalyst for Model Inversion Attacks

paper_url: http://arxiv.org/abs/2310.06549
repo_url: https://github.com/LukasStruppek/Plug-and-Play-Attacks
paper_authors: Lukas Struppek, Dominik Hintersdorf, Kristian Kersting
for: 这篇论文旨在调查标签融合对模型私隐泄露的影响。methods: 作者使用了标签融合来调查模型的泄露情况，并对比了不同的标签融合方法的影响。results: 研究发现，传统的标签融合可能会增加模型的隐私泄露，而使用负因素融合可以阻止模型泄露class-related信息，从而保护模型的隐私。

Abstract
Label smoothing -- using softened labels instead of hard ones -- is a widely adopted regularization method for deep learning, showing diverse benefits such as enhanced generalization and calibration. Its implications for preserving model privacy, however, have remained unexplored. To fill this gap, we investigate the impact of label smoothing on model inversion attacks (MIAs), which aim to generate class-representative samples by exploiting the knowledge encoded in a classifier, thereby inferring sensitive information about its training data. Through extensive analyses, we uncover that traditional label smoothing fosters MIAs, thereby increasing a model's privacy leakage. Even more, we reveal that smoothing with negative factors counters this trend, impeding the extraction of class-related information and leading to privacy preservation, beating state-of-the-art defenses. This establishes a practical and powerful novel way for enhancing model resilience against MIAs.

摘要
标签平滑（label smoothing）是一种广泛应用的深度学习常规方法，它可以提高模型的泛化和准确性。然而，它的隐私保护方面尚未得到探讨。为了填补这一空白，我们研究了标签平滑对模型反向攻击（MIA）的影响，MIA是利用分类器中嵌入的知识来生成类 representative sample，从而推测模型的训练数据。经过广泛的分析，我们发现传统的标签平滑实际上会增强模型的隐私泄露。反之，使用负因子平滑可以阻止类相关信息的提取，从而保持模型的隐私。这成为一种实用和有力的新方法，可以增强模型对MIA的抗御能力。

Perceptual MAE for Image Manipulation Localization: A High-level Vision Learner Focusing on Low-level Features

paper_url: http://arxiv.org/abs/2310.06525
repo_url: https://github.com/SunnyHaze/PMAE
paper_authors: Xiaochen Ma, Jizhe Zhou, Xiong Xu, Zhuohang Jiang, Chi-Man Pun
for: This paper aims to improve the performance of Image Manipulation Localization (IML) tasks by incorporating both high-level and low-level features.
methods: The proposed method, Perceptual MAE (PMAE), enhances the Masked Autoencoder (MAE) with high-resolution inputs and a perceptual loss supervision module to better capture both object semantics and pixel-level features.
results: Extensive experiments show that PMAE outperforms state-of-the-art tampering localization methods on all five publicly available datasets, demonstrating the effectiveness of integrating high-level and low-level features in IML tasks.

Abstract
Nowadays, multimedia forensics faces unprecedented challenges due to the rapid advancement of multimedia generation technology thereby making Image Manipulation Localization (IML) crucial in the pursuit of truth. The key to IML lies in revealing the artifacts or inconsistencies between the tampered and authentic areas, which are evident under pixel-level features. Consequently, existing studies treat IML as a low-level vision task, focusing on allocating tampered masks by crafting pixel-level features such as image RGB noises, edge signals, or high-frequency features. However, in practice, tampering commonly occurs at the object level, and different classes of objects have varying likelihoods of becoming targets of tampering. Therefore, object semantics are also vital in identifying the tampered areas in addition to pixel-level features. This necessitates IML models to carry out a semantic understanding of the entire image. In this paper, we reformulate the IML task as a high-level vision task that greatly benefits from low-level features. Based on such an interpretation, we propose a method to enhance the Masked Autoencoder (MAE) by incorporating high-resolution inputs and a perceptual loss supervision module, which is termed Perceptual MAE (PMAE). While MAE has demonstrated an impressive understanding of object semantics, PMAE can also compensate for low-level semantics with our proposed enhancements. Evidenced by extensive experiments, this paradigm effectively unites the low-level and high-level features of the IML task and outperforms state-of-the-art tampering localization methods on all five publicly available datasets.

摘要
现在， Multimedia FORENSICS 面临了历史上 nunca antes seen 的挑战，这是因为 Multimedia 生成技术的快速发展，从而使 Image Manipulation Localization (IML) 成为了查找真实的关键。关键在于揭示 tampered 和 authentic 区域之间的差异，这些差异在像素级别特征上表现出来。因此，现有的研究都将 IML 视为一种低级视觉任务，通过创建像素级特征，如图像 RGB 噪音、边缘信号或高频特征来分配 tampered 面纱。然而，在实践中， tampering 通常发生在对象层次上，不同的对象类型有不同的抢夺概率。因此，对象 semantics 也是必要的，以便在找到 tampered 区域时具有更高的准确率。这使得 IML 模型需要对整个图像进行semantic 理解。在这篇论文中，我们将 IML 任务重新解释为一种高级视觉任务，这种任务可以受益于低级特征。基于这种解释，我们提出了一种使用高分辨率输入和一种感知损失监督模块的改进 MAE 方法，称为 Perceptual MAE (PMAE)。MAE 已经在对象 semantics 方面表现出色，而 PMAE 可以通过我们的提议的改进，同时具有低级 semantics。经验证明，这种方法可以有效地结合 IML 任务的低级和高级特征，并在所有五个公开的数据集上超越当前最佳的抹改地点标识方法。

Watt For What: Rethinking Deep Learning’s Energy-Performance Relationship

paper_url: http://arxiv.org/abs/2310.06522
repo_url: None
paper_authors: Shreyank N Gowda, Xinyue Hao, Gen Li, Laura Sevilla-Lara, Shashank Narayana Gowda
for: 这篇论文旨在探讨深度学习模型在不同GPU上的电力消耗和精度之间的贸易，并提出一个电力消耗 penalty 来衡量模型的环境影响。
methods: 本研究使用了多种深度学习模型，并在不同的GPU上进行了电力消耗和精度的评估。研究人员还提出了一个电力消耗衡量指标，以衡量模型在不同的GPU上的精度-电力消耗贸易。
results: 研究结果显示，较小的、更能效的模型可以实现较高的精度，同时也能够降低电力消耗。研究人员还发现，不同的GPU上的电力消耗和精度之间存在贸易关系。这些结果显示，可以透过优化模型的设计和训练，以减少电力消耗，并促进更公平的研究竞争。

Abstract
Deep learning models have revolutionized various fields, from image recognition to natural language processing, by achieving unprecedented levels of accuracy. However, their increasing energy consumption has raised concerns about their environmental impact, disadvantaging smaller entities in research and exacerbating global energy consumption. In this paper, we explore the trade-off between model accuracy and electricity consumption, proposing a metric that penalizes large consumption of electricity. We conduct a comprehensive study on the electricity consumption of various deep learning models across different GPUs, presenting a detailed analysis of their accuracy-efficiency trade-offs. By evaluating accuracy per unit of electricity consumed, we demonstrate how smaller, more energy-efficient models can significantly expedite research while mitigating environmental concerns. Our results highlight the potential for a more sustainable approach to deep learning, emphasizing the importance of optimizing models for efficiency. This research also contributes to a more equitable research landscape, where smaller entities can compete effectively with larger counterparts. This advocates for the adoption of efficient deep learning practices to reduce electricity consumption, safeguarding the environment for future generations whilst also helping ensure a fairer competitive landscape.

摘要

paper_url: http://arxiv.org/abs/2310.06489
repo_url: None
paper_authors: Julien Paulet, Axel Molina, Benjamin Beltzung, Takafumi Suzumura, Shinya Yamamoto, Cédric Sueur
for: 这个研究的目的是开发一种不侵入性的工具，用于检测和识别日本黑猩猩（Macaca fuscata）的面部特征，以便自动生成这种动物群体的社会网络表示。
methods: 这项研究使用了深度学习技术，包括物体检测和识别，以检测和识别日本黑猩猩的面部特征。研究采用的模型包括Faster-RCNN和YOLOv8n两种，并在实验中达到了82.2%的准确率和83%的准确率。
results: 研究初步结果显示，使用深度学习技术可以成功地检测和识别日本黑猩猩的面部特征，并生成一个可靠的社会网络表示。在实验中，研究人员创建了一个基于视频拍摄的日本黑猩猩社会网络，并与自动生成的网络进行比较，以评估其可靠性。

Abstract
Individual identification plays a pivotal role in ecology and ethology, notably as a tool for complex social structures understanding. However, traditional identification methods often involve invasive physical tags and can prove both disruptive for animals and time-intensive for researchers. In recent years, the integration of deep learning in research offered new methodological perspectives through automatization of complex tasks. Harnessing object detection and recognition technologies is increasingly used by researchers to achieve identification on video footage. This study represents a preliminary exploration into the development of a non-invasive tool for face detection and individual identification of Japanese macaques (Macaca fuscata) through deep learning. The ultimate goal of this research is, using identifications done on the dataset, to automatically generate a social network representation of the studied population. The current main results are promising: (i) the creation of a Japanese macaques' face detector (Faster-RCNN model), reaching a 82.2% accuracy and (ii) the creation of an individual recognizer for K{\=o}jima island macaques population (YOLOv8n model), reaching a 83% accuracy. We also created a K{\=o}jima population social network by traditional methods, based on co-occurrences on videos. Thus, we provide a benchmark against which the automatically generated network will be assessed for reliability. These preliminary results are a testament to the potential of this innovative approach to provide the scientific community with a tool for tracking individuals and social network studies in Japanese macaques.

摘要
个体识别在生态和行为学中扮演着重要的角色，尤其是用于复杂社会结构的理解。然而，传统的识别方法 oftentimes involve侵入性的物理标签，可以对动物产生不良影响并耗费研究人员的时间。在过去的几年中，研究中的 интеграция深度学习提供了新的方法学视角。通过对视频 Footage 进行对象检测和识别技术的应用，研究人员可以实现个体识别。本研究是一项初步的探索，旨在开发一种不侵入的工具 для日本黑猩狮（Macaca fuscata）个体识别，通过深度学习。研究的最终目标是，使用在数据集上进行识别，自动生成日本黑猩狮社会网络表示。现主要结果如下：1. 创建了一个日本黑猩狮脸部检测器（Faster-RCNN模型），达到了82.2%的准确率。2. 创建了一个特定于吾岛黑猩狮population的个体识别器（YOLOv8n模型），达到了83%的准确率。3. 基于视频上的共处，创建了吾岛黑猩狮population社会网络。这些初步结果表明了这种创新的方法的潜在力量，可以为科学社区提供一种跟踪个体和社会网络研究的工具。

Focus on Local Regions for Query-based Object Detection

paper_url: http://arxiv.org/abs/2310.06470
repo_url: None
paper_authors: Hongbin Xu, Yamei Xia, Shuai Zhao, Bo Cheng
for: 提高查询基于object detection的性能和速度
methods: 提出一种基于transformer的查询架构，增强自注意机制，采用适应采样方法和look-back策略，提高查询效果
results: 实验结果显示 FoLR 在查询基于object detection中达到了状态的性能和计算效率，比传统方法更快 converge 和更高效

Abstract
Query-based methods have garnered significant attention in object detection since the advent of DETR, the pioneering end-to-end query-based detector. However, these methods face challenges like slow convergence and suboptimal performance. Notably, self-attention in object detection often hampers convergence due to its global focus. To address these issues, we propose FoLR, a transformer-like architecture with only decoders. We enhance the self-attention mechanism by isolating connections between irrelevant objects that makes it focus on local regions but not global regions. We also design the adaptive sampling method to extract effective features based on queries' local regions from feature maps. Additionally, we employ a look-back strategy for decoders to retain prior information, followed by the Feature Mixer module to fuse features and queries. Experimental results demonstrate FoLR's state-of-the-art performance in query-based detectors, excelling in convergence speed and computational efficiency.

摘要
干预方法在物体检测中获得了广泛关注，自DETR发明以来。然而，这些方法面临较慢的收敛速度和不佳的性能问题。尤其是对象检测中的自注意力经常干扰收敛，因为它具有全局注意力。为解决这些问题，我们提议FoLR架构，它是一种基于transformer的 Architecture，只有解码器。我们强化对象检测中的自注意力机制，隔离无关对象之间的连接，使其关注本地区域而不是全局区域。此外，我们还设计了适应采样方法，从特征图中提取有效特征，基于查询的本地区域。此外，我们还使用look-back策略，让解码器保留先前信息，然后使用特征混合模块将特征和查询混合。实验结果表明FoLR在查询基于检测器中实现了状态机器的表现，在收敛速度和计算效率方面都具有优势。

A Geometrical Approach to Evaluate the Adversarial Robustness of Deep Neural Networks

paper_url: http://arxiv.org/abs/2310.06468
repo_url: None
paper_authors: Yang Wang, Bo Dong, Ke Xu, Haiyin Piao, Yufei Ding, Baocai Yin, Xin Yang
for:* The paper aims to propose a new metric for evaluating the adversarial robustness of deep neural networks (DNNs) against different types of attacks on large-scale datasets.methods:* The proposed metric is called Adversarial Converging Time Score (ACTS), which measures the time it takes for an attacker to find an adversarial example on a specific input.* ACTS is based on the observation that the local neighborhoods on a DNN’s output surface have different shapes for different inputs, and thus the converging time to an adversarial sample will vary depending on the input.results:* The proposed ACTS metric is validated on the large-scale ImageNet dataset using state-of-the-art deep networks, and shows to be more efficient and effective than the previous CLEVER metric.* Extensive experiments demonstrate the effectiveness and generalization of ACTS against different adversarial attacks.

Abstract
Deep Neural Networks (DNNs) are widely used for computer vision tasks. However, it has been shown that deep models are vulnerable to adversarial attacks, i.e., their performances drop when imperceptible perturbations are made to the original inputs, which may further degrade the following visual tasks or introduce new problems such as data and privacy security. Hence, metrics for evaluating the robustness of deep models against adversarial attacks are desired. However, previous metrics are mainly proposed for evaluating the adversarial robustness of shallow networks on the small-scale datasets. Although the Cross Lipschitz Extreme Value for nEtwork Robustness (CLEVER) metric has been proposed for large-scale datasets (e.g., the ImageNet dataset), it is computationally expensive and its performance relies on a tractable number of samples. In this paper, we propose the Adversarial Converging Time Score (ACTS), an attack-dependent metric that quantifies the adversarial robustness of a DNN on a specific input. Our key observation is that local neighborhoods on a DNN's output surface would have different shapes given different inputs. Hence, given different inputs, it requires different time for converging to an adversarial sample. Based on this geometry meaning, ACTS measures the converging time as an adversarial robustness metric. We validate the effectiveness and generalization of the proposed ACTS metric against different adversarial attacks on the large-scale ImageNet dataset using state-of-the-art deep networks. Extensive experiments show that our ACTS metric is an efficient and effective adversarial metric over the previous CLEVER metric.

摘要

paper_url: http://arxiv.org/abs/2310.06440
repo_url: None
paper_authors: Xiangyu Wu, Yang Yang, Shengdong Xu, Yifeng Wu, Qingguo Chen, Jianfeng Lu
for: This paper presents a solution to the Multi-modal Algorithmic Reasoning Task: SMART-101 Challenge, which evaluates the ability of neural networks to solve visuolinguistic puzzles for children aged 6-8.
methods: The authors employed a divide-and-conquer approach, categorizing questions into eight types and using the llama-2-chat model to generate question types in a zero-shot manner. They also trained a yolov7 model on the icon45 dataset for object detection and combined it with OCR to recognize objects and text within images. Additionally, they used the BLIP-2 model with eight adapters to adaptively extract visual features for different question types.
results: Under the puzzle splits configuration, the authors achieved an accuracy score of 26.5 on the validation set and 24.30 on the private test set.

Abstract
In this paper, we present our solution to a Multi-modal Algorithmic Reasoning Task: SMART-101 Challenge. Different from the traditional visual question-answering datasets, this challenge evaluates the abstraction, deduction, and generalization abilities of neural networks in solving visuolinguistic puzzles designed specifically for children in the 6-8 age group. We employed a divide-and-conquer approach. At the data level, inspired by the challenge paper, we categorized the whole questions into eight types and utilized the llama-2-chat model to directly generate the type for each question in a zero-shot manner. Additionally, we trained a yolov7 model on the icon45 dataset for object detection and combined it with the OCR method to recognize and locate objects and text within the images. At the model level, we utilized the BLIP-2 model and added eight adapters to the image encoder VIT-G to adaptively extract visual features for different question types. We fed the pre-constructed question templates as input and generated answers using the flan-t5-xxl decoder. Under the puzzle splits configuration, we achieved an accuracy score of 26.5 on the validation set and 24.30 on the private test set.

摘要
在这篇论文中，我们介绍了我们对多modal算法逻辑任务（SMART-101挑战）的解决方案。与传统视觉问答数据集不同，这个挑战评估了基于图像和语言的抽象、推理和总结能力的神经网络。我们采用分治方法。在数据层次，根据挑战论文，我们将整个问题分为八种类型，并使用llama-2-chat模型直接生成每个问题的类型在零容量情况下。此外，我们在icon45数据集上训练了yolov7模型，并将其与OCR方法结合使用，以识别和定位图像中的对象和文本。在模型层次，我们使用BLIP-2模型，并添加了八个适应器到图像编码器VIT-G，以适应不同问题类型的视觉特征。我们将预构建的问题模板作为输入，并使用flan-t5-xxl解码器生成答案。在配置为谜题分割的情况下，我们在验证集上达到了26.5的准确率和24.30的私有测试集准确率。

The Solution for the CVPR2023 NICE Image Captioning Challenge

paper_url: http://arxiv.org/abs/2310.06879
repo_url: None
paper_authors: Xiangyu Wu, Yi Gao, Hailiang Zhang, Yang Yang, Weili Guo, Jianfeng Lu
for: 本研究的目标是解决新的零引入图像描述挑战，这个挑战包括许多领域的新视觉概念以及不同的图像类型（如照片、插图、图形）。
methods: 我们使用了Laion-5B，一个大规模的CLIP-过滤图像文本 dataset，进行数据层的采集。在模型层，我们使用了OFA，一个基于手工模板的大规模视语预训模型，来进行图像描述任务。此外，我们还引入了对比学习，以便在预训阶段学习新的视觉概念。最后，我们提出了一种相似桶策略，并将其 integrate 到模板中，以强制模型生成更高质量和更匹配的描述。
results: 我们的方法在验证和测试阶段分别获得了105.17和325.72的Cider-Score，位于领导者板块的第一名。

Abstract
In this paper, we present our solution to the New frontiers for Zero-shot Image Captioning Challenge. Different from the traditional image captioning datasets, this challenge includes a larger new variety of visual concepts from many domains (such as COVID-19) as well as various image types (photographs, illustrations, graphics). For the data level, we collect external training data from Laion-5B, a large-scale CLIP-filtered image-text dataset. For the model level, we use OFA, a large-scale visual-language pre-training model based on handcrafted templates, to perform the image captioning task. In addition, we introduce contrastive learning to align image-text pairs to learn new visual concepts in the pre-training stage. Then, we propose a similarity-bucket strategy and incorporate this strategy into the template to force the model to generate higher quality and more matching captions. Finally, by retrieval-augmented strategy, we construct a content-rich template, containing the most relevant top-k captions from other image-text pairs, to guide the model in generating semantic-rich captions. Our method ranks first on the leaderboard, achieving 105.17 and 325.72 Cider-Score in the validation and test phase, respectively.

摘要
在这篇论文中，我们介绍我们的方法解决新前iers for Zero-shot Image Captioning Challenge。与传统的图像描述 dataset 不同，这个挑战包括更多的新领域的视觉概念（如 COVID-19）以及不同的图像类型（如照片、插图、图形）。在数据层次上，我们收集了来自 Laion-5B 的大规模 CLIP-过滤的图像文本数据集进行训练。在模型层次上，我们使用 OFA，一种基于手工模板的大规模视语预训练模型，来进行图像描述任务。此外，我们引入了对比学习，以对图像-文本对进行对齐，以学习新的视觉概念。然后，我们提出了一种相似桶策略，并将其 incorporate 到模型中，以强制模型生成更高质量和更匹配的描述。最后，我们通过 Retrieval-augmented 策略，construct 一个内容丰富的模板，包含最相关的 top-k 描述从其他图像-文本对，以导引模型生成semantic-rich的描述。我们的方法在领导板中名列第一，在验证阶段和测试阶段分别获得 105.17 和 325.72 Cider-Score。

Skeleton Ground Truth Extraction: Methodology, Annotation Tool and Benchmarks

paper_url: http://arxiv.org/abs/2310.06437
repo_url: https://github.com/cong-yang/skeview
paper_authors: Cong Yang, Bipin Indurkhya, John See, Bo Gao, Yan Ke, Zeyd Boukhers, Zhenyu Yang, Marcin Grzegorzek
for: 这篇论文主要写于提供一种基于扩展的 диагностичность假设的方法来生成shape和图像中的skeletonGround Truth(GT)。
methods: 该方法基于扩展的 диагностичность假设，使用人工监督来提取skeleton GT，并开发了一个工具called SkeView来生成GT。
results: 实验表明，由该方法生成的GT具有较好的标准化一致性，并且提供了一个平衡 между简单性和完整性。

Abstract
Skeleton Ground Truth (GT) is critical to the success of supervised skeleton extraction methods, especially with the popularity of deep learning techniques. Furthermore, we see skeleton GTs used not only for training skeleton detectors with Convolutional Neural Networks (CNN) but also for evaluating skeleton-related pruning and matching algorithms. However, most existing shape and image datasets suffer from the lack of skeleton GT and inconsistency of GT standards. As a result, it is difficult to evaluate and reproduce CNN-based skeleton detectors and algorithms on a fair basis. In this paper, we present a heuristic strategy for object skeleton GT extraction in binary shapes and natural images. Our strategy is built on an extended theory of diagnosticity hypothesis, which enables encoding human-in-the-loop GT extraction based on clues from the target's context, simplicity, and completeness. Using this strategy, we developed a tool, SkeView, to generate skeleton GT of 17 existing shape and image datasets. The GTs are then structurally evaluated with representative methods to build viable baselines for fair comparisons. Experiments demonstrate that GTs generated by our strategy yield promising quality with respect to standard consistency, and also provide a balance between simplicity and completeness.

摘要
骨架真实数据（GT）对超视觉骨架检测方法的成功非常重要，尤其是深度学习技术的普及。此外，我们发现骨架GT不仅用于训练骨架检测器采用Convolutional Neural Networks（CNN），还用于评估骨架相关的束缚和匹配算法。然而，大多数现有的形状和图像数据集受到骨架GT的缺失和GT标准的不一致的影响。这使得评估和重现基于CNN的骨架检测器和算法非常困难。在这篇论文中，我们提出了一种规则性的骨架GT抽取策略，这种策略基于扩展的诊断假设，允许基于目标Context、简洁性和完整性的人工 Loop GT抽取。使用这种策略，我们开发了一个工具，SkeView，用于生成17个现有形状和图像数据集的骨架GT。这些GT然后与代表方法进行结构性评估，以建立可靠的基准。实验表明，由我们的策略生成的GT具有对标准一致性和完整性的承诺，并提供了简洁性和完整性之间的平衡。

Conformal Prediction for Deep Classifier via Label Ranking

paper_url: http://arxiv.org/abs/2310.06430
repo_url: https://github.com/ml-stat-Sustech/conformal_prediction_via_label_ranking
paper_authors: Jianguo Huang, Huajun Xi, Linjun Zhang, Huaxiu Yao, Yue Qiu, Hongxin Wei
for: 本文是针对机器学习预测结果的重叠预测集的一种新方法，即Sorted Adaptive prediction sets（SAPS）。
methods: 本文提出了一种新的算法SAPS，它基于最大softmax概率值忽略概率值，以降低预测集的大小。
results: 经过实验和理论分析表明，SAPS可以生成较小的预测集，同时保留实例级别的uncertainty信息。此外，SAPS还可以广泛提高预测集的 conditional coverage rate和适应性。

Abstract
Conformal prediction is a statistical framework that generates prediction sets containing ground-truth labels with a desired coverage guarantee. The predicted probabilities produced by machine learning models are generally miscalibrated, leading to large prediction sets in conformal prediction. In this paper, we empirically and theoretically show that disregarding the probabilities' value will mitigate the undesirable effect of miscalibrated probability values. Then, we propose a novel algorithm named $\textit{Sorted Adaptive prediction sets}$ (SAPS), which discards all the probability values except for the maximum softmax probability. The key idea behind SAPS is to minimize the dependence of the non-conformity score on the probability values while retaining the uncertainty information. In this manner, SAPS can produce sets of small size and communicate instance-wise uncertainty. Theoretically, we provide a finite-sample coverage guarantee of SAPS and show that the expected value of set size from SAPS is always smaller than APS. Extensive experiments validate that SAPS not only lessens the prediction sets but also broadly enhances the conditional coverage rate and adaptation of prediction sets.

摘要
《匹配预测》是一种统计框架，生成包含真实标签的预测集，具有所需的保障率 garantía。机器学习模型生成的预测概率通常是偏债的，导致匹配预测集较大。在这篇论文中，我们通过实验和理论分析表明，忽略概率值可以减轻不良影响。然后，我们提出一种新的算法名为《排序适应预测集》（SAPS），它丢弃所有概率值，只保留最大软max概率。SAPS的关键想法是减少非准确度评分值与概率值之间的依赖关系，以保留uncertainty信息。因此，SAPS可以生成小size的预测集，并通过实例化uncertainty进行通信。我们也提供了finite-sample coverage garantía，证明SAPS的预期集size是APS always smaller。实验证明，SAPS不仅减少预测集，还广泛提高了预测集的 conditional coverage rate和适应度。

AnoDODE: Anomaly Detection with Diffusion ODE

paper_url: http://arxiv.org/abs/2310.06420
repo_url: None
paper_authors: Xianyao Hu, Congming Jin
for: 这个论文主要应用在医疗影像识别异常性，以提高医疗影像诊断中的异常性检测精度。
methods: 本论文使用Diffusion ODEs进行不监督 anomaly detection，并且提出了一个基于构成特征的异常性评分方法和一个基于重建的异常性地图表示方法。
results: 这个方法在BraTS2021医疗影像资料集上进行了实验，和现有的方法相比，表现出更高的效果和可靠性。

Abstract
Anomaly detection is the process of identifying atypical data samples that significantly deviate from the majority of the dataset. In the realm of clinical screening and diagnosis, detecting abnormalities in medical images holds great importance. Typically, clinical practice provides access to a vast collection of normal images, while abnormal images are relatively scarce. We hypothesize that abnormal images and their associated features tend to manifest in low-density regions of the data distribution. Following this assumption, we turn to diffusion ODEs for unsupervised anomaly detection, given their tractability and superior performance in density estimation tasks. More precisely, we propose a new anomaly detection method based on diffusion ODEs by estimating the density of features extracted from multi-scale medical images. Our anomaly scoring mechanism depends on computing the negative log-likelihood of features extracted from medical images at different scales, quantified in bits per dimension. Furthermore, we propose a reconstruction-based anomaly localization suitable for our method. Our proposed method not only identifie anomalies but also provides interpretability at both the image and pixel levels. Through experiments on the BraTS2021 medical dataset, our proposed method outperforms existing methods. These results confirm the effectiveness and robustness of our method.

摘要
“异常检测是检测数据集中异常出现的数据样本的过程。在医学检测和诊断领域，检测医学影像中的异常非常重要。通常，临床实践中有大量的常见图像，而异常图像相对较少。我们假设异常图像和其相关特征通常会出现在数据分布中的低浓度区域。基于这个假设，我们使用扩散ODE进行无监督异常检测，因为它具有优秀的散度估计能力。更具体地，我们提出了基于扩散ODE的新异常检测方法，通过计算特征EXTRACTED FROM MULTI-SCALE MEDICAL IMAGES的浓度来进行异常分数。我们的异常分数机制是根据不同级别的医学图像特征EXTRACTED FROM MULTI-SCALE MEDICAL IMAGES的负极值log-likelihood来计算的。此外，我们还提出了基于重建的异常地图方法，适用于我们的方法。我们的提议方法不仅可以检测异常，还可以在图像和像素水平提供可读性。通过对BraTS2021医学数据集进行实验，我们的提议方法超越现有方法。这些结果证明了我们的方法的有效性和稳定性。”

Boundary Discretization and Reliable Classification Network for Temporal Action Detection

paper_url: http://arxiv.org/abs/2310.06403
repo_url: https://github.com/zhenyingfang/BDRC-Net
paper_authors: Zhenying Fang
for: 本研究旨在提高无trim的视频中的时间动作检测性能，并提出一种新的Boundary Discretization and Reliable Classification Network（BDRC-Net）方法来解决混合方法中的两大问题。
methods: 本研究使用的方法包括Boundary Discretization Module（BDM）和Reliable Classification Module（RCM），它们可以减少手动设计的锚点和提高时间动作检测的精度。
results: 实验结果表明，BDRC-Net在不同的benchmark上达到了state-of-the-art的性能，比如THUMOS’14上的平均map值为68.6%，比前一个最佳方法高1.5%。

Abstract
Temporal action detection aims to recognize the action category and determine the starting and ending time of each action instance in untrimmed videos. The mixed methods have achieved remarkable performance by simply merging anchor-based and anchor-free approaches. However, there are still two crucial issues in the mixed framework: (1) Brute-force merging and handcrafted anchors design affect the performance and practical application of the mixed methods. (2) A large number of false positives in action category predictions further impact the detection performance. In this paper, we propose a novel Boundary Discretization and Reliable Classification Network (BDRC-Net) that addresses the above issues by introducing boundary discretization and reliable classification modules. Specifically, the boundary discretization module (BDM) elegantly merges anchor-based and anchor-free approaches in the form of boundary discretization, avoiding the handcrafted anchors design required by traditional mixed methods. Furthermore, the reliable classification module (RCM) predicts reliable action categories to reduce false positives in action category predictions. Extensive experiments conducted on different benchmarks demonstrate that our proposed method achieves favorable performance compared with the state-of-the-art. For example, BDRC-Net hits an average mAP of 68.6% on THUMOS'14, outperforming the previous best by 1.5%. The code will be released at https://github.com/zhenyingfang/BDRC-Net.

摘要
temporal action detection targets recognizing action categories and determining the starting and ending times of each action instance in untrimmed videos. 混合方法已经实现了很好的表现，但是还有两个关键问题在混合框架中：（1） brutal merging 和手工设计的锚点会影响混合方法的性能和实际应用。（2）大量的False Positives在动作类别预测中，进一步影响检测性能。在这篇论文中，我们提出了一种新的边界精炼和可靠分类网络（BDRC-Net），该方法可以解决以上问题。 Specifically, the boundary discretization module (BDM) elegantly merges anchor-based and anchor-free approaches in the form of boundary discretization, avoiding the handcrafted anchors design required by traditional mixed methods. 其次，可靠分类模块 (RCM) 预测可靠的动作类别，以减少False Positives在动作类别预测中。经过对不同benchmark的广泛实验，我们的提出的方法可以达到与当前最佳的表现。例如，BDRC-Net在THUMOS'14上取得了68.6%的平均MAP，比前一个最佳的方法高1.5%. 代码将在https://github.com/zhenyingfang/BDRC-Net 上发布。

Learning Stackable and Skippable LEGO Bricks for Efficient, Reconfigurable, and Variable-Resolution Diffusion Modeling

paper_url: http://arxiv.org/abs/2310.06389
repo_url: None
paper_authors: Huangjie Zheng, Zhendong Wang, Jianbo Yuan, Guanghan Ning, Pengcheng He, Quanzeng You, Hongxia Yang, Mingyuan Zhou
For: This paper aims to improve the efficiency and adaptability of diffusion models for image generation by introducing a new network architecture called LEGO bricks.* Methods: The LEGO bricks architecture integrates Local-feature Enrichment and Global-content Orchestration, allowing for selective skipping of bricks to reduce sampling costs and generate higher-resolution images.* Results: The proposed method enhances training efficiency, expedites convergence, and facilitates variable-resolution image generation while maintaining strong generative performance, and significantly reduces sampling time compared to other methods.

Abstract
Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep networks and lack the flexibility needed for generating images at variable resolutions or with a smaller network than used in training. This study introduces LEGO bricks, which seamlessly integrate Local-feature Enrichment and Global-content Orchestration. These bricks can be stacked to create a test-time reconfigurable diffusion backbone, allowing selective skipping of bricks to reduce sampling costs and generate higher-resolution images than the training data. LEGO bricks enrich local regions with an MLP and transform them using a Transformer block while maintaining a consistent full-resolution image across all bricks. Experimental results demonstrate that LEGO bricks enhance training efficiency, expedite convergence, and facilitate variable-resolution image generation while maintaining strong generative performance. Moreover, LEGO significantly reduces sampling time compared to other methods, establishing it as a valuable enhancement for diffusion models.

摘要
Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep networks and lack the flexibility needed for generating images at variable resolutions or with a smaller network than used in training. This study introduces LEGO bricks, which seamlessly integrate Local-feature Enrichment and Global-content Orchestration. These bricks can be stacked to create a test-time reconfigurable diffusion backbone, allowing selective skipping of bricks to reduce sampling costs and generate higher-resolution images than the training data. LEGO bricks enrich local regions with an MLP and transform them using a Transformer block while maintaining a consistent full-resolution image across all bricks. Experimental results demonstrate that LEGO bricks enhance training efficiency, expedite convergence, and facilitate variable-resolution image generation while maintaining strong generative performance. Moreover, LEGO significantly reduces sampling time compared to other methods, establishing it as a valuable enhancement for diffusion models.Here is the translation in Traditional Chinese:Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep networks and lack the flexibility needed for generating images at variable resolutions or with a smaller network than used in training. This study introduces LEGO bricks, which seamlessly integrate Local-feature Enrichment and Global-content Orchestration. These bricks can be stacked to create a test-time reconfigurable diffusion backbone, allowing selective skipping of bricks to reduce sampling costs and generate higher-resolution images than the training data. LEGO bricks enrich local regions with an MLP and transform them using a Transformer block while maintaining a consistent full-resolution image across all bricks. Experimental results demonstrate that LEGO bricks enhance training efficiency, expedite convergence, and facilitate variable-resolution image generation while maintaining strong generative performance. Moreover, LEGO significantly reduces sampling time compared to other methods, establishing it as a valuable enhancement for diffusion models.

3DS-SLAM: A 3D Object Detection based Semantic SLAM towards Dynamic Indoor Environments

paper_url: http://arxiv.org/abs/2310.06385
repo_url: None
paper_authors: Ghanta Sai Krishna, Kundrapu Supriya, Sabur Baidya
for: 这个研究该纸是为了解决受环境变化影响的摄像头位置精度下降问题，并提出了一个3D Semantic SLAM算法（3DS-SLAM），可以在动态环境中提供高精度的位置识别。methods: 这个研究使用了3D part-aware hybrid transformer来检测Point cloud中的动态物件，并提出了一个基于HDBSCAN散度分析的动态特征范围筛选器，以提高位置精度。results: 该研究与ORB-SLAM2进行比较，发现3DS-SLAM在TUM RGB-D dataset上的动态序列中平均提高98.01%。此外，它还超过了其他四个适用于动态环境的主要SLAM系统的性能。

Abstract
The existence of variable factors within the environment can cause a decline in camera localization accuracy, as it violates the fundamental assumption of a static environment in Simultaneous Localization and Mapping (SLAM) algorithms. Recent semantic SLAM systems towards dynamic environments either rely solely on 2D semantic information, or solely on geometric information, or combine their results in a loosely integrated manner. In this research paper, we introduce 3DS-SLAM, 3D Semantic SLAM, tailored for dynamic scenes with visual 3D object detection. The 3DS-SLAM is a tightly-coupled algorithm resolving both semantic and geometric constraints sequentially. We designed a 3D part-aware hybrid transformer for point cloud-based object detection to identify dynamic objects. Subsequently, we propose a dynamic feature filter based on HDBSCAN clustering to extract objects with significant absolute depth differences. When compared against ORB-SLAM2, 3DS-SLAM exhibits an average improvement of 98.01% across the dynamic sequences of the TUM RGB-D dataset. Furthermore, it surpasses the performance of the other four leading SLAM systems designed for dynamic environments.

摘要
环境中变量因素的存在可能导致摄像头地理位置准确性下降，因为这违背了同时地理位置和地图生成（SLAM）算法的基本假设，即静止环境。现代 semantic SLAM 系统在动态环境中通常仅仅依靠2D semantic信息，或者仅仅依靠几何信息，或者将其结果在粗略地集成起来。在这项研究中，我们介绍了3DS-SLAM，即3D Semantic SLAM，适用于动态场景中的视觉3D对象检测。3DS-SLAM 是一种紧密相关的算法，同时解决 semantic 和几何约束。我们设计了一种3D part-aware 混合变换器，用于基于点云的对象检测，以确定动态对象。然后，我们提出了基于 HDBSCAN 聚类的动态特征筛选器，以提取对象具有重要绝对深度差异。与 ORB-SLAM2 相比，3DS-SLAM 在 TUM RGB-D 数据集的动态序列上显示了平均提高98.01%。此外，它还超过了其他四个领先的 SLAM 系统在动态环境中的性能。

Leveraging Diffusion-Based Image Variations for Robust Training on Poisoned Data

paper_url: http://arxiv.org/abs/2310.06372
repo_url: https://github.com/lukasstruppek/robust_training_on_poisoned_samples
paper_authors: Lukas Struppek, Martin B. Hentschel, Clifton Poth, Dominik Hintersdorf, Kristian Kersting
for: 防止训练神经网络时遭受后门攻击，即在模型中隐藏功能的攻击。
methods: 使用 diffusion models 创建所有训练样本的同源变种，并通过知识储存来帮助学生模型具有抗胁袋功能。
results: 通过这种方法，可以培养一些抗胁袋的学生模型，它们可以在各种诱导patterns中保持一致性，并且对于潜在的后门攻击予以抵抗。

Abstract
Backdoor attacks pose a serious security threat for training neural networks as they surreptitiously introduce hidden functionalities into a model. Such backdoors remain silent during inference on clean inputs, evading detection due to inconspicuous behavior. However, once a specific trigger pattern appears in the input data, the backdoor activates, causing the model to execute its concealed function. Detecting such poisoned samples within vast datasets is virtually impossible through manual inspection. To address this challenge, we propose a novel approach that enables model training on potentially poisoned datasets by utilizing the power of recent diffusion models. Specifically, we create synthetic variations of all training samples, leveraging the inherent resilience of diffusion models to potential trigger patterns in the data. By combining this generative approach with knowledge distillation, we produce student models that maintain their general performance on the task while exhibiting robust resistance to backdoor triggers.

摘要
<>对于训练神经网络来说，后门攻击是一种严重的安全隐患，因为它们隐藏式地添加了模型中的隐藏功能。这些后门在潜在的输入数据上做出不寻常的响应，但是在正常的输入数据上则保持沉默，因此难以检测。但是，当特定的触发模式出现在输入数据中时，后门会被激活，使模型执行隐藏的功能。手动检查 vast 数据集中的恶意样本是不可能的，因此我们需要一种新的方法来解决这个挑战。我们提议一种使用最新的扩散模型的方法，通过生成所有训练样本的同义词来实现。通过将这种生成方法与知识储存相结合，我们可以生成学生模型，这些模型在任务上保持普通的表现，同时具有对后门触发器的Robust性。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

CoinSeg: Contrast Inter- and Intra- Class Representations for Incremental Segmentation

paper_url: http://arxiv.org/abs/2310.06368
repo_url: https://github.com/zkzhang98/coinseg
paper_authors: Zekang Zhang, Guangyu Gao, Jianbo Jiao, Chi Harold Liu, Yunchao Wei
for: 这篇论文的目的是提出一种增量semantic segmentation方法，以保持模型的稳定性和适应新的概念，并且优化各类别的表现。
methods: 这篇论文提出了一种名为CoinSeg的方法，它强调模型的弹性，通过维持多元的内部多标本中心来提高内部多标本的多标本性。CoinSeg使用面 Proposition来识别具有强 objectness 的区域，并将这些面 Proposition 用于对比表示，以增强内部多标本的多标本性。此外，CoinSeg 还使用 category-level pseudo-labels 来提高category-level的一致性和多标本性。
results: 根据 Pascal VOC 2012 和 ADE20K dataset 的多个增量情况进行验证，CoinSeg 能够实现Superior 的结果，特别是在更加具体和实际的长期情况下。

Abstract
Class incremental semantic segmentation aims to strike a balance between the model's stability and plasticity by maintaining old knowledge while adapting to new concepts. However, most state-of-the-art methods use the freeze strategy for stability, which compromises the model's plasticity.In contrast, releasing parameter training for plasticity could lead to the best performance for all categories, but this requires discriminative feature representation.Therefore, we prioritize the model's plasticity and propose the Contrast inter- and intra-class representations for Incremental Segmentation (CoinSeg), which pursues discriminative representations for flexible parameter tuning. Inspired by the Gaussian mixture model that samples from a mixture of Gaussian distributions, CoinSeg emphasizes intra-class diversity with multiple contrastive representation centroids. Specifically, we use mask proposals to identify regions with strong objectness that are likely to be diverse instances/centroids of a category. These mask proposals are then used for contrastive representations to reinforce intra-class diversity. Meanwhile, to avoid bias from intra-class diversity, we also apply category-level pseudo-labels to enhance category-level consistency and inter-category diversity. Additionally, CoinSeg ensures the model's stability and alleviates forgetting through a specific flexible tuning strategy. We validate CoinSeg on Pascal VOC 2012 and ADE20K datasets with multiple incremental scenarios and achieve superior results compared to previous state-of-the-art methods, especially in more challenging and realistic long-term scenarios. Code is available at https://github.com/zkzhang98/CoinSeg.

摘要
“ classe 增量 semantics 分类目标是保持模型的稳定性和пластично性，而不是使用固定策略，这会牺牲模型的пластично性。然而，大多数当前的方法使用固定策略来保持模型的稳定性，这会限制模型的发展。相反，释放参数训练可以实现最佳性能，但需要特征表示的启发。因此，我们强调模型的пластично性，并提出了对增量分类（CoinSeg），它寻求特征表示，以便灵活地调整参数。受 Gaussian mixture model 的启发，CoinSeg 强调内类多样性，并使用面部提案来确定强度具有 Objectness 的区域，这些区域可能是多个类别的多样性中心。然后，我们使用这些面部提案进行对比表示，以强调内类多样性。同时，为了避免因内类多样性而导致的偏见，我们还使用类别水平的 Pseudo-labels 来增强类别水平一致性和 между类多样性。此外，CoinSeg 确保模型的稳定性，并避免忘记，通过特定的灵活调整策略。我们在 Pascal VOC 2012 和 ADE20K 数据集上进行了多种增量场景的验证，并实现了较前一代方法更好的结果，特别是在更加复杂和实际上更加挑战的长期场景中。代码可以在 GitHub 上找到：https://github.com/zkzhang98/CoinSeg。”

JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling

paper_url: http://arxiv.org/abs/2310.06347
repo_url: None
paper_authors: Jingyang Zhang, Shiwei Li, Yuanxun Lu, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan, Yao Yao
for: 本研究旨在提出一种基于神经网络的 JOINTNET 模型，用于模型图像和附加的稠密属性（如深度地图）的联合分布。
methods: JOINTNET 基于预训练的文本到图像扩散模型，其中一份原始网络被复制并与RGB分支密集连接。RGB分支在网络细化训练中被锁定，使得新的属性分布学习得到高效地进行，而不会影响大规模预训练扩散模型的强大通用能力。
results: 通过RGBD扩散为例，我们证明 JOINTNET 的有效性，并通过广泛的实验表明它在多种应用中具有广泛的应用前景，包括共同RGBD生成、精度深度预测、深度条件图像生成和协调瓦当3D环境生成。

Abstract
We introduce JointNet, a novel neural network architecture for modeling the joint distribution of images and an additional dense modality (e.g., depth maps). JointNet is extended from a pre-trained text-to-image diffusion model, where a copy of the original network is created for the new dense modality branch and is densely connected with the RGB branch. The RGB branch is locked during network fine-tuning, which enables efficient learning of the new modality distribution while maintaining the strong generalization ability of the large-scale pre-trained diffusion model. We demonstrate the effectiveness of JointNet by using RGBD diffusion as an example and through extensive experiments, showcasing its applicability in a variety of applications, including joint RGBD generation, dense depth prediction, depth-conditioned image generation, and coherent tile-based 3D panorama generation.

摘要
我们介绍JointNet，一个新的神经网络架构，用于模型图像和额外粗细模式（例如深度地图）的共享分布。JointNet是从预训文本图像散射模型中扩展而来，其中将原始网络复制一份，并将RGB分支与新的粗细模式分支 densely connected。RGB分支在网络精细调整时被锁定，这使得对新的分布学习得到高效的学习，并维持大规模预训散射模型的强大普遍能力。我们通过使用RGBD散射来示范JointNet的有效性，并通过广泛的实验，展示其应用在多个应用中，包括共同RGBD生成、粗细深度预测、深度条件图像生成和coherent块式3Dпанора幕生成。

Local Style Awareness of Font Images

paper_url: http://arxiv.org/abs/2310.06337
repo_url: https://github.com/rramatchandran/big-o-performance-java
paper_authors: Daichi Haraguchi, Seiichi Uchida
for: 本研究旨在提出一种自动注意力机制，用于找到字体的重要地方。
methods: 该机制可以通过 quasi-self-supervised 的方式进行训练，不需要手动标注。
results: 经过训练后，该机制可以准确地找到字体的重要地方，并且可以用于实现本地风格化字体生成。

Abstract
When we compare fonts, we often pay attention to styles of local parts, such as serifs and curvatures. This paper proposes an attention mechanism to find important local parts. The local parts with larger attention are then considered important. The proposed mechanism can be trained in a quasi-self-supervised manner that requires no manual annotation other than knowing that a set of character images is from the same font, such as Helvetica. After confirming that the trained attention mechanism can find style-relevant local parts, we utilize the resulting attention for local style-aware font generation. Specifically, we design a new reconstruction loss function to put more weight on the local parts with larger attention for generating character images with more accurate style realization. This loss function has the merit of applicability to various font generation models. Our experimental results show that the proposed loss function improves the quality of generated character images by several few-shot font generation models.

摘要
当我们比较字体时，我们经常关注当地部分的风格，如补别线和弯曲。这篇论文提议了一种注意机制，用于发现重要的当地部分。这些部分的注意程度较大的地方被视为重要。我们的提议可以在 quasi-自我超级vised 的方式进行训练，不需要手动标注，只需要知道一组字体图像是从同一种字体，如Helvetica。我们验证了训练的注意机制可以找到风格相关的当地部分。然后，我们利用这种注意来实现本地风格感知字体生成。我们设计了一个新的重建损失函数，将更多的权重分配给具有更大注意的当地部分，以生成更加准确的风格实现的字体图像。这种损失函数具有可应用于不同的字体生成模型的优点。我们的实验结果表明，我们的提议可以提高几个少量的字体生成模型生成的字体图像质量。

CrowdRec: 3D Crowd Reconstruction from Single Color Images

paper_url: http://arxiv.org/abs/2310.06332
repo_url: https://github.com/boycehbz/crowdrec
paper_authors: Buzhen Huang, Jingyi Ju, Yangang Wang
for: 提高大规模人群图像中人体3D重建的性能，因为现有的多人三角形重建方法难以在人群中具有满意的表现。
methods: 利用人群特征，提出了一种人群偏好权重优化方法，以提高基于单个人物的 mesh 恢复网络在人群图像中的表现。
results: 通过对单个人物网络参数进行人群约束优化，可以从大规模人群图像中获得高精度的体姿坐标和人物形状，而无需使用大规模的3D人群数据集进行训练。

Abstract
This is a technical report for the GigaCrowd challenge. Reconstructing 3D crowds from monocular images is a challenging problem due to mutual occlusions, server depth ambiguity, and complex spatial distribution. Since no large-scale 3D crowd dataset can be used to train a robust model, the current multi-person mesh recovery methods can hardly achieve satisfactory performance in crowded scenes. In this paper, we exploit the crowd features and propose a crowd-constrained optimization to improve the common single-person method on crowd images. To avoid scale variations, we first detect human bounding-boxes and 2D poses from the original images with off-the-shelf detectors. Then, we train a single-person mesh recovery network using existing in-the-wild image datasets. To promote a more reasonable spatial distribution, we further propose a crowd constraint to refine the single-person network parameters. With the optimization, we can obtain accurate body poses and shapes with reasonable absolute positions from a large-scale crowd image using a single-person backbone. The code will be publicly available at~\url{https://github.com/boycehbz/CrowdRec}.

摘要
这是一份技术报告，描述了一种用于解决大规模人群3D重建问题的方法。由于人群中存在互相遮挡、深度异常和复杂的空间分布，使得现有的大规模3D人群数据集不能够用于训练一个可靠的模型。在这篇论文中，我们利用人群特征，并提出了一种人群约束优化方法，以提高现有的单人网络在人群图像上的性能。首先，我们使用商业可用的检测器检测人群中的人脸框和2D姿势。然后，我们使用现有的宽泛采集的图像数据集训练单人网络。为了提高人群中每个人的姿势和形状的可理性，我们还提出了人群约束来细化单人网络参数。通过优化，我们可以从大规模人群图像中获得准确的身体姿势和形状，并且可以使用单人框架来实现。代码将在 GitHub 上公开。

Precise Payload Delivery via Unmanned Aerial Vehicles: An Approach Using Object Detection Algorithms

paper_url: http://arxiv.org/abs/2310.06329
repo_url: None
paper_authors: Aditya Vadduri, Anagh Benjwal, Abhishek Pai, Elkan Quadros, Aniruddh Kammar, Prajwal Uday
for: 提高自动 payload 交付精度，适用于无人机 payload 交付任务。
methods: 使用深度学习计算机视觉方法，将 UAV 精准对齐到目标标记位置。
results: 比传统 GPS 方法提高了500%的平均水平精度。

Abstract
Recent years have seen tremendous advancements in the area of autonomous payload delivery via unmanned aerial vehicles, or drones. However, most of these works involve delivering the payload at a predetermined location using its GPS coordinates. By relying on GPS coordinates for navigation, the precision of payload delivery is restricted to the accuracy of the GPS network and the availability and strength of the GPS connection, which may be severely restricted by the weather condition at the time and place of operation. In this work we describe the development of a micro-class UAV and propose a novel navigation method that improves the accuracy of conventional navigation methods by incorporating a deep-learning-based computer vision approach to identify and precisely align the UAV with a target marked at the payload delivery position. This proposed method achieves a 500% increase in average horizontal precision over conventional GPS-based approaches.

摘要
近年来，无人飞行器（UAV）上的自动货物发送技术有了很大的进步。然而，大多数这些工作都是通过UAV的GPS坐标来定位和发送货物的。使用GPS坐标导航限制了货物发送精度，它们取决于GPS网络的准确性和可用性，以及在运行时的天气情况。在这篇文章中，我们描述了一种微型UAV的开发和一种基于深度学习计算机视觉技术的新导航方法，该方法可以提高传统导航方法的精度。这种方法在平均水平精度方面提高了500%。

Adversarial Masked Image Inpainting for Robust Detection of Mpox and Non-Mpox

paper_url: http://arxiv.org/abs/2310.06318
repo_url: None
paper_authors: Yubiao Yue, Zhenzhang Li
for: 本研究旨在提出一种基于生成模型的MPox检测方法，以提高MPox检测精度和可靠性。
methods: 该方法基于生成对抗网络（GAN），通过填充mask的MPox图像来学习MPox图像表示。然后，通过比较填充后的图像和原始图像之间的相似性来判断输入图像是MPox还是非MPox。
results: 实验结果表明，使用MSLD数据集和 eighteen种非MPox皮肤病图像进行验证，MIM方法的平均AUC得分为0.8237。此外，研究还证明了传统分类模型的缺陷和MIM方法的优势，并通过临床验证证明了MIM方法的可靠性。最后，我们还开发了一款免费在受影响地区提供的在线手机应用程序，以便为affected areas提供免费测试。

Abstract
Due to the lack of efficient mpox diagnostic technology, mpox cases continue to increase. Recently, the great potential of deep learning models in detecting mpox and non-mpox has been proven. However, existing models learn image representations via image classification, which results in they may be easily susceptible to interference from real-world noise, require diverse non-mpox images, and fail to detect abnormal input. These drawbacks make classification models inapplicable in real-world settings. To address these challenges, we propose "Mask, Inpainting, and Measure" (MIM). In MIM's pipeline, a generative adversarial network only learns mpox image representations by inpainting the masked mpox images. Then, MIM determines whether the input belongs to mpox by measuring the similarity between the inpainted image and the original image. The underlying intuition is that since MIM solely models mpox images, it struggles to accurately inpaint non-mpox images in real-world settings. Without utilizing any non-mpox images, MIM cleverly detects mpox and non-mpox and can handle abnormal inputs. We used the recognized mpox dataset (MSLD) and images of eighteen non-mpox skin diseases to verify the effectiveness and robustness of MIM. Experimental results show that the average AUROC of MIM achieves 0.8237. In addition, we demonstrated the drawbacks of classification models and buttressed the potential of MIM through clinical validation. Finally, we developed an online smartphone app to provide free testing to the public in affected areas. This work first employs generative models to improve mpox detection and provides new insights into binary decision-making tasks in medical images.

摘要
中文翻译：由于缺乏高效的MPOX诊断技术，MPOX患者数继续增加。在深度学习模型中，MPOX和非MPOX的潜在优势已经被证明。然而，现有模型通过图像分类学习图像表示，这可能会受到实际噪声的干扰，需要多种非MPOX图像，并且无法检测异常输入。这些缺陷使得分类模型在实际设置中不可用。为了解决这些挑战，我们提出了“面罩、填充和测量”（MIM）。在MIM的管道中，一个生成对抗网络只学习MPOX图像表示，并通过填充面罩的MPOX图像来确定输入是MPOX还是非MPOX。我们的 intuition 是，由于MIM只学习MPOX图像，因此在实际设置中很难准确地填充非MPOX图像。无需使用任何非MPOX图像，MIM才能够精准地检测MPOX和非MPOX，并可以处理异常输入。我们使用认可的MPOX数据集（MSLD）和 eighteen 种非MPOX皮肤病图像来验证MIM的效果和稳定性。实验结果显示，MIM的平均 AUROC 为 0.8237。此外，我们还证明了分类模型的缺陷，并赞赏了MIM的潜在价值。最后，我们开发了一款基于在线智能手机应用程序，以提供免费测试，并在受到影响的地区提供测试。这种工作首次运用生成模型改进MPOX检测，并提供了医疗图像中二分类决策的新视角。

Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models

paper_url: http://arxiv.org/abs/2310.06313
repo_url: https://github.com/muzishen/pcdms
paper_authors: Fei Shen, Hu Ye, Jun Zhang, Cong Wang, Xiao Han, Wei Yang
for: 本研究的目的是提出一种基于扩散模型的高质量、高准确性人像合成方法，能够在不同的姿势下进行 pose-guided 合成。
methods: 本研究使用的方法包括：1) global feature prediction，2) dense correspondence establishment，3) inpainting conditional diffusion，4) refining conditional diffusion。
results: 研究结果表明，提出的 Progressive Conditional Diffusion Models (PCDMs) 可以在不同的姿势下进行高质量、高准确性的人像合成，并且在挑战性的情况下保持图像的自然性和细节。

Abstract
Recent work has showcased the significant potential of diffusion models in pose-guided person image synthesis. However, owing to the inconsistency in pose between the source and target images, synthesizing an image with a distinct pose, relying exclusively on the source image and target pose information, remains a formidable challenge. This paper presents Progressive Conditional Diffusion Models (PCDMs) that incrementally bridge the gap between person images under the target and source poses through three stages. Specifically, in the first stage, we design a simple prior conditional diffusion model that predicts the global features of the target image by mining the global alignment relationship between pose coordinates and image appearance. Then, the second stage establishes a dense correspondence between the source and target images using the global features from the previous stage, and an inpainting conditional diffusion model is proposed to further align and enhance the contextual features, generating a coarse-grained person image. In the third stage, we propose a refining conditional diffusion model to utilize the coarsely generated image from the previous stage as a condition, achieving texture restoration and enhancing fine-detail consistency. The three-stage PCDMs work progressively to generate the final high-quality and high-fidelity synthesized image. Both qualitative and quantitative results demonstrate the consistency and photorealism of our proposed PCDMs under challenging scenarios.The code and model will be available at https://github.com/muzishen/PCDMs.

摘要
近期研究表明扩散模型在人像合成中具有重要潜力。然而，由于源图像和目标图像中人体姿势不一致，通过仅使用源图像和目标姿势信息，Synthesizing an image with a distinct pose remains a daunting challenge.这篇论文提出了逐渐进行Diffusion Models（PCDMs），通过三个阶段逐渐bridging the gap between person images under the target and source poses. Specifically, in the first stage, we design a simple prior conditional diffusion model that predicts the global features of the target image by mining the global alignment relationship between pose coordinates and image appearance. Then, the second stage establishes a dense correspondence between the source and target images using the global features from the previous stage, and an inpainting conditional diffusion model is proposed to further align and enhance the contextual features, generating a coarse-grained person image. In the third stage, we propose a refining conditional diffusion model to utilize the coarsely generated image from the previous stage as a condition, achieving texture restoration and enhancing fine-detail consistency. The three-stage PCDMs work progressively to generate the final high-quality and high-fidelity synthesized image. Both qualitative and quantitative results demonstrate the consistency and photorealism of our proposed PCDMs under challenging scenarios.The code and model will be available at https://github.com/muzishen/PCDMs.

Improving Compositional Text-to-image Generation with Large Vision-Language Models

paper_url: http://arxiv.org/abs/2310.06311
repo_url: None
paper_authors: Song Wen, Guian Fang, Renrui Zhang, Peng Gao, Hao Dong, Dimitris Metaxas
for: 提高文本到图像模型的对齐精度，尤其是在描述多个物体、变量属性和复杂的空间关系的情况下。
methods: 利用大视语言模型（LVLM）进行多维评估对齐精度，并在推理阶段使用LVLM来纠正初始图像中的不一致部分，使用图像编辑算法进行修正，直到LVLM无法探测到任何不一致。
results: 实验结果表明，提案的方法可以显著提高文本到图像模型的对齐精度，特别是在对象数量、属性绑定、空间关系和美观质量等方面。

Abstract
Recent advancements in text-to-image models, particularly diffusion models, have shown significant promise. However, compositional text-to-image models frequently encounter difficulties in generating high-quality images that accurately align with input texts describing multiple objects, variable attributes, and intricate spatial relationships. To address this limitation, we employ large vision-language models (LVLMs) for multi-dimensional assessment of the alignment between generated images and their corresponding input texts. Utilizing this assessment, we fine-tune the diffusion model to enhance its alignment capabilities. During the inference phase, an initial image is produced using the fine-tuned diffusion model. The LVLM is then employed to pinpoint areas of misalignment in the initial image, which are subsequently corrected using the image editing algorithm until no further misalignments are detected by the LVLM. The resultant image is consequently more closely aligned with the input text. Our experimental results validate that the proposed methodology significantly improves text-image alignment in compositional image generation, particularly with respect to object number, attribute binding, spatial relationships, and aesthetic quality.

摘要

Three-Dimensional Medical Image Fusion with Deformable Cross-Attention

paper_url: http://arxiv.org/abs/2310.06291
repo_url: None
paper_authors: Lin Liu, Xinxin Fan, Chulong Zhang, Jingjing Dai, Yaoqin Xie, Xiaokun Liang
for:* 这篇论文主要是用于Multimodal医疗影像融合，以提高疾病识别和肿瘤检测。methods:* 这篇论文使用了一个创新的无监督特征相似学习融合网络，它包括一个弹性跨特征融合模组（DCFB），帮助两 modalities 分别识别自己的相似和不同之处。results:* 这篇论文使用了ADNI dataset中的3D MRI和PET影像，通过应用DCFB模组，生成了高品质的MRI-PET融合影像。* 实验结果显示，我们的方法比传统的2D影像融合方法在PSNR和SSIM等效果指标上表现更好。* 重要的是，我们的方法可以融合3D影像，增加医生和研究人员可用的信息，因此代表了这个领域的一大突破。

Abstract
Multimodal medical image fusion plays an instrumental role in several areas of medical image processing, particularly in disease recognition and tumor detection. Traditional fusion methods tend to process each modality independently before combining the features and reconstructing the fusion image. However, this approach often neglects the fundamental commonalities and disparities between multimodal information. Furthermore, the prevailing methodologies are largely confined to fusing two-dimensional (2D) medical image slices, leading to a lack of contextual supervision in the fusion images and subsequently, a decreased information yield for physicians relative to three-dimensional (3D) images. In this study, we introduce an innovative unsupervised feature mutual learning fusion network designed to rectify these limitations. Our approach incorporates a Deformable Cross Feature Blend (DCFB) module that facilitates the dual modalities in discerning their respective similarities and differences. We have applied our model to the fusion of 3D MRI and PET images obtained from 660 patients in the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset. Through the application of the DCFB module, our network generates high-quality MRI-PET fusion images. Experimental results demonstrate that our method surpasses traditional 2D image fusion methods in performance metrics such as Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM). Importantly, the capacity of our method to fuse 3D images enhances the information available to physicians and researchers, thus marking a significant step forward in the field. The code will soon be available online.

摘要
多Modal医疗图像融合在医学图像处理多个领域中扮演重要角色，特别是疾病识别和肿瘤检测。传统的融合方法通常是独立处理每种模式的图像，然后将特征合并并重建融合图像。然而，这种方法经常忽视不同模式之间的基本相似性和差异。此外，现有的方法ologies主要是对2D医学图像片进行融合，导致融合图像中缺乏上下文指导，从而降低了医生对融合图像中信息的获得。在本研究中，我们介绍了一种创新的无监督特征共同学习融合网络，用于解决这些限制。我们的方法包括一个可变的交叉特征混合（DCFB）模块，该模块使得两种模式能够更好地了解它们之间的相似性和差异。我们在ADNI数据集中对3D MRI和PET图像进行了融合，并通过DCFB模块，我们的网络生成了高质量的MRI-PET融合图像。实验结果表明，我们的方法在PSNR和SSIM等性能指标上比传统2D图像融合方法表现更好。这种能力融合3D图像，提高了医生和研究人员对融合图像中信息的获得，这标志着该领域的一个重要进步。网络代码即将在线上公开。

Towards More Efficient Depression Risk Recognition via Gait

paper_url: http://arxiv.org/abs/2310.06283
repo_url: None
paper_authors: Min Ren, Muchan Tao, Xuecai Hu, Xiaotong Liu, Qiong Li, Yongzhen Huang
for: 这个研究旨在开发一种基于深度学习的抑郁风险识别模型，以便在初级医疗设置中早期识别抑郁症，避免重症和重复发作，并减轻抑郁症对情感和财务造成的负担。methods: 该研究首先构建了一个大规模的步态数据库，包括1,200名参与者、40,000个步态序列和6种视角和3种服装类型。然后，提出了一种基于深度学习的抑郁风险识别模型，以超越手工设计的方法。results: 经过对构建的大规模数据库进行实验，该模型的效果得到了验证，并且提供了许多有用的指导思想，显示了步态基于的抑郁风险识别方法的巨大潜力。

Abstract
Depression, a highly prevalent mental illness, affects over 280 million individuals worldwide. Early detection and timely intervention are crucial for promoting remission, preventing relapse, and alleviating the emotional and financial burdens associated with depression. However, patients with depression often go undiagnosed in the primary care setting. Unlike many physiological illnesses, depression lacks objective indicators for recognizing depression risk, and existing methods for depression risk recognition are time-consuming and often encounter a shortage of trained medical professionals. The correlation between gait and depression risk has been empirically established. Gait can serve as a promising objective biomarker, offering the advantage of efficient and convenient data collection. However, current methods for recognizing depression risk based on gait have only been validated on small, private datasets, lacking large-scale publicly available datasets for research purposes. Additionally, these methods are primarily limited to hand-crafted approaches. Gait is a complex form of motion, and hand-crafted gait features often only capture a fraction of the intricate associations between gait and depression risk. Therefore, this study first constructs a large-scale gait database, encompassing over 1,200 individuals, 40,000 gait sequences, and covering six perspectives and three types of attire. Two commonly used psychological scales are provided as depression risk annotations. Subsequently, a deep learning-based depression risk recognition model is proposed, overcoming the limitations of hand-crafted approaches. Through experiments conducted on the constructed large-scale database, the effectiveness of the proposed method is validated, and numerous instructive insights are presented in the paper, highlighting the significant potential of gait-based depression risk recognition.

摘要
全球280多万人患有抑郁症，早期发现和及时 interven 是关键，可以促进缓解、避免回落和减轻抑郁症的情感和财务负担。但是，抑郁症患者在主要医疗设施中 frequently goes undiagnosed。与许多物理疾病不同，抑郁症缺乏可Recognizing depression risk is challenging due to the lack of objective indicators, and existing methods are time-consuming and often face a shortage of trained medical professionals. However, research has shown that there is a correlation between gait and depression risk. Gait can serve as a promising objective biomarker, offering the advantage of efficient and convenient data collection. However, current methods for recognizing depression risk based on gait have only been validated on small, private datasets, lacking large-scale publicly available datasets for research purposes. Additionally, these methods are primarily limited to hand-crafted approaches. Gait is a complex form of motion, and hand-crafted gait features often only capture a fraction of the intricate associations between gait and depression risk.为了解决这些问题，本研究首先构建了一个大规模的步态数据库，包括1,200个人，40,000个步态序列和六种视角，以及三种衣物类型。两个常用的心理测量方法提供为抑郁风险注释。然后，一种深度学习基于的抑郁风险识别模型被提出，以超越手工制作的方法。经过对构建的大规模数据库进行实验，提出的方法的有效性得到验证，并且在文中提供了丰富的指导思想，强调步态基于的抑郁风险识别的重要性。

MuseChat: A Conversational Music Recommendation System for Videos

paper_url: http://arxiv.org/abs/2310.06282
repo_url: None
paper_authors: Zhikang Dong, Bin Chen, Xiulong Liu, Pawel Polak, Peng Zhang
For: This paper aims to provide an innovative dialog-based music recommendation system that offers interactive user engagement and personalized music selections tailored for input videos.* Methods: The paper introduces a conversation-synthesis method that simulates a two-turn interaction between a user and a recommendation system, leveraging pre-trained music tags and artist information. It also introduces a multi-modal recommendation engine that matches music with visual cues from the video, user feedback, and textual input.* Results: The paper shows that MuseChat, the proposed music recommendation system, surpasses existing state-of-the-art models in music retrieval tasks and pioneers the integration of the recommendation process within a natural language framework.

Abstract
We introduce MuseChat, an innovative dialog-based music recommendation system. This unique platform not only offers interactive user engagement but also suggests music tailored for input videos, so that users can refine and personalize their music selections. In contrast, previous systems predominantly emphasized content compatibility, often overlooking the nuances of users' individual preferences. For example, all the datasets only provide basic music-video pairings or such pairings with textual music descriptions. To address this gap, our research offers three contributions. First, we devise a conversation-synthesis method that simulates a two-turn interaction between a user and a recommendation system, which leverages pre-trained music tags and artist information. In this interaction, users submit a video to the system, which then suggests a suitable music piece with a rationale. Afterwards, users communicate their musical preferences, and the system presents a refined music recommendation with reasoning. Second, we introduce a multi-modal recommendation engine that matches music either by aligning it with visual cues from the video or by harmonizing visual information, feedback from previously recommended music, and the user's textual input. Third, we bridge music representations and textual data with a Large Language Model(Vicuna-7B). This alignment equips MuseChat to deliver music recommendations and their underlying reasoning in a manner resembling human communication. Our evaluations show that MuseChat surpasses existing state-of-the-art models in music retrieval tasks and pioneers the integration of the recommendation process within a natural language framework.

摘要
我们介绍MuseChat，一种创新的对话式音乐推荐系统。这个独特的平台不仅提供互动用户参与度，还为输入视频提供适合的音乐推荐，以便用户可以细化和个性化音乐选择。与之前的系统不同，我们的研究强调用户个人偏好，而不是主要强调内容相容。例如，所有数据只提供基本的音乐视频对应或文本音乐描述。为了解决这一漏洞，我们的研究提供了三种贡献。首先，我们开发了一种对话生成方法，该方法通过使用预训练的音乐标签和艺术家信息，模拟用户和推荐系统之间的两次对话。在这个对话中，用户提供视频，系统则提供适合的音乐曲目并给出了理由。然后，用户表达自己的音乐偏好，系统则提供了修改后的音乐推荐和理由。第二，我们引入多模态推荐引擎，该引擎将音乐与视频信息、用户之前推荐的音乐反馈、以及用户的文本输入进行匹配。第三，我们将音乐表示和文本数据进行了对接，通过使用大型自然语言模型（Vicuna-7B）。这种对接使得MuseChat可以提供音乐推荐和其下面的理由，与人类交流方式相似。我们的评估显示，MuseChat在音乐检索任务中超过了现有的状态 искусственный智能模型，并成为了对话式推荐过程的先驱者。

High-Fidelity 3D Head Avatars Reconstruction through Spatially-Varying Expression Conditioned Neural Radiance Field

paper_url: http://arxiv.org/abs/2310.06275
repo_url: None
paper_authors: Minghan Qin, Yifan Liu, Yuelang Xu, Xiaochen Zhao, Yebin Liu, Haoqian Wang
for:* 三维头部人物重建中的一个关键方面是表情细节。methods:* 我们提出了一种新的空间变换表达（SVE）条件。* SVE可以通过一个简单的多层感知网络生成，包括不同位置的空间特征和全局表情信息。results:* 我们的方法可以更好地处理复杂的表情细节，并实现高质量的三维头部人物重建。* 我们的方法在移动终端收集和公共数据集上实现了比其他状态时间（SOTA）方法更高的图形和渲染质量。

Abstract
One crucial aspect of 3D head avatar reconstruction lies in the details of facial expressions. Although recent NeRF-based photo-realistic 3D head avatar methods achieve high-quality avatar rendering, they still encounter challenges retaining intricate facial expression details because they overlook the potential of specific expression variations at different spatial positions when conditioning the radiance field. Motivated by this observation, we introduce a novel Spatially-Varying Expression (SVE) conditioning. The SVE can be obtained by a simple MLP-based generation network, encompassing both spatial positional features and global expression information. Benefiting from rich and diverse information of the SVE at different positions, the proposed SVE-conditioned neural radiance field can deal with intricate facial expressions and achieve realistic rendering and geometry details of high-fidelity 3D head avatars. Additionally, to further elevate the geometric and rendering quality, we introduce a new coarse-to-fine training strategy, including a geometry initialization strategy at the coarse stage and an adaptive importance sampling strategy at the fine stage. Extensive experiments indicate that our method outperforms other state-of-the-art (SOTA) methods in rendering and geometry quality on mobile phone-collected and public datasets.

摘要
(Simplified Chinese translation)一个重要的方面是3D头像重建的脸部表情细节。虽然最近的NeRF基于的高品质3D头像方法可以实现高质量的头像渲染，但它们还遇到表情细节细节保持的挑战，因为它们忽略了不同空间位置中的特定表情变化的潜在可能性。我们受到这一观察的 inspirada，引入了一种新的空间变化表达（SVE）conditioning。SVE可以通过一个简单的多层感知网络生成，包括空间位势特征和全局表情信息。由于SVE在不同位置上具有丰富和多样的信息，我们的提议的SVE-conditioned神经辐射场可以处理复杂的脸部表情和高精度3D头像的渲染和几何细节。此外，为了进一步提高几何和渲染质量，我们引入了一种新的粗略到细节训练策略，包括geometry initialization策略和适应重要性采样策略。广泛的实验表明，我们的方法在移动设备采集和公共数据集上的渲染和几何质量都有所提高。

Efficient Adaptation of Large Vision Transformer via Adapter Re-Composing

paper_url: http://arxiv.org/abs/2310.06234
repo_url: https://github.com/davidyanande/arc
paper_authors: Wei Dong, Dawei Yan, Zhijun Lin, Peng Wang
for: 这个研究旨在提高大型预训模型的效率适应，并且将适应 Parameters 的数量降低至最小。
methods: 本研究提出了一个新的 Adapter Re-Composing (ARC) 策略，它利用 Shared Parameters 来构建层对层的 Adaptation Layers。这个方法可以实现对大型预训模型的效率适应，并且将适应 Parameters 的数量降低至最小。
results: 在 24 个下游图像分类任务中，我们的方法可以取得出色的转移学习表现，并且将适应 Parameters 的数量降低至最小。

Abstract
The advent of high-capacity pre-trained models has revolutionized problem-solving in computer vision, shifting the focus from training task-specific models to adapting pre-trained models. Consequently, effectively adapting large pre-trained models to downstream tasks in an efficient manner has become a prominent research area. Existing solutions primarily concentrate on designing lightweight adapters and their interaction with pre-trained models, with the goal of minimizing the number of parameters requiring updates. In this study, we propose a novel Adapter Re-Composing (ARC) strategy that addresses efficient pre-trained model adaptation from a fresh perspective. Our approach considers the reusability of adaptation parameters and introduces a parameter-sharing scheme. Specifically, we leverage symmetric down-/up-projections to construct bottleneck operations, which are shared across layers. By learning low-dimensional re-scaling coefficients, we can effectively re-compose layer-adaptive adapters. This parameter-sharing strategy in adapter design allows us to significantly reduce the number of new parameters while maintaining satisfactory performance, thereby offering a promising approach to compress the adaptation cost. We conduct experiments on 24 downstream image classification tasks using various Vision Transformer variants to evaluate our method. The results demonstrate that our approach achieves compelling transfer learning performance with a reduced parameter count. Our code is available at \href{https://github.com/DavidYanAnDe/ARC}{https://github.com/DavidYanAnDe/ARC}.

摘要
高效预训模型的出现对计算机视觉问题的解决带来了革命性的变革，从训练任务特定模型转移到适应预训模型。因此，有效地适应大规模预训模型到下游任务成为了当前研究领域的焦点。现有的解决方案主要集中在设计轻量级适配器和预训模型之间的互动，以尽量减少需要更新的参数数量。在本研究中，我们提出了一种新的适配器重新组合（ARC）策略，从新的角度解决高效预训模型适配问题。我们的方法考虑适配器参数的再利用，并 introduce了共享参数的设计。具体来说，我们利用 симметриcz下/上投影来构建瓶颈操作，这些操作在层次上共享。通过学习低维度归一化系数，我们可以有效地重新组合层次适配器。这种参数共享的适配器设计策略可以减少新的参数数量，同时保持满意的性能，从而提供一种可靠的压缩适配成本的方法。我们在24种图像分类任务上使用了不同的视力变换器变体进行实验，以评估我们的方法。结果表明，我们的方法在参数数量减少的情况下实现了吸引人的转移学习性能。我们的代码可以在 \href{https://github.com/DavidYanAnDe/ARC}{https://github.com/DavidYanAnDe/ARC} 上找到。

Spiking PointNet: Spiking Neural Networks for Point Clouds

paper_url: http://arxiv.org/abs/2310.06232
repo_url: https://github.com/dayongren/spiking-pointnet
paper_authors: Dayong Ren, Zhe Ma, Yuanpei Chen, Weihang Peng, Xiaode Liu, Yuhan Zhang, Yufei Guo
for: 本研究旨在探讨 whether Spiking Neural Networks (SNNs) can be generalized to 3D recognition, and to present a spiking neural model for efficient deep learning on point clouds.
methods: 我们提出了一种 trained-less but learning-more 方法，使用单个时间步骤训练 Spiking PointNet，并通过 theoretically justifications 和实验分析证明其效果。
results: 我们在 ModelNet10 和 ModelNet40 上进行了多种实验，发现 Spiking PointNet 可以在多个时间步骤推理中提供更好的性能，并且可以超越其 ANNS 对应模型，这是 SNN 领域中非常罕见的。另外，Spiking PointNet 在训练阶段也显示出了很好的速度和存储减少。

Abstract
Recently, Spiking Neural Networks (SNNs), enjoying extreme energy efficiency, have drawn much research attention on 2D visual recognition and shown gradually increasing application potential. However, it still remains underexplored whether SNNs can be generalized to 3D recognition. To this end, we present Spiking PointNet in the paper, the first spiking neural model for efficient deep learning on point clouds. We discover that the two huge obstacles limiting the application of SNNs in point clouds are: the intrinsic optimization obstacle of SNNs that impedes the training of a big spiking model with large time steps, and the expensive memory and computation cost of PointNet that makes training a big spiking point model unrealistic. To solve the problems simultaneously, we present a trained-less but learning-more paradigm for Spiking PointNet with theoretical justifications and in-depth experimental analysis. In specific, our Spiking PointNet is trained with only a single time step but can obtain better performance with multiple time steps inference, compared to the one trained directly with multiple time steps. We conduct various experiments on ModelNet10, ModelNet40 to demonstrate the effectiveness of Spiking PointNet. Notably, our Spiking PointNet even can outperform its ANN counterpart, which is rare in the SNN field thus providing a potential research direction for the following work. Moreover, Spiking PointNet shows impressive speedup and storage saving in the training phase.

摘要
近些年，激活神经网络（SNN）因其极高的能效性而吸引了大量研究人员的关注，主要应用于2D视觉识别领域。然而，SNN在3D识别领域的应用仍然尚未得到充分探索。为此，我们在本文中提出了激活点网（Spiking PointNet），这是首个采用激活神经元进行深度学习的点云模型。我们发现，在点云模型中，SNN的两个主要障碍物是：激活神经元优化障碍和点云模型的内存和计算成本高。为解决这两个问题，我们提出了一种无需训练的，但可以吸取更多知识的概念。具体来说，我们的激活点网在单个时间步上进行训练，但可以在多个时间步上进行INFERENCE，并且可以在ModelNet10和ModelNet40上进行了多种实验，以证明激活点网的效果。特别是，我们的激活点网甚至可以超越其ANN对应模型，这在SNN领域内很罕见，因此提供了一个可能的研究方向。此外，激活点网在训练阶段也显示了快速和储存的节省。

CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding

paper_url: http://arxiv.org/abs/2310.06214
repo_url: None
paper_authors: Eslam Mohamed Bakr, Mohamed Ayman, Mahmoud Ahmed, Habib Slim, Mohamed Elhoseiny
for: 本研究旨在提出一种可解释的3D视觉定位方法，以模拟人类视觉系统。
methods: 本方法形式为一种序列到序列任务，首先预测一串含义杆并 then 预测最终目标。解释性不仅提高了总性表现，还帮助我们理解网络做出最终决定的原因。
results: 我们的方法在Nr3D、Sr3D和Scanreferbenchmark上进行了广泛的实验，并比现有方法具有更高的性能和更好的数据效率。此外，我们的方法可以轻松地与现有架构结合使用。

Abstract
3D visual grounding is the ability to localize objects in 3D scenes conditioned by utterances. Most existing methods devote the referring head to localize the referred object directly, causing failure in complex scenarios. In addition, it does not illustrate how and why the network reaches the final decision. In this paper, we address this question Can we design an interpretable 3D visual grounding framework that has the potential to mimic the human perception system?. To this end, we formulate the 3D visual grounding problem as a sequence-to-sequence task by first predicting a chain of anchors and then the final target. Interpretability not only improves the overall performance but also helps us identify failure cases. Following the chain of thoughts approach enables us to decompose the referring task into interpretable intermediate steps, boosting the performance and making our framework extremely data-efficient. Moreover, our proposed framework can be easily integrated into any existing architecture. We validate our approach through comprehensive experiments on the Nr3D, Sr3D, and Scanrefer benchmarks and show consistent performance gains compared to existing methods without requiring manually annotated data. Furthermore, our proposed framework, dubbed CoT3DRef, is significantly data-efficient, whereas on the Sr3D dataset, when trained only on 10% of the data, we match the SOTA performance that trained on the entire data.

摘要
三维视觉根据是指将语音引用对象在三维场景中固定的能力。现有大多数方法直接使用引用头来确定引用对象，导致复杂场景下失败。此外，它们不能解释网络是如何做出最终决定的。在这篇论文中，我们解决这个问题。我们是否可以设计一个可解释的三维视觉根据框架，以模仿人类视觉系统呢？为此，我们将三维视觉根据问题转化为序列到序列任务，首先预测一系列的锚点，然后预测最终的目标。可解释性不仅提高了总性表现，还帮助我们解释失败的原因。我们采用链条思想方法，将引用任务分解为可解释的中间步骤，从而提高表现和训练效率。此外，我们的提议的框架可以轻松地与现有架构结合使用。我们通过对 Nr3D、Sr3D 和 Scanrefer 测试准则进行广泛的实验，并示出与现有方法相比，我们的方法具有显著的性能提升，而无需手动标注数据。此外，我们的提议的框架，即 CoT3DRef，在 Sr3D 数据集上训练时只使用 10% 的数据，却与所有数据训练的 SOTA 性能相当。

2023-10-10

BeSt-LeS: Benchmarking Stroke Lesion Segmentation using Deep Supervision

TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

Utilizing Synthetic Data for Medical Vision-Language Pre-training: Bypassing the Need for Real Images

Pre-Trained Masked Image Model for Mobile Robot Navigation

Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models

Leveraging Neural Radiance Fields for Uncertainty-Aware Visual Localization

Data Distillation Can Be Like Vodka: Distilling More Times For Better Quality

ObjectComposer: Consistent Generation of Multiple Objects Without Fine-tuning

Comparing the robustness of modern no-reference image- and video-quality metrics to adversarial attacks

End-to-end Evaluation of Practical Video Analytics Systems for Face Detection and Recognition

Self-supervised Object-Centric Learning for Videos

Distillation Improves Visual Place Recognition for Low-Quality Queries

Mitigating stereotypical biases in text to image generative systems

AutoAD II: The Sequel – Who, When, and What in Movie Audio Description

What Does Stable Diffusion Know about the 3D Scene?

Neural Bounding

TopoMLP: An Simple yet Strong Pipeline for Driving Topology Reasoning

HiFi-123: Towards High-fidelity One Image to 3D Content Generation

Multi-domain improves out-of-distribution and data-limited scenarios for medical image analysis

Domain Generalization by Rejecting Extreme Augmentations

Latent Diffusion Counterfactual Explanations

SC2GAN: Rethinking Entanglement by Self-correcting Correlated GAN Space

Evaluating Explanation Methods for Vision-and-Language Navigation

How (not) to ensemble LVLMs for VQA

Blind Dates: Examining the Expression of Temporality in Historical Photographs

EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention

Deep Cardiac MRI Reconstruction with ADMM

Pi-DUAL: Using Privileged Information to Distinguish Clean from Noisy Labels

REVO-LION: Evaluating and Refining Vision-Language Instruction Tuning Datasets

Hierarchical Mask2Former: Panoptic Segmentation of Crops, Weeds and Leaves

Energy-Efficient Visual Search by Eye Movement and Low-Latency Spiking Neural Network

SketchBodyNet: A Sketch-Driven Multi-faceted Decoder Network for 3D Human Reconstruction

Efficient Retrieval of Images with Irregular Patterns using Morphological Image Analysis: Applications to Industrial and Healthcare datasets

Compositional Representation Learning for Brain Tumour Segmentation

Data efficient deep learning for medical image analysis: A survey

Be Careful What You Smooth For: Label Smoothing Can Be a Privacy Shield but Also a Catalyst for Model Inversion Attacks

Perceptual MAE for Image Manipulation Localization: A High-level Vision Learner Focusing on Low-level Features

Watt For What: Rethinking Deep Learning’s Energy-Performance Relationship

Deep Learning for Automatic Detection and Facial Recognition in Japanese Macaques: Illuminating Social Networks

Focus on Local Regions for Query-based Object Detection

A Geometrical Approach to Evaluate the Adversarial Robustness of Deep Neural Networks

Solution for SMART-101 Challenge of ICCV Multi-modal Algorithmic Reasoning Task 2023

The Solution for the CVPR2023 NICE Image Captioning Challenge

Skeleton Ground Truth Extraction: Methodology, Annotation Tool and Benchmarks

Conformal Prediction for Deep Classifier via Label Ranking

AnoDODE: Anomaly Detection with Diffusion ODE

Boundary Discretization and Reliable Classification Network for Temporal Action Detection

Learning Stackable and Skippable LEGO Bricks for Efficient, Reconfigurable, and Variable-Resolution Diffusion Modeling

3DS-SLAM: A 3D Object Detection based Semantic SLAM towards Dynamic Indoor Environments

Leveraging Diffusion-Based Image Variations for Robust Training on Poisoned Data

CoinSeg: Contrast Inter- and Intra- Class Representations for Incremental Segmentation

JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling

Local Style Awareness of Font Images

CrowdRec: 3D Crowd Reconstruction from Single Color Images

Precise Payload Delivery via Unmanned Aerial Vehicles: An Approach Using Object Detection Algorithms

Adversarial Masked Image Inpainting for Robust Detection of Mpox and Non-Mpox

Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models

Improving Compositional Text-to-image Generation with Large Vision-Language Models

Three-Dimensional Medical Image Fusion with Deformable Cross-Attention

Towards More Efficient Depression Risk Recognition via Gait

MuseChat: A Conversational Music Recommendation System for Videos

High-Fidelity 3D Head Avatars Reconstruction through Spatially-Varying Expression Conditioned Neural Radiance Field

Efficient Adaptation of Large Vision Transformer via Adapter Re-Composing

Spiking PointNet: Spiking Neural Networks for Point Clouds

CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding