2023-09-18

cs.CV

cs.CV - 2023-09-18

ProtoKD: Learning from Extremely Scarce Data for Parasite Ova Recognition

paper_url: http://arxiv.org/abs/2309.10210
repo_url: None
paper_authors: Shubham Trehan, Udhav Ramachandran, Ruth Scimeca, Sathyanarayanan N. Aakur
for: 这个研究旨在发展可靠的计算框架，以帮助早期螺旋感染的检测，特别是在螺旋卵阶段。
methods: 这个研究使用了prototype网络和自我激发的方法，从仅一个示例每个类别中学习出了坚固的表现。
results: 研究发现，这个ProtoKD框架可以从仅一个示例每个类别中学习出高度可靠的表现，并且在大规模的数据中进行验证。

Abstract
Developing reliable computational frameworks for early parasite detection, particularly at the ova (or egg) stage is crucial for advancing healthcare and effectively managing potential public health crises. While deep learning has significantly assisted human workers in various tasks, its application and diagnostics has been constrained by the need for extensive datasets. The ability to learn from an extremely scarce training dataset, i.e., when fewer than 5 examples per class are present, is essential for scaling deep learning models in biomedical applications where large-scale data collection and annotation can be expensive or not possible (in case of novel or unknown infectious agents). In this study, we introduce ProtoKD, one of the first approaches to tackle the problem of multi-class parasitic ova recognition using extremely scarce data. Combining the principles of prototypical networks and self-distillation, we can learn robust representations from only one sample per class. Furthermore, we establish a new benchmark to drive research in this critical direction and validate that the proposed ProtoKD framework achieves state-of-the-art performance. Additionally, we evaluate the framework's generalizability to other downstream tasks by assessing its performance on a large-scale taxonomic profiling task based on metagenomes sequenced from real-world clinical data.

摘要
发展可靠的计算框架，特别是在菌 ovum (或蛋) 阶段，对于提高医疗和有效管理 potential public health crisis 是非常重要的。深度学习在各种任务中已经提供了很多帮助，但是它在诊断方面受到了有限的数据集的限制。在生物医学应用中，可以使用少量数据来训练深度学习模型，这是因为收集和标注大量数据可能是昂贵的或者不可能的（在新型或未知感染者情况下）。在这项研究中，我们介绍了一种名为 ProtoKD 的新方法，可以在只有一个样本每个类时进行多类菌 ovum 识别。我们结合了原型网络和自我滥化的原则，可以从单个样本中学习Robust的表示。此外，我们还提出了一个新的标准套件，用于驱动这一重要的研究方向，并证明了我们提出的 ProtoKD 框架实现了状态机器的表现。此外，我们还评估了该框架在其他下游任务中的一致性，通过对基于实际临床数据的大规模分类任务进行评估。

Image-Text Pre-Training for Logo Recognition

paper_url: http://arxiv.org/abs/2309.10206
repo_url: None
paper_authors: Mark Hubenthal, Suren Kumar
for: 这个论文是为了提高开放集logo认识技术而写的。methods: 该论文提出了两个新贡献来提高匹配模型的性能：首先，使用图像和文本对应样本进行预训练；其次，使用改进的度量学习损失函数。results: 该论文的实验结果显示，使用图像和文本对应样本进行预训练可以提高视觉嵌入器在logo检索任务中的性能，特别是对于文本占主导的类型。此外，提出的ProxyNCAHN++损失函数可以更好地减少硬negative图像，从而提高logo检索任务的性能。在五个公共logo数据集上，该方法达到了新的国际最佳性能。

Abstract
Open-set logo recognition is commonly solved by first detecting possible logo regions and then matching the detected parts against an ever-evolving dataset of cropped logo images. The matching model, a metric learning problem, is especially challenging for logo recognition due to the mixture of text and symbols in logos. We propose two novel contributions to improve the matching model's performance: (a) using image-text paired samples for pre-training, and (b) an improved metric learning loss function. A standard paradigm of fine-tuning ImageNet pre-trained models fails to discover the text sensitivity necessary to solve the matching problem effectively. This work demonstrates the importance of pre-training on image-text pairs, which significantly improves the performance of a visual embedder trained for the logo retrieval task, especially for more text-dominant classes. We construct a composite public logo dataset combining LogoDet3K, OpenLogo, and FlickrLogos-47 deemed OpenLogoDet3K47. We show that the same vision backbone pre-trained on image-text data, when fine-tuned on OpenLogoDet3K47, achieves $98.6\%$ recall@1, significantly improving performance over pre-training on Imagenet1K ($97.6\%$). We generalize the ProxyNCA++ loss function to propose ProxyNCAHN++ which incorporates class-specific hard negative images. The proposed method sets new state-of-the-art on five public logo datasets considered, with a $3.5\%$ zero-shot recall@1 improvement on LogoDet3K test, $4\%$ on OpenLogo, $6.5\%$ on FlickrLogos-47, $6.2\%$ on Logos In The Wild, and $0.6\%$ on BelgaLogo.

摘要
开放集logo认识通常通过先检测可能的标识区域，然后将检测到的部分与一个不断演化的logo图像集进行匹配来解决。匹配模型是一个度量学习问题，因为logo中的文字和符号混合而特别具有挑战性。我们提出了两个新的贡献来提高匹配模型的性能：（a）使用图像文本对应样本进行预训练，以及（b）改进度量学习损失函数。现有的标准模式是使用ImageNet预训练模型进行精度训练，但这并不能有效地解决匹配问题。这种工作表明了预训练图像文本对应样本的重要性，可以大幅提高一个视觉嵌入器在logo检索任务中的性能，特别是对于文本更加占据的类型。我们组建了OpenLogoDet3K47 compositive公共logo数据集，并证明了使用同一个视觉背景模型，经预训练于图像文本对应样本，在OpenLogoDet3K47上进行精度训练，可以达到98.6%的Recall@1，显著超越ImageNet1K($97.6\%$)。我们将ProxyNCA++损失函数推广到ProxyNCAHN++，其包含类型特定的硬负样本。提出的方法在五个公共logo数据集上达到了新的状态泰施，与LogoDet3K测试集的零MQA@1相比，提高了3.5%，与OpenLogo相比提高了4%，与FlickrLogos-47相比提高了6.5%，与Logos In The Wild相比提高了6.2%，与BelgaLogo相比提高了0.6%。

Machine Learning for enhancing Wind Field Resolution in Complex Terrain

paper_url: http://arxiv.org/abs/2309.10172
repo_url: https://github.com/jacobwulffwold/gan_sr_wind_field
paper_authors: Jacob Wulff Wold, Florian Stadtmann, Adil Rasheed, Mandar Tabib, Omer San, Jan-Tore Horn
for: 这个研究旨在开发一种基于神经网络的方法，用于在高分辨率下数值模拟大气流体动态。
methods: 这个方法基于增强型超解析生成 adversarial neural network，可以从低分辨率的风场数据中生成高分辨率的风场数据，并且能够尊重地形特性。
results: 研究表明，这个方法可以成功重建完全解析3D速度场，并且超过了传递 interpolate 方法。此外，通过采用适当的成本函数，可以减少对对抗训练的需求。

Abstract
Atmospheric flows are governed by a broad variety of spatio-temporal scales, thus making real-time numerical modeling of such turbulent flows in complex terrain at high resolution computationally intractable. In this study, we demonstrate a neural network approach motivated by Enhanced Super-Resolution Generative Adversarial Networks to upscale low-resolution wind fields to generate high-resolution wind fields in an actual wind farm in Bessaker, Norway. The neural network-based model is shown to successfully reconstruct fully resolved 3D velocity fields from a coarser scale while respecting the local terrain and that it easily outperforms trilinear interpolation. We also demonstrate that by using appropriate cost function based on domain knowledge, we can alleviate the use of adversarial training.

摘要
大气流动受到各种空间时间尺度的控制，因此实时数值模拟这种湍流在复杂地形上的计算极其困难。在本研究中，我们提出一种基于神经网络的方法，使用增强型超分辨率生成 adversarial neural network 来提高低分辨率风场的解析。我们的神经网络模型能够成功重建高分辨率三维速度场，同时尊重地形特性，并轻松超越三线 interpolate 方法。此外，我们还示出了通过采用适当的成本函数，可以减少对对抗训练的使用。

Specification-Driven Video Search via Foundation Models and Formal Verification

paper_url: http://arxiv.org/abs/2309.10171
repo_url: None
paper_authors: Yunhao Yang, Jean-Raphaël Gaglione, Sandeep Chinchali, Ufuk Topcu
for: 该文章是为了寻找视频中的事件兴趣点而写的。
methods: 该方法使用了计算机视觉和自然语言处理的最新进展，以及正式方法，自动地搜索视频中的事件兴趣点。
results: 该方法可以高效地搜索视频中的事件兴趣点，并且可以保持隐私。在一个隐私敏感的视频和一个自动驾驶数据集上，该方法可以达到90%的精度。

Abstract
The increasing abundance of video data enables users to search for events of interest, e.g., emergency incidents. Meanwhile, it raises new concerns, such as the need for preserving privacy. Existing approaches to video search require either manual inspection or a deep learning model with massive training. We develop a method that uses recent advances in vision and language models, as well as formal methods, to search for events of interest in video clips automatically and efficiently. The method consists of an algorithm to map text-based event descriptions into linear temporal logic over finite traces (LTL$_f$) and an algorithm to construct an automaton encoding the video information. Then, the method formally verifies the automaton representing the video against the LTL$_f$ specifications and adds the pertinent video clips to the search result if the automaton satisfies the specifications. We provide qualitative and quantitative analysis to demonstrate the video-searching capability of the proposed method. It achieves over 90 percent precision in searching over privacy-sensitive videos and a state-of-the-art autonomous driving dataset.

摘要
随着视频数据的增加，用户能够查找 interess 事件，例如紧急事件。然而，这也提出了新的问题，例如保持隐私。现有的视频搜索方法需要 either manual inspection 或 deep learning model with massive training。我们开发了一种使用 recent advances in vision and language models，以及 formal methods，自动地搜索视频中的事件 интересы。该方法包括一种将文本基于事件描述映射到线性时间逻辑 sobre finite traces (LTL$_f$) 的算法，以及一种构建视频信息的 automaton。然后，方法正式验证视频中的 automaton 与 LTL$_f$ 规范之间的匹配，并将符合规范的视频片段添加到搜索结果中。我们提供了质量和量化分析，以证明我们提议的方法的视频搜索能力。它在 privacy-sensitive 视频和 state-of-the-art autonomous driving 数据集上达到了高于 90% 的精度。

Offline Detection of Misspelled Handwritten Words by Convolving Recognition Model Features with Text Labels

paper_url: http://arxiv.org/abs/2309.10158
repo_url: None
paper_authors: Andrey Totev, Tomas Ward
for: 本研究旨在提高手写识别率，特别是在识别预期出现在语言模型中的缺失词汇时。
methods: 我们提出了一种无约束的二元分类器，包括一个手写识别特征提取器和一个多Modal分类头，该头将特征提取器输出与输入文本的向量表示进行卷积。我们的分类头通过使用一个国际前沿的生成对抗网络创造的Synthetic数据进行训练。
results: 我们的模型可以维持高准确率，同时可以通过调整精度来提高平均准确率19.5%，相比直接使用现有的手写识别模型进行 Addressing 任务。这些性能提升可能会导致人工循环自动化应用程序的产出提高。

Abstract
Offline handwriting recognition (HWR) has improved significantly with the advent of deep learning architectures in recent years. Nevertheless, it remains a challenging problem and practical applications often rely on post-processing techniques for restricting the predicted words via lexicons or language models. Despite their enhanced performance, such systems are less usable in contexts where out-of-vocabulary words are anticipated, e.g. for detecting misspelled words in school assessments. To that end, we introduce the task of comparing a handwriting image to text. To solve the problem, we propose an unrestricted binary classifier, consisting of a HWR feature extractor and a multimodal classification head which convolves the feature extractor output with the vector representation of the input text. Our model's classification head is trained entirely on synthetic data created using a state-of-the-art generative adversarial network. We demonstrate that, while maintaining high recall, the classifier can be calibrated to achieve an average precision increase of 19.5% compared to addressing the task by directly using state-of-the-art HWR models. Such massive performance gains can lead to significant productivity increases in applications utilizing human-in-the-loop automation.

摘要
停机写入Recognition (HWR) 在深度学习架构的推出后有了显著改进，但是仍然是一个具有挑战性的问题，实际应用中通常通过lexicon或语言模型来限制预测的词汇。尽管其表现得到了提高，但这些系统在预测非词汇词语时表现不佳，例如在学校评估中检测拼写错误。为解决这个问题，我们引入对手写图像与文本进行比较的任务。为解决这个任务，我们提议一种无限制binary分类器，其包括一个HWR特征提取器和一个多Modal分类头。这个分类头使用一个state-of-the-art生成对抗网络生成的 sintetic数据进行训练。我们示示，虽然保持高准确率，我们的分类器可以实现平均提高19.5%的精度，相比直接使用现状最佳HWR模型进行处理。这样的巨大性能提升可能会导致人工循环自动化应用中的产出提高。

Preserving Tumor Volumes for Unsupervised Medical Image Registration

paper_url: http://arxiv.org/abs/2309.10153
repo_url: None
paper_authors: Qihua Dong, Hao Du, Ying Song, Yan Xu, Jing Liao
for: 这篇论文旨在解决医学影像调整中的体积变化问题，以确保影像调整中的肿瘤体积不变化。
methods: 本文提出了一个两阶段的方法，首先使用相似性基于的调整来找到潜在的肿瘤区域，然后使用一个新的自适应体积保持损失函数来确保肿瘤体积不变化。
results: 本文的方法能够成功保持肿瘤体积，并与现有的方法相比，实现了相似的调整结果。

Abstract
Medical image registration is a critical task that estimates the spatial correspondence between pairs of images. However, current traditional and deep-learning-based methods rely on similarity measures to generate a deforming field, which often results in disproportionate volume changes in dissimilar regions, especially in tumor regions. These changes can significantly alter the tumor size and underlying anatomy, which limits the practical use of image registration in clinical diagnosis. To address this issue, we have formulated image registration with tumors as a constraint problem that preserves tumor volumes while maximizing image similarity in other normal regions. Our proposed strategy involves a two-stage process. In the first stage, we use similarity-based registration to identify potential tumor regions by their volume change, generating a soft tumor mask accordingly. In the second stage, we propose a volume-preserving registration with a novel adaptive volume-preserving loss that penalizes the change in size adaptively based on the masks calculated from the previous stage. Our approach balances image similarity and volume preservation in different regions, i.e., normal and tumor regions, by using soft tumor masks to adjust the imposition of volume-preserving loss on each one. This ensures that the tumor volume is preserved during the registration process. We have evaluated our strategy on various datasets and network architectures, demonstrating that our method successfully preserves the tumor volume while achieving comparable registration results with state-of-the-art methods. Our codes is available at: \url{https://dddraxxx.github.io/Volume-Preserving-Registration/}.

摘要
医学图像对接是一项重要任务，用于估计图像对的空间匹配。然而，现有的传统方法和深度学习方法通常基于相似度测量来生成扭曲场，这经常导致图像区域中的体积变化不匹配，特别是在肿瘤区域。这些变化可能导致肿瘤大小和下面的解剖结构发生重大变化，从而限制图像对接的实际应用。为解决这个问题，我们提出了一种图像对接约束问题，以保持肿瘤体积而最大化图像相似性。我们的提议包括两个阶段。在第一阶段，我们使用相似度基于的图像对接来确定潜在的肿瘤区域，并生成相应的软肿瘤面罩。在第二阶段，我们提出了一种保持体积的图像对接，使用一种新的自适应体积保持损失函数，该函数根据上一阶段生成的面罩对每个区域进行不同的权重调整。这种方法可以在不同区域（即正常区域和肿瘤区域）平衡图像相似性和体积保持。因此，我们的方法可以在图像对接过程中保持肿瘤体积。我们在多个数据集和网络架构上评估了我们的策略，并证明了我们的方法可以成功保持肿瘤体积，同时与现状的方法相比具有相似的对接结果。我们的代码可以在以下链接中找到：\url{https://dddraxxx.github.io/Volume-Preserving-Registration/}.

Deep Prompt Tuning for Graph Transformers

paper_url: http://arxiv.org/abs/2309.10131
repo_url: https://github.com/hieu9955/ggggg
paper_authors: Reza Shirkavand, Heng Huang
for: 这篇 paper 的目的是提出一种新的方法来使用图形变数对应 tasks，以替代 fine-tuning 方法。
methods: 这篇 paper 使用了一种名为 “deep graph prompt tuning” 的新方法，它将内置了任务特定的 Token 到图形变数模型中，以提高模型的表达能力。
results: 这篇 paper 的实验结果显示，使用 deep graph prompt tuning 方法可以与 fine-tuning 方法相比，具有相似或甚至更好的性能，且仅需要更少的任务特定参数。

Abstract
Graph transformers have gained popularity in various graph-based tasks by addressing challenges faced by traditional Graph Neural Networks. However, the quadratic complexity of self-attention operations and the extensive layering in graph transformer architectures present challenges when applying them to graph based prediction tasks. Fine-tuning, a common approach, is resource-intensive and requires storing multiple copies of large models. We propose a novel approach called deep graph prompt tuning as an alternative to fine-tuning for leveraging large graph transformer models in downstream graph based prediction tasks. Our method introduces trainable feature nodes to the graph and pre-pends task-specific tokens to the graph transformer, enhancing the model's expressive power. By freezing the pre-trained parameters and only updating the added tokens, our approach reduces the number of free parameters and eliminates the need for multiple model copies, making it suitable for small datasets and scalable to large graphs. Through extensive experiments on various-sized datasets, we demonstrate that deep graph prompt tuning achieves comparable or even superior performance to fine-tuning, despite utilizing significantly fewer task-specific parameters. Our contributions include the introduction of prompt tuning for graph transformers, its application to both graph transformers and message passing graph neural networks, improved efficiency and resource utilization, and compelling experimental results. This work brings attention to a promising approach to leverage pre-trained models in graph based prediction tasks and offers new opportunities for exploring and advancing graph representation learning.

摘要
graph transformers 在各种图像任务中获得了广泛的应用，但传统图 neural network 中的 quadratic complexity 和图 transformer 架构中的层次结构却阻碍了它们在图基于预测任务中的应用。 fine-tuning，一种常见的方法，需要存储多个大型模型，并且是资源占用的。我们提出了一种新的方法，即深度图 prompt tuning，作为 fine-tuning 的替代方案，用于利用大型图 transformer 模型在图基于预测任务中。我们的方法在图中添加可训练的特征节点，并在图 transformer 中预先补充任务特定的token，从而提高模型的表达能力。通过冻结预训练参数并只更新添加的token，我们的方法可以降低自定义参数的数量，消除多个模型 копи本的需求，使其适用于小 dataset 和可扩展到大图。我们的贡献包括图 transformers 中的 prompt tuning 的引入，其应用于图 transformers 和 message passing graph neural networks，以及提高效率和资源利用。我们的实验结果表明，深度图 prompt tuning 可以与 fine-tuning 相比，即使使用许多任务特定参数，并且具有更好的可扩展性。这种方法将注意力集中在图基于预测任务中的可优化模型，并提供新的机遇，用于探索和提高图表示学习。

Pre-training on Synthetic Driving Data for Trajectory Prediction

paper_url: http://arxiv.org/abs/2309.10121
repo_url: None
paper_authors: Yiheng Li, Seth Z. Zhao, Chenfeng Xu, Chen Tang, Chenran Li, Mingyu Ding, Masayoshi Tomizuka, Wei Zhan
for: 本研究旨在提高自动驾驶路径预测模型的泛化能力，以便在有限数据情况下进行学习。
methods: 我们提议通过对高分辨率地图（HD map）和路径进行扩充和预训练来解决有限数据的挑战。我们利用图表示法将HD map映射到vector空间中，以便轻松地扩充有限的Scene数据。此外，我们采用规则库模型生成基于扩充的Scene中的路径，从而超出实际收集的路径。
results: 我们对不同预训练策略进行了广泛的测试，并证明了我们的数据扩充和预训练策略的效果。相比基eline预测模型，我们的方法在$MR_6$, $minADE_6$和$minFDE_6$指标上升减5.04%, 3.84%和8.30%。

Abstract
Accumulating substantial volumes of real-world driving data proves pivotal in the realm of trajectory forecasting for autonomous driving. Given the heavy reliance of current trajectory forecasting models on data-driven methodologies, we aim to tackle the challenge of learning general trajectory forecasting representations under limited data availability. We propose to augment both HD maps and trajectories and apply pre-training strategies on top of them. Specifically, we take advantage of graph representations of HD-map and apply vector transformations to reshape the maps, to easily enrich the limited number of scenes. Additionally, we employ a rule-based model to generate trajectories based on augmented scenes; thus enlarging the trajectories beyond the collected real ones. To foster the learning of general representations within this augmented dataset, we comprehensively explore the different pre-training strategies, including extending the concept of a Masked AutoEncoder (MAE) for trajectory forecasting. Extensive experiments demonstrate the effectiveness of our data expansion and pre-training strategies, which outperform the baseline prediction model by large margins, e.g. 5.04%, 3.84% and 8.30% in terms of $MR_6$, $minADE_6$ and $minFDE_6$.

摘要
驱动数据的总量聚集对于自动驾驶的轨迹预测是非常重要的。由于当前的轨迹预测模型具有很大的数据驱动特性，我们想要解决有限数据量下学习通用的轨迹预测表示。我们提出了对 HD 地图和轨迹进行扩充和预训练的方法。 Specifically，我们利用 HD 地图的图表示法，并对它们进行 vector 变换，以便轻松地拓宽有限数量的场景。此外，我们采用规则型模型来生成基于扩充的场景的轨迹，从而将轨迹扩大到实际收集的轨迹之 beyond。为了让模型学习通用的表示，我们全面探讨了不同的预训练策略，包括扩展 MAE 的概念来进行轨迹预测。广泛的实验证明了我们的数据扩展和预训练策略的效果，其比基eline 预测模型高大幅度，例如 $MR_6$、$minADE_6$ 和 $minFDE_6$ 的预测误差分别为 5.04%、3.84% 和 8.30%。

GEDepth: Ground Embedding for Monocular Depth Estimation

paper_url: http://arxiv.org/abs/2309.09975
repo_url: https://github.com/qcraftai/gedepth
paper_authors: Xiaodong Yang, Zhuang Ma, Zhiyu Ji, Zhe Ren
for: 提高深度估计器的通用性和泛化能力，解耦了图像和相机参数之间的关系
methods: 提出了一种新的地面嵌入模块，通过将相机参数和图像嵌入到深度估计中，提高了模型的通用性和泛化能力
results: 实验表明，该方法可以在多个标准测试集上达到当前最佳效果，并且在跨域测试中表现出显著的泛化改进

Abstract
Monocular depth estimation is an ill-posed problem as the same 2D image can be projected from infinite 3D scenes. Although the leading algorithms in this field have reported significant improvement, they are essentially geared to the particular compound of pictorial observations and camera parameters (i.e., intrinsics and extrinsics), strongly limiting their generalizability in real-world scenarios. To cope with this challenge, this paper proposes a novel ground embedding module to decouple camera parameters from pictorial cues, thus promoting the generalization capability. Given camera parameters, the proposed module generates the ground depth, which is stacked with the input image and referenced in the final depth prediction. A ground attention is designed in the module to optimally combine ground depth with residual depth. Our ground embedding is highly flexible and lightweight, leading to a plug-in module that is amenable to be integrated into various depth estimation networks. Experiments reveal that our approach achieves the state-of-the-art results on popular benchmarks, and more importantly, renders significant generalization improvement on a wide range of cross-domain tests.

摘要
<> translate("Monocular depth estimation is an ill-posed problem as the same 2D image can be projected from infinite 3D scenes.") translate("Although the leading algorithms in this field have reported significant improvement, they are essentially geared to the particular compound of pictorial observations and camera parameters (i.e., intrinsics and extrinsics), strongly limiting their generalizability in real-world scenarios.") translate("To cope with this challenge, this paper proposes a novel ground embedding module to decouple camera parameters from pictorial cues, thus promoting the generalization capability.") translate("Given camera parameters, the proposed module generates the ground depth, which is stacked with the input image and referenced in the final depth prediction.") translate("A ground attention is designed in the module to optimally combine ground depth with residual depth.") translate("Our ground embedding is highly flexible and lightweight, leading to a plug-in module that is amenable to be integrated into various depth estimation networks.") translate("Experiments reveal that our approach achieves the state-of-the-art results on popular benchmarks, and more importantly, renders significant generalization improvement on a wide range of cross-domain tests.")Here's the text in Simplified Chinese:<>Monocular depth estimation 是一个不定问题，因为同一个2D图像可以来自无数个3D场景。尽管这个领域的主导算法已经报告了显著改进，但它们实际上受限于特定的图像观察和相机参数（即内参和外参），这限制了它们在真实世界场景中的一般化能力。为了解决这个挑战，本文提出了一种新的地面嵌入模块，用于分离相机参数和图像观察的绑定关系，从而提高总体化能力。给定相机参数，该模块生成的地面深度将与输入图像拼接起来，并作为最终深度预测中的参考。模块中还设计了地面注意力，用于优化地面深度和剩余深度的组合。我们的地面嵌入非常灵活和轻量级，导致它可以易于与不同的深度估计网络集成。实验表明，我们的方法在流行的标准底图上实现了状态艺术的结果，并在广泛的跨频测试中显示了重要的一般化改进。

Parameter-Efficient Long-Tailed Recognition

paper_url: http://arxiv.org/abs/2309.10019
repo_url: https://github.com/shijxcs/pel
paper_authors: Jiang-Xin Shi, Tong Wei, Zhi Zhou, Xin-Yan Han, Jie-Jing Shao, Yu-Feng Li
for: 这篇 paper 旨在提出一种能够快速进行 fine-tuning 的方法，以便在承载大量数据的情况下，实现长尾识别 задачі中的好几个类别的准确识别。
methods: 这篇 paper 使用了一种称为 PEL 的 fine-tuning 方法，这种方法通过将一小部分的任务特定参数添加到预训练好的模型中，以便在极少的训练 epoch 内实现好的表现。此外，这篇 paper 还提出了一种具有 semantic-aware 的分类器初始化技术，它可以从 CLIP 的文本嵌入器中获得，并不会增加任何计算负载。
results: 这篇 paper 的实验结果显示，使用 PEL 方法可以在四个长尾数据集上实现更好的表现，并且较前一些state-of-the-art 方法出色。

Abstract
The "pre-training and fine-tuning" paradigm in addressing long-tailed recognition tasks has sparked significant interest since the emergence of large vision-language models like the contrastive language-image pre-training (CLIP). While previous studies have shown promise in adapting pre-trained models for these tasks, they often undesirably require extensive training epochs or additional training data to maintain good performance. In this paper, we propose PEL, a fine-tuning method that can effectively adapt pre-trained models to long-tailed recognition tasks in fewer than 20 epochs without the need for extra data. We first empirically find that commonly used fine-tuning methods, such as full fine-tuning and classifier fine-tuning, suffer from overfitting, resulting in performance deterioration on tail classes. To mitigate this issue, PEL introduces a small number of task-specific parameters by adopting the design of any existing parameter-efficient fine-tuning method. Additionally, to expedite convergence, PEL presents a novel semantic-aware classifier initialization technique derived from the CLIP textual encoder without adding any computational overhead. Our experimental results on four long-tailed datasets demonstrate that PEL consistently outperforms previous state-of-the-art approaches. The source code is available at https://github.com/shijxcs/PEL.

摘要
“对于长尾识别任务，使用“预训练和精确化”方法已经引起了广泛的关注，自大型视力语言模型CLIP出现以来。 although previous studies have shown promise in adapting pre-trained models for these tasks, they often require extensive training epochs or additional training data to maintain good performance. In this paper, we propose PEL, a fine-tuning method that can effectively adapt pre-trained models to long-tailed recognition tasks in fewer than 20 epochs without the need for extra data. We first empirically find that commonly used fine-tuning methods, such as full fine-tuning and classifier fine-tuning, suffer from overfitting, resulting in performance deterioration on tail classes. To mitigate this issue, PEL introduces a small number of task-specific parameters by adopting the design of any existing parameter-efficient fine-tuning method. Additionally, to expedite convergence, PEL presents a novel semantic-aware classifier initialization technique derived from the CLIP textual encoder without adding any computational overhead. Our experimental results on four long-tailed datasets demonstrate that PEL consistently outperforms previous state-of-the-art approaches. The source code is available at https://github.com/shijxcs/PEL.”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form instead.

vSHARP: variable Splitting Half-quadratic ADMM algorithm for Reconstruction of inverse-Problems

paper_url: http://arxiv.org/abs/2309.09954
repo_url: None
paper_authors: George Yiasemis, Nikita Moriakov, Jan-Jakob Sonke, Jonas Teuwen
for: 本研究旨在提出一种基于深度学习的方法，用于解决医疗影像（MI）任务中的不规则 inverse problem。
methods: 该方法基于变分半quadratic ADMM算法，并使用梯度下降法和U-Net架构来保证数据一致性和图像质量提升。
results: 对于两个不同的数据集，我们的实验结果表明，vSHARP方法在加速并行磁共振成像重构任务中表现出色，与状态艺术方法相比，具有更高的性能。

Abstract
Medical Imaging (MI) tasks, such as accelerated Parallel Magnetic Resonance Imaging (MRI), often involve reconstructing an image from noisy or incomplete measurements. This amounts to solving ill-posed inverse problems, where a satisfactory closed-form analytical solution is not available. Traditional methods such as Compressed Sensing (CS) in MRI reconstruction can be time-consuming or prone to obtaining low-fidelity images. Recently, a plethora of supervised and self-supervised Deep Learning (DL) approaches have demonstrated superior performance in inverse-problem solving, surpassing conventional methods. In this study, we propose vSHARP (variable Splitting Half-quadratic ADMM algorithm for Reconstruction of inverse Problems), a novel DL-based method for solving ill-posed inverse problems arising in MI. vSHARP utilizes the Half-Quadratic Variable Splitting method and employs the Alternating Direction Method of Multipliers (ADMM) to unroll the optimization process. For data consistency, vSHARP unrolls a differentiable gradient descent process in the image domain, while a DL-based denoiser, such as a U-Net architecture, is applied to enhance image quality. vSHARP also employs a dilated-convolution DL-based model to predict the Lagrange multipliers for the ADMM initialization. We evaluate the proposed model by applying it to the task of accelerated Parallel MRI Reconstruction on two distinct datasets. We present a comparative analysis of our experimental results with state-of-the-art approaches, highlighting the superior performance of vSHARP.

摘要
医学成像（MI）任务，如加速的并行磁共振成像（MRI），经常需要从噪声或不完整的测量中重建图像。这等于解决不定系数的反向问题，其中没有可靠的关闭形式分析解。传统方法，如压缩整流（CS）在MRI重建中，可能需要很长时间或得到低精度图像。最近，一些深度学习（DL）方法在反向问题解决方面表现出色，超过了传统方法。在这种研究中，我们提出了变量分割半quadratic ADMM算法 для reconstruction of inverse problems（vSHARP），这是一种基于DL的新方法。vSHARP使用半quadratic变量分割方法和Alternating Direction Method of Multipliers（ADMM）来卷赋予过程。为保持数据一致性，vSHARP在图像域中滚动一个可微的梯度下降过程，而DL基于U-Net架构的降噪器来提高图像质量。vSHARP还使用填充 convolution DL模型来预测ADMM的拉格朗日multipliers的初始化。我们通过在两个不同的数据集上应用vSHARP进行加速的并行MRI重建，并对其实验结果进行比较分析，highlighting vSHARP的superior performance。

End-to-End Learned Event- and Image-based Visual Odometry

paper_url: http://arxiv.org/abs/2309.09947
repo_url: None
paper_authors: Roberto Pellerito, Marco Cannici, Daniel Gehrig, Joris Belhadj, Olivier Dubois-Matra, Massimo Casasco, Davide Scaramuzza
for: 这个论文主要是为了实现自主无人机导航，特别是在GPS被防止的环境中，例如行星表面。
methods: 这个系统使用了新型的循环、异步和大量并行（RAMP）编码器，让它比现有的异步编码器快了8倍，精度高20%。
results: 这个系统在实验中训练后，在传统的实际世界标准测试benchmark上表现出来，与像素和事件基本方法相比，提高了52%和20%。

Abstract
Visual Odometry (VO) is crucial for autonomous robotic navigation, especially in GPS-denied environments like planetary terrains. While standard RGB cameras struggle in low-light or high-speed motion, event-based cameras offer high dynamic range and low latency. However, seamlessly integrating asynchronous event data with synchronous frames remains challenging. We introduce RAMP-VO, the first end-to-end learned event- and image-based VO system. It leverages novel Recurrent, Asynchronous, and Massively Parallel (RAMP) encoders that are 8x faster and 20% more accurate than existing asynchronous encoders. RAMP-VO further employs a novel pose forecasting technique to predict future poses for initialization. Despite being trained only in simulation, RAMP-VO outperforms image- and event-based methods by 52% and 20%, respectively, on traditional, real-world benchmarks as well as newly introduced Apollo and Malapert landing sequences, paving the way for robust and asynchronous VO in space.

摘要
“视觉协调（VO）是自主Robot导航中的关键，尤其在没有GPS环境下，如行星表面。标准的RGB摄像机在低光照或高速运动情况下表现不佳，而事件驱动摄像机则提供了高动态范围和低延迟。然而，将异步事件数据与同步帧数据集成仍然是一个挑战。我们介绍了RAMP-VO，首个综合学习的事件和图像基于VO系统。它利用了新的循环、异步和极大并行（RAMP）编码器，比现有的异步编码器快8倍，并且精度高20%。RAMP-VO还使用了一种新的预测未来姿势技术，以初始化。尽管只在模拟环境中训练，RAMP-VO仍然在传统的实际世界benchmark中与图像和事件基于方法相比，提高52%和20%。此外，RAMP-VO还在新引入的Apoll和Malapert降落序列上表现出色，为 asynchronous VO在宇宙中铺平道路。”

Hierarchical Attention and Graph Neural Networks: Toward Drift-Free Pose Estimation

paper_url: http://arxiv.org/abs/2309.09934
repo_url: None
paper_authors: Kathia Melbouci, Fawzi Nashashibi
for: 本研究旨在提高3D几何对齐的精度，而不是使用传统的几何对齐和pose graph优化方法。
methods: 本研究使用学习模型，包括层次注意力机制和图神经网络，取代传统的几何对齐和pose graph优化方法。
results: 测试结果表明，使用本研究的方法可以significantly improve pose estimation accuracy，特别是在确定旋转组件的时候。Here’s the full Chinese text:
for: 本研究旨在提高3D几何对齐的精度，而不是使用传统的几何对齐和pose graph优化方法。
methods: 本研究使用学习模型，包括层次注意力机制和图神经网络，取代传统的几何对齐和pose graph优化方法。
results: 测试结果表明，使用本研究的方法可以significantly improve pose estimation accuracy，特别是在确定旋转组件的时候。

Abstract
The most commonly used method for addressing 3D geometric registration is the iterative closet-point algorithm, this approach is incremental and prone to drift over multiple consecutive frames. The Common strategy to address the drift is the pose graph optimization subsequent to frame-to-frame registration, incorporating a loop closure process that identifies previously visited places. In this paper, we explore a framework that replaces traditional geometric registration and pose graph optimization with a learned model utilizing hierarchical attention mechanisms and graph neural networks. We propose a strategy to condense the data flow, preserving essential information required for the precise estimation of rigid poses. Our results, derived from tests on the KITTI Odometry dataset, demonstrate a significant improvement in pose estimation accuracy. This improvement is especially notable in determining rotational components when compared with results obtained through conventional multi-way registration via pose graph optimization. The code will be made available upon completion of the review process.

摘要
最常用的3D几何注册方法是迭代最近点算法，这种方法是逐帧增量的，容易出现多个 consecutived frames 中的漂移。常见的策略是通过 frame-to-frame 注册后，进行 pose graph optimization，并包括一个循环关闭过程，以确定先前访问的地方。在这篇论文中，我们探讨了一种替代传统的几何注册和 pose graph optimization 方法，使用层次注意力机制和图 neural networks。我们提议一种压缩数据流程，以保留必要的信息，以便精确地估计固定pose。我们的结果，基于 KITTI Odometry 数据集的测试，表明了对 pose 估计精度的显著改进。这种改进，特别是在确定旋转 Component 方面，与传统的多向注册via pose graph optimization 相比，具有更高的精度。代码将在审核过程完成后提供。

Quantum Vision Clustering

paper_url: http://arxiv.org/abs/2309.09907
repo_url: None
paper_authors: Xuan Bac Nguyen, Benjamin Thompson, Hugh Churchill, Khoa Luu, Samee U. Khan
for: 这篇论文主要是研究无监督视觉归一化的方法，以解释无标签图像的分布。
methods: 该论文提出了首个适合量子计算机解决的归一化问题的形式ulation，使用了爱因斯坦场的模型来表示量子机制。
results: 该论文表明，使用当今的整数编程 solver 和量子计算机，可以实现高效的归一化解决方案，并且可以解决小例子。

Abstract
Unsupervised visual clustering has recently received considerable attention. It aims to explain distributions of unlabeled visual images by clustering them via a parameterized appearance model. From a different perspective, the clustering algorithms can be treated as assignment problems, often NP-hard. They can be solved precisely for small instances on current hardware. Adiabatic quantum computing (AQC) offers a solution, as it can soon provide a considerable speedup on a range of NP-hard optimization problems. However, current clustering formulations are unsuitable for quantum computing due to their scaling properties. Consequently, in this work, we propose the first clustering formulation designed to be solved with AQC. We employ an Ising model representing the quantum mechanical system implemented on the AQC. Our approach is competitive compared to state-of-the-art optimization-based approaches, even using of-the-shelf integer programming solvers. Finally, we demonstrate that our clustering problem is already solvable on the current generation of real quantum computers for small examples and analyze the properties of the measured solutions.

摘要

On Model Explanations with Transferable Neural Pathways

paper_url: http://arxiv.org/abs/2309.09887
repo_url: None
paper_authors: Xinmiao Lin, Wentao Bao, Qi Yu, Yu Kong
for: 本研究旨在提供一种可解释的神经网络模型，即生成类相关神经路径（GEN-CNP）模型，以便解释神经网络模型的行为。
methods: 本研究使用了生成神经路径的方法，包括：（i）同类神经路径主要由类相关神经元组成；（ii）每个实例的神经路径精度应该是最佳的。
results: 本研究通过实验和质量评价表明，生成类相关神经路径可以准确地解释神经网络模型的行为，并且可以 transferred to 解释不同实例的样本。

Abstract
Neural pathways as model explanations consist of a sparse set of neurons that provide the same level of prediction performance as the whole model. Existing methods primarily focus on accuracy and sparsity but the generated pathways may offer limited interpretability thus fall short in explaining the model behavior. In this paper, we suggest two interpretability criteria of neural pathways: (i) same-class neural pathways should primarily consist of class-relevant neurons; (ii) each instance's neural pathway sparsity should be optimally determined. To this end, we propose a Generative Class-relevant Neural Pathway (GEN-CNP) model that learns to predict the neural pathways from the target model's feature maps. We propose to learn class-relevant information from features of deep and shallow layers such that same-class neural pathways exhibit high similarity. We further impose a faithfulness criterion for GEN-CNP to generate pathways with instance-specific sparsity. We propose to transfer the class-relevant neural pathways to explain samples of the same class and show experimentally and qualitatively their faithfulness and interpretability.

摘要
神经路径作为模型解释的方法包括一个稀疏的集合神经元，可以提供与整个模型一样的预测性能。现有方法主要关注准确率和稀疏性，但生成的路径可能具有有限的解释性，因此无法完全解释模型的行为。在这篇论文中，我们提出了两个神经路径解释标准：（i）同类神经路径主要由相关类神经元组成；（ii）每个实例的神经路径稀疏度应该得到优化。为此，我们提议一种生成类相关神经路径（GEN-CNP）模型，可以从目标模型的特征地图中预测神经路径。我们从深层和浅层特征中学习类相关信息，使同类神经路径具有高相似性。此外，我们对 GEN-CNP 模型强制实施一个具体性标准，使得生成的神经路径具有实例特定的稀疏度。我们将这些类相关神经路径传递给解释同类样本，并通过实验和质量上的证明，证明其 faithfulness 和可解释性。

RaLF: Flow-based Global and Metric Radar Localization in LiDAR Maps

paper_url: http://arxiv.org/abs/2309.09875
repo_url: None
paper_authors: Abhijeet Nayak, Daniele Cattaneo, Abhinav Valada
for: 本研究旨在提出一种基于深度神经网络的radar扫描地图匹配方法，以实现自主车辆的地面定位。
methods: 该方法首先通过cross-modal metric学习来学习 радиar和LiDAR特征编码器，然后通过global描述器和三个自由度变换预测器来实现地图匹配和 метрик地位定位。
results: 经过广泛的实验 validate，该方法可以在多个真实的驾驶数据集上达到状态的艺术性表现，并且能够有效地适应不同的城市和感知设置。Here’s the summary in English for reference:
for: The paper proposes a novel deep neural network-based approach for localizing radar scans in a LiDAR map of the environment, which can achieve robust and accurate positioning for autonomous vehicles.
methods: The approach first learns a shared embedding space between radar and LiDAR modalities via cross-modal metric learning, and then uses a place recognition head to generate global descriptors and a metric localization head to predict the 3-DoF transformation between the radar scan and the map.
results: Extensive experiments on multiple real-world driving datasets demonstrate that the proposed approach achieves state-of-the-art performance for both place recognition and metric localization, and can effectively generalize to different cities and sensor setups than the ones used during training.

Abstract
Localization is paramount for autonomous robots. While camera and LiDAR-based approaches have been extensively investigated, they are affected by adverse illumination and weather conditions. Therefore, radar sensors have recently gained attention due to their intrinsic robustness to such conditions. In this paper, we propose RaLF, a novel deep neural network-based approach for localizing radar scans in a LiDAR map of the environment, by jointly learning to address both place recognition and metric localization. RaLF is composed of radar and LiDAR feature encoders, a place recognition head that generates global descriptors, and a metric localization head that predicts the 3-DoF transformation between the radar scan and the map. We tackle the place recognition task by learning a shared embedding space between the two modalities via cross-modal metric learning. Additionally, we perform metric localization by predicting pixel-level flow vectors that align the query radar scan with the LiDAR map. We extensively evaluate our approach on multiple real-world driving datasets and show that RaLF achieves state-of-the-art performance for both place recognition and metric localization. Moreover, we demonstrate that our approach can effectively generalize to different cities and sensor setups than the ones used during training. We make the code and trained models publicly available at http://ralf.cs.uni-freiburg.de.

摘要
“本文提出了一种新的深度神经网络方法，用于将雷达扫描图与激光扫描图进行本地化。该方法被称为RaLF，它可以同时解决地点识别和精度本地化问题。RaLF包括雷达和激光特征编码器、地点识别头和精度本地化头。我们通过跨模态度学习来学习雷达和激光图像之间的共享空间。此外，我们还预测了 queries 雷达扫描图和激光图像之间的像素级流动场，以将 queries 雷达扫描图与激光图像相对平行。我们对多个实际驾驶 dataset 进行了广泛的评估，并证明了 RaLF 在地点识别和精度本地化方面具有了领先的性能。此外，我们还证明了我们的方法可以对不同的城市和感知器设置进行有效的泛化。我们将代码和训练模型公开发布在 http://ralf.cs.uni-freiburg.de。”

Contrastive Learning for Enhancing Robust Scene Transfer in Vision-based Agile Flight

paper_url: http://arxiv.org/abs/2309.09865
repo_url: None
paper_authors: Jiaxu Xing, Leonard Bauersfeld, Yunlong Song, Chunwei Xing, Davide Scaramuzza
for: 本研究旨在提出一种适用于视觉基于移动 робоット应用的Scene转移策略，以解决现有的策略具有差异样本效率或局部化能力等问题。
methods: 本研究提出了一种适应多对多对比学习策略，通过对视觉表示学习进行适应调整，实现零例转移和实际应用。控制策略可以在未经训练的环境中运行，不需要训练环境的微调。
results: 实验和实际应用 результалтати表明，我们的方法可以成功泛化到未经训练的环境，并且超越所有基elines。在视觉基于quadrotor飞行任务上，我们的方法实现了高精度和稳定性。

Abstract
Scene transfer for vision-based mobile robotics applications is a highly relevant and challenging problem. The utility of a robot greatly depends on its ability to perform a task in the real world, outside of a well-controlled lab environment. Existing scene transfer end-to-end policy learning approaches often suffer from poor sample efficiency or limited generalization capabilities, making them unsuitable for mobile robotics applications. This work proposes an adaptive multi-pair contrastive learning strategy for visual representation learning that enables zero-shot scene transfer and real-world deployment. Control policies relying on the embedding are able to operate in unseen environments without the need for finetuning in the deployment environment. We demonstrate the performance of our approach on the task of agile, vision-based quadrotor flight. Extensive simulation and real-world experiments demonstrate that our approach successfully generalizes beyond the training domain and outperforms all baselines.

摘要
场景传输 для视觉基于移动 робо扮 applications 是一个非常相关和挑战的问题。机器人的实用性很大程度取决于它能够在实际世界中完成任务，而不是在受控的实验室环境中。现有的场景传输末端政策学习方法经常受到样本效率的限制或者实际应用中的局限性，这使得它们不适合移动 робо扮应用。这项工作提出了一种适应多对比较学习策略，用于视觉表示学习，允许零shot场景传输并在实际应用中无需质量调整。我们示出了我们方法在激扬、视觉基于四旋翼飞行任务中的表现。广泛的 simulations 和实际实验表明，我们的方法可以成功泛化到训练环境之外，并且超过了所有基elines。

Unsupervised Open-Vocabulary Object Localization in Videos

paper_url: http://arxiv.org/abs/2309.09858
repo_url: None
paper_authors: Ke Fan, Zechen Bai, Tianjun Xiao, Dominik Zietlow, Max Horn, Zixu Zhao, Carl-Johann Simon-Gabriel, Mike Zheng Shou, Francesco Locatello, Bernt Schiele, Thomas Brox, Zheng Zhang, Yanwei Fu, Tong He
for: 自动Localization of objects in videos without manual annotation.
methods: 使用latest advances in video representation learning and pre-trained vision-language models, followed by a slot attention approach to localize objects in videos, and then use unsupervised way to read localized semantic information from CLIP model to assign text to the obtained slots.
results: 实现了无监督的视频对象定位，并且可以在常见的视频benchmark上达到良好的效果，这是首次实现无监督视频对象定位的方法。

Abstract
In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via a slot attention approach and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP, and it is effectively the first unsupervised approach that yields good results on regular video benchmarks.

摘要
在这篇论文中，我们表明了最近的视频表示学习和预训练视频语言模型的进步，可以提供substantial提高自主视频 объек特定的性能。我们提议一种方法，首先通过槽注意力方法在视频中Localize对象，然后通过预训练CLIP模型中的semantic信息来读取本地化的文本。这种方法是完全无监督的，并且可以在常见的视频benchmark上达到良好的性能。

PseudoCal: Towards Initialisation-Free Deep Learning-Based Camera-LiDAR Self-Calibration

paper_url: http://arxiv.org/abs/2309.09855
repo_url: None
paper_authors: Mathieu Cocheteux, Julien Moreau, Franck Davoine
for: 这篇论文主要用于测试一种自动测量摄录和探测器的整合，以提高自主车辆和机器人的多感器融合。
methods: 这篇论文使用了一种名为 PseudoCal 的新型自我测量方法，它利用 Pseudo-LiDAR 概念和直接在3D空间上运作，以解决传统方法的问题，例如需要人工干预和特定环境，导致实际应用中存在误差和问题。
results: 这篇论文的结果显示，PseudoCal 方法可以在一般的自主车辆和机器人环境中实现一击准确的整合，并且可以处理极端情况，例如对照摄录和探测器的偏差和误差。

Abstract
Camera-LiDAR extrinsic calibration is a critical task for multi-sensor fusion in autonomous systems, such as self-driving vehicles and mobile robots. Traditional techniques often require manual intervention or specific environments, making them labour-intensive and error-prone. Existing deep learning-based self-calibration methods focus on small realignments and still rely on initial estimates, limiting their practicality. In this paper, we present PseudoCal, a novel self-calibration method that overcomes these limitations by leveraging the pseudo-LiDAR concept and working directly in the 3D space instead of limiting itself to the camera field of view. In typical autonomous vehicle and robotics contexts and conventions, PseudoCal is able to perform one-shot calibration quasi-independently of initial parameter estimates, addressing extreme cases that remain unsolved by existing approaches.

摘要
Camera-LiDAR这个外在对焦整合是自主系统中的一个关键任务，例如自驾车和移动机器人。传统的方法通常需要手动干预或特定的环境，使其成为劳动密集和错误容易发生的。现有的深度学习自我对焦方法专注于小对焦和仍然依赖初始估计，限制其实用性。在这篇论文中，我们提出了 PseudoCal，一种新的自我对焦方法，通过 Pseudo-LiDAR 概念和直接在3D空间工作，而不是仅对焦相机的视场。在典型的自驾车和机器人上下文中，PseudoCal 能够实现一次对焦 quasi-独立于初始参数估计，解决现有方法无法解决的极端情况。

Hyperbolic vs Euclidean Embeddings in Few-Shot Learning: Two Sides of the Same Coin

paper_url: http://arxiv.org/abs/2309.10013
repo_url: None
paper_authors: Gabriel Moreira, Manuel Marques, João Paulo Costeira, Alexander Hauptmann
for: 这个论文主要研究了使用偏序空间来学习表示，以便在图像识别 tasks 中获得更高精度和更少维度的表示。
methods: 这个论文使用了 прототипические偏序神经网络，并研究了偏序空间中的点集抽象的特性。
results: 研究发现，在高维度情况下，偏序 embedding 往往会趋向于波因逊球体的边缘，这会影响到几何学推理中的几何学距离计算。尽管如此，研究还发现，使用固定半径编码器和欧几丁度距离计算可以在几何学 embeddings 中实现更好的性能，不需要针对 embedding 维度进行调整。

Abstract
Recent research in representation learning has shown that hierarchical data lends itself to low-dimensional and highly informative representations in hyperbolic space. However, even if hyperbolic embeddings have gathered attention in image recognition, their optimization is prone to numerical hurdles. Further, it remains unclear which applications stand to benefit the most from the implicit bias imposed by hyperbolicity, when compared to traditional Euclidean features. In this paper, we focus on prototypical hyperbolic neural networks. In particular, the tendency of hyperbolic embeddings to converge to the boundary of the Poincar\'e ball in high dimensions and the effect this has on few-shot classification. We show that the best few-shot results are attained for hyperbolic embeddings at a common hyperbolic radius. In contrast to prior benchmark results, we demonstrate that better performance can be achieved by a fixed-radius encoder equipped with the Euclidean metric, regardless of the embedding dimension.

摘要
近期研究在表示学习中，层次数据在偏特征空间中得到了低维度和高信息的表示。然而，偏特征 embedding 的优化很容易遇到数值障碍。此外，还未清楚哪些应用程序会从偏特征 embedding 中受益最大，比较传统的欧几丁度特征。在这篇论文中，我们关注了prototype偏特征神经网络。特别是偏特征 embedding 在高维度下的准确性和Poincaré球的边缘减少的效果。我们显示了，使用常见的偏特征 радиус可以获得最佳几拟结果。与先前的标准约束结果不同，我们示出了使用欧几丁度度量的固定 радиус编码器可以在任何嵌入维度下获得更好的性能。

Grasp-Anything: Large-scale Grasp Dataset from Foundation Models

paper_url: http://arxiv.org/abs/2309.09818
repo_url: None
paper_authors: An Dinh Vuong, Minh Nhat Vu, Hieu Le, Baoru Huang, Binh Huynh, Thieu Vo, Andreas Kugi, Anh Nguyen
for: 本研究利用基础模型来解决机器人 grasp detection 问题，即机器人抓取物体的问题，这是机器人控制领域中的一个挑战。
methods: 本研究使用基础模型来生成大规模的 grasp 数据集，并利用这些数据集来实现零shot grasp detection。
results: 研究表明，使用基础模型生成的 grasp 数据集可以在视觉任务和实际机器人实验中实现零shot grasp detection，并且超过了先前的数据集。

Abstract
Foundation models such as ChatGPT have made significant strides in robotic tasks due to their universal representation of real-world domains. In this paper, we leverage foundation models to tackle grasp detection, a persistent challenge in robotics with broad industrial applications. Despite numerous grasp datasets, their object diversity remains limited compared to real-world figures. Fortunately, foundation models possess an extensive repository of real-world knowledge, including objects we encounter in our daily lives. As a consequence, a promising solution to the limited representation in previous grasp datasets is to harness the universal knowledge embedded in these foundation models. We present Grasp-Anything, a new large-scale grasp dataset synthesized from foundation models to implement this solution. Grasp-Anything excels in diversity and magnitude, boasting 1M samples with text descriptions and more than 3M objects, surpassing prior datasets. Empirically, we show that Grasp-Anything successfully facilitates zero-shot grasp detection on vision-based tasks and real-world robotic experiments. Our dataset and code are available at https://grasp-anything-2023.github.io.

摘要
基础模型如ChatGPT在机器人任务中做出了 significiant 进步，归因于它们的 universally 表示实际世界领域。在这篇论文中，我们利用基础模型来解决抓取检测，机器人业务中的一项挑战。虽然有很多的抓取数据集，但它们中的物体多样性还是相对较少，与实际世界中的物体相比。幸运的是，基础模型拥有了广泛的实际世界知识，包括我们日常生活中遇到的物体。因此，我们可以利用这些基础模型中的广泛知识来解决限制性的抓取数据集。我们提出了 Grasp-Anything，一个新的大规模抓取数据集，通过基础模型来实现这一解决方案。Grasp-Anything 的多样性和规模都非常出色，包括100万个样本和超过3000万个物体，超过了之前的数据集。我们的实验结果表明，Grasp-Anything 可以在视觉任务和真实世界机器人实验中实现零shot 抓取检测。我们的数据集和代码可以在https://grasp-anything-2023.github.io 上获取。

R2GenGPT: Radiology Report Generation with Frozen LLMs

paper_url: http://arxiv.org/abs/2309.09812
repo_url: https://github.com/wang-zhanyu/r2gengpt
paper_authors: Zhanyu Wang, Lingqiao Liu, Lei Wang, Luping Zhou
for: 这个研究是为了探索一种可以将类型特征与语言模型（LLM）之间的数据汇流合作，以提高医学报告生成（R2Gen）性能。
methods: 这个研究提出了一种名为R2GenGPT的新解决方案，它使用一个高效的视觉对Alignment模块，将类型特征与LLM的字嵌入空间相对位。这种新的方法让之前 static的LLM能够融合和处理影像信息，从而为R2Gen性能带来了进步。
results: R2GenGPT实现了顶峰性能，只需训练500万个参数（占总参数数的0.07%），并在训练速度和稳定性方面表现出色。

Abstract
Large Language Models (LLMs) have consistently showcased remarkable generalization capabilities when applied to various language tasks. Nonetheless, harnessing the full potential of LLMs for Radiology Report Generation (R2Gen) still presents a challenge, stemming from the inherent disparity in modality between LLMs and the R2Gen task. To bridge this gap effectively, we propose R2GenGPT, which is a novel solution that aligns visual features with the word embedding space of LLMs using an efficient visual alignment module. This innovative approach empowers the previously static LLM to seamlessly integrate and process image information, marking a step forward in optimizing R2Gen performance. R2GenGPT offers the following benefits. First, it attains state-of-the-art (SOTA) performance by training only the lightweight visual alignment module while freezing all the parameters of LLM. Second, it exhibits high training efficiency, as it requires the training of an exceptionally minimal number of parameters while achieving rapid convergence. By employing delta tuning, our model only trains 5M parameters (which constitute just 0.07\% of the total parameter count) to achieve performance close to the SOTA levels. Our code is available at https://github.com/wang-zhanyu/R2GenGPT.

摘要
大型语言模型（LLM）在不同语言任务中表现出了惊人的通用能力。然而，为了充分利用 LLM 在 radiology report generation（R2Gen）任务中的潜力，仍然存在一定的挑战，这主要归结于 LLM 和 R2Gen 任务之间的本质差异。为了有效 bridge 这个差异，我们提出了 R2GenGPT，它是一种新的解决方案，它将视觉特征与 LLM 中词嵌入空间进行有效对齐。这种创新的方法使得原来的静止 LLM 可以轻松地集成和处理图像信息，从而为 R2Gen 的性能优化做出了一步前进。R2GenGPT 具有以下优点：首先，它可以在具有较少参数的情况下达到最佳性能，只需要训练少量的视觉对齐模块，而不需要训练 LLM 的任何参数。其次，它具有高效的训练效率，只需要训练 5000万参数（即总参数数的 0.07%），即使在快速 converges 的情况下。我们的代码可以在 GitHub 上找到：https://github.com/wang-zhanyu/R2GenGPT。

DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving

paper_url: http://arxiv.org/abs/2309.09777
repo_url: https://github.com/JeffWang987/DriveDreamer
paper_authors: Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiwen Lu
for: 本研究旨在开发一种基于实际驾驶场景的世界模型（DriveDreamer），以提高自动驾驶车辆的驾驶视频生成和安全驾驶策略。
methods: 我们提出了一种两阶段训练策略，其中第一阶段是通过学习струк成的交通约束来让DriveDreamer具备了预测未来状态的能力；第二阶段是通过diffusion模型来构建了详细的环境表示。
results: DriveDreamer在nuScenesbenchmark上的实验表明，它可以准确地生成驾驶视频和合理的驾驶策略，并且能够准确地捕捉实际交通场景中的结构约束。

Abstract
World models, especially in autonomous driving, are trending and drawing extensive attention due to their capacity for comprehending driving environments. The established world model holds immense potential for the generation of high-quality driving videos, and driving policies for safe maneuvering. However, a critical limitation in relevant research lies in its predominant focus on gaming environments or simulated settings, thereby lacking the representation of real-world driving scenarios. Therefore, we introduce DriveDreamer, a pioneering world model entirely derived from real-world driving scenarios. Regarding that modeling the world in intricate driving scenes entails an overwhelming search space, we propose harnessing the powerful diffusion model to construct a comprehensive representation of the complex environment. Furthermore, we introduce a two-stage training pipeline. In the initial phase, DriveDreamer acquires a deep understanding of structured traffic constraints, while the subsequent stage equips it with the ability to anticipate future states. The proposed DriveDreamer is the first world model established from real-world driving scenarios. We instantiate DriveDreamer on the challenging nuScenes benchmark, and extensive experiments verify that DriveDreamer empowers precise, controllable video generation that faithfully captures the structural constraints of real-world traffic scenarios. Additionally, DriveDreamer enables the generation of realistic and reasonable driving policies, opening avenues for interaction and practical applications.

摘要
世界模型，尤其在自动驾驶领域，目前是受欢迎的趋势，因其能够理解驾驶环境。现有的世界模型具有巨大的潜力，可以生成高质量的驾驶视频和安全驾驶策略。然而，现有的研究受到限制，因为它们偏向于游戏环境或模拟设定，缺乏实际驾驶场景的表示。因此，我们介绍了DriveDreamer，一种杰出的世界模型，完全来自实际驾驶场景。由于模拟复杂的驾驶场景搜索空间过于庞大，我们提议利用强大的扩散模型构建全面的环境表示。此外，我们提出了两个阶段的训练管道。在初始阶段，DriveDreamer获得了深入的结构化交通限制的理解，而在后续阶段，它学习了预测未来状态的能力。我们实现了DriveDreamer在具有挑战性的 nuScenes 测试benchmark上，并进行了广泛的实验，证明了DriveDreamer可以准确、可控地生成符合实际交通场景的视频，同时也可以生成合理、可行的驾驶策略。此外，DriveDreamer还开启了交互和实际应用的可能性。

Towards Self-Adaptive Pseudo-Label Filtering for Semi-Supervised Learning

paper_url: http://arxiv.org/abs/2309.09774
repo_url: None
paper_authors: Lei Zhu, Zhanghan Ke, Rynson Lau
for: 提高 semi-supervised learning（SSL）方法中的训练效果，特别在标注数据很稀缺的情况下。
methods: 利用自适应 Pseudo-Label Filter（SPF），自动根据模型的演化来过滤噪声 Pseudo 标签。SPF 使用在线混合模型，对每个 Pseudo 标签样本进行权重，以考虑该样本是否正确的可信度分布。与传统手工设置的筛选器不同，SPF 随着深度神经网络的演化而自动调整。
results: 对 existing SSL 方法进行 incorporating SPF，可以提高 SSL 的训练效果，特别在标注数据很稀缺的情况下。

Abstract
Recent semi-supervised learning (SSL) methods typically include a filtering strategy to improve the quality of pseudo labels. However, these filtering strategies are usually hand-crafted and do not change as the model is updated, resulting in a lot of correct pseudo labels being discarded and incorrect pseudo labels being selected during the training process. In this work, we observe that the distribution gap between the confidence values of correct and incorrect pseudo labels emerges at the very beginning of the training, which can be utilized to filter pseudo labels. Based on this observation, we propose a Self-Adaptive Pseudo-Label Filter (SPF), which automatically filters noise in pseudo labels in accordance with model evolvement by modeling the confidence distribution throughout the training process. Specifically, with an online mixture model, we weight each pseudo-labeled sample by the posterior of it being correct, which takes into consideration the confidence distribution at that time. Unlike previous handcrafted filters, our SPF evolves together with the deep neural network without manual tuning. Extensive experiments demonstrate that incorporating SPF into the existing SSL methods can help improve the performance of SSL, especially when the labeled data is extremely scarce.

摘要
现代 semi-supervised learning（SSL）方法通常包括一种筛选策略以提高 Pseudo 标签质量。然而，这些筛选策略通常是手工制定的，而且不会随模型更新而变化，导致训练过程中 Correct Pseudo 标签被抛弃，而 Incorrect Pseudo 标签被选择。在这项工作中，我们发现 Pseudo 标签的准确率分布 gap 在训练的开始阶段就已经出现，这可以被利用来筛选 Pseudo 标签。基于这个观察，我们提出了一种自适应 Pseudo-Label 筛选器（SPF），它可以自动根据模型的演化来筛选 Pseudo 标签中的噪声。具体来说，我们使用在线混合模型，对每个 Pseudo-labeled 样本进行重新权重，以考虑该样本在训练过程中的准确率分布。与前一些手工制定的筛选器不同，我们的 SPF 会随着深度神经网络的更新而进行自适应调整，无需人工调整。广泛的实验表明，将 SPF integrated into 现有的 SSL 方法可以提高 SSL 的性能，特别是当有限的标注数据时。

Semantically Redundant Training Data Removal and Deep Model Classification Performance: A Study with Chest X-rays

paper_url: http://arxiv.org/abs/2309.09773
repo_url: None
paper_authors: Sivaramakrishnan Rajaraman, Ghada Zamzmi, Feng Yang, Zhaohui Liang, Zhiyun Xue, Sameer Antani
for: 这篇论文旨在提出一种基于信息内容的训练样本选择方法，以提高深度学习（Deep Learning）模型的性能和泛化能力。
methods: 方法基于概率理论的信息内容分析，以决定训练样本中的semantic redundancy层次，并将其去除以提高模型的性能。
results: 这篇论文的实验结果显示，使用了这种 entropy-based sample scoring approach的模型在内部测试和外部测试中，均比使用了全部训练样本的模型表现更好，并且显示了更高的准确率和更好的泛化能力。

Abstract
Deep learning (DL) has demonstrated its innate capacity to independently learn hierarchical features from complex and multi-dimensional data. A common understanding is that its performance scales up with the amount of training data. Another data attribute is the inherent variety. It follows, therefore, that semantic redundancy, which is the presence of similar or repetitive information, would tend to lower performance and limit generalizability to unseen data. In medical imaging data, semantic redundancy can occur due to the presence of multiple images that have highly similar presentations for the disease of interest. Further, the common use of augmentation methods to generate variety in DL training may be limiting performance when applied to semantically redundant data. We propose an entropy-based sample scoring approach to identify and remove semantically redundant training data. We demonstrate using the publicly available NIH chest X-ray dataset that the model trained on the resulting informative subset of training data significantly outperforms the model trained on the full training set, during both internal (recall: 0.7164 vs 0.6597, p<0.05) and external testing (recall: 0.3185 vs 0.2589, p<0.05). Our findings emphasize the importance of information-oriented training sample selection as opposed to the conventional practice of using all available training data.

摘要

Localization-Guided Track: A Deep Association Multi-Object Tracking Framework Based on Localization Confidence of Detections

paper_url: http://arxiv.org/abs/2309.09765
repo_url: https://github.com/mengting2023/lg-track
paper_authors: Ting Meng, Chunyun Fu, Mingguang Huang, Xiyang Wang, Jiawei He, Tao Huang, Wankai Shi
for: 提高多目标追踪（MOT）中的精确性和可靠性，特别是在低光度和高背景噪声的情况下。
methods: 本研究提出了一种基于探测检测器的Localization-Guided Track（LG-Track）方法，首次在MOT中使用检测框的地理位置信任度，并考虑检测框的清晰度和地理位置准确性，同时设计了一种有效的深度匹配机制。
results: 对MOT17和MOT20 dataset进行了广泛的实验，结果显示，相比之前的状态值追踪方法，LG-Track方法在精度和可靠性方面具有明显的优势。

Abstract
In currently available literature, no tracking-by-detection (TBD) paradigm-based tracking method has considered the localization confidence of detection boxes. In most TBD-based methods, it is considered that objects of low detection confidence are highly occluded and thus it is a normal practice to directly disregard such objects or to reduce their priority in matching. In addition, appearance similarity is not a factor to consider for matching these objects. However, in terms of the detection confidence fusing classification and localization, objects of low detection confidence may have inaccurate localization but clear appearance; similarly, objects of high detection confidence may have inaccurate localization or unclear appearance; yet these objects are not further classified. In view of these issues, we propose Localization-Guided Track (LG-Track). Firstly, localization confidence is applied in MOT for the first time, with appearance clarity and localization accuracy of detection boxes taken into account, and an effective deep association mechanism is designed; secondly, based on the classification confidence and localization confidence, a more appropriate cost matrix can be selected and used; finally, extensive experiments have been conducted on MOT17 and MOT20 datasets. The results show that our proposed method outperforms the compared state-of-art tracking methods. For the benefit of the community, our code has been made publicly at https://github.com/mengting2023/LG-Track.

摘要
现有 literatura 中的 tracking-by-detection (TBD) 方法都没有考虑检测框的地方确度。大多数 TBD 基本方法认为低检测确度的对象都是高度遮挡的，因此直接忽略这些对象或减少它们的匹配优先级。此外，外观相似性不是匹配这些对象的因素。但是，在检测确度融合分类和地方确度方面，低检测确度的对象可能有不准确的地方位置，但是清晰的外观；相反，高检测确度的对象可能有不准确的地方位置或晦涩的外观，却不进一步分类。为了解决这些问题，我们提出了地方确度引导的跟踪方法（LG-Track）。首先，我们在 MOT 中首次应用了地方确度，考虑检测框的清晰度和地方位置准确性，并设计了一种有效的深度关联机制；其次，根据分类确度和地方确度，选择更适合的成本矩阵；最后，我们在 MOT17 和 MOT20 数据集上进行了广泛的实验。结果显示，我们的提出方法比对比的国际先进跟踪方法更高效。为了为社区帮助，我们的代码已经公开在 GitHub 上，可以在 https://github.com/mengting2023/LG-Track 中找到。

Application-driven Validation of Posteriors in Inverse Problems

paper_url: http://arxiv.org/abs/2309.09764
repo_url: None
paper_authors: Tim J. Adler, Jan-Hinrich Nölke, Annika Reinke, Minu Dietlinde Tizabi, Sebastian Gruber, Dasha Trofimova, Lynton Ardizzone, Paul F. Jaeger, Florian Buettner, Ullrich Köthe, Lena Maier-Hein
for: 这篇论文是为了解决 inverse problems 中的多可能解问题，尤其是 deep learning-based 解决方案无法处理多个可能的解。
methods: 本论文使用 conditional Diffusion Models 和 Invertible Neural Networks 等 posterior-based 方法，但这些方法的译法受到了 validation 的限制，即不具备适当的验证方法。
results: 本论文提出了首个实用应用验证的框架，将 object detection 领域中的验证原则应用到 inverse problems，以提供更好的验证方法。这个框架在 synthetic toy example 和两个医疗影像应用中（包括手术pose estimation 和功能组织parametrization）均展示了优越的表现。

Abstract
Current deep learning-based solutions for image analysis tasks are commonly incapable of handling problems to which multiple different plausible solutions exist. In response, posterior-based methods such as conditional Diffusion Models and Invertible Neural Networks have emerged; however, their translation is hampered by a lack of research on adequate validation. In other words, the way progress is measured often does not reflect the needs of the driving practical application. Closing this gap in the literature, we present the first systematic framework for the application-driven validation of posterior-based methods in inverse problems. As a methodological novelty, it adopts key principles from the field of object detection validation, which has a long history of addressing the question of how to locate and match multiple object instances in an image. Treating modes as instances enables us to perform mode-centric validation, using well-interpretable metrics from the application perspective. We demonstrate the value of our framework through instantiations for a synthetic toy example and two medical vision use cases: pose estimation in surgery and imaging-based quantification of functional tissue parameters for diagnostics. Our framework offers key advantages over common approaches to posterior validation in all three examples and could thus revolutionize performance assessment in inverse problems.

摘要
当前的深度学习基于解决方案 frequently 无法处理具有多个可能的解决方案的问题。作为回应， posterior-based 方法如 Conditional Diffusion Models 和 Invertible Neural Networks 出现了，但它们的翻译受到了 validate 的研究不够的限制。即，进步的度量方法经常不能反映驱动实际应用的需求。在文献中填补这个差距，我们提出了首个应用驱动验证的 posterior-based 方法框架。作为方法学新颖之处，它采用了对象检测验证的关键原则，该领域已经有长期地解决了如何在图像中检测和匹配多个对象实例的问题。将模式视为实例，我们可以进行模式心智验证，使用应用视角可解释的 метриク。我们通过对synthetic 示例和两个医疗视觉应用：urger 和功能组织参数的成像诊断中的 pose 估计进行实例化，以示出我们的框架的优势。我们的框架可以在三个例子中超越常见的验证方法，因此可能将成像问题中的表现评估革命化。

Privileged to Predicted: Towards Sensorimotor Reinforcement Learning for Urban Driving

paper_url: http://arxiv.org/abs/2309.09756
repo_url: None
paper_authors: Ege Onat Özsüer, Barış Akgün, Fatma Güney
for: 本研究旨在探讨RL算法在自驾乘用车道上的潜力，以及如何 bridging RL算法和感知学习算法之间的差距。
methods: 本研究使用视觉深度学习模型来近似特权表示，并提出了解决方案来逐渐发展不具备特权表示的RL算法。
results: 通过在CARLA simulate环境进行严格评估，本研究指出了特权表示在RL算法中的重要性，并释放了未解决的挑战。

Abstract
Reinforcement Learning (RL) has the potential to surpass human performance in driving without needing any expert supervision. Despite its promise, the state-of-the-art in sensorimotor self-driving is dominated by imitation learning methods due to the inherent shortcomings of RL algorithms. Nonetheless, RL agents are able to discover highly successful policies when provided with privileged ground truth representations of the environment. In this work, we investigate what separates privileged RL agents from sensorimotor agents for urban driving in order to bridge the gap between the two. We propose vision-based deep learning models to approximate the privileged representations from sensor data. In particular, we identify aspects of state representation that are crucial for the success of the RL agent such as desired route generation and stop zone prediction, and propose solutions to gradually develop less privileged RL agents. We also observe that bird's-eye-view models trained on offline datasets do not generalize to online RL training due to distribution mismatch. Through rigorous evaluation on the CARLA simulation environment, we shed light on the significance of the state representations in RL for autonomous driving and point to unresolved challenges for future research.

摘要
自适应学习（RL）有可能超越人类在驾驶中的表现，无需专业指导。然而，当前的状态报告中的自适应学习方法占据了主导地位，这是因为RL算法的内在缺陷。然而，RL代理人可以通过精准的环境表示来找到高度成功的策略。在这项工作中，我们调查了隐私RL代理人和感知动作代理人之间的差异，以便 bridging这两者之间的差距。我们提出了基于视觉深度学习模型来近似隐私表示的方法，并特别注意了状态表示中的关键方面，如愿望路径生成和停车区预测。我们还观察到，基于离线数据集训练的鸟瞰视图模型在在线RL训练中并不能总是generalizable，这是因为分布匹配问题。通过在CARLA simulate环境中的严格评估，我们把关于RL在自动驾驶中的状态表示的重要性和未解决的挑战指出来。

Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object Detection with Repeated Labels

paper_url: http://arxiv.org/abs/2309.09742
repo_url: https://github.com/madave94/gtiod
paper_authors: David Tschirschwitz, Christian Benz, Morris Florek, Henrik Norderhus, Benno Stein, Volker Rodehorst
for: 这个研究旨在提高机器学习系统的可靠性，通过确保标签的准确性和可用性。methods: 本研究提出了一个新的地方化算法，可以将多个标签ger的结果组合成更好的标签。results: 实验结果显示，本算法在测试标签中表现出色，并且在训练过程中超过了噪音标签训练和重新标签的Weighted Boxes Fusion（WBF）。研究显示，重复标签带来的好处需要在特定的数据集和标签配置下出现，并且取决于标签ger的一致性、标签ger的数量和标签预算。

Abstract
The reliability of supervised machine learning systems depends on the accuracy and availability of ground truth labels. However, the process of human annotation, being prone to error, introduces the potential for noisy labels, which can impede the practicality of these systems. While training with noisy labels is a significant consideration, the reliability of test data is also crucial to ascertain the dependability of the results. A common approach to addressing this issue is repeated labeling, where multiple annotators label the same example, and their labels are combined to provide a better estimate of the true label. In this paper, we propose a novel localization algorithm that adapts well-established ground truth estimation methods for object detection and instance segmentation tasks. The key innovation of our method lies in its ability to transform combined localization and classification tasks into classification-only problems, thus enabling the application of techniques such as Expectation-Maximization (EM) or Majority Voting (MJV). Although our main focus is the aggregation of unique ground truth for test data, our algorithm also shows superior performance during training on the TexBiG dataset, surpassing both noisy label training and label aggregation using Weighted Boxes Fusion (WBF). Our experiments indicate that the benefits of repeated labels emerge under specific dataset and annotation configurations. The key factors appear to be (1) dataset complexity, the (2) annotator consistency, and (3) the given annotation budget constraints.

摘要
超级vised机器学习系统的可靠性取决于标签的准确性和可用性。然而，人工标注过程容易出现干扰，导致标签噪声，这可能会降低实际应用的实用性。虽然训练噪声标签是一个重要的考虑因素，但测试数据的可靠性也是很重要的，以确定结果的可靠性。在这篇论文中，我们提出了一种新的地方化算法，可以将地方化和分类任务转化为分类任务，从而应用Expectation-Maximization（EM）或 Majority Voting（MJV）等技术。我们的主要关注点是测试数据的唯一标签的聚合，但我们的算法还在TexBiG dataset上进行训练中表现出色，超过了噪声标签训练和Weighted Boxes Fusion（WBF）的标准。我们的实验表明，重复标注的优点在某些 dataset 和注解配置下出现。关键因素包括（1）数据复杂度，（2）注解者一致性，以及（3）注解预算的约束。

Improving Neural Indoor Surface Reconstruction with Mask-Guided Adaptive Consistency Constraints

paper_url: http://arxiv.org/abs/2309.09739
repo_url: None
paper_authors: Xinyi Yu, Liqin Lu, Jintao Rong, Guangkai Xu, Linlin Ou
for: scene reconstruction from 2D images
methods: neural implicit surface, data-driven pre-trained geometric cues, two-stage training process, decoupling view-dependent and view-independent colors, two novel consistency constraints
results: high-quality scene reconstruction with rich geometric details, reducing interference from prior estimation errors

Abstract
3D scene reconstruction from 2D images has been a long-standing task. Instead of estimating per-frame depth maps and fusing them in 3D, recent research leverages the neural implicit surface as a unified representation for 3D reconstruction. Equipped with data-driven pre-trained geometric cues, these methods have demonstrated promising performance. However, inaccurate prior estimation, which is usually inevitable, can lead to suboptimal reconstruction quality, particularly in some geometrically complex regions. In this paper, we propose a two-stage training process, decouple view-dependent and view-independent colors, and leverage two novel consistency constraints to enhance detail reconstruction performance without requiring extra priors. Additionally, we introduce an essential mask scheme to adaptively influence the selection of supervision constraints, thereby improving performance in a self-supervised paradigm. Experiments on synthetic and real-world datasets show the capability of reducing the interference from prior estimation errors and achieving high-quality scene reconstruction with rich geometric details.

摘要
三维场景重建从二维图像中进行了长期的任务。而不是每帧估计深度图并将其在3D中进行融合，当前的研究利用神经隐式表面作为一个统一表示方式进行3D重建。具有数据驱动的预训练准则的方法在实际中表现出了可期望的性能。然而，不准确的先前估计，通常是不可避免的，可能会导致部分复杂的地形区域中的重建质量下降。在本文中，我们提出了两个阶段的训练过程，分离视角依赖和视角独立的颜色，并利用两种新的一致性约束来增强细节重建性能，无需额外的准则。此外，我们还引入了一种重要的面 Maske scheme，以适应性地选择监督约束，从而在自动监督方式下提高表现。实验表明，可以减少先前估计错误的干扰，并实现高质量的场景重建，具有丰富的地形细节。

Scribble-based 3D Multiple Abdominal Organ Segmentation via Triple-branch Multi-dilated Network with Pixel- and Class-wise Consistency

paper_url: http://arxiv.org/abs/2309.09730
repo_url: None
paper_authors: Meng Han, Xiangde Luo, Wenjun Liao, Shichuan Zhang, Shaoting Zhang, Guotai Wang
for: 这个研究旨在提高Multi-organ segmentation在腹部Computed Tomography（CT）图像中的精度，并且降低需要大量时间和劳动力的annotations的需求。
methods: 我们提出了一个 novel的3D框架，使用了两种一致性条件来进行scribble-supervised多个腹部器官分类。
results: 我们的方法在公共的WORD dataset上进行实验，与五种现有的scribble-supervised方法进行比较，结果显示我们的方法的精度高于其他方法。I hope that helps! Let me know if you have any other questions.

Abstract
Multi-organ segmentation in abdominal Computed Tomography (CT) images is of great importance for diagnosis of abdominal lesions and subsequent treatment planning. Though deep learning based methods have attained high performance, they rely heavily on large-scale pixel-level annotations that are time-consuming and labor-intensive to obtain. Due to its low dependency on annotation, weakly supervised segmentation has attracted great attention. However, there is still a large performance gap between current weakly-supervised methods and fully supervised learning, leaving room for exploration. In this work, we propose a novel 3D framework with two consistency constraints for scribble-supervised multiple abdominal organ segmentation from CT. Specifically, we employ a Triple-branch multi-Dilated network (TDNet) with one encoder and three decoders using different dilation rates to capture features from different receptive fields that are complementary to each other to generate high-quality soft pseudo labels. For more stable unsupervised learning, we use voxel-wise uncertainty to rectify the soft pseudo labels and then supervise the outputs of each decoder. To further regularize the network, class relationship information is exploited by encouraging the generated class affinity matrices to be consistent across different decoders under multi-view projection. Experiments on the public WORD dataset show that our method outperforms five existing scribble-supervised methods.

摘要
多器官分割在腹部 computed Tomography（CT）图像中非常重要，用于腹部肿瘤诊断和后续治疗规划。虽然深度学习基本方法已经达到了高性能，但它们依赖于大规模的像素级别标注，这些标注是时间consuming和劳动密集的获得的。由于其低依赖于标注，弱监督分割吸引了广泛的关注。然而，目前的弱监督方法和完全监督学习之间还存在一定的性能差距，留下了更多的探索空间。在这项工作中，我们提出了一种新的3D框架，包括两种一致性约束 для多个腹部器官分割从CT。具体来说，我们使用一个Triple-branch multi-Dilated网络（TDNet），其中一个Encoder和三个Decoder使用不同的扩散率来捕捉不同的感受场，以生成高质量的软 Pseudo标签。为了更稳定无监督学习，我们使用 voxel-wise 不确定性来修正软 Pseudo标签，然后对每个Decoder的输出进行监督。此外，我们利用类关系信息，通过在多视图投影下强制生成的类征相互关系矩阵保持一致，以更加稳定地训练网络。实验表明，我们的方法在公共的 WORD 数据集上比五种现有的scribble-supervised方法表现出色。

Robust Geometry-Preserving Depth Estimation Using Differentiable Rendering

paper_url: http://arxiv.org/abs/2309.09724
repo_url: None
paper_authors: Chi Zhang, Wei Yin, Gang Yu, Zhibin Wang, Tao Chen, Bin Fu, Joey Tianyi Zhou, Chunhua Shen
For: 本研究探讨了单眼深度估计结果中的3D场景结构重建挑战。* Methods: 我们提出了一个学习框架，让模型透过没有额外数据或标签的方式估计具有对称性的深度，以获得实际的3D结构。* Results: 我们的框架在许多benchmark数据集上与现有的state-of-the-art方法进行比较，发现其具有更好的普遍化能力，并且可以自动从未标签的图像中获得域对称的扩展级数和移动。

Abstract
In this study, we address the challenge of 3D scene structure recovery from monocular depth estimation. While traditional depth estimation methods leverage labeled datasets to directly predict absolute depth, recent advancements advocate for mix-dataset training, enhancing generalization across diverse scenes. However, such mixed dataset training yields depth predictions only up to an unknown scale and shift, hindering accurate 3D reconstructions. Existing solutions necessitate extra 3D datasets or geometry-complete depth annotations, constraints that limit their versatility. In this paper, we propose a learning framework that trains models to predict geometry-preserving depth without requiring extra data or annotations. To produce realistic 3D structures, we render novel views of the reconstructed scenes and design loss functions to promote depth estimation consistency across different views. Comprehensive experiments underscore our framework's superior generalization capabilities, surpassing existing state-of-the-art methods on several benchmark datasets without leveraging extra training information. Moreover, our innovative loss functions empower the model to autonomously recover domain-specific scale-and-shift coefficients using solely unlabeled images.

摘要
在这个研究中，我们面临着从单视深度估计中提取3D场景结构的挑战。传统的深度估计方法利用标注数据直接预测精确的深度，而 latest advancements advocate for mix-dataset training，以提高不同场景的泛化性。然而，混合集合训练产生的深度预测只有一个未知的尺度和移动，使得准确的3D重建受阻。现有的解决方案需要额外的3D数据或准确的深度约束，这些限制了它们的灵活性。在这篇论文中，我们提出了一种学习框架，该框架可以不需要额外的数据或约束，train模型以预测保持geometry的深度。为了生成真实的3D结构，我们使用render novel views of the reconstructed scenes，并设计了loss函数来促进深度估计在不同视图之间的一致性。我们的实验表明，我们的框架在多个benchmark datasets上superior generalization capabilities，不使用额外的训练信息。此外，我们的创新的loss函数使得模型可以自动地从单个未标注图像中恢复域pecific的scale-and-shift値。

Traffic Scene Similarity: a Graph-based Contrastive Learning Approach

paper_url: http://arxiv.org/abs/2309.09720
repo_url: None
paper_authors: Maximilian Zipfl, Moritz Jarosch, J. Marius Zöllner
for: 提高高度自动驾驶车辆的验证和 homologation 难度，以便更广泛地推广高度自动驾驶车辆。
methods: 使用场景基本特征来构建场景嵌入空间，并通过图表示法实现场景之间的连续映射。
results: 可以通过基于场景特征的嵌入空间来快速标识相似的场景，从而减少无关的测试运行。

Abstract
Ensuring validation for highly automated driving poses significant obstacles to the widespread adoption of highly automated vehicles. Scenario-based testing offers a potential solution by reducing the homologation effort required for these systems. However, a crucial prerequisite, yet unresolved, is the definition and reduction of the test space to a finite number of scenarios. To tackle this challenge, we propose an extension to a contrastive learning approach utilizing graphs to construct a meaningful embedding space. Our approach demonstrates the continuous mapping of scenes using scene-specific features and the formation of thematically similar clusters based on the resulting embeddings. Based on the found clusters, similar scenes could be identified in the subsequent test process, which can lead to a reduction in redundant test runs.

摘要
高度自动驾驶的验证 pose significant challenges to its widespread adoption. scenario-based testing offers a potential solution by reducing the homologation effort required for these systems. However, a crucial prerequisite, yet unresolved, is the definition and reduction of the test space to a finite number of scenarios. To tackle this challenge, we propose an extension to a contrastive learning approach utilizing graphs to construct a meaningful embedding space. Our approach demonstrates the continuous mapping of scenes using scene-specific features and the formation of thematically similar clusters based on the resulting embeddings. Based on the found clusters, similar scenes could be identified in the subsequent test process, which can lead to a reduction in redundant test runs.Here's the word-for-word translation:高度自动驾驶的验证呈现出了广泛采用的障碍。场景基本测试可以降低高度自动驾驶系统的验证成本。然而，需要解决的关键前提是定义和减少测试空间到有限数量的场景。为此，我们提出了一种基于图的对比学习方法，通过图构建有意义的嵌入空间。我们的方法示出了使用场景特有的特征进行场景连续映射，并基于结果嵌入的场景类型进行分类。在后续测试过程中，可以通过对类似场景进行相似性比较，减少无关的测试运行。

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

paper_url: http://arxiv.org/abs/2309.09709
repo_url: https://github.com/aspirinone/catr.github.io
paper_authors: Kexin Li, Zongxin Yang, Lei Chen, Yi Yang, Jun Xiao
for: 这个论文targets audio-visual video segmentation, aiming to generate pixel-level maps of sound-producing objects within image frames, and ensure the maps faithfully adhere to the given audio.
methods: 该方法使用了一种叫做”decoupled audio-video transformer”的新方法，将音频和视频特征从它们的各自的时间和空间维度 combine，捕捉到它们的共同依赖关系。同时，通过设计一种块，可以在堆叠的过程中， Capture audio-visual细腻的 combinatorial-dependence in a memory-efficient manner.
results: 实验结果表明，我们的方法可以达到所有三个数据集的新的最佳性能，使用两种骨干。 codes are available at \url{https://github.com/aspirinone/CATR.github.io}

Abstract
Audio-visual video segmentation~(AVVS) aims to generate pixel-level maps of sound-producing objects within image frames and ensure the maps faithfully adhere to the given audio, such as identifying and segmenting a singing person in a video. However, existing methods exhibit two limitations: 1) they address video temporal features and audio-visual interactive features separately, disregarding the inherent spatial-temporal dependence of combined audio and video, and 2) they inadequately introduce audio constraints and object-level information during the decoding stage, resulting in segmentation outcomes that fail to comply with audio directives. To tackle these issues, we propose a decoupled audio-video transformer that combines audio and video features from their respective temporal and spatial dimensions, capturing their combined dependence. To optimize memory consumption, we design a block, which, when stacked, enables capturing audio-visual fine-grained combinatorial-dependence in a memory-efficient manner. Additionally, we introduce audio-constrained queries during the decoding phase. These queries contain rich object-level information, ensuring the decoded mask adheres to the sounds. Experimental results confirm our approach's effectiveness, with our framework achieving a new SOTA performance on all three datasets using two backbones. The code is available at \url{https://github.com/aspirinone/CATR.github.io}

摘要
audio-visual video segmentation（AVVS）的目标是生成图像帧中 зву频生成对象的像素级地图，并确保地图与给定的音频保持一致，例如在视频中 identific 和 segmenting 一个唱歌的人。然而，现有的方法具有两个限制：1）它们对视频的时间特征和音频视频互动特征进行分离处理，忽略了合理的空间时间相互依赖性，2）它们在解码阶段不足以引入音频约束和对象级信息，导致分 segmentation 结果不符合音频指令。为解决这些问题，我们提议一种解coupled audio-video transformer，该模型将音频和视频特征从它们的时间和空间维度组合，捕捉 их共同依赖关系。为了优化内存消耗，我们设计了块，当堆叠时，可以在内存fficient manner中捕捉音频视频细致的 combinatorial-dependence。此外，我们在解码阶段引入了音频约束查询，这些查询包含丰富的对象级信息，以确保解码的mask遵循音频指令。实验结果表明，我们的方法有效，我们的框架在三个数据集上 achieved 新的 SOTA 性能，使用两种背景。代码可以在 \url{https://github.com/aspirinone/CATR.github.io} 上获取。

DGM-DR: Domain Generalization with Mutual Information Regularized Diabetic Retinopathy Classification

paper_url: http://arxiv.org/abs/2309.09670
repo_url: None
paper_authors: Aleksandr Matsun, Dana O. Mohamed, Sharon Chokuwa, Muhammad Ridzuan, Mohammad Yaqub
for: 这篇论文主要关注于如何对医疗影像领域进行领域对错（Domain Generalization, DG），以实现模型在不同来源资料上的普遍化。
methods: 本论文提出了一种基于大型预训练模型的DG方法，通过将模型目标函数转换为内在熵的最大化，以获得具有对错性的医疗影像分类模型。
results: 根据实验结果，本论文的提出方法在公开数据集上实现了5.25%的平均准确率和较低的标准差，并且与之前的州际状况进行比较，显示了较好的一致性和稳定性。

Abstract
The domain shift between training and testing data presents a significant challenge for training generalizable deep learning models. As a consequence, the performance of models trained with the independent and identically distributed (i.i.d) assumption deteriorates when deployed in the real world. This problem is exacerbated in the medical imaging context due to variations in data acquisition across clinical centers, medical apparatus, and patients. Domain generalization (DG) aims to address this problem by learning a model that generalizes well to any unseen target domain. Many domain generalization techniques were unsuccessful in learning domain-invariant representations due to the large domain shift. Furthermore, multiple tasks in medical imaging are not yet extensively studied in existing literature when it comes to DG point of view. In this paper, we introduce a DG method that re-establishes the model objective function as a maximization of mutual information with a large pretrained model to the medical imaging field. We re-visit the problem of DG in Diabetic Retinopathy (DR) classification to establish a clear benchmark with a correct model selection strategy and to achieve robust domain-invariant representation for an improved generalization. Moreover, we conduct extensive experiments on public datasets to show that our proposed method consistently outperforms the previous state-of-the-art by a margin of 5.25% in average accuracy and a lower standard deviation. Source code available at https://github.com/BioMedIA-MBZUAI/DGM-DR

摘要
域名转换问题在训练和测试数据之间存在重大挑战，这会导致训练的深度学习模型在实际世界中表现不佳。这个问题在医疗影像领域更加严重，因为数据收集方式、医疗设备和患者之间存在很大的域名差异。域名泛化（DG）目标是解决这个问题，学习一个可以在未经见过的目标域中表现良好的模型。然而，许多域名泛化技术无法学习域名不变的表示，因为域名差异过大。此外，医疗影像领域中多个任务还没有得到广泛的研究，尤其是从域名泛化的角度来看。在这篇论文中，我们介绍了一种DG方法，通过将模型目标函数改写为与一个大型预训练模型的共同信息最大化来应用到医疗影像领域。我们重新评估了域名泛化在 диабетичеRetinopathy（DR）分类问题上的问题，并在正确的模型选择策略下建立了一个清晰的标准。此外，我们在公共数据集上进行了广泛的实验，并证明了我们的提议方法在前一个状态的平均准确率和标准差上比前一个状态提高5.25%，并且具有较低的标准差。源代码可以在中找到。

DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.09668
repo_url: https://github.com/VCIP-RGBD/DFormer
paper_authors: Bowen Yin, Xuying Zhang, Zhongyu Li, Li Liu, Ming-Ming Cheng, Qibin Hou
for: 学习RGB-D segmentation任务上的可转移表示。
methods: 包括一系列RGB-D块，这些块专门用于编码RGB和深度信息，并且采用了一种新的建筑块设计。
results: 在两个流行的RGB-D任务上，即RGB-D semantic segmentation和RGB-D焦点检测，与现有方法相比，我们的DFormer achieved new state-of-the-art performance，并且计算成本下降了超过50%。

Abstract
We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) Unlike previous works that aim to encode RGB features,DFormer comprises a sequence of RGB-D blocks, which are tailored for encoding both RGB and depth information through a novel building block design; 2) We pre-train the backbone using image-depth pairs from ImageNet-1K, and thus the DFormer is endowed with the capacity to encode RGB-D representations. It avoids the mismatched encoding of the 3D geometry relationships in depth maps by RGB pre-trained backbones, which widely lies in existing methods but has not been resolved. We fine-tune the pre-trained DFormer on two popular RGB-D tasks, i.e., RGB-D semantic segmentation and RGB-D salient object detection, with a lightweight decoder head. Experimental results show that our DFormer achieves new state-of-the-art performance on these two tasks with less than half of the computational cost of the current best methods on two RGB-D segmentation datasets and five RGB-D saliency datasets. Our code is available at: https://github.com/VCIP-RGBD/DFormer.

摘要
我们提出了DFormer，一种新的RGB-D预训练框架，用于学习转移性的RGB-D表示。DFormer有两个新额外创新：1）与前期工作不同，DFormer不仅编码RGB特征，而且包含一串RGB-D块，这些块通过一种新的建筑块设计来编码RGB和深度信息; 2）我们在ImageNet-1K中预训练了干燥的脊梁，因此DFormer拥有编码RGB-D表示的能力。它避免了现有方法中RGB预训练背bone对3D几何关系的推断错误，这是现有方法的潜在问题。我们通过一个轻量级的解码头进行了细致的调整。我们的实验结果显示，我们的DFormer在两个流行的RGB-D任务上（RGB-D语义分割和RGB-D焦点检测） achieve了新的状态级表现，并且计算成本只有现有最佳方法的一半。我们的代码可以在https://github.com/VCIP-RGBD/DFormer上找到。

paper_url: http://arxiv.org/abs/2309.09667
repo_url: None
paper_authors: Huan Liu, Zichang Tan, Qiang Chen, Yunchao Wei, Yao Zhao, Jingdong Wang
for:This paper aims to address the problem of detecting and grounding multi-modal media manipulation (DGM^4), specifically focusing on face forgery and text misinformation.methods:The proposed method, Unified Frequency-Assisted transFormer (UFAFormer), utilizes a combination of the discrete wavelet transform and self-attention mechanisms to capture forgery features in both the image and frequency domains. Additionally, the proposed forgery-aware mutual module is designed to facilitate effective interaction between image and frequency features.results:Experimental results on the DGM^4 dataset demonstrate the superior performance of UFAFormer compared to previous methods, achieving a new benchmark in the field.

Abstract
Detecting and grounding multi-modal media manipulation (DGM^4) has become increasingly crucial due to the widespread dissemination of face forgery and text misinformation. In this paper, we present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM^4 problem. Unlike previous state-of-the-art methods that solely focus on the image (RGB) domain to describe visual forgery features, we additionally introduce the frequency domain as a complementary viewpoint. By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts. Then, our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands. Moreover, to address the semantic conflicts between image and frequency domains, the forgery-aware mutual module is developed to further enable the effective interaction of disparate image and frequency features, resulting in aligned and comprehensive visual forgery representations. Finally, based on visual and textual forgery features, we propose a unified decoder that comprises two symmetric cross-modal interaction modules responsible for gathering modality-specific forgery information, along with a fusing interaction module for aggregation of both modalities. The proposed unified decoder formulates our UFAFormer as a unified framework, ultimately simplifying the overall architecture and facilitating the optimization process. Experimental results on the DGM^4 dataset, containing several perturbations, demonstrate the superior performance of our framework compared to previous methods, setting a new benchmark in the field.

摘要
Detecting and grounding多modal媒体修改（DGM^4）已成为当前普遍流传的脸写 forgery和文本谣言的检测方法。在这篇论文中，我们提出了一个统一频率帮助 transformer框架，名为UFAFormer，以解决DGM^4问题。与之前的状态lejian方法不同，我们不仅focus于图像（RGB）频谱来描述视觉修改特征，而且还引入频谱频带，以获得更加丰富的脸写特征。通过频谱分解算法，我们将图像分解成多个频谱子带，捕捉到了脸写特征。然后，我们提出的频率编码器，具有内部和外部自我注意力，明确地聚合了修改特征。此外，为了解决图像和频谱频带之间的 semantic conflict，我们开发了一个 forgery-aware mutual module，以便更好地考虑多modal特征之间的协作。最后，我们提出了一个统一的解码器，包括两个对称的交叉模态交互模块，负责收集不同模式的修改信息，以及一个融合交互模块，用于将多modal特征融合。这使得我们的UFAFormer框架成为一个统一的框架，从而简化整个架构，并且优化过程。实验结果表明，我们的方法在DGM^4数据集上表现出了superior performance，并设置了新的benchmark。

HiT: Building Mapping with Hierarchical Transformers

paper_url: http://arxiv.org/abs/2309.09643
repo_url: None
paper_authors: Mingming Zhang, Qingjie Liu, Yunhong Wang
for: 提高高分辨率 remote sensing 图像中建筑物的自动映射质量
methods: 使用层次转换器（Hierarchical Transformers）和静止卷积神经网络（Convolutional Neural Networks）实现二stage探测架构，并在探测阶段添加 polygon 头，同时输出建筑物 bounding box 和 vector polygon
results: 在 CrowdAI 和 Inria 数据集上进行了 comprehensive экспериiments，并达到了新的州OF-THE-ART 水平以及 instance segmentation 和 polygon 度量上的优秀表现，同时在复杂场景下也有较好的效果

Abstract
Deep learning-based methods have been extensively explored for automatic building mapping from high-resolution remote sensing images over recent years. While most building mapping models produce vector polygons of buildings for geographic and mapping systems, dominant methods typically decompose polygonal building extraction in some sub-problems, including segmentation, polygonization, and regularization, leading to complex inference procedures, low accuracy, and poor generalization. In this paper, we propose a simple and novel building mapping method with Hierarchical Transformers, called HiT, improving polygonal building mapping quality from high-resolution remote sensing images. HiT builds on a two-stage detection architecture by adding a polygon head parallel to classification and bounding box regression heads. HiT simultaneously outputs building bounding boxes and vector polygons, which is fully end-to-end trainable. The polygon head formulates a building polygon as serialized vertices with the bidirectional characteristic, a simple and elegant polygon representation avoiding the start or end vertex hypothesis. Under this new perspective, the polygon head adopts a transformer encoder-decoder architecture to predict serialized vertices supervised by the designed bidirectional polygon loss. Furthermore, a hierarchical attention mechanism combined with convolution operation is introduced in the encoder of the polygon head, providing more geometric structures of building polygons at vertex and edge levels. Comprehensive experiments on two benchmarks (the CrowdAI and Inria datasets) demonstrate that our method achieves a new state-of-the-art in terms of instance segmentation and polygonal metrics compared with state-of-the-art methods. Moreover, qualitative results verify the superiority and effectiveness of our model under complex scenes.

摘要
深度学习基于方法在高分辨率 remote sensing 图像自动建筑图ematization中得到了广泛的探索。大多数建筑图ematization 模型生成 vector 多边形建筑，供地ографи�映图和 Navigation 系统使用。 dominant 方法通常将建筑图ematization 分解成多个子问题，包括分割、多边形化和正则化，导致复杂的推理过程、低准确率和差的普遍性。在本文中，我们提出了一种简单和新的建筑图ematization 方法，即 HiT，可以从高分辨率 remote sensing 图像中提高多边形建筑质量。HiT 基于二stage 检测架构，其中添加了一个多边形头，并同时输出建筑 bounding box 和 vector 多边形。这种方法是完全可导的。多边形头使用逆向特征来表示建筑多边形，避免了开始或结束多边形的假设。在这个新的视角下，多边形头采用了 transformer Encoder-Decoder 架构来预测序列化多边形，并且采用了自定义的双向多边形损失函数进行超参数化。此外，在多边形头的Encoder中，我们引入了层次注意力机制和 convolution 操作，以提供更多的建筑多边形的 геометри�结构，包括 vertex 和 edge 层次。我们在 CrowdAI 和 Inria 两个标准 datasets 进行了广泛的实验，并证明了我们的方法在实例分割和多边形指标方面达到了新的国际纪录。此外，质量证明了我们模型在复杂场景下的superiority和有效性。

Holistic Geometric Feature Learning for Structured Reconstruction

paper_url: http://arxiv.org/abs/2309.09622
repo_url: https://github.com/geo-tell/f-learn
paper_authors: Ziqiong Lu, Linxi Huan, Qiyuan Ma, Xianwei Zheng
for: 这个论文旨在提高结构重建中的结构意识，即使受到低级特征缺乏全局几何信息的影响。
methods: 作者们提出了一种频域特征学习策略（F-Learn），通过将散布的几何特征汇集到频域中，以便更好地理解结构。
results: 实验表明，F-Learn策略可以有效地引入结构意识到几何原子检测和结构推断中，提高最终结构重建的性能。

Abstract
The inference of topological principles is a key problem in structured reconstruction. We observe that wrongly predicted topological relationships are often incurred by the lack of holistic geometry clues in low-level features. Inspired by the fact that massive signals can be compactly described with frequency analysis, we experimentally explore the efficiency and tendency of learning structure geometry in the frequency domain. Accordingly, we propose a frequency-domain feature learning strategy (F-Learn) to fuse scattered geometric fragments holistically for topology-intact structure reasoning. Benefiting from the parsimonious design, the F-Learn strategy can be easily deployed into a deep reconstructor with a lightweight model modification. Experiments demonstrate that the F-Learn strategy can effectively introduce structure awareness into geometric primitive detection and topology inference, bringing significant performance improvement to final structured reconstruction. Code and pre-trained models are available at https://github.com/Geo-Tell/F-Learn.

摘要
“推理 topological 原则是结构重建中的关键问题。我们观察到低级特征中缺乏整体几何启示，导致 wrongly 预测的 topological 关系。受到大量信号可以使用频率分析 compactly 描述的灵感，我们实验性地探索了 learning 结构几何在频率域的有效性和倾向。因此，我们提出了频率域特征学习策略（F-Learn），将散布的几何碎片合理地汇聚到整体结构理解中。由于简洁的设计，F-Learn 策略可以轻松地整合到深度重建器中，并且只需要修改轻量级模型。实验表明，F-Learn 策略可以有效地带来结构意识到几何基本检测和推理，提高最终结构重建的性能。代码和预训练模型可以在 https://github.com/Geo-Tell/F-Learn 上获取。”

Collaborative Three-Stream Transformers for Video Captioning

paper_url: http://arxiv.org/abs/2309.09611
repo_url: https://github.com/wanghao14/COST
paper_authors: Hao Wang, Libo Zhang, Heng Fan, Tiejian Luo
for: 用于视频captioning任务中的主要组件，即主语、谓语和词语之间的互动。
methods: 提出了一种新的框架，名为COST，来分别模型这三部分，并且使得它们互相补充以获得更好的表示。COST包括三个transformers分支，用于探索视频和文本之间的视觉语言互动，检测到的对象和文本之间的互动，以及动作和文本之间的互动。同时，我们提出了一种相互扩展注意力模块，用于协调这三个分支中模型的互动，以便使得这三个分支可以互相支持，捕捉到不同细致程度的最有力的semantic信息，以便准确地预测caption。
results: 经过大规模的实验，我们发现提出的方法在三个大规模的挑战性数据集，即YouCookII、ActivityNet Captions和MSVD上表现出色，与当前的方法相比，具有较高的准确率。

Abstract
As the most critical components in a sentence, subject, predicate and object require special attention in the video captioning task. To implement this idea, we design a novel framework, named COllaborative three-Stream Transformers (COST), to model the three parts separately and complement each other for better representation. Specifically, COST is formed by three branches of transformers to exploit the visual-linguistic interactions of different granularities in spatial-temporal domain between videos and text, detected objects and text, and actions and text. Meanwhile, we propose a cross-granularity attention module to align the interactions modeled by the three branches of transformers, then the three branches of transformers can support each other to exploit the most discriminative semantic information of different granularities for accurate predictions of captions. The whole model is trained in an end-to-end fashion. Extensive experiments conducted on three large-scale challenging datasets, i.e., YouCookII, ActivityNet Captions and MSVD, demonstrate that the proposed method performs favorably against the state-of-the-art methods.

摘要
《 sentence 中的主要组成部分，即主语、谓语和词组，在视频描述任务中需要特别注意。为了实现这个想法，我们设计了一种新的框架，名为协同三流转换器（COST）。COST 框架由三个转换器分支组成，用以模型不同细致程度的视觉语言互动在空间时间域中，包括视频和文本、检测到的对象和文本、和动作和文本之间的互动。此外，我们还提出了跨度量注意模块，用以对三个分支中模型的互动进行协调，以便三个分支可以互相支持，挖掘出最有特征的 semantic 信息，以便准确地预测视频描述。整个模型采用端到端的训练方式。我们在 YouCookII、ActivityNet Captions 和 MSVD 等三个大规模挑战性 dataset 上进行了广泛的实验，结果显示，我们提出的方法与当前最佳方法相比，表现出了良好的性能。》Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

MEDL-U: Uncertainty-aware 3D Automatic Annotator based on Evidential Deep Learning

paper_url: http://arxiv.org/abs/2309.09599
repo_url: None
paper_authors: Helbert Paat, Qing Lian, Weilong Yao, Tong Zhang
for: This paper is written for advancing the field of 3D object detection using weakly supervised deep learning methods, and addressing the challenges of pseudo label noise and uncertainty estimation.
methods: The paper proposes an Evidential Deep Learning (EDL) based uncertainty estimation framework called MEDL-U, which generates pseudo labels and quantifies uncertainties for 3D object detection. The framework consists of an uncertainty-aware IoU-based loss, an evidence-aware multi-task loss function, and a post-processing stage for uncertainty refinement.
results: The paper demonstrates that probabilistic detectors trained using the outputs of MEDL-U surpass deterministic detectors trained using outputs from previous 3D annotators on the KITTI val set for all difficulty levels. Additionally, MEDL-U achieves state-of-the-art results on the KITTI official test set compared to existing 3D automatic annotators.

Abstract
Advancements in deep learning-based 3D object detection necessitate the availability of large-scale datasets. However, this requirement introduces the challenge of manual annotation, which is often both burdensome and time-consuming. To tackle this issue, the literature has seen the emergence of several weakly supervised frameworks for 3D object detection which can automatically generate pseudo labels for unlabeled data. Nevertheless, these generated pseudo labels contain noise and are not as accurate as those labeled by humans. In this paper, we present the first approach that addresses the inherent ambiguities present in pseudo labels by introducing an Evidential Deep Learning (EDL) based uncertainty estimation framework. Specifically, we propose MEDL-U, an EDL framework based on MTrans, which not only generates pseudo labels but also quantifies the associated uncertainties. However, applying EDL to 3D object detection presents three primary challenges: (1) relatively lower pseudolabel quality in comparison to other autolabelers; (2) excessively high evidential uncertainty estimates; and (3) lack of clear interpretability and effective utilization of uncertainties for downstream tasks. We tackle these issues through the introduction of an uncertainty-aware IoU-based loss, an evidence-aware multi-task loss function, and the implementation of a post-processing stage for uncertainty refinement. Our experimental results demonstrate that probabilistic detectors trained using the outputs of MEDL-U surpass deterministic detectors trained using outputs from previous 3D annotators on the KITTI val set for all difficulty levels. Moreover, MEDL-U achieves state-of-the-art results on the KITTI official test set compared to existing 3D automatic annotators.

摘要
深度学习基于3D对象检测的进步导致了大规模数据的需求。然而，这会带来人工标注的挑战，它们frequently both burdensome and time-consuming。为解决这个问题，文献中出现了一些弱supervised frameworks for 3D object detection，可以自动生成pseudo labels for unlabeled data。然而，这些生成的pseudo labels含有噪声并不如人工标注的准确性。在这篇论文中，我们提出了首个解决pseudo labels中存在的自然ambiguity的方法，通过引入Evidential Deep Learning（EDL）基于uncertainty estimation framework。 Specifically, we propose MEDL-U， an EDL framework based on MTrans， which not only generates pseudo labels but also quantifies the associated uncertainties。然而，在应用EDL于3D对象检测中存在三个主要挑战：（1）相比其他自动标注器，pseudolabel的质量相对较低；（2）过度的证据uncertainty estimate；和（3）不清晰的解释和下游任务中的有效利用uncertainty。我们通过引入uncertainty-aware IoU-based loss、evidence-aware multi-task loss function和post-processing阶段进行uncertainty refinement来解决这些问题。我们的实验结果表明，基于MEDL-U输出的 probabilistic detector比基于前一代3D自动标注器的deterministic detector在KITTI val set上的所有Difficulty Level上表现出色。此外，MEDL-U在KITTI官方测试集上与现有的3D自动标注器相比， achieve state-of-the-art results。

Mutual Information-calibrated Conformal Feature Fusion for Uncertainty-Aware Multimodal 3D Object Detection at the Edge

paper_url: http://arxiv.org/abs/2309.09593
repo_url: None
paper_authors: Alex C. Stutts, Danilo Erricolo, Sathya Ravi, Theja Tulabandhula, Amit Ranjan Trivedi
for: This paper aims to address the gap in uncertainty quantification in 3D object detection for robotics, by integrating conformal inference and information theoretic measures to provide lightweight and Monte Carlo-free uncertainty estimation.
methods: The proposed method uses a multimodal framework that fuses features from RGB camera and LiDAR sensor data, and leverages normalized mutual information as a modulator to calibrate uncertainty bounds derived from conformal inference.
results: The simulation results show an inverse correlation between inherent predictive uncertainty and normalized mutual information throughout the model’s training, and the proposed framework demonstrates comparable or better performance in KITTI 3D object detection benchmarks compared to similar methods that are not uncertainty-aware, making it suitable for real-time edge robotics.

Abstract
In the expanding landscape of AI-enabled robotics, robust quantification of predictive uncertainties is of great importance. Three-dimensional (3D) object detection, a critical robotics operation, has seen significant advancements; however, the majority of current works focus only on accuracy and ignore uncertainty quantification. Addressing this gap, our novel study integrates the principles of conformal inference (CI) with information theoretic measures to perform lightweight, Monte Carlo-free uncertainty estimation within a multimodal framework. Through a multivariate Gaussian product of the latent variables in a Variational Autoencoder (VAE), features from RGB camera and LiDAR sensor data are fused to improve the prediction accuracy. Normalized mutual information (NMI) is leveraged as a modulator for calibrating uncertainty bounds derived from CI based on a weighted loss function. Our simulation results show an inverse correlation between inherent predictive uncertainty and NMI throughout the model's training. The framework demonstrates comparable or better performance in KITTI 3D object detection benchmarks to similar methods that are not uncertainty-aware, making it suitable for real-time edge robotics.

摘要
在扩展的人工智能启用机器人领域，准确地量化预测不确定性是非常重要的。三维对象检测，机器人运行中的关键任务，已经取得了 significative 进步，但大多数当前的工作都专注于精度，忽略了不确定性量化。我们的新研究强调了对具有准确性的隐藏变量的拟合推理（CI）的应用，以及信息理论度量来实现轻量级、不含 Monte Carlo 的不确定性估计。通过RGB摄像机和激光雷达感知器数据的融合，通过多元 Normal 分布的 Gaussian 产物来提高预测精度。在权重损失函数中，使用Normalized Mutual Information（NMI）作为调整器，以 derivation uncertainty bound。我们的 simulate 结果显示，预测不确定性与 NMI 之间存在 inverse 相关性，并且在模型训练过程中逐渐下降。该框架在 KITTI 3D 对象检测标准准测试中与其他不具有不确定性感知的方法相比，表现相对或更好，因此适用于实时边缘机器人。

Multi-Semantic Fusion Model for Generalized Zero-Shot Skeleton-Based Action Recognition

paper_url: http://arxiv.org/abs/2309.09592
repo_url: https://github.com/EHZ9NIWI7/MSF-GZSSAR
paper_authors: Ming-Zhe Li, Zhen Jia, Zhang Zhang, Zhanyu Ma, Liang Wang
for: 解决computer vision社区中新的挑战问题——无需任何训练样本的动作识别（GZSSAR）。
methods: 我们提出了一种多 semantics融合（MSF）模型，通过使用两种类别级文本描述（动作描述和运动描述）作为辅助semantic信息，以提高通用skeleton特征的识别性。
results: 与前一代模型进行比较，我们的MSF模型在GZSSAR中表现出色， validate了我们的方法的有效性。

Abstract
Generalized zero-shot skeleton-based action recognition (GZSSAR) is a new challenging problem in computer vision community, which requires models to recognize actions without any training samples. Previous studies only utilize the action labels of verb phrases as the semantic prototypes for learning the mapping from skeleton-based actions to a shared semantic space. However, the limited semantic information of action labels restricts the generalization ability of skeleton features for recognizing unseen actions. In order to solve this dilemma, we propose a multi-semantic fusion (MSF) model for improving the performance of GZSSAR, where two kinds of class-level textual descriptions (i.e., action descriptions and motion descriptions), are collected as auxiliary semantic information to enhance the learning efficacy of generalizable skeleton features. Specially, a pre-trained language encoder takes the action descriptions, motion descriptions and original class labels as inputs to obtain rich semantic features for each action class, while a skeleton encoder is implemented to extract skeleton features. Then, a variational autoencoder (VAE) based generative module is performed to learn a cross-modal alignment between skeleton and semantic features. Finally, a classification module is built to recognize the action categories of input samples, where a seen-unseen classification gate is adopted to predict whether the sample comes from seen action classes or not in GZSSAR. The superior performance in comparisons with previous models validates the effectiveness of the proposed MSF model on GZSSAR.

摘要
新的挑战问题：通用零例基本动作识别（GZSSAR）在计算机视觉社区中引起了广泛关注，需要模型可以识别没有任何训练样本的动作。在先前的研究中，只是使用动作标签的词语短语作为semantic prototype来学习将骨架基本动作映射到共享semantic空间。然而，动作标签的有限 semantic information限制了骨架特征的通用化能力，用于认识未看过的动作。为解决这个困境，我们提出了一种多 semantics融合（MSF）模型，用于提高GZSSAR的性能。在这种模型中，我们收集了两种类型的文本描述（动作描述和运动描述）作为auxiliary semantic information，以增强骨架特征的学习效果。具体来说，一个预训练的语言编码器将动作描述、运动描述和原始类别标签作为输入，以获取每个动作类型的 ric hes semantic features。然后，我们实现了一个骨架编码器来EXTRACT骨架特征。接着，我们使用基于VAE的生成模块来学习骨架和semantic特征之间的cross-modal alignment。最后，我们建立了一个分类模块，用于识别输入样本的动作类别，并采用seen-unseen分类门户来预测样本是否来自seen动作类别。 Comparing with previous models, our MSF model shows superior performance in GZSSAR, validating its effectiveness.

paper_url: http://arxiv.org/abs/2309.09590
repo_url: None
paper_authors: Eleonora Andreis, Paolo Panicucci, Francesco Topputo
for: 提出了一种基于全视 Navigation 算法，用于自动化深空探测器的 Navigation 问题。
methods: 使用了一种基于非维度扩展卡尔曼过滤器的状态估计方法，并对图像进行了进一步的优化。
results: 在高精度的地球-火星间INTERPLANETARY TRANSFER上测试了算法的性能，并证明了其适用于深空 Navigation。

Abstract
The surge of deep-space probes makes it unsustainable to navigate them with standard radiometric tracking. Self-driving interplanetary satellites represent a solution to this problem. In this work, a full vision-based navigation algorithm is built by combining an orbit determination method with an image processing pipeline suitable for interplanetary transfers of autonomous platforms. To increase the computational efficiency of the algorithm, a non-dimensional extended Kalman filter is selected as state estimator, fed by the positions of the planets extracted from deep-space images. An enhancement of the estimation accuracy is performed by applying an optimal strategy to select the best pair of planets to track. Moreover, a novel analytical measurement model for deep-space navigation is developed providing a first-order approximation of the light-aberration and light-time effects. Algorithm performance is tested on a high-fidelity, Earth--Mars interplanetary transfer, showing the algorithm applicability for deep-space navigation.

摘要
深空探测器的激增使得标准Radiometric tracking无法实现可持续导航。自驾宇宙卫星代表了一种解决方案。在这项工作中，我们构建了一个完整的视觉导航算法，通过结合轨道确定方法和适用于交通自主平台的图像处理管道来实现。为提高算法的计算效率，我们选择了非维度扩展卡尔曼滤波器作为状态估计器，通过深空图像中找到行星的位置来 aliment 状态估计。此外，我们还开发了一种新的分析测量模型，用于描述深空导航中的光扭和时光效应。我们测试了算法在高精度的地球-火星间交通任务上的性能，并证明了算法的深空导航可行性。

RIDE: Self-Supervised Learning of Rotation-Equivariant Keypoint Detection and Invariant Description for Endoscopy

paper_url: http://arxiv.org/abs/2309.09563
repo_url: None
paper_authors: Mert Asim Karaoglu, Viktoria Markova, Nassir Navab, Benjamin Busam, Alexander Ladikos
for: 这篇论文是关于endoscopic video中的锚点检测和描述的研究，以解决endoscopic video中的大规模旋转问题。
methods: 该论文提出了一种基于学习的方法，即RIDE，其中包含了 rotation-equivariant detection和invariant description。RIDE使用了latest advancements in group-equivariant learning，并在自动标注数据集上进行了自动验证。
results: 该论文的实验结果表明，RIDE在surgical tissue tracking和relative pose estimation任务上比之前的学习基于和经典方法更高，并且在大规模旋转情况下保持稳定性。

Abstract
Unlike in natural images, in endoscopy there is no clear notion of an up-right camera orientation. Endoscopic videos therefore often contain large rotational motions, which require keypoint detection and description algorithms to be robust to these conditions. While most classical methods achieve rotation-equivariant detection and invariant description by design, many learning-based approaches learn to be robust only up to a certain degree. At the same time learning-based methods under moderate rotations often outperform classical approaches. In order to address this shortcoming, in this paper we propose RIDE, a learning-based method for rotation-equivariant detection and invariant description. Following recent advancements in group-equivariant learning, RIDE models rotation-equivariance implicitly within its architecture. Trained in a self-supervised manner on a large curation of endoscopic images, RIDE requires no manual labeling of training data. We test RIDE in the context of surgical tissue tracking on the SuPeR dataset as well as in the context of relative pose estimation on a repurposed version of the SCARED dataset. In addition we perform explicit studies showing its robustness to large rotations. Our comparison against recent learning-based and classical approaches shows that RIDE sets a new state-of-the-art performance on matching and relative pose estimation tasks and scores competitively on surgical tissue tracking.

摘要
不同于自然图像，endooscopy中没有明确的正常摄像机 Orientations。因此，endooscopic视频通常包含大的旋转动作，需要键点检测和描述算法具有旋转不敏感性。大多数传统方法通过设计实现旋转对称性和不变性，而许多学习基于方法则只有一定程度的旋转不敏感性。在这种情况下，本文提出了RIDE，一种基于学习的旋转对称检测和描述方法。按照最近的群对称学习的进展，RIDE在其架构中隐式地实现了旋转对称性。通过自动化的自我超vised学习方式，RIDE不需要手动标注训练数据。我们在SuPeR dataset上进行了手术组织跟踪和SCARED dataset上进行了修改的版本的相对pose estimation任务中测试了RIDE。此外，我们还进行了明确的研究，证明RIDE在大旋转下的稳定性。我们与最近的学习基于和传统方法进行了比较，结果显示RIDE在匹配和相对pose estimation任务上设置了新的状态态标准性，并在手术组织跟踪任务上分数竞争。

Selective Volume Mixup for Video Action Recognition

paper_url: http://arxiv.org/abs/2309.09534
repo_url: None
paper_authors: Yi Tan, Zhaofan Qiu, Yanbin Hao, Ting Yao, Xiangnan He, Tao Mei
for: 提高深度模型对小规模视频数据的泛化能力
methods: 提出了一种新的视频扩充策略 named Selective Volume Mixup (SV-Mix)，通过选择两个视频中最有用的卷积来实现新的训练视频
results: 在各种视频动作识别benchmark上经验性地证明了SV-Mix扩充策略的效果，可以提高深度模型对小规模视频数据的泛化能力，并且可以适应不同的模型结构。

Abstract
The recent advances in Convolutional Neural Networks (CNNs) and Vision Transformers have convincingly demonstrated high learning capability for video action recognition on large datasets. Nevertheless, deep models often suffer from the overfitting effect on small-scale datasets with a limited number of training videos. A common solution is to exploit the existing image augmentation strategies for each frame individually including Mixup, Cutmix, and RandAugment, which are not particularly optimized for video data. In this paper, we propose a novel video augmentation strategy named Selective Volume Mixup (SV-Mix) to improve the generalization ability of deep models with limited training videos. SV-Mix devises a learnable selective module to choose the most informative volumes from two videos and mixes the volumes up to achieve a new training video. Technically, we propose two new modules, i.e., a spatial selective module to select the local patches for each spatial position, and a temporal selective module to mix the entire frames for each timestamp and maintain the spatial pattern. At each time, we randomly choose one of the two modules to expand the diversity of training samples. The selective modules are jointly optimized with the video action recognition framework to find the optimal augmentation strategy. We empirically demonstrate the merits of the SV-Mix augmentation on a wide range of video action recognition benchmarks and consistently boot the performances of both CNN-based and transformer-based models.

摘要
SV-Mix devises a learnable selective module to choose the most informative volumes from two videos and mixes them to create a new training video. Technically, we propose two new modules: a spatial selective module to select local patches for each spatial position, and a temporal selective module to mix entire frames for each timestamp while maintaining the spatial pattern. At each time, we randomly choose one of the two modules to expand the diversity of training samples. The selective modules are jointly optimized with the video action recognition framework to find the optimal augmentation strategy.We empirically demonstrate the merits of the SV-Mix augmentation on a wide range of video action recognition benchmarks and consistently boost the performances of both CNN-based and transformer-based models.

Decompose Semantic Shifts for Composed Image Retrieval

paper_url: http://arxiv.org/abs/2309.09531
repo_url: None
paper_authors: Xingyu Yang, Daqing Liu, Heng Zhang, Yong Luo, Chaoyue Wang, Jing Zhang
for: 本研究旨在提高图像检索中的组合图像检索任务，使用参考图像作为起点，根据用户提供的文本进行启发式搜索。
methods: 本研究提出了一种Semantic Shift网络（SSN），将文本 instrucion 分解为两个步骤：从参考图像到视觉原型，然后从视觉原型到目标图像。SSN将文本 instrucion 分解为两个组成部分：减退和升级，用于生成图像的逐步映射。
results: 实验结果表明，提出的SSN在CIRR和FashionIQ数据集上显示了5.42%和1.37%的显著提升，成为图像检索领域新的状态级性能。代码将公开提供。

Abstract
Composed image retrieval is a type of image retrieval task where the user provides a reference image as a starting point and specifies a text on how to shift from the starting point to the desired target image. However, most existing methods focus on the composition learning of text and reference images and oversimplify the text as a description, neglecting the inherent structure and the user's shifting intention of the texts. As a result, these methods typically take shortcuts that disregard the visual cue of the reference images. To address this issue, we reconsider the text as instructions and propose a Semantic Shift network (SSN) that explicitly decomposes the semantic shifts into two steps: from the reference image to the visual prototype and from the visual prototype to the target image. Specifically, SSN explicitly decomposes the instructions into two components: degradation and upgradation, where the degradation is used to picture the visual prototype from the reference image, while the upgradation is used to enrich the visual prototype into the final representations to retrieve the desired target image. The experimental results show that the proposed SSN demonstrates a significant improvement of 5.42% and 1.37% on the CIRR and FashionIQ datasets, respectively, and establishes a new state-of-the-art performance. Codes will be publicly available.

摘要
“构成图像检索”是一种图像检索任务，用户提供一个参考图像作为开始点，并指定一段文本，用于将参考图像转换为目标图像。然而，现有的方法将文本视为描述，忽略文本的本质和用户的转换意愿，通常会遗传参考图像的可见特征。为解决这个问题，我们重新考虑文本为 instrucion，并提出了对于 Semantic Shift network (SSN)的建议，具体地将semantic shift decomposed为两步：从参考图像到可视标本，然后从可视标本到目标图像。具体来说，SSN会将 instrucion 分成两个部分：塑性和提升，其中塑性用于将可视标本从参考图像中描绘出来，而提升则用于将可视标本提升为最终的表现，以实现检索目标图像。实验结果显示，我们提出的 SSN 与现有的方法相比，在 CIRR 和 FashionIQ 数据集上的效果提高了5.42%和1.37%，并建立了新的顶尖性能。代码将会公开。”

Instant Photorealistic Style Transfer: A Lightweight and Adaptive Approach

paper_url: http://arxiv.org/abs/2309.10011
repo_url: None
paper_authors: Rong Liu, Enyu Zhao, Zhiyuan Liu, Andrew Wei-Wen Feng, Scott John Easley
for: 这篇论文旨在实现快速即时精美风格转移（IPST）技术，以帮助在超分辨率输入上实现快速、无需预训练的精美风格转移。
methods: 该方法使用轻量级的 StyleNet 进行风格转移，以保留非颜色信息。我们还引入了实例适应优化，以优先级增强风格转移过程中的快速训练和高效的结构减少。
results: 实验结果表明，IPST 可以快速完成多帧风格转移任务，同时保持多视图和时间一致性。它还需要较少的 GPU 内存使用量，具有更快的多帧转移速度和更加真实的输出，这使得它在各种精美风格转移应用中成为一个有前途的解决方案。

Abstract
In this paper, we propose an Instant Photorealistic Style Transfer (IPST) approach, designed to achieve instant photorealistic style transfer on super-resolution inputs without the need for pre-training on pair-wise datasets or imposing extra constraints. Our method utilizes a lightweight StyleNet to enable style transfer from a style image to a content image while preserving non-color information. To further enhance the style transfer process, we introduce an instance-adaptive optimization to prioritize the photorealism of outputs and accelerate the convergence of the style network, leading to a rapid training completion within seconds. Moreover, IPST is well-suited for multi-frame style transfer tasks, as it retains temporal and multi-view consistency of the multi-frame inputs such as video and Neural Radiance Field (NeRF). Experimental results demonstrate that IPST requires less GPU memory usage, offers faster multi-frame transfer speed, and generates photorealistic outputs, making it a promising solution for various photorealistic transfer applications.

摘要
在这篇论文中，我们提出了一种快速实时写入式样式转移（IPST）方法，旨在实现无需预训练对比对数据集或做出额外约束的实时写入式样式转移。我们的方法利用了轻量级的 StyleNet，以启用样式图像到内容图像的样式转移，同时保持非颜色信息。为了进一步优化样式转移过程，我们引入了实例适应优化，以优先级保持输出的实境化，加速样式网络的训练 converges，以在秒钟内完成训练。此外，IPST适用于多帧样式转移任务，因为它保持了多帧输入（如视频和Neural Radiance Field）的时间和多视点一致性。实验结果表明，IPST需要更少的GPU内存使用，具有更快的多帧转移速度，并生成高质量的输出，因此是许多写入式样式转移应用场景的有优点的解决方案。

NOMAD: A Natural, Occluded, Multi-scale Aerial Dataset, for Emergency Response Scenarios

paper_url: http://arxiv.org/abs/2309.09518
repo_url: None
paper_authors: Arturo Miguel Russell Bernal, Walter Scheirer, Jane Cleland-Huang
for: 这个论文是为了提高小无人飞行系统（sUAS）在紧急应急场景中的搜索和救援任务中的计算机视觉能力。
methods: 该论文使用了多种计算机视觉技术来检测人体，并在不同的飞行距离下进行了测试。
results: 该论文提供了一个新的测试数据集，可以让计算机视觉模型在受掩蔽的空中视图下进行人体检测，并且包括了多种人体运动和隐藏情况。这个数据集可以帮助提高空中搜索和救援的效iveness。

Abstract
With the increasing reliance on small Unmanned Aerial Systems (sUAS) for Emergency Response Scenarios, such as Search and Rescue, the integration of computer vision capabilities has become a key factor in mission success. Nevertheless, computer vision performance for detecting humans severely degrades when shifting from ground to aerial views. Several aerial datasets have been created to mitigate this problem, however, none of them has specifically addressed the issue of occlusion, a critical component in Emergency Response Scenarios. Natural Occluded Multi-scale Aerial Dataset (NOMAD) presents a benchmark for human detection under occluded aerial views, with five different aerial distances and rich imagery variance. NOMAD is composed of 100 different Actors, all performing sequences of walking, laying and hiding. It includes 42,825 frames, extracted from 5.4k resolution videos, and manually annotated with a bounding box and a label describing 10 different visibility levels, categorized according to the percentage of the human body visible inside the bounding box. This allows computer vision models to be evaluated on their detection performance across different ranges of occlusion. NOMAD is designed to improve the effectiveness of aerial search and rescue and to enhance collaboration between sUAS and humans, by providing a new benchmark dataset for human detection under occluded aerial views.

摘要
随着小无人航空系统（sUAS）在紧急应急场景中的使用，计算机视觉技术的 интеграción已成为任务成功的关键因素。然而，计算机视觉性能在从地面视图转移到空中视图时会受到严重降低。为了解决这个问题，一些空中数据集已经被创建，但是没有一个专门解决遮挡问题，这是紧急应急场景中的关键组成部分。自然遮挡多Scale空中数据集（NOMAD）提供了人体检测下遮挡空中视图的标准准则，包括五种不同的空中距离和丰富的图像变化。NOMAD包含100名演员，每名演员都执行了走、躺和隐藏的序列，共计42825帧，来自5400分辨率视频，并手动标注了一个 bounding box 和一个describing 10种不同的可见度水平，分为100%的人体部分可见度。这allow computer vision模型在不同的遮挡级别上进行评估。NOMAD旨在提高空中搜救和人类协作的效率，通过提供人体检测下遮挡空中视图的新标准数据集来提高计算机视觉模型的检测性能。

Sparse and Privacy-enhanced Representation for Human Pose Estimation

paper_url: http://arxiv.org/abs/2309.09515
repo_url: None
paper_authors: Ting-Ying Lin, Lin-Yung Hsieh, Fu-En Wang, Wen-Shen Wuen, Min Sun
for: 提高人姿估计（HPE）的隐私和效率。
methods: 使用专利的运动向量感知器（MVS）提取edge图像和两个方向的运动向量图像，并利用最近的简单核算法进行短时间内的快速运动跟踪。
results: 提出一种笔触和隐私增强的HPE表示方式，实现约13倍的速度提升和96%的计算量减少，并在各个模式下进行比较， validate the privacy-enhanced quality of our sparse representation through face recognition on CelebA 和用户研究。

Abstract
We propose a sparse and privacy-enhanced representation for Human Pose Estimation (HPE). Given a perspective camera, we use a proprietary motion vector sensor(MVS) to extract an edge image and a two-directional motion vector image at each time frame. Both edge and motion vector images are sparse and contain much less information (i.e., enhancing human privacy). We advocate that edge information is essential for HPE, and motion vectors complement edge information during fast movements. We propose a fusion network leveraging recent advances in sparse convolution used typically for 3D voxels to efficiently process our proposed sparse representation, which achieves about 13x speed-up and 96% reduction in FLOPs. We collect an in-house edge and motion vector dataset with 16 types of actions by 40 users using the proprietary MVS. Our method outperforms individual modalities using only edge or motion vector images. Finally, we validate the privacy-enhanced quality of our sparse representation through face recognition on CelebA (a large face dataset) and a user study on our in-house dataset.

摘要
我们提议一种稀疏化和隐私增强的人姿估算（HPE）表示方法。 Given a perspective camera, we use a proprietary motion vector sensor(MVS) to extract an edge image and a two-directional motion vector image at each time frame. Both edge and motion vector images are sparse and contain much less information (i.e., enhancing human privacy). We advocate that edge information is essential for HPE, and motion vectors complement edge information during fast movements. We propose a fusion network leveraging recent advances in sparse convolution used typically for 3D voxels to efficiently process our proposed sparse representation, which achieves about 13x speed-up and 96% reduction in FLOPs. We collect an in-house edge and motion vector dataset with 16 types of actions by 40 users using the proprietary MVS. Our method outperforms individual modalities using only edge or motion vector images. Finally, we validate the privacy-enhanced quality of our sparse representation through face recognition on CelebA (a large face dataset) and a user study on our in-house dataset.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. If you need the translation in Traditional Chinese, please let me know.

PanoMixSwap Panorama Mixing via Structural Swapping for Indoor Scene Understanding

paper_url: http://arxiv.org/abs/2309.09514
repo_url: None
paper_authors: Yu-Cheng Hsieh, Cheng Sun, Suraj Dengale, Min Sun
for: 增加现代深度学习方法的训练数据量和多样性，以提高indoor scene理解能力。
methods: 提议了PanoMixSwap数据增强技术，通过混合不同背景风格、前景家具和房间布局，从现有的indoor panorama数据集中生成新的多样化panoramic图像。
results: 经过实验表明，使用PanoMixSwap对indoor scene理解任务（semantic segmentation和layout estimation）的性能有了显著提升，比原始设定升级了consistent。

Abstract
The volume and diversity of training data are critical for modern deep learningbased methods. Compared to the massive amount of labeled perspective images, 360 panoramic images fall short in both volume and diversity. In this paper, we propose PanoMixSwap, a novel data augmentation technique specifically designed for indoor panoramic images. PanoMixSwap explicitly mixes various background styles, foreground furniture, and room layouts from the existing indoor panorama datasets and generates a diverse set of new panoramic images to enrich the datasets. We first decompose each panoramic image into its constituent parts: background style, foreground furniture, and room layout. Then, we generate an augmented image by mixing these three parts from three different images, such as the foreground furniture from one image, the background style from another image, and the room structure from the third image. Our method yields high diversity since there is a cubical increase in image combinations. We also evaluate the effectiveness of PanoMixSwap on two indoor scene understanding tasks: semantic segmentation and layout estimation. Our experiments demonstrate that state-of-the-art methods trained with PanoMixSwap outperform their original setting on both tasks consistently.

摘要
“现代深度学习方法需要大量和多样化的训练数据。相比巨量标注的人工智能图像，360度全景图像缺乏量和多样化。在这篇论文中，我们提出了PanoMixSwap，一种专门为室内全景图像设计的数据增强技术。PanoMixSwap显式地混合了不同背景风格、前景家具和房间布局的元素，从现有室内全景数据集中提取了多样化的新全景图像，以增加数据集的多样性。我们首先将每个全景图像分解成其组成部分：背景风格、前景家具和房间布局。然后，我们生成了一个扩充图像，将这三个部分从三个不同的图像中混合起来，例如将前景家具从一个图像中、背景风格从另一个图像中、房间布局从第三个图像中混合起来。我们的方法可以生成高多样性的图像组合，因为每个图像组合都有3^3=27种可能。我们也对两个室内场景理解任务进行了效果评估：semantic segmentation和layout estimation。我们的实验表明，通过PanoMixSwap增强训练的state-of-the-art方法在两个任务上的性能都能 consistent with the original setting。”

Learning Parallax for Stereo Event-based Motion Deblurring

paper_url: http://arxiv.org/abs/2309.09513
repo_url: None
paper_authors: Mingyuan Lin, Chi Zhang, Chu He, Lei Yu
for: 这种研究是为了提高图像减震的效果，使用事件推理来补充lost信息。
methods: 提出了一种新的卷积框架，即NETwork of Event-based motion Deblurring with STereo event and intensity cameras (St-EDNet)，可以直接从不一致的输入中恢复高质量图像。
results: 实验结果表明，提出的方法在实际 dataset 上比 estado-of-the-art 方法更高效。

Abstract
Due to the extremely low latency, events have been recently exploited to supplement lost information for motion deblurring. Existing approaches largely rely on the perfect pixel-wise alignment between intensity images and events, which is not always fulfilled in the real world. To tackle this problem, we propose a novel coarse-to-fine framework, named NETwork of Event-based motion Deblurring with STereo event and intensity cameras (St-EDNet), to recover high-quality images directly from the misaligned inputs, consisting of a single blurry image and the concurrent event streams. Specifically, the coarse spatial alignment of the blurry image and the event streams is first implemented with a cross-modal stereo matching module without the need for ground-truth depths. Then, a dual-feature embedding architecture is proposed to gradually build the fine bidirectional association of the coarsely aligned data and reconstruct the sequence of the latent sharp images. Furthermore, we build a new dataset with STereo Event and Intensity Cameras (StEIC), containing real-world events, intensity images, and dense disparity maps. Experiments on real-world datasets demonstrate the superiority of the proposed network over state-of-the-art methods.

摘要
因为延迟非常低，事件最近被利用来补充lost信息以实现运动抑白。现有方法大多数假设精确的像素对应关系 между图像和事件，这在实际世界中并不总是成立。为解决这个问题，我们提议一种新的均值-精度框架，名为NETwork of Event-based motion Deblurring with STereo event and intensity cameras (St-EDNet)，可以直接从不一致的输入中恢复高质量图像，包括一个朦素图像和同时发生的事件流。具体来说，首先使用交叉模态掌时匹配模块实现了某些粗糙图像和事件流的空间对应。然后，我们提出了一种双特征嵌入体系，用于慢慢地建立精度的双向嵌入，并重建事件流中latent的锐化图像序列。此外，我们构建了一个新的StEIC数据集，包括实际世界中的事件、图像和密集的 disparity 地图。实验表明，我们提出的网络在实际数据集上的性能较为出色。

RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision

paper_url: http://arxiv.org/abs/2309.09502
repo_url: None
paper_authors: Mingjie Pan, Jiaming Liu, Renrui Zhang, Peixiang Huang, Xiaoqi Li, Li Liu, Shanghang Zhang
for: 本研究旨在提出一种基于2D标签的多视图3D占用预测方法，以便降低高成本的3D占用标签生成过程中的限制。
methods: 我们提出了一种基于NeRF技术的方法，将多视图图像转化为3D体积表示，并使用volume rendering技术进行2D渲染，以便直接从2D semantics和深度标签中提取3D占用信息。此外，我们还提出了一种协助 ray方法，以解决自动驾驶场景中罕见视角的问题，通过Sequential frame组合构建完整的2D渲染。
results: 我们的实验结果表明，RenderOcc可以与完全使用3D标签进行supervision的模型相比，达到相似的性能，这证明了这种方法在实际应用中的重要性。

Abstract
3D occupancy prediction holds significant promise in the fields of robot perception and autonomous driving, which quantifies 3D scenes into grid cells with semantic labels. Recent works mainly utilize complete occupancy labels in 3D voxel space for supervision. However, the expensive annotation process and sometimes ambiguous labels have severely constrained the usability and scalability of 3D occupancy models. To address this, we present RenderOcc, a novel paradigm for training 3D occupancy models only using 2D labels. Specifically, we extract a NeRF-style 3D volume representation from multi-view images, and employ volume rendering techniques to establish 2D renderings, thus enabling direct 3D supervision from 2D semantics and depth labels. Additionally, we introduce an Auxiliary Ray method to tackle the issue of sparse viewpoints in autonomous driving scenarios, which leverages sequential frames to construct comprehensive 2D rendering for each object. To our best knowledge, RenderOcc is the first attempt to train multi-view 3D occupancy models only using 2D labels, reducing the dependence on costly 3D occupancy annotations. Extensive experiments demonstrate that RenderOcc achieves comparable performance to models fully supervised with 3D labels, underscoring the significance of this approach in real-world applications.

摘要
三维占用预测在机器人感知和自动驾驶领域产生了重要的承诺，它将三维场景分解成格格单元中的semantic标签。最近的工作主要利用完整的占用标签在3D精度空间进行监督。然而，贵重的标注过程和有时存在含瑕的标签导致了3D占用模型的可用性和扩展性受到了严重的限制。为了解决这个问题，我们提出了RenderOcc，一种新的训练3D占用模型只使用2D标签的方法。具体来说，我们从多视图图像中提取了NeRF样式的3D体积表示，并使用Volume Rendering技术来建立2D渲染，从而允许直接在2Dsemantics和深度标签的直接监督下训练3D占用模型。此外，我们还提出了auxiliary Ray方法，用于解决自动驾驶场景中稀疏的视点问题，该方法利用顺序帧构建了每个对象的完整2D渲染。根据我们所知，RenderOcc是首次训练基于2D标签的多视图3D占用模型，从而降低了3D占用标签的成本。广泛的实验表明，RenderOcc可以与基于3D标签的模型相比，证明了这种方法在实际应用中的重要性。

Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

paper_url: http://arxiv.org/abs/2309.09501
repo_url: None
paper_authors: Shaofei Huang, Han Li, Yuqing Wang, Hongji Zhu, Jiao Dai, Jizhong Han, Wenge Rong, Si Liu
for: Audio visual segmentation (AVS) aims to segment the sounding objects for each frame of a given video.
methods: The proposed method uses an Audio-Queried Transformer architecture (AQFormer) to establish explicit object-level semantic correspondence between audio and visual modalities, and to exchange sounding object-relevant information among multiple frames.
results: The method achieves state-of-the-art performances on two AVS benchmarks, with 7.1% M_J and 7.6% M_F gains on the MS3 setting.

Abstract
Audio visual segmentation (AVS) aims to segment the sounding objects for each frame of a given video. To distinguish the sounding objects from silent ones, both audio-visual semantic correspondence and temporal interaction are required. The previous method applies multi-frame cross-modal attention to conduct pixel-level interactions between audio features and visual features of multiple frames simultaneously, which is both redundant and implicit. In this paper, we propose an Audio-Queried Transformer architecture, AQFormer, where we define a set of object queries conditioned on audio information and associate each of them to particular sounding objects. Explicit object-level semantic correspondence between audio and visual modalities is established by gathering object information from visual features with predefined audio queries. Besides, an Audio-Bridged Temporal Interaction module is proposed to exchange sounding object-relevant information among multiple frames with the bridge of audio features. Extensive experiments are conducted on two AVS benchmarks to show that our method achieves state-of-the-art performances, especially 7.1% M_J and 7.6% M_F gains on the MS3 setting.

摘要
Audio visual segmentation (AVS) 目标是为每帧视频中的声音对象进行分割。为了将声音对象与无声对象区分开来，需要同时使用音频 Semantic 相关性和时间互动。先前的方法使用多帧交叉模态注意力来同时进行多帧音频和视觉特征之间的像素级交互，这是一种重复和隐式的。在这篇论文中，我们提出了一种叫做 Audio-Queried Transformer 架构（AQFormer），其中我们定义了基于音频信息的对象查询集，并将每个查询与特定的声音对象相关联。在视觉特征中收集对应的对象信息，并通过固定的音频查询来确立明确的音频Semantic 相关性。此外，我们还提出了一种叫做 Audio-Bridged Temporal Interaction 模块，用于在多帧中交换与声音相关的信息，通过音频特征作为桥接。我们在两个 AVS 标准测试集上进行了广泛的实验，并证明了我们的方法可以达到当前最佳性能，特别是在 MS3 设定上提高了7.1% M_J 和 7.6% M_F。

Target-aware Bi-Transformer for Few-shot Segmentation

paper_url: http://arxiv.org/abs/2309.09492
repo_url: None
paper_authors: Xianglin Wang, Xiaoliu Luo, Taiping Zhang
For: The paper is written for proposing a new few-shot semantic segmentation (FSS) method that uses limited labeled support images to identify the segmentation of new classes of objects.* Methods: The proposed method, called Target-aware Bi-Transformer Network (TBTNet), uses a novel Target-aware Transformer Layer (TTL) to distill correlations and focus on foreground information. The model treats the hypercorrelation as a feature, resulting in a significant reduction in the number of feature channels.* Results: The proposed method achieves state-of-the-art performance on standard FSS benchmarks of PASCAL-5i and COCO-20i, with only 0.4M learnable parameters and converging in 10% to 25% of the training epochs compared to traditional methods.

Abstract
Traditional semantic segmentation tasks require a large number of labels and are difficult to identify unlearned categories. Few-shot semantic segmentation (FSS) aims to use limited labeled support images to identify the segmentation of new classes of objects, which is very practical in the real world. Previous researches were primarily based on prototypes or correlations. Due to colors, textures, and styles are similar in the same image, we argue that the query image can be regarded as its own support image. In this paper, we proposed the Target-aware Bi-Transformer Network (TBTNet) to equivalent treat of support images and query image. A vigorous Target-aware Transformer Layer (TTL) also be designed to distill correlations and force the model to focus on foreground information. It treats the hypercorrelation as a feature, resulting a significant reduction in the number of feature channels. Benefit from this characteristic, our model is the lightest up to now with only 0.4M learnable parameters. Futhermore, TBTNet converges in only 10% to 25% of the training epochs compared to traditional methods. The excellent performance on standard FSS benchmarks of PASCAL-5i and COCO-20i proves the efficiency of our method. Extensive ablation studies were also carried out to evaluate the effectiveness of Bi-Transformer architecture and TTL.

摘要
传统的semantic segmentation任务需要大量的标签，并且难以识别未学习的类别。几shot semantic segmentation（FSS）目的是使用有限的标签支持图像来分类新类型的对象，这非常实用于实际世界。先前的研究主要基于原型或相关性。由于图像中颜色、文本和风格具有相似性，我们认为查询图像可以被视为其自己的支持图像。在这篇论文中，我们提出了Target-aware Bi-Transformer Network（TBTNet），以等效地处理支持图像和查询图像。我们还设计了一种强大的Target-aware Transformer层（TTL），以把 corrleation 当作特征，使模型更加注重前景信息。这种特点使得我们的模型成为目前最轻量级的，只有0.4M 可学习参数。此外，TBTNet在训练EPochs中 converges 在10%-25%的训练epochs，相比传统方法更快。标准的FSS Benchmarks PASCAL-5i和COCO-20i中的出色表现证明了我们的方法的效率。我们还进行了广泛的ablation研究，以评估Bi-Transformer架构和TTL的效果。

Self-supervised TransUNet for Ultrasound regional segmentation of the distal radius in children

paper_url: http://arxiv.org/abs/2309.09490
repo_url: None
paper_authors: Yuyue Zhou, Jessica Knight, Banafshe Felfeliyan, Christopher Keen, Abhilash Rakkunedeth Hareendranathan, Jacob L. Jaremko
for: 这篇论文的目的是为了应用自我监督学习（SSL）方法来减少医疗影像标注数量，以提高医疗影像分类和诊断的自动化分析。methods: 这篇论文使用了Masked Autoencoder for SSL（SSL-MAE）方法，并变化了嵌入和损失函数，以提高下游结果。results: 研究发现，将TransUNet嵌入和Encoder预训练使用SSL-MAE，不能比TransUNet直接进行下游分类任务预训练更好。

Abstract
Supervised deep learning offers great promise to automate analysis of medical images from segmentation to diagnosis. However, their performance highly relies on the quality and quantity of the data annotation. Meanwhile, curating large annotated datasets for medical images requires a high level of expertise, which is time-consuming and expensive. Recently, to quench the thirst for large data sets with high-quality annotation, self-supervised learning (SSL) methods using unlabeled domain-specific data, have attracted attention. Therefore, designing an SSL method that relies on minimal quantities of labeled data has far-reaching significance in medical images. This paper investigates the feasibility of deploying the Masked Autoencoder for SSL (SSL-MAE) of TransUNet, for segmenting bony regions from children's wrist ultrasound scans. We found that changing the embedding and loss function in SSL-MAE can produce better downstream results compared to the original SSL-MAE. In addition, we determined that only pretraining TransUNet embedding and encoder with SSL-MAE does not work as well as TransUNet without SSL-MAE pretraining on downstream segmentation tasks.

摘要
<>发表文章报告摘要深度学习监督下的检测和诊断预测具有巨大的承诺，但是它们的表现受到数据标注质量和量的限制。同时，为医疗图像批量标注数据集准备高级别的专业知识和技能，需要很多时间和成本。最近，为了满足大量高质量标注数据的需求，自动学习（SSL）方法，使用不标注的领域特定数据，吸引了关注。因此，设计一种SSL方法，只需少量标注数据，具有重要意义。这篇论文 investigate了将Masked Autoencoder（SSL-MAE）用于儿童肋骨ultrasound扫描图像的分割。我们发现，在SSL-MAE中修改嵌入和损失函数可以生成更好的下游结果，并且确定了不是先行SSL-MAE预训练TransUNet embedding和编码器的下游分割任务的效果不如TransUNet无SSL-MAE预训练。详细描述在医疗图像分割预测中，深度学习监督下的方法具有很高的承诺，但是它们的表现受到数据标注质量和量的限制。为了解决这个问题，研究者们开始关注自动学习（SSL）方法。SSL方法可以使用不标注的领域特定数据，来适应特定任务。在本文中，我们 investigate了将Masked Autoencoder（SSL-MAE）用于儿童肋骨ultrasound扫描图像的分割。我们首先介绍了SSL-MAE的基本思想和方法。然后，我们对SSL-MAE的嵌入和损失函数进行了修改，以便生成更好的下游结果。最后，我们对TransUNet embedding和编码器进行了SSL-MAE预训练，并对其下游分割任务进行了评估。结果显示，在SSL-MAE中修改嵌入和损失函数可以生成更好的下游结果。此外，我们发现不是先行SSL-MAE预训练TransUNet embedding和编码器的下游分割任务的效果不如TransUNet无SSL-MAE预训练。这些结果表明，SSL-MAE可以帮助提高医疗图像分割预测的性能，并且可以适应不同的任务和数据集。结论本文 investigate了将Masked Autoencoder（SSL-MAE）用于儿童肋骨ultrasound扫描图像的分割。我们发现，在SSL-MAE中修改嵌入和损失函数可以生成更好的下游结果，并且确定了不是先行SSL-MAE预训练TransUNet embedding和编码器的下游分割任务的效果不如TransUNet无SSL-MAE预训练。这些结果表明，SSL-MAE可以帮助提高医疗图像分割预测的性能，并且可以适应不同的任务和数据集。因此，设计一种SSL方法，只需少量标注数据，具有重要意义。

Distributional Estimation of Data Uncertainty for Surveillance Face Anti-spoofing

paper_url: http://arxiv.org/abs/2309.09485
repo_url: None
paper_authors: Mouxiao Huang
for: This paper aims to improve the security of face anti-spoofing (FAS) systems in long-distance surveillance scenarios, which are often characterized by low-quality face images and high levels of data uncertainty.methods: The proposed method, called Distributional Estimation (DisE), models data uncertainty during training to improve the stability and accuracy of FAS systems. DisE adjusts the learning strength of clean and noisy samples to enhance performance.results: The proposed method was evaluated on a large-scale and challenging FAS dataset (SuHiFiMask) and achieved comparable performance on both ACER and AUC metrics, indicating its effectiveness in improving the security of FAS systems in long-distance surveillance scenarios.

Abstract
Face recognition systems have become increasingly vulnerable to security threats in recent years, prompting the use of Face Anti-spoofing (FAS) to protect against various types of attacks, such as phone unlocking, face payment, and self-service security inspection. While FAS has demonstrated its effectiveness in traditional settings, securing it in long-distance surveillance scenarios presents a significant challenge. These scenarios often feature low-quality face images, necessitating the modeling of data uncertainty to improve stability under extreme conditions. To address this issue, this work proposes Distributional Estimation (DisE), a method that converts traditional FAS point estimation to distributional estimation by modeling data uncertainty during training, including feature (mean) and uncertainty (variance). By adjusting the learning strength of clean and noisy samples for stability and accuracy, the learned uncertainty enhances DisE's performance. The method is evaluated on SuHiFiMask [1], a large-scale and challenging FAS dataset in surveillance scenarios. Results demonstrate that DisE achieves comparable performance on both ACER and AUC metrics.

摘要
现在的面Recognition系统已经变得易受到安全威胁，因此使用Face Anti-spoofing（FAS）来保护各种攻击，如手机唔锁、脸部支付和自助安全检查。虽然FAS在传统场景下表现出了效iveness，但在长距离监测场景中保持安全是一项 significante challenge。这些场景通常会出现低质量的脸像，因此需要模型数据不确定性以提高在极端情况下的稳定性。为解决这个问题，本工作提出了分布统计（DisE）方法，它将传统的FAS点估转换为分布统计，通过在训练中模型数据不确定性，包括特征（均值）和不确定性（方差）。通过调整清洁和噪声样本的学习力，学习出来的不确定性提高了DisE的性能。这种方法在SuHiFiMask数据集上进行了评估，结果表明DisE在ACER和AUC指标上具有相当的性能。

An Accurate and Efficient Neural Network for OCTA Vessel Segmentation and a New Dataset

paper_url: http://arxiv.org/abs/2309.09483
repo_url: None
paper_authors: Haojian Ning, Chengliang Wang, Xinrun Chen, Shiying Li
for: 本研究利用非侵入性的光共振 Tomatoes angiography（OCTA）成像技术，描述高分辨率的血管网络。
methods: 该研究提出了一种准确且高效的神经网络方法，用于Retinal vessel segmentation在OCTA图像中。该方法通过应用修改后的Recurrent ConvNeXt块，实现了与其他SOTA方法相当的准确率，同时具有更少的参数和更快的推理速度（例如110倍轻量级和1.3倍快于U-Net），适用于工业应用。
results: 该研究创建了918张OCTA图像和其相应的血管注释集。这个数据集使用Segment Anything Model（SAM）进行 semi-自动注释，大大提高了注释速度。

Abstract
Optical coherence tomography angiography (OCTA) is a noninvasive imaging technique that can reveal high-resolution retinal vessels. In this work, we propose an accurate and efficient neural network for retinal vessel segmentation in OCTA images. The proposed network achieves accuracy comparable to other SOTA methods, while having fewer parameters and faster inference speed (e.g. 110x lighter and 1.3x faster than U-Net), which is very friendly for industrial applications. This is achieved by applying the modified Recurrent ConvNeXt Block to a full resolution convolutional network. In addition, we create a new dataset containing 918 OCTA images and their corresponding vessel annotations. The data set is semi-automatically annotated with the help of Segment Anything Model (SAM), which greatly improves the annotation speed. For the benefit of the community, our code and dataset can be obtained from https://github.com/nhjydywd/OCTA-FRNet.

摘要
optical coherence tomography angiography (OCTA) 是一种非侵入性的图像成像技术，可以显示高分辨率的肉眼血管。在这项工作中，我们提出了一种准确和高效的神经网络 для肉眼血管分 segmentation 在 OCTA 图像中。我们的提案的网络实现了与其他 SOTA 方法相同的准确率，而且具有较少的参数和更快的推理速度（例如，110 倍轻量级和 1.3 倍的速度），这对于工业应用非常友好。这一 достиvement 归功于在全分辨率 convolutional 网络中应用修改后的 Recurrent ConvNeXt Block。此外，我们创建了一个包含 918 个 OCTA 图像和其相应的血管注释的数据集。这个数据集通过 Segment Anything Model (SAM) 的 semi-automatic 注释，可以大幅提高注释速度。为了便于社区，我们的代码和数据集可以从 GitHub 上下载：https://github.com/nhjydywd/OCTA-FRNet。

Spatio-temporal Co-attention Fusion Network for Video Splicing Localization

paper_url: http://arxiv.org/abs/2309.09482
repo_url: None
paper_authors: Man Lin, Gang Cao, Zijie Lou
for: 本研究旨在提出一种针对视频拼接 forgery 的检测方法，以推动视频的真实性和安全性。
methods: 本研究使用了一种三核 streams 网络作为编码器，通过 novel parallel and cross co-attention fusion modules 来实现深度交互和融合，以提取多帧视频中的修改迹象。
results: 测试结果表明，使用 SCFNet 可以高效地检测视频拼接 forgery，并且在不同的视频 dataset 上达到了 state-of-the-art 的性能。

Abstract
Digital video splicing has become easy and ubiquitous. Malicious users copy some regions of a video and paste them to another video for creating realistic forgeries. It is significant to blindly detect such forgery regions in videos. In this paper, a spatio-temporal co-attention fusion network (SCFNet) is proposed for video splicing localization. Specifically, a three-stream network is used as an encoder to capture manipulation traces across multiple frames. The deep interaction and fusion of spatio-temporal forensic features are achieved by the novel parallel and cross co-attention fusion modules. A lightweight multilayer perceptron (MLP) decoder is adopted to yield a pixel-level tampering localization map. A new large-scale video splicing dataset is created for training the SCFNet. Extensive tests on benchmark datasets show that the localization and generalization performances of our SCFNet outperform the state-of-the-art. Code and datasets will be available at https://github.com/multimediaFor/SCFNet.

摘要
digital video 剪辑已成为易于普遍的。恶意用户可以将视频中的一些区域复制到另一个视频中，以创造真实的假象。因此，必须在视频中探测这些假象区域。本文提出了一种基于视频剪辑的空间时间共互关注融合网络（SCFNet），用于各种视频剪辑检测。具体来说，使用了一个三流网络作为编码器，以捕捉多帧 manipulate traces。通过新型并行和交叉共互关注融合模块，深度地进行了视频剪辑特征的深度交互和融合。使用了一个轻量级多层感知器（MLP）作为解码器，以生成像素级假象Localization图。创建了一个大规模的视频剪辑数据集，用于训练SCFNet。对于标准 benchmark 数据集进行了广泛的测试，测试结果显示，SCFNet 的Localization和泛化性能都高于当前状态。代码和数据集将在 https://github.com/multimediaFor/SCFNet 上发布。

Stealthy Physical Masked Face Recognition Attack via Adversarial Style Optimization

paper_url: http://arxiv.org/abs/2309.09480
repo_url: None
paper_authors: Huihui Gong, Minjing Dong, Siqi Ma, Seyit Camtepe, Surya Nepal, Chang Xu
for: 本研究旨在探讨一种隐蔽式的面部识别模型攻击方法，用于面部识别任务中的隐蔽式攻击。
methods: 本研究使用了一种新的隐蔽式面部识别攻击方法，即通过对面部识别模型进行隐蔽式攻击，使模型具有较高的攻击力和隐蔽性。
results: 研究发现，对于面部识别任务，采用本研究提出的隐蔽式攻击方法可以很好地隐藏攻击具有较高的攻击力和隐蔽性。此外，研究还发现了一些特定的隐蔽式攻击方法可以在不同的面部识别模型上具有较高的攻击力和隐蔽性。

Abstract
Deep neural networks (DNNs) have achieved state-of-the-art performance on face recognition (FR) tasks in the last decade. In real scenarios, the deployment of DNNs requires taking various face accessories into consideration, like glasses, hats, and masks. In the COVID-19 pandemic era, wearing face masks is one of the most effective ways to defend against the novel coronavirus. However, DNNs are known to be vulnerable to adversarial examples with a small but elaborated perturbation. Thus, a facial mask with adversarial perturbations may pose a great threat to the widely used deep learning-based FR models. In this paper, we consider a challenging adversarial setting: targeted attack against FR models. We propose a new stealthy physical masked FR attack via adversarial style optimization. Specifically, we train an adversarial style mask generator that hides adversarial perturbations inside style masks. Moreover, to ameliorate the phenomenon of sub-optimization with one fixed style, we propose to discover the optimal style given a target through style optimization in a continuous relaxation manner. We simultaneously optimize the generator and the style selection for generating strong and stealthy adversarial style masks. We evaluated the effectiveness and transferability of our proposed method via extensive white-box and black-box digital experiments. Furthermore, we also conducted physical attack experiments against local FR models and online platforms.

摘要
深度神经网络（DNNs）在过去一代的面部识别（FR）任务上达到了状态之一。在实际应用中，部署DNNs时需考虑面部访问ories，如眼镜、帽子和面具。在 COVID-19 疫情时期，穿戴面具是一种最有效的防止新型冠状病毒的方法。然而，DNNs 知道小型但精心制作的攻击例子。因此，一个面具with adversarial perturbations可能对通用的深度学习基于 FR 模型 pose 大威胁。在这篇论文中，我们考虑了一个挑战性的对抗设定：面部识别模型的目标攻击。我们提出了一种新的隐蔽的物理面具FR攻击，通过对抗式风格优化。具体来说，我们训练了一个对抗式风格面具生成器，以隐藏对抗性扰动在风格面具中。此外，为了改善一个固定风格下的优化现象，我们提出了在一个目标上进行风格优化的方法。我们同时优化了生成器和风格选择，以生成强大和隐蔽的对抗式风格面具。我们通过了广泛的白盒和黑盒数字实验，以及本地 FR 模型和在线平台的物理攻击实验。

Self-supervised Multi-view Clustering in Computer Vision: A Survey

paper_url: http://arxiv.org/abs/2309.09473
repo_url: None
paper_authors: Jiatai Wang, Zhiwei Xu, Xuewen Yang, Hailong Li, Bo Li, Xuying Meng
for: 本文旨在探讨多视图归一 clustering（MVC）在跨Modal Representation Learning和数据驱动决策中的重要性，以及自然学习在 MVC 方法中的普及。
methods: 本文主要介绍了自然学习驱动 MVC 方法，包括设计代表任务来挖掘图像和视频数据的表示，以及常见数据集、数据问题、表示学习方法和自然学习方法的内部连接和分类。
results: 本文不仅介绍了每种类别的机制，还给出了一些应用示例。最后，文章还提出了一些未解决的问题，以便进一步的研究和发展。

Abstract
Multi-view clustering (MVC) has had significant implications in cross-modal representation learning and data-driven decision-making in recent years. It accomplishes this by leveraging the consistency and complementary information among multiple views to cluster samples into distinct groups. However, as contrastive learning continues to evolve within the field of computer vision, self-supervised learning has also made substantial research progress and is progressively becoming dominant in MVC methods. It guides the clustering process by designing proxy tasks to mine the representation of image and video data itself as supervisory information. Despite the rapid development of self-supervised MVC, there has yet to be a comprehensive survey to analyze and summarize the current state of research progress. Therefore, this paper explores the reasons and advantages of the emergence of self-supervised MVC and discusses the internal connections and classifications of common datasets, data issues, representation learning methods, and self-supervised learning methods. This paper does not only introduce the mechanisms for each category of methods but also gives a few examples of how these techniques are used. In the end, some open problems are pointed out for further investigation and development.

摘要

Reconstructing Existing Levels through Level Inpainting

paper_url: http://arxiv.org/abs/2309.09472
repo_url: None
paper_authors: Johor Jara Gonzalez, Mathew Guzdial
for: 这篇论文是为了描述如何使用Content Augmentation和Procedural Content Generation via Machine Learning（PCGML）来生成电子游戏等游戏中的关卡。
methods: 这篇论文使用了两种图像填充技术，即Autoencoder和U-net，来解决关卡填充问题。
results: 在比较基准方法的情况下，这两种方法都表现出了较好的性能，并且提供了实际的关卡填充示例和未来研究的想法。

Abstract
Procedural Content Generation (PCG) and Procedural Content Generation via Machine Learning (PCGML) have been used in prior work for generating levels in various games. This paper introduces Content Augmentation and focuses on the subproblem of level inpainting, which involves reconstructing and extending video game levels. Drawing inspiration from image inpainting, we adapt two techniques from this domain to address our specific use case. We present two approaches for level inpainting: an Autoencoder and a U-net. Through a comprehensive case study, we demonstrate their superior performance compared to a baseline method and discuss their relative merits. Furthermore, we provide a practical demonstration of both approaches for the level inpainting task and offer insights into potential directions for future research.

摘要
《生成内容技术》和《机器学习生成内容技术》在先前的工作中已经用于生成游戏等游戏中的关卡。这篇论文介绍内容扩展和填充，它们是关卡填充的子问题。受到图像填充的启发，我们从该领域适应了两种技术来解决我们的特定用例。我们提出了两种关卡填充方法：一种是自适应网络，另一种是U网络。通过完整的案例研究，我们证明了这两种方法的超越性比基eline方法，并讨论了它们的相对优劣。此外，我们为两种方法的关卡填充任务提供了实践示例，并为未来研究提供了可能的方向。

Progressive Text-to-Image Diffusion with Soft Latent Direction

paper_url: http://arxiv.org/abs/2309.09466
repo_url: https://github.com/babahui/progressive-text-to-image
paper_authors: YuTeng Ye, Jiale Cai, Hang Zhou, Guanwen Li, Youjia Zhang, Zikai Song, Chenxing Gao, Junqing Yu, Wei Yang
for: 本研究旨在解决文本到图像生成中多个实体的同时拼接和约束的挑战。
methods: 该 paper 提出了一种进步的分步生成和修改操作，通过逐步将实体添加到目标图像中，以保证它们在空间和关系约束下进行拼接。
results: 该 paper 的提出的方法在处理复杂和长文本描述时显示出了明显的进步，特别是在对多个实体的拼接和修改方面。

Abstract
In spite of the rapidly evolving landscape of text-to-image generation, the synthesis and manipulation of multiple entities while adhering to specific relational constraints pose enduring challenges. This paper introduces an innovative progressive synthesis and editing operation that systematically incorporates entities into the target image, ensuring their adherence to spatial and relational constraints at each sequential step. Our key insight stems from the observation that while a pre-trained text-to-image diffusion model adeptly handles one or two entities, it often falters when dealing with a greater number. To address this limitation, we propose harnessing the capabilities of a Large Language Model (LLM) to decompose intricate and protracted text descriptions into coherent directives adhering to stringent formats. To facilitate the execution of directives involving distinct semantic operations-namely insertion, editing, and erasing-we formulate the Stimulus, Response, and Fusion (SRF) framework. Within this framework, latent regions are gently stimulated in alignment with each operation, followed by the fusion of the responsive latent components to achieve cohesive entity manipulation. Our proposed framework yields notable advancements in object synthesis, particularly when confronted with intricate and lengthy textual inputs. Consequently, it establishes a new benchmark for text-to-image generation tasks, further elevating the field's performance standards.

摘要
尽管文本到图像生成领域在快速发展，仍然有多个实体的合理拟合和操作是持续的挑战。这篇论文提出了一种创新的进行式合成和编辑操作，系统地将实体integrated到目标图像中，以确保它们在每个顺序步骤中遵循空间和关系约束。我们的关键发现来自于发现一个预训练的文本到图像扩散模型能够好好地处理一个或两个实体，但当面临更多实体时，它经常出现问题。为解决这种限制，我们提议利用大型自然语言模型（LLM）来分解复杂和长文本描述，并将其转化为严格格式的直接指令。为了实现不同semantic操作的执行，我们提出了刺激、编辑和消除等操作的框架，称为刺激响应拼接（SRF）框架。在这个框架中，各个实体的latent空间会在每个操作中被细致地刺激，然后将响应的latent组件进行拼接，以实现一致的实体修改。我们的提议的框架在对复杂和长文本输入的对象合成方面带来了显著的进步，因此，它在文本到图像生成任务中成为了一个新的标准，进一步提高了这个领域的性能标准。

Reducing Adversarial Training Cost with Gradient Approximation

paper_url: http://arxiv.org/abs/2309.09464
repo_url: None
paper_authors: Huihui Gong, Shuo Yang, Siqi Ma, Seyit Camtepe, Surya Nepal, Chang Xu
for: 提高模型对抗示例（Adversarial Examples，AE）的Robustness，提高模型的可靠性和安全性。
methods: 使用项目Gradient Descent（PGD）基于的对抗训练法，以及一种新的和高效的对抗训练方法——对抗训练with Gradient Approximation（GAAT），以减少建立强度模型的成本。
results: 对MNIST、CIFAR-10和CIFAR-100等 dataset进行了广泛的实验，结果显示，GAAT方法可以保持模型的准确率，同时减少训练时间，最多可以减少60%的训练时间。

Abstract
Deep learning models have achieved state-of-the-art performances in various domains, while they are vulnerable to the inputs with well-crafted but small perturbations, which are named after adversarial examples (AEs). Among many strategies to improve the model robustness against AEs, Projected Gradient Descent (PGD) based adversarial training is one of the most effective methods. Unfortunately, the prohibitive computational overhead of generating strong enough AEs, due to the maximization of the loss function, sometimes makes the regular PGD adversarial training impractical when using larger and more complicated models. In this paper, we propose that the adversarial loss can be approximated by the partial sum of Taylor series. Furthermore, we approximate the gradient of adversarial loss and propose a new and efficient adversarial training method, adversarial training with gradient approximation (GAAT), to reduce the cost of building up robust models. Additionally, extensive experiments demonstrate that this efficiency improvement can be achieved without any or with very little loss in accuracy on natural and adversarial examples, which show that our proposed method saves up to 60\% of the training time with comparable model test accuracy on MNIST, CIFAR-10 and CIFAR-100 datasets.

摘要
In this paper, we propose a new and efficient adversarial training method, called adversarial training with gradient approximation (GAAT), to reduce the cost of building robust models. We approximate the adversarial loss using a partial sum of Taylor series and approximate the gradient of the adversarial loss. Our proposed method saves up to 60% of the training time with comparable model test accuracy on MNIST, CIFAR-10, and CIFAR-100 datasets.Extensive experiments demonstrate that our proposed method achieves the same or better accuracy on natural and adversarial examples, without any loss in accuracy. This shows that our method is efficient and effective in improving the robustness of deep learning models against adversarial attacks.

Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection

paper_url: http://arxiv.org/abs/2309.09456
repo_url: None
paper_authors: Chenming Zhu, Wenwei Zhang, Tai Wang, Xihui Liu, Kai Chen
for: 提高开放词汇3D物体检测的性能，使用大规模大词汇3D物体数据集来扩充现有3D场景数据集的词汇。
methods: 使用Object2Scene方法，将不同来源的3D物体插入到3D场景中，生成对应的文本描述，并提出了跨频率类别对比学习方法来缓解不同数据集之间的频率差。
results: 在现有的开放词汇3D物体检测标准benchmark上实现了超过现有方法的性能，并在一个新的benchmark OV-ScanNet-200上进行了验证，并证明了对所有罕见类的检测能力。

Abstract
Point cloud-based open-vocabulary 3D object detection aims to detect 3D categories that do not have ground-truth annotations in the training set. It is extremely challenging because of the limited data and annotations (bounding boxes with class labels or text descriptions) of 3D scenes. Previous approaches leverage large-scale richly-annotated image datasets as a bridge between 3D and category semantics but require an extra alignment process between 2D images and 3D points, limiting the open-vocabulary ability of 3D detectors. Instead of leveraging 2D images, we propose Object2Scene, the first approach that leverages large-scale large-vocabulary 3D object datasets to augment existing 3D scene datasets for open-vocabulary 3D object detection. Object2Scene inserts objects from different sources into 3D scenes to enrich the vocabulary of 3D scene datasets and generates text descriptions for the newly inserted objects. We further introduce a framework that unifies 3D detection and visual grounding, named L3Det, and propose a cross-domain category-level contrastive learning approach to mitigate the domain gap between 3D objects from different datasets. Extensive experiments on existing open-vocabulary 3D object detection benchmarks show that Object2Scene obtains superior performance over existing methods. We further verify the effectiveness of Object2Scene on a new benchmark OV-ScanNet-200, by holding out all rare categories as novel categories not seen during training.

摘要
<>TRANSLATE_TEXT点云基于开 vocabulary 3D物体检测目标是检测没有训练集中的地面标注的3D类别。这是非常具有挑战性，因为3D场景数据和标注（ bounding box 与类别标签或文本描述）受限。先前的方法利用大规模有很多描述的图像数据作为3D和类别 semantics 之间的桥梁，但是需要Extra的对2D图像和3D点的对应过程，限制了3D检测器的开 vocabulary 能力。相比之下，我们提议使用大规模大 vocabulary 3D物体数据来增强现有3D场景数据，并在3D场景中插入不同来源的物体，以增加3D场景数据的词汇量。我们还引入了将3D检测和视觉定位（grounding）集成在一起的框架，名为L3Det，并提出了跨 Domian 类别对比学习方法，以 Mitigate 3D对象从不同数据集中的领域差。我们进行了大量的实验，证明 Object2Scene 在现有的开 vocabulary 3D检测标准 benchmark 上表现出色。我们还验证了 Object2Scene 在 OV-ScanNet-200 新的benchmark上的效果，通过在训练中不包括所有罕见类别。TRANSLATE_TEXT

Scalable Label-efficient Footpath Network Generation Using Remote Sensing Data and Self-supervised Learning

paper_url: http://arxiv.org/abs/2309.09446
repo_url: https://github.com/wennyxy/footpathseg
paper_authors: Xinye Wanyan, Sachith Seneviratne, Kerry Nice, Jason Thompson, Marcus White, Nano Langenheim, Mark Stevenson
For: The paper is written for urban planners and researchers who need to manage and analyze footpath infrastructure in cities, but lack real-time information and resources for doing so.* Methods: The paper proposes an automatic pipeline for generating footpath networks based on remote sensing images using machine learning models, specifically a self-supervised feature representation learning method to reduce annotation requirements.* Results: The proposed method is validated using remote sensing images and shows considerable consistency with manually collected GIS layers, making it a low-cost and extensible solution for footpath network generation.Here’s the simplified Chinese text for the three points:* For: 该论文是为城市规划师和研究人员写的，他们需要管理和分析城市的步行道基础设施，但缺乏实时信息和资源。* Methods: 该论文提出了一个基于遥感图像的自动化步行道网络生成管线，使用机器学习模型，特别是一种自我指导的特征表示学习方法，以降低标注需求。* Results: 提posed方法通过对遥感图像进行验证，与手动收集的GIS层进行比较，显示了 considerable consistency，表明该方法是一种低成本和可扩展的解决方案。

Abstract
Footpath mapping, modeling, and analysis can provide important geospatial insights to many fields of study, including transport, health, environment and urban planning. The availability of robust Geographic Information System (GIS) layers can benefit the management of infrastructure inventories, especially at local government level with urban planners responsible for the deployment and maintenance of such infrastructure. However, many cities still lack real-time information on the location, connectivity, and width of footpaths, and/or employ costly and manual survey means to gather this information. This work designs and implements an automatic pipeline for generating footpath networks based on remote sensing images using machine learning models. The annotation of segmentation tasks, especially labeling remote sensing images with specialized requirements, is very expensive, so we aim to introduce a pipeline requiring less labeled data. Considering supervised methods require large amounts of training data, we use a self-supervised method for feature representation learning to reduce annotation requirements. Then the pre-trained model is used as the encoder of the U-Net for footpath segmentation. Based on the generated masks, the footpath polygons are extracted and converted to footpath networks which can be loaded and visualized by geographic information systems conveniently. Validation results indicate considerable consistency when compared to manually collected GIS layers. The footpath network generation pipeline proposed in this work is low-cost and extensible, and it can be applied where remote sensing images are available. Github: https://github.com/WennyXY/FootpathSeg.

摘要
Footpath mapping, modeling, and analysis can provide important geospatial insights to many fields of study, including transport, health, environment, and urban planning. The availability of robust Geographic Information System (GIS) layers can benefit the management of infrastructure inventories, especially at local government level with urban planners responsible for the deployment and maintenance of such infrastructure. However, many cities still lack real-time information on the location, connectivity, and width of footpaths, and/or employ costly and manual survey means to gather this information. This work designs and implements an automatic pipeline for generating footpath networks based on remote sensing images using machine learning models. The annotation of segmentation tasks, especially labeling remote sensing images with specialized requirements, is very expensive, so we aim to introduce a pipeline requiring less labeled data. Considering supervised methods require large amounts of training data, we use a self-supervised method for feature representation learning to reduce annotation requirements. Then the pre-trained model is used as the encoder of the U-Net for footpath segmentation. Based on the generated masks, the footpath polygons are extracted and converted to footpath networks which can be loaded and visualized by geographic information systems conveniently. Validation results indicate considerable consistency when compared to manually collected GIS layers. The footpath network generation pipeline proposed in this work is low-cost and extensible, and it can be applied where remote sensing images are available. Github: .Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

TransTouch: Learning Transparent Objects Depth Sensing Through Sparse Touches

paper_url: http://arxiv.org/abs/2309.09427
repo_url: None
paper_authors: Liuyu Bian, Pengyang Shi, Weihang Chen, Jing Xu, Li Yi, Rui Chen
for: 提高真实世界中透明物体深度感知的精度
methods: 使用探测系统提供自动收集的笔触深度标签来训练雷达网络，并使用一个新的价值函数评估探测位置的利好，以提高网络在真实世界中的性能
results: 在真实世界中construct了一个包括透明和不透明物体的数据集，并进行了实验，结果显示，使用我们的方法可以显著提高真实世界中透明物体深度感知的精度

Abstract
Transparent objects are common in daily life. However, depth sensing for transparent objects remains a challenging problem. While learning-based methods can leverage shape priors to improve the sensing quality, the labor-intensive data collection in the real world and the sim-to-real domain gap restrict these methods' scalability. In this paper, we propose a method to finetune a stereo network with sparse depth labels automatically collected using a probing system with tactile feedback. We present a novel utility function to evaluate the benefit of touches. By approximating and optimizing the utility function, we can optimize the probing locations given a fixed touching budget to better improve the network's performance on real objects. We further combine tactile depth supervision with a confidence-based regularization to prevent over-fitting during finetuning. To evaluate the effectiveness of our method, we construct a real-world dataset including both diffuse and transparent objects. Experimental results on this dataset show that our method can significantly improve real-world depth sensing accuracy, especially for transparent objects.

摘要
通常存在于日常生活中的透明物体depth感测仍然是一个挑战。学习基于方法可以利用形状假设提高感测质量，但实际世界中的数据收集劳动密集和实验室到实际域的差距限制这些方法的扩展性。在这篇论文中，我们提议一种方法，通过自动收集使用探测系统的粘贴反馈来精细调整激光网络。我们提出了一种新的实用函数来评估触摸的利用度。通过估算和优化实用函数，我们可以在给定一个固定的触摸预算下优化探测位置，以更好地提高网络的性能于实际物体上。我们还将满足粘贴深度监督与信心基于规则进行减少过拟合。为评估我们的方法的效果，我们构建了一个包括散发和透明物体的实际世界数据集。实验结果表明，我们的方法可以在实际世界中提高深度感测精度，特别是对于透明物体。

Cross-attention-based saliency inference for predicting cancer metastasis on whole slide images

paper_url: http://arxiv.org/abs/2309.09412
repo_url: None
paper_authors: Ziyu Su, Mostafa Rezapour, Usama Sajjad, Shuo Niu, Metin Nafi Gurcan, Muhammad Khalid Khan Niazi
for:This paper proposes a new method for automatic tumor detection on whole slide images (WSI) using cross-attention-based salient instance inference MIL (CASiiMIL).methods:The proposed method uses a novel saliency-informed attention mechanism and negative representation learning algorithm to identify breast cancer lymph node micro-metastasis on WSIs without the need for any annotations.results:The proposed model outperforms the state-of-the-art MIL methods on two popular tumor metastasis detection datasets and demonstrates great cross-center generalizability. It also exhibits excellent accuracy in classifying WSIs with small tumor lesions and has excellent interpretability attributed to the saliency-informed attention weights.

Abstract
Although multiple instance learning (MIL) methods are widely used for automatic tumor detection on whole slide images (WSI), they suffer from the extreme class imbalance within the small tumor WSIs. This occurs when the tumor comprises only a few isolated cells. For early detection, it is of utmost importance that MIL algorithms can identify small tumors, even when they are less than 1% of the size of the WSI. Existing studies have attempted to address this issue using attention-based architectures and instance selection-based methodologies, but have not yielded significant improvements. This paper proposes cross-attention-based salient instance inference MIL (CASiiMIL), which involves a novel saliency-informed attention mechanism, to identify breast cancer lymph node micro-metastasis on WSIs without the need for any annotations. Apart from this new attention mechanism, we introduce a negative representation learning algorithm to facilitate the learning of saliency-informed attention weights for improved sensitivity on tumor WSIs. The proposed model outperforms the state-of-the-art MIL methods on two popular tumor metastasis detection datasets, and demonstrates great cross-center generalizability. In addition, it exhibits excellent accuracy in classifying WSIs with small tumor lesions. Moreover, we show that the proposed model has excellent interpretability attributed to the saliency-informed attention weights. We strongly believe that the proposed method will pave the way for training algorithms for early tumor detection on large datasets where acquiring fine-grained annotations is practically impossible.

摘要
多个实例学习（MIL）方法在整个扫描图像（WSI）上自动检测肿瘤已经广泛应用，但它们受到扫描图像中小肿瘤的极端分布异常困扰。这发生在肿瘤只包含几个隔离的细胞时。在早期检测中，非常重要的是MIL算法可以识别小肿瘤，即使它们的大小小于扫描图像的1%。现有研究已经使用注意力基 architecture和实例选择基础方法来解决这个问题，但没有产生显著改进。本文提出了跨注意力基础的突出例推理MIL（CASiiMIL），它包括一种新的突出性注意力机制，以无需任何注释来检测乳腺癌Node微量肿瘤在扫描图像上。此外，我们还引入了一种负表示学习算法，以便更好地学习突出性注意力权重，以提高肿瘤WSIs的敏感性。提出的模型在两个流行的肿瘤метастаisis检测数据集上表现出色，并且具有优秀的跨中心一致性。此外，它还能够高度准确地分类WSIs中的小肿瘤 lesions。此外，我们还证明了该模型具有优秀的解释性，即通过突出性注意力权重来解释其决策。我们认为，提出的方法将为大量数据上无法获得细致的注释的训练算法开创新的道路。

BRONCO: Automated modelling of the bronchovascular bundle using the Computed Tomography Images

paper_url: http://arxiv.org/abs/2309.09410
repo_url: None
paper_authors: Wojciech Prażuch, Marek Socha, Anna Mrukwa, Aleksandra Suwalska, Agata Durawa, Malgorzata Jelitto-Górska, Katarzyna Dziadziuszko, Edyta Szurowska, Pawel Bożek, Michal Marczyk, Witold Rzyman, Joanna Polanska
for: 这篇论文主要针对的是体内肺部分的分类，特别是针对肺部分中的气道和血管网络的分类。
methods: 本研究使用了 Computed Tomography 图像来建立肺部分中的气道和血管网络分类ipeline，包括两个模组：模拟气道树和血管树。
results: 研究发现，这个方法在不同的 CT 影像数据中都能够获得正确的分类结果，并且不受 CT 影像系列的起始点和参数的影响。这个方法适用于健康人群、肺肿瘤患者和肺绒溃病患者。

Abstract
Segmentation of the bronchovascular bundle within the lung parenchyma is a key step for the proper analysis and planning of many pulmonary diseases. It might also be considered the preprocessing step when the goal is to segment the nodules from the lung parenchyma. We propose a segmentation pipeline for the bronchovascular bundle based on the Computed Tomography images, returning either binary or labelled masks of vessels and bronchi situated in the lung parenchyma. The method consists of two modules, modeling of the bronchial tree and vessels. The core revolves around a similar pipeline, the determination of the initial perimeter by the GMM method, skeletonization, and hierarchical analysis of the created graph. We tested our method on both low-dose CT and standard-dose CT, with various pathologies, reconstructed with various slice thicknesses, and acquired from various machines. We conclude that the method is invariant with respect to the origin and parameters of the CT series. Our pipeline is best suited for studies with healthy patients, patients with lung nodules, and patients with emphysema.

摘要
填充肺部分分析是许多肺病诊断和规划的关键步骤。它还可以看作预处理步骤，当目标是从肺部分中分割肿瘤时。我们提出了基于计算机断层成像图像的肺部分分析管道，可以生成二值或标注的血管和支气道在肺部分中的掩模。管道由两个模块组成：模拟 bronchial tree 和血管。核心在于类似的管道，即使用 GMM 方法确定初始边界，skeletonization 和层次分析创建的图像。我们在低剂量 CT 和标准剂量 CT 图像上测试了我们的方法，包括不同的病理变化、不同的扫描厚度和不同的成像机。我们结论是该方法具有Origin和参数不归一化的特点。我们的管道适用于健康人群、肿瘤患者和emphysema患者。

2023-09-18

ProtoKD: Learning from Extremely Scarce Data for Parasite Ova Recognition

Image-Text Pre-Training for Logo Recognition

Machine Learning for enhancing Wind Field Resolution in Complex Terrain

Specification-Driven Video Search via Foundation Models and Formal Verification

Offline Detection of Misspelled Handwritten Words by Convolving Recognition Model Features with Text Labels

Preserving Tumor Volumes for Unsupervised Medical Image Registration

Deep Prompt Tuning for Graph Transformers

Pre-training on Synthetic Driving Data for Trajectory Prediction

GEDepth: Ground Embedding for Monocular Depth Estimation

Parameter-Efficient Long-Tailed Recognition

vSHARP: variable Splitting Half-quadratic ADMM algorithm for Reconstruction of inverse-Problems

End-to-End Learned Event- and Image-based Visual Odometry

Hierarchical Attention and Graph Neural Networks: Toward Drift-Free Pose Estimation

Quantum Vision Clustering

On Model Explanations with Transferable Neural Pathways

RaLF: Flow-based Global and Metric Radar Localization in LiDAR Maps

Contrastive Learning for Enhancing Robust Scene Transfer in Vision-based Agile Flight

Unsupervised Open-Vocabulary Object Localization in Videos

PseudoCal: Towards Initialisation-Free Deep Learning-Based Camera-LiDAR Self-Calibration

Hyperbolic vs Euclidean Embeddings in Few-Shot Learning: Two Sides of the Same Coin

Grasp-Anything: Large-scale Grasp Dataset from Foundation Models

R2GenGPT: Radiology Report Generation with Frozen LLMs

DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving

Towards Self-Adaptive Pseudo-Label Filtering for Semi-Supervised Learning

Semantically Redundant Training Data Removal and Deep Model Classification Performance: A Study with Chest X-rays

Localization-Guided Track: A Deep Association Multi-Object Tracking Framework Based on Localization Confidence of Detections

Application-driven Validation of Posteriors in Inverse Problems

Privileged to Predicted: Towards Sensorimotor Reinforcement Learning for Urban Driving

Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object Detection with Repeated Labels

Improving Neural Indoor Surface Reconstruction with Mask-Guided Adaptive Consistency Constraints

Scribble-based 3D Multiple Abdominal Organ Segmentation via Triple-branch Multi-dilated Network with Pixel- and Class-wise Consistency

Robust Geometry-Preserving Depth Estimation Using Differentiable Rendering

Traffic Scene Similarity: a Graph-based Contrastive Learning Approach

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

DGM-DR: Domain Generalization with Mutual Information Regularized Diabetic Retinopathy Classification

DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation

Unified Frequency-Assisted Transformer Framework for Detecting and Grounding Multi-Modal Manipulation

HiT: Building Mapping with Hierarchical Transformers

Holistic Geometric Feature Learning for Structured Reconstruction

Collaborative Three-Stream Transformers for Video Captioning

MEDL-U: Uncertainty-aware 3D Automatic Annotator based on Evidential Deep Learning

Mutual Information-calibrated Conformal Feature Fusion for Uncertainty-Aware Multimodal 3D Object Detection at the Edge

Multi-Semantic Fusion Model for Generalized Zero-Shot Skeleton-Based Action Recognition

An Autonomous Vision-Based Algorithm for Interplanetary Navigation

RIDE: Self-Supervised Learning of Rotation-Equivariant Keypoint Detection and Invariant Description for Endoscopy

Selective Volume Mixup for Video Action Recognition

Decompose Semantic Shifts for Composed Image Retrieval

Instant Photorealistic Style Transfer: A Lightweight and Adaptive Approach

NOMAD: A Natural, Occluded, Multi-scale Aerial Dataset, for Emergency Response Scenarios

Sparse and Privacy-enhanced Representation for Human Pose Estimation

PanoMixSwap Panorama Mixing via Structural Swapping for Indoor Scene Understanding

Learning Parallax for Stereo Event-based Motion Deblurring

RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision

Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

Target-aware Bi-Transformer for Few-shot Segmentation

Self-supervised TransUNet for Ultrasound regional segmentation of the distal radius in children

Distributional Estimation of Data Uncertainty for Surveillance Face Anti-spoofing

An Accurate and Efficient Neural Network for OCTA Vessel Segmentation and a New Dataset

Spatio-temporal Co-attention Fusion Network for Video Splicing Localization

Stealthy Physical Masked Face Recognition Attack via Adversarial Style Optimization

Self-supervised Multi-view Clustering in Computer Vision: A Survey

Reconstructing Existing Levels through Level Inpainting

Progressive Text-to-Image Diffusion with Soft Latent Direction

Reducing Adversarial Training Cost with Gradient Approximation

Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection

Scalable Label-efficient Footpath Network Generation Using Remote Sensing Data and Self-supervised Learning

TransTouch: Learning Transparent Objects Depth Sensing Through Sparse Touches

Cross-attention-based saliency inference for predicting cancer metastasis on whole slide images

BRONCO: Automated modelling of the bronchovascular bundle using the Computed Tomography Images