cs.CV - 2023-07-29

Enhancing Object Detection in Ancient Documents with Synthetic Data Generation and Transformer-Based Models

paper_url: http://arxiv.org/abs/2307.16005
repo_url: None
paper_authors: Zahra Ziran, Francesco Leotta, Massimo Mecella
for: 提高古文献中对象检测精度，减少假阳性结果。
methods: 通过计算媒体创建synthetic数据集，并将视觉特征提取integrated到对象检测过程中。
results: 通过实验，我们表明可以提高对象检测精度，有助于 Paleography 领域进行深入分析和理解历史文献。

Abstract
The study of ancient documents provides a glimpse into our past. However, the low image quality and intricate details commonly found in these documents present significant challenges for accurate object detection. The objective of this research is to enhance object detection in ancient documents by reducing false positives and improving precision. To achieve this, we propose a method that involves the creation of synthetic datasets through computational mediation, along with the integration of visual feature extraction into the object detection process. Our approach includes associating objects with their component parts and introducing a visual feature map to enable the model to discern between different symbols and document elements. Through our experiments, we demonstrate that improved object detection has a profound impact on the field of Paleography, enabling in-depth analysis and fostering a greater understanding of these valuable historical artifacts.

摘要
古文献研究可以带我们回到过去，但是这些文献中的图像质量低下和细节复杂度往往会对准确的对象探测带来挑战。我们的研究目标是提高古文献中对象探测的精度，减少假阳性和提高准确率。我们提出了一种方法，即通过计算媒介创造 synthetic 数据集，并将视觉特征提取 integrate 到对象探测过程中。我们的方法包括对象与其组成部分关联，并通过视觉特征地图来使模型能够辨别不同的符号和文档元素。经过我们的实验，我们发现，改进对象探测对 Paleography 领域有着深远的影响，帮助我们进行深入的分析和更好地理解这些历史遗产。

Automated Hit-frame Detection for Badminton Match Analysis

paper_url: http://arxiv.org/abs/2307.16000
repo_url: https://github.com/arthur900530/Automated-Hit-frame-Detection-for-Badminton-Match-Analysis
paper_authors: Yu-Hang Chien, Fang Yu
for: 这项研究旨在为羽毛球运动员提供更高水平的表现分析，帮助教练和运动员通过自动化工具来系统地评估自己的表现。
methods: 本研究使用现代深度学习技术来自动检测羽毛球比赛视频中的击框帧。检测过程包括赛事裁剪、运动员和球场关键点检测、拍篮轨迹预测和击框检测等自动化步骤。
results: 在本研究中，我们实现了视频裁剪精度99%，在应用运动员关键点序列中预测拍篮轨迹方向的准确率高于92%，并对赛事裁剪和击框检测进行评估。

Abstract
Sports professionals constantly under pressure to perform at the highest level can benefit from sports analysis, which allows coaches and players to reduce manual efforts and systematically evaluate their performance using automated tools. This research aims to advance sports analysis in badminton, systematically detecting hit-frames automatically from match videos using modern deep learning techniques. The data included in hit-frames can subsequently be utilized to synthesize players' strokes and on-court movement, as well as for other downstream applications such as analyzing training tasks and competition strategy. The proposed approach in this study comprises several automated procedures like rally-wise video trimming, player and court keypoints detection, shuttlecock flying direction prediction, and hit-frame detection. In the study, we achieved 99% accuracy on shot angle recognition for video trimming, over 92% accuracy for applying player keypoints sequences on shuttlecock flying direction prediction, and reported the evaluation results of rally-wise video trimming and hit-frame detection.

摘要
运动专业人员需要一直处于最高水平的压力，可以从运动分析中受益，使得教练和运动员可以通过自动化工具来系统地评估自己的表现。这项研究的目的是为了提高羽毛球运动分析，通过现代深度学习技术自动检测比赛视频中的击框帧。这些数据可以用于synthesize运动员的拍打和场上运动，以及其他下游应用程序，如分析训练任务和竞赛策略。本研究的方法包括多个自动化步骤，如赛事截割、运动员和场地关键点检测、拍球飞行方向预测和击框检测。在研究中，我们实现了视频截割时的射击角度识别率99%，以及在拍球飞行方向预测中的运动员关键点序列应用率超过92%。我们还发布了赛事截割和击框检测的评估结果。

Separate Scene Text Detector for Unseen Scripts is Not All You Need

paper_url: http://arxiv.org/abs/2307.15991
repo_url: None
paper_authors: Prateek Keserwani, Taveena Lotey, Rohit Keshari, Partha Pratim Roy
for: 解决多种文字识别问题在野外环境中
methods: 利用vector embedding将文字的roke信息映射到文字类别中
results: 在零 shot Setting下，提出的方法可以准确地检测未看过的文字类别

Abstract
Text detection in the wild is a well-known problem that becomes more challenging while handling multiple scripts. In the last decade, some scripts have gained the attention of the research community and achieved good detection performance. However, many scripts are low-resourced for training deep learning-based scene text detectors. It raises a critical question: Is there a need for separate training for new scripts? It is an unexplored query in the field of scene text detection. This paper acknowledges this problem and proposes a solution to detect scripts not present during training. In this work, the analysis has been performed to understand cross-script text detection, i.e., trained on one and tested on another. We found that the identical nature of text annotation (word-level/line-level) is crucial for better cross-script text detection. The different nature of text annotation between scripts degrades cross-script text detection performance. Additionally, for unseen script detection, the proposed solution utilizes vector embedding to map the stroke information of text corresponding to the script category. The proposed method is validated with a well-known multi-lingual scene text dataset under a zero-shot setting. The results show the potential of the proposed method for unseen script detection in natural images.

摘要
文本检测在野外是一个非常有挑战性的问题，其中多种文本的检测增加了挑战。过去的一个 десятилетие，一些文本种类在研究者们中引起了关注，并实现了良好的检测性能。然而，许多文本种类具有训练深度学习基于Scene文本检测器的资源不充分的问题。这个问题提出了一个关键问题：是否需要分开训练新的文本种类？这是Scene文本检测领域未曾研究的问题。本文承认这个问题，并提出了一种用于检测训练之外的文本种类的解决方案。在这种情况下，我们进行了跨脚本文本检测的分析，即将训练的一种文本种类测试在另一种文本种类上。我们发现，文本注释的标准化（word-level/line-level）是跨脚本文本检测的关键因素。不同的文本注释 между脚本会导致跨脚本文本检测性能下降。此外，为了检测未经训练的脚本，我们提出了使用vector embedding将文本的行为映射到脚本类别。我们的方法在一个知名的多语言Scene文本 dataset上进行零shot设定下进行验证。结果表明，我们的方法具有检测未经训练的脚本的潜在能力。

RGB-D-Fusion: Image Conditioned Depth Diffusion of Humanoid Subjects

paper_url: http://arxiv.org/abs/2307.15988
repo_url: None
paper_authors: Sascha Kirch, Valeria Olyunina, Jan Ondřej, Rafael Pagés, Sergio Martin, Clara Pérez-Molina
for: 生成高分辨率深度图从低分辨率灰度图像中
methods: 使用多模态条件杂音扩散概率模型，首先生成低分辨率深度图，然后使用第二个杂音扩散概率模型来upsample深度图，并 introduce了一种新的增强技术，深度噪声增强
results: 实现高效地生成高分辨率深度图，并提高模型的RobustnessIn English, this translates to:
for: Generating high-resolution depth maps from low-resolution monocular RGB images
methods: Using a multi-modal conditional denoising diffusion probabilistic model, first generating a low-resolution depth map, and then upsampling the depth map using a second denoising diffusion probabilistic model conditioned on a low-resolution RGB-D image. Additionally, introducing a novel augmentation technique, depth noise augmentation, to increase the robustness of the super-resolution model.
results: Achieving high-quality super-resolution of depth maps and improving the robustness of the model.

Abstract
We present RGB-D-Fusion, a multi-modal conditional denoising diffusion probabilistic model to generate high resolution depth maps from low-resolution monocular RGB images of humanoid subjects. RGB-D-Fusion first generates a low-resolution depth map using an image conditioned denoising diffusion probabilistic model and then upsamples the depth map using a second denoising diffusion probabilistic model conditioned on a low-resolution RGB-D image. We further introduce a novel augmentation technique, depth noise augmentation, to increase the robustness of our super-resolution model.

摘要
我们介绍RGB-D-Fusion，一种多模态条件噪声扩散概率模型，用于生成高分辨率深度图像从低分辨率单颜色RGB图像中。RGB-D-Fusion首先使用一种图像条件噪声扩散概率模型生成低分辨率深度图像，然后使用第二种噪声扩散概率模型，条件于低分辨率RGB-D图像，进行� upsampling。我们还介绍了一种新的扩展技术，深度噪声扩散，以增强我们的超分辨率模型的稳定性。

Class-Specific Distribution Alignment for Semi-Supervised Medical Image Classification

paper_url: http://arxiv.org/abs/2307.15987
repo_url: None
paper_authors: Zhongzheng Huang, Jiawei Wu, Tao Wang, Zuoyong Li, Anastasia Ioannou
for: 这篇论文是为了解决医疗影像分类问题，因为数据标注是时间耗费很大，且疾病的分布是不均匀的。
methods: 本文提出了一个叫做分类对象分配（CSDA）的半有supervised学习框架，这个框架适用于从高度不均匀的数据集学习。特别是，我们从距离基础架的变数空间中考虑分配过程，并将这个过程转换为capture class-dependent marginal predictions的方法，以避免偏向多数组别的问题。此外，我们提出了一个变数条件队列（VCQ）模组，以确保每个类别的不断数据数量具有相同的比例。
results: 在三个公开数据集HAM10000、CheXpert和Kvasir上进行验证，我们发现our方法可以在半有supervised的情况下提供竞争性的表现，并且在医疗影像分类任务上获得了比较好的结果。

Abstract
Despite the success of deep neural networks in medical image classification, the problem remains challenging as data annotation is time-consuming, and the class distribution is imbalanced due to the relative scarcity of diseases. To address this problem, we propose Class-Specific Distribution Alignment (CSDA), a semi-supervised learning framework based on self-training that is suitable to learn from highly imbalanced datasets. Specifically, we first provide a new perspective to distribution alignment by considering the process as a change of basis in the vector space spanned by marginal predictions, and then derive CSDA to capture class-dependent marginal predictions on both labeled and unlabeled data, in order to avoid the bias towards majority classes. Furthermore, we propose a Variable Condition Queue (VCQ) module to maintain a proportionately balanced number of unlabeled samples for each class. Experiments on three public datasets HAM10000, CheXpert and Kvasir show that our method provides competitive performance on semi-supervised skin disease, thoracic disease, and endoscopic image classification tasks.

摘要
尽管深度神经网络在医疗图像分类中取得了成功，但问题仍然具有挑战性，因为数据标注是时间consuming的，而且疾病的分布是不均衡的，因为疾病的相对罕见性。为了解决这个问题，我们提出了类别特定分布对齐（CSDA），一种基于自动训练的 semi-supervised 学习框架，适用于学习高度不均衡的数据集。 Specifically, we first provide a new perspective to distribution alignment by considering the process as a change of basis in the vector space spanned by marginal predictions, and then derive CSDA to capture class-dependent marginal predictions on both labeled and unlabeled data, in order to avoid the bias towards majority classes. Furthermore, we propose a Variable Condition Queue (VCQ) module to maintain a proportionately balanced number of unlabeled samples for each class. Experiments on three public datasets HAM10000, CheXpert and Kvasir show that our method provides competitive performance on semi-supervised skin disease, thoracic disease, and endoscopic image classification tasks.Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

GaitASMS: Gait Recognition by Adaptive Structured Spatial Representation and Multi-Scale Temporal Aggregation

paper_url: http://arxiv.org/abs/2307.15981
repo_url: None
paper_authors: Yan Sun, Hu Long, Xueling Feng, Mark Nixon
for: 本研究旨在提出一种新的步态识别方法，以提高步态识别精度和稳定性。
methods: 该方法使用适应结构 repreentation 抽取模块 (ASRE) 和多Scale temporal 聚合模块 (MSTA)，分别提取步态的适应结构和多Scale temporal 信息。此外，提出了一种新的数据增强技术，即随机mask，以增加样本空间的多样性。
results: 对于两个数据集，该方法能够达到竞争力的性能，特别是在复杂场景下（BG和CL）。在CASIA-B数据集上，GaitASMS方法的均值准确率为93.5%，与基准方法相比，在rank-1准确率上提高3.4%和6.3%。

Abstract
Gait recognition is one of the most promising video-based biometric technologies. The edge of silhouettes and motion are the most informative feature and previous studies have explored them separately and achieved notable results. However, due to occlusions and variations in viewing angles, their gait recognition performance is often affected by the predefined spatial segmentation strategy. Moreover, traditional temporal pooling usually neglects distinctive temporal information in gait. To address the aforementioned issues, we propose a novel gait recognition framework, denoted as GaitASMS, which can effectively extract the adaptive structured spatial representations and naturally aggregate the multi-scale temporal information. The Adaptive Structured Representation Extraction Module (ASRE) separates the edge of silhouettes by using the adaptive edge mask and maximizes the representation in semantic latent space. Moreover, the Multi-Scale Temporal Aggregation Module (MSTA) achieves effective modeling of long-short-range temporal information by temporally aggregated structure. Furthermore, we propose a new data augmentation, denoted random mask, to enrich the sample space of long-term occlusion and enhance the generalization of the model. Extensive experiments conducted on two datasets demonstrate the competitive advantage of proposed method, especially in complex scenes, i.e. BG and CL. On the CASIA-B dataset, GaitASMS achieves the average accuracy of 93.5\% and outperforms the baseline on rank-1 accuracies by 3.4\% and 6.3\%, respectively, in BG and CL. The ablation experiments demonstrate the effectiveness of ASRE and MSTA.

摘要
《跟踪识别是视频基metric技术中最有前途的一种。 Edge of silhouettes and motion是最有信息的特征，先前的研究已经分别研究这两个特征，并取得了可观的成果。然而，由于 occlusion 和视角变化， их跟踪识别性能受到预先定义的空间分割策略的影响。此外，传统的时间汇集通常忽略了跑步动作中的独特时间信息。为了解决上述问题，我们提出了一种新的跟踪识别框架， denoted as GaitASMS， which can effectively extract adaptive structured spatial representations and naturally aggregate multi-scale temporal information. Adaptive Structured Representation Extraction Module (ASRE) 使用 adaptive edge mask 来分离 Edge of silhouettes，并在 semantic latent space 中最大化表示。此外， Multi-Scale Temporal Aggregation Module (MSTA) 可以有效地模型长短距离的时间信息。此外，我们还提出了一种新的数据增强技术， denoted random mask，以增加跑步动作的随机 occlusion 样本空间，并提高模型的通用性。Extensive experiments conducted on two datasets demonstrate the competitive advantage of proposed method, especially in complex scenes, i.e. BG and CL. On the CASIA-B dataset, GaitASMS achieves the average accuracy of 93.5% and outperforms the baseline on rank-1 accuracies by 3.4% and 6.3%, respectively, in BG and CL. The ablation experiments demonstrate the effectiveness of ASRE and MSTA.

Fingerprints of Generative Models in the Frequency Domain

paper_url: http://arxiv.org/abs/2307.15977
repo_url: None
paper_authors: Tianyun Yang, Juan Cao, Danding Wang, Chang Xu
for: 这篇论文旨在分析CNN基于生成模型中的唯一指纹，以及这些指纹如何影响生成图像质量。
methods: 这篇论文使用频谱分析方法来解释CNN生成模型中的网络组件，并从这些频谱分析中提取出频谱分布和格子异常的源头。
results: 研究发现，通过使用低成本的生成模型，可以生成图像，这些图像具有与实际CNN生成模型中的频谱分布和格子异常相同的特征。这些特征可以用于验证、识别和分析CNN基于生成模型的唯一指纹。

Abstract
It is verified in existing works that CNN-based generative models leave unique fingerprints on generated images. There is a lack of analysis about how they are formed in generative models. Interpreting network components in the frequency domain, we derive sources for frequency distribution and grid-like pattern discrepancies exhibited on the spectrum. These insights are leveraged to develop low-cost synthetic models, which generate images emulating the frequency patterns observed in real generative models. The resulting fingerprint extractor pre-trained on synthetic data shows superior transferability in verifying, identifying, and analyzing the relationship of real CNN-based generative models such as GAN, VAE, Flow, and diffusion.

摘要
已经在现有的研究中证明，基于Convolutional Neural Network（CNN）的生成模型会留下唯一的指纹在生成图像中。然而，关于如何形成这些指纹的分析却缺乏。我们通过解释网络组件的频谱域分析，得出了频谱分布和格子状差的来源。这些发现被利用来开发低成本的生成模型，这些模型可以生成图像，具有真实生成模型中观察到的频谱特征。这种指纹提取器在嵌入数据上进行预训练后，能够superior的传播性，用于验证、识别和分析真实的CNN基于生成模型，如GAN、VAE、Flow和diffusion。

XMem++: Production-level Video Segmentation From Few Annotated Frames

paper_url: http://arxiv.org/abs/2307.15958
repo_url: https://github.com/max810/XMem2
paper_authors: Maksym Bekuzarov, Ariana Bermudez, Joon-Young Lee, Hao Li
for: 该论文旨在提高现有的内存基于模型，以提高用户指导视频分割的精度和效率。
methods: 该方法使用了一种新的带有常见Memory模块的半监督视频物体分割模型，可以有效地处理多帧用户选择的图像。
results: 该方法可以在不需要重新训练的情况下，在具有多类和多帧的分割任务中提供了最佳性能，并且需要较少的帧注释量。

Abstract
Despite advancements in user-guided video segmentation, extracting complex objects consistently for highly complex scenes is still a labor-intensive task, especially for production. It is not uncommon that a majority of frames need to be annotated. We introduce a novel semi-supervised video object segmentation (SSVOS) model, XMem++, that improves existing memory-based models, with a permanent memory module. Most existing methods focus on single frame annotations, while our approach can effectively handle multiple user-selected frames with varying appearances of the same object or region. Our method can extract highly consistent results while keeping the required number of frame annotations low. We further introduce an iterative and attention-based frame suggestion mechanism, which computes the next best frame for annotation. Our method is real-time and does not require retraining after each user input. We also introduce a new dataset, PUMaVOS, which covers new challenging use cases not found in previous benchmarks. We demonstrate SOTA performance on challenging (partial and multi-class) segmentation scenarios as well as long videos, while ensuring significantly fewer frame annotations than any existing method. Project page: https://max810.github.io/xmem2-project-page/

摘要
尽管用户导导视频分割技术已经取得了进步，但是在高度复杂的场景下，提取复杂对象的工作仍然是一项劳动密集的任务，特别是在生产环境中。一般来说，大多数帧需要被标注。我们介绍了一种新的半监督视频对象分割（SSVOS）模型，XMem++，该模型在已有的记忆型模型基础上进行改进，具有永久记忆模块。大多数现有方法都是单帧标注的，而我们的方法可以有效地处理多个用户选择的帧，这些帧可能有不同的对象或区域的变化。我们的方法可以提取高度一致的结果，同时保持需要的帧标注数量低。我们还引入了一种迭代和注意力基于的帧建议机制，该机制可以计算下一帧的标注。我们的方法是实时的，不需要重新训练 после每个用户输入。我们还介绍了一个新的数据集，PUMaVOS，该数据集包括一些不同于过去的 benchmark 中的新吸引人用 caso。我们在这些挑战性（部分和多类）分割场景以及长视频中表现出了最高级的表现，同时保持了与任何现有方法相比的帧标注数量减少。项目页面：https://max810.github.io/xmem2-project-page/

CMDA: Cross-Modality Domain Adaptation for Nighttime Semantic Segmentation

paper_url: http://arxiv.org/abs/2307.15942
repo_url: https://github.com/xiarho/cmda
paper_authors: Ruihao Xia, Chaoqiang Zhao, Meng Zheng, Ziyan Wu, Qiyu Sun, Yang Tang
for: 提高夜间semantic segmentation的精度和效果，使用多Modalities的信息（图像和事件）进行培育。
methods: 提出了一种基于无监督的cross-modality domain adaptation（CMDA）框架，通过图像动态特征Extractor和图像内容特征Extractor来桥接不同的Modalities和频率域。
results: 在公共图像集和提出的图像-事件集上进行了广泛的实验，并得到了效果的结果，同时还开源了代码、模型和数据集。

Abstract
Most nighttime semantic segmentation studies are based on domain adaptation approaches and image input. However, limited by the low dynamic range of conventional cameras, images fail to capture structural details and boundary information in low-light conditions. Event cameras, as a new form of vision sensors, are complementary to conventional cameras with their high dynamic range. To this end, we propose a novel unsupervised Cross-Modality Domain Adaptation (CMDA) framework to leverage multi-modality (Images and Events) information for nighttime semantic segmentation, with only labels on daytime images. In CMDA, we design the Image Motion-Extractor to extract motion information and the Image Content-Extractor to extract content information from images, in order to bridge the gap between different modalities (Images to Events) and domains (Day to Night). Besides, we introduce the first image-event nighttime semantic segmentation dataset. Extensive experiments on both the public image dataset and the proposed image-event dataset demonstrate the effectiveness of our proposed approach. We open-source our code, models, and dataset at https://github.com/XiaRho/CMDA.

摘要
大多数夜间semantic segmentation研究基于领域适应方法和图像输入。然而，由于普通相机的低动态范围，图像在低照度条件下无法捕捉结构信息和边界信息。事件摄像机作为视觉感知器的新形态，它们与普通相机相比具有高动态范围。为此，我们提出了一种新的无监督 Cross-Modality Domain Adaptation（CMDA）框架，以利用多模态信息（图像和事件）进行夜间semantic segmentation，只需要在白天图像上提供标签。在CMDA中，我们设计了图像运动提取器和图像内容提取器，以EXTRACTING Motion information和图像内容信息，以桥接不同模态（图像到事件）和领域（白天到夜）。此外，我们提出了首个图像-事件夜间semantic segmentation数据集。我们在公共图像集和我们提议的图像-事件集上进行了广泛的实验，并证明了我们的提议方法的有效性。我们在https://github.com/XiaRho/CMDA上分享了我们的代码、模型和数据集。

Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images

paper_url: http://arxiv.org/abs/2307.15904
repo_url: None
paper_authors: Aayush Dhakal, Adeel Ahmad, Subash Khanal, Srikumar Sastry, Nathan Jacobs
for: 这个论文旨在开发一种基于自由文本描述（或标题）的弱监督地图创建方法。
methods: 该论文使用了一种叫做Sat2Cap的对比学习框架，在一个大规模的对比图像和地面图像 dataset 上训练。
results: 该模型能够successfully capture细腻概念和有效地适应时间变化。 codes, datasets, 和模型将被公开发布。

Abstract
We propose a novel weakly supervised approach for creating maps using free-form textual descriptions (or captions). We refer to this new line of work of creating textual maps as zero-shot mapping. Prior works have approached mapping tasks by developing models that predict over a fixed set of attributes using overhead imagery. However, these models are very restrictive as they can only solve highly specific tasks for which they were trained. Mapping text, on the other hand, allows us to solve a large variety of mapping problems with minimal restrictions. To achieve this, we train a contrastive learning framework called Sat2Cap on a new large-scale dataset of paired overhead and ground-level images. For a given location, our model predicts the expected CLIP embedding of the ground-level scenery. Sat2Cap is also conditioned on temporal information, enabling it to learn dynamic concepts that vary over time. Our experimental results demonstrate that our models successfully capture fine-grained concepts and effectively adapt to temporal variations. Our approach does not require any text-labeled data making the training easily scalable. The code, dataset, and models will be made publicly available.

摘要
我们提出了一种新的无监督方法，通过自由文本描述（或标题）来创建地图。我们称这种新的地图创建方法为零shot mapping。前一些工作都是通过开发模型，以预测Fixed集合属性使用过头像来解决地图任务。然而，这些模型具有很多限制，只能解决高度特定的任务。在text中映射，我们可以解决各种地图问题，几乎没有限制。为 достичь这一点，我们训练了一个异构学习框架called Sat2Cap，使用一个新的大规模的对应过头像和地面照片的数据集来训练。对于给定的位置，我们的模型预测地面照片中预期的CLIP嵌入。Sat2Cap还受到时间信息的限制，使其学习时间变化的动态概念。我们的实验结果表明，我们的模型成功捕捉细腻概念，并有效地适应时间变化。我们的方法不需要任何文本标注数据，因此训练非常可扩展。代码、数据集和模型将公开发布。

Effective Whole-body Pose Estimation with Two-stages Distillation

paper_url: http://arxiv.org/abs/2307.15880
repo_url: https://github.com/idea-research/dwpose
paper_authors: Zhendong Yang, Ailing Zeng, Chun Yuan, Yu Li
for: 本研究旨在提高全身姿势估计器的效率和精度。
methods: 我们提出了一个两阶段的姿势炼制方法，名为DWPose，以提高姿势估计器的效果和效率。
results: 我们的方法可以在COCO-WholeBody测试集上实现新的顶尖性能，从RTMPose-l的64.8% Whole Body AP提高到66.5%，甚至超过RTMPose-x教师的65.3% AP。

Abstract
Whole-body pose estimation localizes the human body, hand, face, and foot keypoints in an image. This task is challenging due to multi-scale body parts, fine-grained localization for low-resolution regions, and data scarcity. Meanwhile, applying a highly efficient and accurate pose estimator to widely human-centric understanding and generation tasks is urgent. In this work, we present a two-stage pose \textbf{D}istillation for \textbf{W}hole-body \textbf{P}ose estimators, named \textbf{DWPose}, to improve their effectiveness and efficiency. The first-stage distillation designs a weight-decay strategy while utilizing a teacher's intermediate feature and final logits with both visible and invisible keypoints to supervise the student from scratch. The second stage distills the student model itself to further improve performance. Different from the previous self-knowledge distillation, this stage finetunes the student's head with only 20% training time as a plug-and-play training strategy. For data limitations, we explore the UBody dataset that contains diverse facial expressions and hand gestures for real-life applications. Comprehensive experiments show the superiority of our proposed simple yet effective methods. We achieve new state-of-the-art performance on COCO-WholeBody, significantly boosting the whole-body AP of RTMPose-l from 64.8% to 66.5%, even surpassing RTMPose-x teacher with 65.3% AP. We release a series of models with different sizes, from tiny to large, for satisfying various downstream tasks. Our codes and models are available at https://github.com/IDEA-Research/DWPose.

摘要
全身姿势估计本地化人体、手、面、脚关键点在图像中。这项任务因多级体部、细化本地化低分辨率区域以及数据缺乏而具有挑战性。然而，在人类中心的理解和生成任务中应用高效精度的姿势估计器是急需的。在这种情况下，我们提出了一种两阶段的姿势估计精炼方法，称为DWPose。该方法可以提高姿势估计器的效iveness和精度。首先，我们设计了一种权重衰减策略，并利用教师的中间特征和最终归一化值来监督学生从零开始学习。其次，我们将学生模型自身精炼，以进一步提高其性能。与之前的自知ledge精炼不同，这一阶段只需要训练学生的头部，并且只需20%的训练时间。为了Addressing数据缺乏问题，我们探索了UBody数据集，该数据集包含多种表情和手势，适用于实际应用。我们进行了全面的实验，并证明了我们提出的简单 yet effective的方法的优越性。我们在COCO-WholeBody上 achieve新的状态码性能，从64.8%提高到66.5%，甚至超过RTMPose-x教师的65.3%AP。我们释放了不同大小的模型，从tiny到大，以满足不同的下游任务。我们的代码和模型可以在https://github.com/IDEA-Research/DWPose中获取。

Cross-dimensional transfer learning in medical image segmentation with deep learning

paper_url: http://arxiv.org/abs/2307.15872
repo_url: https://github.com/hic-messaoudi/cross-dimensional-transfer-learning-in-medical-image-segmentation-with-deep-learning
paper_authors: Hicham Messaoudi, Ahror Belaid, Douraied Ben Salem, Pierre-Henri Conze
for: This paper is focused on improving the efficiency of medical image segmentation using convolutional neural networks (CNNs) and transfer learning.
methods: The authors propose two novel architectures based on weight transfer and dimensional transfer to adapt a pre-trained 2D CNN to 2D, 3D uni- and multi-modal medical image segmentation tasks.
results: The proposed methods were tested on several benchmarks and achieved promising results, ranking first on the CAMUS challenge and outperforming other 2D-based methods on the CHAOS challenge. The 3D network also achieved good results on the BraTS 2022 competition, with an average Dice score of 91.69% for the whole tumor.

Abstract
Over the last decade, convolutional neural networks have emerged and advanced the state-of-the-art in various image analysis and computer vision applications. The performance of 2D image classification networks is constantly improving and being trained on databases made of millions of natural images. However, progress in medical image analysis has been hindered by limited annotated data and acquisition constraints. These limitations are even more pronounced given the volumetry of medical imaging data. In this paper, we introduce an efficient way to transfer the efficiency of a 2D classification network trained on natural images to 2D, 3D uni- and multi-modal medical image segmentation applications. In this direction, we designed novel architectures based on two key principles: weight transfer by embedding a 2D pre-trained encoder into a higher dimensional U-Net, and dimensional transfer by expanding a 2D segmentation network into a higher dimension one. The proposed networks were tested on benchmarks comprising different modalities: MR, CT, and ultrasound images. Our 2D network ranked first on the CAMUS challenge dedicated to echo-cardiographic data segmentation and surpassed the state-of-the-art. Regarding 2D/3D MR and CT abdominal images from the CHAOS challenge, our approach largely outperformed the other 2D-based methods described in the challenge paper on Dice, RAVD, ASSD, and MSSD scores and ranked third on the online evaluation platform. Our 3D network applied to the BraTS 2022 competition also achieved promising results, reaching an average Dice score of 91.69% (91.22%) for the whole tumor, 83.23% (84.77%) for the tumor core, and 81.75% (83.88%) for enhanced tumor using the approach based on weight (dimensional) transfer. Experimental and qualitative results illustrate the effectiveness of our methods for multi-dimensional medical image segmentation.

摘要
Our approach is based on two key principles: weight transfer and dimensional transfer. We embed a pre-trained 2D encoder into a higher dimensional U-Net to transfer the weights of the 2D network to the 3D network. Additionally, we expand the 2D segmentation network into a higher dimensional one to transfer the dimensionality of the 2D network to the 3D network.We tested our proposed networks on several benchmarks, including MR, CT, and ultrasound images. Our 2D network ranked first on the CAMUS challenge for echo-cardiographic data segmentation and surpassed the state-of-the-art. On the CHAOS challenge, our approach outperformed other 2D-based methods on Dice, RAVD, ASSD, and MSSD scores and ranked third on the online evaluation platform. Our 3D network achieved promising results on the BraTS 2022 competition, with an average Dice score of 91.69% (91.22%) for the whole tumor, 83.23% (84.77%) for the tumor core, and 81.75% (83.88%) for enhanced tumor.Experimental and qualitative results demonstrate the effectiveness of our methods for multi-dimensional medical image segmentation. By leveraging the efficiency of 2D networks and the robustness of 3D networks, our approach offers a promising solution for medical image analysis.

Catching Elusive Depression via Facial Micro-Expression Recognition

paper_url: http://arxiv.org/abs/2307.15862
repo_url: None
paper_authors: Xiaohui Chen, Tie Luo
for: 您的报告是为了诊断隐藏性抑郁症？
methods: 您使用了面部微表情（FMEs）来检测和识别真正的情感？
results: 您的研究发现了一种使用面部特征点来解决检测极低强度和细微的FMEs的方法，并提出了一种低成本、隐私保护的自诊断解决方案，可以在家庭环境中使用手持式移动设备进行诊断。

Abstract
Depression is a common mental health disorder that can cause consequential symptoms with continuously depressed mood that leads to emotional distress. One category of depression is Concealed Depression, where patients intentionally or unintentionally hide their genuine emotions through exterior optimism, thereby complicating and delaying diagnosis and treatment and leading to unexpected suicides. In this paper, we propose to diagnose concealed depression by using facial micro-expressions (FMEs) to detect and recognize underlying true emotions. However, the extremely low intensity and subtle nature of FMEs make their recognition a tough task. We propose a facial landmark-based Region-of-Interest (ROI) approach to address the challenge, and describe a low-cost and privacy-preserving solution that enables self-diagnosis using portable mobile devices in a personal setting (e.g., at home). We present results and findings that validate our method, and discuss other technical challenges and future directions in applying such techniques to real clinical settings.

摘要
抑郁是一种常见的心理健康问题，可能导致持续的沮丧情绪，从而引起情感压力。一种类型的抑郁是隐藏的抑郁，病人可能有意或无意地隐藏真实的情感，从而使诊断和治疗受阻，并可能导致意外的自杀。在这篇论文中，我们提议使用表情微表情（FMEs）来检测和识别隐藏的抑郁。然而，FMEs的非常低敏感度和细微的特征使其识别成为一项困难的任务。我们提议一种面部特征点-基于的区域引用（ROI）方法，以解决这个挑战。我们描述了一种低成本、隐私保护的解决方案，允许个人在家中使用可携带的移动设备进行自诊断。我们提供的结果和发现证明了我们的方法的有效性，并讨论了在实际临床设置中应用这种技术的其他技术挑战和未来方向。

What can Discriminator do? Towards Box-free Ownership Verification of Generative Adversarial Network

paper_url: http://arxiv.org/abs/2307.15860
repo_url: None
paper_authors: Ziheng Huang, Boheng Li, Yan Cai, Run Wang, Shangwei Guo, Liming Fang, Jing Chen, Lina Wang
for: 本研究旨在防止生成器模型被非法盗用或泄露，通过在不选择输入的情况下进行所有权验证。
methods: 我们提出了一种基于权威识别器的IP保护方案，通过训练权威识别器学习一个圆柱体来捕捉生成器唯一的分布。
results: 我们的方案在两个受欢迎的GAN任务和多达10个GAN架构上进行了广泛的评估，并 показа出高效地验证所有权。此外，我们的方案还能够抵御输入基于的移除攻击和其他已有攻击。

Abstract
In recent decades, Generative Adversarial Network (GAN) and its variants have achieved unprecedented success in image synthesis. However, well-trained GANs are under the threat of illegal steal or leakage. The prior studies on remote ownership verification assume a black-box setting where the defender can query the suspicious model with specific inputs, which we identify is not enough for generation tasks. To this end, in this paper, we propose a novel IP protection scheme for GANs where ownership verification can be done by checking outputs only, without choosing the inputs (i.e., box-free setting). Specifically, we make use of the unexploited potential of the discriminator to learn a hypersphere that captures the unique distribution learned by the paired generator. Extensive evaluations on two popular GAN tasks and more than 10 GAN architectures demonstrate our proposed scheme to effectively verify the ownership. Our proposed scheme shown to be immune to popular input-based removal attacks and robust against other existing attacks. The source code and models are available at https://github.com/AbstractTeen/gan_ownership_verification

摘要
近年来，生成对抗网络（GAN）和其变种在图像生成方面取得了历史上无 precedent 的成功。然而，已经训练过的 GAN 受到非法窃取或泄露的威胁。先前的研究中，攻击者可以通过特定输入来访问异常模型，这种黑盒 Setting 我们发现是不够的 для生成任务。为此，在本文中，我们提出了一种新的知识Property protection scheme for GANs，可以通过检查输出来验证所有权，不需要选择输入（即无盒 Setting）。我们利用了generator和discriminator之间的可能性，通过学习一个捕捉生成器学习的唯一分布的卷积。我们的提议方案在两个流行的 GAN 任务和多于10个 GAN 架构上进行了广泛的评估，并示出了有效的所有权验证。我们的提议方案具有免疫输入基于移除攻击和其他现有攻击的特点。source code和模型可以在上获取。

Seeing Behind Dynamic Occlusions with Event Cameras

paper_url: http://arxiv.org/abs/2307.15829
repo_url: None
paper_authors: Rong Zou, Manasi Muglikar, Nico Messikommer, Davide Scaramuzza
for: 提高计算机视觉系统的性能，解决干扰物（如尘埃、雨滴、雪花）对计算机视觉系统的影响
methods: 组合传统摄像机和事件摄像机，利用事件提供高时间分辨率的背景内容重建
results: 比image填充方法高3dB的PSNR提高，在我们的数据集上表现出色

Abstract
Unwanted camera occlusions, such as debris, dust, rain-drops, and snow, can severely degrade the performance of computer-vision systems. Dynamic occlusions are particularly challenging because of the continuously changing pattern. Existing occlusion-removal methods currently use synthetic aperture imaging or image inpainting. However, they face issues with dynamic occlusions as these require multiple viewpoints or user-generated masks to hallucinate the background intensity. We propose a novel approach to reconstruct the background from a single viewpoint in the presence of dynamic occlusions. Our solution relies for the first time on the combination of a traditional camera with an event camera. When an occlusion moves across a background image, it causes intensity changes that trigger events. These events provide additional information on the relative intensity changes between foreground and background at a high temporal resolution, enabling a truer reconstruction of the background content. We present the first large-scale dataset consisting of synchronized images and event sequences to evaluate our approach. We show that our method outperforms image inpainting methods by 3dB in terms of PSNR on our dataset.

摘要
不想要的摄像头干扰，如垃圾、尘埃、雨滴和雪，可能严重降低计算机视觉系统的性能。动态干扰特别困难，因为它们的模式在不断变化。现有的干扰除方法使用合成光学投影或图像填充。然而，它们在动态干扰情况下存在问题，因为它们需要多个视点或用户生成的面积图来描述背景强度。我们提出了一种新的方法，使用传统摄像头和事件摄像头的组合来重建背景。当干扰物移动过背景图像时，它会导致强度变化，这些变化会触发事件。这些事件提供高度准确的时间分辨率中的背景内容的重建信息。我们提供了首个大规模的同步图像和事件序列数据集，以评估我们的方法。我们表明，我们的方法在我们的数据集上比图像填充方法高3dB的PSNR。

Multi-growth stage plant recognition: a case study of Palmer amaranth (Amaranthus palmeri) in cotton (Gossypium hirsutum)

paper_url: http://arxiv.org/abs/2307.15816
repo_url: None
paper_authors: Guy RY Coleman, Matthew Kutugata, Michael J Walsh, Muthukumar Bagavathiannan
For: The paper is focused on developing and testing a method for recognizing growth stages of Amaranthus palmeri (a weed plant in cotton production) using convolutional neural networks (CNNs) and the You Only Look Once (YOLO) architecture.* Methods: The authors use 26 different architecture variants from YOLO v3, v5, v6, v6 3.0, v7, and v8 to recognize eight different growth stages of A. palmeri. They compare the performance of these architectures on an eight-class growth stage dataset and use class activation maps (CAM) to understand model attention on the complex dataset.* Results: The highest mAP@[0.5:0.95] for recognition of all growth stage classes was 47.34% achieved by v8-X, with inter-class confusion across visually similar growth stages. With all growth stages grouped as a single class, performance increased, with a maximum mAP@[0.5:0.95] of 67.05% achieved by v7-Original. Single class recall of up to 81.42% was achieved by v5-X, and precision of up to 89.72% was achieved by v8-X.

Abstract
Many advanced, image-based precision agricultural technologies for plant breeding, field crop research, and site-specific crop management hinge on the reliable detection and phenotyping of plants across highly variable morphological growth stages. Convolutional neural networks (CNNs) have shown promise for image-based plant phenotyping and weed recognition, but their ability to recognize growth stages, often with stark differences in appearance, is uncertain. Amaranthus palmeri (Palmer amaranth) is a particularly challenging weed plant in cotton (Gossypium hirsutum) production, exhibiting highly variable plant morphology both across growth stages over a growing season, as well as between plants at a given growth stage due to high genetic diversity. In this paper, we investigate eight-class growth stage recognition of A. palmeri in cotton as a challenging model for You Only Look Once (YOLO) architectures. We compare 26 different architecture variants from YOLO v3, v5, v6, v6 3.0, v7, and v8 on an eight-class growth stage dataset of A. palmeri. The highest mAP@[0.5:0.95] for recognition of all growth stage classes was 47.34% achieved by v8-X, with inter-class confusion across visually similar growth stages. With all growth stages grouped as a single class, performance increased, with a maximum mean average precision (mAP@[0.5:0.95]) of 67.05% achieved by v7-Original. Single class recall of up to 81.42% was achieved by v5-X, and precision of up to 89.72% was achieved by v8-X. Class activation maps (CAM) were used to understand model attention on the complex dataset. Fewer classes, grouped by visual or size features improved performance over the ground-truth eight-class dataset. Successful growth stage detection highlights the substantial opportunity for improving plant phenotyping and weed recognition technologies with open-source object detection architectures.

摘要
Many advanced, image-based precision agricultural technologies for plant breeding, field crop research, and site-specific crop management rely on the reliable detection and phenotyping of plants across highly variable morphological growth stages. Convolutional neural networks (CNNs) have shown promise for image-based plant phenotyping and weed recognition, but their ability to recognize growth stages, often with stark differences in appearance, is uncertain. Amaranthus palmeri (Palmer amaranth) is a particularly challenging weed plant in cotton (Gossypium hirsutum) production, exhibiting highly variable plant morphology both across growth stages over a growing season and between plants at a given growth stage due to high genetic diversity. In this paper, we investigate eight-class growth stage recognition of A. palmeri in cotton as a challenging model for You Only Look Once (YOLO) architectures. We compare 26 different architecture variants from YOLO v3, v5, v6, v6 3.0, v7, and v8 on an eight-class growth stage dataset of A. palmeri. The highest mAP@[0.5:0.95] for recognition of all growth stage classes was 47.34% achieved by v8-X, with inter-class confusion across visually similar growth stages. With all growth stages grouped as a single class, performance increased, with a maximum mean average precision (mAP@[0.5:0.95]) of 67.05% achieved by v7-Original. Single class recall of up to 81.42% was achieved by v5-X, and precision of up to 89.72% was achieved by v8-X. Class activation maps (CAM) were used to understand model attention on the complex dataset. Fewer classes, grouped by visual or size features improved performance over the ground-truth eight-class dataset. Successful growth stage detection highlights the substantial opportunity for improving plant phenotyping and weed recognition technologies with open-source object detection architectures.

VPP: Efficient Conditional 3D Generation via Voxel-Point Progressive Representation

paper_url: http://arxiv.org/abs/2307.16605
repo_url: https://github.com/qizekun/vpp
paper_authors: Zekun Qi, Muzhou Yu, Runpei Dong, Kaisheng Ma
for: 这篇论文主要用于提高3D生成的效率和质量。
methods: 该论文提出了一种进步的生成方法，即精细体量进行渐进生成（VPP），该方法结合了精细体量表示法和点云拼接技术，以实现高效的多类Object生成。
results: 实验表明，VPP可以高效地生成高质量的8K点云数据，并且可以 Transfer Learning 到多种3D下渠道任务，如生成、编辑、完成和预训练。

Abstract
Conditional 3D generation is undergoing a significant advancement, enabling the free creation of 3D content from inputs such as text or 2D images. However, previous approaches have suffered from low inference efficiency, limited generation categories, and restricted downstream applications. In this work, we revisit the impact of different 3D representations on generation quality and efficiency. We propose a progressive generation method through Voxel-Point Progressive Representation (VPP). VPP leverages structured voxel representation in the proposed Voxel Semantic Generator and the sparsity of unstructured point representation in the Point Upsampler, enabling efficient generation of multi-category objects. VPP can generate high-quality 8K point clouds within 0.2 seconds. Additionally, the masked generation Transformer allows for various 3D downstream tasks, such as generation, editing, completion, and pre-training. Extensive experiments demonstrate that VPP efficiently generates high-fidelity and diverse 3D shapes across different categories, while also exhibiting excellent representation transfer performance. Codes will be released on https://github.com/qizekun/VPP.

摘要
<>将文本翻译成简化中文。<>Conditional 3D生成在进行了重要的进步，允许根据文本或2D图像生成3D内容。然而，之前的方法受到了低效率推理、有限的生成类别和下游应用的限制。在这种工作中，我们重新评估了不同的3D表示方式对生成质量和效率的影响。我们提议一种逐步生成方法，通过精细 voxel 表示法和简单点表示法的结合，实现高效的多类对象生成。我们称之为精细点进程表示（VPP）。VPP可以在0.2秒内生成8K点云，并且可以进行多种3D下游任务，如生成、编辑、完成和预训练。广泛的实验表明，VPP可以高效地生成多种不同类别的高质量3D形状，同时也展现出了优秀的表示转移性能。代码将在 GitHub 上发布。

Semi-Supervised Object Detection in the Open World

paper_url: http://arxiv.org/abs/2307.15710
repo_url: None
paper_authors: Garvita Allabadi, Ana Lucic, Peter Pao-Huang, Yu-Xiong Wang, Vikram Adve
for: 本研究旨在开普世 semi-supervised object detection 领域，解决预设固定类目存在于训练和无标签数据中的问题。
methods: 我们提出了 Open World Semi-supervised Detection 框架 (OWSSD)，可以有效地探测异distribution（OOD）样本，同时还可以从这些样本中学习。我们还提出了一个ensemble基于自动编码网络，通过仅使用固定类目数据进行训练。
results: 我们通过广泛的评估表明，我们的方法可以与现状的OOD探测算法竞争，同时也可以在开放世界enario中显著提高 semi-supervised 学习性能。

Abstract
Existing approaches for semi-supervised object detection assume a fixed set of classes present in training and unlabeled datasets, i.e., in-distribution (ID) data. The performance of these techniques significantly degrades when these techniques are deployed in the open-world, due to the fact that the unlabeled and test data may contain objects that were not seen during training, i.e., out-of-distribution (OOD) data. The two key questions that we explore in this paper are: can we detect these OOD samples and if so, can we learn from them? With these considerations in mind, we propose the Open World Semi-supervised Detection framework (OWSSD) that effectively detects OOD data along with a semi-supervised learning pipeline that learns from both ID and OOD data. We introduce an ensemble based OOD detector consisting of lightweight auto-encoder networks trained only on ID data. Through extensive evalulation, we demonstrate that our method performs competitively against state-of-the-art OOD detection algorithms and also significantly boosts the semi-supervised learning performance in open-world scenarios.

摘要
现有的半指导object detection方法假设training和无标据数据集中有固定的类型存在，即在distribution（ID）数据。在开放世界中部署这些技术时，其性能会受到很大的影响，因为测试数据可能包含不同于训练数据中的对象，即out-of-distribution（OOD）数据。我们在这篇论文中考虑以下两个关键问题：可以探测OOD样本吗，如果可以，我们可以学习吗？针对这些考虑，我们提出了开放世界半指导检测框架（OWSSD），可以有效地检测OOD数据，同时还可以从ID和OOD数据中学习。我们提出了一种 ensemble 基于 auto-encoder 网络，只在 ID 数据上训练。经过广泛的评估，我们示出了我们的方法与现有的OOD检测算法相比，性能具有竞争力，同时还可以在开放世界 scenarios 中提高半指导学习性能。

MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking

paper_url: http://arxiv.org/abs/2307.15700
repo_url: https://github.com/mcg-nju/memotr
paper_authors: Ruopeng Gao, Limin Wang
for: 本研究旨在提高多目标追踪（MOT）中的时间信息捕捉效果，尤其是在短时间内的目标追踪 tasks 上。
methods: 我们提出了一种基于Transformer的长期记忆扩展方法，named MeMOTR，通过将长期记忆注入到自定义的记忆注意力层中，使同一个目标的追踪嵌入更加稳定和区别化。
results: 我们在DanceTrack上进行了实验，结果显示MeMOTR比前一代方法提高7.9%和13.0%的HOTA和AssA指标，并且在MOT17和BDD100K上也超越了其他Transformer-based方法的协议性表现。

Abstract
As a video task, Multiple Object Tracking (MOT) is expected to capture temporal information of targets effectively. Unfortunately, most existing methods only explicitly exploit the object features between adjacent frames, while lacking the capacity to model long-term temporal information. In this paper, we propose MeMOTR, a long-term memory-augmented Transformer for multi-object tracking. Our method is able to make the same object's track embedding more stable and distinguishable by leveraging long-term memory injection with a customized memory-attention layer. This significantly improves the target association ability of our model. Experimental results on DanceTrack show that MeMOTR impressively surpasses the state-of-the-art method by 7.9% and 13.0% on HOTA and AssA metrics, respectively. Furthermore, our model also outperforms other Transformer-based methods on association performance on MOT17 and generalizes well on BDD100K. Code is available at https://github.com/MCG-NJU/MeMOTR.

摘要
作为视频任务，多目标跟踪（MOT）预期能够有效捕捉目标的时间信息。然而，大多数现有方法只是直接利用邻帧对象特征，而忽略了长期时间信息的模型。在这篇论文中，我们提出了MeMOTR，一种带有长期记忆的Transformer型多目标跟踪方法。我们的方法可以使同一个目标的跟踪嵌入更加稳定和分化，通过自定义的记忆注意力层进行长期记忆注入。这意味着我们的模型可以更好地进行目标相关性能。实验结果表明，MeMOTR在DanceTrack上表现出色，与比较方法相比，提高了7.9%和13.0%的HOTA和AssA指标。此外，我们的模型还超过了其他基于Transformer的方法在相关性能上，并在MOT17和BDD100K上进行了良好的总体化。代码可以在https://github.com/MCG-NJU/MeMOTR中找到。

SimDETR: Simplifying self-supervised pretraining for DETR

paper_url: http://arxiv.org/abs/2307.15697
repo_url: None
paper_authors: Ioannis Maniadis Metaxas, Adrian Bulat, Ioannis Patras, Brais Martinez, Georgios Tzimiropoulos
for: 提高 DETR 基于检测器的效果，提高 sample efficiency 和速度
methods: 使用无监督预训练，使用高级特征图生成更加具有 semantics 的初始提案，使用对象pseudo标签进行推训，进行自我训练
results: 比对先前的预训练方法，我们的预训练方法在全数据和低数据 régime 都具有显著的提升，可以直接在复杂图像 datasets 上预训练 DETR 从头开始

Abstract
DETR-based object detectors have achieved remarkable performance but are sample-inefficient and exhibit slow convergence. Unsupervised pretraining has been found to be helpful to alleviate these impediments, allowing training with large amounts of unlabeled data to improve the detector's performance. However, existing methods have their own limitations, like keeping the detector's backbone frozen in order to avoid performance degradation and utilizing pretraining objectives misaligned with the downstream task. To overcome these limitations, we propose a simple pretraining framework for DETR-based detectors that consists of three simple yet key ingredients: (i) richer, semantics-based initial proposals derived from high-level feature maps, (ii) discriminative training using object pseudo-labels produced via clustering, (iii) self-training to take advantage of the improved object proposals learned by the detector. We report two main findings: (1) Our pretraining outperforms prior DETR pretraining works on both the full and low data regimes by significant margins. (2) We show we can pretrain DETR from scratch (including the backbone) directly on complex image datasets like COCO, paving the path for unsupervised representation learning directly using DETR.

摘要

基于高级特征图像的丰富Semantic-based初始提案，来提高检测器的性能。2. 使用对象 Pseudo-标签生成的推理训练，以提高检测器的精度。3. 使用自我训练，以利用改进的对象提案来提高检测器的性能。我们的研究发现：1. 我们的预训练在完整数据集和低数据集上都比过去的 DETR 预训练工作表现出了显著的改善。2. 我们可以直接在复杂的图像 datasets 上预训练 DETR，从而开展无监督表征学学习。

PatchMixer: Rethinking network design to boost generalization for 3D point cloud understanding

paper_url: http://arxiv.org/abs/2307.15692
repo_url: https://github.com/davideboscaini/patchmixer
paper_authors: Davide Boscaini, Fabio Poiesi
for: 本研究旨在评估深度学习方法对3D点云理解的能力，并提出一种简单 yet effective的方法来扩展MLP-Mixer纸质。
methods: 本方法使用本地小块处理而不是整个形状，以促进对部分点云的稳定性，并使用MLP进行局部特征聚合。
results: 我们在形态分类和部分分割任务中评估了我们的方法，与一些相关的深度架构进行比较，得到了更好的总体适应性表现。

Abstract
The recent trend in deep learning methods for 3D point cloud understanding is to propose increasingly sophisticated architectures either to better capture 3D geometries or by introducing possibly undesired inductive biases. Moreover, prior works introducing novel architectures compared their performance on the same domain, devoting less attention to their generalization to other domains. We argue that the ability of a model to transfer the learnt knowledge to different domains is an important feature that should be evaluated to exhaustively assess the quality of a deep network architecture. In this work we propose PatchMixer, a simple yet effective architecture that extends the ideas behind the recent MLP-Mixer paper to 3D point clouds. The novelties of our approach are the processing of local patches instead of the whole shape to promote robustness to partial point clouds, and the aggregation of patch-wise features using an MLP as a simpler alternative to the graph convolutions or the attention mechanisms that are used in prior works. We evaluated our method on the shape classification and part segmentation tasks, achieving superior generalization performance compared to a selection of the most relevant deep architectures.

摘要
现代深度学习方法的趋势是不断提出更加复杂的架构，以更好地捕捉3D形态或者引入可能不希望的逻辑偏见。然而，先前的工作通常只在同一个领域中评估其性能，对其在其他领域的泛化性能 menos 关注。我们认为，深度网络模型在不同领域中的泛化性能是一项重要的评估标准。在这篇文章中，我们提出了PatchMixer，一种简单 yet effective的架构，扩展了最近的 MLP-Mixer 文献中的想法，并对3D 点云进行处理。我们的方法的创新之处在于处理本地小块而不是整个形态，以便增强对部分点云的Robustness，以及通过 MLP 作为更简单的替代品，对于先前的图像 convolution 或者注意力机制进行聚合。我们对 shape classification 和部分 segmentation 任务进行评估，并实现了相比一些最相关的深度架构的更好的泛化性能。

TrackAgent: 6D Object Tracking via Reinforcement Learning

paper_url: http://arxiv.org/abs/2307.15671
repo_url: None
paper_authors: Konstantin Röhrl, Dominik Bauer, Timothy Patten, Markus Vincze
for: Tracking an object’s 6D pose in robotics and augmented reality applications, while either the object or the observing camera is moving.
methods: Simplify object tracking to a reinforced point cloud (depth only) alignment task, using a streamlined approach with limited amounts of sparse 3D point clouds and a reinforcement learning (RL) agent that jointly solves for both objectives.
results: The RL agent’s uncertainty and a rendering-based mask propagation are effective reinitialization triggers, and the proposed method outperforms previous RGB(D)-based methods in terms of computational efficiency and robustness to tracking loss.

Abstract
Tracking an object's 6D pose, while either the object itself or the observing camera is moving, is important for many robotics and augmented reality applications. While exploiting temporal priors eases this problem, object-specific knowledge is required to recover when tracking is lost. Under the tight time constraints of the tracking task, RGB(D)-based methods are often conceptionally complex or rely on heuristic motion models. In comparison, we propose to simplify object tracking to a reinforced point cloud (depth only) alignment task. This allows us to train a streamlined approach from scratch with limited amounts of sparse 3D point clouds, compared to the large datasets of diverse RGBD sequences required in previous works. We incorporate temporal frame-to-frame registration with object-based recovery by frame-to-model refinement using a reinforcement learning (RL) agent that jointly solves for both objectives. We also show that the RL agent's uncertainty and a rendering-based mask propagation are effective reinitialization triggers.

摘要
Tracking an object's 6D pose, while either the object itself or the observing camera is moving, is important for many robotics and augmented reality applications. While exploiting temporal priors eases this problem, object-specific knowledge is required to recover when tracking is lost. Under the tight time constraints of the tracking task, RGB(D)-based methods are often conceptionally complex or rely on heuristic motion models. In comparison, we propose to simplify object tracking to a reinforced point cloud (depth only) alignment task. This allows us to train a streamlined approach from scratch with limited amounts of sparse 3D point clouds, compared to the large datasets of diverse RGBD sequences required in previous works. We incorporate temporal frame-to-frame registration with object-based recovery by frame-to-model refinement using a reinforcement learning (RL) agent that jointly solves for both objectives. We also show that the RL agent's uncertainty and a rendering-based mask propagation are effective reinitialization triggers.Translation notes:* "6D pose" is translated as "6D位姿" (liù dì wèi xìng)* "RGB(D)" is translated as "RGB(D)" (RGB(D) séqùì)* "point cloud" is translated as "点云" (diǎn yú)* "reinforced" is translated as "加强" (jiā qiáng)* "streamlined" is translated as "流畅" (liú chóng)* "sparse" is translated as "稀疏" (xī shōu)* "temporal" is translated as "时间" (shí jian)* "frame-to-frame" is translated as "帧到帧" (kuàng dào kuàng)* "object-based" is translated as "基于物体" (jī yú wù tǐ)* "reinitialization" is translated as "重新初始化" (zhòng xīn chū shí huà)* "uncertainty" is translated as "不确定性" (bù jì dì xìng)* "rendering-based" is translated as "基于渲染" (jī yú yǎo chéng)

Multi-layer Aggregation as a key to feature-based OOD detection

paper_url: http://arxiv.org/abs/2307.15647
repo_url: https://github.com/benolmbrt/MedicOOD
paper_authors: Benjamin Lambert, Florence Forbes, Senan Doyle, Michel Dojat
for: 本研究旨在探讨基于深度学习模型的异常检测方法，以提高医学图像分析中的精度和可靠性。
methods: 本研究使用的方法包括单层方法和多层方法，单层方法根据特定层获得的特征图进行检测，多层方法则使用模型生成的ensemble特征图进行检测。
results: 本研究对20种异常类型（约7800个3D MRI）进行了大规模的异常检测实验，结果表明多层方法在异常检测中表现更好，而单层方法则具有不一致的行为，具体取决于异常类型。此外，研究还发现了模型的结构对异常检测性能产生了很大影响。

Abstract
Deep Learning models are easily disturbed by variations in the input images that were not observed during the training stage, resulting in unpredictable predictions. Detecting such Out-of-Distribution (OOD) images is particularly crucial in the context of medical image analysis, where the range of possible abnormalities is extremely wide. Recently, a new category of methods has emerged, based on the analysis of the intermediate features of a trained model. These methods can be divided into 2 groups: single-layer methods that consider the feature map obtained at a fixed, carefully chosen layer, and multi-layer methods that consider the ensemble of the feature maps generated by the model. While promising, a proper comparison of these algorithms is still lacking. In this work, we compared various feature-based OOD detection methods on a large spectra of OOD (20 types), representing approximately 7800 3D MRIs. Our experiments shed the light on two phenomenons. First, multi-layer methods consistently outperform single-layer approaches, which tend to have inconsistent behaviour depending on the type of anomaly. Second, the OOD detection performance highly depends on the architecture of the underlying neural network.

摘要
深度学习模型容易受到输入图像的变化所影响，导致预测结果不可预测。在医学图像分析中，检测这些外围数据（Out-of-Distribution，OOD）图像特别重要。近期，一种新的分类方法在发展，基于模型的中间特征分析。这些方法可以分为两类：单层方法，利用特定层的特征图，和多层方法，利用模型生成的特征图的ensemble。虽有承诺，但对这些算法进行正确的比较仍然缺乏。在这项工作中，我们对20种OOD类型（约7800个3D MRI）进行了大规模的比较。我们的实验揭示了两个现象：首先，多层方法在不同类型的异常情况下表现更好，单层方法则具有不一致的行为；其次，OOD检测性能强烈依赖于下面的神经网络架构。

Scale-aware Test-time Click Adaptation for Pulmonary Nodule and Mass Segmentation

paper_url: http://arxiv.org/abs/2307.15645
repo_url: https://github.com/splinterli/sattca
paper_authors: Zhihao Li, Jiancheng Yang, Yongchao Xu, Li Zhang, Wenhui Dong, Bo Du
for: 针对肺癌检测中肺脏异常的尺寸管理，提出了一种基于多尺度神经网络的尺度意识适应测试方法。
methods: 提出了一种基于简单的Click适应方法，通过使用轻松获得的病理点击来提高分 segmentation 性能，特别是大型肿瘤的检测。
results: EXTENSIVE EXPERIMENTS ON BOTH OPEN-SOURCE AND IN-HOUSE DATASETS CONSISTENTLY DEMONSTRATE THE EFFECTIVENESS OF THE PROPOSED METHOD OVER SOME CNN AND TRANSFORMER-BASED SEGMENTATION METHODS。

Abstract
Pulmonary nodules and masses are crucial imaging features in lung cancer screening that require careful management in clinical diagnosis. Despite the success of deep learning-based medical image segmentation, the robust performance on various sizes of lesions of nodule and mass is still challenging. In this paper, we propose a multi-scale neural network with scale-aware test-time adaptation to address this challenge. Specifically, we introduce an adaptive Scale-aware Test-time Click Adaptation method based on effortlessly obtainable lesion clicks as test-time cues to enhance segmentation performance, particularly for large lesions. The proposed method can be seamlessly integrated into existing networks. Extensive experiments on both open-source and in-house datasets consistently demonstrate the effectiveness of the proposed method over some CNN and Transformer-based segmentation methods. Our code is available at https://github.com/SplinterLi/SaTTCA

摘要
肺部肿瘤和质量是链球癌检测中重要的图像特征，需要精确的诊断管理。尽管深度学习基础的医疗图像分类已经取得成功，但是不同大小的肿瘤和质量的性能仍然是挑战。在这篇论文中，我们提出了一个多尺度神经网络，并导入了渠道对应的测试时间适应方法。具体来说，我们引入了适应状态测试时间Click整合方法，根据易доступible的肿瘤点击作为测试时间启发，以提高分类性能，特别是大肿瘤。提案的方法可以与现有的网络集成。我们在多个开源和内部数据集上进行了广泛的实验，结果显示了提案方法的效果，比如CNN和Transformer基础的分类方法。我们的代码可以在https://github.com/SplinterLi/SaTTCA中下载。

CLIP Brings Better Features to Visual Aesthetics Learners

paper_url: http://arxiv.org/abs/2307.15640
repo_url: None
paper_authors: Liwu Xu, Jinjin Xu, Yuzhe Yang, Yijie Huang, Yanchun Xie, Yaqian Li
for:The paper is written for the application of image aesthetics assessment (IAA), which has a subjective and expensive labeling procedure.methods:The proposed method uses a two-phase approach, which integrates and leverages a multi-source unlabeled dataset to align rich features between a given visual encoder and an off-the-shelf CLIP image encoder via feature alignment loss.results:The proposed method achieves state-of-the-art performance on multiple widely used IAA benchmarks, alleviating the feature collapse issue and showcasing the necessity of feature alignment instead of training directly based on CLIP image encoder.Here is the Chinese version of the three key points:for:这篇论文是为了应用图像美学评估（IAA），这是一个主观和昂贵的标注过程。methods:提议的方法使用了两个阶段的方法，即将多源无标记数据集集成并利用，以实现给定视觉编码器和Off-the-shelf CLIP图像编码器之间的特征匹配，并通过特征匹配损失来进行协调。results:提议的方法在多个广泛使用的IAA标准准点上达到了状态机器人的性能，解决了特征塌积问题，并证明了在不直接基于CLIP图像编码器进行训练的情况下，特征匹配是必不可少的。

Abstract
The success of pre-training approaches on a variety of downstream tasks has revitalized the field of computer vision. Image aesthetics assessment (IAA) is one of the ideal application scenarios for such methods due to subjective and expensive labeling procedure. In this work, an unified and flexible two-phase \textbf{C}LIP-based \textbf{S}emi-supervised \textbf{K}nowledge \textbf{D}istillation paradigm is proposed, namely \textbf{\textit{CSKD}. Specifically, we first integrate and leverage a multi-source unlabeled dataset to align rich features between a given visual encoder and an off-the-shelf CLIP image encoder via feature alignment loss. Notably, the given visual encoder is not limited by size or structure and, once well-trained, it can seamlessly serve as a better visual aesthetic learner for both student and teacher. In the second phase, the unlabeled data is also utilized in semi-supervised IAA learning to further boost student model performance when applied in latency-sensitive production scenarios. By analyzing the attention distance and entropy before and after feature alignment, we notice an alleviation of feature collapse issue, which in turn showcase the necessity of feature alignment instead of training directly based on CLIP image encoder. Extensive experiments indicate the superiority of CSKD, which achieves state-of-the-art performance on multiple widely used IAA benchmarks.

摘要
随着预训练方法在多种下游任务上的成功，计算机视觉领域得到了新的动力。图像美学评价（IAA）是适用于这些方法的理想应用场景，因为评价标签过程是主观且昂贵的。在这项工作中，我们提出了一种统一和灵活的两阶段\textbf{C}LIP-基于\textbf{S}emi-supervised \textbf{K}nowledge \textbf{D}istillation（CSKD）方法。具体来说，我们首先将多种无标注数据集集成并利用，以确保图像编码器和CLIP图像编码器之间的特征相似性。注意，给定的图像编码器没有尺寸或结构的限制，一旦它很好地训练，它就可以成为更好的图像美学学习者。在第二阶段，我们还利用无标注数据集进行 semi-supervised IAA 学习，以进一步提高学生模型在响应时间敏感的生产环境中的性能。通过分析特征距离和熵之前和之后准对，我们发现了特征坍塌问题的缓解，这反映了特征准对的重要性，而不是直接基于CLIP图像编码器进行训练。我们进行了广泛的实验，并证明了 CSKD 的优越性，在多个广泛使用的 IAA benchmark 上达到了状态之最的表现。