2023-09-23

cs.CV

cs.CV - 2023-09-23

Portrait Stylization: Artistic Style Transfer with Auxiliary Networks for Human Face Stylization

paper_url: http://arxiv.org/abs/2309.13492
repo_url: https://github.com/thiagoambiel/PortraitStylization
paper_authors: Thiago Ambiel
for: 提高图像风格传递中人脸个体特征的保留
methods: 使用辅助预训练人脸识别模型的嵌入来鼓励算法在内容图像上传递人脸特征到最终风格化结果中
results: 提高了图像风格传递中人脸个体特征的保留

Abstract
Today's image style transfer methods have difficulty retaining humans face individual features after the whole stylizing process. This occurs because the features like face geometry and people's expressions are not captured by the general-purpose image classifiers like the VGG-19 pre-trained models. This paper proposes the use of embeddings from an auxiliary pre-trained face recognition model to encourage the algorithm to propagate human face features from the content image to the final stylized result.

摘要
今天的图像风格传递方法很难保持人脸个人特征 после整个风格化过程。这是因为人脸的特征，如人脸几何学和人们的表情，不被通用的图像分类器如VGG-19预训练模型捕捉。这篇论文提议使用auxiliary预训练人脸识别模型的嵌入来鼓励算法在内容图像上传递人脸特征到最终风格化结果中。

Identifying Systematic Errors in Object Detectors with the SCROD Pipeline

paper_url: http://arxiv.org/abs/2309.13489
repo_url: https://github.com/hieu9955/ggggg
paper_authors: Valentyn Boreiko, Matthias Hein, Jan Hendrik Metzen
for: 本研究旨在提高对象检测器的系统性错误识别和除除，以便在自动驾驶和机器人应用中使用。
methods: 我们提出了一种新的框架， combinesPhysical simulators和生成模型的优点，以实现高级的自动控制和可扩展性。
results: 我们的框架可以自动生成街道场景，并且可以具有精细的控制。此外，我们还提出了一种评价设定，可以作为类似框架的标准测试进程。

Abstract
The identification and removal of systematic errors in object detectors can be a prerequisite for their deployment in safety-critical applications like automated driving and robotics. Such systematic errors can for instance occur under very specific object poses (location, scale, orientation), object colors/textures, and backgrounds. Real images alone are unlikely to cover all relevant combinations. We overcome this limitation by generating synthetic images with fine-granular control. While generating synthetic images with physical simulators and hand-designed 3D assets allows fine-grained control over generated images, this approach is resource-intensive and has limited scalability. In contrast, using generative models is more scalable but less reliable in terms of fine-grained control. In this paper, we propose a novel framework that combines the strengths of both approaches. Our meticulously designed pipeline along with custom models enables us to generate street scenes with fine-grained control in a fully automated and scalable manner. Moreover, our framework introduces an evaluation setting that can serve as a benchmark for similar pipelines. This evaluation setting will contribute to advancing the field and promoting standardized testing procedures.

摘要
“系统性错误在物检测器中的识别和移除可以是安全应用程序like自动驾驶和机器人的必要条件。这些系统性错误可能会发生在非常特定的物品位置（位置、比例、姿态）、物品颜色/ texture 和背景下。实际的图像独立无法覆盖所有相关的 комbination。我们 overcome这个限制，通过生成Synthetic图像，并具有精细的控制。使用物理 simulator 和手动设计的 3D 资产来生成Synthetic图像可以实现精细的控制，但这种方法是资源耗尽和有限的可扩展性。相比之下，使用生成模型是更可扩展的，但是在精细控制方面 Less reliable。在这篇论文中，我们提出一个新的框架，让我们在自动化和可扩展的方式下，生成街景图像，并具有精细的控制。此外，我们的框架还引入了评估环境，可以作为类似框架的参考。这个评估环境将对领域的进步和标准化 testing процедуures 做出贡献。”

Detecting and Mitigating System-Level Anomalies of Vision-Based Controllers

paper_url: http://arxiv.org/abs/2309.13475
repo_url: None
paper_authors: Aryaman Gupta, Kaustav Chakraborty, Somil Bansal
for: 本研究旨在提高自主系统的安全性和可靠性，通过在运行时检测和 Mitigate 视觉控制器的异常情况。
methods: 本研究使用了可达性基础结构来压测视觉控制器，并将其异常情况数据用于在线训练异常检测器。
results: 研究结果表明，提出的方法可以识别和处理视觉控制器的系统级异常情况，并在检测和 Mitigate 过程中提高自主系统的安全性和可靠性。

Abstract
Autonomous systems, such as self-driving cars and drones, have made significant strides in recent years by leveraging visual inputs and machine learning for decision-making and control. Despite their impressive performance, these vision-based controllers can make erroneous predictions when faced with novel or out-of-distribution inputs. Such errors can cascade to catastrophic system failures and compromise system safety. In this work, we introduce a run-time anomaly monitor to detect and mitigate such closed-loop, system-level failures. Specifically, we leverage a reachability-based framework to stress-test the vision-based controller offline and mine its system-level failures. This data is then used to train a classifier that is leveraged online to flag inputs that might cause system breakdowns. The anomaly detector highlights issues that transcend individual modules and pertain to the safety of the overall system. We also design a fallback controller that robustly handles these detected anomalies to preserve system safety. We validate the proposed approach on an autonomous aircraft taxiing system that uses a vision-based controller for taxiing. Our results show the efficacy of the proposed approach in identifying and handling system-level anomalies, outperforming methods such as prediction error-based detection, and ensembling, thereby enhancing the overall safety and robustness of autonomous systems.

摘要
自主系统，如自驾车和无人机，在最近几年中取得了很大的进步，通过视觉输入和机器学习来做出决策和控制。尽管它们的表现非常出色，但这些视觉控制器在面对新或非标准的输入时可能会做出错误的预测。这些错误可能会导致系统失败和发生危机。在这个工作中，我们提出了一个在线运行的问题检测器，以检测和缓解这些关键的系统级别失败。具体来说，我们利用一个可以在线运行的抽象框架，对视觉控制器进行压力测试，并从这些资料中提取出可能会导致系统异常的特征。这些特征可以被用来训练一个在线运行的分类器，以检测和识别可能会导致系统异常的输入。这个问题检测器可以帮助检测和解决系统级别的问题，以提高自主系统的安全和可靠性。我们还设计了一个可靠的备援控制器，以确保系统在检测到问题时能够稳定地运行。我们验证了我们的方法在一个使用视觉控制器进行着陆的自主飞机系统中，结果显示了我们的方法能够优化自主系统的安全和可靠性。

Edge Aware Learning for 3D Point Cloud

paper_url: http://arxiv.org/abs/2309.13472
repo_url: None
paper_authors: Lei Li
for:* 这种方法是为了处理点云数据中的噪声，提高物体认知和分割。methods:* 该方法使用了人类视觉系统中的边感知概念，并将其 интеグриinto了学习方法中，以提高物体识别和分割。results:* 该方法在ModelNet40和ShapeNet数据集上表现出色，在物体分类和分割任务中表现出了显著的优势。

Abstract
This paper proposes an innovative approach to Hierarchical Edge Aware 3D Point Cloud Learning (HEA-Net) that seeks to address the challenges of noise in point cloud data, and improve object recognition and segmentation by focusing on edge features. In this study, we present an innovative edge-aware learning methodology, specifically designed to enhance point cloud classification and segmentation. Drawing inspiration from the human visual system, the concept of edge-awareness has been incorporated into this methodology, contributing to improved object recognition while simultaneously reducing computational time. Our research has led to the development of an advanced 3D point cloud learning framework that effectively manages object classification and segmentation tasks. A unique fusion of local and global network learning paradigms has been employed, enriched by edge-focused local and global embeddings, thereby significantly augmenting the model's interpretative prowess. Further, we have applied a hierarchical transformer architecture to boost point cloud processing efficiency, thus providing nuanced insights into structural understanding. Our approach demonstrates significant promise in managing noisy point cloud data and highlights the potential of edge-aware strategies in 3D point cloud learning. The proposed approach is shown to outperform existing techniques in object classification and segmentation tasks, as demonstrated by experiments on ModelNet40 and ShapeNet datasets.

摘要

HAVE-Net: Hallucinated Audio-Visual Embeddings for Few-Shot Classification with Unimodal Cues

paper_url: http://arxiv.org/abs/2309.13470
repo_url: None
paper_authors: Ankit Jha, Debabrata Pal, Mainak Singha, Naman Agarwal, Biplab Banerjee
for: 本研究旨在解决remote sensing（RS）频谱图像识别领域中的一个新的问题，即在meta-训练阶段可以同时使用视觉modalities，但在meta-测试阶段一个modalities可能缺失。
methods: 我们提出了一种新的几何学生成框架，即Hallucinated Audio-Visual Embeddings-Network（HAVE-Net），用于在限制性的单模态数据上meta-训练交叉模态特征。在推理阶段，我们使用这些幻想出的特征进行几何学分类。
results: 我们的实验结果表明，使用我们的幻想模态增强策略，在ADVANCE和AudioSetZSL数据集上的 benchmark dataset上，我们的几何学分类器在少数据情况下表现至少比基eline perfomance高出0.8-2%。

Abstract
Recognition of remote sensing (RS) or aerial images is currently of great interest, and advancements in deep learning algorithms added flavor to it in recent years. Occlusion, intra-class variance, lighting, etc., might arise while training neural networks using unimodal RS visual input. Even though joint training of audio-visual modalities improves classification performance in a low-data regime, it has yet to be thoroughly investigated in the RS domain. Here, we aim to solve a novel problem where both the audio and visual modalities are present during the meta-training of a few-shot learning (FSL) classifier; however, one of the modalities might be missing during the meta-testing stage. This problem formulation is pertinent in the RS domain, given the difficulties in data acquisition or sensor malfunctioning. To mitigate, we propose a novel few-shot generative framework, Hallucinated Audio-Visual Embeddings-Network (HAVE-Net), to meta-train cross-modal features from limited unimodal data. Precisely, these hallucinated features are meta-learned from base classes and used for few-shot classification on novel classes during the inference phase. The experimental results on the benchmark ADVANCE and AudioSetZSL datasets show that our hallucinated modality augmentation strategy for few-shot classification outperforms the classifier performance trained with the real multimodal information at least by 0.8-2%.

摘要
现在远程感知（RS）或航空图像的认知是非常有趣，深度学习算法在最近几年中增添了一些风味。在训练神经网络时， occlusion、内类差异、照明等问题可能会出现。尽管将音频和视觉模式联合训练可以在数据缺乏的情况下提高分类性能，但是在RS领域还未得到过分析。在这里，我们想解决一个新的问题，即在meta-测试阶段缺失一个感知模式。这种问题在RS领域非常有 pertinence，因为数据收集或传感器故障是常见的问题。为了 mitigate，我们提出了一种新的几shot生成框架，即Hallucinated Audio-Visual Embeddings-Network（HAVE-Net）。具体来说，这些hallucinated特征是在基类上meta-学习的，并在推理阶段用于几shot分类 novel classes。实验结果表明，我们的 hallucinated感知特征增强策略在ADVANCE和AudioSetZSL数据集上的 benchmark 上超过了实际多modal信息下的类ifier性能，至少提高0.8-2%。

Turbulence in Focus: Benchmarking Scaling Behavior of 3D Volumetric Super-Resolution with BLASTNet 2.0 Data

paper_url: http://arxiv.org/abs/2309.13457
repo_url: None
paper_authors: Wai Tong Chung, Bassem Akoush, Pushan Sharma, Alex Tamkin, Ki Sung Jung, Jacqueline H. Chen, Jack Guo, Davy Brouzet, Mohsen Talei, Bruno Savard, Alexei Y. Poludnenko, Matthias Ihme
for:The paper aims to provide a large-scale dataset of 3D high-fidelity compressible turbulent flow simulations for training and benchmarking deep learning models.methods:The paper uses 744 full-domain samples from 34 high-fidelity direct numerical simulations to create a network-of-datasets called BLASTNet 2.0, which contains 49 variations of five deep learning approaches for 3D super-resolution.results:The paper performs a neural scaling analysis on these models to examine the performance of different machine learning (ML) approaches, including two scientific ML techniques, and demonstrates that (i) predictive performance can scale with model size and cost, (ii) architecture matters significantly, especially for smaller models, and (iii) the benefits of physics-based losses can persist with increasing model size.

Abstract
Analysis of compressible turbulent flows is essential for applications related to propulsion, energy generation, and the environment. Here, we present BLASTNet 2.0, a 2.2 TB network-of-datasets containing 744 full-domain samples from 34 high-fidelity direct numerical simulations, which addresses the current limited availability of 3D high-fidelity reacting and non-reacting compressible turbulent flow simulation data. With this data, we benchmark a total of 49 variations of five deep learning approaches for 3D super-resolution - which can be applied for improving scientific imaging, simulations, turbulence models, as well as in computer vision applications. We perform neural scaling analysis on these models to examine the performance of different machine learning (ML) approaches, including two scientific ML techniques. We demonstrate that (i) predictive performance can scale with model size and cost, (ii) architecture matters significantly, especially for smaller models, and (iii) the benefits of physics-based losses can persist with increasing model size. The outcomes of this benchmark study are anticipated to offer insights that can aid the design of 3D super-resolution models, especially for turbulence models, while this data is expected to foster ML methods for a broad range of flow physics applications. This data is publicly available with download links and browsing tools consolidated at https://blastnet.github.io.

摘要
We tested 49 variations of five deep learning approaches for 3D super-resolution using this dataset. Our results show that:1. Predictive performance can scale with model size and cost.2. Model architecture is crucial, especially for smaller models.3. Physics-based losses can provide significant benefits, even with larger models.These findings can aid the design of 3D super-resolution models, particularly for turbulence models. The BLASTNet 2.0 dataset is publicly available at , with download links and browsing tools. This dataset is expected to facilitate the development of machine learning methods for a wide range of flow physics applications.

Video Timeline Modeling For News Story Understanding

paper_url: http://arxiv.org/abs/2309.13446
repo_url: https://github.com/google-research/google-research
paper_authors: Meng Liu, Mingda Zhang, Jialu Liu, Hanjun Dai, Ming-Hsuan Yang, Shuiwang Ji, Zheyun Feng, Boqing Gong
For: The paper is written for exploring the problem of video timeline modeling, with the goal of creating a video-associated timeline to facilitate content and structure understanding of the story being told.* Methods: The paper proposes several deep learning approaches to tackling this problem, including the development of a realistic benchmark dataset (YouTube-News-Timeline) and the introduction of quantitative metrics to evaluate and compare methodologies.* Results: The paper anticipates that this exploratory work will pave the way for further research in video timeline modeling, and provides a testbed for researchers to build upon.Here’s the same information in Simplified Chinese text:* For: 本文探讨视频时间轴模型问题，目的是创建与特定主题相关的视频时间轴，以便更好地理解故事的内容和结构。* Methods: 本文提出了多种深度学习方法来解决这个问题，包括开发一个真实的 bencmark 数据集 (YouTube-News-Timeline) 和引入评估和比较方法的量化指标。* Results: 本文预计这项探讨工作将为视频时间轴模型的进一步研究开辟新的道路，并提供了研究者们可以进一步发展的测试平台。

Abstract
In this paper, we present a novel problem, namely video timeline modeling. Our objective is to create a video-associated timeline from a set of videos related to a specific topic, thereby facilitating the content and structure understanding of the story being told. This problem has significant potential in various real-world applications, for instance, news story summarization. To bootstrap research in this area, we curate a realistic benchmark dataset, YouTube-News-Timeline, consisting of over $12$k timelines and $300$k YouTube news videos. Additionally, we propose a set of quantitative metrics to comprehensively evaluate and compare methodologies. With such a testbed, we further develop and benchmark several deep learning approaches to tackling this problem. We anticipate that this exploratory work will pave the way for further research in video timeline modeling. The assets are available via https://github.com/google-research/google-research/tree/master/video_timeline_modeling.

摘要
在这篇论文中，我们提出了一个新的问题，即视频时间轴建模。我们的目标是从一组关于特定主题的视频集合中生成一个视频相关的时间轴，以便更好地理解视频中的内容和结构。这个问题在实际应用中具有重要的潜在价值，例如新闻故事概要。为了推动这个领域的研究，我们制作了一个现实的测试集，YouTube-News-Timeline，包含超过12000个时间轴和300000个YouTube新闻视频。此外，我们提出了一些量化的评价指标，以全面评估和比较不同方法的性能。通过这些实验，我们预计会开拓视频时间轴建模的研究途径。资产可以通过https://github.com/google-research/google-research/tree/master/video_timeline_modeling获取。

Dream the Impossible: Outlier Imagination with Diffusion Models

paper_url: http://arxiv.org/abs/2309.13415
repo_url: https://github.com/deeplearning-wisc/dream-ood
paper_authors: Xuefeng Du, Yiyou Sun, Xiaojin Zhu, Yixuan Li
for: 本研究旨在提出一种新的框架，以便生成高品质的异常样例，以提高机器学习模型的OOD检测和预测安全性。
methods: 该框架基于diffusion模型，通过文本条件的latent空间学习，生成高维像素空间中的异常样例。
results: 研究表明，通过使用DREAM-OOD生成的样例进行训练，可以提高OOD检测性能。

Abstract
Utilizing auxiliary outlier datasets to regularize the machine learning model has demonstrated promise for out-of-distribution (OOD) detection and safe prediction. Due to the labor intensity in data collection and cleaning, automating outlier data generation has been a long-desired alternative. Despite the appeal, generating photo-realistic outliers in the high dimensional pixel space has been an open challenge for the field. To tackle the problem, this paper proposes a new framework DREAM-OOD, which enables imagining photo-realistic outliers by way of diffusion models, provided with only the in-distribution (ID) data and classes. Specifically, DREAM-OOD learns a text-conditioned latent space based on ID data, and then samples outliers in the low-likelihood region via the latent, which can be decoded into images by the diffusion model. Different from prior works, DREAM-OOD enables visualizing and understanding the imagined outliers, directly in the pixel space. We conduct comprehensive quantitative and qualitative studies to understand the efficacy of DREAM-OOD, and show that training with the samples generated by DREAM-OOD can benefit OOD detection performance. Code is publicly available at https://github.com/deeplearning-wisc/dream-ood.

摘要
使用辅助外围数据集规范机器学习模型，已经显示了出现在其他分布（OOD）探测和安全预测的承诺。由于数据收集和清洁的劳动性，自动生成外围数据变得非常感兴趣。尽管有吸引力，在高维像素空间中生成真实的外围数据仍然是领域的一个开放挑战。为解决这个问题，这篇论文提出了一个新的框架——DREAM-OOD，可以生成真实的外围数据。具体来说，DREAM-OOD学习了根据内 distribuition（ID）数据的文本 conditioned的隐藏空间，然后通过这个隐藏空间的低可能性区域进行采样，这些采样可以通过扩散模型进行解码，转换为图像。与先前的工作不同，DREAM-OOD可以直接在像素空间中可见和理解生成的外围数据。我们进行了全面的量化和质量研究，并证明了在训练中使用DREAM-OOD生成的样本可以提高OOD探测性能。代码可以在https://github.com/deeplearning-wisc/dream-ood中下载。

WS-YOLO: Weakly Supervised Yolo Network for Surgical Tool Localization in Endoscopic Videos

paper_url: http://arxiv.org/abs/2309.13404
repo_url: https://github.com/breezewrf/weakly-supervised-yolov8
paper_authors: Rongfeng Wei, Jinlin Wu, You Pang, Zhen Chen
for: 这个论文是为了提高endooscopic视频记录中手术工具的检测和跟踪而写的。
methods: 这个论文使用了Weakly Supervised Yolo Network (WS-YOLO)来生成endooscopic视频中手术工具的精细Semantic信息，包括工具的位置和类别。
results: 这个论文的实验结果表明，WS-YOLO可以准确地检测和跟踪手术工具，并且可以减少人工标注劳动量。 codes are available online。

Abstract
Being able to automatically detect and track surgical instruments in endoscopic video recordings would allow for many useful applications that could transform different aspects of surgery. In robot-assisted surgery, the potentially informative data like categories of surgical tool can be captured, which is sparse, full of noise and without spatial information. We proposed a Weakly Supervised Yolo Network (WS-YOLO) for Surgical Tool Localization in Endoscopic Videos, to generate fine-grained semantic information with location and category from coarse-grained semantic information outputted by the da Vinci surgical robot, which significantly diminished the necessary human annotation labor while striking an optimal balance between the quantity of manually annotated data and detection performance. The source code is available at https://github.com/Breezewrf/Weakly-Supervised-Yolov8.

摘要
能够自动探测和跟踪针对endooscopic视频记录的手术工具，将有很多有用的应用程序，可以transform不同方面的手术。在机器助手手术中，可以捕捉可能有用的数据，如手术工具类别，但这些数据稀疏、充满噪音，无法提供空间信息。我们提出了一种Weakly Supervised Yolo Network (WS-YOLO) for Surgical Tool Localization in Endoscopic Videos，以生成细化的semantic信息，包括位置和类别，从粗化的semantic信息输出ted by the da Vinci surgical robot，这有效减少了人工标注劳动，同时 strike an optimal balance between the quantity of manually annotated data and detection performance。源代码可以在https://github.com/Breezewrf/Weakly-Supervised-Yolov8中找到。

Dual-Reference Source-Free Active Domain Adaptation for Nasopharyngeal Carcinoma Tumor Segmentation across Multiple Hospitals

paper_url: http://arxiv.org/abs/2309.13401
repo_url: https://github.com/whq-xxh/Active-GTV-Seg
paper_authors: Hongqiu Wang, Jian Chen, Shichen Zhang, Yuan He, Jinfeng Xu, Mengwan Wu, Jinlan He, Wenjun Liao, Xiangde Luo
for: 这个论文旨在提高nasopharyngeal carcinoma（NPC）的肿体卷积（GTV）分割精度，以确保NPC radiotherapy的效果。
methods: 这个论文提出了一种源自free active domain adaptation（SFADA）框架，用于解决GTV分割任务中的领域适应问题。该框架使用了双参照策略，选择目标领域中具有适应性和特定性的样本进行标注和模型细化。
results: 实验结果表明， compared to unsupervised domain adaptation（UDA）方法，SFADA方法可以更好地适应领域适应问题，并且可以与完全监督Upper Bound（UB）相当，即使只有几个标注样本。此外，该论文还收集了1057名NPC患者的临床数据，以验证该方法的有效性。

Abstract
Nasopharyngeal carcinoma (NPC) is a prevalent and clinically significant malignancy that predominantly impacts the head and neck area. Precise delineation of the Gross Tumor Volume (GTV) plays a pivotal role in ensuring effective radiotherapy for NPC. Despite recent methods that have achieved promising results on GTV segmentation, they are still limited by lacking carefully-annotated data and hard-to-access data from multiple hospitals in clinical practice. Although some unsupervised domain adaptation (UDA) has been proposed to alleviate this problem, unconditionally mapping the distribution distorts the underlying structural information, leading to inferior performance. To address this challenge, we devise a novel Sourece-Free Active Domain Adaptation (SFADA) framework to facilitate domain adaptation for the GTV segmentation task. Specifically, we design a dual reference strategy to select domain-invariant and domain-specific representative samples from a specific target domain for annotation and model fine-tuning without relying on source-domain data. Our approach not only ensures data privacy but also reduces the workload for oncologists as it just requires annotating a few representative samples from the target domain and does not need to access the source data. We collect a large-scale clinical dataset comprising 1057 NPC patients from five hospitals to validate our approach. Experimental results show that our method outperforms the UDA methods and achieves comparable results to the fully supervised upper bound, even with few annotations, highlighting the significant medical utility of our approach. In addition, there is no public dataset about multi-center NPC segmentation, we will release code and dataset for future research.

摘要
《nasopharyngeal carcinoma (NPC)是一种流行的且严重影响头颈部的恶性肿瘤，精准定义Gross Tumor Volume (GTV)对NPC有效 radiotherapy至关重要。 despite recent methods have achieved promising results in GTV segmentation, they are still limited by the lack of carefully-annotated data and hard-to-access data from multiple hospitals in clinical practice. although some unsupervised domain adaptation (UDA) has been proposed to alleviate this problem, unconditionally mapping the distribution distorts the underlying structural information, leading to inferior performance. to address this challenge, we propose a novel Source-Free Active Domain Adaptation (SFADA) framework to facilitate domain adaptation for the GTV segmentation task. specifically, we design a dual reference strategy to select domain-invariant and domain-specific representative samples from a specific target domain for annotation and model fine-tuning without relying on source-domain data. our approach not only ensures data privacy but also reduces the workload for oncologists as it only requires annotating a few representative samples from the target domain and does not need to access the source data. we collect a large-scale clinical dataset comprising 1057 NPC patients from five hospitals to validate our approach. experimental results show that our method outperforms the UDA methods and achieves comparable results to the fully supervised upper bound, even with few annotations, highlighting the significant medical utility of our approach. in addition, there is no public dataset about multi-center NPC segmentation, we will release code and dataset for future research.》Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need the translation in Traditional Chinese, please let me know.

A mirror-Unet architecture for PET/CT lesion segmentation

paper_url: http://arxiv.org/abs/2309.13398
repo_url: https://github.com/yrotstein/autopet2023_mv1
paper_authors: Yamila Rotstein Habarnau, Mauro Namías
for: 这个研究的目的是为了自动检测和分类内科癌症的变化，并将其分类为不同的类型。
methods: 这个研究使用了一种深度学习方法，即 combining two UNet-3D branches，其中一个分支用于从 CT 图像中分类组织，另一个分支则用于从 PET 图像中分类变化。
results: 研究发现，这种深度学习方法可以实现高精度的变化检测和分类，并且可以将变化分类为不同的类型。

Abstract
Automatic lesion detection and segmentation from [${}^{18}$F]FDG PET/CT scans is a challenging task, due to the diversity of shapes, sizes, FDG uptake and location they may present, besides the fact that physiological uptake is also present on healthy tissues. In this work, we propose a deep learning method aimed at the segmentation of oncologic lesions, based on a combination of two UNet-3D branches. First, one of the network's branches is trained to segment a group of tissues from CT images. The other branch is trained to segment the lesions from PET images, combining on the bottleneck the embedded information of CT branch, already trained. We trained and validated our networks on the AutoPET MICCAI 2023 Challenge dataset. Our code is available at: https://github.com/yrotstein/AutoPET2023_Mv1.

摘要
自动检测和分割肿瘤从 [${}^{18}$F]FDG PET/CT扫描图像是一项具有挑战性的任务，由于肿瘤的多样性、大小、FDG吸收和位置的不同，以及健康组织也会有physiological吸收。在这项工作中，我们提出了基于深度学习的肿瘤分割方法，通过两个UNet-3D支持树的组合来实现。第一个网络支持树被训练用于从CT图像中分割一组组织。另一个支持树则被训练用于从PET图像中分割肿瘤，并将 embedding在树的瓶颈中的CT支持树已经训练完成。我们在AutoPET MICCAI 2023 Challenge数据集上训练和验证了我们的网络。我们的代码可以在以下地址找到：https://github.com/yrotstein/AutoPET2023_Mv1。

YOLORe-IDNet: An Efficient Multi-Camera System for Person-Tracking

paper_url: http://arxiv.org/abs/2309.13387
repo_url: None
paper_authors: Vipin Gautam, Shitala Prasad, Sharad Sinha
for: 实时多摄像头人识别和跟踪
methods: correlate filters 和 Intersection Over Union (IOU) 约束，以及基于 YOLOv5 的深度学习人体重复识别 (Re-ID)
results: 高达 79% 的 F1-Score 和 59% 的 IOU，与现有状态的算法相当，在公共可用 OTB-100 数据集上进行评估。

Abstract
The growing need for video surveillance in public spaces has created a demand for systems that can track individuals across multiple cameras feeds in real-time. While existing tracking systems have achieved impressive performance using deep learning models, they often rely on pre-existing images of suspects or historical data. However, this is not always feasible in cases where suspicious individuals are identified in real-time and without prior knowledge. We propose a person-tracking system that combines correlation filters and Intersection Over Union (IOU) constraints for robust tracking, along with a deep learning model for cross-camera person re-identification (Re-ID) on top of YOLOv5. The proposed system quickly identifies and tracks suspect in real-time across multiple cameras and recovers well after full or partial occlusion, making it suitable for security and surveillance applications. It is computationally efficient and achieves a high F1-Score of 79% and an IOU of 59% comparable to existing state-of-the-art algorithms, as demonstrated in our evaluation on a publicly available OTB-100 dataset. The proposed system offers a robust and efficient solution for the real-time tracking of individuals across multiple camera feeds. Its ability to track targets without prior knowledge or historical data is a significant improvement over existing systems, making it well-suited for public safety and surveillance applications.

摘要
随着公共空间内部照相设备的增加，需要一种可以在多个摄像头视频流中实时跟踪人员的系统。现有的跟踪系统已经使用深度学习模型实现了出色的性能，但是它们常常依赖于先前的图像或历史数据。然而，这并不总是可行的，特别是在实时发现疑犯并无先知的情况下。我们提议一种结合相关滤波器和交叉区域大小（IOU）约束的人员跟踪系统，并在YOLOv5之上使用深度学习模型进行交叉摄像头人员重新识别（Re-ID）。提议的系统可以快速地在多个摄像头视频流中识别和跟踪疑犯，并在全或部分遮挡后快速恢复，适用于安全监控应用。它的计算效率高，并在OTB-100公共数据集上实现了79%的F1分数和59%的IOU分数，与现有状态的算法相当。提议的系统提供了一种实时、有效的人员跟踪解决方案，无需先知或历史数据，对公共安全和监控应用有 significannot improvement。

paper_url: http://arxiv.org/abs/2309.13385
repo_url: https://github.com/vios-s/CMRxRECON_Challenge_EDIPO
paper_authors: Yuyang Xue, Yuning Du, Gianluca Carloni, Eva Pachetti, Connor Jordan, Sotirios A. Tsaftaris
for: 心脏功能和状态的非侵入性理解
methods: 使用 convolutional recurrent neural network (CRNN) 构成和单影像超解析模组
results: 比基eline情况下提高4.4%的结构相似性和3.9%的normalized mean square error

Abstract
Cine Magnetic Resonance Imaging (MRI) allows for understanding of the heart's function and condition in a non-invasive manner. Undersampling of the $k$-space is employed to reduce the scan duration, thus increasing patient comfort and reducing the risk of motion artefacts, at the cost of reduced image quality. In this challenge paper, we investigate the use of a convolutional recurrent neural network (CRNN) architecture to exploit temporal correlations in supervised cine cardiac MRI reconstruction. This is combined with a single-image super-resolution refinement module to improve single coil reconstruction by 4.4\% in structural similarity and 3.9\% in normalised mean square error compared to a plain CRNN implementation. We deploy a high-pass filter to our $\ell_1$ loss to allow greater emphasis on high-frequency details which are missing in the original data. The proposed model demonstrates considerable enhancements compared to the baseline case and holds promising potential for further improving cardiac MRI reconstruction.

摘要

Beyond Grids: Exploring Elastic Input Sampling for Vision Transformers

paper_url: http://arxiv.org/abs/2309.13353
repo_url: None
paper_authors: Adam Pardyl, Grzegorz Kurzejamski, Jan Olszewski, Tomasz Trzciński, Bartosz Zieliński
for: 该论文旨在提高视transformer在实际应用中的表现和效率，通过提高输入灵活性。
methods: 论文提出了一种用于评估视transformer输入灵活性的评价协议，并提出了对transformer架构和训练策略的修改，以增强其输入灵活性。
results: 经过广泛的实验，论文发现了输入 sampling 策略的机会和挑战，并提供了关于视transformer在不同应用场景中的表现。

Abstract
Vision transformers have excelled in various computer vision tasks but mostly rely on rigid input sampling using a fixed-size grid of patches. This limits their applicability in real-world problems, such as in the field of robotics and UAVs, where one can utilize higher input elasticity to boost model performance and efficiency. Our paper addresses this limitation by formalizing the concept of input elasticity for vision transformers and introducing an evaluation protocol, including dedicated metrics for measuring input elasticity. Moreover, we propose modifications to the transformer architecture and training regime, which increase its elasticity. Through extensive experimentation, we spotlight opportunities and challenges associated with input sampling strategies.

摘要
< translating into Simplified Chinese...视力变换器在计算机视觉任务中表现出色，但它们通常采用固定大小网格的粒子 sampling 来输入数据。这限制了它们在实际应用中，如 роботи克和无人机领域， где可以使用更高的输入灵活性来提高模型性能和效率。我们的论文解决了这个限制，正式表述了视力变换器的输入灵活性概念，并提出了评估协议，包括专门为测量输入灵活性的指标。此外，我们还提出了 transformer 架构和训练方法的修改，以增加其灵活性。通过广泛的实验，我们把关注到输入采样策略的机会和挑战。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know.

FedDrive v2: an Analysis of the Impact of Label Skewness in Federated Semantic Segmentation for Autonomous Driving

paper_url: http://arxiv.org/abs/2309.13336
repo_url: https://github.com/Erosinho13/FedDrive
paper_authors: Eros Fanì, Marco Ciccone, Barbara Caputo
for: 研究 semantic segmentation 在自动驾驶中的 federated learning benchmark 的分布式skewness 的影响。
methods: 提出了 six 种新的 federated 场景，以研究 label skewness 对 segmentation 模型的性能的影响，并与域shift 的影响进行比较。
results: 研究了使用域信息 durante testing 的影响。Here’s the English version for reference:
for: Investigating the impact of distribution skewness on semantic segmentation in autonomous driving using federated learning benchmarks.
methods: Propose six new federated scenarios to study the effect of label skewness on segmentation models and compare it with the effect of domain shift.
results: Study the impact of using domain information during testing.

Abstract
We propose FedDrive v2, an extension of the Federated Learning benchmark for Semantic Segmentation in Autonomous Driving. While the first version aims at studying the effect of domain shift of the visual features across clients, in this work, we focus on the distribution skewness of the labels. We propose six new federated scenarios to investigate how label skewness affects the performance of segmentation models and compare it with the effect of domain shift. Finally, we study the impact of using the domain information during testing. Official website: https://feddrive.github.io

摘要
我们提出了FedDrive v2，它是基于联合学习benchmark для自动驾驶 semantic segmentation的扩展。在前一版中，我们研究了视觉特征之间客户端域转换的效果，而在这次工作中，我们专注于标签的分布偏度。我们提出了六种新的联合场景，以研究标签偏度对 segmentation 模型的性能影响，并与域转换的影响进行比较。最后，我们研究了在测试时使用域信息的影响。官方网站：https://feddrive.github.ioNote: "联合学习" (federated learning) in Chinese is usually translated as "联合学习" (federated learning), but in this context, I used "基于联合学习benchmark" (based on federated learning benchmark) to emphasize that FedDrive is an extension of a existing benchmark.

Tackling the Incomplete Annotation Issue in Universal Lesion Detection Task By Exploratory Training

paper_url: http://arxiv.org/abs/2309.13306
repo_url: None
paper_authors: Xiaoyu Bai, Benteng Ma, Changyang Li, Yong Xia
for: 这种研究的目的是提高非特征化图像诊断的精度，尤其是检测医疗图像中的多种器官 lesion。
methods: 该研究使用了深度学习方法，并利用 pseudo-label 技术来挖掘未标注的对象。
results: 研究表明，提出的方法可以superior表现于现有的方法，在两个医疗图像 datasets 上。

Abstract
Universal lesion detection has great value for clinical practice as it aims to detect various types of lesions in multiple organs on medical images. Deep learning methods have shown promising results, but demanding large volumes of annotated data for training. However, annotating medical images is costly and requires specialized knowledge. The diverse forms and contrasts of objects in medical images make fully annotation even more challenging, resulting in incomplete annotations. Directly training ULD detectors on such datasets can yield suboptimal results. Pseudo-label-based methods examine the training data and mine unlabelled objects for retraining, which have shown to be effective to tackle this issue. Presently, top-performing methods rely on a dynamic label-mining mechanism, operating at the mini-batch level. However, the model's performance varies at different iterations, leading to inconsistencies in the quality of the mined labels and limits their performance enhancement. Inspired by the observation that deep models learn concepts with increasing complexity, we introduce an innovative exploratory training to assess the reliability of mined lesions over time. Specifically, we introduce a teacher-student detection model as basis, where the teacher's predictions are combined with incomplete annotations to train the student. Additionally, we design a prediction bank to record high-confidence predictions. Each sample is trained several times, allowing us to get a sequence of records for each sample. If a prediction consistently appears in the record sequence, it is likely to be a true object, otherwise it may just a noise. This serves as a crucial criterion for selecting reliable mined lesions for retraining. Our experimental results substantiate that the proposed framework surpasses state-of-the-art methods on two medical image datasets, demonstrating its superior performance.

摘要
全面搜寻检测在医疗实践中具有极高的价值，旨在多种器官的医学图像上检测不同类型的损害。深度学习方法在检测中表现出了扎根的成绩，但需要大量已注解数据进行训练。然而，注解医学图像是成本高昂的，需要专业知识。医学图像中 объек的多样性和对比性使得全面注解变得更加困难，从而导致了不完全的注解。直接在如此数据上训练 ULD 检测器可能会得到不佳的结果。基于 pseudo-label 方法，我们可以在训练数据中挖掘未注解的对象，以便重新训练。目前，最高级的方法通过动态标签采集机制，在小批量级别进行操作。然而，模型在不同迭代中的性能会变化，导致标签采集过程中的品质不稳定，限制了性能提高的可能性。我们从深度学习模型学习对象的复杂性的观察中灵感，并提出了一种创新的探索训练方法。具体来说，我们在教师-学生检测模型的基础上，将教师的预测与不完全注解结合使用，用于训练学生。此外，我们还设计了预测银行，以记录高置信预测。每个样本都会在多次训练中得到多个记录，这些记录中的一些预测可能是真实的对象，而其他的预测可能只是噪音。这些记录可以作为选择可靠挖掘的标准。我们的实验结果表明，我们的方法在两个医学图像数据集上超过了当前state-of-the-art方法的性能，这证明了我们的方法的优越性。

C$^2$VAE: Gaussian Copula-based VAE Differing Disentangled from Coupled Representations with Contrastive Posterior

paper_url: http://arxiv.org/abs/2309.13303
repo_url: None
paper_authors: Zhangkai Wu, Longbing Cao
for: 学习分离的隐藏因素和相关的隐藏因素
methods: 使用自适应变换自动机（VAE）和循环优化
results: 提高分离表示学习的效果，并且解决了TC-基于VAE的不稳定性和表达和表示之间的负面效应

Abstract
We present a self-supervised variational autoencoder (VAE) to jointly learn disentangled and dependent hidden factors and then enhance disentangled representation learning by a self-supervised classifier to eliminate coupled representations in a contrastive manner. To this end, a Contrastive Copula VAE (C$^2$VAE) is introduced without relying on prior knowledge about data in the probabilistic principle and involving strong modeling assumptions on the posterior in the neural architecture. C$^2$VAE simultaneously factorizes the posterior (evidence lower bound, ELBO) with total correlation (TC)-driven decomposition for learning factorized disentangled representations and extracts the dependencies between hidden features by a neural Gaussian copula for copula coupled representations. Then, a self-supervised contrastive classifier differentiates the disentangled representations from the coupled representations, where a contrastive loss regularizes this contrastive classification together with the TC loss for eliminating entangled factors and strengthening disentangled representations. C$^2$VAE demonstrates a strong effect in enhancing disentangled representation learning. C$^2$VAE further contributes to improved optimization addressing the TC-based VAE instability and the trade-off between reconstruction and representation.

摘要
我们提出了一种自助学习的变分自动编码器（VAE），用于同时学习独立的隐藏因素和相关的隐藏因素。然后，我们使用一种自我超级vised类ifier来消除相关的表示，从而增强独立表示学习。为此，我们引入了一种叫做Contrastive Copula VAE（C$^2$VAE），不需要对数据的 probabilistic principle 进行严格的模型假设，同时可以学习分解 posterior（证明下界，ELBO）和总相关度（TC）驱动的分解，以学习独立的隐藏表示。然后，一种自我超级vised类ifier可以分辨出独立的表示和相关的表示，并通过对这两个类别进行对比来regular化这种对比分类。C$^2$VAE在提高独立表示学习的效果，同时也有助于改善优化，解决TC-基于VAE的不稳定性和表示重建之间的负面选择问题。

Gaining the Sparse Rewards by Exploring Binary Lottery Tickets in Spiking Neural Network

paper_url: http://arxiv.org/abs/2309.13302
repo_url: None
paper_authors: Hao Cheng, Jiahang Cao, Erjia Xiao, Pu Zhao, Mengshu Sun, Jiaxu Wang, Jize Zhang, Xue Lin, Bhavya Kailkhura, Kaidi Xu, Renjing Xu
for: This paper aims to explore the efficiency of Spiking Neural Networks (SNNs) by investigating the existence of Lottery Tickets (LTs) in binary SNNs and comparing the spiking mechanism with simple model binarization.
methods: The paper proposes a sparse training method called Binary Weights Spiking Lottery Tickets (BinW-SLT) to find LTs in binary SNNs under different network structures.
results: The paper shows that BinW-SLT can achieve up to +5.86% and +3.17% improvement on CIFAR-10 and CIFAR-100 compared with binary LTs, as well as achieve 1.86x and 8.92x energy saving compared with full-precision SNNs and ANNs.

Abstract
Spiking Neural Network (SNN) as a brain-inspired strategy receives lots of attention because of the high-sparsity and low-power properties derived from its inherent spiking information state. To further improve the efficiency of SNN, some works declare that the Lottery Tickets (LTs) Hypothesis, which indicates that the Artificial Neural Network (ANN) contains a subnetwork without sacrificing the performance of the original network, also exists in SNN. However, the spiking information handled by SNN has a natural similarity and affinity with binarization in sparsification. Therefore, to further explore SNN efficiency, this paper focuses on (1) the presence or absence of LTs in the binary SNN, and (2) whether the spiking mechanism is a superior strategy in terms of handling binary information compared to simple model binarization. To certify these consumptions, a sparse training method is proposed to find Binary Weights Spiking Lottery Tickets (BinW-SLT) under different network structures. Through comprehensive evaluations, we show that BinW-SLT could attain up to +5.86% and +3.17% improvement on CIFAR-10 and CIFAR-100 compared with binary LTs, as well as achieve 1.86x and 8.92x energy saving compared with full-precision SNN and ANN.

摘要
神经网络（SNN）因其自然的简约性和低功耗特性而受到了很多关注。一些研究表明，Artificial Neural Network（ANN）中的子网络也存在于SNN中，这被称为彩票假设（LTs）。然而，SNN处理的脉冲信息具有自然的相似性和亲和力，因此可以通过对binary化进行压缩来提高SNN的效率。为了进一步探索SNN的效率，本文主要研究了以下两个问题：（1）在二进制SNN中是否存在LTs，（2）脉冲机制是对binary信息处理的超越策略吗？为了证明这些消耗，我们提出了一种稀疏训练方法，可以在不同的网络结构下找到Binary Weights Spiking Lottery Tickets（BinW-SLT）。经过全面的评估，我们发现BinW-SLT可以在CIFAR-10和CIFAR-100上提高+5.86%和+3.17%，并且可以在full-precision SNN和ANN上实现1.86x和8.92x的能量减少。

MP-MVS: Multi-Scale Windows PatchMatch and Planar Prior Multi-View Stereo

paper_url: http://arxiv.org/abs/2309.13294
repo_url: https://github.com/rongxuantan/mp-mvs
paper_authors: Rongxuan Tan, Qing Wang, Xueyan Wang, Chao Yan, Yang Sun, Youyang Feng
for: 提高多视图ステレオ（MVS）基于3D重建的准确性。
methods: 提出了一种可靠的多观察窗口PatchMatch（mPM），以获取无文本区域的可靠深度值。此外，我们还改进了现有的检查板样本方案，限制我们的样本到远距离区域，以提高空间卷积的效率，同时避免生成异常值。最后，我们引入并改进了平面假设辅助PatchMatch（ACMP），而不是依赖于光学一致性，我们利用多视图之间的几何一致信息选择可靠的三角形 vertices。
results: 我们的方法在ETH3D高分辨率多视图标准准测试集上与多个状态 искусственный智能方法进行比较，结果表明，我们的方法可以达到状态 искусственный智能水平。

Abstract
Significant strides have been made in enhancing the accuracy of Multi-View Stereo (MVS)-based 3D reconstruction. However, untextured areas with unstable photometric consistency often remain incompletely reconstructed. In this paper, we propose a resilient and effective multi-view stereo approach (MP-MVS). We design a multi-scale windows PatchMatch (mPM) to obtain reliable depth of untextured areas. In contrast with other multi-scale approaches, which is faster and can be easily extended to PatchMatch-based MVS approaches. Subsequently, we improve the existing checkerboard sampling schemes by limiting our sampling to distant regions, which can effectively improve the efficiency of spatial propagation while mitigating outlier generation. Finally, we introduce and improve planar prior assisted PatchMatch of ACMP. Instead of relying on photometric consistency, we utilize geometric consistency information between multi-views to select reliable triangulated vertices. This strategy can obtain a more accurate planar prior model to rectify photometric consistency measurements. Our approach has been tested on the ETH3D High-res multi-view benchmark with several state-of-the-art approaches. The results demonstrate that our approach can reach the state-of-the-art. The associated codes will be accessible at https://github.com/RongxuanTan/MP-MVS.

摘要
<>translate text into Simplified Chinese多视图ステレオ（MVS）ベースの3D复元において、精度が向上した进歩がありました。しかし、不安定な光学的一致性を持つ不染色领域がまだ不完全に复元されていることがあります。この论文では、抗强度的かつ效率的な多视点ステレオアプローチ（MP-MVS）を提案します。我々は、信赖性の高い深度を得るために、多スケールのウィンドウパッチマッチ（mPM）を设计しました。他の多スケールアプローチと异なり、我々のアプローチは速く、パッチマッチベースのMVSアプローチに容易に拡张できます。次に、我々は、离れた地域に限定されたサンプリングを行うことで、空间的な広がりを改善することができます。また、アウトライアーの生成を抑制することができます。最后に、我々は、写真的一致性を基准にしてパッチマッチをサポートする计画的な射影を提案します。このストラテジは、多视点间の几何学的一致性情报を利用して、信赖性の高い平面モデルを构筑することができます。我々のアプローチは、ETH3D High-res多视点ベンチマークで复数の现状最高のアプローチとの比较を行い、结果はstate-of-the-artに到达することを示しています。関连するコードは、https://github.com/RongxuanTan/MP-MVSにアクセスできます。

Domain-Guided Conditional Diffusion Model for Unsupervised Domain Adaptation

paper_url: http://arxiv.org/abs/2309.14360
repo_url: None
paper_authors: Yulong Zhang, Shuhao Chen, Weisen Jiang, Yu Zhang, Jiangang Lu, James T. Kwok
for: 提高深度学习模型在新应用场景中的性能，增强Unsupervised Domain Adaptation（UDA）技术的表现。
methods: 提出DomAin-guided Conditional Diffusion Model（DACDM），通过控制生成样本的标签信息和引入域类分类器，以生成高准确性和多样性的目标域样本，从而使得现有的UDA方法更容易在目标域上传输。
results: 经验表明，DACDM可以大幅提高现有UDA方法在多种标准权重环境下的表现。

Abstract
Limited transferability hinders the performance of deep learning models when applied to new application scenarios. Recently, Unsupervised Domain Adaptation (UDA) has achieved significant progress in addressing this issue via learning domain-invariant features. However, the performance of existing UDA methods is constrained by the large domain shift and limited target domain data. To alleviate these issues, we propose DomAin-guided Conditional Diffusion Model (DACDM) to generate high-fidelity and diversity samples for the target domain. In the proposed DACDM, by introducing class information, the labels of generated samples can be controlled, and a domain classifier is further introduced in DACDM to guide the generated samples for the target domain. The generated samples help existing UDA methods transfer from the source domain to the target domain more easily, thus improving the transfer performance. Extensive experiments on various benchmarks demonstrate that DACDM brings a large improvement to the performance of existing UDA methods.

摘要
因深度学习模型在新应用场景中表现不佳的限制性，现在不监督领域适应（UDA）已经取得了显著的进步，通过学习领域共同特征来解决这一问题。然而，现有的 UDA 方法受到大领域差异和有限目标领域数据的限制。为了解决这些问题，我们提议了带有类信息的 DomAin-guided Conditional Diffusion Model (DACDM)，用于生成高准确性和多样性的目标领域样本。在我们的 DACDM 中，通过引入类信息，生成的样本的标签可以被控制，并在 DACDM 中引入了领域分类器，以便导引生成的样本到目标领域。这些生成的样本可以帮助现有的 UDA 方法更好地在源领域和目标领域之间传输，从而提高传输性能。我们在各种标准准点上进行了广泛的实验，并证明了 DACDM 可以大幅提高现有 UDA 方法的表现。

Automatic Reverse Engineering: Creating computer-aided design (CAD) models from multi-view images

paper_url: http://arxiv.org/abs/2309.13281
repo_url: None
paper_authors: Henrik Jobczyk, Hanno Homann
for: automated reverse engineering task
methods: combine three distinct stages: convolutional neural network, multi-view pooling, and transformer-based CAD sequence generator
results: successfully reconstructed valid CAD models from simulated test image data, and demonstrated some capabilities in real-world test with actual photographs of three-dimensional test objects, but limited to basic shapes.

Abstract
Generation of computer-aided design (CAD) models from multi-view images may be useful in many practical applications. To date, this problem is usually solved with an intermediate point-cloud reconstruction and involves manual work to create the final CAD models. In this contribution, we present a novel network for an automated reverse engineering task. Our network architecture combines three distinct stages: A convolutional neural network as the encoder stage, a multi-view pooling stage and a transformer-based CAD sequence generator. The model is trained and evaluated on a large number of simulated input images and extensive optimization of model architectures and hyper-parameters is performed. A proof-of-concept is demonstrated by successfully reconstructing a number of valid CAD models from simulated test image data. Various accuracy metrics are calculated and compared to a state-of-the-art point-based network. Finally, a real world test is conducted supplying the network with actual photographs of two three-dimensional test objects. It is shown that some of the capabilities of our network can be transferred to this domain, even though the training exclusively incorporates purely synthetic training data. However to date, the feasible model complexity is still limited to basic shapes.

摘要
计算机支持设计（CAD）模型生成从多视图图像可能在很多实际应用中有用。目前，这个问题通常通过中间点云重建解决，并且需要手动创建最终CAD模型。在这篇论文中，我们提出了一种新的网络，用于自动反工程设计任务。我们的网络架构包括三个不同的阶段：一个卷积神经网络作为编码阶段，一个多视图池化阶段，以及一个基于变换器的CAD序列生成器。我们在一大量的模拟输入图像和优化模型结构和超参数方面进行了训练和评估。我们成功地从模拟测试数据中生成了一些有效的CAD模型。我们还计算了几个精度指标，并与当前点 cloud 网络进行比较。最后，我们在实际测试中使用实际照片对两个三维测试对象进行测试，并证明了我们的网络部分可以在这个领域中转移。然而，目前可能的模型复杂度仍然受限于基本形状。

Discwise Active Learning for LiDAR Semantic Segmentation

paper_url: http://arxiv.org/abs/2309.13276
repo_url: None
paper_authors: Ozan Unal, Dengxin Dai, Ali Tamer Unal, Luc Van Gool
For: This paper explores the use of active learning (AL) for LiDAR semantic segmentation, with a focus on improving annotation efficiency and reducing costs.* Methods: The proposed method, called DiAL, uses a discwise approach to query the region covered by a single frame on global coordinates, and labels all frames simultaneously. It also addresses two major challenges in discwise AL: a new acquisition function that takes 3D point density changes into consideration, and a mixed-integer linear program to select multiple frames while avoiding disc intersections.* Results: The proposed method is evaluated on a real-world LiDAR dataset, and shows improved performance and efficiency compared to traditional sequential labeling methods. Additionally, a semi-supervised learning approach is proposed to utilize all frames within the dataset and further improve performance.

Abstract
While LiDAR data acquisition is easy, labeling for semantic segmentation remains highly time consuming and must therefore be done selectively. Active learning (AL) provides a solution that can iteratively and intelligently label a dataset while retaining high performance and a low budget. In this work we explore AL for LiDAR semantic segmentation. As a human expert is a component of the pipeline, a practical framework must consider common labeling techniques such as sequential labeling that drastically improve annotation times. We therefore propose a discwise approach (DiAL), where in each iteration, we query the region a single frame covers on global coordinates, labeling all frames simultaneously. We then tackle the two major challenges that emerge with discwise AL. Firstly we devise a new acquisition function that takes 3D point density changes into consideration which arise due to location changes or ego-vehicle motion. Next we solve a mixed-integer linear program that provides a general solution to the selection of multiple frames while taking into consideration the possibilities of disc intersections. Finally we propose a semi-supervised learning approach to utilize all frames within our dataset and improve performance.

摘要
利用LiDAR数据获取 relativamente fácil，但是用于 semantic segmentation 的标注仍然非常时间consuming，因此需要选择性地进行标注。活动学习（AL）提供了一个解决方案，可以逐步地、智能地标注数据集，同时保持高性能和低预算。在这项工作中，我们探索了 LiDAR semantic segmentation 中的 AL。作为人类专家是数据流水线的一部分，我们需要考虑常见的标注技术，如顺序标注，以提高标注时间的效率。因此，我们提出了一种区域方法（DiAL），其中，在每次迭代中，我们查询当前帧所覆盖的全球坐标范围，并同时标注所有帧。然后，我们解决了两个主要的挑战，即：1. 根据Location changes 或者自动车动的影响，3D 点云 density 发生变化，我们提出了一种新的获取函数，以便考虑这些变化。2. 我们解决了板块交叠的问题，通过一种混合整数线性Programming 提供一个通用的解决方案，以便选择多帧的板块。3. 最后，我们提出了一种半supervised learning 方法，以利用整个数据集，并提高性能。

GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER

paper_url: http://arxiv.org/abs/2309.13274
repo_url: https://github.com/iva-mzsun/glober
paper_authors: Mingzhen Sun, Weining Wang, Zihan Qin, Jiahui Sun, Sihan Chen, Jing Liu
for: 这个论文目的是提出一种新的非 autoregressive 视频生成方法，以提高视频生成的全局性和本地性。
methods: 该方法首先生成全局特征，以获取全面的全局指导，然后基于全局特征，通过非 autoregressive 的方式，生成具有全局性和本地性的视频帧。特别是，我们提出了一个视频自编码器，该自编码器将视频转换为全局特征，并建立了一个基于扩散模型的视频解码器，该解码器通过非 autoregressive 的方式，生成视频帧。
results: 我们的实验结果表明，我们的提出的方法可以具有高效率和高质量的视频生成。我们在多个 benchmark 上达到了新的州OF-THE-ART 结果。

Abstract
Video generation necessitates both global coherence and local realism. This work presents a novel non-autoregressive method GLOBER, which first generates global features to obtain comprehensive global guidance and then synthesizes video frames based on the global features to generate coherent videos. Specifically, we propose a video auto-encoder, where a video encoder encodes videos into global features, and a video decoder, built on a diffusion model, decodes the global features and synthesizes video frames in a non-autoregressive manner. To achieve maximum flexibility, our video decoder perceives temporal information through normalized frame indexes, which enables it to synthesize arbitrary sub video clips with predetermined starting and ending frame indexes. Moreover, a novel adversarial loss is introduced to improve the global coherence and local realism between the synthesized video frames. Finally, we employ a diffusion-based video generator to fit the global features outputted by the video encoder for video generation. Extensive experimental results demonstrate the effectiveness and efficiency of our proposed method, and new state-of-the-art results have been achieved on multiple benchmarks.

摘要
视频生成需要both全局一致性和地方现实性。这项工作提出了一种新的非 autoregressive方法GLOBER，它首先生成全局特征来获得全面的全局指导，然后根据全局特征来生成协调的视频帧，以生成一致的视频。具体来说，我们提出了一个视频自编码器，其中视频编码器将视频编码成全局特征，而视频解码器，基于扩散模型，将全局特征解码并在非 autoregressive 方式下synthesize视频帧。为了保证最大的灵活性，我们的视频解码器通过normalized帧索引来感知时间信息，这使其能够synthesize任意的子视频片段，并且通过引入novel adversarial loss来提高全局一致性和地方现实性 между生成的视频帧。最后，我们使用扩散基于的视频生成器来适应GLOBER输出的全局特征。我们的实验结果表明，我们的提出的方法效果和效率都非常高，并在多个标准准则上达到了新的状态码。

Randomize to Generalize: Domain Randomization for Runway FOD Detection

paper_url: http://arxiv.org/abs/2309.13264
repo_url: https://github.com/hieu9955/ggggg
paper_authors: Javaria Farooq, Nayyer Aafaq, M Khizer Ali Khan, Ammar Saleem, M Ibraheem Siddiqui
for:* The paper aims to improve object detection in low-resolution images with small objects and diverse backgrounds, which is challenging for existing methods.methods:* The proposed method, Synthetic Randomized Image Augmentation (SRIA), consists of two stages: weakly supervised pixel-level segmentation mask generation and batch-wise synthesis of artificial images with diverse augmentations.results:* The proposed method significantly improves object detection accuracy on out-of-distribution (OOD) test sets, with a reported improvement from 41% to 92%. The method also outperforms several state-of-the-art (SOTA) models, including CenterNet, SSD, YOLOv3, YOLOv4, YOLOv5, and Outer Vit, on a publicly available foreign object debris (FOD) dataset.

Abstract
Tiny Object Detection is challenging due to small size, low resolution, occlusion, background clutter, lighting conditions and small object-to-image ratio. Further, object detection methodologies often make underlying assumption that both training and testing data remain congruent. However, this presumption often leads to decline in performance when model is applied to out-of-domain(unseen) data. Techniques like synthetic image generation are employed to improve model performance by leveraging variations in input data. Such an approach typically presumes access to 3D-rendered datasets. In contrast, we propose a novel two-stage methodology Synthetic Randomized Image Augmentation (SRIA), carefully devised to enhance generalization capabilities of models encountering 2D datasets, particularly with lower resolution which is more practical in real-world scenarios. The first stage employs a weakly supervised technique to generate pixel-level segmentation masks. Subsequently, the second stage generates a batch-wise synthesis of artificial images, carefully designed with an array of diverse augmentations. The efficacy of proposed technique is illustrated on challenging foreign object debris (FOD) detection. We compare our results with several SOTA models including CenterNet, SSD, YOLOv3, YOLOv4, YOLOv5, and Outer Vit on a publicly available FOD-A dataset. We also construct an out-of-distribution test set encompassing 800 annotated images featuring a corpus of ten common categories. Notably, by harnessing merely 1.81% of objects from source training data and amalgamating with 29 runway background images, we generate 2227 synthetic images. Subsequent model retraining via transfer learning, utilizing enriched dataset generated by domain randomization, demonstrates significant improvement in detection accuracy. We report that detection accuracy improved from an initial 41% to 92% for OOD test set.

摘要
小 объек detection 是一个挑战，主要由于小型、低分辨率、 occlusion、背景噪音、照明条件以及小对图像比率而导致。此外，检测方法ologies 经常假设训练和测试数据保持一致，但这种假设常导致模型在未经验数据上下运行时表现下降。为了改善模型性能，人们通常采用生成 Synthetic 图像的技术。然而，这些技术通常假设有许多3D-rendered 数据可供使用。在这种情况下，我们提出了一种新的两Stage 方法，即Synthetic Randomized Image Augmentation（SRIA），旨在提高模型在2D数据上的泛化能力。首先，我们使用弱有监督技术生成像素级分割标记。然后，第二stage 使用批量 Synthesis 技术生成一批人工生成的假图像，并且特意设计了一系列多样化的增强。我们证明了我们的方法在困难的外部物杂（FOD）检测 task 上表现出色。我们与多个state-of-the-art（SOTA）模型，包括CenterNet、SSD、YOLOv3、YOLOv4、YOLOv5和Outer Vit进行比较。我们还构建了一个异常数据集，包括800个标注图像，其中包含10种常见类别。吸引地，我们只需使用来源训练数据中的1.81%的对象，并将其混合到29个runway背景图像中，就可以生成2227个假图像。然后，通过将模型重新训练，使用了增强的频率 randomization 数据集，我们可以看到检测精度从41%提高到92%。

Order-preserving Consistency Regularization for Domain Adaptation and Generalization

paper_url: http://arxiv.org/abs/2309.13258
repo_url: None
paper_authors: Mengmeng Jing, Xiantong Zhen, Jingjing Li, Cees Snoek
for: 提高cross-domain任务中深度学习模型的Robustness，使其不受特定领域属性的影响。
methods: 采用数据增强和一致规范来使模型更不敏感于特定领域属性。
results: 对五种不同的cross-domain任务进行了全面的实验，得到了明显的优势。

Abstract
Deep learning models fail on cross-domain challenges if the model is oversensitive to domain-specific attributes, e.g., lightning, background, camera angle, etc. To alleviate this problem, data augmentation coupled with consistency regularization are commonly adopted to make the model less sensitive to domain-specific attributes. Consistency regularization enforces the model to output the same representation or prediction for two views of one image. These constraints, however, are either too strict or not order-preserving for the classification probabilities. In this work, we propose the Order-preserving Consistency Regularization (OCR) for cross-domain tasks. The order-preserving property for the prediction makes the model robust to task-irrelevant transformations. As a result, the model becomes less sensitive to the domain-specific attributes. The comprehensive experiments show that our method achieves clear advantages on five different cross-domain tasks.

摘要
深度学习模型在跨频道挑战中失败，因为模型过敏于频道特有的特征，如闪电、背景、摄像头角度等。为解决这问题，数据增强和一致化 regularization 是常用的方法，以使模型对频道特有的特征更不敏感。一致性 regularization 要求模型对两个视图的同一张图像输出相同的表示或预测结果。然而，这些约束 Either too strict or not order-preserving for the classification probabilities。在这工作中，我们提出了Order-preserving Consistency Regularization (OCR) 方法，该方法的顺序性质使模型对任务不相关的变换具有鲁棒性。因此，模型对频道特有的特征变得更不敏感。我们的方法在五个不同的跨频道任务中实现了明显的优势。

RTrack: Accelerating Convergence for Visual Object Tracking via Pseudo-Boxes Exploration

paper_url: http://arxiv.org/abs/2309.13257
repo_url: None
paper_authors: Guotian Zeng, Bi Zeng, Hong Zhang, Jianqi Liu, Qingmao Wei
for: 提高单目标跟踪（SOT）的性能，减少训练时间并靠近对象检测（OD）任务
methods: 使用一组样本点来获取pseudo bounding box，自动调整这些点来定义空间范围和强调本地区域
results: 在GOT-10k数据集上实现了与状态革新（SOTA）跟踪器相同的性能，减少训练时间到前一代跟踪器的10%，将SOT更近于OD任务，并且在各种情况下展现了更快的转化速度。

Abstract
Single object tracking (SOT) heavily relies on the representation of the target object as a bounding box. However, due to the potential deformation and rotation experienced by the tracked targets, the genuine bounding box fails to capture the appearance information explicitly and introduces cluttered background. This paper proposes RTrack, a novel object representation baseline tracker that utilizes a set of sample points to get a pseudo bounding box. RTrack automatically arranges these points to define the spatial extents and highlight local areas. Building upon the baseline, we conducted an in-depth exploration of the training potential and introduced a one-to-many leading assignment strategy. It is worth noting that our approach achieves competitive performance to the state-of-the-art trackers on the GOT-10k dataset while reducing training time to just 10% of the previous state-of-the-art (SOTA) trackers' training costs. The substantial reduction in training costs brings single-object tracking (SOT) closer to the object detection (OD) task. Extensive experiments demonstrate that our proposed RTrack achieves SOTA results with faster convergence.

摘要
单一目标追踪（SOT）仅靠目标物体的包围盒来表示，但由于追踪目标的潜在偏移和旋转，真正的包围盒将显示出目标物体的外观信息，却会受到背景噪声的影响。本文提出了一个新的物体表现基准追踪器（RTrack），利用一组Sample Points来取得伪包围盒。RTrack可自动安排这些点来定义空间扩展和突出地方。基于这个基准，我们进行了深入的训练潜力探索和引入了一个一对多领导分配策略。值得注意的是，我们的方法在GOT-10k dataset上与现有的State-of-the-art（SOTA）追踪器相比，具有竞争性的性能，同时降低训练时间至前一个SOTA追踪器的10%。这个重要的减少训练时间将单一目标追踪（SOT）与物体检测（OD）任务逐渐接近。广泛的实验显示了我们提出的RTrack具有SOTA结果，并且更快地趋向稳定。

Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation

paper_url: http://arxiv.org/abs/2309.13248
repo_url: https://github.com/kfan21/eoras
paper_authors: Ke Fan, Jingshi Lei, Xuelin Qian, Miaopeng Yu, Tianjun Xiao, Tong He, Zheng Zhang, Yanwei Fu
for: 这个论文的目的是提出一种高效的物体抽象分割方法，用于视频模态分割 tasks。
methods: 该方法利用了自主学习的方式，并使用了对象центric表示，以便在实际场景中使用对象特征来优化特征质量。具体来说，该方法包括一个翻译模块，用于将图像特征 proyect 到鸟瞰视图（BEV）中，以获得3D信息；以及一个多视图协同层，用于在不同视图之间进行对话 Mechanism，以完善对象表示。
results: 实验结果表明，该方法可以在实际场景中实现高效的物体抽象分割，并且在多个synthetic和实际 benchmark中达到了领先的性能。

Abstract
Video amodal segmentation is a particularly challenging task in computer vision, which requires to deduce the full shape of an object from the visible parts of it. Recently, some studies have achieved promising performance by using motion flow to integrate information across frames under a self-supervised setting. However, motion flow has a clear limitation by the two factors of moving cameras and object deformation. This paper presents a rethinking to previous works. We particularly leverage the supervised signals with object-centric representation in \textit{real-world scenarios}. The underlying idea is the supervision signal of the specific object and the features from different views can mutually benefit the deduction of the full mask in any specific frame. We thus propose an Efficient object-centric Representation amodal Segmentation (EoRaS). Specially, beyond solely relying on supervision signals, we design a translation module to project image features into the Bird's-Eye View (BEV), which introduces 3D information to improve current feature quality. Furthermore, we propose a multi-view fusion layer based temporal module which is equipped with a set of object slots and interacts with features from different views by attention mechanism to fulfill sufficient object representation completion. As a result, the full mask of the object can be decoded from image features updated by object slots. Extensive experiments on both real-world and synthetic benchmarks demonstrate the superiority of our proposed method, achieving state-of-the-art performance. Our code will be released at \url{https://github.com/kfan21/EoRaS}.

摘要
视频无模板分割是计算机视觉中特别具有挑战性的任务，需要从可见部分中推断物体的全部形状。最近几年，一些研究已经取得了一定的成果，通过在无监督的设置下使用运动流来集成帧中的信息。然而，运动流受到两个因素的限制：摄像机的移动和物体的变形。本文提出了对前一些研究的重新思考。我们特别利用实际场景中的监督信号和不同视图中的特征进行协同协调，以解决任意帧中的全面 маска推断。我们因此提出了一种高效的物体中心表示协调分割（EoRaS）方法。具体来说，我们不仅仅依靠监督信号，还设计了一种将图像特征投影到鸟瞰视（BEV）中的翻译模块，以此增加图像特征的3D信息。此外，我们提出了基于多视图的协同协调层，该层配备了一组物体槽，通过注意机制与不同视图中的特征进行互动，以便填充物体的完整表示。因此，从图像特征更新后的物体槽中可以解码出物体的全面 маска。广泛的实验表明，我们提出的方法在实际和Syntheticbenchmark上具有最高性能，达到了状态盘的表现。我们的代码将在\url{https://github.com/kfan21/EoRaS}上发布。

paper_url: http://arxiv.org/abs/2309.13247
repo_url: None
paper_authors: Yifan Ding, Liqiang Wang, Boqing Gong
for: 本文旨在提出一种 novel 的多模式知识转移方法，用于解决 Referring Expression Grounding (REG) 问题。
methods: 我们提出的方法利用特定的关系修饰，同时增强多模式间关系和多模式与另一个模式之间的转移关系，以提高多模式领域的转移性能。
results: 我们的方法在实验中显著提高了多模式领域的转移性能，并且在 REG 问题中显著提高了地区localization的准确率。

Abstract
Domain adaptation, which aims to transfer knowledge between domains, has been well studied in many areas such as image classification and object detection. However, for multi-modal tasks, conventional approaches rely on large-scale pre-training. But due to the difficulty of acquiring multi-modal data, large-scale pre-training is often impractical. Therefore, domain adaptation, which can efficiently utilize the knowledge from different datasets (domains), is crucial for multi-modal tasks. In this paper, we focus on the Referring Expression Grounding (REG) task, which is to localize an image region described by a natural language expression. Specifically, we propose a novel approach to effectively transfer multi-modal knowledge through a specially relation-tailored approach for the REG problem. Our approach tackles the multi-modal domain adaptation problem by simultaneously enriching inter-domain relations and transferring relations between domains. Experiments show that our proposed approach significantly improves the transferability of multi-modal domains and enhances adaptation performance in the REG problem.

摘要
域适应，targeting to transfer knowledge between domains, has been well studied in many areas such as image classification and object detection. However, for multi-modal tasks, conventional approaches rely on large-scale pre-training. But due to the difficulty of acquiring multi-modal data, large-scale pre-training is often impractical. Therefore, domain adaptation, which can efficiently utilize the knowledge from different datasets (domains), is crucial for multi-modal tasks. In this paper, we focus on the Referring Expression Grounding (REG) task, which is to localize an image region described by a natural language expression. Specifically, we propose a novel approach to effectively transfer multi-modal knowledge through a specially relation-tailored approach for the REG problem. Our approach tackles the multi-modal domain adaptation problem by simultaneously enriching inter-domain relations and transferring relations between domains. Experiments show that our proposed approach significantly improves the transferability of multi-modal domains and enhances adaptation performance in the REG problem.Here's the translation in Traditional Chinese as well:域适应，targeting to transfer knowledge between domains, has been well studied in many areas such as image classification and object detection. However, for multi-modal tasks, conventional approaches rely on large-scale pre-training. But due to the difficulty of acquiring multi-modal data, large-scale pre-training is often impractical. Therefore, domain adaptation, which can efficiently utilize the knowledge from different datasets (domains), is crucial for multi-modal tasks. In this paper, we focus on the Referring Expression Grounding (REG) task, which is to localize an image region described by a natural language expression. Specifically, we propose a novel approach to effectively transfer multi-modal knowledge through a specially relation-tailored approach for the REG problem. Our approach tackles the multi-modal domain adaptation problem by simultaneously enriching inter-domain relations and transferring relations between domains. Experiments show that our proposed approach significantly improves the transferability of multi-modal domains and enhances adaptation performance in the REG problem.I hope this helps!

RBFormer: Improve Adversarial Robustness of Transformer by Robust Bias

paper_url: http://arxiv.org/abs/2309.13245
repo_url: None
paper_authors: Hao Cheng, Jinhao Duan, Hui Li, Lyutianyang Zhang, Jiahang Cao, Ping Wang, Jize Zhang, Kaidi Xu, Renjing Xu
for: 本研究旨在 investigate Transformer-based structure 的内在Robustness性能，而不是 introduce novel defense measures against adversarial attacks.
methods: 我们使用 rational structure design approach 来 mitigate the susceptibility to robustness issues, 具体来说是增加高频结构强度的偏好。
results: 我们的实验结果显示， compared to several existing baseline structures, RBFormer 表现出了 robust superiority, 在 CIFAR-10 和 ImageNet-1k 上的评价标准上减少了 +16.12% 和 +5.04%。

Abstract
Recently, there has been a surge of interest and attention in Transformer-based structures, such as Vision Transformer (ViT) and Vision Multilayer Perceptron (VMLP). Compared with the previous convolution-based structures, the Transformer-based structure under investigation showcases a comparable or superior performance under its distinctive attention-based input token mixer strategy. Introducing adversarial examples as a robustness consideration has had a profound and detrimental impact on the performance of well-established convolution-based structures. This inherent vulnerability to adversarial attacks has also been demonstrated in Transformer-based structures. In this paper, our emphasis lies on investigating the intrinsic robustness of the structure rather than introducing novel defense measures against adversarial attacks. To address the susceptibility to robustness issues, we employ a rational structure design approach to mitigate such vulnerabilities. Specifically, we enhance the adversarial robustness of the structure by increasing the proportion of high-frequency structural robust biases. As a result, we introduce a novel structure called Robust Bias Transformer-based Structure (RBFormer) that shows robust superiority compared to several existing baseline structures. Through a series of extensive experiments, RBFormer outperforms the original structures by a significant margin, achieving an impressive improvement of +16.12% and +5.04% across different evaluation criteria on CIFAR-10 and ImageNet-1k, respectively.

摘要
近期，有一股关注和关注力在Transformer结构方面，如视觉转换器（ViT）和视觉多层感知器（VMLP）。与过去的卷积结构相比，Transformer结构在其特有的注意力基于输入token混合策略下显示了相当或更高的性能。在引入对抗示例作为Robustness考虑的情况下，已经证明了传统的卷积结构的内置敏感性。在这篇论文中，我们的注意点是研究Transformer结构的内置Robustness，而不是引入新的防御措施 против对抗攻击。为了解决敏感性问题，我们采用了合理的结构设计方法，增加高频结构Robust遗产偏好。因此，我们提出了一种新的结构，即Robust Bias Transformer-based Structure（RBFormer），它在不同评价标准下与多个基准结构进行比较，具有显著的超越性。通过一系列广泛的实验，RBFormer在CIFAR-10和ImageNet-1k上的表现都出色，与原始结构之间的提升为+16.12%和+5.04%。

NeRF-Enhanced Outpainting for Faithful Field-of-View Extrapolation

paper_url: http://arxiv.org/abs/2309.13240
repo_url: None
paper_authors: Rui Yu, Jiachen Liu, Zihan Zhou, Sharon X. Huang
for: 增强环境感知，如 робоット导航和远程视觉协助等应用场景中，扩大摄像头的视野范围有利于提高环境感知。
methods: 使用NeRF生成扩展视野图像，并使用这些图像培育场景特定的图像填充模型。
results: 对三个 fotorealistic 数据集和一个实际世界数据集进行了广泛的测试，并显示了方法的稳定性和潜力。

Abstract
In various applications, such as robotic navigation and remote visual assistance, expanding the field of view (FOV) of the camera proves beneficial for enhancing environmental perception. Unlike image outpainting techniques aimed solely at generating aesthetically pleasing visuals, these applications demand an extended view that faithfully represents the scene. To achieve this, we formulate a new problem of faithful FOV extrapolation that utilizes a set of pre-captured images as prior knowledge of the scene. To address this problem, we present a simple yet effective solution called NeRF-Enhanced Outpainting (NEO) that uses extended-FOV images generated through NeRF to train a scene-specific image outpainting model. To assess the performance of NEO, we conduct comprehensive evaluations on three photorealistic datasets and one real-world dataset. Extensive experiments on the benchmark datasets showcase the robustness and potential of our method in addressing this challenge. We believe our work lays a strong foundation for future exploration within the research community.

摘要
在各种应用中，如 робоット导航和远程视觉协助，扩展相机的视场（FOV）有利于提高环境识别。不同于基于图像涂抹技术的艺术化视觉，这些应用需要一个扩展的视野，准确反映场景。为实现这一点，我们提出了一个新的 faithful FOV 拓展问题，利用场景中预 capture 的图像作为知识来培养场景特定的图像涂抹模型。为解决这个问题，我们提出了一种简单 yet effective 的解决方案called NeRF-Enhanced Outpainting（NEO），使用通过NeRF生成的扩展 FOV 图像来训练场景特定的图像涂抹模型。为评估 NEO 的性能，我们进行了三个实验室数据集和一个真实世界数据集的全面评估。广泛的实验表明我们的方法在解决这个挑战中具有强大的基础和潜力。我们认为我们的研究为未来研究社区提供了一个坚强的基础。

Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph Generation

paper_url: http://arxiv.org/abs/2309.13237
repo_url: https://github.com/hcplab-sysu/stket
paper_authors: Tao Pu, Tianshui Chen, Hefeng Wu, Yongyi Lu, Liang Lin
for: VidSGG aims to identify objects in visual scenes and infer their relationships for a given video.
methods: The proposed method, STKET, incorporates prior spatial-temporal knowledge into the multi-head cross-attention mechanism to learn more representative relationship representations.
results: STKET outperforms current competing algorithms by a large margin, with improvements of 8.1%, 4.7%, and 2.1% on different settings.

Abstract
Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer their relationships for a given video. It requires not only a comprehensive understanding of each object scattered on the whole scene but also a deep dive into their temporal motions and interactions. Inherently, object pairs and their relationships enjoy spatial co-occurrence correlations within each image and temporal consistency/transition correlations across different images, which can serve as prior knowledge to facilitate VidSGG model learning and inference. In this work, we propose a spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism to learn more representative relationship representations. Specifically, we first learn spatial co-occurrence and temporal transition correlations in a statistical manner. Then, we design spatial and temporal knowledge-embedded layers that introduce the multi-head cross-attention mechanism to fully explore the interaction between visual representation and the knowledge to generate spatial- and temporal-embedded representations, respectively. Finally, we aggregate these representations for each subject-object pair to predict the final semantic labels and their relationships. Extensive experiments show that STKET outperforms current competing algorithms by a large margin, e.g., improving the mR@50 by 8.1%, 4.7%, and 2.1% on different settings over current algorithms.

摘要
视频场景图生成（VidSGG）目标是从视频中标识对象并推断它们之间的关系。这需要不仅对整个场景中每个对象进行全面的理解，还需要深入研究它们的时间变化和互动。自然地，对象对的关系具有在每个图像中的空间相互关联和在不同图像之间的时间相关性，这些知识可以用来促进VidSGG模型的学习和推断。在这种工作中，我们提出了一种具有空间-时间知识嵌入的变换器（STKET），它将在多头交叉关注机制中嵌入先前学习的空间-时间知识，以学习更加表示关系的表示。具体来说，我们首先在统计方面学习空间共occurrence和时间转换关系。然后，我们设计了空间和时间知识嵌入层，以全面地探索视觉表示和知识之间的交互，生成空间-时间嵌入表示。最后，我们对每个主体-对象对进行综合这些表示，以预测最终的semantic标签和其关系。广泛的实验显示，STKET在不同设置下比现有竞争算法大幅提高了性能，例如在不同设置下提高mR@50的表现，比如提高8.1%, 4.7%和2.1%。

M$^3$CS: Multi-Target Masked Point Modeling with Learnable Codebook and Siamese Decoders

paper_url: http://arxiv.org/abs/2309.13235
repo_url: None
paper_authors: Qibo Qiu, Honghui Yang, Wenxiao Wang, Shun Zhang, Haiming Gao, Haochao Ying, Wei Hua, Xiaofei He
for: 本研究旨在提高点云自助预训练的表现，使模型具备低级和高级表示能力，捕捉点云的几何细节和 semantic上下文。
methods: 该方法使用 masked point cloud 作为输入，并引入两个解码器同时预测masked表示和原始点 cloud。在解码过程中，我们提出了 siamese decoder 技术，以保持学习参数的数量不变。此外，我们还提出了在线 codebook 技术，将连续的 tokens 映射到精确的 discrete tokens。
results: 实验结果显示，M$^3$CS 在类别和分割任务中表现出色，与现有方法进行比较，具有更高的表现。

Abstract
Masked point modeling has become a promising scheme of self-supervised pre-training for point clouds. Existing methods reconstruct either the original points or related features as the objective of pre-training. However, considering the diversity of downstream tasks, it is necessary for the model to have both low- and high-level representation modeling capabilities to capture geometric details and semantic contexts during pre-training. To this end, M$^3$CS is proposed to enable the model with the above abilities. Specifically, with masked point cloud as input, M$^3$CS introduces two decoders to predict masked representations and the original points simultaneously. While an extra decoder doubles parameters for the decoding process and may lead to overfitting, we propose siamese decoders to keep the amount of learnable parameters unchanged. Further, we propose an online codebook projecting continuous tokens into discrete ones before reconstructing masked points. In such way, we can enforce the decoder to take effect through the combinations of tokens rather than remembering each token. Comprehensive experiments show that M$^3$CS achieves superior performance at both classification and segmentation tasks, outperforming existing methods.

摘要
受隐藏点模型的推广使得自我超视了点云的预训练 scheme 成为了可靠的方法。现有的方法都是在预训练中重建原始点或相关的特征作为目标。然而，考虑到下游任务的多样性，需要模型具备低层和高层表示模型化能力，以 Capture point clouds的几何细节和 semantics during pre-training。为此，我们提出了M$^3$CS。具体来说，输入隐藏点云后，M$^3$CS引入了两个解码器同时预测隐藏表示和原始点。虽然增加了一个解码器会增加参数的数量，可能导致过拟合，但我们提出了同源的解码器，以保持参数的数量不变。此外，我们还提出了在线代码库，将连续的 токен proyect 为离散的 токен。这样可以让解码器通过Token的组合来实现效果，而不是记忆每个Token。我们的实验表明，M$^3$CS在类别和分割任务中表现出色，超过了现有的方法。

Real3D-AD: A Dataset of Point Cloud Anomaly Detection

paper_url: http://arxiv.org/abs/2309.13226
repo_url: https://github.com/m-3lab/real3d-ad
paper_authors: Jiaqi Liu, Guoyang Xie, Ruitao Chen, Xinpeng Li, Jinbao Wang, Yong Liu, Chengjie Wang, Feng Zheng
for: 高精度点云异常检测是精密制造和机器制造领域中的标准，用于检测机器制造过程中的缺陷。
methods: 我们提出了一种基于准备的3D异常检测方法，named Reg3D-AD，该方法包括一种新的特征记忆银行，可以保持本地和全局表示。
results: 我们在Real3D-AD dataset上进行了广泛的实验，并证明了Reg3D-AD的效果。Real3D-AD dataset是目前最大的高精度3D工业异常检测dataset，它包括1,254个高分辨率3D项，每个item有40,000到百万个点。

Abstract
High-precision point cloud anomaly detection is the gold standard for identifying the defects of advancing machining and precision manufacturing. Despite some methodological advances in this area, the scarcity of datasets and the lack of a systematic benchmark hinder its development. We introduce Real3D-AD, a challenging high-precision point cloud anomaly detection dataset, addressing the limitations in the field. With 1,254 high-resolution 3D items from forty thousand to millions of points for each item, Real3D-AD is the largest dataset for high-precision 3D industrial anomaly detection to date. Real3D-AD surpasses existing 3D anomaly detection datasets available regarding point cloud resolution (0.0010mm-0.0015mm), 360 degree coverage and perfect prototype. Additionally, we present a comprehensive benchmark for Real3D-AD, revealing the absence of baseline methods for high-precision point cloud anomaly detection. To address this, we propose Reg3D-AD, a registration-based 3D anomaly detection method incorporating a novel feature memory bank that preserves local and global representations. Extensive experiments on the Real3D-AD dataset highlight the effectiveness of Reg3D-AD. For reproducibility and accessibility, we provide the Real3D-AD dataset, benchmark source code, and Reg3D-AD on our website:https://github.com/M-3LAB/Real3D-AD.

摘要
高精度点云异常检测是现代加工和精密制造中的标准。 despite some methodological advances in this area, the scarcity of datasets and the lack of a systematic benchmark hinder its development. We introduce Real3D-AD, a challenging high-precision point cloud anomaly detection dataset, addressing the limitations in the field. With 1,254 high-resolution 3D items from forty thousand to millions of points for each item, Real3D-AD is the largest dataset for high-precision 3D industrial anomaly detection to date. Real3D-AD surpasses existing 3D anomaly detection datasets available regarding point cloud resolution (0.0010mm-0.0015mm), 360 degree coverage and perfect prototype. Additionally, we present a comprehensive benchmark for Real3D-AD, revealing the absence of baseline methods for high-precision point cloud anomaly detection. To address this, we propose Reg3D-AD, a registration-based 3D anomaly detection method incorporating a novel feature memory bank that preserves local and global representations. Extensive experiments on the Real3D-AD dataset highlight the effectiveness of Reg3D-AD. For reproducibility and accessibility, we provide the Real3D-AD dataset, benchmark source code, and Reg3D-AD on our website:https://github.com/M-3LAB/Real3D-AD.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard Chinese scripts used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Fun Paper

2023-09-23

cs.CV - 2023-09-23

Portrait Stylization: Artistic Style Transfer with Auxiliary Networks for Human Face Stylization

Identifying Systematic Errors in Object Detectors with the SCROD Pipeline

Detecting and Mitigating System-Level Anomalies of Vision-Based Controllers

Edge Aware Learning for 3D Point Cloud

HAVE-Net: Hallucinated Audio-Visual Embeddings for Few-Shot Classification with Unimodal Cues

Turbulence in Focus: Benchmarking Scaling Behavior of 3D Volumetric Super-Resolution with BLASTNet 2.0 Data

Video Timeline Modeling For News Story Understanding

Dream the Impossible: Outlier Imagination with Diffusion Models

WS-YOLO: Weakly Supervised Yolo Network for Surgical Tool Localization in Endoscopic Videos

Dual-Reference Source-Free Active Domain Adaptation for Nasopharyngeal Carcinoma Tumor Segmentation across Multiple Hospitals

A mirror-Unet architecture for PET/CT lesion segmentation

YOLORe-IDNet: An Efficient Multi-Camera System for Person-Tracking

Cine cardiac MRI reconstruction using a convolutional recurrent network with refinement

Beyond Grids: Exploring Elastic Input Sampling for Vision Transformers

FedDrive v2: an Analysis of the Impact of Label Skewness in Federated Semantic Segmentation for Autonomous Driving

Tackling the Incomplete Annotation Issue in Universal Lesion Detection Task By Exploratory Training

C$^2$VAE: Gaussian Copula-based VAE Differing Disentangled from Coupled Representations with Contrastive Posterior

Gaining the Sparse Rewards by Exploring Binary Lottery Tickets in Spiking Neural Network

MP-MVS: Multi-Scale Windows PatchMatch and Planar Prior Multi-View Stereo

Domain-Guided Conditional Diffusion Model for Unsupervised Domain Adaptation

Automatic Reverse Engineering: Creating computer-aided design (CAD) models from multi-view images

Discwise Active Learning for LiDAR Semantic Segmentation

GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER

Randomize to Generalize: Domain Randomization for Runway FOD Detection

Order-preserving Consistency Regularization for Domain Adaptation and Generalization

RTrack: Accelerating Convergence for Visual Object Tracking via Pseudo-Boxes Exploration

Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation

RBFormer: Improve Adversarial Robustness of Transformer by Robust Bias

NeRF-Enhanced Outpainting for Faithful Field-of-View Extrapolation

Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph Generation

M$^3$CS: Multi-Target Masked Point Modeling with Learnable Codebook and Siamese Decoders

Real3D-AD: A Dataset of Point Cloud Anomaly Detection

2023-09-23

Portrait Stylization: Artistic Style Transfer with Auxiliary Networks for Human Face Stylization

Identifying Systematic Errors in Object Detectors with the SCROD Pipeline

Detecting and Mitigating System-Level Anomalies of Vision-Based Controllers

Edge Aware Learning for 3D Point Cloud

HAVE-Net: Hallucinated Audio-Visual Embeddings for Few-Shot Classification with Unimodal Cues

Turbulence in Focus: Benchmarking Scaling Behavior of 3D Volumetric Super-Resolution with BLASTNet 2.0 Data

Video Timeline Modeling For News Story Understanding

Dream the Impossible: Outlier Imagination with Diffusion Models

WS-YOLO: Weakly Supervised Yolo Network for Surgical Tool Localization in Endoscopic Videos

Dual-Reference Source-Free Active Domain Adaptation for Nasopharyngeal Carcinoma Tumor Segmentation across Multiple Hospitals

A mirror-Unet architecture for PET/CT lesion segmentation

YOLORe-IDNet: An Efficient Multi-Camera System for Person-Tracking

Cine cardiac MRI reconstruction using a convolutional recurrent network with refinement

Beyond Grids: Exploring Elastic Input Sampling for Vision Transformers

FedDrive v2: an Analysis of the Impact of Label Skewness in Federated Semantic Segmentation for Autonomous Driving

Tackling the Incomplete Annotation Issue in Universal Lesion Detection Task By Exploratory Training

C$^2$VAE: Gaussian Copula-based VAE Differing Disentangled from Coupled Representations with Contrastive Posterior

Gaining the Sparse Rewards by Exploring Binary Lottery Tickets in Spiking Neural Network

MP-MVS: Multi-Scale Windows PatchMatch and Planar Prior Multi-View Stereo

Domain-Guided Conditional Diffusion Model for Unsupervised Domain Adaptation

Automatic Reverse Engineering: Creating computer-aided design (CAD) models from multi-view images

Discwise Active Learning for LiDAR Semantic Segmentation

GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER

Randomize to Generalize: Domain Randomization for Runway FOD Detection

Order-preserving Consistency Regularization for Domain Adaptation and Generalization

RTrack: Accelerating Convergence for Visual Object Tracking via Pseudo-Boxes Exploration

Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation

Multi-modal Domain Adaptation for REG via Relation Transfer

RBFormer: Improve Adversarial Robustness of Transformer by Robust Bias

NeRF-Enhanced Outpainting for Faithful Field-of-View Extrapolation

Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph Generation

M$^3$CS: Multi-Target Masked Point Modeling with Learnable Codebook and Siamese Decoders

Real3D-AD: A Dataset of Point Cloud Anomaly Detection