2023-10-02

cs.CV

cs.CV - 2023-10-02

STARS: Zero-shot Sim-to-Real Transfer for Segmentation of Shipwrecks in Sonar Imagery

paper_url: http://arxiv.org/abs/2310.01667
repo_url: None
paper_authors: Advaith Venkatramanan Sethuraman, Katherine A. Skinner
for: 这篇论文主要关注的是对于对象分类的适应性问题，具体来说是在没有训练过真实数据的情况下，实现实验资料和真实数据之间的转换。
methods: 我们提出了一个新的分类网络，名为STARS，它通过融合预测的扭变场和异常体积，以更好地适应真实侧探显像，并实现零shot实验资料转换。
results: 我们在一个真实、专家标注的侧探显像数据集上评估了我们的方法的实验资料转换能力，结果显示我们的方法可以实现20%的提升在过滤器中的适应性。

Abstract
In this paper, we address the problem of sim-to-real transfer for object segmentation when there is no access to real examples of an object of interest during training, i.e. zero-shot sim-to-real transfer for segmentation. We focus on the application of shipwreck segmentation in side scan sonar imagery. Our novel segmentation network, STARS, addresses this challenge by fusing a predicted deformation field and anomaly volume, allowing it to generalize better to real sonar images and achieve more effective zero-shot sim-to-real transfer for image segmentation. We evaluate the sim-to-real transfer capabilities of our method on a real, expert-labeled side scan sonar dataset of shipwrecks collected from field work surveys with an autonomous underwater vehicle (AUV). STARS is trained entirely in simulation and performs zero-shot shipwreck segmentation with no additional fine-tuning on real data. Our method provides a significant 20% increase in segmentation performance for the targeted shipwreck class compared to the best baseline.

摘要
在这篇论文中，我们解决了在没有训练时间的情况下，从 simulate 到 real 的转移问题，即零例转移。我们在Side Scan Sonar 图像中进行船舶残骸分割的应用中强调这一问题。我们提出了一种新的分割网络，称为 STARS，它通过将预测的变换场和异常体积融合，以更好地适应实际 Sonar 图像和实现更好的零例转移。我们使用了一个实际由专家标注的 Side Scan Sonar 数据集，来评估我们的方法的 sim-to-real 转移能力。STARS 在实际上完全在 simulator 上训练，无需进行额外的调整，可以对真实的 Sonar 图像进行零例分割。我们的方法在Targeted 船舶类中提高了 segmentation 性能，相比最佳基准，提高了20%。

Task-guided Domain Gap Reduction for Monocular Depth Prediction in Endoscopy

paper_url: http://arxiv.org/abs/2310.01663
repo_url: None
paper_authors: Anita Rau, Binod Bhattarai, Lourdes Agapito, Danail Stoyanov
for: 针对抑制肠癌的computer-aided方法，提高肠胃镜检查和改善肠胃镜检查的质量和可用性。
methods: 利用监督学习方法和无监督学习方法，对单眼视频帧中的深度进行预测。
results: 提出了一种新的方法，可以充分利用标注的 sintetic数据和无标注的实际数据，以提高肠胃镜检查中深度的预测精度。

Abstract
Colorectal cancer remains one of the deadliest cancers in the world. In recent years computer-aided methods have aimed to enhance cancer screening and improve the quality and availability of colonoscopies by automatizing sub-tasks. One such task is predicting depth from monocular video frames, which can assist endoscopic navigation. As ground truth depth from standard in-vivo colonoscopy remains unobtainable due to hardware constraints, two approaches have aimed to circumvent the need for real training data: supervised methods trained on labeled synthetic data and self-supervised models trained on unlabeled real data. However, self-supervised methods depend on unreliable loss functions that struggle with edges, self-occlusion, and lighting inconsistency. Methods trained on synthetic data can provide accurate depth for synthetic geometries but do not use any geometric supervisory signal from real data and overfit to synthetic anatomies and properties. This work proposes a novel approach to leverage labeled synthetic and unlabeled real data. While previous domain adaptation methods indiscriminately enforce the distributions of both input data modalities to coincide, we focus on the end task, depth prediction, and translate only essential information between the input domains. Our approach results in more resilient and accurate depth maps of real colonoscopy sequences.

摘要
This work proposes a novel approach that leverages labeled synthetic and unlabeled real data to improve depth prediction. Previous domain adaptation methods have indiscriminately enforced the distributions of both input data modalities to coincide. In contrast, our approach focuses on the end task of depth prediction and translates only essential information between the input domains. This results in more resilient and accurate depth maps of real colonoscopy sequences.

SYRAC: Synthesize, Rank, and Count

paper_url: http://arxiv.org/abs/2310.01662
repo_url: https://github.com/adrian-dalessandro/SYRAC
paper_authors: Adriano D’Alessandro, Ali Mahdavi-Amiri, Ghassan Hamarneh
for: 这个研究旨在解决人群掌握 tasks 中的投入问题，特别是在计算机视觉领域中，具有重要应用。
methods: 本研究提出了一个新的方法，利用对流模型生成静止图像，以消除投入问题。然而，这些模型对物品数量的理解有误差，导致生成的图像中的物品数量信号为杂音。因此，本研究使用对流模型生成两种类型的实验数据：一种是从真实图像中移除人群，生成排名的图像对，另一种是生成预先决定的物品数量的实验图像。
results: 本研究报告了无监督人群掌握的最新成果，并且证明了这个新方法可以帮助解决投入问题。

Abstract
Crowd counting is a critical task in computer vision, with several important applications. However, existing counting methods rely on labor-intensive density map annotations, necessitating the manual localization of each individual pedestrian. While recent efforts have attempted to alleviate the annotation burden through weakly or semi-supervised learning, these approaches fall short of significantly reducing the workload. We propose a novel approach to eliminate the annotation burden by leveraging latent diffusion models to generate synthetic data. However, these models struggle to reliably understand object quantities, leading to noisy annotations when prompted to produce images with a specific quantity of objects. To address this, we use latent diffusion models to create two types of synthetic data: one by removing pedestrians from real images, which generates ranked image pairs with a weak but reliable object quantity signal, and the other by generating synthetic images with a predetermined number of objects, offering a strong but noisy counting signal. Our method utilizes the ranking image pairs for pre-training and then fits a linear layer to the noisy synthetic images using these crowd quantity features. We report state-of-the-art results for unsupervised crowd counting.

摘要
受众计数是计算机视觉中的关键任务，具有许多重要应用。然而，现有的计数方法依赖于劳动密集的密度地图注解，需要手动定位每个人。尽管最近的努力已经尝试通过弱类或半监督学习方法减轻注解负担，但这些方法并没有显著减少工作负担。我们提出一种新的方法，以利用幽游泛函数模型生成Synthetic数据，以消除注解负担。然而，这些模型困难理解物体量，导致生成具有特定量物体的图像时的笔迹噪声。为解决这个问题，我们使用幽游泛函数模型生成两种类型的Synthetic数据：一种是从真实图像中移除人群，生成排名图像对，具有弱的但可靠的物体量信号；另一种是生成具有预先确定的物体数量的Synthetic图像，提供强大但噪声的计数信号。我们的方法首先使用排名图像对进行预训练，然后使用这些人群量特征适应Linear层。我们报告了未监督人群计数领域的状态足球结果。

You Only Look at Once for Real-time and Generic Multi-Task

paper_url: http://arxiv.org/abs/2310.01641
repo_url: https://github.com/jiayuanwang-jw/yolov8-multi-task
paper_authors: Jiayuan Wang, Q. M. Jonathan Wu, Ning Zhang
for: This paper aims to present an adaptive, real-time, and lightweight multi-task model for object detection, drivable area segmentation, and lane line segmentation tasks in autonomous driving.
methods: The proposed model is an end-to-end multi-task model with a unified and streamlined segmentation structure, featuring a learnable parameter that adaptively concatenates features in segmentation necks and a segmentation head composed of a series of convolutional layers.
results: The model achieves competitive results on the BDD100k dataset, with a mAP50 of 81.1% for object detection, a mIoU of 91.0% for drivable area segmentation, and an IoU of 28.8% for lane line segmentation. Additionally, the model demonstrates better performance in real-world scenarios compared to existing multi-task models.Here is the information in Simplified Chinese text:
for: 这篇论文目标是提出一种适应性、实时性和轻量级的多任务模型，用于自动驾驶中的对象检测、可行区域分割和车道线分割任务。
methods: 提出的模型是一种端到端的多任务模型，具有统一和整合的分割结构，拥有一个可学习的参数，可以在分割 neck 中 concatenate 特征，使用同一个损失函数进行所有分割任务的学习。
results: 模型在 BDD100k 数据集上实现了竞争性的结果，具体来说是 mAP50 为 81.1%，可行区域分割 mIoU 为 91.0%，车道线分割 IoU 为 28.8%。此外，模型在实际场景中表现出色，与现有的多任务模型相比，具有更好的灵活性和更快的运算速度。

Abstract
High precision, lightweight, and real-time responsiveness are three essential requirements for implementing autonomous driving. In this study, we present an adaptive, real-time, and lightweight multi-task model designed to concurrently address object detection, drivable area segmentation, and lane line segmentation tasks. Specifically, we developed an end-to-end multi-task model with a unified and streamlined segmentation structure. We introduced a learnable parameter that adaptively concatenate features in segmentation necks, using the same loss function for all segmentation tasks. This eliminates the need for customizations and enhances the model's generalization capabilities. We also introduced a segmentation head composed only of a series of convolutional layers, which reduces the inference time. We achieved competitive results on the BDD100k dataset, particularly in visualization outcomes. The performance results show a mAP50 of 81.1% for object detection, a mIoU of 91.0% for drivable area segmentation, and an IoU of 28.8% for lane line segmentation. Additionally, we introduced real-world scenarios to evaluate our model's performance in a real scene, which significantly outperforms competitors. This demonstrates that our model not only exhibits competitive performance but is also more flexible and faster than existing multi-task models. The source codes and pre-trained models are released at https://github.com/JiayuanWang-JW/YOLOv8-multi-task

摘要
高精度、轻量级、实时响应性是执行自动驾驶的三个关键要求。在这项研究中，我们提出了一种适应、实时和轻量级多任务模型，用于同时解决对象检测、可驾驶区域分割和车道线分割任务。具体来说，我们开发了一个端到端多任务模型，其具有统一和整合的分割结构。我们引入了可学习的参数，通过同一个损失函数对所有分割任务进行 concatenate feature，从而消除了需要特殊化的需求，提高模型的泛化能力。此外，我们引入了一个仅由一系列卷积层组成的分割头，从而降低了推理时间。我们在 BDD100k 数据集上实现了竞争力强的 результа都，特别是视觉结果。结果显示我们的模型在 object detection 任务上取得了81.1%的 mAP50 分数，在可驾驶区域分割任务上取得了91.0%的 mIoU 分数，在车道线分割任务上取得了28.8%的 IoU 分数。此外，我们将实际Scene 中的场景添加到了我们的模型中，并在实际场景中显示了明显的优异性。这表明我们的模型不仅具有竞争力强的性能，还更具有更多的灵活性和更快的速度。我们的代码和预训练模型可以在 GitHub 上找到：https://github.com/JiayuanWang-JW/YOLOv8-multi-task。

Adaptive Visual Scene Understanding: Incremental Scene Graph Generation

paper_url: http://arxiv.org/abs/2310.01636
repo_url: https://github.com/zhanglab-deepneurocoglab/csegg
paper_authors: Naitik Khandelwal, Xiao Liu, Mengmi Zhang
for: 本研究旨在探讨Scene Graph Generation（SGG）领域中的连续学习问题，以满足人工智能系统在视觉世界中检测新对象和建立新关系的需求。
methods: 我们提出了包括3种学习场景和8个评价指标的完整的Continual ScenE Graph Generation（CSEGG）数据集，用于评估既存SGG方法在保持之前对象实体和关系的维护能力的 continual learning 性能。
results: 我们的研究发现，经典的两stage SGG方法和最新的 transformer-based SGG方法在连续学习Setting下的表现不佳，而 continual object detection 可以增强未知对象的类别化能力。我们的实验结果为这个新兴领域提供了有价值的信息。

Abstract
Scene graph generation (SGG) involves analyzing images to extract meaningful information about objects and their relationships. Given the dynamic nature of the visual world, it becomes crucial for AI systems to detect new objects and establish their new relationships with existing objects. To address the lack of continual learning methodologies in SGG, we introduce the comprehensive Continual ScenE Graph Generation (CSEGG) dataset along with 3 learning scenarios and 8 evaluation metrics. Our research investigates the continual learning performances of existing SGG methods on the retention of previous object entities and relationships as they learn new ones. Moreover, we also explore how continual object detection enhances generalization in classifying known relationships on unknown objects. We conduct extensive experiments benchmarking and analyzing the classical two-stage SGG methods and the most recent transformer-based SGG methods in continual learning settings, and gain valuable insights into the CSEGG problem. We invite the research community to explore this emerging field of study.

摘要
scene graph生成（SGG）涉及对图像进行意义分析，提取对象和其关系的信息。由于视觉世界的动态性，AI系统需要检测新的对象并建立它们与现有对象之间的新关系。为了解决SGG中缺乏 continual learning 方法，我们介绍了全面的 Continual ScenE Graph Generation（CSEGG）数据集，以及3种学习场景和8个评估指标。我们的研究探讨了现有SGG方法在保持先前对象实体和关系的同时学习新对象的能力。此外，我们还探究了持续对象检测是如何增强对未知对象的分类能力的。我们进行了广泛的实验和分析，探讨了经典两Stage SGG方法和最近的 transformer-based SGG方法在连续学习Setting下的性能。我们希望通过这项研究，鼓励研究者探索这一新兴领域。

Dynamic Spatio-Temporal Summarization using Information Based Fusion

paper_url: http://arxiv.org/abs/2310.01617
repo_url: None
paper_authors: Humayra Tasnim, Soumya Dutta, Melanie Moses
For: 寻求解决大规模时变数据的管理和存储问题，提高数据管理效率和深入理解复杂数据行为。* Methods: 提出一种动态空间时序数据概要技术，通过identifying informative features in key timesteps和fusing less informative ones来减少存储需求，保留原始和概要时间点的数据视图。* Results: 在多种数据集上实现了高效的数据管理和深入理解，包括流体动力学、安全监测和免疫系统等领域。

Abstract
In the era of burgeoning data generation, managing and storing large-scale time-varying datasets poses significant challenges. With the rise of supercomputing capabilities, the volume of data produced has soared, intensifying storage and I/O overheads. To address this issue, we propose a dynamic spatio-temporal data summarization technique that identifies informative features in key timesteps and fuses less informative ones. This approach minimizes storage requirements while preserving data dynamics. Unlike existing methods, our method retains both raw and summarized timesteps, ensuring a comprehensive view of information changes over time. We utilize information-theoretic measures to guide the fusion process, resulting in a visual representation that captures essential data patterns. We demonstrate the versatility of our technique across diverse datasets, encompassing particle-based flow simulations, security and surveillance applications, and biological cell interactions within the immune system. Our research significantly contributes to the realm of data management, introducing enhanced efficiency and deeper insights across diverse multidisciplinary domains. We provide a streamlined approach for handling massive datasets that can be applied to in situ analysis as well as post hoc analysis. This not only addresses the escalating challenges of data storage and I/O overheads but also unlocks the potential for informed decision-making. Our method empowers researchers and experts to explore essential temporal dynamics while minimizing storage requirements, thereby fostering a more effective and intuitive understanding of complex data behaviors.

摘要
在大规模数据生成时代，管理和存储大规模时变数据集pose significant challenges。随着超级计算机能力的升级，数据生产的量增加，加速存储和I/O负担。为解决这个问题，我们提出了一种动态空间时间数据简化技术，可以在关键时间步骤中识别有用的特征，并将不重要的特征进行融合。这种方法可以减少存储需求，保持数据的动态特征。与现有方法不同，我们的方法保留了原始和简化时间步骤，以便在时间上具有全面的信息变化视图。我们利用信息理论度量来引导融合过程，从而生成一个捕捉主要数据模式的视觉表示。我们在多种数据集中进行了多样化的应用，包括基于流体 simulations、安全和监测应用以及生物细胞交互在免疫系统中。我们的研究对数据管理领域作出了重要贡献，提出了提高效率和深入理解的新方法，可以应用于实时分析以及后续分析。这种方法不仅解决了存储和I/O负担的增长问题，还为研究人员和专家提供了一种更有效和直观地了解复杂数据行为的工具。

ImagenHub: Standardizing the evaluation of conditional image generation models

paper_url: http://arxiv.org/abs/2310.01596
repo_url: https://github.com/TIGER-AI-Lab/ImagenHub
paper_authors: Max Ku, Tianle Li, Kai Zhang, Yujie Lu, Xingyu Fu, Wenwen Zhuang, Wenhu Chen
for: This paper aims to standardize the evaluation and comparison of conditional image generation models by providing a unified inference pipeline and human evaluation metrics.
methods: The paper proposes a one-stop library called ImagenHub, which includes seven prominent tasks and high-quality evaluation datasets. It also introduces two human evaluation scores, i.e. Semantic Consistency and Perceptual Quality, and comprehensive guidelines for evaluating generated images.
results: The paper evaluates around 30 models using the proposed metrics and observes that the existing models’ performance is generally unsatisfying except for Text-guided Image Generation and Subject-driven Image Generation. It also finds that 83% of the claims from published papers hold with a few exceptions, and none of the existing automatic metrics has a Spearman’s correlation higher than 0.2 except for subject-driven image generation.

Abstract
Recently, a myriad of conditional image generation and editing models have been developed to serve different downstream tasks, including text-to-image generation, text-guided image editing, subject-driven image generation, control-guided image generation, etc. However, we observe huge inconsistencies in experimental conditions: datasets, inference, and evaluation metrics - render fair comparisons difficult. This paper proposes ImagenHub, which is a one-stop library to standardize the inference and evaluation of all the conditional image generation models. Firstly, we define seven prominent tasks and curate high-quality evaluation datasets for them. Secondly, we built a unified inference pipeline to ensure fair comparison. Thirdly, we design two human evaluation scores, i.e. Semantic Consistency and Perceptual Quality, along with comprehensive guidelines to evaluate generated images. We train expert raters to evaluate the model outputs based on the proposed metrics. Our human evaluation achieves a high inter-worker agreement of Krippendorff's alpha on 76% models with a value higher than 0.4. We comprehensively evaluated a total of around 30 models and observed three key takeaways: (1) the existing models' performance is generally unsatisfying except for Text-guided Image Generation and Subject-driven Image Generation, with 74% models achieving an overall score lower than 0.5. (2) we examined the claims from published papers and found 83% of them hold with a few exceptions. (3) None of the existing automatic metrics has a Spearman's correlation higher than 0.2 except subject-driven image generation. Moving forward, we will continue our efforts to evaluate newly published models and update our leaderboard to keep track of the progress in conditional image generation.

摘要
近期，一大量的具有条件生成和修改图像的模型已经开发出来，以服务不同的下游任务，包括文本生成图像、文本指导图像修改、主题驱动图像生成、控制指导图像生成等。然而，我们发现实验条件差异很大：数据集、推理和评估指标 - 使得公正比较困难。这篇文章提出了 ImagenHub，它是一个一站式库，用于标准化具有条件图像生成模型的推理和评估。首先，我们定义了七种表达 Task，并为它们制定了高质量的评估数据集。其次，我们构建了一个统一的推理管道，以确保公正比较。最后，我们设计了两种人工评分指标，即Semantic Consistency和Perceptual Quality，并附加了详细的评估指南。我们通过专业评分员根据我们提出的指标评估模型输出。我们的人工评估达到了Krippendorff alpha值高于0.4的同作业者一致度的76%模型。我们对总计约30个模型进行了全面评估，发现了三个关键结论：（1）现有模型的表现通常不满意，只有文本指导图像生成和主题驱动图像生成 Task 表现较好，74%的模型在总体分下降0.5。（2）我们查阅发表的论文的CLAIM，发现83%的CLAIM是正确的，其余几个例外。（3）现有自动指标中，只有主题驱动图像生成 Task 的自动指标有Spearman correlation coefficient高于0.2。后续，我们将继续评估新发布的模型，并将更新我们的排名，以跟踪具有条件图像生成的进步。

RF-ULM: Deep Learning for Radio-Frequency Ultrasound Localization Microscopy

paper_url: http://arxiv.org/abs/2310.01545
repo_url: https://github.com/hahnec/rf-ulm
paper_authors: Christopher Hahne, Georges Chabouh, Arthur Chavignon, Olivier Couture, Raphael Sznitman
for: 本研究旨在提高ultrasound localization microscopy（ULM）中图像的高分辨率，通过精准地确定干扰素Particle Across consecutive beamformed frames。
methods: 我们提出了一种直接在RF信号中 lokalisieren scatterers的方法，使用自定义的超解像深度学神经网络（DNN），并 introduce a novel semi-global convolutional sampling block tailored for reliable and accurate localization in RF input data。
results: 我们的研究表明，RF-ULM可以 bridge the domain gap between synthetic and real datasets，提供了较高的精度和较低的复杂性。我们还发现，RF-ULM可以提供实际世界中的实用性。

Abstract
In Ultrasound Localization Microscopy (ULM), achieving high-resolution images relies on the precise localization of contrast agent particles across consecutive beamformed frames. However, our study uncovers an enormous potential: The process of delay-and-sum beamforming leads to an irreversible reduction of Radio-Frequency (RF) data, while its implications for localization remain largely unexplored. The rich contextual information embedded within RF wavefronts, including their hyperbolic shape and phase, offers great promise for guiding Deep Neural Networks (DNNs) in challenging localization scenarios. To fully exploit this data, we propose to directly localize scatterers in RF signals. Our approach involves a custom super-resolution DNN using learned feature channel shuffling and a novel semi-global convolutional sampling block tailored for reliable and accurate localization in RF input data. Additionally, we introduce a geometric point transformation that facilitates seamless mapping between B-mode and RF spaces. To validate the effectiveness of our method and understand the impact of beamforming, we conduct an extensive comparison with State-Of-The-Art (SOTA) techniques in ULM. We present the inaugural in vivo results from an RF-trained DNN, highlighting its real-world practicality. Our findings show that RF-ULM bridges the domain gap between synthetic and real datasets, offering a considerable advantage in terms of precision and complexity. To enable the broader research community to benefit from our findings, our code and the associated SOTA methods are made available at https://github.com/hahnec/rf-ulm.

摘要
ultrasound мест化 microscopy (ULM) 中，实现高分辨照片依赖于连续的束formed帧中的精确局域化。然而，我们的研究发现了一个巨大的潜力：延迟和总和的扫描过程会导致无法恢复的射频数据，而其对局域化的影响尚未得到了充分的探讨。ULM中的射频波front的 ricoh 信息，包括它们的折射形状和阶段，对深度神经网络 (DNNs) 在困难的局域化场景中提供了丰富的 Contextual information。为了全面利用这些数据，我们提议直接在射频信号中localize散射体。我们的方法包括一种自定义的超解像 DNN，使用学习Feature channel shuffling和一种新的半全球巨观测束 block，以确保可靠和准确的局域化。此外，我们还引入了一种地理点变换，使得B-mode和射频空间之间的映射变得更加简单。为了证明我们的方法的有效性并了解扫描过程对局域化的影响，我们进行了广泛的对比与State-Of-The-Art (SOTA)技术在ULM中。我们发表了实际中的RF-trained DNN的首次结果，展示了它在实际场景中的实用性。我们的发现表明，RF-ULM可以跨越数据集领域的域 gap，提供高度精度和复杂性的优势。为了让研究者们更加方便地利用我们的发现，我们在https://github.com/hahnec/rf-ulm中提供了我们的代码和相关SOTA方法。

Progressive DeepSSM: Training Methodology for Image-To-Shape Deep Models

paper_url: http://arxiv.org/abs/2310.01529
repo_url: None
paper_authors: Abu Zahid Bin Aziz, Jadie Adams, Shireen Elhabian
for: 本研究旨在提高医疗图像中的统计形态模型（SSM）的精度和稳定性，以便更好地研究各种医学应用中的形态特征。
methods: 本研究提出了一种新的训练策略，称为进行式深度SSM（Progressive DeepSSM），它在多个尺度上进行多个轮次训练，以便逐渐学习各种细腻的形态特征。此外，本研究还利用形态假设和多任务学习来加以改进。
results: 实验表明，由提出的训练策略训练的模型在量化和质量上均有显著提高，而且在医疗图像中的形态推断任务中也有更高的精度和稳定性。

Abstract
Statistical shape modeling (SSM) is an enabling quantitative tool to study anatomical shapes in various medical applications. However, directly using 3D images in these applications still has a long way to go. Recent deep learning methods have paved the way for reducing the substantial preprocessing steps to construct SSMs directly from unsegmented images. Nevertheless, the performance of these models is not up to the mark. Inspired by multiscale/multiresolution learning, we propose a new training strategy, progressive DeepSSM, to train image-to-shape deep learning models. The training is performed in multiple scales, and each scale utilizes the output from the previous scale. This strategy enables the model to learn coarse shape features in the first scales and gradually learn detailed fine shape features in the later scales. We leverage shape priors via segmentation-guided multi-task learning and employ deep supervision loss to ensure learning at each scale. Experiments show the superiority of models trained by the proposed strategy from both quantitative and qualitative perspectives. This training methodology can be employed to improve the stability and accuracy of any deep learning method for inferring statistical representations of anatomies from medical images and can be adopted by existing deep learning methods to improve model accuracy and training stability.

摘要

Fetal-BET: Brain Extraction Tool for Fetal MRI

paper_url: http://arxiv.org/abs/2310.01523
repo_url: https://github.com/bchimagine/fetal-brain-extraction
paper_authors: Razieh Faghihpirayesh, Davood Karimi, Deniz Erdoğmuş, Ali Gholipour
for: 这个研究的目的是为了提供一种高度自动化、准确、普适的胎儿大脑EXTRACTION方法，以便在不同的MRI序列和扫描条件下进行胎儿大脑图像分析。
methods: 这种方法利用了U-Net风格的建筑、注意机制、多态特征学习和数据增强等深度学习技术，以捕捉多contrast（多序列）胎儿MRI数据中的详细胎儿大脑结构信息。
results: 这种方法在独立测试数据上表现出了高度的自动化、准确和普适性，可以在不同的扫描仪和 Gestational stages 下进行胎儿大脑EXTRACTION，并且在patological brains中也表现出了良好的性能。

Abstract
Fetal brain extraction is a necessary first step in most computational fetal brain MRI pipelines. However, it has been a very challenging task due to non-standard fetal head pose, fetal movements during examination, and vastly heterogeneous appearance of the developing fetal brain and the neighboring fetal and maternal anatomy across various sequences and scanning conditions. Development of a machine learning method to effectively address this task requires a large and rich labeled dataset that has not been previously available. As a result, there is currently no method for accurate fetal brain extraction on various fetal MRI sequences. In this work, we first built a large annotated dataset of approximately 72,000 2D fetal brain MRI images. Our dataset covers the three common MRI sequences including T2-weighted, diffusion-weighted, and functional MRI acquired with different scanners. Moreover, it includes normal and pathological brains. Using this dataset, we developed and validated deep learning methods, by exploiting the power of the U-Net style architectures, the attention mechanism, multi-contrast feature learning, and data augmentation for fast, accurate, and generalizable automatic fetal brain extraction. Our approach leverages the rich information from multi-contrast (multi-sequence) fetal MRI data, enabling precise delineation of the fetal brain structures. Evaluations on independent test data show that our method achieves accurate brain extraction on heterogeneous test data acquired with different scanners, on pathological brains, and at various gestational stages. This robustness underscores the potential utility of our deep learning model for fetal brain imaging and image analysis.

摘要
“胎儿脑部抽取是computational fetal brain MRI处理的必要步骤。然而，由于胎儿头部的非标准位置、胎儿在检测过程中的运动以及胎儿脑部和 maternal anatomy在不同序列和扫描条件下的巨大多样性，使得 machine learning 方法实现这项任务非常具有挑战性。在现有的数据不够时，无法实现准确的胎儿脑部抽取。在这项工作中，我们首先建立了约72,000个2D胎儿脑部 MRI图像的大量注解数据集。我们的数据集包括T2-weighted、diffusion-weighted和功能 MRI序列，以及正常和疾病脑部。使用这个数据集，我们开发了和验证了深度学习方法，利用 U-Net 风格的架构、注意机制、多contrast feature learning 和数据扩展来实现快速、准确和普适的自动胎儿脑部抽取。我们的方法利用多contrast（多序列）胎儿 MRI 数据中的丰富信息，以便准确地定义胎儿脑部结构。独立测试数据上的评估表明，我们的方法可以在不同扫描器、疾病脑部和不同 gestational stages 上准确地抽取胎儿脑部。这种稳定性表明了我们的深度学习模型在胎儿脑 imaging 和图像分析中的潜在应用前景。”

Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code

paper_url: http://arxiv.org/abs/2310.01506
repo_url: https://github.com/cure-lab/directinversion
paper_authors: Xuan Ju, Ailing Zeng, Yuxuan Bian, Shaoteng Liu, Qiang Xu
for: 这篇论文主要目标是提高 diffusion-based editing 的效果，具体来说是通过分离源和目标扩散支持分支来提高图像编辑的可靠性和多样性。methods: 这篇论文使用了一种新的直接反向方法，即 “Direct Inversion”，这种方法可以在只有三行代码的情况下实现优化的源和目标扩散支持分支。results: 对于700张不同场景和编辑类型的图像， compared to state-of-the-art optimization-based inversion techniques, 这种新方法不仅表现出了更高的编辑效果，还可以减少了大约一个数量级的计算时间。

Abstract
Text-guided diffusion models have revolutionized image generation and editing, offering exceptional realism and diversity. Specifically, in the context of diffusion-based editing, where a source image is edited according to a target prompt, the process commences by acquiring a noisy latent vector corresponding to the source image via the diffusion model. This vector is subsequently fed into separate source and target diffusion branches for editing. The accuracy of this inversion process significantly impacts the final editing outcome, influencing both essential content preservation of the source image and edit fidelity according to the target prompt. Prior inversion techniques aimed at finding a unified solution in both the source and target diffusion branches. However, our theoretical and empirical analyses reveal that disentangling these branches leads to a distinct separation of responsibilities for preserving essential content and ensuring edit fidelity. Building on this insight, we introduce "Direct Inversion," a novel technique achieving optimal performance of both branches with just three lines of code. To assess image editing performance, we present PIE-Bench, an editing benchmark with 700 images showcasing diverse scenes and editing types, accompanied by versatile annotations and comprehensive evaluation metrics. Compared to state-of-the-art optimization-based inversion techniques, our solution not only yields superior performance across 8 editing methods but also achieves nearly an order of speed-up.

摘要
(Simplified Chinese translation)文本引导的扩散模型已经革命化了图像生成和编辑，提供了无比真实和多样性。 Specifically, in the context of diffusion-based editing, where a source image is edited according to a target prompt, the process commences by acquiring a noisy latent vector corresponding to the source image via the diffusion model. This vector is subsequently fed into separate source and target diffusion branches for editing. The accuracy of this inversion process significantly impacts the final editing outcome, influencing both essential content preservation of the source image and edit fidelity according to the target prompt. Prior inversion techniques aimed at finding a unified solution in both the source and target diffusion branches. However, our theoretical and empirical analyses reveal that disentangling these branches leads to a distinct separation of responsibilities for preserving essential content and ensuring edit fidelity. Building on this insight, we introduce "Direct Inversion," a novel technique achieving optimal performance of both branches with just three lines of code. To assess image editing performance, we present PIE-Bench, an editing benchmark with 700 images showcasing diverse scenes and editing types, accompanied by versatile annotations and comprehensive evaluation metrics. Compared to state-of-the-art optimization-based inversion techniques, our solution not only yields superior performance across 8 editing methods but also achieves nearly an order of speed-up.

DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model

paper_url: http://arxiv.org/abs/2310.01412
repo_url: None
paper_authors: Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee. K. Wong, Zhenguo Li, Hengshuang Zhao
for: 这篇论文旨在解决自动驾驶领域中的可读性问题，以便提高自动车辆的商业化和进一步的发展。
methods: 这篇论文使用了多Modal大型自然语言处理器（LLM）来处理和理解图像和视频数据，并通过文本进行解释和回答人类用户提出的多种问题。
results: 对多个任务进行评估， DriveGPT4 显示出了与传统方法和视频理解 LLM 相比的超过其他方法的较高质量和量化性表现。此外， DriveGPT4 还可以在零时shot模式下扩展到更多的未看到的场景。

Abstract
In the past decade, autonomous driving has experienced rapid development in both academia and industry. However, its limited interpretability remains a significant unsolved problem, severely hindering autonomous vehicle commercialization and further development. Previous approaches utilizing small language models have failed to address this issue due to their lack of flexibility, generalization ability, and robustness. Recently, multimodal large language models (LLMs) have gained considerable attention from the research community for their capability to process and reason non-text data (e.g., images and videos) by text. In this paper, we present DriveGPT4, an interpretable end-to-end autonomous driving system utilizing LLMs. DriveGPT4 is capable of interpreting vehicle actions and providing corresponding reasoning, as well as answering diverse questions posed by human users for enhanced interaction. Additionally, DriveGPT4 predicts vehicle low-level control signals in an end-to-end fashion. These capabilities stem from a customized visual instruction tuning dataset specifically designed for autonomous driving. To the best of our knowledge, DriveGPT4 is the first work focusing on interpretable end-to-end autonomous driving. When evaluated on multiple tasks alongside conventional methods and video understanding LLMs, DriveGPT4 demonstrates superior qualitative and quantitative performance. Additionally, DriveGPT4 can be generalized in a zero-shot fashion to accommodate more unseen scenarios. The project page is available at https://tonyxuqaq.github.io/projects/DriveGPT4/ .

摘要
过去一代，自动驾驶技术在学术和业界两个领域都有了很快的发展。然而，它的解释能力仍然是一个主要的未解决问题，妨碍自动车 commercialization 和进一步的发展。以前的方法使用小型语言模型（SLMs）未能解决这个问题，因为它们缺乏灵活性、泛化能力和可靠性。最近，多模态大型语言模型（LLMs）在研究 сообществе中受到了广泛的关注，因为它们可以通过文本处理和理解非文本数据（如图像和视频）。在这篇论文中，我们提出了一种可解释的终端自动驾驶系统，即 DriveGPT4。 DriveGPT4 可以解释车辆行为并提供相应的理由，同时还可以回答人类用户提出的多样化问题以提高人机交互。此外， DriveGPT4 预测车辆低级控制信号在终端方式。这些能力来自于特制的自动驾驶视觉指令调整数据集。据我们所知， DriveGPT4 是第一个关注可解释终端自动驾驶的研究成果。在多个任务上与 convent ional 方法和视频理解 LLMS 进行评估时， DriveGPT4 表现出了超过其他方法的质量和量化性能。此外， DriveGPT4 可以在零shot 方式推广到更多的未看到场景。项目页面可以在中找到。

LEAP: Liberate Sparse-view 3D Modeling from Camera Poses

paper_url: http://arxiv.org/abs/2310.01410
repo_url: https://github.com/hwjiang1510/LEAP
paper_authors: Hanwen Jiang, Zhenyu Jiang, Yue Zhao, Qixing Huang
for: 是否需要摄像头pose进行多视图3D模型？现有方法主要假设可以获得准确的摄像头pose。但是，在稀疏视图情况下，估计摄像头pose的精度可能很差。我们的分析表明，具有噪声的估计pose会导致现有的稀疏视图3D模型方法的性能下降。
methods: 我们提出了一种新的 pose-free方法，即LEAP，以挑战现有的假设。LEAP抛弃了pose-based操作，从数据中学习 геометрические知识。LEAP具有一个神经网络卷积体，该卷积体在不同场景中共享参数，并用来编码光栅和тексту征约束。对于每个进来场景，我们将卷积体中的参数更新，通过Feature-similarity驱动的方式，将2D图像特征相似性聚合到 neural volume 中。更新后的 neural volume 被解码成为辐射场，可以实现从任何视角 synthesize 新的视图。
results: 我们在object-centric和scene-level数据集上展示了LEAP的显著性能优势。LEAP在使用 state-of-the-art pose estimator 预测pose的情况下，与使用真实pose的方法准确相当，而且运行速度比PixelNeRF快 $400\times$。此外，我们还证明LEAP可以在新的对象类和场景中 generalized 学习，并且学习的知识与epipolar geometry closely resembles。项目页面：https://hwjiang1510.github.io/LEAP/

Abstract
Are camera poses necessary for multi-view 3D modeling? Existing approaches predominantly assume access to accurate camera poses. While this assumption might hold for dense views, accurately estimating camera poses for sparse views is often elusive. Our analysis reveals that noisy estimated poses lead to degraded performance for existing sparse-view 3D modeling methods. To address this issue, we present LEAP, a novel pose-free approach, therefore challenging the prevailing notion that camera poses are indispensable. LEAP discards pose-based operations and learns geometric knowledge from data. LEAP is equipped with a neural volume, which is shared across scenes and is parameterized to encode geometry and texture priors. For each incoming scene, we update the neural volume by aggregating 2D image features in a feature-similarity-driven manner. The updated neural volume is decoded into the radiance field, enabling novel view synthesis from any viewpoint. On both object-centric and scene-level datasets, we show that LEAP significantly outperforms prior methods when they employ predicted poses from state-of-the-art pose estimators. Notably, LEAP performs on par with prior approaches that use ground-truth poses while running $400\times$ faster than PixelNeRF. We show LEAP generalizes to novel object categories and scenes, and learns knowledge closely resembles epipolar geometry. Project page: https://hwjiang1510.github.io/LEAP/

摘要
是否必需摄像头姿态 для多视图3D模型？现有方法主要假设可以获得准确的摄像头姿态。然而，对于稀疏视图来说， precisely estimating camera poses 可能是一个难题。我们的分析表明，具有噪声的估计姿态会导致现有的稀疏视图3D模型方法的性能下降。为解决这个问题，我们提出了LEAP，一种不需要摄像头姿态的新方法，因此挑战了现有的摄像头姿态是必需的假设。LEAP抛弃了姿态基于的操作，从数据中学习 геометрические知识。LEAP具有一个神经网络体，该体在不同场景中被共享，并被参数化以编码geometry和texture约束。对于每个进来的场景，我们将 neural network 更新为图像特征的集合，并通过图像特征相似性驱动的方式来归一化。更新后的神经网络体被解码为辐射场，以 Synthesize 任何视点的新视图。在对象中心和场景级 datasets 上，我们示出了LEAPsignificantly 超越了先前的方法，当使用state-of-the-art pose estimator 预测的姿态。尤其是，LEAP在使用真实姿态的情况下运行 $400\times$ faster than PixelNeRF，并且可以在新的物体类和场景中广泛适用。我们还证明LEAP学习了epipolar geometry 的知识，并且可以在不同的视图和光照条件下进行推断。项目页面：https://hwjiang1510.github.io/LEAP/

HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation

paper_url: http://arxiv.org/abs/2310.01406
repo_url: None
paper_authors: Xin Huang, Ruizhi Shao, Qi Zhang, Hongwen Zhang, Ying Feng, Yebin Liu, Qing Wang
for: 高品质和真实的3D人体生成
methods: 调整文本至图像扩散模型，包括对于测试文本进行对�OH的扩散模型，以提高2D对3D几何的认知，并将它转换为对于普通测试文本进行扩散的模型，以维持大规模数据集中的统计学约束。
results: 提出了一个名为HumanNorm的新方法，可以实现高品质和真实的3D人体生成，并且可以实现对于测试文本进行高精度的对�OH，以及对于实际测试文本进行高效的扩散。实验结果显示，HumanNorm可以实现高品质和真实的3D人体生成，并且比较 existing text-to-3D 方法在几何和颜色品质方面表现更好。

Abstract
Recent text-to-3D methods employing diffusion models have made significant advancements in 3D human generation. However, these approaches face challenges due to the limitations of the text-to-image diffusion model, which lacks an understanding of 3D structures. Consequently, these methods struggle to achieve high-quality human generation, resulting in smooth geometry and cartoon-like appearances. In this paper, we observed that fine-tuning text-to-image diffusion models with normal maps enables their adaptation into text-to-normal diffusion models, which enhances the 2D perception of 3D geometry while preserving the priors learned from large-scale datasets. Therefore, we propose HumanNorm, a novel approach for high-quality and realistic 3D human generation by learning the normal diffusion model including a normal-adapted diffusion model and a normal-aligned diffusion model. The normal-adapted diffusion model can generate high-fidelity normal maps corresponding to prompts with view-dependent text. The normal-aligned diffusion model learns to generate color images aligned with the normal maps, thereby transforming physical geometry details into realistic appearance. Leveraging the proposed normal diffusion model, we devise a progressive geometry generation strategy and coarse-to-fine texture generation strategy to enhance the efficiency and robustness of 3D human generation. Comprehensive experiments substantiate our method's ability to generate 3D humans with intricate geometry and realistic appearances, significantly outperforming existing text-to-3D methods in both geometry and texture quality. The project page of HumanNorm is https://humannorm.github.io/.

摘要
Recent text-to-3D methods using diffusion models have made significant advancements in 3D human generation. However, these approaches face challenges due to the limitations of the text-to-image diffusion model, which lacks an understanding of 3D structures. As a result, these methods often produce smooth geometry and cartoon-like appearances. In this paper, we observed that fine-tuning text-to-image diffusion models with normal maps can enhance the 2D perception of 3D geometry while preserving the priors learned from large-scale datasets. Therefore, we propose HumanNorm, a novel approach for high-quality and realistic 3D human generation by learning a normal diffusion model, including a normal-adapted diffusion model and a normal-aligned diffusion model. The normal-adapted diffusion model can generate high-fidelity normal maps corresponding to prompts with view-dependent text. The normal-aligned diffusion model learns to generate color images aligned with the normal maps, thereby transforming physical geometry details into realistic appearance. By leveraging the proposed normal diffusion model, we devise a progressive geometry generation strategy and coarse-to-fine texture generation strategy to enhance the efficiency and robustness of 3D human generation. Comprehensive experiments demonstrate our method's ability to generate 3D humans with intricate geometry and realistic appearances, significantly outperforming existing text-to-3D methods in both geometry and texture quality. The project page of HumanNorm is .

H-InDex: Visual Reinforcement Learning with Hand-Informed Representations for Dexterous Manipulation

paper_url: http://arxiv.org/abs/2310.01404
repo_url: https://github.com/YanjieZe/H-InDex
paper_authors: Yanjie Ze, Yuyao Liu, Ruizhe Shi, Jiaxin Qin, Zhecheng Yuan, Jiashun Wang, Huazhe Xu
for: 解决困难的手征性操作任务（dexterous manipulation tasks），提高机器人手部的柔韧性和灵活性。
methods: 基于人工智能的视觉学习框架，包括三个阶段：（i）预训练表示的3D人手姿态估计，（ii）自动化表示的自我标注，以及（iii）强化学习加速移动平均BatchNorm。
results: 对12个困难的手征性操作任务进行了实验研究，发现H-InDex比强基础方法和最新的视觉基础模型具有更高的性能，能够更好地控制机器人的手部动作。

Abstract
Human hands possess remarkable dexterity and have long served as a source of inspiration for robotic manipulation. In this work, we propose a human $\textbf{H}$and$\textbf{-In}$formed visual representation learning framework to solve difficult $\textbf{Dex}$terous manipulation tasks ($\textbf{H-InDex}$) with reinforcement learning. Our framework consists of three stages: (i) pre-training representations with 3D human hand pose estimation, (ii) offline adapting representations with self-supervised keypoint detection, and (iii) reinforcement learning with exponential moving average BatchNorm. The last two stages only modify $0.36\%$ parameters of the pre-trained representation in total, ensuring the knowledge from pre-training is maintained to the full extent. We empirically study 12 challenging dexterous manipulation tasks and find that H-InDex largely surpasses strong baseline methods and the recent visual foundation models for motor control. Code is available at https://yanjieze.com/H-InDex .

摘要
人类手部拥有杰出的灵活性，长期作为 робо控制 manipulate 的灵感源。在这项工作中，我们提出一种基于人类手部 pose 估计的视觉学习框架，以解决困难的dexterous manipulate 任务（H-InDex）。我们的框架包括三个阶段：（i）使用 3D 人类手部 pose 估计进行先期准备表示，（ii）在线适应表示使用无监督关键点检测，（iii）使用加权平均值 BatchNorm 进行强化学习。最后两个阶段只 modify 0.36% 的参数，保证先期准备的知识得到完整保留。我们对 12 个困难的dexterous manipulate 任务进行了实验研究，发现 H-InDex 在强基eline方法和最近的视觉基础模型之上减得较大的成果。代码可以在 https://yanjieze.com/H-InDex 上下载。

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

paper_url: http://arxiv.org/abs/2310.01403
repo_url: https://github.com/wusize/clipself
paper_authors: Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, Chen Change Loy
for: 这篇研究旨在探讨 CLIP 模型在对Local Image Region Representation的适应，以提高下游开放 vocabulary dense prediction 任务的性能。
methods: 本研究使用 CLIP 模型，特别是包含 Computer Vision Transformer (ViT) 的 CLIP 模型，实现了从全图像到地方图像区域的视力-语言Alignment。
results: 本研究获得了对开放 vocabulary object detection、Semantic Segmentation 和 Panoptic Segmentation 的新的顶峰性能，并且不需要任何 Region-Text 对。

Abstract
Open-vocabulary dense prediction tasks including object detection and image segmentation have been advanced by the success of Contrastive Language-Image Pre-training (CLIP). CLIP models, particularly those incorporating vision transformers (ViTs), have exhibited remarkable generalization ability in zero-shot image classification. However, when transferring the vision-language alignment of CLIP from global image representation to local region representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer from the domain shift from full images to local image regions. In this paper, we embark on an in-depth analysis of the region-language alignment in CLIP models, which is essential for downstream open-vocabulary dense prediction tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by aligning a region representation extracted from its dense feature map with the image-level representation of the corresponding image crop. With the enhanced CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks. Models and code will be available at https://github.com/wusize/CLIPSelf.

摘要
translate_language="zh-CN"开放词汇稠密预测任务，包括物体检测和图像分割，受到了CLIP的成功的推动。CLIP模型，特别是包含视Transformer（ViT）的CLIP模型，在零shot图像分类中显示出了惊人的泛化能力。然而，在将CLIP的视力语言协调从全图像表示转移到本地区域表示时，CLIP ViT受到了全图像和本地区域之间的领域转移。在这篇论文中，我们进行了CLIP模型的地区语言协调的深入分析，这是对下游开放词汇稠密预测任务的关键。然后，我们提出了一种名为CLIPSelf的方法，可以使CLIP ViT在不需要任何区域文本对的情况下，将图像级别的认知能力转移到本地图像区域。CLIPSelf方法使得ViT可以自我整化，将图像级别的表示与对应的图像裁剪中的图像级别表示进行对应。通过增强CLIP ViT，我们在不同的标准benchmark上实现了新的状态场报表性能。模型和代码将在https://github.com/wusize/CLIPSelf上提供。

Pixel-Aligned Recurrent Queries for Multi-View 3D Object Detection

paper_url: http://arxiv.org/abs/2310.01401
repo_url: https://github.com/ymingxie/PARQ
paper_authors: Yiming Xie, Huaizu Jiang, Georgia Gkioxari, Julian Straub
for: 这篇论文是为了开发一种基于转换器和像素对齐循环查询的多视图3D物体检测器。
methods: 这篇论文使用了具有出版点的参考点来初始化查询，然后使用循环十字关注操作来更新查询的3D位置。它还 integrates 像素对齐特征和十字关注，使模型能够编码必要的3D-to-2D匹配和捕捉全局图像信息。
results: 根据实验结果，PARQ在ScanNet和ARKitScenes数据集上表现出色，比之前的最佳方法更快地学习和检测，更具有响应度分布变化的引导点的 Robustness，可以无需重新训练使用多个视图，并且可以根据批处理计算更改数量进行调整。

Abstract
We present PARQ - a multi-view 3D object detector with transformer and pixel-aligned recurrent queries. Unlike previous works that use learnable features or only encode 3D point positions as queries in the decoder, PARQ leverages appearance-enhanced queries initialized from reference points in 3D space and updates their 3D location with recurrent cross-attention operations. Incorporating pixel-aligned features and cross attention enables the model to encode the necessary 3D-to-2D correspondences and capture global contextual information of the input images. PARQ outperforms prior best methods on the ScanNet and ARKitScenes datasets, learns and detects faster, is more robust to distribution shifts in reference points, can leverage additional input views without retraining, and can adapt inference compute by changing the number of recurrent iterations.

摘要
我团队 todavía presentamos PARQ - a multi-view 3D object detector with transformer and pixel-aligned recurrent queries. 与之前的工作不同，PARQ 不使用学习的特征或者只将3D点位作为解码器中的查询进行编码; 相反，PARQ 利用增强的外观查询，初始化自参考点在3D空间，并通过循环クロス注意力操作更新其3D位置。将像素对应的特征和循环注意力纳入模型可以编码必要的3D-to-2D对应关系，捕捉输入图像的全局Contextual information。PARQ 在ScanNet和ARKitScenes数据集上超越了之前的最佳方法，速度快，训练和检测更快，对参考点的分布shift更强，可以无需重新训练使用多个视图，并且可以根据计算引擎的变化而变化推理计算数量。

Sequential Data Generation with Groupwise Diffusion Process

paper_url: http://arxiv.org/abs/2310.01400
repo_url: None
paper_authors: Sangyun Lee, Gayoung Lee, Hyunsu Kim, Junho Kim, Youngjung Uh
for: 本研究旨在扩展diffusion模型，通过分组数据并逐步升级一个组的方式来生成数据。
methods: 本研究使用Groupwise Diffusion Model（GDM），该模型将数据分成多个组，并在每个时间间隔内逐步升级一个组。
results: GDM可以生成数据，并且可以扩展certain forms of autoregressive models和cascaded diffusion models。此外，由于每个初始噪声只影响一个特定组的生成数据， latent space 现在具有group-wise可解释的含义。

Abstract
We present the Groupwise Diffusion Model (GDM), which divides data into multiple groups and diffuses one group at one time interval in the forward diffusion process. GDM generates data sequentially from one group at one time interval, leading to several interesting properties. First, as an extension of diffusion models, GDM generalizes certain forms of autoregressive models and cascaded diffusion models. As a unified framework, GDM allows us to investigate design choices that have been overlooked in previous works, such as data-grouping strategy and order of generation. Furthermore, since one group of the initial noise affects only a certain group of the generated data, latent space now possesses group-wise interpretable meaning. We can further extend GDM to the frequency domain where the forward process sequentially diffuses each group of frequency components. Dividing the frequency bands of the data as groups allows the latent variables to become a hierarchical representation where individual groups encode data at different levels of abstraction. We demonstrate several applications of such representation including disentanglement of semantic attributes, image editing, and generating variations.

摘要
我团队介绍了集体扩散模型（GDM），该模型将数据分成多个组并在一个时间间隔内对一个组进行扩散。GDM在前向扩散过程中逐渐生成数据，从而导致一些有趣的性质。首先，作为扩散模型的扩展，GDM拓展了某些形式的自回归模型和链式扩散模型。作为一个统一框架，GDM允许我们研究过去被忽略的设计选择，如数据分组策略和生成顺序。此外，由于一个组的初始噪声只影响特定组的生成数据，潜在空间现在具有组别可解释的含义。我们可以进一步扩展GDM到频域频谱，在前向过程中逐渐扩散每个频谱组件。将频谱分成多个组可以使潜在变量变得层次结构化，各个组代表不同层次的数据抽象。我们展示了这种表示的一些应用，包括分解semantic attribute、图像编辑和生成变化。

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

paper_url: http://arxiv.org/abs/2310.01393
repo_url: None
paper_authors: Shilin Xu, Xiangtai Li, Size Wu, Wenwei Zhang, Yining Li, Guangliang Cheng, Yunhai Tong, Kai Chen, Chen Change Loy
for: 本研究旨在提高open-vocabulary object detection（OVOD）的精度和准确性，使得模型能够检测训练时未见过的所有类别。
methods: 本研究提出了一种简单 yet effective的策略，利用预训练的视觉语言模型（VLM）的零shot类别能力，将提案分为不同类别进行直接类别。与之前的方法不同，我们的方法不仅通过Region Proposal Network（RPN）来检测未知类别，还可以在训练阶段对提案进行选择性筛选，以便使用提案作为pseudo-label来进行自我训练。
results: 我们的方法在三个 dataset（LVIS、V3Det和COCO）上进行了实验，并达到了无需更多参数或计算成本的情况下，与基eline性能相比，提高了1.7-2.0%的LVIS数据集和2.3-3.8%的V3Det数据集的性能，并在COCO数据集上提高了6%的MAP。

Abstract
Open-vocabulary object detection (OVOD) aims to detect the objects beyond the set of categories observed during training. This work presents a simple yet effective strategy that leverages the zero-shot classification ability of pre-trained vision-language models (VLM), such as CLIP, to classify proposals for all possible novel classes directly. Unlike previous works that ignore novel classes during training and rely solely on the region proposal network (RPN) for novel object detection, our method selectively filters proposals based on specific design criteria. The resulting sets of identified proposals serve as pseudo-labels for novel classes during the training phase. It enables our self-training strategy to improve the recall and accuracy of novel classes in a self-training manner without requiring additional annotations or datasets. We further propose a simple offline pseudo-label generation strategy to refine the object detector. Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance without incurring additional parameters or computational costs during inference. In particular, compared with previous F-VLM, our method achieves a 1.7-2.0% improvement on LVIS dataset and 2.3-3.8% improvement on the recent challenging V3Det dataset. Our method also boosts the strong baseline by 6% mAP on COCO. The code and models will be publicly available at https://github.com/xushilin1/dst-det.

摘要
open-vocabulary对象检测（OVOD）目标是检测训练过程中未经见过的对象类型。这项工作提出了一种简单又有效的策略，利用预训练的视觉语言模型（VLM），如CLIP，来直接将提议分类为所有可能的新类型。与先前的方法不同，我们在训练过程中忽略新类型，并仅仅依靠区域提议网络（RPN）进行新对象检测。我们的方法可以根据特定的设计 criterion 选择提议，并将其作为新类型的pseudo-标签进行训练。这种自动训练策略可以在没有额外注释或数据集的情况下提高新类型的准确率和回归率。我们还提出了一种简单的离线pseudo-标签生成策略，以更进一步地改进对象检测器。我们的实验结果在三个数据集上，包括LVIS、V3Det和COCO，表明我们的方法可以减少基eline性能的差异，而不需要额外的参数或计算成本。具体来说，相比之前的F-VLM，我们的方法在LVIS数据集上提高了1.7-2.0%，在最近的V3Det数据集上提高了2.3-3.8%，并在COCO数据集上提高了6% mAP。我们的代码和模型将在https://github.com/xushilin1/dst-det上公开。

Towards Distribution-Agnostic Generalized Category Discovery

paper_url: http://arxiv.org/abs/2310.01376
repo_url: https://github.com/jianhongbai/bacon
paper_authors: Jianhong Bai, Zuozhu Liu, Hualiang Wang, Ruizhe Chen, Lianrui Mu, Xiaomeng Li, Joey Tianyi Zhou, Yang Feng, Jian Wu, Haoji Hu
for: 本文旨在解决数据不均衡和开放结构的问题，通过将两者结合到实际视觉世界中来。
methods: 本文提出了一种名为分布agnostic generalized category discovery (DA-GCD)的任务，即在长尾开放世界中为close-set和open-set类进行细化预测。为解决这个问题，本文提出了一种自适应协助对比框架(BaCon)，包括对比学习分支和pseudo标签分支，它们在合作下提供了相互的超vision来解决DA-GCD任务。
results: 对比BaCon与状态对应方法的实验结果表明，BaCon在所有基elines上显示出优于性，并进行了广泛的数据分析。

Abstract
Data imbalance and open-ended distribution are two intrinsic characteristics of the real visual world. Though encouraging progress has been made in tackling each challenge separately, few works dedicated to combining them towards real-world scenarios. While several previous works have focused on classifying close-set samples and detecting open-set samples during testing, it's still essential to be able to classify unknown subjects as human beings. In this paper, we formally define a more realistic task as distribution-agnostic generalized category discovery (DA-GCD): generating fine-grained predictions for both close- and open-set classes in a long-tailed open-world setting. To tackle the challenging problem, we propose a Self-Balanced Co-Advice contrastive framework (BaCon), which consists of a contrastive-learning branch and a pseudo-labeling branch, working collaboratively to provide interactive supervision to resolve the DA-GCD task. In particular, the contrastive-learning branch provides reliable distribution estimation to regularize the predictions of the pseudo-labeling branch, which in turn guides contrastive learning through self-balanced knowledge transfer and a proposed novel contrastive loss. We compare BaCon with state-of-the-art methods from two closely related fields: imbalanced semi-supervised learning and generalized category discovery. The effectiveness of BaCon is demonstrated with superior performance over all baselines and comprehensive analysis across various datasets. Our code is publicly available.

摘要
数据不匹配和开放式分布是现实视觉世界的两个内在特点。虽然已经取得了解决每个挑战的进步，但几乎没有工作将其们结合到现实世界场景中。之前的一些工作都是在测试时分类close-set样本和检测open-set样本，但还是必须能够将未知对象分类为人类。在这篇论文中，我们正式定义了更真实的任务：分布不依赖泛化分类发现（DA-GCD）：在长尾开放世界设置下，生成细化预测close-和open-set类型。为解决这个复杂的问题，我们提出了Self-Balanced Co-Advice对比框架（BaCon），它包括对比学习分支和伪标签分支，这两个分支在合作下提供交互性监督，以解决DA-GCD任务。具体来说，对比学习分支为可靠分布估计提供了正则化，以帮助伪标签分支的预测，而伪标签分支则通过自适应知识传递和我们提出的一种新的对比损失，来指导对比学习。我们与状态艺术方法进行比较，包括受欠 semi-supervised learning 和泛化分类发现。我们的代码公开可用。

NEUCORE: Neural Concept Reasoning for Composed Image Retrieval

paper_url: http://arxiv.org/abs/2310.01358
repo_url: None
paper_authors: Shu Zhao, Huijuan Xu
for: 本研究旨在提高图像检索中的复合图像组合检索任务，使模型能够更好地理解视觉和语言模式之间的互动。现有方法强调整体多modal交互模型化，忽略了图像和文本修改器之间的组合和补充性。
methods: 我们将多modal理解升级到精细度的概念水平，并学习多modal概念对齐以确定图像和文本修改器之间的视觉位置。我们提出的NEUCORE模型包括多modal概念对齐和进程式多modal融合。在多个实例学习框架下，我们使用图像和句子水平的弱监督学习多modal概念对齐。此外，基于对齐的概念，我们提出一种进程式融合策略，通过听写语言概念来生成准确的目标图像检索结果。
results: 我们在三个 datasets 上测试了我们的提议，并实现了状态地理上的结果。

Abstract
Composed image retrieval which combines a reference image and a text modifier to identify the desired target image is a challenging task, and requires the model to comprehend both vision and language modalities and their interactions. Existing approaches focus on holistic multi-modal interaction modeling, and ignore the composed and complimentary property between the reference image and text modifier. In order to better utilize the complementarity of multi-modal inputs for effective information fusion and retrieval, we move the multi-modal understanding to fine-granularity at concept-level, and learn the multi-modal concept alignment to identify the visual location in reference or target images corresponding to text modifier. Toward the end, we propose a NEUral COncept REasoning (NEUCORE) model which incorporates multi-modal concept alignment and progressive multimodal fusion over aligned concepts. Specifically, considering that text modifier may refer to semantic concepts not existing in the reference image and requiring to be added into the target image, we learn the multi-modal concept alignment between the text modifier and the concatenation of reference and target images, under multiple-instance learning framework with image and sentence level weak supervision. Furthermore, based on aligned concepts, to form discriminative fusion features of the input modalities for accurate target image retrieval, we propose a progressive fusion strategy with unified execution architecture instantiated by the attended language semantic concepts. Our proposed approach is evaluated on three datasets and achieves state-of-the-art results.

摘要
合成图像检索，将参考图像和文本修改器结合以确定目标图像是一项具有挑战性的任务，需要模型能够理解视觉和语言模式，并且理解它们之间的互动。现有方法强调整合多模态交互，忽略了参考图像和文本修改器之间的组合和补偿性。为了更好地利用多模态输入的资料，我们将多模态理解降到了微观粒度，并学习多模态概念对齐，以确定参考图像或目标图像中对文本修改器的视觉位置。最后，我们提出了一种基于NEUCORE模型，包括多模态概念对齐和进程式多模态融合。具体来说，我们学习了文本修改器中的 semantic concept不在参考图像中存在，需要在目标图像中添加，基于多个实例学习框架和图像和句子水平的弱级别指导。此外，基于对齐的概念，我们提出了一种逐步融合策略，使得输入模态之间的准确融合。我们的提议方法在三个 datasets 上进行了评估，并实现了状态最佳的结果。

Less is More: Toward Zero-Shot Local Scene Graph Generation via Foundation Models

paper_url: http://arxiv.org/abs/2310.01356
repo_url: None
paper_authors: Shu Zhao, Huijuan Xu
for: 本研究旨在提高计算机视觉系统的能力，使其更好地理解和掌握场景中的物体和关系。
methods: 我们提出了一种新的任务——本地场景图生成任务，它的目标是从图像中提取有用的结构信息，并将其转化为符号知识。我们还提出了一种基于基础模型的框架，称为zEro-shot Local scEne GrAph geNeraTion（ELEGANT），它可以在不需要标注supervision下实现zero-shot本地场景图生成。
results: 我们的方法在开放式评估环境下显著超过了基eline，并在封闭式评估环境下达到了24.58%的性能提升。这表明我们的提posed方法具有强大的理解能力和掌握能力。

Abstract
Humans inherently recognize objects via selective visual perception, transform specific regions from the visual field into structured symbolic knowledge, and reason their relationships among regions based on the allocation of limited attention resources in line with humans' goals. While it is intuitive for humans, contemporary perception systems falter in extracting structural information due to the intricate cognitive abilities and commonsense knowledge required. To fill this gap, we present a new task called Local Scene Graph Generation. Distinct from the conventional scene graph generation task, which encompasses generating all objects and relationships in an image, our proposed task aims to abstract pertinent structural information with partial objects and their relationships for boosting downstream tasks that demand advanced comprehension and reasoning capabilities. Correspondingly, we introduce zEro-shot Local scEne GrAph geNeraTion (ELEGANT), a framework harnessing foundation models renowned for their powerful perception and commonsense reasoning, where collaboration and information communication among foundation models yield superior outcomes and realize zero-shot local scene graph generation without requiring labeled supervision. Furthermore, we propose a novel open-ended evaluation metric, Entity-level CLIPScorE (ECLIPSE), surpassing previous closed-set evaluation metrics by transcending their limited label space, offering a broader assessment. Experiment results show that our approach markedly outperforms baselines in the open-ended evaluation setting, and it also achieves a significant performance boost of up to 24.58% over prior methods in the close-set setting, demonstrating the effectiveness and powerful reasoning ability of our proposed framework.

摘要
人类天生拥有对物体的选择性视觉识别能力，将视场中的特定区域转化为结构化符号知识，并根据人类目标的分配有限的注意资源进行关系理解。这种人类的直觉行为尽管自然，但现代观察系统却很难提取结构信息，因为需要复杂的认知能力和通用常识知识。为了填补这个 gap，我们提出了一个新的任务：本地场景图生成。与传统场景图生成任务不同，我们的提议任务仅需抽象出图像中相关的结构信息，而不是生成所有 объек 和关系。这种任务的目标是为下游任务提供更高级的认知和理解能力。为实现这个任务，我们提出了一种新的框架：zEro-shot Local scEne GrAph geNeraTion（ELEGANT）。该框架利用了知名的基础模型，这些模型具有强大的观察和通用常识理解能力。 Collaboration 和模型之间的信息交流，使得我们可以在零shot情况下实现本地场景图生成，而不需要标注supervision。此外，我们还提出了一个新的开放式评价指标：Entity-level CLIPScorE（ECLIPSE）。这个指标不仅超越了先前的关闭集评价指标，还可以为更广泛的评价提供更好的评价。实验结果表明，我们的方法在开放式评价 setting 上表现出色，并在close-set setting 上达到了24.58%的显著性提升。这说明了我们的提议的效iveness和强大的理解能力。

Streaming Motion Forecasting for Autonomous Driving

paper_url: http://arxiv.org/abs/2310.01351
repo_url: https://github.com/ziqipang/streamingforecasting
paper_authors: Ziqi Pang, Deva Ramanan, Mengtian Li, Yu-Xiong Wang
for: 这个研究旨在解决自动化航行中的路径预测问题，但现有的底线没有考虑到实际应用中的流动数据，这个研究将流动数据作为预测的来源。
methods: 这个研究使用了流动预测 benchmark，该 benchmark 每个时间检查未来的路径，并且因此产生了隐藏的代理人问题，这是传统的 snapshot-based benchmark 忽略的安全问题。此外，这个 benchmark 还要求预测中的时间弹性，即预测的结果必须与上一个时间检查的结果匹配。
results: 这个研究获得了以下结果：（1）透过将 snapshot-based forecaster 转换为 streaming forecaster，可以提高预测质量；（2）透过 occlusion reasoning 和时间弹性策略，可以降低隐藏代理人的预测误差，导致终端错误量下降25%；（3）这个研究将 motion forecasting 带入了其自然的流动设定，从而提高了预测质量。

Abstract
Trajectory forecasting is a widely-studied problem for autonomous navigation. However, existing benchmarks evaluate forecasting based on independent snapshots of trajectories, which are not representative of real-world applications that operate on a continuous stream of data. To bridge this gap, we introduce a benchmark that continuously queries future trajectories on streaming data and we refer to it as "streaming forecasting." Our benchmark inherently captures the disappearance and re-appearance of agents, presenting the emergent challenge of forecasting for occluded agents, which is a safety-critical problem yet overlooked by snapshot-based benchmarks. Moreover, forecasting in the context of continuous timestamps naturally asks for temporal coherence between predictions from adjacent timestamps. Based on this benchmark, we further provide solutions and analysis for streaming forecasting. We propose a plug-and-play meta-algorithm called "Predictive Streamer" that can adapt any snapshot-based forecaster into a streaming forecaster. Our algorithm estimates the states of occluded agents by propagating their positions with multi-modal trajectories, and leverages differentiable filters to ensure temporal consistency. Both occlusion reasoning and temporal coherence strategies significantly improve forecasting quality, resulting in 25% smaller endpoint errors for occluded agents and 10-20% smaller fluctuations of trajectories. Our work is intended to generate interest within the community by highlighting the importance of addressing motion forecasting in its intrinsic streaming setting. Code is available at https://github.com/ziqipang/StreamingForecasting.

摘要
几乎所有的自动适应 Navigation 问题都是访问预测的问题，但现有的底线没有考虑到实际上的应用程序是以流动数据为基础运作的。为了将这个差异处理，我们引入了一个以流动数据查询未来路径的底线，我们称之为“流动预测”。我们的底线自然地捕捉了代理人的消失和重新出现，这个问题在Snapshot-based底线上被忽略的，并且这个问题在流动预测中是一个安全critical的问题。此外，在流动timestamp上进行预测 naturally asks for temporal coherence between adjacent timestamp predictions. Based on this benchmark, we further provide solutions and analysis for streaming forecasting. We propose a plug-and-play meta-algorithm called "Predictive Streamer" that can adapt any snapshot-based forecaster into a streaming forecaster. Our algorithm estimates the states of occluded agents by propagating their positions with multi-modal trajectories, and leverages differentiable filters to ensure temporal consistency. Both occlusion reasoning and temporal coherence strategies significantly improve forecasting quality, resulting in 25% smaller endpoint errors for occluded agents and 10-20% smaller fluctuations of trajectories. Our work is intended to generate interest within the community by highlighting the importance of addressing motion forecasting in its intrinsic streaming setting. Code is available at .

Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association

paper_url: http://arxiv.org/abs/2310.01330
repo_url: None
paper_authors: Qiyu Wu, Mengjie Zhao, Yutong He, Lang Huang, Junya Ono, Hiromi Wakaki, Yuki Mitsufuji
for: Mitigating reporting bias in visual-language datasets to improve object-attribute understanding and zero-shot retrieval performance.
methods: Bimodal augmentation (BiAug) approach through object-attribute decoupling, employing large language models (LLMs) and an inpainting model to synthesize visual-language examples with a rich array of object-attribute pairing and cross-modal hard negatives.
results: Superior object-attribute understanding and improved performance on zero-shot retrieval tasks on general benchmarks like MSCOCO and Flickr30K.

Abstract
Reporting bias arises when people assume that some knowledge is universally understood and hence, do not necessitate explicit elaboration. In this paper, we focus on the wide existence of reporting bias in visual-language datasets, embodied as the object-attribute association, which can subsequentially degrade models trained on them. To mitigate this bias, we propose a bimodal augmentation (BiAug) approach through object-attribute decoupling to flexibly synthesize visual-language examples with a rich array of object-attribute pairing and construct cross-modal hard negatives. We employ large language models (LLMs) in conjunction with a grounding object detector to extract target objects. Subsequently, the LLM generates a detailed attribute description for each object and produces a corresponding hard negative counterpart. An inpainting model is then used to create images based on these detailed object descriptions. By doing so, the synthesized examples explicitly complement omitted objects and attributes to learn, and the hard negative pairs steer the model to distinguish object attributes. Our experiments demonstrated that BiAug is superior in object-attribute understanding. In addition, BiAug also improves the performance on zero-shot retrieval tasks on general benchmarks like MSCOCO and Flickr30K. BiAug refines the way of collecting text-image datasets. Mitigating the reporting bias helps models achieve a deeper understanding of visual-language phenomena, expanding beyond mere frequent patterns to encompass the richness and diversity of real-world scenarios.

摘要
报告偏见出现在视觉语言数据集中，manifests as object-attribute association，可能导致模型在这些数据集上训练时降低性能。为了 Mitigate this bias，我们提出了bi-modal augmentation（BiAug）方法，通过对象-属性分解来生成多样化的视觉语言示例，并构建跨模态硬negative pairs。我们使用大语言模型（LLM）和图像识别器来提取目标对象，然后LLM生成对象的详细属性描述，并生成对应的硬negative counterpart。然后，一个填充模型用于基于这些详细对象描述生成图像。这些合成示例可以填补掉 omitted objects and attributes，使模型能够更好地理解视觉语言现象，并且可以扩展到更加复杂和多样化的实际场景。我们的实验表明，BiAug可以提高对象-属性理解能力，同时也提高了零shot Retrieval任务的性能。BiAug还改善了文本-图像数据集的收集方式，减少了报告偏见，帮助模型更好地理解视觉语言现象。

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

paper_url: http://arxiv.org/abs/2310.01324
repo_url: https://github.com/leexinhao/ZeroI2V
paper_authors: Xinhao Li, Limin Wang
for: 本研究的目标是在视频识别任务中转移图像模型，而不需要全面精细调整。
methods: 我们采用了两种核心设计来实现零成本的转移：首先，我们利用自我注意力的灵活性，引入了空间-时间两头注意力（STDHA），以便免除额外参数和计算，同时具备视频中的时间模型化能力。其次，我们提出了一种线性适应策略，通过轻量级密集的线性适应器来完全传递冻结的图像模型到视频识别任务。
results: 我们在四个广泛使用的视频识别 benchmark 上进行了广泛的实验，结果表明，我们的 ZeroI2V 可以与之前的状态体系方法匹配或超越，同时享有优秀的参数和执行效率。

Abstract
Adapting image models to video domain is becoming an efficient paradigm for solving video recognition tasks. Due to the huge number of parameters and effective transferability of image models, performing full fine-tuning is less efficient and even unnecessary. Thus, recent research is shifting its focus towards parameter-efficient image-to-video adaptation. However, these adaptation strategies inevitably introduce extra computational cost to deal with the domain gap and temporal modeling in videos. In this paper, our goal is to present a zero-cost adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks (i.e., introduce zero extra cost to the adapted models during inference). To achieve this goal, we present two core designs. First, to capture the dynamics in videos and reduce the difficulty of achieving image-to-video adaptation, we exploit the flexibility of self-attention and introduce the spatial-temporal dual-headed attention (STDHA) that efficiently endow the image transformers with temporal modeling capability at zero extra parameters and computation. Second, to handle the domain gap between images and videos, we propose a linear adaption strategy which utilizes lightweight densely placed linear adapters to fully transfer the frozen image models to video recognition. Due to its customized linear design, all newly added adapters could be easily merged with the original modules through structural reparameterization after training, thus achieving zero extra cost during inference. Extensive experiments on four widely-used video recognition benchmarks show that our ZeroI2V can match or even outperform previous state-of-the-art methods while enjoying superior parameter and inference efficiency.

摘要
现在，将图像模型适应到视频频谱是解决视频识别任务的有效方法。由于图像模型的庞大参数数量和传输效果，完全精度调整是不必要且不高效的。因此，当前研究的焦点在于减少参数的效果。在这篇论文中，我们的目标是提出一种零成本适应（ZeroI2V）模型，将图像变换器转换到视频识别任务（即在推理过程中不增加额外成本）。为 достичь这个目标，我们提出了两个核心设计：首先，为了在视频中捕捉动态和减少图像适应的困难，我们利用自注意力的灵活性，并引入空间-时间双头注意力（STDHA），以免费 Parameters和计算增加，快速地赋予图像变换器时间表示能力。其次，为了处理图像和视频之间的频谱差，我们提出了直线适应策略，通过轻量级的密集分布在线性适应器来完全传输冻结的图像模型到视频识别任务。由于其特制的直线设计，所有新增的适应器都可以轻松地与原始模块合并，通过结构权重重新parameter化后训练，因此在推理过程中不增加额外成本。我们在四个常用的视频识别标准benchmark上进行了广泛的实验，结果表明，我们的ZeroI2V可以与之前的状态机制匹配或者超越，同时享有优化参数和推理效率。

Color and Texture Dual Pipeline Lightweight Style Transfer

paper_url: http://arxiv.org/abs/2310.01321
repo_url: None
paper_authors: ShiQi Jiang
for: 提高风格转换的效果和效率，并能够添加控制INTENSITY的Texture结构
methods: 提出了一种双管道方法，同时输出色彩和文本ure转换结果，并使用masked total variation loss来抑制遗憾和小Texture表示
results: 在比较 экспериментах中， CTDP 方法在色彩和文本ure转换中达到了现状的最佳性能，并且模型的大小仅20k，与其他现状模型相比，较为小巧。

Abstract
Style transfer methods typically generate a single stylized output of color and texture coupling for reference styles, and color transfer schemes may introduce distortion or artifacts when processing reference images with duplicate textures. To solve the problem, we propose a Color and Texture Dual Pipeline Lightweight Style Transfer CTDP method, which employs a dual pipeline method to simultaneously output the results of color and texture transfer. Furthermore, we designed a masked total variation loss to suppress artifacts and small texture representations in color transfer results without affecting the semantic part of the content. More importantly, we are able to add texture structures with controllable intensity to color transfer results for the first time. Finally, we conducted feature visualization analysis on the texture generation mechanism of the framework and found that smoothing the input image can almost completely eliminate this texture structure. In comparative experiments, the color and texture transfer results generated by CTDP both achieve state-of-the-art performance. Additionally, the weight of the color transfer branch model size is as low as 20k, which is 100-1500 times smaller than that of other state-of-the-art models.

摘要
通常的风格传输方法只能生成单个风格化输出，其中包括颜色和文本ure coupling，而且在处理参照图像时，可能会出现重复的文本URE引起的扭曲或 artifacts。为解决这问题，我们提出了一种颜色和Texture Dual Pipeline Lightweight Style Transfer（CTDP）方法，该方法使用 dual pipeline 方式同时输出颜色和Texture transfer 的结果。此外，我们还设计了一个masked total variation loss，以便抑制颜色传输结果中的artefacts和小型文本URE，而不影响内容的semantic部分。此外，我们还能够在颜色传输结果中添加可控的Intensity的Texture structure，这是首次实现的。最后，我们对框架的Texture generation机制进行了特征可视化分析，发现可以几乎完全消除输入图像中的Texture structure的smoothing。在比较实验中，CTDP方法生成的颜色和Texture transfer结果都达到了领先水平。此外，颜色传输分支模型的 веса只有20k，这与其他领先模型的 веса相比，是100-1500倍小得多。

Efficient Remote Sensing Segmentation With Generative Adversarial Transformer

paper_url: http://arxiv.org/abs/2310.01292
repo_url: None
paper_authors: Luyi Qiu, Dayu Yu, Xiaofeng Zhang, Chenxiao Zhang
for: 提高 semantic segmentation 精度，适用于嵌入式设备。
methods: 使用 Global Transformer Network (GTNet) 生成器，并通过 residual connections 高效提取多级特征。 GTNet 使用全球 transformer 块，逐渐 Linear 计算复杂度来重新分配全局特征。
results: 在 Vaihingen 数据集上进行了广泛的实验，实现了平均 F1 分数为 90.17%，总准确率为 91.92%。

Abstract
Most deep learning methods that achieve high segmentation accuracy require deep network architectures that are too heavy and complex to run on embedded devices with limited storage and memory space. To address this issue, this paper proposes an efficient Generative Adversarial Transfomer (GATrans) for achieving high-precision semantic segmentation while maintaining an extremely efficient size. The framework utilizes a Global Transformer Network (GTNet) as the generator, efficiently extracting multi-level features through residual connections. GTNet employs global transformer blocks with progressively linear computational complexity to reassign global features based on a learnable similarity function. To focus on object-level and pixel-level information, the GATrans optimizes the objective function by combining structural similarity losses. We validate the effectiveness of our approach through extensive experiments on the Vaihingen dataset, achieving an average F1 score of 90.17% and an overall accuracy of 91.92%.

摘要
大多数深度学习方法可以达到高精度分割精度，但是这些方法通常需要具有较重、复杂的深度网络架构，这些架构在嵌入式设备上具有有限的存储和内存空间。为解决这个问题，这篇论文提出了一种高效的生成式对抗网络（GATrans），用于实现高精度semantic segmentation，同时具有极高效率。该框架利用全球变换网络（GTNet）作为生成器，高效地提取多级特征通过径向连接。GTNet使用全球变换块，通过学习相似函数来重新分配全球特征。为了强调对象级和像素级信息，GATrans优化目标函数通过结构相似损失。我们通过对洛阳数据集进行广泛的实验，证明了我们的方法的有效性，实现了平均F1分数90.17%和总准确率91.92%。

3DHR-Co: A Collaborative Test-time Refinement Framework for In-the-Wild 3D Human-Body Reconstruction Task

paper_url: http://arxiv.org/abs/2310.01291
repo_url: None
paper_authors: Jonathan Samuel Lumentut, Kyoung Mu Lee
for: 提高3D人体重建 task 的精度和稳定性，特别是在真实世界场景中处理各种多样化的人体pose和形态。
methods: 提议一种协同策略，包括预适应和测试时适应两部分，以提高通用的3DHR模型在各种场景中的性能。
results: 实验结果表明，提议的方法可以显著提高通用的3DHR模型在各种场景中的精度，比如在各种各样的人体pose和形态下的精度提高达到-34 mm。

Abstract
The field of 3D human-body reconstruction (abbreviated as 3DHR) that utilizes parametric pose and shape representations has witnessed significant advancements in recent years. However, the application of 3DHR techniques to handle real-world, diverse scenes, known as in-the-wild data, still faces limitations. The primary challenge arises as curating accurate 3D human pose ground truth (GT) for in-the-wild scenes is still difficult to obtain due to various factors. Recent test-time refinement approaches on 3DHR leverage initial 2D off-the-shelf human keypoints information to support the lack of 3D supervision on in-the-wild data. However, we observed that additional 2D supervision alone could cause the overfitting issue on common 3DHR backbones, making the 3DHR test-time refinement task seem intractable. We answer this challenge by proposing a strategy that complements 3DHR test-time refinement work under a collaborative approach. Specifically, we initially apply a pre-adaptation approach that works by collaborating various 3DHR models in a single framework to directly improve their initial outputs. This approach is then further combined with the test-time adaptation work under specific settings that minimize the overfitting issue to further boost the 3DHR performance. The whole framework is termed as 3DHR-Co, and on the experiment sides, we showed that the proposed work can significantly enhance the scores of common classic 3DHR backbones up to -34 mm pose error suppression, putting them among the top list on the in-the-wild benchmark data. Such achievement shows that our approach helps unveil the true potential of the common classic 3DHR backbones. Based on these findings, we further investigate various settings on the proposed framework to better elaborate the capability of our collaborative approach in the 3DHR task.

摘要
三维人体重建（简称3DHR）领域在过去几年内取得了重要进步，但是在真实世界、多样化场景中应用3DHR技术仍面临一些限制。主要挑战在于获取准确的三维人体姿态真实数据（GT），因为各种因素很难以制定。现有的3DHR测试时精度修正方法可以利用初始的二维人体关键点信息来支持缺乏三维监督的在野数据，但我们发现单独使用二维监督可能会导致在常见3DHR后缀中出现过拟合问题，使得3DHR测试时精度修正任务看起来无法解决。我们回答这个挑战，我们提出了一种协作策略，即在一个框架中结合不同的3DHR模型，以直接改进他们的初始输出。这种方法然后与测试时精度修正工作结合，在特定设置下进行最小化过拟合问题，以进一步提高3DHR性能。整个框架被称为3DHR-Co，我们在实验中表明，我们的方法可以明显提高常见的经典3DHR后缀的表现，在野数据上的姿态误差优化达到-34 mm，将其列入了顶尖名单。这一成果表明，我们的方法有助于探索经典3DHR后缀的真实潜力。基于这些发现，我们进一步调查了我们的框架在3DHR任务中的不同设置，以更好地评估我们的协作方法的能力。

Offline Tracking with Object Permanence

paper_url: http://arxiv.org/abs/2310.01288
repo_url: None
paper_authors: Xianzhong Liu, Holger Caesar
for: 提高自动驾驶数据集的标注效率，避免手动标注的昂贵劳动成本。
methods: 提出了一种停机跟踪模型，通过对 occlusion 情况的处理，能够有效地恢复 occluded 对象的轨迹。模型包括三部分：标准的在线跟踪器、重复识别（Re-ID）模块和轨迹完成模块。Re-ID 模块和轨迹完成模块使用 vectorized map 作为输入，以提高跟踪结果的准确性。
results: 模型可以有效地恢复 occluded 对象的轨迹，并在 3D 多对象跟踪中实现了state-of-the-art表现，比原始在线跟踪结果提高了45% IDS 和 2% AMOTA 的车辆轨迹。

Abstract
To reduce the expensive labor cost for manual labeling autonomous driving datasets, an alternative is to automatically label the datasets using an offline perception system. However, objects might be temporally occluded. Such occlusion scenarios in the datasets are common yet underexplored in offline autolabeling. In this work, we propose an offline tracking model that focuses on occluded object tracks. It leverages the concept of object permanence which means objects continue to exist even if they are not observed anymore. The model contains three parts: a standard online tracker, a re-identification (Re-ID) module that associates tracklets before and after occlusion, and a track completion module that completes the fragmented tracks. The Re-ID module and the track completion module use the vectorized map as one of the inputs to refine the tracking results with occlusion. The model can effectively recover the occluded object trajectories. It achieves state-of-the-art performance in 3D multi-object tracking by improving over the original online tracking result by 45% IDS and 2% AMOTA on the vehicle tracks.

摘要
(Simplified Chinese translation)为了减少手动标注自动驾驶数据集的高昂劳动成本，一种 alternatives 是使用离线感知系统自动标注数据集。然而，物体可能会被时间 occluded。这些 occlusion 场景在数据集中很常见， yet underexplored 在离线 autolabeling 中。在这种工作中，我们提出了一种离线跟踪模型，它关注 occluded объек目的跟踪。它利用 object permanence 的概念，这是指物体甚至不再观察后仍然存在。模型包括三部分：标准的在线跟踪器、重复识别（Re-ID）模块，它将在 occlusion 之前和之后的跟踪集成为一个，以及跟踪完成模块，它可以完成受阻跟踪的分割。Re-ID 模块和跟踪完成模块使用 vectorized map 作为输入，以提高 occlusion 的跟踪结果。模型可以有效地恢复 occluded объек目的轨迹。它在 3D 多对目标跟踪中实现了 state-of-the-art 性能，提高了在线跟踪结果的 45% IDS 和 2% AMOTA 的车辆轨迹。

MobileNVC: Real-time 1080p Neural Video Compression on a Mobile Device

paper_url: http://arxiv.org/abs/2310.01258
repo_url: None
paper_authors: Ties van Rozendaal, Tushar Singhal, Hoang Le, Guillaume Sautiere, Amir Said, Krishna Buska, Anjuman Raha, Dimitris Kalatzis, Hitarth Mehta, Frank Mayer, Liang Zhang, Markus Nagel, Auke Wiggers
for: 这个论文旨在提出一种可实现在移动设备上的神经网络视频编解码器，可以在低延迟设置下与标准编解码器竞争。
methods: 该论文使用了两大贡献来实现实时神经网络视频编解码器。首先，我们设计了一个高效的编码器，使用移动加速器上的块基动态补做算法，并将这个模型适应整数精度。其次，我们实现了一个快速解码管道，通过并行运行神经网络组件、移动设备上的Mobile GPU和扩展核心上的扩展核心来实现。
results: 我们的编码器与之前的设备上的编码器比较，具有大量BD-率减少（最高达48%）和接收器端MAC计数减少10倍。我们也进行了精心的减少分析，以证明我们引入的动态补做方案的效果。

Abstract
Neural video codecs have recently become competitive with standard codecs such as HEVC in the low-delay setting. However, most neural codecs are large floating-point networks that use pixel-dense warping operations for temporal modeling, making them too computationally expensive for deployment on mobile devices. Recent work has demonstrated that running a neural decoder in real time on mobile is feasible, but shows this only for 720p RGB video, while the YUV420 format is more commonly used in production. This work presents the first neural video codec that decodes 1080p YUV420 video in real time on a mobile device. Our codec relies on two major contributions. First, we design an efficient codec that uses a block-based motion compensation algorithm available on the warping core of the mobile accelerator, and we show how to quantize this model to integer precision. Second, we implement a fast decoder pipeline that concurrently runs neural network components on the neural signal processor, parallel entropy coding on the mobile GPU, and warping on the warping core. Our codec outperforms the previous on-device codec by a large margin with up to 48 % BD-rate savings, while reducing the MAC count on the receiver side by 10x. We perform a careful ablation to demonstrate the effect of the introduced motion compensation scheme, and ablate the effect of model quantization.

摘要
现代神经视频编码器在低延迟设置下与标准编码器如HEVC竞争。然而，大多数神经编码器是大型浮点网络，使用像素密集扭曲操作进行 temporal 模型化，导致其在移动设备上不可deploy。 recent work 表明可以在移动设备上运行真实时间的神经解码器，但只适用于 720p RGB 视频，而 YUV420 格式更常用于生产。这项工作介绍了首个可以在移动设备上实时解码 1080p YUV420 视频的神经视频编码器。我们的编码器基于两大贡献：首先，我们设计了高效的编码器，使用移动加速器上的块基动作补做算法，并显示如何将这个模型降到整数精度。其次，我们实现了快速解码管道，并在神经信号处理器上并行运行神经网络组件、移动 GPU 上的快速Entropy 编码和块基动作。我们的编码器在前一代移动设备上的编码器之上减少了大量的MAC计数，同时实现了48%的BD-rate 减少。我们进行了仔细的减少分析，以证明我们引入的运动补做方案的效果，以及模型降到整数精度的效果。

Generating 3D Brain Tumor Regions in MRI using Vector-Quantization Generative Adversarial Networks

paper_url: http://arxiv.org/abs/2310.01251
repo_url: None
paper_authors: Meng Zhou, Matthias W Wagner, Uri Tabori, Cynthia Hawkins, Birgit B Ertl-Wagner, Farzad Khalvati
for: 增强医学影像分析的深度学习方法，特别是生成对抗网络（GANs）的应用，以生成真实和多样化的图像，以增强训练集的数据。
methods: 我们提出了一种新的框架，使用向量量化GAN和变换器，并含有遮盲令素模型，来生成高分辨率和多样化的3D脑肿瘤ROI，可以直接作为增强数据进行脑肿瘤分类。
results: 我们在两个不均衡数据集上应用了我们的方法，分别是Multimodal Brain Tumor Segmentation Challenge（BraTS）2019数据集和内部的pediatric LGG（pLGG）数据集。结果显示，我们的方法比基线模型提高6.4%的AUC在BraTS 2019数据集和4.3%的AUC在我们内部的pLGG数据集。这些结果表明我们生成的肿瘤ROIs可以有效地解决不均衡数据问题，并且我们的方法可能可以帮助精准诊断少见的脑肿瘤。

Abstract
Medical image analysis has significantly benefited from advancements in deep learning, particularly in the application of Generative Adversarial Networks (GANs) for generating realistic and diverse images that can augment training datasets. However, the effectiveness of such approaches is often limited by the amount of available data in clinical settings. Additionally, the common GAN-based approach is to generate entire image volumes, rather than solely the region of interest (ROI). Research on deep learning-based brain tumor classification using MRI has shown that it is easier to classify the tumor ROIs compared to the entire image volumes. In this work, we present a novel framework that uses vector-quantization GAN and a transformer incorporating masked token modeling to generate high-resolution and diverse 3D brain tumor ROIs that can be directly used as augmented data for the classification of brain tumor ROI. We apply our method to two imbalanced datasets where we augment the minority class: (1) the Multimodal Brain Tumor Segmentation Challenge (BraTS) 2019 dataset to generate new low-grade glioma (LGG) ROIs to balance with high-grade glioma (HGG) class; (2) the internal pediatric LGG (pLGG) dataset tumor ROIs with BRAF V600E Mutation genetic marker to balance with BRAF Fusion genetic marker class. We show that the proposed method outperforms various baseline models in both qualitative and quantitative measurements. The generated data was used to balance the data in the brain tumor types classification task. Using the augmented data, our approach surpasses baseline models by 6.4% in AUC on the BraTS 2019 dataset and 4.3% in AUC on our internal pLGG dataset. The results indicate the generated tumor ROIs can effectively address the imbalanced data problem. Our proposed method has the potential to facilitate an accurate diagnosis of rare brain tumors using MRI scans.

摘要
医学影像分析受到深度学习的进步带来了 significiant benefits,特别是在使用生成对抗网络（GANs）生成真实多样化的图像，以增加训练集的数据量。然而，这些方法的效果受到临床实践中数据的有限性的限制。此外，通常的GAN基本方法是生成整个图像Volume，而不是仅仅是Region of Interest（ROI）。研究表明，使用深度学习分类brain tumor using MRI时，ROI比整个图像更容易分类。在这项工作中，我们提出了一种新的框架，使用vector-quantization GAN和一种包含masked token的transformer来生成高分辨率和多样化的3D brain tumor ROI，这些数据可以直接用于分类brain tumor ROI。我们在两个倾斜数据集上应用我们的方法：（1）Multimodal Brain Tumor Segmentation Challenge（BraTS）2019 dataset，生成新的low-grade glioma（LGG）ROI，以增加与高度 glioma（HGG）类的数据均衡；（2）内部的pediatric LGG（pLGG）数据集，tumor ROIs with BRAF V600E Mutation genetic marker，以增加与BRAF Fusion genetic marker类的数据均衡。我们显示，我们的方法在两个数据集上都有较好的表现，与基线模型相比，提高了6.4%的AUC在BraTS 2019dataset和4.3%的AUC在我们内部的pLGG dataset。这些结果表明，生成的 tumor ROIs 可以有效地解决倾斜数据问题。我们的提议方法具有激活准确诊断少见脑肿的潜力。

Mirror Diffusion Models for Constrained and Watermarked Generation

paper_url: http://arxiv.org/abs/2310.01236
repo_url: None
paper_authors: Guan-Horng Liu, Tianrong Chen, Evangelos A. Theodorou, Molei Tao
for: 这个研究旨在构建一种可以在受限的集合上生成数据的扩散模型，并且保持 tractability 的特性。
methods: 这个研究使用了一种新的 Mirror Diffusion Models (MDM)，它是一种基于mirror map的扩散模型，并且可以在受限的集合上生成数据。
results: 研究发现，这个 MDM 可以在受限的集合上生成数据，并且可以实现适当的 tractability。此外，这个方法还可以实现对数据的安全和隐私保护。

Abstract
Modern successes of diffusion models in learning complex, high-dimensional data distributions are attributed, in part, to their capability to construct diffusion processes with analytic transition kernels and score functions. The tractability results in a simulation-free framework with stable regression losses, from which reversed, generative processes can be learned at scale. However, when data is confined to a constrained set as opposed to a standard Euclidean space, these desirable characteristics appear to be lost based on prior attempts. In this work, we propose Mirror Diffusion Models (MDM), a new class of diffusion models that generate data on convex constrained sets without losing any tractability. This is achieved by learning diffusion processes in a dual space constructed from a mirror map, which, crucially, is a standard Euclidean space. We derive efficient computation of mirror maps for popular constrained sets, such as simplices and $\ell_2$-balls, showing significantly improved performance of MDM over existing methods. For safety and privacy purposes, we also explore constrained sets as a new mechanism to embed invisible but quantitative information (i.e., watermarks) in generated data, for which MDM serves as a compelling approach. Our work brings new algorithmic opportunities for learning tractable diffusion on complex domains.

摘要
现代扩散模型在学习复杂高维数据分布上取得了成功，其中一部分归功于它们可以构建分析式传递函数和评估函数。这些特点使得在无 simulate 框架下实现了稳定的回归损失，从而可以学习具有扩散性的生成过程。然而，当数据受到约束时，这些愉悦的特点似乎消失了，这是根据之前的尝试所示。在这个工作中，我们提出了镜像扩散模型（MDM），一种新的扩散模型，可以在受到约束的 convex 空间中生成数据，而不失去任何的 tractability。我们通过学习镜像映射来实现这一点，这个映射是标准的欧几丁素空间。我们 derivation 了镜像映射的高效计算方法，并证明了 MDM 在各种流行的约束空间中表现出色。此外，我们还探讨了在受到约束的数据中嵌入不可见的数字信息（即水印）的可能性，并证明了 MDM 作为一种可靠的方法。我们的工作为学习在复杂领域上 tractable 扩散带来了新的算法机遇。

Reconstructing 3D Human Pose from RGB-D Data with Occlusions

paper_url: http://arxiv.org/abs/2310.01228
repo_url: https://github.com/DangBowen-Bell/Occlusion_HPR
paper_authors: Bowen Dang, Xi Zhao, Bowen Zhang, He Wang
for: 从RGB-D图像中重建3D人体，解决 occlusion 问题
methods: 使用 neural network 估算“自由区”，并使用“截角影像体积”对可见的身体部分进行条件约束
results: 在PROX dataset上实验，比较其他方法更加精确和可能的结果

Abstract
We propose a new method to reconstruct the 3D human body from RGB-D images with occlusions. The foremost challenge is the incompleteness of the RGB-D data due to occlusions between the body and the environment, leading to implausible reconstructions that suffer from severe human-scene penetration. To reconstruct a semantically and physically plausible human body, we propose to reduce the solution space based on scene information and prior knowledge. Our key idea is to constrain the solution space of the human body by considering the occluded body parts and visible body parts separately: modeling all plausible poses where the occluded body parts do not penetrate the scene, and constraining the visible body parts using depth data. Specifically, the first component is realized by a neural network that estimates the candidate region named the "free zone", a region carved out of the open space within which it is safe to search for poses of the invisible body parts without concern for penetration. The second component constrains the visible body parts using the "truncated shadow volume" of the scanned body point cloud. Furthermore, we propose to use a volume matching strategy, which yields better performance than surface matching, to match the human body with the confined region. We conducted experiments on the PROX dataset, and the results demonstrate that our method produces more accurate and plausible results compared with other methods.

摘要
我们提出了一种新的方法，用于从RGB-D图像中重建3D人体。最大的挑战是RGB-D数据的不完整性，由于人体和环境之间的遮挡，导致重建结果受到严重的人体-场景卷绕的影响。为了重建具有semantic和physical可能性的人体，我们提议将解决空间缩小到场景信息和先前知识基础上。我们的关键思想是：对于遮挡的人体部分和可见的人体部分分别进行搜索和约束。具体来说，我们使用神经网络来估算"自由区"（free zone），这是在场景中开放的空间内，无需担心遮挡的人体部分的搜索。第二个组件是使用"截断影adowVolume"来约束可见的人体部分。此外，我们还提议使用体积匹配策略，这对于匹配人体和困难的场景来说比surface匹配更有优势。我们在PROX数据集上进行了实验，结果表明，我们的方法可以在与其他方法进行比较时产生更加准确和可能的结果。

Making LLaMA SEE and Draw with SEED Tokenizer

paper_url: http://arxiv.org/abs/2310.01218
repo_url: https://github.com/ailab-cvc/seed
paper_authors: Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, Ying Shan
for: 本研究旨在推动大型自然语言模型（LLM）的进一步发展，使其能够更好地处理多Modal信息，并且展现出开放世界上的emergent能力。
methods: 本研究使用了一种新的图像tokenizer，称为SEED，允许LLM在原有的训练环境下进行批量多Modal autoregressive预测。SEED tokens采用了一种1D causaldependency，使得图像和文本能够相互之间独立地进行预测。
results: 研究表明，使用SEED tokens可以实现scalable多Modal autoregressive预测，并且在多种多Modal comprehension和生成任务上表现出色。此外，SEED-LLaMA还能够展现出compositional emergent能力，如多轮在 Context中的多Modal生成。

Abstract
The great success of Large Language Models (LLMs) has expanded the potential of multimodality, contributing to the gradual evolution of General Artificial Intelligence (AGI). A true AGI agent should not only possess the capability to perform predefined multi-tasks but also exhibit emergent abilities in an open-world context. However, despite the considerable advancements made by recent multimodal LLMs, they still fall short in effectively unifying comprehension and generation tasks, let alone open-world emergent abilities. We contend that the key to overcoming the present impasse lies in enabling text and images to be represented and processed interchangeably within a unified autoregressive Transformer. To this end, we introduce SEED, an elaborate image tokenizer that empowers LLMs with the ability to SEE and Draw at the same time. We identify two crucial design principles: (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens should capture high-level semantics consistent with the degree of semantic abstraction in words, and be optimized for both discriminativeness and reconstruction during the tokenizer training phase. With SEED tokens, LLM is able to perform scalable multimodal autoregression under its original training recipe, i.e., next-word prediction. SEED-LLaMA is therefore produced by large-scale pretraining and instruction tuning on the interleaved textual and visual data, demonstrating impressive performance on a broad range of multimodal comprehension and generation tasks. More importantly, SEED-LLaMA has exhibited compositional emergent abilities such as multi-turn in-context multimodal generation, acting like your AI assistant.

摘要
LLMs 的成功已经扩展了多Modal的潜力，为普遍智能（AGI）的演化做出了贡献。一个真正的 AGI 代理应该不仅具备多个预定的多Modal 任务的能力，而且在开放世界上展现出emergent能力。然而，尽管 latest multimodal LLMs 已经做出了很大的进步，但它们仍然无法有效地结合理解和生成任务，更不用说open-world emergent ability。我们认为，解决当前僵困的关键在于让文本和图像被表示和处理在一起，并在一个权威的Transformer中进行 autoregressive 处理。为此，我们引入 SEED，一种复杂的图像Tokenizer，使 LLMS 能够同时See和Draw。我们提出了两个关键的设计原则：（1）图像Token应该是2D物理 patch 位置独立的，而是通过1D causal dependency生成的，这种自适应性与 LLMS 的 left-to-right autoregressive 预测机制相符。（2）图像Token应该捕捉高度semantic的抽象，与文本中的语义相关，并在Tokenizer 训练阶段进行优化，以确保它们在描述性和重建性方面都具备出色的表现。 SEED Token 使 LLMS 能够在原来的训练策略下进行扩展的多Modal autoregression，并且在大规模预训练和指令调整后，SEED-LLaMA 在多Modal 理解和生成任务上表现出卓越的表现。更重要的是，SEED-LLaMA 展现出了 compositional emergent ability，例如多turn in-context multimodal generation，类似于你的 AI 助手。

Towards Robust Cardiac Segmentation using Graph Convolutional Networks

paper_url: http://arxiv.org/abs/2310.01210
repo_url: https://github.com/gillesvntnu/graphbasedsegmentation
paper_authors: Gilles Van De Vyver, Sarina Thomas, Guy Ben-Yosef, Sindre Hellum Olaisen, Håvard Dalen, Lasse Løvstakken, Erik Smistad
for: 本研究旨在提高现有的深度学习模型中的卡多骨骼分割精度，以提高echocardiography图像中的心血管结构分割结果。
methods: 本研究使用了图 convolutional neural networks（GCN）来预测心血管结构的极限点，而不是每个像素的标签。GCN使用了两个径向的卷积核，基于心血管解剖学。
results: 研究表明，使用GCN可以减少心血管结构多结构分割中的大异常值，并且在公共数据集CAMUS上显示了比较好的性能。此外，研究还进行了ablation study和临床数据集HUNT4上的评估。最后，研究还提出了使用U-Net和GCN之间的间隔协议来预测输入和分割质量。

Abstract
Fully automatic cardiac segmentation can be a fast and reproducible method to extract clinical measurements from an echocardiography examination. The U-Net architecture is the current state-of-the-art deep learning architecture for medical segmentation and can segment cardiac structures in real-time with average errors comparable to inter-observer variability. However, this architecture still generates large outliers that are often anatomically incorrect. This work uses the concept of graph convolutional neural networks that predict the contour points of the structures of interest instead of labeling each pixel. We propose a graph architecture that uses two convolutional rings based on cardiac anatomy and show that this eliminates anatomical incorrect multi-structure segmentations on the publicly available CAMUS dataset. Additionally, this work contributes with an ablation study on the graph convolutional architecture and an evaluation of clinical measurements on the clinical HUNT4 dataset. Finally, we propose to use the inter-model agreement of the U-Net and the graph network as a predictor of both the input and segmentation quality. We show this predictor can detect out-of-distribution and unsuitable input images in real-time. Source code is available online: https://github.com/gillesvntnu/GCN_multistructure

摘要
自动化心脏分割可以是一种快速和可重复的方法，以提取生理测量从echo心脏检查中。U-Net架构是现有状态的深度学习架构，用于医疗分割，可以在实时中将心脏结构分割，并且 average 错误相对于多观者变化。然而，这些架构仍然会生成大的异常值，其中一些是生理错误的多结构分割。本工作使用图 convolutional neural networks（图 CNN）的概念，预测结构关键点而不是每个像素的标签。我们提出一种基于心脏解剖学的图架构，使用两个 convolutional 环，并证明了这可以消除生理错误的多结构分割。此外，本工作还包括一个ablation study 和 echo心脏 dataset 上的临床评估。最后，我们提出使用 U-Net 和图网络之间的交互模型协议，用于预测输入和分割质量。我们表明这可以在实时中检测输入图像是否合法。源代码可以在线获取：https://github.com/gillesvntnu/GCN_multistructure。

Self-distilled Masked Attention guided masked image modeling with noise Regularized Teacher (SMART) for medical image analysis

paper_url: http://arxiv.org/abs/2310.01209
repo_url: None
paper_authors: Jue Jiang, Harini Veeraraghavan
For: 这个研究是为了提出一种基于自我混合推理的masked image modeling（MIM）预训练方法，以提高SWIN模型在医学图像分析中的转移性和准确性。* Methods: 该研究使用了 hierarchical shifted window transformers（Swin）模型，并通过自我混合推理和随机扰动注入的师导学习方法来实现Masked image modeling（MIM）预训练。* Results: 该研究在多个下游任务中表现出色，包括预测高度进行Immunotherapy Response（Task I）、预测恶性肿瘤复发（Task II）、 segmentation of lung cancer（Task III）和不监督的器官归一化（Task IV）等。SMART模型在这些任务中均无需精度调整，并且在医学图像分析中表现出了优秀的转移性和准确性。

Abstract
Hierarchical shifted window transformers (Swin) are a computationally efficient and more accurate alternative to plain vision transformers. Masked image modeling (MIM)-based pretraining is highly effective in increasing models' transferability to a variety of downstream tasks. However, more accurate and efficient attention guided MIM approaches are difficult to implement with Swin due to it's lack of an explicit global attention. We thus architecturally enhanced Swin with semantic class attention for self-supervised attention guided co-distillation with MIM. We also introduced a noise injected momentum teacher, implemented with patch dropout of teacher's inputs for improved training regularization and accuracy. Our approach, called \underline{s}elf-distilled \underline{m}asked \underline{a}ttention MIM with noise \underline{r}egularized \underline{t}eacher (SMART) was pretrained with \textbf{10,412} unlabeled 3D computed tomography (CT)s of multiple disease sites and sourced from institutional and public datasets. We evaluated SMART for multiple downstream tasks involving analysis of 3D CTs of lung cancer (LC) patients for: (i) [Task I] predicting immunotherapy response in advanced stage LC (n = 200 internal dataset), (ii) [Task II] predicting LC recurrence in early stage LC before surgery (n = 156 public dataset), (iii) [Task III] LC segmentation (n = 200 internal, 21 public dataset), and (iv) [Task IV] unsupervised clustering of organs in the chest and abdomen (n = 1,743 public dataset) \underline{without} finetuning. SMART predicted immunotherapy response with an AUC of 0.916, LC recurrence with an AUC of 0.793, segmented LC with Dice accuracy of 0.81, and clustered organs with an inter-class cluster distance of 5.94, indicating capability of attention guided MIM for Swin in medical image analysis.

摘要
Hierarchical shifted window transformers (Swin) 是一种 computationally efficient 和更加准确的替代品 для普通的视觉transformers。基于掩码图像模型（MIM）的预训练是提高模型的转移性的高效方法，但更高精度和更高效的注意力导向MIM方法难以实现于Swindue to its lack of explicit global attention。我们因此将Swin进行了semantic class attention的建模，并将其与MIM进行自我混合填充。我们还引入了噪声注入的振荡教师，通过对教师输入的patch dropout进行了改进的训练正则化和精度。我们的方法被称为自适应掩码注意力MIM噪声注入教师（SMART）。SMART在10,412个未标注的3D计算机 Tomography（CT）图像上进行预训练，这些图像来自机构和公共数据集。我们对SMART进行多个下游任务的评估，包括对lung cancer（LC）患者的先进stage predicting immunotherapy response（任务I，n = 200）、LC恢复前 surgery predicting LC recurrence（任务II，n = 156）、LC segmentation（任务III，n = 200）和无监督归一化 clustering of organs in the chest and abdomen（任务IV，n = 1,743）。SMART预测了immunotherapy response的AUC为0.916，LC恢复的AUC为0.793，LC segmentation的Dice accuracy为0.81，并且归一化了organs的inter-class cluster distance为5.94，表明了Swin的注意力导向MIM在医学影像分析中的能力。

Cross-adversarial local distribution regularization for semi-supervised medical image segmentation

paper_url: http://arxiv.org/abs/2310.01176
repo_url: https://github.com/PotatoThanh/Cross-adversarial-local-distribution-regularization
paper_authors: Thanh Nguyen-Duc, Trung Le, Roland Bammer, He Zhao, Jianfei Cai, Dinh Phung
for: 这篇论文探讨了半指导式医疗图像分类的技术，尤其是对于有限标注数据的情况下。
methods: 本文提出了一种新的对抗式本地分布（Cross-ALD）调整，用于增强半指导式医疗图像分类任务中的平滑假设。
results: 作者在实验中发现，Cross-ALD可以对多个最近的方法在公开的LA和ACDC数据集上进行比较，并取得了最佳表现。

Abstract
Medical semi-supervised segmentation is a technique where a model is trained to segment objects of interest in medical images with limited annotated data. Existing semi-supervised segmentation methods are usually based on the smoothness assumption. This assumption implies that the model output distributions of two similar data samples are encouraged to be invariant. In other words, the smoothness assumption states that similar samples (e.g., adding small perturbations to an image) should have similar outputs. In this paper, we introduce a novel cross-adversarial local distribution (Cross-ALD) regularization to further enhance the smoothness assumption for semi-supervised medical image segmentation task. We conducted comprehensive experiments that the Cross-ALD archives state-of-the-art performance against many recent methods on the public LA and ACDC datasets.

摘要
医学半超级分割是一种技术，用于在医疗图像中分割注意力点对象，只使用有限的标注数据进行训练。现有的半超级分割方法通常基于平滑假设。这个假设表明，两个相似的数据样本（例如，通过添加小干扰到图像）的模型输出分布应该是相似的。在这篇论文中，我们介绍了一种新的交叉对抗本地分布（Cross-ALD）规范，以进一步增强半超级分割任务中的平滑假设。我们对公共的LA和ACDC数据集进行了广泛的实验，并证明了Cross-ALD可以在许多最近的方法中达到状态盘点性能。

Segment Any Building

paper_url: http://arxiv.org/abs/2310.01164
repo_url: https://github.com/SOYJUN/Application-with-raw-IP-sockets
paper_authors: Lei Li
for: 本研究旨在提高遥感图像中建筑物分割的精度和效率，以便在城市规划、灾害防制和生态监测等领域中应用。
methods: 该研究利用了多种数据集，并采用了前沿的表示学习方法进行建筑物分割。这些数据集的整合不仅扩大了模型训练中可用信息的 horizons，而且在多个数据集上实现了卓越的性能指标。
results: 该研究的结果表明，通过合理的数据集整合和采用前沿表示学习方法，可以在多个数据集上实现优秀的建筑物分割性能。这些成果不仅为后续学术研究提供了坚实的基础，还预示了在建筑物分割领域的创新应用。

Abstract
The task of identifying and segmenting buildings within remote sensing imagery has perennially stood at the forefront of scholarly investigations. This manuscript accentuates the potency of harnessing diversified datasets in tandem with cutting-edge representation learning paradigms for building segmentation in such images. Through the strategic amalgamation of disparate datasets, we have not only expanded the informational horizon accessible for model training but also manifested unparalleled performance metrics across multiple datasets. Our avant-garde joint training regimen underscores the merit of our approach, bearing significant implications in pivotal domains such as urban infrastructural development, disaster mitigation strategies, and ecological surveillance. Our methodology, predicated upon the fusion of datasets and gleaning insights from pre-trained models, carves a new benchmark in the annals of building segmentation endeavors. The outcomes of this research both fortify the foundations for ensuing scholarly pursuits and presage a horizon replete with innovative applications in the discipline of building segmentation.

摘要
remote sensing imagery 内部建筑物识别任务一直处于学术研究的前列。这篇文章强调了利用多种数据集并结合前沿表示学习方法的潜在力量，以实现在这些图像中的建筑物分割。通过策略性融合多种数据集，我们不仅扩大了模型训练的信息领域，而且在多个数据集上实现了很高的性能指标。我们的前卫合训练方法表明了我们的方法的优势，对于重要领域如城市基础设施建设、灾害防御策略和生态监测等具有深远的意义。我们的方法基于数据集的 fusión和采用预训练模型的洞察，为建筑物分割领域划新的benchmark。这些研究成果不仅强化了后续学术研究的基础，还预示了这一领域的创新应用的前景。

Iterative Semi-Supervised Learning for Abdominal Organs and Tumor Segmentation

paper_url: http://arxiv.org/abs/2310.01159
repo_url: https://github.com/ustguy/flare23
paper_authors: Jiaxin Zhuang, Luyang Luo, Zhixuan Chen, Linshan Wu
for: 这个研究是为了提高Computed Tomography（CT）扫描图像中腹部器官和肿瘤分类的深度学习（Deep-learning）方法。
methods: 这个研究使用了Semantic Supervised Learning（SSL）策略和迭代 Pseudo Labeling（PL）方法，使用一个深度模型（nn-UNet）在完全标注的数据集上训练，然后将这些pseudo Labels用于训练一个更强大的分类模型。
results: 使用Flare23 dataset，我们的方法在线上验证领航板上获得了89.63%的DSC分数和46.07%的NSD分数 для器官分类，并获得了0.9007%的DSC和0.9493%的NSD分数 для肿瘤分类。我们的代码可以在https://github.com/USTguy/Flare23上找到。

Abstract
Deep-learning (DL) based methods are playing an important role in the task of abdominal organs and tumors segmentation in CT scans. However, the large requirements of annotated datasets heavily limit its development. The FLARE23 challenge provides a large-scale dataset with both partially and fully annotated data, which also focuses on both segmentation accuracy and computational efficiency. In this study, we propose to use the strategy of Semi-Supervised Learning (SSL) and iterative pseudo labeling to address FLARE23. Initially, a deep model (nn-UNet) trained on datasets with complete organ annotations (about 220 scans) generates pseudo labels for the whole dataset. These pseudo labels are then employed to train a more powerful segmentation model. Employing the FLARE23 dataset, our approach achieves an average DSC score of 89.63% for organs and 46.07% for tumors on online validation leaderboard. For organ segmentation, We obtain 0.9007\% DSC and 0.9493\% NSD. For tumor segmentation, we obtain 0.3785% DSC and 0.2842% NSD. Our code is available at https://github.com/USTguy/Flare23.

摘要
深度学习（DL）基本方法在腹部器官和肿瘤分割 CT 扫描图像任务中发挥重要作用。然而，大量标注数据的需求对其发展带来了很大的限制。FLARE23 挑战提供了大规模的数据集，其中包括部分和完全标注数据，同时也关注到了分割精度和计算效率。在这个研究中，我们提议使用半监督学习（SSL）策略和迭代 pseudo labeling 来解决 FLARE23。在首先，使用完整器官标注数据（约 220 个扫描图像）训练深度模型（nn-UNet），然后使用这些 pseudo labels 训练更强大的分割模型。使用 FLARE23 数据集，我们的方法在在线验证领先板上获得了平均 DSC 分数为 89.63% 的器官和 46.07% 的肿瘤。对器官分割，我们获得了 0.9007% DSC 和 0.9493% NSD。对肿瘤分割，我们获得了 0.3785% DSC 和 0.2842% NSD。我们的代码可以在 GitHub 上找到：https://github.com/USTguy/Flare23。

paper_url: http://arxiv.org/abs/2310.01142
repo_url: None
paper_authors: Viswesh N, Kaushal Jadhav, Avi Amalanshu, Bratin Mondal, Sabaris Waran, Om Sadhwani, Apoorv Kumar, Debashish Chakravarty
for: 本研究提出了一种新的 Cross Layer Refinement Network（CLRNet），用于lane detection。CLRNet利用了高和低层特征进行融合，以提高检测精度。
methods: 本研究使用了一种新的网络结构，即CLRNet，其包括了高层特征提取和低层特征提取两个部分。高层特征提取部分使用了一种新的卷积神经网络，可以更好地捕捉高层特征。而低层特征提取部分使用了一种新的卷积神经网络，可以更好地捕捉低层特征。两个部分的特征会在多个层次上进行融合，以提高检测精度。
results: 本研究在三个lane detection benchmark上进行了测试，结果显示，CLRNet可以在这些benchmark上达到新的state-of-the-art纪录。

Abstract
The following work is a reproducibility report for CLRNet: Cross Layer Refinement Network for Lane Detection. The basic code was made available by the author. The paper proposes a novel Cross Layer Refinement Network to utilize both high and low level features for lane detection. The authors assert that the proposed technique sets the new state-of-the-art on three lane-detection benchmarks

摘要
本文是一份可重现报告，描述了CLRNet：跨层精度网络，用于车道检测。作者提供了基本代码。文章提出了一种新的跨层精度网络，利用高和低层特征进行车道检测。作者表示，提出的技术已经在三个车道检测标准测试集上设置了新的状态照。

Neural Processing of Tri-Plane Hybrid Neural Fields

paper_url: http://arxiv.org/abs/2310.01140
repo_url: https://github.com/CVLAB-Unibo/triplane_processing
paper_authors: Adriano Cardace, Pierluigi Zama Ramirez, Francesco Ballerini, Allan Zhou, Samuele Salti, Luigi Di Stefano
for: 本文旨在Addressing tasks such as classification and part segmentation directly on neural fields for 3D data, which have appealing properties for storing and communicating 3D data.
methods: 本文使用了具有多层感知器（MLP）的个体神经场，但是这些神经场的维度高，具有内在的重要性空间对称性和随机初始化的敏感性，导致结果较为差。而hybrid representation，尤其是基于三面的表示，尚未被直接处理。
results: 本文表明，使用三面离散数据结构可以有效地处理神经场，并且可以使用标准深度学习机制来处理。作者定义了一个广泛的 benchmark，覆盖了占据性、签名/无签名距离和、 для首次，辐射场。而在同等重构质量下，本文达到了大MLP框架和明文处理框架的任务性能，并且达到了对应的explicit representation处理框架的近似水平。

Abstract
Driven by the appealing properties of neural fields for storing and communicating 3D data, the problem of directly processing them to address tasks such as classification and part segmentation has emerged and has been investigated in recent works. Early approaches employ neural fields parameterized by shared networks trained on the whole dataset, achieving good task performance but sacrificing reconstruction quality. To improve the latter, later methods focus on individual neural fields parameterized as large Multi-Layer Perceptrons (MLPs), which are, however, challenging to process due to the high dimensionality of the weight space, intrinsic weight space symmetries, and sensitivity to random initialization. Hence, results turn out significantly inferior to those achieved by processing explicit representations, e.g., point clouds or meshes. In the meantime, hybrid representations, in particular based on tri-planes, have emerged as a more effective and efficient alternative to realize neural fields, but their direct processing has not been investigated yet. In this paper, we show that the tri-plane discrete data structure encodes rich information, which can be effectively processed by standard deep-learning machinery. We define an extensive benchmark covering a diverse set of fields such as occupancy, signed/unsigned distance, and, for the first time, radiance fields. While processing a field with the same reconstruction quality, we achieve task performance far superior to frameworks that process large MLPs and, for the first time, almost on par with architectures handling explicit representations.

摘要
驱动了神经场的吸引性，用于存储和传输3D数据的问题在最近的研究中得到了关注。早期的方法使用共享网络参数化神经场，实现了好的任务性能，但是额外增加了重建质量的成本。为了改进这个问题，后期的方法使用大的多层感知神经网络（MLP）作为神经场的参数，但是这些神经网络的维度很高，其中的内在约束和随机初始化导致结果不稳定。因此，与处理明确表示（如点云或多面体）的结果相比，这些方法的性能并不如理想。在这篇论文中，我们显示了tri-plane离散数据结构对神经场具有丰富的信息，这些信息可以通过标准的深度学习机制进行有效地处理。我们定义了一个广泛的 benchmark，覆盖各种领域，如占据、签名/未签名距离场和，在第一次提出的辐射场中。而处理这个场的同时保持同样的重建质量，我们实现了任务性能远胜大MLP框架，并且在处理明确表示的框架中几乎与之相当。

Strength in Diversity: Multi-Branch Representation Learning for Vehicle Re-Identification

paper_url: http://arxiv.org/abs/2310.01129
repo_url: https://github.com/videturfortuna/vehicle_reid_itsc2023
paper_authors: Eurico Almeida, Bruno Silva, Jorge Batista
for: 这种 paper 是为了提高汽车重复标识（V-ReID）而写的。
methods: 这种 paper 使用了组合了复杂多支架构来提取Robust和多样化的嵌入，以便实现重复标识。作者提出了组合了Grouped-convolution和Loss-Branch-Split策略来设计多支架构，以提高特征多样性和特征分化。此外，作者还提出了一种使用 grouped convolution 来模拟损失分割的轻量级解决方案，以减少模型大小。
results: 作者的方法在 Veri-776 和 Veri-Wild 两个预测集上比靶状方法高，其中在 Veri-776 上取得了85.6% mAP和97.7% CMC1，在 Veri-Wild 上取得了88.1% mAP和96.3% CMC1。总的来说，作者的工作提供了对于提高汽车重复标识的重要思路，并提供了其他检索任务的强大基础。

Abstract
This paper presents an efficient and lightweight multi-branch deep architecture to improve vehicle re-identification (V-ReID). While most V-ReID work uses a combination of complex multi-branch architectures to extract robust and diversified embeddings towards re-identification, we advocate that simple and lightweight architectures can be designed to fulfill the Re-ID task without compromising performance. We propose a combination of Grouped-convolution and Loss-Branch-Split strategies to design a multi-branch architecture that improve feature diversity and feature discriminability. We combine a ResNet50 global branch architecture with a BotNet self-attention branch architecture, both designed within a Loss-Branch-Split (LBS) strategy. We argue that specialized loss-branch-splitting helps to improve re-identification tasks by generating specialized re-identification features. A lightweight solution using grouped convolution is also proposed to mimic the learning of loss-splitting into multiple embeddings while significantly reducing the model size. In addition, we designed an improved solution to leverage additional metadata, such as camera ID and pose information, that uses 97% less parameters, further improving re-identification performance. In comparison to state-of-the-art (SoTA) methods, our approach outperforms competing solutions in Veri-776 by achieving 85.6% mAP and 97.7% CMC1 and obtains competitive results in Veri-Wild with 88.1% mAP and 96.3% CMC1. Overall, our work provides important insights into improving vehicle re-identification and presents a strong basis for other retrieval tasks. Our code is available at the https://github.com/videturfortuna/vehicle_reid_itsc2023.

摘要
这篇论文提出了一种高效和轻量级多支分支深度架构，以提高车辆重认（V-ReID）性能。大多数V-ReID工作使用复杂多支分支架构来提取Robust和多样化的嵌入，但我们认为可以通过简单和轻量级的架构来实现Re-ID任务，无需妥协性能。我们提出了组合Grouped-convolution和Loss-Branch-Split策略，以设计一种多支分支架构，以提高特征多样性和特征分化能力。我们将ResNet50全球支分支架构与BotNet自注意支分支架构相结合，同时采用Loss-Branch-Split策略。我们认为特殊的损失分支拆分可以提高重认任务的性能，生成特殊的重认特征。此外，我们还提出了一种使用组合 convolution 来模拟损失分支拆分的学习方法，可以在减少模型大小的情况下提高重认性能。此外，我们还提出了一种利用额外元数据，如摄像头 ID 和姿势信息，来提高重认性能，该方法使用97% fewer parameters，并且在Veri-776和Veri-Wild上与SoTA方法竞争。总之，我们的工作为车辆重认提供了重要的意见和技术基础，并在其他检索任务上提供了一个强大的基础。我们的代码可以在上下载。

Batch-less stochastic gradient descent for compressive learning of deep regularization for image denoising

paper_url: http://arxiv.org/abs/2310.03085
repo_url: None
paper_authors: Hui Shi, Yann Traonmilin, J-F Aujol
for: 这个论文是为了解含优先知识数据库中的信号或图像杂谱问题。
methods: 这个论文使用了变分法，通过最大 posteriori bayesian框架，将数据分布系统地链接到Regularizer。使用深度神经网络（DNN）可以从大训练数据库中恢复复杂的分布。
results: 这个论文提出了两种 Stochastic Gradient Descent（SGD）算法，用于从压缩数据库中恢复深度REGULARIZER参数。这两种算法比初始方法更高效，每次使用整个数据库中的信息，并且受到 классиical SGD的确定性保证。这些改进使得这种方法可以应用于图像杂谱问题中。

Abstract
We consider the problem of denoising with the help of prior information taken from a database of clean signals or images. Denoising with variational methods is very efficient if a regularizer well adapted to the nature of the data is available. Thanks to the maximum a posteriori Bayesian framework, such regularizer can be systematically linked with the distribution of the data. With deep neural networks (DNN), complex distributions can be recovered from a large training database.To reduce the computational burden of this task, we adapt the compressive learning framework to the learning of regularizers parametrized by DNN. We propose two variants of stochastic gradient descent (SGD) for the recovery of deep regularization parameters from a heavily compressed database. These algorithms outperform the initially proposed method that was limited to low-dimensional signals, each iteration using information from the whole database. They also benefit from classical SGD convergence guarantees. Thanks to these improvements we show that this method can be applied for patch based image denoising.}

摘要
我们考虑使用库存资料中的对应信息来解决干扰问题。使用可変方法来进行干扰是非常高效，只要有一个适合数据的调整器。透过最大 posteriori Bayesian框架，这个调整器可以与数据的分布系统地连接。使用深度神经网络（DNN）可以从大量训练数据库中重建复杂的分布。为了将这个任务中的计算负担降低，我们适用了压缩学习框架来学习由DNN实现的调整器。我们提出了两种Stochastic Gradient Descent（SGD）算法来从压缩的数据库中获得深度调整器的回复。这些算法比起初的方法，限制在低维度的信号上，每次迭代都使用整个数据库中的信息。它们也获得了 класи学SGD的均衡点。由于这些改进，我们展示了这种方法可以应用于质子项像干扰。

HyMNet: a Multimodal Deep Learning System for Hypertension Classification using Fundus Photographs and Cardiometabolic Risk Factors

paper_url: http://arxiv.org/abs/2310.01099
repo_url: https://github.com/mohammedsb/hypertension
paper_authors: Mohammed Baharoon, Hessa Almatar, Reema Alduhayan, Tariq Aldebasi, Badr Alahmadi, Yahya Bokhari, Mohammed Alawad, Ahmed Almazroa, Abdulrhman Aljouie
for: 预测高血压（HTN）从胸部影像中
methods: 用多modal深度学习（MMDL）系统，结合胸部影像和Cardiometabolic风险因素（年龄和性别），提高高血压检测能力
results: 结果显示，将胸部影像与年龄和性别结合的多modal模型，可以提高高血压检测精度，AUC为0.791([CI: 0.735, 0.848])，比单 modal模型（仅基于胸部影像）的AUC（0.766，[CI: 0.705, 0.828）高。

Abstract
In recent years, deep learning has shown promise in predicting hypertension (HTN) from fundus images. However, most prior research has primarily focused on analyzing a single type of data, which may not capture the full complexity of HTN risk. To address this limitation, this study introduces a multimodal deep learning (MMDL) system, dubbed HyMNet, which combines fundus images and cardiometabolic risk factors, specifically age and gender, to improve hypertension detection capabilities. Our MMDL system uses the DenseNet-201 architecture, pre-trained on ImageNet, for the fundus imaging path and a fully connected neural network for the age and gender path. The two paths are jointly trained by concatenating 64 features output from each path that are then fed into a fusion network. The system was trained on 1,143 retinal images from 626 individuals collected from the Saudi Ministry of National Guard Health Affairs. The results show that the multimodal model that integrates fundus images along with age and gender achieved an AUC of 0.791 [CI: 0.735, 0.848], which outperforms the unimodal model trained solely on fundus photographs that yielded an AUC of 0.766 [CI: 0.705, 0.828] for hypertension detection.

摘要
近年来，深度学习在血管照片中预测高血压（HTN）表现了承诺。然而，大多数先前的研究都主要集中在分析单一的数据类型，可能不能捕捉高血压风险的全面性。为了解决这些限制，本研究提出了一种多模态深度学习（MMDL）系统，名为HyMNet，它将血管照片和Cardiometabolic风险因素（具体来说是年龄和性别）结合起来，以提高高血压检测能力。我们的MMDL系统使用了DenseNet-201架构，预先在ImageNet上训练，对血管照片路径进行了预处理，并使用了一个全连接神经网络来处理年龄和性别路径。两个路径之后被 concatenate 并 feed into 一个 fusión网络进行联合训练。系统在Saudi Ministry of National Guard Health Affairs提供的1,143张胶囊照片和626名参与者的数据集上进行了训练。结果显示，将血管照片和年龄 gender integrate 在一起的多模态模型在检测高血压方面达到了AUC 0.791 的表现（CI: 0.735, 0.848），超过了只使用血管照片进行训练的单模态模型，其AUC为0.766 （CI: 0.705, 0.828）。

Leveraging Cutting Edge Deep Learning Based Image Matching for Reconstructing a Large Scene from Sparse Images

paper_url: http://arxiv.org/abs/2310.01092
repo_url: None
paper_authors: Georg Bökman, Johan Edstedt
for: 本研究 targets the AISG-SLA Visual Localisation Challenge benchmark (IJCAI 2023), where the task is to estimate relative motion between images taken in sequence by a camera mounted on a car driving through an urban scene.
methods: 我们使用我们最新的深度学习基于匹配器 RoMa 来匹配图像，并使用 COLMAP 进行结构从运动重建。我们选择了我们最新的 DeDoDe 关键点，以便提高匹配的稳定性。此外，我们还使用 DINOv2 进行图像检索，以解决时间跳跃问题。
results: 我们的方法在比赛中得到了第三名的竞争性成绩，并且我们还提供了一个不精确的最大准确率的上限，这个上限表明了图像检索方法的可能性。

Abstract
We present the top ranked solution for the AISG-SLA Visual Localisation Challenge benchmark (IJCAI 2023), where the task is to estimate relative motion between images taken in sequence by a camera mounted on a car driving through an urban scene. For matching images we use our recent deep learning based matcher RoMa. Matching image pairs sequentially and estimating relative motion from point correspondences sampled by RoMa already gives very competitive results -- third rank on the challenge benchmark. To improve the estimations we extract keypoints in the images, match them using RoMa, and perform structure from motion reconstruction using COLMAP. We choose our recent DeDoDe keypoints for their high repeatability. Further, we address time jumps in the image sequence by matching specific non-consecutive image pairs based on image retrieval with DINOv2. These improvements yield a solution beating all competitors. We further present a loose upper bound on the accuracy obtainable by the image retrieval approach by also matching hand-picked non-consecutive pairs.

摘要
我们现在提出了AISG-SLA视Localisation挑战 bencmark（IJCAI 2023）上的首位解决方案，任务是根据摄像机附加在城市场景中驱动的车辆拍摄的图像序列中估计相对运动。 для匹配图像我们使用我们最近的深度学习基于的匹配器RoMa。匹配图像序列中的图像对并估计相对运动从点对应关系采样了RoMa已经给出了非常竞争力的结果， rank third on the challenge benchmark。为了提高估计，我们提取图像中的关键点，匹配它们使用RoMa，并使用COLMAP进行结构从运动重建。我们选择我们最近的DeDoDe关键点，因为它们具有高重复性。此外，我们解决图像序列中的时间跳动问题，通过基于图像检索的DINOv2匹配特定非连续图像对。这些改进导致我们的解决方案超过了所有竞争对手。此外，我们还提出了图像检索方法的轻松上限，通过手动选择非连续图像对进行匹配。

Unsupervised Roofline Extraction from True Orthophotos for LoD2 Building Model Reconstruction

paper_url: http://arxiv.org/abs/2310.01067
repo_url: https://github.com/tudelft3d/roofline-extraction-from-orthophotos
paper_authors: Weixiao Gao, Ravi Peters, Jantien Stoter
for: 本研究旨在利用倾斜飞行图像生成的点云进行大规模城市环境中的LoD2建筑模型重建。
methods: 本研究使用了线检测技术来从真正正方图中提取屋顶线，以便在LoD2水平进行建筑模型重建。
results: 本研究表明，使用本方法可以relative完整地提取屋顶线，而无需预先训练数据或模型。这些线可以直接用于LoD2建筑模型重建过程中。与传统方法和现状深度学习方法相比，本方法能够提供更高的准确性和完整性。Here’s the same information in English:
for: The paper aims to reconstruct LoD2 building models from 2D and 3D data for large-scale urban environments using point clouds generated from oblique aerial images.
methods: The method used in the paper is line detection to extract rooflines from true orthophotos for the reconstruction of building models at the LoD2 level.
results: The paper shows that the method can relatively complete extract rooflines without the need for pre-labeled training data or pre-trained models. These lines can directly be used in the LoD2 building model reconstruction process, and the method is superior to existing plane detection-based methods and state-of-the-art deep learning methods in terms of accuracy and completeness.

Abstract
This paper discusses the reconstruction of LoD2 building models from 2D and 3D data for large-scale urban environments. Traditional methods involve the use of LiDAR point clouds, but due to high costs and long intervals associated with acquiring such data for rapidly developing areas, researchers have started exploring the use of point clouds generated from (oblique) aerial images. However, using such point clouds for traditional plane detection-based methods can result in significant errors and introduce noise into the reconstructed building models. To address this, this paper presents a method for extracting rooflines from true orthophotos using line detection for the reconstruction of building models at the LoD2 level. The approach is able to extract relatively complete rooflines without the need for pre-labeled training data or pre-trained models. These lines can directly be used in the LoD2 building model reconstruction process. The method is superior to existing plane detection-based methods and state-of-the-art deep learning methods in terms of the accuracy and completeness of the reconstructed building. Our source code is available at https://github.com/tudelft3d/Roofline-extraction-from-orthophotos.

摘要

Unsupervised motion segmentation in one go: Smooth long-term model over a video

paper_url: http://arxiv.org/abs/2310.01040
repo_url: None
paper_authors: Etienne Meunier, Patrick Bouthemy
for: 提出一种 Totally Unsupervised Video Object Segmentation (VOS) 方法，用于同时分割视频序列中的对象和动作。
methods: 提出了一种基于 transformer 网络的方法，使用 Evidence Lower Bound (ELBO) 捕获函数来定义损失函数，该函数组合了空间尺度的多omial（quadratic）动作模型和时间尺度的 B-splines 动作模型，以及一个 temporal consistency 正则项。
results: 在四个 VOS benchmark 上实现了有力的量化结果，并通过视觉结果展示了方法对时间一致性的重要贡献。

Abstract
Human beings have the ability to continuously analyze a video and immediately extract the main motion components. Motion segmentation methods often proceed frame by frame. We want to go beyond this classical paradigm, and perform the motion segmentation over a video sequence in one go. It will be a prominent added value for downstream computer vision tasks, and could provide a pretext criterion for unsupervised video representation learning. In this perspective, we propose a novel long-term spatio-temporal model operating in a totally unsupervised way. It takes as input the volume of consecutive optical flow (OF) fields, and delivers a volume of segments of coherent motion over the video. More specifically, we have designed a transformer-based network, where we leverage a mathematically well-founded framework, the Evidence Lower Bound (ELBO), to infer the loss function. The loss function combines a flow reconstruction term involving spatio-temporal parametric motion models combining, in a novel way, polynomial (quadratic) motion models for the $(x,y)$-spatial dimensions and B-splines for the time dimension of the video sequence, and a regularization term enforcing temporal consistency on the masks. We report experiments on four VOS benchmarks with convincing quantitative results. We also highlight through visual results the key contributions on temporal consistency brought by our method.

摘要
人类有能力不断分析视频并立即提取主要运动组成部分。运动分割方法经常以帧为单位进行。我们想要超越这种传统模式，并在视频序列中一次性进行运动分割。这将为下游计算机视觉任务带来明显的加值，并可提供无监督视频表示学习的先天权威标准。在这个视角下，我们提出了一种新的长期空间时间模型，不需要监督。它接受连续的滤流场（OF）场的体积作为输入，并输出一个视频序列中的各个 Segment of coherent motion。我们设计了基于 transformer 网络，并利用数学上有根据的框架，证明 Lower Bound（ELBO）来推导损函数。损函数组合了空间时间参数动力学模型的杂率重构项，以及时间维度的视频序列中的强制一致性约束。我们在四个 VOS benchmark 上进行了有力量的量化实验，并通过视觉结果显示了我们方法带来的时间一致性的关键贡献。

paper_url: http://arxiv.org/abs/2310.01035
repo_url: None
paper_authors: Hu Wang, Yuanhong Chen, Congbo Ma, Jodie Avery, Louise Hull, Gustavo Carneiro
for: 这篇论文的目的是解决多Modalities中缺失的问题，提高多Modalities模型的性能。methods: 这篇论文提出了一种Learnable Cross-modal Knowledge Distillation（LCKD）模型，通过自适应选择重要的Modalities并将它们中的知识透传到其他Modalities来解决缺失Modalities问题。results: 实验结果表明，LCKD方法在Brain Tumour Segmentation Dataset 2018（BraTS2018）上表现出色，比其他方法提高了3.61%、5.99%和3.76%的 segmentation Dice score，提高了状态对应的性能。

Abstract
The problem of missing modalities is both critical and non-trivial to be handled in multi-modal models. It is common for multi-modal tasks that certain modalities contribute more compared to other modalities, and if those important modalities are missing, the model performance drops significantly. Such fact remains unexplored by current multi-modal approaches that recover the representation from missing modalities by feature reconstruction or blind feature aggregation from other modalities, instead of extracting useful information from the best performing modalities. In this paper, we propose a Learnable Cross-modal Knowledge Distillation (LCKD) model to adaptively identify important modalities and distil knowledge from them to help other modalities from the cross-modal perspective for solving the missing modality issue. Our approach introduces a teacher election procedure to select the most ``qualified'' teachers based on their single modality performance on certain tasks. Then, cross-modal knowledge distillation is performed between teacher and student modalities for each task to push the model parameters to a point that is beneficial for all tasks. Hence, even if the teacher modalities for certain tasks are missing during testing, the available student modalities can accomplish the task well enough based on the learned knowledge from their automatically elected teacher modalities. Experiments on the Brain Tumour Segmentation Dataset 2018 (BraTS2018) shows that LCKD outperforms other methods by a considerable margin, improving the state-of-the-art performance by 3.61% for enhancing tumour, 5.99% for tumour core, and 3.76% for whole tumour in terms of segmentation Dice score.

摘要
“多modal模型处理缺失modalities的问题是非常 kritical 和复杂的。通常情况下，多modal任务中某些modalities会比其他modalities更加重要，如果这些重要modalities缺失，模型性能会下降很多。这一点尚未被当前的多modal方法考虑，这些方法通常通过特征重建或盲目特征聚合来从其他modalities中恢复缺失的modalities。而我们提出的Learnable Cross-modal Knowledge Distillation（LCKD）模型则可以适应性地标识重要的modalities，并从这些modalities中提取有用信息，以帮助其他modalities从跨modal的视角解决缺失modalities问题。我们的方法包括选择最佳的教师modalities基于它们单 modalities的性能在某些任务上，然后在教师和学生modalities之间进行跨modal知识填充。因此，即使在测试时缺失某些教师modalities，可以使用可得到的学生modalities来完成任务，并且基于自动选择的教师modalities学习出来的知识来实现比较好的性能。我们在Brain Tumour Segmentation Dataset 2018（BraTS2018）上进行了实验，结果显示LCKD的表现比其他方法更好，提高了 state-of-the-art 性能的水平，具体来说是提高了涂抹率的Dice分数的值：3.61%，5.99%和3.76%。”

Incorporating Supervised Domain Generalization into Data Augmentation

paper_url: http://arxiv.org/abs/2310.01029
repo_url: None
paper_authors: Shohei Enomoto, Monikka Roslianna Busto, Takeharu Eda
for: 提高深度学习在户外环境中的鲁棒性，以 preserve accuracy 在分布变化的情况下
methods: 使用数据扩充技术，并将其视为支持领域整合~~(SDG)，使用对比语义Alignment~~(CSA)损失来提高数据扩充的鲁棒性和训练效率
results: 在CIFAR-100和CUB datasets上实验表明，提议的方法可以提高数据扩充的鲁棒性和训练效率，并可以作为现有数据扩充方法的插件使用

Abstract
With the increasing utilization of deep learning in outdoor settings, its robustness needs to be enhanced to preserve accuracy in the face of distribution shifts, such as compression artifacts. Data augmentation is a widely used technique to improve robustness, thanks to its ease of use and numerous benefits. However, it requires more training epochs, making it difficult to train large models with limited computational resources. To address this problem, we treat data augmentation as supervised domain generalization~(SDG) and benefit from the SDG method, contrastive semantic alignment~(CSA) loss, to improve the robustness and training efficiency of data augmentation. The proposed method only adds loss during model training and can be used as a plug-in for existing data augmentation methods. Experiments on the CIFAR-100 and CUB datasets show that the proposed method improves the robustness and training efficiency of typical data augmentations.

摘要
随着深度学习在户外场景中的应用越来越广泛，其Robustness需要得到加强，以保持面对分布变化时的准确性。数据扩充是一种广泛使用的技术来提高Robustness，因为它的使用非常容易和有很多优点。然而，它需要更多的训练环节，使得在有限的计算资源下训练大型模型变得困难。为解决这个问题，我们将数据扩充视为指导领域泛化~(SDG)，并利用指导领域泛化方法的对准性 semantic alignment~(CSA)损失来提高数据扩充的Robustness和训练效率。该提案只需在模型训练时添加损失，可以作为现有数据扩充方法的插件使用。实验表明，在CIFAR-100和CUB数据集上，我们的提案方法可以提高数据扩充的Robustness和训练效率。

A New Real-World Video Dataset for the Comparison of Defogging Algorithms

paper_url: http://arxiv.org/abs/2310.01020
repo_url: None
paper_authors: Alexandra Duminil, Jean-Philippe Tarel, Roland Brémond
for: 这篇论文主要写于为何 foggy video 修复引起了越来越多的关注，以及为何现有的数据集缺乏包含清晰和雾oso的视频样本，以便进行深度学习和评估。methods: 该论文提出了一个新的REal-world Video dataset（REVIDE），用于比较抑fog算法（VIREDA）的性能，该数据集包含多种雾 densities和ground truths without fog。此外，该论文还提出了一种视频抑fog算法（尚在开发中），其关键思想是利用时间重复性来减少artefacts和曝光变化 между帧。results: 该论文采用了Transformers架构，以示出该数据集的相关性。

Abstract
Video restoration for noise removal, deblurring or super-resolution is attracting more and more attention in the fields of image processing and computer vision. Works on video restoration with data-driven approaches for fog removal are rare however, due to the lack of datasets containing videos in both clear and foggy conditions which are required for deep learning and benchmarking. A new dataset, called REVIDE, was recently proposed for just that purpose. In this paper, we implement the same approach by proposing a new REal-world VIdeo dataset for the comparison of Defogging Algorithms (VIREDA), with various fog densities and ground truths without fog. This small database can serve as a test base for defogging algorithms. A video defogging algorithm is also mentioned (still under development), with the key idea of using temporal redundancy to minimize artefacts and exposure variations between frames. Inspired by the success of Transformers architecture in deep learning for various applications, we select this kind of architecture in a neural network to show the relevance of the proposed dataset.

摘要
视频修复技术在干扰除、锐化或超分辨等领域受到越来越多的关注，但对于数据驱动的视频修复方法， Works on video restoration with data-driven approaches for fog removal are rare，因为缺乏包含清晰和雾osos condition的视频数据集，这些数据集是深度学习和标准化的必要条件。一个新的数据集，名为REVIDE，最近被提出用于这个目的。在这篇论文中，我们实现了同样的方法，提出了一个新的真实世界视频数据集，用于比较抑雾算法（VIREDA），其中包括不同的雾度和无雾的场景。这个小型数据库可以作为抑雾算法的测试基础。此外，我们还提出了一种视频抑雾算法（还在开发中），其关键思想是通过时间重复使用数据来减少遗传和曝光变化的问题。受到Transformers架构在深度学习中的成功，我们选择了这种架构来证明提出的数据集的相关性。

Controlling Vision-Language Models for Universal Image Restoration

paper_url: http://arxiv.org/abs/2310.01018
repo_url: https://github.com/algolzw/daclip-uir
paper_authors: Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sjölund, Thomas B. Schön
for: 这个论文旨在提高适用于底层视觉任务的预训语言模型（CLIP）的表现，并提供一个通用的框架来实现图像修复。
methods: 这个论文使用了一个增强控制器，将预先训练的 CLIP 图像encoder 扮演为预测高品质的对象特征向量的模型。这个控制器还会输出一个与实际负载相应的偏差特征，将模型引导学习高屏幕图像重建。
results: 这个论文的方法可以在两种类型的图像修复任务上进行顶尖表现，包括对于特定类型的负载和统一的图像修复任务。此外，这个论文还构建了一个混合类型的负载数据集，以供 DA-CLIP 训练。

Abstract
Vision-language models such as CLIP have shown great impact on diverse downstream tasks for zero-shot or label-free predictions. However, when it comes to low-level vision such as image restoration their performance deteriorates dramatically due to corrupted inputs. In this paper, we present a degradation-aware vision-language model (DA-CLIP) to better transfer pretrained vision-language models to low-level vision tasks as a universal framework for image restoration. More specifically, DA-CLIP trains an additional controller that adapts the fixed CLIP image encoder to predict high-quality feature embeddings. By integrating the embedding into an image restoration network via cross-attention, we are able to pilot the model to learn a high-fidelity image reconstruction. The controller itself will also output a degradation feature that matches the real corruptions of the input, yielding a natural classifier for different degradation types. In addition, we construct a mixed degradation dataset with synthetic captions for DA-CLIP training. Our approach advances state-of-the-art performance on both degradation-specific and unified image restoration tasks, showing a promising direction of prompting image restoration with large-scale pretrained vision-language models. Our code is available at https://github.com/Algolzw/daclip-uir.

摘要
CLIP类视力语模型在多种下渠任务中表现出色，但在低级视力任务中，其表现很差，主要是因为输入数据受到损害。在这篇论文中，我们提出了一种适应质量降低（DA-CLIP）视力语模型，以更好地将预训练的视力语模型传输到低级视力任务中。具体来说，DA-CLIP在预训练的CLIP图像Encoder上添加了一个适应器，以预测高质量的特征嵌入。通过将嵌入 integrate到图像修复网络中via Cross-Attention，我们可以让模型学习高准确性的图像重建。适应器本身也会输出一个与实际损害相符的损害特征，使得模型可以自然地分类不同的损害类型。此外，我们构建了一个混合损害数据集，用于DA-CLIP训练。我们的方法在不同类型的损害任务和综合图像修复任务中提高了状态之前的表现，展示了使用大规模预训练的视力语模型进行图像修复的可行性。我们的代码可以在https://github.com/Algolzw/daclip-uir中找到。

Multi-task Learning with 3D-Aware Regularization

paper_url: http://arxiv.org/abs/2310.00986
repo_url: https://github.com/vico-uoe/mtpsl
paper_authors: Wei-Hong Li, Steven McDonagh, Ales Leonardis, Hakan Bilen
For: The paper aims to improve the performance of deep neural networks for multiple dense computer vision tasks by introducing a structured 3D-aware regularizer.* Methods: The proposed method uses a shared 3D feature space to interface multiple tasks and improve performance by reducing noisy cross-task correlations. The method is architecture agnostic and can be plugged into various prior multi-task backbones.* Results: The proposed method improves performance on standard benchmarks NYUv2 and PASCAL-Context.Here’s the Chinese translation of the three points:* For: 该文章目的是使深度神经网络在多个粗糙计算机视觉任务上表现更好，通过引入一种结构化3D意识的正则化。* Methods: 提议的方法使用一个共享的3D特征空间来接口多个任务，以减少高维特征空间中的噪声交叠关系，从而提高性能。该方法是任务预设独立的，可以插入不同的多任务背景中。* Results: 提议的方法在标准测试集NYUv2和PASCAL-Context上提高了性能。

Abstract
Deep neural networks have become a standard building block for designing models that can perform multiple dense computer vision tasks such as depth estimation and semantic segmentation thanks to their ability to capture complex correlations in high dimensional feature space across tasks. However, the cross-task correlations that are learned in the unstructured feature space can be extremely noisy and susceptible to overfitting, consequently hurting performance. We propose to address this problem by introducing a structured 3D-aware regularizer which interfaces multiple tasks through the projection of features extracted from an image encoder to a shared 3D feature space and decodes them into their task output space through differentiable rendering. We show that the proposed method is architecture agnostic and can be plugged into various prior multi-task backbones to improve their performance; as we evidence using standard benchmarks NYUv2 and PASCAL-Context.

摘要
深度神经网络已成为多任务计算机视觉模型的标准构建件，感谢它们可以捕捉高维特征空间中复杂的相关性。然而，在无结构的特征空间中学习的交叉任务相关性可能会具有噪声性和适应性问题，从而影响性能。我们提议通过引入三维意识的结构化正则化来解决这问题，该正则化通过从图像编码器提取的特征进行投影，并将其decode到每个任务的输出空间。我们证明了该方法是静态背景无关的，可以适用于不同的多任务背景架构，以提高其性能。我们通过使用标准 bencmarks NYUv2和PASCAL-Context证明了这一点。Note: " Simplified Chinese" is also known as "Mandarin" or "Standard Chinese".

LS-VOS: Identifying Outliers in 3D Object Detections Using Latent Space Virtual Outlier Synthesis

paper_url: http://arxiv.org/abs/2310.00952
repo_url: None
paper_authors: Aldi Piroli, Vinzenz Dallabetta, Johannes Kopp, Marc Walessa, Daniel Meissner, Klaus Dietmayer
for: 提高自动驾驶应用中LiDAR基于3D物体探测器的可靠性和准确性
methods: 基于虚拟外围同构（VOS）的方法，在训练过程中integrate异常知识，使模型学习更加紧凑的决策边界
results: 在广泛的实验中，我们的方法可以改善现有的3D物体探测器的异常检测能力，同时保持高度的3D物体检测性能

Abstract
LiDAR-based 3D object detectors have achieved unprecedented speed and accuracy in autonomous driving applications. However, similar to other neural networks, they are often biased toward high-confidence predictions or return detections where no real object is present. These types of detections can lead to a less reliable environment perception, severely affecting the functionality and safety of autonomous vehicles. We address this problem by proposing LS-VOS, a framework for identifying outliers in 3D object detections. Our approach builds on the idea of Virtual Outlier Synthesis (VOS), which incorporates outlier knowledge during training, enabling the model to learn more compact decision boundaries. In particular, we propose a new synthesis approach that relies on the latent space of an auto-encoder network to generate outlier features with a parametrizable degree of similarity to in-distribution features. In extensive experiments, we show that our approach improves the outlier detection capabilities of a state-of-the-art object detector while maintaining high 3D object detection performance.

摘要
“LiDAR基于的3D物体探测器在自动驾驶应用中实现了历史性的速度和准确性。然而，与其他神经网络一样，它们经常受到高信任度预测或返回存在实体的检测，导致环境感知变得不可靠，严重影响自动驾驶车辆的功能和安全。我们解决这个问题 by proposing LS-VOS，一种用于察看3D物体检测中异常值的框架。我们的方法基于虚拟异常合成（VOS）的想法，在训练过程中包含异常知识，使模型学习更加紧凑的决策边界。具体来说，我们提出了一种新的合成方法，利用自动encoder网络的隐藏空间生成异常特征，其 Parametrizable degree of similarity to in-distribution features。在广泛的实验中，我们显示了我们的方法可以提高一个状态的arteobject detector的异常检测能力，而不会影响3D物体检测性能。”

paper_url: http://arxiv.org/abs/2310.00950
repo_url: None
paper_authors: Ling Shuang Soh, Hann Woei Ho
for: 本研究旨在提供一种基于视觉的indoor Micro Air Vehicle（MAV）导航解决方案，主要应用于自动化仓库中。
methods: 我们的方法包括HSV色彩检测和Hough直线变换，以实现在仓库环境中的线检测和跟踪。我们还利用 kalman filter 来使camera可靠地跟踪黄色线。
results: 我们通过在Gazebo 11平台上进行MAV飞行测试，使用 ROS Noetic 评估了我们的视觉基于的直线跟踪算法的性能。测试结果表明系统可以成功导航窄的室内空间。我们的提出的系统具有减少劳动成本和提高仓库运作效率的潜力。

Abstract
In this paper, we propose a vision-based solution for indoor Micro Air Vehicle (MAV) navigation, with a primary focus on its application within autonomous warehouses. Our work centers on the utilization of a single camera as the primary sensor for tasks such as detection, localization, and path planning. To achieve these objectives, we implement the HSV color detection and the Hough Line Transform for effective line detection within warehouse environments. The integration of a Kalman filter into our system enables the camera to track yellow lines reliably. We evaluated the performance of our vision-based line following algorithm through various MAV flight tests conducted in the Gazebo 11 platform, utilizing ROS Noetic. The results of these simulations demonstrate the system capability to successfully navigate narrow indoor spaces. Our proposed system has the potential to significantly reduce labor costs and enhance overall productivity in warehouse operations. This work contributes to the growing field of MAV applications in autonomous warehouses, addressing the need for efficient logistics and supply chain solutions.

摘要
在这篇论文中，我们提出了一种视觉基于的indoor Micro Air Vehicle（MAV）导航解决方案，主要关注于自动化仓库应用。我们的工作集中在Single camera作为主要感知器，用于任务such as检测、定位和路径规划。为了实现这些目标，我们实现了HSV颜色检测和Hough Line Transform，以有效地检测仓库环境中的直线。通过将Kalman Filter integrate into our system，我们可以可靠地跟踪黄色线。我们通过在Gazebo 11平台上进行MAV飞行测试，使用ROS Noetic评估了我们的视觉基于直线跟踪算法的性能。测试结果表明系统能够成功导航室内窄空间。我们的提议系统具有减少劳动成本和提高仓库运作效率的潜力。这种工作对自动化仓库应用的MAV技术做出了贡献，解决了效率的物流和供应链解决方案。

Towards Robust 3D Object Detection In Rainy Conditions

paper_url: http://arxiv.org/abs/2310.00944
repo_url: None
paper_authors: Aldi Piroli, Vinzenz Dallabetta, Johannes Kopp, Marc Walessa, Daniel Meissner, Klaus Dietmayer
for: 提高探测器对路面泥浆的Robustness
methods: 使用现有的恶劣天气检测网络筛除路面泥浆，并使用雷达目标进行进一步的过滤false positive检测
results: 测试结果表明，我们的方法可以提高各种popular 3D对象检测器对路面泥浆的Robustness

Abstract
LiDAR sensors are used in autonomous driving applications to accurately perceive the environment. However, they are affected by adverse weather conditions such as snow, fog, and rain. These everyday phenomena introduce unwanted noise into the measurements, severely degrading the performance of LiDAR-based perception systems. In this work, we propose a framework for improving the robustness of LiDAR-based 3D object detectors against road spray. Our approach uses a state-of-the-art adverse weather detection network to filter out spray from the LiDAR point cloud, which is then used as input for the object detector. In this way, the detected objects are less affected by the adverse weather in the scene, resulting in a more accurate perception of the environment. In addition to adverse weather filtering, we explore the use of radar targets to further filter false positive detections. Tests on real-world data show that our approach improves the robustness to road spray of several popular 3D object detectors.

摘要
利达（LiDAR）感知器在自动驾驶应用中用于准确感知环境。然而，它们受到日常天气Conditionssuch as snow, fog, and rain的影响，这些现象会引入LiDAR测量中的噪声，严重降低LiDAR基于感知系统的性能。在这种工作中，我们提出一种加强LiDAR基于3D объек检测系统对道路喷涂的Robustness的框架。我们的方法使用当前最佳的不良天气检测网络来筛除喷涂从LiDAR点云中，然后将筛除后的点云作为对象检测器的输入，从而使检测到的对象减少与场景中的不良天气的影响，实现更加准确的环境感知。此外，我们还探讨了使用雷达目标来进一步筛除假阳性检测的可能性。实际测试表明，我们的方法可以提高多种流行的3D对象检测器对道路喷涂的Robustness。

paper_url: http://arxiv.org/abs/2310.00943
repo_url: None
paper_authors: M. Zarebnia, R. Parvaz
for: 本研究旨在解决图像模糊问题，图像模糊是由手动或摄像头摇摆等多种因素引起的。
methods: 本研究使用了 semi-blind 图像除锐方法，由于这是一个不可逆的问题，因此需要使用总变量（TV）方法。
results: 提出的方法在不同类型的图像上进行了测试，与现有方法进行了比较，测试结果表明，提出的方法在图像除锐问题中具有较高的精度和稳定性。Here’s the English version of the three key points for reference:
for: The paper aims to solve the problem of image blurring, which is caused by various factors such as hand or camera shake.
methods: The proposed method uses a semi-blind image deblurring approach, which is an ill-conditioned problem and cannot be solved directly. The method improves the TV method by using the framelet transform and fractional calculations.
results: The proposed method is tested on different types of images and compared with existing methods, and the results show that the method has higher accuracy and stability in image deblurring.

Abstract
The problem of image blurring is one of the most studied topics in the field of image processing. Image blurring is caused by various factors such as hand or camera shake. To restore the blurred image, it is necessary to know information about the point spread function (PSF). And because in the most cases it is not possible to accurately calculate the PSF, we are dealing with an approximate kernel. In this paper, the semi-blind image deblurring problem are studied. Due to the fact that the model of the deblurring problems is an ill-conditioned problem, it is not possible to solve this problem directly. One of the most efficient ways to solve this problem is to use the total variation (TV) method. In the proposed algorithm, by using the framelet transform and fractional calculations, the TV method is improved. The proposed method is used on different types of images and is compared with existing methods with different types of tests.

摘要
“图像模糊问题是图像处理领域最具研究价值的话题之一。图像模糊是由手或相机摇晃等多种因素引起的。为恢复模糊图像，需要知道点扩散函数（PSF）的信息。然而，在大多数情况下，不可能准确计算PSF，因此我们面临着一个近似kernel的问题。在这篇论文中，我们研究了半覆盖图像修复问题。由于修复问题的模型是一个不整合问题，因此直接解决这个问题是不可能的。一种非常有效的解决方法是使用总变量（TV）方法。在我们提出的算法中，通过使用框列变换和分数计算，我们改进了TV方法。我们使用了不同类型的图像和与现有方法进行了不同类型的测试，并与之进行比较。”Note that Simplified Chinese is used in mainland China, and Traditional Chinese is used in Taiwan and other regions. The translation may vary slightly depending on the specific dialect or regional variation.

Data Efficient Training of a U-Net Based Architecture for Structured Documents Localization

paper_url: http://arxiv.org/abs/2310.00937
repo_url: None
paper_authors: Anastasiia Kabeshova, Guillaume Betmont, Julien Lerouge, Evgeny Stepankevich, Alexis Bergès
for: 本研究旨在提高现代在线权限进程中的文档分析和识别率，并且文档地标是确定可靠关键信息提取的关键步骤。
methods: 我们提出了SDL-Net，一种基于encoder-decoder架构的新型文档地标方法，可以预训练encoder部分使用通用数据集，并快速和数据efficient地调整decoder部分以支持新类型文档的地标。
results: 我们在一个专有的文档图像数据集上进行了广泛的实验，证明了我们提出的方法的有效性和通用性。

Abstract
Structured documents analysis and recognition are essential for modern online on-boarding processes, and document localization is a crucial step to achieve reliable key information extraction. While deep-learning has become the standard technique used to solve document analysis problems, real-world applications in industry still face the limited availability of labelled data and of computational resources when training or fine-tuning deep-learning models. To tackle these challenges, we propose SDL-Net: a novel U-Net like encoder-decoder architecture for the localization of structured documents. Our approach allows pre-training the encoder of SDL-Net on a generic dataset containing samples of various document classes, and enables fast and data-efficient fine-tuning of decoders to support the localization of new document classes. We conduct extensive experiments on a proprietary dataset of structured document images to demonstrate the effectiveness and the generalization capabilities of the proposed approach.

摘要
现代在线上办理过程中，结构化文档分析和识别是非常重要的，而文档本地化是确保可靠地提取关键信息的关键步骤。深度学习已成为解决文档分析问题的标准技术，但实际应用中仍然面临有限的标签数据和计算资源的问题。为解决这些挑战，我们提出了 SDL-Net：一种基于 U-Net 的Encoder-Decoder架构，用于本地化结构化文档。我们的方法允许预训练 Encoder 的 SDL-Net 在一个通用的 dataset 上，并允许快速和数据有效地调整 Decoder 以支持新的文档类型的本地化。我们在一个专用的结构化文档图像集上进行了广泛的实验，以证明我们的方法的有效性和泛化能力。

paper_url: http://arxiv.org/abs/2310.00936
repo_url: None
paper_authors: Takumi Harada, Kazuyuki Aihara, Hiroyuki Sakai
for: 提高 StyleGAN 模型中 latent code 的搜索和操作精度，以保持生成图像的真实性。
methods: 提出一种简单的无监督方法，通过识别紧密映射的 latent space，限制 latent code 的修改在本地 latent subspace 内，以保持生成图像的真实性。
results: 实验表明，通过本方法对 latent code 进行优化，可以保持生成图像的真实性，并且可以应用于不同类型的 style-based 模型。

Abstract
Recent studies on StyleGAN variants show promising performances for various generation tasks. In these models, latent codes have traditionally been manipulated and searched for the desired images. However, this approach sometimes suffers from a lack of photorealism in generated images due to a lack of knowledge about the geometry of the trained latent space. In this paper, we show a simple unsupervised method that provides well-trained local latent subspace, enabling latent code navigation while preserving the photorealism of the generated images. Specifically, the method identifies densely mapped latent spaces and restricts latent manipulations within the local latent subspace. Experimental results demonstrate that images generated within the local latent subspace maintain photorealism even when the latent codes are significantly and repeatedly manipulated. Moreover, experiments show that the method can be applied to latent code optimization for various types of style-based models. Our empirical evidence of the method will benefit applications in style-based models.

摘要
Translated into Simplified Chinese:现代StyleGAN变体的研究显示了许多图像生成任务的承诺性表现。然而，在这些模型中，通常是通过 manipulate 缓存代码来实现图像生成。但是，这种方法有时会导致图像生成出来的图像不具备真实性，这是因为缓存空间的学习结构不够了解。在这篇论文中，我们提出了一种简单的无监督方法，可以提供一个准确地训练了的本地缓存子空间，以便在缓存代码 navigation 时保持图像的真实性。具体来说，该方法可以将缓存空间映射到 densely 的地方，并在本地缓存子空间内限制缓存操作。实验结果表明，在本地缓存子空间内生成的图像保持真实性，即使缓存代码被重复地修改。此外，该方法可以应用于不同类型的 style-based 模型的缓存代码优化。我们的实验证明了该方法的有用性，将有助于应用于 style-based 模型。

Enhanced Winter Road Surface Condition Monitoring with Computer Vision

paper_url: http://arxiv.org/abs/2310.00923
repo_url: https://github.com/ojalar/siwnet
paper_authors: Risto Ojala, Alvari Seppänen
for: 这篇论文的目的是提出一个深度学习 regression 模型，SIWNet，可以从摄像头图像中估计路面黏度特性。
methods: 这篇论文使用了一个包含 uncertainty estimation 机制的深度学习网络架构，并且使用了一个最大可能性损失函数来训练这个机制。
results: 研究发现，SIWNet 可以 accurately estimate road surface friction properties from camera images, and the prediction interval estimation of SIWNet is effective in quantifying the uncertainty of the predictions.

Abstract
Winter conditions pose several challenges for automated driving applications. A key challenge during winter is accurate assessment of road surface condition, as its impact on friction is a critical parameter for safely and reliably controlling a vehicle. This paper proposes a deep learning regression model, SIWNet, capable of estimating road surface friction properties from camera images. SIWNet extends state of the art by including an uncertainty estimation mechanism in the architecture. This is achieved by including an additional head in the network, which estimates a prediction interval. The prediction interval head is trained with a maximum likelihood loss function. The model was trained and tested with the SeeingThroughFog dataset, which features corresponding road friction sensor readings and images from an instrumented vehicle. Acquired results highlight the functionality of the prediction interval estimation of SIWNet, while the network also achieved similar point estimate accuracy as the previous state of the art. Furthermore, the SIWNet architecture is several times more lightweight than the previously applied state-of-the-art model, resulting in more practical and efficient deployment.

摘要
冬季条件对自动驾驶应用 pose 多个挑战。冬季中准确评估路面条件的影响是 Critical 参数，以确保安全可靠地控制车辆。这篇论文提出了一种深度学习回归模型，SIWNet，能够从摄像头图像中估算路面逐滴性。SIWNet 进一步了现状之arte，通过包括一个不确定估计机制在网络架构中。这是通过添加一个额外头部来实现的，该头部用最大有elihood 损失函数进行训练。模型在SeeingThroughFog数据集上进行了训练和测试，测试结果表明了SIWNet 的预测interval 估计能力，而模型也达到了与之前的状态之arte 相同的点估 precisión。此外，SIWNet 的架构比之前应用的状态之arte 模型轻量级多少，使其更加实用和高效地部署。

How Close are Other Computer Vision Tasks to Deepfake Detection?

paper_url: http://arxiv.org/abs/2310.00922
repo_url: None
paper_authors: Huy H. Nguyen, Junichi Yamagishi, Isao Echizen
for: 本研究挑战了传统的信念，即监督 ImageNet 训练模型具有强大的总结能力和适用于深伪检测中的特征提取器。
methods: 本研究提出了一个新的衡量指标，“模型分离度”，用于可视化和量化地评估模型在无监督下的数据分离能力。我们还提供了一个系统的比较，用于检测深伪检测和其他计算机视觉任务之间的相关性。
results: 我们的分析显示，预训练面部识别模型与深伪检测更加相关，而其他模型则更加与其他计算机视觉任务相关。自动学习方法学习的模型在分离方面更高效，但是可能存在过拟合风险。我们的结果为研究人员和实践者提供了有价值的指导，帮助他们开发更有效的深伪检测模型。

Abstract
In this paper, we challenge the conventional belief that supervised ImageNet-trained models have strong generalizability and are suitable for use as feature extractors in deepfake detection. We present a new measurement, "model separability," for visually and quantitatively assessing a model's raw capacity to separate data in an unsupervised manner. We also present a systematic benchmark for determining the correlation between deepfake detection and other computer vision tasks using pre-trained models. Our analysis shows that pre-trained face recognition models are more closely related to deepfake detection than other models. Additionally, models trained using self-supervised methods are more effective in separation than those trained using supervised methods. After fine-tuning all models on a small deepfake dataset, we found that self-supervised models deliver the best results, but there is a risk of overfitting. Our results provide valuable insights that should help researchers and practitioners develop more effective deepfake detection models.

摘要
在这篇论文中，我们挑战了传统的认知，即超级视频批处理模型具有强大的总体化能力和适用于深伪检测中的特征提取器。我们提出了一个新的评价指标，即“模型分离度”，用于不经指导的方式评估模型的原始能力分离数据。我们还提供了一个系统的比较方法，用于确定深伪检测和其他计算机视觉任务之间的相关性，使用预训练模型。我们的分析表明，预训练人脸识别模型和深伪检测更加相关，而使用自我指导方法进行训练的模型则更好地进行分离。经过所有模型的微调using一小型深伪数据集，我们发现自我指导模型提供了最佳的结果，但也存在风险的过拟合。我们的结果提供了价值的意见，可以帮助研究人员和实践者开发更有效的深伪检测模型。

Every Dataset Counts: Scaling up Monocular 3D Object Detection with Joint Datasets Training

paper_url: http://arxiv.org/abs/2310.00920
repo_url: https://github.com/Owen-Liuyuxuan/visionfactory
paper_authors: Fulong Ma, Xiaoyang Yan, Yuxuan Liu, Ming Liu
for: 实现自动驾驶中的单目3D物体检测，但现有的单目3D检测算法受到3D标签的限制，这些标签仅从LiDAR测量获取，价格高昂且在新环境中实施具有挑战性。本研究探讨实现单目3D物体检测模型的训练管线。
methods: 提出一个三部分架构，包括：(1) 一个可靠的单目3D模型，能够在不同的摄像头设定下运作，(2) 选择性的训练策略，以应对具有不同分类标签的数据集，以及(3) 使用2D标签进行伪3D训练，以增强在仅具有2D标签的场景中的检测性能。
results: 透过实验证明，提出的方法可以训练模型在各种开放3D/2D数据集上，实现模型在新数据集上的强大普遍化能力和优化检测性能。

Abstract
Monocular 3D object detection plays a crucial role in autonomous driving. However, existing monocular 3D detection algorithms depend on 3D labels derived from LiDAR measurements, which are costly to acquire for new datasets and challenging to deploy in novel environments. Specifically, this study investigates the pipeline for training a monocular 3D object detection model on a diverse collection of 3D and 2D datasets. The proposed framework comprises three components: (1) a robust monocular 3D model capable of functioning across various camera settings, (2) a selective-training strategy to accommodate datasets with differing class annotations, and (3) a pseudo 3D training approach using 2D labels to enhance detection performance in scenes containing only 2D labels. With this framework, we could train models on a joint set of various open 3D/2D datasets to obtain models with significantly stronger generalization capability and enhanced performance on new dataset with only 2D labels. We conduct extensive experiments on KITTI/nuScenes/ONCE/Cityscapes/BDD100K datasets to demonstrate the scaling ability of the proposed method.

摘要
MONOCULAR 3D OBJECT DETECTION PLAYS A CRUCIAL ROLE IN AUTONOMOUS DRIVING. HOWEVER, EXISTING MONOCULAR 3D DETECTION ALGORITHMS RELY ON 3D LABELS DERIVED FROM LiDAR MEASUREMENTS, WHICH ARE COSTLY TO ACQUIRE FOR NEW DATASETS AND CHALLENGING TO DEPLOY IN NOVEL ENVIRONMENTS. THIS STUDY INVESTIGATES A PIPELINE FOR TRAINING A MONOCULAR 3D OBJECT DETECTION MODEL ON A DIVERSE COLLECTION OF 3D AND 2D DATASETS. THE PROPOSED FRAMEWORK COMPRISES THREE COMPONENTS: (1) A ROBUST MONOCULAR 3D MODEL CAPABLE OF FUNCTIONING ACROSS VARIOUS CAMERA SETTINGS, (2) A SELECTIVE-TRAINING STRATEGY TO ACCOMMODATE DATASETS WITH DIFFERING CLASS ANNOTATIONS, AND (3) A PSEUDO 3D TRAINING APPROACH USING 2D LABELS TO ENHANCE DETECTION PERFORMANCE IN SCENES CONTAINING ONLY 2D LABELS. WITH THIS FRAMEWORK, WE COULD TRAIN MODELS ON A JOINT SET OF VARIOUS OPEN 3D/2D DATASETS TO OBTAIN MODELS WITH SIGNIFICANTLY STRONGER GENERALIZATION CAPABILITY AND ENHANCED PERFORMANCE ON NEW DATASETS WITH ONLY 2D LABELS. WE CONDUCT EXTENSIVE EXPERIMENTS ON KITTI/nUSCENES/ONCE/CITYSCAPES/BDD100K DATASETS TO DEMONSTRATE THE SCALING ABILITY OF THE PROPOSED METHOD.

BAAF: A Benchmark Attention Adaptive Framework for Medical Ultrasound Image Segmentation Tasks

paper_url: http://arxiv.org/abs/2310.00919
repo_url: https://github.com/cgpxy/baaf
paper_authors: Gongping Chen, Lei Zhao, Xiaotao Yin, Liang Cui, Jianxun Zhang, Yu Dai
For: 这个研究旨在提出一种更通用和Robust的Benchmark Attention Adaptive Framework（BAAF），帮助医生更快速和准确地分类病变或组织在ultrasound图像中。* Methods: 该方法包括一个并行混合注意模块（PHAM）和一个适应调整机制（ACM）。特别是，BAAF首先粗略调整输入特征从通道和空间维度上，然后适应选择更加Robust的病变或组织特征从粗略调整后的特征地图。* Results: 实验结果表明，BAAF在四个医学ultrasound分类任务上表现出了remarkable的性能改进，与现有状态最佳方法相比，BAAF也表现出了superiority。这种方法可能提供自动化医学ultrasound诊断助手，减少人类准确率和精度的依赖。

Abstract
The AI-based assisted diagnosis programs have been widely investigated on medical ultrasound images. Complex scenario of ultrasound image, in which the coupled interference of internal and external factors is severe, brings a unique challenge for localize the object region automatically and precisely in ultrasound images. In this study, we seek to propose a more general and robust Benchmark Attention Adaptive Framework (BAAF) to assist doctors segment or diagnose lesions and tissues in ultrasound images more quickly and accurately. Different from existing attention schemes, the BAAF consists of a parallel hybrid attention module (PHAM) and an adaptive calibration mechanism (ACM). Specifically, BAAF first coarsely calibrates the input features from the channel and spatial dimensions, and then adaptively selects more robust lesion or tissue characterizations from the coarse-calibrated feature maps. The design of BAAF further optimizes the "what" and "where" focus and selection problems in CNNs and seeks to improve the segmentation accuracy of lesions or tissues in medical ultrasound images. The method is evaluated on four medical ultrasound segmentation tasks, and the adequate experimental results demonstrate the remarkable performance improvement over existing state-of-the-art methods. In addition, the comparison with existing attention mechanisms also demonstrates the superiority of BAAF. This work provides the possibility for automated medical ultrasound assisted diagnosis and reduces reliance on human accuracy and precision.

摘要
《人工智能支持的医疗ultrasound图像分割方法》已经广泛研究。ultrasound图像的复杂场景，即内部和外部因素的整合干扰严重，带来了自动LOCAL化对象区域的特殊挑战。在本研究中，我们提出了一种更通用和Robust的Benchmark Attention Adaptive Framework（BAAF），以帮助医生更快、更准确地分割或诊断ultrasound图像中的肿瘤或组织。BAAF与已有的注意机制不同，它包括一个并行混合注意模块（PHAM）和一个适应calibration机制（ACM）。具体来说，BAAF首先粗略调整输入特征从通道和空间维度，然后适应地选择更Robust的肿瘤或组织特征从粗略调整后的特征地图。BAAF的设计解决了CNN中的“what”和“where”注意和选择问题，并且提高了肿瘤或组织分割精度。对四种医疗ultrasound分割任务进行了评估，实验结果表明BAAF的表现很出色，与现有状态的方法相比，具有显著的性能提升。此外，与现有的注意机制进行比较，BAAF也表现出了superiority。这种工作为医疗ultrasound辅助诊断提供了自动化的可能性，减少了人类精度和精密性的依赖。

Harnessing the Power of Multi-Lingual Datasets for Pre-training: Towards Enhancing Text Spotting Performance

paper_url: http://arxiv.org/abs/2310.00917
repo_url: None
paper_authors: Alloy Das, Sanket Biswas, Ayan Banerjee, Saumik Bhattacharya, Josep Lladós, Umapada Pal
for: 本研究旨在探讨Scene Text Spotting（文本检测）模型在不同领域中的适应能力，以便在实际应用中能够更好地适应不同的环境和场景。
methods: 本研究使用了Transformer基eline called Swin-TESTR，通过在多个领域的源数据上进行训练，以达到文本检测模型能够直接适应目标领域的目的。
results: 结果表明，使用中间表示可以在多个领域的文本检测benchmark上达到显著的性能提升， both in terms of accuracy和 efficiency。

Abstract
The adaptation capability to a wide range of domains is crucial for scene text spotting models when deployed to real-world conditions. However, existing state-of-the-art (SOTA) approaches usually incorporate scene text detection and recognition simply by pretraining on natural scene text datasets, which do not directly exploit the intermediate feature representations between multiple domains. Here, we investigate the problem of domain-adaptive scene text spotting, i.e., training a model on multi-domain source data such that it can directly adapt to target domains rather than being specialized for a specific domain or scenario. Further, we investigate a transformer baseline called Swin-TESTR to focus on solving scene-text spotting for both regular and arbitrary-shaped scene text along with an exhaustive evaluation. The results clearly demonstrate the potential of intermediate representations to achieve significant performance on text spotting benchmarks across multiple domains (e.g. language, synth-to-real, and documents). both in terms of accuracy and efficiency.

摘要
<>translate_language Simplified Chinese;The adaptation capability to a wide range of domains is crucial for scene text spotting models when deployed to real-world conditions. However, existing state-of-the-art (SOTA) approaches usually incorporate scene text detection and recognition simply by pretraining on natural scene text datasets, which do not directly exploit the intermediate feature representations between multiple domains. Here, we investigate the problem of domain-adaptive scene text spotting, i.e., training a model on multi-domain source data such that it can directly adapt to target domains rather than being specialized for a specific domain or scenario. Further, we investigate a transformer baseline called Swin-TESTR to focus on solving scene-text spotting for both regular and arbitrary-shaped scene text along with an exhaustive evaluation. The results clearly demonstrate the potential of intermediate representations to achieve significant performance on text spotting benchmarks across multiple domains (e.g. language, synth-to-real, and documents). both in terms of accuracy and efficiency.TRANSLATION COMMENTS:* "scene text spotting" is translated as "场景文本检测"* "domain-adaptive" is translated as "适应域"* "intermediate feature representations" is translated as "中间特征表示"* "target domains" is translated as "目标域"* "synth-to-real" is translated as " sint-to-real"* "documents" is translated as "文档"Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you need Traditional Chinese, please let me know.

paper_url: http://arxiv.org/abs/2310.00906
repo_url: None
paper_authors: Mohamed Rahouti, Damian Lyons, Senthil Kumar Jagatheesaperumal, Kaiqi Xiong
for: 这种方法用于visual navigation，特别是 для一个病理多种机器人团队在广泛的视觉导航中进行导航。
methods: 这种方法使用了块链技术，不需要地图数据结构，可以在机器人平台上进行实时计算，并且可以在不可靠的视觉导航网络中达成共识。
results: 这种方法可以支持一个可靠和适应的导航路径选择，并且可以在缺乏地图数据的情况下实现视觉导航。

Abstract
Visual homing is a lightweight approach to visual navigation. Given the stored information of an initial 'home' location, the navigation task back to this location is achieved from any other location by comparing the stored home information to the current image and extracting a motion vector. A challenge that constrains the applicability of visual homing is that the home location must be within the robot's field of view to initiate the homing process. Thus, we propose a blockchain approach to visual navigation for a heterogeneous robot team over a wide area of visual navigation. Because it does not require map data structures, the approach is useful for robot platforms with a small computational footprint, and because it leverages current visual information, it supports a resilient and adaptive path selection. Further, we present a lightweight Proof-of-Work (PoW) mechanism for reaching consensus in the untrustworthy visual homing network.

摘要
Visual homing 是一种轻量级的视觉导航方法。从存储的初始"家"位置信息开始，导航任务返回到该位置从任何其他位置进行比较，并提取动作 вектор。一个挑战是要求家位置在机器人的视场内以便启动寻回过程。因此，我们提议使用区块链方法来支持多种机器人团队在广泛的视觉导航中进行寻回。由于不需要地图数据结构，这种方法适用于具有小型计算机脚本的机器人平台，而且由于使用当前视觉信息，它支持恒久和适应的路径选择。此外，我们提出了一种轻量级的 Proof-of-Work（PoW）机制来达成不可信视觉寻回网络中的一致。

JPEG Information Regularized Deep Image Prior for Denoising

paper_url: http://arxiv.org/abs/2310.00894
repo_url: None
paper_authors: Tsukasa Takagi, Shinya Ishizaki, Shin-ichi Maeda
for: 图像干涉除 (image denoising) 是计算机视觉中的一个重要任务，尤其是从干涉图像中进行图像恢复。
methods: 深度图像假设 (DIP) 提出了基于卷积神经网络架构的图像恢复方法，无需任何预训练。但是，DIP 的主要挑战是，它会完全恢复原始干涉图像，除非应用早期停止。
results: 我们提议使用 JPEG 文件大小来监控优化过程中的干涉水平，作为早期停止的代理指标。我们的实验表明，压缩图像文件大小可以作为有效的指标来实现早期停止。

Abstract
Image denoising is a representative image restoration task in computer vision. Recent progress of image denoising from only noisy images has attracted much attention. Deep image prior (DIP) demonstrated successful image denoising from only a noisy image by inductive bias of convolutional neural network architectures without any pre-training. The major challenge of DIP based image denoising is that DIP would completely recover the original noisy image unless applying early stopping. For early stopping without a ground-truth clean image, we propose to monitor JPEG file size of the recovered image during optimization as a proxy metric of noise levels in the recovered image. Our experiments show that the compressed image file size works as an effective metric for early stopping.

摘要
Image denoising 是计算机视觉中的一个代表性图像恢复任务。最近的进展在只有噪图像时进行图像恢复吸引了很多关注。深度图像先验（DIP）成功地通过循环神经网络架构的卷积假设来实现只有噪图像的图像恢复。但DIP基于的图像恢复具有一定的挑战，即DIP只有在应用早期停止时才能完全恢复原始噪图像。为了在优化过程中实现早期停止而不需要标准clean图像，我们提议监测优化过程中图像恢复后的JPEG文件大小作为噪度水平的代理指标。我们的实验表明，压缩图像文件大小indeed作为有效的停止指标。

PC-NeRF: Parent-Child Neural Radiance Fields under Partial Sensor Data Loss in Autonomous Driving Environments

paper_url: http://arxiv.org/abs/2310.00874
repo_url: https://github.com/biter0088/pc-nerf
paper_authors: Xiuzhong Hu, Guangming Xiong, Zheng Zang, Peng Jia, Yuxuan Han, Junyi Ma
for: 大规模3D场景重建是自动驾驶车辆的关键，特别是当部分感知数据丢失时。 although recently developed neural radiance fields (NeRF) have shown promising results in implicit representations, large-scale 3D scene reconstruction using partially lost LiDAR point cloud data still needs to be explored.
methods: we propose a novel 3D scene reconstruction framework called parent-child neural radiance field (PC-NeRF), which comprises two modules: the parent NeRF and the child NeRF. the framework simultaneously optimizes scene-level, segment-level, and point-level scene representations, allowing for more efficient utilization of sensor data and quick obtainment of an approximate volumetric representation of the scene even with limited observations.
results: our proposed PC-NeRF is proven to achieve high-precision 3D reconstruction in large-scale scenes, and can effectively tackle situations where partial sensor data is lost. our approach has high deployment efficiency with limited training time, and the pre-trained models and implementation will be available at https://github.com/biter0088/pc-nerf.

Abstract
Reconstructing large-scale 3D scenes is essential for autonomous vehicles, especially when partial sensor data is lost. Although the recently developed neural radiance fields (NeRF) have shown compelling results in implicit representations, the large-scale 3D scene reconstruction using partially lost LiDAR point cloud data still needs to be explored. To bridge this gap, we propose a novel 3D scene reconstruction framework called parent-child neural radiance field (PC-NeRF). The framework comprises two modules, the parent NeRF and the child NeRF, to simultaneously optimize scene-level, segment-level, and point-level scene representations. Sensor data can be utilized more efficiently by leveraging the segment-level representation capabilities of child NeRFs, and an approximate volumetric representation of the scene can be quickly obtained even with limited observations. With extensive experiments, our proposed PC-NeRF is proven to achieve high-precision 3D reconstruction in large-scale scenes. Moreover, PC-NeRF can effectively tackle situations where partial sensor data is lost and has high deployment efficiency with limited training time. Our approach implementation and the pre-trained models will be available at https://github.com/biter0088/pc-nerf.

摘要
<>Translate given text into Simplified Chinese.<>大规模3D场景重建是自动驾驶 Vehicle 中的关键，特别是当部分感知数据丢失时。虽然最近发展的神经辐射场（NeRF）已经显示出了吸引人的结果，但是大规模3D场景重建使用部分丢失 LiDAR 点云数据还需要进一步研究。为了填补这个差距，我们提出了一种新的3D场景重建框架，即父母神经辐射场（PC-NeRF）。该框架包括两个模块：父NeRF和孩子NeRF，同时优化场景级、段级和点级场景表示。通过利用孩子NeRF的段级表示能力，我们可以更好地利用感知数据，并在有限观察下 quickly obtain 一个 Approximate 的顶点 cloud 表示。经过广泛的实验，我们的提出的 PC-NeRF 已经证明可以在大规模场景中实现高精度3D重建。此外，PC-NeRF 还可以有效地处理部分感知数据丢失的情况，并且具有限定训练时间和高部署效率。我们的方法实现和预训练模型将在 GitHub 上提供。

RT-GAN: Recurrent Temporal GAN for Adding Lightweight Temporal Consistency to Frame-Based Domain Translation Approaches

paper_url: http://arxiv.org/abs/2310.00868
repo_url: https://github.com/nadeemlab/CEP
paper_authors: Shawn Mathew, Saad Nadeem, Alvin C. Goh, Arie Kaufman
for: 这篇论文旨在开发一种新的无监督频率翻译方法，用于探视视频。
methods: 论文使用个别帧的方法，然后添加连续帧，并使用修改的深度学习架构来训练新的模型以实现时间一致性。
results: 论文提出了一种轻量级的解决方案，使得训练时间减少了一半。并在两个困难的应用中，包括肠静脉瓣分割和实际探视视频生成，进行了证明。Here’s the same information in English for reference:
for: This paper aims to develop new unsupervised domain translation methods for endoscopy videos.
methods: The paper uses individual-frame methods and then adds contiguous frames with a modified deep learning architecture to train a new model for temporal consistency.
results: The paper proposes a lightweight solution with a tunable temporal parameter, RT-GAN, that reduces training requirements by a factor of 5. The effectiveness of the approach is demonstrated on two challenging use cases in colonoscopy: haustral fold segmentation and realistic colonoscopy simulator video generation.

Abstract
While developing new unsupervised domain translation methods for endoscopy videos, it is typical to start with approaches that initially work for individual frames without temporal consistency. Once an individual-frame model has been finalized, additional contiguous frames are added with a modified deep learning architecture to train a new model for temporal consistency. This transition to temporally-consistent deep learning models, however, requires significantly more computational and memory resources for training. In this paper, we present a lightweight solution with a tunable temporal parameter, RT-GAN (Recurrent Temporal GAN), for adding temporal consistency to individual frame-based approaches that reduces training requirements by a factor of 5. We demonstrate the effectiveness of our approach on two challenging use cases in colonoscopy: haustral fold segmentation (indicative of missed surface) and realistic colonoscopy simulator video generation. The datasets, accompanying code, and pretrained models will be made available at \url{https://github.com/nadeemlab/CEP}.

摘要
在开发新的无监督频道翻译方法时，通常会从个别帧的方向开始，然后逐渐添加连续的帧来训练一个新的模型以保证时间一致性。然而，这种从无监督频道模型到时间一致的转换需要更多的计算资源和内存资源进行训练。在这篇论文中，我们提出了一种轻量级解决方案，即RT-GAN（循环时间GAN），用于在个别帧基础上添加时间一致性，从而降低训练需求的 factor of 5。我们在两个具有挑战性的colonoscopy任务中， namely haustral fold segmentation（表示过时的表面）和realistic colonoscopy simulator video生成任务中，证明了我们的方法的有效性。我们将在 \url{https://github.com/nadeemlab/CEP} 上提供数据集、代码和预训练模型。

Can Pre-trained Networks Detect Familiar Out-of-Distribution Data?

paper_url: http://arxiv.org/abs/2310.00847
repo_url: https://github.com/atsumiyai/pt-ood
paper_authors: Atsuyuki Miyai, Qing Yu, Go Irie, Kiyoharu Aizawa
for: 本研究旨在探讨预训练模型中PT-OOD数据对OOD探测性能的影响，以便更好地理解现有的探测方法在实际应用中的效果。
methods: 本研究使用了预训练模型，并通过线性探测进行PT-OOD探测。同时，我们还比较了超级vised和自我监督预训练方法的PT-OOD探测性能。
results: 我们发现PT-OOD在特征空间的低linear分离度对OOD探测性能产生了重大影响，而自我监督预训练模型比超级vised模型更容易受到PT-OOD的影响，即使使用了当前最佳的探测方法。为解决这个敏感性，我们还提出了一种独特的解决方案：利用预训练模型中强大的实例对实例分类表示，独立于ID决策边界进行OOD探测。

Abstract
Out-of-distribution (OOD) detection is critical for safety-sensitive machine learning applications and has been extensively studied, yielding a plethora of methods developed in the literature. However, most studies for OOD detection did not use pre-trained models and trained a backbone from scratch. In recent years, transferring knowledge from large pre-trained models to downstream tasks by lightweight tuning has become mainstream for training in-distribution (ID) classifiers. To bridge the gap between the practice of OOD detection and current classifiers, the unique and crucial problem is that the samples whose information networks know often come as OOD input. We consider that such data may significantly affect the performance of large pre-trained networks because the discriminability of these OOD data depends on the pre-training algorithm. Here, we define such OOD data as PT-OOD (Pre-Trained OOD) data. In this paper, we aim to reveal the effect of PT-OOD on the OOD detection performance of pre-trained networks from the perspective of pre-training algorithms. To achieve this, we explore the PT-OOD detection performance of supervised and self-supervised pre-training algorithms with linear-probing tuning, the most common efficient tuning method. Through our experiments and analysis, we find that the low linear separability of PT-OOD in the feature space heavily degrades the PT-OOD detection performance, and self-supervised models are more vulnerable to PT-OOD than supervised pre-trained models, even with state-of-the-art detection methods. To solve this vulnerability, we further propose a unique solution to large-scale pre-trained models: Leveraging powerful instance-by-instance discriminative representations of pre-trained models and detecting OOD in the feature space independent of the ID decision boundaries. The code will be available via https://github.com/AtsuMiyai/PT-OOD.

摘要
OUT-OF-DISTRIBUTION (OOD) 检测是安全敏感机器学习应用中的关键问题，文献中已有大量研究，但大多数研究不使用预训练模型，从scratch 训练了后续任务的后续任务。在过去几年，通过将大型预训练模型的知识传递到下游任务，使用轻量级调整成为现代培训ID类器的主流方法。为了跨越现有的类ifier和OOD检测方法之间的差距，我们认为这样的OOD数据可能会对大型预训练网络的性能产生很大影响，因为这些OOD数据的分类可能与预训练算法有关。我们将这种OOD数据称为PT-OOD（预训练OOD）数据。在这篇论文中，我们想要探讨PT-OOD对预训练网络的OOD检测性能的影响，从预训练算法的角度出发。为了实现这一目标，我们在supervised和self-supervised预训练算法中进行了线性探索调整，这是最常用的效率调整方法。经过我们的实验和分析，我们发现PT-OOD在特征空间的低线性分割对OOD检测性能产生重要影响，而自动化模型更容易受到PT-OOD的影响，即使使用当前最佳检测方法。为解决这一问题，我们进一步提出了一种对大规模预训练模型的解决方案：利用预训练模型强大的实例对实例权威表示，独立于ID决策边界在特征空间检测OOD。代码将通过https://github.com/AtsuMiyai/PT-OOD 提供。

Elastic Interaction Energy Loss for Traffic Image Segmentation

paper_url: http://arxiv.org/abs/2310.01449
repo_url: None
paper_authors: Yaxin Feng, Yuan Lan, Luchan Zhang, Yang Xiang
for: 这篇论文旨在提高实时车道景象理解中的图像分割精度和实时运算速度。
methods: 本论文提出了一种简单 yet efficient的geometry-sensitive energy-based损失函数，用于增强Convolutional Neural Network (CNN)的多类分割能力。
results: experiments show that the proposed method consistently improves performance, especially when using real-time, lightweight networks as the backbones, which is more suitable for autonomous driving.

Abstract
Segmentation is a pixel-level classification of images. The accuracy and fast inference speed of image segmentation are crucial for autonomous driving safety. Fine and complex geometric objects are the most difficult but important recognition targets in traffic scene, such as pedestrians, traffic signs and lanes. In this paper, a simple and efficient geometry-sensitive energy-based loss function is proposed to Convolutional Neural Network (CNN) for multi-class segmentation on real-time traffic scene understanding. To be specific, the elastic interaction energy (EIE) between two boundaries will drive the prediction moving toward the ground truth until completely overlap. The EIE loss function is incorporated into CNN to enhance accuracy on fine-scale structure segmentation. In particular, small or irregularly shaped objects can be identified more accurately, and discontinuity issues on slender objects can be improved. Our approach can be applied to different segmentation-based problems, such as urban scene segmentation and lane detection. We quantitatively and qualitatively analyze our method on three traffic datasets, including urban scene data Cityscapes, lane data TuSimple and CULane. The results show that our approach consistently improves performance, especially when using real-time, lightweight networks as the backbones, which is more suitable for autonomous driving.

摘要
Segmentation 是图像水平分类的过程。图像 segmentation 的准确性和快速推理速度对自动驾驶安全性至关重要。交通场景中最难但最重要的认知目标是细小复杂的几何对象，如行人、交通标志和车道。本文提出了一种简单有效的几何敏感能量基函数，用于 Convolutional Neural Network (CNN) 的多类分类。具体来说，在实际执行时，两个边界之间的弹性交互能量 (EIE) 会使预测向真实值靠拢，直到完全重叠。EIE 损失函数被集成到 CNN 中，以提高细致结构分类的准确性。特别是，小型或不规则形状的对象可以更加准确地被识别出来，并且可以解决车道板材的不连续问题。我们的方法可以应用于不同的分类问题，如城市场景分类和车道检测。我们对三个交通数据集进行了量化和质量分析，结果表明，我们的方法在使用实时、轻量级网络作为 backing 时，能够一直提高表现，特别是在使用真实值、快速推理的情况下。

Large Scale Masked Autoencoding for Reducing Label Requirements on SAR Data

paper_url: http://arxiv.org/abs/2310.00826
repo_url: None
paper_authors: Matt Allen, Francisco Dorr, Joseph A. Gallego-Mejia, Laura Martínez-Ferrer, Anna Jungbluth, Freddie Kalaitzis, Raúl Ramos-Pollán
for: 这 paper 用于 monitoring 和 mitigation of anthropogenic climate change.
methods: 这 paper 使用 Synthetic Aperture Radar (SAR) 数据, 并应用 self-supervised pretraining scheme 和 masked autoencoding.
results: 这 paper 显示了 SAR 数据可以减少 labelling 需求, 并且在不同地区表现出较好的泛化性. 这些结果可以为 клиimate change 监测和 mitigation 提供更快、更准确的监测方案.

Abstract
Satellite-based remote sensing is instrumental in the monitoring and mitigation of the effects of anthropogenic climate change. Large scale, high resolution data derived from these sensors can be used to inform intervention and policy decision making, but the timeliness and accuracy of these interventions is limited by use of optical data, which cannot operate at night and is affected by adverse weather conditions. Synthetic Aperture Radar (SAR) offers a robust alternative to optical data, but its associated complexities limit the scope of labelled data generation for traditional deep learning. In this work, we apply a self-supervised pretraining scheme, masked autoencoding, to SAR amplitude data covering 8.7\% of the Earth's land surface area, and tune the pretrained weights on two downstream tasks crucial to monitoring climate change - vegetation cover prediction and land cover classification. We show that the use of this pretraining scheme reduces labelling requirements for the downstream tasks by more than an order of magnitude, and that this pretraining generalises geographically, with the performance gain increasing when tuned downstream on regions outside the pretraining set. Our findings significantly advance climate change mitigation by facilitating the development of task and region-specific SAR models, allowing local communities and organizations to deploy tailored solutions for rapid, accurate monitoring of climate change effects.

摘要

2023-10-02

STARS: Zero-shot Sim-to-Real Transfer for Segmentation of Shipwrecks in Sonar Imagery

Task-guided Domain Gap Reduction for Monocular Depth Prediction in Endoscopy

SYRAC: Synthesize, Rank, and Count

You Only Look at Once for Real-time and Generic Multi-Task

Adaptive Visual Scene Understanding: Incremental Scene Graph Generation

Dynamic Spatio-Temporal Summarization using Information Based Fusion

ImagenHub: Standardizing the evaluation of conditional image generation models

RF-ULM: Deep Learning for Radio-Frequency Ultrasound Localization Microscopy

Progressive DeepSSM: Training Methodology for Image-To-Shape Deep Models

Fetal-BET: Brain Extraction Tool for Fetal MRI

Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code

DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model

LEAP: Liberate Sparse-view 3D Modeling from Camera Poses

HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation

H-InDex: Visual Reinforcement Learning with Hand-Informed Representations for Dexterous Manipulation

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Pixel-Aligned Recurrent Queries for Multi-View 3D Object Detection

Sequential Data Generation with Groupwise Diffusion Process

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

Towards Distribution-Agnostic Generalized Category Discovery

NEUCORE: Neural Concept Reasoning for Composed Image Retrieval

Less is More: Toward Zero-Shot Local Scene Graph Generation via Foundation Models

Streaming Motion Forecasting for Autonomous Driving

Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

Color and Texture Dual Pipeline Lightweight Style Transfer

Efficient Remote Sensing Segmentation With Generative Adversarial Transformer

3DHR-Co: A Collaborative Test-time Refinement Framework for In-the-Wild 3D Human-Body Reconstruction Task

Offline Tracking with Object Permanence

MobileNVC: Real-time 1080p Neural Video Compression on a Mobile Device

Generating 3D Brain Tumor Regions in MRI using Vector-Quantization Generative Adversarial Networks

Mirror Diffusion Models for Constrained and Watermarked Generation

Reconstructing 3D Human Pose from RGB-D Data with Occlusions

Making LLaMA SEE and Draw with SEED Tokenizer

Towards Robust Cardiac Segmentation using Graph Convolutional Networks

Self-distilled Masked Attention guided masked image modeling with noise Regularized Teacher (SMART) for medical image analysis

Cross-adversarial local distribution regularization for semi-supervised medical image segmentation

Segment Any Building

Iterative Semi-Supervised Learning for Abdominal Organs and Tumor Segmentation

[Re] CLRNet: Cross Layer Refinement Network for Lane Detection

Neural Processing of Tri-Plane Hybrid Neural Fields

Strength in Diversity: Multi-Branch Representation Learning for Vehicle Re-Identification

Batch-less stochastic gradient descent for compressive learning of deep regularization for image denoising

HyMNet: a Multimodal Deep Learning System for Hypertension Classification using Fundus Photographs and Cardiometabolic Risk Factors

Leveraging Cutting Edge Deep Learning Based Image Matching for Reconstructing a Large Scene from Sparse Images

Unsupervised Roofline Extraction from True Orthophotos for LoD2 Building Model Reconstruction

Unsupervised motion segmentation in one go: Smooth long-term model over a video

Learnable Cross-modal Knowledge Distillation for Multi-modal Learning with Missing Modality

Incorporating Supervised Domain Generalization into Data Augmentation

A New Real-World Video Dataset for the Comparison of Defogging Algorithms

Controlling Vision-Language Models for Universal Image Restoration

Multi-task Learning with 3D-Aware Regularization

LS-VOS: Identifying Outliers in 3D Object Detections Using Latent Space Virtual Outlier Synthesis

Autonomous Navigation of Micro Air Vehicles in Warehouses Using Vision-based Line Following

Towards Robust 3D Object Detection In Rainy Conditions

Semi-Blind Image Deblurring Based on Framelet Prior

Data Efficient Training of a U-Net Based Architecture for Structured Documents Localization

Trained Latent Space Navigation to Prevent Lack of Photorealism in Generated Images on Style-based Models

Enhanced Winter Road Surface Condition Monitoring with Computer Vision

How Close are Other Computer Vision Tasks to Deepfake Detection?

Every Dataset Counts: Scaling up Monocular 3D Object Detection with Joint Datasets Training

BAAF: A Benchmark Attention Adaptive Framework for Medical Ultrasound Image Segmentation Tasks

Harnessing the Power of Multi-Lingual Datasets for Pre-training: Towards Enhancing Text Spotting Performance

A Decentralized Cooperative Navigation Approach for Visual Homing Networks

JPEG Information Regularized Deep Image Prior for Denoising

PC-NeRF: Parent-Child Neural Radiance Fields under Partial Sensor Data Loss in Autonomous Driving Environments

RT-GAN: Recurrent Temporal GAN for Adding Lightweight Temporal Consistency to Frame-Based Domain Translation Approaches

Can Pre-trained Networks Detect Familiar Out-of-Distribution Data?

Elastic Interaction Energy Loss for Traffic Image Segmentation

Large Scale Masked Autoencoding for Reducing Label Requirements on SAR Data