cs.CV - 2023-08-11

DIG In: Evaluating Disparities in Image Generations with Indicators for Geographic Diversity

  • paper_url: http://arxiv.org/abs/2308.06198
  • repo_url: None
  • paper_authors: Melissa Hall, Candace Ross, Adina Williams, Nicolas Carion, Michal Drozdzal, Adriana Romero Soriano
  • for: 这个研究旨在评估文本到图像生成系统中的可信度、多样性和提示一致性,以及这些系统在不同地区的使用情况。
  • methods: 研究人员提出了三个指标来评估文本到图像生成系统的可信度、多样性和提示一致性,包括对世界各地的图像生成、地域差异和提示信息的影响。
  • results: 研究结果显示,当提示为非洲和西亚地区时,模型的生成图像的可信度和多样性较低,而提示包含地理信息会导致生成图像的一致性和多样性减退。此外,研究还发现了一些物体的地域差异较大。这些结果表明,随着图像生成质量的提高,图像的真实 Representation 在世界各地受到了影响。
    Abstract The unprecedented photorealistic results achieved by recent text-to-image generative systems and their increasing use as plug-and-play content creation solutions make it crucial to understand their potential biases. In this work, we introduce three indicators to evaluate the realism, diversity and prompt-generation consistency of text-to-image generative systems when prompted to generate objects from across the world. Our indicators complement qualitative analysis of the broader impact of such systems by enabling automatic and efficient benchmarking of geographic disparities, an important step towards building responsible visual content creation systems. We use our proposed indicators to analyze potential geographic biases in state-of-the-art visual content creation systems and find that: (1) models have less realism and diversity of generations when prompting for Africa and West Asia than Europe, (2) prompting with geographic information comes at a cost to prompt-consistency and diversity of generated images, and (3) models exhibit more region-level disparities for some objects than others. Perhaps most interestingly, our indicators suggest that progress in image generation quality has come at the cost of real-world geographic representation. Our comprehensive evaluation constitutes a crucial step towards ensuring a positive experience of visual content creation for everyone.
    摘要 “Recent text-to-image生成系统实现了无 precedent 的实景实现,并在实际应用中普遍用作内置的内容创建解决方案。因此,理解这些系统的可能偏见成为核心。在这个工作中,我们提出三个指标来评估文本生成系统当前对世界各地的物品生成的实现度、多样性和描述稳定性。我们的指标可以辅助自动和高效地评估地域差异,从而构建责任感的视觉内容创建系统。我们使用我们提出的指标进行分析,发现:(1)当提交 Africah 和 West Asia 时,模型的实现度和多样性较低;(2)对地理信息进行提交会导致实现稳定性和多样性的损失;(3)模型对一些物品的地域差异较大。最有趣的是,我们的指标显示,实际图像质量的进步对于地域呈现来说是否增。我们的全面评估是建立责任感的视觉内容创建系统的重要一步。”

Towards Packaging Unit Detection for Automated Palletizing Tasks

  • paper_url: http://arxiv.org/abs/2308.06306
  • repo_url: None
  • paper_authors: Markus Völk, Kilian Kleeberger, Werner Kraus, Richard Bormann
  • for: 这篇论文是用于解决各种自动化装卸任务中的包装单位检测问题。
  • methods: 本论文提出了一种基于人工生成数据的方法,可以对真实世界的包装单位进行检测,并且不需要进一步训练或设置。这种方法可以处理稀疏且低质量的感应数据,可以利用专家知识,并且具有广泛的产品和应用场景的应用能力。
  • results: 本论文的实验结果显示,提出的方法能够实现高度的准确性和稳定性,并且在实际应用中具有很好的表现。此外,论文还详细介绍了一个实验室示范器和一个商业解决方案,它们都是根据本论文的方法进行开发的。
    Abstract For various automated palletizing tasks, the detection of packaging units is a crucial step preceding the actual handling of the packaging units by an industrial robot. We propose an approach to this challenging problem that is fully trained on synthetically generated data and can be robustly applied to arbitrary real world packaging units without further training or setup effort. The proposed approach is able to handle sparse and low quality sensor data, can exploit prior knowledge if available and generalizes well to a wide range of products and application scenarios. To demonstrate the practical use of our approach, we conduct an extensive evaluation on real-world data with a wide range of different retail products. Further, we integrated our approach in a lab demonstrator and a commercial solution will be marketed through an industrial partner.
    摘要

Discovering Local Binary Pattern Equation for Foreground Object Removal in Videos

  • paper_url: http://arxiv.org/abs/2308.06305
  • repo_url: None
  • paper_authors: Caroline Pacheco do Espirito Silva, Andrews Cordolino Sobral, Antoine Vacavant, Thierry Bouwmans, Felippe De Souza
  • for: 自动发现具有背景和前景的场景中移动部分的Local Binary Pattern(LBP)处理方法
  • methods: 使用 симвоlic regression 自动发现 LBP 公式
  • results: 实验结果显示,提案的方法可以对实际的城市景象进行高品质的分类,并与先前的州前测试数据表现出色。
    Abstract Designing a novel Local Binary Pattern (LBP) process usually relies heavily on human experts' knowledge and experience in the area. Even experts are often left with tedious episodes of trial and error until they identify an optimal LBP for a particular dataset. To address this problem, we present a novel symbolic regression able to automatically discover LBP formulas to remove the moving parts of a scene by segmenting it into a background and a foreground. Experimental results conducted on real videos of outdoor urban scenes under various conditions show that the LBPs discovered by the proposed approach significantly outperform the previous state-of-the-art LBP descriptors both qualitatively and quantitatively. Our source code and data will be available online.
    摘要 通常情况下,设计本地二进制 patrern(LBP)处理程序需要几乎依赖人类专家的知识和经验。甚至专家们也经常会遭遇辗转的尝试和错误,直到他们找到一个特定数据集的优化LBP。为解决这个问题,我们提出了一种新的符号回归算法,能够自动找到去掉场景中移动部分的LBP方程。我们的实验结果表明,对于实际的户外都市场景,我们的方法可以明显超越前一个状态的艺术LBP描述符, both qualitatively and quantitatively。我们的源代码和数据将在线上公开。

Rethinking the Localization in Weakly Supervised Object Localization

  • paper_url: http://arxiv.org/abs/2308.06161
  • repo_url: https://github.com/tzzcl/PSOL
  • paper_authors: Rui Xu, Yong Luo, Han Hu, Bo Du, Jialie Shen, Yonggang Wen
  • for: 本文针对强度指导物体Localization(WSOL)进行研究,WSOL是计算机视觉中最受欢迎且最具挑战性的任务之一,这任务的目的是在仅有图像水平指导下,将图像中的物体Localization。
  • methods: 本文提出了两个方法来解决WSOL中的问题,第一个方法是替换单一类别检测(SCR)为多类别检测(BCD),以便在图像中多次检测多个物体。第二个方法是设计一个权重Entropy(WE)损失函数,用于降低随机矩形的负面影响。
  • results: 实验结果显示,我们的方法可以对Popular CUB-200-2011和ImageNet-1K datasets进行广泛的测试,并获得了效果的结果。
    Abstract Weakly supervised object localization (WSOL) is one of the most popular and challenging tasks in computer vision. This task is to localize the objects in the images given only the image-level supervision. Recently, dividing WSOL into two parts (class-agnostic object localization and object classification) has become the state-of-the-art pipeline for this task. However, existing solutions under this pipeline usually suffer from the following drawbacks: 1) they are not flexible since they can only localize one object for each image due to the adopted single-class regression (SCR) for localization; 2) the generated pseudo bounding boxes may be noisy, but the negative impact of such noise is not well addressed. To remedy these drawbacks, we first propose to replace SCR with a binary-class detector (BCD) for localizing multiple objects, where the detector is trained by discriminating the foreground and background. Then we design a weighted entropy (WE) loss using the unlabeled data to reduce the negative impact of noisy bounding boxes. Extensive experiments on the popular CUB-200-2011 and ImageNet-1K datasets demonstrate the effectiveness of our method.
    摘要 弱级指导对象定位(WSOL)是计算机视觉中最受欢迎且最有挑战的任务之一。这个任务的目标是根据图像水平的监督来地将对象在图像中Localize。最新的解决方案通常受到以下几个缺点的影响:1)它们不够灵活,因为它们只能在每个图像中Localize一个对象,因为采用的是单个类 regression(SCR)来实现Localization;2)生成的假 bounding box 可能具有噪音,但这种噪音的负面影响并未得到足够的解决。为了缓解这些缺点,我们首先提议将 SCR 替换为二分类探测器(BCD),以在多个对象中Localize。然后,我们设计了一种权重Entropy(WE)损失函数,使用无标注数据来减少噪音 bounding box 的负面影响。我们在popular CUB-200-2011 和 ImageNet-1K 数据集上进行了广泛的实验,并证明了我们的方法的有效性。

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models

  • paper_url: http://arxiv.org/abs/2308.06160
  • repo_url: https://github.com/showlab/datasetdm
  • paper_authors: Weijia Wu, Yuzhong Zhao, Hao Chen, Yuchao Gu, Rui Zhao, Yefei He, Hong Zhou, Mike Zheng Shou, Chunhua Shen
  • for: 本文提出了一种生成各种感知数据的数据生成模型,以便用于训练多种下游任务的感知模型。
  • methods: 该模型基于预训练的扩散模型,并通过文本引导图像生成来生成各种感知数据,包括分割mask和深度。
  • results: 该方法可以生成具有高质量感知标注的无限多个synthetic图像,并且可以在不同的领域进行鲁棒的适应和零shot掌握。此外,该方法还具有高度的可重复性和可扩展性,可以有效地应用于多种任务和新的任务组合。
    Abstract Current deep networks are very data-hungry and benefit from training on largescale datasets, which are often time-consuming to collect and annotate. By contrast, synthetic data can be generated infinitely using generative models such as DALL-E and diffusion models, with minimal effort and cost. In this paper, we present DatasetDM, a generic dataset generation model that can produce diverse synthetic images and the corresponding high-quality perception annotations (e.g., segmentation masks, and depth). Our method builds upon the pre-trained diffusion model and extends text-guided image synthesis to perception data generation. We show that the rich latent code of the diffusion model can be effectively decoded as accurate perception annotations using a decoder module. Training the decoder only needs less than 1% (around 100 images) manually labeled images, enabling the generation of an infinitely large annotated dataset. Then these synthetic data can be used for training various perception models for downstream tasks. To showcase the power of the proposed approach, we generate datasets with rich dense pixel-wise labels for a wide range of downstream tasks, including semantic segmentation, instance segmentation, and depth estimation. Notably, it achieves 1) state-of-the-art results on semantic segmentation and instance segmentation; 2) significantly more robust on domain generalization than using the real data alone; and state-of-the-art results in zero-shot segmentation setting; and 3) flexibility for efficient application and novel task composition (e.g., image editing). The project website and code can be found at https://weijiawu.github.io/DatasetDM_page/ and https://github.com/showlab/DatasetDM, respectively
    摘要 现代深度网络很需要大量数据进行训练,而收集和标注大量数据可能是时间consuming和成本高的。然而,使用生成模型如DALL-E和扩散模型可以生成无穷量的 sintetic 数据,只需要微不足百的人工标注图像。在这篇论文中,我们提出了 DatasetDM,一种通用的数据生成模型,可以生成多样化的 sintetic 图像和相应的高质量感知注释(例如分割mask和深度)。我们的方法基于预训练的扩散模型,并扩展了文本引导的图像生成到感知数据生成。我们发现,扩散模型的含义密集代码可以高效地解码为准确的感知注释,只需要训练decoder模块 fewer than 1% (around 100 images) 的人工标注图像。这些生成的synthetic数据可以用于训练多种感知模型,以满足下游任务。我们在各种下游任务中实现了1) 状态的最佳结果,包括semantic segmentation和instance segmentation; 2) 在预测领域内显著更加稳定,比使用真实数据alone; 和3) 可以有效地应用和创新任务(例如图像编辑)。相关项目网站和代码可以在https://weijiawu.github.io/DatasetDM_page/ 和https://github.com/showlab/DatasetDM 中找到。

Efficient Large-scale AUV-based Visual Seafloor Mapping

  • paper_url: http://arxiv.org/abs/2308.06147
  • repo_url: None
  • paper_authors: Mengkun She, Yifan Song, David Nakath, Kevin Köser
  • for: 这项研究旨在提高自动化潜水车(AUV)在深海中进行三维重建的精度和效率。
  • methods: 该系统利用最新的深海拍摄技术和视觉地图生成技术,以帮助自动化潜水车在深海中重建海底三维模型。
  • results: 该系统在多次实验船次中展示了其可靠性和实用性,并且比增量重建更快,而且性能与增量重建相当。
    Abstract Driven by the increasing number of marine data science applications, there is a growing interest in surveying and exploring the vast, uncharted terrain of the deep sea with robotic platforms. Despite impressive results achieved by many on-land visual mapping algorithms in the past decades, transferring these methods from land to the deep sea remains a challenge due to harsh environmental conditions. Typically, deep-sea exploration involves the use of autonomous underwater vehicles (AUVs) equipped with high-resolution cameras and artificial illumination systems. However, images obtained in this manner often suffer from heterogeneous illumination and quality degradation due to attenuation and scattering, on top of refraction of light rays. All of this together often lets on-land SLAM approaches fail underwater or makes Structure-from-Motion approaches drift or omit difficult images, resulting in gaps, jumps or weakly registered areas. In this work, we present a system that incorporates recent developments in underwater imaging and visual mapping to facilitate automated robotic 3D reconstruction of hectares of seafloor. Our approach is efficient in that it detects and reconsiders difficult, weakly registered areas, to avoid omitting images and to make better use of limited dive time; on the other hand it is computationally efficient; leveraging a hybrid approach combining benefits from SLAM and Structure-from-Motion that runs much faster than incremental reconstructions while achieving at least on-par performance. The proposed system has been extensively tested and evaluated during several research cruises, demonstrating its robustness and practicality in real-world conditions.
    摘要 随着 marine data science 应用的增加,有越来越多的人对深海的未探索领域进行探索和调查,使用 робоット平台。虽然过去数十年来的陆地视图算法已经取得了很好的成果,但将这些方法从陆地传播到深海仍然是一个挑战,因为深海环境条件非常恶劣。通常,深海探索使用自动化潜水车(AUV)配备高分辨率摄像头和人工照明系统。但是,在这种情况下获得的图像经常受到不均匀照明和质量强化的影响,加之光线干扰和折射,导致在陆地SLAM方法下不能在水下工作或者Structure-from-Motion方法会在水下偏移或者忽略Difficult Images,导致图像中的缺失、跳跃和弱联系区域。在这项工作中,我们提出了一种系统,利用最新的深海摄像头和视图地图技术来自动化机器人3D重建深海底的百 hectare。我们的方法能够检测和重新考虑Difficult,弱联系区域,以避免忽略图像和更好地利用有限的潜水时间;同时,它具有计算效率,利用一种hybrid方法,结合SLAM和Structure-from-Motion两种方法,运行得更快,而且性能至少与逐步重建相当。我们的系统在多次研究考察中得到了广泛的测试和评估,并在实际条件下显示了其可靠性和实用性。

CompTLL-UNet: Compressed Domain Text-Line Localization in Challenging Handwritten Documents using Deep Feature Learning from JPEG Coefficients

  • paper_url: http://arxiv.org/abs/2308.06142
  • repo_url: None
  • paper_authors: Bulla Rajesh, Sk Mahafuz Zaman, Mohammed Javed, P. Nagabhushan
  • for: 本研究旨在提出一种直接从JPEG压缩系数中进行文本线程地理Localization的方法,以解决手写文本图像中文本线程的自动地理化问题。
  • methods: 本研究使用了一种Modified U-Net architecture,称为Compressed Text-Line Localization Network (CompTLL-UNet),以直接从JPEG压缩系数中学习深度特征,并将其应用于文本线程地理Localization问题。
  • results: 研究人员通过对ICDAR2017(cBAD)和ICDAR2019(cBAD)标准测试集进行训练和测试,并得到了在JPEG压缩domain中的state-of-the-art表现,同时减少了存储和计算成本。
    Abstract Automatic localization of text-lines in handwritten documents is still an open and challenging research problem. Various writing issues such as uneven spacing between the lines, oscillating and touching text, and the presence of skew become much more challenging when the case of complex handwritten document images are considered for segmentation directly in their respective compressed representation. This is because, the conventional way of processing compressed documents is through decompression, but here in this paper, we propose an idea that employs deep feature learning directly from the JPEG compressed coefficients without full decompression to accomplish text-line localization in the JPEG compressed domain. A modified U-Net architecture known as Compressed Text-Line Localization Network (CompTLL-UNet) is designed to accomplish it. The model is trained and tested with JPEG compressed version of benchmark datasets including ICDAR2017 (cBAD) and ICDAR2019 (cBAD), reporting the state-of-the-art performance with reduced storage and computational costs in the JPEG compressed domain.
    摘要 自动化手写文档中文行的本地化仍然是一个开放的研究问题。不同的手写问题,如行间距不均匀、文本振荡和触擦,以及扭曲的存在,在考虑复杂手写文档图像时变得非常困难。这是因为,传统的处理压缩文档的方法是通过解压缩,但在这篇论文中,我们提出了一个想法,即直接从JPEG压缩系数中使用深度特征学习来完成文行本地化。我们设计了一种修改后的U-Net架构,称为Compressed Text-Line Localization Network (CompTLL-UNet),以实现其目的。我们在JPEG压缩版本的标准评价数据集ICDAR2017(cBAD)和ICDAR2019(cBAD)进行训练和测试,并实现了当今最佳性能,同时减少存储和计算成本在JPEG压缩Domain中。

Uncertainty Quantification for Image-based Traffic Prediction across Cities

  • paper_url: http://arxiv.org/abs/2308.06129
  • repo_url: https://github.com/alextimans/traffic4cast-uncertainty
  • paper_authors: Alexander Timans, Nina Wiedemann, Nishant Kumar, Ye Hong, Martin Raubal
    for: This paper aims to investigate the application of uncertainty quantification (UQ) methods for traffic prediction, and to evaluate the effectiveness of UQ methods in providing meaningful uncertainty estimates for city-wide traffic dynamics.methods: The paper compares two epistemic and two aleatoric UQ methods on both temporal and spatio-temporal transfer tasks, and demonstrates how uncertainty estimates can be employed for unsupervised outlier detection on changes in city traffic dynamics.results: The paper finds that meaningful uncertainty estimates can be recovered using UQ methods, and that these estimates can be used to capture both temporal and spatial effects on traffic behavior in a representative case study for the city of Moscow.
    Abstract Despite the strong predictive performance of deep learning models for traffic prediction, their widespread deployment in real-world intelligent transportation systems has been restrained by a lack of interpretability. Uncertainty quantification (UQ) methods provide an approach to induce probabilistic reasoning, improve decision-making and enhance model deployment potential. To gain a comprehensive picture of the usefulness of existing UQ methods for traffic prediction and the relation between obtained uncertainties and city-wide traffic dynamics, we investigate their application to a large-scale image-based traffic dataset spanning multiple cities and time periods. We compare two epistemic and two aleatoric UQ methods on both temporal and spatio-temporal transfer tasks, and find that meaningful uncertainty estimates can be recovered. We further demonstrate how uncertainty estimates can be employed for unsupervised outlier detection on changes in city traffic dynamics. We find that our approach can capture both temporal and spatial effects on traffic behaviour in a representative case study for the city of Moscow. Our work presents a further step towards boosting uncertainty awareness in traffic prediction tasks, and aims to highlight the value contribution of UQ methods to a better understanding of city traffic dynamics.
    摘要

Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow

  • paper_url: http://arxiv.org/abs/2308.06101
  • repo_url: https://github.com/bcmi/DCI-VTON-Virtual-Try-On
  • paper_authors: Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, Liqing Zhang
  • for: 用于实现虚拟试穿,即将衣服从一个图像传输到另一个图像中,保持衣服和人体细节的图像合成任务。
  • methods: 使用噪声扩散模型(Diffusion Model)来生成高质量图像,并使用准备模块(Warping Module)来引导噪声扩散模型的生成。
  • results: 通过将衣服经过准备模块进行初始处理,以保持衣服的本地细节,然后将渗透模型的输出与人体图像相加,实现高质量和真实的虚拟试穿结果。
    Abstract Virtual try-on is a critical image synthesis task that aims to transfer clothes from one image to another while preserving the details of both humans and clothes. While many existing methods rely on Generative Adversarial Networks (GANs) to achieve this, flaws can still occur, particularly at high resolutions. Recently, the diffusion model has emerged as a promising alternative for generating high-quality images in various applications. However, simply using clothes as a condition for guiding the diffusion model to inpaint is insufficient to maintain the details of the clothes. To overcome this challenge, we propose an exemplar-based inpainting approach that leverages a warping module to guide the diffusion model's generation effectively. The warping module performs initial processing on the clothes, which helps to preserve the local details of the clothes. We then combine the warped clothes with clothes-agnostic person image and add noise as the input of diffusion model. Additionally, the warped clothes is used as local conditions for each denoising process to ensure that the resulting output retains as much detail as possible. Our approach, namely Diffusion-based Conditional Inpainting for Virtual Try-ON (DCI-VTON), effectively utilizes the power of the diffusion model, and the incorporation of the warping module helps to produce high-quality and realistic virtual try-on results. Experimental results on VITON-HD demonstrate the effectiveness and superiority of our method.
    摘要 “虚拟试穿”是一个重要的图像合成任务,目的是将衣服从一个图像转移到另一个,同时保留人体和衣服的细节。许多现有方法利用生成对抗网络(GANs)实现此目的,但缺陷仍然存在,特别是在高分辨率下。随后,扩散模型在许多应用中作为生成高品质图像的可能性而崛起。然而,仅将衣服作为导向扩散模型填充的条件,无法维持衣服的细节。为解决这个挑战,我们提出了一个示范基本填充方法,利用扭曲模组将衣服进行初始处理,帮助保留衣服的地方细节。我们然后结合扭曲衣服和不同于衣服的人像图混合,加入杂音作为扩散模型的输入。此外,扭曲衣服也用作每个降噪过程的地方条件,以确保结果保留最多细节。我们统称这种方法为扩散基于对应填充(DCI-VTON)。我们的方法充分利用扩散模型的力量,并通过扭曲模组的帮助,实现高品质和实际的虚拟试穿结果。实验结果显示,我们的方法在VITON-HD上得到了优异的效果。

Diffusion-based Visual Counterfactual Explanations – Towards Systematic Quantitative Evaluation

  • paper_url: http://arxiv.org/abs/2308.06100
  • repo_url: https://github.com/cairo-thws/dbvce_eval
  • paper_authors: Philipp Vaeth, Alexander M. Fruehwald, Benjamin Paassen, Magda Gregorova
  • for: 本研究旨在系统地评估最新的视觉对照解释(VCE)方法,并提出一组最小的度量来评估这些方法。
  • methods: 本研究使用了涉及潜在的设计选择的扩散基于生成模型来生成高维像素图像,并对这些方法进行系统评估。
  • results: 研究发现了多种方向,以便未来驱动VCE方法的进步和改进。此外,研究还提供了一个有价值的指南,以便其他研究人员在评估对照解释时保持一致和透明度。
    Abstract Latest methods for visual counterfactual explanations (VCE) harness the power of deep generative models to synthesize new examples of high-dimensional images of impressive quality. However, it is currently difficult to compare the performance of these VCE methods as the evaluation procedures largely vary and often boil down to visual inspection of individual examples and small scale user studies. In this work, we propose a framework for systematic, quantitative evaluation of the VCE methods and a minimal set of metrics to be used. We use this framework to explore the effects of certain crucial design choices in the latest diffusion-based generative models for VCEs of natural image classification (ImageNet). We conduct a battery of ablation-like experiments, generating thousands of VCEs for a suite of classifiers of various complexity, accuracy and robustness. Our findings suggest multiple directions for future advancements and improvements of VCE methods. By sharing our methodology and our approach to tackle the computational challenges of such a study on a limited hardware setup (including the complete code base), we offer a valuable guidance for researchers in the field fostering consistency and transparency in the assessment of counterfactual explanations.
    摘要 最新的视觉对照解释(VCE)方法充分利用深度生成模型来生成高维度图像,但目前难以比较这些VCE方法的表现,因为评估方法多样化并经常降至视觉检查个例和小规模用户研究。在这项工作中,我们提出了一个系统化、量化评估VCE方法的框架,以及一组最小化的度量。我们使用这个框架来探索latest扩散基本生成模型中的某些关键设计选择对自然图像分类(ImageNet)VCE的影响。我们进行了一系列减少类似实验,生成了数以千计的VCE,对一组不同的分类器进行评估。我们的发现表明了未来VCE方法的多个方向的进攻,并提供了一个有价值的指南,以帮助研究人员在评估对照解释时保持一致和透明度。

Automated Construction of Time-Space Diagrams for Traffic Analysis Using Street-View Video Sequence

  • paper_url: http://arxiv.org/abs/2308.06098
  • repo_url: None
  • paper_authors: Tanay Rastogi, Mårten Björkman
  • for: 这个论文旨在利用在运动车辆上安装的摄像头捕捉的街景视频序列来构建时空图表,以分析交通流量和优化交通基础设施和交通管理策略。
  • methods: 该研究使用了最新的YOLOv5、StrongSORT和光学测距技术来从视频数据中推断车辆轨迹,并生成时空图表。
  • results: 评估结果表明,该方法可以从视频数据中提取车辆轨迹,但是存在一些误差,可以通过提高检测器、跟踪器和距离计算组件的性能来缓解这些误差。
    Abstract Time-space diagrams are essential tools for analyzing traffic patterns and optimizing transportation infrastructure and traffic management strategies. Traditional data collection methods for these diagrams have limitations in terms of temporal and spatial coverage. Recent advancements in camera technology have overcome these limitations and provided extensive urban data. In this study, we propose an innovative approach to constructing time-space diagrams by utilizing street-view video sequences captured by cameras mounted on moving vehicles. Using the state-of-the-art YOLOv5, StrongSORT, and photogrammetry techniques for distance calculation, we can infer vehicle trajectories from the video data and generate time-space diagrams. To evaluate the effectiveness of our proposed method, we utilized datasets from the KITTI computer vision benchmark suite. The evaluation results demonstrate that our approach can generate trajectories from video data, although there are some errors that can be mitigated by improving the performance of the detector, tracker, and distance calculation components. In conclusion, the utilization of street-view video sequences captured by cameras mounted on moving vehicles, combined with state-of-the-art computer vision techniques, has immense potential for constructing comprehensive time-space diagrams. These diagrams offer valuable insights into traffic patterns and contribute to the design of transportation infrastructure and traffic management strategies.
    摘要 时空图是交通Pattern分析和交通基础设施和交通管理策略优化的重要工具。传统数据采集方法有限制性,尤其是在时间和空间方面。现代摄像头技术的发展已经突破了这些限制,提供了广泛的都市数据。本研究提出了一种创新的时空图构建方法,利用行车中的街景视频序列,通过安装在移动 Vehicle 上的摄像头捕捉的视频数据来进行推断。使用最新的 YOLOv5、StrongSORT 和光学计算技术,我们可以从视频数据中推断车辆轨迹,并生成时空图。为了评估我们的提议方法的效果,我们利用了 KITTI 计算机视觉标准库中的数据集。评估结果表明,我们的方法可以从视频数据中提取车辆轨迹,但是存在一些错误,这些错误可以通过提高探测器、跟踪器和距离计算组件的性能来减少。因此,通过利用行车中的街景视频序列,并利用现代计算机视觉技术,我们的方法具有广泛的应用前景,可以为交通基础设施和交通管理策略的设计提供有价值的信息。

RIGID: Recurrent GAN Inversion and Editing of Real Face Videos

  • paper_url: http://arxiv.org/abs/2308.06097
  • repo_url: None
  • paper_authors: Yangyang Xu, Shengfeng He, Kwan-Yee K. Wong, Ping Luo
  • for: 本研究旨在应用GAN的强大可编辑性到实际图像上,但现有方法通常是对视频帧进行分解式逆合,导致时间上的不一致问题。
  • methods: 我们提出了一种统一的循环框架,名为RIGID,以同时强制满足时间相关的GAN逆合和人脸编辑。我们的方法从三个方面模型了时间关系。
  • results: 我们的方法可以Qualitatively and quantitatively超过现有方法,并且可以应用于不同的编辑任务。
    Abstract GAN inversion is indispensable for applying the powerful editability of GAN to real images. However, existing methods invert video frames individually often leading to undesired inconsistent results over time. In this paper, we propose a unified recurrent framework, named \textbf{R}ecurrent v\textbf{I}deo \textbf{G}AN \textbf{I}nversion and e\textbf{D}iting (RIGID), to explicitly and simultaneously enforce temporally coherent GAN inversion and facial editing of real videos. Our approach models the temporal relations between current and previous frames from three aspects. To enable a faithful real video reconstruction, we first maximize the inversion fidelity and consistency by learning a temporal compensated latent code. Second, we observe incoherent noises lie in the high-frequency domain that can be disentangled from the latent space. Third, to remove the inconsistency after attribute manipulation, we propose an \textit{in-between frame composition constraint} such that the arbitrary frame must be a direct composite of its neighboring frames. Our unified framework learns the inherent coherence between input frames in an end-to-end manner, and therefore it is agnostic to a specific attribute and can be applied to arbitrary editing of the same video without re-training. Extensive experiments demonstrate that RIGID outperforms state-of-the-art methods qualitatively and quantitatively in both inversion and editing tasks. The deliverables can be found in \url{https://cnnlstm.github.io/RIGID}
    摘要 GAN逆转是应用GAN的强大编辑能力的必要条件。然而,现有方法通常对视频帧进行逆转,经常导致时间上的不一致结果。在这篇论文中,我们提出了一个统一的回归框架,名为Recurrent Video GAN Inversion and Editing(RIGID),以确立和同时执行时间相关的GAN逆转和人脸编辑。我们的方法从三个方面模型视频帧之间的时间关系:1. 通过学习时间补偿的秘密码来最大化逆转准确率和一致性。2. 发现高频域中的不一致噪声可以从秘密码空间分离。3. 在执行特性修改后,我们提出了一个“间隔帧组合约束”,使得任意帧必须是其邻近帧的直接组合。我们的统一框架在端到端方式学习视频帧之间的自然一致性,因此它是不同特性的编辑无需重新训练。广泛的实验表明,RIGID在逆转和编辑任务中都能够胜过现有方法,详细的结果可以在中找到。

Experts Weights Averaging: A New General Training Scheme for Vision Transformers

  • paper_url: http://arxiv.org/abs/2308.06093
  • repo_url: None
  • paper_authors: Yongqi Huang, Peng Ye, Xiaoshui Huang, Sheng Li, Tao Chen, Wanli Ouyang
  • for: 这篇论文是为了提出一种新的普适训练策略 для Transformer 视觉模型(ViT),以提高性能而不增加推理成本。
  • methods: 这种训练策略利用 Mixture-of-Experts(MoE)机制,在训练和推理阶段分别使用不同的 FFN 和 MoE,以实现性能提高和成本降低。
  • results: experiments 表明,这种训练策略可以在多个 2D 和 3D 视觉任务、不同的 ViT 架构和数据集上实现性能提高,并且可以应用于 fine-tuning ViTs 中进一步提高性能。
    Abstract Structural re-parameterization is a general training scheme for Convolutional Neural Networks (CNNs), which achieves performance improvement without increasing inference cost. As Vision Transformers (ViTs) are gradually surpassing CNNs in various visual tasks, one may question: if a training scheme specifically for ViTs exists that can also achieve performance improvement without increasing inference cost? Recently, Mixture-of-Experts (MoE) has attracted increasing attention, as it can efficiently scale up the capacity of Transformers at a fixed cost through sparsely activated experts. Considering that MoE can also be viewed as a multi-branch structure, can we utilize MoE to implement a ViT training scheme similar to structural re-parameterization? In this paper, we affirmatively answer these questions, with a new general training strategy for ViTs. Specifically, we decouple the training and inference phases of ViTs. During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs that assign tokens to experts by random uniform partition, and perform Experts Weights Averaging (EWA) on these MoEs at the end of each iteration. After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference. We further provide a theoretical analysis to show why and how it works. Comprehensive experiments across various 2D and 3D visual tasks, ViT architectures, and datasets validate the effectiveness and generalizability of the proposed training scheme. Besides, our training scheme can also be applied to improve performance when fine-tuning ViTs. Lastly, but equally important, the proposed EWA technique can significantly improve the effectiveness of naive MoE in various 2D visual small datasets and 3D visual tasks.
    摘要 通用的training scheme for Convolutional Neural Networks (CNNs) 是structural re-parameterization, which can improve performance without increasing inference cost. As Vision Transformers (ViTs) are gradually surpassing CNNs in various visual tasks, one may wonder if there is a training scheme specifically for ViTs that can also achieve performance improvement without increasing inference cost. Recently, Mixture-of-Experts (MoE) has attracted increasing attention, as it can efficiently scale up the capacity of Transformers at a fixed cost through sparsely activated experts. Considering that MoE can also be viewed as a multi-branch structure, can we utilize MoE to implement a ViT training scheme similar to structural re-parameterization? In this paper, we answer these questions affirmatively, with a new general training strategy for ViTs. Specifically, we decouple the training and inference phases of ViTs. During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs that assign tokens to experts by random uniform partition, and perform Experts Weights Averaging (EWA) on these MoEs at the end of each iteration. After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference. We further provide a theoretical analysis to show why and how it works. Comprehensive experiments across various 2D and 3D visual tasks, ViT architectures, and datasets validate the effectiveness and generalizability of the proposed training scheme. Besides, our training scheme can also be applied to improve performance when fine-tuning ViTs. Lastly, but equally important, the proposed EWA technique can significantly improve the effectiveness of naive MoE in various 2D visual small datasets and 3D visual tasks.

Versatile Face Animator: Driving Arbitrary 3D Facial Avatar in RGBD Space

  • paper_url: http://arxiv.org/abs/2308.06076
  • repo_url: None
  • paper_authors: Haoyu Wang, Haozhe Wu, Junliang Xing, Jia Jia
  • for: 这paper的目的是提出一种新的自动化面部动画框架,以便在电影制作和游戏行业中创造真实的3D面部动画,并且能够减少成本和增加效率。
  • methods: 这paper使用的方法包括面部动画捕捉和动作重定向,通过端到端的方式实现无需blendshape或rigs的face animation生成。具体来说, authors propose了一种RGBD动画模块,通过层次动作字典学习面部动作从 Raw RGBD视频中,并将RGBD图像细化到3D面部模型中,以便无论面部模型的topology、文字、blendshape或rigs是什么,都可以生成高质量的3D面部动画。
  • results: 实验表明, authors的提出的框架可以生成出吸引人的3D面部动画效果, highlighting its potential as a promising solution for the cost-effective and efficient production of facial animation in the metaverse.
    Abstract Creating realistic 3D facial animation is crucial for various applications in the movie production and gaming industry, especially with the burgeoning demand in the metaverse. However, prevalent methods such as blendshape-based approaches and facial rigging techniques are time-consuming, labor-intensive, and lack standardized configurations, making facial animation production challenging and costly. In this paper, we propose a novel self-supervised framework, Versatile Face Animator, which combines facial motion capture with motion retargeting in an end-to-end manner, eliminating the need for blendshapes or rigs. Our method has the following two main characteristics: 1) we propose an RGBD animation module to learn facial motion from raw RGBD videos by hierarchical motion dictionaries and animate RGBD images rendered from 3D facial mesh coarse-to-fine, enabling facial animation on arbitrary 3D characters regardless of their topology, textures, blendshapes, and rigs; and 2) we introduce a mesh retarget module to utilize RGBD animation to create 3D facial animation by manipulating facial mesh with controller transformations, which are estimated from dense optical flow fields and blended together with geodesic-distance-based weights. Comprehensive experiments demonstrate the effectiveness of our proposed framework in generating impressive 3D facial animation results, highlighting its potential as a promising solution for the cost-effective and efficient production of facial animation in the metaverse.
    摘要 创造真实的3D人脸动画对于电影制作和游戏业界的各种应用非常重要,特别是在metaverse领域的发展。然而,现有的方法,如blendshape-basedapproaches和facial rigging技术,是时间消耗大、劳动密集,而且没有标准化的配置,使得人脸动画生产变得困难和昂贵。在这篇论文中,我们提出了一种新的自主学习框架,名为Versatile Face Animator,它将facial motion capture和动作重定向结合在一起,从而消除了需要blendshapes或rigs的需求。我们的方法有以下两个主要特点:1. 我们提出了一种RGBD动画模块,通过层次动作词典来学习人脸动作从raw RGBD视频中,并将RGBD图像渲染为3D人脸模型,从而实现人脸动画的生成。这种方法允许人脸动画生产在任何3D人脸模型上,无论其topology、texture、blendshapes或rigs等属性。2. 我们引入了一种网格重定向模块,通过使用 dense optical flow Fields和geodesic-distance-based weights来估算控制器变换,并将其与RGBD动画结合以生成3D人脸动画。我们的实验表明,我们提出的框架可以生成出吸引人的3D人脸动画结果, highlighting its potential as a promising solution for the cost-effective and efficient production of facial animation in the metaverse.

Out-of-Distribution Detection for Monocular Depth Estimation

  • paper_url: http://arxiv.org/abs/2308.06072
  • repo_url: None
  • paper_authors: Julia Hornauer, Adrian Holzbock, Vasileios Belagiannis
  • for: 本研究旨在提高单目深度估计中的不确定性估计方法,主要是针对图像噪声引入的数据不确定性。
  • methods: 我们使用 anomaly detection 技术来检测 encoder-decoder 深度估计模型中的数据不符合分布(out-of-distribution,OOD)图像。我们使用固定深度encoder提取特征,然后使用只有在 Distribution 数据进行图像重建训练。因此,OOD 图像会导致高重建错误,我们可以用这个错误来 отличи出在 Distribution 和 OOD 图像之间。
  • results: 我们在标准的 NYU Depth V2 和 KITTI 测试集上进行了实验,并发现我们的后期方法可以在不修改已经训练过的 encoder-decoder 深度估计模型上达到 astonishingly 好的性能,并且超过了现有的不确定性估计方法。
    Abstract In monocular depth estimation, uncertainty estimation approaches mainly target the data uncertainty introduced by image noise. In contrast to prior work, we address the uncertainty due to lack of knowledge, which is relevant for the detection of data not represented by the training distribution, the so-called out-of-distribution (OOD) data. Motivated by anomaly detection, we propose to detect OOD images from an encoder-decoder depth estimation model based on the reconstruction error. Given the features extracted with the fixed depth encoder, we train an image decoder for image reconstruction using only in-distribution data. Consequently, OOD images result in a high reconstruction error, which we use to distinguish between in- and out-of-distribution samples. We built our experiments on the standard NYU Depth V2 and KITTI benchmarks as in-distribution data. Our post hoc method performs astonishingly well on different models and outperforms existing uncertainty estimation approaches without modifying the trained encoder-decoder depth estimation model.
    摘要 单目深度估算中的不确定性估计方法主要针对图像噪声引入的数据不确定性。与先前的工作不同,我们处理lack of knowledge引起的不确定性,这是对于检测没有被训练分布表示的数据(即out-of-distribution,OOD数据)的探测。受到异常检测的 inspirited by,我们提议通过重建错误来探测OOD图像。我们使用固定深度encoder提取特征,然后使用只有在distribution数据进行图像重建训练。因此,OOD图像会导致高重建错误,我们可以使用这个错误来分辨在-和out-of-distribution样本。我们的实验基于标准的NYU Depth V2和KITTIbenchmark中的in-distribution数据。我们的后期方法在不修改已经训练过的encoder-decoder深度估算模型上表现出色,并且超过了现有的不确定性估计方法。

Head Rotation in Denoising Diffusion Models

  • paper_url: http://arxiv.org/abs/2308.06057
  • repo_url: https://github.com/asperti/head-rotation
  • paper_authors: Andrea Asperti, Gabriele Colasuonno, Antonio Guerra
  • for: 本研究旨在探讨 Denoising Diffusion Models (DDM) 在深度生成模型领域的应用,特别是在批处理和修改图像中的特征特点。
  • methods: 本研究使用了一种新的嵌入技术,即 Denoising Diffusion Implicit Models (DDIM),以实现高维latent space中的 semantics 探索和特征修改。
  • results: 研究表明,通过对数据集样本进行linear regression,可以获得不同拜角度($\pm 30^o$) 的批处理结果,保留图像的特征特点。此外,研究还发现了一种基于照明方向的图像分类法,将 CelebA 图像分为三个主要组:左、中、右。
    Abstract Denoising Diffusion Models (DDM) are emerging as the cutting-edge technology in the realm of deep generative modeling, challenging the dominance of Generative Adversarial Networks. However, effectively exploring the latent space's semantics and identifying compelling trajectories for manipulating and editing important attributes of the generated samples remains challenging, primarily due to the high-dimensional nature of the latent space. In this study, we specifically concentrate on face rotation, which is known to be one of the most intricate editing operations. By leveraging a recent embedding technique for Denoising Diffusion Implicit Models (DDIM), we achieve, in many cases, noteworthy manipulations encompassing a wide rotation angle of $\pm 30^o$, preserving the distinct characteristics of the individual. Our methodology exploits the computation of trajectories approximating clouds of latent representations of dataset samples with different yaw rotations through linear regression. Specific trajectories are obtained by restricting the analysis to subsets of data sharing significant attributes with the source image. One of these attributes is the light provenance: a byproduct of our research is a labeling of CelebA, categorizing images into three major groups based on the illumination direction: left, center, and right.
    摘要 德提高扩散模型(DDM)正在深度生成模型领域中崛起,挑战对抗生成对抗网络的主导地位。然而,深入探索生成空间的 semantics 和找到编辑重要特征的吸引人之路仍然是一大挑战,主要因为生成空间的维度很高。在这个研究中,我们专注于脸 rotate,被认为是深度生成模型中最复杂的编辑操作。我们通过利用最近的 Denoising Diffusion Implicit Models(DDIM)嵌入技术,在许多情况下实现了30度的旋转角范围内的出色操作,并保持个体特征的独特性。我们的方法利用计算 trajectories approximating clouds of latent representations of dataset samples with different yaw rotations through linear regression。specific trajectories 是通过对数据集中 sharing significative attributes with the source image 进行限制分析获得的。其中一个特征是光源来源:我们的研究中的一个成果是对 CelebA 图像进行了分类,将图像分为三个主要组 Based on the illumination direction:left, center, and right。

Computer-Aided Cytology Diagnosis in Animals: CNN-Based Image Quality Assessment for Accurate Disease Classification

  • paper_url: http://arxiv.org/abs/2308.06055
  • repo_url: None
  • paper_authors: Jan Krupiński, Maciej Wielgosz, Szymon Mazurek, Krystian Strzałka, Paweł Russek, Jakub Caputa, Daria Łukasik, Jakub Grzeszczyk, Michał Karwatowski, Rafał Fraczek, Ernest Jamro, Marcin Pietroń, Sebastian Koryciak, Agnieszka Dąbrowska-Boruch, Kazimierz Wiatr
  • for: 这个论文是为了开发一个基于计算机的病理诊断系统,用于动物病理诊断。
  • methods: 论文使用了卷积神经网络(CNN)进行图像质量评估(IQA)。
  • results: 研究发现,使用CNN进行IQA可以提高病理诊断的准确率。
    Abstract This paper presents a computer-aided cytology diagnosis system designed for animals, focusing on image quality assessment (IQA) using Convolutional Neural Networks (CNNs). The system's building blocks are tailored to seamlessly integrate IQA, ensuring reliable performance in disease classification. We extensively investigate the CNN's ability to handle various image variations and scenarios, analyzing the impact on detecting low-quality input data. Additionally, the network's capacity to differentiate valid cellular samples from those with artifacts is evaluated. Our study employs a ResNet18 network architecture and explores the effects of input sizes and cropping strategies on model performance. The research sheds light on the significance of CNN-based IQA in computer-aided cytology diagnosis for animals, enhancing the accuracy of disease classification.
    摘要 Translation notes:* "computer-aided cytology diagnosis system" is translated as "计算机助成细胞诊断系统" (computer-aided cytology diagnosis system)* "Convolutional Neural Networks" is translated as "卷积神经网络" (Convolutional Neural Networks)* "image quality assessment" is translated as "图像质量评估" (image quality assessment)* "IQA" is translated as "IQA" (IQA)* "disease classification" is translated as "疾病分类" (disease classification)* "ResNet18" is translated as "ResNet18" (ResNet18)* "input sizes" is translated as "输入大小" (input sizes)* "cropping strategies" is translated as "剪辑策略" (cropping strategies)

Hardware Accelerators in Autonomous Driving

  • paper_url: http://arxiv.org/abs/2308.06054
  • repo_url: None
  • paper_authors: Ken Power, Shailendra Deva, Ting Wang, Julius Li, Ciarán Eising
  • for: 本研究旨在概述机器学习加速器在自动驾驶汽车中的应用,以提高机器视觉任务的性能和可靠性。
  • methods: 本文使用了多种机器学习模型和特殊目标处理器来加速机器视觉任务。
  • results: 本研究提出了一些建议和未来研究方向,以推动机器学习加速器在自动驾驶领域的发展。
    Abstract Computing platforms in autonomous vehicles record large amounts of data from many sensors, process the data through machine learning models, and make decisions to ensure the vehicle's safe operation. Fast, accurate, and reliable decision-making is critical. Traditional computer processors lack the power and flexibility needed for the perception and machine vision demands of advanced autonomous driving tasks. Hardware accelerators are special-purpose coprocessors that help autonomous vehicles meet performance requirements for higher levels of autonomy. This paper provides an overview of ML accelerators with examples of their use for machine vision in autonomous vehicles. We offer recommendations for researchers and practitioners and highlight a trajectory for ongoing and future research in this emerging field.
    摘要 计算平台在自动驾驶车辆记录大量感知器数据,通过机器学习模型进行处理,以确保车辆安全运行。快速、准确、可靠的决策是关键。传统计算机处理器缺乏高级自动驾驶任务的感知和机器视觉需求的能力。硬件加速器是特殊用途的辅助处理器,帮助自动驾驶车辆实现更高的自主驱动水平。本文提供机器学习加速器的概述,以及机器视觉在自动驾驶车辆中的应用例子。我们对研究人员和实践者提出建议,并高亮了该领域的未来研究轨迹。

Towards Instance-adaptive Inference for Federated Learning

  • paper_url: http://arxiv.org/abs/2308.06051
  • repo_url: https://github.com/chunmeifeng/fedins
  • paper_authors: Chun-Mei Feng, Kai Yu, Nian Liu, Xinxing Xu, Salman Khan, Wangmeng Zuo
  • for: 这种论文是为了解决 federated learning (FL) 中的客户端数据不均衡问题,特别是在复杂的实际数据上出现的内部客户端数据不均衡。
  • methods: 该论文提出了一种新的 FL 算法,即 FedIns,用于处理内部客户端数据不均衡。该算法通过实例特定的折衡策略来实现实例特定的折衡,从而降低了客户端数据不均衡的影响。
  • results: 实验结果表明, compare to 状态之前的方法,FedIns 可以达到6.64%的提升,并且与少于15%的通信成本相对。
    Abstract Federated learning (FL) is a distributed learning paradigm that enables multiple clients to learn a powerful global model by aggregating local training. However, the performance of the global model is often hampered by non-i.i.d. distribution among the clients, requiring extensive efforts to mitigate inter-client data heterogeneity. Going beyond inter-client data heterogeneity, we note that intra-client heterogeneity can also be observed on complex real-world data and seriously deteriorate FL performance. In this paper, we present a novel FL algorithm, i.e., FedIns, to handle intra-client data heterogeneity by enabling instance-adaptive inference in the FL framework. Instead of huge instance-adaptive models, we resort to a parameter-efficient fine-tuning method, i.e., scale and shift deep features (SSF), upon a pre-trained model. Specifically, we first train an SSF pool for each client, and aggregate these SSF pools on the server side, thus still maintaining a low communication cost. To enable instance-adaptive inference, for a given instance, we dynamically find the best-matched SSF subsets from the pool and aggregate them to generate an adaptive SSF specified for the instance, thereby reducing the intra-client as well as the inter-client heterogeneity. Extensive experiments show that our FedIns outperforms state-of-the-art FL algorithms, e.g., a 6.64\% improvement against the top-performing method with less than 15\% communication cost on Tiny-ImageNet. Our code and models will be publicly released.
    摘要 联合学习(FL)是一种分布式学习 paradigma,允许多个客户端学习一个强大的全球模型,通过聚合本地训练来提高性能。然而,全球模型的性能 oftentimes 受到客户端数据不均衡的影响,需要大量努力来mitigate inter-client data heterogeneity。此外,我们注意到了复杂的实际数据中的内部客户端数据不均衡,也会严重影响FL性能。在这篇论文中,我们提出了一种新的FL算法,即FedIns,用于处理内部客户端数据不均衡。在FL框架中,我们使用 instance-adaptive inference 来适应实例特点,而不需要巨大的实例特化模型。具体来说,我们首先在每个客户端上训练一个scale and shift deep features(SSF)池,然后在服务器端将这些SSF池进行聚合,以保持低通信成本。为了实现实例特化的推理,对于一个实例,我们会在实例特点上动态找到最佳的SSF子集,并将这些子集聚合以生成一个适应该实例的adaptive SSF,从而降低内部客户端以及客户端之间的数据不均衡。我们的 FedIns 在 Tiny-ImageNet 上比顶尖方法提高6.64%,并且通信成本低于15%。我们将代码和模型公开发布。

Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning

  • paper_url: http://arxiv.org/abs/2308.06038
  • repo_url: https://github.com/chunmeifeng/DiffTPT
  • paper_authors: Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, Wangmeng Zuo
  • for: 这个研究paper的目的是提出一种名为DiffTPT的新的测试时间Prompt Tuning方法,用于在不知道的新领域上适应下测试样本。
  • methods: 这个方法使用了预训条件模型,并利用预训 diffusion 模型生成多元和有用的新数据,以扩大模型在不知道的新领域上的适应能力。
  • results: 在测试数据集上,DiffTPT 方法可以提高零件精度的正确率,比起现有的state-of-the-art TPT 方法,提高了5.13%。
    Abstract Benefiting from prompt tuning, recent years have witnessed the promising performance of pre-trained vision-language models, e.g., CLIP, on versatile downstream tasks. In this paper, we focus on a particular setting of learning adaptive prompts on the fly for each test sample from an unseen new domain, which is known as test-time prompt tuning (TPT). Existing TPT methods typically rely on data augmentation and confidence selection. However, conventional data augmentation techniques, e.g., random resized crops, suffers from the lack of data diversity, while entropy-based confidence selection alone is not sufficient to guarantee prediction fidelity. To address these issues, we propose a novel TPT method, named DiffTPT, which leverages pre-trained diffusion models to generate diverse and informative new data. Specifically, we incorporate augmented data by both conventional method and pre-trained stable diffusion to exploit their respective merits, improving the models ability to adapt to unknown new test data. Moreover, to ensure the prediction fidelity of generated data, we introduce a cosine similarity-based filtration technique to select the generated data with higher similarity to the single test sample. Our experiments on test datasets with distribution shifts and unseen categories demonstrate that DiffTPT improves the zero-shot accuracy by an average of 5.13\% compared to the state-of-the-art TPT method. Our code and models will be publicly released.
    摘要 在最近几年,快速调参的推动下,预训练的视觉语言模型,如CLIP,在多种下游任务上表现出了扎实的潜力。在这篇论文中,我们关注一种特殊的学习适应 prompt 的方法,即在测试阶段调参(TPT)。现有的 TPT 方法通常基于数据增强和信任选择。然而,传统的数据增强技术,例如随机resize crop,缺乏数据多样性,而基于Entropy的信任选择独立不能保证预测准确性。为解决这些问题,我们提出了一种新的 TPT 方法,名为DiffTPT,它利用预训练的扩散模型生成多样和有用的新数据。具体来说,我们结合了传统的数据增强方法和预训练的稳定扩散来利用它们的优势,提高模型对未知新测试数据的适应能力。此外,为保证生成数据的预测准确性,我们引入了cosine similarity基于的筛选技术,选择生成数据与单个测试样本更高的相似性。我们的实验表明,DiffTPT 在分布偏移和未知类型的测试集上提高了零aser准精度,相比前州之最的 TPT 方法。我们的代码和模型将公开发布。

Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

  • paper_url: http://arxiv.org/abs/2308.06027
  • repo_url: None
  • paper_authors: Yuki Endo
  • for: 这个论文的目的是提出一种不需要额外训练 diffusion models 的文本到图像生成方法,以实现更好的空间控制。
  • methods: 该方法基于 cross-attention maps 的 pozitional 关系,通过 directly swapping cross-attention maps 与常数 maps 计算自 semantic regions 来控制注意力。 此外,提出了 masked-attention guidance,通过 manipulate 噪音图来控制注意力,使图像更 faithful 于 semantic masks。
  • results: 实验表明,该方法可以比基eline 更加准确地控制空间,并且可以生成更 faithful 于 semantic masks 的图像。
    Abstract Text-to-image synthesis has achieved high-quality results with recent advances in diffusion models. However, text input alone has high spatial ambiguity and limited user controllability. Most existing methods allow spatial control through additional visual guidance (e.g, sketches and semantic masks) but require additional training with annotated images. In this paper, we propose a method for spatially controlling text-to-image generation without further training of diffusion models. Our method is based on the insight that the cross-attention maps reflect the positional relationship between words and pixels. Our aim is to control the attention maps according to given semantic masks and text prompts. To this end, we first explore a simple approach of directly swapping the cross-attention maps with constant maps computed from the semantic regions. Moreover, we propose masked-attention guidance, which can generate images more faithful to semantic masks than the first approach. Masked-attention guidance indirectly controls attention to each word and pixel according to the semantic regions by manipulating noise images fed to diffusion models. Experiments show that our method enables more accurate spatial control than baselines qualitatively and quantitatively.
    摘要 为达到这个目标,我们首先探索了直接将抽象映射换成固定的映射,从Semantic Regions中计算的常数映射。此外,我们还提出了受隐藏图像干扰的掩码引导,可以在Semantic Mask中更加准确地控制注意力。这种方法通过在扩散模型中隐藏图像的干扰来控制每个字和像素的注意力,从而更加准确地实现Semantic Mask中的图像生成。实验表明,我们的方法可以较为精确地控制文本到图像的空间位置,相比基线方法。

Spatial-information Guided Adaptive Context-aware Network for Efficient RGB-D Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2308.06024
  • repo_url: https://github.com/mvme-hbut/sgacnet
  • paper_authors: Yang Zhang, Chenyun Xiong, Junjie Liu, Xuhui Ye, Guodong Sun
  • for: 这篇论文主要针对mobile robots中的RGB-D语义分割问题,即在环境中分割对象和场景,并对其进行分类。
  • methods: 该论文提出了一种高效的encoder-decoder网络,该网络通过杂合模式和空间杂合注意力模块来有效地捕捉多级RGB-D特征。此外,论文还提出了一种全球导航本地相互关系模块,以获取足够高级上下文信息。
  • results: 实验结果表明,该方法在NYUv2、SUN RGB-D和Cityscapes数据集上比州前方法更好地平衡了分割精度、计算时间和参数数量。
    Abstract Efficient RGB-D semantic segmentation has received considerable attention in mobile robots, which plays a vital role in analyzing and recognizing environmental information. According to previous studies, depth information can provide corresponding geometric relationships for objects and scenes, but actual depth data usually exist as noise. To avoid unfavorable effects on segmentation accuracy and computation, it is necessary to design an efficient framework to leverage cross-modal correlations and complementary cues. In this paper, we propose an efficient lightweight encoder-decoder network that reduces the computational parameters and guarantees the robustness of the algorithm. Working with channel and spatial fusion attention modules, our network effectively captures multi-level RGB-D features. A globally guided local affinity context module is proposed to obtain sufficient high-level context information. The decoder utilizes a lightweight residual unit that combines short- and long-distance information with a few redundant computations. Experimental results on NYUv2, SUN RGB-D, and Cityscapes datasets show that our method achieves a better trade-off among segmentation accuracy, inference time, and parameters than the state-of-the-art methods. The source code will be at https://github.com/MVME-HBUT/SGACNet
    摘要 高效的RGB-Dsemantic segmentation在移动机器人中得到了广泛的关注,它对环境信息的分析和识别具有重要作用。根据前一些研究,深度信息可以提供对物体和场景的相对性 geometric 关系,但实际的深度数据通常存在噪声。为了避免不利的影响于分 segmentation 精度和计算,需要设计一个有效的框架,利用交叉模态的相关性和补做信息。在这篇论文中,我们提出了一种高效的轻量级 encoder-decoder 网络,它可以减少计算参数并保证算法的稳定性。与核心和空间拼接注意力模块结合,我们的网络可以有效地捕捉多级 RGB-D 特征。我们还提出了一种全局导引的地方相互关系上下文模块,以获得足够的高级上下文信息。解码器使用一种轻量级径相关单元,将短距离和长距离信息相结合,并且只有一些冗余计算。实验结果表明,我们的方法在 NYUv2、SUN RGB-D 和 Cityscapes 数据集上实现了更好的精度至灵活性至参数的平衡,比现有方法更好。代码将在 GitHub 上发布,链接在

Scale-Preserving Automatic Concept Extraction (SPACE)

  • paper_url: http://arxiv.org/abs/2308.06022
  • repo_url: https://github.com/data-science-in-mechanical-engineering/space
  • paper_authors: Andrés Felipe Posada-Moreno, Lukas Kreisköther, Tassilo Glander, Sebastian Trimpe
  • for: 提高工业四点零中 convolutional neural network(CNN)的可靠性和透明度,以及减少经济损失和人类生命风险。
  • methods: 基于图像 slice 的方法,避免缺失缩放问题,提供全面的概念描述。
  • results: 在工业质量控制领域的三个图像分类 dataset 上,SPACE 方法比其他方法表现更好,提供了可操作的概念描述,帮助理解 CNN 的决策机制。
    Abstract Convolutional Neural Networks (CNN) have become a common choice for industrial quality control, as well as other critical applications in the Industry 4.0. When these CNNs behave in ways unexpected to human users or developers, severe consequences can arise, such as economic losses or an increased risk to human life. Concept extraction techniques can be applied to increase the reliability and transparency of CNNs through generating global explanations for trained neural network models. The decisive features of image datasets in quality control often depend on the feature's scale; for example, the size of a hole or an edge. However, existing concept extraction methods do not correctly represent scale, which leads to problems interpreting these models as we show herein. To address this issue, we introduce the Scale-Preserving Automatic Concept Extraction (SPACE) algorithm, as a state-of-the-art alternative concept extraction technique for CNNs, focused on industrial applications. SPACE is specifically designed to overcome the aforementioned problems by avoiding scale changes throughout the concept extraction process. SPACE proposes an approach based on square slices of input images, which are selected and then tiled before being clustered into concepts. Our method provides explanations of the models' decision-making process in the form of human-understandable concepts. We evaluate SPACE on three image classification datasets in the context of industrial quality control. Through experimental results, we illustrate how SPACE outperforms other methods and provides actionable insights on the decision mechanisms of CNNs. Finally, code for the implementation of SPACE is provided.
    摘要 convolutional neural networks (CNN) 已成为工业品质控制的常用选择,以及工业4.0其他关键应用程序。当这些CNN表现出人类用户或开发者所未预料的行为时,可能会导致经济损失或增加人类生命风险。概念提取技术可以应用于增加CNN的可靠性和透明度,通过生成全局解释以便训练的神经网络模型。图像 Dataset 中的特征特征frequently dependent on the scale of the feature; 例如,图像中的孔或边缘的大小。然而,现有的概念提取方法不正确地表示尺度,这会导致解释问题。为解决这个问题,我们介绍了Scale-Preserving Automatic Concept Extraction (SPACE)算法,作为工业应用场景中的现代替代方法。SPACE专门设计用于解决以上问题,避免在概念提取过程中改变尺度。SPACE提出基于输入图像的方方块,选择并粘贴后,再分类为概念。我们的方法可以提供神经网络模型决策过程的人类可理解的解释。我们在三个图像分类Dataset中进行了实验,并证明了SPACE在工业品质控制领域的表现优于其他方法,并提供了可行的解释。最后,我们提供了实现SPACE的代码。

Enhancing Generalization of Universal Adversarial Perturbation through Gradient Aggregation

  • paper_url: http://arxiv.org/abs/2308.06015
  • repo_url: https://github.com/liuxuannan/stochastic-gradient-aggregation
  • paper_authors: Xuannan Liu, Yaoyao Zhong, Yuhang Zhang, Lixiong Qin, Weihong Deng
  • for: 提高 universal adversarial perturbation (UAP) 的泛化能力,解决 UAP 生成方法中的梯度消失和局部最优化问题。
  • methods: 提出 Stochastic Gradient Aggregation (SGA) 方法,通过多个小批量训练和内部多次归一化来稳定梯度和减少量化误差,从而提高 UAP 的泛化能力。
  • results: EXTENSIVE experiments on the standard ImageNet dataset demonstrate that our method significantly enhances the generalization ability of UAP and outperforms other state-of-the-art methods。
    Abstract Deep neural networks are vulnerable to universal adversarial perturbation (UAP), an instance-agnostic perturbation capable of fooling the target model for most samples. Compared to instance-specific adversarial examples, UAP is more challenging as it needs to generalize across various samples and models. In this paper, we examine the serious dilemma of UAP generation methods from a generalization perspective -- the gradient vanishing problem using small-batch stochastic gradient optimization and the local optima problem using large-batch optimization. To address these problems, we propose a simple and effective method called Stochastic Gradient Aggregation (SGA), which alleviates the gradient vanishing and escapes from poor local optima at the same time. Specifically, SGA employs the small-batch training to perform multiple iterations of inner pre-search. Then, all the inner gradients are aggregated as a one-step gradient estimation to enhance the gradient stability and reduce quantization errors. Extensive experiments on the standard ImageNet dataset demonstrate that our method significantly enhances the generalization ability of UAP and outperforms other state-of-the-art methods. The code is available at https://github.com/liuxuannan/Stochastic-Gradient-Aggregation.
    摘要 TRANSLATION:深度神经网络受到通用对抗扰动(UAP)的威胁,UAP是一种能够骗取目标模型的大多数样本的实例无关扰动。与实例特定对抗示例相比,UAP更加挑战性,因为它需要在不同的样本和模型之间进行泛化。在这篇文章中,我们对UAP生成方法的泛化问题进行了深入的检查——梯度消失问题和大批量优化问题。为了解决这些问题,我们提出了一种简单有效的方法,即随机梯度聚合(SGA)。SGA使用小批量训练来进行多次内部预搜,然后将所有内部梯度聚合为一个步骤的梯度估计,以提高梯度稳定性和减少量化误差。广泛的实验表明,我们的方法可以很好地提高UAP的泛化能力,并超过了当前状态的方法。代码可以在https://github.com/liuxuannan/Stochastic-Gradient-Aggregation中获取。

ViGT: Proposal-free Video Grounding with Learnable Token in Transformer

  • paper_url: http://arxiv.org/abs/2308.06009
  • repo_url: None
  • paper_authors: Kun Li, Dan Guo, Meng Wang
  • for: 本研究targets the video grounding (VG) task, aiming to locate the queried action or event in an untrimmed video based on rich linguistic descriptions.
  • methods: 我们提出了一种novel boundary regression paradigm, which performs regression token learning in a transformer. Specifically, we present a simple but effective proposal-free framework, namely Video Grounding Transformer (ViGT), which predicts the temporal boundary using a learnable regression token rather than multi-modal or cross-modal features.
  • results: 在三个公共数据集(ANet Captions、TACoS和YouCookII)上,我们的提案ViGT表现出色,并进行了广泛的ablation study和质量分析以验证我们的解释性。
    Abstract The video grounding (VG) task aims to locate the queried action or event in an untrimmed video based on rich linguistic descriptions. Existing proposal-free methods are trapped in complex interaction between video and query, overemphasizing cross-modal feature fusion and feature correlation for VG. In this paper, we propose a novel boundary regression paradigm that performs regression token learning in a transformer. Particularly, we present a simple but effective proposal-free framework, namely Video Grounding Transformer (ViGT), which predicts the temporal boundary using a learnable regression token rather than multi-modal or cross-modal features. In ViGT, the benefits of a learnable token are manifested as follows. (1) The token is unrelated to the video or the query and avoids data bias toward the original video and query. (2) The token simultaneously performs global context aggregation from video and query features. First, we employed a sharing feature encoder to project both video and query into a joint feature space before performing cross-modal co-attention (i.e., video-to-query attention and query-to-video attention) to highlight discriminative features in each modality. Furthermore, we concatenated a learnable regression token [REG] with the video and query features as the input of a vision-language transformer. Finally, we utilized the token [REG] to predict the target moment and visual features to constrain the foreground and background probabilities at each timestamp. The proposed ViGT performed well on three public datasets: ANet Captions, TACoS and YouCookII. Extensive ablation studies and qualitative analysis further validated the interpretability of ViGT.
    摘要 视频定位(VG)任务的目标是在没有提案的情况下,基于丰富的语言描述,在不归档的视频中找到查询的动作或事件。现有的方法受到视频和查询之间的复杂交互的限制,过度强调cross-modal特征融合和特征相关性,导致VG任务的实现困难。在本文中,我们提出了一种新的boundary regression paradigm,即使用一个学习的回归токеン(REG)来实现视频定位。特别是,我们提出了一种简单 yet effective的无提案框架,即视频定位变换器(ViGT),该框架通过学习回归token来预测目标时刻,而不是通过多Modal或cross-modal特征。在ViGT中, learnable token的好处如下:1. token不与视频或查询有关,因此避免了数据偏好向原始视频和查询。2. token同时进行全局上下文聚合从视频和查询特征中。首先,我们使用了一个共享特征编码器将视频和查询映射到一个共同特征空间,然后进行cross-modal协同注意力(即视频到查询注意力和查询到视频注意力)以便强调每个模式中的特征。此外,我们将视频和查询特征 concatenated with a learnable回归token [REG]作为视频-语言变换器的输入。最后,我们使用了这个token [REG]来预测目标时刻,并使用视频和查询特征来约束背景和前景概率在每个时间戳。我们的ViGT在ANet Captions、TACoS和YouCookII等三个公共数据集上表现出色,并进行了广泛的归一化研究和质量分析,以证明ViGT的可读性。

Image-based Geolocalization by Ground-to-2.5D Map Matching

  • paper_url: http://arxiv.org/abs/2308.05993
  • repo_url: None
  • paper_authors: Mengjie Zhou, Liu Liu, Yiran Zhong
  • for: 图像基于地理位置问题的解决方案,即将平面图像与地图照片匹配。
  • methods: 我们提出了一种新的方法,通过在2.5D空间中使用高度信息来提高跨视图匹配。我们首先将2D地图与平面图像对齐使用极限变换,然后利用全球混合来混合多模态特征从2D和2.5D地图来增强定位编码的 distintiveness。
  • results: 我们的方法在两种常见的定位方法,单图定位和路径定位中实现了显著高于前一代2D地图基于方法的定位精度和更快的收敛速度。
    Abstract We study the image-based geolocalization problem that aims to locate ground-view query images on cartographic maps. Previous methods often utilize cross-view localization techniques to match ground-view query images with 2D maps. However, the performance of these methods is frequently unsatisfactory due to the significant cross-view appearance differences. In this paper, we extend cross-view matching to 2.5D spaces, where the heights of the structures - such as trees, buildings, and other objects - can provide additional information to guide the cross-view matching. We present a new approach to learning representative embeddings from multi-model data. Specifically, we first align 2D maps to ground-view panoramic images with polar transform to reduce the gap between panoramic images and maps. Then we leverage global fusion to fuse the multi-modal features from 2D and 2.5D maps to increase the distinctiveness of location embeddings. We construct the first large-scale ground-to-2.5D map geolocalization dataset to validate our method and facilitate the research. We test our learned embeddings on two popular localization approaches, i.e., single-image based localization, and route based localization. Extensive experiments demonstrate that our proposed method achieves significantly higher localization accuracy and faster convergence than previous 2D map-based approaches.
    摘要 我们研究图像基于地理位置 Localization 问题,即将地面查询图像与地图映射到一起。过去的方法 frequently utilize cross-view localization技术来匹配地面查询图像与2D地图。然而,这些方法的性能frequently不满足,因为跨视图的外观差异很大。在这篇论文中,我们将cross-view matching扩展到2.5D空间,其中结构高度,如树、建筑和其他物体的高度,可以提供更多的信息来导航跨视图匹配。我们提出了一种新的方法来学习表示 embedding从多模型数据中。特别是,我们首先将2D地图与地面全景图像使用极体 transform进行对齐,以降低全景图像与地图之间的差异。然后,我们利用全球混合来混合2D和2.5D地图的多模态特征,以提高定位编码的特征性。我们构建了首个大规模的地面到2.5D地图地理位置定位数据集,以验证我们的方法和促进研究。我们测试了我们学习的编码,并与单个图像基于定位和路径基于定位进行比较。广泛的实验表明,我们的提议方法在前2D地图基于方法的定位精度和更快的收敛速度上具有显著优势。

Cyclic-Bootstrap Labeling for Weakly Supervised Object Detection

  • paper_url: http://arxiv.org/abs/2308.05991
  • repo_url: https://github.com/yinyf0804/wsod-cbl
  • paper_authors: Yufei Yin, Jiajun Deng, Wengang Zhou, Li Li, Houqiang Li
  • for: 提高weakly supervised object detection的精度,增强多个实例检测网络(MIDN)的 Pseudo-labeling 质量。
  • methods: 我们提出了一种新的weakly supervised object detection框架,即Cylic-Bootstrap Labeling(CBL),它利用了一个可靠的教师网络来优化MIDN的 Pseudo-labeling。特别是,我们使用了一种权重加权移动平均策略,以便利用不同的修正模块的输出。此外,我们还提出了一种新的类别特定的排名精灵抽象算法,用于使MIDN受益于教师网络的输出。
  • results: 我们在PASCAL VOC 2007 & 2012和COCO datasets上进行了广泛的实验,并证明了我们的CBL框架在weakly supervised object detection中表现出色。
    Abstract Recent progress in weakly supervised object detection is featured by a combination of multiple instance detection networks (MIDN) and ordinal online refinement. However, with only image-level annotation, MIDN inevitably assigns high scores to some unexpected region proposals when generating pseudo labels. These inaccurate high-scoring region proposals will mislead the training of subsequent refinement modules and thus hamper the detection performance. In this work, we explore how to ameliorate the quality of pseudo-labeling in MIDN. Formally, we devise Cyclic-Bootstrap Labeling (CBL), a novel weakly supervised object detection pipeline, which optimizes MIDN with rank information from a reliable teacher network. Specifically, we obtain this teacher network by introducing a weighted exponential moving average strategy to take advantage of various refinement modules. A novel class-specific ranking distillation algorithm is proposed to leverage the output of weighted ensembled teacher network for distilling MIDN with rank information. As a result, MIDN is guided to assign higher scores to accurate proposals among their neighboring ones, thus benefiting the subsequent pseudo labeling. Extensive experiments on the prevalent PASCAL VOC 2007 \& 2012 and COCO datasets demonstrate the superior performance of our CBL framework. Code will be available at https://github.com/Yinyf0804/WSOD-CBL/.
    摘要 最近的弱监督对象检测进步主要表现为多个实例检测网络(MIDN)和顺序在线级化。然而,只有图像级别的标注,MIDN会不可避免地将一些意外的区域提案分配高分数。这些不准确的高分数区域提案会 Mislead 后续的级化模块的训练,从而降低检测性能。在这种情况下,我们explore如何改善MIDN中 pseudo-标签的质量。我们提出了一种新的弱监督对象检测框架,即循环 bootstrap labeling(CBL),该框架利用多个可靠的教师网络来优化MIDN。具体来说,我们通过引入权重加权移动平均策略来获得这些教师网络。此外,我们还提出了一种类别特定的排名精炼算法,以利用多个 ensembled 教师网络的输出来精炼MIDN。这使得MIDN可以更好地将准确的提案与其邻近的提案分配高分数。这种方法的实验结果表明,我们的CBL框架在PASCAL VOC 2007 和 COCO datasets上显示出了superior的性能。代码将在https://github.com/Yinyf0804/WSOD-CBL/上公布。

Automatic Classification of Blood Cell Images Using Convolutional Neural Network

  • paper_url: http://arxiv.org/abs/2308.06300
  • repo_url: None
  • paper_authors: Rabia Asghar, Sanjay Kumar, Paul Hynds, Abeera Mahfooz
    for: 这项研究的目的是自动分类血液细胞。methods: 这项研究使用了转移学习的 convolutional neural network (CNN) 模型,包括 VGG16、VGG19、ResNet-50、ResNet-101、ResNet-152、InceptionV3、MobileNetV2 和 DenseNet-20,并应用于 PBC 数据集的正常 DIB。results: 实验结果表明,提议的 CNN 模型在 PBC 数据集上达到了 99.91% 的准确率,与之前在文献中报道的结果相比,我们的提议的 convolutional neural network 模型在血液细胞分类方面表现竞争力强。
    Abstract Human blood primarily comprises plasma, red blood cells, white blood cells, and platelets. It plays a vital role in transporting nutrients to different organs, where it stores essential health-related data about the human body. Blood cells are utilized to defend the body against diverse infections, including fungi, viruses, and bacteria. Hence, blood analysis can help physicians assess an individual's physiological condition. Blood cells have been sub-classified into eight groups: Neutrophils, eosinophils, basophils, lymphocytes, monocytes, immature granulocytes (promyelocytes, myelocytes, and metamyelocytes), erythroblasts, and platelets or thrombocytes on the basis of their nucleus, shape, and cytoplasm. Traditionally, pathologists and hematologists in laboratories have examined these blood cells using a microscope before manually classifying them. The manual approach is slower and more prone to human error. Therefore, it is essential to automate this process. In our paper, transfer learning with CNN pre-trained models. VGG16, VGG19, ResNet-50, ResNet-101, ResNet-152, InceptionV3, MobileNetV2, and DenseNet-20 applied to the PBC dataset's normal DIB. The overall accuracy achieved with these models lies between 91.375 and 94.72%. Hence, inspired by these pre-trained architectures, a model has been proposed to automatically classify the ten types of blood cells with increased accuracy. A novel CNN-based framework has been presented to improve accuracy. The proposed CNN model has been tested on the PBC dataset normal DIB. The outcomes of the experiments demonstrate that our CNN-based framework designed for blood cell classification attains an accuracy of 99.91% on the PBC dataset. Our proposed convolutional neural network model performs competitively when compared to earlier results reported in the literature.
    摘要 人体血液主要由血浆、红细胞、白细胞和板块组成。它扮演着重要的Transportation和存储功能,帮助医生评估人体 physiological 状况。血液细胞可以用来防御身体各种感染,包括真菌、病毒和 бактери。因此,血液分析可以帮助医生评估个体的 physiological 状况。血液细胞可以分为八种类型:neutrophils、eosinophils、basophils、lymphocytes、monocytes、immature granulocytes(promyelocytes、myelocytes和metamyelocytes)、erythroblasts和板块或 trombocytes ,根据其核心、形态和 cytoplasm。在医学实验室中,传统上,病理学家和血液学家使用 Mikroskop manually 分类这些血液细胞。这种手动方法 slower 和更容易出现人类错误。因此,需要自动化这个过程。在我们的论文中,我们使用了转移学习的 CNN 预训练模型。VGG16、VGG19、ResNet-50、ResNet-101、ResNet-152、InceptionV3、MobileNetV2和DenseNet-20 应用于 PBC 数据集的 normal DIB。模型的总准确率在这些模型之间处于 91.375% 和 94.72% 之间。因此,我们提出了一种基于 CNN 的框架,以提高准确率。我们提出的 CNN 模型在 PBC 数据集 normal DIB 上进行测试,实验结果表明,我们的 CNN 基于框架在 PBC 数据集上达到了 99.91% 的准确率。我们的提出的 convolutional neural network 模型与之前在文献中报道的结果相比,表现竞争力强。

MS3D++: Ensemble of Experts for Multi-Source Unsupervised Domain Adaption in 3D Object Detection

  • paper_url: http://arxiv.org/abs/2308.05988
  • repo_url: https://github.com/darrenjkt/ms3d
  • paper_authors: Darren Tsai, Julie Stephany Berrio, Mao Shan, Eduardo Nebot, Stewart Worrall
  • for: addressed the issue of 3D detectors’ poor performance in unfamiliar domains due to domain gap
  • methods: introduced MS3D++, a self-training framework for multi-source unsupervised domain adaptation in 3D object detection, which generates high-quality pseudo-labels and fuses predictions of an ensemble of multi-frame pre-trained detectors
  • results: achieved state-of-the-art performance, comparable to training with human-annotated labels in Bird’s Eye View (BEV) evaluation for both low and high density lidar, on Waymo, nuScenes and Lyft datasets.
    Abstract Deploying 3D detectors in unfamiliar domains has been demonstrated to result in a drastic drop of up to 70-90% in detection rate due to variations in lidar, geographical region, or weather conditions from their original training dataset. This domain gap leads to missing detections for densely observed objects, misaligned confidence scores, and increased high-confidence false positives, rendering the detector highly unreliable. To address this, we introduce MS3D++, a self-training framework for multi-source unsupervised domain adaptation in 3D object detection. MS3D++ provides a straightforward approach to domain adaptation by generating high-quality pseudo-labels, enabling the adaptation of 3D detectors to a diverse range of lidar types, regardless of their density. Our approach effectively fuses predictions of an ensemble of multi-frame pre-trained detectors from different source domains to improve domain generalization. We subsequently refine the predictions temporally to ensure temporal consistency in box localization and object classification. Furthermore, we present an in-depth study into the performance and idiosyncrasies of various 3D detector components in a cross-domain context, providing valuable insights for improved cross-domain detector ensembling. Experimental results on Waymo, nuScenes and Lyft demonstrate that detectors trained with MS3D++ pseudo-labels achieve state-of-the-art performance, comparable to training with human-annotated labels in Bird's Eye View (BEV) evaluation for both low and high density lidar.
    摘要 部署3D探测器在不熟悉的领域中有示出大幅下降,达到70-90%的探测率下降,这是由于探测器的原始训练数据集和领域不同而导致的领域差距。这种领域差距会导致密集观测的对象丢失、信度不对的置信度和高置信度的假阳性,使探测器变得不可靠。为解决这问题,我们提出了MS3D++自动化框架,用于多源无监督领域适应3D对象探测。MS3D++提供了一种简单的适应方法,通过生成高质量的 Pseudo-标签来适应不同类型的雷达,无论其密度如何。我们的方法可以将多帧预训练探测器的预测结果集成,以提高领域通用性。然后,我们将预测结果进行时间推理,以确保盒子的时间一致性和物体的分类。此外,我们还进行了跨领域3D探测器组件的性能研究,提供了有价值的探测器组合研究指导。实验结果表明,使用MS3D++ Pseudo-标签训练的探测器在Waymo、nuScenes和Lyft上达到了状态之arte的性能,与人工标注的BEV评价相当,包括低密度和高密度雷达。

Zero-shot Text-driven Physically Interpretable Face Editing

  • paper_url: http://arxiv.org/abs/2308.05976
  • repo_url: None
  • paper_authors: Yapeng Meng, Songru Yang, Xu Hu, Rui Zhao, Lincheng Li, Zhenwei Shi, Zhengxia Zou
  • for: 这篇论文提出了一种基于自由文本提示的新型面部编辑方法,以便在不同的文本提示下进行面部编辑。
  • methods: 这篇论文使用了一种新的vector流场模型来实现面部编辑,这种模型可以通过控制图像像素的坐标和颜色的偏移来实现面部编辑。
  • results: compared with现有的文本驱动的面部编辑方法,这种方法可以生成高同一性和图像质量的面部编辑结果,并且可以在实时视频面部编辑中使用。
    Abstract This paper proposes a novel and physically interpretable method for face editing based on arbitrary text prompts. Different from previous GAN-inversion-based face editing methods that manipulate the latent space of GANs, or diffusion-based methods that model image manipulation as a reverse diffusion process, we regard the face editing process as imposing vector flow fields on face images, representing the offset of spatial coordinates and color for each image pixel. Under the above-proposed paradigm, we represent the vector flow field in two ways: 1) explicitly represent the flow vectors with rasterized tensors, and 2) implicitly parameterize the flow vectors as continuous, smooth, and resolution-agnostic neural fields, by leveraging the recent advances of implicit neural representations. The flow vectors are iteratively optimized under the guidance of the pre-trained Contrastive Language-Image Pretraining~(CLIP) model by maximizing the correlation between the edited image and the text prompt. We also propose a learning-based one-shot face editing framework, which is fast and adaptable to any text prompt input. Our method can also be flexibly extended to real-time video face editing. Compared with state-of-the-art text-driven face editing methods, our method can generate physically interpretable face editing results with high identity consistency and image quality. Our code will be made publicly available.
    摘要 In our proposed paradigm, we represent the vector flow field in two ways:1. Explicitly represent the flow vectors with rasterized tensors.2. Implicitly parameterize the flow vectors as continuous, smooth, and resolution-agnostic neural fields by leveraging recent advances in implicit neural representations.The flow vectors are iteratively optimized under the guidance of the pre-trained Contrastive Language-Image Pretraining (CLIP) model by maximizing the correlation between the edited image and the text prompt. We also propose a learning-based one-shot face editing framework, which is fast and adaptable to any text prompt input. Our method can also be flexibly extended to real-time video face editing.Compared with state-of-the-art text-driven face editing methods, our method can generate physically interpretable face editing results with high identity consistency and image quality. Our code will be made publicly available.

Focused Specific Objects NeRF

  • paper_url: http://arxiv.org/abs/2308.05970
  • repo_url: None
  • paper_authors: Yuesong Li, Feng Pan, Helong Yan, Xiuli Xin, Xiaoxue Feng
  • for: 提高NeRF模型的快速训练和高质量渲染,适用于复杂场景。
  • methods: 利用场景 semantic priors 提高快速训练,使网络只关注特定目标而不受背景影响。 可以提高训练速度7.78倍,并且更快地渲染小至中型目标。 此外,这种改进适用于所有NeRF模型。
  • results: 通过弱监督和粗粒抽象采样,进一步加速训练并保持渲染质量。 此外,提出了新的场景编辑技术,可以实现特定Semantic targets的独特显示或掩蔽渲染。 解决不supervised区域错误推理问题,我们还设计了一个自动化 loops,结合形态运算和聚类。
    Abstract Most NeRF-based models are designed for learning the entire scene, and complex scenes can lead to longer learning times and poorer rendering effects. This paper utilizes scene semantic priors to make improvements in fast training, allowing the network to focus on the specific targets and not be affected by complex backgrounds. The training speed can be increased by 7.78 times with better rendering effect, and small to medium sized targets can be rendered faster. In addition, this improvement applies to all NeRF-based models. Considering the inherent multi-view consistency and smoothness of NeRF, this paper also studies weak supervision by sparsely sampling negative ray samples. With this method, training can be further accelerated and rendering quality can be maintained. Finally, this paper extends pixel semantic and color rendering formulas and proposes a new scene editing technique that can achieve unique displays of the specific semantic targets or masking them in rendering. To address the problem of unsupervised regions incorrect inferences in the scene, we also designed a self-supervised loop that combines morphological operations and clustering.
    摘要 大多数NeRF基于模型是为整个场景学习,复杂的场景可能会导致更长的学习时间和更差的渲染效果。这篇论文利用场景 semantic 预测来进行改进快速训练,使网络只关注特定的目标而不受背景的影响。通过这种方法,训练速度可以提高7.78倍,并且小到中等大小的目标可以更快地渲染。此外,这种改进还适用于所有NeRF基于模型。由于NeRF的内置多视图一致性和平滑性,这篇论文还研究了稀疏采样负方向样本的弱监督学习。通过这种方法,训练可以进一步加速,并且渲染质量可以保持。最后,这篇论文扩展像素semantic和颜色渲染公式,并提出了一种新的场景编辑技术,可以实现特定semantictarget的唯一显示或隐藏其在渲染中。为了解决场景中无监督区域的错误推断问题,我们还设计了一种自动化环节,结合形态运算和分群。

YOLOrtho – A Unified Framework for Teeth Enumeration and Dental Disease Detection

  • paper_url: http://arxiv.org/abs/2308.05967
  • repo_url: None
  • paper_authors: Shenxiao Mei, Chenglong Ma, Feihong Shen, Huikai Wu
  • For: 本研究的目的是开发一种结合了牙齿数量和牙疾病识别的综合框架,以提高牙医的诊断效率和准确率。* Methods: 我们采用了一种基于CoordConv的模型结构,并在模型中插入了一个更多的 upsampling layer,以更好地利用数据并同时学习牙齿检测和疾病识别。* Results: 我们的模型在实验中得到了较大的扩散模型的更好的效果,并且可以准确地识别牙齿和牙疾病。
    Abstract Detecting dental diseases through panoramic X-rays images is a standard procedure for dentists. Normally, a dentist need to identify diseases and find the infected teeth. While numerous machine learning models adopting this two-step procedure have been developed, there has not been an end-to-end model that can identify teeth and their associated diseases at the same time. To fill the gap, we develop YOLOrtho, a unified framework for teeth enumeration and dental disease detection. We develop our model on Dentex Challenge 2023 data, which consists of three distinct types of annotated data. The first part is labeled with quadrant, and the second part is labeled with quadrant and enumeration and the third part is labeled with quadrant, enumeration and disease. To further improve detection, we make use of Tufts Dental public dataset. To fully utilize the data and learn both teeth detection and disease identification simultaneously, we formulate diseases as attributes attached to their corresponding teeth. Due to the nature of position relation in teeth enumeration, We replace convolution layer with CoordConv in our model to provide more position information for the model. We also adjust the model architecture and insert one more upsampling layer in FPN in favor of large object detection. Finally, we propose a post-process strategy for teeth layout that corrects teeth enumeration based on linear sum assignment. Results from experiments show that our model exceeds large Diffusion-based model.
    摘要 检测牙科疾病通过扫描影像是 dentist 的标准程序。通常, dentist 需要识别疾病并找到感染的牙齿。虽然许多机器学习模型采用了这两步过程,但没有一个端到端模型可以同时识别牙齿和其相关疾病。为填补这一空白,我们开发了 YOLOrtho,一个综合框架 для牙齿编号和牙科疾病检测。我们在 Dentex Challenge 2023 数据集上验证我们的模型,该数据集包括三种不同的注释数据。第一部分是标注了 quadrant,第二部分是标注了 quadrant 和编号,第三部分是标注了 quadrant、编号和疾病。为了进一步提高检测精度,我们使用 Tufts Dental 公共数据集。为了充分利用数据并同时学习牙齿检测和疾病识别,我们将疾病视为牙齿上的特征,并将疾病分别附加到它们所对应的牙齿上。由于牙齿编号的位置关系,我们将卷积层替换为 CoordConv,以提供更多的位置信息 для模型。我们还调整模型结构,并在 FPN 中添加一个更多的膨敛层,以便更好地检测大对象。最后,我们提出了一种采用线性归一化的牙齿布局修正策略,以修正牙齿编号。实验结果显示,我们的模型超越了大 diffusion-based 模型。

Compositional Learning in Transformer-Based Human-Object Interaction Detection

  • paper_url: http://arxiv.org/abs/2308.05961
  • repo_url: None
  • paper_authors: Zikun Zhuang, Ruihao Qian, Chi Xie, Shuang Liang
  • for: 本研究旨在解决人机对象交互(HOI)检测中长尾分布的问题,通过启发式学习和组合学习来提高HOI检测的性能。
  • methods: 我们提出了一种基于 transformer 框架的组合 HOI 学习方法,利用人物对象对的表示和交互表示在不同 HOI 实例中进行重新组合,以获得更加 ricther 的上下文信息和更好的知识泛化。
  • results: 我们的简单 yet effective 方法在 experiments 中达到了领先的性能水平,特别是在 rare HOI 类中表现出色。
    Abstract Human-object interaction (HOI) detection is an important part of understanding human activities and visual scenes. The long-tailed distribution of labeled instances is a primary challenge in HOI detection, promoting research in few-shot and zero-shot learning. Inspired by the combinatorial nature of HOI triplets, some existing approaches adopt the idea of compositional learning, in which object and action features are learned individually and re-composed as new training samples. However, these methods follow the CNN-based two-stage paradigm with limited feature extraction ability, and often rely on auxiliary information for better performance. Without introducing any additional information, we creatively propose a transformer-based framework for compositional HOI learning. Human-object pair representations and interaction representations are re-composed across different HOI instances, which involves richer contextual information and promotes the generalization of knowledge. Experiments show our simple but effective method achieves state-of-the-art performance, especially on rare HOI classes.
    摘要 人机物交互(HOI)检测是理解人类活动和视觉场景的重要组成部分。长板分布的标注实例是HOI检测的主要挑战,促进了几个shot和零shot学习的研究。以HOI三元组的 combinatorial 性为 inspiration,一些现有的方法采用了组合学习的想法,在Object和动作特征上学习并重新组合为新的训练样本。然而,这些方法通常采用了基于CNN的两Stage paradigm,具有有限的特征提取能力,并经常利用辅助信息来提高性能。而无需任何额外信息,我们创新地提出了基于 transformer 框架的 HOI 学习方法。人机对象对的表示和交互表示在不同的 HOI 实例中被重新组合,这里包含了更加丰富的上下文信息,提高了知识的通用性。实验表明,我们简单 yet effective 的方法可以达到领先的性能,特别是在罕见 HOI 类上。

Classification of White Blood Cells Using Machine and Deep Learning Models: A Systematic Review

  • paper_url: http://arxiv.org/abs/2308.06296
  • repo_url: None
  • paper_authors: Rabia Asghar, Sanjay Kumar, Paul Hynds, Arslan Shaukat
  • for: 这种研究的目的是帮助改善医疗图像分析领域中白细胞类型分类的精度。
  • methods: 这种研究使用了现代的机器学习(ML)和深度学习(DL)技术,包括血液图像、MRI、X射线等医疗图像领域的数据。
  • results: 研究发现,随着近年来ML和DL技术的不断发展和应用,白细胞类型分类的精度得到了显著改善,但还存在一些挑战,如数据集的可用性和医疗人员的培训等。
    Abstract Machine learning (ML) and deep learning (DL) models have been employed to significantly improve analyses of medical imagery, with these approaches used to enhance the accuracy of prediction and classification. Model predictions and classifications assist diagnoses of various cancers and tumors. This review presents an in-depth analysis of modern techniques applied within the domain of medical image analysis for white blood cell classification. The methodologies that use blood smear images, magnetic resonance imaging (MRI), X-rays, and similar medical imaging domains are identified and discussed, with a detailed analysis of ML/DL techniques applied to the classification of white blood cells (WBCs) representing the primary focus of the review. The data utilized in this research has been extracted from a collection of 136 primary papers that were published between the years 2006 and 2023. The most widely used techniques and best-performing white blood cell classification methods are identified. While the use of ML and DL for white blood cell classification has concurrently increased and improved in recent year, significant challenges remain - 1) Availability of appropriate datasets remain the primary challenge, and may be resolved using data augmentation techniques. 2) Medical training of researchers is recommended to improve current understanding of white blood cell structure and subsequent selection of appropriate classification models. 3) Advanced DL networks including Generative Adversarial Networks, R-CNN, Fast R-CNN, and faster R-CNN will likely be increasingly employed to supplement or replace current techniques.
    摘要 医学影像分析(ML)和深度学习(DL)模型已经广泛应用于医学影像分析中,以提高预测和分类的准确性。这些模型的预测和分类帮助诊断各种恶性肿瘤。本文归纳了现代医学影像分析领域中的现代技术,并详细分析了用于白血球类别的ML/DL技术。研究使用的数据来自于2006年至2023年发表的136篇原始论文。最常用的技术和最佳白血球类别方法被标出。虽然在最近几年,用于白血球类别的ML和DL技术的使用和改进得到了普遍应用,但是还存在一些挑战,包括:1)获得适当数据集是最主要的挑战,可以通过数据增强技术解决。2)医学研究人员的培训可以提高现代白血球结构的理解,并选择适当的类别模型。3)将来,高级的DL网络,如生成对抗网络、R-CNN、快速R-CNN和更快的R-CNN将可能被广泛应用,以补充或取代当前的技术。

Learned Point Cloud Compression for Classification

  • paper_url: http://arxiv.org/abs/2308.05959
  • repo_url: https://github.com/multimedialabsfu/learned-point-cloud-compression-for-classification
  • paper_authors: Mateen Ulhaq, Ivan V. Bajić
  • for: 这个论文是为了提出一种特殊的点云编码器,用于在服务器端进行深度学习机器视觉任务的点云数据传输和处理。
  • methods: 该编码器基于PointNet,并且实现了一个高度特殊的点云编码方法,以实现更好的环境-质量负担比。
  • results: 相比于非特殊编码器,该编码器在ModelNet40数据集上可以实现94%的BD-比特率减少,同时保持高度的准确率。此外,对于低资源的终端设备,我们还提出了两种轻量级的编码器配置,可以实现相似的BD-比特率减少,同时具有较低的顶部1准确率下降和较低的编码器端kMACs/点。
    Abstract Deep learning is increasingly being used to perform machine vision tasks such as classification, object detection, and segmentation on 3D point cloud data. However, deep learning inference is computationally expensive. The limited computational capabilities of end devices thus necessitate a codec for transmitting point cloud data over the network for server-side processing. Such a codec must be lightweight and capable of achieving high compression ratios without sacrificing accuracy. Motivated by this, we present a novel point cloud codec that is highly specialized for the machine task of classification. Our codec, based on PointNet, achieves a significantly better rate-accuracy trade-off in comparison to alternative methods. In particular, it achieves a 94% reduction in BD-bitrate over non-specialized codecs on the ModelNet40 dataset. For low-resource end devices, we also propose two lightweight configurations of our encoder that achieve similar BD-bitrate reductions of 93% and 92% with 3% and 5% drops in top-1 accuracy, while consuming only 0.470 and 0.048 encoder-side kMACs/point, respectively. Our codec demonstrates the potential of specialized codecs for machine analysis of point clouds, and provides a basis for extension to more complex tasks and datasets in the future.
    摘要 深度学习在进行三维点云数据的机器视觉任务中变得越来越普遍,如分类、物体检测和分割。然而,深度学习推理是计算昂贵的。因此,为了将点云数据传输到服务器进行处理,需要一个轻量级的编码器。这个编码器必须具有高度压缩比和低计算成本,而不是牺牲准确性。驱动了这一点,我们提出了一种特化于机器分类任务的点云编码器。我们的编码器基于PointNet,与其他方法相比,实现了显著更好的速率准确性质量比。具体来说,在ModelNet40数据集上,我们的编码器可以达到94%的BD-比特率减少,而且只需0.470和0.048 encoder-side kMACs/点。为低资源的端device,我们还提出了两种轻量级的编码器配置,可以实现相同的BD-比特率减少,分别为93%和92%,而且只需3%和5%的顶部一个精度下降,并且只需0.470和0.048 encoder-side kMACs/点。我们的编码器表明特化编码器在机器分析点云数据方面的潜力,并提供了未来扩展到更复杂的任务和数据集的基础。

Uncertainty-Aware Cross-Modal Transfer Network for Sketch-Based 3D Shape Retrieval

  • paper_url: http://arxiv.org/abs/2308.05948
  • repo_url: None
  • paper_authors: Yiyang Cai, Jiaming Lu, Jiewen Wang, Shuang Liang
  • for: 本研究旨在解决粗糙和噪声存在的手绘图像数据中的低质量样本问题,提高三维形状检索的精度。
  • methods: 本研究提出了一种不确定性意识的跨Modal传输网络(UACTN),它将手绘图像和三维形状的表示学分解为两个独立任务:手绘图像的分类学习和三维形状的特征传输。首先,我们提出了一种端到端的分类方法,同时学习手绘图像的特征和不确定性,以便通过不同水平的不确定性来防止噪声绘图的抖抖。然后,三维形状的特征被映射到预学习的手绘图像空间中,进行特征对齐。
  • results: 对两个标准 benchmark进行了广泛的实验和剖析研究,证明了我们提出的方法在比前STATE-OF-THE-ART方法更高的精度和稳定性。
    Abstract In recent years, sketch-based 3D shape retrieval has attracted growing attention. While many previous studies have focused on cross-modal matching between hand-drawn sketches and 3D shapes, the critical issue of how to handle low-quality and noisy samples in sketch data has been largely neglected. This paper presents an uncertainty-aware cross-modal transfer network (UACTN) that addresses this issue. UACTN decouples the representation learning of sketches and 3D shapes into two separate tasks: classification-based sketch uncertainty learning and 3D shape feature transfer. We first introduce an end-to-end classification-based approach that simultaneously learns sketch features and uncertainty, allowing uncertainty to prevent overfitting noisy sketches by assigning different levels of importance to clean and noisy sketches. Then, 3D shape features are mapped into the pre-learned sketch embedding space for feature alignment. Extensive experiments and ablation studies on two benchmarks demonstrate the superiority of our proposed method compared to state-of-the-art methods.
    摘要 在最近几年,基于绘制的3D形状检索已经吸引了越来越多的关注。许多先前的研究都集中在手绘素描和3D形状之间的跨模态匹配上,但是对于绘制数据中噪声和低质量样本的处理问题一直受到了很大的忽略。这篇论文提出了一种不确定性意识的跨模态传输网络(UACTN),解决这个问题。UACTN将绘制素描和3D形状的表示学分解为两个独立的任务:基于绘制素描的分类学习和3D形状特征传输。我们首先介绍了一种端到端的分类基本方法,同时学习绘制素描和不确定性,使不确定性防止噪声绘制素描中的过拟合。然后,3D形状特征被映射到预先学习的绘制元素空间中进行特征对齐。广泛的实验和减少研究在两个标准 bench 上表明了我们提出的方法的优越性。

Generalizing Event-Based Motion Deblurring in Real-World Scenarios

  • paper_url: http://arxiv.org/abs/2308.05932
  • repo_url: https://github.com/xiangz-0/gem
  • paper_authors: Xiang Zhang, Lei Yu, Wen Yang, Jianzhuang Liu, Gui-Song Xia
  • for: 这篇论文旨在普遍化事件基 Motion deblurring 性能,并适应实际场景中的不同空间和时间尺度的运动模糊。
  • methods: 该论文提出了一种缩放意识网络,可以自适应输入空间Scale和学习不同的时间尺度运动模糊。并提出了一种两个阶段自我超视训练方案,以适应实际数据分布。
  • results: 该论文的方法可以高效地还原隐藏的图像的亮度和结构,并通过自适应学习来普遍化运动模糊处理的性能,以适应实际场景中的不同空间和时间尺度的运动模糊。
    Abstract Event-based motion deblurring has shown promising results by exploiting low-latency events. However, current approaches are limited in their practical usage, as they assume the same spatial resolution of inputs and specific blurriness distributions. This work addresses these limitations and aims to generalize the performance of event-based deblurring in real-world scenarios. We propose a scale-aware network that allows flexible input spatial scales and enables learning from different temporal scales of motion blur. A two-stage self-supervised learning scheme is then developed to fit real-world data distribution. By utilizing the relativity of blurriness, our approach efficiently ensures the restored brightness and structure of latent images and further generalizes deblurring performance to handle varying spatial and temporal scales of motion blur in a self-distillation manner. Our method is extensively evaluated, demonstrating remarkable performance, and we also introduce a real-world dataset consisting of multi-scale blurry frames and events to facilitate research in event-based deblurring.
    摘要 Event-based motion deblurring 已经显示出了有前途的结果,通过利用低延迟事件。然而,现有的方法受到一些限制,因为它们假设输入的空间分辨率和特定的模糊分布相同。这项工作解决了这些限制,并寻求将事件基于的模糊除掉应用于实际场景中。我们提议一种可灵活输入空间比例的缩放网络,以及一种两阶段自我超vised学习方案,以适应实际数据分布。通过利用模糊程度的相对性,我们的方法能够有效地保持原始的明亮和结构,并在自我滤清过程中广泛应用。我们的方法在评价中表现出色,并且我们还介绍了一个包含多个滤镜速度和事件的实际数据集,以便进行事件基于的模糊除掉研究。

CaPhy: Capturing Physical Properties for Animatable Human Avatars

  • paper_url: http://arxiv.org/abs/2308.05925
  • repo_url: None
  • paper_authors: Zhaoqi Su, Liangxiao Hu, Siyou Lin, Hongwen Zhang, Shengping Zhang, Justus Thies, Yebin Liu
  • for: reconstruction of animatable human avatars with realistic dynamic properties for clothing
  • methods: combination of unsupervised training with physics-based losses and 3D-supervised training using scanned data, optimization of physical parameters using gradient constraints
  • results: ability to generalize to novel poses with realistic dynamic cloth deformations, superior quantitative and qualitative results compared with previous methodsHere’s the text in Simplified Chinese:
  • for: 创建有动态衣物特性的可动人形模型
  • methods: 组合不监督学习和物理损失,以及3D监督学习使用扫描数据,并对物理参数进行梯度约束优化
  • results: 能够扩展到新的姿势,并实现真实的动态布料扭曲和皱纹效果,与之前的方法相比有较高的量化和质量效果
    Abstract We present CaPhy, a novel method for reconstructing animatable human avatars with realistic dynamic properties for clothing. Specifically, we aim for capturing the geometric and physical properties of the clothing from real observations. This allows us to apply novel poses to the human avatar with physically correct deformations and wrinkles of the clothing. To this end, we combine unsupervised training with physics-based losses and 3D-supervised training using scanned data to reconstruct a dynamic model of clothing that is physically realistic and conforms to the human scans. We also optimize the physical parameters of the underlying physical model from the scans by introducing gradient constraints of the physics-based losses. In contrast to previous work on 3D avatar reconstruction, our method is able to generalize to novel poses with realistic dynamic cloth deformations. Experiments on several subjects demonstrate that our method can estimate the physical properties of the garments, resulting in superior quantitative and qualitative results compared with previous methods.
    摘要 我们提出了CaPhy方法,一种新的人工智能方法,用于重建具有真实动态性质的人工人体模型。我们的目标是从实际观察中捕捉人体服装的几何和物理性质。这使得我们可以将人工人体模型应用到新的姿势上,并且通过物理正确的扭曲和皱纹来模拟人体的动态行为。为此,我们结合无监督训练、物理学损失和3D监督训练使用扫描数据来重建一个物理realistic的服装动态模型。此外,我们还通过引入物理损失的梯度约束来优化物理参数。与前一代3D人体重建方法不同,我们的方法能够泛化到新的姿势上,并且能够模拟真实的动态皮肤扭曲。我们在多个主题上进行了实验,并证明了我们的方法可以估计人体服装的物理参数,从而实现了较好的量化和质量上的效果。

BATINet: Background-Aware Text to Image Synthesis and Manipulation Network

  • paper_url: http://arxiv.org/abs/2308.05921
  • repo_url: None
  • paper_authors: Ryugo Morita, Zhiqiang Zhang, Jinjia Zhou
  • for: 本研究旨在生成文本描述的背景图像中的前景内容,以匹配输入背景图像的样式。
  • methods: 该研究提出了一种 Background-Aware Text to Image synthesis and manipulation Network (BATINet),包括两个关键组件:位置探测网络 (PDN) 和融合网络 (HN)。PDN 检测文本相关对象在背景图像中最有可能的位置,而 HN 使得生成的内容与背景样式信息融合。
  • results: 该研究通过对 CUB 数据集进行质量和量化评估,表明了该模型比其他状态 искусственный方法更高效。此外,该模型还可以应用于文本指导图像修改任务,解决最大化对象形状的修改任务。
    Abstract Background-Induced Text2Image (BIT2I) aims to generate foreground content according to the text on the given background image. Most studies focus on generating high-quality foreground content, although they ignore the relationship between the two contents. In this study, we analyzed a novel Background-Aware Text2Image (BAT2I) task in which the generated content matches the input background. We proposed a Background-Aware Text to Image synthesis and manipulation Network (BATINet), which contains two key components: Position Detect Network (PDN) and Harmonize Network (HN). The PDN detects the most plausible position of the text-relevant object in the background image. The HN harmonizes the generated content referring to background style information. Finally, we reconstructed the generation network, which consists of the multi-GAN and attention module to match more user preferences. Moreover, we can apply BATINet to text-guided image manipulation. It solves the most challenging task of manipulating the shape of an object. We demonstrated through qualitative and quantitative evaluations on the CUB dataset that the proposed model outperforms other state-of-the-art methods.
    摘要

Semantics2Hands: Transferring Hand Motion Semantics between Avatars

  • paper_url: http://arxiv.org/abs/2308.05920
  • repo_url: https://github.com/abcyzj/Semantics2Hands
  • paper_authors: Zijie Ye, Jia Jia, Junliang Xing
  • for: 这篇论文主要目标是在不同的人工手模型之间传递手势semantics。
  • methods: 作者提出了一种基于解剖学semantic matrix(ASM)的方法,通过定量地编码手势semantics,实现精准地重定向手势。然后,他们使用一种基于解剖学semantics重建网络(ASRN)来从源ASM到目标手关节弯曲的映射。
  • results: 作者通过在同频和交叉频段的手势重定向任务中评估了他们的方法,并证明了其在质量和量化上与当前状态OF THE ARTS具有显著优势。
    Abstract Human hands, the primary means of non-verbal communication, convey intricate semantics in various scenarios. Due to the high sensitivity of individuals to hand motions, even minor errors in hand motions can significantly impact the user experience. Real applications often involve multiple avatars with varying hand shapes, highlighting the importance of maintaining the intricate semantics of hand motions across the avatars. Therefore, this paper aims to transfer the hand motion semantics between diverse avatars based on their respective hand models. To address this problem, we introduce a novel anatomy-based semantic matrix (ASM) that encodes the semantics of hand motions. The ASM quantifies the positions of the palm and other joints relative to the local frame of the corresponding joint, enabling precise retargeting of hand motions. Subsequently, we obtain a mapping function from the source ASM to the target hand joint rotations by employing an anatomy-based semantics reconstruction network (ASRN). We train the ASRN using a semi-supervised learning strategy on the Mixamo and InterHand2.6M datasets. We evaluate our method in intra-domain and cross-domain hand motion retargeting tasks. The qualitative and quantitative results demonstrate the significant superiority of our ASRN over the state-of-the-arts.
    摘要 人类手部,非语言交流的主要手段,在多种场景中传递细腻的 semantics。由于人类对手势的敏感性强,even slight errors in hand motions can significantly impact the user experience。实际应用中常有多个化身,各自的手形不同,因此维护手势 semantics的细腻性在多个化身之间是非常重要的。因此,这篇论文旨在将多个化身中的手势 semantics 转移到彼此的手 JOINTS 模型上。为解决这个问题,我们提出了一种新的 анатомиче基于的semantic matrix (ASM),该矩阵编码了手势 semantics。ASM 量化了手部和其他关节的位置 relative to the local frame of the corresponding joint,使得精准地重定向手势。然后,我们获得了一个从源 ASM 到目标手关节旋转的映射函数,通过使用一种基于 анатомиче基的semantics reconstruction network (ASRN)来实现。我们使用一种半监督学习策略在 Mixamo 和 InterHand2.6M 数据集上训练 ASRN。我们对 intra-domain 和 cross-domain 手势重定向任务进行评估。结果表明,我们的 ASRN 在 Qualitative 和量化方面具有显著的优势,胜过当前状态艺的。

Collaborative Tracking Learning for Frame-Rate-Insensitive Multi-Object Tracking

  • paper_url: http://arxiv.org/abs/2308.05911
  • repo_url: None
  • paper_authors: Yiheng Liu, Junta Wu, Yi Fu
  • for: 提高edge设备的计算、存储和功耗成本,实现高效的多目标跟踪(MOT)
  • methods: 提出了一种搜索式跟踪学习(ColTrack)方法,通过多个历史查询来共同跟踪目标,并在每两个temporal块解码器之间插入信息级别融合模块,以更好地融合时间特征。此外,还提出了跟踪对象一致损失函数,以引导历史查询之间的交互。
  • results: 在高帧率视频中,ColTrack比 state-of-the-art 方法在大规模 datasets Dancetrack 和 BDD100K 上表现出更高的性能,并超过了现有的端到端方法在 MOT17 上的表现。此外,ColTrack 在低帧率视频中也具有明显的优势,可以降低帧率要求,同时保持更高的性能,从而实现更快的处理速度。
    Abstract Multi-object tracking (MOT) at low frame rates can reduce computational, storage and power overhead to better meet the constraints of edge devices. Many existing MOT methods suffer from significant performance degradation in low-frame-rate videos due to significant location and appearance changes between adjacent frames. To this end, we propose to explore collaborative tracking learning (ColTrack) for frame-rate-insensitive MOT in a query-based end-to-end manner. Multiple historical queries of the same target jointly track it with richer temporal descriptions. Meanwhile, we insert an information refinement module between every two temporal blocking decoders to better fuse temporal clues and refine features. Moreover, a tracking object consistency loss is proposed to guide the interaction between historical queries. Extensive experimental results demonstrate that in high-frame-rate videos, ColTrack obtains higher performance than state-of-the-art methods on large-scale datasets Dancetrack and BDD100K, and outperforms the existing end-to-end methods on MOT17. More importantly, ColTrack has a significant advantage over state-of-the-art methods in low-frame-rate videos, which allows it to obtain faster processing speeds by reducing frame-rate requirements while maintaining higher performance. Code will be released at https://github.com/yolomax/ColTrack
    摘要 多bject tracking(MOT)在低帧率下可以降低计算、存储和功能开销,以更好地满足边缘设备的限制。许多现有的MOT方法在低帧率视频中表现不佳,因为 между邻帧的位置和外观改变很大。为此,我们提出了协同跟踪学习(ColTrack),用于极低帧率下的框架缺失感知MOT。多个历史查询在同一个目标上共同跟踪,使用更 ricoh 的时间描述。此外,我们在每两个时间块解码器之间插入信息精细模块,以更好地融合时间断言和细化特征。此外,我们还提出了跟踪对象一致损失,以引导历史查询之间的互动。实验结果表明,在高帧率视频中,ColTrack可以与现有的方法匹配或超越其性能,并在MOT17上跟踪对象的框架缺失下表现更好。此外,ColTrack在低帧率视频中表现更优于现有的方法,可以降低帧率要求,同时保持高性能。代码将在https://github.com/yolomax/ColTrack上发布。

Semantic-embedded Similarity Prototype for Scene Recognition

  • paper_url: http://arxiv.org/abs/2308.05896
  • repo_url: None
  • paper_authors: Chuanxin Song, Hanbo Wu, Xin Ma
  • for: 提高Scene recognition的准确性,不增加网络参数
  • methods: 使用统计策略描述Scene中的semantic知识,并利用这些知识来构建一个相似性原型,以支持网络训练
  • results: 对多个benchmark进行了全面评估,并确认了该相似性原型可以提高现有网络的性能,无需增加计算负担
    Abstract Due to the high inter-class similarity caused by the complex composition within scenes and the co-existing objects across scenes, various studies have explored object semantic knowledge within scenes to improve scene recognition. However, a resulting issue arises as semantic segmentation or object detection techniques demand heavy computational power, thereby burdening the network considerably. This limitation often renders object-assisted approaches incompatible with edge devices. In contrast, this paper proposes a semantic-based similarity prototype that assists the scene recognition network to achieve higher accuracy without increasing network parameters. It is simple and can be plug-and-played into existing pipelines. More specifically, a statistical strategy is introduced to depict semantic knowledge in scenes as class-level semantic representations. These representations are utilized to explore inter-class correlations, ultimately constructing a similarity prototype. Furthermore, we propose two ways to use the similarity prototype to support network training from the perspective of gradient label softening and batch-level contrastive loss, respectively. Comprehensive evaluations on multiple benchmarks show that our similarity prototype enhances the performance of existing networks without adding any computational burden. Code and the statistical similarity prototype will be available soon.
    摘要 (Simplified Chinese translation)由于场景中物体的复杂composición和场景之间的对象协同存在高度的类间相似性,许多研究已经探索了场景内物体的semantic知识,以提高场景认知。然而,这些方法通常需要大量的计算能力,对网络造成沉重的负担。这限制了对象辅助方法在边缘设备上的应用。相比之下,这篇论文提出了一种基于semantic的相似性原型,可以帮助场景认知网络提高准确性,而不需增加网络参数。它简单,可以与现有管道一起使用。更具体地,我们引入了一种统计策略,将场景中的semantic知识映射到类级semantic表示上。这些表示可以探索场景中类之间的相似性,最终构建一个相似性原型。此外,我们还提出了在网络训练的视角下使用相似性原型的两种方法: gradient label softening 和 batch-level contrastive loss。多个 benchmark 的全面评估表明,我们的相似性原型可以帮助现有网络提高性能,而无需增加计算负担。代码和统计相似性原型将很快地提供。

Aphid Cluster Recognition and Detection in the Wild Using Deep Learning Models

  • paper_url: http://arxiv.org/abs/2308.05881
  • repo_url: None
  • paper_authors: Tianxiao Zhang, Kaidong Li, Xiangyu Chen, Cuncong Zhong, Bo Luo, Ivan Grijalva, Brian McCornack, Daniel Flippo, Ajay Sharda, Guanghui Wang
  • for: 用于检测螟蛾群落,提高化学防治效率和环境可持续性。
  • methods: 使用深度学习模型检测螟蛾群落,并对图像进行裁剪处理,生成了151,380个标注图像 patch。
  • results: 对四种 state-of-the-art 对象检测模型(VFNet、GFLV2、PAA 和 ATSS)进行实验,并证明所有模型在螟蛾数据集上具有稳定的相似性表现,并且通过合并邻近螟蛾群落和移除小 cluster 提高了性能约17%。
    Abstract Aphid infestation poses a significant threat to crop production, rural communities, and global food security. While chemical pest control is crucial for maximizing yields, applying chemicals across entire fields is both environmentally unsustainable and costly. Hence, precise localization and management of aphids are essential for targeted pesticide application. The paper primarily focuses on using deep learning models for detecting aphid clusters. We propose a novel approach for estimating infection levels by detecting aphid clusters. To facilitate this research, we have captured a large-scale dataset from sorghum fields, manually selected 5,447 images containing aphids, and annotated each individual aphid cluster within these images. To facilitate the use of machine learning models, we further process the images by cropping them into patches, resulting in a labeled dataset comprising 151,380 image patches. Then, we implemented and compared the performance of four state-of-the-art object detection models (VFNet, GFLV2, PAA, and ATSS) on the aphid dataset. Extensive experimental results show that all models yield stable similar performance in terms of average precision and recall. We then propose to merge close neighboring clusters and remove tiny clusters caused by cropping, and the performance is further boosted by around 17%. The study demonstrates the feasibility of automatically detecting and managing insects using machine learning models. The labeled dataset will be made openly available to the research community.
    摘要 螟蛂滋生对农村社区和全球食品安全构成了重要威胁。虽然化学防治 insect 是提高产量的重要手段,但是在整个田野上应用化学药品是环境不可持续和昂贵的。因此,精准地Localization和管理螟蛂是非常重要的。本文主要关注使用深度学习模型来探测螟蛂群。我们提出了一种新的方法,利用深度学习模型来估算感染水平。为了进行这项研究,我们在甘蔗田中采集了大规模数据集,手动选择了5447张图像,并在每张图像中标注了每个螟蛂群。为了使机器学习模型更易使用,我们进一步处理了图像,将其分割成小块,得到了151380个标注的图像块。然后,我们实现了和比较了四种当前最佳对象检测模型(VFNet、GFLV2、PAA和ATSS)在螟蛂数据集上的性能。结果显示,所有模型在精度和报告方面具有稳定的相似性。我们然后提议将邻近的螟蛂群合并并 removes 小 clusters,性能得到了约17%的提升。研究表明,使用机器学习模型自动检测和管理昆虫是可能的。标注数据集将被开放提供给研究社区。

Vision Backbone Enhancement via Multi-Stage Cross-Scale Attention

  • paper_url: http://arxiv.org/abs/2308.05872
  • repo_url: None
  • paper_authors: Liang Shang, Yanli Liu, Zhengyang Lou, Shuxue Quan, Nagesh Adluru, Bochen Guan, William A. Sethares
  • for: 提高视觉任务中 CNN 和 ViT 的性能,增加多stage和跨Scale的交互。
  • methods: 提出了一个简单的添加注意力模块,使得不同阶段和尺度的特征图可以进行多stage和跨Scale的交互。
  • results: 实验表明,在多个下游任务中,MSCSA 可以提供明显的性能提升,具有相对较少的额外计算量和运行时间。
    Abstract Convolutional neural networks (CNNs) and vision transformers (ViTs) have achieved remarkable success in various vision tasks. However, many architectures do not consider interactions between feature maps from different stages and scales, which may limit their performance. In this work, we propose a simple add-on attention module to overcome these limitations via multi-stage and cross-scale interactions. Specifically, the proposed Multi-Stage Cross-Scale Attention (MSCSA) module takes feature maps from different stages to enable multi-stage interactions and achieves cross-scale interactions by computing self-attention at different scales based on the multi-stage feature maps. Our experiments on several downstream tasks show that MSCSA provides a significant performance boost with modest additional FLOPs and runtime.
    摘要 卷积神经网络(CNN)和视transformer(ViT)在视觉任务中取得了非常出色的成绩。然而,许多architecture不考虑不同阶段和缩放的feature map之间的交互,这可能会限制其性能。在这个工作中,我们提出了一种简单的加载注意力模块,以便超越这些限制。具体来说,我们的多阶段跨度注意力(MSCSA)模块使用不同阶段的feature map来实现多阶段交互,并在不同缩放级别上计算基于多阶段feature map的自注意力。我们的实验表明,MSCSA可以提供显著的性能提升,而且增加的计算量和运行时间幅度都很小。

The Multi-modality Cell Segmentation Challenge: Towards Universal Solutions

  • paper_url: http://arxiv.org/abs/2308.05864
  • repo_url: None
  • paper_authors: Jun Ma, Ronald Xie, Shamini Ayyadhury, Cheng Ge, Anubha Gupta, Ritu Gupta, Song Gu, Yao Zhang, Gihun Lee, Joonkee Kim, Wei Lou, Haofeng Li, Eric Upschulte, Timo Dickscheid, José Guilherme de Almeida, Yixin Wang, Lin Han, Xin Yang, Marco Labagnara, Sahand Jamal Rahi, Carly Kempster, Alice Pollitt, Leon Espinosa, Tâm Mignot, Jan Moritz Middeke, Jan-Niklas Eckardt, Wangkai Li, Zhaoyang Li, Xiaochen Cai, Bizhe Bai, Noah F. Greenwald, David Van Valen, Erin Weisbart, Beth A. Cimini, Zhuoshi Li, Chao Zuo, Oscar Brück, Gary D. Bader, Bo Wang
  • for: 本研究旨在提供一个多Modalities单元 segmentation的benchmark,以便用于生物学实验中的单元分析。
  • methods: 本研究使用Transformer基于的深度学习算法,可以在不同的显微镜像平台和组织类型中自动调整参数。
  • results: 研究发现,这种新的算法可以在多个Modalities的单元分析中提供更高的准确率和多样性。
    Abstract Cell segmentation is a critical step for quantitative single-cell analysis in microscopy images. Existing cell segmentation methods are often tailored to specific modalities or require manual interventions to specify hyperparameters in different experimental settings. Here, we present a multi-modality cell segmentation benchmark, comprising over 1500 labeled images derived from more than 50 diverse biological experiments. The top participants developed a Transformer-based deep-learning algorithm that not only exceeds existing methods, but can also be applied to diverse microscopy images across imaging platforms and tissue types without manual parameter adjustments. This benchmark and the improved algorithm offer promising avenues for more accurate and versatile cell analysis in microscopy imaging.
    摘要 细胞分 segmentation是单细胞分析中的关键步骤,现有的细胞分 segmentation方法 oftentimes 适应特定的Modalities or require manual interventions to specify hyperparameters in different experimental settings. 在这里,我们提出了一个多Modalities 细胞分 segmentation benchmark,包括超过1500个标注图像,来自更多的50种多样化的生物实验。顶尖参与者开发了一种基于Transformer的深度学习算法,不仅超越了现有的方法,还可以适用于多种微scopic imaging平台和组织类型无需手动参数调整。这个benchmark和改进的算法提供了更加准确和多样化的细胞分析方法。

SegDA: Maximum Separable Segment Mask with Pseudo Labels for Domain Adaptive Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2308.05851
  • repo_url: None
  • paper_authors: Anant Khandelwal
  • for: 提高Unsupervised Domain Adaptation(UDA)方法的转移性能,解决目标领域 Label 缺乏问题。
  • methods: 提出SegDA模组,增强UDA方法的转移性能,通过学习最大分类separable segment表现。
  • results: 在四个UDA benchmark上实现了+2.2 mIoU、+2.0 mIoU、+5.9 mIoU、+2.6 mIoU 的提升,即GTA -> Cityscapes、Synthia -> Cityscapes、Cityscapes -> DarkZurich、Cityscapes -> ACDC。
    Abstract Unsupervised Domain Adaptation (UDA) aims to solve the problem of label scarcity of the target domain by transferring the knowledge from the label rich source domain. Usually, the source domain consists of synthetic images for which the annotation is easily obtained using the well known computer graphics techniques. However, obtaining annotation for real world images (target domain) require lot of manual annotation effort and is very time consuming because it requires per pixel annotation. To address this problem we propose SegDA module to enhance transfer performance of UDA methods by learning the maximum separable segment representation. This resolves the problem of identifying visually similar classes like pedestrian/rider, sidewalk/road etc. We leveraged Equiangular Tight Frame (ETF) classifier inspired from Neural Collapse for maximal separation between segment classes. This causes the source domain pixel representation to collapse to a single vector forming a simplex vertices which are aligned to the maximal separable ETF classifier. We use this phenomenon to propose the novel architecture for domain adaptation of segment representation for target domain. Additionally, we proposed to estimate the noise in labelling the target domain images and update the decoder for noise correction which encourages the discovery of pixels for classes not identified in pseudo labels. We have used four UDA benchmarks simulating synthetic-to-real, daytime-to-nighttime, clear-to-adverse weather scenarios. Our proposed approach outperforms +2.2 mIoU on GTA -> Cityscapes, +2.0 mIoU on Synthia -> Cityscapes, +5.9 mIoU on Cityscapes -> DarkZurich, +2.6 mIoU on Cityscapes -> ACDC.
    摘要 Unsupervised Domain Adaptation (UDA) 目标是解决目标领域中标签稀缺的问题,通过从标签充沛的源领域传递知识。通常,源领域包含合成图像,可以使用Well known计算机图形技术获得注解。然而,获得真实世界图像(目标领域)的注解需要大量的手动注解effort和时间consuming,因为它需要每个像素注解。为解决这个问题,我们提议SegDA模块,用于增强UDA方法的传递性能。这解决了类型相似的问题,如行人/骑行者、斜道/路等。我们利用Equiangular Tight Frame(ETF)分类器,启发自Neural Collapse,以实现最大分离。这导致源领域像素表示 collapse到一个简单x��ensional vertices,这些 vertices 与最大分离ETF分类器相互平行。我们使用这种现象,提出一种新的领域适应措施。此外,我们还提出了估计目标领域图像注解的噪声,并更新解码器进行噪声纠正,这会鼓励发现类不存在 pseudo labels 中的像素。我们在四个 UDA benchmark 中使用了 simulating synthetic-to-real、daytime-to-nighttime、clear-to-adverse weather 等enario。我们的提议方法在 GTA -> Cityscapes、Synthia -> Cityscapes、Cityscapes -> DarkZurich 和 Cityscapes -> ACDC 等四个 benchmark 上都有出色的表现,相对于基eline +2.2 mIoU、+2.0 mIoU、+5.9 mIoU 和 +2.6 mIoU。

Recognizing Handwritten Mathematical Expressions of Vertical Addition and Subtraction

  • paper_url: http://arxiv.org/abs/2308.05820
  • repo_url: https://github.com/danielgol/hme-vas
  • paper_authors: Daniel Rosa, Filipe R. Cordeiro, Ruan Carvalho, Everton Souza, Sergio Chevtchenko, Luiz Rodrigues, Marcelo Marinho, Thales Vieira, Valmir Macario
  • for: 这个论文的目的是提出一种新的手写数学表达识别方法,能够识别括论的添加和减法表达。
  • methods: 这个论文使用了一些现有的物体检测算法,包括YOLO v7、YOLO v8、YOLO-NAS、NanoDet和FCOS,以及一种新的识别方法,可以将 bounding box 映射到 LaTeX markup 序列中。
  • results: 这个论文的结果表明,该方法能够高效地识别手写数学表达,并且可以在不同的环境中进行扩展。
    Abstract Handwritten Mathematical Expression Recognition (HMER) is a challenging task with many educational applications. Recent methods for HMER have been developed for complex mathematical expressions in standard horizontal format. However, solutions for elementary mathematical expression, such as vertical addition and subtraction, have not been explored in the literature. This work proposes a new handwritten elementary mathematical expression dataset composed of addition and subtraction expressions in a vertical format. We also extended the MNIST dataset to generate artificial images with this structure. Furthermore, we proposed a solution for offline HMER, able to recognize vertical addition and subtraction expressions. Our analysis evaluated the object detection algorithms YOLO v7, YOLO v8, YOLO-NAS, NanoDet and FCOS for identifying the mathematical symbols. We also proposed a transcription method to map the bounding boxes from the object detection stage to a mathematical expression in the LATEX markup sequence. Results show that our approach is efficient, achieving a high expression recognition rate. The code and dataset are available at https://github.com/Danielgol/HME-VAS
    摘要 手写数学表达识别(HMER)是一项具有许多教育应用的挑战性任务。现今的HMER方法主要关注于标准水平格式中的复杂数学表达。然而,对于基本数学表达,如垂直加减表达,在 литературе中没有得到了探讨。本工作提出了一个新的手写基本数学表达数据集,包括垂直加减表达的形式。此外,我们还扩展了MNIST数据集,生成了具有这种结构的人工图像。此外,我们还提出了一种OFFLINE HMER方法,能够识别垂直加减表达。我们的分析评估了YOLO v7、YOLO v8、YOLO-NAS、NanoDet和FCOS算法来识别数学符号。此外,我们还提出了一种映射矩阵方法,将物体检测阶段中的 bounding box 映射到LATEX markup语句中的数学表达。结果表明,我们的方法高效,达到了高表达识别率。代码和数据集可以在https://github.com/Danielgol/HME-VAS上获取。

Absorption-Based, Passive Range Imaging from Hyperspectral Thermal Measurements

  • paper_url: http://arxiv.org/abs/2308.05818
  • repo_url: None
  • paper_authors: Unay Dorken Gallastegi, Hoover Rueda-Chacon, Martin J. Stevens, Vivek K Goyal
  • for: This paper is written for the purpose of developing a novel passive range imaging method based on atmospheric absorption of ambient thermal radiance, which can be used to recover range features from remote objects in natural scenes without the need for active illumination.
  • methods: The paper uses a computational approach that separates the effects of remote object material composition, temperature, and range on the spectrum of thermal radiance, and introduces a novel method that exploits atmospheric absorption to mitigate noise in low-contrast scenarios. The method jointly estimates range and intrinsic object properties by exploiting a variety of absorption lines spread over the infrared spectrum.
  • results: The paper reports that the proposed method can recover range features from 15m to 150m in long-wave infrared (8–13 $\mu$m) hyperspectral image data acquired from natural scenes with no active illumination. The results show good qualitative match to unaligned lidar data.
    Abstract Passive hyperspectral long-wave infrared measurements are remarkably informative about the surroundings, such as remote object material composition, temperature, and range; and air temperature and gas concentrations. Remote object material and temperature determine the spectrum of thermal radiance, and range, air temperature, and gas concentrations determine how this spectrum is modified by propagation to the sensor. We computationally separate these phenomena, introducing a novel passive range imaging method based on atmospheric absorption of ambient thermal radiance. Previously demonstrated passive absorption-based ranging methods assume hot and highly emitting objects. However, the temperature variation in natural scenes is usually low, making range imaging challenging. Our method benefits from explicit consideration of air emission and parametric modeling of atmospheric absorption. To mitigate noise in low-contrast scenarios, we jointly estimate range and intrinsic object properties by exploiting a variety of absorption lines spread over the infrared spectrum. Along with Monte Carlo simulations that demonstrate the importance of regularization, temperature differentials, and availability of many spectral bands, we apply this method to long-wave infrared (8--13 $\mu$m) hyperspectral image data acquired from natural scenes with no active illumination. Range features from 15m to 150m are recovered, with good qualitative match to unaligned lidar data.
    摘要 活动式 hyperspectral 长波无源热红外测量具有极高的准确性,能够提供远程物体材质组成、温度和距离等信息,以及空气温度和气体成分。远程物体材质和温度会影响热辐射谱的спектrum,而距离、空气温度和气体成分会影响这种谱spectrum的修饰和传播到探测器。我们通过计算分离这些现象,提出了一种新的无源范围成像方法,基于大气吸收的热辐射 ambient。先前的无源吸收基于范围方法假设了热和高度发射的对象,但是自然场景中的温度变化通常很低,使范围成像变得困难。我们的方法具有详细考虑大气发射和参数化大气吸收的优势,以适应实际场景。为了降低在低对比度场景下的噪声,我们同时估算范围和对象内部特性,通过利用多种吸收线扩散在红外谱域中。此外,我们还通过 Monte Carlo 仿真示出了正则化、温度差和许多 spectral band 的重要性,并应用这种方法到长波无源热红外(8-13 $\mu$m) hyperspectral 图像数据,从自然场景中无活动照明获得范围信息。在15米至150米的范围内,可以恢复出较好的质量匹配,与不对齐的探测器数据相符。

Spintronics for image recognition : performance benchmarking via ultrafast data-driven simulations

  • paper_url: http://arxiv.org/abs/2308.05810
  • repo_url: None
  • paper_authors: Anatole Moureaux, Chloé Chopin, Laurent Jacques, Flavio Abreu Araujo
  • for: 用于图像分类
  • methods: 使用硬件基于echo-state网络(ESN),利用磁通量普通磁铁结构(STVO)来实现图像分类
  • results: 使用DD-TEA模拟STVO动态,实现了图像分类 task的高精度性,并在MNIST、EMNIST-letters和Fashion MNIST等 dataset上达到了比较高的性能水平,但在EMNIST-letters和Fashion MNIST上的性能较低,这可能是因为系统架构的简单性和任务的复杂性的问题。
    Abstract We present a demonstration of image classification using a hardware-based echo-state network (ESN) that relies on spintronic nanostructures known as vortex-based spin-torque oscillators (STVOs). Our network is realized using a single STVO multiplexed in time. To circumvent the challenges associated with repeated experimental manipulation of such a nanostructured system, we employ an ultrafast data-driven simulation framework called the data-driven Thiele equation approach (DD-TEA) to simulate the STVO dynamics. We use this approach to efficiently develop, optimize and test an STVO-based ESN for image classification using the MNIST dataset. We showcase the versatility of our solution by successfully applying it to solve classification challenges with the EMNIST-letters and Fashion MNIST datasets. Through our simulations, we determine that within a large ESN the results obtained using the STVO dynamics as an activation function are comparable to the ones obtained with other conventional nonlinear activation functions like the reLU and the sigmoid. While achieving state-of-the-art accuracy levels on the MNIST dataset, our model's performance on EMNIST-letters and Fashion MNIST is lower due to the relative simplicity of the system architecture and the increased complexity of the tasks. We expect that the DD-TEA framework will enable the exploration of more specialized neural architectures, ultimately leading to improved classification accuracy. This approach also holds promise for investigating and developing dedicated learning rules to further enhance classification performance.
    摘要 我们提出了一种使用硬件基于的echo-state网络(ESN)进行图像分类,该网络基于旋转-基于磁通量(STVO)的磁镜结构。我们的网络使用了单个STVO,并在时间多态化中进行多批处理。为了缓解重复实验室中STVO系统的nanostructured系统的挑战,我们使用了一种高速数据驱动的模拟框架called the data-driven Thiele equation approach(DD-TEA)来模拟STVO动态。我们使用这种方法来有效地开发、优化和测试一个基于STVO的ESN,并在MNIST数据集上进行图像分类。我们成功地应用了这种解决方案,并在EMNIST-letters和Fashion MNIST数据集上解决了分类挑战。我们通过模拟结果发现,在大型ESN中,使用STVO动态作为激活函数的结果与其他常见非线性激活函数如reLU和sigmoid相比,具有相似的性能水平。而在MNIST数据集上,我们的模型实现了state-of-the-art的准确率水平,但在EMNIST-letters和Fashion MNIST数据集上,模型的性能较低,这主要归结于系统架构的简单性和任务的复杂性。我们预计,DD-TEA框架将能够探索更特化的神经网络架构,从而实现更高的分类精度。此外,这种方法还可以用于研究和开发特定的学习规则,以进一步提高分类性能。

Iterative Reweighted Least Squares Networks With Convergence Guarantees for Solving Inverse Imaging Problems

  • paper_url: http://arxiv.org/abs/2308.05745
  • repo_url: None
  • paper_authors: Iaroslav Koshelev, Stamatios Lefkimmiatis
  • for: 这篇论文关注于图像重建任务中的分析基于图像正则化,旨在推广 sparse 和/或低维解决方案。
  • methods: 作者提出了一种新的优化策略,基于可评估函数 Parametrize 图像正则化。这种策略是基于 Iteratively Reweighted Least Squares (IRLS) 方法,通常用于 synthesis-based $\ell_p$ 和 $\mathcal{S}_p$ norm 以及 analysis-based $\ell_1$ 和核函数正则化。
  • results: 作者证明了其优化算法在稳定点下线性收敛,并提供了一个上界 для收敛速率。此外,作者还提出了一种学习参数的方法,通过将学习过程视为一个随机双层优化问题来实现。通过证明了其优化算法的收敛性,这种学习方法可以成功完成。作者还对其 learned IRLS 变体进行了评估,并与其他现有的学习重建方法进行了比较。
    Abstract In this work we present a novel optimization strategy for image reconstruction tasks under analysis-based image regularization, which promotes sparse and/or low-rank solutions in some learned transform domain. We parameterize such regularizers using potential functions that correspond to weighted extensions of the $\ell_p^p$-vector and $\mathcal{S}_p^p$ Schatten-matrix quasi-norms with $0 < p \le 1$. Our proposed minimization strategy extends the Iteratively Reweighted Least Squares (IRLS) method, typically used for synthesis-based $\ell_p$ and $\mathcal{S}_p$ norm and analysis-based $\ell_1$ and nuclear norm regularization. We prove that under mild conditions our minimization algorithm converges linearly to a stationary point, and we provide an upper bound for its convergence rate. Further, to select the parameters of the regularizers that deliver the best results for the problem at hand, we propose to learn them from training data by formulating the supervised learning process as a stochastic bilevel optimization problem. We show that thanks to the convergence guarantees of our proposed minimization strategy, such optimization can be successfully performed with a memory-efficient implicit back-propagation scheme. We implement our learned IRLS variants as recurrent networks and assess their performance on the challenging image reconstruction tasks of non-blind deblurring, super-resolution and demosaicking. The comparisons against other existing learned reconstruction approaches demonstrate that our overall method is very competitive and in many cases outperforms existing unrolled networks, whose number of parameters is orders of magnitude higher than in our case.
    摘要 在这项工作中,我们提出了一种新的优化策略,用于图像重建任务中的分析基于图像正则化。我们使用可调函数来parameterize这些正则izer,这些函数对应于weighted扩展的 $\ell_p^p$ 评估函数和 $\mathcal{S}_p^p$ 施密特矩阵评估函数的 $0 < p \le 1$。我们的提出的最小化策略是基于Iteratively Reweighted Least Squares(IRLS)方法,通常用于 synthesis-based $\ell_p$ 和 $\mathcal{S}_p$ 评估和分析基于 $\ell_1$ 和核评估的正则化。我们证明了,在某些条件下,我们的最小化算法会线性收敛到稳定点,并提供了最小化速度的上限。此外,为了选择正则izer的参数,以实现问题中最佳的结果,我们提议通过训练数据来学习这些参数,并将其表示为随机双层优化问题。我们证明了, благодаря我们提出的最小化策略的收敛保证,这种优化可以成功地进行,使用内存高效的隐式循环反射学习。我们实现了我们学习IRLS变体的recurrent neural network,并对非盲杂化、超分辨和排版重建任务进行评估。与其他已知的学习重建方法相比,我们的总方法在许多情况下表现非常竞争力强,而且在许多情况下even outperform existing unrolled networks, whose number of parameters is orders of magnitude higher than ours。

PlankAssembly: Robust 3D Reconstruction from Three Orthographic Views with Learnt Shape Programs

  • paper_url: http://arxiv.org/abs/2308.05744
  • repo_url: https://github.com/manycore-research/PlankAssembly
  • paper_authors: Wentao Hu, Jia Zheng, Zixin Zhang, Xiaojun Yuan, Jian Yin, Zihan Zhou
  • for: 自动将2D线描图 transformed into3D CAD模型
  • methods: 使用Transformer序列生成模型和形状程序
  • results: 与现有方法相比,our方法在干扰或 incomplete输入时表现出色
    Abstract In this paper, we develop a new method to automatically convert 2D line drawings from three orthographic views into 3D CAD models. Existing methods for this problem reconstruct 3D models by back-projecting the 2D observations into 3D space while maintaining explicit correspondence between the input and output. Such methods are sensitive to errors and noises in the input, thus often fail in practice where the input drawings created by human designers are imperfect. To overcome this difficulty, we leverage the attention mechanism in a Transformer-based sequence generation model to learn flexible mappings between the input and output. Further, we design shape programs which are suitable for generating the objects of interest to boost the reconstruction accuracy and facilitate CAD modeling applications. Experiments on a new benchmark dataset show that our method significantly outperforms existing ones when the inputs are noisy or incomplete.
    摘要 在这篇论文中,我们提出了一种新的方法,用于自动将二维线 drawing从三个orthographic视图转换为三维 CAD 模型。现有的方法在这个问题上 reconstruction 3D 模型,而不是维护explicit的对应关系 между输入和输出。这些方法在输入中存在错误和噪声时容易失败。为了解决这个困难,我们利用了注意力机制在Transformer-based sequence generation模型中学习flexible的对应关系。此外,我们还设计了适合生成目标对象的形状程序,以提高重建精度和促进 CAD 模型应用。在一个新的 bencmark 数据集上进行了实验,我们的方法在输入噪声或不完整时表现出了显著的优异性。

Zero Grads Ever Given: Learning Local Surrogate Losses for Non-Differentiable Graphics

  • paper_url: http://arxiv.org/abs/2308.05739
  • repo_url: None
  • paper_authors: Michael Fischer, Tobias Ritschel
  • for: 缺乏定义或ZeroGradients问题的图形优化
  • methods: 自动学习神经网络对象函数的替换、在线自监督学习、跟踪式采样
  • results: 可以优化非对称非导数的图形问题,比如视野渲染、批处理模型和物理驱动动画优化,并且可以扩展到更高维度问题。
    Abstract Gradient-based optimization is now ubiquitous across graphics, but unfortunately can not be applied to problems with undefined or zero gradients. To circumvent this issue, the loss function can be manually replaced by a "surrogate" that has similar minima but is differentiable. Our proposed framework, ZeroGrads, automates this process by learning a neural approximation of the objective function, the surrogate, which in turn can be used to differentiate through arbitrary black-box graphics pipelines. We train the surrogate on an actively smoothed version of the objective and encourage locality, focusing the surrogate's capacity on what matters at the current training episode. The fitting is performed online, alongside the parameter optimization, and self-supervised, without pre-computed data or pre-trained models. As sampling the objective is expensive (it requires a full rendering or simulator run), we devise an efficient sampling scheme that allows for tractable run-times and competitive performance at little overhead. We demonstrate optimizing diverse non-convex, non-differentiable black-box problems in graphics, such as visibility in rendering, discrete parameter spaces in procedural modelling or optimal control in physics-driven animation. In contrast to more traditional algorithms, our approach scales well to higher dimensions, which we demonstrate on problems with up to 35k interlinked variables.
    摘要 gradient-based优化现在Graphics中 ubique,但是它无法应用于具有未定义或Zero Gradient的问题。为了缺页这个问题,损失函数可以手动被替换为一个"代理",该函数有相似的 минимумы,但是可微的。我们的提议的框架ZeroGrads自动实现了这个过程,它学习一个神经网络的目标函数,代理函数,该函数可以在arbitrary黑盒图形处理管道中进行导数。我们在训练过程中使用一种活动缓和的损失函数,并且强调地方性,使得代理函数的容量集中在当前训练集中。我们在线上进行训练,并且是自我supervised,不需要预计算数据或预训练模型。由于抽样损失函数是Expensive(需要全部渲染或simulatorrun),我们开发了一种高效的抽样方案,使得运行时间可以被控制,并且性能与其他方法相匹配。我们在Graphics中优化了多种非对称、非导数的黑盒问题,如渲染中的可见性、procedural模型中的分割参数空间和物理驱动的动画中的优化问题。与传统算法相比,我们的方法可以扩展到更高的维度,我们在35k个相互连接的变量上进行了示例。

Follow Anything: Open-set detection, tracking, and following in real-time

  • paper_url: http://arxiv.org/abs/2308.05737
  • repo_url: https://github.com/alaamaalouf/followanything
  • paper_authors: Alaa Maalouf, Ninad Jadhav, Krishna Murthy Jatavallabhula, Makram Chahine, Daniel M. Vogt, Robert J. Wood, Antonio Torralba, Daniela Rus
  • for: 本研究旨在开发一个能够实时检测、跟踪和跟踪任何对象的 роботизирован系统,以满足各种工业自动化、物流和仓储、医疗和安全等领域中的需求。
  • methods: 本研究使用的方法是一种开放 vocabulary 和多Modal 模型,可以在实时检测和跟踪中应用于未知类型的对象。它利用大规模预训练模型(基础模型)提供的丰富视觉描述符,对输入图像序列进行检测和分割,并跟踪图像帧中的检测和分割结果。
  • results: 本研究在一个真实世界的 роботизирован系统(微型飞行器)上进行了实验,并证明了 FAn 可以在实时控制循环中无顾 occlusion 和对象重新出现的情况下准确地跟踪对象。此外,FAn 可以在一个笔记型的 GPU 上运行,实现每秒 6-20 帧的 Throughput。为了促进快速采用、部署和扩展,我们将所有代码打包在 GitHub 上的项目页面(https://github.com/alaamaalouf/FollowAnything),并附上了一个5分钟的 Explainer 视频(https://www.youtube.com/watch?v=6Mgt3EPytrw)。
    Abstract Tracking and following objects of interest is critical to several robotics use cases, ranging from industrial automation to logistics and warehousing, to healthcare and security. In this paper, we present a robotic system to detect, track, and follow any object in real-time. Our approach, dubbed ``follow anything'' (FAn), is an open-vocabulary and multimodal model -- it is not restricted to concepts seen at training time and can be applied to novel classes at inference time using text, images, or click queries. Leveraging rich visual descriptors from large-scale pre-trained models (foundation models), FAn can detect and segment objects by matching multimodal queries (text, images, clicks) against an input image sequence. These detected and segmented objects are tracked across image frames, all while accounting for occlusion and object re-emergence. We demonstrate FAn on a real-world robotic system (a micro aerial vehicle) and report its ability to seamlessly follow the objects of interest in a real-time control loop. FAn can be deployed on a laptop with a lightweight (6-8 GB) graphics card, achieving a throughput of 6-20 frames per second. To enable rapid adoption, deployment, and extensibility, we open-source all our code on our project webpage at https://github.com/alaamaalouf/FollowAnything . We also encourage the reader the watch our 5-minutes explainer video in this https://www.youtube.com/watch?v=6Mgt3EPytrw .
    摘要 tracking和跟踪对象是 robotics 应用场景中的关键,从制造自动化到物流和仓储,到医疗和安全。在这篇论文中,我们提出了一种用于实时检测、跟踪和追踪任何对象的机器人系统。我们的方法,命名为“随时跟踪”(FAn),不受训练时的概念限制,可以在推理时应用于新的类型。通过使用大规模预训练模型(基础模型)提供的丰富视觉描述符,FAn 可以在输入图像序列中检测和分割对象,并在图像帧中跟踪这些检测到的对象,同时考虑 occlusion 和对象重新出现。我们在一种真实的机器人系统(微型飞行器)上实现了 FAn,并发现它可以在实时控制循环中不间断地跟踪对象。FAn 可以在具有6-8 GB 的轻量级图形处理器(GPU)上运行, achieving a throughput of 6-20 frames per second。为了促进快速采用、部署和扩展,我们在项目网站上公开了所有代码(https://github.com/alaamaalouf/FollowAnything)。我们还邀请读者查看我们的5分钟解释视频(https://www.youtube.com/watch?v=6Mgt3EPytrw)。

MapTRv2: An End-to-End Framework for Online Vectorized HD Map Construction

  • paper_url: http://arxiv.org/abs/2308.05736
  • repo_url: https://github.com/hustvl/maptr
  • paper_authors: Bencheng Liao, Shaoyu Chen, Yunchi Zhang, Bo Jiang, Qian Zhang, Wenyu Liu, Chang Huang, Xinggang Wang
  • for: 本研究旨在提供一种在线vector化高清地图构建框架,以便实现自动驾驶系统中的规划。
  • methods: 本方法提议了一种统一的 permutation-equivalent 模型方法,即将地图元素表示为一组等效排序的点集,以准确描述地图元素的形状和稳定学习过程。 我们还设计了层次查询嵌入 schema和 hierarchical bipartite matching 来灵活地编码结构化地图信息。
  • results: 我们的方法可以快速地在实时推理速度下 convergence,并在 nuScenes 和 Argoverse2 datasets 上达到了状态对照性的表现。丰富的质量结果显示了在复杂和多样化的驾驶场景中的稳定和可靠的地图构建质量。
    Abstract High-definition (HD) map provides abundant and precise static environmental information of the driving scene, serving as a fundamental and indispensable component for planning in autonomous driving system. In this paper, we present \textbf{Map} \textbf{TR}ansformer, an end-to-end framework for online vectorized HD map construction. We propose a unified permutation-equivalent modeling approach, \ie, modeling map element as a point set with a group of equivalent permutations, which accurately describes the shape of map element and stabilizes the learning process. We design a hierarchical query embedding scheme to flexibly encode structured map information and perform hierarchical bipartite matching for map element learning. To speed up convergence, we further introduce auxiliary one-to-many matching and dense supervision. The proposed method well copes with various map elements with arbitrary shapes. It runs at real-time inference speed and achieves state-of-the-art performance on both nuScenes and Argoverse2 datasets. Abundant qualitative results show stable and robust map construction quality in complex and various driving scenes. Code and more demos are available at \url{https://github.com/hustvl/MapTR} for facilitating further studies and applications.
    摘要 高清晰地图(HD地图)提供了驾驶场景中精确和丰富的静态环境信息, serves as 驾驶自动化系统的基本和不可或缺的组件。在这篇论文中,我们提出了 \textbf{Map} \textbf{TR}ansformer,一种端到端框架,用于在线vectorized HD地图建构。我们提出了一种统一 permutation-equivalent 模型方法,即将地图元素模型为一个点集,并使用一组等效排序来准确描述地图元素的形状,从而稳定学习过程。我们设计了层次查询嵌入方案,以便flexibly编码结构化地图信息,并在多个纬度上进行层次对应。为了加速收敛,我们还引入了辅助一对多对应和密集监督。提出的方法可以快速和稳定地构建多种形状的地图元素,并在真实时间执行速度下达到了 nuScenes 和 Argoverse2 数据集的国际级表现。丰富的质量图像显示了在复杂和多样的驾驶场景中的稳定和可靠的地图建构质量。代码和更多示例可以在 \url{https://github.com/hustvl/MapTR} 上找到,以便进一步的研究和应用。

FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models

  • paper_url: http://arxiv.org/abs/2308.05733
  • repo_url: https://github.com/aim-uofa/FrozenRecon
  • paper_authors: Guangkai Xu, Wei Yin, Hao Chen, Chunhua Shen, Kai Cheng, Feng Zhao
    for:The paper is written for the task of 3D scene reconstruction, specifically addressing the challenge of robustly obtaining camera poses and achieving dense scene reconstruction in diverse real-world scenarios.methods:The paper proposes a novel test-time optimization approach that leverages pre-trained affine-invariant depth models, such as LeReS, to ensure inter-frame consistency and achieve robust scene reconstruction. The approach involves freezing the depth predictions, rectifying them with a geometric consistency alignment module, and employing the resulting scale-consistent depth maps to obtain camera poses and reconstruct the scene.results:The paper achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets, demonstrating the effectiveness of the proposed approach in handling diverse real-world scenarios and improving the robustness of 3D scene reconstruction.
    Abstract 3D scene reconstruction is a long-standing vision task. Existing approaches can be categorized into geometry-based and learning-based methods. The former leverages multi-view geometry but can face catastrophic failures due to the reliance on accurate pixel correspondence across views. The latter was proffered to mitigate these issues by learning 2D or 3D representation directly. However, without a large-scale video or 3D training data, it can hardly generalize to diverse real-world scenarios due to the presence of tens of millions or even billions of optimization parameters in the deep network. Recently, robust monocular depth estimation models trained with large-scale datasets have been proven to possess weak 3D geometry prior, but they are insufficient for reconstruction due to the unknown camera parameters, the affine-invariant property, and inter-frame inconsistency. Here, we propose a novel test-time optimization approach that can transfer the robustness of affine-invariant depth models such as LeReS to challenging diverse scenes while ensuring inter-frame consistency, with only dozens of parameters to optimize per video frame. Specifically, our approach involves freezing the pre-trained affine-invariant depth model's depth predictions, rectifying them by optimizing the unknown scale-shift values with a geometric consistency alignment module, and employing the resulting scale-consistent depth maps to robustly obtain camera poses and achieve dense scene reconstruction, even in low-texture regions. Experiments show that our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets.
    摘要 3D场景重建是一个长期的视觉任务。现有的方法可以分为基于几何学的方法和学习基于方法。前者利用多视图几何学,但可能会遇到几何学灾难,因为需要精准的像素对应关系 across views。后者提出了使用 directly 学习 2D 或 3D 表示,但在没有大规模视频或 3D 训练数据时,它很难总结到多样化的实际场景中。最近,一些稳定的单视图深度估计模型,通过大规模数据集训练,已经证明了具有弱型 3D 几何特征,但它们无法承受不知道摄像头参数,Affine-invariant 性和inter-frame不一致性。我们提出了一种新的测试时优化方法,可以在多样化场景中传递稳定的 Affine-invariant 深度模型 LeReS 的稳定性,同时保证 inter-frame 一致性,只需要每帧Optimize 几十个参数。我们的方法包括冻结预训练 Affine-invariant 深度模型的深度预测,对其进行 rectify 操作,使用 resulting scale-consistent depth maps 来Robustly 获取摄像头姿态和实现密集场景重建,即使在低文本区域。实验表明,我们的方法在五个零学习测试数据集上 achieve state-of-the-art cross-dataset reconstruction。

Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction

  • paper_url: http://arxiv.org/abs/2308.05721
  • repo_url: https://github.com/yangyangxu0/demtg
  • paper_authors: Yangyang Xu, Yibo Yang, Bernard Ghanemm, Lefei Zhang, Du Bo, Dacheng Tao
  • for: 这个研究旨在开发一个新的多任务学习(Multi-task learning,MTL)模型,可以结合具有弹性和对话的对称卷积(Deformable CNN)和问题基于对称(Query-based Transformer)的优点,以提高MTL的效能。
  • methods: 这个模型使用了一个简单且有效的encoder-decoder架构,组合了对称和注意力机制,具有弹性和全面的特征,可以对多个任务进行精确的预测。
  • results: 实验结果显示,提案的DeMTG模型比现有的Transformer-based和CNN-based竞争模型在多个度量上表现更好,并且需要更少的GFLOPs。 codes和模型可以在https://github.com/yangyangxu0/DeMTG上下载。
    Abstract CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL). Most of the current studies on MTL solely rely on CNN or Transformer. In this work, we present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction. This combination may offer a simple and efficient solution owing to its powerful and flexible task-specific learning and advantages of lower cost, less complexity and smaller parameters than the traditional MTL methods. We introduce deformable mixer Transformer with gating (DeMTG), a simple and effective encoder-decoder architecture up-to-date that incorporates the convolution and attention mechanism in a unified network for MTL. It is exquisitely designed to use advantages of each block, and provide deformable and comprehensive features for all tasks from local and global perspective. First, the deformable mixer encoder contains two types of operators: the channel-aware mixing operator leveraged to allow communication among different channels, and the spatial-aware deformable operator with deformable convolution applied to efficiently sample more informative spatial locations. Second, the task-aware gating transformer decoder is used to perform the task-specific predictions, in which task interaction block integrated with self-attention is applied to capture task interaction features, and the task query block integrated with gating attention is leveraged to select corresponding task-specific features. Further, the experiment results demonstrate that the proposed DeMTG uses fewer GFLOPs and significantly outperforms current Transformer-based and CNN-based competitive models on a variety of metrics on three dense prediction datasets. Our code and models are available at https://github.com/yangyangxu0/DeMTG.
    摘要 CNN 和 Transformer 都有自己的优点,两者都广泛用于密集预测多任务学习(MTL)。现有大多数MTL研究都仅仅靠坐标 CNN 或 Transformer。在这种工作中,我们提出了一种新的MTL模型,该模型将具有扭曲 CNN 和查询基于 Transformer 的共享锁定,以便实现多任务密集预测。这种结合可能会提供一种简单、高效的解决方案,因为它可以充分利用每个任务的特点,并且具有更低的成本、更低的复杂性和更小的参数。我们称之为“扭曲混合变换器”(DeMTG),它是一种简单而有效的编码器-解码器架构,它将 convolution 和注意力机制 integrate 在一个网络中,以便实现 MTL。它的设计非常灵活,可以根据每个任务的需求进行定制。首先,扭曲混合encoder包含两种操作符:通道意识混合操作符,可以允许不同通道之间的交流,以及空间意识扭曲操作符,通过高效地采样更多的空间位置来提高预测性能。其次,任务意识解码器使用任务交互块和自注意力来捕捉任务交互特征,并使用任务查询块和锁定注意力来选择相应的任务特定特征。此外,我们的实验结果表明,提案的 DeMTG 使用更少的 GFLOPs,并在多个维度上显著超越现有的 Transformer 基于和 CNN 基于竞争模型。我们的代码和模型可以在 GitHub 上获取。

Temporally-Adaptive Models for Efficient Video Understanding

  • paper_url: http://arxiv.org/abs/2308.05787
  • repo_url: https://github.com/alibaba-mmai-research/TAdaConv
  • paper_authors: Ziyuan Huang, Shiwei Zhang, Liang Pan, Zhiwu Qing, Yingya Zhang, Ziwei Liu, Marcelo H. Ang Jr
  • for: 这个研究旨在提高视频理解模型中的时间模型化能力,以提高视频理解的精度和效率。
  • methods: 该研究提出了Temporally-Adaptive Convolutions(TAdaConv),它在视频中的每帧中进行自适应权重调整,以便更好地模型视频中的时间动态。TAdaConv使得空间核函数获得了时间模型化能力,从而提高了模型的表现。
  • results: 根据实验结果,TAdaConvNeXtV2和TAdaFormer在不同的视频理解benchmark中与现有的卷积和Transformer-based模型竞争性地表现,并且在一些任务上具有更高的精度和效率。
    Abstract Spatial convolutions are extensively used in numerous deep video models. It fundamentally assumes spatio-temporal invariance, i.e., using shared weights for every location in different frames. This work presents Temporally-Adaptive Convolutions (TAdaConv) for video understanding, which shows that adaptive weight calibration along the temporal dimension is an efficient way to facilitate modeling complex temporal dynamics in videos. Specifically, TAdaConv empowers spatial convolutions with temporal modeling abilities by calibrating the convolution weights for each frame according to its local and global temporal context. Compared to existing operations for temporal modeling, TAdaConv is more efficient as it operates over the convolution kernels instead of the features, whose dimension is an order of magnitude smaller than the spatial resolutions. Further, kernel calibration brings an increased model capacity. Based on this readily plug-in operation TAdaConv as well as its extension, i.e., TAdaConvV2, we construct TAdaBlocks to empower ConvNeXt and Vision Transformer to have strong temporal modeling capabilities. Empirical results show TAdaConvNeXtV2 and TAdaFormer perform competitively against state-of-the-art convolutional and Transformer-based models in various video understanding benchmarks. Our codes and models are released at: https://github.com/alibaba-mmai-research/TAdaConv.
    摘要 “空间卷积广泛应用于深度视频模型中。它假设空间时间不变性,即使用共享权重 для每帧不同帧。本工作介绍了Temporally-Adaptive Convolutions(TAdaConv) для视频理解,它表明了在时间维度上进行权重调整是一种高效的方式来模型视频复杂的时间动态。具体来说,TAdaConv使得空间卷积具有时间模型能力,通过对每帧的卷积权重进行本地和全局时间上下文的调整。与现有的时间模型操作相比,TAdaConv更高效,因为它操作在卷积核心上而不是特征上,特征维度与空间分辨率相比只有一个数量级。此外,权重调整提高了模型的容量。基于这种可插入操作,我们构建了TAdaBlocks来激发ConvNeXt和Vision Transformer具有强大的时间模型能力。实验结果显示TAdaConvNeXtV2和TAdaFormer与状态态的 convolutional 和 Transformer-based 模型在不同的视频理解 benchmark 中竞争性。我们的代码和模型在:https://github.com/alibaba-mmai-research/TAdaConv。”

Spatial Pathomics Toolkit for Quantitative Analysis of Podocyte Nuclei with Histology and Spatial Transcriptomics Data in Renal Pathology

  • paper_url: http://arxiv.org/abs/2308.06288
  • repo_url: https://github.com/hrlblab/spatial_pathomics
  • paper_authors: Jiayuan Chen, Yu Wang, Ruining Deng, Quan Liu, Can Cui, Tianyuan Yao, Yilin Liu, Jianyong Zhong, Agnes B. Fogo, Haichun Yang, Shilin Zhao, Yuankai Huo
  • for: 这篇论文的目的是提出一种新的工具包,用于全面评估肾脏病变中的podocyte细胞特征。
  • methods: 这个工具包包括三个主要组成部分:1)实例对象分割,以准确地识别podocyte核lei;2)pathomics特征生成,从识别的核lei中提取了一系列量化特征;3)Robust统计分析,为探索质量特征之间的空间关系提供了一个全面的探索。
  • results: 该工具包成功提取和分析了podocyte核lei的形态和文化特征,并通过统计分析发现了一系列podocyte形omic特征。此外,工具还能够揭示肾脏病变中podocyte分布的空间信息,为肾脏病变的研究提供了新的视角。
    Abstract Podocytes, specialized epithelial cells that envelop the glomerular capillaries, play a pivotal role in maintaining renal health. The current description and quantification of features on pathology slides are limited, prompting the need for innovative solutions to comprehensively assess diverse phenotypic attributes within Whole Slide Images (WSIs). In particular, understanding the morphological characteristics of podocytes, terminally differentiated glomerular epithelial cells, is crucial for studying glomerular injury. This paper introduces the Spatial Pathomics Toolkit (SPT) and applies it to podocyte pathomics. The SPT consists of three main components: (1) instance object segmentation, enabling precise identification of podocyte nuclei; (2) pathomics feature generation, extracting a comprehensive array of quantitative features from the identified nuclei; and (3) robust statistical analyses, facilitating a comprehensive exploration of spatial relationships between morphological and spatial transcriptomics features.The SPT successfully extracted and analyzed morphological and textural features from podocyte nuclei, revealing a multitude of podocyte morphomic features through statistical analysis. Additionally, we demonstrated the SPT's ability to unravel spatial information inherent to podocyte distribution, shedding light on spatial patterns associated with glomerular injury. By disseminating the SPT, our goal is to provide the research community with a powerful and user-friendly resource that advances cellular spatial pathomics in renal pathology. The implementation and its complete source code of the toolkit are made openly accessible at https://github.com/hrlblab/spatial_pathomics.
    摘要 PODOSITE 是特殊的血浆上皮细胞,它们环绕血浆 капиllaries,对肾健康具有重要作用。现有的描述和量化特征是有限的,因此需要创新的解决方案来全面评估多样化的phenotypic特征在整个扫描图像(WSIs)中。特别是理解PODOSITE的形态特征非常重要,以study glomerular injury。本文介绍了Spatial Pathomics Toolkit(SPT)并应用它于PODOSITE pathomics。SPT包括三个主要组成部分:(1)实体对象分割,可以准确地识别PODOSITE核lei;(2)pathomics特征生成,从识别的核lei中提取了一系列数量特征;以及(3)Robust统计分析,使得可以全面探索扫描图像中PODOSITE的形态特征和空间特征之间的关系。SPT成功地提取和分析PODOSITE核lei的形态和文化特征,揭示了许多PODOSITE形态特征,并通过统计分析,探索了PODOSITE分布的空间特征,为glomerular injury提供了新的视角。我们的目标是通过普及SPT,为研究社区提供一个强大和易用的资源,以提高细胞空间Pathomics在肾病理学中的发展。SPT的实现和完整的源代码可以在https://github.com/hrlblab/spatial_pathomics上免费获取。

Shadow Datasets, New challenging datasets for Causal Representation Learning

  • paper_url: http://arxiv.org/abs/2308.05707
  • repo_url: https://github.com/Jiagengzhu/Shadow-dataset-for-crl
  • paper_authors: Jiageng Zhu, Hanchen Xie, Jianhua Wu, Jiazhi Li, Mahyar Khayatkhoei, Mohamed E. Hussein, Wael AbdAlmageed
  • For: 本研究旨在探讨语义因素之间的 causal 关系,并提出了一种基于弱监督学习的 causal representation learning(CRL)方法。* Methods: 该方法使用了一种基于 Generative Adversarial Networks(GANs)的弱监督学习方法,并在四个现有的数据集(Pendulum、Flow、CelebA(BEARD)和CelebA(SMILE))上进行了评估。* Results: 研究人员发现,使用该方法可以在具有更多多样化生成因素的更复杂的 causal 图上找到更好的 causal 关系。此外,他们还修改了现有的real数据集(CelebA(BEARD)和CelebA(SMILE))的原始 causal 图,以更好地适应数据集的分布。
    Abstract Discovering causal relations among semantic factors is an emergent topic in representation learning. Most causal representation learning (CRL) methods are fully supervised, which is impractical due to costly labeling. To resolve this restriction, weakly supervised CRL methods were introduced. To evaluate CRL performance, four existing datasets, Pendulum, Flow, CelebA(BEARD) and CelebA(SMILE), are utilized. However, existing CRL datasets are limited to simple graphs with few generative factors. Thus we propose two new datasets with a larger number of diverse generative factors and more sophisticated causal graphs. In addition, current real datasets, CelebA(BEARD) and CelebA(SMILE), the originally proposed causal graphs are not aligned with the dataset distributions. Thus, we propose modifications to them.
    摘要 发现 semantic factor 之间的 causal 关系是 representation learning 中一个出现的话题。大多数 causal representation learning(CRL)方法是强制执行的,这是因为标注成本高昂。为解决这些限制,弱类标注 CRL 方法被引入。为评估 CRL 性能,四个现有的数据集, Pendulum、Flow、CelebA(BEARD)和 CelebA(SMILE),被使用。然而,现有的 CRL 数据集受限于简单的图像和少量生成因素。因此,我们提议两个新的数据集,它们具有更多的多样化的生成因素和更复杂的 causal 图。此外,原始的 real datasets,CelebA(BEARD)和 CelebA(SMILE),其提posed的 causal 图与数据集分布不匹配。因此,我们提议修改它们。

Masked Diffusion as Self-supervised Representation Learner

  • paper_url: http://arxiv.org/abs/2308.05695
  • repo_url: None
  • paper_authors: Zixuan Pan, Jianxu Chen, Yiyu Shi
  • for: 这个论文是用于探讨Diffusion模型在生成和表示学习中的关系,以及如何使用Masking机制来提高Diffusion模型的表示学习能力。
  • methods: 该论文使用了Diffusion模型,并将传统的加法 Gaussian 噪声替换为Masking机制来进行自我超级vised学习。
  • results: 该论文在医疗和自然图像 semantic segmentation 任务中达到了优秀的表现,特别是在少shot场景下。
    Abstract Denoising diffusion probabilistic models have recently demonstrated state-of-the-art generative performance and been used as strong pixel-level representation learners. This paper decomposes the interrelation between the generative capability and representation learning ability inherent in diffusion models. We present masked diffusion model (MDM), a scalable self-supervised representation learner that substitutes the conventional additive Gaussian noise of traditional diffusion with a masking mechanism. Our proposed approach convincingly surpasses prior benchmarks, demonstrating remarkable advancements in both medical and natural image semantic segmentation tasks, particularly within the context of few-shot scenario.
    摘要 diffusion 模型在最近几年已经展示了状态算法的极优生成性能,并且用作强大的像素级表示学习器。这篇论文分析了diffusion模型中生成能力和表示学习能力之间的关系。我们提出了受掩码机制取代传统的加法 Gaussian 噪声的masked diffusion model(MDM)。我们的提议方法在医学和自然图像Semantic segmentation任务中表现出色,特别是在几个shot场景下。

Leverage Weakly Annotation to Pixel-wise Annotation via Zero-shot Segment Anything Model for Molecular-empowered Learning

  • paper_url: http://arxiv.org/abs/2308.05785
  • repo_url: None
  • paper_authors: Xueyuan Li, Ruining Deng, Yucheng Tang, Shunxing Bao, Haichun Yang, Yuankai Huo
  • for: 这个研究旨在发展一种可以实现批量标注的人工智能方法,以便将数位实验室数组据数据集(Giga-pixel Whole Slide Imaging,WSI)中的多个细胞型态识别为精确为可能。
  • methods: 这个研究使用了一种名为“对应强化学习”的方法,即使用强化学习来将类别为细胞型态的图像转换为对应的细胞类别。此外,还使用了一种名为“强制标注”的方法,即将图像中的细胞类别标注为精确的细胞类别。
  • results: 研究结果显示,使用“对应强化学习”和“强制标注”的方法可以实现对细胞型态的精确识别,并且可以降低非专家标注者的努力,只需要对图像进行简单的标注。此外,这种方法并不会对数据集的质量产生影响。
    Abstract Precise identification of multiple cell classes in high-resolution Giga-pixel whole slide imaging (WSI) is critical for various clinical scenarios. Building an AI model for this purpose typically requires pixel-level annotations, which are often unscalable and must be done by skilled domain experts (e.g., pathologists). However, these annotations can be prone to errors, especially when distinguishing between intricate cell types (e.g., podocytes and mesangial cells) using only visual inspection. Interestingly, a recent study showed that lay annotators, when using extra immunofluorescence (IF) images for reference (referred to as molecular-empowered learning), can sometimes outperform domain experts in labeling. Despite this, the resource-intensive task of manual delineation remains a necessity during the annotation process. In this paper, we explore the potential of bypassing pixel-level delineation by employing the recent segment anything model (SAM) on weak box annotation in a zero-shot learning approach. Specifically, we harness SAM's ability to produce pixel-level annotations from box annotations and utilize these SAM-generated labels to train a segmentation model. Our findings show that the proposed SAM-assisted molecular-empowered learning (SAM-L) can diminish the labeling efforts for lay annotators by only requiring weak box annotations. This is achieved without compromising annotation accuracy or the performance of the deep learning-based segmentation. This research represents a significant advancement in democratizing the annotation process for training pathological image segmentation, relying solely on non-expert annotators.
    摘要 高精度整个扫描图像(WSI)中多个细胞类型的精准识别是许多临床应用场景中的关键。建立AI模型用于这种目的通常需要像素级别的标注,但这些标注通常是不可扩展的并且需要具备专业知识(如病理学家)进行完成。然而,这些标注可能存在错误,特别是在辨别复杂细胞类型(如 podocytes 和 mesangial cells)时使用 только视觉检查。让人感兴奋的是,一项最近的研究表示,使用附加的免疫抗体(IF)图像作参考(称为分子驱动学习),非专业标注人员可以在标注时与专业人员相比表现更出色。尽管这样,手动分割任务仍然是标注过程中的必需任务。在这篇论文中,我们探讨使用最近的分类任何事物模型(SAM)在零容量学习方法中绕过像素级别分割的可能性。我们利用SAM的能力生成像素级别标注从箱子标注,并使用这些SAM生成的标注来训练分割模型。我们的发现表明,我们的SAM-助け学习(SAM-L)可以减少非专业标注人员的标注努力,只需要弱型箱子标注。这是在不妨害标注精度或深度学习基于图像分割的性能下进行的。这项研究表明了在训练病理图像分割模型时,不需要专业人员进行标注,可以仅仅通过非专业标注人员完成。

High-performance Data Management for Whole Slide Image Analysis in Digital Pathology

  • paper_url: http://arxiv.org/abs/2308.05784
  • repo_url: https://github.com/hrlblab/adios
  • paper_authors: Haoju Leng, Ruining Deng, Shunxing Bao, Dazheng Fang, Bryan A. Millis, Yucheng Tang, Haichun Yang, Xiao Wang, Yifan Peng, Lipeng Wan, Yuankai Huo
  • For: The paper is written to address the computational bottleneck in input-output (I/O) system when deploying image analysis algorithms on whole-slide images (WSIs).* Methods: The paper proposes the use of Adaptable IO System version 2 (ADIOS2) to streamline data management across WSIs and reduce data retrieval times.* Results: The paper shows that ADIOS2 achieves a two-fold speed-up compared to the brute-force approach in a CPU-based image analysis scenario, and its performance is on par with the cutting-edge GPU I/O acceleration framework, NVIDIA Magnum IO GPU Direct Storage (GDS), in a GPU-based deep learning framework scenario.Here are the three points in Simplified Chinese text:* For: 本文是为了解决扫描图像分析过程中的计算瓶颈,尤其是在扫描图像(WSIs)上部署图像分析算法时。* Methods: 本文提出使用Adaptable IO System version 2(ADIOS2)来协调数据管理过程,以便更好地处理扫描图像中的数据访问。* Results: 本文显示,使用ADIOS2可以将计算瓶颈降低到一半,并且在使用深度学习框架时与NVIDIA Magnum IO GPU Direct Storage(GDS)的性能相当。
    Abstract When dealing with giga-pixel digital pathology in whole-slide imaging, a notable proportion of data records holds relevance during each analysis operation. For instance, when deploying an image analysis algorithm on whole-slide images (WSI), the computational bottleneck often lies in the input-output (I/O) system. This is particularly notable as patch-level processing introduces a considerable I/O load onto the computer system. However, this data management process could be further paralleled, given the typical independence of patch-level image processes across different patches. This paper details our endeavors in tackling this data access challenge by implementing the Adaptable IO System version 2 (ADIOS2). Our focus has been constructing and releasing a digital pathology-centric pipeline using ADIOS2, which facilitates streamlined data management across WSIs. Additionally, we've developed strategies aimed at curtailing data retrieval times. The performance evaluation encompasses two key scenarios: (1) a pure CPU-based image analysis scenario ("CPU scenario"), and (2) a GPU-based deep learning framework scenario ("GPU scenario"). Our findings reveal noteworthy outcomes. Under the CPU scenario, ADIOS2 showcases an impressive two-fold speed-up compared to the brute-force approach. In the GPU scenario, its performance stands on par with the cutting-edge GPU I/O acceleration framework, NVIDIA Magnum IO GPU Direct Storage (GDS). From what we know, this appears to be among the initial instances, if any, of utilizing ADIOS2 within the field of digital pathology. The source code has been made publicly available at https://github.com/hrlblab/adios.
    摘要 当处理 gigapixel 数字 PATHOLOGY 整个扫描图像时,一定比例的数据记录在每次分析操作中保持有效。例如,在应用图像分析算法于整个扫描图像(WSI)时,计算机系统的瓶颈通常出现在输入输出(I/O)系统中。特别是在使用 patch-level 处理时, introduce 一个较大的 I/O 负担到计算机系统中。但是,这个数据管理过程可以进一步并行化,因为不同的 patch 之间的图像处理通常具有一定的独立性。这篇论文详细介绍了我们在解决这个数据访问挑战方面的努力,我们实现了基于 ADIOS2 的数字 PATHOLOGY 中心管道,以便在 WSI 之间进行流畅的数据管理。此外,我们还开发了降低数据检索时间的策略。我们的性能评估包括两个关键场景:(1) CPU 场景(CPU 场景),和(2) GPU 场景(GPU 场景)。我们的发现表明了一些有趣的结果。在 CPU 场景下,ADIOS2 与简洁方法相比,显示出了很好的两倍速度提升。在 GPU 场景下,ADIOS2 的性能与当前最先进的 GPU I/O 加速框架,NVIDIA Magnum IO GPU Direct Storage(GDS)相当。我们知道,这可能是数字 PATHOLOGY 领域中首次使用 ADIOS2 的情况之一,如果不是唯一的。我们在 上公开了源代码。

Multi-scale Multi-site Renal Microvascular Structures Segmentation for Whole Slide Imaging in Renal Pathology

  • paper_url: http://arxiv.org/abs/2308.05782
  • repo_url: None
  • paper_authors: Franklin Hu, Ruining Deng, Shunxing Bao, Haichun Yang, Yuankai Huo
  • for: This paper is written for renal pathologists who need a computational tool for the quantitative analysis of renal microvascular structures.
  • methods: The paper uses a novel single dynamic network method called Omni-Seg, which capitalizes on multi-site, multi-scale training data and utilizes partially labeled images to segment microvascular structures.
  • results: The experimental results indicate that Omni-Seg outperforms other methods in terms of both the Dice Similarity Coefficient (DSC) and Intersection over Union (IoU).
    Abstract Segmentation of microvascular structures, such as arterioles, venules, and capillaries, from human kidney whole slide images (WSI) has become a focal point in renal pathology. Current manual segmentation techniques are time-consuming and not feasible for large-scale digital pathology images. While deep learning-based methods offer a solution for automatic segmentation, most suffer from a limitation: they are designed for and restricted to training on single-site, single-scale data. In this paper, we present Omni-Seg, a novel single dynamic network method that capitalizes on multi-site, multi-scale training data. Unique to our approach, we utilize partially labeled images, where only one tissue type is labeled per training image, to segment microvascular structures. We train a singular deep network using images from two datasets, HuBMAP and NEPTUNE, across different magnifications (40x, 20x, 10x, and 5x). Experimental results indicate that Omni-Seg outperforms in terms of both the Dice Similarity Coefficient (DSC) and Intersection over Union (IoU). Our proposed method provides renal pathologists with a powerful computational tool for the quantitative analysis of renal microvascular structures.
    摘要 Segmentation of microvascular structures, such as arterioles, venules, and capillaries, from human kidney whole slide images (WSI) has become a focal point in renal pathology. Current manual segmentation techniques are time-consuming and not feasible for large-scale digital pathology images. While deep learning-based methods offer a solution for automatic segmentation, most suffer from a limitation: they are designed for and restricted to training on single-site, single-scale data. In this paper, we present Omni-Seg, a novel single dynamic network method that capitalizes on multi-site, multi-scale training data. Unique to our approach, we utilize partially labeled images, where only one tissue type is labeled per training image, to segment microvascular structures. We train a singular deep network using images from two datasets, HuBMAP and NEPTUNE, across different magnifications (40x, 20x, 10x, and 5x). Experimental results indicate that Omni-Seg outperforms in terms of both the Dice Similarity Coefficient (DSC) and Intersection over Union (IoU). Our proposed method provides renal pathologists with a powerful computational tool for the quantitative analysis of renal microvascular structures.Here's the translation in Traditional Chinese: segmentation of microvascular structures, such as arterioles, venules, and capillaries, from human kidney whole slide images (WSI) has become a focal point in renal pathology. Current manual segmentation techniques are time-consuming and not feasible for large-scale digital pathology images. While deep learning-based methods offer a solution for automatic segmentation, most suffer from a limitation: they are designed for and restricted to training on single-site, single-scale data. In this paper, we present Omni-Seg, a novel single dynamic network method that capitalizes on multi-site, multi-scale training data. Unique to our approach, we utilize partially labeled images, where only one tissue type is labeled per training image, to segment microvascular structures. We train a singular deep network using images from two datasets, HuBMAP and NEPTUNE, across different magnifications (40x, 20x, 10x, and 5x). Experimental results indicate that Omni-Seg outperforms in terms of both the Dice Similarity Coefficient (DSC) and Intersection over Union (IoU). Our proposed method provides renal pathologists with a powerful computational tool for the quantitative analysis of renal microvascular structures.

2D3D-MATR: 2D-3D Matching Transformer for Detection-free Registration between Images and Point Clouds

  • paper_url: http://arxiv.org/abs/2308.05667
  • repo_url: None
  • paper_authors: Minhao Li, Zheng Qin, Zhirui Gao, Renjiao Yi, Chenyang Zhu, Yulan Guo, Kai Xu
  • for: 准确和稳定的图像和点云对接问题(cross-modality registration)
  • methods: 提出了一种检测器-自由的方法(detection-free method),即2D3D-MATR,它首先在下采样后的图像和点云之间计算粗略匹配,然后将其扩展到形成密集匹配。该方法采用了一个巨量感知机制,将图像和点云之间的全局上下文约束和跨Modalities的相关性约束都带入学习。
  • results: 在两个公共测试集上进行了广泛的实验,证明了2D3D-MATR比前一个状态的P2-Net提高了约20个百分点的匹配率和10个百分点的对接回归率。
    Abstract The commonly adopted detect-then-match approach to registration finds difficulties in the cross-modality cases due to the incompatible keypoint detection and inconsistent feature description. We propose, 2D3D-MATR, a detection-free method for accurate and robust registration between images and point clouds. Our method adopts a coarse-to-fine pipeline where it first computes coarse correspondences between downsampled patches of the input image and the point cloud and then extends them to form dense correspondences between pixels and points within the patch region. The coarse-level patch matching is based on transformer which jointly learns global contextual constraints with self-attention and cross-modality correlations with cross-attention. To resolve the scale ambiguity in patch matching, we construct a multi-scale pyramid for each image patch and learn to find for each point patch the best matching image patch at a proper resolution level. Extensive experiments on two public benchmarks demonstrate that 2D3D-MATR outperforms the previous state-of-the-art P2-Net by around $20$ percentage points on inlier ratio and over $10$ points on registration recall. Our code and models are available at https://github.com/minhaolee/2D3DMATR.
    摘要 通常采用检测然后匹配的方法在跨Modalities的情况下遇到困难,这是因为针对不同模式的关键点检测和特征描述不兼容。我们提议了一种没有检测的方法,即2D3D-MATR,用于准确和可靠地将图像和点云进行registrations。我们的方法采用一个分支扩充的管道,首先计算下采样的图像和点云之间的粗略匹配,然后将其扩展到形成点云和图像之间的密集匹配。粗略水平的patch匹配基于转换器,它同时学习全局上下文约束和自我注意力以及跨Modalities的相关性。为解决粗略水平的匹配抖动问题,我们构建了每个图像区域的多尺度 pyramid,并学习找到每个点云区域的最佳匹配图像区域的合适的分辨率水平。我们的实验表明,2D3D-MATR比前一个状态的P2-Net提高了约20个百分点的匹配率和10个百分点的注册回归率。我们的代码和模型可以在https://github.com/minhaolee/2D3DMATR中获取。