results: 实验结果显示,这个网络在多个医疗影像数据集上实现了高精度的分类性能,并且比 existed 网络更快速、轻量级、并且具有更少的计算成本。Abstract
The U-shaped architecture has emerged as a crucial paradigm in the design of medical image segmentation networks. However, due to the inherent local limitations of convolution, a fully convolutional segmentation network with U-shaped architecture struggles to effectively extract global context information, which is vital for the precise localization of lesions. While hybrid architectures combining CNNs and Transformers can address these issues, their application in real medical scenarios is limited due to the computational resource constraints imposed by the environment and edge devices. In addition, the convolutional inductive bias in lightweight networks adeptly fits the scarce medical data, which is lacking in the Transformer based network. In order to extract global context information while taking advantage of the inductive bias, we propose CMUNeXt, an efficient fully convolutional lightweight medical image segmentation network, which enables fast and accurate auxiliary diagnosis in real scene scenarios. CMUNeXt leverages large kernel and inverted bottleneck design to thoroughly mix distant spatial and location information, efficiently extracting global context information. We also introduce the Skip-Fusion block, designed to enable smooth skip-connections and ensure ample feature fusion. Experimental results on multiple medical image datasets demonstrate that CMUNeXt outperforms existing heavyweight and lightweight medical image segmentation networks in terms of segmentation performance, while offering a faster inference speed, lighter weights, and a reduced computational cost. The code is available at https://github.com/FengheTan9/CMUNeXt.
摘要
医学图像分割网络的U型体系已成为现代医学图像分割领域的关键思想。然而,由于卷积操作的本质性限制,一个完全卷积分割网络具有U型体系很难准确地提取全局上下文信息,这是诊断病理所必需的精确位置信息。尽管混合 arquitectures combining CNNs和Transformers可以解决这些问题,但它们在实际医疗场景中的应用受限于环境和边缘设备的计算资源限制。此外,卷积操作的权重偏好适应于稀缺的医学数据,这是Transformer基于网络中缺失的。为了提取全局上下文信息并利用卷积操作的权重偏好,我们提出CMUNeXt,一种高效的卷积分割医学图像分割网络。CMUNeXt通过大kernel和缩回瓦片设计,fficiently混合远程空间和位置信息,以提取全局上下文信息。我们还提出Skip-Fusion块,用于实现平滑的跳转连接和充分的特征融合。实验结果表明,CMUNeXt在多个医学图像Dataset上的分割性能超过现有的重量级和轻量级医学图像分割网络,同时提供更快的推理速度、轻量级的权重和降低的计算成本。代码可以在https://github.com/FengheTan9/CMUNeXt中下载。
TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval
results: EXTENSIVE experiments 表明,提出的方法可以在多个公共数据集上达到高效的文本-视频检索效果。Abstract
For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods are dominating. Compared to CLIP4Clip which is efficient and compact, the state-of-the-art models tend to compute video-text similarity by fine-grained cross-modal feature interaction and matching, putting their scalability for large-scale T2VR into doubt. For efficient T2VR, we propose TeachCLIP with multi-grained teaching to let a CLIP4Clip based student network learn from more advanced yet computationally heavy models such as X-CLIP, TS2-Net and X-Pool . To improve the student's learning capability, we add an Attentional frame-Feature Aggregation (AFA) block, which by design adds no extra storage/computation overhead at the retrieval stage. While attentive weights produced by AFA are commonly used for combining frame-level features, we propose a novel use of the weights to let them imitate frame-text relevance estimated by the teacher network. As such, AFA provides a fine-grained learning (teaching) channel for the student (teacher). Extensive experiments on multiple public datasets justify the viability of the proposed method.
摘要
For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods are dominant. Compared to CLIP4Clip which is efficient and compact, the state-of-the-art models tend to compute video-text similarity by fine-grained cross-modal feature interaction and matching, putting their scalability for large-scale T2VR into doubt. For efficient T2VR, we propose TeachCLIP with multi-grained teaching to let a CLIP4Clip based student network learn from more advanced yet computationally heavy models such as X-CLIP, TS2-Net and X-Pool . To improve the student's learning capability, we add an Attentional frame-Feature Aggregation (AFA) block, which by design adds no extra storage/computation overhead at the retrieval stage. While attentive weights produced by AFA are commonly used for combining frame-level features, we propose a novel use of the weights to let them imitate frame-text relevance estimated by the teacher network. As such, AFA provides a fine-grained learning (teaching) channel for the student (teacher). Extensive experiments on multiple public datasets justify the viability of the proposed method.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard forms of Chinese writing. The other form is Traditional Chinese.
Improving Generalization in Visual Reinforcement Learning via Conflict-aware Gradient Agreement Augmentation
results: 经验表明,CG2A可以显著提高视觉奖励学习算法的泛化性和样本效率。Abstract
Learning a policy with great generalization to unseen environments remains challenging but critical in visual reinforcement learning. Despite the success of augmentation combination in the supervised learning generalization, naively applying it to visual RL algorithms may damage the training efficiency, suffering from serve performance degradation. In this paper, we first conduct qualitative analysis and illuminate the main causes: (i) high-variance gradient magnitudes and (ii) gradient conflicts existed in various augmentation methods. To alleviate these issues, we propose a general policy gradient optimization framework, named Conflict-aware Gradient Agreement Augmentation (CG2A), and better integrate augmentation combination into visual RL algorithms to address the generalization bias. In particular, CG2A develops a Gradient Agreement Solver to adaptively balance the varying gradient magnitudes, and introduces a Soft Gradient Surgery strategy to alleviate the gradient conflicts. Extensive experiments demonstrate that CG2A significantly improves the generalization performance and sample efficiency of visual RL algorithms.
摘要
Generative Noisy-Label Learning by Implicit Dicriminative Approximation with Partial Label Prior
results: 经过广泛的实验表明,本文的生成模型在多个噪声标签 benchmark 上达到了领先的Result,同时保持了与权衡分布模型相同的计算复杂性。Abstract
The learning with noisy labels has been addressed with both discriminative and generative models. Although discriminative models have dominated the field due to their simpler modeling and more efficient computational training processes, generative models offer a more effective means of disentangling clean and noisy labels and improving the estimation of the label transition matrix. However, generative approaches maximize the joint likelihood of noisy labels and data using a complex formulation that only indirectly optimizes the model of interest associating data and clean labels. Additionally, these approaches rely on generative models that are challenging to train and tend to use uninformative clean label priors. In this paper, we propose a new generative noisy-label learning approach that addresses these three issues. First, we propose a new model optimisation that directly associates data and clean labels. Second, the generative model is implicitly estimated using a discriminative model, eliminating the inefficient training of a generative model. Third, we propose a new informative label prior inspired by partial label learning as supervision signal for noisy label learning. Extensive experiments on several noisy-label benchmarks demonstrate that our generative model provides state-of-the-art results while maintaining a similar computational complexity as discriminative models.
摘要
学习噪声标签的问题已经由权重分配模型和生成模型来解决。虽然权重分配模型因其简单的模型化和更高效的计算训练过程而在领域中占据主导地位,但生成模型可以更好地分离噪声标签和数据,并提高标签过渡矩阵的估计。然而,生成方法通过复杂的形式ulation maximizes the joint likelihood of noisy labels and data, which only indirectly optimizes the model of interest associating data and clean labels。此外,这些方法通常需要困难的训练和使用无用的干净标签辅助signal。在本文中,我们提出了一种新的生成噪声标签学习方法,解决了以下三个问题。首先,我们提出了一种新的模型优化方法,直接将数据和干净标签相关联。其次,我们使用权重分配模型来隐式地估计生成模型,从而消除了生成模型的不效环境训练。最后,我们提出了一种新的有用的标签前导信号, inspirited by partial label learning as supervision signal for noisy label learning。我们在多个噪声标签benchmark experiment中进行了广泛的实验,结果表明我们的生成模型可以在计算复杂度相同的情况下提供国际级的result。
Interpretable End-to-End Driving Model for Implicit Scene Understanding
methods: 本研究使用特定的感知任务,如物体检测和景图生成,但这些任务只能EQivalent to sampling from高维度的景象特征,而不够表示场景。此外,感知任务的目标与人类驾驶不符,人类只关注可能影响自己的车辆轨迹。因此,我们提出了一个综合 interpretable 的隐藏驾驶景象理解(II-DSU)模型,通过规划模块引导,提取隐藏的高维度景象特征作为场景理解结果,并使用 auxillary 感知任务进行可视化验证。
results: 实验结果表明,我们的方法在 CARLA 测试benchmark 上达到了新的状态OF-THE-ART,并能够获取包含更多的驾驶场景信息的景象特征,从而提高下游规划的性能。Abstract
Driving scene understanding is to obtain comprehensive scene information through the sensor data and provide a basis for downstream tasks, which is indispensable for the safety of self-driving vehicles. Specific perception tasks, such as object detection and scene graph generation, are commonly used. However, the results of these tasks are only equivalent to the characterization of sampling from high-dimensional scene features, which are not sufficient to represent the scenario. In addition, the goal of perception tasks is inconsistent with human driving that just focuses on what may affect the ego-trajectory. Therefore, we propose an end-to-end Interpretable Implicit Driving Scene Understanding (II-DSU) model to extract implicit high-dimensional scene features as scene understanding results guided by a planning module and to validate the plausibility of scene understanding using auxiliary perception tasks for visualization. Experimental results on CARLA benchmarks show that our approach achieves the new state-of-the-art and is able to obtain scene features that embody richer scene information relevant to driving, enabling superior performance of the downstream planning.
摘要
自动驾驶车辆Scene理解是通过感知数据获取全面的Scene信息,并提供下游任务的基础,这是自动驾驶车辆的安全所不可或缺。特定的感知任务,如物体检测和Scene图生成,通常被使用。然而,这些任务的结果只是样本高维Scene特征的 caracterization,这并不够 representationScene。此外,感知任务的目标与人类驾驶不符,人类驾驶只关注可能affect eg trajectory的 factor。因此,我们提出了End-to-end可解释的含义Scene Understanding(II-DSU)模型,通过规划模块引导,从感知数据中提取高维Scene特征,并使用辅助感知任务进行视觉化。实验结果表明,我们的方法在CARLAbenchmark上达到了新的州OF-the-art,并能够获取包含更加丰富的Scene信息,有利于下游规划的性能。
paper_authors: Huzheng Yang, James Gee, Jianbo Shi
for: 这个研究探究了一种新的大脑编码模型,通过添加记忆相关信息作为输入来实现。
methods: 这个研究使用了视觉认知任务,并使用了以前看过的图像来预测非视觉大脑的活动。
results: 研究发现,在添加记忆信息作为输入时,模型的性能得到了显著提高(单个模型分数为66.8,集成模型分数为70.8)。此外,研究还发现了Periodic delayed brain response和hippocampus的相关活动,这些活动与6-7帧之前的图像相关。Abstract
We explore a new class of brain encoding model by adding memory-related information as input. Memory is an essential brain mechanism that works alongside visual stimuli. During a vision-memory cognitive task, we found the non-visual brain is largely predictable using previously seen images. Our Memory Encoding Model (Mem) won the Algonauts 2023 visual brain competition even without model ensemble (single model score 66.8, ensemble score 70.8). Our ensemble model without memory input (61.4) can also stand a 3rd place. Furthermore, we observe periodic delayed brain response correlated to 6th-7th prior image, and hippocampus also showed correlated activity timed with this periodicity. We conjuncture that the periodic replay could be related to memory mechanism to enhance the working memory.
摘要
我们探索了一新类的脑编码模型,将记忆相关信息作为输入。记忆是脑中重要的机制,与视觉刺激同时工作。在视觉认知任务中,我们发现了非视觉脑的大部分可预测,使用先前看到的图像。我们的记忆编码模型(Mem)在2023年Algonauts视觉脑竞赛中获胜,即使没有模型集成(单个模型分数为66.8,集成分数为70.8)。我们的 ensemble模型 без记忆输入(61.4)也能够获得第三名。此外,我们发现了 periodic delayed 脑响应与第6-7个先前图像相对应,以及 hippocampus 也显示了与这种周期性相关的活动。我们推测 periodic replay 可能与记忆机制相关,以增强工作记忆。
Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment for Markup-to-Image Generation
results: 在四个不同领域的四个标准 benchmark 数据集上进行了广泛的实验,并证明了 FSA-CDM 模型中提出的组件的效果,与当前状态的表达能力相比,提高了约2%-12% DTW 的改进。Abstract
The recently rising markup-to-image generation poses greater challenges as compared to natural image generation, due to its low tolerance for errors as well as the complex sequence and context correlations between markup and rendered image. This paper proposes a novel model named "Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment" (FSA-CDM), which introduces contrastive positive/negative samples into the diffusion model to boost performance for markup-to-image generation. Technically, we design a fine-grained cross-modal alignment module to well explore the sequence similarity between the two modalities for learning robust feature representations. To improve the generalization ability, we propose a contrast-augmented diffusion model to explicitly explore positive and negative samples by maximizing a novel contrastive variational objective, which is mathematically inferred to provide a tighter bound for the model's optimization. Moreover, the context-aware cross attention module is developed to capture the contextual information within markup language during the denoising process, yielding better noise prediction results. Extensive experiments are conducted on four benchmark datasets from different domains, and the experimental results demonstrate the effectiveness of the proposed components in FSA-CDM, significantly exceeding state-of-the-art performance by about 2%-12% DTW improvements. The code will be released at https://github.com/zgj77/FSACDM.
摘要
“Recently rising markup-to-image generation poses greater challenges compared to natural image generation, due to its low tolerance for errors and complex sequence and context correlations between markup and rendered image. This paper proposes a novel model named 'Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment' (FSA-CDM), which introduces contrastive positive/negative samples into the diffusion model to boost performance for markup-to-image generation. Technically, we design a fine-grained cross-modal alignment module to well explore the sequence similarity between the two modalities for learning robust feature representations. To improve generalization ability, we propose a contrast-augmented diffusion model to explicitly explore positive and negative samples by maximizing a novel contrastive variational objective, which is mathematically inferred to provide a tighter bound for the model's optimization. Moreover, the context-aware cross attention module is developed to capture the contextual information within markup language during the denoising process, yielding better noise prediction results. Extensive experiments are conducted on four benchmark datasets from different domains, and the experimental results demonstrate the effectiveness of the proposed components in FSA-CDM, significantly exceeding state-of-the-art performance by about 2%-12% DTW improvements. The code will be released at https://github.com/zgj77/FSACDM.”Note: DTW (Dynamic Time Warping) is a measure of similarity between two time series signals. A higher DTW improvement indicates better performance.
UCDFormer: Unsupervised Change Detection Using a Transformer-driven Image Translation
For: 本研究旨在提出一种基于域名适应的无监督变化检测方法,以满足 Remote Sensing 图像的变化检测任务。* Methods: 本方法使用了一种基于 transformer 的图像翻译模型,称为 UCDFormer,以解决域名适应问题。图像翻译模型包括一个轻量级 transformer 和域名特定的相互作用权重。然后,通过图像翻译后的差分图计算得到变化映射。最后,提出了一种可靠的像素选择模块,使用杂化 c-means clustering 和自适应阈值来选择变化/不变化像素。* Results: 对不同的无监督变化检测任务进行实验,结果显示 UCDFormer 的性能高于其他相关方法,卷积比例更高达 12.3%。此外,UCDFormer 在考虑大规模应用时对地球quake-induced landslide 检测也表现出色。代码可以在 \url{https://github.com/zhu-xlab/UCDFormer} 上下载。Abstract
Change detection (CD) by comparing two bi-temporal images is a crucial task in remote sensing. With the advantages of requiring no cumbersome labeled change information, unsupervised CD has attracted extensive attention in the community. However, existing unsupervised CD approaches rarely consider the seasonal and style differences incurred by the illumination and atmospheric conditions in multi-temporal images. To this end, we propose a change detection with domain shift setting for remote sensing images. Furthermore, we present a novel unsupervised CD method using a light-weight transformer, called UCDFormer. Specifically, a transformer-driven image translation composed of a light-weight transformer and a domain-specific affinity weight is first proposed to mitigate domain shift between two images with real-time efficiency. After image translation, we can generate the difference map between the translated before-event image and the original after-event image. Then, a novel reliable pixel extraction module is proposed to select significantly changed/unchanged pixel positions by fusing the pseudo change maps of fuzzy c-means clustering and adaptive threshold. Finally, a binary change map is obtained based on these selected pixel pairs and a binary classifier. Experimental results on different unsupervised CD tasks with seasonal and style changes demonstrate the effectiveness of the proposed UCDFormer. For example, compared with several other related methods, UCDFormer improves performance on the Kappa coefficient by more than 12\%. In addition, UCDFormer achieves excellent performance for earthquake-induced landslide detection when considering large-scale applications. The code is available at \url{https://github.com/zhu-xlab/UCDFormer}
摘要
Change detection (CD) by comparing two bi-temporal images is a crucial task in remote sensing. With the advantages of not requiring cumbersome labeled change information, unsupervised CD has attracted extensive attention in the community. However, existing unsupervised CD approaches rarely consider the seasonal and style differences incurred by the illumination and atmospheric conditions in multi-temporal images. To this end, we propose a change detection with domain shift setting for remote sensing images. Furthermore, we present a novel unsupervised CD method using a light-weight transformer, called UCDFormer. Specifically, a transformer-driven image translation composed of a light-weight transformer and a domain-specific affinity weight is first proposed to mitigate domain shift between two images with real-time efficiency. After image translation, we can generate the difference map between the translated before-event image and the original after-event image. Then, a novel reliable pixel extraction module is proposed to select significantly changed/unchanged pixel positions by fusing the pseudo change maps of fuzzy c-means clustering and adaptive threshold. Finally, a binary change map is obtained based on these selected pixel pairs and a binary classifier. Experimental results on different unsupervised CD tasks with seasonal and style changes demonstrate the effectiveness of the proposed UCDFormer. For example, compared with several other related methods, UCDFormer improves performance on the Kappa coefficient by more than 12\%. In addition, UCDFormer achieves excellent performance for earthquake-induced landslide detection when considering large-scale applications. The code is available at \url{https://github.com/zhu-xlab/UCDFormer}.Here's the translation in Traditional Chinese:同比较两个时间影像的变化检测(CD)是远程感知中的一个重要任务。由于不需要耗费大量的标签变化信息,不监督CD吸引了社区的广泛关注。然而,现有的不监督CD方法很少考虑多个时间影像中的季节和风格变化,这些变化是由照明和大气conditions导致的。为了解决这个问题,我们提出了适用于远程感知影像的CD设定。此外,我们还提出了一个名为UCDFormer的新型不监督CD方法,这是一个轻量级的transformer驱动的图像转换。 Specifically, 这个图像转换由一个轻量级的transformer和对应域专的相互关联项目所构成。在实现图像转换后,我们可以生成原始后事件影像与转换前事件影像之间的差分图。然后,我们提出了一个新的可靠像素提取模组,这个模组通过融合多个pseudo change map的统计学 clustering和自适应阈值来选择具有重要变化的像素位置。最后,我们根据这些选择的像素对生成一个二进制变化图。实验结果显示,UCDFormer在不同的不监督CD任务中具有优秀的表现,比如在凡托伦数据上的Kappa系数提高了更多于12%。此外,UCDFormer在考虑大规模应用时也具有优秀的震灾引起的塌陷检测表现。相关代码可以在 \url{https://github.com/zhu-xlab/UCDFormer} 中找到。
DySTreSS: Dynamically Scaled Temperature in Self-Supervised Contrastive Learning
results: 实验证明,提案的框架(DySTreSS)在SSL中的性能比或与对偶损失函数基于的框架相当。它还进一步分析了温度的选择,以及特征空间中的本地和全球结构如何影响性能。Abstract
In contemporary self-supervised contrastive algorithms like SimCLR, MoCo, etc., the task of balancing attraction between two semantically similar samples and repulsion between two samples from different classes is primarily affected by the presence of hard negative samples. While the InfoNCE loss has been shown to impose penalties based on hardness, the temperature hyper-parameter is the key to regulating the penalties and the trade-off between uniformity and tolerance. In this work, we focus our attention to improve the performance of InfoNCE loss in SSL by studying the effect of temperature hyper-parameter values. We propose a cosine similarity-dependent temperature scaling function to effectively optimize the distribution of the samples in the feature space. We further analyze the uniformity and tolerance metrics to investigate the optimal regions in the cosine similarity space for better optimization. Additionally, we offer a comprehensive examination of the behavior of local and global structures in the feature space throughout the pre-training phase, as the temperature varies. Experimental evidence shows that the proposed framework outperforms or is at par with the contrastive loss-based SSL algorithms. We believe our work (DySTreSS) on temperature scaling in SSL provides a foundation for future research in contrastive learning.
摘要
现代自我监督对比算法如SimCLR、MoCo等,任务是让两个semantically相似的样本之间吸引,而两个来自不同类型的样本之间冲突。而这种吸引和冲突的 equilibrio 主要受到强度负样本的影响。在这种情况下,InfoNCE损失已经显示出对强度的penalty,但温度超参数是控制这些penalty的关键。在这项工作中,我们关注InfoNCE损失在SSL中的性能优化,研究温度超参数的影响。我们提议一种cosine相似性dependent的温度扩展函数,以便有效地优化样本在特征空间的分布。我们进一步分析了cosine相似性空间中的各种参数区域,以便更好地优化。此外,我们还对批量和全局结构在特征空间中的变化进行了详细的分析,从温度变化的角度出发。实验证明,我们提出的框架可以超越或与基于对比损失的SSL算法相当。我们认为我们的工作(DySTreSS)在SSL中的温度扩展提供了对contrastive学习的未来研究的基础。
Multi-task learning for classification, segmentation, reconstruction, and detection on chest CT scans
results: 提出了一种新的多任务学习框架,可以帮助医生更快速、更准确地诊断肺癌和covid-19患者的疾病。此外,通过对不同背部和损失函数的使用,可以提高 segmentation 任务的性能。Abstract
Lung cancer and covid-19 have one of the highest morbidity and mortality rates in the world. For physicians, the identification of lesions is difficult in the early stages of the disease and time-consuming. Therefore, multi-task learning is an approach to extracting important features, such as lesions, from small amounts of medical data because it learns to generalize better. We propose a novel multi-task framework for classification, segmentation, reconstruction, and detection. To the best of our knowledge, we are the first ones who added detection to the multi-task solution. Additionally, we checked the possibility of using two different backbones and different loss functions in the segmentation task.
摘要
肺癌和 COVID-19 是全球 morbidity 和 mortality 率最高的疾病之一。为医生来说,在早期疾病阶段寻找 lesions 是困难的和时间消耗的。因此,多任务学习是一种EXTRACTING重要特征的方法,如疾病 lesions,从小量医疗数据中提取。我们提出了一种新的多任务框架,用于分类、分割、重建和检测。根据我们所知,我们是第一个将检测添加到多任务解决方案中的人。此外,我们还检查了使用两种不同的背景和不同的损失函数在分割任务中的可能性。
Leveraging Expert Models for Training Deep Neural Networks in Scarce Data Domains: Application to Offline Handwritten Signature Verification
results: 研究表明,使用本研究的方法可以在三个流行的签名数据集上达到相当或更高的性能,而不需要在特征提取过程中使用任何签名数据。这种方法可能可以在其他有限数据的领域中应用。Abstract
This paper introduces a novel approach to leverage the knowledge of existing expert models for training new Convolutional Neural Networks, on domains where task-specific data are limited or unavailable. The presented scheme is applied in offline handwritten signature verification (OffSV) which, akin to other biometric applications, suffers from inherent data limitations due to regulatory restrictions. The proposed Student-Teacher (S-T) configuration utilizes feature-based knowledge distillation (FKD), combining graph-based similarity for local activations with global similarity measures to supervise student's training, using only handwritten text data. Remarkably, the models trained using this technique exhibit comparable, if not superior, performance to the teacher model across three popular signature datasets. More importantly, these results are attained without employing any signatures during the feature extraction training process. This study demonstrates the efficacy of leveraging existing expert models to overcome data scarcity challenges in OffSV and potentially other related domains.
摘要
DiffusePast: Diffusion-based Generative Replay for Class Incremental Semantic Segmentation
results: 通过实验表明,ours方法可以在主流的benchmark上实现竞争性的性能,更好地平衡老类和新类的性能Abstract
The Class Incremental Semantic Segmentation (CISS) extends the traditional segmentation task by incrementally learning newly added classes. Previous work has introduced generative replay, which involves replaying old class samples generated from a pre-trained GAN, to address the issues of catastrophic forgetting and privacy concerns. However, the generated images lack semantic precision and exhibit out-of-distribution characteristics, resulting in inaccurate masks that further degrade the segmentation performance. To tackle these challenges, we propose DiffusePast, a novel framework featuring a diffusion-based generative replay module that generates semantically accurate images with more reliable masks guided by different instructions (e.g., text prompts or edge maps). Specifically, DiffusePast introduces a dual-generator paradigm, which focuses on generating old class images that align with the distribution of downstream datasets while preserving the structure and layout of the original images, enabling more precise masks. To adapt to the novel visual concepts of newly added classes continuously, we incorporate class-wise token embedding when updating the dual-generator. Moreover, we assign adequate pseudo-labels of old classes to the background pixels in the new step images, further mitigating the forgetting of previously learned knowledge. Through comprehensive experiments, our method demonstrates competitive performance across mainstream benchmarks, striking a better balance between the performance of old and novel classes.
摘要
“ classe Incremental Semantic Segmentation(CISS)延伸了传统的分类任务,通过不断学习新增的类别。先前的工作已经引入生成重播,即从预训的GAN中生成的旧类样本,以解决严重遗忘和隐私问题。但是,生成的图像没有准确的Semantic和外部分布特征,导致不准确的mask,进一步损害分类性能。为了解决这些挑战,我们提出了DiffusePast,一个新的框架,其中包括一个散射型生成重播模组,可以产生具有准确Semantic和可靠mask的图像。 Specifically, DiffusePast采用了双生成者 парадиг,专注于生成旧类样本,使其与下游资料集的分布相似,并保持原始图像的结构和布局,实现更 precisemask。为了适应新增的视觉概念,我们在更新双生成者时 incorporate class-wise token embedding。此外,我们将旧类别的 pseudo-labels assign 到新步骤图像的背景像素上,进一步减少遗忘先前学习的知识。通过实际的实验,我们的方法在主流的benchmark上展示了竞争性表现,寻求更好地寻求旧和新类别的 equilibrio。”
Stereo Visual Odometry with Deep Learning-Based Point and Line Feature Matching using an Attention Graph Neural Network
results: 实验结果显示,我们的方法在多个实际和synthetic dataset上能够在低可见度天气和照明条件下实现更多的线特征匹配,并且当补充了点特征匹配时,能够在恶劣天气和照明条件下保持一致的性能。Abstract
Robust feature matching forms the backbone for most Visual Simultaneous Localization and Mapping (vSLAM), visual odometry, 3D reconstruction, and Structure from Motion (SfM) algorithms. However, recovering feature matches from texture-poor scenes is a major challenge and still remains an open area of research. In this paper, we present a Stereo Visual Odometry (StereoVO) technique based on point and line features which uses a novel feature-matching mechanism based on an Attention Graph Neural Network that is designed to perform well even under adverse weather conditions such as fog, haze, rain, and snow, and dynamic lighting conditions such as nighttime illumination and glare scenarios. We perform experiments on multiple real and synthetic datasets to validate the ability of our method to perform StereoVO under low visibility weather and lighting conditions through robust point and line matches. The results demonstrate that our method achieves more line feature matches than state-of-the-art line matching algorithms, which when complemented with point feature matches perform consistently well in adverse weather and dynamic lighting conditions.
摘要
Robust 特征匹配构成了大多数视觉同时定位和地图(vSLAM)、视觉速度、3D重建和结构从运动(SfM)算法的核心。然而,从Texture贫瘠场景中恢复特征匹配是一个主要挑战,并且仍然是一个开放的研究领域。在这篇论文中,我们提出了一种基于点和线特征的斯tereo视巡(StereoVO)技术,使用一种基于注意力图 neural network的新特征匹配机制,可以在恶劣天气和照明条件下表现良好,如雾、霾、雨、雪等。我们在多个实际和 sintetic 数据集上进行了实验,以验证我们的方法在低视力天气和照明条件下进行斯tereoVO的能力。结果显示,我们的方法在与State-of-the-art 线Matching算法进行比较时,在恶劣天气和动态照明条件下可以得到更多的线特征匹配,并且当 complemented Point 特征匹配时,能够一致地表现良好。
Unlearning Spurious Correlations in Chest X-ray Classification
methods: 该研究使用了一种基于 Covid-19 肺 X-ray 图像集的深度学习模型,并通过利用用户提供的互动式反馈来实现 eXplanation Based Learning(XBL)方法,以解除不必要的相关性。
results: 研究结果表明,XBL 方法可以有效地减少骨灰图像分类模型中的误差,并且可以在患有 Covid-19 的肺 X-ray 图像中提高模型的透明度。Abstract
Medical image classification models are frequently trained using training datasets derived from multiple data sources. While leveraging multiple data sources is crucial for achieving model generalization, it is important to acknowledge that the diverse nature of these sources inherently introduces unintended confounders and other challenges that can impact both model accuracy and transparency. A notable confounding factor in medical image classification, particularly in musculoskeletal image classification, is skeletal maturation-induced bone growth observed during adolescence. We train a deep learning model using a Covid-19 chest X-ray dataset and we showcase how this dataset can lead to spurious correlations due to unintended confounding regions. eXplanation Based Learning (XBL) is a deep learning approach that goes beyond interpretability by utilizing model explanations to interactively unlearn spurious correlations. This is achieved by integrating interactive user feedback, specifically feature annotations. In our study, we employed two non-demanding manual feedback mechanisms to implement an XBL-based approach for effectively eliminating these spurious correlations. Our results underscore the promising potential of XBL in constructing robust models even in the presence of confounding factors.
摘要
医疗图像分类模型经常使用多个数据源进行训练。虽然利用多个数据源对于实现模型泛化具有重要意义,但是需要承认这些来源的多样性自然而来带来不必要的混乱和其他挑战,这些挑战可能会影响模型的准确性和透明度。一种常见的混乱因素在医疗图像分类中是由青春期所引起的骨髓成熔。我们使用COVID-19胸部X射线图像集训练深度学习模型,并显示这些数据集可能会导致不必要的相关性,因为不Intentional的混乱区域。基于解释的学习(XBL)是一种深度学习方法,它不仅提供了解释,还可以通过交互式的用户反馈来解除不必要的相关性。我们在研究中使用了两种非常容易的手动反馈机制来实现XBL的方法。我们的结果表明,XBL在存在混乱因素的情况下可以构建可靠的模型。
Spatio-Temporal Branching for Motion Prediction using Motion Increments
results: 根据标准 HMP 测试准则,该 paper 的方法在预测动作的精度方面表现出色,超过了当前状态的方法。Abstract
Human motion prediction (HMP) has emerged as a popular research topic due to its diverse applications, but it remains a challenging task due to the stochastic and aperiodic nature of future poses. Traditional methods rely on hand-crafted features and machine learning techniques, which often struggle to model the complex dynamics of human motion. Recent deep learning-based methods have achieved success by learning spatio-temporal representations of motion, but these models often overlook the reliability of motion data. Additionally, the temporal and spatial dependencies of skeleton nodes are distinct. The temporal relationship captures motion information over time, while the spatial relationship describes body structure and the relationships between different nodes. In this paper, we propose a novel spatio-temporal branching network using incremental information for HMP, which decouples the learning of temporal-domain and spatial-domain features, extracts more motion information, and achieves complementary cross-domain knowledge learning through knowledge distillation. Our approach effectively reduces noise interference and provides more expressive information for characterizing motion by separately extracting temporal and spatial features. We evaluate our approach on standard HMP benchmarks and outperform state-of-the-art methods in terms of prediction accuracy.
摘要
人体运动预测(HMP)已经成为一个流行的研究话题,因为它在多个应用领域中具有广泛的应用前景。然而,HMP仍然是一个具有抽象和不规则的未来姿势的挑战。传统的方法通常使用手工设计的特征和机器学习技术来模型人体运动,但这些方法经常难以捕捉人体运动的复杂动力学特征。最近的深度学习基于方法已经取得成功,通过学习人体运动的空间时间表示,但这些模型经常忽略人体运动数据的可靠性。此外,人体运动中的时间和空间关系不同。时间关系捕捉运动信息的时间方向,而空间关系描述人体结构和不同节点之间的关系。在本文中,我们提出了一种新的空间时间分支网络,使得HMP可以分解时间领域和空间领域的学习,提取更多的运动信息,并通过知识储存来实现补做知识学习。我们的方法可以更好地减少噪音干扰,并为运动特征的描述提供更多的表达信息。我们在标准HMP测试benchmark上评估了我们的方法,并在预测精度方面超过了状态 искусственный智能方法。
AutoPoster: A Highly Automatic and Content-aware Design System for Advertising Poster Generation
results: 这个系统可以将来自单一图像和标题的输入转换为多个不同尺寸的广告招牌,并且可以提供高品质的广告招牌。用户研究和实验结果显示,这个系统比其他广告招牌生成方法更有效率和有艺术价值。Abstract
Advertising posters, a form of information presentation, combine visual and linguistic modalities. Creating a poster involves multiple steps and necessitates design experience and creativity. This paper introduces AutoPoster, a highly automatic and content-aware system for generating advertising posters. With only product images and titles as inputs, AutoPoster can automatically produce posters of varying sizes through four key stages: image cleaning and retargeting, layout generation, tagline generation, and style attribute prediction. To ensure visual harmony of posters, two content-aware models are incorporated for layout and tagline generation. Moreover, we propose a novel multi-task Style Attribute Predictor (SAP) to jointly predict visual style attributes. Meanwhile, to our knowledge, we propose the first poster generation dataset that includes visual attribute annotations for over 76k posters. Qualitative and quantitative outcomes from user studies and experiments substantiate the efficacy of our system and the aesthetic superiority of the generated posters compared to other poster generation methods.
摘要
宣传海报是信息传达的一种形式,它将视觉和语言Modalities结合在一起。创建海报需要多个步骤,需要设计经验和创造力。本文介绍AutoPoster,一种高度自动化和内容意识的海报生成系统。只需输入产品图像和标题,AutoPoster可以自动生成不同尺寸的海报,通过四个关键阶段:图像清洁和重点调整、布局生成、标语生成和风格特征预测。为保证海报的视觉和谐,我们将两个内容意识模型 incorporated into layout and tagline generation。此外,我们提出了一种新的多任务Style Attribute Predictor (SAP),可以同时预测视觉风格特征。此外,我们还提供了首个包含视觉特征注释的海报生成数据集,包含超过76万个海报。用户测试和实验结果表明,我们的系统和生成的海报比其他海报生成方法更有效果和美观。
Attention-free Spikformer: Mixing Spike Sequences with Simple Linear Transforms
results: 研究表明,通过将SSA替换为无参数化的线性变换(LT),例如快推和wavelet变换,可以将 quadratic time complexity 降低到 log-linear time complexity。这些变换可以将笛声序列混合,并且在不同的频率和时间域中提取稀疏的视觉特征,以显示出强大的性能和效率。Abstract
By integrating the self-attention capability and the biological properties of Spiking Neural Networks (SNNs), Spikformer applies the flourishing Transformer architecture to SNNs design. It introduces a Spiking Self-Attention (SSA) module to mix sparse visual features using spike-form Query, Key, and Value, resulting in the State-Of-The-Art (SOTA) performance on numerous datasets compared to previous SNN-like frameworks. In this paper, we demonstrate that the Spikformer architecture can be accelerated by replacing the SSA with an unparameterized Linear Transform (LT) such as Fourier and Wavelet transforms. These transforms are utilized to mix spike sequences, reducing the quadratic time complexity to log-linear time complexity. They alternate between the frequency and time domains to extract sparse visual features, showcasing powerful performance and efficiency. We conduct extensive experiments on image classification using both neuromorphic and static datasets. The results indicate that compared to the SOTA Spikformer with SSA, Spikformer with LT achieves higher Top-1 accuracy on neuromorphic datasets (i.e., CIFAR10-DVS and DVS128 Gesture) and comparable Top-1 accuracy on static datasets (i.e., CIFAR-10 and CIFAR-100). Furthermore, Spikformer with LT achieves approximately 29-51% improvement in training speed, 61-70% improvement in inference speed, and reduces memory usage by 4-26% due to not requiring learnable parameters.
摘要
<>使用自注意力能力和生物特性,Spikformer将Transformer架构应用于生物神经网络(SNN)设计。它引入了自引导注意力(SSA)模块,将稀薄视觉特征混合使用披萃Query、Key和Value,达到了State-Of-The-Art(SOTA)性能在多个数据集上,比前一些SNN-like框架更高。在这篇论文中,我们示出了Spikformer架构可以通过替换SSA而加速。这些替换使用线性变换(LT),如 fourier和wavelet变换,来混合披萃序列,从而将时间复杂度降低为对数复杂度。它们在频率和时间域之间 alternate,提取稀薄视觉特征,展示出了强大的性能和效率。我们对图像分类进行了广泛的实验,使用了neuromorphic和静止数据集。结果表明,相比SOTA的Spikformer与SSA,Spikformer与LT achieve higher Top-1准确率在neuromorphic数据集(即CIFAR10-DVS和DVS128Gesture),并在静止数据集(即CIFAR-10和CIFAR-100)中具有相同的Top-1准确率。此外,Spikformer与LT achieve approximately 29-51%的训练速度提升,61-70%的推理速度提升,并减少了内存使用量。Note: The translation is in Simplified Chinese, which is the standard Chinese writing system used in mainland China and Singapore. The traditional Chinese writing system is also widely used in Taiwan and Hong Kong.
Homography Estimation in Complex Topological Scenes
results: 实验结果显示,提案的方法可以与现有模型相比,提高了IoU指标的值,最高提升12%。Abstract
Surveillance videos and images are used for a broad set of applications, ranging from traffic analysis to crime detection. Extrinsic camera calibration data is important for most analysis applications. However, security cameras are susceptible to environmental conditions and small camera movements, resulting in a need for an automated re-calibration method that can account for these varying conditions. In this paper, we present an automated camera-calibration process leveraging a dictionary-based approach that does not require prior knowledge on any camera settings. The method consists of a custom implementation of a Spatial Transformer Network (STN) and a novel topological loss function. Experiments reveal that the proposed method improves the IoU metric by up to 12% w.r.t. a state-of-the-art model across five synthetic datasets and the World Cup 2014 dataset.
摘要
侦察视频和图像广泛应用于交通分析和犯罪检测等领域。外部摄像头准确性受环境因素和小型摄像头运动影响,需要一种自动重新准确方法,能够考虑这些变化条件。本文提出了一种基于词典方法的自动摄像头准确方法,不需要任何摄像头设置的先前知识。该方法包括一个自定义的空间转换网络(STN)和一个新的 topological 损失函数。实验表明,提议的方法可以与先前状态艺术模型相比,在五个Synthetic dataset和2014年世界杯 dataset上提高 IoU 指标达12%。
Improving Generalization of Synthetically Trained Sonar Image Descriptors for Underwater Place Recognition
results: 研究表明,提议的方法可以在使用synthetic数据进行训练的情况下,与现有方法相比,在真实场景中表现更加出色。Abstract
Autonomous navigation in underwater environments presents challenges due to factors such as light absorption and water turbidity, limiting the effectiveness of optical sensors. Sonar systems are commonly used for perception in underwater operations as they are unaffected by these limitations. Traditional computer vision algorithms are less effective when applied to sonar-generated acoustic images, while convolutional neural networks (CNNs) typically require large amounts of labeled training data that are often unavailable or difficult to acquire. To this end, we propose a novel compact deep sonar descriptor pipeline that can generalize to real scenarios while being trained exclusively on synthetic data. Our architecture is based on a ResNet18 back-end and a properly parameterized random Gaussian projection layer, whereas input sonar data is enhanced with standard ad-hoc normalization/prefiltering techniques. A customized synthetic data generation procedure is also presented. The proposed method has been evaluated extensively using both synthetic and publicly available real data, demonstrating its effectiveness compared to state-of-the-art methods.
摘要
自适应导航在水下环境中存在许多挑战,如光线吸收和水渍混淆,这些因素对光学感知器的效果有限。因此,SONAR系统在水下操作中被广泛使用,因为它们不受这些限制。传统的计算机视觉算法在SONAR生成的声学图像上不太有效,而深度学习网络(CNN)通常需要大量的标注训练数据,这些数据通常不易获得或困难获得。为此,我们提出了一种新的快速深度声学描述管道,可以在实际场景中广泛应用,同时只需要在合成数据上进行培训。我们的架构基于ResNet18的后端和一个正确参数化的随机抽象射影层,输入声学数据通过标准的预处理/正规化技术进行增强。我们还提出了一种自定义合成数据生成过程。我们的方法在使用合成数据和公共available的实际数据进行广泛评估后,与现有方法相比,表现出了明显的效果。
MammoDG: Generalisable Deep Learning Breaks the Limits of Cross-Domain Multi-Center Breast Cancer Screening
paper_authors: Yijun Yang, Shujun Wang, Lihao Liu, Sarah Hickman, Fiona J Gilbert, Carola-Bibiane Schönlieb, Angelica I. Aviles-Rivero
for: 旨在提高乳腺癌早期发现,以提高治疗效果和女性生活质量。
methods: 利用机器学习模型支持专家决策。
results: 提出了一种新的深度学习框架 MammoDG,可以在不同卫星中心的多视图照片上进行可靠和通用的分析。 MammoDG 利用了一种新的对比机制,以提高其通用能力。Abstract
Breast cancer is a major cause of cancer death among women, emphasising the importance of early detection for improved treatment outcomes and quality of life. Mammography, the primary diagnostic imaging test, poses challenges due to the high variability and patterns in mammograms. Double reading of mammograms is recommended in many screening programs to improve diagnostic accuracy but increases radiologists' workload. Researchers explore Machine Learning models to support expert decision-making. Stand-alone models have shown comparable or superior performance to radiologists, but some studies note decreased sensitivity with multiple datasets, indicating the need for high generalisation and robustness models. This work devises MammoDG, a novel deep-learning framework for generalisable and reliable analysis of cross-domain multi-center mammography data. MammoDG leverages multi-view mammograms and a novel contrastive mechanism to enhance generalisation capabilities. Extensive validation demonstrates MammoDG's superiority, highlighting the critical importance of domain generalisation for trustworthy mammography analysis in imaging protocol variations.
摘要
乳癌是女性主要的癌症死亡原因之一,强调早期检测可以提高治疗效果和生活质量。 however,环境变化和图像征特的高度变化使得胸部X射线成为主要的诊断图像检测方法具有挑战。为了提高诊断准确性,许多检测 програм序建议采用双重诊断,但这会增加医生的工作负担。研究人员在寻找机器学习模型以支持专家决策。单独的研究表明,机器学习模型可以与专家相比或超越其性能,但一些研究表明,多个数据集的使用可能会导致感知性下降,表明需要高度泛化和可靠性模型。本文提出了 MammoDG,一种新的深度学习框架,用于可靠和泛化的胸部X射线数据分析。 MammoDG 利用多视图照片和一种新的对比机制来增强泛化能力。经验证明, MammoDG 的性能明显优于其他方法, highlighting the critical importance of domain generalization for trustworthy mammography analysis in imaging protocol variations。
Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation
methods: 这个方法基于 early exit 的思想,将网络分成多个阶段,每个阶段都有一个 auxiliary block,评估每个 tokens 的困难程度。只需要完成前几个阶段的运算就能够获得结果,以减少计算量。
results: 实验结果显示,提案的 DToP 架构可以降低平均 $20% - 35%$ 的计算成本,与现有的 semantic segmentation 方法相比,不会影响精度。Abstract
Vision transformers have achieved leading performance on various visual tasks yet still suffer from high computational complexity. The situation deteriorates in dense prediction tasks like semantic segmentation, as high-resolution inputs and outputs usually imply more tokens involved in computations. Directly removing the less attentive tokens has been discussed for the image classification task but can not be extended to semantic segmentation since a dense prediction is required for every patch. To this end, this work introduces a Dynamic Token Pruning (DToP) method based on the early exit of tokens for semantic segmentation. Motivated by the coarse-to-fine segmentation process by humans, we naturally split the widely adopted auxiliary-loss-based network architecture into several stages, where each auxiliary block grades every token's difficulty level. We can finalize the prediction of easy tokens in advance without completing the entire forward pass. Moreover, we keep $k$ highest confidence tokens for each semantic category to uphold the representative context information. Thus, computational complexity will change with the difficulty of the input, akin to the way humans do segmentation. Experiments suggest that the proposed DToP architecture reduces on average $20\% - 35\%$ of computational cost for current semantic segmentation methods based on plain vision transformers without accuracy degradation.
摘要
现代变换器已经在多种视觉任务中表现出了领先的性能,然而仍然受到高计算复杂性的限制。在密集预测任务如Semantic Segmentation中,高分辨率输入和输出通常意味着更多的 токен参与计算。直接从计算中移除不重要的 токен已经在图像分类任务上被讨论,但这种方法无法扩展到Semantic Segmentation,因为每个 patch 需要进行密集预测。为解决这个问题,这种工作提出了一种基于早期离开的 Token Pruning(DToP)方法。受人类的粗化预测过程启发,我们自然将广泛采用的辅助损失基于网络架构分成了多个阶段,每个辅助块都会评估每个 токен的Difficulty Level。因此,我们可以在提前完成某些简单的预测之前就结束预测。此外,我们保留每个 semantic 类型的前 $k$ 个最高信任的 токен,以保持代表性信息。因此,计算复杂性会随输入的Difficulty Level而变化,与人类的分 segmentation 类似。实验表明,我们提出的 DToP 架构可以将现有的Semantic Segmentation方法基于普通的视觉变换器中的计算复杂性减少平均 $20\% - 35\%$ без影响准确性。
WCCNet: Wavelet-integrated CNN with Crossmodal Rearranging Fusion for Fast Multispectral Pedestrian Detection
paper_authors: Xingjian Wang, Li Chai, Jiming Chen, Zhiguo Shi for:本文提出了一种新的和高效的多spectral人体检测框架,它能够有效地提高人体检测的可见性,并且具有更高的计算效率和精度。methods:该框架基于 dual-stream backbone,其中一个流程用 discrete wavelet transform (DWT) 进行快速推理和训练,另一个流程使用 CNN 层进行特征提取。此外,本文还提出了一种 crossmodal rearranging fusion module (CMRF),可以 Mitigate spatial misalignment和merge semantically complementary features of spatially-related local regions。results:根据 KAIST 和 FLIR 测试套件的评估结果,WCCNet 在计算效率和精度之间做出了极大的折衔,并且与当前的状态革命性方法进行比较。此外,本文还进行了 ablation study,并进行了深入的组件对性能的分析。Abstract
Multispectral pedestrian detection achieves better visibility in challenging conditions and thus has a broad application in various tasks, for which both the accuracy and computational cost are of paramount importance. Most existing approaches treat RGB and infrared modalities equally, typically adopting two symmetrical CNN backbones for multimodal feature extraction, which ignores the substantial differences between modalities and brings great difficulty for the reduction of the computational cost as well as effective crossmodal fusion. In this work, we propose a novel and efficient framework named WCCNet that is able to differentially extract rich features of different spectra with lower computational complexity and semantically rearranges these features for effective crossmodal fusion. Specifically, the discrete wavelet transform (DWT) allowing fast inference and training speed is embedded to construct a dual-stream backbone for efficient feature extraction. The DWT layers of WCCNet extract frequency components for infrared modality, while the CNN layers extract spatial-domain features for RGB modality. This methodology not only significantly reduces the computational complexity, but also improves the extraction of infrared features to facilitate the subsequent crossmodal fusion. Based on the well extracted features, we elaborately design the crossmodal rearranging fusion module (CMRF), which can mitigate spatial misalignment and merge semantically complementary features of spatially-related local regions to amplify the crossmodal complementary information. We conduct comprehensive evaluations on KAIST and FLIR benchmarks, in which WCCNet outperforms state-of-the-art methods with considerable computational efficiency and competitive accuracy. We also perform the ablation study and analyze thoroughly the impact of different components on the performance of WCCNet.
摘要
多spectrum行人检测可以在具有挑战性的条件下提供更好的可见性,因此在不同任务中具有广泛的应用。在这些任务中,精度和计算成本都是非常重要的。大多数现有方法都会对RGB和红外Modalities进行相同的处理,通常采用两个对称的CNN背bone来进行多Modal feature抽取,这会忽略modalities之间的重要差异,并且使得计算成本减少和有效的交叉模态融合变得非常困难。在这种情况下,我们提出了一种新的和高效的框架,名为WCCNet,可以在低计算成本下 differentially抽取不同 спектrum的丰富特征,并将这些特征semantically重新排序以实现有效的交叉模态融合。具体来说,WCCNet中使用了束波变换(DWT),以便快速进行推理和训练。DWT层EXTRACT frequency component for infrared modality,而CNN层EXTRACT spatial-domain features for RGB modality。这种方法不仅可以减少计算成本,还可以提高infrared特征的提取,以便后续的交叉模态融合。基于良好地EXTRACT的特征,我们在WCCNet中进行了详细的交叉模态重新排序融合模块(CMRF)的设计,可以抵消空间misalignment和semantic complementary feature of spatially-related local regionsmerge to amplify the crossmodal complementary information.我们在KAIST和FLIRbenchmark上进行了广泛的评估,并发现WCCNet在计算效率和精度方面都高于当前的状态方法。我们还进行了归 subtractive study和仔细分析了不同组件对WCCNet性能的影响。
TS-RGBD Dataset: a Novel Dataset for Theatre Scenes Description for People with Visual Impairments
paper_authors: Leyla Benhamida, Khadidja Delloul, Slimane Larabi for: 这篇论文是为了提供一个新的RGB-D数据集,用于图像描述和人体动作识别。methods: 这篇论文使用了Microsoft Kinect捕获RGB、深度和骨骼序列数据,并提供了人体动作识别和图像描述模型的评估。results: 研究人员通过在TS-RGBD数据集上测试图像描述模型和骨骼基于人体动作识别模型,发现这些模型可以在剧场场景中探测人类动作和描述场景中的区域特征。Abstract
Computer vision was long a tool used for aiding visually impaired people to move around their environment and avoid obstacles and falls. Solutions are limited to either indoor or outdoor scenes, which limits the kind of places and scenes visually disabled people can be in, including entertainment places such as theatres. Furthermore, most of the proposed computer-vision-based methods rely on RGB benchmarks to train their models resulting in a limited performance due to the absence of the depth modality. In this paper, we propose a novel RGB-D dataset containing theatre scenes with ground truth human actions and dense captions annotations for image captioning and human action recognition: TS-RGBD dataset. It includes three types of data: RGB, depth, and skeleton sequences, captured by Microsoft Kinect. We test image captioning models on our dataset as well as some skeleton-based human action recognition models in order to extend the range of environment types where a visually disabled person can be, by detecting human actions and textually describing appearances of regions of interest in theatre scenes.
摘要
计算机视觉曾长期用于帮助视障人群在环境中移动和避免障碍和落体。现有的解决方案主要限制在indoor或outdoor场景中,这限制了视障人群可以在的场所和场景,包括娱乐场所如剧场。此外,大多数提出的计算机视觉基本方法依赖RGB标准准确数据来训练其模型,这导致模型性能有限因为缺少深度特征。在本文中,我们提出一个新的RGB-D数据集,包括剧场场景,具有真实人行为和描述镜像照片的文本描述。该数据集包括三类数据:RGB、深度和骨架幕,通过微软Kinect捕获。我们对我们的数据集进行图像描述和人体动作识别模型的测试,以扩展视障人群可以在的环境类型,通过检测人体动作和描述相关区域的出现方式来描述剧场场景。
Point Anywhere: Directed Object Estimation from Omnidirectional Images
methods: 该方法使用全景相机,以消除用户/物体位置约束和指示臂左右约束。具体来说, repeatedly extracting regions of interest from equirectangular images and projecting them onto perspective images 可以实现高精度的估计。
results: 该方法可以提高估计精度,并且通过训练机器学习模型进一步提高估计精度。Abstract
One of the intuitive instruction methods in robot navigation is a pointing gesture. In this study, we propose a method using an omnidirectional camera to eliminate the user/object position constraint and the left/right constraint of the pointing arm. Although the accuracy of skeleton and object detection is low due to the high distortion of equirectangular images, the proposed method enables highly accurate estimation by repeatedly extracting regions of interest from the equirectangular image and projecting them onto perspective images. Furthermore, we found that training the likelihood of the target object in machine learning further improves the estimation accuracy.
摘要
一种直观的导航指南方法是指示手势。在这项研究中,我们提出了使用全景相机来消除用户/物体位置约束和指示手势的左右约束。虽然投影图像的矩形误差导致skeleton和物体检测精度较低,但我们的方法可以通过重复提取全景图像中的区域并将其投影到平视图中,实现高精度的估算。此外,我们发现通过机器学习来训练目标物体的可能性可以进一步提高估算精度。
MDT3D: Multi-Dataset Training for LiDAR 3D Object Detection Generalization
paper_authors: Louis Soum-Fontez, Jean-Emmanuel Deschaud, François Goulette
For: + The paper aims to improve the robustness of 3D object detection models when tested in new environments with different sensor configurations.* Methods: + The authors propose a Multi-Dataset Training for 3D Object Detection (MDT3D) method that leverages information from several annotated source datasets to increase the robustness of 3D object detection models. + The authors use a new label mapping based on coarse labels to tackle the labelling gap between datasets. + They also introduce a new cross-dataset augmentation method called cross-dataset object injection.* Results: + The authors demonstrate improvements in 3D object detection performance using their MDT3D method, compared to training on a single dataset. + The improvements are shown for different types of 3D object detection models.Here is the simplified Chinese translation of the three key information points:* For: + 论文目标是提高3D对象检测模型在新环境中的可靠性,即使使用不同的感知器配置。* Methods: + 作者提出了一种多个源数据集合训练的3D对象检测模型(MDT3D)方法,以利用多个标注源数据集的信息来提高3D对象检测模型的Robustness。 + 作者采用了一种新的标签映射方法,以 Address the labelling gap between datasets。 + 他们还引入了一种新的跨数据集增强方法:跨数据集对象注入。* Results: + 作者展示了使用MDT3D方法的3D对象检测性能提高,比单个数据集训练 better。 + 提高的性能适用于不同类型的3D对象检测模型。Abstract
Supervised 3D Object Detection models have been displaying increasingly better performance in single-domain cases where the training data comes from the same environment and sensor as the testing data. However, in real-world scenarios data from the target domain may not be available for finetuning or for domain adaptation methods. Indeed, 3D object detection models trained on a source dataset with a specific point distribution have shown difficulties in generalizing to unseen datasets. Therefore, we decided to leverage the information available from several annotated source datasets with our Multi-Dataset Training for 3D Object Detection (MDT3D) method to increase the robustness of 3D object detection models when tested in a new environment with a different sensor configuration. To tackle the labelling gap between datasets, we used a new label mapping based on coarse labels. Furthermore, we show how we managed the mix of datasets during training and finally introduce a new cross-dataset augmentation method: cross-dataset object injection. We demonstrate that this training paradigm shows improvements for different types of 3D object detection models. The source code and additional results for this research project will be publicly available on GitHub for interested parties to access and utilize: https://github.com/LouisSF/MDT3D
摘要
超级vised 3D对象检测模型在单域场景中表现越来越好,但在实际应用中数据可能不可用于训练或适应方法。实际上,基于特定点分布的3D对象检测模型在未seen数据集上采集的数据上表现不佳。因此,我们决定利用多个注释源数据集的信息,通过我们的多个数据集训练方法(MDT3D)提高3D对象检测模型在新环境中的鲁棒性。为了bridging数据集之间的标签差距,我们使用了一种新的标签映射基于粗略标签。此外,我们描述了在训练中混合数据集的方法,以及一种新的交叉数据集增强方法:交叉数据集对象注入。我们证明了这种训练方法对不同类型的3D对象检测模型都有改进。GitHub上将公开源代码和额外结果,欢迎有兴趣的朋友来取得和利用:https://github.com/LouisSF/MDT3D。
Exploiting Synthetic Data for Data Imbalance Problems: Baselines from a Data Perspective
paper_authors: Moon Ye-Bin, Nam Hyeon-Woo, Wonseok Choi, Nayeong Kim, Suha Kwak, Tae-Hyun Oh
for: addressing data imbalance problems in deep neural networks
methods: utilizes synthetic data as a preliminary step before employing task-specific algorithms
results: impressive performance on several datasets, surpassing the performance of existing task-specific methodsAbstract
We live in a vast ocean of data, and deep neural networks are no exception to this. However, this data exhibits an inherent phenomenon of imbalance. This imbalance poses a risk of deep neural networks producing biased predictions, leading to potentially severe ethical and social consequences. To address these challenges, we believe that the use of generative models is a promising approach for comprehending tasks, given the remarkable advancements demonstrated by recent diffusion models in generating high-quality images. In this work, we propose a simple yet effective baseline, SYNAuG, that utilizes synthetic data as a preliminary step before employing task-specific algorithms to address data imbalance problems. This straightforward approach yields impressive performance on datasets such as CIFAR100-LT, ImageNet100-LT, UTKFace, and Waterbird, surpassing the performance of existing task-specific methods. While we do not claim that our approach serves as a complete solution to the problem of data imbalance, we argue that supplementing the existing data with synthetic data proves to be an effective and crucial preliminary step in addressing data imbalance concerns.
摘要
我们生活在一个庞大的数据洋面中,深度神经网络也不例外。然而,这些数据具有内在的不均衡现象,这可能导致深度神经网络预测结果受到偏见,从而导致严重的伦理和社会后果。为解决这些挑战,我们认为使用生成模型是一种有前途的方法,因为最近的扩散模型在生成高质量图像方面已经展示出了很好的进步。在这种工作中,我们提出了一个简单 yet有效的基线方法,即SYNAuG,它利用 synthetic data 作为先决步骤,然后使用任务特定的算法来解决数据不均衡问题。这种简单的方法在 CIFAR100-LT、ImageNet100-LT、UTKFace 和 Waterbird 等数据集上实现了很好的性能,超过了现有的任务特定方法的表现。虽然我们不能声称我们的方法是解决数据不均衡问题的完整解决方案,但我们认为在增加现有数据的同时使用 synthetic data 作为先决步骤是一种有效和重要的步骤。
Orientation-Guided Contrastive Learning for UAV-View Geo-Localisation
results: 我们在University-1652和University-160k datasets上进行了实验,并得到了比前方法更高的性能。在推理阶段,我们不再需要OrientationModule,这意味着在实际应用中不需要额外计算。Abstract
Retrieving relevant multimedia content is one of the main problems in a world that is increasingly data-driven. With the proliferation of drones, high quality aerial footage is now available to a wide audience for the first time. Integrating this footage into applications can enable GPS-less geo-localisation or location correction. In this paper, we present an orientation-guided training framework for UAV-view geo-localisation. Through hierarchical localisation orientations of the UAV images are estimated in relation to the satellite imagery. We propose a lightweight prediction module for these pseudo labels which predicts the orientation between the different views based on the contrastive learned embeddings. We experimentally demonstrate that this prediction supports the training and outperforms previous approaches. The extracted pseudo-labels also enable aligned rotation of the satellite image as augmentation to further strengthen the generalisation. During inference, we no longer need this orientation module, which means that no additional computations are required. We achieve state-of-the-art results on both the University-1652 and University-160k datasets.
摘要
世界变得数据驱动, retrieve relevant multimedia content成为主要问题。随着无人机的普及,高质量的飞行视频现在可以为广泛的用户开放。将这些视频集成到应用程序中可以实现无GPS地理定位或位置修正。在这篇论文中,我们提出了无人机视图地理定位的委托导向训练框架。通过层次的地理定位方向,我们估算了无人机图像与卫星图像之间的方向关系。我们提出了一个轻量级预测模块,这个模块基于对比学习的嵌入学习来预测不同视图之间的方向。我们实验表明,这种预测支持训练并超过了之前的方法。提取的pseudo标签还可以帮助对卫星图像进行平行旋转的调整,以进一步强化通用化。在推理阶段,我们再无需该方向模块,因此不需要额外的计算。我们在University-1652和University-160k数据集上实现了状态机器人的结果。
ForensicsForest Family: A Series of Multi-scale Hierarchical Cascade Forests for Detecting GAN-generated Faces
results: 在实验中,ForensicsForest Family的三种变种都达到了比较高的伪造探测率,并且在不同的伪造率下都能够保持一定的稳定性。Hybrid ForensicsForest和Divide-and-Conquer ForensicsForest相比,后者在减少内存开销的同时,能够保持相对高的伪造探测率。Abstract
The prominent progress in generative models has significantly improved the reality of generated faces, bringing serious concerns to society. Since recent GAN-generated faces are in high realism, the forgery traces have become more imperceptible, increasing the forensics challenge. To combat GAN-generated faces, many countermeasures based on Convolutional Neural Networks (CNNs) have been spawned due to their strong learning ability. In this paper, we rethink this problem and explore a new approach based on forest models instead of CNNs. Specifically, we describe a simple and effective forest-based method set called {\em ForensicsForest Family} to detect GAN-generate faces. The proposed ForensicsForest family is composed of three variants, which are {\em ForensicsForest}, {\em Hybrid ForensicsForest} and {\em Divide-and-Conquer ForensicsForest} respectively. ForenscisForest is a newly proposed Multi-scale Hierarchical Cascade Forest, which takes semantic, frequency and biology features as input, hierarchically cascades different levels of features for authenticity prediction, and then employs a multi-scale ensemble scheme that can comprehensively consider different levels of information to improve the performance further. Based on ForensicsForest, we develop Hybrid ForensicsForest, an extended version that integrates the CNN layers into models, to further refine the effectiveness of augmented features. Moreover, to reduce the memory cost in training, we propose Divide-and-Conquer ForensicsForest, which can construct a forest model using only a portion of training samplings. In the training stage, we train several candidate forest models using the subsets of training samples. Then a ForensicsForest is assembled by picking the suitable components from these candidate forest models...
摘要
“由于生成模型的杰出进步,生成的面部现场具有高现实性,对社会带来了严重的担忧。由于最近的GAN生成面部具有高真实性,因此伪造的迹象 becames more imperceptible,增加了侦错挑战。为了解决GAN生成面部的问题,许多基于Convolutional Neural Networks(CNNs)的对策被生成。在这篇论文中,我们重新思考这个问题,并探索一个新的方法,基于森林模型而不是CNNs。具体来说,我们描述了一个简单而有效的森林模型家族,called “ForensicsForest Family”,用于检测GAN生成面部。我们的提案的ForensicsForest是一个新的多层次堆栈森林,它接受 semantic、frequency和生物特征,在不同的层次进行堆积不同的特征,然后使用多层次组合 scheme,具体来说是使用多个不同层次的特征来进一步提高表现。基于ForensicsForest,我们发展了Hybrid ForensicsForest,一个扩展版本,它在模型中添加了CNN层,以进一步提高增强特征的效果。此外,为了降低训练阶段的内存成本,我们提出了Divide-and-Conquer ForensicsForest,这个方法可以将森林模型建立成多个子集的训练样本。在训练阶段,我们将训练多个候选的森林模型,然后选择适合的部分来建立一个ForensicsForest。”
methods: 这篇论文使用了一个名为Curriculum Adaptation for Black-Box(CABB)的curriculum导向适应方法,它会逐渐地训练目标模型,首先是在目标数据上高信度(清洁)的标签下进行训练,然后是在目标数据上噪音标签下进行训练。CABB使用了Jensen-Shannon散度作为更好的混合标签分类标准,并且使用了对应分支网络来抑制错误堆积的错误识别。
results: 实验结果显示,CABB比 EXISTS的黑盒页面适应模型好,并且与白盒页面适应模型相近。Abstract
Addressing the rising concerns of privacy and security, domain adaptation in the dark aims to adapt a black-box source trained model to an unlabeled target domain without access to any source data or source model parameters. The need for domain adaptation of black-box predictors becomes even more pronounced to protect intellectual property as deep learning based solutions are becoming increasingly commercialized. Current methods distill noisy predictions on the target data obtained from the source model to the target model, and/or separate clean/noisy target samples before adapting using traditional noisy label learning algorithms. However, these methods do not utilize the easy-to-hard learning nature of the clean/noisy data splits. Also, none of the existing methods are end-to-end, and require a separate fine-tuning stage and an initial warmup stage. In this work, we present Curriculum Adaptation for Black-Box (CABB) which provides a curriculum guided adaptation approach to gradually train the target model, first on target data with high confidence (clean) labels, and later on target data with noisy labels. CABB utilizes Jensen-Shannon divergence as a better criterion for clean-noisy sample separation, compared to the traditional criterion of cross entropy loss. Our method utilizes co-training of a dual-branch network to suppress error accumulation resulting from confirmation bias. The proposed approach is end-to-end trainable and does not require any extra finetuning stage, unlike existing methods. Empirical results on standard domain adaptation datasets show that CABB outperforms existing state-of-the-art black-box DA models and is comparable to white-box domain adaptation models.
摘要
In this work, we present Curriculum Adaptation for Black-Box (CABB) which provides a curriculum-guided adaptation approach to gradually train the target model, first on target data with high confidence (clean) labels, and later on target data with noisy labels. CABB utilizes Jensen-Shannon divergence as a better criterion for clean-noisy sample separation compared to the traditional criterion of cross-entropy loss. Our method utilizes co-training of a dual-branch network to suppress error accumulation resulting from confirmation bias. The proposed approach is end-to-end trainable and does not require any extra fine-tuning stage, unlike existing methods.Empirical results on standard domain adaptation datasets show that CABB outperforms existing state-of-the-art black-box DA models and is comparable to white-box domain adaptation models.
Training-Free Instance Segmentation from Semantic Image Segmentation Masks
results: 在两个复杂的dataset上和多种基线模型(包括CNN和Transformers)进行了广泛的实验,得到了与state-of-the-art Fully-supervised实例分割方法相当的结果,而无需额外的人工资源或计算成本增加。Abstract
In recent years, the development of instance segmentation has garnered significant attention in a wide range of applications. However, the training of a fully-supervised instance segmentation model requires costly both instance-level and pixel-level annotations. In contrast, weakly-supervised instance segmentation methods (i.e., with image-level class labels or point labels) struggle to satisfy the accuracy and recall requirements of practical scenarios. In this paper, we propose a novel paradigm for instance segmentation called training-free instance segmentation (TFISeg), which achieves instance segmentation results from image masks predicted using off-the-shelf semantic segmentation models. TFISeg does not require training a semantic or/and instance segmentation model and avoids the need for instance-level image annotations. Therefore, it is highly efficient. Specifically, we first obtain a semantic segmentation mask of the input image via a trained semantic segmentation model. Then, we calculate a displacement field vector for each pixel based on the segmentation mask, which can indicate representations belonging to the same class but different instances, i.e., obtaining the instance-level object information. Finally, instance segmentation results are obtained after being refined by a learnable category-agnostic object boundary branch. Extensive experimental results on two challenging datasets and representative semantic segmentation baselines (including CNNs and Transformers) demonstrate that TFISeg can achieve competitive results compared to the state-of-the-art fully-supervised instance segmentation methods without the need for additional human resources or increased computational costs. The code is available at: TFISeg
摘要
Recently, the development of instance segmentation has received significant attention in a wide range of applications. However, fully-supervised instance segmentation models require expensive instance-level and pixel-level annotations. In contrast, weakly-supervised instance segmentation methods (i.e., with image-level class labels or point labels) cannot meet the accuracy and recall requirements of practical scenarios. In this paper, we propose a novel instance segmentation paradigm called training-free instance segmentation (TFISeg), which achieves instance segmentation results from image masks predicted using off-the-shelf semantic segmentation models. TFISeg does not require training a semantic or/and instance segmentation model and avoids the need for instance-level image annotations. Therefore, it is highly efficient. Specifically, we first obtain a semantic segmentation mask of the input image via a trained semantic segmentation model. Then, we calculate a displacement field vector for each pixel based on the segmentation mask, which can indicate representations belonging to the same class but different instances, i.e., obtaining the instance-level object information. Finally, instance segmentation results are obtained after being refined by a learnable category-agnostic object boundary branch. Extensive experimental results on two challenging datasets and representative semantic segmentation baselines (including CNNs and Transformers) demonstrate that TFISeg can achieve competitive results compared to the state-of-the-art fully-supervised instance segmentation methods without the need for additional human resources or increased computational costs. The code is available at: TFISeg.
Decomposing and Coupling Saliency Map for Lesion Segmentation in Ultrasound Images
results: 这篇论文的结果显示,使用DC-Net可以实现聚合调察影像中肿瘤分类的高精度,并且比前一代方法提高了许多。Abstract
Complex scenario of ultrasound image, in which adjacent tissues (i.e., background) share similar intensity with and even contain richer texture patterns than lesion region (i.e., foreground), brings a unique challenge for accurate lesion segmentation. This work presents a decomposition-coupling network, called DC-Net, to deal with this challenge in a (foreground-background) saliency map disentanglement-fusion manner. The DC-Net consists of decomposition and coupling subnets, and the former preliminarily disentangles original image into foreground and background saliency maps, followed by the latter for accurate segmentation under the assistance of saliency prior fusion. The coupling subnet involves three aspects of fusion strategies, including: 1) regional feature aggregation (via differentiable context pooling operator in the encoder) to adaptively preserve local contextual details with the larger receptive field during dimension reduction; 2) relation-aware representation fusion (via cross-correlation fusion module in the decoder) to efficiently fuse low-level visual characteristics and high-level semantic features during resolution restoration; 3) dependency-aware prior incorporation (via coupler) to reinforce foreground-salient representation with the complementary information derived from background representation. Furthermore, a harmonic loss function is introduced to encourage the network to focus more attention on low-confidence and hard samples. The proposed method is evaluated on two ultrasound lesion segmentation tasks, which demonstrates the remarkable performance improvement over existing state-of-the-art methods.
摘要
复杂的超声图像场景,邻近组织(即背景)与病変区域(即前景)的像素强度和文本URE模式都非常相似,带来了准确病变分割的独特挑战。这个工作提出了一种 decomposition-coupling 网络(DC-Net),通过 fore- ground-background 敏感地图分离融合来解决这个挑战。DC-Net 包括 decomposition 和 coupling 子网络,前者先将原始图像粗略地分解成病变和背景的敏感地图,然后后者使用敏感融合来进行准确的分割。 coupling 子网络包括三种融合策略:1)地域特征聚合(通过可微的上下文池Operator在编码器中),以适应大的辐射场景保留地方Contextual details; 2)关系意识表示融合(通过cross-correlation融合模块在解码器中),以高效地融合低级视觉特征和高级 semantics Features during resolution restoration; 3)依赖关系 Prior 包含(通过coupler),以强制前景敏感表示与背景表示DERIVE complementary information。此外,我们还引入了一种和谐损失函数,以促进网络更多地关注低信任和困难样本。我们的方法在两个超声病变分割任务上进行了评估,并达到了现有状态的杰出性提高。
WaterFlow: Heuristic Normalizing Flow for Underwater Image Enhancement and Beyond
results: 对比state-of-the-art方法,水下图像增强方法提高了质量和可行性Abstract
Underwater images suffer from light refraction and absorption, which impairs visibility and interferes the subsequent applications. Existing underwater image enhancement methods mainly focus on image quality improvement, ignoring the effect on practice. To balance the visual quality and application, we propose a heuristic normalizing flow for detection-driven underwater image enhancement, dubbed WaterFlow. Specifically, we first develop an invertible mapping to achieve the translation between the degraded image and its clear counterpart. Considering the differentiability and interpretability, we incorporate the heuristic prior into the data-driven mapping procedure, where the ambient light and medium transmission coefficient benefit credible generation. Furthermore, we introduce a detection perception module to transmit the implicit semantic guidance into the enhancement procedure, where the enhanced images hold more detection-favorable features and are able to promote the detection performance. Extensive experiments prove the superiority of our WaterFlow, against state-of-the-art methods quantitatively and qualitatively.
摘要
水下图像受到光折射和吸收的影响,导致视觉效果下降,同时影响后续应用。现有水下图像提升方法主要关注图像质量改进,忽略了实际应用的效果。为了平衡视觉质量和实际应用,我们提议一种基于归一化流的探测驱动水下图像提升方法,称为WaterFlow。specifically,我们首先开发了一个可逆映射,以实现水下图像和其清晰版本之间的翻译。在考虑到可 diferenciability和可 interpretability的情况下,我们将HEURISTIC prior incorporated into the data-driven mapping procedure,其中 ambient light和medium transmission coefficient benefit credible generation。此外,我们引入了探测感知模块,以将隐式的Semantic guidance传递给提升过程,这些提升后的图像具有更多的探测利好特征,能够提高探测性能。广泛的实验证明了我们的WaterFlow,在量化和 каче性上胜过当前状态的方法。
Towards Discriminative Representation with Meta-learning for Colonoscopic Polyp Re-Identification
results: 实验结果表明,该方法可以明显超越当前的状态机器人的方法,并且在不同的摄像头和视图下都可以保持高度的表现。Abstract
Colonoscopic Polyp Re-Identification aims to match the same polyp from a large gallery with images from different views taken using different cameras and plays an important role in the prevention and treatment of colorectal cancer in computer-aided diagnosis. However, traditional methods for object ReID directly adopting CNN models trained on the ImageNet dataset usually produce unsatisfactory retrieval performance on colonoscopic datasets due to the large domain gap. Additionally, these methods neglect to explore the potential of self-discrepancy among intra-class relations in the colonoscopic polyp dataset, which remains an open research problem in the medical community. To solve this dilemma, we propose a simple but effective training method named Colo-ReID, which can help our model to learn more general and discriminative knowledge based on the meta-learning strategy in scenarios with fewer samples. Based on this, a dynamic Meta-Learning Regulation mechanism called MLR is introduced to further boost the performance of polyp re-identification. To the best of our knowledge, this is the first attempt to leverage the meta-learning paradigm instead of traditional machine learning to effectively train deep models in the task of colonoscopic polyp re-identification. Empirical results show that our method significantly outperforms current state-of-the-art methods by a clear margin.
摘要
干扰识别是一种重要的计算机辅助诊断技术,旨在将同一个肿瘤从大量图像库中匹配到不同视角和不同摄像头拍摄的图像中。但是,传统的对象识别方法通常在计算机辅助诊断中表现不 satisfactory,因为它们直接采用基于 ImageNet 数据集训练的 CNN 模型,而这些模型在干扰识别任务中存在很大的领域差距。此外,这些方法忽略了干扰识别任务中的自体差异,这是医学社区还未解决的一个开放问题。为解决这个问题,我们提出了一种简单 yet effective 的训练方法,称为 Colo-ReID。这种方法可以帮助我们的模型学习更一般和特征 rich 的知识,基于元学习策略在scenario中 avec fewer samples。此外,我们还提出了一种动态元学习调节机制,称为 MLR,以进一步提高肿瘤重新识别性能。到目前为止,这是首次在干扰识别任务中采用元学习 paradigm,而不是传统机器学习方法,以有效地训练深度模型。实验结果表明,我们的方法在肿瘤重新识别任务中显著超过当前状态的方法。
Detection and Segmentation of Cosmic Objects Based on Adaptive Thresholding and Back Propagation Neural Network
results: 这篇论文通过使用适应resholding方法和反射神经网络,实现了高精度的宇宙对象分类和检测。Abstract
Astronomical images provide information about the great variety of cosmic objects in the Universe. Due to the large volumes of data, the presence of innumerable bright point sources as well as noise within the frame and the spatial gap between objects and satellite cameras, it is a challenging task to classify and detect the celestial objects. We propose an Adaptive Thresholding Method (ATM) based segmentation and Back Propagation Neural Network (BPNN) based cosmic object detection including a well-structured series of pre-processing steps designed to enhance segmentation and detection.
摘要
天文图像提供宇宙中各种不同类型的天体信息。由于数据量庞大,充满灯点源和扫描器镜头之间的空间差距,以及图像中的噪声,因此分类和检测天体是一项复杂的任务。我们提出了适应resholding方法(ATM)基于分割和基于迷你网络(BPNN)基于天体物体检测,包括一系列井井有条的预处理步骤,用于提高分割和检测。
Continual Domain Adaptation on Aerial Images under Gradually Degrading Weather
results: 研究发现了对于现有的缓冲适应方法存在稳定问题,并提出了一个简单的Gradient normalization方法来缓和训练不稳定。Abstract
Domain adaptation (DA) strives to mitigate the domain gap between the source domain where a model is trained, and the target domain where the model is deployed. When a deep learning model is deployed on an aerial platform, it may face gradually degrading weather conditions during operation, leading to widening domain gaps between the training data and the encountered evaluation data. We synthesize two such gradually worsening weather conditions on real images from two existing aerial imagery datasets, generating a total of four benchmark datasets. Under the continual, or test-time adaptation setting, we evaluate three DA models on our datasets: a baseline standard DA model and two continual DA models. In such setting, the models can access only one small portion, or one batch of the target data at a time, and adaptation takes place continually, and over only one epoch of the data. The combination of the constraints of continual adaptation, and gradually deteriorating weather conditions provide the practical DA scenario for aerial deployment. Among the evaluated models, we consider both convolutional and transformer architectures for comparison. We discover stability issues during adaptation for existing buffer-fed continual DA methods, and offer gradient normalization as a simple solution to curb training instability.
摘要
域适应(DA)目的是减少源域和目标域之间的域隔,以便在不同域隔下使用模型。当深度学习模型在飞行平台上部署时,可能会遇到逐渐恶化的天气条件,导致模型训练数据和评估数据之间的域隔变得更大。我们使用两个现有的飞行图像数据集来生成两种慢恶化的天气条件,并生成了四个benchmark dataset。在 continual setting下,我们评估了三个DA模型,包括标准DA模型和两个 continual DA模型。在这种设定下,模型只能访问一小部分或一批目标数据,并且adaptation只能在一个epoch的数据上进行。 combinaison of continual adaptation和慢恶化天气条件提供了实际的DAenario for aerial deployment。我们考虑了 convolutional和 transformer架构进行比较。我们发现了 continual adaptation中的稳定问题,并提供了 gradient normalization作为一种简单的解决方案来缓解训练不稳定。
Survey on Computer Vision Techniques for Internet-of-Things Devices
results: 这篇论文总结了各种低功率技术的优缺点和未解决的问题,以及它们在 convolutional 和 transformer DNN 中的应用。Abstract
Deep neural networks (DNNs) are state-of-the-art techniques for solving most computer vision problems. DNNs require billions of parameters and operations to achieve state-of-the-art results. This requirement makes DNNs extremely compute, memory, and energy-hungry, and consequently difficult to deploy on small battery-powered Internet-of-Things (IoT) devices with limited computing resources. Deployment of DNNs on Internet-of-Things devices, such as traffic cameras, can improve public safety by enabling applications such as automatic accident detection and emergency response.Through this paper, we survey the recent advances in low-power and energy-efficient DNN implementations that improve the deployability of DNNs without significantly sacrificing accuracy. In general, these techniques either reduce the memory requirements, the number of arithmetic operations, or both. The techniques can be divided into three major categories: neural network compression, network architecture search and design, and compiler and graph optimizations. In this paper, we survey both low-power techniques for both convolutional and transformer DNNs, and summarize the advantages, disadvantages, and open research problems.
摘要
深度神经网络(DNN)是现代计算机视觉问题的州际技术。DNN需要数百亿个参数和运算来实现州际 результаados,这使得DNN具有极高的计算、存储和能源需求,因此难以在具有有限的计算资源的小型Internet of Things(IoT)设备上部署。然而,通过部署DNN在交通摄像头等IoT设备上,可以提高公共安全性,例如自动检测事故并触发紧急应急回应。本文概述了最近几年关于低功耗和能效的DNN实现方法,以提高DNN的部署可能性,不会导致重要的精度损失。这些方法可以分为三大类:神经网络压缩、网络架构搜索和设计,以及编译器和图像优化。本文将对低功耗的 convolutional 和 transformer DNN 进行概述,并评价其优缺点和未解决的问题。
Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion
For: This paper aims to protect the generative model from generating inappropriate or dangerous content, such as specific intellectual property (IP), human faces, and various artistic styles, by shielding the unwanted concepts from the model’s weights.* Methods: The proposed method, called Degeneration-Tuning (DT), uses Scrambled Grid to reconstruct the correlation between undesired concepts and their corresponding image domain, and guides the text-to-image diffusion model to generate meaningless content when such textual concepts are provided as input.* Results: The proposed DT method effectively shields the unwanted concepts from the model’s weights, without significantly impacting the generative quality of other contents. The FID and IS scores of the model on COCO-30K exhibit only minor changes after DT, shifting from 12.61 and 39.20 to 13.04 and 38.25, respectively, which outperforms previous methods.Here’s the simplified Chinese text in the format you requested:* For: 保护生成模型从生成不当或危险内容,如特定知识产权(IP)、人脸和多种艺术风格。* Methods: 提出的方法是叫做“干扰调整”(DT),使用扰乱网格重建不良概念和其对应的图像域之间的相关性,使模型对这些文本概念提供输入时生成无意义的内容。* Results: DT方法能够成功隔离模型的概念,不会对其他内容产生重要影响。COCO-30K上的FID和IS分数只有微小变化,从12.61和39.20变化到13.04和38.25,这明显超越了之前的方法。Abstract
Owing to the unrestricted nature of the content in the training data, large text-to-image diffusion models, such as Stable Diffusion (SD), are capable of generating images with potentially copyrighted or dangerous content based on corresponding textual concepts information. This includes specific intellectual property (IP), human faces, and various artistic styles. However, Negative Prompt, a widely used method for content removal, frequently fails to conceal this content due to inherent limitations in its inference logic. In this work, we propose a novel strategy named \textbf{Degeneration-Tuning (DT)} to shield contents of unwanted concepts from SD weights. By utilizing Scrambled Grid to reconstruct the correlation between undesired concepts and their corresponding image domain, we guide SD to generate meaningless content when such textual concepts are provided as input. As this adaptation occurs at the level of the model's weights, the SD, after DT, can be grafted onto other conditional diffusion frameworks like ControlNet to shield unwanted concepts. In addition to qualitatively showcasing the effectiveness of our DT method in protecting various types of concepts, a quantitative comparison of the SD before and after DT indicates that the DT method does not significantly impact the generative quality of other contents. The FID and IS scores of the model on COCO-30K exhibit only minor changes after DT, shifting from 12.61 and 39.20 to 13.04 and 38.25, respectively, which clearly outperforms the previous methods.
摘要
由于训练数据的内容是不受限制的,大型文本到图像扩散模型,如稳定扩散(SD),可以根据相关的文本概念信息生成图像,包括特定的知识产权(IP)、人脸和多种艺术风格。然而,负面提示,一种广泛使用的内容 removals 方法,常常无法隐藏这些内容,因为其推理逻辑的内在限制。在这种工作中,我们提出了一种新的策略,即倒退调整(DT),以隐藏 SD 权重中不良的内容。通过使用扰乱网格重建不良概念和其相应的图像领域之间的相关性,我们使 SD 生成无意义的内容,当提供这些文本概念时。由于这种适应发生在模型权重 niveau,SD 之后的 DT 可以与其他条件扩散框架,如控制网络(ControlNet),结合使用。此外,我们还进行了质量比较,发现DT方法对其他内容的生成质量没有显著影响,COCO-30K上的 FID 和 IS 分数从12.61和39.20下降到13.04和38.25,分别提高了1.44和1.95个百分数点,这明显超过了之前的方法。
Virtual histological staining of unlabeled autopsy tissue
paper_authors: Yuzhu Li, Nir Pillar, Jingxi Li, Tairan Liu, Di Wu, Songyu Sun, Guangdong Ma, Kevin de Haan, Luzhe Huang, Sepehr Hamidi, Anatoly Urisman, Tal Keidar Haran, William Dean Wallace, Jonathan E. Zuckerman, Aydogan Ozcan
results: 研究人员发现,使用这种虚拟染色技术可以快速和Cost-effectively 生成高质量的 H&E 染色图像,即使是 COVID-19 样本中受到严重的自体酶分解和细胞死亡的情况下,传统 histochemical staining 方法无法提供一致的染色质量。此外,这种虚拟染色技术还可以扩展到肿瘤组织,并可以快速生成高质量的 H&E 染色图像,减少审查尸体组织的劳动力、成本和基础设施需求。Abstract
Histological examination is a crucial step in an autopsy; however, the traditional histochemical staining of post-mortem samples faces multiple challenges, including the inferior staining quality due to autolysis caused by delayed fixation of cadaver tissue, as well as the resource-intensive nature of chemical staining procedures covering large tissue areas, which demand substantial labor, cost, and time. These challenges can become more pronounced during global health crises when the availability of histopathology services is limited, resulting in further delays in tissue fixation and more severe staining artifacts. Here, we report the first demonstration of virtual staining of autopsy tissue and show that a trained neural network can rapidly transform autofluorescence images of label-free autopsy tissue sections into brightfield equivalent images that match hematoxylin and eosin (H&E) stained versions of the same samples, eliminating autolysis-induced severe staining artifacts inherent in traditional histochemical staining of autopsied tissue. Our virtual H&E model was trained using >0.7 TB of image data and a data-efficient collaboration scheme that integrates the virtual staining network with an image registration network. The trained model effectively accentuated nuclear, cytoplasmic and extracellular features in new autopsy tissue samples that experienced severe autolysis, such as COVID-19 samples never seen before, where the traditional histochemical staining failed to provide consistent staining quality. This virtual autopsy staining technique can also be extended to necrotic tissue, and can rapidly and cost-effectively generate artifact-free H&E stains despite severe autolysis and cell death, also reducing labor, cost and infrastructure requirements associated with the standard histochemical staining.
摘要
histological examination是毫不可或缺的探验步骤,但传统的历史化学染色方法在尸体样本上面临多种挑战,包括由尸体腐败导致的自释染色质量下降,以及涉及大量化学物质染色过程,需要巨大的劳动力、成本和时间投入。在全球卫生危机期间, Histopathology服务的可用性受限,导致尸体稳定和染色artifacts更加严重。我们现在报道了首次虚拟染色技术的应用,使用训练过的神经网络将自折衍 fluorescence图像转化为同样的H&E染色版本,消除尸体腐败导致的严重染色 artifacts。我们的虚拟H&E模型在 >0.7 TB的图像数据和数据efficient合作方案下进行训练。训练后,模型能够强调尸体中的核、细胞和 extracellular特征,并在新的尸体样本中,如COVID-19样本,提供了高质量的染色图像,传统历史化学染色无法提供一致的染色质量。此虚拟探验染色技术还可以扩展到肿瘤组织,可以快速、成本低地生成 artifact-free H&E染色,即使尸体腐败严重,细胞死亡严重,也可以避免传统历史化学染色所需的劳动力、成本和基础设施。
A Novel Cross-Perturbation for Single Domain Generalization
For: The paper aims to enhance the ability of a model to generalize to unknown domains when trained on a single source domain, by using cross-perturbation and feature-level perturbation methods.* Methods: The paper proposes a simple yet effective cross-perturbation method called CPerb, which utilizes both horizontal and vertical operations to increase data diversity and learn domain-invariant features. Additionally, the paper proposes a novel feature-level perturbation method called MixPatch, which exploits local image style information to further diversify the training data.* Results: The paper achieves state-of-the-art performance on various benchmark datasets, demonstrating the effectiveness of the proposed method in enhancing the generalization capability of the model.Abstract
Single domain generalization aims to enhance the ability of the model to generalize to unknown domains when trained on a single source domain. However, the limited diversity in the training data hampers the learning of domain-invariant features, resulting in compromised generalization performance. To address this, data perturbation (augmentation) has emerged as a crucial method to increase data diversity. Nevertheless, existing perturbation methods often focus on either image-level or feature-level perturbations independently, neglecting their synergistic effects. To overcome these limitations, we propose CPerb, a simple yet effective cross-perturbation method. Specifically, CPerb utilizes both horizontal and vertical operations. Horizontally, it applies image-level and feature-level perturbations to enhance the diversity of the training data, mitigating the issue of limited diversity in single-source domains. Vertically, it introduces multi-route perturbation to learn domain-invariant features from different perspectives of samples with the same semantic category, thereby enhancing the generalization capability of the model. Additionally, we propose MixPatch, a novel feature-level perturbation method that exploits local image style information to further diversify the training data. Extensive experiments on various benchmark datasets validate the effectiveness of our method.
摘要
<>单Domain总结旨在提高模型在未知领域下的泛化能力,但限制的训练数据多样性使得学习领域共同特征受挫。为此,数据扰动(增强)技术已成为提高数据多样性的关键手段。然而,现有的扰动方法通常只关注图像级或特征级扰动,忽视了它们的相互作用。为了解决这些限制,我们提出CPerb,一种简单 yet有效的跨扰动方法。具体来说,CPerb利用图像级和特征级扰动来增强训练数据的多样性,从而解决单源领域中的多样性问题。此外,我们提出MixPatch,一种新的特征级扰动方法,利用本地图像风格信息来进一步让训练数据更加多样。广泛的实验 validate了我们的方法的有效性。Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that form as well.
ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation
for: This paper aims to propose a novel image manipulation method that can accurately reflect human intentions without relying on external cross-modal language information.
methods: The proposed method, called ImageBrush, learns visual instructions for image editing by employing a pair of transformation images as visual instructions. The method uses a diffusion-based inpainting approach to capture the underlying intentions from visual demonstrations and apply them to a new image.
results: The proposed method generates engaging manipulation results that conform to the transformations entailed in the demonstrations. The method also exhibits robust generalization capabilities on various downstream tasks such as pose transfer, image translation, and video inpainting.Abstract
While language-guided image manipulation has made remarkable progress, the challenge of how to instruct the manipulation process faithfully reflecting human intentions persists. An accurate and comprehensive description of a manipulation task using natural language is laborious and sometimes even impossible, primarily due to the inherent uncertainty and ambiguity present in linguistic expressions. Is it feasible to accomplish image manipulation without resorting to external cross-modal language information? If this possibility exists, the inherent modality gap would be effortlessly eliminated. In this paper, we propose a novel manipulation methodology, dubbed ImageBrush, that learns visual instructions for more accurate image editing. Our key idea is to employ a pair of transformation images as visual instructions, which not only precisely captures human intention but also facilitates accessibility in real-world scenarios. Capturing visual instructions is particularly challenging because it involves extracting the underlying intentions solely from visual demonstrations and then applying this operation to a new image. To address this challenge, we formulate visual instruction learning as a diffusion-based inpainting problem, where the contextual information is fully exploited through an iterative process of generation. A visual prompting encoder is carefully devised to enhance the model's capacity in uncovering human intent behind the visual instructions. Extensive experiments show that our method generates engaging manipulation results conforming to the transformations entailed in demonstrations. Moreover, our model exhibits robust generalization capabilities on various downstream tasks such as pose transfer, image translation and video inpainting.
摘要
“对于语言导向的图像调整进步了很远,但实际执行人类意图的问题仍然存在。使用自然语言描述调整任务的精确和全面描述是困难且有时是不可能的,主要因为语言表达中固有的不确定和歧义。我们是否可以不对外部跨modal的语言信息进行调整图像?如果这个可能性存在,则无需运用对应的模式差距。在这篇论文中,我们提出了一种新的调整方法,名为ImageBrush,它可以从visual示例中学习更精确的图像编辑指令。我们的关键思想是使用对应的变数图像作为visual指令,这不仅能够准确地表达人类的意图,而且可以在实际应用中更加方便。”Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.
Deep Learning Approaches in Pavement Distress Identification: A Review
results: 论文发现,使用UAV和深度学习算法可以有效地检测和分类公路损害,并且在2D图像处理方面有所进步,但3D图像处理还存在一些挑战。Abstract
This paper presents a comprehensive review of recent advancements in image processing and deep learning techniques for pavement distress detection and classification, a critical aspect in modern pavement management systems. The conventional manual inspection process conducted by human experts is gradually being superseded by automated solutions, leveraging machine learning and deep learning algorithms to enhance efficiency and accuracy. The ability of these algorithms to discern patterns and make predictions based on extensive datasets has revolutionized the domain of pavement distress identification. The paper investigates the integration of unmanned aerial vehicles (UAVs) for data collection, offering unique advantages such as aerial perspectives and efficient coverage of large areas. By capturing high-resolution images, UAVs provide valuable data that can be processed using deep learning algorithms to detect and classify various pavement distresses effectively. While the primary focus is on 2D image processing, the paper also acknowledges the challenges associated with 3D images, such as sensor limitations and computational requirements. Understanding these challenges is crucial for further advancements in the field. The findings of this review significantly contribute to the evolution of pavement distress detection, fostering the development of efficient pavement management systems. As automated approaches continue to mature, the implementation of deep learning techniques holds great promise in ensuring safer and more durable road infrastructure for the benefit of society.
摘要
The paper explores the integration of unmanned aerial vehicles (UAVs) for data collection, which offers unique advantages such as aerial perspectives and efficient coverage of large areas. By capturing high-resolution images, UAVs provide valuable data that can be processed using deep learning algorithms to detect and classify various pavement distresses effectively. While the primary focus is on 2D image processing, the paper also acknowledges the challenges associated with 3D images, such as sensor limitations and computational requirements. Understanding these challenges is crucial for further advancements in the field.The findings of this review significantly contribute to the evolution of pavement distress detection, fostering the development of efficient pavement management systems. As automated approaches continue to mature, the implementation of deep learning techniques holds great promise in ensuring safer and more durable road infrastructure for the benefit of society.
Addressing Uncertainty in Imbalanced Histopathology Image Classification of HER2 Breast Cancer: An interpretable Ensemble Approach with Threshold Filtered Single Instance Evaluation (SIE)
paper_authors: Md Sakib Hossain Shovon, M. F. Mridha, Khan Md Hasib, Sultan Alfarhood, Mejdl Safran, Dunren Che for:This paper aims to develop an accurate and robust method for diagnosing breast cancer (BC) subtypes based on the expression of Human Epidermal Growth Factor Receptor (HER2).methods:The proposed method utilizes an ensemble approach that combines DenseNet201 and Xception feature extractors, followed by a single instance evaluation (SIE) technique to determine different confidence levels and adjust the decision boundary among the imbalanced classes.results:The proposed approach, called DenseNet201-Xception-SIE, achieved an accuracy of 97.12% on H&E data and 97.56% on IHC data, outperforming all other existing state-of-the-art models. The use of Grad-CAM and Guided Grad-CAM provided insights into how the model works and makes decisions on the histopathology dataset.Abstract
Breast Cancer (BC) is among women's most lethal health concerns. Early diagnosis can alleviate the mortality rate by helping patients make efficient treatment decisions. Human Epidermal Growth Factor Receptor (HER2) has become one the most lethal subtype of BC. According to the College of American Pathologists/American Society of Clinical Oncology (CAP/ASCO), the severity level of HER2 expression can be classified between 0 and 3+ range. HER2 can be detected effectively from immunohistochemical (IHC) and, hematoxylin \& eosin (HE) images of different classes such as 0, 1+, 2+, and 3+. An ensemble approach integrated with threshold filtered single instance evaluation (SIE) technique has been proposed in this study to diagnose BC from the multi-categorical expression of HER2 subtypes. Initially, DenseNet201 and Xception have been ensembled into a single classifier as feature extractors with an effective combination of global average pooling, dropout layer, dense layer with a swish activation function, and l2 regularizer, batch normalization, etc. After that, extracted features has been processed through single instance evaluation (SIE) to determine different confidence levels and adjust decision boundary among the imbalanced classes. This study has been conducted on the BC immunohistochemical (BCI) dataset, which is classified by pathologists into four stages of HER2 BC. This proposed approach known as DenseNet201-Xception-SIE with a threshold value of 0.7 surpassed all other existing state-of-art models with an accuracy of 97.12\%, precision of 97.15\%, and recall of 97.68\% on H\&E data and, accuracy of 97.56\%, precision of 97.57\%, and recall of 98.00\% on IHC data respectively, maintaining momentous improvement. Finally, Grad-CAM and Guided Grad-CAM have been employed in this study to interpret, how TL-based model works on the histopathology dataset and make decisions from the data.
摘要
乳癌(BC)是女性最致命的健康问题之一。早期诊断可以降低死亡率,帮助患者做出有效的治疗决策。人类皮肤增长因子受体(HER2)是乳癌最致命的一种亚型。根据美国病理学会/美国肿瘤学会(CAP/ASCO)的分类标准,HER2表达严重程度可分为0到3+之间的范围。HER2可以从免疫染色(IHC)和铝染色(HE)图像中有效地探测。本研究提出了一种基于多个表达类型的 ensemble 方法,用于诊断 BC。该方法首先将 DenseNet201 和 Xception ensemble 为单个分类器的特征提取器,并使用有效的全球均值池化、dropout层、权重抑制层、短束激活函数、L2 regularizer、批处理等技术。然后,提取的特征经过单个实例评估(SIE)处理,以确定不同的信任水平并调整不均衡的类别之间的决策边界。本研究在 BC 免疫染色(BCI)数据集上进行了实验,该数据集由病理学家分为4个HER2 BC的阶段。该提出的方法,称为 DenseNet201-Xception-SIE,在HER2 BC的诊断方面达到了97.12%的准确率、97.15%的精度和97.68%的回归率在HE数据上,以及97.56%的准确率、97.57%的精度和98.00%的回归率在IHC数据上,与现有的所有状态 искус法模型相比,保持了很大的改善。最后, Grad-CAM 和 Guided Grad-CAM 在 histopathology 数据集上被用来解释,如何TL基于模型在数据上工作,从数据中做出决策。
Body Knowledge and Uncertainty Modeling for Monocular 3D Human Body Reconstruction
results: 实验显示,KNOWN 的人体重建比先前的弱监督方法高,特别是在困难的少数图像上。Abstract
While 3D body reconstruction methods have made remarkable progress recently, it remains difficult to acquire the sufficiently accurate and numerous 3D supervisions required for training. In this paper, we propose \textbf{KNOWN}, a framework that effectively utilizes body \textbf{KNOW}ledge and u\textbf{N}certainty modeling to compensate for insufficient 3D supervisions. KNOWN exploits a comprehensive set of generic body constraints derived from well-established body knowledge. These generic constraints precisely and explicitly characterize the reconstruction plausibility and enable 3D reconstruction models to be trained without any 3D data. Moreover, existing methods typically use images from multiple datasets during training, which can result in data noise (\textit{e.g.}, inconsistent joint annotation) and data imbalance (\textit{e.g.}, minority images representing unusual poses or captured from challenging camera views). KNOWN solves these problems through a novel probabilistic framework that models both aleatoric and epistemic uncertainty. Aleatoric uncertainty is encoded in a robust Negative Log-Likelihood (NLL) training loss, while epistemic uncertainty is used to guide model refinement. Experiments demonstrate that KNOWN's body reconstruction outperforms prior weakly-supervised approaches, particularly on the challenging minority images.
摘要
Recently, 3D body reconstruction methods have made significant progress, but acquiring sufficient and accurate 3D supervisions for training remains a challenge. In this paper, we propose KNOWN, a framework that effectively utilizes body knowledge and uncertainty modeling to compensate for insufficient 3D supervisions. KNOWN leverages a comprehensive set of generic body constraints derived from well-established body knowledge, which precisely and explicitly characterize the reconstruction plausibility and enable 3D reconstruction models to be trained without any 3D data. Moreover, existing methods typically use images from multiple datasets during training, which can result in data noise (e.g., inconsistent joint annotation) and data imbalance (e.g., minority images representing unusual poses or captured from challenging camera views). KNOWN solves these problems through a novel probabilistic framework that models both aleatoric and epistemic uncertainty. Aleatoric uncertainty is encoded in a robust Negative Log-Likelihood (NLL) training loss, while epistemic uncertainty is used to guide model refinement. Experimental results demonstrate that KNOWN's body reconstruction outperforms prior weakly-supervised approaches, particularly on challenging minority images.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The translation is based on the standard Mandarin pronunciation and may vary depending on the regional accent or dialect.
Hybrid-SORT: Weak Cues Matter for Online Multi-Object Tracking
methods: 本文引入 confidence state 和 height state 作为 potential weak cue,并且通过 velocity direction 进行补做。
results: compared to existing methods, 本方法在 diverse trackers 和 scenarios 中表现出优于其他方法,并且在 plug-and-play 和 training-free 的情况下实现了显著的改进。特别是在 DanceTrack benchmark 上,本方法 achieve superior performance due to its ability to handle interaction and occlusion.Abstract
Multi-Object Tracking (MOT) aims to detect and associate all desired objects across frames. Most methods accomplish the task by explicitly or implicitly leveraging strong cues (i.e., spatial and appearance information), which exhibit powerful instance-level discrimination. However, when object occlusion and clustering occur, both spatial and appearance information will become ambiguous simultaneously due to the high overlap between objects. In this paper, we demonstrate that this long-standing challenge in MOT can be efficiently and effectively resolved by incorporating weak cues to compensate for strong cues. Along with velocity direction, we introduce the confidence state and height state as potential weak cues. With superior performance, our method still maintains Simple, Online and Real-Time (SORT) characteristics. Furthermore, our method shows strong generalization for diverse trackers and scenarios in a plug-and-play and training-free manner. Significant and consistent improvements are observed when applying our method to 5 different representative trackers. Further, by leveraging both strong and weak cues, our method Hybrid-SORT achieves superior performance on diverse benchmarks, including MOT17, MOT20, and especially DanceTrack where interaction and occlusion are frequent and severe. The code and models are available at https://github.com/ymzis69/HybirdSORT.
摘要
Accessibility and Inclusiveness of New Information and Communication Technologies for Disabled Users and Content Creators in the Metaverse
results: 该论文的研究结果表明,在Metaverse平台的设计和开发过程中,active involvement of physically disabled individuals是促进包容性的关键。同时,该论文还强调了需要进一步的研究和合作,以建立Metaverse项目中 disabled individuals的标准和规范。Abstract
Despite the proliferation of Blockchain Metaverse projects, the inclusion of physically disabled individuals in the Metaverse remains distant, with limited standards and regulations in place. However, the article proposes a concept of the Metaverse that leverages emerging technologies, such as Virtual and Augmented Reality, and the Internet of Things, to enable greater engagement of disabled creatives. This approach aims to enhance inclusiveness in the Metaverse landscape. Based on the findings, the paper concludes that the active involvement of physically disabled individuals in the design and development of Metaverse platforms is crucial for promoting inclusivity. The proposed framework for accessibility and inclusiveness in Virtual, Augmented, and Mixed realities of decentralised Metaverses provides a basis for the meaningful participation of disabled creatives. The article emphasises the importance of addressing the mechanisms for art production by individuals with disabilities in the emerging Metaverse landscape. Additionally, it highlights the need for further research and collaboration to establish standards and regulations that facilitate the inclusion of physically disabled individuals in Metaverse projects.
摘要
尽管Metaverse项目的普及,但 disabled individuals在Metaverse中的包括仍然远离,有限的标准和规定在位。然而,这篇文章提出了一种基于虚拟和增强现实、物联网等新技术的Metaverse概念,以便更好地促进disabled creatives的参与度。这种方法的目的是在Metaverse景观中提高包括性。根据发现,文章结论称,在Metaverse平台的设计和开发过程中active involvement of physically disabled individuals是促进包括性的关键。文章还强调了在虚拟、增强和混合现实中的decentralized Metaverses中 Addressing mechanisms for art production by individuals with disabilities是emerging Metaverse landscape的重要问题。此外,文章还高举了需要进一步的研究和合作,以确立包括physically disabled individuals在Metaverse项目中的标准和规定。
High-Fidelity Eye Animatable Neural Radiance Fields for Human Face
results: 通过ETH-XGaze数据集的实验,证明模型能够生成高质量图像,具有精准的眼动rotation和非均匀的周围肉肉变形。此外,通过使用生成的图像可以有效提高眼球估计性能。Abstract
Face rendering using neural radiance fields (NeRF) is a rapidly developing research area in computer vision. While recent methods primarily focus on controlling facial attributes such as identity and expression, they often overlook the crucial aspect of modeling eyeball rotation, which holds importance for various downstream tasks. In this paper, we aim to learn a face NeRF model that is sensitive to eye movements from multi-view images. We address two key challenges in eye-aware face NeRF learning: how to effectively capture eyeball rotation for training and how to construct a manifold for representing eyeball rotation. To accomplish this, we first fit FLAME, a well-established parametric face model, to the multi-view images considering multi-view consistency. Subsequently, we introduce a new Dynamic Eye-aware NeRF (DeNeRF). DeNeRF transforms 3D points from different views into a canonical space to learn a unified face NeRF model. We design an eye deformation field for the transformation, including rigid transformation, e.g., eyeball rotation, and non-rigid transformation. Through experiments conducted on the ETH-XGaze dataset, we demonstrate that our model is capable of generating high-fidelity images with accurate eyeball rotation and non-rigid periocular deformation, even under novel viewing angles. Furthermore, we show that utilizing the rendered images can effectively enhance gaze estimation performance.
摘要
<>使用神经辐射场(NeRF)进行人脸渲染是计算机视觉领域的一个迅速发展的研究领域。而最近的方法主要是控制人脸特征,如身份和表情,但它们经常忽略人脸中的眼球旋转,这在多个下游任务中具有重要性。在这篇论文中,我们想要学习一个敏感于眼球旋转的人脸NeRF模型,从多视图图像中获得 Training。我们解决了两个关键挑战:如何有效地捕捉眼球旋转,以及如何构建一个表示眼球旋转的拟合。为了实现这一点,我们首先适应FLAME,一个已知的参数化人脸模型,到多视图图像中,并考虑多视图一致性。然后,我们引入一种新的动态眼球旋转NeRF(DeNeRF)。DeNeRF将不同视图中的3D点转换成一个均匀的Canonical空间,以学习一个独立的人脸NeRF模型。我们设计了一个眼球变形场,包括静止变换、如眼球旋转,以及非静止变换。经过在ETH-XGaze数据集上进行的实验,我们表明了我们的模型可以生成高质量的图像,具有准确的眼球旋转和非静止肉眼扭曲,即使在新的视角角度下。此外,我们还证明了使用渲染出来的图像可以有效地提高眼球估计性能。
Decomposition Ascribed Synergistic Learning for Unified Image Restoration
paper_authors: Jinghao Zhang, Jie Huang, Man Zhou, Chongyi Li, Feng Zhao
for: 本研究旨在为实际应用 scenarios 提供多种图像异常处理方法的共同学习机制。
methods: 我们通过对各种异常信息进行分解,使用singular value decomposition (SVD) 分析不同类型的异常信息,并将其分为两类:singular vector 主导和singular value 主导。
results: 我们提出了一种基于 DASL 的图像异常处理方法,包括 Singular VEctor Operator (SVEO) 和 Singular VAlue Operator (SVAO) 两种有效操作,可以轻松地搭配现有的卷积图像修复框架。我们还提出了一种协调分解损失函数。对于五种混合图像修复任务进行了广泛的实验,证明了我们的方法的效果。Abstract
Learning to restore multiple image degradations within a single model is quite beneficial for real-world applications. Nevertheless, existing works typically concentrate on regarding each degradation independently, while their relationship has been less exploited to ensure the synergistic learning. To this end, we revisit the diverse degradations through the lens of singular value decomposition, with the observation that the decomposed singular vectors and singular values naturally undertake the different types of degradation information, dividing various restoration tasks into two groups,\ie, singular vector dominated and singular value dominated. The above analysis renders a more unified perspective to ascribe the diverse degradations, compared to previous task-level independent learning. The dedicated optimization of degraded singular vectors and singular values inherently utilizes the potential relationship among diverse restoration tasks, attributing to the Decomposition Ascribed Synergistic Learning (DASL). Specifically, DASL comprises two effective operators, namely, Singular VEctor Operator (SVEO) and Singular VAlue Operator (SVAO), to favor the decomposed optimization, which can be lightly integrated into existing convolutional image restoration backbone. Moreover, the congruous decomposition loss has been devised for auxiliary. Extensive experiments on blended five image restoration tasks demonstrate the effectiveness of our method, including image deraining, image dehazing, image denoising, image deblurring, and low-light image enhancement.
摘要
To exploit the potential relationship among diverse restoration tasks, we propose Decomposition Ascribed Synergistic Learning (DASL). DASL consists of two operators: Singular VEctor Operator (SVEO) and Singular VAlue Operator (SVAO), which favor decomposed optimization. These operators can be easily integrated into existing convolutional image restoration backbones. Additionally, we introduce a congruous decomposition loss for auxiliary purposes.Our method is tested on five blended image restoration tasks: image deraining, image dehazing, image denoising, image deblurring, and low-light image enhancement. Extensive experiments demonstrate the effectiveness of our approach.
LISA: Reasoning Segmentation via Large Language Model
results: 实验表明,LISA可以处理复杂的逻辑分割、世界知识、解释性答案和多turn会话等情况,并且在没有逻辑分割数据集上进行零基础训练时仍然表现出色。细化训练后,模型的性能得到进一步提高。Abstract
Although perception systems have made remarkable advancements in recent years, they still rely on explicit human instruction to identify the target objects or categories before executing visual recognition tasks. Such systems lack the ability to actively reason and comprehend implicit user intentions. In this work, we propose a new segmentation task -- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. Furthermore, we establish a benchmark comprising over one thousand image-instruction pairs, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: large Language Instructed Segmentation Assistant, which inherits the language generation capabilities of the multi-modal Large Language Model (LLM) while also possessing the ability to produce segmentation masks. We expand the original vocabulary with a token and propose the embedding-as-mask paradigm to unlock the segmentation capability. Remarkably, LISA can handle cases involving: 1) complex reasoning; 2) world knowledge; 3) explanatory answers; 4) multi-turn conversation. Also, it demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition, fine-tuning the model with merely 239 reasoning segmentation image-instruction pairs results in further performance enhancement. Experiments show our method not only unlocks new reasoning segmentation capabilities but also proves effective in both complex reasoning segmentation and standard referring segmentation tasks. Code, models, and demo are at https://github.com/dvlab-research/LISA.
摘要
尽管视觉系统在最近几年内已经做出了很多出色的进步,但它们仍然需要显式的人类指令来标识目标对象或类别才能进行视觉识别任务。这些系统缺乏能 aktive 地理解和理解用户的潜在意图。在这项工作中,我们提出了一个新的分割任务---理解分割。这个任务的目标是通过给出复杂的潜在查询文本而输出分割mask。此外,我们建立了包含超过一千张图像指令对的benchmark,并包含了复杂的理解和世界知识 для评估purpose。最后,我们提出了LISA:大型语言指令分割助手,它继承了多模态大语言模型(LLM)的语言生成能力,同时也具有生成分割mask的能力。我们将原始词汇表添加了一个Token,并提出了 embedding-as-mask paradigm来解锁分割能力。特别是,LISA可以处理:1)复杂的理解;2)世界知识;3)解释性答案;4)多turn conversation。此外,它还示出了零基础学习能力,只需要训练于无理解数据集就能够进行表现。此外, fine-tuning the model with merely 239 reasoning segmentation image-instruction pairs results in further performance enhancement。实验表明,我们的方法不仅解锁了新的理解分割能力,还证明了在复杂的理解分割和标准引用分割任务中的有效性。代码、模型和demo可以在https://github.com/dvlab-research/LISA中找到。
Toward Zero-shot Character Recognition: A Gold Standard Dataset with Radical-level Annotations
results: 实验结果表明,ACCID 是一个有效的古汉字图像集,并且基于分解和重组的零shot OCR方法可以在不同的文本背景下达到高度的识别率。Abstract
Optical character recognition (OCR) methods have been applied to diverse tasks, e.g., street view text recognition and document analysis. Recently, zero-shot OCR has piqued the interest of the research community because it considers a practical OCR scenario with unbalanced data distribution. However, there is a lack of benchmarks for evaluating such zero-shot methods that apply a divide-and-conquer recognition strategy by decomposing characters into radicals. Meanwhile, radical recognition, as another important OCR task, also lacks radical-level annotation for model training. In this paper, we construct an ancient Chinese character image dataset that contains both radical-level and character-level annotations to satisfy the requirements of the above-mentioned methods, namely, ACCID, where radical-level annotations include radical categories, radical locations, and structural relations. To increase the adaptability of ACCID, we propose a splicing-based synthetic character algorithm to augment the training samples and apply an image denoising method to improve the image quality. By introducing character decomposition and recombination, we propose a baseline method for zero-shot OCR. The experimental results demonstrate the validity of ACCID and the baseline model quantitatively and qualitatively.
摘要
Optical character recognition (OCR) 技术已经应用到多种任务中,如街景文本识别和文档分析。最近,零 shot OCR 引起了研究者的关注,因为它考虑了实际 OCR 场景中的不均衡数据分布。然而,这些零 shot 方法的评估标准缺乏,尤其是对于使用分解符认识策略进行分解字符的方法。同时, радикал认识也缺乏 радикал级别的标注数据,用于模型训练。在这篇论文中,我们构建了一个古代中文字符图像集,该集包括字符级别和 радикал级别的注释,以满足上述方法的要求。为了提高 ACCID 的适应性,我们提议一种拼接基于的人工生成字符算法,以及一种图像净化方法来提高图像质量。通过引入字符分解和重组,我们提出了一个基线方法 для零 shot OCR。实验结果表明 ACCID 和基线模型的有效性和质量。
Ada-DQA: Adaptive Diverse Quality-aware Feature Acquisition for Video Quality Assessment
results: 对三个主流的无参照 VQA benchmark进行实验,显示 Ada-DQA 方法在比较当前状态艺术方法无需使用额外的 VQA 训练数据的情况下,具有显著优于其他方法的性能。Abstract
Video quality assessment (VQA) has attracted growing attention in recent years. While the great expense of annotating large-scale VQA datasets has become the main obstacle for current deep-learning methods. To surmount the constraint of insufficient training data, in this paper, we first consider the complete range of video distribution diversity (\ie content, distortion, motion) and employ diverse pretrained models (\eg architecture, pretext task, pre-training dataset) to benefit quality representation. An Adaptive Diverse Quality-aware feature Acquisition (Ada-DQA) framework is proposed to capture desired quality-related features generated by these frozen pretrained models. By leveraging the Quality-aware Acquisition Module (QAM), the framework is able to extract more essential and relevant features to represent quality. Finally, the learned quality representation is utilized as supplementary supervisory information, along with the supervision of the labeled quality score, to guide the training of a relatively lightweight VQA model in a knowledge distillation manner, which largely reduces the computational cost during inference. Experimental results on three mainstream no-reference VQA benchmarks clearly show the superior performance of Ada-DQA in comparison with current state-of-the-art approaches without using extra training data of VQA.
摘要
视频质量评估(VQA)在最近几年内引起了越来越多的关注。然而,大量 annotate VQA 数据的成本已成为当前深度学习方法的主要障碍。为了突破这个限制,在这篇论文中,我们首先考虑了视频分布的完整范围(即内容、扭曲、运动),并采用多种预训练模型(即架构、预text任务、预训练数据)来优化质量表示。一个适应多样性特征获取(Ada-DQA)框架是提出来捕捉所需的质量相关特征。通过利用质量相关模块(QAM),该框架能够提取更加重要和相关的特征来表示质量。最后,学习的质量表示被用作增强知识储存模型的辅助监督信息,与标注的质量分数一起导航训练,大大降低了推理过程中的计算成本。实验结果表明,Ada-DQA 在三个主流无参考 VQA benchmark 上表现出色,胜过当前的状态艺术方法,无需使用额外的 VQA 训练数据。