paper_authors: Aixuan Li, Jing Zhang, Yunqiu Lv, Tong Zhang, Yiran Zhong, Mingyi He, Yuchao Dai for: 这个论文主要针对的是��Salient Object Detection(SOD)和Camouflaged Object Detection(COD)两个任务的uncertainty-aware学习ipeline的研究。methods: 该论文提出了一种基于数据水平和任务水平的 contradiction modeling的方法,通过利用数据集之间的相互关系和任务之间的相互关系,以及一种joint-task contrastive learning框架,来减少模型的训练数据量,同时提高模型的 Robustness和 representation learning能力。results: 实验结果表明,该方法可以在标准 benchmark datasets 上达到状态革命的性能,同时也可以提供有用的uncertainty estimation。Abstract
Salient objects attract human attention and usually stand out clearly from their surroundings. In contrast, camouflaged objects share similar colors or textures with the environment. In this case, salient objects are typically non-camouflaged, and camouflaged objects are usually not salient. Due to this inherent contradictory attribute, we introduce an uncertainty-aware learning pipeline to extensively explore the contradictory information of salient object detection (SOD) and camouflaged object detection (COD) via data-level and task-wise contradiction modeling. We first exploit the dataset correlation of these two tasks and claim that the easy samples in the COD dataset can serve as hard samples for SOD to improve the robustness of the SOD model. Based on the assumption that these two models should lead to activation maps highlighting different regions of the same input image, we further introduce a contrastive module with a joint-task contrastive learning framework to explicitly model the contradictory attributes of these two tasks. Different from conventional intra-task contrastive learning for unsupervised representation learning, our contrastive module is designed to model the task-wise correlation, leading to cross-task representation learning. To better understand the two tasks from the perspective of uncertainty, we extensively investigate the uncertainty estimation techniques for modeling the main uncertainties of the two tasks, namely task uncertainty (for SOD) and data uncertainty (for COD), and aiming to effectively estimate the challenging regions for each task to achieve difficulty-aware learning. Experimental results on benchmark datasets demonstrate that our solution leads to both state-of-the-art performance and informative uncertainty estimation.
摘要
salient objects 吸引人类注意力,通常与周围环境分别出色。相反,涂抹物体通常与环境颜色或文化相似。在这种情况下,突出物体通常不是涂抹的,而涂抹物体通常不是突出的。由于这种内在的矛盾特征,我们介绍了一种不确定性意识学习管道,激进探索这两个任务之间的矛盾信息。我们首先利用这两个任务的数据集之间的相关性,并认为易Difficult样本在COD任务中可以为SOD任务提高稳健性。基于假设这两个模型应该导致输入图像的活动图标高亮不同的区域,我们还引入了一个集成任务对比学习框架,以显式地模型这两个任务之间的矛盾特征。与传统的同任务内部对比学习不同,我们的对比模块旨在模型这两个任务之间的任务相关性,导致跨任务表示学习。为更好地理解这两个任务的不确定性,我们进行了广泛的不确定性估计技术研究,包括任务不确定性( для SOD)和数据不确定性( для COD),并aspire to effectively estimate the challenging regions for each task to achieve difficulty-aware learning。实验结果表明,我们的解决方案可以同时实现状态机器人表现和有用的不确定性估计。
Multimodal brain age estimation using interpretable adaptive population-graph learning
results: 该论文在使用 UK Biobank 数据集进行大脑年龄估计和分类任务中,超过了竞争static graph方法和其他状态态 adaptive 方法。此外,通过可视化注意力权重,提高了图的可读性。Abstract
Brain age estimation is clinically important as it can provide valuable information in the context of neurodegenerative diseases such as Alzheimer's. Population graphs, which include multimodal imaging information of the subjects along with the relationships among the population, have been used in literature along with Graph Convolutional Networks (GCNs) and have proved beneficial for a variety of medical imaging tasks. A population graph is usually static and constructed manually using non-imaging information. However, graph construction is not a trivial task and might significantly affect the performance of the GCN, which is inherently very sensitive to the graph structure. In this work, we propose a framework that learns a population graph structure optimized for the downstream task. An attention mechanism assigns weights to a set of imaging and non-imaging features (phenotypes), which are then used for edge extraction. The resulting graph is used to train the GCN. The entire pipeline can be trained end-to-end. Additionally, by visualizing the attention weights that were the most important for the graph construction, we increase the interpretability of the graph. We use the UK Biobank, which provides a large variety of neuroimaging and non-imaging phenotypes, to evaluate our method on brain age regression and classification. The proposed method outperforms competing static graph approaches and other state-of-the-art adaptive methods. We further show that the assigned attention scores indicate that there are both imaging and non-imaging phenotypes that are informative for brain age estimation and are in agreement with the relevant literature.
摘要
�XV�urrency age estimation is clinically important as it can provide valuable information in the context of neurodegenerative diseases such as Alzheimer's. Population graphs, which include multimodal imaging information of the subjects along with the relationships among the population, have been used in literature along with Graph Convolutional Networks (GCNs) and have proved beneficial for a variety of medical imaging tasks. A population graph is usually static and constructed manually using non-imaging information. However, graph construction is not a trivial task and might significantly affect the performance of the GCN, which is inherently very sensitive to the graph structure. In this work, we propose a framework that learns a population graph structure optimized for the downstream task. An attention mechanism assigns weights to a set of imaging and non-imaging features (phenotypes), which are then used for edge extraction. The resulting graph is used to train the GCN. The entire pipeline can be trained end-to-end. Additionally, by visualizing the attention weights that were the most important for the graph construction, we increase the interpretability of the graph. We use the UK Biobank, which provides a large variety of neuroimaging and non-imaging phenotypes, to evaluate our method on brain age regression and classification. The proposed method outperforms competing static graph approaches and other state-of-the-art adaptive methods. We further show that the assigned attention scores indicate that there are both imaging and non-imaging phenotypes that are informative for brain age estimation and are in agreement with the relevant literature.Here's the translation in Traditional Chinese:��XV�urrency age estimation is clinically important as it can provide valuable information in the context of neurodegenerative diseases such as Alzheimer's. Population graphs, which include multimodal imaging information of the subjects along with the relationships among the population, have been used in literature along with Graph Convolutional Networks (GCNs) and have proved beneficial for a variety of medical imaging tasks. A population graph is usually static and constructed manually using non-imaging information. However, graph construction is not a trivial task and might significantly affect the performance of the GCN, which is inherently very sensitive to the graph structure. In this work, we propose a framework that learns a population graph structure optimized for the downstream task. An attention mechanism assigns weights to a set of imaging and non-imaging features (phenotypes), which are then used for edge extraction. The resulting graph is used to train the GCN. The entire pipeline can be trained end-to-end. Additionally, by visualizing the attention weights that were the most important for the graph construction, we increase the interpretability of the graph. We use the UK Biobank, which provides a large variety of neuroimaging and non-imaging phenotypes, to evaluate our method on brain age regression and classification. The proposed method outperforms competing static graph approaches and other state-of-the-art adaptive methods. We further show that the assigned attention scores indicate that there are both imaging and non-imaging phenotypes that are informative for brain age estimation and are in agreement with the relevant literature.
SPLAL: Similarity-based pseudo-labeling with alignment loss for semi-supervised medical image classification
results: 实验结果显示,我们的提案方法在两个公开的医疗影像分类 benchmark 数据集上(ISIC 2018 和 BCCD)表现出色,较前一些状态的 SSL 方法在不同的评估指标上表现出明显的改善。具体来说,我们的方法在 ISIC 2018 数据集上的 Accuracy 和 F1 分别提高了2.24% 和11.40%。此外,我们还进行了广泛的折冲实验,以验证方法的有效性。Abstract
Medical image classification is a challenging task due to the scarcity of labeled samples and class imbalance caused by the high variance in disease prevalence. Semi-supervised learning (SSL) methods can mitigate these challenges by leveraging both labeled and unlabeled data. However, SSL methods for medical image classification need to address two key challenges: (1) estimating reliable pseudo-labels for the images in the unlabeled dataset and (2) reducing biases caused by class imbalance. In this paper, we propose a novel SSL approach, SPLAL, that effectively addresses these challenges. SPLAL leverages class prototypes and a weighted combination of classifiers to predict reliable pseudo-labels over a subset of unlabeled images. Additionally, we introduce alignment loss to mitigate model biases toward majority classes. To evaluate the performance of our proposed approach, we conduct experiments on two publicly available medical image classification benchmark datasets: the skin lesion classification (ISIC 2018) and the blood cell classification dataset (BCCD). The experimental results empirically demonstrate that our approach outperforms several state-of-the-art SSL methods over various evaluation metrics. Specifically, our proposed approach achieves a significant improvement over the state-of-the-art approach on the ISIC 2018 dataset in both Accuracy and F1 score, with relative margins of 2.24\% and 11.40\%, respectively. Finally, we conduct extensive ablation experiments to examine the contribution of different components of our approach, validating its effectiveness.
摘要
医学图像分类是一项具有挑战性的任务,原因在于缺乏标注样本和类别偏度导致的高度变化。 semi-supervised learning(SSL)方法可以减轻这些挑战,通过利用标注和未标注数据。然而,SSL方法 для医学图像分类需要解决两个关键挑战:(1)估算未标注图像的可靠 Pseudo-标签和(2)减轻类别偏度引起的模型偏好。在这篇论文中,我们提出了一种新的SSL方法,称为SPLAL,可以有效地解决这些挑战。SPLAL利用类 prototype和权重组合的类ifier来预测未标注图像的可靠 Pseudo-标签。此外,我们引入了对齐损失来减轻模型偏好向主要类别。为评估我们提出的方法的性能,我们在两个公开可用的医学图像分类标准测试集上进行了实验:皮肤病诊断(ISIC 2018)和血球分类 dataset(BCCD)。实验结果证明,我们的方法在多种评价指标上超越了一些状态方法。具体来说,我们的方法在 ISIC 2018 数据集上取得了显著的改进,其中 Accuracy 和 F1 score 分别提高了2.24%和11.40%。最后,我们进行了广泛的减少实验,以验证我们的方法的有效性。
Source-Free Open-Set Domain Adaptation for Histopathological Images via Distilling Self-Supervised Vision Transformer
results: 这个论文的实验结果表明,其方法可以significantly outperform previous methods,包括开集检测、测试时 Adaptation 和 SF-OSDA方法,在三个公共的 histopathological 数据集上(Kather-16、Kather-19和CRCTP) achieve state-of-the-art 性能。Abstract
There is a strong incentive to develop computational pathology models to i) ease the burden of tissue typology annotation from whole slide histological images; ii) transfer knowledge, e.g., tissue class separability from the withheld source domain to the distributionally shifted unlabeled target domain, and simultaneously iii) detect Open Set samples, i.e., unseen novel categories not present in the training source domain. This paper proposes a highly practical setting by addressing the abovementioned challenges in one fell swoop, i.e., source-free Open Set domain adaptation (SF-OSDA), which addresses the situation where a model pre-trained on the inaccessible source dataset can be adapted on the unlabeled target dataset containing Open Set samples. The central tenet of our proposed method is distilling knowledge from a self-supervised vision transformer trained in the target domain. We propose a novel style-based data augmentation used as hard positives for self-training a vision transformer in the target domain, yielding strongly contextualized embedding. Subsequently, semantically similar target images are clustered while the source model provides their corresponding weak pseudo-labels with unreliable confidence. Furthermore, we propose cluster relative maximum logit score (CRMLS) to rectify the confidence of the weak pseudo-labels and compute weighted class prototypes in the contextualized embedding space that are utilized for adapting the source model on the target domain. Our method significantly outperforms the previous methods, including open set detection, test-time adaptation, and SF-OSDA methods, setting the new state-of-the-art on three public histopathological datasets of colorectal cancer (CRC) assessment- Kather-16, Kather-19, and CRCTP. Our code is available at https://github.com/LTS5/Proto-SF-OSDA.
摘要
There is a strong incentive to develop computational pathology models to ease the burden of tissue typology annotation from whole slide histological images, transfer knowledge from the withheld source domain to the distributionally shifted unlabeled target domain, and simultaneously detect Open Set samples. This paper proposes a highly practical setting by addressing the above challenges in one fell swoop, i.e., source-free Open Set domain adaptation (SF-OSDA), which addresses the situation where a model pre-trained on the inaccessible source dataset can be adapted on the unlabeled target dataset containing Open Set samples. The central tenet of our proposed method is distilling knowledge from a self-supervised vision transformer trained in the target domain. We propose a novel style-based data augmentation used as hard positives for self-training a vision transformer in the target domain, yielding strongly contextualized embeddings. Subsequently, semantically similar target images are clustered while the source model provides their corresponding weak pseudo-labels with unreliable confidence. Furthermore, we propose cluster relative maximum logit score (CRMLS) to rectify the confidence of the weak pseudo-labels and compute weighted class prototypes in the contextualized embedding space that are utilized for adapting the source model on the target domain. Our method significantly outperforms the previous methods, including open set detection, test-time adaptation, and SF-OSDA methods, setting the new state-of-the-art on three public histopathological datasets of colorectal cancer (CRC) assessment- Kather-16, Kather-19, and CRCTP. Our code is available at https://github.com/LTS5/Proto-SF-OSDA.
DWA: Differential Wavelet Amplifier for Image Super-Resolution
paper_authors: Brian B. Moser, Stanislav Frolov, Federico Raue, Sebastian Palacio, Andreas Dengel
For: 提高图像超分辨率(SR)模型的可持续性和效能。* Methods: 使用Discrete Wavelet Transformation(DWT)和两个卷积 filters的差异来提高波лет频域中特征提取和噪声抑制。* Results: 通过与现有SR模型(如DWSR和MWCNN)结合, demonstarted improved SR performance in classical SR tasks, and also enabled direct application of these models to the input image space, reducing the DWT representation channel-wise.Abstract
This work introduces Differential Wavelet Amplifier (DWA), a drop-in module for wavelet-based image Super-Resolution (SR). DWA invigorates an approach recently receiving less attention, namely Discrete Wavelet Transformation (DWT). DWT enables an efficient image representation for SR and reduces the spatial area of its input by a factor of 4, the overall model size, and computation cost, framing it as an attractive approach for sustainable ML. Our proposed DWA model improves wavelet-based SR models by leveraging the difference between two convolutional filters to refine relevant feature extraction in the wavelet domain, emphasizing local contrasts and suppressing common noise in the input signals. We show its effectiveness by integrating it into existing SR models, e.g., DWSR and MWCNN, and demonstrate a clear improvement in classical SR tasks. Moreover, DWA enables a direct application of DWSR and MWCNN to input image space, reducing the DWT representation channel-wise since it omits traditional DWT.
摘要
Our proposed DWA model enhances wavelet-based SR models by leveraging the difference between two convolutional filters to refine relevant feature extraction in the wavelet domain, emphasizing local contrasts and suppressing common noise in the input signals. We demonstrate the effectiveness of DWA by integrating it into existing SR models, such as DWSR and MWCNN, and show a clear improvement in classical SR tasks.Moreover, DWA enables a direct application of DWSR and MWCNN to the input image space, reducing the DWT representation channel-wise since it omits traditional DWT. This simplifies the model architecture and reduces computational complexity, making it more sustainable and efficient.
A Graph Multi-separator Problem for Image Segmentation
results: 在模拟的三维图像中分割 foam cells 和 filaments 时,本研究的方法得到了有效的结果。Abstract
We propose a novel abstraction of the image segmentation task in the form of a combinatorial optimization problem that we call the multi-separator problem. Feasible solutions indicate for every pixel whether it belongs to a segment or a segment separator, and indicate for pairs of pixels whether or not the pixels belong to the same segment. This is in contrast to the closely related lifted multicut problem where every pixel is associated to a segment and no pixel explicitly represents a separating structure. While the multi-separator problem is NP-hard, we identify two special cases for which it can be solved efficiently. Moreover, we define two local search algorithms for the general case and demonstrate their effectiveness in segmenting simulated volume images of foam cells and filaments.
摘要
我们提出了一种新的图像分割任务抽象方法,即多分隔器问题(Multi-Separator Problem)。这个问题的可行解释是每个像素是否属于一个分割结构中,以及每对像素是否属于同一个分割结构。这与它与提高的多分隔器问题(Lifted Multicut Problem)相似,但不同的是,在这个问题中没有像素Explicitly表示分割结构。虽然多分隔器问题是NP困难的,但我们发现了两种特殊情况,可以有效解决。此外,我们还定义了两种本地搜索算法,并在模拟的壳体细胞和纤维物体图像中进行了效果证明。
AnyTeleop: A General Vision-Based Dexterous Robot Arm-Hand Teleoperation System
paper_authors: Yuzhe Qin, Wei Yang, Binghao Huang, Karl Van Wyk, Hao Su, Xiaolong Wang, Yu-Wei Chao, Dieter Fox
for: 提供多种机器人模型和运行环境下的智能远程操作系统
methods: 使用一个通用的和普适的 теле操作系统,支持多种臂、手、现实和摄像头配置
results: 在实际实验中,AnyTeleop可以比之前特定机器人硬件的系统更高的成功率,同时在虚拟实验中也可以比之前特定模拟器的系统更好的模仿学习性能Abstract
Vision-based teleoperation offers the possibility to endow robots with human-level intelligence to physically interact with the environment, while only requiring low-cost camera sensors. However, current vision-based teleoperation systems are designed and engineered towards a particular robot model and deploy environment, which scales poorly as the pool of the robot models expands and the variety of the operating environment increases. In this paper, we propose AnyTeleop, a unified and general teleoperation system to support multiple different arms, hands, realities, and camera configurations within a single system. Although being designed to provide great flexibility to the choice of simulators and real hardware, our system can still achieve great performance. For real-world experiments, AnyTeleop can outperform a previous system that was designed for a specific robot hardware with a higher success rate, using the same robot. For teleoperation in simulation, AnyTeleop leads to better imitation learning performance, compared with a previous system that is particularly designed for that simulator. Project page: http://anyteleop.com/.
摘要
“视觉基于的 теле操作提供了让机器人具备人类水平的智能,只需要低成本的摄像头传感器。然而,当前的视觉基于的 теле操作系统是为特定的机器人模型和部署环境设计和工程,这会导致系统在机器人模型池的扩展和操作环境的多样性增加时,缺乏扩展性。在这篇论文中,我们提出了 AnyTeleop,一个通用和总体的 теле操作系统,可以支持多种不同的机器人臂、手、现实和摄像头配置。尽管我们的系统设计了很大的灵活性,但它仍然可以达到高性能。在实际实验中, AnyTeleop 可以比一个特定机器人硬件的前一个系统高效性更高,使用同一个机器人。在 simulations 中, AnyTeleop 可以比一个特定 simulator 的前一个系统更好地学习模仿性。项目页面:http://anyteleop.com/。”
TFR: Texture Defect Detection with Fourier Transform using Normal Reconstructed Template of Simple Autoencoder
results: 实验结果表明,该方法可以准确和有效地检测纹理缺陷。与现有方法相比,该方法的性能和精度更高。Abstract
Texture is an essential information in image representation, capturing patterns and structures. As a result, texture plays a crucial role in the manufacturing industry and is extensively studied in the fields of computer vision and pattern recognition. However, real-world textures are susceptible to defects, which can degrade image quality and cause various issues. Therefore, there is a need for accurate and effective methods to detect texture defects. In this study, a simple autoencoder and Fourier transform are employed for texture defect detection. The proposed method combines Fourier transform analysis with the reconstructed template obtained from the simple autoencoder. Fourier transform is a powerful tool for analyzing the frequency domain of images and signals. Moreover, since texture defects often exhibit characteristic changes in specific frequency ranges, analyzing the frequency domain enables effective defect detection. The proposed method demonstrates effectiveness and accuracy in detecting texture defects. Experimental results are presented to evaluate its performance and compare it with existing approaches.
摘要
Texture 是图像表示中的关键信息,捕捉 Patterns 和结构。因此,Texture 在制造业中扮演着重要的角色,在计算机视觉和模式识别领域广泛研究。然而,实际世界中的 Texture 容易受到缺陷的影响,这可能会下降图像质量并导致多种问题。因此,有一需要高精度和有效的方法来检测 Texture 缺陷。在本研究中,我们employs a simple autoencoder 和 Fourier transform 来检测 Texture 缺陷。我们的方法结合了 Fourier transform 分析和从 simple autoencoder 获得的重建模板。Fourier transform 是一个强大的图像和信号分析工具,而 texture 缺陷通常在特定频谱范围内表现出特征变化,因此在频谱域进行分析可以有效地检测缺陷。我们的方法在检测 Texture 缺陷方面表现出了高效和准确的性能。实验结果用于评估我们的方法和现有方法的比较。
Unraveling the Age Estimation Puzzle: Comparative Analysis of Deep Learning Approaches for Facial Age Estimation
results: 研究发现,这些因素frequently exerts a more significant influence than the choice of age estimation method itself,而且使用一致的数据处理方法和建立标准benchmarks是 Ensure reliable and meaningful comparisons.Abstract
Comparing different age estimation methods poses a challenge due to the unreliability of published results, stemming from inconsistencies in the benchmarking process. Previous studies have reported continuous performance improvements over the past decade using specialized methods; however, our findings challenge these claims. We argue that, for age estimation tasks outside of the low-data regime, designing specialized methods is unnecessary, and the standard approach of utilizing cross-entropy loss is sufficient. This paper aims to address the benchmark shortcomings by evaluating state-of-the-art age estimation methods in a unified and comparable setting. We systematically analyze the impact of various factors, including facial alignment, facial coverage, image resolution, image representation, model architecture, and the amount of data on age estimation results. Surprisingly, these factors often exert a more significant influence than the choice of the age estimation method itself. We assess the generalization capability of each method by evaluating the cross-dataset performance for publicly available age estimation datasets. The results emphasize the importance of using consistent data preprocessing practices and establishing standardized benchmarks to ensure reliable and meaningful comparisons. The source code is available at https://github.com/paplhjak/Facial-Age-Estimation-Benchmark.
摘要
Translated into Simplified Chinese:与不同的年龄估计方法进行比较存在挑战,主要是因为发表的结果不可靠,这主要归结于测试过程中的不一致。先前的研究报告过去十年内,通过特殊方法进行改进,但我们的发现推翻了这些宣称。我们认为,在低数据域以外,特殊的方法设计是不必要的,使用权重损失是充分的。这篇论文旨在解决测试缺陷,对现有的年龄估计方法进行统一和可比较的评估。我们系统地分析了不同因素的影响,包括人脸匹配、人脸覆盖率、图像分辨率、图像表示方式、模型体系和数据量,这些因素经常对年龄估计结果产生更大的影响,而不是选择年龄估计方法本身。我们评估每种方法的通用能力,通过评估公共可用的年龄估计数据集上的交叉数据表现来评估。结果表明,使用一致的数据处理方法和建立标准化的标准是确保可靠和有意义的比较的重要。软件代码可以在https://github.com/paplhjak/Facial-Age-Estimation-Benchmark上获取。
Important Clues that Facilitate Visual Emergence: Three Psychological Experiments
results: 研究发现,本地区域的斑点密度和一些关键斑点的排列有关于 emerging 图像的识别。Abstract
Visual emergence is the phenomenon in which the visual system obtains a holistic perception after grouping and reorganizing local signals. The picture Dalmatian dog is known for its use in explaining visual emergence. This type of image, which consists of a set of discrete black speckles (speckles), is called an emerging image. Not everyone can find the dog in Dalmatian dog, and among those who can, the time spent varies greatly. Although Gestalt theory summarizes perceptual organization into several principles, it remains ambiguous how these principles affect the perception of emerging images. This study, therefore, designed three psychological experiments to explore the factors that influence the perception of emerging images. In the first, we found that the density of speckles in the local area and the arrangements of some key speckles played a key role in the perception of an emerging case. We set parameters in the algorithm to characterize these two factors. We then automatically generated diversified emerging-test images (ETIs) through the algorithm and verified their effectiveness in two subsequent experiments.
摘要
“视觉突出现”是指视系统从局部信号集成和重新排序后获得整体视觉。达尔马提亚狗图例是用于解释视觉突出现的图像。这种图像由一组独立的黑点(点)组成,被称为突出现图像。不 everyone可以在达尔马提亚狗图中找到狗,而且那些找到狗的人之间的时间花费差异很大。虽然格式理论总结了感知组织化的几个原则,但是这些原则如何影响突出现图像的感知仍然抽象。这个研究因此设计了三个心理实验,探讨突出现图像的因素。在第一个实验中,我们发现了局部区域的密度和一些关键点的排列对突出现的感知产生了关键作用。我们在算法中设置了这两个因素的参数。然后我们通过算法自动生成多样化的突出现测试图像(ETIs),并在两个后续实验中证明了它们的有效性。
SparseVSR: Lightweight and Noise Robust Visual Speech Recognition
results: 研究发现,使用几何基于簇节约技术可以实现轻量级的模型,并且在视觉声译解任务中表现更好,尤其是在视觉噪音存在的情况下。Abstract
Recent advances in deep neural networks have achieved unprecedented success in visual speech recognition. However, there remains substantial disparity between current methods and their deployment in resource-constrained devices. In this work, we explore different magnitude-based pruning techniques to generate a lightweight model that achieves higher performance than its dense model equivalent, especially under the presence of visual noise. Our sparse models achieve state-of-the-art results at 10% sparsity on the LRS3 dataset and outperform the dense equivalent up to 70% sparsity. We evaluate our 50% sparse model on 7 different visual noise types and achieve an overall absolute improvement of more than 2% WER compared to the dense equivalent. Our results confirm that sparse networks are more resistant to noise than dense networks.
摘要
Translated into Simplified Chinese:最近的深度神经网络技术已经取得了无 precedent 的成功 в视觉语音识别领域。然而,目前的方法和其在资源受限的设备中的部署仍然存在很大的差距。在这项工作中,我们探索了不同的大小基于的剪除技术,以生成一个轻量级的模型,其性能高于其稠密模型等效的,尤其是在视觉噪声存在的情况下。我们的稀疏模型在 LRS3 数据集上达到了state-of-the-art 的结果,并在 10% 稀疏率下超过了稠密模型的性能。我们的 50% 稀疏模型在 7 种不同的视觉噪声类型下进行评估,实现了全部绝对改进超过 2% WER 的成果,相比稠密模型。我们的结果表明,稀疏网络比稠密网络更抗障压。
Customizing Synthetic Data for Data-Free Student Learning
results: 实验结果表明,该方法可以在不同的数据集和教师-学生模型下达到效果,并且可以提高学生模型的性能。代码可以在: $\href{https://github.com/luoshiya/CSD}{https://github.com/luoshiya/CSD}$ 获取。Abstract
Data-free knowledge distillation (DFKD) aims to obtain a lightweight student model without original training data. Existing works generally synthesize data from the pre-trained teacher model to replace the original training data for student learning. To more effectively train the student model, the synthetic data shall be customized to the current student learning ability. However, this is ignored in the existing DFKD methods and thus negatively affects the student training. To address this issue, we propose Customizing Synthetic Data for Data-Free Student Learning (CSD) in this paper, which achieves adaptive data synthesis using a self-supervised augmented auxiliary task to estimate the student learning ability. Specifically, data synthesis is dynamically adjusted to enlarge the cross entropy between the labels and the predictions from the self-supervised augmented task, thus generating hard samples for the student model. The experiments on various datasets and teacher-student models show the effectiveness of our proposed method. Code is available at: $\href{https://github.com/luoshiya/CSD}{https://github.com/luoshiya/CSD}$
摘要
“数据空间知识储备(DFKD)目的是获得轻量级学生模型无需原始训练数据。现有的方法通常将先行训练的教师模型中的数据进行合成,以替代学生学习的原始数据。但是现有的DFKD方法忽略了对学生学习能力的适应,这会负面影响学生模型的训练。为解决这个问题,我们在本文提出了适应数据生成 для无数据学习(CSD)方法,它通过自我监督增强任务来估算学生学习能力,并在数据生成中进行动态调整。具体来说,通过自我监督增强任务来增加label和预测值之间的cross entropy,从而生成硬样本 для学生模型。实验结果表明,我们的提议方法在不同的数据集和教师-学生模型上具有效果。代码可以在: $\href{https://github.com/luoshiya/CSD}{https://github.com/luoshiya/CSD}$”
Cluster-Induced Mask Transformers for Effective Opportunistic Gastric Cancer Screening on Non-contrast CT Scans
paper_authors: Mingze Yuan, Yingda Xia, Xin Chen, Jiawen Yao, Junli Wang, Mingyan Qiu, Hexin Dong, Jingren Zhou, Bin Dong, Le Lu, Li Zhang, Zaiyi Liu, Ling Zhang for: 这个研究旨在检测胃癌,它是全球第三大导致死亡癌症之一,但是没有建议的检测测试。现有的方法可能是侵入性高,成本高,敏感度低,无法检测早期胃癌。methods: 我们使用了一种深度学习方法,使用非对照CT扫描图进行胃癌检测。我们提出了一种新的归一化引导Transporter,该模型同时分割肿瘤并将异常性分类。我们的模型包括学习集的概率,使用自我和对自我的关注来交互对征特征。results: 我们的方法在一个测试集上达到了敏感性85.0%和特征性92.6%,比两名医生的平均敏感性73.5%和特征性84.3%高。此外,我们在一个外部测试集上获得了特征性97.7%。我们的方法与血液测试和endoscopia等现有的胃癌检测工具相当,同时具有更高的敏感性,可能成为一种新的、不侵入性、成本低、准确的胃癌creening方法。Abstract
Gastric cancer is the third leading cause of cancer-related mortality worldwide, but no guideline-recommended screening test exists. Existing methods can be invasive, expensive, and lack sensitivity to identify early-stage gastric cancer. In this study, we explore the feasibility of using a deep learning approach on non-contrast CT scans for gastric cancer detection. We propose a novel cluster-induced Mask Transformer that jointly segments the tumor and classifies abnormality in a multi-task manner. Our model incorporates learnable clusters that encode the texture and shape prototypes of gastric cancer, utilizing self- and cross-attention to interact with convolutional features. In our experiments, the proposed method achieves a sensitivity of 85.0% and specificity of 92.6% for detecting gastric tumors on a hold-out test set consisting of 100 patients with cancer and 148 normal. In comparison, two radiologists have an average sensitivity of 73.5% and specificity of 84.3%. We also obtain a specificity of 97.7% on an external test set with 903 normal cases. Our approach performs comparably to established state-of-the-art gastric cancer screening tools like blood testing and endoscopy, while also being more sensitive in detecting early-stage cancer. This demonstrates the potential of our approach as a novel, non-invasive, low-cost, and accurate method for opportunistic gastric cancer screening.
摘要
胃癌是全球第三大的癌症相关死亡原因,但没有任何指南推荐的检测试exist。现有方法可能是侵入的、昂贵的,并且缺乏感知到早期胃癌的能力。在这项研究中,我们研究了使用深度学习方法来非对照CT扫描图像中的胃癌检测。我们提出了一种新的归一化Transformer,它同时分割肿瘤并将异常性分类,使用多任务方式进行协同学习。我们的模型利用学习的归一化来编码胃癌的纹理和形态谱,通过自我和十字关注来交互与卷积特征。在我们的实验中,我们的方法达到了检测胃癌的敏感性85.0%和特征精度92.6%,与两名医生的平均检测率73.5%和84.3%相比。此外,我们还获得了97.7%的特征精度在外部测试集上,该测试集包含903个正常例。我们的方法与已有的胃癌检测工具如血液测试和内视镜相比,同时具有更高的敏感性,可能成为一种新的、不侵入的、低成本的和准确的胃癌creening方法。
Efficient Match Pair Retrieval for Large-scale UAV Images via Graph Indexed Global Descriptor
paper_authors: San Jiang, Yichen Ma, Qingquan Li, Wanshou Jiang, Bingxuan Guo, Lelin Li, Lizhe Wang for: 这 paper 是用于提高 UAV 图像姿态重建的 Structure from Motion (SfM) 方法的研究。methods: 该 paper 提出了一种高效的匹配对选择方法,包括在线训练个性化 Codebook,使用 VLAD 综合器将本地特征转换为高维度全局描述符,并使用 HNSW 图结构进行最近匹配搜索。results: 测试结果表明,提出的解决方案可以减少匹配对选择的计算量,并且在相对和绝对姿态重建中具有竞争力的准确率。Abstract
SfM (Structure from Motion) has been extensively used for UAV (Unmanned Aerial Vehicle) image orientation. Its efficiency is directly influenced by feature matching. Although image retrieval has been extensively used for match pair selection, high computational costs are consumed due to a large number of local features and the large size of the used codebook. Thus, this paper proposes an efficient match pair retrieval method and implements an integrated workflow for parallel SfM reconstruction. First, an individual codebook is trained online by considering the redundancy of UAV images and local features, which avoids the ambiguity of training codebooks from other datasets. Second, local features of each image are aggregated into a single high-dimension global descriptor through the VLAD (Vector of Locally Aggregated Descriptors) aggregation by using the trained codebook, which remarkably reduces the number of features and the burden of nearest neighbor searching in image indexing. Third, the global descriptors are indexed via the HNSW (Hierarchical Navigable Small World) based graph structure for the nearest neighbor searching. Match pairs are then retrieved by using an adaptive threshold selection strategy and utilized to create a view graph for divide-and-conquer based parallel SfM reconstruction. Finally, the performance of the proposed solution has been verified using three large-scale UAV datasets. The test results demonstrate that the proposed solution accelerates match pair retrieval with a speedup ratio ranging from 36 to 108 and improves the efficiency of SfM reconstruction with competitive accuracy in both relative and absolute orientation.
摘要
SfM (结构从运动) 已广泛应用于无人飞行器(UAV)图像方向。其效率直接取决于特征匹配。 although image retrieval has been extensively used for match pair selection, high computational costs are consumed due to a large number of local features and the large size of the used codebook. Therefore, this paper proposes an efficient match pair retrieval method and implements an integrated workflow for parallel SfM reconstruction.First, an individual codebook is trained online by considering the redundancy of UAV images and local features, which avoids the ambiguity of training codebooks from other datasets. Second, local features of each image are aggregated into a single high-dimensional global descriptor through the VLAD (Vector of Locally Aggregated Descriptors) aggregation by using the trained codebook, which remarkably reduces the number of features and the burden of nearest neighbor searching in image indexing.Third, the global descriptors are indexed via the HNSW (Hierarchical Navigable Small World) based graph structure for the nearest neighbor searching. Match pairs are then retrieved by using an adaptive threshold selection strategy and utilized to create a view graph for divide-and-conquer based parallel SfM reconstruction.Finally, the performance of the proposed solution has been verified using three large-scale UAV datasets. The test results demonstrate that the proposed solution accelerates match pair retrieval with a speedup ratio ranging from 36 to 108 and improves the efficiency of SfM reconstruction with competitive accuracy in both relative and absolute orientation.
An Examination of Wearable Sensors and Video Data Capture for Human Exercise Classification
paper_authors: Ashish Singh, Antonio Bevilacqua, Timilehin B. Aderinola, Thach Le Nguyen, Darragh Whelan, Martin O’Reilly, Brian Caulfield, Georgiana Ifrim for:这种研究用于评估人体运动表现,并使用增强式计时系统(IMU)和视频技术进行分类。methods:这种研究使用了单个相机和5个IMU,并使用了多变量时间序列分类器来处理数据。results:研究发现,使用单个相机可以在平均上高于使用单个IMU的表现,而且至少需要3个IMU来与单个相机相比。此外,将数据从单个相机和单个IMU组合在一起可以达到更高的表现。Abstract
Wearable sensors such as Inertial Measurement Units (IMUs) are often used to assess the performance of human exercise. Common approaches use handcrafted features based on domain expertise or automatically extracted features using time series analysis. Multiple sensors are required to achieve high classification accuracy, which is not very practical. These sensors require calibration and synchronization and may lead to discomfort over longer time periods. Recent work utilizing computer vision techniques has shown similar performance using video, without the need for manual feature engineering, and avoiding some pitfalls such as sensor calibration and placement on the body. In this paper, we compare the performance of IMUs to a video-based approach for human exercise classification on two real-world datasets consisting of Military Press and Rowing exercises. We compare the performance using a single camera that captures video in the frontal view versus using 5 IMUs placed on different parts of the body. We observe that an approach based on a single camera can outperform a single IMU by 10 percentage points on average. Additionally, a minimum of 3 IMUs are required to outperform a single camera. We observe that working with the raw data using multivariate time series classifiers outperforms traditional approaches based on handcrafted or automatically extracted features. Finally, we show that an ensemble model combining the data from a single camera with a single IMU outperforms either data modality. Our work opens up new and more realistic avenues for this application, where a video captured using a readily available smartphone camera, combined with a single sensor, can be used for effective human exercise classification.
摘要
佩戴式感测器如力度测量单元(IMU)经常用于评估人类运动表现。常见的方法使用域专业知识或自动提取的特征来制定手动特征工程。使用多个感测器可以获得高分类精度,但是不实际。这些感测器需要协调和同步,并可能会导致身体部位的不适,尤其是在长时间内。现有的计算机视觉技术研究表示,通过视频来评估人类运动,可以达到类似的性能,而不需要手动特征工程,并避免一些障碍物,如感测器协调和身体部位的印象。在这篇论文中,我们对IMUs和视频方法进行了人类运动分类的比较,使用了两个实际数据集,一个是军事推举操作,另一个是舟行操作。我们发现,使用单个摄像头捕捉视频的方法可以在 average 上高于单个IMU的性能,并且至少需要3个IMUs以达到单个摄像头的性能。此外,我们发现,使用原始数据进行多变量时间序列分类可以超越传统的手动特征工程和自动提取特征的方法。最后,我们表明,将单个摄像头和单个IMU的数据 ensemble 起来可以超越任何一个数据类型。我们的工作开启了新的、更实际的应用途径,通过使用普遍可用的智能手机摄像头,并与单个感测器结合,可以实现有效的人类运动分类。
CoactSeg: Learning from Heterogeneous Data for New Multiple Sclerosis Lesion Segmentation
paper_authors: Yicheng Wu, Zhonghua Wu, Hengcan Shi, Bjoern Picker, Winston Chong, Jianfei Cai for:* 新型多发性纤维病(MS)诊断和治疗中的新损害分 segmentation 问题具有重要的实际意义,因为它可以帮助评估疾病的进程和治疗效果。methods:* 提出了一种协同分割(CoactSeg)框架,利用新损害注意点和所有损害注意点的同时注入,以提高新损害分 segmentation 的性能。* 提出了一种简单而有效的时间关系约束,以保证新损害和所有损害之间的长期关系,从而改进模型学习。results:* 利用多种数据(包括新损害注意点和所有损害注意点)和提出的时间关系约束,可以明显提高新损害分 segmentation 的性能。* 在评估新损害和所有损害分 segmentation 任务中,模型表现出色,并且在不同的数据上进行了广泛的验证。I hope that helps! Let me know if you have any other questions.Abstract
New lesion segmentation is essential to estimate the disease progression and therapeutic effects during multiple sclerosis (MS) clinical treatments. However, the expensive data acquisition and expert annotation restrict the feasibility of applying large-scale deep learning models. Since single-time-point samples with all-lesion labels are relatively easy to collect, exploiting them to train deep models is highly desirable to improve new lesion segmentation. Therefore, we proposed a coaction segmentation (CoactSeg) framework to exploit the heterogeneous data (i.e., new-lesion annotated two-time-point data and all-lesion annotated single-time-point data) for new MS lesion segmentation. The CoactSeg model is designed as a unified model, with the same three inputs (the baseline, follow-up, and their longitudinal brain differences) and the same three outputs (the corresponding all-lesion and new-lesion predictions), no matter which type of heterogeneous data is being used. Moreover, a simple and effective relation regularization is proposed to ensure the longitudinal relations among the three outputs to improve the model learning. Extensive experiments demonstrate that utilizing the heterogeneous data and the proposed longitudinal relation constraint can significantly improve the performance for both new-lesion and all-lesion segmentation tasks. Meanwhile, we also introduce an in-house MS-23v1 dataset, including 38 Oceania single-time-point samples with all-lesion labels. Codes and the dataset are released at https://github.com/ycwu1997/CoactSeg.
摘要
新的肿瘤分割是评估多发性硬化病(MS)治疗效果和病程进程中的关键。然而,收集数据和专家标注的成本限制了大规模深度学习模型的应用性。由于单时点检查结果具有所有肿瘤标注,因此利用它们来训练深度模型是非常有利的。因此,我们提出了合作分割(CoactSeg)框架,以利用不同类型的数据(包括新肿瘤标注的两个时点数据和所有肿瘤标注的单个时点数据)进行新肿瘤分割。CoactSeg模型采用同一个输入(基线、追加和其中间脑变化)和同一个输出(对应的所有肿瘤和新肿瘤预测),无论使用哪种不同类型的数据。此外,我们还提出了一种简单而有效的时间相关约束,以提高模型学习。我们的实验表明,利用多种数据和提议的时间相关约束可以显著提高新肿瘤和所有肿瘤分割任务的性能。此外,我们还推出了一个MS-23v1 dataset,包括38个澳大利亚单时点样本,每个样本均有所有肿瘤标注。代码和数据可以在https://github.com/ycwu1997/CoactSeg中下载。
Exact Diffusion Inversion via Bi-directional Integration Approximation
results: 通过实验证明,BDIA可以减少计算复杂度,并且在图像重建和修改中获得更好的性能。Abstract
Recently, different methods have been proposed to address the inconsistency issue of DDIM inversion to enable image editing, such as EDICT \cite{Wallace23EDICT} and Null-text inversion \cite{Mokady23NullTestInv}. However, the above methods introduce considerable computational overhead. In this paper, we propose a new technique, named \emph{bi-directional integration approximation} (BDIA), to perform exact diffusion inversion with neglible computational overhead. Suppose we would like to estimate the next diffusion state $\boldsymbol{z}_{i-1}$ at timestep $t_i$ with the historical information $(i,\boldsymbol{z}_i)$ and $(i+1,\boldsymbol{z}_{i+1})$. We first obtain the estimated Gaussian noise $\hat{\boldsymbol{\epsilon}(\boldsymbol{z}_i,i)$, and then apply the DDIM update procedure twice for approximating the ODE integration over the next time-slot $[t_i, t_{i-1}]$ in the forward manner and the previous time-slot $[t_i, t_{t+1}]$ in the backward manner. The DDIM step for the previous time-slot is used to refine the integration approximation made earlier when computing $\boldsymbol{z}_i$. One nice property with BDIA-DDIM is that the update expression for $\boldsymbol{z}_{i-1}$ is a linear combination of $(\boldsymbol{z}_{i+1}, \boldsymbol{z}_i, \hat{\boldsymbol{\epsilon}(\boldsymbol{z}_i,i))$. This allows for exact backward computation of $\boldsymbol{z}_{i+1}$ given $(\boldsymbol{z}_i, \boldsymbol{z}_{i-1})$, thus leading to exact diffusion inversion. Experiments on both image reconstruction and image editing were conducted, confirming our statement. BDIA can also be applied to improve the performance of other ODE solvers in addition to DDIM. In our work, it is found that applying BDIA to the EDM sampling procedure produces slightly better FID score over CIFAR10.
摘要
最近,有些方法已经被提出来解决了DDIM逆向问题的不一致性issue,如EDICT \cite{Wallace23EDICT}和Null-text逆向 \cite{Mokady23NullTestInv}。然而,这些方法会带来较大的计算开销。在这篇论文中,我们提出了一新的方法,即双向集成approximation(BDIA),以实现精确的扩散逆向。假设我们想要在时间步长$t_i$中计算下一个扩散状态$\boldsymbol{z}_{i-1}$,使用历史信息 $(i,\boldsymbol{z}_i)$ 和 $(i+1,\boldsymbol{z}_{i+1})$。我们首先获取估计的高斯噪声 $\hat{\boldsymbol{\epsilon}(\boldsymbol{z}_i,i)$,然后使用DDIM更新过程两次来approximate ODE интеграção在下一个时间步长 $[t_i, t_{i-1}]$ 和上一个时间步长 $[t_i, t_{t+1}]$ 之间的integração。DDIM步骤用于上一个时间步长来精度地计算以前的integração,从而提高了integration的精度。BDIA-DDIM的更新表达式为$\boldsymbol{z}_{i-1}$ 是 $\boldsymbol{z}_{i+1}$, $\boldsymbol{z}_i$ 和 $\hat{\boldsymbol{\epsilon}(\boldsymbol{z}_i,i)$ 的线性组合。这使得可以精确地计算 $\boldsymbol{z}_{i+1}$ given $(\boldsymbol{z}_i, \boldsymbol{z}_{i-1})$,从而实现精确的扩散逆向。我们在图像重建和图像修改方面进行了实验,并证明了我们的说法。BDIA还可以用于提高其他ODE解析器的性能,而不仅仅是DDIM。在我们的工作中,我们发现了应用BDIA到EDM采样过程中会提高FID分数。
Partial Vessels Annotation-based Coronary Artery Segmentation with Self-training and Prototype Learning
results: 在临床数据上的实验表明,提出的框架比竞争方法在 PVA 下表现出色 (24.29% 血管), 并且与基线模型使用完整注释 (100% 血管) 的表现相似。Abstract
Coronary artery segmentation on coronary-computed tomography angiography (CCTA) images is crucial for clinical use. Due to the expertise-required and labor-intensive annotation process, there is a growing demand for the relevant label-efficient learning algorithms. To this end, we propose partial vessels annotation (PVA) based on the challenges of coronary artery segmentation and clinical diagnostic characteristics. Further, we propose a progressive weakly supervised learning framework to achieve accurate segmentation under PVA. First, our proposed framework learns the local features of vessels to propagate the knowledge to unlabeled regions. Subsequently, it learns the global structure by utilizing the propagated knowledge, and corrects the errors introduced in the propagation process. Finally, it leverages the similarity between feature embeddings and the feature prototype to enhance testing outputs. Experiments on clinical data reveals that our proposed framework outperforms the competing methods under PVA (24.29% vessels), and achieves comparable performance in trunk continuity with the baseline model using full annotation (100% vessels).
摘要
coronary artery 分 segmentation on coronary-computed tomography angiography (CCTA) 图像是临床使用的关键。由于注意力和劳动力需求的注意力和劳动力需求,因此有一种增长的需求 für relevant label-efficient learning algorithms。为此,我们提出了partial vessels annotation (PVA),基于 coronary artery segmentation 和临床诊断特点的挑战。然后,我们提出了一种进步的弱监督学习框架,以实现高精度的分 segmentation。首先,我们的提出的框架学习了local feature of vessels,以帮助它在无标注区域中传播知识。然后,它学习了全局结构,利用传播的知识,并 corrections 了在传播过程中引入的错误。最后,它利用特征嵌入和特征原型之间的相似性,提高测试输出。在临床数据上的实验表明,我们的提出的框架在 PVA (24.29% 血管) 下比竞争方法表现出色,并与基于全注解 (100% 血管) 的基eline模型具有相同的 trunk continuity 性能。
Test-Time Adaptation for Nighttime Color-Thermal Semantic Segmentation
paper_authors: Yexin Liu, Weiming Zhang, Guoyang Zhao, Jinjing Zhu, Athanasios Vasilakos, Lin Wang
For: The paper focuses on improving the performance of RGB-Thermal (RGB-T) semantic segmentation in nighttime scenes, which is a challenging task due to the large day-night gap and the inconsistent performance of RGB images at night.* Methods: The proposed method, called Night-TTA, uses a test-time adaptation (TTA) framework to address the challenges of nighttime RGB-T semantic segmentation without requiring access to the source (daytime) data during adaptation. The method consists of three key technical parts: Imaging Heterogeneity Refinement (IHR), Class Aware Refinement (CAR), and a specific learning scheme.* Results: The proposed method achieves state-of-the-art (SoTA) performance with a 13.07% boost in mean Intersection over Union (mIoU) compared to the baseline method.Abstract
The ability to scene understanding in adverse visual conditions, e.g., nighttime, has sparked active research for RGB-Thermal (RGB-T) semantic segmentation. However, it is essentially hampered by two critical problems: 1) the day-night gap of RGB images is larger than that of thermal images, and 2) the class-wise performance of RGB images at night is not consistently higher or lower than that of thermal images. we propose the first test-time adaptation (TTA) framework, dubbed Night-TTA, to address the problems for nighttime RGBT semantic segmentation without access to the source (daytime) data during adaptation. Our method enjoys three key technical parts. Firstly, as one modality (e.g., RGB) suffers from a larger domain gap than that of the other (e.g., thermal), Imaging Heterogeneity Refinement (IHR) employs an interaction branch on the basis of RGB and thermal branches to prevent cross-modal discrepancy and performance degradation. Then, Class Aware Refinement (CAR) is introduced to obtain reliable ensemble logits based on pixel-level distribution aggregation of the three branches. In addition, we also design a specific learning scheme for our TTA framework, which enables the ensemble logits and three student logits to collaboratively learn to improve the quality of predictions during the testing phase of our Night TTA. Extensive experiments show that our method achieves state-of-the-art (SoTA) performance with a 13.07% boost in mIoU.
摘要
《夜间RGB-热图(RGB-T)semantic segmentation的研究受到了日间和夜间视觉条件下的场景理解能力的激发。然而,这些研究受到了两个重要问题的限制:1)RGB图像的日夜差距比热图更大,2)夜间RGB图像的类别性能不一定高于或低于热图。我们提出了第一个测试时适应(TTA)框架,名为夜间TTA,以解决这些问题。我们的方法具有三个关键技术部分。首先,由于一种感知Modal(例如RGB)的频谱差异比另一种Modal(例如热图)更大,我们使用了Imaging Heterogeneity Refinement(IHR)模块,该模块基于RGB和热图分支来避免交互Modal差异并避免性能下降。其次,我们引入了Class Aware Refinement(CAR)模块,以基于像素级分布聚合来获取可靠的ensemble logits。此外,我们还设计了特殊的学习方案 для我们的TTA框架,这使得ensemble logits和三个学生logits可以共同学习以提高测试阶段的预测质量。广泛的实验表明,我们的方法可以 дости到当前最佳性能(SoTA),并提高了13.07%的mIoU。》Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China. If you need the translation in Traditional Chinese, please let me know.
SAM-IQA: Can Segment Anything Boost Image Quality Assessment?
results: 实验结果表明,使用我们提议的方法可以在四个代表性的数据集上超越当前最佳状态(SOTA), both qualitatively and quantitatively。这些实验证明了 Segment Anything 模型的强大特征提取能力,以及频域和空间域特征的组合在 IQA 任务中的价值。Abstract
Image Quality Assessment (IQA) is a challenging task that requires training on massive datasets to achieve accurate predictions. However, due to the lack of IQA data, deep learning-based IQA methods typically rely on pre-trained networks trained on massive datasets as feature extractors to enhance their generalization ability, such as the ResNet network trained on ImageNet. In this paper, we utilize the encoder of Segment Anything, a recently proposed segmentation model trained on a massive dataset, for high-level semantic feature extraction. Most IQA methods are limited to extracting spatial-domain features, while frequency-domain features have been shown to better represent noise and blur. Therefore, we leverage both spatial-domain and frequency-domain features by applying Fourier and standard convolutions on the extracted features, respectively. Extensive experiments are conducted to demonstrate the effectiveness of all the proposed components, and results show that our approach outperforms the state-of-the-art (SOTA) in four representative datasets, both qualitatively and quantitatively. Our experiments confirm the powerful feature extraction capabilities of Segment Anything and highlight the value of combining spatial-domain and frequency-domain features in IQA tasks. Code: https://github.com/Hedlen/SAM-IQA
摘要
Image Quality Assessment (IQA) 是一个复杂的任务,需要训练大量数据来获得准确的预测。然而,由于缺乏 IQA 数据,深度学习基于 IQA 方法通常会使用预训练的网络作为特征提取器,以增强其泛化能力,如 ImageNet 上的 ResNet 网络。在这篇论文中,我们利用 Segment Anything 中的encoder,一个最近提出的分割模型,在大量数据上进行高级别semantic特征提取。大多数 IQA 方法只是提取空间频域特征,而频率频域特征已经被证明可以更好地表示噪音和模糊。因此,我们利用空间频域和频率频域特征,通过Fourier和标准 convolutions应用于提取的特征,分别进行特征提取。我们进行了广泛的实验,以证明所提出的所有组件的效果,结果表明我们的方法在四个代表性数据集上, both qualitatively and quantitatively 超越了状态前的最佳方法。我们的实验证明了 Segment Anything 的强大特征提取能力,并高亮了在 IQA 任务中,结合空间频域和频率频域特征的价值。代码:https://github.com/Hedlen/SAM-IQA
results: 根据实验表明,DCA-NAS比手动设计的架构更高效,并与流行的移动架构在多个图像分类数据集上达到了相似的性能水平。此外,通过在DARTS和NAS-Bench-201上进行搜索空间的测试,表明DCA-NAS具有广泛的通用能力。最后,通过在Hardware-NAS-Bench上进行评估,发现了适用于具体设备的特定架构,并达到了低于10ms的推理延迟和现场性能的国际前景。Abstract
Edge computing aims to enable edge devices, such as IoT devices, to process data locally instead of relying on the cloud. However, deep learning techniques like computer vision and natural language processing can be computationally expensive and memory-intensive. Creating manual architectures specialized for each device is infeasible due to their varying memory and computational constraints. To address these concerns, we automate the construction of task-specific deep learning architectures optimized for device constraints through Neural Architecture Search (NAS). We present DCA-NAS, a principled method of fast neural network architecture search that incorporates edge-device constraints such as model size and floating-point operations. It incorporates weight sharing and channel bottleneck techniques to speed up the search time. Based on our experiments, we see that DCA-NAS outperforms manual architectures for similar sized models and is comparable to popular mobile architectures on various image classification datasets like CIFAR-10, CIFAR-100, and Imagenet-1k. Experiments with search spaces -- DARTS and NAS-Bench-201 show the generalization capabilities of DCA-NAS. On further evaluating our approach on Hardware-NAS-Bench, device-specific architectures with low inference latency and state-of-the-art performance were discovered.
摘要
Edge computing旨在让边缘设备,如IoT设备,在云上进行数据处理而不依赖于云。然而,深度学习技术如计算机视觉和自然语言处理可能会占用许多计算资源和内存。为了解决这些问题,我们自动化了任务特定的深度学习建筑,以适应设备的变化。我们称之为DCA-NAS,它是一种快速的神经网络建筑搜索方法,它包括设备约束such as模型大小和浮点运算。它还包括共享权重和扁拟束技术来加速搜索时间。根据我们的实验,DCA-NAS比相同大小的手动建筑更高效,与流行的移动设备建筑相当在多个图像分类数据集上,如CIFAR-10、CIFAR-100和Imagenet-1k。我们的实验还表明DCA-NAS可以在不同的搜索空间中进行普适化,例如DARTS和NAS-Bench-201。进一步评估我们的方法在Hardware-NAS-Bench上,我们发现了适用于具体设备的硬件特定建筑,具有低评估延迟和当前最佳性能。
Automatic diagnosis of knee osteoarthritis severity using Swin transformer
results: 实验结果表明我们的方法可以准确预测KOA严重程度Abstract
Knee osteoarthritis (KOA) is a widespread condition that can cause chronic pain and stiffness in the knee joint. Early detection and diagnosis are crucial for successful clinical intervention and management to prevent severe complications, such as loss of mobility. In this paper, we propose an automated approach that employs the Swin Transformer to predict the severity of KOA. Our model uses publicly available radiographic datasets with Kellgren and Lawrence scores to enable early detection and severity assessment. To improve the accuracy of our model, we employ a multi-prediction head architecture that utilizes multi-layer perceptron classifiers. Additionally, we introduce a novel training approach that reduces the data drift between multiple datasets to ensure the generalization ability of the model. The results of our experiments demonstrate the effectiveness and feasibility of our approach in predicting KOA severity accurately.
摘要
《骨关节炎(KOA)是一种广泛存在的疾病,可以导致chronic pain和股关节僵硬。早期发现和诊断是成功临床干预和管理的关键,以避免严重的complications,如 mobilit产生 loss。在这篇论文中,我们提出了一种自动化的方法,使用Swin Transformer来预测KOA的严重程度。我们的模型使用公共可用的 radiographic数据集来实现早期发现和严重评估。为了提高模型的准确性,我们使用多个预测头 Architecture,并使用多层感知器来进行分类。此外,我们还引入了一种新的训练方法,以降低数据的漂移,以确保模型的泛化能力。实验结果表明我们的方法可以准确地预测KOA的严重程度。》Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China. If you need the translation in Traditional Chinese, please let me know.
Global and Local Visual Processing: Influence of Perceptual Field Variables
results: 研究发现,CONGRUENCY和SIZE两个PFVs有显著的影响,而SPARSITY只有小的影响。此外,任务概念和PFVs之间的交互也有显著的效果。这些结果表明,任务概念在评估PFVs对GPE的影响中扮演着重要的角色。Abstract
The Global Precedence Effect (GPE) suggests that the processing of global properties of a visual stimulus precedes the processing of local properties. The generality of this theory was argued for four decades during different known Perceptual Field Variables. The effect size of various PFVs, regarding the findings during these four decades, were pooled in our recent meta-analysis study. Pursuing the study, in the present paper, we explore the effects of Congruency, Size, and Sparsity and their interaction on global advantage in two different experiments with different task paradigms; Matching judgment and Similarity judgment. Upon results of these experiments, Congruency and Size have significant effects and Sparsity has small effects. Also, the task paradigm and its interaction with other PFVs are shown significant effects in this study, which shows the prominence of the role of task paradigms in evaluating PFVs' effects on GPE. Also, we found that the effects of these parameters were not specific to the special condition that individuals were instructed to retinal stabilize. So, the experiments were more extendible to daily human behavior.
摘要
globally precedes processing of local properties, the Global Precedence Effect (GPE) suggests. 这个理论四十年来在不同的知觉场景中得到了证明,包括多种知觉场景中的效果大小的Pooling。在当前的论文中,我们探究了 Congruency、Size 和 Sparsity 对全局优势的影响,并在不同的任务模式下进行了两个实验。结果显示,Congruency 和 Size 具有显著影响,而 Sparsity 具有微scopic 影响。此外,任务模式和其他知觉场景之间的交互也在这些实验中得到了显著影响,这表明任务模式在评估知觉场景的效果中扮演着重要的角色。此外,我们发现这些参数的影响不仅限于特殊的指导困难者保持视场的情况,而且可以推广到日常人类行为。
Identification of Hemorrhage and Infarct Lesions on Brain CT Images using Deep Learning
for: This paper aims to evaluate the performance of a deep learning-based algorithm in identifying intracranial hemorrhage (ICH) and infarct from head non-contrast computed tomography (NCCT) scans.
methods: The paper uses a deep learning-based algorithm to automatically identify ICH and infarct from head-NCCT scans. The dataset used for validation consists of head-NCCT scans collected from multiple diagnostic imaging centers across India.
results: The study shows the potential and limitations of using a DL-based algorithm for identifying ICH and infarct from head-NCCT scans. The algorithm demonstrated high accuracy in identifying ICH and infarct, but the results also highlighted the limitations of using this approach in routine clinical practice due to factors such as image quality and dataset variability.Abstract
Head Non-contrast computed tomography (NCCT) scan remain the preferred primary imaging modality due to their widespread availability and speed. However, the current standard for manual annotations of abnormal brain tissue on head NCCT scans involves significant disadvantages like lack of cutoff standardization and degeneration identification. The recent advancement of deep learning-based computer-aided diagnostic (CAD) models in the multidisciplinary domain has created vast opportunities in neurological medical imaging. Significant literature has been published earlier in the automated identification of brain tissue on different imaging modalities. However, determining Intracranial hemorrhage (ICH) and infarct can be challenging due to image texture, volume size, and scan quality variability. This retrospective validation study evaluated a DL-based algorithm identifying ICH and infarct from head-NCCT scans. The head-NCCT scans dataset was collected consecutively from multiple diagnostic imaging centers across India. The study exhibits the potential and limitations of such DL-based software for introduction in routine workflow in extensive healthcare facilities.
摘要
<>头部非对比计算机断层成像(NCCT)扫描仍然是首选的主要图像Modalität,因为它们的普及性和速度很高。然而,现有的手动标注不正常脑组织头部NCCT扫描的标准具有一些缺点,如标准化的分割标准和衰变标识。随着深度学习基于计算机支持的诊断(CAD)模型在多学科领域的发展,头部神经医学成像中的自动识别技术受到了广泛的关注。 Earlier literature has been published on the automated identification of brain tissue on different imaging modalities. However, determining intracranial hemorrhage (ICH) and infarct can be challenging due to image texture, volume size, and scan quality variability. This retrospective validation study evaluated a deep learning-based algorithm for identifying ICH and infarct from head-NCCT scans. The head-NCCT scans dataset was collected consecutively from multiple diagnostic imaging centers across India. The study exhibits the potential and limitations of such deep learning-based software for introduction in routine workflow in extensive healthcare facilities.<>
Towards Enabling Cardiac Digital Twins of Myocardial Infarction Using Deep Computational Models for Inverse Inference
results: 研究发现,提出的深度计算模型可以有效地捕捉QRS复议和相应的损伤区域之间的复杂关系,具有推广应用的潜力。Abstract
Myocardial infarction (MI) demands precise and swift diagnosis. Cardiac digital twins (CDTs) have the potential to offer individualized evaluation of cardiac function in a non-invasive manner, making them a promising approach for personalized diagnosis and treatment planning of MI. The inference of accurate myocardial tissue properties is crucial in creating a reliable CDT platform, and particularly in the context of studying MI. In this work, we investigate the feasibility of inferring myocardial tissue properties from the electrocardiogram (ECG), focusing on the development of a comprehensive CDT platform specifically designed for MI. The platform integrates multi-modal data, such as cardiac MRI and ECG, to enhance the accuracy and reliability of the inferred tissue properties. We perform a sensitivity analysis based on computer simulations, systematically exploring the effects of infarct location, size, degree of transmurality, and electrical activity alteration on the simulated QRS complex of ECG, to establish the limits of the approach. We subsequently propose a deep computational model to infer infarct location and distribution from the simulated QRS. The in silico experimental results show that our model can effectively capture the complex relationships between the QRS signals and the corresponding infarct regions, with promising potential for clinical application in the future. The code will be released publicly once the manuscript is accepted for publication.
摘要
In this study, we investigate the feasibility of inferring myocardial tissue properties from the electrocardiogram (ECG) to develop a comprehensive CDT platform for MI. We integrate multi-modal data, including cardiac MRI and ECG, to enhance the accuracy and reliability of the inferred tissue properties.To evaluate the effectiveness of our approach, we perform a sensitivity analysis based on computer simulations. We systematically explore the effects of infarct location, size, degree of transmurality, and electrical activity alteration on the simulated QRS complex of ECG. Our results show that our model can effectively capture the complex relationships between the QRS signals and the corresponding infarct regions, with promising potential for clinical application in the future.Our code will be released publicly once the manuscript is accepted for publication, providing a valuable resource for researchers and clinicians working in the field of MI diagnosis and treatment.
for: solve Video Object Segmentation (VOS) in an unsupervised setting
methods: incorporating intra-frame appearance and flow similarities, and inter-frame temporal continuation of the objects under consideration
results: results comparable (within a range of ~2 mIoU) to the existing top approaches in unsupervised VOSAbstract
Segmentation of objects in a video is challenging due to the nuances such as motion blurring, parallax, occlusions, changes in illumination, etc. Instead of addressing these nuances separately, we focus on building a generalizable solution that avoids overfitting to the individual intricacies. Such a solution would also help us save enormous resources involved in human annotation of video corpora. To solve Video Object Segmentation (VOS) in an unsupervised setting, we propose a new pipeline (FODVid) based on the idea of guiding segmentation outputs using flow-guided graph-cut and temporal consistency. Basically, we design a segmentation model incorporating intra-frame appearance and flow similarities, and inter-frame temporal continuation of the objects under consideration. We perform an extensive experimental analysis of our straightforward methodology on the standard DAVIS16 video benchmark. Though simple, our approach produces results comparable (within a range of ~2 mIoU) to the existing top approaches in unsupervised VOS. The simplicity and effectiveness of our technique opens up new avenues for research in the video domain.
摘要
视频对象分割是一项复杂的任务,因为存在许多细节,如运动模糊、相对位移、遮挡、照明变化等。而不是单独处理这些细节,我们强调总体化解决方案,以避免过拟合特定细节。这样的解决方案不仅可以减少对视频资料的人工标注成本,还可以开拓新的研究领域。为解决无监督视频对象分割(VOS)问题,我们提出了一个新的管道(FODVid),基于流场导向图割和时间一致性。我们设计了一种包含内帧出现相似性和流动相似性的分割模型,并且在不同帧之间继续对对象进行时间一致性。我们对标准的DAVIS16视频benchmark进行了广泛的实验分析,并证明了我们的简单方法在无监督VOS中可以获得比较好的结果(与对比方法的误差在 ~2 mIoU 之间)。我们的简单 yet effective 的方法打开了新的研究领域,特别是在视频领域。
CT-based Subchondral Bone Microstructural Analysis in Knee Osteoarthritis via MR-Guided Distillation Learning
For: The paper aims to develop a novel method for subchondral bone microstructural analysis using easily-acquired CT images, which can enhance the accuracy of knee osteoarthritis classification.* Methods: The proposed method, named SRRD, leverages paired MR images to enhance the CT-based analysis model during training. The method uses a GAN-based generative model to transform MR images into CT images, and a distillation-learning technique to transfer MR structural information to the CT-based model.* Results: The proposed method achieved high reliability and validity in MR-CT registration, regression, and knee osteoarthritis classification, with an AUC score of 0.767 (95% CI, 0.681-0.853). The use of distillation learning significantly improved the performance of the CT-based knee osteoarthritis classification method using the CNN approach.Abstract
Background: MR-based subchondral bone effectively predicts knee osteoarthritis. However, its clinical application is limited by the cost and time of MR. Purpose: We aim to develop a novel distillation-learning-based method named SRRD for subchondral bone microstructural analysis using easily-acquired CT images, which leverages paired MR images to enhance the CT-based analysis model during training. Materials and Methods: Knee joint images of both CT and MR modalities were collected from October 2020 to May 2021. Firstly, we developed a GAN-based generative model to transform MR images into CT images, which was used to establish the anatomical correspondence between the two modalities. Next, we obtained numerous patches of subchondral bone regions of MR images, together with their trabecular parameters (BV / TV, Tb. Th, Tb. Sp, Tb. N) from the corresponding CT image patches via regression. The distillation-learning technique was used to train the regression model and transfer MR structural information to the CT-based model. The regressed trabecular parameters were further used for knee osteoarthritis classification. Results: A total of 80 participants were evaluated. CT-based regression results of trabecular parameters achieved intra-class correlation coefficients (ICCs) of 0.804, 0.773, 0.711, and 0.622 for BV / TV, Tb. Th, Tb. Sp, and Tb. N, respectively. The use of distillation learning significantly improved the performance of the CT-based knee osteoarthritis classification method using the CNN approach, yielding an AUC score of 0.767 (95% CI, 0.681-0.853) instead of 0.658 (95% CI, 0.574-0.742) (p<.001). Conclusions: The proposed SRRD method showed high reliability and validity in MR-CT registration, regression, and knee osteoarthritis classification, indicating the feasibility of subchondral bone microstructural analysis based on CT images.
摘要
Materials and Methods: Knee joint CT and MR images were collected from October 2020 to May 2021. A GAN-based generative model was used to transform MR images into CT images for anatomical correspondence. Regression was used to obtain trabecular parameters (BV/TV, Tb. Th, Tb. Sp, Tb. N) from CT image patches. Distillation learning was used to train the regression model and transfer MR structural information to the CT-based model. The regressed trabecular parameters were used for knee osteoarthritis classification.Results: 80 participants were evaluated. CT-based regression achieved intra-class correlation coefficients (ICCs) of 0.804, 0.773, 0.711, and 0.622 for BV/TV, Tb. Th, Tb. Sp, and Tb. N, respectively. Distillation learning significantly improved the performance of the CT-based knee osteoarthritis classification method using the CNN approach, yielding an AUC score of 0.767 (95% CI, 0.681-0.853) instead of 0.658 (95% CI, 0.574-0.742) (p<.001).Conclusions: The proposed SRRD method showed high reliability and validity in MR-CT registration, regression, and knee osteoarthritis classification, indicating the feasibility of subchondral bone microstructural analysis based on CT images.
Towards Generalizable Diabetic Retinopathy Grading in Unseen Domains
paper_authors: Haoxuan Che, Yuhan Cheng, Haibo Jin, Hao Chen for:This paper aims to address the domain generalization problem in deep learning-based automated diabetic retinopathy (DR) grading, which hinders the real-world deployment of DR grading systems.methods:The proposed method, called Generalizable Diabetic Retinopathity Grading Network (GDRNet), consists of three components: fundus visual-artifact augmentation (FundusAug), dynamic hybrid-supervised loss (DahLoss), and domain-class-aware re-balancing (DCR).results:GDRNet achieves state-of-the-art performance on a publicly available benchmark and demonstrates better generalization ability than existing methods through extensive comparison experiments and ablation studies.Abstract
Diabetic Retinopathy (DR) is a common complication of diabetes and a leading cause of blindness worldwide. Early and accurate grading of its severity is crucial for disease management. Although deep learning has shown great potential for automated DR grading, its real-world deployment is still challenging due to distribution shifts among source and target domains, known as the domain generalization problem. Existing works have mainly attributed the performance degradation to limited domain shifts caused by simple visual discrepancies, which cannot handle complex real-world scenarios. Instead, we present preliminary evidence suggesting the existence of three-fold generalization issues: visual and degradation style shifts, diagnostic pattern diversity, and data imbalance. To tackle these issues, we propose a novel unified framework named Generalizable Diabetic Retinopathy Grading Network (GDRNet). GDRNet consists of three vital components: fundus visual-artifact augmentation (FundusAug), dynamic hybrid-supervised loss (DahLoss), and domain-class-aware re-balancing (DCR). FundusAug generates realistic augmented images via visual transformation and image degradation, while DahLoss jointly leverages pixel-level consistency and image-level semantics to capture the diverse diagnostic patterns and build generalizable feature representations. Moreover, DCR mitigates the data imbalance from a domain-class view and avoids undesired over-emphasis on rare domain-class pairs. Finally, we design a publicly available benchmark for fair evaluations. Extensive comparison experiments against advanced methods and exhaustive ablation studies demonstrate the effectiveness and generalization ability of GDRNet.
摘要
糖尿病综述病 (DR) 是 диабеت tipo 2 的常见并发症,并是全球最大的盲人原因。 早期和准确地评估其严重程度是疾病管理的关键。 虽然深度学习已经表现出了自动DR评估的潜在能力,但是在实际应用中仍然存在域泛化问题,即域域识别问题。 现有的工作主要归结于域域识别问题,即限制域域识别问题,但是我们发现了三种普遍问题:视觉和衰落风格差异、诊断样本多样性和数据不均衡。 为解决这些问题,我们提出了一个全新的框架,即普遍DR评估网络(GDRNet)。 GDRNet包括三个重要组成部分:基于图像增强的视觉增强(FundusAug)、动态混合监督损失(DahLoss)和域类感知重新平衡(DCR)。 FundusAug通过视觉变换和图像衰落来生成真实的增强图像,而 DahLoss 同时captures多样的诊断模式和建立普遍的特征表示。 此外,DCR 消除了域类视图中的数据不均衡,以避免不必要地强调少数域类对的独特对。 最后,我们设计了一个公共可用的标准套件,以便公平的评估。 对于先进方法和详细的抽象研究,我们进行了广泛的比较和探索。 结果表明,GDRNet 具有普遍能力和高效性。
One-Shot Pruning for Fast-adapting Pre-trained Models on Devices
for: 这篇 paper 的目的是提出一个可扩展的一体化推断运算方法,将大规模预训练模型部署到低能量设备上。
methods: 方法是根据类似任务的删除模型来提取适当大小的子网络,以便快速适束新任务。
results: 实验分析显示,提案的方法可以在不同的数据集上实现高精度和高效率,并且在不同的下游任务和设备上表现更好 than 对等的删除基eline方法。Abstract
Large-scale pre-trained models have been remarkably successful in resolving downstream tasks. Nonetheless, deploying these models on low-capability devices still requires an effective approach, such as model pruning. However, pruning the model from scratch can pose a practical challenge given the limited resources of each downstream task or device. To tackle this issue, we present a scalable one-shot pruning method that leverages pruned knowledge of similar tasks to extract a sub-network from the pre-trained model for a new task. Specifically, we create a score mask using the pruned models of similar tasks to identify task-specific filters/nodes in the pre-trained model for the new task. Based on this mask, we conduct a single round of pruning to extract a suitably-sized sub-network that can quickly adapt to the new task with only a few training iterations. Our experimental analysis demonstrates the effectiveness of the proposed method on the convolutional neural networks (CNNs) and vision transformers (ViT) with various datasets. The proposed method consistently outperforms popular pruning baseline methods in terms of accuracy and efficiency when dealing with diverse downstream tasks with different memory constraints.
摘要
大规模预训练模型已经在下游任务中表现惊人成功。然而,在低能力设备上部署这些模型仍然需要有效的方法,如模型剪除。然而,从零开始剪除模型可能是实际挑战,因为每个下游任务或设备的资源都是有限的。为解决这个问题,我们提出了可扩展的一键剪除方法,利用预剪的相似任务知识来提取预训练模型中的子网络 для新任务。特别是,我们创建了一个分数面罩,使用预剪模型中的相似任务来标识新任务中的任务特有的滤波器/节点。根据这个面罩,我们在单轮剪除中提取了适当大小的子网络,可以快速适应新任务,只需几个训练轮。我们的实验分析表明,我们的方法在 convolutional neural networks (CNNs) 和 vision transformers (ViT) 中对各种数据集表现出了效果。我们的方法在处理不同的下游任务时,常常超越流行的剪除基准方法,在准确性和效率方面。
InfLoR-SNN: Reducing Information Loss for Spiking Neural Networks
results: 实验结果表明,使用”软重置”机制和”Rectifier”组合可以在静止和动态数据集上超过原始神经网络的性能。Abstract
The Spiking Neural Network (SNN) has attracted more and more attention recently. It adopts binary spike signals to transmit information. Benefitting from the information passing paradigm of SNNs, the multiplications of activations and weights can be replaced by additions, which are more energy-efficient. However, its "Hard Reset" mechanism for the firing activity would ignore the difference among membrane potentials when the membrane potential is above the firing threshold, causing information loss. Meanwhile, quantifying the membrane potential to 0/1 spikes at the firing instants will inevitably introduce the quantization error thus bringing about information loss too. To address these problems, we propose to use the "Soft Reset" mechanism for the supervised training-based SNNs, which will drive the membrane potential to a dynamic reset potential according to its magnitude, and Membrane Potential Rectifier (MPR) to reduce the quantization error via redistributing the membrane potential to a range close to the spikes. Results show that the SNNs with the "Soft Reset" mechanism and MPR outperform their vanilla counterparts on both static and dynamic datasets.
摘要
《刺激神经网络(SNN)》在最近几年内受到了越来越多的关注。它使用二进制刺激信号来传输信息。由于SNN的信息传递模式,活动和权重的乘法可以被替换为加法,这会更加能效。然而,SNN的“硬重置”机制会忽略细胞电位的差异,当细胞电位超过刺激阈值时,这会导致信息损失。同时,将细胞电位量化为0/1刺激在刺激实时会导致量化误差,从而再次导致信息损失。为解决这些问题,我们提议使用“软重置”机制来进行监督式训练的SNN,将细胞电位驱动到一个动态重置电位,根据其大小,并使用细胞电位重定向器(MPR)来减少量化误差。结果表明,使用“软重置”机制和MPR的SNN比其原始版本在静态和动态数据集上表现更佳。
Hierarchical Semantic Tree Concept Whitening for Interpretable Image Classification
results: 提高模型解释性,提取Semantic概念的分离,不增加模型分类性能的损失Abstract
With the popularity of deep neural networks (DNNs), model interpretability is becoming a critical concern. Many approaches have been developed to tackle the problem through post-hoc analysis, such as explaining how predictions are made or understanding the meaning of neurons in middle layers. Nevertheless, these methods can only discover the patterns or rules that naturally exist in models. In this work, rather than relying on post-hoc schemes, we proactively instill knowledge to alter the representation of human-understandable concepts in hidden layers. Specifically, we use a hierarchical tree of semantic concepts to store the knowledge, which is leveraged to regularize the representations of image data instances while training deep models. The axes of the latent space are aligned with the semantic concepts, where the hierarchical relations between concepts are also preserved. Experiments on real-world image datasets show that our method improves model interpretability, showing better disentanglement of semantic concepts, without negatively affecting model classification performance.
摘要
随着深度神经网络(DNN)的普及,模型解释性变得越来越重要。许多方法已经被开发来解决这个问题,如解释预测如何进行或者理解中间层神经元的含义。然而,这些方法只能找到模型中自然存在的模式或规则。在这项工作中,我们不再仰仗后期分析方法,而是主动填充了人理解的概念知识,以影响深度模型中的表示。 Specifically,我们使用一个层次结构的semantic概念树来存储知识,并将其用于训练深度模型中的数据实例表示。模型的幂等空间轴与semantic概念之间的层次关系保持一致。实验表明,我们的方法可以提高模型解释性,提取 semantic概念的分离度更高,不会影响模型的分类性能。
New Variants of Frank-Wolfe Algorithm for Video Co-localization Problem
results: 在YouTube-Objects数据集上实现比较高的效率,并与 conditional gradient sliding算法(CGS)进行比较Abstract
The co-localization problem is a model that simultaneously localizes objects of the same class within a series of images or videos. In \cite{joulin2014efficient}, authors present new variants of the Frank-Wolfe algorithm (aka conditional gradient) that increase the efficiency in solving the image and video co-localization problems. The authors show the efficiency of their methods with the rate of decrease in a value called the Wolfe gap in each iteration of the algorithm. In this project, inspired by the conditional gradient sliding algorithm (CGS) \cite{CGS:Lan}, We propose algorithms for solving such problems and demonstrate the efficiency of the proposed algorithms through numerical experiments. The efficiency of these methods with respect to the Wolfe gap is compared with implementing them on the YouTube-Objects dataset for videos.
摘要
《协同地ocalization问题》是一种同时对同一类图像或视频中的对象进行本地化的模型。在《\cite{joulin2014efficient}》中,作者们提出了新的Frank-Wolfe算法(即条件Gradient)的变种,以提高解决图像和视频协同地ocalization问题的效率。作者们通过每次算法的Wolfe差值下降率来证明方法的效果。在这个项目中,我们Draw inspiration from conditional gradient sliding algorithm(CGS)《\cite{CGS:Lan}》,并提出了解决这些问题的算法。我们通过数值实验证明了我们的方法的效率,并与实现在YouTube-Objects dataset上进行比较。
Leveraging Multiple Descriptive Features for Robust Few-shot Image Learning
results: 研究表明,使用这种方法可以在几个shot learning Setting中提高图像分类的准确率和Robustness,并且在训练和测试集之间的混合性能也达到了状态艺术的水平。Abstract
Modern image classification is based upon directly predicting model classes via large discriminative networks, making it difficult to assess the intuitive visual ``features'' that may constitute a classification decision. At the same time, recent works in joint visual language models such as CLIP provide ways to specify natural language descriptions of image classes but typically focus on providing single descriptions for each class. In this work, we demonstrate that an alternative approach, arguably more akin to our understanding of multiple ``visual features'' per class, can also provide compelling performance in the robust few-shot learning setting. In particular, we automatically enumerate multiple visual descriptions of each class -- via a large language model (LLM) -- then use a vision-image model to translate these descriptions to a set of multiple visual features of each image; we finally use sparse logistic regression to select a relevant subset of these features to classify each image. This both provides an ``intuitive'' set of relevant features for each class, and in the few-shot learning setting, outperforms standard approaches such as linear probing. When combined with finetuning, we also show that the method is able to outperform existing state-of-the-art finetuning approaches on both in-distribution and out-of-distribution performance.
摘要
现代图像分类基于直接预测模型类 via 大量抽象网络,使得分类决策的视觉特征难以评估。同时, recient works in JOINT 视觉语言模型(CLIP)提供了提供单个描述每个类的方法,但通常将精力集中在单个描述每个类。在这种工作中,我们示出了一种不同的方法,更加符合我们对多个“视觉特征”per 类的认知。具体来说,我们使用大语言模型(LLM)自动生成每个类的多个视觉描述,然后使用视觉图像模型将这些描述翻译成每个图像的多个视觉特征集;最后,我们使用稀疏逻辑回归选择每个图像的相关 subset 的特征进行分类。这种方法不仅提供了每个类的“直观”的视觉特征集,而且在少数shot learning setting中也能够超越标准方法,如线性探测。当与 fine-tuning 结合使用时,我们还示出了该方法可以超越现有的状态之artefinetuning 方法在 both in-distribution 和 out-of-distribution 性能上。
results: 实验结果显示,这篇论文提出的方法可以高效地提高模型对噪音标签的抗衰强性。Abstract
Supervised learning of deep neural networks heavily relies on large-scale datasets annotated by high-quality labels. In contrast, mislabeled samples can significantly degrade the generalization of models and result in memorizing samples, further learning erroneous associations of data contents to incorrect annotations. To this end, this paper proposes an efficient approach to tackle noisy labels by learning robust feature representation based on unsupervised augmentation restoration and cluster regularization. In addition, progressive self-bootstrapping is introduced to minimize the negative impact of supervision from noisy labels. Our proposed design is generic and flexible in applying to existing classification architectures with minimal overheads. Experimental results show that our proposed method can efficiently and effectively enhance model robustness under severely noisy labels.
摘要
<>对深度神经网络的超级vised学习 heavily relies on large-scale datasets annotated with high-quality labels. However, mislabeled samples can significantly degrade the generalization of models and result in memorizing samples, leading to learning erroneous associations of data contents to incorrect annotations. To address this issue, this paper proposes an efficient approach to tackle noisy labels by learning robust feature representation based on unsupervised augmentation restoration and cluster regularization. Additionally, progressive self-bootstrapping is introduced to minimize the negative impact of supervision from noisy labels. Our proposed design is generic and flexible, and can be applied to existing classification architectures with minimal overheads. Experimental results show that our proposed method can efficiently and effectively enhance model robustness under severely noisy labels.Translation notes:* "supervised learning" is translated as "超级vised学习" (with "超级" meaning "super" and "vised" meaning "learning")* "deep neural networks" is translated as "深度神经网络"* "large-scale datasets" is translated as "大规模 datasets"* "high-quality labels" is translated as "高质量 labels"* "mislabeled samples" is translated as "错abeled samples"* "noisy labels" is translated as "噪音 labels"* "robust feature representation" is translated as "鲁棒特征表示"* "unsupervised augmentation restoration" is translated as "无监督增强 Restoration"* "cluster regularization" is translated as "群集 regularization"* "progressive self-bootstrapping" is translated as "进行式自我启动"* "minimal overheads" is translated as "最小的开销"Please note that the translation is done in Simplified Chinese, which is the most widely used standard for Chinese writing. If you need the translation in Traditional Chinese, please let me know.
K-Space-Aware Cross-Modality Score for Synthesized Neuroimage Quality Assessment
results: 我们的方法在NIRPS数据集中表现出色,特别是与放射学专业人员相比。我们的实验结果表明,K-CROSS度量方法可以更好地评估多Modalität医学影像合成的质量。Abstract
The problem of how to assess cross-modality medical image synthesis has been largely unexplored. The most used measures like PSNR and SSIM focus on analyzing the structural features but neglect the crucial lesion location and fundamental k-space speciality of medical images. To overcome this problem, we propose a new metric K-CROSS to spur progress on this challenging problem. Specifically, K-CROSS uses a pre-trained multi-modality segmentation network to predict the lesion location, together with a tumor encoder for representing features, such as texture details and brightness intensities. To further reflect the frequency-specific information from the magnetic resonance imaging principles, both k-space features and vision features are obtained and employed in our comprehensive encoders with a frequency reconstruction penalty. The structure-shared encoders are designed and constrained with a similarity loss to capture the intrinsic common structural information for both modalities. As a consequence, the features learned from lesion regions, k-space, and anatomical structures are all captured, which serve as our quality evaluators. We evaluate the performance by constructing a large-scale cross-modality neuroimaging perceptual similarity (NIRPS) dataset with 6,000 radiologist judgments. Extensive experiments demonstrate that the proposed method outperforms other metrics, especially in comparison with the radiologists on NIRPS.
摘要
医学多Modal图像合成评估问题尚未得到广泛的研究。现有的度量方法,如PSNR和SSIM,主要关注医学图像的结构特征,而忽略了病变位置和基本扫描空间特点。为解决这问题,我们提出了一种新的度量方法,即K-CROSS。具体来说,K-CROSS使用预训练的多Modal Segmentation网络预测病变位置,同时使用肿瘤编码器来表示特征,如文字特征和亮度强度。为了更好地反映医学图像的频率特征,我们获得了k空间特征和视觉特征,并将其与共同Encoder结合,并使用频率重建罚。结构共享Encoder受到相似损失的限制,以捕捉两种模式之间的共同结构信息。因此,我们可以从病变区域、k空间和解剖结构中获得特征,这些特征作为我们评估质量的依据。我们通过构建6000个医生评估的大规模跨Modal神经成像相似度(NIRPS)数据集来评估性能。广泛的实验表明,我们提出的方法在比较其他度量方法时,特别是与医生的NIRPS表现出色。
results: 论文的结果表明,使用该方法可以达到与单个图像预测深度的误差相似的精度。Abstract
We describe a method to parse a complex, cluttered indoor scene into primitives which offer a parsimonious abstraction of scene structure. Our primitives are simple convexes. Our method uses a learned regression procedure to parse a scene into a fixed number of convexes from RGBD input, and can optionally accept segmentations to improve the decomposition. The result is then polished with a descent method which adjusts the convexes to produce a very good fit, and greedily removes superfluous primitives. Because the entire scene is parsed, we can evaluate using traditional depth, normal, and segmentation error metrics. Our evaluation procedure demonstrates that the error from our primitive representation is comparable to that of predicting depth from a single image.
摘要
我们描述了一种方法,用于将复杂敷地室内场景分解成基本元素,这些基本元素提供了场景结构的简洁抽象。我们的基本元素是简单的凹陷体。我们的方法使用学习回归过程将RGBD输入 parsed into a fixed number of convexes,并可选地接受分割来改进分解。结果 subsequentially polished with a descent method,以达到非常好的适应。因为整个场景都被分解,我们可以使用传统的深度、 норма和分割错误度量来评估。我们的评估过程表明,我们的基本表示错误与从单个图像预测深度的错误相当。
Mx2M: Masked Cross-Modality Modeling in Domain Adaptation for 3D Semantic Segmentation
results: 对于三个DA场景(Day/Night、USA/Singapore和A2D2/SemanticKITTI)的评估,该方法比前一代方法在许多指标上带来了大幅提升。Abstract
Existing methods of cross-modal domain adaptation for 3D semantic segmentation predict results only via 2D-3D complementarity that is obtained by cross-modal feature matching. However, as lacking supervision in the target domain, the complementarity is not always reliable. The results are not ideal when the domain gap is large. To solve the problem of lacking supervision, we introduce masked modeling into this task and propose a method Mx2M, which utilizes masked cross-modality modeling to reduce the large domain gap. Our Mx2M contains two components. One is the core solution, cross-modal removal and prediction (xMRP), which makes the Mx2M adapt to various scenarios and provides cross-modal self-supervision. The other is a new way of cross-modal feature matching, the dynamic cross-modal filter (DxMF) that ensures the whole method dynamically uses more suitable 2D-3D complementarity. Evaluation of the Mx2M on three DA scenarios, including Day/Night, USA/Singapore, and A2D2/SemanticKITTI, brings large improvements over previous methods on many metrics.
摘要
现有的跨Modal域适应方法 для 3D semantic segmentation 仅通过2D-3D complementarity来预测结果,而这些 complementarity 通常由跨Modal feature matching 获得。然而,由于缺乏目标域的监督,这些 complementarity 不一定可靠。在差异大的域内,结果并不理想。为解决缺乏监督的问题,我们引入 masked modeling 到这个任务中,并提出了一种方法 Mx2M,该方法利用 masked cross-modality modeling 来减少大域差。Mx2M 包含两个组件。一个是核心解决方案,cross-modal removal and prediction (xMRP),该方法使得 Mx2M 适应不同的场景,并提供 cross-modal self-supervision。另一个是一种新的跨Modal feature matching 方法,dynamic cross-modal filter (DxMF),该方法确保整个方法在运行时能够使用更适合的 2D-3D complementarity。对 Mx2M 在三个 DA 场景中进行评估,包括 Day/Night、USA/Singapore 和 A2D2/SemanticKITTI,实现了大幅提高过去方法的多个纪录。