paper_authors: Arman H Ter-Petrosyan, Jenna A Bilbrey, Christina M Doty, Bethany E Matthews, Le Wang, Yingge Du, Eric Lang, Khalid Hattar, Steven R Spurgeon
results: 这个论文通过跟踪辐射引起的杂质前缘在薄膜中的变化,以及在 catalysis 和电子设备中的应用。Abstract
We present a method for the unsupervised segmentation of electron microscopy images, which are powerful descriptors of materials and chemical systems. Images are oversegmented into overlapping chips, and similarity graphs are generated from embeddings extracted from a domain$\unicode{x2010}$pretrained convolutional neural network (CNN). The Louvain method for community detection is then applied to perform segmentation. The graph representation provides an intuitive way of presenting the relationship between chips and communities. We demonstrate our method to track irradiation$\unicode{x2010}$induced amorphous fronts in thin films used for catalysis and electronics. This method has potential for "on$\unicode{x2010}$the$\unicode{x2010}$fly" segmentation to guide emerging automated electron microscopes.
摘要
我们提出了一种无监督的电子顾像图像分割方法,这些图像是物质和化学系统的强大描述器。图像被过分割成重叠的块,并从域预训练的卷积神经网络(CNN)中提取出的特征向量生成相似图。然后,我们使用Louvaine方法进行社区探测,进行分割。图表表示块和社区之间的关系,提供了直观的表达方式。我们示cases demonstrate our method for tracking irradiation-induced amorphous fronts in thin films used for catalysis and electronics. This method has the potential for "on-the-fly" segmentation to guide emerging automated electron microscopes.Note: I used the Google Translate API to translate the text into Simplified Chinese. Please note that the translation may not be perfect and may require some adjustments for clarity or accuracy.
UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs
results: UFOGen在单步文本到图像生成中表现出色,能够高效地生成高质量的图像。此外,UFOGen还展示了在不同应用领域的多样化应用能力,如文本描述转换、图像生成等。Abstract
Text-to-image diffusion models have demonstrated remarkable capabilities in transforming textual prompts into coherent images, yet the computational cost of their inference remains a persistent challenge. To address this issue, we present UFOGen, a novel generative model designed for ultra-fast, one-step text-to-image synthesis. In contrast to conventional approaches that focus on improving samplers or employing distillation techniques for diffusion models, UFOGen adopts a hybrid methodology, integrating diffusion models with a GAN objective. Leveraging a newly introduced diffusion-GAN objective and initialization with pre-trained diffusion models, UFOGen excels in efficiently generating high-quality images conditioned on textual descriptions in a single step. Beyond traditional text-to-image generation, UFOGen showcases versatility in applications. Notably, UFOGen stands among the pioneering models enabling one-step text-to-image generation and diverse downstream tasks, presenting a significant advancement in the landscape of efficient generative models. \blfootnote{*Work done as a student researcher of Google, $\dagger$ indicates equal contribution.
摘要
文本到图像扩散模型已经展现出了很强的能力,可以将文本描述转换成一致的图像,但计算成本仍然是一个挑战。为解决这个问题,我们提出了UFOGen,一种新的生成模型,旨在实现超快、一步文本到图像合成。与传统方法不同,UFOGen采用混合方法,将扩散模型与GAN目标相结合。通过引入新的扩散-GAN目标和使用预训练扩散模型的初始化,UFOGen在一步文本描述下生成高质量的图像。此外,UFOGen在应用方面也有很好的灵活性,可以进行多种下游任务,如图像修饰等。值得注意的是,UFOGen是一种开创性的模型,可以实现一步文本到图像生成和多种下游任务,对有效的生成模型领域具有重要的进步。
results: 对于九名不同体型、衣服和动作的试验者,该方法可以获得更高质量的结果,比之前的方法更适合电子沟通应用。Abstract
We present Drivable 3D Gaussian Avatars (D3GA), the first 3D controllable model for human bodies rendered with Gaussian splats. Current photorealistic drivable avatars require either accurate 3D registrations during training, dense input images during testing, or both. The ones based on neural radiance fields also tend to be prohibitively slow for telepresence applications. This work uses the recently presented 3D Gaussian Splatting (3DGS) technique to render realistic humans at real-time framerates, using dense calibrated multi-view videos as input. To deform those primitives, we depart from the commonly used point deformation method of linear blend skinning (LBS) and use a classic volumetric deformation method: cage deformations. Given their smaller size, we drive these deformations with joint angles and keypoints, which are more suitable for communication applications. Our experiments on nine subjects with varied body shapes, clothes, and motions obtain higher-quality results than state-of-the-art methods when using the same training and test data.
摘要
我们介绍Drivable 3D Gaussian Avatars(D3GA),这是首个基于 Gaussian splats 的人体模型,可以实时控制。现有的高级实时渲染人体模型通常需要在训练期间进行高精度的3D注册或在测试期间使用密集的输入图像,或者都是这两者。此外,基于神经辐射场也往往会对实时应用场景造成极大的阻塞。我们使用最近提出的3D Gaussian Splatting(3DGS)技术来渲染真实的人体,并使用密集标准化多视图视频作为输入。为了变形这些基本模型,我们弃用通常使用的点形变形方法(LBS),而是使用经典的体Volume变形方法:笼体变形。由于它们的更小的大小,我们使用关节角度和关节点来驱动这些变形,这更适合通信应用。我们在9名不同身体形态、衣服和动作的试验中获得了与现状方法相比的更高质量结果,使用同一组training和测试数据。
Reading Between the Mud: A Challenging Motorcycle Racer Number Dataset
results: 我们的实验表明,使用现有的 OCR 算法后 fine-tuning,只有 End-to-End F1 分数为 0.527 的 RnD,并且分析显示泥尘是主要的挑战,对于正常情况下的模型而言,泥尘会导致模型的精度下降很多。Abstract
This paper introduces the off-road motorcycle Racer number Dataset (RnD), a new challenging dataset for optical character recognition (OCR) research. RnD contains 2,411 images from professional motorsports photographers that depict motorcycle racers in off-road competitions. The images exhibit a wide variety of factors that make OCR difficult, including mud occlusions, motion blur, non-standard fonts, glare, complex backgrounds, etc. The dataset has 5,578 manually annotated bounding boxes around visible motorcycle numbers, along with transcribed digits and letters. Our experiments benchmark leading OCR algorithms and reveal an end-to-end F1 score of only 0.527 on RnD, even after fine-tuning. Analysis of performance on different occlusion types shows mud as the primary challenge, degrading accuracy substantially compared to normal conditions. But the models struggle with other factors including glare, blur, shadows, and dust. Analysis exposes substantial room for improvement and highlights failure cases of existing models. RnD represents a valuable new benchmark to drive innovation in real-world OCR capabilities. The authors hope the community will build upon this dataset and baseline experiments to make progress on the open problem of robustly recognizing text in unconstrained natural environments. The dataset is available at https://github.com/JacobTyo/SwinTextSpotter.
摘要
Topology of Surface Electromyogram Signals: Hand Gesture Decoding on Riemannian Manifolds
results: 这个研究发现,使用这种分析方法可以从sEMG信号中提取出丰富的几何图形特征,并且可以用来分辨不同的手势。此外,这种方法还可以处理不同个体和Session之间的信号变化,并且可以提供更加稳定和透明的模型。Abstract
Decoding gestures from the upper limb using noninvasive surface electromyogram (sEMG) signals is of keen interest for the rehabilitation of amputees, artificial supernumerary limb augmentation, gestural control of computers, and virtual/augmented realities. We show that sEMG signals recorded across an array of sensor electrodes in multiple spatial locations around the forearm evince a rich geometric pattern of global motor unit (MU) activity that can be leveraged to distinguish different hand gestures. We demonstrate a simple technique to analyze spatial patterns of muscle MU activity within a temporal window and show that distinct gestures can be classified in both supervised and unsupervised manners. Specifically, we construct symmetric positive definite (SPD) covariance matrices to represent the spatial distribution of MU activity in a time window of interest, calculated as pairwise covariance of electrical signals measured across different electrodes. This allows us to understand and manipulate multivariate sEMG timeseries on a more natural subspace -the Riemannian manifold. Furthermore, it directly addresses signal variability across individuals and sessions, which remains a major challenge in the field. sEMG signals measured at a single electrode lack contextual information such as how various anatomical and physiological factors influence the signals and how their combined effect alters the evident interaction among neighboring muscles. As we show here, analyzing spatial patterns using covariance matrices on Riemannian manifolds allows us to robustly model complex interactions across spatially distributed MUs and provides a flexible and transparent framework to quantify differences in sEMG signals across individuals. The proposed method is novel in the study of sEMG signals and its performance exceeds the current benchmarks while maintaining exceptional computational efficiency.
摘要
使用非侵入性表面电 MYography (sEMG) 信号记录从上肢臂的各种位置获得的手势解oding是对各种应用场景的感兴趣,包括截肢者的复健、人工辅助臂、手势控制计算机和虚拟/增强现实。我们显示了 sEMG 信号记录在多个感知电极上的数组形式具有丰富的几何征特,可以用来 отличи不同的手势。我们展示了一种简单的分析方法,可以在时间窗口内分析多个肌电单元 (MU) 的活动空间分布,并证明了不同的手势可以在超级vised和无监督方式下分类。特别是,我们使用对称正定 definite (SPD) covariance matrix来表示在时间窗口内MU活动的空间分布,这allow us 理解和操纵多ivariate sEMG 时序序列在自然的Riemannian manifold上。此外,这种方法直接解决了信号变化 across individuals and sessions 这一主要挑战,而且可以Robustly 模型跨 spacedly distributed MUs 的复杂交互关系。我们的方法是对 sEMG 信号的研究中的新方法,其性能超过当前的benchmark,而且保持了出色的计算效率。
Physical Adversarial Examples for Multi-Camera Systems
results: 该论文发现,使用多个摄像头设计可以提供一定的鲁棒性,但是在同时优化多个视角时,这种鲁棒性减少。Transcender-MC方法比现有方法更有效,可以成功攻击多个摄像头设计的11%。Abstract
Neural networks build the foundation of several intelligent systems, which, however, are known to be easily fooled by adversarial examples. Recent advances made these attacks possible even in air-gapped scenarios, where the autonomous system observes its surroundings by, e.g., a camera. We extend these ideas in our research and evaluate the robustness of multi-camera setups against such physical adversarial examples. This scenario becomes ever more important with the rise in popularity of autonomous vehicles, which fuse the information of several cameras for their driving decision. While we find that multi-camera setups provide some robustness towards past attack methods, we see that this advantage reduces when optimizing on multiple perspectives at once. We propose a novel attack method that we call Transcender-MC, where we incorporate online 3D renderings and perspective projections in the training process. Moreover, we motivate that certain data augmentation techniques can facilitate the generation of successful adversarial examples even further. Transcender-MC is 11% more effective in successfully attacking multi-camera setups than state-of-the-art methods. Our findings offer valuable insights regarding the resilience of object detection in a setup with multiple cameras and motivate the need of developing adequate defense mechanisms against them.
摘要
SceneScore: Learning a Cost Function for Object Arrangement
results: 实验结果显示,学习的成本函数可以用来预测缺失的物品的姿态,扩展到新的物品使用semantic特征,并且可以与其他成本函数进行结合以满足约束。Abstract
Arranging objects correctly is a key capability for robots which unlocks a wide range of useful tasks. A prerequisite for creating successful arrangements is the ability to evaluate the desirability of a given arrangement. Our method "SceneScore" learns a cost function for arrangements, such that desirable, human-like arrangements have a low cost. We learn the distribution of training arrangements offline using an energy-based model, solely from example images without requiring environment interaction or human supervision. Our model is represented by a graph neural network which learns object-object relations, using graphs constructed from images. Experiments demonstrate that the learned cost function can be used to predict poses for missing objects, generalise to novel objects using semantic features, and can be composed with other cost functions to satisfy constraints at inference time.
摘要
“ Correctly arranging objects is a crucial ability for robots, unlocking a wide range of useful tasks. To create successful arrangements, we need to evaluate the desirability of a given arrangement. Our method, SceneScore, learns a cost function for arrangements, where desirable, human-like arrangements have a low cost. We learn the distribution of training arrangements offline using an energy-based model, solely from example images without requiring environment interaction or human supervision. Our model is represented by a graph neural network, which learns object-object relations using graphs constructed from images. Experimental results show that the learned cost function can be used to predict poses for missing objects, generalize to novel objects using semantic features, and can be composed with other cost functions to satisfy constraints at inference time.”Here's the translation breakdown:* “ Correctly arranging objects is a crucial ability for robots, unlocking a wide range of useful tasks.” (对象的正确排序是机器人的关键能力,解锁了许多有用的任务。)* “ To create successful arrangements, we need to evaluate the desirability of a given arrangement.” (成功的排序需要评估给定排序的可 desirability。)* “ Our method, SceneScore, learns a cost function for arrangements, where desirable, human-like arrangements have a low cost.” (我们的方法是 SceneScore,学习排序的成本函数,愿望的人类化排序有低成本。)* “ We learn the distribution of training arrangements offline using an energy-based model, solely from example images without requiring environment interaction or human supervision.” (我们在线上学习排序的分布,使用能量基本模型,只使用示例图像,不需要环境互动或人类监督。)* “ Our model is represented by a graph neural network, which learns object-object relations using graphs constructed from images.” (我们的模型是图神经网络,学习图像中对象之间的关系。)* “ Experimental results show that the learned cost function can be used to predict poses for missing objects, generalize to novel objects using semantic features, and can be composed with other cost functions to satisfy constraints at inference time.” (实验结果表明,学习的成本函数可以用来预测缺失对象的姿态,泛化到新对象使用semantic特征,并可以与其他成本函数组合来满足约束。)
Cross-dataset domain adaptation for the classification COVID-19 using chest computed tomography images
methods: 这个论文使用了 Deep Learning(DL)解决方案,具体来说是 Convolutional Neural Networks(CNN)。它们在同一个数据集上训练和测试时取得了出色的结果,但是当数据集不同时,结果却很低。这个论文使用了领域适应(DA)技术来解决这个跨数据集问题。
results: 这个论文在四个跨数据集场景中测试了 COVID19-DANet 模型,使用了 SARS-CoV-2-CT 和 COVID19-CT 数据集,并取得了比现有研究更加鼓舞人的结果。Abstract
Detecting COVID-19 patients using Computed Tomography (CT) images of the lungs is an active area of research. Datasets of CT images from COVID-19 patients are becoming available. Deep learning (DL) solutions and in particular Convolutional Neural Networks (CNN) have achieved impressive results for the classification of COVID-19 CT images, but only when the training and testing take place within the same dataset. Work on the cross-dataset problem is still limited and the achieved results are low. Our work tackles the cross-dataset problem through a Domain Adaptation (DA) technique with deep learning. Our proposed solution, COVID19-DANet, is based on pre-trained CNN backbone for feature extraction. For this task, we select the pre-trained Efficientnet-B3 CNN because it has achieved impressive classification accuracy in previous work. The backbone CNN is followed by a prototypical layer which is a concept borrowed from prototypical networks in few-shot learning (FSL). It computes a cosine distance between given samples and the class prototypes and then converts them to class probabilities using the Softmax function. To train the COVID19-DANet model, we propose a combined loss function that is composed of the standard cross-entropy loss for class discrimination and another entropy loss computed over the unlabelled target set only. This so-called unlabelled target entropy loss is minimized and maximized in an alternative fashion, to reach the two objectives of class discrimination and domain invariance. COVID19-DANet is tested under four cross-dataset scenarios using the SARS-CoV-2-CT and COVID19-CT datasets and has achieved encouraging results compared to recent work in the literature.
摘要
寻找COVID-19患者使用计算机Tomography(CT)图像是一个活跃的研究领域。COVID-19患者的CT图像集合在不断地提供。深度学习(DL)解决方案,特别是卷积神经网络(CNN),在分类COVID-19 CT图像时已经达到了非常出色的结果,但只有在训练和测试数据集在同一个数据集时才能达到这些结果。对于跨数据集问题,目前的研究尚未充分,已经获得的结果较低。我们的工作是通过领域适应(DA)技术来解决跨数据集问题。我们提出的解决方案是基于预训练的 CNN 背景,称为 COVID19-DANet。我们选择了预训练的 Efficientnet-B3 CNN,因为它在前一个工作中已经达到了非常出色的分类精度。背景 CNN 后接一个prototype层,这是从几何学学习(FSL)中借鉴的概念。它计算给定样本和类型谱的cosine距离,然后使用softmax函数将其转换为类别概率。为了训练 COVID19-DANet 模型,我们提议一个组合损失函数,由标准的交叉熵损失和另一个对无标签目标集的 entropy 损失组成。这个无标签目标 entropy 损失在 alternate 的方式下逐渐下降和上升,以实现两个目标:类别识别和领域不变性。COVID19-DANet 在四个跨数据集场景下进行测试,使用 SARS-CoV-2-CT 和 COVID19-CT 数据集,并取得了与当前文献中的结果相当的成绩。
MADG: Margin-based Adversarial Learning for Domain Generalization
paper_authors: Aveen Dayal, Vimal K. B., Linga Reddy Cenkeramaddi, C. Krishna Mohan, Abhinav Kumar, Vineeth N Balasubramanian for:The paper aims to address the challenges of domain shift in deep learning by proposing a novel adversarial learning-based domain generalization (DG) algorithm, MADG.methods:The MADG model uses a margin loss-based discrepancy metric, which is more informative, tighter, practical, and efficiently optimizable compared to traditional 0-1 loss-based methods.results:The proposed MADG model learns domain-invariant features across all source domains and generalizes well to unseen target domains, as demonstrated through extensive experiments on popular real-world DG datasets. The model’s performance is consistent across all datasets, and the authors provide a theoretical analysis of the model’s generalization bound using margin loss and Rademacher complexity.Abstract
Domain Generalization (DG) techniques have emerged as a popular approach to address the challenges of domain shift in Deep Learning (DL), with the goal of generalizing well to the target domain unseen during the training. In recent years, numerous methods have been proposed to address the DG setting, among which one popular approach is the adversarial learning-based methodology. The main idea behind adversarial DG methods is to learn domain-invariant features by minimizing a discrepancy metric. However, most adversarial DG methods use 0-1 loss based $\mathcal{H}\Delta\mathcal{H}$ divergence metric. In contrast, the margin loss-based discrepancy metric has the following advantages: more informative, tighter, practical, and efficiently optimizable. To mitigate this gap, this work proposes a novel adversarial learning DG algorithm, MADG, motivated by a margin loss-based discrepancy metric. The proposed MADG model learns domain-invariant features across all source domains and uses adversarial training to generalize well to the unseen target domain. We also provide a theoretical analysis of the proposed MADG model based on the unseen target error bound. Specifically, we construct the link between the source and unseen domains in the real-valued hypothesis space and derive the generalization bound using margin loss and Rademacher complexity. We extensively experiment with the MADG model on popular real-world DG datasets, VLCS, PACS, OfficeHome, DomainNet, and TerraIncognita. We evaluate the proposed algorithm on DomainBed's benchmark and observe consistent performance across all the datasets.
摘要
域外泛化(DG)技术已成为深度学习(DL)中解决域shift问题的流行方法,目标是在训练时未看到目标域的情况下,在目标域上具有良好的泛化性。在过去几年,许多DG方法被提出,其中一种受欢迎的方法是对抗学习基本方法。对抗DG方法的主要想法是通过最小化一个距离度量来学习域 invariant 特征。然而,大多数对抗DG方法使用0-1损失函数基于 $\mathcal{H}\Delta\mathcal{H}$ 距离度量。相比之下,margin损失基于距离度量有以下优点:更加有用信息、紧密、实用和可效地优化。为了弥补这一差距,本文提出了一种新的对抗学习DG算法,called MADG,被激发于margin损失基于距离度量。MADG模型在所有源域上学习域 invariant 特征,并使用对抗训练来在未看到目标域的情况下泛化良好。我们还提供了对MADG模型的理论分析,基于未见目标错误 bound。 Specifically, we construct the link between the source and unseen domains in the real-valued hypothesis space and derive the generalization bound using margin loss and Rademacher complexity. We extensively experiment with the MADG model on popular real-world DG datasets, VLCS, PACS, OfficeHome, DomainNet, and TerraIncognita. We evaluate the proposed algorithm on DomainBed's benchmark and observe consistent performance across all the datasets.
Performance of Machine Learning Classification in Mammography Images using BI-RADS
results: 我们发现这些模型在全面精度训练设定下具有极高的精度和F1分数,尤其是76.39%的精度和67.94%的F1分数。这些结果显示了我们的电脑支持诊断系统的可能性和可靠性,并提供了对未来对乳腺影像评价系统的改进的坚实基础。Abstract
This research aims to investigate the classification accuracy of various state-of-the-art image classification models across different categories of breast ultrasound images, as defined by the Breast Imaging Reporting and Data System (BI-RADS). To achieve this, we have utilized a comprehensively assembled dataset of 2,945 mammographic images sourced from 1,540 patients. In order to conduct a thorough analysis, we employed six advanced classification architectures, including VGG19 \cite{simonyan2014very}, ResNet50 \cite{he2016deep}, GoogleNet \cite{szegedy2015going}, ConvNext \cite{liu2022convnet}, EfficientNet \cite{tan2019efficientnet}, and Vision Transformers (ViT) \cite{dosovitskiy2020image}, instead of traditional machine learning models. We evaluate models in three different settings: full fine-tuning, linear evaluation and training from scratch. Our findings demonstrate the effectiveness and capability of our Computer-Aided Diagnosis (CAD) system, with a remarkable accuracy of 76.39\% and an F1 score of 67.94\% in the full fine-tuning setting. Our findings indicate the potential for enhanced diagnostic accuracy in the field of breast imaging, providing a solid foundation for future endeavors aiming to improve the precision and reliability of CAD systems in medical imaging.
摘要
Note:* "BI-RADS" 是指 Breast Imaging Reporting and Data System* "CAD" 是指 Computer-Aided Diagnosis* "F1 score" 是指 F1 分数
MUDD: A New Re-Identification Dataset with Efficient Annotation for Off-Road Racers in Extreme Conditions
results: 根据实验结果,在没有微调的情况下,最佳模型的rank1准确率只有33%,但是通过微调MUDD数据集,可以提高rank1准确率至79%。但是,还存在许多可以改进的问题,如果能够解决这些问题,可以提高重新识别模型的性能。Abstract
Re-identifying individuals in unconstrained environments remains an open challenge in computer vision. We introduce the Muddy Racer re-IDentification Dataset (MUDD), the first large-scale benchmark for matching identities of motorcycle racers during off-road competitions. MUDD exhibits heavy mud occlusion, motion blurring, complex poses, and extreme lighting conditions previously unseen in existing re-id datasets. We present an annotation methodology incorporating auxiliary information that reduced labeling time by over 65%. We establish benchmark performance using state-of-the-art re-id models including OSNet and ResNet-50. Without fine-tuning, the best models achieve only 33% Rank-1 accuracy. Fine-tuning on MUDD boosts results to 79% Rank-1, but significant room for improvement remains. We analyze the impact of real-world factors including mud, pose, lighting, and more. Our work exposes open problems in re-identifying individuals under extreme conditions. We hope MUDD serves as a diverse and challenging benchmark to spur progress in robust re-id, especially for computer vision applications in emerging sports analytics. All code and data can be found at https://github.com/JacobTyo/MUDD.
摘要
<>将文本翻译成简化中文。<>计算机视觉中重新识别人员在未经限制的环境中仍然是一个打开的挑战。我们介绍了污泥跑手重新识别数据集(MUDD),这是第一个大规模的匹配跑手身份的挑战。MUDD表现出污泥干扰、运动模糊、复杂的姿势和前所未见的照明条件。我们提出了一种注解方法,通过辅助信息减少标注时间超过65%。我们使用现状的最佳扩展模型,包括OSNet和ResNet-50,并对MUDD进行了微调。没有微调,最佳模型只有33%的排名第一精度。微调后,最佳模型的精度提高到79%,但还有很大的改进空间。我们分析了实际世界中的因素,包括污泥、姿势、照明等。我们的工作暴露了人员重新识别下极端条件下的开放问题。我们希望MUDD能成为一个多样化和挑战的标准 benchmark,以推动Robust re-id的进步,特别是在出现的运动数据分析领域。所有代码和数据可以在https://github.com/JacobTyo/MUDD中找到。
Leveraging Foundation Models to Improve Lightweight Clients in Federated Learning
paper_authors: Xidong Wu, Wan-Yi Lin, Devin Willmott, Filipe Condessa, Yufei Huang, Zhenzhen Li, Madan Ravi Ganesh
for: 帮助在不同数据分布环境下进行联合训练轺降减模型,提高模型性能和Robustness。
methods: 基于基础模型的模型精炼法,帮助减轻联合训练的计算开销和执行速率。
results: 在不同数据分布环境下,模型性能得到了提高,尤其是在罕见样本下的表现。Abstract
Federated Learning (FL) is a distributed training paradigm that enables clients scattered across the world to cooperatively learn a global model without divulging confidential data. However, FL faces a significant challenge in the form of heterogeneous data distributions among clients, which leads to a reduction in performance and robustness. A recent approach to mitigating the impact of heterogeneous data distributions is through the use of foundation models, which offer better performance at the cost of larger computational overheads and slower inference speeds. We introduce foundation model distillation to assist in the federated training of lightweight client models and increase their performance under heterogeneous data settings while keeping inference costs low. Our results show improvement in the global model performance on a balanced testing set, which contains rarely observed samples, even under extreme non-IID client data distributions. We conduct a thorough evaluation of our framework with different foundation model backbones on CIFAR10, with varying degrees of heterogeneous data distributions ranging from class-specific data partitions across clients to dirichlet data sampling, parameterized by values between 0.01 and 1.0.
摘要
federated learning (FL) 是一种分布式训练模式,允许全球各地的客户端共同学习一个全球模型,而无需披露敏感数据。然而,FL 面临着各客户端数据分布异常的挑战,这会导致模型性能和可靠性下降。为了减轻客户端数据分布异常的影响,我们可以使用基础模型,它们在训练时需要更大的计算负担和更慢的推理速度,但可以提高模型性能。我们称之为基础模型浸泡来帮助在不同客户端上进行联合训练轻量级客户端模型,以提高在不同数据分布情况下的性能,并保持推理成本低。我们的结果表明,在权衡测试集上,我们的框架可以在不同基础模型背bone和客户端数据分布情况下提高全球模型的性能,尤其是在EXTREME Non-IID客户端数据分布情况下。我们在 CIFAR10 上进行了系统性的评估,并 parametrised 客户端数据分布情况,从 class-specific 数据分区到 dirichlet 抽样,参数范围为 0.01 到 1.0。
Towards Open-Ended Visual Recognition with Large Language Model
for: 提出一种 straightforward 和有效的解决方案,即 OmniScient Model (OSM),以 Addressing the challenges of localizing and recognizing objects in the open-ended physical world.
methods: 使用 Large Language Model (LLM) 生成类标签,取消提供类名称 during both training and testing,并且可以在不需要人工干预的情况下进行跨数据集训练,展现了robust generalization capabilities。
results: 通过combine OSM with off-the-shelf mask proposal model,在多种 benchmark 上显示了Promising results, and demonstrated its effectiveness in handling novel concepts.Abstract
Localizing and recognizing objects in the open-ended physical world poses a long-standing challenge within the domain of machine perception. Recent methods have endeavored to address the issue by employing a class-agnostic mask (or box) proposal model, complemented by an open-vocabulary classifier (e.g., CLIP) using pre-extracted text embeddings. However, it is worth noting that these open-vocabulary recognition models still exhibit limitations in practical applications. On one hand, they rely on the provision of class names during testing, where the recognition performance heavily depends on this predefined set of semantic classes by users. On the other hand, when training with multiple datasets, human intervention is required to alleviate the label definition conflict between them. In this paper, we introduce the OmniScient Model (OSM), a novel Large Language Model (LLM) based mask classifier, as a straightforward and effective solution to the aforementioned challenges. Specifically, OSM predicts class labels in a generative manner, thus removing the supply of class names during both training and testing. It also enables cross-dataset training without any human interference, exhibiting robust generalization capabilities due to the world knowledge acquired from the LLM. By combining OSM with an off-the-shelf mask proposal model, we present promising results on various benchmarks, and demonstrate its effectiveness in handling novel concepts. Code/model are available at https://github.com/bytedance/OmniScient-Model.
摘要
本文提出了一种新的大型自然语言模型(LLM)基于的面部分类器——全能科学模型(OSM),用于解决开放式物理世界中物体认知的长期挑战。传统方法通常采用类型不固定的面提议模型,并且使用预取得的文本嵌入来进行开放 vocabulary 分类。然而,这些开放 vocabulary 分类器在实际应用中仍存在一些限制。一方面,它们需要用户提供类别名称进行测试,测试性能强度取决于用户提供的类别名称。另一方面,在多个数据集上训练时,需要人工干预来缓解数据集之间的类别定义冲突。本文的解决方案是基于 LLM 的面部分类器,可以预测类别标签的生成方式,因此不需要在训练和测试中提供类别名称。它还可以在不需要人工干预的情况下跨数据集训练,并且具有强大的世界知识,从而表现出了robust的泛化能力。通过将 OSM 与一个开源的面提议模型结合使用,我们在多个标准准确率上获得了扎实的表现,并且在处理新概念方面也具有良好的效果。代码/模型可以在 GitHub 上找到。
USLR: an open-source tool for unbiased and smooth longitudinal registration of brain MR
paper_authors: Adrià Casamitjana, Roser Sala-Llonch, Karim Lekadir, Juan Eugenio Iglesias for: 这个论文是为了提出一种计算框架,用于对大脑MRI扫描图像进行长期匀速注册,以估计时间序列中的非线性图像轨迹,这些轨迹是时间点无偏见、空间变换准确、成像artefact抗性的。methods: 这个框架使用了Lie álgebra参数化的空间变换(与rigid变换和静止速度场兼容),并利用log域属性来解决问题,并且使用 Bayesian推理来估计非线性注册和rigid注册。results: 这个框架可以为Alzheimer’s disease研究提供多个方面的 beneficial effects,例如:时间一致的图像分割,以减少内部变化的影响,subject特定预测或人口分析使用tensor-based morphometry。这种方法可以在扫描图像之间进行较为精细的衰减识别,从而减少临床试验中的样本大小。Abstract
We present USLR, a computational framework for longitudinal registration of brain MRI scans to estimate nonlinear image trajectories that are smooth across time, unbiased to any timepoint, and robust to imaging artefacts. It operates on the Lie algebra parameterisation of spatial transforms (which is compatible with rigid transforms and stationary velocity fields for nonlinear deformation) and takes advantage of log-domain properties to solve the problem using Bayesian inference. USRL estimates rigid and nonlinear registrations that: (i) bring all timepoints to an unbiased subject-specific space; and (i) compute a smooth trajectory across the imaging time-series. We capitalise on learning-based registration algorithms and closed-form expressions for fast inference. A use-case Alzheimer's disease study is used to showcase the benefits of the pipeline in multiple fronts, such as time-consistent image segmentation to reduce intra-subject variability, subject-specific prediction or population analysis using tensor-based morphometry. We demonstrate that such approach improves upon cross-sectional methods in identifying group differences, which can be helpful in detecting more subtle atrophy levels or in reducing sample sizes in clinical trials. The code is publicly available in https://github.com/acasamitjana/uslr
摘要
我们提出了USLR,一种计算机框架,用于对脑MRI扫描图像进行长期均衡注册,以估计不含时刻点的图像轨迹,这些轨迹在时间方向上是平滑的,不受成像artefacts的影响。它基于 Lie丰化参数化的空间变换(与静止速度场和非线性扭曲兼容),并利用 log 域的特性来解决问题,使用 Bayesian 推理。USLR 估计了不含时刻点的扭曲和非线性注册,它们可以:(i)将所有时刻点转移到不受偏见影响的个体特定空间中; (ii)计算图像序列中的平滑轨迹。我们利用了学习型注册算法和关闭式表达,以便快速推理。我们使用了 Alzheimer's disease 研究来展示我们的框架在多个方面的优势,包括时间相关的图像分割,以降低内部变化的影响,以及使用tensor基本形态来预测或对 population 进行分析。我们示出,这种方法可以在跨sectional 方法上提高组差异的检测,这有助于检测更加弱的衰老水平或减少临床试验中的样本大小。代码可以在 https://github.com/acasamitjana/uslr 中获取。
The Perception-Robustness Tradeoff in Deterministic Image Restoration
paper_authors: Guy Ohayon, Tomer Michaeli, Michael Elad
for: 解决 inverse problems 的 deterministic 方法的行为
methods: 使用 Lipschitz 常数来证明 predictor 的质量
results: deterministic 方法会受到 adversarial attack 的影响,但可以通过 exploring posterior distribution 来模拟 stochastic methods。Here’s a more detailed explanation of each point:1. for: The paper is written to study the behavior of deterministic methods for solving inverse problems in imaging. The authors aim to understand the limitations of these methods and how they can be improved.2. methods: The authors use Lipschitz constants to measure the quality of the predictor. They prove that the better the predictor satisfies two goals - high perceptual quality and consistency with measurements - the larger the Lipschitz constant must be. This implies that such methods are more susceptible to adversarial attacks.3. results: The authors demonstrate their theory on single image super-resolution algorithms, showing that the deterministic method can be affected by adversarial attacks. However, they also show that by exploring the posterior distribution, the deterministic method can imitate stochastic methods, which are more robust to such attacks.Abstract
We study the behavior of deterministic methods for solving inverse problems in imaging. These methods are commonly designed to achieve two goals: (1) attaining high perceptual quality, and (2) generating reconstructions that are consistent with the measurements. We provide a rigorous proof that the better a predictor satisfies these two requirements, the larger its Lipschitz constant must be, regardless of the nature of the degradation involved. In particular, to approach perfect perceptual quality and perfect consistency, the Lipschitz constant of the model must grow to infinity. This implies that such methods are necessarily more susceptible to adversarial attacks. We demonstrate our theory on single image super-resolution algorithms, addressing both noisy and noiseless settings. We also show how this undesired behavior can be leveraged to explore the posterior distribution, thereby allowing the deterministic model to imitate stochastic methods.
摘要
我们研究决定方法对几何问题的解决方案。这些方法通常是设计来 дости持二个目标:(1)实现高度的感知质量,和(2)生成符合测量的重建。我们提供了一个严谨的证明,表明如果预测器更好地满足这两个需求,则其Lipschitz常数必须变大,不管受到的扰动是什么样的。具体来说,要进一步推进完美的感知质量和完美的一致性,预测器的Lipschitz常数必须增长到无限大。这意味着这些方法一定会更易受到骗袭攻击。我们在单影像超解析算法中证明了我们的理论,包括噪音和噪音无的设定。我们还示出了如何利用这种不愿的行为来探索 posterior 分布,从而让决定模型模仿随机方法。
Rotation-Agnostic Image Representation Learning for Digital Pathology
results: 该论文表明,使用我们的紧凑型模型可以在 12 种多样化的数据集上超越现有的 histopathology-specific 视Transformers,包括四个内部数据集 (乳腺、肝脏、皮肤和肠癌) 以及七个公共数据集。准确性提高了 8.5%。Abstract
This paper addresses complex challenges in histopathological image analysis through three key contributions. Firstly, it introduces a fast patch selection method, FPS, for whole-slide image (WSI) analysis, significantly reducing computational cost while maintaining accuracy. Secondly, it presents PathDino, a lightweight histopathology feature extractor with a minimal configuration of five Transformer blocks and only 9 million parameters, markedly fewer than alternatives. Thirdly, it introduces a rotation-agnostic representation learning paradigm using self-supervised learning, effectively mitigating overfitting. We also show that our compact model outperforms existing state-of-the-art histopathology-specific vision transformers on 12 diverse datasets, including both internal datasets spanning four sites (breast, liver, skin, and colorectal) and seven public datasets (PANDA, CAMELYON16, BRACS, DigestPath, Kather, PanNuke, and WSSS4LUAD). Notably, even with a training dataset of 6 million histopathology patches from The Cancer Genome Atlas (TCGA), our approach demonstrates an average 8.5% improvement in patch-level majority vote performance. These contributions provide a robust framework for enhancing image analysis in digital pathology, rigorously validated through extensive evaluation. Project Page: https://rhazeslab.github.io/PathDino-Page/
摘要
这份论文通过三大贡献提供了复杂的 histopathological 图像分析解决方案。首先,它提出了一种快速补充方法(FPS),用于整个扫描图像(WSI)分析,大幅降低计算成本而保持准确性。其次,它推出了一种轻量级的历史病理特征提取器(PathDino),只有5个转换块和900万参数,与其他选择器相比明显少于。最后,它引入了一种不受旋转影响的学习模式,使用自动驱动学习,有效地避免过拟合。我们还证明了我们的减少模型在12个多样化的数据集上(包括四个内部数据集(乳腺、肝脏、皮肤和肠Rectum)以及七个公共数据集(PANDA、CAMELYON16、BRACS、DigestPath、Kather、PanNuke和WSSS4LUAD))都能够超越现有的历史病理特定视Transformers。特别是,即使使用TCGA数据集进行600万次训练,我们的方法还能够在批量投票中平均提高8.5%的性能。这些贡献为整个数字病理学领域提供了一个坚实的框架,通过广泛的评估来强制验证。项目页面:https://rhazeslab.github.io/PathDino-Page/
Convolutional Neural Networks Exploiting Attributes of Biological Neurons
For: This paper aims to improve the performance of Convolutional Neural Networks (CNNs) by integrating principles from biological neurons into the network architecture.* Methods: The proposed method uses neuro-science-inspired computational models of the Lateral Geniculate Nucleus (LGN) and simple cells of the primary visual cortex to extract image features as input to CNNs. The method also uses a two-tower CNN architecture, with one shallow tower and one ResNet 18 tower, to enhance the learning process and performance.* Results: The proposed method achieves a noticeable improvement in performance (on average 5-10%) on CIFAR-10, CIFAR-100, and ImageNet-100 datasets compared to ResNet-18. Additionally, the efficiency of only the Push-Pull tower of the network is also checked.Here is the Chinese translation of the three key points:* For: 这篇论文目标是通过将生物神经元的原理 integrate into CNN 网络架构来提高 CNN 的性能。* Methods: 提议的方法使用了生物神经元的计算模型,即 Lateral Geniculate Nucleus (LGN) 和 primary visual cortex 中的简单细胞模型,以提取图像特征作为 CNN 的输入。该方法还使用了一个两个塔的 CNN 架构,其中一个是浅塔,另一个是 ResNet 18 塔,以增强网络学习过程和性能。* Results: 提议的方法在 CIFAR-10、CIFAR-100 和 ImageNet-100 数据集上实现了平均提高5-10%的性能,相比 ResNet-18。此外,只测试了 Push-Pull 塔的效率也被检查了。Abstract
In this era of artificial intelligence, deep neural networks like Convolutional Neural Networks (CNNs) have emerged as front-runners, often surpassing human capabilities. These deep networks are often perceived as the panacea for all challenges. Unfortunately, a common downside of these networks is their ''black-box'' character, which does not necessarily mirror the operation of biological neural systems. Some even have millions/billions of learnable (tunable) parameters, and their training demands extensive data and time. Here, we integrate the principles of biological neurons in certain layer(s) of CNNs. Specifically, we explore the use of neuro-science-inspired computational models of the Lateral Geniculate Nucleus (LGN) and simple cells of the primary visual cortex. By leveraging such models, we aim to extract image features to use as input to CNNs, hoping to enhance training efficiency and achieve better accuracy. We aspire to enable shallow networks with a Push-Pull Combination of Receptive Fields (PP-CORF) model of simple cells as the foundation layer of CNNs to enhance their learning process and performance. To achieve this, we propose a two-tower CNN, one shallow tower and the other as ResNet 18. Rather than extracting the features blindly, it seeks to mimic how the brain perceives and extracts features. The proposed system exhibits a noticeable improvement in the performance (on an average of $5\%-10\%$) on CIFAR-10, CIFAR-100, and ImageNet-100 datasets compared to ResNet-18. We also check the efficiency of only the Push-Pull tower of the network.
摘要
在人工智能时代,深度神经网络(CNN)已成为前Runner,经常超越人类能力。这些深度网络经常被视为所有挑战的解决方案。然而,它们的一个常见缺点是“黑盒”性,不一定反映生物神经系统的运作。一些甚至有 millions/billions 的可调参数,并且培训需要大量数据和时间。在这里,我们将生物神经元的原理 integrate 到 Certain Layer 中的 CNN 中。具体来说,我们将 explore 使用生物 ней维科学发展的 Lateral Geniculate Nucleus (LGN) 和 primary visual cortex 的简单细胞模型。通过这些模型,我们希望从图像中提取特征,并将其作为 CNN 的输入,以提高训练效率和准确率。我们 aspire 使用 shallow network 和 ResNet 18 的 Push-Pull Combination of Receptive Fields (PP-CORF) 模型作为基础层,以便提高 CNN 的学习过程和性能。我们的提案的系统在 CIFAR-10、CIFAR-100 和 ImageNet-100 数据集上显示了明显的性能提升(在 average 上为5%-10%),并且只有 Push-Pull 塔的网络进行测试。
results: 研究发现,使用 convolutional 和 residual 层,并且在通道方向进行自注意力处理,可以实现最佳性能,需要 menos than 100K 参数。Abstract
Facial landmark tracking for thermal images requires tracking certain important regions of subjects' faces, using images from thermal images, which omit lighting and shading, but show the temperatures of their subjects. The fluctuations of heat in particular places reflect physiological changes like bloodflow and perspiration, which can be used to remotely gauge things like anxiety and excitement. Past work in this domain has been limited to only a very limited set of architectures and techniques. This work goes further by trying a comprehensive suit of various models with different components, such as residual connections, channel and feature-wise attention, as well as the practice of ensembling components of the network to work in parallel. The best model integrated convolutional and residual layers followed by a channel-wise self-attention layer, requiring less than 100K parameters.
摘要
法面特征跟踪技术 для热图像需要跟踪热图像中重要的面部区域,使用热图像,它们排除光照和阴影,但显示对象的温度。热图像中的温度波动反映了物体的生理变化,如血液循环和汗水,可以通过远程测量情绪和兴奋程度。过去在这个领域中的研究受限于一个非常有限的集合 Architecture和技术。这个工作更迭,尝试了多种不同的模型组件,如径 residual 连接,通道和特征 wise 注意力,以及网络组件的并行实现。最佳模型结合了卷积层和径 residual 层,然后跟踪通道的自注意力层,需要 menos than 100K 参数。
results: 测试结果与其他方法相比,Level-set KSVD 方法显示更高的准确率和更好的性能Abstract
We present a new algorithm for image segmentation - Level-set KSVD. Level-set KSVD merges the methods of sparse dictionary learning for feature extraction and variational level-set method for image segmentation. Specifically, we use a generalization of the Chan-Vese functional with features learned by KSVD. The motivation for this model is agriculture based. Aerial images are taken in order to detect the spread of fungi in various crops. Our model is tested on such images of cotton fields. The results are compared to other methods.
摘要
我们提出了一种新的图像分割算法——级别集成KSVD。级别集成KSVD将特征提取的方法和变量水平集成方法结合在一起,用于图像分割。我们使用一种普遍化的 Chan-Vese 函数,其中特征被 KSVD 学习。我们的模型的动机是基于农业的,用于从航空图像中检测不同作物中致病菌的传播。我们的模型在棉田图像上进行测试,与其他方法进行比较。Note:* "级别集成" (level-set) is used to refer to the combination of level-set method and sparse dictionary learning.* "KSVD" (K-Singular Value Decomposition) is used to refer to the sparse dictionary learning method.* " Chan-Vese 函数" (Chan-Vese functional) is used to refer to the energy functional used in the level-set method.
ARTEMIS: Using GANs with Multiple Discriminators to Generate Art
results: 论文的实验结果表明,使用这种方法可以生成出具有幻想的、几何化的图像,这些图像具有高度的创新性和多样性。Abstract
We propose a novel method for generating abstract art. First an autoencoder is trained to encode and decode the style representations of images, which are extracted from source images with a pretrained VGG network. Then, the decoder component of the autoencoder is extracted and used as a generator in a GAN. The generator works with an ensemble of discriminators. Each discriminator takes different style representations of the same images, and the generator is trained to create images that create convincing style representations in order to deceive all of the generators. The generator is also trained to maximize a diversity term. The resulting images had a surreal, geometric quality. We call our approach ARTEMIS (ARTistic Encoder- Multi- Discriminators Including Self-Attention), as it uses the self-attention layers and an encoder-decoder architecture.
摘要
我们提出了一种新的抽象艺术生成方法。首先,我们使用预训练的VGG网络提取源图像中的风格表示,然后使用自动encoder进行编码和解码。接着,decoder组件被提取出来作为生成器在GAN中使用。生成器与多个批处理器(discriminators)一起工作,每个批处理器接受不同的风格表示,生成器则被训练以创造出可以欺骗所有批处理器的图像。同时,生成器还被训练以最大化多样性项。结果图像具有了具有Surreal的几何特点。我们称之为ARTEMIS(ARTistic Encoder- Multi- Discriminators Including Self-Attention),因为它使用了自我注意层和编码-解码架构。
Defining the boundaries: challenges and advances in identifying cells in microscopy images
results: 这篇论文的实验结果表明,使用深度学习方法可以提高图像分割的准确性和效率,并且可以在多种不同的测试数据上达到高度的一致性。Abstract
Segmentation, or the outlining of objects within images, is a critical step in the measurement and analysis of cells within microscopy images. While improvements continue to be made in tools that rely on classical methods for segmentation, deep learning-based tools increasingly dominate advances in the technology. Specialist models such as Cellpose continue to improve in accuracy and user-friendliness, and segmentation challenges such as the Multi-Modality Cell Segmentation Challenge continue to push innovation in accuracy across widely-varying test data as well as efficiency and usability. Increased attention on documentation, sharing, and evaluation standards are leading to increased user-friendliness and acceleration towards the goal of a truly universal method.
摘要
“分割”或“图像中 объек 的划分”是微scopic 图像分析中的重要步骤。 classical 方法工具的改进仍在继续,但是深度学习基础的工具在技术发展中越来越 dominant。专家模型如 Cellpose 的精度和用户Friendliness 不断提高,分割挑战如多 modal 维度细胞分割挑战也在不同的测试数据上提高精度和效率。 文档、分享和评估标准的增加导致了更高的用户友好性和universal 方法的加速。Note: "多 modal" in the text refers to the fact that the images being analyzed are from multiple sources or modalities, such as brightfield, phase contrast, and fluorescence microscopy.
TENT: Connect Language Models with IoT Sensors for Zero-Shot Activity Recognition
results: TENT 不仅可以识别已经看到的动作,还可以“猜测”未看到的动作,通过最近的文本词汇从共同特征空间中选择。我们在不同感知模式的零例 HAR 任务上达到了state-of-the-art 性能,超过了最佳视language模型的12%。Abstract
Recent achievements in language models have showcased their extraordinary capabilities in bridging visual information with semantic language understanding. This leads us to a novel question: can language models connect textual semantics with IoT sensory signals to perform recognition tasks, e.g., Human Activity Recognition (HAR)? If so, an intelligent HAR system with human-like cognition can be built, capable of adapting to new environments and unseen categories. This paper explores its feasibility with an innovative approach, IoT-sEnsors-language alignmEnt pre-Training (TENT), which jointly aligns textual embeddings with IoT sensor signals, including camera video, LiDAR, and mmWave. Through the IoT-language contrastive learning, we derive a unified semantic feature space that aligns multi-modal features with language embeddings, so that the IoT data corresponds to specific words that describe the IoT data. To enhance the connection between textual categories and their IoT data, we propose supplementary descriptions and learnable prompts that bring more semantic information into the joint feature space. TENT can not only recognize actions that have been seen but also ``guess'' the unseen action by the closest textual words from the feature space. We demonstrate TENT achieves state-of-the-art performance on zero-shot HAR tasks using different modalities, improving the best vision-language models by over 12%.
摘要
最近的语言模型 achievements 展示了它们在结合视觉信息和语义理解方面的杰出能力。这使我们对一个新的问题感兴趣:可以 ли使用语言模型将文本 semantics 与 IoT 感知信号相连,以实现识别任务,例如人类活动识别(HAR)?如果可以,那么可以构建一个具有人类智能认知的智能 HAR 系统,可以适应新环境和未经见过的类别。这篇文章探索了这种可能性,并提出了一种创新的方法:IoT-sEnsors-language alignmEnt pre-Training(TENT)。TENT 方法将文本嵌入与 IoT 感知信号,包括相机视频、LiDAR 和 mmWave 信号进行同步。通过 IoT-语言对比学习,我们得到了一个统一的Semantic Feature Space,其中 IoT 数据与文本嵌入之间的对应关系得到了确定。为了增强文本类别和其 IoT 数据之间的连接,我们提出了补充描述和可学习的提示。TENT 可以不仅识别已经看过的动作,还可以通过 closest 文本单词来“估计”未经见过的动作。我们示出 TENT 在零shot HAR 任务中实现了state-of-the-art 性能,比最佳视觉语言模型提高了12%以上。
MeLo: Low-rank Adaptation is Better than Fine-tuning for Medical Image Diagnosis
methods: 该方法使用低级化适应 instead of 资源占用的精细调整,通过固定 ViT 模型的Weight 并只添加小型低级插件,实现了多种鉴定任务中的竞争性表现。
results: 对于四种医疗成像数据集,提出的方法可以与完全精细调整的 ViT 模型具有相似的表现,使用约 0.17% 的可训练参数。此外,MeLo 只增加了约 0.5MB 的存储空间,并允许极快的模型交换和执行。Abstract
The common practice in developing computer-aided diagnosis (CAD) models based on transformer architectures usually involves fine-tuning from ImageNet pre-trained weights. However, with recent advances in large-scale pre-training and the practice of scaling laws, Vision Transformers (ViT) have become much larger and less accessible to medical imaging communities. Additionally, in real-world scenarios, the deployments of multiple CAD models can be troublesome due to problems such as limited storage space and time-consuming model switching. To address these challenges, we propose a new method MeLo (Medical image Low-rank adaptation), which enables the development of a single CAD model for multiple clinical tasks in a lightweight manner. It adopts low-rank adaptation instead of resource-demanding fine-tuning. By fixing the weight of ViT models and only adding small low-rank plug-ins, we achieve competitive results on various diagnosis tasks across different imaging modalities using only a few trainable parameters. Specifically, our proposed method achieves comparable performance to fully fine-tuned ViT models on four distinct medical imaging datasets using about 0.17% trainable parameters. Moreover, MeLo adds only about 0.5MB of storage space and allows for extremely fast model switching in deployment and inference. Our source code and pre-trained weights are available on our website (https://absterzhu.github.io/melo.github.io/).
摘要
通常,在基于转换器架构的计算机辅助诊断(CAD)模型开发中,会进行 ImageNet 预训练权重的细化。然而,随着大规模预训练的进步和做法的做法,视觉转换器(ViT)已经变得更加大型,并且对医疗影像社区而言更加不可accessible。此外,在实际应用场景中,多个 CAD 模型的部署可能会陷入有限的存储空间和时间consuming的模型交换问题。为解决这些挑战,我们提出了一新方法 MeLo(医疗影像低级化),它允许开发单一的 CAD 模型,用于多种临床任务,并且具有轻量级的特点。它采用低级化adaptation而不是资源占用的细化。通过固定 ViT 模型的weight和只添加小型低级插件,我们实现了在不同的医疗影像模式上多种诊断任务中的竞争性成绩,使用只有约 0.17% 的可训练参数。此外,MeLo 只增加了约 0.5MB 的存储空间,并允许在部署和推理过程中非常快速的模型交换。我们的源代码和预训练 веса可以在我们的网站(https://absterzhu.github.io/melo.github.io/)上获取。
A Unified Approach for Comprehensive Analysis of Various Spectral and Tissue Doppler Echocardiography
results: 对比其他方法,该框架在性能指标(如 dice similarity coefficients (DSC) 和 intersection over union (IoU))中具有显著的优势,并与临床医生的诊断相一致。Abstract
Doppler echocardiography offers critical insights into cardiac function and phases by quantifying blood flow velocities and evaluating myocardial motion. However, previous methods for automating Doppler analysis, ranging from initial signal processing techniques to advanced deep learning approaches, have been constrained by their reliance on electrocardiogram (ECG) data and their inability to process Doppler views collectively. We introduce a novel unified framework using a convolutional neural network for comprehensive analysis of spectral and tissue Doppler echocardiography images that combines automatic measurements and end-diastole (ED) detection into a singular method. The network automatically recognizes key features across various Doppler views, with novel Doppler shape embedding and anti-aliasing modules enhancing interpretation and ensuring consistent analysis. Empirical results indicate a consistent outperformance in performance metrics, including dice similarity coefficients (DSC) and intersection over union (IoU). The proposed framework demonstrates strong agreement with clinicians in Doppler automatic measurements and competitive performance in ED detection.
摘要
Uni-COAL: A Unified Framework for Cross-Modality Synthesis and Super-Resolution of MR Images
results: 实验结果表明,Uni-COAL在CMS、SR和CMSR等任务中表现出色,超过了现有的方法,这反映了其在多种应用场景中的一致性和通用性。Abstract
Cross-modality synthesis (CMS), super-resolution (SR), and their combination (CMSR) have been extensively studied for magnetic resonance imaging (MRI). Their primary goals are to enhance the imaging quality by synthesizing the desired modality and reducing the slice thickness. Despite the promising synthetic results, these techniques are often tailored to specific tasks, thereby limiting their adaptability to complex clinical scenarios. Therefore, it is crucial to build a unified network that can handle various image synthesis tasks with arbitrary requirements of modality and resolution settings, so that the resources for training and deploying the models can be greatly reduced. However, none of the previous works is capable of performing CMS, SR, and CMSR using a unified network. Moreover, these MRI reconstruction methods often treat alias frequencies improperly, resulting in suboptimal detail restoration. In this paper, we propose a Unified Co-Modulated Alias-free framework (Uni-COAL) to accomplish the aforementioned tasks with a single network. The co-modulation design of the image-conditioned and stochastic attribute representations ensures the consistency between CMS and SR, while simultaneously accommodating arbitrary combinations of input/output modalities and thickness. The generator of Uni-COAL is also designed to be alias-free based on the Shannon-Nyquist signal processing framework, ensuring effective suppression of alias frequencies. Additionally, we leverage the semantic prior of Segment Anything Model (SAM) to guide Uni-COAL, ensuring a more authentic preservation of anatomical structures during synthesis. Experiments on three datasets demonstrate that Uni-COAL outperforms the alternatives in CMS, SR, and CMSR tasks for MR images, which highlights its generalizability to wide-range applications.
摘要
跨Modalities合成(CMS)、超解像(SR)以及它们的组合(CMSR)在核磁共振成像(MRI)中已经得到了广泛的研究。它们的主要目标是提高成像质量,同时降低slice thickness。尽管这些技术在合成结果方面具有惊人的成果,但它们往往是为特定任务而设计的,因此它们在复杂的临床场景中的适应性受到限制。因此,建立一个通用的网络,可以处理多种图像合成任务,并且可以根据不同的模式和分辨率设置进行调整,这样就可以大幅减少训练和部署模型的资源。然而,现有的工作都不能通过单一的网络来完成CMS、SR和CMSR等任务。此外,这些MRI重建方法经常不当处理假频谱,导致细节的还原不佳。在这篇论文中,我们提出了一种统一的Co-Modulated Alias-free框架(Uni-COAL),用于实现以上任务。图像受控和随机特征表示的协调设计,保证了CMS和SR之间的一致性,同时能够适应任意的输入/输出模式和厚度。Uni-COAL生成器还基于Shannon-Nyquist信号处理框架,确保了假频谱的有效抑制。此外,我们利用Segment Anything Model(SAM)的semantic priors来引导Uni-COAL,确保在合成过程中保留了更加Authentic的结构。实验结果表明,Uni-COAL在CMS、SR和CMSR等任务中表现出色,高于相关方法,这 highlights its generalizability to wide-range applications。
Improving Image Captioning via Predicting Structured Concepts
paper_authors: Ting Wang, Weidong Chen, Yuanhe Tian, Yan Song, Zhendong Mao
for: This paper aims to improve image captioning performance by bridging the semantic gap between images and texts using structured concept prediction and weighted graph convolutional networks (W-GCN).
methods: The proposed approach includes a structured concept predictor (SCP) to predict concepts and their structures, as well as W-GCN to depict concept relations driven by word dependencies.
results: The approach is shown to be effective in enhancing the contribution of visual signals in image captioning, and the learned differentiated contributions from concepts improve the description generation process. Extensive experiments demonstrate the effectiveness of the proposed approach and each module.Here’s the Chinese version of the three pieces of information:
results: 方法被证明可以增强图像描述中视觉信号的贡献,并且通过 learned differentiated contributions from concepts来改善描述生成过程。广泛的实验结果证明提议的方法和每个模块的效果。Abstract
Having the difficulty of solving the semantic gap between images and texts for the image captioning task, conventional studies in this area paid some attention to treating semantic concepts as a bridge between the two modalities and improved captioning performance accordingly. Although promising results on concept prediction were obtained, the aforementioned studies normally ignore the relationship among concepts, which relies on not only objects in the image, but also word dependencies in the text, so that offers a considerable potential for improving the process of generating good descriptions. In this paper, we propose a structured concept predictor (SCP) to predict concepts and their structures, then we integrate them into captioning, so as to enhance the contribution of visual signals in this task via concepts and further use their relations to distinguish cross-modal semantics for better description generation. Particularly, we design weighted graph convolutional networks (W-GCN) to depict concept relations driven by word dependencies, and then learns differentiated contributions from these concepts for following decoding process. Therefore, our approach captures potential relations among concepts and discriminatively learns different concepts, so that effectively facilitates image captioning with inherited information across modalities. Extensive experiments and their results demonstrate the effectiveness of our approach as well as each proposed module in this work.
摘要
在图像描述 зада务中解决semantic gap问题的困难性,传统研究具有一定的注意力于将semantic concept作为两种模态之间的桥梁,从而改进描述性能。虽然在概念预测方面获得了promising的结果,但这些研究通常忽略了概念之间的关系,这些关系不仅取决于图像中的对象,还取决于文本中的word依赖关系,这意味着可以在描述生成过程中提高visuallsignal的贡献。在这篇论文中,我们提出了结构化概念预测器(SCP),用于预测概念和其结构,然后将其集成到描述中,以增强图像描述中visuallsignal的贡献。特别是,我们设计了weighted graph convolutional networks(W-GCN),用于描述概念之间的关系,然后学习这些概念之间的差异性,以便在后续的解码过程中更好地分配权重。因此,我们的方法能够捕捉概念之间的关系,并且强化不同的概念之间的差异性,从而有效地促进图像描述 tasks。广泛的实验和结果证明了我们的方法和每个提议模块的有效性。
Peer is Your Pillar: A Data-unbalanced Conditional GANs for Few-shot Image Generation
paper_authors: Ziqiang Li, Chaoyue Wang, Xue Rui, Chao Xue, Jiaxu Leng, Bin Li for:* 实现几少数例的图像生成,专门为缺乏训练图像的情况。methods:* 融合目标几少数例数据集和对应的同类数据集,实现数据不均匀的条件生成。* 使用组别嵌入法分离类别空间和内生空间,并使用预训练CLIP的方向损失提高图像多样性。results:* 在不同的几少数例数据集上进行实验,提出了一种名为Peer is your Pillar(PIP)的新管道,可以实现几少数例图像生成,并且降低了训练需求。Abstract
Few-shot image generation aims to train generative models using a small number of training images. When there are few images available for training (e.g. 10 images), Learning From Scratch (LFS) methods often generate images that closely resemble the training data while Transfer Learning (TL) methods try to improve performance by leveraging prior knowledge from GANs pre-trained on large-scale datasets. However, current TL methods may not allow for sufficient control over the degree of knowledge preservation from the source model, making them unsuitable for setups where the source and target domains are not closely related. To address this, we propose a novel pipeline called Peer is your Pillar (PIP), which combines a target few-shot dataset with a peer dataset to create a data-unbalanced conditional generation. Our approach includes a class embedding method that separates the class space from the latent space, and we use a direction loss based on pre-trained CLIP to improve image diversity. Experiments on various few-shot datasets demonstrate the advancement of the proposed PIP, especially reduces the training requirements of few-shot image generation.
摘要
几个干净图像生成目标是训练生成模型使用几个训练图像。当有很少图像可用 для训练(例如10个图像)时,学习从零(LFS)方法通常生成图像与训练数据非常相似,而传输学习(TL)方法尝试通过利用大规模数据集中的先前知识来提高性能。然而,当目标和源领域不相关时,当前的TL方法可能无法保持来自源模型的知识水平,使其不适用。为解决这个问题,我们提议一个新的管道,即同伴是你的柱子(PIP),它将目标几个干净数据集与一个对等数据集组合起来,以创建数据不均衡的条件生成。我们的方法包括一种类嵌入方法,可以分离类空间与潜在空间,并使用预训练的CLIP来提高图像多样性。在各种几个干净数据集上进行了实验,我们发现提案的PIP有显著的进步,特别是减少了几个干净图像生成的训练要求。
Diffusion-based generation of Histopathological Whole Slide Images at a Gigapixel scale
methods: 本研究使用了一种从粗糙到细节的样本过滤方法,将初始低分辨率的图像逐步升级为高分辨率WSIs。 Specifically, a diffusion model sequentially adds fine details to images and increases their resolution.
results: 在实验中,我们将方法训练使用了TCGA-BRCA数据集。在实验中,我们通过量值评估和用户研究发现,生成的WSIs与真实的标本构造有相似之处。Abstract
We present a novel diffusion-based approach to generate synthetic histopathological Whole Slide Images (WSIs) at an unprecedented gigapixel scale. Synthetic WSIs have many potential applications: They can augment training datasets to enhance the performance of many computational pathology applications. They allow the creation of synthesized copies of datasets that can be shared without violating privacy regulations. Or they can facilitate learning representations of WSIs without requiring data annotations. Despite this variety of applications, no existing deep-learning-based method generates WSIs at their typically high resolutions. Mainly due to the high computational complexity. Therefore, we propose a novel coarse-to-fine sampling scheme to tackle image generation of high-resolution WSIs. In this scheme, we increase the resolution of an initial low-resolution image to a high-resolution WSI. Particularly, a diffusion model sequentially adds fine details to images and increases their resolution. In our experiments, we train our method with WSIs from the TCGA-BRCA dataset. Additionally to quantitative evaluations, we also performed a user study with pathologists. The study results suggest that our generated WSIs resemble the structure of real WSIs.
摘要
我们提出了一种新的扩散基于的方法,用于生成高分辨率整个染色质影像(WSIs)。这些合成WSIs具有许多应用前景:它们可以补充训练集,提高计算生物学应用程序的性能。它们允许创建合成的数据集,无需违反隐私法规。还有,它们可以帮助学习WSIs的表示,无需数据标注。Despite this variety of applications, no existing deep learning-based method generates WSIs at their typically high resolutions. Mainly due to the high computational complexity. Therefore, we propose a novel coarse-to-fine sampling scheme to tackle image generation of high-resolution WSIs. In this scheme, we increase the resolution of an initial low-resolution image to a high-resolution WSI. Particularly, a diffusion model sequentially adds fine details to images and increases their resolution. In our experiments, we train our method with WSIs from the TCGA-BRCA dataset. Additionally to quantitative evaluations, we also performed a user study with pathologists. The study results suggest that our generated WSIs resemble the structure of real WSIs.Here's the word-for-word translation of the text into Simplified Chinese:我们提出了一种新的扩散基于的方法,用于生成高分辨率整个染色质影像(WSIs)。这些合成WSIs具有许多应用前景:它们可以补充训练集,提高计算生物学应用程序的性能。它们允许创建合成的数据集,无需违反隐私法规。还有,它们可以帮助学习WSIs的表示,无需数据标注。Despite this variety of applications, no existing deep learning-based method generates WSIs at their typically high resolutions. Mainly due to the high computational complexity. Therefore, we propose a novel coarse-to-fine sampling scheme to tackle image generation of high-resolution WSIs. In this scheme, we increase the resolution of an initial low-resolution image to a high-resolution WSI. Particularly, a diffusion model sequentially adds fine details to images and increases their resolution. In our experiments, we train our method with WSIs from the TCGA-BRCA dataset. Additionally to quantitative evaluations, we also performed a user study with pathologists. The study results suggest that our generated WSIs resemble the structure of real WSIs.
LocaliseBot: Multi-view 3D object localisation with differentiable rendering for robot grasping
results: 在ShapeNet dataset上,本文的对象pose估计方法与状态体系比较,显示出提高。此外,通过使用Estimated object pose结果和实际的抓取候选点,在OCID Grasp dataset上计算的抓取精度为99.65%。Abstract
Robot grasp typically follows five stages: object detection, object localisation, object pose estimation, grasp pose estimation, and grasp planning. We focus on object pose estimation. Our approach relies on three pieces of information: multiple views of the object, the camera's extrinsic parameters at those viewpoints, and 3D CAD models of objects. The first step involves a standard deep learning backbone (FCN ResNet) to estimate the object label, semantic segmentation, and a coarse estimate of the object pose with respect to the camera. Our novelty is using a refinement module that starts from the coarse pose estimate and refines it by optimisation through differentiable rendering. This is a purely vision-based approach that avoids the need for other information such as point cloud or depth images. We evaluate our object pose estimation approach on the ShapeNet dataset and show improvements over the state of the art. We also show that the estimated object pose results in 99.65% grasp accuracy with the ground truth grasp candidates on the Object Clutter Indoor Dataset (OCID) Grasp dataset, as computed using standard practice.
摘要
Robot grasp通常包括五个阶段:对象检测、对象定位、对象姿态估计、抓取姿态估计和抓取规划。我们专注于对象姿态估计。我们的方法基于三个信息:对象的多视图、摄像头的外部参数和3D CAD模型。第一步使用标准的深度学习基础结构(FCN ResNet)来估计对象标签、semantic segmentation和相对于摄像头的粗略对象姿态。我们的创新是使用改进模块,从粗略姿态估计开始,通过微调Rendering来优化。这是一种完全视觉基于的方法,不需要其他信息如点云或深度图像。我们在ShapeNet数据集上评估了我们的对象姿态估计方法,并表明我们的方法在比较准确性方面有所改进。此外,我们还证明我们估计的对象姿态结果和实际的 grasp candidates在OCID Grasp数据集上的抓取精度为99.65%,如标准实践所计算。
SAMIHS: Adaptation of Segment Anything Model for Intracranial Hemorrhage Segmentation
paper_authors: Yinuo Wang, Kai Chen, Weimin Yuan, Cai Meng, XiangZhi Bai for: 这篇论文是针对stroke诊断和手术规划中的脑出血分类进行研究,使用Segment Anything Model (SAM) 作为基础模型,并提出了一种基于SAM的优化方法(SAMIHS),以提高这些类型的医疗影像分类效能。methods: 这篇论文使用了SAM的图像嵌入器中的参数适束器(parameter-refactoring adapters),并将其视为可变的参数,以提高SAM的灵活性和效能。此外,这篇论文还使用了一种混合损失函数(combo loss),其结合了二进制条件预测损失和边界敏感损失,以提高SAMIHS的边界区域识别能力。results: 根据实验结果,SAMIHS在两个公共数据集上的效能都得到了改善,尤其是在脑出血类型的医疗影像分类中,表明SAMIHS可以提高这些类型的医疗影像分类效能。Abstract
Segment Anything Model (SAM), a vision foundation model trained on large-scale annotations, has recently continued raising awareness within medical image segmentation. Despite the impressive capabilities of SAM on natural scenes, it struggles with performance decline when confronted with medical images, especially those involving blurry boundaries and highly irregular regions of low contrast. In this paper, a SAM-based parameter-efficient fine-tuning method, called SAMIHS, is proposed for intracranial hemorrhage segmentation, which is a crucial and challenging step in stroke diagnosis and surgical planning. Distinguished from previous SAM and SAM-based methods, SAMIHS incorporates parameter-refactoring adapters into SAM's image encoder and considers the efficient and flexible utilization of adapters' parameters. Additionally, we employ a combo loss that combines binary cross-entropy loss and boundary-sensitive loss to enhance SAMIHS's ability to recognize the boundary regions. Our experimental results on two public datasets demonstrate the effectiveness of our proposed method. Code is available at https://github.com/mileswyn/SAMIHS .
摘要
Segment Anything Model (SAM) 模型,一种基于大规模注释的视觉基础模型,最近在医学图像 segmentation 中受到了更多的关注。尽管 SAM 在自然场景中表现出色,但在医学图像中,它的表现却会逐渐下降,特别是面临着模糊的边界和低对比度的区域。在这篇论文中,我们提出了基于 SAM 的参数效率调整方法,称为 SAMIHS,用于脑出血栓 segmentation,这是诊断和手术规划中的关键步骤。与之前的 SAM 和基于 SAM 的方法不同,SAMIHS 在 SAM 的图像编码器中添加了参数 refactoring 适配器,并且利用这些适配器的参数进行有效和灵活的使用。此外,我们采用了一种 combio 损失函数,该函数将 binary cross-entropy 损失函数和边界敏感损失函数相加,以提高 SAMIHS 对边界区域的识别能力。我们在两个公共数据集上进行了实验,结果表明了我们提出的方法的有效性。代码可以在 GitHub 上找到:https://github.com/mileswyn/SAMIHS。
A deformation-based morphometry framework for disentangling Alzheimer’s disease from normal aging using learned normal aging templates
results: 研究结果表明,脑 Ventricles 主要遵循加速的正常老化模式,而 hippocampus 和 Amygdala 区域受到了正常老化和阿尔茨海默症特有的影响。 Interestingly,在疾病早期临床阶段,hippocampus 和 Amygdala 区域更加受到加速的正常老化影响,而疾病后期,阿尔茨海默症特有的分数增加。Abstract
Alzheimer's Disease and normal aging are both characterized by brain atrophy. The question of whether AD-related brain atrophy represents accelerated aging or a neurodegeneration process distinct from that in normal aging remains unresolved. Moreover, precisely disentangling AD-related brain atrophy from normal aging in a clinical context is complex. In this study, we propose a deformation-based morphometry framework to estimate normal aging and AD-specific atrophy patterns of subjects from morphological MRI scans. We first leverage deep-learning-based methods to create age-dependent templates of cognitively normal (CN) subjects. These templates model the normal aging atrophy patterns in a CN population. Then, we use the learned diffeomorphic registration to estimate the one-year normal aging pattern at the voxel level. We register the testing image to the 60-year-old CN template in the second step. Finally, normal aging and AD-specific scores are estimated by measuring the alignment of this registration with the one-year normal aging pattern. The methodology was developed and evaluated on the OASIS3 dataset with 1,014 T1-weighted MRI scans. Of these, 326 scans were from CN subjects, and 688 scans were from individuals clinically diagnosed with AD at different stages of clinical severity defined by clinical dementia rating (CDR) scores. The results show that ventricles predominantly follow an accelerated normal aging pattern in subjects with AD. In turn, hippocampi and amygdala regions were affected by both normal aging and AD-specific factors. Interestingly, hippocampi and amygdala regions showed more of an accelerated normal aging pattern for subjects during the early clinical stages of the disease, while the AD-specific score increases in later clinical stages. Our code is freely available at https://github.com/Fjr9516/DBM_with_DL.
摘要
阿尔茨海默病和正常年龄都 caracterized by brain atrophy。问题是whether AD-related brain atrophy represents accelerated aging or a neurodegeneration process distinct from that in normal aging remains unresolved。另外,在临床上准确地分离AD-related brain atrophy from normal aging是复杂的。在这种研究中,我们提议一种基于几何变换的 morphometry框架,用于估计在MRI扫描中的正常年龄和AD-specific atrophy模式。我们首先利用深度学习基本的方法创建年龄相关的模板,以模型正常年龄atrophy模式。然后,我们使用学习的射影变换来估计一年内正常年龄的变化模式。最后,我们测量这个注册与一年内正常年龄的变化模式之间的匹配程度,以计算正常年龄和AD-specific scores。方法在OASIS3 dataset上进行了开发和评估,该dataset包括1,014个T1-weighted MRI扫描,其中326个来自正常年龄 subjects,688个来自AD诊断的个体。结果显示,脑室主要follows an accelerated normal aging pattern in subjects with AD。而hippocampus和 Amygdala region受到了正常年龄和AD-specific factor的影响。具有诊断CDR scores的患者在早期клиниче阶段,hippocampus和Amygdala region更加受到加速的正常年龄变化,而AD-specific score在后期 клиниче阶段增加。我们的代码可以在https://github.com/Fjr9516/DBM_with_DL上获取。
Vision-Language Instruction Tuning: A Review and Analysis
results: 该论文基于 constructed 的指令数据进行了视觉语言指令调整,并在相应的 метриках上进行了广泛的实验,以证明提出的建设原则的合理性。Abstract
Instruction tuning is an essential supervised training phase for Large Language Models (LLMs), with the goal of enhancing LLMs' capacity to generalize instruction execution and adapt to user preferences. With the growing incorporation of multi-modal data into LLMs, there is an increasing interest in the performance of vision-language instruction tuning which presents more complex features in comparison to pure text instructions. In this paper, we systematically review the latest vision-language instruction tuning settings and datasets in multi-modal LLMs and summarize the characteristics that high-quality vision-language tuning data should have. We consider these characteristics as the foundational principles for constructing vision-language instruction data and propose a complete construction pipeline consisting of data collection, instruction generation, and quality control modules that incorporate meticulously designed instruction property evaluation indicators. We perform vision-language instruction tuning on three widely used multi-modal LLMs based on the instruction data we constructed and conduct extensive experiments on the corresponding metrics to demonstrate the rationality of the construction principles proposed in this paper. The code and dataset related to this paper have been open-sourced at \url{https://github.com/palchenli/VL-Instruction-Tuning}.
摘要
大型自然语言模型(LLM)的指令调整是一个重要的有监督训练阶段,目的是增强LLM的指令执行泛化和用户偏好适应能力。随着多模态数据的加入,视觉语言指令调整的性能已经引起了更多的关注,这种多模态指令调整比纯文本指令更加复杂。在这篇论文中,我们系统地回顾最新的视觉语言指令调整设置和数据集在多模态LLM中,并总结高质量视觉语言调整数据应该具备哪些特征。我们认为这些特征是建构视觉语言指令数据的基础原则,我们提议一个完整的建构管道,包括数据采集、指令生成和质量控制模块,这些模块都包括了仔细设计的指令性质评价指标。我们在三种广泛使用的多模态LLM上进行了视觉语言指令调整,并对相应的指标进行了广泛的实验,以示我们提出的建构原则的合理性。相关代码和数据集可以在 \url{https://github.com/palchenli/VL-Instruction-Tuning} 上下载。
DynamicSurf: Dynamic Neural RGB-D Surface Reconstruction with an Optimizable Feature Grid
For: 高精度3D模型化非rigid表面从单视图RGB-D视频中。* Methods: 使用深度、表面法向和RGB损失来提高重建准确性和优化时间。* Results: 比前一代方法快$6\times$,并 achieved comparable results to the state-of-the-art methods。In English:* For: High-fidelity 3D modeling of non-rigid surfaces from monocular RGB-D video.* Methods: Using depth, surface normals, and RGB losses to improve reconstruction accuracy and optimization time.* Results: Faster than previous methods by $6\times$ and achieved comparable results to the state-of-the-art methods.Abstract
We propose DynamicSurf, a model-free neural implicit surface reconstruction method for high-fidelity 3D modelling of non-rigid surfaces from monocular RGB-D video. To cope with the lack of multi-view cues in monocular sequences of deforming surfaces, one of the most challenging settings for 3D reconstruction, DynamicSurf exploits depth, surface normals, and RGB losses to improve reconstruction fidelity and optimisation time. DynamicSurf learns a neural deformation field that maps a canonical representation of the surface geometry to the current frame. We depart from current neural non-rigid surface reconstruction models by designing the canonical representation as a learned feature grid which leads to faster and more accurate surface reconstruction than competing approaches that use a single MLP. We demonstrate DynamicSurf on public datasets and show that it can optimize sequences of varying frames with $6\times$ speedup over pure MLP-based approaches while achieving comparable results to the state-of-the-art methods. Project is available at https://mirgahney.github.io//DynamicSurf.io/.
摘要
我们提出了DynamicSurf,一种无模型 neural implicit surface reconstruction方法,用于高精度3D模型化非rigid表面从单视角RGB-D视频中。为了处理单视角序列中的形变表面的缺乏多视角cue,DynamicSurf利用深度、表面法向和RGB损失来提高重建准确性和优化时间。DynamicSurf学习一个神经变形场,将一个均匀表面几何代表映射到当前帧中。我们与现有的神经非RIGID表面重建模型不同,我们设计了学习的特征网格作为均匀表面几何代表,这导致了更快和更准确的表面重建。我们在公共数据集上展示了DynamicSurf的性能,并证明它可以在不同的帧序列中优化6倍速度于纯MLP-based方法,而且与状态艺术方法相当。项目可以在https://mirgahney.github.io//DynamicSurf.io/查看。
Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing
results: 实验表明,该方法在与现有状态的方法进行比较时表现出色,具有更高的准确率和更好的一致性。Abstract
Existing works on weakly-supervised audio-visual video parsing adopt hybrid attention network (HAN) as the multi-modal embedding to capture the cross-modal context. It embeds the audio and visual modalities with a shared network, where the cross-attention is performed at the input. However, such an early fusion method highly entangles the two non-fully correlated modalities and leads to sub-optimal performance in detecting single-modality events. To deal with this problem, we propose the messenger-guided mid-fusion transformer to reduce the uncorrelated cross-modal context in the fusion. The messengers condense the full cross-modal context into a compact representation to only preserve useful cross-modal information. Furthermore, due to the fact that microphones capture audio events from all directions, while cameras only record visual events within a restricted field of view, there is a more frequent occurrence of unaligned cross-modal context from audio for visual event predictions. We thus propose cross-audio prediction consistency to suppress the impact of irrelevant audio information on visual event prediction. Experiments consistently illustrate the superior performance of our framework compared to existing state-of-the-art methods.
摘要
现有的弱监睹音视频分解方法采用混合注意网络(HAN)作为多modal嵌入,以便捕捉音视模态之间的交叉模态上下文。HAN将音和视模态embedded在共享网络中,并在输入阶段进行交叉注意。然而,这种早期融合方法会高度束缚两个不完全相关的模态,从而降低单模态事件检测的性能。为解决这个问题,我们提出了使者导向中间融合变换器,以减少不相关的交叉模态上下文。使者将全模态上下文压缩到一个紧凑的表示中,只保留有用的交叉模态信息。此外,由于 Microphones 捕捉的音频事件来自所有方向,而 Camera 只记录视频事件在限定的视野内,因此在视频事件预测中更常出现不同模态上下文的不一致。我们因此提出了跨音频预测一致性,以抑制不关联的音频信息对视频事件预测的影响。实验表明,我们的框架在现有状态艺术方法之上具有显著优势。
results: 根据标准的GM benchmarks,GMTR在对比SOTA框架时显示出竞争力的性能,具体来说,在Pascal VOC上,GMTR的准确率为83.6%,高于SOTA框架的0.9%。在Spair-71k上,GMTR也表现出了优异的性能,并超越了大多数之前的works。此外,QueryTrans在Pascal VOC上提高了NGMv2的准确率从80.1%到83.3%,并提高了BBGM的准确率从79.0%到84.5%。在Spair-71k上,QueryTrans也提高了NGMv2的准确率从80.6%到82.5%,并提高了BBGM的准确率从82.1%到83.9%。Abstract
Vision transformers (ViTs) have recently been used for visual matching beyond object detection and segmentation. However, the original grid dividing strategy of ViTs neglects the spatial information of the keypoints, limiting the sensitivity to local information. Therefore, we propose \textbf{QueryTrans} (Query Transformer), which adopts a cross-attention module and keypoints-based center crop strategy for better spatial information extraction. We further integrate the graph attention module and devise a transformer-based graph matching approach \textbf{GMTR} (Graph Matching TRansformers) whereby the combinatorial nature of GM is addressed by a graph transformer neural GM solver. On standard GM benchmarks, GMTR shows competitive performance against the SOTA frameworks. Specifically, on Pascal VOC, GMTR achieves $\mathbf{83.6\%}$ accuracy, $\mathbf{0.9\%}$ higher than the SOTA framework. On Spair-71k, GMTR shows great potential and outperforms most of the previous works. Meanwhile, on Pascal VOC, QueryTrans improves the accuracy of NGMv2 from $80.1\%$ to $\mathbf{83.3\%}$, and BBGM from $79.0\%$ to $\mathbf{84.5\%}$. On Spair-71k, QueryTrans improves NGMv2 from $80.6\%$ to $\mathbf{82.5\%}$, and BBGM from $82.1\%$ to $\mathbf{83.9\%}$. Source code will be made publicly available.
摘要
《视觉转换器(ViT)在视觉匹配中的应用》。在原始的网格分割策略下,ViT忽视了关键点的空间信息,导致对本地信息的敏感性受限。因此,我们提出了《查询转换器》(QueryTrans),它采用了交叉注意模块和基于关键点的中心裁剪策略,以更好地提取空间信息。此外,我们还整合了图注意模块,并设计了基于图transformer的图匹配方法《图匹配变换器》(GMTR),以解决图匹配问题的 combinatorial 性。在标准GM benchmark上,GMTR与顶尖框架相比,表现竞争力强。具体来说,在Pascal VOC上,GMTR的准确率为83.6%,高于顶尖框架80.1%。在Spair-71k上,GMTR表现出色,超过了大多数之前的工作。同时,在Pascal VOC上,QueryTrans提高了NGMv2的准确率从80.1%到83.3%,和BBGM从79.0%到84.5%。在Spair-71k上,QueryTrans提高了NGMv2的准确率从80.6%到82.5%,和BBGM从82.1%到83.9%。源代码将公开发布。
Learning based Deep Disentangling Light Field Reconstruction and Disparity Estimation Application
results: 实现了最佳性能在实验中,并提出了减少内存占用的块跨度angular超解析策略,用于深度估计增强。Abstract
Light field cameras have a wide range of uses due to their ability to simultaneously record light intensity and direction. The angular resolution of light fields is important for downstream tasks such as depth estimation, yet is often difficult to improve due to hardware limitations. Conventional methods tend to perform poorly against the challenge of large disparity in sparse light fields, while general CNNs have difficulty extracting spatial and angular features coupled together in 4D light fields. The light field disentangling mechanism transforms the 4D light field into 2D image format, which is more favorable for CNN for feature extraction. In this paper, we propose a Deep Disentangling Mechanism, which inherits the principle of the light field disentangling mechanism and further develops the design of the feature extractor and adds advanced network structure. We design a light-field reconstruction network (i.e., DDASR) on the basis of the Deep Disentangling Mechanism, and achieve SOTA performance in the experiments. In addition, we design a Block Traversal Angular Super-Resolution Strategy for the practical application of depth estimation enhancement where the input views is often higher than 2x2 in the experiments resulting in a high memory usage, which can reduce the memory usage while having a better reconstruction performance.
摘要
光场相机具有广泛的应用领域,主要是因为它同时记录光强和方向。光场的方向分辨率对下游任务如深度估计非常重要,但受硬件限制,通常难以提高。传统方法在大 disparity sparse 光场中表现不佳,而通用 CNN 在4D 光场中抽取空间和方向特征同时存在困难。基于光场分解机制,我们提出了深度分解机制,将4D 光场转换成2D 图像格式,更适合 CNN 进行特征提取。在这篇论文中,我们提出了一种深度分解机制,具有光场分解机制的原理,并进一步开发特征提取器的设计和高级网络结构。基于深度分解机制,我们设计了一个深度场重建网络(i.e., DDASR),在实验中达到了最佳性能。此外,我们还设计了一种块传播角度超解析策略,用于实际应用深度估计增强,其中输入视图 часто高于2x2,导致高内存使用量,可以降低内存使用量而且具有更好的重建性能。
DeepEMplanner: An EM Motion Planner with Iterative Interactions
results: 在nuScenesbenchmark上进行了实验,我们的方法实现了状态的最佳Results。Abstract
Motion planning is a computational problem that finds a sequence of valid trajectories, often based on surrounding agents' forecasting, environmental understanding, and historical and future contexts. It can also be viewed as a game in which agents continuously plan their next move according to other agents' intentions and the encountering environment, further achieving their ultimate goals through incremental actions. To model the dynamic planning and interaction process, we propose a novel framework, DeepEMplanner, which takes the stepwise interaction into account for fine-grained behavior learning. The ego vehicle maximizes each step motion to reach its eventual driving outcome based on the stepwise expectation from agents and its upcoming road conditions. On the other hand, the agents also follow the same philosophy to maximize their stepwise behavior under the encountering environment and the expectations from ego and other agents. Our DeepEMplanner models the interactions among ego, agents, and the dynamic environment in an autoregressive manner by interleaving the Expectation and Maximization processes. Further, we design ego-to-agents, ego-to-map, and ego-to-BEV interaction mechanisms with hierarchical dynamic key objects attention to better model the interactions. Experiments on the nuScenes benchmark show that our approach achieves state-of-the-art results.
摘要
行为规划是一个计算问题,找到一系列有效的轨迹,经常基于周围的代理人预测、环境理解和历史和未来的上下文。它也可以视为一个游戏,在 które agents 不断规划下一步的动作,根据其他代理人的意图和遇到的环境,以实现他们的最终目标。为了模型动态规划和互动过程,我们提出了一个新的框架,深度EMplanner,它考虑了每个步骤的互动,以提高细化的行为学习。ego Vehicle 在每步动作中尽可能地实现其最终驾驶结果,基于预测的代理人和下一步道路条件。然而,代理人也遵循同样的哲学,在遇到的环境和预测中 maximize 其每步行为。我们的 DeepEMplanner 模型了 egO、代理人和动态环境之间的互动,通过嵌入 Expectation 和 Maximization 过程来模型这些互动。此外,我们还设计了 ego-to-agents、ego-to-map 和 ego-to-BEV 互动机制,并使用层次的动态关键对象注意力来更好地模型这些互动。在 nuScenes benchmark 上进行了实验,我们的方法实现了状态计算的最佳结果。
Identifying Light-curve Signals with a Deep Learning Based Object Detection Algorithm. II. A General Light Curve Classification Framework
results: 这个模型在变星和噪音测量资料上取得了87%的准确率,与先前的特征基于模型相比。此外,这个模型还可以直接应用于其他任务,如ASAS-SN, без需要任何重新训练或调整。Abstract
Vast amounts of astronomical photometric data are generated from various projects, requiring significant efforts to identify variable stars and other object classes. In light of this, a general, widely applicable classification framework would simplify the task of designing custom classifiers. We present a novel deep learning framework for classifying light curves using a weakly supervised object detection model. Our framework identifies the optimal windows for both light curves and power spectra automatically, and zooms in on their corresponding data. This allows for automatic feature extraction from both time and frequency domains, enabling our model to handle data across different scales and sampling intervals. We train our model on datasets obtained from both space-based and ground-based multi-band observations of variable stars and transients. We achieve an accuracy of 87% for combined variables and transient events, which is comparable to the performance of previous feature-based models. Our trained model can be utilized directly to other missions, such as ASAS-SN, without requiring any retraining or fine-tuning. To address known issues with miscalibrated predictive probabilities, we apply conformal prediction to generate robust predictive sets that guarantee true label coverage with a given probability. Additionally, we incorporate various anomaly detection algorithms to empower our model with the ability to identify out-of-distribution objects. Our framework is implemented in the Deep-LC toolkit, which is an open-source Python package hosted on Github and PyPI.
摘要
巨量的天文光度数据由多个项目生成,需要大量的努力来识别变星和其他对象类型。为了简化这个任务,我们提出了一种通用的深度学习分类框架。我们的框架使用弱监督对象检测模型来分类光谱曲线。我们的框架可以自动确定光谱曲线和功率спектrum的优化窗口,并在这些窗口中提取数据。这使得我们的模型能够处理不同的时间和频率尺度,并且不需要手动设置窗口大小。我们的模型通过对多个变星和事件进行训练,实现了87%的总精度,与之前基于特征的模型相当。我们的训练模型可以直接应用于其他任务,如ASAS-SN,无需重新训练或调整。为了解决已知的预测概率误差,我们使用封闭预测来生成可靠的预测集, garantía true label coverage WITH a given probability。此外,我们还 incorporated 多种异常检测算法,使我们的模型能够识别不符合预期的对象。我们的框架在 Deep-LC 工具包中实现,该工具包是一个开源的 Python 包, hosted on Github 和 PyPI。
GlanceSeg: Real-time microaneurysm lesion segmentation with gaze-map-guided foundation model for early detection of diabetic retinopathy
results: 这个研究显示了一个名为“GlanceSeg”的人工智能框架,可以帮助诊断早期视力疾病中的微小血管变化,并且可以提高诊断效率和准确性。Abstract
Early-stage diabetic retinopathy (DR) presents challenges in clinical diagnosis due to inconspicuous and minute microangioma lesions, resulting in limited research in this area. Additionally, the potential of emerging foundation models, such as the segment anything model (SAM), in medical scenarios remains rarely explored. In this work, we propose a human-in-the-loop, label-free early DR diagnosis framework called GlanceSeg, based on SAM. GlanceSeg enables real-time segmentation of microangioma lesions as ophthalmologists review fundus images. Our human-in-the-loop framework integrates the ophthalmologist's gaze map, allowing for rough localization of minute lesions in fundus images. Subsequently, a saliency map is generated based on the located region of interest, which provides prompt points to assist the foundation model in efficiently segmenting microangioma lesions. Finally, a domain knowledge filter refines the segmentation of minute lesions. We conducted experiments on two newly-built public datasets, i.e., IDRiD and Retinal-Lesions, and validated the feasibility and superiority of GlanceSeg through visualized illustrations and quantitative measures. Additionally, we demonstrated that GlanceSeg improves annotation efficiency for clinicians and enhances segmentation performance through fine-tuning using annotations. This study highlights the potential of GlanceSeg-based annotations for self-model optimization, leading to enduring performance advancements through continual learning.
摘要
早期 диабетическая ретинопатия (DR) 的临床诊断受到微型抽象病变的难以诊断的挑战,这导致了这个领域的研究受到限制。此外,现有的基础模型,如 segment anything model (SAM),在医疗场景中的应用尚未得到广泛探索。在这项工作中,我们提出了一种人类在Loop的、无标签的早期 DR 诊断框架,称为 GlanceSeg,基于 SAM。GlanceSeg 可以在眼科医生查看基底图像时实时分割微型病变。我们的人类在Loop 框架将眼科医生的视线地图与基底图像进行结合,以便粗略地位微型病变。然后,基于所处的区域兴趣点的敏感地图会生成,以提供帮助基础模型快速分割微型病变的指导点。最后,基于区域的知识滤波器会对微型病变进行精细分割。我们在两个新建的公共数据集上进行了实验,即 IDRiD 和 Retinal-Lesions,并通过视觉化示例和量化度量证明了 GlanceSeg 的可行性和超越性。此外,我们还示出了 GlanceSeg 可以提高临床医生的注意力和分割性能,通过细化使用注解进行训练。这种研究强调了 GlanceSeg 基于注解的自适应优化的潜在可能,这将导致持续学习的性能提升。
FS-Net: Full Scale Network and Adaptive Threshold for Improving Extraction of Micro-Retinal Vessel Structures
paper_authors: Melaku N. Getahun, Oleg Y. Rogov, Dmitry V. Dylov, Andrey Somov, Ahmed Bouridane, Rifat Hamoudi for: 这个研究旨在帮助验视医生为某些眼病进行诊断和检测,减轻验视医生的工作负担。methods: 这个研究使用了一个基于encoder-decoder神经网络架构的全规模微血管提取机制,以及sigmoid平滑和适应阈值方法。results: 这个研究在DRIVE、CHASE-DB1和STARE datasets上进行了评估,与之前的研究相比,获得了比较出色的成绩,其中在DRIVEdataset上的AUC和准确率分别为0.9884和0.9702,在CHASE-DB1dataset上的 scores分别为0.9903和0.9755,在STAREdataset上的 scores分别为0.9916和0.9750。这些成绩较之前的研究一步进步,使得这个解决方案在实际检测中有更高的可能性被应用。Abstract
Retinal vascular segmentation, is a widely researched subject in biomedical image processing, aims to relieve ophthalmologists' workload when treating and detecting retinal disorders. However, segmenting retinal vessels has its own set of challenges, with prior techniques failing to generate adequate results when segmenting branches and microvascular structures. The neural network approaches used recently are characterized by the inability to keep local and global properties together and the failure to capture tiny end vessels make it challenging to attain the desired result. To reduce this retinal vessel segmentation problem, we propose a full-scale micro-vessel extraction mechanism based on an encoder-decoder neural network architecture, sigmoid smoothing, and an adaptive threshold method. The network consists of of residual, encoder booster, bottleneck enhancement, squeeze, and excitation building blocks. All of these blocks together help to improve the feature extraction and prediction of the segmentation map. The proposed solution has been evaluated using the DRIVE, CHASE-DB1, and STARE datasets, and competitive results are obtained when compared with previous studies. The AUC and accuracy on the DRIVE dataset are 0.9884 and 0.9702, respectively. On the CHASE-DB1 dataset, the scores are 0.9903 and 0.9755, respectively. On the STARE dataset, the scores are 0.9916 and 0.9750, respectively. The performance achieved is one step ahead of what has been done in previous studies, and this results in a higher chance of having this solution in real-life diagnostic centers that seek ophthalmologists attention.
摘要
Retinal vascular segmentation 是医学图像处理领域广泛研究的主题,旨在减轻眼科医生在诊断和治疗 RETINAL 疾病时的劳重。然而, segmenting retinal vessels 有其独特的挑战,先前的技术无法生成足够的结果,特别是在分支和微血管结构上。近年来的神经网络方法具有不能同时保持本地和全局属性以及失去微血管结构的缺点,使得 segmentation 问题变得更加困难。为解决这个问题,我们提出了基于 encoder-decoder 神经网络架构、sigmoid 缓和适应阈值方法的全规模微血管提取机制。该网络由 residual、encoder booster、瓶颈增强、缩小和刺激块组成。这些块都帮助提高特征提取和预测 segmentation 图像。我们对 DRIVE、CHASE-DB1 和 STARE 数据集进行评估,与前期研究相比,实现了竞争性的结果。 DRIVE 数据集上的 AUC 和精度分别为 0.9884 和 0.9702,CHASE-DB1 数据集上的分别为 0.9903 和 0.9755,STARE 数据集上的分别为 0.9916 和 0.9750。我们的成果胜过了前期研究,这将有助于这种解决方案在实际诊断中心得到应用。
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
results: 实验结果表明,Chat-UniVi模型在混合 dataset上进行训练后,能够在图像和视频任务中表现出色,并且在图像和视频任务中的性能都高于专门为图像或视频设计的方法。Abstract
Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations. However, existing methods encounter challenges in effectively handling both image and video understanding, particularly with limited visual tokens. In this work, we introduce Chat-UniVi, a unified vision-language model capable of comprehending and engaging in conversations involving images and videos through a unified visual representation. Specifically, we employ a set of dynamic visual tokens to uniformly represent images and videos. This representation framework empowers the model to efficiently utilize a limited number of visual tokens to simultaneously capture the spatial details necessary for images and the comprehensive temporal relationship required for videos. Moreover, we leverage a multi-scale representation, enabling the model to perceive both high-level semantic concepts and low-level visual details. Notably, Chat-UniVi is trained on a mixed dataset containing both images and videos, allowing direct application to tasks involving both mediums without requiring any modifications. Extensive experimental results demonstrate that Chat-UniVi, as a unified model, consistently outperforms even existing methods exclusively designed for either images or videos.
摘要
Simplified Chinese:大型语言模型已经展示了广泛的通用能力,并扩展了其应用范围到包括多modal会话。然而,现有方法在处理图像和视频理解方面遇到了挑战,特别是具有有限的视觉标识符。在这种情况下,我们引入了 Chat-UniVi,一种能够同时理解和参与图像和视频的沟通的统一视觉语言模型。具体来说,我们使用了一组动态的视觉标识符来统一表示图像和视频。这种表示框架使得模型可以有效地利用有限的视觉标识符来同时捕捉图像中的空间细节和视频中的全面时间关系。此外,我们还利用了多尺度表示,让模型能够捕捉高级 semantic概念以及低级视觉细节。值得一提的是,Chat-UniVi 是在包含图像和视频的混合 Dataset 上训练的,因此不需要任何修改就能直接应用于包含这两种媒体的任务。我们的实验结果表明,Chat-UniVi 作为一种统一模型,在与专门为图像或视频设计的方法进行比较时,一直表现出优于其。
Contrastive Learning for Multi-Object Tracking with Transformers
results: 我们的训练方案可以学习物件的外观,同时保持探测能力,仅需小量的额外负载。在BDD100K dataset上,我们的表现超过了过往的最佳性能+2.6 mMOTA,与现有的对称基于方法在MOT17 dataset上的表现相似。Abstract
The DEtection TRansformer (DETR) opened new possibilities for object detection by modeling it as a translation task: converting image features into object-level representations. Previous works typically add expensive modules to DETR to perform Multi-Object Tracking (MOT), resulting in more complicated architectures. We instead show how DETR can be turned into a MOT model by employing an instance-level contrastive loss, a revised sampling strategy and a lightweight assignment method. Our training scheme learns object appearances while preserving detection capabilities and with little overhead. Its performance surpasses the previous state-of-the-art by +2.6 mMOTA on the challenging BDD100K dataset and is comparable to existing transformer-based methods on the MOT17 dataset.
摘要
“DEtection TRansformer(DETR)开创了新的可能性,将对象探测视为翻译任务:将图像特征转换为对象级别表示。先前的工作通常会添加费时模块到DETR来实现多对象跟踪(MOT),导致建立更加复杂的架构。我们则示出了如何将DETR转换成MOT模型,使用实例级别的对比损失,修改抽取策略和轻量级的归属方法。我们的训练方案学习对象外观,保留探测能力,占用较少的资源。其性能在具有挑战性的BDD100K数据集上超过了之前的州态艺术的+2.6 mMOTA,与现有的转换基本方法在MOT17数据集上的性能相当。”
ELF: An End-to-end Local and Global Multimodal Fusion Framework for Glaucoma Grading
results: 在多modal睫际病变评估GAMMA dataset上进行了广泛的实验,并证明ELF在比较其他状态艺术方法时表现更高效。Abstract
Glaucoma is a chronic neurodegenerative condition that can lead to blindness. Early detection and curing are very important in stopping the disease from getting worse for glaucoma patients. The 2D fundus images and optical coherence tomography(OCT) are useful for ophthalmologists in diagnosing glaucoma. There are many methods based on the fundus images or 3D OCT volumes; however, the mining for multi-modality, including both fundus images and data, is less studied. In this work, we propose an end-to-end local and global multi-modal fusion framework for glaucoma grading, named ELF for short. ELF can fully utilize the complementary information between fundus and OCT. In addition, unlike previous methods that concatenate the multi-modal features together, which lack exploring the mutual information between different modalities, ELF can take advantage of local-wise and global-wise mutual information. The extensive experiment conducted on the multi-modal glaucoma grading GAMMA dataset can prove the effiectness of ELF when compared with other state-of-the-art methods.
摘要
glaucoma 是一种慢性神经退化疾病,可能导致失明。早期发现和治疗非常重要,以防疫情进一步加剧。二维基准图像和光共振镜(OCT) 是诊断 glaucoma 的有用工具。多种基于基准图像或3D OCT 体积的方法已经研究过,但是对于多Modal 数据的挖掘还是相对较少。在这种情况下,我们提出了一种综合使用本地和全局多Modal 融合框架,名为ELF。ELF 可以充分利用基准图像和 OCT 之间的补偿信息。此外,与之前的方法不同的是,ELF 可以利用本地和全局多Modal 信息之间的相互关系,而不是将多Modal 特征 concatenate 在一起。经过了对多modal 疾病评分 GAMMA 数据集的广泛实验,ELF 的效果可以证明在与其他现有方法相比,效果更好。
MD-IQA: Learning Multi-scale Distributed Image Quality Assessment with Semi Supervised Learning for Low Dose CT
paper_authors: Tao Song, Ruizhi Hou, Lisong Dai, Lei Xiang for: 这个研究旨在提高深度学习基于医学影像评估(IQA)的模型通用性和感知准确性。methods: 该研究提出了一种多级分布回归方法,通过制约输出分布来提高模型通用性。此外,我们还设计了一个双树对齐网络来增强特征提取能力。最后,我们引入了半监督学习,使用 pseudo-标签来导引模型训练。results: 我们的提议方法在大规模实验中得到了广泛的可读性和稳定性。我们的方法可以在医学影像评估中提高模型的通用性和感知准确性。代码可以在 GitHub 上找到:https://github.com/zunzhumu/MD-IQA。Abstract
Image quality assessment (IQA) plays a critical role in optimizing radiation dose and developing novel medical imaging techniques in computed tomography (CT). Traditional IQA methods relying on hand-crafted features have limitations in summarizing the subjective perceptual experience of image quality. Recent deep learning-based approaches have demonstrated strong modeling capabilities and potential for medical IQA, but challenges remain regarding model generalization and perceptual accuracy. In this work, we propose a multi-scale distributions regression approach to predict quality scores by constraining the output distribution, thereby improving model generalization. Furthermore, we design a dual-branch alignment network to enhance feature extraction capabilities. Additionally, semi-supervised learning is introduced by utilizing pseudo-labels for unlabeled data to guide model training. Extensive qualitative experiments demonstrate the effectiveness of our proposed method for advancing the state-of-the-art in deep learning-based medical IQA. Code is available at: https://github.com/zunzhumu/MD-IQA.
摘要
医用像质评估(IQA)在计算机Tomography(CT)中扮演了关键的角色, optimize radiation dose和开发新的医疗成像技术。传统的IQA方法,取决于手工设计的特征,有限的概念化Subjective perceived image quality的经验。 current deep learning-based approaches have shown strong modeling capabilities and potential for medical IQA, but there are still challenges in terms of model generalization and perceptual accuracy. 在这项工作中,我们提议一种多尺度分布回归方法,预测质分数,并通过限制输出分布,提高模型泛化性。此外,我们设计了双树对齐网络,提高特征提取能力。此外,我们还利用 pseudo-labels 进行 semi-supervised learning,以帮助模型训练。我们的提议方法在 deep learning-based 医疗IQA 领域中具有广泛的可靠性和稳定性。 Code 可以在以下 GitHub 上获取:https://github.com/zunzhumu/MD-IQA.
CP-SLAM: Collaborative Neural Point-based SLAM System
results: 实验结果表明,提出的方法在不同的数据集上都有较高的精度和稳定性,在摄像头跟踪和地图建模方面均有显著提高。Abstract
This paper presents a collaborative implicit neural simultaneous localization and mapping (SLAM) system with RGB-D image sequences, which consists of complete front-end and back-end modules including odometry, loop detection, sub-map fusion, and global refinement. In order to enable all these modules in a unified framework, we propose a novel neural point based 3D scene representation in which each point maintains a learnable neural feature for scene encoding and is associated with a certain keyframe. Moreover, a distributed-to-centralized learning strategy is proposed for the collaborative implicit SLAM to improve consistency and cooperation. A novel global optimization framework is also proposed to improve the system accuracy like traditional bundle adjustment. Experiments on various datasets demonstrate the superiority of the proposed method in both camera tracking and mapping.
摘要
这篇论文介绍了一种基于RGB-D图像序列的协同隐式神经同时地图(SLAM)系统,包括完整的前端和后端模块,如偏移、循环检测、子地图融合和全局调整。为了在一个统一框架中实现这些模块,我们提出了一种新的神经点基于3D场景表示,其中每个点都保留一个学习的神经特征 для场景编码,并与某个键帧相关。此外,我们还提出了分布式到中心学习策略来提高协同隐式SLAM的一致性和合作性。此外,我们还提出了一种新的全局优化框架来提高系统精度,类似于传统的缎纹调整。在多个数据集上进行了诸多实验,并证明了我们的方法在摄像头跟踪和地图建模方面具有显著优势。
Clearer Frames, Anytime: Resolving Velocity Ambiguity in Video Frame Interpolation
results: 与传统VFI方法相比,使用“距离索引”和循环引用基本估计策略可以提高视频帧 interpolate的精度和清晰度,并且可以独立地调整视频的时间 interpolate。此外,“距离索引”还可以在每帧级进行独立的时间修饰,为视频编辑任务提供一种新的工具。Abstract
Existing video frame interpolation (VFI) methods blindly predict where each object is at a specific timestep t ("time indexing"), which struggles to predict precise object movements. Given two images of a baseball, there are infinitely many possible trajectories: accelerating or decelerating, straight or curved. This often results in blurry frames as the method averages out these possibilities. Instead of forcing the network to learn this complicated time-to-location mapping implicitly together with predicting the frames, we provide the network with an explicit hint on how far the object has traveled between start and end frames, a novel approach termed "distance indexing". This method offers a clearer learning goal for models, reducing the uncertainty tied to object speeds. We further observed that, even with this extra guidance, objects can still be blurry especially when they are equally far from both input frames (i.e., halfway in-between), due to the directional ambiguity in long-range motion. To solve this, we propose an iterative reference-based estimation strategy that breaks down a long-range prediction into several short-range steps. When integrating our plug-and-play strategies into state-of-the-art learning-based models, they exhibit markedly sharper outputs and superior perceptual quality in arbitrary time interpolations, using a uniform distance indexing map in the same format as time indexing. Additionally, distance indexing can be specified pixel-wise, which enables temporal manipulation of each object independently, offering a novel tool for video editing tasks like re-timing.
摘要
存在的视频帧 interpolate (VFI) 方法盲目地预测每个对象在特定时间步 t ("时间索引") 上的位置,这可能会难以预测对象的精确移动。给出了两个 baseball 图像,有无数可能的轨迹:加速或减速,直线或弯曲。这经常导致模糊的帧,因为方法平均出这些可能性。而不是让网络学习这些复杂的时间-to-位置映射,我们为网络提供了一个显式的提示,即在开始和结束帧之间对象如何移动的距离,一种新的方法称为 "距离索引"。这种方法为模型提供了明确的学习目标,降低了对象速度的uncertainty。我们进一步发现,即使有这些额外指导,对象仍然可能变得模糊,特别是当它们处于两个输入帧之间的中点(即半路)时,由于长距离运动的方向ambiguity。为解决这个问题,我们提议一种Iterative reference-based估计策略,将长距离预测分解成多个短距离步骤。将我们的插件和简化策略 integrate 到当前的学习基于模型中,它们在任意时间 interpolate 中表现出了明显更加锐化的输出和superior perceptual quality,使用一个固定的距离索引地图,与时间索引相同。此外,距离索引可以像时间索引一样Specified 像素级,这允许在视频编辑任务中重新时间调整每个对象独立地,提供了一种新的工具。
Explicit Change Relation Learning for Change Detection in VHR Remote Sensing Images
results: 实验结果表明,NAME 网络在四个公共高分辨率 remote sensing 数据集上的 F1、IoU 和 OA 指标上比现有的先进网络更高。Abstract
Change detection has always been a concerned task in the interpretation of remote sensing images. It is essentially a unique binary classification task with two inputs, and there is a change relationship between these two inputs. At present, the mining of change relationship features is usually implicit in the network architectures that contain single-branch or two-branch encoders. However, due to the lack of artificial prior design for change relationship features, these networks cannot learn enough change semantic information and lose more accurate change detection performance. So we propose a network architecture NAME for the explicit mining of change relation features. In our opinion, the change features of change detection should be divided into pre-changed image features, post-changed image features and change relation features. In order to fully mine these three kinds of change features, we propose the triple branch network combining the transformer and convolutional neural network (CNN) to extract and fuse these change features from two perspectives of global information and local information, respectively. In addition, we design the continuous change relation (CCR) branch to further obtain the continuous and detail change relation features to improve the change discrimination capability of the model. The experimental results show that our network performs better, in terms of F1, IoU, and OA, than those of the existing advanced networks for change detection on four public very high-resolution (VHR) remote sensing datasets. Our source code is available at https://github.com/DalongZ/NAME.
摘要
<>将文本翻译成简化中文。<>改变检测一直是Remote感知图像解释中的关键任务。它实际上是一个独特的二分类任务,其中有两个输入。在现有的网络架构中,改变关系特征的挖掘是通常隐式的,但由于缺乏人工设计的改变关系特征,这些网络无法学习足够的改变semantic信息,导致更准确的改变检测性能下降。因此,我们提议一种名为NAME的网络架构,用于明确挖掘改变关系特征。根据我们的观点,改变特征包括预变图像特征、后变图像特征和改变关系特征。为了全面挖掘这三种改变特征,我们提议使用转换器和卷积神经网络(CNN)结合三支分支网络,从全球信息和局部信息两个角度提取和融合这些改变特征。此外,我们还设计了连续改变关系(CCR)支分,以获取更细致的改变关系特征,以提高改变检测模型的改变识别能力。实验结果显示,我们的网络在四个公共very高分辨率(VHR)Remote感知数据集上的F1、IoU和OA指标上表现比现有的先进网络更好。我们的源代码可以在https://github.com/DalongZ/NAME中下载。
Benchmarking Individual Tree Mapping with Sub-meter Imagery
paper_authors: Dimitri Gominski, Ankit Kariryaa, Martin Brandt, Christian Igel, Sizhuo Li, Maurice Mugabowindekwe, Rasmus Fensholt
for: 本研究旨在提供一个适合单一树木映射的评估框架,以便在不同的物理环境中进行单一树木映射。
methods: 本研究使用了多种方法和深度架构进行单一树木映射,包括检测和分类方法,以及传播者。
results: 本研究通过实验证明了一个新的方法,可以在单一树木映射中实现一个好的折冲点between segmentation和检测。Abstract
There is a rising interest in mapping trees using satellite or aerial imagery, but there is no standardized evaluation protocol for comparing and enhancing methods. In dense canopy areas, the high variability of tree sizes and their spatial proximity makes it arduous to define the quality of the predictions. Concurrently, object-centric approaches such as bounding box detection usuallyperform poorly on small and dense objects. It thus remains unclear what is the ideal framework for individual tree mapping, in regards to detection and segmentation approaches, convolutional neural networks and transformers. In this paper, we introduce an evaluation framework suited for individual tree mapping in any physical environment, with annotation costs and applicative goals in mind. We review and compare different approaches and deep architectures, and introduce a new method that we experimentally prove to be a good compromise between segmentation and detection.
摘要
有一些团队正在使用卫星或飞行图像来映射树木,但没有一个标准化的评估协议来比较和提高方法。在密集的树木区域中,树木的大小和空间 proximity 的高度变化使其困难定义预测的质量。同时,对象中心的方法,如 bounding box 探测,通常在小型和密集的对象上表现不佳。因此,还未确定最佳的框架是什么,它应该是 detection 和 segmentation 方法, convolutional neural networks 和 transformers。在这篇文章中,我们提出了适用于个体树木映射的评估框架,考虑到标注成本和实际应用目标。我们还对不同的方法和深度架构进行了查看和比较,并引入了一种新的方法,我们实验证明这种方法是一个好的 segmentation 和 detection 的 компроми斯。
Comparison of two data fusion approaches for land use classification
results: 研究发现,预分类 fusion 方法可以达到最高的 final 精度(97%)和macro-mean F1 分数(88%)。Abstract
Accurate land use maps, describing the territory from an anthropic utilisation point of view, are useful tools for land management and planning. To produce them, the use of optical images alone remains limited. It is therefore necessary to make use of several heterogeneous sources, each carrying complementary or contradictory information due to their imperfections or their different specifications. This study compares two different approaches i.e. a pre-classification and a post-classification fusion approach for combining several sources of spatial data in the context of land use classification. The approaches are applied on authoritative land use data located in the Gers department in the southwest of France. Pre-classification fusion, while not explicitly modeling imperfections, has the best final results, reaching an overall accuracy of 97% and a macro-mean F1 score of 88%.
摘要
准确的土地使用地图,从人类活动利用角度来看 territory,是地域规划和管理中非常有用的工具。但使用光学图像 alone 的使用受限,因此需要使用多种不同来源,每种携带不同的信息,由于它们的不完全性或不同的规格。本研究比较了两种不同的方法,即预分类和后分类 fusión 方法,用于将多种空间数据组合在土地使用分类中。两种方法在法国南西部GERS省的官方土地使用数据上进行应用。预分类 fusión 方法,不直接模型瑕疵,最终结果最佳,达到了97%的总准确率和88%的macro-mean F1分数。
Robust Learning Based Condition Diagnosis Method for Distribution Network Switchgear
For: 这种方法用于诊断分布网络switchgear的状态,以维护电力质量 для终端用户。* Methods: 该方法使用扩展的特征向量,包括环境数据、温度测量、Switch位置、电动机操作、隔离状况和本地充电信息。它利用特征映射处理高维度数据,并引入决策半径来分类无标签样本,通过综合超参和自参损失函数、一致常数 regularization 函数来更新模型参数。* Results: 相比现有模型,该方法在准确性和稳定性两个方面具有显著优势。Abstract
This paper introduces a robust, learning-based method for diagnosing the state of distribution network switchgear, which is crucial for maintaining the power quality for end users. Traditional diagnostic models often rely heavily on expert knowledge and lack robustness. To address this, our method incorporates an expanded feature vector that includes environmental data, temperature readings, switch position, motor operation, insulation conditions, and local discharge information. We tackle the issue of high dimensionality through feature mapping. The method introduces a decision radius to categorize unlabeled samples and updates the model parameters using a combination of supervised and unsupervised loss, along with a consistency regularization function. This approach ensures robust learning even with a limited number of labeled samples. Comparative analysis demonstrates that this method significantly outperforms existing models in both accuracy and robustness.
摘要
Detection of Small Targets in Sea Clutter Based on RepVGG and Continuous Wavelet Transform
results: 测试结果表明,使用RepVGGA0-CWT探测器可以在低控制干扰false alarm rate、高训练速度、高探测速度和低内存使用率等方面表现优于其他网络和特征提取方法。Abstract
Constructing a high-performance target detector under the background of sea clutter is always necessary and important. In this work, we propose a RepVGGA0-CWT detector, where RepVGG is a residual network that gains a high detection accuracy. Different from traditional residual networks, RepVGG keeps an acceptable calculation speed. Giving consideration to both accuracy and speed, the RepVGGA0 is selected among all the variants of RepVGG. Also, continuous wavelet transform (CWT) is employed to extract the radar echoes' time-frequency feature effectively. In the tests, other networks (ResNet50, ResNet18 and AlexNet) and feature extraction methods (short-time Fourier transform (STFT), CWT) are combined to build detectors for comparison. The result of different datasets shows that the RepVGGA0-CWT detector performs better than those detectors in terms of low controllable false alarm rate, high training speed, high inference speed and low memory usage. This RepVGGA0-CWT detector is hardware-friendly and can be applied in real-time scenes for its high inference speed in detection.
摘要
构建高性能的目标检测器在海啸背景下是一项必要和重要的任务。在这项工作中,我们提议了RepVGGA0-CWT检测器,其中RepVGG是一种具有高检测精度的径向网络,而不同于传统的径向网络,RepVGG具有可接受的计算速度。为了考虑精度和速度之间的平衡,我们选择了RepVGGA0中的所有变体。此外,我们采用了 kontinuous wavelet transform(CWT)来提取雷达回声的时空频特征,以便更好地检测目标。在测试中,我们使用了其他网络(ResNet50、ResNet18和AlexNet)和特征提取方法(short-time Fourier transform(STFT)、CWT)构建了对比的检测器。测试结果显示,RepVGGA0-CWT检测器在低可控干扰False Alarm率、高训练速度、高推理速度和低内存使用量等方面表现更好于其他检测器。此外,这种RepVGGA0-CWT检测器具有硬件友好性,可以在实时场景中应用,因为它具有高推理速度。
Test-Time Training for Semantic Segmentation with Output Contrastive Loss
results: 作者通过了多种评估场景,证明了他们的方法的有效性,特别是当应用于先进的预训练方法中的测试数据时,他们的方法表现出色,说明它具有抗衰落和适应性。Abstract
Although deep learning-based segmentation models have achieved impressive performance on public benchmarks, generalizing well to unseen environments remains a major challenge. To improve the model's generalization ability to the new domain during evaluation, the test-time training (TTT) is a challenging paradigm that adapts the source-pretrained model in an online fashion. Early efforts on TTT mainly focus on the image classification task. Directly extending these methods to semantic segmentation easily experiences unstable adaption due to segmentation's inherent characteristics, such as extreme class imbalance and complex decision spaces. To stabilize the adaptation process, we introduce contrastive loss (CL), known for its capability to learn robust and generalized representations. Nevertheless, the traditional CL operates in the representation space and cannot directly enhance predictions. In this paper, we resolve this limitation by adapting the CL to the output space, employing a high temperature, and simplifying the formulation, resulting in a straightforward yet effective loss function called Output Contrastive Loss (OCL). Our comprehensive experiments validate the efficacy of our approach across diverse evaluation scenarios. Notably, our method excels even when applied to models initially pre-trained using domain adaptation methods on test domain data, showcasing its resilience and adaptability.\footnote{Code and more information could be found at~ \url{https://github.com/dazhangyu123/OCL}
摘要
(简体中文)尽管深度学习基于的分割模型在公共评测上表现出色,但将其推广到未见的环境中仍然是一大挑战。以提高模型在新领域评测时的泛化能力,测试时间训练(TTT)是一种挑战的 paradigma,它在线上适应源预训练模型。早期的TTT主要关注于图像分类任务。将这些方法扩展到 semantic segmentation 是容易陷入不稳定的适应过程,因为分割的特点包括分类异常分布和复杂的决策空间。为稳定适应过程,我们引入对比损失(CL),它能够学习Robust和通用的表示。然而,传统的CL在表示空间运行,无法直接改进预测。在这篇论文中,我们解决这个限制,通过对CL的修改,使其适应输出空间,使用高温度,并简化表述,得到一种直观 yet 有效的损失函数,称为输出对比损失(OCL)。我们在多个评测场景中进行了广泛的实验,证明了我们的方法的可靠性和适应性。尤其是,当我们将模型在测试预训练使用Domain adaptation方法时,我们的方法仍然表现出色,这表明了我们的方法的稳定性和适应性。
Dual-channel Prototype Network for few-shot Classification of Pathological Images
results: 根据实验结果显示,DCPN在不同预设的临床情况下,实现了几何学分类任务的超级表现,特别是在同领域内的任务中,其性能与指导式学习相当。Abstract
In pathology, the rarity of certain diseases and the complexity in annotating pathological images significantly hinder the creation of extensive, high-quality datasets. This limitation impedes the progress of deep learning-assisted diagnostic systems in pathology. Consequently, it becomes imperative to devise a technology that can discern new disease categories from a minimal number of annotated examples. Such a technology would substantially advance deep learning models for rare diseases. Addressing this need, we introduce the Dual-channel Prototype Network (DCPN), rooted in the few-shot learning paradigm, to tackle the challenge of classifying pathological images with limited samples. DCPN augments the Pyramid Vision Transformer (PVT) framework for few-shot classification via self-supervised learning and integrates it with convolutional neural networks. This combination forms a dual-channel architecture that extracts multi-scale, highly precise pathological features. The approach enhances the versatility of prototype representations and elevates the efficacy of prototype networks in few-shot pathological image classification tasks. We evaluated DCPN using three publicly available pathological datasets, configuring small-sample classification tasks that mirror varying degrees of clinical scenario domain shifts. Our experimental findings robustly affirm DCPN's superiority in few-shot pathological image classification, particularly in tasks within the same domain, where it achieves the benchmarks of supervised learning.
摘要
在 PATHOLOGY 领域,一些疾病的罕见性和诊断图像的复杂性,使得创建大量、高质量数据集成为非常困难的。这种限制阻碍了深度学习助动诊断系统在 PATHOLOGY 领域的进步。因此,需要开发一种技术,可以从 minimal 数量的标注示例中分辨出新的疾病类别。这种技术将帮助深度学习模型更好地识别罕见疾病。为解决这个需求,我们介绍了 Dual-channel Prototype Network (DCPN),基于 few-shot 学习 paradigm,用于分类 PATHOLOGY 图像。DCPN 将 Pyramid Vision Transformer (PVT) 框架与自动学习相结合,并将其与卷积神经网络结合。这种结构形成了双通道体系,可以提取多级、高精度 PATHOLOGY 特征。这种方法提高了prototype表示的多样性,并提高了prototype网络在 few-shot PATHOLOGY 图像分类任务中的效果。我们通过三个公共可用的 PATHOLOGY 数据集进行了实验,并配置了小样本分类任务,这些任务模拟了不同程度的临床enario域转移。我们的实验结果表明,DCPN 在 few-shot PATHOLOGY 图像分类任务中表现出色,特别是在同一个域的任务中,它可以达到超过supervised learning的标准。
Probing clustering in neural network representations
results: 发现 datasets 的类别结构和预训练模型的选择对团集化有影响,normalization 策略affects 哪些层的团集化性能,并发现 Vision Transformers 的团集化性能较差。Abstract
Neural network representations contain structure beyond what was present in the training labels. For instance, representations of images that are visually or semantically similar tend to lie closer to each other than to dissimilar images, regardless of their labels. Clustering these representations can thus provide insights into dataset properties as well as the network internals. In this work, we study how the many design choices involved in neural network training affect the clusters formed in the hidden representations. To do so, we establish an evaluation setup based on the BREEDS hierarchy, for the task of subclass clustering after training models with only superclass information. We isolate the training dataset and architecture as important factors affecting clusterability. Datasets with labeled classes consisting of unrelated subclasses yield much better clusterability than those following a natural hierarchy. When using pretrained models to cluster representations on downstream datasets, models pretrained on subclass labels provide better clusterability than models pretrained on superclass labels, but only when there is a high degree of domain overlap between the pretraining and downstream data. Architecturally, we find that normalization strategies affect which layers yield the best clustering performance, and, surprisingly, Vision Transformers attain lower subclass clusterability than ResNets.
摘要
(注:以下是使用简化中文表示的文本)神经网络表示含有超出训练标签的结构。例如,与标签不同的图像表示在不同的图像中往往更近,无论它们的标签如何。将这些表示进行归类可以提供关于数据集和网络内部的信息。在这种工作中,我们研究了各种神经网络训练中的设计选择如何影响隐藏表示中的归类结构。为此,我们基于BREEDS层次结构设置评估集成,用于在只有超类信息下进行类别归类。我们发现,使用不同类别的数据集和架构可以影响归类性。具有不相关的类别的数据集可以获得更好的归类性,而遵循自然层次结构的数据集则不然。使用预训练模型进行下游数据集的归类时,使用 subclass标签进行预训练可以获得更好的归类性,但只有在预训练和下游数据集具有高度域 overlap 时。层次上,我们发现normalization策略可以影响归类性最佳层,并且意外地发现视图转换器的 subclass 归类性较低于ResNet。