methods: 提出一种基于分布学习技术的 Differentiable Arithmetic Distribution Module (DADM),通过学习图像中像素的空间分布信息来提高模型的Robustness。
results: 通过对比与 LeNet 等方法的实验和简洁分析,证明了该方法能够提高模型的 Robustness 无需牺牲特征提取能力。Abstract
Although Convolutional Neural Networks (CNNs) have achieved promising results in image classification, they still are vulnerable to affine transformations including rotation, translation, flip and shuffle. The drawback motivates us to design a module which can alleviate the impact from different affine transformations. Thus, in this work, we introduce a more robust substitute by incorporating distribution learning techniques, focusing particularly on learning the spatial distribution information of pixels in images. To rectify the issue of non-differentiability of prior distribution learning methods that rely on traditional histograms, we adopt the Kernel Density Estimation (KDE) to formulate differentiable histograms. On this foundation, we present a novel Differentiable Arithmetic Distribution Module (DADM), which is designed to extract the intrinsic probability distributions from images. The proposed approach is able to enhance the model's robustness to affine transformations without sacrificing its feature extraction capabilities, thus bridging the gap between traditional CNNs and distribution-based learning. We validate the effectiveness of the proposed approach through ablation study and comparative experiments with LeNet.
摘要
To address the issue of non-differentiability of prior distribution learning methods that rely on traditional histograms, we adopt Kernel Density Estimation (KDE) to formulate differentiable histograms. On this foundation, we present a novel Differentiable Arithmetic Distribution Module (DADM), which is designed to extract the intrinsic probability distributions from images. The proposed approach can enhance the model's robustness to affine transformations without sacrificing its feature extraction capabilities, thus bridging the gap between traditional CNNs and distribution-based learning. We validate the effectiveness of the proposed approach through ablation study and comparative experiments with LeNet.
PathLDM: Text conditioned Latent Diffusion Model for Histopathology
results: 通过策略性的 conditioning 和必要的建构改进,该论文在 TCGA-BRCA 数据集上实现了 SoTA FID 分数 7.64,明显超越最近的文本指导竞争对手的 FID 分数 30.1。Abstract
To achieve high-quality results, diffusion models must be trained on large datasets. This can be notably prohibitive for models in specialized domains, such as computational pathology. Conditioning on labeled data is known to help in data-efficient model training. Therefore, histopathology reports, which are rich in valuable clinical information, are an ideal choice as guidance for a histopathology generative model. In this paper, we introduce PathLDM, the first text-conditioned Latent Diffusion Model tailored for generating high-quality histopathology images. Leveraging the rich contextual information provided by pathology text reports, our approach fuses image and textual data to enhance the generation process. By utilizing GPT's capabilities to distill and summarize complex text reports, we establish an effective conditioning mechanism. Through strategic conditioning and necessary architectural enhancements, we achieved a SoTA FID score of 7.64 for text-to-image generation on the TCGA-BRCA dataset, significantly outperforming the closest text-conditioned competitor with FID 30.1.
摘要
paper_authors: Saeid Asgari Taghanaki, Aliasghar Khani, Amir Khasahmadi, Aditya Sanghi, Karl D. D. Willis, Ali Mahdavi-Amiri
for: 提高图像分类器的解释性和可靠性
methods: 利用大型自然语言模型(LLMs)解释图像分类器学习的特征空间
results: 在多个数据集上进行了实验,证明了方法的有效性,可以提高图像分类器的解释性和可靠性Abstract
Interpreting the learned features of vision models has posed a longstanding challenge in the field of machine learning. To address this issue, we propose a novel method that leverages the capabilities of large language models (LLMs) to interpret the learned features of pre-trained image classifiers. Our method, called TExplain, tackles this task by training a neural network to establish a connection between the feature space of image classifiers and LLMs. Then, during inference, our approach generates a vast number of sentences to explain the features learned by the classifier for a given image. These sentences are then used to extract the most frequent words, providing a comprehensive understanding of the learned features and patterns within the classifier. Our method, for the first time, utilizes these frequent words corresponding to a visual representation to provide insights into the decision-making process of the independently trained classifier, enabling the detection of spurious correlations, biases, and a deeper comprehension of its behavior. To validate the effectiveness of our approach, we conduct experiments on diverse datasets, including ImageNet-9L and Waterbirds. The results demonstrate the potential of our method to enhance the interpretability and robustness of image classifiers.
摘要
machine learning 领域中解释视图模型学习的结果对于长期是一个挑战。为解决这个问题,我们提出了一种新的方法,即利用大型自然语言模型(LLMs)来解释预训练的图像分类器学习的特征。我们的方法,称为TExplain,通过在图像分类器的特征空间和LLMs之间建立连接来实现这个任务。在推理过程中,我们的方法生成大量的句子来解释给定图像中分类器学习的特征。这些句子中的最常见词语被用来提取特征空间中学习的特征和模式,从而提供了图像分类器的决策过程中的全面理解。我们的方法首次利用这些与视觉表示相对应的常见词语,提供了图像分类器的决策过程中的深入了解和检测偏见、偏好等。为验证我们的方法的有效性,我们在多个 dataset 上进行了实验,包括 ImageNet-9L 和 Waterbirds。结果表明,我们的方法可以增强图像分类器的可解释性和Robustness。
Deep learning in medical image registration: introduction and survey
results: 本文涵盖了多种图像注册技术,包括参考 Atlases、多stage图像注册、Pyramid Approach等,以及医学图像注册的数据集、评估指标、应用场景等。Abstract
Image registration (IR) is a process that deforms images to align them with respect to a reference space, making it easier for medical practitioners to examine various medical images in a standardized reference frame, such as having the same rotation and scale. This document introduces image registration using a simple numeric example. It provides a definition of image registration along with a space-oriented symbolic representation. This review covers various aspects of image transformations, including affine, deformable, invertible, and bidirectional transformations, as well as medical image registration algorithms such as Voxelmorph, Demons, SyN, Iterative Closest Point, and SynthMorph. It also explores atlas-based registration and multistage image registration techniques, including coarse-fine and pyramid approaches. Furthermore, this survey paper discusses medical image registration taxonomies, datasets, evaluation measures, such as correlation-based metrics, segmentation-based metrics, processing time, and model size. It also explores applications in image-guided surgery, motion tracking, and tumor diagnosis. Finally, the document addresses future research directions, including the further development of transformers.
摘要
Image registration (IR) 是一个过程,用于将图像调整,使其与参照空间align,以便医疗专业人员通过标准化参照框架进行评估不同医疗图像,例如同一个旋转和缩放。本文介绍了图像 registration 的简单数字示例,并提供了图像 registration 的定义和空间 oriented 符号表示。本文评论了各种图像变换,包括 affine、可变、可逆、 bidirectional 变换,以及医疗图像 registration 算法,如 Voxelmorph、Demons、SyN、Iterative Closest Point 和 SynthMorph。此外,本文还探讨了 atlas-based registration 和多阶段图像 registration 技术,包括 coarse-fine 和 pyramid 方法。此外,本文还讨论了医疗图像 registration 的分类、数据集、评价指标,如 correlation-based метри克、segmentation-based метри克、处理时间和模型大小。此外,本文还探讨了图像导航手术、运动跟踪和肿瘤诊断的应用。最后,文档还讨论了未来研究方向,包括 transformers 的进一步发展。
results: 这个论文得到了一个高精度的人脸识别系统,可以快速准确地匹配眼照图像到数据库中的匹配记录。Abstract
28,000+ high-quality iris images of 1350 distinct eyes from 650+ different individuals from a relatively diverse university town population were collected. A small defined unobstructed portion of the normalized iris image is selected as a key portion for quickly identifying an unknown individual when submitting an iris image to be matched to a database of enrolled irises of the 1350 distinct eyes. The intrinsic dimension of a set of these key portions of the 1350 enrolled irises is measured to be about four (4). This set is mapped to a four-dimensional intrinsic space by principal components analysis (PCA). When an iris image is presented to the iris database for identification, the search begins in the neighborhood of the location of its key portion in the 4D intrinsic space, typically finding a correct identifying match after comparison to only a few percent of the database.
摘要
我们收集了1350个眼睛的28,000多个高质量眼睛图像,来自650多个不同个体的大学城市人口。我们选择了一小部分眼睛图像作为快速识别未知个体的关键部分,并计算了这些关键部分的内在维度为4。我们将这些关键部分映射到4维内在空间中,使用主成分分析(PCA)。当我们将眼睛图像提交到识别数据库时,我们会在数据库中查找与其关键部分相似的匹配,通常只需要比较数据库中的一些百分比就能够获得正确的识别结果。
AAN: Attributes-Aware Network for Temporal Action Detection
results: 通过利用CLIP特征,AAN超过了当前state-of-the-art方法在Charades和Toyota Smarthome Untrimmed dataset上的性能。Abstract
The challenge of long-term video understanding remains constrained by the efficient extraction of object semantics and the modelling of their relationships for downstream tasks. Although the CLIP visual features exhibit discriminative properties for various vision tasks, particularly in object encoding, they are suboptimal for long-term video understanding. To address this issue, we present the Attributes-Aware Network (AAN), which consists of two key components: the Attributes Extractor and a Graph Reasoning block. These components facilitate the extraction of object-centric attributes and the modelling of their relationships within the video. By leveraging CLIP features, AAN outperforms state-of-the-art approaches on two popular action detection datasets: Charades and Toyota Smarthome Untrimmed datasets.
摘要
“长期视频理解的挑战仍然受到有效提取对象 semantics 和其关系的模型化限制。虽然 CLIP 视觉特征具有多种视觉任务中的推断性质,特别是对象编码,但它们对长期视频理解不利。为解决这个问题,我们提出了 Attributes-Aware Network(AAN),它包括两个关键组件:Attributes Extractor 和 Graph Reasoning 块。这两个组件可以帮助提取对象-中心的特征和视频中的对象关系。通过利用 CLIP 特征,AAN 超越了现有的状态泰施 Approaches 在 Charades 和 Toyota Smarthome Untrimmed 数据集上表现出色。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The Traditional Chinese writing system is also widely used in Taiwan and other parts of the world.
OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation
paper_authors: Zhening Huang, Xiaoyang Wu, Xi Chen, Hengshuang Zhao, Lei Zhu, Joan Lasenby
for: This paper is written for 3D open-vocabulary scene understanding at the instance level, without requiring any 2D image inputs.
methods: The OpenIns3D framework uses a “Mask-Snap-Lookup” scheme, which consists of a “Mask” module for class-agnostic mask proposals in 3D point clouds, a “Snap” module for generating synthetic scene-level images, and a “Lookup” module for assigning category names to the proposed masks using Mask2Pixel maps.
results: The proposed approach achieved state-of-the-art results on a wide range of indoor and outdoor datasets with a large margin, and it also allows for effortless switching of 2D detectors without re-training. Additionally, when integrated with state-of-the-art 2D open-world models and large language models (LLMs), it demonstrates excellent performance on open-vocabulary instance segmentation and processing complex text queries.Abstract
Current 3D open-vocabulary scene understanding methods mostly utilize well-aligned 2D images as the bridge to learn 3D features with language. However, applying these approaches becomes challenging in scenarios where 2D images are absent. In this work, we introduce a completely new pipeline, namely, OpenIns3D, which requires no 2D image inputs, for 3D open-vocabulary scene understanding at the instance level. The OpenIns3D framework employs a "Mask-Snap-Lookup" scheme. The "Mask" module learns class-agnostic mask proposals in 3D point clouds. The "Snap" module generates synthetic scene-level images at multiple scales and leverages 2D vision language models to extract interesting objects. The "Lookup" module searches through the outcomes of "Snap" with the help of Mask2Pixel maps, which contain the precise correspondence between 3D masks and synthetic images, to assign category names to the proposed masks. This 2D input-free, easy-to-train, and flexible approach achieved state-of-the-art results on a wide range of indoor and outdoor datasets with a large margin. Furthermore, OpenIns3D allows for effortless switching of 2D detectors without re-training. When integrated with state-of-the-art 2D open-world models such as ODISE and GroundingDINO, superb results are observed on open-vocabulary instance segmentation. When integrated with LLM-powered 2D models like LISA, it demonstrates a remarkable capacity to process highly complex text queries, including those that require intricate reasoning and world knowledge. Project page: https://zheninghuang.github.io/OpenIns3D/
摘要
当前的3D开 vocabularyScene理解方法通常使用Well-aligned的2D图像作为桥接学习3D特征。然而,在没有2D图像的场景下,这些方法变得困难。在这项工作中,我们引入了一个completely新的管道,即OpenIns3D,它不需要2D图像输入,可以实现3D开 vocabularyScene理解的实例水平。OpenIns3D框架采用“Mask-Snap-Lookup”的方案。“Mask”模块学习类型不敏感的3D点云掩模。“Snap”模块生成多个尺度的 sintetic场景图像,并利用2D视觉语言模型提取有趣的对象。“Lookup”模块通过Mask2Pixel地图,该地图包含3D掩模和 sintetic图像之间的准确对应关系,将提案的掩模分配类别名称。这种没有2D输入、易于训练、灵活的方法在各种室外和室内数据集上实现了状态的前一个Result,并且可以轻松地将2D检测器更换,无需重新训练。当与开放世界2D模型如ODISE和GroundingDINO集成后,Superb的结果被观察到。当与LLM力量2D模型如LISA集成后,它表现出了对高级推理和世界知识的强大处理能力。项目页面:https://zheninghuang.github.io/OpenIns3D/
CityDreamer: Compositional Generative Model of Unbounded 3D Cities
paper_authors: Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu
for: This paper focuses on the generation of 3D cities, which has received less attention in recent years despite the greater challenges it poses due to human sensitivity to structural distortions and the complexity of generating buildings with a wide range of appearances.
methods: The proposed method, CityDreamer, is a compositional generative model that separates the generation of building instances from other background objects into distinct modules, and uses two datasets (OSM and GoogleEarth) to enhance the realism of the generated 3D cities.
results: Through extensive experiments, CityDreamer has proven its superiority over state-of-the-art methods in generating a wide range of lifelike 3D cities.Here’s the text in Traditional Chinese for reference:
results: 经过了广泛的实验,CityDreamer已经证明了它在生成各种生活力强的3D城市方面的优越性,比起现有的方法更加出色。Abstract
In recent years, extensive research has focused on 3D natural scene generation, but the domain of 3D city generation has not received as much exploration. This is due to the greater challenges posed by 3D city generation, mainly because humans are more sensitive to structural distortions in urban environments. Additionally, generating 3D cities is more complex than 3D natural scenes since buildings, as objects of the same class, exhibit a wider range of appearances compared to the relatively consistent appearance of objects like trees in natural scenes. To address these challenges, we propose CityDreamer, a compositional generative model designed specifically for unbounded 3D cities, which separates the generation of building instances from other background objects, such as roads, green lands, and water areas, into distinct modules. Furthermore, we construct two datasets, OSM and GoogleEarth, containing a vast amount of real-world city imagery to enhance the realism of the generated 3D cities both in their layouts and appearances. Through extensive experiments, CityDreamer has proven its superiority over state-of-the-art methods in generating a wide range of lifelike 3D cities.
摘要
results: 在训练城市(阿姆斯特丹)和 nunca before seen 城市(英顿)的结果中,显示了一些难以理解的趋势,特别是在不同时间步骤的图像获取方式下。这种结果 demonstarte了监测生活品质变化的复杂性,以及需要更加复杂的方法来补做不同于生活品质动态的变化。Abstract
In this paper we explore deep learning models to monitor longitudinal liveability changes in Dutch cities at the neighbourhood level. Our liveability reference data is defined by a country-wise yearly survey based on a set of indicators combined into a liveability score, the Leefbaarometer. We pair this reference data with yearly-available high-resolution aerial images, which creates yearly timesteps at which liveability can be monitored. We deploy a convolutional neural network trained on an aerial image from 2016 and the Leefbaarometer score to predict liveability at new timesteps 2012 and 2020. The results in a city used for training (Amsterdam) and one never seen during training (Eindhoven) show some trends which are difficult to interpret, especially in light of the differences in image acquisitions at the different time steps. This demonstrates the complexity of liveability monitoring across time periods and the necessity for more sophisticated methods compensating for changes unrelated to liveability dynamics.
摘要
在这篇论文中,我们探讨深度学习模型来监测荷兰城市的长期生活质量变化。我们的生活质量参照数据是基于年度国家调查,并将各个指标组合成一个生活质量分数,称为Leefbaarometer。我们将年度可用的高分辨率航空图像与参照数据对应,从而创造了年度时间步。我们使用2016年的航空图像和Leefbaarometer分数来训练卷积神经网络,并用这些神经网络预测2012年和2020年的生活质量。在训练城市(阿姆斯特丹)和 nunca before seen during training 的城市(恩登霍恩)的结果中,我们发现了一些难以理解的趋势,特别是在不同时间步的图像获取方式的影响下。这表明监测生活质量的变化 across time periods 的复杂性,以及需要更加复杂的方法来补做不related to liveability dynamics的变化。
results: 本文应用了 DMNN 来识别受噪的数字边缘,并讨论了多个未来研究的话题。Abstract
A classical approach to designing binary image operators is Mathematical Morphology (MM). We propose the Discrete Morphological Neural Networks (DMNN) for binary image analysis to represent W-operators and estimate them via machine learning. A DMNN architecture, which is represented by a Morphological Computational Graph, is designed as in the classical heuristic design of morphological operators, in which the designer should combine a set of MM operators and Boolean operations based on prior information and theoretical knowledge. Then, once the architecture is fixed, instead of adjusting its parameters (i.e., structural elements or maximal intervals) by hand, we propose a lattice gradient descent algorithm (LGDA) to train these parameters based on a sample of input and output images under the usual machine learning approach. We also propose a stochastic version of the LGDA that is more efficient, is scalable and can obtain small error in practical problems. The class represented by a DMNN can be quite general or specialized according to expected properties of the target operator, i.e., prior information, and the semantic expressed by algebraic properties of classes of operators is a differential relative to other methods. The main contribution of this paper is the merger of the two main paradigms for designing morphological operators: classical heuristic design and automatic design via machine learning. Thus, conciliating classical heuristic morphological operator design with machine learning. We apply the DMNN to recognize the boundary of digits with noise, and we discuss many topics for future research.
摘要
经典方法设计二进制图像运算员是数学形态学(MM)。我们提议使用数字形态神经网络(DMNN)来表示二进制图像分析,以代表W-运算员并使用机器学习来估算。DMNN架构,表示为形态计算图,是根据经典的规范设计形态操作员,其中设计师将组合一组MM操作员和逻辑运算,根据优化目标和理论知识。然后,在架构固定后,而不是手动调整其结构元素或最大间隔的参数,我们提议使用格子梯度下降算法(LGDA)来训练这些参数,根据输入和输出图像的样本。我们还提议一种随机版本的LGDA,它更高效、可扩展和可以在实际问题中获得小的错误。DMNN的类可以是非常一般或特殊,根据预期的目标运算员的特性和优化目标。我们的主要贡献在于将经典的规范设计和自动设计融合在一起,因此把经典规范设计与机器学习融合在一起。我们应用DMNN来识别数字的边缘,并讨论了许多未来研究的话题。
Mechanism of feature learning in convolutional neural networks
results: 我们的结果表明,使用patch-based AGOP可以启用深度特征学习在卷积核机中。我们称之为(深)ConvRFM,并证明其能够恢复深度 convolutional networks 中的类似特征。此外,我们发现Deep ConvRFM可以超越先前发现的卷积核的限制,例如对本地信号的适应能力和不可变性,从而导致性能提高。Abstract
Understanding the mechanism of how convolutional neural networks learn features from image data is a fundamental problem in machine learning and computer vision. In this work, we identify such a mechanism. We posit the Convolutional Neural Feature Ansatz, which states that covariances of filters in any convolutional layer are proportional to the average gradient outer product (AGOP) taken with respect to patches of the input to that layer. We present extensive empirical evidence for our ansatz, including identifying high correlation between covariances of filters and patch-based AGOPs for convolutional layers in standard neural architectures, such as AlexNet, VGG, and ResNets pre-trained on ImageNet. We also provide supporting theoretical evidence. We then demonstrate the generality of our result by using the patch-based AGOP to enable deep feature learning in convolutional kernel machines. We refer to the resulting algorithm as (Deep) ConvRFM and show that our algorithm recovers similar features to deep convolutional networks including the notable emergence of edge detectors. Moreover, we find that Deep ConvRFM overcomes previously identified limitations of convolutional kernels, such as their inability to adapt to local signals in images and, as a result, leads to sizable performance improvement over fixed convolutional kernels.
摘要
We then demonstrate the generality of our result by using the patch-based AGOP to enable deep feature learning in convolutional kernel machines. We refer to the resulting algorithm as (Deep) ConvRFM and show that our algorithm recovers similar features to deep convolutional networks, including the notable emergence of edge detectors. Moreover, we find that Deep ConvRFM overcomes previously identified limitations of convolutional kernels, such as their inability to adapt to local signals in images, and leads to sizable performance improvement over fixed convolutional kernels.
Amyloid-Beta Axial Plane PET Synthesis from Structural MRI: An Image Translation Approach for Screening Alzheimer’s Disease
results: 通过对MRI图像和β蛋白PET图像的对应,可以生成高度相似于真实的Synthetic抑衰β蛋白PET图像,具有高度的SSIM和PSNR。Abstract
In this work, an image translation model is implemented to produce synthetic amyloid-beta PET images from structural MRI that are quantitatively accurate. Image pairs of amyloid-beta PET and structural MRI were used to train the model. We found that the synthetic PET images could be produced with a high degree of similarity to truth in terms of shape, contrast and overall high SSIM and PSNR. This work demonstrates that performing structural to quantitative image translation is feasible to enable the access amyloid-beta information from only MRI.
摘要
在这个工作中,我们实现了一种图像翻译模型,以生成基于MRI的蛋白β扩散图像,并且这些图像具有高度准确的量化性。我们使用了蛋白β扩散图像和MRI图像的对应对来训练模型。我们发现,使用这种方法可以生成高度相似于真实的蛋白β扩散图像,包括形态、对比度和总体高度匹配SSIM和PSNR。这个研究表明,从MRI图像中获取蛋白β信息是可能的,并且这种方法可以帮助解决蛋白β扩散图像的缺失问题。
Fused Classification For Differential Face Morphing Detection
results: 实验结果表明方法有效地检测 morphing 攻击Abstract
Face morphing, a sophisticated presentation attack technique, poses significant security risks to face recognition systems. Traditional methods struggle to detect morphing attacks, which involve blending multiple face images to create a synthetic image that can match different individuals. In this paper, we focus on the differential detection of face morphing and propose an extended approach based on fused classification method for no-reference scenario. We introduce a public face morphing detection benchmark for the differential scenario and utilize a specific data mining technique to enhance the performance of our approach. Experimental results demonstrate the effectiveness of our method in detecting morphing attacks.
摘要
面部融合攻击,一种复杂的演示攻击技术,对人脸识别系统 pose significiant 安全风险。传统方法很难探测融合攻击,这些攻击 involve 混合多个人脸图像生成一个合成图像,可以与不同个体匹配。在这篇论文中,我们关注 differential 探测方法,并提出基于融合分类方法的延展方法,用于无参考enario。我们引入一个公共人脸融合检测标准套件,并利用特定的数据挖掘技术来提高我们的方法的性能。实验结果表明我们的方法有效地探测融合攻击。
Impact of Image Context for Single Deep Learning Face Morphing Attack Detection
for: 本研究探讨了深度学习 face morphing 检测性能如何受到输入图像对齐设置的影响。
methods: 本研究使用了深度学习技术对 face morphing 进行检测,并分析了面 contour 和图像上下文之间的关系,以提出优化输入图像对齐的方法。
results: 研究发现,适当地调整输入图像对齐设置可以提高深度学习 face morphing 检测性能。Abstract
The increase in security concerns due to technological advancements has led to the popularity of biometric approaches that utilize physiological or behavioral characteristics for enhanced recognition. Face recognition systems (FRSs) have become prevalent, but they are still vulnerable to image manipulation techniques such as face morphing attacks. This study investigates the impact of the alignment settings of input images on deep learning face morphing detection performance. We analyze the interconnections between the face contour and image context and suggest optimal alignment conditions for face morphing detection.
摘要
技术进步引起的安全问题带来了基于生物特征的识别方法的普遍性,特别是面Recognition系统(FRS)。然而,FRS仍然容易受到图像修改技术的袭击,如面形变换攻击。本研究研究输入图像的Alignment设置对深度学习面形变换检测性能的影响。我们分析面 outline和图像Context之间的关系,并提出优化Alignment条件以提高面形变换检测性能。Note: Simplified Chinese is used in mainland China and Singapore, while Traditional Chinese is used in Taiwan, Hong Kong, and Macau.
Trust your Good Friends: Source-free Domain Adaptation by Reciprocal Neighborhood Clustering
results: 我们的方法在多个2D图像和3D点云识别dataset上达到了状态 искусственный智能的性能。Abstract
Domain adaptation (DA) aims to alleviate the domain shift between source domain and target domain. Most DA methods require access to the source data, but often that is not possible (e.g. due to data privacy or intellectual property). In this paper, we address the challenging source-free domain adaptation (SFDA) problem, where the source pretrained model is adapted to the target domain in the absence of source data. Our method is based on the observation that target data, which might not align with the source domain classifier, still forms clear clusters. We capture this intrinsic structure by defining local affinity of the target data, and encourage label consistency among data with high local affinity. We observe that higher affinity should be assigned to reciprocal neighbors. To aggregate information with more context, we consider expanded neighborhoods with small affinity values. Furthermore, we consider the density around each target sample, which can alleviate the negative impact of potential outliers. In the experimental results we verify that the inherent structure of the target features is an important source of information for domain adaptation. We demonstrate that this local structure can be efficiently captured by considering the local neighbors, the reciprocal neighbors, and the expanded neighborhood. Finally, we achieve state-of-the-art performance on several 2D image and 3D point cloud recognition datasets.
摘要
领域适应(DA)的目标是解决源领域和目标领域之间的频率差异。大多数DA方法需要访问源数据,但在一些情况下,这并不可能(例如,由于数据隐私或知识产权等原因)。在这篇论文中,我们面临着难以进行频率适应(SFDA)问题,其中源预训练模型被应用到目标领域中,无法访问源数据。我们基于目标数据的内在结构的观察,即目标数据可能不符合源领域分类器的分类结果。我们通过定义目标数据的本地相互关系来捕捉这种内在结构,并且鼓励标签相互一致。我们发现,更高的相互关系应该被分配给对应的反向邻居。为了更好地融合信息,我们考虑了扩展的邻里域,并且考虑了每个目标样本的扩展邻里域。此外,我们还考虑了每个目标样本的密度,以避免潜在的异常值的影响。在实验结果中,我们证明了目标特征的内在结构是频率适应中的重要信息来源。我们表明了这种本地结构可以通过考虑本地邻居、反向邻居和扩展邻里域来效率地捕捉。最后,我们在多个2D图像和3D点云认知dataset上实现了状态的最佳性能。
results: 实验结果表明,我们的方法在KITTI和Cityscapes上达到了remarkable的状态数据表现(AbsRel = $0.082$ on KITTI, $0.052$ on KITTI with improved ground-truth, $0.106$ on Cityscapes),与前一个最佳方法相比,减少了9.9%、5.5%和4.5%的误差。此外,我们的方法还展示了减少训练复杂度、计算效率、改进的普适性和可以恢复细腻场景细节的能力。同时,自身匹配推导的SQLdepth可以在不同的摄像头和环境下进行适应性训练,并且可以在不同的任务上进行适应性测试。Abstract
Recently, self-supervised monocular depth estimation has gained popularity with numerous applications in autonomous driving and robotics. However, existing solutions primarily seek to estimate depth from immediate visual features, and struggle to recover fine-grained scene details with limited generalization. In this paper, we introduce SQLdepth, a novel approach that can effectively learn fine-grained scene structures from motion. In SQLdepth, we propose a novel Self Query Layer (SQL) to build a self-cost volume and infer depth from it, rather than inferring depth from feature maps. The self-cost volume implicitly captures the intrinsic geometry of the scene within a single frame. Each individual slice of the volume signifies the relative distances between points and objects within a latent space. Ultimately, this volume is compressed to the depth map via a novel decoding approach. Experimental results on KITTI and Cityscapes show that our method attains remarkable state-of-the-art performance (AbsRel = $0.082$ on KITTI, $0.052$ on KITTI with improved ground-truth and $0.106$ on Cityscapes), achieves $9.9\%$, $5.5\%$ and $4.5\%$ error reduction from the previous best. In addition, our approach showcases reduced training complexity, computational efficiency, improved generalization, and the ability to recover fine-grained scene details. Moreover, the self-supervised pre-trained and metric fine-tuned SQLdepth can surpass existing supervised methods by significant margins (AbsRel = $0.043$, $14\%$ error reduction). self-matching-oriented relative distance querying in SQL improves the robustness and zero-shot generalization capability of SQLdepth. Code and the pre-trained weights will be publicly available. Code is available at \href{https://github.com/hisfog/SQLdepth-Impl}{https://github.com/hisfog/SQLdepth-Impl}.
摘要
最近,自主指导单目深度估计已经在自动驾驶和机器人领域得到了广泛的应用。然而,现有的解决方案主要是从直接视觉特征中估计深度,而忽略了细节Scene的恢复。在这篇论文中,我们提出了一种新的方法,即SQLdepth,可以有效地从运动中学习细节Scene的结构。在SQLdepth中,我们提出了一种新的Self Query层(SQL),用于建立自身成本Volume并从中估计深度,而不是从特征图进行估计。这个自身成本Volume隐式地捕捉了场景的内在几何结构,每个层次的尺度都表示了点和物体之间的相对距离在潜在空间中。最终,这个Volume通过一种新的解码方法压缩到深度图。实验结果表明,我们的方法在KITTI和Cityscapes上达到了非常出色的状态态表现(AbsRel = $0.082$ on KITTI, $0.052$ on KITTI with improved ground truth and $0.106$ on Cityscapes),实现了$9.9\%$, $5.5\%$和$4.5\%$的错误减少。此外,我们的方法还显示了减少训练复杂性、计算效率、改进的泛化和恢复细节Scene的能力。此外,自主匹配方向的相对距离查询在SQL中提高了SQLdepth的Robustness和零shot泛化能力。代码和预训练 веса将在公共可用。代码可以在 \href{https://github.com/hisfog/SQLdepth-Impl}{https://github.com/hisfog/SQLdepth-Impl} 上找到。
A Machine Vision Method for Correction of Eccentric Error: Based on Adaptive Enhancement Algorithm
results: 该方法可以减少表面缺陷的误差到 Within 10um,并且具有一定的实时性。Abstract
In the procedure of surface defects detection for large-aperture aspherical optical elements, it is of vital significance to adjust the optical axis of the element to be coaxial with the mechanical spin axis accurately. Therefore, a machine vision method for eccentric error correction is proposed in this paper. Focusing on the severe defocus blur of reference crosshair image caused by the imaging characteristic of the aspherical optical element, which may lead to the failure of correction, an Adaptive Enhancement Algorithm (AEA) is proposed to strengthen the crosshair image. AEA is consisted of existed Guided Filter Dark Channel Dehazing Algorithm (GFA) and proposed lightweight Multi-scale Densely Connected Network (MDC-Net). The enhancement effect of GFA is excellent but time-consuming, and the enhancement effect of MDC-Net is slightly inferior but strongly real-time. As AEA will be executed dozens of times during each correction procedure, its real-time performance is very important. Therefore, by setting the empirical threshold of definition evaluation function SMD2, GFA and MDC-Net are respectively applied to highly and slightly blurred crosshair images so as to ensure the enhancement effect while saving as much time as possible. AEA has certain robustness in time-consuming performance, which takes an average time of 0.2721s and 0.0963s to execute GFA and MDC-Net separately on ten 200pixels 200pixels Region of Interest (ROI) images with different degrees of blur. And the eccentricity error can be reduced to within 10um by our method.
摘要
在大开口非球面光元件表面缺陷检测过程中,准确调整光轴与机械轴的准确性至关重要。因此,这篇文章提出了一种机器视觉方法来修正不对称错误。关注大开口非球面光元件的极大模糊效应,可能导致修正失败,这篇文章提出了一种适应增强算法(AEA)来强化交叉线图像。AEA由现有的导引灰度黑色滤波算法(GFA)和提议的轻量级多尺度紧密连接网络(MDC-Net)组成。GFA的增强效果非常好,但是时间耗费较长,而MDC-Net的增强效果较弱,但是实时性非常好。因此,在每次修正过程中,AEA将被执行数十次,因此其实时性非常重要。因此,通过设置定义评估函数SMD2的实际阈值,GFA和MDC-Net分别应用于高度模糊和轻度模糊的交叉线图像,以确保增强效果而尽可能地避免时间浪费。AEA具有一定的实时性,每个ROI图像需要0.2721秒和0.0963秒分别使用GFA和MDC-Net进行处理,并且可以将不对称错误降到10um以下。
Multi-stage Deep Learning Artifact Reduction for Computed Tomography
paper_authors: Jiayang Shi, Daniel M. Pelt, K. Joost Batenburg
For: 提高计算tomography(CT)图像质量,减少图像artefacts。* Methods: 使用多个域(如投影图像和重建图像)的深度学习方法进行artefact除去,与传统的CT处理管道类似。* Results: 对于 simulate和实际实验数据集,我们的方法可以减少artefacts,并且比deep learning基于后处理的方法更高效。Abstract
In Computed Tomography (CT), an image of the interior structure of an object is computed from a set of acquired projection images. The quality of these reconstructed images is essential for accurate analysis, but this quality can be degraded by a variety of imaging artifacts. To improve reconstruction quality, the acquired projection images are often processed by a pipeline consisting of multiple artifact-removal steps applied in various image domains (e.g., outlier removal on projection images and denoising of reconstruction images). These artifact-removal methods exploit the fact that certain artifacts are easier to remove in a certain domain compared with other domains. Recently, deep learning methods have shown promising results for artifact removal for CT images. However, most existing deep learning methods for CT are applied as a post-processing method after reconstruction. Therefore, artifacts that are relatively difficult to remove in the reconstruction domain may not be effectively removed by these methods. As an alternative, we propose a multi-stage deep learning method for artifact removal, in which neural networks are applied to several domains, similar to a classical CT processing pipeline. We show that the neural networks can be effectively trained in succession, resulting in easy-to-use and computationally efficient training. Experiments on both simulated and real-world experimental datasets show that our method is effective in reducing artifacts and superior to deep learning-based post-processing.
摘要
在计算 Tomography(CT)中,通过一组获取的投影图像计算对象内部结构的图像。这些重建图像的质量非常重要,但可能受到多种损害像素的影响。为了提高重建质量,通常将获取的投影图像处理为多个遗传元素去除抗错的步骤,这些步骤在不同的图像领域(例如,投影图像上的异常值除去和重建图像上的锈除)。这些遗传元素去除方法利用了某些遗传元素更易于在某个领域中除去,相比之下其他领域。 最近,深度学习方法在CT图像中表现出了扩展的成果。然而,大多数现有的深度学习方法在CT图像中是作为后处理方法进行应用,因此,在重建领域中存在一些难以除去的遗传元素可能无法有效地除去。作为替代方案,我们提出了一种多Stage深度学习方法,在这种方法中,神经网络在多个领域中应用,类似于传统的CT处理管道。我们发现,这些神经网络可以在继序中有效地训练,从而实现了容易使用和计算效率高的训练。在模拟和实际实验数据集上,我们的方法能够有效地减少遗传元素,并且与深度学习基于后处理的方法相比,表现出优异。
Asymmetric double-winged multi-view clustering network for exploring Diverse and Consistent Information
methods: 这个网络使用非对称架构,分别提取了 shallow 和 deep 特征。然后,通过调整 shallow 特征相似度矩阵,以确保多观点数据的多样性。此外,我们提出了一个双重对比机制,以维持 deep 特征在多观点和伪标端层上的一致性。
results: 在六个广泛使用的 benchmarkt 数据集上进行了广泛的实验,证明了我们的框架在多观点聚类 зада期中的高效性,并大多超过了现有的多观点聚类算法。Abstract
In unsupervised scenarios, deep contrastive multi-view clustering (DCMVC) is becoming a hot research spot, which aims to mine the potential relationships between different views. Most existing DCMVC algorithms focus on exploring the consistency information for the deep semantic features, while ignoring the diverse information on shallow features. To fill this gap, we propose a novel multi-view clustering network termed CodingNet to explore the diverse and consistent information simultaneously in this paper. Specifically, instead of utilizing the conventional auto-encoder, we design an asymmetric structure network to extract shallow and deep features separately. Then, by aligning the similarity matrix on the shallow feature to the zero matrix, we ensure the diversity for the shallow features, thus offering a better description of multi-view data. Moreover, we propose a dual contrastive mechanism that maintains consistency for deep features at both view-feature and pseudo-label levels. Our framework's efficacy is validated through extensive experiments on six widely used benchmark datasets, outperforming most state-of-the-art multi-view clustering algorithms.
摘要
<>将文本翻译成简化中文。>在无监督场景下,深度对比多视图划分(DCMVC)正在成为研究热点,旨在挖掘不同视图之间的潜在关系。现有大多数DCMVC算法都是针对深度 semantic features的一致信息进行探索,而忽略了不同视图之间的多样信息。为了填补这一漏洞,我们在本文提出了一种新的多视图划分网络,称为 codingNet,以同时探索多视图数据中的多样和一致信息。具体来说,我们不使用传统的自编码器,而是设计了一种非对称结构网络,用于分离不同视图中的 shallow 和 deep features。然后,通过将 shallow 特征相似矩阵与零矩阵进行对齐,我们保证了多视图数据中 shallow 特征的多样性,从而为多视图划分提供更好的描述。此外,我们还提出了一种双对照机制,以保持深度特征在多视图和 pseudo-标签 两级层次上的一致性。我们的框架在六种广泛使用的 benchmark 数据集上进行了广泛的实验,并与大多数现状的多视图划分算法进行比较,证明了我们的框架的效果。
General and Practical Tuning Method for Off-the-Shelf Graph-Based Index: SISAP Indexing Challenge Report by Team UTokyo
methods: 本研究使用了一种黑盒优化算法进行集成调整,以满足需要的召回率和 Queries Per Second (QPS) 要求。
results: 本研究在 SISAP 2023 Indexing Challenge 的 Task A 中获得第二名,在 10M 和 30M tracks 上显著提高了性能,相比较简单的方法。这种研究方法可以扩展到更广泛的应用场景。Abstract
Despite the efficacy of graph-based algorithms for Approximate Nearest Neighbor (ANN) searches, the optimal tuning of such systems remains unclear. This study introduces a method to tune the performance of off-the-shelf graph-based indexes, focusing on the dimension of vectors, database size, and entry points of graph traversal. We utilize a black-box optimization algorithm to perform integrated tuning to meet the required levels of recall and Queries Per Second (QPS). We applied our approach to Task A of the SISAP 2023 Indexing Challenge and got second place in the 10M and 30M tracks. It improves performance substantially compared to brute force methods. This research offers a universally applicable tuning method for graph-based indexes, extending beyond the specific conditions of the competition to broader uses.
摘要
尽管图表基本算法在approximate nearest neighbor(ANN)搜索中的效果是明显的,但是最佳化这些系统的调整仍然不清楚。这个研究介绍了一种方法来调整off-the-shelf图表基本索引的性能,专注于维度 Vector,数据库大小,和图表搜索的入口点。我们使用黑盒优化算法进行集成调整,以达到需要的回快和Queries Per Second(QPS)水平。我们在SISAP 2023 Indexing Challenge的任务A中应用了我们的方法,在10M和30M tracks上获得了第二名,与简单方法相比,性能提高了很多。这项研究提供了一种通用的图表索引调整方法,超出了竞赛中的特定条件,拓展到更广泛的应用场景。
An Improved Encoder-Decoder Framework for Food Energy Estimation
results: 本研究比前一代营养估算方法提高了10%以上和30千卡拉以上的MAP和MAE分别。Abstract
Dietary assessment is essential to maintaining a healthy lifestyle. Automatic image-based dietary assessment is a growing field of research due to the increasing prevalence of image capturing devices (e.g. mobile phones). In this work, we estimate food energy from a single monocular image, a difficult task due to the limited hard-to-extract amount of energy information present in an image. To do so, we employ an improved encoder-decoder framework for energy estimation; the encoder transforms the image into a representation embedded with food energy information in an easier-to-extract format, which the decoder then extracts the energy information from. To implement our method, we compile a high-quality food image dataset verified by registered dietitians containing eating scene images, food-item segmentation masks, and ground truth calorie values. Our method improves upon previous caloric estimation methods by over 10\% and 30 kCal in terms of MAPE and MAE respectively.
摘要
饮食评估是保持健康生活的重要组成部分。自动化图像基本饮食评估是研究领域的快速发展,因为图像捕捉设备的使用率在增长(如手机)。在这种工作中,我们将单一的偏振图像中的食物能量估算,这是由于图像中有限的精炼能量信息,使得这种任务非常困难。为此,我们采用了改进的编码器-解码器框架来进行能量估算,编码器将图像转化为含有食物能量信息的更易EXTRACT的表示形式,然后解码器EXTRACT出能量信息。为实现我们的方法,我们编译了高质量的食物图像数据集,该数据集包括餐厅场景图像、食物项分割面积和真实热量值。我们的方法与前期热量估算方法相比,提高了10%以上和30 kCal的MAP和MAE分别。
dacl10k: Benchmark for Semantic Bridge Damage Segmentation
results: 研究人员使用基eline模型对dacl10k数据集进行评估,得到了0.42的mean intersection-over-union值。此外,研究人员还将数据集和基eline模型公开访问,以便对bridge检测和评估领域进行进一步研究。Abstract
Reliably identifying reinforced concrete defects (RCDs)plays a crucial role in assessing the structural integrity, traffic safety, and long-term durability of concrete bridges, which represent the most common bridge type worldwide. Nevertheless, available datasets for the recognition of RCDs are small in terms of size and class variety, which questions their usability in real-world scenarios and their role as a benchmark. Our contribution to this problem is "dacl10k", an exceptionally diverse RCD dataset for multi-label semantic segmentation comprising 9,920 images deriving from real-world bridge inspections. dacl10k distinguishes 12 damage classes as well as 6 bridge components that play a key role in the building assessment and recommending actions, such as restoration works, traffic load limitations or bridge closures. In addition, we examine baseline models for dacl10k which are subsequently evaluated. The best model achieves a mean intersection-over-union of 0.42 on the test set. dacl10k, along with our baselines, will be openly accessible to researchers and practitioners, representing the currently biggest dataset regarding number of images and class diversity for semantic segmentation in the bridge inspection domain.
摘要
<>Translate the given text into Simplified Chinese.<>鉴别强化混凝土缺陷(RCD)可以准确评估混凝土桥的结构完整性、交通安全性和长期持续性,混凝土桥是全球最常见的桥梁类型。然而,目前可用的RCD数据集较小,尺寸和类型多样性受限,这会问题其在实际场景中的可用性和作为标准。我们的贡献是“dacl10k”数据集,包含9,920个真实桥梁检查图像,用于多类Semantic segmentation。dacl10k可以分辨12种缺陷类型和6种桥 component,这些组件对建筑评估和建议行动(如修复工程、交通负荷限制或桥梁关闭)具有关键性。此外,我们还考虑了baseline模型,并评估其性能。测试集上的最佳模型得分为0.42。dacl10k、我们的baselines以及相关的评估结果将公开 accessible for researchers and practitioners,代表bridge检测领域中最大的数据集,包括图像数量和类型多样性。
Unsupervised bias discovery in medical image segmentation
paper_authors: Nicolás Gaggion, Rodrigo Echeveste, Lucas Mansilla, Diego H. Milone, Enzo Ferrante
for: 避免深度学习模型在医疗影像分割中存在对某些保护属性(如性别或民族)的偏见。
methods: 我们提出了一种新的无监督偏见探测方法,利用反分类精度框架来估算分割质量。
results: 我们通过synthetic和实际场景的数字实验表示,我们的方法能够成功预测深度分割模型的公平问题,这成为该领域的新和有价值的工具。Abstract
It has recently been shown that deep learning models for anatomical segmentation in medical images can exhibit biases against certain sub-populations defined in terms of protected attributes like sex or ethnicity. In this context, auditing fairness of deep segmentation models becomes crucial. However, such audit process generally requires access to ground-truth segmentation masks for the target population, which may not always be available, especially when going from development to deployment. Here we propose a new method to anticipate model biases in biomedical image segmentation in the absence of ground-truth annotations. Our unsupervised bias discovery method leverages the reverse classification accuracy framework to estimate segmentation quality. Through numerical experiments in synthetic and realistic scenarios we show how our method is able to successfully anticipate fairness issues in the absence of ground-truth labels, constituting a novel and valuable tool in this field.
摘要
Recently, research has shown that deep learning models for anatomical segmentation in medical images can exhibit biases against certain sub-populations defined by protected attributes such as sex or ethnicity. In this context, it is crucial to audit the fairness of deep segmentation models. However, this process typically requires access to ground-truth segmentation masks for the target population, which may not always be available, especially when moving from development to deployment. Here, we propose a new method to anticipate model biases in biomedical image segmentation in the absence of ground-truth annotations. Our unsupervised bias discovery method utilizes the reverse classification accuracy framework to estimate segmentation quality. Through numerical experiments in synthetic and realistic scenarios, we demonstrate how our method can successfully anticipate fairness issues in the absence of ground-truth labels, providing a novel and valuable tool in this field.Here's the translation in Traditional Chinese as well:Recently, research has shown that deep learning models for anatomical segmentation in medical images can exhibit biases against certain sub-populations defined by protected attributes such as sex or ethnicity. In this context, it is crucial to audit the fairness of deep segmentation models. However, this process typically requires access to ground-truth segmentation masks for the target population, which may not always be available, especially when moving from development to deployment. Here, we propose a new method to anticipate model biases in biomedical image segmentation in the absence of ground-truth annotations. Our unsupervised bias discovery method utilizes the reverse classification accuracy framework to estimate segmentation quality. Through numerical experiments in synthetic and realistic scenarios, we demonstrate how our method can successfully anticipate fairness issues in the absence of ground-truth labels, providing a novel and valuable tool in this field.
Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models
results: 提出了一种零基础方法,可以将通用的视觉文本关系转移到 VMR 领域,而不需要访问 VMR 数据。该方法包括一个 conditional feature refinement module 和一个底层提档生成策略,可以更好地理解瞬间边界,并最大化 VLM 的效果。实验结果表明,我们的零基础算法在三个 VMR 标准测试集上具有显著的性能优势,特别是在未知词汇和未经见过的场景下。Abstract
Accurate video moment retrieval (VMR) requires universal visual-textual correlations that can handle unknown vocabulary and unseen scenes. However, the learned correlations are likely either biased when derived from a limited amount of moment-text data which is hard to scale up because of the prohibitive annotation cost (fully-supervised), or unreliable when only the video-text pairwise relationships are available without fine-grained temporal annotations (weakly-supervised). Recently, the vision-language models (VLM) demonstrate a new transfer learning paradigm to benefit different vision tasks through the universal visual-textual correlations derived from large-scale vision-language pairwise web data, which has also shown benefits to VMR by fine-tuning in the target domains. In this work, we propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment, without the need for accessing the VMR data. To this end, we devise a conditional feature refinement module to generate boundary-aware visual features conditioned on text queries to enable better moment boundary understanding. Additionally, we design a bottom-up proposal generation strategy that mitigates the impact of domain discrepancies and breaks down complex-query retrieval tasks into individual action retrievals, thereby maximizing the benefits of VLM. Extensive experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm, especially in the novel-word and novel-location out-of-distribution setups.
摘要
准确的视频时刻选取(VMR)需要一种通用的视觉文本相关性,可以处理未知词汇和未经见过的场景。然而,学习的相关性可能受到有限的时刻文本数据的偏袋影响(完全监督),或者只有视频和文本之间的对应关系(弱监督),无法提供精细的时刻标注。近些年,视觉语言模型(VLM)表明了一种新的转移学习 paradigma,通过大规模的视觉语言对数据来获得通用的视觉文本相关性,并在目标领域进行 fine-tuning,以便为不同的视觉任务提供利用。在这种情况下,我们提出了一种零shot方法,通过不需要访问VMR数据来适应通用的视觉文本相关性。为此,我们设计了一个条件feature重定向模块,通过文本查询来生成与边界相关的视觉特征,以便更好地理解时刻边界。此外,我们还设计了一种底层提议生成策略,以降低域外差异的影响和将复杂的查询任务分解成个体动作 retrieve 任务,从而最大化VLM的利用。经过对三个VMR benchmark数据集的广泛实验,我们发现了我们零shot算法在未知词汇和未经见过的设置中表现出了remarkable的性能优势。
Improving the matching of deformable objects by learning to detect keypoints
results: 对多种描述符的匹配精度提高20pp,并在真实世界中的对象检索任务中与最佳关键点检测器一样高效Abstract
We propose a novel learned keypoint detection method to increase the number of correct matches for the task of non-rigid image correspondence. By leveraging true correspondences acquired by matching annotated image pairs with a specified descriptor extractor, we train an end-to-end convolutional neural network (CNN) to find keypoint locations that are more appropriate to the considered descriptor. For that, we apply geometric and photometric warpings to images to generate a supervisory signal, allowing the optimization of the detector. Experiments demonstrate that our method enhances the Mean Matching Accuracy of numerous descriptors when used in conjunction with our detection method, while outperforming the state-of-the-art keypoint detectors on real images of non-rigid objects by 20 p.p. We also apply our method on the complex real-world task of object retrieval where our detector performs on par with the finest keypoint detectors currently available for this task. The source code and trained models are publicly available at https://github.com/verlab/LearningToDetect_PRL_2023
摘要
我们提出了一种新的学习基于关键点检测方法,以提高非RIGID图像匹配任务中correct匹配的数量。通过利用已知图像对照对描述符EXTRACTOR进行匹配,我们训练了一个端到端的卷积神经网络(CNN)来找出更适合考虑的描述符的关键点位置。为了实现这一目标,我们应用了地理метриic和光学扭曲到图像,以生成一个监督信号,allowing the optimization of the detector。实验表明,我们的方法可以提高许多描述符的mean Matching Accuracy,并在真实图像中对非RIGID对象的检测任务上超越当前最佳的关键点检测器。我们还应用了我们的方法在复杂的real-world对象检索任务中,并与当前最佳的关键点检测器一样表现。source code和训练模型可以在https://github.com/verlab/LearningToDetect_PRL_2023中下载。
results: can remove target words as expectedAbstract
Scene text removal (STR) is the image transformation task to remove text regions in scene images. The conventional STR methods remove all scene text. This means that the existing methods cannot select text to be removed. In this paper, we propose a novel task setting named selective scene text removal (SSTR) that removes only target words specified by the user. Although SSTR is a more complex task than STR, the proposed multi-module structure enables efficient training for SSTR. Experimental results show that the proposed method can remove target words as expected.
摘要
Scene文本除除(STR)是图像变换任务,去除场景中的文本区域。传统的STR方法都是全面去除场景中的所有文本。在这篇论文中,我们提出了一种新的任务设定方式,即选择场景文本除除(SSTR),可以根据用户指定的Target字符串来去除特定的文本。虽然SSTR是STR的更加复杂的任务,但我们提出的多模块结构使得SSTR的训练变得高效。实验结果表明,我们的方法可以如预期地去除Target字符串。
Fine-grained Recognition with Learnable Semantic Data Augmentation
results: 实验表明,该方法可以在多个竞争性的细化图像识别benchmark上提高分类器的泛化性,并在CUB-200-2011 dataset上与最新的方法相当。Abstract
Fine-grained image recognition is a longstanding computer vision challenge that focuses on differentiating objects belonging to multiple subordinate categories within the same meta-category. Since images belonging to the same meta-category usually share similar visual appearances, mining discriminative visual cues is the key to distinguishing fine-grained categories. Although commonly used image-level data augmentation techniques have achieved great success in generic image classification problems, they are rarely applied in fine-grained scenarios, because their random editing-region behavior is prone to destroy the discriminative visual cues residing in the subtle regions. In this paper, we propose diversifying the training data at the feature-level to alleviate the discriminative region loss problem. Specifically, we produce diversified augmented samples by translating image features along semantically meaningful directions. The semantic directions are estimated with a covariance prediction network, which predicts a sample-wise covariance matrix to adapt to the large intra-class variation inherent in fine-grained images. Furthermore, the covariance prediction network is jointly optimized with the classification network in a meta-learning manner to alleviate the degenerate solution problem. Experiments on four competitive fine-grained recognition benchmarks (CUB-200-2011, Stanford Cars, FGVC Aircrafts, NABirds) demonstrate that our method significantly improves the generalization performance on several popular classification networks (e.g., ResNets, DenseNets, EfficientNets, RegNets and ViT). Combined with a recently proposed method, our semantic data augmentation approach achieves state-of-the-art performance on the CUB-200-2011 dataset. The source code will be released.
摘要
传统的图像识别挑战之一是细化图像识别,即在同一个meta-类别下分别识别多个子类别。由于图像在同一个meta-类别下通常具有相似的视觉特征,因此挖掘特征特征是识别细化类别的关键。虽然通常使用的图像级别数据增强技术已经取得了广泛的成功在通用图像分类问题上,但它们在细化场景下rarely被应用,因为它们的随机编辑区域行为容易 Destroying the discriminative visual cues residing in subtle regions.在这篇论文中,我们提出了在特征级别上多样化训练数据以解决细化类别损失问题。具体来说,我们生成了多样化增强样本 by translating image features along semantically meaningful directions. 预测样本级别的covariance matrix的网络来Estimate the semantic directions, which adapts to the large intra-class variation inherent in fine-grained images. 此外,预测网络和分类网络在meta-学习方式下jointly optimize the covariance prediction network to alleviate the degenerate solution problem.实验表明,我们的方法可以在四个竞争力高的细化图像识别benchmark上(CUB-200-2011, Stanford Cars, FGVC Aircrafts, NABirds)提高通用分类网络(例如ResNets, DenseNets, EfficientNets, RegNets和ViT)的总体性能。与 reciently proposed method 的semantic data augmentation approach combine, our approach achieves state-of-the-art performance on the CUB-200-2011 dataset. 代码将会发布。
VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation
results: 该方法在文本到视频生成方面设置了新的州OF-THE-ART, both qualitative and quantitative evaluation 表明,VideoGen可以生成高质量、高分辨率的视频。更多样例可以参考 \url{https://videogen.github.io/VideoGen/}。Abstract
In this paper, we present VideoGen, a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion. We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt, as a reference image to guide video generation. Then, we introduce an efficient cascaded latent diffusion module conditioned on both the reference image and the text prompt, for generating latent video representations, followed by a flow-based temporal upsampling step to improve the temporal resolution. Finally, we map latent video representations into a high-definition video through an enhanced video decoder. During training, we use the first frame of a ground-truth video as the reference image for training the cascaded latent diffusion module. The main characterises of our approach include: the reference image generated by the text-to-image model improves the visual fidelity; using it as the condition makes the diffusion model focus more on learning the video dynamics; and the video decoder is trained over unlabeled video data, thus benefiting from high-quality easily-available videos. VideoGen sets a new state-of-the-art in text-to-video generation in terms of both qualitative and quantitative evaluation. See \url{https://videogen.github.io/VideoGen/} for more samples.
摘要
在这篇论文中,我们提出了 VideoGen,一种文本到视频生成方法,可以生成高清晰度视频,具有高帧准确性和强时间一致性使用参考导向的潜在扩散。我们利用了一个标准的文本到图像生成模型,例如稳定扩散,来生成基于文本提示的图像,并用其为参考图像来导引视频生成。然后,我们引入了一个高效的级联潜在扩散模块,Conditional on both the reference image and the text prompt, for generating latent video representations, followed by a flow-based temporal upsampling step to improve the temporal resolution.最后,我们将潜在视频表示映射到高清晰度视频 через一个加强的视频解码器。在训练时,我们使用了ground truth video的首帧作为参考图像来训练级联潜在扩散模块。主要特点包括:参考图像生成于文本到图像模型提高了视觉质量;使用参考图像作为条件使得扩散模型更专注学习视频动力学;以及视频解码器在训练过程中使用了高质量的无标注视频数据,从而受益于高质量的可获得视频数据。VideoGen将文本到视频生成领域的新州标准,以质量和量化评价为据。更多样例可以参考\url{https://videogen.github.io/VideoGen/}。
On the Localization of Ultrasound Image Slices within Point Distribution Models
results: 实验结果表明,该多模态注册框架可以准确地将ultrasound块注册到患者特定的三维形态模型和统计 shapes模型中,并且可以预测块位置在患者特定的三维形态模型上的平均误差为1.2毫米,并且在统计 shapes模型上的平均误差为4.6毫米。Abstract
Thyroid disorders are most commonly diagnosed using high-resolution Ultrasound (US). Longitudinal nodule tracking is a pivotal diagnostic protocol for monitoring changes in pathological thyroid morphology. This task, however, imposes a substantial cognitive load on clinicians due to the inherent challenge of maintaining a mental 3D reconstruction of the organ. We thus present a framework for automated US image slice localization within a 3D shape representation to ease how such sonographic diagnoses are carried out. Our proposed method learns a common latent embedding space between US image patches and the 3D surface of an individual's thyroid shape, or a statistical aggregation in the form of a statistical shape model (SSM), via contrastive metric learning. Using cross-modality registration and Procrustes analysis, we leverage features from our model to register US slices to a 3D mesh representation of the thyroid shape. We demonstrate that our multi-modal registration framework can localize images on the 3D surface topology of a patient-specific organ and the mean shape of an SSM. Experimental results indicate slice positions can be predicted within an average of 1.2 mm of the ground-truth slice location on the patient-specific 3D anatomy and 4.6 mm on the SSM, exemplifying its usefulness for slice localization during sonographic acquisitions. Code is publically available: \href{https://github.com/vuenc/slice-to-shape}{https://github.com/vuenc/slice-to-shape}
摘要
淀脑疾病通常通过高分辨率超声成像(US)诊断。 longitudinal nodule tracking是诊断过程中的关键协议,但这会对临床医生带来很大的认知负担,因为需要维护一个 mental 3D 重建的器官。我们因此提出了一种自动化 US 图像片断localization在3D形状表示中的框架,以facilitate如此的超声诊断。我们的提议方法learns一个共同latent embedding空间 между US图像块和个体的thyroid形状3D,或者一个统计汇总模型(SSM),通过对比度学习。通过cross-modality registration和Procrustes分析,我们利用我们的模型特征来注册US剖平图像到个体特定的thyroid形状3D的表示中。我们的多Modal注册框架可以在病人特定的3D анатомия上和统计平均形态模型(SSM)上localize图像剖平。实验结果表明,我们可以在病人特定3D形状上和统计平均形态模型(SSM)上预测图像剖平的位置,与实际 slice位置相差在1.2毫米和4.6毫米之间。这 demonstartes我们的多Modal注册框架的用于图像剖平localization durante acquisitions sonográficas。代码可以在以下链接获取:https://github.com/vuenc/slice-to-shape
How You Split Matters: Data Leakage and Subject Characteristics Studies in Longitudinal Brain MRI Analysis
results: 研究发现,不当的数据分割策略可能会导致模型性能受到影响,特别是在长期图像数据中,包括重复的扫描数据。研究还发现,GradCAM视觉化可以揭示卷积神经网络模型中的快捷缺陷,这些缺陷可能会使模型学习到诊断特征以及患者身份。Abstract
Deep learning models have revolutionized the field of medical image analysis, offering significant promise for improved diagnostics and patient care. However, their performance can be misleadingly optimistic due to a hidden pitfall called 'data leakage'. In this study, we investigate data leakage in 3D medical imaging, specifically using 3D Convolutional Neural Networks (CNNs) for brain MRI analysis. While 3D CNNs appear less prone to leakage than 2D counterparts, improper data splitting during cross-validation (CV) can still pose issues, especially with longitudinal imaging data containing repeated scans from the same subject. We explore the impact of different data splitting strategies on model performance for longitudinal brain MRI analysis and identify potential data leakage concerns. GradCAM visualization helps reveal shortcuts in CNN models caused by identity confounding, where the model learns to identify subjects along with diagnostic features. Our findings, consistent with prior research, underscore the importance of subject-wise splitting and evaluating our model further on hold-out data from different subjects to ensure the integrity and reliability of deep learning models in medical image analysis.
摘要
深度学习模型在医疗图像分析领域已经引起了广泛的关注,因为它们提供了改善诊断和患者护理的可能性。然而,它们的性能可能会受到一种隐藏的陷阱,称为“数据泄露”。在这项研究中,我们研究了3D医疗图像中的数据泄露, especailly使用3D卷积神经网络(CNN)进行脑MRI分析。虽然3D CNN在2D counterpart中看起来更加免疫于泄露,但是在cross-validation(CV)时不当的数据分割可以仍然存在问题,尤其是在包含同一个主题的重复扫描数据中。我们研究了不同的数据分割策略对于长itudinal脑MRI分析的模型性能的影响,并发现了可能的数据泄露问题。GradCAM可视化 помо助于揭示了 CNN模型中由identify confounding引起的快捷路径,其中模型学习了主题以及诊断特征。我们的发现与之前的研究一致,强调了在医疗图像分析中深度学习模型的完整性和可靠性的重要性,需要在不同主题的数据上进行评估。
RigNet++: Efficient Repetitive Image Guided Network for Depth Completion
results: 我们在KITTI、VKITTI、NYUv2、3D60和Matterport3D等 datasets上进行了广泛的实验,结果表明,我们的方法可以取得Superior或竞争性的结果。Abstract
Depth completion aims to recover dense depth maps from sparse ones, where color images are often used to facilitate this task. Recent depth methods primarily focus on image guided learning frameworks. However, blurry guidance in the image and unclear structure in the depth still impede their performance. To tackle these challenges, we explore an efficient repetitive design in our image guided network to gradually and sufficiently recover depth values. Specifically, the efficient repetition is embodied in both the image guidance branch and depth generation branch. In the former branch, we design a dense repetitive hourglass network to extract discriminative image features of complex environments, which can provide powerful contextual instruction for depth prediction. In the latter branch, we introduce a repetitive guidance module based on dynamic convolution, in which an efficient convolution factorization is proposed to reduce the complexity while modeling high-frequency structures progressively. Extensive experiments indicate that our approach achieves superior or competitive results on KITTI, VKITTI, NYUv2, 3D60, and Matterport3D datasets.
摘要
depth completion 目标是从稀畴的深度地图中恢复粗粒度的深度地图,通常使用颜色图像来促进这个任务。 current depth method 主要关注于图像导学框架。然而,图像指导中的模糊和深度中的 unclear structure 仍然妨碍其性能。为了解决这些挑战,我们探索了一种高效的循环设计在我们的图像导学网络中。 Specifically, the efficient repetition is embodied in both the image guidance branch and depth generation branch。在前者分支中,我们设计了一个紧凑的重复弧形网络来提取复杂环境中的特征特征,这可以为深度预测提供强大的Contextual instruction。在后者分支中,我们引入了一种循环导引模块,基于动态 convolution,在其中我们提出了一种高效的 convolution factorization来降低复杂性,同时模型高频结构进行进行逐步进行进行。 extensive experiments indicate that our approach achieves superior or competitive results on KITTI, VKITTI, NYUv2, 3D60, and Matterport3D datasets。
MuraNet: Multi-task Floor Plan Recognition with Relation Attention
paper_authors: Lingxiao Huang, Jung-Hsuan Wu, Chiching Wei, Wilson Li
For: floor plan data recognition* Methods: attention-based multi-task model (MuraNet) with unified encoder (MURA) and separated branches for segmentation and detection tasks* Results: improved convergence speed and performance in detection and segmentation tasks compared to single-task models like U-Net and YOLOv3Abstract
The recognition of information in floor plan data requires the use of detection and segmentation models. However, relying on several single-task models can result in ineffective utilization of relevant information when there are multiple tasks present simultaneously. To address this challenge, we introduce MuraNet, an attention-based multi-task model for segmentation and detection tasks in floor plan data. In MuraNet, we adopt a unified encoder called MURA as the backbone with two separated branches: an enhanced segmentation decoder branch and a decoupled detection head branch based on YOLOX, for segmentation and detection tasks respectively. The architecture of MuraNet is designed to leverage the fact that walls, doors, and windows usually constitute the primary structure of a floor plan's architecture. By jointly training the model on both detection and segmentation tasks, we believe MuraNet can effectively extract and utilize relevant features for both tasks. Our experiments on the CubiCasa5k public dataset show that MuraNet improves convergence speed during training compared to single-task models like U-Net and YOLOv3. Moreover, we observe improvements in the average AP and IoU in detection and segmentation tasks, respectively.Our ablation experiments demonstrate that the attention-based unified backbone of MuraNet achieves better feature extraction in floor plan recognition tasks, and the use of decoupled multi-head branches for different tasks further improves model performance. We believe that our proposed MuraNet model can address the disadvantages of single-task models and improve the accuracy and efficiency of floor plan data recognition.
摘要
<>loor plan数据认知需要使用探测和分割模型。然而,依赖于多个单任务模型可能会导致 relevante信息的不fficient使用, especialmente when there are multiple tasks present simultaneously. To address this challenge, we introduce MuraNet, an attention-based multi-task model for segmentation and detection tasks in floor plan data. In MuraNet, we adopt a unified encoder called MURA as the backbone with two separated branches: an enhanced segmentation decoder branch and a decoupled detection head branch based on YOLOX, for segmentation and detection tasks respectively. The architecture of MuraNet is designed to leverage the fact that walls, doors, and windows usually constitute the primary structure of a floor plan's architecture. By jointly training the model on both detection and segmentation tasks, we believe MuraNet can effectively extract and utilize relevant features for both tasks. Our experiments on the CubiCasa5k public dataset show that MuraNet improves convergence speed during training compared to single-task models like U-Net and YOLOv3. Moreover, we observe improvements in the average AP and IoU in detection and segmentation tasks, respectively. Our ablation experiments demonstrate that the attention-based unified backbone of MuraNet achieves better feature extraction in floor plan recognition tasks, and the use of decoupled multi-head branches for different tasks further improves model performance. We believe that our proposed MuraNet model can address the disadvantages of single-task models and improve the accuracy and efficiency of floor plan data recognition.中文简体版:loor plan数据认知需要使用探测和分割模型。然而,依赖于多个单任务模型可能会导致 relevante信息的不fficient使用, especialmente when there are multiple tasks present simultaneously. To address this challenge, we introduce MuraNet, an attention-based multi-task model for segmentation and detection tasks in floor plan data. In MuraNet, we adopt a unified encoder called MURA as the backbone with two separated branches: an enhanced segmentation decoder branch and a decoupled detection head branch based on YOLOX, for segmentation and detection tasks respectively. The architecture of MuraNet is designed to leverage the fact that walls, doors, and windows usually constitute the primary structure of a floor plan's architecture. By jointly training the model on both detection and segmentation tasks, we believe MuraNet can effectively extract and utilize relevant features for both tasks. Our experiments on the CubiCasa5k public dataset show that MuraNet improves convergence speed during training compared to single-task models like U-Net and YOLOv3. Moreover, we observe improvements in the average AP and IoU in detection and segmentation tasks, respectively. Our ablation experiments demonstrate that the attention-based unified backbone of MuraNet achieves better feature extraction in floor plan recognition tasks, and the use of decoupled multi-head branches for different tasks further improves model performance. We believe that our proposed MuraNet model can address the disadvantages of single-task models and improve the accuracy and efficiency of floor plan data recognition.
Towards Contrastive Learning in Music Video Domain
paper_authors: Karel Veldkamp, Mariya Hendriksen, Zoltán Szlávik, Alexander Keijser
for: 这个论文 investigate whether contrastive learning can be applied to the domain of music videos, and evaluate the effectiveness of this approach on downstream tasks such as music tagging and genre classification.
methods: 作者使用了 dual en-coder 来学习 audio 和 video 模式的 multimodal 表示,并使用了 bidirectional contrastive loss 进行训练。
results: 结果表明,没有对 contrastive learning 进行 fine-tuning 的预训练网络在两个下游任务中表现更好,而且作者通过Qualitative analysis of the learned representations来解释了为什么 contrastive learning 对 music videos 不成功。Abstract
Contrastive learning is a powerful way of learning multimodal representations across various domains such as image-caption retrieval and audio-visual representation learning. In this work, we investigate if these findings generalize to the domain of music videos. Specifically, we create a dual en-coder for the audio and video modalities and train it using a bidirectional contrastive loss. For the experiments, we use an industry dataset containing 550 000 music videos as well as the public Million Song Dataset, and evaluate the quality of learned representations on the downstream tasks of music tagging and genre classification. Our results indicate that pre-trained networks without contrastive fine-tuning outperform our contrastive learning approach when evaluated on both tasks. To gain a better understanding of the reasons contrastive learning was not successful for music videos, we perform a qualitative analysis of the learned representations, revealing why contrastive learning might have difficulties uniting embeddings from two modalities. Based on these findings, we outline possible directions for future work. To facilitate the reproducibility of our results, we share our code and the pre-trained model.
摘要
“对比学习是一种强大的学习多Modal表现的方法,可以应用于不同领域,如图像描述和视觉表现学习。在这个工作中,我们 investigate 这些结果是否应用于音乐录影带领域。 Specifically, we create a dual en-coder for the audio and video modalities and train it using a bidirectional contrastive loss. 实验中,我们使用了550000首音乐录影带的industry dataset以及公共的Million Song Dataset,并评估学习的表现质量downstream task of music tagging和类别分类。我们的结果显示了pre-trained network without contrastive fine-tuning outperform our contrastive learning approach when evaluated on both tasks。为了更好地理解contrastive learning why it was not successful for music videos, we perform a qualitative analysis of the learned representations, revealing why contrastive learning might have difficulties uniting embeddings from two modalities。 Based on these findings, we outline possible directions for future work。为了促进我们的结果的重现性,我们分享了我们的代码和预训模型。”
Robust Point Cloud Processing through Positional Embedding
results: 论文通过在多种异常点云和噪声下进行多个下渠道任务的实验,表明该方法可以提供更高的Robustness和稳定性。Abstract
End-to-end trained per-point embeddings are an essential ingredient of any state-of-the-art 3D point cloud processing such as detection or alignment. Methods like PointNet, or the more recent point cloud transformer -- and its variants -- all employ learned per-point embeddings. Despite impressive performance, such approaches are sensitive to out-of-distribution (OOD) noise and outliers. In this paper, we explore the role of an analytical per-point embedding based on the criterion of bandwidth. The concept of bandwidth enables us to draw connections with an alternate per-point embedding -- positional embedding, particularly random Fourier features. We present compelling robust results across downstream tasks such as point cloud classification and registration with several categories of OOD noise.
摘要
<>translate_language: zh-CNEnd-to-end 培чение的每个点嵌入是现代三维点云处理的重要组成部分,如探测或对Alignment。方法如PointNet或更近的点云变换器以及其变种都使用学习的每个点嵌入。尽管表现出色,但这些方法对于异常情况(OOD)噪声和异常值敏感。在这篇论文中,我们探讨使用分布参数来确定每个点嵌入的概念。这种概念允许我们与另一种嵌入——位置嵌入,特别是随机傅立叶特征进行联系。我们在多个下游任务中,如点云分类和注册,对多种OOD噪声进行了吸引人的Robust表现。
Human trajectory prediction using LSTM with Attention mechanism
results: 我们在ETH和UCY公共数据集上评估了我们的方法,并使用最终差分误差(FDE)和平均差分误差(ADE)度量来评估性能。我们发现,我们修改后的算法在拥挤空间中预测人行踪的性能有6.2%和6.3%的提升,相比文献中Social LSTM的结果。Abstract
In this paper, we propose a human trajectory prediction model that combines a Long Short-Term Memory (LSTM) network with an attention mechanism. To do that, we use attention scores to determine which parts of the input data the model should focus on when making predictions. Attention scores are calculated for each input feature, with a higher score indicating the greater significance of that feature in predicting the output. Initially, these scores are determined for the target human position, velocity, and their neighboring individual's positions and velocities. By using attention scores, our model can prioritize the most relevant information in the input data and make more accurate predictions. We extract attention scores from our attention mechanism and integrate them into the trajectory prediction module to predict human future trajectories. To achieve this, we introduce a new neural layer that processes attention scores after extracting them and concatenates them with positional information. We evaluate our approach on the publicly available ETH and UCY datasets and measure its performance using the final displacement error (FDE) and average displacement error (ADE) metrics. We show that our modified algorithm performs better than the Social LSTM in predicting the future trajectory of pedestrians in crowded spaces. Specifically, our model achieves an improvement of 6.2% in ADE and 6.3% in FDE compared to the Social LSTM results in the literature.
摘要
在这篇论文中,我们提出了一种人体轨迹预测模型,该模型结合了长期短记忆网络(LSTM)和注意力机制。为了实现这一点,我们使用注意力分数来确定输入数据中哪些部分需要模型的注意力。注意力分数分别计算对每个输入特征的注意力,高注意力分数表示该特征在预测输出时的更大重要性。我们首先计算这些分数,然后将其与目标人体位置、速度和邻近个体位置和速度相关的输入特征进行相加。通过使用注意力分数,我们的模型可以在输入数据中优先级掌握相关信息,并且更准确地预测人体轨迹。我们从注意力机制中提取出注意力分数,并将其与位置信息一起处理。我们新增一层神经网络来处理注意力分数,并将其与位置信息进行拼接。我们使用公共可用的ETH和UCY数据集进行评估,并使用最终差分Error(FDE)和平均差分Error(ADE) metric来评估我们的方法。我们的修改后的算法在预测人体轨迹时比Social LSTM在文献中的结果更好,具体来说,我们的模型在ADE和FDE metric上分别提高6.2%和6.3%。
ARFA: An Asymmetric Receptive Field Autoencoder Model for Spatiotemporal Prediction
paper_authors: Wenxuan Zhang, Xuechao Zou, Li Wu, Jianqiang Huang, Xiaoying Wang
for: 预测Future sequences by learning from historical contexts, 用于各个领域,如交通流量预测和天气预测。
methods: 提出了一种偏 asymmetric Receptive Field Autoencoder (ARFA) 模型,通过设计不同大小的感知场模块,适应Encoder和Decoder的不同功能。在Encoder中,引入大kernel模块 для全球空间时间特征提取;在Decoder中,开发小kernel模块 для本地空间时间信息重建。
results: 在两个主流的空间时间预测数据集和我们自己construct的RainBench数据集上,ARFA实现了一致的状态集成性表现,证明了我们的方法的有效性。这种方法不仅从感知场的角度探索了一种新的方法,还为降水预测提供了数据支持,从而推动了未来的空间时间预测研究。Abstract
Spatiotemporal prediction aims to generate future sequences by paradigms learned from historical contexts. It holds significant importance in numerous domains, including traffic flow prediction and weather forecasting. However, existing methods face challenges in handling spatiotemporal correlations, as they commonly adopt encoder and decoder architectures with identical receptive fields, which adversely affects prediction accuracy. This paper proposes an Asymmetric Receptive Field Autoencoder (ARFA) model to address this issue. Specifically, we design corresponding sizes of receptive field modules tailored to the distinct functionalities of the encoder and decoder. In the encoder, we introduce a large kernel module for global spatiotemporal feature extraction. In the decoder, we develop a small kernel module for local spatiotemporal information reconstruction. To address the scarcity of meteorological prediction data, we constructed the RainBench, a large-scale radar echo dataset specific to the unique precipitation characteristics of inland regions in China for precipitation prediction. Experimental results demonstrate that ARFA achieves consistent state-of-the-art performance on two mainstream spatiotemporal prediction datasets and our RainBench dataset, affirming the effectiveness of our approach. This work not only explores a novel method from the perspective of receptive fields but also provides data support for precipitation prediction, thereby advancing future research in spatiotemporal prediction.
摘要
<> translate "Spatiotemporal prediction aims to generate future sequences by paradigms learned from historical contexts. It holds significant importance in numerous domains, including traffic flow prediction and weather forecasting. However, existing methods face challenges in handling spatiotemporal correlations, as they commonly adopt encoder and decoder architectures with identical receptive fields, which adversely affects prediction accuracy. This paper proposes an Asymmetric Receptive Field Autoencoder (ARFA) model to address this issue. Specifically, we design corresponding sizes of receptive field modules tailored to the distinct functionalities of the encoder and decoder. In the encoder, we introduce a large kernel module for global spatiotemporal feature extraction. In the decoder, we develop a small kernel module for local spatiotemporal information reconstruction. To address the scarcity of meteorological prediction data, we constructed the RainBench, a large-scale radar echo dataset specific to the unique precipitation characteristics of inland regions in China for precipitation prediction. Experimental results demonstrate that ARFA achieves consistent state-of-the-art performance on two mainstream spatiotemporal prediction datasets and our RainBench dataset, affirming the effectiveness of our approach. This work not only explores a novel method from the perspective of receptive fields but also provides data support for precipitation prediction, thereby advancing future research in spatiotemporal prediction."中文翻译:<>预测在时空中的序列,通过历史上的模式学习来实现。这种预测在各个领域都具有重要性,例如交通流量预测和天气预测。然而,现有的方法在处理时空相关性方面存在挑战,因为它们通常采用编码器和解码器结构具有相同的接收场,这会影响预测精度。本文提出了不同接收场的自适应各自谱频域自适应编码器(ARFA)模型,以解决这个问题。特别是,我们在编码器中设计了大小不同的接收场模块,以适应不同的功能。在编码器中,我们引入了大kernel模块,用于全局时空特征提取。在解码器中,我们开发了小kernel模块,用于本地时空信息重建。为了解决天气预测数据的缺乏,我们建立了雨峰Bench,一个特有的雨水特征的大规模雷达响应数据集,用于雨水预测。实验结果表明,ARFA在两个主流时空预测数据集和我们的雨峰Bench数据集上具有一致的状态艺术性,证明了我们的方法的有效性。这种研究不仅从接收场的角度探讨了一种新的方法,还为雨水预测提供了数据支持,从而推动了未来的时空预测研究。
Fusing Monocular Images and Sparse IMU Signals for Real-time Human Motion Capture
For: This paper proposes a method for real-time human motion capture using a combination of monocular images and sparse IMUs.* Methods: The proposed method uses a dual coordinate strategy to fully explore the IMU signals and combines the information from both modalities to achieve robust motion capture.* Results: The proposed method significantly outperforms state-of-the-art vision, IMU, and combined methods on both global orientation and local pose estimation, and the codes are available for research at https://shaohua-pan.github.io/robustcap-page/.Here is the same information in Traditional Chinese:* For: 这篇 paper 提出了一种基于 monocular 影像和简陋 IMU 的 real-time人体动作捕捉方法。* Methods: 提出的方法使用了对 IMU 信号的双坐标策略,以全面利用 IMU 信号,并结合两种感测资料以 достиieving Robust 动作捕捉。* Results: 提出的方法在 global 方向和本地姿态估测方面均有 significanly 超过了现有的见识、IMU 和合成方法,并且 codes 可以在 https://shaohua-pan.github.io/robustcap-page/ 进行研究。Abstract
Either RGB images or inertial signals have been used for the task of motion capture (mocap), but combining them together is a new and interesting topic. We believe that the combination is complementary and able to solve the inherent difficulties of using one modality input, including occlusions, extreme lighting/texture, and out-of-view for visual mocap and global drifts for inertial mocap. To this end, we propose a method that fuses monocular images and sparse IMUs for real-time human motion capture. Our method contains a dual coordinate strategy to fully explore the IMU signals with different goals in motion capture. To be specific, besides one branch transforming the IMU signals to the camera coordinate system to combine with the image information, there is another branch to learn from the IMU signals in the body root coordinate system to better estimate body poses. Furthermore, a hidden state feedback mechanism is proposed for both two branches to compensate for their own drawbacks in extreme input cases. Thus our method can easily switch between the two kinds of signals or combine them in different cases to achieve a robust mocap. %The two divided parts can help each other for better mocap results under different conditions. Quantitative and qualitative results demonstrate that by delicately designing the fusion method, our technique significantly outperforms the state-of-the-art vision, IMU, and combined methods on both global orientation and local pose estimation. Our codes are available for research at https://shaohua-pan.github.io/robustcap-page/.
摘要
Original text: Either RGB images or inertial signals have been used for the task of motion capture (mocap), but combining them together is a new and interesting topic. We believe that the combination is complementary and able to solve the inherent difficulties of using one modality input, including occlusions, extreme lighting/texture, and out-of-view for visual mocap and global drifts for inertial mocap. To this end, we propose a method that fuses monocular images and sparse IMUs for real-time human motion capture. Our method contains a dual coordinate strategy to fully explore the IMU signals with different goals in motion capture. To be specific, besides one branch transforming the IMU signals to the camera coordinate system to combine with the image information, there is another branch to learn from the IMU signals in the body root coordinate system to better estimate body poses. Furthermore, a hidden state feedback mechanism is proposed for both two branches to compensate for their own drawbacks in extreme input cases. Thus our method can easily switch between the two kinds of signals or combine them in different cases to achieve a robust mocap. %The two divided parts can help each other for better mocap results under different conditions. Quantitative and qualitative results demonstrate that by delicately designing the fusion method, our technique significantly outperforms the state-of-the-art vision, IMU, and combined methods on both global orientation and local pose estimation. Our codes are available for research at https://shaohua-pan.github.io/robustcap-page/.Simplified Chinese translation: Either RGB 图像或惯性信号已经用于人体运动捕捉(моcap)任务,但是将其结合在一起是一个新领域的研究。我们认为这种结合是补做的,可以解决单modal输入的内在困难,包括图像中的 occlusion、极端的光照/文化和视图外的出现。为此,我们提出了一种将单静止图像和稀疏 IMU 进行实时人体运动捕捉的方法。我们的方法包括一种双坐标策略,以便充分利用 IMU 信号的不同目标在运动捕捉中。具体来说,除了一个分支将 IMU 信号转换到相机坐标系统中,与图像信息结合之外,还有另一个分支可以从 IMU 信号中学习体部pose。此外,我们还提出了一种隐藏状态反馈机制,以便在极端输入情况下补做各自的缺陷。因此,我们的方法可以根据不同情况选择不同的信号或将其结合在一起,以实现一种稳定的 mocap。%两个分支可以在不同的情况下协助each other,以提高运动捕捉结果。我们的数据可以在 https://shaohua-pan.github.io/robustcap-page/ 上进行研究。
Efficient Surrogate Models for Materials Science Simulations: Machine Learning-based Prediction of Microstructure Properties
results: 这篇论文通过分析两个不同数据集,包括二维离散伊辛模型的数据和卡恩-希耶模型的数据,以及将这些数据转换为特别设计的特征,来预测结构属性关系。Abstract
Determining, understanding, and predicting the so-called structure-property relation is an important task in many scientific disciplines, such as chemistry, biology, meteorology, physics, engineering, and materials science. Structure refers to the spatial distribution of, e.g., substances, material, or matter in general, while property is a resulting characteristic that usually depends in a non-trivial way on spatial details of the structure. Traditionally, forward simulations models have been used for such tasks. Recently, several machine learning algorithms have been applied in these scientific fields to enhance and accelerate simulation models or as surrogate models. In this work, we develop and investigate the applications of six machine learning techniques based on two different datasets from the domain of materials science: data from a two-dimensional Ising model for predicting the formation of magnetic domains and data representing the evolution of dual-phase microstructures from the Cahn-Hilliard model. We analyze the accuracy and robustness of all models and elucidate the reasons for the differences in their performances. The impact of including domain knowledge through tailored features is studied, and general recommendations based on the availability and quality of training data are derived from this.
摘要
In this work, we develop and investigate the applications of six machine learning techniques based on two different datasets from the domain of materials science: data from a two-dimensional Ising model for predicting the formation of magnetic domains and data representing the evolution of dual-phase microstructures from the Cahn-Hilliard model. We analyze the accuracy and robustness of all models and elucidate the reasons for the differences in their performances.We also study the impact of including domain knowledge through tailored features and derive general recommendations based on the availability and quality of training data. Our findings provide insights into the potential of machine learning techniques for predicting structure-property relations in materials science and highlight the importance of considering domain knowledge and data quality when selecting and applying these techniques.
Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning
results: 实验表明,基于 FIMA 框架学习的表示能够具备出色的动态意识能力,并在 UCF101、HMDB51 和 Diving48 等 datasets 上 achieve state-of-the-art or competitive results。Abstract
As the most essential property in a video, motion information is critical to a robust and generalized video representation. To inject motion dynamics, recent works have adopted frame difference as the source of motion information in video contrastive learning, considering the trade-off between quality and cost. However, existing works align motion features at the instance level, which suffers from spatial and temporal weak alignment across modalities. In this paper, we present a \textbf{Fi}ne-grained \textbf{M}otion \textbf{A}lignment (FIMA) framework, capable of introducing well-aligned and significant motion information. Specifically, we first develop a dense contrastive learning framework in the spatiotemporal domain to generate pixel-level motion supervision. Then, we design a motion decoder and a foreground sampling strategy to eliminate the weak alignments in terms of time and space. Moreover, a frame-level motion contrastive loss is presented to improve the temporal diversity of the motion features. Extensive experiments demonstrate that the representations learned by FIMA possess great motion-awareness capabilities and achieve state-of-the-art or competitive results on downstream tasks across UCF101, HMDB51, and Diving48 datasets. Code is available at \url{https://github.com/ZMHH-H/FIMA}.
摘要
As the most essential property in a video, motion information is critical to a robust and generalized video representation. To inject motion dynamics, recent works have adopted frame difference as the source of motion information in video contrastive learning, considering the trade-off between quality and cost. However, existing works align motion features at the instance level, which suffers from spatial and temporal weak alignment across modalities. In this paper, we present a 细grained Motion Alignment (FIMA) framework, capable of introducing well-aligned and significant motion information. Specifically, we first develop a dense contrastive learning framework in the spatiotemporal domain to generate pixel-level motion supervision. Then, we design a motion decoder and a foreground sampling strategy to eliminate the weak alignments in terms of time and space. Moreover, a frame-level motion contrastive loss is presented to improve the temporal diversity of the motion features. Extensive experiments demonstrate that the representations learned by FIMA possess great motion-awareness capabilities and achieve state-of-the-art or competitive results on downstream tasks across UCF101, HMDB51, and Diving48 datasets. Code is available at \url{https://github.com/ZMHH-H/FIMA}.
Fast Diffusion EM: a diffusion model for blind inverse problems with application to deconvolution
paper_authors: Charles Laroche, Andrés Almansa, Eva Coupete
for: solves inverse problems of blind image deblurring
methods: uses diffusion models and Expectation-Minimization (EM) estimation method with blur kernel regularization
results: provides effective and fast results compared to other state-of-the-art approaches in blind image deblurringAbstract
Using diffusion models to solve inverse problems is a growing field of research. Current methods assume the degradation to be known and provide impressive results in terms of restoration quality and diversity. In this work, we leverage the efficiency of those models to jointly estimate the restored image and unknown parameters of the degradation model. In particular, we designed an algorithm based on the well-known Expectation-Minimization (EM) estimation method and diffusion models. Our method alternates between approximating the expected log-likelihood of the inverse problem using samples drawn from a diffusion model and a maximization step to estimate unknown model parameters. For the maximization step, we also introduce a novel blur kernel regularization based on a Plug \& Play denoiser. Diffusion models are long to run, thus we provide a fast version of our algorithm. Extensive experiments on blind image deblurring demonstrate the effectiveness of our method when compared to other state-of-the-art approaches.
摘要
(Simplified Chinese translation)使用分散模型解决反向问题是一个快速 развивающийся的研究领域。当前的方法假设质量损害是已知的,并且提供了非常出色的修复质量和多样性。在这个工作中,我们利用分散模型的效率来同时估计修复图像和未知的损害模型参数。特别是,我们设计了基于well-known Expectation-Minimization(EM)估计方法和分散模型的算法。我们的方法 alternate между估计反向问题的预期日志似然函数 using 分散模型中的样本,以及一个最大化步骤来估计未知模型参数。为了最大化步骤,我们还引入了一种新的噪声核定regularization,基于Plug & Playdenoiser。分散模型需要长时间运行,因此我们提供了一个快速版本的算法。我们的实验表明,当比较到其他当前最佳方法时,我们的方法具有非常出色的效果。
results: 在偏辐射1B/WorldView-3图像上使用SpS-NeRF方法,与NeRF和Sat-NeRF相比,能够更好地恢复场景的几何结构。Abstract
Digital surface model generation using traditional multi-view stereo matching (MVS) performs poorly over non-Lambertian surfaces, with asynchronous acquisitions, or at discontinuities. Neural radiance fields (NeRF) offer a new paradigm for reconstructing surface geometries using continuous volumetric representation. NeRF is self-supervised, does not require ground truth geometry for training, and provides an elegant way to include in its representation physical parameters about the scene, thus potentially remedying the challenging scenarios where MVS fails. However, NeRF and its variants require many views to produce convincing scene's geometries which in earth observation satellite imaging is rare. In this paper we present SparseSat-NeRF (SpS-NeRF) - an extension of Sat-NeRF adapted to sparse satellite views. SpS-NeRF employs dense depth supervision guided by crosscorrelation similarity metric provided by traditional semi-global MVS matching. We demonstrate the effectiveness of our approach on stereo and tri-stereo Pleiades 1B/WorldView-3 images, and compare against NeRF and Sat-NeRF. The code is available at https://github.com/LulinZhang/SpS-NeRF
摘要
<>使用传统多视图顺序匹配(MVS)生成数字表面模型在非拉贝特表面上表现不佳,特别是在异步获取、缺失数据或缺界面上。基于神经辐射场(NeRF)的新方法可以continuous volumetric representation来重建表面几何。NeRF不需要训练时的地面几何数据,同时可以自动包含场景中的物理参数,因此可能解决MVS在困难情况下失败的问题。然而,NeRF和其变种需要许多视图来生成实际场景的几何,而在地球观测卫星图像中这是罕见的。本文介绍了SpS-NeRF(SpareSat-NeRF),它是基于Sat-NeRF的扩展,针对罕见的卫星视图进行了改进。SpS-NeRF使用了密集的深度监督,通过传统的semi-global MVS匹配提供的相似性度量来导引。我们在顺recto-stereo Pleiades 1B/WorldView-3图像上展示了我们的方法的效果,并与NeRF和Sat-NeRF进行了比较。代码可以在https://github.com/LulinZhang/SpS-NeRF上获取。
Application of Machine Learning in Melanoma Detection and the Identification of ‘Ugly Duckling’ and Suspicious Naevi: A Review
paper_authors: Fatima Al Zegair, Nathasha Naranpanawa, Brigid Betz-Stablein, Monika Janda, H. Peter Soyer, Shekhar S. Chandra for: 这篇论文的目的是什么?methods: 这篇论文使用了哪些方法?results: 这篇论文获得了什么结果?Here are the answers in Simplified Chinese text:for: 这篇论文的目的是为了提高皮肤癌诊断的精度和方便,以及应对皮肤癌医生短缺。methods: 这篇论文使用了机器学习和深度学习技术,包括人工神经网络等,以探索皮肤癌早期识别和应对方法。results: 这篇论文获得了训练结果,显示机器学习和深度学习技术可以实现与专业医生相等的皮肤癌诊断精度,并且可以帮助减少医疗成本和提高疗效率。Abstract
Skin lesions known as naevi exhibit diverse characteristics such as size, shape, and colouration. The concept of an "Ugly Duckling Naevus" comes into play when monitoring for melanoma, referring to a lesion with distinctive features that sets it apart from other lesions in the vicinity. As lesions within the same individual typically share similarities and follow a predictable pattern, an ugly duckling naevus stands out as unusual and may indicate the presence of a cancerous melanoma. Computer-aided diagnosis (CAD) has become a significant player in the research and development field, as it combines machine learning techniques with a variety of patient analysis methods. Its aim is to increase accuracy and simplify decision-making, all while responding to the shortage of specialized professionals. These automated systems are especially important in skin cancer diagnosis where specialist availability is limited. As a result, their use could lead to life-saving benefits and cost reductions within healthcare. Given the drastic change in survival when comparing early stage to late-stage melanoma, early detection is vital for effective treatment and patient outcomes. Machine learning (ML) and deep learning (DL) techniques have gained popularity in skin cancer classification, effectively addressing challenges, and providing results equivalent to that of specialists. This article extensively covers modern Machine Learning and Deep Learning algorithms for detecting melanoma and suspicious naevi. It begins with general information on skin cancer and different types of naevi, then introduces AI, ML, DL, and CAD. The article then discusses the successful applications of various ML techniques like convolutional neural networks (CNN) for melanoma detection compared to dermatologists' performance. Lastly, it examines ML methods for UD naevus detection and identifying suspicious naevi.
摘要
皮肤 lesions bekannt为 naevi 具有多样的特征,如大小、形状和颜色。“ugly duckling naevus”是在监测 melanoma 时的概念,指的是一个与周围其他 lesions 不同的特征,可能是癌变。因为 lesions 在同一个人身上通常具有相似的特征和预测的模式,ugly duckling naevus 会突出来为不寻常,并可能表示癌变的存在。computer-aided diagnosis (CAD) 在研发领域中发挥了重要作用,它结合了机器学习技术和多种患者分析方法。其目的是提高准确性和简化决策,同时回应医疗专业人员的短缺。这些自动化系统在皮肤癌诊断中特别重要,因为专业人员的可用性有限。因此,它们的使用可能导致生命的拯救和医疗成本的减少。由于皮肤癌的晚期诊断和治疗对 patient 的结果有极大的影响,早期检测是至关重要的。机器学习(ML)和深度学习(DL)技术在皮肤癌类型化方面取得了成功,有效地解决了一些挑战,并提供了与专业人员相当的结果。这篇文章从 skin cancer 和不同类型的 naevi 的基础知识开始,然后介绍了 AI、ML、DL 和 CAD。文章 then discusses 了不同 ML 技术,如 convolutional neural networks (CNN) 在 melanoma 检测中的成功,与 dermatologists 的性能相比。最后,它检查了 ML 方法在 UD naevus 检测和寻找可疑 naevi 方面的应用。
Interpretable Medical Imagery Diagnosis with Self-Attentive Transformers: A Review of Explainable AI for Health Care
results: 本文结合了最新的ViT和XAI技术,提供了一种透明的医疗诊断方法,可以帮助医生更好地理解模型做出的决策,从而提高医疗诊断的准确性和可靠性。Abstract
Recent advancements in artificial intelligence (AI) have facilitated its widespread adoption in primary medical services, addressing the demand-supply imbalance in healthcare. Vision Transformers (ViT) have emerged as state-of-the-art computer vision models, benefiting from self-attention modules. However, compared to traditional machine-learning approaches, deep-learning models are complex and are often treated as a "black box" that can cause uncertainty regarding how they operate. Explainable Artificial Intelligence (XAI) refers to methods that explain and interpret machine learning models' inner workings and how they come to decisions, which is especially important in the medical domain to guide the healthcare decision-making process. This review summarises recent ViT advancements and interpretative approaches to understanding the decision-making process of ViT, enabling transparency in medical diagnosis applications.
摘要
最近的人工智能(AI)发展已经推动了医疗服务中的广泛应用,解决医疗需求和供应的失衡。视Transformer(ViT)已经 emerge 为当前最佳的计算机视觉模型,受益于自注意模块。然而,相比传统机器学习方法,深度学习模型更加复杂,经常被视为一个“黑盒子”,导致决策过程中存在不确定性。可解释人工智能(XAI)是指解释和解读机器学习模型内部工作的方法,尤其在医疗领域,以便guide医疗决策过程。本文回顾了最新的ViT进展和解释方法,以提高医疗诊断应用中的透明度。
results: 这个论文的实验结果显示,这种方法可以减少 MOT 预测中的注解卷入,并且可以与完全监督的状态前进比肩。此外,这种方法还可以超过一些无监督跟踪器的性能。Abstract
Unsupervised object-centric learning methods allow the partitioning of scenes into entities without additional localization information and are excellent candidates for reducing the annotation burden of multiple-object tracking (MOT) pipelines. Unfortunately, they lack two key properties: objects are often split into parts and are not consistently tracked over time. In fact, state-of-the-art models achieve pixel-level accuracy and temporal consistency by relying on supervised object detection with additional ID labels for the association through time. This paper proposes a video object-centric model for MOT. It consists of an index-merge module that adapts the object-centric slots into detection outputs and an object memory module that builds complete object prototypes to handle occlusions. Benefited from object-centric learning, we only require sparse detection labels (0%-6.25%) for object localization and feature binding. Relying on our self-supervised Expectation-Maximization-inspired loss for object association, our approach requires no ID labels. Our experiments significantly narrow the gap between the existing object-centric model and the fully supervised state-of-the-art and outperform several unsupervised trackers.
摘要
<>无监督对象中心学习方法可以将场景分解为实体,无需额外的本地化信息,是多个物体跟踪(MOT)管道中减少注解负担的优秀候选人。然而,它们缺乏两个关键特性:物体经常被分解成部分,并且在时间上不一致地跟踪。事实上,现状的模型可以在批处精度和时间上达到像素级准确性,通过额外的ID标签进行对时asso ciation。这篇论文提出了视频对象中心模型,它包括一个索引合并模块,将对象中心槽 adapt into 检测输出,以及一个对象记忆模块,用于处理遮挡。由于对象中心学习,我们只需 sparse 的检测标注(0%-6.25%)来确定物体位置和特征绑定。基于我们的自我超vised Expectation-Maximization-inspired 损失函数,我们的方法不需要ID标签。我们的实验显著缩小了现有的对象中心模型和完全监督状态前的差距,并在多个无监督跟踪器之上表现出色。
What Makes Good Open-Vocabulary Detector: A Disassembling Perspective
results: 在OVD-COCO和OVD-LVIS测试集上,DRR方法获得了最好的性能,在未知类别中提高了35.8个 Novel AP${50}$,相比之前的最佳状态(SOTA)提高了2.8个 Novel AP${50}$。在罕见类别中,DRR方法超过了前一个SOTA的1.9个 AP$_{50}$。此外,论文还提供了一个对象检测数据集called PID,并提供了该数据集上的基线性能。Abstract
Open-vocabulary detection (OVD) is a new object detection paradigm, aiming to localize and recognize unseen objects defined by an unbounded vocabulary. This is challenging since traditional detectors can only learn from pre-defined categories and thus fail to detect and localize objects out of pre-defined vocabulary. To handle the challenge, OVD leverages pre-trained cross-modal VLM, such as CLIP, ALIGN, etc. Previous works mainly focus on the open vocabulary classification part, with less attention on the localization part. We argue that for a good OVD detector, both classification and localization should be parallelly studied for the novel object categories. We show in this work that improving localization as well as cross-modal classification complement each other, and compose a good OVD detector jointly. We analyze three families of OVD methods with different design emphases. We first propose a vanilla method,i.e., cropping a bounding box obtained by a localizer and resizing it into the CLIP. We next introduce another approach, which combines a standard two-stage object detector with CLIP. A two-stage object detector includes a visual backbone, a region proposal network (RPN), and a region of interest (RoI) head. We decouple RPN and ROI head (DRR) and use RoIAlign to extract meaningful features. In this case, it avoids resizing objects. To further accelerate the training time and reduce the model parameters, we couple RPN and ROI head (CRR) as the third approach. We conduct extensive experiments on these three types of approaches in different settings. On the OVD-COCO benchmark, DRR obtains the best performance and achieves 35.8 Novel AP$_{50}$, an absolute 2.8 gain over the previous state-of-the-art (SOTA). For OVD-LVIS, DRR surpasses the previous SOTA by 1.9 AP$_{50}$ in rare categories. We also provide an object detection dataset called PID and provide a baseline on PID.
摘要
新的目标检测方式:开 vocabulary detection (OVD),旨在确定和识别未经定义的对象。这是一个挑战,因为传统的检测器只能学习预先定义的类别,因此无法检测和定位未知的对象。为解决这个挑战,OVD 利用预训练的交叉模态 VLM,如 CLIP 和 ALIGN 等。过去的工作主要关注于开 vocabulary 分类部分,尚未充分关注本地化部分。我们认为,为一个好的 OVD 检测器,分类和本地化应该同时研究对于新的对象类别。我们在这里展示,提高本地化以及交叉模态分类之间的互补关系,组成一个好的 OVD 检测器。我们分析了三种 OVD 方法的设计强调不同,并进行了广泛的实验。首先,我们提出了一种简单的方法,即将一个由本地化器生成的 bounding box 裁剪成 CLIP 的大小,并将其作为输入给 CLIP。然后,我们引入了另一种方法,即将标准的两stage 对象检测器与 CLIP 结合,该检测器包括视觉脊梁、区域提案网络(RPN)和区域兴趣(RoI)头。我们在这种情况下,使用 RoIAlign 提取有用的特征。这种方法可以避免对对象进行resize。最后,我们将 RPN 和 RoI 头结合(CRR),以便快速训练和减少模型参数。我们在这些三种方法上进行了广泛的实验,并在 OVD-COCO 标准测试集上进行了评估。DRR 在 OVD-COCO 上取得了最高性能,其 Novel AP$_{50}$ 为 35.8,相对于前一个 SOTA 提高了 2.8。在 rare 类别上,DRR 超过了之前的 SOTA 的 1.9 AP$_{50}$。我们还提供了一个对象检测数据集 called PID,并提供了这个数据集上的基线。
Human-Inspired Facial Sketch Synthesis with Dynamic Adaptation
results: 实验结果表明,HIDA 可以在多种难度的人脸上生成高质量的素描,并在不同风格的人脸上保持一致性。Abstract
Facial sketch synthesis (FSS) aims to generate a vivid sketch portrait from a given facial photo. Existing FSS methods merely rely on 2D representations of facial semantic or appearance. However, professional human artists usually use outlines or shadings to covey 3D geometry. Thus facial 3D geometry (e.g. depth map) is extremely important for FSS. Besides, different artists may use diverse drawing techniques and create multiple styles of sketches; but the style is globally consistent in a sketch. Inspired by such observations, in this paper, we propose a novel Human-Inspired Dynamic Adaptation (HIDA) method. Specially, we propose to dynamically modulate neuron activations based on a joint consideration of both facial 3D geometry and 2D appearance, as well as globally consistent style control. Besides, we use deformable convolutions at coarse-scales to align deep features, for generating abstract and distinct outlines. Experiments show that HIDA can generate high-quality sketches in multiple styles, and significantly outperforms previous methods, over a large range of challenging faces. Besides, HIDA allows precise style control of the synthesized sketch, and generalizes well to natural scenes and other artistic styles. Our code and results have been released online at: https://github.com/AiArt-HDU/HIDA.
摘要
Facial Sketch Synthesis (FSS) 目标是从给定的面部照片中生成一个生动的笔画肖像。现有的 FSS 方法只是基于面部semantic或外观的2D表示。然而,职业艺术家通常使用 outline 或渐变来传递3D几何信息。因此,面部3D几何(例如深度图)在 FSS 中非常重要。此外,不同艺术家可能使用不同的绘画技巧,但是在绘画中保持全局一致的风格是非常重要。以这些观察为 inspirations,在这篇论文中,我们提出了一种新的人类 inspiritedynamic adaptation(HIDA)方法。具体来说,我们提出在考虑面部3D几何和2D外观以及全局一致的风格控制的基础上,动态调整神经活动。此外,我们使用可变核函数在粗略尺度上对深度特征进行对接,以生成抽象而独特的 OUTLINE。实验表明,HIDA 可以生成高质量的笔画肖像,并在各种挑战性脸部上显著超越先前的方法。此外,HIDA 允许精准地控制生成的笔画肖像的风格,并在自然场景和其他艺术风格上广泛适用。我们的代码和结果已经在 GitHub 上发布:https://github.com/AiArt-HDU/HIDA。
DARC: Distribution-Aware Re-Coloring Model for Generalizable Nucleus Segmentation
methods: 本研究使用了一种 Distribution-Aware Re-Coloring (DARC) 模型,从两个角度解决了域 gap 问题。首先,提出了一种重新调色方法,以解决不同域的图像颜色差异。其次,提出了一种新的实例normalization方法,可以快速地处理不同的前景-背景比例变化。
results: 对于两个 H$&$E 染料和两个 IHC 染料的图像数据集进行了广泛的实验,证明了我们提出的 DARC 模型的效果。代码可以在 \url{https://github.com/csccsccsccsc/DARC} 上下载。Abstract
Nucleus segmentation is usually the first step in pathological image analysis tasks. Generalizable nucleus segmentation refers to the problem of training a segmentation model that is robust to domain gaps between the source and target domains. The domain gaps are usually believed to be caused by the varied image acquisition conditions, e.g., different scanners, tissues, or staining protocols. In this paper, we argue that domain gaps can also be caused by different foreground (nucleus)-background ratios, as this ratio significantly affects feature statistics that are critical to normalization layers. We propose a Distribution-Aware Re-Coloring (DARC) model that handles the above challenges from two perspectives. First, we introduce a re-coloring method that relieves dramatic image color variations between different domains. Second, we propose a new instance normalization method that is robust to the variation in foreground-background ratios. We evaluate the proposed methods on two H$\&$E stained image datasets, named CoNSeP and CPM17, and two IHC stained image datasets, called DeepLIIF and BC-DeepLIIF. Extensive experimental results justify the effectiveness of our proposed DARC model. Codes are available at \url{https://github.com/csccsccsccsc/DARC
摘要
nuclei segmentation 通常是生物医学图像分析任务的第一步。通用 nuclei segmentation 问题指的是训练一个可以在不同来源领域中具有良好性能的分 segmentation 模型。这些领域差异通常被认为是由不同的图像捕获条件引起的,例如不同的扫描仪、组织或染色方法。在这篇论文中,我们认为领域差异也可能是由不同的前景(核体)背景比例引起的,因为这种比例对于图像的特征统计学非常重要。我们提出了一种 Distribution-Aware Re-Coloring(DARC)模型,该模型可以处理以下挑战。首先,我们引入了一种重新调色方法,以降低不同领域图像之间的极大图像颜色差异。其次,我们提出了一种新的实例Normalization方法,可以对不同的前景-背景比例进行鲁棒化。我们在两个HE染料图像集(CoNSeP和CPM17)和两个IHC染料图像集(DeepLIIF和BC-DeepLIIF)上进行了广泛的实验,结果证明了我们提出的DARC模型的效果。代码可以在 \url{https://github.com/csccsccsccsc/DARC} 上获取。
Vision-aided nonlinear control framework for shake table tests
paper_authors: Zhongwei Chen, T. Y. Yang, Yifei Xiao, Xiao Pan, Wanyan Yang
For: 本研究使用适应控制理论来实现震动机械系统中的控制和结构相互作用(Control-Structural Interaction,CSI),并考虑了系统的非线性性。* Methods: 本研究使用了滤波器控制理论(Proportional-Integral-Derivative,PID)和适应控制理论来实现非线性控制。* Results: simulations and experiments show that the proposed control framework can effectively control the shake table and reduce the vibration of the structure under earthquake excitations.Abstract
The structural response under the earthquake excitations can be simulated by scaled-down model shake table tests or full-scale model shake table tests. In this paper, adaptive control theory is used as a nonlinear shake table control algorithm which considers the inherent nonlinearity of the shake table system and the Control-Structural Interaction (CSI) effect that the linear controller cannot consider, such as the Proportional-Integral-Derivative (PID) controller. The mass of the specimen can be assumed as an unknown variation and the unknown parameter will be replaced by an estimated value in the proposed control framework. The signal generated by the control law of the adaptive control method will be implemented by a loop-shaping controller. To verify the stability and feasibility of the proposed control framework, a simulation of a bare shake table and experiments with a bare shake table with a two-story frame were carried out. This study randomly selects Earthquake recordings from the Pacific Earthquake Engineering Research Center (PEER) database. The simulation and experimental results show that the proposed control framework can be effectively used in shake table control.
摘要
《震动试验中的结构回应可以通过尺度缩小的模型震动试验或全尺度模型震动试验来模拟。本文使用适应控制理论作为震动试验中的非线性控制算法,考虑了震动试验系统的自然非线性和控制结构互动(CSI)效应,例如调速度-积分- Derivative(PID)控制器。试验用的物品重量可以被视为未知变化,并将未知参数更新为估算值。控制法框架中的控制信号将由适应控制方法的loop-shaping控制器实现。为验证提案的稳定性和可行性,本研究在实验中使用了一个空震动试验和一个有两层架构的空震动试验进行了实验。这些实验使用了太平洋地震工程研究中心(PEER)数据库中的地震纪录。实验和模拟结果显示,提案的控制法框架可以有效地应用于震动试验中。》Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.