results: 在三个标准数据集上进行实验,得到了最佳性能在ITM和joint SR-ITM任务中。Abstract
The display devices like HDR10 televisions are increasingly prevalent in our daily life for visualizing high dynamic range (HDR) images. But the majority of media images on the internet remain in 8-bit standard dynamic range (SDR) format. Therefore, converting SDR images to HDR ones by inverse tone mapping (ITM) is crucial to unlock the full potential of abundant media images. However, existing ITM methods are usually developed with complex network architectures requiring huge computational costs. In this paper, we propose a lightweight Improved Residual Network (IRNet) by enhancing the power of popular residual block for efficient ITM. Specifically, we propose a new Improved Residual Block (IRB) to extract and fuse multi-layer features for fine-grained HDR image reconstruction. Experiments on three benchmark datasets demonstrate that our IRNet achieves state-of-the-art performance on both the ITM and joint SR-ITM tasks. The code, models and data will be publicly available at https://github.com/ThisisVikki/ITM-baseline.
摘要
现在的显示设备如HDR10电视在我们日常生活中变得越来越普遍,用于显示高动态范围(HDR)图像。然而,互联网上的媒体图像大多数仍然是8比特标准动态范围(SDR)格式。因此,将SDR图像转换为HDR图像的 inverse tone mapping(ITM)变得非常重要,以便使用丰富的媒体图像。然而,现有的ITM方法通常具有复杂的网络架构,需要巨大的计算成本。在这篇论文中,我们提出了一种轻量级的改进律重网络(IRNet),通过提高流行的律重块来提高HDR图像重建精度。具体来说,我们提出了一种新的改进律重块(IRB),用于提取和融合多层特征来实现细腻的HDR图像重建。实验表明,我们的IRNet在ITM和 joint SR-ITM任务上达到了状态略的表现。代码、模型和数据将在https://github.com/ThisisVikki/ITM-baseline上公开。
Stimulating the Diffusion Model for Image Denoising via Adaptive Embedding and Ensembling
methods: 提出了一种新的策略called Diffusion Model for Image Denoising (DMID),包括适应嵌入方法和适应 ensemble方法,以解决噪声模型和图像去噪的关键问题。
results: 实现了对所有噪声基数和感知指标的状态艺术性和低扭曲性表现,包括对真实世界图像的去噪。Abstract
Image denoising is a fundamental problem in computational photography, where achieving high-quality perceptual performance with low distortion is highly demanding. Current methods either struggle with perceptual performance or suffer from significant distortion. Recently, the emerging diffusion model achieves state-of-the-art performance in various tasks, and its denoising mechanism demonstrates great potential for image denoising. However, stimulating diffusion models for image denoising is not straightforward and requires solving several critical problems. On the one hand, the input inconsistency hinders the connection of diffusion models and image denoising. On the other hand, the content inconsistency between the generated image and the desired denoised image introduces additional distortion. To tackle these problems, we present a novel strategy called Diffusion Model for Image Denoising (DMID) by understanding and rethinking the diffusion model from a denoising perspective. Our DMID strategy includes an adaptive embedding method that embeds the noisy image into a pre-trained diffusion model, and an adaptive ensembling method that reduces distortion in the denoised image. Our DMID strategy achieves state-of-the-art performance on all distortion-based and perceptual metrics, for both Gaussian and real-world image denoising.
摘要
Image denoising是计算摄影学中的基本问题,要求 достичь高质量的感知性性能并且减少扭曲。目前的方法可能会导致感知性性能下降或者受到重大的扭曲影响。近些年,出现了扩散模型,在不同任务中达到了国际前ier的性能,其扩散机制对图像杂 noise 有很大的潜力。然而,激活扩散模型以实现图像杂 noise 约束是不直接的,需要解决多个关键问题。一方面,输入不一致会阻碍扩散模型和图像杂 noise 的连接。另一方面,生成的图像与需要的杂 noise 图像的内容不一致会引入额外的扭曲。为了解决这些问题,我们提出了一种新的策略,即扩散模型 для图像杂 noise(DMID)。我们的 DMID 策略包括适应嵌入方法,将噪图像 embedding 到预训练的扩散模型中,以及适应ensemble方法,以减少生成的图像中的扭曲。我们的 DMID 策略在所有基于扭曲和感知性的指标上达到了国际前ier的性能,包括 Gaussian 和实际世界的图像杂 noise 约束。
FTFDNet: Learning to Detect Talking Face Video Manipulation with Tri-Modality Interaction
results: FTFDNet在已知的深伪面部检测数据集(FTFDD)以及深伪视频检测数据集(DFDC和DF-TIMIT)上达到了与其他当前领先的深伪视频检测方法相比较好的检测性能。Abstract
DeepFake based digital facial forgery is threatening public media security, especially when lip manipulation has been used in talking face generation, and the difficulty of fake video detection is further improved. By only changing lip shape to match the given speech, the facial features of identity are hard to be discriminated in such fake talking face videos. Together with the lack of attention on audio stream as the prior knowledge, the detection failure of fake talking face videos also becomes inevitable. It's found that the optical flow of the fake talking face video is disordered especially in the lip region while the optical flow of the real video changes regularly, which means the motion feature from optical flow is useful to capture manipulation cues. In this study, a fake talking face detection network (FTFDNet) is proposed by incorporating visual, audio and motion features using an efficient cross-modal fusion (CMF) module. Furthermore, a novel audio-visual attention mechanism (AVAM) is proposed to discover more informative features, which can be seamlessly integrated into any audio-visual CNN architecture by modularization. With the additional AVAM, the proposed FTFDNet is able to achieve a better detection performance than other state-of-the-art DeepFake video detection methods not only on the established fake talking face detection dataset (FTFDD) but also on the DeepFake video detection datasets (DFDC and DF-TIMIT).
摘要
深度复制基本的数字面伪造是公共媒体安全的威胁,特别是在口型 manipulate 在生成 talking face 中使用,并使 fake video 检测更加困难。只有改变 lip 形状来匹配给定的speech,then the facial features of identity are difficult to distinguish in such fake talking face videos. In addition, the lack of attention to the audio stream as prior knowledge makes it impossible to detect fake talking face videos. It is found that the optical flow of the fake talking face video is disordered, especially in the lip region, while the optical flow of the real video changes regularly, which means that the motion feature from optical flow is useful for capturing manipulation cues. In this study, a fake talking face detection network (FTFDNet) is proposed by incorporating visual, audio, and motion features using an efficient cross-modal fusion (CMF) module. Furthermore, a novel audio-visual attention mechanism (AVAM) is proposed to discover more informative features, which can be seamlessly integrated into any audio-visual CNN architecture by modularization. With the additional AVAM, the proposed FTFDNet is able to achieve a better detection performance than other state-of-the-art DeepFake video detection methods not only on the established fake talking face detection dataset (FTFDD) but also on the DeepFake video detection datasets (DFDC and DF-TIMIT).
TractGeoNet: A geometric deep learning framework for pointwise analysis of tract microstructure to predict language assessment performance
paper_authors: Yuqian Chen, Leo R. Zekelman, Chaoyi Zhang, Tengfei Xue, Yang Song, Nikos Makris, Yogesh Rathi, Alexandra J. Golby, Weidong Cai, Fan Zhang, Lauren J. O’Donnell
results: 我们通过使用TractGeoNet进行回归预测,并评估了20个白 matter轨脉图像的语言功能表现。结果显示,TractGeoNet比许多流行的回归模型表现更优异。此外,我们发现左弯曲脉幕轨脉是语言功能表现的最重要预测因素之一。本地化的关键区域分布在两个半球的白 matter轨脉中,包括耳延盘、前叶、上部和下部的脑区域,这些脑区域被认为是语言功能的重要组成部分。总的来说,TractGeoNet表明几何深度学习可以增强脑白 matter轨脉的研究和语言功能之间的关系。Abstract
We propose a geometric deep-learning-based framework, TractGeoNet, for performing regression using diffusion magnetic resonance imaging (dMRI) tractography and associated pointwise tissue microstructure measurements. By employing a point cloud representation, TractGeoNet can directly utilize pointwise tissue microstructure and positional information from all points within a fiber tract. To improve regression performance, we propose a novel loss function, the Paired-Siamese Regression loss, which encourages the model to focus on accurately predicting the relative differences between regression label scores rather than just their absolute values. In addition, we propose a Critical Region Localization algorithm to identify highly predictive anatomical regions within the white matter fiber tracts for the regression task. We evaluate the effectiveness of the proposed method by predicting individual performance on two neuropsychological assessments of language using a dataset of 20 association white matter fiber tracts from 806 subjects from the Human Connectome Project. The results demonstrate superior prediction performance of TractGeoNet compared to several popular regression models. Of the twenty tracts studied, we find that the left arcuate fasciculus tract is the most highly predictive of the two studied language performance assessments. The localized critical regions are widespread and distributed across both hemispheres and all cerebral lobes, including areas of the brain considered important for language function such as superior and anterior temporal regions, pars opercularis, and precentral gyrus. Overall, TractGeoNet demonstrates the potential of geometric deep learning to enhance the study of the brain's white matter fiber tracts and to relate their structure to human traits such as language performance.
摘要
我们提出了一个几何深度学习基于扩散磁共振成像(dMRI)的推论框架,称为TractGeoNet,用于进行回归。该框架利用点云表示,可以直接利用所有纤维股区域的点wise组织和位域信息。为了提高回归性能,我们提出了一种新的损失函数,即对称对拼减损失函数,该函数鼓励模型对准精确地预测相对差值而不仅仅是绝对值。此外,我们提出了一种关键区域定位算法,用于在白 matter纤维股中预测性能。我们通过使用dMRI数据集的20个相关纤维股,从806名参与者中预测语言性能,证明了TractGeoNet的效果比较出色。其中,左弯曲 fasiculus 纤维股被证明为语言性能预测的最高度相关。关键区域广泛分布在两个半球和所有脑叶,包括被认为是语言功能重要的脑区域,如上侧 temporalis、anterior temporalis、pars opercularis 和 precentral gyrus。总之,TractGeoNet表明几何深度学习可以提高白 matter纤维股的研究和语言性能之间的关系。
Building and Road Segmentation Using EffUNet and Transfer Learning Approach
results: 使用该方法, authors在麻省建筑物和路径 dataset上达到了 benchmark 分割精度(mIOU)的0.8365和0.9153。Abstract
In city, information about urban objects such as water supply, railway lines, power lines, buildings, roads, etc., is necessary for city planning. In particular, information about the spread of these objects, locations and capacity is needed for the policymakers to make impactful decisions. This thesis aims to segment the building and roads from the aerial image captured by the satellites and UAVs. Many different architectures have been proposed for the semantic segmentation task and UNet being one of them. In this thesis, we propose a novel architecture based on Google's newly proposed EfficientNetV2 as an encoder for feature extraction with UNet decoder for constructing the segmentation map. Using this approach we achieved a benchmark score for the Massachusetts Building and Road dataset with an mIOU of 0.8365 and 0.9153 respectively.
摘要
在城市中,有关城市 объекts such as 水upply, 铁路线, 电力线, 建筑物, 路网等的信息是必要的 для城市规划。特别是在决策者们需要了解这些对象的扩散、位置和容量,以便做出有效的决策。本论文目的是从航空图像和无人机拍摄的卫星和UAV中提取建筑物和路网的信息,并使用Google提出的新的EfficientNetV2嵌入器和UNet解码器组成 semantic segmentation 图像。我们在这里提出了一种新的架构,并在Massachusetts Building and Road数据集上实现了benchmark分数,具体分别为0.8365和0.9153。
results: EXTENSIVE experiments 表明,KMCL 可以在图像分类任务中实现鲁棒的改进,并且可以减少计算复杂性。Abstract
Multilabel representation learning is recognized as a challenging problem that can be associated with either label dependencies between object categories or data-related issues such as the inherent imbalance of positive/negative samples. Recent advances address these challenges from model- and data-centric viewpoints. In model-centric, the label correlation is obtained by an external model designs (e.g., graph CNN) to incorporate an inductive bias for training. However, they fail to design an end-to-end training framework, leading to high computational complexity. On the contrary, in data-centric, the realistic nature of the dataset is considered for improving the classification while ignoring the label dependencies. In this paper, we propose a new end-to-end training framework -- dubbed KMCL (Kernel-based Mutlilabel Contrastive Learning) -- to address the shortcomings of both model- and data-centric designs. The KMCL first transforms the embedded features into a mixture of exponential kernels in Gaussian RKHS. It is then followed by encoding an objective loss that is comprised of (a) reconstruction loss to reconstruct kernel representation, (b) asymmetric classification loss to address the inherent imbalance problem, and (c) contrastive loss to capture label correlation. The KMCL models the uncertainty of the feature encoder while maintaining a low computational footprint. Extensive experiments are conducted on image classification tasks to showcase the consistent improvements of KMCL over the SOTA methods. PyTorch implementation is provided in \url{https://github.com/mahdihosseini/KMCL}.
摘要
多标签表示学习被认为是一个复杂的问题,与对象类别标签之间的相互关系或数据相关的问题有关。现代进步从模型和数据中心视角来解决这些挑战。在模型中心的方法中,通过外部模型设计(例如图像 CNN)获得标签相关性,但是它们无法设计端到端训练框架,导致计算复杂性过高。在数据中心的方法中,利用实际数据的自然特性来提高分类,忽略标签相互关系。在这篇论文中,我们提出一种新的端到端训练框架——拥有kernel-based多标签对比学习(KMCL)——以解决模型和数据中心的缺点。KMCL首先将嵌入特征转换为混合几何函数在高斯RKHS中,然后编码一个包含(a)重建kernel表示的损失函数,(b)偏好类别的不对称分类损失函数,和(c)捕捉标签相关性的对比损失函数的目标损失函数。KMCL模型特征编码器的不确定性,同时保持计算脚本的低峰值。我们在图像分类任务中进行了广泛的实验,并证明了KMCL在SOTA方法上显著提高了性能。PyTorch实现可以在 \url{https://github.com/mahdihosseini/KMCL} 中找到。
Reading Between the Lanes: Text VideoQA on the Road
results: 这个研究发现,现有的VideoQA模型在这个领域中仍有很大的提升空间,并且显示了这个 dataset 的可用性和重要性在进一步推进驾驶员支持系统和文字敏感多modal问答的研究中。Abstract
Text and signs around roads provide crucial information for drivers, vital for safe navigation and situational awareness. Scene text recognition in motion is a challenging problem, while textual cues typically appear for a short time span, and early detection at a distance is necessary. Systems that exploit such information to assist the driver should not only extract and incorporate visual and textual cues from the video stream but also reason over time. To address this issue, we introduce RoadTextVQA, a new dataset for the task of video question answering (VideoQA) in the context of driver assistance. RoadTextVQA consists of $3,222$ driving videos collected from multiple countries, annotated with $10,500$ questions, all based on text or road signs present in the driving videos. We assess the performance of state-of-the-art video question answering models on our RoadTextVQA dataset, highlighting the significant potential for improvement in this domain and the usefulness of the dataset in advancing research on in-vehicle support systems and text-aware multimodal question answering. The dataset is available at http://cvit.iiit.ac.in/research/projects/cvit-projects/roadtextvqa
摘要
文本和路上的示意图提供了驾驶员 navigate 和情况意识 的关键信息。Scene text recognition in motion 是一个具有挑战性的问题,因为文本cue 通常只出现短时间,早期检测距离是必要的。为了解决这个问题,我们介绍了 RoadTextVQA dataset,用于驾驶助手Question Answering 任务(VideoQA)的研究。RoadTextVQA 包括 $3,222$ 段驾驶视频,收集自多个国家,并有 $10,500$ 个问题,所有问题基于驾驶视频中的文本或路上示意图。我们评估了现有的 VideoQA 模型在我们的 RoadTextVQA dataset 上的性能,并指出了这个领域的显著改进潜力和使用这个dataset进行文本感知多模式问答的研究的用于。dataset 可以在 http://cvit.iiit.ac.in/research/projects/cvit-projects/roadtextvqa 上获取。
Camouflaged Object Detection with Feature Grafting and Distractor Aware
results: 我们的方法在四个常用的 benchmark datasets 以及ACOD2K dataset上进行了广泛的实验,结果显示,我们的方法与其他状态之前的方法相比,有显著的提高。代码和ACOD2K dataset将在https://github.com/syxvision/FDNet上公开。Abstract
The task of Camouflaged Object Detection (COD) aims to accurately segment camouflaged objects that integrated into the environment, which is more challenging than ordinary detection as the texture between the target and background is visually indistinguishable. In this paper, we proposed a novel Feature Grafting and Distractor Aware network (FDNet) to handle the COD task. Specifically, we use CNN and Transformer to encode multi-scale images in parallel. In order to better explore the advantages of the two encoders, we design a cross-attention-based Feature Grafting Module to graft features extracted from Transformer branch into CNN branch, after which the features are aggregated in the Feature Fusion Module. A Distractor Aware Module is designed to explicitly model the two possible distractors in the COD task to refine the coarse camouflage map. We also proposed the largest artificial camouflaged object dataset which contains 2000 images with annotations, named ACOD2K. We conducted extensive experiments on four widely used benchmark datasets and the ACOD2K dataset. The results show that our method significantly outperforms other state-of-the-art methods. The code and the ACOD2K will be available at https://github.com/syxvision/FDNet.
摘要
“Camouflaged Object Detection(COD)任务的目标是精准地找到融入环境中的掩蔽物,这比普通的检测更加具有挑战性,因为标的和背景的文字特征无法辨识。在这篇论文中,我们提出了一个新的特色插入和干扰识别网络(FDNet)来处理COD任务。具体来说,我们使用CNN和Transformer并行地实现多个标本的像素网络。为了更好地利用两个网络的优点,我们设计了一个交互式特色插入模组,将Transformer分支中提取的特色插入到CNN分支中,然后将特色在特色聚合模组中聚合。此外,我们设计了一个干扰识别模组,以Explicitly模型COD任务中的两种干扰因素,以改善粗糙的掩蔽地图。我们还提出了2000幅掩蔽物标注图像集合,名为ACOD2K。我们实现了广泛的实验,包括四个通用的 benchmark 测试集和 ACOD2K 标注图像集合。结果显示,我们的方法在其他状态顶专门方法之上得到了很好的表现。我们将代码和ACOD2K数据集存储在 GitHub 上,请遵循 https://github.com/syxvision/FDNet 来访问。”
Ariadne’s Thread:Using Text Prompts to Improve Segmentation of Infected Areas from Chest X-ray images
paper_authors: Yi Zhong, Mengqiu Xu, Kongming Liang, Kaixin Chen, Ming Wu
for: 该研究旨在提高肺病评估的精度,提供更准确的肺病诊断和治疗方案。
methods: 该研究使用语言驱动的图像分割方法,通过文本提示来改进图像分割结果。
results: 实验结果表明,该方法与单模方法相比,提高了QaTa-COV19数据集的 dice分数6.09%以上。此外,研究还表明,多模式方法在文本粒度和训练数据大小方面具有显著的优势。Abstract
Segmentation of the infected areas of the lung is essential for quantifying the severity of lung disease like pulmonary infections. Existing medical image segmentation methods are almost uni-modal methods based on image. However, these image-only methods tend to produce inaccurate results unless trained with large amounts of annotated data. To overcome this challenge, we propose a language-driven segmentation method that uses text prompt to improve to the segmentation result. Experiments on the QaTa-COV19 dataset indicate that our method improves the Dice score by 6.09% at least compared to the uni-modal methods. Besides, our extended study reveals the flexibility of multi-modal methods in terms of the information granularity of text and demonstrates that multi-modal methods have a significant advantage over image-only methods in terms of the size of training data required.
摘要
segmentation of infected lung areas is crucial for assessing lung disease severity, such as pulmonary infections. current medical image segmentation methods are mostly uni-modal, relying solely on images. however, these image-only methods often produce inaccurate results without large amounts of annotated data. to address this challenge, we propose a language-driven segmentation method that utilizes text prompts to improve segmentation accuracy. experiments on the QaTa-COV19 dataset show that our method improves the dice score by at least 6.09% compared to uni-modal methods. furthermore, our extended study demonstrates the flexibility of multi-modal methods in terms of text information granularity and shows that multi-modal methods require significantly less training data than image-only methods.Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.
Face Image Quality Enhancement Study for Face Recognition
for: 本研究探讨低质量face图像 face recognition的问题,尝试提高低质量face图像的识别精度。
methods: 使用当前最佳的人脸图像进行优化,开发了一种新的识别协议,以避免实验偏见。
results: 实验结果表明,使用优化后的face图像可以提高识别精度,但也存在一些挑战性的问题。Abstract
Unconstrained face recognition is an active research area among computer vision and biometric researchers for many years now. Still the problem of face recognition in low quality photos has not been well-studied so far. In this paper, we explore the face recognition performance on low quality photos, and we try to improve the accuracy in dealing with low quality face images. We assemble a large database with low quality photos, and examine the performance of face recognition algorithms for three different quality sets. Using state-of-the-art facial image enhancement approaches, we explore the face recognition performance for the enhanced face images. To perform this without experimental bias, we have developed a new protocol for recognition with low quality face photos and validate the performance experimentally. Our designed protocol for face recognition with low quality face images can be useful to other researchers. Moreover, experiment results show some of the challenging aspects of this problem.
摘要
<>无约束面Recognition是计算机视觉和生物认证领域的活跃研究领域,数年来仍然没有充分研究低质量照片的面Recognition问题。在这篇论文中,我们探索了低质量照片中的面Recognition性能,并尝试提高对低质量面图像的准确率。我们组织了一个大型数据库,并对三个不同质量集进行了测试。使用当前的状态kemal facial image enhancement方法,我们探索了对加强的面图像进行认证的性能。为了避免实验偏见,我们开发了一种新的协议 для面Recognition with low quality face photos,并 Validate its performance experimentally。我们的设计的协议可以为其他研究人员提供帮助。此外,实验结果表明了一些面Recognition问题的挑战性方面。Translation notes:* "无约束" (wú jiè shì) means "unconstrained" in Chinese.* "面Recognition" (liǎn tiān xiǎng) means "face recognition" in Chinese.* "生物认证" (shēng wù rèn shè) means "biometric recognition" in Chinese.* "计算机视觉" (jìsuàn jīsuān) means "computer vision" in Chinese.* "数年来" (shù nián lái) means "for many years" in Chinese.* "仍然" (jiéguān) means "still" in Chinese.* "low quality" (gōng yǐn) means "low quality" in Chinese.* "照片" (zhāo pǐn) means "photos" in Chinese.* "面图像" (liǎn tú xiàng) means "face images" in Chinese.* " recognition" (tiān xiǎng) means "recognition" in Chinese.* "准确率" (zhèng qiú lǐ) means "accuracy" in Chinese.* "数据库" (shù jīng kù) means "database" in Chinese.* "测试" (cè shí) means "testing" in Chinese.* "状态kemal" (zhuàng tài kè mā) means "state-of-the-art" in Chinese.* "facial image enhancement" (liǎn tíng xiǎng yǎng) means "facial image enhancement" in Chinese.* "协议" (xié yì) means "protocol" in Chinese.* " Validate" (bèi yǐ) means "to validate" in Chinese.* "实验偏见" (shí yàn pēn jiàn) means "experimental bias" in Chinese.* "挑战性" (tiǎo zhàn xìng) means "challenging aspects" in Chinese.
Edge-Aware Mirror Network for Camouflaged Object Detection
results: 对三个常用的隐形目标检测数据集进行量化和质量测试,比较了与现有最佳基eline的比较,得到了更高的精度Here is the translation in English:
for: Improving the accuracy of camouflaged object detection
methods: Proposed a novel Edge-aware Mirror Network (EAMNet) that models edge detection and camouflaged object segmentation as a cross refinement process, consisting of a segmentation-induced edge aggregation module, an edge-induced integrity aggregation module, and a guided-residual channel attention module.
results: Quantitative and qualitative experiment results on three widely used COD datasets show that EAMNet outperforms existing cutting-edge baselines.Note that the translation is done in a simplified Chinese format, which is a more casual and conversational style of Chinese writing. If you prefer a more formal style, I can also provide that.Abstract
Existing edge-aware camouflaged object detection (COD) methods normally output the edge prediction in the early stage. However, edges are important and fundamental factors in the following segmentation task. Due to the high visual similarity between camouflaged targets and the surroundings, edge prior predicted in early stage usually introduces erroneous foreground-background and contaminates features for segmentation. To tackle this problem, we propose a novel Edge-aware Mirror Network (EAMNet), which models edge detection and camouflaged object segmentation as a cross refinement process. More specifically, EAMNet has a two-branch architecture, where a segmentation-induced edge aggregation module and an edge-induced integrity aggregation module are designed to cross-guide the segmentation branch and edge detection branch. A guided-residual channel attention module which leverages the residual connection and gated convolution finally better extracts structural details from low-level features. Quantitative and qualitative experiment results show that EAMNet outperforms existing cutting-edge baselines on three widely used COD datasets. Codes are available at https://github.com/sdy1999/EAMNet.
摘要
现有的隐形目标检测(COD)方法通常在早期输出边预测。然而,边是后续 segmentation 任务中非常重要的因素。由于隐形目标和周围环境的视觉相似性很高,早期边预测通常会导致误分别前景和背景,污染分割特征。为解决这个问题,我们提出了一种新的 Edge-aware Mirror Network(EAMNet),它将边检测和隐形目标分割视为交叠的过程。更 Specifically,EAMNet 具有两极性体系,包括 segmentation-induced edge aggregation module 和 edge-induced integrity aggregation module,这两个模块用于交叠导航分割支路和边检测支路。最后,一个受引用的残差核心注意力模块,通过残差连接和阻止 convolution,终于更好地提取低级特征的结构细节。量化和质量实验结果表明,EAMNet 在三个广泛使用的 COD 数据集上表现出色,超过现有的最新基线。代码可以在 https://github.com/sdy1999/EAMNet 中找到。
VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation
for: This paper aims to improve the performance of egocentric action anticipation by proposing a novel visual-semantic fusion enhanced and Transformer GRU-based action anticipation framework.
methods: The proposed method introduces high-level semantic information to improve action anticipation performance, uses an effective visual-semantic fusion module, and employs a Transformer based encoder and GRU-based decoder to model long-term sequential and flexible iteration decoding.
results: The proposed method achieves new state-of-the-art performance on two large-scale first-person view datasets, outperforming previous approaches by a large margin.Abstract
Egocentric action anticipation is a challenging task that aims to make advanced predictions of future actions from current and historical observations in the first-person view. Most existing methods focus on improving the model architecture and loss function based on the visual input and recurrent neural network to boost the anticipation performance. However, these methods, which merely consider visual information and rely on a single network architecture, gradually reach a performance plateau. In order to fully understand what has been observed and capture the dependencies between current observations and future actions well enough, we propose a novel visual-semantic fusion enhanced and Transformer GRU-based action anticipation framework in this paper. Firstly, high-level semantic information is introduced to improve the performance of action anticipation for the first time. We propose to use the semantic features generated based on the class labels or directly from the visual observations to augment the original visual features. Secondly, an effective visual-semantic fusion module is proposed to make up for the semantic gap and fully utilize the complementarity of different modalities. Thirdly, to take advantage of both the parallel and autoregressive models, we design a Transformer based encoder for long-term sequential modeling and a GRU-based decoder for flexible iteration decoding. Extensive experiments on two large-scale first-person view datasets, i.e., EPIC-Kitchens and EGTEA Gaze+, validate the effectiveness of our proposed method, which achieves new state-of-the-art performance, outperforming previous approaches by a large margin.
摘要
先前的方法主要是通过提高模型架构和损失函数来提高预测性能,主要基于视觉输入和循环神经网络。然而,这些方法很快就会达到性能极限。为了全面理解已经观察到的内容和捕捉未来行为之间的依赖关系,我们在这篇论文中提出了一种新的视 semantic融合增强的动作预测框架。首先,我们提出了在动作预测中使用高级 semantic信息,以提高性能。我们使用基于类别标签或直接从视觉观察到的semantic特征来增强原始视觉特征。其次,我们提出了一种有效的视 semantic融合模块,以弥补semantic漏斗并全面利用不同模式之间的共轭性。最后,为了利用并行和自适应模型,我们设计了基于Transformer的编码器进行长期序列化和基于GRU的解码器进行灵活迭代解码。我们在两个大规模的第一人视角数据集,即EPIC-Kitchens和EGTEA Gaze+上进行了广泛的实验,并证明了我们提出的方法的效iveness,成功击败了之前的方法,占新的状态符极限。
Adversarial Self-Attack Defense and Spatial-Temporal Relation Mining for Visible-Infrared Video Person Re-Identification
paper_authors: Huafeng Li, Le Xu, Yafei Zhang, Dapeng Tao, Zhengtao Yu for:The paper is written for solving the problem of cross-modal pedestrian identity matching in visible-infrared video person re-identification, by proposing a new method that integrates adversarial self-attack defense and spatial-temporal relation mining.methods:The proposed method uses adversarial self-attack to defend against the perturbations caused by changes in views, posture, background, and modal discrepancy, and a spatial-temporal information-guided feature representation network to extract robust features from video sequences.results:The proposed method exhibits compelling performance on large-scale cross-modality video datasets.Here is the Chinese version of the three information:for:这篇论文是为了解决视频人识别中的跨模态人识别问题,提出了一种新的方法,即利用对抗自我攻击防御和空间时间关系挖掘。methods:该方法使用对抗自我攻击来防御视频中人识别特征的变化,并使用空间时间关系引导的特征表示网络来提取更加稳定的特征。results:该方法在大规模跨模态视频 dataset 上表现出了吸引人的表现。Abstract
In visible-infrared video person re-identification (re-ID), extracting features not affected by complex scenes (such as modality, camera views, pedestrian pose, background, etc.) changes, and mining and utilizing motion information are the keys to solving cross-modal pedestrian identity matching. To this end, the paper proposes a new visible-infrared video person re-ID method from a novel perspective, i.e., adversarial self-attack defense and spatial-temporal relation mining. In this work, the changes of views, posture, background and modal discrepancy are considered as the main factors that cause the perturbations of person identity features. Such interference information contained in the training samples is used as an adversarial perturbation. It performs adversarial attacks on the re-ID model during the training to make the model more robust to these unfavorable factors. The attack from the adversarial perturbation is introduced by activating the interference information contained in the input samples without generating adversarial samples, and it can be thus called adversarial self-attack. This design allows adversarial attack and defense to be integrated into one framework. This paper further proposes a spatial-temporal information-guided feature representation network to use the information in video sequences. The network cannot only extract the information contained in the video-frame sequences but also use the relation of the local information in space to guide the network to extract more robust features. The proposed method exhibits compelling performance on large-scale cross-modality video datasets. The source code of the proposed method will be released at https://github.com/lhf12278/xxx.
摘要
visible-infrared видео人重复标识(re-ID)中,提取不受复杂场景(如Modalidad、摄像头视图、行人姿态、背景等)变化的特征是键,以实现交叉模式人标识匹配。为此,本文提出了一种新的可见infrared视频人重复标识方法,即反对抗自我攻击和空间时间关系挖掘。在这种方法中,视图、姿态、背景和模式差异被视为人标识特征变化的主要因素。这些干扰信息被包含在训练样本中,并用作对抗攻击。通过在训练中引入这些干扰信息,使模型更加抗性于这些不利因素。此外,文章还提出了一种基于视频序列的空间时间信息引导特征表示网络,以使用视频序列中的信息。该网络不仅可以提取视频帧序列中的信息,还可以使用当地信息的空间关系来引导网络提取更加Robust的特征。提议的方法在大规模交叉模式视频数据集上展示了吸引人的表现。源代码将在https://github.com/lhf12278/xxx上发布。
StyleGAN3: Generative Networks for Improving the Equivariance of Translation and Rotation
for: The purpose of this study is to improve the equivariance of StyleGAN.
methods: The study used StyleGAN2 and two modified versions of StyleGAN3, and evaluated them using the FFHQ dataset.
results: The study found that the StyleGAN3 version is a better generative network, with improved equivariance. These findings are beneficial for the creation of animations and videos.Abstract
StyleGAN can use style to affect facial posture and identity features, and noise to affect hair, wrinkles, skin color and other details. Among these, the outcomes of the picture processing will vary slightly between different versions of styleGAN. As a result, the comparison of performance differences between styleGAN2 and the two modified versions of styleGAN3 will be the main focus of this study. We used the FFHQ dataset as the dataset and FID, EQ-T, and EQ-R were used to be the assessment of the model. In the end, we discovered that Stylegan3 version is a better generative network to improve the equivariance. Our findings have a positive impact on the creation of animation and videos.
摘要
StyleGAN 可以通过风格来影响脸部姿势和个体特征,并通过噪音来影响头发、皱纹、皮肤颜色等细节。 Among these, the outcomes of the picture processing will vary slightly between different versions of StyleGAN. As a result, the comparison of performance differences between StyleGAN2 and the two modified versions of StyleGAN3 will be the main focus of this study. We used the FFHQ dataset as the dataset and FID, EQ-T, and EQ-R were used to be the assessment of the model. In the end, we discovered that Stylegan3 version is a better generative network to improve the equivariance. Our findings have a positive impact on the creation of animation and videos.Here's the translation in Traditional Chinese: StyleGAN 可以透过风格来影响脸部姿势和个体特征,并通过噪音来影响头发、皱纹、皮肤颜色等细节。 Among these, the outcomes of the picture processing will vary slightly between different versions of StyleGAN. As a result, the comparison of performance differences between StyleGAN2 and the two modified versions of StyleGAN3 will be the main focus of this study. We used the FFHQ dataset as the dataset and FID, EQ-T, and EQ-R were used to be the assessment of the model. In the end, we discovered that Stylegan3 version is a better generative network to improve the equivariance. Our findings have a positive impact on the creation of animation and videos.
results: 研究发现了一些有用的结果,包括检测到的瑕理症状和ARIMA方法的预测结果。Abstract
We implemented a simple method for early detection in this research. The implemented methods are plotting the given mat files and analyzing scalogram images generated by performing Continuous Wavelet Transform (CWT) on the samples. Also, finding the mean, standard deviation (STD), and peak-to-peak (P2P) values from each signal also helped detect faulty signs. We have implemented the autoregressive integrated moving average (ARIMA) method to track the progression.
摘要
我们在这项研究中实现了一种简单的早期检测方法。我们使用了Continuous Wavelet Transform (CWT)来生成scalogram图像,并从每个信号中计算了平均值、标准差(STD)和峰值至谷值(P2P)。此外,我们还使用了autoregressive integrated moving average (ARIMA)方法来跟踪进程。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.
for: 本研究旨在exploring how large pre-trained models can be used to generate 3D shapes from sketches, which has been an open challenge due to limited datasets and varying abstraction levels in the sketches.
methods: 我们使用了一种简单的方法,即在训练时使用大型预训练视觉模型的特征来conditioning 3D生成模型,以便在推理时从绘制 generate 3D shapes.
results: 我们的实验表明,使用大型预训练视觉模型的特征可以允许我们在推理时从绘制 generate 3D shapes, regardless of the level of abstraction in the sketches. 我们还发现这些特征可以跨域传递Semantic信号,从而实现多个3D shapes的生成 per each input sketch.Abstract
Significant progress has recently been made in creative applications of large pre-trained models for downstream tasks in 3D vision, such as text-to-shape generation. This motivates our investigation of how these pre-trained models can be used effectively to generate 3D shapes from sketches, which has largely remained an open challenge due to the limited sketch-shape paired datasets and the varying level of abstraction in the sketches. We discover that conditioning a 3D generative model on the features (obtained from a frozen large pre-trained vision model) of synthetic renderings during training enables us to effectively generate 3D shapes from sketches at inference time. This suggests that the large pre-trained vision model features carry semantic signals that are resilient to domain shifts, i.e., allowing us to use only RGB renderings, but generalizing to sketches at inference time. We conduct a comprehensive set of experiments investigating different design factors and demonstrate the effectiveness of our straightforward approach for generation of multiple 3D shapes per each input sketch regardless of their level of abstraction without requiring any paired datasets during training.
摘要
Recently, there have been significant advances in using large pre-trained models for downstream tasks in 3D vision, such as text-to-shape generation. This has inspired us to explore how these pre-trained models can be used effectively to generate 3D shapes from sketches, which has been a long-standing challenge due to the limited availability of sketch-shape paired datasets and the varying level of abstraction in the sketches. We discovered that conditioning a 3D generative model on the features (obtained from a frozen large pre-trained vision model) of synthetic renderings during training enables us to effectively generate 3D shapes from sketches at inference time. This suggests that the large pre-trained vision model features carry semantic signals that are robust to domain shifts, i.e., allowing us to use only RGB renderings, but generalizing to sketches at inference time. We conducted a comprehensive set of experiments investigating different design factors and demonstrated the effectiveness of our straightforward approach for generating multiple 3D shapes per each input sketch regardless of their level of abstraction without requiring any paired datasets during training.
Novel Categories Discovery from probability matrix perspective
methods: 我们从 novel data 概率矩阵的角度 investigate NCD,并利用提供的 novel class 多尼尔分布( categorical distribution)的连接。我们预测可以通过学习其类分布来实现semantic-based novel data clustering。我们提出了一些新的约束,包括实例级别的信息约束和第一个统计特征约束。
results: 我们的简单方法成功地实现了基于类 semantics 的 novel data clustering,但需要提供类 semantic similarity между标签未标注类。我们在图像和视频模式下 demonstate了我们的方法的探索性。此外,我们进行了广泛的减少研究,以提供更好的理解。我们的方法可以在 Cifar10、UCF101 和 MPSC-ARL 数据集上实现 94%、93% 和 85% 的分类精度,同时实现 ~90%、84% 和 ~72% 的 clustering 精度,与状态 искусственный智能方法相匹配。Abstract
Novel Categories Discovery (NCD) tackles the open-world problem of classifying known and clustering novel categories based on the class semantics using partial class space annotated data. Unlike traditional pseudo-label and retraining, we investigate NCD from the novel data probability matrix perspective. We leverage the connection between NCD novel data sampling with provided novel class Multinoulli (categorical) distribution and hypothesize to implicitly achieve semantic-based novel data clustering by learning their class distribution. We propose novel constraints on first-order (mean) and second-order (covariance) statistics of probability matrix features while applying instance-wise information constraints. In particular, we align the neuron distribution (activation patterns) under a large batch of Monte-Carlo novel data sampling by matching their empirical features mean and covariance with the provided Multinoulli-distribution. Simultaneously, we minimize entropy and enforce prediction consistency for each instance. Our simple approach successfully realizes semantic-based novel data clustering provided the semantic similarity between label-unlabeled classes. We demonstrate the discriminative capacity of our approaches in image and video modalities. Moreover, we perform extensive ablation studies regarding data, networks, and our framework components to provide better insights. Our approach maintains ~94%, ~93%, and ~85%, classification accuracy in labeled data while achieving ~90%, ~84%, and ~72% clustering accuracy for novel categories for Cifar10, UCF101, and MPSC-ARL datasets that matches state-of-the-art approaches without any external clustering.
摘要
新领域发现(NCD)处理开放世界中的已知类和新类分类问题,基于类 semantics 使用偏序数据进行分类。 unlike traditional pseudo-label 和重新训练,我们从新数据概率矩阵的视角进行研究。我们利用新数据采样与提供的新类 Multinoulli 分布之间的连接,并假设通过学习其类分布来隐式地实现 semantic-based 新数据归类。我们提出了新的一级(平均值)和二级(协方差)统计特征的约束,并在实例级别上应用情况约束。特别是,我们在大批量 Monte-Carlo 新数据采样中对 neuron 分布(活动模式)进行对齐,使其 empirical features 的平均值和协方差与提供的 Multinoulli-分布匹配。同时,我们减少 entropy 并强制实例级别预测一致。我们简单的方法成功地实现 semantic-based 新数据归类,只要提供类相似性。我们在图像和视频模式中展示了我们的方法的探索性能。此外,我们进行了广泛的数据、网络和框架组件的拟合研究,以提供更好的理解。我们的方法在 Cifar10、UCF101 和 MPSC-ARL 数据集上保持了 ~94%、~93% 和 ~85% 的分类精度,同时实现了 ~90%、~84% 和 ~72% 的归类精度,与状态艺技术相匹配。
TBSS++: A novel computational method for Tract-Based Spatial Statistics
results: 对比TBSS方法,该论文的提议方法显示了更高的复制性和对数据干扰的Robustness。Abstract
Diffusion-weighted magnetic resonance imaging (dMRI) is widely used to assess the brain white matter. One of the most common computations in dMRI involves cross-subject tract-specific analysis, whereby dMRI-derived biomarkers are compared between cohorts of subjects. The accuracy and reliability of these studies hinges on the ability to compare precisely the same white matter tracts across subjects. This is an intricate and error-prone computation. Existing computational methods such as Tract-Based Spatial Statistics (TBSS) suffer from a host of shortcomings and limitations that can seriously undermine the validity of the results. We present a new computational framework that overcomes the limitations of existing methods via (i) accurate segmentation of the tracts, and (ii) precise registration of data from different subjects/scans. The registration is based on fiber orientation distributions. To further improve the alignment of cross-subject data, we create detailed atlases of white matter tracts. These atlases serve as an unbiased reference space where the data from all subjects is registered for comparison. Extensive evaluations show that, compared with TBSS, our proposed framework offers significantly higher reproducibility and robustness to data perturbations. Our method promises a drastic improvement in accuracy and reproducibility of cross-subject dMRI studies that are routinely used in neuroscience and medical research.
摘要
Diffusion-weighted магнитная резонансная томография (dMRI) 广泛用于评估大脑白 mater. 一种最常见的计算在 dMRI 中是 cross-subject 股道特征分析,其中 dMRI 获得的生物标志物被比较 между 团队的Subjects. 这些研究的准确性和可靠性取决于能够准确比较不同主体/扫描数据中的白 mater股道。 现有的计算方法,如 Tract-Based Spatial Statistics (TBSS),受到多种缺陷和局限性的影响,可能会严重损害研究结果的有效性。 我们提出了一种新的计算框架,通过以下两个方法来缓解现有方法的局限性:1. 精准的股道分割2. 基于纤维方向分布的数据重复注册为了进一步提高交由数据的对接,我们创建了详细的白 mater股道 атла斯。 这些 атла斯作为一种无偏参照空间,用于注册所有主体的数据,以便对比。 我们的方法与 TBSS 相比,具有显著更高的重复性和对数据扰动的抗难度。 我们的方法承诺可以大幅提高交由数据的精度和可重复性,这些研究在 neuroscience 和医学研究中 Routinely 使用。
Blocks2World: Controlling Realistic Scenes with Editable Primitives
methods: convex decomposition of images and conditioned synthesis
results: highly customizable scene rendering process with remarkable control over the synthesis of novel and edited scenesHere’s the full summary in Simplified Chinese:
for: 这paper是为了解决3D场景渲染和编辑问题
methods: 使用几何分解和受控合成
results: 提供一种高度自定义的场景渲染过程,可以高度控制创建和编辑场景的图像生成I hope this helps! Let me know if you have any other questions.Abstract
We present Blocks2World, a novel method for 3D scene rendering and editing that leverages a two-step process: convex decomposition of images and conditioned synthesis. Our technique begins by extracting 3D parallelepipeds from various objects in a given scene using convex decomposition, thus obtaining a primitive representation of the scene. These primitives are then utilized to generate paired data through simple ray-traced depth maps. The next stage involves training a conditioned model that learns to generate images from the 2D-rendered convex primitives. This step establishes a direct mapping between the 3D model and its 2D representation, effectively learning the transition from a 3D model to an image. Once the model is fully trained, it offers remarkable control over the synthesis of novel and edited scenes. This is achieved by manipulating the primitives at test time, including translating or adding them, thereby enabling a highly customizable scene rendering process. Our method provides a fresh perspective on 3D scene rendering and editing, offering control and flexibility. It opens up new avenues for research and applications in the field, including authoring and data augmentation.
摘要
我们介绍了Blocks2World,一种新的3D场景渲染和编辑方法,利用了两步过程:几何分解和条件生成。我们的技术首先从给定场景中的各种物体中提取3D矩形体使用几何分解,从而获得场景的原始表示。这些基本对象然后用简单的投影法生成对应的深度地图。接下来,我们将这些对应的数据用条件学习模型进行训练,以学习将3D模型转换为图像。这个步骤建立了3D模型和其2D表示之间的直接映射,从而学习了将3D模型转换为图像的过程。一旦模型完全训练完成,它可以在测试时对基本对象进行 manipulate,包括平移或添加,以此获得高度自定义的场景渲染过程。我们的方法为3D场景渲染和编辑带来了新的视角,提供了控制和灵活性。它打开了新的研究和应用领域,包括作者和数据增强。
Invariant Scattering Transform for Medical Imaging
results: 研究发现,使用散射变换可以提高医疗图像分类的精度和效率。Abstract
Invariant scattering transform introduces new area of research that merges the signal processing with deep learning for computer vision. Nowadays, Deep Learning algorithms are able to solve a variety of problems in medical sector. Medical images are used to detect diseases brain cancer or tumor, Alzheimer's disease, breast cancer, Parkinson's disease and many others. During pandemic back in 2020, machine learning and deep learning has played a critical role to detect COVID-19 which included mutation analysis, prediction, diagnosis and decision making. Medical images like X-ray, MRI known as magnetic resonance imaging, CT scans are used for detecting diseases. There is another method in deep learning for medical imaging which is scattering transform. It builds useful signal representation for image classification. It is a wavelet technique; which is impactful for medical image classification problems. This research article discusses scattering transform as the efficient system for medical image analysis where it's figured by scattering the signal information implemented in a deep convolutional network. A step by step case study is manifested at this research work.
摘要
“恒常散射变换引入了一新的研究领域,即将信号处理与深度学习结合用于计算机视觉。目前,深度学习算法能够解决医疗领域多种问题。医疗图像用于检测脑癌或肿瘤、阿尔茨曼病、乳癌、 Parkinson 病等。在2020年大流行期间,机器学习和深度学习扮演了关键的角色,用于检测 COVID-19,包括变异分析、预测、诊断和决策。医疗图像如 X 射、MRI(磁共振成像)、CT扫描是用于检测疾病的。另一种在深度学习中用于医疗图像分类的方法是散射变换。它建立了有用的信号表示,用于图像分类问题。这篇研究文章介绍了散射变换作为医疗图像分析中效果的系统,其中将信号信息在深度卷积神经网络中散射。本研究文章通过一个步骤案例研究。”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China. If you prefer Traditional Chinese, please let me know.
Thoracic Cartilage Ultrasound-CT Registration using Dense Skeleton Graph
results: 对五个不同患者的软骨点云进行了评估,结果表明,提案的图形基于注册方法可以有效地将CT中的轨迹映射到当前设置中,并且非RIGID注册结果中的 Hausdorff 距离(Mean$\pm$SD)为9.48$\pm$0.27 mm,路径传输错误(Euclidean distance)为2.21$\pm$1.11 mm。Abstract
Autonomous ultrasound (US) imaging has gained increased interest recently, and it has been seen as a potential solution to overcome the limitations of free-hand US examinations, such as inter-operator variations. However, it is still challenging to accurately map planned paths from a generic atlas to individual patients, particularly for thoracic applications with high acoustic-impedance bone structures under the skin. To address this challenge, a graph-based non-rigid registration is proposed to enable transferring planned paths from the atlas to the current setup by explicitly considering subcutaneous bone surface features instead of the skin surface. To this end, the sternum and cartilage branches are segmented using a template matching to assist coarse alignment of US and CT point clouds. Afterward, a directed graph is generated based on the CT template. Then, the self-organizing map using geographical distance is successively performed twice to extract the optimal graph representations for CT and US point clouds, individually. To evaluate the proposed approach, five cartilage point clouds from distinct patients are employed. The results demonstrate that the proposed graph-based registration can effectively map trajectories from CT to the current setup for displaying US views through limited intercostal space. The non-rigid registration results in terms of Hausdorff distance (Mean$\pm$SD) is 9.48$\pm$0.27 mm and the path transferring error in terms of Euclidean distance is 2.21$\pm$1.11 mm.
摘要
自主式超声成像(US)已经在最近得到了更多的关注,被视为可以解决自由手超声检测中的操作员间变化的问题。然而,将规划的路径从通用Atlas到个体患者中的精准映射仍然是一个挑战。为解决这个问题,一种基于图的非RIGID注册方法被提议,以便将规划的路径从Atlas传递到当前设置,并且特别考虑下皮骨表面特征。为此,使用模板匹配将气肠和软骨分支分别分割出来。然后,基于CT模板生成了指向图。接着,在CT和US点云上进行了顺序的自组织地图使用地理 distance来抽取最佳表示。为评估提议方法,使用了五个不同患者的软骨点云。结果表明,提议的图基于注册可以有效地将CT中的路径映射到当前设置中,并且非RIGID注册的 Hausdorff距离(Mean$\pm$SD)为9.48$\pm$0.27 mm,路径传输错误(Euclidean distance)为2.21$\pm$1.11 mm。
Exploring the Lottery Ticket Hypothesis with Explainability Methods: Insights into Sparse Network Performance
results: 研究发现,随着剪辑更多的参数,网络的性能逐渐下降。发现的概念和像素从剪辑后的网络与原始网络不匹配,可能是性能下降的原因。Abstract
Discovering a high-performing sparse network within a massive neural network is advantageous for deploying them on devices with limited storage, such as mobile phones. Additionally, model explainability is essential to fostering trust in AI. The Lottery Ticket Hypothesis (LTH) finds a network within a deep network with comparable or superior performance to the original model. However, limited study has been conducted on the success or failure of LTH in terms of explainability. In this work, we examine why the performance of the pruned networks gradually increases or decreases. Using Grad-CAM and Post-hoc concept bottleneck models (PCBMs), respectively, we investigate the explainability of pruned networks in terms of pixels and high-level concepts. We perform extensive experiments across vision and medical imaging datasets. As more weights are pruned, the performance of the network degrades. The discovered concepts and pixels from the pruned networks are inconsistent with the original network -- a possible reason for the drop in performance.
摘要
发现一个高性能稀畴网络在大规模神经网络中是有利于在具有限制存储的设备上部署,如移动电话。此外,AI的可解释性是提高人工智能的信任的关键。抽奖假设(LTH)找到一个在深度网络中的网络,其性能与原始模型相当或更高。然而,有限的研究在LTH的成功或失败方面进行了解释性的研究。在这项工作中,我们调查了剪除网络性能的增加或减少原因。使用Grad-CAM和后置概念瓶颈模型(PCBM),我们研究剪除网络的可解释性,即像素和高级概念。我们在视觉和医学影像 dataset 上进行了广泛的实验。随着更多的权重被剪除,网络的性能下降。发现的概念和像素从剪除网络与原始网络不匹配,可能是性能下降的原因。
Synthesizing Forestry Images Conditioned on Plant Phenotype Using a Generative Adversarial Network
results: 这种方法可以准确地生成符合植被特征的人工图像,并且可以用来预测另一种植被特征:植物的红色度。这种方法的可重复性和扩展性也被证明。Abstract
Plant phenology and phenotype prediction using remote sensing data is increasingly gaining the attention of the plant science community to improve agricultural productivity. In this work, we generate synthetic forestry images that satisfy certain phenotypic attributes, viz. canopy greenness. The greenness index of plants describes a particular vegetation type in a mixed forest. Our objective is to develop a Generative Adversarial Network (GAN) to synthesize forestry images conditioned on this continuous attribute, i.e., greenness of vegetation, over a specific region of interest. The training data is based on the automated digital camera imagery provided by the National Ecological Observatory Network (NEON) and processed by the PhenoCam Network. The synthetic images generated by our method are also used to predict another phenotypic attribute, viz., redness of plants. The Structural SIMilarity (SSIM) index is utilized to assess the quality of the synthetic images. The greenness and redness indices of the generated synthetic images are compared against that of the original images using Root Mean Squared Error (RMSE) in order to evaluate their accuracy and integrity. Moreover, the generalizability and scalability of our proposed GAN model is determined by effectively transforming it to generate synthetic images for other forest sites and vegetation types.
摘要
植物生理学和形态预测使用远程感知数据在农业生产力提高方面得到越来越多的关注。在这项工作中,我们生成了符合某些形态特性的人工森林图像,其中一个是叶绿度。叶绿度指的是某种混合林中的植物种类。我们的目标是使用生成 adversarial Network (GAN) 来生成基于这个连续特征(植物覆盖物的绿度)的森林图像,在特定区域上进行条件生成。我们的训练数据来自自动化的数字相机图像,由国家生态观测网络(NEON)提供,并由phenoCam网络处理。我们的生成的人工图像也用于预测另一个形态特性:植物的红度。我们使用结构相似性(SSIM)指数来评估生成的图像质量。我们比较生成的绿度和红度指数与原始图像的Root Mean Squared Error(RMSE)来评估它们的准确性和完整性。此外,我们还确定了我们提议的GAN模型的普适性和扩展性,通过将其转换为生成其他森林站点和植物类型的 synthetic 图像。
Context-aware Pedestrian Trajectory Prediction with Multimodal Transformer
results: 与当前状态艺术比较,常量错误低于0.5、1.0和1.5秒三个时刻点,并且比当前状态艺术更快于PIE和JAAD两个数据集。此外,灵活的多Modal配置对方法的影响也进行了证明。Abstract
We propose a novel solution for predicting future trajectories of pedestrians. Our method uses a multimodal encoder-decoder transformer architecture, which takes as input both pedestrian locations and ego-vehicle speeds. Notably, our decoder predicts the entire future trajectory in a single-pass and does not perform one-step-ahead prediction, which makes the method effective for embedded edge deployment. We perform detailed experiments and evaluate our method on two popular datasets, PIE and JAAD. Quantitative results demonstrate the superiority of our proposed model over the current state-of-the-art, which consistently achieves the lowest error for 3 time horizons of 0.5, 1.0 and 1.5 seconds. Moreover, the proposed method is significantly faster than the state-of-the-art for the two datasets of PIE and JAAD. Lastly, ablation experiments demonstrate the impact of the key multimodal configuration of our method.
摘要
我们提出了一种新的解决方案,用于预测行人未来路径。我们的方法使用一种多ModalEncoder-Decoder变换架构,该架构接受行人位置和自身车速度作为输入。值得注意的是,我们的解码器在单次执行中预测整个未来路径,而不是一步预测,这使得方法适合嵌入式边缘部署。我们进行了详细的实验和PIE和JAAD两个流行的数据集上的评估。量化结果表明,我们的提出的模型在0.5、1.0和1.5秒三个时刻的错误率始终保持最低,并且与现有状态的艺术 consistently outperform。此外,我们的方法在PIE和JAAD两个数据集上明显更快于现有状态。最后,我们进行了关键多模式配置的ablation实验,以评估其影响。
Unsupervised 3D out-of-distribution detection with latent diffusion models
paper_authors: Mark S. Graham, Walter Hugo Lopez Pinaya, Paul Wright, Petru-Daniel Tudosiu, Yee H. Mah, James T. Teo, H. Rolf Jäger, David Werring, Parashkev Nachev, Sebastien Ourselin, M. Jorge Cardoso
For: The paper is written for detecting out-of-distribution (OOD) data in 3D medical data using Latent Diffusion Models (LDMs).* Methods: The paper proposes using LDMs to scale denoising diffusion probabilistic models (DDPMs) to high-resolution 3D medical data, and compares the proposed approach to a recently proposed, 3D-enabled approach using Latent Transformer Models (LTMs).* Results: The proposed LDM-based approach achieves statistically significant better performance than the LTM-based approach, with less sensitivity to the underlying latent representation, more favourable memory scaling, and produces better spatial anomaly maps.Here’s the simplified Chinese text for the three key points:* 为:该文章是用Latent Diffusion Models(LDMs)检测三维医疗数据中的外围数据(Out-of-distribution,OOD)。* 方法:文章提出使用LDMs将denoising diffusion probabilistic models(DDPMs)扩展到高分辨率三维医疗数据,并与近期提出的使用Latent Transformer Models(LTMs)的3D可用方法进行比较。* 结果:提出的LDM-based方法与LTM-based方法进行比较,显示LDM-based方法具有更好的性能,具有更好的准确率、更好的嵌入特征、更好的空间异常映射。Abstract
Methods for out-of-distribution (OOD) detection that scale to 3D data are crucial components of any real-world clinical deep learning system. Classic denoising diffusion probabilistic models (DDPMs) have been recently proposed as a robust way to perform reconstruction-based OOD detection on 2D datasets, but do not trivially scale to 3D data. In this work, we propose to use Latent Diffusion Models (LDMs), which enable the scaling of DDPMs to high-resolution 3D medical data. We validate the proposed approach on near- and far-OOD datasets and compare it to a recently proposed, 3D-enabled approach using Latent Transformer Models (LTMs). Not only does the proposed LDM-based approach achieve statistically significant better performance, it also shows less sensitivity to the underlying latent representation, more favourable memory scaling, and produces better spatial anomaly maps. Code is available at https://github.com/marksgraham/ddpm-ood
摘要
<>将文本翻译成简化中文。<>在真实世界临床深度学习系统中,对于不同数据集的异常检测方法是非常重要的组件。经典的杂噪扩散概率模型(DDPM)在2D数据集上进行重建基于异常检测已被提议,但是这些模型不直接适用于3D数据集。在这个工作中,我们提议使用幽默扩散模型(LDM),以便将DDPM扩展到高分辨率3D医学数据集。我们验证提议的方法在靠近和远离异常数据集上,并与最近提议的3D启用的Latent Transformer Models(LTM)进行比较。我们发现提议的LDM基本逻辑达到了统计学上显著更好的性能,同时也具有更好的内存扩展和更好的空间异常地图。代码可以在https://github.com/marksgraham/ddpm-ood中找到。
results: 比state-of-the-art方法高效,在多视图图像数据集、实际野外视频和大规模真实视频数据集上获得优秀的生成结果。Abstract
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core. The 3D autodecoder framework embeds properties learned from the target dataset in the latent space, which can then be decoded into a volumetric representation for rendering view-consistent appearance and geometry. We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations to learn a 3D diffusion from 2D images or monocular videos of rigid or articulated objects. Our approach is flexible enough to use either existing camera supervision or no camera information at all -- instead efficiently learning it during training. Our evaluations demonstrate that our generation results outperform state-of-the-art alternatives on various benchmark datasets and metrics, including multi-view image datasets of synthetic objects, real in-the-wild videos of moving people, and a large-scale, real video dataset of static objects.
摘要
我们提出了一种新的方法来生成静止和受拘式3D资产,其核心是3D自动解码器框架。该框架嵌入目标数据集中学习的属性在缺失空间中嵌入,然后可以用于渲染视角一致的外观和几何结构。我们然后确定了适当的中间缺失空间,并引入了稳定的 нормализа和解决操作来学习3D扩散从2D图像或半球形物体的照片或视频中。我们的方法可以使用现有的摄像头监督或没有摄像头信息,而不是在训练中高效地学习。我们的评估表明,我们的生成结果超过了当前的参考方法在多视图图像集、实际野外视频中的人体动作和大规模、实际视频集中的多种数据集和指标上。
Training Ensembles with Inliers and Outliers for Semi-supervised Active Learning
results: 本文的实验结果表明,将这三个组件结合使用可以提高 pseudo-labeling 的准确率和数据收集质量。特别是,joint 训练可以正确处理异常示例,无需进行Explicit outlier detection。此外,我们的方法的简单性和易用性,使其在性能上超越其他方法。Abstract
Deep active learning in the presence of outlier examples poses a realistic yet challenging scenario. Acquiring unlabeled data for annotation requires a delicate balance between avoiding outliers to conserve the annotation budget and prioritizing useful inlier examples for effective training. In this work, we present an approach that leverages three highly synergistic components, which are identified as key ingredients: joint classifier training with inliers and outliers, semi-supervised learning through pseudo-labeling, and model ensembling. Our work demonstrates that ensembling significantly enhances the accuracy of pseudo-labeling and improves the quality of data acquisition. By enabling semi-supervision through the joint training process, where outliers are properly handled, we observe a substantial boost in classifier accuracy through the use of all available unlabeled examples. Notably, we reveal that the integration of joint training renders explicit outlier detection unnecessary; a conventional component for acquisition in prior work. The three key components align seamlessly with numerous existing approaches. Through empirical evaluations, we showcase that their combined use leads to a performance increase. Remarkably, despite its simplicity, our proposed approach outperforms all other methods in terms of performance. Code: https://github.com/vladan-stojnic/active-outliers
摘要
深入的活动学习在异常示例存在下 pose 一个现实 yet 挑战的场景。 获取未标注数据 для注释需要一个细腻的平衡,以避免异常示例,并且优先级划分有用的准确示例,以便有效地训练。 在这种情况下,我们提出了一种方法,该方法利用三个高度协同的组件,即: joint 类ifier 训练与准确示例和异常示例, semi-supervised 学习通过pseudo-labeling,以及模型集成。 我们的工作表明,将这三个组件相互融合可以显著提高 pseudo-labeling 的准确度和数据收集质量。通过允许 semi-supervision 过程中的异常示例处理,我们发现了一个非常大的提升准确率,并且可以使用所有可用的无标注示例。另外,我们发现在 joint 训练过程中,无需进行明确的异常检测,可以直接使用所有示例进行训练。这三个关键组件与许多现有方法兼容。通过实验评估,我们表明这些组件的结合使用可以带来性能提升。尤其是,我们的提议方法的简单性和高效性,在性能方面胜过所有其他方法。代码:https://github.com/vladan-stojnic/active-outliersNote: Please note that the translation is in Simplified Chinese, and the word order and grammar may be different from the original text.
Equivariant Single View Pose Prediction Via Induced and Restricted Representations
results: 我们提出了一种新的算法,可以learn三dimensional世界的表示从二dimensional图像中,并在PASCAL3D+和SYMSOL两个 pose estimation 任务上达到了最高精度。Abstract
Learning about the three-dimensional world from two-dimensional images is a fundamental problem in computer vision. An ideal neural network architecture for such tasks would leverage the fact that objects can be rotated and translated in three dimensions to make predictions about novel images. However, imposing SO(3)-equivariance on two-dimensional inputs is difficult because the group of three-dimensional rotations does not have a natural action on the two-dimensional plane. Specifically, it is possible that an element of SO(3) will rotate an image out of plane. We show that an algorithm that learns a three-dimensional representation of the world from two dimensional images must satisfy certain geometric consistency properties which we formulate as SO(2)-equivariance constraints. We use the induced and restricted representations of SO(2) on SO(3) to construct and classify architectures which satisfy these geometric consistency constraints. We prove that any architecture which respects said consistency constraints can be realized as an instance of our construction. We show that three previously proposed neural architectures for 3D pose prediction are special cases of our construction. We propose a new algorithm that is a learnable generalization of previously considered methods. We test our architecture on three pose predictions task and achieve SOTA results on both the PASCAL3D+ and SYMSOL pose estimation tasks.
摘要
学习三维世界从二维图像是计算机视觉的基本问题。理想的神经网络架构 для此类任务应该利用对象可以在三维空间中旋转和平移来做预测。但是,在将 SO(3) 对二维平面的动作直观不是自然的,这使得在二维输入上强制 SO(3) 对称性是困难的。我们显示,一个可以学习三维世界的二维图像表示需要满足某些几何一致性性质,我们称之为 SO(2) 对称性约束。我们使用 SO(2) 在 SO(3) 上的引出和受限表示来构建和分类满足这些几何一致性约束的架构。我们证明,任何满足这些约束的架构都可以通过我们的构建实现。我们显示,三种之前提出的神经网络架构 для 3D 姿态预测是特例我们的构建。我们提出一种新的算法,它是learnable的一般化前述方法。我们测试我们的架构在三种姿态预测任务上,并在 PASCAL3D+ 和 SYMSOL 姿态预测任务上达到了最高的 SOTA 结果。
Motion Magnification in Robotic Sonography: Enabling Pulsation-Aware Artery Segmentation
results: 实验结果表明,PAS-NN 可以与现有技术相当,并且可以有效地提高小动脉(血管)的 segmentation 性能。Abstract
Ultrasound (US) imaging is widely used for diagnosing and monitoring arterial diseases, mainly due to the advantages of being non-invasive, radiation-free, and real-time. In order to provide additional information to assist clinicians in diagnosis, the tubular structures are often segmented from US images. To improve the artery segmentation accuracy and stability during scans, this work presents a novel pulsation-assisted segmentation neural network (PAS-NN) by explicitly taking advantage of the cardiac-induced motions. Motion magnification techniques are employed to amplify the subtle motion within the frequency band of interest to extract the pulsation signals from sequential US images. The extracted real-time pulsation information can help to locate the arteries on cross-section US images; therefore, we explicitly integrated the pulsation into the proposed PAS-NN as attention guidance. Notably, a robotic arm is necessary to provide stable movement during US imaging since magnifying the target motions from the US images captured along a scan path is not manually feasible due to the hand tremor. To validate the proposed robotic US system for imaging arteries, experiments are carried out on volunteers' carotid and radial arteries. The results demonstrated that the PAS-NN could achieve comparable results as state-of-the-art on carotid and can effectively improve the segmentation performance for small vessels (radial artery).
摘要
ultrasound(US)成像广泛用于诊断和监测arterial疾病,主要因为非侵入性、无辐射和实时等优点。为了为临床医生提供更多的诊断信息,在US图像中分割 tubular结构成为一项重要的任务。为了提高artery分割精度和稳定性,本研究提出了一种基于征动辐射信号的新型pulsation-assisted segmentation neural network(PAS-NN)。在图像序列中提取 cardiac-induced motions 的柔化信号,并将其作为注意力引导进行explicitly integrating。由于需要稳定的运动来提供US成像,因此在US成像过程中使用了Robotic arm。为验证提出的Robotic US系统在诊断arteries中的可行性,对志愿者的 Common carotid和Radial artery进行了实验。结果表明,PAS-NN可以与当前状态艺术一样好,并且可以有效地提高小动脉(Radial artery)的分割性能。