paper_authors: Chiao-Yi Wang, Faranguisse Kakhi Sadrieh, Yi-Ting Shen, Shih-En Chen, Sarah Kim, Victoria Chen, Achyut Raghavendra, Dongyi Wang, Osamah Saeedi, Yang Tao
results: 在 CF-FA 数据集和 MEMO 数据集上,VDD-Reg 表现出了较好的性能,并且只需要三个注解的血液管道分割图来维持其精度。Abstract
The measurement of retinal blood flow (RBF) in capillaries can provide a powerful biomarker for the early diagnosis and treatment of ocular diseases. However, no single modality can determine capillary flowrates with high precision. Combining erythrocyte-mediated angiography (EMA) with optical coherence tomography angiography (OCTA) has the potential to achieve this goal, as EMA can measure the absolute 2D RBF of retinal microvasculature and OCTA can provide the 3D structural images of capillaries. However, multimodal retinal image registration between these two modalities remains largely unexplored. To fill this gap, we establish MEMO, the first public multimodal EMA and OCTA retinal image dataset. A unique challenge in multimodal retinal image registration between these modalities is the relatively large difference in vessel density (VD). To address this challenge, we propose a segmentation-based deep-learning framework (VDD-Reg) and a new evaluation metric (MSD), which provide robust results despite differences in vessel density. VDD-Reg consists of a vessel segmentation module and a registration module. To train the vessel segmentation module, we further designed a two-stage semi-supervised learning framework (LVD-Seg) combining supervised and unsupervised losses. We demonstrate that VDD-Reg outperforms baseline methods quantitatively and qualitatively for cases of both small VD differences (using the CF-FA dataset) and large VD differences (using our MEMO dataset). Moreover, VDD-Reg requires as few as three annotated vessel segmentation masks to maintain its accuracy, demonstrating its feasibility.
摘要
retinal blood flow (RBF) 的测量可以提供一个强大的生物标志物,用于早期诊断和治疗 ocular diseases。然而,没有一种单一的模式可以准确地确定 capillary flowrates。 combing erythrocyte-mediated angiography (EMA) 与 optical coherence tomography angiography (OCTA) 可以实现这个目标,因为 EMA 可以测量 retinal microvasculature 的绝对 2D RBF,而 OCTA 可以提供 capillaries 的 3D 结构图像。然而,多modal retinal image registration between these two modalities 仍然存在很大的知识 gap。 To fill this gap, we establish MEMO, the first public multimodal EMA and OCTA retinal image dataset.一个Unique challenge in multimodal retinal image registration between these modalities 是 vessel density (VD) 的相对较大的差异。 To address this challenge, we propose a segmentation-based deep-learning framework (VDD-Reg) 和 a new evaluation metric (MSD), which provide robust results despite differences in vessel density. VDD-Reg consists of a vessel segmentation module and a registration module. To train the vessel segmentation module, we further designed a two-stage semi-supervised learning framework (LVD-Seg) combining supervised and unsupervised losses. We demonstrate that VDD-Reg outperforms baseline methods quantitatively and qualitatively for cases of both small VD differences (using the CF-FA dataset) and large VD differences (using our MEMO dataset). Moreover, VDD-Reg requires as few as three annotated vessel segmentation masks to maintain its accuracy, demonstrating its feasibility.
Dynamic Scene Graph Representation for Surgical Video
paper_authors: Felix Holm, Ghazal Ghazaei, Tobias Czempiel, Ege Özsoy, Stefan Saur, Nassir Navab
for: This paper aims to improve the automated understanding of surgical workflows in videos captured from microscopic or endoscopic imaging devices.
methods: The paper proposes using scene graphs as a more holistic and semantically meaningful way to represent surgical videos, and leverages graph convolutional networks (GCNs) to tackle surgical downstream tasks such as workflow recognition.
results: The paper demonstrates the benefits of surgical scene graphs in terms of explainability and robustness of model decisions, and shows competitive performance in surgical workflow recognition tasks.Here’s the same information in Simplified Chinese:
results: 论文表明场景图在解释和模型决策的Robustness方面具有优势,并在手术 workflow 认知任务中显示竞争性表现。Abstract
Surgical videos captured from microscopic or endoscopic imaging devices are rich but complex sources of information, depicting different tools and anatomical structures utilized during an extended amount of time. Despite containing crucial workflow information and being commonly recorded in many procedures, usage of surgical videos for automated surgical workflow understanding is still limited. In this work, we exploit scene graphs as a more holistic, semantically meaningful and human-readable way to represent surgical videos while encoding all anatomical structures, tools, and their interactions. To properly evaluate the impact of our solutions, we create a scene graph dataset from semantic segmentations from the CaDIS and CATARACTS datasets. We demonstrate that scene graphs can be leveraged through the use of graph convolutional networks (GCNs) to tackle surgical downstream tasks such as surgical workflow recognition with competitive performance. Moreover, we demonstrate the benefits of surgical scene graphs regarding the explainability and robustness of model decisions, which are crucial in the clinical setting.
摘要
手术录影幕 capture from microscopic or endoscopic imaging devices 是丰富且复杂的信息源,显示了不同的工具和生物结构在延时间进行了多种程度的交互。尽管这些录影幕在许多程序中很常见,但是用于自动推理手术 workflow 的使用仍然受限。在这个工作中,我们利用Scene graph来表示手术录影幕,并将所有生物结构和工具都编码在内。为了评估我们的解决方案的影响,我们创建了Scene graph dataset,并使用Graph Convolutional Networks (GCNs)来利用Scene graphs来解决手术下游任务,例如手术 workflow 识别,获得了竞争性的表现。此外,我们还证明了Scene graphs 在解释和Robustness 方面的利陵,这些是在临床设定中非常重要的。
methods: 这篇论文使用了新的�ceptive field-based architectural constraint和principled pixel space mapping来实现meaningful localization,并提出了一种简化的分类头来提高解释性。
results: 作者们的方法PIXPNET可以 Quantifiably improve interpretability without sacrificing accuracy,并且是唯一一个真正地学习和localize到权重部分的protoPartNN。Abstract
Prototypical part neural networks (ProtoPartNNs), namely PROTOPNET and its derivatives, are an intrinsically interpretable approach to machine learning. Their prototype learning scheme enables intuitive explanations of the form, this (prototype) looks like that (testing image patch). But, does this actually look like that? In this work, we delve into why object part localization and associated heat maps in past work are misleading. Rather than localizing to object parts, existing ProtoPartNNs localize to the entire image, contrary to generated explanatory visualizations. We argue that detraction from these underlying issues is due to the alluring nature of visualizations and an over-reliance on intuition. To alleviate these issues, we devise new receptive field-based architectural constraints for meaningful localization and a principled pixel space mapping for ProtoPartNNs. To improve interpretability, we propose additional architectural improvements, including a simplified classification head. We also make additional corrections to PROTOPNET and its derivatives, such as the use of a validation set, rather than a test set, to evaluate generalization during training. Our approach, PIXPNET (Pixel-grounded Prototypical part Network), is the only ProtoPartNN that truly learns and localizes to prototypical object parts. We demonstrate that PIXPNET achieves quantifiably improved interpretability without sacrificing accuracy.
摘要
归纳部神经网络(ProtoPartNNs)是一种内在可解释的机器学习方法。它的原型学习方案允许直观的解释,例如:这个原型看起来像那个测试图像的 patch。然而,这实际上是否看起来像那个?在这项工作中,我们探究过去的对象部分Localization和相关的热图是误导的。现有的ProtoPartNNs并不是localize到对象部分,而是localize到整个图像,与生成的解释性视觉化不符。我们认为这些问题的抽象是由于视觉化的吸引力和过于依赖于直观的INTUITION。为了解决这些问题,我们设计了新的接受场景基于的建筑限制,以及一个原则正确的像素空间映射。此外,我们还提出了更多的建筑改进,包括简化的分类头。我们还对PROTOPNET和其 Derivatives进行了修正,例如使用验证集而不是测试集来评估在训练过程中的普适性。我们的方法PIXPNET(像素基于的原型部分网络)是唯一一个真正地学习和localize到prototype object part。我们示出PIXPNET可以提供量化提高的可解释性而不损失准确性。
UniBEV: Multi-modal 3D Object Detection with Uniform BEV Encoders for Robustness against Missing Sensor Modalities
paper_authors: Shiming Wang, Holger Caesar, Liangliang Nan, Julian F. P. Kooij for:这个论文的目的是提高自动驾驶中多感器对象检测模型的稳定性,特别是在感器输入缺失(modalities missing)的情况下。methods:这个论文提出了一种名为UniBEV的端到端多模态3D对象检测框架,可以在LiDAR和摄像头输入下运行,而无需 retraining。UniBEV使用了 Bird’s Eye View(BEV)特征地图来确保检测器的输入组合能够处理不同的输入。这种方法与之前的BEV多模态检测方法不同,所有的感器模式都采用了一种统一的方法来从原始感器坐标系统中采样特征到BEV特征。results:在nuScenes上对所有感器输入组合进行比较,UniBEV得到了52.5%的mAP平均值,与基线值(43.5%的mAP平均值)和MetaBEV(48.7%的mAP平均值)相比有显著提高。一个ablation研究表明,通过权重平均 fusioneather than regular concatenation,以及在每个模式的BEV编码器之间共享查询,可以提高对稳定性的依赖。Abstract
Multi-sensor object detection is an active research topic in automated driving, but the robustness of such detection models against missing sensor input (modality missing), e.g., due to a sudden sensor failure, is a critical problem which remains under-studied. In this work, we propose UniBEV, an end-to-end multi-modal 3D object detection framework designed for robustness against missing modalities: UniBEV can operate on LiDAR plus camera input, but also on LiDAR-only or camera-only input without retraining. To facilitate its detector head to handle different input combinations, UniBEV aims to create well-aligned Bird's Eye View (BEV) feature maps from each available modality. Unlike prior BEV-based multi-modal detection methods, all sensor modalities follow a uniform approach to resample features from the native sensor coordinate systems to the BEV features. We furthermore investigate the robustness of various fusion strategies w.r.t. missing modalities: the commonly used feature concatenation, but also channel-wise averaging, and a generalization to weighted averaging termed Channel Normalized Weights. To validate its effectiveness, we compare UniBEV to state-of-the-art BEVFusion and MetaBEV on nuScenes over all sensor input combinations. In this setting, UniBEV achieves $52.5 \%$ mAP on average over all input combinations, significantly improving over the baselines ($43.5 \%$ mAP on average for BEVFusion, $48.7 \%$ mAP on average for MetaBEV). An ablation study shows the robustness benefits of fusing by weighted averaging over regular concatenation, and of sharing queries between the BEV encoders of each modality. Our code will be released upon paper acceptance.
摘要
多感器对象检测是自动驾驶领域的活跃研究话题,但对感器输入缺失(例如突然的感器故障)的Robustness仍然是一个尚未得到充分研究的问题。在这种情况下,我们提出了UniBEV,一个综合多Modal 3D对象检测框架,旨在提高对感器输入缺失的Robustness。UniBEV可以使用LiDAR和摄像头输入,同时也可以使用LiDAR-only或摄像头只输入,无需重新训练。为使其检测头处理不同的输入组合,UniBEV стре望在每个可用感器模式下创建匹配的Bird's Eye View(BEV)特征地图。与先前的BEV基于多Modal detection方法不同,所有感器模式都采用了一致的方式,将native感器坐标系统中的特征atures映射到BEV特征地图。我们还进行了不同模式之间的混合策略的研究,包括常见的特征 concatenation、梯度平均值和通过 Channel Normalized Weights 扩展。为证明其效果,我们与状态体系的BEVFusion和MetaBEV进行比较,在nuScenes上对所有感器输入组合进行评估。在这种设定下,UniBEV achieved $52.5\%$ mAP的平均值,significantly improving over the baselines ($43.5\%$ mAP on average for BEVFusion, $48.7\%$ mAP on average for MetaBEV).一个ablation study表明,通过权重平均混合而不是常见的特征 concatenation,以及在每个模式的BEVEncoder中共享查询,具有Robustness的优点。我们将在纸Acceptance时发布代码。
Accurate and Interactive Visual-Inertial Sensor Calibration with Next-Best-View and Next-Best-Trajectory Suggestion
results: 经过实验表明,我们的方法比现有技术更快、更准、更一致,并且可以与现有VI odometry和VI-SLAM方法结合使用,以获得更高精度的估计结果。Abstract
Visual-Inertial (VI) sensors are popular in robotics, self-driving vehicles, and augmented and virtual reality applications. In order to use them for any computer vision or state-estimation task, a good calibration is essential. However, collecting informative calibration data in order to render the calibration parameters observable is not trivial for a non-expert. In this work, we introduce a novel VI calibration pipeline that guides a non-expert with the use of a graphical user interface and information theory in collecting informative calibration data with Next-Best-View and Next-Best-Trajectory suggestions to calibrate the intrinsics, extrinsics, and temporal misalignment of a VI sensor. We show through experiments that our method is faster, more accurate, and more consistent than state-of-the-art alternatives. Specifically, we show how calibrations with our proposed method achieve higher accuracy estimation results when used by state-of-the-art VI Odometry as well as VI-SLAM approaches. The source code of our software can be found on: https://github.com/chutsu/yac.
摘要
Visual-Inertial(VI)传感器在 роботику、自动驾驶车和增强和虚拟现实应用中广泛使用。为了在计算机视觉或状态估计任务中使用它们,一个好的准备是必要的。然而,收集有用的准备数据以便计算准备参数的可见性并不是非专家的 trivial事。在这个工作中,我们介绍了一个新的 VI 准备管线,通过使用图形用户界面和信息理论来引导非专家收集有用的准备数据,并且提供 Next-Best-View 和 Next-Best-Trajectory 建议来准备 VI 传感器的内参、外参和时间偏移。我们通过实验表明,我们的方法比现有的状态之 искусственный风格更快、更准、更一致。 Specifically,我们表明使用我们提议的方法来准备准确性估计结果,与现有的 VI Odometry 以及 VI-SLAM 方法相比,具有更高的准确性。我们的软件源代码可以在:https://github.com/chutsu/yac 找到。
Assessment of a new GeoAI foundation model for flood inundation mapping
results: 研究结果显示Prithvi模型在未见过的区域中进行洪水淹没地图分析任务时表现良好,并且在验证数据集和没有被模型视觉化的数据集上显示了良好的预测性和可读性。Abstract
Vision foundation models are a new frontier in Geospatial Artificial Intelligence (GeoAI), an interdisciplinary research area that applies and extends AI for geospatial problem solving and geographic knowledge discovery, because of their potential to enable powerful image analysis by learning and extracting important image features from vast amounts of geospatial data. This paper evaluates the performance of the first-of-its-kind geospatial foundation model, IBM-NASA's Prithvi, to support a crucial geospatial analysis task: flood inundation mapping. This model is compared with convolutional neural network and vision transformer-based architectures in terms of mapping accuracy for flooded areas. A benchmark dataset, Sen1Floods11, is used in the experiments, and the models' predictability, generalizability, and transferability are evaluated based on both a test dataset and a dataset that is completely unseen by the model. Results show the good transferability of the Prithvi model, highlighting its performance advantages in segmenting flooded areas in previously unseen regions. The findings also indicate areas for improvement for the Prithvi model in terms of adopting multi-scale representation learning, developing more end-to-end pipelines for high-level image analysis tasks, and offering more flexibility in terms of input data bands.
摘要
地球空间人工智能(GeoAI)是一个交叉学科研究领域,它应用和扩展人工智能来解决地球空间问题和地理知识发现。视频基础模型是GeoAI新领域,它们可以通过学习和提取重要的图像特征来实现强大的图像分析。本文评估了首次实现的地球空间基础模型——IBM-NASA的Prithvi,以支持重要的地球空间分析任务:洪水泛滥地图。这个模型与 convolutional neural network 和 vision transformer 基础结构相比,在泛滥区域的地图准确率方面进行了比较。使用 Sen1Floods11 benchmark 数据集进行实验,并根据测试数据集和完全新的数据集来评估模型的预测性、普适性和可转移性。结果显示 Prithvi 模型在未经见过的区域中 segments 泛滥区域的表现良好,表明其在新区域中的表现优异。发现也表明了 Prithvi 模型在采用多尺度表示学习、开发更多的端到端管道和提供更多的输入数据频谱等方面存在改进的空间。
Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator
results: 无需任何视频数据和训练,Free-Bloom可以生成具有丰富semantic meaningful frame sequence的高质量视频,能够描绘复杂的场景。此外,Free-Bloom自然兼容LDMs-based extensions。Abstract
Text-to-video is a rapidly growing research area that aims to generate a semantic, identical, and temporal coherence sequence of frames that accurately align with the input text prompt. This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient. To generate a semantic-coherent video, exhibiting a rich portrayal of temporal semantics such as the whole process of flower blooming rather than a set of "moving images", we propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence, while pre-trained latent diffusion models (LDMs) as the animator to generate the high fidelity frames. Furthermore, to ensure temporal and identical coherence while maintaining semantic coherence, we propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path interpolation. Without any video data and training requirements, Free-Bloom generates vivid and high-quality videos, awe-inspiring in generating complex scenes with semantic meaningful frame sequences. In addition, Free-Bloom is naturally compatible with LDMs-based extensions.
摘要
AiAReSeg: Catheter Detection and Segmentation in Interventional Ultrasound using Transformers
paper_authors: Alex Ranne, Yordanka Velikova, Nassir Navab, Ferdinando Rodriguez y Baena for: 这篇论文是为了提出一种基于深度学习的扫描器网络,以便在 intervencion Ultrasound 图像序列中探测和分割导管。methods: 该方法使用了一种基于 Attention 机制的 transformer 架构,并引入了一种新的 3D 分割头,以实现在时间上的扫描。results: 该方法在一个验证数据集上进行了验证,并在用physics-based导管插入 simulations 生成的synthetic Ultrasound 图像上进行了测试,得到了良好的效果。Abstract
To date, endovascular surgeries are performed using the golden standard of Fluoroscopy, which uses ionising radiation to visualise catheters and vasculature. Prolonged Fluoroscopic exposure is harmful for the patient and the clinician, and may lead to severe post-operative sequlae such as the development of cancer. Meanwhile, the use of interventional Ultrasound has gained popularity, due to its well-known benefits of small spatial footprint, fast data acquisition, and higher tissue contrast images. However, ultrasound images are hard to interpret, and it is difficult to localise vessels, catheters, and guidewires within them. This work proposes a solution using an adaptation of a state-of-the-art machine learning transformer architecture to detect and segment catheters in axial interventional Ultrasound image sequences. The network architecture was inspired by the Attention in Attention mechanism, temporal tracking networks, and introduced a novel 3D segmentation head that performs 3D deconvolution across time. In order to facilitate training of such deep learning networks, we introduce a new data synthesis pipeline that used physics-based catheter insertion simulations, along with a convolutional ray-casting ultrasound simulator to produce synthetic ultrasound images of endovascular interventions. The proposed method is validated on a hold-out validation dataset, thus demonstrated robustness to ultrasound noise and a wide range of scanning angles. It was also tested on data collected from silicon-based aorta phantoms, thus demonstrated its potential for translation from sim-to-real. This work represents a significant step towards safer and more efficient endovascular surgery using interventional ultrasound.
摘要
Translated into Simplified Chinese:到目前为止,endovascular手术使用金标准fluoroscopy,利用 ionizing radiation Visualize catheters 和vasculature。 prolonged fluoroscopic exposure 对 patient 和 clinician 有害,可能导致 postoperative sequelae 的发展,如 cancer。 Meanwhile,使用 interventional ultrasound 已经得到了广泛的应用,因为它的小型空间占用、快速的数据收集和高对比度图像等优点。然而,ultrasound 图像具有困难的解释和localize vessels、catheters 和 guidewires 在它们中的问题。这个工作提出了一种解决方案,利用一种基于 state-of-the-art 机器学习 transformer 架构来检测和分割 axial interventional ultrasound 图像序列中的 catheters。该网络架构 inspirited 了 Attention in Attention 机制、时间跟踪网络和引入了一个新的3D分割头,通过在时间方向上进行3D deconvolution。为了促进这些深度学习网络的训练,我们引入了一个新的数据生成管线,利用基于物理学习 catheter 插入 simulations 和一个 convolutional ray-casting ultrasound simulator 生成 synthetic ultrasound 图像。该提案在 hold-out 验证集上验证了Robustness 于 ultrasound noise 和多个扫描角度。它还在基于 silicon-based 的 aorta 模型上测试, thereby demonstrating its potential for translation from sim-to-real。这个工作表示了更安全和高效的 endovascular surgery 使用 interventional ultrasound 的重要一步。
Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving
results: 对于 Waymo 开放数据集的实验,该方法与先前的研究相比,在各种无监督3D感知任务上表现出了显著的优异。Abstract
Closed-set 3D perception models trained on only a pre-defined set of object categories can be inadequate for safety critical applications such as autonomous driving where new object types can be encountered after deployment. In this paper, we present a multi-modal auto labeling pipeline capable of generating amodal 3D bounding boxes and tracklets for training models on open-set categories without 3D human labels. Our pipeline exploits motion cues inherent in point cloud sequences in combination with the freely available 2D image-text pairs to identify and track all traffic participants. Compared to the recent studies in this domain, which can only provide class-agnostic auto labels limited to moving objects, our method can handle both static and moving objects in the unsupervised manner and is able to output open-vocabulary semantic labels thanks to the proposed vision-language knowledge distillation. Experiments on the Waymo Open Dataset show that our approach outperforms the prior work by significant margins on various unsupervised 3D perception tasks.
摘要
闭sets 3D 识别模型只训练在预定的对象类型上可能不够用于安全关键应用程序,如自动驾驶,因为在部署后可能会遇到新的对象类型。在这篇论文中,我们提出了一个多Modal auto Labeling 管道,可以在无人标注的情况下为训练模型提供开放集成类别的培训数据。我们的管道利用点云序列中的运动特征,并与可以得到的免费的2D图像文本对照来识别和跟踪所有交通参与者。与当前领域的研究相比,我们的方法可以不需要人工标注,并且可以自动为移动和静止对象分配开放 vocabulary 语义标签。我们的方法在 Waymo 开放数据集上进行了实验,并与之前的工作相比,在各种无监督3D识别任务上表现出了显著的优异。
Gastro-Intestinal Tract Segmentation Using an Explainable 3D Unet
methods: 这篇论文提出了一个基于深度学习(DL)的治疗管线,并将可读性学习(XAI) integrate into the pipeline 以提高模型的透明度和可信度。
results: 这篇论文获得了一个可靠且高精度的辐射治疗管线,可以帮助辐射科医师更快速地处理病人。Abstract
In treating gastrointestinal cancer using radiotherapy, the role of the radiation oncologist is to administer high doses of radiation, through x-ray beams, toward the tumor while avoiding the stomach and intestines. With the advent of precise radiation treatment technology such as the MR-Linac, oncologists can visualize the daily positions of the tumors and intestines, which may vary day to day. Before delivering radiation, radio oncologists must manually outline the position of the gastrointestinal organs in order to determine position and direction of the x-ray beam. This is a time consuming and labor intensive process that may substantially prolong a patient's treatment. A deep learning (DL) method can automate and expedite the process. However, many deep neural networks approaches currently in use are black-boxes which lack interpretability which render them untrustworthy and impractical in a healthcare setting. To address this, an emergent field of AI known as Explainable AI (XAI) may be incorporated to improve the transparency and viability of a model. This paper proposes a deep learning pipeline that incorporates XAI to address the challenges of organ segmentation.
摘要
在治疗肝肠癌用电疗时,辐射生物学家的角色是通过X射线束射高剂量辐射于肿瘤,同时避免肠和肠肝。随着精细辐射治疗技术的发展,如MR-Linac,生物学家可以每天Visualize肿瘤和肠肝的位置,这些位置可能每天不同。在发射辐射之前, radio生物学家必须手动标识肠肝的位置,以确定辐射的方向和强度。这是一项时间consuming和劳动密集的过程,可能会导致患者的治疗持续时间增加。在这种情况下,一种深度学习(DL)方法可以自动和加速这个过程。然而,许多深度神经网络方法现在在使用的是黑obox,缺乏可读性,这使得它们在医疗设置中不可靠和不实用。为了解决这个问题,一个emergent的人工智能领域,称为可解释AI(XAI),可以被包含到深度学习管道中,以提高模型的透明度和实用性。本文提出了一个深度学习管道,其中包含XAI,以解决肠肿瘤分割的挑战。
FARSEC: A Reproducible Framework for Automatic Real-Time Vehicle Speed Estimation Using Traffic Cameras
results: 与三种已知模型进行比较后,该模型在实际的CCTV视频上达到了竞争性的结果,同时具有更好的可重复性和可更新性。Abstract
Estimating the speed of vehicles using traffic cameras is a crucial task for traffic surveillance and management, enabling more optimal traffic flow, improved road safety, and lower environmental impact. Transportation-dependent systems, such as for navigation and logistics, have great potential to benefit from reliable speed estimation. While there is prior research in this area reporting competitive accuracy levels, their solutions lack reproducibility and robustness across different datasets. To address this, we provide a novel framework for automatic real-time vehicle speed calculation, which copes with more diverse data from publicly available traffic cameras to achieve greater robustness. Our model employs novel techniques to estimate the length of road segments via depth map prediction. Additionally, our framework is capable of handling realistic conditions such as camera movements and different video stream inputs automatically. We compare our model to three well-known models in the field using their benchmark datasets. While our model does not set a new state of the art regarding prediction performance, the results are competitive on realistic CCTV videos. At the same time, our end-to-end pipeline offers more consistent results, an easier implementation, and better compatibility. Its modular structure facilitates reproducibility and future improvements.
摘要
Our model employs novel techniques to estimate the length of road segments via depth map prediction. Additionally, our framework can automatically handle realistic conditions such as camera movements and different video stream inputs. We compare our model with three well-known models in the field using their benchmark datasets. While our model does not set a new state of the art regarding prediction performance, the results are competitive on realistic CCTV videos. Our end-to-end pipeline offers more consistent results, easier implementation, and better compatibility. Its modular structure facilitates reproducibility and future improvements.
Chop & Learn: Recognizing and Generating Object-State Compositions
results: 该论文使用视频进行compositional action recognition,并证明了这些数据的多种应用。项目官网:https://chopnlearn.github.io。Abstract
Recognizing and generating object-state compositions has been a challenging task, especially when generalizing to unseen compositions. In this paper, we study the task of cutting objects in different styles and the resulting object state changes. We propose a new benchmark suite Chop & Learn, to accommodate the needs of learning objects and different cut styles using multiple viewpoints. We also propose a new task of Compositional Image Generation, which can transfer learned cut styles to different objects, by generating novel object-state images. Moreover, we also use the videos for Compositional Action Recognition, and show valuable uses of this dataset for multiple video tasks. Project website: https://chopnlearn.github.io.
摘要
Recognizing and generating object-state compositions has been a challenging task, especially when generalizing to unseen compositions. In this paper, we study the task of cutting objects in different styles and the resulting object state changes. We propose a new benchmark suite Chop & Learn, to accommodate the needs of learning objects and different cut styles using multiple viewpoints. We also propose a new task of Compositional Image Generation, which can transfer learned cut styles to different objects, by generating novel object-state images. Moreover, we also use the videos for Compositional Action Recognition, and show valuable uses of this dataset for multiple video tasks. Project website: https://chopnlearn.github.io.Translation:recognizing和生成对象状态组合是一个挑战性任务,特别是对于未看过的组合。在这篇论文中,我们研究对象在不同风格下被剪辑的任务,以及它们所导致的对象状态变化。我们提出了一个新的benchmark集Chop & Learn,以便学习对象和不同剪辑风格的多视点学习。我们还提出了一个新的任务:compositional Image Generation,可以将学习的剪辑风格应用到不同的对象上,通过生成新的对象状态图像。此外,我们还使用视频进行compositional Action Recognition,并显示了这个数据集的多种视频任务的用途。项目网站:https://chopnlearn.github.io。
paper_authors: Mohamed El Amine Boudjoghra, Salwa K. Al Khatib, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan
for: 3D indoor instance segmentation in an open-world setting, where the model can distinguish known classes and identify unknown objects
methods: use an auto-labeling scheme to produce pseudo-labels during training, and adjust the unknown class probability based on objectness score distribution
results: promising open-world 3D instance segmentation performance with carefully curated open-world splitsAbstract
Existing 3D instance segmentation methods typically assume that all semantic classes to be segmented would be available during training and only seen categories are segmented at inference. We argue that such a closed-world assumption is restrictive and explore for the first time 3D indoor instance segmentation in an open-world setting, where the model is allowed to distinguish a set of known classes as well as identify an unknown object as unknown and then later incrementally learning the semantic category of the unknown when the corresponding category labels are available. To this end, we introduce an open-world 3D indoor instance segmentation method, where an auto-labeling scheme is employed to produce pseudo-labels during training and induce separation to separate known and unknown category labels. We further improve the pseudo-labels quality at inference by adjusting the unknown class probability based on the objectness score distribution. We also introduce carefully curated open-world splits leveraging realistic scenarios based on inherent object distribution, region-based indoor scene exploration and randomness aspect of open-world classes. Extensive experiments reveal the efficacy of the proposed contributions leading to promising open-world 3D instance segmentation performance.
摘要
现有的3D实例分割方法通常假设所有需要分割的semantic类都会在训练时 disponible,只有seen类会在推理时分割。我们认为这种closed-world假设是限制性的,我们开发了第一个在开放世界设定下进行3Dindoor实例分割的方法,其中模型允许分辨知道的类别以及未知对象的类别,并在可以获得对应类别标签时逐渐学习未知类别的semanticcategory。为此,我们提出了一种开放世界3Dindoor实例分割方法,其中使用自动标签机制生成pseudo-标签 durante entrenamiento,并在推理时调整未知类别概率根据对象性分布。此外,我们还提出了仔细制定的开放世界分割,利用实际场景、indoorScene区域探索和Randomaspect of open-world类来生成准确的pseudo-标签。广泛的实验表明我们的提案具有优秀的开放世界3D实例分割性能。
Noise-in, Bias-out: Balanced and Real-time MoCap Solving
results: 该论文的实验结果表明,使用该方法可以在实时 MoCap 中提高 marker 估计的准确性和稳定性,并在具有极端和特殊 pose 的情况下表现出优异性。项目页面:https://moverseai.github.io/noise-tailAbstract
Real-time optical Motion Capture (MoCap) systems have not benefited from the advances in modern data-driven modeling. In this work we apply machine learning to solve noisy unstructured marker estimates in real-time and deliver robust marker-based MoCap even when using sparse affordable sensors. To achieve this we focus on a number of challenges related to model training, namely the sourcing of training data and their long-tailed distribution. Leveraging representation learning we design a technique for imbalanced regression that requires no additional data or labels and improves the performance of our model in rare and challenging poses. By relying on a unified representation, we show that training such a model is not bound to high-end MoCap training data acquisition, and exploit the advances in marker-less MoCap to acquire the necessary data. Finally, we take a step towards richer and affordable MoCap by adapting a body model-based inverse kinematics solution to account for measurement and inference uncertainty, further improving performance and robustness. Project page: https://moverseai.github.io/noise-tail
摘要
现实时光学动作捕捉(MoCap)系统没有受到现代数据驱动模型的改进。在这项工作中,我们通过机器学习解决实时噪声不结构 marker 估计中的噪声问题,并提供了可靠的 marker-based MoCap,即使使用便宜的感知器。为达到这一目标,我们关注了许多相关的挑战,包括训练数据的获取和其长尾分布。通过表示学习,我们设计了一种无需额外数据或标签的偏好回归技术,以提高我们模型在罕见和具有挑战性的姿势中的性能。由于我们的模型不依赖高级 MoCap 训练数据获取,我们可以利用 marker-less MoCap 技术获取必要的数据。最后,我们通过对体部模型基于 inverse kinematics 解决方案进行修改,以考虑测量和推理不确定性,进一步提高性能和可靠性。项目页面:https://moverseai.github.io/noise-tail
DeepMesh: Mesh-based Cardiac Motion Tracking using Deep Learning
results: 实验结果表明,DeepMesh方法可以高效地和量化地评估冠动脉左心室的3D运动信息,并且比其他图像基于和网格基于的冠动脉运动跟踪方法更高效。Abstract
3D motion estimation from cine cardiac magnetic resonance (CMR) images is important for the assessment of cardiac function and the diagnosis of cardiovascular diseases. Current state-of-the art methods focus on estimating dense pixel-/voxel-wise motion fields in image space, which ignores the fact that motion estimation is only relevant and useful within the anatomical objects of interest, e.g., the heart. In this work, we model the heart as a 3D mesh consisting of epi- and endocardial surfaces. We propose a novel learning framework, DeepMesh, which propagates a template heart mesh to a subject space and estimates the 3D motion of the heart mesh from CMR images for individual subjects. In DeepMesh, the heart mesh of the end-diastolic frame of an individual subject is first reconstructed from the template mesh. Mesh-based 3D motion fields with respect to the end-diastolic frame are then estimated from 2D short- and long-axis CMR images. By developing a differentiable mesh-to-image rasterizer, DeepMesh is able to leverage 2D shape information from multiple anatomical views for 3D mesh reconstruction and mesh motion estimation. The proposed method estimates vertex-wise displacement and thus maintains vertex correspondences between time frames, which is important for the quantitative assessment of cardiac function across different subjects and populations. We evaluate DeepMesh on CMR images acquired from the UK Biobank. We focus on 3D motion estimation of the left ventricle in this work. Experimental results show that the proposed method quantitatively and qualitatively outperforms other image-based and mesh-based cardiac motion tracking methods.
摘要
3D动态计算从cinéCardiac Magnetic Resonance(CMR)图像是评估心脏功能和诊断循环疾病的重要方法。当前状态艺术方法都是在图像空间进行密集像素/体积化动态场的估计,忽略了动态场只在心脏的解剖对象上是有用的事实。在这种工作中,我们模型了心脏为3D网格,包括血管和内血管表面。我们提出了一种新的学习框架,深度网格(DeepMesh),它将投影模板心脏网格到个体空间,并估计从CMR图像中心脏的3D动态。在DeepMesh中,个体心脏的结构图像的结束 диасто利Frame中的心脏网格被首先从模板网格中重construct。然后,从2D短轴和长轴CMR图像中获取心脏网格的3D动态场,并通过开发可导的网格到图像照片的映射器,以便利用多视图解剖信息来进行3D网格重建和动态场估计。提出的方法可以计算 vertex-wise 偏移量,并维护 vertex 之间的匹配关系,这是评估不同个体和人口中心脏功能的量化评估的关键。我们在UK Biobank中获取的CMR图像进行了实验,我们专注于左心室的3D动态跟踪。实验结果表明,提出的方法在图像基于和网格基于的心脏动态跟踪方法中量化和质量上有显著优势。
Regress Before Construct: Regress Autoencoder for Point Cloud Self-supervised Learning
results: 该模型在多个下游任务中表现出色,包括ScanObjectNN和ModelNet40等。Specifically, our pre-trained models achieve a high accuracy of 90.28% on the ScanObjectNN hardest split and 94.1% accuracy on ModelNet40, surpassing all the other self-supervised learning methods.Abstract
Masked Autoencoders (MAE) have demonstrated promising performance in self-supervised learning for both 2D and 3D computer vision. Nevertheless, existing MAE-based methods still have certain drawbacks. Firstly, the functional decoupling between the encoder and decoder is incomplete, which limits the encoder's representation learning ability. Secondly, downstream tasks solely utilize the encoder, failing to fully leverage the knowledge acquired through the encoder-decoder architecture in the pre-text task. In this paper, we propose Point Regress AutoEncoder (Point-RAE), a new scheme for regressive autoencoders for point cloud self-supervised learning. The proposed method decouples functions between the decoder and the encoder by introducing a mask regressor, which predicts the masked patch representation from the visible patch representation encoded by the encoder and the decoder reconstructs the target from the predicted masked patch representation. By doing so, we minimize the impact of decoder updates on the representation space of the encoder. Moreover, we introduce an alignment constraint to ensure that the representations for masked patches, predicted from the encoded representations of visible patches, are aligned with the masked patch presentations computed from the encoder. To make full use of the knowledge learned in the pre-training stage, we design a new finetune mode for the proposed Point-RAE. Extensive experiments demonstrate that our approach is efficient during pre-training and generalizes well on various downstream tasks. Specifically, our pre-trained models achieve a high accuracy of \textbf{90.28\%} on the ScanObjectNN hardest split and \textbf{94.1\%} accuracy on ModelNet40, surpassing all the other self-supervised learning methods. Our code and pretrained model are public available at: \url{https://github.com/liuyyy111/Point-RAE}.
摘要
masked autoencoders (MAE) 在自助学习中表现出色,特别是在2D和3D计算机视觉领域。然而,现有的MAE基本方法仍有一些缺点。首先,Encoder和Decoder之间的函数分离不够完善,这限制了Encoder的表征学习能力。其次,下游任务只使用Encoder,而不全面利用通过Encoder-Decoder架构在 предtext任务中获得的知识。在这篇论文中,我们提出了Point Regress AutoEncoder(Point-RAE),一种新的抽象方法 для点云自助学习。我们在Point-RAE中引入了一个mask推 regression器,该推 regression器预测从可见patch表示中编码的masked patch表示,而Decoder则使用这些预测的masked patch表示重建目标。通过这种方式,我们减少了Encoder的表征空间中Decoder的影响。此外,我们引入了一个alignment constraint,确保encoded表示中的masked patch表示与Encoder计算的masked patch表示相align。为了充分利用在预训练阶段学习的知识,我们设计了一种新的finetune模式 дляPoint-RAE。我们的实验表明,我们的方法是在预训练阶段高效,并且在多个下游任务上具有良好的泛化性。具体来说,我们的预训练模型在ScanObjectNN最难的分区上达到了90.28%的高精度,并在ModelNet40上达到了94.1%的精度,超过了所有其他自助学习方法。我们的代码和预训练模型可以在以下链接中下载:\url{https://github.com/liuyyy111/Point-RAE}.
Dataset Diffusion: Diffusion-based Synthetic Dataset Generation for Pixel-Level Semantic Segmentation
paper_authors: Quang Nguyen, Truong Vu, Anh Tran, Khoi Nguyen for:This paper aims to address the labor-intensive task of preparing training data for deep vision models by proposing a novel method for generating pixel-level semantic segmentation labels using a text-to-image generative model.methods:The proposed method utilizes the text prompts, cross-attention, and self-attention of the Stable Diffusion (SD) model to generate segmentation maps corresponding to synthetic images. The method introduces three new techniques: class-prompt appending, class-prompt cross-attention, and self-attention exponentiation.results:The proposed approach significantly outperforms concurrent work on two datasets, PASCAL VOC and MSCOCO, and provides a reliable way to generate pixel-level semantic segmentation labels without the need for labor-intensive pixel-wise annotation.Abstract
Preparing training data for deep vision models is a labor-intensive task. To address this, generative models have emerged as an effective solution for generating synthetic data. While current generative models produce image-level category labels, we propose a novel method for generating pixel-level semantic segmentation labels using the text-to-image generative model Stable Diffusion (SD). By utilizing the text prompts, cross-attention, and self-attention of SD, we introduce three new techniques: class-prompt appending, class-prompt cross-attention, and self-attention exponentiation. These techniques enable us to generate segmentation maps corresponding to synthetic images. These maps serve as pseudo-labels for training semantic segmenters, eliminating the need for labor-intensive pixel-wise annotation. To account for the imperfections in our pseudo-labels, we incorporate uncertainty regions into the segmentation, allowing us to disregard loss from those regions. We conduct evaluations on two datasets, PASCAL VOC and MSCOCO, and our approach significantly outperforms concurrent work. Our benchmarks and code will be released at https://github.com/VinAIResearch/Dataset-Diffusion
摘要
准备深度视觉模型的训练数据是一项劳动密集的任务。为了解决这个问题,生成模型在深度学习领域得到了广泛的应用。现有的生成模型可以生成图像级别的类别标签,但我们提出了一种新的方法,即使用文本生成器Stable Diffusion(SD)来生成像素级别的semantic segmentation标签。我们利用文本提示、跨处理和自处理的SD特性,提出了三种新技术:类提示附加、类提示跨处理和自处理指数。这些技术使得我们可以生成对应于合成图像的segmentation图。这些图像serve为训练semantic segmenter的pseudo标签,从而消除了对每个像素的手动标注的劳动密集任务。为了考虑我们的 pseudo标签中的不准确部分,我们将uncertainty区域纳入segmentation中,因此可以忽略这些区域中的损失。我们在PASCAL VOC和MSCOCO两个 dataset上进行了评估,并得到了与当前同类工作的显著超越。我们的标准 benchmarks和代码将在https://github.com/VinAIResearch/Dataset-Diffusion中发布。
Tiled Multiplane Images for Practical 3D Photography
results: 该研究提出了一种使用分割多平面图像(TMPI)来生成单视图三维图像的方法,其中每个小区域只有几个深度层,可以提高计算效率。与state-of-the-art单视图MPI方法相比,该方法的生成结果相似,计算 overhead 比较低。Abstract
The task of synthesizing novel views from a single image has useful applications in virtual reality and mobile computing, and a number of approaches to the problem have been proposed in recent years. A Multiplane Image (MPI) estimates the scene as a stack of RGBA layers, and can model complex appearance effects, anti-alias depth errors and synthesize soft edges better than methods that use textured meshes or layered depth images. And unlike neural radiance fields, an MPI can be efficiently rendered on graphics hardware. However, MPIs are highly redundant and require a large number of depth layers to achieve plausible results. Based on the observation that the depth complexity in local image regions is lower than that over the entire image, we split an MPI into many small, tiled regions, each with only a few depth planes. We call this representation a Tiled Multiplane Image (TMPI). We propose a method for generating a TMPI with adaptive depth planes for single-view 3D photography in the wild. Our synthesized results are comparable to state-of-the-art single-view MPI methods while having lower computational overhead.
摘要
“ synthesizing novel views from a single image ”有很多实际应用,如虚拟现实和移动设备等,Recent years 有很多解决方案提出来。 Multiplane Image (MPI) 估算场景为堆叠的 RGBA 层,可以更好地模拟复杂的外观效果、抑制遮蔽depth 误差和软边缘。 不同于 neural radiance fields, MPI 可以高效地在图形硬件上运算。然而, MPI 很受重复性的限制,需要许多深度层以 дости得可靠的结果。 根据本地图像区域的深度复杂性观察,我们将 MPI 拆分为多个小、瓷砾的区域,每个区域只有几个深度平面。 我们称这个表示法为 Tiled Multiplane Image (TMPI)。 我们提出一种方法,用于从单一影像中生成 TMPI WITH adaptive depth planes 的单眼 3D 摄影。我们的合成结果与现有的单眼 MPI 方法相似,但计算负载较低。
results: 在 PASCAL VOC 和 COCO 上,该方法可以达到零shot semantic segmentation 的State-of-the-art 结果,与最佳方法在 COCO 上表现相当。Abstract
The emergence of CLIP has opened the way for open-world image perception. The zero-shot classification capabilities of the model are impressive but are harder to use for dense tasks such as image segmentation. Several methods have proposed different modifications and learning schemes to produce dense output. Instead, we propose in this work an open-vocabulary semantic segmentation method, dubbed CLIP-DIY, which does not require any additional training or annotations, but instead leverages existing unsupervised object localization approaches. In particular, CLIP-DIY is a multi-scale approach that directly exploits CLIP classification abilities on patches of different sizes and aggregates the decision in a single map. We further guide the segmentation using foreground/background scores obtained using unsupervised object localization methods. With our method, we obtain state-of-the-art zero-shot semantic segmentation results on PASCAL VOC and perform on par with the best methods on COCO.
摘要
CLIP的出现开启了开放世界图像识别的新时代。CLIP的零shot分类能力吸引了很多人,但是在密集任务如图像 segmentation 中更加困难使用。许多方法已经提出了不同的修改和学习方案来生成密集输出。而我们在这里提出了一种开放词汇 semantic segmentation 方法,称为 CLIP-DIY,不需要任何额外的训练或标注,而是利用现有的无监督物体定位方法来进行推导。具体来说,CLIP-DIY 是一种多尺度方法,直接利用 CLIP 分类器在不同大小的 patches 上进行分类,并将决定聚合到一个地图上。我们还使用无监督物体定位方法来引导分 segmentation。我们的方法可以在 PASCAL VOC 和 COCO 上达到领先的 zero-shot semantic segmentation 结果,并与最佳方法在 COCO 上表现相当。
paper_authors: Muxin Liao, Shishun Tian, Yuhang Zhang, Guoguang Hua, Wenbin Zou, Xia Li for: 这个研究是为了解决域缩推理中的域对预设问题,并提出了一个基于对照对比学习的方法来获得域标准的特征。methods: 这个方法使用了一个对照对比学习的核心思想,即使用不同域的中心价值作为域标准,并将这些中心价值与不同域的标准对照。results: 这个方法在域扩展Semantic segmentation任务中得到了superior的性能,并且可以对不同域的标准进行对照对比,以获得更好的域标准。Abstract
Prototypical contrastive learning (PCL) has been widely used to learn class-wise domain-invariant features recently. These methods are based on the assumption that the prototypes, which are represented as the central value of the same class in a certain domain, are domain-invariant. Since the prototypes of different domains have discrepancies as well, the class-wise domain-invariant features learned from the source domain by PCL need to be aligned with the prototypes of other domains simultaneously. However, the prototypes of the same class in different domains may be different while the prototypes of different classes may be similar, which may affect the learning of class-wise domain-invariant features. Based on these observations, a calibration-based dual prototypical contrastive learning (CDPCL) approach is proposed to reduce the domain discrepancy between the learned class-wise features and the prototypes of different domains for domain generalization semantic segmentation. It contains an uncertainty-guided PCL (UPCL) and a hard-weighted PCL (HPCL). Since the domain discrepancies of the prototypes of different classes may be different, we propose an uncertainty probability matrix to represent the domain discrepancies of the prototypes of all the classes. The UPCL estimates the uncertainty probability matrix to calibrate the weights of the prototypes during the PCL. Moreover, considering that the prototypes of different classes may be similar in some circumstances, which means these prototypes are hard-aligned, the HPCL is proposed to generate a hard-weighted matrix to calibrate the weights of the hard-aligned prototypes during the PCL. Extensive experiments demonstrate that our approach achieves superior performance over current approaches on domain generalization semantic segmentation tasks.
摘要
它们(Prototypical contrastive learning,PCL)在最近得到了广泛的应用,用于学习域外兼顾的特征。这些方法基于假设,各个类域的中值(prototype)是域外兼顾的。然而,不同域的中值之间可能存在差异,因此通过PCL学习的域外兼顾特征需要同时与其他域的中值进行对alignment。然而,不同类域的中值可能存在差异,而不同类域的中值可能相似,这可能影响学习域外兼顾特征的过程。基于这些观察,我们提出了一种calibration-based dual prototypical contrastive learning(CDPCL)方法,用于降低不同域的中值与学习的域外兼顾特征之间的域不一致。CDPCL包括了一种uncertainty-guided PCL(UPCL)和一种hard-weighted PCL(HPCL)。由于不同类域的中值之间可能存在不同的差异,我们提出了一个不确定性概率矩阵,用于表示不同类域中值之间的差异。UPCL用于估算这个不确定性概率矩阵,以calibrate PCL中的权重。此外,considering that prototypes of different classes may be similar in some circumstances, which means these prototypes are hard-aligned, we propose a hard-weighted matrix to calibrate the weights of the hard-aligned prototypes during the PCL.经过广泛的实验,我们发现我们的方法在域通用 semantic segmentation 任务上达到了现有方法的最高性能。
SINCERE: Supervised Information Noise-Contrastive Estimation REvisited
results: 比较SINCERE和SupCon损失函数的学习轨迹和终端Linear分类器性能,发现SINCERE损失函数可以更好地分离不同类别的嵌入空间,并且与SupCon损失函数相比,SINCERE损失函数可以提供更高的终端分类器性能。Abstract
The information noise-contrastive estimation (InfoNCE) loss function provides the basis of many self-supervised deep learning methods due to its strong empirical results and theoretic motivation. Previous work suggests a supervised contrastive (SupCon) loss to extend InfoNCE to learn from available class labels. This SupCon loss has been widely-used due to reports of good empirical performance. However, in this work we suggest that the specific SupCon loss formulated by prior work has questionable theoretic justification, because it can encourage images from the same class to repel one another in the learned embedding space. This problematic behavior gets worse as the number of inputs sharing one class label increases. We propose the Supervised InfoNCE REvisited (SINCERE) loss as a remedy. SINCERE is a theoretically justified solution for a supervised extension of InfoNCE that never causes images from the same class to repel one another. We further show that minimizing our new loss is equivalent to maximizing a bound on the KL divergence between class conditional embedding distributions. We compare SINCERE and SupCon losses in terms of learning trajectories during pretraining and in ultimate linear classifier performance after finetuning. Our proposed SINCERE loss better separates embeddings from different classes during pretraining while delivering competitive accuracy.
摘要
《信息干扰对照估计(InfoNCE)损失函数提供了许多自动学习深度学习方法的基础,因为它在实际上表现良好并具有理论基础。之前的工作提出了一种名为超级vised contrastive(SupCon)损失函数,用于从可用的类标签学习。这种SupCon损失函数广泛使用,但是我们认为其具体的形式不具有理论基础,因为它可能会使图像同一个类型的图像在学习的嵌入空间中抵抗对方。这种问题的严重程度随着输入图像同一个类型的数量增加。我们提议一种名为Supervised InfoNCE REvisited(SINCERE)损失函数,它是一种理论上正确的自upervised扩展,不会使图像同一个类型的图像在学习的嵌入空间中抵抗对方。我们还证明了将我们的新损失函数最小化等价于将类型 conditional嵌入分布的KL散度上升 bounds。我们比较了SINCERE和SupCon损失函数在预训练和精化后的线性分类器性能。我们的提议的SINCERE损失函数在预训练时更好地分离不同类型的嵌入,而且在精化后具有竞争力的准确率。》Note: The translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China.
Identity-preserving Editing of Multiple Facial Attributes by Learning Global Edit Directions and Local Adjustments
results: 试验结果显示,ID-Style比基于相似的state-of-the-art工作更好地保持实例特征,具体而言,与基于工作相比,ID-Style在人脸特征编辑中的Identity preserving metric(FRS)和均值修改率(mACC)分别提高了10%和7%。此外,ID-Style的网络结构比基于工作小了约95%,但是它仍然能够保持与基于工作相似的修改效果。Abstract
Semantic facial attribute editing using pre-trained Generative Adversarial Networks (GANs) has attracted a great deal of attention and effort from researchers in recent years. Due to the high quality of face images generated by StyleGANs, much work has focused on the StyleGANs' latent space and the proposed methods for facial image editing. Although these methods have achieved satisfying results for manipulating user-intended attributes, they have not fulfilled the goal of preserving the identity, which is an important challenge. We present ID-Style, a new architecture capable of addressing the problem of identity loss during attribute manipulation. The key components of ID-Style include Learnable Global Direction (LGD), which finds a shared and semi-sparse direction for each attribute, and an Instance-Aware Intensity Predictor (IAIP) network, which finetunes the global direction according to the input instance. Furthermore, we introduce two losses during training to enforce the LGD to find semi-sparse semantic directions, which along with the IAIP, preserve the identity of the input instance. Despite reducing the size of the network by roughly 95% as compared to similar state-of-the-art works, it outperforms baselines by 10% and 7% in Identity preserving metric (FRS) and average accuracy of manipulation (mACC), respectively.
摘要
<>Translate the given text into Simplified Chinese.<>Recent years have seen a great deal of attention and effort from researchers in the field of semantic facial attribute editing using pre-trained Generative Adversarial Networks (GANs). This is due to the high quality of face images generated by StyleGANs, which has led to a focus on the latent space of StyleGANs and proposed methods for facial image editing. However, these methods have not been able to preserve the identity of the input instance, which is an important challenge. To address this challenge, we present ID-Style, a new architecture that includes Learnable Global Direction (LGD) and an Instance-Aware Intensity Predictor (IAIP) network. The LGD finds a shared and semi-sparse direction for each attribute, while the IAIP finetunes the global direction according to the input instance. Additionally, we introduce two losses during training to enforce the LGD to find semi-sparse semantic directions, which, along with the IAIP, preserve the identity of the input instance. Despite reducing the size of the network by roughly 95% compared to similar state-of-the-art works, ID-Style outperforms baselines by 10% and 7% in Identity Preserving Metric (FRS) and Average Accuracy of Manipulation (mACC), respectively.
Industrial Application of 6D Pose Estimation for Robotic Manipulation in Automotive Internal Logistics
paper_authors: Philipp Quentin, Dino Knoll, Daniel Goehring
for: This paper aims to evaluate the current status quo of 6D pose estimation in the context of automotive parts handling tasks, and to identify the challenges and limitations of existing approaches.
methods: The authors built a representative 6D pose estimation pipeline using state-of-the-art components, including data generation methods and pose estimators, and evaluated its performance on automotive parts.
results: The authors found that the performance of the trained 6D pose estimators was promising, but did not meet industry requirements. They also revealed that the main challenge was the inability of the estimators to provide reliable uncertainties for their poses, rather than the accuracy of the poses themselves. Additionally, the authors compared RGB- and RGB-D-based approaches and showed that they are differently vulnerable to the domain gap induced by synthetic data.Abstract
Despite the advances in robotics a large proportion of the of parts handling tasks in the automotive industry's internal logistics are not automated but still performed by humans. A key component to competitively automate these processes is a 6D pose estimation that can handle a large number of different parts, is adaptable to new parts with little manual effort, and is sufficiently accurate and robust with respect to industry requirements. In this context, the question arises as to the current status quo with respect to these measures. To address this we built a representative 6D pose estimation pipeline with state-of-the-art components from economically scalable real to synthetic data generation to pose estimators and evaluated it on automotive parts with regards to a realistic sequencing process. We found that using the data generation approaches, the performance of the trained 6D pose estimators are promising, but do not meet industry requirements. We reveal that the reason for this is the inability of the estimators to provide reliable uncertainties for their poses, rather than the ability of to provide sufficiently accurate poses. In this context we further analyzed how RGB- and RGB-D-based approaches compare against this background and show that they are differently vulnerable to the domain gap induced by synthetic data.
摘要
Our results show that while the trained 6D pose estimators perform well, they do not meet industry requirements. We found that the reason for this is the inability of the estimators to provide reliable uncertainties for their poses, rather than the accuracy of the poses themselves. Additionally, we compared RGB- and RGB-D-based approaches and found that they are differently vulnerable to the domain gap induced by synthetic data.
Enhancing Healthcare with EOG: A Novel Approach to Sleep Stage Classification
results: accurate classification of five distinct sleep stages, noteworthy performance with macro-F1 scores of 74.72, 70.63, and 69.26, respectively, excelling in identifying REM sleep.Abstract
We introduce an innovative approach to automated sleep stage classification using EOG signals, addressing the discomfort and impracticality associated with EEG data acquisition. In addition, it is important to note that this approach is untapped in the field, highlighting its potential for novel insights and contributions. Our proposed SE-Resnet-Transformer model provides an accurate classification of five distinct sleep stages from raw EOG signal. Extensive validation on publically available databases (SleepEDF-20, SleepEDF-78, and SHHS) reveals noteworthy performance, with macro-F1 scores of 74.72, 70.63, and 69.26, respectively. Our model excels in identifying REM sleep, a crucial aspect of sleep disorder investigations. We also provide insight into the internal mechanisms of our model using techniques such as 1D-GradCAM and t-SNE plots. Our method improves the accessibility of sleep stage classification while decreasing the need for EEG modalities. This development will have promising implications for healthcare and the incorporation of wearable technology into sleep studies, thereby advancing the field's potential for enhanced diagnostics and patient comfort.
摘要
我们介绍了一种创新的自动睡眠阶段分类方法使用 EOG 信号,解决了使用 EEG 数据采集所带来的不适和实用性问题。此外,这种方法在领域中尚未被探索,因此它的潜在性和贡献很大。我们的提议的 SE-Resnet-Transformer 模型可以准确地从原始 EOG 信号中分类五个不同的睡眠阶段。我们对公共可用的数据库(SleepEDF-20、SleepEDF-78 和 SHHS)进行了广泛的验证,并发现了关键的表现,其中 macro-F1 分数分别为 74.72、70.63 和 69.26。我们的模型在识别 REM 睡眠方面表现出色,这是许多睡眠疾病研究中的关键方面。我们还使用了一些技术,如 1D-GradCAM 和 t-SNE 图表,来探索我们的模型的内部机制。我们的方法可以提高睡眠阶段分类的可accessibility,同时减少 EEG 模态的需求。这种发展将对医疗保健和睡眠研究中的睡眠疾病诊断和患者舒适具有普ROMising的影响。
Informative Data Mining for One-Shot Cross-Domain Semantic Segmentation
paper_authors: Yuxi Wang, Jian Liang, Jun Xiao, Shuqi Mei, Yuran Yang, Zhaoxiang Zhang for: 这个研究旨在提供一个有效的一击适应方法,实现从标注源数据到无标注目数据的语意分类转换。methods: 这个方法使用了一个新的选择项目,将标注源数据中的最有用样本选择出来,以便快速适应和减少过滤训练。然后,这些选择的样本会被用来更新模型,包括裁剪和原型基于的信息增强。results: 我们的方法在实验中表现出色,比较 existing 方法高效和精度。 Specifically, our approach achieves a new state-of-the-art one-shot performance of 56.7%/55.4% on the GTA5/SYNTHIA to Cityscapes adaptation tasks, respectively.Abstract
Contemporary domain adaptation offers a practical solution for achieving cross-domain transfer of semantic segmentation between labeled source data and unlabeled target data. These solutions have gained significant popularity; however, they require the model to be retrained when the test environment changes. This can result in unbearable costs in certain applications due to the time-consuming training process and concerns regarding data privacy. One-shot domain adaptation methods attempt to overcome these challenges by transferring the pre-trained source model to the target domain using only one target data. Despite this, the referring style transfer module still faces issues with computation cost and over-fitting problems. To address this problem, we propose a novel framework called Informative Data Mining (IDM) that enables efficient one-shot domain adaptation for semantic segmentation. Specifically, IDM provides an uncertainty-based selection criterion to identify the most informative samples, which facilitates quick adaptation and reduces redundant training. We then perform a model adaptation method using these selected samples, which includes patch-wise mixing and prototype-based information maximization to update the model. This approach effectively enhances adaptation and mitigates the overfitting problem. In general, we provide empirical evidence of the effectiveness and efficiency of IDM. Our approach outperforms existing methods and achieves a new state-of-the-art one-shot performance of 56.7\%/55.4\% on the GTA5/SYNTHIA to Cityscapes adaptation tasks, respectively. The code will be released at \url{https://github.com/yxiwang/IDM}.
摘要
当前领域的适应采用实现了semantic segmentation中的交叉频道传输,即使在不同的测试环境下,可以将源数据标注得到目标数据上的semantic segmentation。这些解决方案在应用中得到了广泛的应用,但是它们需要模型在测试环境发生变化时重新训练,这可能会导致不可持额外高的成本,特别是在时间consuming的训练过程和数据隐私问题上。one-shot适应方法试图解决这些挑战,通过将源模型转移到目标频道上,只需要一个目标数据。然而,这些方法仍然面临计算成本和过拟合问题。为了解决这个问题,我们提出了一种新的框架,即信息挖掘(Informative Data Mining,IDM)。IDM提供了一种不确定性基于的选择 criterion,可以帮助快速适应和减少重复训练。然后,我们使用这些选择的样本进行模型适应方法,包括patch-wise混合和prototype-based信息最大化,以更新模型。这种方法有效地提高适应和减少过拟合问题。总之,我们提供了empirical evidence表明IDM的效果和效率。我们的方法比现有方法更高效,在GTA5/SYNTHIA到Cityscapes适应任务上达到了56.7%/55.4%的一射性能。我们的代码将在\url{https://github.com/yxiwang/IDM}上发布。
QuadricsNet: Learning Concise Representation for Geometric Primitives in Point Clouds
results: 我们的研究表明,我们的精简表示方法和 QuadricsNet 框架具有高效性和稳定性。我们的代码可以在 \url{https://github.com/MichaelWu99-lab/QuadricsNet} 上获取。Abstract
This paper presents a novel framework to learn a concise geometric primitive representation for 3D point clouds. Different from representing each type of primitive individually, we focus on the challenging problem of how to achieve a concise and uniform representation robustly. We employ quadrics to represent diverse primitives with only 10 parameters and propose the first end-to-end learning-based framework, namely QuadricsNet, to parse quadrics in point clouds. The relationships between quadrics mathematical formulation and geometric attributes, including the type, scale and pose, are insightfully integrated for effective supervision of QuaidricsNet. Besides, a novel pattern-comprehensive dataset with quadrics segments and objects is collected for training and evaluation. Experiments demonstrate the effectiveness of our concise representation and the robustness of QuadricsNet. Our code is available at \url{https://github.com/MichaelWu99-lab/QuadricsNet}
摘要
这篇论文提出了一种新的框架,用于学习3D点云中简洁的 геометри� primitives表示。与之前每种 primitives 都被 individually 表示不同,我们在挑战性的问题上关注了如何实现一种简洁而均衡的表示。我们使用 quadrics 来表示多样化的 primitives,只需要10个参数。我们提出了第一个终端学习基于架构,即 QuadricsNet,用于解析 quadrics 在点云中。我们妥善地 интеGRATE了 quadrics 的数学表述和 geometric attribute,包括类型、比例和 Orient,以便对 QuadricsNet 进行有效的监督。此外,我们还收集了包含 quadrics 段和物体的novel pattern-comprehensive dataset,用于训练和评估。实验表明我们的简洁表示和 QuadricsNet 的稳定性。我们的代码可以在 GitHub 上找到:https://github.com/MichaelWu99-lab/QuadricsNet。
Automatic Animation of Hair Blowing in Still Portrait Photos
results: 对比其他state-of-the-art方法,该论文的方法在量化评价中占优,并在质量测试中提供了最愉悦和最吸引人的视觉效果。Abstract
We propose a novel approach to animate human hair in a still portrait photo. Existing work has largely studied the animation of fluid elements such as water and fire. However, hair animation for a real image remains underexplored, which is a challenging problem, due to the high complexity of hair structure and dynamics. Considering the complexity of hair structure, we innovatively treat hair wisp extraction as an instance segmentation problem, where a hair wisp is referred to as an instance. With advanced instance segmentation networks, our method extracts meaningful and natural hair wisps. Furthermore, we propose a wisp-aware animation module that animates hair wisps with pleasing motions without noticeable artifacts. The extensive experiments show the superiority of our method. Our method provides the most pleasing and compelling viewing experience in the qualitative experiments and outperforms state-of-the-art still-image animation methods by a large margin in the quantitative evaluation. Project url: \url{https://nevergiveu.github.io/AutomaticHairBlowing/}
摘要
我们提出了一种新的方法,用于在静止画像中动画人类发型。现有的工作主要研究了流体元素的动画,如水和火。然而,对真实图像中的发型动画还具有挑战性,这是因为发型结构的复杂性和动态性。为了解决这个问题,我们创新地将发型抽取视为实例分割问题,其中每个发型被称为一个实例。使用高级实例分割网络,我们的方法可以EXTRACT meaningful和自然的发型。此外,我们还提议了一种发型感知动画模块,可以使发型动画具有愉悦的运动而不会出现显著的瑕疵。我们的实验结果表明,我们的方法可以提供最有趣和最有吸引力的视觉体验,并在量化评价中大幅超越了现有的静止图像动画方法。项目链接:
Detecting and Grounding Multi-Modal Media Manipulation and Beyond
paper_authors: Rui Shao, Tianxing Wu, Jianlong Wu, Liqiang Nie, Ziwei Liu
for: 本研究探讨了一新的多Modal媒体伪造问题,即Detecting and Grounding Multi-Modal Media Manipulation (DGM^4),旨在不仅检测多Modal媒体的 authenticty,还是根据多Modal媒体的不同modalities进行深入的媒体伪造推理。
results: 实验结果表明,HAMMER和HAMMER++ 两种模型具有superiority,能够准确地检测和理解多Modal媒体中的伪造 traces。Abstract
Misinformation has become a pressing issue. Fake media, in both visual and textual forms, is widespread on the web. While various deepfake detection and text fake news detection methods have been proposed, they are only designed for single-modality forgery based on binary classification, let alone analyzing and reasoning subtle forgery traces across different modalities. In this paper, we highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM^4). DGM^4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content, which requires deeper reasoning of multi-modal media manipulation. To support a large-scale investigation, we construct the first DGM^4 dataset, where image-text pairs are manipulated by various approaches, with rich annotation of diverse manipulations. Moreover, we propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities. HAMMER performs 1) manipulation-aware contrastive learning between two uni-modal encoders as shallow manipulation reasoning, and 2) modality-aware cross-attention by multi-modal aggregator as deep manipulation reasoning. Dedicated manipulation detection and grounding heads are integrated from shallow to deep levels based on the interacted multi-modal information. To exploit more fine-grained contrastive learning for cross-modal semantic alignment, we further integrate Manipulation-Aware Contrastive Loss with Local View and construct a more advanced model HAMMER++. Finally, we build an extensive benchmark and set up rigorous evaluation metrics for this new research problem. Comprehensive experiments demonstrate the superiority of HAMMER and HAMMER++.
摘要
伪信息已成为一种紧迫的问题。在图像和文本之间的多modal forgery广泛存在于互联网上。虽然多种深度迷假检测和文本假新闻检测方法已经提出,但它们只是为单模态迷假进行二分类 binary classification,而不是检测和理解多modal media forgery的细节。在这篇论文中,我们提出了一个新的研究问题:多Modal Media Manipulation Detection and Grounding(DGM^4)。DGM^4的目标不仅是检测多modal media的authenticity,而且需要理解和检测受到 modificaiton的内容,这需要更深入的理解多modal media forgery。为了支持大规模的研究,我们构建了第一个DGM^4数据集,其中图像和文本组合被多种方法修改,并且有丰富的修改注释。此外,我们提出了一种 HierArchical Multi-modal Manipulation rEasoning tRansformer(HAMMER),它可以全面捕捉多modal media forgery的细节交互。HAMMER包括两种推理方法:1)在图像和文本之间进行修改意识的对比学习,以及2)在多modal信息之间进行模块相关的交互。通过将这两种推理方法集成到深度和浅度层次上,我们可以实现更细致的修改检测和修改理解。为了更好地利用跨modal semantic alignment的推理,我们还提出了Manipulation-Aware Contrastive Loss with Local View,并构建了更先进的模型HAMMER++。最后,我们建立了一个完整的 bencmarks,并设置了严格的评价指标。广泛的实验表明HAMMER和HAMMER++的优越性。
(Predictable) Performance Bias in Unsupervised Anomaly Detection
results: 我们的实验发现了“公平法律”(类似于 Transformers 中的“扩大法律”),即训练集中 subgroup 的表达与该 subgroup 内 anomaly detection性能之间存在直线关系。我们的研究还发现了Balanced training data 仍然存在性能差距,并且这些差距可以叠加,使得具有多个不利影响组的主体性能更加低下。Abstract
Background: With the ever-increasing amount of medical imaging data, the demand for algorithms to assist clinicians has amplified. Unsupervised anomaly detection (UAD) models promise to aid in the crucial first step of disease detection. While previous studies have thoroughly explored fairness in supervised models in healthcare, for UAD, this has so far been unexplored. Methods: In this study, we evaluated how dataset composition regarding subgroups manifests in disparate performance of UAD models along multiple protected variables on three large-scale publicly available chest X-ray datasets. Our experiments were validated using two state-of-the-art UAD models for medical images. Finally, we introduced a novel subgroup-AUROC (sAUROC) metric, which aids in quantifying fairness in machine learning. Findings: Our experiments revealed empirical "fairness laws" (similar to "scaling laws" for Transformers) for training-dataset composition: Linear relationships between anomaly detection performance within a subpopulation and its representation in the training data. Our study further revealed performance disparities, even in the case of balanced training data, and compound effects that exacerbate the drop in performance for subjects associated with multiple adversely affected groups. Interpretation: Our study quantified the disparate performance of UAD models against certain demographic subgroups. Importantly, we showed that this unfairness cannot be mitigated by balanced representation alone. Instead, the representation of some subgroups seems harder to learn by UAD models than that of others. The empirical fairness laws discovered in our study make disparate performance in UAD models easier to estimate and aid in determining the most desirable dataset composition.
摘要
Background: With the ever-increasing amount of medical imaging data, the demand for algorithms to assist clinicians has amplified. Unsupervised anomaly detection (UAD) models promise to aid in the crucial first step of disease detection. While previous studies have thoroughly explored fairness in supervised models in healthcare, for UAD, this has so far been unexplored.Methods: In this study, we evaluated how dataset composition regarding subgroups manifests in disparate performance of UAD models along multiple protected variables on three large-scale publicly available chest X-ray datasets. Our experiments were validated using two state-of-the-art UAD models for medical images. Finally, we introduced a novel subgroup-AUROC (sAUROC) metric, which aids in quantifying fairness in machine learning.Findings: Our experiments revealed empirical "fairness laws" (similar to "scaling laws" for Transformers) for training-dataset composition: Linear relationships between anomaly detection performance within a subpopulation and its representation in the training data. Our study further revealed performance disparities, even in the case of balanced training data, and compound effects that exacerbate the drop in performance for subjects associated with multiple adversely affected groups.Interpretation: Our study quantified the disparate performance of UAD models against certain demographic subgroups. Importantly, we showed that this unfairness cannot be mitigated by balanced representation alone. Instead, the representation of some subgroups seems harder to learn by UAD models than that of others. The empirical fairness laws discovered in our study make disparate performance in UAD models easier to estimate and aid in determining the most desirable dataset composition.Here's the translation in Traditional Chinese:背景:随着医疗影像数据的不断增加,诊断助手需求增加。不监督型异常检测(UAD)模型能帮助在疾病检测的首步中发掘疾病。过去的研究已经对医疗保健领域中监督模型的公平性进行了广泛的探讨,但是对UAD模型仍然是未知的。方法:这次研究中,我们评估了不同子群体的参数影响UAD模型在多个数据库中的表现差异。我们使用了三个公共可用的胸部X射影数据库,并验证了两个现有的UAD模型。最后,我们引入了一个新的子群体AUROC(sAUROC)指标,以便量测机器学习中的公平性。发现:我们的实验发现了关于训练集合的公平性法则(similar to Transformers的推广法则),这些法则表示在一个子population中的异常检测性能和训练集合中的表现之间的直线关系。我们的研究还发现了充分平衡的训练数据中的表现差异,以及两个或多个负面影响的子群体之间的互动效应。解释:我们的研究量化了UAD模型对特定民族子群体的不公平表现。我们发现了,不公平性不能通过充分平衡 alone 来减轻。相反,一些子群体的表现似乎更难由UAD模型学习,而其他子群体的表现则更容易学习。我们在这研究中发现的公平法则使得UAD模型的不公平表现更易估计,并帮助决定最佳的训练集合。
LAPP: Layer Adaptive Progressive Pruning for Compressing CNNs from Scratch
paper_authors: Pucheng Zhai, Kailing Guo, Fang Liu, Xiaofen Xing, Xiangmin Xu for:这个论文的目的是提出一种名为层 adaptive progressive pruning(LAPP)的框架,用于快速适应性地减小 convolutional Neural Network(CNN)的计算量。methods:LAPP 框架使用了一种learnable threshold和 FLOPs 约束来控制减小率,并在训练过程中动态更新这些约束,以便适应不同层的重要性分数变化。此外,在减小每层之前,我们还引入了一种轻量级的跳过来提高减小后的表达能力。results:与先前的压缩方法相比,LAPP 框架在多个数据集和后ION 背景上达到了更高的性能提升。例如,在 CIFAR-10 上,我们可以压缩 ResNet-20 到 40.3% 而无需减少精度。 ImageNet 上,我们可以减少 ResNet-18 的 FLOPs 55.6%,同时保持顶部 1 准确率和顶部 5 准确率不变。Abstract
Structured pruning is a commonly used convolutional neural network (CNN) compression approach. Pruning rate setting is a fundamental problem in structured pruning. Most existing works introduce too many additional learnable parameters to assign different pruning rates across different layers in CNN or cannot control the compression rate explicitly. Since too narrow network blocks information flow for training, automatic pruning rate setting cannot explore a high pruning rate for a specific layer. To overcome these limitations, we propose a novel framework named Layer Adaptive Progressive Pruning (LAPP), which gradually compresses the network during initial training of a few epochs from scratch. In particular, LAPP designs an effective and efficient pruning strategy that introduces a learnable threshold for each layer and FLOPs constraints for network. Guided by both task loss and FLOPs constraints, the learnable thresholds are dynamically and gradually updated to accommodate changes of importance scores during training. Therefore the pruning strategy can gradually prune the network and automatically determine the appropriate pruning rates for each layer. What's more, in order to maintain the expressive power of the pruned layer, before training starts, we introduce an additional lightweight bypass for each convolutional layer to be pruned, which only adds relatively few additional burdens. Our method demonstrates superior performance gains over previous compression methods on various datasets and backbone architectures. For example, on CIFAR-10, our method compresses ResNet-20 to 40.3% without accuracy drop. 55.6% of FLOPs of ResNet-18 are reduced with 0.21% top-1 accuracy increase and 0.40% top-5 accuracy increase on ImageNet.
摘要
“构成式剔除”是一种常用的卷积神经网络(CNN)压缩方法。剔除率设定是 convolutional neural network (CNN)压缩的基本问题。现有大多数的工作将额外的可学习参数引入到不同层的 CNN 中,以便对不同层设置不同的剔除率。另外,一些方法无法明确控制压缩率,或者对于特定层设置过高的剔除率。这些限制使得自动剔除率设定无法充分发挥其效果。为了解决这些问题,我们提出了一个名为层别进行式进行剔除(Layer Adaptive Progressive Pruning,LAPP)的新框架。LAPP 采用了一个可学习的阈值,以及 FLOPs 约束,以便在训练的初期几十轮中逐步压缩网络。具体来说,LAPP 设计了一个高效且可靠的剔除策略,通过在训练过程中逐步更新可学习的阈值,以便适应变化的重要性分数。因此,LAPP 可以逐步剔除网络,并自动决定每个层的适当剔除率。此外,为维护剔除后的表达能力,我们将每个剔除的卷积层加上一个轻量级的辅助路径,这仅增加了一些轻微的负担。我们的方法在多个数据集和背景 arquitectures 上示出了优秀的性能提升。例如,在 CIFAR-10 上,我们将 ResNet-20 压缩到 40.3% without accuracy drop。 ImageNet 上,我们将 ResNet-18 的 FLOPs 压缩到 55.6%,并且跟踪到 0.21% 顶部 1 accuracy increase 和 0.40% top-5 accuracy increase。
IEBins: Iterative Elastic Bins for Monocular Depth Estimation
paper_authors: Shuwei Shao, Zhongcai Pei, Xingming Wu, Zhong Liu, Weihai Chen, Zhengguo Li for:* 本研究旨在提出一种基于分类回归的独眼深度估计方法(MDE),用于解决独眼深度估计中的问题。methods:* 提出了一种新的迭代弹性桶(IEBins)技术,用于在多个阶段中进行高质量深度搜索。* 利用了一种 нов的弹性目标桶技术,以适应不同的深度不确定性。results:* 对于KITTI、NYU-Depth-v2和SUN RGB-D数据集进行了广泛的实验,并证明了提出的方法可以超越先前的状态之 искусственный智能。* 源代码可以在https://github.com/ShuweiShao/IEBins上获取。Abstract
Monocular depth estimation (MDE) is a fundamental topic of geometric computer vision and a core technique for many downstream applications. Recently, several methods reframe the MDE as a classification-regression problem where a linear combination of probabilistic distribution and bin centers is used to predict depth. In this paper, we propose a novel concept of iterative elastic bins (IEBins) for the classification-regression-based MDE. The proposed IEBins aims to search for high-quality depth by progressively optimizing the search range, which involves multiple stages and each stage performs a finer-grained depth search in the target bin on top of its previous stage. To alleviate the possible error accumulation during the iterative process, we utilize a novel elastic target bin to replace the original target bin, the width of which is adjusted elastically based on the depth uncertainty. Furthermore, we develop a dedicated framework composed of a feature extractor and an iterative optimizer that has powerful temporal context modeling capabilities benefiting from the GRU-based architecture. Extensive experiments on the KITTI, NYU-Depth-v2 and SUN RGB-D datasets demonstrate that the proposed method surpasses prior state-of-the-art competitors. The source code is publicly available at https://github.com/ShuweiShao/IEBins.
摘要
《单眼深度估计(MDE)是计算机视觉的基本领域和许多下渠应用的核心技术。近期,一些方法将MDE视为一个分类预测和回传问题,使用线性结合的概率分布和数组中心来预测深度。本文提出一个新的迭代弹性桶(IEBins)概念,用于分类预测和回传问题的MDE。提案的IEBins通过不断地优化搜寻范围,以进行多阶段的精确深度搜寻。为了避免可能的错误累累在迭代过程中,我们使用了一个新的弹性目标桶,其宽度根据深度不确定而调整。此外,我们开发了一个特别的架构,包括一个特征提取器和一个迭代优化器,具有强大的时间统计模型能力,从GRU架构中受益。广泛的实验表明,提案的方法在KITTI、NYU-Depth-v2和SUN RGB-D数据集上具有较高的性能,超过了先前的州际之径。原始代码可以在https://github.com/ShuweiShao/IEBins上取得。
Masked Image Residual Learning for Scaling Deeper Vision Transformers
paper_authors: Guoxi Huang, Hongtao Fu, Adrian G. Bors
for: 这篇研究旨在提高深度向 ViT 的训练过程,并解决深度层中的降低问题。
methods: authors propose a self-supervised learning framework called Masked Image Residual Learning (MIRL), which significantly alleviates the degradation problem, making it possible to scale ViT along depth for performance upgrade.
results: authors show that deeper ViTs can be effectively optimized using MIRL, and easily gain accuracy from increased depth. With the same level of computational complexity as ViT-Base and ViT-Large, they instantiate 4.5 times and 2 times deeper ViTs, dubbed ViT-S-54 and ViT-B-48. The deeper ViT-S-54 achieves performance on par with ViT-Large, while ViT-B-48 achieves 86.2% top-1 accuracy on ImageNet.Abstract
Deeper Vision Transformers (ViTs) are more challenging to train. We expose a degradation problem in deeper layers of ViT when using masked image modeling (MIM) for pre-training. To ease the training of deeper ViTs, we introduce a self-supervised learning framework called Masked Image Residual Learning (MIRL), which significantly alleviates the degradation problem, making scaling ViT along depth a promising direction for performance upgrade. We reformulate the pre-training objective for deeper layers of ViT as learning to recover the residual of the masked image. We provide extensive empirical evidence showing that deeper ViTs can be effectively optimized using MIRL and easily gain accuracy from increased depth. With the same level of computational complexity as ViT-Base and ViT-Large, we instantiate 4.5$\times$ and 2$\times$ deeper ViTs, dubbed ViT-S-54 and ViT-B-48. The deeper ViT-S-54, costing 3$\times$ less than ViT-Large, achieves performance on par with ViT-Large. ViT-B-48 achieves 86.2% top-1 accuracy on ImageNet. On one hand, deeper ViTs pre-trained with MIRL exhibit excellent generalization capabilities on downstream tasks, such as object detection and semantic segmentation. On the other hand, MIRL demonstrates high pre-training efficiency. With less pre-training time, MIRL yields competitive performance compared to other approaches.
摘要
deeper 视图变换器(ViTs)更加具有挑战性,我们揭示了深层 ViT 的降低问题,当使用掩码图像模型(MIM)进行预训练时。为了减轻深层 ViT 的训练困难,我们提出了一种自动学习框架,称为掩码图像差分学习(MIRL),它可以明显减轻降低问题,使深层 ViT 的缩放成为性能升级的可能性。我们重新表述了深层 ViT 的预训练目标,即学习恢复掩码图像中的差异。我们提供了丰富的实验证据,表明深层 ViT 可以通过 MIRL 有效地优化,并且可以轻松地从增加深度中获得性能提升。与 ViT-Base 和 ViT-Large 相同的计算复杂度下,我们实例化了 4.5 $\times$ 和 2 $\times$ 深度更深的 ViTs,称为 ViT-S-54 和 ViT-B-48。深度更深的 ViT-S-54,耗资相当于 ViT-Large 的三倍,却可以与 ViT-Large 的性能一致。ViT-B-48 在 ImageNet 上达到 86.2% 的顶部一 Accuracy。一方面,深度更深的 ViTs 预训练后在下游任务中表现出色,如物体检测和semantic segmentation。另一方面, MIRL 表现出高效的预训练能力,需要更少的预训练时间,却可以与其他方法相比肩并胜。
SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution
for: The paper is written to address the issue of advanced text-to-image models generating unsafe content, specifically photorealistic NSFW images of political figures.
methods: The paper uses a novel framework called SurrogatePrompt, which utilizes large language models, image-to-text, and image-to-image modules to automate the creation of attack prompts that can bypass the safety filters of closed-source models like Midjourney.
results: The paper demonstrates the success of SurrogatePrompt in generating abundant photorealistic NSFW images of political figures by exploiting vulnerabilities in Midjourney’s proprietary safety filter, with an 88% success rate in bypassing the filter. The generated images are found to present significant safety hazards, both subjectively and objectively.Abstract
Advanced text-to-image models such as DALL-E 2 and Midjourney possess the capacity to generate highly realistic images, raising significant concerns regarding the potential proliferation of unsafe content. This includes adult, violent, or deceptive imagery of political figures. Despite claims of rigorous safety mechanisms implemented in these models to restrict the generation of not-safe-for-work (NSFW) content, we successfully devise and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant photorealistic NSFW images. We reveal the fundamental principles of such prompt attacks and suggest strategically substituting high-risk sections within a suspect prompt to evade closed-source safety measures. Our novel framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules to automate attack prompt creation at scale. Evaluation results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts, leading to the generation of counterfeit images depicting political figures in violent scenarios. Both subjective and objective assessments validate that the images generated from our attack prompts present considerable safety hazards.
摘要
高级文本到图像模型如DALL-E 2和Midjourney具有生成高度真实的图像能力,这引发了严重的安全问题。这包括政治人物的成人、暴力或误导性图像。尽管这些模型的生成不安全内容的安全机制已经实施了严格的安全措施,但我们成功地开发了第一个攻击示例,使得Midjourney生成了大量高度真实的不安全图像。我们揭示了攻击示例的基本原理,并建议在涉及高风险的提示中进行部分替换,以避免关闭源代码安全措施。我们的新框架SurrogatePrompt可以在大规模的攻击提示创建中自动化攻击提示的生成。我们的评估结果表明,使用我们的攻击提示可以在Midjourney的专有安全筛选器中成功绕过88%的攻击,并生成政治人物在暴力场景中的假图像。subjective和objective评估表明,生成的图像具有严重的安全风险。
results: 对于一个多Modal图像分类数据集,该方法的实验结果表明,与单模式方法相比,该方法的多模式方法得到了更好的结果。此外,研究不同的输入图像大小和最新的特征多样性规则izers的影响,并证明这些规则izers可以提高性能。Abstract
One-class classification refers to approaches of learning using data from a single class only. In this paper, we propose a deep learning one-class classification method suitable for multimodal data, which relies on two convolutional autoencoders jointly trained to reconstruct the positive input data while obtaining the data representations in the latent space as compact as possible. During inference, the distance of the latent representation of an input to the origin can be used as an anomaly score. Experimental results using a multimodal macroinvertebrate image classification dataset show that the proposed multimodal method yields better results as compared to the unimodal approach. Furthermore, study the effect of different input image sizes, and we investigate how recently proposed feature diversity regularizers affect the performance of our approach. We show that such regularizers improve performance.
摘要
一类分类指的是使用单一类型的数据进行学习。在这篇论文中,我们提出了一种适用于多modal数据的深度学习一类分类方法,该方法基于两个卷积 autoencoder 同时进行卷积重构正确的输入数据,并在幂空间中获得数据表示的最短距离。在推理阶段,输入数据的幂空间表示距离原点的距离可以作为异常分数。我们使用多modalmacro生物图像分类 dataset 进行实验,并证明了我们的方法在多modal数据中获得更好的结果,而且比单modal方法更好。此外,我们还研究了不同的输入图像大小的效果,以及最近提出的特征多样化规范化的影响。我们发现这些规范化可以提高性能。
BoIR: Box-Supervised Instance Representation for Multi-Person Pose Estimation
results: 该方法在 CrowdPose 和 OCHuman 等数据集上达到了领先的性能水平,比如 COCO val (0.8 AP)、COCO test-dev (0.5 AP) 和 CrowdPose (4.9 AP) 等。Abstract
Single-stage multi-person human pose estimation (MPPE) methods have shown great performance improvements, but existing methods fail to disentangle features by individual instances under crowded scenes. In this paper, we propose a bounding box-level instance representation learning called BoIR, which simultaneously solves instance detection, instance disentanglement, and instance-keypoint association problems. Our new instance embedding loss provides a learning signal on the entire area of the image with bounding box annotations, achieving globally consistent and disentangled instance representation. Our method exploits multi-task learning of bottom-up keypoint estimation, bounding box regression, and contrastive instance embedding learning, without additional computational cost during inference. BoIR is effective for crowded scenes, outperforming state-of-the-art on COCO val (0.8 AP), COCO test-dev (0.5 AP), CrowdPose (4.9 AP), and OCHuman (3.5 AP). Code will be available at https://github.com/uyoung-jeong/BoIR
摘要
单stage多人人体 pose 估计(MPPE)方法已经达到了非常高的性能水平,但现有方法在拥挤场景下无法分离特征。在这篇论文中,我们提出了一种名为BoIR的 bounding box 级别实体表示学习方法,该方法同时解决实体检测、实体分离和实体关键点匹配问题。我们的新的实体嵌入损失提供了图像整体的学习信号,实现了全局一致的和分离的实体表示。我们的方法通过 bottom-up 关键点估计、 bounding box 回归和对比实体嵌入学习 Multi-task learning,不需要额外计算成本在推理过程中。BoIR 在拥挤场景下表现出色,与状态流行的 COCO val (0.8 AP)、COCO test-dev (0.5 AP)、CrowdPose (4.9 AP) 和 OCHuman (3.5 AP) 等测试集上表现出色。代码将在 GitHub 上提供,请参考 https://github.com/uyoung-jeong/BoIR。
Soft Mixture Denoising: Beyond the Expressive Bottleneck of Diffusion Models
methods: 这 paper 使用 current diffusion models 和 soft mixture denoising (SMD) 方法进行研究。
results: 这 paper 发现 current diffusion models 在 backward denoising 方面存在 expressive bottleneck 和 unbounded errors,而 SMD 方法可以有效地解决这些问题,并在实际应用中表现出优于 diffusion models。Abstract
Because diffusion models have shown impressive performances in a number of tasks, such as image synthesis, there is a trend in recent works to prove (with certain assumptions) that these models have strong approximation capabilities. In this paper, we show that current diffusion models actually have an expressive bottleneck in backward denoising and some assumption made by existing theoretical guarantees is too strong. Based on this finding, we prove that diffusion models have unbounded errors in both local and global denoising. In light of our theoretical studies, we introduce soft mixture denoising (SMD), an expressive and efficient model for backward denoising. SMD not only permits diffusion models to well approximate any Gaussian mixture distributions in theory, but also is simple and efficient for implementation. Our experiments on multiple image datasets show that SMD significantly improves different types of diffusion models (e.g., DDPM), espeically in the situation of few backward iterations.
摘要
因为扩散模型在多种任务中表现出色,如图像生成等,所以有一些研究者在最近的工作中尝试证明(假设一些)这些模型具有强的近似能力。在这篇论文中,我们发现现有的扩散模型实际上在回卷降噪中存在表达力瓶颈,而现有的理论保证假设太强。基于这个发现,我们证明了扩散模型在本地和全局降噪中有无限大的错误。视我们的理论研究,我们引入了软组合降噪(SMD)模型,这是一种表达力强大且实用的回卷降噪模型。SMD不仅允许扩散模型在理论上能够很好地近似任何高斯混合分布,而且是实用的并简单的实现。我们在多个图像 dataset 上进行了实验,发现SMD可以大幅提高不同类型的扩散模型(如 DDPM),特别是在几个后向迭代中。
AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation
results: 在NYUv2和SUNRGBD datasets上进行了测试,AsymFormer与52.0% mIoU在NYUv2和49.1% mIoU在SUNRGBD达到了竞争性的结果。此外,AsymFormer在RTX3090上实现了79 FPS的推理速度,在混合精度量化后达到了65 FPS的推理速度。这些结果表明AsymFormer可以在RGB-D semantic segmentation中寻求高精度和高效率的平衡。Abstract
In the realm of robotic intelligence, achieving efficient and precise RGB-D semantic segmentation is a key cornerstone. State-of-the-art multimodal semantic segmentation methods, primarily rooted in symmetrical skeleton networks, find it challenging to harmonize computational efficiency and precision. In this work, we propose AsymFormer, a novel network for real-time RGB-D semantic segmentation, which targets the minimization of superfluous parameters by optimizing the distribution of computational resources and introduces an asymmetrical backbone to allow for the effective fusion of multimodal features. Furthermore, we explore techniques to bolster network accuracy by redefining feature selection and extracting multi-modal self-similarity features without a substantial increase in the parameter count, thereby ensuring real-time execution on robotic platforms. Additionally, a Local Attention-Guided Feature Selection (LAFS) module is used to selectively fuse features from different modalities by leveraging their dependencies. Subsequently, a Cross-Modal Attention-Guided Feature Correlation Embedding (CMA) module is introduced to further extract cross-modal representations. This method is evaluated on NYUv2 and SUNRGBD datasets, with AsymFormer demonstrating competitive results with 52.0% mIoU on NYUv2 and 49.1% mIoU on SUNRGBD. Notably, AsymFormer achieves an inference speed of 65 FPS and after implementing mixed precision quantization, it attains an impressive inference speed of 79 FPS on RTX3090. This significantly outperforms existing multi-modal methods, thereby demonstrating that AsymFormer can strike a balance between high accuracy and efficiency for RGB-D semantic segmentation.
摘要
在机器人智能领域,实现高效精准的 RGB-D semantic segmentation 是一个关键的核心。现有的多Modal semantic segmentation 方法,主要基于对称骨网络,很难实现计算效率和准确之间的平衡。在这个工作中,我们提出了 AsymFormer,一种新的网络,旨在最小化无用参数,通过分布计算资源的优化,并引入非对称脊梁,以有效地融合多Modal 特征。此外,我们还 explore 了提高网络准确性的技术,包括重新定义特征选择和提取多Modal 自相似特征,而无需增加参数数量,以确保实时执行于机器人平台。另外,我们还使用 Local Attention-Guided Feature Selection (LAFS) 模块, selecing 不同模式的特征,并使用 Cross-Modal Attention-Guided Feature Correlation Embedding (CMA) 模块,进一步提取交Modal 表示。这种方法在 NYUv2 和 SUNRGBD 数据集上进行评估,AsymFormer Displaying competitive results with 52.0% mIoU on NYUv2 and 49.1% mIoU on SUNRGBD。含义的是,AsymFormer 在实时执行中能够平衡高准确率和高效率,为 RGB-D semantic segmentation 提供了一个可靠的解决方案。
FeCAM: Exploiting the Heterogeneity of Class Distributions in Exemplar-Free Continual Learning
results: 研究发现,在统计学上分布的分类是成功的,但是在学习非站势数据时,Euclidian 距离是不理想的,并且特征分布是多样的。因此,这篇研究提出了使用不规律 Mahalanobis 距离来解决这个挑战。此外,这篇研究还证明了模型特征相关性比之前的对应样本从正常分布中采样和训练线性分类器的方法更好。相比于现有的方法,这篇研究的方法可以在多个shot CIL 设定下进行普遍化,并且在领域增执行设定下也能够取得顶尖的结果。Abstract
Exemplar-free class-incremental learning (CIL) poses several challenges since it prohibits the rehearsal of data from previous tasks and thus suffers from catastrophic forgetting. Recent approaches to incrementally learning the classifier by freezing the feature extractor after the first task have gained much attention. In this paper, we explore prototypical networks for CIL, which generate new class prototypes using the frozen feature extractor and classify the features based on the Euclidean distance to the prototypes. In an analysis of the feature distributions of classes, we show that classification based on Euclidean metrics is successful for jointly trained features. However, when learning from non-stationary data, we observe that the Euclidean metric is suboptimal and that feature distributions are heterogeneous. To address this challenge, we revisit the anisotropic Mahalanobis distance for CIL. In addition, we empirically show that modeling the feature covariance relations is better than previous attempts at sampling features from normal distributions and training a linear classifier. Unlike existing methods, our approach generalizes to both many- and few-shot CIL settings, as well as to domain-incremental settings. Interestingly, without updating the backbone network, our method obtains state-of-the-art results on several standard continual learning benchmarks. Code is available at https://github.com/dipamgoswami/FeCAM.
摘要
例子空间自适应学习(CIL)具有一些挑战,因为它禁止在前一个任务中重新训练数据,从而导致忘记现象。最近的途径是通过冻结特征提取器来逐步学习类别器,这些途径在这篇论文中得到了广泛的关注。我们在这篇论文中探讨了示例网络,它们通过冻结特征提取器生成新的类 prototype,并基于Euclidean距离来类别特征。我们在类别分布分析中表明,基于Euclidean距离的分类是适用于共同训练的特征。然而,当学习非站点数据时,我们发现Euclidean距离是不优的,特征分布是不均匀的。为了解决这个挑战,我们再次考虑了不规则的Mahalanobis距离。此外,我们验证了模型特征 covariance 关系的模型化是比前一些尝试在normal分布上随机抽取特征并训练线性分类器更好。与现有方法不同的是,我们的方法可以在多少shot CIL设定下和领域增量学习设定下普遍适用,并且不需要更新后台网络,我们的方法在多个标准的不断学习标准benchmark上获得了状态机器人的结果。代码可以在https://github.com/dipamgoswami/FeCAM中找到。
Weakly Supervised Semantic Segmentation by Knowledge Graph Inference
results: 使用图像级别的supervision,在PASCAL VOC 2012和MS-COCO datasets上实现了WSSS的状态oa-level性能。经验表明,提出的图reasoning方法有效地提高了WSSS的性能。Abstract
Currently, existing efforts in Weakly Supervised Semantic Segmentation (WSSS) based on Convolutional Neural Networks (CNNs) have predominantly focused on enhancing the multi-label classification network stage, with limited attention given to the equally important downstream segmentation network. Furthermore, CNN-based local convolutions lack the ability to model the extensive inter-category dependencies. Therefore, this paper introduces a graph reasoning-based approach to enhance WSSS. The aim is to improve WSSS holistically by simultaneously enhancing both the multi-label classification and segmentation network stages. In the multi-label classification network segment, external knowledge is integrated, coupled with GCNs, to globally reason about inter-class dependencies. This encourages the network to uncover features in non-salient regions of images, thereby refining the completeness of generated pseudo-labels. In the segmentation network segment, the proposed Graph Reasoning Mapping (GRM) module is employed to leverage knowledge obtained from textual databases, facilitating contextual reasoning for class representation within image regions. This GRM module enhances feature representation in high-level semantics of the segmentation network's local convolutions, while dynamically learning semantic coherence for individual samples. Using solely image-level supervision, we have achieved state-of-the-art performance in WSSS on the PASCAL VOC 2012 and MS-COCO datasets. Extensive experimentation on both the multi-label classification and segmentation network stages underscores the effectiveness of the proposed graph reasoning approach for advancing WSSS.
摘要
现有的弱监督Semantic Segmentation(WSSS)基于Convolutional Neural Networks(CNNs)的努力主要集中在增强多标签分类网络阶段,忽略了下游分类网络的 equally important 部分。此外,CNN-based 本地卷积缺乏模型图像中的广泛交互类关系。因此,本文提出了基于图 reasoning的方法,以提高 WSSS。目标是通过同时提高多标签分类和分类网络两个阶段来改进 WSSS。在多标签分类网络段,外部知识被集成,与GCNs相结合,以全局理解图像中的交互类关系。这使得网络可以在非关键区域找到特征,从而提高生成的 Pseudo-labels 的完整性。在分类网络段,我们提出的 Graph Reasoning Mapping(GRM)模块,使用文本数据库中的知识,以便在图像区域内进行 Contextual Reasoning 。这个 GRM 模块可以增强分类网络的高级 semantics 表示,同时动态学习个体样本的 semantic coherence。通过仅使用图像水平监督,我们在 PASCAL VOC 2012 和 MS-COCO 数据集上实现了 WSSS 的最新状态。广泛的实验表明,提出的图 reasoning 方法有效地提高 WSSS。
Single Image Test-Time Adaptation for Segmentation
results: 研究发现,通过在测试时间使用自动生成的mask来优化自我超vised损失,可以提高图像分割模型的Robustness。在不同的条件下,这些改进可以提高模型的性能,相比之下,不使用这些改进,提高的性能仅为1.7%和2.16%。Abstract
Test-Time Adaptation (TTA) methods improve the robustness of deep neural networks to domain shift on a variety of tasks such as image classification or segmentation. This work explores adapting segmentation models to a single unlabelled image with no other data available at test-time. In particular, this work focuses on adaptation by optimizing self-supervised losses at test-time. Multiple baselines based on different principles are evaluated under diverse conditions and a novel adversarial training is introduced for adaptation with mask refinement. Our additions to the baselines result in a 3.51 and 3.28 % increase over non-adapted baselines, without these improvements, the increase would be 1.7 and 2.16 % only.
摘要
测试时适应(TTA)方法可以提高深度神经网络对频率差shift的Robustness在各种任务上,如图像分类或分割。这项工作探讨在没有其他数据可用的情况下,使用单个无标注图像进行适应。具体来说,这项工作关注在测试时优化自我指导损失来进行适应。我们评估了基于不同原则的多个基准,并引入了一种新的对适应进行干扰训练。我们的改进对基准而言,导致了3.51%和3.28%的提高,如果没有这些改进,则只有1.7%和2.16%的提高。
Unveiling Fairness Biases in Deep Learning-Based Brain MRI Reconstruction
results: 研究发现,DL重建模型存在男女和年龄 subgroup的性别偏见,但不是由数据不均衡和训练歧视引起的。这些发现可以帮助我们更好地理解深度学习图像重建中的不公正性问题,并为医疗AI应用中的公平性提供更多的参考。Abstract
Deep learning (DL) reconstruction particularly of MRI has led to improvements in image fidelity and reduction of acquisition time. In neuroimaging, DL methods can reconstruct high-quality images from undersampled data. However, it is essential to consider fairness in DL algorithms, particularly in terms of demographic characteristics. This study presents the first fairness analysis in a DL-based brain MRI reconstruction model. The model utilises the U-Net architecture for image reconstruction and explores the presence and sources of unfairness by implementing baseline Empirical Risk Minimisation (ERM) and rebalancing strategies. Model performance is evaluated using image reconstruction metrics. Our findings reveal statistically significant performance biases between the gender and age subgroups. Surprisingly, data imbalance and training discrimination are not the main sources of bias. This analysis provides insights of fairness in DL-based image reconstruction and aims to improve equity in medical AI applications.
摘要
深度学习(DL)重建特别是MRI的场景下,已经导致图像准确性的提高和数据采样时间的减少。在神经成像中,DL方法可以从不充分的数据中重建高质量图像。然而,是必须考虑公平性在DL算法中,特别是在人口特征方面。本研究提供了首次DL基于脑MRI重建模型的公平分析。该模型采用U-Net架构 для图像重建,并实施基eline Empirical Risk Minimization(ERM)和重新填充策略来探讨不公正的存在和来源。模型的性能被评估使用图像重建指标。我们的发现表明男女和年龄子集之间存在 statistically significant的性能偏好。意外地,数据不均衡和训练歧视并不是主要的偏好来源。这种分析提供了DL基于图像重建中的公平性的视角,并旨在提高医疗AI应用中的公平性。
Hashing Neural Video Decomposition with Multiplicative Residuals in Space-Time
results: 该方法可以在不同视频中生成高质量的编辑效果,并且可以在实时进行编辑。同时,该方法还提出了一种基于特征跟踪的评价指标,以对视频编辑效果进行 объектив评估。Abstract
We present a video decomposition method that facilitates layer-based editing of videos with spatiotemporally varying lighting and motion effects. Our neural model decomposes an input video into multiple layered representations, each comprising a 2D texture map, a mask for the original video, and a multiplicative residual characterizing the spatiotemporal variations in lighting conditions. A single edit on the texture maps can be propagated to the corresponding locations in the entire video frames while preserving other contents' consistencies. Our method efficiently learns the layer-based neural representations of a 1080p video in 25s per frame via coordinate hashing and allows real-time rendering of the edited result at 71 fps on a single GPU. Qualitatively, we run our method on various videos to show its effectiveness in generating high-quality editing effects. Quantitatively, we propose to adopt feature-tracking evaluation metrics for objectively assessing the consistency of video editing. Project page: https://lightbulb12294.github.io/hashing-nvd/
摘要
我们提出了一种视频分解方法,该方法使得可以对视频进行层次编辑,包括空间时间变化的灯光效果。我们的神经网络模型将输入视频分解成多个层次表示,每个层包括一个2D текстура地图、原始视频的遮盾和表示空间时间变化的乘法差异。通过修改Texture Maps可以在整个视频帧中快速传播修改,而保持其他内容的一致性。我们的方法可以快速学习视频层次神经表示,并在单个GPU上实现实时渲染修改后的结果,每帧25s。我们对多个视频进行了质量评估,并提出了一种基于特征跟踪的评价指标来客观评估视频编辑的一致性。项目页面:https://lightbulb12294.github.io/hashing-nvd/Note: The translation is in Simplified Chinese, which is the standard Chinese writing system used in mainland China. If you prefer Traditional Chinese, please let me know.
Variational Inference for Scalable 3D Object-centric Learning
results: 实验表明,我们提出的方法可以在synthetic和实际数据集上INFER和维护3D场景中对象-центric的表示,并比前一代模型表现更好。Abstract
We tackle the task of scalable unsupervised object-centric representation learning on 3D scenes. Existing approaches to object-centric representation learning show limitations in generalizing to larger scenes as their learning processes rely on a fixed global coordinate system. In contrast, we propose to learn view-invariant 3D object representations in localized object coordinate systems. To this end, we estimate the object pose and appearance representation separately and explicitly map object representations across views while maintaining object identities. We adopt an amortized variational inference pipeline that can process sequential input and scalably update object latent distributions online. To handle large-scale scenes with a varying number of objects, we further introduce a Cognitive Map that allows the registration and query of objects on a per-scene global map to achieve scalable representation learning. We explore the object-centric neural radiance field (NeRF) as our 3D scene representation, which is jointly modeled within our unsupervised object-centric learning framework. Experimental results on synthetic and real datasets show that our proposed method can infer and maintain object-centric representations of 3D scenes and outperforms previous models.
摘要
我们面临三维场景上可扩展无监督物体中心表示学习的挑战。现有的物体中心表示学习方法在扩大到更大的场景时存在限制,因为它们的学习过程基于固定的全球坐标系。与此相反,我们提议通过分离出视图不变的3D物体表示和维护物体标识来学习视图不变的3D物体表示。我们采用了一个缓存梯度推导管道,可以处理顺序输入并在线更新物体热度分布。为了处理大规模场景中的变化数量物体,我们还引入了认知地图,允许在每个场景全球地图上注册和查询物体,以实现可扩展的表示学习。我们采用物体中心场景灯光场(NeRF)作为我们的三维场景表示,并将其与我们的无监督物体中心学习框架集成。实验结果表明,我们的提议方法可以推断和维护三维场景中的物体中心表示,并在现实数据集上超越先前的模型。
Better Generalization of White Matter Tract Segmentation to Arbitrary Datasets with Scaled Residual Bootstrap
results: 对两个 dMRI 数据集进行实验,结果表明,提出的方法可以在不同的设置下提高 WM 轨迹分割的泛化性能。Abstract
White matter (WM) tract segmentation is a crucial step for brain connectivity studies. It is performed on diffusion magnetic resonance imaging (dMRI), and deep neural networks (DNNs) have achieved promising segmentation accuracy. Existing DNN-based methods use an annotated dataset for model training. However, the performance of the trained model on a different test dataset may not be optimal due to distribution shift, and it is desirable to design WM tract segmentation approaches that allow better generalization of the segmentation model to arbitrary test datasets. In this work, we propose a WM tract segmentation approach that improves the generalization with scaled residual bootstrap. The difference between dMRI scans in training and test datasets is most noticeably caused by the different numbers of diffusion gradients and noise levels. Since both of them lead to different signal-to-noise ratios (SNRs) between the training and test data, we propose to augment the training scans by adjusting the noise magnitude and develop an adapted residual bootstrap strategy for the augmentation. To validate the proposed approach, two dMRI datasets were used, and the experimental results show that our method consistently improved the generalization of WM tract segmentation under various settings.
摘要
白 matter(WM)轨迹分 segmentation 是脑连接研究的关键步骤。它通过 diffusion magnetic resonance imaging(dMRI)进行,并使用深度神经网络(DNNs)获得了有前途的分 segmentation 精度。现有的 DNN-based 方法通常使用标注数据集进行模型训练。然而,训练模型在不同的测试数据集上的性能可能不是最佳,因为分布转移,而且您想要设计 WM tract segmentation 方法,可以更好地将分 segmentation 模型应用于任何测试数据集。在这项工作中,我们提出了一种 WM tract segmentation 方法,可以提高分 segmentation 模型的泛化性。我们通过扩大 residual bootstrap 来实现这一点。测试和训练数据集之间的主要差异在于增强和噪声水平的不同,这两个因素都会导致测试和训练数据集之间的信号噪声比(SNR)的不同。因此,我们提议在训练数据集上进行噪声调整,并开发了适应的 residual bootstrap 策略。为验证我们的方法,我们使用了两个 dMRI 数据集,并实验结果表明,我们的方法可以在不同的设置下提高 WM tract segmentation 的泛化性。
paper_authors: Hakan Sivuk, Aysegul Dundar for: 这个论文的目的是提出一种基于semantic map的Semantic Image Editing方法,可以填充杀死图像中的 pixels,同时保持图像的内在逻辑和外在逻辑的一致性。methods: 该方法使用了一种新的Style Encoding机制,可以区分可见和部分可见对象的风格编码,从而提高了图像的一致性和多样性。results: 对比之前的条件图像生成和Semantic Image Editing算法,该方法在评价指标上具有显著的改进,并且可以提供更多的多样化结果。Abstract
Semantic image editing requires inpainting pixels following a semantic map. It is a challenging task since this inpainting requires both harmony with the context and strict compliance with the semantic maps. The majority of the previous methods proposed for this task try to encode the whole information from erased images. However, when an object is added to a scene such as a car, its style cannot be encoded from the context alone. On the other hand, the models that can output diverse generations struggle to output images that have seamless boundaries between the generated and unerased parts. Additionally, previous methods do not have a mechanism to encode the styles of visible and partially visible objects differently for better performance. In this work, we propose a framework that can encode visible and partially visible objects with a novel mechanism to achieve consistency in the style encoding and final generations. We extensively compare with previous conditional image generation and semantic image editing algorithms. Our extensive experiments show that our method significantly improves over the state-of-the-art. Our method not only achieves better quantitative results but also provides diverse results. Please refer to the project web page for the released code and demo: https://github.com/hakansivuk/DivSem.
摘要
semantic image editing 需要根据semantic map进行像素填充。这是一项复杂的任务,因为这种填充需要与上下文相协调,并且必须严格遵循semantic map。大多数之前的方法都尝试从抹除图像中提取整个信息。然而,当一个 objet 如车加入场景时,其样式不能从上下文中提取。相反,模型们可以输出多个生成物,但是这些生成物往往与不抹除部分之间没有滑块。此外,之前的方法没有机制来对可见和部分可见对象的样式进行不同的编码,以提高性能。在这种情况下,我们提出了一个框架,可以对可见和部分可见对象进行novel机制来实现样式编码的一致性和最终生成的一致性。我们与前期的条件图像生成和semantic image editing算法进行了广泛的比较。我们的广泛实验表明,我们的方法在状态前的性能上显著提高。我们不仅获得了更好的量化结果,还提供了更多的多样性。请参考项目网页获取代码和示例:https://github.com/hakansivuk/DivSem。
Egocentric RGB+Depth Action Recognition in Industry-Like Settings
methods: 我们采用3D Video SWIN Transformer来有效地编码RGB和深度模式,并提出了一种基于exponentially decaying variant of the focal loss modulating factor的训练策略,以及late fusion来结合两种模式的预测结果。
results: 我们在MECCANO dataset上进行了广泛的评估,并在多模式人体动作识别挑战中获得了优秀成绩,其中包括在ICIAP 2023 多模式人体动作识别比赛中获得第一名。Abstract
Action recognition from an egocentric viewpoint is a crucial perception task in robotics and enables a wide range of human-robot interactions. While most computer vision approaches prioritize the RGB camera, the Depth modality - which can further amplify the subtleties of actions from an egocentric perspective - remains underexplored. Our work focuses on recognizing actions from egocentric RGB and Depth modalities in an industry-like environment. To study this problem, we consider the recent MECCANO dataset, which provides a wide range of assembling actions. Our framework is based on the 3D Video SWIN Transformer to encode both RGB and Depth modalities effectively. To address the inherent skewness in real-world multimodal action occurrences, we propose a training strategy using an exponentially decaying variant of the focal loss modulating factor. Additionally, to leverage the information in both RGB and Depth modalities, we opt for late fusion to combine the predictions from each modality. We thoroughly evaluate our method on the action recognition task of the MECCANO dataset, and it significantly outperforms the prior work. Notably, our method also secured first place at the multimodal action recognition challenge at ICIAP 2023.
摘要
egocentric 视角下的行为识别是机器人学中的一项重要感知任务,它允许机器人与人类进行广泛的互动。而大多数计算机视觉方法强调RGB摄像头,但深度modalidad - 可以进一步强调 egocentric 视角下的动作细节 - 仍然未得到充分的利用。我们的工作是 Egocentric RGB 和深度modalities 上的动作识别在行业环境中进行研究。为了研究这个问题,我们使用了最近的MECCANO dataset,该集合提供了许多精心搭建的动作。我们的框架基于3D Video SWIN Transformer来有效地编码RGB和深度modalities。为了解决实际世界中多Modal 动作的自然偏见,我们提出了一种使用加速式衰减的焦点损失模块化因子来进行训练策略。此外,我们选择了将RGB和深度modalities 的预测结果进行融合,以利用每个模式中的信息。我们对MECCANO dataset上的动作识别任务进行了严格的评估,并证明了我们的方法在此任务上明显超越了先前的工作。值得注意的是,我们的方法还在ICIAP 2023 多模态动作识别挑战中获得了第一名。
In-Domain GAN Inversion for Faithful Reconstruction and Editability
results: 这篇论文发现,内部类型对应可以将GAN模型中学习的知识应用到实际的影像修改上,并且可以让GAN模型在不需要重新训练的情况下,实现各种影像修改应用程序。Abstract
Generative Adversarial Networks (GANs) have significantly advanced image synthesis through mapping randomly sampled latent codes to high-fidelity synthesized images. However, applying well-trained GANs to real image editing remains challenging. A common solution is to find an approximate latent code that can adequately recover the input image to edit, which is also known as GAN inversion. To invert a GAN model, prior works typically focus on reconstructing the target image at the pixel level, yet few studies are conducted on whether the inverted result can well support manipulation at the semantic level. This work fills in this gap by proposing in-domain GAN inversion, which consists of a domain-guided encoder and a domain-regularized optimizer, to regularize the inverted code in the native latent space of the pre-trained GAN model. In this way, we manage to sufficiently reuse the knowledge learned by GANs for image reconstruction, facilitating a wide range of editing applications without any retraining. We further make comprehensive analyses on the effects of the encoder structure, the starting inversion point, as well as the inversion parameter space, and observe the trade-off between the reconstruction quality and the editing property. Such a trade-off sheds light on how a GAN model represents an image with various semantics encoded in the learned latent distribution. Code, models, and demo are available at the project page: https://genforce.github.io/idinvert/.
摘要
Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training
paper_authors: Jiangliu Wang, Jianbo Jiao, Yibing Song, Stephen James, Zhan Tong, Chongjian Ge, Pieter Abbeel, Yun-hui Liu
for: 本研究旨在提高无监督音视频预训练。
methods: 我们提议一种新的速度协调增强方法,该方法随机改变音频和视频数据的播放速率。
results: 实验结果表明,我们提议的方法可以明显提高学习的表示。Abstract
This work aims to improve unsupervised audio-visual pre-training. Inspired by the efficacy of data augmentation in visual contrastive learning, we propose a novel speed co-augmentation method that randomly changes the playback speeds of both audio and video data. Despite its simplicity, the speed co-augmentation method possesses two compelling attributes: (1) it increases the diversity of audio-visual pairs and doubles the size of negative pairs, resulting in a significant enhancement in the learned representations, and (2) it changes the strict correlation between audio-visual pairs but introduces a partial relationship between the augmented pairs, which is modeled by our proposed SoftInfoNCE loss to further boost the performance. Experimental results show that the proposed method significantly improves the learned representations when compared to vanilla audio-visual contrastive learning.
摘要
这项工作目的是提高无监督音视频预训练。受到视觉对冲学习的数据扩充效果启发,我们提议一种新的速度合并方法,该方法随机改变音频和视频数据的播放速度。尽管简单,这种速度合并方法具有两个吸引人的特点:(1)它增加了音视频对的多样性,同时将负对的数据量增加一倍,从而对学习的表示进行了显著提升,和(2)它改变了音视频对的紧密相关性,但是引入了增强对的关系,这种关系由我们提议的SoftInfoNCE损失函数来模型,以进一步提高表示的性能。实验结果表明,我们的方法在无监督音视频对比学习中显著提高了学习的表示。
A Lightweight Recurrent Grouping Attention Network for Video Super-Resolution
results: 实验表明,我们的模型在多个数据集上达到了当今主流模型的最佳性能。Abstract
Effective aggregation of temporal information of consecutive frames is the core of achieving video super-resolution. Many scholars have utilized structures such as sliding windows and recurrent to gather spatio-temporal information of frames. However, although the performance of the constructed VSR models is improving, the size of the models is also increasing, exacerbating the demand on the equipment. Thus, to reduce the stress on the device, we propose a novel lightweight recurrent grouping attention network. The parameters of this model are only 0.878M, which is much lower than the current mainstream model for studying video super-resolution. We design forward feature extraction module and backward feature extraction module to collect temporal information between consecutive frames from two directions. Moreover, a new grouping mechanism is proposed to efficiently collect spatio-temporal information of the reference frame and its neighboring frames. The attention supplementation module is presented to further enhance the information gathering range of the model. The feature reconstruction module aims to aggregate information from different directions to reconstruct high-resolution features. Experiments demonstrate that our model achieves state-of-the-art performance on multiple datasets.
摘要
“有效地聚合 consecutiverame 的时间信息是视频超分辨的核心。许多学者们使用 slide window 和 recurrent 结构收集 frame 的空间时间信息。although the constructed VSR models are improving, the size of the models is also increasing, exacerbating the demand on the equipment. Therefore, to reduce the stress on the device, we propose a novel lightweight recurrent grouping attention network. The parameters of this model are only 0.878M, which is much lower than the current mainstream model for studying video super-resolution.我们设计了 forward feature extraction module 和 backward feature extraction module,用于从两个方向收集 consecutive frames 的时间信息。此外,我们还提出了一种新的 grouping mechanism,用于高效地收集 reference frame 和其邻近 frame 的空间时间信息。另外,我们还提出了一种 attention supplementation module,用于进一步增强模型的信息收集范围。最后,我们设计了一个 feature reconstruction module,用于将信息从不同的方向聚合到高分辨特征上。实验表明,我们的模型在多个 dataset 上达到了状态监测性能。”
Recursive Counterfactual Deconfounding for Object Recognition
results: 在closed-set和open-set scenario的识别任务中,提出的RCD模型与11个基eline模型相比,在大多数情况下表现出色,并且可以进一步提高图像识别的准确率和泛化能力。Abstract
Image recognition is a classic and common task in the computer vision field, which has been widely applied in the past decade. Most existing methods in literature aim to learn discriminative features from labeled images for classification, however, they generally neglect confounders that infiltrate into the learned features, resulting in low performances for discriminating test images. To address this problem, we propose a Recursive Counterfactual Deconfounding model for object recognition in both closed-set and open-set scenarios based on counterfactual analysis, called RCD. The proposed model consists of a factual graph and a counterfactual graph, where the relationships among image features, model predictions, and confounders are built and updated recursively for learning more discriminative features. It performs in a recursive manner so that subtler counterfactual features could be learned and eliminated progressively, and both the discriminability and generalization of the proposed model could be improved accordingly. In addition, a negative correlation constraint is designed for alleviating the negative effects of the counterfactual features further at the model training stage. Extensive experimental results on both closed-set recognition task and open-set recognition task demonstrate that the proposed RCD model performs better than 11 state-of-the-art baselines significantly in most cases.
摘要
RCD 模型包括一个事实图和一个对事实图的对照图,其中包含了图像特征、模型预测和干扰因素之间的关系,通过 recursively 更新和学习更加细致的特征,以提高模型的抗混淆性和泛化能力。此外,为了进一步缓解对对照特征的负面影响,我们在模型训练阶段设计了一种负 correlated 约束。在 closed-set 识别任务和 open-set 识别任务中,我们进行了广泛的实验,结果显示,相比于 11 个基eline,RCD 模型在大多数情况下能够显著提高对象识别的性能。
Subspace-Aware Feature Reconstruction for Unsupervised Anomaly Localization
results: 实验结果显示,这篇论文的方法可以与现有的 state-of-the-art 方法相比,实现适应 feature approximation,并且只需要少量的 samples。Abstract
Unsupervised anomaly localization, which plays a critical role in industrial manufacturing, is to identify anomalous regions that deviate from patterns established exclusively from nominal samples. Recent mainstream methods focus on approximating the target feature distribution by leveraging embeddings from ImageNet models. However, a common issue in many anomaly localization methods is the lack of adaptability of the feature approximations to specific targets. Consequently, their ability to effectively identify anomalous regions relies significantly on the data coverage provided by the finite resources in a memory bank. In this paper, we propose a novel subspace-aware feature reconstruction framework for anomaly localization. To achieve adaptive feature approximation, our proposed method involves the reconstruction of the feature representation through the self-expressive model designed to learn low-dimensional subspaces. Importantly, the sparsity of the subspace representation contributes to covering feature patterns from the same subspace with fewer resources, leading to a reduction in the memory bank. Extensive experiments across three industrial benchmark datasets demonstrate that our approach achieves competitive anomaly localization performance compared to state-of-the-art methods by adaptively reconstructing target features with a small number of samples.
摘要
不监督异常定位,在工业生产中扮演关键的角色,是要将异常区域与基于nominal样本所建立的模式进行比较。现今主流方法通常是通过利用ImageNet模型生成的嵌入来近似目标特征分布。然而,许多异常定位方法中的共同问题是特征近似的灵活性不够,这使得它们在特定目标上效果地识别异常区域的能力受到数据库中的finite资源的限制。在本文中,我们提出了一种新的子空间意识激发特征重建框架,用于异常定位。我们的提议方法包括通过自我表达模型学习低维度子空间,以达到适应性的特征近似。重要的是,子空间表示的稀疏性使得从同一个子空间中覆盖特征模式需要 fewer 资源,从而降低数据库。我们在三个工业标准 datasets上进行了广泛的实验,结果表明,我们的方法可以与当前状态OF-the-art方法竞争地实现异常定位,只需要一小数量的样本。
Bitstream-Corrupted Video Recovery: A Novel Benchmark Dataset and Method
results: 透过评估现有的视频填写方法,发现了这些方法在解决实世界中的视频损坏问题上的限制,并证明了我们的框架在解决这个问题上的优势。Abstract
The past decade has witnessed great strides in video recovery by specialist technologies, like video inpainting, completion, and error concealment. However, they typically simulate the missing content by manual-designed error masks, thus failing to fill in the realistic video loss in video communication (e.g., telepresence, live streaming, and internet video) and multimedia forensics. To address this, we introduce the bitstream-corrupted video (BSCV) benchmark, the first benchmark dataset with more than 28,000 video clips, which can be used for bitstream-corrupted video recovery in the real world. The BSCV is a collection of 1) a proposed three-parameter corruption model for video bitstream, 2) a large-scale dataset containing rich error patterns, multiple corruption levels, and flexible dataset branches, and 3) a plug-and-play module in video recovery framework that serves as a benchmark. We evaluate state-of-the-art video inpainting methods on the BSCV dataset, demonstrating existing approaches' limitations and our framework's advantages in solving the bitstream-corrupted video recovery problem. The benchmark and dataset are released at https://github.com/LIUTIGHE/BSCV-Dataset.
摘要
过去一代,视频恢复技术得到了大幅度的进步,如视频填充、完善和错误隐藏。然而,这些技术通常通过手动设计的错误面积来模拟缺失内容,因此无法填充实际的视频损害在视频通信(如电子投票、直播和互联网视频)和多媒体证明中。为解决这一问题,我们介绍了 bitstream-corrupted video(BSCV) benchmark,这是世界上第一个以上28,000个视频剪辑为基础的损害视频恢复 benchmark。BSCV包括以下三个组成部分:1)一种提议的三参数损害模型 для视频比特流;2)一个大规模的数据集,包含丰富的错误特征、多个损害水平和灵活的数据支线;3)一个在视频恢复框架中的插件模块,作为 benchmark。我们对现有视频填充方法进行了BSCV dataset上的评估,并证明了我们的框架在解决损害视频恢复问题中的优势。BSCV dataset和 benchmark将于https://github.com/LIUTIGHE/BSCV-Dataset上发布。
Skip-Connected Neural Networks with Layout Graphs for Floor Plan Auto-Generation
paper_authors: Yuntae Jeon, Dai Quoc Tran, Seunghee Park
for: automated and efficient floor plan designs
methods: skip-connected neural networks integrated with layout graphs
results: 93.9 mIoU score in the 1st CVAAD workshop challengeAbstract
With the advent of AI and computer vision techniques, the quest for automated and efficient floor plan designs has gained momentum. This paper presents a novel approach using skip-connected neural networks integrated with layout graphs. The skip-connected layers capture multi-scale floor plan information, and the encoder-decoder networks with GNN facilitate pixel-level probability-based generation. Validated on the MSD dataset, our approach achieved a 93.9 mIoU score in the 1st CVAAD workshop challenge. Code and pre-trained models are publicly available at https://github.com/yuntaeJ/SkipNet-FloorPlanGe.
摘要
With the advent of AI and computer vision techniques, the quest for automated and efficient floor plan designs has gained momentum. This paper presents a novel approach using skip-connected neural networks integrated with layout graphs. The skip-connected layers capture multi-scale floor plan information, and the encoder-decoder networks with GNN facilitate pixel-level probability-based generation. Validated on the MSD dataset, our approach achieved a 93.9 mIoU score in the 1st CVAAD workshop challenge. Code and pre-trained models are publicly available at https://github.com/yuntaeJ/SkipNet-FloorPlanGe.Here's the translation in Traditional Chinese:随着人工智能和计算机见识技术的发展,自动化和高效的地图设计问题得到了很多关注。这篇论文提出了一种使用跳接连接神经网络与格局图Integrated的新方法。跳接层 capture多値标高图信息,并且与对应的encoder-decoder网络和GNN结合,实现像素级概率基于生成。在MSD dataset上验证,我们的方法实现了93.9 mIoU分数在1st CVAAD工作坊挑战中。代码和预训模型公开可以在https://github.com/yuntaeJ/SkipNet-FloorPlanGe中找到。
Attention and Pooling based Sigmoid Colon Segmentation in 3D CT images
results: 结果表明,使用PyP和csSE技术可以提高分割精度,并且 ensemble方法可以提高分割精度。最终,该研究结果表明,基于修改的3D U-Net体系结构是有效的用于 segments the sigmoid colon在CT图像中。Abstract
Segmentation of the sigmoid colon is a crucial aspect of treating diverticulitis. It enables accurate identification and localisation of inflammation, which in turn helps healthcare professionals make informed decisions about the most appropriate treatment options. This research presents a novel deep learning architecture for segmenting the sigmoid colon from Computed Tomography (CT) images using a modified 3D U-Net architecture. Several variations of the 3D U-Net model with modified hyper-parameters were examined in this study. Pyramid pooling (PyP) and channel-spatial Squeeze and Excitation (csSE) were also used to improve the model performance. The networks were trained using manually annotated sigmoid colon. A five-fold cross-validation procedure was used on a test dataset to evaluate the network's performance. As indicated by the maximum Dice similarity coefficient (DSC) of 56.92+/-1.42%, the application of PyP and csSE techniques improves segmentation precision. We explored ensemble methods including averaging, weighted averaging, majority voting, and max ensemble. The results show that average and majority voting approaches with a threshold value of 0.5 and consistent weight distribution among the top three models produced comparable and optimal results with DSC of 88.11+/-3.52%. The results indicate that the application of a modified 3D U-Net architecture is effective for segmenting the sigmoid colon in Computed Tomography (CT) images. In addition, the study highlights the potential benefits of integrating ensemble methods to improve segmentation precision.
摘要
Segmentation of the sigmoid colon is a crucial aspect of treating diverticulitis. It enables accurate identification and localization of inflammation, which in turn helps healthcare professionals make informed decisions about the most appropriate treatment options. This research presents a novel deep learning architecture for segmenting the sigmoid colon from Computed Tomography (CT) images using a modified 3D U-Net architecture. Several variations of the 3D U-Net model with modified hyper-parameters were examined in this study. Pyramid pooling (PyP) and channel-spatial Squeeze and Excitation (csSE) were also used to improve the model performance. The networks were trained using manually annotated sigmoid colon. A five-fold cross-validation procedure was used on a test dataset to evaluate the network's performance. As indicated by the maximum Dice similarity coefficient (DSC) of 56.92+/-1.42%, the application of PyP and csSE techniques improves segmentation precision. We explored ensemble methods including averaging, weighted averaging, majority voting, and max ensemble. The results show that average and majority voting approaches with a threshold value of 0.5 and consistent weight distribution among the top three models produced comparable and optimal results with DSC of 88.11+/-3.52%. The results indicate that the application of a modified 3D U-Net architecture is effective for segmenting the sigmoid colon in Computed Tomography (CT) images. In addition, the study highlights the potential benefits of integrating ensemble methods to improve segmentation precision.Here's the text in Traditional Chinese:分 segmentation of the sigmoid colon 是 diverticulitis 的一个重要方面,可以精确地识别和localization of inflammation,从而帮助医疗专业人员做出最适当的治疗选择。本研究提出了一个基于 modified 3D U-Net 架构的深度学习模型,用于 Computed Tomography (CT) 影像中的sigmoid colon 分 segmentation。本研究中评估了多种 modified 3D U-Net 模型的不同参数,并使用 Pyramid pooling (PyP) 和 channel-spatial Squeeze and Excitation (csSE) 技术来改善模型性能。模型被训练使用手动标注的sigmoid colon。使用五fold cross-validation 方法进行评估,结果显示,使用 PyP 和 csSE 技术可以提高分 segmentation 精度。我们explored ensemble methods,包括 averaging、weighted averaging、majority voting 和 max ensemble,并发现,使用average 和 majority voting 方法,并设置阈值为 0.5,可以获得最佳结果,DSC 为 88.11+/-3.52%。结果显示,使用 modified 3D U-Net 架构可以有效地分 segmentation sigmoid colon 在 Computed Tomography (CT) 影像中。此外,研究也显示了 ensemble methods 的潜在优化效果。
On Calibration of Modern Quantized Efficient Neural Networks
results: 研究发现,Calibration 性能与精度之间存在相互关系,尤其是在 4 位 activation режиmé下。GhostNet-VGG 被发现为最为鲁减的 Architecture,能够在低精度下保持比较好的 Calibration 性能。另外,温度 scaling 可以改善 Calibration 错误,但也有一些限制。Abstract
We explore calibration properties at various precisions for three architectures: ShuffleNetv2, GhostNet-VGG, and MobileOne; and two datasets: CIFAR-100 and PathMNIST. The quality of calibration is observed to track the quantization quality; it is well-documented that performance worsens with lower precision, and we observe a similar correlation with poorer calibration. This becomes especially egregious at 4-bit activation regime. GhostNet-VGG is shown to be the most robust to overall performance drop at lower precision. We find that temperature scaling can improve calibration error for quantized networks, with some caveats. We hope that these preliminary insights can lead to more opportunities for explainable and reliable EdgeML.
摘要
我们研究了不同精度下的准确性质量,对三种架构:ShuffleNetv2、GhostNet-VGG和MobileOne,以及两个 dataset:CIFAR-100和PathMNIST。我们发现,准确性质量与量化质量之间存在直接的相关性,即性能随着精度下降而变差,我们在4比特活动 режиimen中发现了类似的关系。 GhostNet-VGG 显示为低精度下的最高抗性。我们发现了温度Scaling可以改善量化网络的准确性错误,但有一些限制。我们希望这些初步发现可以带来更多的可靠和可解释的EdgeML。
SuPerPM: A Large Deformation-Robust Surgical Perception Framework Based on Deep Point Matching Learned from Physical Constrained Simulation Data
paper_authors: Shan Lin, Albert J. Miao, Ali Alabiad, Fei Liu, Kaiyuan Wang, Jingpei Lu, Florian Richter, Michael C. Yip
for: 实现更好的骨盘识别和重建,减少对大幅弯曲的肿瘤进行追踪和重建的误差。
methods: 使用学习型非静态点云匹配来进行数据汇合,以应对大幅弯曲。
results: 在复杂的骨盘运动中,得到了superior的表现,比前一代医疗场景追踪算法更好。Abstract
Manipulation of tissue with surgical tools often results in large deformations that current methods in tracking and reconstructing algorithms have not effectively addressed. A major source of tracking errors during large deformations stems from wrong data association between observed sensor measurements with previously tracked scene. To mitigate this issue, we present a surgical perception framework, SuPerPM, that leverages learning-based non-rigid point cloud matching for data association, thus accommodating larger deformations. The learning models typically require training data with ground truth point cloud correspondences, which is challenging or even impractical to collect in surgical environments. Thus, for tuning the learning model, we gather endoscopic data of soft tissue being manipulated by a surgical robot and then establish correspondences between point clouds at different time points to serve as ground truth. This was achieved by employing a position-based dynamics (PBD) simulation to ensure that the correspondences adhered to physical constraints. The proposed framework is demonstrated on several challenging surgical datasets that are characterized by large deformations, achieving superior performance over state-of-the-art surgical scene tracking algorithms.
摘要
人体组织的 manipulate 使用手术工具经常会导致大幅变形,现有的跟踪和重建算法并未有效地处理这些变形。主要的跟踪错误源于 incorrect data association between observed sensor measurements with previously tracked scene。为解决这个问题,我们提出了一个手术认知框架,SuPerPM,该框架利用学习基于非RIGID点云匹配来实现数据关联,因此可以满足更大的变形。学习模型通常需要训练数据包含真实的点云对应关系,但在手术环境中收集这些数据是困难或者实际上不可能的。因此,我们为训练学习模型,收集了endoscopic数据 soft tissue being manipulated by a surgical robot,并在不同时间点之间建立了点云对应关系,以供真实的参照。我们使用了位置基于动力学(PBD)模拟来确保这些对应关系遵循物理约束。我们提出的框架在一些具有大幅变形的手术数据集上进行了评测,实现了与当前最佳手术场景跟踪算法相比的更高性能。
Adversarial Attacks on Video Object Segmentation with Hard Region Discovery
methods: 本研究提出了一种基于 first-frame attack 的对象agnostic adversary,通过探索 easily confused region 来生成具有更强的攻击力的干扰。
results: 实验结果表明,我们的攻击器可以很大程度下降了多种 state-of-the-art video object segmentation 模型的性能。Abstract
Video object segmentation has been applied to various computer vision tasks, such as video editing, autonomous driving, and human-robot interaction. However, the methods based on deep neural networks are vulnerable to adversarial examples, which are the inputs attacked by almost human-imperceptible perturbations, and the adversary (i.e., attacker) will fool the segmentation model to make incorrect pixel-level predictions. This will rise the security issues in highly-demanding tasks because small perturbations to the input video will result in potential attack risks. Though adversarial examples have been extensively used for classification, it is rarely studied in video object segmentation. Existing related methods in computer vision either require prior knowledge of categories or cannot be directly applied due to the special design for certain tasks, failing to consider the pixel-wise region attack. Hence, this work develops an object-agnostic adversary that has adversarial impacts on VOS by first-frame attacking via hard region discovery. Particularly, the gradients from the segmentation model are exploited to discover the easily confused region, in which it is difficult to identify the pixel-wise objects from the background in a frame. This provides a hardness map that helps to generate perturbations with a stronger adversarial power for attacking the first frame. Empirical studies on three benchmarks indicate that our attacker significantly degrades the performance of several state-of-the-art video object segmentation models.
摘要
“视频对象分割(VOS)在计算机视觉任务中得到应用,如视频编辑、自动驾驶和人机交互。然而,基于深度神经网络的方法容易受到恶意示例(adversarial examples)的攻击,这些攻击者(i.e., 攻击者)会使用 almost human-imperceptible 的干扰,让 VOS 模型进行错误的像素级预测。这会导致高度需求任务中的安全问题。虽然恶意示例在分类领域已经广泛研究,但在 VOS 领域 rarely 被研究。现有的计算机视觉相关方法 Either require prior knowledge of categories 或者不可直接应用,因为它们特定的设计不适用于 VOS。因此,这项工作开发了一种对 VOS 具有恶意影响的对象agnostic adversary。具体来说,从 VOS 模型的梯度来发现容易困惑的区域,这个区域中的像素Difficult to identify from the background in a frame.这提供了一个 hardness map,帮助生成具有更强的恶意力的攻击。实验表明,我们的攻击者可以 Significantly degrade 多种 state-of-the-art VOS 模型的性能。”
DISeR: Designing Imaging Systems with Reinforcement Learning
results: 研究示出,通过自动搜索图像系统的组件,可以实现更高的任务性能和更好的可靠性。实验结果表明,与业界标准相比,我们的方法可以提供更高的深度估计和更好的摄像头配置。Abstract
Imaging systems consist of cameras to encode visual information about the world and perception models to interpret this encoding. Cameras contain (1) illumination sources, (2) optical elements, and (3) sensors, while perception models use (4) algorithms. Directly searching over all combinations of these four building blocks to design an imaging system is challenging due to the size of the search space. Moreover, cameras and perception models are often designed independently, leading to sub-optimal task performance. In this paper, we formulate these four building blocks of imaging systems as a context-free grammar (CFG), which can be automatically searched over with a learned camera designer to jointly optimize the imaging system with task-specific perception models. By transforming the CFG to a state-action space, we then show how the camera designer can be implemented with reinforcement learning to intelligently search over the combinatorial space of possible imaging system configurations. We demonstrate our approach on two tasks, depth estimation and camera rig design for autonomous vehicles, showing that our method yields rigs that outperform industry-wide standards. We believe that our proposed approach is an important step towards automating imaging system design.
摘要
In this paper, we use a context-free grammar (CFG) to formulate these four building blocks of imaging systems. The CFG can be automatically searched over with a learned camera designer to jointly optimize the imaging system with task-specific perception models. By transforming the CFG to a state-action space, we can implement the camera designer with reinforcement learning to intelligently search over the combinatorial space of possible imaging system configurations.We demonstrate our approach on two tasks, depth estimation and camera rig design for autonomous vehicles. Our method yields rigs that outperform industry-wide standards. We believe that our proposed approach is an important step towards automating imaging system design.
Tuning Multi-mode Token-level Prompt Alignment across Modalities
results: 在各种图像识别benchmark上显示出优于常见方法的总体化和几个shot能力,并且Qualitative分析表明学习的Prompt tokens能够捕捉多种视觉概念。Abstract
Advancements in prompt tuning of vision-language models have underscored their potential in enhancing open-world visual concept comprehension. However, prior works only primarily focus on single-mode (only one prompt for each modality) and holistic level (image or sentence) semantic alignment, which fails to capture the sample diversity, leading to sub-optimal prompt discovery. To address the limitation, we propose a multi-mode token-level tuning framework that leverages the optimal transportation to learn and align a set of prompt tokens across modalities. Specifically, we rely on two essential factors: 1) multi-mode prompts discovery, which guarantees diverse semantic representations, and 2) token-level alignment, which helps explore fine-grained similarity. Consequently, the similarity can be calculated as a hierarchical transportation problem between the modality-specific sets. Extensive experiments on popular image recognition benchmarks show the superior generalization and few-shot abilities of our approach. The qualitative analysis demonstrates that the learned prompt tokens have the ability to capture diverse visual concepts.
摘要
Multi-mode prompts discovery, which guarantees diverse semantic representations.2. Token-level alignment, which helps explore fine-grained similarity.Consequently, the similarity can be calculated as a hierarchical transportation problem between the modality-specific sets. Extensive experiments on popular image recognition benchmarks show the superior generalization and few-shot abilities of our approach. The qualitative analysis demonstrates that the learned prompt tokens have the ability to capture diverse visual concepts.Note: “vision-language models” should be translated as “视觉语言模型” in Simplified Chinese.
Traj-LO: In Defense of LiDAR-Only Odometry Using an Effective Continuous-Time Trajectory
results: 对于不同类型的LiDAR和多个LiDAR系统,我们的方法表现出了可靠和有效的性能,甚至在kinematic状态超过IMU测量范围的情况下也能够取得良好的结果。我们的实现已经在GitHub上公开。Abstract
LiDAR Odometry is an essential component in many robotic applications. Unlike the mainstreamed approaches that focus on improving the accuracy by the additional inertial sensors, this letter explores the capability of LiDAR-only odometry through a continuous-time perspective. Firstly, the measurements of LiDAR are regarded as streaming points continuously captured at high frequency. Secondly, the LiDAR movement is parameterized by a simple yet effective continuous-time trajectory. Therefore, our proposed Traj-LO approach tries to recover the spatial-temporal consistent movement of LiDAR by tightly coupling the geometric information from LiDAR points and kinematic constraints from trajectory smoothness. This framework is generalized for different kinds of LiDAR as well as multi-LiDAR systems. Extensive experiments on the public datasets demonstrate the robustness and effectiveness of our proposed LiDAR-only approach, even in scenarios where the kinematic state exceeds the IMU's measuring range. Our implementation is open-sourced on GitHub.
摘要
雷达探测器(LiDAR)是许多 робо械应用中的一个关键组件。与主流方法不同,这封信函数通过不断增强附加的惯性传感器来提高精度。而我们的方法则是通过持续时间的视角来探讨LiDAR只的定位。首先,我们视为LiDAR测量点是高频Capture的流动点。其次,我们将LiDAR的移动参数化为简单 yet effective的持续时间曲线。因此,我们提出的Traj-LO方法尝试通过与LiDAR点的几何信息和运动稳定性的束缚来恢复LiDAR的空间时间准确的运动。这种框架适用于不同类型的LiDAR以及多个LiDAR系统。我们的实现在GitHub上公开。Extensive experiments on public datasets have demonstrated the robustness and effectiveness of our proposed LiDAR-only approach, even in scenarios where the kinematic state exceeds the IMU's measuring range.
Fill the K-Space and Refine the Image: Prompting for Dynamic and Multi-Contrast MRI Reconstruction
results: 对比 précédente estado del arte的加速MRI重建方法,提出的方法得到了显著的提高,并且可以更好地适应不同的输入类型和参数 Conditioning.Abstract
The key to dynamic or multi-contrast magnetic resonance imaging (MRI) reconstruction lies in exploring inter-frame or inter-contrast information. Currently, the unrolled model, an approach combining iterative MRI reconstruction steps with learnable neural network layers, stands as the best-performing method for MRI reconstruction. However, there are two main limitations to overcome: firstly, the unrolled model structure and GPU memory constraints restrict the capacity of each denoising block in the network, impeding the effective extraction of detailed features for reconstruction; secondly, the existing model lacks the flexibility to adapt to variations in the input, such as different contrasts, resolutions or views, necessitating the training of separate models for each input type, which is inefficient and may lead to insufficient reconstruction. In this paper, we propose a two-stage MRI reconstruction pipeline to address these limitations. The first stage involves filling the missing k-space data, which we approach as a physics-based reconstruction problem. We first propose a simple yet efficient baseline model, which utilizes adjacent frames/contrasts and channel attention to capture the inherent inter-frame/-contrast correlation. Then, we extend the baseline model to a prompt-based learning approach, PromptMR, for all-in-one MRI reconstruction from different views, contrasts, adjacent types, and acceleration factors. The second stage is to refine the reconstruction from the first stage, which we treat as a general video restoration problem to further fuse features from neighboring frames/contrasts in the image domain. Extensive experiments show that our proposed method significantly outperforms previous state-of-the-art accelerated MRI reconstruction methods.
摘要
<>针对动态或多比特磁共振成像(MRI)重建,关键在于挖掘 между帧或比特信息。目前,最佳性能的方法是折叠模型,它将iterative MRI重建步骤与可学习神经网络层组合起来。然而,需要突破两个主要限制:首先,折叠模型的结构和GPU内存限制限制每个除噪块在网络中的容量,阻碍细节特征的有效提取;其次,现有模型缺乏适应输入的灵活性,需要对不同的输入,如不同的比特、分辨率、视野或视角,进行分别训练,这是不fficient和可能导致重建不足。在这篇论文中,我们提出了一个两个阶段的MRI重建管道,用于解决这些限制。第一阶段是填充缺失的k空间数据,我们对此采用物理基础的重建方法。我们首先提出了一个简单 yet efficient的基准模型,该模型利用邻近帧/比特和通道注意力 capture the inherent inter-frame/-contrast correlation。然后,我们将基准模型扩展到PromptMR,用于从不同的视角、比特、邻近类型和加速因子中进行一起的MRI重建。第二阶段是对第一阶段的重建进行进一步的纠正,我们将其视为一个通用视频恢复问题,以更好地融合邻近帧/比特中的特征。经过广泛的实验,我们发现我们的提议方法在前一个状态的加速MRI重建方法上显著超越。
IBVC: Interpolation-driven B-frame Video Compression
For: The paper aims to improve B-frame video compression by addressing inaccurate quantized motions and inefficient motion compensation in previous learned approaches.* Methods: The proposed method, Interpolation-driven B-frame Video Compression (IBVC), involves two major operations: video frame interpolation and artifact reduction compression. It uses a bit-rate free MEMC based on interpolation and a residual guided masking encoder to adaptively select meaningful contexts with interpolated multi-scale dependencies.* Results: The experimental results on B-frame coding demonstrate that IBVC has significant improvements compared to relevant state-of-the-art methods, and can save bit rates compared with the random access (RA) configuration of H.266 (VTM).Here are the three points in Simplified Chinese:* For: 提高B帧视频压缩,解决前一些学习方法中的不准确量化运动和不fficient的运动补做。* Methods: 提议的方法是Interpolation-driven B-frame Video Compression (IBVC),它包括两个主要操作:视频 interpolating和artefact reduction compression。它使用免费的MEMC based on interpolation,并使用 residual guided masking encoder来选择有用的上下文。* Results: 实验结果表明,IBVC在B帧编码方面有显著的改进,并可以比Random Access(RA)配置的H.266(VTM)节省比特率。Abstract
Learned B-frame video compression aims to adopt bi-directional motion estimation and motion compensation (MEMC) coding for middle frame reconstruction. However, previous learned approaches often directly extend neural P-frame codecs to B-frame relying on bi-directional optical-flow estimation or video frame interpolation. They suffer from inaccurate quantized motions and inefficient motion compensation. To address these issues, we propose a simple yet effective structure called Interpolation-driven B-frame Video Compression (IBVC). Our approach only involves two major operations: video frame interpolation and artifact reduction compression. IBVC introduces a bit-rate free MEMC based on interpolation, which avoids optical-flow quantization and additional compression distortions. Later, to reduce duplicate bit-rate consumption and focus on unaligned artifacts, a residual guided masking encoder is deployed to adaptively select the meaningful contexts with interpolated multi-scale dependencies. In addition, a conditional spatio-temporal decoder is proposed to eliminate location errors and artifacts instead of using MEMC coding in other methods. The experimental results on B-frame coding demonstrate that IBVC has significant improvements compared to the relevant state-of-the-art methods. Meanwhile, our approach can save bit rates compared with the random access (RA) configuration of H.266 (VTM). The code will be available at https://github.com/ruhig6/IBVC.
摘要
学习B帧视频压缩targets采用双向运动估计和运动补做(MEMC)编码来重建中间帧。然而,以前的学习方法通常直接将神经网络P帧编码器扩展到B帧,基于双向光流估计或视频帧 interpolación。它们受到不准确的量化运动和不fficient的运动补做的影响。为了解决这些问题,我们提出了一种简单 yet effective的结构,即 interpolación-driven B帧视频压缩(IBVC)。我们的方法只有两个主要操作:视频帧 interpolación和噪声压缩。IBVC引入了一种免费的MEMC基于 interpolación,这可以避免光流量化和额外压缩损害。后来,为了减少重复的比特率消耗和重点关注不同依赖度的噪声,我们提出了一种适应性的masking编码器,以便选择 interpolated多尺度依赖关系中的有意义上下文。此外,我们还提出了一种conditional spatio-temporal解码器,以消除Location errors和噪声而不是使用MEMC编码。实验结果表明,IBVC与相关的状态 искусternal methods相比有显著改善。同时,我们的方法可以与H.266(VTM)中的随机访问(RA)配置相比节省比特率。代码将在https://github.com/ruhig6/IBVC上提供。
PARTICLE: Part Discovery and Contrastive Learning for Fine-grained Recognition
results: 这些方法可以提高图像分类和分割任务的性能,例如在Linear-evaluation scheme中,使用DetCon自助学习方法训练ResNet50在ImageNet上的分类精度从35.4%提高到42.0%在Caltech-UCSD Birds上,从35.5%提高到44.1%在FGVC Aircraft上,从29.7%提高到37.4%在Stanford Cars上。Abstract
We develop techniques for refining representations for fine-grained classification and segmentation tasks in a self-supervised manner. We find that fine-tuning methods based on instance-discriminative contrastive learning are not as effective, and posit that recognizing part-specific variations is crucial for fine-grained categorization. We present an iterative learning approach that incorporates part-centric equivariance and invariance objectives. First, pixel representations are clustered to discover parts. We analyze the representations from convolutional and vision transformer networks that are best suited for this task. Then, a part-centric learning step aggregates and contrasts representations of parts within an image. We show that this improves the performance on image classification and part segmentation tasks across datasets. For example, under a linear-evaluation scheme, the classification accuracy of a ResNet50 trained on ImageNet using DetCon, a self-supervised learning approach, improves from 35.4% to 42.0% on the Caltech-UCSD Birds, from 35.5% to 44.1% on the FGVC Aircraft, and from 29.7% to 37.4% on the Stanford Cars. We also observe significant gains in few-shot part segmentation tasks using the proposed technique, while instance-discriminative learning was not as effective. Smaller, yet consistent, improvements are also observed for stronger networks based on transformers.
摘要
(Simplified Chinese translation)我们开发了一种基于自我指导的方法,用于精细分类和分割任务中的表示更新。我们发现,基于实例异同学习的精化方法并不那么有效,而recognizing特定部分的变化是精度分类的关键。我们提出了一种循环学习方法,其中包括部分准确性和不变性目标。首先,我们使用图像中的像素表示进行聚合,以便发现特定部分。然后,我们分析了图像中的卷积和视力转换网络,以确定最适合这种任务的表示。接着,我们在图像中聚合和对比部分表示,以提高图像分类和分割任务的性能。我们发现,这种方法在多个数据集上都有显著提高,而instance-discriminative学习方法不太有效。此外,我们还发现,对于更强大的网络,基于转换器的方法也会得到更小 yet consistent的改进。
MMA-Net: Multiple Morphology-Aware Network for Automated Cobb Angle Measurement
results: 在 AASCE 挑战数据集上测试,SMAPE 为 7.28%,MAE 为 3.18{\deg}, 与其他竞争方法相比表现出色。Abstract
Scoliosis diagnosis and assessment depend largely on the measurement of the Cobb angle in spine X-ray images. With the emergence of deep learning techniques that employ landmark detection, tilt prediction, and spine segmentation, automated Cobb angle measurement has become increasingly popular. However, these methods encounter difficulties such as high noise sensitivity, intricate computational procedures, and exclusive reliance on a single type of morphological information. In this paper, we introduce the Multiple Morphology-Aware Network (MMA-Net), a novel framework that improves Cobb angle measurement accuracy by integrating multiple spine morphology as attention information. In the MMA-Net, we first feed spine X-ray images into the segmentation network to produce multiple morphological information (spine region, centerline, and boundary) and then concatenate the original X-ray image with the resulting segmentation maps as input for the regression module to perform precise Cobb angle measurement. Furthermore, we devise joint loss functions for our segmentation and regression network training, respectively. We evaluate our method on the AASCE challenge dataset and achieve superior performance with the SMAPE of 7.28% and the MAE of 3.18{\deg}, indicating a strong competitiveness compared to other outstanding methods. Consequently, we can offer clinicians automated, efficient, and reliable Cobb angle measurement.
摘要
诊断和评估斯科利病(Scoliosis)几乎完全依赖在脊梁X射线图像中测量Cobb角度。随着深度学习技术的出现,使用landmark检测、倾斜预测和脊梁分 segmentation的自动化Cobb角度测量方法在现场上变得越来越受欢迎。然而,这些方法受到高噪音敏感、复杂计算过程和单一类型形态信息的限制。在这篇论文中,我们介绍了多种形态意识网络(MMA-Net),一种新的框架,可以提高Cobb角度测量精度。在MMA-Net中,我们首先将脊梁X射线图像传递给分 segmentation网络,以生成多种形态信息(脊梁区域、中心线和边界),然后将原始X射线图像和生成的分 segmentation图像作为输入传递给 regression模块进行精确Cobb角度测量。此外,我们定义了joint损失函数用于我们的分 segmentation和回归网络训练。我们在AASCE挑战数据集上评估了我们的方法,并实现了优于其他突出的方法的性能,SMAPE值为7.28%和MAE值为3.18°,这表明我们的方法具有强大的竞争力。因此,我们可以为临床医生提供自动化、高效和可靠的Cobb角度测量。
for: This paper aims to improve visual simultaneous localization and mapping (SLAM) methods by better integrating visual information and inertial measurement unit (IMU) data.
methods: The proposed method uses a novel deep SLAM network with dual visual factors, which integrates both photometric and re-projection factors into an end-to-end differentiable structure through a multi-factor data association module.
results: The proposed method significantly outperforms state-of-the-art methods on several public datasets, including TartanAir, EuRoC, and ETH3D-SLAM. Specifically, the absolute trajectory error was reduced by 45.3% and 36.2% for monocular and stereo configurations on the EuRoC dataset, respectively.Abstract
Recent deep learning based visual simultaneous localization and mapping (SLAM) methods have made significant progress. However, how to make full use of visual information as well as better integrate with inertial measurement unit (IMU) in visual SLAM has potential research value. This paper proposes a novel deep SLAM network with dual visual factors. The basic idea is to integrate both photometric factor and re-projection factor into the end-to-end differentiable structure through multi-factor data association module. We show that the proposed network dynamically learns and adjusts the confidence maps of both visual factors and it can be further extended to include the IMU factors as well. Extensive experiments validate that our proposed method significantly outperforms the state-of-the-art methods on several public datasets, including TartanAir, EuRoC and ETH3D-SLAM. Specifically, when dynamically fusing the three factors together, the absolute trajectory error for both monocular and stereo configurations on EuRoC dataset has reduced by 45.3% and 36.2% respectively.
摘要
现代深度学习基于视觉同时定位地图(SLAM)方法在最近几年中已经做出了很大的进步。然而,如何更好地利用视觉信息并更好地与惯性测量单元(IMU)在视觉SLAM中进行集成,这是有研究价值的。本文提出了一种新的深度SLAM网络,具有双视觉因素。基本思想是通过多因素数据匹配模块将 photometric 因素和重投影因素集成到端到端可微结构中。我们显示了我们提出的网络可以在运动中学习和调整两个视觉因素的信任地图,并且可以进一步包括 IMU 因素。广泛的实验证明了我们的提出方法在多个公共数据集上具有显著的优势,包括 TartanAir、EuRoC 和 ETH3D-SLAM。具体来说,在动态混合三个因素时,EuRoC 数据集上的绝对轨迹错误量降低了45.3%和36.2%分别 для 单镜和 стерео 配置。
Boundary-Aware Proposal Generation Method for Temporal Action Localization
paper_authors: Hao Zhang, Chunyan Feng, Jiahui Yang, Zheng Li, Caili Guo for: 本文旨在提出一种基于界限感知的暂时动作地理化(TAL)方法,以便在未处理视频中找到动作类别和时间边界。methods: 本文提出的Boundary-Aware Proposal Generation(BAPG)方法,通过强调界限感知来改善TAL的准确性。BAPG不依赖现有的TAL网络架构,可以与主流TAL模型进行插件式应用。results: 对于THUMOS14和ActivityNet-1.3 dataset的广泛实验表明,BAPG可以显著提高TAL的性能。Abstract
The goal of Temporal Action Localization (TAL) is to find the categories and temporal boundaries of actions in an untrimmed video. Most TAL methods rely heavily on action recognition models that are sensitive to action labels rather than temporal boundaries. More importantly, few works consider the background frames that are similar to action frames in pixels but dissimilar in semantics, which also leads to inaccurate temporal boundaries. To address the challenge above, we propose a Boundary-Aware Proposal Generation (BAPG) method with contrastive learning. Specifically, we define the above background frames as hard negative samples. Contrastive learning with hard negative mining is introduced to improve the discrimination of BAPG. BAPG is independent of the existing TAL network architecture, so it can be applied plug-and-play to mainstream TAL models. Extensive experimental results on THUMOS14 and ActivityNet-1.3 demonstrate that BAPG can significantly improve the performance of TAL.
摘要
文本内容:目的是Temporal Action Localization(TAL)找到视频中的分类和时间边界。大多数TAL方法都依赖于动作识别模型,而这些模型更关注动作标签而非时间边界。更重要的是,有些工作不考虑视频中的背景帧,这些帧与动作帧相似在像素级别,但是在semantics方面不同,这也导致了不准确的时间边界。为解决这个挑战,我们提出了Boundary-Aware Proposal Generation(BAPG)方法,该方法使用了对比学习。我们定义了上述背景帧为hard negative samples。对比学习可以提高BAPG的推iscrimination。BAPG与主流TAL网络架构独立,因此可以直接应用于主流TAL模型。我们在THUMOS14和ActivityNet-1.3上进行了广泛的实验,结果表明BAPG可以显著提高TAL的性能。