paper_authors: Diego Hidalgo-Carvajal, Hanzhi Chen, Gemma C. Bettelani, Jaesug Jung, Melissa Zavaglia, Laura Busse, Abdeldjallil Naceri, Stefan Leutenegger, Sami Haddadin
for: 这 paper 的目的是提高 робоット 在人造环境中的物体抓取和操作能力。
methods: 这 paper 使用了人类对物体的理解,通过重建和完善部分观察的物体形状,并使用7度自由度人工手臂来抓取和操作物体。
results: 这 paper 的方法在不同的方向和位置下,可以提高基eline的抓取成功率约30%,并实现了多种不同物体类别上的150多个成功抓取。这表明这种方法可以准确预测和执行抓取姿势,并在实际场景中提高了робоット的抓取和操作能力。Abstract
The progressive prevalence of robots in human-suited environments has given rise to a myriad of object manipulation techniques, in which dexterity plays a paramount role. It is well-established that humans exhibit extraordinary dexterity when handling objects. Such dexterity seems to derive from a robust understanding of object properties (such as weight, size, and shape), as well as a remarkable capacity to interact with them. Hand postures commonly demonstrate the influence of specific regions on objects that need to be grasped, especially when objects are partially visible. In this work, we leverage human-like object understanding by reconstructing and completing their full geometry from partial observations, and manipulating them using a 7-DoF anthropomorphic robot hand. Our approach has significantly improved the grasping success rates of baselines with only partial reconstruction by nearly 30% and achieved over 150 successful grasps with three different object categories. This demonstrates our approach's consistent ability to predict and execute grasping postures based on the completed object shapes from various directions and positions in real-world scenarios. Our work opens up new possibilities for enhancing robotic applications that require precise grasping and manipulation skills of real-world reconstructed objects.
摘要
人类环境中机器人的普及进展,导致了一系列物体抓取技巧的发展,dexterity在这些技巧中扮演着关键角色。人类在抓取物体时表现出了惊人的灵活性,这种灵活性归功于对物体特性(如重量、大小、形状)的稳固了理解,以及与物体进行互动的出色能力。手姿常常反映物体需要抓取的特定区域的影响,特别是当物体只部分可见时。在这项工作中,我们利用人类对物体的理解,通过重建和完成部分观察的物体形态,并使用7自由度人工手掌进行抓取。我们的方法比基eline只有部分重建时的抓取成功率提高了近30%,并在不同的物体类别上达成了150多次成功的抓取。这表明我们的方法可以在真实世界enario中预测和执行基于完整的物体形态的抓取姿势,开启了新的机器人应用的可能性,例如精准的抓取和 manipulate技巧。
Neural Network Reconstruction of the Left Atrium using Sparse Catheter Paths
results: 该论文表明,该方法可以在3分钟内基于部分数据来重构 Left atrial shape,并且可以生成真实的Visualization。 Synthetic和人类临床案例都被示出。Abstract
Catheter based radiofrequency ablation for pulmonary vein isolation has become the first line of treatment for atrial fibrillation in recent years. This requires a rather accurate map of the left atrial sub-endocardial surface including the ostia of the pulmonary veins, which requires dense sampling of the surface and takes more than 10 minutes. The focus of this work is to provide left atrial visualization early in the procedure to ease procedure complexity and enable further workflows, such as using catheters that have difficulty sampling the surface. We propose a dense encoder-decoder network with a novel regularization term to reconstruct the shape of the left atrium from partial data which is derived from simple catheter maneuvers. To train the network, we acquire a large dataset of 3D atria shapes and generate corresponding catheter trajectories. Once trained, we show that the suggested network can sufficiently approximate the atrium shape based on a given trajectory. We compare several network solutions for the 3D atrium reconstruction. We demonstrate that the solution proposed produces realistic visualization using partial acquisition within a 3-minute time interval. Synthetic and human clinical cases are shown.
摘要
医疗器械导管基于射频热力学隔离,作为现代抗不规征性颤动疾病治疗的首选方式,已经在过去几年得到广泛应用。这需要一个非常准确的左心室内部表面地图,包括肺动脉口,这需要密集的表面探测,需要 более10分钟的时间。我们的目标是提供早期左心室视觉,以便简化过程复杂性和启用更多的工作流程,如使用困难探测表面的导管。我们提议一种密集编码-解码网络,以重建左心室形状从partial数据中。为了训练网络,我们收集了大量3Datria形状数据,并生成相应的导管轨迹。我们显示,我们的提议的网络可以基于给定轨迹sufficiently approximate left atrium shape。我们比较了多个网络解决方案,并显示我们的解决方案可以生成真实的视觉使用部分收集在3分钟时间内。我们使用 sintetic和人类临床案例展示。
A Strictly Bounded Deep Network for Unpaired Cyclic Translation of Medical Images
results: 论文的实验结果表明,该方法可以在实际的CT和MRI图像翻译中提供superior的结果,并且通过质量、量化和简洁分析表明,该方法可以减少变异和提高稳定性。Abstract
Medical image translation is an ill-posed problem. Unlike existing paired unbounded unidirectional translation networks, in this paper, we consider unpaired medical images and provide a strictly bounded network that yields a stable bidirectional translation. We propose a patch-level concatenated cyclic conditional generative adversarial network (pCCGAN) embedded with adaptive dictionary learning. It consists of two cyclically connected CGANs of 47 layers each; where both generators (each of 32 layers) are conditioned with concatenation of alternate unpaired patches from input and target modality images (not ground truth) of the same organ. The key idea is to exploit cross-neighborhood contextual feature information that bounds the translation space and boosts generalization. The generators are further equipped with adaptive dictionaries learned from the contextual patches to reduce possible degradation. Discriminators are 15-layer deep networks that employ minimax function to validate the translated imagery. A combined loss function is formulated with adversarial, non-adversarial, forward-backward cyclic, and identity losses that further minimize the variance of the proposed learning machine. Qualitative, quantitative, and ablation analysis show superior results on real CT and MRI.
摘要
医学图像翻译是一个不定Problem。不同于现有的已经对应的无限向量翻译网络,在这篇论文中,我们考虑了无对应的医学图像,并提供了一个具有稳定性的双向翻译网络。我们提议了一种patch-level concatenated cyclic conditional generative adversarial network (pCCGAN),它包括两个相互连接的CGAN,每个CGAN都有47层,其中每个生成器都是通过 concatenation of alternate unpaired patches from input and target modality images (不是真实的ground truth) of the same organ来Conditional generation。我们的关键思想是利用跨邻域特征信息,以防止翻译空间过大,并提高泛化能力。生成器还使用了从Contextual patches中学习的自适应字典,以避免可能的下降。检测器是15层深度的网络,使用最大函数来验证翻译的图像。我们定义了一个组合损失函数,包括对抗损失、非对抗损失、前向后向环路损失和标识损失,以更加减小提议的学习机器的幂等误差。Qualitative、quantitative和ablation分析表明,我们的方法在真实的CT和MRI上达到了superior的结果。
SPHEAR: Spherical Head Registration for Complete Statistical 3D Modeling
results: 该模型可以生成高分辨率的自然色Texture、表面法向图、毛发辐射图等,同时支持自动化的视觉数据生成、 semantic 注释和总体重建任务。与现有方法相比,该模型的组件快速、内存利用率高,实验证明了设计方式的有效性和注册、重建和生成技术的准确性。Abstract
We present \emph{SPHEAR}, an accurate, differentiable parametric statistical 3D human head model, enabled by a novel 3D registration method based on spherical embeddings. We shift the paradigm away from the classical Non-Rigid Registration methods, which operate under various surface priors, increasing reconstruction fidelity and minimizing required human intervention. Additionally, SPHEAR is a \emph{complete} model that allows not only to sample diverse synthetic head shapes and facial expressions, but also gaze directions, high-resolution color textures, surface normal maps, and hair cuts represented in detail, as strands. SPHEAR can be used for automatic realistic visual data generation, semantic annotation, and general reconstruction tasks. Compared to state-of-the-art approaches, our components are fast and memory efficient, and experiments support the validity of our design choices and the accuracy of registration, reconstruction and generation techniques.
摘要
我们介绍了SPHEAR,一种精准、可微分的参数型三维人头模型,基于圆形嵌入的新三维 регистра方法。我们弃却了传统的非固定注册方法,这些方法基于不同的表面优先级,从而提高了重建准确性并最小化了人工干预。此外,SPHEAR是一个完整的模型,允许不仅采样多种 sintetic 头部形状和 facial expression,还可以控制视线方向、高分辨率颜色Texture、表面法向图和毛发 Represented in detail, as strands。SPHEAR可以用于自动生成真实的视觉数据,semantic annotation和总体重建任务。相比之前的方法,我们的组件快速和内存减少,实验证明了我们的设计选择和注册、重建和生成技术的准确性。
Extracting Network Structures from Corporate Organization Charts Using Heuristic Image Processing
results: 研究者通过应用该方法,成功地从2008年至2011年《企业组织架构图/系统图手册》中的10,008个组织架构PDF文档中提取了4,606个组织网络(数据获取成功率为46%)。对每个重构的组织网络进行了多种网络指标的测量,以便进一步的统计分析,以 investigate 其可能的相关性与企业行为和表现。Abstract
Organizational structure of corporations has potential to provide implications for dynamics and performance of corporate operations. However, this subject has remained unexplored because of the lack of readily available organization network datasets. To overcome the this gap, we developed a new heuristic image-processing method to extract and reconstruct organization network data from published organization charts. Our method analyzes a PDF file of a corporate organization chart and detects text labels, boxes, connecting lines, and other objects through multiple steps of heuristically implemented image processing. The detected components are reorganized together into a Python's NetworkX Graph object for visualization, validation and further network analysis. We applied the developed method to the organization charts of all the listed firms in Japan shown in the ``Organization Chart/System Diagram Handbook'' published by Diamond, Inc., from 2008 to 2011. Out of the 10,008 organization chart PDF files, our method was able to reconstruct 4,606 organization networks (data acquisition success rate: 46%). For each reconstructed organization network, we measured several network diagnostics, which will be used for further statistical analysis to investigate their potential correlations with corporate behavior and performance.
摘要
企业组织结构具有可能对企业运营动态和性能产生影响,但这个主题尚未被探讨,因为有限的可用组织网络数据。为了bridge这个阻隔,我们开发了一种新的图像处理方法,用于从公布的组织图中提取和重建组织网络数据。我们的方法通过多个逻辑地图处理步骤,从PDF文档中的组织图中检测文本标签、方块、连接线和其他对象。检测到的组件被重新组织为Python的NetworkX图形对象,以供可视化、验证和进一步的网络分析。我们应用了这种方法于日本上市公司《组织图/系统 диаграм手册》(Diamond, Inc.,2008-2011年)中所显示的10,008个组织图PDF文档中。其中,我们的方法成功地重建了4,606个组织网络(数据获取成功率:46%)。对每个重建的组织网络,我们测量了多种网络指标,这些指标将用于进一步的统计分析,以研究它们可能与企业行为和性能之间的相关性。
P-Age: Pexels Dataset for Robust Spatio-Temporal Apparent Age Classification
results: 该方法在不同的视频数据集上进行了评测,与现有的面部基于方法相比,达到了更高的准确率。特别是在真实世界情况下,当人脸受到干扰、模糊或遮盾时,该方法仍能够准确地 estimte 年龄。Abstract
Age estimation is a challenging task that has numerous applications. In this paper, we propose a new direction for age classification that utilizes a video-based model to address challenges such as occlusions, low-resolution, and lighting conditions. To address these challenges, we propose AgeFormer which utilizes spatio-temporal information on the dynamics of the entire body dominating face-based methods for age classification. Our novel two-stream architecture uses TimeSformer and EfficientNet as backbones, to effectively capture both facial and body dynamics information for efficient and accurate age estimation in videos. Furthermore, to fill the gap in predicting age in real-world situations from videos, we construct a video dataset called Pexels Age (P-Age) for age classification. The proposed method achieves superior results compared to existing face-based age estimation methods and is evaluated in situations where the face is highly occluded, blurred, or masked. The method is also cross-tested on a variety of challenging video datasets such as Charades, Smarthome, and Thumos-14.
摘要
���������ж�� Age 估计是一项复杂的任务,具有许多应用。在这篇论文中,我们提出了一种新的方向 для age 分类,利用视频基本模型来解决 occlusions、低分辨率和照明条件等挑战。为了解决这些挑战,我们提出了 AgeFormer,它利用全身动态信息来替代面部基本方法 для age 分类。我们的新型两核体系使用 TimeSformer 和 EfficientNet 作为后备网络,以有效地捕捉全身动态信息以实现高效准确的 age 估计。此外,为了填充实际情况中的 age 估计漏斗,我们建立了一个名为 Pexels Age (P-Age) 的视频 dataset,用于 age 分类。我们的方法在不同的挑战性视频 dataset 上实现了比较出色的结果,比如 Charades、Smarthome 和 Thumos-14。
results: 我们的方法可以完全对抗快速忘却问题,并且降低了训练模型的计算成本。将10个类别的样本存储在小型快取中可以实现近似于全集调整的性能。Abstract
Continual learning refers to the problem where the training data is available in sequential chunks, termed "tasks". The majority of progress in continual learning has been stunted by the problem of catastrophic forgetting, which is caused by sequential training of the model on streams of data. Moreover, it becomes computationally expensive to sequentially train large models multiple times. To mitigate both of these problems at once, we propose a novel method to continually train transformer-based vision models using low-rank adaptation and task arithmetic. Our method completely bypasses the problem of catastrophic forgetting, as well as reducing the computational requirement for training models on each task. When aided with a small memory of 10 samples per class, our method achieves performance close to full-set finetuning. We present rigorous ablations to support the prowess of our method.
摘要
Translation in Simplified Chinese: kontinuäl learning réferë à problém where training data è disponibile in sequential chunks, termed "tasks". Majorité avancements in kontinuäl learning hanno été stunted per problem of catastrophic forgetting, which è caused da sequential training of model on streams of data. Moreover, it becomes computationally expensive to sequentially train large models multiple times. To mitigate both of these problems at once, we propose a novel method to continually train transformer-based vision models using low-rank adaptation and task arithmetic. Our method completely bypasses the problem of catastrophic forgetting, as well as reducing the computational requirement for training models on each task. When aided with a small memory of 10 samples per class, our method achieves performance close to full-set finetuning. We present rigorous ablations to support the prowess of our method.
P2O-Calib: Camera-LiDAR Calibration Using Point-Pair Spatial Occlusion Relationship
results: 对实际图像集KITTI进行评估,比较方法与现有目标 moins 方法的性能,结果表明,提出的方法可以在各种环境中具有低误差和高可靠性,为实际应用中的高质量摄像头-LiDAR推准做出贡献。Abstract
The accurate and robust calibration result of sensors is considered as an important building block to the follow-up research in the autonomous driving and robotics domain. The current works involving extrinsic calibration between 3D LiDARs and monocular cameras mainly focus on target-based and target-less methods. The target-based methods are often utilized offline because of restrictions, such as additional target design and target placement limits. The current target-less methods suffer from feature indeterminacy and feature mismatching in various environments. To alleviate these limitations, we propose a novel target-less calibration approach which is based on the 2D-3D edge point extraction using the occlusion relationship in 3D space. Based on the extracted 2D-3D point pairs, we further propose an occlusion-guided point-matching method that improves the calibration accuracy and reduces computation costs. To validate the effectiveness of our approach, we evaluate the method performance qualitatively and quantitatively on real images from the KITTI dataset. The results demonstrate that our method outperforms the existing target-less methods and achieves low error and high robustness that can contribute to the practical applications relying on high-quality Camera-LiDAR calibration.
摘要
“准确和可靠的感知器的准确性是自动驾驶和机器人领域的重要基础结构。目前的外部投入calibration方法主要集中在目标基础和无目标方法两个领域。目标基础方法通常在线上使用,但是受到附加的目标设计和目标置放限制。现有的无目标方法受到环境中的特征不确定和特征匹配问题的影响。为了解决这些限制,我们提出了一种新的无目标准确方法,基于2D-3D边缘点提取和3D空间 occlusion 关系。基于提取的2D-3D点对,我们进一步提出了一种 occlusion-guided 点对应方法,可以提高准确率并降低计算成本。为验证我们的方法的有效性,我们对实际来自 KITTI 数据集的图像进行质量评估。结果表明,我们的方法在无目标情况下比现有的方法更高精度和更高可靠性,可以为实际应用中的高质量 Camera-LiDAR 准确协调做出贡献。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.
Hybrid quantum image classification and federated learning for hepatic steatosis diagnosis
paper_authors: Luca Lusnig, Asel Sagingalieva, Mikhail Surmach, Tatjana Protasevich, Ovidiu Michiu, Joseph McLoughlin, Christopher Mansell, Graziano de’ Petris, Deborah Bonazza, Fabrizio Zanconati, Alexey Melnikov, Fabio Cavalli for: 这项研究旨在开发一种可助Pathologist在日常诊断中使用的智能系统,该系统可以利用深度学习技术和量子计算技术,并采用联合学习方法来实现隐私友好的多方参与学习。methods: 该研究使用了一种混合式量子神经网络,该网络包括5个量子比特和超过100个变量门,可以用于评估非酒精性肝脂肿,并提出了一种基于类型深度学习解决方案,通过减少每个参与者的数据量来解决隐私问题。results: 研究发现,混合式量子神经网络的肝脂肿图像分类精度达97%,高于其类似的类型深度学习模型(ResNet)的95.2%,而且在减少数据量的情况下,hybrid方法仍然能够superior generalization和less potential for overfitting,这表明该方法在医疗应用中具有优异的普适性和可靠性。Abstract
With the maturity achieved by deep learning techniques, intelligent systems that can assist physicians in the daily interpretation of clinical images can play a very important role. In addition, quantum techniques applied to deep learning can enhance this performance, and federated learning techniques can realize privacy-friendly collaborative learning among different participants, solving privacy issues due to the use of sensitive data and reducing the number of data to be collected for each individual participant. We present in this study a hybrid quantum neural network that can be used to quantify non-alcoholic liver steatosis and could be useful in the diagnostic process to determine a liver's suitability for transplantation; at the same time, we propose a federated learning approach based on a classical deep learning solution to solve the same problem, but using a reduced data set in each part. The liver steatosis image classification accuracy of the hybrid quantum neural network, the hybrid quantum ResNet model, consisted of 5 qubits and more than 100 variational gates, reaches 97%, which is 1.8% higher than its classical counterpart, ResNet. Crucially, that even with a reduced dataset, our hybrid approach consistently outperformed its classical counterpart, indicating superior generalization and less potential for overfitting in medical applications. In addition, a federated approach with multiple clients, up to 32, despite the lower accuracy, but still higher than 90%, would allow using, for each participant, a very small dataset, i.e., up to one-thirtieth. Our work, based over real-word clinical data can be regarded as a scalable and collaborative starting point, could thus fulfill the need for an effective and reliable computer-assisted system that facilitates the daily diagnostic work of the clinical pathologist.
摘要
随着深度学习技术的成熔,智能系统可以帮助医生日常解读临床图像,扮演着非常重要的角色。此外,应用于深度学习的量子技术可以提高这种性能,而 federated learning 技术可以实现隐私友好的合作学习,解决由敏感数据使用而导致的隐私问题,同时减少每个参与者需要收集的数据量。在本研究中,我们提出了一种混合量子神经网络,可以用于评估非酒精肝炎病变,并且可以在诊断过程中决定肝脏的适用性 для移植。同时,我们提出了基于类型深度学习解决方案的联邦学习方法,可以使用减少的数据集来解决同一个问题。混合量子神经网络的肝炎病变图像分类精度达97%,高于其类型深度学习模型(ResNet)的95.2%。更重要的是,我们的混合方法在减少数据集时仍然可以高效地分类肝炎病变图像,这表明它在医疗应用中具有更好的总结和更少的潜在过拟合问题。此外,多客户联邦学习方法可以使用每个参与者的非常小数据集(最多一半)来解决同一个问题,尽管它的准确率虽然不如单个客户的混合量子神经网络,但仍高于90%。我们的工作,基于实际临床数据,可以视为一种可扩展和协作的起点,可以满足医疗应用中的有效和可靠计算助手的需求。
Domain Transfer in Latent Space (DTLS) Wins on Image Super-Resolution – a Non-Denoising Model
results: 在这篇文章中,我们提出了一种简单的方法,它不使用 Gaussian noise,而是采用基于diffusion models的一些基本结构来实现高质量的图像超分辨。我们使用 DNN 来实现域传递,以便利用邻域域的统计特性来进行渐进 interpolate,并通过参照输入LR图像来进行条件域传递,从而进一步提高图像质量。实验结果表明,我们的方法不仅超越了当前的大规模超分辨模型,还超越了当前的扩散模型。这种方法可以轻松扩展到其他图像到图像任务,如图像照明、填充、降噪等。Abstract
Large scale image super-resolution is a challenging computer vision task, since vast information is missing in a highly degraded image, say for example forscale x16 super-resolution. Diffusion models are used successfully in recent years in extreme super-resolution applications, in which Gaussian noise is used as a means to form a latent photo-realistic space, and acts as a link between the space of latent vectors and the latent photo-realistic space. There are quite a few sophisticated mathematical derivations on mapping the statistics of Gaussian noises making Diffusion Models successful. In this paper we propose a simple approach which gets away from using Gaussian noise but adopts some basic structures of diffusion models for efficient image super-resolution. Essentially, we propose a DNN to perform domain transfer between neighbor domains, which can learn the differences in statistical properties to facilitate gradual interpolation with results of reasonable quality. Further quality improvement is achieved by conditioning the domain transfer with reference to the input LR image. Experimental results show that our method outperforms not only state-of-the-art large scale super resolution models, but also the current diffusion models for image super-resolution. The approach can readily be extended to other image-to-image tasks, such as image enlightening, inpainting, denoising, etc.
摘要
大规模图像超解析是一项计算机视觉任务,因为高度受损图像中的信息量很大,例如 scale x16 超解析。扩散模型在过去几年得到了成功,在极端超解析应用中使用 Gaussian 噪声作为一种 latent photo-realistic 空间的形成者,并作为 latent vector 空间和 latent photo-realistic 空间之间的连接。有很多复杂的数学推导,映射 Gaussian 噪声的统计特性,使扩散模型成功。在这篇论文中,我们提出了一种简单的方法,不使用 Gaussian 噪声,但采用了扩散模型的一些基本结构,实现高质量图像超解析。我们提议使用 DNN 进行频率域传输,以便利用邻域频率的不同来实现慢滑均衡,并通过参考输入低解析图像来进行条件域传输,从而进一步提高图像质量。实验结果表明,我们的方法不仅超过了当前大规模超解析模型的状态,还超过了当前扩散模型的图像超解析性能。该方法可以轻松扩展到其他图像-图像任务,如图像照明、填充、去噪等。
Proposal-Level Unsupervised Domain Adaptation for Open World Unbiased Detector
results: 通过OOD evaluation,我们达到了状态 искусственный智能的性能水平。Abstract
Open World Object Detection (OWOD) combines open-set object detection with incremental learning capabilities to handle the challenge of the open and dynamic visual world. Existing works assume that a foreground predictor trained on the seen categories can be directly transferred to identify the unseen categories' locations by selecting the top-k most confident foreground predictions. However, the assumption is hardly valid in practice. This is because the predictor is inevitably biased to the known categories, and fails under the shift in the appearance of the unseen categories. In this work, we aim to build an unbiased foreground predictor by re-formulating the task under Unsupervised Domain Adaptation, where the current biased predictor helps form the domains: the seen object locations and confident background locations as the source domain, and the rest ambiguous ones as the target domain. Then, we adopt the simple and effective self-training method to learn a predictor based on the domain-invariant foreground features, hence achieving unbiased prediction robust to the shift in appearance between the seen and unseen categories. Our approach's pipeline can adapt to various detection frameworks and UDA methods, empirically validated by OWOD evaluation, where we achieve state-of-the-art performance.
摘要
Translation notes:* Open World Object Detection (OWOD) is translated as "开放世界物体检测" (kāifàng shìjiè wùzhì kǎoyan).* existing works is translated as "现有的工作" (xiàn yǒu de gōngzuò).* foreground predictor is translated as "前景预测器" (qiánjìng yùdiǎn).* unseen categories is translated as "未知类别" (wèi zhī lèibì).* domain adaptation is translated as "领域适应" (dòngyì tiěbìng).* self-training method is translated as "自我启用法" (zìwǒ kāifàng fǎ).* domain-invariant features is translated as "领域不变特征" (dòngyì bùbiàn tèzhèng).* unbiased prediction is translated as "无偏预测" (wùpíng yùdiǎn).* state-of-the-art performance is translated as "顶尖性能" (dǐngjiān xìngnéng).
MC-Stereo: Multi-peak Lookup and Cascade Search Range for Stereo Matching
results: 根据实验结果,MC-Stereo在KITTI-2012和KITTI-2015测试集上 ranked first among all publicly available方法,并在ETH3D上达到了领域内最佳性能。Abstract
Stereo matching is a fundamental task in scene comprehension. In recent years, the method based on iterative optimization has shown promise in stereo matching. However, the current iteration framework employs a single-peak lookup, which struggles to handle the multi-peak problem effectively. Additionally, the fixed search range used during the iteration process limits the final convergence effects. To address these issues, we present a novel iterative optimization architecture called MC-Stereo. This architecture mitigates the multi-peak distribution problem in matching through the multi-peak lookup strategy, and integrates the coarse-to-fine concept into the iterative framework via the cascade search range. Furthermore, given that feature representation learning is crucial for successful learnbased stereo matching, we introduce a pre-trained network to serve as the feature extractor, enhancing the front end of the stereo matching pipeline. Based on these improvements, MC-Stereo ranks first among all publicly available methods on the KITTI-2012 and KITTI-2015 benchmarks, and also achieves state-of-the-art performance on ETH3D. The code will be open sourced after the publication of this paper.
摘要
斯tereo匹配是Scene理解中的基本任务。在最近几年,基于迭代优化的方法在斯tereo匹配中表现良好。然而,当前的迭代框架使用单峰搜索,在多峰问题中效果不佳。此外,使用的fixed搜索范围限制了最终的整合效果。为解决这些问题,我们提出了一种新的迭代优化架构,称为MC-Stereo。这种架构通过多峰搜索策略解决了匹配中的多峰分布问题,并通过缩放搜索范围实现了从粗到细的搜索概念。此外,由于特征表示学习是成功的learnbased斯tereo匹配的关键,我们引入了预训练的网络作为特征提取器,从而提高了斯tereo匹配管线的前端。基于这些改进,MC-Stereo在KITTI-2012和KITTI-2015benchmark上名列前茅,并在ETH3D上实现了状态的art performance。代码将在本文发表后开源。
Multimodal Machine Learning for Clinically-Assistive Imaging-Based Biomedical Applications
methods: 这篇论文描述了五种对多modal AI 的挑战(表示、融合、对接、翻译和共学习),并评估了这些挑战在医疗影像基础的临床决策支持模型中的应用。
results: 这篇论文结论提出了未来这个领域的发展趋势,并建议了在成功诊断模型的翻译和应用中进一步探索的方向。Abstract
Machine learning (ML) applications in medical artificial intelligence (AI) systems have shifted from traditional and statistical methods to increasing application of deep learning models and even more recently generative models. Recent years have seen a rise in the discovery of widely-available deep learning architectures that support multimodal data integration, particularly with images. The incorporation of multiple modalities into these models is a thriving research topic, presenting its own unique challenges. In this work, we discuss five challenges to multimodal AI as it pertains to ML (representation, fusion, alignment, translation, and co-learning) and survey recent approaches to addressing these challenges in the context of medical image-based clinical decision support models. We conclude with a discussion of the future of the field, suggesting directions that should be elucidated further for successful clinical models and their translation to the clinical setting.
摘要
机器学习(ML)在医疗人工智能(AI)系统中的应用从传统和统计方法逐渐转移到深度学习模型,并在最近几年内,更多地使用生成模型。在过去几年中,我们发现了许多可用的深度学习架构,尤其是与图像集成。将多种模式integrated into these models presents unique challenges. In this work, we discuss five challenges to multimodal AI as it pertains to ML (representation, fusion, alignment, translation, and co-learning) and survey recent approaches to addressing these challenges in the context of medical image-based clinical decision support models. We conclude with a discussion of the future of the field, suggesting directions that should be further explored for successful clinical models and their translation to the clinical setting.Here's a breakdown of the translation:1. 机器学习 (ML) - Machine learning2. 在医疗人工智能 (AI)系统中 - In medical artificial intelligence systems3. 应用从传统和统计方法逐渐转移 - From traditional and statistical methods to4. 到深度学习模型 - Deep learning models5. 并在最近几年内,更多地使用生成模型 - And more recently, using generative models6. 在过去几年中,我们发现了许多可用的深度学习架构 - In the past few years, we have found many available deep learning architectures7. 尤其是与图像集成 - Especially with image integration8. 将多种模式integrated into these models presents unique challenges - Integrating multiple modes into these models presents unique challenges9. In this work, we discuss five challenges to multimodal AI as it pertains to ML - In this work, we discuss five challenges to multimodal AI as it relates to machine learning10. representation, fusion, alignment, translation, and co-learning - Representation, fusion, alignment, translation, and co-learning11. 并 survey recent approaches to addressing these challenges - And survey recent approaches to addressing these challenges12. 在医疗图像基础的临床决策模型中 - In medical image-based clinical decision support models13. We conclude with a discussion of the future of the field - We conclude with a discussion of the future of the field14. 建议更多的探索 - Suggesting further explorationPlease note that the translation is done in Simplified Chinese, which is the most widely used standard for Chinese writing. If you need the translation in Traditional Chinese, please let me know.
Counting Manatee Aggregations using Deep Neural Networks and Anisotropic Gaussian Kernel
results: 实验结果显示,使用Anisotropic Gaussian Kernel(AGK)kernel,并应用于不同类型的深度神经网络,可以准确地计算水獭群体中的数量,特别是在复杂背景环境下。Abstract
Manatees are aquatic mammals with voracious appetites. They rely on sea grass as the main food source, and often spend up to eight hours a day grazing. They move slow and frequently stay in group (i.e. aggregations) in shallow water to search for food, making them vulnerable to environment change and other risks. Accurate counting manatee aggregations within a region is not only biologically meaningful in observing their habit, but also crucial for designing safety rules for human boaters, divers, etc., as well as scheduling nursing, intervention, and other plans. In this paper, we propose a deep learning based crowd counting approach to automatically count number of manatees within a region, by using low quality images as input. Because manatees have unique shape and they often stay in shallow water in groups, water surface reflection, occlusion, camouflage etc. making it difficult to accurately count manatee numbers. To address the challenges, we propose to use Anisotropic Gaussian Kernel (AGK), with tunable rotation and variances, to ensure that density functions can maximally capture shapes of individual manatees in different aggregations. After that, we apply AGK kernel to different types of deep neural networks primarily designed for crowd counting, including VGG, SANet, Congested Scene Recognition network (CSRNet), MARUNet etc. to learn manatee densities and calculate number of manatees in the scene. By using generic low quality images extracted from surveillance videos, our experiment results and comparison show that AGK kernel based manatee counting achieves minimum Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). The proposed method works particularly well for counting manatee aggregations in environments with complex background.
摘要
MANATEES 是水生哺乳动物,具有极高的食量。它们依赖于海草为食,并可能每天花费8小时在牧场。它们移动缓慢,并常常在浅水区寻找食物,使得它们易受环境变化和其他风险的影响。正确地计算MANATEES 的聚集数量在一个区域非仅生物学上重要,还是关键的 для设计人类潜水员、潜水等安全规则,以及安排护理、救援等计划。在这篇论文中,我们提出了基于深度学习的人ATEES 聚集计数方法,使用低质量图像作为输入。由于MANATEES 的特有形状和它们常常在浅水区寻找食物,水面反射、遮挡等因素使得准确计数MANATEES 数量非常困难。为了解决这些挑战,我们提议使用不规则 Gaussian kernel(AGK),并可调整旋转和方差,以确保AGK kernel能够最大化个体MANATEES 形状在不同的聚集中。然后,我们将AGK kernel应用到不同类型的深度神经网络,包括 VGG、SANet、 Congested Scene Recognition network(CSRNet)等,以学习MANATEES 的浓度函数并计算场景中的MANATEES 数量。通过使用普通的低质量图像,我们的实验结果和比较表明,AGK kernel基于MANATEES 计数实现了最小的精度平均误差(MAE)和平方根误差(RMSE)。我们的方法在环境复杂背景下特别有效。
LISNeRF Mapping: LiDAR-based Implicit Mapping via Semantic Neural Fields for Large-Scale 3D Scenes
for: This paper proposes a method for large-scale 3D semantic reconstruction from LiDAR measurements alone, which is crucial for outdoor autonomous agents to fulfill high-level tasks such as planning and navigation.
methods: The proposed method uses an octree-based and hierarchical structure to store implicit features, which are decoded to semantic information and signed distance value through shallow Multilayer Perceptrons (MLPs). Off-the-shelf algorithms are used to predict semantic labels and instance IDs of point cloud, and the implicit features and MLPs parameters are jointly optimized with self-supervision paradigm for point cloud geometry and pseudo-supervision paradigm for semantic and panoptic labels.
results: The proposed method is evaluated on three real-world datasets, SemanticKITTI, SemanticPOSS, and nuScenes, and demonstrates effectiveness and efficiency compared to current state-of-the-art 3D mapping methods.Abstract
Large-scale semantic mapping is crucial for outdoor autonomous agents to fulfill high-level tasks such as planning and navigation. This paper proposes a novel method for large-scale 3D semantic reconstruction through implicit representations from LiDAR measurements alone. We firstly leverages an octree-based and hierarchical structure to store implicit features, then these implicit features are decoded to semantic information and signed distance value through shallow Multilayer Perceptrons (MLPs). We adopt off-the-shelf algorithms to predict the semantic labels and instance IDs of point cloud. Then we jointly optimize the implicit features and MLPs parameters with self-supervision paradigm for point cloud geometry and pseudo-supervision pradigm for semantic and panoptic labels. Subsequently, Marching Cubes algorithm is exploited to subdivide and visualize the scenes in the inferring stage. For scenarios with memory constraints, a map stitching strategy is also developed to merge sub-maps into a complete map. As far as we know, our method is the first work to reconstruct semantic implicit scenes from LiDAR-only input. Experiments on three real-world datasets, SemanticKITTI, SemanticPOSS and nuScenes, demonstrate the effectiveness and efficiency of our framework compared to current state-of-the-art 3D mapping methods.
摘要
大规模 semantic mapping 是外部自主 Agent 完成高级任务,如规划和导航的关键。这篇论文提出了一种基于 LiDAR 测量的大规模 3D semantic 重建方法。我们首先利用 Octree 结构来存储偏函数特征,然后使用 shallow Multilayer Perceptrons (MLPs) 来解码这些偏函数特征为 semantic 信息和 signed distance value。我们采用了市场上的算法来预测 semantic 标签和实例 ID 的点云。然后,我们同时优化偏函数和 MLPs 参数使用自我超级vised paradigm for point cloud geometry和 pseudo-supervision 概念 для semantic 和 panoptic 标签。在推断阶段,我们利用 Marching Cubes 算法来分割和可视化场景。对于存储限制的场景,我们还开发了一种 map stitching 策略来合并子地图到完整的地图。根据我们所知,我们的方法是首个从 LiDAR 输入 alone 重建 semantic implicit scene。我们在三个实际世界数据集(SemanticKITTI、SemanticPOSS 和 nuScenes)进行了实验,并证明了我们的框架在现状最佳的 3D 地图方法中表现出了效果和效率。