cs.CV - 2023-10-17

Holistic Parking Slot Detection with Polygon-Shaped Representations

  • paper_url: http://arxiv.org/abs/2310.11629
  • repo_url: None
  • paper_authors: Lihao Wang, Antonyo Musabini, Christel Leonet, Rachid Benmokhtar, Amaury Breheret, Chaima Yedes, Fabian Burger, Thomas Boulay, Xavier Perrotton
    for:This paper proposes a one-step Holistic Parking Slot Network (HPS-Net) to detect vacant parking slots using a camera-based approach.methods:The proposed method uses a tailor-made adaptation of the You Only Look Once (YOLO)v4 algorithm, which directly outputs the four vertex coordinates of the parking slot in topview domain. A novel regression loss function named polygon-corner Generalized Intersection over Union (GIoU) is proposed to optimize the polygon vertex position and distinguish the entrance line.results:Experiments show that HPS-Net can detect various vacant parking slots with a F1-score of 0.92 on the internal Valeo Parking Slots Dataset (VPSD) and 0.99 on the public dataset PS2.0. The method achieves a satisfying generalization and robustness in various parking scenarios, such as indoor or paved ground, with a real-time detection speed of 17 FPS on Nvidia Drive AGX Xavier.
    Abstract Current parking slot detection in advanced driver-assistance systems (ADAS) primarily relies on ultrasonic sensors. This method has several limitations such as the need to scan the entire parking slot before detecting it, the incapacity of detecting multiple slots in a row, and the difficulty of classifying them. Due to the complex visual environment, vehicles are equipped with surround view camera systems to detect vacant parking slots. Previous research works in this field mostly use image-domain models to solve the problem. These two-stage approaches separate the 2D detection and 3D pose estimation steps using camera calibration. In this paper, we propose one-step Holistic Parking Slot Network (HPS-Net), a tailor-made adaptation of the You Only Look Once (YOLO)v4 algorithm. This camera-based approach directly outputs the four vertex coordinates of the parking slot in topview domain, instead of a bounding box in raw camera images. Several visible points and shapes can be proposed from different angles. A novel regression loss function named polygon-corner Generalized Intersection over Union (GIoU) for polygon vertex position optimization is also proposed to manage the slot orientation and to distinguish the entrance line. Experiments show that HPS-Net can detect various vacant parking slots with a F1-score of 0.92 on our internal Valeo Parking Slots Dataset (VPSD) and 0.99 on the public dataset PS2.0. It provides a satisfying generalization and robustness in various parking scenarios, such as indoor (F1: 0.86) or paved ground (F1: 0.91). Moreover, it achieves a real-time detection speed of 17 FPS on Nvidia Drive AGX Xavier. A demo video can be found at https://streamable.com/75j7sj.
    摘要 当前的停车槽检测在高级驾驶支持系统(ADAS)主要依靠 ultrasonic 探测器。这种方法有多个限制,包括需要扫描整个停车槽才能检测它,无法检测多个槽在一行,以及困难的分类问题。由于视觉环境复杂,车辆通常配备卫星视频摄像头系统以检测空闲停车槽。先前的研究工作主要使用图像领域模型解决这个问题。这些两个阶段方法在摄像头准确性上分解了2D检测和3D姿态估计步骤。在本文中,我们提出了一步骤Holistic Parking Slot Network(HPS-Net),这是基于 You Only Look Once(YOLO)v4算法的修改版。这种摄像头基本方法直接输出了四个顶点坐标,而不是Raw摄像头图像中的 bounding box。通过不同的角度可以提出多个可见的点和形状。我们还提出了一种新的 regression loss function,即 polygon-corner Generalized Intersection over Union(GIoU),用于优化额外坐标和分辨入口线。实验表明,HPS-Net可以准确地检测多种空闲停车槽,F1 分数为 0.92 在我们的内部 valeo 停车槽数据集(VPSD)上,并达到 0.99 在公共数据集 PS2.0 上。它在多种停车场景中具有满意的总体化和稳定性,例如室内(F1: 0.86)或沥青地(F1: 0.91)。此外,它在 Nvidia Drive AGX Xavier 上实现了实时检测速度为 17 FPS。更多的详细信息可以在 找到。

High-Resolution Building and Road Detection from Sentinel-2

  • paper_url: http://arxiv.org/abs/2310.11622
  • repo_url: https://github.com/RudraxDave/UrbanizationDetection_RoadnBuilding
  • paper_authors: Wojciech Sirko, Emmanuel Asiedu Brempong, Juliana T. C. Marcos, Abigail Annkah, Abel Korme, Mohammed Alewi Hassen, Krishna Sapkota, Tomer Shekel, Abdoulaye Diack, Sella Nevo, Jason Hickey, John Quinn
  • for: This paper demonstrates how to use multiple 10m resolution Sentinel-2 images to generate 50cm resolution building and road segmentation masks.
  • methods: The authors use a “student” model to reproduce the predictions of a “teacher” model, which has access to high-resolution imagery.
  • results: The authors achieve 78.3% mIoU for building segmentation and R^2 = 0.91 for counting individual buildings, which are comparable to the performance of the high-resolution teacher model.Here is the simplified Chinese version of the three key points:
  • for: 这篇论文用多个10米分辨率的卫星图像生成50厘米分辨率的建筑和公路分割mask。
  • methods: 作者使用一个”学生”模型来重现一个”教师”模型的预测,该模型有访问高分辨率卫星图像的能力。
  • results: 作者在建筑分割方面 achievement 78.3% mIoU,与高分辨率教师模型的性能相似。同时,对个别建筑物的计数也达到R^2 = 0.91。
    Abstract Mapping buildings and roads automatically with remote sensing typically requires high-resolution imagery, which is expensive to obtain and often sparsely available. In this work we demonstrate how multiple 10 m resolution Sentinel-2 images can be used to generate 50 cm resolution building and road segmentation masks. This is done by training a `student' model with access to Sentinel-2 images to reproduce the predictions of a `teacher' model which has access to corresponding high-resolution imagery. While the predictions do not have all the fine detail of the teacher model, we find that we are able to retain much of the performance: for building segmentation we achieve 78.3% mIoU, compared to the high-resolution teacher model accuracy of 85.3% mIoU. We also describe a related method for counting individual buildings in a Sentinel-2 patch which achieves R^2 = 0.91 against true counts. This work opens up new possibilities for using freely available Sentinel-2 imagery for a range of tasks that previously could only be done with high-resolution satellite imagery.
    摘要 使用远程感知自动地图建模通常需要高分辨率图像,这些图像昂贵并且 sparse 地available。在这项工作中,我们展示了如何使用多张10米分辨率 Sentine-2 图像生成 50 cm 分辨率的建筑和路径分割mask。这是通过训练一个“学生”模型,该模型有访问 Sentine-2 图像,来复制“教师”模型,该模型有访问相应的高分辨率图像的预测。虽然预测没有 teacher 模型的细节,但我们发现可以保留大量的性能:对建筑分割,我们 achieve 78.3% mIoU,比高分辨率教师模型的准确率 85.3% mIoU。我们还描述了一种与此相关的方法,可以在 Sentinel-2 质patch 中计数个建筑,该方法 achieve R^2 = 0.91 与真实计数的相关性。这项工作开 up 了使用免费available Sentinel-2 图像进行许多任务的新可能性,这些任务在过去只能通过高分辨率卫星图像进行。

Classification of Safety Driver Attention During Autonomous Vehicle Operation

  • paper_url: http://arxiv.org/abs/2310.11608
  • repo_url: None
  • paper_authors: Santiago Gerling Konrad, Julie Stephany Berrio, Mao Shan, Favio Masson, Stewart Worrall
  • For: This paper aims to develop a system to monitor the alertness of vehicle operators in ADAS and AVs to ensure safe operation.* Methods: The proposed system uses an infrared camera to detect the driver’s head and calculate head orientation, and incorporates environmental data from the perception system to determine the driver’s attention to objects in the surroundings.* Results: The system was tested using data collected in Sydney, Australia, and was found to effectively determine the vehicle operator’s attention levels, enabling interventions such as warnings or reducing autonomous functionality as appropriate.Here is the same information in Simplified Chinese text:* For: 这篇论文目标是为ADAS和AV开发一种监测车辆操作员的关注度,以确保安全操作。* Methods: 该系统使用红外摄像头探测司机头部,计算头部方向,并在环境感知系统的数据支持下,判断司机是否注意到周围环境中的物体。* Results: 该系统在澳大利亚悉尼的数据集上进行了测试,并被证明可以有效地确定车辆操作员的关注度,并发出警示或减少自动化功能等措施。
    Abstract Despite the continual advances in Advanced Driver Assistance Systems (ADAS) and the development of high-level autonomous vehicles (AV), there is a general consensus that for the short to medium term, there is a requirement for a human supervisor to handle the edge cases that inevitably arise. Given this requirement, it is essential that the state of the vehicle operator is monitored to ensure they are contributing to the vehicle's safe operation. This paper introduces a dual-source approach integrating data from an infrared camera facing the vehicle operator and vehicle perception systems to produce a metric for driver alertness in order to promote and ensure safe operator behaviour. The infrared camera detects the driver's head, enabling the calculation of head orientation, which is relevant as the head typically moves according to the individual's focus of attention. By incorporating environmental data from the perception system, it becomes possible to determine whether the vehicle operator observes objects in the surroundings. Experiments were conducted using data collected in Sydney, Australia, simulating AV operations in an urban environment. Our results demonstrate that the proposed system effectively determines a metric for the attention levels of the vehicle operator, enabling interventions such as warnings or reducing autonomous functionality as appropriate. This comprehensive solution shows promise in contributing to ADAS and AVs' overall safety and efficiency in a real-world setting.
    摘要 Translated into Simplified Chinese:尽管现代驾驶助手系统(ADAS)和高级无人驾驶车辆(AV)在不断发展,但是有一致的共识,在短到中期,需要有人监督处理边缘情况。由于这一点,监控车辆运行者的状态非常重要。这篇文章介绍了一种将 инфракра力相机和车辆感知系统相结合的双源方法,以计算驾驶员注意力水平,以便促进和确保安全的操作行为。这种方法可以检测驾驶员的头部,并计算头部的方向,因为头部通常会根据个人的注意力方向移动。通过与环境数据集成,可以判断车辆运行者是否观察了周围环境。我们在悉尼、澳大利亚进行了实验,使用了在城市环境中采集的数据,模拟了AV的运行。我们的结果表明,该系统可以有效地计算驾驶员注意力水平,并发出警告或降低自动化功能,以适应实际情况。这种全面的解决方案显示了它在ADAS和AV的安全和效率方面的承诺。

DIAR: Deep Image Alignment and Reconstruction using Swin Transformers

  • paper_url: http://arxiv.org/abs/2310.11605
  • repo_url: None
  • paper_authors: Monika Kwiatkowski, Simon Matern, Olaf Hellwich
  • for: 这个论文旨在建立一个深度学习管道,用于同时对扭曲图像进行Alignment和重建。
  • methods: 论文使用了Swin transformer模型,并使用了自定义的图像扭曲 dataset,以及图像特征地图来检测图像内容和排除噪声。
  • results: 论文通过使用图像特征地图和深度学习模型,实现了对扭曲图像的高精度Alignment和重建。
    Abstract When taking images of some occluded content, one is often faced with the problem that every individual image frame contains unwanted artifacts, but a collection of images contains all relevant information if properly aligned and aggregated. In this paper, we attempt to build a deep learning pipeline that simultaneously aligns a sequence of distorted images and reconstructs them. We create a dataset that contains images with image distortions, such as lighting, specularities, shadows, and occlusion. We create perspective distortions with corresponding ground-truth homographies as labels. We use our dataset to train Swin transformer models to analyze sequential image data. The attention maps enable the model to detect relevant image content and differentiate it from outliers and artifacts. We further explore using neural feature maps as alternatives to classical key point detectors. The feature maps of trained convolutional layers provide dense image descriptors that can be used to find point correspondences between images. We utilize this to compute coarse image alignments and explore its limitations.
    摘要 当拍摄部分受遮挡内容时,常常会遇到每帧图像中存在不必要的artefacts问题,但是一系列图像中会包含所有相关信息,如果正确地对齐和汇总。在这篇论文中,我们尝试建立一个深度学习管道,同时对扭曲图像序列进行对齐和重建。我们创建了一个包含图像扭曲效果,如光照、 Specularities、阴影和遮挡的 dataset。我们创建了对应的地平线扭曲和真实的homographies作为标签。我们使用这些 dataset 来训练 Swin transformer 模型分析序列图像数据。模型的注意力地图使得模型可以检测图像中相关的内容,并将其与异常值和artefacts分开。我们进一步 explore 使用神经网络特征图作为经典关键点检测器的代替。训练 convolutional 层的神经网络特征图提供了密集图像描述符,可以用于找到图像之间的点对应关系。我们利用这个技术来计算粗略图像对齐,并探讨其局限性。

Learning Neural Implicit through Volume Rendering with Attentive Depth Fusion Priors

  • paper_url: http://arxiv.org/abs/2310.11598
  • repo_url: https://github.com/MachinePerceptionLab/Attentive_DFPrior
  • paper_authors: Pengchong Hu, Zhizhong Han
  • for: 本研究旨在提高基于多视图图像的3D重建精度,使用神经网络学习隐式表示,并通过粘性深度融合优先来解决 incomplete depth at holes 和 occluded structures 问题。
  • methods: 我们提出了一种基于多视图RGBD图像的神经网络学习隐式表示方法,通过粘性深度融合优先来解决 incomplete depth at holes 和 occluded structures 问题。我们引入了一种novel attention mechanism,允许神经网络直接使用 depth fusion prior 来学习隐式函数。
  • results: 我们的方法在 widely used benchmarks 上表现出色,超过了 latest neural implicit methods。我们的实验结果表明,我们的方法可以更好地处理 incomplete depth at holes 和 occluded structures 问题,提高了3D重建精度。
    Abstract Learning neural implicit representations has achieved remarkable performance in 3D reconstruction from multi-view images. Current methods use volume rendering to render implicit representations into either RGB or depth images that are supervised by multi-view ground truth. However, rendering a view each time suffers from incomplete depth at holes and unawareness of occluded structures from the depth supervision, which severely affects the accuracy of geometry inference via volume rendering. To resolve this issue, we propose to learn neural implicit representations from multi-view RGBD images through volume rendering with an attentive depth fusion prior. Our prior allows neural networks to perceive coarse 3D structures from the Truncated Signed Distance Function (TSDF) fused from all depth images available for rendering. The TSDF enables accessing the missing depth at holes on one depth image and the occluded parts that are invisible from the current view. By introducing a novel attention mechanism, we allow neural networks to directly use the depth fusion prior with the inferred occupancy as the learned implicit function. Our attention mechanism works with either a one-time fused TSDF that represents a whole scene or an incrementally fused TSDF that represents a partial scene in the context of Simultaneous Localization and Mapping (SLAM). Our evaluations on widely used benchmarks including synthetic and real-world scans show our superiority over the latest neural implicit methods. Project page: https://machineperceptionlab.github.io/Attentive_DF_Prior/
    摘要 学习神经隐式表示法已经在多视图图像中实现了很好的3D重建性能。现有的方法使用volume rendering将隐式表示渲染成RGB或深度图像,并且通过多视图ground truth进行超视图指导。然而,每次渲染一个视图会导致部分深度信息缺失和 occluded 结构的不可见性,这会对geometry inference via volume rendering产生严重的影响。为解决这个问题,我们提议通过多视图RGBD图像学习神经隐式表示法,并通过注意力机制将神经网络让掌握TSDF(Truncated Signed Distance Function)的粗略3D结构。TSDF允许访问多视图图像中缺失的深度信息和 occluded 结构,从而提高geometry inference的准确性。我们的注意力机制可以同时使用一次拼接的TSDF或者逐渐拼接的TSDF,它们在Simultaneous Localization and Mapping(SLAM)上下文中表示整个场景或者部分场景。我们的评估结果表明,我们的方法在广泛使用的标准benchmark上表现出色,超过了最新的神经隐式方法。项目页面:https://machineperceptionlab.github.io/Attentive_DF_Prior/

  • paper_url: http://arxiv.org/abs/2310.11577
  • repo_url: None
  • paper_authors: Mahsa Dibaji, Neha Gianchandani, Akhil Nair, Mansi Singhal, Roberto Souza, Mariana Bento
  • for: 这个论文的目的是研究机器学习模型在不同性别人群中的偏见和公平问题。
  • methods: 作者使用了基于大脑磁共振成像数据的机器学习模型,并在不同的实验设计下进行了训练和评估。
  • results: 研究发现在不同性别人群和数据集上训练模型时,模型的性能存在差异,并且偏见可能导致模型在不同性别人群中的决策不具有公平性。
    Abstract While utilizing machine learning models, one of the most crucial aspects is how bias and fairness affect model outcomes for diverse demographics. This becomes especially relevant in the context of machine learning for medical imaging applications as these models are increasingly being used for diagnosis and treatment planning. In this paper, we study biases related to sex when developing a machine learning model based on brain magnetic resonance images (MRI). We investigate the effects of sex by performing brain age prediction considering different experimental designs: model trained using only female subjects, only male subjects and a balanced dataset. We also perform evaluation on multiple MRI datasets (Calgary-Campinas(CC359) and CamCAN) to assess the generalization capability of the proposed models. We found disparities in the performance of brain age prediction models when trained on distinct sex subgroups and datasets, in both final predictions and decision making (assessed using interpretability models). Our results demonstrated variations in model generalizability across sex-specific subgroups, suggesting potential biases in models trained on unbalanced datasets. This underlines the critical role of careful experimental design in generating fair and reliable outcomes.
    摘要 在使用机器学习模型时,最重要的一点是如何处理偏见和公平问题,以确保模型对不同的民族团体都有正确的预测结果。在医疗领域机器学习应用中,这些模型正在被用于诊断和治疗规划。在这篇论文中,我们研究了基于大脑磁共振成像(MRI)的机器学习模型中的性偏见。我们对不同实验设计进行了研究:使用只有女性参与者的模型、只有男性参与者的模型以及平衡 dataset。我们还对多个 MRI 数据集(Calgary-Campinas(CC359)和 CamCAN)进行了评估,以评估提议的模型在不同的数据集上的普适性。我们发现在不同的性 subgroup 和数据集上训练的模型表现出了差异, both final predictions 和决策(使用可解释模型进行评估)。我们的结果表明了不同的性团体中模型的一致性不同,这表明了训练在不均衡数据集上的模型可能存在偏见。这些结果提醒我们在设计实验时应该非常小心,以确保获得公平和可靠的结果。

Learning Lens Blur Fields

  • paper_url: http://arxiv.org/abs/2310.11535
  • repo_url: None
  • paper_authors: Esther Y. H. Lin, Zhecheng Wang, Rebecca Lin, Daniel Miau, Florian Kainz, Jiawen Chen, Xuaner Cecilia Zhang, David B. Lindell, Kiriakos N. Kutulakos
  • for: The paper is written to address the challenge of modeling optical blur in modern cameras with complex optical elements, and to introduce a high-dimensional neural representation of the blur field.
  • methods: The paper proposes a practical method for acquiring the lens blur field, which is a multilayer perceptron (MLP) designed to capture variations of the lens 2D point spread function over image plane location, focus setting, and depth. The representation models the combined effects of defocus, diffraction, aberration, and accounts for sensor features such as pixel color filters and pixel-specific micro-lenses.
  • results: The paper shows that the acquired 5D blur fields are expressive and accurate enough to reveal differences in optical behavior of smartphone devices of the same make and model, and provides a first-of-its-kind dataset of 5D blur fields for smartphone cameras, camera bodies equipped with a variety of lenses, etc.
    Abstract Optical blur is an inherent property of any lens system and is challenging to model in modern cameras because of their complex optical elements. To tackle this challenge, we introduce a high-dimensional neural representation of blur$-$$\textit{the lens blur field}$$-$and a practical method for acquiring it. The lens blur field is a multilayer perceptron (MLP) designed to (1) accurately capture variations of the lens 2D point spread function over image plane location, focus setting and, optionally, depth and (2) represent these variations parametrically as a single, sensor-specific function. The representation models the combined effects of defocus, diffraction, aberration, and accounts for sensor features such as pixel color filters and pixel-specific micro-lenses. To learn the real-world blur field of a given device, we formulate a generalized non-blind deconvolution problem that directly optimizes the MLP weights using a small set of focal stacks as the only input. We also provide a first-of-its-kind dataset of 5D blur fields$-$for smartphone cameras, camera bodies equipped with a variety of lenses, etc. Lastly, we show that acquired 5D blur fields are expressive and accurate enough to reveal, for the first time, differences in optical behavior of smartphone devices of the same make and model.
    摘要 “光学朦胧是任何镜系统的自然属性,现代摄像机中模型它具有复杂的光学元件的挑战。为了解决这个挑战,我们介绍了一种高维度神经网络表示朦胧$-$镜片朦胧场$-$以及一种实用的获取方法。镜片朦胧场是一种多层感知网络(MLP),旨在:(1) 精确捕捉镜片2D点扩散函数在图像平面位置、焦距设置和(选择)深度上的变化,以及(2) 表示这些变化为单个、设备特定的函数。该表示模型了杂光、折射、笛卡尔等效应,并考虑了感器特性 such as 像素色滤和像素特定的微镜。为了学习具体的朦胧场,我们提出了一种通用非盲目分解问题,直接优化MLP参量使用一小组焦距栈作为输入。此外,我们还提供了一个具有5D朦胧场的首个数据集$-$包括智能手机摄像机、配备多种镜头的相机机身等。最后,我们证明了获取的5D朦胧场是具有表达力和准确性的,可以折衣出智能手机设备的同类型和型号之间的光学行为差异。”

GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

  • paper_url: http://arxiv.org/abs/2310.11513
  • repo_url: https://github.com/djghosh13/geneval
  • paper_authors: Dhruba Ghosh, Hanna Hajishirzi, Ludwig Schmidt
  • for: 评估文本到图像生成模型的详细特性,如物体共处、位置、数量和颜色等。
  • methods: 利用现有的 объек检测模型来评估文本到图像模型的多种生成任务,并通过链接其他探索视觉模型来进一步验证特性。
  • results: 研究发现,现有的文本到图像模型在这些任务上已经显示出了显著的进步,但还缺乏复杂的能力,如空间关系和属性绑定。
    Abstract Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models. Given human evaluation is expensive and difficult to scale, automated methods are critical for evaluating the increasingly large number of new models. However, most current automated evaluation metrics like FID or CLIPScore only offer a holistic measure of image quality or image-text alignment, and are unsuited for fine-grained or instance-level analysis. In this paper, we introduce GenEval, an object-focused framework to evaluate compositional image properties such as object co-occurrence, position, count, and color. We show that current object detection models can be leveraged to evaluate text-to-image models on a variety of generation tasks with strong human agreement, and that other discriminative vision models can be linked to this pipeline to further verify properties like object color. We then evaluate several open-source text-to-image models and analyze their relative generative capabilities on our benchmark. We find that recent models demonstrate significant improvement on these tasks, though they are still lacking in complex capabilities such as spatial relations and attribute binding. Finally, we demonstrate how GenEval might be used to help discover existing failure modes, in order to inform development of the next generation of text-to-image models. Our code to run the GenEval framework is publicly available at https://github.com/djghosh13/geneval.
    摘要

DELIFFAS: Deformable Light Fields for Fast Avatar Synthesis

  • paper_url: http://arxiv.org/abs/2310.11449
  • repo_url: None
  • paper_authors: Youngjoong Kwon, Lingjie Liu, Henry Fuchs, Marc Habermann, Christian Theobalt
  • for: 生成高品质的数字人像,解决长期存在的图形和视觉领域中的问题。
  • methods: 提出了一种新的方法,即DELIFFAS,它将人体的外观 Parameterized as a surface light field that is attached to a controllable and deforming human mesh model。
  • results: 通过 méticulously designed human representation and supervision strategy,实现了 state-of-the-art synthesis results和快速的推理时间。
    Abstract Generating controllable and photorealistic digital human avatars is a long-standing and important problem in Vision and Graphics. Recent methods have shown great progress in terms of either photorealism or inference speed while the combination of the two desired properties still remains unsolved. To this end, we propose a novel method, called DELIFFAS, which parameterizes the appearance of the human as a surface light field that is attached to a controllable and deforming human mesh model. At the core, we represent the light field around the human with a deformable two-surface parameterization, which enables fast and accurate inference of the human appearance. This allows perceptual supervision on the full image compared to previous approaches that could only supervise individual pixels or small patches due to their slow runtime. Our carefully designed human representation and supervision strategy leads to state-of-the-art synthesis results and inference time. The video results and code are available at https://vcai.mpi-inf.mpg.de/projects/DELIFFAS.
    摘要 <>将生成控制可靠、渲染实际的数字人类头像作为视图和图形领域的长期问题。最近的方法在一个或两个属性中都已经取得了很大的进步,但是同时拥有这两种属性仍然是一个未解决的问题。为此,我们提出了一种新的方法,称为DELIFFAS,它将人类的外观 parameterized为附着在可控制和变形的人类矩阵模型上的表面光场。在核心上,我们使用可变的两面参数化来表示人类周围的光场,这使得快速和准确地推断人类的外观。这使得我们可以在全图像上进行准确的upervison,而不是之前的方法只能在个像素或小块上进行supervision,因为它们的运行时间过长。我们仔细设计的人类表示和监督策略,导致了最新的合成结果和运行时间。影像结果和代码可以在https://vcai.mpi-inf.mpg.de/projects/DELIFFAS中下载。

4K4D: Real-Time 4D View Synthesis at 4K Resolution

  • paper_url: http://arxiv.org/abs/2310.11448
  • repo_url: https://github.com/zju3dv/4K4D
  • paper_authors: Zhen Xu, Sida Peng, Haotong Lin, Guangzhao He, Jiaming Sun, Yujun Shen, Hujun Bao, Xiaowei Zhou
  • for: 高精度和实时视觉合成动态3D场景的4K解析
  • methods: 使用4D点云表示法和硬件排版加速,以及新型混合外观模型提高渲染质量,同时保持高效性
  • results: 在DNA-Rendering数据集1080p分辨率和ENeRF-Outdoor数据集4K分辨率上实现了30倍 быстре的渲染速度,并达到了当前最佳的渲染质量
    Abstract This paper targets high-fidelity and real-time view synthesis of dynamic 3D scenes at 4K resolution. Recently, some methods on dynamic view synthesis have shown impressive rendering quality. However, their speed is still limited when rendering high-resolution images. To overcome this problem, we propose 4K4D, a 4D point cloud representation that supports hardware rasterization and enables unprecedented rendering speed. Our representation is built on a 4D feature grid so that the points are naturally regularized and can be robustly optimized. In addition, we design a novel hybrid appearance model that significantly boosts the rendering quality while preserving efficiency. Moreover, we develop a differentiable depth peeling algorithm to effectively learn the proposed model from RGB videos. Experiments show that our representation can be rendered at over 400 FPS on the DNA-Rendering dataset at 1080p resolution and 80 FPS on the ENeRF-Outdoor dataset at 4K resolution using an RTX 4090 GPU, which is 30x faster than previous methods and achieves the state-of-the-art rendering quality. Our project page is available at https://zju3dv.github.io/4k4d/.
    摘要

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

  • paper_url: http://arxiv.org/abs/2310.11440
  • repo_url: https://github.com/evalcrafter/EvalCrafter
  • paper_authors: Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, Ying Shan
    for: 这个论文主要目的是提出一种新的评估视频生成模型的框架和管道,以提高现有的评估方法的精度和全面性。methods: 作者使用了一种新的提问列表和18个对象指标来评估state-of-the-art的视频生成模型。同时,他们还提出了一种基于大语言模型的意见对应方法,以使用户的意见来调整对象指标的权重。results: 作者的研究表明,使用新的评估方法可以更好地评估视频生成模型的性能,并且与用户的意见更高相关。这表明了新的评估方法的效iveness。
    Abstract The vision and language generative models have been overgrown in recent years. For video generation, various open-sourced models and public-available services are released for generating high-visual quality videos. However, these methods often use a few academic metrics, for example, FVD or IS, to evaluate the performance. We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities. Thus, we propose a new framework and pipeline to exhaustively evaluate the performance of the generated videos. To achieve this, we first conduct a new prompt list for text-to-video generation by analyzing the real-world prompt list with the help of the large language model. Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmarks, in terms of visual qualities, content qualities, motion qualities, and text-caption alignment with around 18 objective metrics. To obtain the final leaderboard of the models, we also fit a series of coefficients to align the objective metrics to the users' opinions. Based on the proposed opinion alignment method, our final score shows a higher correlation than simply averaging the metrics, showing the effectiveness of the proposed evaluation method.
    摘要 “在最近几年中,视觉和语言生成模型得到了广泛的应用和发展。为视频生成,各种开源的模型和公共可用的服务被发布,以生成高质量的视频。然而,这些方法经常使用一些学术指标,例如FVD或IS,来评估模型的表现。我们认为,这些简单的指标不能准确评估大型条件生成模型,因为这些模型通常在大量数据集上进行训练,具有多方面能力。因此,我们提出了一新的评估框架和管道,以完整评估生成视频的性能。我们首先对文本到视频生成的新提问列表进行分析,并使用大型语言模型的帮助来生成一个新的提问列表。然后,我们对当今状态的视频生成模型进行了严格的评估,包括视觉质量、内容质量、动作质量和文本描述对齐等18个对象指标。为了获得最终的模型排名,我们还使用一系列的系数进行对象指标的对齐。根据我们的提议的意见对齐方法,我们的最终分数显示与简单地平均指标的相对耗散更高,表明了我们的评估方法的效iveness。”

Revisiting Map Relations for Unsupervised Non-Rigid Shape Matching

  • paper_url: http://arxiv.org/abs/2310.11420
  • repo_url: None
  • paper_authors: Dongliang Cao, Paul Roetzer, Florian Bernard
  • for: 非RIGID 3D shape matching
  • methods: 提出了一种新的无监督学习方法,包括自适应功能图计算器和点精度损失函数
  • results: 在不同的挑战性场景下(包括非同温、topological noise和部分性),与前状态艺术方法相比,提高了substantially的性能
    Abstract We propose a novel unsupervised learning approach for non-rigid 3D shape matching. Our approach improves upon recent state-of-the art deep functional map methods and can be applied to a broad range of different challenging scenarios. Previous deep functional map methods mainly focus on feature extraction and aim exclusively at obtaining more expressive features for functional map computation. However, the importance of the functional map computation itself is often neglected and the relationship between the functional map and point-wise map is underexplored. In this paper, we systematically investigate the coupling relationship between the functional map from the functional map solver and the point-wise map based on feature similarity. To this end, we propose a self-adaptive functional map solver to adjust the functional map regularisation for different shape matching scenarios, together with a vertex-wise contrastive loss to obtain more discriminative features. Using different challenging datasets (including non-isometry, topological noise and partiality), we demonstrate that our method substantially outperforms previous state-of-the-art methods.
    摘要 我们提出了一种新的无监督学习方法 для非定形3D形状匹配。我们的方法在当前状态艺术深度函数地图方法的基础上进行改进,并可以应用于各种不同的复杂enario。先前的深度函数地图方法主要关注特征提取,偏向于获得更表达力强的特征来计算函数地图。然而,函数地图计算本身的重要性经常被忽略,以及特征和点级地图之间的关系也未得到充分调查。在这篇论文中,我们系统地探讨了函数地图从函数地图解决器中的coupling关系,以及特征相似性基础上的点级地图。为此,我们提出了一种自适应函数地图解决器,以及基于特征相似性的Vertex-wise contraste loss。使用不同的复杂的数据集(包括非同一致、拓扑噪声和缺失),我们示出了我们的方法在前一代方法的基础上具有显著的改进。

VcT: Visual change Transformer for Remote Sensing Image Change Detection

  • paper_url: http://arxiv.org/abs/2310.11417
  • repo_url: https://github.com/event-ahu/vct_remote_sensing_change_detection
  • paper_authors: Bo Jiang, Zitian Wang, Xixi Wang, Ziyan Zhang, Lan Chen, Xiao Wang, Bin Luo
  • for: 本文提出了一种Visual change Transformer(VcT)模型,用于解决视觉变化检测问题。
  • methods: 该模型首先使用共享背景网络提取图像对的特征图,然后使用图 neural network 模型化图像对的结构信息,并采用 top-K 可靠的 токен mines 和改进自/交叉关注机制。
  • results: 广泛的实验 validate 了我们提出的 VcT 模型的有效性。
    Abstract Existing visual change detectors usually adopt CNNs or Transformers for feature representation learning and focus on learning effective representation for the changed regions between images. Although good performance can be obtained by enhancing the features of the change regions, however, these works are still limited mainly due to the ignorance of mining the unchanged background context information. It is known that one main challenge for change detection is how to obtain the consistent representations for two images involving different variations, such as spatial variation, sunlight intensity, etc. In this work, we demonstrate that carefully mining the common background information provides an important cue to learn the consistent representations for the two images which thus obviously facilitates the visual change detection problem. Based on this observation, we propose a novel Visual change Transformer (VcT) model for visual change detection problem. To be specific, a shared backbone network is first used to extract the feature maps for the given image pair. Then, each pixel of feature map is regarded as a graph node and the graph neural network is proposed to model the structured information for coarse change map prediction. Top-K reliable tokens can be mined from the map and refined by using the clustering algorithm. Then, these reliable tokens are enhanced by first utilizing self/cross-attention schemes and then interacting with original features via an anchor-primary attention learning module. Finally, the prediction head is proposed to get a more accurate change map. Extensive experiments on multiple benchmark datasets validated the effectiveness of our proposed VcT model.
    摘要 现有的视觉变化探测器通常采用CNN或Transformers来学习特征表示学习,并主要关注学习改变区域 между图像中的有效表示。although these works can achieve good performance by enhancing the features of the changed regions, they are still limited because they ignore the mining of unchanged background context information. It is known that one of the main challenges of change detection is how to obtain consistent representations for two images with different variations, such as spatial variation and sunlight intensity. In this work, we find that carefully mining the common background information provides an important cue to learn the consistent representations for the two images, which thus facilitates the visual change detection problem. Based on this observation, we propose a novel Visual change Transformer (VcT) model for the visual change detection problem. Specifically, a shared backbone network is first used to extract the feature maps for the given image pair. Then, each pixel of the feature map is regarded as a graph node, and a graph neural network is proposed to model the structured information for coarse change map prediction. Top-K reliable tokens can be mined from the map and refined by using a clustering algorithm. Then, these reliable tokens are enhanced by first utilizing self/cross-attention schemes and then interacting with the original features via an anchor-primary attention learning module. Finally, the prediction head is proposed to get a more accurate change map. Extensive experiments on multiple benchmark datasets validated the effectiveness of our proposed VcT model.

A voxel-level approach to brain age prediction: A method to assess regional brain aging

  • paper_url: http://arxiv.org/abs/2310.11385
  • repo_url: https://github.com/nehagianchandani/voxel-level-brain-age-prediction
  • paper_authors: Neha Gianchandani, Mahsa Dibaji, Johanna Ospel, Fernando Vega, Mariana Bento, M. Ethan MacDonald, Roberto Souza
  • For: 这个研究的目的是预测脑部年龄从T1束缚成像图像中,以获得精细的地方化脑部年龄估计。这有助于理解健康和疾病群体中脑部年龄轨迹的差异。* Methods: 这个研究使用了深度学习多任务模型来预测脑部年龄。这种模型在现有文献中表现出色,并且在健康和疾病群体中提供了价值的临床洞察。* Results: 研究发现,健康群体和疾病群体之间存在差异的脑部年龄轨迹。具体来说,健康群体的脑部年龄轨迹比疾病群体更年轻,而且存在脑部区域异常的差异。这些结果提供了有价值的临床洞察,可以帮助理解脑部年龄的发展和疾病的扩散。
    Abstract Brain aging is a regional phenomenon, a facet that remains relatively under-explored within the realm of brain age prediction research using machine learning methods. Voxel-level predictions can provide localized brain age estimates that can provide granular insights into the regional aging processes. This is essential to understand the differences in aging trajectories in healthy versus diseased subjects. In this work, a deep learning-based multitask model is proposed for voxel-level brain age prediction from T1-weighted magnetic resonance images. The proposed model outperforms the models existing in the literature and yields valuable clinical insights when applied to both healthy and diseased populations. Regional analysis is performed on the voxel-level brain age predictions to understand aging trajectories of known anatomical regions in the brain and show that there exist disparities in regional aging trajectories of healthy subjects compared to ones with underlying neurological disorders such as Dementia and more specifically, Alzheimer's disease. Our code is available at https://github.com/nehagianchandani/Voxel-level-brain-age-prediction.
    摘要 脑衰老是一个地域性的现象,在机器学习方法可预测脑 возраст预测研究中尚未得到充分的探讨。 voxel 级预测可提供本地化的脑 возраст估计,从而为诊断不同 Population 提供细化的诊断信息。这对于理解健康和疾病 Population 的脑衰老趋势具有重要意义。在本研究中,我们提出了一种基于深度学习的多任务模型,用于从 T1 束缚磁共振成像图像中预测 voxel 级脑 возраст。该模型在文献中存在的模型之上升级,并且在应用于健康和疾病 Population 时具有价值的临床应用。通过对 voxel 级脑 возраст预测结果进行区域分析,我们可以理解健康 Population 的脑衰老趋势与患有 деменcia 和特别是阿尔ц海默病的脑衰老趋势之间的差异。我们的代码可以在 GitHub 上找到:https://github.com/nehagianchandani/Voxel-level-brain-age-prediction。

Towards Generalizable Multi-Camera 3D Object Detection via Perspective Debiasing

  • paper_url: http://arxiv.org/abs/2310.11346
  • repo_url: None
  • paper_authors: Hao Lu, Yunpeng Zhang, Qing Lian, Dalong Du, Yingcong Chen
  • for: 本研究旨在提高多摄像头3D物体检测(MC3D-Det)的精度和可靠性,解决由于不 familar 测试环境而导致的问题。
  • methods: 我们提出了一种新的方法,即将3D检测与2D相机平面结果相对应,以确保稳定和准确的检测。我们的框架基于视角偏移 rectification,帮助学习不受视角变化影响的特征。我们首先生成了多视图地图从BEV特征中,然后将这些地图的视角偏移正确 rectified,以利用隐藏的前景体Volume来连接相机和BEV平面。
  • results: 我们的方法可以在不同的视角和环境下提高object detection的精度和可靠性。我们在Domain Generalization(DG)和Unsupervised Domain Adaptation(UDA)任务上实现了显著的效果。此外,我们还证明了我们的方法可以在实际数据上达到满意的结果,只需训练于虚拟数据集。
    Abstract Detecting objects in 3D space using multiple cameras, known as Multi-Camera 3D Object Detection (MC3D-Det), has gained prominence with the advent of bird's-eye view (BEV) approaches. However, these methods often struggle when faced with unfamiliar testing environments due to the lack of diverse training data encompassing various viewpoints and environments. To address this, we propose a novel method that aligns 3D detection with 2D camera plane results, ensuring consistent and accurate detections. Our framework, anchored in perspective debiasing, helps the learning of features resilient to domain shifts. In our approach, we render diverse view maps from BEV features and rectify the perspective bias of these maps, leveraging implicit foreground volumes to bridge the camera and BEV planes. This two-step process promotes the learning of perspective- and context-independent features, crucial for accurate object detection across varying viewpoints, camera parameters and environment conditions. Notably, our model-agnostic approach preserves the original network structure without incurring additional inference costs, facilitating seamless integration across various models and simplifying deployment. Furthermore, we also show our approach achieves satisfactory results in real data when trained only with virtual datasets, eliminating the need for real scene annotations. Experimental results on both Domain Generalization (DG) and Unsupervised Domain Adaptation (UDA) clearly demonstrate its effectiveness. Our code will be released.
    摘要 使用多个摄像头探测3D空间中的对象,称为多摄像头3D对象探测(MC3D-Det),在现代鸟瞰视图(BEV)方法的普及下得到了更多的关注。然而,这些方法经常在不熟悉的测试环境中遇到困难,因为缺乏包括多个视角和环境的多样化训练数据。为解决这个问题,我们提出了一种新的方法,该方法将3D探测与2D摄像头平面结果进行对应,以确保准确和一致的探测。我们的框架基于视角偏移的约束,帮助学习不受视角和环境的偏移影响的特征。在我们的方法中,我们从BEV特征中生成多种视图图,然后对这些图像进行视角偏移的正确化,通过使用隐藏的前景体来连接摄像头和BEV平面。这两步过程推动学习不受视角和环境的特征,这些特征是精度的对象探测所必需的。各种模型的机制无需更改,我们的方法可以轻松地与不同的模型集成,并且简化部署。此外,我们还证明我们的方法在真实数据上获得了满意的结果,无需使用真实场景的注释。实验结果表明,我们的方法在适应不同视角、摄像头参数和环境条件时具有显著的效果。我们的代码将会发布。

Towards Generic Semi-Supervised Framework for Volumetric Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2310.11320
  • repo_url: https://github.com/xmed-lab/GenericSSL
  • paper_authors: Haonan Wang, Xiaomeng Li
  • For: 这个研究的目的是开发一个通用的semi-supervised learning(SSL)框架,可以处理多个设定,包括SSL、Unsupervised Domain Adaptation(UDA)和Semi-supervised Domain Generalization(SemiDG)。* Methods: 该框架使用了一种含有混合和解耦的方法,包括一个扩散编码器,用于从多个分布/领域中提取共同知识集,以及三个解码器,用于在培训过程中避免过拟合标注数据。* Results: 对四个 benchmark dataset进行了评估,并与现有方法进行了比较,结果表明该框架在所有四个设定中具有显著的改进, indicating the potential of the framework to tackle more challenging SSL scenarios.
    Abstract Volume-wise labeling in 3D medical images is a time-consuming task that requires expertise. As a result, there is growing interest in using semi-supervised learning (SSL) techniques to train models with limited labeled data. However, the challenges and practical applications extend beyond SSL to settings such as unsupervised domain adaptation (UDA) and semi-supervised domain generalization (SemiDG). This work aims to develop a generic SSL framework that can handle all three settings. We identify two main obstacles to achieving this goal in the existing SSL framework: 1) the weakness of capturing distribution-invariant features; and 2) the tendency for unlabeled data to be overwhelmed by labeled data, leading to over-fitting to the labeled data during training. To address these issues, we propose an Aggregating & Decoupling framework. The aggregating part consists of a Diffusion encoder that constructs a common knowledge set by extracting distribution-invariant features from aggregated information from multiple distributions/domains. The decoupling part consists of three decoders that decouple the training process with labeled and unlabeled data, thus avoiding over-fitting to labeled data, specific domains and classes. We evaluate our proposed framework on four benchmark datasets for SSL, Class-imbalanced SSL, UDA and SemiDG. The results showcase notable improvements compared to state-of-the-art methods across all four settings, indicating the potential of our framework to tackle more challenging SSL scenarios. Code and models are available at: https://github.com/xmed-lab/GenericSSL.
    摘要 医学三维图像的体积标注是一项时间消耗大的任务,需要专家知识。由于此类任务的挑战和实际应用超出了半编制学习(SSL)技术的范畴,因此这项工作的目标是开发一个通用的SSL框架,可以处理所有三个设置。我们在现有的SSL框架中 Identified two main challenges to achieving this goal: 1)不能够捕捉分布不变的特征; 2)无标注数据被标注数据所抑制,导致在训练过程中适应标注数据,从而导致过拟合。为了解决这些问题,我们提出了一个集成&分离框架。集成部分包括一个扩散编码器,通过从多个分布/领域中提取分布不变的特征来构建共同知识集。分离部分包括三个解码器,通过解耦训练过程中的标注数据和无标注数据,从而避免过拟合标注数据、特定领域和类别。我们在四个 benchmark 数据集上进行了四种SSL、不均衡SSL、UDA 和 SemiDG 的测试,结果显示了与现有方法的明显改善,这表明我们的框架有可能在更加复杂的SSL场景中表现出色。代码和模型可以在 GitHub 上获取:https://github.com/xmed-lab/GenericSSL。

Multi Self-supervised Pre-fine-tuned Transformer Fusion for Better Intelligent Transportation Detection

  • paper_url: http://arxiv.org/abs/2310.11307
  • repo_url: None
  • paper_authors: Juwu Zheng, Jiangtao Ren
  • for: 本研究旨在提高智能交通系统中的检测精度,解决现有检测方法受到两个限制:一是模型知识预处理大规模数据集和目标任务知识之间的差异,二是大多数检测模型采用单源学习方式,限制学习能力。
  • methods: 我们提出了一种多自助学习前练 transformer 融合网络(MSPTF),包括两个步骤:自助学习预练频率学习和多模型协同学习目标任务。在第一步,我们引入了自助学习方法到 transformer 模型预练中,以减少数据成本并减轻预处理模型和目标任务之间的知识差异。在第二步,我们提出了多模型协同学习方法,通过考虑不同模型架构和预练任务之间的信息差异,将不同 transformer 模型特征信息融合到一起,以获得更完整和正确的检测特征。
  • results: 我们对 vehicle 识别数据集和路径疾病检测数据集进行实验,比基准方法提高了1.1%、5.5%、4.2%,比最新方法(sota)提高了0.7%、1.8%、1.7%。这些结果证明了我们的方法的有效性。
    Abstract Intelligent transportation system combines advanced information technology to provide intelligent services such as monitoring, detection, and early warning for modern transportation. Intelligent transportation detection is the cornerstone of many intelligent traffic services by identifying task targets through object detection methods. However existing detection methods in intelligent transportation are limited by two aspects. First, there is a difference between the model knowledge pre-trained on large-scale datasets and the knowledge required for target task. Second, most detection models follow the pattern of single-source learning, which limits the learning ability. To address these problems, we propose a Multi Self-supervised Pre-fine-tuned Transformer Fusion (MSPTF) network, consisting of two steps: unsupervised pre-fine-tune domain knowledge learning and multi-model fusion target task learning. In the first step, we introduced self-supervised learning methods into transformer model pre-fine-tune which could reduce data costs and alleviate the knowledge gap between pre-trained model and target task. In the second step, we take feature information differences between different model architectures and different pre-fine-tune tasks into account and propose Multi-model Semantic Consistency Cross-attention Fusion (MSCCF) network to combine different transformer model features by considering channel semantic consistency and feature vector semantic consistency, which obtain more complete and proper fusion features for detection task. We experimented the proposed method on vehicle recognition dataset and road disease detection dataset and achieved 1.1%, 5.5%, 4.2% improvement compared with baseline and 0.7%, 1.8%, 1.7% compared with sota, which proved the effectiveness of our method.
    摘要 智能交通系统结合先进的信息技术,为现代交通提供智能服务,如监测、检测和早期警示。智能交通检测是现代交通服务的核心,通过物体检测方法来确定任务目标。然而,现有的检测方法在智能交通中存在两个限制。一是,模型的先行学习知识与目标任务知识之间存在差异。二是,大多数检测模型采用单源学习模式,限制了学习能力。为了解决这些问题,我们提出了多自动学习预练转换器融合网络(MSPTF),包括两个步骤:无监督预练域知识学习和多模型融合目标任务学习。在第一步,我们将自动学习方法引入转换器模型预练,以减少数据成本并缓解先行学习知识与目标任务知识之间的差异。在第二步,我们利用不同模型架构和预练任务的特征信息差异,提出多模型semantic consistency cross-attention融合网络(MSCCF),将不同转换器模型的特征信息融合,以考虑通道semantic consistency和特征向量semantic consistency,从而获得更加完整和正确的融合特征,进而提高检测任务的准确率。我们对汽车识别 dataset 和路况病变 dataset 进行实验,与基准值和state-of-the-art(sota)进行比较,实验结果显示,我们的方法可以提高检测任务的准确率1.1%、5.5%、4.2%,比基准值更高,证明了我们的方法的有效性。

CorrTalk: Correlation Between Hierarchical Speech and Facial Activity Variances for 3D Animation

  • paper_url: http://arxiv.org/abs/2310.11295
  • repo_url: None
  • paper_authors: Zhaojie Chu, Kailing Guo, Xiaofen Xing, Yilin Lan, Bolun Cai, Xiangmin Xu
    for: 这篇论文主要针对的是如何通过语音驱动来生成3D人脸动画,以解决跨Modal Task的挑战。methods: 该论文提出了一种新的框架,即CorrTalk,该框架可以有效地在不同强度的面部活动之间建立时间相关性,并通过 dual-branch decoding 架构来同时synthesize强度不同的面部活动,以保证更宽泛的表情动画生成。results: 根据论文的实验和用户研究,CorrTalk 表现出excelent的性能,可以准确地生成具有不同强度的表情动画,并且能够保持 lip-sync 和合理的表情表达。
    Abstract Speech-driven 3D facial animation is a challenging cross-modal task that has attracted growing research interest. During speaking activities, the mouth displays strong motions, while the other facial regions typically demonstrate comparatively weak activity levels. Existing approaches often simplify the process by directly mapping single-level speech features to the entire facial animation, which overlook the differences in facial activity intensity leading to overly smoothed facial movements. In this study, we propose a novel framework, CorrTalk, which effectively establishes the temporal correlation between hierarchical speech features and facial activities of different intensities across distinct regions. A novel facial activity intensity metric is defined to distinguish between strong and weak facial activity, obtained by computing the short-time Fourier transform of facial vertex displacements. Based on the variances in facial activity, we propose a dual-branch decoding framework to synchronously synthesize strong and weak facial activity, which guarantees wider intensity facial animation synthesis. Furthermore, a weighted hierarchical feature encoder is proposed to establish temporal correlation between hierarchical speech features and facial activity at different intensities, which ensures lip-sync and plausible facial expressions. Extensive qualitatively and quantitatively experiments as well as a user study indicate that our CorrTalk outperforms existing state-of-the-art methods. The source code and supplementary video are publicly available at: https://zjchu.github.io/projects/CorrTalk/
    摘要 <>转换文本到简化中文。<>语音驱动的3D面部动画是一个吸引了研究者们的挑战性跨Modal任务。在说话活动中,口部会显示强烈的动作,而其他面部区域通常会表现出较弱的活动水平。现有的方法通常会简化过程,直接将单级语音特征映射到整个面部动画上,这会忽略面部活动Intensity的差异,导致面部运动过于平滑。在这种研究中,我们提出了一个新的框架,即CorrTalk,它可以有效地在不同Intensity水平上建立语音特征和面部动画的时间相关性。我们定义了一个新的面部动activity强度度量,通过计算面部顶点位移的短时傅立叶变换来获得。基于面部动activity的差异,我们提出了一种双分支解码机制,以同步生成强度不同的面部动画,从而保证更广泛的Intensity面部动画生成。此外,我们还提出了一种加权层次特征编码器,以建立不同Intensity水平上的语音特征和面部动画之间的时间相关性,从而保证唇ync和合理的面部表达。我们的CorrTalk在质量和kvantalitative的实验以及用户研究中表现出色,超越了现有的状态 искус技术。source code和补充视频可以在以下链接获取:https://zjchu.github.io/projects/CorrTalk/

Self-Supervised 3D Scene Flow Estimation and Motion Prediction using Local Rigidity Prior

  • paper_url: http://arxiv.org/abs/2310.11284
  • repo_url: None
  • paper_authors: Ruibo Li, Chi Zhang, Zhe Wang, Chunhua Shen, Guosheng Lin
  • for: investigate self-supervised 3D scene flow estimation and class-agnostic motion prediction on point clouds
  • methods: build pseudo scene flow labels through piecewise rigid motion estimation and validate with a validity mask
  • results: achieve new state-of-the-art performance in self-supervised scene flow learning and outperform previous state-of-the-art self-supervised methods on nuScenes dataset.
    Abstract In this article, we investigate self-supervised 3D scene flow estimation and class-agnostic motion prediction on point clouds. A realistic scene can be well modeled as a collection of rigidly moving parts, therefore its scene flow can be represented as a combination of the rigid motion of these individual parts. Building upon this observation, we propose to generate pseudo scene flow labels for self-supervised learning through piecewise rigid motion estimation, in which the source point cloud is decomposed into local regions and each region is treated as rigid. By rigidly aligning each region with its potential counterpart in the target point cloud, we obtain a region-specific rigid transformation to generate its pseudo flow labels. To mitigate the impact of potential outliers on label generation, when solving the rigid registration for each region, we alternately perform three steps: establishing point correspondences, measuring the confidence for the correspondences, and updating the rigid transformation based on the correspondences and their confidence. As a result, confident correspondences will dominate label generation and a validity mask will be derived for the generated pseudo labels. By using the pseudo labels together with their validity mask for supervision, models can be trained in a self-supervised manner. Extensive experiments on FlyingThings3D and KITTI datasets demonstrate that our method achieves new state-of-the-art performance in self-supervised scene flow learning, without any ground truth scene flow for supervision, even performing better than some supervised counterparts. Additionally, our method is further extended to class-agnostic motion prediction and significantly outperforms previous state-of-the-art self-supervised methods on nuScenes dataset.
    摘要 在这篇文章中,我们调查了无监督3D场景流估计和无类别运动预测。一个现实场景可以很好地被模型为一个由rigidly运动的部件组成的集合,因此其场景流可以被表示为这些个体部件的rigid运动的组合。基于这一观察,我们提议通过地方rigid运动估计来生成 Pseudo场景流标签,其中源点云被分解成地方区域,每个区域都是rigid。通过将每个区域与其可能的对应点云中的区域进行rigid对齐,我们可以获得每个区域的rigid变换,并生成其pseudo流标签。为了mitigate潜在异常值对标签生成的影响,当解决每个区域的rigid注册问题时,我们采取了三个步骤:确定点对应关系、测量对应关系的信任度,并基于对应关系和其信任度更新rigid变换。因此,信任度高的对应关系将dominates标签生成,并生成一个有效性面纱。通过使用这些pseudo标签和其有效性面纱进行监督,我们可以在无监督情况下训练模型。我们的方法在FlyingThings3D和KITTI数据集上进行了广泛的实验,并达到了无监督场景流学习的新状态对抗性性能,甚至超过了一些监督 counterpart。此外,我们的方法进一步扩展到无类别运动预测,并在nuScenes数据集上显著超越了前一个状态对抗性自监督方法。

Video Super-Resolution Using a Grouped Residual in Residual Network

  • paper_url: http://arxiv.org/abs/2310.11276
  • repo_url: None
  • paper_authors: MohammadHossein Ashoori, Arash Amini
  • for: 提高图像/视频内容的分辨率和质量。
  • methods: 使用 grouped residual in residual network (GRRN) 方法。
  • results: 与现有方法相比,GRRN 方法可以提供Acceptable的输出图像质量。
    Abstract Super-resolution (SR) is the technique of increasing the nominal resolution of image / video content accompanied with quality improvement. Video super-resolution (VSR) can be considered as the generalization of single image super-resolution (SISR). This generalization should be such that more detail is created in the output using adjacent input frames. In this paper, we propose a grouped residual in residual network (GRRN) for VSR. By adjusting the hyperparameters of the proposed structure, we train three networks with different numbers of parameters and compare their quantitative and qualitative results with the existing methods. Although based on some quantitative criteria, GRRN does not provide better results than the existing methods, in terms of the quality of the output image it has acceptable performance.
    摘要 超分解(SR)是增加图像/视频内容的 номинаinal 分辨率,同时提高图像质量的技术。视频超分解(VSR)可以视为单个图像超分解(SISR)的推广。在这篇论文中,我们提出了分组差分网络(GRRN) для VSR。通过调整结构的 hyperparameter,我们训练了三个网络,每个网络有不同的参数数量,并与现有方法进行比较。虽然根据一些量化标准,GRRN 不比现有方法提供更好的结果,但在输出图像质量方面,它的表现是可接受的。

Image Compression using only Attention based Neural Networks

  • paper_url: http://arxiv.org/abs/2310.11265
  • repo_url: None
  • paper_authors: Natacha Luka, Romain Negrel, David Picard
  • for: 本研究旨在探讨图像压缩中完全使用注意力层,以提高图像压缩性和计算效率。
  • methods: 本paper使用的方法是基于注意力机制的Transformer architecture,并引入了学习图像查询来汇聚patch信息。
  • results: 经过广泛的评估,本paper的方法在popular Kodak、DIV2K和CLIC datasets上达到了与传统手动设计ipeline相同或更高的性能,而无需使用 convolutional layers。
    Abstract In recent research, Learned Image Compression has gained prominence for its capacity to outperform traditional handcrafted pipelines, especially at low bit-rates. While existing methods incorporate convolutional priors with occasional attention blocks to address long-range dependencies, recent advances in computer vision advocate for a transformative shift towards fully transformer-based architectures grounded in the attention mechanism. This paper investigates the feasibility of image compression exclusively using attention layers within our novel model, QPressFormer. We introduce the concept of learned image queries to aggregate patch information via cross-attention, followed by quantization and coding techniques. Through extensive evaluations, our work demonstrates competitive performance achieved by convolution-free architectures across the popular Kodak, DIV2K, and CLIC datasets.
    摘要 Recent research has shown that Learned Image Compression has gained popularity for its ability to outperform traditional handcrafted pipelines, especially at low bit-rates. While existing methods use convolutional priors with occasional attention blocks to address long-range dependencies, recent advances in computer vision have advocated for a transformative shift towards fully transformer-based architectures grounded in the attention mechanism. This paper explores the feasibility of image compression using only attention layers in our novel model, QPressFormer. We introduce the concept of learned image queries to aggregate patch information via cross-attention, followed by quantization and coding techniques. Through extensive evaluations, our work demonstrates competitive performance achieved by convolution-free architectures across the popular Kodak, DIV2K, and CLIC datasets.

An empirical study of automatic wildlife detection using drone thermal imaging and object detection

  • paper_url: http://arxiv.org/abs/2310.11257
  • repo_url: None
  • paper_authors: Miao Chang, Tan Vuong, Manas Palaparthi, Lachlan Howell, Alessio Bonti, Mohamed Abdelrazek, Duc Thanh Nguyen
  • For: The paper is written for wildlife management, specifically to explore the use of drones and thermal imaging technology for collecting and interpreting wildlife data.* Methods: The paper uses a comprehensive review and empirical study of drone-based wildlife detection, including the collection of a realistic dataset of drone-derived wildlife thermal detections and the annotation of these detections via bounding boxes by experts. The paper also benchmarks state-of-the-art object detection algorithms on the collected dataset.* Results: The paper provides experimental results to identify issues and discuss future directions in automatic animal monitoring using drones.Here is the information in Simplified Chinese text:* For: 这篇论文是为了管理野生动物而写的,具体来说是利用无人飞行器和热成像技术收集和解读野生动物数据。* Methods: 这篇论文使用了全面的文献综述和实验研究,包括收集了一个真实的无人飞行器捕捉的野生动物热成像检测数据,并由专家 manually annotate these detections via bounding boxes。 paper还使用了现有的对象检测算法来对 collected dataset 进行比较。* Results: 这篇论文提供了实验结果,用于发现issues和讨论未来无人飞行器自动动物监测的方向。
    Abstract Artificial intelligence has the potential to make valuable contributions to wildlife management through cost-effective methods for the collection and interpretation of wildlife data. Recent advances in remotely piloted aircraft systems (RPAS or ``drones'') and thermal imaging technology have created new approaches to collect wildlife data. These emerging technologies could provide promising alternatives to standard labourious field techniques as well as cover much larger areas. In this study, we conduct a comprehensive review and empirical study of drone-based wildlife detection. Specifically, we collect a realistic dataset of drone-derived wildlife thermal detections. Wildlife detections, including arboreal (for instance, koalas, phascolarctos cinereus) and ground dwelling species in our collected data are annotated via bounding boxes by experts. We then benchmark state-of-the-art object detection algorithms on our collected dataset. We use these experimental results to identify issues and discuss future directions in automatic animal monitoring using drones.
    摘要 人工智能有可能为野生动物管理提供有价值的贡献,通过cost-effective的方式收集和解释野生动物数据。最近的无人驾驶飞行器系统(RPAS或“无人机”)和热成像技术的发展已经创造了新的方法收集野生动物数据。这些新技术可能会提供标准化Field技术的有优的替代方案,同时能够覆盖更大的区域。在这项研究中,我们进行了全面的文献综述和实验研究,specifically collecting a realistic dataset of drone-derived wildlife thermal detections。我们的收集数据包括树上的物种(如 Koala,phascolarctos cinereus)和地面生物种,并由专家用 bounding boxes 进行了标注。我们然后对 state-of-the-art 对象检测算法进行了 benchmarking 测试,并使用实验结果来评估自动动物监测使用无人机的问题和未来方向。

Gromov-Wassertein-like Distances in the Gaussian Mixture Models Space

  • paper_url: http://arxiv.org/abs/2310.11256
  • repo_url: None
  • paper_authors: Antoine Salmona, Julie Delon, Agnès Desolneux
  • for: 本文介绍了两种Gromov-Wasserstein-type距离在 Gaussian mixture model 空间上。第一种距离是 между两个离散分布在 Gaussian 测度空间上的 Gromov-Wasserstein 距离,可以作为 Gromov-Wasserstein 的替代方法,用于评估分布之间的距离,而不需要直接计算传输计划。
  • methods: 本文引入了另一种距离 between measures living in incomparable spaces,这种距离与 Gromov-Wasserstein 密切相关,可以用于定义传输计划。 restricting the set of admissible transportation couplings to be themselves Gaussian mixture models in this latter, this defines another distance between Gaussian mixture models that can be used as another alternative to Gromov-Wasserstein。
  • results: 本文设计了一种基于第一种距离的传输计划,并通过对 medium-to-large scale problems such as shape matching and hyperspectral image color transfer 进行实验,证明了其实用性。
    Abstract In this paper, we introduce two Gromov-Wasserstein-type distances on the set of Gaussian mixture models. The first one takes the form of a Gromov-Wasserstein distance between two discrete distributionson the space of Gaussian measures. This distance can be used as an alternative to Gromov-Wasserstein for applications which only require to evaluate how far the distributions are from each other but does not allow to derive directly an optimal transportation plan between clouds of points. To design a way to define such a transportation plan, we introduce another distance between measures living in incomparable spaces that turns out to be closely related to Gromov-Wasserstein. When restricting the set of admissible transportation couplings to be themselves Gaussian mixture models in this latter, this defines another distance between Gaussian mixture models that can be used as another alternative to Gromov-Wasserstein and which allows to derive an optimal assignment between points. Finally, we design a transportation plan associated with the first distance by analogy with the second, and we illustrate their practical uses on medium-to-large scale problems such as shape matching and hyperspectral image color transfer.
    摘要 在这篇论文中,我们介绍了两种Gromov-Wasserstein-类型的距离在 Gaussian mixture model 上。第一个距离是两个抽象分布在 Gaussian measure 空间上的 Gromov-Wasserstein 距离,可以用作 Gromov-Wasserstein 的替代方法,但不能直接 derivate 最佳运输计划 между云集点。为了设计一种定义这种运输计划的方法,我们引入了另一种在不可比较的空间上的距离,该距离与 Gromov-Wasserstein 密切相关。当限制了可用的运输结合为 Gaussian mixture model 时,这个距离定义了另一种 Gaussian mixture model 之间的距离,可以作为 Gromov-Wasserstein 的另一种替代方法,并且可以 derivate 最佳分配计划。最后,我们设计了一个运输计划相关的方法,并在媒体规模到大型问题上如形态匹配和彩色图像传输中 illustrate 其实用性。

LiDAR-based 4D Occupancy Completion and Forecasting

  • paper_url: http://arxiv.org/abs/2310.11239
  • repo_url: https://github.com/ai4ce/occ4cast
  • paper_authors: Xinhao Liu, Moonjun Gong, Qi Fang, Haoyu Xie, Yiming Li, Hang Zhao, Chen Feng
  • for: 本研究旨在整合Scene Completion和Forecasting两个问题,提出了一种新的LiDAR感知任务——Occupancy Completion and Forecasting(OCF),用于自动驾驶中的4D感知。
  • methods: 本研究使用了新的算法来解决三个挑战:(1)稀疏到稠密的重建,(2)部分到完整的幻化,(3)3D到4D预测。
  • results: 研究人员通过对OCFBench dataset进行超视觉和评估,发现了一些相对较好的基线模型和自己的模型的性能。这些结果预示了OCF任务的重要性和潜在应用前景。
    Abstract Scene completion and forecasting are two popular perception problems in research for mobile agents like autonomous vehicles. Existing approaches treat the two problems in isolation, resulting in a separate perception of the two aspects. In this paper, we introduce a novel LiDAR perception task of Occupancy Completion and Forecasting (OCF) in the context of autonomous driving to unify these aspects into a cohesive framework. This task requires new algorithms to address three challenges altogether: (1) sparse-to-dense reconstruction, (2) partial-to-complete hallucination, and (3) 3D-to-4D prediction. To enable supervision and evaluation, we curate a large-scale dataset termed OCFBench from public autonomous driving datasets. We analyze the performance of closely related existing baseline models and our own ones on our dataset. We envision that this research will inspire and call for further investigation in this evolving and crucial area of 4D perception. Our code for data curation and baseline implementation is available at https://github.com/ai4ce/Occ4cast.
    摘要 Scene completion and forecasting are two popular perception problems in research for mobile agents like autonomous vehicles. Existing approaches treat the two problems in isolation, resulting in a separate perception of the two aspects. In this paper, we introduce a novel LiDAR perception task of Occupancy Completion and Forecasting (OCF) in the context of autonomous driving to unify these aspects into a cohesive framework. This task requires new algorithms to address three challenges altogether: (1) sparse-to-dense reconstruction, (2) partial-to-complete hallucination, and (3) 3D-to-4D prediction. To enable supervision and evaluation, we curate a large-scale dataset termed OCFBench from public autonomous driving datasets. We analyze the performance of closely related existing baseline models and our own ones on our dataset. We envision that this research will inspire and call for further investigation in this evolving and crucial area of 4D perception. Our code for data curation and baseline implementation is available at https://github.com/ai4ce/Occ4cast.Translation notes:* "Scene completion" and "forecasting" are both translated as "场景完成" (scenario completion) to refer to the same concept.* "LiDAR perception" is translated as "LiDAR感知" (LiDAR perception) to refer to the specific sensing technology used.* "Occupancy Completion and Forecasting" is translated as "场景完成和预测" (scenario completion and forecasting) to refer to the combined task.* "Sparse-to-dense reconstruction" is translated as "稀疏到密集重建" (sparse-to-dense reconstruction) to refer to the challenge of reconstructing a dense 3D point cloud from a sparse set of measurements.* "Partial-to-complete hallucination" is translated as "部分到完整的幻觉" (partial-to-complete hallucination) to refer to the challenge of generating complete 3D information from partial measurements.* "3D-to-4D prediction" is translated as "3D到4D预测" (3D-to-4D prediction) to refer to the challenge of predicting the future 4D state of the environment based on current 3D information.* "OCFBench" is translated as "OCFBench" (OCFBench) to refer to the specific dataset used for evaluation.* "baseline models" is translated as "基线模型" (baseline models) to refer to the existing models used for comparison.* "our own ones" is translated as "我们自己的" (our own ones) to refer to the models developed by the authors.

Innovative Methods for Non-Destructive Inspection of Handwritten Documents

  • paper_url: http://arxiv.org/abs/2310.11217
  • repo_url: None
  • paper_authors: Eleonora Breci, Luca Guarnera, Sebastiano Battiato
  • for: 本研究旨在提高手写文档分析的精度和效率,以便通过评估手写文档的特征来确定作者身份。
  • methods: 本研究使用图像处理和深度学习技术EXTRACT AND ANALYZE handwritten manuscript documents的内在特征,包括文本行高、单词间距和字体大小。每个文档的特征向量最终包括每种类型的平均值和标准差。通过对比文档的特征向量的euclid distance,可以 объектив地确定作者身份。
  • results: 我们的实验结果表明,我们的方法可以对不同的写作媒体(包括手写纸和数字设备)进行 объектив的作者身份识别,并且超过了现有方法的性能。
    Abstract Handwritten document analysis is an area of forensic science, with the goal of establishing authorship of documents through examination of inherent characteristics. Law enforcement agencies use standard protocols based on manual processing of handwritten documents. This method is time-consuming, is often subjective in its evaluation, and is not replicable. To overcome these limitations, in this paper we present a framework capable of extracting and analyzing intrinsic measures of manuscript documents related to text line heights, space between words, and character sizes using image processing and deep learning techniques. The final feature vector for each document involved consists of the mean and standard deviation for every type of measure collected. By quantifying the Euclidean distance between the feature vectors of the documents to be compared, authorship can be discerned. We also proposed a new and challenging dataset consisting of 362 handwritten manuscripts written on paper and digital devices by 124 different people. Our study pioneered the comparison between traditionally handwritten documents and those produced with digital tools (e.g., tablets). Experimental results demonstrate the ability of our method to objectively determine authorship in different writing media, outperforming the state of the art.
    摘要 手写文档分析是法医科学领域中的一个领域,旨在透过评估手写文档的内在特征,以确定文档的作者。法律机关通常采用标准协议,基于手动处理手写文档。这种方法是时间consuming,容易受主观影响,并且不可重复。为了超越这些限制,在这篇论文中,我们提出了一个框架,可以提取和分析手写文档中相关的字体大小、词间距和字体大小等内在特征,使用图像处理和深度学习技术。每个文档的最终特征向量由每种测量类型的均值和标准差组成。通过计算这些特征向量之间的欧氏距离,可以bjectively Determine authorship。我们还提出了一个新的和挑战性的数据集,包含362份手写文档,由124名不同的人写作。我们的研究对手写文档和数字工具(例如平板电脑)生成的文档进行比较,并实验结果表明,我们的方法可以在不同的写作媒体中对作者进行 объекively 确定。

Learning Comprehensive Representations with Richer Self for Text-to-Image Person Re-Identification

  • paper_url: http://arxiv.org/abs/2310.11210
  • repo_url: None
  • paper_authors: Shuanglin Yan, Neng Dong, Jun Liu, Liyan Zhang, Jinhui Tang
  • For: 本研究旨在提高文本检索到图像的精度,并解决现有方法对多视图匹配的缺乏考虑。* Methods: 我们提出了一种简单 yet effective的框架,即LCR$^2$S,它通过学习多视图modalities的承载表示来建立多对多匹配。我们还设计了一个多头注意力混合模块,以混合图像(文本)和其支持集。* Results: 我们的方法在三个popular TIReID数据集上实现了优秀的效果,并新创造了TIReID tasks的最佳表现。
    Abstract Text-to-image person re-identification (TIReID) retrieves pedestrian images of the same identity based on a query text. However, existing methods for TIReID typically treat it as a one-to-one image-text matching problem, only focusing on the relationship between image-text pairs within a view. The many-to-many matching between image-text pairs across views under the same identity is not taken into account, which is one of the main reasons for the poor performance of existing methods. To this end, we propose a simple yet effective framework, called LCR$^2$S, for modeling many-to-many correspondences of the same identity by learning comprehensive representations for both modalities from a novel perspective. We construct a support set for each image (text) by using other images (texts) under the same identity and design a multi-head attentional fusion module to fuse the image (text) and its support set. The resulting enriched image and text features fuse information from multiple views, which are aligned to train a "richer" TIReID model with many-to-many correspondences. Since the support set is unavailable during inference, we propose to distill the knowledge learned by the "richer" model into a lightweight model for inference with a single image/text as input. The lightweight model focuses on semantic association and reasoning of multi-view information, which can generate a comprehensive representation containing multi-view information with only a single-view input to perform accurate text-to-image retrieval during inference. In particular, we use the intra-modal features and inter-modal semantic relations of the "richer" model to supervise the lightweight model to inherit its powerful capability. Extensive experiments demonstrate the effectiveness of LCR$^2$S, and it also achieves new state-of-the-art performance on three popular TIReID datasets.
    摘要

Whole-brain radiomics for clustered federated personalization in brain tumor segmentation

  • paper_url: http://arxiv.org/abs/2310.11480
  • repo_url: None
  • paper_authors: Matthis Manthe, Stefan Duffner, Carole Lartizien
  • For: The paper focuses on mitigating the impact of statistical heterogeneity in federated learning for medical image segmentation.* Methods: The proposed method, called federated personalization, involves computing radiomic features and clustering analysis on each institution’s local dataset, followed by fine-tuning a global model using the clustered decentralized dataset.* Results: The proposed method was validated on the Federated Brain Tumor Segmentation 2022 Challenge dataset and showed improved performance compared to classical federated learning.Here are the three points in Simplified Chinese:* For: 本文针对适用于医疗影像分类的联合学习中的统计不确定性进行了研究,以提高联合学习的精度和效率。* Methods: 提议的方法是基于每个机构的本地数据进行调整,首先计算每个机构的对应数据,然后使用聚合分析将所有的特征向量转移到中央服务器,并使用每个中心 compute 的对应数据进行精度调整。* Results: 在适用于脑癌分类的 Federated Brain Tumor Segmentation 2022 Challenge 数据集上验证了提议的方法,并与传统的联合学习相比,获得了改善的性能。
    Abstract Federated learning and its application to medical image segmentation have recently become a popular research topic. This training paradigm suffers from statistical heterogeneity between participating institutions' local datasets, incurring convergence slowdown as well as potential accuracy loss compared to classical training. To mitigate this effect, federated personalization emerged as the federated optimization of one model per institution. We propose a novel personalization algorithm tailored to the feature shift induced by the usage of different scanners and acquisition parameters by different institutions. This method is the first to account for both inter and intra-institution feature shift (multiple scanners used in a single institution). It is based on the computation, within each centre, of a series of radiomic features capturing the global texture of each 3D image volume, followed by a clustering analysis pooling all feature vectors transferred from the local institutions to the central server. Each computed clustered decentralized dataset (potentially including data from different institutions) then serves to finetune a global model obtained through classical federated learning. We validate our approach on the Federated Brain Tumor Segmentation 2022 Challenge dataset (FeTS2022). Our code is available at (https://github.com/MatthisManthe/radiomics_CFFL).
    摘要 《联邦学习和它的医学图像分割应用已经在最近引起了广泛的研究兴趣。这种培训模式受到参与机构本地数据的统计差异的影响,会导致减速和溢出精度相比于传统培训。为了缓解这些效应,联邦个性化出现了,即在每个机构上进行联邦优化的一个模型。我们提出了一种新的个性化算法,专门针对不同扫描仪和获取参数导致的特征偏移。这种方法是首次考虑了多个机构的内部和外部特征偏移(多个扫描仪在同一个机构中使用)。它基于在每个中心计算的一系列各种激光特征,用于捕捉每个3D图像卷积的全局 текстура,然后对所有从本地机构传输到中央服务器的特征向量进行归一化分析。每个计算的归一化分析后的各个归一化分析结果(可能包括多个机构的数据)然后用于在经典联邦学习中进行精化。我们验证了我们的方法在2022年联邦大脑肿瘤分割挑战数据集(FeTS2022)上。我们的代码可以在(https://github.com/MatthisManthe/radiomics_CFFL)上获取。》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know.

Improving Video Deepfake Detection: A DCT-Based Approach with Patch-Level Analysis

  • paper_url: http://arxiv.org/abs/2310.11204
  • repo_url: None
  • paper_authors: Luca Guarnera, Salvatore Manganello, Sebastiano Battiato
  • for: 这篇研究旨在开发一个可靠、快速且可解释的检测深伪 Multimedia 内容的算法,以应对现在广泛存在的深伪技术滥用问题。
  • methods: 这篇研究使用了I-frames来提高计算和分析的速度,并分析了个别帧中的背景、脸部、眼睛、鼻子和口部,以找出最有可能性的特征。使用了Discrete Cosine Transform (DCT)来提取Beta комponents,并将其用作标准分类器(例如k-NN、SVM等)的输入,以决定解决这问题的最有可能性的频率。
  • results: 实验结果显示,眼和口部区域是检测深伪 Multimedia 内容中最有可能性的特征,能够更加可靠地决定影片的性质。提出的方法具有分析性、快速且不需要大量的计算资源。
    Abstract The term deepfake refers to all those multimedia contents that were synthetically altered or created from scratch through the use of generative models. This phenomenon has become widespread due to the use of increasingly accurate and efficient architectures capable of rendering manipulated content indistinguishable from real content. In order to fight the illicit use of this powerful technology, it has become necessary to develop algorithms able to distinguish synthetic content from real ones. In this study, a new algorithm for the detection of deepfakes in digital videos is presented, focusing on the main goal of creating a fast and explainable method from a forensic perspective. To achieve this goal, the I-frames were extracted in order to provide faster computation and analysis than approaches described in literature. In addition, to identify the most discriminating regions within individual video frames, the entire frame, background, face, eyes, nose, mouth, and face frame were analyzed separately. From the Discrete Cosine Transform (DCT), the Beta components were extracted from the AC coefficients and used as input to standard classifiers (e.g., k-NN, SVM, and others) in order to identify those frequencies most discriminative for solving the task in question. Experimental results obtained on the Faceforensics++ and Celeb-DF (v2) datasets show that the eye and mouth regions are those most discriminative and able to determine the nature of the video with greater reliability than the analysis of the whole frame. The method proposed in this study is analytical, fast and does not require much computational power.
    摘要 deepfake 指的是通过生成模型制造或修改 multimedia 内容的所有内容。由于使用的生成模型不断改进,使得修改后的内容与原始内容难以区分,因此需要开发一种能够分辨真实内容和修改后的内容的算法。本研究提出了一种用于检测数字视频中的深伪内容的新算法,强调实现快速和可解释的方法。为了实现这一目标,我们提取了 I-frame,以便更快地计算和分析,而不是按照文献中所描述的方法。此外,我们还分析了每帧视频中最有可能区分的地方,包括背景、脸、眼睛、鼻子和口。从 discrete cosine transform (DCT) 中,我们提取了 AC 约束中的 Beta 成分,并将其作为输入给标准分类器(如 k-NN、SVM 等),以确定解决当前问题中最有可能的频率。实验结果表明,脸部和口部是最有可能的区分点,能够更加可靠地判断视频的性质,而不是分析整个帧。该方法具有分析性、快速和不需要大量计算资源的优点。

Sparse Multi-Object Render-and-Compare

  • paper_url: http://arxiv.org/abs/2310.11184
  • repo_url: None
  • paper_authors: Florian Langer, Ignas Budvytis, Roberto Cipolla
  • for: 这paper是为了解决单图像中物体的3D形态和姿态重建问题,这问题在机器人、增强现实和数字内容创建等领域都是非常重要的。
  • methods: 这paper使用了一种新的网络架构 called Multi-SPARC,它可以同时对多个检测到的物体进行CAD模型的对接。
  • results: 相比单视图方法,这paper在ScanNet dataset上达到了状态机器人性的表现,即从31.8%提高到40.3%的实例对接精度。
    Abstract Reconstructing 3D shape and pose of static objects from a single image is an essential task for various industries, including robotics, augmented reality, and digital content creation. This can be done by directly predicting 3D shape in various representations or by retrieving CAD models from a database and predicting their alignments. Directly predicting 3D shapes often produces unrealistic, overly smoothed or tessellated shapes. Retrieving CAD models ensures realistic shapes but requires robust and accurate alignment. Learning to directly predict CAD model poses from image features is challenging and inaccurate. Works, such as ROCA, compute poses from predicted normalised object coordinates which can be more accurate but are susceptible to systematic failure. SPARC demonstrates that following a ''render-and-compare'' approach where a network iteratively improves upon its own predictions achieves accurate alignments. Nevertheless, it performs individual CAD alignment for every object detected in an image. This approach is slow when applied to many objects as the time complexity increases linearly with the number of objects and can not learn inter-object relations. Introducing a new network architecture Multi-SPARC we learn to perform CAD model alignments for multiple detected objects jointly. Compared to other single-view methods we achieve state-of-the-art performance on the challenging real-world dataset ScanNet. By improving the instance alignment accuracy from 31.8% to 40.3% we perform similar to state-of-the-art multi-view methods.
    摘要 重建静止物体的3D形状和姿势从单个图像中是许多领域的关键任务,包括机器人、增强现实和数字内容创建。这可以通过直接预测3D形状或从数据库中检索CAD模型并预测其对齐来完成。直接预测3D形状常常生成不真实、过度缩短或分割的形状。从数据库中检索CAD模型可以保证真实的形状,但需要稳定和准确的对齐。学习直接从图像特征中预测CAD模型姿势是困难且不准确。ROCA等方法计算姿势从预测的 нормализованobject坐标,可以更准确但容易系统性失败。SPARC示例了一个“render-and-compare”方法,其中网络在自己的预测基础上进行多次改进,可以实现准确的对齐。然而,它每个图像中检测到的对象都进行个别CAD对齐,这会导致运行时间linearly增长与对象数量的线性关系,无法学习对象之间的关系。我们提出了一种新的网络架构 Multi-SPARC,可以同时对多个检测到的对象进行CAD模型对齐。与其他单视图方法相比,我们在真实的世界数据集ScanNet上 achieve state-of-the-art性能。我们从31.8%提高了实例对齐精度到40.3%,与多视图方法相当。

Unsupervised Pre-Training Using Masked Autoencoders for ECG Analysis

  • paper_url: http://arxiv.org/abs/2310.11153
  • repo_url: None
  • paper_authors: Guoxin Wang, Qingyuan Wang, Ganesh Neelakanta Iyer, Avishek Nag, Deepu John
  • for: 这篇论文的目的是提出一种基于masked autoencoder(MAE)的无监督预训技术,用于电气征ogram(ECG)信号的分析。
  • methods: 本论文使用的方法包括masked autoencoder(MAE)和任务特定的精致化。
  • results: 实验结果显示,使用本提案的方法可以在MITDB dataset上达到ECG预订分类任务的94.39%的精度。此外,该方法也在未见之数据中的分类性能比全监督方法更好。
    Abstract Unsupervised learning methods have become increasingly important in deep learning due to their demonstrated large utilization of datasets and higher accuracy in computer vision and natural language processing tasks. There is a growing trend to extend unsupervised learning methods to other domains, which helps to utilize a large amount of unlabelled data. This paper proposes an unsupervised pre-training technique based on masked autoencoder (MAE) for electrocardiogram (ECG) signals. In addition, we propose a task-specific fine-tuning to form a complete framework for ECG analysis. The framework is high-level, universal, and not individually adapted to specific model architectures or tasks. Experiments are conducted using various model architectures and large-scale datasets, resulting in an accuracy of 94.39% on the MITDB dataset for ECG arrhythmia classification task. The result shows a better performance for the classification of previously unseen data for the proposed approach compared to fully supervised methods.
    摘要 《深度学习中的无监督学习方法在最近几年变得越来越重要,因为它们在计算机视觉和自然语言处理任务中的精度高于监督学习方法。随着这些方法的扩展到其他领域,可以利用大量的无标签数据。这篇论文提出了基于屏蔽自动编码器(MAE)的无监督预训练技术,用于电cardiogram(ECG)信号分析。此外,我们还提出了任务特定的细化,以形成一个完整的ECG分析框架。这个框架是高级、通用、不具体适应特定的模型结构或任务。在各种模型结构和大规模数据集上进行了实验,实现了MITDB数据集上ECG动力痕迹分类任务的准确率为94.39%。结果显示,提posed方法对于处理前未见数据的分类表现更好于完全监督方法。》Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

BayesDiff: Estimating Pixel-wise Uncertainty in Diffusion via Bayesian Inference

  • paper_url: http://arxiv.org/abs/2310.11142
  • repo_url: None
  • paper_authors: Siqi Kou, Lei Gan, Dequan Wang, Chongxuan Li, Zhijie Deng
  • for: 提高 diffusion 模型生成图像质量
  • methods: 基于 bayesian 推断,计算扩散过程中像素层次的不确定性
  • results: 能够减少低质量图像,并帮助改进成功图像和修正失败图像中的缺陷
    Abstract Diffusion models have impressive image generation capability, but low-quality generations still exist, and their identification remains challenging due to the lack of a proper sample-wise metric. To address this, we propose BayesDiff, a pixel-wise uncertainty estimator for generations from diffusion models based on Bayesian inference. In particular, we derive a novel uncertainty iteration principle to characterize the uncertainty dynamics in diffusion, and leverage the last-layer Laplace approximation for efficient Bayesian inference. The estimated pixel-wise uncertainty can not only be aggregated into a sample-wise metric to filter out low-fidelity images but also aids in augmenting successful generations and rectifying artifacts in failed generations in text-to-image tasks. Extensive experiments demonstrate the efficacy of BayesDiff and its promise for practical applications.
    摘要 Diffusion模型具有吸引人的图像生成能力,但低质量生成仍然存在,其标识仍然困难由于缺乏适当的样本级度指标。为解决这个问题,我们提出了 BayesDiff,一种基于泛函推理的像素级uncertainty估计器 дляDiffusion模型。具体来说,我们 derivate了一种新的uncertainty迭代原理来描述Diffusion中的uncertainty动态,并利用最后层拉пла斯批处理来实现高效的泛函推理。测试表明,BayesDiff可以不仅将像素级uncertainty聚合成样本级度指标来滤除低准确图像,还可以帮助改善成功生成的图像和修复失败生成的瑕疵。在文本到图像任务中,BayesDiff展示了其效果和实际应用潜力。

Super resolution of histopathological frozen sections via deep learning preserving tissue structure

  • paper_url: http://arxiv.org/abs/2310.11112
  • repo_url: None
  • paper_authors: Elad Yoshai, Gil Goldinger, Miki Haifler, Natan T. Shaked
  • for: histopathological frozen sections imaging, with a focus on achieving better distortion measures and reducing the risk of diagnostic misinterpretation.
  • methods: deep-learning architecture that leverages loss functions in the frequency domain to generate high-resolution images while preserving critical image details.
  • results: significant improvements in terms of Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR), as well as the preservation of details lost in low-resolution frozen-section images, which can affect pathologists’ clinical decisions.
    Abstract Histopathology plays a pivotal role in medical diagnostics. In contrast to preparing permanent sections for histopathology, a time-consuming process, preparing frozen sections is significantly faster and can be performed during surgery, where the sample scanning time should be optimized. Super-resolution techniques allow imaging the sample in lower magnification and sparing scanning time. In this paper, we present a new approach to super resolution for histopathological frozen sections, with focus on achieving better distortion measures, rather than pursuing photorealistic images that may compromise critical diagnostic information. Our deep-learning architecture focuses on learning the error between interpolated images and real images, thereby it generates high-resolution images while preserving critical image details, reducing the risk of diagnostic misinterpretation. This is done by leveraging the loss functions in the frequency domain, assigning higher weights to the reconstruction of complex, high-frequency components. In comparison to existing methods, we obtained significant improvements in terms of Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR), as well as indicated details that lost in the low-resolution frozen-section images, affecting the pathologist's clinical decisions. Our approach has a great potential in providing more-rapid frozen-section imaging, with less scanning, while preserving the high resolution in the imaged sample.
    摘要 Our deep-learning architecture is designed to learn the error between interpolated images and real images, generating high-resolution images while preserving critical image details. We leverage loss functions in the frequency domain, assigning higher weights to the reconstruction of complex, high-frequency components. Compared to existing methods, our approach achieves significant improvements in terms of Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR), as well as reveals details that were lost in low-resolution frozen-section images, which can affect the pathologist's clinical decisions.Our approach has great potential in providing rapid frozen-section imaging with less scanning, while preserving the high resolution of the imaged sample. This can improve the accuracy of medical diagnosis and treatment.

3D Structure-guided Network for Tooth Alignment in 2D Photograph

  • paper_url: http://arxiv.org/abs/2310.11106
  • repo_url: https://github.com/douyl/2DToothAlignment
  • paper_authors: Yulong Dou, Lanzhuju Mei, Dinggang Shen, Zhiming Cui
  • for: 这篇论文的目的是提供一个基于2D图像空间的牙齿协调网络,用于 dentist-patient 沟通和鼓励病人接受 ortodontic 治疗。
  • methods: 这篇论文使用了3D intra-oral scanning models 收集在 clinic 中,并使用了一个对应关系学习模型来将预先和后 ortodontic 治疗的3D 牙齿结构与2D 牙齿 outline 进行映射。然后,使用一个演化模型来将牙齿 outline 调整为具有美观和排列的牙齿图像。
  • results: 这篇论文的结果显示了该网络在不同的 facial photographs 上的优秀表现和强大应用性,并且可以帮助 dentist 快速地生成一个美观和排列的牙齿图像,以便更好地与病人沟通和鼓励病人接受 ortodontic 治疗。
    Abstract Orthodontics focuses on rectifying misaligned teeth (i.e., malocclusions), affecting both masticatory function and aesthetics. However, orthodontic treatment often involves complex, lengthy procedures. As such, generating a 2D photograph depicting aligned teeth prior to orthodontic treatment is crucial for effective dentist-patient communication and, more importantly, for encouraging patients to accept orthodontic intervention. In this paper, we propose a 3D structure-guided tooth alignment network that takes 2D photographs as input (e.g., photos captured by smartphones) and aligns the teeth within the 2D image space to generate an orthodontic comparison photograph featuring aesthetically pleasing, aligned teeth. Notably, while the process operates within a 2D image space, our method employs 3D intra-oral scanning models collected in clinics to learn about orthodontic treatment, i.e., projecting the pre- and post-orthodontic 3D tooth structures onto 2D tooth contours, followed by a diffusion model to learn the mapping relationship. Ultimately, the aligned tooth contours are leveraged to guide the generation of a 2D photograph with aesthetically pleasing, aligned teeth and realistic textures. We evaluate our network on various facial photographs, demonstrating its exceptional performance and strong applicability within the orthodontic industry.
    摘要 Orthodontics 专注于 corrections 不对称牙齿(即 malocclusion),影响咀嚼功能和美观。然而,orthodontic 治疗经常包括复杂、长时间的过程。因此,生成一张显示牙齿调整后的2D照片是关键的,以便dentist和病人之间有效沟通,更重要的是,使病人accept orthodontic intervention。在这篇论文中,我们提议一个基于3D结构的牙齿调整网络,输入2D照片(例如,由智能手机拍摄的照片),并将牙齿在2D图像空间中调整,生成一张 featuring 美观、调整后的牙齿的orthodontic comparison照片。需要注意的是,我们的过程在2D图像空间中进行,但我们使用了3D intra-oral scanning模型,收集在临床中,以学习orthodontic treatment。具体来说,我们将预 orthodontic 和后 orthodontic 3D 牙齿结构投影到2D 牙齿轮廓上,然后使用一种扩散模型来学习 mapping 关系。最后,我们使用了调整后的牙齿轮廓来指导生成一张 featuring 美观、调整后的牙齿和实际 Texture的2D照片。我们对多张人脸照片进行了评估,并证明了我们的网络在orthodontic 行业中表现出色,有强大的应用前景。

Generalizability of CNN Architectures for Face Morph Presentation Attack

  • paper_url: http://arxiv.org/abs/2310.11105
  • repo_url: None
  • paper_authors: Sherko R. HmaSalah, Aras Asaad
  • for: 防止犯罪分子使用假识别信息越境
  • methods: 使用Convolutional Neural Network (CNN)模型进行人脸识别,并 investigate CNN模型在各种数据集上的泛化能力
  • results: InceptionResNet-v2模型在多个数据集上表现出最好的泛化能力,并在人脸识别 task 中获得了最高的性能
    Abstract Automatic border control systems are wide spread in modern airports worldwide. Morphing attacks on face biometrics is a serious threat that undermines the security and reliability of face recognition systems deployed in airports and border controls. Therefore, developing a robust Machine Learning (ML) system is necessary to prevent criminals crossing borders with fake identifications especially since it has been shown that security officers cannot detect morphs better than machines. In this study, we investigate the generalization power of Convolutional Neural Network (CNN) architectures against morphing attacks. The investigation utilizes 5 distinct CNNs namely ShuffleNet, DenseNet201, VGG16, EffecientNet-B0 and InceptionResNet-v2. Each CNN architecture represents a well-known family of CNN models in terms of number of parameters, architectural design and performance across various computer vision applications. To ensure robust evaluation, we employ 4 different datasets (Utrecht, London, Defacto and KurdFace) that contain a diverse range of digital face images which cover variations in ethnicity, gender, age, lighting condition and camera setting. One of the fundamental concepts of ML system design is the ability to generalize effectively to previously unseen data, hence not only we evaluate the performance of CNN models within individual datasets but also explore their performance across combined datasets and investigating each dataset in testing phase only. Experimental results on more than 8 thousand images (genuine and morph) from the 4 datasets show that InceptionResNet-v2 generalizes better to unseen data and outperforms the other 4 CNN models.
    摘要 现代机场中的自动边境控制系统广泛应用。但 morphing 攻击对于面部biometrics 是一种严重的威胁,这会使面 recognition 系统在机场和边境控制中受到影响。为了防止罪犯使用假身份证件越境,需要开发一个可靠的机器学习(ML)系统。在这项研究中,我们研究了 CNN 架构对 morphing 攻击的普适性。我们使用 5 种不同的 CNN 模型,即 ShuffleNet、DenseNet201、VGG16、EfficientNet-B0 和 InceptionResNet-v2。每种 CNN 模型都代表了不同的参数量、架构设计和在不同计算机视觉应用中的性能。为了有效评估,我们使用 4 个不同的数据集(UTrecht、London、Defacto 和 KurdFace),这些数据集包含了不同的民族、性别、年龄、照明条件和摄像头设置。 ML 系统设计的一个基本原则是能够有效地普退到未见数据,因此我们不仅在单个数据集中评估 CNN 模型的性能,还在将数据集组合起来评估它们的总体性能。实验结果表明,InceptionResNet-v2 在未见数据中普退性能最好,并且在4个数据集中的测试阶段也表现出色,超过其他 4 种 CNN 模型。

SODA: Robust Training of Test-Time Data Adaptors

  • paper_url: http://arxiv.org/abs/2310.11093
  • repo_url: https://github.com/tmlr-group/soda
  • paper_authors: Zige Wang, Yonggang Zhang, Zhen Fang, Long Lan, Wenjing Yang, Bo Han
  • For: The paper aims to mitigate the performance degradation caused by distribution shifts in machine learning models, specifically by adapting models deployed to test distributions.* Methods: The paper proposes a method called pseudo-label-robust data adaptation (SODA) that utilizes zeroth-order optimization (ZOO) to train a data adaptor to adapt the test data to fit the deployed models. SODA addresses the issue of unreliable gradients in ZOO by using high-confidence predicted labels as reliable labels to optimize the data adaptor.* Results: The paper shows that SODA can significantly enhance the performance of deployed models in the presence of distribution shifts without requiring access to model parameters. Empirical results indicate that SODA can improve the performance of data adaptation.Here are the three key points in Simplified Chinese text:* For: 这个论文的目标是解决机器学习模型在分布Shift时的性能下降问题,specifically by adapting deployed models to test distributions.* Methods: 论文提出了一种方法called pseudo-label-robust data adaptation (SODA),which utilizes zeroth-order optimization (ZOO) to train a data adaptor to adapt the test data to fit the deployed models. SODA Addresses the issue of unreliable gradients in ZOO by using high-confidence predicted labels as reliable labels to optimize the data adaptor.* Results: 论文显示SODA可以Significantly enhance deployed models在分布Shift情况下的性能,without requiring access to model parameters. Empirical results indicate that SODA can improve the performance of data adaptation.
    Abstract Adapting models deployed to test distributions can mitigate the performance degradation caused by distribution shifts. However, privacy concerns may render model parameters inaccessible. One promising approach involves utilizing zeroth-order optimization (ZOO) to train a data adaptor to adapt the test data to fit the deployed models. Nevertheless, the data adaptor trained with ZOO typically brings restricted improvements due to the potential corruption of data features caused by the data adaptor. To address this issue, we revisit ZOO in the context of test-time data adaptation. We find that the issue directly stems from the unreliable estimation of the gradients used to optimize the data adaptor, which is inherently due to the unreliable nature of the pseudo-labels assigned to the test data. Based on this observation, we propose pseudo-label-robust data adaptation (SODA) to improve the performance of data adaptation. Specifically, SODA leverages high-confidence predicted labels as reliable labels to optimize the data adaptor with ZOO for label prediction. For data with low-confidence predictions, SODA encourages the adaptor to preserve data information to mitigate data corruption. Empirical results indicate that SODA can significantly enhance the performance of deployed models in the presence of distribution shifts without requiring access to model parameters.
    摘要 适应已部署的模型可以减轻由分布shift引起的性能下降。然而,隐私问题可能使模型参数无法访问。一种有 promise的方法是使用零次优化(ZOO)来训练一个数据适应器,以适应已部署的模型。然而,通常情况下,ZOO 训练的数据适应器 Typically brings restricted improvements due to the potential corruption of data features caused by the data adaptor。为Address this issue, we revisit ZOO in the context of test-time data adaptation. We find that the issue directly stems from the unreliable estimation of the gradients used to optimize the data adaptor, which is inherently due to the unreliable nature of the pseudo-labels assigned to the test data。 Based on this observation, we propose pseudo-label-robust data adaptation (SODA) to improve the performance of data adaptation。 Specifically, SODA leverages high-confidence predicted labels as reliable labels to optimize the data adaptor with ZOO for label prediction。 For data with low-confidence predictions, SODA encourages the adaptor to preserve data information to mitigate data corruption。 Empirical results indicate that SODA can significantly enhance the performance of deployed models in the presence of distribution shifts without requiring access to model parameters。

DORec: Decomposed Object Reconstruction Utilizing 2D Self-Supervised Features

  • paper_url: http://arxiv.org/abs/2310.11092
  • repo_url: None
  • paper_authors: Jun Wu, Sicheng Li, Sihui Ji, Yue Wang, Rong Xiong, Yiyi Liao
  • for: 提高对复杂背景下对象的分解和重建的精度
  • methods: 基于神经网络隐式表示的Decomposed Object Reconstruction(DORec)网络,利用2D自动学习的自然特征进行分解
  • results: 在多个 dataset 上实现对eground object的高精度分解和重建
    Abstract Decomposing a target object from a complex background while reconstructing is challenging. Most approaches acquire the perception for object instances through the use of manual labels, but the annotation procedure is costly. The recent advancements in 2D self-supervised learning have brought new prospects to object-aware representation, yet it remains unclear how to leverage such noisy 2D features for clean decomposition. In this paper, we propose a Decomposed Object Reconstruction (DORec) network based on neural implicit representations. Our key idea is to transfer 2D self-supervised features into masks of two levels of granularity to supervise the decomposition, including a binary mask to indicate the foreground regions and a K-cluster mask to indicate the semantically similar regions. These two masks are complementary to each other and lead to robust decomposition. Experimental results show the superiority of DORec in segmenting and reconstructing the foreground object on various datasets.
    摘要 分解一个目标对象从复杂背景中分离,而重建时也是一项挑战。大多数方法通过使用手动标注来获得对象实例的感知,但标注过程很昂贵。现代2D自助学习技术的发展带来了新的可能性,但是如何利用这些噪音2D特征来获得清晰的分解仍然是一个未知。本文提出了基于神经无限表示的分解对象网络(DORec),我们的关键想法是将2D自助学习特征转换成两级划分的mask,包括一个二进制划分用于指示前景区域,以及一个K-集群划分用于指示相似区域。这两个划分是相互补偿的,导致了稳定的分解。实验结果表明DORec在不同的数据集上 segment和重建前景对象表现出色。

United We Stand: Using Epoch-wise Agreement of Ensembles to Combat Overfit

  • paper_url: http://arxiv.org/abs/2310.11077
  • repo_url: None
  • paper_authors: Uri Stern, Daniel Shwartz, Daphna Weinshall
  • for: 该论文的目的是解决深度神经网络在图像分类任务中遇到的过拟合问题,提出了一种新的深度网络集成预测方法来避免过拟合。
  • methods: 该论文使用了一种新的集成预测方法,该方法基于论文中提出的一种回归模型的预测结果,其中预测结果表明过拟合时,类别分类器的变异度会增加。
  • results: 在多个图像和文本分类 dataset 上,该方法可以减少过拟合导致的generalization下降,并且在一些情况下,甚至超越了使用 early stopping 的性能。该方法易于实现,可以与任何训练方案和架构结合使用,无需额外的特殊知识。
    Abstract Deep neural networks have become the method of choice for solving many image classification tasks, largely because they can fit very complex functions defined over raw images. The downside of such powerful learners is the danger of overfitting the training set, leading to poor generalization, which is usually avoided by regularization and "early stopping" of the training. In this paper, we propose a new deep network ensemble classifier that is very effective against overfit. We begin with the theoretical analysis of a regression model, whose predictions - that the variance among classifiers increases when overfit occurs - is demonstrated empirically in deep networks in common use. Guided by these results, we construct a new ensemble-based prediction method designed to combat overfit, where the prediction is determined by the most consensual prediction throughout the training. On multiple image and text classification datasets, we show that when regular ensembles suffer from overfit, our method eliminates the harmful reduction in generalization due to overfit, and often even surpasses the performance obtained by early stopping. Our method is easy to implement, and can be integrated with any training scheme and architecture, without additional prior knowledge beyond the training set. Accordingly, it is a practical and useful tool to overcome overfit.
    摘要 深度神经网络已成为许多图像分类任务的方法选择,主要是因为它们可以适应非常复杂的图像函数。但是这些强大的学习者也存在过拟合风险,导致泛化性差,通常通过规范和"早停止"等方法来避免。在这篇论文中,我们提出了一种新的深度网络集成分类器,可以很好地避免过拟合。我们从理论分析中开始,对于过拟合情况下的回归模型,其预测结果表明,过拟合时,类ifier的差异量会增加。基于这些结果,我们构建了一种新的集成预测方法,通过在训练过程中确定最一致的预测来对抗过拟合。在多个图像和文本分类 dataset 上,我们证明了,当常见集成遭到过拟合时,我们的方法可以消除过拟合导致的泛化性下降,并经常超越通过早停止获得的性能。我们的方法易于实现,可以与任何训练方案和架构结合使用,无需额外的优先知识,只需要训练集。因此,它是一种实用和有用的工具,可以解决过拟合问题。

$k$-$t$ CLAIR: Self-Consistency Guided Multi-Prior Learning for Dynamic Parallel MR Image Reconstruction

  • paper_url: http://arxiv.org/abs/2310.11050
  • repo_url: https://github.com/lpzhang/ktCLAIR
  • paper_authors: Liping Zhang, Weitian Chen
  • for: 用于快速推断心脏疾病的临床诊断
  • methods: 使用自适应启发式多优先学习框架$k$-$t$CLAIR,利用高度强化的数据来推导动态平行MRI重建
  • results: 实验结果表明,$k$-$t$CLAIR可以在心脏动态MRI重建中实现高质量的重建,并且与量化和质量方面的表现均有显著改善
    Abstract Cardiac magnetic resonance imaging (CMR) has been widely used in clinical practice for the medical diagnosis of cardiac diseases. However, the long acquisition time hinders its development in real-time applications. Here, we propose a novel self-consistency guided multi-prior learning framework named $k$-$t$ CLAIR to exploit spatiotemporal correlations from highly undersampled data for accelerated dynamic parallel MRI reconstruction. The $k$-$t$ CLAIR progressively reconstructs faithful images by leveraging multiple complementary priors learned in the $x$-$t$, $x$-$f$, and $k$-$t$ domains in an iterative fashion, as dynamic MRI exhibits high spatiotemporal redundancy. Additionally, $k$-$t$ CLAIR incorporates calibration information for prior learning, resulting in a more consistent reconstruction. Experimental results on cardiac cine and T1W/T2W images demonstrate that $k$-$t$ CLAIR achieves high-quality dynamic MR reconstruction in terms of both quantitative and qualitative performance.
    摘要 cardiac magnetic resonance imaging (CMR) 已经广泛应用在临床实践中用于医疗诊断心脏疾病。然而,长期获取时间限制了其在实时应用中的发展。我们提议一种新的自适应性导向多优先学习框架,名为 $k$-$t$ CLAIR,以利用高度减掉样本数据中的空间时间相关性进行加速的动态平行MRI重建。 $k$-$t$ CLAIR 逐渐重建准确的图像,利用在 $x$-$t$, $x$-$f$, 和 $k$-$t$ 领域中学习的多个补做先天知识,因为动态MRI在空间时间上具有高度相似性。此外, $k$-$t$ CLAIR 还包含了准确性信息 для先天学习,从而使得重建更加一致。实验结果表明, $k$-$t$ CLAIR 在心脏笔记和 T1W/T2W 图像上达到了高质量的动态MR重建, Both quantitative and qualitative performance。

Co-Learning Semantic-aware Unsupervised Segmentation for Pathological Image Registration

  • paper_url: http://arxiv.org/abs/2310.11040
  • repo_url: None
  • paper_authors: Yang Liu, Shi Gu
  • for: 本研究旨在提出一种不需要标注数据的自动化肿瘤图像注册方法,以解决肿瘤图像注册问题中的焦点缺失和肿瘤图像异常扭曲问题。
  • methods: 本研究提出了一种基于生成、填充和注册(GIR)原理的自动化肿瘤图像注册方法,包括注册、分割和填充三个模块,通过同时训练这三个模块,以提高肿瘤图像分割和注册的精度。
  • results: 实验结果表明,提出的方法可以准确地注册肿瘤图像,并且可以在不同的成像模式下检测肿瘤。 code available at https://github.com/brain-intelligence-lab/GIRNet。
    Abstract The registration of pathological images plays an important role in medical applications. Despite its significance, most researchers in this field primarily focus on the registration of normal tissue into normal tissue. The negative impact of focal tissue, such as the loss of spatial correspondence information and the abnormal distortion of tissue, are rarely considered. In this paper, we propose GIRNet, a novel unsupervised approach for pathological image registration by incorporating segmentation and inpainting through the principles of Generation, Inpainting, and Registration (GIR). The registration, segmentation, and inpainting modules are trained simultaneously in a co-learning manner so that the segmentation of the focal area and the registration of inpainted pairs can improve collaboratively. Overall, the registration of pathological images is achieved in a completely unsupervised learning framework. Experimental results on multiple datasets, including Magnetic Resonance Imaging (MRI) of T1 sequences, demonstrate the efficacy of our proposed method. Our results show that our method can accurately achieve the registration of pathological images and identify lesions even in challenging imaging modalities. Our unsupervised approach offers a promising solution for the efficient and cost-effective registration of pathological images. Our code is available at https://github.com/brain-intelligence-lab/GIRNet.
    摘要 注册病理图像在医疗应用中扮演着重要的角色。尽管其重要性,大多数研究人员在这个领域主要关注normal tissue到normal tissue的注册。病理区域的负面影响,如损失的空间匹配信息和病理区域的异常扭曲,几乎不被考虑。在这篇论文中,我们提出了GIRNet,一种新的无监督方法,通过生成、填充和注册(GIR)原理,以帮助病理图像注册。注册、分割和填充模块在一起训练,以便在合作方式下提高病理区域的分割和注册匹配的精度。总之,我们的提出的方法可以在完全无监督学习框架下完成病理图像注册。我们的实验结果表明,我们的方法可以准确地注册病理图像,并在具有挑战性的成像模式下识别病理区域。我们的无监督方法可以提供高效、成本效果的病理图像注册解决方案。我们的代码可以在https://github.com/brain-intelligence-lab/GIRNet上获取。

Domain Generalization Using Large Pretrained Models with Mixture-of-Adapters

  • paper_url: http://arxiv.org/abs/2310.11031
  • repo_url: None
  • paper_authors: Gyuseong Lee, Wooseok Jang, Jin Hyeon Kim, Jaewoo Jung, Seungryong Kim
  • for: 这个论文的目的是提出一种可以实现执行环境中模型的稳定性和可靠性的预测模型,尤其是在域别测量(Domain Generalization,DG)任务中。
  • methods: 这个论文使用了parameter-efficient fine-tuning(PEFT)方法,并且在PEFT中使用了束缚器(adapter)来实现更好的稳定性和可靠性。另外,这个论文还提出了一种mixture-of-expert(MoA)方法,这是一种将多个束缚器(adapter)组合起来,以提高模型的稳定性和可靠性。
  • results: 这个论文的结果显示,使用PEFT和MoA方法可以实现更好的稳定性和可靠性,并且可以减少模型的训练时间和计算成本。另外,这个论文还证明了,在域别测量任务中,使用PEFT和MoA方法可以提高模型的性能,并且可以减少模型的性能下降。
    Abstract Learning a robust vision model despite large distribution shift is essential for model deployment in real-world settings. Especially, domain generalization (DG) algorithm aims to maintain the performance of a trained model on different distributions which were not seen during training. One of the most effective methods has been leveraging the already learned rich knowledge of large pretrained models. However, naively fine-tuning large models to DG tasks is often practically infeasible due to memory limitations, extensive time requirements for training, and the risk of learned knowledge deterioration. Recently, parameter-efficient fine-tuning (PEFT) methods have been proposed to reduce the high computational cost during training and efficiently adapt large models to downstream tasks. In this work, for the first time, we find that the use of adapters in PEFT methods not only reduce high computational cost during training but also serve as an effective regularizer for DG tasks. Surprisingly, a naive adapter implementation for large models achieve superior performance on common datasets. However, in situations of large distribution shifts, additional factors such as optimal amount of regularization due to the strength of distribution shifts should be considered for a sophisticated adapter implementation. To address this, we propose a mixture-of-expert based adapter fine-tuning method, dubbed as mixture-of-adapters (MoA). Specifically, we employ multiple adapters that have varying capacities, and by using learnable routers, we allocate each token to a proper adapter. By using both PEFT and MoA methods, we effectively alleviate the performance deterioration caused by distribution shifts and achieve state-of-the-art performance on diverse DG benchmarks.
    摘要 学习一个强健的视觉模型,尤其是在不同的分布下进行模型部署,是实际应用中非常重要的。域外泛化(DG)算法的目标是保持训练后的模型在不同的分布下保持性能。然而,直接将大型模型精细调整到DG任务是实际上不可行,因为内存限制、训练时间的投入和模型学习知识的削弱。最近,参数效率的调整方法(PEFT)被提出,以减少训练时间的计算成本并有效地适应大型模型下推理任务。在这项工作中,我们发现了使用适应器不仅可以减少训练时间的计算成本,还可以作为DG任务的有效常规化。 surprisingly,一个简单的适应器实现对常用 datasets 表现出色。然而,在大分布差情况下,需要考虑适当的补偿因子,以适应强大的分布差。为此,我们提出了一种mixture-of-expert(MoA)适应器细化方法,其中我们采用多个适应器,每个适应器有不同的容量,并通过learnable routers来分配每个 токен到合适的适应器。通过使用 PEFT 和 MoA 方法,我们有效地减少分布差引起的性能下降,并在多种 DG bencmarks 上达到了国际首席性表现。

NICE: Improving Panoptic Narrative Detection and Segmentation with Cascading Collaborative Learning

  • paper_url: http://arxiv.org/abs/2310.10975
  • repo_url: https://github.com/mr-neko/nice
  • paper_authors: Haowei Wang, Jiayi Ji, Tianyu Guo, Yilong Yang, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji
  • for: 本研究旨在提出一个统一的和有效的框架,可以同时进行多个目标的图像识别和位置探索,并且可以将它们与文本描述相互对应。
  • methods: 我们提出了一个名为NICE的框架,它可以同时进行当地描述的识别和分类,并且可以通过将它们与文本描述相互对应,从而提高表现。在这个框架中,我们引入了两个协调模组,它们是协调调节(CGA)和本体驱动的本地化(BDL),它们负责分类和检测。我们还引入了一个将PNS和PND串接在一起的方法,使得它们可以相互对应,并且允许它们相互补偿以提高表现。
  • results: 我们的方法NICE可以轻松地超越所有现有的方法,实现4.1%的PND和2.9%的PNS表现。这些结果证明了我们的提出的协调学习策略的有效性。
    Abstract Panoptic Narrative Detection (PND) and Segmentation (PNS) are two challenging tasks that involve identifying and locating multiple targets in an image according to a long narrative description. In this paper, we propose a unified and effective framework called NICE that can jointly learn these two panoptic narrative recognition tasks. Existing visual grounding tasks use a two-branch paradigm, but applying this directly to PND and PNS can result in prediction conflict due to their intrinsic many-to-many alignment property. To address this, we introduce two cascading modules based on the barycenter of the mask, which are Coordinate Guided Aggregation (CGA) and Barycenter Driven Localization (BDL), responsible for segmentation and detection, respectively. By linking PNS and PND in series with the barycenter of segmentation as the anchor, our approach naturally aligns the two tasks and allows them to complement each other for improved performance. Specifically, CGA provides the barycenter as a reference for detection, reducing BDL's reliance on a large number of candidate boxes. BDL leverages its excellent properties to distinguish different instances, which improves the performance of CGA for segmentation. Extensive experiments demonstrate that NICE surpasses all existing methods by a large margin, achieving 4.1% for PND and 2.9% for PNS over the state-of-the-art. These results validate the effectiveness of our proposed collaborative learning strategy. The project of this work is made publicly available at https://github.com/Mr-Neko/NICE.
    摘要 通用和有效的框架NICE(Nice In Coordinate Embedding)可以同时学习多个目标的涵义和位置识别 task。在这篇论文中,我们提出了一种解决方案,即在Panoptic Narrative Detection(PND)和Panoptic Segmentation(PNS)任务之间进行协同学习。现有的视觉定位任务使用两支分支方法,但是直接应用这种方法于PND和PNS可能会导致预测冲突,因为它们具有内在的多对多对应性。为解决这个问题,我们引入了两个协同模块,即坐标导航集成(CGA)和坐标驱动本地化(BDL),负责分割和检测。我们将PNS和PND串联在一起,使得两个任务之间的对应关系自然地进行了协同学习。具体来说,CGA提供了分割的参考点,从而降低BDL的候选框数量的依赖性。BDL利用其优秀的性能来分辨不同的实例,从而提高CGA的分割性能。我们的实验结果表明,NICE比所有现有方法大幅超越,实现了PND4.1%和PNS2.9%的状态当前最佳性能。这些结果证明了我们提出的协同学习策略的有效性。NICE项目的代码可以在https://github.com/Mr-Neko/NICE上获取。

Tracking and Mapping in Medical Computer Vision: A Review

  • paper_url: http://arxiv.org/abs/2310.11475
  • repo_url: None
  • paper_authors: Adam Schmidt, Omid Mohareri, Simon DiMaio, Michael Yip, Septimiu E. Salcudean
    for:* 医疗图像分析领域的应用,包括诊断和手术指导。methods:* 使用摄像头进行跟踪和场景映射。results:* 提供了一个审查和概述当前领域的状态,包括最新的发展和趋势。Please note that the above information is in Simplified Chinese text.
    Abstract As computer vision algorithms are becoming more capable, their applications in clinical systems will become more pervasive. These applications include diagnostics such as colonoscopy and bronchoscopy, guiding biopsies and minimally invasive interventions and surgery, automating instrument motion and providing image guidance using pre-operative scans. Many of these applications depend on the specific visual nature of medical scenes and require designing and applying algorithms to perform in this environment. In this review, we provide an update to the field of camera-based tracking and scene mapping in surgery and diagnostics in medical computer vision. We begin with describing our review process, which results in a final list of 515 papers that we cover. We then give a high-level summary of the state of the art and provide relevant background for those who need tracking and mapping for their clinical applications. We then review datasets provided in the field and the clinical needs therein. Then, we delve in depth into the algorithmic side, and summarize recent developments, which should be especially useful for algorithm designers and to those looking to understand the capability of off-the-shelf methods. We focus on algorithms for deformable environments while also reviewing the essential building blocks in rigid tracking and mapping since there is a large amount of crossover in methods. Finally, we discuss the current state of the tracking and mapping methods along with needs for future algorithms, needs for quantification, and the viability of clinical applications in the field. We conclude that new methods need to be designed or combined to support clinical applications in deformable environments, and more focus needs to be put into collecting datasets for training and evaluation.
    摘要 为了满足医疗系统中computer vision算法的应用 becoming more pervasive,这些应用包括诊断如colonoscopy和bronchoscopy、导引生物检查和微创入侵性手术、自动化工具动作并提供预操作扫描图像导航。许多这些应用需要特定的医疗场景的视觉特性,因此需要设计和应用算法以在这种环境中工作。在这篇评论中,我们提供了医疗计算机视觉中摄像机基于跟踪和场景映射的更新。我们开始介绍我们的评审过程,从而获得了515篇论文的最终列表。然后,我们提供了高级概述,并为需要跟踪和映射的价值读者提供了相关的背景信息。然后,我们评审了在领域中提供的数据集,并评估了临床应用中的临床需求。接着,我们深入探讨算法的方面,并总结了最近的进展,这将对算法设计者和需要了解摄像机基于跟踪和映射的方法来说特别有用。我们主要关注可变环境中的算法,并同时评估了基础建立的硬件跟踪和映射方法。最后,我们讨论了当前跟踪和映射方法的状况,以及未来需要的算法、评估量化和临床应用的可行性。我们结论认为,新的方法需要被设计或组合以支持临床应用,并更多的精力需要投入到训练和评估数据集的收集中。

Context-Aware Meta-Learning

  • paper_url: http://arxiv.org/abs/2310.10971
  • repo_url: https://github.com/hallogameboy/MARU
  • paper_authors: Christopher Fifty, Dennis Duan, Ronald G. Junkins, Ehsan Amid, Jure Leskovec, Christopher Ré, Sebastian Thrun
  • for: 这个论文旨在解决视觉模型在推理过程中学习新的概念的问题。
  • methods: 该论文提出了一种基于冻结预训练特征提取器的 meta-学习算法,可以在推理过程中不需要 fine-tuning 地学习新的视觉概念。
  • results: 在 8 个 meta-学习Benchmark 中,该算法可以无需 meta-训练或 fine-tuning 达到或超过 state-of-the-art 算法 P>M>F 的性能。
    Abstract Large Language Models like ChatGPT demonstrate a remarkable capacity to learn new concepts during inference without any fine-tuning. However, visual models trained to detect new objects during inference have been unable to replicate this ability, and instead either perform poorly or require meta-training and/or fine-tuning on similar objects. In this work, we propose a meta-learning algorithm that emulates Large Language Models by learning new visual concepts during inference without fine-tuning. Our approach leverages a frozen pre-trained feature extractor, and analogous to in-context learning, recasts meta-learning as sequence modeling over datapoints with known labels and a test datapoint with an unknown label. On 8 out of 11 meta-learning benchmarks, our approach -- without meta-training or fine-tuning -- exceeds or matches the state-of-the-art algorithm, P>M>F, which is meta-trained on these benchmarks.
    摘要 大语言模型如ChatGPT表现出了惊人的新概念学习能力,而无需任何精度调整。然而,用于检测新对象的视觉模型在推理时表现不佳,或者需要meta-training和/或精度调整。在这项工作中,我们提出了一种meta-学习算法,可以在推理时学习新的视觉概念,无需精度调整。我们的方法利用冻结的预训练特征提取器,类似于上下文学习,将meta-学习视为序列模型化,对已知标签的数据点和未知标签的测试数据点进行模型化。在11个meta-学习benchmark上,我们的方法,无需meta-training或精度调整,超过或等于状态的算法P>M>F,该算法在这些benchmark上进行meta-training。

MRI brain tumor segmentation using informative feature vectors and kernel dictionary learning

  • paper_url: http://arxiv.org/abs/2310.10963
  • repo_url: None
  • paper_authors: Seyedeh Mahya Mousavi, Mohammad Mostafavi
    for: 这个论文是用于分类健康和癌症脑磁共振图像中的脑膜区域的方法。methods: 这个方法使用基于内生kernel字典学习算法来学习健康和癌症脑磁共振图像中的特征向量。results: 实验结果表明,提出的方法比其他已有方法更高的准确率和快速采样速度,可以快速完成分类任务,同时也可以减少训练时间和内存需求。
    Abstract This paper presents a method based on a kernel dictionary learning algorithm for segmenting brain tumor regions in magnetic resonance images (MRI). A set of first-order and second-order statistical feature vectors are extracted from patches of size 3 * 3 around pixels in the brain MRI scans. These feature vectors are utilized to train two kernel dictionaries separately for healthy and tumorous tissues. To enhance the efficiency of the dictionaries and reduce training time, a correlation-based sample selection technique is developed to identify the most informative and discriminative subset of feature vectors. This technique aims to improve the performance of the dictionaries by selecting a subset of feature vectors that provide valuable information for the segmentation task. Subsequently, a linear classifier is utilized to distinguish between healthy and unhealthy pixels based on the learned dictionaries. The results demonstrate that the proposed method outperforms other existing methods in terms of segmentation accuracy and significantly reduces both the time and memory required, resulting in a remarkably fast training process.
    摘要 这个论文提出了基于kernel字典学习算法的 brain 肿瘤区域分割方法,使用patches 的first-order和second-order统计特征向量从brain MRI扫描中提取特征向量,然后使用这些特征向量训练两个kernel字典,一个用于健康组,一个用于肿瘤组。为了提高字典的效率和减少训练时间,我们提出了一种基于相关性的样本选择技术,以选择最有价值和分类的特征向量,从而提高分类性能。接着,我们使用学习的字典和线性分类器来分割健康和肿瘤组。结果表明,提出的方法在分割精度和训练时间上具有明显的优势,比其他方法更高效。

Enhancing Deep Neural Network Training Efficiency and Performance through Linear Prediction

  • paper_url: http://arxiv.org/abs/2310.10958
  • repo_url: None
  • paper_authors: Hejie Ying, Mengmeng Song, Yaohong Tang, Shungen Xiao, Zimin Xiao
  • for: 提高深度神经网络(DNN)模型训练效率和性能。
  • methods: 基于DNN参数在训练过程中变化的规律发现可以预测DNN参数,并且采用Parameter Linear Prediction(PLP)方法进行预测。
  • results: 对于Vgg16、Resnet18和GoogLeNet等 represntative底层,通过比较Normal训练和提posed方法的结果,发现提posed方法能够在相同的训练条件和轮数下提高模型性能,具体的结果为CIFAR-100 dataset上的准确率提高约1%和top-1/top-5错误降低约0.01。
    Abstract Deep neural networks (DNN) have achieved remarkable success in various fields, including computer vision and natural language processing. However, training an effective DNN model still poses challenges. This paper aims to propose a method to optimize the training effectiveness of DNN, with the goal of improving model performance. Firstly, based on the observation that the DNN parameters change in certain laws during training process, the potential of parameter prediction for improving model training efficiency and performance is discovered. Secondly, considering the magnitude of DNN model parameters, hardware limitations and characteristics of Stochastic Gradient Descent (SGD) for noise tolerance, a Parameter Linear Prediction (PLP) method is exploit to perform DNN parameter prediction. Finally, validations are carried out on some representative backbones. Experiment results show that compare to the normal training ways, under the same training conditions and epochs, by employing proposed PLP method, the optimal model is able to obtain average about 1% accuracy improvement and 0.01 top-1/top-5 error reduction for Vgg16, Resnet18 and GoogLeNet based on CIFAR-100 dataset, which shown the effectiveness of the proposed method on different DNN structures, and validated its capacity in enhancing DNN training efficiency and performance.
    摘要 深度神经网络(DNN)在不同领域取得了显著成功,包括计算机视觉和自然语言处理。然而,训练有效的DNN模型仍然存在挑战。本文旨在提出一种方法,以提高DNN训练效果并提高模型性能。首先,基于训练过程中DNN参数变化的规律,探讨可以通过预测参数来提高DNN训练效率和性能的潜在可能性。其次,考虑到DNN模型参数的大小、硬件限制和SGD算法对雷yy的耐受性,提出了一种基于PLP方法的DNN参数预测方法。最后,对一些代表性的背bone进行验证。实验结果显示,相比于常规训练方式,通过提议的PLP方法,在同等训练条件和轮数下,可以获得average约1%的准确率提高和0.01的top-1/top-5错误减少,这demonstrates the effectiveness of the proposed method on different DNN structures and validates its ability to enhance DNN training efficiency and performance.

Medical Image Segmentation via Sparse Coding Decoder

  • paper_url: http://arxiv.org/abs/2310.10957
  • repo_url: None
  • paper_authors: Long Zeng, Kaigui Wu
  • for: 这篇论文主要探讨了如何将Transformer模型应用于医疗影像分类,以提高其捕捉长距离依赖关系的能力。
  • methods: 这篇论文提出了一种基于核心 sparse vector coding的decoder,named CASCSCDE,它使用核心 sparse vector coding来表示从encoder模组中获取的特征。
  • results: 实验结果显示,将CASCSCDE与TransUNet结合,可以实现Synapse benchmark上的表现提升,相比于TransUNet alone,提高了3.15%和1.16%的DICE和mIoU分数。
    Abstract Transformers have achieved significant success in medical image segmentation, owing to its capability to capture long-range dependencies. Previous works incorporate convolutional layers into the encoder module of transformers, thereby enhancing their ability to learn local relationships among pixels. However, transformers may suffer from limited generalization capabilities and reduced robustness, attributed to the insufficient spatial recovery ability of their decoders. To address this issue, A convolution sparse vector coding based decoder is proposed , namely CAScaded multi-layer Convolutional Sparse vector Coding DEcoder (CASCSCDE), which represents features extracted by the encoder using sparse vectors. To prove the effectiveness of our CASCSCDE, The widely-used TransUNet model is chosen for the demonstration purpose, and the CASCSCDE is incorporated with TransUNet to establish the TransCASCSCDE architecture. Our experiments demonstrate that TransUNet with CASCSCDE significantly enhances performance on the Synapse benchmark, obtaining up to 3.15\% and 1.16\% improvements in DICE and mIoU scores, respectively. CASCSCDE opens new ways for constructing decoders based on convolutional sparse vector coding.
    摘要 transformers在医学图像分割领域取得了重要成功,归功于它的长距离依赖关系捕捉能力。 précédentes works将 convolutional layers incorporated into the encoder module of transformers,以提高它们对像之间的本地关系学习能力。然而,transformers可能会受到有限的泛化能力和减少的稳定性影响,这是由于它们的解码器的空间恢复能力不够。为 Addressing this issue, a convolution sparse vector coding based decoder is proposed, namely CAScaded multi-layer Convolutional Sparse vector Coding DEcoder (CASCSCDE), which represents features extracted by the encoder using sparse vectors。To prove the effectiveness of our CASCSCDE, the widely-used TransUNet model is chosen for the demonstration purpose, and the CASCSCDE is incorporated with TransUNet to establish the TransCASCSCDE architecture。 our experiments show that TransUNet with CASCSCDE significantly enhances performance on the Synapse benchmark, obtaining up to 3.15% and 1.16% improvements in DICE and mIoU scores, respectively。 CASCSCDE opens new ways for constructing decoders based on convolutional sparse vector coding。

FusionU-Net: U-Net with Enhanced Skip Connection for Pathology Image Segmentation

  • paper_url: http://arxiv.org/abs/2310.10951
  • repo_url: https://github.com/zongyi-lee/fusionu-net
  • paper_authors: Zongyi Li, Hongbing Lyu, Jun Wang
  • For: 本研究旨在提高路ологи影像分类任务中U-Net和其变种的性能,提出一种基于U-Net结构的新网络模型,即FusionU-Net。* Methods: FusionU-Net使用了一种基于U-Net结构的增强模块,该模块通过在不同层的Encoder和Decoder之间进行信息交换,以减少Semantic Gap。此外,我们采用了两轮交换机制,以完全考虑当前层输出之间的本地相关性和多层信息交换的需求。* Results: 我们在多个路ологи影像 dataset上进行了广泛的实验,并发现FusionU-Net在与其他竞争方法相比表现更好。我们认为我们的交换模块比现有网络的设计更有效,并且可以轻松地在其他网络中插入,以进一步提高模型性能。
    Abstract In recent years, U-Net and its variants have been widely used in pathology image segmentation tasks. One of the key designs of U-Net is the use of skip connections between the encoder and decoder, which helps to recover detailed information after upsampling. While most variations of U-Net adopt the original skip connection design, there is semantic gap between the encoder and decoder that can negatively impact model performance. Therefore, it is important to reduce this semantic gap before conducting skip connection. To address this issue, we propose a new segmentation network called FusionU-Net, which is based on U-Net structure and incorporates a fusion module to exchange information between different skip connections to reduce semantic gaps. Unlike the other fusion modules in existing networks, ours is based on a two-round fusion design that fully considers the local relevance between adjacent encoder layer outputs and the need for bi-directional information exchange across multiple layers. We conducted extensive experiments on multiple pathology image datasets to evaluate our model and found that FusionU-Net achieves better performance compared to other competing methods. We argue our fusion module is more effective than the designs of existing networks, and it could be easily embedded into other networks to further enhance the model performance.
    摘要 Unlike other fusion modules in existing networks, our fusion module is based on a two-round fusion design that fully considers the local relevance between adjacent encoder layer outputs and the need for bi-directional information exchange across multiple layers. We conducted extensive experiments on multiple pathology image datasets to evaluate our model and found that FusionU-Net achieves better performance compared to other competing methods. We argue that our fusion module is more effective than the designs of existing networks and could be easily embedded into other networks to further enhance model performance.

UNK-VQA: A Dataset and A Probe into Multi-modal Large Models’ Abstention Ability

  • paper_url: http://arxiv.org/abs/2310.10942
  • repo_url: https://github.com/guoyang9/unk-vqa
  • paper_authors: Yanyang Guo, Fangkai Jiao, Zhiqi Shen, Liqiang Nie, Mohan Kankanhalli
  • for: 本研究的目的是帮助视觉问答模型避免回答无法答案的问题,以建立更可靠的人工智能系统。
  • methods: 我们首先对现有数据进行了有意义的干扰,以使问题图像的 semantics 保持与原始不干扰分布的相似性。然后,我们对多 modal 大型模型进行了 Zero-和 few-shot 性能评估,并发现它们在我们的数据集上表现出了明显的局限性。最后,我们还提出了一种简单的方法来解决这些无法答案的问题。
  • results: 我们的数据集(UNK-VQA)可以用于提高视觉问答模型的拒绝能力,从而提高人工智能系统的可靠性。我们的研究可以帮助解决现有的问题,并为建立更可靠的人工智能系统提供一个重要的基础。
    Abstract Teaching Visual Question Answering (VQA) models to refrain from answering unanswerable questions is necessary for building a trustworthy AI system. Existing studies, though have explored various aspects of VQA but somewhat ignored this particular attribute. This paper aims to bridge the research gap by contributing a comprehensive dataset, called UNK-VQA. The dataset is specifically designed to address the challenge of questions that models do not know. To this end, we first augment the existing data via deliberate perturbations on either the image or question. In specific, we carefully ensure that the question-image semantics remain close to the original unperturbed distribution. By this means, the identification of unanswerable questions becomes challenging, setting our dataset apart from others that involve mere image replacement. We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models and discover their significant limitations when applied to our dataset. Additionally, we also propose a straightforward method to tackle these unanswerable questions. This dataset, we believe, will serve as a valuable benchmark for enhancing the abstention capability of VQA models, thereby leading to increased trustworthiness of AI systems. We have made the \href{https://github.com/guoyang9/UNK-VQA}{dataset} available to facilitate further exploration in this area.
    摘要 教学视觉问答模型避免回答无法回答的问题是建立可信worthy AI系统的必要 условия。现有研究,虽曾经探讨了多种VQA方面的问题,但它们却有些忽略了这一特点。这篇论文的目的是填补这个研究漏洞,通过提供一个全面的UNK-VQA数据集来解决问题。这个数据集专门针对模型不知道的问题,我们通过对问题或图像进行意图的拟合来增强问题-图像的Semantics相似性,从而使模型很难以识别无法回答的问题。我们 THEN 进行了广泛的零、少射性性能评估,发现现有的多模式大型模型在我们的数据集上表现出了显著的局限性。此外,我们还提出了一种简单的方法来解决这些无法回答的问题。我们认为这个数据集会成为VQA模型增强其拒绝能力的重要benchmark,从而提高AI系统的可信worthiness。我们已经在github上分享了这个\href{https://github.com/guoyang9/UNK-VQA}{数据集,以便进一步的探索。}

Towards Training-free Open-world Segmentation via Image Prompting Foundation Models

  • paper_url: http://arxiv.org/abs/2310.10912
  • repo_url: None
  • paper_authors: Lv Tang, Peng-Tao Jiang, Hao-Ke Xiao, Bo Li
    for:This paper explores open-world segmentation using a novel approach called Image Prompt Segmentation (IPSeg), which leverages vision foundational models and image prompting techniques to segment target objects in input images without requiring exhaustive training sessions.methods:IPSeg utilizes a single image containing a subjective visual concept as a flexible prompt to query vision foundation models like DINOv2 and Stable Diffusion. The approach extracts robust features for the prompt image and input image, then matches the input representations to the prompt representations via a novel feature interaction module to generate point prompts highlighting target objects in the input image.results:Experiments on COCO, PASCAL VOC, and other datasets demonstrate IPSeg’s efficacy for flexible open-world segmentation using intuitive image prompts. The proposed method offers a more efficient and scalable solution for open-world segmentation compared to traditional training-based methods.
    Abstract The realm of computer vision has witnessed a paradigm shift with the advent of foundational models, mirroring the transformative influence of large language models in the domain of natural language processing. This paper delves into the exploration of open-world segmentation, presenting a novel approach called Image Prompt Segmentation (IPSeg) that harnesses the power of vision foundational models. At the heart of IPSeg lies the principle of a training-free paradigm, which capitalizes on image prompting techniques. IPSeg utilizes a single image containing a subjective visual concept as a flexible prompt to query vision foundation models like DINOv2 and Stable Diffusion. Our approach extracts robust features for the prompt image and input image, then matches the input representations to the prompt representations via a novel feature interaction module to generate point prompts highlighting target objects in the input image. The generated point prompts are further utilized to guide the Segment Anything Model to segment the target object in the input image. The proposed method stands out by eliminating the need for exhaustive training sessions, thereby offering a more efficient and scalable solution. Experiments on COCO, PASCAL VOC, and other datasets demonstrate IPSeg's efficacy for flexible open-world segmentation using intuitive image prompts. This work pioneers tapping foundation models for open-world understanding through visual concepts conveyed in images.
    摘要 computer vision 领域受到基础模型的洗礼,这对于自然语言处理领域中基础模型的转变产生了类似的影响。这篇论文探讨开放世界分割问题,提出了一种名为图像提示分 segmentation(IPSeg)的新方法。IPSeg 利用视觉基础模型如 DINOv2 和 Stable Diffusion 的力量,并且采用一种无需训练的方法。我们的方法使用一个包含主观视觉概念的单张图像作为灵活的提示,EXTRACT robust的特征于提示图像和输入图像,然后通过一种新的特征互动模块将输入表示与提示表示匹配,以生成指向目标对象的点提示。这些生成的点提示最后用于导引 Segment Anything Model 对输入图像中的目标对象进行分割。我们的方法不需要耗费大量的训练时间,因此更加高效和可扩展。实验结果表明,IPSeg 在 COCO、PASCAL VOC 和其他数据集上表现出色,用于开放世界中的灵活分割。这项工作开拓了基础模型在视觉领域中对开放世界理解的新途径。