cs.CV - 2023-11-15

Predicting Spine Geometry and Scoliosis from DXA Scans

  • paper_url: http://arxiv.org/abs/2311.09424
  • repo_url: None
  • paper_authors: Amir Jamaludin, Timor Kadir, Emma Clark, Andrew Zisserman
  • for: 这paper的目的是估算DXA扫描中脊梁的弯 curvature.
  • methods: 首先,我们训练一个神经网络来预测扫描中的中脊梁弯曲。然后,我们使用一种积分方法来确定脊梁弯曲的弯曲范围。
  • results: 我们发现,使用弯曲度作为评分函数可以对脊梁弯曲进行排序,并且性能超过了 Jamaludin et al. 2018 的先前工作。
    Abstract Our objective in this paper is to estimate spine curvature in DXA scans. To this end we first train a neural network to predict the middle spine curve in the scan, and then use an integral-based method to determine the curvature along the spine curve. We use the curvature to compare to the standard angle scoliosis measure obtained using the DXA Scoliosis Method (DSM). The performance improves over the prior work of Jamaludin et al. 2018. We show that the maximum curvature can be used as a scoring function for ordering the severity of spinal deformation.
    摘要 我们的目标在这篇论文中是估算DXA扫描中的脊梁弯曲度。为达到这个目标,我们首先训练一个神经网络来预测扫描中的中脊梁弯曲度,然后使用一种积分方法来确定脊梁弯曲度的沿脊梁曲线。我们使用弯曲度来与使用DXA脊梁疾病方法(DSM)获得的标准角度股疾病量进行比较。我们表明,最大弯曲度可以用作评估脊梁弯曲度严重程度的分数函数。Here's the translation in Traditional Chinese:我们的目标在这篇论文中是估算DXA扫描中的脊梁弯曲度。为达到这个目标,我们首先训练一个神经网络来预测扫描中的中脊梁弯曲度,然后使用一种积分方法来确定脊梁弯曲度的沿脊梁曲线。我们使用弯曲度来与使用DXA脊梁疾病方法(DSM)获得的标准角度股疾病量进行比较。我们表明,最大弯曲度可以用作评估脊梁弯曲度严重程度的分数函数。

Synthetically Enhanced: Unveiling Synthetic Data’s Potential in Medical Imaging Research

  • paper_url: http://arxiv.org/abs/2311.09402
  • repo_url: https://github.com/bardiakh/syntheticallyenhanced
  • paper_authors: Bardia Khosravi, Frank Li, Theo Dapamede, Pouria Rouzrokh, Cooper U. Gamble, Hari M. Trivedi, Cody C. Wyles, Andrew B. Sellergren, Saptarshi Purkayastha, Bradley J. Erickson, Judy W. Gichoya
  • for: 这个研究旨在检验深度学习(DL)分类器在胸部X射线成像(CXR)分析中的性能,并研究使用扩充的数据集来提高模型的准确率。
  • methods: 这个研究使用了三个数据集:CheXpert、MIMIC-CXR和Emory Chest X-ray,并使用了条件涉嫌扩充推散模型(DDPMs)来生成医学图像。我们确保了生成的人工图像具有原始数据中的人类和疾病特征。
  • results: 研究发现,通过使用人工数据来补充真实数据,可以提高DL模型的准确率,特别是在检测较少发现的疾病方面。此外,使用人工数据alone来训练模型也可以达到与使用真实数据来训练模型的性能水平。这表示人工数据可能可以弥补真实数据的短缺,并且可以帮助建立更加坚固的DL模型。然而,尽管有扎实的结果,真实数据仍然保持优势。
    Abstract Chest X-rays (CXR) are the most common medical imaging study and are used to diagnose multiple medical conditions. This study examines the impact of synthetic data supplementation, using diffusion models, on the performance of deep learning (DL) classifiers for CXR analysis. We employed three datasets: CheXpert, MIMIC-CXR, and Emory Chest X-ray, training conditional denoising diffusion probabilistic models (DDPMs) to generate synthetic frontal radiographs. Our approach ensured that synthetic images mirrored the demographic and pathological traits of the original data. Evaluating the classifiers' performance on internal and external datasets revealed that synthetic data supplementation enhances model accuracy, particularly in detecting less prevalent pathologies. Furthermore, models trained on synthetic data alone approached the performance of those trained on real data. This suggests that synthetic data can potentially compensate for real data shortages in training robust DL models. However, despite promising outcomes, the superiority of real data persists.
    摘要 胸部X射影(CXR)是医学影像研究中最常用的方法,用于诊断多种医疗问题。本研究检查了深度学习(DL)分类器在CXR分析中使用推 diffusion模型(DDPMs)的合成数据补充的影响。我们使用了三个数据集:CheXpert、MIMIC-CXR和Emory Chest X-ray,并使用 conditional denoising diffusion probabilistic models(DDPMs)来生成合成前置影像。我们的方法确保了生成的合成图像具有原始数据中的人口和疾病特征。我们对内部和外部数据集进行评估,发现使用合成数据补充可以提高模型的准确率,特别是在检测较少发生的疾病方面。此外,使用合成数据alone进行训练的模型可以达到与使用真实数据训练的模型相同的性能。这表明合成数据可以可能补充实际数据的短缺,帮助建立更加稳定的DL模型。然而,尽管出色的结果,实际数据仍然保持优势。

MoCo-Transfer: Investigating out-of-distribution contrastive learning for limited-data domains

  • paper_url: http://arxiv.org/abs/2311.09401
  • repo_url: None
  • paper_authors: Yuwen Chen, Helen Zhou, Zachary C. Lipton
  • for: 这 paper 是为了研究如何将自动生成的医学影像数据转移到不同的域内数据集上,以提高特殊模型的开发。
  • methods: 这 paper 使用了 MoCo 自我超vised 对比表示法进行预训练,并对不同的 X-ray 数据集进行比较,以确定是否可以从更大的数据集中提取有用的信息。
  • results: 研究发现,在有限量的数据集上,可以通过从更大的数据集中提取自动生成的医学影像数据来提高模型性能。此外,在不同的域内数据集之间进行对比可以更好地提高模型性能,而不是使用 ImageNet 预训练的 weights。最后,这 paper 还提供了一种初步的数据集相似性量化方法。
    Abstract Medical imaging data is often siloed within hospitals, limiting the amount of data available for specialized model development. With limited in-domain data, one might hope to leverage larger datasets from related domains. In this paper, we analyze the benefit of transferring self-supervised contrastive representations from moment contrast (MoCo) pretraining on out-of-distribution data to settings with limited data. We consider two X-ray datasets which image different parts of the body, and compare transferring from each other to transferring from ImageNet. We find that depending on quantity of labeled and unlabeled data, contrastive pretraining on larger out-of-distribution datasets can perform nearly as well or better than MoCo pretraining in-domain, and pretraining on related domains leads to higher performance than if one were to use the ImageNet pretrained weights. Finally, we provide a preliminary way of quantifying similarity between datasets.
    摘要 医疗影像数据经常受限于医院内部,限制了特殊模型开发的数据量。在这种情况下,可能会希望利用更大的相关领域数据进行开发。本文分析了在有限数据情况下,将自动学习强制对比表示(MoCo)预训练的扩展数据传递到另一个领域的效果。我们考虑了两个X射线数据集,它们分别捕捉了不同的身体部位,并比较了从别的领域传递和从ImageNet预训练的效果。我们发现,取决于标注和无标注数据的数量,在更大的外部数据集上进行对比预训练可以与域内MoCo预训练相当或更好,而从相关领域传递的预训练性能高于使用ImageNet预训练的模型。最后,我们提供了一种初步的数据集相似性量化方法。

RENI++ A Rotation-Equivariant, Scale-Invariant, Natural Illumination Prior

  • paper_url: http://arxiv.org/abs/2311.09361
  • repo_url: https://github.com/jadgardner/ns_reni
  • paper_authors: James A. D. Gardner, Bernhard Egger, William A. P. Smith
  • for: 本研究旨在提出一种基于神经网络的自然照明模型,用于解决 inverse rendering 问题。
  • methods: 该模型使用 conditional neural field 表示法和 transformer 解码器,并通过 Vector Neurons 技术实现建立对称性。
  • results: 模型可以准确地表示高动态范围 (HDR) 环境图像,并且可以捕捉到自然环境中高频特征。 Is there anything else you would like to know?
    Abstract Inverse rendering is an ill-posed problem. Previous work has sought to resolve this by focussing on priors for object or scene shape or appearance. In this work, we instead focus on a prior for natural illuminations. Current methods rely on spherical harmonic lighting or other generic representations and, at best, a simplistic prior on the parameters. This results in limitations for the inverse setting in terms of the expressivity of the illumination conditions, especially when taking specular reflections into account. We propose a conditional neural field representation based on a variational auto-decoder and a transformer decoder. We extend Vector Neurons to build equivariance directly into our architecture, and leveraging insights from depth estimation through a scale-invariant loss function, we enable the accurate representation of High Dynamic Range (HDR) images. The result is a compact, rotation-equivariant HDR neural illumination model capable of capturing complex, high-frequency features in natural environment maps. Training our model on a curated dataset of 1.6K HDR environment maps of natural scenes, we compare it against traditional representations, demonstrate its applicability for an inverse rendering task and show environment map completion from partial observations. We share our PyTorch implementation, dataset and trained models at https://github.com/JADGardner/ns_reni
    摘要 “ inverse rendering 是一个欠定的问题。 先前的工作强调了对象或场景形状或外观的确认,以解决这个问题。 在这个工作中,我们则专注于天然照明的确认。 现有的方法使用圆柱对称光学或其他通用表示,并仅对parameters的简单确认。 这会导致对 inverse setting 的限制,特别是在考虑到镜面反射时。 我们提议一个基于 conditional neural field 的表示方法,使用 variational auto-decoder 和 transformer decoder。 我们将 Vector Neurons 扩展到建立了 architecture 中的 equivariance,并利用 depth estimation 中的构成调和损失函数,以实现高动态范围 (HDR) 图像的准确表示。 结果是一个可靠、旋转相似的 HDR 神经照明模型,能够捕捉自然环境图中的复杂、高频率特征。 我们在一个精心选择的 dataset 上训练我们的模型,并与传统表示进行比较,证明其适用于 inverse rendering 任务,以及环境图完整性的完成从部分观察。 我们在 GitHub 上分享我们的 PyTorch 实现、dataset 和训练模型,请见 。”

Nothing Stands Still: A Spatiotemporal Benchmark on 3D Point Cloud Registration Under Large Geometric and Temporal Change

  • paper_url: http://arxiv.org/abs/2311.09346
  • repo_url: None
  • paper_authors: Tao Sun, Yan Hao, Shengyu Huang, Silvio Savarese, Konrad Schindler, Marc Pollefeys, Iro Armeni
  • for: 这篇论文旨在探讨现有的3D geometric map建构方法如何处理时间变化。
  • methods: 该论文使用了多个参考视图的点云注册方法,以及一个新的多方向点云注册 benchmark。
  • results: 研究发现现有的点云注册方法无法处理大规模的时间变化,新的方法需要 специаль地设计来处理这类变化。
    Abstract Building 3D geometric maps of man-made spaces is a well-established and active field that is fundamental to computer vision and robotics. However, considering the evolving nature of built environments, it is essential to question the capabilities of current mapping efforts in handling temporal changes. In addition, spatiotemporal mapping holds significant potential for achieving sustainability and circularity goals. Existing mapping approaches focus on small changes, such as object relocation or self-driving car operation; in all cases where the main structure of the scene remains fixed. Consequently, these approaches fail to address more radical changes in the structure of the built environment, such as geometry and topology. To this end, we introduce the Nothing Stands Still (NSS) benchmark, which focuses on the spatiotemporal registration of 3D scenes undergoing large spatial and temporal change, ultimately creating one coherent spatiotemporal map. Specifically, the benchmark involves registering two or more partial 3D point clouds (fragments) from the same scene but captured from different spatiotemporal views. In addition to the standard pairwise registration, we assess the multi-way registration of multiple fragments that belong to any temporal stage. As part of NSS, we introduce a dataset of 3D point clouds recurrently captured in large-scale building indoor environments that are under construction or renovation. The NSS benchmark presents three scenarios of increasing difficulty, to quantify the generalization ability of point cloud registration methods over space (within one building and across buildings) and time. We conduct extensive evaluations of state-of-the-art methods on NSS. The results demonstrate the necessity for novel methods specifically designed to handle large spatiotemporal changes. The homepage of our benchmark is at http://nothing-stands-still.com.
    摘要 建立3D геометрические地图的人工环境是一个已经成熟且活跃的领域,对计算机视觉和机器人来说是基础领域。然而,随着建筑环境的发展,需要考虑当前地图努力的有效性,特别是在面对时间变化时。此外,空间时间地图具有可持续和循环的潜力,现有的地图方法主要集中于小型变化,如物体重新布局或自动驾驶车辆操作,而 ignore 主要结构的变化。为此,我们引入Nothing Stands Still(NSS)测试准则,强调在3D场景中大尺度空间和时间变化下进行空间时间地图的准确匹配。具体来说,测试准则包括将两个或多个来自同一场景,但从不同空间时间视图捕捉的3D点云(块)进行对比注准。此外,我们还评估了多个块之间的多方注准,以及这些块在不同时间阶段的注准。NSS测试准则包括3个难度不同的场景,以评估点云注准方法的通用性和灵活性。我们在NSS测试准则上进行了广泛的评估,结果表明了现有方法在面对大尺度空间时间变化时的不足。NSS测试准则的主页可以在http://nothing-stands-still.com 中找到。

Single-Image 3D Human Digitization with Shape-Guided Diffusion

  • paper_url: http://arxiv.org/abs/2311.09221
  • repo_url: None
  • paper_authors: Badour AlBahar, Shunsuke Saito, Hung-Yu Tseng, Changil Kim, Johannes Kopf, Jia-Bin Huang
  • for: 实现单一图像中的人物360度全景视图,包括衣服和人体的高精度描绘。
  • methods: 利用高容量2D扩散模型来提供衣服人体的应earance假设,然后逐步Synthesize多个视角的人物图像,并通过反射处理协会Multi-view图像进行融合,以获得高精度的3D mesh。
  • results: 试验结果显示,我们的方法可以优于先前的方法,实现高精度、 фото实际的360度人物全景视图,并且可以处理不同的衣服和人体表现。
    Abstract We present an approach to generate a 360-degree view of a person with a consistent, high-resolution appearance from a single input image. NeRF and its variants typically require videos or images from different viewpoints. Most existing approaches taking monocular input either rely on ground-truth 3D scans for supervision or lack 3D consistency. While recent 3D generative models show promise of 3D consistent human digitization, these approaches do not generalize well to diverse clothing appearances, and the results lack photorealism. Unlike existing work, we utilize high-capacity 2D diffusion models pretrained for general image synthesis tasks as an appearance prior of clothed humans. To achieve better 3D consistency while retaining the input identity, we progressively synthesize multiple views of the human in the input image by inpainting missing regions with shape-guided diffusion conditioned on silhouette and surface normal. We then fuse these synthesized multi-view images via inverse rendering to obtain a fully textured high-resolution 3D mesh of the given person. Experiments show that our approach outperforms prior methods and achieves photorealistic 360-degree synthesis of a wide range of clothed humans with complex textures from a single image.
    摘要 我们提出了一种方法,可以从单个输入图像中生成一个360度的人体视图,具有一致性和高分辨率的外观。NeRF和其变体通常需要不同视角的视频或图像。现有的approach都是通过ground-truth 3D扫描来supervise,或者lack 3D一致性。而最近的3D生成模型显示了人体数字化的3D一致性,但这些方法不能普遍应用于多样化的服装外表,并且lack photorealism。我们利用高容量2D扩散模型,这些模型在通用图像生成任务上进行预训练,作为人体服装的外观先验。为了实现更好的3D一致性而保持输入人物的身份,我们逐步synthesize多个视图的人体在输入图像中的缺失区域,使用形状响应的扩散条件和表面法向量来填充。然后,我们将这些合成的多视图图像进行反向渲染,以获得一个完全纹理高分辨率的3D mesh。实验表明,我们的方法超过了先前的方法,并实现了从单个图像中高真实度地生成360度的人体视图,包括复杂的服装表面。

DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model

  • paper_url: http://arxiv.org/abs/2311.09217
  • repo_url: None
  • paper_authors: Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Jiahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein, Zexiang Xu, Kai Zhang
  • for: The paper is written for proposing a novel 3D generation approach called DMV3D, which uses a transformer-based 3D large reconstruction model to denoise multi-view diffusion.
  • methods: The paper uses a transformer-based 3D large reconstruction model that incorporates a triplane NeRF representation to denoise noisy multi-view images, achieving single-stage 3D generation in $\sim$30s on single A100 GPU.
  • results: The paper demonstrates state-of-the-art results for the single-image reconstruction problem where probabilistic modeling of unseen object parts is required for generating diverse reconstructions with sharp textures. The paper also shows high-quality text-to-3D generation results outperforming previous 3D diffusion models.
    Abstract We propose \textbf{DMV3D}, a novel 3D generation approach that uses a transformer-based 3D large reconstruction model to denoise multi-view diffusion. Our reconstruction model incorporates a triplane NeRF representation and can denoise noisy multi-view images via NeRF reconstruction and rendering, achieving single-stage 3D generation in $\sim$30s on single A100 GPU. We train \textbf{DMV3D} on large-scale multi-view image datasets of highly diverse objects using only image reconstruction losses, without accessing 3D assets. We demonstrate state-of-the-art results for the single-image reconstruction problem where probabilistic modeling of unseen object parts is required for generating diverse reconstructions with sharp textures. We also show high-quality text-to-3D generation results outperforming previous 3D diffusion models. Our project website is at: https://justimyhxu.github.io/projects/dmv3d/ .
    摘要 我们提出了\textbf{DMV3D},一种新的3D生成方法,使用transformer基础的3D大量重建模型来去噪多视对冲测。我们的重建模型包括了三面NeRF表现,可以通过NeRF重建和渲染来去噪多视图像,实现单一阶段3D生成,需时约30秒在单一A100 GPU上。我们在大规模多视图像数据集上训练\textbf{DMV3D},只使用图像重建损失,不需要访问3D资产。我们示出了单一图像重建问题中的前景,需要 probabilistic modeling of unseen object parts,以生成多样化的重建结果, texture sharpness。我们还展示了与前一代3D扩散模型相比,\textbf{DMV3D}可以实现高品质的文本到3D生成。我们的项目网站位于:https://justimyhxu.github.io/projects/dmv3d/ .

ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy

  • paper_url: http://arxiv.org/abs/2311.09215
  • repo_url: https://github.com/kirill-vish/beyond-inet
  • paper_authors: Kirill Vishniakov, Zhiqiang Shen, Zhuang Liu
  • for: 本研究旨在探讨现代计算机视觉模型的多样性,以及选择最佳模型 для特定应用场景的决策因素。
  • methods: 本研究采用了多种模型建构和训练方法,包括ConvNet和Vision Transformer两种主流建构,以及supervised和CLIP训练方法。
  • results: 研究发现,尽管选择的模型在ImageNet精度上具有相似性,但它们在其他方面存在显著差异,如输出准确率、类型错误、转移性和特征不变等。这些差异不可能由传统的精度metric fully capture。
    Abstract Modern computer vision offers a great variety of models to practitioners, and selecting a model from multiple options for specific applications can be challenging. Conventionally, competing model architectures and training protocols are compared by their classification accuracy on ImageNet. However, this single metric does not fully capture performance nuances critical for specialized tasks. In this work, we conduct an in-depth comparative analysis of model behaviors beyond ImageNet accuracy, for both ConvNet and Vision Transformer architectures, each across supervised and CLIP training paradigms. Although our selected models have similar ImageNet accuracies and compute requirements, we find that they differ in many other aspects: types of mistakes, output calibration, transferability, and feature invariance, among others. This diversity in model characteristics, not captured by traditional metrics, highlights the need for more nuanced analysis when choosing among different models. Our code is available at https://github.com/kirill-vish/Beyond-INet.
    摘要 现代计算机视觉提供了多种模型选择器,选择合适的模型对特定应用可以是挑战。传统上,竞争模型建筑和训练协议通常被比较通过ImageNet分类精度。然而,这个单一指标并不完全捕捉特有任务的性能细节。在这项工作中,我们进行了深入的比较分析,涵盖ConvNet和Vision Transformer建筑,每种模型在超vised和CLIP训练协议下的行为。虽然我们选择的模型在ImageNet精度和计算需求上几乎相同,但我们发现它们在很多方面不同:出错类型、输出准确、传输性和特征不变等方面。这些模型特性的多样性,不被传统指标捕捉,高亮了选择模型时需要更加细化的分析。我们的代码可以在https://github.com/kirill-vish/Beyond-INet上获取。

Leveraging Citizen Science for Flood Extent Detection using Machine Learning Benchmark Dataset

  • paper_url: http://arxiv.org/abs/2311.09276
  • repo_url: None
  • paper_authors: Muthukumaran Ramasubramanian, Iksha Gurung, Shubhankar Gahlot, Ronny Hänsch, Andrew L. Molthan, Manil Maskey
  • for: 这个论文的目的是为了提供一个高精度的洪水覆盖区域检测方法,以便在紧急应急响应和恢复工作中使用。
  • methods: 这个论文使用了卫星遥感数据,特别是Sentinel-1 C-Band Synthetic Aperture Radar(SAR)图像,以检测洪水覆盖区域。由于SAR图像中水体的低反射率,因此可以准确地检测水体。然而,在某些洪水区域中,如果存在基础设施和树木等,会导致反射强度增加,使得简单的像素强度阈值和时间序列差分方法无法准确地检测洪水覆盖区域。因此,这个论文使用机器学习技术来准确地检测洪水覆盖区域。
  • results: 这个论文提供了一个标注了水体覆盖区域和洪水覆盖区域的 dataset,以及一个基eline模型和一个开源的 competedition,以推动洪水覆盖区域检测的研究。此外,这个论文还利用了公民科学,通过开源dataset和组织一个开源的比赛,以快速推进洪水覆盖区域检测的社区生成模型。
    Abstract Accurate detection of inundated water extents during flooding events is crucial in emergency response decisions and aids in recovery efforts. Satellite Remote Sensing data provides a global framework for detecting flooding extents. Specifically, Sentinel-1 C-Band Synthetic Aperture Radar (SAR) imagery has proven to be useful in detecting water bodies due to low backscatter of water features in both co-polarized and cross-polarized SAR imagery. However, increased backscatter can be observed in certain flooded regions such as presence of infrastructure and trees - rendering simple methods such as pixel intensity thresholding and time-series differencing inadequate. Machine Learning techniques has been leveraged to precisely capture flood extents in flooded areas with bumps in backscatter but needs high amounts of labelled data to work desirably. Hence, we created a labeled known water body extent and flooded area extents during known flooding events covering about 36,000 sq. kilometers of regions within mainland U.S and Bangladesh. Further, We also leveraged citizen science by open-sourcing the dataset and hosting an open competition based on the dataset to rapidly prototype flood extent detection using community generated models. In this paper we present the information about the dataset, the data processing pipeline, a baseline model and the details about the competition, along with discussion on winning approaches. We believe the dataset adds to already existing datasets based on Sentinel-1C SAR data and leads to more robust modeling of flood extents. We also hope the results from the competition pushes the research in flood extent detection further.
    摘要 “溢涌水域的扩散范围检测是紧急应急响应和恢复工作中的关键。卫星遥感数据提供了全球性的检测溢涌水域的方法。特别是Sentinel-1 C-Band Synthetic Aperture Radar(SAR)影像已经证明可以准确检测水体,因为水体具有低的反射强度。然而,某些淹没区域中的基础设施和树木会导致反射强度增加,使得简单的像素强度阈值和时间序列差异方法无法准确检测溢涌水域。机器学习技术已经被应用于准确地检测溢涌水域,但需要大量标注数据来工作。因此,我们创建了标注水体范围和淹没区域的known数据集,覆盖了美国大陆和孟加拉国的约36,000平方公里地区。此外,我们还利用公民科学,将数据集打包开源,并在该数据集基础上组织了一场公开竞赛,以快速搭建溢涌水域检测模型。本文介绍了数据集、数据处理管道、基线模型以及竞赛的详细信息,以及赢家的方法。我们认为该数据集将提高现有的Sentinel-1C SAR数据集,并且希望竞赛结果可以推动溢涌水域检测的研究进一步。”

Domain Aligned CLIP for Few-shot Classification

  • paper_url: http://arxiv.org/abs/2311.09191
  • repo_url: None
  • paper_authors: Muhammad Waleed Gondal, Jochen Gast, Inigo Alonso Ruiz, Richard Droste, Tommaso Macri, Suren Kumar, Luitpold Staudigl
  • for: 提高CLIP模型在target分布上的预测性能,包括图像分类和OOD robustness。
  • methods: 提出了一种sample-efficient领域适应策略,称为Domain Aligned CLIP (DAC),可以在target分布上提高图像-图像和图像-文本的同步,无需修改CLIP模型的参数。
  • results: 在11个广泛使用的图像分类任务上 demonstrates 16-shot classification的领先表现,相比强基eline的2.3%提高,并在4个OOD robustness benchmark上达到了竞争性表现。
    Abstract Large vision-language representation learning models like CLIP have demonstrated impressive performance for zero-shot transfer to downstream tasks while largely benefiting from inter-modal (image-text) alignment via contrastive objectives. This downstream performance can further be enhanced by full-scale fine-tuning which is often compute intensive, requires large labelled data, and can reduce out-of-distribution (OOD) robustness. Furthermore, sole reliance on inter-modal alignment might overlook the rich information embedded within each individual modality. In this work, we introduce a sample-efficient domain adaptation strategy for CLIP, termed Domain Aligned CLIP (DAC), which improves both intra-modal (image-image) and inter-modal alignment on target distributions without fine-tuning the main model. For intra-modal alignment, we introduce a lightweight adapter that is specifically trained with an intra-modal contrastive objective. To improve inter-modal alignment, we introduce a simple framework to modulate the precomputed class text embeddings. The proposed few-shot fine-tuning framework is computationally efficient, robust to distribution shifts, and does not alter CLIP's parameters. We study the effectiveness of DAC by benchmarking on 11 widely used image classification tasks with consistent improvements in 16-shot classification upon strong baselines by about 2.3% and demonstrate competitive performance on 4 OOD robustness benchmarks.
    摘要 大型视力语言表示学习模型如CLIP已经显示出Zero-shot传输下游任务的出色表现,主要受益于图像文本对Alignment via对比目标。然而,这种下游性能可以通过全面精细调整进一步提高,但这需要大量标注数据、计算昂贵和可能导致OOD不稳定性下降。此外,几乎完全依赖于图像文本对Alignment可能会忽略每个模式内的资源多样性。在这种情况下,我们提出了一种减少样本成本的领域适应策略,称为Domain Aligned CLIP(DAC),该策略可以在目标分布上提高图像图像和图像文本对Alignment,无需修改CLIP的主模型参数。为了提高图像图像对Alignment,我们引入了一个轻量级的适配器,该适配器专门通过图像对对比目标来训练。为了改善图像文本对Alignment,我们引入了一个简单的框架来修改预计算的类文本嵌入。我们的几个少量精细调整框架可以快速、稳定地进行,不需要大量标注数据,并且不会改变CLIP的参数。我们对11种常用的图像分类任务进行了测试,并 consistently obtain 16-shot classification improvements of around 2.3% over strong baselines,并且在4个OOD robustness benchmark上达到了竞争性的性能。

On the Computation of the Gaussian Rate-Distortion-Perception Function

  • paper_url: http://arxiv.org/abs/2311.09190
  • repo_url: None
  • paper_authors: Giuseppe Serra, Photios A. Stavrou, Marios Kountouris
  • for: 这个论文研究了多变量 Gaussian 源的 rate-distortion-perception 函数 (RDPF) 的计算,对于 mean squared error (MSE) 损害和各种抽象感知指标(Kullback-Leibler 异同、 геометрический Jensen-Shannon 异同、平方 Hellinger 距离和平方 Wasserstein-2 距离)。
  • methods: 作者首先计算了拟合函数的分析上下文 bound,并提供了 RDPF 实现的前向 “测试通道” 实现。在多变量情况下,作者表明了对于 tensorizable 损害和感知指标,优化解决方案 residues 在源协 variance 矩阵的 eigenvector 上。因此,多变量优化问题可以表示为一个约束的 scalar Gaussian RDPF 问题。
  • results: 作者提出了一种基于块非线性 Gauss-Seidel 方法的 alternating minimization 算法,可以优化多变量问题,同时 identificatinig RDPF 实现。此外,作者还提供了算法的实现、 converges 和 converge 速率的 Characterization。最后,作者在 “完美现实” régime 下获得了多变量 Gaussian RDPF 的分析解。作者通过数值实验证明了结论,并与现有结果进行了比较。
    Abstract In this paper, we study the computation of the rate-distortion-perception function (RDPF) for a multivariate Gaussian source under mean squared error (MSE) distortion and, respectively, Kullback-Leibler divergence, geometric Jensen-Shannon divergence, squared Hellinger distance, and squared Wasserstein-2 distance perception metrics. To this end, we first characterize the analytical bounds of the scalar Gaussian RDPF for the aforementioned divergence functions, also providing the RDPF-achieving forward "test-channel" realization. Focusing on the multivariate case, we establish that, for tensorizable distortion and perception metrics, the optimal solution resides on the vector space spanned by the eigenvector of the source covariance matrix. Consequently, the multivariate optimization problem can be expressed as a function of the scalar Gaussian RDPFs of the source marginals, constrained by global distortion and perception levels. Leveraging this characterization, we design an alternating minimization scheme based on the block nonlinear Gauss-Seidel method, which optimally solves the problem while identifying the Gaussian RDPF-achieving realization. Furthermore, the associated algorithmic embodiment is provided, as well as the convergence and the rate of convergence characterization. Lastly, for the "perfect realism" regime, the analytical solution for the multivariate Gaussian RDPF is obtained. We corroborate our results with numerical simulations and draw connections to existing results.
    摘要 在本文中,我们研究了多变量 Gaussian 源的 computation rate-distortion-perception function (RDPF) 的计算,采用 Mean Squared Error (MSE) 损均、Kullback-Leibler 差分、Geometric Jensen-Shannon 差分、平方 Hellinger 距离和平方 Wasserstein-2 距离的感知度量。为此,我们首先计算了整数 Gaussian RDPF 的分析 bound,并提供了 RDPF 实现的前向 "测试通道" 实现。在多变量情况下,我们证明了,对于张量izable 损均和感知度量,最佳解在源协VARiance矩阵的 eigenvector 上。因此,多变量优化问题可以表示为各自 Gaussian RDPF 的源协VARiance矩阵的eigenvector 的函数,受到全局损均和感知水平的约束。我们利用这种特征来设计一种 alternating minimization 方案,基于块非线性 Gauss-Seidel 方法,可以最佳地解决问题,同时确定 Gaussian RDPF 实现。此外,我们还提供了算法实现、收敛性和收敛速度的特征。最后,在 "完美现实" режиме下,我们获得了多变量 Gaussian RDPF 的分析解。我们的结果与数值仿真结果相符,并与现有结果进行了连接。

RBPGAN: Recurrent Back-Projection GAN for Video Super Resolution

  • paper_url: http://arxiv.org/abs/2311.09178
  • repo_url: None
  • paper_authors: Israa Fahmy, Marwah Sulaiman, Zahraa Shehabeldin, Mohammed Barakat, Dareen Hussein, Mohammed El-Naggar, Hesham Eraqi, Moustafa Youssef
  • for: 这个论文旨在提出一种Video超分辨(VSR)模型,以生成时间协调的解决方案,保持空间细节。
  • methods: 该模型结合了两个 state-of-the-art 模型, generator inspirited by RBPN 系统, discriminator inspirited by TecoGAN。使用 Ping-Pong 损失函数,提高时间一致性。
  • results: 我们的贡献使得模型在 temporally 一致的细节方面表现出色,证明了我们的模型在不同的数据集上的高性能。
    Abstract Recently, video super resolution (VSR) has become a very impactful task in the area of Computer Vision due to its various applications. In this paper, we propose Recurrent Back-Projection Generative Adversarial Network (RBPGAN) for VSR in an attempt to generate temporally coherent solutions while preserving spatial details. RBPGAN integrates two state-of-the-art models to get the best in both worlds without compromising the accuracy of produced video. The generator of the model is inspired by RBPN system, while the discriminator is inspired by TecoGAN. We also utilize Ping-Pong loss to increase temporal consistency over time. Our contribution together results in a model that outperforms earlier work in terms of temporally consistent details, as we will demonstrate qualitatively and quantitatively using different datasets.
    摘要 近期,视频超解像 (VSR) 在计算机视觉领域已经变得非常重要,因为它具有许多应用。在这篇论文中,我们提出了 Recurrent Back-Projection Generative Adversarial Network (RBPGAN),用于实现时间相关的解决方案,同时保持空间细节的准确性。RBPGAN 结合了两个状态的先进模型,以获得最佳的效果而无需牺牲生成的视频的准确性。生成器采用 RBPN 系统的设计,而批判器采用 TecoGAN 的设计。我们还使用了 Ping-Pong 损失函数,以增加时间一致性。我们的贡献结合使得模型在时间一致性和空间细节方面表现出色,我们将通过不同的数据集进行质量和量化的比较来证明。

WildlifeDatasets: An open-source toolkit for animal re-identification

  • paper_url: http://arxiv.org/abs/2311.09118
  • repo_url: https://github.com/wildlifedatasets/wildlife-datasets
  • paper_authors: Vojtěch Čermák, Lukas Picek, Lukáš Adam, Kostas Papafitsoros
  • for: 本研究开发了一个开源工具箱(WildlifeDatasets),旨在帮助生物学家和计算机视觉/机器学习研究人员访问公共可用的野生动物数据集,并提供了许多数据集预处理、性能分析和模型细化等方法。
  • methods: 本研究使用了Python编程语言,并提供了许多数据集预处理和性能分析方法,包括了本研究中的最全面的实验比较,涵盖了本地描述器和深度学习方法。此外,本研究还提供了一个基于描述器的个体重复识别模型——MegaDescriptor,该模型在野生动物重复识别数据集上实现了state-of-the-art性能,并在其他预训练模型 such as CLIP 和 DINOv2 中出众表现。
  • results: 本研究通过实验和比较,证明了MegaDescriptor模型在野生动物重复识别任务上具有state-of-the-art性能,并且在多种情况下都能够准确地识别动物个体。此外,本研究还提供了多种MegaDescriptor的不同版本(i.e., Small, Medium, and Large),通过HuggingFace hub(https://huggingface.co/BVRA)进行了公开发布,以便易于与现有的野生动物监测应用程序集成。
    Abstract In this paper, we present WildlifeDatasets (https://github.com/WildlifeDatasets/wildlife-datasets) - an open-source toolkit intended primarily for ecologists and computer-vision / machine-learning researchers. The WildlifeDatasets is written in Python, allows straightforward access to publicly available wildlife datasets, and provides a wide variety of methods for dataset pre-processing, performance analysis, and model fine-tuning. We showcase the toolkit in various scenarios and baseline experiments, including, to the best of our knowledge, the most comprehensive experimental comparison of datasets and methods for wildlife re-identification, including both local descriptors and deep learning approaches. Furthermore, we provide the first-ever foundation model for individual re-identification within a wide range of species - MegaDescriptor - that provides state-of-the-art performance on animal re-identification datasets and outperforms other pre-trained models such as CLIP and DINOv2 by a significant margin. To make the model available to the general public and to allow easy integration with any existing wildlife monitoring applications, we provide multiple MegaDescriptor flavors (i.e., Small, Medium, and Large) through the HuggingFace hub (https://huggingface.co/BVRA).
    摘要 在这篇论文中,我们介绍了 WildlifeDatasets(https://github.com/WildlifeDatasets/wildlife-datasets)-一个开源工具箱,主要针对生态学家和计算机视觉/机器学习研究人员。WildlifeDatasets 是用 Python 编写的,可以直接访问公共可用的野生动物数据集,并提供了许多方法 для数据集预处理、性能分析和模型细化。我们在各种场景和基线实验中示例了这个工具箱,包括我们知道的最全面的野生动物重新识别实验,包括本地描述器和深度学习方法。此外,我们还提供了一个称之为 MegaDescriptor 的基本模型,该模型在各种种类的动物重新识别任务中提供了状态机器的性能,并在其他预训练模型 such as CLIP 和 DINOv2 的基础上出色的超越。为使这个模型对一般公众开放,并让它与现有的野生监测应用程序融合,我们在 HuggingFace 平台(https://huggingface.co/BVRA)提供了多种 MegaDescriptor 的FLAVOR(i.e., Small, Medium, and Large)。

Cross-view and Cross-pose Completion for 3D Human Understanding

  • paper_url: http://arxiv.org/abs/2311.09104
  • repo_url: None
  • paper_authors: Matthieu Armando, Salma Galaaoui, Fabien Baradel, Thomas Lucas, Vincent Leroy, Romain Brégier, Philippe Weinzaepfel, Grégory Rogez
  • for: The paper is written for the domain of computer vision, specifically for human perception and understanding.
  • methods: The paper proposes a pre-training approach based on self-supervised learning using human-centric data, including stereoscopic and temporal pairs of images.
  • results: The proposed method outperforms existing self-supervised pre-training methods on a wide set of human-centric downstream tasks, and achieves state-of-the-art performance for model-based and model-free human mesh recovery.Here’s the same information in Simplified Chinese text:
  • for: 这篇论文是关注计算机视觉领域,具体是人体识别和理解。
  • methods: 这篇论文提议一种基于自我监督学习的预训练方法,使用人体中心的数据集,包括左右视图和时间视图对。
  • results: 提议的方法在多种人体中心下沉淀任务上超过现有的自我监督预训练方法,并在模型基于和模型外的人体网格恢复任务上达到了状态机器。
    Abstract Human perception and understanding is a major domain of computer vision which, like many other vision subdomains recently, stands to gain from the use of large models pre-trained on large datasets. We hypothesize that the most common pre-training strategy of relying on general purpose, object-centric image datasets such as ImageNet, is limited by an important domain shift. On the other hand, collecting domain specific ground truth such as 2D or 3D labels does not scale well. Therefore, we propose a pre-training approach based on self-supervised learning that works on human-centric data using only images. Our method uses pairs of images of humans: the first is partially masked and the model is trained to reconstruct the masked parts given the visible ones and a second image. It relies on both stereoscopic (cross-view) pairs, and temporal (cross-pose) pairs taken from videos, in order to learn priors about 3D as well as human motion. We pre-train a model for body-centric tasks and one for hand-centric tasks. With a generic transformer architecture, these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks, and obtain state-of-the-art performance for instance when fine-tuning for model-based and model-free human mesh recovery.
    摘要 人类 восприятие和理解是计算机视觉的一大领域,与其他视觉子领域一样,它也可以受惠于大型模型在大量数据上进行预训练。我们假设,通过普通的对象中心图像 dataset such as ImageNet 进行预训练的方法,受到了重要的领域变化的限制。而收集专门适用于人类数据的预先知道ledge,如 2D 或 3D 标注,不可能扩展。因此,我们提议一种基于自我学习的预训练方法,使用人类数据图像。我们的方法使用人类图像的对称对(cross-view)和视频中的姿态对(cross-pose)进行自我学习,以学习人类的3D 和运动约束。我们预训练了一个身体中心任务的模型和一个手中心任务的模型,使用通用的 transformer 架构。这些模型在人类中心下沉天任务中表现出色,并在模型基于和模型自由人体碎片恢复任务中实现了国际级的表现。

Guided Scale Space Radon Transform for linear structures detection

  • paper_url: http://arxiv.org/abs/2311.09103
  • repo_url: None
  • paper_authors: Aicha Baya Goumeidane, Djemel Ziou, Nafaa Nacereddine
  • for: automaton detection of thick linear structures in gray scale and binary images
  • methods: 使用Scale Space Radon Transform (SSRT) 和计算图像的希尔曼方向的方法
  • results: 能够有效地检测图像中的不同厚度线条,并且具有鲁棒性 against noise and complex background
    Abstract Using integral transforms to the end of lines detection in images with complex background, makes the detection a hard task needing additional processing to manage the detection. As an integral transform, the Scale Space Radon Transform (SSRT) suffers from such drawbacks, even with its great abilities for thick lines detection. In this work, we propose a method to address this issue for automatic detection of thick linear structures in gray scale and binary images using the SSRT, whatever the image background content. This method involves the calculated Hessian orientations of the investigated image while computing its SSRT, in such a way that linear structures are emphasized in the SSRT space. As a consequence, the subsequent maxima detection in the SSRT space is done on a modified transform space freed from unwanted parts and, consequently, from irrelevant peaks that usually drown the peaks representing lines. Besides, highlighting the linear structure in the SSRT space permitting, thus, to efficiently detect lines of different thickness in synthetic and real images, the experiments show also the method robustness against noise and complex background.
    摘要 Our method involves calculating the Hessian orientations of the investigated image while computing its SSRT, which emphasizes linear structures in the SSRT space. As a result, the subsequent maxima detection in the SSRT space is done on a modified transform space free from unwanted parts and irrelevant peaks that usually drown the peaks representing lines. This approach allows us to efficiently detect lines of different thickness in synthetic and real images, and the experiments show the method's robustness against noise and complex backgrounds.

Applications of Computer Vision in Autonomous Vehicles: Methods, Challenges and Future Directions

  • paper_url: http://arxiv.org/abs/2311.09093
  • repo_url: None
  • paper_authors: Xingshuai Dong, Massimiliano L. Cappuccio
  • for: 本文主要旨在为读者提供自动驾驶技术的全面了解,包括自动驾驶系统的发展、感知器技术、benchmark数据集和公共评价等方面。
  • methods: 本文主要使用了computer vision技术,包括深度估计、物体检测、车道检测和交通标志识别等方面。同时,本文还对自动驾驶系统的发展进行了概述,包括主要汽车制造商从不同国家的开发。
  • results: 本文对自动驾驶技术的现状进行了全面的检视和分析,包括现有技术挑战和未来研究方向等。通过对自动驾驶技术的概述和分析,本文可以帮助读者更好地了解自动驾驶技术的发展和应用。
    Abstract Autonomous vehicle refers to a vehicle capable of perceiving its surrounding environment and driving with little or no human driver input. The perception system is a fundamental component which enables the autonomous vehicle to collect data and extract relevant information from the environment to drive safely. Benefit from the recent advances in computer vision, the perception task can be achieved by using sensors, such as camera, LiDAR, radar, and ultrasonic sensor. This paper reviews publications on computer vision and autonomous driving that are published during the last ten years. In particular, we first investigate the development of autonomous driving systems and summarize these systems that are developed by the major automotive manufacturers from different countries. Second, we investigate the sensors and benchmark data sets that are commonly utilized for autonomous driving. Then, a comprehensive overview of computer vision applications for autonomous driving such as depth estimation, object detection, lane detection, and traffic sign recognition are discussed. Additionally, we review public opinions and concerns on autonomous vehicles. Based on the discussion, we analyze the current technological challenges that autonomous vehicles meet with. Finally, we present our insights and point out some promising directions for future research. This paper will help the reader to understand autonomous vehicles from the perspectives of academia and industry.
    摘要 自动驾驶车辆指的是一种可以自主感知周围环境并减少或完全消除人类驾驶员输入的车辆。感知系统是自动驾驶车辆的基本组件,它使得自动驾驶车辆能够收集环境数据并提取有用信息以安全驾驶。受计算机视觉技术的推动,感知任务可以通过感知器,如摄像头、LiDAR、雷达和超声波感知器来实现。本文回顾过去十年发表的计算机视觉和自动驾驶相关的研究论文。特别是,我们首先调查了各国主要汽车制造商在自动驾驶领域的开发系统,然后调查了通用于自动驾驶的感知器和标准数据集。接着,我们提供了计算机视觉在自动驾驶中的应用篇幅,包括深度估计、物体检测、车道检测和交通标志识别。此外,我们还评估了自动驾驶车辆的公众观点和担忧,并分析了自动驾驶车辆目前所面临的技术挑战。最后,我们提出了一些有前途的研究方向。本文将帮助读者更好地理解自动驾驶车辆的 academia 和 industry 视角。

  • paper_url: http://arxiv.org/abs/2311.09084
  • repo_url: https://github.com/hcplab-sysu/personsearch-ctlg
  • paper_authors: Hefeng Wu, Weifeng Chen, Zhibin Liu, Tianshui Chen, Zhiguang Chen, Liang Lin
  • for: 这个论文是为了提出一种简单 yet effective的双Transformer模型,用于图像库中的文本基于人脸检索。
  • methods: 该模型使用了一种具有强度感知的对比学习策略,以及一种自动生成数据模块(PDG),以提高模型的性能。
  • results: 实验表明,该模型在两个流行的TBPS数据集(CUHK-PEDES和ICFG-PEDES)上表现出色,与现有方法相比,提高了Top1、Top5、Top10的性能(例如,CUHK-PEDES上提高了3.88%, 4.02%, 2.92%)。
    Abstract Given a descriptive text query, text-based person search (TBPS) aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. To better align the two modalities, most existing works focus on introducing sophisticated network structures and auxiliary tasks, which are complex and hard to implement. In this paper, we propose a simple yet effective dual Transformer model for text-based person search. By exploiting a hardness-aware contrastive learning strategy, our model achieves state-of-the-art performance without any special design for local feature alignment or side information. Moreover, we propose a proximity data generation (PDG) module to automatically produce more diverse data for cross-modal training. The PDG module first introduces an automatic generation algorithm based on a text-to-image diffusion model, which generates new text-image pair samples in the proximity space of original ones. Then it combines approximate text generation and feature-level mixup during training to further strengthen the data diversity. The PDG module can largely guarantee the reasonability of the generated samples that are directly used for training without any human inspection for noise rejection. It improves the performance of our model significantly, providing a feasible solution to the data insufficiency problem faced by such fine-grained visual-linguistic tasks. Extensive experiments on two popular datasets of the TBPS task (i.e., CUHK-PEDES and ICFG-PEDES) show that the proposed approach outperforms state-of-the-art approaches evidently, e.g., improving by 3.88%, 4.02%, 2.92% in terms of Top1, Top5, Top10 on CUHK-PEDES. The codes will be available at https://github.com/HCPLab-SYSU/PersonSearch-CTLG
    摘要 文本基于人搜索(TBPS)是一种跨模态检索任务,目的是从图库中检索最佳匹配的人Target。由于两个模态之间存在显著的差异和细化差异,以及缺乏标注数据,这种检索任务非常具有挑战性。大多数现有的方法都是通过引入复杂的网络结构和辅助任务来减轻这些差异,但这些方法通常具有复杂性和困难实现性。在本文中,我们提出了一种简单 yet effective的双Transformer模型,用于实现文本基于人搜索。我们通过利用一种困难性感知的对比学习策略,使我们的模型在无需特殊的本地特征对齐或副作用信息的情况下达到了状态艺术的表现。此外,我们还提出了一种 proximity data generation(PDG)模块,用于自动生成更多的跨模态数据。PDG模块首先通过基于文本到图像扩散模型的自动生成算法,生成了新的文本-图像对amples在原始对amples的邻近空间中。然后,它通过在训练时 combining approximate text generation和特征级混合来进一步增强数据多样性。PDG模块可以保证生成的样本的合理性,不需要人工检查噪声。这使得我们的模型表现得更好,提供了一种可行的解决方案 для跨模态任务中的数据不足问题。我们在两个流行的TBPS任务(i.e., CUHK-PEDES和ICFG-PEDES)上进行了广泛的实验,结果显示,我们的方法明显超越了现有的方法,例如,在CUHK-PEDES上提高了Top1、Top5、Top10的表现,升准3.88%、4.02%、2.92%。代码将在https://github.com/HCPLab-SYSU/PersonSearch-CTLG上提供。

Spiking NeRF: Representing the Real-World Geometry by a Discontinuous Representation

  • paper_url: http://arxiv.org/abs/2311.09077
  • repo_url: None
  • paper_authors: Zhanfeng Liao, Qian Zheng, Yan Liu, Gang Pan
  • for: 提高 NeRF 方法的准确性,使其能够更好地捕捉物体的 geometry 和光学特性。
  • methods: 使用 spiking neuron 和 hybrid ANN-SNN 框架建立不连续的density field,以 faithful 地表示物体的geometry。
  • results: 实现了 SOTA 性能,并提供了数值关系 между spiking neuron 参数和理论准确性,以便进一步改进方法。
    Abstract A crucial reason for the success of existing NeRF-based methods is to build a neural density field for the geometry representation via multiple perceptron layers (MLPs). MLPs are continuous functions, however, real geometry or density field is frequently discontinuous at the interface between the air and the surface. Such a contrary brings the problem of unfaithful geometry representation. To this end, this paper proposes spiking NeRF, which leverages spiking neuron and a hybrid Artificial Neural Network (ANN)-Spiking Neural Network (SNN) framework to build a discontinuous density field for faithful geometry representation. Specifically, we first demonstrate the reason why continuous density fields will bring inaccuracy. Then, we propose to use the spiking neurons to build a discontinuous density field. We conduct comprehensive analysis for the problem of existing spiking neuron models and then provide the numerical relationship between the parameter of spiking neuron and the theoretical accuracy of geometry, Based on this, we propose a bounded spiking neuron to build the discontinuous density field. Our results achieve SOTA performance. Our code and data will be released to the public.
    摘要 <>TRANSLATE_TEXT一个关键的原因导致现有的NeRF方法成功是通过多层感知核(MLP)建立神经density场来表示geometry。但是,实际的geometry或density场经常在空气和表面之间存在缺口,这会导致不准确的geometry表示。为解决这问题,本文提出了脊动NeRF,它利用脊动神经和混合人工神经网络(ANN)-脊动神经网络(SNN)框架来建立不连续的density场,以准确地表示geometry。 Specifically, we first demonstrate why continuous density fields will bring inaccuracy. Then, we propose to use spiking neurons to build a discontinuous density field. We conduct comprehensive analysis for the problem of existing spiking neuron models and then provide the numerical relationship between the parameter of spiking neuron and the theoretical accuracy of geometry. Based on this, we propose a bounded spiking neuron to build the discontinuous density field. Our results achieve SOTA performance. Our code and data will be released to the public.Note: SOTA stands for "State of the Art" in English, which means the current best performance in a particular field or task.

Imagine the Unseen World: A Benchmark for Systematic Generalization in Visual World Models

  • paper_url: http://arxiv.org/abs/2311.09064
  • repo_url: None
  • paper_authors: Yeongbin Kim, Gautam Singh, Junyeong Park, Caglar Gulcehre, Sungjin Ahn
  • for: 本文旨在提出一个新的基准测试系统,用于评估机器学习模型在视觉领域中的系统性协成能力。
  • methods: 本文使用了一种新的基准测试系统,称为视觉协成能力测试 benchmark (SVIB),以评估模型在一种受控的世界动力下的一步图像转换能力。
  • results: 经过对多种基线模型的评估,本文发现现有的模型在系统性视觉协成能力方面存在一定的限制,并提出了一些可能的改进方向。I hope that helps! Let me know if you have any other questions.
    Abstract Systematic compositionality, or the ability to adapt to novel situations by creating a mental model of the world using reusable pieces of knowledge, remains a significant challenge in machine learning. While there has been considerable progress in the language domain, efforts towards systematic visual imagination, or envisioning the dynamical implications of a visual observation, are in their infancy. We introduce the Systematic Visual Imagination Benchmark (SVIB), the first benchmark designed to address this problem head-on. SVIB offers a novel framework for a minimal world modeling problem, where models are evaluated based on their ability to generate one-step image-to-image transformations under a latent world dynamics. The framework provides benefits such as the possibility to jointly optimize for systematic perception and imagination, a range of difficulty levels, and the ability to control the fraction of possible factor combinations used during training. We provide a comprehensive evaluation of various baseline models on SVIB, offering insight into the current state-of-the-art in systematic visual imagination. We hope that this benchmark will help advance visual systematic compositionality.
    摘要 系统性的组合性,或者在新的情况下适应性的创建一个世界模型使用可重用的知识,仍然是机器学习领域的主要挑战。虽然在语言领域有了很大的进步,但对于视觉想象的系统atic imagination,尚未有充分的尝试。我们介绍了系统atic Visual Imagination Benchmark (SVIB),第一个专门解决这个问题的benchmark。SVIB提供了一个新的世界模型设计问题,其中模型被评估根据它们在一个latent世界动力学下生成一步图像到图像变换的能力。这个框架具有许多优点,如同时优化系统atic perception和想象、多种难度水平和在训练中控制可能的因素组合的使用。我们对多种基eline模型在SVIB上进行了全面的评估,提供了现状的概况,并希望这个benchmark能够推动视觉系统atic compositionality的进步。

Self-Annotated 3D Geometric Learning for Smeared Points Removal

  • paper_url: http://arxiv.org/abs/2311.09029
  • repo_url: None
  • paper_authors: Miaowei Wang, Daniel Morris
  • for: 本研究旨在提高顾客级束缚精密深度感知器的准确性和质量,并解决雷达点精灵抽象的问题。
  • methods: 我们提出了一种完全自我标注的方法,利用多视角 geometric evidence 自动检测和标注精灵点和有效点。
  • results: 我们的方法在实验和减少研究中表现出色,超过了传统的滤波器和其他自我标注方法。In simpler English, the paper aims to improve the accuracy and quality of consumer-level depth sensors and solve the problem of “smeared points” (points that are not on any 3D surface and can cause errors in depth maps). The proposed method uses fully self-annotated training data and relies on 3D geometric evidence from multiple perspectives to detect and remove smeared points. Experimental results show that the method outperforms traditional filters and other self-annotated methods.
    Abstract There has been significant progress in improving the accuracy and quality of consumer-level dense depth sensors. Nevertheless, there remains a common depth pixel artifact which we call smeared points. These are points not on any 3D surface and typically occur as interpolations between foreground and background objects. As they cause fictitious surfaces, these points have the potential to harm applications dependent on the depth maps. Statistical outlier removal methods fare poorly in removing these points as they tend also to remove actual surface points. Trained network-based point removal faces difficulty in obtaining sufficient annotated data. To address this, we propose a fully self-annotated method to train a smeared point removal classifier. Our approach relies on gathering 3D geometric evidence from multiple perspectives to automatically detect and annotate smeared points and valid points. To validate the effectiveness of our method, we present a new benchmark dataset: the Real Azure-Kinect dataset. Experimental results and ablation studies show that our method outperforms traditional filters and other self-annotated methods. Our work is publicly available at https://github.com/wangmiaowei/wacv2024_smearedremover.git.
    摘要 “随着对消费者级数 dense depth sensor 的改进,有所进步。然而,还有一个常见的深度像素错误,我们称之为“扩散点”。这些点不在任何三维表面上,通常发生在前景和背景物体之间的插值。由于这些点会创建虚拟表面,这些点有可能伤害对深度地图依赖的应用。统计方法对这些点进行排除是不具有效果的,因为它们也可能会 removes 真正的表面点。训练网络基于的方法也难以获得足够的标注数据。为解决这个问题,我们提出了一个完全自我标注的方法,用于训练深度扩散点移除分类器。我们的方法基于从多个角度收集3D几何证据,以自动检测和标注深度扩散点和有效点。为 validate 我们的方法,我们提供了一个新的库 benchmark 数据集:Real Azure-Kinect 数据集。实验结果和删除研究显示,我们的方法比传统范例和其他自我标注方法高效。我们的工作公开在 GitHub 上,请参考 https://github.com/wangmiaowei/wacv2024_smearedremover.git。”

Fast Certification of Vision-Language Models Using Incremental Randomized Smoothing

  • paper_url: http://arxiv.org/abs/2311.09024
  • repo_url: None
  • paper_authors: A K Nirala, A Joshi, C Hegde, S Sarkar
  • for: 这个论文旨在提出一种快速的开放词汇识别模型认证方法,以确保这些模型在实际应用中的可靠性。
  • methods: 这种认证方法基于随机缓和技术,使用基于训练集的Certified CLIP classifier来快速认证novel prompts。
  • results: 实验结果表明,OVC可以快速认证开放词汇模型,并且可以在CIFAR-10和ImageNet测试 datasets上达到比较高的识别率。
    Abstract A key benefit of deep vision-language models such as CLIP is that they enable zero-shot open vocabulary classification; the user has the ability to define novel class labels via natural language prompts at inference time. However, while CLIP-based zero-shot classifiers have demonstrated competitive performance across a range of domain shifts, they remain highly vulnerable to adversarial attacks. Therefore, ensuring the robustness of such models is crucial for their reliable deployment in the wild. In this work, we introduce Open Vocabulary Certification (OVC), a fast certification method designed for open-vocabulary models like CLIP via randomized smoothing techniques. Given a base "training" set of prompts and their corresponding certified CLIP classifiers, OVC relies on the observation that a classifier with a novel prompt can be viewed as a perturbed version of nearby classifiers in the base training set. Therefore, OVC can rapidly certify the novel classifier using a variation of incremental randomized smoothing. By using a caching trick, we achieve approximately two orders of magnitude acceleration in the certification process for novel prompts. To achieve further (heuristic) speedups, OVC approximates the embedding space at a given input using a multivariate normal distribution bypassing the need for sampling via forward passes through the vision backbone. We demonstrate the effectiveness of OVC on through experimental evaluation using multiple vision-language backbones on the CIFAR-10 and ImageNet test datasets.
    摘要 CLIP 深度视力语言模型具有零shot开 vocabulary 分类的能力,即在运行时通过自然语言提示来定义新的分类标签。然而, CLIP 基于的零shot 分类器对于攻击而言是极为易受攻击的。因此,为了可靠地部署这些模型,确保其 Robustness 是非常重要的。在这项工作中,我们介绍了一种叫做 Open Vocabulary Certification (OVC) 的快速证明方法,用于验证开 vocabulary 模型如 CLIP。OVC 基于的观察是,一个新的提示可以视为 nearby 的基础训练集中的类ifier 的噪声版本。因此,OVC 可以快速地证明这个新的类ifier 使用随机噪声技术。使用缓存技巧,我们实现了约两个数量级的加速。此外,OVC 使用一种快速的方法来 Approximate 输入空间的 embedding 空间,从而缩短证明过程。我们通过对多个视力语言背景进行实验评估,证明了 OVC 的有效性。

Incremental Object-Based Novelty Detection with Feedback Loop

  • paper_url: http://arxiv.org/abs/2311.09004
  • repo_url: None
  • paper_authors: Simone Caldarella, Elisa Ricci, Rahaf Aljundi
  • for: 本研究旨在提高对象检测模型的不可预测对象检测能力,以避免在实际应用中可能存在危害的行为,如自驾车或自主机器人中使用的对象检测模型。
  • methods: 本研究提出了一种基于人工反馈的对象新型检测方法,假设可以在预测输出中请求人工反馈,并在反馈可用时进行不间断的改进。为解决这种新的对象检测问题,我们提出了一个轻量级的ND模块,附加在已经训练的对象检测模型之上,并通过反馈循环进行不断更新。
  • results: 我们的实验表明,我们的ND方法可以增强对象检测模型的Robustness,并成功地收集和 incorporate 人工反馈。我们还提出了一个新的评价指标,用于评价对象检测模型的新型检测能力,并进行了广泛的比较试验,以证明我们的ND方法的效果。
    Abstract Object-based Novelty Detection (ND) aims to identify unknown objects that do not belong to classes seen during training by an object detection model. The task is particularly crucial in real-world applications, as it allows to avoid potentially harmful behaviours, e.g. as in the case of object detection models adopted in a self-driving car or in an autonomous robot. Traditional approaches to ND focus on one time offline post processing of the pretrained object detection output, leaving no possibility to improve the model robustness after training and discarding the abundant amount of out-of-distribution data encountered during deployment. In this work, we propose a novel framework for object-based ND, assuming that human feedback can be requested on the predicted output and later incorporated to refine the ND model without negatively affecting the main object detection performance. This refinement operation is repeated whenever new feedback is available. To tackle this new formulation of the problem for object detection, we propose a lightweight ND module attached on top of a pre-trained object detection model, which is incrementally updated through a feedback loop. We also propose a new benchmark to evaluate methods on this new setting and test extensively our ND approach against baselines, showing increased robustness and a successful incorporation of the received feedback.
    摘要 In this work, we propose a novel framework for object-based ND that assumes human feedback can be requested on the predicted output and later incorporated to refine the ND model without negatively affecting the main object detection performance. This refinement operation is repeated whenever new feedback is available. To tackle this new formulation of the problem for object detection, we propose a lightweight ND module attached on top of a pre-trained object detection model, which is incrementally updated through a feedback loop. We also propose a new benchmark to evaluate methods on this new setting and test our ND approach extensively against baselines, showing increased robustness and a successful incorporation of the received feedback.

Simple but Effective Unsupervised Classification for Specified Domain Images: A Case Study on Fungi Images

  • paper_url: http://arxiv.org/abs/2311.08995
  • repo_url: None
  • paper_authors: Zhaocong liu, Fa Zhang, Lin Cheng, Huanxi Deng, Xiaoyan Yang, Zhenyu Zhang, Chichun Zhou
  • for: 特别适用于需要专业知识的域面图像分类任务,解决高质量标注数据的缺乏问题。
  • methods: 使用自动Feature extraction方法,并利用多种 clustering 算法投票来实现无监督分类。
  • results: 在 fungal 图像数据集上达到 94.1% 和 96.7% 的分类精度,比超级vised方法高。这种无监督分类方法可以减少依赖于预先标注数据,提供closed-loop数据分类。
    Abstract High-quality labeled datasets are essential for deep learning. Traditional manual annotation methods are not only costly and inefficient but also pose challenges in specialized domains where expert knowledge is needed. Self-supervised methods, despite leveraging unlabeled data for feature extraction, still require hundreds or thousands of labeled instances to guide the model for effective specialized image classification. Current unsupervised learning methods offer automatic classification without prior annotation but often compromise on accuracy. As a result, efficiently procuring high-quality labeled datasets remains a pressing challenge for specialized domain images devoid of annotated data. Addressing this, an unsupervised classification method with three key ideas is introduced: 1) dual-step feature dimensionality reduction using a pre-trained model and manifold learning, 2) a voting mechanism from multiple clustering algorithms, and 3) post-hoc instead of prior manual annotation. This approach outperforms supervised methods in classification accuracy, as demonstrated with fungal image data, achieving 94.1% and 96.7% on public and private datasets respectively. The proposed unsupervised classification method reduces dependency on pre-annotated datasets, enabling a closed-loop for data classification. The simplicity and ease of use of this method will also bring convenience to researchers in various fields in building datasets, promoting AI applications for images in specialized domains.
    摘要 高品质标注数据是深度学习的关键。传统的手动标注方法不仅成本高、效率低,还存在专业领域中的知识问题。无监督方法,尽管利用无标注数据进行特征提取,仍需要数百或千个标注实例来引导模型以实现特殊领域图像分类。当前无监督学习方法可以自动分类而无需先前的标注,但通常会 compromise 准确性。因此,得到高品质标注数据仍然是特殊领域图像无标注数据的Pressing challenge。为解决这个问题,我们提出了一种无监督分类方法,包括三个关键想法:1)使用预训练模型和拟合学习来实现双步特征维度减少,2)多种 clustering 算法的投票机制,和3)post-hoc 而不是先前的手动标注。这种方法在分类准确率方面超过了supervised方法,如 demonstrated 在菌类图像数据上,实现了94.1%和96.7%的公共和私人数据集分类率。我们的无监督分类方法减少了对前置标注数据的依赖,使得数据分类成为closed-loop。这种简单易用的方法会将研究人员在多个领域建立数据集,推动人工智能应用于特殊领域图像。

Unsupervised approaches based on optimal transport and convex analysis for inverse problems in imaging

  • paper_url: http://arxiv.org/abs/2311.08972
  • repo_url: None
  • paper_authors: Marcello Carioni, Subhadip Mukherjee, Hong Ye Tan, Junqi Tang
  • for: This paper focuses on theoretically principled unsupervised learning schemes for solving imaging inverse problems, with a particular focus on methods rooted in optimal transport and convex analysis.
  • methods: The paper reviews optimal transport-based unsupervised approaches, learned adversarial regularization methods, provably convergent learned optimization algorithms, and plug-and-play algorithms for imaging problems.
  • results: The paper provides an overview of the key mathematical results that underlie the methods reviewed in the chapter to keep the discussion self-contained.
    Abstract Unsupervised deep learning approaches have recently become one of the crucial research areas in imaging owing to their ability to learn expressive and powerful reconstruction operators even when paired high-quality training data is scarcely available. In this chapter, we review theoretically principled unsupervised learning schemes for solving imaging inverse problems, with a particular focus on methods rooted in optimal transport and convex analysis. We begin by reviewing the optimal transport-based unsupervised approaches such as the cycle-consistency-based models and learned adversarial regularization methods, which have clear probabilistic interpretations. Subsequently, we give an overview of a recent line of works on provably convergent learned optimization algorithms applied to accelerate the solution of imaging inverse problems, alongside their dedicated unsupervised training schemes. We also survey a number of provably convergent plug-and-play algorithms (based on gradient-step deep denoisers), which are among the most important and widely applied unsupervised approaches for imaging problems. At the end of this survey, we provide an overview of a few related unsupervised learning frameworks that complement our focused schemes. Together with a detailed survey, we provide an overview of the key mathematical results that underlie the methods reviewed in the chapter to keep our discussion self-contained.
    摘要 <>translate_language: zh-CN<>无监督深度学习方法在媒体领域取得了重要突破,尤其是在数据质量较高的时候,它们能够学习表达力强的重建运算符。在这章中,我们将评论理论上正确的无监督学习方案,用于解决媒体领域的反向问题,特别是基于最优运输和凸分析的方法。我们首先介绍循环一致性基于的模型和学习抑制方法,这些方法具有明确的概率解释。接着,我们将讲解最近一些可靠地训练无监督算法,以加速媒体领域的反向问题解决。此外,我们还介绍了一些可靠地插入执行的无监督算法(基于梯度步深排除器),它们是媒体领域中最重要和最广泛应用的无监督方法。 finally,我们将介绍一些与我们关注的无监督学习框架,以及这些方法的关键数学结论,以便保持我们的讨论自 contenido。

A Spectral Diffusion Prior for Hyperspectral Image Super-Resolution

  • paper_url: http://arxiv.org/abs/2311.08955
  • repo_url: None
  • paper_authors: Jianjun Liu, Zebin Wu, Liang Xiao
  • for: fusion-based hyperspectral image (HSI) super-resolution
  • methods: spectral diffusion prior, maximum a posteriori, Adam optimization
  • results: effective in producing high-spatial-resolution HSI, demonstrated on both synthetic and real datasetsHere’s the simplified Chinese text:
  • for: 高分辨率多spectral影像(HSI)超解析
  • methods: spectral diffusion prior, maximum a posteriori, Adam优化
  • results: 高效地生成高分辨率HSI, 在synthetic和实际数据上进行了实验
    Abstract Fusion-based hyperspectral image (HSI) super-resolution aims to produce a high-spatial-resolution HSI by fusing a low-spatial-resolution HSI and a high-spatial-resolution multispectral image. Such a HSI super-resolution process can be modeled as an inverse problem, where the prior knowledge is essential for obtaining the desired solution. Motivated by the success of diffusion models, we propose a novel spectral diffusion prior for fusion-based HSI super-resolution. Specifically, we first investigate the spectrum generation problem and design a spectral diffusion model to model the spectral data distribution. Then, in the framework of maximum a posteriori, we keep the transition information between every two neighboring states during the reverse generative process, and thereby embed the knowledge of trained spectral diffusion model into the fusion problem in the form of a regularization term. At last, we treat each generation step of the final optimization problem as its subproblem, and employ the Adam to solve these subproblems in a reverse sequence. Experimental results conducted on both synthetic and real datasets demonstrate the effectiveness of the proposed approach. The code of the proposed approach will be available on https://github.com/liuofficial/SDP.
    摘要 融合基于快照影像(HSI)超分辨率目标是生成高空间分辨率HSI,通过融合低空间分辨率HSI和高空间分辨率多spectral影像。这种HSI超分辨率过程可以表示为一个逆问题,其中假设知识是获得所需解决方案的关键。鼓动 diffusion模型的成功,我们提出了一种新的 spectral diffusion prior для融合基于HSI超分辨率。specifically,我们首先调查spectrum生成问题,并设计了一种spectral diffusion模型来模拟spectral数据分布。然后,在maximum a posteriori框架中,我们保留了每两个邻居状态之间的过渡信息,并将这些知识嵌入到融合问题中,以形式化一个正则化项。最后,我们对每个生成步骤的最终优化问题进行分解,并使用Adam算法解决这些子问题。实验结果表明,我们提出的方法在synthetic和实际数据集上具有效果。code的github地址为https://github.com/liuofficial/SDP.

Automated Volume Corrected Mitotic Index Calculation Through Annotation-Free Deep Learning using Immunohistochemistry as Reference Standard

  • paper_url: http://arxiv.org/abs/2311.08949
  • repo_url: None
  • paper_authors: Jonas Ammeling, Moritz Hecker, Jonathan Ganz, Taryn A. Donovan, Christof A. Bertram, Katharina Breininger, Marc Aubreville
  • for: This paper is written for assessing the prognostic value of invasive breast carcinomas using a deep learning-based approach.
  • methods: The paper uses a deep learning pipeline solely trained with an annotation-free, immunohistochemistry-based approach to estimate epithelial segmentation in canine breast carcinomas.
  • results: The deep learning-based pipeline shows expert-level performance, providing time efficiency and reproducibility, compared to the manually annotated M/V-Index.
    Abstract The volume-corrected mitotic index (M/V-Index) was shown to provide prognostic value in invasive breast carcinomas. However, despite its prognostic significance, it is not established as the standard method for assessing aggressive biological behaviour, due to the high additional workload associated with determining the epithelial proportion. In this work, we show that using a deep learning pipeline solely trained with an annotation-free, immunohistochemistry-based approach, provides accurate estimations of epithelial segmentation in canine breast carcinomas. We compare our automatic framework with the manually annotated M/V-Index in a study with three board-certified pathologists. Our results indicate that the deep learning-based pipeline shows expert-level performance, while providing time efficiency and reproducibility.
    摘要 “对入侵性乳癌中的肉眼癌指数(M/V-Index)的评估,有过往的研究显示其具有预后价值。然而,尽管其预后意义,但它并未被视为标准的评估具有攻击性生物行为的方法,因为需要额外的努力来决定胞质含量。在这个工作中,我们展示了一个使用深度学习管线仅以无标注、免疫抗体技术为基础的架构,可以实时和可重复地估算乳癌组织中的胞质分布。我们与三位美国医学会认证的病理学家进行比较,结果显示,我们的深度学习架构具有专家水准的表现,同时提供时间效益和可重复性。”Note that the translation is in Simplified Chinese, which is the standard written form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.

Confident Naturalness Explanation (CNE): A Framework to Explain and Assess Patterns Forming Naturalness

  • paper_url: http://arxiv.org/abs/2311.08936
  • repo_url: None
  • paper_authors: Ahmed Emam, Mohamed Farag, Ribana Roscher
  • for: This paper aims to improve the understanding and mapping of naturalness within protected natural areas using machine learning and explainability techniques.
  • methods: The proposed Confident Naturalness Explanation (CNE) framework combines explainable machine learning and uncertainty quantification to assess and explain naturalness, using a new quantitative metric and uncertainty-aware segmentation masks.
  • results: The proposed CNE framework is demonstrated to be effective in a study site in Fennoscandia using two open-source satellite datasets, providing confident and objective explanations of naturalness.Here is the same information in Simplified Chinese text:
  • for: 这篇论文目的是使用机器学习和可解释技术来提高保护自然区域中自然性的理解和地图。
  • methods: 提议的Confident Naturalness Explanation(CNE)框架结合可解释机器学习和不确定量化来评估和解释自然性,使用新的量化度量和不确定度映射。
  • results: 在芬兰地区使用两个开源卫星数据集,通过应用CNE框架,实现了对自然性的可靠和客观解释。
    Abstract Protected natural areas are regions that have been minimally affected by human activities such as urbanization, agriculture, and other human interventions. To better understand and map the naturalness of these areas, machine learning models can be used to analyze satellite imagery. Specifically, explainable machine learning methods show promise in uncovering patterns that contribute to the concept of naturalness within these protected environments. Additionally, addressing the uncertainty inherent in machine learning models is crucial for a comprehensive understanding of this concept. However, existing approaches have limitations. They either fail to provide explanations that are both valid and objective or struggle to offer a quantitative metric that accurately measures the contribution of specific patterns to naturalness, along with the associated confidence. In this paper, we propose a novel framework called the Confident Naturalness Explanation (CNE) framework. This framework combines explainable machine learning and uncertainty quantification to assess and explain naturalness. We introduce a new quantitative metric that describes the confident contribution of patterns to the concept of naturalness. Furthermore, we generate an uncertainty-aware segmentation mask for each input sample, highlighting areas where the model lacks knowledge. To demonstrate the effectiveness of our framework, we apply it to a study site in Fennoscandia using two open-source satellite datasets.
    摘要 保护的自然区域是人类活动影响的最小化区域,如城市化、农业等。为了更好地理解和映射这些区域的自然性,机器学习模型可以使用卫星图像进行分析。特别是使用可解释机器学习方法可以揭示保护区域中自然性的特征。然而,现有的方法有限制。它们可能无法提供有效和客观的解释,或者困难提供准确度量自然性的贡献和相关信息。在这篇论文中,我们提出了一种新的框架,即可靠自然性解释(CNE)框架。这个框架结合可解释机器学习和不确定量化来评估和解释自然性。我们还提出了一个新的量化度量,用于描述模型对自然性的可靠贡献。此外,我们生成了每个输入样本的不确定性感知分割图,以标识模型对具体区域的不确定性。为了证明我们的框架的效果,我们对芬兰北部的一个研究区使用了两个开源卫星数据集进行应用。

Structural-Based Uncertainty in Deep Learning Across Anatomical Scales: Analysis in White Matter Lesion Segmentation

  • paper_url: http://arxiv.org/abs/2311.08931
  • repo_url: https://github.com/medical-image-analysis-laboratory/ms_wml_uncs
  • paper_authors: Nataliia Molchanova, Vatsal Raina, Andrey Malinin, Francesco La Rosa, Adrien Depeursinge, Mark Gales, Cristina Granziera, Henning Muller, Mara Graziani, Meritxell Bach Cuadra
  • for: 这个论文探讨了自动深度学习(DL)工具的可靠性量化(UQ)在多发性脑梗液病人(MS)的磁共振成像(MRI)扫描中的白 matter损伤(WML)分 segmentation任务中的作用。
  • methods: 我们的研究主要集中在两个主要的不确定性问题上:首先,我们认为一个好的不确定性度量应该指示预测有高度不确定性的值。其次,我们 investigate了不确定性在不同的 анаatomical scale( voxel、 lesion 或 patient)之间的关系。我们认为不确定性在每个缩放级别都与特定类型的错误有关。
  • results: 我们的研究结果表明,我们提出的方法可以更好地捕捉模型错误在 lesion 和 patient 缩放级别上,比 tradicional voxel-scale uncertainty 值的平均值。我们在一个多中心 MRI 数据集上进行了172名病人的研究,并提供了 UQ 协议代码在 GitHub 上(https://github.com/Medical-Image-Analysis-Laboratory/MS_WML_uncs)。
    Abstract This paper explores uncertainty quantification (UQ) as an indicator of the trustworthiness of automated deep-learning (DL) tools in the context of white matter lesion (WML) segmentation from magnetic resonance imaging (MRI) scans of multiple sclerosis (MS) patients. Our study focuses on two principal aspects of uncertainty in structured output segmentation tasks. Firstly, we postulate that a good uncertainty measure should indicate predictions likely to be incorrect with high uncertainty values. Second, we investigate the merit of quantifying uncertainty at different anatomical scales (voxel, lesion, or patient). We hypothesize that uncertainty at each scale is related to specific types of errors. Our study aims to confirm this relationship by conducting separate analyses for in-domain and out-of-domain settings. Our primary methodological contributions are (i) the development of novel measures for quantifying uncertainty at lesion and patient scales, derived from structural prediction discrepancies, and (ii) the extension of an error retention curve analysis framework to facilitate the evaluation of UQ performance at both lesion and patient scales. The results from a multi-centric MRI dataset of 172 patients demonstrate that our proposed measures more effectively capture model errors at the lesion and patient scales compared to measures that average voxel-scale uncertainty values. We provide the UQ protocols code at https://github.com/Medical-Image-Analysis-Laboratory/MS_WML_uncs.
    摘要
  1. We propose that a good uncertainty measure should indicate predictions that are likely to be incorrect with high uncertainty values.2. We investigate the merit of quantifying uncertainty at different anatomical scales (voxel, lesion, or patient). We hypothesize that uncertainty at each scale is related to specific types of errors.Our study aims to confirm this relationship by conducting separate analyses for in-domain and out-of-domain settings. Our primary methodological contributions are:1. The development of novel measures for quantifying uncertainty at lesion and patient scales, derived from structural prediction discrepancies.2. The extension of an error retention curve analysis framework to facilitate the evaluation of UQ performance at both lesion and patient scales.The results from a multi-centric MRI dataset of 172 patients demonstrate that our proposed measures more effectively capture model errors at the lesion and patient scales compared to measures that average voxel-scale uncertainty values. The UQ protocols code is available at https://github.com/Medical-Image-Analysis-Laboratory/MS_WML_uncs.

Progressive Feedback-Enhanced Transformer for Image Forgery Localization

  • paper_url: http://arxiv.org/abs/2311.08910
  • repo_url: None
  • paper_authors: Haochen Zhu, Gang Cao, Xianglin Huang
  • for: 本研究旨在提出一种 Progressive FeedbACk-enhanced Transformer (ProFact) 网络,用于提高图像forge localization的精度和可靠性。
  • methods: 该网络使用了一个初始分支网络生成的粗略定位图,并将其FeedbACk到早期的 transformer 嵌入层以增强正面特征表示,同时抑制干扰因素。此外,还提出了一种 Contextual Spatial Pyramid 模块,用于进一步提高医学特征的涵盖率和分辨率。
  • results: 对于九个公共的医学检测数据集,我们的提出的定位器表现出色,在泛化能力和Robustness方面都大幅超越了当前状态。
    Abstract Blind detection of the forged regions in digital images is an effective authentication means to counter the malicious use of local image editing techniques. Existing encoder-decoder forensic networks overlook the fact that detecting complex and subtle tampered regions typically requires more feedback information. In this paper, we propose a Progressive FeedbACk-enhanced Transformer (ProFact) network to achieve coarse-to-fine image forgery localization. Specifically, the coarse localization map generated by an initial branch network is adaptively fed back to the early transformer encoder layers for enhancing the representation of positive features while suppressing interference factors. The cascaded transformer network, combined with a contextual spatial pyramid module, is designed to refine discriminative forensic features for improving the forgery localization accuracy and reliability. Furthermore, we present an effective strategy to automatically generate large-scale forged image samples close to real-world forensic scenarios, especially in realistic and coherent processing. Leveraging on such samples, a progressive and cost-effective two-stage training protocol is applied to the ProFact network. The extensive experimental results on nine public forensic datasets show that our proposed localizer greatly outperforms the state-of-the-art on the generalization ability and robustness of image forgery localization. Code will be publicly available at https://github.com/multimediaFor/ProFact.
    摘要 “针对本地图像修改技术的恶意使用,潜意检测数字图像中的假造区域是一种有效的身份验证手段。现有的编oder-解码器审计网络忽视了检测复杂且微妙的假造区域通常需要更多的反馈信息。本文提出了一种Progressive FeedbACk-enhanced Transformer(ProFact)网络,以实现从粗到细图像假造地点检测。特别是,初始分支网络生成的粗略地点映射被适应地feedback到早期的Transformer编码层,以增强正面特征表示,同时抑制干扰因素。另外,我们设计了一个Contextual Spatial Pyramid模块,用于修改审计特征,以提高假造地点检测精度和可靠性。此外,我们还提出了一种有效的自动生成大规模假造图像样本close to real-world审计enario,特别是在realistic和coherent处理中。基于这些样本,我们采用了一种进程ive和cost-effective的两阶段训练 protocole。我们的实验结果表明,我们提出的检测器在通用性和Robustness两个方面都有出色的表现,超过了当前state-of-the-art。代码将在https://github.com/multimediaFor/ProFact中公开。”

DLAS: An Exploration and Assessment of the Deep Learning Acceleration Stack

  • paper_url: http://arxiv.org/abs/2311.08909
  • repo_url: None
  • paper_authors: Perry Gibson, José Cano, Elliot J. Crowley, Amos Storkey, Michael O’Boyle
  • for: 这个论文的目的是提供一个参考框架,帮助机器学习和系统实践者在实现深度学习推进运算时,更好地考虑各个层次的依赖关系。
  • methods: 这个论文使用了机器学习和系统技术,建立了深度学习加速框架(DLAS),并对DLAS进行了逐层次的干扰研究,以探索各个层次之间的依赖关系。
  • results: 这个论文的评估结果显示,DLAS的各个层次之间存在许多依赖关系,而且这些关系可以通过干扰DLAS的各个层次来变化。此外,论文还发现了一些实际上的规律,例如压缩技术的加速效益是具体设备依赖的,以及自动调整代码生成可以对最佳化器的选择产生重大影响。
    Abstract Deep Neural Networks (DNNs) are extremely computationally demanding, which presents a large barrier to their deployment on resource-constrained devices. Since such devices are where many emerging deep learning applications lie (e.g., drones, vision-based medical technology), significant bodies of work from both the machine learning and systems communities have attempted to provide optimizations to accelerate DNNs. To help unify these two perspectives, in this paper we combine machine learning and systems techniques within the Deep Learning Acceleration Stack (DLAS), and demonstrate how these layers can be tightly dependent on each other with an across-stack perturbation study. We evaluate the impact on accuracy and inference time when varying different parameters of DLAS across two datasets, seven popular DNN architectures, four DNN compression techniques, three algorithmic primitives with sparse and dense variants, untuned and auto-scheduled code generation, and four hardware platforms. Our evaluation highlights how perturbations across DLAS parameters can cause significant variation and across-stack interactions. The highest level observation from our evaluation is that the model size, accuracy, and inference time are not guaranteed to be correlated. Overall we make 13 key observations, including that speedups provided by compression techniques are very hardware dependent, and that compiler auto-tuning can significantly alter what the best algorithm to use for a given configuration is. With DLAS, we aim to provide a reference framework to aid machine learning and systems practitioners in reasoning about the context in which their respective DNN acceleration solutions exist in. With our evaluation strongly motivating the need for co-design, we believe that DLAS can be a valuable concept for exploring the next generation of co-designed accelerated deep learning solutions.
    摘要 深度神经网络(DNNs)非常 computationally 需求强大,这使得它们在有限资源的设备上部署变得困难。由于这些设备是许多深度学习应用程序的核心(如无人机、视觉基于医疗技术),因此机器学习和系统社区中的大量工作尝试了加速DNNs。为了统一这两个视角,在这篇论文中我们将机器学习和系统技术融合在一起,并通过各层之间的扰动研究表明了它们之间的互相依赖关系。我们对两个数据集、七种流行的DNN架构、四种DNN压缩技术、三种算法基本primitives的稀热和杂散变体、未调uning和自动生成代码生成、四种硬件平台进行了评估。我们的评估表明,在不同的DAS Parameters中,可以导致显著的变化和层之间的互动。最高层的观察结论是,模型大小、准确率和执行时间之间不一定是相关的。总的来说,我们所得到的13个观察结论,其中一些包括压缩技术在不同硬件平台上提供的加速效果是非常硬件依赖的,并且编译器自动调试可以很大地改变选择最佳算法的配置是否正确。通过DAS,我们希望提供一个参考框架,帮助机器学习和系统专家更好地理解它们的DNN加速解决方案在不同上下文中的运行环境。我们的评估强烈驱动了需要的共设计,我们认为DAS可以成为下一代共设计加速深度学习解决方案的价值概念。

Robust Brain MRI Image Classification with SIBOW-SVM

  • paper_url: http://arxiv.org/abs/2311.08908
  • repo_url: None
  • paper_authors: Liyun Zeng, Hao Helen Zhang
    for:* The paper aims to develop a novel brain tumor image classification method to improve the accuracy and efficiency of detecting and diagnosing brain tumors.methods:* The proposed method, called SIBOW-SVM, integrates the Bag-of-Features (BoF) model with SIFT feature extraction and weighted Support Vector Machines (wSVMs) to capture hidden image features and differentiate various tumor types.* The method also estimates the probabilities of images belonging to each class, providing high-confidence classification decisions.results:* The SIBOW-SVM method outperforms state-of-the-art methods, including Convolutional Neural Network (CNN), on a public data set of brain tumor MRI images containing four classes: glioma, meningioma, pituitary, and normal.
    Abstract The majority of primary Central Nervous System (CNS) tumors in the brain are among the most aggressive diseases affecting humans. Early detection of brain tumor types, whether benign or malignant, glial or non-glial, is critical for cancer prevention and treatment, ultimately improving human life expectancy. Magnetic Resonance Imaging (MRI) stands as the most effective technique to detect brain tumors by generating comprehensive brain images through scans. However, human examination can be error-prone and inefficient due to the complexity, size, and location variability of brain tumors. Recently, automated classification techniques using machine learning (ML) methods, such as Convolutional Neural Network (CNN), have demonstrated significantly higher accuracy than manual screening, while maintaining low computational costs. Nonetheless, deep learning-based image classification methods, including CNN, face challenges in estimating class probabilities without proper model calibration. In this paper, we propose a novel brain tumor image classification method, called SIBOW-SVM, which integrates the Bag-of-Features (BoF) model with SIFT feature extraction and weighted Support Vector Machines (wSVMs). This new approach effectively captures hidden image features, enabling the differentiation of various tumor types and accurate label predictions. Additionally, the SIBOW-SVM is able to estimate the probabilities of images belonging to each class, thereby providing high-confidence classification decisions. We have also developed scalable and parallelable algorithms to facilitate the practical implementation of SIBOW-SVM for massive images. As a benchmark, we apply the SIBOW-SVM to a public data set of brain tumor MRI images containing four classes: glioma, meningioma, pituitary, and normal. Our results show that the new method outperforms state-of-the-art methods, including CNN.
    摘要 主要脑中央神经系统肿瘤的多数是人类最致命的疾病之一。早期检测脑肿瘤类型,无论是肿瘤或非肿瘤, glial 或非 glial,都是防范癌症和治疗的关键,最终提高人类存活时间。磁共振成像(MRI)是识别脑肿瘤的最有效的技术,通过扫描生成全面脑图像。然而,人工检查可能会出现错误和不具有效率,因为脑肿瘤的复杂性、大小和位置变化。现在,自动分类技术使用机器学习(ML)方法,如卷积神经网络(CNN),已经表明了与人工检查相比,有较高的准确率,同时保持低的计算成本。然而,深度学习基于图像分类方法,包括CNN,面临着估计类别概率无法进行正确的模型定制。在这篇论文中,我们提出了一种新的脑肿瘤图像分类方法,called SIBOW-SVM,它将袋子模型(BoF)和SIFT特征提取结合weighted Support Vector Machines(wSVMs)。这种新方法可以有效捕捉隐藏的图像特征,以便区分不同的肿瘤类型并准确地预测标签。此外,SIBOW-SVM还可以估计图像属于哪一类的概率,从而提供高确度的分类决策。我们还开发了可扩展和并行的算法,以便实现SIBOW-SVM的实用应用。作为标准,我们对一个公共的脑肿瘤MRI图像集进行了应用,该集包含四个类别: glioma、meningioma、pituitary和正常。我们的结果显示,新方法在与现状的方法相比,具有更高的准确率。

AdapterShadow: Adapting Segment Anything Model for Shadow Detection

  • paper_url: http://arxiv.org/abs/2311.08891
  • repo_url: None
  • paper_authors: Leiping Jie, Hui Zhang
  • for: 提高阴影检测的精度和效率
  • methods: 使用可调式适应器与SAM模型结合,并提出了一种新的格子采样方法来自动生成精度点提示
  • results: 在四个广泛使用的基准数据集上进行了广泛的实验,并证明了我们提出的方法的精度和效率的提高
    Abstract Segment anything model (SAM) has shown its spectacular performance in segmenting universal objects, especially when elaborate prompts are provided. However, the drawback of SAM is twofold. On the first hand, it fails to segment specific targets, e.g., shadow images or lesions in medical images. On the other hand, manually specifying prompts is extremely time-consuming. To overcome the problems, we propose AdapterShadow, which adapts SAM model for shadow detection. To adapt SAM for shadow images, trainable adapters are inserted into the frozen image encoder of SAM, since the training of the full SAM model is both time and memory consuming. Moreover, we introduce a novel grid sampling method to generate dense point prompts, which helps to automatically segment shadows without any manual interventions. Extensive experiments are conducted on four widely used benchmark datasets to demonstrate the superior performance of our proposed method. Codes will are publicly available at https://github.com/LeipingJie/AdapterShadow.
    摘要 Segment anything model (SAM) 已经显示出了吸引人的表现,尤其是当提供详细的提示时。然而,SAM的缺点是两重的。一方面,它无法 segment Specific targets, 例如阴影图像或医学图像中的病变。另一方面,手动指定提示是非常时间和内存占用的。为了解决这些问题,我们提议 AdapterShadow,它将SAM模型适应到阴影检测中。为了适应SAM模型 для阴影图像,我们在冻结的图像编码器中插入可学习的适应器。此外,我们还介绍了一种新的网格采样方法,用于生成密集的点提示,以帮助自动 segment 阴影 без任何手动干预。我们在四个常用的标准数据集上进行了广泛的实验,以证明我们提出的方法的优秀性。代码将在 GitHub 上公开。

One-Shot Federated Learning with Classifier-Guided Diffusion Models

  • paper_url: http://arxiv.org/abs/2311.08870
  • repo_url: None
  • paper_authors: Mingzhao Yang, Shangchao Su, Bin Li, Xiangyang Xue
  • for: This paper focuses on exploring the potential of diffusion models in one-shot federated learning (OSFL) to generate high-quality synthetic datasets that can be used to train aggregated models without relying on auxiliary datasets or training generators.
  • methods: The proposed method, called FedCADO, utilizes guidance from client classifiers to generate data that complies with clients’ distributions and subsequently trains the aggregated model on the server. The method involves targeted optimizations in two aspects: conditionally editing the randomly sampled initial noises and employing the BN statistics from the classifiers to provide detailed guidance during generation.
  • results: The proposed method effectively handles the heterogeneous client models and the problems of non-IID features or labels, and can generate synthetic datasets that closely resemble the distribution and quality of the original client dataset. The method also avoids privacy leakage risks by not training any generators or transferring any auxiliary information on clients. Experimental results on three large-scale multi-domain image datasets demonstrate that the synthetic datasets generated by FedCADO can assist in surpassing the knowledge limitations of the client samples, resulting in aggregation models that even outperform the performance ceiling of centralized training in some cases.
    Abstract One-shot federated learning (OSFL) has gained attention in recent years due to its low communication cost. However, most of the existing methods require auxiliary datasets or training generators, which hinders their practicality in real-world scenarios. In this paper, we explore the novel opportunities that diffusion models bring to OSFL and propose FedCADO, utilizing guidance from client classifiers to generate data that complies with clients' distributions and subsequently training the aggregated model on the server. Specifically, our method involves targeted optimizations in two aspects. On one hand, we conditionally edit the randomly sampled initial noises, embedding them with specified semantics and distributions, resulting in a significant improvement in both the quality and stability of generation. On the other hand, we employ the BN statistics from the classifiers to provide detailed guidance during generation. These tailored optimizations enable us to limitlessly generate datasets, which closely resemble the distribution and quality of the original client dataset. Our method effectively handles the heterogeneous client models and the problems of non-IID features or labels. In terms of privacy protection, our method avoids training any generator or transferring any auxiliary information on clients, eliminating any additional privacy leakage risks. Leveraging the extensive knowledge stored in the pre-trained diffusion model, the synthetic datasets can assist us in surpassing the knowledge limitations of the client samples, resulting in aggregation models that even outperform the performance ceiling of centralized training in some cases, which is convincingly demonstrated in the sufficient quantification and visualization experiments conducted on three large-scale multi-domain image datasets.
    摘要 一种新型的 federated learning 方法,即 One-shot federated learning(OSFL),在最近几年内受到了广泛关注,因为它的通信成本很低。然而,大多数现有的方法都需要附加的 auxillary 数据或训练生成器,这限制了它们在实际场景中的实用性。在这篇论文中,我们探索了 diffusion 模型带来的新机遇,并提出了 FedCADO 方法,通过客户端分类器的指导,在服务器上训练汇集模型。具体来说,我们的方法包括两个方面的优化。一方面,我们通过条件编辑 randomly 采样的初始噪声,使其嵌入特定的 semantics 和分布,从而获得较好的质量和稳定性。另一方面,我们利用客户端的 BN 统计来提供详细的指导,以便在生成过程中进行精细的调整。这些特定的优化使得我们能够无限生成数据,这些数据与客户端的原始数据分布和质量具有很高的相似度。我们的方法可以有效地处理不同客户端模型的 hetroogeneous 特性,以及非 Identical 的特征或标签问题。另外,我们的方法不需要在客户端上训练任何生成器或传输任何附加信息,因此不会增加隐私泄露的风险。通过 diffusion 模型存储的广泛知识,我们可以使用生成的 sintethic 数据进行超越客户端样本的知识限制,实现汇集模型的性能超越中央化训练的性能均衡,这些结果在三个大规模多域图像 dataset 上得到了证明。

Toulouse Hyperspectral Data Set: a benchmark data set to assess semi-supervised spectral representation learning and pixel-wise classification techniques

  • paper_url: http://arxiv.org/abs/2311.08863
  • repo_url: https://github.com/romain3ch216/tlse-experiments
  • paper_authors: Romain Thoreau, Laurent Risser, Véronique Achard, Béatrice Berthelot, Xavier Briottet
    for:The paper aims to provide a new hyperspectral data set for large-scale urban area mapping, addressing the scarcity of annotated data and the limitations of existing data sets.methods:The paper uses semi-supervised and self-supervised techniques, such as Masked Autoencoders, to train machine learning models on the new data set, and evaluates their performance on pixel-wise classification.results:The paper achieves an overall accuracy of 82% and an F1 score of 74% on pixel-wise classification, using a conventional autoencoder combined with a Random Forest classifier. The paper also releases the Toulouse Hyperspectral Data Set and the code for reproducing the experiments.Here is the Chinese translation of the three points:for:论文旨在提供大规模城市区域地图的空中彩色影像,解决现有数据集缺乏标注数据和限制。methods:论文使用 semi-supervised 和 self-supervised 技术,如Masked Autoencoders,在新数据集上训练机器学习模型,并评估其像素级分类性能。results:论文在像素级分类 task 上达到了 82% 的总准确率和 74% 的 F1 分数,使用 convential autoencoder 和 Random Forest 分类器。论文还发布了 Toulouse Hyperspectral Data Set 和 reproduce эксперименты 的代码。
    Abstract Airborne hyperspectral images can be used to map the land cover in large urban areas, thanks to their very high spatial and spectral resolutions on a wide spectral domain. While the spectral dimension of hyperspectral images is highly informative of the chemical composition of the land surface, the use of state-of-the-art machine learning algorithms to map the land cover has been dramatically limited by the availability of training data. To cope with the scarcity of annotations, semi-supervised and self-supervised techniques have lately raised a lot of interest in the community. Yet, the publicly available hyperspectral data sets commonly used to benchmark machine learning models are not totally suited to evaluate their generalization performances due to one or several of the following properties: a limited geographical coverage (which does not reflect the spectral diversity in metropolitan areas), a small number of land cover classes and a lack of appropriate standard train / test splits for semi-supervised and self-supervised learning. Therefore, we release in this paper the Toulouse Hyperspectral Data Set that stands out from other data sets in the above-mentioned respects in order to meet key issues in spectral representation learning and classification over large-scale hyperspectral images with very few labeled pixels. Besides, we discuss and experiment the self-supervised task of Masked Autoencoders and establish a baseline for pixel-wise classification based on a conventional autoencoder combined with a Random Forest classifier achieving 82% overall accuracy and 74% F1 score. The Toulouse Hyperspectral Data Set and our code are publicly available at https://www.toulouse-hyperspectral-data-set.com and https://www.github.com/Romain3Ch216/tlse-experiments, respectively.
    摘要 飞行式干扰спектраль成像可以用于大都市地区的地表覆盖图像,因为它们具有非常高的空间和спектраль分辨率,并且覆盖了广泛的 спектраль频谱。 although the spectral dimension of hyperspectral images is highly informative of the chemical composition of the land surface, the use of state-of-the-art machine learning algorithms to map the land cover has been dramatically limited by the availability of training data. To cope with the scarcity of annotations, semi-supervised and self-supervised techniques have lately raised a lot of interest in the community. However, the publicly available hyperspectral data sets commonly used to benchmark machine learning models are not totally suited to evaluate their generalization performances due to one or several of the following properties: limited geographical coverage (which does not reflect the spectral diversity in metropolitan areas), a small number of land cover classes, and a lack of appropriate standard train / test splits for semi-supervised and self-supervised learning. Therefore, we release in this paper the Toulouse Hyperspectral Data Set, which stands out from other data sets in the above-mentioned respects in order to meet key issues in spectral representation learning and classification over large-scale hyperspectral images with very few labeled pixels. Besides, we discuss and experiment the self-supervised task of Masked Autoencoders and establish a baseline for pixel-wise classification based on a conventional autoencoder combined with a Random Forest classifier, achieving 82% overall accuracy and 74% F1 score. The Toulouse Hyperspectral Data Set and our code are publicly available at https://www.toulouse-hyperspectral-data-set.com and https://www.github.com/Romain3Ch216/tlse-experiments, respectively.

Data Augmentations in Deep Weight Spaces

  • paper_url: http://arxiv.org/abs/2311.08851
  • repo_url: None
  • paper_authors: Aviv Shamsian, David W. Zhang, Aviv Navon, Yan Zhang, Miltiadis Kofinas, Idan Achituve, Riccardo Valperga, Gertjan J. Burghouts, Efstratios Gavves, Cees G. M. Snoek, Ethan Fetaya, Gal Chechik, Haggai Maron
  • for: 本研究旨在解决深度神经网络学习在权重空间中的难题,即生成大量数据来避免过拟合。
  • methods: 本文提出了一种基于混合方法的数据增强技术,以生成新的数据示例,不需要额外训练输入权重空间元素。
  • results: 对于现有的benchmark和新生成的benchmark,我们评估了不同数据增强技术的性能,并发现了一种基于混合方法的新数据增强方案可以提高学习效果。
    Abstract Learning in weight spaces, where neural networks process the weights of other deep neural networks, has emerged as a promising research direction with applications in various fields, from analyzing and editing neural fields and implicit neural representations, to network pruning and quantization. Recent works designed architectures for effective learning in that space, which takes into account its unique, permutation-equivariant, structure. Unfortunately, so far these architectures suffer from severe overfitting and were shown to benefit from large datasets. This poses a significant challenge because generating data for this learning setup is laborious and time-consuming since each data sample is a full set of network weights that has to be trained. In this paper, we address this difficulty by investigating data augmentations for weight spaces, a set of techniques that enable generating new data examples on the fly without having to train additional input weight space elements. We first review several recently proposed data augmentation schemes %that were proposed recently and divide them into categories. We then introduce a novel augmentation scheme based on the Mixup method. We evaluate the performance of these techniques on existing benchmarks as well as new benchmarks we generate, which can be valuable for future studies.
    摘要 学习Weight空间中的神经网络,其中神经网络处理另一个深度神经网络的权重,已经出现为一个有前途的研究方向,具有应用于不同领域的可能性,从分析和编辑神经场和隐藏神经表示之间的应用,到网络剪辑和量化。最近的工作设计了适用于这个空间的建筑,考虑其独特的协变结构。然而,目前这些建筑受到严重的过拟合问题困扰,需要大量的数据来适应。在这篇论文中,我们解决这个挑战,通过调查Weight空间中的数据增强技术,以生成新的数据示例,而无需额外训练输入权重空间元素。我们首先回顾最近提出的数据增强方案,并将其分为类别。然后,我们介绍了一种基于 Mixup 方法的新的增强方案。我们对这些技术的性能进行评估,并在现有的标准准则上进行评估,以及新生成的标准准则,这些标准准则可能对未来的研究有所价值。

Controlling the Output of a Generative Model by Latent Feature Vector Shifting

  • paper_url: http://arxiv.org/abs/2311.08850
  • repo_url: None
  • paper_authors: Róbert Belanec, Peter Lacko, Kristína Malinovská
  • for: 这个论文的目的是控制StyleGAN3生成器的输出图像修改。
  • methods: 我们使用了一个预训练的StyleGAN3生成器和一个ResNet34对应神经网络,将生成的图像分类为 celebA 数据集中的 Binary facial 特征。我们还使用了一个叫做 latent feature shifter 的神经网络,将 StyleGAN3 的 latent vector shift 到指定的特征方向。
  • results: 我们的 latent feature shifter 方法比基eline方法多出了更多的生成图像具有想要的特征。我们通过评估结果发现,我们的 latent feature shifter 方法成功地控制了 StyleGAN3 生成器的输出图像修改。
    Abstract State-of-the-art generative models (e.g. StyleGAN3 \cite{karras2021alias}) often generate photorealistic images based on vectors sampled from their latent space. However, the ability to control the output is limited. Here we present our novel method for latent vector shifting for controlled output image modification utilizing semantic features of the generated images. In our approach we use a pre-trained model of StyleGAN3 that generates images of realistic human faces in relatively high resolution. We complement the generative model with a convolutional neural network classifier, namely ResNet34, trained to classify the generated images with binary facial features from the CelebA dataset. Our latent feature shifter is a neural network model with a task to shift the latent vectors of a generative model into a specified feature direction. We have trained latent feature shifter for multiple facial features, and outperformed our baseline method in the number of generated images with the desired feature. To train our latent feature shifter neural network, we have designed a dataset of pairs of latent vectors with and without a certain feature. Based on the evaluation, we conclude that our latent feature shifter approach was successful in the controlled generation of the StyleGAN3 generator.
    摘要 现代生成模型(例如StyleGAN3 \cite{karras2021alias)frequently生成高分辨率的图像,基于生成器的幂space中的向量采样。然而,控制输出的能力受限。在这里,我们介绍了我们的新方法,利用生成器图像的Semantic特征来实现控制输出图像修改。我们使用已经训练过的StyleGAN3生成器,可以生成高分辨率的真实人脸图像。我们补充了生成器的核心网络,使其能够通过CelebA数据集中的二分类网络(ResNet34)来分类生成的图像。我们的幂向量推移器是一个具有将幂向量推移到指定特征方向的任务的神经网络模型。我们在多个面部特征上训练了幂向量推移器,并超过了我们的基eline方法的数量。为了训练我们的幂向量推移器神经网络,我们设计了一个包含具有和无某些特征的latent向量对的数据集。根据评估结果,我们认为我们的幂向量推移器方法成功地控制了StyleGAN3生成器。

Personalized Video Relighting Using Casual Light Stage

  • paper_url: http://arxiv.org/abs/2311.08843
  • repo_url: None
  • paper_authors: Jun Myeong Choi, Max Christman, Roni Sengupta
  • for: 这个论文的目的是提出一种个性化视频重光算法,以实现在任何姿势、表情和照明条件下,在实时下生成高质量的重光视频。
  • methods: 该算法使用了一种新的神经网络重光架构,可以有效地分离出照明源的光照特征、物体的 geometry 和反射特征,然后将其与目标照明相加以生成重光图像。
  • results: 根据对 Light Stage at Your Desk (LSYD) 数据和 Light Stage captured One Light At a Time (OLAT) 数据的质量评估,这种重光算法能够提高肖像图像重光质量和时间稳定性,比之前的方法更高效。
    Abstract In this paper, we develop a personalized video relighting algorithm that produces high-quality and temporally consistent relit video under any pose, expression, and lighting conditions in real-time. Existing relighting algorithms typically rely either on publicly available synthetic data, which yields poor relighting results, or instead on Light Stage data which is inaccessible and is not publicly available. We show that by casually capturing video of a user watching YouTube videos on a monitor we can train a personalized algorithm capable of producing high-quality relighting under any condition. Our key contribution is a novel neural relighting architecture that effectively separates the intrinsic appearance features, geometry and reflectance, from the source lighting and then combines it with the target lighting to generate a relit image. This neural architecture enables smoothing of intrinsic appearance features leading to temporally stable video relighting. Both qualitative and quantitative evaluations show that our relighting architecture improves portrait image relighting quality and temporal consistency over state-of-the-art approaches on both casually captured Light Stage at Your Desk (LSYD) data and Light Stage captured One Light At a Time (OLAT) datasets.
    摘要 在这篇论文中,我们开发了一种个性化视频重新照明算法,该算法在实时下生成高质量、时间上一致的重新照明视频,无论用户的姿势、表情或照明条件如何。现有的重新照明算法通常依赖于公共可用的生成器数据,这些数据的重新照明结果很差,或者使用Light Stage数据,但这些数据不公开可用。我们显示,通过通过捕捉用户在 monitor 上观看 YouTube 视频来训练个性化算法,我们可以生成高质量的重新照明视频。我们的关键贡献是一种新的神经网络重新照明架构,该架构能够有效地分离出照明源的自然特征、几何和反射特征,然后与目标照明相结合,生成一个重新照明的图像。这种神经网络架构使得图像的内在特征平滑,从而实现了时间上一致的视频重新照明。我们的重新照明架构在使用LSYD 和 OLAT 数据集上的质量和时间一致性方面与当前的方法进行比较,并取得了显著的改善。

Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding

  • paper_url: http://arxiv.org/abs/2311.08835
  • repo_url: https://github.com/wjun0830/cgdetr
  • paper_authors: WonJun Moon, Sangeek Hyun, SuBeen Lee, Jae-Pil Heo
  • for: 这 paper 的目的是提供一种基于注意力机制的视频时间固定方法,以便在视频和文本查询之间强化交互,并且能够根据文本查询提取相关的视频clip。
  • methods: 这 paper 使用了一种名为 Correlation-Guided Detection Transformer~(CG-DETR) 的方法,它包括一个适应式跨模态注意力层、一个 dummy tokens 的使用、以及一个高级概念共同embedding空间学习。
  • results: 这 paper 的实验结果表明,CG-DETR 可以在多个benchmark上达到州OF-the-art的Result,包括时刻检索和突出部分检测。 codes 可以在 https://github.com/wjun0830/CGDETR 中找到。
    Abstract Recent endeavors in video temporal grounding enforce strong cross-modal interactions through attention mechanisms to overcome the modality gap between video and text query. However, previous works treat all video clips equally regardless of their semantic relevance with the text query in attention modules. In this paper, our goal is to provide clues for query-associated video clips within the crossmodal encoding process. With our Correlation-Guided Detection Transformer~(CG-DETR), we explore the appropriate clip-wise degree of cross-modal interactions and how to exploit such degrees for prediction. First, we design an adaptive cross-attention layer with dummy tokens. Dummy tokens conditioned by text query take a portion of the attention weights, preventing irrelevant video clips from being represented by the text query. Yet, not all word tokens equally inherit the text query's correlation to video clips. Thus, we further guide the cross-attention map by inferring the fine-grained correlation between video clips and words. We enable this by learning a joint embedding space for high-level concepts, i.e., moment and sentence level, and inferring the clip-word correlation. Lastly, we use a moment-adaptive saliency detector to exploit each video clip's degrees of text engagement. We validate the superiority of CG-DETR with the state-of-the-art results on various benchmarks for both moment retrieval and highlight detection. Codes are available at https://github.com/wjun0830/CGDETR.
    摘要 近期的视频时间挂钩工作强制实施了跨Modal的交互,通过注意机制来超越视频和文本查询之间的Modal gap。然而,前一些工作都是在注意模块中对所有视频clip进行等效的处理,不考虑视频clip与文本查询的Semantic relevance。在这篇论文中,我们的目标是提供与文本查询相关的视频clip在跨Modal编码过程中的线索。我们使用Correlation-Guided Detection Transformer~(CG-DETR)来探索适当的clipwise度跨Modal交互,以及如何利用这些度量进行预测。首先,我们设计了适应式交叉注意层,其中文本查询条件下的干扰符token会占据一部分注意量,以避免不相关的视频clip被文本查询所代表。然而,不是所有的单词token都会相同地继承文本查询的视频clip相关性。因此,我们进一步指导交叉注意地图,通过学习高级概念的共同embedding空间,以及clip-word关系的推理。最后,我们使用时刻适应性的焦点检测器来利用每个视频clip的文本参与度。我们 validate CG-DETR的优越性通过在多种benchmark上实现state-of-the-art的结果,包括时刻检索和突出部分检测。代码可以在https://github.com/wjun0830/CGDETR中获取。

Target-oriented Domain Adaptation for Infrared Image Super-Resolution

  • paper_url: http://arxiv.org/abs/2311.08816
  • repo_url: https://github.com/yongsongh/dasrgan
  • paper_authors: Yongsong Huang, Tomo Miyazaki, Xiaofeng Liu, Yafei Dong, Shinichiro Omachi
  • for: 提高红外超分辨率图像质量
  • methods: 使用目标域适应SRGAN(DASRGAN),包括Texture-Oriented Adaptation(TOA)和Noise-Oriented Adaptation(NOA)两部分
  • results: 实验表明,DASRGAN在多个标准测试 benchmark 和不同的� upsampling 因子下表现出优于其他方法,并设置了新的州际表现标准。
    Abstract Recent efforts have explored leveraging visible light images to enrich texture details in infrared (IR) super-resolution. However, this direct adaptation approach often becomes a double-edged sword, as it improves texture at the cost of introducing noise and blurring artifacts. To address these challenges, we propose the Target-oriented Domain Adaptation SRGAN (DASRGAN), an innovative framework specifically engineered for robust IR super-resolution model adaptation. DASRGAN operates on the synergy of two key components: 1) Texture-Oriented Adaptation (TOA) to refine texture details meticulously, and 2) Noise-Oriented Adaptation (NOA), dedicated to minimizing noise transfer. Specifically, TOA uniquely integrates a specialized discriminator, incorporating a prior extraction branch, and employs a Sobel-guided adversarial loss to align texture distributions effectively. Concurrently, NOA utilizes a noise adversarial loss to distinctly separate the generative and Gaussian noise pattern distributions during adversarial training. Our extensive experiments confirm DASRGAN's superiority. Comparative analyses against leading methods across multiple benchmarks and upsampling factors reveal that DASRGAN sets new state-of-the-art performance standards. Code are available at \url{https://github.com/yongsongH/DASRGAN}.
    摘要 近期研究探索了使用可见光图像来增强护理图像的细节,但这直接适应方法经常变成一件双刃剑,因为它会提高细节的同时也会引入噪声和模糊 artefacts。为了解决这些挑战,我们提议了Target-oriented Domain Adaptation SRGAN(DASRGAN),一种创新的护理图像超分解模型适应框架。DASRGAN在两个关键组件之间运行:1)Texture-Oriented Adaptation(TOA),用于精细调整细节,和2)Noise-Oriented Adaptation(NOA),专门降低噪声传输。具体来说,TOA包括一个特殊的检测器,并使用 Sobel 引导的对抗损失来有效地对细节分布进行对齐。同时,NOA使用噪声对抗损失来在对抗训练中明确地分离生成的pattern和 Gaussian 噪声的分布。我们的广泛的实验证明了 DASRGAN 的优越性。与其他领先方法进行多个标准尺度和增强因子的比较分析表明,DASRGAN 创造了新的状态标准表现。代码可以在 中下载。

Correlation-aware active learning for surgery video segmentation

  • paper_url: http://arxiv.org/abs/2311.08811
  • repo_url: None
  • paper_authors: Fei Wu, Pablo Marquez-Neila, Mingyi Zheng, Hedyeh Rafii-Tari, Raphael Sznitman
  • for: 这个研究的目的是提出一个新的活动学习策略(COALSamp),用于降低医疗影像资料的标注成本。
  • methods: 方法包括将影像视网膜下降到一个特别设计的对应学习 latent space,然后从本地块群中选择一定数量的表征性影像。
  • results: 这个方法在两个手术影像资料集上进行了评估,结果显示COALSamp 可以对医疗影像资料进行有效的分类。此外,这个方法还在三个真实世界的影像资料集上进行了评估,结果也很显著。
    Abstract Semantic segmentation is a complex task that relies heavily on large amounts of annotated image data. However, annotating such data can be time-consuming and resource-intensive, especially in the medical domain. Active Learning (AL) is a popular approach that can help to reduce this burden by iteratively selecting images for annotation to improve the model performance. In the case of video data, it is important to consider the model uncertainty and the temporal nature of the sequences when selecting images for annotation. This work proposes a novel AL strategy for surgery video segmentation, \COALSamp{}, COrrelation-aWare Active Learning. Our approach involves projecting images into a latent space that has been fine-tuned using contrastive learning and then selecting a fixed number of representative images from local clusters of video frames. We demonstrate the effectiveness of this approach on two video datasets of surgical instruments and three real-world video datasets. The datasets and code will be made publicly available upon receiving necessary approvals.
    摘要 Semantic segmentation是一项复杂的任务,它依赖于大量已经标注的图像数据。然而,对于医疗领域来说,标注这些数据可以是时间consuming和资源占用的。活动学习(AL)是一种受欢迎的方法,它可以逐步选择图像进行标注,以提高模型性能。在视频数据中,需要考虑模型的uncertainty和时间序列的特点,以便更好地选择需要标注的图像。这项工作提出了一种新的AL策略,称为\COALSamp{}, COrrelation-aWare Active Learning。我们的方法是将图像 проек到一个已经精心调整的latent空间中,然后选择视频帧的本地集群中固定数量的表示图像。我们在两个手术工具视频数据集和三个实际世界视频数据集上证明了这种方法的有效性。数据和代码将在接收所需的批准后公开发布。

EyeLS: Shadow-Guided Instrument Landing System for Intraocular Target Approaching in Robotic Eye Surgery

  • paper_url: http://arxiv.org/abs/2311.08799
  • repo_url: None
  • paper_authors: Junjie Yang, Zhihao Zhao, Siyuan Shen, Daniel Zapp, Mathias Maier, Kai Huang, Nassir Navab, M. Ali Nasseri
  • for: This paper aims to improve the accuracy of robotic ophthalmic surgery by using shadow positions to estimate the depth position of the instrument tip and optimize its insertion trajectory.
  • methods: The proposed method uses shadows to estimate the relative depth position of the instrument tip and the target, and then optimizes the insertion trajectory to approach the target within the scanning area of the iOCT.
  • results: The method was tested on a retina model and achieved an average depth error of 0.0127 mm for floating targets and 0.3473 mm for retinal targets in the surgical simulator, without damaging the retina.
    Abstract Robotic ophthalmic surgery is an emerging technology to facilitate high-precision interventions such as retina penetration in subretinal injection and removal of floating tissues in retinal detachment depending on the input imaging modalities such as microscopy and intraoperative OCT (iOCT). Although iOCT is explored to locate the needle tip within its range-limited ROI, it is still difficult to coordinate iOCT's motion with the needle, especially at the initial target-approaching stage. Meanwhile, due to 2D perspective projection and thus the loss of depth information, current image-based methods cannot effectively estimate the needle tip's trajectory towards both retinal and floating targets. To address this limitation, we propose to use the shadow positions of the target and the instrument tip to estimate their relative depth position and accordingly optimize the instrument tip's insertion trajectory until the tip approaches targets within iOCT's scanning area. Our method succeeds target approaching on a retina model, and achieves an average depth error of 0.0127 mm and 0.3473 mm for floating and retinal targets respectively in the surgical simulator without damaging the retina.
    摘要 关于:机器人眼科手术技术的发展机器人眼科手术是一种emerging technology,用于实现高精度干预,如retina penetration和floatings tissues removing,这些干预都取决于输入的干预modalities,如微scopy和intraoperative OCT(iOCT)。虽然iOCT被探索以定位针的位置,但是尚未能够协调针与iOCT的运动,特别是在目标方向的初始阶段。此外,由于2D的 perspective projection,当前的图像基本方法无法有效地估算针的轨迹,特别是在向retinal和浮动目标的方向上。为了解决这个限制,我们提议使用target和 instrumente tip的阴影位置来估算它们的相对深度位置,并根据此来优化针的插入轨迹,直到针接近target在iOCT的扫描范围内。我们的方法在retina模型上成功地进行目标接近,并在手术模拟器中达到了0.0127mm和0.3473mm的平均深度误差,对于浮动和retinal目标。

HFORD: High-Fidelity and Occlusion-Robust De-identification for Face Privacy Protection

  • paper_url: http://arxiv.org/abs/2311.08786
  • repo_url: None
  • paper_authors: Dongxin Chen, Mingrui Zhu, Nannan Wang, Xinbo Gao
  • for: 面部隐私保护issue受到了智能设备的普及和计算机视觉技术的发展的关注。本文提出了一种高效和防护 occlusion 的面部隐私化方法(HFORD),以解决这些问题。
  • methods: 本方法使用了一种叫做 Identity Disentanglement Module(IDM)的模块,用于分离 latent codes 中的个体特征和特征特征。此外,还提出了一种叫做 Attribute Retention Module(ARM)的模块,用于保留不相关的特征和面部遮挡。
  • results: 对比其他面部隐私化方法,本方法的结果质量更高,细节更加详细,并且更强地适应 occlusion 情况。
    Abstract With the popularity of smart devices and the development of computer vision technology, concerns about face privacy protection are growing. The face de-identification technique is a practical way to solve the identity protection problem. The existing facial de-identification methods have revealed several problems, including the impact on the realism of anonymized results when faced with occlusions and the inability to maintain identity-irrelevant details in anonymized results. We present a High-Fidelity and Occlusion-Robust De-identification (HFORD) method to deal with these issues. This approach can disentangle identities and attributes while preserving image-specific details such as background, facial features (e.g., wrinkles), and lighting, even in occluded scenes. To disentangle the latent codes in the GAN inversion space, we introduce an Identity Disentanglement Module (IDM). This module selects the latent codes that are closely related to the identity. It further separates the latent codes into identity-related codes and attribute-related codes, enabling the network to preserve attributes while only modifying the identity. To ensure the preservation of image details and enhance the network's robustness to occlusions, we propose an Attribute Retention Module (ARM). This module adaptively preserves identity-irrelevant details and facial occlusions and blends them into the generated results in a modulated manner. Extensive experiments show that our method has higher quality, better detail fidelity, and stronger occlusion robustness than other face de-identification methods.
    摘要 随着智能设备的普及和计算机视觉技术的发展,人脸隐私保护的问题日益突出。面部隐私化技术是一种实际的解决方案。现有的面部隐私化方法存在一些问题,如受到遮挡物的影响下的匿名结果的真实性下降,以及维护不同于人脸特征的匿名结果。我们提出了一种高度准确和遮挡物鲁棒的匿名化方法(HFORD),以解决这些问题。这种方法可以分离人脸特征和属性,并保留图像特有的背景、表情特征(如皱纹)和照明等信息,即使在遮挡场景下也能够保持高质量。为了分离GAN拟合空间中的秘密码,我们引入了一种身份分解模块(IDM)。这个模块选择与身份有关的秘密码,并将其分解成身份相关的秘密码和属性相关的秘密码,使网络能够保留属性,只对人脸进行修改。为确保图像细节的保留和网络的遮挡物鲁棒性,我们提议一种Attribute Retention Module(ARM)。这个模块可以动态保留不相关于身份的细节和脸部遮挡物,并将其混合到生成结果中,以实现更高质量和更强的鲁棒性。经过广泛的实验,我们发现我们的方法在质量、细节准确性和遮挡物鲁棒性等方面都有更高的表现。

Language Semantic Graph Guided Data-Efficient Learning

  • paper_url: http://arxiv.org/abs/2311.08782
  • repo_url: None
  • paper_authors: Wenxuan Ma, Shuang Li, Lincan Cai, Jingxuan Kang
    for: 这个研究的目的是提高机器学习模型对有限数据的学习效能,并在无需人工标注的情况下实现更好的模型表现。methods: 这个研究使用了 Semi-Supervised Learning (SSL)、Transfer Learning (TL) 和 Data Augmentation (DA) 等方法来实现数据优化。另外,这个研究还使用了一个名为 Language Semantic Graph (LSG) 的新方法,它是根据标签中的自然语言描述建立的一个图形。results: 这个研究在图像、影片和音频等模式下运用 LSG 方法,在 SSL 和 TL 情况下获得了显著改善的表现,并且比其他数据优化方法更快速。
    Abstract Developing generalizable models that can effectively learn from limited data and with minimal reliance on human supervision is a significant objective within the machine learning community, particularly in the era of deep neural networks. Therefore, to achieve data-efficient learning, researchers typically explore approaches that can leverage more related or unlabeled data without necessitating additional manual labeling efforts, such as Semi-Supervised Learning (SSL), Transfer Learning (TL), and Data Augmentation (DA). SSL leverages unlabeled data in the training process, while TL enables the transfer of expertise from related data distributions. DA broadens the dataset by synthesizing new data from existing examples. However, the significance of additional knowledge contained within labels has been largely overlooked in research. In this paper, we propose a novel perspective on data efficiency that involves exploiting the semantic information contained in the labels of the available data. Specifically, we introduce a Language Semantic Graph (LSG) which is constructed from labels manifest as natural language descriptions. Upon this graph, an auxiliary graph neural network is trained to extract high-level semantic relations and then used to guide the training of the primary model, enabling more adequate utilization of label knowledge. Across image, video, and audio modalities, we utilize the LSG method in both TL and SSL scenarios and illustrate its versatility in significantly enhancing performance compared to other data-efficient learning approaches. Additionally, our in-depth analysis shows that the LSG method also expedites the training process.
    摘要 Developing generalizable models that can effectively learn from limited data and with minimal reliance on human supervision is a significant objective within the machine learning community, particularly in the era of deep neural networks. Therefore, to achieve data-efficient learning, researchers typically explore approaches that can leverage more related or unlabeled data without necessitating additional manual labeling efforts, such as Semi-Supervised Learning (SSL), Transfer Learning (TL), and Data Augmentation (DA). SSL leverages unlabeled data in the training process, while TL enables the transfer of expertise from related data distributions. DA broadens the dataset by synthesizing new data from existing examples. However, the significance of additional knowledge contained within labels has been largely overlooked in research. In this paper, we propose a novel perspective on data efficiency that involves exploiting the semantic information contained in the labels of the available data. Specifically, we introduce a Language Semantic Graph (LSG) which is constructed from labels manifest as natural language descriptions. Upon this graph, an auxiliary graph neural network is trained to extract high-level semantic relations and then used to guide the training of the primary model, enabling more adequate utilization of label knowledge. Across image, video, and audio modalities, we utilize the LSG method in both TL and SSL scenarios and illustrate its versatility in significantly enhancing performance compared to other data-efficient learning approaches. Additionally, our in-depth analysis shows that the LSG method also expedites the training process.Here's the translation in Traditional Chinese:开发能够从有限数据中学习并且仅对人工标注有最少依赖的机器学习模型是机器学习社区中的一个重要目标,特别是在深度神经网络时代。因此,实现数据效率的研究通常会探索可以将更多相关的或未标注的数据 leveraged 进行训练,例如半监督学习 (SSL)、传播学习 (TL) 和数据扩展 (DA)。SSL 利用训练过程中的无标注数据,而 TL 允许将相关数据分布中的专长转移到新的数据上。DA 则是将现有数据中的新数据生成新的数据,以增加数据集的大小。但是,实际上 Label 中含的额外知识几乎没有被研究。在这篇论文中,我们提出了一个新的数据效率的思路,即利用可用数据中的标签上的 semantic information。 Specifically, we introduce a Language Semantic Graph (LSG) which is constructed from labels manifest as natural language descriptions. Upon this graph, an auxiliary graph neural network is trained to extract high-level semantic relations and then used to guide the training of the primary model, enabling more adequate utilization of label knowledge. Across image, video, and audio modalities, we utilize the LSG method in both TL and SSL scenarios and illustrate its versatility in significantly enhancing performance compared to other data-efficient learning approaches. Additionally, our in-depth analysis shows that the LSG method also expedites the training process.

Two-stage Joint Transductive and Inductive learning for Nuclei Segmentation

  • paper_url: http://arxiv.org/abs/2311.08774
  • repo_url: None
  • paper_authors: Hesham Ali, Idriss Tondji, Mennatullah Siam
  • for: 针对医疗影像中蛋白质分割任务进行研究,以提高肿瘤诊断和治疗的效率和准确性。
  • methods: 提出了一种新的混合学习方法,结合了泛化学习和推导学习的优点,以便更好地利用可用的标注和未标注数据。
  • results: 在MoNuSeg测试集上证明了该方法的效果和潜在应用前景,并提出了一种新的两stage混合推理方案。
    Abstract AI-assisted nuclei segmentation in histopathological images is a crucial task in the diagnosis and treatment of cancer diseases. It decreases the time required to manually screen microscopic tissue images and can resolve the conflict between pathologists during diagnosis. Deep Learning has proven useful in such a task. However, lack of labeled data is a significant barrier for deep learning-based approaches. In this study, we propose a novel approach to nuclei segmentation that leverages the available labelled and unlabelled data. The proposed method combines the strengths of both transductive and inductive learning, which have been previously attempted separately, into a single framework. Inductive learning aims at approximating the general function and generalizing to unseen test data, while transductive learning has the potential of leveraging the unlabelled test data to improve the classification. To the best of our knowledge, this is the first study to propose such a hybrid approach for medical image segmentation. Moreover, we propose a novel two-stage transductive inference scheme. We evaluate our approach on MoNuSeg benchmark to demonstrate the efficacy and potential of our method.
    摘要 AI助成 Histopathological 图像中的核体分割是诊断和治疗癌症疾病的关键任务。它可以减少手动检查微scopic 组织图像所需的时间,并能够解决 pathologists 在诊断中存在的冲突。深度学习 已经在这种任务中证明了其有用性。然而,缺乏标注数据是深度学习基于方法的主要障碍。在这项研究中,我们提出了一种新的核体分割方法,利用可用的标注和无标注数据。我们的方法结合了泛化学习和抽象学习的优点,这两种学习方法在过去已经分别被应用。泛化学习目标是将通用函数approximated,并在未看到的测试数据上generalize;而抽象学习具有利用无标注测试数据来改进分类的潜在优势。根据我们所知,这是第一项提出了这种混合方法的医学图像分割研究。此外,我们还提出了一种新的两stage 混合推理方案。我们在 MoNuSeg benchmark 上评估了我们的方法,以demonstrate 我们的方法的效果和潜在。

FastBlend: a Powerful Model-Free Toolkit Making Video Stylization Easier

  • paper_url: http://arxiv.org/abs/2311.09265
  • repo_url: https://github.com/artiprocher/sd-webui-fastblend
  • paper_authors: Zhongjie Duan, Chengyu Wang, Cen Chen, Weining Qian, Jun Huang, Mingyi Jin
  • for: Addresses the consistency problem in video processing for tasks such as style transfer and image editing.
  • methods: Uses a patch matching algorithm with two inference modes: blending and interpolation.
  • results: Outperforms existing methods for video deflickering and video synthesis in the blending mode, and surpasses video interpolation and model-based video processing approaches in the interpolation mode.Here’s the summary in Traditional Chinese:
  • for: 解决影像处理中的一致问题,如style transfer和图像修复。
  • methods: 使用一个patch matching算法,分别有汇入和 interpolate两种推论方式。
  • results: 与现有方法相比,在汇入模式下表现更好,并在 interpolate 模式下超过了影像 interpolate 和基于模型的影像处理方法。I hope that helps!
    Abstract With the emergence of diffusion models and rapid development in image processing, it has become effortless to generate fancy images in tasks such as style transfer and image editing. However, these impressive image processing approaches face consistency issues in video processing. In this paper, we propose a powerful model-free toolkit called FastBlend to address the consistency problem for video processing. Based on a patch matching algorithm, we design two inference modes, including blending and interpolation. In the blending mode, FastBlend eliminates video flicker by blending the frames within a sliding window. Moreover, we optimize both computational efficiency and video quality according to different application scenarios. In the interpolation mode, given one or more keyframes rendered by diffusion models, FastBlend can render the whole video. Since FastBlend does not modify the generation process of diffusion models, it exhibits excellent compatibility. Extensive experiments have demonstrated the effectiveness of FastBlend. In the blending mode, FastBlend outperforms existing methods for video deflickering and video synthesis. In the interpolation mode, FastBlend surpasses video interpolation and model-based video processing approaches. The source codes have been released on GitHub.
    摘要 “对于快速发展的图像处理技术和扩散模型,创造出了丰富的图像效果,例如风格转移和图像修剪。但这些印象精采的图像处理方法在视频处理中还存在一定的一致性问题。在本文中,我们提出了一个强大且无模型的工具套件called FastBlend,以解决视频处理中的一致性问题。基于图像匹配算法,我们设计了两种推理模式,包括融合和插值。在融合模式下,FastBlend可以消除视频震荡,通过视频内排埋窗口中的匹配。此外,我们还优化了不同应用场景中的计算效率和视频质量。在插值模式下,给定一幅或多幅透过扩散模型生成的关键帧,FastBlend可以生成整个视频。由于FastBlend不会改变扩散模型的生成过程,因此它具有很好的相容性。实验结果显示FastBlend的效果极佳,在融合模式下比现有的视频显示和视频生成方法更好,在插值模式下比视频插值和基于模型的视频处理方法更好。源代码已经在GitHub上发布。”

4K-Resolution Photo Exposure Correction at 125 FPS with ~8K Parameters

  • paper_url: http://arxiv.org/abs/2311.08759
  • repo_url: https://github.com/zhou-yijie/msltnet
  • paper_authors: Yijie Zhou, Chao Li, Jin Liang, Tianyi Xu, Xin Liu, Jun Xu
  • for: 本研究旨在提出一种高效且轻量级的多层感知架构,用于高分辨率图像曝光 corrections。
  • methods: 提议使用多槽线性变换网络(MSLT),通过拉普拉敏度 pyramid 技术进行高频和低频层分解,然后采用像素适应线性变换进行层次修复。
  • results: 实验结果表明,提议的 MSLT 网络在两个标准数据集上对 фото曝光 corrections 表现更高效,并且对比于现有的状态 искусственный neural network 有更好的性能。
    Abstract The illumination of improperly exposed photographs has been widely corrected using deep convolutional neural networks or Transformers. Despite with promising performance, these methods usually suffer from large parameter amounts and heavy computational FLOPs on high-resolution photographs. In this paper, we propose extremely light-weight (with only ~8K parameters) Multi-Scale Linear Transformation (MSLT) networks under the multi-layer perception architecture, which can process 4K-resolution sRGB images at 125 Frame-Per-Second (FPS) by a Titan RTX GPU. Specifically, the proposed MSLT networks first decompose an input image into high and low frequency layers by Laplacian pyramid techniques, and then sequentially correct different layers by pixel-adaptive linear transformation, which is implemented by efficient bilateral grid learning or 1x1 convolutions. Experiments on two benchmark datasets demonstrate the efficiency of our MSLTs against the state-of-the-arts on photo exposure correction. Extensive ablation studies validate the effectiveness of our contributions. The code is available at https://github.com/Zhou-Yijie/MSLTNet.
    摘要 “对于不当露光的照片,深度卷积神经网络或Transformers已经广泛地解决问题。然而,这些方法通常具有较大的参数数量和高 Computational FLOPs 的高resolution照片。在这篇文章中,我们提出了 extremely light-weight(仅有 ~8K 参数)的多尺度线性转换(MSLT)网络,可以在 Titan RTX GPU 上处理 4K 分辨率 sRGB 影像,并且可以在 125 帧每秒(FPS)的速度下进行处理。具体来说,我们的 MSLT 网络首先将输入影像分解成高频和低频层,使用 Laplacian pyramid 技术,然后逐层 corrections 不同层的像素,使用高效的二元方格学习或 1x1 卷积。实验结果显示,我们的 MSLT 网络在照片曝光修正方面与现有的方法相比,有着更高的效率。广泛的测试 validate 了我们的贡献。代码可以在 https://github.com/Zhou-Yijie/MSLTNet 上取得。”

Improved Dense Nested Attention Network Based on Transformer for Infrared Small Target Detection

  • paper_url: http://arxiv.org/abs/2311.08747
  • repo_url: None
  • paper_authors: Chun Bao, Jie Cao, Yaqian Ning, Tianhua Zhao, Zhijun Li, Zechen Wang, Li Zhang, Qun Hao
    for:The paper is written for detecting infrared small targets in complex and dynamic backgrounds using deep learning.methods:The proposed method, called improved dense nested attention network (IDNANet), is based on the transformer architecture and incorporates several novel features, including the Swin-transformer and ACmix attention structure, to enhance the continuity and features of the target.results:The proposed method outperforms other state-of-the-art methods in terms of probability of detection (P_d), false-alarm rate (F_a), and mean intersection of union ($mIoU$). Specifically, the $mIoU$ reaches 90.89 on the NUDT-SIRST dataset and 79.72 on the NUAA-SIRST dataset.
    Abstract Infrared small target detection based on deep learning offers unique advantages in separating small targets from complex and dynamic backgrounds. However, the features of infrared small targets gradually weaken as the depth of convolutional neural network (CNN) increases. To address this issue, we propose a novel method for detecting infrared small targets called improved dense nested attention network (IDNANet), which is based on the transformer architecture. We preserve the dense nested structure of dense nested attention network (DNANet) and introduce the Swin-transformer during feature extraction stage to enhance the continuity of features. Furthermore, we integrate the ACmix attention structure into the dense nested structure to enhance the features of intermediate layers. Additionally, we design a weighted dice binary cross-entropy (WD-BCE) loss function to mitigate the negative impact of foreground-background imbalance in the samples. Moreover, we develop a dataset specifically for infrared small targets, called BIT-SIRST. The dataset comprises a significant amount of real-world targets and manually annotated labels, as well as synthetic data and corresponding labels. We have evaluated the effectiveness of our method through experiments conducted on public datasets. In comparison to other state-of-the-art methods, our approach outperforms in terms of probability of detection (P_d), false-alarm rate (F_a), and mean intersection of union ($mIoU$). The $mIoU$ reaches 90.89 on the NUDT-SIRST dataset and 79.72 on the NUAA-SIRST dataset.
    摘要 infrared小target检测基于深度学习具有独特的优势,可以准确分割小target从复杂和动态背景中。然而,infrared小target特征逐渐弱化为深度卷积神经网络(CNN)的深度增加。为解决这个问题,我们提出了一种改进的infrared小target检测方法,称为改进的密集嵌入注意网络(IDNANet),基于transformer架构。我们保持密集嵌入结构的密集嵌入注意网络(DNANet)结构,并在特征提取阶段引入Swin-transformer以增强特征连续性。此外,我们将ACmix注意结构integrated到密集结构中,以提高中间层特征的表现。此外,我们还定义了weighted dice二分类优化函数(WD-BCE),以抑制样本中背景干扰的负面影响。此外,我们还开发了专门为infrared小target而设计的BIT-SIRST数据集。该数据集包括大量真实世界目标和手动标注 Label,以及 sintetic数据和相应的标注。我们通过对公共数据集进行实验,证明了我们的方法的效果。与其他当前状态的方法相比,我们的方法在检测概率(P_d)、假阳性率(F_a)和mean intersection of union($mIoU)方面具有优势。$mIoU$ 在NUDT-SIRST数据集上达到了90.89,在NUAA-SIRST数据集上达到了79.72。

A Diffusion Model Based Quality Enhancement Method for HEVC Compressed Video

  • paper_url: http://arxiv.org/abs/2311.08746
  • repo_url: None
  • paper_authors: Zheng Liu, Honggang Qi
  • for: 提高压缩视频质量
  • methods: 使用扩散模型进行后处理
  • results: 在混合数据集上实现更高的质量改进 compared to existing methods
    Abstract Video post-processing methods can improve the quality of compressed videos at the decoder side. Most of the existing methods need to train corresponding models for compressed videos with different quantization parameters to improve the quality of compressed videos. However, in most cases, the quantization parameters of the decoded video are unknown. This makes existing methods have their limitations in improving video quality. To tackle this problem, this work proposes a diffusion model based post-processing method for compressed videos. The proposed method first estimates the feature vectors of the compressed video and then uses the estimated feature vectors as the prior information for the quality enhancement model to adaptively enhance the quality of compressed video with different quantization parameters. Experimental results show that the quality enhancement results of our proposed method on mixed datasets are superior to existing methods.
    摘要 <>视频后处理技术可以提高压缩视频在解码器端的质量。现有的大多数方法需要为不同的压缩参数训练对应的模型,以提高压缩视频的质量。然而,在大多数情况下,解码后的视频的压缩参数都是未知的。这限制了现有方法的改进视频质量的能力。为解决这个问题,本研究提出了基于扩散模型的后处理方法。该方法首先估计压缩视频的特征向量,然后使用估计的特征向量作为质量提升模型的先知信息,以适应不同的压缩参数进行自适应质量提升。实验结果表明,我们提出的方法在混合数据集上的质量提升结果较 existing方法优。Note: Simplified Chinese is also known as "Mandarin" or "Standard Chinese".

Scalable Federated Learning for Clients with Different Input Image Sizes and Numbers of Output Categories

  • paper_url: http://arxiv.org/abs/2311.08716
  • repo_url: None
  • paper_authors: Shuhei Nitta, Taiji Suzuki, Albert Rodríguez Mulet, Atsushi Yaguchi, Ryusuke Hirai
  • for: 采用 federated learning 方法进行隐私保护的训练,但不是所有客户端的数据都可以共享。
  • methods: 根据客户端的输入图像大小和输出类别数量调整本地模型的深度和宽度,以及提供一个新的普适性隔距来描述联邦学习的泛化差。
  • results: 在多个不同客户端设置下,对图像分类和物体检测任务进行了证明效果,并且提供了一个可靠的 bound 来描述联邦学习的泛化差。
    Abstract Federated learning is a privacy-preserving training method which consists of training from a plurality of clients but without sharing their confidential data. However, previous work on federated learning do not explore suitable neural network architectures for clients with different input images sizes and different numbers of output categories. In this paper, we propose an effective federated learning method named ScalableFL, where the depths and widths of the local models for each client are adjusted according to the clients' input image size and the numbers of output categories. In addition, we provide a new bound for the generalization gap of federated learning. In particular, this bound helps to explain the effectiveness of our scalable neural network approach. We demonstrate the effectiveness of ScalableFL in several heterogeneous client settings for both image classification and object detection tasks.
    摘要 federated learning 是一种隐私保护的训练方法,它通过多个客户端进行训练,但不是共享客户端的敏感数据。然而,过去的联邦学习工作未经检查适合客户端的不同输入图像大小和输出类别数量的适应性的神经网络架构。在这篇论文中,我们提出了一种有效的联邦学习方法,名为可扩展FL(ScalableFL)。我们在每个客户端的本地模型中调整了深度和宽度,以适应客户端的输入图像大小和输出类别数量。此外,我们还提供了一个新的泛化差 bounds,帮助解释我们的扩展神经网络方法的效iveness。我们在多种不同客户端设置下进行了多个实验,证明了ScalableFL 的效iveness。Note: Please note that the translation is in Simplified Chinese, and the word order and sentence structure may be different from the original text.

CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding

  • paper_url: http://arxiv.org/abs/2311.08673
  • repo_url: None
  • paper_authors: Jianzong Wang, Yimin Deng, Ziqi Liang, Xulong Zhang, Ning Cheng, Jing Xiao
  • for: 这个论文是为了提出一种名为“CP-EB”的对话面生成方法,用于从音频信号和人像作为输入,生成一个具有自然头姿和眼睛跳动的人讲视频。
  • methods: 该方法使用了一种基于GAN的建模结构,从输入音频和参考视频中提取眼睛跳动特征,并通过对其进行对比训练,将其 embedding到人讲图像中。
  • results: 实验结果显示,该方法可以生成具有同步嘴部动作、自然头姿和眼睛跳动的真实人讲视频。
    Abstract This paper proposes a talking face generation method named "CP-EB" that takes an audio signal as input and a person image as reference, to synthesize a photo-realistic people talking video with head poses controlled by a short video clip and proper eye blinking embedding. It's noted that not only the head pose but also eye blinking are both important aspects for deep fake detection. The implicit control of poses by video has already achieved by the state-of-art work. According to recent research, eye blinking has weak correlation with input audio which means eye blinks extraction from audio and generation are possible. Hence, we propose a GAN-based architecture to extract eye blink feature from input audio and reference video respectively and employ contrastive training between them, then embed it into the concatenated features of identity and poses to generate talking face images. Experimental results show that the proposed method can generate photo-realistic talking face with synchronous lips motions, natural head poses and blinking eyes.
    摘要 这篇论文提出了一种名为“CP-EB”的人脸发言方法,该方法接受音频信号和人像作为输入,并生成一个具有自然头姿和眼睛跳动的真实人脸发言视频。研究表明,不仅头姿也是深度伪造检测中重要的一个方面,而且眼睛跳动与输入音频的关系弱,这意味着可以从音频中提取眼睛跳动特征并生成。因此,我们提议使用GAN网络抽取音频和参考视频中的眼睛跳动特征,并在这些特征之间进行对比培训,然后将其 embedding到人脸特征和头姿特征中,以生成真实的人脸发言图像。实验结果表明,提出的方法可以生成具有同步嘴部动作、自然头姿和眼睛跳动的真实人脸发言图像。

Deep Neural Network Identification of Limnonectes Species and New Class Detection Using Image Data

  • paper_url: http://arxiv.org/abs/2311.08661
  • repo_url: None
  • paper_authors: Li Xu, Yili Hong, Eric P. Smith, David S. McLeod, Xinwei Deng, Laura J. Freeman
  • for: 这篇论文是用于解决生物多样性的挑战,具体来说是用于处理种群复杂的问题。
  • methods: 这篇论文使用机器学习的原理来解决两个问题:一是种群分类问题,二是外围检测问题。
  • results: 研究人员使用深度神经网络成功地自动将图像分类到已知种群中,并且可以成功地检测图像是否属于现有类别之外。
    Abstract As is true of many complex tasks, the work of discovering, describing, and understanding the diversity of life on Earth (viz., biological systematics and taxonomy) requires many tools. Some of this work can be accomplished as it has been done in the past, but some aspects present us with challenges which traditional knowledge and tools cannot adequately resolve. One such challenge is presented by species complexes in which the morphological similarities among the group members make it difficult to reliably identify known species and detect new ones. We address this challenge by developing new tools using the principles of machine learning to resolve two specific questions related to species complexes. The first question is formulated as a classification problem in statistics and machine learning and the second question is an out-of-distribution (OOD) detection problem. We apply these tools to a species complex comprising Southeast Asian stream frogs (Limnonectes kuhlii complex) and employ a morphological character (hind limb skin texture) traditionally treated qualitatively in a quantitative and objective manner. We demonstrate that deep neural networks can successfully automate the classification of an image into a known species group for which it has been trained. We further demonstrate that the algorithm can successfully classify an image into a new class if the image does not belong to the existing classes. Additionally, we use the larger MNIST dataset to test the performance of our OOD detection algorithm. We finish our paper with some concluding remarks regarding the application of these methods to species complexes and our efforts to document true biodiversity. This paper has online supplementary materials.
    摘要 如同许多复杂任务一样,发现、描述和理解地球上生物多样性(即生物系统матиCS和taxonomy)需要许多工具。一些这些工具可以通过传统的方法来实现,但一些方面却对传统知识和工具来说无法充分解决。一个如此挑战是由种群复杂体系所带来,其中种群成员之间的形态相似性使得可靠地识别已知种和检测新种变得困难。我们通过开发新的工具,使用机器学习原理来解决这两个问题。第一个问题是一个统计学和机器学习中的分类问题,第二个问题是一个 OUT-OF-DISTRIBUTION(OOD)检测问题。我们在南东亚瀑布蟾(Limnonectes kuhlii complex)种群中应用这些工具,使用传统上被质量地对待的一个形态特征(背股皮Texture)进行了量化Objective的处理。我们示示了深度神经网络可以成功地自动将图像分类到已知种群中,并且可以成功地将图像分类到新的类别中。此外,我们使用更大的MNIST数据集来测试我们的OOD检测算法的性能。我们在文章结尾附加了一些关于这些方法在种群复杂体系中的应用以及我们的记录真正的生物多样性的评论。这篇文章有在线补充材料。

ConeQuest: A Benchmark for Cone Segmentation on Mars

  • paper_url: http://arxiv.org/abs/2311.08657
  • repo_url: https://github.com/kerner-lab/conequest
  • paper_authors: Mirali Purohit, Jacob Adler, Hannah Kerner
  • for: 这 paper 是为了开发更加准确和可靠的 Mars 穹顶 cone 分割模型。
  • methods: 该 paper 使用了 computer vision 技术,并基于 Mars 的三个地区,提供了 >13k 个样本,以便进行 cone 分割模型的训练和评估。
  • results: 该 paper 的结果表明,现有的 segmentation 模型无法准确地分割 Mars 的穹顶 cone,其中最佳模型的 IoU 分割率只有 52.52% 和 42.55%。
    Abstract Over the years, space scientists have collected terabytes of Mars data from satellites and rovers. One important set of features identified in Mars orbital images is pitted cones, which are interpreted to be mud volcanoes believed to form in regions that were once saturated in water (i.e., a lake or ocean). Identifying pitted cones globally on Mars would be of great importance, but expert geologists are unable to sort through the massive orbital image archives to identify all examples. However, this task is well suited for computer vision. Although several computer vision datasets exist for various Mars-related tasks, there is currently no open-source dataset available for cone detection/segmentation. Furthermore, previous studies trained models using data from a single region, which limits their applicability for global detection and mapping. Motivated by this, we introduce ConeQuest, the first expert-annotated public dataset to identify cones on Mars. ConeQuest consists of >13k samples from 3 different regions of Mars. We propose two benchmark tasks using ConeQuest: (i) Spatial Generalization and (ii) Cone-size Generalization. We finetune and evaluate widely-used segmentation models on both benchmark tasks. Results indicate that cone segmentation is a challenging open problem not solved by existing segmentation models, which achieve an average IoU of 52.52% and 42.55% on in-distribution data for tasks (i) and (ii), respectively. We believe this new benchmark dataset will facilitate the development of more accurate and robust models for cone segmentation. Data and code are available at https://github.com/kerner-lab/ConeQuest.
    摘要 Over the years, space scientists have collected terabytes of Mars data from satellites and rovers. One important set of features identified in Mars orbital images is pitted cones, which are interpreted to be mud volcanoes believed to form in regions that were once saturated in water (i.e., a lake or ocean). Identifying pitted cones globally on Mars would be of great importance, but expert geologists are unable to sort through the massive orbital image archives to identify all examples. However, this task is well suited for computer vision. Although several computer vision datasets exist for various Mars-related tasks, there is currently no open-source dataset available for cone detection/segmentation. Furthermore, previous studies trained models using data from a single region, which limits their applicability for global detection and mapping. Motivated by this, we introduce ConeQuest, the first expert-annotated public dataset to identify cones on Mars. ConeQuest consists of >13k samples from 3 different regions of Mars. We propose two benchmark tasks using ConeQuest: (i) Spatial Generalization and (ii) Cone-size Generalization. We finetune and evaluate widely-used segmentation models on both benchmark tasks. Results indicate that cone segmentation is a challenging open problem not solved by existing segmentation models, which achieve an average IoU of 52.52% and 42.55% on in-distribution data for tasks (i) and (ii), respectively. We believe this new benchmark dataset will facilitate the development of more accurate and robust models for cone segmentation. Data and code are available at https://github.com/kerner-lab/ConeQuest.Here's the translation in Traditional Chinese:过去的年头,宇宙科学家从卫星和探测车获取了数十TB的火星数据。一个重要的特征是穿孔碗,被解释为怀疑是过去曾经淹没在水中的泥火山。全球火星上的穿孔碗识别是非常重要,但专业地质学家无法从巨大的卫星图像档案中找到所有的例子。然而,这个任务非常适合计算机视觉。虽然有许多火星相关的计算机视觉数据集存在,但目前没有公开的数据集可以用于穿孔碗检测。此外,前一 studies将模型训练使用单一区域的数据,导致其在全球检测和地图上的应用有限。为了解决这个问题,我们介绍了ConeQuest,首个专家录实 dataset 用于火星穿孔碗检测。ConeQuest 包含了 >13k 个样本,来自三个不同的火星区域。我们提出了两个benchmark任务:(i) 空间一致和 (ii) 穿孔大小一致。我们在这两个任务上调整和评估了广泛使用的检测模型。结果显示,穿孔检测是一个尚未解决的开问题,现有的检测模型在内部数据上的内容率为52.52%和42.55%。我们相信这个新的benchmark数据集将促进更加精确和可靠的穿孔检测模型的发展。数据和代码可以在 https://github.com/kerner-lab/ConeQuest 上获取。

Review of AlexNet for Medical Image Classification

  • paper_url: http://arxiv.org/abs/2311.08655
  • repo_url: https://github.com/Arminsbss/tumor-classification
  • paper_authors: Wenhao Tang, Junding Sun, Shuihua Wang, Yudong Zhang
  • for: 这篇论文主要探讨了AlexNet模型在医学图像分类领域的应用和技术细节。
  • methods: 该论文使用了Dropout技术和ReLU活化函数来避免过拟合和梯度消失问题,并提出了一种基于AlexNet模型的医学图像分类方法。
  • results: 该论文通过对40篇学术论文和会议论文进行回顾,提出了AlexNet模型的技术细节、优势和应用领域。
    Abstract In recent years, the rapid development of deep learning has led to a wide range of applications in the field of medical image classification. The variants of neural network models with ever-increasing performance share some commonalities: to try to mitigate overfitting, improve generalization, avoid gradient vanishing and exploding, etc. AlexNet first utilizes the dropout technique to mitigate overfitting and the ReLU activation function to avoid gradient vanishing. Therefore, we focus our discussion on AlexNet, which has contributed greatly to the development of CNNs in 2012. After reviewing over 40 papers, including journal papers and conference papers, we give a narrative on the technical details, advantages, and application areas of AlexNet.
    摘要

Refining Perception Contracts: Case Studies in Vision-based Safe Auto-landing

  • paper_url: http://arxiv.org/abs/2311.08652
  • repo_url: None
  • paper_authors: Yangge Li, Benjamin C Yang, Yixuan Jia, Daniel Zhuang, Sayan Mitra
  • for: 该论文旨在评估控制系统中使用机器学习进行感知时的安全性。
  • methods: 论文使用了合同测试和证明方法来证明总体系统水平安全需求的可行性。
  • results: 论文通过引入数据和需求导向的改进合同构建算法(DaRePC),得出了可测试的合同,确定了降落在跑道上安全的飞机状态和环境条件,以及通过序列门检测器安全过航的无人机状态。同时也发现了一些可能导致视觉控制系统的安全性问题的条件(例如低地平线的阳光)。
    Abstract Perception contracts provide a method for evaluating safety of control systems that use machine learning for perception. A perception contract is a specification for testing the ML components, and it gives a method for proving end-to-end system-level safety requirements. The feasibility of contract-based testing and assurance was established earlier in the context of straight lane keeping: a 3-dimensional system with relatively simple dynamics. This paper presents the analysis of two 6 and 12-dimensional flight control systems that use multi-stage, heterogeneous, ML-enabled perception. The paper advances methodology by introducing an algorithm for constructing data and requirement guided refinement of perception contracts (DaRePC). The resulting analysis provides testable contracts which establish the state and environment conditions under which an aircraft can safety touchdown on the runway and a drone can safely pass through a sequence of gates. It can also discover conditions (e.g., low-horizon sun) that can possibly violate the safety of the vision-based control system.
    摘要 感知合约提供了评估机器学习控制系统安全性的方法。感知合约是测试ML组件的规范,它提供了系统级别的安全要求的证明方法。在推点驱动的情况下,感知合约的可行性已经在三维直线保持问题中被证明。本文分析了使用多 Stage、异构、ML实现的6和12维飞行控制系统。本文提出了一种数据和需求驱动的感知合约构建算法(DaRePC),从而得到了可测试的合约,这些合约确定了飞机在跑道上安全着陆和无人机通过序列门的前提条件。此外,它还可以发现可能违反视觉控制系统安全性的情况(如低地平线的阳光)。

Painterly Image Harmonization via Adversarial Residual Learning

  • paper_url: http://arxiv.org/abs/2311.08646
  • repo_url: None
  • paper_authors: Xudong Wang, Li Niu, Junyan Cao, Yan Hong, Liqing Zhang
  • for: 将 photorealistic 背景图像和绘画风格背景图像进行合成,以实现图像的协调和融合。
  • methods: 使用对抗学习技术,特别是设计了 dual-encoder 生成器和 pixel-wise 探测器,以bridge 背景和前景特征图像之间的域隔。
  • results: 实验表明,我们的方法可以实现更加协调、视觉吸引人的结果,比前方法更高效。
    Abstract Image compositing plays a vital role in photo editing. After inserting a foreground object into another background image, the composite image may look unnatural and inharmonious. When the foreground is photorealistic and the background is an artistic painting, painterly image harmonization aims to transfer the style of background painting to the foreground object, which is a challenging task due to the large domain gap between foreground and background. In this work, we employ adversarial learning to bridge the domain gap between foreground feature map and background feature map. Specifically, we design a dual-encoder generator, in which the residual encoder produces the residual features added to the foreground feature map from main encoder. Then, a pixel-wise discriminator plays against the generator, encouraging the refined foreground feature map to be indistinguishable from background feature map. Extensive experiments demonstrate that our method could achieve more harmonious and visually appealing results than previous methods.
    摘要 Image compositing 在图像编辑中扮演着关键角色。在插入一个背景图像中的前景对象后,复合图像可能会看起来不自然和不协调。当前景是真实图像,背景是艺术油画时, painterly image harmonization 的目标是将背景油画的风格传递到前景对象中,这是一项具有很大领域差异的任务。在这种情况下,我们使用对抗学习来跨领域 bridge 差异。我们设计了一个 dual-encoder 生成器,其中副encoder 生成的差异特征将被添加到主encoder 生成的前景特征图中。然后,一个像素级别的探测器与生成器进行对抗,以便使得重新处理后的前景特征图与背景特征图无法分辨。我们的方法在详细实验中被证明可以实现更自然和视觉吸引人的结果,比前方法更好。