cs.CV - 2023-10-03

Harvard Eye Fairness: A Large-Scale 3D Imaging Dataset for Equitable Eye Diseases Screening and Fair Identity Scaling

  • paper_url: http://arxiv.org/abs/2310.02492
  • repo_url: None
  • paper_authors: Yan Luo, Yu Tian, Min Shi, Tobias Elze, Mengyu Wang
    for:* 这个论文的目的是为了提供一个大规模的公共医疗 datasets,用于针对医学领域的公平学习。methods:* 这篇论文使用了一种新的公平标准化方法(FIS),并与其他当前的公平学习方法进行比较。results:* 论文提出了一个包含30,000名参与者的 Harvard-EF dataset,该dataset包括了3D optical coherence tomography scans和2D fundus photos,并且包含了6个个人特征。Here’s the same information in Simplified Chinese text:for:* 这篇论文的目的是为了提供一个大规模的公共医疗数据集,用于针对医学领域的公平学习。methods:* 这篇论文使用了一种新的公平标准化方法(FIS),并与其他当前的公平学习方法进行比较。results:* 论文提出了一个包含30,000名参与者的 Harvard-EF 数据集,该数据集包括了3D optical coherence tomography scans和2D fundus photos,并且包含了6个个人特征。
    Abstract Fairness or equity in machine learning is profoundly important for societal well-being, but limited public datasets hinder its progress, especially in the area of medicine. It is undeniable that fairness in medicine is one of the most important areas for fairness learning's applications. Currently, no large-scale public medical datasets with 3D imaging data for fairness learning are available, while 3D imaging data in modern clinics are standard tests for disease diagnosis. In addition, existing medical fairness datasets are actually repurposed datasets, and therefore they typically have limited demographic identity attributes with at most three identity attributes of age, gender, and race for fairness modeling. To address this gap, we introduce our Eye Fairness dataset with 30,000 subjects (Harvard-EF) covering three major eye diseases including age-related macular degeneration, diabetic retinopathy, and glaucoma affecting 380 million patients globally. Our Harvard-EF dataset includes both 2D fundus photos and 3D optical coherence tomography scans with six demographic identity attributes including age, gender, race, ethnicity, preferred language, and marital status. We also propose a fair identity scaling (FIS) approach combining group and individual scaling together to improve model fairness. Our FIS approach is compared with various state-of-the-art fairness learning methods with superior performance in the racial, gender, and ethnicity fairness tasks with 2D and 3D imaging data, which demonstrate the utilities of our Harvard-EF dataset for fairness learning. To facilitate fairness comparisons between different models, we propose performance-scaled disparity measures, which can be used to compare model fairness accounting for overall performance levels. The dataset and code are publicly accessible via https://ophai.hms.harvard.edu/datasets/harvard-ef30k.
    摘要 “公平或公德在机器学习中非常重要,但有限的公共数据集阻碍了其进步,特别是在医疗领域。无疑地,医疗领域的公平是机器学习公平应用的重要领域。目前,没有大规模的公共医疗数据集,包含3D成像数据,用于公平学习。此外,现有的医疗公平数据集都是 reuse 的数据集,因此它们通常只有限制的民族标识属性,最多只有三个标识属性:年龄、性别和种族。为解决这个差距,我们引入了我们的眼睛公平数据集(Harvard-EF),包括30,000名参与者,涵盖三种主要的眼病:年龄相关的macular degeneration、 диабетиче Retinopathy 和 Glaucoma,全球380亿名患者。我们的 Harvard-EF 数据集包括2D背景照片和3D光子成像扫描,六个民族标识属性:年龄、性别、种族、语言、婚姻状况和民族。我们还提出了一种公平标识扩大(FIS)方法,结合集体和个体扩大方法,以提高模型公平性。我们的 FIS 方法与各种当前最佳公平学习方法进行比较,在种族、性别和民族公平任务中表现优秀,这些任务使用 2D 和 3D 成像数据。为便于公平比较不同模型,我们提议用性能比例的 disparity 度量,可以用来比较不同模型的公平性,考虑到不同模型的总性能水平。数据集和代码可以通过以下链接获取:https://ophai.hms.harvard.edu/datasets/harvard-ef30k。”

OCU-Net: A Novel U-Net Architecture for Enhanced Oral Cancer Segmentation

  • paper_url: http://arxiv.org/abs/2310.02486
  • repo_url: None
  • paper_authors: Ahmed Albishri, Syed Jawad Hussain Shah, Yugyung Lee, Rong Wang
  • for: 革新的肺癌图像分割模型OCU-Net,旨在提高肺癌图像分割精度。
  • methods: OCU-Net使用了多种进步的深度学习模块,包括通道和空间注意力融合(CSAF)模块、抑压激活(SE)注意力模块、尺度空间Pyramid Pooling(ASPP)模块、循环块和多尺度融合。
  • results: OCU-Net和OCU-Netm在两个H&E图像 Dataset上实现了高精度的肺癌图像分割,超过了现有的分割方法。
    Abstract Accurate detection of oral cancer is crucial for improving patient outcomes. However, the field faces two key challenges: the scarcity of deep learning-based image segmentation research specifically targeting oral cancer and the lack of annotated data. Our study proposes OCU-Net, a pioneering U-Net image segmentation architecture exclusively designed to detect oral cancer in hematoxylin and eosin (H&E) stained image datasets. OCU-Net incorporates advanced deep learning modules, such as the Channel and Spatial Attention Fusion (CSAF) module, a novel and innovative feature that emphasizes important channel and spatial areas in H&E images while exploring contextual information. In addition, OCU-Net integrates other innovative components such as Squeeze-and-Excite (SE) attention module, Atrous Spatial Pyramid Pooling (ASPP) module, residual blocks, and multi-scale fusion. The incorporation of these modules showed superior performance for oral cancer segmentation for two datasets used in this research. Furthermore, we effectively utilized the efficient ImageNet pre-trained MobileNet-V2 model as a backbone of our OCU-Net to create OCU-Netm, an enhanced version achieving state-of-the-art results. Comprehensive evaluation demonstrates that OCU-Net and OCU-Netm outperformed existing segmentation methods, highlighting their precision in identifying cancer cells in H&E images from OCDC and ORCA datasets.
    摘要 医学图像分割是肺癌检测中不可或缺的一环,但是这个领域面临两个主要挑战:一是深度学习基于图像分割的研究对肺癌的不足,二是图像分割数据集的标注不充分。我们的研究提出了OCU-Net,一种专门为肺癌H&E染色图像进行图像分割的U-Net架构。OCU-Net包含了高级深度学习模块,如通道和空间注意力融合(CSAF)模块、抑压扩展(SE)注意力模块、精细空间 Pyramid Pooling(ASPP)模块、循环块和多比例融合。这些模块的结合使OCU-Net在两个测试数据集上实现了优秀的肺癌图像分割表现。此外,我们有效地利用了高效的ImageNet预训练MobileNet-V2模型作为OCU-Net的基础模型,创建了OCU-Netm,这是一个提高了表现的版本。对OCU-Net和OCU-Netm进行了全面的评估,表明它们在OCDC和ORCA数据集上的肺癌图像分割表现比既有方法更加精准。

EvDNeRF: Reconstructing Event Data with Dynamic Neural Radiance Fields

  • paper_url: http://arxiv.org/abs/2310.02437
  • repo_url: https://github.com/anish-bhattacharya/evdnerf
  • paper_authors: Anish Bhattacharya, Ratnesh Madaan, Fernando Cladera, Sai Vemprala, Rogerio Bonatti, Kostas Daniilidis, Ashish Kapoor, Vijay Kumar, Nikolai Matni, Jayesh K. Gupta
  • for: 用于重constructing事件流动场景中的快速运动,包括非固定和固定扭变。
  • methods: 使用事件摄像头捕捉 asynchronous per-pixel明亮变化,并使用神经透辑场(NeRF)进行可见质量的可学习渲染。
  • results: 提供了一个可用于 simulate dynamic scenes的事件基本的NeRF模型,并在不同的批处大小下进行了训练,以提高测试时间分辨率的预测结果,超越基eline的标准动态NeRF和事件模拟器。
    Abstract We present EvDNeRF, a pipeline for generating event data and training an event-based dynamic NeRF, for the purpose of faithfully reconstructing eventstreams on scenes with rigid and non-rigid deformations that may be too fast to capture with a standard camera. Event cameras register asynchronous per-pixel brightness changes at MHz rates with high dynamic range, making them ideal for observing fast motion with almost no motion blur. Neural radiance fields (NeRFs) offer visual-quality geometric-based learnable rendering, but prior work with events has only considered reconstruction of static scenes. Our EvDNeRF can predict eventstreams of dynamic scenes from a static or moving viewpoint between any desired timestamps, thereby allowing it to be used as an event-based simulator for a given scene. We show that by training on varied batch sizes of events, we can improve test-time predictions of events at fine time resolutions, outperforming baselines that pair standard dynamic NeRFs with event simulators. We release our simulated and real datasets, as well as code for both event-based data generation and the training of event-based dynamic NeRF models (https://github.com/anish-bhattacharya/EvDNeRF).
    摘要 我们提出了EvDNeRF,一个搬运数据生成和训练基于事件的动态NeRF渲染管道,用于重现具有固定和非固定扭formation的场景中的事件流。事件摄像机可以在MHz速率上高Dynamic范围内注册ynchronously per-pixel的明亮变化,这使得它们成为观察快速运动的首选方式。Neural radiance fields(NeRFs)提供可见质量基于Geometric的学习渲染,但以前的事件处理只考虑了静止场景的重建。我们的EvDNeRF可以预测动态场景的事件流,从任意时间戳点的视角中查看和渲染场景,因此可以作为场景的事件基本模拟器。我们证明,通过训练不同批大小的事件,可以在细致的时间分辨率上提高测试预测的事件,超过基准的标准动态NeRF和事件模拟器的组合。我们发布了模拟和实际数据集,以及用于生成事件基本数据和训练事件基于动态NeRF模型的代码(https://github.com/anish-bhattacharya/EvDNeRF)。

EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods

  • paper_url: http://arxiv.org/abs/2310.02426
  • repo_url: None
  • paper_authors: Samyadeep Basu, Mehrdad Saberi, Shweta Bhardwaj, Atoosa Malemir Chegini, Daniela Massiceti, Maziar Sanjabi, Shell Xu Hu, Soheil Feizi
    for: EditVal is a standardized benchmark for evaluating text-guided image editing methods, which aims to provide a fair comparison of different methods across different types of fine-grained edits.methods: EditVal uses a curated dataset of images, a set of editable attributes for each image, and pre-trained vision-language models to assess the fidelity of generated images for each edit type.results: The top-performing methods averaged across different edit types are Instruct-Pix2Pix and Null-Text, but only Instruct-Pix2Pix and Null-Text are able to preserve original image properties. Most editing methods fail at edits involving spatial operations, and there is no single `winner’ method that ranks the best individually across a range of different edit types.
    Abstract A plethora of text-guided image editing methods have recently been developed by leveraging the impressive capabilities of large-scale diffusion-based generative models such as Imagen and Stable Diffusion. A standardized evaluation protocol, however, does not exist to compare methods across different types of fine-grained edits. To address this gap, we introduce EditVal, a standardized benchmark for quantitatively evaluating text-guided image editing methods. EditVal consists of a curated dataset of images, a set of editable attributes for each image drawn from 13 possible edit types, and an automated evaluation pipeline that uses pre-trained vision-language models to assess the fidelity of generated images for each edit type. We use EditVal to benchmark 8 cutting-edge diffusion-based editing methods including SINE, Imagic and Instruct-Pix2Pix. We complement this with a large-scale human study where we show that EditVall's automated evaluation pipeline is strongly correlated with human-preferences for the edit types we considered. From both the human study and automated evaluation, we find that: (i) Instruct-Pix2Pix, Null-Text and SINE are the top-performing methods averaged across different edit types, however {\it only} Instruct-Pix2Pix and Null-Text are able to preserve original image properties; (ii) Most of the editing methods fail at edits involving spatial operations (e.g., changing the position of an object). (iii) There is no `winner' method which ranks the best individually across a range of different edit types. We hope that our benchmark can pave the way to developing more reliable text-guided image editing tools in the future. We will publicly release EditVal, and all associated code and human-study templates to support these research directions in https://deep-ml-research.github.io/editval/.
    摘要 “一些文本引导的图像修改方法在最近得到了广泛研究,利用大规模的扩散基于生成模型,如Imagen和Stable Diffusion。然而,没有一个标准化的评估协议,以便比较不同类型的细腻修改方法。为解决这个 gap,我们介绍 EditVal,一个标准化的评估协议,用于量化评估文本引导的图像修改方法。EditVal包括一个精心准备的图像集,每个图像的编辑属性,以及使用预训练的视觉语言模型来评估生成图像的准确性。我们使用 EditVal 评估8种 cutting-edge 扩散基于修改方法,包括 SINE、Imagic 和 Instruct-Pix2Pix。我们 также进行了大规模的人类研究,并发现:(i) Instruct-Pix2Pix 和 Null-Text 是最高效的方法,但只有 Instruct-Pix2Pix 能够保留原图像的性质;(ii)大多数修改方法在空间操作(例如,对象的位置变化)时失败;(iii)没有一个 `winner' 方法,能够在不同类型的修改任务中表现出色。我们希望 EditVal 能够为未来开发更可靠的文本引导图像修改工具提供依据。我们将在 https://deep-ml-research.github.io/editval/ 上公开 EditVal,以及所有相关的代码和人类研究模板。”

FedL2P: Federated Learning to Personalize

  • paper_url: http://arxiv.org/abs/2310.02420
  • repo_url: https://github.com/royson/fedl2p
  • paper_authors: Royson Lee, Minyoung Kim, Da Li, Xinchi Qiu, Timothy Hospedales, Ferenc Huszár, Nicholas D. Lane
  • for: 学习个性化策略的 Federated Meta-Learning 问题
  • methods: 使用 Federated Learning 方法学习批处理和学习率参数,以适应每个客户端的特定数据分布
  • results: 比标准手动定制策略提高表达效果,在标签和特征转移 Situations 中均显示提高表达效果
    Abstract Federated learning (FL) research has made progress in developing algorithms for distributed learning of global models, as well as algorithms for local personalization of those common models to the specifics of each client's local data distribution. However, different FL problems may require different personalization strategies, and it may not even be possible to define an effective one-size-fits-all personalization strategy for all clients: depending on how similar each client's optimal predictor is to that of the global model, different personalization strategies may be preferred. In this paper, we consider the federated meta-learning problem of learning personalization strategies. Specifically, we consider meta-nets that induce the batch-norm and learning rate parameters for each client given local data statistics. By learning these meta-nets through FL, we allow the whole FL network to collaborate in learning a customized personalization strategy for each client. Empirical results show that this framework improves on a range of standard hand-crafted personalization baselines in both label and feature shift situations.
    摘要 联合学习(FL)的研究已经做出了分布式学习全域模型的算法和每个客户的本地数据分布特有的本地个性化算法的进展。但是不同的FL问题可能需要不同的个性化策略,而且可能无法定义一个一般适用的一个大小全域个性化策略 для所有客户:对于每个客户的最佳预测器与全域模型的预测器之间的相似度可能会影响选择的个性化策略。在这篇论文中,我们考虑了联合元学习问题,具体是通过FL学习每个客户的批装优化和学习速率参数。通过这种方法,我们让整个FL网络共同学习每个客户的自适应策略。实验结果显示,这个框架可以超过一些传统手工个性化基准的表现,包括标签和特征迁移的情况下。

Bag of Tricks for Fully Test-Time Adaptation

  • paper_url: http://arxiv.org/abs/2310.02416
  • repo_url: https://github.com/smounsav/tta_bot
  • paper_authors: Saypraseuth Mounsaveng, Florent Chiaroni, Malik Boudiaf, Marco Pedersoli, Ismail Ben Ayed
  • for: 本研究旨在为Test-Time Adaptation (TTA)提供一个系统性的分类和分析,以帮助汇集社区的知识,并提高TTA的性能。
  • methods: 本研究使用了多种 orthogonal TTA 技巧,包括小批量均值、流程重新平衡、可靠样本选择和网络自信度调整。
  • results: 通过分析各种场景,本研究揭示了每个技巧的影响,并探讨了精度、计算资源和模型复杂性之间的交互关系。此外,研究还发现了 combining 技巧时的 synergy,并实现了新的州OF-the-art 结果。
    Abstract Fully Test-Time Adaptation (TTA), which aims at adapting models to data drifts, has recently attracted wide interest. Numerous tricks and techniques have been proposed to ensure robust learning on arbitrary streams of unlabeled data. However, assessing the true impact of each individual technique and obtaining a fair comparison still constitutes a significant challenge. To help consolidate the community's knowledge, we present a categorization of selected orthogonal TTA techniques, including small batch normalization, stream rebalancing, reliable sample selection, and network confidence calibration. We meticulously dissect the effect of each approach on different scenarios of interest. Through our analysis, we shed light on trade-offs induced by those techniques between accuracy, the computational power required, and model complexity. We also uncover the synergy that arises when combining techniques and are able to establish new state-of-the-art results.
    摘要 全面测试时适应(TTA),目标是适应数据变化,在最近几年内吸引了广泛的关注。许多技巧和技术已经被提议,以确保在无标注数据流中强健学习。然而,评估每个个体技巧的真正影响和取得公平比较仍然是一项重要挑战。为了帮助共同知识的整合,我们提出了选择的orthogonal TTA技巧的分类,包括小批量正常化、流量重新平衡、可靠样本选择和网络信任均衡。我们仔细分析每种方法在不同的场景下的效果,并揭示了每种技巧的精度、计算资源和模型复杂性之间的贸易。此外,我们发现了不同技巧的合作 synergy,并成功建立了新的状态状态 record。

FT-Shield: A Watermark Against Unauthorized Fine-tuning in Text-to-Image Diffusion Models

  • paper_url: http://arxiv.org/abs/2310.02401
  • repo_url: None
  • paper_authors: Yingqian Cui, Jie Ren, Yuping Lin, Han Xu, Pengfei He, Yue Xing, Wenqi Fan, Hui Liu, Jiliang Tang
  • for: 防止文本到图像散发模型ersonal化的违用数据泄露问题
  • methods: 提出了FT-Shield watermarking方法,通过在文本到图像散发模型的训练图像上生成一个特有的水印,以便在文本到图像散发模型生成的图像上快速和准确地检测水印
  • results: 通过实验证明FT-Shield可以有效地检测文本到图像散发模型的违用数据泄露问题
    Abstract Text-to-image generative models based on latent diffusion models (LDM) have demonstrated their outstanding ability in generating high-quality and high-resolution images according to language prompt. Based on these powerful latent diffusion models, various fine-tuning methods have been proposed to achieve the personalization of text-to-image diffusion models such as artistic style adaptation and human face transfer. However, the unauthorized usage of data for model personalization has emerged as a prevalent concern in relation to copyright violations. For example, a malicious user may use the fine-tuning technique to generate images which mimic the style of a painter without his/her permission. In light of this concern, we have proposed FT-Shield, a watermarking approach specifically designed for the fine-tuning of text-to-image diffusion models to aid in detecting instances of infringement. We develop a novel algorithm for the generation of the watermark to ensure that the watermark on the training images can be quickly and accurately transferred to the generated images of text-to-image diffusion models. A watermark will be detected on an image by a binary watermark detector if the image is generated by a model that has been fine-tuned using the protected watermarked images. Comprehensive experiments were conducted to validate the effectiveness of FT-Shield.
    摘要

ScaleNet: An Unsupervised Representation Learning Method for Limited Information

  • paper_url: http://arxiv.org/abs/2310.02386
  • repo_url: None
  • paper_authors: Huili Huang, M. Mahdi Roozbahani
  • for: 增强深度卷积神经网络(ConvNet)在有限信息情况下学习高级别semantic视觉表示。
  • methods: 基于多尺度图像的简单和高效无监督表示学习方法,即ScaleNet,以提高ConvNet的性能。
  • results: ScaleNet在CIFAR-10和ImageNet datasets上,使用不同的architecture(如AlexNet和ResNet50),在限定数据量情况下,提高了ConvNet的 rotation-prediction 任务性能,并且可以超过RotNet模型。 transferred parameters from a ScaleNet model with limited data also improve the ImageNet Classification task by about 6% compared to the RotNet model.
    Abstract Although large-scale labeled data are essential for deep convolutional neural networks (ConvNets) to learn high-level semantic visual representations, it is time-consuming and impractical to collect and annotate large-scale datasets. A simple and efficient unsupervised representation learning method named ScaleNet based on multi-scale images is proposed in this study to enhance the performance of ConvNets when limited information is available. The input images are first resized to a smaller size and fed to the ConvNet to recognize the rotation degree. Next, the ConvNet learns the rotation-prediction task for the original size images based on the parameters transferred from the previous model. The CIFAR-10 and ImageNet datasets are examined on different architectures such as AlexNet and ResNet50 in this study. The current study demonstrates that specific image features, such as Harris corner information, play a critical role in the efficiency of the rotation-prediction task. The ScaleNet supersedes the RotNet by ~7% in the limited CIFAR-10 dataset. The transferred parameters from a ScaleNet model with limited data improve the ImageNet Classification task by about 6% compared to the RotNet model. This study shows the capability of the ScaleNet method to improve other cutting-edge models such as SimCLR by learning effective features for classification tasks.
    摘要 although large-scale labeled data are essential for deep convolutional neural networks (ConvNets) to learn high-level semantic visual representations, it is time-consuming and impractical to collect and annotate large-scale datasets. a simple and efficient unsupervised representation learning method named ScaleNet based on multi-scale images is proposed in this study to enhance the performance of ConvNets when limited information is available. the input images are first resized to a smaller size and fed to the ConvNet to recognize the rotation degree. next, the ConvNet learns the rotation-prediction task for the original size images based on the parameters transferred from the previous model. the CIFAR-10 and ImageNet datasets are examined on different architectures such as AlexNet and ResNet50 in this study. the current study demonstrates that specific image features, such as harris corner information, play a critical role in the efficiency of the rotation-prediction task. the ScaleNet supersedes the RotNet by ~7% in the limited CIFAR-10 dataset. the transferred parameters from a ScaleNet model with limited data improve the ImageNet Classification task by about 6% compared to the RotNet model. this study shows the capability of the ScaleNet method to improve other cutting-edge models such as SimCLR by learning effective features for classification tasks.

Multi-Prompt Fine-Tuning of Foundation Models for Enhanced Medical Image Segmentation

  • paper_url: http://arxiv.org/abs/2310.02381
  • repo_url: None
  • paper_authors: Xiangru Li, Yifei Zhang, Liang Zhao
  • for: 提高医学图像分割表现
  • methods: 利用SAM模型的多个提示批处理功能,并在批处理中使用两个标注的 bounding box 作为参考
  • results: 在多种分割任务上显著提高表现指标
    Abstract The Segment Anything Model (SAM) is a powerful foundation model that introduced revolutionary advancements in natural image segmentation. However, its performance remains sub-optimal when delineating the intricate structure of biomedical images, where multiple organs and tissues intertwine in a single image. In this study, we introduce a novel fine-tuning framework that leverages SAM's ability to bundle and process multiple prompts per image and seeks to improve SAM's performance in medical images. We first curated a medical image dataset that consists of CT scans of lesions in various organs, each with two annotations for organs and lesions respectively. Then, we fine-tuned SAM's mask decoder within our framework by batching both bounding boxes generated from ground truth masks as reference. The batched prompt strategy we introduced not only addresses the inherent complexity and ambiguity often found in medical images but also substantially enhances performance metrics when applied onto a wide range of segmentation tasks.
    摘要 “对于自然图像分割,Segment Anything Model(SAM)是一个具有革命性的基础模型。然而,它在医学影像中的性能仍然偏低,因为该区域内有多个器官和组织相互纠缠。在这个研究中,我们介绍了一个新的精确化框架,它利用SAM将多个提示集成一个影像,并对SAM的面瘫处理器进行了修正。我们首先从各种器官的CT扫描图中组建了医学影像集,每个图片都有两个标注:一个是器官的 bounding box,另一个是病变的 bounding box。然后,我们在我们的框架中进行了SAM的面瘫处理器的精确化,使用批处理 Both bounding boxes 生成自真实标注masks作为参考。我们称之为批处理 Both bounding boxes 策略,这策略不��ely addressing the inherent complexity and ambiguity often found in medical images, but also significantly enhances performance metrics when applied to a wide range of segmentation tasks。”

DREAM: Visual Decoding from Reversing Human Visual System

  • paper_url: http://arxiv.org/abs/2310.02265
  • repo_url: None
  • paper_authors: Weihao Xia, Raoul de Charette, Cengiz Öztireli, Jing-Hao Xue
  • For: 本研究团队开发了一种基于人类视觉系统知识的 fMRI-to-image 方法,可以从大脑活动中重建视图图像。* Methods: 这种方法使用了人类视觉系统中的层次和平行结构,通过两个特定的组件来模拟人类视觉系统内部的反向过程:Reverse Visual Association Cortex (R-VAC) 和 Reverse Parallel PKM (R-PKM)。* Results: 实验结果表明,这种方法在 терms of 出现 consistency, 结构和 semantics 方面表现更好于当前州态艺的模型。
    Abstract In this work we present DREAM, an fMRI-to-image method for reconstructing viewed images from brain activities, grounded on fundamental knowledge of the human visual system. We craft reverse pathways that emulate the hierarchical and parallel nature of how humans perceive the visual world. These tailored pathways are specialized to decipher semantics, color, and depth cues from fMRI data, mirroring the forward pathways from visual stimuli to fMRI recordings. To do so, two components mimic the inverse processes within the human visual system: the Reverse Visual Association Cortex (R-VAC) which reverses pathways of this brain region, extracting semantics from fMRI data; the Reverse Parallel PKM (R-PKM) component simultaneously predicting color and depth from fMRI signals. The experiments indicate that our method outperforms the current state-of-the-art models in terms of the consistency of appearance, structure, and semantics. Code will be made publicly available to facilitate further research in this field.
    摘要 在这项工作中,我们介绍了DREAM方法,它可以从脑活动中重建被观看的图像,基于人类视觉系统的基本知识。我们设计了逆行Pathways,这些Pathways模拟了人类视觉系统的层次和平行结构,从fMRI数据中提取 semantics、色彩和深度信息。为此,我们采用了两个组件:倒转视觉关联区(R-VAC)和倒转平行PKM(R-PKM)组件,它们分别模拟了人类视觉系统内的逆向过程。实验结果显示,我们的方法在 struttural consistency、semantic consistency和appearance consistency等方面都有较高的表现,比之前的状态艺方法更高。我们将代码公开发布,以便进一步的研究在这个领域。Note: Please note that the translation is in Simplified Chinese, and the grammar and sentence structure may be different from the original text.

RSRD: A Road Surface Reconstruction Dataset and Benchmark for Safe and Comfortable Autonomous Driving

  • paper_url: http://arxiv.org/abs/2310.02262
  • repo_url: None
  • paper_authors: Tong Zhao, Chenfeng Xu, Mingyu Ding, Masayoshi Tomizuka, Wei Zhan, Yintao Wei
  • for: 本研究旨在满足智能机器人系统中安全性和舒适性的增长需求, 特别是自动驾驶车辆, 其中道路条件对整体驾驶性能产生重要影响。
  • methods: 我们引入了高分辨率和高精度的道路表面重建数据集(RSRD), 该数据集在多种驾驶条件下收集到了约16,000对双眼图像、原始点云和真实的深度/相差图像, 并通过精心的后处理管道来保证其质量。
  • results: 基于RSRD数据集, 我们建立了一个全面的道路表面重建 benchmark, 通过深度估计和双眼匹配来恢复道路profile。 初步的评估表明, RSRD 数据集和相关技术具有很大的潜在价值, 可以用于提高多视角镜头技术等安全自动驾驶技术的发展。
    Abstract This paper addresses the growing demands for safety and comfort in intelligent robot systems, particularly autonomous vehicles, where road conditions play a pivotal role in overall driving performance. For example, reconstructing road surfaces helps to enhance the analysis and prediction of vehicle responses for motion planning and control systems. We introduce the Road Surface Reconstruction Dataset (RSRD), a real-world, high-resolution, and high-precision dataset collected with a specialized platform in diverse driving conditions. It covers common road types containing approximately 16,000 pairs of stereo images, original point clouds, and ground-truth depth/disparity maps, with accurate post-processing pipelines to ensure its quality. Based on RSRD, we further build a comprehensive benchmark for recovering road profiles through depth estimation and stereo matching. Preliminary evaluations with various state-of-the-art methods reveal the effectiveness of our dataset and the challenge of the task, underscoring substantial opportunities of RSRD as a valuable resource for advancing techniques, e.g., multi-view stereo towards safe autonomous driving. The dataset and demo videos are available at https://thu-rsxd.com/rsrd/
    摘要

Talk2BEV: Language-enhanced Bird’s-eye View Maps for Autonomous Driving

  • paper_url: http://arxiv.org/abs/2310.02251
  • repo_url: None
  • paper_authors: Vikrant Dewangan, Tushar Choudhary, Shivam Chandhok, Shubham Priyadarshan, Anushka Jain, Arun K. Singh, Siddharth Srivastava, Krishna Murthy Jatavallabhula, K. Madhava Krishna
  • for: 这个论文主要是为了提供一个大型视力语言模型(LVLM)接口,用于自动驾驶场景中的 bird’s-eye view(BEV)地图。
  • methods: 这个论文使用了现代的通用语言和视觉模型,以及BEV结构化地图表示,从而消除了需要任务特定模型的需求。
  • results: 这个论文通过对大量的Scene理解任务进行广泛的评估,证明了Talk2BEV可以在自动驾驶场景中执行视觉和空间理解、预测交通actor的意图以及基于视觉cue的决策。
    Abstract Talk2BEV is a large vision-language model (LVLM) interface for bird's-eye view (BEV) maps in autonomous driving contexts. While existing perception systems for autonomous driving scenarios have largely focused on a pre-defined (closed) set of object categories and driving scenarios, Talk2BEV blends recent advances in general-purpose language and vision models with BEV-structured map representations, eliminating the need for task-specific models. This enables a single system to cater to a variety of autonomous driving tasks encompassing visual and spatial reasoning, predicting the intents of traffic actors, and decision-making based on visual cues. We extensively evaluate Talk2BEV on a large number of scene understanding tasks that rely on both the ability to interpret free-form natural language queries, and in grounding these queries to the visual context embedded into the language-enhanced BEV map. To enable further research in LVLMs for autonomous driving scenarios, we develop and release Talk2BEV-Bench, a benchmark encompassing 1000 human-annotated BEV scenarios, with more than 20,000 questions and ground-truth responses from the NuScenes dataset.
    摘要 talk2bev 是一个大型视觉语言模型(LVLM)接口,用于自动驾驶场景中的 bird's-eye view(BEV)地图。而现有的自动驾驶场景识别系统大多集中在固定的对象类划定和驾驶场景上,而 talk2bev 则将最新的通用语言和视觉模型与 BEV 结构化地图表示相结合,从而消除需要任务特定模型的需求。这使得单个系统可以处理多种自动驾驶任务,包括视觉和空间逻辑、预测交通员的意图以及基于视觉提示的决策。我们对 talk2bev 进行了广泛的场景理解测试,测试包括解析自然语言查询的能力和将查询grounding到语言增强的 BEV 地图中。为了推动更多关于 LVLM 的研究在自动驾驶场景中,我们开发了 talk2bev-bench,一个包含 1000 个人注释的 BEV 场景 benchmark,包括超过 20,000 个问题和 NuScenes 数据集中的真实答案。

Hierarchical Generation of Human-Object Interactions with Diffusion Probabilistic Models

  • paper_url: http://arxiv.org/abs/2310.02242
  • repo_url: https://github.com/zju3dv/hghoi
  • paper_authors: Huaijin Pi, Sida Peng, Minghui Yang, Xiaowei Zhou, Hujun Bao
  • for: 本研究旨在解决现有的自动进行式模型和走道规划方法无法满足的长距离多样动作生成挑战。
  • methods: 我们提出了一种层次生成框架,首先生成一系列的标点,然后将动作Synthesize along them。因此,长距离动作生成可以被减少到几个短动作序列的组合,受标点的指导。
  • results: 我们的方法在NSM、COUCH和SAMP数据集上进行实验,与之前的方法相比,表现出较大的优势, both in terms of quality and diversity。
    Abstract This paper presents a novel approach to generating the 3D motion of a human interacting with a target object, with a focus on solving the challenge of synthesizing long-range and diverse motions, which could not be fulfilled by existing auto-regressive models or path planning-based methods. We propose a hierarchical generation framework to solve this challenge. Specifically, our framework first generates a set of milestones and then synthesizes the motion along them. Therefore, the long-range motion generation could be reduced to synthesizing several short motion sequences guided by milestones. The experiments on the NSM, COUCH, and SAMP datasets show that our approach outperforms previous methods by a large margin in both quality and diversity. The source code is available on our project page https://zju3dv.github.io/hghoi.
    摘要 这篇论文提出了一种新的方法来生成人类与目标对象之间的3D运动,强调解决现有的滚动式模型或路径规划基本方法无法实现的远程多样化运动挑战。我们提议了层次生成框架来解决这个挑战。具体来说,我们的框架首先生成一组里程碑,然后将运动synthesize到这些里程碑上。因此,长距离运动生成可以被减少到几个短距离运动序列的指导下进行synthesize。实验结果表明,我们的方法在NSM、COUCH和SAMP数据集上比前方法大幅提高了质量和多样性。代码可以在我们项目页面(https://zju3dv.github.io/hghoi)上获取。

Learnable Data Augmentation for One-Shot Unsupervised Domain Adaptation

  • paper_url: http://arxiv.org/abs/2310.02201
  • repo_url: https://github.com/iit-pavis/learnaug-uda
  • paper_authors: Julio Ivan Davila Carrazco, Pietro Morerio, Alessio Del Bue, Vittorio Murino
  • for: 解决一个难度最大的领域适应问题(One-Shot Unsupervised Domain Adaptation,OS-UDA),即只有一个目标样本可用 для模型适应。
  • methods: 提出了一种学习数据增强框架,通过将源数据变换为类似于目标数据的形式,使得基于这种增强数据的分类器在目标领域中具有良好的泛化能力。
  • results: 在DomainNet和VisDA两个领域适应测试 benchmark 上达到了当前最佳性能。Here is the same information in English:
  • for: Solving the most challenging setting in Domain Adaptation, where only one single unlabeled target sample is available for model adaptation.
  • methods: Propose a classification framework based on learnable data augmentation, which transforms source data into a form similar to the target data, enabling a classifier trained on the augmented data to generalize well to the target domain.
  • results: Achieve state-of-the-art performance on two well-known Domain Adaptation benchmarks, DomainNet and VisDA.
    Abstract This paper presents a classification framework based on learnable data augmentation to tackle the One-Shot Unsupervised Domain Adaptation (OS-UDA) problem. OS-UDA is the most challenging setting in Domain Adaptation, as only one single unlabeled target sample is assumed to be available for model adaptation. Driven by such single sample, our method LearnAug-UDA learns how to augment source data, making it perceptually similar to the target. As a result, a classifier trained on such augmented data will generalize well for the target domain. To achieve this, we designed an encoder-decoder architecture that exploits a perceptual loss and style transfer strategies to augment the source data. Our method achieves state-of-the-art performance on two well-known Domain Adaptation benchmarks, DomainNet and VisDA. The project code is available at https://github.com/IIT-PAVIS/LearnAug-UDA
    摘要 这篇论文提出了一种基于可学习数据扩展的分类框架,用于解决一架不监督领域适应(OS-UDA)问题。OS-UDA是领域适应中最为困难的设定,只有一个单个目标样本可用于模型适应。我们的方法LearnAug-UDA通过使用单个目标样本来学习扩展源数据,使其与目标数据变得感知上相似。因此,基于这种扩展数据的分类器将在目标领域中进行良好的泛化。为实现此目标,我们设计了一个Encoder-Decoder架构,利用感知损失和风格传递策略来扩展源数据。我们的方法在DomainNet和VisDA两个领域适应 benchmark 上实现了状态的表现,代码可以在https://github.com/IIT-PAVIS/LearnAug-UDA 上下载。

PAD-Phys: Exploiting Physiology for Presentation Attack Detection in Face Biometrics

  • paper_url: http://arxiv.org/abs/2310.02140
  • repo_url: None
  • paper_authors: Luis F. Gomez, Julian Fierrez, Aythami Morales, Mahdi Ghafourian, Ruben Tolosana, Imanol Solano, Alejandro Garcia, Francisco Zamora-Martinez
  • for: 防止人脸识别系统中的个人信息泄露或身份识别 spoofing
  • methods: 基于远程光谱 Plethysmography (rPPG) 的脉冲检测
  • results: 比较三种方法的结果,即生物域、深伪域和新的演示攻击域,结果显示 rPPG 基于模型的脉冲检测效果好,ACER 降低了 21.70%(从 41.03% 降低到 19.32%)。
    Abstract Presentation Attack Detection (PAD) is a crucial stage in facial recognition systems to avoid leakage of personal information or spoofing of identity to entities. Recently, pulse detection based on remote photoplethysmography (rPPG) has been shown to be effective in face presentation attack detection. This work presents three different approaches to the presentation attack detection based on rPPG: (i) The physiological domain, a domain using rPPG-based models, (ii) the Deepfakes domain, a domain where models were retrained from the physiological domain to specific Deepfakes detection tasks; and (iii) a new Presentation Attack domain was trained by applying transfer learning from the two previous domains to improve the capability to differentiate between bona-fides and attacks. The results show the efficiency of the rPPG-based models for presentation attack detection, evidencing a 21.70% decrease in average classification error rate (ACER) (from 41.03% to 19.32%) when the presentation attack domain is compared to the physiological and Deepfakes domains. Our experiments highlight the efficiency of transfer learning in rPPG-based models and perform well in presentation attack detection in instruments that do not allow copying of this physiological feature.
    摘要 本工作介绍了三种不同的面向攻击检测方法 based on rPPG:1. 生物学领域,使用 rPPG 基于模型,检测面向攻击。2. Deepfakes 领域, retrained 模型从生物学领域到特定的 Deepfakes 检测任务。3. 一个新的面向攻击领域,通过转移学习从前两个领域提高了区分真实和攻击的能力。结果显示 rPPG 基于模型在面向攻击检测中的效果, ACER (平均分类错误率)下降了 21.70%(从 41.03% 降至 19.32%),当面向攻击领域与生物学和 Deepfakes 领域进行比较时。我们的实验表明 rPPG 基于模型的转移学习能够在不允许 copying 生物学特征的 instrumente 中表现出色。

SIEVE: Multimodal Dataset Pruning Using Image Captioning Models

  • paper_url: http://arxiv.org/abs/2310.02110
  • repo_url: None
  • paper_authors: Anas Mahmoud, Mostafa Elhoushi, Amro Abbas, Yu Yang, Newsha Ardalani, Hugh Leather, Ari Morcos
  • for: 这 paper 的目的是提出一种新的数据筛选方法,以提高 vision-language 模型(VLM)的性能。
  • methods: 该方法使用生成的文本描述和语言模型来评估图像和文本对应的一致性,并使用数据COMP的多模态筛选 benchmark 来评估其性能。
  • results: 该方法在大规模池中实现了状态机器的性能,并在中等规模池中获得了竞争性的结果,比 CLIPScore 基于的筛选方法提高了1.7%和2.6%的平均性能。
    Abstract Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets. This underscores the critical need for dataset pruning, as the quality of these datasets is strongly correlated with the performance of VLMs on downstream tasks. Using CLIPScore from a pretrained model to only train models using highly-aligned samples is one of the most successful methods for pruning.We argue that this approach suffers from multiple limitations including: 1) false positives due to spurious correlations captured by the pretrained CLIP model, 2) false negatives due to poor discrimination between hard and bad samples, and 3) biased ranking towards samples similar to the pretrained CLIP dataset. We propose a pruning method, SIEVE, that employs synthetic captions generated by image-captioning models pretrained on small, diverse, and well-aligned image-text pairs to evaluate the alignment of noisy image-text pairs. To bridge the gap between the limited diversity of generated captions and the high diversity of alternative text (alt-text), we estimate the semantic textual similarity in the embedding space of a language model pretrained on billions of sentences. Using DataComp, a multimodal dataset filtering benchmark, we achieve state-of-the-art performance on the large scale pool, and competitive results on the medium scale pool, surpassing CLIPScore-based filtering by 1.7% and 2.6% on average, on 38 downstream tasks.
    摘要 视力语言模型(VLM)通常预训练在大量、多样化和噪音的网络抓取数据上。这种情况下,数据剔除变得非常重要,因为数据质量直接影响下游任务的性能。使用CLIPScore从预训练模型中只训练使用高度对应的样本是一种非常成功的剔除方法。然而,我们认为这种方法受到多种限制,包括:1)由预训练CLIP模型捕捉的假阳性,2)由低精度的样本分类错失,和3)偏向于与预训练CLIP数据集类似的样本排名。我们提议一种剔除方法,即SIEVE,该方法使用由图像描述模型预训练在小型、多样化和匹配的图像文本对中生成的文本来评估噪音图像文本的对应度。为了补偿生成文本的有限多样性,我们利用一种语言模型预训练 на千万句 sentences的embedding空间内的 semantic textual similarity来估计。使用DataComp,一种多模式数据筛选 benchmark,我们在大规模池中实现了状态机器的性能,并在中等规模池中获得了竞争性的结果,高于CLIPScore-based filtering的平均性能by 1.7%和2.6%,在38个下游任务上。

Leveraging Classic Deconvolution and Feature Extraction in Zero-Shot Image Restoration

  • paper_url: http://arxiv.org/abs/2310.02097
  • repo_url: https://github.com/ctom2/cider
  • paper_authors: Tomáš Chobola, Gesine Müller, Veit Dausmann, Anton Theileis, Jan Taucher, Jan Huisken, Tingying Peng
  • for: 非Mask-based非透明恢复图像
  • methods: 组合深度学习和经典迭代恢复算法
  • results: 在实际应用中显示了明显的改善
    Abstract Non-blind deconvolution aims to restore a sharp image from its blurred counterpart given an obtained kernel. Existing deep neural architectures are often built based on large datasets of sharp ground truth images and trained with supervision. Sharp, high quality ground truth images, however, are not always available, especially for biomedical applications. This severely hampers the applicability of current approaches in practice. In this paper, we propose a novel non-blind deconvolution method that leverages the power of deep learning and classic iterative deconvolution algorithms. Our approach combines a pre-trained network to extract deep features from the input image with iterative Richardson-Lucy deconvolution steps. Subsequently, a zero-shot optimisation process is employed to integrate the deconvolved features, resulting in a high-quality reconstructed image. By performing the preliminary reconstruction with the classic iterative deconvolution method, we can effectively utilise a smaller network to produce the final image, thus accelerating the reconstruction whilst reducing the demand for valuable computational resources. Our method demonstrates significant improvements in various real-world applications non-blind deconvolution tasks.
    摘要 非目的减推算器目的是从模糊图像中还原锐利图像,给出的核函数。现有的深度神经架构 часто基于大量锐利真实图像的 datasets 和监督学习。然而,锐利、高质量的真实图像在生物医学应用中并不总是可得,特别是在实际应用中。这会严重限制现有方法的应用。在本文中,我们提出了一种新的非目的减推算法,利用深度学习和经典的迭代减推算法。我们的方法首先使用预训练的网络提取输入图像的深度特征,然后使用迭代的理查森-卢西减推算步骤。最后,我们使用零截 optimization 进程将减推后的特征集成,得到高质量重建图像。通过在经典迭代减推算法前进行初步重建,我们可以更好地利用更小的网络生成最终图像,从而加速重建,降低计算资源的需求。我们的方法在非目的减推应用中表现出了显著的改善。

Global Attractor for a Reaction-Diffusion Model Arising in Biological Dynamic in 3D Soil Structure

  • paper_url: http://arxiv.org/abs/2310.02060
  • repo_url: None
  • paper_authors: Mohamed Elghandouri, Khalil Ezzinbi, Mouad Klai, Olivier Monga
  • for: 本研究用partial differential equations (PDEs) 模型三维土壤结构中微生物活动的复杂过程,为生物领域提供有价值的理解。
  • methods: 本研究使用PDEs模型和数值价值计算来研究微生物活动的存在和唯一性,以及相应模型的极限行为。
  • results: 研究发现了一个全球吸引器,这是一个重要的系统行为特征,有助于理解长期系统的行为。数值价值计算也用于较之这个全球吸引器的特性。
    Abstract Partial Differential Equations (PDEs) play a crucial role as tools for modeling and comprehending intricate natural processes, notably within the domain of biology. This research explores the domain of microbial activity within the complex matrix of 3D soil structures, providing valuable understanding into both the existence and uniqueness of solutions and the asymptotic behavior of the corresponding PDE model. Our investigation results in the discovery of a global attractor, a fundamental feature with significant implications for long-term system behavior. To enhance the clarity of our findings, numerical simulations are employed to visually illustrate the attributes of this global attractor.
    摘要

Exploring Generalisability of Self-Distillation with No Labels for SAR-Based Vegetation Prediction

  • paper_url: http://arxiv.org/abs/2310.02048
  • repo_url: None
  • paper_authors: Laura Martínez-Ferrer, Anna Jungbluth, Joseph A. Gallego-Mejia, Matt Allen, Francisco Dorr, Freddie Kalaitzis, Raúl Ramos-Pollán
  • for: 本研究使用两个Synthetic Aperture Radar数据集(S1GRD或GSSIC)在三个地区(中国、Conus、欧洲)预训练了一个DINO-ViT基本模型,并对其进行了精度调整以预测植被百分比。
  • methods: 本研究使用了预训练模型,并对其进行了精度调整以预测植被百分比。
  • results: 研究发现,S1GRD中不同地区的嵌入空间明显分离,而GSSIC中嵌入空间受到干扰。在精度调整过程中,位置特征保持不变,而较不熟悉的地区的误差通常高于 Familiar 地区。这些结果可以帮助我们更好地理解自动学习模型在远程感知领域的普适性。
    Abstract In this work we pre-train a DINO-ViT based model using two Synthetic Aperture Radar datasets (S1GRD or GSSIC) across three regions (China, Conus, Europe). We fine-tune the models on smaller labeled datasets to predict vegetation percentage, and empirically study the connection between the embedding space of the models and their ability to generalize across diverse geographic regions and to unseen data. For S1GRD, embedding spaces of different regions are clearly separated, while GSSIC's overlaps. Positional patterns remain during fine-tuning, and greater distances in embeddings often result in higher errors for unfamiliar regions. With this, our work increases our understanding of generalizability for self-supervised models applied to remote sensing.
    摘要 在这个工作中,我们预训练了基于DINO-ViT的模型使用两个Synthetic Aperture Radar数据集(S1GRD或GSSIC)在三个地区(中国、Conus、欧洲)。我们精度地调整模型以预测植被百分数,并实际研究模型 embedding 空间和其能否通过不同地理区域和未见数据进行泛化。对于S1GRD,不同地区的 embedding 空间明显分离,而GSSIC 的 embedding 空间存在重叠。位置特征在精度调整中保持不变,并且更大的 embedding 距离通常会导致对不熟悉地区的误差更高。因此,我们的工作将自助学习模型在遥感技术中的泛化性进一步了解。

Video Transformers under Occlusion: How Physics and Background Attributes Impact Large Models for Robotic Manipulation

  • paper_url: http://arxiv.org/abs/2310.02044
  • repo_url: https://github.com/shutongjin/occlumanip
  • paper_authors: Shutong Jin, Ruiyu Wang, Muhammad Zahid, Florian T. Pokorny
  • for: 本研究旨在 investigating the influence of object physics attributes and background characteristics on the performance of Video Transformers in trajectory prediction tasks under occlusion.
  • methods: 本研究使用了 OccluManip dataset,一个包含不同物理属性和背景特征的物体 occlusion 视频数据集,并提出了 Video Occlusion Transformer (VOT) 网络,一种通用的视频变换器基于网络,实现了平均96%的准确率在所有18个子数据集中。
  • results: 研究发现,物体物理属性和背景特征均对 Video Transformers 的性能产生了重要影响,并且提出了一些可能导致模型减少性能的原因。此外,研究还发现了一个数据满足点,在这个点上,大型变换器模型的性能不再提高。
    Abstract As transformer architectures and dataset sizes continue to scale, the need to understand the specific dataset factors affecting model performance becomes increasingly urgent. This paper investigates how object physics attributes (color, friction coefficient, shape) and background characteristics (static, dynamic, background complexity) influence the performance of Video Transformers in trajectory prediction tasks under occlusion. Beyond mere occlusion challenges, this study aims to investigate three questions: How do object physics attributes and background characteristics influence the model performance? What kinds of attributes are most influential to the model generalization? Is there a data saturation point for large transformer model performance within a single task? To facilitate this research, we present OccluManip, a real-world video-based robot pushing dataset comprising 460,000 consistent recordings of objects with different physics and varying backgrounds. 1.4 TB and in total 1278 hours of high-quality videos of flexible temporal length along with target object trajectories are collected, accommodating tasks with different temporal requirements. Additionally, we propose Video Occlusion Transformer (VOT), a generic video-transformer-based network achieving an average 96% accuracy across all 18 sub-datasets provided in OccluManip. OccluManip and VOT will be released at: https://github.com/ShutongJIN/OccluManip.git
    摘要 As transformer architectures and dataset sizes continue to scale, understanding the specific dataset factors affecting model performance becomes increasingly urgent. This paper investigates how object physics attributes (color, friction coefficient, shape) and background characteristics (static, dynamic, background complexity) influence the performance of Video Transformers in trajectory prediction tasks under occlusion. Beyond mere occlusion challenges, this study aims to investigate three questions: How do object physics attributes and background characteristics influence the model performance? What kinds of attributes are most influential to the model generalization? Is there a data saturation point for large transformer model performance within a single task? To facilitate this research, we present OccluManip, a real-world video-based robot pushing dataset comprising 460,000 consistent recordings of objects with different physics and varying backgrounds. 1.4 TB and in total 1278 hours of high-quality videos of flexible temporal length along with target object trajectories are collected, accommodating tasks with different temporal requirements. Additionally, we propose Video Occlusion Transformer (VOT), a generic video-transformer-based network achieving an average 96% accuracy across all 18 sub-datasets provided in OccluManip. OccluManip and VOT will be released at: https://github.com/ShutongJIN/OccluManip.gitHere's the translation in Traditional Chinese:当 transformer 架构和数据集大小继续扩大时,理解特定数据集因素对模型性能的影响变得越来越重要。本研究探讨影片变数(颜色、摩擦系数、形状)和背景特征(静止、动态、背景复杂度)对 Video Transformers 在遮蔽 task 中的性能影响。 beyond 遮蔽挑战,本研究旨在 investigate 三个问题:影片变数和背景特征对模型性能影响多大?这些特征在模型通用化中多重影响吗?是否存在单一任务中大 transformer 模型性能的数据饱和点? To facilitate this research, we present OccluManip, a real-world video-based robot pushing dataset comprising 460,000 consistent recordings of objects with different physics and varying backgrounds. 1.4 TB and in total 1278 hours of high-quality videos of flexible temporal length along with target object trajectories are collected, accommodating tasks with different temporal requirements. Additionally, we propose Video Occlusion Transformer (VOT), a generic video-transformer-based network achieving an average 96% accuracy across all 18 sub-datasets provided in OccluManip. OccluManip and VOT will be released at: https://github.com/ShutongJIN/OccluManip.git

Decoding Human Activities: Analyzing Wearable Accelerometer and Gyroscope Data for Activity Recognition

  • paper_url: http://arxiv.org/abs/2310.02011
  • repo_url: None
  • paper_authors: Utsab Saha, Sawradip Saha, Tahmid Kabir, Shaikh Anowarul Fattah, Mohammad Saquib
  • for: 本研究旨在提出一种基于Residual网络和Residual MobileNet的多结构层合并方法,用于分类人类活动。
  • methods: 该方法使用了特制的Residual块进行分类static和dynamic活动,并将这两个模型独立地训练。之后,这两个ResNet被通过权重加权ensemble来结合,以便更好地分类特定的静止和动态活动。
  • results: 在UCI HAR和Motion-Sense等两个公共数据集上进行测试,该方法能够成功处理数据重叠的混淆情况,并实现了state-of-the-art的准确率96.71%和95.35%。
    Abstract A person's movement or relative positioning effectively generates raw electrical signals that can be read by computing machines to apply various manipulative techniques for the classification of different human activities. In this paper, a stratified multi-structural approach based on a Residual network ensembled with Residual MobileNet is proposed, termed as FusionActNet. The proposed method involves using carefully designed Residual blocks for classifying the static and dynamic activities separately because they have clear and distinct characteristics that set them apart. These networks are trained independently, resulting in two specialized and highly accurate models. These models excel at recognizing activities within a specific superclass by taking advantage of the unique algorithmic benefits of architectural adjustments. Afterward, these two ResNets are passed through a weighted ensemble-based Residual MobileNet. Subsequently, this ensemble proficiently discriminates between a specific static and a specific dynamic activity, which were previously identified based on their distinct feature characteristics in the earlier stage. The proposed model is evaluated using two publicly accessible datasets; namely, UCI HAR and Motion-Sense. Therein, it successfully handled the highly confusing cases of data overlap. Therefore, the proposed approach achieves a state-of-the-art accuracy of 96.71% and 95.35% in the UCI HAR and Motion-Sense datasets respectively.
    摘要 人体运动或相对位置生成的Raw电信号可以由计算机机器读取并应用不同的涉及技术来分类不同的人类活动。在这篇论文中,我们提出了一种多结构层次方法,称为FusionActNet,使用了 residual网络和 residual MobileNet的 ensemble。这种方法使用了特制的 residual块来分类静止和动态活动,因为它们在特征上有清晰的分化。这两个网络独立地训练,从而生成了两个高度精度的模型。这些模型可以通过特有的算法优点来识别活动内一个特定超类。接着,这两个 residual 网络通过权重 ensemble 的 residual MobileNet 进行混合,从而高效地区分静止和动态活动。在使用两个公共可访问的数据集(UC Irvine HAR和Motion-Sense)进行评估时,该方法成功处理了数据重叠的高度混淆情况。因此,我们的方法实现了状态的最佳准确率为96.71%和95.35%在UC Irvine HAR和Motion-Sense数据集中。

MUSCLE: Multi-task Self-supervised Continual Learning to Pre-train Deep Models for X-ray Images of Multiple Body Parts

  • paper_url: http://arxiv.org/abs/2310.02000
  • repo_url: None
  • paper_authors: Weibin Liao, Haoyi Xiong, Qingzhong Wang, Yan Mo, Xuhong Li, Yi Liu, Zeyu Chen, Siyu Huang, Dejing Dou
  • for: 这个论文旨在提高透明度图像分析的表示学习,使用多任务自监学习(Multi-task Self-supervised Continual Learning,MUSCLE) pipeline,并在多个身体部位的X射线图像上进行训练。
  • methods: 这个方法使用MoCo基于表示学习,并采用了一种优化的连续学习(Continual Learning,CL)过程,以适应不同的X射线分析任务。此外,方法还使用了一些解决数据不一致、过拟合和忘记问题的策略,如图像预处理、学习调度和规范化。
  • results: 在9个真实世界X射线数据集上进行评估, Comparisons against other pre-trained models confirm the proof-of-concept that self-supervised multi-task/dataset continual pre-training could boost the performance of X-ray image analysis。
    Abstract While self-supervised learning (SSL) algorithms have been widely used to pre-train deep models, few efforts [11] have been done to improve representation learning of X-ray image analysis with SSL pre-trained models. In this work, we study a novel self-supervised pre-training pipeline, namely Multi-task Self-super-vised Continual Learning (MUSCLE), for multiple medical imaging tasks, such as classification and segmentation, using X-ray images collected from multiple body parts, including heads, lungs, and bones. Specifically, MUSCLE aggregates X-rays collected from multiple body parts for MoCo-based representation learning, and adopts a well-designed continual learning (CL) procedure to further pre-train the backbone subject various X-ray analysis tasks jointly. Certain strategies for image pre-processing, learning schedules, and regularization have been used to solve data heterogeneity, overfitting, and catastrophic forgetting problems for multi-task/dataset learning in MUSCLE.We evaluate MUSCLE using 9 real-world X-ray datasets with various tasks, including pneumonia classification, skeletal abnormality classification, lung segmentation, and tuberculosis (TB) detection. Comparisons against other pre-trained models [7] confirm the proof-of-concept that self-supervised multi-task/dataset continual pre-training could boost the performance of X-ray image analysis.
    摘要 traditional Chinese:随自学习(SSL)算法已经广泛使用来预训深度模型,但有少数尝试(11)来提高X射照像分析中的表示学习使用SSL预训模型。在这个工作中,我们研究了一个新的自我监督预训管道,即多任务自监督流行学习(MUSCLE),用于多种医疗影像任务,如分类和分类,使用X射照像集合自多个身体部位,包括头部、肺部和骨骼。具体来说,MUSCLE将X射照像集合自多个身体部位用MoCo基础的表示学习,并运用了一个Well-设计的流行学习(CL)程式来进一步预训底层组件面对多种X射照像分析任务。我们还使用了一些对于多任务/数据集学习的问题,如数据不一致、过滤和忘却问题的策略。我们使用了9个真实世界X射照像数据集进行评估,包括肺部炎症分类、骨骼异常分类、肺部分类和抑菌病(TB)检测。与其他预训模型(7)进行比较,证明了我们的观念的概念验证,即自我监督多任务/数据集流行预训可以提高X射照像分析的性能。Simplified Chinese:随自学习(SSL)算法已经广泛应用于预训深度模型,但有少数尝试(11)来提高X射照像分析中的表示学习使用SSL预训模型。在这个工作中,我们研究了一个新的自我监督预训管道,即多任务自监督流行学习(MUSCLE),用于多种医疗影像任务,如分类和分类,使用X射照像集合自多个身体部位,包括头部、肺部和骨骼。具体来说,MUSCLE将X射照像集合自多个身体部位用MoCo基础的表示学习,并运用了一个Well-设计的流行学习(CL)程式来进一步预训底层组件面对多种X射照像分析任务。我们还使用了一些对于多任务/数据集学习的问题,如数据不一致、过滤和忘却问题的策略。我们使用了9个真实世界X射照像数据集进行评估,包括肺部炎症分类、骨骼异常分类、肺部分类和抑菌病(TB)检测。与其他预训模型(7)进行比较,证明了我们的观念的概念验证,即自我监督多任务/数据集流行预训可以提高X射照像分析的性能。

Understanding Masked Autoencoders From a Local Contrastive Perspective

  • paper_url: http://arxiv.org/abs/2310.01994
  • repo_url: None
  • paper_authors: Xiaoyu Yue, Lei Bai, Meng Wei, Jiangmiao Pang, Xihui Liu, Luping Zhou, Wanli Ouyang
  • for: 本研究旨在探讨Masked AutoEncoder(MAE)在自supervised learning中的内部机制,以及它如何生成高质量的隐藏表示。
  • methods: 本研究使用MAE的生成预训练路径,通过对图像进行恶势masking来重建图像。研究发现MAE的解码器主要学习本地特征,遵循了Local Principle。基于本地性假设,提出了一种理论框架,将MAE转化为一种地区级别的对比学习形式,以更好地理解MAE的工作原理。
  • results: 研究发现MAE具有强大的地区级别对比学习能力,并且可以在不同的下游任务中达到状态 arts 的性能。此外,研究还提出了一种不需要masking和显式解码器的Siamese架构,可以寻求更加灵活的自supervised learning框架。
    Abstract Masked AutoEncoder(MAE) has revolutionized the field of self-supervised learning with its simple yet effective masking and reconstruction strategies. However, despite achieving state-of-the-art performance across various downstream vision tasks, the underlying mechanisms that drive MAE's efficacy are less well-explored compared to the canonical contrastive learning paradigm. In this paper, we explore a new perspective to explain what truly contributes to the "rich hidden representations inside the MAE". Firstly, concerning MAE's generative pretraining pathway, with a unique encoder-decoder architecture to reconstruct images from aggressive masking, we conduct an in-depth analysis of the decoder's behaviors. We empirically find that MAE's decoder mainly learns local features with a limited receptive field, adhering to the well-known Locality Principle. Building upon this locality assumption, we propose a theoretical framework that reformulates the reconstruction-based MAE into a local region-level contrastive learning form for improved understanding. Furthermore, to substantiate the local contrastive nature of MAE, we introduce a Siamese architecture that combines the essence of MAE and contrastive learning without masking and explicit decoder, which sheds light on a unified and more flexible self-supervised learning framework.
    摘要 masked autoencoder (MAE) 已经在自适应学习领域中革命化了,它的简单 yet effective 的面纱和重建策略使得它在不同的下游视觉任务中实现了状态机器人的性能。然而,尽管 MAE 的表现能力在多个下游任务中达到了领先水平,但是它的内部机制仍然不够清晰,相比于 canonical 对比学习模式。在这篇论文中,我们提出了一种新的视角来解释 MAE 中 "rich hidden representations" 的真正来源。首先,关于 MAE 的生成预训练路径,我们通过对掩蔽图像的唯一 encoder-decoder 架构进行深入分析,发现 MAE 的decoder 主要学习局部特征,遵循了已知的 Local Principle。基于这个局部假设,我们提出了一个理论框架,将 reconstruction-based MAE 转化为一种局部区域级对比学习形式,以提高理解。此外,为了证明 MAE 的局部对比性,我们提出了一种 Siamese 架构,将 MAE 和对比学习的本质结合在一起,无需掩蔽和显式的 decoder,这为一个更加统一和灵活的自适应学习框架提供了新的思路。

Development of Machine Vision Approach for Mechanical Component Identification based on its Dimension and Pitch

  • paper_url: http://arxiv.org/abs/2310.01995
  • repo_url: None
  • paper_authors: Toshit Jain, Faisel Mushtaq, K Ramesh, Sandip Deshmukh, Tathagata Ray, Chandu Parimi, Praveen Tandon, Pramod Kumar Jha
  • for: automatization of mechanical assembly lines
  • methods: novel method of calculating pitch and bolt identification, lightweight and fast system
  • results: correct identification of parts with 98% accuracy
    Abstract In this work, a highly customizable and scalable vision based system for automation of mechanical assembly lines is described. The proposed system calculates the features that are required to classify and identify the different kinds of bolts that are used in the assembly line. The system describes a novel method of calculating the pitch of the bolt in addition to bolt identification and calculating the dimensions of the bolts. This identification and classification system is extremely lightweight and can be run on bare minimum hardware. The system is very fast in the order of milliseconds, hence the system can be used successfully even if the components are steadily moving on a conveyor. The results show that our system can correctly identify the parts in our dataset with 98% accuracy using the calculated features.
    摘要 在这个工作中,描述了一种高度可定制和扩展的视觉基于系统,用于机械生产线自动化。该系统计算用于分类和识别不同类型的螺丝的特征。系统描述了一种新的螺丝排列方法,以及螺丝的尺寸计算方法。这个识别和分类系统非常轻量级,可以在最少的硬件上运行。系统速度非常快,只需毫秒级时间,因此可以成功地在滚动道上使用。实验结果显示,我们的系统可以在我们的数据集中正确地识别部件,准确率达98%。

CoralVOS: Dataset and Benchmark for Coral Video Segmentation

  • paper_url: http://arxiv.org/abs/2310.01946
  • repo_url: https://github.com/zhengziqiang/CoralVOS
  • paper_authors: Zheng Ziqiang, Xie Yaofeng, Liang Haixin, Yu Zhibin, Sai-Kit Yeung
    for:* 这个论文旨在提高贝壳礁的分析效率和准确性,并提供一个大规模的贝壳礁视频分割数据集(CoralVOS),以支持 dense coral video segmentation。methods:* 本论文使用了6种最新的视频对象分割(VOS)算法,并对其进行了微调,以便在CoralVOS dataset上进行分割。results:* 实验结果显示,通过微调VOS算法并使用CoralVOS dataset,可以大幅提高贝壳礁分割精度。然而,还有很大的潜在提升空间。
    Abstract Coral reefs formulate the most valuable and productive marine ecosystems, providing habitat for many marine species. Coral reef surveying and analysis are currently confined to coral experts who invest substantial effort in generating comprehensive and dependable reports (\emph{e.g.}, coral coverage, population, spatial distribution, \textit{etc}), from the collected survey data. However, performing dense coral analysis based on manual efforts is significantly time-consuming, the existing coral analysis algorithms compromise and opt for performing down-sampling and only conducting sparse point-based coral analysis within selected frames. However, such down-sampling will \textbf{inevitable} introduce the estimation bias or even lead to wrong results. To address this issue, we propose to perform \textbf{dense coral video segmentation}, with no down-sampling involved. Through video object segmentation, we could generate more \textit{reliable} and \textit{in-depth} coral analysis than the existing coral reef analysis algorithms. To boost such dense coral analysis, we propose a large-scale coral video segmentation dataset: \textbf{CoralVOS} as demonstrated in Fig. 1. To the best of our knowledge, our CoralVOS is the first dataset and benchmark supporting dense coral video segmentation. We perform experiments on our CoralVOS dataset, including 6 recent state-of-the-art video object segmentation (VOS) algorithms. We fine-tuned these VOS algorithms on our CoralVOS dataset and achieved observable performance improvement. The results show that there is still great potential for further promoting the segmentation accuracy. The dataset and trained models will be released with the acceptance of this work to foster the coral reef research community.
    摘要 海礁生态系统是生物多样性最高、最产值的marine生态系统,提供许多海洋生物种类的栖息地。但现在的海礁调查和分析仅限于专业人员,他们需要投入大量时间和努力来生成全面和可靠的报告(如珊瑚覆盖率、种群数量、空间分布等),从收集的调查数据中。然而,基于手动努力进行的 dense coral analysis 会带来估计偏差或者结果错误。为解决这个问题,我们提议实施 dense coral video segmentation,不含下采样。通过视频对象分 segmentation,我们可以生成更可靠和更深入的海礁分析结果,超过现有的海礁分析算法。为了推动这种 dense coral analysis,我们提出了一个大规模的海礁视频分 segmentation 数据集:CoralVOS(参见图1)。我们认为,CoralVOS 是目前所知道的第一个支持 dense coral video segmentation 的数据集和标准准。我们在 CoralVOS 数据集上进行了实验,包括 6 个最新的视频对象分 segmentation(VOS)算法。我们对这些 VOS 算法进行了微调,并在我们的 CoralVOS 数据集上进行了实验。结果表明,还有很大的提高可能性。我们将数据集和训练模型释放,以便推动海礁研究社区的发展。

OOD Aware Supervised Contrastive Learning

  • paper_url: http://arxiv.org/abs/2310.01942
  • repo_url: None
  • paper_authors: Soroush Seifi, Daniel Olmeda Reino, Nikolay Chumerin, Rahaf Aljundi
  • for: 本文提出了一种基于超级对比学习的外部数据检测方法,以确保机器学习模型在部署时能够正确地识别外部数据。
  • methods: 本文提出了一种扩展supervised contrastive(SupCon)准则的方法,并增加了两个附加的对比项。第一个项使auxiliary OOD表示异常离ID表示,而不受任何约束。第二个项使OOD特征远离现有的类prototype,而Push ID表示更近于其相应的类prototype。当auxiliary OOD数据不可用时,本文提出了一种效率高的特征混合技术来生成pseudo-OOD特征。
  • results: 作者对不同的OOD检测方法进行比较,并在常用的benchmark上显示了state-of-the-art的结果。
    Abstract Out-of-Distribution (OOD) detection is a crucial problem for the safe deployment of machine learning models identifying samples that fall outside of the training distribution, i.e. in-distribution data (ID). Most OOD works focus on the classification models trained with Cross Entropy (CE) and attempt to fix its inherent issues. In this work we leverage powerful representation learned with Supervised Contrastive (SupCon) training and propose a holistic approach to learn a classifier robust to OOD data. We extend SupCon loss with two additional contrast terms. The first term pushes auxiliary OOD representations away from ID representations without imposing any constraints on similarities among auxiliary data. The second term pushes OOD features far from the existing class prototypes, while pushing ID representations closer to their corresponding class prototype. When auxiliary OOD data is not available, we propose feature mixing techniques to efficiently generate pseudo-OOD features. Our solution is simple and efficient and acts as a natural extension of the closed-set supervised contrastive representation learning. We compare against different OOD detection methods on the common benchmarks and show state-of-the-art results.
    摘要 外部分布(OOD)检测是机器学习模型安全部署的关键问题,即内部分布数据(ID)。大多数OOD工作集中在基于交叉熵(CE)训练的分类模型上,尝试解决其内存问题。在这项工作中,我们利用强大的Supervised Contrastive(SupCon)训练学习出的表示,并提出一种整体方法来学习对OOD数据强健的分类器。我们将SupCon损失函数扩展为两个附加的对比项。第一项使auxiliary OOD表示远离ID表示,无论ID数据之间的相似性是否存在约束。第二项使OOD特征远离现有的类prototype,而ID表示靠近其相应的类prototype。当auxiliary OOD数据不可用时,我们提议使用特征混合技术生成 Pseudo-OOD 特征。我们的解决方案简单、高效,可以视为关闭集成Supervised Contrastive表示学习的自然扩展。我们在常用的benchmark上与不同的OOD检测方法进行比较,并显示状态前的结果。

Constructing Image-Text Pair Dataset from Books

  • paper_url: http://arxiv.org/abs/2310.01936
  • repo_url: None
  • paper_authors: Yamato Okamoto, Haruto Toyonaga, Yoshihisa Ijiri, Hirokatsu Kataoka
  • for: This paper aims to leverage digital archives for machine learning, with the potential to uncover unknown insights and acquire knowledge autonomously.
  • methods: The proposed approach uses an optical character reader (OCR), an object detector, and a layout analyzer to construct an image-text pair dataset.
  • results: The authors apply their pipeline on old photo books to demonstrate the effectiveness of the approach in image-text retrieval and insight extraction.Here’s the same information in Simplified Chinese text:
  • for: 这篇论文目的是利用数字档案库进行机器学习,以便发现未知的发现和自主获取知识,就像人类读书一样。
  • methods: 该方法使用光学字符识别器(OCR)、物体检测器和布局分析器,为自动提取图文对象构建一个数据集。
  • results: 作者在古老的照片书中应用了自己的管道,展示了图文检索和发现的效果。
    Abstract Digital archiving is becoming widespread owing to its effectiveness in protecting valuable books and providing knowledge to many people electronically. In this paper, we propose a novel approach to leverage digital archives for machine learning. If we can fully utilize such digitized data, machine learning has the potential to uncover unknown insights and ultimately acquire knowledge autonomously, just like humans read books. As a first step, we design a dataset construction pipeline comprising an optical character reader (OCR), an object detector, and a layout analyzer for the autonomous extraction of image-text pairs. In our experiments, we apply our pipeline on old photo books to construct an image-text pair dataset, showing its effectiveness in image-text retrieval and insight extraction.
    摘要 “数字档案化在广泛应用,因为它能够保护价值书籍并提供电子形式知识给许多人。在这篇论文中,我们提出了一种新的方法,利用数字档案来进行机器学习。如果我们可以完全利用这些数字化数据,机器学习就可以探索未知的材料和获得自主知识,就像人类读书一样。为了实现这一目标,我们设计了一个数据建构管道,包括光学字符读取器(OCR)、对象检测器和布局分析器,以自动提取图文对。在我们的实验中,我们应用了这个管道于古照片书籍,构建了一个图文对数据集,并证明了其在图文检索和材料探索方面的有效性。”

Robust deformable image registration using cycle-consistent implicit representations

  • paper_url: http://arxiv.org/abs/2310.01934
  • repo_url: https://github.com/louisvh/cycle_consistent_inr
  • paper_authors: Louis D. van Harten, Jaap Stoker, Ivana Išgum
  • for: 这个论文是为了提高医疗影像注册的稳定性和可靠性而写的。
  • methods: 这个方法使用了对于每对新影像进行优化的隐藏神经表现,并使用对于每对影像进行逆转的方法来实现对称调整。
  • results: 这个方法可以提高注册精度,并且可以提供一个可靠的不确定度量,可以用于自动质量控制。在4D肺CT数据集上进行评估,这个方法可以减少优化失败率从2.4%降至0.0%,提高标志精度4.5%,并且可以侦测到注册方法是否能够正确地解决 проблеme。此外,这个方法在 Abdomen 4D MRI 中心线传播任务上显示了46%的传播一致性提高,并且与注册精度之间存在强相关。
    Abstract Recent works in medical image registration have proposed the use of Implicit Neural Representations, demonstrating performance that rivals state-of-the-art learning-based methods. However, these implicit representations need to be optimized for each new image pair, which is a stochastic process that may fail to converge to a global minimum. To improve robustness, we propose a deformable registration method using pairs of cycle-consistent Implicit Neural Representations: each implicit representation is linked to a second implicit representation that estimates the opposite transformation, causing each network to act as a regularizer for its paired opposite. During inference, we generate multiple deformation estimates by numerically inverting the paired backward transformation and evaluating the consensus of the optimized pair. This consensus improves registration accuracy over using a single representation and results in a robust uncertainty metric that can be used for automatic quality control. We evaluate our method with a 4D lung CT dataset. The proposed cycle-consistent optimization method reduces the optimization failure rate from 2.4% to 0.0% compared to the current state-of-the-art. The proposed inference method improves landmark accuracy by 4.5% and the proposed uncertainty metric detects all instances where the registration method fails to converge to a correct solution. We verify the generalizability of these results to other data using a centerline propagation task in abdominal 4D MRI, where our method achieves a 46% improvement in propagation consistency compared with single-INR registration and demonstrates a strong correlation between the proposed uncertainty metric and registration accuracy.
    摘要 现有医疗影像注册研究提出使用隐藏神经表示,表现和目前学习型方法相当。然而,这些隐藏表示需要每对新影像进行优化,这是一个几率过程可能无法对全域最小化。为了提高适当性,我们提议使用对称的变形注册方法,每对隐藏神经表示 Linked to a second implicit representation that estimates the opposite transformation,使每个网络成为对应的常量。在推断中,我们产生多个变形估计,通过对称反传数进行数值逆解,并评估这对的整合度。这个整合度可以提高注册精度,并生成一个可靠的不确定度量,可以用于自动质量控制。我们使用4D肺CT影像 dataset进行评估。我们的循环相互优化方法可以降低优化失败率从2.4%降至0.0%,相比于目前的state-of-the-art。我们的推断方法可以提高标本精度4.5%,并且可以检测所有注册方法失败 converge to a correct solution。我们还证明了我们的结果可以在其他数据上进行一致性测试,例如腹部4D MRI中心线传播任务,我们的方法可以提高传播一致性46%,并且与注册精度之间存在强相关。

MarineDet: Towards Open-Marine Object Detection

  • paper_url: http://arxiv.org/abs/2310.01931
  • repo_url: None
  • paper_authors: Liang Haixin, Zheng Ziqiang, Ma Zeyu, Sai-Kit Yeung
    for: marine object detection, especially for detecting diverse and unseen marine objects in underwater imagerymethods: 使用joint visual-text semantic space through pre-training, followed by marine-specific training to achieve in-air-to-marine knowledge transferresults: superior performance compared to existing generalist and specialist object detection algorithms, demonstrating the effectiveness of OMOD for marine ecosystem monitoring and management.
    Abstract Marine object detection has gained prominence in marine research, driven by the pressing need to unravel oceanic mysteries and enhance our understanding of invaluable marine ecosystems. There is a profound requirement to efficiently and accurately identify and localize diverse and unseen marine entities within underwater imagery. The open-marine object detection (OMOD for short) is required to detect diverse and unseen marine objects, performing categorization and localization simultaneously. To achieve OMOD, we present \textbf{MarineDet}. We formulate a joint visual-text semantic space through pre-training and then perform marine-specific training to achieve in-air-to-marine knowledge transfer. Considering there is no specific dataset designed for OMOD, we construct a \textbf{MarineDet dataset} consisting of 821 marine-relative object categories to promote and measure OMOD performance. The experimental results demonstrate the superior performance of MarineDet over existing generalist and specialist object detection algorithms. To the best of our knowledge, we are the first to present OMOD, which holds a more valuable and practical setting for marine ecosystem monitoring and management. Our research not only pushes the boundaries of marine understanding but also offers a standard pipeline for OMOD.
    摘要 海洋物体检测已经在海洋研究中占据了重要地位,这是由于需要解开海洋的秘密和提高我们对宝贵海洋生态系统的理解。在海洋图像中寻找和分类多种和未经见过的海洋对象是一项急需要解决的问题。为了实现这一目标,我们提出了海洋物体检测(OMOD)。我们通过预训练形成了视觉语义空间,然后通过海洋专门训练来实现海洋知识传递。由于没有专门为OMOD设计的数据集,我们构建了一个名为“MarineDet数据集”的821个海洋相关对象类别,以促进和评估OMOD性能。实验结果表明,MarineDet在现有的普遍和专家对象检测算法中表现出色。据我们所知,我们是第一个提出OMOD的研究人员,这将为海洋监测和管理提供更加有价值和实用的 Setting。我们的研究不仅推动了海洋理解的前iers,也提供了OMOD的标准管道。

RoFormer for Position Aware Multiple Instance Learning in Whole Slide Image Classification

  • paper_url: http://arxiv.org/abs/2310.01924
  • repo_url: None
  • paper_authors: Etienne Pochet, Rami Maroun, Roger Trullo
  • for: 这个论文旨在解决 computational pathology 中的整个扫描图像(WSI)分类问题,但是现有的方法受到多个特征提取器的冻结问题。
  • methods: 该论文提出了一种使用 RoFormer 层,该层利用了内存高效的自我注意力和相对位置编码,可以对大小和形态不固定的 WSI 的补丁进行全自我注意力和相对位置编码,从而解决了patches之间的相互关系和组织结构的问题。
  • results: 该论文表明,使用该方法可以在三个公共数据集(TCGA-NSCLC、BRACS 和 Camelyon16)上超越现有的MIL模型,在弱级标注分类任务上实现更高的性能。
    Abstract Whole slide image (WSI) classification is a critical task in computational pathology. However, the gigapixel-size of such images remains a major challenge for the current state of deep-learning. Current methods rely on multiple-instance learning (MIL) models with frozen feature extractors. Given the the high number of instances in each image, MIL methods have long assumed independence and permutation-invariance of patches, disregarding the tissue structure and correlation between patches. Recent works started studying this correlation between instances but the computational workload of such a high number of tokens remained a limiting factor. In particular, relative position of patches remains unaddressed. We propose to apply a straightforward encoding module, namely a RoFormer layer , relying on memory-efficient exact self-attention and relative positional encoding. This module can perform full self-attention with relative position encoding on patches of large and arbitrary shaped WSIs, solving the need for correlation between instances and spatial modeling of tissues. We demonstrate that our method outperforms state-of-the-art MIL models on three commonly used public datasets (TCGA-NSCLC, BRACS and Camelyon16)) on weakly supervised classification tasks. Code is available at https://github.com/Sanofi-Public/DDS-RoFormerMIL
    摘要 全像图分类(WSI)是计算 PATHOLOGY 中的一项关键任务。然而, gigapixel 大小的这些图像仍然是当前深度学习中的主要挑战。现有的方法通常使用多个实例学习(MIL)模型,它们的特征提取器被冻结。由于每个图像中的实例数量很高,MIL 方法一直假设实例之间独立和排序不变。这些方法忽略了组织结构和实例之间的相关性。在最近的一些研究中,开始研究实例之间的相关性,但计算工作负担仍然是一个限制因素。特别是,实例之间的相对位置没有被考虑。我们提议使用一个简单的编码模块,即 RoFormer 层,该模块基于内存有效的准确自注意和相对位置编码。这个模块可以在大小和形态不固定的 WSI 上完全进行自注意和相对位置编码,解决实例之间的相关性和组织结构的空间模型。我们示出了我们的方法在三个常用的公共数据集(TCGA-NSCLC、BRACS 和 Camelyon16)上超过了当前的 MIL 模型的弱化类型分类任务。代码可以在 https://github.com/Sanofi-Public/DDS-RoFormerMIL 上获取。

Improved Automatic Diabetic Retinopathy Severity Classification Using Deep Multimodal Fusion of UWF-CFP and OCTA Images

  • paper_url: http://arxiv.org/abs/2310.01912
  • repo_url: None
  • paper_authors: Mostafa El Habib Daho, Yihao Li, Rachid Zeghlache, Yapo Cedric Atse, Hugo Le Boité, Sophie Bonnin, Deborah Cosette, Pierre Deman, Laurent Borderie, Capucine Lepicard, Ramin Tadayoni, Béatrice Cochener, Pierre-Henri Conze, Mathieu Lamard, Gwenolé Quellec
  • for: 这个研究的目的是提高遗传性疾病肉眼病(DR)的早期识别,以提高病人的临床结果。
  • methods: 这个研究使用了多modal的图像技术,包括Ultra-WideField Color Fundus Photography(UWF-CFP)图像和Optical Coherence Tomography Angiography(OCTA)图像,并融合了ResNet50和3D-ResNet50模型,以提高DR的分类性能。
  • results: 实验结果显示,这个多modal方法可以与单一模式比较,提高DR的分类性能。这种方法可能将成为早期识别DR的重要工具,帮助改善病人的临床结果。
    Abstract Diabetic Retinopathy (DR), a prevalent and severe complication of diabetes, affects millions of individuals globally, underscoring the need for accurate and timely diagnosis. Recent advancements in imaging technologies, such as Ultra-WideField Color Fundus Photography (UWF-CFP) imaging and Optical Coherence Tomography Angiography (OCTA), provide opportunities for the early detection of DR but also pose significant challenges given the disparate nature of the data they produce. This study introduces a novel multimodal approach that leverages these imaging modalities to notably enhance DR classification. Our approach integrates 2D UWF-CFP images and 3D high-resolution 6x6 mm$^3$ OCTA (both structure and flow) images using a fusion of ResNet50 and 3D-ResNet50 models, with Squeeze-and-Excitation (SE) blocks to amplify relevant features. Additionally, to increase the model's generalization capabilities, a multimodal extension of Manifold Mixup, applied to concatenated multimodal features, is implemented. Experimental results demonstrate a remarkable enhancement in DR classification performance with the proposed multimodal approach compared to methods relying on a single modality only. The methodology laid out in this work holds substantial promise for facilitating more accurate, early detection of DR, potentially improving clinical outcomes for patients.
    摘要 糖尿病 retinopathy (DR) 是 диабеتype 的一种常见并严重的合并症,全球范围内有数百万人受到影响,这种情况加剧了精准和及时诊断的需求。 latest advancements in imaging technologies, such as Ultra-WideField Color Fundus Photography (UWF-CFP) imaging and Optical Coherence Tomography Angiography (OCTA), provide opportunities for the early detection of DR, but also pose significant challenges due to the disparate nature of the data they produce. This study introduces a novel multimodal approach that leverages these imaging modalities to notably enhance DR classification. Our approach integrates 2D UWF-CFP images and 3D high-resolution 6x6 mm$^3$ OCTA (both structure and flow) images using a fusion of ResNet50 and 3D-ResNet50 models, with Squeeze-and-Excitation (SE) blocks to amplify relevant features. Additionally, to increase the model's generalization capabilities, a multimodal extension of Manifold Mixup, applied to concatenated multimodal features, is implemented. Experimental results demonstrate a remarkable enhancement in DR classification performance with the proposed multimodal approach compared to methods relying on a single modality only. The methodology laid out in this work holds substantial promise for facilitating more accurate, early detection of DR, potentially improving clinical outcomes for patients.

CLIP Is Also a Good Teacher: A New Learning Framework for Inductive Zero-shot Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2310.02296
  • repo_url: None
  • paper_authors: Jialei Chen, Daisuke Deguchi, Chenkai Zhang, Xu Zheng, Hiroshi Murase
  • for: 提高 Zero-shot Semantic Segmentation (GZLSS) 方法的效果,使其能够应用于不同的像素分类 segmentation 模型,无需添加额外的mask提案器或改变 CLIP 模型的结构。
  • methods: 提出了一种新的学习框架 CLIPTeacher,它可以应用于不同的像素分类 segmentation 模型,并利用所有的图像信息。 CLIPTeacher 包括两个关键模块:全球学习模块 (GLM) 和像素学习模块 (PLM)。 GLM 将图像编码器中的普通特征与 CLIP 模型中的 CLS token 进行对对比,从而捕捉全局信息。 PLM 则只使用 CLIP 模型中的普通特征来生成高级假注入,不需要添加额外的mask提案器。
  • results: 对三个 benchmark 数据集进行实验,结果表明,CLIPTeacher 可以大幅提高 Zero-shot Semantic Segmentation 的效果,具体来说是:PASCAL VOC 2012 上提高了 2.2%,COCO-Stuff 164k 上提高了 1.3%,PASCAL Context 上提高了 8.8%。
    Abstract Existing Generalized Zero-shot Semantic Segmentation (GZLSS) methods apply either finetuning the CLIP paradigm or formulating it as a mask classification task, benefiting from the Vision-Language Models (VLMs). However, the fine-tuning methods are restricted with fixed backbone models which are not flexible for segmentation, and mask classification methods heavily rely on additional explicit mask proposers. Meanwhile, prevalent methods utilize only seen categories which is a great waste, i.e., neglecting the area exists but not annotated. To this end, we propose CLIPTeacher, a new learning framework that can be applied to various per-pixel classification segmentation models without introducing any explicit mask proposer or changing the structure of CLIP, and utilize both seen and ignoring areas. Specifically, CLIPTeacher consists of two key modules: Global Learning Module (GLM) and Pixel Learning Module (PLM). Specifically, GLM aligns the dense features from an image encoder with the CLS token, i.e., the only token trained in CLIP, which is a simple but effective way to probe global information from the CLIP models. In contrast, PLM only leverages dense tokens from CLIP to produce high-level pseudo annotations for ignoring areas without introducing any extra mask proposer. Meanwhile, PLM can fully take advantage of the whole image based on the pseudo annotations. Experimental results on three benchmark datasets: PASCAL VOC 2012, COCO-Stuff 164k, and PASCAL Context show large performance gains, i.e., 2.2%, 1.3%, and 8.8%
    摘要 现有的泛化零shot semantic segmentation(GZLSS)方法通常是通过finetuning CLIP模式或者表示为 маска分类任务,借助于视觉语言模型(VLM)。然而, Fine-tuning 方法受到固定的背景模型的限制,不 flexible enough for segmentation,而且mask classification方法强调添加显式的mask proposer。此外,普遍的方法只使用已经看到的类别,这是一种大的浪费,即忽略不标注的区域。为了解决这个问题,我们提出了 CLIPTeacher,一种新的学习框架,可以应用于不同的每像素分类 segmentation 模型,无需添加显式的mask proposer,并且可以使用所有的区域。CLIPTeacher 包括两个关键模块:全球学习模块(GLM)和像素学习模块(PLM)。具体来说,GLM 将图像编码器中的稠密特征与 CLIP 模型中的 CLS token 进行对应,即通过简单而有效的方式 probing 全局信息从 CLIP 模型中。相比之下,PLM 只是使用 CLIP 模型中的稠密token 生成高级 Pseudo 标注 для忽略区域,而不需要添加额外的mask proposer。同时,PLM 可以完全利用整个图像,基于 Pseudo 标注。实验结果在 Pascal VOC 2012、COCO-Stuff 164k 和 Pascal Context 三个标准测试集上显示了大的性能提升,即 2.2%、1.3% 和 8.8%。

Improving style transfer in dynamic contrast enhanced MRI using a spatio-temporal approach

  • paper_url: http://arxiv.org/abs/2310.01908
  • repo_url: None
  • paper_authors: Adam G. Tattersall, Keith A. Goatman, Lucy E. Kershaw, Scott I. K. Semple, Sonia Dahdouh
  • for: 这个论文旨在解决DCE-MRI中的样式传递问题,因为不同的组织和时间点的增强效应具有大量的变化。
  • methods: 该论文提出了一种新的方法,它将Autoencoder与卷积LSTM结合,以便分解内容和风格,并使用适应性的卷积来处理增强效应的地方化特性。
  • results: 论文的实验结果表明,该方法可以在两个不同的数据集上出perform state-of-the-art。
    Abstract Style transfer in DCE-MRI is a challenging task due to large variations in contrast enhancements across different tissues and time. Current unsupervised methods fail due to the wide variety of contrast enhancement and motion between the images in the series. We propose a new method that combines autoencoders to disentangle content and style with convolutional LSTMs to model predicted latent spaces along time and adaptive convolutions to tackle the localised nature of contrast enhancement. To evaluate our method, we propose a new metric that takes into account the contrast enhancement. Qualitative and quantitative analyses show that the proposed method outperforms the state of the art on two different datasets.
    摘要 <>将给定文本翻译成简化中文。<>在DCE-MRI中的风格传输是一项复杂的任务,因为不同的组织和时间中的对比增强大相差。现有的无监督方法因为图像序列中的对比增强和运动的各种多样性而失败。我们提议一种新的方法,该方法组合了自动编码器来分离内容和风格,以及卷积LSTM来预测时间序列中的隐藏空间。为评估我们的方法,我们提出了一个新的度量标准,该标准考虑对比增强的因素。Qualitative和量化分析表明,我们的方法在两个不同的数据集上表现出了优于当前状态的表现。

Beyond the Benchmark: Detecting Diverse Anomalies in Videos

  • paper_url: http://arxiv.org/abs/2310.01904
  • repo_url: https://github.com/yoavarad/mfad
  • paper_authors: Yoav Arad, Michael Werman
  • for: 本研究旨在推动视频异常检测(VAD)领域的发展,扩展传统的单帧异常检测范畴,处理复杂的动作异常情况。
  • methods: 本研究提出了两个新的数据集:HMDB-AD和HMDB-Violence,以挑战模型处理多种动作异常情况。此外,研究人员还提出了一种新的多帧异常检测方法(MFAD),基于AI-VAD框架,使用单帧特征和两帧特征,以计算异常分布。
  • results: 实验结果表明,现有模型在新的异常类型面前存在限制,MFAD方法在单简异常检测和复杂异常检测场景中均表现出色。
    Abstract Video Anomaly Detection (VAD) plays a crucial role in modern surveillance systems, aiming to identify various anomalies in real-world situations. However, current benchmark datasets predominantly emphasize simple, single-frame anomalies such as novel object detection. This narrow focus restricts the advancement of VAD models. In this research, we advocate for an expansion of VAD investigations to encompass intricate anomalies that extend beyond conventional benchmark boundaries. To facilitate this, we introduce two datasets, HMDB-AD and HMDB-Violence, to challenge models with diverse action-based anomalies. These datasets are derived from the HMDB51 action recognition dataset. We further present Multi-Frame Anomaly Detection (MFAD), a novel method built upon the AI-VAD framework. AI-VAD utilizes single-frame features such as pose estimation and deep image encoding, and two-frame features such as object velocity. They then apply a density estimation algorithm to compute anomaly scores. To address complex multi-frame anomalies, we add a deep video encoding features capturing long-range temporal dependencies, and logistic regression to enhance final score calculation. Experimental results confirm our assumptions, highlighting existing models limitations with new anomaly types. MFAD excels in both simple and complex anomaly detection scenarios.
    摘要 视频异常检测(VAD)在现代监测系统中扮演着关键角色,旨在在实际情况中发现多种异常。然而,当前的标准数据集主要强调简单的单帧异常,如物品检测。这种狭隘的焦点限制了VAD模型的发展。在本研究中,我们主张拓展VAD研究,以涵盖更为复杂的异常。为此,我们介绍了两个数据集:HMDB-AD和HMDB-Violence,以挑战模型。这两个数据集都来自于HMDB51动作识别数据集。我们还提出了一种新的多帧异常检测方法(MFAD),基于AI-VAD框架。AI-VAD使用单帧特征,如pose estimation和深度图像编码,以及两帧特征,如物体速度。然后,它们应用某种密度估计算法计算异常分数。为了处理复杂的多帧异常,我们添加了深度视频编码特征,捕捉长距离时间相关性,以及逻辑回归来增强最终分数计算。实验结果证明了我们的假设,显示了现有模型对新类型异常的局限性。MFAD在简单和复杂异常检测场景中均表现出色。

MFOS: Model-Free & One-Shot Object Pose Estimation

  • paper_url: http://arxiv.org/abs/2310.01897
  • repo_url: None
  • paper_authors: JongMin Lee, Yohann Cabon, Romain Brégier, Sungjoo Yoo, Jerome Revaud
  • for: 能够在单一的前进 pass 中 estimate 未经训练过的对象姿 pose, 只需要 minimum 的输入。
  • methods: 我们提出了一种使用 transformer 架构的方法, 可以充分利用最近提出的 3D-geometry 通用预训练。
  • results: 我们在 LINEMOD benchmark 上进行了广泛的实验,并reported 一shot 性能的 state-of-the-art 表现。
    Abstract Existing learning-based methods for object pose estimation in RGB images are mostly model-specific or category based. They lack the capability to generalize to new object categories at test time, hence severely hindering their practicability and scalability. Notably, recent attempts have been made to solve this issue, but they still require accurate 3D data of the object surface at both train and test time. In this paper, we introduce a novel approach that can estimate in a single forward pass the pose of objects never seen during training, given minimum input. In contrast to existing state-of-the-art approaches, which rely on task-specific modules, our proposed model is entirely based on a transformer architecture, which can benefit from recently proposed 3D-geometry general pretraining. We conduct extensive experiments and report state-of-the-art one-shot performance on the challenging LINEMOD benchmark. Finally, extensive ablations allow us to determine good practices with this relatively new type of architecture in the field.
    摘要 现有的学习基于方法 дляRGB图像中对象姿态估计都是模型特定或类别基于的。它们缺乏在测试时对新类别对象进行泛化的能力,因此很大程度上阻碍了它们的实用性和扩展性。值得注意的是,最近有一些尝试解决这个问题,但它们仍然需要在训练和测试时准确的3D对象表面数据。在这篇论文中,我们介绍了一种新的方法,可以在单一的前进 pass中估计在训练时未看过的对象姿态,只需最小的输入。与现有的状态对比,我们的提议的模型完全基于 transformer 架构,可以从最近的3D-geometry普适预训练中受益。我们进行了广泛的实验,并在LINEMOD测试准则上report了一shot性能的状态对比。最后,我们进行了广泛的ablation,以确定这种相对新的架构在领域中的好做法。

Adaptive Multi-NeRF: Exploit Efficient Parallelism in Adaptive Multiple Scale Neural Radiance Field Rendering

  • paper_url: http://arxiv.org/abs/2310.01881
  • repo_url: None
  • paper_authors: Tong Wang, Shuichi Kurabayashi
  • for: 这个论文的目的是提高NeRF的 Rendering速度,使其适用于实时渲染应用。
  • methods: 该方法使用了分割Scene into axis-aligned bounding boxes,并将不同场景部分分配给不同大小的NeRF。它还使用了导航密度网格来均衡每个Multilayer Perceptron(MLP)的表现能力。
  • results: 该方法可以大幅提高GPU的利用率,并且可以在实时渲染应用中加速Rendering过程。
    Abstract Recent advances in Neural Radiance Fields (NeRF) have demonstrated significant potential for representing 3D scene appearances as implicit neural networks, enabling the synthesis of high-fidelity novel views. However, the lengthy training and rendering process hinders the widespread adoption of this promising technique for real-time rendering applications. To address this issue, we present an effective adaptive multi-NeRF method designed to accelerate the neural rendering process for large scenes with unbalanced workloads due to varying scene complexities. Our method adaptively subdivides scenes into axis-aligned bounding boxes using a tree hierarchy approach, assigning smaller NeRFs to different-sized subspaces based on the complexity of each scene portion. This ensures the underlying neural representation is specific to a particular part of the scene. We optimize scene subdivision by employing a guidance density grid, which balances representation capability for each Multilayer Perceptron (MLP). Consequently, samples generated by each ray can be sorted and collected for parallel inference, achieving a balanced workload suitable for small MLPs with consistent dimensions for regular and GPU-friendly computations. We aosl demonstrated an efficient NeRF sampling strategy that intrinsically adapts to increase parallelism, utilization, and reduce kernel calls, thereby achieving much higher GPU utilization and accelerating the rendering process.
    摘要 Our method subdivides scenes into axis-aligned bounding boxes using a tree hierarchy approach, assigning smaller NeRFs to different-sized subspaces based on the complexity of each scene portion. This ensures that the underlying neural representation is specific to a particular part of the scene. We optimize scene subdivision by employing a guidance density grid, which balances representation capability for each Multilayer Perceptron (MLP). Consequently, samples generated by each ray can be sorted and collected for parallel inference, achieving a balanced workload suitable for small MLPs with consistent dimensions for regular and GPU-friendly computations. We have also demonstrated an efficient NeRF sampling strategy that intrinsically adapts to increase parallelism, utilization, and reduce kernel calls, thereby achieving much higher GPU utilization and accelerating the rendering process.

A Dual Attentive Generative Adversarial Network for Remote Sensing Image Change Detection

  • paper_url: http://arxiv.org/abs/2310.01876
  • repo_url: None
  • paper_authors: Luyi Qiu, Xiaofeng Zhang, ChaoChen Gu, and ShanYing Zhu
  • for: 本研究针对高分辨率 remote sensing 像素数据的变化探测任务进行研究,旨在提高检测精度和效率。
  • methods: 本研究提出了一个基于 dual attentive generative adversarial network(DAGAN)的方法,具有以下特点:(1)将检测模型视为生成器,通过生成 adversarial 策略来实现最佳类别器的适性,不增加检测模型的参数数目;(2)采用多层特征提取器来有效地融合多层特征,并导入总和连接来融合多层特征;(3)提出多値构成弹性融合模组来适应不同级别的物件,并导入 контекст调整模组来探索 Kontextuelle 依赖关系。
  • results: 实验结果显示,DAGAN 架构在 LEVIR 数据集上的平均 IoU 和 F1 分布为 85.01% 和 91.48%,较先进方法的表现更佳。
    Abstract Remote sensing change detection between bi-temporal images receives growing concentration from researchers. However, comparing two bi-temporal images for detecting changes is challenging, as they demonstrate different appearances. In this paper, we propose a dual attentive generative adversarial network for achieving very high-resolution remote sensing image change detection tasks, which regards the detection model as a generator and attains the optimal weights of the detection model without increasing the parameters of the detection model through generative-adversarial strategy, boosting the spatial contiguity of predictions. Moreover, We design a multi-level feature extractor for effectively fusing multi-level features, which adopts the pre-trained model to extract multi-level features from bi-temporal images and introduces aggregate connections to fuse them. To strengthen the identification of multi-scale objects, we propose a multi-scale adaptive fusion module to adaptively fuse multi-scale features through various receptive fields and design a context refinement module to explore contextual dependencies. Moreover, the DAGAN framework utilizes the 4-layer convolution network as a discriminator to identify whether the synthetic image is fake or real. Extensive experiments represent that the DAGAN framework has better performance with 85.01% mean IoU and 91.48% mean F1 score than advanced methods on the LEVIR dataset.
    摘要 <>Translate the given text into Simplified Chinese.<>远程感知变化检测 между双时间图像获得了研究人员的增加关注。然而,比较两个双时间图像以检测变化是具有挑战性的,因为它们具有不同的外观。在这篇论文中,我们提议一种双注意力生成 adversarial网络(DAGAN),用于实现高分辨率远程感知图像变化检测任务。DAGAN通过生成检测模型的优化参数,不需要增加检测模型的参数,从而提高了空间连续性。此外,我们设计了一种多级特征提取器,用于有效地融合多级特征。我们采用预训练模型来提取多级特征从双时间图像,并引入汇聚连接来融合它们。为了强化多 scales 对象的标识,我们提出了多 scales adaptive fusion模块,可以适应性地融合多级特征。此外,DAGAN框架还利用4层核心网络作为判据器,以确定生成图像是真实的或假的。广泛的实验表明,DAGAN框架在LEVIR数据集上的表现更好,其中 mean IoU 为85.01%,mean F1 score 为91.48%。

Shifting More Attention to Breast Lesion Segmentation in Ultrasound Videos

  • paper_url: http://arxiv.org/abs/2310.01861
  • repo_url: https://github.com/jhl-det/fla-net
  • paper_authors: Junhao Lin, Qian Dai, Lei Zhu, Huazhu Fu, Qiong Wang, Weibin Li, Wenhao Rao, Xiaoyang Huang, Liansheng Wang
  • for: 这个研究旨在提高乳腺癌识别和治疗 axillary lymph node metastasis 的过程中,使用ultrasound 影像。
  • methods: 研究人员提出了一个新的频率和地方特征聚合网络(FLA-Net),从频率领域学习时间特征,并预测额外的腺腔癌位置。他们还提出了一个局部化的对照损失函数,以减少不同影像中的腺腔癌位置之间的距离。
  • results: 研究人员在自己的标注数据集和两个公共影像数据集上进行了实验,结果显示,他们的提案的 FLA-Net 在乳腺癌识别中具有国际级的性能,并且可以对影像数据集进行高效的处理和分析。
    Abstract Breast lesion segmentation in ultrasound (US) videos is essential for diagnosing and treating axillary lymph node metastasis. However, the lack of a well-established and large-scale ultrasound video dataset with high-quality annotations has posed a persistent challenge for the research community. To overcome this issue, we meticulously curated a US video breast lesion segmentation dataset comprising 572 videos and 34,300 annotated frames, covering a wide range of realistic clinical scenarios. Furthermore, we propose a novel frequency and localization feature aggregation network (FLA-Net) that learns temporal features from the frequency domain and predicts additional lesion location positions to assist with breast lesion segmentation. We also devise a localization-based contrastive loss to reduce the lesion location distance between neighboring video frames within the same video and enlarge the location distances between frames from different ultrasound videos. Our experiments on our annotated dataset and two public video polyp segmentation datasets demonstrate that our proposed FLA-Net achieves state-of-the-art performance in breast lesion segmentation in US videos and video polyp segmentation while significantly reducing time and space complexity. Our model and dataset are available at https://github.com/jhl-Det/FLA-Net.
    摘要 腋窝癌肿分割在ultrasound(US)视频中是诊断和治疗肿瘤肿瘤肿瘤的关键。然而,由于没有一个成熔和大规模的US视频数据集,高质量的注释很困难。为了解决这个问题,我们仔细筛选了572个US视频和34,300个注释帧,覆盖了许多真实的临床场景。此外,我们提出了一种频率和地点特征聚合网络(FLA-Net),可以从频率频谱中学习时间特征,并预测更多的肿瘤位置。我们还提出了一种基于地点的对比损失函数,可以在同一个US视频中减少肿瘤位置的距离,并在不同的US视频中增大距离。我们的实验表明,我们的提议的FLA-Net在US视频中的肿瘤分割和视频垂直段落分割中具有最佳性能,同时减少时间和空间复杂性。我们的模型和数据集可以在https://github.com/jhl-Det/FLA-Net中获取。

Selective Feature Adapter for Dense Vision Transformers

  • paper_url: http://arxiv.org/abs/2310.01843
  • repo_url: None
  • paper_authors: Xueqing Deng, Qi Fan, Xiaojie Jin, Linjie Yang, Peng Wang
  • for: 这个论文目的是解决预训transformer模型中巨量parameters的问题,以提高紧密预测类别 tasks的性能。
  • methods: 这个方法使用选择性特征适应器(SFA),包括外部适应器和内部适应器, sequentially operate over a transformer model。
  • results: 实验结果显示,这个方法可以在紧密预测类别 tasks上 achieves state-of-the-art(SoTA)性能,并且比完全 fine-tune 模型在不同的任务上表现更好或相同。
    Abstract Fine-tuning pre-trained transformer models, e.g., Swin Transformer, are successful in numerous downstream for dense prediction vision tasks. However, one major issue is the cost/storage of their huge amount of parameters, which becomes increasingly challenging to handle with the growing amount of vision tasks. In this paper, we propose an effective approach to alleviate the issue, namely selective feature adapter (SFA). It achieves state-of-the-art (SoTA) performance under any given budget of trainable parameters, and demonstrates comparable or better performance than fully fine-tuned models across various dense tasks. Specifically, SFA consists of external adapters and internal adapters which are sequentially operated over a transformer model. For external adapters, we properly select the places and amount of additional multilayer perception (MLP). For internal adapters, we transform a few task-important parameters inside the transformer, which are automatically discovered through a simple yet effective lottery ticket algorithm. Our experiments show that the dual adapter module, a.k.a SFA, is essential to achieve the best trade-off on dense vision tasks, such as segmentation, detection and depth-estimation, outperforming other adapters with a single module.
    摘要 通过练写预训练变换器模型,如斯вин变换器,在许多某些下渠任务上实现了优秀的表现。然而,一个主要问题是其参数的成本和存储量,这在视觉任务的数量不断增加时变得越来越Difficult to handle。在这篇论文中,我们提出了一种有效的方法来解决这个问题,即选择性特征适配器(SFA)。它在任何给定的训练参数预算下实现了状态元的表现,并在不同的权重任务中达到了相当于或更好的表现。SFA由外部适配器和内部适配器两部分组成,这两部分在转换器模型之上依次运行。对于外部适配器,我们合理地选择了附加的多层感知(MLP)的地方和数量。对于内部适配器,我们通过简单 yet effective的抽奖算法自动发现了转换器中一些任务重要的参数,并将它们变换为一些新的参数。我们的实验表明,双 adapter 模块,即 SFA,是在权重视觉任务,如 segmentation、检测和深度估计,实现了最佳的平衡,超过了其他适配器。

SelfGraphVQA: A Self-Supervised Graph Neural Network for Scene-based Question Answering

  • paper_url: http://arxiv.org/abs/2310.01842
  • repo_url: None
  • paper_authors: Bruno Souza, Marius Aasan, Helio Pedrini, Adín Ramírez Rivera
    for: 本研究旨在提高基于Scene Graph(SG)的Visual Question Answering(VQA)任务中的批处理性能。methods: 我们提出了SelfGraphVQA框架,其包括使用预训练Scene Graph生成器提取图像中的Scene Graph,并使用自我超VI的自然语言表示进行semantically-preserving augmentation。我们还研究了三种最大化策略:节点级、图像级和卷积equivariant regularization。results: 我们实验表明,使用提取的Scene Graph可以提高VQA任务中的批处理性能,并且这些方法可以增强图像中的视觉信息的重要性。
    Abstract The intersection of vision and language is of major interest due to the increased focus on seamless integration between recognition and reasoning. Scene graphs (SGs) have emerged as a useful tool for multimodal image analysis, showing impressive performance in tasks such as Visual Question Answering (VQA). In this work, we demonstrate that despite the effectiveness of scene graphs in VQA tasks, current methods that utilize idealized annotated scene graphs struggle to generalize when using predicted scene graphs extracted from images. To address this issue, we introduce the SelfGraphVQA framework. Our approach extracts a scene graph from an input image using a pre-trained scene graph generator and employs semantically-preserving augmentation with self-supervised techniques. This method improves the utilization of graph representations in VQA tasks by circumventing the need for costly and potentially biased annotated data. By creating alternative views of the extracted graphs through image augmentations, we can learn joint embeddings by optimizing the informational content in their representations using an un-normalized contrastive approach. As we work with SGs, we experiment with three distinct maximization strategies: node-wise, graph-wise, and permutation-equivariant regularization. We empirically showcase the effectiveness of the extracted scene graph for VQA and demonstrate that these approaches enhance overall performance by highlighting the significance of visual information. This offers a more practical solution for VQA tasks that rely on SGs for complex reasoning questions.
    摘要 “视觉语言交叉是当前研究热点,因为它们在识别和理解之间的集成提供了更好的性能。场景图(SG)在多模态图像分析中表现出色,特别是在视觉问答(VQA)任务中。在这项工作中,我们发现了使用预训练场景图生成器生成场景图后,使用自我相似的扩展技术来增强图表示的方法。这种方法可以减少使用高成本和可能受损的标注数据,并且可以增强图表示的信息内容。我们在实验中使用了三种不同的最大化策略:节点级、图像级和对称变换的正则化。我们证明了这种方法可以提高 VQA 任务的总性能,并且强调了视觉信息的重要性。这个方法可以提供更实际的解决方案 для VQA 任务,尤其是当它们基于场景图进行复杂的推理问题。”

Self-Supervised High Dynamic Range Imaging with Multi-Exposure Images in Dynamic Scenes

  • paper_url: http://arxiv.org/abs/2310.01840
  • repo_url: https://github.com/cszhilu1998/selfhdr
  • paper_authors: Zhilu Zhang, Haoyu Wang, Shuai Liu, Xiaotao Wang, Lei Lei, Wangmeng Zuo
  • for: 提高高动态范围(HDR)图像的重建,不需要 Labelled数据。
  • methods: 基于自我监督学习的HDR重建方法,使用多张多曝光图像进行训练,并通过两个 complementary component来捕捉HDR颜色和结构信息。
  • results: 在实际图像上进行测试,SelfHDR方法可以获得supervised方法相当的性能,并且超过自我监督方法的性能。代码可以在https://github.com/cszhilu1998/SelfHDR中下载。
    Abstract Merging multi-exposure images is a common approach for obtaining high dynamic range (HDR) images, with the primary challenge being the avoidance of ghosting artifacts in dynamic scenes. Recent methods have proposed using deep neural networks for deghosting. However, the methods typically rely on sufficient data with HDR ground-truths, which are difficult and costly to collect. In this work, to eliminate the need for labeled data, we propose SelfHDR, a self-supervised HDR reconstruction method that only requires dynamic multi-exposure images during training. Specifically, SelfHDR learns a reconstruction network under the supervision of two complementary components, which can be constructed from multi-exposure images and focus on HDR color as well as structure, respectively. The color component is estimated from aligned multi-exposure images, while the structure one is generated through a structure-focused network that is supervised by the color component and an input reference (\eg, medium-exposure) image. During testing, the learned reconstruction network is directly deployed to predict an HDR image. Experiments on real-world images demonstrate our SelfHDR achieves superior results against the state-of-the-art self-supervised methods, and comparable performance to supervised ones. Codes are available at https://github.com/cszhilu1998/SelfHDR
    摘要 合并多曝光图像是一种常见的方法来获得高动态范围(HDR)图像,主要挑战是避免在动态场景中出现幽灵 artifacts。现有方法通常是使用深度神经网络进行deghosting,但这些方法通常需要具有HDR真实值的数据,这些数据困难和昂贵地收集。在这种情况下,我们提出了SelfHDR,一种自动标注HDR重建方法,只需要在训练时提供动态多曝光图像。具体来说,SelfHDR学习一个重建网络,该网络在两个补做 Component 的指导下进行学习,这两个 Component 可以从多曝光图像中构建,它们的目标是获得HDR颜色以及结构。颜色 Component 是由对aligned多曝光图像进行Estimation的,而结构 Component 是通过一个以颜色Component为导向,并且与输入参考图像(例如中曝光图像)进行Supervised的神经网络来生成的。在测试时,学习的重建网络直接部署到预测HDR图像。实验结果表明,我们的SelfHDR方法在实际图像上比自动标注方法 superior,并且与指导方法相当。代码可以在https://github.com/cszhilu1998/SelfHDR 中找到。

Skin the sheep not only once: Reusing Various Depth Datasets to Drive the Learning of Optical Flow

  • paper_url: http://arxiv.org/abs/2310.01833
  • repo_url: None
  • paper_authors: Sheng-Chi Huang, Wei-Chen Chiu
  • for: 提高视觉和机器人应用中的光流估计精度,解决现有方法在实际场景中难以获得有效的光流真实数据问题。
  • methods: 利用光流估计和雷达测量之间的几何连接,将多种实际深度估计数据集合并生成超过监督学习的有效训练数据,并通过在一个图像上进行几何变换来增加光流估计器的学习环境灵活性。
  • results: 通过对多种数据集和不同光流估计模型进行广泛的实验,证明提出的方法能够提高光流估计精度和普适性,并且不是与特定的光流估计器相关。
    Abstract Optical flow estimation is crucial for various applications in vision and robotics. As the difficulty of collecting ground truth optical flow in real-world scenarios, most of the existing methods of learning optical flow still adopt synthetic dataset for supervised training or utilize photometric consistency across temporally adjacent video frames to drive the unsupervised learning, where the former typically has issues of generalizability while the latter usually performs worse than the supervised ones. To tackle such challenges, we propose to leverage the geometric connection between optical flow estimation and stereo matching (based on the similarity upon finding pixel correspondences across images) to unify various real-world depth estimation datasets for generating supervised training data upon optical flow. Specifically, we turn the monocular depth datasets into stereo ones via synthesizing virtual disparity, thus leading to the flows along the horizontal direction; moreover, we introduce virtual camera motion into stereo data to produce additional flows along the vertical direction. Furthermore, we propose applying geometric augmentations on one image of an optical flow pair, encouraging the optical flow estimator to learn from more challenging cases. Lastly, as the optical flow maps under different geometric augmentations actually exhibit distinct characteristics, an auxiliary classifier which trains to identify the type of augmentation from the appearance of the flow map is utilized to further enhance the learning of the optical flow estimator. Our proposed method is general and is not tied to any particular flow estimator, where extensive experiments based on various datasets and optical flow estimation models verify its efficacy and superiority.
    摘要 优化流量估计是视觉和机器人应用中的关键问题。由于实际场景中收集实际流量真实数据困难,大多数现有的学习流量估计方法仍然采用生成的 sintetic 数据进行超级vised 训练或利用相机适应性保持 temporal 邻近视频帧的光学一致性来驱动不监督学习,其中前者通常具有普遍化问题而后者通常比监督学习更差。为解决这些挑战,我们提议利用光流估计与tereo匹配的几何连接来统一各种实际深度估计数据,以生成超级vised 训练数据。具体来说,我们将монокуляр深度数据转化为stereo数据,并通过 sintesizing 虚拟的 disparity 来导致流量在水平方向上;此外,我们还引入虚拟相机运动,以生成额外的流量在垂直方向上。此外,我们提议在一个光流对的图像上应用几何变换,以驱动光流估计器学习更加具有挑战性的情况。最后,我们发现光流图像不同的几何变换实际上具有不同的特征,因此我们提出了一个辅助类ifier,用于在光流图像上预测几何变换的类型,以进一步改进光流估计器的学习。我们的提议方法不受任何特定的流量估计器约束,并经过了多种数据和光流估计器的实验,证明了其效果和优越性。

AI-Generated Images as Data Source: The Dawn of Synthetic Era

  • paper_url: http://arxiv.org/abs/2310.01830
  • repo_url: https://github.com/mwxely/aigs
  • paper_authors: Zuhao Yang, Fangneng Zhan, Kunhao Liu, Muyu Xu, Shijian Lu
  • for: 本研究旨在探讨使用生成AI模型生成的图像作为视觉智能的新数据源,以改善传统的模型设计方法。
  • methods: 本研究采用了生成AI模型,如Generative Adversarial Networks (GANs)和Variational Autoencoders (VAEs),生成大量的图像数据,并对这些数据进行了分析和评估。
  • results: 研究发现,使用生成AI模型生成的图像数据可以提高视觉智能模型的性能,并且可以轻松地生成大量的 Edge cases 和罕见的场景,以便进行计算机模拟、测试和验证。
    Abstract The advancement of visual intelligence is intrinsically tethered to the availability of large-scale data. In parallel, generative Artificial Intelligence (AI) has unlocked the potential to create synthetic images that closely resemble real-world photographs. This prompts a compelling inquiry: how much visual intelligence could benefit from the advance of generative AI? This paper explores the innovative concept of harnessing these AI-generated images as new data sources, reshaping traditional modeling paradigms in visual intelligence. In contrast to real data, AI-generated data exhibit remarkable advantages, including unmatched abundance and scalability, the rapid generation of vast datasets, and the effortless simulation of edge cases. Built on the success of generative AI models, we examine the potential of their generated data in a range of applications, from training machine learning models to simulating scenarios for computational modeling, testing, and validation. We probe the technological foundations that support this groundbreaking use of generative AI, engaging in an in-depth discussion on the ethical, legal, and practical considerations that accompany this transformative paradigm shift. Through an exhaustive survey of current technologies and applications, this paper presents a comprehensive view of the synthetic era in visual intelligence. A project associated with this paper can be found at https://github.com/mwxely/AIGS .
    摘要 通过潜在的推动,视觉智能的进步与大规模数据的可用性密切相关。与此同时,生成型人工智能(AI)已经解锁了创造真实图像的潜在。这引发了一个感人的问题:如何利用这些AI生成的图像来提高视觉智能?这篇论文探讨了将这些AI生成的图像作为新的数据源,重新定义传统的视觉智能模型。与真实数据相比,AI生成的数据具有无可比的优势,包括无比充足和可扩展的数据量,快速生成大量数据,以及轻松模拟边缘情况。基于成功的生成AI模型,我们研究了这些生成的数据在多种应用中的潜力,从训练机器学习模型到计算模拟和验证。我们还探讨了这种新的思维方式在技术、伦理和实践方面的支持,并进行了详细的讨论。通过对当前技术和应用的总结,这篇论文提供了对synthetic时代的视觉智能全面的视图。关于这个项目,可以参考https://github.com/mwxely/AIGS。

Amazing Combinatorial Creation: Acceptable Swap-Sampling for Text-to-Image Generation

  • paper_url: http://arxiv.org/abs/2310.01819
  • repo_url: None
  • paper_authors: Jun Li, Zedong Zhang, Jian Yang
  • for: 本研究旨在开发一种能够生成有意义的组合物图像,从多个文本描述中提取出人类创造力的机器学习系统。
  • methods: 我们提出了一种简单 yet 高效的技术——可接受的交换抽象法,通过交换两个文本embedding的列向量来生成一个新的组合图像,并使用cutting-edge diffusion model来生成新的图像。
  • results: 我们的实验结果表明,我们的方法可以在生成 novel和surprising的组合图像时,超过latest方法,如Stable-Diffusion2、DALLE2、ERNIE-ViLG2和Bing。此外,我们的方法还可以和人工偏好数据集上训练的PickScore和HPSv2相比,在抽取过程中获得相似的结果。
    Abstract Exploring a machine learning system to generate meaningful combinatorial object images from multiple textual descriptions, emulating human creativity, is a significant challenge as humans are able to construct amazing combinatorial objects, but machines strive to emulate data distribution. In this paper, we develop a straight-forward yet highly effective technique called acceptable swap-sampling to generate a combinatorial object image that exhibits novelty and surprise, utilizing text concepts of different objects. Initially, we propose a swapping mechanism that constructs a novel embedding by exchanging column vectors of two text embeddings for generating a new combinatorial image through a cutting-edge diffusion model. Furthermore, we design an acceptable region by managing suitable CLIP distances between the new image and the original concept generations, increasing the likelihood of accepting the new image with a high-quality combination. This region allows us to efficiently sample a small subset from a new image pool generated by using randomly exchanging column vectors. Lastly, we employ a segmentation method to compare CLIP distances among the segmented components, ultimately selecting the most promising object image from the sampled subset. Our experiments focus on text pairs of objects from ImageNet, and our results demonstrate that our approach outperforms recent methods such as Stable-Diffusion2, DALLE2, ERNIE-ViLG2 and Bing in generating novel and surprising object images, even when the associated concepts appear to be implausible, such as lionfish-abacus. Moreover, during the sampling process, our approach without training and human preference is also comparable to PickScore and HPSv2 trained using human preference datasets.
    摘要 investigate 一种机器学习系统,用于从多个文本描述生成有意义的 combinatorial 物品图像,模拟人类创作能力,是一项 significanthallenge。 humans 可以构建惊人的 combinatorial 物品,但机器尝试模拟数据分布。在这篇论文中,我们开发了一种直观而高效的技术——可接受的换Sampling。我们提议一种交换机制,通过对两个文本嵌入的列向量进行交换,生成一个新的 combinatorial 图像。此外,我们设计了一个可接受的区域,通过控制适当的 CLIP 距离,使新图像与原始概念生成之间的关系更加可靠。这个区域使我们能够有效地采样一小 subsets从新生成的图像池中,并最终选择最有前途的物品图像。我们的实验集中使用了 ImageNet 中的对象对,并我们的结果表明,我们的方法在生成有意义和 surprising 的 object 图像方面,超过了最近的方法,如 Stable-Diffusion2、DALLE2、ERNIE-ViLG2 和 Bing。此外,在采样过程中,我们的方法不需要训练和人类偏好,也能与 PickScore 和 HPSv2 训练使用人类偏好数据进行比较。

PPT: Token Pruning and Pooling for Efficient Vision Transformers

  • paper_url: http://arxiv.org/abs/2310.01812
  • repo_url: https://github.com/xjwu1024/PPT
  • paper_authors: Xinjian Wu, Fanhu Zeng, Xiudong Wang, Yunhe Wang, Xinghao Chen
  • for: 提高计算复杂性的实际应用在计算机视觉领域中
  • methods: 使用token pruning和token pooling技术在ViTs中进行减少重复的 acceleration framework
  • results: 在ImageNet dataset上,PPT可以减少37%的FLOPs并提高通过putthroughput比例45%,而无需减少精度。
    Abstract Vision Transformers (ViTs) have emerged as powerful models in the field of computer vision, delivering superior performance across various vision tasks. However, the high computational complexity poses a significant barrier to their practical applications in real-world scenarios. Motivated by the fact that not all tokens contribute equally to the final predictions and fewer tokens bring less computational cost, reducing redundant tokens has become a prevailing paradigm for accelerating vision transformers. However, we argue that it is not optimal to either only reduce inattentive redundancy by token pruning, or only reduce duplicative redundancy by token merging. To this end, in this paper we propose a novel acceleration framework, namely token Pruning & Pooling Transformers (PPT), to adaptively tackle these two types of redundancy in different layers. By heuristically integrating both token pruning and token pooling techniques in ViTs without additional trainable parameters, PPT effectively reduces the model complexity while maintaining its predictive accuracy. For example, PPT reduces over 37% FLOPs and improves the throughput by over 45% for DeiT-S without any accuracy drop on the ImageNet dataset.
    摘要 computer vision 领域中,Vision Transformers(ViTs)已经成为了强大的模型,在不同的视觉任务中具有出色的表现。然而,高计算复杂度对实际应用场景中的应用带来了很大的障碍。受到每个token不同程度地影响最终预测的事实的灵感,我们提出了一种新的加速框架,即token Pruning & Pooling Transformers(PPT),用于在不同层次上适应性地处理两种不同类型的缺失。通过在ViTs中不添加任何可训练参数的情况下,合理地结合token pruning和token pooling技术,PPT可以有效减少模型的复杂度,保持预测精度。例如,在DeiT-S上,PPT可以减少37%的FLOPs,提高通过put throughput by over 45%,而无需减少精度在ImageNet dataset。

SMRD: SURE-based Robust MRI Reconstruction with Diffusion Models

  • paper_url: http://arxiv.org/abs/2310.01799
  • repo_url: https://github.com/nvlabs/smrd
  • paper_authors: Batu Ozturkler, Chao Liu, Benjamin Eckart, Morteza Mardani, Jiaming Song, Jan Kautz
    for:SMRD is designed to improve the robustness of diffusion models for accelerated MRI reconstruction.methods:SMRD uses Stein’s Unbiased Risk Estimator (SURE) to estimate the mean squared error of the reconstruction during testing, and automatically tunes the inference hyperparameters without the need for validation tuning.results:SMRD outperforms diffusion model baselines on various measurement noise levels, acceleration factors, and anatomies, achieving a PSNR improvement of up to 6 dB under measurement noise.Here is the Chinese translation of the three key points:for:SMRD 是用于提高Diffusion模型的加速MRI重建稳定性的方法。methods:SMRD 使用 Stein 不偏风险估计器 (SURE) 来在测试阶段估计重建结果的均方差,并自动调整推理参数而无需验证集调整。results:SMRD 在不同的测量噪声水平、加速因子和解剖学上都超过了Diffusion模型基elines,实现了测量噪声下 PSNR 提高达 6 dB。
    Abstract Diffusion models have recently gained popularity for accelerated MRI reconstruction due to their high sample quality. They can effectively serve as rich data priors while incorporating the forward model flexibly at inference time, and they have been shown to be more robust than unrolled methods under distribution shifts. However, diffusion models require careful tuning of inference hyperparameters on a validation set and are still sensitive to distribution shifts during testing. To address these challenges, we introduce SURE-based MRI Reconstruction with Diffusion models (SMRD), a method that performs test-time hyperparameter tuning to enhance robustness during testing. SMRD uses Stein's Unbiased Risk Estimator (SURE) to estimate the mean squared error of the reconstruction during testing. SURE is then used to automatically tune the inference hyperparameters and to set an early stopping criterion without the need for validation tuning. To the best of our knowledge, SMRD is the first to incorporate SURE into the sampling stage of diffusion models for automatic hyperparameter selection. SMRD outperforms diffusion model baselines on various measurement noise levels, acceleration factors, and anatomies, achieving a PSNR improvement of up to 6 dB under measurement noise. The code is publicly available at https://github.com/NVlabs/SMRD .
    摘要 Diffusion models 最近受欢迎用于加速MRI重建,因为它们可以提供高质量的样本,并且可以在推理时 flexibly incorporate 前向模型。它们在分布转移下更加稳定,但是需要在验证集上精细调整推理超参数,并且在测试时仍然敏感于分布转移。为解决这些挑战,我们提出了 SMRD(Diffusion Model based MRI Reconstruction with Stein's Unbiased Risk Estimator),一种在测试时进行 hyperparameter 调整,以提高测试时的Robustness。SMRD 使用 Stein's Unbiased Risk Estimator(SURE)来估计测试时重建的 mean squared error。SURE 然后用于自动调整推理超参数,并设置 early stopping 条件,无需验证集调整。根据我们所知,SMRD 是首次将 SURE integrate 到 diffusion models 的 sampling 阶段中进行自动超参数选择。SMRD 在不同的测量噪声水平、压缩因数和 анатомиче特征下,超过 diffusion model 基eline,实现 PSNR 改善达 6 dB 以上。代码可以在 上获取。

HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption

  • paper_url: http://arxiv.org/abs/2310.01779
  • repo_url: None
  • paper_authors: Bohan Zhai, Shijia Yang, Xiangchen Zhao, Chenfeng Xu, Sheng Shen, Dongdi Zhao, Kurt Keutzer, Manling Li, Tan Yan, Xiangjun Fan
  • for: 这 paper 旨在 investigate 现有大规模视语言模型 (LVLM) 的细节描述能力是否准确,以及如何控制这些模型在细节描述中的幻见。
  • methods: 这 paper 提出了一种基于 GPT-4 的评价方法,称为 $\textit{CCEval}$,用于评价 LVLM 在细节描述 task 中的性能。
  • results: 研究发现,exist Hallucination 仍然存在于现有 VQA benchmark 中,而 $\textit{CCEval}$ 的评价方法可以减少这种幻见的发生。此外,研究还发现了一些因素对细节描述的影响,如图像分辨率、语言解码器大小和数据量等。
    Abstract Current large vision-language models (LVLMs) achieve remarkable progress, yet there remains significant uncertainty regarding their ability to accurately apprehend visual details, that is, in performing detailed captioning. To address this, we introduce \textit{CCEval}, a GPT-4 assisted evaluation method tailored for detailed captioning. Interestingly, while LVLMs demonstrate minimal object existence hallucination in existing VQA benchmarks, our proposed evaluation reveals continued susceptibility to such hallucinations. In this paper, we make the first attempt to investigate and attribute such hallucinations, including image resolution, the language decoder size, and instruction data amount, quality, granularity. Our findings underscore the unwarranted inference when the language description includes details at a finer object granularity than what the vision module can ground or verify, thus inducing hallucination. To control such hallucinations, we further attribute the reliability of captioning to contextual knowledge (involving only contextually grounded objects) and parametric knowledge (containing inferred objects by the model). Thus, we introduce $\textit{HallE-Switch}$, a controllable LVLM in terms of $\textbf{Hall}$ucination in object $\textbf{E}$xistence. HallE-Switch can condition the captioning to shift between (i) exclusively depicting contextual knowledge for grounded objects and (ii) blending it with parametric knowledge to imagine inferred objects. Our method reduces hallucination by 44% compared to LLaVA$_{7B}$ and maintains the same object coverage.
    摘要 当前大规模视语言模型(LVLM)已经取得了很大的进步,但是对于细节描述仍存在较大的不确定性。为了解决这个问题,我们提出了《CCEval》评价方法,它是基于GPT-4的辅助评价方法,专门用于细节描述。有趣的是,当前的LVLM在存在的VQA benchmark上基本没有显示对象存在hallucination,但我们的提出的评价方法发现,LVLM仍然存在hallucination的问题。在这篇论文中,我们首次 investigate和 attribute这些hallucination,包括图像分辨率、语言解码器大小和数据量、质量和细节。我们的发现表明,当语言描述包含更细的物体detail than what the vision module can confirm or verify时,会导致hallucination。为了控制这些hallucination,我们进一步 attribute了描述的可靠性,包括上下文知识(仅仅包含上下文中的物体)和参数知识(由模型推导出的物体)。因此,我们提出了《HallE-Switch》,一种可控的LVLM,可以在描述中shift между(i)仅仅描述上下文知识和(ii)混合上下文知识和参数知识来控制hallucination。我们的方法可以reduces hallucination by 44% compared to LLaVA$_{7B}$,并且保持同样的物体覆盖率。

ImageNet-OOD: Deciphering Modern Out-of-Distribution Detection Algorithms

  • paper_url: http://arxiv.org/abs/2310.01755
  • repo_url: None
  • paper_authors: William Yang, Byron Zhang, Olga Russakovsky
  • for: The paper aims to investigate the behavior of out-of-distribution (OOD) detection algorithms and to provide insights for guiding the design of future OOD detectors.
  • methods: The paper uses a clean semantic shift dataset called ImageNet-OOD to decouple semantic shift and covariate shift, and conducts comprehensive experiments to evaluate the performance of OOD detection algorithms under these two types of shifts.
  • results: The paper shows that OOD detectors are more sensitive to covariate shift than to semantic shift, and that the benefits of recent OOD detection algorithms on semantic shift detection are minimal. The paper provides important insights for designing future OOD detectors.Here is the same information in Simplified Chinese text:
  • for: 该文章的目的是调查扩展类别检测算法的行为,并为未来的扩展类别检测算法的设计提供重要的指导。
  • methods: 文章使用ImageNet-OOD清晰的semantic shift dataset,以分离semantic shift和covariate shift,并通过全面的实验评估扩展类别检测算法在这两种类型的变化下的表现。
  • results: 文章发现扩展类别检测算法更敏感于covariate shift,而semantic shift detection的 beneficial effects minimal。文章提供了重要的指导,用于设计未来的扩展类别检测算法。
    Abstract The task of out-of-distribution (OOD) detection is notoriously ill-defined. Earlier works focused on new-class detection, aiming to identify label-altering data distribution shifts, also known as "semantic shift." However, recent works argue for a focus on failure detection, expanding the OOD evaluation framework to account for label-preserving data distribution shifts, also known as "covariate shift." Intriguingly, under this new framework, complex OOD detectors that were previously considered state-of-the-art now perform similarly to, or even worse than the simple maximum softmax probability baseline. This raises the question: what are the latest OOD detectors actually detecting? Deciphering the behavior of OOD detection algorithms requires evaluation datasets that decouples semantic shift and covariate shift. To aid our investigations, we present ImageNet-OOD, a clean semantic shift dataset that minimizes the interference of covariate shift. Through comprehensive experiments, we show that OOD detectors are more sensitive to covariate shift than to semantic shift, and the benefits of recent OOD detection algorithms on semantic shift detection is minimal. Our dataset and analyses provide important insights for guiding the design of future OOD detectors.
    摘要 “批量外部(OOD)检测任务被认为是不具体定义的。 Earlier works 专注于新类别检测,尝试识别标签改变的数据分布类型,也就是“semantic shift”。然而,最近的工作强调失败检测,扩展OOD评估框架,考虑到标签保持的数据分布类型,也就是“covariate shift”。这引起了问题: latest OOD 检测器是否能够正确地检测到问题?对 OOD 检测算法的行为进行评估需要分离 semantic shift 和 covariate shift 的评估数据集。为了帮助我们的研究,我们提出了 ImageNet-OOD,一个减少了 covariate shift 的干扰的清晰 semantic shift 数据集。通过全面的实验,我们发现 OOD 检测器更感应 covariate shift 而不是 semantic shift,而最近的 OOD 检测算法对 semantic shift 的 beneficial 效果几乎是零。我们的数据和分析对未来 OOD 检测器的设计提供了重要的启发。”

Generative Autoencoding of Dropout Patterns

  • paper_url: http://arxiv.org/abs/2310.01712
  • repo_url: https://github.com/shuntama/deciphering-autoencoders
  • paper_authors: Shunta Maeda
  • for: 用于生成图像
  • methods: 使用随机Dropout模式和自适应编码器实现图像生成
  • results: 与DCGAN相比,Deciphering Autoencoders具有更稳定的训练过程和类似的样本质量
    Abstract We propose a generative model termed Deciphering Autoencoders. In this model, we assign a unique random dropout pattern to each data point in the training dataset and then train an autoencoder to reconstruct the corresponding data point using this pattern as information to be encoded. Since the training of Deciphering Autoencoders relies solely on reconstruction error, it offers more stable training than other generative models. Despite its simplicity, Deciphering Autoencoders show comparable sampling quality to DCGAN on the CIFAR-10 dataset.
    摘要 我们提出了一种生成模型,即解译自适应网络(Deciphering Autoencoders)。在这种模型中,我们对训练集中每个数据点分配了唯一的随机掉帽模式,然后使用这种模式作为编码信息,使用自适应网络来重建对应的数据点。由于训练解译自适应网络只依据重建错误来进行训练,因此它比其他生成模型更稳定。尽管简单,但解译自适应网络在CIFAR-10数据集上的采样质量与DCGAN相当。