cs.CV - 2023-11-24

Uncertainty Aware AI for 2D MRI Segmentation

  • paper_url: http://arxiv.org/abs/2311.14875
  • repo_url: None
  • paper_authors: Lohith Konathala
  • for: 这个研究旨在提供一个可靠且可解释的深度学习探测医疗影像的方法,以提高自动诊断疾病的精度和可靠性。
  • methods: 本研究使用了巴叶斯对应网络和注意力机制,实现了精度和可解释的探测结果。
  • results: 在使用BraTS 2020 dataset进行评估时,我们的模型得到了高的F1分数和交集遮上率(IoU)。
    Abstract Robust uncertainty estimations are necessary in safety-critical applications of Deep Learning. One such example is the semantic segmentation of medical images, whilst deep-learning approaches have high performance in such tasks they lack interpretability as they give no indication of their confidence when making classification decisions. Robust and interpretable segmentation is a critical first stage in automatically screening for pathologies hence the optimal solution is one which can provide high accuracy but also capture the underlying uncertainty. In this work, we present an uncertainty-aware segmentation model, BA U-Net, for use on MRI data that incorporates Bayesian Neural Networks and Attention Mechanisms to provide accurate and interpretable segmentations. We evaluated our model on the publicly available BraTS 2020 dataset using F1 Score and Intersection Over Union (IoU) as evaluation metrics.
    摘要 强大的不确定性估计是深度学习应用中的必需品,特别是在安全关键应用中。例如,深度学习方法在医学图像Semantic segmentation任务中表现出色,但它们缺乏可解性,因为它们不会提供分类决策时的信息。可靠和可解的分 segmentation是自动检测疾病的关键第一步,因此最佳解决方案是一个可以提供高精度而且捕捉下面的不确定性的模型。在这篇文章中,我们提出了一种不确定性意识的分 segmentation模型,BA U-Net,用于MRI数据,该模型包括抽象神经网络和注意机制,以提供准确和可解的分 segmentation。我们使用BraTS 2020 dataset进行评估,使用F1 Score和Intersection Over Union(IoU)作为评估指标。

Unified Medical Image Pre-training in Language-Guided Common Semantic Space

  • paper_url: http://arxiv.org/abs/2311.14851
  • repo_url: None
  • paper_authors: Xiaoxuan He, Yifan Yang, Xinyang Jiang, Xufang Luo, Haoji Hu, Siyun Zhao, Dongsheng Li, Yuqing Yang, Lili Qiu
  • for: 这篇论文的目的是提出一个具有扩展性的医疗影像预训架构,以掌握不同频率和维度的医疗影像数据,并将其转换为一个共同semantic空间,以便实现医疗影像分析和诠释的统一化。
  • methods: 本研究使用了诊断报告来建立共同semantic空间,从而创建了一个统一的医疗影像表示,并使用了这个表示来进行医疗影像分析和诠释。
  • results: 研究结果显示,UniMedI可以优化医疗影像分析和诠释的一致性,并且在不同频率和维度的医疗影像数据上具有出色的表现。
    Abstract Vision-Language Pre-training (VLP) has shown the merits of analysing medical images, by leveraging the semantic congruence between medical images and their corresponding reports. It efficiently learns visual representations, which in turn facilitates enhanced analysis and interpretation of intricate imaging data. However, such observation is predominantly justified on single-modality data (mostly 2D images like X-rays), adapting VLP to learning unified representations for medical images in real scenario remains an open challenge. This arises from medical images often encompass a variety of modalities, especially modalities with different various number of dimensions (e.g., 3D images like Computed Tomography). To overcome the aforementioned challenges, we propose an Unified Medical Image Pre-training framework, namely UniMedI, which utilizes diagnostic reports as common semantic space to create unified representations for diverse modalities of medical images (especially for 2D and 3D images). Under the text's guidance, we effectively uncover visual modality information, identifying the affected areas in 2D X-rays and slices containing lesion in sophisticated 3D CT scans, ultimately enhancing the consistency across various medical imaging modalities. To demonstrate the effectiveness and versatility of UniMedI, we evaluate its performance on both 2D and 3D images across 10 different datasets, covering a wide range of medical image tasks such as classification, segmentation, and retrieval. UniMedI has demonstrated superior performance in downstream tasks, showcasing its effectiveness in establishing a universal medical visual representation.
    摘要 医学图像预训练(VLP)已经证明了对医学图像进行分析,通过挖掘医学图像和其报告之间的语义相似性。它高效地学习视觉表示,从而促进了复杂的医学图像数据的分析和解释。然而,这种观察是基于单一模式的数据(主要是2D图像如X射线),将VLP学习到一个综合表示中是一个开放的挑战。这是因为医学图像通常包含多种模式,特别是不同维度的模式(例如3D图像如计算Tomography)。为了解决这些挑战,我们提出了一个综合医学图像预训练框架,称为UniMedI,该框架利用诊断报告作为共同语义空间,创建综合表示 для多种医学图像模式(特别是2D和3D图像)。在我们的指导下,我们有效地抽取视觉模式信息,在2DX射线图像中找到病changed区域,并在复杂的3DCT扫描中找到患部,从而提高了各种医学图像模式之间的一致性。为了证明UniMedI的效果和多样性,我们对2D和3D图像进行了10个不同的数据集评估,覆盖了各种医学图像任务,如分类、 segmentation和检索。UniMedI在下游任务中表现出色,证明了其在建立医学视觉共同表示方面的效iveness。

UniHPE: Towards Unified Human Pose Estimation via Contrastive Learning

  • paper_url: http://arxiv.org/abs/2311.16477
  • repo_url: None
  • paper_authors: Zhongyu Jiang, Wenhao Chai, Lei Li, Zhuoran Zhou, Cheng-Yen Yang, Jenq-Neng Hwang
  • for: 这 paper 的目的是开发一种能够效果地结合多Modalities 信息的感知技术,以便使用更大的数据集和约束来训练,并利用每个模式中的信息。
  • methods: 该 paper 提出了一种统一的人姿估计(HPE)管道,将所有三种模式的特征都Alignment在同一个管道中,并提出了一种新的 singular value 基于的对比学习损失函数,以更好地对不同的模式进行对应。
  • results: 该 paper 的实验结果表明,UniHPE 在 Human3.6M 数据集上的 MPJPE 为 50.5 mm,在 3DPW 数据集上的 PAMPJPE 为 51.6 mm,这些表现 metric 非常出色。
    Abstract In recent times, there has been a growing interest in developing effective perception techniques for combining information from multiple modalities. This involves aligning features obtained from diverse sources to enable more efficient training with larger datasets and constraints, as well as leveraging the wealth of information contained in each modality. 2D and 3D Human Pose Estimation (HPE) are two critical perceptual tasks in computer vision, which have numerous downstream applications, such as Action Recognition, Human-Computer Interaction, Object tracking, etc. Yet, there are limited instances where the correlation between Image and 2D/3D human pose has been clearly researched using a contrastive paradigm. In this paper, we propose UniHPE, a unified Human Pose Estimation pipeline, which aligns features from all three modalities, i.e., 2D human pose estimation, lifting-based and image-based 3D human pose estimation, in the same pipeline. To align more than two modalities at the same time, we propose a novel singular value based contrastive learning loss, which better aligns different modalities and further boosts the performance. In our evaluation, UniHPE achieves remarkable performance metrics: MPJPE $50.5$mm on the Human3.6M dataset and PAMPJPE $51.6$mm on the 3DPW dataset. Our proposed method holds immense potential to advance the field of computer vision and contribute to various applications.
    摘要 近些年来,有一个增长的兴趣是开发有效的感知技术,以 combinational 多模态信息。这些技术包括将多种来源的特征进行对应,以便更高效地训练大量数据和约束,以及利用每种模态中的财富。计算机视觉中的2D和3D人姿估算(HPE)是两个关键的感知任务,它们有许多下游应用,如行为识别、人机交互、物体跟踪等。然而,关于图像和2D/3D人姿之间的相关性研究使用对比性学习方法的研究是有限的。在这篇论文中,我们提出UniHPE,一个统一的人姿估算管道,将所有三种模态的特征进行对应,即2D人姿估算、提升基于和图像基于的3D人姿估算。为了同时对多于两种模态进行对应,我们提出了一种新的 singular value 基于对比学习损失,可以更好地对不同模态进行对应,并进一步提高性能。在我们的评估中,UniHPE达到了remarkable的性能指标:MPJPE $50.5$mm on the Human3.6M dataset和PAMPJPE $51.6$mm on the 3DPW dataset。我们的提出的方法具有潜在的推动计算机视觉领域的前进,并可以应用于多种应用程序。

Benchmarking Robustness of Text-Image Composed Retrieval

  • paper_url: http://arxiv.org/abs/2311.14837
  • repo_url: None
  • paper_authors: Shitong Sun, Jindong Gu, Shaogang Gong
  • for: 这个论文目的是研究文本图像组合检索的稳定性,包括对于自然损害和文本理解的分析。
  • methods: 该论文使用了文本图像组合检索方法,并在自然损害和文本理解方面进行了系统性的分析。
  • results: 该论文通过 introduce 两个新的大规模 benchmark 数据集(CIRR-C和 FashionIQ-C)和一个新的 диагностиック数据集(CIRR-D),以及对文本图像组合检索方法的系统性分析,以探讨文本图像组合检索的稳定性和文本理解能力。
    Abstract Text-image composed retrieval aims to retrieve the target image through the composed query, which is specified in the form of an image plus some text that describes desired modifications to the input image. It has recently attracted attention due to its ability to leverage both information-rich images and concise language to precisely express the requirements for target images. However, the robustness of these approaches against real-world corruptions or further text understanding has never been studied. In this paper, we perform the first robustness study and establish three new diversified benchmarks for systematic analysis of text-image composed retrieval against natural corruptions in both vision and text and further probe textural understanding. For natural corruption analysis, we introduce two new large-scale benchmark datasets, CIRR-C and FashionIQ-C for testing in open domain and fashion domain respectively, both of which apply 15 visual corruptions and 7 textural corruptions. For textural understanding analysis, we introduce a new diagnostic dataset CIRR-D by expanding the original raw data with synthetic data, which contains modified text to better probe textual understanding ability including numerical variation, attribute variation, object removal, background variation, and fine-grained evaluation. The code and benchmark datasets are available at https://github.com/SunTongtongtong/Benchmark-Robustness-Text-Image-Compose-Retrieval.
    摘要 文本图像组合检索目标是通过指定一个图像和一些描述所需修改的文本来检索目标图像。这些方法在最近吸引了关注,因为它们可以利用图像中的信息和简短的语言来精确表达需要的图像要求。然而,这些方法对实际世界的损害或文本理解的Robustness从未被研究。在这篇论文中,我们进行了第一个Robustness研究,并建立了三个新的多样化 benchMark для系统性的分析文本图像组合检索方法对实际世界的损害和文本理解。对于实际世界的损害分析,我们引入了两个新的大规模 benchmark dataset:CIRR-C和FashionIQ-C,它们在开放领域和时尚领域分别测试15种视觉损害和7种文本损害。对于文本理解分析,我们引入了一个新的 диагностические dataset CIRR-D,通过扩展原始的 raw 数据,包括修改的文本,以更好地检测文本理解能力,包括数值变化、属性变化、对象移除、背景变化和细化评价。代码和benchMark dataset可以在 GitHub 上获取:https://github.com/SunTongtongtong/Benchmark-Robustness-Text-Image-Compose-Retrieval。

Proximal Algorithms for Accelerated Langevin Dynamics

  • paper_url: http://arxiv.org/abs/2311.14829
  • repo_url: None
  • paper_authors: Duy H. Thai, Alexander L. Young, David B. Dunson
  • for: 这个论文是为了开发一种新的MCMC算法,以实现更好地混合Markov链。
  • methods: 这个论文使用了一种带有随机噪声的奈斯特洛夫算法,并证明了这种方法可以导致一个指定的目标分布作为它的稳定分布。
  • results: 实验表明,提议的方法在不同的统计和图像处理模型中都有更好的混合性,比传统的勒拜赛末MCMC算法更好。
    Abstract We develop a novel class of MCMC algorithms based on a stochastized Nesterov scheme. With an appropriate addition of noise, the result is a time-inhomogeneous underdamped Langevin equation, which we prove emits a specified target distribution as its invariant measure. Convergence rates to stationarity under Wasserstein-2 distance are established as well. Metropolis-adjusted and stochastic gradient versions of the proposed Langevin dynamics are also provided. Experimental illustrations show superior performance of the proposed method over typical Langevin samplers for different models in statistics and image processing including better mixing of the resulting Markov chains.
    摘要 我们开发了一种新的 MCMC 算法基于偏导数 Nesterov 方案。通过适当添加噪声,得到了时间不一致的涨落征函数方程,我们证明其具有指定目标分布作为吸引器的吸引性。我们还证明了在 Wasserstein-2 距离下的收敛率。此外,我们还提供了 Metropolis 调整和随机梯度版本的提案 Langvin 动力学。实验示例显示了我们的方法在不同的统计和图像处理模型中表现更好,包括更好地混合Markov链。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, please let me know.

Text and Click inputs for unambiguous open vocabulary instance segmentation

  • paper_url: http://arxiv.org/abs/2311.14822
  • repo_url: https://github.com/nikolaiwarner7/text-and-click-for-open-vocabulary-segmentation
  • paper_authors: Nikolai Warner, Meera Hahn, Jonathan Huang, Irfan Essa, Vighnesh Birodkar
  • for: 本研究旨在提高图像分割的精度和效率,通过 humans-in-the-loop 提供额外输入,以便更好地分割图像中的对象。
  • methods: 本研究提出了一种新的分割方法,称为 Text + Click segmentation,它使用图像、文本描述和前景单击点作为输入,并使用开放词汇图像文本模型来支持广泛的文本提示。
  • results: 对于常见的分割集合refCOCO、COCO、VOC和OpenImages,模型能够更好地分割图像中的对象,特别是在涉及到重叠或共存的semantic category时。
    Abstract Segmentation localizes objects in an image on a fine-grained per-pixel scale. Segmentation benefits by humans-in-the-loop to provide additional input of objects to segment using a combination of foreground or background clicks. Tasks include photoediting or novel dataset annotation, where human annotators leverage an existing segmentation model instead of drawing raw pixel level annotations. We propose a new segmentation process, Text + Click segmentation, where a model takes as input an image, a text phrase describing a class to segment, and a single foreground click specifying the instance to segment. Compared to previous approaches, we leverage open-vocabulary image-text models to support a wide-range of text prompts. Conditioning segmentations on text prompts improves the accuracy of segmentations on novel or unseen classes. We demonstrate that the combination of a single user-specified foreground click and a text prompt allows a model to better disambiguate overlapping or co-occurring semantic categories, such as "tie", "suit", and "person". We study these results across common segmentation datasets such as refCOCO, COCO, VOC, and OpenImages. Source code available here.
    摘要 Segmentation 可以在图像上进行细化的每个像素级别地Localize对象。Segmentation 可以通过人类在Loop中提供额外输入来提高对象的分割,包括使用背景或前景键 clicks。任务包括图像编辑或新数据集注释, где人类标注员可以利用现有的分割模型而不是直接在像素级别上绘制Raw annotations。我们提出了一种新的分割过程,文本 + 单击分割,其中模型接受图像、文本短语描述类划分和单个前景键点击。与之前的方法相比,我们利用了开放词汇图像文本模型,以支持广泛的文本提示。基于文本提示来conditioning分割提高了对 novel或未经见过的类的准确性。我们证明了 combining 单个用户指定的前景键点击和文本提示可以使模型更好地解决 overlap 或共处的语义类划分,例如“链”、“西装”和“人”。我们在 refCOCO、COCO、VOC 和 OpenImages 等常见分割数据集上进行了这些研究。代码可以在这里找到。

SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation

  • paper_url: http://arxiv.org/abs/2311.14671
  • repo_url: https://github.com/menglcool/segic
  • paper_authors: Lingchen Meng, Shiyi Lan, Hengduo Li, Jose M. Alvarez, Zuxuan Wu, Yu-Gang Jiang
  • for: 本研究旨在开发一种基于单一视觉基础模型(VFM)的结构准确分割方法,可以在少量标注样本的情况下进行准确的图像分割。
  • methods: 本研究使用了一种名为SEGIC的结构准确分割方法,该方法基于VFM模型,并且通过提取受影响的图像和准确样本之间的相似性,来学习分割规则。
  • results: 根据实验结果,SEGIC方法可以在一些一键分割 benchmark 上达到 estado del arte 的性能,而且可以轻松扩展到多个任务,如视频对象分割和开放词汇分割。
    Abstract In-context segmentation aims at segmenting novel images using a few labeled example images, termed as "in-context examples", exploring content similarities between examples and the target. The resulting models can be generalized seamlessly to novel segmentation tasks, significantly reducing the labeling and training costs compared with conventional pipelines. However, in-context segmentation is more challenging than classic ones due to its meta-learning nature, requiring the model to learn segmentation rules conditioned on a few samples, not just the segmentation. Unlike previous work with ad-hoc or non-end-to-end designs, we propose SEGIC, an end-to-end segment-in-context framework built upon a single vision foundation model (VFM). In particular, SEGIC leverages the emergent correspondence within VFM to capture dense relationships between target images and in-context samples. As such, information from in-context samples is then extracted into three types of instructions, i.e. geometric, visual, and meta instructions, serving as explicit conditions for the final mask prediction. SEGIC is a straightforward yet effective approach that yields state-of-the-art performance on one-shot segmentation benchmarks. Notably, SEGIC can be easily generalized to diverse tasks, including video object segmentation and open-vocabulary segmentation. Code will be available at \url{https://github.com/MengLcool/SEGIC}.
    摘要 宏观分割是一种将新图像分割成多个类别的技术,使用一些已经标注过的图像,称为“内容相似例子”,来探索图像的内容相似性。这种技术可以在新的分割任务上进行普适的泛化,相比于传统的管道,可以大幅减少标注和训练成本。然而,宏观分割比 класси的分割更加具有挑战性,因为它具有元学习性,需要模型学习分割规则,条件于几个样本,而不仅仅是分割。在这种情况下,我们提出了SEGIC,一种基于单一视觉基础模型(VFM)的一阶段内容分割框架。特别是,SEGIC利用VFM中的emergent对应关系,捕捉目标图像和内容相似样本之间的稠密关系。因此,从内容相似样本中提取信息,并将其转化为三种类型的指令,即几何指令、视觉指令和元指令,作为最终掩码预测的显式条件。SEGIC是一种简单 yet 有效的方法,在一阶段内容分割 benchmark 上实现了状态码的性能。另外,SEGIC可以轻松扩展到多种任务,包括视频对象分割和开放词汇分割。代码将在 \url{https://github.com/MengLcool/SEGIC} 上提供。

Understanding Self-Supervised Features for Learning Unsupervised Instance Segmentation

  • paper_url: http://arxiv.org/abs/2311.14665
  • repo_url: None
  • paper_authors: Paul Engstler, Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina
  • for: 本文探索了无需人工标注的自助学习方法对实例分割任务的应用。
  • methods: 本文使用了多种自助学习方法,包括DINO和MAE等。
  • results: 研究发现,DINO的特征在实例分割任务中虽然具有良好的语义描述能力,但缺乏对实例的敏感度。MAE的特征则表现出较高的实例敏感度。
    Abstract Self-supervised learning (SSL) can be used to solve complex visual tasks without human labels. Self-supervised representations encode useful semantic information about images, and as a result, they have already been used for tasks such as unsupervised semantic segmentation. In this paper, we investigate self-supervised representations for instance segmentation without any manual annotations. We find that the features of different SSL methods vary in their level of instance-awareness. In particular, DINO features, which are known to be excellent semantic descriptors, lack behind MAE features in their sensitivity for separating instances.
    摘要 自我指导学习(SSL)可以用来解决复杂的视觉任务,无需人工标注。自我指导表示包含有用的 semantic 信息,因此它们已经在无监督Semantic Segmentation 等任务中使用。在这篇论文中,我们调查了不同 SSL 方法的自我指导表示,以便实现无需人工标注的实例分割。我们发现,DINO 的特征在实例分割方面较为脆弱,而 MAE 的特征则更加敏感于分割实例。

Continuous football player tracking from discrete broadcast data

  • paper_url: http://arxiv.org/abs/2311.14642
  • repo_url: None
  • paper_authors: Matthew J. Penn, Christl A. Donnelly, Samir Bhatt
  • for: 提供了一种方法来估算全场 continuous tracking 数据,以帮助职业足球队伍获取高质量的数据,而不需要特殊的设备或高成本。
  • methods: 该方法使用计算机视觉技术,基于广播视频的不连续数据,来估算全场 tracking 数据。
  • results: 测试结果表明,该方法可以准确地估算全场 tracking 数据,并且可以应用于大量的足球比赛。
    Abstract Player tracking data remains out of reach for many professional football teams as their video feeds are not sufficiently high quality for computer vision technologies to be used. To help bridge this gap, we present a method that can estimate continuous full-pitch tracking data from discrete data made from broadcast footage. Such data could be collected by clubs or players at a similar cost to event data, which is widely available down to semi-professional level. We test our method using open-source tracking data, and include a version that can be applied to a large set of over 200 games with such discrete data.
    摘要 <>将文本翻译成简化中文。<>职业足球队伍的玩家跟踪数据仍然无法达到许多专业队伍的手中,因为他们的视频流不够高质量,不能用计算机视觉技术进行跟踪。为了bridging这个差距,我们提出了一种方法,可以从广播影片中提取连续全场跟踪数据。这些数据可以由俱乐部或球员在类似于事件数据的成本下收集,而事件数据在 semi-professional 水平下广泛可用。我们使用开源跟踪数据进行测试,并包含一个可应用于大量超过 200 场的精简数据。

Unsupervised high-throughput segmentation of cells and cell nuclei in quantitative phase images

  • paper_url: http://arxiv.org/abs/2311.14639
  • repo_url: None
  • paper_authors: Julia Sistermanns, Ellen Emken, Gregor Weirich, Oliver Hayden, Wolfgang Utschick
  • for: 这篇论文的目的是为了提高细胞学诊断的效率和准确性,通过开发高通量数字束谱微scopic镜技术来自动分类单个细胞。
  • methods: 这篇论文使用了无监督的多stage方法来自动分类细胞,不会误分辨噪音或反射为细胞,同时能够检测细胞内部结构,特别是细胞核在不染色的细胞中。
  • results: 论文中表明,这种分类方法在多个实验中对病人样本具有一致性和可靠性,并且在合理的每个细胞分析时间内完成。
    Abstract In the effort to aid cytologic diagnostics by establishing automatic single cell screening using high throughput digital holographic microscopy for clinical studies thousands of images and millions of cells are captured. The bottleneck lies in an automatic, fast, and unsupervised segmentation technique that does not limit the types of cells which might occur. We propose an unsupervised multistage method that segments correctly without confusing noise or reflections with cells and without missing cells that also includes the detection of relevant inner structures, especially the cell nucleus in the unstained cell. In an effort to make the information reasonable and interpretable for cytopathologists, we also introduce new cytoplasmic and nuclear features of potential help for cytologic diagnoses which exploit the quantitative phase information inherent to the measurement scheme. We show that the segmentation provides consistently good results over many experiments on patient samples in a reasonable per cell analysis time.
    摘要 在努力帮助细胞诊断方面,我们提出了自动化单元筛选技术,使用高通量数字折射 Microscopy 进行临床研究,捕捉了千张图像和百万个细胞。问题在于发现一种自动、快速、无监督的分割技术,不会混淆噪音或反射与细胞,也不会错过细胞。我们提出了一种多 stage 的无监督方法,可以正确地分割细胞,同时检测细胞内部的相关结构,特别是细胞核在未染色细胞中。为使信息更加合理和可解释,我们还引入了新的细胞质和核特征,可以帮助cytopathologists进行诊断。我们展示了这种分割技术在多个实验中的一致性和可靠性。

Automated Detection and Counting of Windows using UAV Imagery based Remote Sensing

  • paper_url: http://arxiv.org/abs/2311.14635
  • repo_url: None
  • paper_authors: Dhruv Patel, Shivani Chepuri, Sarvesh Thakur, K. Harikumar, Ravi Kiran S., K. Madhava Krishna
  • for: 本研究旨在提出一种基于无人机遥感系统的窗户数量检测方法,以帮助建筑和surveying领域中的监测和评估工作。
  • methods: 该方法包括两个阶段,首先是通过无人机摄像头和其他传感器提供数据,然后是通过计算机视觉管道自动识别和计数窗户。
  • results: 实验结果表明,提出的方法可以准确地检测和计数窗户,并且比现有方法更高效。
    Abstract Despite the technological advancements in the construction and surveying sector, the inspection of salient features like windows in an under-construction or existing building is predominantly a manual process. Moreover, the number of windows present in a building is directly related to the magnitude of deformation it suffers under earthquakes. In this research, a method to accurately detect and count the number of windows of a building by deploying an Unmanned Aerial Vehicle (UAV) based remote sensing system is proposed. The proposed two-stage method automates the identification and counting of windows by developing computer vision pipelines that utilize data from UAV's onboard camera and other sensors. Quantitative and Qualitative results show the effectiveness of our proposed approach in accurately detecting and counting the windows compared to the existing method.
    摘要

CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization

  • paper_url: http://arxiv.org/abs/2311.14631
  • repo_url: None
  • paper_authors: Ruoyu Zhao, Mingrui Zhu, Shiyin Dong, Nannan Wang, Xinbo Gao
  • for: 这 paper 的目的是提出一种基于倒推的文本到图像个性化方法,以便通过文本提示生成符合个性化概念的图像。
  • methods: 该 paper 使用了一种基于倒推的方法,包括首先分解文本Encoder 在图像生成过程中的集成,然后在这个特点空间中 concatenate 表示,以学习个性化概念与基类之间的差异。
  • results: 该 paper 的实验结果表明,CatVersion 可以更好地保持个性化概念,并且允许更加灵活的编辑。同时,该方法还可以更加准确地评估个性化图像生成的结果。
    Abstract We propose CatVersion, an inversion-based method that learns the personalized concept through a handful of examples. Subsequently, users can utilize text prompts to generate images that embody the personalized concept, thereby achieving text-to-image personalization. In contrast to existing approaches that emphasize word embedding learning or parameter fine-tuning for the diffusion model, which potentially causes concept dilution or overfitting, our method concatenates embeddings on the feature-dense space of the text encoder in the diffusion model to learn the gap between the personalized concept and its base class, aiming to maximize the preservation of prior knowledge in diffusion models while restoring the personalized concepts. To this end, we first dissect the text encoder's integration in the image generation process to identify the feature-dense space of the encoder. Afterward, we concatenate embeddings on the Keys and Values in this space to learn the gap between the personalized concept and its base class. In this way, the concatenated embeddings ultimately manifest as a residual on the original attention output. To more accurately and unbiasedly quantify the results of personalized image generation, we improve the CLIP image alignment score based on masks. Qualitatively and quantitatively, CatVersion helps to restore personalization concepts more faithfully and enables more robust editing.
    摘要 我们提出了CatVersion,一种基于倒推的方法,通过一些示例学习个人化概念。然后,用户可以使用文本提示生成符合个人化概念的图像,实现文本到图像个性化。与现有方法相比,我们的方法不是通过Word embedding学习或Diffusion模型参数微调来强调概念泛化或过拟合,而是将Encoder的特征密集空间 embedding concatenated在Diffusion模型中,以学习个人化概念和基类之间的差异,以最大化Diffusion模型中的优先知识保留,同时还能够Restore个人化概念。为此,我们首先析分Encoder在图像生成过程中的集成,以便Identify特征密集空间。然后,我们在Keys和Values中 concatenated embedding,以学习个人化概念和基类之间的差异。这样, concatenated embedding最终将 manifest as原始注意力输出的差异。为更准确和不偏地衡量个性化图像生成的结果,我们改进了CLIP图像对齐分数,基于masks。从qualitative和quantitative角度来看,CatVersion可以更加忠实地Restore个人化概念,并允许更加稳定的编辑。

Neural Style Transfer for Computer Games

  • paper_url: http://arxiv.org/abs/2311.14617
  • repo_url: None
  • paper_authors: Eleftherios Ioannou, Steve Maddock
  • for: 增强3D电子游戏的视觉效果
  • methods: 在3D渲染管道中注入深度意识的样式转移技术
  • results: 实现了一种可靠、有艺术性的游戏场景样式化方法,超越了现有图像和视频样式转移方法的效果
    Abstract Neural Style Transfer (NST) research has been applied to images, videos, 3D meshes and radiance fields, but its application to 3D computer games remains relatively unexplored. Whilst image and video NST systems can be used as a post-processing effect for a computer game, this results in undesired artefacts and diminished post-processing effects. Here, we present an approach for injecting depth-aware NST as part of the 3D rendering pipeline. Qualitative and quantitative experiments are used to validate our in-game stylisation framework. We demonstrate temporally consistent results of artistically stylised game scenes, outperforming state-of-the-art image and video NST methods.
    摘要 神经风格传输(NST)研究已经应用于图像、视频、3D模型和辐射场景,但它在3D电脑游戏中的应用还相对未经探索。尽管图像和视频NST系统可以作为电脑游戏的后处理效果使用,但这会导致不想要的artefacts和降低后处理效果。在这里,我们提出了在3D渲染管道中注入深度意识的NST方法。我们使用了质量和量的实验来验证我们的游戏风格框架。我们展示了一系列艺术风格化游戏场景,超过了现状的图像和视频NST方法。

Animate124: Animating One Image to 4D Dynamic Scene

  • paper_url: http://arxiv.org/abs/2311.14603
  • repo_url: https://github.com/HeliosZhao/Animate124
  • paper_authors: Yuyang Zhao, Zhiwen Yan, Enze Xie, Lanqing Hong, Zhenguo Li, Gim Hee Lee
  • for: 这篇论文旨在实现将单张宽泛图像转换为3D视频,通过文本动作描述来控制动画。
  • methods: 该方法使用高级4D网格动态神经辐射场(NeRF)模型,在三个阶段进行优化,包括使用2D和3D扩散约束、视频扩散模型和个性化扩散约束。
  • results: 该方法可以具有显著的进步,与现有的基eline相比,根据广泛的量化和质量评估表明。
    Abstract We introduce Animate124 (Animate-one-image-to-4D), the first work to animate a single in-the-wild image into 3D video through textual motion descriptions, an underexplored problem with significant applications. Our 4D generation leverages an advanced 4D grid dynamic Neural Radiance Field (NeRF) model, optimized in three distinct stages using multiple diffusion priors. Initially, a static model is optimized using the reference image, guided by 2D and 3D diffusion priors, which serves as the initialization for the dynamic NeRF. Subsequently, a video diffusion model is employed to learn the motion specific to the subject. However, the object in the 3D videos tends to drift away from the reference image over time. This drift is mainly due to the misalignment between the text prompt and the reference image in the video diffusion model. In the final stage, a personalized diffusion prior is therefore utilized to address the semantic drift. As the pioneering image-text-to-4D generation framework, our method demonstrates significant advancements over existing baselines, evidenced by comprehensive quantitative and qualitative assessments.
    摘要 我们介绍Animate124(animate一张图像到4D),这是首个通过文本运动描述将单个现场图像转化为3D视频的工作。这是一个尚未得到足够关注的问题,具有广泛的应用前景。我们的4D生成方法利用了高级4D网格动态神经辐射场(NeRF)模型,在三个不同的阶段进行了多次扩散假设的优化。首先,使用参考图像为INITIALIZATION的静态模型进行优化,由2D和3D扩散假设引导。然后,使用视频扩散模型学习图像的运动特征。然而,在3D视频中,对象往往会偏离参考图像。这种偏离主要是由文本提示和参考图像在视频扩散模型之间的不同导致的。因此,在最后一个阶段,我们使用个性化扩散假设来解决 semantic drift。作为图像-文本-4D生成框架的先锋,我们的方法在现有的基线上显示出了显著的进步,经过了广泛的量化和质量评估。

Large Language Models as Automated Aligners for benchmarking Vision-Language Models

  • paper_url: http://arxiv.org/abs/2311.14580
  • repo_url: None
  • paper_authors: Yuanfeng Ji, Chongjian Ge, Weikai Kong, Enze Xie, Zhengying Liu, Zhengguo Li, Ping Luo
  • for: 这种论文的目的是为了评估大语言模型(LLMs)在执行复杂的认知和理解任务时的性能,并且检验这些模型是否与人类智能相符。
  • methods: 这种论文使用了自动生成数据集,使用大语言模型(如GPT-4)生成大量的问答逻辑三元组,以便评估视力语言模型(VLMs)的性能。
  • results: 研究发现,使用大语言模型进行自动评估,可以达到85%的一致率,这表明这种方法可以准确评估视力语言模型的性能。
    Abstract With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation benchmarks, primarily relying on rigid, hand-crafted datasets to measure task-specific performance, face significant limitations in assessing the alignment of these increasingly anthropomorphic models with human intelligence. In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient aligners, measuring the alignment between VLMs and human intelligence and value through automatic data curation and assessment. Specifically, for data curation, Auto-Bench utilizes LLMs (e.g., GPT-4) to automatically generate a vast set of question-answer-reasoning triplets via prompting on visual symbolic representations (e.g., captions, object locations, instance relationships, and etc.). The curated data closely matches human intent, owing to the extensive world knowledge embedded in LLMs. Through this pipeline, a total of 28.5K human-verified and 3,504K unfiltered question-answer-reasoning triplets have been curated, covering 4 primary abilities and 16 sub-abilities. We subsequently engage LLMs like GPT-3.5 to serve as judges, implementing the quantitative and qualitative automated assessments to facilitate a comprehensive evaluation of VLMs. Our validation results reveal that LLMs are proficient in both evaluation data curation and model assessment, achieving an average agreement rate of 85%. We envision Auto-Bench as a flexible, scalable, and comprehensive benchmark for evaluating the evolving sophisticated VLMs.
    摘要 随着大语言模型(LLMs)的发展,视力语言模型(VLMs)已达到了新的水平,表现出了人类智能的复杂认知和理解能力。然而,现有的评估标准,主要依靠手动制作的数据来衡量任务特定的性能,面临着评估这些越来越人化的模型与人类智能的Alignment的困难。在这种情况下,我们通过Auto-Bench来解决这些问题,它通过自动征集数据和自动评估来衡量LLMs的Alignment。具体来说, для数据征集,Auto-Bench利用LLMs(例如GPT-4)自动生成了一大量的问题答案理解 triplets,通过提示视觉符号表示(例如caption、对象位置、实例关系等)。这些征集到的数据准确反映了人类意图,因为LLMs中嵌入了广泛的世界知识。我们通过这种管道,共征集了28.5K个人验证的和3504K个未过滤的问题答案理解 triplets,覆盖了4种基本能力和16种互相关的子能力。我们接着让LLMs如GPT-3.5服为评价judge,实现了量化和质量自动评估,以便全面评估VLMs。我们的验证结果表明,LLMs在评估数据征集和模型评估方面具有强大的能力,达到了85%的一致率。我们期望Auto-Bench成为一个灵活、扩展性强的评估标准,用于评估逐渐成熟的VLMs。

From Text to Image: Exploring GPT-4Vision’s Potential in Advanced Radiological Analysis across Subspecialties

  • paper_url: http://arxiv.org/abs/2311.14777
  • repo_url: None
  • paper_authors: Felix Busch, Tianyu Han, Marcus Makowski, Daniel Truhn, Keno Bressem, Lisa Adams
  • for: 评估和比较GPT-4和GPT-4Vision在放射学任务中的表现,并提出GPT-4Vision可能可以从图像中识别放射学特征,从而提高其诊断潜力。
  • methods: 使用GPT-4和GPT-4Vision进行放射学任务,比较 их表现。
  • results: GPT-4Vision可能可以更好地识别放射学特征从图像中,提高诊断潜力。
    Abstract The study evaluates and compares GPT-4 and GPT-4Vision for radiological tasks, suggesting GPT-4Vision may recognize radiological features from images, thereby enhancing its diagnostic potential over text-based descriptions.
    摘要 研究比较了GPT-4和GPT-4视力在放射学任务中的表现,表明GPT-4视力可以从图像中识别放射学特征,从而提高其诊断潜力,比文本描述更高。Here's a breakdown of the translation:* 研究 (研究) - study* 比较 (比较) - compare* GPT-4 (GPT-4) - GPT-4* GPT-4视力 (GPT-4视力) - GPT-4Vision* 放射学任务 (放射学任务) - radiological tasks* 表明 (表明) - suggests* 识别 (识别) - recognize* 放射学特征 (放射学特征) - radiological features* 从图像中 (从图像中) - from images* 提高 (提高) - enhance* 诊断潜力 (诊断潜力) - diagnostic potential* 比文本描述 (比文本描述) - compared to text-based descriptions

ToddlerDiffusion: Flash Interpretable Controllable Diffusion Model

  • paper_url: http://arxiv.org/abs/2311.14542
  • repo_url: None
  • paper_authors: Eslam Mohamed Bakr, Liangbing Zhao, Vincent Tao Hu, Matthieu Cord, Patrick Perez, Mohamed Elhoseiny
  • for: 这篇论文旨在提出一种可解释的二维扩散生成模型,以增强生成过程的可读性。
  • methods: 该方法由三个简单可解释的阶段组成:生成框架、alette和细节颜色图像。这些阶段通过精心设计和优化,实现了高效率和准确性。
  • results: 对于LSUN-Churches和COCO数据集的广泛实验表明, ToddlerDiffusion 方法可以持续性地超越现有方法,并且在LSUN-Churches 数据集上与LDM(Stable-Diffusion)相当,而且三倍快速, architecture 相对较小。
    Abstract Diffusion-based generative models excel in perceptually impressive synthesis but face challenges in interpretability. This paper introduces ToddlerDiffusion, an interpretable 2D diffusion image-synthesis framework inspired by the human generation system. Unlike traditional diffusion models with opaque denoising steps, our approach decomposes the generation process into simpler, interpretable stages; generating contours, a palette, and a detailed colored image. This not only enhances overall performance but also enables robust editing and interaction capabilities. Each stage is meticulously formulated for efficiency and accuracy, surpassing Stable-Diffusion (LDM). Extensive experiments on datasets like LSUN-Churches and COCO validate our approach, consistently outperforming existing methods. ToddlerDiffusion achieves notable efficiency, matching LDM performance on LSUN-Churches while operating three times faster with a 3.76 times smaller architecture. Our source code is provided in the supplementary material and will be publicly accessible.
    摘要 Diffusion-based生成模型在视觉上表现出色,但面临解释性的挑战。这篇论文介绍了 ToddlerDiffusion,一种可解释的2D扩散图像生成框架,Draw inspiration from human generation system。不同于传统的扩散模型,我们的方法将生成过程 decomposes into simpler, more interpretable stages:生成框线,alette和细节颜色图像。这不仅提高了总性表现,还允许Robust editing and interaction capabilities。每个阶段都仔细制定了,以确保高效性和准确性,超越Stable-Diffusion(LDM)。广泛的实验表明,ToddlerDiffusion在LSUN-Churches和COCO数据集上具有显著的高效性,与LDM相当,而且操作速度三倍, architecture size为3.76倍。我们的源代码将在补充材料中提供,并将公开 accessible。

READS-V: Real-time Automated Detection of Epileptic Seizures from Surveillance Videos via Skeleton-based Spatiotemporal ViG

  • paper_url: http://arxiv.org/abs/2311.14775
  • repo_url: None
  • paper_authors: Yankun Xu, Jie Yang, Wenjie Ming, Shuang Wang, Mohamad Sawan
  • 为:开发一个高效、准确、及时的视频基于 epileptic seizure 开始探测系统,以便为患者提供更好的监测和诊断。* 方法:使用skeleton-based spatiotemporal vision graph neural network (STViG),能够快速、准确地识别患者的动作,并且可以在实时监测视频中探测 epileptic seizure 开始。* 结果:实验结果显示,STViG 在收集的患者视频数据上比前一代动作识别模型更高度准确(错误率5.9%),并且具有较低的计算复杂度(0.4G)。此外,通过结合输出概率和积累函数的决策规则,我们的 READS-V 系统实现了5.1 s EEG 开始探测延迟时间,比临床标准延迟时间提前13.1 s,并且无false detection。
    Abstract An accurate and efficient epileptic seizure onset detection system can significantly benefit patients. Traditional diagnostic methods, primarily relying on electroencephalograms (EEGs), often result in cumbersome and non-portable solutions, making continuous patient monitoring challenging. The video-based seizure detection system is expected to free patients from the constraints of scalp or implanted EEG devices and enable remote monitoring in residential settings. Previous video-based methods neither enable all-day monitoring nor provide short detection latency due to insufficient resources and ineffective patient action recognition techniques. Additionally, skeleton-based action recognition approaches remain limitations in identifying subtle seizure-related actions. To address these challenges, we propose a novel skeleton-based spatiotemporal vision graph neural network (STViG) for efficient, accurate, and timely REal-time Automated Detection of epileptic Seizures from surveillance Videos (READS-V). Our experimental results indicate STViG outperforms previous state-of-the-art action recognition models on our collected patients' video data with higher accuracy (5.9% error) and lower FLOPs (0.4G). Furthermore, by integrating a decision-making rule that combines output probabilities and an accumulative function, our READS-V system achieves a 5.1 s EEG onset detection latency, a 13.1 s advance in clinical onset detection, and zero false detection rate.
    摘要 一个精准、高效的癫痫发作起始检测系统可以对患者提供 significannot benefits。传统诊断方法通常基于电энцефалографи(EEG),往往会导致不便宜和不可搬移的解决方案,使得不间断监测患者困难。视频基本检测系统预计可以解вобо患者从头皮或植入EEG设备的限制中解放出来,并允许在家庭环境中进行远程监测。过去的视频基本方法未能实现全天候监测,也未能提供短检测延迟,因为不足的资源和不效的患者行为识别技术。此外,骨骼基本行动识别方法还存在识别癫痫发作相关行动的限制。为解决这些挑战,我们提出了一种新的骨骼基本空间视觉 graphs neuronal network(STViG),用于高效、准确、及时的REal-time Automated Detection of epileptic Seizures from surveillance Videos(READS-V)。我们的实验结果表明,STViG在我们收集的患者视频数据上表现出优于之前的状态艺术行动识别模型,误差率为5.9%,FLOPs为0.4G。此外,通过结合输出概率和积累函数的决策规则,我们的 READS-V 系统实现了5.1 s EEG 起始检测延迟,13.1 s 提前诊断起点,并无误检测。

GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting

  • paper_url: http://arxiv.org/abs/2311.14521
  • repo_url: None
  • paper_authors: Yiwen Chen, Zilong Chen, Chi Zhang, Feng Wang, Xiaofeng Yang, Yikai Wang, Zhongang Cai, Lei Yang, Huaping Liu, Guosheng Lin
  • For: This paper aims to improve the efficiency and control of 3D editing methods, particularly in complex scenes, by introducing a novel 3D representation called Gaussian Splatting (GS) and a new editing algorithm called GaussianEditor.* Methods: The proposed GaussianEditor algorithm uses Gaussian semantic tracing to trace the editing target throughout the training process, and Hierarchical Gaussian splatting (HGS) to achieve stabilized and fine results under stochastic generative guidance from 2D diffusion models. The algorithm also includes editing strategies for efficient object removal and integration.* Results: The paper presents comprehensive experiments that demonstrate the superior control, efficacy, and rapid performance of GaussianEditor compared to traditional 3D editing methods. The results show that GaussianEditor can effectively edit complex scenes with high precision and efficiency, and is particularly useful for tasks such as object removal and integration.
    Abstract 3D editing plays a crucial role in many areas such as gaming and virtual reality. Traditional 3D editing methods, which rely on representations like meshes and point clouds, often fall short in realistically depicting complex scenes. On the other hand, methods based on implicit 3D representations, like Neural Radiance Field (NeRF), render complex scenes effectively but suffer from slow processing speeds and limited control over specific scene areas. In response to these challenges, our paper presents GaussianEditor, an innovative and efficient 3D editing algorithm based on Gaussian Splatting (GS), a novel 3D representation. GaussianEditor enhances precision and control in editing through our proposed Gaussian semantic tracing, which traces the editing target throughout the training process. Additionally, we propose Hierarchical Gaussian splatting (HGS) to achieve stabilized and fine results under stochastic generative guidance from 2D diffusion models. We also develop editing strategies for efficient object removal and integration, a challenging task for existing methods. Our comprehensive experiments demonstrate GaussianEditor's superior control, efficacy, and rapid performance, marking a significant advancement in 3D editing. Project Page: https://buaacyw.github.io/gaussian-editor/
    摘要 三维编辑在许多领域,如游戏和虚拟现实,扮演着重要的角色。传统的三维编辑方法,例如基于三角形和点云的表示方法,经常无法实际地重现复杂的场景。然而,基于各种各样的三维表示方法,如神经辐射场(NeRF),可以有效地渲染复杂的场景,但是它们的处理速度较慢,控制特定场景区域的能力受限。为了解决这些挑战,我们的论文提出了一种有效和高效的三维编辑算法,基于 Gaussian Splatting(GS),一种新的三维表示方法。我们的 Gaussian Editor 提高了编辑精度和控制,通过我们的提议的 Gaussian semantic tracing,在训练过程中跟踪编辑目标。此外,我们还提出了层次 Gaussian splatting(HGS),以实现稳定和细腻的结果,通过从二维扩散模型获得的随机生成指导。我们还开发了高效的对象移除和集成策略,这是现有方法中的一个挑战。我们的完整的实验证明 Gaussian Editor 的精度、效果和快速性,标志着三维编辑领域的一个重要进步。项目页面:https://buaacyw.github.io/gaussian-editor/

Multi-Class Anomaly Detection based on Regularized Discriminative Coupled hypersphere-based Feature Adaptation

  • paper_url: http://arxiv.org/abs/2311.14506
  • repo_url: None
  • paper_authors: Mehdi Rafiei, Alexandros Iosifidis
  • for: 这篇论文旨在解决多类别异常检测中的问题,提出了一个新的模型,即Regularized Discriminative Coupled-hypersphere-based Feature Adaptation (RD-CFA)。
  • methods: 这篇论文使用了一种modified Regularized Discriminative Variational Auto-Encoder (RD-VAE)来获取类分布特征,然后与Coupled-hypersphere-based Feature Adaptation (CFA)组合,以实现多类别异常检测。
  • results: 经过广泛的评估,RD-CFA比起八种当前的方法表现出色,能够优化异常检测和地图化。
    Abstract In anomaly detection, identification of anomalies across diverse product categories is a complex task. This paper introduces a new model by including class discriminative properties obtained by a modified Regularized Discriminative Variational Auto-Encoder (RD-VAE) in the feature extraction process of Coupled-hypersphere-based Feature Adaptation (CFA). By doing so, the proposed Regularized Discriminative Coupled-hypersphere-based Feature Adaptation (RD-CFA), forms a solution for multi-class anomaly detection. By using the discriminative power of RD-VAE to capture intricate class distributions, combined with CFA's robust anomaly detection capability, the proposed method excels in discerning anomalies across various classes. Extensive evaluations on multi-class anomaly detection and localization using the MVTec AD and BeanTech AD datasets showcase the effectiveness of RD-CFA compared to eight leading contemporary methods.
    摘要 在异常检测中,针对不同产品类别的异常检测是一项复杂的任务。这篇论文提出了一种新的模型,该模型通过包含修改后的常规化学抑制变换器(RD-VAE)获取分类特征属性,并将其与嵌入式半球体特征适应(CFA)结合使用。这种提议的模型被称为常规化嵌入式半球体特征适应(RD-CFA),用于多类异常检测。通过使用RD-VAE捕捉复杂的分类分布,并与CFA的强大异常检测能力相结合,提议的方法在不同类别的异常检测和 lokalisierung中表现出色。经过对多类异常检测和 lokalisierung使用MVTec AD和BeanTech AD数据集的广泛评估,提议的方法与当前八种领先方法相比,显示出了更高的效果。

MVControl: Adding Conditional Control to Multi-view Diffusion for Controllable Text-to-3D Generation

  • paper_url: http://arxiv.org/abs/2311.14494
  • repo_url: https://github.com/wu-cvgl/mvcontrol
  • paper_authors: Zhiqi Li, Yiming Chen, Lingzhe Zhao, Peidong Liu
  • for: 增强现有预训练多视图2D扩散模型,以生成可控多视图图像和视角一致3D内容。
  • methods: 基于MVDream模型,采用新的神经网络模块进行终端任务特定条件学习。提出一种新的条件机制,通过预测输入空间和视角条件的嵌入来控制网络的生成图像。
  • results: 通过训练MVControl模型,可以实现高质量3D内容的生成,并且可以控制生成图像的形状和视角。广泛的实验表明,我们的方法具有强大的普适性和可控性。代码可以在https://github.com/WU-CVGL/MVControl/上下载。
    Abstract We introduce MVControl, a novel neural network architecture that enhances existing pre-trained multi-view 2D diffusion models by incorporating additional input conditions, e.g. edge maps. Our approach enables the generation of controllable multi-view images and view-consistent 3D content. To achieve controllable multi-view image generation, we leverage MVDream as our base model, and train a new neural network module as additional plugin for end-to-end task-specific condition learning. To precisely control the shapes and views of generated images, we innovatively propose a new conditioning mechanism that predicts an embedding encapsulating the input spatial and view conditions, which is then injected to the network globally. Once MVControl is trained, score-distillation (SDS) loss based optimization can be performed to generate 3D content, in which process we propose to use a hybrid diffusion prior. The hybrid prior relies on a pre-trained Stable-Diffusion network and our trained MVControl for additional guidance. Extensive experiments demonstrate that our method achieves robust generalization and enables the controllable generation of high-quality 3D content. Code available at https://github.com/WU-CVGL/MVControl/.
    摘要 我们介绍MVControl,一种新的神经网络架构,可以增强现有预训练的多视图2D扩散模型,例如添加边图等输入条件。我们的方法可以生成可控的多视图图像和视图一致的3D内容。为实现可控的多视图图像生成,我们利用MVDream作为基本模型,并训练一个新的神经网络模块作为特定任务的终端用途。为了准确地控制生成的图像的形状和视图,我们创新地提出了一种新的条件定义机制,可以预测输入空间和视图条件的嵌入,并将其注入到网络中。一旦MVControl被训练,可以使用SDS损失基于优化来生成3D内容,在这个过程中,我们提议使用混合扩散先验。混合先验基于预训练的稳定扩散网络和我们训练的MVControl,以提供额外的指导。广泛的实验表明,我们的方法可以具有robust的通用性,并且可以生成高质量的3D内容。代码可以在https://github.com/WU-CVGL/MVControl/上获取。

Set Features for Anomaly Detection

  • paper_url: http://arxiv.org/abs/2311.14773
  • repo_url: https://github.com/abhishekpatel-lpu/CICIDS-2017-intrution-detection-
  • paper_authors: Niv Cohen, Issar Tzachor, Yedid Hoshen
  • for: 本研究旨在探讨如何探测样本中不寻常的组合元素。
  • methods: 本文提出了一种基于元素分布模型的方法,通过简单的概率计算方法来计算样本的异常分数。
  • results: 本研究比较了fixed feature方法和state-of-the-art segmentation-based方法,发现本方法在图像水平的逻辑异常检测和时间序列异常检测中表现较佳。
    Abstract This paper proposes set features for detecting anomalies in samples that consist of unusual combinations of normal elements. Many leading methods discover anomalies by detecting an unusual part of a sample. For example, state-of-the-art segmentation-based approaches, first classify each element of the sample (e.g., image patch) as normal or anomalous and then classify the entire sample as anomalous if it contains anomalous elements. However, such approaches do not extend well to scenarios where the anomalies are expressed by an unusual combination of normal elements. In this paper, we overcome this limitation by proposing set features that model each sample by the distribution of its elements. We compute the anomaly score of each sample using a simple density estimation method, using fixed features. Our approach outperforms the previous state-of-the-art in image-level logical anomaly detection and sequence-level time series anomaly detection.
    摘要 这篇论文提出了一种针对含有不寻常组合的正常元素的异常检测方法。许多领先方法通过检测样本中异常部分来检测异常。例如,当前领先的分割基于方法先将每个样本中的每个元素(例如图像块)分类为正常或异常,然后将整个样本分类为异常如果它包含异常元素。但这些方法不太适用于情况下,异常表现为正常元素的不寻常组合。在这篇论文中,我们解决了这个限制,我们提出了一种基于元素分布的集合特征来模型每个样本。我们使用了一种简单的涂抹估计方法来计算每个样本的异常分数。我们的方法在图像水平逻辑异常检测和时间序列异常检测中超过了前一个领先方法。

Towards Interpretable Classification of Leukocytes based on Deep Learning

  • paper_url: http://arxiv.org/abs/2311.14485
  • repo_url: None
  • paper_authors: Stefan Röhrl, Johannes Groll, Manuel Lengl, Simon Schumann, Christian Klenk, Dominik Heim, Martin Knopp, Oliver Hayden, Klaus Diepold
  • for: 这个论文旨在提高自标量方法的精度和可靠性,以便更好地将其应用于临床决策过程中。
  • methods: 该论文使用了机器学习方法,包括自适应核算法和深度学习模型,以提高细胞类型分类的精度和可靠性。
  • results: 研究人员通过对不同enario的血细胞分析进行比较,发现了一些通用的检测模式,并证明了这些方法在不同的情况下的可行性和有用性。
    Abstract Label-free approaches are attractive in cytological imaging due to their flexibility and cost efficiency. They are supported by machine learning methods, which, despite the lack of labeling and the associated lower contrast, can classify cells with high accuracy where the human observer has little chance to discriminate cells. In order to better integrate these workflows into the clinical decision making process, this work investigates the calibration of confidence estimation for the automated classification of leukocytes. In addition, different visual explanation approaches are compared, which should bring machine decision making closer to professional healthcare applications. Furthermore, we were able to identify general detection patterns in neural networks and demonstrate the utility of the presented approaches in different scenarios of blood cell analysis.
    摘要 自适应方法在细胞成像中具有灵活性和成本效益,这些方法通过机器学习方法实现,尽管无标签和相关下降的对比度,仍可以准确地分类细胞,人类观察员很难进行分辨。为了更好地将这些工作流程integrate到临床决策过程中,本研究检验了自动分类细胞的准确性评估的准确性。此外,我们还比较了不同的视觉解释方法,以便将机器决策更近于专业医疗应用。此外,我们还发现了神经网络中的普遍探测模式,并证明了提出的方法在不同的血液细胞分析场景中的实用性。

Trainwreck: A damaging adversarial attack on image classifiers

  • paper_url: http://arxiv.org/abs/2311.14772
  • repo_url: https://github.com/janzahalka/trainwreck
  • paper_authors: Jan Zahálka
  • for: 这个论文探讨了一种新的攻击vector,即损害性 adversarial attack(DAAs),用于让计算机视觉(CV)模型受损,以达到经济抗衰的目的。
  • methods: 这篇论文提出了一种名为 Trainwreck 的训练时间攻击,用于腐蚀图像分类器的性能。 Trainwreck 使用了隐藏式(ε ≤ 8/255)的类对对 Universal 抖动来融合类似类的数据,从而使得模型在训练时受到损害。
  • results: 实验表明, Trainwreck 是一种有效的攻击,可以适用于不同的模型架构,包括 EfficientNetV2、ResNeXt-101 和 finetuned ViT-L-16。 攻击的强度可以通过毒素率参数来调整。 针对 Trainwreck 或类似 DAAs,数据重复和文件哈希/像素差可以作为可靠的防御技术。
    Abstract Adversarial attacks are an important security concern for computer vision (CV), as they enable malicious attackers to reliably manipulate CV models. Existing attacks aim to elicit an output desired by the attacker, but keep the model fully intact on clean data. With CV models becoming increasingly valuable assets in applied practice, a new attack vector is emerging: disrupting the models as a form of economic sabotage. This paper opens up the exploration of damaging adversarial attacks (DAAs) that seek to damage the target model and maximize the total cost incurred by the damage. As a pioneer DAA, this paper proposes Trainwreck, a train-time attack that poisons the training data of image classifiers to degrade their performance. Trainwreck conflates the data of similar classes using stealthy ($\epsilon \leq 8/255$) class-pair universal perturbations computed using a surrogate model. Trainwreck is a black-box, transferable attack: it requires no knowledge of the target model's architecture, and a single poisoned dataset degrades the performance of any model trained on it. The experimental evaluation on CIFAR-10 and CIFAR-100 demonstrates that Trainwreck is indeed an effective attack across various model architectures including EfficientNetV2, ResNeXt-101, and a finetuned ViT-L-16. The strength of the attack can be customized by the poison rate parameter. Finally, data redundancy with file hashing and/or pixel difference are identified as a reliable defense technique against Trainwreck or similar DAAs. The code is available at https://github.com/JanZahalka/trainwreck.
    摘要 “机器视觉(CV)模型的反攻击是一项重要的安全性应急,因为它允许攻击者随意地改变CV模型的输出。现有的攻击尝试使模型产生攻击者所需的输出,但是保持清洁数据时模型完整无损。随着CV模型在实践中的重要性增加,一新的攻击方向正在出现:破坏模型作为经济战略的攻击。本文开启了这种破坏性反攻击(DAAs)的探索,并提出了一个名为“Trainwreck”的训练时间攻击。Trainwreck使用隐藏($\epsilon \leq 8/255)”的型别对应运算,在训练过程中混乱相似类型的数据,导致模型的性能下降。Trainwreck是一个黑盒子、可转移的攻击,它不需要攻击者知道目标模型的架构,仅需将毒化的训练数据提供给任何模型,即使是不同的架构。实验评估在CIFAR-10和CIFAR-100上,显示Trainwreck是一个有效的攻击,可以适用于不同的模型架构,包括EfficientNetV2、ResNeXt-101和Finetuned ViT-L-16。攻击强度可以通过毒素率参数自动调整。最后,我们发现了使用档案哈希和/或像素差来防护Trainwreck或相似的DAAs的资料重复是一个可靠的防御技术。代码可以在https://github.com/JanZahalka/trainwreck上取得。”

Joint Diffusion: Mutual Consistency-Driven Diffusion Model for PET-MRI Co-Reconstruction

  • paper_url: http://arxiv.org/abs/2311.14473
  • repo_url: None
  • paper_authors: Taofeng Xie, Zhuo-Xu Cui, Chen Luo, Huayu Wang, Congcong Liu, Yuanzhi Zhang, Xuemei Wang, Yanjie Zhu, Qiyu Jin, Guoqing Chen, Yihang Zhou, Dong Liang, Haifeng Wang
    for: 这项研究的目的是提高PET-MRI系统中的功能和解剖成像的速度和质量。methods: 这项研究使用了一种新的MC-Diffusion模型,利用多Modal图像之间的共同信息来提高图像重建。results: 研究结果表明,MC-Diffusion模型可以在PET-MRI系统中提高图像质量和速度,超过了现有的方法。
    Abstract Positron Emission Tomography and Magnetic Resonance Imaging (PET-MRI) systems can obtain functional and anatomical scans. PET suffers from a low signal-to-noise ratio. Meanwhile, the k-space data acquisition process in MRI is time-consuming. The study aims to accelerate MRI and enhance PET image quality. Conventional approaches involve the separate reconstruction of each modality within PET-MRI systems. However, there exists complementary information among multi-modal images. The complementary information can contribute to image reconstruction. In this study, we propose a novel PET-MRI joint reconstruction model employing a mutual consistency-driven diffusion mode, namely MC-Diffusion. MC-Diffusion learns the joint probability distribution of PET and MRI for utilizing complementary information. We conducted a series of contrast experiments about LPLS, Joint ISAT-net and MC-Diffusion by the ADNI dataset. The results underscore the qualitative and quantitative improvements achieved by MC-Diffusion, surpassing the state-of-the-art method.
    摘要

CT-xCOV: a CT-scan based Explainable Framework for COVid-19 diagnosis

  • paper_url: http://arxiv.org/abs/2311.14462
  • repo_url: https://github.com/ismailelbouknify/ct-xcov
  • paper_authors: Ismail Elbouknify, Afaf Bouhoute, Khalid Fardousse, Ismail Berrada, Abdelmajid Badri
  • for: 这paper的目的是开发一个可解释的深度学习涂抹COVID-19诊断模型,以及提供可视化和文本化解释。
  • methods: 这paper使用了U-Net模型进行肺部分 segmentation,并对COVID-19检测使用了三种不同的CNN架构:标准CNN、ResNet50和DenseNet121。在检测后,使用了三种XAI技术:Grad-Cam、Integrated Gradient(IG)和LIME提供可视化和文本化解释。
  • results: 实验结果表明,使用的DL模型具有良好的性能。U-Net分割模型达到了98%的Dice指标。对于提posed的分类模型(标准CNN),使用5-fold交叉验证(准精度98.40%,准确率98.23%)。在XAI技术比较中,Grad-Cam得到了最好的解释结果,与IG和LIME相比,其达到了55%的COVID-19阳性扫描中的Dice指标。
    Abstract In this work, CT-xCOV, an explainable framework for COVID-19 diagnosis using Deep Learning (DL) on CT-scans is developed. CT-xCOV adopts an end-to-end approach from lung segmentation to COVID-19 detection and explanations of the detection model's prediction. For lung segmentation, we used the well-known U-Net model. For COVID-19 detection, we compared three different CNN architectures: a standard CNN, ResNet50, and DenseNet121. After the detection, visual and textual explanations are provided. For visual explanations, we applied three different XAI techniques, namely, Grad-Cam, Integrated Gradient (IG), and LIME. Textual explanations are added by computing the percentage of infection by lungs. To assess the performance of the used XAI techniques, we propose a ground-truth-based evaluation method, measuring the similarity between the visualization outputs and the ground-truth infections. The performed experiments show that the applied DL models achieved good results. The U-Net segmentation model achieved a high Dice coefficient (98%). The performance of our proposed classification model (standard CNN) was validated using 5-fold cross-validation (acc of 98.40% and f1-score 98.23%). Lastly, the results of the comparison of XAI techniques show that Grad-Cam gives the best explanations compared to LIME and IG, by achieving a Dice coefficient of 55%, on COVID-19 positive scans, compared to 29% and 24% obtained by IG and LIME respectively. The code and the dataset used in this paper are available in the GitHub repository [1].
    摘要 在这项工作中,我们开发了一个可解释的Deep Learning(DL)框架,用于COVID-19诊断 based on CT-scans。我们采用了一个端到端的方法,从肺部分 segmentation 到 COVID-19检测和检测模型预测的解释。用于肺部分 segmentation 的模型是 U-Net 模型。用于 COVID-19 检测,我们比较了三种不同的 Convolutional Neural Network(CNN)架构:标准 CNN、ResNet50 和 DenseNet121。检测后,我们提供了视觉和文本的解释。用于视觉解释,我们应用了三种不同的 Explainable AI(XAI)技术, namely,Grad-Cam、Integrated Gradient(IG)和 LIME。文本解释是通过计算感染部分的百分比来进行的。为了评估使用的 XAI 技术的性能,我们提出了一种基于真实数据的评估方法,测量视觉化输出和真实感染的相似度。实验结果表明,我们使用的 DL 模型具有良好的性能。U-Net segmentation 模型达到了 98% 的 Dice 系数。我们的提出的分类模型(标准 CNN)在 5-fold cross-validation 中得到了 98.40% 的准确率和 98.23% 的 F1-score。最后,我们对 XAI 技术进行了比较,结果显示 Grad-Cam 提供了最好的解释,其 Dice 系数为 COVID-19 阳性扫描中的 55%,比 LIME 和 IG 的 29% 和 24% 高出了许多。代码和数据集使用在这篇论文中,可以在 GitHub 存储库中找到。

IDD-AW: A Benchmark for Safe and Robust Segmentation of Drive Scenes in Unstructured Traffic and Adverse Weather

  • paper_url: http://arxiv.org/abs/2311.14459
  • repo_url: None
  • paper_authors: Furqan Ahmed Shaik, Abhishek Malreddy, Nikhil Reddy Billa, Kunal Chaudhary, Sunny Manchanda, Girish Varma
  • for: 本研究旨在提供一个高度Robustness的大规模自动驾驶 dataset,以满足现代自动驾驶系统的需求。
  • methods: 我们引入了一个新的捕捉和标注方法,并使用了高品质的图像和标注数据,以提高模型的准确性和可靠性。
  • results: 我们的实验结果表明,IDD-AW 是目前最加分的大规模自动驾驶 dataset,可以挑战和提高现有的模型性能。
    Abstract Large-scale deployment of fully autonomous vehicles requires a very high degree of robustness to unstructured traffic, and weather conditions, and should prevent unsafe mispredictions. While there are several datasets and benchmarks focusing on segmentation for drive scenes, they are not specifically focused on safety and robustness issues. We introduce the IDD-AW dataset, which provides 5000 pairs of high-quality images with pixel-level annotations, captured under rain, fog, low light, and snow in unstructured driving conditions. As compared to other adverse weather datasets, we provide i.) more annotated images, ii.) paired Near-Infrared (NIR) image for each frame, iii.) larger label set with a 4-level label hierarchy to capture unstructured traffic conditions. We benchmark state-of-the-art models for semantic segmentation in IDD-AW. We also propose a new metric called ''Safe mean Intersection over Union (Safe mIoU)'' for hierarchical datasets which penalizes dangerous mispredictions that are not captured in the traditional definition of mean Intersection over Union (mIoU). The results show that IDD-AW is one of the most challenging datasets to date for these tasks. The dataset and code will be available here: http://iddaw.github.io.
    摘要 大规模自动驾驶汽车部署需要非常高度的Robustness,以适应不结构化交通和天气条件,并避免 unsafe 的预测错误。当前有几个数据集和标准套件专注于驾驶场景的分割,但他们并不特地关注安全和可靠性问题。我们介绍了 IDD-AW 数据集,该数据集提供了 5000 个高质量图像,每个图像都有像素级注解,在雨、雾、低光和雪等不结构化驾驶条件下拍摄。与其他不良天气数据集相比,我们提供了以下优势:i.) 更多的注释图像ii.) 每帧图像都有对应的 Near-Infrared(NIR)图像iii.) 更大的标签集,包括4级标签层,以捕捉不结构化交通条件我们使用现状最佳模型进行 semantic segmentation 的测试,并提出了一个新的指标called ''Safe mean Intersection over Union''(Safe mIoU),用于衡量层次数据集中的安全性。结果表明,IDD-AW 是目前最有挑战性的数据集之一。数据集和代码将在以下地址公开:http://iddaw.github.io。

Segment (Almost) Nothing: Prompt-Agnostic Adversarial Attacks on Segmentation Models

  • paper_url: http://arxiv.org/abs/2311.14450
  • repo_url: None
  • paper_authors: Francesco Croce, Matthias Hein
  • for: 本研究旨在提出一种基于Latent Space的攻击方法,可以对多种提示(包括视觉和文本提示)进行攻击。
  • methods: 该方法使用了image encoder来对输入图像进行编码,然后使用embedding vectors进行攻击。
  • results: 研究发现,只需添加一定的杂音(radius=1/255)可以使得基于Latent Space的攻击方法对多种提示进行攻击,并且这些攻击可以被应用于任何输入图像。
    Abstract General purpose segmentation models are able to generate (semantic) segmentation masks from a variety of prompts, including visual (points, boxed, etc.) and textual (object names) ones. In particular, input images are pre-processed by an image encoder to obtain embedding vectors which are later used for mask predictions. Existing adversarial attacks target the end-to-end tasks, i.e. aim at altering the segmentation mask predicted for a specific image-prompt pair. However, this requires running an individual attack for each new prompt for the same image. We propose instead to generate prompt-agnostic adversarial attacks by maximizing the $\ell_2$-distance, in the latent space, between the embedding of the original and perturbed images. Since the encoding process only depends on the image, distorted image representations will cause perturbations in the segmentation masks for a variety of prompts. We show that even imperceptible $\ell_\infty$-bounded perturbations of radius $\epsilon=1/255$ are often sufficient to drastically modify the masks predicted with point, box and text prompts by recently proposed foundation models for segmentation. Moreover, we explore the possibility of creating universal, i.e. non image-specific, attacks which can be readily applied to any input without further computational cost.
    摘要 通用分割模型可以生成(semantic)分割面照,包括视觉(点、盒子等)和文本(对象名)等多种提示。具体来说,输入图像先经图像编码器处理,以获取嵌入向量,然后用于预测分割面照。现有的敌意攻击target着端到端任务,即想要对特定图像-提示对的分割面照进行修改。然而,这需要对每个新提示进行专门的攻击。我们提议代之而行的是生成提示无关的敌意攻击,通过在嵌入空间中最大化 $\ell_2$ 距离,以使得编码过程只依赖于图像,导致各种提示下的分割面照受到扰动。我们显示,甚至是可见度很低的 $\ell_\infty$ 约束Radius $\epsilon=1/255$ 的扰动也可以很快地大幅修改由最近提出的基础模型进行分割的分割面照。此外,我们探讨了创建通用的攻击,可以无需进一步的计算成本,直接应用于任何输入。

Deformable multi-modal image registration for the correlation between optical measurements and histology images

  • paper_url: http://arxiv.org/abs/2311.14414
  • repo_url: None
  • paper_authors: Lianne Feenstra, Maud Lambregts, Theo J. M Ruers, Behdad Dashtbozorg
  • for: 提高仪器技术的验证性,减少人工注册错误和不一致性。
  • methods: 使用深度学习原理自动多Modal图像对alignment,并利用手动注册图像作为准确参照。
  • results: 对比手动注册和自动注册,自动注册表现出色,Dice分数和互信息指标都高于手动注册。
    Abstract The correlation of optical measurements with a correct pathology label is often hampered by imprecise registration caused by deformations in histology images. This study explores an automated multi-modal image registration technique utilizing deep learning principles to align snapshot breast specimen images with corresponding histology images. The input images, acquired through different modalities, present challenges due to variations in intensities and structural visibility, making linear assumptions inappropriate. An unsupervised and supervised learning approach, based on the VoxelMorph model, was explored, making use of a dataset with manually registered images used as ground truth. Evaluation metrics, including Dice scores and mutual information, reveal that the unsupervised model outperforms the supervised (and manual approach) significantly, achieving superior image alignment. This automated registration approach holds promise for improving the validation of optical technologies by minimizing human errors and inconsistencies associated with manual registration.
    摘要 “对于光学测量和正确病理标签之间的相互相关,经常会受到压缩图像的干扰,这使得对压缩图像进行自动多modal镜像对接成为一个重要的研究课题。本研究探讨了使用深度学习原理来自动对接 snapshot乳腺标本图像和相应的病理图像。输入图像,通过不同modalities所取得,具有不同的强度和结构可视性,使得线性假设无法应用。本研究探讨了不supervised和supervised学习方法,基于VoxelMorph模型,并使用了手动注册图像作为参考标准。评估指标,包括 dice分数和共识度,显示了不supervised模型对supervised(以及人工方法)进行了明显的超越,实现了更好的图像对接。这自动对接方法具有改善光学技术的验证过程中的人类错误和不一致性的潜力。”

OneFormer3D: One Transformer for Unified Point Cloud Segmentation

  • paper_url: http://arxiv.org/abs/2311.14405
  • repo_url: https://github.com/filapro/oneformer3d
  • paper_authors: Maxim Kolodiazhnyi, Anna Vorontsova, Anton Konushin, Danila Rukhovich
  • for: 这个论文 targets 3D 点云Semantic, Instance, and Panoptic Segmentation tasks.
  • methods: 该论文提出了一种统一的、简单的和高效的模型,名为OneFormer3D,可以同时地处理这三种任务。该模型使用可学习的核函数,每个核函数负责生成一个实例或 semantic category 的面积掩模。这些核函数被一个基于 transformer 的解码器进行培育,并将所有实例和 semantic queries 作为输入传递给该解码器。这种设计允许在单个训练运行中培育一个模型,从而实现所有三个分 segmentation 任务的同时优秀性能。
  • results: 该论文在ScanNet 测试领先борDEX中 ranking 1st, 并设置了新的 state-of-the-art (+2.1 mAP50) 成绩。此外,该论文还在 ScanNet200 和 S3DIS 数据集上实现了 state-of-the-art 的结果。
    Abstract Semantic, instance, and panoptic segmentation of 3D point clouds have been addressed using task-specific models of distinct design. Thereby, the similarity of all segmentation tasks and the implicit relationship between them have not been utilized effectively. This paper presents a unified, simple, and effective model addressing all these tasks jointly. The model, named OneFormer3D, performs instance and semantic segmentation consistently, using a group of learnable kernels, where each kernel is responsible for generating a mask for either an instance or a semantic category. These kernels are trained with a transformer-based decoder with unified instance and semantic queries passed as an input. Such a design enables training a model end-to-end in a single run, so that it achieves top performance on all three segmentation tasks simultaneously. Specifically, our OneFormer3D ranks 1st and sets a new state-of-the-art (+2.1 mAP50) in the ScanNet test leaderboard. We also demonstrate the state-of-the-art results in semantic, instance, and panoptic segmentation of ScanNet (+21 PQ), ScanNet200 (+3.8 mAP50), and S3DIS (+0.8 mIoU) datasets.
    摘要 <>TRANSLATE_TEXTSemantic, instance, and panoptic segmentation of 3D point clouds have been addressed using task-specific models of distinct design. Thereby, the similarity of all segmentation tasks and the implicit relationship between them have not been utilized effectively. This paper presents a unified, simple, and effective model addressing all these tasks jointly. The model, named OneFormer3D, performs instance and semantic segmentation consistently, using a group of learnable kernels, where each kernel is responsible for generating a mask for either an instance or a semantic category. These kernels are trained with a transformer-based decoder with unified instance and semantic queries passed as an input. Such a design enables training a model end-to-end in a single run, so that it achieves top performance on all three segmentation tasks simultaneously. Specifically, our OneFormer3D ranks 1st and sets a new state-of-the-art (+2.1 mAP50) in the ScanNet test leaderboard. We also demonstrate the state-of-the-art results in semantic, instance, and panoptic segmentation of ScanNet (+21 PQ), ScanNet200 (+3.8 mAP50), and S3DIS (+0.8 mIoU) datasets.TRANSLATE_TEXT

Multi-scale Semantic Correlation Mining for Visible-Infrared Person Re-Identification

  • paper_url: http://arxiv.org/abs/2311.14395
  • repo_url: https://github.com/Hua-XC/MSCMNet
  • paper_authors: Ke Cheng, Xuecheng Hua, Hu Lu, Juanjuan Tu, Yuanquan Wang, Shitong Wang
  • for: 提高Visible-Infrared Person Re-Identification (VI-ReID) 任务中的匹配精度,主要是怎样提取不同模式的特征来匹配用。
  • methods: 提出了一种 Multi-scale Semantic Correlation Mining network (MSCMNet),该网络包括三个新的组成部分:首先,通过考虑多种模式信息的有效利用,设计了一个多 scales Information Correlation Mining Block (MIMB),以探索多个缩放级别的semantic correlations; 其次,为MIMB提供更多的semantic信息,设计了一个 quadruple-stream feature extractor (QFE) with non-shared parameters,从不同维度的数据集中提取信息; 最后,提出了一种 Quadruple Center Triplet Loss (QCT),以解决总特征中的信息不一致问题。
  • results: 在SYSU-MM01、RegDB和LLCM datasets上进行了广泛的实验,结果显示,提出的MSCMNet可以达到最高的匹配精度。
    Abstract The main challenge in the Visible-Infrared Person Re-Identification (VI-ReID) task lies in how to extract discriminative features from different modalities for matching purposes. While the existing well works primarily focus on minimizing the modal discrepancies, the modality information can not thoroughly be leveraged. To solve this problem, a Multi-scale Semantic Correlation Mining network (MSCMNet) is proposed to comprehensively exploit semantic features at multiple scales and simultaneously reduce modality information loss as small as possible in feature extraction. The proposed network contains three novel components. Firstly, after taking into account the effective utilization of modality information, the Multi-scale Information Correlation Mining Block (MIMB) is designed to explore semantic correlations across multiple scales. Secondly, in order to enrich the semantic information that MIMB can utilize, a quadruple-stream feature extractor (QFE) with non-shared parameters is specifically designed to extract information from different dimensions of the dataset. Finally, the Quadruple Center Triplet Loss (QCT) is further proposed to address the information discrepancy in the comprehensive features. Extensive experiments on the SYSU-MM01, RegDB, and LLCM datasets demonstrate that the proposed MSCMNet achieves the greatest accuracy.
    摘要 主要挑战在可见光谱人重识别任务中是如何提取特征特异性,以便匹配用途。现有的既有工作主要强调降低模态差异,但模态信息不能充分利用。为解决这个问题,一种多尺度semantic correlation mining网络(MSCMNet)被提出,以全面利用semantic特征,并同时尽量减少模态信息损失。该网络包括三个新成分。首先,为了考虑有效利用模态信息,一个多scale信息相关挖掘块(MIMB)被设计,以探索多个缩放级别的semantic相关性。其次,为了丰富semantic信息,特制的四个流Feature抽取器(QFE)被设计,以EXTRACT数据集中不同维度的信息。最后,一种四个中心三重损失函数(QCT)被提出,以解决全面特征之间的信息差异。EXTENSIVE experiments on SYSU-MM01, RegDB,和LLCM数据集显示,提出的MSCMNet可以获得最高的准确率。

A Parameterized Generative Adversarial Network Using Cyclic Projection for Explainable Medical Image Classification

  • paper_url: http://arxiv.org/abs/2311.14388
  • repo_url: None
  • paper_authors: Xiangyu Xiong, Yue Sun, Xiaohong Liu, ChanTong Lam, Tong Tong, Hao Chen, Qinquan Gao, Wei Ke, Tao Tan
  • for: addresses the problem of data insufficiency in small-scale medical datasets by proposing a parameterized GAN (ParaGAN) for effective domain adaptation and explainable classification.
  • methods: ParaGAN incorporates projection distance parameters in cyclic projection and projects the source images to the decision boundary to obtain the class-difference maps, which effectively controls the changes of synthetic samples among domains and highlights the attention regions for downstream classification.
  • results: the proposed ParaGAN consistently outperforms the existing augmentation methods with explainable classification on two small-scale medical datasets.
    Abstract Although current data augmentation methods are successful to alleviate the data insufficiency, conventional augmentation are primarily intra-domain while advanced generative adversarial networks (GANs) generate images remaining uncertain, particularly in small-scale datasets. In this paper, we propose a parameterized GAN (ParaGAN) that effectively controls the changes of synthetic samples among domains and highlights the attention regions for downstream classification. Specifically, ParaGAN incorporates projection distance parameters in cyclic projection and projects the source images to the decision boundary to obtain the class-difference maps. Our experiments show that ParaGAN can consistently outperform the existing augmentation methods with explainable classification on two small-scale medical datasets.
    摘要 尽管当前的数据增强方法能够解决数据不足问题,但这些方法主要是同一个领域内的增强,而高级的生成对抗网络(GANs)生成的图像仍然存在uncertainty,特别是在小规模数据集中。在这篇论文中,我们提出了一种具有参数化的GAN(ParaGAN),可以有效控制生成的样本之间域的变化和突出下游分类器的注意区域。具体来说,ParaGAN在循环投影中包含投影距离参数,将源图像投影到决策边缘,从而获得类差地图。我们的实验表明,ParaGAN可以在两个小规模医疗数据集上一致性地超越现有的增强方法,并且可以提供可解释的分类。

Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion

  • paper_url: http://arxiv.org/abs/2311.14343
  • repo_url: None
  • paper_authors: Minshan Xie, Hanyuan Liu, Chengze Li, Tien-Tsin Wong
  • for: 本研究旨在提出一种同步多帧擦除框架,以维护视觉特征和时间一致性。
  • methods: 本方法使用文本指导的影像擦除模型,并将其扩展为视频合成。同时,通过 Shared Information among Frames (SIF) 机制,确保每帧的信息与其他帧的信息相互关联,以保持视觉特征和时间一致性。
  • results: 对比于现有的视频编辑方法,本方法可以生成高质量和多样化的视频结果,并且在量化测试中显示出优异的性能。
    Abstract Text-guided video-to-video stylization transforms the visual appearance of a source video to a different appearance guided on textual prompts. Existing text-guided image diffusion models can be extended for stylized video synthesis. However, they struggle to generate videos with both highly detailed appearance and temporal consistency. In this paper, we propose a synchronized multi-frame diffusion framework to maintain both the visual details and the temporal consistency. Frames are denoised in a synchronous fashion, and more importantly, information of different frames is shared since the beginning of the denoising process. Such information sharing ensures that a consensus, in terms of the overall structure and color distribution, among frames can be reached in the early stage of the denoising process before it is too late. The optical flow from the original video serves as the connection, and hence the venue for information sharing, among frames. We demonstrate the effectiveness of our method in generating high-quality and diverse results in extensive experiments. Our method shows superior qualitative and quantitative results compared to state-of-the-art video editing methods.
    摘要

Towards Concept-based Interpretability of Skin Lesion Diagnosis using Vision-Language Models

  • paper_url: http://arxiv.org/abs/2311.14339
  • repo_url: https://github.com/cristianopatricio/concept-based-interpretability-vlm
  • paper_authors: Cristiano Patrício, Luís F. Teixeira, João C. Neves
  • for: 该文章主要针对皮肤病变诊断方面的概念基型模型进行研究,以便提高诊断的可解释性。
  • methods: 该文章提出了一种基于视觉语言模型的概念embedding学习策略,通过将概念描述为文本嵌入来适应下游皮肤病变分类任务。
  • results: 实验表明,视觉语言模型不仅在使用概念为文本嵌入时可以达到更高的准确率,还可以采用更少的概念标注样本来达到相当的性能。
    Abstract Concept-based models naturally lend themselves to the development of inherently interpretable skin lesion diagnosis, as medical experts make decisions based on a set of visual patterns of the lesion. Nevertheless, the development of these models depends on the existence of concept-annotated datasets, whose availability is scarce due to the specialized knowledge and expertise required in the annotation process. In this work, we show that vision-language models can be used to alleviate the dependence on a large number of concept-annotated samples. In particular, we propose an embedding learning strategy to adapt CLIP to the downstream task of skin lesion classification using concept-based descriptions as textual embeddings. Our experiments reveal that vision-language models not only attain better accuracy when using concepts as textual embeddings, but also require a smaller number of concept-annotated samples to attain comparable performance to approaches specifically devised for automatic concept generation.
    摘要 “概念基础模型自然地适用于生成内在可解释的皮肤患病诊断,因为医疗专家做出决策 based on 皮肤患病的视觉模式。然而,这些模型的开发受到概念注解数据的有限性的限制,因为注解过程需要特殊的专业知识和技能。在这种情况下,我们表明了使用视觉语言模型可以减轻对概念注解样本的依赖。具体来说,我们提出了一种嵌入学习策略,使得 CLIP 可以通过概念基础的文本嵌入来适应皮肤患病分类任务。我们的实验表明,视觉语言模型不仅在使用概念为文本嵌入时达到更高的准确率,而且只需要一小数量的概念注解样本来达到与自动概念生成方法相比的相似水平。”

TVT: Training-Free Vision Transformer Search on Tiny Datasets

  • paper_url: http://arxiv.org/abs/2311.14337
  • repo_url: None
  • paper_authors: Zimian Wei, Hengyue Pan, Lujun Li, Peijie Dong, Zhiliang Tian, Xin Niu, Dongsheng Li
  • for: 在这篇论文中,搜索一个更好的视觉转换器(ViT),并且不需要训练。
  • methods: 这篇论文使用教师模型(ConvNet)的专长来搜索一个更好的ViT,并且使用了一个新的教师对应的评估指标和学生能力指标来搜索。
  • results: 在实验中,这篇论文的方法比过去的训练自由搜索方法表现更好,并且在不同的小资料集和搜索空间中都获得了优秀的效果。
    Abstract Training-free Vision Transformer (ViT) architecture search is presented to search for a better ViT with zero-cost proxies. While ViTs achieve significant distillation gains from CNN teacher models on small datasets, the current zero-cost proxies in ViTs do not generalize well to the distillation training paradigm according to our experimental observations. In this paper, for the first time, we investigate how to search in a training-free manner with the help of teacher models and devise an effective Training-free ViT (TVT) search framework. Firstly, we observe that the similarity of attention maps between ViT and ConvNet teachers affects distill accuracy notably. Thus, we present a teacher-aware metric conditioned on the feature attention relations between teacher and student. Additionally, TVT employs the L2-Norm of the student's weights as the student-capability metric to improve ranking consistency. Finally, TVT searches for the best ViT for distilling with ConvNet teachers via our teacher-aware metric and student-capability metric, resulting in impressive gains in efficiency and effectiveness. Extensive experiments on various tiny datasets and search spaces show that our TVT outperforms state-of-the-art training-free search methods. The code will be released.
    摘要 training-freevision transformer(ViT)架构搜索是提出了在零成本情况下搜索更好的ViT。而ViT在小数据集上达到了显著的精炼成果,但现有的零成本代理在ViT中并不好地适应精炼训练方法,根据我们的实验观察。在这篇论文中,我们第一次 investigate了如何在无需训练的情况下进行搜索,并提出了一种有效的training-free ViT(TVT)搜索框架。首先,我们发现了ViT和ConvNet教师模型之间的注意力地图相似性对精炼精度产生了明显的影响。因此,我们提出了一种基于教师模型的教师相关的度量,并使用学生模型的L2-Norm weight作为学生能力度量来提高排名的一致性。最后,TVT通过我们的教师相关度量和学生能力度量来搜索最佳的ViT模型,并在不同的小数据集和搜索空间进行了广泛的实验,得到了较高的效率和效果。我们的TVT方法超越了当前的无需训练搜索方法的状态。代码将会发布。

Maximizing Discrimination Capability of Knowledge Distillation with Energy-based Score

  • paper_url: http://arxiv.org/abs/2311.14334
  • repo_url: None
  • paper_authors: Seonghak Kim, Gyeongdo Ham, Suin Lee, Donggon Jang, Daeshik Kim
  • for: 用于应用最新的计算机视觉技术,知识储存方法(KD)是必不可少的。现有的常数温度扩展KDs采用了所有样本集中的常数温度扩展,从而限制了每个样本的知识利用。
  • methods: 我们分类dataset为两类(低能量样本和高能量样本),基于每个样本的能量分数。通过实验,我们发现低能量样本具有高信任分数,表示一定的预测,而高能量样本具有低信任分数,表示不确定的预测。为了通过调整非目标类预测来总结优质知识,我们应用高温到低能量样本,以创建平滑的分布,并应用低温到高能量样本,以实现锐化分布。
  • results: 与前期的logit-based和特征基于方法相比,我们的能量基于KD(Energy KD)在多个数据集上达到了更好的性能。尤其是在CIFAR-100-LT和ImageNet数据集上, Energy KD 表现出了显著的改善。此外,我们还提出了高能量基数据增强(HE-DA),通过对20-50%的数据集进行增强,可以实现明显的性能改善。这表明HE-DA可以在资源有限的设备上使用。根据我们所知,这是第一篇利用能量分数在KD和DA中使用的论文,我们认为它将对未来的研究产生很大的贡献。
    Abstract To apply the latest computer vision techniques that require a large computational cost in real industrial applications, knowledge distillation methods (KDs) are essential. Existing logit-based KDs apply the constant temperature scaling to all samples in dataset, limiting the utilization of knowledge inherent in each sample individually. In our approach, we classify the dataset into two categories (i.e., low energy and high energy samples) based on their energy score. Through experiments, we have confirmed that low energy samples exhibit high confidence scores, indicating certain predictions, while high energy samples yield low confidence scores, meaning uncertain predictions. To distill optimal knowledge by adjusting non-target class predictions, we apply a higher temperature to low energy samples to create smoother distributions and a lower temperature to high energy samples to achieve sharper distributions. When compared to previous logit-based and feature-based methods, our energy-based KD (Energy KD) achieves better performance on various datasets. Especially, Energy KD shows significant improvements on CIFAR-100-LT and ImageNet datasets, which contain many challenging samples. Furthermore, we propose high energy-based data augmentation (HE-DA) for further improving the performance. We demonstrate that meaningful performance improvement could be achieved by augmenting only 20-50% of dataset, suggesting that it can be employed on resource-limited devices. To the best of our knowledge, this paper represents the first attempt to make use of energy scores in KD and DA, and we believe it will greatly contribute to future research.
    摘要 <>为应用计算机视觉技术,需要大量计算资源在实际应用中。知识储存方法(KD)是必要的。现有的常数温度扩展KDs将所有样本的温度设置为常数,这限制了每个样本的知识利用。在我们的方法中,我们将数据集分类为两类(即低能量样本和高能量样本),基于它们的能量分数。我们通过实验确认,低能量样本具有高信息率,表示确定的预测,而高能量样本具有低信息率,表示不确定的预测。为了通过调整非目标类预测来把握优质知识,我们将低能量样本应用高温,以创建更平滑的分布,而高能量样本应用低温,以实现更锐化的分布。与前期的常数温度扩展KD和特征扩展方法相比,我们的能量基于KD(能量KD)实现了更好的性能在多个数据集上。特别是在CIFAR-100-LT和ImageNet数据集上,我们的方法表现出了显著的改善。此外,我们还提出了高能量基数数据增强(HE-DA),可以进一步提高性能。我们示出,只需增强20-50%的数据集,就可以获得意义性的性能改善,这表明可以在资源有限的设备上使用。根据我们所知,这是首次利用能量分数在KD和DA中使用,我们认为这将对未来的研究产生很大的贡献。

Binarized 3D Whole-body Human Mesh Recovery

  • paper_url: http://arxiv.org/abs/2311.14323
  • repo_url: https://github.com/zhitengli/bidrn
  • paper_authors: Zhiteng Li, Yulun Zhang, Jing Lin, Haotong Qin, Jinjin Gu, Xin Yuan, Linghe Kong, Xiaokang Yang
  • for: reconstruction of 3D human body, face, and hands from a single image
  • methods: Binarized Dual Residual Network (BiDRN) and Binaried BoxNet
  • results: significant improvement over state-of-the-art binarization algorithms, comparable performance with full-precision method Hand4Whole using fewer parameters and operations.Here’s the full text in Simplified Chinese:
  • for: 这个论文的目标是从单个图像中重建3D人体、面孔和手部的三维模型。
  • methods: 我们提出了一种彩色分割网络(BiDRN)和彩色盒网络(Binaried BoxNet)来实现高效的三维人体重建。
  • results: 我们的BiDRN方法在比特化方法比较中具有显著的改善,并且与全精度方法Hand4Whole的性能相似,但占用的参数和运算数量减少了22.1%和14.8%。
    Abstract 3D whole-body human mesh recovery aims to reconstruct the 3D human body, face, and hands from a single image. Although powerful deep learning models have achieved accurate estimation in this task, they require enormous memory and computational resources. Consequently, these methods can hardly be deployed on resource-limited edge devices. In this work, we propose a Binarized Dual Residual Network (BiDRN), a novel quantization method to estimate the 3D human body, face, and hands parameters efficiently. Specifically, we design a basic unit Binarized Dual Residual Block (BiDRB) composed of Local Convolution Residual (LCR) and Block Residual (BR), which can preserve full-precision information as much as possible. For LCR, we generalize it to four kinds of convolutional modules so that full-precision information can be propagated even between mismatched dimensions. We also binarize the face and hands box-prediction network as Binaried BoxNet, which can further reduce the model redundancy. Comprehensive quantitative and qualitative experiments demonstrate the effectiveness of BiDRN, which has a significant improvement over state-of-the-art binarization algorithms. Moreover, our proposed BiDRN achieves comparable performance with full-precision method Hand4Whole while using just 22.1% parameters and 14.8% operations. We will release all the code and pretrained models.
    摘要 三维全身人体重建目标是从单个图像中重建三维人体、面孔和手部。虽然强大的深度学习模型已经实现了高精度的估计,但它们需要巨大的内存和计算资源。因此,这些方法几乎无法在边缘设备上部署。在这种情况下,我们提出了一种彩色二分数差异网络(BiDRN),一种新的量化方法,用于高效地估计三维人体、面孔和手部参数。具体来说,我们设计了基本单元彩色二分数差异块(BiDRB),其包括本地径差异块(LCR)和块差异块(BR)。LCR通过四种不同的径差异模块来保留全精度信息,以便在不匹配的维度之间传递信息。此外,我们还对面孔和手部框预测网络进行了彩色化,以进一步减少模型的重复性。从量化和质量上的实验来看,BiDRN具有显著的改善,与当前的量化算法相比。此外,我们的提议的BiDRN可以与全精度方法Hand4Whole achieve相当的性能,但使用的参数和操作数量却只有22.1%和14.8%。我们将发布所有代码和预训练模型。

Stable Cluster Discrimination for Deep Clustering

  • paper_url: http://arxiv.org/abs/2311.14310
  • repo_url: https://github.com/idstcv/secu
  • paper_authors: Qi Qian
  • for: 本研究旨在提高深度嵌入 clustering 的表现,同时实现 representation learning 和 clustering 的同时进行。
  • methods: 为了解决两项目的问题,本研究提出了两阶段训练策略,包括一个预训练阶段 для representation learning,然后精确地调整得到的模型 для clustering。此外,一些一阶段方法也是主要用于 representation learning,通过设计不同的限制来避免潜在的潜在泥沼问题。
  • results: 实验结果显示,SeCu 可以在所有测试数据集上实现 state-of-the-art 的表现,证明了一阶段 deep clustering 的效iveness。
    Abstract Deep clustering can optimize representations of instances (i.e., representation learning) and explore the inherent data distribution (i.e., clustering) simultaneously, which demonstrates a superior performance over conventional clustering methods with given features. However, the coupled objective implies a trivial solution that all instances collapse to the uniform features. To tackle the challenge, a two-stage training strategy is developed for decoupling, where it introduces an additional pre-training stage for representation learning and then fine-tunes the obtained model for clustering. Meanwhile, one-stage methods are developed mainly for representation learning rather than clustering, where various constraints for cluster assignments are designed to avoid collapsing explicitly. Despite the success of these methods, an appropriate learning objective tailored for deep clustering has not been investigated sufficiently. In this work, we first show that the prevalent discrimination task in supervised learning is unstable for one-stage clustering due to the lack of ground-truth labels and positive instances for certain clusters in each mini-batch. To mitigate the issue, a novel stable cluster discrimination (SeCu) task is proposed and a new hardness-aware clustering criterion can be obtained accordingly. Moreover, a global entropy constraint for cluster assignments is studied with efficient optimization. Extensive experiments are conducted on benchmark data sets and ImageNet. SeCu achieves state-of-the-art performance on all of them, which demonstrates the effectiveness of one-stage deep clustering. Code is available at \url{https://github.com/idstcv/SeCu}.
    摘要 深度归一可以优化实例表示(即表示学习)并同时探索数据内部分布(即归一),这表明与传统归一方法相比,深度归一具有更高的表现。然而,整体目标函数隐藏了一个潜在的简单解,即所有实例都归一到共同特征。为解决这个挑战,我们提出了一种两阶段训练策略,其中首先进行表示学习预训练,然后细化获得的模型进行归一。同时,一些一阶段方法被主要用于表示学习而不是归一,这些方法通过设计各种约束来避免归一。虽然这些方法得到了成功,但是对深度归一的适应学习目标函数的研究还不充分。在这种情况下,我们首先显示了一阶段归一的普遍稳定性问题,因为每个批处中缺乏准确标签和每个类别的正例实例。为解决这个问题,我们提出了一种新的稳定归一任务(SeCu),并可以根据这个任务获得一个新的硬度感知的归一标准。此外,我们还研究了一种全局Entropy约束,以便有效地优化归一。我们在标准数据集和ImageNet上进行了广泛的实验,SeCu实现了所有数据集的状态之最好表现,这表明了深度归一的一阶段表示学习的有效性。代码可以在 \url{https://github.com/idstcv/SeCu} 中找到。

Cosine Similarity Knowledge Distillation for Individual Class Information Transfer

  • paper_url: http://arxiv.org/abs/2311.14307
  • repo_url: None
  • paper_authors: Gyeongdo Ham, Seonghak Kim, Suin Lee, Jae-Hyeok Lee, Daeshik Kim
  • for: 提高模型压缩的效果,特别是使得学生模型能够与老师模型的性能相似或更高。
  • methods: 利用批处理级别的教师和学生预测,并使用cosine相似性来衡量学生模型对老师模型知识的学习。在cosine相似性高时,降低温度缩放,以便学生模型更好地学习老师模型的知识。
  • results: 对比 existed KD 方法,提出了一种有效的 KD 方法,可以使学生模型与老师模型的性能相似或更高。经验证明,这种方法可以提高模型压缩的效果。
    Abstract Previous logits-based Knowledge Distillation (KD) have utilized predictions about multiple categories within each sample (i.e., class predictions) and have employed Kullback-Leibler (KL) divergence to reduce the discrepancy between the student and teacher predictions. Despite the proliferation of KD techniques, the student model continues to fall short of achieving a similar level as teachers. In response, we introduce a novel and effective KD method capable of achieving results on par with or superior to the teacher models performance. We utilize teacher and student predictions about multiple samples for each category (i.e., batch predictions) and apply cosine similarity, a commonly used technique in Natural Language Processing (NLP) for measuring the resemblance between text embeddings. This metric's inherent scale-invariance property, which relies solely on vector direction and not magnitude, allows the student to dynamically learn from the teacher's knowledge, rather than being bound by a fixed distribution of the teacher's knowledge. Furthermore, we propose a method called cosine similarity weighted temperature (CSWT) to improve the performance. CSWT reduces the temperature scaling in KD when the cosine similarity between the student and teacher models is high, and conversely, it increases the temperature scaling when the cosine similarity is low. This adjustment optimizes the transfer of information from the teacher to the student model. Extensive experimental results show that our proposed method serves as a viable alternative to existing methods. We anticipate that this approach will offer valuable insights for future research on model compression.
    摘要 先前的预测值基于的知识填充(KD)方法都是使用每个样本中的多个类别预测(i.e., class predictions),并使用卷积-莱布NER(KL)差异来减少学生模型和教师模型的差异。尽管KD技术的普及,学生模型仍然无法达到教师模型的相同水平。为此,我们介绍了一种新的和有效的KD方法,能够达到或超过教师模型的性能。我们使用教师和学生对每个类别的多个样本预测(i.e., batch predictions),并使用cosine相似性,一种常用的自然语言处理(NLP)中的度量手段。这个度量的自然尺度不变性,即仅仅基于向量方向而不是大小,使得学生可以从教师的知识中动态学习,而不是受到固定的教师知识分布的限制。此外,我们提出了一种called cosine相似性权重temperature(CSWT)来提高性能。CSWT在cosine相似性高时降低温度尺度的涨幅,并在cosine相似性低时增加温度尺度的涨幅。这种调整可以优化教师知识的传递给学生模型。我们的实验结果表明,我们的提议方法可以作为现有方法的可行代替。我们期望这种方法会为未来的模型压缩提供有价值的 inspirations。

GeoViT: A Versatile Vision Transformer Architecture for Geospatial Image Analysis

  • paper_url: http://arxiv.org/abs/2311.14301
  • repo_url: None
  • paper_authors: Madhav Khirwar, Ankur Narang
  • for: 本研究旨在提供更高精度的CO2和NO2排放量估算、燃料类型和气溶胶覆盖率等数据,以便促进气候变化监测和排放限制政策的发展。
  • methods: 本研究使用了一种名为GeoViT的小型视 transformer模型,可以处理卫星影像数据进行多模式分 segmentation、分类和回归任务。
  • results: 通过使用GeoViT模型,本研究实现了对CO2排放量的更高精度估算、燃料类型的识别和高分辨率NO2浓度地图的创造等任务,并超越了之前的状态对模型。
    Abstract Greenhouse gases are pivotal drivers of climate change, necessitating precise quantification and source identification to foster mitigation strategies. We introduce GeoViT, a compact vision transformer model adept in processing satellite imagery for multimodal segmentation, classification, and regression tasks targeting CO2 and NO2 emissions. Leveraging GeoViT, we attain superior accuracy in estimating power generation rates, fuel type, plume coverage for CO2, and high-resolution NO2 concentration mapping, surpassing previous state-of-the-art models while significantly reducing model size. GeoViT demonstrates the efficacy of vision transformer architectures in harnessing satellite-derived data for enhanced GHG emission insights, proving instrumental in advancing climate change monitoring and emission regulation efforts globally.
    摘要 《绿色气候变化的主要驱动者是绿色气体,因此精准量化和来源确定是必要的,以便开发 Mitigation 策略。我们介绍 GeoViT,一种具有可靠的卫星影像处理能力的小型视transformer模型,用于多模态分割、分类和回归任务,targeting CO2 和 NO2 排放。通过 GeoViT,我们实现了对发电率、燃料类型、CO2 气泡覆盖率和高分辨率 NO2 浓度地图的高精度估计,超越了先前的状态太平方法,同时具有显著减小模型大小的优势。 GeoViT 示例了视transformer 架构在使用卫星来源数据的情况下提供更高的GHG 排放情况洞察,这将对全球气候变化监测和排放规定做出重要贡献。》Note: Please note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need the translation in Traditional Chinese, please let me know.

Decouple Content and Motion for Conditional Image-to-Video Generation

  • paper_url: http://arxiv.org/abs/2311.14294
  • repo_url: None
  • paper_authors: Cuifeng Shen, Yulu Gan, Chen Chen, Xiongwei Zhu, Lele Cheng, Jinzhi Wang
  • for: 创建一个真实的新视频,只准入一个图像和文本作为条件。
  • methods: 提出了一种新的方法,通过分解目标RGB像素为空间内容和时间运动两个分量来解决传统的cI2V生成方法中的限制,包括模式一致和视觉连续性。
  • results: 实验结果表明,该方法可以在不添加新结构复杂度的情况下提高效果和效率,并且在多个数据集上达到了现有方法的性能水平。
    Abstract The goal of conditional image-to-video (cI2V) generation is to create a believable new video by beginning with the condition, i.e., one image and text.The previous cI2V generation methods conventionally perform in RGB pixel space, with limitations in modeling motion consistency and visual continuity. Additionally, the efficiency of generating videos in pixel space is quite low.In this paper, we propose a novel approach to address these challenges by disentangling the target RGB pixels into two distinct components: spatial content and temporal motions. Specifically, we predict temporal motions which include motion vector and residual based on a 3D-UNet diffusion model. By explicitly modeling temporal motions and warping them to the starting image, we improve the temporal consistency of generated videos. This results in a reduction of spatial redundancy, emphasizing temporal details. Our proposed method achieves performance improvements by disentangling content and motion, all without introducing new structural complexities to the model. Extensive experiments on various datasets confirm our approach's superior performance over the majority of state-of-the-art methods in both effectiveness and efficiency.
    摘要 “目的是实现基于条件(cI2V)的图像转映,创建一个真实的新影片,从条件开始,即一幅图像和文本。现有的cI2V生成方法通常在RGB像素空间中进行,它们受到模型动作一致和视觉连续性的限制。另外,生成影片的效率在像素空间是很低。在这篇论文中,我们提出了一个新的方法,即分解目标RGB像素为两个不同的分量:空间内容和时间动作。具体来说,我们预测时间动作,包括动向 вектор和差异,透过3D-UNet扩散模型。通过Explicitly 模型时间动作,并将其扭转到起始图像,我们提高了生成影片的时间一致性。这导致缩减空间重复,强调时间细节。我们的提案方法在效能和效率方面都超过了大多数现有的方法,并且不增加新的结构层次到模型中。广泛的实验证明了我们的方法的超越性。”

Paragraph-to-Image Generation with Information-Enriched Diffusion Model

  • paper_url: http://arxiv.org/abs/2311.14284
  • repo_url: https://github.com/weijiawu/paradiffusion
  • paper_authors: Weijia Wu, Zhuang Li, Yefei He, Mike Zheng Shou, Chunhua Shen, Lele Cheng, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang
  • for: 这篇论文主要target是解决长段文本(Up to 512 words)到图像生成任务中的Alignment问题。
  • methods: 该模型使用大语言模型(例如Llama V2)进行文本编码,然后通过LORA进行调整,以实现文本-图像特征空间的Alignment。
  • results: 实验表明,ParaDiffusion模型在ViLG-300和ParaPrompts上比前一代模型(SD XL、DeepFloyd IF)高出15%和45%的人工投票率,对视觉吸引力和文本准确性进行了 significiant improvement。
    Abstract Text-to-image (T2I) models have recently experienced rapid development, achieving astonishing performance in terms of fidelity and textual alignment capabilities. However, given a long paragraph (up to 512 words), these generation models still struggle to achieve strong alignment and are unable to generate images depicting complex scenes. In this paper, we introduce an information-enriched diffusion model for paragraph-to-image generation task, termed ParaDiffusion, which delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation. At its core is using a large language model (e.g., Llama V2) to encode long-form text, followed by fine-tuning with LORA to alignthe text-image feature spaces in the generation task. To facilitate the training of long-text semantic alignment, we also curated a high-quality paragraph-image pair dataset, namely ParaImage. This dataset contains a small amount of high-quality, meticulously annotated data, and a large-scale synthetic dataset with long text descriptions being generated using a vision-language model. Experiments demonstrate that ParaDiffusion outperforms state-of-the-art models (SD XL, DeepFloyd IF) on ViLG-300 and ParaPrompts, achieving up to 15% and 45% human voting rate improvements for visual appeal and text faithfulness, respectively. The code and dataset will be released to foster community research on long-text alignment.
    摘要 文本到图像(T2I)模型在最近几年内发展非常快,在 faithfulness 和文本对齐能力方面达到了惊人的表现。然而,对于长 paragraph(最多 512 个字),这些生成模型仍然具有差不多的对齐能力,无法生成复杂场景的图像。在这篇论文中,我们介绍了一种具有信息激活的扩展模型,称为 ParaDiffusion,该模型利用大型语言模型(例如 Llama V2)对长文本进行编码,然后通过 LORA 的 fine-tuning 将文本-图像特征空间对齐在生成任务中。为了促进长文本semantic alignment的训练,我们还精心准备了一个高质量的 paragraph-image 对应数据集,即 ParaImage。这个数据集包括一小部分高质量、精心标注的数据,以及一个大规模的Synthetic数据集,其中使用了视力语言模型来生成长文本描述。实验表明,ParaDiffusion 在 ViLG-300 和 ParaPrompts 上超过了现状的模型(SD XL、DeepFloyd IF),实现了15% 到 45% 的人类投票率提升,即 visual appeal 和 text faithfulness 的提升。代码和数据将被发布,以便社区进行长文本对齐的研究。

Image Super-Resolution with Text Prompt Diffusion

  • paper_url: http://arxiv.org/abs/2311.14282
  • repo_url: https://github.com/zhengchen1999/promptsr
  • paper_authors: Zheng Chen, Yulun Zhang, Jinjin Gu, Xin Yuan, Linghe Kong, Guihai Chen, Xiaokang Yang
  • for: 提高图像超分辨率(SR)性能,通过引入文本提示来提供质量约束。
  • methods: 使用文本生成管道将文本与SR数据集集成,通过抽象的文本受损表示方法和受损模型来实现文本描述。提出PromptSR模型,利用扩散模型和预训练语言模型(如T5和CLIP)来实现文本提示SR。
  • results: 在synthetic和实际图像上,通过引入文本提示,SR性能得到了显著提高。
    Abstract Image super-resolution (SR) methods typically model degradation to improve reconstruction accuracy in complex and unknown degradation scenarios. However, extracting degradation information from low-resolution images is challenging, which limits the model performance. To boost image SR performance, one feasible approach is to introduce additional priors. Inspired by advancements in multi-modal methods and text prompt image processing, we introduce text prompts to image SR to provide degradation priors. Specifically, we first design a text-image generation pipeline to integrate text into SR dataset through the text degradation representation and degradation model. The text representation applies a discretization manner based on the binning method to describe the degradation abstractly. This representation method can also maintain the flexibility of language. Meanwhile, we propose the PromptSR to realize the text prompt SR. The PromptSR employs the diffusion model and the pre-trained language model (e.g., T5 and CLIP). We train the model on the generated text-image dataset. Extensive experiments indicate that introducing text prompts into image SR, yields excellent results on both synthetic and real-world images. Code: https://github.com/zhengchen1999/PromptSR.
    摘要 图像超分辨率(SR)方法通常模型劣化来提高重建准确率,但是从低分辨率图像中提取劣化信息是困难的,这限制了模型性能。为了提高图像SR性能,一个可行的方法是引入额外约束。引申于多模式方法和文本描述图像处理的进步,我们引入文本提示来图像SR中。具体来说,我们首先设计了文本-图像生成管线,通过文本劣化表示和劣化模型来整合文本到SR数据集中。文本表示采用了精度方法基于桶法来描述劣化抽象。这种表示方法可以保持语言的灵活性。同时,我们提出了PromptSR,它使用了扩散模型和预训练语言模型(例如T5和CLIP)来实现文本提示SR。我们在生成的文本-图像数据集上训练了模型。广泛的实验表明,在图像SR中引入文本提示,可以获得优秀的结果,并在真实的图像上进行了证明。代码:https://github.com/zhengchen1999/PromptSR。

Multi-modal Instance Refinement for Cross-domain Action Recognition

  • paper_url: http://arxiv.org/abs/2311.14281
  • repo_url: None
  • paper_authors: Yuan Qing, Naixing Wu, Shaohua Wan, Lixin Duan
  • for: 提高跨频道动作识别的性能,减少负向传递
  • methods: 提出了多模态实例精炼(MMIR)方法,通过对每个模式进行强制学习,选择每个频道中的负样本,以减少负向传递
  • results: 在EPIC-Kitchens数据集上,与其他多个基线方法进行比较,实现了MMIR方法的超越表现,demonstrating the advantage of MMIR in reducing negative transfer.
    Abstract Unsupervised cross-domain action recognition aims at adapting the model trained on an existing labeled source domain to a new unlabeled target domain. Most existing methods solve the task by directly aligning the feature distributions of source and target domains. However, this would cause negative transfer during domain adaptation due to some negative training samples in both domains. In the source domain, some training samples are of low-relevance to target domain due to the difference in viewpoints, action styles, etc. In the target domain, there are some ambiguous training samples that can be easily classified as another type of action under the case of source domain. The problem of negative transfer has been explored in cross-domain object detection, while it remains under-explored in cross-domain action recognition. Therefore, we propose a Multi-modal Instance Refinement (MMIR) method to alleviate the negative transfer based on reinforcement learning. Specifically, a reinforcement learning agent is trained in both domains for every modality to refine the training data by selecting out negative samples from each domain. Our method finally outperforms several other state-of-the-art baselines in cross-domain action recognition on the benchmark EPIC-Kitchens dataset, which demonstrates the advantage of MMIR in reducing negative transfer.
    摘要 <>translate "Unsupervised cross-domain action recognition aims at adapting the model trained on an existing labeled source domain to a new unlabeled target domain. Most existing methods solve the task by directly aligning the feature distributions of source and target domains. However, this would cause negative transfer during domain adaptation due to some negative training samples in both domains. In the source domain, some training samples are of low-relevance to target domain due to the difference in viewpoints, action styles, etc. In the target domain, there are some ambiguous training samples that can be easily classified as another type of action under the case of source domain. The problem of negative transfer has been explored in cross-domain object detection, while it remains under-explored in cross-domain action recognition. Therefore, we propose a Multi-modal Instance Refinement (MMIR) method to alleviate the negative transfer based on reinforcement learning. Specifically, a reinforcement learning agent is trained in both domains for every modality to refine the training data by selecting out negative samples from each domain. Our method finally outperforms several other state-of-the-art baselines in cross-domain action recognition on the benchmark EPIC-Kitchens dataset, which demonstrates the advantage of MMIR in reducing negative transfer." into Simplified Chinese.cross-domain action recognition是一种无监督的领域适应任务,旨在将源频率域已经训练的模型适应到新的无标签目标频率域。大多数现有方法直接对源和目标频率域的特征分布进行对齐,这会导致域适应中的负面影响。在源频率域中,一些训练样本具有低相关性,因为视角、动作风格等因素的不同。在目标频率域中,有一些抽象的训练样本可以在源频率域中被误分类为另一种动作。域适应中的负面影响已经在跨频率域物体检测中被探讨,而在跨频率域动作识别中尚未得到充分的探讨。因此,我们提出了一种多模式实例纠正(MMIR)方法,以减少负面影响。具体来说,我们在两个频率域中训练了一个强化学习代理,用于在每个模式中纠正训练数据,选择源频率域和目标频率域中的负面样本。我们的方法最终在EPIC-Kitchens数据集上比较其他多个状态级基elines具有优势,这说明MMIR在减少负面影响方面的优势。

Latent Diffusion Prior Enhanced Deep Unfolding for Spectral Image Reconstruction

  • paper_url: http://arxiv.org/abs/2311.14280
  • repo_url: None
  • paper_authors: Zongliang Wu, Ruiying Lu, Ying Fu, Xin Yuan
  • for: 实现压缩特征图像重建,从单一具有压缩测量的二维数据中恢复三维空间特征图像。
  • methods: 使用热漂浮结构,并将实现为具有压缩特征的图像进行增强。
  • results: 提供了一个可靠且高效的方法,可以从单一具有压缩测量的二维数据中恢复高质量的三维空间特征图像,并且可以提高实现速度。
    Abstract Snapshot compressive spectral imaging reconstruction aims to reconstruct three-dimensional spatial-spectral images from a single-shot two-dimensional compressed measurement. Existing state-of-the-art methods are mostly based on deep unfolding structures but have intrinsic performance bottlenecks: $i$) the ill-posed problem of dealing with heavily degraded measurement, and $ii$) the regression loss-based reconstruction models being prone to recover images with few details. In this paper, we introduce a generative model, namely the latent diffusion model (LDM), to generate degradation-free prior to enhance the regression-based deep unfolding method. Furthermore, to overcome the large computational cost challenge in LDM, we propose a lightweight model to generate knowledge priors in deep unfolding denoiser, and integrate these priors to guide the reconstruction process for compensating high-quality spectral signal details. Numeric and visual comparisons on synthetic and real-world datasets illustrate the superiority of our proposed method in both reconstruction quality and computational efficiency. Code will be released.
    摘要 Snapshot 压缩spectral imaging重建目标是从单个两维压缩测量中重建三维空间spectral图像。现有状态之arteMethods主要基于深度 unfolding 结构,但它们具有内在的性能瓶颈:$i$) 测量受到严重损害的问题,和 $ii$) 基于回归损失的重建模型容易recover图像中的细节少。在这篇论文中,我们引入了一种生成模型,即latent diffusion model(LDM),以增强基于回归的深度 unfolding 方法。此外,为了解决LDM中的大量计算成本问题,我们提议一种轻量级的模型来生成知识先验,并将这些先验与深度 unfolding denoiser 集成,以导引重建过程以补做高质量spectral信号的细节。数字和视觉比较表明我们的提出方法在重建质量和计算效率两个方面具有优势。代码将发布。

Racing With ROS 2 A Navigation System for an Autonomous Formula Student Race Car

  • paper_url: http://arxiv.org/abs/2311.14276
  • repo_url: https://github.com/qut-motorsport/qutms_nav_integration
  • paper_authors: Alastair Bradford, Grant van Breda, Tobias Fischer
  • For: The paper is written for teams participating in autonomous racing disciplines, such as Formula Student and Society of Automotive Engineers, who are looking for an open-source solution to navigate their race cars.* Methods: The paper uses the Robot Operating System 2 (ROS2) and its open-source navigation stack to address the challenges of high-speed navigation and control in autonomous racing. The authors compare off-the-shelf navigation libraries against traditional custom-made programs developed by QUT Motorsport to evaluate their applicability in autonomous racing scenarios.* Results: The paper provides quantitative and qualitative comparisons of the navigation packages against traditional navigation solutions, with the goal of lowering the entry barrier for autonomous racing. The authors also provide a comprehensive tutorial for teams participating in similar racing disciplines and other autonomous mobile robot applications.
    Abstract The advent of autonomous vehicle technologies has significantly impacted various sectors, including motorsport, where Formula Student and Formula: Society of Automotive Engineers introduced autonomous racing classes. These offer new challenges to aspiring engineers, including the team at QUT Motorsport, but also raise the entry barrier due to the complexity of high-speed navigation and control. This paper presents an open-source solution using the Robot Operating System 2, specifically its open-source navigation stack, to address these challenges in autonomous Formula Student race cars. We compare off-the-shelf navigation libraries that this stack comprises of against traditional custom-made programs developed by QUT Motorsport to evaluate their applicability in autonomous racing scenarios and integrate them onto an autonomous race car. Our contributions include quantitative and qualitative comparisons of these packages against traditional navigation solutions, aiming to lower the entry barrier for autonomous racing. This paper also serves as a comprehensive tutorial for teams participating in similar racing disciplines and other autonomous mobile robot applications.
    摘要 自动驾驶技术的出现对各个领域产生了深远的影响,其中包括赛车, Formula Student 和 Society of Automotive Engineers 等组织也在引入自动赛车级别。这些新的挑战对年轻的工程师来说是一个重要的机遇,如 Queensland University of Technology 的车队。然而,自动赛车技术的复杂性也提高了参与门槛,特别是高速导航和控制方面。本文推出了一个开源解决方案,使用 Robot Operating System 2 的开源导航栈来解决自动赛车技术中的挑战。我们对商业 Navigation 库和 QUT Motorsport 自己开发的传统导航程序进行了比较,以评估它们在自动赛车场景中的适用性,并将它们集成到自动赛车上。我们的贡献包括对这些包装的量化和质量比较,以降低自动赛车技术的入门门槛。此外,本文还 serves as a comprehensive tutorial for参与相似赛车领域的团队和其他自动移动机器应用。

Cooperative Dual Attention for Audio-Visual Speech Enhancement with Facial Cues

  • paper_url: http://arxiv.org/abs/2311.14275
  • repo_url: None
  • paper_authors: Feixiang Wang, Shuang Yang, Shiguang Shan, Xilin Chen
    for:本文主要目标是提高Audio-Visual Speech Enhancement(AVSE)的稳定性和效果,通过利用脸部信息以外的面部特征。methods:提议一种 dual attention cooperative 框架,包括一个基于空间注意力的视觉编码器,以及一个基于自我注意力的视觉特征融合策略,以忽略非关键信息,捕捉和提高视觉信息,并与音频信号进行紧凑的融合。results:对多个 dataset 进行了严格的分析和比较,结果显示,我们的模型在多种 metric 上均超过现有方法,特别是在面部信息不可靠或缺失时。
    Abstract In this work, we focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE). The facial region, encompassing the lip region, reflects additional speech-related attributes such as gender, skin color, nationality, etc., which contribute to the effectiveness of AVSE. However, static and dynamic speech-unrelated attributes also exist, causing appearance changes during speech. To address these challenges, we propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE. Specifically, we introduce a spatial attention-based visual encoder to capture and enhance visual speech information beyond the lip region, incorporating global facial context and automatically ignoring speech-unrelated information for robust visual feature extraction. Additionally, a dynamic visual feature fusion strategy is introduced by integrating a temporal-dimensional self-attention module, enabling the model to robustly handle facial variations. The acoustic noise in the speaking process is variable, impacting audio quality. Therefore, a dynamic fusion strategy for both audio and visual features is introduced to address this issue. By integrating cooperative dual attention in the visual encoder and audio-visual fusion strategy, our model effectively extracts beneficial speech information from both audio and visual cues for AVSE. Thorough analysis and comparison on different datasets, including normal and challenging cases with unreliable or absent visual information, consistently show our model outperforming existing methods across multiple metrics.
    摘要 在这项工作中,我们关注利用 facial cues beyond the lip region дляrobust Audio-Visual Speech Enhancement (AVSE). facial region, including the lip region, reflects additional speech-related attributes such as gender, skin color, nationality, etc., which contribute to the effectiveness of AVSE. However, static and dynamic speech-unrelated attributes also exist, causing appearance changes during speech. To address these challenges, we propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE. Specifically, we introduce a spatial attention-based visual encoder to capture and enhance visual speech information beyond the lip region, incorporating global facial context and automatically ignoring speech-unrelated information for robust visual feature extraction. Additionally, a dynamic visual feature fusion strategy is introduced by integrating a temporal-dimensional self-attention module, enabling the model to robustly handle facial variations. The acoustic noise in the speaking process is variable, impacting audio quality. Therefore, a dynamic fusion strategy for both audio and visual features is introduced to address this issue. By integrating cooperative dual attention in the visual encoder and audio-visual fusion strategy, our model effectively extracts beneficial speech information from both audio and visual cues for AVSE. Thorough analysis and comparison on different datasets, including normal and challenging cases with unreliable or absent visual information, consistently show our model outperforming existing methods across multiple metrics.

CRISP: Hybrid Structured Sparsity for Class-aware Model Pruning

  • paper_url: http://arxiv.org/abs/2311.14272
  • repo_url: https://github.com/shivmgg/crisp
  • paper_authors: Shivam Aggarwal, Kuluhan Binici, Tulika Mitra
  • for: 提高计算效率和减少模型大小,适用于在限定类型数据上进行图像分类任务。
  • methods: 提出了一种名为CRISP的新零化框架,利用一种混合的精细结构稀疏模式,包括细致的N:M结构稀疏和块级稀疏。采用梯度导引的类域准确性分数来引导零化策略,以保留用户特定类型的权重。
  • results: CRISP实现了高准确率和最小内存占用量,对Popular模型如ResNet-50、VGG-16和MobileNetV2进行了ImageNet和CIFAR-100数据集的测试,并实现了14倍的延迟和能耗减少,与现有零化方法相比。代码可以在https://github.com/shivmgg/CRISP/查看。
    Abstract Machine learning pipelines for classification tasks often train a universal model to achieve accuracy across a broad range of classes. However, a typical user encounters only a limited selection of classes regularly. This disparity provides an opportunity to enhance computational efficiency by tailoring models to focus on user-specific classes. Existing works rely on unstructured pruning, which introduces randomly distributed non-zero values in the model, making it unsuitable for hardware acceleration. Alternatively, some approaches employ structured pruning, such as channel pruning, but these tend to provide only minimal compression and may lead to reduced model accuracy. In this work, we propose CRISP, a novel pruning framework leveraging a hybrid structured sparsity pattern that combines both fine-grained N:M structured sparsity and coarse-grained block sparsity. Our pruning strategy is guided by a gradient-based class-aware saliency score, allowing us to retain weights crucial for user-specific classes. CRISP achieves high accuracy with minimal memory consumption for popular models like ResNet-50, VGG-16, and MobileNetV2 on ImageNet and CIFAR-100 datasets. Moreover, CRISP delivers up to 14$\times$ reduction in latency and energy consumption compared to existing pruning methods while maintaining comparable accuracy. Our code is available at https://github.com/shivmgg/CRISP/.
    摘要

Segmentation-Based Parametric Painting

  • paper_url: http://arxiv.org/abs/2311.14271
  • repo_url: https://github.com/manuelladron/semantic_based_painting
  • paper_authors: Manuel Ladron de Guevara, Matthew Fisher, Aaron Hertzmann
  • for: 这个论文的目的是提出一种基于 semantic segmentation 的图像转 painting 方法,以生成大规模、高精度的画作,具有人类like的艺术品质和样式变化。
  • methods: 这个方法使用了 segmentation-based painting 过程和基于人工绘画策略的动态注意力地图方法,以优化画梭进程,使得批处理大图像时能够 capture 大规模结构和细节,同时允许采用不同的绘画风格进行控制。
  • results: 该方法可以生成高质量的画作,并且可以处理大Canvas,比前一代方法更高效和灵活。经过严格的评估,我们的方法被证明可以生成更加美观和功能上优于前一代方法的画作。代码可以在 GitHub 上找到:https://github.com/manuelladron/semantic_based_painting.git
    Abstract We introduce a novel image-to-painting method that facilitates the creation of large-scale, high-fidelity paintings with human-like quality and stylistic variation. To process large images and gain control over the painting process, we introduce a segmentation-based painting process and a dynamic attention map approach inspired by human painting strategies, allowing optimization of brush strokes to proceed in batches over different image regions, thereby capturing both large-scale structure and fine details, while also allowing stylistic control over detail. Our optimized batch processing and patch-based loss framework enable efficient handling of large canvases, ensuring our painted outputs are both aesthetically compelling and functionally superior as compared to previous methods, as confirmed by rigorous evaluations. Code available at: https://github.com/manuelladron/semantic\_based\_painting.git
    摘要 我们介绍了一种新的图像到画作方法,该方法可以生成大规模、高精度的画作,具有人类艺术品质和样式变化。为处理大图像并获得笔划控制,我们引入了分割基于笔划过程和人工绘画策略启发的动态注意力地图方法,使得笔划过程可以在不同的图像区域进行批处理,同时捕捉大规模结构和细节,并允许样式控制细节。我们优化的批处理和 patch-based 损失框架,使得我们可以高效地处理大Canvas,确保我们的涂护输出具有艺术魅力和功能优势,与之前的方法相比,经rigorous评估得出。代码可以在:https://github.com/manuelladron/semantic\_based\_painting.git 中找到。

Bursting Spikes: Efficient and High-performance SNNs for Event-based Vision

  • paper_url: http://arxiv.org/abs/2311.14265
  • repo_url: https://github.com/bic-l/burst-ann2snn
  • paper_authors: Ziqing Wang, Yuetong Fang, Jiahang Cao, Renjing Xu
  • for: 提高事件驱动视觉的效率和高速响应,使用脉冲神经网络(SNN)。
  • methods: 引入启动机制,允许每步多发脉冲,降低转换错误并实现低延迟SNN。使用 pareto 前ier-driven 算法重新分配启动模式。同时,提出一种敏感度驱动的脉冲压缩技术,自动选择层Specific的最佳阈值比率。
  • results: 比较 experiments 表明,我们的方法在分类和物体检测方面具有优秀的性能和降低了能耗。代码将在 https://github.com/bic-L/burst-ann2snn 上提供。
    Abstract Advancing event-driven vision through spiking neural networks (SNNs) is crucial to empowering high-speed and efficient perception. While directly converting the pre-trained artificial neural networks (ANNs) - by replacing the non-linear activation with spiking neurons - can provide SNNs with good performance, the resultant SNNs typically demand long timesteps and high energy consumption to achieve their optimal performance. To address this challenge, we introduce the burst-spike mechanism inspired by the biological nervous system, allowing multiple spikes per timestep to reduce conversion errors and produce low-latency SNNs. To further bolster this enhancement, we leverage the Pareto Frontier-driven algorithm to reallocate burst-firing patterns. Moreover, to reduce energy consumption during the conversion process, we propose a sensitivity-driven spike compression technique, which automatically locates the optimal threshold ratio according to layer-specific sensitivity. Extensive experiments demonstrate our approach outperforms state-of-the-art SNN methods, showcasing superior performance and reduced energy usage across classification and object detection. Our code will be available at https://github.com/bic-L/burst-ann2snn.
    摘要 Simplified Chinese:通过升级事件驱动视觉的射频神经网络(SNN)来提高高速和高效的感知是非常重要的。直接将预训练的人工神经网络(ANN)转换为SNN可以提供良好的性能,但resultant SNN通常需要长时步和高能耗来实现最佳性能。为解决这个挑战,我们引入了生物神经系统中的冲击频率机制,允许每个时步多发多个射频,从而减少转换错误和生成低延迟SNN。此外,我们利用Pareto Frontier驱动算法来重新分配冲击射频模式。此外,为了在转换过程中减少能耗,我们提议一种敏感度驱动的压缩射频技术,自动根据层pecific敏感度确定最佳阈值比率。广泛的实验证明我们的方法在分类和物体检测方面超过了状态的最佳SNN方法,展示了更高的性能和更低的能耗。我们的代码将在https://github.com/bic-L/burst-ann2snn中提供。

ZeroPS: High-quality Cross-modal Knowledge Transfer for Zero-Shot 3D Part Segmentation

  • paper_url: http://arxiv.org/abs/2311.14262
  • repo_url: None
  • paper_authors: Yuheng Xue, Nenglun Chen, Jun Liu, Wenyun Sun
  • for: 这个研究的目的是设计一个 zero-shot 3D 部件分 segmentation 管道,即 ZeroPS,以高品质地将 2D 预备模型的知识转移到 3D 点云。
  • methods: 我们的方法包括两个 ком成分:1) 自我扩展 Component 将 2D 组 FROM 单一视点扩展到空间全球级 3D 组; 2) 多modal 标签 Component 引入了二维检查机制来投票每个 2D 预测 bounding box 到最佳对应 3D 部件,并使用 Class Non-highest Vote Penalty 函数来精致化投票矩阵。
  • results: 我们的方法在 PartnetE 数据集上进行了三个 zero-shot segmentation 任务的广泛评估, achieved state-of-the-art 结果,与现有方法 (+19.6%, +5.2%, +4.9%, 分别) 有 statistically significant 优化。我们的提案不需要任何训练、 fine-tuning 或学习可变参数。它对预设�shift hardly affected。代码将会发布。
    Abstract Recently, many 2D pretrained foundational models have demonstrated impressive zero-shot prediction capabilities. In this work, we design a novel pipeline for zero-shot 3D part segmentation, called ZeroPS. It high-quality transfers knowledge from 2D pretrained foundational models to 3D point clouds. The main idea of our approach is to explore the natural relationship between multi-view correspondences and the prompt mechanism of foundational models and build bridges on it. Our pipeline consists of two components: 1) a self-extension component that extends 2D groups from a single viewpoint to spatial global-level 3D groups; 2) a multi-modal labeling component that introduces a two-dimensional checking mechanism to vote each 2D predicted bounding box to the best matching 3D part, and a Class Non-highest Vote Penalty function to refine the Vote Matrix. Additionally, a merging algorithm is included to merge part-level 3D groups. Extensive evaluation of three zero-shot segmentation tasks on PartnetE datasets, achieving state-of-the-art results with significant improvements (+19.6%, +5.2% and +4.9%, respectively) over existing methods. Our proposed approach does not need any training, fine-tuning or learnable parameters. It is hardly affected by domain shift. The code will be released.
    摘要 近些时候,许多2D预训模型已经展现出了吸引人的零批预测能力。在这项工作中,我们设计了一个新的零批3D部分 segmentation管道,称为ZeroPS。它高质量地传输了2D预训基本模型中的知识到3D点云。我们的方法的核心思想是利用多视图匹配和基础模型的提示机制之间的自然关系,建立桥梁。我们的管道包括两个组成部分:1)一个自适应组件,将单个视图中的2D组从单个视图扩展到全球水平的3D组;2)一个多模态标签组件,引入二维检查机制,将每个2D预测 bounding box 投票到最佳匹配的3D部分,并使用类非最高投票函数进行细化。此外,我们还包括一个合并算法,将部分级3D组合成。我们对三个零批 segmentation 任务进行了广泛的评估,在 PartnetE 数据集上 achievement 了现有方法的状态对应成绩 (+19.6%, +5.2% 和 +4.9%, 分别),而且我们的提posed方法不需要任何训练、微调或学习参数。它受到域shift的影响很小。我们将代码发布。

RSB-Pose: Robust Short-Baseline Binocular 3D Human Pose Estimation with Occlusion Handling

  • paper_url: http://arxiv.org/abs/2311.14242
  • repo_url: None
  • paper_authors: Xiaoyue Wan, Zhuo Chen, Yiming Bao, Xu Zhao
  • for: 短基线双目3D人姿估算,寻求更加PORTABLE的设备,同时维护 Géometric Measurement Property 以避免深度抽象。
  • methods: 提出了 Stereo Co-Keypoints Estimation 模块,通过对二视图2D点的匹配使用不同的Disparity,提高了二视图2D点的视角一致性,并通过 Stereo Volume Feature 来包含不同Disparity的二视图特征。此外,还提出了 Pre-trained Pose Transformer 模块,通过捕捉人姿协调关系,对 occlusion 进行处理。
  • results: 通过在 H36M 和 MHAD 数据集上进行了广泛的实验,以及对图像进行了视觉化,证明了我们的方法在短基线双目3D人姿估算和 occlusion 处理方面的效果。
    Abstract In the domain of 3D Human Pose Estimation, which finds widespread daily applications, the requirement for convenient acquisition equipment continues to grow. To satisfy this demand, we set our sights on a short-baseline binocular setting that offers both portability and a geometric measurement property that radically mitigates depth ambiguity. However, as the binocular baseline shortens, two serious challenges emerge: first, the robustness of 3D reconstruction against 2D errors deteriorates; and second, occlusion reoccurs due to the limited visual differences between two views. To address the first challenge, we propose the Stereo Co-Keypoints Estimation module to improve the view consistency of 2D keypoints and enhance the 3D robustness. In this module, the disparity is utilized to represent the correspondence of binocular 2D points and the Stereo Volume Feature is introduced to contain binocular features across different disparities. Through the regression of SVF, two-view 2D keypoints are simultaneously estimated in a collaborative way which restricts their view consistency. Furthermore, to deal with occlusions, a Pre-trained Pose Transformer module is introduced. Through this module, 3D poses are refined by perceiving pose coherence, a representation of joint correlations. This perception is injected by the Pose Transformer network and learned through a pre-training task that recovers iterative masked joints. Comprehensive experiments carried out on H36M and MHAD datasets, complemented by visualizations, validate the effectiveness of our approach in the short-baseline binocular 3D Human Pose Estimation and occlusion handling.
    摘要 在3D人姿估计领域中,日常应用的需求不断增长。为满足这一需求,我们选择了短基线双目设计,它提供了可携带性和深度不确定性的 geometric measurement 性能。然而,随着基线短化,存在两个严重的挑战:首先,3D重建对2D错误的Robustness下降;其次,因视场限制而导致 occlusion 重新发生。为解决第一个挑战,我们提议了stereo co-keypoints estimation模块,以提高视 consistency 的2D关键点和加强3D Robustness。在这个模块中,disparity 用于表示双目2D点之间的对应关系,并引入了stereo volume feature来包含不同disparities的双目特征。通过SVF的回归,两个视场2D关键点同时被 estimate,这限制了它们的视 consistency。此外,为处理 occlusion,我们引入了预训练 pose transformer 模块。通过这个模块,3D姿态被修正,以便 perceive pose coherence,这是通过 pose transformer 网络学习并在先期任务中恢复循环屏蔽的。通过在 H36M 和 MHAD 数据集上进行了广泛的实验,并通过视觉化,证明了我们的方法在短基线双目3D人姿估计和 occlusion 处理方面的效果。

SafeSea: Synthetic Data Generation for Adverse & Low Probability Maritime Conditions

  • paper_url: http://arxiv.org/abs/2311.14764
  • repo_url: https://github.com/martin-3240/safesea
  • paper_authors: Martin Tran, Jordan Shipard, Hermawan Mulyono, Arnold Wiliem, Clinton Fookes
  • for: 提高物体探测模型的Robustness,尤其是在恶势害海域检测中。
  • methods: 使用两个自动化筛选器,首先根据海域状况分类为不同的海State水平,然后检查输入图像中的物体是否保留。
  • results: 创建了SafeSea数据集,提供了不同的天气背景,以补充海上物体探测模型的训练。并观察到,在风暴海域背景下,物体探测模型具有较差的检测精度。
    Abstract High-quality training data is essential for enhancing the robustness of object detection models. Within the maritime domain, obtaining a diverse real image dataset is particularly challenging due to the difficulty of capturing sea images with the presence of maritime objects , especially in stormy conditions. These challenges arise due to resource limitations, in addition to the unpredictable appearance of maritime objects. Nevertheless, acquiring data from stormy conditions is essential for training effective maritime detection models, particularly for search and rescue, where real-world conditions can be unpredictable. In this work, we introduce SafeSea, which is a stepping stone towards transforming actual sea images with various Sea State backgrounds while retaining maritime objects. Compared to existing generative methods such as Stable Diffusion Inpainting~\cite{stableDiffusion}, this approach reduces the time and effort required to create synthetic datasets for training maritime object detection models. The proposed method uses two automated filters to only pass generated images that meet the criteria. In particular, these filters will first classify the sea condition according to its Sea State level and then it will check whether the objects from the input image are still preserved. This method enabled the creation of the SafeSea dataset, offering diverse weather condition backgrounds to supplement the training of maritime models. Lastly, we observed that a maritime object detection model faced challenges in detecting objects in stormy sea backgrounds, emphasizing the impact of weather conditions on detection accuracy. The code, and dataset are available at https://github.com/martin-3240/SafeSea.
    摘要 高品质的训练数据是增强物品探测模型的关键。在海上领域中,获取多样化的真实图像数据 particullay 困难,因为捕捉海洋图像时,海洋物品的存在特别困难,尤其是在飓风情况下。这些挑战的原因包括资源限制,以及海洋物品的不可预测出现方式。然而,从飓风情况中获取数据是训练有效海上物品探测模型的必要条件,特别是搜救行动中,实际世界的情况可能是不可预测的。在这个工作中,我们介绍了SafeSea,它是将实际海洋图像转换为多样化海洋背景的开始。相比于现有的生成方法,如稳定传播填充(Stable Diffusion Inpainting),这种方法可以快速创建海上物品探测模型的synthetic数据集。我们的方法使用两个自动筛选器,分别为 sea 状态水平的分类和物品从输入图像中是否仍然存在的检查。这个方法实现了创建SafeSea数据集,提供了多元的天气情况背景,供海上物品探测模型的训练。最后,我们观察到海上物品探测模型在飓风海域背景下侦测物品的问题,强调了天气情况对探测精度的影响。SafeSea 代码和数据可以在 GitHub 上获取:https://github.com/martin-3240/SafeSea。

Pseudo-label Correction for Instance-dependent Noise Using Teacher-student Framework

  • paper_url: http://arxiv.org/abs/2311.14237
  • repo_url: https://github.com/eugenekim3107/pseudo-label-correction-for-instance-dependent-noise-using-teacher-student-framework
  • paper_authors: Eugene Kim
  • for: 这篇论文旨在解决深度学习模型面临标签噪音的问题,对于标签噪音的影响使得模型对于标签的分类能力下降。
  • methods: 本研究提出了一个新的教师-学生架构,称为P-LC(伪标签修正),它利用三个Encoder来建立一个伪标签修正系统。当学生为一些图像生成伪标签时,教师将选择使用伪标签或原始标签。
  • results: 实验结果显示,P-LC在MNIST、Fashion-MNIST和SVHN等数据集上均表现出色,尤其在高噪音水平下。此外,我们还引入了一个噪音水平估计,帮助评估模型表现和决定是否需要进一步的数据清洁程序。
    Abstract The high capacity of deep learning models to learn complex patterns poses a significant challenge when confronted with label noise. The inability to differentiate clean and noisy labels ultimately results in poor generalization. We approach this problem by reassigning the label for each image using a new teacher-student based framework termed P-LC (pseudo-label correction). Traditional teacher-student networks are composed of teacher and student classifiers for knowledge distillation. In our novel approach, we reconfigure the teacher network into a triple encoder, leveraging the triplet loss to establish a pseudo-label correction system. As the student generates pseudo labels for a set of given images, the teacher learns to choose between the initially assigned labels and the pseudo labels. Experiments on MNIST, Fashion-MNIST, and SVHN demonstrate P-LC's superior performance over existing state-of-the-art methods across all noise levels, most notably in high noise. In addition, we introduce a noise level estimation to help assess model performance and inform the need for additional data cleaning procedures.
    摘要 高效的深度学习模型可以学习复杂的模式,但 Label 噪声问题却对其具有 significante challenge。由于无法区分干净和噪声标签,这 ultimately results in poor generalization。我们通过一种新的教师生学习框架(P-LC,pseudo-label correction)来解决这个问题。传统的教师生网络由教师和学生分类器组成,用于知识储存。在我们的新方法中,我们将教师网络重新配置为三重Encoder,利用 triplet loss 来建立一个pseudo-label correction系统。当学生生成一组给定图像的 pseudo labels时,教师将选择最初分配的标签还是 pseudo labels。实验表明,P-LC 在 MNIST、Fashion-MNIST 和 SVHN 上的性能都高于现有的状态之势方法,特别是在高噪声水平。此外,我们还提出了一种噪声水平估计,以便评估模型性能并提供额外数据清洁过程的指导。